Amazon Redshift is a fast, scalable, secure, and fully-managed data warehouse that enables you to analyze all of your data using standard SQL easily and cost-effectively. Amazon Redshift Data Sharing allows customers to securely share live, transactionally consistent data in one Amazon Redshift cluster with another Amazon Redshift cluster across accounts and regions without needing to copy or move data from one cluster to another.
Amazon Redshift Data Sharing was initially launched in March 2021, and added support for cross-account data sharing was added in August 2021. The cross-region support became generally available in February 2022. This provides full flexibility and agility to share data across Redshift clusters in the same AWS account, different accounts, or different regions.
Amazon Redshift Data Sharing is used to fundamentally redefine Amazon Redshift deployment architectures into a hub-spoke, data mesh model to better meet performance SLAs, provide workload isolation, perform cross-group analytics, easily onboard new use cases, and most importantly do all of this without the complexity of data movement and data copies. Some of the most common questions asked during data sharing deployment are, “How big should my consumer clusters and producer clusters be?”, and “How do I get the best price performance for workload isolation?”. As workload characteristics like data size, ingestion rate, query pattern, and maintenance activities can impact data sharing performance, a continuous strategy to size both consumer and producer clusters to maximize the performance and minimize cost should be implemented. In this post, we provide a step-by-step approach to help you determine your producer and consumer clusters sizes for the best price performance based on your specific workload.
Generic consumer sizing guidance
The following steps show the generic strategy to size your producer and consumer clusters. You can use it as a starting point and modify accordingly to cater your specific use case scenario.
Size your producer cluster
You should always make sure that you properly size your producer cluster to get the performance that you need to meet your SLA. You can leverage the sizing calculator from the Amazon Redshift console to get a recommendation for the producer cluster based on the size of your data and query characteristic. Look for Help me choose on the console in AWS Regions that support RA3 node types to use this sizing calculator. Note that this is just an initial recommendation to get started, and you should test running your full workload on the initial size cluster and elastic resize the cluster up and down accordingly to get the best price performance.
Size and setup initial consumer cluster
You should always size your consumer cluster based on your compute needs. One way to get started is to follow the generic cluster sizing guide similar to the producer cluster above.
Setup Amazon Redshift data sharing
Setup data sharing from producer to consumer once you have both the producer and consumer cluster setup. Refer to this post for guidance on how to setup data sharing.
Test consumer only workload on initial consumer cluster
Test consumer only workload on the new initial consumer cluster. This can be done by pointing consumer applications, for example ETL tools, BI applications, and SQL clients, to the new consumer cluster and rerunning the workload to evaluate the performance against your requirements.
Test consumer only workload on different consumer cluster configurations
If the initial size consumer cluster meets or exceeds your workload performance requirements, then you can either continue to use this cluster configuration or you can test on smaller configurations to see if you can further reduce the cost and still get the performance that you need.
On the other hand, if the initial size consumer cluster fails to meet your workload performance requirements, then you can further test larger configurations to get the configuration that meets your SLA.
As a rule of thumb, size up the consumer cluster by 2x the initial cluster configuration incrementally until it meets your workload requirements.
Once you plan out what configuration you want to test, use elastic resize to resize the initial cluster to the target cluster configuration. After elastic resize is completed, perform the same workload test and evaluate the performance against your SLA. Select the configuration that meets your price performance target.
Test producer only workload on different producer cluster configurations
Once you move your consumer workload to the consumer cluster with the optimum price performance, there might be an opportunity to reduce the compute resource on the producer to save on costs.
To achieve this, you can rerun the producer only workload on 1/2x of the original producer size and evaluate the workload performance. Resizing the cluster up and down accordingly depends on the result, and then you select the minimum producer configuration that meets your workload performance requirements.
Re-evaluate after a full workload run over time
As Amazon Redshift continues evolving, and there are continuous performance and scalability improvement releases, data sharing performance will continue improving. Furthermore, numerous variables might impact the performance of data sharing queries. The following are just some examples:
- Ingestion rate and amount of data change
- Query pattern and characteristic
- Workload changes
- Concurrency
- Maintenance activities, for example vacuum, analyze, and ATO
This is why you must re-evaluate the producer and consumer cluster sizing using the strategy above on occasion, especially after a full workload deployment, to gain the new best price performance from your cluster’s configuration.
Automated sizing solutions
If your environment involved more complex architecture, for example with multiple tools or applications (BI, ingestion or streaming, ETL, data science), then it might not feasible to use the manual method from the generic guidance above. Instead, you can leverage solutions in this section to automatically replay the workload from your production cluster on the test consumer and producer clusters to evaluate the performance.
Simple Replay utility will be leveraged as the automated solution to guide you through the process of getting the right producer and consumer clusters size for the best price performance.
Simple Replay is a tool for conducting a what-if analysis and evaluating how your workload performs in different scenarios. For example, you can use the tool to benchmark your actual workload on a new instance type like RA3, evaluate a new feature, or assess different cluster configurations. It also includes enhanced support for replaying data ingestion and export pipelines with COPY and UNLOAD statements. To get started and replay your workloads, download the tool from the Amazon Redshift GitHub repository.
Here we walk through the steps to extract your workload logs from the source production cluster and replay them in an isolated environment. This lets you perform a direct comparison between these Amazon Redshift clusters seamlessly and select the clusters configuration that best meet your price performance target.
The following diagram shows the solution architecture.
Solution walkthrough
Follow these steps to go through the solution to size your consumer and producer clusters.
Size your production cluster
You should always make sure to properly size your existing production cluster to get the performance that you need to meet your workload requirements. You can leverage the sizing calculator from the Amazon Redshift console to get a recommendation on the production cluster based on the size of your data and query characteristic. Look for Help me choose on the console in AWS Regions that support RA3 node types to use this sizing calculator. Note that this is just an initial recommendation to get started. You should test running your full workload on the initial size cluster and elastic resize the cluster up and down accordingly to get the best price performance.
Identify the workload to be isolated
You might have different workloads running on your original cluster, but the first step is to identify the most critical workload to the business that we want to isolate. This is because we want to make sure that the new architecture can meet your workload requirements. This post is a good reference on a data sharing workload isolation use case that can help you decide which workload can be isolated.
Setup Simple Replay
Once you know your critical workload, you must enable audit logging in your production cluster where the critical workload identified above is running to capture query activities and store in Amazon Simple Storage Service (Amazon S3). Note that it may take up to three hours for the audit logs to be delivered to Amazon S3. Once the audit log is available, proceed to setup Simple Replay and then extract the critical workload from the audit log. Note that start_time and end_time could be used as parameters to filter out the critical workload if those workloads run in certain time periods, for example 9am to 11am. Otherwise it will extract all of the logged activities.
Baseline workload
Create a baseline cluster with the same configuration as the producer cluster by restoring from the production snapshot. The purpose of starting with the same configuration is to baseline the performance with an isolated environment.
Once the baseline cluster is available, replay the extracted workload in the baseline cluster. The output from this replay will be the baseline used to compare against subsequent replays on different consumer configurations.
Setup initial producer and consumer test clusters
Create a producer cluster with the same production cluster configuration by restoring from the production snapshot. Create a consumer cluster with the recommended initial consumer size from the previous guidance. Furthermore, setup data sharing between the producer and consumer.
Replay workload on initial producer and consumer
Replay the producer only workload on the initial size producer cluster. This can be achieved using the “Exclude” filter parameter to exclude consumer queries, for example the user that runs consumer queries.
Replay the consumer only workload on the initial size consumer cluster. This can be achieved using the “Include” filter parameter to exclude consumer queries, for example the user that runs consumer queries.
Evaluate the performance of these replays against the baseline and workload performance requirements.
Replay consumer workload on different configurations
If the initial size consumer cluster meets or exceeds your workload performance requirements, then you can either use this cluster configuration or you can follow these steps to test on smaller configurations to see if you can further reduce costs and still get the performance that you need.
Compare initial consumer performance results against your workload requirements:
- If the result exceeds your workload performance requirements, then you can reduce the size of the consumer cluster incrementally, starting with 1/2x, retry the replay and evaluate the performance, then resize up or down accordingly based on the result until it meets your workload requirements. The purpose is to get a sweet spot where you’re comfortable with the performance requirements and get the lowest price possible.
- If the result fails to meet your workload performance requirements, then you can increase the size of the cluster incrementally, starting with 2x the original size, retry the replay and evaluate the performance until it meets your workload performance requirements.
Replay producer workload on different configurations
Once you split your workloads out to consumer clusters, the load on the producer cluster should be reduced and you should evaluate your producer cluster’s workload performance to seek the opportunity to downsize to save on costs.
The steps are similar to consumer replay. Elastic resize the producer cluster incrementally starting with 1/2x the original size, replay the producer only workload and evaluate the performance, and then further resize up or down until it meets your workload performance requirements. The purpose is to get a sweet spot where you’re comfortable with the workload performance requirements and get the lowest price possible. Once you have the desired producer cluster configuration, retry replay consumer workloads on the consumer cluster to make sure that the performance wasn’t impacted by producer cluster configuration changes. Finally, you should replay both producer and consumer workloads concurrently to make sure that the performance is achieved in a full workload scenario.
Re-evaluate after a full workload run over time
Similar to the generic guidance, you should re-evaluate the producer and consumer clusters sizing using the previous strategy on occasion, especially after full workload deployment to gain the new best price performance from your cluster’s configuration.
Clean up
Running these sizing tests in your AWS account may have some cost implications because it provisions new Amazon Redshift clusters, which may be charged as on-demand instances if you don’t have Reserved Instances. When you complete your evaluations, we recommend deleting the Amazon Redshift clusters to save on costs. We also recommend pausing your clusters when they’re not in use.
Applying Amazon Redshift and data sharing best practices
Proper sizing of both your producer and consumer clusters will give you a good start to get the best price performance from your Amazon Redshift deployment. However, sizing isn’t the only factor that can maximize your performance. In this case, understanding and following best practices are equally important.
General Amazon Redshift performance tuning best practices are applicable to data sharing deployment. Make sure that your deployment follows these best practices.
There numerous data sharing specific best practices that you should follow to make sure that you maximize the performance. Refer to this post for more details.
Summary
There is no one-size-fits-all recommendation on producer and consumer cluster sizes. It varies by workloads and your performance SLA. The purpose of this post is to provide you with guidance for how you can evaluate your specific data sharing workload performance to determine both consumer and producer cluster sizes to get the best price performance. Consider testing your workloads on producer and consumer using simple replay before adopting it in production to get the best price performance.
About the Authors
BP Yau is a Sr Product Manager at AWS. He is passionate about helping customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate its Oracle data warehouse to Amazon Redshift and build its next generation big data analytics platform using AWS technologies.
Sidhanth Muralidhar is a Principal Technical Account Manager at AWS. He works with large enterprise customers who run their workloads on AWS. He is passionate about working with customers and helping them architect workloads for costs, reliability, performance and operational excellence at scale in their cloud journey. He has a keen interest in Data Analytics as well.