Redshift Expert Services
We have listed some common challenges that our customers experience. We can also solve all of these issues for you.
Our queries, reports, dashboards etc. are very slow
There are dozens of possible reasons why your reports or dashboards are slow. Let me list some of the more common issues.
- Your data is not modeled for speed, e.g. you have selected the wrong distribution or sort keys or you have not denormalized your data sufficiently.
- Your queries have not been written for high performance or your cluster is too small.
- You haven't set up workload management correctly, e.g. you are not separating your ETL batch from your interactive queries.
- You don't use SQL window functions. In our view you are not doing data warehousing if you are not using window functions. This blog post introduces you to window functions on Redshift
- Redshift is a distributed relational database aka MPP. Don't treat it like it was Oracle or SQL Server.
Designing for performance is our core area of expertise. Besides, performance tuning is great fun as it requires a complete understanding of how a technology works.
As part of our popular Redshift audit service we will look at the following items:
- Slow SQL statements
- Review of data model and approach to data modelling
- Selection of distribution and sort keys
- Redshift ETL
- Tools for data orchestration
- Handling data quality and managing master data
- Naming standards
- Team structure: roles, skills, and responsibilities
Our load process takes too long
Again, there could be a lot of reasons why your load process is slow.
Do you perform full reloads or have a large number of incremental updates with vacuum operations?
Do you use the bulk loaders, e.g. the copy command?
For certain scenarios and workloads you can leave your data in S3 and use Redshift Spectrum. This is aka query offload. We help you in identifying those workloads and get you around the pitfalls. You can read up about query offload to Spectrum in our blog post Query Offload with Redshift Spectrum. Use cases and limitations.
How can I increase concurrency and the degree of parallelism?
This is actually the wrong question. When looking at overall performance of a system we need to look at query throughput rather than concurrency. We increase throughput by decreasing latency of queries. The latency of queries can be improved by writing better queries, creating better data models, adding nodes to the cluster etc. etc.
Having said that, you should also make use of Redshift's workload management feature to assign cluster resources according to latency requirements, e.g. batch workloads are assigned a lower amount of cluster resources than interactive queries.
How can we efficiently build ETL processes and orchestrate data flows on Redshift? How should we schedule our data pipelines?
First of all, you are probably better off building your ETL pipelines outside of Redshift, e.g. AWS EMR, AWS Glue etc. Redshift has specifically been designed for running analytics workloads and interactive queries. It has been optimised for those use cases, e.g. columnar representation of data.
We use Apache Airflow to build, deploy, schedule, and support your ETL data pipelines. We have developed a couple of re-usable Airflow Operators for AWS Redshift that encapsulate some common data integration logic.
Traditional approaches to data orchestration hard code the dependencies into the data pipeline. With Airflow we follow a dependency driven approach.
We have written a blog post that gives you a detailed outline of our approach to data orchestration on Redshift.
Redshift does not support constraints. How can we enforce data integrity?
Redshift is not a traditional OLTP database built for transaction processing. It is a columnar compressed database technology for analytics workloadsIn data warehousing it is a standard practice to build data integrity checks into the data pipeline.
We need to migrate our on-premise data warehouse to Redshift.
We have all the scripts and experience to migrate from any of your on-premise databases or OLAP cubes to Redshift.
What is the best way to model data in Redshift?
Whenever you model data you need to be aware that Redshift is a distributed database that follows the MPP model of distributing data across multiple nodes. For each table data is evenly distributed across all of the nodes in your cluster. The distribution key is used to distribute the data. For a table with 100 rows on a 10 node cluster ten rows will end up on each node of the cluster.
Do you have any other challenges?
We have just listed some common problems here. If you have a particular challenge please reach out to us and we will provide you with a solution.