Coherent’s automated solution saves time and effort
Some of you may have read my previous blog post comparing IBM’s Netezza with AWS’s Redshift performance. Now I want to share some details of pushing test data into Redshift and what we’ve come up with at Coherent Solutions. You could conduct this research yourself, piecing it together from various sources, but we have already done it and compiled the steps here – in a single place.
Apart from getting through the whole process of loading data into AWS Redshift, we wanted to automate the solution to make it reusable for our clients. This can be done in multiple ways, with different technologies and platforms. We chose a MS SQL/SSIS tool set as we have extensive experience in this.
It came as no surprise that we’d be faced with multiple challenges in creating an automated data load orchestration. Besides learning how AWS Redshift and S3 work, (yes, S3 is a necessary part of it) what data types are supported, and then applying best practices for building high- performing columnar data warehouses, we had to:
- Understand how data is converted to meet Redshift constraints
- Find a way to perform effective data partitioning during a data load
- Provide an acceptable data consuming/load
- And overcome many other obstacles
But this blog post isn’t intended to discuss all the challenges in great detail. The illustration here shows how we configured a solution. And I’m also providing a few key recommendations from AWS on how to establish a data load.
ETL data load Architecture onto AWS Redshift.
- Splitting data into multiple files. AWS strongly recommends that data be divided into multiple files to take advantage of parallel processing, see Splitting your data into multiple files.
- Compressing data files with GZIP Archiving. All files needs to individually compressed using GZIP for larger datasets to minimize internet traffic, see Compressing your data files with GZIP.
- Uploading compressed data files to Amazon Simple Storage Service (S3)
- Loading data from S3 to RedShift. The COPY command is used to transfer data from S3 to Redshift as it leverages Amazon’s massively parallel processing (MPP) architecture to read and load data in parallel from files in an Amazon S3 bucket, see Loading data from Amazon S3.
The bottom line is that we’ve created a set of SSIS packages that fully automates data load into AWS Redshift with little or no modification needed to accommodate each customers’ needs. It supports the features that every professional data integration task should, i.e. logging work statistics, handling ETL process errors and so on. As a result of this new tool set we can concentrate more of our time and effort on our customers’ business issues rather than reinventing the wheel with each new data load.
If you would like to learn more on how we can help you upload data to AWS Redshift, please contact us.