Quite often we get asked to compare different technologies at Coherent Solutions. Clients wonder what would provide the best solution for their business needs and technical constraints. As with most things, there are trade-offs.
Recently we were asked about the differences between Amazon’s cloud data warehouse platform, Redshift, and IBM’s data warehousing appliance, Netezza. These two platforms are from different worlds. Redshift is a cloud based, columnar data storage, while Netezza “pretends” to be columnar (via supporting compression) and is backed by some serious hardware. What could be more different? Yet they both exist to store vast amounts of data. It is obvious that a full comparison and analysis of these two solutions would be very complicated and depend on the specific needs of a given customer. But, as we’ve been working with and analyzing both products, we have built up some technical comparisons between the two products and I’d like to share some of our findings in this blog.
Research findings
For the sake of clarity, I’ll skip some implementation details and begin with the preconditions for our experiment. We strove to have comparable size data on both products, but ended up having a slightly smaller dataset on the Redshift platform. The effect of that difference did not skew our results so much that we couldn’t make the comparison, however.
Another important note: In order to eliminate a negative impact of data being transferred over intranet/internet (Redshift lives on the cloud, while we have Netezza in our own data center) we added extra grouping and aggregate operations to queries to minimize the size of a result set being returned. Obviously, extra operations impact final results, but this exercise allowed us to evaluate a fundamental difference between the two technologies while trying some fairly realistic data manipulation scenarios. Our results are certainly not definitive – your mileage may vary, but hopefully this can provide readers with a baseline and approach to jumpstart your own analysis.
Spoiler alert: We fully expected Netezza to come out faster – and it did. But the difference in cost is great enough to make Redshift a very competitively priced option for small-medium businesses that could not consider the infrastructure expense of Netezza.
Test methodology and data
The test dataset is a star-like OLAP cube, with a fact table and 7 dimension tables. Performance difference was calculated using this formula – (Redshift – Netezza)/Netezza * 100. Since we expected Netezza to be faster, this formula shows how much faster Netezza is in comparison to Redshift.
Redshift capacity details:
Node type: dw1.xlarge, 4.4 EC2 Compute Units (2 virtual cores) per node, 15 GiB of memory per node, 2TB HDD storage per node, moderate I/O performance, 64-bit platform.
Netezza details:
Netezza TwinFin: 12 S-Blades, CPU Cores and 32 TB
Average value – Netteza outperforms Redshift by 161.4%
Tests of a single table grouping and aggregating
Average value – Netteza outperforms Redshift by 609.48%
Tests of JOIN A few tables with grouping and aggregating result set
Average value – Netteza outperforms Redshift by 113.5%
Tests of Sub queries + Ordering + Filtering
Average value – Netteza outperforms Redshift by 236.5%
Conclusion
It came as no surprise to anyone that Netezza came out on top in terms of raw performance for this test configuration. What did surprise us is that, on average, it was about four times faster than Redshift. While this may seem like a lot, it is nowhere near twenty or thirty times faster – which can be the cost difference between the two. Netezza requires a large hardware investment which can vary based upon a buyer’s needs (but readily tops $1M), as well as trained support staff. Amazon’s Redshift, on the other hand, is a turnkey solution with Amazon handling all administration and maintenance, and charging a monthly fee in the range of $1000/year per terabyte of data.
Of course, each platform’s strengths make it a preferred solution under certain circumstances.
Netezza’s out-of-the-box performance makes it more suitable when your workloads need to be near real-time and are relatively stable. But if you choose to invest in this fixed hardware to handle peak loads, you also want to also be sure that you don’t overpay for unused, off-peak capacity. This is where Redshift really shines. You can start small then scale up as needed, and scale down as well – in a matter of days, if not hours depending upon the amount of data movement.
The last but not least consideration is the type of processing and data manipulation performed in the workloads. You can see from our quick performance analysis that the more complex the query process, the less performance advantage Netezza has.
At Coherent Solutions we think that Redshift is a very competitive solution for many customer scenarios, especially when ‘set-up and go’ is a priority. In fact, we have helped a number of customers set up their data warehouses in Amazon’s cloud quite successfully.