One of our clients, a major provider of managed healthcare solutions requested Coherent Solutions BI Practice to provide a position paper that compares two approaches to data storage:
- A data warehouse based on a dimensional data model implemented on a relational database server.
- A Massively Parallel Processing (MPP) database solution such as GreenplumDB or Microsoft Parallel Data Warehouse (PDW).
While the results were shared with that specific customer in detailed form, we thought they would also be interesting to share in abbreviated form for the general public in our blog.
Leveraging Coherent’s deep experience designing and building BI solutions using Microsoft technologies, we decided to use SQL Server 2014 and Microsoft Parallel Data Warehouse (PDW) as representative products for #1 and #2 respectively.
The decision between two approaches is not trivial, as they are not mutually exclusive. Thus, to make the right decision for an organization, we attacked it from two different angles – technical architecture (i.e. technology stack, Warehouse/ETL design to support frequent updates, hardware requirements and costs), and data architecture (target data model focusing on effective data access by various consumers). The two sections below cover the high level approach we used for the data architecture and the technology stack part of technical architecture. Although we did not have detailed cost and required skillset information for PDW, our exploration efforts (which included high-level cost numbers) turned out to be just enough to reach a conclusion.
During initial discussions, we learned a few of the data issues that the client faced.
The immediate issue was their current data access strategy had failed to perform on larger data sets such as when 40+ transactional tables needed to be joined together, resulting in complex and possibly inefficient queries.
The short-to-medium term issue was the need for the organization’s customers to directly access their data with the current OLTP schema. Resolving the issue head on would prove to not only be too complex for the end consumer, but also would require very extensive training to ensure the customers do not end up in situation where one query makes database unresponsive for others.
In addition, although third party reporting tools (e.g. Logi or Tableau) can work against OLTP databases, performance benefits with larger data sets would not be fully realized.
The long-term issue the client had was they expected significant growth in data volumes, which needed to be addressed with their overall data architecture.
It was obvious that regardless of the technical architecture decisions, the data would need to be converted into a dimensional structure for reporting purposes. Regardless of the technology, making this conversion would allow for 1) a better understanding of the data by the end consumer, thus more intuitive querying, 2) a straight forward consumption by the third party reporting tool of choice (although here, the data model might need to be properly “snow flaked”, depending on the tool selection), and 3) an increase in performance due to significantly simplifying and reducing the number of joins.
As part of this section, we also considered an approach as it comes to “Big Data”. “Big Data” is characterized as loosely structured (e.g. attributes might not be finalized yet, or are expected to change often) and/or high velocity (e.g. machine-generated data). In the case of the client’s current and future needs, some transactions could fall under a “Big Data” definition (such as claims and bank transactions), although this data will most likely not be directly exposed to customers without significant restructuring.
However, a Hadoop environment would be beneficial for internal data exploration and analytics activities, which then might lead to a more structured approach to adding the data to the reporting database.
SQL Server PDW
SQL Server PDW requires a server machine with 144 cores and 2,304 GB of RAM and 203 TB of storage. It has a 5-year cost of ownership of approximately $3M (not including migration efforts). Its sweet spot is series of ad-hoc queries against large volumes of data (tens of terabytes), where the appliance can provide very quick response, and one doesn’t have to lose time waiting for the query results. The appliance is built for much bigger tasks, like advanced analytics against Big Data volumes to allow a business to run more queries on the data within a day and be more productive in its exploration and analysis activities. The performance increase is promised to be 10x-50x in comparison to the current solution. However, based on our experience with other appliances, due to the opportunistic nature of resource management (e.g. throwing as many resources as possible for each query request), appliances are not optimized for tasks of running a very large number of basic queries, thus performance benefits might not be fully realized.
SQL Server 2014
SQL Server 2014 offers a very interesting alternative to the PDW offering. Its new feature, in-memory OLTP, allows one to increase one’s queries’ performance 10x-30x times if they are running against moderate amounts of data (< 120GB).
From the hardware standpoint, one would need to increase the amount of RAM available for the SQL Server 2014 instance (max for single server is 256 GB). From a development\DBA standpoint, table data would need to be migrated from disk to memory. Memory tables solution appeared to be a sweet spot for the current short-terms goals defined above. Licensing price per core is around $14,000, which makes the barrier to entry much smaller than that for PDW.
SQL Server 2014 appeared to be a great starting point for the DBMS as it provides a significant performance increase from SQL Server 2012 almost without a learning curve (aside from modifying selected entities from disk-based to memory-based storage).
For a robust, scalable, and cost-effective solution, we proposed the client to take a holistic approach focusing on structuring the data and processes before exploring expensive appliance options.
- Define high-level data architecture
- Develop reporting database / model
- Select reporting tool
At that point, the data model could be tuned to meet the specific nuances of the data and reporting tools. If advanced analytics for high-velocity, unstructured data is needed, appliances such as Microsoft PWD could be considered (in the case one starts with SQL Server 2014, most of the structures can be transferred to PDW).