It is no secret that Big Data is one of the hottest buzz words in information technology. It is becoming commonplace for companies to look to Big Data technologies to expand or replace traditional data warehouses. One reads a lot about the fundamental differences between traditional data warehouse and Big Data, and a data warehouse QA engineer might find it overwhelming to make the switch. It feels like huge leap. With challenges like new paradigms, tools and technologies, on top of ever increasing pressure to continue delivering business value, where do you start? As a technical QA manager, I found myself in a similar situation. A quick Google search for Big Data Landscape lead to the picture below. Unfortunately, it wasn’t much help.
source: http://mattturck.com/2016/02/01/big-data-landscape/, to see the landscape at full size, click here
Once the initial shock settled, I consulted with my QA team and Big Data development team, and decided to take an approach of identifying and tackling the areas where Big Data testing differs from traditional approaches, building on the solid foundation in Data Warehouse testing that our team has accumulated over years. Below I describe the major challenges we identified through this process and our approach to turn them into strengths to take our BI QA practice to the next level.
Introduction of Big Data will inevitably bring unstructured and semi-structured data sources into the data flow. Testing these new types of data is a major challenge. However, after looking into it more deeply, we realized that
- “Structured data still makes 75% of data under management for more than two thirds of organizations, with nearly one-third of organizations not yet actively managing unstructured data at all” (see Dell’s survey at this link).
- Semi-structured data (data, which schema can be defined, but data types are not always available) is no different from structured data from QA standpoint, with one caveat – QA engineers will have to get used to “schema-on-read” paradigm (see link for more info).
- Unstructured data has to be cleaned, explored and its value needs to be determined before considering it as a feasible data source. In reality, that implies transforming unstructured data into semi structured format.
Therefore, most of Big Data QA work will follow the same patterns, processes, and procedures that historically have been used by QA teams when working with traditional data sources. As long as there is structure to the source data, there will be structure to target data, and transformation rules will have to be defined and followed by ETL development and QA teams.
Our QA team has mostly worked on Windows workstations, thus Linux (which is the preferred OS platform for tools in Big Data ecosystem) introduced a learning curve. Nevertheless, going through the process of learning basics of the OS and shell commands not only took shorter time than expected (with help of numerous online resources and our Big Data dev team), but also felt very empowering. Within a few weeks I realized that my brain was looking for similar command-line operations in Windows. Needless to say, I now have Cygwin installed on my Windows PC.
Having worked with IDE tools such as SSMS, Visual Studio, DB Forge, Pentaho DI and others we looked for similar tools used in the Hadoop world. The first standard tool we found in the toolset was HUE (Hadoop User Experience). HUE is a web-based tool that helps to get started interacting with Hadoop and to learn the names of the animals from Big Data zoo.
As we started using the tool, we quickly came across numerous shortcomings, such as incomplete compatibility with some of the Big Data zoo’s tools and a certain amount of bugs like spontaneous “losses” of submitted Hive queries or incapability to import all properties of an Oozie workflow from its XML definition. Realizing that the visual toolset for the Hadoop ecosystem is still raw, we had to take advantage of old and proven command line (great thanks to our development team, once again, for getting us started). Open-source tools like the community edition of IntelliJ IDEA helped a lot to trigger reorganization of our thought process to be able to accommodate the paradigm shift, as well as to learn crucial concepts/patterns of Java-world (yes, Big Data is the world of Java and its descendants like Scala or Groovy).
Automated or manual testing of traditional data warehouses implies broad usage of SQL, which our team has mastered. Unfortunately, in the Big Data world SQL is just “one of”, but not the “all-in-one” means of accessing data. There is an ongoing effort to reduce the learning curve in Big Data with the introduction of tools like Drill and Hive.
However, there are still plenty of tasks that require expertise beyond SQL. Adding Bash and Python scripting languages to our toolset, along with getting trained in the most common execution engines like Spark and MapReduce, allowed us to get enough power to access/analyze any data properly and efficiently, regardless of its size. In turn, this helped us to fill in areas unreachable for SQL, like unstructured and semi-structured data analysis. As a result, we reached a point where our toolset is broad enough to test all of the typical big-data features, as well as to validate some of the results against data stored in traditional data sources.
In conclusion, I would like to reiterate that, although the Big Data world differs quite a bit from traditional data warehousing, the majority of QA efforts boil down to moving data from one place to another, which should be second nature to traditional DW QA team. Big Data’s new environment and tools introduce challenges, but these challenges can be quickly overcome with help of online
resources and assistance from your Big Data development team.