We’ve been asked what data profiling contributes to a data warehouse, and once in awhile we even hear that an organization – pressed for time, or trying to save money – wants to eliminate data profiling or tack it on at the end of a data warehouse build-out. This is an appeal to never, EVER, consider this as a time or cost-saving opportunity. In fact, I would argue that data profiling is a key part of any DW/BI project’s foundation.
Base your work on knowledge, not faith
Not executing a comprehensive analysis of the source data from a technical viewpoint means relying on other people’s thoughts. These are usually non-technical people who only deal with a small chunk of the available data in their daily activities. They don’t think about the source system as a whole because it’s not their responsibility. In other words, you have to choose between two things: faith and knowledge. Data profiling turns guesses (faith) into confidence (knowledge).
What happens when you don’t do data profiling
Let me provide a couple of examples which can help you understand why data profiling is crucial for DW/BI projects.
On a DW/BI project we were asked to add a new data source with inventory information on a Stock Keeping Unit (SKU – 13 character codes) level of product hierarchy. At first glance the task was trivial because it looked consistent and straightforward. The source had attributed 13-character long titles to the SKUs. Once we performed our regular data profiling, however, we saw some inconsistencies. See the allocation of SKU values in the box below.
As you can see, only 70% of records were in line with business requirements and expectations. Analysis on the 30% that did not match the pattern revealed that these records defined a Kit level of product hierarchy, and additional complex transformations were required for shifting this data to the SKU level of information. Had we not profiled the data, we would have treated it all as SKU level, applying uniform processing rules, joining conditions, etc. This would have led to a data mess in the warehouse or data leaks during the ETL process.
In our practice we have found that sophisticated data profiling can increase business users’ understanding of the source system. One of the most frequently missed problems is deletions. For instance, users might know answers to the following questions:
- What is the grain of the table?
- What does each attribute mean?
- How often is each attribute being updated?
- How many records are being inserted daily?
However, they rarely realize that something has disappeared from the table. There are a wide range of business cases when deletes take place such as when inactive accounts are purged or when a cancelled order line is erased. Careful handling of the deletion process as the DW/BI system is being built can be crucial for trustworthy information later on. And if users don’t get reliable information they will either stop using the system or report a bug that will require much more time and effort to fix later in the process than it would take to set it up correctly at the beginning.
In other words, ignoring data profiling can have a really disruptive impact on the whole DW/BI project. An upfront investment of a few hours for one analytical person’s time seems minimal when compared to the hundreds of possible hours that everyone from analysts to DBAs to promotion managers will spend later fixing poorly categorized data.
Checks and outcomes
Here are some of the primary data checks and outcomes we look to achieve at Coherent Solutions:
- Ensure that users’ expectations and requirements are in line with source system actuals.
- Learn/verify the granularity of different processes ongoing in the source system.
- Understand source system behavior, eg:
- Bulk or transactional loads?
- What operations are common for what entities (inserts/updates/deletes)?
- Identify data anomalies and create rules to isolate them.
- Learn defaults of the source system.
- Trace unknown semantic conformity between different source systems.