articles, October 30, 2023

Build your data warehouse on a solid foundation with data profiling

Build your data warehouse

We’ve been asked what data profiling contributes to a data warehouse, and once in a while we even hear that an organization – pressed for time or trying to save money – wants to eliminate data profiling or tack it on at the end of a data warehouse build-out. This is an appeal to never, EVER, consider this as a time or cost-saving opportunity. In fact, I would argue that data profiling is a key part of any DW/BI project’s foundation.

Base your work on knowledge, not faith

Not executing a comprehensive analysis of the source data from a technical viewpoint means relying on other people’s thoughts. These are usually non-technical people who only deal with a small chunk of the available data in their daily activities. They don’t think about the source system as a whole, because it’s not their responsibility. In other words, you have to choose between two things: faith and knowledge. Data profiling turns guesses (faith) into confidence (knowledge).

What happens when you don’t do data profiling

Let me provide a couple of examples which can help you understand why data profiling is crucial for DW/BI projects.

Situation 1

On a DW/BI project we were asked to add a new data source with inventory information on a Stock Keeping Unit (SKU – 13 character codes) level of product hierarchy. At first glance the task was trivial because it looked consistent and straightforward. The source had attributed 13-character long titles to the SKUs. Once we performed our regular data profiling, however, we saw some inconsistencies. See the allocation of SKU values in the box below.

Only 70% of records were in line with business requirements and expectations. Analysis on the 30% that did not match the pattern revealed that these records defined a Kit level of product hierarchy, and additional complex transformations were required for shifting this data to the SKU level of information. Had we not profiled the data, we would have treated it all as SKU level, applying uniform processing rules, joining conditions, etc. This would have led to a data mess in the warehouse or data leaks during the ETL process.

Situation 2

In our practice we have found that sophisticated data profiling can increase business users’ understanding of the source system. One of the most frequently missed problems is deletions. For instance, users might know answers to the following questions:

  • What is the grain of the table?
  • What does each attribute mean?
  • How often is each attribute being updated?
  • How many records are being inserted daily?

However, they rarely realize that something has disappeared from the table. There are a wide range of business cases when deletes take place such as when inactive accounts are purged or when a cancelled order line is erased. Careful handling of the deletion process as the DW/BI system is being built can be crucial for trustworthy information later on. And if users don’t get reliable information, they will either stop using the system or report a bug that will require much more time and effort to fix later in the process than it would take to set it up correctly at the beginning.

In other words, ignoring data profiling can disrupt the whole DW/BI project. An upfront investment of a few hours for one analytical person’s time seems minimal when compared to the hundreds of possible hours that everyone from analysts to DBAs to promotion managers will spend later fixing poorly categorized data.

Checks and outcomes

Here are some of the primary data checks and outcomes we look to achieve with our data analytics development services at Coherent Solutions:

  1. Ensure that users’ expectations and requirements are in line with source system actuals.

  2. Learn/verify the granularity of different processes ongoing in the source system.

  3. Understand source system behavior, eg:

  • Bulk or transactional loads?

  • What operations are common for what entities (inserts/updates/deletes)?

  1. Identify data anomalies and create rules to isolate them.

  2. Learn defaults of the source system.

  3. Trace unknown semantic conformity between different source systems.

Share article

Get a Free Consultation with Our Experts!

Simply fill out our contact form below, and we will reach out to you within 1 business day to schedule a free 1-hour consultation covering platform selection, budgeting, and project timelines.

This field is required. Maximum 100 characters.
This field is required. Maximum 100 characters.
This field is required.
Only digits are applicable.
Maximum 2000 characters.
* Required field

An error occurred sending your message.
Try again or contact us via webinforequest@coherentsolutions.com.

Message sent!

Here's what happens next:

  1. Our sales rep will contact you within 1 day to discuss your case in more detail.
  2. Next, we will arrange a free 1-hour consultation with our experts on platform selection, budgeting, and timelines.
  3. After that, we’ll need 1-2 weeks to prepare a proposal, covering solutions, team requirements, cost & time estimates.
  4. Once approved, we will launch your project within 1-2 weeks.