It is a best practice to start data profiling at the beginning of a data warehouse build-out. I addressed this in my first blog, providing a couple examples of how data and reporting can go very wrong if you don’t profile your data carefully from the start.
In this post I would like to demonstrate why custom data profile scripts are superior to any standard tools made for this purpose.
The problem with standardized tools
While starting the data profiling exercise at each iteration of a business intelligence/data warehouse project, the analyst always has a short list of key questions:
- What is the grain of each source entity?
- What are the data types of the columns required for this particular iteration? (Explicit , as defined in DDL or in the source system specification; and implicit, based on what data is actually flowing in.)
- What is the growth rate of each source entity?
- What operations occur for each source entity? EG: inserts/updates/deletes? How frequently for each?
As you can see, these basic questions can be split into two groups:
- Static properties (1 and 2)
- Behavioral characteristics (3 and 4)
Ready-made profiling tools only allow you to get answers for the first set of questions on static properties. Data is being analyzed “as-is” without built-in capabilities for comparing the same entity at different points in time. This means that to get a dynamic view of observed data you will need to create a custom solution.
What standardized tools do offer – cool graphics
It is worth saying that standardized tools can provide answers in the form of sophisticated graphs and diagrams. (See examples below.)
Caption: Microsoft’s Data Profiling Task provided in its SQL Server Integration Services toolbox
Caption: Ataccama DQ Analyzer
At first glance, such nice visualizations seem to be an advantage of standardized tools as they provide easily understandable interpretations of your profiling results. To create the same level of graphics in a custom solution would require debugging and polishing the user interface, which can takes a good deal of time.
But the real problem we have found with prefabricated metrics and their graphics is a lack of flexibility. For instance, to compare distribution of values between two tables you need to keep clicking your mouse to switch between different attributes/tables/projects, trying to grasp all the differences. This is not a reliable way to build a foundation for further DW/BI implementation. You will be
extracting data profiling results in raw form, and then applying some automated analytics to answer all your questions. So . . . welcome back custom solution.
Briefly, the two main drawbacks of generic data profiling tools are:
- The set of available metrics is very restricted.
- The options for showing your data profiling results is also limited.
For example, if you have queries on your data that the vendor hasn’t anticipated, you will be forced to write custom scripts. Additionally, you will have to teach non-technical people how to access and analyze data profiling results if you need to make decisions regarding unexpected data anomalies. At Coherent, we have concluded that it just is not reasonable to use standardized tools for most data profiling exercises.
Keep your eyes on the prize: good decisions
Ultimately, the goal of data profiling is to make good decisions based on data actuals. Thus, data profiling should be provided in a format that can be easily understood and accessed by any team member (non-technical or technical), as well as converted into other formats for other purposes.
We have found that Einstein’s version of Occam’s razor: “Everything should be made as simple as possible, but not simpler,” is the best rule for data profiling design. Our own rules include:
- Use only custom data profiling scripts because:
- This provides vital flexibility.
- You get all the metrics that you need.
- You get only those metrics that you need.
- You get metrics in the most suitable format for further processing.
- It will be easy to improve your metrics when necessary.
- Use common applications like Excel for presenting data profiling results to enhance understanding from all stakeholders, set proper expectations and reduce overall project risks.