I have read a lot about data quality in a data warehouse and find it one of the thorniest issues out there. It’s all about building quality equity with the business owners. There are a lot of utilities that do data cleansing and data checks on the source side, but there is always something that can cause a data quality issue or more appropriately an information quality issue in a data warehouse. As they say, the devil is always in the details.
Scenario: Reporting a Data Issue
Let’s take a quick scenario: for a typical warehouse load, data is extracted from the source, massaged, scrubbed and cleansed and then transformed and then loaded into the warehouse. The transformation could be simple or could be severely complex like running a model on the data to arrive at the load-ready data. At a high level, all processes run fine; the number of records you started with, from each source is accounted for; the measures you started with add up fine on the fact tables; and the summaries show up fine on the materialized views. Everything is going smooth for a couple of months, when a business analyst, who is drilling down into a metric five or six levels down in a dimension reports of a data issue.
If this situation has not occurred with you, it is either a matter of time or you have been extremely lucky.
So at this time, you are already processing 2-3 months out, the YTD metrics have been accumulating on the summaries and so the numbers you have been presenting for the last 3 months has been garbage, around the issue definition. To correct the issue, you will need to rollback the summaries, drop fact partitions if that is what you use and get it back 3 months so that the processes could be run again month by month. In most of the organizations, a host of global parameters e.g. time is set each time a run occurs, so there is additional process management overhead to get the data back on track.
Proactive Model for Data Quality
I would like to explain an in-house solution that made our lives much easier, and we were always a step ahead of the power users in identifying data issues. Essentially we designed a proactive model rather than to react to an issue found by the business much after the fact.
The idea behind this utility is to detect the percentage change of measure(s) by a dimension or a set of dimensions in the data warehouse. The change is measured vis-à-vis statistics gathered from a healthy run. By healthy run, I mean a run where numbers have been blessed by the power users and it has been validated and tested to be good in all respects.
Let me explain this data warehouse quality implementation in four steps:
1. Create audit measures
We created a set of audit measures, which were a combination of one or more dimensions. We name these measures based on the dimensions selected. Imagine all your dimensions spread horizontally on an Excel sheet and you pick which combination of these dimensions will make up an audit measure.
For a medium size warehouse with 3 facts and 15 dimensions, we created about 45 metrics and without using any fancy methods, used Excel and simple concatenations to create the SQL that would run against the fact tables to produce the results. These results would then be compared against a good run and the variance placed into increments of 10%.
2. Baseline and create variance buckets
We determined the facts we wanted to evaluate with respect to the audit measures and we created variance buckets starting from 0 to 100% in increments of 10%. We also had categories for ‘No Change’ and ‘New’, apart from the 10 buckets of variance, to capture new instances of the dimensions.
3. Dynamically create a script and running it on the database.
We created an SQL script that needed to be run on the latest data load, by putting together a spreadsheet that required little or no manual intervention. We then picked the SQL out and ran it against the database to product current load statistics. It is worth mentioning that we have never had performance issues running these scripts.
4. Plot and interpret the results
Then we plotted the change in the audit measures with the variance percentage buckets. For a good run, we saw most of the metrics in the 0-10% variance range. Metrics undergoing 80, 90 or 100% change were the ones that demanded first attention and further data investigation. Looking at the audit measures that had these suspicious variances and tracing them back to the matrix that defined an audit measure, helped us narrow down the problem dimension very quickly and then we took necessary corrective actions.
Try This At Home (well, at work)
I feel that this methodology can be fairly easily customized to suit audit needs for any size data warehouse. It would help in new implementations and in maintaining ongoing data warehouse runs as well.
By Balakrishna Dixit (Principal DW Consultant)