DWLOGIC (Data Warehouse Logic, LLC) specializes in Architecture of Data Warehouse Systems. Site Map
Contact Us
 

Data Profiling

One of the common problems in any data integration and data warehouse project is to understand the source data. The data warehouse consists of data from different internal and external objects. These objects are sometimes poorly documented or their ERD is not available. Furthermore, the source objects are often not from the same single native source.
In many cases analyzing the sources and understanding the relationship between the objects is the Achilles‛ heel of the process.
Since analyzing the source is a complicated and expensive task, it is often postponed to " . . . when there will be more time and budget . . . " which usually results in this critical step being only partially completed or skipped altogether. The consequence is that the data integration is performed without sufficient knowledge of the data.
In addition, since the data profiling process is often skipped, no reverse ERD is created to support the integration process.
Therefore, it is most likely that poor (or no) data source profiling is the main reason for failures of data warehouse projects. The lack of visual models creates a barrier between the business customer and the technical team and between the target ODS and its sources.

Since the data source profiling and visual reversed ERD model is a fundamental step of the data warehouse, DWLOGIC developed a profiler-ERD-Engine mechanism. It is designed to almost automatically support the entire process ranging from gathering the data sources via data profiling, to breaking down the "pseudo" ERD model to subject areas.


Below are the profiler-ERD-Engine steps (on a macro level):

Step 1: Loading multiple data sources into the Profiler repository (ERD-Engine).
The purpose of this step is to unify all the source objects from the different data sources into a single schema in order to be able to analyze the spread sources as if they were one single source.

 
 

Step 2: Analyzing the individual source objects. Calculating statistics and identifying unique keys and "near" unique keys.
The purpose of this step is to collect metadata information about the individual source objects.

 
 

Step 3: Calculating potential FKs or partial FKs between the input objects.
This step would be finalized manually to minimize false positive FKs results. The clean version would be followed by reversing the objects and the RIs to a "pseudo" ERD model.
The purpose of this step is to create a "pseudo" ERD to simulate a source OLTP ERD that acts as a single source.

 
 

Step 4: Suggesting options to break the reversed "pseudo" ERD into subject areas for better visualizing of the source data.
The purpose of this step is to break the "pseudo" ERD into subject areas in order to simulate different business areas and allow an easier reading and understanding of hundreds of objects.