Don McMunn's FREEWARE Data Profiling Tool Kit

FREE Data Profiling Kit - Includes: Data profiling table DDL, Data profiling script generator, and instructions for DIY data profiling, free download, - IPC Designs, LLC
Don McMunn's FREEWARE Data Profiling Kit
Version 1.14
Released: 18-Sep-2007

Why? What? Download Data Profiling DDL Comments Home My Resume
Free DDTK
Free ETL MD
View Don McMunn's profile on LinkedIn
Please visit these handy sites:

Kimball Group - Microsoft Data Warehouse Toolkit Tools
COMMENTS:

"Excellent work!"

"Thanks for sharing these queries with the license."

"This site was very helpful."

"Starting a Data Quality program and need to profile data I've downloaded a trial of Datatris which looks like has very similar structures to this. It seems to me that I do not have to spend $800 for a licences when I can really build on your toolkit :)"

"Thanks for the freeware. It really was a big help. You saved me many hours and days of manual calculations."

"Thanks for the tools."

Purpose:


Every data warehouse, ETL architecture can benefit from an easy to produce data profiling of source data to determine data quality issues already present before the ETL processing begins or even as a periodic data quality assessment of systems of record.

What's in the box?
The Data Profiling Tool Kit contains the basic data structures to capture table and column-specific summary data profiling information. Included are: DDL to generate a data profiling metadata control table and the data profiling table/column summary repository that will contain basic column-specific metrics such as: table name, table comment, total number of rows in the table, column: name, comment, defined data type, defined data length, defined data precision, allows nulls?, number of null values, number of distinct/unique values found, column density, min/max/avg actual data value length, min/max date value, min/max nmeric values, min/max alphanumeric value, count of numeric values, count of date values, count of alphanumeric values, count of non-printable values, and that's the BASIC data profiling. More will be forthcoming, so stay tuned sports fans!
 
Errata:

None to date

LEGAL STUFF:


DISCLAIMER:
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

LICENSE:
Creative Commons License
Data Profiling Tool Kit by Don McMunn is licensed under a Creative Commons Attribution 3.0 License.
Based on a work at www.ipcdesigns.com.


The purpose of the data profiling tool kit is to provide a quick-start platform for assessing data quality of a given environment. From this information, further data quality assessments may be estimated, planned, designed, developed and executed in an iterative fashion to deliver metadata useful in preparing for a data warehousing ETL effort, simply provide a periodic quick assessment of data quality, or serve as a guide for more extensive D-I-Y data profiling efforts.


The process of creating a rapidly deployed D-I-Y data profiling application is made more feasible and hopefully more likely to be performed with the FREEWARE Data Profiling Toolkit.


Custom Search

INSTRUCTIONAL GUIDE:
NOTE: You must have sufficient database permissions to create tables, indexes, select, update, delete and insert data to perform all these steps. You may also need to be able to uncompress a Zip file to access the source data files.
  1. Download the SQL Server DDL or the equivalent DDL for Oracle to create the data profiling repository tables which will serve as the destination for the data profiling queries generated by the data profiling generator queries.
    Note:If your database is not SQL Server or Oracle, you will need to make the necessary adjustments to bring the DDL into syntactical compliance with the requirements of your RDBMS. If you create a new or modify a downloaded script for a new different database, please share it with me and I will make it available here for everyone to share, netizen. :o)
  2. create the tables to contain the output from the queries
    Note: the DML statements are for SQL Server and will need to be updated to the syntax of your particular non-SQL Server RDBMS.
  3. Run the queries (see the lower half of the script) to generate the data profiling query sets
  4. Run the resulting data profiling queries generated above, and save the results in the column summary table (PROFILE_COL_SUM).
  5. You now have a simple data profiling solution which can give you some basic data quality measures and prepare you for additional investigations
  6. The Oracle script contains both a column summary and detail analysis;sorry SQL Server, someday soon...
Useful SQL scripts will be added to this site by me (and hopefully by the community, too!) to provide additional data profiling functionality. So, make sure to check back from time to time.


Downloads


SQL Server scripts to produce and house a simple data profiling column summary
20Kb
Oracle 9i scripts to produce and house a simple data profiling column summary
19Kb
Prototype design for a relational database data profiling GUI using a tree structureOnline
Please drop us a comment. Your feedback helps. We do not spam


SQL Table Definition for PROFILE_META and PROFILE_COL_SUM:


For SQL Server: 
CREATE TABLE dbo.PROFILE_META 
(
  PROFILE_META_KEY int IDENTITY(1,1)
, PROFILE_DATE datetime NOT NULL CONSTRAINT DF_ETL_META_COL_SUM_CDTE DEFAULT (GETDATE())
, PROFILE_BY varchar(128) NOT NULL CONSTRAINT DF_ETL_META_COL_SUM_CBY DEFAULT (USER)
, PROFILE_VERSION decimal(10, 5) NOT NULL
, PROFILE_REASON varchar(255) NOT NULL
, PROFILE_DESCRIPTION varchar(2048) NOT NULL
);

ALTER TABLE dbo.PROFILE_META 
  ADD CONSTRAINT PKProfile_Meta 
  PRIMARY KEY CLUSTERED (PROFILE_META_KEY);

CREATE TABLE dbo.PROFILE_COL_SUM 
(
  PROFILE_COL_SUM_KEY INT IDENTITY(1,1)
, TNAME varchar (255) NOT NULL
, CNAME varchar (255) NOT NULL
, DATA_TYPE varchar (50) NOT NULL
, DATA_LENGTH int NULL
, DATA_PRECISION int NULL
, DATA_SCALE int NULL
, NULLABLE varchar (3) NULL
, NUM_TABLE_ROWS decimal(18, 0) NULL
, NUM_NULLS decimal(18, 0) NULL
, NUM_DISTINCT_VALUES decimal(18, 0) NULL
, DENSITY decimal(18, 5) NULL
, MIN_DATA_LENGTH int NULL
, MAX_DATA_LENGTH int NULL
, AVG_DATA_LENGTH int NULL
, MIN_DATE_VALUE datetime NULL
, MAX_DATE_VALUE datetime NULL
, MIN_NUMERIC_VALUE decimal(18, 5) NULL
, MAX_NUMERIC_VALUE decimal(18, 5) NULL
, MIN_ALPHANUM_VALUE varchar (2048) NULL
, MAX_ALPHANUM_VALUE varchar (2048) NULL
, NUMERIC_COUNT decimal(18) NULL
, DATE_COUNT decimal(18) NULL 
, ALPHANUM_COUNT decimal(18) NULL 
, NOPRINT_COUNT decimal(18) NULL 
, TABLE_COMMENT varchar (512) NULL
, COLUMN_COMMENT varchar (512) NULL
, PROFILE_META_KEY int NOT NULL
) ON [PRIMARY]
GO

ALTER TABLE dbo.PROFILE_COL_SUM 
  ADD CONSTRAINT PKProfile_Col_Sum 
  PRIMARY KEY CLUSTERED (PROFILE_COL_SUM_KEY)
GO




For Oracle:

CREATE TABLE DATA_PROFILE_COL_SUM
(
  TNAME            VARCHAR2(50 BYTE),
  CNAME            VARCHAR2(50 BYTE),
  DATA_TYPE        VARCHAR2(50 BYTE),
  DATA_LENGTH      INTEGER,
  DATA_PRECISION   INTEGER,
  DATA_SCALE       INTEGER,
  NULLABLE         CHAR(1 BYTE),
  NUM_DISTINCT     NUMBER(20),
  DENSITY          NUMBER(20,8),
  NUM_NULLS        NUMBER(20),
  COL_LENGTH       INTEGER,
  AVG_COL_LENGTH   NUMBER(20,2),
  CHAR_LENGTH      INTEGER,
  TABLE_COMMENTS   VARCHAR2(2048 BYTE),  
  COLUMN_COMMENTS  VARCHAR2(2048 BYTE),  
  RCOUNT           NUMBER(20),
  MIN_DATE         DATE,
  MAX_DATE         DATE,
  MIN_NUMBER       NUMBER(30,10),
  MAX_NUMBER       NUMBER(30,10),
  MIN_INTEGER      INTEGER,
  MAX_INTEGER      INTEGER,
  MIN_VARCHAR2     VARCHAR2(2048 BYTE),
  MAX_VARCHAR2     VARCHAR2(2048 BYTE),
  MIN_LEN          INTEGER,
  MAX_LEN          INTEGER,
  LAST_UPDATED	   DATE DEFAULT SYSDATE
);

create table DATA_PROFILE_COL_DTL
(
  tname                     varchar2(255) NOT NULL
, cname                     varchar2(255) NOT NULL
, freq_rank                 integer NOT NULL
, date_column_value         date
, integer_column_value      integer
, number_column_value       number
, varchar2_column_value     varchar2(2048)
, occurs                    integer
, total_rows                decimal(18,0)
, occurs_percent            decimal(18,5)
, capture_date              date default sysdate
);



Column Definitions


(Alphabetical by column name)
TableColumnDescription
PROFILE_COL_SUMALPHANUM_COUNTNumber of rows containing ONLY an alphanumeric value that is not either a date or a numeric only value
PROFILE_COL_SUMAVG_DATA_LENGTHAverage length of data values found in this column; mainly useful for string data types
PROFILE_COL_SUMCNAMEName of column in table being profiled
PROFILE_COL_SUMCOLUMN_COMMENTAny comment associated with the current column;
PROFILE_COL_SUMDATA_LENGTHMaximum length of the column in bytes
PROFILE_COL_SUMDATA_PRECISIONNumeric magnitude / vector length; for numeric columns only
PROFILE_COL_SUMDATA_SCALENumber of decimal places for numeric columns;
PROFILE_COL_SUMDATA_TYPEName of the data type of the column
PROFILE_COL_SUMDATE_COUNTNumber of rows containing ONLY a valid date value in this column
PROFILE_COL_SUMDATE_KEYCaptures metadata about each table/column in a data profiling pass; copyright c. 2007 IPC Designs, LLC; Author: Don McMunn;
PROFILE_COL_SUMDENSITYA ratio of non-null values to total rows in the table
PROFILE_COL_SUMMAX_ALPHANUM_VALUEmaximum alphanumeric value found in a string column; based on default page
PROFILE_COL_SUMMAX_DATA_LENGTHLongest length of data value found in this column; mainly useful for string data types
PROFILE_COL_SUMMAX_DATE_VALUEmost recent data found in this date column;
PROFILE_COL_SUMMAX_NUMERIC_VALUElargest numeric value found in a numeric column
PROFILE_COL_SUMMIN_ALPHANUM_VALUEminimum alphanumeric value found in a string column; based on default page
PROFILE_COL_SUMMIN_DATA_LENGTHShortest length of data value found in this column; mainly useful for string data types
PROFILE_COL_SUMMIN_DATE_VALUEoldest date value found in this column
PROFILE_COL_SUMMIN_NUMERIC_VALUEsmallest numeric value found in a numeric column
PROFILE_COL_SUMNOPRINT_COUNTNumber of columns containing non-printable ASCII values; useful for identifying UNICODE or potential data quality issues
PROFILE_COL_SUMNULLABLEYes if the column allows NULL values; No otherwise
PROFILE_COL_SUMNUM_DISTINCT_VALUESTotal number of unique values found in the column at the time of the data profiling pass
PROFILE_COL_SUMNUM_NULLSTotal number of NULL values found in the column at the time of the data profiling pass
PROFILE_COL_SUMNUM_TABLE_ROWSTotal number of rows contained in the table at the time of the data profiling pass
PROFILE_COL_SUMNUMERIC_COUNTNumber of rows containing ONLY numeric values in this column regardless of data type
PROFILE_COL_SUMPROFILE_COL_SUM_KEYInteger identity; surrogate primary key; uniquely identifies each row in the column data profile summary table
PROFILE_COL_SUMPROFILE_META_KEYA foreign key referencing an single row in the PROFILE_META table;
PROFILE_COL_SUMTABLE_COMMENTAny comment associated with the current table;
PROFILE_COL_SUMTNAMEName of table containing column being profiled
PROFILE_METAPROFILE_BYThe database user who started the data profiling pass;
PROFILE_METAPROFILE_DATEDate and time when the data profiling pass was begun; defaults to current server date and time;
PROFILE_METAPROFILE_DESCRIPTIONA description of what this data profiling pass is to address with its results;
PROFILE_METAPROFILE_META_KEYInteger identity; surrogate primary key; uniquely identifies each row in the data profile metadata table
PROFILE_METAPROFILE_REASONThe reason for performing this data profiling pass
PROFILE_METAPROFILE_VERSIONA numeric value to be used in tagging this data profile pass;

Send a comment:


Subject:
Name:
Message:
 


   Don McMunn is a professional, business-focused consultant specializing in data warehousing and
business intelligence design, development and delivery for over a decade with the
Fortune 500 and SMB segments. He may be reached at: dmcmunn at ipcdesigns dot com
 
You are the   viewer


Search Engine Optimization and SEO Tools

Changing LINKS