Don McMunn's FREEWARE Data Profiling Kit Version 1.14 Released: 18-Sep-2007
| Why? | What? | Download | Data Profiling DDL | Comments | Home | My Resume |
Free DDTK |
Free ETL MD |
|
|
||||||||
|
Please visit these handy sites: Kimball Group - Microsoft Data Warehouse Toolkit Tools | ||||||||
|
COMMENTS: "Excellent work!" "Thanks for sharing these queries with the license." "This site was very helpful." "Starting a Data Quality program and need to profile data I've downloaded a trial of Datatris which looks like has very similar structures to this. It seems to me that I do not have to spend $800 for a licences when I can really build on your toolkit :)" "Thanks for the freeware. It really was a big help. You saved me many hours and days of manual calculations." "Thanks for the tools." |
||||||||
|
Every data warehouse, ETL architecture can benefit from an easy to produce data profiling of source data to determine data quality issues already present before the ETL processing begins or even as a periodic data quality assessment of systems of record. What's in the box? The Data Profiling Tool Kit contains the basic data structures to capture table and column-specific summary data profiling information. Included are: DDL to generate a data profiling metadata control table and the data profiling table/column summary repository that will contain basic column-specific metrics such as: table name, table comment, total number of rows in the table, column: name, comment, defined data type, defined data length, defined data precision, allows nulls?, number of null values, number of distinct/unique values found, column density, min/max/avg actual data value length, min/max date value, min/max nmeric values, min/max alphanumeric value, count of numeric values, count of date values, count of alphanumeric values, count of non-printable values, and that's the BASIC data profiling. More will be forthcoming, so stay tuned sports fans! |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Errata: None to date | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LEGAL STUFF:DISCLAIMER: THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. LICENSE:
Data Profiling Tool Kit by Don McMunn is licensed under a Creative Commons Attribution 3.0 License. Based on a work at www.ipcdesigns.com. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The purpose of the data profiling tool kit is to provide a quick-start platform for assessing data quality of a given environment. From this information, further data quality assessments may be estimated, planned, designed, developed and executed in an iterative fashion to deliver metadata useful in preparing for a data warehousing ETL effort, simply provide a periodic quick assessment of data quality, or serve as a guide for more extensive D-I-Y data profiling efforts. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The process of creating a rapidly deployed D-I-Y data profiling application is made more feasible and hopefully more likely to be performed with the FREEWARE Data Profiling Toolkit. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Custom Search
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
INSTRUCTIONAL GUIDE: NOTE: You must have sufficient database permissions to create tables, indexes, select, update, delete and insert data to perform all these steps. You may also need to be able to uncompress a Zip file to access the source data files.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Useful SQL scripts will be added to this site by me (and hopefully by the community, too!) to provide additional data profiling functionality. So, make sure to check back from time to time. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Downloads | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SQL Table Definition for PROFILE_META and PROFILE_COL_SUM: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
For SQL Server: CREATE TABLE dbo.PROFILE_META ( PROFILE_META_KEY int IDENTITY(1,1) , PROFILE_DATE datetime NOT NULL CONSTRAINT DF_ETL_META_COL_SUM_CDTE DEFAULT (GETDATE()) , PROFILE_BY varchar(128) NOT NULL CONSTRAINT DF_ETL_META_COL_SUM_CBY DEFAULT (USER) , PROFILE_VERSION decimal(10, 5) NOT NULL , PROFILE_REASON varchar(255) NOT NULL , PROFILE_DESCRIPTION varchar(2048) NOT NULL ); ALTER TABLE dbo.PROFILE_META ADD CONSTRAINT PKProfile_Meta PRIMARY KEY CLUSTERED (PROFILE_META_KEY); CREATE TABLE dbo.PROFILE_COL_SUM ( PROFILE_COL_SUM_KEY INT IDENTITY(1,1) , TNAME varchar (255) NOT NULL , CNAME varchar (255) NOT NULL , DATA_TYPE varchar (50) NOT NULL , DATA_LENGTH int NULL , DATA_PRECISION int NULL , DATA_SCALE int NULL , NULLABLE varchar (3) NULL , NUM_TABLE_ROWS decimal(18, 0) NULL , NUM_NULLS decimal(18, 0) NULL , NUM_DISTINCT_VALUES decimal(18, 0) NULL , DENSITY decimal(18, 5) NULL , MIN_DATA_LENGTH int NULL , MAX_DATA_LENGTH int NULL , AVG_DATA_LENGTH int NULL , MIN_DATE_VALUE datetime NULL , MAX_DATE_VALUE datetime NULL , MIN_NUMERIC_VALUE decimal(18, 5) NULL , MAX_NUMERIC_VALUE decimal(18, 5) NULL , MIN_ALPHANUM_VALUE varchar (2048) NULL , MAX_ALPHANUM_VALUE varchar (2048) NULL , NUMERIC_COUNT decimal(18) NULL , DATE_COUNT decimal(18) NULL , ALPHANUM_COUNT decimal(18) NULL , NOPRINT_COUNT decimal(18) NULL , TABLE_COMMENT varchar (512) NULL , COLUMN_COMMENT varchar (512) NULL , PROFILE_META_KEY int NOT NULL ) ON [PRIMARY] GO ALTER TABLE dbo.PROFILE_COL_SUM ADD CONSTRAINT PKProfile_Col_Sum PRIMARY KEY CLUSTERED (PROFILE_COL_SUM_KEY) GO For Oracle: CREATE TABLE DATA_PROFILE_COL_SUM ( TNAME VARCHAR2(50 BYTE), CNAME VARCHAR2(50 BYTE), DATA_TYPE VARCHAR2(50 BYTE), DATA_LENGTH INTEGER, DATA_PRECISION INTEGER, DATA_SCALE INTEGER, NULLABLE CHAR(1 BYTE), NUM_DISTINCT NUMBER(20), DENSITY NUMBER(20,8), NUM_NULLS NUMBER(20), COL_LENGTH INTEGER, AVG_COL_LENGTH NUMBER(20,2), CHAR_LENGTH INTEGER, TABLE_COMMENTS VARCHAR2(2048 BYTE), COLUMN_COMMENTS VARCHAR2(2048 BYTE), RCOUNT NUMBER(20), MIN_DATE DATE, MAX_DATE DATE, MIN_NUMBER NUMBER(30,10), MAX_NUMBER NUMBER(30,10), MIN_INTEGER INTEGER, MAX_INTEGER INTEGER, MIN_VARCHAR2 VARCHAR2(2048 BYTE), MAX_VARCHAR2 VARCHAR2(2048 BYTE), MIN_LEN INTEGER, MAX_LEN INTEGER, LAST_UPDATED DATE DEFAULT SYSDATE ); create table DATA_PROFILE_COL_DTL ( tname varchar2(255) NOT NULL , cname varchar2(255) NOT NULL , freq_rank integer NOT NULL , date_column_value date , integer_column_value integer , number_column_value number , varchar2_column_value varchar2(2048) , occurs integer , total_rows decimal(18,0) , occurs_percent decimal(18,5) , capture_date date default sysdate ); | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Column Definitions(Alphabetical by column name) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Send a comment: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Don McMunn is a professional, business-focused consultant specializing in data warehousing and business intelligence design, development and delivery for over a decade with the Fortune 500 and SMB segments. He may be reached at: dmcmunn at ipcdesigns dot com | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| You are the | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||