SlideShare a Scribd company logo
Jim Harris
               Blogger‐in‐Chief
             www.ocdqblog.com
Jim Harris
             Digitally signed by Jim Harris
             DN: cn=Jim Harris, o=Obsessive-Compulsive Data Quality (OCDQ), ou, email=jim.harris@ocdqblog.
             com, c=US
             Date: 2010.03.04 10:55:20 -06'00'
Jim Harris
        Blogger‐in‐Chief
       www.ocdqblog.com



     E‐mail
     jim.harris@ocdqblog.com

     Twitter
     twitter.com/ocdqblog

     LinkedIn
     linkedin.com/in/jimharris




Adventures in Data Profiling     Copyright © 2010, Jim Harris. All rights reserved.   2
Let the Adventures Begin . . .
 This will be a vendor‐neutral presentation:

     Focusing on general methodology of data profiling 
     and common functionality of data profiling tools 

     Discussing how a data profiling tool helps automate 
     some of the work needed for preliminary data analysis

     Reviewing fictional data and results produced by a 
     fictional data profiling tool to illustrate basic concepts 


Adventures in Data Profiling      Copyright © 2010, Jim Harris. All rights reserved.   3
Understanding Your Data
     Understanding your data is essential to using it 
     effectively and improving its quality

     You need a reality check for the perceptions and 
     assumptions you have about the quality of your data 

     You need to prepare meaningful questions to ask your 
     business analysts and subject matter experts

     There is simply no substitute for data analysis 

Adventures in Data Profiling     Copyright © 2010, Jim Harris. All rights reserved.   4
Profiling Your Data
 Data profiling includes many types of analysis such as:

     Verify data matches the metadata that describes it

     Identify representations for the absence of data

     Identify potential default and invalid values

     Check data formats for inconsistencies

     Assess domain, structural, and relational integrity

Adventures in Data Profiling     Copyright © 2010, Jim Harris. All rights reserved.   5
Getting Your Data Freq On
     Data profiling tools can help you by automating some 
     of the grunt work needed to begin your data analysis

     One of their basic features is the ability to generate 
     statistical summaries and frequency distributions for 
     the unique values and formats found within your fields

  Therefore, I like to refer to using a data profiling tool as:
              “Getting Your Data Freq On”

Adventures in Data Profiling     Copyright © 2010, Jim Harris. All rights reserved.   6
Let Me Count The Ways


NULL – record count of NULL values          Cardinality – count of the number of 
                                            distinct actual values
Missing – record count of Missing values 
(i.e., non‐NULL absence of data)            Uniqueness – percentage calculated as 
                                            Cardinality divided by total record count
Actual – record count of Actual values  
(i.e., non‐NULL and non‐Missing)            Distinctness – percentage calculated as 
                                            Cardinality divided by Actual
Completeness – percentage calculated as 
Actual divided by total record count
  Adventures in Data Profiling              Copyright © 2010, Jim Harris. All rights reserved.   7
You Uniquely Complete Me
                                   Completeness and 
                                   Uniqueness are useful in 
                                   evaluating potential key 
                                   fields and especially a 
                                   single primary key, 
                                   which should be both: 
                                         100% Complete
                                         100% Unique




Adventures in Data Profiling   Copyright © 2010, Jim Harris. All rights reserved.   8
It’s a Distinct Possibility



                                 Distinctness can be useful 
                                 in evaluating the potential
                                 for duplicate records
                                 < 100% Distinct means some 
                                 distinct actual values occur on 
                                 more than one record
Adventures in Data Profiling   Copyright © 2010, Jim Harris. All rights reserved.   9
Gimme the lo down, Drill‐down




Adventures in Data Profiling   Copyright © 2010, Jim Harris. All rights reserved.   10
Freq’ing Distribution of Values



                                        Frequency distribution of 
                                        values is useful for fields 
                                        with a low cardinality
                                        Extremely low cardinality 
                                        might be an indication of 
                                        default values

Adventures in Data Profiling   Copyright © 2010, Jim Harris. All rights reserved.   11
Reviewing the Top N List




Reviewing the Top N most 
frequently occurring values
 Adventures in Data Profiling   Copyright © 2010, Jim Harris. All rights reserved.   12
Freq’ing Distribution of Formats




Frequency distribution of formats is useful for fields having 
both a high cardinality and free‐form values
Adventures in Data Profiling   Copyright © 2010, Jim Harris. All rights reserved.   13
Unlocking the Combination




Combination of values 
and formats can help with 
unlocking the mystery of 
more complex fields

  Adventures in Data Profiling   Copyright © 2010, Jim Harris. All rights reserved.   14
. . . the Adventures Conclude
What can just your analysis of data tell you about it?

    Understand your data better by first looking at it from a 
    starting point of blissful ignorance and curiosity

    A tool can help automate some of the grunt work, but 
    the actual data analysis can not be automated

    Your analytical goal is not to try to find answers, but to 
    discover the right questions

Adventures in Data Profiling      Copyright © 2010, Jim Harris. All rights reserved.   15

More Related Content

PPT
Data Quality Testing Generic (http://www.geektester.blogspot.com/)
PPT
Data Quality Integration (ETL) Open Source
PDF
( Big ) Data Management - Data Quality - Global concepts in 5 slides
PPTX
Тестирование данных с помощью Data Quality Services (MS SQL 12)
ODP
Data quality overview
PDF
DGIQ 2015 The Fundamentals of Data Quality
PPT
Data Quality Rules introduction
PPT
Data Quality Definitions
Data Quality Testing Generic (http://www.geektester.blogspot.com/)
Data Quality Integration (ETL) Open Source
( Big ) Data Management - Data Quality - Global concepts in 5 slides
Тестирование данных с помощью Data Quality Services (MS SQL 12)
Data quality overview
DGIQ 2015 The Fundamentals of Data Quality
Data Quality Rules introduction
Data Quality Definitions

What's hot (20)

PPT
Building a Data Quality Program from Scratch
PDF
Data quality - The True Big Data Challenge
PPT
Data quality architecture
PPT
Sound Data Quality for CRM
PDF
Data profiling-best-practices
PPT
Data Quality Technical Architecture
PPT
Lecture 22
PPTX
Data Quality: The Data Science struggle nobody mentions - Data Science MeetUp...
PPTX
Introduction to Data Analytics
PPTX
Big Data Expo 2015 - Trillium software Big Data and the Data Quality
PPT
Lecture 21
PPTX
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
PPT
Lecture 23
PDF
Tamr overview
PPTX
PDF
Foundation of data quality
PPTX
Enterprise Analytics: Serving Big Data Projects for Healthcare
PPTX
Data analytics
PPTX
Machine Learning and Multi Drug Resistant(MDR) Infections case study
PDF
Data quality management Basic
Building a Data Quality Program from Scratch
Data quality - The True Big Data Challenge
Data quality architecture
Sound Data Quality for CRM
Data profiling-best-practices
Data Quality Technical Architecture
Lecture 22
Data Quality: The Data Science struggle nobody mentions - Data Science MeetUp...
Introduction to Data Analytics
Big Data Expo 2015 - Trillium software Big Data and the Data Quality
Lecture 21
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Lecture 23
Tamr overview
Foundation of data quality
Enterprise Analytics: Serving Big Data Projects for Healthcare
Data analytics
Machine Learning and Multi Drug Resistant(MDR) Infections case study
Data quality management Basic
Ad

Viewers also liked (20)

PPTX
Data quality and data profiling
PPT
2007 Tidc India Profiling
PPTX
Marketing Data Utilization
PDF
Uncover Untold Stories in Your Data: A Deep Dive on Data Profiling
PPTX
NCDM Datamining Case Study 2010
PDF
Bitcoin, Transaction Fees and The Cost of Poor Quality
PDF
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
PDF
Telling Visual Stories About Data
PPTX
Exposing Your Hidden Costs of Performance
PPTX
Cost of poor quality
PDF
Cost of poor quality presentation5
PPTX
Cost of-poor-quality - juran institute
PDF
Fifth Elephant 2014 talk - Crafting Visual Stories with Data
PPT
Quality is a cost
PPTX
QUALITY & COST
PDF
Crafting Visual Stories with Data
PDF
Big Data Profiling
PPTX
Cost of Poor quality
PPT
Cost of quality
PPT
Cost of quality
Data quality and data profiling
2007 Tidc India Profiling
Marketing Data Utilization
Uncover Untold Stories in Your Data: A Deep Dive on Data Profiling
NCDM Datamining Case Study 2010
Bitcoin, Transaction Fees and The Cost of Poor Quality
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Telling Visual Stories About Data
Exposing Your Hidden Costs of Performance
Cost of poor quality
Cost of poor quality presentation5
Cost of-poor-quality - juran institute
Fifth Elephant 2014 talk - Crafting Visual Stories with Data
Quality is a cost
QUALITY & COST
Crafting Visual Stories with Data
Big Data Profiling
Cost of Poor quality
Cost of quality
Cost of quality
Ad

Similar to Adventures in Data Profiling (20)

PPTX
Data What Type Of Data Do You Have V2.1
PPTX
Artificial Intelligence - Data Analysis, Creative & Critical Thinking and AI...
PPT
Research methodology - Analysis of Data
PDF
Analysis, Prioritisation & Communication: Day 8 & 9
PDF
Data presentation by nndd data presentation.pdf
PPTX
Preprocessing_exploring_and_Visualization.pptx
PDF
MASTERPIECE TO EXCEL IN DATA ANALYSIS WITH EXCEL.pdf
PDF
Descriptive statistics
PPT
T7 data analysis
PDF
Quality control tools (QCT)
PPTX
Topic-6-Finding-the-Answers-to-the-Research-Questions-Interpretation-and-Pres...
PPTX
Unit 2_ Descriptive Analytics for MBA .pptx
PPTX
INTRODUCTION TO DESCRIPTIVE ANALYTICS.pptx
PPTX
Exploratory Data Analysis week 4
PPTX
Introduction-to-Data-Analysis.pptx fgdfgdfvcg
PPTX
Big Data for Pearson Btec Higher level 3.ppt
PDF
Back to the basics-Part2: Data exploration: representing and testing data pro...
DOCX
Krupa rm
PPT
Chapter 12
Data What Type Of Data Do You Have V2.1
Artificial Intelligence - Data Analysis, Creative & Critical Thinking and AI...
Research methodology - Analysis of Data
Analysis, Prioritisation & Communication: Day 8 & 9
Data presentation by nndd data presentation.pdf
Preprocessing_exploring_and_Visualization.pptx
MASTERPIECE TO EXCEL IN DATA ANALYSIS WITH EXCEL.pdf
Descriptive statistics
T7 data analysis
Quality control tools (QCT)
Topic-6-Finding-the-Answers-to-the-Research-Questions-Interpretation-and-Pres...
Unit 2_ Descriptive Analytics for MBA .pptx
INTRODUCTION TO DESCRIPTIVE ANALYTICS.pptx
Exploratory Data Analysis week 4
Introduction-to-Data-Analysis.pptx fgdfgdfvcg
Big Data for Pearson Btec Higher level 3.ppt
Back to the basics-Part2: Data exploration: representing and testing data pro...
Krupa rm
Chapter 12

Recently uploaded (20)

PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
MYSQL Presentation for SQL database connectivity
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Spectroscopy.pptx food analysis technology
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Big Data Technologies - Introduction.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Electronic commerce courselecture one. Pdf
PPT
Teaching material agriculture food technology
PPTX
Cloud computing and distributed systems.
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Encapsulation theory and applications.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Programs and apps: productivity, graphics, security and other tools
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
MYSQL Presentation for SQL database connectivity
The AUB Centre for AI in Media Proposal.docx
Spectroscopy.pptx food analysis technology
The Rise and Fall of 3GPP – Time for a Sabbatical?
Big Data Technologies - Introduction.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Digital-Transformation-Roadmap-for-Companies.pptx
Approach and Philosophy of On baking technology
Chapter 3 Spatial Domain Image Processing.pdf
Electronic commerce courselecture one. Pdf
Teaching material agriculture food technology
Cloud computing and distributed systems.
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Encapsulation theory and applications.pdf
A comparative analysis of optical character recognition models for extracting...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf

Adventures in Data Profiling

  • 1. Jim Harris Blogger‐in‐Chief www.ocdqblog.com Jim Harris Digitally signed by Jim Harris DN: cn=Jim Harris, o=Obsessive-Compulsive Data Quality (OCDQ), ou, email=jim.harris@ocdqblog. com, c=US Date: 2010.03.04 10:55:20 -06'00'
  • 2. Jim Harris Blogger‐in‐Chief www.ocdqblog.com E‐mail jim.harris@ocdqblog.com Twitter twitter.com/ocdqblog LinkedIn linkedin.com/in/jimharris Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 2
  • 3. Let the Adventures Begin . . . This will be a vendor‐neutral presentation: Focusing on general methodology of data profiling  and common functionality of data profiling tools  Discussing how a data profiling tool helps automate  some of the work needed for preliminary data analysis Reviewing fictional data and results produced by a  fictional data profiling tool to illustrate basic concepts  Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 3
  • 4. Understanding Your Data Understanding your data is essential to using it  effectively and improving its quality You need a reality check for the perceptions and  assumptions you have about the quality of your data  You need to prepare meaningful questions to ask your  business analysts and subject matter experts There is simply no substitute for data analysis  Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 4
  • 5. Profiling Your Data Data profiling includes many types of analysis such as: Verify data matches the metadata that describes it Identify representations for the absence of data Identify potential default and invalid values Check data formats for inconsistencies Assess domain, structural, and relational integrity Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 5
  • 6. Getting Your Data Freq On Data profiling tools can help you by automating some  of the grunt work needed to begin your data analysis One of their basic features is the ability to generate  statistical summaries and frequency distributions for  the unique values and formats found within your fields Therefore, I like to refer to using a data profiling tool as: “Getting Your Data Freq On” Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 6
  • 7. Let Me Count The Ways NULL – record count of NULL values Cardinality – count of the number of  distinct actual values Missing – record count of Missing values  (i.e., non‐NULL absence of data) Uniqueness – percentage calculated as  Cardinality divided by total record count Actual – record count of Actual values   (i.e., non‐NULL and non‐Missing) Distinctness – percentage calculated as  Cardinality divided by Actual Completeness – percentage calculated as  Actual divided by total record count Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 7
  • 8. You Uniquely Complete Me Completeness and  Uniqueness are useful in  evaluating potential key  fields and especially a  single primary key,  which should be both:  100% Complete 100% Unique Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 8
  • 9. It’s a Distinct Possibility Distinctness can be useful  in evaluating the potential for duplicate records < 100% Distinct means some  distinct actual values occur on  more than one record Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 9
  • 10. Gimme the lo down, Drill‐down Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 10
  • 11. Freq’ing Distribution of Values Frequency distribution of  values is useful for fields  with a low cardinality Extremely low cardinality  might be an indication of  default values Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 11
  • 12. Reviewing the Top N List Reviewing the Top N most  frequently occurring values Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 12
  • 15. . . . the Adventures Conclude What can just your analysis of data tell you about it? Understand your data better by first looking at it from a  starting point of blissful ignorance and curiosity A tool can help automate some of the grunt work, but  the actual data analysis can not be automated Your analytical goal is not to try to find answers, but to  discover the right questions Adventures in Data Profiling Copyright © 2010, Jim Harris. All rights reserved. 15