SlideShare a Scribd company logo
Data Profiling
using CA ERwin Modeling to assure data and metadata
abstract

• This session explores the use of data profiling to increase the
  accuracy of critical data assets and their associated data
  models/metadata. This presentation will include examples of how
  clients have leveraged data profiling in combination with data
  modeling for master data management, data warehousing, data
  governance, and other data intensive initiatives.




  PAGE 2
biography

• Antonio C. Amorin
  President, Data Innovations, Inc.
  – Eighteen years of data modeling experience and fourteen years of
    experience using CA ERwin® Data Modeler
  – Ten years of data profiling experience and two years of experience using CA
    ERwin® Data Profiler
  – Data Innovations, Inc. – CA Partner since 2004
  – Presented at CA World’08, CBI’s Life Sciences Forum on “Customer Data
    Quality and Integrity”, ERwin User Groups, webcasts and at client sites
  – Graduated from Illinois State University with a BA in Computer Science and
    a minor in Economics



  PAGE 3
agenda

•   Data Profiling
•   Data and Metadata Quality
•   Data Governance and Data Warehousing
•   Real-life Examples
•   Summary




    PAGE 4
data profiling




PAGE 5
data profiling

• What is data profiling?
  – The analysis of data content to infer metadata
  – A component of data modeling
• What are the basic components of the CA ERwin® Data Profiler?
  –   Column analysis
  –   PF key analysis
  –   Data object analysis
  –   Overlap analysis




  PAGE 6
data profiling

• Column analysis
    – Inferred metadata provides
      intimate knowledge of the data
      content at the column level
           •   Cardinality
           •   Range
           •   Mode
           •   Sparse
           •   Null count




  PAGE 7
data profiling

• Column analysis (continued)
            • Value frequencies
            • Pattern frequencies
            • Length frequencies
• Identify critical data elements
     – Allows the user to focus analysis on
       specific attributes




   PAGE 8
data profiling

• PF key analysis
    – Cross-table analysis of primary-
      foreign key relationships
           • Column matches
           • Classification
               – Parent-child
               – Reference
               – None




  PAGE 9
data profiling

• PF key analysis (continued)
    – Cross-table analysis of primary-
      foreign key relationships
            • Expressions
                 – table.column=table.column
            • Row hit rate
            • Value hit rate
            • Selectivity




  PAGE 10
data profiling

• Data objects
    – Similar to subject areas
    – Groups objects together that
      contain the same data content
    – Based on the parent-child
      relationships
    – Creates an object view of related
      tables or files




  PAGE 11
data profiling

• Overlap analysis
    – Cross-system analysis that identifies
      data content overlap
    – Data Set Summary
            • Provides graphical overview
                 – Legend identifies color coded data
                   sources
                 – Allows modeler to visualize data
                   content overlap between data
                   sources




  PAGE 12
data profiling

• Overlap analysis (continued)
    – Data set overlaps
            • Table compares each data
              source to the other data
              sources
            • Allows comparison of two
              data sources at a time
            • Identifies the number of
              tables and columns that
              overlap between each data
              source




  PAGE 13
data profiling

• Overlap analysis (continued)
    – Column Summary
            • Identifies each column in
              the primary data source
            • Identifies value overlap
              between data sources
            • Allows modeler to use
              critical data elements to
              focus analysis
            • Allows modeler to drill into
              analysis to identify data
              content overlap




  PAGE 14
data profiling

• Overlap analysis (continued)
    – Matches data preview
            • Allows the modeler to view hits
              or misses
            • Identifies specific data content
              that overlaps or does not
              overlap between each data
              source




  PAGE 15
data and metadata quality




PAGE 16
data and metadata quality

• Data
  – Business data - information utilized to operate the business
• Metadata
  – Information generated during the development of IT solutions
  – Defines both the business and technical understanding of the data
  – Utilized to store, process, and report on business data




  PAGE 17
data and metadata quality

• Data Quality
  – Accuracy of the business data
  – High/low quality
  – Mission critical
• Metadata quality
  – Properly represents data content
  – Validate parent-child relationships




  PAGE 18
data and metadata quality

• Leveraging data profiling
  – Use the cardinality, range, mode, and sparse indicators to identify attributes
    requiring detailed analysis
  – Identify data quality issues and validate data types using the value and
    pattern frequencies
  – Leverage the null count and length frequencies to validate column metadata
  – Validate parent-child relationships using the primary-foreign key analysis
  – Leverage the overlap analysis with reference tables containing valid values
    for data quality assessments




  PAGE 19
data governance and data warehousing




PAGE 20
data governance and data warehousing

Leveraging data profiling for data governance
• Business Data
  – Standards
  – Master data management
  – Data quality assessments
• Metadata
  – Standards
  – Model validation




  PAGE 21
data governance and data warehousing

Leveraging data profiling for data governance (continued)
• Standards
  – Business data - valid values, data patterns, and standardized values for static
    data content
  – Metadata – validate model metadata represents data content properly and
    validate parent-child relationships
  – Automate the analysis with profiling
  – Develop profiling reports for each standard
  – Define and implement a review process
  – Integrate standards and review process into SDLC




  PAGE 22
data governance and data warehousing

Leveraging data profiling for data governance (continued)
• Master data management (MDM)
  –   Locating reference data
  –   Data mapping
  –   Harmonizing reference data
  –   Establishing validations and syndication rules
  –   Identifying hub metadata
  –   Data quality assessments




  PAGE 23
data governance and data warehousing

Leveraging data profiling for data governance (continued)
• Data quality assessments
  –   Comprehensive review at the column level
  –   Validation of primary keys
  –   Validation of parent-child relationships
  –   Point-to-point content validation between systems
  –   Standardize analysis methodology
  –   Standardize problem notation
  –   Standardize reporting




  PAGE 24
data governance and data warehousing

Leveraging data profiling for data warehousing
• Data warehouse development
  – Leverage data models and data profiling results to locate and map business
    data to the data warehouse
  – Eliminate the code-load-explode development methodology for ETL
       • Profile each data source to validate data content
       • Identify accurate requirements for transformations to consolidate data content
         and correct data quality issues
  – Use profiling results to determine model metadata for target staging
    databases and the data warehouse
  – Profile the data warehouse regularly to ensure high quality data content



  PAGE 25
real-life examples




 PAGE 26
real-life examples

Public computer hardware and software manufacturer
• Introduced data profiling into ongoing data warehousing
  project
  – Profiled first data source
       • Found questionable data content in financial data within ten minutes of
         profiling data
       • Realized that six months were wasted mapping from the data source to
         the target data warehouse
       • All new data sources were profiled going forward to ensure validity




  PAGE 27
real-life examples

Large public food manufacturer
• Introduced data profiling into sales data warehouse project
  – Leveraged data profiling results to create accurate ETL specifications,
    reducing the overall development time
  – Developers utilized data profiling to validate ETL unit testing
  – Used cross-system analysis to integrate data content from disparate data
    sources into standardized values in data warehouse
  – Profiled data warehouse regularly to identify data content issues




  PAGE 28
real-life examples

Public healthcare insurance provider
• Introduced data profiling into ongoing master data management
  project
  – Performed data content mapping utilizing profiling results
  – Analyzed IMS extracts and flat files to determine where reference data lived
    within legacy mainframe data sources
  – Leveraged profiling results to create ETL specifications
  – Harmonized reference data using profiling results
  – Validated reference data loaded into MDM hub




  PAGE 29
real-life examples

Medium-sized accounting service organization
• Created data store for reporting purposes
  – Profiled disparate data sources to identify model metadata for new data
    store
  – Leveraged profiling results to identify data quality issues for each data
    source
  – Created ETL specifications to consolidate data content from the disparate
    data sources using the profiling results
  – Validated data content in the loaded data store




  PAGE 30
summary

•   Data Profiling
•   Increases accuracy of data content and metadata
•   Reduces project overrun
•   Increases value of deliverables to the business
•   Valuable for master data management, data warehousing, data
    governance, and other data intensive initiatives




    PAGE 31
questions and answers
thank you

More Related Content

PPTX
A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...
PDF
Online retail a look at data consulting approach
PPT
1.2 steps and functionalities
PDF
Enterprise Search in the Big Data Era: Recent Developments and Open Challenges
PPTX
Data Quality with AI
PPT
11667 Bitt I 2008 Lect4
PPTX
PDF
Semantics-enhanced Geoscience Interoperability, Analytics, and Applications
A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...
Online retail a look at data consulting approach
1.2 steps and functionalities
Enterprise Search in the Big Data Era: Recent Developments and Open Challenges
Data Quality with AI
11667 Bitt I 2008 Lect4
Semantics-enhanced Geoscience Interoperability, Analytics, and Applications

What's hot (20)

PPTX
Introduction to ETL process
DOCX
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
PPT
The Big Metadata
DOC
ETL QA
PPTX
Classification and prediction in data mining
PDF
Spark SQL In Depth www.syedacademy.com
PPT
Data Verification In QA Department Final
PDF
Data analytcis-first-steps
PPTX
Data mining
PDF
Metadata in Business Intelligence
DOCX
data mining and data warehousing
PPTX
Computer based data analysis
PDF
RES814 U1 Individual Project
PDF
Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint F...
PPTX
Extended LargeRDFBench
PPT
Knowledge discovery thru data mining
PDF
DGIQ 2015 The Fundamentals of Data Quality
PPT
Datamining
PPT
Data mining 1
PPT
Data miningppt378
Introduction to ETL process
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
The Big Metadata
ETL QA
Classification and prediction in data mining
Spark SQL In Depth www.syedacademy.com
Data Verification In QA Department Final
Data analytcis-first-steps
Data mining
Metadata in Business Intelligence
data mining and data warehousing
Computer based data analysis
RES814 U1 Individual Project
Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint F...
Extended LargeRDFBench
Knowledge discovery thru data mining
DGIQ 2015 The Fundamentals of Data Quality
Datamining
Data mining 1
Data miningppt378
Ad

Viewers also liked (20)

PPTX
All data models in dbms
PPTX
Importance of data model
PPTX
rm006sn (2)
PDF
Creating enterprise standards 09302010
PDF
Data modeling for the business 09282010
PDF
Ernesto_Arce_ERwin_Data_Modeling
PDF
Integrating data process a roundtrip modeling using e rwin data modeler_erwin...
PDF
Sybase PowerDesigner Vs Erwin
PDF
Mastering your data with ca e rwin dm 09082010
PDF
Rm006sn ca world2010
PDF
Lançamento ERwin 08/02
PDF
Ca e rwin state of the union 09082010
DOCX
Nagendra Resume
PDF
Cust experience a practical guide 09152010
PDF
Sneak peak ca e rwin data modeler r8 preview09222010
PPTX
CA ERwin Data Modeler End User Presentation
PPT
Different data models
PPT
Dbms models
PPS
Data models
PPTX
Data Modeling PPT
All data models in dbms
Importance of data model
rm006sn (2)
Creating enterprise standards 09302010
Data modeling for the business 09282010
Ernesto_Arce_ERwin_Data_Modeling
Integrating data process a roundtrip modeling using e rwin data modeler_erwin...
Sybase PowerDesigner Vs Erwin
Mastering your data with ca e rwin dm 09082010
Rm006sn ca world2010
Lançamento ERwin 08/02
Ca e rwin state of the union 09082010
Nagendra Resume
Cust experience a practical guide 09152010
Sneak peak ca e rwin data modeler r8 preview09222010
CA ERwin Data Modeler End User Presentation
Different data models
Dbms models
Data models
Data Modeling PPT
Ad

Similar to Using ca e rwin modeling to asure data 09162010 (20)

PDF
Data Profiling, Data Catalogs and Metadata Harmonisation
PPT
Unit 3 part i Data mining
PPTX
Jumbune data analyzer
PPTX
Build data quality rules and data cleansing into your data pipelines
PDF
DataGraft: Data-as-a-Service for Open Data
PPTX
Code Camp - Data Profiling and Quality Analysis Framework
PPT
Chapter 4 Organizational Aspects of Data Management.ppt
PDF
2 introductory slides
PDF
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
PPTX
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
PPTX
Data base and data entry presentation by mj n somya
PPTX
Chapter 5 data resource management
PPTX
Lecture_8_Updated_1.pptx computer science revision
PPTX
Lecture_8_Updated_1.pptx A computerr science revision guide
PPT
Management information system database management
PPT
Sql server ___________session_1-intro
PPTX
Data Quality: A Raising Data Warehousing Concern
PPTX
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
PDF
Discovering Related Data Sources in Data Portals
PPTX
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Data Profiling, Data Catalogs and Metadata Harmonisation
Unit 3 part i Data mining
Jumbune data analyzer
Build data quality rules and data cleansing into your data pipelines
DataGraft: Data-as-a-Service for Open Data
Code Camp - Data Profiling and Quality Analysis Framework
Chapter 4 Organizational Aspects of Data Management.ppt
2 introductory slides
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
Data base and data entry presentation by mj n somya
Chapter 5 data resource management
Lecture_8_Updated_1.pptx computer science revision
Lecture_8_Updated_1.pptx A computerr science revision guide
Management information system database management
Sql server ___________session_1-intro
Data Quality: A Raising Data Warehousing Concern
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
Discovering Related Data Sources in Data Portals
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!

More from ERwin Modeling (9)

PDF
Zen of metadata 09212010
PDF
Staying relevant in todays changing dm environment 09282010
PDF
Monetizing data management 09162010
PDF
Effective capture of metadata using ca e rwin data modeler 09232010
PDF
Deciding to go cloud 09212010
PDF
Ca e rwin modeling global user communities_09232010 - webcast
PDF
10 things to avoid in data model 09242010
PDF
5 physical data modeling blunders 09092010
PDF
Optimizing the design of your data warehouse 09222010
Zen of metadata 09212010
Staying relevant in todays changing dm environment 09282010
Monetizing data management 09162010
Effective capture of metadata using ca e rwin data modeler 09232010
Deciding to go cloud 09212010
Ca e rwin modeling global user communities_09232010 - webcast
10 things to avoid in data model 09242010
5 physical data modeling blunders 09092010
Optimizing the design of your data warehouse 09222010

Recently uploaded (20)

PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Modernizing your data center with Dell and AMD
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Empathic Computing: Creating Shared Understanding
PDF
NewMind AI Monthly Chronicles - July 2025
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Encapsulation theory and applications.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
KodekX | Application Modernization Development
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
A Presentation on Artificial Intelligence
PPTX
Understanding_Digital_Forensics_Presentation.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Modernizing your data center with Dell and AMD
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Empathic Computing: Creating Shared Understanding
NewMind AI Monthly Chronicles - July 2025
“AI and Expert System Decision Support & Business Intelligence Systems”
Encapsulation theory and applications.pdf
Big Data Technologies - Introduction.pptx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Machine learning based COVID-19 study performance prediction
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
KodekX | Application Modernization Development
20250228 LYD VKU AI Blended-Learning.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
NewMind AI Weekly Chronicles - August'25 Week I
A Presentation on Artificial Intelligence
Understanding_Digital_Forensics_Presentation.pptx

Using ca e rwin modeling to asure data 09162010

  • 1. Data Profiling using CA ERwin Modeling to assure data and metadata
  • 2. abstract • This session explores the use of data profiling to increase the accuracy of critical data assets and their associated data models/metadata. This presentation will include examples of how clients have leveraged data profiling in combination with data modeling for master data management, data warehousing, data governance, and other data intensive initiatives. PAGE 2
  • 3. biography • Antonio C. Amorin President, Data Innovations, Inc. – Eighteen years of data modeling experience and fourteen years of experience using CA ERwin® Data Modeler – Ten years of data profiling experience and two years of experience using CA ERwin® Data Profiler – Data Innovations, Inc. – CA Partner since 2004 – Presented at CA World’08, CBI’s Life Sciences Forum on “Customer Data Quality and Integrity”, ERwin User Groups, webcasts and at client sites – Graduated from Illinois State University with a BA in Computer Science and a minor in Economics PAGE 3
  • 4. agenda • Data Profiling • Data and Metadata Quality • Data Governance and Data Warehousing • Real-life Examples • Summary PAGE 4
  • 6. data profiling • What is data profiling? – The analysis of data content to infer metadata – A component of data modeling • What are the basic components of the CA ERwin® Data Profiler? – Column analysis – PF key analysis – Data object analysis – Overlap analysis PAGE 6
  • 7. data profiling • Column analysis – Inferred metadata provides intimate knowledge of the data content at the column level • Cardinality • Range • Mode • Sparse • Null count PAGE 7
  • 8. data profiling • Column analysis (continued) • Value frequencies • Pattern frequencies • Length frequencies • Identify critical data elements – Allows the user to focus analysis on specific attributes PAGE 8
  • 9. data profiling • PF key analysis – Cross-table analysis of primary- foreign key relationships • Column matches • Classification – Parent-child – Reference – None PAGE 9
  • 10. data profiling • PF key analysis (continued) – Cross-table analysis of primary- foreign key relationships • Expressions – table.column=table.column • Row hit rate • Value hit rate • Selectivity PAGE 10
  • 11. data profiling • Data objects – Similar to subject areas – Groups objects together that contain the same data content – Based on the parent-child relationships – Creates an object view of related tables or files PAGE 11
  • 12. data profiling • Overlap analysis – Cross-system analysis that identifies data content overlap – Data Set Summary • Provides graphical overview – Legend identifies color coded data sources – Allows modeler to visualize data content overlap between data sources PAGE 12
  • 13. data profiling • Overlap analysis (continued) – Data set overlaps • Table compares each data source to the other data sources • Allows comparison of two data sources at a time • Identifies the number of tables and columns that overlap between each data source PAGE 13
  • 14. data profiling • Overlap analysis (continued) – Column Summary • Identifies each column in the primary data source • Identifies value overlap between data sources • Allows modeler to use critical data elements to focus analysis • Allows modeler to drill into analysis to identify data content overlap PAGE 14
  • 15. data profiling • Overlap analysis (continued) – Matches data preview • Allows the modeler to view hits or misses • Identifies specific data content that overlaps or does not overlap between each data source PAGE 15
  • 16. data and metadata quality PAGE 16
  • 17. data and metadata quality • Data – Business data - information utilized to operate the business • Metadata – Information generated during the development of IT solutions – Defines both the business and technical understanding of the data – Utilized to store, process, and report on business data PAGE 17
  • 18. data and metadata quality • Data Quality – Accuracy of the business data – High/low quality – Mission critical • Metadata quality – Properly represents data content – Validate parent-child relationships PAGE 18
  • 19. data and metadata quality • Leveraging data profiling – Use the cardinality, range, mode, and sparse indicators to identify attributes requiring detailed analysis – Identify data quality issues and validate data types using the value and pattern frequencies – Leverage the null count and length frequencies to validate column metadata – Validate parent-child relationships using the primary-foreign key analysis – Leverage the overlap analysis with reference tables containing valid values for data quality assessments PAGE 19
  • 20. data governance and data warehousing PAGE 20
  • 21. data governance and data warehousing Leveraging data profiling for data governance • Business Data – Standards – Master data management – Data quality assessments • Metadata – Standards – Model validation PAGE 21
  • 22. data governance and data warehousing Leveraging data profiling for data governance (continued) • Standards – Business data - valid values, data patterns, and standardized values for static data content – Metadata – validate model metadata represents data content properly and validate parent-child relationships – Automate the analysis with profiling – Develop profiling reports for each standard – Define and implement a review process – Integrate standards and review process into SDLC PAGE 22
  • 23. data governance and data warehousing Leveraging data profiling for data governance (continued) • Master data management (MDM) – Locating reference data – Data mapping – Harmonizing reference data – Establishing validations and syndication rules – Identifying hub metadata – Data quality assessments PAGE 23
  • 24. data governance and data warehousing Leveraging data profiling for data governance (continued) • Data quality assessments – Comprehensive review at the column level – Validation of primary keys – Validation of parent-child relationships – Point-to-point content validation between systems – Standardize analysis methodology – Standardize problem notation – Standardize reporting PAGE 24
  • 25. data governance and data warehousing Leveraging data profiling for data warehousing • Data warehouse development – Leverage data models and data profiling results to locate and map business data to the data warehouse – Eliminate the code-load-explode development methodology for ETL • Profile each data source to validate data content • Identify accurate requirements for transformations to consolidate data content and correct data quality issues – Use profiling results to determine model metadata for target staging databases and the data warehouse – Profile the data warehouse regularly to ensure high quality data content PAGE 25
  • 27. real-life examples Public computer hardware and software manufacturer • Introduced data profiling into ongoing data warehousing project – Profiled first data source • Found questionable data content in financial data within ten minutes of profiling data • Realized that six months were wasted mapping from the data source to the target data warehouse • All new data sources were profiled going forward to ensure validity PAGE 27
  • 28. real-life examples Large public food manufacturer • Introduced data profiling into sales data warehouse project – Leveraged data profiling results to create accurate ETL specifications, reducing the overall development time – Developers utilized data profiling to validate ETL unit testing – Used cross-system analysis to integrate data content from disparate data sources into standardized values in data warehouse – Profiled data warehouse regularly to identify data content issues PAGE 28
  • 29. real-life examples Public healthcare insurance provider • Introduced data profiling into ongoing master data management project – Performed data content mapping utilizing profiling results – Analyzed IMS extracts and flat files to determine where reference data lived within legacy mainframe data sources – Leveraged profiling results to create ETL specifications – Harmonized reference data using profiling results – Validated reference data loaded into MDM hub PAGE 29
  • 30. real-life examples Medium-sized accounting service organization • Created data store for reporting purposes – Profiled disparate data sources to identify model metadata for new data store – Leveraged profiling results to identify data quality issues for each data source – Created ETL specifications to consolidate data content from the disparate data sources using the profiling results – Validated data content in the loaded data store PAGE 30
  • 31. summary • Data Profiling • Increases accuracy of data content and metadata • Reduces project overrun • Increases value of deliverables to the business • Valuable for master data management, data warehousing, data governance, and other data intensive initiatives PAGE 31