SlideShare a Scribd company logo
Your AI and ML Projects Are Failing
Key Steps to Get Them Back on Track
Harald Smith, Director Product Marketing
Housekeeping
Webcast Audio
• Today’s webcast audio is streamed through your computer speakers.
• If you need technical assistance with the web interface or audio,
please reach out to us using the chat window.
Questions Welcome
• Submit your questions at any time during the presentation
using the chat window.
• We will answer them during our Q&A session following the
presentation.
Recording and slides
• This webcast is being recorded. You will receive an
email following the webcast with a link to download
both the recording and the slides.
2
Speaker
Harald Smith
• Director of Product Marketing, Syncsort
• 20+ years in Information Management with a
focus on data quality, integration, and governance
• Co-author of Patterns of Information Management
• Author of two Redbooks on Information Governance
and Data Integration
• Blog author: “Data Democratized”
3
AI/ML needs
Data Quality
The importance of data quality
in the enterprise:
35%of senior executives
have a high level of trust
in the accuracy of their
Big Data Analytics
KPMG 2016 Global CEO Outlook
92%of executives are concerned
about the negative impact of
data and analytics on
corporate reputation
KPMG 2017 Global CEO Outlook
80%of AI/ML projects are stalling
due to poor data quality
Dimensional Research, 2019
“Societal trust in business
is arguably at an all-time
low and, in a world
increasingly driven by
data and technology,
reputations and brands are
ever harder to protect.”
• Decision making
• Customer centricity
• Compliance
• Machine learning & AI
4 EY “Trust in Data and Why it Matters”, 2017
Only
“
”
The magic of machine learning is that you
build a statistical model based on the most
valid dataset for the domain of interest.
If the data is junk, then you’ll be building a junk
model that will not be able to do its job.
James Kobeilus
SiliconANGLE Wikibon
Lead Analyst for Data Science, Deep Learning, App Development
2018
1
Key steps to improve Data Quality for AI/ML
Identify the
“right” data to
collect and work
with
Establish baselines
of data quality
through data
profiling and
business rules
Assess and
communicate the
fitness for purpose
of the data for
training and
evaluating the
subsequent models
and algorithms
6
Four foundational data steps to get or keep your AI and ML projects grounded and underway:
Frame the
business problem
2 3 4
1. Frame the business problem
Common Machine Learning applications
Customer/Marketing
• Targeted marketing
• Recommendation engine
• Next best action
• Customer churn prevention
• Sentiment analysis
Risk Management
• Anti-money laundering
• Fraud detection (electricity pilferage, fraudulent transactions)
• Cybersecurity
• Know your customer
Supply Chain Management
• Reduction of freight costs/Optimal routing
• Damage identification/Mechanical repair
8
Universal DQ
best practices
Understand the End Goal
• How does the business intend to use
the data (i.e. what’s the use case)?
• Empower users (“Who”) to gain new
clarity into the core problem (“Why”)
• What will the data be used for?
• What defines the Fitness for
your Purpose?
Establish Scope
• Ask the “right questions” about
the use case and the data (not just
“what” and “how”)
• What data is relevant to the effort?
• Big Data or other, you need to set
boundaries for the work
Understand Context
• How does the business define
the data?
• What are the important
characteristics and context
of the data?
• What are the Critical Data
Elements?
• What qualities will you need
to address, or leave alone?
• “High-quality data” definition
will vary by business problem
“If you don’t know what you want
to get out of the data, how can you
know what data you need – and what
insight you’re looking for?”
Wolf Ruzicka, Chairman of the Board at EastBanc
Technologies, Blog post: June 1, 2017,
“Grow A Data Tree Out Of The “Big Data” Swamp”
“Never lead with a data set;
lead with a question.”
Anthony Scriffignano, Chief Data Scientist,
Dun & Bradstreet, Forbes Insights, May 31, 2017,
“The Data Differentiator”
9
2. Identify the “right” data
What’s the
“Right” Data?
Is relevant and specific for
the business problem
Is free from bias and
assumptions
Supports hypothesis testing
Ask questions about the data you expect
you need
Understand the Provenance of the data
• Who produced it, when did they
produce it, and why?
• Has it been transformed or
changed from original (lineage)?
Understand whether the data is
Comprehensive
• What is the scope of the data?
• What data is missing?
• Are approaches available to
identify/capture what is missing?
Understand the “universe” of Relevant
data
• Consider sources within and
outside the organization
Understand whether the data is Timely
• How can you be certain the
data is truly current?
Understand additional challenges
obtaining data, both for evaluation and
operational use
11
Comprehensiveness depends on the
business context/question
• Customer Engagement/Loyalty 
• Known customers, both active & inactive
• New Customer Campaigns 
• “Active” consumers, both known and
unknown
• Fraud Detection 
• Any known or unknown person
impersonating a customer or prospect
Ask/understand what the “Unknown and/or
Unavailable”  represents
• Why does this segment exist?
• If relevant, can the characteristics be inferred
through other data?
• Is there inherent bias in leaving this group out?
Comprehensive: a “Customer” example
Unknown & Active
• Prospect
• Data in CRM? Website
visits? Store visits? Prospect
lists?
Known & Active
• Customer
• Data in MDM/DW
• What about Call Center?
CRM? Website visits? Store
visits? Loyalty Program?
Unknown and/or unavailable
• Not a customer
• No data? Or is data
available through other
means?
Known & Inactive
• Former Customer
• Data in MDM/DW?
• What about Call Center?
CRM? Website visits? Store
visits? Loyalty Program?




12
Relevance for additional data depends on the
business context/questions
• Customer Engagement/Loyalty
• Website, Call Center, Social Media, Location,
Store Data, Demographics
• New Customer Campaigns
• Location, Demographics, Website, Social Media,
Prospect Lists
• Optimal Shipping/Delivery
• Location, Weather, Store Data
• Fraud Detection
• IP Address, Device ID, Purchase Location, etc.
Additional content from both internal and external
sources may be relevant if within a useful time period
• Change of Address, Suppression lists, etc.
Relevant:
a “Customer” example
“Customer”
Location Demographics
Social
Media
Website,
Call Center,
Store, etc.
Other:
Weather,
Prospect
Lists, etc.
Order
Transactions
Call
Transcriptions
Product/
Service
Reviews
Abandoned
Carts
Census
Data
Credit
History
13
1. Lack of data, or scattered and difficult to access datasets
• Little or no accessible data; or necessary data trapped in mainframes, operational systems, or streams.
• Data typically stored in incompatible formats.
• Other data must be acquired, appended, or transformed for use.
2. Data standardization, cleansing, and enrichment at scale
• Data needs to be tagged, classified, standardized, and normalized.
• Data quality standardization, cleansing, enrichment, and preparation needs to be applied consistently and reproduced at scale.
3. Entity resolution and customer identification
• Distinguishing single entity matches across massive datasets requires sophisticated multi-pass, multi-field matching algorithms.
• Continuous cross-comparison and resolution needs to occur as new data arrives.
4. Need for near real-time current data
• Tracking and detection needs to happen very rapidly.
• Current transactions constantly added to combined datasets and presented to models as close to real-time as possible.
5. Tracking lineage from the source
• Data changes made to help train models have to be exactly duplicated in production.
• Capture of complete lineage, from source to end point is needed.
Five further challenges to enable Machine Learning
14
3. Establish baselines
of Data Quality
Data Quality challenges with Machine Learning
Incorrect, incomplete, mis-formatted, and sparse “dirty data”
• Mistakes and errors are rarely the patterns you are looking for.
• Sparse data generates other issues or may be ignored as “noise”.
• Correcting and standardizing data boosts the signal, but can increase bias.
Missing context
• Insufficient information about customer and location data can make many
ML algorithms unusable.
• Enriching data increases context, but choice of source can skew/bias result.
Duplicates and multiple copies
• Many sources can yield multiple records about the same person, company,
product or other entity, skewing the signal and outcomes.
• Removing duplicates enhances the overall depth and accuracy about a
single entity, but must watch for over- or undermatching of data.
Spurious correlations
• Inclusion of already correlated data (e.g. city and postal code) may result in
overfitting of ML algorithms or ‘false’ discoveries.
Correcting data problems vastly increases a data set’s usefulness for machine learning.
But data analysts may not be aware of
specific data quality issues that must be
addressed to support machine learning.
Traditional data quality processes are
an effective method to identify defects.
!CAUTION
16
Understand Context
• What Critical Data Elements and other attributes are relevant?
• What qualities need to be addressed, or left alone?
• When, and where, do we need to transform or enrich the data content?
• How are we connecting, relating, or combining data?
Develop, Test, and Deploy Corrective Measures
• Consistent application of standardization, transformation, enrichment,
and entity resolution
• Common templates, rules, metrics, and processes that can be leveraged
• Validation and measurement after corrective measures applied
• Deploy into batch, real-time, or embedded services
Apply Data Governance
• Implement metrics and measures for ongoing assessment and evaluation
• Establish baselines for ongoing comparison/evaluation
• Continue to iterate throughout data preparation and model testing
Data Quality best practices
17
Tools for
DQ analysis
Data Profiling
The set of analytical techniques
that evaluate actual data content
(vs. metadata) to provide a
complete view of each data
element in a data source.
Provides summarized inferences,
and details of value and pattern
frequencies to quickly gain data
insights.
Business Rules
The data quality or validation rules
that help ensure that data is “fit for
use” in its intended operational
and decision-making contexts.
Assess the dimensions of data
quality: accuracy, completeness,
consistency, relevance, timeliness,
& validity of data.
18
Common Data Quality measurements
What measures can we take advantage of?
1. Completeness – Are the relevant fields populated?
2. Integrity – Does the data maintain an internal structural integrity
or a relational integrity across sources
3. Uniqueness – Are keys or records unique?
4. Validity – Does the data have the correct values?
• Code and reference values
• Valid ranges
• Valid value combinations
5. Consistency – Is the data at consistent levels of
aggregation or does it have consistent valid
values over time?
6. Timeliness – Did the data arrive in
a time period that makes it
useful or usable?
19
New data, new data quality challenges
• 3rd Party and external data with unknown provenance, timeliness, or
relevance
• Bias in the data – whether in collection, extraction, or other processing
• Data without standardized structure or formatting
• Continuously streaming data
• Disjointed data (e.g. gaps in comprehensiveness or receipt)
• Consistency and verification of data sources (e.g. was the origination
verified?)
• Changes and transformation applied to data (i.e. does it really represent the
original input)
New Data Quality problems
“34 percent of bankers in our survey report that their
organization has been the target of adversarial AI at least
once, and 78 percent believe automated systems create
new risks, such as fake data, external data manipulation,
and inherent bias.””
Accenture Banking Technology Vision 2018
20
4. Assess & communicate
fitness for purpose
Work within the defined Business Frame!
• Reconfirm the business purpose and context
• Review the data attributes deemed critical and the criteria that required
validation
Test and validate data for identified DQ measurements
• Apply data profiling and established business rules
• Establish baselines!
• Evaluate and determine necessary actions/remediate issues
• Take action on incorrect data and defaults
• Create flags for subsequent use in marking or remediating data
Annotate what you’ve found
• Identify each attribute/criteria and annotate all issues
• Utilize flags, tags, and other indicators to help others distinguish the
type and severity of issues
Establish, document, and present Fitness for Purpose
Iterate for all data in use, as well as model validation
Assess Fitness for Purpose
22
Culture of Data Literacy
“Democratization of Data” requires cultural support
• Empowered to ask questions about the data
• Trained to understand the business context and use of data
• Trained to understand approaching and evaluating data quality
• Traditional data, new data, machine learning requirements, …
• Empowered to prove/reject hypotheses
Program of Data Governance
• Provide the processes and practices necessary for success
• Measure, monitor, and improve
• Continous iteration and development
• Communicate what you’ve discovered! (and where others can find!)
Center of Excellence/Knowledge Base
• Where do you go to find answers?
• Who can help show you how?
Communicate!
23
Summary
Keep AI/ML projects focused
It is challenging to keep the
business frame/value in mind!
• Data comes from multiple
disparate systems & sources
• The business context may not
be obvious based on data alone
• There is a higher demand and
expectation for seeing data
quality in context.
• You need to assess and measure
the data content to establish
both baselines and common
understanding
4 Key Steps
1. Remember the end goal – ask
questions, use best practices, and
establish scope & context
2. Consider what data is needed
• Focus your attention based
on the type of data and the
use case
• Consider how you can ensure
data is comprehensive,
relevant, and useful
3. Test rules to validate data quality,
establish baselines, communicate
findings, and build trust!
4. Assess and communicate fitness
for purpose
Gaining insight and
measurement of
data quality is more
critical than ever!
24
Further Resources
• Data Profiling: The First Step to Big Data Quality
• Emerging Data Quality Trends for Governing and Analyzing Big Data
• Introducing Trillium DQ for Big Data: Powerful Profiling and Data
Quality for the Data Lake
harald.smith@syncsort.com
Questions
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on Track

More Related Content

PDF
Applying Data Quality Best Practices at Big Data Scale
PDF
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
PPTX
How to use your data science team: Becoming a data-driven organization
PDF
Foundational Strategies for Trust in Big Data Part 1: Getting Data to the Pla...
PDF
3 Strategies to drive more data driven outcomes in financial services
PPTX
How to Implement a Spend Analytics Program Using Machine Learning
PDF
Analytics - Trends and Prospects
PPTX
5 Data Quality Issues
Applying Data Quality Best Practices at Big Data Scale
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
How to use your data science team: Becoming a data-driven organization
Foundational Strategies for Trust in Big Data Part 1: Getting Data to the Pla...
3 Strategies to drive more data driven outcomes in financial services
How to Implement a Spend Analytics Program Using Machine Learning
Analytics - Trends and Prospects
5 Data Quality Issues

What's hot (20)

PDF
Data Management Meets Human Management - Why Words Matter
PDF
Data-Ed: Monetizing Data Management
PPT
Competing on analytics
PDF
How to Consume Your Data for AI
PDF
Helping HR to Cross the Big Data Chasm
PDF
Applications of AI in Supply Chain Management: Hype versus Reality
PDF
Industry Focus Camp SCB17 "How to build a data driven organization"
PPT
2011 digital trends webinar presentation
PDF
Data Modeling Techniques
PDF
Death of the Dashboard
PPTX
MLOps - Getting Machine Learning Into Production
PDF
Metadata Matters: Business Critical Metadata
PDF
How Enterprises are Using NoSQL for Mission-Critical Applications
PDF
Data-Ed Webinar: Data Quality Success Stories
PDF
Big Challenges in Data Modeling: Modeling Metadata
PDF
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
PDF
How to Create a Data Analytics Roadmap
 
PDF
DataEd Online: Building the Case for the Top Data Job
PDF
The ABC of Data Governance: driving Information Excellence
PPTX
Information Asset Management in Financial Institutions: How Much Is It Really...
Data Management Meets Human Management - Why Words Matter
Data-Ed: Monetizing Data Management
Competing on analytics
How to Consume Your Data for AI
Helping HR to Cross the Big Data Chasm
Applications of AI in Supply Chain Management: Hype versus Reality
Industry Focus Camp SCB17 "How to build a data driven organization"
2011 digital trends webinar presentation
Data Modeling Techniques
Death of the Dashboard
MLOps - Getting Machine Learning Into Production
Metadata Matters: Business Critical Metadata
How Enterprises are Using NoSQL for Mission-Critical Applications
Data-Ed Webinar: Data Quality Success Stories
Big Challenges in Data Modeling: Modeling Metadata
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
How to Create a Data Analytics Roadmap
 
DataEd Online: Building the Case for the Top Data Job
The ABC of Data Governance: driving Information Excellence
Information Asset Management in Financial Institutions: How Much Is It Really...
Ad

Similar to Your AI and ML Projects Are Failing – Key Steps to Get Them Back on Track (20)

PDF
Data Profiling: The First Step to Big Data Quality
PDF
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
PPTX
Do You Trust Your Machine Learning Outcomes?
PPTX
Marketsoft and marketing cube data quality to cc-v3
PPTX
Deliveinrg explainable AI
PDF
Top Tips to Get Your Data AI-Ready‎ ‎ ‎‎ ‎
PPTX
Transform Your Downstream Cloud Analytics with Data Quality 
PDF
Data science guide
PDF
Learn How to Make Machine Learning Work
PPTX
HIPAA De-Identification: Ensuring Privacy and Compliance in Healthcare Data
PDF
Become a citizen data scientist
PDF
AI-Led-Cognitive-Data-Quality.pdf
PDF
Executive Briefing: Why managing machines is harder than you think
PDF
'The Well-Oiled Data Machine' from Experian Data Quality
PDF
Re-orienting your business around data
PDF
Data Quality Success Stories
PDF
Gabor Koncz – AI in email marketing: email conversion optimization in eCommerce
PPTX
KDD 2019 IADSS Workshop - Skills to Master Machine Learning and Data Science ...
PDF
Data Science and Culture
PPTX
Why Data Science Projects Fail
Data Profiling: The First Step to Big Data Quality
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Do You Trust Your Machine Learning Outcomes?
Marketsoft and marketing cube data quality to cc-v3
Deliveinrg explainable AI
Top Tips to Get Your Data AI-Ready‎ ‎ ‎‎ ‎
Transform Your Downstream Cloud Analytics with Data Quality 
Data science guide
Learn How to Make Machine Learning Work
HIPAA De-Identification: Ensuring Privacy and Compliance in Healthcare Data
Become a citizen data scientist
AI-Led-Cognitive-Data-Quality.pdf
Executive Briefing: Why managing machines is harder than you think
'The Well-Oiled Data Machine' from Experian Data Quality
Re-orienting your business around data
Data Quality Success Stories
Gabor Koncz – AI in email marketing: email conversion optimization in eCommerce
KDD 2019 IADSS Workshop - Skills to Master Machine Learning and Data Science ...
Data Science and Culture
Why Data Science Projects Fail
Ad

More from Precisely (20)

PDF
The Future of Automation: AI, APIs, and Cloud Modernization.pdf
PDF
Unlock new opportunities with location data.pdf
PDF
Reimagining Insurance: Connected Data for Confident Decisions.pdf
PDF
Introducing Syncsort™ Storage Management.pdf
PDF
Enable Enterprise-Ready Security on IBM i Systems.pdf
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
PDF
Solving the CIO’s Dilemma: Speed, Scale, and Smarter SAP Modernization.pdf
PDF
Solving the Data Disconnect: Why Success Hinges on Pre-Linked Data.pdf
PDF
Cooking Up Clean Addresses - 3 Ways to Whip Messy Data into Shape.pdf
PDF
Building Confidence in AI & Analytics with High-Integrity Location Data.pdf
PDF
SAP Modernization Strategies for a Successful S/4HANA Journey.pdf
PDF
Precisely Demo Showcase: Powering ServiceNow Discovery with Precisely Ironstr...
PDF
The 2025 Guide on What's Next for Automation.pdf
PDF
Outdated Tech, Invisible Expenses – How Data Silos Undermine Operational Effi...
PDF
Modernización de SAP: Maximizando el Valor de su Migración a SAP S/4HANA.pdf
PDF
Outdated Tech, Invisible Expenses – The Hidden Cost of Disconnected Data Syst...
PDF
Migration vers SAP S/4HANA: Un levier stratégique pour votre transformation d...
PDF
Outdated Tech, Invisible Expenses: The Hidden Cost of Poor Data Integration o...
PDF
The Changing Compliance Landscape in 2025.pdf
The Future of Automation: AI, APIs, and Cloud Modernization.pdf
Unlock new opportunities with location data.pdf
Reimagining Insurance: Connected Data for Confident Decisions.pdf
Introducing Syncsort™ Storage Management.pdf
Enable Enterprise-Ready Security on IBM i Systems.pdf
A Day in the Life of Location Data - Turning Where into How.pdf
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Solving the CIO’s Dilemma: Speed, Scale, and Smarter SAP Modernization.pdf
Solving the Data Disconnect: Why Success Hinges on Pre-Linked Data.pdf
Cooking Up Clean Addresses - 3 Ways to Whip Messy Data into Shape.pdf
Building Confidence in AI & Analytics with High-Integrity Location Data.pdf
SAP Modernization Strategies for a Successful S/4HANA Journey.pdf
Precisely Demo Showcase: Powering ServiceNow Discovery with Precisely Ironstr...
The 2025 Guide on What's Next for Automation.pdf
Outdated Tech, Invisible Expenses – How Data Silos Undermine Operational Effi...
Modernización de SAP: Maximizando el Valor de su Migración a SAP S/4HANA.pdf
Outdated Tech, Invisible Expenses – The Hidden Cost of Disconnected Data Syst...
Migration vers SAP S/4HANA: Un levier stratégique pour votre transformation d...
Outdated Tech, Invisible Expenses: The Hidden Cost of Poor Data Integration o...
The Changing Compliance Landscape in 2025.pdf

Recently uploaded (20)

PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Cloud computing and distributed systems.
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Big Data Technologies - Introduction.pptx
PDF
KodekX | Application Modernization Development
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Modernizing your data center with Dell and AMD
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Encapsulation theory and applications.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
A Presentation on Artificial Intelligence
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Machine learning based COVID-19 study performance prediction
Network Security Unit 5.pdf for BCA BBA.
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Understanding_Digital_Forensics_Presentation.pptx
Cloud computing and distributed systems.
The Rise and Fall of 3GPP – Time for a Sabbatical?
Reach Out and Touch Someone: Haptics and Empathic Computing
Big Data Technologies - Introduction.pptx
KodekX | Application Modernization Development
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Modernizing your data center with Dell and AMD
20250228 LYD VKU AI Blended-Learning.pptx
Encapsulation theory and applications.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
A Presentation on Artificial Intelligence
NewMind AI Weekly Chronicles - August'25 Week I
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Encapsulation_ Review paper, used for researhc scholars
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Machine learning based COVID-19 study performance prediction

Your AI and ML Projects Are Failing – Key Steps to Get Them Back on Track

  • 1. Your AI and ML Projects Are Failing Key Steps to Get Them Back on Track Harald Smith, Director Product Marketing
  • 2. Housekeeping Webcast Audio • Today’s webcast audio is streamed through your computer speakers. • If you need technical assistance with the web interface or audio, please reach out to us using the chat window. Questions Welcome • Submit your questions at any time during the presentation using the chat window. • We will answer them during our Q&A session following the presentation. Recording and slides • This webcast is being recorded. You will receive an email following the webcast with a link to download both the recording and the slides. 2
  • 3. Speaker Harald Smith • Director of Product Marketing, Syncsort • 20+ years in Information Management with a focus on data quality, integration, and governance • Co-author of Patterns of Information Management • Author of two Redbooks on Information Governance and Data Integration • Blog author: “Data Democratized” 3
  • 4. AI/ML needs Data Quality The importance of data quality in the enterprise: 35%of senior executives have a high level of trust in the accuracy of their Big Data Analytics KPMG 2016 Global CEO Outlook 92%of executives are concerned about the negative impact of data and analytics on corporate reputation KPMG 2017 Global CEO Outlook 80%of AI/ML projects are stalling due to poor data quality Dimensional Research, 2019 “Societal trust in business is arguably at an all-time low and, in a world increasingly driven by data and technology, reputations and brands are ever harder to protect.” • Decision making • Customer centricity • Compliance • Machine learning & AI 4 EY “Trust in Data and Why it Matters”, 2017 Only
  • 5. “ ” The magic of machine learning is that you build a statistical model based on the most valid dataset for the domain of interest. If the data is junk, then you’ll be building a junk model that will not be able to do its job. James Kobeilus SiliconANGLE Wikibon Lead Analyst for Data Science, Deep Learning, App Development 2018
  • 6. 1 Key steps to improve Data Quality for AI/ML Identify the “right” data to collect and work with Establish baselines of data quality through data profiling and business rules Assess and communicate the fitness for purpose of the data for training and evaluating the subsequent models and algorithms 6 Four foundational data steps to get or keep your AI and ML projects grounded and underway: Frame the business problem 2 3 4
  • 7. 1. Frame the business problem
  • 8. Common Machine Learning applications Customer/Marketing • Targeted marketing • Recommendation engine • Next best action • Customer churn prevention • Sentiment analysis Risk Management • Anti-money laundering • Fraud detection (electricity pilferage, fraudulent transactions) • Cybersecurity • Know your customer Supply Chain Management • Reduction of freight costs/Optimal routing • Damage identification/Mechanical repair 8
  • 9. Universal DQ best practices Understand the End Goal • How does the business intend to use the data (i.e. what’s the use case)? • Empower users (“Who”) to gain new clarity into the core problem (“Why”) • What will the data be used for? • What defines the Fitness for your Purpose? Establish Scope • Ask the “right questions” about the use case and the data (not just “what” and “how”) • What data is relevant to the effort? • Big Data or other, you need to set boundaries for the work Understand Context • How does the business define the data? • What are the important characteristics and context of the data? • What are the Critical Data Elements? • What qualities will you need to address, or leave alone? • “High-quality data” definition will vary by business problem “If you don’t know what you want to get out of the data, how can you know what data you need – and what insight you’re looking for?” Wolf Ruzicka, Chairman of the Board at EastBanc Technologies, Blog post: June 1, 2017, “Grow A Data Tree Out Of The “Big Data” Swamp” “Never lead with a data set; lead with a question.” Anthony Scriffignano, Chief Data Scientist, Dun & Bradstreet, Forbes Insights, May 31, 2017, “The Data Differentiator” 9
  • 10. 2. Identify the “right” data
  • 11. What’s the “Right” Data? Is relevant and specific for the business problem Is free from bias and assumptions Supports hypothesis testing Ask questions about the data you expect you need Understand the Provenance of the data • Who produced it, when did they produce it, and why? • Has it been transformed or changed from original (lineage)? Understand whether the data is Comprehensive • What is the scope of the data? • What data is missing? • Are approaches available to identify/capture what is missing? Understand the “universe” of Relevant data • Consider sources within and outside the organization Understand whether the data is Timely • How can you be certain the data is truly current? Understand additional challenges obtaining data, both for evaluation and operational use 11
  • 12. Comprehensiveness depends on the business context/question • Customer Engagement/Loyalty  • Known customers, both active & inactive • New Customer Campaigns  • “Active” consumers, both known and unknown • Fraud Detection  • Any known or unknown person impersonating a customer or prospect Ask/understand what the “Unknown and/or Unavailable”  represents • Why does this segment exist? • If relevant, can the characteristics be inferred through other data? • Is there inherent bias in leaving this group out? Comprehensive: a “Customer” example Unknown & Active • Prospect • Data in CRM? Website visits? Store visits? Prospect lists? Known & Active • Customer • Data in MDM/DW • What about Call Center? CRM? Website visits? Store visits? Loyalty Program? Unknown and/or unavailable • Not a customer • No data? Or is data available through other means? Known & Inactive • Former Customer • Data in MDM/DW? • What about Call Center? CRM? Website visits? Store visits? Loyalty Program?     12
  • 13. Relevance for additional data depends on the business context/questions • Customer Engagement/Loyalty • Website, Call Center, Social Media, Location, Store Data, Demographics • New Customer Campaigns • Location, Demographics, Website, Social Media, Prospect Lists • Optimal Shipping/Delivery • Location, Weather, Store Data • Fraud Detection • IP Address, Device ID, Purchase Location, etc. Additional content from both internal and external sources may be relevant if within a useful time period • Change of Address, Suppression lists, etc. Relevant: a “Customer” example “Customer” Location Demographics Social Media Website, Call Center, Store, etc. Other: Weather, Prospect Lists, etc. Order Transactions Call Transcriptions Product/ Service Reviews Abandoned Carts Census Data Credit History 13
  • 14. 1. Lack of data, or scattered and difficult to access datasets • Little or no accessible data; or necessary data trapped in mainframes, operational systems, or streams. • Data typically stored in incompatible formats. • Other data must be acquired, appended, or transformed for use. 2. Data standardization, cleansing, and enrichment at scale • Data needs to be tagged, classified, standardized, and normalized. • Data quality standardization, cleansing, enrichment, and preparation needs to be applied consistently and reproduced at scale. 3. Entity resolution and customer identification • Distinguishing single entity matches across massive datasets requires sophisticated multi-pass, multi-field matching algorithms. • Continuous cross-comparison and resolution needs to occur as new data arrives. 4. Need for near real-time current data • Tracking and detection needs to happen very rapidly. • Current transactions constantly added to combined datasets and presented to models as close to real-time as possible. 5. Tracking lineage from the source • Data changes made to help train models have to be exactly duplicated in production. • Capture of complete lineage, from source to end point is needed. Five further challenges to enable Machine Learning 14
  • 16. Data Quality challenges with Machine Learning Incorrect, incomplete, mis-formatted, and sparse “dirty data” • Mistakes and errors are rarely the patterns you are looking for. • Sparse data generates other issues or may be ignored as “noise”. • Correcting and standardizing data boosts the signal, but can increase bias. Missing context • Insufficient information about customer and location data can make many ML algorithms unusable. • Enriching data increases context, but choice of source can skew/bias result. Duplicates and multiple copies • Many sources can yield multiple records about the same person, company, product or other entity, skewing the signal and outcomes. • Removing duplicates enhances the overall depth and accuracy about a single entity, but must watch for over- or undermatching of data. Spurious correlations • Inclusion of already correlated data (e.g. city and postal code) may result in overfitting of ML algorithms or ‘false’ discoveries. Correcting data problems vastly increases a data set’s usefulness for machine learning. But data analysts may not be aware of specific data quality issues that must be addressed to support machine learning. Traditional data quality processes are an effective method to identify defects. !CAUTION 16
  • 17. Understand Context • What Critical Data Elements and other attributes are relevant? • What qualities need to be addressed, or left alone? • When, and where, do we need to transform or enrich the data content? • How are we connecting, relating, or combining data? Develop, Test, and Deploy Corrective Measures • Consistent application of standardization, transformation, enrichment, and entity resolution • Common templates, rules, metrics, and processes that can be leveraged • Validation and measurement after corrective measures applied • Deploy into batch, real-time, or embedded services Apply Data Governance • Implement metrics and measures for ongoing assessment and evaluation • Establish baselines for ongoing comparison/evaluation • Continue to iterate throughout data preparation and model testing Data Quality best practices 17
  • 18. Tools for DQ analysis Data Profiling The set of analytical techniques that evaluate actual data content (vs. metadata) to provide a complete view of each data element in a data source. Provides summarized inferences, and details of value and pattern frequencies to quickly gain data insights. Business Rules The data quality or validation rules that help ensure that data is “fit for use” in its intended operational and decision-making contexts. Assess the dimensions of data quality: accuracy, completeness, consistency, relevance, timeliness, & validity of data. 18
  • 19. Common Data Quality measurements What measures can we take advantage of? 1. Completeness – Are the relevant fields populated? 2. Integrity – Does the data maintain an internal structural integrity or a relational integrity across sources 3. Uniqueness – Are keys or records unique? 4. Validity – Does the data have the correct values? • Code and reference values • Valid ranges • Valid value combinations 5. Consistency – Is the data at consistent levels of aggregation or does it have consistent valid values over time? 6. Timeliness – Did the data arrive in a time period that makes it useful or usable? 19
  • 20. New data, new data quality challenges • 3rd Party and external data with unknown provenance, timeliness, or relevance • Bias in the data – whether in collection, extraction, or other processing • Data without standardized structure or formatting • Continuously streaming data • Disjointed data (e.g. gaps in comprehensiveness or receipt) • Consistency and verification of data sources (e.g. was the origination verified?) • Changes and transformation applied to data (i.e. does it really represent the original input) New Data Quality problems “34 percent of bankers in our survey report that their organization has been the target of adversarial AI at least once, and 78 percent believe automated systems create new risks, such as fake data, external data manipulation, and inherent bias.”” Accenture Banking Technology Vision 2018 20
  • 21. 4. Assess & communicate fitness for purpose
  • 22. Work within the defined Business Frame! • Reconfirm the business purpose and context • Review the data attributes deemed critical and the criteria that required validation Test and validate data for identified DQ measurements • Apply data profiling and established business rules • Establish baselines! • Evaluate and determine necessary actions/remediate issues • Take action on incorrect data and defaults • Create flags for subsequent use in marking or remediating data Annotate what you’ve found • Identify each attribute/criteria and annotate all issues • Utilize flags, tags, and other indicators to help others distinguish the type and severity of issues Establish, document, and present Fitness for Purpose Iterate for all data in use, as well as model validation Assess Fitness for Purpose 22
  • 23. Culture of Data Literacy “Democratization of Data” requires cultural support • Empowered to ask questions about the data • Trained to understand the business context and use of data • Trained to understand approaching and evaluating data quality • Traditional data, new data, machine learning requirements, … • Empowered to prove/reject hypotheses Program of Data Governance • Provide the processes and practices necessary for success • Measure, monitor, and improve • Continous iteration and development • Communicate what you’ve discovered! (and where others can find!) Center of Excellence/Knowledge Base • Where do you go to find answers? • Who can help show you how? Communicate! 23
  • 24. Summary Keep AI/ML projects focused It is challenging to keep the business frame/value in mind! • Data comes from multiple disparate systems & sources • The business context may not be obvious based on data alone • There is a higher demand and expectation for seeing data quality in context. • You need to assess and measure the data content to establish both baselines and common understanding 4 Key Steps 1. Remember the end goal – ask questions, use best practices, and establish scope & context 2. Consider what data is needed • Focus your attention based on the type of data and the use case • Consider how you can ensure data is comprehensive, relevant, and useful 3. Test rules to validate data quality, establish baselines, communicate findings, and build trust! 4. Assess and communicate fitness for purpose Gaining insight and measurement of data quality is more critical than ever! 24
  • 25. Further Resources • Data Profiling: The First Step to Big Data Quality • Emerging Data Quality Trends for Governing and Analyzing Big Data • Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for the Data Lake harald.smith@syncsort.com