SlideShare a Scribd company logo
Engineering Machine Learning Data Pipelines
Tracking Data Lineage
Paige Roberts
Integrate Product Marketing Manager
Common Machine Learning Applications
Engineering Machine Learning Data Pipelines
• Anti-money laundering
• Fraud detection
• Cybersecurity
• Targeted marketing
• Recommendation engine
• Next best action
• Customer churn prevention
• Know your customer
2
Data Scientist
Engineering Machine Learning Data Pipelines3
Data Engineer to the Rescue
• Expert in statistical analysis, machine learning
techniques, finding answers to business questions
buried in datasets.
• Does NOT want to spend 50 – 90% of their time
tinkering with data, getting it into good shape to
train models – but frequently does, especially if
there’s no data engineer on their team.
• When machine learning model is trained, tested,
and proven it will accomplish the goal, turns it over
to data engineer to productionize. Not skilled at
taking the model from a test sandbox into
production, especially not at large scale.
• Expert in data structures, data manipulation, and
constructing production data pipelines.
• WANTS to spend all of their time working with data,
but usually has more on their plate than they can
keep up with. Anything that will speed up their work
is helpful.
• In most successful companies, is involved from the
beginning. First gathers, cleans and standardizes
data, helps data scientist with feature engineering,
provides top notch data, ready to train models.
• After model is tested, builds robust high scale, data
pipelines to feed the models the data they need in
the correct format in production to provide ongoing
business value.
Data Engineer
Engineering Machine Learning Data Pipelines4
Five Big Challenges of Engineering ML Data Pipelines
1. Scattered and Difficult to Access Datasets
Much of the necessary data is trapped in mainframes or streams in from POS, web clicks, etc. all in
incompatible formats, making it difficult to gather and prepare the data for model training.
2. Data Cleansing at Scale
Data quality cleansing and preparation routines have to be reproduced at scale. Most data quality tools
are not designed to work on that scale of data.
3. Entity Resolution
Distinguishing matches across massive datasets that indicate a single specific entity (person, company,
product, etc.) requires sophisticated multi-field matching algorithms and a lot of compute power.
Essentially everything has to be compared to everything else.
4. Tracking Lineage from the Source
Data changes made to help train models have to be exactly duplicated in production, in order for models
to accurately make predictions on new data, and for required audit trails. Capture of complete lineage,
from source to end point is needed.
5
End-to-End Data Lineage
Data Sources Data Lake
Data
Onboard data, modify
on-the-fly to match
Hadoop storage model,
or store unchanged for
archive and compliance.
Access data from
streaming and batch
sources outside
cluster.
Transform, join,
cleanse, enhance
data in cluster
with MapReduce
or Spark.
6
End-to-End Data Lineage
Data Sources Data Lake
Data
Onboard data, modify
on-the-fly to match
Hadoop storage model,
or store unchanged for
archive and compliance.
Access data from
streaming and batch
sources outside
cluster.
Transform, join,
cleanse, enhance
data in cluster
with MapReduce
or Spark.
7
End-to-End Data Lineage
Data Sources
Pass source-to-cluster
data lineage info to
Navigator or Atlas.
Data Lake
Data
Data Lineage
REST
API
Onboard data, modify
on-the-fly to match
Hadoop storage model,
or store unchanged for
archive and compliance.
Access data from
streaming and batch
sources outside
cluster.
Transform, join,
cleanse, enhance
data in cluster
with MapReduce
or Spark.
8
End-to-End Data Lineage
Data Sources
Navigator or Atlas
gathers any other
changes made to
data on cluster.
Pass source-to-cluster
data lineage info to
Navigator or Atlas.
Data Lake
Data changes made
by MapReduce,
Spark, HiveQL.
Data
Data Lineage
REST
API
Onboard data, modify
on-the-fly to match
Hadoop storage model,
or store unchanged for
archive and compliance.
Access data from
streaming and batch
sources outside
cluster.
Transform, join,
cleanse, enhance
data in cluster
with MapReduce
or Spark.
9
End-to-End Data Lineage
Data Sources
Navigator or Atlas
gathers any other
changes made to
data on cluster.
Pass source-to-cluster
data lineage info to
Navigator or Atlas.
Data Lake
Data changes made
by MapReduce,
Spark, HiveQL.
Data
Data Lineage
REST
API
Onboard data, modify
on-the-fly to match
Hadoop storage model,
or store unchanged for
archive and compliance.
Access data from
streaming and batch
sources outside
cluster.
Transform, join,
cleanse, enhance
data in cluster
with MapReduce
or Spark.
Auditors
get end-to-end
data lineage.
Analytics,
visualizations, and
machine learning
algorithms get ALL
necessary data.
Analytics,
Visualization,
Machine
Learning
Complete
Data
10
Syncsort Published Lineage in Cloudera Navigator
Engineering Machine Learning Data Pipelines11
Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage from the Source

More Related Content

PPTX
Evolution of big data
PDF
Iterative data discovery and transformation with open refine
DOCX
IEEE 2014 JAVA DATA MINING PROJECTS Scalable keyword search on large rdf data
DOC
Strayer cis 515 week 10 technical paper database administrator for department...
PPT
Strayer cis-515-week-10-technical-paper-database-administrator-for-department...
PPTX
Big Data Analytics Using Hadoop
PPTX
(The life of a) Data engineer
PPTX
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Evolution of big data
Iterative data discovery and transformation with open refine
IEEE 2014 JAVA DATA MINING PROJECTS Scalable keyword search on large rdf data
Strayer cis 515 week 10 technical paper database administrator for department...
Strayer cis-515-week-10-technical-paper-database-administrator-for-department...
Big Data Analytics Using Hadoop
(The life of a) Data engineer
Big data: Descoberta de conhecimento em ambientes de big data e computação na...

What's hot (18)

PDF
Case Study mypetstop detailed
PPTX
Data warehouse testing
PPTX
3 Ways Tableau Improves Predictive Analytics
PDF
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...
PPTX
Glue Conference
PPTX
Hadoop - An Introduction
PPTX
Bigdata
PDF
Big Data Engineer Roles & Responsibilities | Edureka
PPTX
Hadoop Turns a Corner and Sees the Future
PDF
Future of Data - Big Data
PDF
Not Your Father's Database by Databricks
PPTX
Data science life cycle
PDF
PPTX
Big Data Ecosystem
PPTX
Data science big data and analytics
PPTX
Introduction to Data Science
PPT
Data Mining and Data Warehousing
PPTX
Predictive analytics and big data tutorial
Case Study mypetstop detailed
Data warehouse testing
3 Ways Tableau Improves Predictive Analytics
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...
Glue Conference
Hadoop - An Introduction
Bigdata
Big Data Engineer Roles & Responsibilities | Edureka
Hadoop Turns a Corner and Sees the Future
Future of Data - Big Data
Not Your Father's Database by Databricks
Data science life cycle
Big Data Ecosystem
Data science big data and analytics
Introduction to Data Science
Data Mining and Data Warehousing
Predictive analytics and big data tutorial
Ad

Similar to Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage from the Source (20)

PDF
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
PPTX
Deliveinrg explainable AI
PDF
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
PDF
Which Change Data Capture Strategy is Right for You?
PDF
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
PPTX
Automated Analytics at Scale
PPTX
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
PDF
The Shifting Landscape of Data Integration
PDF
Data Driven Engineering 2014
PDF
ADV Slides: Data Pipelines in the Enterprise and Comparison
PDF
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
PDF
Streamline Your Data Workflows with DataOps for Better Efficiency.pdf
PPTX
Machine learning at scale - Webinar By zekeLabs
PPTX
Real Time Analytics
PPTX
Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack
PDF
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
PDF
Data migration patterns special
PDF
Factors To Consider When Building a Data Pipeline
PPTX
Real Time Analytics
PDF
Mind Map Test Data Management Overview
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Deliveinrg explainable AI
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Which Change Data Capture Strategy is Right for You?
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
Automated Analytics at Scale
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
The Shifting Landscape of Data Integration
Data Driven Engineering 2014
ADV Slides: Data Pipelines in the Enterprise and Comparison
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Streamline Your Data Workflows with DataOps for Better Efficiency.pdf
Machine learning at scale - Webinar By zekeLabs
Real Time Analytics
Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Data migration patterns special
Factors To Consider When Building a Data Pipeline
Real Time Analytics
Mind Map Test Data Management Overview
Ad

More from Precisely (20)

PDF
The Future of Automation: AI, APIs, and Cloud Modernization.pdf
PDF
Unlock new opportunities with location data.pdf
PDF
Reimagining Insurance: Connected Data for Confident Decisions.pdf
PDF
Introducing Syncsort™ Storage Management.pdf
PDF
Enable Enterprise-Ready Security on IBM i Systems.pdf
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
PDF
Solving the CIO’s Dilemma: Speed, Scale, and Smarter SAP Modernization.pdf
PDF
Solving the Data Disconnect: Why Success Hinges on Pre-Linked Data.pdf
PDF
Cooking Up Clean Addresses - 3 Ways to Whip Messy Data into Shape.pdf
PDF
Building Confidence in AI & Analytics with High-Integrity Location Data.pdf
PDF
SAP Modernization Strategies for a Successful S/4HANA Journey.pdf
PDF
Precisely Demo Showcase: Powering ServiceNow Discovery with Precisely Ironstr...
PDF
The 2025 Guide on What's Next for Automation.pdf
PDF
Outdated Tech, Invisible Expenses – How Data Silos Undermine Operational Effi...
PDF
Modernización de SAP: Maximizando el Valor de su Migración a SAP S/4HANA.pdf
PDF
Outdated Tech, Invisible Expenses – The Hidden Cost of Disconnected Data Syst...
PDF
Migration vers SAP S/4HANA: Un levier stratégique pour votre transformation d...
PDF
Outdated Tech, Invisible Expenses: The Hidden Cost of Poor Data Integration o...
PDF
The Changing Compliance Landscape in 2025.pdf
The Future of Automation: AI, APIs, and Cloud Modernization.pdf
Unlock new opportunities with location data.pdf
Reimagining Insurance: Connected Data for Confident Decisions.pdf
Introducing Syncsort™ Storage Management.pdf
Enable Enterprise-Ready Security on IBM i Systems.pdf
A Day in the Life of Location Data - Turning Where into How.pdf
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Solving the CIO’s Dilemma: Speed, Scale, and Smarter SAP Modernization.pdf
Solving the Data Disconnect: Why Success Hinges on Pre-Linked Data.pdf
Cooking Up Clean Addresses - 3 Ways to Whip Messy Data into Shape.pdf
Building Confidence in AI & Analytics with High-Integrity Location Data.pdf
SAP Modernization Strategies for a Successful S/4HANA Journey.pdf
Precisely Demo Showcase: Powering ServiceNow Discovery with Precisely Ironstr...
The 2025 Guide on What's Next for Automation.pdf
Outdated Tech, Invisible Expenses – How Data Silos Undermine Operational Effi...
Modernización de SAP: Maximizando el Valor de su Migración a SAP S/4HANA.pdf
Outdated Tech, Invisible Expenses – The Hidden Cost of Disconnected Data Syst...
Migration vers SAP S/4HANA: Un levier stratégique pour votre transformation d...
Outdated Tech, Invisible Expenses: The Hidden Cost of Poor Data Integration o...
The Changing Compliance Landscape in 2025.pdf

Recently uploaded (20)

PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Machine Learning_overview_presentation.pptx
PPTX
Cloud computing and distributed systems.
PPT
Teaching material agriculture food technology
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Approach and Philosophy of On baking technology
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Spectroscopy.pptx food analysis technology
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Machine Learning_overview_presentation.pptx
Cloud computing and distributed systems.
Teaching material agriculture food technology
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
The AUB Centre for AI in Media Proposal.docx
Review of recent advances in non-invasive hemoglobin estimation
Mobile App Security Testing_ A Comprehensive Guide.pdf
Approach and Philosophy of On baking technology
NewMind AI Weekly Chronicles - August'25-Week II
Dropbox Q2 2025 Financial Results & Investor Presentation
Digital-Transformation-Roadmap-for-Companies.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Spectral efficient network and resource selection model in 5G networks
Building Integrated photovoltaic BIPV_UPV.pdf
20250228 LYD VKU AI Blended-Learning.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
A comparative analysis of optical character recognition models for extracting...
Chapter 3 Spatial Domain Image Processing.pdf
Spectroscopy.pptx food analysis technology

Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage from the Source

  • 1. Engineering Machine Learning Data Pipelines Tracking Data Lineage Paige Roberts Integrate Product Marketing Manager
  • 2. Common Machine Learning Applications Engineering Machine Learning Data Pipelines • Anti-money laundering • Fraud detection • Cybersecurity • Targeted marketing • Recommendation engine • Next best action • Customer churn prevention • Know your customer 2
  • 3. Data Scientist Engineering Machine Learning Data Pipelines3 Data Engineer to the Rescue • Expert in statistical analysis, machine learning techniques, finding answers to business questions buried in datasets. • Does NOT want to spend 50 – 90% of their time tinkering with data, getting it into good shape to train models – but frequently does, especially if there’s no data engineer on their team. • When machine learning model is trained, tested, and proven it will accomplish the goal, turns it over to data engineer to productionize. Not skilled at taking the model from a test sandbox into production, especially not at large scale. • Expert in data structures, data manipulation, and constructing production data pipelines. • WANTS to spend all of their time working with data, but usually has more on their plate than they can keep up with. Anything that will speed up their work is helpful. • In most successful companies, is involved from the beginning. First gathers, cleans and standardizes data, helps data scientist with feature engineering, provides top notch data, ready to train models. • After model is tested, builds robust high scale, data pipelines to feed the models the data they need in the correct format in production to provide ongoing business value. Data Engineer
  • 4. Engineering Machine Learning Data Pipelines4 Five Big Challenges of Engineering ML Data Pipelines 1. Scattered and Difficult to Access Datasets Much of the necessary data is trapped in mainframes or streams in from POS, web clicks, etc. all in incompatible formats, making it difficult to gather and prepare the data for model training. 2. Data Cleansing at Scale Data quality cleansing and preparation routines have to be reproduced at scale. Most data quality tools are not designed to work on that scale of data. 3. Entity Resolution Distinguishing matches across massive datasets that indicate a single specific entity (person, company, product, etc.) requires sophisticated multi-field matching algorithms and a lot of compute power. Essentially everything has to be compared to everything else. 4. Tracking Lineage from the Source Data changes made to help train models have to be exactly duplicated in production, in order for models to accurately make predictions on new data, and for required audit trails. Capture of complete lineage, from source to end point is needed.
  • 5. 5 End-to-End Data Lineage Data Sources Data Lake Data Onboard data, modify on-the-fly to match Hadoop storage model, or store unchanged for archive and compliance. Access data from streaming and batch sources outside cluster. Transform, join, cleanse, enhance data in cluster with MapReduce or Spark.
  • 6. 6 End-to-End Data Lineage Data Sources Data Lake Data Onboard data, modify on-the-fly to match Hadoop storage model, or store unchanged for archive and compliance. Access data from streaming and batch sources outside cluster. Transform, join, cleanse, enhance data in cluster with MapReduce or Spark.
  • 7. 7 End-to-End Data Lineage Data Sources Pass source-to-cluster data lineage info to Navigator or Atlas. Data Lake Data Data Lineage REST API Onboard data, modify on-the-fly to match Hadoop storage model, or store unchanged for archive and compliance. Access data from streaming and batch sources outside cluster. Transform, join, cleanse, enhance data in cluster with MapReduce or Spark.
  • 8. 8 End-to-End Data Lineage Data Sources Navigator or Atlas gathers any other changes made to data on cluster. Pass source-to-cluster data lineage info to Navigator or Atlas. Data Lake Data changes made by MapReduce, Spark, HiveQL. Data Data Lineage REST API Onboard data, modify on-the-fly to match Hadoop storage model, or store unchanged for archive and compliance. Access data from streaming and batch sources outside cluster. Transform, join, cleanse, enhance data in cluster with MapReduce or Spark.
  • 9. 9 End-to-End Data Lineage Data Sources Navigator or Atlas gathers any other changes made to data on cluster. Pass source-to-cluster data lineage info to Navigator or Atlas. Data Lake Data changes made by MapReduce, Spark, HiveQL. Data Data Lineage REST API Onboard data, modify on-the-fly to match Hadoop storage model, or store unchanged for archive and compliance. Access data from streaming and batch sources outside cluster. Transform, join, cleanse, enhance data in cluster with MapReduce or Spark. Auditors get end-to-end data lineage. Analytics, visualizations, and machine learning algorithms get ALL necessary data. Analytics, Visualization, Machine Learning Complete Data
  • 10. 10 Syncsort Published Lineage in Cloudera Navigator
  • 11. Engineering Machine Learning Data Pipelines11