Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage from the Source

Engineering Machine Learning Data Pipelines
Tracking Data Lineage
Paige Roberts
Integrate Product Marketing Manager

Common Machine Learning Applications
Engineering Machine Learning Data Pipelines
• Anti-money laundering
• Fraud detection
• Cybersecurity
• Targeted marketing
• Recommendation engine
• Next best action
• Customer churn prevention
• Know your customer
2

Data Scientist
Engineering Machine Learning Data Pipelines3
Data Engineer to the Rescue
• Expert in statistical analysis, machine learning
techniques, finding answers to business questions
buried in datasets.
• Does NOT want to spend 50 – 90% of their time
tinkering with data, getting it into good shape to
train models – but frequently does, especially if
there’s no data engineer on their team.
• When machine learning model is trained, tested,
and proven it will accomplish the goal, turns it over
to data engineer to productionize. Not skilled at
taking the model from a test sandbox into
production, especially not at large scale.
• Expert in data structures, data manipulation, and
constructing production data pipelines.
• WANTS to spend all of their time working with data,
but usually has more on their plate than they can
keep up with. Anything that will speed up their work
is helpful.
• In most successful companies, is involved from the
beginning. First gathers, cleans and standardizes
data, helps data scientist with feature engineering,
provides top notch data, ready to train models.
• After model is tested, builds robust high scale, data
pipelines to feed the models the data they need in
the correct format in production to provide ongoing
business value.
Data Engineer

Five Big Challenges of Engineering ML Data Pipelines
1. Scattered and Difficult to Access Datasets
Much of the necessary data is trapped in mainframes or streams in from POS, web clicks, etc. all in
incompatible formats, making it difficult to gather and prepare the data for model training.
2. Data Cleansing at Scale
Data quality cleansing and preparation routines have to be reproduced at scale. Most data quality tools
are not designed to work on that scale of data.
3. Entity Resolution
Distinguishing matches across massive datasets that indicate a single specific entity (person, company,
product, etc.) requires sophisticated multi-field matching algorithms and a lot of compute power.
Essentially everything has to be compared to everything else.
4. Tracking Lineage from the Source
Data changes made to help train models have to be exactly duplicated in production, in order for models
to accurately make predictions on new data, and for required audit trails. Capture of complete lineage,
from source to end point is needed.

5
End-to-End Data Lineage
Data Sources Data Lake
Data
Onboard data, modify
on-the-fly to match
Hadoop storage model,
or store unchanged for
archive and compliance.
Access data from
streaming and batch
sources outside
cluster.
Transform, join,
cleanse, enhance
data in cluster
with MapReduce
or Spark.

6
Data Sources Data Lake
Data
on-the-fly to match
Access data from
streaming and batch
sources outside
cluster.
Transform, join,
cleanse, enhance
data in cluster
with MapReduce
or Spark.

7
Data Sources
Pass source-to-cluster
data lineage info to
Navigator or Atlas.
Data Lake
Data
Data Lineage
REST
API
on-the-fly to match
Access data from
streaming and batch
sources outside
cluster.
Transform, join,
cleanse, enhance
data in cluster
with MapReduce
or Spark.

8
Data Sources
Navigator or Atlas
gathers any other
changes made to
data on cluster.
Navigator or Atlas.
Data Lake
Data changes made
by MapReduce,
Spark, HiveQL.
Data
Data Lineage
REST
API
on-the-fly to match
Access data from
streaming and batch
sources outside
cluster.
Transform, join,
cleanse, enhance
data in cluster
with MapReduce
or Spark.

9
Data Sources
Navigator or Atlas
gathers any other
changes made to
data on cluster.
Navigator or Atlas.
Data Lake
Data changes made
by MapReduce,
Spark, HiveQL.
Data
Data Lineage
REST
API
on-the-fly to match
Access data from
streaming and batch
sources outside
cluster.
Transform, join,
cleanse, enhance
data in cluster
with MapReduce
or Spark.
Auditors
get end-to-end
data lineage.
Analytics,
visualizations, and
machine learning
algorithms get ALL
necessary data.
Analytics,
Visualization,
Machine
Learning
Complete
Data

10
Syncsort Published Lineage in Cloudera Navigator

Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage from the Source

Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage from the Source

More Related Content

What's hot (18)

Similar to Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage from the Source (20)

More from Precisely (20)

Recently uploaded (20)

Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage from the Source