SlideShare a Scribd company logo
www.scling.com
Mortal Analytics - Covid 19 &
the problem of data quality
Lars Albertsson (@lalleal)
Scling
1
www.scling.com
Why this presentation?
● Non-goal: Argue for or against a particular strategy
○ We are already too polarised
● Goals:
○ What can go wrong with data quality?
○ What can we learn?
○ Data engineering as a solution
2
www.scling.com
Imperial College: We saved the world!
3
https://www.bbc.com/news/health-52968523
www.scling.com
Imperial College model predictions for Sweden
4
https://www.medrxiv.org/content/10.1101/2020.04.11.20062133v1.full.pdf
www.scling.com
Model and reality
5
https://swprs.org/a-swiss-doctor-on-covid-19/
www.scling.com
Imperial College model code
●
● Screenshots are only part of functions...
● A couple of regression tests - no tests validating correct functionality
● My impression: No chance of producing high confidence result
6
https://github.com/mrc-ide/covid-sim
www.scling.com
Imperial College: bugs are not a problem
7
https://lockdownsceptics.org/code-review-of-fergusons-model/
www.scling.com
Example Imperial College bug handling
8
https://github.com/mrc-ide/covid-sim/issues/330
Imperial College response
www.scling.com
Bad predictions are harmful
9
● Each action has a health cost
○ Economic misery
→ social misery
→ health misery
○ Mental health
○ Drug / alcohol use
○ Domestic violence
● During Ebola pandemic,
10x more people died from fear
of hospitals than from Ebola
https://medium.com/@robert.munro/the-tech-communitys-response-to-ebola-44d2c8dbb5be
www.scling.com
Ways to degrade data & analytics quality
10
● Deviating definitions
● Selection
● Deviating context
● Presentation
● Interpretation
● Data collection
● Data processing
● Lack of quality assessment
● Lack of quality improvement
Add senior software
engineers with
production experience.
Data engineering
www.scling.com
Define death
11
Observed Covid-19 death definitions:
● Infection confirmed, last 30 days
● Infection confirmed, any time
● Infection assumed
● Assumed cause
● Hospitalised
● Other disease complicated by Covid-19
● Excess mortality
www.scling.com
Sweden on the rise?
12
https://youtu.be/4uTj96ZowCU
https://www.bbc.com/news/world-europe-53175459
https://sverigesradio.se/artikel/7503606
"New Covid-19 cases per day"
www.scling.com
No, context is missing
13
Tests executed
Test positive rate
New cases
https://youtu.be/4uTj96ZowCU https://twitter.com/JacobGudiol/status/1283308826842759168 https://twitter.com/JacobGudiol/status/1283308817787293696
www.scling.com
Death numbers, different views
14https://twitter.com/HaraldofW/status/1270080232104624128
https://www.folkhalsomyndigheten.se/globalassets/statistik-uppfoljning/smittsamma-sjukdomar/veckorapporter-covid-19/2020/covid-19-veckorapport-vecka-25-final.pdf
www.scling.com
Data will confess to anything
15
● Absolute numbers mislead
○ Days since case x →
time shift by country size
● Relative numbers mislead
○ Diluted in large countries
○ Small regions stand out
https://swprs.org/a-swiss-doctor-on-covid-19/
www.scling.com
Granularity matters
16
● Outbreaks in regions
● Country aggregation - information loss
○ But debate assumes homogeneous countries
● Peak of Swedish outbreak
○ Major outbreak in Stockholm + surroundings
○ Rest of Sweden on par with Nordics
● Nothing is "obvious"
https://www.folkhalsomyndigheten.se/globalassets/statistik-uppfoljning/smittsamma-sjukdomar/veckorapporter-covid-19/2020/covid-19-veckorapport-vecka-25-final.pdf
Swedish policy "obviously"
terrible. Compare numbers
with neighbours!
www.scling.com
Data collection
17
"The last week is not complete, so it is
difficult to determine if the trend continues."
https://youtu.be/4uTj96ZowCU
https://www.folkhalsomyndigheten.se/globalassets/statistik-uppfoljning/smittsamma-sjukdomar/veckorapporter-covid-19/2020/covid-19-veckorapport-vecka-27-final.pdf
www.scling.com
What conclusion from this graph?
COVID-19 fatalities / day in Sweden
18
https://www.folkhalsomyndigheten.se/
www.scling.com
Comparing apples, oranges, bananas, ...
COVID-19 fatalities / day in Sweden
19
Fatalities collected during 2 day
Fatalities collected during 4 days
Fatalities collected during 10 days
www.scling.com
Naive data collection
● Gather the events that we have
● Put them in a database
● "Let us look at the latest data"
● You never want the latest data!
You want comparable data.
20
www.scling.com
Wrong conclusion, every day
● Fatalities data as of
April 6
April 15
April 19
21Graph by Statistisk Opinion, @StatistiskO
www.scling.com
Wrong conclusion, every day
● Downward trend every day!
22
https://www.bloomberg.com/amp/news/articles/2020-07-17/georgia-massaged-virus-data-to-reopen-then-voided-mask-orders
www.scling.com
Normalise data collection to compare
23Graph by Adam Altmejd, @adamaltmejd
www.scling.com
Normalise data collection to compare
24Graph by Adam Altmejd, @adamaltmejd
www.scling.com
Forecast for analytics with fresh data
25Graph by Adam Altmejd, @adamaltmejd
www.scling.com
Why aren't authorities doing that?
26
● Cost of processing data
● Manual handcraft
not
Industrial process
https://github.com/FohmAnalys/SEIR-model-Stockholm
We are not done
processing the data yet.
Since we do calculations
quickly, some mistakes
might happen.
www.scling.com
● Scaled processes
● Machine tools
● Challenges: scale,
logistics, legal,
organisation, faults, ...
Manual, mechanised, industrialised
27
● Muscle-powered
● Few tools
● Human touch for every
step
● Direct human control
● Machine tools
● Low investment, direct
return
www.scling.com
Muscle powered analytics & machine learning
● Use hand tools to
○ Collect data
○ Aggregate for analytics
or
○ Train a model
● Typical tools:
○ Excel
○ Matlab
○ Interactive SQL
○ Interactive BI tools
○ Jupyter
○ R
○ One-off Python scripts
28
"Dataset" - a data artifact of direct or indirect value
www.scling.com
Mechanised analytics & machine learning
● Use machine tools to semi-automatically
○ Collect data
○ Aggregate for analytics
or
○ Train a model
● Typical tools: Muscle tools +
○ Databases
○ Data warehouses + ETL
○ Hadoop, Spark, Flink
○ Java, Scala, Python, SQL
○ Kafka
○ Similar cloud services
29
Datasets, produced monthly / hourly / daily / ..
www.scling.com
From craft to process
30
www.scling.com
From craft to process
31
Multiple time windows
www.scling.com
From craft to process
32
Multiple time windows
Assess ingress data quality
www.scling.com
From craft to process
33
Multiple time windows
Assess ingress data quality
Assess outcome data quality
www.scling.com
From craft to process
34
Multiple time windows
Assess ingress data quality
Assess outcome data quality
Repair broken data
Intermediate datasets, reusable between pipelines
www.scling.com
From craft to process
35
Multiple time windows
Assess ingress data quality
Repair broken data from
complementary source
Assess outcome data quality
www.scling.com
From craft to process
36
Multiple time windows
Assess ingress data quality
Repair broken data from
complementary source
Forecast based on history
Assess outcome data quality
www.scling.com
From craft to process
37
Multiple time windows
Assess ingress data quality
Repair broken data from
complementary source
Forecast based on history,
multiple parameter settings
Assess outcome data quality
www.scling.com
From craft to process
38
Multiple time windows
Assess ingress data quality
Repair broken data from
complementary source
Forecast based on history,
multiple parameter settings
Assess outcome data quality
Assess forecast success,
adapt parameters
www.scling.com
Towards sustainable production ML
39
Multiple models,
parameters, features
Assess ingress data quality
Repair broken data from
complementary source
Choose model and parameters based
on performance and input data
Benchmark models
Try multiple models,
measure, A/B test
www.scling.com
Industrialised analytics / machine learning
● Build resilient, automated processes that
○ Collect & process
○ Assess & improve quality
○ Create multiple artifacts, measure, adapt
● Typical tools: Mechanised tools +
○ Data lake
○ Workflow orchestration (Luigi, Airflow)
○ Quality assessment, monitoring
○ Testing, CI/CD
40
www.scling.com
● Resilient data factory
● Every dev team,
100-1000s datasets /
day per team
Costs down - ROI from data
41
● Hand-built
● Analyst team,
< 10 dataset / day
● Semi-automated
● "The data team",
10-100 datasets / day
Spotify ~2014,
20K datasets/day
www.scling.com
Becoming data industrialised
42
● Knowledge limited to leading tech companies + startups
● Change in processes & culture
○ C.f. agile, DevOps
○ Journey of many years
● Challenge is not technical
○ Can't buy a system or tool
○ Consultants can't help
www.scling.com
Scling - data-value-as-a-service
43
Data value through collaboration
Customer
Data factory
Data platform & lake
data
domain
expertise
Value from data!
www.scling.com/reading-list
www.scling.com/presentations
www.scling.com/courses

More Related Content

PDF
Taming the reproducibility crisis
PDF
Data democratised
PDF
Data ops in practice - Swedish style
PDF
Eventually, time will kill your data processing
PPTX
Privacy by design
PDF
Engineering data quality
PDF
Kubernetes as data platform
PDF
10 ways to stumble with big data
Taming the reproducibility crisis
Data democratised
Data ops in practice - Swedish style
Eventually, time will kill your data processing
Privacy by design
Engineering data quality
Kubernetes as data platform
10 ways to stumble with big data

What's hot (20)

PDF
Don't build a data science team
PPTX
Data ops in practice
PDF
The right side of speed - learning to shift left
PDF
Protecting privacy in practice
PDF
DataOps - Lean principles and lean practices
PDF
The lean principles of data ops
PDF
Testing data streaming applications
PDF
Data pipelines from zero to solid
PDF
Data Pipline Observability meetup
PPTX
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
PDF
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
PPTX
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
PPTX
Provenance as a building block for an open science infrastructure
PPTX
Big Data with Apache Hadoop
PDF
Continuous delivery for machine learning
PPTX
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
PDF
Building Reactive Real-time Data Pipeline
PDF
How to design and implement a data ops architecture with sdc and gcp
PDF
A Walk Through the Kimball ETL Subsystems with Oracle Data Integration
PDF
Offload, Transform, and Present - The New World of Data Integration
Don't build a data science team
Data ops in practice
The right side of speed - learning to shift left
Protecting privacy in practice
DataOps - Lean principles and lean practices
The lean principles of data ops
Testing data streaming applications
Data pipelines from zero to solid
Data Pipline Observability meetup
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
Provenance as a building block for an open science infrastructure
Big Data with Apache Hadoop
Continuous delivery for machine learning
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Building Reactive Real-time Data Pipeline
How to design and implement a data ops architecture with sdc and gcp
A Walk Through the Kimball ETL Subsystems with Oracle Data Integration
Offload, Transform, and Present - The New World of Data Integration
Ad

Similar to Mortal analytics - Covid-19 and the problem of data quality (20)

PDF
Crossing the data divide
PDF
Holistic data application quality
PDF
Data engineering in 10 years.pdf
PDF
Secure software supply chain on a shoestring budget
PPTX
Make Sense Out of Data with Feature Engineering
PPTX
MLOps.pptx
PDF
Agile Data Science
PDF
The end of analytics as we know it gauc 2020 - iih nordic - steen rasmussen v2
PDF
Generative AI - the power to destroy democracy meets the security and reliabi...
PDF
Data science guide
PDF
Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...
PDF
Investing in ai driven startups
PDF
big_data_topic1_[introduction]_[thanh_binh_nguyen].TextMark.pdf
PDF
Reproducibility and experiments management in Machine Learning
PPTX
Production ready big ml workflows from zero to hero daniel marcous @ waze
PPTX
Introduction to Six Sigma
PDF
How to Build a ML Platform Efficiently Using Open-Source
PDF
CD in Machine Learning Systems
PDF
Building successful and secure products with AI and ML
PDF
Using OPC-UA to Extract IIoT Time Series Data from PLC and SCADA Systems
Crossing the data divide
Holistic data application quality
Data engineering in 10 years.pdf
Secure software supply chain on a shoestring budget
Make Sense Out of Data with Feature Engineering
MLOps.pptx
Agile Data Science
The end of analytics as we know it gauc 2020 - iih nordic - steen rasmussen v2
Generative AI - the power to destroy democracy meets the security and reliabi...
Data science guide
Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...
Investing in ai driven startups
big_data_topic1_[introduction]_[thanh_binh_nguyen].TextMark.pdf
Reproducibility and experiments management in Machine Learning
Production ready big ml workflows from zero to hero daniel marcous @ waze
Introduction to Six Sigma
How to Build a ML Platform Efficiently Using Open-Source
CD in Machine Learning Systems
Building successful and secure products with AI and ML
Using OPC-UA to Extract IIoT Time Series Data from PLC and SCADA Systems
Ad

More from Lars Albertsson (12)

PDF
All the DataOps, all the paradigms .
PDF
The road to pragmatic application of AI.pdf
PDF
End-to-end pipeline agility - Berlin Buzzwords 2024
PDF
Schema on read is obsolete. Welcome metaprogramming..pdf
PDF
Industrialised data - the key to AI success.pdf
PDF
Schema management with Scalameta
PDF
How to not kill people - Berlin Buzzwords 2023.pdf
PDF
The 7 habits of data effective companies.pdf
PDF
Ai legal and ethics
PDF
Eventually, time will kill your data pipeline
PDF
Big data == lean data
PDF
Test strategies for data processing pipelines, v2.0
All the DataOps, all the paradigms .
The road to pragmatic application of AI.pdf
End-to-end pipeline agility - Berlin Buzzwords 2024
Schema on read is obsolete. Welcome metaprogramming..pdf
Industrialised data - the key to AI success.pdf
Schema management with Scalameta
How to not kill people - Berlin Buzzwords 2023.pdf
The 7 habits of data effective companies.pdf
Ai legal and ethics
Eventually, time will kill your data pipeline
Big data == lean data
Test strategies for data processing pipelines, v2.0

Recently uploaded (20)

PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPT
Teaching material agriculture food technology
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Modernizing your data center with Dell and AMD
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Machine learning based COVID-19 study performance prediction
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Encapsulation theory and applications.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Cloud computing and distributed systems.
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Building Integrated photovoltaic BIPV_UPV.pdf
Teaching material agriculture food technology
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Encapsulation_ Review paper, used for researhc scholars
NewMind AI Monthly Chronicles - July 2025
Per capita expenditure prediction using model stacking based on satellite ima...
20250228 LYD VKU AI Blended-Learning.pptx
Modernizing your data center with Dell and AMD
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Diabetes mellitus diagnosis method based random forest with bat algorithm
Machine learning based COVID-19 study performance prediction
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
MYSQL Presentation for SQL database connectivity
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Encapsulation theory and applications.pdf
The AUB Centre for AI in Media Proposal.docx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Cloud computing and distributed systems.

Mortal analytics - Covid-19 and the problem of data quality

  • 1. www.scling.com Mortal Analytics - Covid 19 & the problem of data quality Lars Albertsson (@lalleal) Scling 1
  • 2. www.scling.com Why this presentation? ● Non-goal: Argue for or against a particular strategy ○ We are already too polarised ● Goals: ○ What can go wrong with data quality? ○ What can we learn? ○ Data engineering as a solution 2
  • 3. www.scling.com Imperial College: We saved the world! 3 https://www.bbc.com/news/health-52968523
  • 4. www.scling.com Imperial College model predictions for Sweden 4 https://www.medrxiv.org/content/10.1101/2020.04.11.20062133v1.full.pdf
  • 6. www.scling.com Imperial College model code ● ● Screenshots are only part of functions... ● A couple of regression tests - no tests validating correct functionality ● My impression: No chance of producing high confidence result 6 https://github.com/mrc-ide/covid-sim
  • 7. www.scling.com Imperial College: bugs are not a problem 7 https://lockdownsceptics.org/code-review-of-fergusons-model/
  • 8. www.scling.com Example Imperial College bug handling 8 https://github.com/mrc-ide/covid-sim/issues/330 Imperial College response
  • 9. www.scling.com Bad predictions are harmful 9 ● Each action has a health cost ○ Economic misery → social misery → health misery ○ Mental health ○ Drug / alcohol use ○ Domestic violence ● During Ebola pandemic, 10x more people died from fear of hospitals than from Ebola https://medium.com/@robert.munro/the-tech-communitys-response-to-ebola-44d2c8dbb5be
  • 10. www.scling.com Ways to degrade data & analytics quality 10 ● Deviating definitions ● Selection ● Deviating context ● Presentation ● Interpretation ● Data collection ● Data processing ● Lack of quality assessment ● Lack of quality improvement Add senior software engineers with production experience. Data engineering
  • 11. www.scling.com Define death 11 Observed Covid-19 death definitions: ● Infection confirmed, last 30 days ● Infection confirmed, any time ● Infection assumed ● Assumed cause ● Hospitalised ● Other disease complicated by Covid-19 ● Excess mortality
  • 12. www.scling.com Sweden on the rise? 12 https://youtu.be/4uTj96ZowCU https://www.bbc.com/news/world-europe-53175459 https://sverigesradio.se/artikel/7503606 "New Covid-19 cases per day"
  • 13. www.scling.com No, context is missing 13 Tests executed Test positive rate New cases https://youtu.be/4uTj96ZowCU https://twitter.com/JacobGudiol/status/1283308826842759168 https://twitter.com/JacobGudiol/status/1283308817787293696
  • 14. www.scling.com Death numbers, different views 14https://twitter.com/HaraldofW/status/1270080232104624128 https://www.folkhalsomyndigheten.se/globalassets/statistik-uppfoljning/smittsamma-sjukdomar/veckorapporter-covid-19/2020/covid-19-veckorapport-vecka-25-final.pdf
  • 15. www.scling.com Data will confess to anything 15 ● Absolute numbers mislead ○ Days since case x → time shift by country size ● Relative numbers mislead ○ Diluted in large countries ○ Small regions stand out https://swprs.org/a-swiss-doctor-on-covid-19/
  • 16. www.scling.com Granularity matters 16 ● Outbreaks in regions ● Country aggregation - information loss ○ But debate assumes homogeneous countries ● Peak of Swedish outbreak ○ Major outbreak in Stockholm + surroundings ○ Rest of Sweden on par with Nordics ● Nothing is "obvious" https://www.folkhalsomyndigheten.se/globalassets/statistik-uppfoljning/smittsamma-sjukdomar/veckorapporter-covid-19/2020/covid-19-veckorapport-vecka-25-final.pdf Swedish policy "obviously" terrible. Compare numbers with neighbours!
  • 17. www.scling.com Data collection 17 "The last week is not complete, so it is difficult to determine if the trend continues." https://youtu.be/4uTj96ZowCU https://www.folkhalsomyndigheten.se/globalassets/statistik-uppfoljning/smittsamma-sjukdomar/veckorapporter-covid-19/2020/covid-19-veckorapport-vecka-27-final.pdf
  • 18. www.scling.com What conclusion from this graph? COVID-19 fatalities / day in Sweden 18 https://www.folkhalsomyndigheten.se/
  • 19. www.scling.com Comparing apples, oranges, bananas, ... COVID-19 fatalities / day in Sweden 19 Fatalities collected during 2 day Fatalities collected during 4 days Fatalities collected during 10 days
  • 20. www.scling.com Naive data collection ● Gather the events that we have ● Put them in a database ● "Let us look at the latest data" ● You never want the latest data! You want comparable data. 20
  • 21. www.scling.com Wrong conclusion, every day ● Fatalities data as of April 6 April 15 April 19 21Graph by Statistisk Opinion, @StatistiskO
  • 22. www.scling.com Wrong conclusion, every day ● Downward trend every day! 22 https://www.bloomberg.com/amp/news/articles/2020-07-17/georgia-massaged-virus-data-to-reopen-then-voided-mask-orders
  • 23. www.scling.com Normalise data collection to compare 23Graph by Adam Altmejd, @adamaltmejd
  • 24. www.scling.com Normalise data collection to compare 24Graph by Adam Altmejd, @adamaltmejd
  • 25. www.scling.com Forecast for analytics with fresh data 25Graph by Adam Altmejd, @adamaltmejd
  • 26. www.scling.com Why aren't authorities doing that? 26 ● Cost of processing data ● Manual handcraft not Industrial process https://github.com/FohmAnalys/SEIR-model-Stockholm We are not done processing the data yet. Since we do calculations quickly, some mistakes might happen.
  • 27. www.scling.com ● Scaled processes ● Machine tools ● Challenges: scale, logistics, legal, organisation, faults, ... Manual, mechanised, industrialised 27 ● Muscle-powered ● Few tools ● Human touch for every step ● Direct human control ● Machine tools ● Low investment, direct return
  • 28. www.scling.com Muscle powered analytics & machine learning ● Use hand tools to ○ Collect data ○ Aggregate for analytics or ○ Train a model ● Typical tools: ○ Excel ○ Matlab ○ Interactive SQL ○ Interactive BI tools ○ Jupyter ○ R ○ One-off Python scripts 28 "Dataset" - a data artifact of direct or indirect value
  • 29. www.scling.com Mechanised analytics & machine learning ● Use machine tools to semi-automatically ○ Collect data ○ Aggregate for analytics or ○ Train a model ● Typical tools: Muscle tools + ○ Databases ○ Data warehouses + ETL ○ Hadoop, Spark, Flink ○ Java, Scala, Python, SQL ○ Kafka ○ Similar cloud services 29 Datasets, produced monthly / hourly / daily / ..
  • 31. www.scling.com From craft to process 31 Multiple time windows
  • 32. www.scling.com From craft to process 32 Multiple time windows Assess ingress data quality
  • 33. www.scling.com From craft to process 33 Multiple time windows Assess ingress data quality Assess outcome data quality
  • 34. www.scling.com From craft to process 34 Multiple time windows Assess ingress data quality Assess outcome data quality Repair broken data Intermediate datasets, reusable between pipelines
  • 35. www.scling.com From craft to process 35 Multiple time windows Assess ingress data quality Repair broken data from complementary source Assess outcome data quality
  • 36. www.scling.com From craft to process 36 Multiple time windows Assess ingress data quality Repair broken data from complementary source Forecast based on history Assess outcome data quality
  • 37. www.scling.com From craft to process 37 Multiple time windows Assess ingress data quality Repair broken data from complementary source Forecast based on history, multiple parameter settings Assess outcome data quality
  • 38. www.scling.com From craft to process 38 Multiple time windows Assess ingress data quality Repair broken data from complementary source Forecast based on history, multiple parameter settings Assess outcome data quality Assess forecast success, adapt parameters
  • 39. www.scling.com Towards sustainable production ML 39 Multiple models, parameters, features Assess ingress data quality Repair broken data from complementary source Choose model and parameters based on performance and input data Benchmark models Try multiple models, measure, A/B test
  • 40. www.scling.com Industrialised analytics / machine learning ● Build resilient, automated processes that ○ Collect & process ○ Assess & improve quality ○ Create multiple artifacts, measure, adapt ● Typical tools: Mechanised tools + ○ Data lake ○ Workflow orchestration (Luigi, Airflow) ○ Quality assessment, monitoring ○ Testing, CI/CD 40
  • 41. www.scling.com ● Resilient data factory ● Every dev team, 100-1000s datasets / day per team Costs down - ROI from data 41 ● Hand-built ● Analyst team, < 10 dataset / day ● Semi-automated ● "The data team", 10-100 datasets / day Spotify ~2014, 20K datasets/day
  • 42. www.scling.com Becoming data industrialised 42 ● Knowledge limited to leading tech companies + startups ● Change in processes & culture ○ C.f. agile, DevOps ○ Journey of many years ● Challenge is not technical ○ Can't buy a system or tool ○ Consultants can't help
  • 43. www.scling.com Scling - data-value-as-a-service 43 Data value through collaboration Customer Data factory Data platform & lake data domain expertise Value from data! www.scling.com/reading-list www.scling.com/presentations www.scling.com/courses