SlideShare a Scribd company logo
www.scling.com
DataOps in practice -
Swedish style
Lars Albertsson (@lalleal)
Scling
1
www.scling.com
Who’s talking?
...
Google - video conference, engineering productivity
...
Spotify - data engineering
...
Independent data engineering consultant
Banks, media, startups, heavy industry, telco
Founder @ Scling - data-value-as-a-service
2
www.scling.com
Contents
Journey to DataOps
Experiences that shaped my data engineering
IMHO principles of successful DataOps
Toolbox
3
● Spotify information is old history
● Previously published
● Today is very different
www.scling.com
Spotify data 2007-2013
● Hadoop installed 2007
● Use cases: reporting, insights, recommendations
● Cultural aspects:
○ Autonomous teams
○ Eliminate waste
○ Learn and adapt
4
www.scling.com
Traditional systems
5
Mutation
Early Hadoop:
● Weak indexing
● No transactions
● Weak security
● Batch transformations
www.scling.com
Data lake
Transformation
Cold
store
6
Mutation
Immutable,
shareable
Early Hadoop:
● Weak indexing
● No transactions
● Weak security
● Batch transformations
DataOps workflows:
● Immutable, shared data
● Resilient to failure
● Quick error recovery
● Low-risk experiments
Data factories
www.scling.com
What conclusion from this graph?
COVID-19 fatalities / day in Sweden
7
www.scling.com
Wrong conclusion, every day
● Downward trend every day!
8
www.scling.com
Normalise data collection to compare
9Graph by Adam Altmejd, @adamaltmejd
www.scling.com
Normalise data collection to compare
10Graph by Adam Altmejd, @adamaltmejd
www.scling.com
Forecast for analytics with fresh data
11Graph by Adam Altmejd, @adamaltmejd
www.scling.com
From craft to process
12
www.scling.com
From craft to process
13
Multiple time windows
www.scling.com
From craft to process
14
Multiple time windows
Assess ingress data quality
www.scling.com
From craft to process
15
Multiple time windows
Assess ingress data quality
Assess outcome data quality
www.scling.com
From craft to process
16
Multiple time windows
Assess ingress data quality
Assess outcome data quality
Repair broken data
Intermediate datasets, reusable between pipelines
www.scling.com
From craft to process
17
Multiple time windows
Assess ingress data quality
Repair broken data from
complementary source
Assess outcome data quality
www.scling.com
From craft to process
18
Multiple time windows
Assess ingress data quality
Repair broken data from
complementary source
Forecast based on history
Assess outcome data quality
www.scling.com
From craft to process
19
Multiple time windows
Assess ingress data quality
Repair broken data from
complementary source
Forecast based on history,
multiple parameter settings
Assess outcome data quality
www.scling.com
From craft to process
20
Multiple time windows
Assess ingress data quality
Repair broken data from
complementary source
Forecast based on history,
multiple parameter settings
Assess outcome data quality
Assess forecast success,
adapt parameters
www.scling.com
Towards sustainable production ML
21
Multiple models,
parameters, features
Assess ingress data quality
Repair broken data from
complementary source
Choose model and parameters based
on performance and input data
Benchmark models
Try multiple models,
measure, A/B test
www.scling.com
Risky operations
22
How to I test the pipeline?
You temporarily change the
output path and run manually.
Don’t do that.
What if I forget to change path?
www.scling.com
2013
23
● Teams: Analytics computation (AC), data collection (DC), recommend, reporting (1)
● Folklore development cycle & operations
● Unsatisfied needs in other teams
www.scling.com
luigid
Redundant "edge nodes" with Luigi workers, scheduled with cron. Compute + data in Hadoop.
On-prem Hadoop production
Worker
10 * * * * luigi --module mymodule MyDaily
23 * * * * luigi --module other OtherDaily
Master
Executor
Worker
HDFS metadata
Data
Control
(+data)
Submit job
10 * ...
23 * ...
www.scling.com
Ghost in the cluster
● Jobs were deployed with Debian packages + Puppet on pet machines.
○ Multiple pets for redundancy. Race to run job.
● "This monitor daemon is at 100%. Since 6 months. I'll kill it."
● "Data is wrong. But we fixed this bug 6 months ago?!?"
25
www.scling.com
Start of a DataOps journey
26
Stateful Stateless
Pets Cattle
Folklore
Golden pathTest in prod
Local test
CI/CD
Weeks to learn
New pipeline
< 1 day
Days to mend
Bug fix
< 1 hour
www.scling.com
On-prem pipeline deployment pipeline
27
source
repo Luigi DSL, jars, config
my-pipe-7.tar.gz
Luigi
daemon
> pip install my-pipe-7.tar.gz
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Redundant cron schedule,
higher frequency
All that a pipeline needs, installed atomically
10 * * * * luigi --module mymodule MyDaily
Standard deployment artifact Standard artifact store
www.scling.com
Principle: Functional pipelines
28
● Raw source of truth + data refinement factory
● Immutable datasets & artifacts
● Deterministic, idempotent, reproducible deployment & processing
● Key success factor: workflow orchestration
○ Oozie, Rambo, Builder, Builder2, Luigi
○ Key properties:
1. Pure Python
2. Simplicity
3. All the features it lacks
www.scling.com
Big data - a collaboration paradigm
29
Stream storage
Data lake
Data
democratised
www.scling.com
● Technically
○ Data available
○ Reusable QA
● Operationally
○ Continuous deployment
○ Hands off operations
○ Monitoring, debugging
● Bottom-up innovation
Enabling teams
30
"The actual work that went into
Discover Weekly was very little,
because we're reusing things we
already had."
https://youtu.be/A259Yo8hBRs
https://youtu.be/ZcmJxli8WS8
www.scling.com
Principle: Small scope components
31
● Do one thing well. Less is more.
● Complex systems from replaceable bricks
○ Cloud/OSS over enterprise vendors
○ Simplicity over features
Solvable
challenge
~2000 lines of code
Perpetual
complexity
www.scling.com
Cloud native deployment
32
source
repo Luigi DSL, jars, config
my-pipe:7
Luigi
daemon
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Redundant cron schedule,
higher frequency
kind: CronJob
spec:
schedule: "10 * * * *"
command: "luigi --module mymodule MyDaily"
Docker image Docker registry
S3 / GCS
Dataproc /
EMR
www.scling.com
Data platform gravitation
● Hadoop all the things.
● Data is there. Simple test, simple deploy, simple ops.
● Autonomous teams - no mandate. Natural gravity.
33
www.scling.com
3434
Nearline
● Stream storage
● Asynchronous event
processing
● 10 ms - 1 hour
Data integration timescales
34
Job
Stream
Offline
● File storage
● Asynchronous batch
processing
● 1 minute -
Online
● SOA / microservices
● Synchronous RPC
● 1-100 ms
Stream
Job
Stream
www.scling.com
3535
Upgrade
● Careful rollout
● Risk of user impact
● Proactive QA
Operational manoeuvres - online
35
www.scling.com
3636
Upgrade
● Careful rollout
● Risk of user impact
● Proactive QA
Operational manoeuvres - online
36
Service failure
● User impact
● Data loss
● Cascading outage
www.scling.com
3737
Upgrade
● Careful rollout
● Risk of user impact
● Proactive QA
Operational manoeuvres - online
37
Service failure
● User impact
● Data loss
● Cascading outage
Bug
● User impact
● Data corruption
● Cascading corruption
www.scling.com
38
Operational manoeuvres - offline
38
Upgrade
● Instant rollout
● No user impact
● Reactive QA
Service failure
● Pipeline delay
● No data loss
● No downstream impact
Bug
● Temporary data
corruption
● Downstream impact
www.scling.com
Life of an error, batch pipelines
39
● Faulty job, emits bad data
1. Revert serving datasets to old
2. Fix bug
3. Remove faulty datasets
4. Backfill is automatic (Luigi)
Done!
● Low cost of error
○ Reactive QA
○ Production environment sufficient
www.scling.com
40
Production critical upgrade
● Dual datasets during transition
● Run downstream parallel pipelines
○ Cheap
○ Low risk
○ Easy rollback
● Testable end-to-end
No dev & staging environment needed!
∆?
www.scling.com
41
Operational manoeuvres - nearline
41
Upgrade
● Swift rollout
● Parallel pipelines
● User impact, QA?
Service failure
● Pipeline delay
● No data loss
● Downstream impact?
Bug
● Data corruption
● Downstream impact
Job
Stream
Stream
Job
Stream
Job
Stream
Stream
Job
Stream
Job
Stream
Stream
Job
Stream
www.scling.com
42
Life of an error, streaming
42
● Works for a single job, not pipeline. :-(
Job
StreamStream Stream
Stream Stream Stream
Job
Job
Stream Stream Stream
Job
Job Job
Reprocessing in Kafka Streams
www.scling.com
Data speed Innovation speed
43
Nearline
Data processing tradeoff
43
Job
Stream
OfflineOnline
Stream
Job
Stream
www.scling.com
44
Separating online & offline
● Daily user DB dump. Cassandra can handle the load.
○ Load spike became 25 h long…
● New recommendation model! Cassandra can replicate to all regions.
○ Who saturated the Atlantic link?
● Batch jobs saturate one resource.
○ Bad neighbours.
www.scling.com
Batch offline vs online
45
Raw
Fraud
serviceFraud
model
Orders Orders
Replication /
Backup
Standard procedures Standard proceduresLightweight procedures
● QA driven by internal efficiency
● Continuous deployment
● New pipeline < 1 day
● Upgrade < 1 hour
● Bug recovery < 1 hour
Careful handover Careful handover
www.scling.com
Data quality dimensions
● Timeliness
○ E.g. the customer engagement report was produced at the expected time
● Correctness
○ The numbers in the reports were calculated correctly
● Completeness
○ The report includes information on all customers, using all information from the whole time period
● Consistency
○ The customer summaries are all based on the same time period
46
www.scling.com
Testing single batch job
47
Job
Standard Scalatest harness
file://test_input/ file://test_output/
1. Generate input 2. Run in local mode 3. Verify output
f() p()
Runs well in
CI / from IDE
www.scling.com
Testing batch pipelines - two options
48
Standard Scalatest harness
file://test_input/ file://test_output/
1. Generate input 2. Run custom multi-job
Test job with sequence of jobs
3. Verify output
f() p()
A:
Customised workflow manager setup
p()f()
B:
www.scling.com
Monitoring timeliness, examples
● Datamon - Spotify internal
● Twitter Ambrose (dead?)
● Airflow
49
www.scling.com
50
Measuring correctness: counters
● Processing tool (Spark/Hadoop) counters
○ Odd code path => bump counter
○ System metrics
Hadoop / Spark counters DB
Standard graphing tools
Standard
alerting
service
www.scling.com
Measuring correctness: counters
● User-defined
● Technical from framework
○ Execution time
○ Memory consumption
○ Data volumes
○ ...
51
case class Order(item: ItemId, userId: UserId)
case class User(id: UserId, country: String)
val orders = read(orderPath)
val users = read(userPath)
val orderNoUserCounter = longAccumulator("order-no-user")
val joined: C[(Order, Option[User])] = orders
.groupBy(_.userId)
.leftJoin(users.groupBy(_.id))
.values
val orderWithUser: C[(Order, User)] = joined
.flatMap( orderUser match
case (order, Some(user)) => Some((order, user))
case (order, None) => {
orderNoUserCounter.add(1)
None
})
SQL: Nope
www.scling.com
Data quality - high code vs low code
● 2013: Python MapReduce outdated
● Hive/SQL?
○ Not expressive enough
○ Data quality challenging
● Technical platform + multi-skilled teams!
○ Strong development processes
52
Low code / no code platform? Technical platform?
www.scling.com
53
Measuring consistency: pipelines
● Processing tool (Spark/Hadoop) counters
○ Odd code path => bump counter
○ System metrics
● Dedicated quality assessment pipelines
DB
Quality assessment job
Quality metadataset (tiny)
Standard graphing tools
Standard
alerting
service
www.scling.com
54
Machine learning operations, simplified
● Multiple trained models
○ Select at run time
● Measure user behaviour
○ E.g. session length, engagement, funnel
● Ready to revert to
○ old models
○ simpler models
Measure interactionsRendez-
vous
DB
Standard
alerting
service
Stream Job
"The required surrounding
infrastructure is vast and
complex."
- Google
www.scling.com
55
Not all things went well
● Autonomy → excessive heterogeneity
○ 25 ways to store a timestamp?
● Pipeline end-to-end tests
○ Culturally challenging
○ → difficult to change & retire pipelines
● Trial and error to learn
www.scling.com
Data engineering in Scandinavia
● Stockholm region ranks 2nd in unicorns / capita
○ Media, games, fintech
● Critical mass of world class data engineering
○ Limited to a few companies
56
www.scling.com
Mission: Spread data & AI superpowers
● There are companies to help
● Data & AI capabilities require culture & process change
○ Slow, very slow
57
www.scling.com
Scandinavian minimalist design
● Lean, simple technology - focus on flow and business value
● Bonnier News data platform, 4-5 persons:
○ Zero to happy customer in 3 weeks.
○ Dozens of ROI pipelines in 8 months.
● Scling retail client, 1-3 persons, after 1 year:
○ 40 sources, 70 pipelines, 200 egress points
○ 3,400 datasets / day
● Typical enterprise numbers
○ Big data project: 6-24 months
○ Analytics department: 100-1000 datasets / day
○ Spotify: 100,000+ datasets / day
○ Google: 1.6B datasets / day (2016)
58
www.scling.com
Scling - data-value-as-a-service
59
Data value through collaboration
Customer
Data factory
Data platform & lake
data
domain
expertise
Value from data!
www.scling.com/reading-list
www.scling.com/presentations
www.scling.com/courses

More Related Content

PDF
Data democratised
PDF
Taming the reproducibility crisis
PDF
Mortal analytics - Covid-19 and the problem of data quality
PDF
Don't build a data science team
PPTX
Data ops in practice
PDF
DataOps - Lean principles and lean practices
PDF
Eventually, time will kill your data processing
PDF
The right side of speed - learning to shift left
Data democratised
Taming the reproducibility crisis
Mortal analytics - Covid-19 and the problem of data quality
Don't build a data science team
Data ops in practice
DataOps - Lean principles and lean practices
Eventually, time will kill your data processing
The right side of speed - learning to shift left

What's hot (20)

PDF
Kubernetes as data platform
PDF
Engineering data quality
PDF
The lean principles of data ops
PDF
10 ways to stumble with big data
PDF
Protecting privacy in practice
PDF
Data pipelines from zero to solid
PDF
Big Data Monitoring Cockpit
PDF
Open Data Science Conference Agile Data
PPTX
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
PDF
Offload, Transform, and Present - The New World of Data Integration
PDF
Building Reactive Real-time Data Pipeline
PDF
Testing the Data Warehouse—Big Data, Big Problems
PDF
Data Science and Enterprise Engineering with Michael Finger and Chris Robison
PDF
Testing data streaming applications
PDF
How to design and implement a data ops architecture with sdc and gcp
PDF
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
PDF
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
PDF
H2O AutoML roadmap - Ray Peck
PDF
Continuous delivery for machine learning
PDF
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Kubernetes as data platform
Engineering data quality
The lean principles of data ops
10 ways to stumble with big data
Protecting privacy in practice
Data pipelines from zero to solid
Big Data Monitoring Cockpit
Open Data Science Conference Agile Data
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
Offload, Transform, and Present - The New World of Data Integration
Building Reactive Real-time Data Pipeline
Testing the Data Warehouse—Big Data, Big Problems
Data Science and Enterprise Engineering with Michael Finger and Chris Robison
Testing data streaming applications
How to design and implement a data ops architecture with sdc and gcp
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
H2O AutoML roadmap - Ray Peck
Continuous delivery for machine learning
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Ad

Similar to Data ops in practice - Swedish style (20)

PDF
End-to-end pipeline agility - Berlin Buzzwords 2024
PDF
Holistic data application quality
PDF
All the DataOps, all the paradigms .
PDF
DevOpsDaysRiga 2018: Eric Skoglund, Lars Albertsson - Kubernetes as data plat...
PDF
Data engineering in 10 years.pdf
PDF
Crossing the data divide
PDF
Building data intensive applications
PDF
Data Infrastructure for a World of Music
PPTX
Gcp dataflow
PDF
Building real-time data analytics on Google Cloud
PPTX
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
PPTX
Data pipelines from zero
PDF
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
PDF
Predicting Startup Market Trends based on the news and social media - Albert ...
PPTX
Trivento summercamp masterclass 9/9/2016
PDF
Building highly reliable data pipeline @datadog par Quentin François
PDF
Cost-Effective Data Pipelines 4th Edition Sev Leonard
PPTX
THE RISE AND FALL OF SERVERLESS COSTS - TAMING THE (SERVERLESS) BEAST
PDF
DevOps for DataScience
PDF
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
End-to-end pipeline agility - Berlin Buzzwords 2024
Holistic data application quality
All the DataOps, all the paradigms .
DevOpsDaysRiga 2018: Eric Skoglund, Lars Albertsson - Kubernetes as data plat...
Data engineering in 10 years.pdf
Crossing the data divide
Building data intensive applications
Data Infrastructure for a World of Music
Gcp dataflow
Building real-time data analytics on Google Cloud
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Data pipelines from zero
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Predicting Startup Market Trends based on the news and social media - Albert ...
Trivento summercamp masterclass 9/9/2016
Building highly reliable data pipeline @datadog par Quentin François
Cost-Effective Data Pipelines 4th Edition Sev Leonard
THE RISE AND FALL OF SERVERLESS COSTS - TAMING THE (SERVERLESS) BEAST
DevOps for DataScience
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Ad

More from Lars Albertsson (13)

PDF
Generative AI - the power to destroy democracy meets the security and reliabi...
PDF
The road to pragmatic application of AI.pdf
PDF
Schema on read is obsolete. Welcome metaprogramming..pdf
PDF
Industrialised data - the key to AI success.pdf
PDF
Schema management with Scalameta
PDF
How to not kill people - Berlin Buzzwords 2023.pdf
PDF
The 7 habits of data effective companies.pdf
PDF
Secure software supply chain on a shoestring budget
PDF
Ai legal and ethics
PDF
Eventually, time will kill your data pipeline
PDF
Big data == lean data
PPTX
Privacy by design
PDF
Test strategies for data processing pipelines, v2.0
Generative AI - the power to destroy democracy meets the security and reliabi...
The road to pragmatic application of AI.pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Industrialised data - the key to AI success.pdf
Schema management with Scalameta
How to not kill people - Berlin Buzzwords 2023.pdf
The 7 habits of data effective companies.pdf
Secure software supply chain on a shoestring budget
Ai legal and ethics
Eventually, time will kill your data pipeline
Big data == lean data
Privacy by design
Test strategies for data processing pipelines, v2.0

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Getting Started with Data Integration: FME Form 101
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Tartificialntelligence_presentation.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Electronic commerce courselecture one. Pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PPT
Teaching material agriculture food technology
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
A comparative analysis of optical character recognition models for extracting...
Empathic Computing: Creating Shared Understanding
MYSQL Presentation for SQL database connectivity
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Building Integrated photovoltaic BIPV_UPV.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Assigned Numbers - 2025 - Bluetooth® Document
Getting Started with Data Integration: FME Form 101
Reach Out and Touch Someone: Haptics and Empathic Computing
SOPHOS-XG Firewall Administrator PPT.pptx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Encapsulation_ Review paper, used for researhc scholars
Tartificialntelligence_presentation.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Electronic commerce courselecture one. Pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Group 1 Presentation -Planning and Decision Making .pptx
Teaching material agriculture food technology
Digital-Transformation-Roadmap-for-Companies.pptx
Machine learning based COVID-19 study performance prediction
A comparative analysis of optical character recognition models for extracting...

Data ops in practice - Swedish style

  • 1. www.scling.com DataOps in practice - Swedish style Lars Albertsson (@lalleal) Scling 1
  • 2. www.scling.com Who’s talking? ... Google - video conference, engineering productivity ... Spotify - data engineering ... Independent data engineering consultant Banks, media, startups, heavy industry, telco Founder @ Scling - data-value-as-a-service 2
  • 3. www.scling.com Contents Journey to DataOps Experiences that shaped my data engineering IMHO principles of successful DataOps Toolbox 3 ● Spotify information is old history ● Previously published ● Today is very different
  • 4. www.scling.com Spotify data 2007-2013 ● Hadoop installed 2007 ● Use cases: reporting, insights, recommendations ● Cultural aspects: ○ Autonomous teams ○ Eliminate waste ○ Learn and adapt 4
  • 5. www.scling.com Traditional systems 5 Mutation Early Hadoop: ● Weak indexing ● No transactions ● Weak security ● Batch transformations
  • 6. www.scling.com Data lake Transformation Cold store 6 Mutation Immutable, shareable Early Hadoop: ● Weak indexing ● No transactions ● Weak security ● Batch transformations DataOps workflows: ● Immutable, shared data ● Resilient to failure ● Quick error recovery ● Low-risk experiments Data factories
  • 7. www.scling.com What conclusion from this graph? COVID-19 fatalities / day in Sweden 7
  • 8. www.scling.com Wrong conclusion, every day ● Downward trend every day! 8
  • 9. www.scling.com Normalise data collection to compare 9Graph by Adam Altmejd, @adamaltmejd
  • 10. www.scling.com Normalise data collection to compare 10Graph by Adam Altmejd, @adamaltmejd
  • 11. www.scling.com Forecast for analytics with fresh data 11Graph by Adam Altmejd, @adamaltmejd
  • 13. www.scling.com From craft to process 13 Multiple time windows
  • 14. www.scling.com From craft to process 14 Multiple time windows Assess ingress data quality
  • 15. www.scling.com From craft to process 15 Multiple time windows Assess ingress data quality Assess outcome data quality
  • 16. www.scling.com From craft to process 16 Multiple time windows Assess ingress data quality Assess outcome data quality Repair broken data Intermediate datasets, reusable between pipelines
  • 17. www.scling.com From craft to process 17 Multiple time windows Assess ingress data quality Repair broken data from complementary source Assess outcome data quality
  • 18. www.scling.com From craft to process 18 Multiple time windows Assess ingress data quality Repair broken data from complementary source Forecast based on history Assess outcome data quality
  • 19. www.scling.com From craft to process 19 Multiple time windows Assess ingress data quality Repair broken data from complementary source Forecast based on history, multiple parameter settings Assess outcome data quality
  • 20. www.scling.com From craft to process 20 Multiple time windows Assess ingress data quality Repair broken data from complementary source Forecast based on history, multiple parameter settings Assess outcome data quality Assess forecast success, adapt parameters
  • 21. www.scling.com Towards sustainable production ML 21 Multiple models, parameters, features Assess ingress data quality Repair broken data from complementary source Choose model and parameters based on performance and input data Benchmark models Try multiple models, measure, A/B test
  • 22. www.scling.com Risky operations 22 How to I test the pipeline? You temporarily change the output path and run manually. Don’t do that. What if I forget to change path?
  • 23. www.scling.com 2013 23 ● Teams: Analytics computation (AC), data collection (DC), recommend, reporting (1) ● Folklore development cycle & operations ● Unsatisfied needs in other teams
  • 24. www.scling.com luigid Redundant "edge nodes" with Luigi workers, scheduled with cron. Compute + data in Hadoop. On-prem Hadoop production Worker 10 * * * * luigi --module mymodule MyDaily 23 * * * * luigi --module other OtherDaily Master Executor Worker HDFS metadata Data Control (+data) Submit job 10 * ... 23 * ...
  • 25. www.scling.com Ghost in the cluster ● Jobs were deployed with Debian packages + Puppet on pet machines. ○ Multiple pets for redundancy. Race to run job. ● "This monitor daemon is at 100%. Since 6 months. I'll kill it." ● "Data is wrong. But we fixed this bug 6 months ago?!?" 25
  • 26. www.scling.com Start of a DataOps journey 26 Stateful Stateless Pets Cattle Folklore Golden pathTest in prod Local test CI/CD Weeks to learn New pipeline < 1 day Days to mend Bug fix < 1 hour
  • 27. www.scling.com On-prem pipeline deployment pipeline 27 source repo Luigi DSL, jars, config my-pipe-7.tar.gz Luigi daemon > pip install my-pipe-7.tar.gz Worker Worker Worker Worker Worker Worker Worker Worker Redundant cron schedule, higher frequency All that a pipeline needs, installed atomically 10 * * * * luigi --module mymodule MyDaily Standard deployment artifact Standard artifact store
  • 28. www.scling.com Principle: Functional pipelines 28 ● Raw source of truth + data refinement factory ● Immutable datasets & artifacts ● Deterministic, idempotent, reproducible deployment & processing ● Key success factor: workflow orchestration ○ Oozie, Rambo, Builder, Builder2, Luigi ○ Key properties: 1. Pure Python 2. Simplicity 3. All the features it lacks
  • 29. www.scling.com Big data - a collaboration paradigm 29 Stream storage Data lake Data democratised
  • 30. www.scling.com ● Technically ○ Data available ○ Reusable QA ● Operationally ○ Continuous deployment ○ Hands off operations ○ Monitoring, debugging ● Bottom-up innovation Enabling teams 30 "The actual work that went into Discover Weekly was very little, because we're reusing things we already had." https://youtu.be/A259Yo8hBRs https://youtu.be/ZcmJxli8WS8
  • 31. www.scling.com Principle: Small scope components 31 ● Do one thing well. Less is more. ● Complex systems from replaceable bricks ○ Cloud/OSS over enterprise vendors ○ Simplicity over features Solvable challenge ~2000 lines of code Perpetual complexity
  • 32. www.scling.com Cloud native deployment 32 source repo Luigi DSL, jars, config my-pipe:7 Luigi daemon Worker Worker Worker Worker Worker Worker Worker Worker Redundant cron schedule, higher frequency kind: CronJob spec: schedule: "10 * * * *" command: "luigi --module mymodule MyDaily" Docker image Docker registry S3 / GCS Dataproc / EMR
  • 33. www.scling.com Data platform gravitation ● Hadoop all the things. ● Data is there. Simple test, simple deploy, simple ops. ● Autonomous teams - no mandate. Natural gravity. 33
  • 34. www.scling.com 3434 Nearline ● Stream storage ● Asynchronous event processing ● 10 ms - 1 hour Data integration timescales 34 Job Stream Offline ● File storage ● Asynchronous batch processing ● 1 minute - Online ● SOA / microservices ● Synchronous RPC ● 1-100 ms Stream Job Stream
  • 35. www.scling.com 3535 Upgrade ● Careful rollout ● Risk of user impact ● Proactive QA Operational manoeuvres - online 35
  • 36. www.scling.com 3636 Upgrade ● Careful rollout ● Risk of user impact ● Proactive QA Operational manoeuvres - online 36 Service failure ● User impact ● Data loss ● Cascading outage
  • 37. www.scling.com 3737 Upgrade ● Careful rollout ● Risk of user impact ● Proactive QA Operational manoeuvres - online 37 Service failure ● User impact ● Data loss ● Cascading outage Bug ● User impact ● Data corruption ● Cascading corruption
  • 38. www.scling.com 38 Operational manoeuvres - offline 38 Upgrade ● Instant rollout ● No user impact ● Reactive QA Service failure ● Pipeline delay ● No data loss ● No downstream impact Bug ● Temporary data corruption ● Downstream impact
  • 39. www.scling.com Life of an error, batch pipelines 39 ● Faulty job, emits bad data 1. Revert serving datasets to old 2. Fix bug 3. Remove faulty datasets 4. Backfill is automatic (Luigi) Done! ● Low cost of error ○ Reactive QA ○ Production environment sufficient
  • 40. www.scling.com 40 Production critical upgrade ● Dual datasets during transition ● Run downstream parallel pipelines ○ Cheap ○ Low risk ○ Easy rollback ● Testable end-to-end No dev & staging environment needed! ∆?
  • 41. www.scling.com 41 Operational manoeuvres - nearline 41 Upgrade ● Swift rollout ● Parallel pipelines ● User impact, QA? Service failure ● Pipeline delay ● No data loss ● Downstream impact? Bug ● Data corruption ● Downstream impact Job Stream Stream Job Stream Job Stream Stream Job Stream Job Stream Stream Job Stream
  • 42. www.scling.com 42 Life of an error, streaming 42 ● Works for a single job, not pipeline. :-( Job StreamStream Stream Stream Stream Stream Job Job Stream Stream Stream Job Job Job Reprocessing in Kafka Streams
  • 43. www.scling.com Data speed Innovation speed 43 Nearline Data processing tradeoff 43 Job Stream OfflineOnline Stream Job Stream
  • 44. www.scling.com 44 Separating online & offline ● Daily user DB dump. Cassandra can handle the load. ○ Load spike became 25 h long… ● New recommendation model! Cassandra can replicate to all regions. ○ Who saturated the Atlantic link? ● Batch jobs saturate one resource. ○ Bad neighbours.
  • 45. www.scling.com Batch offline vs online 45 Raw Fraud serviceFraud model Orders Orders Replication / Backup Standard procedures Standard proceduresLightweight procedures ● QA driven by internal efficiency ● Continuous deployment ● New pipeline < 1 day ● Upgrade < 1 hour ● Bug recovery < 1 hour Careful handover Careful handover
  • 46. www.scling.com Data quality dimensions ● Timeliness ○ E.g. the customer engagement report was produced at the expected time ● Correctness ○ The numbers in the reports were calculated correctly ● Completeness ○ The report includes information on all customers, using all information from the whole time period ● Consistency ○ The customer summaries are all based on the same time period 46
  • 47. www.scling.com Testing single batch job 47 Job Standard Scalatest harness file://test_input/ file://test_output/ 1. Generate input 2. Run in local mode 3. Verify output f() p() Runs well in CI / from IDE
  • 48. www.scling.com Testing batch pipelines - two options 48 Standard Scalatest harness file://test_input/ file://test_output/ 1. Generate input 2. Run custom multi-job Test job with sequence of jobs 3. Verify output f() p() A: Customised workflow manager setup p()f() B:
  • 49. www.scling.com Monitoring timeliness, examples ● Datamon - Spotify internal ● Twitter Ambrose (dead?) ● Airflow 49
  • 50. www.scling.com 50 Measuring correctness: counters ● Processing tool (Spark/Hadoop) counters ○ Odd code path => bump counter ○ System metrics Hadoop / Spark counters DB Standard graphing tools Standard alerting service
  • 51. www.scling.com Measuring correctness: counters ● User-defined ● Technical from framework ○ Execution time ○ Memory consumption ○ Data volumes ○ ... 51 case class Order(item: ItemId, userId: UserId) case class User(id: UserId, country: String) val orders = read(orderPath) val users = read(userPath) val orderNoUserCounter = longAccumulator("order-no-user") val joined: C[(Order, Option[User])] = orders .groupBy(_.userId) .leftJoin(users.groupBy(_.id)) .values val orderWithUser: C[(Order, User)] = joined .flatMap( orderUser match case (order, Some(user)) => Some((order, user)) case (order, None) => { orderNoUserCounter.add(1) None }) SQL: Nope
  • 52. www.scling.com Data quality - high code vs low code ● 2013: Python MapReduce outdated ● Hive/SQL? ○ Not expressive enough ○ Data quality challenging ● Technical platform + multi-skilled teams! ○ Strong development processes 52 Low code / no code platform? Technical platform?
  • 53. www.scling.com 53 Measuring consistency: pipelines ● Processing tool (Spark/Hadoop) counters ○ Odd code path => bump counter ○ System metrics ● Dedicated quality assessment pipelines DB Quality assessment job Quality metadataset (tiny) Standard graphing tools Standard alerting service
  • 54. www.scling.com 54 Machine learning operations, simplified ● Multiple trained models ○ Select at run time ● Measure user behaviour ○ E.g. session length, engagement, funnel ● Ready to revert to ○ old models ○ simpler models Measure interactionsRendez- vous DB Standard alerting service Stream Job "The required surrounding infrastructure is vast and complex." - Google
  • 55. www.scling.com 55 Not all things went well ● Autonomy → excessive heterogeneity ○ 25 ways to store a timestamp? ● Pipeline end-to-end tests ○ Culturally challenging ○ → difficult to change & retire pipelines ● Trial and error to learn
  • 56. www.scling.com Data engineering in Scandinavia ● Stockholm region ranks 2nd in unicorns / capita ○ Media, games, fintech ● Critical mass of world class data engineering ○ Limited to a few companies 56
  • 57. www.scling.com Mission: Spread data & AI superpowers ● There are companies to help ● Data & AI capabilities require culture & process change ○ Slow, very slow 57
  • 58. www.scling.com Scandinavian minimalist design ● Lean, simple technology - focus on flow and business value ● Bonnier News data platform, 4-5 persons: ○ Zero to happy customer in 3 weeks. ○ Dozens of ROI pipelines in 8 months. ● Scling retail client, 1-3 persons, after 1 year: ○ 40 sources, 70 pipelines, 200 egress points ○ 3,400 datasets / day ● Typical enterprise numbers ○ Big data project: 6-24 months ○ Analytics department: 100-1000 datasets / day ○ Spotify: 100,000+ datasets / day ○ Google: 1.6B datasets / day (2016) 58
  • 59. www.scling.com Scling - data-value-as-a-service 59 Data value through collaboration Customer Data factory Data platform & lake data domain expertise Value from data! www.scling.com/reading-list www.scling.com/presentations www.scling.com/courses