Data ops in practice - Swedish style

www.scling.com
DataOps in practice -
Swedish style
Lars Albertsson (@lalleal)
Scling
1

www.scling.com
Who’s talking?
...
Google - video conference, engineering productivity
...
Spotify - data engineering
...
Independent data engineering consultant
Banks, media, startups, heavy industry, telco
Founder @ Scling - data-value-as-a-service
2

www.scling.com
Contents
Journey to DataOps
Experiences that shaped my data engineering
IMHO principles of successful DataOps
Toolbox
3
● Spotify information is old history
● Previously published
● Today is very different

www.scling.com
Spotify data 2007-2013
● Hadoop installed 2007
● Use cases: reporting, insights, recommendations
● Cultural aspects:
○ Autonomous teams
○ Eliminate waste
○ Learn and adapt
4

www.scling.com
Traditional systems
5
Mutation
Early Hadoop:
● Weak indexing
● No transactions
● Weak security
● Batch transformations

www.scling.com
Data lake
Transformation
Cold
store
6
Mutation
Immutable,
shareable
Early Hadoop:
● Weak indexing
● No transactions
● Weak security
● Batch transformations
DataOps workflows:
● Immutable, shared data
● Resilient to failure
● Quick error recovery
● Low-risk experiments
Data factories

www.scling.com
What conclusion from this graph?
COVID-19 fatalities / day in Sweden
7

www.scling.com
Wrong conclusion, every day
● Downward trend every day!
8

www.scling.com
Normalise data collection to compare
9Graph by Adam Altmejd, @adamaltmejd

www.scling.com
Normalise data collection to compare

www.scling.com
Forecast for analytics with fresh data

www.scling.com
From craft to process
12

www.scling.com
13
Multiple time windows

www.scling.com
14
Assess ingress data quality

www.scling.com
15
Assess outcome data quality

www.scling.com
16
Repair broken data
Intermediate datasets, reusable between pipelines

www.scling.com
17
Repair broken data from
complementary source

www.scling.com
18
Forecast based on history

www.scling.com
19
Forecast based on history,
multiple parameter settings

www.scling.com
20
Forecast based on history,
multiple parameter settings
Assess forecast success,
adapt parameters

www.scling.com
Towards sustainable production ML
21
Multiple models,
parameters, features
Choose model and parameters based
on performance and input data
Benchmark models
Try multiple models,
measure, A/B test

www.scling.com
Risky operations
22
How to I test the pipeline?
You temporarily change the
output path and run manually.
Don’t do that.
What if I forget to change path?

www.scling.com
2013
23
● Teams: Analytics computation (AC), data collection (DC), recommend, reporting (1)
● Folklore development cycle & operations
● Unsatisfied needs in other teams

www.scling.com
luigid
Redundant "edge nodes" with Luigi workers, scheduled with cron. Compute + data in Hadoop.
On-prem Hadoop production
Worker
10 * * * * luigi --module mymodule MyDaily
23 * * * * luigi --module other OtherDaily
Master
Executor
Worker
HDFS metadata
Data
Control
(+data)
Submit job
10 * ...
23 * ...

www.scling.com
Ghost in the cluster
● Jobs were deployed with Debian packages + Puppet on pet machines.
○ Multiple pets for redundancy. Race to run job.
● "This monitor daemon is at 100%. Since 6 months. I'll kill it."
● "Data is wrong. But we fixed this bug 6 months ago?!?"
25

www.scling.com
Start of a DataOps journey
26
Stateful Stateless
Pets Cattle
Folklore
Golden pathTest in prod
Local test
CI/CD
Weeks to learn
New pipeline
< 1 day
Days to mend
Bug fix
< 1 hour

www.scling.com
On-prem pipeline deployment pipeline
27
source
repo Luigi DSL, jars, config
my-pipe-7.tar.gz
Luigi
daemon
> pip install my-pipe-7.tar.gz
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Redundant cron schedule,
higher frequency
All that a pipeline needs, installed atomically
10 * * * * luigi --module mymodule MyDaily
Standard deployment artifact Standard artifact store

www.scling.com
Principle: Functional pipelines
28
● Raw source of truth + data refinement factory
● Immutable datasets & artifacts
● Deterministic, idempotent, reproducible deployment & processing
● Key success factor: workflow orchestration
○ Oozie, Rambo, Builder, Builder2, Luigi
○ Key properties:
1. Pure Python
2. Simplicity
3. All the features it lacks

www.scling.com
Big data - a collaboration paradigm
29
Stream storage
Data lake
Data
democratised

www.scling.com
● Technically
○ Data available
○ Reusable QA
● Operationally
○ Continuous deployment
○ Hands off operations
○ Monitoring, debugging
● Bottom-up innovation
Enabling teams
30
"The actual work that went into
Discover Weekly was very little,
because we're reusing things we
already had."
https://youtu.be/A259Yo8hBRs
https://youtu.be/ZcmJxli8WS8

www.scling.com
Principle: Small scope components
31
● Do one thing well. Less is more.
● Complex systems from replaceable bricks
○ Cloud/OSS over enterprise vendors
○ Simplicity over features
Solvable
challenge
~2000 lines of code
Perpetual
complexity

www.scling.com
Cloud native deployment
32
source
repo Luigi DSL, jars, config
my-pipe:7
Luigi
daemon
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Redundant cron schedule,
higher frequency
kind: CronJob
spec:
schedule: "10 * * * *"
command: "luigi --module mymodule MyDaily"
Docker image Docker registry
S3 / GCS
Dataproc /
EMR

www.scling.com
Data platform gravitation
● Hadoop all the things.
● Data is there. Simple test, simple deploy, simple ops.
● Autonomous teams - no mandate. Natural gravity.
33

www.scling.com
3434
Nearline
● Stream storage
● Asynchronous event
processing
● 10 ms - 1 hour
Data integration timescales
34
Job
Stream
Offline
● File storage
● Asynchronous batch
processing
● 1 minute -
Online
● SOA / microservices
● Synchronous RPC
● 1-100 ms
Stream
Job
Stream

www.scling.com
3535
Upgrade
● Careful rollout
● Risk of user impact
● Proactive QA
Operational manoeuvres - online
35

www.scling.com
3636
Upgrade
● Careful rollout
● Proactive QA
36
Service failure
● User impact
● Data loss
● Cascading outage

www.scling.com
3737
Upgrade
● Careful rollout
● Proactive QA
37
Service failure
● User impact
● Data loss
● Cascading outage
Bug
● User impact
● Data corruption
● Cascading corruption

www.scling.com
38
Operational manoeuvres - offline
38
Upgrade
● Instant rollout
● No user impact
● Reactive QA
Service failure
● Pipeline delay
● No data loss
● No downstream impact
Bug
● Temporary data
corruption
● Downstream impact

www.scling.com
Life of an error, batch pipelines
39
● Faulty job, emits bad data
1. Revert serving datasets to old
2. Fix bug
3. Remove faulty datasets
4. Backfill is automatic (Luigi)
Done!
● Low cost of error
○ Reactive QA
○ Production environment sufficient

www.scling.com
40
Production critical upgrade
● Dual datasets during transition
● Run downstream parallel pipelines
○ Cheap
○ Low risk
○ Easy rollback
● Testable end-to-end
No dev & staging environment needed!
∆?

www.scling.com
41
Operational manoeuvres - nearline
41
Upgrade
● Swift rollout
● Parallel pipelines
● User impact, QA?
Service failure
● Pipeline delay
● No data loss
● Downstream impact?
Bug
● Data corruption
● Downstream impact
Job
Stream
Stream
Job
Stream
Job
Stream
Stream
Job
Stream
Job
Stream
Stream
Job
Stream

www.scling.com
42
Life of an error, streaming
42
● Works for a single job, not pipeline. :-(
Job
StreamStream Stream
Stream Stream Stream
Job
Job
Stream Stream Stream
Job
Job Job
Reprocessing in Kafka Streams

www.scling.com
Data speed Innovation speed
43
Nearline
Data processing tradeoff
43
Job
Stream
OfflineOnline
Stream
Job
Stream

www.scling.com
44
Separating online & offline
● Daily user DB dump. Cassandra can handle the load.
○ Load spike became 25 h long…
● New recommendation model! Cassandra can replicate to all regions.
○ Who saturated the Atlantic link?
● Batch jobs saturate one resource.
○ Bad neighbours.

www.scling.com
Batch offline vs online
45
Raw
Fraud
serviceFraud
model
Orders Orders
Replication /
Backup
Standard procedures Standard proceduresLightweight procedures
● QA driven by internal efficiency
● Continuous deployment
● New pipeline < 1 day
● Upgrade < 1 hour
● Bug recovery < 1 hour
Careful handover Careful handover

www.scling.com
Data quality dimensions
● Timeliness
○ E.g. the customer engagement report was produced at the expected time
● Correctness
○ The numbers in the reports were calculated correctly
● Completeness
○ The report includes information on all customers, using all information from the whole time period
● Consistency
○ The customer summaries are all based on the same time period
46

www.scling.com
Testing single batch job
47
Job
Standard Scalatest harness
file://test_input/ file://test_output/
1. Generate input 2. Run in local mode 3. Verify output
f() p()
Runs well in
CI / from IDE

www.scling.com
Testing batch pipelines - two options
48
Standard Scalatest harness
file://test_input/ file://test_output/
1. Generate input 2. Run custom multi-job
Test job with sequence of jobs
3. Verify output
f() p()
A:
Customised workflow manager setup
p()f()
B:

www.scling.com
Monitoring timeliness, examples
● Datamon - Spotify internal
● Twitter Ambrose (dead?)
● Airflow
49

www.scling.com
50
Measuring correctness: counters
● Processing tool (Spark/Hadoop) counters
○ Odd code path => bump counter
○ System metrics
Hadoop / Spark counters DB
Standard graphing tools
Standard
alerting
service

www.scling.com
Measuring correctness: counters
● User-defined
● Technical from framework
○ Execution time
○ Memory consumption
○ Data volumes
○ ...
51
case class Order(item: ItemId, userId: UserId)
case class User(id: UserId, country: String)
val orders = read(orderPath)
val users = read(userPath)
val orderNoUserCounter = longAccumulator("order-no-user")
val joined: C[(Order, Option[User])] = orders
.groupBy(_.userId)
.leftJoin(users.groupBy(_.id))
.values
val orderWithUser: C[(Order, User)] = joined
.flatMap( orderUser match
case (order, Some(user)) => Some((order, user))
case (order, None) => {
orderNoUserCounter.add(1)
None
})
SQL: Nope

www.scling.com
Data quality - high code vs low code
● 2013: Python MapReduce outdated
● Hive/SQL?
○ Not expressive enough
○ Data quality challenging
● Technical platform + multi-skilled teams!
○ Strong development processes
52
Low code / no code platform? Technical platform?

www.scling.com
53
Measuring consistency: pipelines
● Processing tool (Spark/Hadoop) counters
○ Odd code path => bump counter
○ System metrics
● Dedicated quality assessment pipelines
DB
Quality assessment job
Quality metadataset (tiny)
Standard graphing tools
Standard
alerting
service

www.scling.com
54
Machine learning operations, simplified
● Multiple trained models
○ Select at run time
● Measure user behaviour
○ E.g. session length, engagement, funnel
● Ready to revert to
○ old models
○ simpler models
Measure interactionsRendez-
vous
DB
Standard
alerting
service
Stream Job
"The required surrounding
infrastructure is vast and
complex."
- Google

www.scling.com
55
Not all things went well
● Autonomy → excessive heterogeneity
○ 25 ways to store a timestamp?
● Pipeline end-to-end tests
○ Culturally challenging
○ → difficult to change & retire pipelines
● Trial and error to learn

www.scling.com
Data engineering in Scandinavia
● Stockholm region ranks 2nd in unicorns / capita
○ Media, games, fintech
● Critical mass of world class data engineering
○ Limited to a few companies
56

www.scling.com
Mission: Spread data & AI superpowers
● There are companies to help
● Data & AI capabilities require culture & process change
○ Slow, very slow
57

www.scling.com
Scandinavian minimalist design
● Lean, simple technology - focus on flow and business value
● Bonnier News data platform, 4-5 persons:
○ Zero to happy customer in 3 weeks.
○ Dozens of ROI pipelines in 8 months.
● Scling retail client, 1-3 persons, after 1 year:
○ 40 sources, 70 pipelines, 200 egress points
○ 3,400 datasets / day
● Typical enterprise numbers
○ Big data project: 6-24 months
○ Analytics department: 100-1000 datasets / day
○ Spotify: 100,000+ datasets / day
○ Google: 1.6B datasets / day (2016)
58

www.scling.com
Scling - data-value-as-a-service
59
Data value through collaboration
Customer
Data factory
Data platform & lake
data
domain
expertise
Value from data!
www.scling.com/reading-list
www.scling.com/presentations
www.scling.com/courses

Data ops in practice - Swedish style

More Related Content

What's hot (20)

Similar to Data ops in practice - Swedish style (20)

More from Lars Albertsson (13)

Recently uploaded (20)

Data ops in practice - Swedish style