SlideShare a Scribd company logo
Data collection in AWS
Lars Marius Garshol, lars.marius.garshol@schibsted.com
http://twitter.com/larsga
2018–09–04, AWS Meetup
Schibsted?
Collecting data?
What?
Schibsted
3
30 countries
200 million users/month
20 billion pageviews/month
Three parts
4
Data Platform team
5
Jordi Roura
Ole-Magnus Røysted Aker
Sangram Bal
Oleksandr Ivanov
Mårten Rånge
Håvard Wall
Bjørn Rustad
Rolv Seehus
Håkon Åmdal
Fredrik Vraalsen
Øyvind Løkling
Per Wessel Nore
Lars Marius Garshol
Ning Zhou
Data Platform
6
Data Platform
Batch
Streaming
Pulse
Data volume
7
Original architecture
8
Collector
Kinesis
Kinesis
Storage S3
The
Batch
Job
Piper
S3
The Batch Job
• Implemented in Apache Spark
• Luigi for scheduling/orchestration
• Runs in a shared Mesos cluster
• this was set up because letting users create individual
clusters became far too expensive
• this cluster is the main cost driver for Schibsted’s AWS bills
• Difficult environment to work with
• hard to debug and develop in
9
Problems with batch
• Configuration file mapped to Spark tasks
• very complex set of Spark tasks
• requires lots of communication between Spark nodes
• runs slowly
• Very resource-intensive
• had difficulty keeping up with incoming traffic
• very sensitive to “cluster weather”
• brittle
10
Piper & Storage
• Ordinary Java and Scala applications
• read from the data source, perform all processing on one
node, then write to the destination
• no communication necessary between nodes
• normal EC2 nodes with the application baked into the AMI
• therefore scales trivially with Auto Scaling Groups
• Instrumented with Datadog, logs loaded into
SumoLogic
11
OK
Piper’s problem
12
Storage S3
Piper
SQS
OK
OK
OK
OK
Slow
Kafka vs Kinesis
• Kinesis has very strict API limits
• total read limit = 2x write limit
• effectively limited to 2 readers
• Kinesis API is very limited
• basically only supports reading records in order
• Kafka improves on both
• can support many readers simultaneously
• advanced Scala DSL with window functions etc etc
13
New architecture
14
Collector
Kinesis
Kinesis
S3
Storage
S3
The
Batch
Job
S3
Kafka
Storage Kafka ?
Handling slow consumers
15
Kafka Yggdrasil
Firehose
One topic per consumer
Duratro
One topic per consumer
OK
OK
OK
OK
Slow
Transforms
Filtering
Pulse challenges
• Pulse is a tracking solution with no user interface
• you want dashboards to analyze user traffic? sorry
• problem is: not enough resources to develop that
• Using Amplitude to solve that
• created a Duratro sink for Amplitude
• simple HTTP POST of JSON events to Amplitude API
• users can now create Amplitude projects, feed their Pulse
events there, and finally have dashboards
16
Transforms
• Because GDPR we need to anonymize most incoming
data formats
• Some data has data quality issues that cannot be fixed at
source, requires transforms to solve
• In many cases data needs to be transformed from one
format to another
• Pulse to Amplitude
• ClickMeter to Pulse
• Convert data to match database structures
• …
17
Who configures?
• Schibsted has >100 business units
• for Data Platform to do detailed configuration for all of
these isn’t going to scale
• for sites to do it themselves saves lots of time
• Transforms require domain knowledge
• each site has its own specialities in Pulse tracking
• to transform these correctly requires knowing all this
18
Batch config: 1 sink
{
"driver": "anyoffilter",
"name": "image-classification",
"rules": [
{ "name": "ImageClassification", "key": "provider.component", "value": "ImageClassification" },
{ "name": "ImageSimilarity", "key": "provider.component", "value": "ImageSimilarity" }
],
"onmatch": [
{
"driver": "cache",
"name": "image-classification",
"level": "memory+disk"
},
{
"driver": "demuxer",
"name": "image-classification",
"rules": "${pulseSdrnFilterUri}",
"parallel": true,
"onmatch": {
"driver": "textfilewriter",
"uri": "${imageSiteUri}",
"numFiles": {
"eventsPerFile": 500000,
"max": ${numExecutors}
}
}
}
],
19
Early config was 1838 lines
Yggdrasil: Scala DSL
override def buildTopology(builder: StreamsBuilder): Unit = {
import com.schibsted.spt.data.yggdrasil.serde.YggdrasilImplicitSerdes.{json, strings}
// mads events routing
val madsProEvents = madsEvents.tryFilter(MadsProEventsPredicate, deadLetterQueue("Default"))
val madsPreEvents = madsEvents.tryFilter(MadsPreEventsPredicate, deadLetterQueue("Default"))
val madsDevEvents = madsEvents.tryFilter(new EventSampler(0.01), deadLetterQueue("Default"))
madsProEvents ~> contentTopic("Personalisation-Rocket-Pro")
madsPreEvents ~> contentTopic("Personalisation-Rocket-Pre")
madsDevEvents ~> contentTopic("Personalisation-Rocket-Dev")
madsProEvents ~> providerIdDemuxer(
"^urn:schibsted:madstorage-(rkt|web)-tayara-prod:mp-ads-delivery".r -> contentTopic("Image-
"^urn:schibsted:madstorage-(rkt|web)-corotos-prod:mp-ads-delivery".r -> contentTopic(“Image-
)
20
Duratro: config
pipes {
ATEDev {
sourceTopic = "Public-DataPro-Yggdrasil-ATE-Dev-AteBehaviorEvent-1"
sink {
type = "kinesis"
stream = "AUTO-ate-online-events-loader-AteOnlineEventDataStream-3WEL7DDN2KQG"
region = "eu-west-1"
role = "arn:aws:iam::972724508451:role/AUTO-ate-online-events-lo-AteOnlineEventsDataWrite-GTMDBZSEZJF0"
session = "kinesis-ate-dev"
}
}
21
Not in great shape
• Transforms were written in Scala code
• not easy to read even for Scala developers
• most site devs are not Scala developers
• Config changes require deploys
• in streaming, matching changes must be made to both
Yggdrasil and Duratro
• Three different configuration syntaxes
• definition of same type of event different in batch &
streaming
22
What if?
• We had an expression language for JSON, kind of
like jq
• could write routing filters using that
• We had a tranformation language for JSON
• write as JSON template, using expression language to
compute values to insert
• A custom routing language for both batch and
streaming, based on this language
• designed for easy expressivity & deploy
23
JSLT
• Custom language for JSON transforms & queries
• First iteration
• JSON syntax with ${ … } wrappers for jq expressions
• very simple additions: let, for and if expressions
• tried out, worked well, but not ideal
• Second iteration
• own language from the ground up
• far better performance
• easier to write and use
24
JSLT expressions
25
.foo Get “foo” key from input object
.foo.bar As above + .bar inside that
.foo == 231 Comparison
.foo and .bar < 12 Boolean operator
test(.foo, “^[a-z0-9]+$”) Functions (& regexps)
JSLT transforms
{
“insert_id” : .”@id”,
“event_type” : ."@type" + " " + .object.”@type”,
"device_id" : .device.environmentId,
"time": amp:parse_timestamp(.published),
"device_manufacturer": .device.manufacturer,
"device_model": .device.model,
"language": .device.acceptLanguage,
"os_name": .device.osType,
"os_version": .device.osVersion,
…
}
26
More features
• [for (.array) number(.) * 1.1]
• convert each element in an array
• * : .
• object matcher, keeps rest of object unchanged
• {for (.object) “prefix” + .key : .value}
• dynamic rewrite of object
• def func(p1, p2)
• define custom functions
27
Benefits of JSLT
• Easier to read and write than code
• Doesn’t require user to know Scala
• Can be embedded in configuration
• Flexible enough to support 99-100% of filters/
transforms
• Performance quite good
• 5-10x original language based on jackson-jq
28
Data collection in AWS at Schibsted
Routing language
Firehose:
description: All incoming events.
transform: transforms/base-cleanup.jslt
PulseBase:
description: Cleaned-up Pulse events with all the information in them.
baseType: Firehose
filter: import "filters/pulse.jslt" as pulse pulse(.)
transform: transforms/pulse-cleanup.jslt
postFilter: import "filters/pulse-valid.jslt" as valid valid(.)
30
Pulse definitions
PulseIdentified:
description: Pulse events with personally identifying information included.
baseType: PulseBase
filter: .actor."spt:userId"
transform: transforms/pulse-identified.jslt
PulseAnonymized:
description: Pulse events with personally identifying information excluded.
baseType: PulseBase
transform: transforms/pulse-anonymized.jslt
31
pulse-identified.jslt
let isFiltered = (contains(get-client(.), $filteredProviders))
{
"@id" : if ( ."@id" ) sha256-hex($salt + ."@id"),
"actor" : {
// remove one user identifier, but spt:userId also contains user ID
"@id" : if ( .actor."@id" ) null,
"spt:remoteAddress" : if (not($isFiltered)) .actor."spt:remoteAddress",
"spt:remoteAddressV6" : if (not($isFiltered)) .actor."spt:remoteAddressV6",
* : .
},
"device" : {
"environmentId" : if ( .device.environmentId ) null,
* : .
},
"location" : if (not($isFiltered)) .location,
* : .
}
32
Sinks
VG-ArticleViews-1:
eventType: PulseLoginPreserved
filter: get-client(.) == "vg" and ."@type" == "View" and contains(.object."@type", ["Article", "SalesPoster"])
transform: transforms/vg-article-views.jslt
kinesis:
arn: arn:aws:kinesis:eu-west-1:070941167498:stream/vg_article_views
role: arn:aws:iam::070941167498:role/data-platform-kinesis-write
VG-FrontExperimentsEngagement-1:
eventType: PulseAnonymized
filter: get-client(.) == "vg" and ."@type" == "Engagement" and contains(.object."@type", ["Article", "SalesPoster"])
and (contains("df-86-", .origin.terms) or contains("df-86-", .object."spt:custom".terms))
transform: transforms/vg-article-views.jslt
kinesis:
arn: arn:aws:kinesis:eu-west-1:070941167498:stream/vg_front_experiments_engagement
role: arn:aws:iam::070941167498:role/data-platform-kinesis-write
33
routing-lib
• A Scala library that can load the YAML files
• main dependencies: Jackson and JSLT
• One main API method:
• RoutingConfig.route(JsonNode): Seq[(JsonNode, Sink)]
• Used by
• The Batch Job 2.0
• Yggdrasil
• Duratro
34
The Batch Job 2.0
• Three simple steps
• read JSON input from S3 (Spark solves this)
• push JSON data through routing-lib
• write JSON back to S3 (Spark solves this)
• There’s a little more to it than that, but that’s the heart of
it
• much better performance (much less data shuffling)
• better performance means it handles “cluster weather” more
robustly
• easier to catch up if we fall behind
35
Static routing
• Routing was configuration, packaged as a jar
• Every change required
• make routing PR, merge
• wait for Travis build to upload to Artifactory
• upgrade the batch job, deploy
• upgrade Yggdrasil, deploy
• upgrade Duratro, deploy
36
Hot deploy
37
routing
repo
Travis
build
SQS
queue
routing
config
publisher
S3
The Batch
Job
Yggdrasil Duratro
Data collection in AWS at Schibsted
Self-serve
• Finding the right repo and learning a YAML syntax
is non-trivial
• What if users could instead use a user interface?
• select an event type
• pick a transform
• add a filter, if necessary
• then configure a sink
• press the button, and wham!
39
YAML format
• Was designed for this right from the start
• Having event-types.yaml separate
• enables reuse across batch and streaming
• but also in selfserve
• Making a flat format based on references
• avoids deep, nested tree structures in syntax
• means config can be merged from many sources
40
Data collection in AWS at Schibsted
Hot deploy
42
routing
repo
Travis
build
SQS
queue
routing
config
publisher
S3
The Batch
Job
Yggdrasil Duratro
Pulse
Monitor
Dynamo-
DB
Lambda
Status
• Routing tree (207 sinks)
• streaming: 400 nodes (140 sinks)
• batch: 127 nodes (51 sinks)
• self-serve: ??? nodes (16 sinks)
• JSLT
• 51 transforms, 2366 lines
• runs ~10 billion transforms/day
• 28 contributors outside team
43
1 month of hot deploy
44
GDPR
Schibsted’s setup
• The individual sites are legally data controllers
• that means, they own the data and the responsibility
• Central Schibsted components are data processors
• that means, they do only what the controllers tell them to
• upside: responsibility rests with the controllers
• Has lots of consequences for how things work
46
Main issues
• Anonymization: handled with transforms
• Retention: handled with S3 lifecycle policies
• Takeout
• only necessary where we are the primary storage
• Deletion: a bit of a problem
• but we have 30 days to comply
47
The big picture
48
Privacy
Broker
Data
Platform
D-day
49
Next 3 weeks
50
Deletion
51
Impact
52
Data takeout
• Privacy broker posts message on SQS queue
• we take it down to S3, to get Luigi integration
• Luigi starts Spark job
• reads through stored data, looking for that user
• all events from that user are written to S3
• post SQS message back with reference to event and file
• Source data stored in Parquet
• use manually generated index files to avoid processing data
that has no events from this user
53
Data deletion
• Data is stored as Parquet files in S3
• but Parquet doesn’t have a “delete” function
• you have to rewrite the dataset
• This is slow and costly
• but can be batched: delete many users at once
• batching so many users that the index is useless
• What if someone is reading the dataset when you
are rewriting it?
54
Solution
• bucket/prefix/year=x/month=y/…/gen=0
• data stored under here initially
• bucket/prefix/year=x/month=y/…/gen=1
• data stored here after first rewrite
• once _SUCCESS flag is there consumers must switch
• after a day or so, gen=0 can be deleted
• Janitor Monkey deletes orphan generations & handles
retention
• because deletion rewrites data: messes up last modified
55
Data Access Layer
• Building logic to handle gen=x is a pain for users
• the Data Access Layer wraps Spark to do it for them
• Can in the future be expanded to
• filter out rows from users that opt out of certain processing
• do access control on a column level
• …
56
Access control
57
AWS Databox
account
Sites
Components
Mesos cluster
Analysts
AWS IAM policies
Jupyter-aaS
Spark
SQLaaS
Stricter access to data
• Because the sites are data controllers, they must
decide who should have access to what
• access is controlled by IAM policies
• but users can’t write those, and that’s not safe, anyway
• The system essentially requires communication
• data consumers must request data
• data owners (sites) but approve/reject requests
58
The granule of access
• We have many datasets
• Pulse (anonymized, identified, …)
• Payments (payment data)
• Content (ad and article content)
• …
• Access is per (dataset, site) combination
• you can have access to VG Pulse, but not Aftenposten
Pulse
59
Dataset registry
60
Email notification
61
Review screen
62
Challenges
• Users can potentially get access to many (dataset,
site) combinations
• each one needs to go into their IAM policies
• IAM has very strict API limits
• user inline policies: total max 2048 bytes
• managed policy size: max 6144
• max managed policies per account: 1500
• max attached managed policies: 10
• max group memberships: 10
63
Permission packing
• First pack as much as possible into an inline policy
• Then fill up personal managed policies & attach
• Then create more policies and attach to groups,
then attach those
• We believe we can attach 10,000 datasets this way
64
Sync
12:18:56 INFO c.s.s.d.s.model.IAMPolicyGenerator - User lars.marius.garshol@schibsted.com exists,
must be cleaned
12:18:57 INFO c.s.s.d.s.model.IAMPolicyGenerator - Deleting inline selfserve-policy from
lars.marius.garshol@schibsted.com
12:18:57 INFO c.s.s.d.s.model.IAMPolicyGenerator - Detaching arn:aws:iam::360928389411:policy/
selfserve-lars.marius.garshol@schibsted.com-2 from lars.marius.garshol@schibsted.com
12:18:58 INFO c.s.s.d.s.model.IAMPolicyGenerator - Detaching arn:aws:iam::360928389411:policy/
selfserve-lars.marius.garshol@schibsted.com-1 from lars.marius.garshol@schibsted.com
12:18:59 DEBUG c.s.s.d.s.model.IAMPolicyGenerator - Putting inline policy 'selfserve-policy' on
lars.marius.garshol@schibsted.com
12:19:00 INFO c.s.s.d.s.model.IAMPolicyGenerator - Policy selfserve-
lars.marius.garshol@schibsted.com-1 exists, deleting
12:19:00 DEBUG c.s.s.d.s.model.IAMPolicyGenerator - Creating and attaching selfserve-
lars.marius.garshol@schibsted.com-1 to lars.marius.garshol@schibsted.com, 13 statements left
12:19:01 INFO c.s.s.d.s.model.IAMPolicyGenerator - Policy selfserve-
lars.marius.garshol@schibsted.com-2 exists, deleting
12:19:01 DEBUG c.s.s.d.s.model.IAMPolicyGenerator - Creating and attaching selfserve-
lars.marius.garshol@schibsted.com-2 to lars.marius.garshol@schibsted.com, 0 statements left
12:19:02 DEBUG c.s.s.d.s.model.IAMPolicyGenerator - Putting inline policy 'selfserve-policy' on role
lars.marius.garshol@schibsted.com 65
Winding up
Slow maturation
• We started out with almost nothing in 2015
• now finally becoming something closer to what we need to be
• New challenges ahead
• management wants to scale up usage of common solutions
dramatically
• legal basis management is coming
• selfserve needs more functionality
• Data Quality Tooling needs an overhaul
• data discovery service likewise
• …
67
https://slideshare.net/larsga
Questions?

More Related Content

PDF
JSLT: JSON querying and transformation
PDF
How to Make Norikra Perfect
KEY
The Why and How of Scala at Twitter
PDF
Kotlin @ Coupang Backed - JetBrains Day seoul 2018
PDF
Data Analytics Service Company and Its Ruby Usage
PDF
Automatically generating-json-from-java-objects-java-objects268
PPTX
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
KEY
The Return of the Living Datalog
JSLT: JSON querying and transformation
How to Make Norikra Perfect
The Why and How of Scala at Twitter
Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Data Analytics Service Company and Its Ruby Usage
Automatically generating-json-from-java-objects-java-objects268
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
The Return of the Living Datalog

What's hot (20)

ODP
Query DSL In Elasticsearch
PPTX
ElasticSearch for .NET Developers
PDF
Logstash-Elasticsearch-Kibana
PDF
"ClojureScript journey: from little script, to CLI program, to AWS Lambda fun...
PDF
Requery overview
KEY
ClojureScript Anatomy
PPTX
ElasticSearch AJUG 2013
PPT
{{more}} Kibana4
PDF
SQL for Elasticsearch
PPTX
Dive into spark2
PDF
High-Performance Hibernate Devoxx France 2016
PDF
(Big) Data Serialization with Avro and Protobuf
PDF
Logging logs with Logstash - Devops MK 10-02-2016
KEY
MongoSF - mongodb @ foursquare
PDF
Elastic Search
ZIP
Above the clouds: introducing Akka
KEY
Building Scalable, Distributed Job Queues with Redis and Redis::Client
PDF
Mobile Analytics mit Elasticsearch und Kibana
PPTX
DZone Java 8 Block Buster: Query Databases Using Streams
PDF
Boosting Machine Learning with Redis Modules and Spark
Query DSL In Elasticsearch
ElasticSearch for .NET Developers
Logstash-Elasticsearch-Kibana
"ClojureScript journey: from little script, to CLI program, to AWS Lambda fun...
Requery overview
ClojureScript Anatomy
ElasticSearch AJUG 2013
{{more}} Kibana4
SQL for Elasticsearch
Dive into spark2
High-Performance Hibernate Devoxx France 2016
(Big) Data Serialization with Avro and Protobuf
Logging logs with Logstash - Devops MK 10-02-2016
MongoSF - mongodb @ foursquare
Elastic Search
Above the clouds: introducing Akka
Building Scalable, Distributed Job Queues with Redis and Redis::Client
Mobile Analytics mit Elasticsearch und Kibana
DZone Java 8 Block Buster: Query Databases Using Streams
Boosting Machine Learning with Redis Modules and Spark
Ad

Similar to Data collection in AWS at Schibsted (20)

PPTX
Elastic Stack Introduction
PDF
NetflixOSS Open House Lightning talks
PDF
Data pipelines from zero to solid
PDF
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
PDF
Data Infrastructure for a World of Music
PDF
Security Monitoring for big Infrastructures without a Million Dollar budget
PPTX
Architectures, Frameworks and Infrastructure
PDF
the tooling of a modern and agile oracle dba
PDF
Building a Sustainable Data Platform on AWS
PPTX
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
PDF
Stream-Native Processing with Pulsar Functions
PDF
Collecting 600M events/day
PPTX
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
PDF
Cloud Camp Chicago Dec 2012 - All presentations
PDF
Cloud Camp Chicago Dec 2012 Slides
PDF
Music city data Hail Hydrate! from stream to lake
PPTX
From java to scala at crowd mix
PDF
Flink Forward San Francisco 2019: Building Financial Identity Platform using ...
PPT
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
PDF
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
Elastic Stack Introduction
NetflixOSS Open House Lightning talks
Data pipelines from zero to solid
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Data Infrastructure for a World of Music
Security Monitoring for big Infrastructures without a Million Dollar budget
Architectures, Frameworks and Infrastructure
the tooling of a modern and agile oracle dba
Building a Sustainable Data Platform on AWS
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Stream-Native Processing with Pulsar Functions
Collecting 600M events/day
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
Cloud Camp Chicago Dec 2012 - All presentations
Cloud Camp Chicago Dec 2012 Slides
Music city data Hail Hydrate! from stream to lake
From java to scala at crowd mix
Flink Forward San Francisco 2019: Building Financial Identity Platform using ...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
Ad

More from Lars Marius Garshol (20)

PPTX
Kveik - what is it?
PDF
Nature-inspired algorithms
PDF
History of writing
PDF
NoSQL and Einstein's theory of relativity
PPTX
Norwegian farmhouse ale
PPTX
Archive integration with RDF
PPTX
The Euro crisis in 10 minutes
PPTX
Using the search engine as recommendation engine
PPTX
Linked Open Data for the Cultural Sector
PPTX
NoSQL databases, the CAP theorem, and the theory of relativity
PPTX
Bitcoin - digital gold
PPTX
Introduction to Big Data/Machine Learning
PPTX
Hops - the green gold
PPTX
Big data 101
PPTX
Linked Open Data
PPTX
Hafslund SESAM - Semantic integration in practice
PPTX
Approximate string comparators
PPTX
Experiments in genetic programming
PPTX
Semantisk integrasjon
PPTX
Linking data without common identifiers
Kveik - what is it?
Nature-inspired algorithms
History of writing
NoSQL and Einstein's theory of relativity
Norwegian farmhouse ale
Archive integration with RDF
The Euro crisis in 10 minutes
Using the search engine as recommendation engine
Linked Open Data for the Cultural Sector
NoSQL databases, the CAP theorem, and the theory of relativity
Bitcoin - digital gold
Introduction to Big Data/Machine Learning
Hops - the green gold
Big data 101
Linked Open Data
Hafslund SESAM - Semantic integration in practice
Approximate string comparators
Experiments in genetic programming
Semantisk integrasjon
Linking data without common identifiers

Recently uploaded (20)

PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Business Analytics and business intelligence.pdf
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPT
Predictive modeling basics in data cleaning process
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
Introduction to Knowledge Engineering Part 1
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Introduction to the R Programming Language
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
annual-report-2024-2025 original latest.
PPTX
modul_python (1).pptx for professional and student
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Business Analytics and business intelligence.pdf
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Predictive modeling basics in data cleaning process
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Mega Projects Data Mega Projects Data
Introduction to Knowledge Engineering Part 1
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Introduction to the R Programming Language
oil_refinery_comprehensive_20250804084928 (1).pptx
Data_Analytics_and_PowerBI_Presentation.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
IB Computer Science - Internal Assessment.pptx
annual-report-2024-2025 original latest.
modul_python (1).pptx for professional and student
Reliability_Chapter_ presentation 1221.5784
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Introduction-to-Cloud-ComputingFinal.pptx

Data collection in AWS at Schibsted

  • 1. Data collection in AWS Lars Marius Garshol, lars.marius.garshol@schibsted.com http://twitter.com/larsga 2018–09–04, AWS Meetup
  • 3. Schibsted 3 30 countries 200 million users/month 20 billion pageviews/month
  • 5. Data Platform team 5 Jordi Roura Ole-Magnus Røysted Aker Sangram Bal Oleksandr Ivanov Mårten Rånge Håvard Wall Bjørn Rustad Rolv Seehus Håkon Åmdal Fredrik Vraalsen Øyvind Løkling Per Wessel Nore Lars Marius Garshol Ning Zhou
  • 9. The Batch Job • Implemented in Apache Spark • Luigi for scheduling/orchestration • Runs in a shared Mesos cluster • this was set up because letting users create individual clusters became far too expensive • this cluster is the main cost driver for Schibsted’s AWS bills • Difficult environment to work with • hard to debug and develop in 9
  • 10. Problems with batch • Configuration file mapped to Spark tasks • very complex set of Spark tasks • requires lots of communication between Spark nodes • runs slowly • Very resource-intensive • had difficulty keeping up with incoming traffic • very sensitive to “cluster weather” • brittle 10
  • 11. Piper & Storage • Ordinary Java and Scala applications • read from the data source, perform all processing on one node, then write to the destination • no communication necessary between nodes • normal EC2 nodes with the application baked into the AMI • therefore scales trivially with Auto Scaling Groups • Instrumented with Datadog, logs loaded into SumoLogic 11
  • 13. Kafka vs Kinesis • Kinesis has very strict API limits • total read limit = 2x write limit • effectively limited to 2 readers • Kinesis API is very limited • basically only supports reading records in order • Kafka improves on both • can support many readers simultaneously • advanced Scala DSL with window functions etc etc 13
  • 15. Handling slow consumers 15 Kafka Yggdrasil Firehose One topic per consumer Duratro One topic per consumer OK OK OK OK Slow Transforms Filtering
  • 16. Pulse challenges • Pulse is a tracking solution with no user interface • you want dashboards to analyze user traffic? sorry • problem is: not enough resources to develop that • Using Amplitude to solve that • created a Duratro sink for Amplitude • simple HTTP POST of JSON events to Amplitude API • users can now create Amplitude projects, feed their Pulse events there, and finally have dashboards 16
  • 17. Transforms • Because GDPR we need to anonymize most incoming data formats • Some data has data quality issues that cannot be fixed at source, requires transforms to solve • In many cases data needs to be transformed from one format to another • Pulse to Amplitude • ClickMeter to Pulse • Convert data to match database structures • … 17
  • 18. Who configures? • Schibsted has >100 business units • for Data Platform to do detailed configuration for all of these isn’t going to scale • for sites to do it themselves saves lots of time • Transforms require domain knowledge • each site has its own specialities in Pulse tracking • to transform these correctly requires knowing all this 18
  • 19. Batch config: 1 sink { "driver": "anyoffilter", "name": "image-classification", "rules": [ { "name": "ImageClassification", "key": "provider.component", "value": "ImageClassification" }, { "name": "ImageSimilarity", "key": "provider.component", "value": "ImageSimilarity" } ], "onmatch": [ { "driver": "cache", "name": "image-classification", "level": "memory+disk" }, { "driver": "demuxer", "name": "image-classification", "rules": "${pulseSdrnFilterUri}", "parallel": true, "onmatch": { "driver": "textfilewriter", "uri": "${imageSiteUri}", "numFiles": { "eventsPerFile": 500000, "max": ${numExecutors} } } } ], 19 Early config was 1838 lines
  • 20. Yggdrasil: Scala DSL override def buildTopology(builder: StreamsBuilder): Unit = { import com.schibsted.spt.data.yggdrasil.serde.YggdrasilImplicitSerdes.{json, strings} // mads events routing val madsProEvents = madsEvents.tryFilter(MadsProEventsPredicate, deadLetterQueue("Default")) val madsPreEvents = madsEvents.tryFilter(MadsPreEventsPredicate, deadLetterQueue("Default")) val madsDevEvents = madsEvents.tryFilter(new EventSampler(0.01), deadLetterQueue("Default")) madsProEvents ~> contentTopic("Personalisation-Rocket-Pro") madsPreEvents ~> contentTopic("Personalisation-Rocket-Pre") madsDevEvents ~> contentTopic("Personalisation-Rocket-Dev") madsProEvents ~> providerIdDemuxer( "^urn:schibsted:madstorage-(rkt|web)-tayara-prod:mp-ads-delivery".r -> contentTopic("Image- "^urn:schibsted:madstorage-(rkt|web)-corotos-prod:mp-ads-delivery".r -> contentTopic(“Image- ) 20
  • 21. Duratro: config pipes { ATEDev { sourceTopic = "Public-DataPro-Yggdrasil-ATE-Dev-AteBehaviorEvent-1" sink { type = "kinesis" stream = "AUTO-ate-online-events-loader-AteOnlineEventDataStream-3WEL7DDN2KQG" region = "eu-west-1" role = "arn:aws:iam::972724508451:role/AUTO-ate-online-events-lo-AteOnlineEventsDataWrite-GTMDBZSEZJF0" session = "kinesis-ate-dev" } } 21
  • 22. Not in great shape • Transforms were written in Scala code • not easy to read even for Scala developers • most site devs are not Scala developers • Config changes require deploys • in streaming, matching changes must be made to both Yggdrasil and Duratro • Three different configuration syntaxes • definition of same type of event different in batch & streaming 22
  • 23. What if? • We had an expression language for JSON, kind of like jq • could write routing filters using that • We had a tranformation language for JSON • write as JSON template, using expression language to compute values to insert • A custom routing language for both batch and streaming, based on this language • designed for easy expressivity & deploy 23
  • 24. JSLT • Custom language for JSON transforms & queries • First iteration • JSON syntax with ${ … } wrappers for jq expressions • very simple additions: let, for and if expressions • tried out, worked well, but not ideal • Second iteration • own language from the ground up • far better performance • easier to write and use 24
  • 25. JSLT expressions 25 .foo Get “foo” key from input object .foo.bar As above + .bar inside that .foo == 231 Comparison .foo and .bar < 12 Boolean operator test(.foo, “^[a-z0-9]+$”) Functions (& regexps)
  • 26. JSLT transforms { “insert_id” : .”@id”, “event_type” : ."@type" + " " + .object.”@type”, "device_id" : .device.environmentId, "time": amp:parse_timestamp(.published), "device_manufacturer": .device.manufacturer, "device_model": .device.model, "language": .device.acceptLanguage, "os_name": .device.osType, "os_version": .device.osVersion, … } 26
  • 27. More features • [for (.array) number(.) * 1.1] • convert each element in an array • * : . • object matcher, keeps rest of object unchanged • {for (.object) “prefix” + .key : .value} • dynamic rewrite of object • def func(p1, p2) • define custom functions 27
  • 28. Benefits of JSLT • Easier to read and write than code • Doesn’t require user to know Scala • Can be embedded in configuration • Flexible enough to support 99-100% of filters/ transforms • Performance quite good • 5-10x original language based on jackson-jq 28
  • 30. Routing language Firehose: description: All incoming events. transform: transforms/base-cleanup.jslt PulseBase: description: Cleaned-up Pulse events with all the information in them. baseType: Firehose filter: import "filters/pulse.jslt" as pulse pulse(.) transform: transforms/pulse-cleanup.jslt postFilter: import "filters/pulse-valid.jslt" as valid valid(.) 30
  • 31. Pulse definitions PulseIdentified: description: Pulse events with personally identifying information included. baseType: PulseBase filter: .actor."spt:userId" transform: transforms/pulse-identified.jslt PulseAnonymized: description: Pulse events with personally identifying information excluded. baseType: PulseBase transform: transforms/pulse-anonymized.jslt 31
  • 32. pulse-identified.jslt let isFiltered = (contains(get-client(.), $filteredProviders)) { "@id" : if ( ."@id" ) sha256-hex($salt + ."@id"), "actor" : { // remove one user identifier, but spt:userId also contains user ID "@id" : if ( .actor."@id" ) null, "spt:remoteAddress" : if (not($isFiltered)) .actor."spt:remoteAddress", "spt:remoteAddressV6" : if (not($isFiltered)) .actor."spt:remoteAddressV6", * : . }, "device" : { "environmentId" : if ( .device.environmentId ) null, * : . }, "location" : if (not($isFiltered)) .location, * : . } 32
  • 33. Sinks VG-ArticleViews-1: eventType: PulseLoginPreserved filter: get-client(.) == "vg" and ."@type" == "View" and contains(.object."@type", ["Article", "SalesPoster"]) transform: transforms/vg-article-views.jslt kinesis: arn: arn:aws:kinesis:eu-west-1:070941167498:stream/vg_article_views role: arn:aws:iam::070941167498:role/data-platform-kinesis-write VG-FrontExperimentsEngagement-1: eventType: PulseAnonymized filter: get-client(.) == "vg" and ."@type" == "Engagement" and contains(.object."@type", ["Article", "SalesPoster"]) and (contains("df-86-", .origin.terms) or contains("df-86-", .object."spt:custom".terms)) transform: transforms/vg-article-views.jslt kinesis: arn: arn:aws:kinesis:eu-west-1:070941167498:stream/vg_front_experiments_engagement role: arn:aws:iam::070941167498:role/data-platform-kinesis-write 33
  • 34. routing-lib • A Scala library that can load the YAML files • main dependencies: Jackson and JSLT • One main API method: • RoutingConfig.route(JsonNode): Seq[(JsonNode, Sink)] • Used by • The Batch Job 2.0 • Yggdrasil • Duratro 34
  • 35. The Batch Job 2.0 • Three simple steps • read JSON input from S3 (Spark solves this) • push JSON data through routing-lib • write JSON back to S3 (Spark solves this) • There’s a little more to it than that, but that’s the heart of it • much better performance (much less data shuffling) • better performance means it handles “cluster weather” more robustly • easier to catch up if we fall behind 35
  • 36. Static routing • Routing was configuration, packaged as a jar • Every change required • make routing PR, merge • wait for Travis build to upload to Artifactory • upgrade the batch job, deploy • upgrade Yggdrasil, deploy • upgrade Duratro, deploy 36
  • 39. Self-serve • Finding the right repo and learning a YAML syntax is non-trivial • What if users could instead use a user interface? • select an event type • pick a transform • add a filter, if necessary • then configure a sink • press the button, and wham! 39
  • 40. YAML format • Was designed for this right from the start • Having event-types.yaml separate • enables reuse across batch and streaming • but also in selfserve • Making a flat format based on references • avoids deep, nested tree structures in syntax • means config can be merged from many sources 40
  • 43. Status • Routing tree (207 sinks) • streaming: 400 nodes (140 sinks) • batch: 127 nodes (51 sinks) • self-serve: ??? nodes (16 sinks) • JSLT • 51 transforms, 2366 lines • runs ~10 billion transforms/day • 28 contributors outside team 43
  • 44. 1 month of hot deploy 44
  • 45. GDPR
  • 46. Schibsted’s setup • The individual sites are legally data controllers • that means, they own the data and the responsibility • Central Schibsted components are data processors • that means, they do only what the controllers tell them to • upside: responsibility rests with the controllers • Has lots of consequences for how things work 46
  • 47. Main issues • Anonymization: handled with transforms • Retention: handled with S3 lifecycle policies • Takeout • only necessary where we are the primary storage • Deletion: a bit of a problem • but we have 30 days to comply 47
  • 53. Data takeout • Privacy broker posts message on SQS queue • we take it down to S3, to get Luigi integration • Luigi starts Spark job • reads through stored data, looking for that user • all events from that user are written to S3 • post SQS message back with reference to event and file • Source data stored in Parquet • use manually generated index files to avoid processing data that has no events from this user 53
  • 54. Data deletion • Data is stored as Parquet files in S3 • but Parquet doesn’t have a “delete” function • you have to rewrite the dataset • This is slow and costly • but can be batched: delete many users at once • batching so many users that the index is useless • What if someone is reading the dataset when you are rewriting it? 54
  • 55. Solution • bucket/prefix/year=x/month=y/…/gen=0 • data stored under here initially • bucket/prefix/year=x/month=y/…/gen=1 • data stored here after first rewrite • once _SUCCESS flag is there consumers must switch • after a day or so, gen=0 can be deleted • Janitor Monkey deletes orphan generations & handles retention • because deletion rewrites data: messes up last modified 55
  • 56. Data Access Layer • Building logic to handle gen=x is a pain for users • the Data Access Layer wraps Spark to do it for them • Can in the future be expanded to • filter out rows from users that opt out of certain processing • do access control on a column level • … 56
  • 57. Access control 57 AWS Databox account Sites Components Mesos cluster Analysts AWS IAM policies Jupyter-aaS Spark SQLaaS
  • 58. Stricter access to data • Because the sites are data controllers, they must decide who should have access to what • access is controlled by IAM policies • but users can’t write those, and that’s not safe, anyway • The system essentially requires communication • data consumers must request data • data owners (sites) but approve/reject requests 58
  • 59. The granule of access • We have many datasets • Pulse (anonymized, identified, …) • Payments (payment data) • Content (ad and article content) • … • Access is per (dataset, site) combination • you can have access to VG Pulse, but not Aftenposten Pulse 59
  • 63. Challenges • Users can potentially get access to many (dataset, site) combinations • each one needs to go into their IAM policies • IAM has very strict API limits • user inline policies: total max 2048 bytes • managed policy size: max 6144 • max managed policies per account: 1500 • max attached managed policies: 10 • max group memberships: 10 63
  • 64. Permission packing • First pack as much as possible into an inline policy • Then fill up personal managed policies & attach • Then create more policies and attach to groups, then attach those • We believe we can attach 10,000 datasets this way 64
  • 65. Sync 12:18:56 INFO c.s.s.d.s.model.IAMPolicyGenerator - User lars.marius.garshol@schibsted.com exists, must be cleaned 12:18:57 INFO c.s.s.d.s.model.IAMPolicyGenerator - Deleting inline selfserve-policy from lars.marius.garshol@schibsted.com 12:18:57 INFO c.s.s.d.s.model.IAMPolicyGenerator - Detaching arn:aws:iam::360928389411:policy/ selfserve-lars.marius.garshol@schibsted.com-2 from lars.marius.garshol@schibsted.com 12:18:58 INFO c.s.s.d.s.model.IAMPolicyGenerator - Detaching arn:aws:iam::360928389411:policy/ selfserve-lars.marius.garshol@schibsted.com-1 from lars.marius.garshol@schibsted.com 12:18:59 DEBUG c.s.s.d.s.model.IAMPolicyGenerator - Putting inline policy 'selfserve-policy' on lars.marius.garshol@schibsted.com 12:19:00 INFO c.s.s.d.s.model.IAMPolicyGenerator - Policy selfserve- lars.marius.garshol@schibsted.com-1 exists, deleting 12:19:00 DEBUG c.s.s.d.s.model.IAMPolicyGenerator - Creating and attaching selfserve- lars.marius.garshol@schibsted.com-1 to lars.marius.garshol@schibsted.com, 13 statements left 12:19:01 INFO c.s.s.d.s.model.IAMPolicyGenerator - Policy selfserve- lars.marius.garshol@schibsted.com-2 exists, deleting 12:19:01 DEBUG c.s.s.d.s.model.IAMPolicyGenerator - Creating and attaching selfserve- lars.marius.garshol@schibsted.com-2 to lars.marius.garshol@schibsted.com, 0 statements left 12:19:02 DEBUG c.s.s.d.s.model.IAMPolicyGenerator - Putting inline policy 'selfserve-policy' on role lars.marius.garshol@schibsted.com 65
  • 67. Slow maturation • We started out with almost nothing in 2015 • now finally becoming something closer to what we need to be • New challenges ahead • management wants to scale up usage of common solutions dramatically • legal basis management is coming • selfserve needs more functionality • Data Quality Tooling needs an overhaul • data discovery service likewise • … 67