Data collection in AWS at Schibsted

Data collection in AWS
Lars Marius Garshol, lars.marius.garshol@schibsted.com
http://twitter.com/larsga
2018–09–04, AWS Meetup

Schibsted?
Collecting data?
What?

Schibsted
3
30 countries
200 million users/month
20 billion pageviews/month

Data Platform team
5
Jordi Roura
Ole-Magnus Røysted Aker
Sangram Bal
Oleksandr Ivanov
Mårten Rånge
Håvard Wall
Bjørn Rustad
Rolv Seehus
Håkon Åmdal
Fredrik Vraalsen
Øyvind Løkling
Per Wessel Nore
Lars Marius Garshol
Ning Zhou

Data Platform
6
Data Platform
Batch
Streaming
Pulse

Original architecture
8
Collector
Kinesis
Kinesis
Storage S3
The
Batch
Job
Piper
S3

The Batch Job
• Implemented in Apache Spark
• Luigi for scheduling/orchestration
• Runs in a shared Mesos cluster
• this was set up because letting users create individual
clusters became far too expensive
• this cluster is the main cost driver for Schibsted’s AWS bills
• Difﬁcult environment to work with
• hard to debug and develop in
9

Problems with batch
• Configuration file mapped to Spark tasks
• very complex set of Spark tasks
• requires lots of communication between Spark nodes
• runs slowly
• Very resource-intensive
• had difficulty keeping up with incoming traffic
• very sensitive to “cluster weather”
• brittle
10

Piper & Storage
• Ordinary Java and Scala applications
• read from the data source, perform all processing on one
node, then write to the destination
• no communication necessary between nodes
• normal EC2 nodes with the application baked into the AMI
• therefore scales trivially with Auto Scaling Groups
• Instrumented with Datadog, logs loaded into
SumoLogic
11

OK
Piper’s problem
12
Storage S3
Piper
SQS
OK
OK
OK
OK
Slow

Kafka vs Kinesis
• Kinesis has very strict API limits
• total read limit = 2x write limit
• effectively limited to 2 readers
• Kinesis API is very limited
• basically only supports reading records in order
• Kafka improves on both
• can support many readers simultaneously
• advanced Scala DSL with window functions etc etc
13

New architecture
14
Collector
Kinesis
Kinesis
S3
Storage
S3
The
Batch
Job
S3
Kafka
Storage Kafka ?

Handling slow consumers
15
Kafka Yggdrasil
Firehose
One topic per consumer
Duratro
One topic per consumer
OK
OK
OK
OK
Slow
Transforms
Filtering

Pulse challenges
• Pulse is a tracking solution with no user interface
• you want dashboards to analyze user trafﬁc? sorry
• problem is: not enough resources to develop that
• Using Amplitude to solve that
• created a Duratro sink for Amplitude
• simple HTTP POST of JSON events to Amplitude API
• users can now create Amplitude projects, feed their Pulse
events there, and ﬁnally have dashboards
16

Transforms
• Because GDPR we need to anonymize most incoming
data formats
• Some data has data quality issues that cannot be ﬁxed at
source, requires transforms to solve
• In many cases data needs to be transformed from one
format to another
• Pulse to Amplitude
• ClickMeter to Pulse
• Convert data to match database structures
• …
17

Who conﬁgures?
• Schibsted has >100 business units
• for Data Platform to do detailed conﬁguration for all of
these isn’t going to scale
• for sites to do it themselves saves lots of time
• Transforms require domain knowledge
• each site has its own specialities in Pulse tracking
• to transform these correctly requires knowing all this
18

Batch config: 1 sink
{
"driver": "anyoffilter",
"name": "image-classification",
"rules": [
{ "name": "ImageClassification", "key": "provider.component", "value": "ImageClassification" },
{ "name": "ImageSimilarity", "key": "provider.component", "value": "ImageSimilarity" }
],
"onmatch": [
{
"driver": "cache",
"level": "memory+disk"
},
{
"driver": "demuxer",
"rules": "${pulseSdrnFilterUri}",
"parallel": true,
"onmatch": {
"driver": "textfilewriter",
"uri": "${imageSiteUri}",
"numFiles": {
"eventsPerFile": 500000,
"max": ${numExecutors}
}
}
}
],
19
Early config was 1838 lines

Yggdrasil: Scala DSL
override def buildTopology(builder: StreamsBuilder): Unit = {
import com.schibsted.spt.data.yggdrasil.serde.YggdrasilImplicitSerdes.{json, strings}
// mads events routing
val madsProEvents = madsEvents.tryFilter(MadsProEventsPredicate, deadLetterQueue("Default"))
val madsPreEvents = madsEvents.tryFilter(MadsPreEventsPredicate, deadLetterQueue("Default"))
val madsDevEvents = madsEvents.tryFilter(new EventSampler(0.01), deadLetterQueue("Default"))
madsProEvents ~> contentTopic("Personalisation-Rocket-Pro")
madsPreEvents ~> contentTopic("Personalisation-Rocket-Pre")
madsDevEvents ~> contentTopic("Personalisation-Rocket-Dev")
madsProEvents ~> providerIdDemuxer(
"^urn:schibsted:madstorage-(rkt|web)-tayara-prod:mp-ads-delivery".r -> contentTopic("Image-
"^urn:schibsted:madstorage-(rkt|web)-corotos-prod:mp-ads-delivery".r -> contentTopic(“Image-
)
20

Duratro: conﬁg
pipes {
ATEDev {
sourceTopic = "Public-DataPro-Yggdrasil-ATE-Dev-AteBehaviorEvent-1"
sink {
type = "kinesis"
stream = "AUTO-ate-online-events-loader-AteOnlineEventDataStream-3WEL7DDN2KQG"
region = "eu-west-1"
role = "arn:aws:iam::972724508451:role/AUTO-ate-online-events-lo-AteOnlineEventsDataWrite-GTMDBZSEZJF0"
session = "kinesis-ate-dev"
}
}
21

Not in great shape
• Transforms were written in Scala code
• not easy to read even for Scala developers
• most site devs are not Scala developers
• Config changes require deploys
• in streaming, matching changes must be made to both
Yggdrasil and Duratro
• Three different configuration syntaxes
• definition of same type of event different in batch &
streaming
22

What if?
• We had an expression language for JSON, kind of
like jq
• could write routing ﬁlters using that
• We had a tranformation language for JSON
• write as JSON template, using expression language to
compute values to insert
• A custom routing language for both batch and
streaming, based on this language
• designed for easy expressivity & deploy
23

JSLT
• Custom language for JSON transforms & queries
• First iteration
• JSON syntax with ${ … } wrappers for jq expressions
• very simple additions: let, for and if expressions
• tried out, worked well, but not ideal
• Second iteration
• own language from the ground up
• far better performance
• easier to write and use
24

JSLT expressions
25
.foo Get “foo” key from input object
.foo.bar As above + .bar inside that
.foo == 231 Comparison
.foo and .bar < 12 Boolean operator
test(.foo, “^[a-z0-9]+$”) Functions (& regexps)

JSLT transforms
{
“insert_id” : .”@id”,
“event_type” : ."@type" + " " + .object.”@type”,
"device_id" : .device.environmentId,
"time": amp:parse_timestamp(.published),
"device_manufacturer": .device.manufacturer,
"device_model": .device.model,
"language": .device.acceptLanguage,
"os_name": .device.osType,
"os_version": .device.osVersion,
…
}
26

More features
• [for (.array) number(.) * 1.1]
• convert each element in an array
• * : .
• object matcher, keeps rest of object unchanged
• {for (.object) “preﬁx” + .key : .value}
• dynamic rewrite of object
• def func(p1, p2)
• deﬁne custom functions
27

Benefits of JSLT
• Easier to read and write than code
• Doesn’t require user to know Scala
• Can be embedded in configuration
• Flexible enough to support 99-100% of filters/
transforms
• Performance quite good
• 5-10x original language based on jackson-jq
28

Data collection in AWS at Schibsted

Routing language
Firehose:
description: All incoming events.
transform: transforms/base-cleanup.jslt
PulseBase:
description: Cleaned-up Pulse events with all the information in them.
baseType: Firehose
filter: import "filters/pulse.jslt" as pulse pulse(.)
transform: transforms/pulse-cleanup.jslt
postFilter: import "filters/pulse-valid.jslt" as valid valid(.)
30

Pulse definitions
PulseIdentified:
description: Pulse events with personally identifying information included.
baseType: PulseBase
filter: .actor."spt:userId"
transform: transforms/pulse-identified.jslt
PulseAnonymized:
description: Pulse events with personally identifying information excluded.
baseType: PulseBase
transform: transforms/pulse-anonymized.jslt
31

pulse-identified.jslt
let isFiltered = (contains(get-client(.), $filteredProviders))
{
"@id" : if ( ."@id" ) sha256-hex($salt + ."@id"),
"actor" : {
// remove one user identifier, but spt:userId also contains user ID
"@id" : if ( .actor."@id" ) null,
"spt:remoteAddress" : if (not($isFiltered)) .actor."spt:remoteAddress",
"spt:remoteAddressV6" : if (not($isFiltered)) .actor."spt:remoteAddressV6",
* : .
},
"device" : {
"environmentId" : if ( .device.environmentId ) null,
* : .
},
"location" : if (not($isFiltered)) .location,
* : .
}
32

Sinks
VG-ArticleViews-1:
eventType: PulseLoginPreserved
ﬁlter: get-client(.) == "vg" and ."@type" == "View" and contains(.object."@type", ["Article", "SalesPoster"])
transform: transforms/vg-article-views.jslt
kinesis:
arn: arn:aws:kinesis:eu-west-1:070941167498:stream/vg_article_views
role: arn:aws:iam::070941167498:role/data-platform-kinesis-write
VG-FrontExperimentsEngagement-1:
eventType: PulseAnonymized
ﬁlter: get-client(.) == "vg" and ."@type" == "Engagement" and contains(.object."@type", ["Article", "SalesPoster"])
and (contains("df-86-", .origin.terms) or contains("df-86-", .object."spt:custom".terms))
transform: transforms/vg-article-views.jslt
kinesis:
arn: arn:aws:kinesis:eu-west-1:070941167498:stream/vg_front_experiments_engagement
role: arn:aws:iam::070941167498:role/data-platform-kinesis-write
33

routing-lib
• A Scala library that can load the YAML ﬁles
• main dependencies: Jackson and JSLT
• One main API method:
• RoutingConﬁg.route(JsonNode): Seq[(JsonNode, Sink)]
• Used by
• The Batch Job 2.0
• Yggdrasil
• Duratro
34

The Batch Job 2.0
• Three simple steps
• read JSON input from S3 (Spark solves this)
• push JSON data through routing-lib
• write JSON back to S3 (Spark solves this)
• There’s a little more to it than that, but that’s the heart of
it
• much better performance (much less data shufﬂing)
• better performance means it handles “cluster weather” more
robustly
• easier to catch up if we fall behind
35

Static routing
• Routing was conﬁguration, packaged as a jar
• Every change required
• make routing PR, merge
• wait for Travis build to upload to Artifactory
• upgrade the batch job, deploy
• upgrade Yggdrasil, deploy
• upgrade Duratro, deploy
36

Hot deploy
37
routing
repo
Travis
build
SQS
queue
routing
conﬁg
publisher
S3
The Batch
Job
Yggdrasil Duratro

Self-serve
• Finding the right repo and learning a YAML syntax
is non-trivial
• What if users could instead use a user interface?
• select an event type
• pick a transform
• add a ﬁlter, if necessary
• then conﬁgure a sink
• press the button, and wham!
39

YAML format
• Was designed for this right from the start
• Having event-types.yaml separate
• enables reuse across batch and streaming
• but also in selfserve
• Making a ﬂat format based on references
• avoids deep, nested tree structures in syntax
• means conﬁg can be merged from many sources
40

Hot deploy
42
routing
repo
Travis
build
SQS
queue
routing
conﬁg
publisher
S3
The Batch
Job
Yggdrasil Duratro
Pulse
Monitor
Dynamo-
DB
Lambda

Status
• Routing tree (207 sinks)
• streaming: 400 nodes (140 sinks)
• batch: 127 nodes (51 sinks)
• self-serve: ??? nodes (16 sinks)
• JSLT
• 51 transforms, 2366 lines
• runs ~10 billion transforms/day
• 28 contributors outside team
43

Schibsted’s setup
• The individual sites are legally data controllers
• that means, they own the data and the responsibility
• Central Schibsted components are data processors
• that means, they do only what the controllers tell them to
• upside: responsibility rests with the controllers
• Has lots of consequences for how things work
46

Main issues
• Anonymization: handled with transforms
• Retention: handled with S3 lifecycle policies
• Takeout
• only necessary where we are the primary storage
• Deletion: a bit of a problem
• but we have 30 days to comply
47

The big picture
48
Privacy
Broker
Data
Platform

Data takeout
• Privacy broker posts message on SQS queue
• we take it down to S3, to get Luigi integration
• Luigi starts Spark job
• reads through stored data, looking for that user
• all events from that user are written to S3
• post SQS message back with reference to event and ﬁle
• Source data stored in Parquet
• use manually generated index ﬁles to avoid processing data
that has no events from this user
53

Data deletion
• Data is stored as Parquet ﬁles in S3
• but Parquet doesn’t have a “delete” function
• you have to rewrite the dataset
• This is slow and costly
• but can be batched: delete many users at once
• batching so many users that the index is useless
• What if someone is reading the dataset when you
are rewriting it?
54

Solution
• bucket/prefix/year=x/month=y/…/gen=0
• data stored under here initially
• bucket/prefix/year=x/month=y/…/gen=1
• data stored here after first rewrite
• once _SUCCESS flag is there consumers must switch
• after a day or so, gen=0 can be deleted
• Janitor Monkey deletes orphan generations & handles
retention
• because deletion rewrites data: messes up last modified
55

Data Access Layer
• Building logic to handle gen=x is a pain for users
• the Data Access Layer wraps Spark to do it for them
• Can in the future be expanded to
• ﬁlter out rows from users that opt out of certain processing
• do access control on a column level
• …
56

Access control
57
AWS Databox
account
Sites
Components
Mesos cluster
Analysts
AWS IAM policies
Jupyter-aaS
Spark
SQLaaS

Stricter access to data
• Because the sites are data controllers, they must
decide who should have access to what
• access is controlled by IAM policies
• but users can’t write those, and that’s not safe, anyway
• The system essentially requires communication
• data consumers must request data
• data owners (sites) but approve/reject requests
58

The granule of access
• We have many datasets
• Pulse (anonymized, identiﬁed, …)
• Payments (payment data)
• Content (ad and article content)
• …
• Access is per (dataset, site) combination
• you can have access to VG Pulse, but not Aftenposten
Pulse
59

Challenges
• Users can potentially get access to many (dataset,
site) combinations
• each one needs to go into their IAM policies
• IAM has very strict API limits
• user inline policies: total max 2048 bytes
• managed policy size: max 6144
• max managed policies per account: 1500
• max attached managed policies: 10
• max group memberships: 10
63

Permission packing
• First pack as much as possible into an inline policy
• Then ﬁll up personal managed policies & attach
• Then create more policies and attach to groups,
then attach those
• We believe we can attach 10,000 datasets this way
64

Sync
12:18:56 INFO c.s.s.d.s.model.IAMPolicyGenerator - User lars.marius.garshol@schibsted.com exists,
must be cleaned
12:18:57 INFO c.s.s.d.s.model.IAMPolicyGenerator - Deleting inline selfserve-policy from
lars.marius.garshol@schibsted.com
12:18:57 INFO c.s.s.d.s.model.IAMPolicyGenerator - Detaching arn:aws:iam::360928389411:policy/
selfserve-lars.marius.garshol@schibsted.com-2 from lars.marius.garshol@schibsted.com
12:18:58 INFO c.s.s.d.s.model.IAMPolicyGenerator - Detaching arn:aws:iam::360928389411:policy/
selfserve-lars.marius.garshol@schibsted.com-1 from lars.marius.garshol@schibsted.com
12:18:59 DEBUG c.s.s.d.s.model.IAMPolicyGenerator - Putting inline policy 'selfserve-policy' on
lars.marius.garshol@schibsted.com
12:19:00 INFO c.s.s.d.s.model.IAMPolicyGenerator - Policy selfserve-
lars.marius.garshol@schibsted.com-1 exists, deleting
12:19:00 DEBUG c.s.s.d.s.model.IAMPolicyGenerator - Creating and attaching selfserve-
lars.marius.garshol@schibsted.com-1 to lars.marius.garshol@schibsted.com, 13 statements left
12:19:01 INFO c.s.s.d.s.model.IAMPolicyGenerator - Policy selfserve-
lars.marius.garshol@schibsted.com-2 exists, deleting
12:19:01 DEBUG c.s.s.d.s.model.IAMPolicyGenerator - Creating and attaching selfserve-
lars.marius.garshol@schibsted.com-2 to lars.marius.garshol@schibsted.com, 0 statements left
12:19:02 DEBUG c.s.s.d.s.model.IAMPolicyGenerator - Putting inline policy 'selfserve-policy' on role
lars.marius.garshol@schibsted.com 65

Slow maturation
• We started out with almost nothing in 2015
• now ﬁnally becoming something closer to what we need to be
• New challenges ahead
• management wants to scale up usage of common solutions
dramatically
• legal basis management is coming
• selfserve needs more functionality
• Data Quality Tooling needs an overhaul
• data discovery service likewise
• …
67

https://slideshare.net/larsga
Questions?

Data collection in AWS at Schibsted

More Related Content

What's hot (20)

Similar to Data collection in AWS at Schibsted (20)

More from Lars Marius Garshol (20)

Recently uploaded (20)

Data collection in AWS at Schibsted