Conquering the Lambda architecture in LinkedIn metrics platform with Apache Calcite and Apache Samza

Conquering the Lambda architecture in LinkedIn metrics
platform with Apache Calcite and Apache Samza
Khai Tran
Staff Software Engineer

Agenda
● Overview of LinkedIn metrics platform
● Moving from offline to nearline
● Under the hood of the nearline architecture
● Nearline production usecase
● Conclusion

Overview of LinkedIn metrics platform

Metrics @ LinkedIn
● Metrics = Measurements over tracking data
● Crucial for decision making:
○ Experimentation - test everything
○ Reporting - monitor and alert
○ In production, site-facing applications

We provide:
● A trusted repository of metrics
● A self-serve platform for
sustainable lifecycle of metrics
In
production
Experimentation
Reporting
Primary Data
Unified
Metrics
Platform
LinkedIn unified metrics platform (UMP)

Growth of UMP Metrics
2016 20172015
6800
4680
1100
Current: 8000+ metrics

# code
LOAD …
# data
# transformation
# code
STORE …
# config
Metrics:
- A = SUM(A’)
- B = Unique(id)
Downstream:
- XLNT
- Raptor
UMP
User Code
Platform
Generated
Code
To
App
To
App
DefineDeclare
Onboard
Data
Metadata
Onboarding process
User

Moving from offline to nearline

Offline computation flows
Hourly job latency: 3-6 hours -> want realtime/nearline
......
Metric union
User code
User code
Cubing, Rollup
Dimension
augmentation
HDFS tables
Dali views
Pinot,
Presto
Azkaban execution
Espresso,
Oracle,
MySQL

...
What we want for nearline flows
Metric unionUser code
User code
Samza job
Dimension
augmentation
Pinot

Latency is not the only requirement
Easy to onboard ● Minimum effort to convert existing offline into nearline
● Easy to write user code for new nearline flows
Easy to maintain ● Just one version of user code - single source of truth
● Run as a service
Latency ● ~5 - 30 mins

Samza jobs
Putting things together
Pinot
Batch
jobs
UMP realtime platform
UMP offline platform
HDFS
Raptor
Lambda architecture with a single codebase
code configMetrics
definition

Current support
User code in Pig ● LOAD, STORE
● FILTER, SAMPLE, SPLIT, UNION
● Simple FOREACH
● JOIN - all semantics
● GROUP/COGROUP, DISTINCT
● Record/Array FLATTEN
● Java UDFs, Python UDFs
● Pig Nested FOREACH and sort/limit (in Windows)
● Hive
Not yet

Under the hood of the nearline architecture

Pig to Samza through SQL processing
Open source framework for building dynamic
data management systems. Including:
➢ SQL Parser
➢ Relational algebra APIs
➢ Query planning engine
We built UMP nearline with:
➢ Pig’s Grunt parser
➢ Calcite relational algebra
➢ Calcite query planning engine

Architecture
...
Metric union
User code
User code
Dimension
augmentation
Calcite relational
algebra as an IR
convert generate
Samza code
optimize
Samza
physical plan
Samza
configuration
Pig to Calcite Calcite to Samza

Pig to Calcite
# code
LOAD …
LOAD ...
COGROUP ...
STORE …
GruntParser
CO-
GROUP
LOAD LOAD
PigRelConverter
FULL
OUTERJ
OIN
AGGRE
GATE
AGGRE
GATE
TABLE
SCAN
TABLE
SCAN
PRO-
JECT
User scripts Pig Logical Plan Calcite relational algebra

Example
INNER
JOIN
FILTER FILTER
PROJECT PROJECT
PROJECT
TABLE
SCAN
TABLE
SCAN
Calcite logical plan

Planning/Optimization
➢ Calcite logical plans:
○ Relational algebra: What to do
➢ Samza physical plans:
○ Samza physical node: How to do it
➢ Calcite Samza planner:
○ Calcite logical plan -> optimized Samza physical plan

Example
Stream-
Stream Self
Join
Samza
Project
Samza
Project
Samza
Filter
Samza
Filter
Samza
Project
Input
Stream
INNER
JOIN
FILTER FILTER
PROJECT PROJECT
PROJECT
TABLE
SCAN
TABLE
SCAN
Calcite Samza
planner
Calcite logical plan Samza physical plan

Code-gen
From Samza physical plans:
➢ Generate Samza code for constructing the stream graph using Samza Fluent APIs .
Mapping:
➢ Samza physical nodes -> corresponding stream APIs:
○ Samza project -> stream.map()
○ Samza filter -> stream.filter()
○ ...
➢ Relational expressions -> lamba functions:
○ Filter expressions -> filter() functions
○ Project expressions -> map() functions
○ ...

Example
Schema and UDF declarations
Operator mapping
Filter functions
Map functions
Produce to Kafka
...

Config-gen
Stream
Stream
Join
Samza
Project
Samza
Project
Samza
Filter
Samza
Filter
Samza
Project
Input
Stream
# dataset.conf
app-src
app-def

Nearline production use case - Storylines

Conquering the Lambda architecture in LinkedIn metrics platform with Apache Calcite and Apache Samza

More Related Content

What's hot (20)

Similar to Conquering the Lambda architecture in LinkedIn metrics platform with Apache Calcite and Apache Samza (20)

Recently uploaded (20)

Conquering the Lambda architecture in LinkedIn metrics platform with Apache Calcite and Apache Samza