SlideShare a Scribd company logo
Conquering the Lambda architecture in LinkedIn metrics
platform with Apache Calcite and Apache Samza
​Khai Tran
​Staff Software Engineer
Agenda
● Overview of LinkedIn metrics platform
● Moving from offline to nearline
● Under the hood of the nearline architecture
● Nearline production usecase
● Conclusion
Overview of LinkedIn metrics platform
Metrics @ LinkedIn
● Metrics = Measurements over tracking data
● Crucial for decision making:
○ Experimentation - test everything
○ Reporting - monitor and alert
○ In production, site-facing applications
We provide:
● A trusted repository of metrics
● A self-serve platform for
sustainable lifecycle of metrics
In
production
Experimentation
Reporting
Primary Data
Unified
Metrics
Platform
LinkedIn unified metrics platform (UMP)
Growth of UMP Metrics
2016 20172015
6800
4680
1100
Current: 8000+ metrics
# code
LOAD …
# data
# transformation
# code
STORE …
# config
Metrics:
- A = SUM(A’)
- B = Unique(id)
Downstream:
- XLNT
- Raptor
UMP
User Code
Platform
Generated
Code
To
App
To
App
DefineDeclare
Onboard
Data
Metadata
Onboarding process
User
Moving from offline to nearline
Offline computation flows
Hourly job latency: 3-6 hours -> want realtime/nearline
......
Metric union
User code
User code
Cubing, Rollup
Dimension
augmentation
HDFS tables
Dali views
Pinot,
Presto
Azkaban execution
Espresso,
Oracle,
MySQL
...
What we want for nearline flows
Metric unionUser code
User code
Samza job
Dimension
augmentation
Pinot
Latency is not the only requirement
Easy to onboard ● Minimum effort to convert existing offline into nearline
● Easy to write user code for new nearline flows
Easy to maintain ● Just one version of user code - single source of truth
● Run as a service
Latency ● ~5 - 30 mins
Samza jobs
Putting things together
Pinot
Batch
jobs
UMP realtime platform
UMP offline platform
HDFS
Raptor
Lambda architecture with a single codebase
code configMetrics
definition
Current support
User code in Pig ● LOAD, STORE
● FILTER, SAMPLE, SPLIT, UNION
● Simple FOREACH
● JOIN - all semantics
● GROUP/COGROUP, DISTINCT
● Record/Array FLATTEN
● Java UDFs, Python UDFs
● Pig Nested FOREACH and sort/limit (in Windows)
● Hive
Not yet
Under the hood of the nearline architecture
Pig to Samza through SQL processing
Open source framework for building dynamic
data management systems. Including:
➢ SQL Parser
➢ Relational algebra APIs
➢ Query planning engine
We built UMP nearline with:
➢ Pig’s Grunt parser
➢ Calcite relational algebra
➢ Calcite query planning engine
Architecture
...
Metric union
User code
User code
Dimension
augmentation
Calcite relational
algebra as an IR
convert generate
Samza code
optimize
Samza
physical plan
Samza
configuration
Pig to Calcite Calcite to Samza
Pig to Calcite
# code
LOAD …
LOAD ...
COGROUP ...
STORE …
GruntParser
CO-
GROUP
LOAD LOAD
PigRelConverter
FULL
OUTERJ
OIN
AGGRE
GATE
AGGRE
GATE
TABLE
SCAN
TABLE
SCAN
PRO-
JECT
User scripts Pig Logical Plan Calcite relational algebra
Example
Example
Example
INNER
JOIN
FILTER FILTER
PROJECT PROJECT
PROJECT
TABLE
SCAN
TABLE
SCAN
Calcite logical plan
Planning/Optimization
➢ Calcite logical plans:
○ Relational algebra: What to do
➢ Samza physical plans:
○ Samza physical node: How to do it
➢ Calcite Samza planner:
○ Calcite logical plan -> optimized Samza physical plan
Example
Stream-
Stream Self
Join
Samza
Project
Samza
Project
Samza
Filter
Samza
Filter
Samza
Project
Input
Stream
INNER
JOIN
FILTER FILTER
PROJECT PROJECT
PROJECT
TABLE
SCAN
TABLE
SCAN
Calcite Samza
planner
Calcite logical plan Samza physical plan
Code-gen
From Samza physical plans:
➢ Generate Samza code for constructing the stream graph using Samza Fluent APIs .
Mapping:
➢ Samza physical nodes -> corresponding stream APIs:
○ Samza project -> stream.map()
○ Samza filter -> stream.filter()
○ ...
➢ Relational expressions -> lamba functions:
○ Filter expressions -> filter() functions
○ Project expressions -> map() functions
○ ...
Example
Schema and UDF declarations
Operator mapping
Filter functions
Map functions
Produce to Kafka
...
Config-gen
Stream
Stream
Join
Samza
Project
Samza
Project
Samza
Filter
Samza
Filter
Samza
Project
Input
Stream
# dataset.conf
app-src
app-def
Nearline production use case - Storylines
Top stories picked
up by editors
Feedback to editor - powered by UMP realtime
Conclusion
Samza jobs
From improved Lambda architecture...
Pinot
Batch
jobs
UMP realtime platform
UMP offline platform
HDFS
Raptor
Lambda architecture with a single codebase
code configMetrics
definition
… to our bigger picture
Pig Latin
Calcite
relational
algebra
HiveQL
SparkSQL/
RDD
Presto SQL
Portable
UDFs
AORA (Author Once, Run Anywhere) architecture
Conquering the Lambda architecture in LinkedIn metrics platform with Apache Calcite and Apache Samza

More Related Content

PDF
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
PDF
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
PDF
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
PPTX
Unified Batch & Stream Processing with Apache Samza
PDF
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
PDF
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
PDF
Machine Learning Data Lineage with MLflow and Delta Lake
PDF
Deep Dive: Memory Management in Apache Spark
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
Unified Batch & Stream Processing with Apache Samza
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
Machine Learning Data Lineage with MLflow and Delta Lake
Deep Dive: Memory Management in Apache Spark

What's hot (20)

PPTX
Databricks Platform.pptx
PPTX
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
PPTX
Microsoft azure
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
PDF
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
PPTX
Modern data warehouse
PDF
Data Engineering Basics
PDF
Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...
PDF
PySpark Best Practices
PPTX
Azure Application Modernization
PDF
Jeff Maruschek: How does RAG REALLY work?
PPTX
Introduction to ML with Apache Spark MLlib
PDF
Stream processing with Apache Flink (Timo Walther - Ververica)
PDF
Extracting Insights from Data at Twitter
PPTX
Azure data platform overview
PPTX
Programming in Spark using PySpark
PDF
Massive Data Processing in Adobe Using Delta Lake
PDF
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
PDF
Introduction to Apache Flink - Fast and reliable big data processing
PDF
Azure Training | Microsoft Azure Tutorial | Microsoft Azure Certification | E...
Databricks Platform.pptx
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Microsoft azure
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Modern data warehouse
Data Engineering Basics
Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...
PySpark Best Practices
Azure Application Modernization
Jeff Maruschek: How does RAG REALLY work?
Introduction to ML with Apache Spark MLlib
Stream processing with Apache Flink (Timo Walther - Ververica)
Extracting Insights from Data at Twitter
Azure data platform overview
Programming in Spark using PySpark
Massive Data Processing in Adobe Using Delta Lake
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
Introduction to Apache Flink - Fast and reliable big data processing
Azure Training | Microsoft Azure Tutorial | Microsoft Azure Certification | E...
Ad

Similar to Conquering the Lambda architecture in LinkedIn metrics platform with Apache Calcite and Apache Samza (20)

PDF
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
PDF
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
PDF
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
PPTX
Netflix Data Engineering @ Uber Engineering Meetup
PPTX
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
PDF
Scaling up Near Real-time Analytics @Uber &LinkedIn
PDF
Big Data Computing Architecture
PDF
Uber Business Metrics Generation and Management Through Apache Flink
PDF
Holistic data application quality
PPTX
Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...
PDF
An overview of modern scalable web development
PPTX
Demystifying data engineering
PDF
Best Practices for Building and Deploying Data Pipelines in Apache Spark
PDF
Big Data Analytics Platforms by KTH and RISE SICS
PDF
End-to-end pipeline agility - Berlin Buzzwords 2024
PDF
Lambda architecture @ Indix
PDF
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
PPT
Linked in stream experimentation framework
PDF
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
PDF
The “Big Data” Ecosystem at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
Netflix Data Engineering @ Uber Engineering Meetup
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Scaling up Near Real-time Analytics @Uber &LinkedIn
Big Data Computing Architecture
Uber Business Metrics Generation and Management Through Apache Flink
Holistic data application quality
Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...
An overview of modern scalable web development
Demystifying data engineering
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Big Data Analytics Platforms by KTH and RISE SICS
End-to-end pipeline agility - Berlin Buzzwords 2024
Lambda architecture @ Indix
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Linked in stream experimentation framework
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
The “Big Data” Ecosystem at LinkedIn
Ad

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Cloud computing and distributed systems.
PDF
KodekX | Application Modernization Development
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Encapsulation theory and applications.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Modernizing your data center with Dell and AMD
PDF
Approach and Philosophy of On baking technology
PPTX
A Presentation on Artificial Intelligence
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Unlocking AI with Model Context Protocol (MCP)
NewMind AI Weekly Chronicles - August'25 Week I
Mobile App Security Testing_ A Comprehensive Guide.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
MYSQL Presentation for SQL database connectivity
Advanced methodologies resolving dimensionality complications for autism neur...
Cloud computing and distributed systems.
KodekX | Application Modernization Development
NewMind AI Monthly Chronicles - July 2025
Encapsulation theory and applications.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Modernizing your data center with Dell and AMD
Approach and Philosophy of On baking technology
A Presentation on Artificial Intelligence
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Unlocking AI with Model Context Protocol (MCP)

Conquering the Lambda architecture in LinkedIn metrics platform with Apache Calcite and Apache Samza

  • 1. Conquering the Lambda architecture in LinkedIn metrics platform with Apache Calcite and Apache Samza ​Khai Tran ​Staff Software Engineer
  • 2. Agenda ● Overview of LinkedIn metrics platform ● Moving from offline to nearline ● Under the hood of the nearline architecture ● Nearline production usecase ● Conclusion
  • 3. Overview of LinkedIn metrics platform
  • 4. Metrics @ LinkedIn ● Metrics = Measurements over tracking data ● Crucial for decision making: ○ Experimentation - test everything ○ Reporting - monitor and alert ○ In production, site-facing applications
  • 5. We provide: ● A trusted repository of metrics ● A self-serve platform for sustainable lifecycle of metrics In production Experimentation Reporting Primary Data Unified Metrics Platform LinkedIn unified metrics platform (UMP)
  • 6. Growth of UMP Metrics 2016 20172015 6800 4680 1100 Current: 8000+ metrics
  • 7. # code LOAD … # data # transformation # code STORE … # config Metrics: - A = SUM(A’) - B = Unique(id) Downstream: - XLNT - Raptor UMP User Code Platform Generated Code To App To App DefineDeclare Onboard Data Metadata Onboarding process User
  • 8. Moving from offline to nearline
  • 9. Offline computation flows Hourly job latency: 3-6 hours -> want realtime/nearline ...... Metric union User code User code Cubing, Rollup Dimension augmentation HDFS tables Dali views Pinot, Presto Azkaban execution Espresso, Oracle, MySQL
  • 10. ... What we want for nearline flows Metric unionUser code User code Samza job Dimension augmentation Pinot
  • 11. Latency is not the only requirement Easy to onboard ● Minimum effort to convert existing offline into nearline ● Easy to write user code for new nearline flows Easy to maintain ● Just one version of user code - single source of truth ● Run as a service Latency ● ~5 - 30 mins
  • 12. Samza jobs Putting things together Pinot Batch jobs UMP realtime platform UMP offline platform HDFS Raptor Lambda architecture with a single codebase code configMetrics definition
  • 13. Current support User code in Pig ● LOAD, STORE ● FILTER, SAMPLE, SPLIT, UNION ● Simple FOREACH ● JOIN - all semantics ● GROUP/COGROUP, DISTINCT ● Record/Array FLATTEN ● Java UDFs, Python UDFs ● Pig Nested FOREACH and sort/limit (in Windows) ● Hive Not yet
  • 14. Under the hood of the nearline architecture
  • 15. Pig to Samza through SQL processing Open source framework for building dynamic data management systems. Including: ➢ SQL Parser ➢ Relational algebra APIs ➢ Query planning engine We built UMP nearline with: ➢ Pig’s Grunt parser ➢ Calcite relational algebra ➢ Calcite query planning engine
  • 16. Architecture ... Metric union User code User code Dimension augmentation Calcite relational algebra as an IR convert generate Samza code optimize Samza physical plan Samza configuration Pig to Calcite Calcite to Samza
  • 17. Pig to Calcite # code LOAD … LOAD ... COGROUP ... STORE … GruntParser CO- GROUP LOAD LOAD PigRelConverter FULL OUTERJ OIN AGGRE GATE AGGRE GATE TABLE SCAN TABLE SCAN PRO- JECT User scripts Pig Logical Plan Calcite relational algebra
  • 21. Planning/Optimization ➢ Calcite logical plans: ○ Relational algebra: What to do ➢ Samza physical plans: ○ Samza physical node: How to do it ➢ Calcite Samza planner: ○ Calcite logical plan -> optimized Samza physical plan
  • 22. Example Stream- Stream Self Join Samza Project Samza Project Samza Filter Samza Filter Samza Project Input Stream INNER JOIN FILTER FILTER PROJECT PROJECT PROJECT TABLE SCAN TABLE SCAN Calcite Samza planner Calcite logical plan Samza physical plan
  • 23. Code-gen From Samza physical plans: ➢ Generate Samza code for constructing the stream graph using Samza Fluent APIs . Mapping: ➢ Samza physical nodes -> corresponding stream APIs: ○ Samza project -> stream.map() ○ Samza filter -> stream.filter() ○ ... ➢ Relational expressions -> lamba functions: ○ Filter expressions -> filter() functions ○ Project expressions -> map() functions ○ ...
  • 24. Example Schema and UDF declarations Operator mapping Filter functions Map functions Produce to Kafka ...
  • 26. Nearline production use case - Storylines
  • 27. Top stories picked up by editors
  • 28. Feedback to editor - powered by UMP realtime
  • 30. Samza jobs From improved Lambda architecture... Pinot Batch jobs UMP realtime platform UMP offline platform HDFS Raptor Lambda architecture with a single codebase code configMetrics definition
  • 31. … to our bigger picture Pig Latin Calcite relational algebra HiveQL SparkSQL/ RDD Presto SQL Portable UDFs AORA (Author Once, Run Anywhere) architecture