SlideShare a Scribd company logo
Self-Serve Reporting Platform on Hadoop
Shirshanka Das
Strata Singapore 2015
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
4
Ingest Process Serve Visualize
Reporting Pipelines
5
Ingest Process Serve Visualize
Reporting at LinkedIn: Evolution
Sources
Oracle MSTR
Tableau
Internal
Tools
Espresso
Kafka
External
Custom
Custom
Custom
Hadoop Voldemort
Pinot
MySQL
INFA + MSTR OracleTeradataon+ Scripts
Jobs
on
Infra Scale
6
Number of Hadoop clusters: 12
Total number of machines: ~7k
Largest Cluster: ~3k machines
Data volume generated per day: XX Terabytes
Total accumulated data: XX Petabytes
People Scale
7
Reporting Platform Team: ~10
Core Warehouse Team: 1x
Data Scientists: 10x
Business Analysts: 10x
Product Managers: 10x
Sales and Marketing: 100x
8
Ingest Process Serve Visualize
Challenges
Disjointed efforts, unreliable systems
Unpredictable SLA across all systems
Fragmented data pipelines with inconsistent data
9
Ingest Process Serve Visualize
Houston
we have a problem
Step 1
Central transport pipeline
Still have
a problem
Step 2
Central
Ingestion
Framework
13
Stream + Batch
REST
SFTP
JDBC
Diverse Sources
Open source @ github.com/linkedin/gobblin
In production @ LinkedIn, Intel, Swisscom,
NerdWallet
@LinkedIn
~20 distinct source types
Hundreds of TB per day
Hundreds of datasets
Data Quality
15
Ingest Process Serve Visualize
Unified
Metrics
Platform
16
Single
Source
of
Truth
Easy
Onboarding Operability
Requirements
Workflow Metric
Definition
Sandbox
Code
Repository
Metric
Owner
System Jobs
Build
Core Metrics
Job
Central Team,
Relevant
Stakeholders
1. iterate
2. create 3. review
4. check in
Metric Definition
Name Description TagsOwners
Dataset
Dimensions
Time
Script
Metrics
Entity Ids
Tier
Formulas
Entity
Dimensions
Input
Datasets
Temporality
An example: video play analysis
name: "video"
description: “Metrics for video tracking”
label: “video”
tags: [flagship, feed]
owners: [jdoe, jsmith]
enabled: true
retention: 90d
timestamp: timestamp
frequency: daily
script: video_play.pig
output_window: 1d
dimensions:[
{
name: platform
doc: “phone, tablet or desktop"
}
{
name: action_type
doc: “click play or auto-play“
}
]
input_datasets
[
{
name: actionsRaw
path: Tracking.ActionEvent
range: 1d
}
]
An example contd…
metrics: [
name: unique_viewers
doc: “Count of unique viewers”
formula: “unique(member_id)”
tier: 2
good_direction: "up"
}
{
name: play_actions
doc: “Sum of play actions"
tier: 2
formula: “sum(play_actions)"
good_direction: "up"
}
]
entity_ids: [
{
name: member_id
category: member
}
{
name:video_id
category: video
}
]
UMP Data FlowUmp
Monitor
Primary
Data
(tracking,
databases,
external)
UMP Raw
Data
UMP
Aggregated
Data Relevance
Experiment
analysis
Ad-hoc
Metrics
Script
Data Prep
agg
cube
dimension
verify
HDFS
+ Pinot
Dashboards
…
First version in production since early 2014
Significant redesign in 2015
Total amount of data being scanned per day: Hundreds of TBs
Total number of metrics being computed: 2k+
Total number of scripts: ~ 400
Number of authors for these metrics: ~ 200
Maximum number of dimensions per dataset: ~ 30
Number of people responsible for upkeep of pipeline: 2
UMP by the numbers
22
Learnings so far
23
Ease of onboarding
Hard when you have > 1000 users with different skill sets
Need great UX to complement developer friendly
alternatives
Single source of truth
Not just a technology challenge
Organization needs to rally around it
Operability
Multi-tenant Hadoop pipeline with SLA-s and QoS: hard
Cost 2 Serve: Managing metrics lifecycle is important
The Next Big Things
Bridging streaming and batch
Code-free metrics
Sessions, Funnels, Cohorts
Open source
24
Ingest Process Serve Visualize
P not
SQL-like
interface
(minus joins)
Sub second
query latency
Data load
from Hadoop
and Kafka
Capabilities
Pinot Data Flow
Kafka Hadoop
Samza Process
Pinot
minutes
hour +
Pinot@LinkedIn
Site-­‐facing	
  Apps Reporting	
  dashboards Monitoring
In production since 2012
Open source @ github.com/linkedin/pinot
28
Ingest Process Serve Visualize
Raptor
Standardize Visualization
29
Leverage
- Standalone app, with support for embedding
- Can use existing analytics backend: Pinot
Strategic
- Reduces dependency on 3rd party BI tools
- Closer integration with LinkedIn’s ecosystem of
experimentation, anomaly detection solutions
30
Requirements
Support
apps
ecosystem
Core
Visualization
Capabilities
Metadata
Integration
Raptor 1.0
31
First version built by 3 engineers in a quarter
Features
- Integration with UMP, Pinot
- Time series, bar charts, …
- Create, Publish, Clone, Discover
Dashboards
Numbers
- Number of dashboards: ~100
- Weekly unique users: ~400
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
The Future for Raptor
39
Social Collaboration features
Intelligence
- Anomaly detection
- Dashboards You May Like
Embedding into data products
Open Source
40
Ingest Process Serve Visualize
A Few Good Hammers
Unified
Metrics
Platform
P not Raptor
41
Ingest Process Serve Visualize
What we’re excited about
Unified
Metrics
Platform
P not Raptor
Metadata Bus
42
Metadata driven e2e Optimizations
Dynamic prioritization of data ingest
Surface source data quality issues in dashboard
Surface backfill status on dashboard
Cascading deprecation of dashboards,
computation and data sources through lineage
43
Shirshanka Das
@shirshanka
Catch me offline to chat about…
What we’re doing for
- Views on Hadoop
- Data Quality
- Metadata

More Related Content

PPTX
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
PPTX
Gobblin: Unifying Data Ingestion for Hadoop
PDF
What's new in SQL on Hadoop and Beyond
PPTX
Gobblin' Big Data With Ease @ QConSF 2014
PPTX
Data Infrastructure at LinkedIn
PPTX
Lambda-less Stream Processing @Scale in LinkedIn
PPT
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
PDF
Introduction to Databus
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Gobblin: Unifying Data Ingestion for Hadoop
What's new in SQL on Hadoop and Beyond
Gobblin' Big Data With Ease @ QConSF 2014
Data Infrastructure at LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Introduction to Databus

What's hot (20)

PDF
All Aboard the Databus
PDF
Continus sql with sql stream builder
PDF
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
PDF
Data Infrastructure at LinkedIn
PPTX
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
PPTX
Querying Druid in SQL with Superset
PDF
Application modernization patterns with apache kafka, debezium, and kubernete...
PDF
Spark meetup - Zoomdata Streaming
PPTX
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
PDF
DevNation Live: Kafka and Debezium
PPTX
Hdfs 2016-hadoop-summit-dublin-v1
PPT
The Evolution of Big Data Pipelines at Intuit
PPTX
What is Change Data Capture (CDC) and Why is it Important?
PDF
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
PDF
Getting Ready to Use Redis with Apache Spark with Tague Griffith
PPTX
SQL Server on Linux - march 2017
PDF
dotScale 2017 Keynote: The Rise of Real Time by Neha Narkhede
PDF
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
PDF
MongoDB Europe 2016 - Deploying MongoDB on NetApp storage
PPTX
Make streaming processing towards ANSI SQL
All Aboard the Databus
Continus sql with sql stream builder
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
Data Infrastructure at LinkedIn
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Querying Druid in SQL with Superset
Application modernization patterns with apache kafka, debezium, and kubernete...
Spark meetup - Zoomdata Streaming
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
DevNation Live: Kafka and Debezium
Hdfs 2016-hadoop-summit-dublin-v1
The Evolution of Big Data Pipelines at Intuit
What is Change Data Capture (CDC) and Why is it Important?
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Getting Ready to Use Redis with Apache Spark with Tague Griffith
SQL Server on Linux - march 2017
dotScale 2017 Keynote: The Rise of Real Time by Neha Narkhede
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
MongoDB Europe 2016 - Deploying MongoDB on NetApp storage
Make streaming processing towards ANSI SQL
Ad

Viewers also liked (19)

PDF
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
PDF
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...
PDF
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
PDF
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
PDF
Aksyon radyo
PDF
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
PPTX
LinkedIn Segmentation & Targeting Platform: A Big Data Application
PDF
Resume- William Myers FD2016.1.4
PDF
Data Infrastructure at LinkedIn
PDF
Personal branding playbook
PPTX
Using Big Data for Improved Healthcare Operations and Analytics
PDF
Unlocking the Experts
PDF
Participatory Design: Bringing Users Into Your Process
PDF
Introduction To TensorFlow | Deep Learning Using TensorFlow | TensorFlow Tuto...
PPTX
Big data ppt
PPTX
What to Upload to SlideShare
PPTX
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017
PDF
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
PDF
What is Artificial Intelligence | Artificial Intelligence Tutorial For Beginn...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Aksyon radyo
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
LinkedIn Segmentation & Targeting Platform: A Big Data Application
Resume- William Myers FD2016.1.4
Data Infrastructure at LinkedIn
Personal branding playbook
Using Big Data for Improved Healthcare Operations and Analytics
Unlocking the Experts
Participatory Design: Bringing Users Into Your Process
Introduction To TensorFlow | Deep Learning Using TensorFlow | TensorFlow Tuto...
Big data ppt
What to Upload to SlideShare
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
What is Artificial Intelligence | Artificial Intelligence Tutorial For Beginn...
Ad

Similar to Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop (20)

PPTX
Streaming Data and Stream Processing with Apache Kafka
PPTX
Big Data Applications Made Easy: Fact Or Fiction?
PDF
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
PDF
Attunity Hortonworks Webinar- Sept 22, 2016
PPTX
Best practices and lessons learnt from Running Apache NiFi at Renault
PDF
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
PDF
Achieve Sub-Second Analytics on Apache Kafka with Confluent and Imply
PPTX
Spark Streaming the Industrial IoT
PPTX
Building a Self-Service Hadoop Platform at Linkedin with Azkaban
PPTX
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
PDF
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
PDF
Teradata - Presentation at Hortonworks Booth - Strata 2014
ODP
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
PDF
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
PDF
[WSO2Con EU 2018] The Rise of Streaming SQL
PDF
Pivotal Real Time Data Stream Analytics
PPTX
Webinar: Cutting Time, Complexity and Cost from Data Science to Production
PPTX
Decision trees in hadoop
PDF
Cytoscape: Now and Future
PDF
Processing Real-Time Data at Scale: A streaming platform as a central nervous...
Streaming Data and Stream Processing with Apache Kafka
Big Data Applications Made Easy: Fact Or Fiction?
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Attunity Hortonworks Webinar- Sept 22, 2016
Best practices and lessons learnt from Running Apache NiFi at Renault
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
Achieve Sub-Second Analytics on Apache Kafka with Confluent and Imply
Spark Streaming the Industrial IoT
Building a Self-Service Hadoop Platform at Linkedin with Azkaban
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
Teradata - Presentation at Hortonworks Booth - Strata 2014
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
[WSO2Con EU 2018] The Rise of Streaming SQL
Pivotal Real Time Data Stream Analytics
Webinar: Cutting Time, Complexity and Cost from Data Science to Production
Decision trees in hadoop
Cytoscape: Now and Future
Processing Real-Time Data at Scale: A streaming platform as a central nervous...

Recently uploaded (20)

PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
Introduction to machine learning and Linear Models
PPT
Quality review (1)_presentation of this 21
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PDF
Fluorescence-microscope_Botany_detailed content
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Introduction to machine learning and Linear Models
Quality review (1)_presentation of this 21
Acceptance and paychological effects of mandatory extra coach I classes.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Business Ppt On Nestle.pptx huunnnhhgfvu
IB Computer Science - Internal Assessment.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Fluorescence-microscope_Botany_detailed content
.pdf is not working space design for the following data for the following dat...
Clinical guidelines as a resource for EBP(1).pdf
climate analysis of Dhaka ,Banglades.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
The THESIS FINAL-DEFENSE-PRESENTATION.pptx

Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop