SlideShare a Scribd company logo
Discovery & Consumption
of Analytics Data @Twitter
@sirkamran32
ADP Team
November 17, 2016
Sriram Krishnan Arash Aghevli Joseph Boyd
@sluicing@aaghevli@krishnansriram
Dave Marwick Kamran Munshi Alex Maranka
@sometext42@sirkamran32@dmarwick
1.Analytics Data @Twitter

2.Analytics Data Platform (ADP) Stack Overview

3.Deeper Dive (EagleEye Demo)

4.Deeper Dive (DAL Design & Concepts)

5.Observations & Lessons learned

6.Future Work
Agenda
Analytics Data @Twitter
Mostly resides in Hadoop
•We manage and run several LARGE Hadoop clusters

•Some of Twitter’s Hadoop clusters are the largest of their kind

•> 10K nodes storing more 100s of pb on HDFS
Includes important business data
•e.g. logs, user data, recommendations data, publicly reported metrics, A/B testing data, ads
targeting and much more

Lots of it processed and analyzed in BATCH fashion using…
•Scalding (for ETL, data science, and general analytics)

•Presto (for more interactive querying)

•Vertica, Tableau, Zeppelin, MySQL (for analysis)

All in all there are more than 100K daily jobs processing 10s of pbs/day.
Analytics Data Software Goals
Data Discovery
• Important datasets

• Owners

• Semantics &
relevant metadata
Data Abstraction
• Logical datasets to
abstract away
physical data location
and format
1 3
Data Auditing
• Creators and
consumers
• Dependencies
• Alerts
2
Data Consumption
• How to consume
data via various
clients (scalding,
presto, etc) ?
4
How our tools fit into data platform
ADP Stack
Data Explorer (EagleEye)
Discovery & Consumption
Data Abstraction Layer (DAL)
Data Abstraction
Alerting Service
Alerting
1
2
3
State Management Service
Scheduling
4
Source
Data Explorer
(aka Eagle Eye)
An internal web UI to make it easy to search, discover, track and manage batch applications AND
analytics datasets.
EagleEye
View dataset schemas, notes, health, physical locations, & owners
EagleEye Cont’d
Set up alerts and view lineage & dependencies
EagleEye Cont’d
Data Abstraction Layer
(aka DAL)
DAL helps you…
Discover and consume data easily
• DAL stores and exposes metadata, schemas, and consumption information for datasets (in
HDFS, Vertica, and MySQL)

Understand your dependencies
•DAL tracks which applications produced or consumed datasets, allowing you to visualize your
up and downstream dependencies in EagleEye (and set up alerts)

Future proof your data and avoid lengthy migration work
• DAL abstracts away physical data locations/formats from consumers by informing them
about datasets at runtime, allowing jobs to read data that has changed locations without a
redeploy.

Enable error rate resiliency without redeploys
•DAL centrally manages error thresholds which avoids hardcoded values that may require
rebuild/redeploy of jobs

Easily explore your data - DAL CLI allows users to easily view data without writing jobs/scripts
DAL Models Data & Applications
Data
Logical Dataset - a symbolic name (a tuple of role, simple name, and environment) with a
schema and segment type associated with it

Physical Dataset - describes the serialization of a Logical Dataset to a specific physical
Location (i.e. hadoop-cluster, vertica-cluster, etc)

Segments - where the data actually resides. Segments model all the information necessary to
deserialize them like their physical URL, schema, and format used upon serialization. They also
maintain state to indicate if the segment is ready for consumption, rolled-back, deleted, etc.
•Partitioned Segment - a subset of a dataset for a given time interval

•Snapshots Segment - an entire physical dataset up to a point in time

Batch Applications
DAL can be used by all batch applications, including ad-hoc jobs not managed by Statebird

•Statebird apps have a PhysicalRunContext (the batch application, batch increment and
statebird run id)
DAL Architecture
Thrift service on top of a MySQL db.
•Written in Scala using Slick as the ORM

•Deployed on Mesos (like most Twitter
services)

DAL Clients (e.g. Scalding jobs) register the
production and consumption of data
segments via DAL APIs (PUSH MODEL):

• DAL.read(TweetEventsScalaDataset)

• DAL.writeDAL(DalWriteAuditTestDataset,
D.Hourly, D.Suffix(args("output")), D.Parquet)

Scalding jobs also communicate with
Statebird to:
• Poll for dataset updates and find out when
the job is supposed to run

• Find out if dependent jobs have run
DAL Auditing Flow Listener
Registers reads & writes by jobs that don't use the DAL API directly.
• A custom flow listener is built into all scalding jobs at Twitter by default

• This is limited but useful for those wanting view data via EagleEye

• NOTE: This audit information doesn't include full details of data format and physical location,
and is used only to surface lineage/provenance in EagleEye
Why not just Hive Metastore?
Value-added features such as…
•A single service for both HDFS and non-HDFS data

•Auditing and lineage for datasets

•Example usage

•A place to add additional future features

Single DAL service VS. multiple metastores…
•DAL is a singular service with one back-end, avoiding the need for multiple metastores in
each DC
•Simplifying data consumption across ALL data formats and storage systems is critical as the
size of your organization and datasets grow

•A metadata service is something that every data platform needs and should be built early on

•DAL-integration is now the first order of business for every new processing tool @Twitter
Observations & Lessons
•Integrate retention and replication information into DAL

•Helps to further EagleEye/DAL a ‘one-stop-shop’ for managing datasets

•Expose Hive compatible API

•Currently focused on BATCH applications. Investigate STREAMING applications as well

•“Janitor monkey” to poll DAL and clean up datasets that are not being used

•Open source?
Future Work

More Related Content

PDF
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
PPTX
Building Data Pipelines with Spark and StreamSets
PDF
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
PDF
Building Data Lakes with Apache Airflow
PPTX
Monitoring at scale - Sensu Kafka Kafka-connect Cassandra PrestoDB
PPTX
LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016
PDF
Databricks + Snowflake: Catalyzing Data and AI Initiatives
PDF
Databricks: A Tool That Empowers You To Do More With Data
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Building Data Pipelines with Spark and StreamSets
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
Building Data Lakes with Apache Airflow
Monitoring at scale - Sensu Kafka Kafka-connect Cassandra PrestoDB
LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks: A Tool That Empowers You To Do More With Data

What's hot (20)

PPTX
Integration Monday - Analysing StackExchange data with Azure Data Lake
PPTX
Spark and Couchbase– Augmenting the Operational Database with Spark
PPTX
Analyzing StackExchange data with Azure Data Lake
PDF
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
PDF
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
PDF
Build Real-Time Applications with Databricks Streaming
PDF
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
PDF
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
PDF
Scaling and Modernizing Data Platform with Databricks
PPTX
Big Data on azure
PDF
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
PDF
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
PPTX
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
PDF
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
PDF
Machine Learning Data Lineage with MLflow and Delta Lake
PDF
Yahoo's Next Generation User Profile Platform
PDF
Data Lakes with Azure Databricks
PPTX
Options for Data Prep - A Survey of the Current Market
PPTX
TechEvent Databricks on Azure
PPTX
Building a Big Data Pipeline
Integration Monday - Analysing StackExchange data with Azure Data Lake
Spark and Couchbase– Augmenting the Operational Database with Spark
Analyzing StackExchange data with Azure Data Lake
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
Build Real-Time Applications with Databricks Streaming
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Scaling and Modernizing Data Platform with Databricks
Big Data on azure
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Machine Learning Data Lineage with MLflow and Delta Lake
Yahoo's Next Generation User Profile Platform
Data Lakes with Azure Databricks
Options for Data Prep - A Survey of the Current Market
TechEvent Databricks on Azure
Building a Big Data Pipeline
Ad

Viewers also liked (9)

PDF
Data Analytics on Twitter Feeds
PDF
Sentiment analysis - Our approach and use cases
PPTX
Bigdata analytics-twitter
PPTX
Twitter Data Analytics
PDF
Twitter Big Data
KEY
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
PPT
Real Time Analytics for Big Data a Twitter Case Study
PPTX
Real Time Analytics for Big Data - A twitter inspired case study
PDF
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Data Analytics on Twitter Feeds
Sentiment analysis - Our approach and use cases
Bigdata analytics-twitter
Twitter Data Analytics
Twitter Big Data
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Real Time Analytics for Big Data a Twitter Case Study
Real Time Analytics for Big Data - A twitter inspired case study
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Ad

Similar to Discovery & Consumption of Analytics Data @Twitter (20)

PPTX
Data Lake Overview
PPTX
Tableau and hadoop
PPTX
Azure Data Lake Intro (SQLBits 2016)
PDF
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
PDF
BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS
PPTX
Is the traditional data warehouse dead?
PDF
Google Data Engineering.pdf
PDF
Data Engineering on GCP
PDF
data_engineering_on_GCP_PDE_cheat_sheets
PPTX
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
PPTX
Reshape Data Lake (as of 2020.07)
PPTX
Day 1 - Technical Bootcamp azure synapse analytics
PDF
Azure BI Cloud Architectural Guidelines.pdf
PPTX
Business intelligence
PDF
PDF
Prague data management meetup 2018-03-27
PPTX
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
PPTX
An AMIS Overview of Oracle database 12c (12.1)
PDF
Democratization of Data @Indix
PDF
Aucfanlab Datalake - Big Data Management Platform -
Data Lake Overview
Tableau and hadoop
Azure Data Lake Intro (SQLBits 2016)
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS
Is the traditional data warehouse dead?
Google Data Engineering.pdf
Data Engineering on GCP
data_engineering_on_GCP_PDE_cheat_sheets
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
Reshape Data Lake (as of 2020.07)
Day 1 - Technical Bootcamp azure synapse analytics
Azure BI Cloud Architectural Guidelines.pdf
Business intelligence
Prague data management meetup 2018-03-27
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
An AMIS Overview of Oracle database 12c (12.1)
Democratization of Data @Indix
Aucfanlab Datalake - Big Data Management Platform -

Recently uploaded (20)

PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
Sustainable Sites - Green Building Construction
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
Lecture Notes Electrical Wiring System Components
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
composite construction of structures.pdf
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPT
Project quality management in manufacturing
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
OOP with Java - Java Introduction (Basics)
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Sustainable Sites - Green Building Construction
bas. eng. economics group 4 presentation 1.pptx
Lecture Notes Electrical Wiring System Components
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
CH1 Production IntroductoryConcepts.pptx
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Foundation to blockchain - A guide to Blockchain Tech
composite construction of structures.pdf
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Embodied AI: Ushering in the Next Era of Intelligent Systems
CYBER-CRIMES AND SECURITY A guide to understanding
Project quality management in manufacturing
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Automation-in-Manufacturing-Chapter-Introduction.pdf
UNIT-1 - COAL BASED THERMAL POWER PLANTS
OOP with Java - Java Introduction (Basics)

Discovery & Consumption of Analytics Data @Twitter

  • 1. Discovery & Consumption of Analytics Data @Twitter @sirkamran32 ADP Team November 17, 2016
  • 2. Sriram Krishnan Arash Aghevli Joseph Boyd @sluicing@aaghevli@krishnansriram Dave Marwick Kamran Munshi Alex Maranka @sometext42@sirkamran32@dmarwick
  • 3. 1.Analytics Data @Twitter 2.Analytics Data Platform (ADP) Stack Overview 3.Deeper Dive (EagleEye Demo) 4.Deeper Dive (DAL Design & Concepts) 5.Observations & Lessons learned 6.Future Work Agenda
  • 4. Analytics Data @Twitter Mostly resides in Hadoop •We manage and run several LARGE Hadoop clusters •Some of Twitter’s Hadoop clusters are the largest of their kind •> 10K nodes storing more 100s of pb on HDFS Includes important business data •e.g. logs, user data, recommendations data, publicly reported metrics, A/B testing data, ads targeting and much more Lots of it processed and analyzed in BATCH fashion using… •Scalding (for ETL, data science, and general analytics) •Presto (for more interactive querying) •Vertica, Tableau, Zeppelin, MySQL (for analysis) All in all there are more than 100K daily jobs processing 10s of pbs/day.
  • 5. Analytics Data Software Goals Data Discovery • Important datasets • Owners • Semantics & relevant metadata Data Abstraction • Logical datasets to abstract away physical data location and format 1 3 Data Auditing • Creators and consumers • Dependencies • Alerts 2 Data Consumption • How to consume data via various clients (scalding, presto, etc) ? 4
  • 6. How our tools fit into data platform ADP Stack Data Explorer (EagleEye) Discovery & Consumption Data Abstraction Layer (DAL) Data Abstraction Alerting Service Alerting 1 2 3 State Management Service Scheduling 4 Source
  • 8. An internal web UI to make it easy to search, discover, track and manage batch applications AND analytics datasets. EagleEye
  • 9. View dataset schemas, notes, health, physical locations, & owners EagleEye Cont’d
  • 10. Set up alerts and view lineage & dependencies EagleEye Cont’d
  • 12. DAL helps you… Discover and consume data easily • DAL stores and exposes metadata, schemas, and consumption information for datasets (in HDFS, Vertica, and MySQL) Understand your dependencies •DAL tracks which applications produced or consumed datasets, allowing you to visualize your up and downstream dependencies in EagleEye (and set up alerts) Future proof your data and avoid lengthy migration work • DAL abstracts away physical data locations/formats from consumers by informing them about datasets at runtime, allowing jobs to read data that has changed locations without a redeploy. Enable error rate resiliency without redeploys •DAL centrally manages error thresholds which avoids hardcoded values that may require rebuild/redeploy of jobs Easily explore your data - DAL CLI allows users to easily view data without writing jobs/scripts
  • 13. DAL Models Data & Applications Data Logical Dataset - a symbolic name (a tuple of role, simple name, and environment) with a schema and segment type associated with it Physical Dataset - describes the serialization of a Logical Dataset to a specific physical Location (i.e. hadoop-cluster, vertica-cluster, etc) Segments - where the data actually resides. Segments model all the information necessary to deserialize them like their physical URL, schema, and format used upon serialization. They also maintain state to indicate if the segment is ready for consumption, rolled-back, deleted, etc. •Partitioned Segment - a subset of a dataset for a given time interval •Snapshots Segment - an entire physical dataset up to a point in time Batch Applications DAL can be used by all batch applications, including ad-hoc jobs not managed by Statebird •Statebird apps have a PhysicalRunContext (the batch application, batch increment and statebird run id)
  • 14. DAL Architecture Thrift service on top of a MySQL db. •Written in Scala using Slick as the ORM •Deployed on Mesos (like most Twitter services) DAL Clients (e.g. Scalding jobs) register the production and consumption of data segments via DAL APIs (PUSH MODEL): • DAL.read(TweetEventsScalaDataset) • DAL.writeDAL(DalWriteAuditTestDataset, D.Hourly, D.Suffix(args("output")), D.Parquet) Scalding jobs also communicate with Statebird to: • Poll for dataset updates and find out when the job is supposed to run • Find out if dependent jobs have run
  • 15. DAL Auditing Flow Listener Registers reads & writes by jobs that don't use the DAL API directly. • A custom flow listener is built into all scalding jobs at Twitter by default • This is limited but useful for those wanting view data via EagleEye • NOTE: This audit information doesn't include full details of data format and physical location, and is used only to surface lineage/provenance in EagleEye
  • 16. Why not just Hive Metastore? Value-added features such as… •A single service for both HDFS and non-HDFS data •Auditing and lineage for datasets •Example usage •A place to add additional future features Single DAL service VS. multiple metastores… •DAL is a singular service with one back-end, avoiding the need for multiple metastores in each DC
  • 17. •Simplifying data consumption across ALL data formats and storage systems is critical as the size of your organization and datasets grow
 •A metadata service is something that every data platform needs and should be built early on
 •DAL-integration is now the first order of business for every new processing tool @Twitter Observations & Lessons
  • 18. •Integrate retention and replication information into DAL •Helps to further EagleEye/DAL a ‘one-stop-shop’ for managing datasets
 •Expose Hive compatible API
 •Currently focused on BATCH applications. Investigate STREAMING applications as well
 •“Janitor monkey” to poll DAL and clean up datasets that are not being used •Open source? Future Work