Discovery & Consumption of Analytics Data @Twitter

Discovery & Consumption
of Analytics Data @Twitter
@sirkamran32
ADP Team
November 17, 2016

Sriram Krishnan Arash Aghevli Joseph Boyd
@sluicing@aaghevli@krishnansriram
Dave Marwick Kamran Munshi Alex Maranka
@sometext42@sirkamran32@dmarwick

1.Analytics Data @Twitter

2.Analytics Data Platform (ADP) Stack Overview

3.Deeper Dive (EagleEye Demo)

4.Deeper Dive (DAL Design & Concepts)

5.Observations & Lessons learned

6.Future Work
Agenda

Analytics Data @Twitter
Mostly resides in Hadoop
•We manage and run several LARGE Hadoop clusters

•Some of Twitter’s Hadoop clusters are the largest of their kind

•> 10K nodes storing more 100s of pb on HDFS
Includes important business data
•e.g. logs, user data, recommendations data, publicly reported metrics, A/B testing data, ads
targeting and much more

Lots of it processed and analyzed in BATCH fashion using…
•Scalding (for ETL, data science, and general analytics)

•Presto (for more interactive querying)

•Vertica, Tableau, Zeppelin, MySQL (for analysis)

All in all there are more than 100K daily jobs processing 10s of pbs/day.

Analytics Data Software Goals
Data Discovery
• Important datasets

• Owners

• Semantics &
relevant metadata
Data Abstraction
• Logical datasets to
abstract away
physical data location
and format
1 3
Data Auditing
• Creators and
consumers
• Dependencies
• Alerts
2
Data Consumption
• How to consume
data via various
clients (scalding,
presto, etc) ?
4

How our tools fit into data platform
ADP Stack
Data Explorer (EagleEye)
Discovery & Consumption
Data Abstraction Layer (DAL)
Data Abstraction
Alerting Service
Alerting
1
2
3
State Management Service
Scheduling
4
Source

An internal web UI to make it easy to search, discover, track and manage batch applications AND
analytics datasets.
EagleEye

View dataset schemas, notes, health, physical locations, & owners
EagleEye Cont’d

Set up alerts and view lineage & dependencies
EagleEye Cont’d

Data Abstraction Layer
(aka DAL)

DAL helps you…
Discover and consume data easily
• DAL stores and exposes metadata, schemas, and consumption information for datasets (in
HDFS, Vertica, and MySQL)

Understand your dependencies
•DAL tracks which applications produced or consumed datasets, allowing you to visualize your
up and downstream dependencies in EagleEye (and set up alerts)

Future proof your data and avoid lengthy migration work
• DAL abstracts away physical data locations/formats from consumers by informing them
about datasets at runtime, allowing jobs to read data that has changed locations without a
redeploy.

Enable error rate resiliency without redeploys
•DAL centrally manages error thresholds which avoids hardcoded values that may require
rebuild/redeploy of jobs

Easily explore your data - DAL CLI allows users to easily view data without writing jobs/scripts

DAL Models Data & Applications
Data
Logical Dataset - a symbolic name (a tuple of role, simple name, and environment) with a
schema and segment type associated with it

Physical Dataset - describes the serialization of a Logical Dataset to a speciﬁc physical
Location (i.e. hadoop-cluster, vertica-cluster, etc)

Segments - where the data actually resides. Segments model all the information necessary to
deserialize them like their physical URL, schema, and format used upon serialization. They also
maintain state to indicate if the segment is ready for consumption, rolled-back, deleted, etc.
•Partitioned Segment - a subset of a dataset for a given time interval

•Snapshots Segment - an entire physical dataset up to a point in time

Batch Applications
DAL can be used by all batch applications, including ad-hoc jobs not managed by Statebird

•Statebird apps have a PhysicalRunContext (the batch application, batch increment and
statebird run id)

DAL Architecture
Thrift service on top of a MySQL db.
•Written in Scala using Slick as the ORM

•Deployed on Mesos (like most Twitter
services)

DAL Clients (e.g. Scalding jobs) register the
production and consumption of data
segments via DAL APIs (PUSH MODEL):

• DAL.read(TweetEventsScalaDataset)

• DAL.writeDAL(DalWriteAuditTestDataset,
D.Hourly, D.Suﬃx(args("output")), D.Parquet)

Scalding jobs also communicate with
Statebird to:
• Poll for dataset updates and ﬁnd out when
the job is supposed to run

• Find out if dependent jobs have run

DAL Auditing Flow Listener
Registers reads & writes by jobs that don't use the DAL API directly.
• A custom ﬂow listener is built into all scalding jobs at Twitter by default

• This is limited but useful for those wanting view data via EagleEye

• NOTE: This audit information doesn't include full details of data format and physical location,
and is used only to surface lineage/provenance in EagleEye

Why not just Hive Metastore?
Value-added features such as…
•A single service for both HDFS and non-HDFS data

•Auditing and lineage for datasets

•Example usage

•A place to add additional future features

Single DAL service VS. multiple metastores…
•DAL is a singular service with one back-end, avoiding the need for multiple metastores in
each DC

•Simplifying data consumption across ALL data formats and storage systems is critical as the
size of your organization and datasets grow 
•A metadata service is something that every data platform needs and should be built early on 
•DAL-integration is now the ﬁrst order of business for every new processing tool @Twitter
Observations & Lessons

•Integrate retention and replication information into DAL

•Helps to further EagleEye/DAL a ‘one-stop-shop’ for managing datasets 
•Expose Hive compatible API 
•Currently focused on BATCH applications. Investigate STREAMING applications as well 
•“Janitor monkey” to poll DAL and clean up datasets that are not being used

•Open source?
Future Work

Discovery & Consumption of Analytics Data @Twitter

More Related Content

What's hot (20)

Viewers also liked (9)

Similar to Discovery & Consumption of Analytics Data @Twitter (20)

Recently uploaded (20)

Discovery & Consumption of Analytics Data @Twitter