SlideShare a Scribd company logo
@twitter 
Open Source software for large 
scale data processing @Twitter 
Open Source India 2014 
Lohit VijayaRenu @lohitvijayarenu
About this talk 
Introduction to open sources projects 
contributed to and used for Large Scale data 
processing @Twitter 
@twitter 2
About Twitter 
Twitter helps you create and share ideas and information 
instantly, without barriers. 
Twitter usage 
● 284 million monthly active users 
● 500 million Tweets are sent per day 
@twitter 3 
● 80% of Twitter active users are on mobile 
Company facts 
● 3,600 employees in offices around the world 
● 50% of employees are engineers 
● Incorporated April 19, 2007
Open Source @Twitter 
https://opensource.twitter.com 
@twitter 4
Why Open Source 
Twitter is built on open source software, from the back-end to the front-end. Twitter 
engineers use, contribute to and release a lot of open source software. Twitter 
Open Source Program Office support a variety of open source organizations and 
are grateful to the open source community for their contributions, and want to 
maintain our healthy, reciprocal relationship. 
Contribute to as well as Open Source many projects 
https://engineering.twitter.com/opensource/projects 
For questions, tweet @twitteross 
@twitter 5
Open Source Practices 
● Open Source new projects 
(Mesos, Flight, Scalding, hRaven…) 
● Contribute and use open source projects 
(Hadoop, HBase, ZooKeeper, PIG, Cascading...) 
● Few internal forks of open source projects 
(Scribe, Protobuf...) 
@twitter 6
Large Scale Data Processing 
Store hundreds of PetaBytes of Data across 
thousands of commodity hardware and process 
it using scalable programming models to 
support different use cases across organization 
@twitter 7
Scale @Twitter 
● Terabytes of daily data (not just tweets…) 
● Tens of thousands of daily user jobs 
● Hundreds of PetaBytes of data stored 
● Tens of PetaBytes of data processed daily 
● Tens of thousands of commodity hardware 
● Multiple clusters based on use cases 
@twitter 8
High level view of Data Processing 
@twitter 9
Platform and Tools 
● CentOS based Linux machines (custom 
kernel with patches) 
● OpenJDK JVM (custom build with upstream 
patches) 
● Puppet to deploy and manage configuration 
● Many more tools to manage thousands of 
servers 
@twitter 10
Data Collection 
● Scribe 
o Server for aggregating log data streamed in real 
time from a large number of servers. 
o Twitter Frontend Servers logging events which are 
scribed to Hadoop Distributed FileSystem 
o Collects Terabytes of daily data which are either 
batched or streamed to real time systems 
● Apache Thrift 
o Framework for cross language services 
@twitter 11
Real time processing 
● Apache Storm 
o Real time stream processing framework 
o Sub second latency job processing 
o Integrated with other projects such as Apache 
Kafka 
o Results stored to various systems build on MySQL, 
Redis, Storehaus ... 
@twitter 12
Batch processing 
● Hadoop Distributed FileSystem (HDFS) 
o Hundreds of PetaBytes stored 
o High Available, fault-tolerant and self healing 
● Hadoop YARN 
o Scalable Compute framework 
o Supporting MapReduce, VW, Spark, Tez… 
● Managed across multiple clusters 
@twitter 13
Higher level programming abstracts 
● MapReduce using Apache PIG 
o PIG scripts translated to MapReduce code 
● Cascading 
o Java API for programmers to write MapReduce jobs 
● Scalding 
o Scala API for Cascading 
● SummingBird 
o Streaming MapReduce API which can work either 
on Storm or MapReduce 
@twitter 14
Tools & Client Resource Management 
● Mesos 
o Job or Workflow management done by pool of 
machines running Apache Mesos 
o Scheduled Cron and notification system built on top 
of Apache Mesos 
● Parquet 
o Columnar format which supports nested data 
● LibCrunch 
o Mapping framework from objects to nodes @twitter 15
Service Coordination 
● Apache ZooKeeper 
o Paxos based project providing co-ordination, leader 
election and callback services for multiple projects 
within Twitter 
o Highly available and tolerant against network 
partition in data centers 
o Numerous recipes built on top of ZooKeeper 
@twitter 16
Metrics 
● hRaven 
o Collect runtime metrics across all jobs run on Hadoop 
clusters to store them forever 
o Provide API to query metrics based on various 
dimensions 
o Highly scalable to support billions of tasks stats for 
historical analysis of jobs and cluster usage over time 
● Ambrose 
o Platform for visualizing and real time monitoring of 
MapReduce workflows 
@twitter 17
Conclusion 
● Large scale data processing is complex 
● Twitter backend Infrastructure built using 
many open source projects 
● Open Source helps solve many hard 
problems for Twitter 
● https://opensource.twitter.com 
@twitter 18
Thank You & Questions 
● Q&A 
● @lohitvijayarenu 
@twitter 19

More Related Content

PPTX
Log Events @Twitter
PDF
Story of migrating event pipeline from batch to streaming
PDF
Scaling event aggregation at twitter
PPTX
Managing 100s of PetaBytes of data in Cloud
PPTX
Data Engineer’s Lunch #41: PygramETL
PDF
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
PDF
Presto Summit 2018 - 04 - Netflix Containers
PDF
Case Study: Stream Processing on AWS using Kappa Architecture
Log Events @Twitter
Story of migrating event pipeline from batch to streaming
Scaling event aggregation at twitter
Managing 100s of PetaBytes of data in Cloud
Data Engineer’s Lunch #41: PygramETL
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
Presto Summit 2018 - 04 - Netflix Containers
Case Study: Stream Processing on AWS using Kappa Architecture

What's hot (20)

PDF
Streaming sql and druid
PDF
Intro to open source observability with grafana, prometheus, loki, and tempo(...
PPTX
Stream processing at Hotstar
PDF
Flink Forward San Francisco 2019: Building Financial Identity Platform using ...
PDF
The Rise of Streaming SQL
PDF
uReplicator: Uber Engineering’s Scalable, Robust Kafka Replicator
PPTX
Webinar: MongoDB Use Cases within the Oil, Gas, and Energy Industries
PPTX
Challenges in Building a Data Pipeline
PDF
Kafka Streams
PDF
Apache flink
PDF
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
PDF
Presto Summit 2018 - 10 - Qubole
PDF
Gyula Fóra - RBEA- Scalable Real-Time Analytics at King
PDF
Maintaining spatial data infrastructures (SDIs) using distributed task queues
PDF
How Much Can You Connect? | Bhavesh Raheja, Disney + Hotstar
PDF
Introduction to Real-time data processing
PDF
[WSO2Con USA 2018] Deploying Applications in K8S and Docker
PDF
Introducing MagnetoDB, a key-value storage sevice for OpenStack
PDF
Stream Processing with Ballerina
PDF
Iceberg: a fast table format for S3
Streaming sql and druid
Intro to open source observability with grafana, prometheus, loki, and tempo(...
Stream processing at Hotstar
Flink Forward San Francisco 2019: Building Financial Identity Platform using ...
The Rise of Streaming SQL
uReplicator: Uber Engineering’s Scalable, Robust Kafka Replicator
Webinar: MongoDB Use Cases within the Oil, Gas, and Energy Industries
Challenges in Building a Data Pipeline
Kafka Streams
Apache flink
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
Presto Summit 2018 - 10 - Qubole
Gyula Fóra - RBEA- Scalable Real-Time Analytics at King
Maintaining spatial data infrastructures (SDIs) using distributed task queues
How Much Can You Connect? | Bhavesh Raheja, Disney + Hotstar
Introduction to Real-time data processing
[WSO2Con USA 2018] Deploying Applications in K8S and Docker
Introducing MagnetoDB, a key-value storage sevice for OpenStack
Stream Processing with Ballerina
Iceberg: a fast table format for S3
Ad

Viewers also liked (7)

PDF
Keynote technicals intraday levels for 300712
PPTX
Ricks portfolio
PPTX
HBase backups and performance on MapR
PDF
Philly DB MapR Overview
PDF
certificate-d83x78fbo6qt
PDF
Hands on MapR -- Viadea
PPTX
Back to School - St. Louis Hadoop Meetup September 2016
Keynote technicals intraday levels for 300712
Ricks portfolio
HBase backups and performance on MapR
Philly DB MapR Overview
certificate-d83x78fbo6qt
Hands on MapR -- Viadea
Back to School - St. Louis Hadoop Meetup September 2016
Ad

Similar to Open Source india 2014 (20)

PPTX
Big Data Technology Stack : Nutshell
PPT
Hadoop and Pig at Twitter__HadoopSummit2010
PDF
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
PDF
Open Data Summit Presentation by Joe Olsen
DOCX
Ijircce publish this paper
PDF
The Open Source... Behind the Tweets
PDF
Chirp 2010: Scaling Twitter
DOCX
Paper ijert
KEY
Hadoop at Twitter (Hadoop Summit 2010)
PDF
Big Data Processing Utilizing Open-source Technologies - May 2015
PDF
Cloud operations with streaming analytics using big data tools
PPTX
Architecting Your First Big Data Implementation
PPTX
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
PPT
Architecting Big Data Ingest & Manipulation
PDF
Simple, Modular and Extensible Big Data Platform Concept
PDF
Open Source Tools for Big Data
PDF
Open Source Tools for Big Data
PPTX
In15orlesss hadoop
PDF
Open source stak of big data techs open suse asia
PPTX
Big data – a brief overview
Big Data Technology Stack : Nutshell
Hadoop and Pig at Twitter__HadoopSummit2010
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Open Data Summit Presentation by Joe Olsen
Ijircce publish this paper
The Open Source... Behind the Tweets
Chirp 2010: Scaling Twitter
Paper ijert
Hadoop at Twitter (Hadoop Summit 2010)
Big Data Processing Utilizing Open-source Technologies - May 2015
Cloud operations with streaming analytics using big data tools
Architecting Your First Big Data Implementation
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Architecting Big Data Ingest & Manipulation
Simple, Modular and Extensible Big Data Platform Concept
Open Source Tools for Big Data
Open Source Tools for Big Data
In15orlesss hadoop
Open source stak of big data techs open suse asia
Big data – a brief overview

More from lohitvijayarenu (9)

PPTX
OpenSource and the Cloud ApacheCon.pptx
PPTX
The Adoption of Apache Beam at Twitter
PPTX
Scaling HDFS for Exabyte Storage@twitter
PDF
Extending Twitter's Data Platform to Google Cloud
PPTX
Twitter's Data Replicator for Google Cloud Storage
PDF
How @twitterhadoop chose google cloud
PDF
Large Scale EventLog Management @Twitter
PDF
Routing trillion events per day @twitter
PPTX
Hadoop 2 @Twitter, Elephant Scale. Presented at
OpenSource and the Cloud ApacheCon.pptx
The Adoption of Apache Beam at Twitter
Scaling HDFS for Exabyte Storage@twitter
Extending Twitter's Data Platform to Google Cloud
Twitter's Data Replicator for Google Cloud Storage
How @twitterhadoop chose google cloud
Large Scale EventLog Management @Twitter
Routing trillion events per day @twitter
Hadoop 2 @Twitter, Elephant Scale. Presented at

Recently uploaded (20)

PDF
STL Containers in C++ : Sequence Container : Vector
PPTX
chapter 5 systemdesign2008.pptx for cimputer science students
PDF
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
PDF
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
PDF
Digital Systems & Binary Numbers (comprehensive )
PPTX
Tech Workshop Escape Room Tech Workshop
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
Introduction to Windows Operating System
PPTX
Cybersecurity: Protecting the Digital World
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PDF
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
PDF
Ableton Live Suite for MacOS Crack Full Download (Latest 2025)
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PPTX
Weekly report ppt - harsh dattuprasad patel.pptx
DOCX
How to Use SharePoint as an ISO-Compliant Document Management System
PDF
Types of Token_ From Utility to Security.pdf
PPTX
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
PPTX
WiFi Honeypot Detecscfddssdffsedfseztor.pptx
PPTX
Patient Appointment Booking in Odoo with online payment
PDF
Topaz Photo AI Crack New Download (Latest 2025)
STL Containers in C++ : Sequence Container : Vector
chapter 5 systemdesign2008.pptx for cimputer science students
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
Digital Systems & Binary Numbers (comprehensive )
Tech Workshop Escape Room Tech Workshop
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Introduction to Windows Operating System
Cybersecurity: Protecting the Digital World
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
Ableton Live Suite for MacOS Crack Full Download (Latest 2025)
wealthsignaloriginal-com-DS-text-... (1).pdf
Weekly report ppt - harsh dattuprasad patel.pptx
How to Use SharePoint as an ISO-Compliant Document Management System
Types of Token_ From Utility to Security.pdf
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
WiFi Honeypot Detecscfddssdffsedfseztor.pptx
Patient Appointment Booking in Odoo with online payment
Topaz Photo AI Crack New Download (Latest 2025)

Open Source india 2014

  • 1. @twitter Open Source software for large scale data processing @Twitter Open Source India 2014 Lohit VijayaRenu @lohitvijayarenu
  • 2. About this talk Introduction to open sources projects contributed to and used for Large Scale data processing @Twitter @twitter 2
  • 3. About Twitter Twitter helps you create and share ideas and information instantly, without barriers. Twitter usage ● 284 million monthly active users ● 500 million Tweets are sent per day @twitter 3 ● 80% of Twitter active users are on mobile Company facts ● 3,600 employees in offices around the world ● 50% of employees are engineers ● Incorporated April 19, 2007
  • 4. Open Source @Twitter https://opensource.twitter.com @twitter 4
  • 5. Why Open Source Twitter is built on open source software, from the back-end to the front-end. Twitter engineers use, contribute to and release a lot of open source software. Twitter Open Source Program Office support a variety of open source organizations and are grateful to the open source community for their contributions, and want to maintain our healthy, reciprocal relationship. Contribute to as well as Open Source many projects https://engineering.twitter.com/opensource/projects For questions, tweet @twitteross @twitter 5
  • 6. Open Source Practices ● Open Source new projects (Mesos, Flight, Scalding, hRaven…) ● Contribute and use open source projects (Hadoop, HBase, ZooKeeper, PIG, Cascading...) ● Few internal forks of open source projects (Scribe, Protobuf...) @twitter 6
  • 7. Large Scale Data Processing Store hundreds of PetaBytes of Data across thousands of commodity hardware and process it using scalable programming models to support different use cases across organization @twitter 7
  • 8. Scale @Twitter ● Terabytes of daily data (not just tweets…) ● Tens of thousands of daily user jobs ● Hundreds of PetaBytes of data stored ● Tens of PetaBytes of data processed daily ● Tens of thousands of commodity hardware ● Multiple clusters based on use cases @twitter 8
  • 9. High level view of Data Processing @twitter 9
  • 10. Platform and Tools ● CentOS based Linux machines (custom kernel with patches) ● OpenJDK JVM (custom build with upstream patches) ● Puppet to deploy and manage configuration ● Many more tools to manage thousands of servers @twitter 10
  • 11. Data Collection ● Scribe o Server for aggregating log data streamed in real time from a large number of servers. o Twitter Frontend Servers logging events which are scribed to Hadoop Distributed FileSystem o Collects Terabytes of daily data which are either batched or streamed to real time systems ● Apache Thrift o Framework for cross language services @twitter 11
  • 12. Real time processing ● Apache Storm o Real time stream processing framework o Sub second latency job processing o Integrated with other projects such as Apache Kafka o Results stored to various systems build on MySQL, Redis, Storehaus ... @twitter 12
  • 13. Batch processing ● Hadoop Distributed FileSystem (HDFS) o Hundreds of PetaBytes stored o High Available, fault-tolerant and self healing ● Hadoop YARN o Scalable Compute framework o Supporting MapReduce, VW, Spark, Tez… ● Managed across multiple clusters @twitter 13
  • 14. Higher level programming abstracts ● MapReduce using Apache PIG o PIG scripts translated to MapReduce code ● Cascading o Java API for programmers to write MapReduce jobs ● Scalding o Scala API for Cascading ● SummingBird o Streaming MapReduce API which can work either on Storm or MapReduce @twitter 14
  • 15. Tools & Client Resource Management ● Mesos o Job or Workflow management done by pool of machines running Apache Mesos o Scheduled Cron and notification system built on top of Apache Mesos ● Parquet o Columnar format which supports nested data ● LibCrunch o Mapping framework from objects to nodes @twitter 15
  • 16. Service Coordination ● Apache ZooKeeper o Paxos based project providing co-ordination, leader election and callback services for multiple projects within Twitter o Highly available and tolerant against network partition in data centers o Numerous recipes built on top of ZooKeeper @twitter 16
  • 17. Metrics ● hRaven o Collect runtime metrics across all jobs run on Hadoop clusters to store them forever o Provide API to query metrics based on various dimensions o Highly scalable to support billions of tasks stats for historical analysis of jobs and cluster usage over time ● Ambrose o Platform for visualizing and real time monitoring of MapReduce workflows @twitter 17
  • 18. Conclusion ● Large scale data processing is complex ● Twitter backend Infrastructure built using many open source projects ● Open Source helps solve many hard problems for Twitter ● https://opensource.twitter.com @twitter 18
  • 19. Thank You & Questions ● Q&A ● @lohitvijayarenu @twitter 19