Open Source india 2014

@twitter
Open Source software for large
scale data processing @Twitter
Open Source India 2014
Lohit VijayaRenu @lohitvijayarenu

About this talk
Introduction to open sources projects
contributed to and used for Large Scale data
processing @Twitter
@twitter 2

About Twitter
Twitter helps you create and share ideas and information
instantly, without barriers.
Twitter usage
● 284 million monthly active users
● 500 million Tweets are sent per day
@twitter 3
● 80% of Twitter active users are on mobile
Company facts
● 3,600 employees in offices around the world
● 50% of employees are engineers
● Incorporated April 19, 2007

Open Source @Twitter
https://opensource.twitter.com
@twitter 4

Why Open Source
Twitter is built on open source software, from the back-end to the front-end. Twitter
engineers use, contribute to and release a lot of open source software. Twitter
Open Source Program Office support a variety of open source organizations and
are grateful to the open source community for their contributions, and want to
maintain our healthy, reciprocal relationship.
Contribute to as well as Open Source many projects
https://engineering.twitter.com/opensource/projects
For questions, tweet @twitteross
@twitter 5

Open Source Practices
● Open Source new projects
(Mesos, Flight, Scalding, hRaven…)
● Contribute and use open source projects
(Hadoop, HBase, ZooKeeper, PIG, Cascading...)
● Few internal forks of open source projects
(Scribe, Protobuf...)
@twitter 6

Large Scale Data Processing
Store hundreds of PetaBytes of Data across
thousands of commodity hardware and process
it using scalable programming models to
support different use cases across organization
@twitter 7

Scale @Twitter
● Terabytes of daily data (not just tweets…)
● Tens of thousands of daily user jobs
● Hundreds of PetaBytes of data stored
● Tens of PetaBytes of data processed daily
● Tens of thousands of commodity hardware
● Multiple clusters based on use cases
@twitter 8

High level view of Data Processing
@twitter 9

Platform and Tools
● CentOS based Linux machines (custom
kernel with patches)
● OpenJDK JVM (custom build with upstream
patches)
● Puppet to deploy and manage configuration
● Many more tools to manage thousands of
servers
@twitter 10

Data Collection
● Scribe
o Server for aggregating log data streamed in real
time from a large number of servers.
o Twitter Frontend Servers logging events which are
scribed to Hadoop Distributed FileSystem
o Collects Terabytes of daily data which are either
batched or streamed to real time systems
● Apache Thrift
o Framework for cross language services
@twitter 11

Real time processing
● Apache Storm
o Real time stream processing framework
o Sub second latency job processing
o Integrated with other projects such as Apache
Kafka
o Results stored to various systems build on MySQL,
Redis, Storehaus ...
@twitter 12

Batch processing
● Hadoop Distributed FileSystem (HDFS)
o Hundreds of PetaBytes stored
o High Available, fault-tolerant and self healing
● Hadoop YARN
o Scalable Compute framework
o Supporting MapReduce, VW, Spark, Tez…
● Managed across multiple clusters
@twitter 13

Higher level programming abstracts
● MapReduce using Apache PIG
o PIG scripts translated to MapReduce code
● Cascading
o Java API for programmers to write MapReduce jobs
● Scalding
o Scala API for Cascading
● SummingBird
o Streaming MapReduce API which can work either
on Storm or MapReduce
@twitter 14

Tools & Client Resource Management
● Mesos
o Job or Workflow management done by pool of
machines running Apache Mesos
o Scheduled Cron and notification system built on top
of Apache Mesos
● Parquet
o Columnar format which supports nested data
● LibCrunch
o Mapping framework from objects to nodes @twitter 15

Service Coordination
● Apache ZooKeeper
o Paxos based project providing co-ordination, leader
election and callback services for multiple projects
within Twitter
o Highly available and tolerant against network
partition in data centers
o Numerous recipes built on top of ZooKeeper
@twitter 16

Metrics
● hRaven
o Collect runtime metrics across all jobs run on Hadoop
clusters to store them forever
o Provide API to query metrics based on various
dimensions
o Highly scalable to support billions of tasks stats for
historical analysis of jobs and cluster usage over time
● Ambrose
o Platform for visualizing and real time monitoring of
MapReduce workflows
@twitter 17

Conclusion
● Large scale data processing is complex
● Twitter backend Infrastructure built using
many open source projects
● Open Source helps solve many hard
problems for Twitter
● https://opensource.twitter.com
@twitter 18

Thank You & Questions
● Q&A
● @lohitvijayarenu
@twitter 19

Open Source india 2014

More Related Content

What's hot (20)

Viewers also liked (7)

Similar to Open Source india 2014 (20)

More from lohitvijayarenu (9)

Recently uploaded (20)

Open Source india 2014