SlideShare a Scribd company logo
Hadoop, Pig, HBase at Twitter Dmitriy Ryaboy  Twitter Analytics @squarecog
Who is this guy, anyway LBNL : Genome alignment & analysis Ask.com : Click log data warehousing CMU : MS in “Very Large Information Systems” Cloudera : graduate student intern Twitter : Hadoop, Pig, Big Data, ... Pig  committer.
In This Talk Focus on Hadoop parts of data pipeline Data movement HBase Pig A few tips
Not In This Talk Cassandra FlockDB Gizzard Memcached Rest of Twitter’s NoSQL Bingo card
Daily workload 1000 s  of Front End machines 3 Billion  API requests 7 TB  of ingested data 20,000  Hadoop jobs 55 Million  tweets Tweets only  0.5%  of data
Twitter data pipeline (simplified) Front Ends update DB cluster.  Scheduled DB exports to HDFS Front Ends, Middleware, Backend services write logs Scribe pipes logs straight into HDFS Various other data source exports into HDFS Daemons populate work queues as new data shows up Daemons (and cron) pull work off queues, schedule MR and Pig jobs Pig wrapper pushes results into MySQL for reports and dashboards
Logs Apache HTTP, W3C, JSON and Protocol Buffers Each category goes into its own directory on HDFS Everything is LZO compressed. You need to index LZO files to make them splittable.  We use a patched version of Hadoop LZO libraries See  http://github.com/kevinweil/hadoop-lzo
Tables Users, tweets, geotags, trends, registered devices, etc Automatic generation of protocol buffer definitions from SQL tables Automatic generation of Hadoop Writables, Input / Output formats, Pig loaders from protocol buffers See Elephant-Bird:  http://github.com/kevinweil/elephant-bird
ETL &quot;Crane&quot;, config driven, protocol buffer powered.  Sources/Sinks: HDFS, HBase, MySQL tables, web services Protobuf-based transformations: chain sets of <input proto, output proto, transformation class>
HBase
Mutability Logs are immutable; HDFS is great. Tables have mutable data.  Ignore updates? bad data  Pull updates, resolve at read time? Pain, time. Pull updates, resolve in batches? Pain, time. Let someone else do the resolving? Helloooo, HBase!  Bonus: various NoSQL bonuses, &quot;not just scans&quot;. Lookups, indexes. Warning: we just started with HBase. This is all preliminary. Haven't tried indexes yet. That being said, several services rely on HBase already.
Aren't you guys Cassandra poster boys? poster boys? YES   but Rough analogy: Cassandra is OLTP and HBase is OLAP Cassandra used when we need low-latency, single-key reads and writes HBase scans much more powerful HBase co-locates data on the Hadoop cluster.
HBase schema for MySQL exports, v1. Want to query by created_at range, by updated_at range, and / or by user_id. Key: [created_at, id] CF: &quot;columns&quot;  Configs specify which columns to pull out and store explicitly. Useful for indexing, cheap (HBase-side) filtering CF: &quot;protobuf&quot; A single column, contains serialized protocol buffer.
HBase schema v1, cont. Pro: easy to query by created_at range  Con: hard to pull out specific users (requires a full scan) Con: hot spot at the last region for writes Idea: put created_at into 'columns' CF, make user_id key BUT  ids mostly sequential; still a hot spot at the end of the table Transitioning to non-sequential ids; but their high bits are creation timestamp! Same problem.
HBase schema, v2. Key: inverted Id. Bottom bits are random. Ahh, finally, distribution. Date range queries: new CF, 'time' keep all versions of this CF When specific time range needed, use index on the time column Keeping time in separate CF allows us to keep track of every time the record got updated, without storing all versions of the record
Pig
Why Pig? Much faster to write than vanilla MR Step-by-step iterative expression of data flows intuitive to programmers SQL support coming for those who prefer SQL (PIG-824) Trivial to write UDFs Easy to write Loaders (Even better with 0.7!) For example, we can write Protobuf and HBase loaders... Both in Elephant-Bird
HBase Loader enhancements Data expected to be binary, not String representations Push down key range filters Specify row caching (memory / speed tradeoff) Optionally load the key Optionally limit rows per region Report progress Haven't observed significant overhead vs. HBase scanning
HBase Loader TODOs Expose better control of filters Expose timestamp controls Expose Index hints (IHBase) Automated filter and projection push-down (once on 0.7) HBase Storage
Elephant Bird Auto-generate Hadoop Input/Output formats, Writables, Pig loaders for Protocol Buffers Starting to work on same for Thrift HBase Loader assorted UDFs http://www.github.com/kevinweil/elephant-bird
Assorted Tips
Bad records kill jobs Big data is messy. Catch exceptions, increment counter, return null Deal with potential nulls Far preferable to a single bad record bringing down the whole job
Runaway UDFs kill jobs Regex over a few billion tweets, most return in milliseconds. 8 cause the regex to take  more than 5 minutes ,  task gets reaped. You clever twitterers, you. MonitoredUDF wrapper kicks off a monitoring thread, kills a UDF and returns a default value if it doesn't return something in time. Plan to contribute to Pig, add to ElephantBird.  May build into Pig internals.
Use Counters Use counters. Count everything. UDF invocations, parsed records, unparsable records, timed-out UDFs...  Hook into cleanup phases and store counters to disk, next to data, for future analysis Don't have it for Pig yet, but 0.8 adds metadata to job confs to make this possible.
At first: converted Protocol Buffers into Pig tuples at read time.  Moved to a Tuple wrapper that deserializes fields upon request. Huge performance boost for wide tables with only a few used columns  Lazy deserializaton FTW lazy deserialization
Also see http://www.slideshare.net/kevinweil/protocol-buffers-and-hadoop-at-twitter http://www.slideshare.net/kevinweil/nosql-at-twitter-nosql-eu-2010 http://www.slideshare.net/al3x/building-distributed-systems-in-scala http://www.slideshare.net/ryansking/scaling-twitter-with-cassandra http://www.slideshare.net/nkallen/q-con-3770885
Questions ? Follow me at twitter.com/squarecog TM
Photo Credits Bingo:  http://www.flickr.com/photos/hownowdesign/2393662713/ Sandhill Crane:  http://www.flickr.com/photos/dianeham/123491289/ Oakland Cranes:  http://www.flickr.com/photos/clankennedy/2654213672/

More Related Content

PPT
Hadoop and Voldemort @ LinkedIn
PPTX
Building a Scalable Web Crawler with Hadoop
PPT
Hadoop, Hbase and Hive- Bay area Hadoop User Group
PPT
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
PPT
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
PPS
Searching At Scale
PPTX
File Context
PDF
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Hadoop and Voldemort @ LinkedIn
Building a Scalable Web Crawler with Hadoop
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Searching At Scale
File Context
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...

What's hot (20)

PDF
HUG August 2010: Best practices
PPTX
January 2011 HUG: Howl Presentation
PPT
Nov 2010 HUG: Fuzzy Table - B.A.H
PPT
January 2011 HUG: Kafka Presentation
PPTX
Nov HUG 2009: Hadoop Record Reader In Python
PPT
Hadoop at Yahoo! -- University Talks
PPT
Nextag talk
PDF
The Bixo Web Mining Toolkit
PDF
Hadoop trainting in hyderabad@kelly technologies
PPTX
Messaging architecture @FB (Fifth Elephant Conference)
PDF
Karmasphere Studio for Hadoop
PDF
introduction to data processing using Hadoop and Pig
PPT
Hadoop Hive Talk At IIT-Delhi
PPT
Hadoop Tutorial
ODP
Hadoop - Overview
PPTX
ImpalaToGo use case
PPTX
Cloud Optimized Big Data
PPTX
HUG Nov 2010: HDFS Raid - Facebook
PPT
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
PPT
Hadoop basics
HUG August 2010: Best practices
January 2011 HUG: Howl Presentation
Nov 2010 HUG: Fuzzy Table - B.A.H
January 2011 HUG: Kafka Presentation
Nov HUG 2009: Hadoop Record Reader In Python
Hadoop at Yahoo! -- University Talks
Nextag talk
The Bixo Web Mining Toolkit
Hadoop trainting in hyderabad@kelly technologies
Messaging architecture @FB (Fifth Elephant Conference)
Karmasphere Studio for Hadoop
introduction to data processing using Hadoop and Pig
Hadoop Hive Talk At IIT-Delhi
Hadoop Tutorial
Hadoop - Overview
ImpalaToGo use case
Cloud Optimized Big Data
HUG Nov 2010: HDFS Raid - Facebook
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop basics
Ad

Viewers also liked (18)

PPTX
Yahoo! Mail antispam - Bay area Hadoop user group
PPTX
Common crawlpresentation
PDF
Hdfs high availability
PPT
Pig at Linkedin
PPTX
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
PPTX
January 2011 HUG: Pig Presentation
PDF
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
PPTX
August 2016 HUG: Recent development in Apache Oozie
PDF
Karmasphere hadoop-productivity-tools
ODP
Cascalog internal dsl_preso
PPTX
Yahoo compares Storm and Spark
PDF
Nov 2010 HUG: Business Intelligence for Big Data
PDF
Next Generation MapReduce
PDF
Bay Area HUG Feb 2011 Intro
PDF
Next Generation Hadoop Operations
PDF
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
KEY
Rate Limiting at Scale, from SANS AppSec Las Vegas 2012
PDF
Twitter Protobufs And Hadoop Hug 021709
Yahoo! Mail antispam - Bay area Hadoop user group
Common crawlpresentation
Hdfs high availability
Pig at Linkedin
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
January 2011 HUG: Pig Presentation
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Recent development in Apache Oozie
Karmasphere hadoop-productivity-tools
Cascalog internal dsl_preso
Yahoo compares Storm and Spark
Nov 2010 HUG: Business Intelligence for Big Data
Next Generation MapReduce
Bay Area HUG Feb 2011 Intro
Next Generation Hadoop Operations
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
Rate Limiting at Scale, from SANS AppSec Las Vegas 2012
Twitter Protobufs And Hadoop Hug 021709
Ad

Similar to Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter (20)

KEY
Hadoop at Twitter (Hadoop Summit 2010)
PPT
Hadoop and Pig at Twitter__HadoopSummit2010
PPT
Hadoop Frameworks Panel__HadoopSummit2010
PPTX
Unit II Hadoop Ecosystem_Updated.pptx
PPT
Hadoop presentation
ODP
Hadoop Ecosystem Overview
PPTX
Introduction to PIG
PPT
Chicago Data Summit: Apache HBase: An Introduction
PDF
Xldb2011 tue 0940_facebook_realtimeanalytics
PPTX
Big data components - Introduction to Flume, Pig and Sqoop
PPSX
Hadoop Ecosystem
PDF
Welcome to the Jungle: Distributed Systems for Large Data Sets - StampedeCon ...
PDF
Hadoop Pig: MapReduce the easy way!
PPTX
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
PPTX
Hic 2011 realtime_analytics_at_facebook
PPTX
Big data concepts
PPT
Taylor bosc2010
PPTX
Sf NoSQL MeetUp: Apache Hadoop and HBase
PPT
Leveraging Hadoop in your PostgreSQL Environment
PPT
Hive @ Hadoop day seattle_2010
Hadoop at Twitter (Hadoop Summit 2010)
Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Unit II Hadoop Ecosystem_Updated.pptx
Hadoop presentation
Hadoop Ecosystem Overview
Introduction to PIG
Chicago Data Summit: Apache HBase: An Introduction
Xldb2011 tue 0940_facebook_realtimeanalytics
Big data components - Introduction to Flume, Pig and Sqoop
Hadoop Ecosystem
Welcome to the Jungle: Distributed Systems for Large Data Sets - StampedeCon ...
Hadoop Pig: MapReduce the easy way!
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Hic 2011 realtime_analytics_at_facebook
Big data concepts
Taylor bosc2010
Sf NoSQL MeetUp: Apache Hadoop and HBase
Leveraging Hadoop in your PostgreSQL Environment
Hive @ Hadoop day seattle_2010

More from Hadoop User Group (13)

PDF
Hdfs high availability
PPT
2 hadoop@e bay-hug-2010-07-21
PPT
1 content optimization-hug-2010-07-21
PDF
3 avro hug-2010-07-21
PPT
1 hadoop security_in_details_hadoop_summit2010
PPT
Hadoop Security Preview
PPT
Flightcaster Presentation Hadoop
PPTX
Map Reduce Online
PPT
Hadoop Security Preview
PPT
Hadoop Security Preview
PPT
Hadoop Release Plan Feb17
PPTX
Ordered Record Collection
PPTX
Hadoop Record Reader In Python
Hdfs high availability
2 hadoop@e bay-hug-2010-07-21
1 content optimization-hug-2010-07-21
3 avro hug-2010-07-21
1 hadoop security_in_details_hadoop_summit2010
Hadoop Security Preview
Flightcaster Presentation Hadoop
Map Reduce Online
Hadoop Security Preview
Hadoop Security Preview
Hadoop Release Plan Feb17
Ordered Record Collection
Hadoop Record Reader In Python

Recently uploaded (20)

PDF
KodekX | Application Modernization Development
PDF
Modernizing your data center with Dell and AMD
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Machine learning based COVID-19 study performance prediction
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Chapter 3 Spatial Domain Image Processing.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Electronic commerce courselecture one. Pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Empathic Computing: Creating Shared Understanding
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
KodekX | Application Modernization Development
Modernizing your data center with Dell and AMD
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Machine learning based COVID-19 study performance prediction
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Approach and Philosophy of On baking technology
Chapter 3 Spatial Domain Image Processing.pdf
The AUB Centre for AI in Media Proposal.docx
The Rise and Fall of 3GPP – Time for a Sabbatical?
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Electronic commerce courselecture one. Pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Review of recent advances in non-invasive hemoglobin estimation
Empathic Computing: Creating Shared Understanding
NewMind AI Weekly Chronicles - August'25 Week I
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Dropbox Q2 2025 Financial Results & Investor Presentation
Spectral efficient network and resource selection model in 5G networks
CIFDAQ's Market Insight: SEC Turns Pro Crypto

Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter

  • 1. Hadoop, Pig, HBase at Twitter Dmitriy Ryaboy Twitter Analytics @squarecog
  • 2. Who is this guy, anyway LBNL : Genome alignment & analysis Ask.com : Click log data warehousing CMU : MS in “Very Large Information Systems” Cloudera : graduate student intern Twitter : Hadoop, Pig, Big Data, ... Pig committer.
  • 3. In This Talk Focus on Hadoop parts of data pipeline Data movement HBase Pig A few tips
  • 4. Not In This Talk Cassandra FlockDB Gizzard Memcached Rest of Twitter’s NoSQL Bingo card
  • 5. Daily workload 1000 s of Front End machines 3 Billion API requests 7 TB of ingested data 20,000 Hadoop jobs 55 Million tweets Tweets only 0.5% of data
  • 6. Twitter data pipeline (simplified) Front Ends update DB cluster. Scheduled DB exports to HDFS Front Ends, Middleware, Backend services write logs Scribe pipes logs straight into HDFS Various other data source exports into HDFS Daemons populate work queues as new data shows up Daemons (and cron) pull work off queues, schedule MR and Pig jobs Pig wrapper pushes results into MySQL for reports and dashboards
  • 7. Logs Apache HTTP, W3C, JSON and Protocol Buffers Each category goes into its own directory on HDFS Everything is LZO compressed. You need to index LZO files to make them splittable. We use a patched version of Hadoop LZO libraries See http://github.com/kevinweil/hadoop-lzo
  • 8. Tables Users, tweets, geotags, trends, registered devices, etc Automatic generation of protocol buffer definitions from SQL tables Automatic generation of Hadoop Writables, Input / Output formats, Pig loaders from protocol buffers See Elephant-Bird: http://github.com/kevinweil/elephant-bird
  • 9. ETL &quot;Crane&quot;, config driven, protocol buffer powered. Sources/Sinks: HDFS, HBase, MySQL tables, web services Protobuf-based transformations: chain sets of <input proto, output proto, transformation class>
  • 10. HBase
  • 11. Mutability Logs are immutable; HDFS is great. Tables have mutable data. Ignore updates? bad data Pull updates, resolve at read time? Pain, time. Pull updates, resolve in batches? Pain, time. Let someone else do the resolving? Helloooo, HBase! Bonus: various NoSQL bonuses, &quot;not just scans&quot;. Lookups, indexes. Warning: we just started with HBase. This is all preliminary. Haven't tried indexes yet. That being said, several services rely on HBase already.
  • 12. Aren't you guys Cassandra poster boys? poster boys? YES but Rough analogy: Cassandra is OLTP and HBase is OLAP Cassandra used when we need low-latency, single-key reads and writes HBase scans much more powerful HBase co-locates data on the Hadoop cluster.
  • 13. HBase schema for MySQL exports, v1. Want to query by created_at range, by updated_at range, and / or by user_id. Key: [created_at, id] CF: &quot;columns&quot; Configs specify which columns to pull out and store explicitly. Useful for indexing, cheap (HBase-side) filtering CF: &quot;protobuf&quot; A single column, contains serialized protocol buffer.
  • 14. HBase schema v1, cont. Pro: easy to query by created_at range Con: hard to pull out specific users (requires a full scan) Con: hot spot at the last region for writes Idea: put created_at into 'columns' CF, make user_id key BUT ids mostly sequential; still a hot spot at the end of the table Transitioning to non-sequential ids; but their high bits are creation timestamp! Same problem.
  • 15. HBase schema, v2. Key: inverted Id. Bottom bits are random. Ahh, finally, distribution. Date range queries: new CF, 'time' keep all versions of this CF When specific time range needed, use index on the time column Keeping time in separate CF allows us to keep track of every time the record got updated, without storing all versions of the record
  • 16. Pig
  • 17. Why Pig? Much faster to write than vanilla MR Step-by-step iterative expression of data flows intuitive to programmers SQL support coming for those who prefer SQL (PIG-824) Trivial to write UDFs Easy to write Loaders (Even better with 0.7!) For example, we can write Protobuf and HBase loaders... Both in Elephant-Bird
  • 18. HBase Loader enhancements Data expected to be binary, not String representations Push down key range filters Specify row caching (memory / speed tradeoff) Optionally load the key Optionally limit rows per region Report progress Haven't observed significant overhead vs. HBase scanning
  • 19. HBase Loader TODOs Expose better control of filters Expose timestamp controls Expose Index hints (IHBase) Automated filter and projection push-down (once on 0.7) HBase Storage
  • 20. Elephant Bird Auto-generate Hadoop Input/Output formats, Writables, Pig loaders for Protocol Buffers Starting to work on same for Thrift HBase Loader assorted UDFs http://www.github.com/kevinweil/elephant-bird
  • 22. Bad records kill jobs Big data is messy. Catch exceptions, increment counter, return null Deal with potential nulls Far preferable to a single bad record bringing down the whole job
  • 23. Runaway UDFs kill jobs Regex over a few billion tweets, most return in milliseconds. 8 cause the regex to take more than 5 minutes , task gets reaped. You clever twitterers, you. MonitoredUDF wrapper kicks off a monitoring thread, kills a UDF and returns a default value if it doesn't return something in time. Plan to contribute to Pig, add to ElephantBird. May build into Pig internals.
  • 24. Use Counters Use counters. Count everything. UDF invocations, parsed records, unparsable records, timed-out UDFs... Hook into cleanup phases and store counters to disk, next to data, for future analysis Don't have it for Pig yet, but 0.8 adds metadata to job confs to make this possible.
  • 25. At first: converted Protocol Buffers into Pig tuples at read time. Moved to a Tuple wrapper that deserializes fields upon request. Huge performance boost for wide tables with only a few used columns Lazy deserializaton FTW lazy deserialization
  • 26. Also see http://www.slideshare.net/kevinweil/protocol-buffers-and-hadoop-at-twitter http://www.slideshare.net/kevinweil/nosql-at-twitter-nosql-eu-2010 http://www.slideshare.net/al3x/building-distributed-systems-in-scala http://www.slideshare.net/ryansking/scaling-twitter-with-cassandra http://www.slideshare.net/nkallen/q-con-3770885
  • 27. Questions ? Follow me at twitter.com/squarecog TM
  • 28. Photo Credits Bingo: http://www.flickr.com/photos/hownowdesign/2393662713/ Sandhill Crane: http://www.flickr.com/photos/dianeham/123491289/ Oakland Cranes: http://www.flickr.com/photos/clankennedy/2654213672/