SlideShare a Scribd company logo
The Meta of Hadoop


        Joydeep Sen Sarma
Ex-Facebook DI Lead, Founder Qubole
Intro
• File/Database Systems developer (ex- Netapp/Oracle)
• Yahoo (2005-07), Facebook (2007-11)

• @Facebook:
   – SysAdmin: operated massive Hadoop/Hive installs
   – Architect: conceived/wrote Apache Hive. made Hbase@FB
     happen
   – Herded cats: first manager of Data Infra team
   – IT engineer/DBA: built ETL tools, warehouse/reporting for
     FB Virtual Currency

• Founder Qubole Inc. (2011-)
Why Hadoop Succeeded

• Complete Solution and Extensible
   – useful to Engineers, Data Scientists, Analysts
   – performance isn’t everything.
   – Agile – Businesses much faster than before

• Market Dynamics
   – Captive Super-Reference Customer – Yahoo
   – Had early market to itself for Long-Time

• Separation of Compute and Storage
   – Parallel Computing != Database
Why Hadoop Succeeded
        • Data Consolidation!
          – Just store everything in HDFS
          – MR/Hive/Pig can chew
            anything


        • Lights Out Architecture
DATA      – Low System Operational Cost
          – Low Data Management Cost
             • Don’t need Data Priests
DATA
Meta Takeaways
Adaptive Lights-Out Software

• Successful efforts:
   – Automatic map-join/skew join implementations
   – Automatic local mode, resource cache


• Failed:
   – Statistics: alter table analyze table
   – Pre-Bucketing tables


 Learning Frameworks for Systems Software
Adaptive Lights-Out Software

• Caching + Prefetching is Adaptive
   – Replication is not
   – Can bridge gap between Compute and Storage


• Page Cache over Disk >> In-memory
   – Degrades gracefully


• Provide APIs – not packages
Murphy’s Law
• No Trusted Components

• Defend everything
  – Rate-Limit access to every resource
  – Log and Monitor everything


• Clear and Overwhelming Force
  – Oversize it!


• Think QOS from Day-1
Open Source

• Small is Beautiful
  – Build small easy to use/understand components
  – Redis!


• Iterative Small Changes
  – Operators HATE large releases
  – Hive (2 weeks) vs. Hadoop (2 years?)
Opportunities
Interesting Problems - I

• Collaborative Analysis
  – Most analysis is Repeat
  – Tracking and Searching historical analysis


• Consistency Aware Querying
  – OLAP: Snapshots instead of live tables
  – OLTP: Lookup stale caches instead of master
Interesting Problems - II

• SQL is Rope
  – Better than procedural – but still Rope
  – Higher Level templates: moving averages


• Data = Mutating + Immutable
  – Immutable data is easy to manage
  – Cheap: One copy per data center (Facebook
    Haystack)
Think Services, not Software

• Software is getting less interesting
  – Even Distributed Systems Software


• Run/Operate long-running, hot services
  – Innovate inside this boundary
Q&A

More Related Content

PPTX
Cloud Optimized Big Data
PPTX
Qubole Overview at the Fifth Elephant Conference
PPTX
Messaging architecture @FB (Fifth Elephant Conference)
PPTX
Facebook Retrospective - Big data-world-europe-2012
PPTX
Qubole @ AWS Meetup Bangalore - July 2015
PPT
Nextag talk
PPTX
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
PDF
Hadoop Summit 2014 - recap
Cloud Optimized Big Data
Qubole Overview at the Fifth Elephant Conference
Messaging architecture @FB (Fifth Elephant Conference)
Facebook Retrospective - Big data-world-europe-2012
Qubole @ AWS Meetup Bangalore - July 2015
Nextag talk
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Hadoop Summit 2014 - recap

What's hot (19)

PDF
How Adobe Does 2 Million Records Per Second Using Apache Spark!
PDF
Hd insight essentials quick view
PPTX
Hadoop @ eBay: Past, Present, and Future
PDF
Kylin and Druid Presentation
PDF
HBaseCon2017 Community-Driven Graphs with JanusGraph
PPTX
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
PDF
Netflix running Presto in the AWS Cloud
PDF
Proud to be Polyglot - Riviera Dev 2015
PPTX
Cost effective BigData Processing on Amazon EC2
PPTX
Dataiku big data paris - the rise of the hadoop ecosystem
PDF
Koalas: Pandas on Apache Spark
PDF
Hadoop at ayasdi
PPTX
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
PDF
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
PPTX
The Evolution of Apache Kylin
PPTX
Tuning up with Apache Tez
PPTX
October 2014 HUG : Hive On Spark
PDF
Introduction to MapReduce & hadoop
PDF
Intro to Apache Spark
How Adobe Does 2 Million Records Per Second Using Apache Spark!
Hd insight essentials quick view
Hadoop @ eBay: Past, Present, and Future
Kylin and Druid Presentation
HBaseCon2017 Community-Driven Graphs with JanusGraph
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
Netflix running Presto in the AWS Cloud
Proud to be Polyglot - Riviera Dev 2015
Cost effective BigData Processing on Amazon EC2
Dataiku big data paris - the rise of the hadoop ecosystem
Koalas: Pandas on Apache Spark
Hadoop at ayasdi
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
The Evolution of Apache Kylin
Tuning up with Apache Tez
October 2014 HUG : Hive On Spark
Introduction to MapReduce & hadoop
Intro to Apache Spark
Ad

Similar to The Meta of Hadoop - COMAD 2012 (20)

PPTX
Architecting Your First Big Data Implementation
PPTX
Introduction to Apache Hadoop
PPTX
DataJan27.pptxDataFoundationsPresentation
PPTX
Foxvalley bigdata
PPTX
WaterlooHiveTalk
PDF
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
PPTX
Not Just Another Overview of Apache Hadoop
PDF
Hadoop Overview by Sunitha Flowerhill
PDF
Architecting Agile Data Applications for Scale
PDF
Data Infrastructure for a World of Music
KEY
Make Life Suck Less (Building Scalable Systems)
PPT
Big Data
PDF
Big data
PDF
Big data and hadoop overvew
PPTX
Big Data Strategy for the Relational World
PPTX
Colorado Springs Open Source Hadoop/MySQL
ODP
Hadoop demo ppt
PDF
Semantic web meetup 14.november 2013
PPT
Hive @ Hadoop day seattle_2010
PPT
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Architecting Your First Big Data Implementation
Introduction to Apache Hadoop
DataJan27.pptxDataFoundationsPresentation
Foxvalley bigdata
WaterlooHiveTalk
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Not Just Another Overview of Apache Hadoop
Hadoop Overview by Sunitha Flowerhill
Architecting Agile Data Applications for Scale
Data Infrastructure for a World of Music
Make Life Suck Less (Building Scalable Systems)
Big Data
Big data
Big data and hadoop overvew
Big Data Strategy for the Relational World
Colorado Springs Open Source Hadoop/MySQL
Hadoop demo ppt
Semantic web meetup 14.november 2013
Hive @ Hadoop day seattle_2010
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Ad

The Meta of Hadoop - COMAD 2012

  • 1. The Meta of Hadoop Joydeep Sen Sarma Ex-Facebook DI Lead, Founder Qubole
  • 2. Intro • File/Database Systems developer (ex- Netapp/Oracle) • Yahoo (2005-07), Facebook (2007-11) • @Facebook: – SysAdmin: operated massive Hadoop/Hive installs – Architect: conceived/wrote Apache Hive. made Hbase@FB happen – Herded cats: first manager of Data Infra team – IT engineer/DBA: built ETL tools, warehouse/reporting for FB Virtual Currency • Founder Qubole Inc. (2011-)
  • 3. Why Hadoop Succeeded • Complete Solution and Extensible – useful to Engineers, Data Scientists, Analysts – performance isn’t everything. – Agile – Businesses much faster than before • Market Dynamics – Captive Super-Reference Customer – Yahoo – Had early market to itself for Long-Time • Separation of Compute and Storage – Parallel Computing != Database
  • 4. Why Hadoop Succeeded • Data Consolidation! – Just store everything in HDFS – MR/Hive/Pig can chew anything • Lights Out Architecture DATA – Low System Operational Cost – Low Data Management Cost • Don’t need Data Priests DATA
  • 6. Adaptive Lights-Out Software • Successful efforts: – Automatic map-join/skew join implementations – Automatic local mode, resource cache • Failed: – Statistics: alter table analyze table – Pre-Bucketing tables  Learning Frameworks for Systems Software
  • 7. Adaptive Lights-Out Software • Caching + Prefetching is Adaptive – Replication is not – Can bridge gap between Compute and Storage • Page Cache over Disk >> In-memory – Degrades gracefully • Provide APIs – not packages
  • 8. Murphy’s Law • No Trusted Components • Defend everything – Rate-Limit access to every resource – Log and Monitor everything • Clear and Overwhelming Force – Oversize it! • Think QOS from Day-1
  • 9. Open Source • Small is Beautiful – Build small easy to use/understand components – Redis! • Iterative Small Changes – Operators HATE large releases – Hive (2 weeks) vs. Hadoop (2 years?)
  • 11. Interesting Problems - I • Collaborative Analysis – Most analysis is Repeat – Tracking and Searching historical analysis • Consistency Aware Querying – OLAP: Snapshots instead of live tables – OLTP: Lookup stale caches instead of master
  • 12. Interesting Problems - II • SQL is Rope – Better than procedural – but still Rope – Higher Level templates: moving averages • Data = Mutating + Immutable – Immutable data is easy to manage – Cheap: One copy per data center (Facebook Haystack)
  • 13. Think Services, not Software • Software is getting less interesting – Even Distributed Systems Software • Run/Operate long-running, hot services – Innovate inside this boundary
  • 14. Q&A

Editor's Notes

  • #2: Intro- self, Qubole. In this video, we will see how users setup a Qubole Cluster in 3 simple steps..Those 3 steps are…
  • #3: They are…123Now lets look at the details of each step, starting with step #1.
  • #6: Intro- self, Qubole. In this video, we will see how users setup a Qubole Cluster in 3 simple steps..Those 3 steps are…
  • #11: Intro- self, Qubole. In this video, we will see how users setup a Qubole Cluster in 3 simple steps..Those 3 steps are…
  • #15: Intro- self, Qubole. In this video, we will see how users setup a Qubole Cluster in 3 simple steps..Those 3 steps are…