The Meta of Hadoop - COMAD 2012

The Meta of Hadoop

Joydeep Sen Sarma
Ex-Facebook DI Lead, Founder Qubole

Intro
• File/Database Systems developer (ex- Netapp/Oracle)
• Yahoo (2005-07), Facebook (2007-11)

• @Facebook:
– SysAdmin: operated massive Hadoop/Hive installs
– Architect: conceived/wrote Apache Hive. made Hbase@FB
happen
– Herded cats: first manager of Data Infra team
– IT engineer/DBA: built ETL tools, warehouse/reporting for
FB Virtual Currency

• Founder Qubole Inc. (2011-)

Why Hadoop Succeeded

• Complete Solution and Extensible
– useful to Engineers, Data Scientists, Analysts
– performance isn’t everything.
– Agile – Businesses much faster than before

• Market Dynamics
– Captive Super-Reference Customer – Yahoo
– Had early market to itself for Long-Time

• Separation of Compute and Storage
– Parallel Computing != Database

Why Hadoop Succeeded
• Data Consolidation!
– Just store everything in HDFS
– MR/Hive/Pig can chew
anything

• Lights Out Architecture
DATA – Low System Operational Cost
– Low Data Management Cost
• Don’t need Data Priests
DATA

Adaptive Lights-Out Software

• Successful efforts:
– Automatic map-join/skew join implementations
– Automatic local mode, resource cache

• Failed:
– Statistics: alter table analyze table
– Pre-Bucketing tables

 Learning Frameworks for Systems Software

Adaptive Lights-Out Software

• Caching + Prefetching is Adaptive
– Replication is not
– Can bridge gap between Compute and Storage

• Page Cache over Disk >> In-memory
– Degrades gracefully

• Provide APIs – not packages

Murphy’s Law
• No Trusted Components

• Defend everything
– Rate-Limit access to every resource
– Log and Monitor everything

• Clear and Overwhelming Force
– Oversize it!

• Think QOS from Day-1

Open Source

• Small is Beautiful
– Build small easy to use/understand components
– Redis!

• Iterative Small Changes
– Operators HATE large releases
– Hive (2 weeks) vs. Hadoop (2 years?)

Interesting Problems - I

• Collaborative Analysis
– Most analysis is Repeat
– Tracking and Searching historical analysis

• Consistency Aware Querying
– OLAP: Snapshots instead of live tables
– OLTP: Lookup stale caches instead of master

Interesting Problems - II

• SQL is Rope
– Better than procedural – but still Rope
– Higher Level templates: moving averages

• Data = Mutating + Immutable
– Immutable data is easy to manage
– Cheap: One copy per data center (Facebook
Haystack)

Think Services, not Software

• Software is getting less interesting
– Even Distributed Systems Software

• Run/Operate long-running, hot services
– Innovate inside this boundary

The Meta of Hadoop - COMAD 2012

More Related Content

What's hot (19)

Similar to The Meta of Hadoop - COMAD 2012 (20)

The Meta of Hadoop - COMAD 2012

Editor's Notes