SlideShare a Scribd company logo
Cloud-Optimized Big-Data as a Service 
Joydeep Sen Sarma 
Co-Founder Qubole, Apache-Hive
About Me 
• @Facebook (2007-2011): 
– First Hadoop Engineer 
– Founder - Apache Hive project, PMC Member 
– Contributor to Apache Hadoop/HBase 
• Founder Qubole (2012-) 
– Hadoop-as-a-Service 
– 30+ customers: Pinterest, Quora, Mediamath, Tubemogul … 
– Design/Code/Ops/Support/…
Big Data Cloud 
• Elasticity: 
– Workloads are Bursty 
– Allows easy rolling upgrades and testing 
• Lower TCO: 
– Cloud Storage is Inexpensive (2-3c/GB/month – globally replicated) 
– Zero cost to try new projects 
– Upgrade to new hardware easily (no cluster migrations!)
Big Data Cloud 
• Global: 
– Easily set up where employees/customer/entities are located 
• Collaboration: 
– Zero-Copy sharing of data with Partners and across Departments 
– Easy access to great public data sets 
• As-a-Service delivery model vastly lowers Operational Cost
Cloud-Optimized Big Data? 
• Optimized for lower TCO 
• Optimized for Speed 
• Optimized for Operations/Support
Cloud-Optimized Big Data 
Optimized for lower TCO
7 
Automated LifeCycle Mgmt 
select t.county, count(1) from (select 
transform(a.zip) using ‘geo.py’ as a.county 
from SMALL_TABLE a) t group by 
t.county; 
hadoop jar –Dmapred.min.split.size=32000000 
myapp.jar –partitioner .org.apache… 
insert overwrite table dest 
select a.id, a.zip, count(distinct b.uid) 
from ads a join LARGE_TABLE b on (a.id=b.ad_id) group 
by a.id, a.zip; 
AdCo Hadoop
Auto-Scaling 
insert overwrite table dest 
select … from ads join campaigns 
on …group by …; 
8 
StarCluster 
Map Tasks 
ReduceTasks 
Demand 
Supply 
AWS 
Progress 
Master 
Slaves 
Job Tracker
9 
Spot Instances 
On an average 50-60% cheaper 
• Fallback to regular 
instances when Spot 
unavailable 
• Replace regular 
instances with Spot 
when available
10 
Using Fast but ‘Thin’ nodes 
• C3 instances: 50% better performance at 20% lower cost 
• Little local storage 
11 
Using Fast but ‘Thin’ nodes 
Modify Hadoop to use Network drives for overflow 
Map-Reduce HDFS 
Local 
SSD 
Disk I/O 
Network Drives 
Overflow
Cloud-Optimized Big Data 
Optimized for Speed
Faster, Faster .. 
• Optimize I/O to AWS S3 
– Faster Split Computation (8x) 
– Prefetching S3 files (30%) 
– Zero-Copy writes to S3 
• JVM Reuse (1.2-2x speedup) 
• Columnar File Caches on local disks (1.2-2x speedup)
Faster, Faster .. 
• 5x Faster than nearest competitor (Hive against S3)
• Presto-as-a-Service – 3-22x faster SQL against S3 
– (as tested by customer) 
Faster, Faster ..
Cloud-Optimized Big Data 
Optimized for Operations/Support
Rolling Upgrades 
• @Facebook – we spent months upgrading large cluster 
• @Qubole: Start new cluster, Reassign label
Support 
CHAT 
EMail
Visually browse Historical Jobs
Visually browse Historical Jobs
Questions? 
joydeep@qubole.com 
@jsensarma 
www.qubole.com

More Related Content

PPTX
Qubole Overview at the Fifth Elephant Conference
PPTX
The Meta of Hadoop - COMAD 2012
PPTX
Messaging architecture @FB (Fifth Elephant Conference)
PPT
Nextag talk
PPTX
Facebook Retrospective - Big data-world-europe-2012
PPTX
Qubole @ AWS Meetup Bangalore - July 2015
PPT
Hadoop Hive Talk At IIT-Delhi
PDF
Hadoop Primer
Qubole Overview at the Fifth Elephant Conference
The Meta of Hadoop - COMAD 2012
Messaging architecture @FB (Fifth Elephant Conference)
Nextag talk
Facebook Retrospective - Big data-world-europe-2012
Qubole @ AWS Meetup Bangalore - July 2015
Hadoop Hive Talk At IIT-Delhi
Hadoop Primer

What's hot (20)

PDF
How Adobe Does 2 Million Records Per Second Using Apache Spark!
PPTX
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
PPTX
Hadoop @ eBay: Past, Present, and Future
PDF
Hd insight essentials quick view
PPTX
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
PDF
HBaseCon2017 Community-Driven Graphs with JanusGraph
PDF
Hadoop Summit 2014 - recap
PDF
HBaseCon 2015- HBase @ Flipboard
PDF
HUG August 2010: Best practices
PPTX
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
PPTX
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
PPTX
Hadoop: The elephant in the room
PDF
HBase at Mendeley
PPTX
Asbury Hadoop Overview
PDF
Netflix running Presto in the AWS Cloud
PPTX
October 2014 HUG : Hive On Spark
PDF
Introduction to MapReduce & hadoop
PPTX
Cost effective BigData Processing on Amazon EC2
PDF
Kylin and Druid Presentation
PPTX
Keynote: The Future of Apache HBase
How Adobe Does 2 Million Records Per Second Using Apache Spark!
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Hadoop @ eBay: Past, Present, and Future
Hd insight essentials quick view
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon2017 Community-Driven Graphs with JanusGraph
Hadoop Summit 2014 - recap
HBaseCon 2015- HBase @ Flipboard
HUG August 2010: Best practices
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
Hadoop: The elephant in the room
HBase at Mendeley
Asbury Hadoop Overview
Netflix running Presto in the AWS Cloud
October 2014 HUG : Hive On Spark
Introduction to MapReduce & hadoop
Cost effective BigData Processing on Amazon EC2
Kylin and Druid Presentation
Keynote: The Future of Apache HBase
Ad

Viewers also liked (6)

PPTX
Optimizing Big Data to run in the Public Cloud
PPTX
Big dataanalyticsinthecloud
PPTX
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
PDF
Facebook Presto presentation
PDF
Presto - Hadoop Conference Japan 2014
PPTX
Qubole - Big data in cloud
Optimizing Big Data to run in the Public Cloud
Big dataanalyticsinthecloud
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
Facebook Presto presentation
Presto - Hadoop Conference Japan 2014
Qubole - Big data in cloud
Ad

Similar to Cloud Optimized Big Data (20)

PPTX
Cloud Friendly Hadoop and Hive
PDF
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
PDF
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
PPTX
Atlanta Data Science Meetup | Qubole slides
PPTX
Getting to 1.5M Ads/sec: How DataXu manages Big Data
PPTX
Top Trends in Building Data Lakes for Machine Learning and AI
PPTX
State of enterprise data science
PPTX
Fundamentals of big data analytics and Hadoop
PDF
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
PPT
Final deck
PDF
Real World Analytics with Solr Cloud and Spark
PDF
Cloud-Friendly Hadoop and Hive - StampedeCon 2013
ODP
Hadoop demo ppt
PDF
Introduction to Big Data Analytics on Apache Hadoop
PPTX
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
PDF
Getting started with Hadoop on the Cloud with Bluemix
PPTX
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
PDF
5 Crucial Considerations for Big data adoption
PDF
Elastic Data Analytics Platform @Datadog
PPTX
Big Data on azure
Cloud Friendly Hadoop and Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Atlanta Data Science Meetup | Qubole slides
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Top Trends in Building Data Lakes for Machine Learning and AI
State of enterprise data science
Fundamentals of big data analytics and Hadoop
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Final deck
Real World Analytics with Solr Cloud and Spark
Cloud-Friendly Hadoop and Hive - StampedeCon 2013
Hadoop demo ppt
Introduction to Big Data Analytics on Apache Hadoop
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Getting started with Hadoop on the Cloud with Bluemix
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
5 Crucial Considerations for Big data adoption
Elastic Data Analytics Platform @Datadog
Big Data on azure

Recently uploaded (20)

PPTX
MET 305 MODULE 1 KTU 2019 SCHEME 25.pptx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Welding lecture in detail for understanding
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
additive manufacturing of ss316l using mig welding
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Unit 5 BSP.pptxytrrftyyydfyujfttyczcgvcd
PPTX
Construction Project Organization Group 2.pptx
PPTX
web development for engineering and engineering
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPT
Project quality management in manufacturing
PDF
Well-logging-methods_new................
PPTX
OOP with Java - Java Introduction (Basics)
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
Internet of Things (IOT) - A guide to understanding
MET 305 MODULE 1 KTU 2019 SCHEME 25.pptx
Operating System & Kernel Study Guide-1 - converted.pdf
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Welding lecture in detail for understanding
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
additive manufacturing of ss316l using mig welding
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Unit 5 BSP.pptxytrrftyyydfyujfttyczcgvcd
Construction Project Organization Group 2.pptx
web development for engineering and engineering
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Project quality management in manufacturing
Well-logging-methods_new................
OOP with Java - Java Introduction (Basics)
Model Code of Practice - Construction Work - 21102022 .pdf
bas. eng. economics group 4 presentation 1.pptx
Internet of Things (IOT) - A guide to understanding

Cloud Optimized Big Data

  • 1. Cloud-Optimized Big-Data as a Service Joydeep Sen Sarma Co-Founder Qubole, Apache-Hive
  • 2. About Me • @Facebook (2007-2011): – First Hadoop Engineer – Founder - Apache Hive project, PMC Member – Contributor to Apache Hadoop/HBase • Founder Qubole (2012-) – Hadoop-as-a-Service – 30+ customers: Pinterest, Quora, Mediamath, Tubemogul … – Design/Code/Ops/Support/…
  • 3. Big Data Cloud • Elasticity: – Workloads are Bursty – Allows easy rolling upgrades and testing • Lower TCO: – Cloud Storage is Inexpensive (2-3c/GB/month – globally replicated) – Zero cost to try new projects – Upgrade to new hardware easily (no cluster migrations!)
  • 4. Big Data Cloud • Global: – Easily set up where employees/customer/entities are located • Collaboration: – Zero-Copy sharing of data with Partners and across Departments – Easy access to great public data sets • As-a-Service delivery model vastly lowers Operational Cost
  • 5. Cloud-Optimized Big Data? • Optimized for lower TCO • Optimized for Speed • Optimized for Operations/Support
  • 6. Cloud-Optimized Big Data Optimized for lower TCO
  • 7. 7 Automated LifeCycle Mgmt select t.county, count(1) from (select transform(a.zip) using ‘geo.py’ as a.county from SMALL_TABLE a) t group by t.county; hadoop jar –Dmapred.min.split.size=32000000 myapp.jar –partitioner .org.apache… insert overwrite table dest select a.id, a.zip, count(distinct b.uid) from ads a join LARGE_TABLE b on (a.id=b.ad_id) group by a.id, a.zip; AdCo Hadoop
  • 8. Auto-Scaling insert overwrite table dest select … from ads join campaigns on …group by …; 8 StarCluster Map Tasks ReduceTasks Demand Supply AWS Progress Master Slaves Job Tracker
  • 9. 9 Spot Instances On an average 50-60% cheaper • Fallback to regular instances when Spot unavailable • Replace regular instances with Spot when available
  • 10. 10 Using Fast but ‘Thin’ nodes • C3 instances: 50% better performance at 20% lower cost • Little local storage 
  • 11. 11 Using Fast but ‘Thin’ nodes Modify Hadoop to use Network drives for overflow Map-Reduce HDFS Local SSD Disk I/O Network Drives Overflow
  • 12. Cloud-Optimized Big Data Optimized for Speed
  • 13. Faster, Faster .. • Optimize I/O to AWS S3 – Faster Split Computation (8x) – Prefetching S3 files (30%) – Zero-Copy writes to S3 • JVM Reuse (1.2-2x speedup) • Columnar File Caches on local disks (1.2-2x speedup)
  • 14. Faster, Faster .. • 5x Faster than nearest competitor (Hive against S3)
  • 15. • Presto-as-a-Service – 3-22x faster SQL against S3 – (as tested by customer) Faster, Faster ..
  • 16. Cloud-Optimized Big Data Optimized for Operations/Support
  • 17. Rolling Upgrades • @Facebook – we spent months upgrading large cluster • @Qubole: Start new cluster, Reassign label