Cloud Optimized Big Data

Cloud-Optimized Big-Data as a Service
Joydeep Sen Sarma
Co-Founder Qubole, Apache-Hive

About Me
• @Facebook (2007-2011):
– First Hadoop Engineer
– Founder - Apache Hive project, PMC Member
– Contributor to Apache Hadoop/HBase
• Founder Qubole (2012-)
– Hadoop-as-a-Service
– 30+ customers: Pinterest, Quora, Mediamath, Tubemogul …
– Design/Code/Ops/Support/…

Big Data Cloud
• Elasticity:
– Workloads are Bursty
– Allows easy rolling upgrades and testing
• Lower TCO:
– Cloud Storage is Inexpensive (2-3c/GB/month – globally replicated)
– Zero cost to try new projects
– Upgrade to new hardware easily (no cluster migrations!)

Big Data Cloud
• Global:
– Easily set up where employees/customer/entities are located
• Collaboration:
– Zero-Copy sharing of data with Partners and across Departments
– Easy access to great public data sets
• As-a-Service delivery model vastly lowers Operational Cost

Cloud-Optimized Big Data?
• Optimized for lower TCO
• Optimized for Speed
• Optimized for Operations/Support

Cloud-Optimized Big Data
Optimized for lower TCO

7
Automated LifeCycle Mgmt
select t.county, count(1) from (select
transform(a.zip) using ‘geo.py’ as a.county
from SMALL_TABLE a) t group by
t.county;
hadoop jar –Dmapred.min.split.size=32000000
myapp.jar –partitioner .org.apache…
insert overwrite table dest
select a.id, a.zip, count(distinct b.uid)
from ads a join LARGE_TABLE b on (a.id=b.ad_id) group
by a.id, a.zip;
AdCo Hadoop

Auto-Scaling
insert overwrite table dest
select … from ads join campaigns
on …group by …;
8
StarCluster
Map Tasks
ReduceTasks
Demand
Supply
AWS
Progress
Master
Slaves
Job Tracker

9
Spot Instances
On an average 50-60% cheaper
• Fallback to regular
instances when Spot
unavailable
• Replace regular
instances with Spot
when available

10
Using Fast but ‘Thin’ nodes
• C3 instances: 50% better performance at 20% lower cost
• Little local storage 

11
Using Fast but ‘Thin’ nodes
Modify Hadoop to use Network drives for overflow
Map-Reduce HDFS
Local
SSD
Disk I/O
Network Drives
Overflow

Optimized for Speed

Faster, Faster ..
• Optimize I/O to AWS S3
– Faster Split Computation (8x)
– Prefetching S3 files (30%)
– Zero-Copy writes to S3
• JVM Reuse (1.2-2x speedup)
• Columnar File Caches on local disks (1.2-2x speedup)

Faster, Faster ..
• 5x Faster than nearest competitor (Hive against S3)

• Presto-as-a-Service – 3-22x faster SQL against S3
– (as tested by customer)
Faster, Faster ..

Optimized for Operations/Support

Rolling Upgrades
• @Facebook – we spent months upgrading large cluster
• @Qubole: Start new cluster, Reassign label

Visually browse Historical Jobs

Questions?
joydeep@qubole.com
@jsensarma
www.qubole.com

Cloud Optimized Big Data

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to Cloud Optimized Big Data (20)

Recently uploaded (20)

Cloud Optimized Big Data