SlideShare a Scribd company logo
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hadoop, Hive, Spark
and Object Stores
Steve Loughran
stevel@hortonworks.com
@steveloughran
November 2016
Steve Loughran,
Hadoop committer, PMC member,
ASF Member
Chris Nauroth,
Apache Hadoop committer & PMC; ASF member
Rajesh Balamohan
Tez Committer, PMC Member
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Make Apache Hadoop
at home in the cloud
Step 1: Hadoop runs great on Azure
Step 2: Beat EMR on EC2
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
ORC
datasets
inbound
Elastic ETL
HDFS
external
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
ORC, Parquet
datasets
external
Notebooks
library
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Streaming
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
/
work
pending
part-00
part-01
00
00
00
01
01
01
complete
part-01
rename("/work/pending/part-01", "/work/complete")
A Filesystem: Directories, Files  Data
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
00
00
00
01
01
s01 s02
s03 s04
hash("/work/pending/part-01")
["s02", "s03", "s04"]
copy("/work/pending/part-01",
"/work/complete/part01")
01
01
01
01
delete("/work/pending/part-01")
hash("/work/pending/part-00")
["s01", "s02", "s04"]
Object Store: hash(name)->blob
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
00
00
00
01
01
s01 s02
s03 s04
HEAD /work/complete/part-01
PUT /work/complete/part01
x-amz-copy-source: /work/pending/part-01
01
DELETE /work/pending/part-01
PUT /work/pending/part-01
... DATA ...
GET /work/pending/part-01
Content-Length: 1-8192
GET /?prefix=/work&delimiter=/
REST APIs
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
00
00
00
01
01
s01 s02
s03 s04
01
DELETE /work/pending/part-00
HEAD /work/pending/part-00
GET /work/pending/part-00
200
200
200
Often Eventually Consistent
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
org.apache.hadoop.fs.FileSystem
hdfs s3a wasb adlswift gs
Same API
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Just a different URL to read
val csvdata = spark.read.options(Map(
"header" -> "true",
"inferSchema" -> "true",
"mode" -> "FAILFAST"))
.csv("s3a://landsat-pds/scene_list.gz")
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Writing looks the same …
val p = "s3a://hwdev-stevel-demo/landsat"
csvData.write.parquet(p)
val o = "s3a://hwdev-stevel-demo/landsatOrc"
csvData.write.orc(o)
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive
CREATE EXTERNAL TABLE `scene`(
`entityid` string,
`acquisitiondate` timestamp,
`cloudcover` double,
`processinglevel` string,
`path` int,
`row_id` int,
`min_lat` double,
`min_long` double,
`max_lat` double,
`max_lon` double,
`download_url` string) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY 'n'
STORED AS TEXTFILE
LOCATION s3a://hwdev-rajesh-new2/scene_list'
TBLPROPERTIES ('skip.header.line.count'='1');
(needed to copy file to R/W object store first)
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
> select entityID from scene where cloudCover < 0 limit 10;
+------------------------+--+
| entityid |
+------------------------+--+
| LT81402112015001LGN00 |
| LT81152012015002LGN00 |
| LT81152022015002LGN00 |
| LT81152032015002LGN00 |
| LT81152042015002LGN00 |
| LT81152052015002LGN00 |
| LT81152062015002LGN00 |
| LT81152072015002LGN00 |
| LT81162012015009LGN00 |
| LT81162052015009LGN00 |
+------------------------+--+
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Streaming on Azure Storage
val streamc = new StreamingContext(sparkConf, Seconds(10))
val azure = "wasb://demo@example.blob.core.windows.net/in"
val lines = streamc.textFileStream(azure)
val matches = lines.map(line => {
println(line)
line
})
matches.print()
streamc.start()
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
s3:// —“inode on S3”
s3n://
“Native S3”
s3a://
Replaces s3n
swift://
OpenStack
wasb://
Azure WASB
Phase I
Stabilize
oss://
Aliyun
gs://
Google Cloud
Phase II
Speed & Scale
adl://
Azure Data
Lake
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017?
s3://
Amazon EMR S3
Where did those object store clients come from?
Phase III
Speed & Consistency
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Problem: S3 work is too slow
1. Analyze benchmarks and bug-reports
2. Fix Read path
3. Fix Write path
4. Improve query partitioning
5. The Commitment Problem
getFileStatus()
read()
LLAP (single node) on AWS
TPC-DS queries at 200 GB scale
readFully(pos)
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
The Performance Killers
getFileStatus(Path) (+ isDirectory(), exists())
HEAD path // file?
HEAD path + "/" // empty directory?
LIST path // path with children?
read(long pos, byte[] b, int idx, int len)
readFully(long pos, byte[] b, int idx, int len)
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Positioned reads: close + GET, close + GET
read(long pos, byte[] b, int idx, int len)
throws IOException {
long oldPos = getPos();
int nread = -1;
try {
seek(pos);
nread = read(b, idx, len);
} catch (EOFException e) {
} finally {
seek(oldPos);
}
return nread;
}
seek() is the killer, especially the seek() back
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HADOOP-12444 Support lazy seek in S3AInputStream
public synchronized void seek(long pos)
throws IOException {
nextReadPos = targetPos;
}
+configurable readhead before open/close()
<property>
<name>fs.s3a.readahead.range</name>
<value>256K</value>
</property>
But: ORC reads were still underperforming
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HADOOP-13203: fs.s3a.experimental.input.fadvise
// Before
GetObjectRequest req = new GetObjectRequest(bucket, key)
.withRange(pos, contentLength - 1);
// after
finish = calculateRequestLimit(inputPolicy, pos,
length, contentLength, readahead);
GetObjectRequest req = new GetObjectRequest(bucket, key)
.withRange(pos, finish);
bad for full file reads
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Every HTTP request is precious
⬢ HADOOP-13162: Reduce number of getFileStatus calls in mkdirs()
⬢ HADOOP-13164: Optimize deleteUnnecessaryFakeDirectories()
⬢ HADOOP-13406: Consider reusing filestatus in delete() and mkdirs()
⬢ HADOOP-13145: DistCp to skip getFileStatus when not preserving metadata
⬢ HADOOP-13208: listFiles(recursive=true) to do a bulk listObjects
see HADOOP-11694
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
benchmarks !=
your queries
your data
…but we think we've made a good start
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive-TestBench Benchmark shows average 2.5x speedup
⬢ TPC-DS @ 200 GB Scale in S3 (https://github.com/hortonworks/hive-testbench)
⬢ m4x4x large - 5 nodes
⬢ “HDP 2.3 + S3 in cloud” vs “HDP 2.4 + enhancements + S3 in cloud
⬢ Queries like 15,17, 25, 73,75 etc did not run in HDP 2.3 (AWS timeouts)
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
And EMR? average 2.8x, in our TCP-DS benchmarks
*Queries 40, 50,60,67,72,75,76,79 etc do not complete in EMR.
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What about Spark?
object store work applies
needs tuning
SPARK-7481 patch handles JARs
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark 1.6/2.0 Classpath running with Hadoop 2.7
hadoop-aws-2.7.x.jar
hadoop-azure-2.7.x.jar
aws-java-sdk-1.7.4.jar
joda-time-2.9.3.jar
azure-storage-2.2.0.jar
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
spark-default.conf
spark.sql.parquet.filterPushdown true
spark.sql.parquet.mergeSchema false
spark.hadoop.parquet.enable.summary-metadata false
spark.sql.orc.filterPushdown true
spark.sql.orc.splits.include.file.footer true
spark.sql.orc.cache.stripe.details.size 10000
spark.sql.hive.metastorePartitionPruning true
spark.hadoop.fs.s3a.readahead.range 157810688
spark.hadoop.fs.s3a.experimental.input.fadvise random
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
The Commitment Problem
⬢ rename() used for atomic commitment transaction
⬢ Time to copy() + delete() proportional to data * files
⬢ S3: 6+ MB/s
⬢ Azure: a lot faster —usually
spark.speculation false
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What about Direct Output Committers?
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
s3guard:
fast, consistent S3 metadata
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
00
00
00
01
01
s01 s02
s03 s04
01
DELETE part-00
200
HEAD part-00
200
HEAD part-00
404
DynamoDB becomes the consistent metadata store
PUT part-00
200
00
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How do I get hold of these
features?
• Read improvements in HDP 2.5
• Read + Write in Hortonwork Data Cloud
• Read + Write in Apache Hadoop 2.8 (soon!)
• s3Guard: No timetable
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
You can make your own code work better here too!
😢Reduce getFileStatus(), exists(), isDir(), isFile() calls
😢Avoid globStatus()
😢Reduce listStatus() & listFiles() calls
😭Really avoid rename()
😀Prefer forward seek,
😀Prefer listStatus(path, recursive=true)
😀list/delete/rename in separate threads
😀test against object stores
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved3
9
© Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Questions?
40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Backup Slides
41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Write Pipeline
⬢ PUT blocks as part of a multipart, as soon as size is reached
⬢ Parallel uploads during data creation
⬢ Buffer to disk (default), heap or byte buffers
⬢ Great for distcp
fs.s3a.fast.upload=true
fs.s3a.multipart.size=16M
fs.s3a.fast.upload.active.blocks=8
// tip:
fs.s3a.block.size=${fs.s3a.multipart.size}
42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Parallel rename (Work in Progress)
⬢ Goal: faster commit by rename
⬢ Parallel threads to perform the COPY operation
⬢ listFiles(path, true).sort().parallelize(copy)
⬢ Time from sum(data)/copy-bandwidth to
more size(largest-file)/copy-bandwidth
⬢ Thread pool size will limit parallelism
⬢ Best speedup with a few large files rather than many small
ones
⬢ wasb expected to stay faster & has leases for atomic commits

More Related Content

PPTX
Hadoop & cloud storage object store integration in production (final)
PDF
Spark Summit EU talk by Steve Loughran
PPTX
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
PPTX
Hive, Presto, and Spark on TPC-DS benchmark
PDF
Spark Summit EU talk by Debasish Das and Pramod Narasimha
PDF
The Future of Apache Storm
PPTX
Streamline Hadoop DevOps with Apache Ambari
PPTX
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
Hadoop & cloud storage object store integration in production (final)
Spark Summit EU talk by Steve Loughran
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Hive, Presto, and Spark on TPC-DS benchmark
Spark Summit EU talk by Debasish Das and Pramod Narasimha
The Future of Apache Storm
Streamline Hadoop DevOps with Apache Ambari
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters

What's hot (20)

PPTX
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
PPTX
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
PDF
Apache Eagle - Monitor Hadoop in Real Time
PPTX
LLAP: Sub-Second Analytical Queries in Hive
PPTX
Hadoop and Kerberos: the Madness Beyond the Gate
PPTX
Hadoop engineering bo_f_final
PPTX
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
PPTX
CBlocks - Posix compliant files systems for HDFS
PPTX
LLAP: Building Cloud First BI
PPTX
Achieving 100k Queries per Hour on Hive on Tez
PPTX
Running a container cloud on YARN
PDF
Hive on Spark, production experience @Uber
PPTX
Tuning Apache Ambari performance for Big Data at scale with 3000 agents
PDF
Sub-second-sql-on-hadoop-at-scale
PPT
State of Security: Apache Spark & Apache Zeppelin
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
PPTX
Performance Update: When Apache ORC Met Apache Spark
PDF
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem
PPTX
Ingest and Stream Processing - What will you choose?
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
Apache Eagle - Monitor Hadoop in Real Time
LLAP: Sub-Second Analytical Queries in Hive
Hadoop and Kerberos: the Madness Beyond the Gate
Hadoop engineering bo_f_final
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
CBlocks - Posix compliant files systems for HDFS
LLAP: Building Cloud First BI
Achieving 100k Queries per Hour on Hive on Tez
Running a container cloud on YARN
Hive on Spark, production experience @Uber
Tuning Apache Ambari performance for Big Data at scale with 3000 agents
Sub-second-sql-on-hadoop-at-scale
State of Security: Apache Spark & Apache Zeppelin
Flexible and Real-Time Stream Processing with Apache Flink
Performance Update: When Apache ORC Met Apache Spark
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Large-Scale Stream Processing in the Hadoop Ecosystem
Ingest and Stream Processing - What will you choose?
Ad

Viewers also liked (20)

PPTX
Spark Summit East 2017: Apache spark and object stores
PPTX
Apache Spark and Object Stores
PDF
TriHUG Feb: Hive on spark
PPTX
Empower Hive with Spark
PPTX
Hive on spark is blazing fast or is it final
PPTX
YARN Services
PDF
Gunther hagleitner:apache hive & stinger
PPTX
Slider: Applications on YARN
PDF
Hive on spark berlin buzzwords
PDF
Hive join optimizations
PDF
Strata Stinger Talk October 2013
PPTX
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
PPTX
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
PPTX
Hortonworks apache training
PDF
Tips for getting the most out of AWS re:Invent IN ENGLISH
PDF
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
PPTX
Ozone- Object store for Apache Hadoop
PDF
Netflix Open Source: Building a Distributed and Automated Open Source Program
PDF
SamzaSQL QCon'16 presentation
PPTX
データ活用を効率化するHadoop WebUIと権限管理改善事例
Spark Summit East 2017: Apache spark and object stores
Apache Spark and Object Stores
TriHUG Feb: Hive on spark
Empower Hive with Spark
Hive on spark is blazing fast or is it final
YARN Services
Gunther hagleitner:apache hive & stinger
Slider: Applications on YARN
Hive on spark berlin buzzwords
Hive join optimizations
Strata Stinger Talk October 2013
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hortonworks apache training
Tips for getting the most out of AWS re:Invent IN ENGLISH
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Ozone- Object store for Apache Hadoop
Netflix Open Source: Building a Distributed and Automated Open Source Program
SamzaSQL QCon'16 presentation
データ活用を効率化するHadoop WebUIと権限管理改善事例
Ad

Similar to Hadoop, Hive, Spark and Object Stores (20)

PPTX
Apache Spark and Object Stores —for London Spark User Group
PPTX
PUT is the new rename()
PDF
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
PPTX
Put is the new rename: San Jose Summit Edition
PPTX
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
PPTX
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
PPTX
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
PPTX
Tracing your security telemetry with Apache Metron
PPTX
S3Guard: What's in your consistency model?
PPTX
Dancing elephants - efficiently working with object stores from Apache Spark ...
PPTX
Taming the Elephant: Efficient and Effective Apache Hadoop Management
PPTX
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
PPT
Running Apache Spark & Apache Zeppelin in Production
PPTX
Apache Spark Crash Course
PPT
Running Spark in Production
PPTX
Druid at Hadoop Ecosystem
PPTX
Row/Column- Level Security in SQL for Apache Spark
PPTX
Hive acid and_2.x new_features
PPTX
Hadoop 3 in a Nutshell
PPTX
Druid Scaling Realtime Analytics
Apache Spark and Object Stores —for London Spark User Group
PUT is the new rename()
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Put is the new rename: San Jose Summit Edition
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Tracing your security telemetry with Apache Metron
S3Guard: What's in your consistency model?
Dancing elephants - efficiently working with object stores from Apache Spark ...
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Running Apache Spark & Apache Zeppelin in Production
Apache Spark Crash Course
Running Spark in Production
Druid at Hadoop Ecosystem
Row/Column- Level Security in SQL for Apache Spark
Hive acid and_2.x new_features
Hadoop 3 in a Nutshell
Druid Scaling Realtime Analytics

More from Steve Loughran (20)

PPTX
Hadoop Vectored IO
PPTX
The age of rename() is over
PPTX
What does Rename Do: (detailed version)
PPTX
@Dissidentbot: dissent will be automated!
PPT
Extreme Programming Deployed
PPT
PPTX
I hate mocking
PPTX
What does rename() do?
PPTX
Household INFOSEC in a Post-Sony Era
PPTX
Datacentre stack
PPTX
Overview of slider project
PPTX
Help! My Hadoop doesn't work!
ODP
2014 01-02-patching-workflow
PPTX
2013 11-19-hoya-status
PPTX
Hoya for Code Review
PPTX
Hadoop: Beyond MapReduce
PPTX
HDFS: Hadoop Distributed Filesystem
PPTX
HA Hadoop -ApacheCon talk
PPTX
Inside hadoop-dev
PPTX
Hadoop as data refinery
Hadoop Vectored IO
The age of rename() is over
What does Rename Do: (detailed version)
@Dissidentbot: dissent will be automated!
Extreme Programming Deployed
I hate mocking
What does rename() do?
Household INFOSEC in a Post-Sony Era
Datacentre stack
Overview of slider project
Help! My Hadoop doesn't work!
2014 01-02-patching-workflow
2013 11-19-hoya-status
Hoya for Code Review
Hadoop: Beyond MapReduce
HDFS: Hadoop Distributed Filesystem
HA Hadoop -ApacheCon talk
Inside hadoop-dev
Hadoop as data refinery

Recently uploaded (20)

PDF
top salesforce developer skills in 2025.pdf
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
medical staffing services at VALiNTRY
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
AI in Product Development-omnex systems
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
Transform Your Business with a Software ERP System
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
top salesforce developer skills in 2025.pdf
wealthsignaloriginal-com-DS-text-... (1).pdf
Which alternative to Crystal Reports is best for small or large businesses.pdf
PTS Company Brochure 2025 (1).pdf.......
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Navsoft: AI-Powered Business Solutions & Custom Software Development
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
CHAPTER 2 - PM Management and IT Context
Softaken Excel to vCard Converter Software.pdf
medical staffing services at VALiNTRY
Design an Analysis of Algorithms I-SECS-1021-03
How to Migrate SBCGlobal Email to Yahoo Easily
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
AI in Product Development-omnex systems
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Transform Your Business with a Software ERP System
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...

Hadoop, Hive, Spark and Object Stores

  • 1. 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hadoop, Hive, Spark and Object Stores Steve Loughran stevel@hortonworks.com @steveloughran November 2016
  • 2. Steve Loughran, Hadoop committer, PMC member, ASF Member Chris Nauroth, Apache Hadoop committer & PMC; ASF member Rajesh Balamohan Tez Committer, PMC Member
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Make Apache Hadoop at home in the cloud Step 1: Hadoop runs great on Azure Step 2: Beat EMR on EC2
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved ORC datasets inbound Elastic ETL HDFS external
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved ORC, Parquet datasets external Notebooks library
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Streaming
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved / work pending part-00 part-01 00 00 00 01 01 01 complete part-01 rename("/work/pending/part-01", "/work/complete") A Filesystem: Directories, Files  Data
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved 00 00 00 01 01 s01 s02 s03 s04 hash("/work/pending/part-01") ["s02", "s03", "s04"] copy("/work/pending/part-01", "/work/complete/part01") 01 01 01 01 delete("/work/pending/part-01") hash("/work/pending/part-00") ["s01", "s02", "s04"] Object Store: hash(name)->blob
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved 00 00 00 01 01 s01 s02 s03 s04 HEAD /work/complete/part-01 PUT /work/complete/part01 x-amz-copy-source: /work/pending/part-01 01 DELETE /work/pending/part-01 PUT /work/pending/part-01 ... DATA ... GET /work/pending/part-01 Content-Length: 1-8192 GET /?prefix=/work&delimiter=/ REST APIs
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved 00 00 00 01 01 s01 s02 s03 s04 01 DELETE /work/pending/part-00 HEAD /work/pending/part-00 GET /work/pending/part-00 200 200 200 Often Eventually Consistent
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved org.apache.hadoop.fs.FileSystem hdfs s3a wasb adlswift gs Same API
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Just a different URL to read val csvdata = spark.read.options(Map( "header" -> "true", "inferSchema" -> "true", "mode" -> "FAILFAST")) .csv("s3a://landsat-pds/scene_list.gz")
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Writing looks the same … val p = "s3a://hwdev-stevel-demo/landsat" csvData.write.parquet(p) val o = "s3a://hwdev-stevel-demo/landsatOrc" csvData.write.orc(o)
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive CREATE EXTERNAL TABLE `scene`( `entityid` string, `acquisitiondate` timestamp, `cloudcover` double, `processinglevel` string, `path` int, `row_id` int, `min_lat` double, `min_long` double, `max_lat` double, `max_lon` double, `download_url` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY 'n' STORED AS TEXTFILE LOCATION s3a://hwdev-rajesh-new2/scene_list' TBLPROPERTIES ('skip.header.line.count'='1'); (needed to copy file to R/W object store first)
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved > select entityID from scene where cloudCover < 0 limit 10; +------------------------+--+ | entityid | +------------------------+--+ | LT81402112015001LGN00 | | LT81152012015002LGN00 | | LT81152022015002LGN00 | | LT81152032015002LGN00 | | LT81152042015002LGN00 | | LT81152052015002LGN00 | | LT81152062015002LGN00 | | LT81152072015002LGN00 | | LT81162012015009LGN00 | | LT81162052015009LGN00 | +------------------------+--+
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Streaming on Azure Storage val streamc = new StreamingContext(sparkConf, Seconds(10)) val azure = "wasb://demo@example.blob.core.windows.net/in" val lines = streamc.textFileStream(azure) val matches = lines.map(line => { println(line) line }) matches.print() streamc.start()
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved s3:// —“inode on S3” s3n:// “Native S3” s3a:// Replaces s3n swift:// OpenStack wasb:// Azure WASB Phase I Stabilize oss:// Aliyun gs:// Google Cloud Phase II Speed & Scale adl:// Azure Data Lake 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017? s3:// Amazon EMR S3 Where did those object store clients come from? Phase III Speed & Consistency
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Problem: S3 work is too slow 1. Analyze benchmarks and bug-reports 2. Fix Read path 3. Fix Write path 4. Improve query partitioning 5. The Commitment Problem
  • 19. getFileStatus() read() LLAP (single node) on AWS TPC-DS queries at 200 GB scale readFully(pos)
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved The Performance Killers getFileStatus(Path) (+ isDirectory(), exists()) HEAD path // file? HEAD path + "/" // empty directory? LIST path // path with children? read(long pos, byte[] b, int idx, int len) readFully(long pos, byte[] b, int idx, int len)
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Positioned reads: close + GET, close + GET read(long pos, byte[] b, int idx, int len) throws IOException { long oldPos = getPos(); int nread = -1; try { seek(pos); nread = read(b, idx, len); } catch (EOFException e) { } finally { seek(oldPos); } return nread; } seek() is the killer, especially the seek() back
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HADOOP-12444 Support lazy seek in S3AInputStream public synchronized void seek(long pos) throws IOException { nextReadPos = targetPos; } +configurable readhead before open/close() <property> <name>fs.s3a.readahead.range</name> <value>256K</value> </property> But: ORC reads were still underperforming
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HADOOP-13203: fs.s3a.experimental.input.fadvise // Before GetObjectRequest req = new GetObjectRequest(bucket, key) .withRange(pos, contentLength - 1); // after finish = calculateRequestLimit(inputPolicy, pos, length, contentLength, readahead); GetObjectRequest req = new GetObjectRequest(bucket, key) .withRange(pos, finish); bad for full file reads
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Every HTTP request is precious ⬢ HADOOP-13162: Reduce number of getFileStatus calls in mkdirs() ⬢ HADOOP-13164: Optimize deleteUnnecessaryFakeDirectories() ⬢ HADOOP-13406: Consider reusing filestatus in delete() and mkdirs() ⬢ HADOOP-13145: DistCp to skip getFileStatus when not preserving metadata ⬢ HADOOP-13208: listFiles(recursive=true) to do a bulk listObjects see HADOOP-11694
  • 25. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved benchmarks != your queries your data …but we think we've made a good start
  • 26. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive-TestBench Benchmark shows average 2.5x speedup ⬢ TPC-DS @ 200 GB Scale in S3 (https://github.com/hortonworks/hive-testbench) ⬢ m4x4x large - 5 nodes ⬢ “HDP 2.3 + S3 in cloud” vs “HDP 2.4 + enhancements + S3 in cloud ⬢ Queries like 15,17, 25, 73,75 etc did not run in HDP 2.3 (AWS timeouts)
  • 27. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved And EMR? average 2.8x, in our TCP-DS benchmarks *Queries 40, 50,60,67,72,75,76,79 etc do not complete in EMR.
  • 28. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What about Spark? object store work applies needs tuning SPARK-7481 patch handles JARs
  • 29. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark 1.6/2.0 Classpath running with Hadoop 2.7 hadoop-aws-2.7.x.jar hadoop-azure-2.7.x.jar aws-java-sdk-1.7.4.jar joda-time-2.9.3.jar azure-storage-2.2.0.jar
  • 30. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved spark-default.conf spark.sql.parquet.filterPushdown true spark.sql.parquet.mergeSchema false spark.hadoop.parquet.enable.summary-metadata false spark.sql.orc.filterPushdown true spark.sql.orc.splits.include.file.footer true spark.sql.orc.cache.stripe.details.size 10000 spark.sql.hive.metastorePartitionPruning true spark.hadoop.fs.s3a.readahead.range 157810688 spark.hadoop.fs.s3a.experimental.input.fadvise random
  • 31. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved The Commitment Problem ⬢ rename() used for atomic commitment transaction ⬢ Time to copy() + delete() proportional to data * files ⬢ S3: 6+ MB/s ⬢ Azure: a lot faster —usually spark.speculation false spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2 spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true
  • 32. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What about Direct Output Committers?
  • 33. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved s3guard: fast, consistent S3 metadata
  • 34. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved 00 00 00 01 01 s01 s02 s03 s04 01 DELETE part-00 200 HEAD part-00 200 HEAD part-00 404 DynamoDB becomes the consistent metadata store PUT part-00 200 00
  • 35. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved How do I get hold of these features? • Read improvements in HDP 2.5 • Read + Write in Hortonwork Data Cloud • Read + Write in Apache Hadoop 2.8 (soon!) • s3Guard: No timetable
  • 36. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved You can make your own code work better here too! 😢Reduce getFileStatus(), exists(), isDir(), isFile() calls 😢Avoid globStatus() 😢Reduce listStatus() & listFiles() calls 😭Really avoid rename() 😀Prefer forward seek, 😀Prefer listStatus(path, recursive=true) 😀list/delete/rename in separate threads 😀test against object stores
  • 37. 39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved3 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Questions?
  • 38. 40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Backup Slides
  • 39. 41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Write Pipeline ⬢ PUT blocks as part of a multipart, as soon as size is reached ⬢ Parallel uploads during data creation ⬢ Buffer to disk (default), heap or byte buffers ⬢ Great for distcp fs.s3a.fast.upload=true fs.s3a.multipart.size=16M fs.s3a.fast.upload.active.blocks=8 // tip: fs.s3a.block.size=${fs.s3a.multipart.size}
  • 40. 42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Parallel rename (Work in Progress) ⬢ Goal: faster commit by rename ⬢ Parallel threads to perform the COPY operation ⬢ listFiles(path, true).sort().parallelize(copy) ⬢ Time from sum(data)/copy-bandwidth to more size(largest-file)/copy-bandwidth ⬢ Thread pool size will limit parallelism ⬢ Best speedup with a few large files rather than many small ones ⬢ wasb expected to stay faster & has leases for atomic commits

Editor's Notes

  • #3: Now people may be saying "hang on, these aren't spark developers". Well, I do have some integration patches for spark, but a lot of the integration problems are actually lower down: -filesystem connectors -ORC performance -Hive metastore Rajesh has been doing lots of scale runs and profiling, initially for Hive/Tez, now looking at Spark, including some of the Parquet problems. Chris has done work on HDFS, Azure WASB and most recently S3A Me? Co-author of the Swift connector. author of the Hadoop FS spec and general mentor of the S3A work, even when not actively working on it. Been full time on S3A, using Spark as the integration test suite, since March
  • #4: Simple goal. Make ASF hadoop at home in cloud infra. It's always been a bit of a mixed bag, and there's a lot with agility we need to address: things fail differently. Step 1: Azure. That's the work with Microsoft on wasb://; you can use Azure as a drop-in replacement for HDFS in Azure Step 2: EMR. More specifically, have the ASF Hadoop codebase get higher numbers than EMR
  • #5: This is one of the simplest deployments in cloud: scheduled/dynamic ETL. Incoming data sources saving to an object store; spark cluster brought up for ETL. Either direct cleanup/filter or multistep operations, but either way: an ETL pipeline. HDFS on the VMs for transient storage, the object store used as the destination for data —now in a more efficient format such as ORC or Parquet
  • #6: Notebooks on demand. ; it talks to spark in cloud which then does the work against external and internal data; Your notebook itself can be saved to the object store, for persistence and sharing.
  • #7: Example: streaming on Azure
  • #12: Everything usies the Hadoop APIs to talk to both HDFS, Hadoop Compatible Filesystems and object stores; the Hadoop FS API. There's actually two: the one with a clean split between client side and "driver side", and the older one which is a direct connect. Most use the latter and actually, in terms of opportunities for object store integration tweaking, this is actually the one where can innovate with the most easily. That is: there's nothing in the way. Under the FS API go filesystems and object stores. HDFS is "real" filesystem; WASB/Azure close enough. What is "real?". Best test: can support HBase.
  • #14: you used to have to disable summary data in the spark context Hadoop options, but https://issues.apache.org/jira/browse/SPARK-15719 fixed that for you
  • #18: This is the history
  • #20: Here's a flamegraph of LLAP (single node) with AWS+HDC for a set of TPC-DS queries at 200 GB scale; we should stick this up online only about 2% of time (optimised code) is doing S3 IO. Something at start partitioning data
  • #22: Why so much seeking? It's the default implementation of read
  • #24: A big killer turned out to be the fact that if we had to break and re-open the connection, on a large file this would be done by closing the TCP connection and opening a new one. The fix: ask for data in smaller blocks; the max of (requested-length, min-request-len). Result: significantly lower cost back-in-file seeking and in very-long-distance forward seeks, at the expense of an increased cost in end-to-reads of a file (gzip, csv). It's an experimental option for this reason; I think I'd like to make it an API call that libs like parquet & orc can explicitly request on their IO: it should apply to all blobstores
  • #25: If you look at what we've done, much of it (credit to Rajesh & Chris) is just minimising HTTP requests. Each one can take hundreds of millis, sometimes even seconds due to load balancer issues (tip: reduce DNS TTL on your clients to <30s). A lot of the work internal to S3A was culling those getFileStatus() calls by (a) caching results, (b) not bothering to see if they are needed. Example: cheaper to issue a DELETE listing all parent paths than actually looking to see if they exist, wait for the response, and then delete them. The one at the end, HADOOP-13208, replaces a slow recursive tree walk (many status, many list) with a flat listing of all objects in a tree. This works only for the listStatus(path, true) call —to be benefit you need to use that API call, not do your own treewalk.
  • #28: Don't run off saying "hey, 2x speedup". I'm confident we got HDP faster, EMR is still something we'd need to look at more. Data layout is still a major problem here; I think we are still understanding the implications of sharding and throttling. What we do know is that deep/shallow trees are pathological for recursive treewalks, and they end up storing data in the same s3 nodes, so throttling adjacent requests.
  • #29: Disclaimer: benchmarks, etc. Data was 200GB TCP-DS stored in s3, workload against same cluster.
  • #30: And the result. Yes, currently we are faster in these benchmarks. Does that match to the outside world? If you use ORC & HIve, you will gain from the work we've done. There are still things which are pathologically bad, especially deep directory trees with few files
  • #35: This invariably ends up reaching us on JIRA, to the extent I've got a document somewhere explaining the problem in detail. It was taken away because it can corrupt your data, without you noticiing. This is generally considered harmful.
  • #36: see: Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency for details, essentially it has the semantics HBase needs, that being our real compatibility test.