SlideShare a Scribd company logo
1 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Apache Spark and Object Stores
—What you need to know
Steve Loughran
stevel@hortonworks.com
@steveloughran
February 2017
3 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
ORC, Parquet
datasets
inbound
Elastic ETL
HDFS
external
4 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
datasets
external
Notebooks
library
5 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Streaming
6 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
A Filesystem: Directories, Files  Data
/
work
pending
part-00
part-01
00
00
00
01
01
01
complete
part-01
rename("/work/pending/part-01", "/work/complete")
7 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Object Store: hash(name)->blob
00
00
00
01
01
s01 s02
s03 s04
hash("/work/pending/part-01")
["s02", "s03", "s04"]
copy("/work/pending/part-01",
"/work/complete/part01")
01
01
01
01
delete("/work/pending/part-01")
hash("/work/pending/part-00")
["s01", "s02", "s04"]
8 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
REST APIs
00
00
00
01
01
s01 s02
s03 s04
HEAD /work/complete/part-01
PUT /work/complete/part01
x-amz-copy-source: /work/pending/part-01
01
DELETE /work/pending/part-01
PUT /work/pending/part-01
... DATA ...
GET /work/pending/part-01
Content-Length: 1-8192
GET /?prefix=/work&delimiter=/
9 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Often: Eventually Consistent
00
00
00
01
01
s01 s02
s03 s04
01
DELETE /work/pending/part-00
GET /work/pending/part-00
GET /work/pending/part-00
200
200
200
10 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
org.apache.hadoop.fs.FileSystem
hdfs s3awasb adlswift gs
13 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Four Challenges
1. Classpath
2. Credentials
3. Code
4. Commitment
14 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Use S3A to work with S3
(EMR: use Amazon's s3:// )
15 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Classpath: fix “No FileSystem for scheme: s3a”
hadoop-aws-2.7.x.jar
aws-java-sdk-1.7.4.jar
joda-time-2.9.3.jar
(jackson-*-2.6.5.jar)
See SPARK-7481
Get Spark with
Hadoop 2.7+ JARs
16 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Credentials
core-site.xml or spark-default.conf
spark.hadoop.fs.s3a.access.key MY_ACCESS_KEY
spark.hadoop.fs.s3a.secret.key MY_SECRET_KEY
spark-submit propagates Environment Variables
export AWS_ACCESS_KEY=MY_ACCESS_KEY
export AWS_SECRET_KEY=MY_SECRET_KEY
NEVER: share, check in to SCM, paste in bug reports…
17 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Authentication Failure: 403
com.amazonaws.services.s3.model.AmazonS3Exception:
The request signature we calculated does not match
the signature you provided.
Check your key and signing method.
1. Check joda-time.jar & JVM version
2. Credentials wrong
3. Credentials not propagating
4. Local system clock (more likely on VMs)
19 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Code: just use the URL of the object store
val csvdata = spark.read.options(Map(
"header" -> "true",
"inferSchema" -> "true",
"mode" -> "FAILFAST"))
.csv("s3a://landsat-pds/scene_list.gz")
...read time O(distance)
20 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
DataFrames
val landsat = "s3a://stevel-demo/landsat"
csvData.write.parquet(landsat)
val landsatOrc = "s3a://stevel-demo/landsatOrc"
csvData.write.orc(landsatOrc)
val df = spark.read.parquet(landsat)
val orcDf = spark.read.parquet(landsatOrc)
…list inconsistency
...commit time O(data)
21 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Finding dirty data with Spark SQL
val sqlDF = spark.sql(
"SELECT id, acquisitionDate, cloudCover"
+ s" FROM parquet.`${landsat}`")
val negativeClouds = sqlDF.filter("cloudCover < 0")
negativeClouds.show()
* filter columns and data early
* whether/when to cache()?
* copy popular data to HDFS
22 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
spark-default.conf
spark.sql.parquet.filterPushdown true
spark.sql.parquet.mergeSchema false
spark.hadoop.parquet.enable.summary-metadata false
spark.sql.orc.filterPushdown true
spark.sql.orc.splits.include.file.footer true
spark.sql.orc.cache.stripe.details.size 10000
spark.sql.hive.metastorePartitionPruning true
23 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Recent S3A Performance (Hadoop 2.8, HDP 2.5, CDH 6 (?))
// forward seek by skipping stream
spark.hadoop.fs.s3a.readahead.range 256K
// faster backward seek for ORC and Parquet input
spark.hadoop.fs.s3a.experimental.input.fadvise random
// PUT blocks in separate threads
spark.hadoop.fs.s3a.fast.output.enabled true
24 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
The Commitment Problem
⬢ rename() used for atomic commitment transaction
⬢ time to copy() + delete() proportional to data * files
⬢ S3: 6+ MB/s
⬢ Azure: a lot faster —usually
spark.speculation false
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true
25 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
The "Direct Committer"?
27 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Azure Storage: wasb://
A full substitute for HDFS
28 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Classpath: fix “No FileSystem for scheme: wasb”
wasb:// : Consistent, with very fast rename (hence: commits)
hadoop-azure-2.7.x.jar
azure-storage-2.2.0.jar
+ (jackson-core; http-components, hadoop-common)
29 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Credentials: core-site.xml / spark-default.conf
<property>
<name>fs.azure.account.key.example.blob.core.windows.net</name>
<value>0c0d44ac83ad7f94b0997b36e6e9a25b49a1394c</value>
</property>
spark.hadoop.fs.azure.account.key.example.blob.core.windows.net
0c0d44ac83ad7f94b0997b36e6e9a25b49a1394c
wasb://demo@example.blob.core.windows.net
30 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Example: Azure Storage and Streaming
val streaming = new StreamingContext(sparkConf,Seconds(10))
val azure = "wasb://demo@example.blob.core.windows.net/in"
val lines = streaming.textFileStream(azure)
val matches = lines.map(line => {
println(line)
line
})
matches.print()
streaming.start()
* PUT into the streaming directory
* keep the dir clean
* size window for slow scans
* checkpoints slow —reduce frequency
31 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
s3guard:
fast, consistent S3 metadata
HADOOP-13445
Hortonworks + Cloudera + Western Digital
32 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
DynamoDB as consistent metadata store
00
00
00
01
01
s01 s02
s03 s04
01
DELETE part-00
200
HEAD part-00
200
HEAD part-00
404
PUT part-00
200
00
33 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
34 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Summary
⬢ Object Stores look just like any other Filesystem URL
⬢ …but do need classpath and configuration
⬢ Issues: performance, commitment
⬢ Tune to reduce I/O
⬢ Keep those credentials secret!
Finally: keep an eye out for s3guard!
35 © Hortonworks Inc. 2011 – 2017 All Rights Reserved3
5
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Questions?
36 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Backup Slides
37 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Not Covered
⬢ Partitioning/directory layout
⬢ Infrastructure Throttling
⬢ Optimal path names
⬢ Error handling
⬢ Metrics
38 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Dependencies in Hadoop 2.8
hadoop-aws-2.8.x.jar
aws-java-sdk-core-1.10.6.jar
aws-java-sdk-kms-1.10.6.jar
aws-java-sdk-s3-1.10.6.jar
joda-time-2.9.3.jar
(jackson-*-2.6.5.jar)
hadoop-aws-2.8.x.jar
azure-storage-4.2.0.jar
39 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
S3 Server-Side Encryption
⬢ Encryption of data at rest at S3
⬢ Supports the SSE-S3 option: each object encrypted by a unique key
using AES-256 cipher
⬢ Now covered in S3A automated test suites
⬢ Support for additional options under development (SSE-KMS and SSE-C)
40 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Advanced authentication
<property>
<name>fs.s3a.aws.credentials.provider</name>
<value>
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider,
org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider,
com.amazonaws.auth.EnvironmentVariableCredentialsProvider,
com.amazonaws.auth.InstanceProfileCredentialsProvider,
org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider
</value>
</property>
+encrypted credentials in JECKS files on
HDFS
41 © Hortonworks Inc. 2011 – 2017 All Rights Reserved

More Related Content

PPTX
Evening out the uneven: dealing with skew in Flink
PDF
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
PDF
Distributed computing with spark
PDF
Kafka Streams: What it is, and how to use it?
PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
PDF
ClickHouse Features for Advanced Users, by Aleksei Milovidov
PDF
Introduction to Kafka Streams
PPTX
Keeping Spark on Track: Productionizing Spark for ETL
Evening out the uneven: dealing with skew in Flink
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Distributed computing with spark
Kafka Streams: What it is, and how to use it?
Building a fully managed stream processing platform on Flink at scale for Lin...
ClickHouse Features for Advanced Users, by Aleksei Milovidov
Introduction to Kafka Streams
Keeping Spark on Track: Productionizing Spark for ETL

What's hot (20)

PDF
Introduction to Apache Flink - Fast and reliable big data processing
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
PDF
ETL With Cassandra Streaming Bulk Loading
PPTX
Autoscaling Flink with Reactive Mode
PDF
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
PPTX
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
PPTX
Exactly once with spark streaming
PDF
Top 5 mistakes when writing Spark applications
PDF
Deep Dive: Memory Management in Apache Spark
PPTX
Apache Flink in the Cloud-Native Era
PDF
Everything You Always Wanted to Know About Kafka's Rebalance Protocol but Wer...
PDF
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
PDF
From airflow to google cloud composer
ODP
Stream processing using Kafka
PPTX
Kafka 101
PDF
Optimizing S3 Write-heavy Spark workloads
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Apache Flink Adoption at Shopify
Introduction to Apache Flink - Fast and reliable big data processing
Using the New Apache Flink Kubernetes Operator in a Production Deployment
ETL With Cassandra Streaming Bulk Loading
Autoscaling Flink with Reactive Mode
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Exactly once with spark streaming
Top 5 mistakes when writing Spark applications
Deep Dive: Memory Management in Apache Spark
Apache Flink in the Cloud-Native Era
Everything You Always Wanted to Know About Kafka's Rebalance Protocol but Wer...
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
From airflow to google cloud composer
Stream processing using Kafka
Kafka 101
Optimizing S3 Write-heavy Spark workloads
Efficient Data Storage for Analytics with Apache Parquet 2.0
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Stage Level Scheduling Improving Big Data and AI Integration
Apache Flink Adoption at Shopify
Ad

Viewers also liked (20)

PPTX
Hadoop, Hive, Spark and Object Stores
PDF
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
PPTX
Apache Spark and Object Stores
PDF
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
PDF
Sparkler at spark summit east 2017
PDF
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
PDF
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
PPTX
YARN Services
PDF
IBM Runtimes Performance Observations with Apache Spark
PPTX
Slider: Applications on YARN
PDF
Sparkler Presentation for Spark Summit East 2017
PPTX
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
PDF
Boston Spark Meetup May 24, 2016
PDF
What's new with Apache Spark's Structured Streaming?
PDF
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
PDF
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
PPTX
H gente de la ciudad
PPTX
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
PDF
Webinar: MongoDB Connector for Spark
PDF
Distributed ML in Apache Spark
Hadoop, Hive, Spark and Object Stores
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Apache Spark and Object Stores
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Sparkler at spark summit east 2017
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
YARN Services
IBM Runtimes Performance Observations with Apache Spark
Slider: Applications on YARN
Sparkler Presentation for Spark Summit East 2017
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Boston Spark Meetup May 24, 2016
What's new with Apache Spark's Structured Streaming?
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
H gente de la ciudad
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Webinar: MongoDB Connector for Spark
Distributed ML in Apache Spark
Ad

Similar to Spark Summit East 2017: Apache spark and object stores (20)

PPTX
Apache Spark and Object Stores —for London Spark User Group
PDF
Spark Summit EU talk by Steve Loughran
PPTX
PUT is the new rename()
PPTX
Put is the new rename: San Jose Summit Edition
PPTX
S3Guard: What's in your consistency model?
PPTX
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
PPTX
Intro to Spark with Zeppelin
PPTX
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
PPTX
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
PPTX
HiveWarehouseConnector
PPTX
What does Rename Do: (detailed version)
PPTX
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
PPTX
Dancing elephants - efficiently working with object stores from Apache Spark ...
PPTX
Apache Spark Crash Course
PPTX
Cloudy with a chance of Hadoop - real world considerations
PDF
20170126 big data processing
PPTX
HDP Search Overview (APACHE SOLR & HADOOP)
PPTX
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
PPTX
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
PDF
Hortonworks Technical Workshop - HDP Search
Apache Spark and Object Stores —for London Spark User Group
Spark Summit EU talk by Steve Loughran
PUT is the new rename()
Put is the new rename: San Jose Summit Edition
S3Guard: What's in your consistency model?
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Intro to Spark with Zeppelin
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
HiveWarehouseConnector
What does Rename Do: (detailed version)
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Dancing elephants - efficiently working with object stores from Apache Spark ...
Apache Spark Crash Course
Cloudy with a chance of Hadoop - real world considerations
20170126 big data processing
HDP Search Overview (APACHE SOLR & HADOOP)
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Hortonworks Technical Workshop - HDP Search

More from Steve Loughran (20)

PPTX
Hadoop Vectored IO
PPTX
The age of rename() is over
PPTX
@Dissidentbot: dissent will be automated!
PPT
Extreme Programming Deployed
PPT
PPTX
I hate mocking
PPTX
What does rename() do?
PPTX
Household INFOSEC in a Post-Sony Era
PPTX
Hadoop and Kerberos: the Madness Beyond the Gate
PPTX
Datacentre stack
PPTX
Overview of slider project
PPTX
Help! My Hadoop doesn't work!
ODP
2014 01-02-patching-workflow
PPTX
2013 11-19-hoya-status
PPTX
Hoya for Code Review
PPTX
Hadoop: Beyond MapReduce
PPTX
HDFS: Hadoop Distributed Filesystem
PPTX
HA Hadoop -ApacheCon talk
PPTX
Inside hadoop-dev
PPTX
Hadoop as data refinery
Hadoop Vectored IO
The age of rename() is over
@Dissidentbot: dissent will be automated!
Extreme Programming Deployed
I hate mocking
What does rename() do?
Household INFOSEC in a Post-Sony Era
Hadoop and Kerberos: the Madness Beyond the Gate
Datacentre stack
Overview of slider project
Help! My Hadoop doesn't work!
2014 01-02-patching-workflow
2013 11-19-hoya-status
Hoya for Code Review
Hadoop: Beyond MapReduce
HDFS: Hadoop Distributed Filesystem
HA Hadoop -ApacheCon talk
Inside hadoop-dev
Hadoop as data refinery

Recently uploaded (20)

PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Encapsulation theory and applications.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
sap open course for s4hana steps from ECC to s4
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
KodekX | Application Modernization Development
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Big Data Technologies - Introduction.pptx
PDF
Approach and Philosophy of On baking technology
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Spectral efficient network and resource selection model in 5G networks
Mobile App Security Testing_ A Comprehensive Guide.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Unlocking AI with Model Context Protocol (MCP)
Encapsulation theory and applications.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
sap open course for s4hana steps from ECC to s4
The AUB Centre for AI in Media Proposal.docx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
KodekX | Application Modernization Development
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Big Data Technologies - Introduction.pptx
Approach and Philosophy of On baking technology
MYSQL Presentation for SQL database connectivity
Advanced methodologies resolving dimensionality complications for autism neur...
Reach Out and Touch Someone: Haptics and Empathic Computing
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf

Spark Summit East 2017: Apache spark and object stores

  • 1. 1 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Apache Spark and Object Stores —What you need to know Steve Loughran stevel@hortonworks.com @steveloughran February 2017
  • 2. 3 © Hortonworks Inc. 2011 – 2017 All Rights Reserved ORC, Parquet datasets inbound Elastic ETL HDFS external
  • 3. 4 © Hortonworks Inc. 2011 – 2017 All Rights Reserved datasets external Notebooks library
  • 4. 5 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Streaming
  • 5. 6 © Hortonworks Inc. 2011 – 2017 All Rights Reserved A Filesystem: Directories, Files  Data / work pending part-00 part-01 00 00 00 01 01 01 complete part-01 rename("/work/pending/part-01", "/work/complete")
  • 6. 7 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Object Store: hash(name)->blob 00 00 00 01 01 s01 s02 s03 s04 hash("/work/pending/part-01") ["s02", "s03", "s04"] copy("/work/pending/part-01", "/work/complete/part01") 01 01 01 01 delete("/work/pending/part-01") hash("/work/pending/part-00") ["s01", "s02", "s04"]
  • 7. 8 © Hortonworks Inc. 2011 – 2017 All Rights Reserved REST APIs 00 00 00 01 01 s01 s02 s03 s04 HEAD /work/complete/part-01 PUT /work/complete/part01 x-amz-copy-source: /work/pending/part-01 01 DELETE /work/pending/part-01 PUT /work/pending/part-01 ... DATA ... GET /work/pending/part-01 Content-Length: 1-8192 GET /?prefix=/work&delimiter=/
  • 8. 9 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Often: Eventually Consistent 00 00 00 01 01 s01 s02 s03 s04 01 DELETE /work/pending/part-00 GET /work/pending/part-00 GET /work/pending/part-00 200 200 200
  • 9. 10 © Hortonworks Inc. 2011 – 2017 All Rights Reserved org.apache.hadoop.fs.FileSystem hdfs s3awasb adlswift gs
  • 10. 13 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Four Challenges 1. Classpath 2. Credentials 3. Code 4. Commitment
  • 11. 14 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Use S3A to work with S3 (EMR: use Amazon's s3:// )
  • 12. 15 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Classpath: fix “No FileSystem for scheme: s3a” hadoop-aws-2.7.x.jar aws-java-sdk-1.7.4.jar joda-time-2.9.3.jar (jackson-*-2.6.5.jar) See SPARK-7481 Get Spark with Hadoop 2.7+ JARs
  • 13. 16 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Credentials core-site.xml or spark-default.conf spark.hadoop.fs.s3a.access.key MY_ACCESS_KEY spark.hadoop.fs.s3a.secret.key MY_SECRET_KEY spark-submit propagates Environment Variables export AWS_ACCESS_KEY=MY_ACCESS_KEY export AWS_SECRET_KEY=MY_SECRET_KEY NEVER: share, check in to SCM, paste in bug reports…
  • 14. 17 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Authentication Failure: 403 com.amazonaws.services.s3.model.AmazonS3Exception: The request signature we calculated does not match the signature you provided. Check your key and signing method. 1. Check joda-time.jar & JVM version 2. Credentials wrong 3. Credentials not propagating 4. Local system clock (more likely on VMs)
  • 15. 19 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Code: just use the URL of the object store val csvdata = spark.read.options(Map( "header" -> "true", "inferSchema" -> "true", "mode" -> "FAILFAST")) .csv("s3a://landsat-pds/scene_list.gz") ...read time O(distance)
  • 16. 20 © Hortonworks Inc. 2011 – 2017 All Rights Reserved DataFrames val landsat = "s3a://stevel-demo/landsat" csvData.write.parquet(landsat) val landsatOrc = "s3a://stevel-demo/landsatOrc" csvData.write.orc(landsatOrc) val df = spark.read.parquet(landsat) val orcDf = spark.read.parquet(landsatOrc) …list inconsistency ...commit time O(data)
  • 17. 21 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Finding dirty data with Spark SQL val sqlDF = spark.sql( "SELECT id, acquisitionDate, cloudCover" + s" FROM parquet.`${landsat}`") val negativeClouds = sqlDF.filter("cloudCover < 0") negativeClouds.show() * filter columns and data early * whether/when to cache()? * copy popular data to HDFS
  • 18. 22 © Hortonworks Inc. 2011 – 2017 All Rights Reserved spark-default.conf spark.sql.parquet.filterPushdown true spark.sql.parquet.mergeSchema false spark.hadoop.parquet.enable.summary-metadata false spark.sql.orc.filterPushdown true spark.sql.orc.splits.include.file.footer true spark.sql.orc.cache.stripe.details.size 10000 spark.sql.hive.metastorePartitionPruning true
  • 19. 23 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Recent S3A Performance (Hadoop 2.8, HDP 2.5, CDH 6 (?)) // forward seek by skipping stream spark.hadoop.fs.s3a.readahead.range 256K // faster backward seek for ORC and Parquet input spark.hadoop.fs.s3a.experimental.input.fadvise random // PUT blocks in separate threads spark.hadoop.fs.s3a.fast.output.enabled true
  • 20. 24 © Hortonworks Inc. 2011 – 2017 All Rights Reserved The Commitment Problem ⬢ rename() used for atomic commitment transaction ⬢ time to copy() + delete() proportional to data * files ⬢ S3: 6+ MB/s ⬢ Azure: a lot faster —usually spark.speculation false spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2 spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true
  • 21. 25 © Hortonworks Inc. 2011 – 2017 All Rights Reserved The "Direct Committer"?
  • 22. 27 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Azure Storage: wasb:// A full substitute for HDFS
  • 23. 28 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Classpath: fix “No FileSystem for scheme: wasb” wasb:// : Consistent, with very fast rename (hence: commits) hadoop-azure-2.7.x.jar azure-storage-2.2.0.jar + (jackson-core; http-components, hadoop-common)
  • 24. 29 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Credentials: core-site.xml / spark-default.conf <property> <name>fs.azure.account.key.example.blob.core.windows.net</name> <value>0c0d44ac83ad7f94b0997b36e6e9a25b49a1394c</value> </property> spark.hadoop.fs.azure.account.key.example.blob.core.windows.net 0c0d44ac83ad7f94b0997b36e6e9a25b49a1394c wasb://demo@example.blob.core.windows.net
  • 25. 30 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Example: Azure Storage and Streaming val streaming = new StreamingContext(sparkConf,Seconds(10)) val azure = "wasb://demo@example.blob.core.windows.net/in" val lines = streaming.textFileStream(azure) val matches = lines.map(line => { println(line) line }) matches.print() streaming.start() * PUT into the streaming directory * keep the dir clean * size window for slow scans * checkpoints slow —reduce frequency
  • 26. 31 © Hortonworks Inc. 2011 – 2017 All Rights Reserved s3guard: fast, consistent S3 metadata HADOOP-13445 Hortonworks + Cloudera + Western Digital
  • 27. 32 © Hortonworks Inc. 2011 – 2017 All Rights Reserved DynamoDB as consistent metadata store 00 00 00 01 01 s01 s02 s03 s04 01 DELETE part-00 200 HEAD part-00 200 HEAD part-00 404 PUT part-00 200 00
  • 28. 33 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
  • 29. 34 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Summary ⬢ Object Stores look just like any other Filesystem URL ⬢ …but do need classpath and configuration ⬢ Issues: performance, commitment ⬢ Tune to reduce I/O ⬢ Keep those credentials secret! Finally: keep an eye out for s3guard!
  • 30. 35 © Hortonworks Inc. 2011 – 2017 All Rights Reserved3 5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Questions?
  • 31. 36 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Backup Slides
  • 32. 37 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Not Covered ⬢ Partitioning/directory layout ⬢ Infrastructure Throttling ⬢ Optimal path names ⬢ Error handling ⬢ Metrics
  • 33. 38 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Dependencies in Hadoop 2.8 hadoop-aws-2.8.x.jar aws-java-sdk-core-1.10.6.jar aws-java-sdk-kms-1.10.6.jar aws-java-sdk-s3-1.10.6.jar joda-time-2.9.3.jar (jackson-*-2.6.5.jar) hadoop-aws-2.8.x.jar azure-storage-4.2.0.jar
  • 34. 39 © Hortonworks Inc. 2011 – 2017 All Rights Reserved S3 Server-Side Encryption ⬢ Encryption of data at rest at S3 ⬢ Supports the SSE-S3 option: each object encrypted by a unique key using AES-256 cipher ⬢ Now covered in S3A automated test suites ⬢ Support for additional options under development (SSE-KMS and SSE-C)
  • 35. 40 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Advanced authentication <property> <name>fs.s3a.aws.credentials.provider</name> <value> org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider, org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider, com.amazonaws.auth.EnvironmentVariableCredentialsProvider, com.amazonaws.auth.InstanceProfileCredentialsProvider, org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider </value> </property> +encrypted credentials in JECKS files on HDFS
  • 36. 41 © Hortonworks Inc. 2011 – 2017 All Rights Reserved

Editor's Notes

  • #3: Now people may be saying "hang on, these aren't spark developers". Well, I do have some integration patches for spark, but a lot of the integration problems are actually lower down: -filesystem connectors -ORC performance -Hive metastore Rajesh has been doing lots of scale runs and profiling, initially for Hive/Tez, now looking at Spark, including some of the Parquet problems. Chris has done work on HDFS, Azure WASB and most recently S3A Me? Co-author of the Swift connector. author of the Hadoop FS spec and general mentor of the S3A work, even when not actively working on it. Been full time on S3A, using Spark as the integration test suite, since March
  • #4: This is one of the simplest deployments in cloud: scheduled/dynamic ETL. Incoming data sources saving to an object store; spark cluster brought up for ETL. Either direct cleanup/filter or multistep operations, but either way: an ETL pipeline. HDFS on the VMs for transient storage, the object store used as the destination for data —now in a more efficient format such as ORC or Parquet
  • #5: Notebooks on demand. ; it talks to spark in cloud which then does the work against external and internal data; Your notebook itself can be saved to the object store, for persistence and sharing.
  • #6: Example: streaming on Azure
  • #11: Everything usies the Hadoop APIs to talk to both HDFS, Hadoop Compatible Filesystems and object stores; the Hadoop FS API. There's actually two: the one with a clean split between client side and "driver side", and the older one which is a direct connect. Most use the latter and actually, in terms of opportunities for object store integration tweaking, this is actually the one where can innovate with the most easily. That is: there's nothing in the way. Under the FS API go filesystems and object stores. HDFS is "real" filesystem; WASB/Azure close enough. What is "real?". Best test: can support HBase.
  • #12: This is the history
  • #21: you used to have to disable summary data in the spark context Hadoop options, but https://issues.apache.org/jira/browse/SPARK-15719 fixed that for you
  • #22: It looks the same, you just need to be as aggressive about minimising IO as you can -push down predicates -only select the columns you want -filter -If you read a lot, write to HDFS then re-use. cache()? I don't know. If you do, filter as much as you can first: columns, predicates, ranges, so that parquet/orc can read as little as it needs to, and RAM use is least.
  • #24: without going into the details, here are things you will want for Hadoop 2.8. They are in HDP 2.5, possible in the next CDH release. The first two boost input by reducing the cost of seeking, which is expensive as it breaks then re-opens the HTTPS connection. Readahead means that hundreds of KB can be skipped before that connect (yes, it can take that long to reconnect). The experimental fadvise random feature speeds up backward reads at the expense of pure-forward file reads. It is significantly faster for reading in optimized binary formats like ORC and Parquet The last one is a successor to fast upload in Hadoop 2.7. That buffers on heap and needs careful tuning; its memory needs conflict with RDD caching. The new version defaults to buffering as files on local disk, so won't run out of memory. Offers the potential of significantly more effective use of bandwidth; the resulting partitioned files may also offer higher read perf. (No data there, just hearsay).
  • #26: This invariably ends up reaching us on JIRA, to the extent I've got a document somewhere explaining the problem in detail. It was taken away because it can corrupt your data, without you noticiing. This is generally considered harmful.
  • #27: if your distributor didn't stick the JARs in, you can add the hadoop-aws and hadoop-azure dependencies in the interpreter config credentials: keep out of notebooks. Zeppelin can list its settings too; always dangerous (mind you, so does HDFS and YARN, so an XInclude is handy there) when running in EC2, S3 credentials are now automatically picked up. And, if zeppelin is launched with the AWS env vars set, its invocation of spark-submit should pass them down.
  • #28: see: Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency for details, essentially it has the semantics HBase needs, that being our real compatibility test.
  • #29: Azure storage is unique in that there's a pubished paper (+ video) on its internals. Well worth looking at to understand what's going on. In contrast, if you want to know S3 internals, well, you can ply the original author with gin and he still won't reveal anything. ADL adds okhttp for HTTP/2 performance; yet another json parser for unknown reasons
  • #30: Azure storage is unique in that there's a pubished paper (+ video) on its internals. Well worth looking at to understand what's going on. In contrast, if you want to know S3 internals, well, you can ply the original author with gin and he still won't reveal anything. ADL adds okhttp for HTTP/2 performance; yet another json parser for unknown reasons
  • #41: Hadoop 2.8 adds a lot of control here. (credit: Netfllx, + later us & cloudera) -You can define a list of credential providers to use; the default is simple, env, instance, but you can add temporary and anonymous, choose which are unsupported, etc. -passwords/secrets can be encrypted in hadoop credential files stored locally, HDFS -IAM auth is what EC2 VMs need
  • #42: And this is the big one, as it spans the lot: Hadoop's own code (so far: distcp), Spark, Hive, Flink, related tooling. If we can't speed up the object stores, we can tune the apps