Spark Summit East 2017: Apache spark and object stores

1 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Apache Spark and Object Stores
—What you need to know
Steve Loughran
stevel@hortonworks.com
@steveloughran
February 2017

ORC, Parquet
datasets
inbound
Elastic ETL
HDFS
external

datasets
external
Notebooks
library

Streaming

A Filesystem: Directories, Files  Data
/
work
pending
part-00
part-01
00
00
00
01
01
01
complete
part-01
rename("/work/pending/part-01", "/work/complete")

Object Store: hash(name)->blob
00
00
00
01
01
s01 s02
s03 s04
hash("/work/pending/part-01")
["s02", "s03", "s04"]
copy("/work/pending/part-01",
"/work/complete/part01")
01
01
01
01
delete("/work/pending/part-01")
hash("/work/pending/part-00")
["s01", "s02", "s04"]

REST APIs
00
00
00
01
01
s01 s02
s03 s04
HEAD /work/complete/part-01
PUT /work/complete/part01
x-amz-copy-source: /work/pending/part-01
01
DELETE /work/pending/part-01
PUT /work/pending/part-01
... DATA ...
GET /work/pending/part-01
Content-Length: 1-8192
GET /?prefix=/work&delimiter=/

Often: Eventually Consistent
00
00
00
01
01
s01 s02
s03 s04
01
DELETE /work/pending/part-00
200
200
200

org.apache.hadoop.fs.FileSystem
hdfs s3awasb adlswift gs

Four Challenges
1. Classpath
2. Credentials
3. Code
4. Commitment

Use S3A to work with S3
(EMR: use Amazon's s3:// )

Classpath: fix “No FileSystem for scheme: s3a”
hadoop-aws-2.7.x.jar
aws-java-sdk-1.7.4.jar
joda-time-2.9.3.jar
(jackson-*-2.6.5.jar)
See SPARK-7481
Get Spark with
Hadoop 2.7+ JARs

Credentials
core-site.xml or spark-default.conf
spark.hadoop.fs.s3a.access.key MY_ACCESS_KEY
spark.hadoop.fs.s3a.secret.key MY_SECRET_KEY
spark-submit propagates Environment Variables
export AWS_ACCESS_KEY=MY_ACCESS_KEY
export AWS_SECRET_KEY=MY_SECRET_KEY
NEVER: share, check in to SCM, paste in bug reports…

Authentication Failure: 403
com.amazonaws.services.s3.model.AmazonS3Exception:
The request signature we calculated does not match
the signature you provided.
Check your key and signing method.
1. Check joda-time.jar & JVM version
2. Credentials wrong
3. Credentials not propagating
4. Local system clock (more likely on VMs)

Code: just use the URL of the object store
val csvdata = spark.read.options(Map(
"header" -> "true",
"inferSchema" -> "true",
"mode" -> "FAILFAST"))
.csv("s3a://landsat-pds/scene_list.gz")
...read time O(distance)

DataFrames
val landsat = "s3a://stevel-demo/landsat"
csvData.write.parquet(landsat)
val landsatOrc = "s3a://stevel-demo/landsatOrc"
csvData.write.orc(landsatOrc)
val df = spark.read.parquet(landsat)
val orcDf = spark.read.parquet(landsatOrc)
…list inconsistency
...commit time O(data)

Finding dirty data with Spark SQL
val sqlDF = spark.sql(
"SELECT id, acquisitionDate, cloudCover"
+ s" FROM parquet.`${landsat}`")
val negativeClouds = sqlDF.filter("cloudCover < 0")
negativeClouds.show()
* filter columns and data early
* whether/when to cache()?
* copy popular data to HDFS

spark-default.conf
spark.sql.parquet.filterPushdown true
spark.sql.parquet.mergeSchema false
spark.hadoop.parquet.enable.summary-metadata false
spark.sql.orc.filterPushdown true
spark.sql.orc.splits.include.file.footer true
spark.sql.orc.cache.stripe.details.size 10000
spark.sql.hive.metastorePartitionPruning true

Recent S3A Performance (Hadoop 2.8, HDP 2.5, CDH 6 (?))
// forward seek by skipping stream
spark.hadoop.fs.s3a.readahead.range 256K
// faster backward seek for ORC and Parquet input
spark.hadoop.fs.s3a.experimental.input.fadvise random
// PUT blocks in separate threads
spark.hadoop.fs.s3a.fast.output.enabled true

The Commitment Problem
⬢ rename() used for atomic commitment transaction
⬢ time to copy() + delete() proportional to data * files
⬢ S3: 6+ MB/s
⬢ Azure: a lot faster —usually
spark.speculation false
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true

The "Direct Committer"?

Azure Storage: wasb://
A full substitute for HDFS

Classpath: fix “No FileSystem for scheme: wasb”
wasb:// : Consistent, with very fast rename (hence: commits)
hadoop-azure-2.7.x.jar
azure-storage-2.2.0.jar
+ (jackson-core; http-components, hadoop-common)

Credentials: core-site.xml / spark-default.conf
<property>
<name>fs.azure.account.key.example.blob.core.windows.net</name>
<value>0c0d44ac83ad7f94b0997b36e6e9a25b49a1394c</value>
</property>
spark.hadoop.fs.azure.account.key.example.blob.core.windows.net
0c0d44ac83ad7f94b0997b36e6e9a25b49a1394c
wasb://demo@example.blob.core.windows.net

Example: Azure Storage and Streaming
val streaming = new StreamingContext(sparkConf,Seconds(10))
val azure = "wasb://demo@example.blob.core.windows.net/in"
val lines = streaming.textFileStream(azure)
val matches = lines.map(line => {
println(line)
line
})
matches.print()
streaming.start()
* PUT into the streaming directory
* keep the dir clean
* size window for slow scans
* checkpoints slow —reduce frequency

s3guard:
fast, consistent S3 metadata
HADOOP-13445
Hortonworks + Cloudera + Western Digital

DynamoDB as consistent metadata store
00
00
00
01
01
s01 s02
s03 s04
01
DELETE part-00
200
HEAD part-00
200
HEAD part-00
404
PUT part-00
200
00

Summary
⬢ Object Stores look just like any other Filesystem URL
⬢ …but do need classpath and configuration
⬢ Issues: performance, commitment
⬢ Tune to reduce I/O
⬢ Keep those credentials secret!
Finally: keep an eye out for s3guard!

Backup Slides

Not Covered
⬢ Partitioning/directory layout
⬢ Infrastructure Throttling
⬢ Optimal path names
⬢ Error handling
⬢ Metrics

Dependencies in Hadoop 2.8
aws-java-sdk-core-1.10.6.jar
aws-java-sdk-kms-1.10.6.jar
aws-java-sdk-s3-1.10.6.jar
joda-time-2.9.3.jar
(jackson-*-2.6.5.jar)
azure-storage-4.2.0.jar

S3 Server-Side Encryption
⬢ Encryption of data at rest at S3
⬢ Supports the SSE-S3 option: each object encrypted by a unique key
using AES-256 cipher
⬢ Now covered in S3A automated test suites
⬢ Support for additional options under development (SSE-KMS and SSE-C)

Advanced authentication
<property>
<name>fs.s3a.aws.credentials.provider</name>
<value>
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider,
org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider,
com.amazonaws.auth.EnvironmentVariableCredentialsProvider,
com.amazonaws.auth.InstanceProfileCredentialsProvider,
org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider
</value>
</property>
+encrypted credentials in JECKS files on
HDFS

Spark Summit East 2017: Apache spark and object stores

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Spark Summit East 2017: Apache spark and object stores (20)

More from Steve Loughran (20)

Recently uploaded (20)

Spark Summit East 2017: Apache spark and object stores

Editor's Notes