Putting some Spark into HDF5

DM_PPT_NP_v01
SESIP_0715_GH2
Putting some into HDF5
Gerd Heber & Joe Lee
The HDF Group
Champaign Illinois USA
This work was supported by NASA/GSFC under
Raytheon Co. contract number NNG10HP02C

DM_PPT_NP_v01
SESIP_0715_GH2
2
The Return of

DM_PPT_NP_v01
SESIP_0715_GH2
Outline
• “The Big Schism”
• A Shiny New Engine
• Getting off the Ground
• Future Work
3July 14 – 17, 2015

DM_PPT_NP_v01
SESIP_0715_GH2
“The Big Schism”
• An HDF5 file is a Smart Data Container
• “This is what happens, Larry, when you copy
an HDF5 file into HDFS!” (Walter Sobchak)
4July 14 – 17, 2015
Natural Habitat: Traditional File System Block Store: Hadoop “File System” (HDFS)

DM_PPT_NP_v01
SESIP_0715_GH2
Now What?
• Ask questions:
– Who want’s HDF5 files in Hadoop? (volatile)
• Who wants to program MapReduce? (nobody)
– How big are your HDF5 files? (long tailed distrib.)
• No size (solution) fits all...
• Do experiments:
– Reverse-engineer the format (students, weirdos)
– In-core processing (fiddly)
– Convert to Avro (some success)
• Sit tight and wait for something better!
5July 14 – 17, 2015

DM_PPT_NP_v01
SESIP_0715_GH2
Spark Concepts
Formally, an RDD is a read-only, partitioned collection of records.
RDDs can be only created through deterministic operations on
either (1) a dataset in stable storage or (2) other existing RDDs.
6July 14 – 17, 2015

DM_PPT_NP_v01
SESIP_0715_GH2
What’s Great about Spark
• Refreshingly abstract
• Supports Python
• Typically runs in RAM
• Has batteries included
7July 14 – 17, 2015

DM_PPT_NP_v01
SESIP_0715_GH2
Experimental Setup
• GSSTF_NCEP.3 collection 7/1/1987 to 12/31/2008
• 7,850 HDF-EOS5 files, 16 MB per file,
~120 GB total
• 4 variables on daily 1440x720 grid
– Sea level pressure (hPa)
– 2m air temperature (C)
– Sea surface skin temperature (C)
– Sea surface saturation humidity (g/kg)
• Lenovo ThinkPad X230T
– Intel Core i5-3320M (2 cores, 4 threads), 8GB of RAM,
Samsung SSD 840 Pro
– Windows 8.1 (64-bit), Apache Spark 1.3.0
8July 14 – 17, 2015

DM_PPT_NP_v01
SESIP_0715_GH2
Getting off the Ground
9July 14 – 17, 2015
Where do they dwell?

DM_PPT_NP_v01
SESIP_0715_GH2
General Strategy
1. Create our first RDD – “list of file names/paths/...”
a. Traverse base directory, compile list of HDF5 files
b. Partition the list via SparkContext.parallelize()
2. Use the RDD’s flatMap method to calculate
something interesting, e.g., summary statistics
10July 14 – 17, 2015
RDD
Calculating Tair_2m
mean and median for
3.5 years took about
10 seconds on my
notebook.

DM_PPT_NP_v01
SESIP_0715_GH2
Variations
• Instead of traversing directories, you can
provide a CSV file of [HDF5 file names, path
names, hyperslab selections, etc.] to partition
• A fast SSD array goes a long way
• If you have a distributed file system (e.g.,
GPFS, Lustre, Ceph), you should be able to
feed large numbers of Spark workers (running
on a cluster)
• If you don’t have a parallel file system and use
most of the data in a file, you can stage (copy)
the files first on the cluster nodes
11July 14 – 17, 2015

DM_PPT_NP_v01
SESIP_0715_GH2
Conclusion
• Forget MapReduce, stop worrying about HDFS
• With Spark, exploiting data parallelism has never
been more accessible (easier and cheaper)
• Current HDF5 to Spark on-ramps can be
effective under the right circumstances, but are
kludgy
• Work with us to build the right things right!
12July 14 – 17, 2015

DM_PPT_NP_v01
SESIP_0715_GH2
References
13July 14 – 17, 2015
[BigHDF] https://www.hdfgroup.org/pubs/papers/Big_HDF_FAQs.pdf
[Blog] https://hdfgroup.org/wp/2015/04/putting-some-spark-into-hdf-
eos/
[Report] Zaharia et al., Resilient Distributed Datasets: A Fault-Tolerant
Abstraction for In-Memory Cluster Computing, UCBerkeley
2011.
[Spark] https://spark.apache.org/
[YouTube
]
Mark Madsen: Big Data, Bad Analogies, 2014.

DM_PPT_NP_v01
SESIP_0715_GH2
14

DM_PPT_NP_v01
SESIP_0715_GH2
15
This work was supported by
NASA/GSFC under Raytheon Co.
contract number NNG10HP02C

Putting some Spark into HDF5

More Related Content

Similar to Putting some Spark into HDF5 (20)

More from The HDF-EOS Tools and Information Center (20)

Recently uploaded (20)

Putting some Spark into HDF5