SlideShare a Scribd company logo
DM_PPT_NP_v01
SESIP_0715_GH2
Putting some into HDF5
Gerd Heber & Joe Lee
The HDF Group
Champaign Illinois USA
This work was supported by NASA/GSFC under
Raytheon Co. contract number NNG10HP02C
DM_PPT_NP_v01
SESIP_0715_GH2
2
The Return of
DM_PPT_NP_v01
SESIP_0715_GH2
Outline
• “The Big Schism”
• A Shiny New Engine
• Getting off the Ground
• Future Work
3July 14 – 17, 2015
DM_PPT_NP_v01
SESIP_0715_GH2
“The Big Schism”
• An HDF5 file is a Smart Data Container
• “This is what happens, Larry, when you copy
an HDF5 file into HDFS!” (Walter Sobchak)
4July 14 – 17, 2015
Natural Habitat: Traditional File System Block Store: Hadoop “File System” (HDFS)
DM_PPT_NP_v01
SESIP_0715_GH2
Now What?
• Ask questions:
– Who want’s HDF5 files in Hadoop? (volatile)
• Who wants to program MapReduce? (nobody)
– How big are your HDF5 files? (long tailed distrib.)
• No size (solution) fits all...
• Do experiments:
– Reverse-engineer the format (students, weirdos)
– In-core processing (fiddly)
– Convert to Avro (some success)
• Sit tight and wait for something better!
5July 14 – 17, 2015
DM_PPT_NP_v01
SESIP_0715_GH2
Spark Concepts
Formally, an RDD is a read-only, partitioned collection of records.
RDDs can be only created through deterministic operations on
either (1) a dataset in stable storage or (2) other existing RDDs.
6July 14 – 17, 2015
DM_PPT_NP_v01
SESIP_0715_GH2
What’s Great about Spark
• Refreshingly abstract
• Supports Python
• Typically runs in RAM
• Has batteries included
7July 14 – 17, 2015
DM_PPT_NP_v01
SESIP_0715_GH2
Experimental Setup
• GSSTF_NCEP.3 collection 7/1/1987 to 12/31/2008
• 7,850 HDF-EOS5 files, 16 MB per file,
~120 GB total
• 4 variables on daily 1440x720 grid
– Sea level pressure (hPa)
– 2m air temperature (C)
– Sea surface skin temperature (C)
– Sea surface saturation humidity (g/kg)
• Lenovo ThinkPad X230T
– Intel Core i5-3320M (2 cores, 4 threads), 8GB of RAM,
Samsung SSD 840 Pro
– Windows 8.1 (64-bit), Apache Spark 1.3.0
8July 14 – 17, 2015
DM_PPT_NP_v01
SESIP_0715_GH2
Getting off the Ground
9July 14 – 17, 2015
Where do they dwell?
DM_PPT_NP_v01
SESIP_0715_GH2
General Strategy
1. Create our first RDD – “list of file names/paths/...”
a. Traverse base directory, compile list of HDF5 files
b. Partition the list via SparkContext.parallelize()
2. Use the RDD’s flatMap method to calculate
something interesting, e.g., summary statistics
10July 14 – 17, 2015
RDD
Calculating Tair_2m
mean and median for
3.5 years took about
10 seconds on my
notebook.
DM_PPT_NP_v01
SESIP_0715_GH2
Variations
• Instead of traversing directories, you can
provide a CSV file of [HDF5 file names, path
names, hyperslab selections, etc.] to partition
• A fast SSD array goes a long way
• If you have a distributed file system (e.g.,
GPFS, Lustre, Ceph), you should be able to
feed large numbers of Spark workers (running
on a cluster)
• If you don’t have a parallel file system and use
most of the data in a file, you can stage (copy)
the files first on the cluster nodes
11July 14 – 17, 2015
DM_PPT_NP_v01
SESIP_0715_GH2
Conclusion
• Forget MapReduce, stop worrying about HDFS
• With Spark, exploiting data parallelism has never
been more accessible (easier and cheaper)
• Current HDF5 to Spark on-ramps can be
effective under the right circumstances, but are
kludgy
• Work with us to build the right things right!
12July 14 – 17, 2015
DM_PPT_NP_v01
SESIP_0715_GH2
References
13July 14 – 17, 2015
[BigHDF] https://www.hdfgroup.org/pubs/papers/Big_HDF_FAQs.pdf
[Blog] https://hdfgroup.org/wp/2015/04/putting-some-spark-into-hdf-
eos/
[Report] Zaharia et al., Resilient Distributed Datasets: A Fault-Tolerant
Abstraction for In-Memory Cluster Computing, UCBerkeley
2011.
[Spark] https://spark.apache.org/
[YouTube
]
Mark Madsen: Big Data, Bad Analogies, 2014.
DM_PPT_NP_v01
SESIP_0715_GH2
14
DM_PPT_NP_v01
SESIP_0715_GH2
15
This work was supported by
NASA/GSFC under Raytheon Co.
contract number NNG10HP02C

More Related Content

DOCX
2.mnozenje jednocifrenog i dvocifrenog broja
PPS
дії з раціональними числами
PPS
Opticke-varke.pps
PPTX
Slatka matematika
PDF
Olonlog n4
PDF
Урок 32.17. Контрольна робата "Теплові явища"
PPTX
Dolgoon holbolt
PPT
т'га за југ
2.mnozenje jednocifrenog i dvocifrenog broja
дії з раціональними числами
Opticke-varke.pps
Slatka matematika
Olonlog n4
Урок 32.17. Контрольна робата "Теплові явища"
Dolgoon holbolt
т'га за југ

Similar to Putting some Spark into HDF5 (20)

PPT
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
PDF
Spark Summit East 2015 Advanced Devops Student Slides
PPTX
Module 1- Introduction to Big Data and Hadoop
PPTX
Big data clustering
PPTX
Bigdata workshop february 2015
PDF
Introduction to Apache Spark
PPTX
LanceIntroSpark_box
PPTX
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
PPTX
Intro to Spark development
PPTX
Inroduction to Big Data
PDF
Boston Spark Meetup event Slides Update
PDF
Big data overview by Edgars
PDF
Introduction to Spark Training
PPTX
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
PPTX
2016-07-21-Godil-presentation.pptx
PPT
hadoop-sparktitlsdernsfslfsfnsfsflsnfsfnsfl
PPT
hadoop_spark_Introduction_Bigdata_intro.ppt
PDF
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
PDF
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Spark Summit East 2015 Advanced Devops Student Slides
Module 1- Introduction to Big Data and Hadoop
Big data clustering
Bigdata workshop february 2015
Introduction to Apache Spark
LanceIntroSpark_box
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Intro to Spark development
Inroduction to Big Data
Boston Spark Meetup event Slides Update
Big data overview by Edgars
Introduction to Spark Training
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
2016-07-21-Godil-presentation.pptx
hadoop-sparktitlsdernsfslfsfnsfsflsnfsfnsfl
hadoop_spark_Introduction_Bigdata_intro.ppt
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Ad

More from The HDF-EOS Tools and Information Center (20)

PDF
HDF5 2.0: Cloud Optimized from the Start
PDF
Using a Hierarchical Data Format v5 file as Zarr v3 Shard
PDF
Cloud-Optimized HDF5 Files - Current Status
PDF
Cloud Optimized HDF5 for the ICESat-2 mission
PPTX
Access HDF Data in the Cloud via OPeNDAP Web Service
PPTX
Upcoming New HDF5 Features: Multi-threading, sparse data storage, and encrypt...
PPTX
The State of HDF5 / Dana Robinson / The HDF Group
PDF
Cloud-Optimized HDF5 Files
PDF
Accessing HDF5 data in the cloud with HSDS
PPTX
Highly Scalable Data Service (HSDS) Performance Features
PDF
Creating Cloud-Optimized HDF5 Files
PPTX
HDF5 OPeNDAP Handler Updates, and Performance Discussion
PPTX
Hyrax: Serving Data from S3
PPSX
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
PDF
HDF - Current status and Future Directions
PPSX
HDFEOS.org User Analsys, Updates, and Future
PPTX
HDF - Current status and Future Directions
PDF
H5Coro: The Cloud-Optimized Read-Only Library
PPTX
MATLAB Modernization on HDF5 1.10
HDF5 2.0: Cloud Optimized from the Start
Using a Hierarchical Data Format v5 file as Zarr v3 Shard
Cloud-Optimized HDF5 Files - Current Status
Cloud Optimized HDF5 for the ICESat-2 mission
Access HDF Data in the Cloud via OPeNDAP Web Service
Upcoming New HDF5 Features: Multi-threading, sparse data storage, and encrypt...
The State of HDF5 / Dana Robinson / The HDF Group
Cloud-Optimized HDF5 Files
Accessing HDF5 data in the cloud with HSDS
Highly Scalable Data Service (HSDS) Performance Features
Creating Cloud-Optimized HDF5 Files
HDF5 OPeNDAP Handler Updates, and Performance Discussion
Hyrax: Serving Data from S3
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
HDF - Current status and Future Directions
HDFEOS.org User Analsys, Updates, and Future
HDF - Current status and Future Directions
H5Coro: The Cloud-Optimized Read-Only Library
MATLAB Modernization on HDF5 1.10
Ad

Recently uploaded (20)

PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Encapsulation theory and applications.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Machine learning based COVID-19 study performance prediction
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Electronic commerce courselecture one. Pdf
PPTX
Cloud computing and distributed systems.
PPTX
Big Data Technologies - Introduction.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
A Presentation on Artificial Intelligence
PDF
Empathic Computing: Creating Shared Understanding
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Encapsulation_ Review paper, used for researhc scholars
Mobile App Security Testing_ A Comprehensive Guide.pdf
Understanding_Digital_Forensics_Presentation.pptx
Encapsulation theory and applications.pdf
The AUB Centre for AI in Media Proposal.docx
MYSQL Presentation for SQL database connectivity
Machine learning based COVID-19 study performance prediction
Network Security Unit 5.pdf for BCA BBA.
20250228 LYD VKU AI Blended-Learning.pptx
Electronic commerce courselecture one. Pdf
Cloud computing and distributed systems.
Big Data Technologies - Introduction.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
A Presentation on Artificial Intelligence
Empathic Computing: Creating Shared Understanding
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Agricultural_Statistics_at_a_Glance_2022_0.pdf
cuic standard and advanced reporting.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto

Putting some Spark into HDF5

  • 1. DM_PPT_NP_v01 SESIP_0715_GH2 Putting some into HDF5 Gerd Heber & Joe Lee The HDF Group Champaign Illinois USA This work was supported by NASA/GSFC under Raytheon Co. contract number NNG10HP02C
  • 3. DM_PPT_NP_v01 SESIP_0715_GH2 Outline • “The Big Schism” • A Shiny New Engine • Getting off the Ground • Future Work 3July 14 – 17, 2015
  • 4. DM_PPT_NP_v01 SESIP_0715_GH2 “The Big Schism” • An HDF5 file is a Smart Data Container • “This is what happens, Larry, when you copy an HDF5 file into HDFS!” (Walter Sobchak) 4July 14 – 17, 2015 Natural Habitat: Traditional File System Block Store: Hadoop “File System” (HDFS)
  • 5. DM_PPT_NP_v01 SESIP_0715_GH2 Now What? • Ask questions: – Who want’s HDF5 files in Hadoop? (volatile) • Who wants to program MapReduce? (nobody) – How big are your HDF5 files? (long tailed distrib.) • No size (solution) fits all... • Do experiments: – Reverse-engineer the format (students, weirdos) – In-core processing (fiddly) – Convert to Avro (some success) • Sit tight and wait for something better! 5July 14 – 17, 2015
  • 6. DM_PPT_NP_v01 SESIP_0715_GH2 Spark Concepts Formally, an RDD is a read-only, partitioned collection of records. RDDs can be only created through deterministic operations on either (1) a dataset in stable storage or (2) other existing RDDs. 6July 14 – 17, 2015
  • 7. DM_PPT_NP_v01 SESIP_0715_GH2 What’s Great about Spark • Refreshingly abstract • Supports Python • Typically runs in RAM • Has batteries included 7July 14 – 17, 2015
  • 8. DM_PPT_NP_v01 SESIP_0715_GH2 Experimental Setup • GSSTF_NCEP.3 collection 7/1/1987 to 12/31/2008 • 7,850 HDF-EOS5 files, 16 MB per file, ~120 GB total • 4 variables on daily 1440x720 grid – Sea level pressure (hPa) – 2m air temperature (C) – Sea surface skin temperature (C) – Sea surface saturation humidity (g/kg) • Lenovo ThinkPad X230T – Intel Core i5-3320M (2 cores, 4 threads), 8GB of RAM, Samsung SSD 840 Pro – Windows 8.1 (64-bit), Apache Spark 1.3.0 8July 14 – 17, 2015
  • 9. DM_PPT_NP_v01 SESIP_0715_GH2 Getting off the Ground 9July 14 – 17, 2015 Where do they dwell?
  • 10. DM_PPT_NP_v01 SESIP_0715_GH2 General Strategy 1. Create our first RDD – “list of file names/paths/...” a. Traverse base directory, compile list of HDF5 files b. Partition the list via SparkContext.parallelize() 2. Use the RDD’s flatMap method to calculate something interesting, e.g., summary statistics 10July 14 – 17, 2015 RDD Calculating Tair_2m mean and median for 3.5 years took about 10 seconds on my notebook.
  • 11. DM_PPT_NP_v01 SESIP_0715_GH2 Variations • Instead of traversing directories, you can provide a CSV file of [HDF5 file names, path names, hyperslab selections, etc.] to partition • A fast SSD array goes a long way • If you have a distributed file system (e.g., GPFS, Lustre, Ceph), you should be able to feed large numbers of Spark workers (running on a cluster) • If you don’t have a parallel file system and use most of the data in a file, you can stage (copy) the files first on the cluster nodes 11July 14 – 17, 2015
  • 12. DM_PPT_NP_v01 SESIP_0715_GH2 Conclusion • Forget MapReduce, stop worrying about HDFS • With Spark, exploiting data parallelism has never been more accessible (easier and cheaper) • Current HDF5 to Spark on-ramps can be effective under the right circumstances, but are kludgy • Work with us to build the right things right! 12July 14 – 17, 2015
  • 13. DM_PPT_NP_v01 SESIP_0715_GH2 References 13July 14 – 17, 2015 [BigHDF] https://www.hdfgroup.org/pubs/papers/Big_HDF_FAQs.pdf [Blog] https://hdfgroup.org/wp/2015/04/putting-some-spark-into-hdf- eos/ [Report] Zaharia et al., Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, UCBerkeley 2011. [Spark] https://spark.apache.org/ [YouTube ] Mark Madsen: Big Data, Bad Analogies, 2014.
  • 15. DM_PPT_NP_v01 SESIP_0715_GH2 15 This work was supported by NASA/GSFC under Raytheon Co. contract number NNG10HP02C