This presentation discusses putting HDF5 files into Apache Spark for analysis. It describes the differences between traditional file systems and Hadoop's HDFS, and how Spark provides a more accessible way to exploit data parallelism without using MapReduce. The presentation outlines experiments loading HDF5 climate data files into Spark to calculate statistics. It suggests variations like providing a file list instead of traversing directories. The conclusion is that Spark can effectively analyze HDF5 files under the right circumstances but current methods are imperfect, and future work with The HDF Group could build better integrations.
Related topics: