This document summarizes Daniel Galvez's presentation on creating The People's Speech Dataset using Apache Spark and TPUs. The key points are:
1) The dataset aims to provide 86,000 hours of speech data with forced alignments between audio and transcripts in order to be challenging, free to use, and have a commercial license.
2) The conceptual workload is to take hour-long audio files, split them into 15 second segments, and use a pretrained speech recognition model to discover when each word in the transcript was said.
3) Creating the dataset encountered limitations with accelerator-aware scheduling in Spark, memory issues with PySpark UDFs, crashes in TPUs, and the need to reorder data by