Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs

Creating an 86,000 Hour
Speech Dataset with
Apache Spark and TPUs
Daniel Galve
z

Enginee
r

MLCommons

Feedback
Your feedback is important to us
.

Don’t forget to rate and review the sessions.

Agenda
▪ What is MLCommons
?

▪ What is The People’s Speech
Dataset
?

▪ The Workload to Create the
Datase
t

▪ Limitations of Accelerator-aware
Schedulin
g

▪ PySpark UDFs Gotcha
s

▪ TPU Gotcha
s

▪ Efficient joins on data reordered
by bucketing by sequence length.

What is MLCommons?
• Deep Learning Benchmarking Organizatio
n

• Originally known as MLPer
f

• See “MLCommons: Better ML for Everyone” by David
Kanter, Executive Director, on Thursday, at 4:25PM

• Expanding into
:

• (1) Machine Learning Best Practice
s

• (2) Dataset Development

Motivation for The People’s Speech Dataset
• For widespread
adoption, datasets
need
:

• To be challengin
g

• To be free as in bee
r

• To have a commercial
use license
Provided by Vijay Janapa Reddi

https://www.sigarch.org/data-engineering-for-everyone/
• Historically, the majority of
datasets used by tech companies’
machine learning papers do not
use internal datasets.

The Conceptual Workload
• Given audio and transcripts, must discover when each word
in transcript was said
.

• Known as “forced alignment” or “segmentation”
.

• We must split hour-long audio files into segments of ~15
seconds of audio
.

• Time segments >1 minute typically use too much memory
at training time
.

• Uses a pre-trained speech recognition model.

The Conceptual Workload (2)
SELECT FORCE_ALIGN(
 
ASR_NEURAL_NET(DECODE_MP3(A.FILE)),
 
NORMALIZE_TEXT(T.FILE)
 
)
 
FROM AUDIO A INNER JOIN TRANSCRIPT T ON IDENTIFIER
• On CPUs, this runs in ~0.5x real time. For 86,000 hours,
that is 20 CPU-years
.

• ASR_NEURAL_NET takes 99% of runtime in pipeline
.

• Fundamental motivation for this talk’s topics.

Agenda
?

Dataset
?

▪ The Workload to Create the
Datase
t

▪ Limitations of Accelerator-
aware Schedulin
g

s

▪ TPU Gotcha
s


Accelerator-Aware Schedulin
g

Limitations
• Cloud TPU, being a network service, precludes support in accelerator-aware
scheduling
.

• Typically assign one accelerator to each executor/task
.

• But CPU-dependent parts of the workload usually require many more executors
than you have accelerators
.

• Therefore, we use multiple jobs, writing to disk in-between
.

• Conclusion
:

• Good for data parallel training on existing Spark clusters
.

• Good for integration with NVIDIA RAPIDS
.

• Bad for heterogenous inference workloads with UDFs.
SELECT FORCE_ALIGN(
 
 
 
)
 
FROM AUDIO A INNER JOIN TRANSCRIPT T ON
IDENTIFIER

PySpark Arrow UDF Gotchas
▪ Implication: Memory usage doubled
.

▪ JVM GC does not return physical memory back to OS
.

▪ Adding swap space prevents OOMs
.

▪ Don’t set spark.executor.memory to fill entire physical memory
.

▪ JVM will hog all physical memory, causing pyspark UDF to use swap disk memory
.

▪ Minimize allocations in your python UDF
.

▪ Since Java cannot handle byte arrays larger than 2GB and some MP3 files are almost 2GB in size, we must
set spark.sql.execution.arrow.maxRecordsPerBatch=1
• Reality
Ideal
JVM Executor worker.py JVM Executor worker.py
Serialize Deserialize
Deserialize Serialize
Shared
Memory
SELECT FORCE_ALIGN(
 
 
 
)
 
IDENTIFIER

TPU Gotchas
• Used a TPUv3-8 Pod
.

• Used Google’s lingvo codebase, but had to make several modifications in a custom fork
.

• Link at end of slides
.

• Used a 4-layer 1024 hidden unit LSTM network trained with CTC for inference
.

• Requires usage of Google Cloud Storage as your file system
.

• Cloud TPUs are prone to crash with a mean time between failures measured in hours
.

• Need to write your own “restartability” logic
.

• Not a TPU specific problem: All “Spot instances” require software redundancy
.

• TPU code can’t use tf.string data type. Must use integer primary keys for “keyed
prediction” machine learning design pattern.
SELECT FORCE_ALIGN(
 
 
 
)
 
IDENTIFIER

Agenda
?

Dataset
?

▪ The Workload to Create the Datase
t

▪ Limitations of Accelerator-aware
Schedulin
g

s

▪ TPU Gotcha
s


TPU Gotchas
• We used “keyed prediction” design pattern to join
acoustic model output against original transcript
.

• Records are sorted by key on input to acoustic
model
.

• They are no longer sorted on output.
SELECT FORCE_ALIGN(
 
 
 
)
 
IDENTIFIER

Bucketing by sequence length
Necessary to utilize modern accelerators fully
tf.data.experimental.bucket_by_sequence_length
SELECT FORCE_ALIGN(
 
 
 
)
 
IDENTIFIER
A1
A2
B1
B2
B3
C1
C2
D1
A1
A2
B1
B2
B3
C1
C2
D1

Bucketing by sequence length
• TPU3-8 works best with batch size of 128 * 8 = 1024
.

• Sort-Merge joins are expensive afterward
.

• We must join speech recognizer output against ground truth transcript
.

• Speech recognizer output is not small! Probability distribution of 40 tokens per
30ms. For 86,000 hours, that’s 1.5 TiB uncompressed
.

• Two Solutions
:

• Map Side join - Join whatever you need before using accelerator
.

• Con: Reduces input bandwidth to accelerator
.

• Sharding - aka partitionBy(). Only need to sort each shard
.

• Con: If shards are too small, can reduce efficiency.
SELECT FORCE_ALIGN(
 
 
 
)
 
IDENTIFIER

Conclusions
• Code is publicly available under Apache 2.0
:

• https://github.com/mlcommons/peoples-speech/tree/main/
galvasr2/align/spark
• Ideal for sequence-based deep learning inference is for
accelerators to act as an asynchronous queue, receiving input
data until a batch is large enough to run efficiently
.

• Would someone like to create a custom Spark Streaming sink
?

• Contact: dt.galvez@gmail.com

Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs

More Related Content

Similar to Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs (20)

More from Databricks (20)

Recently uploaded (20)

Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs