SlideShare a Scribd company logo
Creating an 86,000 Hour
Speech Dataset with
Apache Spark and TPUs
Daniel Galve
z

Enginee
r

MLCommons
Feedback
Your feedback is important to us
.

Don’t forget to rate and review the sessions.
Agenda
▪ What is MLCommons
?

▪ What is The People’s Speech
Dataset
?

▪ The Workload to Create the
Datase
t

▪ Limitations of Accelerator-aware
Schedulin
g

▪ PySpark UDFs Gotcha
s

▪ TPU Gotcha
s

▪ Efficient joins on data reordered
by bucketing by sequence length.
What is MLCommons?
• Deep Learning Benchmarking Organizatio
n

• Originally known as MLPer
f

• See “MLCommons: Better ML for Everyone” by David
Kanter, Executive Director, on Thursday, at 4:25PM


• Expanding into
:

• (1) Machine Learning Best Practice
s

• (2) Dataset Development
Agenda
▪ What is MLCommons
?

▪ What is The People’s Speech
Dataset
?

▪ The Workload to Create the
Datase
t

▪ Limitations of Accelerator-aware
Schedulin
g

▪ PySpark UDFs Gotcha
s

▪ TPU Gotcha
s

▪ Efficient joins on data reordered
by bucketing by sequence length.
Motivation for The People’s Speech Dataset
• For widespread
adoption, datasets
need
:

• To be challengin
g

• To be free as in bee
r

• To have a commercial
use license
Provided by Vijay Janapa Reddi


https://www.sigarch.org/data-engineering-for-everyone/
• Historically, the majority of
datasets used by tech companies’
machine learning papers do not
use internal datasets.
Agenda
▪ What is MLCommons
?

▪ What is The People’s Speech
Dataset
?

▪ The Workload to Create the
Datase
t

▪ Limitations of Accelerator-aware
Schedulin
g

▪ PySpark UDFs Gotcha
s

▪ TPU Gotcha
s

▪ Efficient joins on data reordered
by bucketing by sequence length.
The Conceptual Workload
• Given audio and transcripts, must discover when each word
in transcript was said
.

• Known as “forced alignment” or “segmentation”
.

• We must split hour-long audio files into segments of ~15
seconds of audio
.

• Time segments >1 minute typically use too much memory
at training time
.

• Uses a pre-trained speech recognition model.
The Conceptual Workload (2)
SELECT FORCE_ALIGN(


ASR_NEURAL_NET(DECODE_MP3(A.FILE)),


NORMALIZE_TEXT(T.FILE)


)


FROM AUDIO A INNER JOIN TRANSCRIPT T ON IDENTIFIER
• On CPUs, this runs in ~0.5x real time. For 86,000 hours,
that is 20 CPU-years
.

• ASR_NEURAL_NET takes 99% of runtime in pipeline
.

• Fundamental motivation for this talk’s topics.
Agenda
▪ What is MLCommons
?

▪ What is The People’s Speech
Dataset
?

▪ The Workload to Create the
Datase
t

▪ Limitations of Accelerator-
aware Schedulin
g

▪ PySpark UDFs Gotcha
s

▪ TPU Gotcha
s

▪ Efficient joins on data reordered
by bucketing by sequence length.
Accelerator-Aware Schedulin
g

Limitations
• Cloud TPU, being a network service, precludes support in accelerator-aware
scheduling
.

• Typically assign one accelerator to each executor/task
.

• But CPU-dependent parts of the workload usually require many more executors
than you have accelerators
.

• Therefore, we use multiple jobs, writing to disk in-between
.

• Conclusion
:

• Good for data parallel training on existing Spark clusters
.

• Good for integration with NVIDIA RAPIDS
.

• Bad for heterogenous inference workloads with UDFs.
SELECT FORCE_ALIGN(


ASR_NEURAL_NET(DECODE_MP3(A.FILE)),


NORMALIZE_TEXT(T.FILE)


)


FROM AUDIO A INNER JOIN TRANSCRIPT T ON
IDENTIFIER
Agenda
▪ What is MLCommons
?

▪ What is The People’s Speech
Dataset
?

▪ The Workload to Create the
Datase
t

▪ Limitations of Accelerator-aware
Schedulin
g

▪ PySpark UDFs Gotcha
s

▪ TPU Gotcha
s

▪ Efficient joins on data reordered
by bucketing by sequence length.
PySpark Arrow UDF Gotchas
▪ Implication: Memory usage doubled
.

▪ JVM GC does not return physical memory back to OS
.

▪ Adding swap space prevents OOMs
.

▪ Don’t set spark.executor.memory to fill entire physical memory
.

▪ JVM will hog all physical memory, causing pyspark UDF to use swap disk memory
.

▪ Minimize allocations in your python UDF
.

▪ Since Java cannot handle byte arrays larger than 2GB and some MP3 files are almost 2GB in size, we must
set spark.sql.execution.arrow.maxRecordsPerBatch=1
• Reality
Ideal
JVM Executor worker.py JVM Executor worker.py
Serialize Deserialize
Deserialize Serialize
Shared
Memory
SELECT FORCE_ALIGN(


ASR_NEURAL_NET(DECODE_MP3(A.FILE)),


NORMALIZE_TEXT(T.FILE)


)


FROM AUDIO A INNER JOIN TRANSCRIPT T ON
IDENTIFIER
Agenda
▪ What is MLCommons
?

▪ What is The People’s Speech
Dataset
?

▪ The Workload to Create the
Datase
t

▪ Limitations of Accelerator-aware
Schedulin
g

▪ PySpark UDFs Gotcha
s

▪ TPU Gotcha
s

▪ Efficient joins on data reordered
by bucketing by sequence length.
TPU Gotchas
• Used a TPUv3-8 Pod
.

• Used Google’s lingvo codebase, but had to make several modifications in a custom fork
.

• Link at end of slides
.

• Used a 4-layer 1024 hidden unit LSTM network trained with CTC for inference
.

• Requires usage of Google Cloud Storage as your file system
.

• Cloud TPUs are prone to crash with a mean time between failures measured in hours
.

• Need to write your own “restartability” logic
.

• Not a TPU specific problem: All “Spot instances” require software redundancy
.

• TPU code can’t use tf.string data type. Must use integer primary keys for “keyed
prediction” machine learning design pattern.
SELECT FORCE_ALIGN(


ASR_NEURAL_NET(DECODE_MP3(A.FILE)),


NORMALIZE_TEXT(T.FILE)


)


FROM AUDIO A INNER JOIN TRANSCRIPT T ON
IDENTIFIER
Agenda
▪ What is MLCommons
?

▪ What is The People’s Speech
Dataset
?

▪ The Workload to Create the Datase
t

▪ Limitations of Accelerator-aware
Schedulin
g

▪ PySpark UDFs Gotcha
s

▪ TPU Gotcha
s

▪ Efficient joins on data reordered
by bucketing by sequence length.
TPU Gotchas
• We used “keyed prediction” design pattern to join
acoustic model output against original transcript
.

• Records are sorted by key on input to acoustic
model
.

• They are no longer sorted on output.
SELECT FORCE_ALIGN(


ASR_NEURAL_NET(DECODE_MP3(A.FILE)),


NORMALIZE_TEXT(T.FILE)


)


FROM AUDIO A INNER JOIN TRANSCRIPT T ON
IDENTIFIER
Bucketing by sequence length
Necessary to utilize modern accelerators fully
tf.data.experimental.bucket_by_sequence_length
SELECT FORCE_ALIGN(


ASR_NEURAL_NET(DECODE_MP3(A.FILE)),


NORMALIZE_TEXT(T.FILE)


)


FROM AUDIO A INNER JOIN TRANSCRIPT T ON
IDENTIFIER
A1
A2
B1
B2
B3
C1
C2
D1
A1
A2
B1
B2
B3
C1
C2
D1
Bucketing by sequence length
• TPU3-8 works best with batch size of 128 * 8 = 1024
.

• Sort-Merge joins are expensive afterward
.

• We must join speech recognizer output against ground truth transcript
.

• Speech recognizer output is not small! Probability distribution of 40 tokens per
30ms. For 86,000 hours, that’s 1.5 TiB uncompressed
.

• Two Solutions
:

• Map Side join - Join whatever you need before using accelerator
.

• Con: Reduces input bandwidth to accelerator
.

• Sharding - aka partitionBy(). Only need to sort each shard
.

• Con: If shards are too small, can reduce efficiency.
SELECT FORCE_ALIGN(


ASR_NEURAL_NET(DECODE_MP3(A.FILE)),


NORMALIZE_TEXT(T.FILE)


)


FROM AUDIO A INNER JOIN TRANSCRIPT T ON
IDENTIFIER
Conclusions
• Code is publicly available under Apache 2.0
:

• https://github.com/mlcommons/peoples-speech/tree/main/
galvasr2/align/spark
• Ideal for sequence-based deep learning inference is for
accelerators to act as an asynchronous queue, receiving input
data until a batch is large enough to run efficiently
.

• Would someone like to create a custom Spark Streaming sink
?

• Contact: dt.galvez@gmail.com

More Related Content

PDF
File systems for Embedded Linux
PDF
Linux Networking Explained
PDF
Advance linux presentation_0702011
PDF
Ext4 filesystem(1)
PDF
Database Firewall with Snort
PDF
Web application development with laravel php framework version 4
PPTX
Sistemas operacionais escalonamento de processos
PDF
VMware NSX 101: What, Why & How
File systems for Embedded Linux
Linux Networking Explained
Advance linux presentation_0702011
Ext4 filesystem(1)
Database Firewall with Snort
Web application development with laravel php framework version 4
Sistemas operacionais escalonamento de processos
VMware NSX 101: What, Why & How

Similar to Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs (20)

PDF
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
PPTX
TensorFrames: Google Tensorflow on Apache Spark
PDF
Spark Summit EU talk by Tim Hunter
PPTX
Deep Learning for Developers (January 2018)
PDF
RAPIDS – Open GPU-accelerated Data Science
PDF
Listening at the Cocktail Party with Deep Neural Networks and TensorFlow
PPTX
Combining Machine Learning frameworks with Apache Spark
PPTX
Temporal Convolutional Networks - Dethroning RNN's for sequence modelling?
PDF
The State of ML for iOS: On the Advent of WWDC 2018 🕯
PDF
Spark Meetup TensorFrames
PDF
Spark Meetup TensorFrames
PPTX
My Master's Thesis
PDF
Netflix machine learning
PDF
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
PDF
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
PDF
Spark Based Distributed Deep Learning Framework For Big Data Applications
PDF
Wattpad - Spark Stories
PPTX
Remoticon - TinyML Workshop.pptx
PPTX
Combining Machine Learning Frameworks with Apache Spark
PDF
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
TensorFrames: Google Tensorflow on Apache Spark
Spark Summit EU talk by Tim Hunter
Deep Learning for Developers (January 2018)
RAPIDS – Open GPU-accelerated Data Science
Listening at the Cocktail Party with Deep Neural Networks and TensorFlow
Combining Machine Learning frameworks with Apache Spark
Temporal Convolutional Networks - Dethroning RNN's for sequence modelling?
The State of ML for iOS: On the Advent of WWDC 2018 🕯
Spark Meetup TensorFrames
Spark Meetup TensorFrames
My Master's Thesis
Netflix machine learning
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
Spark Based Distributed Deep Learning Framework For Big Data Applications
Wattpad - Spark Stories
Remoticon - TinyML Workshop.pptx
Combining Machine Learning Frameworks with Apache Spark
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake
Ad

Recently uploaded (20)

PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
annual-report-2024-2025 original latest.
PPTX
Computer network topology notes for revision
PDF
Lecture1 pattern recognition............
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Business Analytics and business intelligence.pdf
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
[EN] Industrial Machine Downtime Prediction
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Introduction to machine learning and Linear Models
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
annual-report-2024-2025 original latest.
Computer network topology notes for revision
Lecture1 pattern recognition............
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Clinical guidelines as a resource for EBP(1).pdf
STUDY DESIGN details- Lt Col Maksud (21).pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Business Analytics and business intelligence.pdf
Miokarditis (Inflamasi pada Otot Jantung)
[EN] Industrial Machine Downtime Prediction
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
STERILIZATION AND DISINFECTION-1.ppthhhbx
Introduction to machine learning and Linear Models
Introduction-to-Cloud-ComputingFinal.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx

Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs

  • 1. Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs Daniel Galve z Enginee r MLCommons
  • 2. Feedback Your feedback is important to us . Don’t forget to rate and review the sessions.
  • 3. Agenda ▪ What is MLCommons ? ▪ What is The People’s Speech Dataset ? ▪ The Workload to Create the Datase t ▪ Limitations of Accelerator-aware Schedulin g ▪ PySpark UDFs Gotcha s ▪ TPU Gotcha s ▪ Efficient joins on data reordered by bucketing by sequence length.
  • 4. What is MLCommons? • Deep Learning Benchmarking Organizatio n • Originally known as MLPer f • See “MLCommons: Better ML for Everyone” by David Kanter, Executive Director, on Thursday, at 4:25PM • Expanding into : • (1) Machine Learning Best Practice s • (2) Dataset Development
  • 5. Agenda ▪ What is MLCommons ? ▪ What is The People’s Speech Dataset ? ▪ The Workload to Create the Datase t ▪ Limitations of Accelerator-aware Schedulin g ▪ PySpark UDFs Gotcha s ▪ TPU Gotcha s ▪ Efficient joins on data reordered by bucketing by sequence length.
  • 6. Motivation for The People’s Speech Dataset • For widespread adoption, datasets need : • To be challengin g • To be free as in bee r • To have a commercial use license Provided by Vijay Janapa Reddi https://www.sigarch.org/data-engineering-for-everyone/ • Historically, the majority of datasets used by tech companies’ machine learning papers do not use internal datasets.
  • 7. Agenda ▪ What is MLCommons ? ▪ What is The People’s Speech Dataset ? ▪ The Workload to Create the Datase t ▪ Limitations of Accelerator-aware Schedulin g ▪ PySpark UDFs Gotcha s ▪ TPU Gotcha s ▪ Efficient joins on data reordered by bucketing by sequence length.
  • 8. The Conceptual Workload • Given audio and transcripts, must discover when each word in transcript was said . • Known as “forced alignment” or “segmentation” . • We must split hour-long audio files into segments of ~15 seconds of audio . • Time segments >1 minute typically use too much memory at training time . • Uses a pre-trained speech recognition model.
  • 9. The Conceptual Workload (2) SELECT FORCE_ALIGN( 
 ASR_NEURAL_NET(DECODE_MP3(A.FILE)), 
 NORMALIZE_TEXT(T.FILE) 
 ) 
 FROM AUDIO A INNER JOIN TRANSCRIPT T ON IDENTIFIER • On CPUs, this runs in ~0.5x real time. For 86,000 hours, that is 20 CPU-years . • ASR_NEURAL_NET takes 99% of runtime in pipeline . • Fundamental motivation for this talk’s topics.
  • 10. Agenda ▪ What is MLCommons ? ▪ What is The People’s Speech Dataset ? ▪ The Workload to Create the Datase t ▪ Limitations of Accelerator- aware Schedulin g ▪ PySpark UDFs Gotcha s ▪ TPU Gotcha s ▪ Efficient joins on data reordered by bucketing by sequence length.
  • 11. Accelerator-Aware Schedulin g Limitations • Cloud TPU, being a network service, precludes support in accelerator-aware scheduling . • Typically assign one accelerator to each executor/task . • But CPU-dependent parts of the workload usually require many more executors than you have accelerators . • Therefore, we use multiple jobs, writing to disk in-between . • Conclusion : • Good for data parallel training on existing Spark clusters . • Good for integration with NVIDIA RAPIDS . • Bad for heterogenous inference workloads with UDFs. SELECT FORCE_ALIGN( 
 ASR_NEURAL_NET(DECODE_MP3(A.FILE)), 
 NORMALIZE_TEXT(T.FILE) 
 ) 
 FROM AUDIO A INNER JOIN TRANSCRIPT T ON IDENTIFIER
  • 12. Agenda ▪ What is MLCommons ? ▪ What is The People’s Speech Dataset ? ▪ The Workload to Create the Datase t ▪ Limitations of Accelerator-aware Schedulin g ▪ PySpark UDFs Gotcha s ▪ TPU Gotcha s ▪ Efficient joins on data reordered by bucketing by sequence length.
  • 13. PySpark Arrow UDF Gotchas ▪ Implication: Memory usage doubled . ▪ JVM GC does not return physical memory back to OS . ▪ Adding swap space prevents OOMs . ▪ Don’t set spark.executor.memory to fill entire physical memory . ▪ JVM will hog all physical memory, causing pyspark UDF to use swap disk memory . ▪ Minimize allocations in your python UDF . ▪ Since Java cannot handle byte arrays larger than 2GB and some MP3 files are almost 2GB in size, we must set spark.sql.execution.arrow.maxRecordsPerBatch=1 • Reality Ideal JVM Executor worker.py JVM Executor worker.py Serialize Deserialize Deserialize Serialize Shared Memory SELECT FORCE_ALIGN( 
 ASR_NEURAL_NET(DECODE_MP3(A.FILE)), 
 NORMALIZE_TEXT(T.FILE) 
 ) 
 FROM AUDIO A INNER JOIN TRANSCRIPT T ON IDENTIFIER
  • 14. Agenda ▪ What is MLCommons ? ▪ What is The People’s Speech Dataset ? ▪ The Workload to Create the Datase t ▪ Limitations of Accelerator-aware Schedulin g ▪ PySpark UDFs Gotcha s ▪ TPU Gotcha s ▪ Efficient joins on data reordered by bucketing by sequence length.
  • 15. TPU Gotchas • Used a TPUv3-8 Pod . • Used Google’s lingvo codebase, but had to make several modifications in a custom fork . • Link at end of slides . • Used a 4-layer 1024 hidden unit LSTM network trained with CTC for inference . • Requires usage of Google Cloud Storage as your file system . • Cloud TPUs are prone to crash with a mean time between failures measured in hours . • Need to write your own “restartability” logic . • Not a TPU specific problem: All “Spot instances” require software redundancy . • TPU code can’t use tf.string data type. Must use integer primary keys for “keyed prediction” machine learning design pattern. SELECT FORCE_ALIGN( 
 ASR_NEURAL_NET(DECODE_MP3(A.FILE)), 
 NORMALIZE_TEXT(T.FILE) 
 ) 
 FROM AUDIO A INNER JOIN TRANSCRIPT T ON IDENTIFIER
  • 16. Agenda ▪ What is MLCommons ? ▪ What is The People’s Speech Dataset ? ▪ The Workload to Create the Datase t ▪ Limitations of Accelerator-aware Schedulin g ▪ PySpark UDFs Gotcha s ▪ TPU Gotcha s ▪ Efficient joins on data reordered by bucketing by sequence length.
  • 17. TPU Gotchas • We used “keyed prediction” design pattern to join acoustic model output against original transcript . • Records are sorted by key on input to acoustic model . • They are no longer sorted on output. SELECT FORCE_ALIGN( 
 ASR_NEURAL_NET(DECODE_MP3(A.FILE)), 
 NORMALIZE_TEXT(T.FILE) 
 ) 
 FROM AUDIO A INNER JOIN TRANSCRIPT T ON IDENTIFIER
  • 18. Bucketing by sequence length Necessary to utilize modern accelerators fully tf.data.experimental.bucket_by_sequence_length SELECT FORCE_ALIGN( 
 ASR_NEURAL_NET(DECODE_MP3(A.FILE)), 
 NORMALIZE_TEXT(T.FILE) 
 ) 
 FROM AUDIO A INNER JOIN TRANSCRIPT T ON IDENTIFIER A1 A2 B1 B2 B3 C1 C2 D1 A1 A2 B1 B2 B3 C1 C2 D1
  • 19. Bucketing by sequence length • TPU3-8 works best with batch size of 128 * 8 = 1024 . • Sort-Merge joins are expensive afterward . • We must join speech recognizer output against ground truth transcript . • Speech recognizer output is not small! Probability distribution of 40 tokens per 30ms. For 86,000 hours, that’s 1.5 TiB uncompressed . • Two Solutions : • Map Side join - Join whatever you need before using accelerator . • Con: Reduces input bandwidth to accelerator . • Sharding - aka partitionBy(). Only need to sort each shard . • Con: If shards are too small, can reduce efficiency. SELECT FORCE_ALIGN( 
 ASR_NEURAL_NET(DECODE_MP3(A.FILE)), 
 NORMALIZE_TEXT(T.FILE) 
 ) 
 FROM AUDIO A INNER JOIN TRANSCRIPT T ON IDENTIFIER
  • 20. Conclusions • Code is publicly available under Apache 2.0 : • https://github.com/mlcommons/peoples-speech/tree/main/ galvasr2/align/spark • Ideal for sequence-based deep learning inference is for accelerators to act as an asynchronous queue, receiving input data until a batch is large enough to run efficiently . • Would someone like to create a custom Spark Streaming sink ? • Contact: dt.galvez@gmail.com