SlideShare a Scribd company logo
Disentangled Representation Learning in
Speech and Vocalization
Yusuf BRIMA
Supervisors:
Prof. Dr.-Ing. Gunther HEIDEMANN
Prof. Dr.rer.nat. Simone PIKA
Institute of Cognitive Science,
Osnabrück University
June 27, 2025
Presentation Outline
2
● Introduction and Motivation
● Research Goals and Questions
● Methodology
● Key Results
● Limitations
● Future Directions
● Conclusion
● Q&As
Biological Inspiration
3
● Humans can naturally isolate factors of
variation in often complex high-dimensional
data, in the audio domain examples include:
speaker identity, gender, emotion, speech
content, etc.
● We generalize across contexts (e.g., noisy
environments, varied accents, speaking styles).
Source
What Is Disentanglement?
4
● Refers to learning distributed
representations where distinct latent
factors correspond to the “true”
independent factors of variation in the
data1
.
● Goal: learn a representation where you
can manipulate one factor without
affecting the others — just like humans
do.
[1] Bengio Y, Courville A, Vincent P. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence. 2013 Mar 7;35(8):1798-828.
Source
Three Core Perspectives
5
● Transformational 2
– Separate
invariant vs. variant aspects of a
signal.
● Factorizational 2
– Ensure
independent encoding of generative
factors.
● Functional 2
– Make learned factors
useful for tasks (e.g., transfer,
robustness, interpretability).
Source
[2] Williams J. Learning disentangled speech representations (Doctoral dissertation, University of Edinburgh).
Benefits of Disentanglement
6
● Improves predictive performance2,3
● Reduces sample complexity4
● Offers interpretability5
● Improves fairness6
● Overcome shortcut learning7
[2] Locatello, M. Tschannen, S. Bauer, G. Rätsch, B. Schölkopf, and O. Bachem, “Disentangling factors of variations using few labels,” in International Conference on Learning
Representations, 2020.
[3] Locatello, S. Bauer, M. Lucic, G. Raetsch, S. Gelly, B. Schölkopf, and O. Bachem, “Challenging common assumptions in the unsupervised learning of disentangled representations,” in
International Conference on Machine Learning, 2019.
[4] Ridgeway and M. C. Mozer, “Learning deep disentangled embeddings with the f-statistic loss,” in Advances in Neural Information Processing Systems, 2018.
[5] Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, pp.
1798–1828, Aug. 2013.
[6] .Locatello, G. Abbati, T. Rainforth, S. Bauer, B. Schölkopf, and O. Bachem, “On the fairness of disentangled representations,” in Advances in Neural Information Processing Systems,
2019.
[7] . Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, and F. A. Wichmann, “Shortcut learning in deep neural networks,” arXiv:2004.07780, 2020.
Research Gaps
7
● Deep learning paradigms such as VAEs, GANs, and JAEs have been extensively
used to learn rich audio representations8,9
.
● However, there has been limited empirical investigation into their ability to
disentangle key factors—such as speaker identity, speaking style, and
content—using disentanglement-oriented datasets.
● Existing approaches often lack a systematic quantitative evaluation of how well
these paradigms satisfy crucial disentanglement criteria: modularity,
compactness/completeness, and explicitness/informativeness.
[8] Liu, S., Mallol-Ragolta, A., Parada-Cabaleiro, E., Qian, K., Jing, X., Kathan, A., Hu, B. and Schuller, B.W., 2022. Audio self-supervised learning: A survey. Patterns, 3(12).
[9] Mohamed, A., Lee, H.Y., Borgholt, L., Havtorn, J.D., Edin, J., Igel, C., Kirchhoff, K., Li, S.W., Livescu, K., Maaløe, L. and Sainath, T.N., 2022. Self-supervised speech
representation learning: A review. IEEE Journal of Selected Topics in Signal Processing, 16(6), pp.1179-1210.
Research Questions
8
● How can DL techniques be leveraged to effectively disentangle the
underlying explanatory factors of variation in audio?
○ What learning approaches are most effective for (re)structuring
representations of acoustic data to disentangle key factors?
○ What are the most effective methodologies for evaluating the
quality, generalization, and transferability of disentangled
representations?
Research Objectives
9
● Compare the effectiveness of varied learning paradigms in achieving disentangled
representations
● Develop and apply robust empirical methods for evaluating the quality of
disentangled representations, using latent space analysis techniques:
○ Representation Similarity Analysis (RSA)
○ Low-dimensional visualization
○ Clustering
○ And objective supervised disentanglement metrics10
.
● Analyze the structure of learned representations, focusing on their generalization
and transferability across diverse datasets and tasks using linear probing.
[10] Carbonneau, M.A., Zaidi, J., Boilard, J. and Gagnon, G., 2022. Measuring disentanglement: A review of metrics. IEEE transactions on
neural networks and learning systems, 35(7), pp.8747-8761.
Thesis Contributions
10
● Development of methodologies for disentangling audio
representations
○ JEAs
○ VAE-based Latent Variable Models
● Introduction of three novel datasets
○ DeepChimp, SynTone, and Synspeech (three versions)
● Empirical evaluation of disentangled representations
● Exploration of transferability and generalization
○ Using Latent Space Analysis and Linear Probing
Cognitive Science Link with Thesis
11
● Robust intelligence requires compositional
understanding
● Humans excel at recombining learned
factors in novel ways
● Disentangled representation learning
formalizes this cognitive principle.
Source
"yellow car" combines the concept of "yellow"
with "car" and that you can apply "yellow" to
other objects or "car" to other colors. This
forms the basis of generalization.
General Disentanglement Framework
Ambient space Input space Latent space
12
Disentanglement Criteria
● Modularity : factors should be independent (orthogonal). Variation in one factor has no causal
effect on other factors in the code space.
13
Speaker
identity
accent
Disentanglement Criteria: Modularity
14
Causal conditional factorization
The result is a disentangled factorization
Causal factorization
Disentangled factorization
For disentanglement
Disentanglement Criteria
● Compactness : ideally 1 dimensional representation for each factor
15
Disentanglement Criteria
● Explicitness : semantically useful factors
16
Identity
Gender
Emotional tone
Accent
content
...
Paralinguistic
Linguistic
Methodology
17
● Learning approaches utilized:
○ Joint-embedding architectures (JEAs)
■ Contrastive
■ Non-contrastive
○ Variational Auto-Encoding
● Datasets
● Evaluation Metrics
Source
Contrastive Representation Learning of Audio [R4]
18
128-D
2048-D
Softmax
“Lou”
Cross Entropy
128-D
2048-D
Contrastive
11-D
Softmax
“Lou”
Cross Entropy
Loss Function
Shared Weights
Fine-Tuning
Stage 1
Stage 2
(A) Supervised
(B) Contrastive Supervised 19
Contrastive Representation Learning of Audio [R4]
Triplet-Based Contrastive JEA
20
Anchor
Positive
Negative
Encoder
Supervised Contrastive JEA
21
Anchor
Positives
…
Negatives
…
Embedding space
where:
Barlow Twins Non-contrastive JEA [R2]
22
Stage 1
Stage 2
Input
Linear
Layer
Frozen
weights
.
.
.
Classes
Projector
Encoder
Original Input
Distorted views
Embeddings
feature
dimension
Empirical
correlation
Target
correlation
Invariance and Redundancy
23
Representation manifold
SynTone Learning Architecture [R1]
24
Encoder Decoder
Input Reconstructed Input
SynSpeech RAVE-inspired Learning Architecture [R3]
25
Multiband
decomposition
Encoder Decoder Multiband
decomposition
Multiband Spectral Distance
Multiband
decomposition
Encoder Decoder Multiband
decomposition
sg
Discriminator
State 2: Adversarial
fine-tuning
State 1: Representation Learning
Datasets: SynTone
26
● Basic Overview
○ Total Samples: 32,000 unique audio samples
○ Sample Duration: 1 second each
○ Sampling Rate: 16kHz
● Generative Parameters (Factors):
○ Each audio sample is synthesized by systematically varying three
independent generative parameters:
■ Timbre : Sine, Triangle, Square, Sawtooth waveforms
■ Amplitude : 20 discrete levels, ranging from 0 to 1
■ Frequency : 400 discrete steps, ranging from 440 to 8000Hz
● The dataset is formally structured as a Cartesian product:
○ Each sample corresponds to a unique tuple
Sample Sine Waveform
27
Curation of SynSpeech Dataset
● Neural Speech Synthesis using
○ Speaking accent {American, British, Australian, Indian}
○ Speaker gender {Male, Female}
○ Speaker Identity: Librispeech-10011
Speakers 251 each ~16s speech @ 16kHz
○ Speaking Styles {Default, Friendly, Sad, Whispering}
○ Linguistic content on diverse topics generated using an LLM
● We utilize OpenVoice12
TTS, a flow-based non-autoregressive
generative model
28
[11] Panayotov, V., Chen, G., Povey, D. and Khudanpur, S., 2015, April. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE
international conference on acoustics, speech and signal processing (ICASSP) (pp. 5206-5210). IEEE.
[12] Qin, Zengyi, et al. "OpenVoice: Versatile Instant Voice Cloning." arXiv preprint arXiv:2312.01479 (2023)
Curation of SynSpeech Dataset
29
Neural Speech Synthesizer
Spoken Text
Speaker ID
Speaking Style
Generated utterance
SynSpeech Dataset Versions
30
Version Speakers Size
(GBs)
Speaking
Styles
Number
of content
Total
Samples
DOI Total
download
s
Small 50 4.87 1 500 25,000 Link 63
Medium 25 10.68 4 500 50,000 Link 41
Large 249 21.86 4 500 109,560 Link Link ~35
https://synspeech.github.io/
Datasets: DeepChimp
31
● Audio recordings using an external microphone (Sennheiser
ME400) at a 30-meter radius of 11 male adult chimpanzees
● Collected over 16 months (non-consecutively) between
2018-2020 at the Rekambo community in Loango National Park,
Gabon
● Approach: focal animal sampling with continuous recording
● Call type: pant hoots
● A total of 551 variable-length calls ~6 hours at 44.1kHz
External Real-world Datasets
32
Name Samples Classes Duration (hrs) Usage
VoxCeleb-113
148,642 1,211 340.39 Upstream
LibriSpeech-10011
14,385 128 100 Upstream
LibriSpeech-36011
104,935 921 360 Upstream
Speech Commands14
7,985 2 2.18 Downstream
ESD15
7,000 2 5.52 Downstream
WLUC16
7,500 5 2.05 Downstream
[13] Nagrani, A., Chung, J.S. and Zisserman, A., 2017. Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612.
[14] Warden, P., 2018. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209.
[15] Zhou, K., Sisman, B., Liu, R. and Li, H., 2022. Emotional voice conversion: Theory, databases and esd. Speech Communication, 137, pp.1-18.
[16] A. R. (online speech bank), “World Leaders Address the U.S. Congress,” 2011.
Evaluation Metrics: An Overview
33
● To assess the effectiveness of learned audio representations in
terms of:
○ Disentanglement : How individual latent dimensions capture
distinct underlying factors of variation.
○ Generalization & Transferability : Assess adaptability to
downstream tasks and robustness across diverse datasets,
verifying practical utility.
Disentanglement Evaluation: Predictor-based
● SAP (Separated Attribute Predictability) : Measures how well individual latent
dimensions predict known factors. High = compact factor encoding.
● Explicitness Score : Captures how easily factor values can be linearly decoded
from the latent space.
Evaluate how usable and interpretable the latent codes are.
34
Disentanglement Evaluation: Intervention-based
● IRS (Interventional Robustness Score) : Assesses whether the representation of
a target factor stays consistent when other (nuisance) factors change.
Test how stable latent codes are under controlled factor variations.
35
Disentanglement Evaluation: Information-based
● MIG (Mutual Information Gap) : Measures how uniquely a factor maps to one
latent code dimension.
● JEMMIG : A holistic version of MIG that adds joint entropy to penalize shared
or entangled codes.
● Modularity : Evaluates whether each latent dimension captures only one factor
(and not others).
Quantify statistical alignment between latent codes and
data-generating factors.
36
Generalization and Transferability Metrics
37
Generalization and Transferability Metrics
38
For Representation Similarity Analysis (RSA), we used the:
Results
39
● VAE-based Representation Structure in Factorizational Disentanglement
○ With SynTone
○ With SynSpeech
● Linear Probing as a Proxy for Factorizational Disentanglement in SynSpeech
● Non-contrastive JEA-based Factorizational Disentanglement
● Transferability and Generalization
○ Non-contrastive JEA Downstream Tasks
○ Contrastive JEA Downstream Task
● Impact of Different Learning Approaches
Factorizational Disentanglement with SynToneR1
40
[R1] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2024. Learning disentangled audio representations through controlled synthesis. In: The Second Tiny
Papers Track at ICLR 2024, Tiny Papers @ ICLR 2024, Vienna, Austria, 11 May 2024. OpenReview.net. Available at: https://openreview.net/forum?id=Fn9ORH8PLl
Factorizational Disentanglement with SynToneR1
41
[R1] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2024. Learning disentangled audio representations through controlled synthesis. In: The Second Tiny
Papers Track at ICLR 2024, Tiny Papers @ ICLR 2024, Vienna, Austria, 11 May 2024. OpenReview.net. Available at: https://openreview.net/forum?id=Fn9ORH8PLl
Relaxed information bottleneck has more channel capacity but if unconstrained introduces
redundancies across these channels. This is what we refer to as compactness-modularity
trade-off.
Compactness–Modularity Trade-offR1
42
[R1] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2024. Learning disentangled audio representations through controlled synthesis. In: The Second Tiny
Papers Track at ICLR 2024, Tiny Papers @ ICLR 2024, Vienna, Austria, 11 May 2024. OpenReview.net. Available at: https://openreview.net/forum?id=Fn9ORH8PLl
Bottleneck Type Compactness (MIG,
SAP)
Modularity Explanation
Loose (Vanilla VAE) ✅ High ❌ Low Factors well encoded subset of the code
space, but entangled.
Tight (β-VAE, etc.) ❌ Low ✅ High Forces independence, but factor info is
fragmented across a few code
dimensions.
Relaxed information bottleneck has more channel capacity but if unconstrained introduces
redundancies across these channels. This is what we refer to as compactness-modularity
trade-off.
Factorizational Disentanglement with SynToneR1
43
[R1] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2024. Learning disentangled audio representations through controlled synthesis. In: The Second Tiny
Papers Track at ICLR 2024, Tiny Papers @ ICLR 2024, Vienna, Austria, 11 May 2024. OpenReview.net. Available at: https://openreview.net/forum?id=Fn9ORH8PLl
Factorizational Disentanglement with SynSpeechR3
44
[R3] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2023. Learning disentangled speech representations. New in ML Workshop, NeurIPS 2023, 31 October.
Available at: https://openreview.net/forum?id=3ox1TfKeRF
Linear probes predicting factors
from code dimensions.
Factorizational Disentanglement with SynSpeechR3
45
[R3] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2023. Learning disentangled speech representations. New in ML Workshop, NeurIPS 2023, 31 October.
Available at: https://openreview.net/forum?id=3ox1TfKeRF
Factorizational Disentanglement with SynSpeechR3
46
[R3] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2023. Learning disentangled speech representations. New in ML Workshop, NeurIPS 2023, 31 October.
Available at: https://openreview.net/forum?id=3ox1TfKeRF
Factorizational Disentanglement with SynSpeechR3
47
[R3] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2023. Learning disentangled speech representations. New in ML Workshop, NeurIPS 2023, 31 October.
Available at: https://openreview.net/forum?id=3ox1TfKeRF
Non-contrastive JEA Representation StructureR2
48
[R2] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2024. Understanding Self-Supervised Learning of Speech Representation via Invariance and Redundancy
Reduction. Information, 15(2), p.114.
Non-contrastive JEA Factorizational DisentanglementR2
49
[R2] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2024. Understanding Self-Supervised Learning of Speech Representation via Invariance and Redundancy
Reduction. Information, 15(2), p.114.
Non-contrastive JEA Downstream TasksR2
50
[R2] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2024. Understanding Self-Supervised Learning of Speech Representation via Invariance and Redundancy
Reduction. Information, 15(2), p.114.
Contrastive JEA Representation StructureR4
51
[R4] Brima, Y., Southern, L., Krumnack, U., Heidemann, G. and Pika, S., Deep learning for recognizing individual chimpanzees from vocalizations. In submission
to Scientific Reports.
Contrastive JEA Representation StructureR4
52
[R4] Brima, Y., Southern, L., Krumnack, U., Heidemann, G. and Pika, S., Deep learning for recognizing individual chimpanzees from vocalizations. In submission
to Scientific Reports.
Contrastive JEA Downstream TasksR4
53
[R4] Brima, Y., Southern, L., Krumnack, U., Heidemann, G. and Pika, S., Deep learning for recognizing individual chimpanzees from vocalizations. In submission
to Scientific Reports.
Contrastive JEA Downstream ExplainabilityR4
54
[R4] Brima, Y., Southern, L., Krumnack, U., Heidemann, G. and Pika, S., Deep learning for recognizing individual chimpanzees from vocalizations. In submission
to Scientific Reports.
Limitations
55
● Complex‑Factor Entanglement
○ Prosody & content remain conflated in real‑world speech
● Dataset Constraints
○ DeepChimp size limits model robustness; synthetic sets lack environmental
noise variation
● Metric Gaps
○ Temporal/hierarchical dependencies under‑captured by current disentanglement
scores
● Compute & Architecture
○ Scalability of high‑capacity VAEs and BT models constrained by GPU memory
Future Directions
56
● Multimodal & Hierarchical Models
○ For example, the fusion of audio + video (lip motion) with attention architectures
for finer generative control
● Domain-specific Inductive Priors
○ For factors are are temporal or hierarchical, imposing constraints to learn these
structures
● Dataset Expansion & Realism
○ Curate especially large and varied collection of non-human primate vocalizations
to aid better robustness and generalization.
● Advanced Metrics
○ Develop temporally‑aware/audio specific disentanglement metrics.
Conclusions & Takeaways
57
● Disentangled audio representations (generative + JEA) enhance
interpretability, robustness, transfer, etc.
● Controlled synthetic benchmarks (e.g., SynTone/SynSpeech) + real (e.g.,
DeepChimp) crucial for systematic evaluation of methods.
● No single paradigm “wins”: choose generative for fine‑grained factor
control, contrastive for representation transfer, non-contrastive JEAs for
balance.
● Opens pathways for fair, controllable speech & bioacoustic technologies
Thank you for your attention!
58
Triplet-Based Contrastive JEA
59
Supervised Contrastive JEA
60
Barlow Twins Non-contrastive JEA [R2]
61
SynSpeech Learning Architecture [R3]
62
Encoder
Decoder
Input Reconstructed
Input
Vocoder
SynTone Learning Architecture [R1]
63
Factor-VAE KL Decomposition
SynSpeech RAVE-inspired Learning Architecture [R3]
64
Phase 1: Representation Learning with Spectral Distance and VAE Loss
SynSpeech RAVE-inspired Learning Architecture [R3]
65
Phase 2: Adversarial Fine-Tuning for Enhanced Realism
To avoid mode collapse, a feature matching loss is further added
Supervised Disentanglement Metrics
66
Based on factorizational disentanglement framework17
Notation :
[17] Carbonneau, M.A., Zaidi, J., Boilard, J. and Gagnon, G., 2022. Measuring disentanglement: A review of metrics. IEEE transactions on
neural networks and learning systems, 35(7), pp.8747-8761.
Comparison : Supervised disentanglement metrics compare:
Categories: Predictor-based, Intervention-based, Information-based.
Supervised Disentanglement Metrics
67
Ambient space Input space Latent space
Predictor-based Methods
68
Key Concept : Train a classifier or regressor to predict
factor values directly from latent codes.
Explicitness Score (Explicitness):
69
● Evaluates ability to predict factor values from latent codes 18
.
● Calculated by training a simple classifier (e.g., logistic regression)
on the entire latent codes.
● Performance measured by AUC-ROC, averaged across factors.
● Score normalized to ( [0,1] ), higher indicates greater
interpretability.
[18] K. Ridgeway and M. C. Mozer, “Learning deep disentangled embeddings with the f-statistic loss,” Advances in neural information processing
systems, vol. 31, 2018.
Attribute Predictability Score (SAP) (Compactness)
70
● Assesses how well each latent code dimension predicts
ground-truth factors 19
.
● Continuous factors: ( R^2 ) from linear regressor.
● Categorical factors: Balanced accuracy from decision tree
classifier.
● Formula:
○
● Higher score indicates better compactness (large differences
between top two scores).
[19] A. Kumar, P. Sattigeri, and A. Balakrishnan, “Variational inference of disentangled latent concepts from unlabeled observations,” arXiv preprint
arXiv:1711.00848, 2017.
Intervention-based Methods
71
Key Concept: Assess disentanglement holistically or based on
modularity by examining stability under controlled variations.
Focus : Robustness of factor representations to perturbations
while considering dependencies.
Interventional Robustness Score (IRS) (Holistic)
72
● Measures stability of target factors in the latent space20
.
● Calculated by comparing representation means when target
factors are fixed vs. when nuisance factors vary (e.g., using
norm).
● Highlights how well the model isolates relevant factors
[20] R. Suter, D. Miladinovic, B. Schölkopf, and S. Bauer, “Robustly disentangled causal mechanisms: Validating deep representations for interventional
robustness,” in International Conference on Machine Learning, pp. 6056–6065, PMLR, 2019.
Information-based Metrics
73
Key Concept: Utilize entropy and mutual information (MI) to assess modularity,
compactness, or holistic disentanglement.
Entropy -- measures the amount of uncertainty in a given factor/code
MI -- measures the dependency between factor and code
We need 2.585 bits of information to
describe the outcome of a roll.
MI is 0 iff two rvs are statistically
independent.
Mutual Information Gap (MIG) (Compactness)
74
Measures the gap between the most and second-most associated latent
dimensions for each factor21
.
[21] R. T. Chen, X. Li, R. B. Grosse, and D. K. Duvenaud, “Isolating sources of disentanglement in variational autoencoders,” in NeurIPS, pp. 2615–2625,
2018.
This metric captures how strongly each factor aligns with a specific latent dimension.
The average is reported as the overall score.
Disentanglement, Completeness, Informativeness Mutual Information
Gap (DCIMIG): Holistic
75
Computes the MI gap between a factor and its most informative latent dimension, by
assessing gaps in MI across different latent dimension for that given factor22
[22] R. T. Chen, X. Li, R. B. Grosse, and D. K. Duvenaud, “Isolating sources of disentanglement in variational autoencoders,” in NeurIPS, pp. 2615–2625,
2018.
This gives a much broader view of factor-code relationship in contrast to MIG
where
Joint Entropy Minus Mutual Information Gap (JEMMIG) (Holistic)
76
Builds on MIG but incorporate Joint Entropy between factor and code to achieve
holistic evaluation23
.
[23] K. Do and T. Tran, “Theory and evaluation metrics for learning disentangled representations,” arXiv preprint arXiv:1908.09961, 2019.
Unlike MIG, lower JEMMIG scores indicate better disentanglement.
To put the scores on the closed interval [0,1].
Modularity Score (Modularity)
77
This metric quantifies the relative MI of a primary factor compared to all other factors,
rewarding isolated factor representations in the latent space24
[24] K. Ridgeway and M. C. Mozer, “Learning deep disentangled embeddings with the f-statistic loss,” Advances in neural information processing systems,
vol. 31, 2018.
OpenVoice
arXiv:2312.01479
Non-contrastive JEA Downstream TasksR2
79
[R2] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2024. Understanding Self-Supervised Learning of Speech Representation via Invariance and Redundancy
Reduction. Information, 15(2), p.114.
Contrastive JEA Out-of-domain TestR4
80
[R4] Brima, Y., Southern, L., Krumnack, U., Heidemann, G. and Pika, S., Deep learning for recognizing individual chimpanzees from vocalizations. In submission
to Scientific Reports.
Key Findings
81
● Disentanglement Achievability: Disentangled representations are feasible, especially in
functional settings, evidenced by strong downstream task performance (speaker
identification, speaking style recognition, keyword spotting, gender recognition).
● Context-Dependent Disentanglement:
○ Simpler contexts (e.g., SynTone) enable effective factor separation.
○ More complex settings (e.g., SynSpeech) involve manageable trade-offs between
compactness and modularity.
● Non-contrastive JEA Efficacy: Effectively balances transformative and factorizational
disentanglement, showing robust performance in downstream applications.
● Contrastive JEA Strength: Excels in transformative disentanglement, supported by
linear probing, RSA, clustering, and low-dimensional projections.
● Transferability: Self-supervised upstream representations significantly improve
downstream task performance, even with small target datasets (5-10%).
Research Contributions
82
● New Benchmark Datasets
○ SynTone & SynSpeech: fully controllable synthetic audio isolating timbre, pitch,
prosody
○ DeepChimp: real-world chimpanzee calls with annotated individual identities
● Improved Generative Models
○ Adapted β‑VAE and Factor‑VAE for audio: better trade‑off between
reconstruction fidelity and latent factor disentanglement
● Joint‑Embedding & Contrastive Learning
○ Applied Barlow Twins and Supervised Contrastive loss to learn speaker‑ and
factor‑invariant representations
● Robust Evaluation Suite
○ Combined disentanglement metrics (MIG, SAP) with interventional robustness
(IRS) and downstream probes (speaker ID, emotion)
Sample Waveforms
Sample Log Spectrograms
Sample Classifier Predictions

More Related Content

PDF
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
PPTX
LiDeng-BerlinOct2015-ASR-GenDisc-4by3.pptx
PDF
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
PDF
Contrastive_SSL.pdf
PDF
Introduction to deep learning based voice activity detection
PDF
Progressive learning and Disentanglement of hierarchical representations
PDF
Trends of ICASSP 2022
PDF
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
LiDeng-BerlinOct2015-ASR-GenDisc-4by3.pptx
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
Contrastive_SSL.pdf
Introduction to deep learning based voice activity detection
Progressive learning and Disentanglement of hierarchical representations
Trends of ICASSP 2022
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE

Similar to Disentangled Representation Learning in Speech and Vocalization (20)

PPTX
Deep Learning for Automatic Speaker Recognition
PDF
Theory and evaluation metrics for learning disentangled representations
PDF
Audio and Vision (D4L6 2017 UPC Deep Learning for Computer Vision)
PDF
Automated Speech Recognition
PPTX
lec26_audio.pptx
PDF
Review On Speech Recognition using Deep Learning
PDF
Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019
PDF
causality_discussion_slides_final.pdf
PDF
Afra Alishahi - 2017 - Encoding of Phonology in an RNN model of Grounded Speech
PDF
Performance estimation based recurrent-convolutional encoder decoder for spee...
PPTX
Research_Wu.pptx
PDF
Toward wave net speech synthesis
PDF
Deep Learning Based Voice Activity Detection and Speech Enhancement
PDF
saito22research_talk_at_NUS
PPTX
Deep Learning - Speaker Recognition
PDF
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
PPTX
Cross Model.pptx
PDF
PPT
Presentation based on the paper ”How Far Are We from Robust Voice Conversion:...
PDF
Deep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning for Automatic Speaker Recognition
Theory and evaluation metrics for learning disentangled representations
Audio and Vision (D4L6 2017 UPC Deep Learning for Computer Vision)
Automated Speech Recognition
lec26_audio.pptx
Review On Speech Recognition using Deep Learning
Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019
causality_discussion_slides_final.pdf
Afra Alishahi - 2017 - Encoding of Phonology in an RNN model of Grounded Speech
Performance estimation based recurrent-convolutional encoder decoder for spee...
Research_Wu.pptx
Toward wave net speech synthesis
Deep Learning Based Voice Activity Detection and Speech Enhancement
saito22research_talk_at_NUS
Deep Learning - Speaker Recognition
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Cross Model.pptx
Presentation based on the paper ”How Far Are We from Robust Voice Conversion:...
Deep Learning in practice : Speech recognition and beyond - Meetup
Ad

More from Yusuf Brima (13)

PDF
Multimodal Federated Learning for Robust Lung Cancer Prognosis
PDF
Assessing fusarium oxysporum disease severity in cotton using unmanned aerial...
PDF
Trustworthy Healthcare AI for Mental Health Risk Prediction
PDF
Assessing Explainability in Deep Learning for Medical Image Analysis
PDF
A Talk on Deep Causal Representation Learning
PDF
On Intelligence
PDF
Guides to Securing Scholarships Overseas
PDF
Transfer Learning for the Detection and Classification of traditional pneumon...
PDF
AIMS Block Presentation]{Deep Transfer Learning for Magnetic Resonance Image ...
PPTX
African Accents International Institute (AAII-SL): Work overview
PPT
Introduction to internet
PPTX
Big data for healthcare analytics final -v0.3 miz
PPTX
Detecting malaria using a deep convolutional neural network
Multimodal Federated Learning for Robust Lung Cancer Prognosis
Assessing fusarium oxysporum disease severity in cotton using unmanned aerial...
Trustworthy Healthcare AI for Mental Health Risk Prediction
Assessing Explainability in Deep Learning for Medical Image Analysis
A Talk on Deep Causal Representation Learning
On Intelligence
Guides to Securing Scholarships Overseas
Transfer Learning for the Detection and Classification of traditional pneumon...
AIMS Block Presentation]{Deep Transfer Learning for Magnetic Resonance Image ...
African Accents International Institute (AAII-SL): Work overview
Introduction to internet
Big data for healthcare analytics final -v0.3 miz
Detecting malaria using a deep convolutional neural network
Ad

Recently uploaded (20)

PPTX
Introduction to Building Materials
PDF
Empowerment Technology for Senior High School Guide
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
Cell Types and Its function , kingdom of life
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PDF
LDMMIA Reiki Yoga Finals Review Spring Summer
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PDF
Practical Manual AGRO-233 Principles and Practices of Natural Farming
PDF
Indian roads congress 037 - 2012 Flexible pavement
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PPTX
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
PDF
What if we spent less time fighting change, and more time building what’s rig...
PPTX
Radiologic_Anatomy_of_the_Brachial_plexus [final].pptx
PPTX
UV-Visible spectroscopy..pptx UV-Visible Spectroscopy – Electronic Transition...
PPTX
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
PDF
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
PPTX
A powerpoint presentation on the Revised K-10 Science Shaping Paper
Introduction to Building Materials
Empowerment Technology for Senior High School Guide
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Cell Types and Its function , kingdom of life
202450812 BayCHI UCSC-SV 20250812 v17.pptx
LDMMIA Reiki Yoga Finals Review Spring Summer
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
Practical Manual AGRO-233 Principles and Practices of Natural Farming
Indian roads congress 037 - 2012 Flexible pavement
Final Presentation General Medicine 03-08-2024.pptx
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
What if we spent less time fighting change, and more time building what’s rig...
Radiologic_Anatomy_of_the_Brachial_plexus [final].pptx
UV-Visible spectroscopy..pptx UV-Visible Spectroscopy – Electronic Transition...
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
A powerpoint presentation on the Revised K-10 Science Shaping Paper

Disentangled Representation Learning in Speech and Vocalization

  • 1. Disentangled Representation Learning in Speech and Vocalization Yusuf BRIMA Supervisors: Prof. Dr.-Ing. Gunther HEIDEMANN Prof. Dr.rer.nat. Simone PIKA Institute of Cognitive Science, Osnabrück University June 27, 2025
  • 2. Presentation Outline 2 ● Introduction and Motivation ● Research Goals and Questions ● Methodology ● Key Results ● Limitations ● Future Directions ● Conclusion ● Q&As
  • 3. Biological Inspiration 3 ● Humans can naturally isolate factors of variation in often complex high-dimensional data, in the audio domain examples include: speaker identity, gender, emotion, speech content, etc. ● We generalize across contexts (e.g., noisy environments, varied accents, speaking styles). Source
  • 4. What Is Disentanglement? 4 ● Refers to learning distributed representations where distinct latent factors correspond to the “true” independent factors of variation in the data1 . ● Goal: learn a representation where you can manipulate one factor without affecting the others — just like humans do. [1] Bengio Y, Courville A, Vincent P. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence. 2013 Mar 7;35(8):1798-828. Source
  • 5. Three Core Perspectives 5 ● Transformational 2 – Separate invariant vs. variant aspects of a signal. ● Factorizational 2 – Ensure independent encoding of generative factors. ● Functional 2 – Make learned factors useful for tasks (e.g., transfer, robustness, interpretability). Source [2] Williams J. Learning disentangled speech representations (Doctoral dissertation, University of Edinburgh).
  • 6. Benefits of Disentanglement 6 ● Improves predictive performance2,3 ● Reduces sample complexity4 ● Offers interpretability5 ● Improves fairness6 ● Overcome shortcut learning7 [2] Locatello, M. Tschannen, S. Bauer, G. Rätsch, B. Schölkopf, and O. Bachem, “Disentangling factors of variations using few labels,” in International Conference on Learning Representations, 2020. [3] Locatello, S. Bauer, M. Lucic, G. Raetsch, S. Gelly, B. Schölkopf, and O. Bachem, “Challenging common assumptions in the unsupervised learning of disentangled representations,” in International Conference on Machine Learning, 2019. [4] Ridgeway and M. C. Mozer, “Learning deep disentangled embeddings with the f-statistic loss,” in Advances in Neural Information Processing Systems, 2018. [5] Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, pp. 1798–1828, Aug. 2013. [6] .Locatello, G. Abbati, T. Rainforth, S. Bauer, B. Schölkopf, and O. Bachem, “On the fairness of disentangled representations,” in Advances in Neural Information Processing Systems, 2019. [7] . Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, and F. A. Wichmann, “Shortcut learning in deep neural networks,” arXiv:2004.07780, 2020.
  • 7. Research Gaps 7 ● Deep learning paradigms such as VAEs, GANs, and JAEs have been extensively used to learn rich audio representations8,9 . ● However, there has been limited empirical investigation into their ability to disentangle key factors—such as speaker identity, speaking style, and content—using disentanglement-oriented datasets. ● Existing approaches often lack a systematic quantitative evaluation of how well these paradigms satisfy crucial disentanglement criteria: modularity, compactness/completeness, and explicitness/informativeness. [8] Liu, S., Mallol-Ragolta, A., Parada-Cabaleiro, E., Qian, K., Jing, X., Kathan, A., Hu, B. and Schuller, B.W., 2022. Audio self-supervised learning: A survey. Patterns, 3(12). [9] Mohamed, A., Lee, H.Y., Borgholt, L., Havtorn, J.D., Edin, J., Igel, C., Kirchhoff, K., Li, S.W., Livescu, K., Maaløe, L. and Sainath, T.N., 2022. Self-supervised speech representation learning: A review. IEEE Journal of Selected Topics in Signal Processing, 16(6), pp.1179-1210.
  • 8. Research Questions 8 ● How can DL techniques be leveraged to effectively disentangle the underlying explanatory factors of variation in audio? ○ What learning approaches are most effective for (re)structuring representations of acoustic data to disentangle key factors? ○ What are the most effective methodologies for evaluating the quality, generalization, and transferability of disentangled representations?
  • 9. Research Objectives 9 ● Compare the effectiveness of varied learning paradigms in achieving disentangled representations ● Develop and apply robust empirical methods for evaluating the quality of disentangled representations, using latent space analysis techniques: ○ Representation Similarity Analysis (RSA) ○ Low-dimensional visualization ○ Clustering ○ And objective supervised disentanglement metrics10 . ● Analyze the structure of learned representations, focusing on their generalization and transferability across diverse datasets and tasks using linear probing. [10] Carbonneau, M.A., Zaidi, J., Boilard, J. and Gagnon, G., 2022. Measuring disentanglement: A review of metrics. IEEE transactions on neural networks and learning systems, 35(7), pp.8747-8761.
  • 10. Thesis Contributions 10 ● Development of methodologies for disentangling audio representations ○ JEAs ○ VAE-based Latent Variable Models ● Introduction of three novel datasets ○ DeepChimp, SynTone, and Synspeech (three versions) ● Empirical evaluation of disentangled representations ● Exploration of transferability and generalization ○ Using Latent Space Analysis and Linear Probing
  • 11. Cognitive Science Link with Thesis 11 ● Robust intelligence requires compositional understanding ● Humans excel at recombining learned factors in novel ways ● Disentangled representation learning formalizes this cognitive principle. Source "yellow car" combines the concept of "yellow" with "car" and that you can apply "yellow" to other objects or "car" to other colors. This forms the basis of generalization.
  • 12. General Disentanglement Framework Ambient space Input space Latent space 12
  • 13. Disentanglement Criteria ● Modularity : factors should be independent (orthogonal). Variation in one factor has no causal effect on other factors in the code space. 13 Speaker identity accent
  • 14. Disentanglement Criteria: Modularity 14 Causal conditional factorization The result is a disentangled factorization Causal factorization Disentangled factorization For disentanglement
  • 15. Disentanglement Criteria ● Compactness : ideally 1 dimensional representation for each factor 15
  • 16. Disentanglement Criteria ● Explicitness : semantically useful factors 16 Identity Gender Emotional tone Accent content ... Paralinguistic Linguistic
  • 17. Methodology 17 ● Learning approaches utilized: ○ Joint-embedding architectures (JEAs) ■ Contrastive ■ Non-contrastive ○ Variational Auto-Encoding ● Datasets ● Evaluation Metrics
  • 19. 128-D 2048-D Softmax “Lou” Cross Entropy 128-D 2048-D Contrastive 11-D Softmax “Lou” Cross Entropy Loss Function Shared Weights Fine-Tuning Stage 1 Stage 2 (A) Supervised (B) Contrastive Supervised 19 Contrastive Representation Learning of Audio [R4]
  • 22. Barlow Twins Non-contrastive JEA [R2] 22 Stage 1 Stage 2 Input Linear Layer Frozen weights . . . Classes Projector Encoder Original Input Distorted views Embeddings feature dimension Empirical correlation Target correlation
  • 24. SynTone Learning Architecture [R1] 24 Encoder Decoder Input Reconstructed Input
  • 25. SynSpeech RAVE-inspired Learning Architecture [R3] 25 Multiband decomposition Encoder Decoder Multiband decomposition Multiband Spectral Distance Multiband decomposition Encoder Decoder Multiband decomposition sg Discriminator State 2: Adversarial fine-tuning State 1: Representation Learning
  • 26. Datasets: SynTone 26 ● Basic Overview ○ Total Samples: 32,000 unique audio samples ○ Sample Duration: 1 second each ○ Sampling Rate: 16kHz ● Generative Parameters (Factors): ○ Each audio sample is synthesized by systematically varying three independent generative parameters: ■ Timbre : Sine, Triangle, Square, Sawtooth waveforms ■ Amplitude : 20 discrete levels, ranging from 0 to 1 ■ Frequency : 400 discrete steps, ranging from 440 to 8000Hz ● The dataset is formally structured as a Cartesian product: ○ Each sample corresponds to a unique tuple
  • 28. Curation of SynSpeech Dataset ● Neural Speech Synthesis using ○ Speaking accent {American, British, Australian, Indian} ○ Speaker gender {Male, Female} ○ Speaker Identity: Librispeech-10011 Speakers 251 each ~16s speech @ 16kHz ○ Speaking Styles {Default, Friendly, Sad, Whispering} ○ Linguistic content on diverse topics generated using an LLM ● We utilize OpenVoice12 TTS, a flow-based non-autoregressive generative model 28 [11] Panayotov, V., Chen, G., Povey, D. and Khudanpur, S., 2015, April. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5206-5210). IEEE. [12] Qin, Zengyi, et al. "OpenVoice: Versatile Instant Voice Cloning." arXiv preprint arXiv:2312.01479 (2023)
  • 29. Curation of SynSpeech Dataset 29 Neural Speech Synthesizer Spoken Text Speaker ID Speaking Style Generated utterance
  • 30. SynSpeech Dataset Versions 30 Version Speakers Size (GBs) Speaking Styles Number of content Total Samples DOI Total download s Small 50 4.87 1 500 25,000 Link 63 Medium 25 10.68 4 500 50,000 Link 41 Large 249 21.86 4 500 109,560 Link Link ~35 https://synspeech.github.io/
  • 31. Datasets: DeepChimp 31 ● Audio recordings using an external microphone (Sennheiser ME400) at a 30-meter radius of 11 male adult chimpanzees ● Collected over 16 months (non-consecutively) between 2018-2020 at the Rekambo community in Loango National Park, Gabon ● Approach: focal animal sampling with continuous recording ● Call type: pant hoots ● A total of 551 variable-length calls ~6 hours at 44.1kHz
  • 32. External Real-world Datasets 32 Name Samples Classes Duration (hrs) Usage VoxCeleb-113 148,642 1,211 340.39 Upstream LibriSpeech-10011 14,385 128 100 Upstream LibriSpeech-36011 104,935 921 360 Upstream Speech Commands14 7,985 2 2.18 Downstream ESD15 7,000 2 5.52 Downstream WLUC16 7,500 5 2.05 Downstream [13] Nagrani, A., Chung, J.S. and Zisserman, A., 2017. Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612. [14] Warden, P., 2018. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209. [15] Zhou, K., Sisman, B., Liu, R. and Li, H., 2022. Emotional voice conversion: Theory, databases and esd. Speech Communication, 137, pp.1-18. [16] A. R. (online speech bank), “World Leaders Address the U.S. Congress,” 2011.
  • 33. Evaluation Metrics: An Overview 33 ● To assess the effectiveness of learned audio representations in terms of: ○ Disentanglement : How individual latent dimensions capture distinct underlying factors of variation. ○ Generalization & Transferability : Assess adaptability to downstream tasks and robustness across diverse datasets, verifying practical utility.
  • 34. Disentanglement Evaluation: Predictor-based ● SAP (Separated Attribute Predictability) : Measures how well individual latent dimensions predict known factors. High = compact factor encoding. ● Explicitness Score : Captures how easily factor values can be linearly decoded from the latent space. Evaluate how usable and interpretable the latent codes are. 34
  • 35. Disentanglement Evaluation: Intervention-based ● IRS (Interventional Robustness Score) : Assesses whether the representation of a target factor stays consistent when other (nuisance) factors change. Test how stable latent codes are under controlled factor variations. 35
  • 36. Disentanglement Evaluation: Information-based ● MIG (Mutual Information Gap) : Measures how uniquely a factor maps to one latent code dimension. ● JEMMIG : A holistic version of MIG that adds joint entropy to penalize shared or entangled codes. ● Modularity : Evaluates whether each latent dimension captures only one factor (and not others). Quantify statistical alignment between latent codes and data-generating factors. 36
  • 38. Generalization and Transferability Metrics 38 For Representation Similarity Analysis (RSA), we used the:
  • 39. Results 39 ● VAE-based Representation Structure in Factorizational Disentanglement ○ With SynTone ○ With SynSpeech ● Linear Probing as a Proxy for Factorizational Disentanglement in SynSpeech ● Non-contrastive JEA-based Factorizational Disentanglement ● Transferability and Generalization ○ Non-contrastive JEA Downstream Tasks ○ Contrastive JEA Downstream Task ● Impact of Different Learning Approaches
  • 40. Factorizational Disentanglement with SynToneR1 40 [R1] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2024. Learning disentangled audio representations through controlled synthesis. In: The Second Tiny Papers Track at ICLR 2024, Tiny Papers @ ICLR 2024, Vienna, Austria, 11 May 2024. OpenReview.net. Available at: https://openreview.net/forum?id=Fn9ORH8PLl
  • 41. Factorizational Disentanglement with SynToneR1 41 [R1] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2024. Learning disentangled audio representations through controlled synthesis. In: The Second Tiny Papers Track at ICLR 2024, Tiny Papers @ ICLR 2024, Vienna, Austria, 11 May 2024. OpenReview.net. Available at: https://openreview.net/forum?id=Fn9ORH8PLl Relaxed information bottleneck has more channel capacity but if unconstrained introduces redundancies across these channels. This is what we refer to as compactness-modularity trade-off.
  • 42. Compactness–Modularity Trade-offR1 42 [R1] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2024. Learning disentangled audio representations through controlled synthesis. In: The Second Tiny Papers Track at ICLR 2024, Tiny Papers @ ICLR 2024, Vienna, Austria, 11 May 2024. OpenReview.net. Available at: https://openreview.net/forum?id=Fn9ORH8PLl Bottleneck Type Compactness (MIG, SAP) Modularity Explanation Loose (Vanilla VAE) ✅ High ❌ Low Factors well encoded subset of the code space, but entangled. Tight (β-VAE, etc.) ❌ Low ✅ High Forces independence, but factor info is fragmented across a few code dimensions. Relaxed information bottleneck has more channel capacity but if unconstrained introduces redundancies across these channels. This is what we refer to as compactness-modularity trade-off.
  • 43. Factorizational Disentanglement with SynToneR1 43 [R1] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2024. Learning disentangled audio representations through controlled synthesis. In: The Second Tiny Papers Track at ICLR 2024, Tiny Papers @ ICLR 2024, Vienna, Austria, 11 May 2024. OpenReview.net. Available at: https://openreview.net/forum?id=Fn9ORH8PLl
  • 44. Factorizational Disentanglement with SynSpeechR3 44 [R3] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2023. Learning disentangled speech representations. New in ML Workshop, NeurIPS 2023, 31 October. Available at: https://openreview.net/forum?id=3ox1TfKeRF Linear probes predicting factors from code dimensions.
  • 45. Factorizational Disentanglement with SynSpeechR3 45 [R3] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2023. Learning disentangled speech representations. New in ML Workshop, NeurIPS 2023, 31 October. Available at: https://openreview.net/forum?id=3ox1TfKeRF
  • 46. Factorizational Disentanglement with SynSpeechR3 46 [R3] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2023. Learning disentangled speech representations. New in ML Workshop, NeurIPS 2023, 31 October. Available at: https://openreview.net/forum?id=3ox1TfKeRF
  • 47. Factorizational Disentanglement with SynSpeechR3 47 [R3] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2023. Learning disentangled speech representations. New in ML Workshop, NeurIPS 2023, 31 October. Available at: https://openreview.net/forum?id=3ox1TfKeRF
  • 48. Non-contrastive JEA Representation StructureR2 48 [R2] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2024. Understanding Self-Supervised Learning of Speech Representation via Invariance and Redundancy Reduction. Information, 15(2), p.114.
  • 49. Non-contrastive JEA Factorizational DisentanglementR2 49 [R2] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2024. Understanding Self-Supervised Learning of Speech Representation via Invariance and Redundancy Reduction. Information, 15(2), p.114.
  • 50. Non-contrastive JEA Downstream TasksR2 50 [R2] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2024. Understanding Self-Supervised Learning of Speech Representation via Invariance and Redundancy Reduction. Information, 15(2), p.114.
  • 51. Contrastive JEA Representation StructureR4 51 [R4] Brima, Y., Southern, L., Krumnack, U., Heidemann, G. and Pika, S., Deep learning for recognizing individual chimpanzees from vocalizations. In submission to Scientific Reports.
  • 52. Contrastive JEA Representation StructureR4 52 [R4] Brima, Y., Southern, L., Krumnack, U., Heidemann, G. and Pika, S., Deep learning for recognizing individual chimpanzees from vocalizations. In submission to Scientific Reports.
  • 53. Contrastive JEA Downstream TasksR4 53 [R4] Brima, Y., Southern, L., Krumnack, U., Heidemann, G. and Pika, S., Deep learning for recognizing individual chimpanzees from vocalizations. In submission to Scientific Reports.
  • 54. Contrastive JEA Downstream ExplainabilityR4 54 [R4] Brima, Y., Southern, L., Krumnack, U., Heidemann, G. and Pika, S., Deep learning for recognizing individual chimpanzees from vocalizations. In submission to Scientific Reports.
  • 55. Limitations 55 ● Complex‑Factor Entanglement ○ Prosody & content remain conflated in real‑world speech ● Dataset Constraints ○ DeepChimp size limits model robustness; synthetic sets lack environmental noise variation ● Metric Gaps ○ Temporal/hierarchical dependencies under‑captured by current disentanglement scores ● Compute & Architecture ○ Scalability of high‑capacity VAEs and BT models constrained by GPU memory
  • 56. Future Directions 56 ● Multimodal & Hierarchical Models ○ For example, the fusion of audio + video (lip motion) with attention architectures for finer generative control ● Domain-specific Inductive Priors ○ For factors are are temporal or hierarchical, imposing constraints to learn these structures ● Dataset Expansion & Realism ○ Curate especially large and varied collection of non-human primate vocalizations to aid better robustness and generalization. ● Advanced Metrics ○ Develop temporally‑aware/audio specific disentanglement metrics.
  • 57. Conclusions & Takeaways 57 ● Disentangled audio representations (generative + JEA) enhance interpretability, robustness, transfer, etc. ● Controlled synthetic benchmarks (e.g., SynTone/SynSpeech) + real (e.g., DeepChimp) crucial for systematic evaluation of methods. ● No single paradigm “wins”: choose generative for fine‑grained factor control, contrastive for representation transfer, non-contrastive JEAs for balance. ● Opens pathways for fair, controllable speech & bioacoustic technologies
  • 58. Thank you for your attention! 58
  • 62. SynSpeech Learning Architecture [R3] 62 Encoder Decoder Input Reconstructed Input Vocoder
  • 63. SynTone Learning Architecture [R1] 63 Factor-VAE KL Decomposition
  • 64. SynSpeech RAVE-inspired Learning Architecture [R3] 64 Phase 1: Representation Learning with Spectral Distance and VAE Loss
  • 65. SynSpeech RAVE-inspired Learning Architecture [R3] 65 Phase 2: Adversarial Fine-Tuning for Enhanced Realism To avoid mode collapse, a feature matching loss is further added
  • 66. Supervised Disentanglement Metrics 66 Based on factorizational disentanglement framework17 Notation : [17] Carbonneau, M.A., Zaidi, J., Boilard, J. and Gagnon, G., 2022. Measuring disentanglement: A review of metrics. IEEE transactions on neural networks and learning systems, 35(7), pp.8747-8761. Comparison : Supervised disentanglement metrics compare: Categories: Predictor-based, Intervention-based, Information-based.
  • 67. Supervised Disentanglement Metrics 67 Ambient space Input space Latent space
  • 68. Predictor-based Methods 68 Key Concept : Train a classifier or regressor to predict factor values directly from latent codes.
  • 69. Explicitness Score (Explicitness): 69 ● Evaluates ability to predict factor values from latent codes 18 . ● Calculated by training a simple classifier (e.g., logistic regression) on the entire latent codes. ● Performance measured by AUC-ROC, averaged across factors. ● Score normalized to ( [0,1] ), higher indicates greater interpretability. [18] K. Ridgeway and M. C. Mozer, “Learning deep disentangled embeddings with the f-statistic loss,” Advances in neural information processing systems, vol. 31, 2018.
  • 70. Attribute Predictability Score (SAP) (Compactness) 70 ● Assesses how well each latent code dimension predicts ground-truth factors 19 . ● Continuous factors: ( R^2 ) from linear regressor. ● Categorical factors: Balanced accuracy from decision tree classifier. ● Formula: ○ ● Higher score indicates better compactness (large differences between top two scores). [19] A. Kumar, P. Sattigeri, and A. Balakrishnan, “Variational inference of disentangled latent concepts from unlabeled observations,” arXiv preprint arXiv:1711.00848, 2017.
  • 71. Intervention-based Methods 71 Key Concept: Assess disentanglement holistically or based on modularity by examining stability under controlled variations. Focus : Robustness of factor representations to perturbations while considering dependencies.
  • 72. Interventional Robustness Score (IRS) (Holistic) 72 ● Measures stability of target factors in the latent space20 . ● Calculated by comparing representation means when target factors are fixed vs. when nuisance factors vary (e.g., using norm). ● Highlights how well the model isolates relevant factors [20] R. Suter, D. Miladinovic, B. Schölkopf, and S. Bauer, “Robustly disentangled causal mechanisms: Validating deep representations for interventional robustness,” in International Conference on Machine Learning, pp. 6056–6065, PMLR, 2019.
  • 73. Information-based Metrics 73 Key Concept: Utilize entropy and mutual information (MI) to assess modularity, compactness, or holistic disentanglement. Entropy -- measures the amount of uncertainty in a given factor/code MI -- measures the dependency between factor and code We need 2.585 bits of information to describe the outcome of a roll. MI is 0 iff two rvs are statistically independent.
  • 74. Mutual Information Gap (MIG) (Compactness) 74 Measures the gap between the most and second-most associated latent dimensions for each factor21 . [21] R. T. Chen, X. Li, R. B. Grosse, and D. K. Duvenaud, “Isolating sources of disentanglement in variational autoencoders,” in NeurIPS, pp. 2615–2625, 2018. This metric captures how strongly each factor aligns with a specific latent dimension. The average is reported as the overall score.
  • 75. Disentanglement, Completeness, Informativeness Mutual Information Gap (DCIMIG): Holistic 75 Computes the MI gap between a factor and its most informative latent dimension, by assessing gaps in MI across different latent dimension for that given factor22 [22] R. T. Chen, X. Li, R. B. Grosse, and D. K. Duvenaud, “Isolating sources of disentanglement in variational autoencoders,” in NeurIPS, pp. 2615–2625, 2018. This gives a much broader view of factor-code relationship in contrast to MIG where
  • 76. Joint Entropy Minus Mutual Information Gap (JEMMIG) (Holistic) 76 Builds on MIG but incorporate Joint Entropy between factor and code to achieve holistic evaluation23 . [23] K. Do and T. Tran, “Theory and evaluation metrics for learning disentangled representations,” arXiv preprint arXiv:1908.09961, 2019. Unlike MIG, lower JEMMIG scores indicate better disentanglement. To put the scores on the closed interval [0,1].
  • 77. Modularity Score (Modularity) 77 This metric quantifies the relative MI of a primary factor compared to all other factors, rewarding isolated factor representations in the latent space24 [24] K. Ridgeway and M. C. Mozer, “Learning deep disentangled embeddings with the f-statistic loss,” Advances in neural information processing systems, vol. 31, 2018.
  • 79. Non-contrastive JEA Downstream TasksR2 79 [R2] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2024. Understanding Self-Supervised Learning of Speech Representation via Invariance and Redundancy Reduction. Information, 15(2), p.114.
  • 80. Contrastive JEA Out-of-domain TestR4 80 [R4] Brima, Y., Southern, L., Krumnack, U., Heidemann, G. and Pika, S., Deep learning for recognizing individual chimpanzees from vocalizations. In submission to Scientific Reports.
  • 81. Key Findings 81 ● Disentanglement Achievability: Disentangled representations are feasible, especially in functional settings, evidenced by strong downstream task performance (speaker identification, speaking style recognition, keyword spotting, gender recognition). ● Context-Dependent Disentanglement: ○ Simpler contexts (e.g., SynTone) enable effective factor separation. ○ More complex settings (e.g., SynSpeech) involve manageable trade-offs between compactness and modularity. ● Non-contrastive JEA Efficacy: Effectively balances transformative and factorizational disentanglement, showing robust performance in downstream applications. ● Contrastive JEA Strength: Excels in transformative disentanglement, supported by linear probing, RSA, clustering, and low-dimensional projections. ● Transferability: Self-supervised upstream representations significantly improve downstream task performance, even with small target datasets (5-10%).
  • 82. Research Contributions 82 ● New Benchmark Datasets ○ SynTone & SynSpeech: fully controllable synthetic audio isolating timbre, pitch, prosody ○ DeepChimp: real-world chimpanzee calls with annotated individual identities ● Improved Generative Models ○ Adapted β‑VAE and Factor‑VAE for audio: better trade‑off between reconstruction fidelity and latent factor disentanglement ● Joint‑Embedding & Contrastive Learning ○ Applied Barlow Twins and Supervised Contrastive loss to learn speaker‑ and factor‑invariant representations ● Robust Evaluation Suite ○ Combined disentanglement metrics (MIG, SAP) with interventional robustness (IRS) and downstream probes (speaker ID, emotion)