Disentangled Representation Learning in Speech and Vocalization
1. Disentangled Representation Learning in
Speech and Vocalization
Yusuf BRIMA
Supervisors:
Prof. Dr.-Ing. Gunther HEIDEMANN
Prof. Dr.rer.nat. Simone PIKA
Institute of Cognitive Science,
Osnabrück University
June 27, 2025
2. Presentation Outline
2
● Introduction and Motivation
● Research Goals and Questions
● Methodology
● Key Results
● Limitations
● Future Directions
● Conclusion
● Q&As
3. Biological Inspiration
3
● Humans can naturally isolate factors of
variation in often complex high-dimensional
data, in the audio domain examples include:
speaker identity, gender, emotion, speech
content, etc.
● We generalize across contexts (e.g., noisy
environments, varied accents, speaking styles).
Source
4. What Is Disentanglement?
4
● Refers to learning distributed
representations where distinct latent
factors correspond to the “true”
independent factors of variation in the
data1
.
● Goal: learn a representation where you
can manipulate one factor without
affecting the others — just like humans
do.
[1] Bengio Y, Courville A, Vincent P. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence. 2013 Mar 7;35(8):1798-828.
Source
5. Three Core Perspectives
5
● Transformational 2
– Separate
invariant vs. variant aspects of a
signal.
● Factorizational 2
– Ensure
independent encoding of generative
factors.
● Functional 2
– Make learned factors
useful for tasks (e.g., transfer,
robustness, interpretability).
Source
[2] Williams J. Learning disentangled speech representations (Doctoral dissertation, University of Edinburgh).
6. Benefits of Disentanglement
6
● Improves predictive performance2,3
● Reduces sample complexity4
● Offers interpretability5
● Improves fairness6
● Overcome shortcut learning7
[2] Locatello, M. Tschannen, S. Bauer, G. Rätsch, B. Schölkopf, and O. Bachem, “Disentangling factors of variations using few labels,” in International Conference on Learning
Representations, 2020.
[3] Locatello, S. Bauer, M. Lucic, G. Raetsch, S. Gelly, B. Schölkopf, and O. Bachem, “Challenging common assumptions in the unsupervised learning of disentangled representations,” in
International Conference on Machine Learning, 2019.
[4] Ridgeway and M. C. Mozer, “Learning deep disentangled embeddings with the f-statistic loss,” in Advances in Neural Information Processing Systems, 2018.
[5] Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, pp.
1798–1828, Aug. 2013.
[6] .Locatello, G. Abbati, T. Rainforth, S. Bauer, B. Schölkopf, and O. Bachem, “On the fairness of disentangled representations,” in Advances in Neural Information Processing Systems,
2019.
[7] . Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, and F. A. Wichmann, “Shortcut learning in deep neural networks,” arXiv:2004.07780, 2020.
7. Research Gaps
7
● Deep learning paradigms such as VAEs, GANs, and JAEs have been extensively
used to learn rich audio representations8,9
.
● However, there has been limited empirical investigation into their ability to
disentangle key factors—such as speaker identity, speaking style, and
content—using disentanglement-oriented datasets.
● Existing approaches often lack a systematic quantitative evaluation of how well
these paradigms satisfy crucial disentanglement criteria: modularity,
compactness/completeness, and explicitness/informativeness.
[8] Liu, S., Mallol-Ragolta, A., Parada-Cabaleiro, E., Qian, K., Jing, X., Kathan, A., Hu, B. and Schuller, B.W., 2022. Audio self-supervised learning: A survey. Patterns, 3(12).
[9] Mohamed, A., Lee, H.Y., Borgholt, L., Havtorn, J.D., Edin, J., Igel, C., Kirchhoff, K., Li, S.W., Livescu, K., Maaløe, L. and Sainath, T.N., 2022. Self-supervised speech
representation learning: A review. IEEE Journal of Selected Topics in Signal Processing, 16(6), pp.1179-1210.
8. Research Questions
8
● How can DL techniques be leveraged to effectively disentangle the
underlying explanatory factors of variation in audio?
○ What learning approaches are most effective for (re)structuring
representations of acoustic data to disentangle key factors?
○ What are the most effective methodologies for evaluating the
quality, generalization, and transferability of disentangled
representations?
9. Research Objectives
9
● Compare the effectiveness of varied learning paradigms in achieving disentangled
representations
● Develop and apply robust empirical methods for evaluating the quality of
disentangled representations, using latent space analysis techniques:
○ Representation Similarity Analysis (RSA)
○ Low-dimensional visualization
○ Clustering
○ And objective supervised disentanglement metrics10
.
● Analyze the structure of learned representations, focusing on their generalization
and transferability across diverse datasets and tasks using linear probing.
[10] Carbonneau, M.A., Zaidi, J., Boilard, J. and Gagnon, G., 2022. Measuring disentanglement: A review of metrics. IEEE transactions on
neural networks and learning systems, 35(7), pp.8747-8761.
10. Thesis Contributions
10
● Development of methodologies for disentangling audio
representations
○ JEAs
○ VAE-based Latent Variable Models
● Introduction of three novel datasets
○ DeepChimp, SynTone, and Synspeech (three versions)
● Empirical evaluation of disentangled representations
● Exploration of transferability and generalization
○ Using Latent Space Analysis and Linear Probing
11. Cognitive Science Link with Thesis
11
● Robust intelligence requires compositional
understanding
● Humans excel at recombining learned
factors in novel ways
● Disentangled representation learning
formalizes this cognitive principle.
Source
"yellow car" combines the concept of "yellow"
with "car" and that you can apply "yellow" to
other objects or "car" to other colors. This
forms the basis of generalization.
13. Disentanglement Criteria
● Modularity : factors should be independent (orthogonal). Variation in one factor has no causal
effect on other factors in the code space.
13
Speaker
identity
accent
26. Datasets: SynTone
26
● Basic Overview
○ Total Samples: 32,000 unique audio samples
○ Sample Duration: 1 second each
○ Sampling Rate: 16kHz
● Generative Parameters (Factors):
○ Each audio sample is synthesized by systematically varying three
independent generative parameters:
■ Timbre : Sine, Triangle, Square, Sawtooth waveforms
■ Amplitude : 20 discrete levels, ranging from 0 to 1
■ Frequency : 400 discrete steps, ranging from 440 to 8000Hz
● The dataset is formally structured as a Cartesian product:
○ Each sample corresponds to a unique tuple
28. Curation of SynSpeech Dataset
● Neural Speech Synthesis using
○ Speaking accent {American, British, Australian, Indian}
○ Speaker gender {Male, Female}
○ Speaker Identity: Librispeech-10011
Speakers 251 each ~16s speech @ 16kHz
○ Speaking Styles {Default, Friendly, Sad, Whispering}
○ Linguistic content on diverse topics generated using an LLM
● We utilize OpenVoice12
TTS, a flow-based non-autoregressive
generative model
28
[11] Panayotov, V., Chen, G., Povey, D. and Khudanpur, S., 2015, April. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE
international conference on acoustics, speech and signal processing (ICASSP) (pp. 5206-5210). IEEE.
[12] Qin, Zengyi, et al. "OpenVoice: Versatile Instant Voice Cloning." arXiv preprint arXiv:2312.01479 (2023)
29. Curation of SynSpeech Dataset
29
Neural Speech Synthesizer
Spoken Text
Speaker ID
Speaking Style
Generated utterance
30. SynSpeech Dataset Versions
30
Version Speakers Size
(GBs)
Speaking
Styles
Number
of content
Total
Samples
DOI Total
download
s
Small 50 4.87 1 500 25,000 Link 63
Medium 25 10.68 4 500 50,000 Link 41
Large 249 21.86 4 500 109,560 Link Link ~35
https://synspeech.github.io/
31. Datasets: DeepChimp
31
● Audio recordings using an external microphone (Sennheiser
ME400) at a 30-meter radius of 11 male adult chimpanzees
● Collected over 16 months (non-consecutively) between
2018-2020 at the Rekambo community in Loango National Park,
Gabon
● Approach: focal animal sampling with continuous recording
● Call type: pant hoots
● A total of 551 variable-length calls ~6 hours at 44.1kHz
32. External Real-world Datasets
32
Name Samples Classes Duration (hrs) Usage
VoxCeleb-113
148,642 1,211 340.39 Upstream
LibriSpeech-10011
14,385 128 100 Upstream
LibriSpeech-36011
104,935 921 360 Upstream
Speech Commands14
7,985 2 2.18 Downstream
ESD15
7,000 2 5.52 Downstream
WLUC16
7,500 5 2.05 Downstream
[13] Nagrani, A., Chung, J.S. and Zisserman, A., 2017. Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612.
[14] Warden, P., 2018. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209.
[15] Zhou, K., Sisman, B., Liu, R. and Li, H., 2022. Emotional voice conversion: Theory, databases and esd. Speech Communication, 137, pp.1-18.
[16] A. R. (online speech bank), “World Leaders Address the U.S. Congress,” 2011.
33. Evaluation Metrics: An Overview
33
● To assess the effectiveness of learned audio representations in
terms of:
○ Disentanglement : How individual latent dimensions capture
distinct underlying factors of variation.
○ Generalization & Transferability : Assess adaptability to
downstream tasks and robustness across diverse datasets,
verifying practical utility.
34. Disentanglement Evaluation: Predictor-based
● SAP (Separated Attribute Predictability) : Measures how well individual latent
dimensions predict known factors. High = compact factor encoding.
● Explicitness Score : Captures how easily factor values can be linearly decoded
from the latent space.
Evaluate how usable and interpretable the latent codes are.
34
35. Disentanglement Evaluation: Intervention-based
● IRS (Interventional Robustness Score) : Assesses whether the representation of
a target factor stays consistent when other (nuisance) factors change.
Test how stable latent codes are under controlled factor variations.
35
36. Disentanglement Evaluation: Information-based
● MIG (Mutual Information Gap) : Measures how uniquely a factor maps to one
latent code dimension.
● JEMMIG : A holistic version of MIG that adds joint entropy to penalize shared
or entangled codes.
● Modularity : Evaluates whether each latent dimension captures only one factor
(and not others).
Quantify statistical alignment between latent codes and
data-generating factors.
36
39. Results
39
● VAE-based Representation Structure in Factorizational Disentanglement
○ With SynTone
○ With SynSpeech
● Linear Probing as a Proxy for Factorizational Disentanglement in SynSpeech
● Non-contrastive JEA-based Factorizational Disentanglement
● Transferability and Generalization
○ Non-contrastive JEA Downstream Tasks
○ Contrastive JEA Downstream Task
● Impact of Different Learning Approaches
40. Factorizational Disentanglement with SynToneR1
40
[R1] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2024. Learning disentangled audio representations through controlled synthesis. In: The Second Tiny
Papers Track at ICLR 2024, Tiny Papers @ ICLR 2024, Vienna, Austria, 11 May 2024. OpenReview.net. Available at: https://openreview.net/forum?id=Fn9ORH8PLl
41. Factorizational Disentanglement with SynToneR1
41
[R1] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2024. Learning disentangled audio representations through controlled synthesis. In: The Second Tiny
Papers Track at ICLR 2024, Tiny Papers @ ICLR 2024, Vienna, Austria, 11 May 2024. OpenReview.net. Available at: https://openreview.net/forum?id=Fn9ORH8PLl
Relaxed information bottleneck has more channel capacity but if unconstrained introduces
redundancies across these channels. This is what we refer to as compactness-modularity
trade-off.
42. Compactness–Modularity Trade-offR1
42
[R1] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2024. Learning disentangled audio representations through controlled synthesis. In: The Second Tiny
Papers Track at ICLR 2024, Tiny Papers @ ICLR 2024, Vienna, Austria, 11 May 2024. OpenReview.net. Available at: https://openreview.net/forum?id=Fn9ORH8PLl
Bottleneck Type Compactness (MIG,
SAP)
Modularity Explanation
Loose (Vanilla VAE) ✅ High ❌ Low Factors well encoded subset of the code
space, but entangled.
Tight (β-VAE, etc.) ❌ Low ✅ High Forces independence, but factor info is
fragmented across a few code
dimensions.
Relaxed information bottleneck has more channel capacity but if unconstrained introduces
redundancies across these channels. This is what we refer to as compactness-modularity
trade-off.
43. Factorizational Disentanglement with SynToneR1
43
[R1] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2024. Learning disentangled audio representations through controlled synthesis. In: The Second Tiny
Papers Track at ICLR 2024, Tiny Papers @ ICLR 2024, Vienna, Austria, 11 May 2024. OpenReview.net. Available at: https://openreview.net/forum?id=Fn9ORH8PLl
44. Factorizational Disentanglement with SynSpeechR3
44
[R3] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2023. Learning disentangled speech representations. New in ML Workshop, NeurIPS 2023, 31 October.
Available at: https://openreview.net/forum?id=3ox1TfKeRF
Linear probes predicting factors
from code dimensions.
45. Factorizational Disentanglement with SynSpeechR3
45
[R3] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2023. Learning disentangled speech representations. New in ML Workshop, NeurIPS 2023, 31 October.
Available at: https://openreview.net/forum?id=3ox1TfKeRF
46. Factorizational Disentanglement with SynSpeechR3
46
[R3] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2023. Learning disentangled speech representations. New in ML Workshop, NeurIPS 2023, 31 October.
Available at: https://openreview.net/forum?id=3ox1TfKeRF
47. Factorizational Disentanglement with SynSpeechR3
47
[R3] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2023. Learning disentangled speech representations. New in ML Workshop, NeurIPS 2023, 31 October.
Available at: https://openreview.net/forum?id=3ox1TfKeRF
48. Non-contrastive JEA Representation StructureR2
48
[R2] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2024. Understanding Self-Supervised Learning of Speech Representation via Invariance and Redundancy
Reduction. Information, 15(2), p.114.
49. Non-contrastive JEA Factorizational DisentanglementR2
49
[R2] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2024. Understanding Self-Supervised Learning of Speech Representation via Invariance and Redundancy
Reduction. Information, 15(2), p.114.
50. Non-contrastive JEA Downstream TasksR2
50
[R2] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2024. Understanding Self-Supervised Learning of Speech Representation via Invariance and Redundancy
Reduction. Information, 15(2), p.114.
51. Contrastive JEA Representation StructureR4
51
[R4] Brima, Y., Southern, L., Krumnack, U., Heidemann, G. and Pika, S., Deep learning for recognizing individual chimpanzees from vocalizations. In submission
to Scientific Reports.
52. Contrastive JEA Representation StructureR4
52
[R4] Brima, Y., Southern, L., Krumnack, U., Heidemann, G. and Pika, S., Deep learning for recognizing individual chimpanzees from vocalizations. In submission
to Scientific Reports.
53. Contrastive JEA Downstream TasksR4
53
[R4] Brima, Y., Southern, L., Krumnack, U., Heidemann, G. and Pika, S., Deep learning for recognizing individual chimpanzees from vocalizations. In submission
to Scientific Reports.
54. Contrastive JEA Downstream ExplainabilityR4
54
[R4] Brima, Y., Southern, L., Krumnack, U., Heidemann, G. and Pika, S., Deep learning for recognizing individual chimpanzees from vocalizations. In submission
to Scientific Reports.
55. Limitations
55
● Complex‑Factor Entanglement
○ Prosody & content remain conflated in real‑world speech
● Dataset Constraints
○ DeepChimp size limits model robustness; synthetic sets lack environmental
noise variation
● Metric Gaps
○ Temporal/hierarchical dependencies under‑captured by current disentanglement
scores
● Compute & Architecture
○ Scalability of high‑capacity VAEs and BT models constrained by GPU memory
56. Future Directions
56
● Multimodal & Hierarchical Models
○ For example, the fusion of audio + video (lip motion) with attention architectures
for finer generative control
● Domain-specific Inductive Priors
○ For factors are are temporal or hierarchical, imposing constraints to learn these
structures
● Dataset Expansion & Realism
○ Curate especially large and varied collection of non-human primate vocalizations
to aid better robustness and generalization.
● Advanced Metrics
○ Develop temporally‑aware/audio specific disentanglement metrics.
57. Conclusions & Takeaways
57
● Disentangled audio representations (generative + JEA) enhance
interpretability, robustness, transfer, etc.
● Controlled synthetic benchmarks (e.g., SynTone/SynSpeech) + real (e.g.,
DeepChimp) crucial for systematic evaluation of methods.
● No single paradigm “wins”: choose generative for fine‑grained factor
control, contrastive for representation transfer, non-contrastive JEAs for
balance.
● Opens pathways for fair, controllable speech & bioacoustic technologies
65. SynSpeech RAVE-inspired Learning Architecture [R3]
65
Phase 2: Adversarial Fine-Tuning for Enhanced Realism
To avoid mode collapse, a feature matching loss is further added
66. Supervised Disentanglement Metrics
66
Based on factorizational disentanglement framework17
Notation :
[17] Carbonneau, M.A., Zaidi, J., Boilard, J. and Gagnon, G., 2022. Measuring disentanglement: A review of metrics. IEEE transactions on
neural networks and learning systems, 35(7), pp.8747-8761.
Comparison : Supervised disentanglement metrics compare:
Categories: Predictor-based, Intervention-based, Information-based.
69. Explicitness Score (Explicitness):
69
● Evaluates ability to predict factor values from latent codes 18
.
● Calculated by training a simple classifier (e.g., logistic regression)
on the entire latent codes.
● Performance measured by AUC-ROC, averaged across factors.
● Score normalized to ( [0,1] ), higher indicates greater
interpretability.
[18] K. Ridgeway and M. C. Mozer, “Learning deep disentangled embeddings with the f-statistic loss,” Advances in neural information processing
systems, vol. 31, 2018.
70. Attribute Predictability Score (SAP) (Compactness)
70
● Assesses how well each latent code dimension predicts
ground-truth factors 19
.
● Continuous factors: ( R^2 ) from linear regressor.
● Categorical factors: Balanced accuracy from decision tree
classifier.
● Formula:
○
● Higher score indicates better compactness (large differences
between top two scores).
[19] A. Kumar, P. Sattigeri, and A. Balakrishnan, “Variational inference of disentangled latent concepts from unlabeled observations,” arXiv preprint
arXiv:1711.00848, 2017.
71. Intervention-based Methods
71
Key Concept: Assess disentanglement holistically or based on
modularity by examining stability under controlled variations.
Focus : Robustness of factor representations to perturbations
while considering dependencies.
72. Interventional Robustness Score (IRS) (Holistic)
72
● Measures stability of target factors in the latent space20
.
● Calculated by comparing representation means when target
factors are fixed vs. when nuisance factors vary (e.g., using
norm).
● Highlights how well the model isolates relevant factors
[20] R. Suter, D. Miladinovic, B. Schölkopf, and S. Bauer, “Robustly disentangled causal mechanisms: Validating deep representations for interventional
robustness,” in International Conference on Machine Learning, pp. 6056–6065, PMLR, 2019.
73. Information-based Metrics
73
Key Concept: Utilize entropy and mutual information (MI) to assess modularity,
compactness, or holistic disentanglement.
Entropy -- measures the amount of uncertainty in a given factor/code
MI -- measures the dependency between factor and code
We need 2.585 bits of information to
describe the outcome of a roll.
MI is 0 iff two rvs are statistically
independent.
74. Mutual Information Gap (MIG) (Compactness)
74
Measures the gap between the most and second-most associated latent
dimensions for each factor21
.
[21] R. T. Chen, X. Li, R. B. Grosse, and D. K. Duvenaud, “Isolating sources of disentanglement in variational autoencoders,” in NeurIPS, pp. 2615–2625,
2018.
This metric captures how strongly each factor aligns with a specific latent dimension.
The average is reported as the overall score.
75. Disentanglement, Completeness, Informativeness Mutual Information
Gap (DCIMIG): Holistic
75
Computes the MI gap between a factor and its most informative latent dimension, by
assessing gaps in MI across different latent dimension for that given factor22
[22] R. T. Chen, X. Li, R. B. Grosse, and D. K. Duvenaud, “Isolating sources of disentanglement in variational autoencoders,” in NeurIPS, pp. 2615–2625,
2018.
This gives a much broader view of factor-code relationship in contrast to MIG
where
76. Joint Entropy Minus Mutual Information Gap (JEMMIG) (Holistic)
76
Builds on MIG but incorporate Joint Entropy between factor and code to achieve
holistic evaluation23
.
[23] K. Do and T. Tran, “Theory and evaluation metrics for learning disentangled representations,” arXiv preprint arXiv:1908.09961, 2019.
Unlike MIG, lower JEMMIG scores indicate better disentanglement.
To put the scores on the closed interval [0,1].
77. Modularity Score (Modularity)
77
This metric quantifies the relative MI of a primary factor compared to all other factors,
rewarding isolated factor representations in the latent space24
[24] K. Ridgeway and M. C. Mozer, “Learning deep disentangled embeddings with the f-statistic loss,” Advances in neural information processing systems,
vol. 31, 2018.
79. Non-contrastive JEA Downstream TasksR2
79
[R2] Brima, Y., Krumnack, U., Pika, S. and Heidemann, G., 2024. Understanding Self-Supervised Learning of Speech Representation via Invariance and Redundancy
Reduction. Information, 15(2), p.114.
80. Contrastive JEA Out-of-domain TestR4
80
[R4] Brima, Y., Southern, L., Krumnack, U., Heidemann, G. and Pika, S., Deep learning for recognizing individual chimpanzees from vocalizations. In submission
to Scientific Reports.
81. Key Findings
81
● Disentanglement Achievability: Disentangled representations are feasible, especially in
functional settings, evidenced by strong downstream task performance (speaker
identification, speaking style recognition, keyword spotting, gender recognition).
● Context-Dependent Disentanglement:
○ Simpler contexts (e.g., SynTone) enable effective factor separation.
○ More complex settings (e.g., SynSpeech) involve manageable trade-offs between
compactness and modularity.
● Non-contrastive JEA Efficacy: Effectively balances transformative and factorizational
disentanglement, showing robust performance in downstream applications.
● Contrastive JEA Strength: Excels in transformative disentanglement, supported by
linear probing, RSA, clustering, and low-dimensional projections.
● Transferability: Self-supervised upstream representations significantly improve
downstream task performance, even with small target datasets (5-10%).
82. Research Contributions
82
● New Benchmark Datasets
○ SynTone & SynSpeech: fully controllable synthetic audio isolating timbre, pitch,
prosody
○ DeepChimp: real-world chimpanzee calls with annotated individual identities
● Improved Generative Models
○ Adapted β‑VAE and Factor‑VAE for audio: better trade‑off between
reconstruction fidelity and latent factor disentanglement
● Joint‑Embedding & Contrastive Learning
○ Applied Barlow Twins and Supervised Contrastive loss to learn speaker‑ and
factor‑invariant representations
● Robust Evaluation Suite
○ Combined disentanglement metrics (MIG, SAP) with interventional robustness
(IRS) and downstream probes (speaker ID, emotion)