SlideShare a Scribd company logo
Architecting an in-house CDP solution for greater ROI
1
Voice-to-Value: LLM-Powered Customer Interaction Analysis
Jan 2025
Agenda
1. Introduction
2. Speech-to-Text + multimodality
3. Our problem
4. Model & Architecture
5. Results
6. Conclusion
Speaker
Francesco Fontan
Data Scientist in Data Science Team
francesco.fontan@on.com
on.com
01 Introduction
5
What to expect…
● This project is a work in
progress
● We're actively analyzing
the situation, data, and
goals
● Exciting future
opportunities, including
real-time conversation!
6
Context
There are so many business processes that can be easily automated at ON
● Returns
● Warranty Claims
● Cancellations
● Refunds
● Order Tracking
● …
Very Manual
Slow
Expensive
Warranty Claim
Shoes bought <15 days ago
Decline or Accept Claim?
Phone Calls Analysis
Returns
Refunds
…
7
Context
There are so many business processes that can be easily automated at ON
● Returns
● Warranty Claims
● Cancellations
● Refunds
● Order Tracking
● …
Very Manual
Slow
Expensive
Warranty Claim
Shoes bought <15 days ago
Decline or Accept Claim?
Phone Calls Analysis
Returns
Refunds
…
8
Our Problem
We are currently unable to capture data from phone customer interactions
With no data and insights, it is hard to improve CS
 Hard to be more efficient
 Difficult to cut costs
 Bad User Experience and damage brand reputation
Main topics
● Customer Analysis
customers' needs
customer sentiment
product feedback
…
● Remove Manual and Time-Consuming Processes
● Synergies Textual Chatbot
Some numbers…
Only in US…
● ~40k inbound calls / month
● 500k+ minutes / month
● Utilization peaks during and after Black Friday / End of Season Sales
But we have HD phone support in many other countries!
Breakdown of the After
Work Call split now
02 A couple of info about audio and multimodality…
Speech Text
Features Extraction Acoustic Model
Post Processing
Punctuation | Capitalization |
Normalization
Input: Speech audio
Output: Spectrogram, MFCC
…
Input: Spectrogram, MFCC
Output: textual features
probabilities
Input: Textual
features probabilities
Output: Text
Decoder
Rescoring
Input: Text
Output: Normalized text
My name is Fran Cisco
My name Francesco
My name is Fan Tesco
My name is Francesco
hi let’s meet at two o'clock
=> Hi, let’s meet at 2:00
Speech Text
Features Extraction Acoustic Model
Post Processing
Punctuation | Capitalization |
Normalization
Input: Speech audio
Output: Spectrogram, MFCC
…
Input: Spectrogram, MFCC
Output: textual features
probabilities
Input: Textual
features probabilities
Output: Text
Decoder
Rescoring
Input: Text
Output: Normalized text
My name is Fran Cisco
My name Francesco
My name is Fan Tesco
My name is Francesco
hi let’s meet at two o'clock
=> Hi, let’s meet at 2:00
Feature Extraction and Representation
Speech Text
Features Extraction Acoustic Model
Post Processing
Punctuation | Capitalization |
Normalization
Input: Speech audio
Output: Spectrogram, MFCC
…
Input: Spectrogram, MFCC
Output: textual features
probabilities
Input: Textual
features probabilities
Output: Text
Decoder
Rescoring
Input: Text
Output: Normalized text
My name is Fran Cisco
My name Francesco
My name is Fan Tesco
My name is Francesco
hi let’s meet at two o'clock
=> Hi, let’s meet at 2:00
Acoustic Model - Whisper
https://github.com/openai/whisper
Whisper vs Google
Universal Speech Model
(USM)
2
This architecture likely
mirrors the one used in
Gemini multimodal models…
• Open-source
• 99 languages
• Easy to use
• 1000+languages
• Accurate
• Fast & Efficient
• Not Open
Conformer-based Model:
Convolutional + Transformer
CNNs exploit local
dependencies
Capture global
interactions
Whisper vs Google
Universal Speech Model
(USM)
2
This architecture likely
mirrors the one used in
Gemini multimodal models…
• Open-source
• 99 languages
• Easy to use
Conformer-based Model:
Convolutional + Transformer
CNNs exploit local
dependencies
Capture global
interactions
• 1000+languages
• Accurate
• Fast & Efficient
• Not Open
Speech Text
Features Extraction Acoustic Model
Post Processing
Punctuation | Capitalization |
Normalization
Input: Speech audio
Output: Spectrogram, MFCC
…
Input: Spectrogram, MFCC
Output: textual features
probabilities
Input: Textual
features probabilities
Output: Text
Decoder
Rescoring
Input: Text
Output: Normalized text
My name is Fran Cisco
My name Francesco
My name is Fan Tesco
My name is Francesco
hi let’s meet at two o'clock
=> Hi, let’s meet at 2:00
Decoder Reasoning
Speech Text
Features Extraction Acoustic Model
Post Processing
Punctuation | Capitalization |
Normalization
Input: Speech audio
Output: Spectrogram, MFCC
…
Input: Spectrogram, MFCC
Output: textual features
probabilities
Input: Textual
features probabilities
Output: Text
Decoder
Rescoring
Input: Text
Output: Normalized text
My name is Fran Cisco
My name Francesco
My name is Fan Tesco
My name is Francesco
hi let’s meet at two o'clock
=> Hi, let’s meet at 2:00
LLAMA 3.1 Architecture ( Similar to Gemini ??? )
https://ritvik19.medium.com/papers-explained-187c-llama-3-1-multimodal-experiments-a1940dd45575
03 Our Approach
Gemini API?
We need a highly complex, custom-built
pipeline with 27 microservices, manual
feature engineering, and handcrafted post-
processing scripts to handle every edge
case.
24
Diarization and Audio Adaptation
Diarization (built-in) with Cloud Speech-to-Text:
Adaptation:
Word Boosting:
ON, Cloud Tech, Cyclon, …
Sentence Format Boosting:
Order number is $ORDERNU
M​
ORDERNUM (R12345678
)​
Fine-Tuned Model
GCP Telephony model
Speaker A: Hello, I am Francesco, how can I help you?
Speaker B: Hello, …
Speaker A: ...
Architecture
Langchain pipelines
support both batch or
stream processes
● We record all traces
● Evaluation and dataset
Focus LangChain App
Pipeline with Gemini 2.0 flash exp seems to provide better quality (and it is faster)
But data sent to Gemini 2.0 can be used for training…
So, right now, we use GCP ASR models + Gemini 1.5 pro
04 Some Results
28
Evaluation & Results
predicted
Cancellation Payment Problem Return Other
actual
Cancellation 78 1 5 1
Payment Problem 2 68 1 1
Return 4 3 55 7
Other 2 2 3 45
Results with a simple Few Shot Classification Approaches
Order number (e.g. R12345678)  97%
There are several components in our pipeline:
- Speech-to-Text
- Text transformations and operations
Due to current constraints on time and resources, we're focusing on
evaluating downstream tasks rather than calculating WER.
Confusion Matrix Ticket Category
05 Next Steps
ReAct Agent
Agentic Framework
• Increase Accuracy
• Use Tools to retrieve info about
user
orders
products
FAQs
Old Conversations and notes
…
Long-term
Create Agent that converses with the user to
gather missing information
(Gemini Live APIs ?)
06 Conclusions
Conclusions
Architecting an in-house CDP solution for greater ROI
Architecting an in-house CDP solution for greater ROI
33
Thank you!
34
Architecture
Speech-To-Text Pipeline
Cross-modality
https://sebastianraschka.com/blog/2024/understanding-multimodal-llms.html
Voice-to-Value- LLM-Powered Customer Interaction Analysis.pdf
Multimodality
The Core Mechanism: Unified Embedding Space
At the heart of Gemini 2.0's multimodal capabilities lies the
concept of a unified embedding space. Imagine this space as a
high-dimensional map where information from different
modalities is projected. Crucially, semantically similar
information, regardless of its original format (text, image, or
sound), will be positioned close to each other within this space.
● How it works: Specialized encoders are used to process
each modality. For example, a convolutional neural
network (CNN) might process images, while a
transformer network handles text. These encoders are
trained to project the input data into this shared
embedding space.
● The Benefit: By representing all modalities in a common
space, the model can directly compare and relate
information across them. For instance, the visual features
of a cat in an image can be directly associated with the
word "cat" in a textual description.
Cross-Modal Attention Mechanisms: Focusing on Relevant
Connections
While a unified embedding space allows for comparison, cross-
modal attention mechanisms enable the model to dynamically
focus on the most relevant parts of different inputs when
processing them together.
● How it works: These mechanisms allow the model to
learn which parts of one modality are most important in
relation to specific parts of another modality. For
example, when processing an image and a question
about it, the attention mechanism allows the model to
focus on the specific objects or regions in the image that
are relevant to the question.
● The Benefit: This selective attention significantly
improves the accuracy and efficiency of multimodal
understanding. Instead of treating all information
equally, the model can prioritize the most meaningful
connections.
Voice-to-Value- LLM-Powered Customer Interaction Analysis.pdf
Voice-to-Value- LLM-Powered Customer Interaction Analysis.pdf
Decoder Reasoning
Voice-to-Value- LLM-Powered Customer Interaction Analysis.pdf
Multimodality
Distinguishing Gemini 2.0 from Previous Multimodal AI:
Traditional multimodal AI often involved separate models trained for each modality, with
complex methods to fuse their outputs. This approach had limitations:
● Limited Interaction: Early fusion methods might simply concatenate the outputs of
individual models, hindering deep interaction between modalities.
● Separate Training: Training individual models separately could lead to inconsistencies
and difficulty in aligning representations.
Gemini 2.0's end-to-end training on diverse multimodal data is a key differentiator. The entire
model, including the encoders and cross-modal attention mechanisms, is trained
simultaneously. This allows the model to learn the optimal way to represent and relate
information across modalities from the ground up.
Conformer
Briefly, the Conformer is a model that was built by
Google and presented in 2020. The basic idea is that
Transformers are capable of capturing content-based
global interactions while CNNs instead exploit local
features. Google, therefore, through combining the
convolution module and multi-head self-attention in one
block:
Google USM
"Conformer encoder model architecture. Conformer
comprises of two macaron-like feed-forward layers with
halfstep residual connections sandwiching the multi-
headed selfattention and convolution modules. This is
followed by a post layernorm." Image source: here
Multimodality
Input Encoding (Modality-Specific):
Text:
Tokenization: Uses sub-word tokenization (like Byte Pair Encoding or SentencePiece) to break down text into manageable units.
Transformer Embeddings: Each token is converted into a dense vector embedding.
Images:
Vision Transformer (ViT) or Variant: Most likely uses a ViT architecture, dividing images into patches ("tokens") and processing them with a transformer. Could also incorporate some Convolutional Neural Networks
(CNNs) for feature extraction before or alongside the ViT.
Hierarchical Feature Extraction: Possible use of multiple layers or blocks that capture image details at different levels.
Audio:
Spectrogram Analysis: Transforms audio into spectrograms (visual representations of frequencies) or use raw audio features such as MFCC.
CNN or Specialized Audio Encoder: Likely uses a CNN or a transformer-based audio encoder to process the spectrogram or raw features.
Video:
Frame-Based Processing: Likely processes video as a sequence of frames.
Temporal Attention: Incorporates mechanisms (like temporal attention) to capture relationships between frames over time, also other architectures such as 3D CNN.
Other Modalities: Likely includes specialized encoders for other data types, such as sensor data, code, and possibly 3D meshes.
Multimodal Fusion and Shared Representation:
Unified Transformer Architecture: The core of Gemini 2.0 is a large transformer that processes all encoded data.
Attention-Based Fusion:
Cross-Attention: Key mechanism. Allows each modality's representation to attend to and gather information from other modalities. For example, text can attend to visual features in an image.
Self-Attention: Used within each modality to help the model understand relationships within the modality itself.
Late Fusion: Modalities are processed separately at first and fused into a joint representation layer.
Modality Specific Adapters: Possibly utilizes adapter layers that can be fine-tuned to improve performance on specific modality inputs.
Output and Task-Specific Adaptation:
Task-Specific Decoders: Uses task-specific decoders when generating images, video or text.
Fine-Tuning: Pre-trained model is then fine-tuned for specific tasks (e.g., image captioning, question answering, code generation).
46
Goal
Short-term Goals (Analysis + Support)
- Customer Analysis
gain deeper insights into our customers' needs and desires and refine our product offerings
- Understand what is happening behind the scenes
streamline processes, and alleviate the workload of our HDs
- Prepare better training materials
Find all possible edge cases and cover all possible situations to prepare human HD operators
Long-term Goals (take more action)
● Chatbot/Voicebot interface
○ It can handle a real time conversations
○ Delegate to human operators if user requests
● Act based on more available information, e.g.
○ when we expect to have an item in stock again
○ Is the customer loyal?
○ …
Voice-to-Value- LLM-Powered Customer Interaction Analysis.pdf
Voice-to-Value- LLM-Powered Customer Interaction Analysis.pdf

More Related Content

PPTX
Making an on-device personal assistant a reality
PDF
Enterprise Trends for Gen AI - Berkeley LLM AI Agents MOOC
PDF
The path to personalized, on-device virtual assistant
PDF
“How Qualcomm Is Powering AI-driven Multimedia at the Edge,” a Presentation f...
PDF
Xử lý ngôn ngữ tự nhiên dựa trên học sâu
PPTX
Varnan-ResearchPaperToPodcastGenerator.pptx
PDF
IRJET-speech emotion.pdf
PDF
Ai = your data | Rasa Summit 2021
Making an on-device personal assistant a reality
Enterprise Trends for Gen AI - Berkeley LLM AI Agents MOOC
The path to personalized, on-device virtual assistant
“How Qualcomm Is Powering AI-driven Multimedia at the Edge,” a Presentation f...
Xử lý ngôn ngữ tự nhiên dựa trên học sâu
Varnan-ResearchPaperToPodcastGenerator.pptx
IRJET-speech emotion.pdf
Ai = your data | Rasa Summit 2021

Similar to Voice-to-Value- LLM-Powered Customer Interaction Analysis.pdf (20)

PDF
Multimodal Machine Learning_ Merging Text, Images, and Sound.pdf
PPTX
Speech user interface
PDF
Machine Learning in the AWS Cloud
PDF
Understanding Human Conversations with AI
PDF
Demystifying AI-chatbots Just add CUI to your business apps
PPTX
Introduction to Artificial Intelligence (AI) at Amazon
PPTX
AI for UI: How AI technology may support human-technology interaction by Roop...
PDF
Review On Speech Recognition using Deep Learning
PDF
Triantafyllia Voulibasi
PDF
speech technologies with neural networks present
PDF
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
PDF
A survey on Enhancements in Speech Recognition
PDF
5 Things to Consider When Deploying AI in Your Enterprise
PDF
build-with-ai-sydney AI for web devs Tamas Piros
PDF
Exploring Real-Time Audio Dataset Applications in AI and Machine Learning
PDF
FULLTEXT01.pdf
PDF
Samsung voice intelligence.v5.5
PDF
Conversational agents
PDF
AI Case studies - Experiences and lessons learnt (How would you integrate AI ...
PDF
Semi-Automated Assistance for Conceiving Chatbots
Multimodal Machine Learning_ Merging Text, Images, and Sound.pdf
Speech user interface
Machine Learning in the AWS Cloud
Understanding Human Conversations with AI
Demystifying AI-chatbots Just add CUI to your business apps
Introduction to Artificial Intelligence (AI) at Amazon
AI for UI: How AI technology may support human-technology interaction by Roop...
Review On Speech Recognition using Deep Learning
Triantafyllia Voulibasi
speech technologies with neural networks present
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
A survey on Enhancements in Speech Recognition
5 Things to Consider When Deploying AI in Your Enterprise
build-with-ai-sydney AI for web devs Tamas Piros
Exploring Real-Time Audio Dataset Applications in AI and Machine Learning
FULLTEXT01.pdf
Samsung voice intelligence.v5.5
Conversational agents
AI Case studies - Experiences and lessons learnt (How would you integrate AI ...
Semi-Automated Assistance for Conceiving Chatbots
Ad

More from Zilliz (20)

PDF
Build Fast, Scale Faster: Milvus vs. Zilliz Cloud for Production-Ready AI
PDF
Zilliz Cloud Demo for performance and scale
PDF
Open Source Milvus Vector Database v 2.6
PDF
Zilliz Cloud Monthly Technical Review: May 2025
PDF
Smarter RAG Pipelines: Scaling Search with Milvus and Feast
PDF
Hands-on Tutorial: Building an Agent to Reason about Private Data with OpenAI...
PDF
Agentic AI in Action: Real-Time Vision, Memory & Autonomy with Browser Use & ...
PDF
Webinar - Zilliz Cloud Monthly Demo - March 2025
PDF
What Makes "Deep Research"? A Dive into AI Agents
PDF
Combining Lexical and Semantic Search with Milvus 2.5
PDF
Bedrock Data Automation (Preview): Simplifying Unstructured Data Processing
PDF
Deploying a Multimodal RAG System Using Open Source Milvus, LlamaIndex, and vLLM
PDF
February Product Demo: Discover the Power of Zilliz Cloud
PDF
Full Text Search with Milvus 2.5 - UD Meetup Berlin Jan 23
PDF
Building the Next-Gen Apps with Multimodal Retrieval using Twelve Labs & Milvus
PDF
Accelerate AI Agents with Multimodal RAG powered by Friendli Endpoints and Mi...
PDF
1 Table = 1000 Words? Foundation Models for Tabular Data
PDF
How Milvus allows you to run Full Text Search
PDF
How to Optimize Your Embedding Model Selection and Development through TDA Cl...
PDF
Milvus: Scaling Vector Data Solutions for Gen AI
Build Fast, Scale Faster: Milvus vs. Zilliz Cloud for Production-Ready AI
Zilliz Cloud Demo for performance and scale
Open Source Milvus Vector Database v 2.6
Zilliz Cloud Monthly Technical Review: May 2025
Smarter RAG Pipelines: Scaling Search with Milvus and Feast
Hands-on Tutorial: Building an Agent to Reason about Private Data with OpenAI...
Agentic AI in Action: Real-Time Vision, Memory & Autonomy with Browser Use & ...
Webinar - Zilliz Cloud Monthly Demo - March 2025
What Makes "Deep Research"? A Dive into AI Agents
Combining Lexical and Semantic Search with Milvus 2.5
Bedrock Data Automation (Preview): Simplifying Unstructured Data Processing
Deploying a Multimodal RAG System Using Open Source Milvus, LlamaIndex, and vLLM
February Product Demo: Discover the Power of Zilliz Cloud
Full Text Search with Milvus 2.5 - UD Meetup Berlin Jan 23
Building the Next-Gen Apps with Multimodal Retrieval using Twelve Labs & Milvus
Accelerate AI Agents with Multimodal RAG powered by Friendli Endpoints and Mi...
1 Table = 1000 Words? Foundation Models for Tabular Data
How Milvus allows you to run Full Text Search
How to Optimize Your Embedding Model Selection and Development through TDA Cl...
Milvus: Scaling Vector Data Solutions for Gen AI
Ad

Recently uploaded (20)

PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Cloud computing and distributed systems.
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
MYSQL Presentation for SQL database connectivity
PDF
KodekX | Application Modernization Development
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Empathic Computing: Creating Shared Understanding
PDF
Approach and Philosophy of On baking technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Encapsulation theory and applications.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPT
Teaching material agriculture food technology
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
DOCX
The AUB Centre for AI in Media Proposal.docx
Unlocking AI with Model Context Protocol (MCP)
Cloud computing and distributed systems.
Spectral efficient network and resource selection model in 5G networks
MYSQL Presentation for SQL database connectivity
KodekX | Application Modernization Development
sap open course for s4hana steps from ECC to s4
Agricultural_Statistics_at_a_Glance_2022_0.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Network Security Unit 5.pdf for BCA BBA.
Empathic Computing: Creating Shared Understanding
Approach and Philosophy of On baking technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Encapsulation theory and applications.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Teaching material agriculture food technology
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Reach Out and Touch Someone: Haptics and Empathic Computing
The AUB Centre for AI in Media Proposal.docx

Voice-to-Value- LLM-Powered Customer Interaction Analysis.pdf

  • 1. Architecting an in-house CDP solution for greater ROI 1 Voice-to-Value: LLM-Powered Customer Interaction Analysis Jan 2025
  • 2. Agenda 1. Introduction 2. Speech-to-Text + multimodality 3. Our problem 4. Model & Architecture 5. Results 6. Conclusion
  • 3. Speaker Francesco Fontan Data Scientist in Data Science Team francesco.fontan@on.com on.com
  • 5. 5 What to expect… ● This project is a work in progress ● We're actively analyzing the situation, data, and goals ● Exciting future opportunities, including real-time conversation!
  • 6. 6 Context There are so many business processes that can be easily automated at ON ● Returns ● Warranty Claims ● Cancellations ● Refunds ● Order Tracking ● … Very Manual Slow Expensive Warranty Claim Shoes bought <15 days ago Decline or Accept Claim? Phone Calls Analysis Returns Refunds …
  • 7. 7 Context There are so many business processes that can be easily automated at ON ● Returns ● Warranty Claims ● Cancellations ● Refunds ● Order Tracking ● … Very Manual Slow Expensive Warranty Claim Shoes bought <15 days ago Decline or Accept Claim? Phone Calls Analysis Returns Refunds …
  • 8. 8 Our Problem We are currently unable to capture data from phone customer interactions With no data and insights, it is hard to improve CS  Hard to be more efficient  Difficult to cut costs  Bad User Experience and damage brand reputation Main topics ● Customer Analysis customers' needs customer sentiment product feedback … ● Remove Manual and Time-Consuming Processes ● Synergies Textual Chatbot
  • 9. Some numbers… Only in US… ● ~40k inbound calls / month ● 500k+ minutes / month ● Utilization peaks during and after Black Friday / End of Season Sales But we have HD phone support in many other countries! Breakdown of the After Work Call split now
  • 10. 02 A couple of info about audio and multimodality…
  • 11. Speech Text Features Extraction Acoustic Model Post Processing Punctuation | Capitalization | Normalization Input: Speech audio Output: Spectrogram, MFCC … Input: Spectrogram, MFCC Output: textual features probabilities Input: Textual features probabilities Output: Text Decoder Rescoring Input: Text Output: Normalized text My name is Fran Cisco My name Francesco My name is Fan Tesco My name is Francesco hi let’s meet at two o'clock => Hi, let’s meet at 2:00
  • 12. Speech Text Features Extraction Acoustic Model Post Processing Punctuation | Capitalization | Normalization Input: Speech audio Output: Spectrogram, MFCC … Input: Spectrogram, MFCC Output: textual features probabilities Input: Textual features probabilities Output: Text Decoder Rescoring Input: Text Output: Normalized text My name is Fran Cisco My name Francesco My name is Fan Tesco My name is Francesco hi let’s meet at two o'clock => Hi, let’s meet at 2:00
  • 13. Feature Extraction and Representation
  • 14. Speech Text Features Extraction Acoustic Model Post Processing Punctuation | Capitalization | Normalization Input: Speech audio Output: Spectrogram, MFCC … Input: Spectrogram, MFCC Output: textual features probabilities Input: Textual features probabilities Output: Text Decoder Rescoring Input: Text Output: Normalized text My name is Fran Cisco My name Francesco My name is Fan Tesco My name is Francesco hi let’s meet at two o'clock => Hi, let’s meet at 2:00
  • 15. Acoustic Model - Whisper https://github.com/openai/whisper
  • 16. Whisper vs Google Universal Speech Model (USM) 2 This architecture likely mirrors the one used in Gemini multimodal models… • Open-source • 99 languages • Easy to use • 1000+languages • Accurate • Fast & Efficient • Not Open Conformer-based Model: Convolutional + Transformer CNNs exploit local dependencies Capture global interactions
  • 17. Whisper vs Google Universal Speech Model (USM) 2 This architecture likely mirrors the one used in Gemini multimodal models… • Open-source • 99 languages • Easy to use Conformer-based Model: Convolutional + Transformer CNNs exploit local dependencies Capture global interactions • 1000+languages • Accurate • Fast & Efficient • Not Open
  • 18. Speech Text Features Extraction Acoustic Model Post Processing Punctuation | Capitalization | Normalization Input: Speech audio Output: Spectrogram, MFCC … Input: Spectrogram, MFCC Output: textual features probabilities Input: Textual features probabilities Output: Text Decoder Rescoring Input: Text Output: Normalized text My name is Fran Cisco My name Francesco My name is Fan Tesco My name is Francesco hi let’s meet at two o'clock => Hi, let’s meet at 2:00
  • 20. Speech Text Features Extraction Acoustic Model Post Processing Punctuation | Capitalization | Normalization Input: Speech audio Output: Spectrogram, MFCC … Input: Spectrogram, MFCC Output: textual features probabilities Input: Textual features probabilities Output: Text Decoder Rescoring Input: Text Output: Normalized text My name is Fran Cisco My name Francesco My name is Fan Tesco My name is Francesco hi let’s meet at two o'clock => Hi, let’s meet at 2:00
  • 21. LLAMA 3.1 Architecture ( Similar to Gemini ??? ) https://ritvik19.medium.com/papers-explained-187c-llama-3-1-multimodal-experiments-a1940dd45575
  • 23. Gemini API? We need a highly complex, custom-built pipeline with 27 microservices, manual feature engineering, and handcrafted post- processing scripts to handle every edge case.
  • 24. 24 Diarization and Audio Adaptation Diarization (built-in) with Cloud Speech-to-Text: Adaptation: Word Boosting: ON, Cloud Tech, Cyclon, … Sentence Format Boosting: Order number is $ORDERNU M​ ORDERNUM (R12345678 )​ Fine-Tuned Model GCP Telephony model Speaker A: Hello, I am Francesco, how can I help you? Speaker B: Hello, … Speaker A: ...
  • 25. Architecture Langchain pipelines support both batch or stream processes ● We record all traces ● Evaluation and dataset
  • 26. Focus LangChain App Pipeline with Gemini 2.0 flash exp seems to provide better quality (and it is faster) But data sent to Gemini 2.0 can be used for training… So, right now, we use GCP ASR models + Gemini 1.5 pro
  • 28. 28 Evaluation & Results predicted Cancellation Payment Problem Return Other actual Cancellation 78 1 5 1 Payment Problem 2 68 1 1 Return 4 3 55 7 Other 2 2 3 45 Results with a simple Few Shot Classification Approaches Order number (e.g. R12345678)  97% There are several components in our pipeline: - Speech-to-Text - Text transformations and operations Due to current constraints on time and resources, we're focusing on evaluating downstream tasks rather than calculating WER. Confusion Matrix Ticket Category
  • 30. ReAct Agent Agentic Framework • Increase Accuracy • Use Tools to retrieve info about user orders products FAQs Old Conversations and notes … Long-term Create Agent that converses with the user to gather missing information (Gemini Live APIs ?)
  • 33. Architecting an in-house CDP solution for greater ROI Architecting an in-house CDP solution for greater ROI 33 Thank you!
  • 38. Multimodality The Core Mechanism: Unified Embedding Space At the heart of Gemini 2.0's multimodal capabilities lies the concept of a unified embedding space. Imagine this space as a high-dimensional map where information from different modalities is projected. Crucially, semantically similar information, regardless of its original format (text, image, or sound), will be positioned close to each other within this space. ● How it works: Specialized encoders are used to process each modality. For example, a convolutional neural network (CNN) might process images, while a transformer network handles text. These encoders are trained to project the input data into this shared embedding space. ● The Benefit: By representing all modalities in a common space, the model can directly compare and relate information across them. For instance, the visual features of a cat in an image can be directly associated with the word "cat" in a textual description. Cross-Modal Attention Mechanisms: Focusing on Relevant Connections While a unified embedding space allows for comparison, cross- modal attention mechanisms enable the model to dynamically focus on the most relevant parts of different inputs when processing them together. ● How it works: These mechanisms allow the model to learn which parts of one modality are most important in relation to specific parts of another modality. For example, when processing an image and a question about it, the attention mechanism allows the model to focus on the specific objects or regions in the image that are relevant to the question. ● The Benefit: This selective attention significantly improves the accuracy and efficiency of multimodal understanding. Instead of treating all information equally, the model can prioritize the most meaningful connections.
  • 43. Multimodality Distinguishing Gemini 2.0 from Previous Multimodal AI: Traditional multimodal AI often involved separate models trained for each modality, with complex methods to fuse their outputs. This approach had limitations: ● Limited Interaction: Early fusion methods might simply concatenate the outputs of individual models, hindering deep interaction between modalities. ● Separate Training: Training individual models separately could lead to inconsistencies and difficulty in aligning representations. Gemini 2.0's end-to-end training on diverse multimodal data is a key differentiator. The entire model, including the encoders and cross-modal attention mechanisms, is trained simultaneously. This allows the model to learn the optimal way to represent and relate information across modalities from the ground up.
  • 44. Conformer Briefly, the Conformer is a model that was built by Google and presented in 2020. The basic idea is that Transformers are capable of capturing content-based global interactions while CNNs instead exploit local features. Google, therefore, through combining the convolution module and multi-head self-attention in one block: Google USM "Conformer encoder model architecture. Conformer comprises of two macaron-like feed-forward layers with halfstep residual connections sandwiching the multi- headed selfattention and convolution modules. This is followed by a post layernorm." Image source: here
  • 45. Multimodality Input Encoding (Modality-Specific): Text: Tokenization: Uses sub-word tokenization (like Byte Pair Encoding or SentencePiece) to break down text into manageable units. Transformer Embeddings: Each token is converted into a dense vector embedding. Images: Vision Transformer (ViT) or Variant: Most likely uses a ViT architecture, dividing images into patches ("tokens") and processing them with a transformer. Could also incorporate some Convolutional Neural Networks (CNNs) for feature extraction before or alongside the ViT. Hierarchical Feature Extraction: Possible use of multiple layers or blocks that capture image details at different levels. Audio: Spectrogram Analysis: Transforms audio into spectrograms (visual representations of frequencies) or use raw audio features such as MFCC. CNN or Specialized Audio Encoder: Likely uses a CNN or a transformer-based audio encoder to process the spectrogram or raw features. Video: Frame-Based Processing: Likely processes video as a sequence of frames. Temporal Attention: Incorporates mechanisms (like temporal attention) to capture relationships between frames over time, also other architectures such as 3D CNN. Other Modalities: Likely includes specialized encoders for other data types, such as sensor data, code, and possibly 3D meshes. Multimodal Fusion and Shared Representation: Unified Transformer Architecture: The core of Gemini 2.0 is a large transformer that processes all encoded data. Attention-Based Fusion: Cross-Attention: Key mechanism. Allows each modality's representation to attend to and gather information from other modalities. For example, text can attend to visual features in an image. Self-Attention: Used within each modality to help the model understand relationships within the modality itself. Late Fusion: Modalities are processed separately at first and fused into a joint representation layer. Modality Specific Adapters: Possibly utilizes adapter layers that can be fine-tuned to improve performance on specific modality inputs. Output and Task-Specific Adaptation: Task-Specific Decoders: Uses task-specific decoders when generating images, video or text. Fine-Tuning: Pre-trained model is then fine-tuned for specific tasks (e.g., image captioning, question answering, code generation).
  • 46. 46 Goal Short-term Goals (Analysis + Support) - Customer Analysis gain deeper insights into our customers' needs and desires and refine our product offerings - Understand what is happening behind the scenes streamline processes, and alleviate the workload of our HDs - Prepare better training materials Find all possible edge cases and cover all possible situations to prepare human HD operators Long-term Goals (take more action) ● Chatbot/Voicebot interface ○ It can handle a real time conversations ○ Delegate to human operators if user requests ● Act based on more available information, e.g. ○ when we expect to have an item in stock again ○ Is the customer loyal? ○ …