Voice-to-Value- LLM-Powered Customer Interaction Analysis.pdf

Architecting an in-house CDP solution for greater ROI
1
Voice-to-Value: LLM-Powered Customer Interaction Analysis
Jan 2025

Agenda
1. Introduction
2. Speech-to-Text + multimodality
3. Our problem
4. Model & Architecture
5. Results
6. Conclusion

Speaker
Francesco Fontan
Data Scientist in Data Science Team
francesco.fontan@on.com
on.com

5
What to expect…
● This project is a work in
progress
● We're actively analyzing
the situation, data, and
goals
● Exciting future
opportunities, including
real-time conversation!

6
Context
There are so many business processes that can be easily automated at ON
● Returns
● Warranty Claims
● Cancellations
● Refunds
● Order Tracking
● …
Very Manual
Slow
Expensive
Warranty Claim
Shoes bought <15 days ago
Decline or Accept Claim?
Phone Calls Analysis
Returns
Refunds
…

7
Context
There are so many business processes that can be easily automated at ON
● Returns
● Warranty Claims
● Cancellations
● Refunds
● Order Tracking
● …
Very Manual
Slow
Expensive
Warranty Claim
Shoes bought <15 days ago
Decline or Accept Claim?
Phone Calls Analysis
Returns
Refunds
…

8
Our Problem
We are currently unable to capture data from phone customer interactions
With no data and insights, it is hard to improve CS
 Hard to be more efficient
 Difficult to cut costs
 Bad User Experience and damage brand reputation
Main topics
● Customer Analysis
customers' needs
customer sentiment
product feedback
…
● Remove Manual and Time-Consuming Processes
● Synergies Textual Chatbot

Some numbers…
Only in US…
● ~40k inbound calls / month
● 500k+ minutes / month
● Utilization peaks during and after Black Friday / End of Season Sales
But we have HD phone support in many other countries!
Breakdown of the After
Work Call split now

02 A couple of info about audio and multimodality…

Speech Text
Features Extraction Acoustic Model
Post Processing
Punctuation | Capitalization |
Normalization
Input: Speech audio
Output: Spectrogram, MFCC
…
Input: Spectrogram, MFCC
Output: textual features
probabilities
Input: Textual
features probabilities
Output: Text
Decoder
Rescoring
Input: Text
Output: Normalized text
My name is Fran Cisco
My name Francesco
My name is Fan Tesco
My name is Francesco
hi let’s meet at two o'clock
=> Hi, let’s meet at 2:00

Feature Extraction and Representation

Acoustic Model - Whisper
https://github.com/openai/whisper

Whisper vs Google
Universal Speech Model
(USM)
2
This architecture likely
mirrors the one used in
Gemini multimodal models…
• Open-source
• 99 languages
• Easy to use
• 1000+languages
• Accurate
• Fast & Efficient
• Not Open
Conformer-based Model:
Convolutional + Transformer
CNNs exploit local
dependencies
Capture global
interactions

Whisper vs Google
Universal Speech Model
(USM)
2
This architecture likely
mirrors the one used in
Gemini multimodal models…
• Open-source
• 99 languages
• Easy to use
Conformer-based Model:
Convolutional + Transformer
CNNs exploit local
dependencies
Capture global
interactions
• 1000+languages
• Accurate
• Fast & Efficient
• Not Open

LLAMA 3.1 Architecture ( Similar to Gemini ??? )
https://ritvik19.medium.com/papers-explained-187c-llama-3-1-multimodal-experiments-a1940dd45575

Gemini API?
We need a highly complex, custom-built
pipeline with 27 microservices, manual
feature engineering, and handcrafted post-
processing scripts to handle every edge
case.

24
Diarization and Audio Adaptation
Diarization (built-in) with Cloud Speech-to-Text:
Adaptation:
Word Boosting:
ON, Cloud Tech, Cyclon, …
Sentence Format Boosting:
Order number is $ORDERNU
M
ORDERNUM (R12345678
)
Fine-Tuned Model
GCP Telephony model
Speaker A: Hello, I am Francesco, how can I help you?
Speaker B: Hello, …
Speaker A: ...

Architecture
Langchain pipelines
support both batch or
stream processes
● We record all traces
● Evaluation and dataset

Focus LangChain App
Pipeline with Gemini 2.0 flash exp seems to provide better quality (and it is faster)
But data sent to Gemini 2.0 can be used for training…
So, right now, we use GCP ASR models + Gemini 1.5 pro

28
Evaluation & Results
predicted
Cancellation Payment Problem Return Other
actual
Cancellation 78 1 5 1
Payment Problem 2 68 1 1
Return 4 3 55 7
Other 2 2 3 45
Results with a simple Few Shot Classification Approaches
Order number (e.g. R12345678)  97%
There are several components in our pipeline:
- Speech-to-Text
- Text transformations and operations
Due to current constraints on time and resources, we're focusing on
evaluating downstream tasks rather than calculating WER.
Confusion Matrix Ticket Category

ReAct Agent
Agentic Framework
• Increase Accuracy
• Use Tools to retrieve info about
user
orders
products
FAQs
Old Conversations and notes
…
Long-term
Create Agent that converses with the user to
gather missing information
(Gemini Live APIs ?)

33
Thank you!

Cross-modality
https://sebastianraschka.com/blog/2024/understanding-multimodal-llms.html

Voice-to-Value- LLM-Powered Customer Interaction Analysis.pdf

Multimodality
The Core Mechanism: Unified Embedding Space
At the heart of Gemini 2.0's multimodal capabilities lies the
concept of a unified embedding space. Imagine this space as a
high-dimensional map where information from different
modalities is projected. Crucially, semantically similar
information, regardless of its original format (text, image, or
sound), will be positioned close to each other within this space.
● How it works: Specialized encoders are used to process
each modality. For example, a convolutional neural
network (CNN) might process images, while a
transformer network handles text. These encoders are
trained to project the input data into this shared
embedding space.
● The Benefit: By representing all modalities in a common
space, the model can directly compare and relate
information across them. For instance, the visual features
of a cat in an image can be directly associated with the
word "cat" in a textual description.
Cross-Modal Attention Mechanisms: Focusing on Relevant
Connections
While a unified embedding space allows for comparison, cross-
modal attention mechanisms enable the model to dynamically
focus on the most relevant parts of different inputs when
processing them together.
● How it works: These mechanisms allow the model to
learn which parts of one modality are most important in
relation to specific parts of another modality. For
example, when processing an image and a question
about it, the attention mechanism allows the model to
focus on the specific objects or regions in the image that
are relevant to the question.
● The Benefit: This selective attention significantly
improves the accuracy and efficiency of multimodal
understanding. Instead of treating all information
equally, the model can prioritize the most meaningful
connections.

Multimodality
Distinguishing Gemini 2.0 from Previous Multimodal AI:
Traditional multimodal AI often involved separate models trained for each modality, with
complex methods to fuse their outputs. This approach had limitations:
● Limited Interaction: Early fusion methods might simply concatenate the outputs of
individual models, hindering deep interaction between modalities.
● Separate Training: Training individual models separately could lead to inconsistencies
and difficulty in aligning representations.
Gemini 2.0's end-to-end training on diverse multimodal data is a key differentiator. The entire
model, including the encoders and cross-modal attention mechanisms, is trained
simultaneously. This allows the model to learn the optimal way to represent and relate
information across modalities from the ground up.

Conformer
Briefly, the Conformer is a model that was built by
Google and presented in 2020. The basic idea is that
Transformers are capable of capturing content-based
global interactions while CNNs instead exploit local
features. Google, therefore, through combining the
convolution module and multi-head self-attention in one
block:
Google USM
"Conformer encoder model architecture. Conformer
comprises of two macaron-like feed-forward layers with
halfstep residual connections sandwiching the multi-
headed selfattention and convolution modules. This is
followed by a post layernorm." Image source: here

Multimodality
Input Encoding (Modality-Specific):
Text:
Tokenization: Uses sub-word tokenization (like Byte Pair Encoding or SentencePiece) to break down text into manageable units.
Transformer Embeddings: Each token is converted into a dense vector embedding.
Images:
Vision Transformer (ViT) or Variant: Most likely uses a ViT architecture, dividing images into patches ("tokens") and processing them with a transformer. Could also incorporate some Convolutional Neural Networks
(CNNs) for feature extraction before or alongside the ViT.
Hierarchical Feature Extraction: Possible use of multiple layers or blocks that capture image details at different levels.
Audio:
Spectrogram Analysis: Transforms audio into spectrograms (visual representations of frequencies) or use raw audio features such as MFCC.
CNN or Specialized Audio Encoder: Likely uses a CNN or a transformer-based audio encoder to process the spectrogram or raw features.
Video:
Frame-Based Processing: Likely processes video as a sequence of frames.
Temporal Attention: Incorporates mechanisms (like temporal attention) to capture relationships between frames over time, also other architectures such as 3D CNN.
Other Modalities: Likely includes specialized encoders for other data types, such as sensor data, code, and possibly 3D meshes.
Multimodal Fusion and Shared Representation:
Unified Transformer Architecture: The core of Gemini 2.0 is a large transformer that processes all encoded data.
Attention-Based Fusion:
Cross-Attention: Key mechanism. Allows each modality's representation to attend to and gather information from other modalities. For example, text can attend to visual features in an image.
Self-Attention: Used within each modality to help the model understand relationships within the modality itself.
Late Fusion: Modalities are processed separately at first and fused into a joint representation layer.
Modality Specific Adapters: Possibly utilizes adapter layers that can be fine-tuned to improve performance on specific modality inputs.
Output and Task-Specific Adaptation:
Task-Specific Decoders: Uses task-specific decoders when generating images, video or text.
Fine-Tuning: Pre-trained model is then fine-tuned for specific tasks (e.g., image captioning, question answering, code generation).

46
Goal
Short-term Goals (Analysis + Support)
- Customer Analysis
gain deeper insights into our customers' needs and desires and refine our product offerings
- Understand what is happening behind the scenes
streamline processes, and alleviate the workload of our HDs
- Prepare better training materials
Find all possible edge cases and cover all possible situations to prepare human HD operators
Long-term Goals (take more action)
● Chatbot/Voicebot interface
○ It can handle a real time conversations
○ Delegate to human operators if user requests
● Act based on more available information, e.g.
○ when we expect to have an item in stock again
○ Is the customer loyal?
○ …

Voice-to-Value- LLM-Powered Customer Interaction Analysis.pdf

More Related Content

Similar to Voice-to-Value- LLM-Powered Customer Interaction Analysis.pdf (20)

More from Zilliz (20)

Recently uploaded (20)

Voice-to-Value- LLM-Powered Customer Interaction Analysis.pdf