SlideShare a Scribd company logo
Paul Lo
Data Analytics Manager @ Uber, Asia-Pacific Community Operation Central team
paullo0106@gmail.com | paul.lo@uber.com | http://paullo.myvnc.com/blog/
Transforming the Call Center with Text Mining
and Deep Learning for Better User Experience
PythonPH Sep. 2018 (https://www.meetup.com/pythonph/events/254444065/)
Project #1
Text ming tool to unlock user insights
Python lib: natural language processing,
topic modeling
Self-introduction
Who am I?
What does our analytics team do for
Asia-Pacific?
Project #2
Artificial Intelligence revolution in call
centers: deep learning-based bot
Python lib: machine learning related
such as tensorflow, keras, sklearn,
numpy, and etc.
Transforming the Call Center with Text Mining and Deep Learning for Better User Experience
Table of contents
Transforming the Call
Center with Text Mining
and Deep Learning for
Better User Experience
Self-introduction
Skills: Full stack software engineer (Java/ Python) → Data Analyst (R/ Python, databases, machine learning)
Journey: Taipei → Shanghai → Manila
Self-introduction
Uber Shanghai → Uber Manila (APAC Community Operation Central Analytics team)
Scope of Community Operation in Uber APAC
Scope
10+ languages in ~20 locations
Central Team
In
Manila
India
Singapore (South East and North Asia)
Australia
APAC A&I
2017 Year-end
Analytics & Insights is the team responsible for building the analyses,
models, and tools to aid operational and strategic decision making for the
APAC Region. We are also dedicated to furthering Uber’s collective
analytical capability.
Self-introduction
Uber still has awesome team (Analytics, S&P, PM, and etc) based in Manila!!
Improving user experience is one of our core mission
Improve user experience
Drive down defect rate
Optimize operational efficiency
Manage the cost of business operation
Project #1:
Text mining and NLP for use experience
enhancement
Acknowledgement: Troy James Palanca, Lorenzo Ampil
Value proposition
Speed up the workflow on user experience enhancement
Defect rate and issue type
Leaderboard
Community
Operation
Product,
Engineering,
and etc.
User
feedback
database
Root cause analysis
and recommended
feature or policy
changes
Review
customer
feedback in
tickets
User experience
enhancement
Value proposition
Speed up the workflow on user experience enhancement
Defect rate and issue type
Leaderboard
Community
Operation
Product,
Engineering,
and etc.
User
feedback
database
Root cause analysis
and recommended
feature or policy
changes
Review
customer
feedback in
tickets
User experience
enhancement
Making this process more efficient
Issue type dashboard as a high-level data source
Mockup
Dashboard
Problem
How can we quickly get the insights from users’ feedback?
Problem
Reviewing tickets
manually to diagnose
the root cause is not
scalable and
unsystematic
Ticket dataset
Driver > Trips > Fare … > … > Technical issue
ticket
ticket ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
ticket
Problem
How can we quickly get the insights from users’ feedback?
Solution
Use topic modeling
techniques to
efficiently group tickets
and assign them to
reasonably named
topics.
Ticket dataset
Driver > Trips > Fare … > … > Technical issue
App stuck/ crash
(35%)
Fare calculation
Dispute
(15%)
GPS issue
(55%)
Key features of our solution
Using Topic modeling based tool to learn pain points from our users
Ticket snippet with user profile: respective ticket
samples are displayed when clicking on a keyword
Word cloud view: user can switch to
this view to see most relevant (tf-idf
score) keywords in each topic
>>DEMO
Sample results
“Fare Disputes” in one of the city we operate are
mainly about payments, airport issues, and wrong
riders:
● Credit cards and other modes of payment
(18%)
● Overcharging (28.8%)
● Wrong profiles being billed (12.8%)
● Airport terminal issues (12.9%)
● Someone else taking the trip (12.5%)
Sample results
Lots of “rude”, “loud music”, “drunk”, and “slam door” keywords
were detected as the pain points of our NY driver partners
Sample results
More than 10% of driver cancellation
tickets in Singapore are related to car
seat rules for child safety: many
sample tickets show that drivers want to
reimburse their cancellation fee due to
their riders bringing children without prior
notice.
Tool architecture
Computing node
(any Uber servers)
Data collection
Data preparation
LDA model training
Web server
(AWS node)
Html and json
files from
training results
User Interface
(d3js)
Train the model for each country with top issues
monthly
Web 1.0 design with the focus on computing node
Workflow overview
Data input: ticket text as raw
data
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
category
Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
category
Text processing library: nltk, BeautifulSoup, re, TextBlob
LDA library: gensim.ldamodel.LdaModel and pyLDAvis
Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
category
Remove invalid words:
● Numbers
● Html tags
● Custom dictionary
Stemming and lemmatization
Tokenization
TFIDF (Term Frequency Inverse Document
Frequency)
Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
category
Remove invalid words:
● Numbers
re.sub(r'd+', '', text)
● Html tags
BeautifulSoup(document).get_text()
BeautifulSoup(document).find_all(‘b’)
● Custom dictionary
Stemming and lemmatization
Tokenization
TFIDF (Term Frequency Inverse Document
Frequency)
Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
category
Remove invalid words
Stemming and lemmatization: Reduce inflectional
forms and sometimes derivationally related forms of a
word to a common base form. For instance:
○ cancel, cancels, cancelled -> cancel
○ riders, rider -> rider
Tokenization
TFIDF (Term Frequency Inverse Document
Frequency)
Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
category
Remove invalid words
Stemming and lemmatization
Tokenization: Part-of-speech based word
detection
TFIDF (Term Frequency Inverse Document
Frequency)
Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
category
Remove invalid words
Stemming and lemmatization
Tokenization: Part-of-speech based word
detection
TFIDF (Term Frequency Inverse Document
Frequency) Common practice to score each term
with weighted frequency and relevance
Data Preparation (Natural Language Processing)
Using TFIDF to filter the most important keywords
Machine Learning
Model
Data Preparation (Natural Language Processing)
Using TFIDF to filter the most important keywords
Machine Learning
Model
Term frequency
Inverse Document
Frequency
Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Data preparation for text processing can be very time-consuming
Sample ~50,000 tickets for
each training in each issue
category
Remove invalid words:
Stemming and lemmatization
Tokenization
TFIDF (Term Frequency Inverse Document
Frequency)
Speed up data processing
Pandas runs on a single thread by default
A pandas DataFrame with 50k+ rows
Data Preparation
text_processing() is a heavy function
contains many things:
● Tokenization
● Removal of numbers, html tags, and
other invalid words
● Stemming and lemmatization
● TFIDF
df['content'].apply(text_processing)
→ single thread by default
Speed up data processing
Pandas runs on a single thread by default
Worker 1
Worker 2
Worker N
keywords
Data processing speedup trick in Pandas
Pandas runs on a single thread by default
1
2
3
4
5
6
7
8
9
10
Many handy text processing libraries
TextBlob and spaCy
Tokenization Sentence correction
.correct()
Part of speech
.tags
Sentiment analysis
.sentiment.polarity
NLP Library
(TextBlob)
(spaCy)
Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Unlocking support insights from textual content - but how?
Sample ~50,000 tickets for
each training in each issue
category
LDA:
- Unsupervised learning
- Bag of words
- “topic distribution”
Usage:
lda = LdaModel(corpus=corpus,
id2word=dictionary,
num_topics=4,
random_state=some_number)
lda.show_topics()
Latent Dirichlet Allocation model
General concept of this model
Unsupervised learning method - does not
require any class labels; similar to clustering
‘Bag of words’ model - uses word counts in
messages without regard for its order
(Peter owe Alice money = Alice owe Peter
money)
Estimated iteratively - Starts with random
initialization then adjusts probabilities to
reduce perplexity / increase fit
Doc 1 Doc 2 Doc 3 Doc n...
(topic) FruitsFruits
document-topic
probabilities
30% health (topic
1)
60% fruits
(topic 2)
10% disease
(topic 3)
Latent Dirichlet Allocation model
Model implementation and visualization
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Data input: ticket text as raw
data
Output: topic model clusters
Sample ~50,000 tickets for
each training in each issue
category
Usage:
lda = LdaModel(corpus=corpus,
id2word=dictionary,
num_topics=4,
random_state=some_number)
lda.show_topics()
from pyLDAvis.gensim import prepare, save_html
from gensim.models import LdaModel
Future work and learnings
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
matrix
Data Modeling
(Latent Dirichlet
Allocation)
Main computation to perform
topic modeling
Customization is needed
● Not suited for
specific issue
category
● Build own
dictionary for the
removal of
irrelevant words
Data input: ticket text as raw
data
Output: topic model clusters
How to make the results more “actionable”?
● # of topic for convergence
● Time and performance
tradeoff
● Other ”Deep NLP” model ?
Bad result
examples
Project #1
Text ming tool to unlock user insights
Python lib: natural language processing,
topic modeling
Self-introduction
Who am I?
What does our analytics team do for
Asia-Pacific?
Project #2
Artificial Intelligence revolution in call
centers: deep learning-based bot
Python lib: machine learning related
such as tensorflow, keras, sklearn,
numpy, and etc.
Transforming the Call Center with Text Mining and Deep Learning for Better User Experience
Table of contents
Transforming the Call
Center with Text Mining
and Deep Learning for
Better User Experience
Product owner: Huaixiu Zheng and Yichia Wang in Uber’s Applied Machine Learning team
Project #2:
Artificial Intelligence revolution in call centers
CSR’s sample workflow for user in a call center
How does our users submit an issue?
CSR’s sample workflow for user in a call center
Online support via in-app-help
User
CSRContact
Ticket
Response
Select
Issue Category
Write Message
Confirm
Issue Category
Lookup info. &
Knowledge Base
Select Action
Write response
using a Reply
Template
The issue for call center operation: scalability and cost
The growth comes at a price again….
Solution? Let’s start from a basic sample
“I want to change my rating for a rider”
API-less solution to the basic sample
We can ‘program’ the pre-defined logic for certain tickets with Selenium or Chrome Script
element mapping
element mapping
End-to-end solution
Web interaction
Read and Write (click and input text)
Knowledge base
● Keyword recognition
● Web element id dictionary
● (Natural Language Processing)
Policy engine
Program the flow aligning with policy/ SOP
Monitoring and logging
● Real-time gsheet API logging
● Monitoring and alert trigger
Ticket
answering
bot
The business impact of a simple bot-solving solution
3k+ weekly solves
A team of
18 CSR
28k USD
monthly
What’s the problem with this solution?
What’s the problem with this solution?
“Scalability”
The difference between Programming and Machine Learning
Outputs =
Agents’
responses
Inputs =
Contact
Ticket
Our machine learning solution design
Why go with “Semi-automated” assistance rather than real robot?
Pros:
- Scalable solution to all (+ new) ticket-types
- Flexible and safer application as human
can still evaluate it and make the final call
Cons: Not fully automated to replace the agent
workforce completely.
Product designed by Hugh Williams, Huaixiu Zheng, Yi-Chia Wang in Applied Machine Learning team
Our machine learning solution design
‘Assistant to CSR’ - Provide suggestions for reply and actions
Issue category suggestion
Action suggestion
10M+ tickets
Correct response from
agents to these 10M+
tickets
Technical model training Product design
Typical Machine Learning process
Note: picture from “Mark Peng’s “General Tips for participating Kaggle Competitions” on Slideshare
Typical Machine Learning process
Model selection
ML 101:
Start with simple model first
Data source: https://eng.uber.com/cota-v2/
Deep Learning Architecture
Reference: Uber AML Lab: http://eng.uber.com/cota
Sample code with Keras for a simple CNN
Deep Learning Architecture
Reference: Uber AML Lab: http://eng.uber.com/cota
Essay: COTA: Improving the Speed and Accuracy of Customer Support through Ranking and
Deep Networks
Development environment for Deep learning model training
How does model training look like?
>> DEMO
Main codebase + data set
Feature engineering and feature importance
Trade off between capacity and interpretability
“Capacity” “Interpretability”
Feature engineering and feature importance
What are the important features? Very easy to learn that in simpler model
Feature engineering and feature importance
What are the important features? Very easy to get explanation in simpler models
Feature engineering and feature importance
What are the important features? NN model is like our brain’s intuition … blackbox
Feature engineering and feature importance
What are the important features?
Sklearn: Recursive feature elimination
(sklearn.feature_selection.RFE)
Mockup
dataset
Feature engineering and feature importance
What are the important features?
Time on model training >>> prediction
Shuffle each feature to create noise…. on the testing set
Mockup
dataset
Python tips: be cautious about the underlying “copy implementation”
np.random.shuffle
What’s the value of
my_list2?
A. [1, 2, 3, 4, 5]
B. [2, 5, 1, 4, 3]
np.random.shuffle
What’s the value of
my_list2?
A. [1, 2, 3, 4, 5]
B. [2, 5, 1, 4, 3]
Python tips: be cautious about the underlying “copy implementation”
np.random.permutation
Python tips: be cautious about the underlying “copy implementation”
np.random.permutation
from copy import deepcopy
mylist2 = deepcopy(my_list)
Python tips: be cautious about the underlying “copy implementation”
Feature engineering and feature importance
What are the important features?
Shuffle each feature to create noise…. on the testing set
Mockup
example
Issue category suggestion
Action suggestion
Product design
Last stop: making business Impact
Ensure KPI measurement is well-planned in the beginning
User
CSRContact
Ticket
Response
Select
Issue Category
Write Message
Confirm
Issue Category
Lookup info. &
Knowledge Base
Select Action
Write response
using a Reply
Template
Last stop: making business Impact
Identify key business metrics, and cautiously conduct and monitor A/A and A/B testing
Source: https://eng.uber.com/cota-v2/
Look forward to collaborating! http://careers.uber.com
Paul Lo
Data Analytics Manager @ Uber
paul.lo@uber.com | paullo0106@gmail.com | | http://paullo.myvnc.com/blog/
Q&A

More Related Content

PDF
Big Data Meetup #7
PPTX
Model Drift Monitoring using Tensorflow Model Analysis
PPTX
Introduction to Auto ML
PPTX
Vectors in Search - Towards More Semantic Matching
PDF
Transparent Machine Learning for Information Extraction: State-of-the-Art and...
PDF
ML Infra for Netflix Recommendations - AI NEXTCon talk
PDF
Exploiting Structure in Representation of Named Entities using Active Learning
PDF
Graph-Powered Machine Learning
Big Data Meetup #7
Model Drift Monitoring using Tensorflow Model Analysis
Introduction to Auto ML
Vectors in Search - Towards More Semantic Matching
Transparent Machine Learning for Information Extraction: State-of-the-Art and...
ML Infra for Netflix Recommendations - AI NEXTCon talk
Exploiting Structure in Representation of Named Entities using Active Learning
Graph-Powered Machine Learning

What's hot (19)

PDF
PPT5: Neuron Introduction
PPTX
Ml product page
PDF
Ml product page
PDF
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
PDF
The Machine Learning Workflow with Azure
PDF
Intro_to_ML
PPTX
ETL & Machine Learning
PDF
Mentoring Session with Innovesia: Advance Robotics
PDF
The Analytics Frontier of the Hadoop Eco-System
PPTX
Intro to Mahout -- DC Hadoop
PDF
Automatic Machine Learning, AutoML
PPTX
Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...
PDF
AutoML - The Future of AI
PDF
Strata parallel m-ml-ops_sept_2017
PPTX
Machine Learning with Apache Spark
PDF
Ryan Curtin, Principal Research Scientist, Symantec at MLconf ATL 2016
PDF
Architecting for Data Science
PPTX
Wolfram alpha A Computational Knowledge Engine Interesting Technology
PDF
The Barclays Data Science Hackathon: Building Retail Recommender Systems base...
PPT5: Neuron Introduction
Ml product page
Ml product page
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
The Machine Learning Workflow with Azure
Intro_to_ML
ETL & Machine Learning
Mentoring Session with Innovesia: Advance Robotics
The Analytics Frontier of the Hadoop Eco-System
Intro to Mahout -- DC Hadoop
Automatic Machine Learning, AutoML
Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...
AutoML - The Future of AI
Strata parallel m-ml-ops_sept_2017
Machine Learning with Apache Spark
Ryan Curtin, Principal Research Scientist, Symantec at MLconf ATL 2016
Architecting for Data Science
Wolfram alpha A Computational Knowledge Engine Interesting Technology
The Barclays Data Science Hackathon: Building Retail Recommender Systems base...
Ad

Similar to [PythonPH] Transforming the call center with Text mining and Deep learning (Case study@Uber) (20)

PDF
[Taipei.py] improving user experience with text mining and deep learning in Uber
PPT
Good Applications of Bad Machine Translation
PPTX
Serving Information Needs of Knowledge Workers
PDF
Large Language Models Bootcamp
PDF
Data Workflows for Machine Learning - SF Bay Area ML
PPTX
ML Framework for auto-responding to customer support queries
PPT
Map Reduce amrp presentation
PDF
OSMC 2023 | Experiments with OpenSearch and AI by Jochen Kressin & Leanne La...
PDF
PAPIs LATAM 2019 - Training and deploying ML models with Kubeflow and TensorF...
PDF
PAPIs LATAM 2019 - Training and deploying ML models with Kubeflow and TensorF...
PPTX
Chatbot using Python, mini project presentation
PDF
mlflow: Accelerating the End-to-End ML lifecycle
PPTX
ML Framework for auto-responding to customer support queries
PDF
Data Workflows for Machine Learning - Seattle DAML
PPTX
Text Analytics for Legal work
PPTX
ML Framework for auto-responding to customer support queries
PPTX
[DSC DACH 24] Increasing user adoption with GenAI offerings - Martin Flechl
PDF
Discovering User's Topics of Interest in Recommender Systems
PPT
3 Software Estmation.ppt
PDF
Best Data Science Online Training in Hyderabad
[Taipei.py] improving user experience with text mining and deep learning in Uber
Good Applications of Bad Machine Translation
Serving Information Needs of Knowledge Workers
Large Language Models Bootcamp
Data Workflows for Machine Learning - SF Bay Area ML
ML Framework for auto-responding to customer support queries
Map Reduce amrp presentation
OSMC 2023 | Experiments with OpenSearch and AI by Jochen Kressin & Leanne La...
PAPIs LATAM 2019 - Training and deploying ML models with Kubeflow and TensorF...
PAPIs LATAM 2019 - Training and deploying ML models with Kubeflow and TensorF...
Chatbot using Python, mini project presentation
mlflow: Accelerating the End-to-End ML lifecycle
ML Framework for auto-responding to customer support queries
Data Workflows for Machine Learning - Seattle DAML
Text Analytics for Legal work
ML Framework for auto-responding to customer support queries
[DSC DACH 24] Increasing user adoption with GenAI offerings - Martin Flechl
Discovering User's Topics of Interest in Recommender Systems
3 Software Estmation.ppt
Best Data Science Online Training in Hyderabad
Ad

Recently uploaded (20)

PDF
Mushroom cultivation and it's methods.pdf
PPTX
Tartificialntelligence_presentation.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PPTX
1. Introduction to Computer Programming.pptx
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
Encapsulation theory and applications.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
August Patch Tuesday
PPTX
Spectroscopy.pptx food analysis technology
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
Mushroom cultivation and it's methods.pdf
Tartificialntelligence_presentation.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
cloud_computing_Infrastucture_as_cloud_p
Accuracy of neural networks in brain wave diagnosis of schizophrenia
1. Introduction to Computer Programming.pptx
TLE Review Electricity (Electricity).pptx
Empathic Computing: Creating Shared Understanding
NewMind AI Weekly Chronicles - August'25-Week II
Advanced methodologies resolving dimensionality complications for autism neur...
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Group 1 Presentation -Planning and Decision Making .pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
A Presentation on Artificial Intelligence
Encapsulation theory and applications.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
August Patch Tuesday
Spectroscopy.pptx food analysis technology
Digital-Transformation-Roadmap-for-Companies.pptx

[PythonPH] Transforming the call center with Text mining and Deep learning (Case study@Uber)

  • 1. Paul Lo Data Analytics Manager @ Uber, Asia-Pacific Community Operation Central team paullo0106@gmail.com | paul.lo@uber.com | http://paullo.myvnc.com/blog/ Transforming the Call Center with Text Mining and Deep Learning for Better User Experience PythonPH Sep. 2018 (https://www.meetup.com/pythonph/events/254444065/)
  • 2. Project #1 Text ming tool to unlock user insights Python lib: natural language processing, topic modeling Self-introduction Who am I? What does our analytics team do for Asia-Pacific? Project #2 Artificial Intelligence revolution in call centers: deep learning-based bot Python lib: machine learning related such as tensorflow, keras, sklearn, numpy, and etc. Transforming the Call Center with Text Mining and Deep Learning for Better User Experience Table of contents Transforming the Call Center with Text Mining and Deep Learning for Better User Experience
  • 3. Self-introduction Skills: Full stack software engineer (Java/ Python) → Data Analyst (R/ Python, databases, machine learning) Journey: Taipei → Shanghai → Manila
  • 4. Self-introduction Uber Shanghai → Uber Manila (APAC Community Operation Central Analytics team)
  • 5. Scope of Community Operation in Uber APAC Scope 10+ languages in ~20 locations Central Team In Manila India Singapore (South East and North Asia) Australia
  • 6. APAC A&I 2017 Year-end Analytics & Insights is the team responsible for building the analyses, models, and tools to aid operational and strategic decision making for the APAC Region. We are also dedicated to furthering Uber’s collective analytical capability.
  • 7. Self-introduction Uber still has awesome team (Analytics, S&P, PM, and etc) based in Manila!!
  • 8. Improving user experience is one of our core mission Improve user experience Drive down defect rate Optimize operational efficiency Manage the cost of business operation
  • 9. Project #1: Text mining and NLP for use experience enhancement Acknowledgement: Troy James Palanca, Lorenzo Ampil
  • 10. Value proposition Speed up the workflow on user experience enhancement Defect rate and issue type Leaderboard Community Operation Product, Engineering, and etc. User feedback database Root cause analysis and recommended feature or policy changes Review customer feedback in tickets User experience enhancement
  • 11. Value proposition Speed up the workflow on user experience enhancement Defect rate and issue type Leaderboard Community Operation Product, Engineering, and etc. User feedback database Root cause analysis and recommended feature or policy changes Review customer feedback in tickets User experience enhancement Making this process more efficient
  • 12. Issue type dashboard as a high-level data source Mockup Dashboard
  • 13. Problem How can we quickly get the insights from users’ feedback? Problem Reviewing tickets manually to diagnose the root cause is not scalable and unsystematic Ticket dataset Driver > Trips > Fare … > … > Technical issue ticket ticket ticket ticket ticket ticket ticket ticket ticket ticket ticket ticket ticket ticket ticket ticket ticket ticket ticket ticket
  • 14. Problem How can we quickly get the insights from users’ feedback? Solution Use topic modeling techniques to efficiently group tickets and assign them to reasonably named topics. Ticket dataset Driver > Trips > Fare … > … > Technical issue App stuck/ crash (35%) Fare calculation Dispute (15%) GPS issue (55%)
  • 15. Key features of our solution Using Topic modeling based tool to learn pain points from our users Ticket snippet with user profile: respective ticket samples are displayed when clicking on a keyword Word cloud view: user can switch to this view to see most relevant (tf-idf score) keywords in each topic >>DEMO
  • 16. Sample results “Fare Disputes” in one of the city we operate are mainly about payments, airport issues, and wrong riders: ● Credit cards and other modes of payment (18%) ● Overcharging (28.8%) ● Wrong profiles being billed (12.8%) ● Airport terminal issues (12.9%) ● Someone else taking the trip (12.5%)
  • 17. Sample results Lots of “rude”, “loud music”, “drunk”, and “slam door” keywords were detected as the pain points of our NY driver partners
  • 18. Sample results More than 10% of driver cancellation tickets in Singapore are related to car seat rules for child safety: many sample tickets show that drivers want to reimburse their cancellation fee due to their riders bringing children without prior notice.
  • 19. Tool architecture Computing node (any Uber servers) Data collection Data preparation LDA model training Web server (AWS node) Html and json files from training results User Interface (d3js) Train the model for each country with top issues monthly Web 1.0 design with the focus on computing node
  • 20. Workflow overview Data input: ticket text as raw data Output: topic model clusters Unlocking support insights from textual content Sample ~50,000 tickets for each training in each issue category
  • 21. Workflow overview Data Preparation (text processing) Extract useful information and transform corpus to a sparse matrix Data Modeling (Latent Dirichlet Allocation) Main computation to perform topic modeling Data input: ticket text as raw data Output: topic model clusters Unlocking support insights from textual content Sample ~50,000 tickets for each training in each issue category Text processing library: nltk, BeautifulSoup, re, TextBlob LDA library: gensim.ldamodel.LdaModel and pyLDAvis
  • 22. Workflow overview Data Preparation (text processing) Extract useful information and transform corpus to a sparse matrix Data Modeling (Latent Dirichlet Allocation) Main computation to perform topic modeling Data input: ticket text as raw data Output: topic model clusters Unlocking support insights from textual content Sample ~50,000 tickets for each training in each issue category Remove invalid words: ● Numbers ● Html tags ● Custom dictionary Stemming and lemmatization Tokenization TFIDF (Term Frequency Inverse Document Frequency)
  • 23. Workflow overview Data Preparation (text processing) Extract useful information and transform corpus to a sparse matrix Data Modeling (Latent Dirichlet Allocation) Main computation to perform topic modeling Data input: ticket text as raw data Output: topic model clusters Unlocking support insights from textual content Sample ~50,000 tickets for each training in each issue category Remove invalid words: ● Numbers re.sub(r'd+', '', text) ● Html tags BeautifulSoup(document).get_text() BeautifulSoup(document).find_all(‘b’) ● Custom dictionary Stemming and lemmatization Tokenization TFIDF (Term Frequency Inverse Document Frequency)
  • 24. Workflow overview Data Preparation (text processing) Extract useful information and transform corpus to a sparse matrix Data Modeling (Latent Dirichlet Allocation) Main computation to perform topic modeling Data input: ticket text as raw data Output: topic model clusters Unlocking support insights from textual content Sample ~50,000 tickets for each training in each issue category Remove invalid words Stemming and lemmatization: Reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance: ○ cancel, cancels, cancelled -> cancel ○ riders, rider -> rider Tokenization TFIDF (Term Frequency Inverse Document Frequency)
  • 25. Workflow overview Data Preparation (text processing) Extract useful information and transform corpus to a sparse matrix Data Modeling (Latent Dirichlet Allocation) Main computation to perform topic modeling Data input: ticket text as raw data Output: topic model clusters Unlocking support insights from textual content Sample ~50,000 tickets for each training in each issue category Remove invalid words Stemming and lemmatization Tokenization: Part-of-speech based word detection TFIDF (Term Frequency Inverse Document Frequency)
  • 26. Workflow overview Data Preparation (text processing) Extract useful information and transform corpus to a sparse matrix Data Modeling (Latent Dirichlet Allocation) Main computation to perform topic modeling Data input: ticket text as raw data Output: topic model clusters Unlocking support insights from textual content Sample ~50,000 tickets for each training in each issue category Remove invalid words Stemming and lemmatization Tokenization: Part-of-speech based word detection TFIDF (Term Frequency Inverse Document Frequency) Common practice to score each term with weighted frequency and relevance
  • 27. Data Preparation (Natural Language Processing) Using TFIDF to filter the most important keywords Machine Learning Model
  • 28. Data Preparation (Natural Language Processing) Using TFIDF to filter the most important keywords Machine Learning Model Term frequency Inverse Document Frequency
  • 29. Workflow overview Data Preparation (text processing) Extract useful information and transform corpus to a sparse matrix Data Modeling (Latent Dirichlet Allocation) Main computation to perform topic modeling Data input: ticket text as raw data Output: topic model clusters Data preparation for text processing can be very time-consuming Sample ~50,000 tickets for each training in each issue category Remove invalid words: Stemming and lemmatization Tokenization TFIDF (Term Frequency Inverse Document Frequency)
  • 30. Speed up data processing Pandas runs on a single thread by default A pandas DataFrame with 50k+ rows Data Preparation text_processing() is a heavy function contains many things: ● Tokenization ● Removal of numbers, html tags, and other invalid words ● Stemming and lemmatization ● TFIDF df['content'].apply(text_processing) → single thread by default
  • 31. Speed up data processing Pandas runs on a single thread by default Worker 1 Worker 2 Worker N keywords
  • 32. Data processing speedup trick in Pandas Pandas runs on a single thread by default 1 2 3 4 5 6 7 8 9 10
  • 33. Many handy text processing libraries TextBlob and spaCy Tokenization Sentence correction .correct() Part of speech .tags Sentiment analysis .sentiment.polarity NLP Library (TextBlob) (spaCy)
  • 34. Workflow overview Data Preparation (text processing) Extract useful information and transform corpus to a sparse matrix Data Modeling (Latent Dirichlet Allocation) Main computation to perform topic modeling Data input: ticket text as raw data Output: topic model clusters Unlocking support insights from textual content - but how? Sample ~50,000 tickets for each training in each issue category LDA: - Unsupervised learning - Bag of words - “topic distribution” Usage: lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=4, random_state=some_number) lda.show_topics()
  • 35. Latent Dirichlet Allocation model General concept of this model Unsupervised learning method - does not require any class labels; similar to clustering ‘Bag of words’ model - uses word counts in messages without regard for its order (Peter owe Alice money = Alice owe Peter money) Estimated iteratively - Starts with random initialization then adjusts probabilities to reduce perplexity / increase fit Doc 1 Doc 2 Doc 3 Doc n... (topic) FruitsFruits document-topic probabilities 30% health (topic 1) 60% fruits (topic 2) 10% disease (topic 3)
  • 36. Latent Dirichlet Allocation model Model implementation and visualization Data Preparation (text processing) Extract useful information and transform corpus to a sparse matrix Data Modeling (Latent Dirichlet Allocation) Main computation to perform topic modeling Data input: ticket text as raw data Output: topic model clusters Sample ~50,000 tickets for each training in each issue category Usage: lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=4, random_state=some_number) lda.show_topics() from pyLDAvis.gensim import prepare, save_html from gensim.models import LdaModel
  • 37. Future work and learnings Data Preparation (text processing) Extract useful information and transform corpus to a sparse matrix Data Modeling (Latent Dirichlet Allocation) Main computation to perform topic modeling Customization is needed ● Not suited for specific issue category ● Build own dictionary for the removal of irrelevant words Data input: ticket text as raw data Output: topic model clusters How to make the results more “actionable”? ● # of topic for convergence ● Time and performance tradeoff ● Other ”Deep NLP” model ? Bad result examples
  • 38. Project #1 Text ming tool to unlock user insights Python lib: natural language processing, topic modeling Self-introduction Who am I? What does our analytics team do for Asia-Pacific? Project #2 Artificial Intelligence revolution in call centers: deep learning-based bot Python lib: machine learning related such as tensorflow, keras, sklearn, numpy, and etc. Transforming the Call Center with Text Mining and Deep Learning for Better User Experience Table of contents Transforming the Call Center with Text Mining and Deep Learning for Better User Experience
  • 39. Product owner: Huaixiu Zheng and Yichia Wang in Uber’s Applied Machine Learning team Project #2: Artificial Intelligence revolution in call centers
  • 40. CSR’s sample workflow for user in a call center How does our users submit an issue?
  • 41. CSR’s sample workflow for user in a call center Online support via in-app-help User CSRContact Ticket Response Select Issue Category Write Message Confirm Issue Category Lookup info. & Knowledge Base Select Action Write response using a Reply Template
  • 42. The issue for call center operation: scalability and cost The growth comes at a price again….
  • 43. Solution? Let’s start from a basic sample “I want to change my rating for a rider”
  • 44. API-less solution to the basic sample We can ‘program’ the pre-defined logic for certain tickets with Selenium or Chrome Script element mapping element mapping
  • 45. End-to-end solution Web interaction Read and Write (click and input text) Knowledge base ● Keyword recognition ● Web element id dictionary ● (Natural Language Processing) Policy engine Program the flow aligning with policy/ SOP Monitoring and logging ● Real-time gsheet API logging ● Monitoring and alert trigger Ticket answering bot
  • 46. The business impact of a simple bot-solving solution 3k+ weekly solves A team of 18 CSR 28k USD monthly
  • 47. What’s the problem with this solution?
  • 48. What’s the problem with this solution? “Scalability”
  • 49. The difference between Programming and Machine Learning Outputs = Agents’ responses Inputs = Contact Ticket
  • 50. Our machine learning solution design Why go with “Semi-automated” assistance rather than real robot? Pros: - Scalable solution to all (+ new) ticket-types - Flexible and safer application as human can still evaluate it and make the final call Cons: Not fully automated to replace the agent workforce completely. Product designed by Hugh Williams, Huaixiu Zheng, Yi-Chia Wang in Applied Machine Learning team
  • 51. Our machine learning solution design ‘Assistant to CSR’ - Provide suggestions for reply and actions Issue category suggestion Action suggestion 10M+ tickets Correct response from agents to these 10M+ tickets Technical model training Product design
  • 52. Typical Machine Learning process Note: picture from “Mark Peng’s “General Tips for participating Kaggle Competitions” on Slideshare
  • 53. Typical Machine Learning process Model selection ML 101: Start with simple model first Data source: https://eng.uber.com/cota-v2/
  • 54. Deep Learning Architecture Reference: Uber AML Lab: http://eng.uber.com/cota Sample code with Keras for a simple CNN
  • 55. Deep Learning Architecture Reference: Uber AML Lab: http://eng.uber.com/cota Essay: COTA: Improving the Speed and Accuracy of Customer Support through Ranking and Deep Networks
  • 56. Development environment for Deep learning model training How does model training look like? >> DEMO Main codebase + data set
  • 57. Feature engineering and feature importance Trade off between capacity and interpretability “Capacity” “Interpretability”
  • 58. Feature engineering and feature importance What are the important features? Very easy to learn that in simpler model
  • 59. Feature engineering and feature importance What are the important features? Very easy to get explanation in simpler models
  • 60. Feature engineering and feature importance What are the important features? NN model is like our brain’s intuition … blackbox
  • 61. Feature engineering and feature importance What are the important features? Sklearn: Recursive feature elimination (sklearn.feature_selection.RFE) Mockup dataset
  • 62. Feature engineering and feature importance What are the important features? Time on model training >>> prediction Shuffle each feature to create noise…. on the testing set Mockup dataset
  • 63. Python tips: be cautious about the underlying “copy implementation” np.random.shuffle What’s the value of my_list2? A. [1, 2, 3, 4, 5] B. [2, 5, 1, 4, 3]
  • 64. np.random.shuffle What’s the value of my_list2? A. [1, 2, 3, 4, 5] B. [2, 5, 1, 4, 3] Python tips: be cautious about the underlying “copy implementation”
  • 65. np.random.permutation Python tips: be cautious about the underlying “copy implementation”
  • 66. np.random.permutation from copy import deepcopy mylist2 = deepcopy(my_list) Python tips: be cautious about the underlying “copy implementation”
  • 67. Feature engineering and feature importance What are the important features? Shuffle each feature to create noise…. on the testing set Mockup example
  • 68. Issue category suggestion Action suggestion Product design Last stop: making business Impact Ensure KPI measurement is well-planned in the beginning User CSRContact Ticket Response Select Issue Category Write Message Confirm Issue Category Lookup info. & Knowledge Base Select Action Write response using a Reply Template
  • 69. Last stop: making business Impact Identify key business metrics, and cautiously conduct and monitor A/A and A/B testing Source: https://eng.uber.com/cota-v2/
  • 70. Look forward to collaborating! http://careers.uber.com
  • 71. Paul Lo Data Analytics Manager @ Uber paul.lo@uber.com | paullo0106@gmail.com | | http://paullo.myvnc.com/blog/ Q&A