SlideShare a Scribd company logo
Continuous Learning Systems:
Building ML systems that learn from
their mistakes
This work was done when the authors were at Freshworks
Anuj Gupta
Head of Machine Learning, Vahan
Saurabh Arora, Satyam Saxena, Navaneethan Santhanam
Agenda
1. Understanding the Problem Statement
● Background
● Metrics that matter
● Observations
1. Solution v1.0
2. Issues
1. Solution v2.0
a. Building Feedback loop
b. Global + local
1. Results
2. Conclusions and Way Forward
Background
● Customer Support on social is now must for all B2C brands.
● Ex: @AppleSupport, @AmazonHelp, @BofA_Help.
● Twitter, Facebook have launched dedicated features for this.
● Most CRM suites support Customer Service@social
Metrics that matter
● Owing to public nature of conversations, brands
care about 2 things:
a. Reply fast
b. Reply well
Both these contribute to how a brand is perceived.
● To measure (a), 2 key metrics are:
a. Average First Response Time (AFRT)
b. Average Response Time (ART)
● Many of our customers (CS team of brands) had pretty high AFRT/ART
● Ask: Reduce AFRT/ART
● Traffic on brand’s social channel is not just questions or requests. Its lot more than that!
Observations
✅
❌
❌
❌
Actionable
Noise/spam
Observations
● The average number of replies sent per agent per day was relatively low. (~12-15). Yet the
ART/FRT were pretty high.
● Of the total inbound traffic on support handles, only a fraction of tickets were being replied to.
typically ~ 5% - 40%.
● In between 2 messages that were responded to, lot of
messages that were not being responded to (~3-30)
Most of time going in finding finding
actionable conversations
Solution v1.0
• Noise filter for CS@social
• Model it as (binary) classification problem.
• Acquire good quality dataset.
• Engineer features – there are some very good indicators.
Actionable Noise/Spam
• Train-test-tune, ~75% accuracy. Deploy
Issues
*within couple of weeks of deployment
● Performance varied across brands.
● While for some brands the model worked very well, for some it did very badly.
● As time* went by even the models that performed well, started doing badly.
• Our data was changing
Behind the Scene
Non-stationary distributions
A stationary process is time-independent ~ the averages remain more or less the constant.
• World of CS@social is not just Black(noise) and White (actionable).
• It also has a spectrum of grey in between:
a. “Hi”, “Hello”, “Good mornings”
b. “Any new offers today”
c. “The recent ad you launched is very good. Keep it up”
d. Quizzes, engagement posts
• Some brands respond to such traffic, some do not.
• Noise and actionable were merely 2 extremes of this spectrum.
• Definition of noise and actionable was not consistent across various brands.
• Boundary (in the grey region) separating noise from actionable varies from brand to brand.
• A single common classifier for all is doomed to fail!
Behind the Scene
In Nutshell
• Based on last few slides, degradation in model performance shouldn’t come
as surprise
• One model fits all is not going to work.
• Non-stationary distributions is not just specific to twitter data. In general, it is
found in other domains as well:
o Monitoring & Anomaly detection (one-class classification) in adversarial setting
o Recommendations (where the user preferences are continuously changing; evolving labels)
o Stock market predictions (concept drift; evolving distributions).
• Build per brand model to have brand specific learning.
• Learn from mistakes: In our system, by looking at what messages are being
replied to and what not, we know (with a small delay), if the classification done
by the system is right or wrong.
• The model is not utilizing these signals to improve.
• If feedback is utilized well:
• With time adapt to brands definition of noise and actionable.
• Adapt to variations/changes in features
Towards Solution: Exploration
Incorporate feedback
• Frequently retrain your model on the updated data and deploy the same.
o Training, testing, fine-tuning – 45K models.
Compute heavy. Doesn’t scale at all .
o Loose all old learnings
• Keep learning from feedback: Model adapts to the new incoming data.
What worked for us
Global Model
Batch trained
Large Corpus
No short term updates
Local model
Fast learner
Short term updates
● 2 models - Global + Local
● Global model is common for all
brands
○ Trained on large dataset
● 1 Local model per brand
Local
• Goals
o Improves with feedback.
• Desired properties
o Fast learner (light compute)
▪ Incorporates most feedbacks successfully
(After model update, if the same data point is presented, it must correctly predict its class label.)
o Avoid catastrophic forgetting
(After model update, if the last N data point is presented, it should predict its class label with higher accuracy.)
Building feedback loop
ML model
<Tweet, Yp>
<Tweet, YT>
If YT ≠ Yp
Tweet
Works fine if the velocity of
feedback data is high (don’t
have to wait long to accumulate
a mini-batch of feedbacks).
Many applications don’t have
high velocity.
Very few data point - can skew
the model
mini-batches Instant feedback, tiny-
batches
Possible Approaches to incorporate feedback
Building feedback loop
• We model a feedback point <Tweet, YT> as a datapoint presented to local model
in online setting.
• Thus, a bunch of feedbacks = incoming data stream
• We used a Online learning.
• Online learning:
Data is modeled as stream.
Model makes a prediction (YP), when presented with data point (X).
Environment reveals the correct class label (YT)
If YP ≠ YT, update the model with <X, YT>
Online Algorithms
http://scikit-learn.org/stable/auto_examples/linear_model/plot_sgd_comparison.html
Crammer’s PA-II
• Dataset – 150K tweets, time sequenced
• Feedback incorporation improves accuracy:
o Trained (offline batch mode) model on first 100K data points.
o On test set (last 50k data points) it gave 75% accuracy (offline batch mode)
o Then ran the model on test data (50k data points) in online fashion
Model made a total 9028 mistakes.
These mistakes were instantaneously fed into the local model as feedback.
This gives a accuracy ~82 % across the test set.
○ We gained ~7% accuracy by incorporating feedback.
Results of Local :
Improving accuracy
# of test points
We also tested the local by feeding it with wrong feedbacks.
Combining global and local
• Scores from both global and local, combined to get a single score and apply
threshold to arrive at a prediction.
• We got an accuracy of ~82%
Global
Local
combined
score
# of test points
Pros:
• Improved running accuracy
• Personalization : The notion of spam varies from brand to brand. Some
brands treat ‘Hi’, ‘Hello’ as spam while others treat them as actionable. By
learning from feedback, the model adapts to the notions of the brand.
• Local is light-weight, fast thus easy to boot-strap, deploy and scale.
Cons:
● Local can overfit to feedback, thus become biased.
● Need to monitor biasness.
● Reset local as when it becomes biased.
Future Work
• Instead of a single global, have vertical specific global
• Try other online algorithms
• Handle drift
• Not incorporate every feedback? Update on most important ones.
References
1. “Online Passive-Aggressive Algorithms” - Crammer et al., JMLR 2006
2. “The learning behind gmail priority inbox” – Aberdeen et al., LCCC: NIPS Workshop 2010
3. “Learning with drift detection” – Gama et al., BSAI 2004
4. "Adaptive regularization of weight vectors." ” - Crammer et al., ANIPS 2009
5. LIBOL - A Library for Online Learning Algorithms. https://github.com/LIBOL/LIBOL
Thank You
Please feel free to reach out post this talk or on the interwebs.
@anujgupta82
https://www.linkedin.com/in/anujgupta-82/

More Related Content

PDF
Creating AnswerBot with Keras and TensorFlow (TensorBeat)
PPTX
Building Continuous Learning Systems
PDF
Practical Deep Learning
PDF
Synthetic Gradients - Decoupling Layers of a Neural Nets
PPTX
Feature Engineering for NLP
PPTX
Scale Machine Learning from zero to millions of users (April 2020)
PDF
Keras Tutorial For Beginners | Creating Deep Learning Models Using Keras In P...
PPTX
KERAS Python Tutorial
Creating AnswerBot with Keras and TensorFlow (TensorBeat)
Building Continuous Learning Systems
Practical Deep Learning
Synthetic Gradients - Decoupling Layers of a Neural Nets
Feature Engineering for NLP
Scale Machine Learning from zero to millions of users (April 2020)
Keras Tutorial For Beginners | Creating Deep Learning Models Using Keras In P...
KERAS Python Tutorial

What's hot (20)

PPTX
Lessons Learned from Building Machine Learning Software at Netflix
PDF
A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...
PPTX
Introduction of Machine learning and Deep Learning
PPTX
Deep Learning Made Easy with Deep Features
PDF
OWF14 - Big Data : The State of Machine Learning in 2014
PPTX
Machine Learning Exposed!
PDF
Deep Learning as a Cat/Dog Detector
PPTX
Machine Learning for .NET Developers - ADC21
PPTX
Ai use cases
PDF
How Artificial Intelligence & Machine Learning Are Transforming Modern Marketing
PDF
How Artificial Intelligence & Machine Learning Are Transforming Modern Market...
PDF
Generating Natural-Language Text with Neural Networks
PDF
Generating Sequences with Deep LSTMs & RNNS in julia
PDF
MLconf seattle 2015 presentation
PDF
A Multiscale Visualization of Attention in the Transformer Model
PPTX
Azure Machine Learning Dotnet Campus 2015
PPTX
Introduction to Keras
PPTX
An introduction to Machine Learning (and a little bit of Deep Learning)
PDF
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
PPTX
Machine Learning Overview
Lessons Learned from Building Machine Learning Software at Netflix
A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...
Introduction of Machine learning and Deep Learning
Deep Learning Made Easy with Deep Features
OWF14 - Big Data : The State of Machine Learning in 2014
Machine Learning Exposed!
Deep Learning as a Cat/Dog Detector
Machine Learning for .NET Developers - ADC21
Ai use cases
How Artificial Intelligence & Machine Learning Are Transforming Modern Marketing
How Artificial Intelligence & Machine Learning Are Transforming Modern Market...
Generating Natural-Language Text with Neural Networks
Generating Sequences with Deep LSTMs & RNNS in julia
MLconf seattle 2015 presentation
A Multiscale Visualization of Attention in the Transformer Model
Azure Machine Learning Dotnet Campus 2015
Introduction to Keras
An introduction to Machine Learning (and a little bit of Deep Learning)
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Machine Learning Overview
Ad

Similar to ODSC East 2020 : Continuous_learning_systems (20)

PPTX
Continuous Learning Systems: Building ML systems that learn from their mistakes
PPTX
PPTX
241202_JH_labseminar[Denoising Implicit Feedback for Recommendation].pptx
PPTX
Developing Web-scale Machine Learning at LinkedIn - From Soup to Nuts
PDF
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
PDF
ML Application Life Cycle
PDF
Making Machine Learning Work in Practice - StampedeCon 2014
PPTX
Deepak-Computational Advertising-The LinkedIn Way
PDF
10 more lessons learned from building Machine Learning systems - MLConf
PDF
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
PDF
10 more lessons learned from building Machine Learning systems
PDF
Practical machine learning
PDF
Offline evaluation of recommender systems: all pain and no gain?
PDF
3 Challenges in Customer Feedback Classification
PDF
"You can't just turn the crank": Machine learning for fighting abuse on the c...
PDF
A data science observatory based on RAMP - rapid analytics and model prototyping
PPTX
Ml2 production
PDF
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
PPTX
Model Development And Evaluation in ML.pptx
PDF
Tips for data science competitions
Continuous Learning Systems: Building ML systems that learn from their mistakes
241202_JH_labseminar[Denoising Implicit Feedback for Recommendation].pptx
Developing Web-scale Machine Learning at LinkedIn - From Soup to Nuts
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
ML Application Life Cycle
Making Machine Learning Work in Practice - StampedeCon 2014
Deepak-Computational Advertising-The LinkedIn Way
10 more lessons learned from building Machine Learning systems - MLConf
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
10 more lessons learned from building Machine Learning systems
Practical machine learning
Offline evaluation of recommender systems: all pain and no gain?
3 Challenges in Customer Feedback Classification
"You can't just turn the crank": Machine learning for fighting abuse on the c...
A data science observatory based on RAMP - rapid analytics and model prototyping
Ml2 production
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Model Development And Evaluation in ML.pptx
Tips for data science competitions
Ad

More from Anuj Gupta (8)

PPTX
Sarcasm Detection: Achilles Heel of sentiment analysis
PPTX
NLP Bootcamp
PDF
NLP Bootcamp 2018 : Representation Learning of text for NLP
PPTX
Recent Advances in NLP
PPTX
Talk from NVidia Developer Connect
PDF
Representation Learning of Text for NLP
PPTX
DLBLR talk
PPTX
Representation Learning for NLP
Sarcasm Detection: Achilles Heel of sentiment analysis
NLP Bootcamp
NLP Bootcamp 2018 : Representation Learning of text for NLP
Recent Advances in NLP
Talk from NVidia Developer Connect
Representation Learning of Text for NLP
DLBLR talk
Representation Learning for NLP

Recently uploaded (20)

PDF
KodekX | Application Modernization Development
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Cloud computing and distributed systems.
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
cuic standard and advanced reporting.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Encapsulation_ Review paper, used for researhc scholars
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Modernizing your data center with Dell and AMD
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
KodekX | Application Modernization Development
NewMind AI Weekly Chronicles - August'25 Week I
20250228 LYD VKU AI Blended-Learning.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Cloud computing and distributed systems.
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
cuic standard and advanced reporting.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Understanding_Digital_Forensics_Presentation.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Per capita expenditure prediction using model stacking based on satellite ima...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Encapsulation_ Review paper, used for researhc scholars
The AUB Centre for AI in Media Proposal.docx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Network Security Unit 5.pdf for BCA BBA.
Reach Out and Touch Someone: Haptics and Empathic Computing
Modernizing your data center with Dell and AMD
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx

ODSC East 2020 : Continuous_learning_systems

  • 1. Continuous Learning Systems: Building ML systems that learn from their mistakes This work was done when the authors were at Freshworks Anuj Gupta Head of Machine Learning, Vahan Saurabh Arora, Satyam Saxena, Navaneethan Santhanam
  • 2. Agenda 1. Understanding the Problem Statement ● Background ● Metrics that matter ● Observations 1. Solution v1.0 2. Issues 1. Solution v2.0 a. Building Feedback loop b. Global + local 1. Results 2. Conclusions and Way Forward
  • 3. Background ● Customer Support on social is now must for all B2C brands. ● Ex: @AppleSupport, @AmazonHelp, @BofA_Help. ● Twitter, Facebook have launched dedicated features for this. ● Most CRM suites support Customer Service@social
  • 4. Metrics that matter ● Owing to public nature of conversations, brands care about 2 things: a. Reply fast b. Reply well Both these contribute to how a brand is perceived. ● To measure (a), 2 key metrics are: a. Average First Response Time (AFRT) b. Average Response Time (ART)
  • 5. ● Many of our customers (CS team of brands) had pretty high AFRT/ART ● Ask: Reduce AFRT/ART ● Traffic on brand’s social channel is not just questions or requests. Its lot more than that!
  • 7. Observations ● The average number of replies sent per agent per day was relatively low. (~12-15). Yet the ART/FRT were pretty high. ● Of the total inbound traffic on support handles, only a fraction of tickets were being replied to. typically ~ 5% - 40%. ● In between 2 messages that were responded to, lot of messages that were not being responded to (~3-30) Most of time going in finding finding actionable conversations
  • 8. Solution v1.0 • Noise filter for CS@social • Model it as (binary) classification problem. • Acquire good quality dataset. • Engineer features – there are some very good indicators. Actionable Noise/Spam • Train-test-tune, ~75% accuracy. Deploy
  • 9. Issues *within couple of weeks of deployment ● Performance varied across brands. ● While for some brands the model worked very well, for some it did very badly. ● As time* went by even the models that performed well, started doing badly.
  • 10. • Our data was changing Behind the Scene Non-stationary distributions A stationary process is time-independent ~ the averages remain more or less the constant.
  • 11. • World of CS@social is not just Black(noise) and White (actionable). • It also has a spectrum of grey in between: a. “Hi”, “Hello”, “Good mornings” b. “Any new offers today” c. “The recent ad you launched is very good. Keep it up” d. Quizzes, engagement posts • Some brands respond to such traffic, some do not. • Noise and actionable were merely 2 extremes of this spectrum. • Definition of noise and actionable was not consistent across various brands. • Boundary (in the grey region) separating noise from actionable varies from brand to brand. • A single common classifier for all is doomed to fail! Behind the Scene
  • 12. In Nutshell • Based on last few slides, degradation in model performance shouldn’t come as surprise • One model fits all is not going to work. • Non-stationary distributions is not just specific to twitter data. In general, it is found in other domains as well: o Monitoring & Anomaly detection (one-class classification) in adversarial setting o Recommendations (where the user preferences are continuously changing; evolving labels) o Stock market predictions (concept drift; evolving distributions).
  • 13. • Build per brand model to have brand specific learning. • Learn from mistakes: In our system, by looking at what messages are being replied to and what not, we know (with a small delay), if the classification done by the system is right or wrong. • The model is not utilizing these signals to improve. • If feedback is utilized well: • With time adapt to brands definition of noise and actionable. • Adapt to variations/changes in features Towards Solution: Exploration
  • 14. Incorporate feedback • Frequently retrain your model on the updated data and deploy the same. o Training, testing, fine-tuning – 45K models. Compute heavy. Doesn’t scale at all . o Loose all old learnings • Keep learning from feedback: Model adapts to the new incoming data.
  • 15. What worked for us Global Model Batch trained Large Corpus No short term updates Local model Fast learner Short term updates ● 2 models - Global + Local ● Global model is common for all brands ○ Trained on large dataset ● 1 Local model per brand
  • 16. Local • Goals o Improves with feedback. • Desired properties o Fast learner (light compute) ▪ Incorporates most feedbacks successfully (After model update, if the same data point is presented, it must correctly predict its class label.) o Avoid catastrophic forgetting (After model update, if the last N data point is presented, it should predict its class label with higher accuracy.)
  • 17. Building feedback loop ML model <Tweet, Yp> <Tweet, YT> If YT ≠ Yp Tweet
  • 18. Works fine if the velocity of feedback data is high (don’t have to wait long to accumulate a mini-batch of feedbacks). Many applications don’t have high velocity. Very few data point - can skew the model mini-batches Instant feedback, tiny- batches Possible Approaches to incorporate feedback
  • 19. Building feedback loop • We model a feedback point <Tweet, YT> as a datapoint presented to local model in online setting. • Thus, a bunch of feedbacks = incoming data stream • We used a Online learning. • Online learning: Data is modeled as stream. Model makes a prediction (YP), when presented with data point (X). Environment reveals the correct class label (YT) If YP ≠ YT, update the model with <X, YT>
  • 21. • Dataset – 150K tweets, time sequenced • Feedback incorporation improves accuracy: o Trained (offline batch mode) model on first 100K data points. o On test set (last 50k data points) it gave 75% accuracy (offline batch mode) o Then ran the model on test data (50k data points) in online fashion Model made a total 9028 mistakes. These mistakes were instantaneously fed into the local model as feedback. This gives a accuracy ~82 % across the test set. ○ We gained ~7% accuracy by incorporating feedback. Results of Local :
  • 22. Improving accuracy # of test points We also tested the local by feeding it with wrong feedbacks.
  • 23. Combining global and local • Scores from both global and local, combined to get a single score and apply threshold to arrive at a prediction. • We got an accuracy of ~82% Global Local combined score
  • 24. # of test points
  • 25. Pros: • Improved running accuracy • Personalization : The notion of spam varies from brand to brand. Some brands treat ‘Hi’, ‘Hello’ as spam while others treat them as actionable. By learning from feedback, the model adapts to the notions of the brand. • Local is light-weight, fast thus easy to boot-strap, deploy and scale. Cons: ● Local can overfit to feedback, thus become biased. ● Need to monitor biasness. ● Reset local as when it becomes biased.
  • 26. Future Work • Instead of a single global, have vertical specific global • Try other online algorithms • Handle drift • Not incorporate every feedback? Update on most important ones.
  • 27. References 1. “Online Passive-Aggressive Algorithms” - Crammer et al., JMLR 2006 2. “The learning behind gmail priority inbox” – Aberdeen et al., LCCC: NIPS Workshop 2010 3. “Learning with drift detection” – Gama et al., BSAI 2004 4. "Adaptive regularization of weight vectors." ” - Crammer et al., ANIPS 2009 5. LIBOL - A Library for Online Learning Algorithms. https://github.com/LIBOL/LIBOL
  • 28. Thank You Please feel free to reach out post this talk or on the interwebs. @anujgupta82 https://www.linkedin.com/in/anujgupta-82/