SlideShare a Scribd company logo
Anna Chaney
Twitter: @anna_seg
Watson Applied Research
Rinse and Repeat:
The Spiral of Applied Machine Learning
2
DISCLAIMER
This presentation shows numerical results of machine learning
systems, and none of the systems contained Watson
technology, or used any data from an IBM customer.
These results are obtained from data sets obtained from
internal systems, using technologies from open source libraries
for sentiment analysis, and classification. All results may be
scaled or shifted for illustrative purposes and are not meant to
be representative of actual system performance.
3
Analyze and Improve Performance of Machine Learning in Four Easy Steps
Step 0. Deploy your machine learning application
Step 1. Assess performance of app using human judgement
Step 2. Analyze and optimize operating thresholds
Step 3. Retrain machine learning with golden examples from humans
Step 4. Go to Step 0 with new changes
In machine learning, we need data, data, and more data. In your application you need to make sure
you record:
• All human input into the app
• All of the machine learning components top 𝑥 responses
• The confidence score of those 𝑥	responses
• If multiple ML components, then record subsystem where the responses came from
• Timestamp and system revision (system revision should be traceable to the data used to train the
ML component)
4
Instrumentation, is Key!! 🔑
Note:	for	purposes	of	this	talk,	we	assume	that	the	design	and	use	
case	of	your	application	have	been	clearly	articulated	and	agreed	
upon	with	all	of	the	stakeholders	in	the	project.
5
Analyze and Improve Performance of Machine Learning in Four Easy Steps
Step 0. Deploy your machine learning application
Step 1. Assess performance of app using human judgement
Step 2. Analyze and optimize operating thresholds
Step 3. Retrain machine learning with golden examples from humans
Step 4. Go to Step 0 with new changes
Experts* say:
• The operative question in evaluating a machine learning system is the extent to which it produces
the results for which it was designed.
• The most straightforward way to evaluate a machine learning system is to recruit human subjects
and ask them to assess system output along some predetermined criteria
The customer (or technical owner) decides the criteria that the system responses will be evaluated
against. All of the information to judge the system response needs to be available in the logs that are
created by the system instrumentation.
6
Measuring Performance Using Humans
*Source:	The	Handbook	of	Computation	Linguistics	and	Natural	Language	Processing.		Clark,	Fox,	and	Lappin
Depends on the type of machine learning response
• Open field textual response ➡ ask human to rate response as on of the following categories,
given all of the original context information:
– Wrong - the answer and the question are completely unrelated
– Poor - the answer and the question might be related, but the answer does not satisfy the
question
– Decent - the answer relates to the question, but could be better
– Perfect - the answer directly relates to the question and is phrased clearly
• Classification response, e.g., measuring sentiment as {positive, negative, neutral} ➡ ask human
to classify response given all of the original context relevant to the classification
Note: even though services like Watson Conversation are classifiers under the hood, the response
received by the user is a textual response, not the label of the intent, thus should be evaluated as a
textual response
7
Design of Experiments – Evaluation Metrics
Overview
Give the 'what' and 'why' of a task in less than 200 characters. The overview should give a clear
high-level picture of what a worker will be doing and why what they are doing is valuable.
Steps
Describe the process by which humans will complete the task. This should be a discrete list of
steps to use to complete the task. Each step should begin with an action verb in bold.
Rules and Tips
Use green headers for positive/"Do This", yellow for warning/"Be Careful Of", red for bad/"Do Not”
Examples
Provide at least three examples of your job to contributors. This will help them perform better on
the job.
Thank you!
Humans really appreciate a customized thank you note!
8
Design of Experiments – Instructions for Human*
*Source:	https://success.crowdflower.com/hc/en-us/articles/201855779-Instructions-Template
The quickest way to judge lots of data is to involve as many humans as possible. However, as you
judge the answers to many of these questions, you will find that many of the responses are open to
interpretation. Have a small group of people 2-5, who can definitively asses if the response meets the
criteria your team has decided on for the project.
Have this team discuss and agree on around 100 the correct judgment for responses. This is your
Golden Standard. Write very clear reasons explaining why you have selected that answer. These
reasons will be shown to humans if they get a golden question incorrect. This is a great tool to explain
to your contributors how you’ve reached your answer. By explaining the answer, humans can learn
the rules and intricacies of a job in more depth than is provided in the instructions.
Golden questions should have an appropriate answer distribution that reflects your dataset. An even
answer distribution will train contributors on every possible answer instead of biasing them towards
one answer.
Before the human can contribute to evaluating the results of the job, they must pass a quiz of the
golden standard questions.
9
Design of Experiments – Create Golden
Standard
You’ve got your golden questions, now you want to load all of the data that you want annotated for
human judgement
For Q/A tasks, a general run of thumb is 20 man-hours per 1000 questions
1000 responses is the minimum number of responses I would want to judge in a given experiment,
and the maximum only depends on time, cost, and resources
10
Run the Experiment
11
Analyze and Improve Performance of Machine Learning in Four Easy Steps
Step 0. Deploy your machine learning application
Step 1. Assess performance of app using human judgement
Step 2. Analyze and optimize operating thresholds
Step 3. Retrain machine learning with golden examples from humans
Step 4. Go to Step 0 with new changes
"He that knows not,
and knows not that he knows not
is a fool.
Shun him
He that knows not,
and knows that he knows not
is a pupil.
Teach him.
He that knows,
and knows not that he knows
is asleep
Wake him.
He that knows,
and knows that he knows
is a teacher.
Follow him."
12
Question and Answering: Optimize for
Human Computer Interaction
How much does it
hurt my reputation
to answer a
question
incorrectly?
𝑅𝑒𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛 =
𝑐𝑜𝑟𝑟𝑒𝑐𝑡	𝑎𝑛𝑠𝑤𝑒𝑟 − |𝑖𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡	𝑎𝑛𝑠𝑤𝑒𝑟|
|𝑎𝑠𝑘𝑒𝑑	𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛𝑠|
Results	are	notional	and	not	meant	to	be	representative	of	any	Watson	API
Help Desk Question and Answer
Results	are	notional	and	not	meant	to	be	representative	of	any	Watson	API
Help Desk Question and Answer
Huge Gains in Perceived Performance
Assume 1000 questions, when the threshold is set at 0.7, you’ve attempted to help
534 people, giving 486 of the a correct answer
Now assume, you wanted to answer 81% of the questions with Machine Learning.
You’ve attempted to help 810 people, giving 648 people a correct answer
Added	162	happy	customers	saves	me	
$XXXX in	level	>	0	support	time	using	
the	EXACT	SAME	SYSTEM
• Assume a sentiment score 𝑠	𝜖[−1,1]
– sentiment is negative if 𝑠 < 0
– sentiment is neutral if 𝑠 = 0
– sentiment is positive if 𝑠 > 0
Ideally, the closer to 1 the sentiment is, the more confident the classification algorithm is that the text
is positive, and the closer to -1 the sentiment score is, the more confident the classification
algorithm is that the text is negative.
Let’s measure this using the crowd sourcing platform, CrowdFlower!
16
Sentiment Analysis
• Detailed Instructions
• Human must pass a quiz to enter the job
• 1 out of every 10 judgements is a test
question, human must maintain an 80%
agreement with our test questions to
remain in the job
• First round of twitter data, 3314 samples,
3238 had human agreement
• Second round, twitter and news data,
6471 samples, 5642 had human
agreement
• Following analysis only considers
samples with human agreement
17
Collecting Data from the Crowd Using
CrowdFlower
18
Experiment Results
ML
TOTAL	SAMPLES:	8880
Number	of	Crowd	Negative:	788
Number	of	Crowd	Neutral:	6283
Number	of	Crowd	Positive:	1809
Number	of	ML	Negative:	2111
Number	of	ML	Neutral:	2156
Number	of	ML	Positive:	4613
Partial	(neg only)	Agreement:	565,	71.70%
Partial	(neu only)	Agreement:	1757,	27.96%
Partial	(pos only)	Agreement:	1083,	59.87%
Total	Judgements	Agreement:	3405,	38.34%
*Results	are	notional	and	not	meant	to	be	representative	of	any	Watson	API
19
Redefining Neutral Using the Sentiment Score –
Effect on Positive and Negative Classification
performance
-1.0 0.0 1.0
-x	to	x
negative neutral positive
*Results	are	notional	and	not	meant	to	be	representative	of	any	Watson	API
20
Redefining Neutral Using the Sentiment Score –
Effect on Neutral Classification performance
-1.0 0.0 1.0
-x	to	x
negative neutral positive
*Results	are	notional	and	not	meant	to	be	representative	of	any	Watson	API
21
Redefining Neutral Using the Sentiment Score –
Effect on Total performance (all categories)
-1.0 0.0 1.0
-x	to	x
negative neutral positive
*Results	are	notional	and	not	meant	to	be	representative	of	any	Watson	API
• We obviously don’t want to maximize for total correctness, because the class imbalance for the
neutral data (approx 70% of the data) would give a naïve classifier of “call everything neutral”
• Generated a heuristic cost function to value the correct classification of positive and negative
sentiment calls 4 times higher than the correct classification of correct neutral calls.
• Maximum of the cost function occurs at sentiment score of 0.4, now implemented in our client side
code.
• Total accuracy over all three categories goes from 38% to 57%
22
Adding in a Cost Function
*Results	are	notional	and	not	meant	to	be	representative	of	any	Watson	API
23
Final Results, Weighting cost function
normalized to appear on same axis
*Results	are	notional	and	not	meant	to	be	representative	of	any	Watson	API
24
Analyze and Improve Performance of Machine Learning in Four Easy Steps
Step 0. Deploy your machine learning application
Step 1. Assess performance of app using human judgement
Step 2. Analyze and optimize operating thresholds
Step 3. Retrain machine learning with golden examples from humans
Step 4. Go to Step 0 with new changes
25
Augmenting Classifier Systems Golden Training
System
Logs
System	
Judgment
Good Intent
Answer
Bad intent
Answer
Add	1	question	to	
intent	proper	training	
bin
Correct	
intent	
exists?
Create	New	
Intent
Yes
No
Note:	when	you	train	the	
classifier	the	size	of	the	
training	bin	effects	the	
probability	of	the	intent	
being	returned,	so	you	may	
want	to	down	sample	some	
training	bins	to	avoid	
unwanted	bias.	Or,	you	may	
consider	that	bias	a	good	
thing,	depending	on	the	use	
case
26
Real Data; Mocked Dashboard – Alpha denotes performance after incorporation of new data
from log files
User
Experience
76%
Performance
Prediction
80%
Note that the performance threshold (indicated by *) is fixed at runtime
*Results	are	notional	and	not	meant	to	be	representative	of	any	Watson	API
We incorporated the human labeled data in with their other training sets, and then performed tests
across all the myriad of data we have labels and are available.
Performance on our data set jumped from 57% to 70%
It is also worth noting, that performance of other data sets that were completely distinct from our use
case also improved. A true win-win!
27
Training Sentiment Model
*Results	are	notional	and	not	meant	to	be	representative	of	any	Watson	API
28
Analyze and Improve Performance of Machine Learning in Four Easy Steps
Step 0. Deploy your machine learning application
Step 1. Assess performance of app using human judgement
Step 2. Analyze and optimize operating thresholds
Step 3. Retrain machine learning with golden examples from humans
Step 4. Go to Step 0 with new changes
The output of machine learning system can always
be improved. Better training data, algorithms more
suited to your use case, and system improvements
based on threshold setting can all be employed.
However, you will find that after each iteration, the
system will improve less and less…much like the
radius of a spiral as it makes rotations around the
orgin.
Depending on your use case, you may decide to
stop iterating at some point, or, you may need to
never stop iterating (especially true of systems that
contain golden samples that can change over time)
Rinse and Repeat:
The Spiral of Applied Machine Learning

More Related Content

PDF
Reinforcement Learning in Practice: Contextual Bandits
PDF
(In)convenient truths about applied machine learning
PDF
Real-world Reinforcement Learning
PDF
Barga Data Science lecture 1
PPTX
DoWhy Python library for causal inference: An End-to-End tool
PPTX
Types of machine learning
PDF
3 Types of Machine Learning
PPT
Lecture 7
Reinforcement Learning in Practice: Contextual Bandits
(In)convenient truths about applied machine learning
Real-world Reinforcement Learning
Barga Data Science lecture 1
DoWhy Python library for causal inference: An End-to-End tool
Types of machine learning
3 Types of Machine Learning
Lecture 7

What's hot (20)

PDF
Project Report
PDF
Barga Data Science lecture 2
PDF
Real-world Reinforcement Learning
PDF
Scott Clark, Software Engineer, Yelp at MLconf SF
PPTX
LinkedIn talk at Netflix ML Platform meetup Sep 2019
PPTX
RecSys Challenge 2016
PDF
copy for Gary Chin.
PDF
Andrew NG machine learning
PDF
Machine Learning Landscape
PPT
MLlecture1.ppt
PDF
Intro to machine learning
PPTX
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
PDF
Barga DIDC'14 Invited Talk
PPTX
The current state of prediction in neuroimaging
PDF
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
PDF
optimizing_site_performance
PDF
Finding Some "Good" iOS Interview Questions for Employers
PDF
Machine learning and_buzzwords
PDF
Agile Deep Learning
PPTX
Introduction to MaxDiff Scaling of Importance - Parametric Marketing Slides
Project Report
Barga Data Science lecture 2
Real-world Reinforcement Learning
Scott Clark, Software Engineer, Yelp at MLconf SF
LinkedIn talk at Netflix ML Platform meetup Sep 2019
RecSys Challenge 2016
copy for Gary Chin.
Andrew NG machine learning
Machine Learning Landscape
MLlecture1.ppt
Intro to machine learning
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Barga DIDC'14 Invited Talk
The current state of prediction in neuroimaging
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
optimizing_site_performance
Finding Some "Good" iOS Interview Questions for Employers
Machine learning and_buzzwords
Agile Deep Learning
Introduction to MaxDiff Scaling of Importance - Parametric Marketing Slides
Ad

Similar to Rinse and Repeat : The Spiral of Applied Machine Learning (20)

PPTX
SIGIR Tutorial on IR Evaluation: Designing an End-to-End Offline Evaluation P...
PDF
Ai demystified for HR and TA leaders
PDF
Managing machine learning
PDF
Pstc 2018
PDF
Lessons learned from building practical deep learning systems
PPTX
Machine Learning Experimentation at Sift Science
PPTX
My Three Ex’s: A Data Science Approach for Applied Machine Learning
PPTX
Machine learning yearning
PDF
ODP
Human + Machine Learning
PDF
Applying soft computing techniques to corporate mobile security systems
PDF
Weapons of Math Instruction: Evolving from Data0-Driven to Science-Driven
PPTX
Ai4life aiml-xops-sig
PDF
Algorithmic Impact Assessment: Fairness, Robustness and Explainability in Aut...
PDF
Human computer learning interaction in a virtual laboratory by Vasilis Zafeir...
PDF
1. Demystifying ML.pdf
PDF
AI/ML as an empirical science
PPTX
Validation and mechanism: exploring the limits of evaluation
PPTX
Testing for cognitive bias in ai systems
SIGIR Tutorial on IR Evaluation: Designing an End-to-End Offline Evaluation P...
Ai demystified for HR and TA leaders
Managing machine learning
Pstc 2018
Lessons learned from building practical deep learning systems
Machine Learning Experimentation at Sift Science
My Three Ex’s: A Data Science Approach for Applied Machine Learning
Machine learning yearning
Human + Machine Learning
Applying soft computing techniques to corporate mobile security systems
Weapons of Math Instruction: Evolving from Data0-Driven to Science-Driven
Ai4life aiml-xops-sig
Algorithmic Impact Assessment: Fairness, Robustness and Explainability in Aut...
Human computer learning interaction in a virtual laboratory by Vasilis Zafeir...
1. Demystifying ML.pdf
AI/ML as an empirical science
Validation and mechanism: exploring the limits of evaluation
Testing for cognitive bias in ai systems
Ad

Recently uploaded (20)

PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Business Analytics and business intelligence.pdf
PDF
Mega Projects Data Mega Projects Data
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Lecture1 pattern recognition............
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Introduction-to-Cloud-ComputingFinal.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
ISS -ESG Data flows What is ESG and HowHow
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Business Ppt On Nestle.pptx huunnnhhgfvu
Reliability_Chapter_ presentation 1221.5784
Qualitative Qantitative and Mixed Methods.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Introduction to Knowledge Engineering Part 1
Business Analytics and business intelligence.pdf
Mega Projects Data Mega Projects Data
Miokarditis (Inflamasi pada Otot Jantung)
.pdf is not working space design for the following data for the following dat...
Lecture1 pattern recognition............
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx

Rinse and Repeat : The Spiral of Applied Machine Learning

  • 1. Anna Chaney Twitter: @anna_seg Watson Applied Research Rinse and Repeat: The Spiral of Applied Machine Learning
  • 2. 2 DISCLAIMER This presentation shows numerical results of machine learning systems, and none of the systems contained Watson technology, or used any data from an IBM customer. These results are obtained from data sets obtained from internal systems, using technologies from open source libraries for sentiment analysis, and classification. All results may be scaled or shifted for illustrative purposes and are not meant to be representative of actual system performance.
  • 3. 3 Analyze and Improve Performance of Machine Learning in Four Easy Steps Step 0. Deploy your machine learning application Step 1. Assess performance of app using human judgement Step 2. Analyze and optimize operating thresholds Step 3. Retrain machine learning with golden examples from humans Step 4. Go to Step 0 with new changes
  • 4. In machine learning, we need data, data, and more data. In your application you need to make sure you record: • All human input into the app • All of the machine learning components top 𝑥 responses • The confidence score of those 𝑥 responses • If multiple ML components, then record subsystem where the responses came from • Timestamp and system revision (system revision should be traceable to the data used to train the ML component) 4 Instrumentation, is Key!! 🔑 Note: for purposes of this talk, we assume that the design and use case of your application have been clearly articulated and agreed upon with all of the stakeholders in the project.
  • 5. 5 Analyze and Improve Performance of Machine Learning in Four Easy Steps Step 0. Deploy your machine learning application Step 1. Assess performance of app using human judgement Step 2. Analyze and optimize operating thresholds Step 3. Retrain machine learning with golden examples from humans Step 4. Go to Step 0 with new changes
  • 6. Experts* say: • The operative question in evaluating a machine learning system is the extent to which it produces the results for which it was designed. • The most straightforward way to evaluate a machine learning system is to recruit human subjects and ask them to assess system output along some predetermined criteria The customer (or technical owner) decides the criteria that the system responses will be evaluated against. All of the information to judge the system response needs to be available in the logs that are created by the system instrumentation. 6 Measuring Performance Using Humans *Source: The Handbook of Computation Linguistics and Natural Language Processing. Clark, Fox, and Lappin
  • 7. Depends on the type of machine learning response • Open field textual response ➡ ask human to rate response as on of the following categories, given all of the original context information: – Wrong - the answer and the question are completely unrelated – Poor - the answer and the question might be related, but the answer does not satisfy the question – Decent - the answer relates to the question, but could be better – Perfect - the answer directly relates to the question and is phrased clearly • Classification response, e.g., measuring sentiment as {positive, negative, neutral} ➡ ask human to classify response given all of the original context relevant to the classification Note: even though services like Watson Conversation are classifiers under the hood, the response received by the user is a textual response, not the label of the intent, thus should be evaluated as a textual response 7 Design of Experiments – Evaluation Metrics
  • 8. Overview Give the 'what' and 'why' of a task in less than 200 characters. The overview should give a clear high-level picture of what a worker will be doing and why what they are doing is valuable. Steps Describe the process by which humans will complete the task. This should be a discrete list of steps to use to complete the task. Each step should begin with an action verb in bold. Rules and Tips Use green headers for positive/"Do This", yellow for warning/"Be Careful Of", red for bad/"Do Not” Examples Provide at least three examples of your job to contributors. This will help them perform better on the job. Thank you! Humans really appreciate a customized thank you note! 8 Design of Experiments – Instructions for Human* *Source: https://success.crowdflower.com/hc/en-us/articles/201855779-Instructions-Template
  • 9. The quickest way to judge lots of data is to involve as many humans as possible. However, as you judge the answers to many of these questions, you will find that many of the responses are open to interpretation. Have a small group of people 2-5, who can definitively asses if the response meets the criteria your team has decided on for the project. Have this team discuss and agree on around 100 the correct judgment for responses. This is your Golden Standard. Write very clear reasons explaining why you have selected that answer. These reasons will be shown to humans if they get a golden question incorrect. This is a great tool to explain to your contributors how you’ve reached your answer. By explaining the answer, humans can learn the rules and intricacies of a job in more depth than is provided in the instructions. Golden questions should have an appropriate answer distribution that reflects your dataset. An even answer distribution will train contributors on every possible answer instead of biasing them towards one answer. Before the human can contribute to evaluating the results of the job, they must pass a quiz of the golden standard questions. 9 Design of Experiments – Create Golden Standard
  • 10. You’ve got your golden questions, now you want to load all of the data that you want annotated for human judgement For Q/A tasks, a general run of thumb is 20 man-hours per 1000 questions 1000 responses is the minimum number of responses I would want to judge in a given experiment, and the maximum only depends on time, cost, and resources 10 Run the Experiment
  • 11. 11 Analyze and Improve Performance of Machine Learning in Four Easy Steps Step 0. Deploy your machine learning application Step 1. Assess performance of app using human judgement Step 2. Analyze and optimize operating thresholds Step 3. Retrain machine learning with golden examples from humans Step 4. Go to Step 0 with new changes
  • 12. "He that knows not, and knows not that he knows not is a fool. Shun him He that knows not, and knows that he knows not is a pupil. Teach him. He that knows, and knows not that he knows is asleep Wake him. He that knows, and knows that he knows is a teacher. Follow him." 12 Question and Answering: Optimize for Human Computer Interaction How much does it hurt my reputation to answer a question incorrectly? 𝑅𝑒𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛 = 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑎𝑛𝑠𝑤𝑒𝑟 − |𝑖𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑎𝑛𝑠𝑤𝑒𝑟| |𝑎𝑠𝑘𝑒𝑑 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛𝑠|
  • 15. Huge Gains in Perceived Performance Assume 1000 questions, when the threshold is set at 0.7, you’ve attempted to help 534 people, giving 486 of the a correct answer Now assume, you wanted to answer 81% of the questions with Machine Learning. You’ve attempted to help 810 people, giving 648 people a correct answer Added 162 happy customers saves me $XXXX in level > 0 support time using the EXACT SAME SYSTEM
  • 16. • Assume a sentiment score 𝑠 𝜖[−1,1] – sentiment is negative if 𝑠 < 0 – sentiment is neutral if 𝑠 = 0 – sentiment is positive if 𝑠 > 0 Ideally, the closer to 1 the sentiment is, the more confident the classification algorithm is that the text is positive, and the closer to -1 the sentiment score is, the more confident the classification algorithm is that the text is negative. Let’s measure this using the crowd sourcing platform, CrowdFlower! 16 Sentiment Analysis
  • 17. • Detailed Instructions • Human must pass a quiz to enter the job • 1 out of every 10 judgements is a test question, human must maintain an 80% agreement with our test questions to remain in the job • First round of twitter data, 3314 samples, 3238 had human agreement • Second round, twitter and news data, 6471 samples, 5642 had human agreement • Following analysis only considers samples with human agreement 17 Collecting Data from the Crowd Using CrowdFlower
  • 18. 18 Experiment Results ML TOTAL SAMPLES: 8880 Number of Crowd Negative: 788 Number of Crowd Neutral: 6283 Number of Crowd Positive: 1809 Number of ML Negative: 2111 Number of ML Neutral: 2156 Number of ML Positive: 4613 Partial (neg only) Agreement: 565, 71.70% Partial (neu only) Agreement: 1757, 27.96% Partial (pos only) Agreement: 1083, 59.87% Total Judgements Agreement: 3405, 38.34% *Results are notional and not meant to be representative of any Watson API
  • 19. 19 Redefining Neutral Using the Sentiment Score – Effect on Positive and Negative Classification performance -1.0 0.0 1.0 -x to x negative neutral positive *Results are notional and not meant to be representative of any Watson API
  • 20. 20 Redefining Neutral Using the Sentiment Score – Effect on Neutral Classification performance -1.0 0.0 1.0 -x to x negative neutral positive *Results are notional and not meant to be representative of any Watson API
  • 21. 21 Redefining Neutral Using the Sentiment Score – Effect on Total performance (all categories) -1.0 0.0 1.0 -x to x negative neutral positive *Results are notional and not meant to be representative of any Watson API
  • 22. • We obviously don’t want to maximize for total correctness, because the class imbalance for the neutral data (approx 70% of the data) would give a naïve classifier of “call everything neutral” • Generated a heuristic cost function to value the correct classification of positive and negative sentiment calls 4 times higher than the correct classification of correct neutral calls. • Maximum of the cost function occurs at sentiment score of 0.4, now implemented in our client side code. • Total accuracy over all three categories goes from 38% to 57% 22 Adding in a Cost Function *Results are notional and not meant to be representative of any Watson API
  • 23. 23 Final Results, Weighting cost function normalized to appear on same axis *Results are notional and not meant to be representative of any Watson API
  • 24. 24 Analyze and Improve Performance of Machine Learning in Four Easy Steps Step 0. Deploy your machine learning application Step 1. Assess performance of app using human judgement Step 2. Analyze and optimize operating thresholds Step 3. Retrain machine learning with golden examples from humans Step 4. Go to Step 0 with new changes
  • 25. 25 Augmenting Classifier Systems Golden Training System Logs System Judgment Good Intent Answer Bad intent Answer Add 1 question to intent proper training bin Correct intent exists? Create New Intent Yes No Note: when you train the classifier the size of the training bin effects the probability of the intent being returned, so you may want to down sample some training bins to avoid unwanted bias. Or, you may consider that bias a good thing, depending on the use case
  • 26. 26 Real Data; Mocked Dashboard – Alpha denotes performance after incorporation of new data from log files User Experience 76% Performance Prediction 80% Note that the performance threshold (indicated by *) is fixed at runtime *Results are notional and not meant to be representative of any Watson API
  • 27. We incorporated the human labeled data in with their other training sets, and then performed tests across all the myriad of data we have labels and are available. Performance on our data set jumped from 57% to 70% It is also worth noting, that performance of other data sets that were completely distinct from our use case also improved. A true win-win! 27 Training Sentiment Model *Results are notional and not meant to be representative of any Watson API
  • 28. 28 Analyze and Improve Performance of Machine Learning in Four Easy Steps Step 0. Deploy your machine learning application Step 1. Assess performance of app using human judgement Step 2. Analyze and optimize operating thresholds Step 3. Retrain machine learning with golden examples from humans Step 4. Go to Step 0 with new changes
  • 29. The output of machine learning system can always be improved. Better training data, algorithms more suited to your use case, and system improvements based on threshold setting can all be employed. However, you will find that after each iteration, the system will improve less and less…much like the radius of a spiral as it makes rotations around the orgin. Depending on your use case, you may decide to stop iterating at some point, or, you may need to never stop iterating (especially true of systems that contain golden samples that can change over time) Rinse and Repeat: The Spiral of Applied Machine Learning