Rinse and Repeat : The Spiral of Applied Machine Learning

Anna Chaney
Twitter: @anna_seg
Watson Applied Research
Rinse and Repeat:
The Spiral of Applied Machine Learning

2
DISCLAIMER
This presentation shows numerical results of machine learning
systems, and none of the systems contained Watson
technology, or used any data from an IBM customer.
These results are obtained from data sets obtained from
internal systems, using technologies from open source libraries
for sentiment analysis, and classification. All results may be
scaled or shifted for illustrative purposes and are not meant to
be representative of actual system performance.

3
Analyze and Improve Performance of Machine Learning in Four Easy Steps
Step 0. Deploy your machine learning application
Step 1. Assess performance of app using human judgement
Step 2. Analyze and optimize operating thresholds
Step 3. Retrain machine learning with golden examples from humans
Step 4. Go to Step 0 with new changes

In machine learning, we need data, data, and more data. In your application you need to make sure
you record:
• All human input into the app
• All of the machine learning components top 𝑥 responses
• The confidence score of those 𝑥 responses
• If multiple ML components, then record subsystem where the responses came from
• Timestamp and system revision (system revision should be traceable to the data used to train the
ML component)
4
Instrumentation, is Key!! 🔑
Note: for purposes of this talk, we assume that the design and use
case of your application have been clearly articulated and agreed
upon with all of the stakeholders in the project.

5

Experts* say:
• The operative question in evaluating a machine learning system is the extent to which it produces
the results for which it was designed.
• The most straightforward way to evaluate a machine learning system is to recruit human subjects
and ask them to assess system output along some predetermined criteria
The customer (or technical owner) decides the criteria that the system responses will be evaluated
against. All of the information to judge the system response needs to be available in the logs that are
created by the system instrumentation.
6
Measuring Performance Using Humans
*Source: The Handbook of Computation Linguistics and Natural Language Processing. Clark, Fox, and Lappin

Depends on the type of machine learning response
• Open field textual response ➡ ask human to rate response as on of the following categories,
given all of the original context information:
– Wrong - the answer and the question are completely unrelated
– Poor - the answer and the question might be related, but the answer does not satisfy the
question
– Decent - the answer relates to the question, but could be better
– Perfect - the answer directly relates to the question and is phrased clearly
• Classification response, e.g., measuring sentiment as {positive, negative, neutral} ➡ ask human
to classify response given all of the original context relevant to the classification
Note: even though services like Watson Conversation are classifiers under the hood, the response
received by the user is a textual response, not the label of the intent, thus should be evaluated as a
textual response
7
Design of Experiments – Evaluation Metrics

Overview
Give the 'what' and 'why' of a task in less than 200 characters. The overview should give a clear
high-level picture of what a worker will be doing and why what they are doing is valuable.
Steps
Describe the process by which humans will complete the task. This should be a discrete list of
steps to use to complete the task. Each step should begin with an action verb in bold.
Rules and Tips
Use green headers for positive/"Do This", yellow for warning/"Be Careful Of", red for bad/"Do Not”
Examples
Provide at least three examples of your job to contributors. This will help them perform better on
the job.
Thank you!
Humans really appreciate a customized thank you note!
8
Design of Experiments – Instructions for Human*
*Source: https://success.crowdflower.com/hc/en-us/articles/201855779-Instructions-Template

The quickest way to judge lots of data is to involve as many humans as possible. However, as you
judge the answers to many of these questions, you will find that many of the responses are open to
interpretation. Have a small group of people 2-5, who can definitively asses if the response meets the
criteria your team has decided on for the project.
Have this team discuss and agree on around 100 the correct judgment for responses. This is your
Golden Standard. Write very clear reasons explaining why you have selected that answer. These
reasons will be shown to humans if they get a golden question incorrect. This is a great tool to explain
to your contributors how you’ve reached your answer. By explaining the answer, humans can learn
the rules and intricacies of a job in more depth than is provided in the instructions.
Golden questions should have an appropriate answer distribution that reflects your dataset. An even
answer distribution will train contributors on every possible answer instead of biasing them towards
one answer.
Before the human can contribute to evaluating the results of the job, they must pass a quiz of the
golden standard questions.
9
Design of Experiments – Create Golden
Standard

You’ve got your golden questions, now you want to load all of the data that you want annotated for
human judgement
For Q/A tasks, a general run of thumb is 20 man-hours per 1000 questions
1000 responses is the minimum number of responses I would want to judge in a given experiment,
and the maximum only depends on time, cost, and resources
10
Run the Experiment

11

"He that knows not,
and knows not that he knows not
is a fool.
Shun him
He that knows not,
and knows that he knows not
is a pupil.
Teach him.
He that knows,
and knows not that he knows
is asleep
Wake him.
He that knows,
and knows that he knows
is a teacher.
Follow him."
12
Question and Answering: Optimize for
Human Computer Interaction
How much does it
hurt my reputation
to answer a
question
incorrectly?
𝑅𝑒𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛 =
𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑎𝑛𝑠𝑤𝑒𝑟 − |𝑖𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑎𝑛𝑠𝑤𝑒𝑟|
|𝑎𝑠𝑘𝑒𝑑 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛𝑠|

Results are notional and not meant to be representative of any Watson API
Help Desk Question and Answer

Huge Gains in Perceived Performance
Assume 1000 questions, when the threshold is set at 0.7, you’ve attempted to help
534 people, giving 486 of the a correct answer
Now assume, you wanted to answer 81% of the questions with Machine Learning.
You’ve attempted to help 810 people, giving 648 people a correct answer
Added 162 happy customers saves me
$XXXX in level > 0 support time using
the EXACT SAME SYSTEM

• Assume a sentiment score 𝑠 𝜖[−1,1]
– sentiment is negative if 𝑠 < 0
– sentiment is neutral if 𝑠 = 0
– sentiment is positive if 𝑠 > 0
Ideally, the closer to 1 the sentiment is, the more confident the classification algorithm is that the text
is positive, and the closer to -1 the sentiment score is, the more confident the classification
algorithm is that the text is negative.
Let’s measure this using the crowd sourcing platform, CrowdFlower!
16
Sentiment Analysis

• Detailed Instructions
• Human must pass a quiz to enter the job
• 1 out of every 10 judgements is a test
question, human must maintain an 80%
agreement with our test questions to
remain in the job
• First round of twitter data, 3314 samples,
3238 had human agreement
• Second round, twitter and news data,
6471 samples, 5642 had human
agreement
• Following analysis only considers
samples with human agreement
17
Collecting Data from the Crowd Using
CrowdFlower

18
Experiment Results
ML
TOTAL SAMPLES: 8880
Number of Crowd Negative: 788
Number of Crowd Neutral: 6283
Number of Crowd Positive: 1809
Number of ML Negative: 2111
Number of ML Neutral: 2156
Number of ML Positive: 4613
Partial (neg only) Agreement: 565, 71.70%
Partial (neu only) Agreement: 1757, 27.96%
Partial (pos only) Agreement: 1083, 59.87%
Total Judgements Agreement: 3405, 38.34%
*Results are notional and not meant to be representative of any Watson API

19
Redefining Neutral Using the Sentiment Score –
Effect on Positive and Negative Classification
performance
-1.0 0.0 1.0
-x to x
negative neutral positive

20
Effect on Neutral Classification performance
-1.0 0.0 1.0
-x to x

21
Effect on Total performance (all categories)
-1.0 0.0 1.0
-x to x

• We obviously don’t want to maximize for total correctness, because the class imbalance for the
neutral data (approx 70% of the data) would give a naïve classifier of “call everything neutral”
• Generated a heuristic cost function to value the correct classification of positive and negative
sentiment calls 4 times higher than the correct classification of correct neutral calls.
• Maximum of the cost function occurs at sentiment score of 0.4, now implemented in our client side
code.
• Total accuracy over all three categories goes from 38% to 57%
22
Adding in a Cost Function

23
Final Results, Weighting cost function
normalized to appear on same axis

24

25
Augmenting Classifier Systems Golden Training
System
Logs
System
Judgment
Good Intent
Answer
Bad intent
Answer
Add 1 question to
intent proper training
bin
Correct
intent
exists?
Create New
Intent
Yes
No
Note: when you train the
classifier the size of the
training bin effects the
probability of the intent
being returned, so you may
want to down sample some
training bins to avoid
unwanted bias. Or, you may
consider that bias a good
thing, depending on the use
case

26
Real Data; Mocked Dashboard – Alpha denotes performance after incorporation of new data
from log files
User
Experience
76%
Performance
Prediction
80%
Note that the performance threshold (indicated by *) is fixed at runtime

We incorporated the human labeled data in with their other training sets, and then performed tests
across all the myriad of data we have labels and are available.
Performance on our data set jumped from 57% to 70%
It is also worth noting, that performance of other data sets that were completely distinct from our use
case also improved. A true win-win!
27
Training Sentiment Model

28

The output of machine learning system can always
be improved. Better training data, algorithms more
suited to your use case, and system improvements
based on threshold setting can all be employed.
However, you will find that after each iteration, the
system will improve less and less…much like the
radius of a spiral as it makes rotations around the
orgin.
Depending on your use case, you may decide to
stop iterating at some point, or, you may need to
never stop iterating (especially true of systems that
contain golden samples that can change over time)
Rinse and Repeat:
The Spiral of Applied Machine Learning

Rinse and Repeat : The Spiral of Applied Machine Learning

More Related Content

What's hot (20)

Similar to Rinse and Repeat : The Spiral of Applied Machine Learning (20)

Recently uploaded (20)

Rinse and Repeat : The Spiral of Applied Machine Learning