Machine learning yearning

Machine
Learning
Yearning
Technical Strategy for AI Engineers,
In the Era of Deep Learning
1
Andrew Ng.
Slide preparation:
Mohammad Pourheidary
(m.pourheidary@yahoo.com)
Winter 2018

Introduction
2
1- Why Machine Learning Strategy
2- How to use this book to help your
team
3- Prerequisites and Notation
4- Scale drives machine learning
progress
Machine learning is the foundation of countless important applications:
web search
email anti-spam
speech recognition
product recommendations
and more
Example: Building a cat picture startup

IDEAS!
• Get more data: Collect more pictures of cats.
• Collect a more diverse training set.
• Train the algorithm longer, by running more
gradient descent iterations.
• Try a bigger neural network, with more
layers/hidden units/parameters.
• Try a smaller neural network.
• Try adding regularization (such as L2
regularization).
• Change the neural network architecture
(activation function, number of hidden units, etc.)
3

Introduction
4
team
progress
After finishing this book, you will have a deep understanding of how
to set technical direction for a machine learning project.
But your teammates might not understand why you’re
recommending a particular direction.
So that you can print them out just the 1-2 pages you need them to
know.
You will become the superhero of your team!

Introduction
5
team
progress
supervised learning:
learning a function that maps from x to y
using labeled training examples (x,y)
Supervised learning algorithms:
linear regression
logistic regression
neural networks
http://ml-class.org

Introduction
6
team
progress
Many of the ideas of deep learning (neural networks) have been
around for decades.
Why are these ideas taking off now?
• Data availability: People are now spending more time on digital
devices (laptops, mobile devices). Their digital activities generate
huge amounts of data that we can feed to our learning algorithms.
• Computational scale: We started just a few years ago to be able
to train neural networks that are big enough to take advantage of the
huge datasets we now have

▫ If you train larger and larger neural networks, you can obtain even better
performance
▫ Thus, you obtain the best performance when you Train a very large neural
network and Have a huge amount of data.
7
▫ Even as you accumulate more data, the performance of older learning
algorithms “plateaus.”
▫ the algorithm stops improving even as you give it more data.
▫ older algorithms didn’t know what to do with all the data we now have.
▫ If you train a small neutral network (NN) on the same supervised learning
task, you might get slightly better performance.
▫ “Small NN” we mean a neural network with only a small number of hidden
units/layers/parameters.

8
Setting up
development
and test sets

Setting up
development and test sets
9
5- Your development and test sets
6- Your dev and test sets should come
from the same distribution
7- How large do the dev/test sets
need to be?
8- Establish a single-number
evaluation metric for your team to
optimize
9- Optimizing and satisficing metrics
10- Having a dev set and metric
speeds up iterations
11- When to change dev/test sets and
metrics
12- Takeaways: Setting up
Your team gets a large training set by downloading pictures of cats
(positive examples) and non-cats (negative examples) of different
websites.
They split the dataset 70%/30% into training and test sets.
Using this data, they build a cat detector that works well on the
training and test sets.
But when you deploy this classifier into the mobile app,
you find that the performance is really poor!
Users are uploading pictures taken with mobile phones:
lower resolution
blurrier
and poorly lit

We usually define:
10
• Training set
Which you run your learning algorithm
on.
• Dev (development) set
Which you use to tune parameters,
select features, and make other
decisions regarding the learning
algorithm. Sometimes also called the
hold-out cross validation set.
• Test set
which you use to evaluate the
performance of the algorithm, but not
to make any decisions regarding what
learning algorithm or parameters to
use
Test setDev setTraining set
Choose dev and test sets to reflect data you expect to get in the
future and want to do well on.
If you really don’t have any way of getting data that approximates
what you expect to get in the future, be aware of the risk of this
leading to a system that doesn’t generalize well.
Test set
30%
Training set
70%
the purpose of the dev and test sets are to direct you toward the
most important changes to make to the machine learning system.
(same distribution)

Setting up
11
need to be?
optimize
metrics
suppose your team
develops a system that
works well on the dev
set but not the test set.
dev and test sets had
come from the same
distribution
very clear diagnosis
You have overfit the
dev set
The obvious cure is to
get more dev set data
dev and test sets come
from different
distributions
Several things could
have gone wrong
You had overfit to the
dev set.
The test set is harder
than the dev set.
it might be doing as
well as could, no
further improvement is
possible.
The test set is just
different, from the dev
set.
a lot of your work to
improve dev set
performance might be
wasted effort.

“
Working on machine learning applications is hard enough!
Having mismatched dev and test sets
introduces additional uncertainty
About whether improving on the dev set distribution
also improves test set performance.
Having mismatched dev and test sets
makes it harder to figure out what is and isn’t working,
and thus makes it harder to prioritize what to work on.
12

Setting up
13
need to be?
optimize
metrics
The dev set should be large enough to detect differences between
algorithms that you are trying out.
accuracy(classifier A)= 90.0%
accuracy(classifier B)= 90.1%
to detect this 0.1% difference
• 100 example dev set is small.
• 1,000 to 10,000 examples are common.
to detect this 0.01% difference
• much larger than 10,000
How about the size of the test set?
largeenoughtogivehighconfidenceintheoverallperformanceofyoursystem.
 30% of your data for your test set.
Thisworkswellwhenyouhaveamodestnumberofexamples(100to10,000examples)
But in the era of big data (more than a billion examples)?!
There is no need to have excessively large dev/test sets
beyond what is needed to evaluate the performance of your algorithms.

Setting up
14
need to be?
optimize
metrics
A single-number evaluation metric,
then we judge which algorithms is superior.
In contrast, having multiple-number evaluation metrics
makes it harder to compare algorithms.
Having a single-number evaluation metric speeds up your ability to
make a decision when you are selecting among a large number of
classifiers.
RecallPrecisionClassifier
90%95%A
85%98%B
F1 score
92.4%
91.0%

15
evaluation metrics
Accuracy Precision/Recall
F-measure /
F1-score
Specificity/Sensitivity
multiplication
AUC Gini Coefficient

Setting up
16
need to be?
optimize
metrics
Suppose you care about:
accuracy
running time
binary file size of the model
runningtimeaccuracyClassifier
80ms90%A
95ms92%B
105ms95%C
binary size
a
b
c
1) derive a single metric by putting accuracy and running time into a single formula:
Accuracy - 0.5*RunningTime
2) First,definewhat is an “acceptable”running time.Letssayanything thatrunsin 100msis acceptable.
Then, maximize accuracy, subject to your classifier meeting the running time criteria.
N different criteria:
you might consider setting N-1 of the criteria as “satisficing” metrics.
I.e., you simply require that they meet a certain value(threshold).
Then define the final one as the “optimizing” metric.

Setting up
17
need to be?
optimize
metrics
It is very difficult to know in advance
what approach will work best for a new problem.
When building a machine learning system:
1. Start off with some idea on how to build the system.
2. Implement the idea in code.
3. Carry out an experiment which tells how well the idea worked.
(Usually first few ideas don’t work!)
4. Go back to generate more ideas, and keep on iterating.
This is an iterative process.
The faster you can go round this loop, the faster you will make progress.
And therefore lets you quickly decide:
 what ideas to keep refining,
 which ones to discard.

Setting up
18
need to be?
optimize
metrics
When starting out on a new project,
try to quickly choose dev/test sets,
since this gives the team a well-defined target to aim for.
Typically come up with an initial dev/test set and an initial metric
in less than one week.
“It is better to come up with something imperfect and get going quickly,
rather than overthink this.”
But this one week timeline does not apply to mature applications.
For example, anti-spam is a mature deep learning application.
Teams works on mature systems months to acquire even better dev/test sets.
If you later realize that your initial dev/test set or metric missed the mark,
change them quickly.
For example, if your dev set + metric ranks classifier A above classifier B,
but your team thinks that classifier B is actually superior for your product,
then this might be a sign that you need to change:
 dev/test sets
 or your evaluation metric.

19
main possible causes of the dev set/metric incorrectly rating classifier A higher:
1. The actual distribution you need to do well on is different from the dev/test sets.
initial dev/test set had mainly pictures of adult cats / users are uploading a lot more kitten images than expected.
the dev/test set distribution is not representative of the actual distribution you need to do well on.
In this case, update your dev/test sets to be more representative
2. You have overfit to the dev set.
The process of repeatedly evaluating ideas on the dev set causes “overfit” to the dev set.
When you evaluate your system on the test set, If you find that your dev set performance is much better than your test set performance,
it is a sign that you have overfit to the dev set. In this case, get a fresh dev set.
you can also evaluate your system regularly (once per week or once per month—on the test set.)
But do not use the test set to make any decisions regarding the algorithm, If you do so, you will start to overfit to the test set,
and can no longer count on it to give a completely unbiased estimate of your system’s performance
3. The metric is measuring something other than what the project needs to optimize.
Suppose that for your cat application, your metric is classification accuracy.
This metric currently ranks classifier A as superior to classifier B.
But suppose you try out both algorithms, and find classifier A is allowing occasional pornographic images to slip through.
Even though classifier A is more accurate, the bad impression left by the occasional pornographic image means its performance is
unacceptable.
What do you do?
Here, the metric is failing to identify the fact that Algorithm B is in fact better than Algorithm A for your product.
So, you can no longer trust the metric to pick the best algorithm. It is time to change evaluation metrics.
For example, you can change the metric to heavily penalize letting through pornographic images.
Strongly recommend picking a new metric.

Setting up
20
need to be?
optimize
metrics
• Choose dev and test sets from a distribution that reflects what data you expect to get in the future and
want to do well on.
This may not be the same as your training data’s distribution.
• Choose dev and test sets from the same distribution if possible.
• Choose a single-number evaluation metric for your team to optimize.
If there are multiple goals that you care about, consider combining them into a single formula
(such as averaging multiple error metrics) or defining satisficing and optimizing metrics.
• Machine learning is a highly iterative process: You may try many dozens of ideas before finding one that
you’re satisfied with.
• Having dev/test sets and a single-number evaluation metric helps you quickly evaluate algorithms, and
therefore iterate faster.
• When starting out on a brand new application, try to establish dev/test sets and a metric quickly,
say in less than a week. It might be okay to take longer on mature applications.
• The old heuristic of a 70%/30% train/test split does not apply for problems where you have a lot of data;
the dev and test sets can be much less than 30% of the data.
• Your dev set should be large enough to detect meaningful changes in the accuracy of your algorithm,
but not necessarily much larger. Your test set should be big enough to give you a confident estimate of
the final performance of your system.
• If your dev set and metric are no longer pointing your team in the right direction, quickly change them:
(i) If you had overfit the dev set, get more dev set data.
(ii) If the actual distribution you care about is different from the dev/test set distribution, get new dev/test set data.
(iii) If your metric is no longer measuring what is most important to you, change the metric.

Basic Error Analysis
22
13- Build your first system quickly,
then iterate
14- Error analysis: Look at dev set
examples to evaluate ideas
15- Evaluating multiple ideas in
parallel during error analysis
16- Cleaning up mislabeled dev and
test set examples
17- If you have a large dev set, split it
into two subsets, only one of which
you look at
18- How big should the Eyeball and
Blackbox dev sets be?
19- Takeaways: Basic error analysis
You want to build a new email anti-spam system.
ideas:
• Collect a huge training set of spam email.
• Develop features for understanding the text content of the email.
• Develop features for understanding the email header features.
• and more.
Don’t trying to design and build the perfect system.
Instead, build and train a basic system quickly (in just a few days.)
Even if the basic system is far from the “best”.
it is valuable to examine how the basic system functions:
you will quickly find clues that show you the most promising directions
in which to invest your time.

23
then iterate
test set examples
you look at
When you play with your cat app,
notice several examples where it mistakes dogs for cats!
A team member proposes incorporating 3rd party software
that will make the system do better on dog images.
These changes will take a month,
Should you ask them to go ahead?
Estimate how much it will actually improve the system’s accuracy.
Then you can more rationally decide if this is worth the month of
development time,
or if you’re better off using that time on other tasks.
what you can do:
1. Gather a sample of 100 dev set examples that your system
misclassified(errors).
2. Look at these examples manually, and count what fraction of them
are dog images.
The process of looking at misclassified examples
is called error analysis.

If you find that only 5% of the misclassified images are dogs,
then no matter how much you improve your algorithm’s
performance on dog images, you won’t get rid of more than 5% of
your errors.
5% is a “ceiling” for how much the proposed project could help.
Thus, if your overall system is currently 90% accurate (10% error),
this improvement is likely to result in at best 90.5% accuracy
(or 9.5% error, which is 5% less error than the original 10% error).
24
In contrast, if you find that 50% of the mistakes are dogs,
then you can be more confident that the proposed project will
have a big impact.
It could boost accuracy from 90% to 95%
(a 50% relative reduction in error, from 10% down to 5%).

25
then iterate
test set examples
you look at
team ideas for improving the cat detector:
• Fix the problem of recognizing dogs as cats.
• Fix the problem of recognizing great cats (lions, panthers, etc.)
• Improve the system’s performance on blurry images.
• …
Evaluate all of these ideas in parallel.
Look through ~100 misclassified dev set images.
Create a spreadsheet, also jot down comments that might help
Image Dog Great cat Blurry Comments
1 ✔ Usual pitbull color
2 ✔
3 ✔ ✔ Lion; picture taken at zoo on rainy day
4 ✔ Panther behind tree
… … … … …
% of total 8% 43% 61%

26
then iterate
test set examples
you look at
you might notice that some examples in your dev set are mislabeled
by a human labeler even before the algorithm encountered it.
If the fraction of mislabeled images is significant,
add a category to keep track of the fraction of examples mislabeled
Image Dog Great cat Blurry mislabeled Comments
… … … … … …
98 ✔ Labeler missed cat in background
99 ✔
100 ✔ Drawing not a real cat
% of total 8% 43% 61% 6%
Should you correct the labels in your dev set?

For example, suppose your classifier’s performance is:
• Overall accuracy on dev set.………………. 90% (10% overall error.)
• Errors due to mislabeled examples……. 0.6% (6% of dev set errors.)
• Errors due to other causes………………… 9.4% (94% of dev set errors)
There is no harm in manually fixing the mislabeled images in the dev set,
but it is not crucial to do so:
It might be fine not knowing whether your system has 10% or 9.4%
overall error.
27
For example, suppose your classifier’s performance is:
• Overall accuracy on dev set.………………. 98.0% (2.0% overall error.)
• Errors due to mislabeled examples……. 0.6%. (30% of dev set errors.)
• Errors due to other causes………………… 1.4% (70% of dev set errors)
It is now worthwhile to improve the quality of the labels in the dev set.
Tackling the mislabeled examples will help you figure out if a classifier’s
error is closer to 1.4% or 2%
a significant relative difference.

“
Whatever process you apply to fixing dev set labels,
remember to apply it to the test set labels too
so that your dev and test sets continue to be drawn from the same distribution.
28

29
then iterate
test set examples
you look at
Suppose you have a large dev set (5,000)
you have a 20% error rate (1,000)
It takes a long time to manually examine 1,000 images.
so we might decide not to use all of them in the error analysis.
Split the dev set into two subsets,
• one of which you look at manually(100)(10% of the errors)
• one of which you don’t(tune parameters)(900)(90% of the errors).
Eyeball dev set:
You should randomly select 10% of the dev set (500)
remind ourselves that we are looking at it with our eyes.
we would expect our algorithm to misclassify about 100.
Blackbox dev set:
will have the remaining 4500 examples.
Use it to evaluate classifiers automatically
by measuring their error rates also tune hyperparameters.
However, you should avoid looking at it with your eyes.

30
then iterate
test set examples
you look at
Your Eyeball dev set should be large enough
to give you a sense of your algorithm’s major error categories.
Guidelines:
 10 mistakes: it’s better than nothing
 20 mistakes: start to get sense of the major error sources
 50 mistakes: good sense of the major error sources
 100 mistakes: very good sense of the major sources of errors
classifier has a 5% error rate.
to have ~100 misclassified examples in the Eyeball dev set,
the Eyeball dev set would have to have about 2,000 examples (0.05*2,000 = 100).
How about the Blackbox dev set?
Blackbox dev set of 1,000-10,000 examples will often give you enough data to tune
hyperparameters and select among models,
If you have a small dev set, your entire dev set might have to be used as the Eyeball dev set
Consider the Eyeball dev set more important
(assuming that you are working on a problem that humans can solve well and that examining
the examples helps you gain insight)
The downside of having only an Eyeball dev set is that the risk of overfitting the dev set is
greater.

31
then iterate
test set examples
you look at
• When you start a new project, especially if it is in an area in which you are not an expert,
it is hard to correctly guess the most promising directions.
• So don’t start off trying to design and build the perfect system.
Instead build and train a basic system as quickly as possible.
Then use error analysis to help you identify the most promising directions
and iteratively improve your algorithm from there.
• Carry out error analysis by manually examining ~100 dev set examples the algorithm
misclassifies
and counting the major categories of errors.
Use this information to prioritize what types of errors to work on fixing.
• Consider splitting the dev set into an Eyeball dev set, which you will manually examine,
and a Blackbox dev set, which you will not manually examine.
If performance on the Eyeball dev set is much better than the Blackbox dev set,
you have overfit the Eyeball dev set and should consider acquiring more data for it.
• The Eyeball dev set should be big enough so that your algorithm misclassifies enough
examples for you to analyze.
A Blackbox dev set of 1,000-10,000 examples is sufficient for many applications.
• If your dev set is not big enough to split this way, just use the entire dev set as an Eyeball
dev set
for manual error analysis, model selection, and hyperparameter tuning.

Bias and Variance
33
20- Bias and Variance: The two big
sources of error
21- Examples of Bias and Variance
22- Comparing to the optimal error
rate
23- Addressing Bias and Variance
24 Bias vs. Variance tradeoff
25- Techniques for reducing avoidable
bias
26- Error analysis on the training set
27- Techniques for reducing variance
more data, doesn’t always help as much as you might hope.
It could be a waste of time to work on getting more data.
So, how do you decide when to add data, and when not to bother?
Two major sources of error: bias and variance.
Understanding them will help you decide what to do.
Goal: cat recognizer that has 5% error. (95% accuracy)
training set error: 15% (85% accuracy)
dev set error:16%
then the first problem to solve is
to improve your algorithm’s performance on your training set.
Your dev/test set performance is usually worse than your training set.
So if you are getting 85% accuracy on the examples your algorithm has seen,
there’s no way you’re getting 95% accuracy on examples your algorithm hasn’t even seen.
In this case, adding training data won’t help.
Break the 16% error into two components:
• Bias, the algorithm’s error rate on the training set.(15%)
• Variance, how much worse the algorithm does on the dev (or test) set than the training set.(1%)
Some changes address bias (improve performance on the training set)
Some changes address variance (generalize better from the training set to the dev/test sets)
Total Error = Bias + Variance

Bias and Variance
35
sources of error
rate
bias
What problem does it have?
• Training error = 1%
• Dev error = 11%
Bias = 1%
Variance = 11%-1% = 10%
high variance (failing to generalize)
overfitting
• Dev error = 16%
Bias = 15%
Variance = 16%-15% = 1%
high bias
underfitting
• Dev error = 30%
Bias = 15%
Variance = 30%-15% = 15%
high bias and high variance
overfitting/underfitting terminology is hard to apply
• Training error = 0.5%
• Dev error = 1%
Bias = 0.5%
Variance = 1%-0.5% = 0.5%
low bias and low variance
This classifier is doing well.

Bias and Variance
36
sources of error
rate
bias
In cat recognition:
the “ideal” error rate—that is nearly 0% (achievable by an “optimal” classifier)
A human looking at a picture would be able to recognize almost all the time;
thus, we can hope for a machine that would do just as well.
harder problems:
speech recognition system
14% of the audio clips have so much background noise or are so
unintelligible that even a human cannot recognize what was said.
In this case, even the most “optimal” error = 14%.
• Dev error = 30%
Bias = 15%-14% = 1%
Variance = 30%-15% = 15%
overfitting
Break the 30% error into three components:
• Optimal error rate (“unavoidable bias”), best possible speech system in the world.(14%)
• Avoidable bias, the difference between training and optimal error(1%)
• Variance, The difference between the dev and training error.(15%)
Bias = Optimal error rate (“unavoidable bias”) + Avoidable bias
***optimal error rate is also called Bayes error rate , or Bayes rate.

How do we know
what the
optimal error rate is?
For tasks that humans are good at,
(recognizing pictures or transcribing audio clips),
you can ask a human to provide labels
then measure the accuracy of the human labels relative to
your training set.
This would give an estimate of the optimal error rate.
If you are working on a problem that even humans have a
hard time solving (predicting what movie to recommend,
or what ad to show to a user)
it can be hard to estimate the optimal error rate.
37

Bias and Variance
38
sources of error
rate
bias
Simplest formula for addressing bias and variance:
• If you have high avoidable bias,
increase the size of your model
(increase the size of your neural network by adding layers/neurons).
• If you have high variance,
add data to your training set.
*** Increasing the model size generally reduces bias,
but increase variance and the risk of overfitting.
If you use a well-designed regularization method,
then you can usually safely increase the size of the model
without increasing overfitting.

Bias and Variance
39
sources of error
rate
bias
Reduce bias errors but at the cost of increasing variance, and vice versa.
This creates a “trade off” between bias and variance.
For example:
increasing the size of your model
(adding neurons/layers in a neural network, or adding input features)
generally reduces bias but could increase variance.
Alternatively,
adding regularization generally increases bias but reduces variance.
Also by adding training data, you can reduce variance.

Bias and Variance
40
sources of error
rate
bias
Techniques for reducing high avoidable bias:
• Increase the model size (such as number of neurons/layers)
fit the training set better. also increases variance, then use regularization.
• Reduce or eliminate regularization
reduce avoidable bias, but increase variance.
• Modify input features based on insights from error analysis
error analysis inspires to create additional features, could increase variance; then use regularization,
• Modify model architecture to be more suitable for your problem
This technique can affect both bias and variance.
One method that is not helpful:
• Add more training data
This helps with variance problems

Bias and Variance
41
sources of error
rate
bias
Audio clip
Loud
backgroun
d noise
User
spoke
quickly
Far
from
microphone
Comments
1 ✔ Car noise
2 ✔ Restaurant noise
3 ✔ ✔ ✔ User shouting
4 ✔ Coffee shop
% of total 75% 25% 50%
Your algorithm must perform well on the training set before you can expect it to perform well
on the dev/test sets.
Try error analysis on the training data, similar to error analysis on the Eyeball dev set.
This can be useful if your algorithm has high bias (not fitting the training set well).
For example, in a speech recognition system
If your system is not doing well on the training set,
you might consider listening to a set of ~100 examples that the algorithm is doing poorly on
to understand the major categories of training set errors.
In this example, you might realize that your algorithm is having a particularly hard time
with training examples that have a lot of background noise.
Thus, you might focus on techniques that allow it to better fit training examples with
background noise.

Bias and Variance
42
sources of error
rate
bias
Techniques for reducing high variance:
• Add more training data
simplest and most reliable way to address variance
• Add regularization (L2 regularization, L1 regularization, dropout)
reduces variance but increases bias.
• Add early stopping (stop gradient descent early, based on dev set error)
reduces variance but increases bias. behaves a lot like regularization methods.
• Feature selection to decrease number/type of input features
it might increase bias. When your training set is small, feature selection can be very useful.
• Decrease the model size (such as number of neurons/layers)
it might increase bias BUT Adding regularization usually gives better performance.
The advantage is reducing computational cost and speeding up.
• Modify input features based on insights from error analysis
error analysis inspires to create additional features, could increase variance; then use regularization,
• Modify model architecture to be more suitable for your problem
This technique can affect both bias and variance.

Learning curves
44
28- Diagnosing bias and variance:
Learning curves
29- Plotting training error
30- Interpreting learning curves: High
bias
31- Interpreting learning curves:
Other cases
32- Plotting learning curves
A learning curve plots dev set error against the number of training examples.
To plot it, you would run your algorithm using different training set sizes
(100, 200, 300, …, 1000 examples).
Then plot how dev set error varies with the training set size.
As the training set size increases, the dev set error should decrease.

We often have some “desired error rate”
that we hope our learning algorithm will eventually achieve.
For example:
• If we hope for human-level performance, then it could be the “desired error
rate.”
• If our learning algorithm serves some product (cat pictures), we have an
intuition about level of great performance.
• If you have worked on a application for a long time, you have intuition about
how much more progress you can make.
45
Add the desired level of performance to your learning curve,
You can visually extrapolate the red “dev error” curve to guess how much closer
you could get to the desired level of performance by adding more data.
In the example, it looks plausible that doubling the training set size might allow
you to reach the desired performance.
But if the dev error curve has flattened out,
then you can immediately tell that adding more data won’t get you to your goal.
Looking at the learning curve might therefore help you avoid spending months
collecting twice as much training data, only to realize it does not help.

Learning curves
46
Learning curves
bias
Other cases
as the training set size grows:
1- dev set (and test set) error should decrease
2- But training set error usually increases
Suppose your training set has only 2 examples:
One cat image and one non-cat image.
It is easy for the learning algorithms to “memorize” examples and get 0% training set error.
Now suppose your training set has 100 examples:
Perhaps even a few examples are mislabeled, or ambiguous—some images are very blurry.
Perhaps the learning algorithm can still “memorize” most of the training set,
but it is harder to obtain 100% accuracy.
Finally, suppose your training set has 10,000 examples.
In this case, it becomes even harder for the algorithm to perfectly fit all 10,000 examples.
Thus, your learning algorithm will do even worse on this training set.
algorithm usually does better on the training set than on the dev set
thus the red dev error curve usually lies strictly above the blue training error curve.

Learning curves
47
Learning curves
bias
Other cases
Suppose your learning curves looks like above,
Now, you can be sure that adding more data will not be sufficient. Why?
• As we add more training data, training error can only get worse.
Thus, the blue training error curve can only stay the same or go higher,
and thus it can only get further away from the level of desired performance.
• The red dev error curve is usually higher than the blue training error.
Thus, there’s almost no way that adding more data would allow the red dev error curve
to drop down to the desired level of performance
when even the training error is higher than the desired level of performance.
Suppose that the desired performance is our estimate of the optimal error rate.
Then the figure above is the standard example of
what a learning curve with high avoidable bias looks like:
At the largest training set size (all the training data we have)
there is a large gap between the training error and the desired performance,
indicating large avoidable bias.
Furthermore, the gap between the training and dev curves is small,
indicating small variance.

“1- Examining both dev error and training error curves on the same plot
allows us to more confidently extrapolate the curves.
2- We are measuring training and dev set error only at the rightmost point of this plot,
which corresponds to using all the available training data.
Plotting the full learning curve gives us a more comprehensive picture
of the algorithms’ performance on different training set sizes.
48

Learning curves
49
Learning curves
bias
Other cases
Consider this learning curve:
The blue training error curve is relatively low,
and the red dev error curve is much higher.
Thus, the bias is small, but the variance is large.
Adding more training data will probably help close the gap
between dev error and training error.
Now, consider this:
The training error is large
and it is much higher than the desired level of performance.
The dev error is also much larger than the training error.
Thus, you have significant bias and significant variance.
You will have to find a way to reduce both bias and variance.

Learning curves
50
Learning curves
bias
Other cases
Suppose you have a very small training set (100 examples).
You train your algorithm using a randomly chosen subset of 10 examples,
then 20 examples, then 30, up to 100,
increasing the number of examples by intervals of ten.
After you plot your learning curve, you might find that the curve looks noisy
(meaning that the values are higher/lower than expected)
at the smaller training set sizes.
When training on just 10 randomly chosen examples,
you might be unlucky and have a particularly “bad” training set,
such as one with many ambiguous/mislabeled examples.
Or, you might get lucky and get a particularly “good” training set.
Having a small training set means that
the dev and training errors may randomly fluctuate.

If your machine learning application is heavily skewed toward one class
(such as a cat classification task
where the negative examples are much larger than positive(20%-80%)),
or if it has a huge number of classes
(such as recognizing 100 different animal species),
then the chance of selecting an “unrepresentative” or “bad” training set is also larger.
thus it's difficult for the algorithm to learn something meaningful.
51
solution?
• Instead of training just one model on 10 examples,
select several different randomly chosen training sets of 10 examples (say 3-10)
by sampling with replacement10 from your original set of 100.
Train a different model on each of these,
and compute the training and dev set error of each of the resulting models.
Compute and plot the average training error and average dev set error.
• If your training set is skewed towards one class,
or if it has many classes, choose a “balanced” subset instead of 10 training examples
at random out of the set of 100.
More generally,
you can make sure the fraction of examples from each class
is as close as possible to the overall fraction in the original training set.

Training models with small datasets is much faster
than training models with large datasets.
When you have large dataset
plotting a learning curve may be computationally expensive:
For example, you might have to train ten models with
1.000, 2.000, 3.000, 4.000, 5.000, 6.000, 7.000, 8.000, 9.000, 10.000 examples.
Thus, instead of evenly spacing out the training set sizes on a linear scale,
you might train models with 1,000, 2,000, 4,000, 8,000, 10,000 examples.
This should still give you a clear sense of the trends in the learning curves.
Of course, this technique is relevant only if the computational cost
of training all the additional models is significant.
52

53
Comparing to
human-level
performance

Comparing to
human-level performance
54
33- Why we compare to human-level
performance
34- How to define human-level
performance
35- Surpassing human-level
performance
machine learning
area
humans are
good at
image
recognition
speech
recognition
email spam
classification
humans aren’t
good at
book
recommendation
advertisement
recommendation
stock market
prediction

humans aren’t good at
Learning algorithms have improved so much
we are surpassing human-level performance on more and more of these tasks.
building an ML system is easier in this area:
1. Ease of obtaining data from human labelers.
people recognize images well,
they can provide high accuracy labels for your learning algorithm.
2. Error analysis can draw on human intuition.
in a speech recognition algorithm incorrectly transcribes
“This recipe calls for a pear of apples,” mistaking “pair” for “pear.”
You can draw on human intuition and try to understand what information
use human knowledge to modify the learning algorithm.
3. Use human-level performance to estimate the
“optimal error rate” or “desired error rate”.
Your algorithm achieves 10% error, but a person achieves 2% error.
Then we know that the avoidable bias is at least 8%.
Thus, you should try bias-reducing techniques.
55
humans aren’t good at
Computers alreadysurpass the performanceof most people on these tasks.
With these applications, we run into the following problems:
• It is harder to obtain labels.
It’s hard for human labelers to annotate a database of users
with the “optimal” book recommendation.
• Human intuition is harder to count on.
Pretty much no one can predict the stock market.
So if our stock prediction algorithm does no better than random guessing,
it is hard to figure out how to improve it.
• It is hard to know what the “optimal error rate” or “desired error rate” is.
Suppose a book recommendation system that is doing quite well.
How do you know how much more it can improve
without a human baseline?

Comparing to
56
performance
performance
performance
Suppose a medical imaging application that makes diagnoses from x-ray images.
 A typical person with some basic training achieves 15% error.
 A junior doctor achieves 10% error.
 An experienced doctor achieves 5% error.
 And a small team of doctors that debate each image achieves 2% error.
Which one is the “human-level performance”?
In this case we use 2% as the human-level performance for our optimal error rate.
You can also set 2% as the desired performance level
because all reasons from the previous chapter for human-level performance:
• Ease of obtaining labeled data from human labelers.
• Error analysis can draw on human intuition.
• Use human-level performance to estimate the optimal error rate and “desired error rate.”

“1- When it comes to obtaining labeled data,
you might not want to discuss every image with an entire team of doctors
since their time is expensive.
Perhaps you can have a single junior doctor label the vast majority of cases
and bring only the harder cases to more experienced doctors or to the team of doctors.
2- If your system is currently at 40% error,
then it doesn’t matter much whether you use a junior doctor (10% error)
or an experienced doctor (5% error) to label your data and provide intuitions.
But if your system is already at 10% error,
then defining the human-level reference as 2%
gives you better tools to keep improving your system.
57

Comparing to
58
performance
performance
performance
You are working on speech recognition and have a dataset of audio clips.
Suppose your dataset has many noisy audio clips so that even humans have 10% error.
Suppose your system already achieves 8% error.
Can you use any of the three techniques described in previous to continue making rapid progress?
If you can identify a subset of data in which humans significantly surpass your system,
then you can still use those techniques to drive rapid progress.
For example, your system is much better than people at recognizing speech in noisy audio,
but humans are still better at transcribing very rapidly spoken speech.
For the subset of data with rapidly spoken speech:
1. You can still obtain transcripts from humans
that are higher quality than your algorithm’s output.
2. You can draw on human intuition to understand
why they correctly heard a rapidly spoken utterance when your system didn’t.
3. You can use human-level performance on rapidly spoken speech
as a desired performance target.

59
More generally
as long as there are examples
where humans are right
and algorithm is wrong,
then many of the techniques
described earlier will apply.
This is true even if,
averaged over the entire dev/test set,
your performance is already surpassing
human-level performance.
progress is slower
when machines already surpass
human-level performance,
progress is faster
when machines are still trying
to catch up to humans.

60
Training and testing
on
different distributions

on
61
36- When you should train and test on
37- How to decide whether to use all
your data
38- How to decide whether to include
inconsistent data
39- Weighting data
40- Generalizing from the training set
to the dev set
41- Identifying Bias, Variance, and
Data Mismatch Errors
42- Addressing data mismatch
43- Artificial data synthesis
Users uploaded 10,000 images which you have manually labeled
You also have 200,000 images that you downloaded off the internet.
How should you define train/dev/test sets?
User images reflect actual distribution of data you want to do well on,
So you use that for your dev and test sets.
If you are training a data-hungry deep learning algorithm,
you might give it the additional 200,000 internet images for training.
Thus, your training and dev/test sets come from different distributions.
How does this affect your work?
Solution 1:
we could take all 210,000 images we have,
and randomly shuffle them into train/dev/test sets.
In this case, all the data comes from the same distribution.
But I recommend against this method,
becauseabout 200,000/210,000≈ 95.2%of your dev/test datawould come from internetimages,
which does not reflect the actual distribution you want to do well on.
Remember our recommendation on choosing dev/test sets:
Choose dev and test sets to reflect
data you expect to get in the future and want to do well on.

62
Most of the academic literature on machine learning
assumes that the training set, dev set and test set
all come from the same distribution.
In the early days of machine learning, data was scarce.
We had one dataset drawn from some distribution.
So we would randomly split that data into train/dev/test sets,
and the assumption that all the data
was coming from the same source was usually satisfied.
But in the era of big data,
we now have access to huge training sets,
Even if the training set
comes from a different distribution than the dev/test set,
we still want to use it for learning
since it can provide a lot of information.
Solution 2:
instead of putting all 10,000 user-uploaded images
into the dev/test sets,
put 5,000 into the dev/test sets.
And the remaining 205,000 examples into the training set.
This way, your training set of examples
contains some data that comes from your dev/test distribution
along with the internet images.

on
63
your data
inconsistent data
39- Weighting data
to the dev set
Suppose your cat detector’s training set
includes 10,000 user-uploaded images.
also have 20,000 images downloaded from the internet.
Should you provide all 30.000 to your learning algorithm as its training set?
or discard the internet images for fear of it biasing learning algorithm?
When using earlier generations of learning algorithms
there was a real risk that merging both types of data
would cause you to perform worse.
But in the modern era of powerful, flexible learning algorithms(large neural networks)
this risk has greatly diminished.
If you can afford to build a neural network
with a large enough number of hidden units/layers,
you can safely add the 20,000 images to your training set.
Adding the images is more likely to increase your performance.
This observation relies on the fact that there is some x —> y mapping that works well
for both types of data.
In other words,
there exists some system that inputs either an internet image or a mobile app image
and reliably predicts the label,
even without knowing the source of the image.

64
Adding the additional 20,000 images has the following effects:
1. It gives your neural network more examples
of what cats do/do not look like.
This is helpful,
since internet images and user-uploaded mobile app images
do share some similarities.
Your neural network can apply some of the knowledge acquired
from internet images to mobile app images.
2. It forces the neural network to expend some of its capacity
to learn about properties that are specific to internet images
If these properties differ greatly from mobile app images,
it will “use up” some of the representational capacity
of the neural network.
Thus there is less capacity for recognizing data
drawn from the distribution of mobile app images,
which is what you really care about.
Theoretically, this could hurt your algorithms’ performance.
To describe the second effect in different terms,
we can turn to the fictional character Sherlock Holmes,
who says that your brain is like an attic;
it only has a finite amount of space.
“for every addition of knowledge,
you forget something that you knew before.
It is of the highest importance, therefore,
not to have useless facts elbowing out the useful ones.”

Fortunately, if you have the computational capacity
needed to build a big enough neural network (a big enough attic)
then this is not a serious concern.
You have enough capacity to learn from both internet and from mobile app images,
without the two types of data competing for capacity.
Your algorithm’s “brain” is big enough
that you don’t have to worry about running out of attic space.
But if you do not have a big enough neural network
(or highly flexible learning algorithm),
then you should pay more attention to your training data
matching your dev/test set distribution.
If you think you have data that has no benefit,
you should just leave out that data for computational reasons.
65

on
66
your data
inconsistent data
39- Weighting data
to the dev set
Suppose you want to learn to predict housing prices in New York City.
Given the size of a house (input feature x),
you want to predict the price (target label y).
Housing prices in New York City are very high.
Suppose you have a second dataset of housing prices in Detroit, Michigan,
where housing prices are much lower.
Should you include this data in your training set?
Given the same size x,
the price of a house y .
is very different depending on whether it is in New York City or in Detroit.
If you only care about predicting New York City housing prices,
putting the two datasets together will hurt your performance.
In this case, it would be better to leave out the inconsistent Detroit data.

How is this New York City vs. Detroit example
different from the mobile app vs. internet cat images example?
The cat image example is different because, given an input picture x,
one can reliably predict the label y indicating whether there is a cat,
even without knowing if the image is an internet image or a mobile app image.
I.e., there is a function f(x) that reliably
maps from the input x to the target output y,
even without knowing the origin of x.
Thus, the task of recognition from internet images is “consistent”
with the task of recognition from mobile app images.
This means there was little downside (other than computational cost)
to including all the data,
and some possible significant upside.
In contrast, New York City and Detroit, Michigan data are not consistent.
Given the same x (size of house),
the price is very different depending on where the house is.
67

on
68
your data
inconsistent data
39- Weighting data
to the dev set
Suppose you have 200,000 internet images and 5,000 mobile app image.
There is a 40:1 ratio between the size of these datasets.
In theory, so long as you build a huge neural network
there is no harm in trying to make the algorithm do well
on both internet images and mobile images.
But in practice, having 40x internet images mean you need to spend 40x
as much computational resources to model both,
compared to if you trained on only the 5,000 images.
If you don’t have huge computational resources,
you could give the internet images a much lower weight as a compromise.
For example, suppose your optimization objective is squared error
Thus, our learning algorithm tries to optimize:
The first sum above is over the 5,000 mobile images,
and the second sum is over the 200,000 internet images.
You can instead optimize with an additional parameter 𝛽:
If you set 𝛽=1/40, the algorithm would give equal weight
to the 5,000 mobile images and the 200,000 internet images.

“By weighting the additional Internet images less,
you don’t have to build as massive a neural network
to make sure the algorithm does well on both types of tasks.
This type of re-weighting is needed only when:
1- you suspect the additional data (Internet Images)
has a very different distribution than the dev/test set,
Or
2- if the additional data is much larger
than the data that came from the same distribution
as the dev/test set (mobile images).
69

on
70
your data
inconsistent data
39- Weighting data
to the dev set
Suppose you are applying ML in a setting
where the training and the dev/test distributions are different.
training set contains Internet images + Mobile images,
dev/test sets contain only Mobile images.
However, the algorithm is not working well:
It has a much higher dev/test set error than you would like.
possibilities of what might be wrong:
1. It does not do well on the training set.
This is the problem of high (avoidable) bias on the training set distribution.
2. It does well on the training set,
but does not generalize well to previously unseen data
from the same distribution as the training set.
This is high variance.
3. It generalizes well to new data from the same distribution as the training set,
but not to data drawn from the dev/test set distribution.
We call this problem data mismatch,
since it is because the training set data
is a poor match for the dev/test set data.

71
suppose that humans achieve near perfect performance
on the cat recognition task.
Your algorithm achieves this:
• 1% error on the training set
• 1.5% error on data with same distribution
as the training that the algorithm has not seen
• 10% error on the dev set
In this case, you clearly have a data mismatch problem.
In order to diagnose to what extent an algorithm
suffers from each of the problems 1-3,
it will be useful to have another dataset.
Specifically, rather than giving the algorithm
all the available training data,
you can split it into two subsets:
1- The actual training set which the algorithm will train on,
2- and a separate set, which we will call the “Training dev” set,
that we will not train on.
You now have four subsets of data:
• Training set.
This is the data that the algorithm will learn from
(e.g., Internet images + Mobile images).
• Training dev set:
This data is drawn from the same distribution as the training set
(e.g., Internet images + Mobile images).
• Dev set:
This is from the same distribution as the test set,
and it reflects the distribution of data that we ultimately care
(E.g., mobile images.)
• Test set:
This is drawn from the same distribution as the dev set.
(E.g., mobile images.)
Armed with these four separate datasets, you can now evaluate:
• Training error, by evaluating on the training set.
• The algorithm’s ability to generalize to new data
from the training set distribution,
by evaluating on the training dev set.
• The algorithm’s performance on the task you care about,
by evaluating on the dev and/or test sets.

on
72
your data
inconsistent data
39- Weighting data
to the dev set
• Training Dev error = 5%
• Dev error = 5%
overfitting
• Dev error = 12%
high avoidable bias
underfitting
• Dev error = 20%
high avoidable bias
and Data mismatch
What problem does it have?
Suppose humans achieve perfect performance (≈0% error)
on the cat detection task, thus the optimal error rate is 0%.
Distribution A
Internet + Mobile image
Distribution B
Mobile image
Human level error
Human level error
~0
Avoid
able
bias
Error on examples
Algorithm has trained on
Training error
10%
Variance
Error on examples
Algorithm hasn't trained on
Training-dev error
11%
Dev-test error
20%
Data mismatch
It might be easier to understand
how the different types of errors relate to each other by drawing them as entries in a table:

on
73
your data
inconsistent data
39- Weighting data
to the dev set
Suppose you have developed a speech recognition system,
that have a data mismatch problem.
What can you do?
(i) Try to understand what properties of the data
differ between the training and the dev set distributions.
(ii) Try to find more training data that better matches the dev set examples
that your algorithm has trouble with.
For example, suppose you carry out an error analysis on 100 examples dev set:
You find that your system does poorly
because most of the audio clips in the dev set are taken within a car,
whereas most of the training set were recorded against a quiet background.
The engine and road noise dramatically worsen the performance of system.
In this case, you might try to acquire more training data that were taken in a car.
The purpose of the error analysis is
to understand the significant differences
between the training and the dev set,
which leads to the data mismatch.
Unfortunately, there are no guarantees in this process.
For example, if you don't have any way to get more training data
that better match the dev set data,
you might not have a clear path towards improving performance.

on
74
your data
inconsistent data
39- Weighting data
to the dev set
Your speech system needs more data that were taken from within a car.
Rather than collecting a lot of data while driving around,
there might be an easier way to get this data:
By artificially synthesizing it.
Suppose you obtain a large quantity of car/road noise audio clips.
also have a large training set of people speaking in a quiet room.
If you take an audio clip of a person speaking and
“add” that to an audio clip of car/road noise,
you will obtain an audio clip that sounds that person was speaking in a noisy car.
Using this process, you can “synthesize” huge amounts of data.
Let’s use the cat image detector as a second example.
You notice that dev set images have much more motion blur
because they tend to come from cell phone users who are moving their phone slightly
while taking the picture.
You can take non-blurry images from the training set of internet images,
and add simulated motion blur to them,
thus making them more similar to the dev set.

“Keep in mind that
artificial data synthesis has its challenges:
it is sometimes easier to create synthetic data
that appears realistic
to a person
than it is to create data
that appears realistic
to a computer.
75

76
suppose you have 1,000 hours of speech training data,
but only 1 hour of car noise.
If you repeatedly use the same 1 hour of car noise
with different portions from the 1,000 hours,
you will end up with a synthetic dataset.
While a person listening to this audio
probably would not be able to tell
But it is possible that a learning algorithm would “overfit”
to the 1 hour of car noise.
Thus, it could generalize poorly to a new audio clip
where the car noise happens to sound different.
Alternatively,
suppose you have 1,000 unique hours of car noise,
but all of it was taken from just 10 different cars.
In this case, it is possible for an algorithm to “overfit”
to these 10 cars
and perform poorly if tested on audio from a different car.
suppose you are building a computer vision system
to recognize cars.
Suppose you partner with a video gaming company,
which has computer graphics models of several cars.
To train your algorithm, you use the models
to generate synthetic images of cars.
Even if the synthesized images look very realistic,
this approach will probably not work well.
The video game might have ~20 car designs in the game.
It is very expensive to build a 3D car model of a car;
if you were playing the game,
you probably wouldn’t notice that
you’re seeing the same cars over and over,
perhaps only painted differently.
I.e., this data looks very realistic to you.
But compared to the set of all cars out on roads
this set of 20 synthesized cars
captures only a minuscule fraction
of the world’s distribution of cars.
Thus your system will “overfit” to these 20 specific car designs,
and it will fail to generalize well to dev/test sets
that include other car designs.

“
1- Unfortunately, data synthesis problems can be hard to spot.
2-When working on data synthesis,
produced data with details that are close enough
to the actual distribution.
If you are able to get the details right,
you can suddenly access a far larger training set than before.
77

78
Debugging
inference
algorithms

Debugging
inference algorithms
79
44- The Optimization Verification test
45- General form of Optimization
Verification test
46- Reinforcement learning example
Suppose a speech recognition system.
Your system works by inputting an audio clip A,
and computing some ScoreA(S) for each possible output sentence S.
For example, you might try to estimate ScoreA(S) = P(S|A),
the probability that the correct output transcription is the sentence S,
given that the input audio was A.
Given a way to compute ScoreA(S),
you still have to find the English sentence S that maximizes it:
If the English language has 50,000 words,
then there are (50,000)N possible sentences of length N
You need to apply an approximate search algorithm,
to find the value of S that optimizes (maximizes) ScoreA(S).
One example search algorithm is “beam search,”
Algorithms like this are not guaranteed to find the value of S
that maximizes ScoreA(S).

80
Suppose that an audio clip A is “I love machine learning.”
But system outputs is “I love robots.”
There are two possibilities:
1. Search algorithm problem.
The search algorithm failed to find the value of S
that maximizes ScoreA(S).
2. scoring function problem.
Our estimates for ScoreA(S) = P(S|A) were inaccurate.
ScoreA(S) failed to recognize that “I love machine learning”
is the correct transcription.
Depending on which of these was the cause of the failure,
you should prioritize your efforts very differently.
If #1, work on improving the search algorithm.
If #2, work on the learning algorithm that estimates ScoreA(S).
How can you decide what to work on?
Let Sout be the output transcription (“I love robots”).
Let S* be the correct transcription (“I love machine learning”).
In order to find #1 or #2 is the problem,
you can perform the Optimization Verification test:
First, compute ScoreA(S*) and ScoreA(Sout).
Then check which one is greater.
Case 1: ScoreA(S*) > ScoreA(Sout)
In this case, your learning algorithm
has correctly given S* a higher score than Sout.
but our approximate search algorithm chose Sout.
So search algorithm is failing to choose
the value of S that maximizes ScoreA(S).
Optimization Verification test tells
you have a search algorithm problem
you can try increasing the beam width of beam search.
Case 2: ScoreA(S*) ≤ ScoreA(Sout)
In this case, the way you’re computing ScoreA(.) is at fault:
It is failing to give a strictly higher score
to the correct output S* than the incorrect Sout.
The Optimization Verification test tells
you have an objective (scoring) function problem.
you should improve how you learn or approximate ScoreA(S)
for different sentences S.
To apply the Optimization Verification test in practice,
you should examine the each errors in your dev set.
you find that 95% of the errors scoring function ScoreA(.),
and 5% due to the optimization algorithm.
no matter how much you improve your optimization procedure,
you would realistically eliminate only ~5% of our errors.
Thus, you should instead focus on improving ScoreA(.).

Debugging
81
Verification test
You can apply the Optimization Verification test when,
given some input x,
you know how to compute Scorex(y)
that indicates how good a response y is to an input x .
Furthermore, you are using an approximate algorithm
to try to find arg maxy Scorex(y).
In our previous speech recognition example,
x=A was an audio clip,
and y=S was the output transcript.
Suppose y* is the “correct” output
but the algorithm instead outputs yout .
Then the key test is to measure whether Scorex(y*) > Scorex(yout).
If this inequality holds, we blame the optimization algorithm for the mistake.
If not, we blame the computation of Score x (y).

Debugging
82
Verification test
Suppose you are using machine learning to teach a helicopter
to fly complex maneuvers.
landing with the engine turned off.
This is called an “autorotation” maneuver.
Human pilots practice this as part of their training.
Your goal is to use a learning algorithm
to fly the helicopter through a trajectory T
that ends in a safe landing.
To apply reinforcement learning,
you have to develop a “Reward function” R(.)
that gives a score measuring how good each trajectory T is.
if T results in the helicopter crashing, R(T) = -1,000
a huge negative reward.
A trajectory T resulting in a safe landing, result positive R(T)
with the value depending on how smooth the landing was.
The reward function R(.)
is typically chosen by hand
It has to trade off
(how bumpy landing, landing in desired spot, how rough for passengers)
It is not easy to design good reward functions.
the job of the reinforcement learning algorithm
is to control the helicopter to achieves maxTR(T).

83
Suppose you have picked some reward R(.)
and have run your learning algorithm.
its performance appears far worse than your human pilot
How can you tell if the fault is
* with the reinforcement learning algorithm or
* with the reward function?
To apply the Optimization Verification test,
let Thuman be the trajectory achieved by the human pilot,
and let Tout be the trajectory achieved by the algorithm.
T human is a superior trajectory to T out .
Thus, test if it hold R(Thuman)>R(Tout)?
Case 1:
If this inequality holds,
then the reward function R(.) is correct.
But our reinforcement learning algorithm is finding the inferior Tout.
This suggests that working on reinforcement learning algorithm
Case 2:
The inequality does not hold: R (Thuman) ≤ R (Tout).
This means R(.) assigns a worse score to Thuman .
You should work on improving R(.)
Many machine learning applications have this “pattern”
of optimizing an approximate scoring function Score x (.)
using an approximate search algorithm.
In our example,
the scoring function was the reward function Score(T)=R(T),
the optimization algorithm was the reinforcement learning algorithm
trying to execute a good trajectory T .
One difference between this and earlier examples
is that, rather than comparing to an “optimal” output,
you were instead comparing to human-level performance Thuman.
We assumed Thuman is pretty good,
even if not optimal.
In general,
so long as you have some y* (in this example Thuman)
that is a superior output
to the performance of your current learning algorithm
—even if it is not the “optimal” output—
then the Optimization Verification test can indicate
whether it is more promising to improve
the optimization algorithm
or the scoring function.

End-to-end
deep learning
85
47- The rise of end-to-end learning
48- More end-to-end learning
examples
49- Pros and cons of end-to-end
learning
50- Choosing pipeline components:
Data availability
Task simplicity
52- Directly learning rich outputs
Suppose you want to build a system to examine online product reviews
and tells if the writer liked or disliked that product.
For example recognize
"This is a great mop!“ as highly positive.
and "This mop is low quality--I regret buying it.“ as highly negative.
The problem of recognizing positive vs. negative opinions
is called “sentiment classification.”
To build this system, you might build a “pipeline” of two components:
1. Parser:
A system that annotates the text with information
identifying the most important words.
For example, label all the adjectives and nouns.
This is a greatAdjective mopNoun!
2. Sentiment classifier:
A learning algorithm that takes as input the annotated text
and predicts the overall sentiment.
By giving adjectives a higher weight,
your algorithm will be able to quickly focus
on the important words such as “great,”
and ignore less important words such as “this.”

86
There has been a recent trend
toward replacing pipeline systems
with a single learning algorithm.
An end-to-end learning algorithm for this task
would simply take as input the original text,
and try to directly recognize the sentiment:
Neural networks are commonly used
in end-to-end learning systems.
The term “end-to-end” refers to the fact that
we are asking the learning algorithm to go directly
from the input to the desired output.
I.e., the learning algorithm directly connects
the “input end” of the system to the “output end.”
In problems where data is abundant,
end-to-end systems have been remarkably successful.
But they are not always a good choice.

End-to-end
deep learning
87
examples
learning
Data availability
Task simplicity
Suppose you want to build a speech recognition system.
You might build a system with three components:
1. Compute features:
Extract hand-designed features,
such as MFCC (Mel-frequency cepstrum coefficients) features,
which try to capture the content of an utterance
while disregarding less relevant properties (speaker’s pitch).
2. Phoneme recognizer:
linguists believe there are basic units of sound called “phonemes.”
The initial “k” sound in “keep”
is the same phoneme as the “c” sound in “cake.”
This system tries to recognize the phonemes in the audio clip.
3. Final recognizer:
Take the sequence of recognized phonemes,
and try to string them together into an output transcript.
In contrast, an end-to-end system might input an audio clip,
and try to directly output the transcript:

So far, we have only described machine learning “pipelines”
that are completely linear:
the output is sequentially passed from one staged to the next.
Pipelines can be more complex:
here is a simple architecture for
an autonomous car that has three components:
1- One detects other cars using the camera images;
2- one detects pedestrians;
3- then a final component plans a path for our own car
that avoids the cars and pedestrians.
88
In contrast, and end-to-end approach
might try to take in the sensor inputs
and directly output the steering direction:
Even though end-to-end learning has seen many successes,
it is not always the best approach.

End-to-end
deep learning
89
examples
learning
Data availability
Task simplicity
Consider the same speech pipeline:
Many parts of this pipeline were “hand-engineered”:
Disadvantages:
• MFCCs are a set of hand-designed audio features.
Although they provide a reasonable summary of the audio input,
they also simplify the input signal by throwing some information away.
• Phonemes are an invention of linguists.
They are an imperfect representation of speech sounds.
To the extent that phonemes are a poor approximation of reality,
it will limit the speech system’s performance.
advantages:
• The MFCC features are robust to some properties of speech
that do not affect the content, such as speaker pitch.
Thus, they help simplify the problem for the learning algorithm.
• To the extent that phonemes are a reasonable representation of speech,
they can also help the learning algorithm understand basic sound components
and therefore improve its performance.

Having more hand-engineered components generally allows a speech system
to learn with less data.
The hand-engineered knowledge captured by MFCCs and phonemes
“supplements” the knowledge our algorithm acquires from data.
When we don’t have much data, this knowledge is useful.
90
Now, consider the end-to-end system:
This system lacks the hand-engineered knowledge.
Thus, when the training set is small,
it might do worse than the hand-engineered pipeline.
But, when the training set is large,
then it is not hampered by the limitations of an MFCC or phoneme-based representation.
If the learning algorithm is a large-enough neural network
and if it is trained with enough training data,
(audio, transcript) pairs.
it has the potential to do very well,
and perhaps even approach the optimal error rate.
When this type of data is not available,
approach end-to-end learning with great caution.

End-to-end
deep learning
91
examples
learning
Data availability
Task simplicity
When building a non-end-to-end pipeline system,
Design of the pipeline will greatly impact the overall system’s performance.
One important factor is whether you can easily collect data
to train each of the components.
For example, consider autonomous driving architecture.
You can use machine learning to detect cars and pedestrians.
Further, it is not hard to obtain data for these:
There are numerous datasets with labeled cars and pedestrians.
You can also use crowdsourcing
(such as Amazon Mechanical Turk)
thus it's easy to obtain training data to build car and pedestrian detector.

In contrast, consider a pure end-to-end approach:
To train this system, we would need a large dataset of
(Image, Steering Direction) pairs.
It is very time-consuming and expensive
to have people drive cars around
and record their steering direction to collect such data.
You need a lot of specially-instrumented cars,
and a huge amount of driving to cover a wide range of possible scenarios.
This makes an end-to-end system difficult to train.
It is much easier to obtain a large dataset of labeled car or pedestrian images.
More generally:
if there is a lot of data available
for training “intermediate modules” of a pipeline
then you might consider using a pipeline with multiple stages.
This structure could be superior
because you could use all that available data
to train the intermediate modules.
92

End-to-end
deep learning
93
examples
learning
Data availability
Task simplicity
Other than data availability,
you should also consider a second factor for pipeline:
How simple are the tasks solved by the individual components?
Try to choose components that are individually easy to build or learn.
But what does “easy to learn” mean?
Consider these machine learning tasks, listed in order of increasing difficulty:
1. Classifying whether an image is overexposed (like the example above)
2. Classifying whether an image was taken indoor or outdoor
3. Classifying whether an image contains a cat
4. Classifying whether an image contains a cat with both black and white fur
5. Classifying whether an image contains a Siamese cat (a particular breed of cat)
Each of these is a binary image classification task:
You have to input an image, and output either 0 or 1.
But the tasks earlier in the list seem “easier” for a neural network to learn.
You will be able to learn the easier tasks with fewer training examples.

“
Machine learning does not yet have a good formal definition
of what makes a task easy or hard.
we sometimes say a
task is “easy” if it can be carried out with fewer computation steps
(corresponding to a shallow neural network),
task is “hard” if it requires more computation steps
(requiring a deeper neural network).
But these are informal definitions.
94

Suppose you are building a Siamese cat detector.95
pure end-to-end architecture
pipeline with two steps
in pipeline:
The first step (cat detector) detects all the cats in the image.
The second step then passes cropped images of each of the detected cats
(one at a time) to a cat species classifier,
and finally outputs 1 if any of the cats detected is a Siamese cat.
Compared to training a purely end-to-end classifier using just labels 0/1,
each of the two components in the pipeline
seem much easier to learn and will require significantly less data.
In summary, when deciding what should be the components of a pipeline,
try to build a pipeline where each component is a relatively “simple” function
that can therefore be learned from only a modest amount of data.

End-to-end
deep learning
96
examples
learning
Data availability
Task simplicity
An image classification algorithm will input an image x,
and output an integer indicating the object category.
Can an algorithm instead output an entire sentence describing the image?
Problem X Y
Spam classification Email Spam/Not spam (0/1)
Image recognition Image Integer label
Housing price prediction Features of house Price in dollars
Product recommendation Product & user features Chance of purchase
One of the most exciting developments in end-to-end deep learning
is that it is letting us directly learn y that are much more complex than a number.
In the image-captioning example above,
you can have a neural network input an image (x) and directly output a caption (y).
This is an accelerating trend in deep learning.
For example:
x = y = “A yellow bus driving down a road
with green trees and green grass
in the background.”
Traditional applications of supervised learning learned a function h:X→Y,
where the output y was usually an integer or a real number
For example:

Error analysis
by parts
98
53- Error analysis by parts
54- Attributing error to one part
55- General case of error attribution
56- Error analysis by parts and
comparison to human-level
performance
57- Spotting a flawed ML pipeline
Suppose your system is built using a complex machine learning pipeline,
and you would like to improve the system’s performance.
Which part of the pipeline should you work on improving?
By attributing errors to specific parts of the pipeline,
you can decide how to prioritize your work.
Let’s use our Siamese cat classifier example:
The first part, the cat detector, detects cats and crops them out of the image.
The second part, the cat breed classifier, decides if it is a Siamese cat.
By carrying out error analysis by parts,
you can try to attribute each mistake the algorithm makes
to one (or sometimes both) of the two parts of the pipeline

For example, the algorithm misclassifies image
as not containing a Siamese cat (y=0)
even though the correct label is y=1.
Suppose the Siamese cat detector had detected a cat as 2nd image:
This means that the cat breed classifier is given the 3rd image:
The cat breed classifier then correctly classifies image
as not containing a Siamese cat.
Thus, the cat breed classifier is blameless:
It was given of a pile of rocks and outputted a very reasonable label y=0.
Thus, you can clearly attribute this error to the cat detector.
If, on the other hand, the cat detector had outputted the 4th image:
then you would conclude that the cat detector had done its job,
and that it was the cat breed classifier that is at fault.
99
Our description of how you attribute error
to one part of the pipeline has been informal so far:
you look at the output of each of the parts
and see if you can decide which one made a mistake.
This informal method could be all you need.

Error analysis
by parts
100
performance
Suppose the cat detector outputted this bounding box:
The cat breed classifier is thus given this cropped image,
whereupon it incorrectly outputs y=0.
The cat detector did its job poorly.
So do we attribute this error to the cat detector,
or the cat breed classifier, or both?
It is ambiguous.
1. Replace the cat detector output with a hand-labeled bounding box. “perfect input”.
2. Run the corresponding cropped image through the cat breed classifier.
If the cat breed classifier still misclassifies it,
attribute the error to the cat breed classifier.
Otherwise, attribute the error to the cat detector.
By carrying out this analysis on the misclassified dev set images,
you can now unambiguously attribute each error to one component.
This allows you to estimate the fraction of errors due to each component of the pipeline,
and therefore decide where to focus your attention.

Error analysis
by parts
101
performance
Suppose the pipeline has three steps A, B and C,
For each mistake the system makes on the dev set:
1. Try manually modifying A’s output to be a “perfect” output
and run the rest of the pipeline B, C on this output.
If the algorithm now gives a correct output,
then this shows that, if only A had given a better output,
the overall algorithm’s output would have been correct;
thus, you can attribute this error to component A.
Otherwise, go on to Step 2.
2. Try manually modifying B’s output to be the “perfect” output for B.
then attribute the error to component B.
3. Attribute the error to component C.

Let’s look at a more complex example:
Your self-driving car uses this pipeline.
You can map the three components to A, B, C as follows:
A: Detect cars / B: Detect pedestrians / C: Plan path for car
You would then:
1. Try manually modifying A’s output to be a “perfect” output
(e.g., manually go in and tell it where the other cars are).
Run the rest of the pipeline B, C as before,
but allow C (plan path) to use A’s now perfect output.
If the algorithm now plans a much better path for the car,
then this shows that, if only A had given a better output,
the overall algorithm’s output would have been better;
Thus, you can attribute this error to component A.
2. Try manually modifying B (detect pedestrian)’s
output to be the “perfect” output for B.
then attribute the error to component B.
3. Attribute the error to component C.
102

“The components of an ML pipeline should be ordered
according to a Directed Acyclic Graph (DAG),
meaning that you should be able to compute them
in some fixed left-to-right order,
and later components should depend
only on earlier components’ outputs.
So long as the mapping of the components to the A->B->C
order follows the DAG ordering,
then the error analysis will be fine.
103

Error analysis
by parts
104
performance
Let’s return to the self-driving application,
To debug this pipeline, rather than the procedure in the previous chapter,
you could informally ask:
1. How far is the Detect cars component from human-level performance at detecting cars?
2. How far is the Detect pedestrians component from human-level performance?
3. How far is the overall system’s performance from human-level performance?
Here, human-level performance assumes the human has to plan a path for the car
given only the outputs from the previous two pipeline components
If you find that one of the components is far from human-level performance,
you now have a good case to focus on improving the performance of that component.
Many error analysis processes work best when we are trying to automate
something humans can do and can thus benchmark against human-level performance.
Most of our preceding examples had this implicit assumption.
This is another advantage of working on problems that humans can solve
you have more powerful error analysis tools,
and thus you can prioritize your team’s work more efficiently.

“Carrying out error analysis on a learning algorithm
is like using data science to analyze an ML system’s mistakes
in order to derive insights about what to do next.
error analysis by parts tells us
which component performance is worth the greatest effort to improve.
There is no one “right” way to analyze a dataset,
and there are many possible useful insights one could draw.
Similarly,
there is no one “right” way to carry out error analysis.
and you should feel free to experiment with other ways of analyzing errors as well.
105

Error analysis
by parts
106
performance
What if each individual component of your ML pipeline
is performing at human-level performance
but the overall pipeline falls far short of human-level?
This usually means that the pipeline is flawed and needs to be redesigned.
In the previous chapter,
we posed the question of whether each components’ performance is at human level.
Suppose the answer to all three questions is yes.
However, your overall self-driving car
is performing significantly below human-level performance.
The only possible conclusion is that the ML pipeline is flawed.
In this case, the Plan path component is doing as well as it can given its inputs,
but the inputs do not contain enough information.
You should ask yourself what other information,
is needed to plan paths very well for a car to drive.
In other words, what other information does a skilled human driver need?
suppose you realize that
a human driver also needs to know
the location of the lane markings.
This suggests that you should
redesign the pipeline as follows :

Conclusion
108
58- Building a superhero team - Get
your teammates to read this Page
Congratulations on finishing this book!
In Chapter 2, we talked about how this book can help you
become the superhero of your team.
The only thing better than being a superhero
is being part of a superhero team.
I hope you’ll give copies of this book
to your friends and teammates and help create other superheroes!

Machine
Learning
Yearning
Andrew Ng. Technical Strategy for AI Engineers,
In the Era of Deep Learning
Slide preparation:
Mohammad Pourheidary
(m.pourheidary@yahoo.com)
Winter 2018

Machine learning yearning

More Related Content

What's hot (20)

Similar to Machine learning yearning (20)

Recently uploaded (20)

Machine learning yearning