3. What is machine learning?
7
• For many problems, it’s difficult to program the correct
behavior by hand:
I recognizing people and objects
understanding human speech
• Machine learning approach: program an algorithm to
automatically learn from data, or from experience.
• Why might you want to use a learning algorithm?
4. What is machine learning?
7
• For many problems, it’s difficult to program the correct
behavior by hand:
I recognizing people and objects
understanding human speech
• Machine learning approach: program an algorithm to
automatically learn from data, or from experience.
• Why might you want to use a learning algorithm?
Hard to code up a solution by hand (e.g. vision, speech)
System needs to adapt to a changing environment (e.g.
spam detection)
Want the system to perform better than the human
programmers
Privacy/fairness (e.g. ranking search results)
5. What is machine learning?
7
• It’s similar to statistics...
Both fields try to uncover patterns in data
Both fields draw heavily on calculus, probability, and
linear algebra, and share many of the same core
algorithms
• But it’s not statistics!
Stats is more concerned with helping scientists and
policymakers draw good conclusions; ML is more
concerned with building autonomous agents
Stats puts more emphasis on interpretability and
mathematical rigor; ML puts more emphasis on
predictive performance, scalability, and autonomy
6. Relation to AI
7
• Nowadays, “machine learning” is often brought up with
“artificial intelligence” (AI)
• AI does not always imply a learning based system
Symbolic reasoning
Rule based system
Tree search
etc.
• Learning based system → learned based on the data →
more flexibility, good at solving pattern recognition problems.
7. Relations to guman learning
7
• Human learning is:
Very data efficient
An entire multitasking system (vision, language, motor
control, etc.)
Takes at least a few years :)
• For serving specific purposes, machine learning doesn’t
have to look like human learning in the end.
• It may borrow ideas from biological systems, e.g., neural
networks.
• It may perform better or worse than humans.
8. What is machine learning?
7
• Types of machine learning
• Supervised learning: have labeled examples of the correct
behavior
• Reinforcement learning: learning system (agent) interacts
with the world and learns to maximize a scalar reward signal
• Unsupervised learning: no labeled examples – instead,
looking for “interesting” patterns in the data
10. What is machine learning?
7
Criteria Supervised Learning Unsupervised Learning Reinforcement Learning
Definition
Learns from labeled data to map inputs to
known outputs
Explores patterns and associations in
unlabeled data
Learns through interactions with an
environment to maximize rewards
Type of Data Labeled data Unlabeled data
No predefined data; interacts with
environment
Type of Problems Regression and classification Clustering and association Exploitation or exploration
Supervision Requires external supervision No supervision No supervision
Algorithms
Linear Regression, Logistic Regression,
SVM, KNN
K-means clustering, Hierarchical clustering,
DBSCAN, Principal Component Analysis Q-learning, SARSA, Deep Q-Network
Aim Calculate outcomes based on labeled data
Discover underlying patterns and group
data Learn a series of actions to achieve a goal
Applications Risk evaluation, forecasting sales
Recommendation systems, anomaly
detection Self-driving cars, gaming, healthcare
Learning Process Maps labeled inputs to known outputs Finds patterns and trends in data Trial and error method with rewards and
penalties
https://www.youtube.com/watch?v=x3KOCphRltk
11. History of machine learning
7
• 1957 — Perceptron algorithm (implemented as a circuit!)
• 1959 — Arthur Samuel wrote a learning-based checkers program
that could defeat him
• 1969 — Minsky and Papert’s book Perceptrons (limitations of
linear models)
• 1980s — Some foundational ideas
Connectionist psychologists explored neural models of
cognition
1984 — Leslie Valiant formalized the problem of learning as
PAC learning
1988 — Backpropagation (re-)discovered by Geoffrey Hinton
and colleagues
1988 — Judea Pearl’s book Probabilistic Reasoning in
Intelligent Systems introduced Bayesian networks
12. History of machine learning
7
• 1990s — the “AI Winter”, a time of pessimism and low funding But looking
back, the ’90s were also sort of a golden age for ML research
Markov chain Monte Carlo
variational inference
kernels and support vector machines
Boosting
convolutional networks
reinforcement learning
• 2000s — applied AI fields (vision, NLP, etc.) adopted ML
• 2010s — deep learning
2010–2012 — neural nets smashed previous records in speech-to-
text and object recognition
increasing adoption by the tech industry
2016 — AlphaGo defeated the human Go champion
2018-now — generating photorealistic images and videos
2020 — GPT3 language model
• now — increasing attention to ethical and societal implications
13. Examples
7
• Computer vision: Object detection, semantic segmentation, pose
estimation, and almost every other task is done with ML.
18. ML Workflow
7
• ML workflow sketch:
• 1. Should I use ML on this problem? I Is there a pattern to detect? I
Can I solve it analytically? I Do I have data?
• 2. Gather and organize data. I Preprocessing, cleaning, visualizing.
• 3. Establishing a baseline.
• 4. Choosing a model, loss, regularization, ...
• 5. Optimization (could be simple, could be a Phd...).
• 6. Hyperparameter search.
• 7. Analyze performance & mistakes, and iterate back to step 4 (or 2).
19. Implementing machine learning systems
7
• You will often need to derive an algorithm (with pencil and paper), and then
translate the math into code.
• Array processing (NumPy)
vectorize computations (express them in terms of matrix/vector
operations) to exploit hardware efficiency
This also makes your code cleaner and more readable!
20. Implementing machine learning systems
7
• In supervised learning we are given a training set consisting of inputs
and corresponding labels, e.g.
22. Input vectors
7
• Machine learning algorithms need to handle lots of types of data:
images, text, audio waveforms, credit card transactions, etc.
• Common strategy: represent the input as an input vector in
Representation = mapping to another space that’s easy to
manipulate
Vectors are a great representation since we can do linear algebra!
23. Input vectors
7
• Can use raw pixels:
• Can do much better if you compute a vector of meaningful features.
24. Examples
7
• Mathematically, our training set consists of a collection of pairs of an input
vector and its corresponding target, or label, t
Regression: t is a real number (e.g. stock price)
Classification: t is an element of a discrete set {1, . . . , C}
These days, t is often a highly structured object (e.g. image)
• Denote the training set {(), . . . ,()} I Note: these superscripts have nothing
to do with exponentiation!
30. Regression
7
• What do all these problems have in common?
Continuous outputs, we’ll call these t (eg, a rating: a real number between 0-
10, # of followers, house price)
• What do I need in order to predict these outputs?
Predicting continuous outputs is called regression
Features (inputs), we’ll call these x (or X if vectors)
Training examples, many x (i) for which t (i) is known (eg, many movies for
which we know the rating)
A model, a function that represents the relationship between x and t
A loss or a cost or an objective function, which tells us how well our model
approximates the training examples
Optimization, a way of finding the parameters of our model that minimizes
the loss function
31. Linear regression
7
• Linear regression
continuous outputs
simple model (linear)
• Introduce key concepts:
loss functions
Generalization
Optimization
model complexity
regularization
32. Simple 1-D regression
7
• Circles are data points (i.e., training examples) that are given to us
• The data points are uniform in x, but may be displaced in y
with some noise
• In green is the ”true” curve that we don’t know
• Goal: We want to fit a curve to these points
33. Simple 1-D regression
7
• Key Questions:
How do we parametrize the model?
What loss (objective) function should we use to judge the fit?
How do we optimize fit to unseen test data (generalization)?
34. Noise
7
• Data is described as pairs D = {(), . . . ,()}
x R is the
∈ input feature (per capita crime rate)
t R is the
∈ target output (median house price)
(i) simply indicates the training examples (we have N in this case)
• Here t is continuous, so this is a regression problem
• Model outputs y, an estimate of t
y
• What type of model did we choose?
• Divide the dataset into training and testing examples
Use the training examples to construct hypothesis, or function approximator, that
maps x to predicted y
Evaluate hypothesis on test set
35. Least-Squares Regression
7
• Define a model
• Standard loss/cost/objective function measures the squared error between y and the true value t
36. Least-Squares Regression
7
• Define a model
• Standard loss/cost/objective function measures the squared error between y and the true value t
Linear:
37. Least-Squares Regression
7
• Define a model
• Standard loss/cost/objective function measures the squared error between y and the true value t
Linear:
Linear model:
38. Least-Squares Regression
7
• Define a model
• Standard loss/cost/objective function measures the squared error between y and the true value t
Linear:
Linear model:
• For a particular hypothesis (y(x) defined by a choice of w, drawn in red), what does the loss
represent geometrically?
39. Least-Squares Regression
7
• Define a model
• Standard loss/cost/objective function measures the squared error between y and the true value t
Linear:
Linear model:
• The loss for the red hypothesis is the sum of the squared vertical errors (squared lengths of
green vertical lines)
40. Least-Squares Regression
7
• Define a model
• Standard loss/cost/objective function measures the squared error between y and the true value t
Linear:
Linear model:
• How do we obtain weights w = (w0,w1)? Find w that minimizes loss `(w)
41. Least-Squares Regression
7
• Define a model
• Standard loss/cost/objective function measures the squared error between y and the true value t
Linear:
Linear model:
• How do we obtain weights w = (w0,w1)?
• For the linear model, what kind of a function is `(w)?
42. Optimizing the Objective
7
• One straightforward method: gradient descent
initialize w (e.g., randomly)
repeatedly update w based on the gradient
• λ is the learning rate
• For a single training case, this gives the LMS update rule:
• Note: As error approaches zero, so does the update (w stops changing)
43. Optimizing Across Training Set
7
• Two ways to generalize this for all examples in training set:
1. Batch updates: sum or average updates across every example n, then change
the parameter values
2. Stochastic/online updates: update the parameters for each
training case in turn, according to its own gradients
44. Linear Regression with Multi-dimensional Inputs
7
• One method of extending the model is to consider other input dimensions
• In the Boston housing example, we can look at the number of rooms
45. Multi-dimensional Inputs
7
• Imagine now we want to predict the median house price from these multi-dimensional
observations
• Each house is a data point n, with observations indexed by j:
• We can incorporate the bias w0 into w, by using x0 = 1, then
• We can incorporate the bias w0 into w, by using x0 = 1, then
• We can use gradient descent to solve for each coefficient, or compute w
analytically (how does the solution change?)
46. Fitting a Polynomial
7
• What if our linear model is not good? How can we create a more complicated model?
• We can create a more complicated model by defining input variables that are
combinations of components of x
• Example: an M-th order polynomial function of one dimensional feature x:
• where is the j-th power of x
• We can use gradient descent to solve for each coefficient, or compute w
• How do we do that?
47. Fitting a Polynomial
7
• What if our linear model is not good? How can we create a more complicated model?
• We can create a more complicated model by defining input variables that are
combinations of components of x
• Example: an M-th order polynomial function of one dimensional feature x:
• where is the j-th power of x
• We can use gradient descent to solve for each coefficient, or compute w
• How do we do that?
49. Generalization
7
• Generalization = model’s ability to predict the held out data
• What is happening?
• Our model with M = 9 overfits the data (it models also noise)
50. Generalization
7
• Generalization = model’s ability to predict the held out data
• What is happening?
• Our model with M = 9 overfits the data (it models also noise)
• Not a problem if we have lots of training examples
51. Generalization
7
• Generalization = model’s ability to predict the held out data
• What is happening?
• Our model with M = 9 overfits the data (it models also noise)
• Let’s look at the estimated weights for various M in the case of fewer examples
52. Generalization
7
• Generalization = model’s ability to predict the held out data
• What is happening?
• Our model with M = 9 overfits the data (it models also noise)
• Let’s look at the estimated weights for various M in the case of fewer examples
53. Generalization
7
• Generalization = model’s ability to predict the held out data
• What is happening?
• Our model with M = 9 overfits the data (it models also noise)
• Let’s look at the estimated weights for various M in the case of fewer examples
• The weights are becoming huge to compensate for the noise
• One way of dealing with this is to encourage the weights to be small (this way no input
dimension will have too much influence on prediction). This is called regularization
54. Regularized Least Squares
7
• Increasing the input features this way can complicate the model considerably
• Goal: select the appropriate model complexity automatically
• Standard approach: regularization
• Intuition: Since we are minimizing the loss, the second term will encourage smaller values in w
• The penalty on the squared weights is known as ridge regression in statistics
• Leads to a modified update rule for gradient descent:
• Also has an analytical solution: (verify!)
56. 1-D regression illustrates key concepts
7
• Data fits – is linear model best (model selection)?
Simple models may not capture all the important variations (signal) in the data:
underfit I
More complex models may overfit the training data (fit not only the signal but also
the noise in the data), especially if not enough data to constrain model
• One method of assessing fit: test generalization = model’s ability to predict the held
out data
• Optimization is essential: stochastic and batch iterative approaches; analytic when
available
59. Regression
7
What do all these problems have in common?
Categorical outputs, called labels (eg, yes/no, dog/cat/person/other)
Assigning each input vector to one of a finite number of labels is called
classification
Binary classification: two possible labels (eg, yes/no, 0/1, cat/dog)
Multi-class classification: multiple possible labels
We will first look at binary problems, and discuss multi-class problems later
60. Today
7
• We are interested in mapping the input x X to a label t Y
∈ ∈
• In regression typically Y = R
• Now Y is categorical
61. Classification as Regression
7
• Can we do this task using what we have learned in previous lectures?
• Simple hack: Ignore that the output is categorical!
• Suppose we have a binary problem, t {−1, 1}
∈
• Assuming the standard model used for regression
• How can we obtain w?
• Use least squares, . How is X computed? and t?
• Which loss are we minimizing? Does it make sense?
• How do I compute a label for a new example? Let’s see an example
62. Classification as Regression
7
• One dimensional example (input x is 1-dim)
• The colors indicate labels (a blue plus denotes that is from class −1, red circle that is class 1)
63. Decision Rules
7
• Our classifier has the form
• A reasonable decision rule is
• How does this function look like?
• How can I mathematically write this rule?
64. Decision Rules
7
• This specifies a linear classifier: it has a linear boundary (hyperplane)
• How can I mathematically write this rule?
• which separates the space into two ”half-spaces”
103. 7
A K Nearest Neighbor classifier is a machine learning model that
makes predictions based on the majority class of the K nearest
data points in the feature space. The KNN algorithm assumes
that similar things exist in close proximity, making it intuitive and
easy to understand.
Nearest Neighbors
106. Nearest Neighbors
7
Euclidean is probably the most intuitive one and represents the shortest distance
between two points. It’s calculated using the well-known Pythagorean theorem.
Conceptually, it should be
used whenever we are comparing observations with continuous features, like
height, weight, or salaries. This distance measure is often the “default” distance
used in algorithms like KNN.
107. Nearest Neighbors
7
Euclidean is probably the most intuitive one and represents the shortest distance
between two points. It’s calculated using the well-known Pythagorean theorem.
Conceptually, it should be
used whenever we are comparing observations with continuous features, like
height, weight, or salaries. This distance measure is often the “default” distance
used in algorithms like KNN.
108. Nearest Neighbors distance
7
Manhattan is used to estimate the distance to get from one data point to another if a grid-like
path is taken. Unlike Euclidean distance, the Manhattan distance calculates the sum of the
absolute values of the difference of the coordinates of two points. This way, instead of estimating
a straight line between two points, we “walk” through available paths. The Manhattan distance is
useful when our observations are distributed along a grid, like in chess or city blocks (when the
features of our observations are entire integers with no decimal parts).
109. Nearest Neighbors distance
7
Minkowski distance is a versatile metric used in normed vector
spaces, named after the German mathematician Hermann Minkowski.
It's a generalization of several well-known distance measures, making
it a fundamental concept in various fields such as math, computer
science, and data analysis.