Machine Learning Presentation - Vilnius Tech

Mašininis mokymas
(Machine learning)
Mechatronikos, robotikos ir
skaitmeninės gamybos katedra
Vygantas Ušinskis
vygantas.usinskis@vgtu.lt

Maschine learning introduction
7

What is machine learning?
7
• For many problems, it’s difficult to program the correct
behavior by hand:
 I recognizing people and objects
 understanding human speech
• Machine learning approach: program an algorithm to
automatically learn from data, or from experience.
• Why might you want to use a learning algorithm?

7
• For many problems, it’s difficult to program the correct
behavior by hand:
 I recognizing people and objects
 understanding human speech
• Machine learning approach: program an algorithm to
automatically learn from data, or from experience.
• Why might you want to use a learning algorithm?
 Hard to code up a solution by hand (e.g. vision, speech)
 System needs to adapt to a changing environment (e.g.
spam detection)
 Want the system to perform better than the human
programmers
 Privacy/fairness (e.g. ranking search results)

7
• It’s similar to statistics...
 Both fields try to uncover patterns in data
 Both fields draw heavily on calculus, probability, and
linear algebra, and share many of the same core
algorithms
• But it’s not statistics!
 Stats is more concerned with helping scientists and
policymakers draw good conclusions; ML is more
concerned with building autonomous agents
 Stats puts more emphasis on interpretability and
mathematical rigor; ML puts more emphasis on
predictive performance, scalability, and autonomy

Relation to AI
7
• Nowadays, “machine learning” is often brought up with
“artificial intelligence” (AI)
• AI does not always imply a learning based system
 Symbolic reasoning
 Rule based system
 Tree search
 etc.
• Learning based system → learned based on the data →
more flexibility, good at solving pattern recognition problems.

Relations to guman learning
7
• Human learning is:
 Very data efficient
 An entire multitasking system (vision, language, motor
control, etc.)
 Takes at least a few years :)
• For serving specific purposes, machine learning doesn’t
have to look like human learning in the end.
• It may borrow ideas from biological systems, e.g., neural
networks.
• It may perform better or worse than humans.

7
• Types of machine learning
• Supervised learning: have labeled examples of the correct
behavior
• Reinforcement learning: learning system (agent) interacts
with the world and learns to maximize a scalar reward signal
• Unsupervised learning: no labeled examples – instead,
looking for “interesting” patterns in the data

7
Criteria Supervised Learning Unsupervised Learning Reinforcement Learning
Definition
Learns from labeled data to map inputs to
known outputs
Explores patterns and associations in
unlabeled data
Learns through interactions with an
environment to maximize rewards
Type of Data Labeled data Unlabeled data
No predefined data; interacts with
environment
Type of Problems Regression and classification Clustering and association Exploitation or exploration
Supervision Requires external supervision No supervision No supervision
Algorithms
Linear Regression, Logistic Regression,
SVM, KNN
K-means clustering, Hierarchical clustering,
DBSCAN, Principal Component Analysis Q-learning, SARSA, Deep Q-Network
Aim Calculate outcomes based on labeled data
Discover underlying patterns and group
data Learn a series of actions to achieve a goal
Applications Risk evaluation, forecasting sales
Recommendation systems, anomaly
detection Self-driving cars, gaming, healthcare
Learning Process Maps labeled inputs to known outputs Finds patterns and trends in data Trial and error method with rewards and
penalties
https://www.youtube.com/watch?v=x3KOCphRltk

History of machine learning
7
• 1957 — Perceptron algorithm (implemented as a circuit!)
• 1959 — Arthur Samuel wrote a learning-based checkers program
that could defeat him
• 1969 — Minsky and Papert’s book Perceptrons (limitations of
linear models)
• 1980s — Some foundational ideas
 Connectionist psychologists explored neural models of
cognition
 1984 — Leslie Valiant formalized the problem of learning as
PAC learning
 1988 — Backpropagation (re-)discovered by Geoffrey Hinton
and colleagues
 1988 — Judea Pearl’s book Probabilistic Reasoning in
Intelligent Systems introduced Bayesian networks

History of machine learning
7
• 1990s — the “AI Winter”, a time of pessimism and low funding But looking
back, the ’90s were also sort of a golden age for ML research
 Markov chain Monte Carlo
 variational inference
 kernels and support vector machines
 Boosting
 convolutional networks
 reinforcement learning
• 2000s — applied AI fields (vision, NLP, etc.) adopted ML
• 2010s — deep learning
 2010–2012 — neural nets smashed previous records in speech-to-
text and object recognition
 increasing adoption by the tech industry
 2016 — AlphaGo defeated the human Go champion
 2018-now — generating photorealistic images and videos
 2020 — GPT3 language model
• now — increasing attention to ethical and societal implications

Examples
7
• Computer vision: Object detection, semantic segmentation, pose
estimation, and almost every other task is done with ML.

Examples
7
• Speech: Speech to text, personal assistants, speaker identification...

Examples
7
• NLP: Machine translation, sentiment analysis, topic modeling, spam filtering.

Examples
7
• E-commerce & Recommender Systems : Amazon, netflix, ...

ML Workflow
7
• ML workflow sketch:
• 1. Should I use ML on this problem? I Is there a pattern to detect? I
Can I solve it analytically? I Do I have data?
• 2. Gather and organize data. I Preprocessing, cleaning, visualizing.
• 3. Establishing a baseline.
• 4. Choosing a model, loss, regularization, ...
• 5. Optimization (could be simple, could be a Phd...).
• 6. Hyperparameter search.
• 7. Analyze performance & mistakes, and iterate back to step 4 (or 2).

Implementing machine learning systems
7
• You will often need to derive an algorithm (with pencil and paper), and then
translate the math into code.
• Array processing (NumPy)
 vectorize computations (express them in terms of matrix/vector
operations) to exploit hardware efficiency
 This also makes your code cleaner and more readable!

Implementing machine learning systems
7
• In supervised learning we are given a training set consisting of inputs
and corresponding labels, e.g.

Input vectors
7
• What an image looks like to the computer:

Input vectors
7
• Machine learning algorithms need to handle lots of types of data:
images, text, audio waveforms, credit card transactions, etc.
• Common strategy: represent the input as an input vector in
 Representation = mapping to another space that’s easy to
manipulate
 Vectors are a great representation since we can do linear algebra!

Input vectors
7
• Can use raw pixels:
• Can do much better if you compute a vector of meaningful features.

Examples
7
• Mathematically, our training set consists of a collection of pairs of an input
vector and its corresponding target, or label, t
 Regression: t is a real number (e.g. stock price)
 Classification: t is an element of a discrete set {1, . . . , C}
 These days, t is often a highly structured object (e.g. image)
• Denote the training set {(), . . . ,()} I Note: these superscripts have nothing
to do with exponentiation!

Examples
7
• What should I watch this Friday?

Examples
7
• Goal: Predict movie rating automatically!

Examples
7
• Goal: How many followers will I get?

Regression
7
• What do all these problems have in common?

Regression
7
• What do all these problems have in common?
 Continuous outputs, we’ll call these t (eg, a rating: a real number between 0-
10, # of followers, house price)
• What do I need in order to predict these outputs?
Predicting continuous outputs is called regression
 Features (inputs), we’ll call these x (or X if vectors)
 Training examples, many x (i) for which t (i) is known (eg, many movies for
which we know the rating)
 A model, a function that represents the relationship between x and t
 A loss or a cost or an objective function, which tells us how well our model
approximates the training examples
 Optimization, a way of finding the parameters of our model that minimizes
the loss function

Linear regression
7
• Linear regression
 continuous outputs
 simple model (linear)
• Introduce key concepts:
 loss functions
 Generalization
 Optimization
 model complexity
 regularization

Simple 1-D regression
7
• Circles are data points (i.e., training examples) that are given to us
• The data points are uniform in x, but may be displaced in y
with some noise
• In green is the ”true” curve that we don’t know
• Goal: We want to fit a curve to these points

Simple 1-D regression
7
• Key Questions:
 How do we parametrize the model?
 What loss (objective) function should we use to judge the fit?
 How do we optimize fit to unseen test data (generalization)?

Noise
7
• Data is described as pairs D = {(), . . . ,()}
 x R is the
∈ input feature (per capita crime rate)
 t R is the
∈ target output (median house price)
 (i) simply indicates the training examples (we have N in this case)
• Here t is continuous, so this is a regression problem
• Model outputs y, an estimate of t
y
• What type of model did we choose?
• Divide the dataset into training and testing examples
 Use the training examples to construct hypothesis, or function approximator, that
maps x to predicted y
 Evaluate hypothesis on test set

Least-Squares Regression
7
• Define a model
• Standard loss/cost/objective function measures the squared error between y and the true value t

7
• Define a model
Linear:

7
• Define a model
Linear:
Linear model:

7
• Define a model
Linear:
Linear model:
• For a particular hypothesis (y(x) defined by a choice of w, drawn in red), what does the loss
represent geometrically?

7
• Define a model
Linear:
Linear model:
• The loss for the red hypothesis is the sum of the squared vertical errors (squared lengths of
green vertical lines)

7
• Define a model
Linear:
Linear model:
• How do we obtain weights w = (w0,w1)? Find w that minimizes loss `(w)

7
• Define a model
Linear:
Linear model:
• How do we obtain weights w = (w0,w1)?
• For the linear model, what kind of a function is `(w)?

Optimizing the Objective
7
• One straightforward method: gradient descent
 initialize w (e.g., randomly)
 repeatedly update w based on the gradient
• λ is the learning rate
• For a single training case, this gives the LMS update rule:
• Note: As error approaches zero, so does the update (w stops changing)

Optimizing Across Training Set
7
• Two ways to generalize this for all examples in training set:
1. Batch updates: sum or average updates across every example n, then change
the parameter values
2. Stochastic/online updates: update the parameters for each
training case in turn, according to its own gradients

Linear Regression with Multi-dimensional Inputs
7
• One method of extending the model is to consider other input dimensions
• In the Boston housing example, we can look at the number of rooms

Multi-dimensional Inputs
7
• Imagine now we want to predict the median house price from these multi-dimensional
observations
• Each house is a data point n, with observations indexed by j:
• We can incorporate the bias w0 into w, by using x0 = 1, then
• We can incorporate the bias w0 into w, by using x0 = 1, then
• We can use gradient descent to solve for each coefficient, or compute w
analytically (how does the solution change?)

Fitting a Polynomial
7
• What if our linear model is not good? How can we create a more complicated model?
• We can create a more complicated model by defining input variables that are
combinations of components of x
• Example: an M-th order polynomial function of one dimensional feature x:
• where is the j-th power of x
• We can use gradient descent to solve for each coefficient, or compute w
• How do we do that?

Generalization
7
• Generalization = model’s ability to predict the held out data
• What is happening?
• Our model with M = 9 overfits the data (it models also noise)

Generalization
7
• Not a problem if we have lots of training examples

Generalization
7
• Let’s look at the estimated weights for various M in the case of fewer examples

Generalization
7
• Let’s look at the estimated weights for various M in the case of fewer examples
• The weights are becoming huge to compensate for the noise
• One way of dealing with this is to encourage the weights to be small (this way no input
dimension will have too much influence on prediction). This is called regularization

Regularized Least Squares
7
• Increasing the input features this way can complicate the model considerably
• Goal: select the appropriate model complexity automatically
• Standard approach: regularization
• Intuition: Since we are minimizing the loss, the second term will encourage smaller values in w
• The penalty on the squared weights is known as ridge regression in statistics
• Leads to a modified update rule for gradient descent:
• Also has an analytical solution: (verify!)

Regularized Least Squares
7
• Better generalization
• Choose α carefully

1-D regression illustrates key concepts
7
• Data fits – is linear model best (model selection)?
 Simple models may not capture all the important variations (signal) in the data:
underfit I
 More complex models may overfit the training data (fit not only the signal but also
the noise in the data), especially if not enough data to constrain model
• One method of assessing fit: test generalization = model’s ability to predict the held
out data
• Optimization is essential: stochastic and batch iterative approaches; analytic when
available

Example of problems
7
What digit is this?
How can I predict this? What are my input features?

Regression
7
What do all these problems have in common?
Categorical outputs, called labels (eg, yes/no, dog/cat/person/other)
Assigning each input vector to one of a finite number of labels is called
classification
Binary classification: two possible labels (eg, yes/no, 0/1, cat/dog)
Multi-class classification: multiple possible labels
We will first look at binary problems, and discuss multi-class problems later

Today
7
• We are interested in mapping the input x X to a label t Y
∈ ∈
• In regression typically Y = R
• Now Y is categorical

Classification as Regression
7
• Can we do this task using what we have learned in previous lectures?
• Simple hack: Ignore that the output is categorical!
• Suppose we have a binary problem, t {−1, 1}
∈
• Assuming the standard model used for regression
• How can we obtain w?
• Use least squares, . How is X computed? and t?
• Which loss are we minimizing? Does it make sense?
• How do I compute a label for a new example? Let’s see an example

Classification as Regression
7
• One dimensional example (input x is 1-dim)
• The colors indicate labels (a blue plus denotes that is from class −1, red circle that is class 1)

Decision Rules
7
• Our classifier has the form
• A reasonable decision rule is
• How does this function look like?
• How can I mathematically write this rule?

Decision Rules
7
• This specifies a linear classifier: it has a linear boundary (hyperplane)
• How can I mathematically write this rule?
• which separates the space into two ”half-spaces”

Can we always seperate the classes?
7

Shape of the Logistic Function
7

Decision Boundary for Logistic Regresion
7

Logistic Regression vs Least Squares Regression
7

7
A K Nearest Neighbor classifier is a machine learning model that
makes predictions based on the majority class of the K nearest
data points in the feature space. The KNN algorithm assumes
that similar things exist in close proximity, making it intuitive and
easy to understand.
Nearest Neighbors

Nearest Neighbors
7
Euclidean is probably the most intuitive one and represents the shortest distance
between two points. It’s calculated using the well-known Pythagorean theorem.
Conceptually, it should be
used whenever we are comparing observations with continuous features, like
height, weight, or salaries. This distance measure is often the “default” distance
used in algorithms like KNN.

Nearest Neighbors distance
7
Manhattan is used to estimate the distance to get from one data point to another if a grid-like
path is taken. Unlike Euclidean distance, the Manhattan distance calculates the sum of the
absolute values of the difference of the coordinates of two points. This way, instead of estimating
a straight line between two points, we “walk” through available paths. The Manhattan distance is
useful when our observations are distributed along a grid, like in chess or city blocks (when the
features of our observations are entire integers with no decimal parts).

Nearest Neighbors distance
7
Minkowski distance is a versatile metric used in normed vector
spaces, named after the German mathematician Hermann Minkowski.
It's a generalization of several well-known distance measures, making
it a fundamental concept in various fields such as math, computer
science, and data analysis.

Classification: Oranges and Lemon
7

Nearest Neighbors: Decision Boundaries
7

k-Nearest Neighbors Remedies: Remove Redundancy
7

Example: Digit Classification
7

Fun Example: Where on Earth is this Photo From?
7

Machine Learning Presentation - Vilnius Tech

Decision Tree: Example with Discrete Inputs

Decision Tree: Classification and Regression

How do we Learn a DecisionTree?

Entropy of a Joint Distribution

Machine Learning Presentation - Vilnius Tech

More Related Content

Similar to Machine Learning Presentation - Vilnius Tech (20)

Recently uploaded (20)

Machine Learning Presentation - Vilnius Tech

Editor's Notes