Machine Learning, LIX004M5

General Info General Info (cont’d)
instructor: Jörg Tiedemann (j.tiedemann@rug.nl)
Machine Learning, LIX004M5 Harmoniegebouw, room 1311-429 You need an account on ’hagen’ for the labs!
(But you may also work from home or somewhere
Overview and Introduction prerequisites: open to students in Computer else)
Science, Artiﬁcial Intelligence and Information
J¨ rg Tiedemann
o Science
tiedeman@let.rug.nl
2nd year student or higher
background: programming ability, elementary Go to A. Da Costa. (Harmoniegebouw, k. 336,
Informatiekunde
Rijksuniversiteit Groningen statistics bouwdeel 13.13, telefoon 363 5801) (dagelijks 10.30-
schedule: September 5 - October 21 12.00 en 14.00-15.30 uur, vrijdagmiddag gesloten)!
• lectures: mondays 13-15
• labs: fridays 9-11 (4 times only!)

Machine Learning, LIX004M5 – p.1/50 Machine Learning, LIX004M5 – p.2/50 Machine Learning, LIX004M5 – p.3/50

General Info (cont’d) General Info (cont’d) General Info (cont’d)
Website: http://www.let.rug.nl/˜tiedeman/ml06 Purpose of this course • Examination:
Examination: lab assignments and exercises (50%), • Introduction (!) to machine learning techniques • obligatory lab assignments (50%)

written exam (50%) (How much do you know already?) • written exam (50%)

Exam: Friday, October 27, 9-12 • Discussion of several machine learning • A minimum of 6 for both parts is required!
Literature: Tom Mitchell Machine Learning, New approaches • Exam is open book
York: McGraw-Hill, 1997 • Examples and applications in various ﬁelds
additional on-line literature (links available from • Practical assignments
the course website) • using Weka - a machine learning package
implemented in Java
• a little bit of programming/scripting
• some theoretical questions


Preliminary Program General comments What is Machine learning?
1. Organization, Introduction, (Ch.1, Ch.5) • Read the book! (and other literature if necessary) Machine Learning is
2. Inductive learning (Ch.2), Decision Trees (Ch.3) • Ask questions! (and I’ll try to answer) • the study of algorithms that
• Lab 1 - Decision Trees
• Tell me if you think that something’s wrong • improve their performance
3. Instance-Based Learning (Ch.8) • at some task
• Keep the deadlines!
• Lab 2 - Instance-based learning
(1 week late → half the points, later → no points) • with experience

4. Bayesian Learning (Ch.6)
• Lab 3 - Learner comparison/combination
... just like a human being ... (?)
5. Sequential data & Markov Models, M&S Ch.9, Bilmex;
• Lab 4 - Markov models

6. Maximum Entropy models, Combining Learners
7. Genetic Algorithms (Ch.9), Reinforcement Learning (Ch.13)


What is all the hype about ML? Why machine learning? Typical Data Mining Task
data mining: pattern recognition, knowledge
discovery, use historical data to improve future
"Every time I fire a linguist the performance of the decisions, prediction (classification, regression),
recognizer goes up" data discripton (clustering, summarization,
visualization)
(probably) said by Fred Jelinek (IBM speech group) in the 80s, quoted by, e.g., complex applications: we cannot program by hand,
Jurafsky and Martin, Speech and Language Processing. (efficient) processing of complex signals Given:
self-customizing programs: automatic adjustments • 9714 patient records, each describing a pregnancy and birth
according to usage, dynamic systems • Each patient record contains 215 features
Learn to predict:
• Classes of future patients at high risk for Emergency Cesarean Section


Pattern Recognition Complex applications Classification
Object detection Operating robots: Personal home page? Company website? Educational site?
ALVINN [Pomerleau] drives 70 mph on highways


Automatic customization Machine learning is growing Questions to ask
many more applications: Learning = improve with experience at some task
• speech recognition • What experience?
• robot control • What exactly should be learned?
• spam filtering, data sorting • How shall it be represented?
• machine translation • What specific algorithm to learn it?
• financial data analysis and market predictions Goal: handle unseen data correctly according to the
• hand writing recognition task (use your knowledge inferred from experience!)
• data clustering and visualization
• pattern recognition in genetics (e.g. DNA
sequences)


What experience? What exactly should be learned? How shall it be represented?
• What do we know already about the task and Outcome of the target function Model selection
possible solutions? (prior knowledge) • boolean (→ concept learning)
• symbolic representation (e.g. rules)
• What kind of data do we have available? • discrete values (→ classification)
(training examples) • subsymbolic representation (neural networks,
• real values (→ regression) SVMs)
What are the discriminative features? How are
they connected with each other (dependencies)? many machine learning tasks are classification tasks ... Do we want to restrict the space of possible solutions?
• Is a “teacher” available (→ supervised learning) (→ restriction bias ... we come back to this)
or not (→ unsupervised learning)?
How expensive is labeling?
• How much data do we need and how clean does it
have to be?


What algorithm to learn it? What algorithm to learn it? Learning Models
Learning means approximating the real (unknown) Learning means approximating the real (unknown) Learning means approximating the real (unknown)
target function according to our experience (e.g. target function according to our experience (e.g. target function according to our experience (e.g.
observed training examples) observed training examples) observed training examples)
→ Learning = Search for a “good” hypothesis/model → Learning = search for a “good” hypothesis/model
Do we want to prefer certain models? Which one is better?
(→ preference bias ... later more)


Learning Models What algorithm to learn it? The roots of ML
Learning means approximating the real (unknown) Learning means approximating the real (unknown) Artificial intelligence: use prior knowledge and training data to guide
target function according to our experience (e.g. target function according to our experience (e.g. learning as a search problem
observed training examples) observed training examples) Baysian methods: probabilistic classifiers, probabilistic reasoning
Computational complexity theory: trade-off between model (learning)
→ Learning = search for a “good” hypothesis/model → Learning = search for a “good” hypothesis/model complexity and performance
Control theory: control optimisation processes
Which one is better? • supervised learning (classified data available)
• unsupervised learning (e.g. clustering) Information theory: entropy, information content, code optimsation and the
minimum description length principle
• inductive learning (from training data)
Philosophy: Occam’s razor (simple is best)
• deductive learning (data + domain theory)
Psychology and neurobiology: response improvement with practice, ideas
• gradient descent, bayesian learning, reinforcement learning ... that lead to artificical neural networks
Statistics: data description, estimation of probability distributions,
evaluation, confidence


A walk-through example A walk-through example A walk-through example
from Duda et al: Pattern Classification from Duda et al: Pattern Classification Procedure:
• Task: automatically sort incoming fish on a • Task: automatically sort incoming fish on a preprocessing: isolate fishes from one another and
conveyor belt into “sea bass” or “salmon” conveyor belt into “sea bass” or “salmon” from the background of the images
• Experience: sample images feature selection: determine discriminative features
We want a machine to learn this task.
The machine needs some “experience”. to be extracted from the images (e.g. length,
lightness, width, position of mouth, etc),
feature selection = kind of data reduction (focus
on relevant information)
feature extraction: extract selected features from
images and pass them to a classifier


Select length for discrimiation: Lightness is a better feature:


• devise decision rule or move decision boundary Feature selection: We still need to:
to minimize some classification cost (→ decision • distinguishing (similar for objects in same
theory) • select an appropriate type of model for
category and very different for different classification (e.g. function class to define
• a single feature might not be enough minimize categories) separation boundaries)
costs • invariant (feature value doesn’t change when
• select the model that generalizes the best (to be
→ feature vector, e.g.: changing the context) able to classify even unseen objects correctly)
• insensitive to noise
• consider computational complexity (trade-off
X1 width • simple to extract between complexity and performance; scalibility)
X=
X2 lightness



Linear decision boundary Overly complex decision boundary a good trade-off between performance on the training
(What is the problem?) set and model simplicity

The Design Cycle Evaluation Evaluation
We have ... Evaluation of classifiers based on
• different feature sets • accuracy or error rate
• different models (percentage of classification errors)
• different learning strategies
• risk (cost estimation for classification decisions)

→ We need to evaluate! Never ever evaluate on training data!
... Why not?


Evaluation Evaluation Evaluation
Typical strategy in supervised learning:
Distinguish: How good is an estimate of the true error by means of Split data into disjoint training data and test data
sample errors?
sample error: error rate observed when classifying Problems:
sample data (test data) • confidence intervals
true error: probability of misclassifying a randomly • larger sample → greater confidence • we could be (too) lucky (sample error on test data
selected object is better than with other data splits)
• test data set is too small to be confident
How good is one model compared to another?
• training data is rare and expensive (we don’t want
• calculate sample errors to waste too much when seperating test data)
• compute statistical significance (e.g. paired t-test)


Cross validation Cross validation Cross validation
• split D into k similar sized sets (e.g. k=10) • split D into k similar sized sets (e.g. k=10) • split D into k similar sized sets (e.g. k=10)
• use k − 1 sets for training and 1 for evaluation • use k − 1 sets for training and 1 for evaluation • use k − 1 sets for training and 1 for evaluation
• use each set once for evaluation and calculate the • use each set once for evaluation and calculate the • use each set once for evaluation and calculate the
average of the errors average of the errors average of the errors
→ improve error estimates (higher conﬁdence) → improve error estimates (higher conﬁdence)
→ all data is tested → all data is tested
→ better use of (limited) training data → better use of (limited) training data
Note: we still don’t evaluate on training data! Note: we still don’t evaluate on training data!
special case: leave one out cross validation - use each
training example once for testing and the others for
training

Conclusion What’s next?
• we seem to be overwhelmed by number,
complexity and magnitude of sub-problems
This week: Read ch. 1 & ch. 5 of Mitchell and look
• many of them can be solved (to a certain degree
at the exercises
at least)
No lab on Friday!
• many fascinating problems still remain
Next week: Inductive learning, Mitchell ch. 2 &
Enjoy working with learning systems! Decision trees, ch. 3
First lab about Decision Trees

Machine Learning, LIX004M5 – p.49/50 Machine Learning, LIX004M5 – p.50/50

Machine Learning, LIX004M5

More Related Content

Viewers also liked (11)

Similar to Machine Learning, LIX004M5 (20)

More from butest (20)

Machine Learning, LIX004M5