SlideShare a Scribd company logo
DATA MINING & PROBABILISTIC REASONING
Dr.-Ing. Gjergji Kasneci
gjergji.kasneci@hpi.uni-potsdam.de
HPI Potsdam, winter term 2013/14
1
Organization
 Timetable
 Lectures
 Tuesdays 13:30-15:00 in Room H-E.51
 Every second Thursday 11:00-12:30 in Room H-2.57
 Exercises
 Every second Thursday 11:00-12:30 in Room H-2.57
 Teaching assistant
 Maximilian Jenders (M.Sc.)
 Expertise: Recommendations, Web Mining, Opinion Mining
 Exam
 Condition for admission: Oral presentation of at least two solutions during the
tutorials
 Form of exam: oral exam at the end of the term
2
What is this lecture about?
 Data Mining
 Analyzing data
 Finding patterns/structure
 Detecting outliers
 Learning predictive models
 Discovering knowledge
 Probabilistic Reasoning
 Representing and quantifying
uncertainty in data
 Predicting likely outcomes of
random variables, i.e.,
occurrence of events
 Choosing the right model
3
Application areas
 Web mining (e.g., find documents for a given query or topic, group users
by interest, recommendations, spam detection, …)
 Medicine/Bioinformatics (e.g., analyze the effect of drugs, derive diagnose
based on symptoms, analyze protein-protein interactions, discover
sequence similarities, detect mutations, …)
 Market analysis (e.g., market baskets, opinion mining, stock value
prediction, influence propagation, … )
 Physics (e.g., multivariate data analysis, modeling motion of particles, i.e.,
Brownian motion, event classification, noise detection, …)
 Video games (e.g., AI game characters, matching players in online gaming,
speech/shape recognition, …)
 …
4
A Big Data perspective
“[…] every two days
we create as much
information as we
did from the dawn
of civilization up
until 2003!”
Eric Schmidt
Sensors
HTML clicks
DBslinks  Distributed databases
 Key-value stores
 Column stores
 Document databases
emailsSocial
Large amounts of structured
and unstructured data (often
incomplete and ambiguous)
 Texts
 Lists, tables, graphs
 Images, audio, videos
 Data analytics , Data Mining,
Machine Learning, and Knowledge Discovery
5
Example: Part-of-speech tagging (1)
 Task: Find the correct grammatical tag for terms in natural language text
 Difficulties arise from ambiguous grammatical meanings
 Examples
word tag
flies  verb / noun
heat  verb / noun
like  verb / prep
water  noun / verb
in  prep / adv
6
Example: Part-of-speech tagging (2)
From: http://smile-pos.appspot.com/
1. This/DT is/VBZ only/RB a/DT simple/JJ example/NN sentence/NN for/IN the/DT
sake/NN of/IN presentation/NN
2. They/PRP are/VBP hunting/VBG dogs/NNS
3. Fruit/NNP flies/VBZ like/IN a/DT banana/NN
7
Other important text analysis tasks
 Role labeling
 Entity recognition
 Entity disambiguation
 Relationship extraction
 Topic assignment (classification)
 Clustering
8
Example: Email classification
 Example classes
 Spam vs. non-spam
 Important vs. less important
 Work-related / social / family / ads /…
 Simple model
Assign email 𝐦 to class
𝐶∗ = argmax 𝐶 𝑃 𝐶|𝐦 = argmax 𝐶 𝑃 𝐶|𝑥1 𝐦 , 𝑥2 𝐦 , … , 𝑥 𝑘 𝐦
e.g., email domain e.g., indicates whether certain
word appears
Features of 𝐦
9
Example: Click prediction
Rank ads by: 𝑃 𝐶 = 1|𝑄 = 𝑞, 𝐴 = 𝑎
10
Example: Image categorization
Source: http://image-net.org/ 11
Example: Object recognition and vision support
From: Tafaj et al.: ICANN’12From: http://www.cognitivesystems.org
12
Example: Shape and speech recognition
Source: http://www.computerweekly.com
13
Example: Clustering astrophysical objects
From: http://ssg.astro.washington.edu/research.shtml?research/galaxies
14
Example: Recommendation
Alice
Bob
Amazon
recommendations
Collaborative
filtering
… see also the
Netflix Challenge
15
Example: Movie recommendation
1 0 1 0
0 2 2 2
0 0 0 1
1 2 3 2
1 0 1 1
0 2 2 3
=
1 0 0
0 1 0
0 0 1
1 1 0
1 0 1
0 1 1
∗
1 0 0
0 2 0
0 0 1
∗
1 0 1 0
0 1 1 1
0 0 0 1
User 1
User 2
User 3
User 4
User 5
User 6
M1 M2 M3 M4
M1: The Shawshank Redemption
M2: The Usual Suspects
M3: The Godfather
M4: The Big Lebowski
Example from: Machine Learning by P. Flach
Matrix factorization
M1 M2 M3 M4
T1 T2 T3
e.g., drama e.g., crime e.g., comedy
16
Example: Learning from crowds
Challenges:
1. As few labels as possible from crowd
2. Identify and give higher weight to experts
3. Derive a (globally) optimal labelling
Has President
Obama won
the Grammy
Award?
true
true
false
false
false
Was President
Obama born
in Chicago?
false
false
false
true
true
Social Web
Classification
system
𝑜1, … , 𝑜 𝑛 𝐶 𝑜1 , … , 𝐶 𝑜 𝑛
𝑜𝑖
𝐶 𝑜𝑖Active learning
scenario
17
𝑜𝑗
𝐶 𝑜𝑗
⋮
⋮
Example: Community detection in social networks
Source: S. Fortunato, Physics Reports 2010 18
Example: Knowledge discovery
Boceprevir
Hepatitis C (HCV)
Drug
Darunavir
Telaprevir
Protease Inhibitor
HIV
Lopinavir
Carbamat
Molecule
Entity
Find common interference
patterns among protease inhibitors
Find interesting interaction subgraphs
between two or more elements
19
Important terms (1)
 Predictive model / hypothesis: Formalization of relationships between
input and output variables with the goal of prediction
Examples
 𝑤𝑖 = 𝑎 + 𝑏 ∗ ℎ𝑖 + 𝜖, e.g., weight is linearly dependent on height
 𝑦 ~ 𝑁(𝑥, 𝜎2
), i.e., 𝑦 is normally distributed with mean 𝑥 and variance 𝜎2
 𝑃 𝑙1, … , 𝑙 𝑛, 𝑥1, … , 𝑥 𝑛 = 𝑃 𝑥1 𝑃 𝑥1|𝑙1 𝑃 𝑥𝑖|𝑥𝑖−1 𝑃 𝑥𝑖|𝑙𝑖𝑖≥2
 Parameterized statistical model: Set of parameters and corresponding
distributions that govern the data of interest
 Learning: Improvement on a task (measured by a target function) with
growing experience
grammatical
labels
𝑛 consecutive
words
20
Important terms (2)
 Training: Sequence of observations from which experience can be gained
 Target function: Formal definition for the goal that has to be achieved
Possible goals
 Identify the “best next” item to label in active learning
 Maximize the joint probability of two or more observations (given some
parameters)
 Predict the “best next” move in a chess game
Often, only an approximation of the “ideal” target function is considered
21
Example of a target function
 Task: Predict the number of retweets 𝑉 𝐭 𝑖 for a tweet 𝐭 𝑖
𝑉 𝐭 𝑖 ≈ 𝑉 𝐭 𝑖 = 𝑡1, 𝑡2, … , 𝑡 𝑘
𝑇
= 𝑤0 + 𝑤1 𝑡𝑖1 + 𝑤2 𝑡𝑖2 + ⋯ + 𝑤 𝑘 𝑡𝑖𝑘 = 𝐰 𝑇
𝐭 𝑖
 Choosing an approximation algorithm
 Learn a function 𝑉 that predicts 𝑅𝑖 based on 𝐭 𝑖 from training examples of the
form (𝐭1 = 37,0, … , 1 𝑇
, 𝑅1 = 0), … , (𝐭 𝑛 = 23879,3, … , 0 𝑇
, 𝑅 𝑛 = 214)
 𝑉 should minimize the training error
1
2
𝑅𝑖 − 𝑉 𝐭 𝑖
2𝑛
𝑖=1
features
Number of possible readers
Number of hashtags
Number of URLs
22
Inductive learning hypothesis and Occam’s razor
 Suppose a learning algorithm performs well on the training examples
 How do we know that it will perform well on other unobserved examples?
 Lacking any further information, we assume the following hypothesis
holds
Any algorithm approximating the target function well over a
sufficiently large set of training examples will also approximate it
well over unseen examples (Inductive Learning Hypothesis).
 But there may be many different algorithms that approximate the target
function similarly well … Which one should be chosen?
Other things being equal, prefer the simplest hypothesis (Occam’s
Razor)
23
Interesting questions related to learning algorithms
 How to (formally) represent training examples?
 How many examples are sufficient?
 What algorithms can be used for a given target function?
 How complex is a given learning algorithm?
 How can a learning algorithm quickly adept to new observations?
24
Learning with labeled data
 Which algorithm works best for Confusion Set Disambiguation (Banko &
Brill ACL’01)?
 Problem: Choose the correct use
of a word, given a set of words
with which it is commonly confused
 Examples: {principle, principal},
{then, than}, {to, two, too}, and
{weather, whether}
 Often, what matters is data!
25
Inductive bias is fine, there’s no free lunch!
 Inductive bias of a learning algorithm: Set of assumptions that allow the
algorithm to predict well on unseen examples
Examples of inductive bias
 (Conditional) independence assumption
 Item belongs to same class as its neighbors
 Select features that are highly correlated with the class (but uncorrelated with
each other)
 Choose the model that worked best on test data according to some measure
 No Free Lunch Theorem (D. H. Wolpert & W. G. Macready 1997)
For any leaning algorithm, any elevated performance over one class
of problems is offset by the performance over another class
26
Areas of learning theory
 Supervised Learning
 Classification problems
 Input: feature vector
 Output: one of a finite number of discrete categories
 Unsupervised Learning
 Clustering, dimensionality reduction, density estimation
 Input: feature vectors
 Output: similar groups of vectors, reduced vectors, or distribution of data from
the input space
 Regression
 Like classification but output is continuous
 Reinforcement Learning
 Find suitable actions to maximize reward
 Trade-off between exploration (trying out new actions) and exploitation
(choose action with maximal reward)
27
Topics of this lecture
 Basics from probability theory, statistics, information theory
 Evaluation measures
 Hierarchical classifiers
 Linear classifiers
 Artificial neural networks
 Regression
 Clustering and topic models
 Graphical models (directed vs. undirected models)
 Factor graphs and inference
 Reinforcement learning
28
Related literature
 Literature
 I. H. Witten, E. Frank, M. A. Hall: Data Mining - Practical Machine Learning
Tools and Techniques (Chapters 1 – 6)
 C. Bishop: Pattern Recognition and Machine Learning (Chapters 1 – 4, 8, 9)
 T. M. Mitchell: Machine Learning (Chapters 3 – 6, 8, 10)
 P. Flach: Machine Learning – The Art and Science of Algorithms that make
Sense of Data (Chapters 1 – 3, 5 – 11)
 D. J. C. MacKay: Information Theory, Inference and Learning
Algorithms (Chapters 1 – 6)
 Important conferences
 KDD, WSDM, ICDM, WWW, CIKM, ICML, ECML, ACL, EMNLP, NIPS, …
 Tools
 The Weka Toolkit (http://www.cs.waikato.ac.nz/ml/weka/)
 The R Project for Statistical Computing (http://www.r-project.org/)
29

More Related Content

PPT
Basics of Machine Learning
PPT
Machine Learning Applications in NLP.ppt
PPT
Machine Learning: Foundations Course Number 0368403401
PDF
Lecture 2 Basic Concepts in Machine Learning for Language Technology
PPTX
Machine learning
PPT
Introduction to Machine Learning.
PPTX
Lecture 01: Machine Learning for Language Technology - Introduction
PDF
NLP_Project_Paper_up276_vec241
Basics of Machine Learning
Machine Learning Applications in NLP.ppt
Machine Learning: Foundations Course Number 0368403401
Lecture 2 Basic Concepts in Machine Learning for Language Technology
Machine learning
Introduction to Machine Learning.
Lecture 01: Machine Learning for Language Technology - Introduction
NLP_Project_Paper_up276_vec241

What's hot (18)

PPTX
Recommenders, Topics, and Text
PDF
On Semi-Supervised Learning and Beyond
PPTX
Machine Learning Unit 1 Semester 3 MSc IT Part 2 Mumbai University
PPTX
Introduction
PDF
Opinion mining on newspaper headlines using SVM and NLP
PPTX
Tweet sentiment analysis (Data mining)
PDF
Nlp presentation
PDF
Supervised Approach to Extract Sentiments from Unstructured Text
PDF
A scalable, lexicon based technique for sentiment analysis
PPTX
sentiment analysis
PPTX
Applied Artificial Intelligence Unit 4 Semester 3 MSc IT Part 2 Mumbai Univer...
PPTX
Unit 1 Introduction to Artificial Intelligence.pptx
PPT
The impact of standardized terminologies and domain-ontologies in multilingua...
PDF
Neural Network Based Context Sensitive Sentiment Analysis
PPTX
Sentiment Analysis
PDF
Human in the loop: Bayesian Rules Enabling Explainable AI
PDF
A Survey on Sentiment Categorization of Movie Reviews
PDF
FEATURE SELECTION AND CLASSIFICATION APPROACH FOR SENTIMENT ANALYSIS
Recommenders, Topics, and Text
On Semi-Supervised Learning and Beyond
Machine Learning Unit 1 Semester 3 MSc IT Part 2 Mumbai University
Introduction
Opinion mining on newspaper headlines using SVM and NLP
Tweet sentiment analysis (Data mining)
Nlp presentation
Supervised Approach to Extract Sentiments from Unstructured Text
A scalable, lexicon based technique for sentiment analysis
sentiment analysis
Applied Artificial Intelligence Unit 4 Semester 3 MSc IT Part 2 Mumbai Univer...
Unit 1 Introduction to Artificial Intelligence.pptx
The impact of standardized terminologies and domain-ontologies in multilingua...
Neural Network Based Context Sensitive Sentiment Analysis
Sentiment Analysis
Human in the loop: Bayesian Rules Enabling Explainable AI
A Survey on Sentiment Categorization of Movie Reviews
FEATURE SELECTION AND CLASSIFICATION APPROACH FOR SENTIMENT ANALYSIS
Ad

Similar to Eric Smidth (20)

PPTX
Machine_Learning.pptx
PPT
Machine learning and deep learning algorithms
PPT
Machine Learning ICS 273A
PPT
Machine Learning and Inductive Inference
PPTX
machine leraning : main principles and techniques
PPT
Lecture: introduction to Machine Learning.ppt
PPT
Machine learning introduction to unit 1.ppt
PDF
Introduction to ML.pdf Supervised Learning, Unsupervised
PPTX
Introduction to Machine Learning
PPTX
Launching into machine learning
PPTX
Week_1 Machine Learning introduction.pptx
PPT
AML_030607.ppt
PPTX
ppt on introduction to Machine learning tools
PPTX
Statistical foundations of ml
PPTX
Chapter 6 - Learning data and analytics course
PPTX
Essential of ML 1st Lecture IIT Kharagpur
PPTX
Introduction to Machine Learning
PPT
ai4.ppt
PPTX
AI -learning and machine learning.pptx
PPTX
Machine Learning
Machine_Learning.pptx
Machine learning and deep learning algorithms
Machine Learning ICS 273A
Machine Learning and Inductive Inference
machine leraning : main principles and techniques
Lecture: introduction to Machine Learning.ppt
Machine learning introduction to unit 1.ppt
Introduction to ML.pdf Supervised Learning, Unsupervised
Introduction to Machine Learning
Launching into machine learning
Week_1 Machine Learning introduction.pptx
AML_030607.ppt
ppt on introduction to Machine learning tools
Statistical foundations of ml
Chapter 6 - Learning data and analytics course
Essential of ML 1st Lecture IIT Kharagpur
Introduction to Machine Learning
ai4.ppt
AI -learning and machine learning.pptx
Machine Learning
Ad

Eric Smidth

  • 1. DATA MINING & PROBABILISTIC REASONING Dr.-Ing. Gjergji Kasneci gjergji.kasneci@hpi.uni-potsdam.de HPI Potsdam, winter term 2013/14 1
  • 2. Organization  Timetable  Lectures  Tuesdays 13:30-15:00 in Room H-E.51  Every second Thursday 11:00-12:30 in Room H-2.57  Exercises  Every second Thursday 11:00-12:30 in Room H-2.57  Teaching assistant  Maximilian Jenders (M.Sc.)  Expertise: Recommendations, Web Mining, Opinion Mining  Exam  Condition for admission: Oral presentation of at least two solutions during the tutorials  Form of exam: oral exam at the end of the term 2
  • 3. What is this lecture about?  Data Mining  Analyzing data  Finding patterns/structure  Detecting outliers  Learning predictive models  Discovering knowledge  Probabilistic Reasoning  Representing and quantifying uncertainty in data  Predicting likely outcomes of random variables, i.e., occurrence of events  Choosing the right model 3
  • 4. Application areas  Web mining (e.g., find documents for a given query or topic, group users by interest, recommendations, spam detection, …)  Medicine/Bioinformatics (e.g., analyze the effect of drugs, derive diagnose based on symptoms, analyze protein-protein interactions, discover sequence similarities, detect mutations, …)  Market analysis (e.g., market baskets, opinion mining, stock value prediction, influence propagation, … )  Physics (e.g., multivariate data analysis, modeling motion of particles, i.e., Brownian motion, event classification, noise detection, …)  Video games (e.g., AI game characters, matching players in online gaming, speech/shape recognition, …)  … 4
  • 5. A Big Data perspective “[…] every two days we create as much information as we did from the dawn of civilization up until 2003!” Eric Schmidt Sensors HTML clicks DBslinks  Distributed databases  Key-value stores  Column stores  Document databases emailsSocial Large amounts of structured and unstructured data (often incomplete and ambiguous)  Texts  Lists, tables, graphs  Images, audio, videos  Data analytics , Data Mining, Machine Learning, and Knowledge Discovery 5
  • 6. Example: Part-of-speech tagging (1)  Task: Find the correct grammatical tag for terms in natural language text  Difficulties arise from ambiguous grammatical meanings  Examples word tag flies  verb / noun heat  verb / noun like  verb / prep water  noun / verb in  prep / adv 6
  • 7. Example: Part-of-speech tagging (2) From: http://smile-pos.appspot.com/ 1. This/DT is/VBZ only/RB a/DT simple/JJ example/NN sentence/NN for/IN the/DT sake/NN of/IN presentation/NN 2. They/PRP are/VBP hunting/VBG dogs/NNS 3. Fruit/NNP flies/VBZ like/IN a/DT banana/NN 7
  • 8. Other important text analysis tasks  Role labeling  Entity recognition  Entity disambiguation  Relationship extraction  Topic assignment (classification)  Clustering 8
  • 9. Example: Email classification  Example classes  Spam vs. non-spam  Important vs. less important  Work-related / social / family / ads /…  Simple model Assign email 𝐦 to class 𝐶∗ = argmax 𝐶 𝑃 𝐶|𝐦 = argmax 𝐶 𝑃 𝐶|𝑥1 𝐦 , 𝑥2 𝐦 , … , 𝑥 𝑘 𝐦 e.g., email domain e.g., indicates whether certain word appears Features of 𝐦 9
  • 10. Example: Click prediction Rank ads by: 𝑃 𝐶 = 1|𝑄 = 𝑞, 𝐴 = 𝑎 10
  • 11. Example: Image categorization Source: http://image-net.org/ 11
  • 12. Example: Object recognition and vision support From: Tafaj et al.: ICANN’12From: http://www.cognitivesystems.org 12
  • 13. Example: Shape and speech recognition Source: http://www.computerweekly.com 13
  • 14. Example: Clustering astrophysical objects From: http://ssg.astro.washington.edu/research.shtml?research/galaxies 14
  • 16. Example: Movie recommendation 1 0 1 0 0 2 2 2 0 0 0 1 1 2 3 2 1 0 1 1 0 2 2 3 = 1 0 0 0 1 0 0 0 1 1 1 0 1 0 1 0 1 1 ∗ 1 0 0 0 2 0 0 0 1 ∗ 1 0 1 0 0 1 1 1 0 0 0 1 User 1 User 2 User 3 User 4 User 5 User 6 M1 M2 M3 M4 M1: The Shawshank Redemption M2: The Usual Suspects M3: The Godfather M4: The Big Lebowski Example from: Machine Learning by P. Flach Matrix factorization M1 M2 M3 M4 T1 T2 T3 e.g., drama e.g., crime e.g., comedy 16
  • 17. Example: Learning from crowds Challenges: 1. As few labels as possible from crowd 2. Identify and give higher weight to experts 3. Derive a (globally) optimal labelling Has President Obama won the Grammy Award? true true false false false Was President Obama born in Chicago? false false false true true Social Web Classification system 𝑜1, … , 𝑜 𝑛 𝐶 𝑜1 , … , 𝐶 𝑜 𝑛 𝑜𝑖 𝐶 𝑜𝑖Active learning scenario 17 𝑜𝑗 𝐶 𝑜𝑗 ⋮ ⋮
  • 18. Example: Community detection in social networks Source: S. Fortunato, Physics Reports 2010 18
  • 19. Example: Knowledge discovery Boceprevir Hepatitis C (HCV) Drug Darunavir Telaprevir Protease Inhibitor HIV Lopinavir Carbamat Molecule Entity Find common interference patterns among protease inhibitors Find interesting interaction subgraphs between two or more elements 19
  • 20. Important terms (1)  Predictive model / hypothesis: Formalization of relationships between input and output variables with the goal of prediction Examples  𝑤𝑖 = 𝑎 + 𝑏 ∗ ℎ𝑖 + 𝜖, e.g., weight is linearly dependent on height  𝑦 ~ 𝑁(𝑥, 𝜎2 ), i.e., 𝑦 is normally distributed with mean 𝑥 and variance 𝜎2  𝑃 𝑙1, … , 𝑙 𝑛, 𝑥1, … , 𝑥 𝑛 = 𝑃 𝑥1 𝑃 𝑥1|𝑙1 𝑃 𝑥𝑖|𝑥𝑖−1 𝑃 𝑥𝑖|𝑙𝑖𝑖≥2  Parameterized statistical model: Set of parameters and corresponding distributions that govern the data of interest  Learning: Improvement on a task (measured by a target function) with growing experience grammatical labels 𝑛 consecutive words 20
  • 21. Important terms (2)  Training: Sequence of observations from which experience can be gained  Target function: Formal definition for the goal that has to be achieved Possible goals  Identify the “best next” item to label in active learning  Maximize the joint probability of two or more observations (given some parameters)  Predict the “best next” move in a chess game Often, only an approximation of the “ideal” target function is considered 21
  • 22. Example of a target function  Task: Predict the number of retweets 𝑉 𝐭 𝑖 for a tweet 𝐭 𝑖 𝑉 𝐭 𝑖 ≈ 𝑉 𝐭 𝑖 = 𝑡1, 𝑡2, … , 𝑡 𝑘 𝑇 = 𝑤0 + 𝑤1 𝑡𝑖1 + 𝑤2 𝑡𝑖2 + ⋯ + 𝑤 𝑘 𝑡𝑖𝑘 = 𝐰 𝑇 𝐭 𝑖  Choosing an approximation algorithm  Learn a function 𝑉 that predicts 𝑅𝑖 based on 𝐭 𝑖 from training examples of the form (𝐭1 = 37,0, … , 1 𝑇 , 𝑅1 = 0), … , (𝐭 𝑛 = 23879,3, … , 0 𝑇 , 𝑅 𝑛 = 214)  𝑉 should minimize the training error 1 2 𝑅𝑖 − 𝑉 𝐭 𝑖 2𝑛 𝑖=1 features Number of possible readers Number of hashtags Number of URLs 22
  • 23. Inductive learning hypothesis and Occam’s razor  Suppose a learning algorithm performs well on the training examples  How do we know that it will perform well on other unobserved examples?  Lacking any further information, we assume the following hypothesis holds Any algorithm approximating the target function well over a sufficiently large set of training examples will also approximate it well over unseen examples (Inductive Learning Hypothesis).  But there may be many different algorithms that approximate the target function similarly well … Which one should be chosen? Other things being equal, prefer the simplest hypothesis (Occam’s Razor) 23
  • 24. Interesting questions related to learning algorithms  How to (formally) represent training examples?  How many examples are sufficient?  What algorithms can be used for a given target function?  How complex is a given learning algorithm?  How can a learning algorithm quickly adept to new observations? 24
  • 25. Learning with labeled data  Which algorithm works best for Confusion Set Disambiguation (Banko & Brill ACL’01)?  Problem: Choose the correct use of a word, given a set of words with which it is commonly confused  Examples: {principle, principal}, {then, than}, {to, two, too}, and {weather, whether}  Often, what matters is data! 25
  • 26. Inductive bias is fine, there’s no free lunch!  Inductive bias of a learning algorithm: Set of assumptions that allow the algorithm to predict well on unseen examples Examples of inductive bias  (Conditional) independence assumption  Item belongs to same class as its neighbors  Select features that are highly correlated with the class (but uncorrelated with each other)  Choose the model that worked best on test data according to some measure  No Free Lunch Theorem (D. H. Wolpert & W. G. Macready 1997) For any leaning algorithm, any elevated performance over one class of problems is offset by the performance over another class 26
  • 27. Areas of learning theory  Supervised Learning  Classification problems  Input: feature vector  Output: one of a finite number of discrete categories  Unsupervised Learning  Clustering, dimensionality reduction, density estimation  Input: feature vectors  Output: similar groups of vectors, reduced vectors, or distribution of data from the input space  Regression  Like classification but output is continuous  Reinforcement Learning  Find suitable actions to maximize reward  Trade-off between exploration (trying out new actions) and exploitation (choose action with maximal reward) 27
  • 28. Topics of this lecture  Basics from probability theory, statistics, information theory  Evaluation measures  Hierarchical classifiers  Linear classifiers  Artificial neural networks  Regression  Clustering and topic models  Graphical models (directed vs. undirected models)  Factor graphs and inference  Reinforcement learning 28
  • 29. Related literature  Literature  I. H. Witten, E. Frank, M. A. Hall: Data Mining - Practical Machine Learning Tools and Techniques (Chapters 1 – 6)  C. Bishop: Pattern Recognition and Machine Learning (Chapters 1 – 4, 8, 9)  T. M. Mitchell: Machine Learning (Chapters 3 – 6, 8, 10)  P. Flach: Machine Learning – The Art and Science of Algorithms that make Sense of Data (Chapters 1 – 3, 5 – 11)  D. J. C. MacKay: Information Theory, Inference and Learning Algorithms (Chapters 1 – 6)  Important conferences  KDD, WSDM, ICDM, WWW, CIKM, ICML, ECML, ACL, EMNLP, NIPS, …  Tools  The Weka Toolkit (http://www.cs.waikato.ac.nz/ml/weka/)  The R Project for Statistical Computing (http://www.r-project.org/) 29