SlideShare a Scribd company logo
Supervised and semi-supervised learning for NLPJohn Blitzer自然语言计算组http://research.microsoft.com/asia/group/nlc/
Why should I know about machine learning? This is an NLP summer school.  Why should I care about machine learning?ACL 2008:  50 of 96 full papers mention learning, or statistics in their titles4 of 4 outstanding papers propose new learning or statistical inference methods
Example 1:   Review classificationOutput: LabelsInput: Product Review	   Running with Scissors: A Memoir 	   Title: Horrible book, horrible.This book was horrible.  I read half of it, suffering from a headache the entire time, and eventually i lit it on fire.  One less copy in the world...don't waste your money.  I wish i had the time spent reading this book back so i could use it for better purposes.  This book wasted my lifePositiveNegative
From the MSRA 机器学习组 http://research.microsoft.com/research/china/DCCUE/ml.aspx
Example 2:   Relevance RankingRanked ListUn-ranked List. . . . . .
Example 3: Machine TranslationInput: English sentenceThe national track & field championships concludedOutput: Chinese sentence全国田径冠军赛结束
Course Outline1) Supervised Learning [2.5 hrs]2) Semi-supervised learning [3 hrs]3) Learning bounds for domain adaptation [30 mins]
Supervised Learning Outline1) Notation and Definitions [5 mins]2) Generative Models [25 mins]3) Discriminative Models [55 mins]4) Machine Learning Examples [15 mins]
Training and testing dataTraining data: labeled pairs.  .  .Use training data to learn a functionUse this function to label unlabeled testing data????.  .  .??
Feature representations of......2100300wastehorribleread_half......0021000loved_itexcellenthorrible
Generative model
Graphical Model RepresentationEncode a multivariate probability distributionNodes indicate random variablesEdges indicate conditional dependency
Graphical Model Inferencep(y = -1)p(horrible | -1)p(read_half | -1)wasteread_halfhorrible
Inference at test timeGiven an unlabeled instance, how can we find its label???Just choose the most probable label y
Estimating parameters from training dataBack to labeled training data:.  .  .
Multiclass ClassificationQuery classificationTravel TechnologyNewsEntertainment. . . .Input query:    “自然语言处理”Training and testing same as in binary case
Maximum Likelihood EstimationWhy set parameters to counts?
MLE – Label marginals
Problems with Naïve BayesPredicting broken traffic lightsLights are broken:  both lights are red alwaysLights are working: 1 is red & 1 is green
Problems with Naïve Bayes 2Now, suppose both lights are red.  What will our model predict?We got the wrong answer.  Is there a better model?The MLE generative model is not the best model!!
More on Generative modelsWe can introduce more dependenciesThis can explode parameter spaceDiscriminative models minimize error -- nextFurther readingK. Toutanova. Competitive generative models with structure learning for NLP classification tasks.  EMNLP 2006.	A. Ng and M. Jordan.  On Discriminative vs. Generative Classifiers:  A comparison of logistic regression and naïve Bayes. NIPS 2002
Discriminative LearningWe will focus on linear modelsModel training error
Upper bounds on binary training error0-1 loss (error):  NP-hard to minimize over all data pointsExp loss:  exp(-score): Minimized by AdaBoostHinge loss:  Minimized by support vector machines
Binary classification: Weak hypothesesIn NLP, a feature can be a weak learner Sentiment example:
The AdaBoost algorithm
A small example––++Excellent readTerrible:  The_plot was boring and opaqueAwful book.  Couldn’t follow the_plot.Excellent book.  The_plot was riveting––––+++––––+––––+––
Bound on training error [Freund & Schapire 1995]We greedily minimize error by minimizing
For proofs and a more complete discussion	Robert Schapire and YoramSinger.Improved Boosting Algorithms Using Confidence-rated Predictions.  Machine Learning Journal 1998.
Exponential convergence of error in tPlugging in our solution for, we haveWe chose      to minimize         .  Was that the right choice?This gives
AdaBoost drawbacksWhat happens when an example is mis-labeled or an outlier?Exp loss exponentially penalizes incorrect scores.Hinge loss linearly penalizes incorrect scores.
Support Vector MachinesLinearly separableNon-separable+++++–++––––+–––
Margin++++++++––––––––Lots of separating hyperplanes.  Which should we choose?Choose the hyperplane with largest margin
Max-margin optimization greater thanmargin score of correct labelWhy do we fix norm of w to be less than 1?Scaling the weight vector doesn’t change the optimal hyperplane
Equivalent optimization problemMinimize the norm of the weight vectorWith fixed margin for each example
Back to the non-separable case++We can’t satisfy the margin constraintsBut some hyperplanes are better than others–+––+–
Soft margin optimizationAdd slack variables to the optimizationAllow margin constraints to be violatedBut minimize the violation as much as possible
Optimization 1: Absorbing constraints
Optimization 2:  Sub-gradient descentMax creates a non-differentiable point, but there is a subgradientSubgradient:
Stochastic subgradient descentSubgradient descent is like gradient descent.Also guaranteed to converge, but slowPegasos [Shalev-Schwartz and Singer 2007]Sub-gradient descent for a randomly selected subset of examples. Convergence bound:Objective after T iterationsBest objective valueLinear convergence
SVMs for NLPWe’ve been looking at binary classificationBut most NLP problems aren’t binaryPiece-wise linear decision boundariesWe showed 2-dimensional examplesBut NLP is typically very high dimensionalJoachims [2000] discusses linear models in high-dimensional spaces
Kernels and non-linearityKernels let us efficiently map training data into a high-dimensional feature spaceThen learn a model which is linear in the new space, but non-linear in our original spaceBut for NLP, we already have a high-dimensional representation!Optimization with non-linear kernels is often super-linear in number of examples
More on SVMsJohn Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis.  Cambridge University Press 2004.Dan Klein and Ben Taskar. Max Margin Methods for NLP:  Estimation, Structure, and Applications.  ACL 2005 Tutorial.Ryan McDonald. Generalized Linear Classifiers in NLP.  Tutorial at the Swedish Graduate School in Language Technology.  2007.
SVMs vs. AdaBoostSVMs with slack are noise tolerantAdaBoost has no explicit regularizationMust resort to early stoppingAdaBoost easily extends to non-linear modelsNon-linear optimization for SVMs is super-linear in the number of examplesCan be important for examples with hundreds or thousands of features
More on discriminative methodsLogistic regression:  Also known as Maximum EntropyProbabilistic discriminative model which directly models p(y | x)A good general machine learning bookOn discriminative learning and moreChris Bishop.  Pattern Recognition and Machine Learning.  Springer 2006.
Learning to rank(1)(2)(3)(4)
Features for web page rankingGood features for this model?  (1) How many words are shared between the query and the web page?(2) What is the PageRank of the webpage?(3) Other ideas?
Optimization ProblemLoss for a query and a pair of documents Score for documents of different ranks must be separated by a marginMSRA 互联网搜索与挖掘组http://research.microsoft.com/asia/group/wsm/
Come work with us at Microsoft!http://www.msra.cn/recruitment/

More Related Content

PPT
Machine Learning Applications in NLP.ppt
PPTX
Lecture 6: Ensemble Methods
PPT
MachineLearning.ppt
PPTX
Multiple Classifier Systems
PDF
Class imbalance problem1
PPT
Basics of Machine Learning
PPT
Machine Learning: Foundations Course Number 0368403401
PPTX
Bag the model with bagging
Machine Learning Applications in NLP.ppt
Lecture 6: Ensemble Methods
MachineLearning.ppt
Multiple Classifier Systems
Class imbalance problem1
Basics of Machine Learning
Machine Learning: Foundations Course Number 0368403401
Bag the model with bagging

What's hot (20)

PPTX
Applied Artificial Intelligence Unit 3 Semester 3 MSc IT Part 2 Mumbai Univer...
PDF
Machine Learning in NLP
PPT
Overfitting and-tbl
PPTX
Ensemble methods
PPTX
Tips and tricks to win kaggle data science competitions
PDF
Winning Kaggle 101: Introduction to Stacking
PPT
Ensemble Learning Featuring the Netflix Prize Competition and ...
PPT
ensemble learning
PPTX
Machine Learning - Ensemble Methods
PPTX
Ensemble learning
PDF
Artificial Intelligence Course: Linear models
PPTX
boosting algorithm
PDF
Generating Natural-Language Text with Neural Networks
PDF
Deep Learning For Practitioners, lecture 2: Selecting the right applications...
PDF
L4. Ensembles of Decision Trees
PDF
Ensemble modeling and Machine Learning
PPTX
Ensemble methods
PDF
Machine Learning and Data Mining: 16 Classifiers Ensembles
PPT
notes as .ppt
PDF
LR2. Summary Day 2
Applied Artificial Intelligence Unit 3 Semester 3 MSc IT Part 2 Mumbai Univer...
Machine Learning in NLP
Overfitting and-tbl
Ensemble methods
Tips and tricks to win kaggle data science competitions
Winning Kaggle 101: Introduction to Stacking
Ensemble Learning Featuring the Netflix Prize Competition and ...
ensemble learning
Machine Learning - Ensemble Methods
Ensemble learning
Artificial Intelligence Course: Linear models
boosting algorithm
Generating Natural-Language Text with Neural Networks
Deep Learning For Practitioners, lecture 2: Selecting the right applications...
L4. Ensembles of Decision Trees
Ensemble modeling and Machine Learning
Ensemble methods
Machine Learning and Data Mining: 16 Classifiers Ensembles
notes as .ppt
LR2. Summary Day 2
Ad

Viewers also liked (20)

PDF
Deep Learning for NLP Applications
DOC
Neural Network Applications In Machining: A Review
PDF
Logica | Intelligent Self learning - a helping hand in financial crime
 
PDF
An Optimal Iterative Algorithm for Extracting MUCs in a Black-box Constraint ...
PPT
Black Box Methods for Inferring Parallel Applications' Properties in Virtual ...
PPTX
Application of machine learning and cognitive computing in intrusion detectio...
PPT
Cognitive Modeling & Intelligent Tutors
PPTX
Project based learning methodologies for Embedded Systems and Intelligent Sys...
PPT
Ai and neural networks
PPTX
PPT
Home Automation: Design and Construction of an intelligent design for Cooling...
PPTX
Neural Network Classification and its Applications in Insurance Industry
PPTX
Application of machine learning in industrial applications
PPTX
Soft computing (ANN and Fuzzy Logic) : Dr. Purnima Pandit
PDF
Digital Implementation of Artificial Neural Network for Function Approximatio...
PDF
Applications of Artificial Neural Network and Wavelet Transform For Conditio...
PPTX
Neural network & its applications
PPTX
Introduction to Machine Learning
PDF
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
Deep Learning for NLP Applications
Neural Network Applications In Machining: A Review
Logica | Intelligent Self learning - a helping hand in financial crime
 
An Optimal Iterative Algorithm for Extracting MUCs in a Black-box Constraint ...
Black Box Methods for Inferring Parallel Applications' Properties in Virtual ...
Application of machine learning and cognitive computing in intrusion detectio...
Cognitive Modeling & Intelligent Tutors
Project based learning methodologies for Embedded Systems and Intelligent Sys...
Ai and neural networks
Home Automation: Design and Construction of an intelligent design for Cooling...
Neural Network Classification and its Applications in Insurance Industry
Application of machine learning in industrial applications
Soft computing (ANN and Fuzzy Logic) : Dr. Purnima Pandit
Digital Implementation of Artificial Neural Network for Function Approximatio...
Applications of Artificial Neural Network and Wavelet Transform For Conditio...
Neural network & its applications
Introduction to Machine Learning
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
Ad

Similar to Query Linguistic Intent Detection (20)

PPTX
Introduction to LLM Post-Training - MIT 6.S191 2025
PPTX
Machine Learning basics
PPTX
Interpretable Machine Learning
PDF
Lessons learned from building practical deep learning systems
PPTX
Explainable Machine Learning (Explainable ML)
PPTX
17- Kernels and Clustering.pptx
PDF
EssentialsOfMachineLearning.pdf
PDF
Top 100+ Google Data Science Interview Questions.pdf
PPTX
Demystifying Machine Learning
PPTX
Model Development And Evaluation in ML.pptx
PPT
Download It
PDF
LNCS 5050 - Bilevel Optimization and Machine Learning
PDF
lec3_annotated.pdf ml csci 567 vatsal sharan
PDF
Barga Data Science lecture 9
PDF
ML-advice machine learning forn applying
PDF
Machine learning4dummies
PDF
Machine Learning and Deep Learning 4 dummies
PPTX
Chapter8_What_Is_Machine_Learning Testing Cases
PPTX
How to fine-tune and develop your own large language model.pptx
PDF
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Introduction to LLM Post-Training - MIT 6.S191 2025
Machine Learning basics
Interpretable Machine Learning
Lessons learned from building practical deep learning systems
Explainable Machine Learning (Explainable ML)
17- Kernels and Clustering.pptx
EssentialsOfMachineLearning.pdf
Top 100+ Google Data Science Interview Questions.pdf
Demystifying Machine Learning
Model Development And Evaluation in ML.pptx
Download It
LNCS 5050 - Bilevel Optimization and Machine Learning
lec3_annotated.pdf ml csci 567 vatsal sharan
Barga Data Science lecture 9
ML-advice machine learning forn applying
Machine learning4dummies
Machine Learning and Deep Learning 4 dummies
Chapter8_What_Is_Machine_Learning Testing Cases
How to fine-tune and develop your own large language model.pptx
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019

More from butest (20)

PDF
EL MODELO DE NEGOCIO DE YOUTUBE
DOC
1. MPEG I.B.P frame之不同
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPT
Timeline: The Life of Michael Jackson
DOCX
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPTX
Com 380, Summer II
PPT
PPT
DOCX
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
DOC
MICHAEL JACKSON.doc
PPTX
Social Networks: Twitter Facebook SL - Slide 1
PPT
Facebook
DOCX
Executive Summary Hare Chevrolet is a General Motors dealership ...
DOC
Welcome to the Dougherty County Public Library's Facebook and ...
DOC
NEWS ANNOUNCEMENT
DOC
C-2100 Ultra Zoom.doc
DOC
MAC Printing on ITS Printers.doc.doc
DOC
Mac OS X Guide.doc
DOC
hier
DOC
WEB DESIGN!
EL MODELO DE NEGOCIO DE YOUTUBE
1. MPEG I.B.P frame之不同
LESSONS FROM THE MICHAEL JACKSON TRIAL
Timeline: The Life of Michael Jackson
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
LESSONS FROM THE MICHAEL JACKSON TRIAL
Com 380, Summer II
PPT
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
MICHAEL JACKSON.doc
Social Networks: Twitter Facebook SL - Slide 1
Facebook
Executive Summary Hare Chevrolet is a General Motors dealership ...
Welcome to the Dougherty County Public Library's Facebook and ...
NEWS ANNOUNCEMENT
C-2100 Ultra Zoom.doc
MAC Printing on ITS Printers.doc.doc
Mac OS X Guide.doc
hier
WEB DESIGN!

Query Linguistic Intent Detection

  • 1. Supervised and semi-supervised learning for NLPJohn Blitzer自然语言计算组http://research.microsoft.com/asia/group/nlc/
  • 2. Why should I know about machine learning? This is an NLP summer school. Why should I care about machine learning?ACL 2008: 50 of 96 full papers mention learning, or statistics in their titles4 of 4 outstanding papers propose new learning or statistical inference methods
  • 3. Example 1: Review classificationOutput: LabelsInput: Product Review Running with Scissors: A Memoir Title: Horrible book, horrible.This book was horrible. I read half of it, suffering from a headache the entire time, and eventually i lit it on fire. One less copy in the world...don't waste your money. I wish i had the time spent reading this book back so i could use it for better purposes. This book wasted my lifePositiveNegative
  • 4. From the MSRA 机器学习组 http://research.microsoft.com/research/china/DCCUE/ml.aspx
  • 5. Example 2: Relevance RankingRanked ListUn-ranked List. . . . . .
  • 6. Example 3: Machine TranslationInput: English sentenceThe national track & field championships concludedOutput: Chinese sentence全国田径冠军赛结束
  • 7. Course Outline1) Supervised Learning [2.5 hrs]2) Semi-supervised learning [3 hrs]3) Learning bounds for domain adaptation [30 mins]
  • 8. Supervised Learning Outline1) Notation and Definitions [5 mins]2) Generative Models [25 mins]3) Discriminative Models [55 mins]4) Machine Learning Examples [15 mins]
  • 9. Training and testing dataTraining data: labeled pairs. . .Use training data to learn a functionUse this function to label unlabeled testing data????. . .??
  • 12. Graphical Model RepresentationEncode a multivariate probability distributionNodes indicate random variablesEdges indicate conditional dependency
  • 13. Graphical Model Inferencep(y = -1)p(horrible | -1)p(read_half | -1)wasteread_halfhorrible
  • 14. Inference at test timeGiven an unlabeled instance, how can we find its label???Just choose the most probable label y
  • 15. Estimating parameters from training dataBack to labeled training data:. . .
  • 16. Multiclass ClassificationQuery classificationTravel TechnologyNewsEntertainment. . . .Input query: “自然语言处理”Training and testing same as in binary case
  • 17. Maximum Likelihood EstimationWhy set parameters to counts?
  • 18. MLE – Label marginals
  • 19. Problems with Naïve BayesPredicting broken traffic lightsLights are broken: both lights are red alwaysLights are working: 1 is red & 1 is green
  • 20. Problems with Naïve Bayes 2Now, suppose both lights are red. What will our model predict?We got the wrong answer. Is there a better model?The MLE generative model is not the best model!!
  • 21. More on Generative modelsWe can introduce more dependenciesThis can explode parameter spaceDiscriminative models minimize error -- nextFurther readingK. Toutanova. Competitive generative models with structure learning for NLP classification tasks. EMNLP 2006. A. Ng and M. Jordan. On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naïve Bayes. NIPS 2002
  • 22. Discriminative LearningWe will focus on linear modelsModel training error
  • 23. Upper bounds on binary training error0-1 loss (error): NP-hard to minimize over all data pointsExp loss: exp(-score): Minimized by AdaBoostHinge loss: Minimized by support vector machines
  • 24. Binary classification: Weak hypothesesIn NLP, a feature can be a weak learner Sentiment example:
  • 26. A small example––++Excellent readTerrible: The_plot was boring and opaqueAwful book. Couldn’t follow the_plot.Excellent book. The_plot was riveting––––+++––––+––––+––
  • 27. Bound on training error [Freund & Schapire 1995]We greedily minimize error by minimizing
  • 28. For proofs and a more complete discussion Robert Schapire and YoramSinger.Improved Boosting Algorithms Using Confidence-rated Predictions. Machine Learning Journal 1998.
  • 29. Exponential convergence of error in tPlugging in our solution for, we haveWe chose to minimize . Was that the right choice?This gives
  • 30. AdaBoost drawbacksWhat happens when an example is mis-labeled or an outlier?Exp loss exponentially penalizes incorrect scores.Hinge loss linearly penalizes incorrect scores.
  • 31. Support Vector MachinesLinearly separableNon-separable+++++–++––––+–––
  • 32. Margin++++++++––––––––Lots of separating hyperplanes. Which should we choose?Choose the hyperplane with largest margin
  • 33. Max-margin optimization greater thanmargin score of correct labelWhy do we fix norm of w to be less than 1?Scaling the weight vector doesn’t change the optimal hyperplane
  • 34. Equivalent optimization problemMinimize the norm of the weight vectorWith fixed margin for each example
  • 35. Back to the non-separable case++We can’t satisfy the margin constraintsBut some hyperplanes are better than others–+––+–
  • 36. Soft margin optimizationAdd slack variables to the optimizationAllow margin constraints to be violatedBut minimize the violation as much as possible
  • 38. Optimization 2: Sub-gradient descentMax creates a non-differentiable point, but there is a subgradientSubgradient:
  • 39. Stochastic subgradient descentSubgradient descent is like gradient descent.Also guaranteed to converge, but slowPegasos [Shalev-Schwartz and Singer 2007]Sub-gradient descent for a randomly selected subset of examples. Convergence bound:Objective after T iterationsBest objective valueLinear convergence
  • 40. SVMs for NLPWe’ve been looking at binary classificationBut most NLP problems aren’t binaryPiece-wise linear decision boundariesWe showed 2-dimensional examplesBut NLP is typically very high dimensionalJoachims [2000] discusses linear models in high-dimensional spaces
  • 41. Kernels and non-linearityKernels let us efficiently map training data into a high-dimensional feature spaceThen learn a model which is linear in the new space, but non-linear in our original spaceBut for NLP, we already have a high-dimensional representation!Optimization with non-linear kernels is often super-linear in number of examples
  • 42. More on SVMsJohn Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press 2004.Dan Klein and Ben Taskar. Max Margin Methods for NLP: Estimation, Structure, and Applications. ACL 2005 Tutorial.Ryan McDonald. Generalized Linear Classifiers in NLP. Tutorial at the Swedish Graduate School in Language Technology. 2007.
  • 43. SVMs vs. AdaBoostSVMs with slack are noise tolerantAdaBoost has no explicit regularizationMust resort to early stoppingAdaBoost easily extends to non-linear modelsNon-linear optimization for SVMs is super-linear in the number of examplesCan be important for examples with hundreds or thousands of features
  • 44. More on discriminative methodsLogistic regression: Also known as Maximum EntropyProbabilistic discriminative model which directly models p(y | x)A good general machine learning bookOn discriminative learning and moreChris Bishop. Pattern Recognition and Machine Learning. Springer 2006.
  • 46. Features for web page rankingGood features for this model? (1) How many words are shared between the query and the web page?(2) What is the PageRank of the webpage?(3) Other ideas?
  • 47. Optimization ProblemLoss for a query and a pair of documents Score for documents of different ranks must be separated by a marginMSRA 互联网搜索与挖掘组http://research.microsoft.com/asia/group/wsm/
  • 48. Come work with us at Microsoft!http://www.msra.cn/recruitment/

Editor's Notes

  • #2: Say something good about HIT !!! Also advertise for Microsoft
  • #27: Say there’s no alpha, so weights are approximate
  • #37: Say: This is the SVM formulation that we observe most commonly