Query Linguistic Intent Detection

Supervised and semi-supervised learning for NLPJohn Blitzer自然语言计算组http://research.microsoft.com/asia/group/nlc/

Why should I know about machine learning? This is an NLP summer school. Why should I care about machine learning?ACL 2008: 50 of 96 full papers mention learning, or statistics in their titles4 of 4 outstanding papers propose new learning or statistical inference methods

Example 1: Review classificationOutput: LabelsInput: Product Review Running with Scissors: A Memoir Title: Horrible book, horrible.This book was horrible. I read half of it, suffering from a headache the entire time, and eventually i lit it on fire. One less copy in the world...don't waste your money. I wish i had the time spent reading this book back so i could use it for better purposes. This book wasted my lifePositiveNegative

From the MSRA 机器学习组 http://research.microsoft.com/research/china/DCCUE/ml.aspx

Example 2: Relevance RankingRanked ListUn-ranked List. . . . . .

Example 3: Machine TranslationInput: English sentenceThe national track & field championships concludedOutput: Chinese sentence全国田径冠军赛结束

Course Outline1) Supervised Learning [2.5 hrs]2) Semi-supervised learning [3 hrs]3) Learning bounds for domain adaptation [30 mins]

Supervised Learning Outline1) Notation and Definitions [5 mins]2) Generative Models [25 mins]3) Discriminative Models [55 mins]4) Machine Learning Examples [15 mins]

Training and testing dataTraining data: labeled pairs. . .Use training data to learn a functionUse this function to label unlabeled testing data????. . .??

Feature representations of......2100300wastehorribleread_half......0021000loved_itexcellenthorrible

Graphical Model RepresentationEncode a multivariate probability distributionNodes indicate random variablesEdges indicate conditional dependency

Graphical Model Inferencep(y = -1)p(horrible | -1)p(read_half | -1)wasteread_halfhorrible

Inference at test timeGiven an unlabeled instance, how can we find its label???Just choose the most probable label y

Estimating parameters from training dataBack to labeled training data:. . .

Multiclass ClassificationQuery classificationTravel TechnologyNewsEntertainment. . . .Input query: “自然语言处理”Training and testing same as in binary case

Maximum Likelihood EstimationWhy set parameters to counts?

Problems with Naïve BayesPredicting broken traffic lightsLights are broken: both lights are red alwaysLights are working: 1 is red & 1 is green

Problems with Naïve Bayes 2Now, suppose both lights are red. What will our model predict?We got the wrong answer. Is there a better model?The MLE generative model is not the best model!!

More on Generative modelsWe can introduce more dependenciesThis can explode parameter spaceDiscriminative models minimize error -- nextFurther readingK. Toutanova. Competitive generative models with structure learning for NLP classification tasks. EMNLP 2006. A. Ng and M. Jordan. On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naïve Bayes. NIPS 2002

Discriminative LearningWe will focus on linear modelsModel training error

Upper bounds on binary training error0-1 loss (error): NP-hard to minimize over all data pointsExp loss: exp(-score): Minimized by AdaBoostHinge loss: Minimized by support vector machines

Binary classification: Weak hypothesesIn NLP, a feature can be a weak learner Sentiment example:

A small example––++Excellent readTerrible: The_plot was boring and opaqueAwful book. Couldn’t follow the_plot.Excellent book. The_plot was riveting––––+++––––+––––+––

Bound on training error [Freund & Schapire 1995]We greedily minimize error by minimizing

For proofs and a more complete discussion Robert Schapire and YoramSinger.Improved Boosting Algorithms Using Confidence-rated Predictions. Machine Learning Journal 1998.

Exponential convergence of error in tPlugging in our solution for, we haveWe chose to minimize . Was that the right choice?This gives

AdaBoost drawbacksWhat happens when an example is mis-labeled or an outlier?Exp loss exponentially penalizes incorrect scores.Hinge loss linearly penalizes incorrect scores.

Support Vector MachinesLinearly separableNon-separable+++++–++––––+–––

Margin++++++++––––––––Lots of separating hyperplanes. Which should we choose?Choose the hyperplane with largest margin

Max-margin optimization greater thanmargin score of correct labelWhy do we fix norm of w to be less than 1?Scaling the weight vector doesn’t change the optimal hyperplane

Equivalent optimization problemMinimize the norm of the weight vectorWith fixed margin for each example

Back to the non-separable case++We can’t satisfy the margin constraintsBut some hyperplanes are better than others–+––+–

Soft margin optimizationAdd slack variables to the optimizationAllow margin constraints to be violatedBut minimize the violation as much as possible

Optimization 1: Absorbing constraints

Optimization 2: Sub-gradient descentMax creates a non-differentiable point, but there is a subgradientSubgradient:

Stochastic subgradient descentSubgradient descent is like gradient descent.Also guaranteed to converge, but slowPegasos [Shalev-Schwartz and Singer 2007]Sub-gradient descent for a randomly selected subset of examples. Convergence bound:Objective after T iterationsBest objective valueLinear convergence

SVMs for NLPWe’ve been looking at binary classificationBut most NLP problems aren’t binaryPiece-wise linear decision boundariesWe showed 2-dimensional examplesBut NLP is typically very high dimensionalJoachims [2000] discusses linear models in high-dimensional spaces

Kernels and non-linearityKernels let us efficiently map training data into a high-dimensional feature spaceThen learn a model which is linear in the new space, but non-linear in our original spaceBut for NLP, we already have a high-dimensional representation!Optimization with non-linear kernels is often super-linear in number of examples

More on SVMsJohn Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press 2004.Dan Klein and Ben Taskar. Max Margin Methods for NLP: Estimation, Structure, and Applications. ACL 2005 Tutorial.Ryan McDonald. Generalized Linear Classifiers in NLP. Tutorial at the Swedish Graduate School in Language Technology. 2007.

SVMs vs. AdaBoostSVMs with slack are noise tolerantAdaBoost has no explicit regularizationMust resort to early stoppingAdaBoost easily extends to non-linear modelsNon-linear optimization for SVMs is super-linear in the number of examplesCan be important for examples with hundreds or thousands of features

More on discriminative methodsLogistic regression: Also known as Maximum EntropyProbabilistic discriminative model which directly models p(y | x)A good general machine learning bookOn discriminative learning and moreChris Bishop. Pattern Recognition and Machine Learning. Springer 2006.

Features for web page rankingGood features for this model? (1) How many words are shared between the query and the web page?(2) What is the PageRank of the webpage?(3) Other ideas?

Optimization ProblemLoss for a query and a pair of documents Score for documents of different ranks must be separated by a marginMSRA 互联网搜索与挖掘组http://research.microsoft.com/asia/group/wsm/

Come work with us at Microsoft!http://www.msra.cn/recruitment/

Query Linguistic Intent Detection

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Query Linguistic Intent Detection (20)

More from butest (20)

Query Linguistic Intent Detection

Editor's Notes