SlideShare a Scribd company logo
Word Sense DisambiguationCMSC 723: Computational Linguistics I ― Session #11Jimmy LinThe iSchoolUniversity of MarylandWednesday, November 11, 2009Material drawn from slides by Saif Mohammad and Bonnie Dorr
Progression of the CourseWordsFinite-state morphologyPart-of-speech tagging (TBL + HMM)StructureCFGs + parsing (CKY, Earley)N-gram language modelsMeaning!
Today’s AgendaWord sense disambiguationBeyond lexical semanticsSemantic attachments to syntaxShallow semantics: PropBank
Word Sense Disambiguation
Recap: Word SenseFrom WordNet:Noun{pipe, tobacco pipe} (a tube with a small bowl at one end; used for smoking tobacco) {pipe, pipage, piping} (a long tube made of metal or plastic that is used to carry water or oil or gas etc.) {pipe, tube} (a hollow cylindrical shape) {pipe} (a tubular wind instrument) {organ pipe, pipe, pipework} (the flues and stops on a pipe organ) Verb{shriek, shrill, pipe up, pipe} (utter a shrill cry) {pipe} (transport by pipeline) “pipe oil, water, and gas into the desert”{pipe} (play on a pipe) “pipe a tune”{pipe} (trim with piping) “pipe the skirt”
Word Sense DisambiguationTask: automatically select the correct sense of a wordLexical sampleAll-wordsTheoretically useful for many applications:Semantic similarity (remember from last time?)Information retrievalMachine translation…Solution in search of a problem? Why?
How big is the problem?Most words in English have only one sense62% in Longman’s Dictionary of Contemporary English79% in WordNetBut the others tend to have several sensesAverage of 3.83 in LDOCEAverage of 2.96 in WordNetAmbiguous words are more frequently usedIn the British National Corpus, 84% of instances have more than one senseSome senses are more frequent than others
Ground TruthWhich sense inventory do we use?Issues there?Application specificity?
CorporaLexical sampleline-hard-serve corpus (4k sense-tagged examples)interest corpus (2,369 sense-tagged examples)… All-wordsSemCor (234k words, subset of Brown Corpus)Senseval-3 (2081 tagged content words from 5k total words)…Observations about the size?
EvaluationIntrinsicMeasure accuracy of sense selection wrt ground truthExtrinsicIntegrate WSD as part of a bigger end-to-end system, e.g., machine translation or information retrievalCompare WSD
Baseline + Upper BoundBaseline: most frequent senseEquivalent to “take first sense” in WordNetDoes surprisingly well!Upper bound:Fine-grained WordNet sense: 75-80% human agreementCoarser-grained inventories: 90% human agreement possibleWhat does this mean?62% accuracy in this case!
WSD ApproachesDepending on use of manually created knowledge sourcesKnowledge-leanKnowledge-richDepending on use of labeled dataSupervisedSemi- or minimally supervisedUnsupervised
Lesk’s AlgorithmIntuition: note word overlap between context and dictionary entriesUnsupervised, but knowledge richThe bank can guarantee deposits will eventually cover future tuition costs because it invests in adjustable-rate mortgage securities.  WordNet
Lesk’s AlgorithmSimplest implementation:Count overlapping content words between glosses and contextLots of variants:Include the examples in dictionary definitionsInclude hypernyms and hyponymsGive more weight to larger overlaps (e.g., bigrams)Give extra weight to infrequent words (e.g., idf weighting)…Works reasonably well!
Supervised WSD: NLP meets MLWSD as a supervised classification taskTrain a separate classifier for each wordThree components of a machine learning problem:Training data (corpora)Representations (features)Learning method (algorithm, model)
Supervised ClassificationTrainingTestingtraining dataunlabeled document?label1label2label3label4Representation Functionlabel1?label2?Classifiersupervised machine learning algorithmlabel3?label4?
Three Laws of Machine LearningThou shalt not mingle training data with test dataThou shalt not mingle training data with test dataThou shalt not mingle training data with test dataBut what do you do if you need more test data?
FeaturesPossible featuresPOS and surface form of the word itselfSurrounding words and POS tagPositional information of surrounding words and POS tagsSame as above, but with n-gramsGrammatical information…Richness of the features?Richer features = ML algorithm does less of the workMore impoverished features = ML algorithm does more of the work
ClassifiersOnce we cast the WSD problem as supervised classification, many learning techniques are possible:Naïve Bayes (the thing to try first)Decision listsDecision treesMaxEntSupport vector machinesNearest neighbor methods…
Classifiers TradeoffsWhich classifier should I use?It depends:Number of featuresTypes of featuresNumber of possible values for a featureNoise…General advice:Start with Naïve BayesUse decision trees/lists if you want to understand what the classifier is doingSVMs often give state of the art performanceMaxEnt methods also work well
Naïve BayesPick the sense that is most probable given the contextContext represented by feature vectorBy Bayes’ Theorem:Problem: data sparsity!We can ignore this term… why?
The “Naïve” PartFeature vectors are too sparse to estimate directly:So… assume features are conditionally independent given the word senseThis is naïve because?Putting everything together:
Naïve Bayes: TrainingHow do we estimate the probability distributions?Maximum-Likelihood Estimates (MLE):What else do we need to do?Well, how well does it work?(later…)
Decision ListOrdered list of tests (equivalent to “case” statement):Example decision list, discriminating between bass (fish) and bass (music):
Building Decision ListsSimple algorithm:Compute how discriminative each feature is:Create ordered list of tests from these valuesLimitation?How do you build n-way classifiers from binary classifiers?One vs. rest (sequential vs. parallel)Another learning problemWell, how well does it work?(later…)
Decision TreesInstead of a list, imagine a tree…fish in k wordsnoyesstriped bassFISHnoyesguitar in k wordsFISHnoyesMUSIC…
Using Decision TreesGiven an instance (= list of feature values)Start at the rootAt each interior node, check feature valueFollow corresponding branch based on the testWhen a leaf node is reached, return its categoryDecision tree material drawn from slides by Ed Loper
Building Decision TreesBasic idea: build tree top down, recursively partitioning the training data at each stepAt each node, try to split the training data on a feature (could be binary or otherwise)What features should we split on?Small decision tree desiredPick the feature that gives the most information about the categoryExample: 20 questionsI’m thinking of a number from 1 to 1,000You can ask any yes no questionWhat question would you ask?
Evaluating Splits via EntropyEntropy of a set of events E:Where P(c) is the probability that an event in E has category cHow much information does a feature give us about the category (sense)?H(E) = entropy of event set EH(E|f) = expected entropy of event set E once we know the value of feature fInformation Gain: G(E, f) = H(E) – H(E|f) = amount of new information provided by feature fSplit on feature that maximizes information gainWell, how well does it work?(later…)
WSD AccuracyGenerally:Naïve Bayes provides a reasonable baseline: ~70%Decision lists and decision trees slightly lowerState of the art is slightly higherHowever:Accuracy depends on actual word, sense inventory, amount of training data, number of features, etc.Remember caveat about baseline and upper bound
Minimally Supervised WSDBut annotations are expensive!“Bootstrapping” or co-training (Yarowsky 1995)Start with (small) seed, learn decision listUse decision list to label rest of corpusRetain “confident” labels, treat as annotated data to learn new decision listRepeat…Heuristics (derived from observation):One sense per discourseOne sense per collocation
One Sense per DiscourseA word tends to preserve its meaning across all its occurrences in a  given discourseEvaluation: 8 words with two-way ambiguity, e.g. plant, crane, etc.98% of the two-word occurrences in the same discourse carry the same meaningThe grain of salt: accuracy depends on granularityPerformance of “one sense per discourse” measured on SemCor is approximately 70%Slide by Mihalcea and Pedersen
One Sense per CollocationA word tends to preserve its meaning when used in the same collocationStrong for adjacent collocationsWeaker as the distance between words increasesEvaluation:97% precision on words with two-way ambiguityAgain, accuracy depends on granularity:70% precision on SemCor wordsSlide by Mihalcea and Pedersen
Yarowsky’s Method: ExampleDisambiguating plant (industrial sense) vs. plant (living thing sense)Think of seed features for each senseIndustrial sense: co-occurring with “manufacturing”Living thing sense: co-occurring with “life”Use “one sense per collocation” to build initial decision list classifierTreat results as annotated data, train new decision list classifier, iterate…
CMSC 723: Computational Linguistics I
CMSC 723: Computational Linguistics I
CMSC 723: Computational Linguistics I
CMSC 723: Computational Linguistics I
Yarowsky’s Method: StoppingStop when: Error on training data is less than a thresholdNo more training data is coveredUse final decision list for WSD
Yarowsky’s Method: DiscussionAdvantages: Accuracy is about as good as a supervised algorithmBootstrapping: far less manual effortDisadvantages: Seeds may be tricky to constructWorks only for coarse-grained sense distinctionsSnowballing error with co-trainingRecent extension: now apply this to the web!
WSD with Parallel TextBut annotations are expensive!What’s the “proper” sense inventory?How fine or coarse grained?Application specific?Observation: multiple senses translate to different words in other languages!A “bill” in English may be a “pico” (bird jaw) in or a “cuenta” (invoice) in SpanishUse the foreign language as the sense inventory!Added bonus: annotations for free! (Byproduct of word-alignment process in machine translation)
Beyond Lexical Semantics
Syntax-Semantics PipelineExample: FOPL
Semantic AttachmentsBasic idea:Associate -expressions with lexical itemsAt branching node, apply semantics of one child to another (based on synctatic rule)Refresher in -calculus…
Augmenting Syntactic Rules
Semantic Analysis: Example
ComplexitiesOh, there are many…Classic problem: quantifier scopingEvery restaurant has a menuIssues with this style of semantic analysis?
Semantics in NLP TodayCan be characterized as “shallow semantics”Verbs denote eventsRepresent as “frames”Nouns (in general) participate in eventsTypes of event participants = “slots” or “roles”Event participants themselves = “slot fillers”Depending on the linguistic theory, roles may have special names: agent, theme, etc.Semantic analysis: semantic role labelingAutomatically identify the event type (i.e., frame)Automatically identify event participants and the role that each plays (i.e., label the semantic role)
What works in NLP?POS-annotated corporaTree-annotated corpora: Penn TreebankRole-annotated corpora: Proposition Bank (PropBank)
PropBank: Two Examplesagree.01Arg0: AgreerArg1: PropositionArg2: Other entity agreeingExample: [Arg0 John] agrees [Arg2 with Mary] [Arg1 on everything]fall.01Arg1: Logical subject, patient, thing fallingArg2: Extent, amount fallenArg3: Start pointArg4: End pointExample: [Arg1 Sales] fell [Arg4 to $251.2 million] [Arg3 from $278.7 million]
How do we do it?Short answer: supervised machine learningOne approach: classification of each tree constituentFeatures can be words, phrase type, linear position, tree position, etc.Apply standard machine learning algorithms
Recap of Today’s TopicsWord sense disambiguationBeyond lexical semanticsSemantic attachments to syntaxShallow semantics: PropBank
The Complete PicturePhonologyMorphologySyntaxSemanticsReasoningSpeech RecognitionMorphological AnalysisParsingSemantic AnalysisReasoning, PlanningSpeech SynthesisMorphological RealizationSyntactic RealizationUtterance Planning
The Home StretchNext week: MapReduce and large-data processingNo classes Thanksgiving week!December: two guest lectures by Ken Church

More Related Content

PPTX
Deep Learning for Search
PPTX
Neural Text Embeddings for Information Retrieval (WSDM 2017)
PPTX
Text Mining for Lexicography
PPT
Simulation of Language Acquisition Walter Daelemans
PPTX
Neural Models for Information Retrieval
PPTX
A Simple Introduction to Neural Information Retrieval
PPTX
Deep Neural Methods for Retrieval
PPTX
5 Lessons Learned from Designing Neural Models for Information Retrieval
Deep Learning for Search
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Text Mining for Lexicography
Simulation of Language Acquisition Walter Daelemans
Neural Models for Information Retrieval
A Simple Introduction to Neural Information Retrieval
Deep Neural Methods for Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval

What's hot (20)

PPTX
Neural Models for Information Retrieval
PPTX
The Duet model
PPTX
Tutorial on word2vec
PDF
Word Embedding In IR
PPTX
Neural Models for Document Ranking
PPT
Towards a theory of semantic communication
PPT
Semantic information theory in 20 minutes
PDF
International Journal of Computational Engineering Research(IJCER)
PPT
SNLI_presentation_2
PPTX
Dual Embedding Space Model (DESM)
PDF
Abigail See - 2017 - Get To The Point: Summarization with Pointer-Generator N...
PDF
Document Classification Using KNN with Fuzzy Bags of Word Representation
PPTX
Neural Information Retrieval: In search of meaningful progress
DOCX
Steganography
PDF
Basic review on topic modeling
PPTX
PDF
Abstractive Text Summarization
PDF
Deep Learning for NLP: An Introduction to Neural Word Embeddings
PDF
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
PDF
Topics Modeling
Neural Models for Information Retrieval
The Duet model
Tutorial on word2vec
Word Embedding In IR
Neural Models for Document Ranking
Towards a theory of semantic communication
Semantic information theory in 20 minutes
International Journal of Computational Engineering Research(IJCER)
SNLI_presentation_2
Dual Embedding Space Model (DESM)
Abigail See - 2017 - Get To The Point: Summarization with Pointer-Generator N...
Document Classification Using KNN with Fuzzy Bags of Word Representation
Neural Information Retrieval: In search of meaningful progress
Steganography
Basic review on topic modeling
Abstractive Text Summarization
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Topics Modeling
Ad

Viewers also liked (20)

PPTX
An Improved Approach to Word Sense Disambiguation
ODP
Word sense dissambiguation
DOC
Draft programme 15 09-2015
PDF
BibleTech2011
PPTX
A word sense disambiguation technique for sinhala
PDF
228-SE3001_2
PDF
Wsd final paper
PDF
Supervised WSD Using Master- Slave Voting Technique
PDF
Introduction to Kernel Functions
PPT
Advances In Wsd Aaai 2005
PPTX
COMPUTATIONAL LINGUISTICS
PPT
Amharic WSD using WordNet
PDF
PhD defense Koen Deschacht
PPT
Similarity based methods for word sense disambiguation
PPT
Svm and kernel machines
PPTX
Error analysis of Word Sense Disambiguation
PDF
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
PPTX
Historical linguistics
An Improved Approach to Word Sense Disambiguation
Word sense dissambiguation
Draft programme 15 09-2015
BibleTech2011
A word sense disambiguation technique for sinhala
228-SE3001_2
Wsd final paper
Supervised WSD Using Master- Slave Voting Technique
Introduction to Kernel Functions
Advances In Wsd Aaai 2005
COMPUTATIONAL LINGUISTICS
Amharic WSD using WordNet
PhD defense Koen Deschacht
Similarity based methods for word sense disambiguation
Svm and kernel machines
Error analysis of Word Sense Disambiguation
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Historical linguistics
Ad

Similar to CMSC 723: Computational Linguistics I (20)

PPTX
A Simple Walkthrough of Word Sense Disambiguation
PDF
2-IJCSE-00536
PDF
2-IJCSE-00536
PDF
A Survey on Word Sense Disambiguation
PPTX
Word Sense Disambiguation - Algorithms for WSD.pptx
PDF
Noun Sense Induction and Disambiguation using Graph-Based Distributional Sema...
PPTX
Semi supervised approach for word sense disambiguation
DOC
Supervised Corpus-based Methods for Word Sense Disambiguation
PDF
A N H YBRID A PPROACH TO W ORD S ENSE D ISAMBIGUATION W ITH A ND W ITH...
DOC
Doc format.
PDF
Automatic classification of bengali sentences based on sense definitions pres...
PPT
PDF
A supervised word sense disambiguation method using ontology and context know...
PDF
Ny3424442448
PPTX
Module 4.1 of chennai's slides wo hanve dot do thhopps otps
PDF
61_Empirical
PDF
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
PDF
SENSE DISAMBIGUATION TECHNIQUE FOR PROVIDING MORE ACCURATE RESULTS IN WEB SEARCH
PDF
SENSE DISAMBIGUATION TECHNIQUE FOR PROVIDING MORE ACCURATE RESULTS IN WEB SEARCH
A Simple Walkthrough of Word Sense Disambiguation
2-IJCSE-00536
2-IJCSE-00536
A Survey on Word Sense Disambiguation
Word Sense Disambiguation - Algorithms for WSD.pptx
Noun Sense Induction and Disambiguation using Graph-Based Distributional Sema...
Semi supervised approach for word sense disambiguation
Supervised Corpus-based Methods for Word Sense Disambiguation
A N H YBRID A PPROACH TO W ORD S ENSE D ISAMBIGUATION W ITH A ND W ITH...
Doc format.
Automatic classification of bengali sentences based on sense definitions pres...
A supervised word sense disambiguation method using ontology and context know...
Ny3424442448
Module 4.1 of chennai's slides wo hanve dot do thhopps otps
61_Empirical
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
SENSE DISAMBIGUATION TECHNIQUE FOR PROVIDING MORE ACCURATE RESULTS IN WEB SEARCH
SENSE DISAMBIGUATION TECHNIQUE FOR PROVIDING MORE ACCURATE RESULTS IN WEB SEARCH

More from butest (20)

PDF
EL MODELO DE NEGOCIO DE YOUTUBE
DOC
1. MPEG I.B.P frame之不同
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPT
Timeline: The Life of Michael Jackson
DOCX
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPTX
Com 380, Summer II
PPT
PPT
DOCX
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
DOC
MICHAEL JACKSON.doc
PPTX
Social Networks: Twitter Facebook SL - Slide 1
PPT
Facebook
DOCX
Executive Summary Hare Chevrolet is a General Motors dealership ...
DOC
Welcome to the Dougherty County Public Library's Facebook and ...
DOC
NEWS ANNOUNCEMENT
DOC
C-2100 Ultra Zoom.doc
DOC
MAC Printing on ITS Printers.doc.doc
DOC
Mac OS X Guide.doc
DOC
hier
DOC
WEB DESIGN!
EL MODELO DE NEGOCIO DE YOUTUBE
1. MPEG I.B.P frame之不同
LESSONS FROM THE MICHAEL JACKSON TRIAL
Timeline: The Life of Michael Jackson
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
LESSONS FROM THE MICHAEL JACKSON TRIAL
Com 380, Summer II
PPT
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
MICHAEL JACKSON.doc
Social Networks: Twitter Facebook SL - Slide 1
Facebook
Executive Summary Hare Chevrolet is a General Motors dealership ...
Welcome to the Dougherty County Public Library's Facebook and ...
NEWS ANNOUNCEMENT
C-2100 Ultra Zoom.doc
MAC Printing on ITS Printers.doc.doc
Mac OS X Guide.doc
hier
WEB DESIGN!

CMSC 723: Computational Linguistics I

  • 1. Word Sense DisambiguationCMSC 723: Computational Linguistics I ― Session #11Jimmy LinThe iSchoolUniversity of MarylandWednesday, November 11, 2009Material drawn from slides by Saif Mohammad and Bonnie Dorr
  • 2. Progression of the CourseWordsFinite-state morphologyPart-of-speech tagging (TBL + HMM)StructureCFGs + parsing (CKY, Earley)N-gram language modelsMeaning!
  • 3. Today’s AgendaWord sense disambiguationBeyond lexical semanticsSemantic attachments to syntaxShallow semantics: PropBank
  • 5. Recap: Word SenseFrom WordNet:Noun{pipe, tobacco pipe} (a tube with a small bowl at one end; used for smoking tobacco) {pipe, pipage, piping} (a long tube made of metal or plastic that is used to carry water or oil or gas etc.) {pipe, tube} (a hollow cylindrical shape) {pipe} (a tubular wind instrument) {organ pipe, pipe, pipework} (the flues and stops on a pipe organ) Verb{shriek, shrill, pipe up, pipe} (utter a shrill cry) {pipe} (transport by pipeline) “pipe oil, water, and gas into the desert”{pipe} (play on a pipe) “pipe a tune”{pipe} (trim with piping) “pipe the skirt”
  • 6. Word Sense DisambiguationTask: automatically select the correct sense of a wordLexical sampleAll-wordsTheoretically useful for many applications:Semantic similarity (remember from last time?)Information retrievalMachine translation…Solution in search of a problem? Why?
  • 7. How big is the problem?Most words in English have only one sense62% in Longman’s Dictionary of Contemporary English79% in WordNetBut the others tend to have several sensesAverage of 3.83 in LDOCEAverage of 2.96 in WordNetAmbiguous words are more frequently usedIn the British National Corpus, 84% of instances have more than one senseSome senses are more frequent than others
  • 8. Ground TruthWhich sense inventory do we use?Issues there?Application specificity?
  • 9. CorporaLexical sampleline-hard-serve corpus (4k sense-tagged examples)interest corpus (2,369 sense-tagged examples)… All-wordsSemCor (234k words, subset of Brown Corpus)Senseval-3 (2081 tagged content words from 5k total words)…Observations about the size?
  • 10. EvaluationIntrinsicMeasure accuracy of sense selection wrt ground truthExtrinsicIntegrate WSD as part of a bigger end-to-end system, e.g., machine translation or information retrievalCompare WSD
  • 11. Baseline + Upper BoundBaseline: most frequent senseEquivalent to “take first sense” in WordNetDoes surprisingly well!Upper bound:Fine-grained WordNet sense: 75-80% human agreementCoarser-grained inventories: 90% human agreement possibleWhat does this mean?62% accuracy in this case!
  • 12. WSD ApproachesDepending on use of manually created knowledge sourcesKnowledge-leanKnowledge-richDepending on use of labeled dataSupervisedSemi- or minimally supervisedUnsupervised
  • 13. Lesk’s AlgorithmIntuition: note word overlap between context and dictionary entriesUnsupervised, but knowledge richThe bank can guarantee deposits will eventually cover future tuition costs because it invests in adjustable-rate mortgage securities. WordNet
  • 14. Lesk’s AlgorithmSimplest implementation:Count overlapping content words between glosses and contextLots of variants:Include the examples in dictionary definitionsInclude hypernyms and hyponymsGive more weight to larger overlaps (e.g., bigrams)Give extra weight to infrequent words (e.g., idf weighting)…Works reasonably well!
  • 15. Supervised WSD: NLP meets MLWSD as a supervised classification taskTrain a separate classifier for each wordThree components of a machine learning problem:Training data (corpora)Representations (features)Learning method (algorithm, model)
  • 16. Supervised ClassificationTrainingTestingtraining dataunlabeled document?label1label2label3label4Representation Functionlabel1?label2?Classifiersupervised machine learning algorithmlabel3?label4?
  • 17. Three Laws of Machine LearningThou shalt not mingle training data with test dataThou shalt not mingle training data with test dataThou shalt not mingle training data with test dataBut what do you do if you need more test data?
  • 18. FeaturesPossible featuresPOS and surface form of the word itselfSurrounding words and POS tagPositional information of surrounding words and POS tagsSame as above, but with n-gramsGrammatical information…Richness of the features?Richer features = ML algorithm does less of the workMore impoverished features = ML algorithm does more of the work
  • 19. ClassifiersOnce we cast the WSD problem as supervised classification, many learning techniques are possible:Naïve Bayes (the thing to try first)Decision listsDecision treesMaxEntSupport vector machinesNearest neighbor methods…
  • 20. Classifiers TradeoffsWhich classifier should I use?It depends:Number of featuresTypes of featuresNumber of possible values for a featureNoise…General advice:Start with Naïve BayesUse decision trees/lists if you want to understand what the classifier is doingSVMs often give state of the art performanceMaxEnt methods also work well
  • 21. Naïve BayesPick the sense that is most probable given the contextContext represented by feature vectorBy Bayes’ Theorem:Problem: data sparsity!We can ignore this term… why?
  • 22. The “Naïve” PartFeature vectors are too sparse to estimate directly:So… assume features are conditionally independent given the word senseThis is naïve because?Putting everything together:
  • 23. Naïve Bayes: TrainingHow do we estimate the probability distributions?Maximum-Likelihood Estimates (MLE):What else do we need to do?Well, how well does it work?(later…)
  • 24. Decision ListOrdered list of tests (equivalent to “case” statement):Example decision list, discriminating between bass (fish) and bass (music):
  • 25. Building Decision ListsSimple algorithm:Compute how discriminative each feature is:Create ordered list of tests from these valuesLimitation?How do you build n-way classifiers from binary classifiers?One vs. rest (sequential vs. parallel)Another learning problemWell, how well does it work?(later…)
  • 26. Decision TreesInstead of a list, imagine a tree…fish in k wordsnoyesstriped bassFISHnoyesguitar in k wordsFISHnoyesMUSIC…
  • 27. Using Decision TreesGiven an instance (= list of feature values)Start at the rootAt each interior node, check feature valueFollow corresponding branch based on the testWhen a leaf node is reached, return its categoryDecision tree material drawn from slides by Ed Loper
  • 28. Building Decision TreesBasic idea: build tree top down, recursively partitioning the training data at each stepAt each node, try to split the training data on a feature (could be binary or otherwise)What features should we split on?Small decision tree desiredPick the feature that gives the most information about the categoryExample: 20 questionsI’m thinking of a number from 1 to 1,000You can ask any yes no questionWhat question would you ask?
  • 29. Evaluating Splits via EntropyEntropy of a set of events E:Where P(c) is the probability that an event in E has category cHow much information does a feature give us about the category (sense)?H(E) = entropy of event set EH(E|f) = expected entropy of event set E once we know the value of feature fInformation Gain: G(E, f) = H(E) – H(E|f) = amount of new information provided by feature fSplit on feature that maximizes information gainWell, how well does it work?(later…)
  • 30. WSD AccuracyGenerally:Naïve Bayes provides a reasonable baseline: ~70%Decision lists and decision trees slightly lowerState of the art is slightly higherHowever:Accuracy depends on actual word, sense inventory, amount of training data, number of features, etc.Remember caveat about baseline and upper bound
  • 31. Minimally Supervised WSDBut annotations are expensive!“Bootstrapping” or co-training (Yarowsky 1995)Start with (small) seed, learn decision listUse decision list to label rest of corpusRetain “confident” labels, treat as annotated data to learn new decision listRepeat…Heuristics (derived from observation):One sense per discourseOne sense per collocation
  • 32. One Sense per DiscourseA word tends to preserve its meaning across all its occurrences in a given discourseEvaluation: 8 words with two-way ambiguity, e.g. plant, crane, etc.98% of the two-word occurrences in the same discourse carry the same meaningThe grain of salt: accuracy depends on granularityPerformance of “one sense per discourse” measured on SemCor is approximately 70%Slide by Mihalcea and Pedersen
  • 33. One Sense per CollocationA word tends to preserve its meaning when used in the same collocationStrong for adjacent collocationsWeaker as the distance between words increasesEvaluation:97% precision on words with two-way ambiguityAgain, accuracy depends on granularity:70% precision on SemCor wordsSlide by Mihalcea and Pedersen
  • 34. Yarowsky’s Method: ExampleDisambiguating plant (industrial sense) vs. plant (living thing sense)Think of seed features for each senseIndustrial sense: co-occurring with “manufacturing”Living thing sense: co-occurring with “life”Use “one sense per collocation” to build initial decision list classifierTreat results as annotated data, train new decision list classifier, iterate…
  • 39. Yarowsky’s Method: StoppingStop when: Error on training data is less than a thresholdNo more training data is coveredUse final decision list for WSD
  • 40. Yarowsky’s Method: DiscussionAdvantages: Accuracy is about as good as a supervised algorithmBootstrapping: far less manual effortDisadvantages: Seeds may be tricky to constructWorks only for coarse-grained sense distinctionsSnowballing error with co-trainingRecent extension: now apply this to the web!
  • 41. WSD with Parallel TextBut annotations are expensive!What’s the “proper” sense inventory?How fine or coarse grained?Application specific?Observation: multiple senses translate to different words in other languages!A “bill” in English may be a “pico” (bird jaw) in or a “cuenta” (invoice) in SpanishUse the foreign language as the sense inventory!Added bonus: annotations for free! (Byproduct of word-alignment process in machine translation)
  • 44. Semantic AttachmentsBasic idea:Associate -expressions with lexical itemsAt branching node, apply semantics of one child to another (based on synctatic rule)Refresher in -calculus…
  • 47. ComplexitiesOh, there are many…Classic problem: quantifier scopingEvery restaurant has a menuIssues with this style of semantic analysis?
  • 48. Semantics in NLP TodayCan be characterized as “shallow semantics”Verbs denote eventsRepresent as “frames”Nouns (in general) participate in eventsTypes of event participants = “slots” or “roles”Event participants themselves = “slot fillers”Depending on the linguistic theory, roles may have special names: agent, theme, etc.Semantic analysis: semantic role labelingAutomatically identify the event type (i.e., frame)Automatically identify event participants and the role that each plays (i.e., label the semantic role)
  • 49. What works in NLP?POS-annotated corporaTree-annotated corpora: Penn TreebankRole-annotated corpora: Proposition Bank (PropBank)
  • 50. PropBank: Two Examplesagree.01Arg0: AgreerArg1: PropositionArg2: Other entity agreeingExample: [Arg0 John] agrees [Arg2 with Mary] [Arg1 on everything]fall.01Arg1: Logical subject, patient, thing fallingArg2: Extent, amount fallenArg3: Start pointArg4: End pointExample: [Arg1 Sales] fell [Arg4 to $251.2 million] [Arg3 from $278.7 million]
  • 51. How do we do it?Short answer: supervised machine learningOne approach: classification of each tree constituentFeatures can be words, phrase type, linear position, tree position, etc.Apply standard machine learning algorithms
  • 52. Recap of Today’s TopicsWord sense disambiguationBeyond lexical semanticsSemantic attachments to syntaxShallow semantics: PropBank
  • 53. The Complete PicturePhonologyMorphologySyntaxSemanticsReasoningSpeech RecognitionMorphological AnalysisParsingSemantic AnalysisReasoning, PlanningSpeech SynthesisMorphological RealizationSyntactic RealizationUtterance Planning
  • 54. The Home StretchNext week: MapReduce and large-data processingNo classes Thanksgiving week!December: two guest lectures by Ken Church