SlideShare a Scribd company logo
◯Atsushi Keyaki†, Jun Miyazaki†
†: Tokyo Institute of Technology,
Japan
Part-­‐‑of-­‐‑speech  Tagging  for  
Web  Search  Queries  using  a  
Large-­‐‑scale  Web  Corpus
SAC2017  IAR
Objective
•  Accurate part-of-speech (POS) tagging to Web
queries
o POS tags are beneficial in accurate IR
•  Different search strategy per POS tag [1]
•  Identifying unnecessary data with POS tags [2]
o Example
•  Query: “discovery channel”
•  Doc: “Victim’s discovery is broadcasted by the channel”
2
[1]  Crestani  et  al.:  “Short  Queries,  Natural  Language  and  Spoken  Document  	
            Retrieval:  Experiments  at  Glasgow  University”,  TREC-­‐‑6,  1998.
[2]  Chowdhury  and  Mccabe:  “Improving  Information  Retrieval  Systems  using	
            Part  of  Speech  Tagging”,  Univ.  of  Maryland,  1993.
POS  tag  mismatch  may  cause  false  positive
TV  program  (proper  nouns)
common  noun common	
noun
Difficulty  in  query  POS  tagging
•  Characteristics of Web query
o  Length is short (composed of a few words)
o  Capitalization is missing
o  Word order is fairly free
•  Solution of related work [3][4]
o  Utilizing the results of sentence-level morphological analysis
•  Sentences are based on natural language grammar
•  Results of sentence-level morphological analysis are accurate
3
Difficult  to  correctly  identify  POS  tags	
with  existing  morphological  analysis  tool	
[3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo  	
            Relevance  Feedback"ʺ,  CIKM2010.
[4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012.
developed  for  
natural  language
Sentence:  “We        stayed        at              Rif  Carlton.”
Query          :  “rif  carlton”
Difficulty  in  query  POS  tagging
•  Characteristics of Web query
o  Length is short (composed of a few words)
o  Capitalization is missing
o  Word order is fairly free
•  Solution of related work [3][4]
o  Utilizing the results of sentence-level morphological analysis
•  Sentences are based on natural language grammar
•  Results of sentence-level morphological analysis are accurate
4
Difficult  to  correctly  identify  POS  tags	
with  existing  morphological  analysis  tool	
[3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo  	
            Relevance  Feedback"ʺ,  CIKM2010.
[4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012.
developed  for  
natural  language
Sentence:  “We        stayed        at              Rif  Carlton.”
pronoun  verb  particle    proper  noun
Query          :  “rif  carlton”
Difficulty  in  query  POS  tagging
•  Characteristics of Web query
o  Length is short (composed of a few words)
o  Capitalization is missing
o  Word order is fairly free
•  Solution of related work [3][4]
o  Utilizing the results of sentence-level morphological analysis
•  Sentences are based on natural language grammar
•  Results of sentence-level morphological analysis are accurate
5
Difficult  to  correctly  identify  POS  tags	
with  existing  morphological  analysis  tool	
[3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo  	
            Relevance  Feedback"ʺ,  CIKM2010.
[4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012.
Sentence:  “We        stayed        at              Rif  Carlton.”
pronoun  verb  particle    proper  noun
proper  nounQuery          :  “rif  carlton”
developed  for  
natural  language
Difficulty  in  query  POS  tagging
•  Characteristics of Web query
o  Length is short (composed of a few words)
o  Capitalization is missing
o  Word order is fairly free
•  Solution of related work [3][4]
o  Utilizing the results of sentence-level morphological analysis
•  Sentences are based on natural language grammar
•  Results of sentence-level morphological analysis are accurate
6
Difficult  to  correctly  identify  POS  tags	
with  existing  morphological  analysis  tool	
[3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo  	
            Relevance  Feedback"ʺ,  CIKM2010.
[4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012.
Sentence:  “We        stayed        at              Rif  Carlton.”
pronoun  verb  particle    proper  noun
proper  nounQuery          :  “rif  carlton”
developed  for  
natural  language
Frequently  
assigned  POS  tag  
is  employed
Our  approach
•  Related study
o Using sentence-level morphological analysis of
•  Search results [3]
•  Snippet from search logs [4]
o Considering just freq. of assigned POS tags
•  Our approach
o Taking account of global statistics from large corpus
•  Easily available, considering long tail
o Considering co-occurrence of query terms
April 5, 2017SAC2017 IAR 7
[3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo  	
            Relevance  Feedback"ʺ,  CIKM2010.
[4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012.
A  small  number  of  highly  relevant  information	
User  feedback/search  log  is  not  always  available
Preliminary  investigation
•  Morphological analysis to Web queries
o Queries
•  TREC Web track topics (200 queries from 2009-2012)
o  Oracle POS tags are annotated by three assessors
o  Referring to description (information need)
o Morphological analysis tool
•  Stanford Log-linear Part-Of-Speech Tagger [5]
o Model
•  Default model
•  Caseless model
o  Not consider capitalization information during training
o  Try to solve “Capitalization is missing” problem
April 5, 2017SAC2017 IAR 8
[5]  Toutanova  et  al.:  "ʺFeature-­‐‑Rich  Part-­‐‑of-­‐‑Speech  Tagging	
            with  a  Cyclic  Dependency  Network"ʺ,  NAACL  2003.
High  agreement	
Kappa:  0.98
Summary  of  error  analysis
•  Default model
o  Only half of query terms were assigned correct POS tags
o  Almost all of proper nouns were NOT identified
•  72% of proper nouns are mistakenly assigned as common nouns
•  Error: “obama”, “india”, “ritz carlton”, “discovery channel”
•  Caseless model
o  Around 75% of query terms were assigned correct POS
tags
o  Many proper nouns were identified
•  Common nouns are mistakenly identified as proper nouns
•  Errors caused by a partial grammatical rule
o  “lower heart rate”
o  “gs pay rate”
April 5, 2017SAC2017 IAR 9
verb adjective
common  noun verb
:  Adjectives  come  before  common  nouns
:  Verbs  come  after  a  subject
Proposed  POS  tagging
•  Summary of the error analysis
o  Proper nouns/common nouns cannot be identified
•  Problem1: Capitalization is missing
o  Grammatical rules are mistakenly applied
•  Problem2: Word order is fairly free
•  Related study
o  A small num. of highly relevant information
•  Problem3: User feedback and user log are not always available
•  Approach
o  Sol-P1: Sentence-level morphological analysis
o  Sol-P2: Proposing a POS tagging not based on word order
o  Sol-P3: Large-scale Web corpus (easily available)
o  Building the term-POS database (TPDB)
•  Morphological analysis are applied offline
April 5, 2017SAC2017 IAR 10
Processing  flow
April 5, 2017SAC2017 IAR 11
Large-scale
Web corpus
S1 tA/P1 tB/P2 tC/P3tA tB tC
tA tC tD
tC tE tA tF
tA/P1 tC/P4 tD/P5
tC/P3 tE/P1 tA/P2 tF/P1
tB tD tB/P2 tD/P3
Morphological
analysis
S2
S3
S4
S1
S2
S3
S4
TPDB
tA/P1 tB/P2 tC/P3
tA/P1 tC/P4 tD/P5
tC/P3 tE/P1 tA/P2 tA/P1
S1
S2
S3
tA tC Query
tA/P1 tC/P3
tA/P1 tC/P4
Scoring	
method
Offline Online
Insert
Scoring  for  POS  tagging
•  Design principle
o  Frequently appearing POS tags in the corpus are assigned to queries
o  POS tags of a sentence are emphasized when the sentence contains
more kinds of query terms
•  Co-occurrence of query terms is a useful clue
•  Step of scoring
o  Retrieving entries which contain query terms from TPDB
o  Braking down into pairs of query terms
•  Query: “tA tB tC”
o  Counting entries per the term-POS pairs for each query term pair
•  Query term pair: {tA tB}
o  Scoring with three proposed methods
April 5, 2017 12
{tA  tB}  {tA  tC}  {tB  tC}
tA/P1 tB/P2 5 0.33 (5/15)
tA/P1 tB/P3 3 0.20 (3/15)
tA/P2 tB/P4 7 0.47 (7/15)
freq. normalized freq. num.  of  entries  
containing  	
tA/P1 and tB/P2
Three  proposed  methods
•  MaxFreq
o  The most frequently appearing
POS tag (highest freq.) is assigned
•  MostLikelihood
o  The highest normalized freq. is
assigned
o  MaxFreq may be affected by
frequently appearing terms
•  AllCombi
o  POS tag of the highest sum of the
term-POS frequency is assigned
o  MaxFreq and MostLikelihood
only focus on a POS tag with the
highest frequency/normalized
frequency
o  More diversified context including
long tail can be considered
April 5, 2017SAC2017 IAR 13
Query:
“tA tB tC”
tA:tB
tA/P1 tB/P2 5 0.33
tA/P1 tB/P3 3 0.20
tA/P2 tB/P4 7 0.47
tA:tC
tA/P1 tC/P2 3 0.43
tA/P3 tC/P3 4 0.57
tB:tC
tB/P1 tC/P2 5 0.5
tB/P2 tC/P2 5 0.5
freq.
normalized
freq.
Three  proposed  methods
•  MaxFreq
o  The most frequently appearing
POS tag (highest freq.) is assigned
•  MostLikelihood
o  The highest normalized freq. is
assigned
o  MaxFreq may be affected by
frequently appearing terms
•  AllCombi
o  POS tag of the highest sum of the
term-POS frequency is assigned
o  MaxFreq and MostLikelihood
only focus on a POS tag with the
highest frequency/normalized
frequency
o  More diversified context including
long tail can be considered
April 5, 2017SAC2017 IAR 14
Query:
“tA tB tC”
tA:tB
tA/P1 tB/P2 5 0.33
tA/P1 tB/P3 3 0.20
tA/P2 tB/P4 7 0.47
tB:tC
tB/P1 tC/P2 5 0.5
tB/P2 tC/P2 5 0.5
freq.
normalized
freq.
tA/P2
tA:tC
tA/P1 tC/P2 3 0.43
tA/P3 tC/P3 4 0.57
Three  proposed  methods
•  MaxFreq
o  The most frequently appearing
POS tag (highest freq.) is assigned
•  MostLikelihood
o  The highest normalized freq. is
assigned
o  MaxFreq may be affected by
frequently appearing terms
•  AllCombi
o  POS tag of the highest sum of the
term-POS frequency is assigned
o  MaxFreq and MostLikelihood
only focus on a POS tag with the
highest frequency/normalized
frequency
o  More diversified context including
long tail can be considered
April 5, 2017SAC2017 IAR 15
tB:tC
tB/P1 tC/P2 5 0.5
tB/P2 tC/P2 5 0.5
freq.
normalized
freq.
Query:
“tA tB tC”
tA:tB
tA/P1 tB/P2 5 0.33
tA/P1 tB/P3 3 0.20
tA/P2 tB/P4 7 0.47
tA:tC
tA/P1 tC/P2 3 0.43
tA/P3 tC/P3 4 0.57
Three  proposed  methods
•  MaxFreq
o  The most frequently appearing
POS tag (highest freq.) is assigned
•  MostLikelihood
o  The highest normalized freq. is
assigned
o  MaxFreq may be affected by
frequently appearing terms
•  AllCombi
o  POS tag of the highest sum of the
term-POS frequency is assigned
o  MaxFreq and MostLikelihood
only focus on a POS tag with the
highest frequency/normalized
frequency
o  More diversified context including
long tail can be considered
April 5, 2017SAC2017 IAR 16
tB:tC
tB/P1 tC/P2 5 0.5
tB/P2 tC/P2 5 0.5
freq.
normalized
freq.
tA/P3
Query:
“tA tB tC”
tA:tB
tA/P1 tB/P2 5 0.33
tA/P1 tB/P3 3 0.20
tA/P2 tB/P4 7 0.47
tA:tC
tA/P1 tC/P2 3 0.43
tA/P3 tC/P3 4 0.57
Three  proposed  methods
•  MaxFreq
o  The most frequently appearing
POS tag (highest freq.) is assigned
•  MostLikelihood
o  The highest normalized freq. is
assigned
o  MaxFreq may be affected by
frequently appearing terms
•  AllCombi
o  POS tag of the highest sum of the
term-POS frequency is assigned
o  MaxFreq and MostLikelihood
only focus on a POS tag with the
highest frequency/normalized
frequency
o  More diversified context including
long tail can be considered
April 5, 2017SAC2017 IAR 17
tB:tC
tB/P1 tC/P2 5 0.5
tB/P2 tC/P2 5 0.5
freq.
normalized
freq.
tA/P1
Query:
“tA tB tC”
tA:tB
tA/P1 tB/P2 5 0.33
tA/P1 tB/P3 3 0.20
tA/P2 tB/P4 7 0.47
tA:tC
tA/P1 tC/P2 3 0.43
tA/P3 tC/P3 4 0.57
Three  proposed  methods
•  MaxFreq
o  The most frequently appearing
POS tag (highest freq.) is assigned
•  MostLikelihood
o  The highest normalized freq. is
assigned
o  MaxFreq may be affected by
frequently appearing terms
•  AllCombi
o  POS tag of the highest sum of the
term-POS frequency is assigned
o  MaxFreq and MostLikelihood
only focus on a POS tag with the
highest frequency/normalized
frequency
o  More diversified context including
long tail can be considered
April 5, 2017SAC2017 IAR 18
tB:tC
tB/P1 tC/P2 5 0.5
tB/P2 tC/P2 5 0.5
freq.
normalized
freq.
tA/P1
Query:
“tA tB tC”
tA:tB
tA/P1 tB/P2 5 0.33
tA/P1 tB/P3 3 0.20
tA/P2 tB/P4 7 0.47
tA:tC
tA/P1 tC/P2 3 0.43
tA/P3 tC/P3 4 0.57
11
Experiment
•  Datasets
o  TREC Web track topics
•  200 queries from 2009-2012
o  MS-251
•  Microsoft search log used in related studies [3][4]
•  Large-scale Web corpus
o  ClueWeb09 Category B
•  50 million Web documents
•  Evaluation methods
o  Proposed methods: MaxFreq, MostLikelihood, AllCombi
o  Existing methods: Stanford, Caseless, SingleFreq
April 5, 2017SAC2017 IAR 19
[3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo  	
            Relevance  Feedback"ʺ,  CIKM2010.
[4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012.
The  most  frequently  appearing  POS  tag  is  assigned
Skip  because  the  trend  is  the  same
POS-­‐‑tagged  Web  track  topics
•  AllCombi: the highest for all terms, common noun, proper noun
o  Good at judging nouns
o  Considering more diversified context is useful
•  Global statistics from large-scale Web corpus is useful
•  MaxFreq and MostLikelihood: the highest for common noun, verb,
adjective
•  Every proposed method significantly outperformed (VS Caseless)
April 5, 2017SAC2017 IAR 20
Precision All  query	
terms
Common	
noun
Proper	
noun
Verb Adjective sign  test  with	
Caseless
MaxFreq .814 .825 .833 .769 .647 p  <  0.05
MostLikelihood .814 .825 .833 .769 .647 p  <  0.05
AllCombi .821 .825 .860 .714 .629 p  <  0.01
Caseless .763 .789 .751 .733 .690
SingleFreq .702 .775 .670 .533 .581
Stanford .547 .550 1.0 .722 .451
Effect  of  the  proposed  method
•  AllCombi correctly identified many query terms
•  Some errors by partial grammatical rules still remain
•  Negative effects of the proposed method
o  “president” in the corpus are often identified as proper
nouns
•  Need to normalize term weights
April 5, 2017SAC2017 IAR 21
Query Stanford AllCombi
obama
india
rif  carlton
lower  heart  rate
gs  pay  rate
president  united  states
Conclusion
•  POS tagging to Web queries
o  Results of sentence-level morphological analysis
o  Large-scale Web corpus
o  Proposed three scoring methods
•  Experiments
o  Considering more diversified context is useful
o  The best proposed method differs by POS tag
o  Overwhelmed existing tools and existing studies
•  Future work
o  Combination of proposed methods may improve accuracy
o  Database schema design for fast POS tagging
April 5, 2017SAC2017 IAR 22
Default  model
April 5, 2017SAC2017 IAR 23
POS  tags Precision Recall
Common  noun .550 .985
Proper  noun 1.0 .010
Verb .722 .867
Adjective .451 .958
All  query  terms .547 .547
•  Nearly half of query terms
were assigned correct POS tags
•  Almost all of proper nouns
were not identified
o  72% of proper nouns are
mistakenly assigned as common
nouns
o  Error: “obama”, “india”, “ritz
carlton”, “discovery channel”
•  Errors caused by a partial grammatical rule
o  “lower heart rate”
o  “gs pay rate”
verb adjective
common  noun verb
:  Adjectives  come  before  common  nouns
:  Verbs  come  after  a  subject
Caseless  model
•  Precision and recall improved overall
•  Many proper nouns were identified
o  31% of proper nouns are mistakenly assigned as common nouns
o  Precision is decreased
•  Harm of partial grammatical rules still exist
o  “discovery channel store”
April 5, 2017SAC2017 IAR 24
common  noun proper  noun
POS  tags Precision Recall
Common  noun .789 .769
Proper  noun .751 .640
Verb .733 .733
Adjective .690 .833
All  query  terms .763 .763
MS-­‐‑251
•  The trend of the proposed methods is the same
o The ratio of POS tags affected the order
•  AllCombi
•  MaxFreq, MostLikelihood
o The proposed methods are better than [4]
April 5, 2017SAC2017 IAR 25
Precision
MaxFreq .890
MostLikelihood .895
AllCombi .893
the  best  method  in  [4] .858
[4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012.
Good  at  judging  nouns
Good  at  judging  verb,  adjective

More Related Content

PPT
Information extraction for Free Text
PDF
Grosof haley-talk-semtech2013-ver6-10-13
PDF
Applications of Word Vectors in Text Retrieval and Classification
KEY
The Semantic Web meets the Code of Federal Regulations
PDF
NLP for Everyday People
PPT
QALL-ME: Ontology and Semantic Web
PPTX
Enriching the semantic web tutorial session 1
PPTX
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging
Information extraction for Free Text
Grosof haley-talk-semtech2013-ver6-10-13
Applications of Word Vectors in Text Retrieval and Classification
The Semantic Web meets the Code of Federal Regulations
NLP for Everyday People
QALL-ME: Ontology and Semantic Web
Enriching the semantic web tutorial session 1
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging

What's hot (17)

PPTX
Info 2402 irt-chapter_4
ODP
Information Extraction from the Web - Algorithms and Tools
PDF
Linguistic markup and transclusion processing in XML documents
PDF
Netflix Global Search - Lucene Revolution
PDF
Bio ontologies and semantic technologies
PDF
Deep Natural Language Processing for Search and Recommender Systems
PDF
Bio ontologies and semantic technologies
PPTX
Deep natural language processing in search systems
PPTX
2017 biological databases_part1_vupload
PDF
Neural Architectures for Named Entity Recognition
PPTX
KIT Graduiertenkolloquium 11.05.2016
PDF
PyGotham NY 2017: Natural Language Processing from Scratch
PDF
Phd tesis olga giraldo 10mayo
PDF
Ting-Hao (Kenneth) Huang - 2015 - ACBiMA: Advanced Chinese Bi-Character Word ...
PPTX
NAMED ENTITY RECOGNITION
PDF
master_thesis_greciano_v2
PPTX
PhD Comprehensive exam of Masud Rahman
Info 2402 irt-chapter_4
Information Extraction from the Web - Algorithms and Tools
Linguistic markup and transclusion processing in XML documents
Netflix Global Search - Lucene Revolution
Bio ontologies and semantic technologies
Deep Natural Language Processing for Search and Recommender Systems
Bio ontologies and semantic technologies
Deep natural language processing in search systems
2017 biological databases_part1_vupload
Neural Architectures for Named Entity Recognition
KIT Graduiertenkolloquium 11.05.2016
PyGotham NY 2017: Natural Language Processing from Scratch
Phd tesis olga giraldo 10mayo
Ting-Hao (Kenneth) Huang - 2015 - ACBiMA: Advanced Chinese Bi-Character Word ...
NAMED ENTITY RECOGNITION
master_thesis_greciano_v2
PhD Comprehensive exam of Masud Rahman
Ad

Similar to Part-of-speech Tagging for Web Search Queries Using a Large-scale Web Corpus (20)

PDF
Natural Language Processing using Java
PDF
Applications of Large Language Models in Materials Discovery and Design
PDF
Improved chemical text mining of patents using infinite dictionaries, transla...
PDF
NLP Data Cleansing Based on Linguistic Ontology Constraints
PPTX
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
ODP
SIGIR 2011
PDF
Introduction of semantic technology for SAS programmers
PPTX
Spoken Content Retrieval
PPTX
stemming and tokanization in corpus.pptx
PDF
Towards a Quality Assessment of Web Corpora for Language Technology Applications
DOCX
UNDERSTAND SHORTTEXTS BY HARVESTING & ANALYZING SEMANTIKNOWLEDGE
PDF
Natural Language Processing, Techniques, Current Trends and Applications in I...
PPTX
C:\Fakepath\Learning Through Conversation
PPTX
EKAW 2016 - TechMiner: Extracting Technologies from Academic Publications
PPT
RFS Search Lang Spec
PDF
Aspects of NLP Practice
PPTX
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
PPTX
Semantic Technologies and Programmatic Access to Semantic Data
PDF
The Nature of Information
Natural Language Processing using Java
Applications of Large Language Models in Materials Discovery and Design
Improved chemical text mining of patents using infinite dictionaries, transla...
NLP Data Cleansing Based on Linguistic Ontology Constraints
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
SIGIR 2011
Introduction of semantic technology for SAS programmers
Spoken Content Retrieval
stemming and tokanization in corpus.pptx
Towards a Quality Assessment of Web Corpora for Language Technology Applications
UNDERSTAND SHORTTEXTS BY HARVESTING & ANALYZING SEMANTIKNOWLEDGE
Natural Language Processing, Techniques, Current Trends and Applications in I...
C:\Fakepath\Learning Through Conversation
EKAW 2016 - TechMiner: Extracting Technologies from Academic Publications
RFS Search Lang Spec
Aspects of NLP Practice
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Semantic Technologies and Programmatic Access to Semantic Data
The Nature of Information
Ad

Recently uploaded (20)

PPTX
nose tajweed for the arabic alphabets for the responsive
PPTX
Sustainable Forest Management ..SFM.pptx
PPTX
FINAL TEST 3C_OCTAVIA RAMADHANI SANTOSO-1.pptx
PPTX
BIOLOGY TISSUE PPT CLASS 9 PROJECT PUBLIC
PDF
Yusen Logistics Group Sustainability Report 2024.pdf
DOCX
ENGLISH PROJECT FOR BINOD BIHARI MAHTO KOYLANCHAL UNIVERSITY
PPTX
2025-08-10 Joseph 02 (shared slides).pptx
PPT
The Effect of Human Resource Management Practice on Organizational Performanc...
PPTX
fundraisepro pitch deck elegant and modern
PPTX
PHIL.-ASTRONOMY-AND-NAVIGATION of ..pptx
PPTX
3RD-Q 2022_EMPLOYEE RELATION - Copy.pptx
PPTX
Project and change Managment: short video sequences for IBA
PDF
Tunisia's Founding Father(s) Pitch-Deck 2022.pdf
PDF
PM Narendra Modi's speech from Red Fort on 79th Independence Day.pdf
PPTX
AcademyNaturalLanguageProcessing-EN-ILT-M02-Introduction.pptx
PPTX
Tablets And Capsule Preformulation Of Paracetamol
PPTX
Impressionism_PostImpressionism_Presentation.pptx
PPTX
ART-APP-REPORT-FINctrwxsg f fuy L-na.pptx
PDF
Presentation1 [Autosaved].pdf diagnosiss
PPTX
ANICK 6 BIRTHDAY....................................................
nose tajweed for the arabic alphabets for the responsive
Sustainable Forest Management ..SFM.pptx
FINAL TEST 3C_OCTAVIA RAMADHANI SANTOSO-1.pptx
BIOLOGY TISSUE PPT CLASS 9 PROJECT PUBLIC
Yusen Logistics Group Sustainability Report 2024.pdf
ENGLISH PROJECT FOR BINOD BIHARI MAHTO KOYLANCHAL UNIVERSITY
2025-08-10 Joseph 02 (shared slides).pptx
The Effect of Human Resource Management Practice on Organizational Performanc...
fundraisepro pitch deck elegant and modern
PHIL.-ASTRONOMY-AND-NAVIGATION of ..pptx
3RD-Q 2022_EMPLOYEE RELATION - Copy.pptx
Project and change Managment: short video sequences for IBA
Tunisia's Founding Father(s) Pitch-Deck 2022.pdf
PM Narendra Modi's speech from Red Fort on 79th Independence Day.pdf
AcademyNaturalLanguageProcessing-EN-ILT-M02-Introduction.pptx
Tablets And Capsule Preformulation Of Paracetamol
Impressionism_PostImpressionism_Presentation.pptx
ART-APP-REPORT-FINctrwxsg f fuy L-na.pptx
Presentation1 [Autosaved].pdf diagnosiss
ANICK 6 BIRTHDAY....................................................

Part-of-speech Tagging for Web Search Queries Using a Large-scale Web Corpus

  • 1. ◯Atsushi Keyaki†, Jun Miyazaki† †: Tokyo Institute of Technology, Japan Part-­‐‑of-­‐‑speech  Tagging  for   Web  Search  Queries  using  a   Large-­‐‑scale  Web  Corpus SAC2017  IAR
  • 2. Objective •  Accurate part-of-speech (POS) tagging to Web queries o POS tags are beneficial in accurate IR •  Different search strategy per POS tag [1] •  Identifying unnecessary data with POS tags [2] o Example •  Query: “discovery channel” •  Doc: “Victim’s discovery is broadcasted by the channel” 2 [1]  Crestani  et  al.:  “Short  Queries,  Natural  Language  and  Spoken  Document              Retrieval:  Experiments  at  Glasgow  University”,  TREC-­‐‑6,  1998. [2]  Chowdhury  and  Mccabe:  “Improving  Information  Retrieval  Systems  using            Part  of  Speech  Tagging”,  Univ.  of  Maryland,  1993. POS  tag  mismatch  may  cause  false  positive TV  program  (proper  nouns) common  noun common noun
  • 3. Difficulty  in  query  POS  tagging •  Characteristics of Web query o  Length is short (composed of a few words) o  Capitalization is missing o  Word order is fairly free •  Solution of related work [3][4] o  Utilizing the results of sentence-level morphological analysis •  Sentences are based on natural language grammar •  Results of sentence-level morphological analysis are accurate 3 Difficult  to  correctly  identify  POS  tags with  existing  morphological  analysis  tool [3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo              Relevance  Feedback"ʺ,  CIKM2010. [4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012. developed  for   natural  language Sentence:  “We        stayed        at              Rif  Carlton.” Query          :  “rif  carlton”
  • 4. Difficulty  in  query  POS  tagging •  Characteristics of Web query o  Length is short (composed of a few words) o  Capitalization is missing o  Word order is fairly free •  Solution of related work [3][4] o  Utilizing the results of sentence-level morphological analysis •  Sentences are based on natural language grammar •  Results of sentence-level morphological analysis are accurate 4 Difficult  to  correctly  identify  POS  tags with  existing  morphological  analysis  tool [3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo              Relevance  Feedback"ʺ,  CIKM2010. [4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012. developed  for   natural  language Sentence:  “We        stayed        at              Rif  Carlton.” pronoun  verb  particle    proper  noun Query          :  “rif  carlton”
  • 5. Difficulty  in  query  POS  tagging •  Characteristics of Web query o  Length is short (composed of a few words) o  Capitalization is missing o  Word order is fairly free •  Solution of related work [3][4] o  Utilizing the results of sentence-level morphological analysis •  Sentences are based on natural language grammar •  Results of sentence-level morphological analysis are accurate 5 Difficult  to  correctly  identify  POS  tags with  existing  morphological  analysis  tool [3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo              Relevance  Feedback"ʺ,  CIKM2010. [4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012. Sentence:  “We        stayed        at              Rif  Carlton.” pronoun  verb  particle    proper  noun proper  nounQuery          :  “rif  carlton” developed  for   natural  language
  • 6. Difficulty  in  query  POS  tagging •  Characteristics of Web query o  Length is short (composed of a few words) o  Capitalization is missing o  Word order is fairly free •  Solution of related work [3][4] o  Utilizing the results of sentence-level morphological analysis •  Sentences are based on natural language grammar •  Results of sentence-level morphological analysis are accurate 6 Difficult  to  correctly  identify  POS  tags with  existing  morphological  analysis  tool [3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo              Relevance  Feedback"ʺ,  CIKM2010. [4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012. Sentence:  “We        stayed        at              Rif  Carlton.” pronoun  verb  particle    proper  noun proper  nounQuery          :  “rif  carlton” developed  for   natural  language Frequently   assigned  POS  tag   is  employed
  • 7. Our  approach •  Related study o Using sentence-level morphological analysis of •  Search results [3] •  Snippet from search logs [4] o Considering just freq. of assigned POS tags •  Our approach o Taking account of global statistics from large corpus •  Easily available, considering long tail o Considering co-occurrence of query terms April 5, 2017SAC2017 IAR 7 [3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo              Relevance  Feedback"ʺ,  CIKM2010. [4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012. A  small  number  of  highly  relevant  information User  feedback/search  log  is  not  always  available
  • 8. Preliminary  investigation •  Morphological analysis to Web queries o Queries •  TREC Web track topics (200 queries from 2009-2012) o  Oracle POS tags are annotated by three assessors o  Referring to description (information need) o Morphological analysis tool •  Stanford Log-linear Part-Of-Speech Tagger [5] o Model •  Default model •  Caseless model o  Not consider capitalization information during training o  Try to solve “Capitalization is missing” problem April 5, 2017SAC2017 IAR 8 [5]  Toutanova  et  al.:  "ʺFeature-­‐‑Rich  Part-­‐‑of-­‐‑Speech  Tagging            with  a  Cyclic  Dependency  Network"ʺ,  NAACL  2003. High  agreement Kappa:  0.98
  • 9. Summary  of  error  analysis •  Default model o  Only half of query terms were assigned correct POS tags o  Almost all of proper nouns were NOT identified •  72% of proper nouns are mistakenly assigned as common nouns •  Error: “obama”, “india”, “ritz carlton”, “discovery channel” •  Caseless model o  Around 75% of query terms were assigned correct POS tags o  Many proper nouns were identified •  Common nouns are mistakenly identified as proper nouns •  Errors caused by a partial grammatical rule o  “lower heart rate” o  “gs pay rate” April 5, 2017SAC2017 IAR 9 verb adjective common  noun verb :  Adjectives  come  before  common  nouns :  Verbs  come  after  a  subject
  • 10. Proposed  POS  tagging •  Summary of the error analysis o  Proper nouns/common nouns cannot be identified •  Problem1: Capitalization is missing o  Grammatical rules are mistakenly applied •  Problem2: Word order is fairly free •  Related study o  A small num. of highly relevant information •  Problem3: User feedback and user log are not always available •  Approach o  Sol-P1: Sentence-level morphological analysis o  Sol-P2: Proposing a POS tagging not based on word order o  Sol-P3: Large-scale Web corpus (easily available) o  Building the term-POS database (TPDB) •  Morphological analysis are applied offline April 5, 2017SAC2017 IAR 10
  • 11. Processing  flow April 5, 2017SAC2017 IAR 11 Large-scale Web corpus S1 tA/P1 tB/P2 tC/P3tA tB tC tA tC tD tC tE tA tF tA/P1 tC/P4 tD/P5 tC/P3 tE/P1 tA/P2 tF/P1 tB tD tB/P2 tD/P3 Morphological analysis S2 S3 S4 S1 S2 S3 S4 TPDB tA/P1 tB/P2 tC/P3 tA/P1 tC/P4 tD/P5 tC/P3 tE/P1 tA/P2 tA/P1 S1 S2 S3 tA tC Query tA/P1 tC/P3 tA/P1 tC/P4 Scoring method Offline Online Insert
  • 12. Scoring  for  POS  tagging •  Design principle o  Frequently appearing POS tags in the corpus are assigned to queries o  POS tags of a sentence are emphasized when the sentence contains more kinds of query terms •  Co-occurrence of query terms is a useful clue •  Step of scoring o  Retrieving entries which contain query terms from TPDB o  Braking down into pairs of query terms •  Query: “tA tB tC” o  Counting entries per the term-POS pairs for each query term pair •  Query term pair: {tA tB} o  Scoring with three proposed methods April 5, 2017 12 {tA  tB}  {tA  tC}  {tB  tC} tA/P1 tB/P2 5 0.33 (5/15) tA/P1 tB/P3 3 0.20 (3/15) tA/P2 tB/P4 7 0.47 (7/15) freq. normalized freq. num.  of  entries   containing   tA/P1 and tB/P2
  • 13. Three  proposed  methods •  MaxFreq o  The most frequently appearing POS tag (highest freq.) is assigned •  MostLikelihood o  The highest normalized freq. is assigned o  MaxFreq may be affected by frequently appearing terms •  AllCombi o  POS tag of the highest sum of the term-POS frequency is assigned o  MaxFreq and MostLikelihood only focus on a POS tag with the highest frequency/normalized frequency o  More diversified context including long tail can be considered April 5, 2017SAC2017 IAR 13 Query: “tA tB tC” tA:tB tA/P1 tB/P2 5 0.33 tA/P1 tB/P3 3 0.20 tA/P2 tB/P4 7 0.47 tA:tC tA/P1 tC/P2 3 0.43 tA/P3 tC/P3 4 0.57 tB:tC tB/P1 tC/P2 5 0.5 tB/P2 tC/P2 5 0.5 freq. normalized freq.
  • 14. Three  proposed  methods •  MaxFreq o  The most frequently appearing POS tag (highest freq.) is assigned •  MostLikelihood o  The highest normalized freq. is assigned o  MaxFreq may be affected by frequently appearing terms •  AllCombi o  POS tag of the highest sum of the term-POS frequency is assigned o  MaxFreq and MostLikelihood only focus on a POS tag with the highest frequency/normalized frequency o  More diversified context including long tail can be considered April 5, 2017SAC2017 IAR 14 Query: “tA tB tC” tA:tB tA/P1 tB/P2 5 0.33 tA/P1 tB/P3 3 0.20 tA/P2 tB/P4 7 0.47 tB:tC tB/P1 tC/P2 5 0.5 tB/P2 tC/P2 5 0.5 freq. normalized freq. tA/P2 tA:tC tA/P1 tC/P2 3 0.43 tA/P3 tC/P3 4 0.57
  • 15. Three  proposed  methods •  MaxFreq o  The most frequently appearing POS tag (highest freq.) is assigned •  MostLikelihood o  The highest normalized freq. is assigned o  MaxFreq may be affected by frequently appearing terms •  AllCombi o  POS tag of the highest sum of the term-POS frequency is assigned o  MaxFreq and MostLikelihood only focus on a POS tag with the highest frequency/normalized frequency o  More diversified context including long tail can be considered April 5, 2017SAC2017 IAR 15 tB:tC tB/P1 tC/P2 5 0.5 tB/P2 tC/P2 5 0.5 freq. normalized freq. Query: “tA tB tC” tA:tB tA/P1 tB/P2 5 0.33 tA/P1 tB/P3 3 0.20 tA/P2 tB/P4 7 0.47 tA:tC tA/P1 tC/P2 3 0.43 tA/P3 tC/P3 4 0.57
  • 16. Three  proposed  methods •  MaxFreq o  The most frequently appearing POS tag (highest freq.) is assigned •  MostLikelihood o  The highest normalized freq. is assigned o  MaxFreq may be affected by frequently appearing terms •  AllCombi o  POS tag of the highest sum of the term-POS frequency is assigned o  MaxFreq and MostLikelihood only focus on a POS tag with the highest frequency/normalized frequency o  More diversified context including long tail can be considered April 5, 2017SAC2017 IAR 16 tB:tC tB/P1 tC/P2 5 0.5 tB/P2 tC/P2 5 0.5 freq. normalized freq. tA/P3 Query: “tA tB tC” tA:tB tA/P1 tB/P2 5 0.33 tA/P1 tB/P3 3 0.20 tA/P2 tB/P4 7 0.47 tA:tC tA/P1 tC/P2 3 0.43 tA/P3 tC/P3 4 0.57
  • 17. Three  proposed  methods •  MaxFreq o  The most frequently appearing POS tag (highest freq.) is assigned •  MostLikelihood o  The highest normalized freq. is assigned o  MaxFreq may be affected by frequently appearing terms •  AllCombi o  POS tag of the highest sum of the term-POS frequency is assigned o  MaxFreq and MostLikelihood only focus on a POS tag with the highest frequency/normalized frequency o  More diversified context including long tail can be considered April 5, 2017SAC2017 IAR 17 tB:tC tB/P1 tC/P2 5 0.5 tB/P2 tC/P2 5 0.5 freq. normalized freq. tA/P1 Query: “tA tB tC” tA:tB tA/P1 tB/P2 5 0.33 tA/P1 tB/P3 3 0.20 tA/P2 tB/P4 7 0.47 tA:tC tA/P1 tC/P2 3 0.43 tA/P3 tC/P3 4 0.57
  • 18. Three  proposed  methods •  MaxFreq o  The most frequently appearing POS tag (highest freq.) is assigned •  MostLikelihood o  The highest normalized freq. is assigned o  MaxFreq may be affected by frequently appearing terms •  AllCombi o  POS tag of the highest sum of the term-POS frequency is assigned o  MaxFreq and MostLikelihood only focus on a POS tag with the highest frequency/normalized frequency o  More diversified context including long tail can be considered April 5, 2017SAC2017 IAR 18 tB:tC tB/P1 tC/P2 5 0.5 tB/P2 tC/P2 5 0.5 freq. normalized freq. tA/P1 Query: “tA tB tC” tA:tB tA/P1 tB/P2 5 0.33 tA/P1 tB/P3 3 0.20 tA/P2 tB/P4 7 0.47 tA:tC tA/P1 tC/P2 3 0.43 tA/P3 tC/P3 4 0.57 11
  • 19. Experiment •  Datasets o  TREC Web track topics •  200 queries from 2009-2012 o  MS-251 •  Microsoft search log used in related studies [3][4] •  Large-scale Web corpus o  ClueWeb09 Category B •  50 million Web documents •  Evaluation methods o  Proposed methods: MaxFreq, MostLikelihood, AllCombi o  Existing methods: Stanford, Caseless, SingleFreq April 5, 2017SAC2017 IAR 19 [3]  Bendersky  et  al.:  "ʺStructural  Annotation  of  Search  Queries  Using  Pseudo              Relevance  Feedback"ʺ,  CIKM2010. [4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012. The  most  frequently  appearing  POS  tag  is  assigned Skip  because  the  trend  is  the  same
  • 20. POS-­‐‑tagged  Web  track  topics •  AllCombi: the highest for all terms, common noun, proper noun o  Good at judging nouns o  Considering more diversified context is useful •  Global statistics from large-scale Web corpus is useful •  MaxFreq and MostLikelihood: the highest for common noun, verb, adjective •  Every proposed method significantly outperformed (VS Caseless) April 5, 2017SAC2017 IAR 20 Precision All  query terms Common noun Proper noun Verb Adjective sign  test  with Caseless MaxFreq .814 .825 .833 .769 .647 p  <  0.05 MostLikelihood .814 .825 .833 .769 .647 p  <  0.05 AllCombi .821 .825 .860 .714 .629 p  <  0.01 Caseless .763 .789 .751 .733 .690 SingleFreq .702 .775 .670 .533 .581 Stanford .547 .550 1.0 .722 .451
  • 21. Effect  of  the  proposed  method •  AllCombi correctly identified many query terms •  Some errors by partial grammatical rules still remain •  Negative effects of the proposed method o  “president” in the corpus are often identified as proper nouns •  Need to normalize term weights April 5, 2017SAC2017 IAR 21 Query Stanford AllCombi obama india rif  carlton lower  heart  rate gs  pay  rate president  united  states
  • 22. Conclusion •  POS tagging to Web queries o  Results of sentence-level morphological analysis o  Large-scale Web corpus o  Proposed three scoring methods •  Experiments o  Considering more diversified context is useful o  The best proposed method differs by POS tag o  Overwhelmed existing tools and existing studies •  Future work o  Combination of proposed methods may improve accuracy o  Database schema design for fast POS tagging April 5, 2017SAC2017 IAR 22
  • 23. Default  model April 5, 2017SAC2017 IAR 23 POS  tags Precision Recall Common  noun .550 .985 Proper  noun 1.0 .010 Verb .722 .867 Adjective .451 .958 All  query  terms .547 .547 •  Nearly half of query terms were assigned correct POS tags •  Almost all of proper nouns were not identified o  72% of proper nouns are mistakenly assigned as common nouns o  Error: “obama”, “india”, “ritz carlton”, “discovery channel” •  Errors caused by a partial grammatical rule o  “lower heart rate” o  “gs pay rate” verb adjective common  noun verb :  Adjectives  come  before  common  nouns :  Verbs  come  after  a  subject
  • 24. Caseless  model •  Precision and recall improved overall •  Many proper nouns were identified o  31% of proper nouns are mistakenly assigned as common nouns o  Precision is decreased •  Harm of partial grammatical rules still exist o  “discovery channel store” April 5, 2017SAC2017 IAR 24 common  noun proper  noun POS  tags Precision Recall Common  noun .789 .769 Proper  noun .751 .640 Verb .733 .733 Adjective .690 .833 All  query  terms .763 .763
  • 25. MS-­‐‑251 •  The trend of the proposed methods is the same o The ratio of POS tags affected the order •  AllCombi •  MaxFreq, MostLikelihood o The proposed methods are better than [4] April 5, 2017SAC2017 IAR 25 Precision MaxFreq .890 MostLikelihood .895 AllCombi .893 the  best  method  in  [4] .858 [4]  K.  Ganchev  et  al.:  "ʺUsing  Search-­‐‑Logs  to  Improve  Query  Tagging"ʺ,  ACL2012. Good  at  judging  nouns Good  at  judging  verb,  adjective