SlideShare a Scribd company logo
Multilingual Named Entity Recognition
           using Wikipedia
    Laboratory for Knowledge Discovery in Databases
   Department of Computing and Information Sciences
                 Kansas State University
     http://www.kddresearch.org/tikiwiki/tiki-index.php




              Presenter: Svitlana O. Volkova
                 Instructor: William Hsu
AGENDA

I.     Project Overview
II.    Crawling Wikipedia
III.   Synonymy Discovery with Google Sets
IV.    Experiment Design
V.     Conclusions
AGENDA

I.     Project Overview
II.    Crawling Wikipedia
III.   GoogleSets for Synonymy Discovery
IV.    Experiment
V.     Conclusions
PROJECT MILESTONES

Input: Crawler Functionality
CRAWLING WIKIPEDIA
Output: Set of Multilingual Gazetteers


      Input: Initial Gazetteer in one Language
      RELATIONSHIP DISCOVERY WITH GOOGLESETS
      Output: Extended Gazetteer with Synonyms


             Input: Extended Gazetteer with Synonyms + Content
             MULTILINGUAL NER TASK
             Output: Extracted Entities from the Content
KEY IDEA - WIKIPEDIA
 Apply Wikipedia knowledge representation for
  multilingual information extraction
             English Wiki Concepts of Interest
      …, anthrax, bovine virus, …, camelpox, surra, …




             17http://wiki.digitalmethods.net/Dmi/WikipediaAnalysis



           Russian Wiki Concepts of Interest
 …, Зоонозы, Классическая чума свиней, Лептоспироз, …
AGENDA

I.     Project Overview
II.    Crawling Wikipedia
III.   GoogleSets for Synonymy Discovery
IV.    Experiment
V.     Conclusions
CRAWLING WIKIPEDIA



Multilingual NER
(article + category
 +interwiki links)


                      Wiki Category Graph and Article Graph
GAZETTEERS EXAMPLES IN DIFFERENT
           LANGUAGES
GAZETTEERS SIZE IN DIFFERENT
                 LANGUAGES


                            19

               37                                                English
                                                      86         Japanese
                                                                 German
                       20                                        Russian




Decision: dictionaries are too small, so wee need to find a way how to
                             extend it!!!
AGENDA

I.     Project Overview
II.    Crawling Wikipedia
III.   GoogleSets for Synonymy Discovery
IV.    Experiment
V.     Conclusions
GAZETTEERS EXAMPLES:
GERMAN GOOGLE SETS OUTPUT
AGENDA

I.     Project Overview
II.    Crawling Wikipedia
III.   GoogleSets for Synonymy Discovery
IV.    Experiment
V.     Conclusions
EXPERIMENT SET UP
 Purpose: to perform named entity recognition task in
  specific domain and report accuracy of extraction using
  a) Wiki knowledge
  b) Extended lists with synonyms from Google Sets


 Hypothesis: the synonyms extraction phase is essential
  for increasing accuracy of information extraction task
DISEASE EXTRACTOR MODULE
                 INPUT AND OUTPUT
                                             Output:
                                             Index of the first character

                         Disease             Index of the last character
                        Extractor            Length of the matched text
           Input: Text Module
              from file                      Matched Text
                                             Canonical disease name
Disease ExtractionTask
  The task of disease recognition can be considered as NER/information
    extraction (IE) task
  The main purpose is to retrieve tokens that much at least one term with
    synonyms, abbreviations from list of the animal disease names
CONTEXT EXAMPLES IN DIFFERENT LANGUAGES
DUTCH
    Leptospirose komt voor in alle landen, behalve het Noordpoolgebied. De incidentie is hoog.
      Meer dan de helft van de gevallen voordoet in ernstige en vereiste reanimatie.
CZECH
    Leptospiróza se vyskytuje ve všech zemích s výjimkou Arktidy. Incidence je vysoká. Více než
      polovina případů se vyskytuje v těžké a vyžaduje resuscitaci.
GERMAN
    Leptospirose tritt in allen Ländern, mit Ausnahme der Arktis. Die Inzidenz ist hoch. Mehr als
      die Hälfte der Fälle tritt in schweren und Reanimation erforderlich.
ITALIAN
     Leptospirosi si verifica in tutti i paesi, tranne l'Artico. L'incidenza è alta. Più della metà dei
      casi si verifica in rianimazione grave e richiesti.
URKAINIAN
     Лептоспіроз відбувається в усіх країнах, за винятком Арктики. Захворюваність висока.
      Більше половини випадків відбувається в суворих і необхідність реанімації.
RUSSIAN
     Лептоспироз происходит во всех странах, за исключением Арктики. Заболеваемость
      высокая. Более половины случаев происходит в суровых и необходимости реанимации.
DISEASE EXTRACTOR MODULE DEMO
http://fingolfin.user.cis.ksu.edu:8080/diseaseextractor/
Multilingual Ner Using Wiki
RESULTS FOR DISEASE EXTRACTOR MODULE

       INPUT A                OUTPUT A
Foot and mouth disease is
one of the most contagious
diseases of cloven-hooved
mammals…

       INPUT B                OUTPUT B
Rift Valley Fever | CDC
Special Pathogens Branch
Mission Statement Disease …
AGENDA

I.     Project Overview
II.    Crawling Wikipedia
III.   GoogleSets for Synonymy Discovery
IV.    Experiment
V.     Conclusions
CONCLUSIONS
 ApplyingWikipedia knowledge for multilingual NERTask


 Phase 1: CrawlingWiki – completed
 Phase 2: Google Sets Expansion – completed
 Phase 3: Multilingual Disease Extraction – in progress


 Novelty: Overcome Wiki limitations by applying Google Sets
  expansion approach

 In order to estimate accuracy we need to have annotated data in
  different languages
REFERENCES
   Torsten Zesch and Iryna Gurevych, Analysis of the Wikipedia Category Graph for NLP
    Applications, In: Proceedings of the TextGraphs-2 Workshop (NAACL-HLT 2007), p.
    1--8,               April             2007.          http://elara.tk.informatik.tu-
    darmstadt.de/publications/2007/hlt-textgraphs.pdf

   Watanabe, Yotaro and Asahara, Masayuki and Matsumoto, Yuji, A Graph-Based
    Approach to Named Entity Categorization in Wikipedia Using Conditional Random Fields,
    Proceedings of the 2007 Joint Conference on Empirical Methods in Natural
    Language Processing and Computational Natural Language Learning (EMNLP-
    CoNLL), 649-657. http://www.aclweb.org/anthology/D/D07/D07-1068

   Manning, C., & Schutze, H. Foundations of statistical natural language processing.
    Cambridge, MA: MIT Press, 1999.
ACKNOWLEDGEMENTS

 Dr. William Hsu for meaningful guidance




 John Drouhard for building extraction architecture




 Landon Fowles for expanding gazetteers using Google Sets

More Related Content

PDF
Datasets and GATE Evaluation Framework for Benchmarking Wikipedia Based NER S...
PPTX
Information retrieval and extraction
PDF
Multimodal Information Extraction: Disease, Date and Location Retrieval
PPTX
Information_retrieval_and_extraction_IIIT
PPTX
Web Intelligence 2010
PPSX
Semantic Analysis using Wikipedia Taxonomy
PDF
WiML Poster
PPTX
Datasets and GATE Evaluation Framework for Benchmarking Wikipedia Based NER S...
Information retrieval and extraction
Multimodal Information Extraction: Disease, Date and Location Retrieval
Information_retrieval_and_extraction_IIIT
Web Intelligence 2010
Semantic Analysis using Wikipedia Taxonomy
WiML Poster

Similar to Multilingual Ner Using Wiki (20)

PPTX
BabelNet 3.0
PDF
C ONSTRUCTION O F R ESOURCES U SING J APANESE - S PANISH M EDICAL D ATA
PPT
Effective Extraction of Thematically Grouped Key Terms From Text
PPTX
AI-SDV 2020: Combining Knowledge and Machine Learning for the Analysis of Sci...
PPTX
MS Thesis Short
PPT
Pratt Sils LIS653 4 Fall 2007
PDF
RAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSING
PPTX
Master Thesis
PDF
Improving a japanese spanish machine translation system using wikipedia medic...
PDF
Ontology learning
PDF
Perspectives on mining knowledge graphs from text
PDF
Comparing taxonomies for organising collections of documents
PDF
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
PDF
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
PDF
Automatic extraction of microorganisms and their habitats from free text usin...
PPT
Becta Vms
PPTX
Learning Relations from Social Tagging Data
PPT
Special Libraries Associatin
PPT
Literature Based Framework for Semantic Descriptions of e-Science resources
PDF
Learning Multilingual Semantics from Big Data on the Web
BabelNet 3.0
C ONSTRUCTION O F R ESOURCES U SING J APANESE - S PANISH M EDICAL D ATA
Effective Extraction of Thematically Grouped Key Terms From Text
AI-SDV 2020: Combining Knowledge and Machine Learning for the Analysis of Sci...
MS Thesis Short
Pratt Sils LIS653 4 Fall 2007
RAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSING
Master Thesis
Improving a japanese spanish machine translation system using wikipedia medic...
Ontology learning
Perspectives on mining knowledge graphs from text
Comparing taxonomies for organising collections of documents
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Automatic extraction of microorganisms and their habitats from free text usin...
Becta Vms
Learning Relations from Social Tagging Data
Special Libraries Associatin
Literature Based Framework for Semantic Descriptions of e-Science resources
Learning Multilingual Semantics from Big Data on the Web
Ad

More from Svitlana volkova (12)

PDF
EACL'12 Poster
PDF
Grace Hopper Celebration 2010
PPTX
IEEE ISI'10
PDF
Topics Modeling
PPTX
Project Proposal Topics Modeling (Ir)
PDF
Social Networks
PDF
Methods Of Reliability Analysis
PDF
Ohio Project
PDF
Ukraine Presentation
PDF
Ukraine Presentation at Kansas State University
PDF
Communicatons Fulbright
PDF
Communications Ternopil
EACL'12 Poster
Grace Hopper Celebration 2010
IEEE ISI'10
Topics Modeling
Project Proposal Topics Modeling (Ir)
Social Networks
Methods Of Reliability Analysis
Ohio Project
Ukraine Presentation
Ukraine Presentation at Kansas State University
Communicatons Fulbright
Communications Ternopil
Ad

Recently uploaded (20)

PPTX
Cell Structure & Organelles in detailed.
PDF
Pre independence Education in Inndia.pdf
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
Institutional Correction lecture only . . .
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PPTX
Cell Types and Its function , kingdom of life
PPTX
Pharma ospi slides which help in ospi learning
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Classroom Observation Tools for Teachers
PDF
Computing-Curriculum for Schools in Ghana
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
GDM (1) (1).pptx small presentation for students
PDF
Sports Quiz easy sports quiz sports quiz
Cell Structure & Organelles in detailed.
Pre independence Education in Inndia.pdf
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
human mycosis Human fungal infections are called human mycosis..pptx
Institutional Correction lecture only . . .
Microbial disease of the cardiovascular and lymphatic systems
Renaissance Architecture: A Journey from Faith to Humanism
Cell Types and Its function , kingdom of life
Pharma ospi slides which help in ospi learning
2.FourierTransform-ShortQuestionswithAnswers.pdf
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Classroom Observation Tools for Teachers
Computing-Curriculum for Schools in Ghana
O5-L3 Freight Transport Ops (International) V1.pdf
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
TR - Agricultural Crops Production NC III.pdf
102 student loan defaulters named and shamed – Is someone you know on the list?
GDM (1) (1).pptx small presentation for students
Sports Quiz easy sports quiz sports quiz

Multilingual Ner Using Wiki

  • 1. Multilingual Named Entity Recognition using Wikipedia Laboratory for Knowledge Discovery in Databases Department of Computing and Information Sciences Kansas State University http://www.kddresearch.org/tikiwiki/tiki-index.php Presenter: Svitlana O. Volkova Instructor: William Hsu
  • 2. AGENDA I. Project Overview II. Crawling Wikipedia III. Synonymy Discovery with Google Sets IV. Experiment Design V. Conclusions
  • 3. AGENDA I. Project Overview II. Crawling Wikipedia III. GoogleSets for Synonymy Discovery IV. Experiment V. Conclusions
  • 4. PROJECT MILESTONES Input: Crawler Functionality CRAWLING WIKIPEDIA Output: Set of Multilingual Gazetteers Input: Initial Gazetteer in one Language RELATIONSHIP DISCOVERY WITH GOOGLESETS Output: Extended Gazetteer with Synonyms Input: Extended Gazetteer with Synonyms + Content MULTILINGUAL NER TASK Output: Extracted Entities from the Content
  • 5. KEY IDEA - WIKIPEDIA  Apply Wikipedia knowledge representation for multilingual information extraction English Wiki Concepts of Interest …, anthrax, bovine virus, …, camelpox, surra, … 17http://wiki.digitalmethods.net/Dmi/WikipediaAnalysis Russian Wiki Concepts of Interest …, Зоонозы, Классическая чума свиней, Лептоспироз, …
  • 6. AGENDA I. Project Overview II. Crawling Wikipedia III. GoogleSets for Synonymy Discovery IV. Experiment V. Conclusions
  • 7. CRAWLING WIKIPEDIA Multilingual NER (article + category +interwiki links) Wiki Category Graph and Article Graph
  • 8. GAZETTEERS EXAMPLES IN DIFFERENT LANGUAGES
  • 9. GAZETTEERS SIZE IN DIFFERENT LANGUAGES 19 37 English 86 Japanese German 20 Russian Decision: dictionaries are too small, so wee need to find a way how to extend it!!!
  • 10. AGENDA I. Project Overview II. Crawling Wikipedia III. GoogleSets for Synonymy Discovery IV. Experiment V. Conclusions
  • 12. AGENDA I. Project Overview II. Crawling Wikipedia III. GoogleSets for Synonymy Discovery IV. Experiment V. Conclusions
  • 13. EXPERIMENT SET UP  Purpose: to perform named entity recognition task in specific domain and report accuracy of extraction using a) Wiki knowledge b) Extended lists with synonyms from Google Sets  Hypothesis: the synonyms extraction phase is essential for increasing accuracy of information extraction task
  • 14. DISEASE EXTRACTOR MODULE INPUT AND OUTPUT Output: Index of the first character Disease Index of the last character Extractor Length of the matched text Input: Text Module from file Matched Text Canonical disease name Disease ExtractionTask  The task of disease recognition can be considered as NER/information extraction (IE) task  The main purpose is to retrieve tokens that much at least one term with synonyms, abbreviations from list of the animal disease names
  • 15. CONTEXT EXAMPLES IN DIFFERENT LANGUAGES DUTCH  Leptospirose komt voor in alle landen, behalve het Noordpoolgebied. De incidentie is hoog. Meer dan de helft van de gevallen voordoet in ernstige en vereiste reanimatie. CZECH  Leptospiróza se vyskytuje ve všech zemích s výjimkou Arktidy. Incidence je vysoká. Více než polovina případů se vyskytuje v těžké a vyžaduje resuscitaci. GERMAN  Leptospirose tritt in allen Ländern, mit Ausnahme der Arktis. Die Inzidenz ist hoch. Mehr als die Hälfte der Fälle tritt in schweren und Reanimation erforderlich. ITALIAN  Leptospirosi si verifica in tutti i paesi, tranne l'Artico. L'incidenza è alta. Più della metà dei casi si verifica in rianimazione grave e richiesti. URKAINIAN  Лептоспіроз відбувається в усіх країнах, за винятком Арктики. Захворюваність висока. Більше половини випадків відбувається в суворих і необхідність реанімації. RUSSIAN  Лептоспироз происходит во всех странах, за исключением Арктики. Заболеваемость высокая. Более половины случаев происходит в суровых и необходимости реанимации.
  • 16. DISEASE EXTRACTOR MODULE DEMO http://fingolfin.user.cis.ksu.edu:8080/diseaseextractor/
  • 18. RESULTS FOR DISEASE EXTRACTOR MODULE INPUT A OUTPUT A Foot and mouth disease is one of the most contagious diseases of cloven-hooved mammals… INPUT B OUTPUT B Rift Valley Fever | CDC Special Pathogens Branch Mission Statement Disease …
  • 19. AGENDA I. Project Overview II. Crawling Wikipedia III. GoogleSets for Synonymy Discovery IV. Experiment V. Conclusions
  • 20. CONCLUSIONS  ApplyingWikipedia knowledge for multilingual NERTask  Phase 1: CrawlingWiki – completed  Phase 2: Google Sets Expansion – completed  Phase 3: Multilingual Disease Extraction – in progress  Novelty: Overcome Wiki limitations by applying Google Sets expansion approach  In order to estimate accuracy we need to have annotated data in different languages
  • 21. REFERENCES  Torsten Zesch and Iryna Gurevych, Analysis of the Wikipedia Category Graph for NLP Applications, In: Proceedings of the TextGraphs-2 Workshop (NAACL-HLT 2007), p. 1--8, April 2007. http://elara.tk.informatik.tu- darmstadt.de/publications/2007/hlt-textgraphs.pdf  Watanabe, Yotaro and Asahara, Masayuki and Matsumoto, Yuji, A Graph-Based Approach to Named Entity Categorization in Wikipedia Using Conditional Random Fields, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP- CoNLL), 649-657. http://www.aclweb.org/anthology/D/D07/D07-1068  Manning, C., & Schutze, H. Foundations of statistical natural language processing. Cambridge, MA: MIT Press, 1999.
  • 22. ACKNOWLEDGEMENTS  Dr. William Hsu for meaningful guidance  John Drouhard for building extraction architecture  Landon Fowles for expanding gazetteers using Google Sets