SlideShare a Scribd company logo
Laboratory for Knowledge Discovery in DatabasesEntity Extraction, Animal Disease-related Event Recognition and Classification from WebPresenter: Svitlana Volkova Adviser: William H. HsuCommittee: Dr. Doina Caragea, Dr. Gurdip Singh Supported by: K-State National Agricultural Biosecurity Center (NABC), US Department of Defense
AgendaThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010Background & MotivationRelated WorkAnimal Disease Monitoring Systems
Document Classification
Entity and Relation Extraction
Event RecognitionFramework for Epidemiological AnalyticsDisease-related Document ClassificationDomain-specific Entity ExtractionOntology-based Entity Extraction
Sequence Labeling using Syntactic FeaturesDisease-related Event Recognition Summary & Future Work
Motivationinfluence on the travel and tradecause economic crises, political instabilitydiseases, zoonotic in type can cause loss of lifeThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Animal Disease Monitoring Systems: ManuallySupportedWeb Interfaces (1)International:World Animal Health Information Database (WAHID) Interface - http://www.oie.int/wahis/public.php?page=homeWHO Global Atlas of Infectious Diseases - http://diseasemaps.usgs.gov/index.htmEmergency Prevention System (EMPRES) for Transboundary Animal and Plant Pests and Diseases - http://www.fao.org/EMPRES/default.htmlThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Animal Disease Monitoring Systems: ManuallySupportedWeb Interfaces(2)USACenters for Disease Control and Prevention (CDC) - http://www.cdc.govU.S. Department of Agriculture (USDA) - http://www.usda.gov/wps/portal/usdahomeU.S. Geological Survey (USGS) and U.S. Geological Survey (USGS) National Wildlife Health Center (NWHC) - http://www.nwhc.usgs.govIowa State University Center for Food Security and Public Health (CFSPH) - http://www.cfsph.iastate.eduBioPortal - http://biocomputingcorp.com/bpsystem.htmlFMD BioPortal - https://fmdbioportal.ucdavis.eduUnited KingdomDepartment for Environment Food and Rural Affairs (DEFRA) - http://www.defra.gov.ukThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Animal Disease Monitoring Systems: Automated Web Services (1)BioCaster - http://biocaster.nii.ac.jp/follows 1500 RSS feeds hourly
classifies documents as topically relevant or not
taxonomy of 4300 named entities (50 disease names, 243 country names, 4025 province/city names, latitudes and longitudes)
identifies 40 diseases at up to 25-30 locations per day
multilingual information extraction on to English, French, Spanish, Chinese, Thai, Vietnamese, Japanese
uses ontology pattern matching approaches to recognize disease-location-verb pairs
plots events on a Google Map
does not classify events into categories and does not report past outbreaks
no timeline visualizationThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
BioCaster -http://biocaster.nii.ac.jp/Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Animal Disease Monitoring Systems: Automated Web Services (2)Information retrieval system MedISys  -  http://medusa.jrc.it/medisys/homeedition/all/home.htmlPattern-based Understanding and Learning System (PULS) - http://sysdb.cs.helsinki.fi/puls/jrc/allallows automated recognizing of the metadata and structured facts related to the disease outbreaks
collects an average 50000 news articles per day from about 1400 news portals and about 150 specialized Public Health sites
43 languages
current ontology contains 2400 disease names, 400 organisms, 1500 political entities and over 70000 location names including towns, cities, provinces
real-time news clustering and filtering by matching 3000 patterns
does not classify events and does not report past outbreaks.Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
MedISys - http://medusa.jrc.it/medisys/homeedition/all/home.html*part of the Europe Media Monitor (EMM) product family http://emm.jrc.it/overview.htmlThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Pattern-based Understanding and Learning System (PULS)Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Animal Disease Monitoring Systems: Automated Web Services (3)HealthMap - http://healthmap.org/enaggregates articles from Google News and ProMED-Mail portal
2300 locations and 1100 disease names
identifies between 20-30 outbreaks per day
multiple languages English, Russian, Arabic, French, Portuguese, Spanish, Chinese
manually supported systemEpiSpider- http://www.epispider.org/combines emerging infectious disease data from:
ProMED-Mail - www.promedmail.org
The Global Disaster Alert Coordinating System (GDACS) - www.gdacs.org
Central Intelligence Agency (CIA) Factbook - https://www.cia.gov/library/publications/the-world-factbook/
TheUnited Nations Human Development Report sites - http://hdr.undp.org/enThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
HealthMap - http://healthmap.org/enThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
ProMED-Mail -www.promedmail.orgThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
EpiSpider - http://www.epispider.org/Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Animal Disease-related Data OnlineStructured DataUnstructured DataOfficial reports by different organizations:state and federal laboratories, bioportals; health care providers;governmental agricultural or environmental agencies.Web-pagesNewsE-mails (e.g., ProMed-Mail)BlogsMedical literature (e.g., books)Scientific papers (e.g., PubMed)Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Problem StatementSuppose we have a document collection D with documents collected from different domains C:news, web pages, scientific papers, medical literature, e-mails.We classify documents into two classes:disease-related documents DR;disease non-related document DNR.We extract a set of events E from every document di in DR for every domain cj. For every event ek in E we extract a set of domain-specific and domain-independent entities:disease, species, location, date, event status.We classify recognized events from E into:two classes – suspected or confirmed;three classes – susceptible, infected or recovered.Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Research TasksClassification of the disease-related documents collected from different domainsDomain-specific entity extraction animal disease names, viruses, disease serotypesAutomated animal disease-related event recognition and classification from unstructured web dataThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Related Work: Text CategorizationSupervised learning training and testing datasets; lots of labeled data;Unsupervised Learningclustering; only unlabeled data;Semi-supervised Learningsmall amount of labeled data and a lot of unlabeled data.Feature Representation - “bag-of-words”   Binary       Term Frequency	   TF-IDF	Word bigrams, “bag-of-concepts”Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Classification AlgorithmsThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010LazyIB1, IBk, KStarMetaAdaBoostTreesJ48, RandomForestRulesZeroRBayesNaive Bayes, Naive Bayes MultinomialFunctionsLogistic, Multilayer Perceptron, RBFNetwork
Related Work: Entity ExtractionThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010Gazetteers and Regular expressionsLimitations: dictionary look-up methods achieve high precision, but low recall limited to the size of the dictionaryHidden Markov Models (HMM)Conditional Random Fields (CRF)NER SystemsStanford NER SystemCMU Lemur ToolkitOpen NLPOntology-based Biomedical Entity ExtractionBioCaster: 50 animal disease names + SVMPULS: 2400 human diseases and 1100 animal diseases
Related Work: Relation ExtractionThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Related Work: Event RecognitionThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010BioCasterfor disease-location pairs and calculating their frequency in the document and in the collectionPULSallows extracting metadata and structured facts related to animal disease outbreaks using pattern matching approach
SummaryThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010Overview of the existing automated and manually supported systems for monitoring animal disease outbreaks.Approaches for text categorization: supervised, unsupervised and semi-supervised learning and different feature representations: “bag-of-words”, terms frequency, binary features, word bigrams, classification algorithms: lazy learners, decision trees, Naïve Bayes.Entity extraction approaches: gazetteers, regular expressions, Hidden Markov Models  and Conditional random Fields; existing NER systems; ontology-based biomedical entity extraction.Relation extraction for automated ontology construction works.Animal disease-related event recognition methods.
Framework for Epidemiological AnalyticsFramework for Epidemiological AnalyticsMain Functional ComponentsData Collection (Document Relevance Classification) -> Data Sharing -> Search -> Data Analysis (Entity Extraction and Event Recognition) -> Visualization
Existing Systems vs. Designed System (1)Existing SystemsDesigned SystemThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Existing Systems vs. Designed System (2)Existing SystemsDesigned SystemThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
System ArchitectureCrawled DocumentsSearch Interface1. Entity Extraction    Component 2. TemporalExtraction ComponentExtractedEntities3. Spatial     Extraction Component ExtractedEventTuples4. Event Recognition    ComponentData StorageWeb ServerTimeLine/Map ViewThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Main System ComponentsThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
1. Data Collection (1)Periodically crawl the web using Heritrix crawler - http://crawler.archive.org/set of seeds (ProMED-Mail, DEFRA etc.)set of terms (animal disease names from the ontology) Text-to-tag ratio-based method for content extraction from web pagesThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
1. Data Collection (2)WWWEmailCrawlerDBDocument CollectionQueryLiteratureThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
2. Data SharingThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010Document relevance classification using Naive Bayes Classifier from Mallet - http://mallet.cs.umass.eduRelevantNon-relevant
3. SearchThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010Lucene-based* rankingQuery-based keyword searchSearch by animal disease name and/or location*Lucene - http://lucene.apache.org
4. Data AnalysisEvent example:“On 12 September 2007, a new foot-and-mouth disease outbreak was confirmed in Egham, Surrey”Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
5. VisualizationThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010Map ViewGoogleMaps API - http://code.google.com/apis/maps/TimeLine ViewSIMILE API - http://www.simile-widgets.org/timeline/
Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
SummaryThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010Compare our system to other systemsBioCaster, MedISys/PULS, HealthMap System ArchitectureData Collection -> Data Sharing -> Search -> Data Analysis ->  VisualizationMain System ComponentsEntity Extraction Componentdomain-specific entities (disease names, their synonyms, abbreviations, corresponding viruses and disease serotypes, species) domain-independent entities  (locations and dates)Event Recognition Component
Disease-related Document ClassificationBinary Classification using Supervised LearningFeature Representations: “Bag-of-words”, TF, BigramsClassifiers:  Naïve Bayes, MaxEntropy,  J48
SupervisedLearningFrameworkNewDocumentsDTestFeatureRepresentation  R1…FeatureRepresentation  RnLearnedModel  M1…LearnedModel MkCrawledDocuments DTrainClassifierDisease Related - DR (processed to the next phases)Disease Non-related – DNR (eliminated from the index)Feature Representations:R1 – “bag-of-words” binary, |R1|=28908R2 – “bag-of-words”  term frequency, |R2|=28908R3 – “bag-of-words”  bigrams, |R3|=99108R4 –  noun and verb keywords represented as binary counts, |R4|=2R5 –  noun and verb keywords normalized frequency, |R5|=2Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Experiment ADisease-related Document Classification~1500 crawled documentsFoot-and-mouth disease (FMD)Rift valley fever (RVF)Focused Crawl Terms[foot and mouth disease, FMD, rift valley fever, RVF]After labeling - 813 related and 752 non-related docsTesting with 10-fold cross validation + OR -Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Classification Results: Precision, Recall,F-Measure, Area Under CurveBinary FeaturesNoun and Verb Frequency FeaturesThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Classification Results: AccuracyBinary FeaturesAll unigrams, all bigrams and all term frequency featuresThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Summary (1)Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010“Bag-of-words” representation gives higher accuracy;Generative approaches give the highest accuracy: Naïve Bayes together with comprehensive feature representation R3 using bigram as features – 0.97;MaxEnt classier using unigram “bag-of words” representation R2 – 0.96;MaxEnt classier using comprehensive binary counts as feature representation R1 – 0.94.Normalized term frequency is much better than just binary features.
Summary (2)Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Entity Extraction in the Domain of Veterinary Medicine (1)Ontology-based Entity ExtractionAutomated Ontology Construction
Domain Meta-dataDomain-independent knowledgeDomain-specific knowledgeLocation hierarchynames of countries, states, cities;Time hierarchycanonical dates.Medical ontologydiseases, serotypes, and viruses.Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Manually-constructedInitial Ontology|OINIT|=429	|OS|=581	 |OA|=581	 |OS+A|=6051. Disease names and fact sheets from Iowa State University Center for Food Security and Public Health (CFSPH): http://www.cfsph.iastate.edu/diseaseinfo/animaldiseaseindex.htm2. Word Organization of Animal Health (OIE) Animal Disease Data:http://www.oie.int/eng/maladies/en_alpha.htm3. Department for Environmental Food and Rural Affairs, UK (DEFRA):http://www.defra.gov.uk/animalh/diseases/vetsurveillance/az_index.htm 4. Wikipediahttp://en.wikipedia.org/wiki/Animal_diseasesRelationship TypesSynonymic relationships – “E1 is a kind of E2”		E1 = “swine influenza” is a kind of E2 = “swine fever”Hyponymic relationships – “E1 and E1 are diseases”		E1 = “anthrax”, E2 = “yellow fever” are diseasesCausal relationships – “E1 is caused by E2”		E1 = “Ovine epididymitis” is caused by E2 = “Brucella ovis”Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Algorithm for Semantic Relation Discovery and Automated Ontology ExpansionThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Experiment BOntology-based Entity Extraction100 unlabeled documents for ontology expansion
100 manually labeled document for entity extractionThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Entity Extraction Results: ROC CurvesThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Entity Extraction Results: Learning Curves|OG|=754..1238 |OR|=772..1287Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Entity Extraction in the Domain of Veterinary Medicine (2)Sequence Labeling using Syntactic Features with Sliding Window
Syntactic Feature ExtractionPOS tagnumeric word-level featureCapitalizationbinary word-level featureCapitalization insidebinary word-level feature for identifying abbreviationsPosition in the sentencenumeric document-level featurePosition in the documentnumeric document-level featureFrequencynumeric document-level featureThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Sequence Labeling ApproachThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
An Example of Syntactic Feature Extraction“Severe disease in dairy cattle caused by Salmonella Newport”POS= [NNP, IN, NNS, VBN, …] = [2, 0, 2, 5, …]Xi = [POSi, CAPi, ICAPi, SPOSi, DPOSi, FREQi]Xi-3 = [2, 0, 0, 5, 5, 1]Xi = [2, 1, 0, 8, 8, 1]Xi-2 = [5, 0, 0, 6, 6, 1]NewportXi-1 = [0, 0, 0, 7, 7, 1]……wiwi+1wi+2wi-3wi-1wi+3wi-2cattle caused by Salmonella Xi+1 = [2, 1, 0, 9, 9, 1]Xi+2 = [-1, -1, -1, -1, -1, -1]Fi = [Xi, Xi-1, Xi-2, Xi-3, Xi+1, Xi+2, Xi+3], w = 3Class = {0, 1}Xi+3 = [-1, -1, -1, -1, -1, -1]Fi = [2, 1, 0, 8, 8, 1, 0, 0, 0, 7, 7, 1, 5, 0, 0, 6, 6, 1, 2, 0, 0, 5, 5, 1,        2, 1, 0, 9, 9, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], Class = [1]Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Experiment CSequence Labeling using Syntactic Features100 manually labeled documents from Experiment BNumber of disease names is more that 5 per documentKeep capitalizationRemove stop words202977 examples in the dataset80% for training (approx. 160000 examples)20% for testing (approx. 40000 examples)Results are averaged over 3 runsWe do not report accuracy because the data set is unbalanced (approx. 8570 positive examples vs. approx. 194430 negative examples)Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Entity Extraction Results: F-measure (1)Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Entity Extraction Results: F-measure (2)Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Entity Extraction Results: Precision, Recall, AUC (1)Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Entity Extraction Results: Precision, Recall, AUC (2)Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
SummaryBioCaster named entity recognition system200 news articles F-score – 0.769 for all named entity classesSVM and feature window -2/+1 including surface word,  orthography, biomedical prefixes/suffixes, lemma, head noun etc.DNA, RNA, cell type extractionSVM and orthographic featuresF-score – 0.799 during the identification phase and 66.5 during the classification phase;Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
Disease-related Event Recognition and Classification (1)Sentence-based Event Recognition and Classification
Animal Disease-related Event TypesThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010

More Related Content

PPTX
IEEE ISI'10
PPTX
PDF
Livestock Disease Prediction System
PPT
Lecture 3 softwares used in health care (2)
PDF
Administrazio Publikoetan Parte-Hartzeko Prozesuak Ebaluatzeko Eredua
PDF
Modelo relacional
PDF
Results basel open complete
RTF
IEEE ISI'10
Livestock Disease Prediction System
Lecture 3 softwares used in health care (2)
Administrazio Publikoetan Parte-Hartzeko Prozesuak Ebaluatzeko Eredua
Modelo relacional
Results basel open complete

Similar to Master Thesis (20)

PPTX
MS Thesis Short
PPT
High throughput analysis and alerting of disease outbreaks from the grey lite...
PDF
Multimodal Information Extraction: Disease, Date and Location Retrieval
PPTX
Web Intelligence 2010
PPTX
ShortStory_bioCaster.pptx
PDF
CovidOnTheWeb : covid19 linked data published on the Web
PPTX
Exploiting NLP for Digital Disease Informatics
PPT
ppt
PPTX
Exploiting NLP for Digital Disease Informatics
PPT
Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada
PDF
Improving Disease Surveillance in the United States Using Companion Animal Data
PDF
Ontology Based Information Extraction for Disease Intelligence
PDF
Multilingual Ner Using Wiki
PDF
Belak thorenleblancviljoenreview2009
PDF
WiML Poster
PPTX
Advances in animal health management system & use of epidemiological tools
PPTX
CID_Emerging and Re-emerging Diseases of Livestock.pptx
PDF
BDCC-06-00004.pdf
PPTX
Enabling faster analysis of vaccine adverse event reports with ontology support
PPTX
Intelligence supported media monitoring in veterinary medicine
MS Thesis Short
High throughput analysis and alerting of disease outbreaks from the grey lite...
Multimodal Information Extraction: Disease, Date and Location Retrieval
Web Intelligence 2010
ShortStory_bioCaster.pptx
CovidOnTheWeb : covid19 linked data published on the Web
Exploiting NLP for Digital Disease Informatics
ppt
Exploiting NLP for Digital Disease Informatics
Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada
Improving Disease Surveillance in the United States Using Companion Animal Data
Ontology Based Information Extraction for Disease Intelligence
Multilingual Ner Using Wiki
Belak thorenleblancviljoenreview2009
WiML Poster
Advances in animal health management system & use of epidemiological tools
CID_Emerging and Re-emerging Diseases of Livestock.pptx
BDCC-06-00004.pdf
Enabling faster analysis of vaccine adverse event reports with ontology support
Intelligence supported media monitoring in veterinary medicine
Ad

More from Svitlana volkova (11)

PDF
EACL'12 Poster
PDF
Grace Hopper Celebration 2010
PDF
Topics Modeling
PPTX
Project Proposal Topics Modeling (Ir)
PDF
Social Networks
PDF
Methods Of Reliability Analysis
PDF
Ohio Project
PDF
Ukraine Presentation
PDF
Ukraine Presentation at Kansas State University
PDF
Communicatons Fulbright
PDF
Communications Ternopil
EACL'12 Poster
Grace Hopper Celebration 2010
Topics Modeling
Project Proposal Topics Modeling (Ir)
Social Networks
Methods Of Reliability Analysis
Ohio Project
Ukraine Presentation
Ukraine Presentation at Kansas State University
Communicatons Fulbright
Communications Ternopil
Ad

Recently uploaded (20)

PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PPTX
Cell Types and Its function , kingdom of life
PDF
Insiders guide to clinical Medicine.pdf
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
master seminar digital applications in india
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
TR - Agricultural Crops Production NC III.pdf
human mycosis Human fungal infections are called human mycosis..pptx
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
STATICS OF THE RIGID BODIES Hibbelers.pdf
Cell Types and Its function , kingdom of life
Insiders guide to clinical Medicine.pdf
Abdominal Access Techniques with Prof. Dr. R K Mishra
FourierSeries-QuestionsWithAnswers(Part-A).pdf
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
master seminar digital applications in india
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Final Presentation General Medicine 03-08-2024.pptx
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Renaissance Architecture: A Journey from Faith to Humanism
Module 4: Burden of Disease Tutorial Slides S2 2025
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPH.pptx obstetrics and gynecology in nursing
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
TR - Agricultural Crops Production NC III.pdf

Master Thesis

  • 1. Laboratory for Knowledge Discovery in DatabasesEntity Extraction, Animal Disease-related Event Recognition and Classification from WebPresenter: Svitlana Volkova Adviser: William H. HsuCommittee: Dr. Doina Caragea, Dr. Gurdip Singh Supported by: K-State National Agricultural Biosecurity Center (NABC), US Department of Defense
  • 2. AgendaThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010Background & MotivationRelated WorkAnimal Disease Monitoring Systems
  • 5. Event RecognitionFramework for Epidemiological AnalyticsDisease-related Document ClassificationDomain-specific Entity ExtractionOntology-based Entity Extraction
  • 6. Sequence Labeling using Syntactic FeaturesDisease-related Event Recognition Summary & Future Work
  • 7. Motivationinfluence on the travel and tradecause economic crises, political instabilitydiseases, zoonotic in type can cause loss of lifeThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 8. Animal Disease Monitoring Systems: ManuallySupportedWeb Interfaces (1)International:World Animal Health Information Database (WAHID) Interface - http://www.oie.int/wahis/public.php?page=homeWHO Global Atlas of Infectious Diseases - http://diseasemaps.usgs.gov/index.htmEmergency Prevention System (EMPRES) for Transboundary Animal and Plant Pests and Diseases - http://www.fao.org/EMPRES/default.htmlThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 9. Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 10. Animal Disease Monitoring Systems: ManuallySupportedWeb Interfaces(2)USACenters for Disease Control and Prevention (CDC) - http://www.cdc.govU.S. Department of Agriculture (USDA) - http://www.usda.gov/wps/portal/usdahomeU.S. Geological Survey (USGS) and U.S. Geological Survey (USGS) National Wildlife Health Center (NWHC) - http://www.nwhc.usgs.govIowa State University Center for Food Security and Public Health (CFSPH) - http://www.cfsph.iastate.eduBioPortal - http://biocomputingcorp.com/bpsystem.htmlFMD BioPortal - https://fmdbioportal.ucdavis.eduUnited KingdomDepartment for Environment Food and Rural Affairs (DEFRA) - http://www.defra.gov.ukThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 11. Animal Disease Monitoring Systems: Automated Web Services (1)BioCaster - http://biocaster.nii.ac.jp/follows 1500 RSS feeds hourly
  • 12. classifies documents as topically relevant or not
  • 13. taxonomy of 4300 named entities (50 disease names, 243 country names, 4025 province/city names, latitudes and longitudes)
  • 14. identifies 40 diseases at up to 25-30 locations per day
  • 15. multilingual information extraction on to English, French, Spanish, Chinese, Thai, Vietnamese, Japanese
  • 16. uses ontology pattern matching approaches to recognize disease-location-verb pairs
  • 17. plots events on a Google Map
  • 18. does not classify events into categories and does not report past outbreaks
  • 19. no timeline visualizationThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 20. BioCaster -http://biocaster.nii.ac.jp/Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 21. Animal Disease Monitoring Systems: Automated Web Services (2)Information retrieval system MedISys - http://medusa.jrc.it/medisys/homeedition/all/home.htmlPattern-based Understanding and Learning System (PULS) - http://sysdb.cs.helsinki.fi/puls/jrc/allallows automated recognizing of the metadata and structured facts related to the disease outbreaks
  • 22. collects an average 50000 news articles per day from about 1400 news portals and about 150 specialized Public Health sites
  • 24. current ontology contains 2400 disease names, 400 organisms, 1500 political entities and over 70000 location names including towns, cities, provinces
  • 25. real-time news clustering and filtering by matching 3000 patterns
  • 26. does not classify events and does not report past outbreaks.Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 27. MedISys - http://medusa.jrc.it/medisys/homeedition/all/home.html*part of the Europe Media Monitor (EMM) product family http://emm.jrc.it/overview.htmlThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 28. Pattern-based Understanding and Learning System (PULS)Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 29. Animal Disease Monitoring Systems: Automated Web Services (3)HealthMap - http://healthmap.org/enaggregates articles from Google News and ProMED-Mail portal
  • 30. 2300 locations and 1100 disease names
  • 31. identifies between 20-30 outbreaks per day
  • 32. multiple languages English, Russian, Arabic, French, Portuguese, Spanish, Chinese
  • 33. manually supported systemEpiSpider- http://www.epispider.org/combines emerging infectious disease data from:
  • 35. The Global Disaster Alert Coordinating System (GDACS) - www.gdacs.org
  • 36. Central Intelligence Agency (CIA) Factbook - https://www.cia.gov/library/publications/the-world-factbook/
  • 37. TheUnited Nations Human Development Report sites - http://hdr.undp.org/enThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 38. HealthMap - http://healthmap.org/enThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 39. ProMED-Mail -www.promedmail.orgThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 40. EpiSpider - http://www.epispider.org/Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 41. Animal Disease-related Data OnlineStructured DataUnstructured DataOfficial reports by different organizations:state and federal laboratories, bioportals; health care providers;governmental agricultural or environmental agencies.Web-pagesNewsE-mails (e.g., ProMed-Mail)BlogsMedical literature (e.g., books)Scientific papers (e.g., PubMed)Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 42. Problem StatementSuppose we have a document collection D with documents collected from different domains C:news, web pages, scientific papers, medical literature, e-mails.We classify documents into two classes:disease-related documents DR;disease non-related document DNR.We extract a set of events E from every document di in DR for every domain cj. For every event ek in E we extract a set of domain-specific and domain-independent entities:disease, species, location, date, event status.We classify recognized events from E into:two classes – suspected or confirmed;three classes – susceptible, infected or recovered.Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 43. Research TasksClassification of the disease-related documents collected from different domainsDomain-specific entity extraction animal disease names, viruses, disease serotypesAutomated animal disease-related event recognition and classification from unstructured web dataThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 44. Related Work: Text CategorizationSupervised learning training and testing datasets; lots of labeled data;Unsupervised Learningclustering; only unlabeled data;Semi-supervised Learningsmall amount of labeled data and a lot of unlabeled data.Feature Representation - “bag-of-words” Binary Term Frequency TF-IDF Word bigrams, “bag-of-concepts”Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 45. Classification AlgorithmsThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010LazyIB1, IBk, KStarMetaAdaBoostTreesJ48, RandomForestRulesZeroRBayesNaive Bayes, Naive Bayes MultinomialFunctionsLogistic, Multilayer Perceptron, RBFNetwork
  • 46. Related Work: Entity ExtractionThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010Gazetteers and Regular expressionsLimitations: dictionary look-up methods achieve high precision, but low recall limited to the size of the dictionaryHidden Markov Models (HMM)Conditional Random Fields (CRF)NER SystemsStanford NER SystemCMU Lemur ToolkitOpen NLPOntology-based Biomedical Entity ExtractionBioCaster: 50 animal disease names + SVMPULS: 2400 human diseases and 1100 animal diseases
  • 47. Related Work: Relation ExtractionThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 48. Related Work: Event RecognitionThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010BioCasterfor disease-location pairs and calculating their frequency in the document and in the collectionPULSallows extracting metadata and structured facts related to animal disease outbreaks using pattern matching approach
  • 49. SummaryThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010Overview of the existing automated and manually supported systems for monitoring animal disease outbreaks.Approaches for text categorization: supervised, unsupervised and semi-supervised learning and different feature representations: “bag-of-words”, terms frequency, binary features, word bigrams, classification algorithms: lazy learners, decision trees, Naïve Bayes.Entity extraction approaches: gazetteers, regular expressions, Hidden Markov Models and Conditional random Fields; existing NER systems; ontology-based biomedical entity extraction.Relation extraction for automated ontology construction works.Animal disease-related event recognition methods.
  • 50. Framework for Epidemiological AnalyticsFramework for Epidemiological AnalyticsMain Functional ComponentsData Collection (Document Relevance Classification) -> Data Sharing -> Search -> Data Analysis (Entity Extraction and Event Recognition) -> Visualization
  • 51. Existing Systems vs. Designed System (1)Existing SystemsDesigned SystemThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 52. Existing Systems vs. Designed System (2)Existing SystemsDesigned SystemThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 53. System ArchitectureCrawled DocumentsSearch Interface1. Entity Extraction Component 2. TemporalExtraction ComponentExtractedEntities3. Spatial Extraction Component ExtractedEventTuples4. Event Recognition ComponentData StorageWeb ServerTimeLine/Map ViewThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 54. Main System ComponentsThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 55. 1. Data Collection (1)Periodically crawl the web using Heritrix crawler - http://crawler.archive.org/set of seeds (ProMED-Mail, DEFRA etc.)set of terms (animal disease names from the ontology) Text-to-tag ratio-based method for content extraction from web pagesThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 56. 1. Data Collection (2)WWWEmailCrawlerDBDocument CollectionQueryLiteratureThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 57. 2. Data SharingThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010Document relevance classification using Naive Bayes Classifier from Mallet - http://mallet.cs.umass.eduRelevantNon-relevant
  • 58. 3. SearchThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010Lucene-based* rankingQuery-based keyword searchSearch by animal disease name and/or location*Lucene - http://lucene.apache.org
  • 59. 4. Data AnalysisEvent example:“On 12 September 2007, a new foot-and-mouth disease outbreak was confirmed in Egham, Surrey”Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 60. 5. VisualizationThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010Map ViewGoogleMaps API - http://code.google.com/apis/maps/TimeLine ViewSIMILE API - http://www.simile-widgets.org/timeline/
  • 61. Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 62. SummaryThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010Compare our system to other systemsBioCaster, MedISys/PULS, HealthMap System ArchitectureData Collection -> Data Sharing -> Search -> Data Analysis -> VisualizationMain System ComponentsEntity Extraction Componentdomain-specific entities (disease names, their synonyms, abbreviations, corresponding viruses and disease serotypes, species) domain-independent entities (locations and dates)Event Recognition Component
  • 63. Disease-related Document ClassificationBinary Classification using Supervised LearningFeature Representations: “Bag-of-words”, TF, BigramsClassifiers: Naïve Bayes, MaxEntropy, J48
  • 64. SupervisedLearningFrameworkNewDocumentsDTestFeatureRepresentation R1…FeatureRepresentation RnLearnedModel M1…LearnedModel MkCrawledDocuments DTrainClassifierDisease Related - DR (processed to the next phases)Disease Non-related – DNR (eliminated from the index)Feature Representations:R1 – “bag-of-words” binary, |R1|=28908R2 – “bag-of-words” term frequency, |R2|=28908R3 – “bag-of-words” bigrams, |R3|=99108R4 – noun and verb keywords represented as binary counts, |R4|=2R5 – noun and verb keywords normalized frequency, |R5|=2Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 65. Experiment ADisease-related Document Classification~1500 crawled documentsFoot-and-mouth disease (FMD)Rift valley fever (RVF)Focused Crawl Terms[foot and mouth disease, FMD, rift valley fever, RVF]After labeling - 813 related and 752 non-related docsTesting with 10-fold cross validation + OR -Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 66. Classification Results: Precision, Recall,F-Measure, Area Under CurveBinary FeaturesNoun and Verb Frequency FeaturesThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 67. Classification Results: AccuracyBinary FeaturesAll unigrams, all bigrams and all term frequency featuresThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 68. Summary (1)Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010“Bag-of-words” representation gives higher accuracy;Generative approaches give the highest accuracy: Naïve Bayes together with comprehensive feature representation R3 using bigram as features – 0.97;MaxEnt classier using unigram “bag-of words” representation R2 – 0.96;MaxEnt classier using comprehensive binary counts as feature representation R1 – 0.94.Normalized term frequency is much better than just binary features.
  • 69. Summary (2)Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 70. Entity Extraction in the Domain of Veterinary Medicine (1)Ontology-based Entity ExtractionAutomated Ontology Construction
  • 71. Domain Meta-dataDomain-independent knowledgeDomain-specific knowledgeLocation hierarchynames of countries, states, cities;Time hierarchycanonical dates.Medical ontologydiseases, serotypes, and viruses.Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 72. Manually-constructedInitial Ontology|OINIT|=429 |OS|=581 |OA|=581 |OS+A|=6051. Disease names and fact sheets from Iowa State University Center for Food Security and Public Health (CFSPH): http://www.cfsph.iastate.edu/diseaseinfo/animaldiseaseindex.htm2. Word Organization of Animal Health (OIE) Animal Disease Data:http://www.oie.int/eng/maladies/en_alpha.htm3. Department for Environmental Food and Rural Affairs, UK (DEFRA):http://www.defra.gov.uk/animalh/diseases/vetsurveillance/az_index.htm 4. Wikipediahttp://en.wikipedia.org/wiki/Animal_diseasesRelationship TypesSynonymic relationships – “E1 is a kind of E2” E1 = “swine influenza” is a kind of E2 = “swine fever”Hyponymic relationships – “E1 and E1 are diseases” E1 = “anthrax”, E2 = “yellow fever” are diseasesCausal relationships – “E1 is caused by E2” E1 = “Ovine epididymitis” is caused by E2 = “Brucella ovis”Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 73. Algorithm for Semantic Relation Discovery and Automated Ontology ExpansionThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 74. Experiment BOntology-based Entity Extraction100 unlabeled documents for ontology expansion
  • 75. 100 manually labeled document for entity extractionThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 76. Entity Extraction Results: ROC CurvesThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 77. Entity Extraction Results: Learning Curves|OG|=754..1238 |OR|=772..1287Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 78. Entity Extraction in the Domain of Veterinary Medicine (2)Sequence Labeling using Syntactic Features with Sliding Window
  • 79. Syntactic Feature ExtractionPOS tagnumeric word-level featureCapitalizationbinary word-level featureCapitalization insidebinary word-level feature for identifying abbreviationsPosition in the sentencenumeric document-level featurePosition in the documentnumeric document-level featureFrequencynumeric document-level featureThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 80. Sequence Labeling ApproachThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 81. An Example of Syntactic Feature Extraction“Severe disease in dairy cattle caused by Salmonella Newport”POS= [NNP, IN, NNS, VBN, …] = [2, 0, 2, 5, …]Xi = [POSi, CAPi, ICAPi, SPOSi, DPOSi, FREQi]Xi-3 = [2, 0, 0, 5, 5, 1]Xi = [2, 1, 0, 8, 8, 1]Xi-2 = [5, 0, 0, 6, 6, 1]NewportXi-1 = [0, 0, 0, 7, 7, 1]……wiwi+1wi+2wi-3wi-1wi+3wi-2cattle caused by Salmonella Xi+1 = [2, 1, 0, 9, 9, 1]Xi+2 = [-1, -1, -1, -1, -1, -1]Fi = [Xi, Xi-1, Xi-2, Xi-3, Xi+1, Xi+2, Xi+3], w = 3Class = {0, 1}Xi+3 = [-1, -1, -1, -1, -1, -1]Fi = [2, 1, 0, 8, 8, 1, 0, 0, 0, 7, 7, 1, 5, 0, 0, 6, 6, 1, 2, 0, 0, 5, 5, 1, 2, 1, 0, 9, 9, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1], Class = [1]Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 82. Experiment CSequence Labeling using Syntactic Features100 manually labeled documents from Experiment BNumber of disease names is more that 5 per documentKeep capitalizationRemove stop words202977 examples in the dataset80% for training (approx. 160000 examples)20% for testing (approx. 40000 examples)Results are averaged over 3 runsWe do not report accuracy because the data set is unbalanced (approx. 8570 positive examples vs. approx. 194430 negative examples)Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 83. Entity Extraction Results: F-measure (1)Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 84. Entity Extraction Results: F-measure (2)Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 85. Entity Extraction Results: Precision, Recall, AUC (1)Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 86. Entity Extraction Results: Precision, Recall, AUC (2)Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 87. SummaryBioCaster named entity recognition system200 news articles F-score – 0.769 for all named entity classesSVM and feature window -2/+1 including surface word, orthography, biomedical prefixes/suffixes, lemma, head noun etc.DNA, RNA, cell type extractionSVM and orthographic featuresF-score – 0.799 during the identification phase and 66.5 during the classification phase;Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 88. Disease-related Event Recognition and Classification (1)Sentence-based Event Recognition and Classification
  • 89. Animal Disease-related Event TypesThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 90. Event Recognition MethodologyStep 1. Entity recognition from raw text.Step 2. Sentence classification from which entities are extracted as being related to an event or not; if they are related to an event we classify them as confirmed or suspected.Step 3. Combination of entities within an event sentence into the structured tuples and aggregation of tuples related to the same event into one comprehensive tuple.Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 91. Step 1.Entity RecognitionThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010Locate and classify atomic elements into predefined categories:Disease names:“foot and mouth disease”, “rift valley fever”; viruses: “picornavirus”; serotypes: “Asia-1”;Species: “sheep”, “pigs”, “cattle” and “livestock”;Locationsof events specified at different levels of geo-granularity: “United Kingdom", “eastern provinces of Shandong and Jiangsu, China”;Datesin different formats: “last Tuesday”, “two month ago”.
  • 92. Entity Recognition ToolsAnimal Disease Extractor*relies on a medical ontology, automatically-enriched with synonyms and causative viruses.Species Extractor* pattern matching on a stemmed dictionary of animal names from Wikipedia.Location ExtractorStanford NER Tool** (uses conditional random fields);NGA GEOnet Names Database (GNS)*** for location disambiguation and retrieving latitude/longitude.Date/Time Extractorset of regular expressions.*KDD KSU DSEx - http://fingolfin.user.cis.ksu.edu:8080/diseaseextractor/**Stanford NER - http://nlp.stanford.edu/ner/index.shtml***GNS - http://earth-info.nga.mil/gns/html/Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 93. Step 2. Event Sentence Classification Constraint: True events should include a disease name together with a status verb from Google Sets* and WordNet** (eliminate event non-related sentences).“Foot and mouth disease is[V] a highly pathogenic animal disease”.Confirmed status verbs “happened” and verb phrases “strike out”“On 9 Jun 2009, the farm's owner reported[V] symptoms of FMD in more than 30 hogs”.Suspected status verbs “catch” and verb phrases “be taken in”“RVF is suspected[V] in Saudi Arabia in September 2000”. *GoogleSets - http://labs.google.com/sets **WordNet - http://wordnet.princeton.edu/Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 94. Step 3. Event Tuple GenerationThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010Event attributes:diseasedatelocationspeciesconfirmation statusEvent tuple:Eventi = < disease; date; location; species; status > = <FMD, 9 Jun 2009, Taoyuan, hog, confirmed>Event tuple with missing attributes:Eventj = <FMD, ?, ?, ?, confirmed>
  • 95. Event Recognition WorkflowStep 1: Entity RecognitionFoot-and-mouth disease[DIS]on hog[SP] farm in Taoyuan[LOC]. Taiwan's TVBS television station reports that agricultural authorities confirmed foot-and-mouth disease[DIS] on a hog[SP] farm in Taoyuan[LOC]. On 9 Jun 2009[DT], the farm's owner reported symptoms of FMD[DIS] in more than 30 hogs[SP]. Subsequent testing confirmed FMD[DIS]. Agricultural authorities asked the farmer to strengthen immunization. The outbreak has not affected other farms. Authorities stipulated that the affected hog[SP] farm may not sell pork for 2 weeks.Step 2: Sentence ClassificationYES 1. Foot-and-mouth disease[DIS]on hog[SP] farm in Taoyuan[LOC]. YES 2.Taiwan's TVBS television station reports that agricultural authorities confirmedfoot-and-mouth disease[DIS]on a hog[SP] farm in Taoyuan[LOC]. YES 3. On 9 Jun 2009[DT], the farm's owner reported symptoms of FMD[DIS] in more than 30 hogs[SP]. YES 4. Subsequent testing confirmedFMD[DIS].NO 5. Agricultural authorities asked the farmer to strengthen immunization.NO 6. The outbreak has not affected other farms. NO 7. Authorities stipulated that the affected hog[SP] farm may not sell pork for 2 weeks.Step 3a: Tuple GenerationE1 = <Foot-and-mouth disease, ?, Taoyuan, hog, ?> E3 = <FMD, 9 Jun 2009,?, hog, reported>E2 = <Foot-and-mouth disease, ?, Taoyuan, hog, confirmed > E4 = <FMD, ?, ?, ?, confirmed>Step 3b: Tuple AggregationE = <disease, date, location, species, status> = <Foot-and-mouth disease, 9 Jun 2009, Taoyuan, hog, confirmed > Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 96. Rule-based event recognition approachStep 1.Entity RecognitionStep 2.Event Sentence ClassificationStep 3.Event Tuple Generation & AggregationTuple aggregation is based on the set of rules: - disease name is the main attribute for the event tuple; - optimally combine the event tuples around a disease name (event tuple should have max number of event attributes); - if tuple has a new disease, then form the next tuple.The First International Workshop on Web Science and Information Exchange in the Medical Web (MedEx 2010)
  • 97. Experiment DEvent Recognition and ClassificationThe First International Workshop on Web Science and Information Exchange in the Medical Web (MedEx 2010) ~100 event-related documentsFoot-and-mouth disease (FMD)Rift valley fever (RVF)Manually created 2 sets of summaries for 100 docsDUCView Pyramid Scoring Tool* – Score [0..1]relies on multiple summaries to assign the significance weights to summarization content units (i.e., entities)to compare automatically generated event tuples with entities from human summaries.Scorei = < wddisease; wtdate; wllocation; wsspecies; wcstatus… >,subject to disease + status = 2
  • 98. Event Score Distribution by RangeWe interpret the Pyramid score values as an event extraction accuracy:# of unique contributing entities (TP);# of entities not in the summary (FP);# of extra contributing entities from summary (FN).multiple summaries – majority voting for annotation.The First International Workshop on Web Science and Information Exchange in the Medical Web (MedEx 2010)
  • 99. Event Recognition & Classification ResultsThe First International Workshop on Web Science and Information Exchange in the Medical Web (MedEx 2010)
  • 100. Event Recognition and Classification: Susceptible, Infected and RecoveredThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010“The signs suggested the 27 pigs could be suffering from foot and mouth disease (FMD) in Island of Anglesey, Wales as reported on 2/21/2001”“The UK Ministry of Agriculture confirmed on 2/20/2001 that 27 pigs found with vesicles in an abattoir near Brentwood, Essex, have Foot and Mouth Disease”“Almost 2000 cattle and more than 15 000 sheep, have been or are waiting to be slaughtered since the resurgence of the disease in Northumberland as reported on 8/31/2001”
  • 101. Disease-related Event Recognition and Classification (2)Event Recognition and Classification in Predictive Epidemiology Domain
  • 102. ENTITY EXTRACTIONDocument 3, sentence s31Almost 2000 cattle[SP] are waiting to be slaughtered on 02/28/2001[DATE]since the resurgence of FMD[DIS] in Northumberland[LOC].Document 2, sentence s21The UK Ministry of Agriculture confirmed on 2/20/01[DATE] that 27 pigs[SP] found with vesicles in an abattoir near Brentwood, Essex[LOC] have FMD[DIS].Document 1, sentence s11, s12The signs suggested the 27 pigs[SP] could be suffering from foot and mouth disease[DIS] in Anglesey, Wales[LOC].It was reported on 02/18/01[DATE].……EVENT TUPLE GENERATIONe11 = [27 pigs, FMD, ?, Anglesey, Wales, “suggest”]e12 = [?,?, 02/18/01, ?, “report”]e21 = [27 pigs, FMD, 2/20/01, Brentwood, Essex, “confirm”]e31 = [2000 cattle, FMD, 2/28/01, Northumberland, “slaughter”]EVENT TUPLE CLASSIFICATIONSusceptibleRecoveredInfectedEVENT TUPLE AGGREGATIONE2 = [27 pigs, FMD, 2/20/01, Brentwood, Essex, Infected]E3 = [2000 cattle, FMD, 2/28/01, Northumberland, Recovered]E1= [27 pigs, FMD, 02/18/01, Anglesey, Wales, Susceptible]Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 103. Modified Algorithm for Event Recognition & SIR ClassificationThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 104. The spread of foot-and-mouth disease outbreak in UK, 2001118 ProMed-Mail reportsyellow - susceptiblered - infectedgreen - recoveredThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 105. SummaryThe accuracy of the event recognition completely depends on the separate entity extraction accuracyThe event aggregation and deduplication requires much comprehensive heuristics and additional knowledge, for example co-reference resolutionBioCaster950 disease-location pairs per monthreported results - 887/950 correct disease-location pairs and 0.934 precisionMedISys/PULS100 English-language documents with 156 eventsReported results – 0.88 precisionThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 106. Conclusions, Contributions and Future WorkSummary: 1. Disease-related Document Classification 2. Ontology-based Entity Extraction 3. Entity Extraction using Sequence Labeling 4. Event Recognition and Classification
  • 107. Conclusions (1)Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010Disease-related Document Classification:applied supervised framework and considered different feature representations for the documents and machine learning classification algorithms;evaluated the document relevance classification using binary features, keyword term frequency and the “bag-of-words” representation using word unigrams and bigrams.Our experimental results demonstrate the efficiency of our text categorization component in the designed framework for epidemiological analytics.
  • 108. Conclusions (2)Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010Ontology-based Domain-specific Entity Extractionused a semantic relationship extraction based on syntactic patterns and POS tagging to construct an ontology; compared the automatically-constructed ontology obtained using our relationship extraction approach with an ontology constructed using GoogleSets expansion approach;compared the entities extracted using all ontology in terms of precision and recall, report F-measure.The results show that our semantic relationship extraction approach brings new knowledge to an initial ontology and, therefore, boosts the domain-specific biomedical entity extraction results.
  • 109. Conclusions (3)Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010Entity Extraction using Sequence Labeling Approachexacted syntactic word-level features (capitalization, POS tagging, abbreviations, term frequency) using different sliding window;evaluated our approach using different machine learning algorithms together with different feature representations and various window size.reported results of the domain-specific entity extraction in terms of F-measure, precision and recall and compared them with the results from the other surveillance systems.
  • 110. Conclusions (4)Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010Event Recognition and Classificationpresented our novel sentence-based approachapplied several lists of verbs for confirmation status extraction including WordNet and GoogleSets.used DUCView tool to calculate scores for automatically generated event tuples, which can be seen as a measure of accuracy of our approach.The highest accuracy was obtained using a WordNet augmented list of verbs.Applied event recognition and classification approach for the predictive epidemiology domain.
  • 111. ContributionsThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010Paper “Computational Knowledge and Information Management in Veterinary Epidemiology”IEEE Intelligence and Security Informatics Conference (ISI'10), 23-26 May 2010, Vancouver, BC, CanadaPaper “Animal Disease Event Recognition and Classification”First International Workshop on Web Science and Information Exchange in the Medical Web (MedEx'10), WWW Conference, 26-30 April 2010, Raleigh, NC, USAPaper “Boosting Biomedical Entity Extraction by Using Syntactic Patterns for Semantic Relation Discovery” (to appear)2010 IEEE/WIC/ACM International Conference on Web Intelligence (WI'10), August 31 - September 3, York University, Toronto, CanadaPoster “Named Entity Recognition and Tagging in the Domain of Epizootics”Women in Machine Learning Workshop (WiML'09) Workshop, 6-7 Dec 2009, Vancouver, CanadaACM Poster Presentation Competition “Automated Event Extraction and Named Entity Recognition in the Domain of Veterinary Medicine”2010 Grace Hopper Celebration of Women in Computing (GHC'10),September 28 - October 1, Atlanta, Georgia, USA
  • 112. Future WorkDomain-specific Entity Extractionadd automated multilingual ontology construction for the domain of veterinary medicine using other semistructured sources e.g., Wikipedia.Automated Ontology Constructionextend our semantic relationship extraction approach to other domains and other generalized named entitiesEvent Recognition and Classificationapply deeper syntactic analysis of the sentence and part-of-speech tagging in addition to the list of verbs;consider the negation words, modal words and tense;to integrate coreference resolution functionality.Thesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010
  • 113. AcknowledgmentsThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010Faculty: Dr. William H. HsuDr. Doina CarageaDr. Gurdip SinghKDD Lab alumni: Tim Weninger (crawler deployment) and Jing Xia (rule-based event extraction)KDD Lab assistants:Information Extraction Team: John Drouhard, Landon Fowles, Swathi BujuruSpatial Data Mining Team: Wesam Elshamy, AndrewBerggrenTopic Detection & Tracking Team: Danny Jones, Srinivas Reddy
  • 114. Thank you!Svitlana Volkova, svitlana.volkova@gmail.comhttp://people.cis.ksu.edu/~svitlanaThesis "Entity Extraction, Animal Disease-related Event Recognition and Classification from Web", July 30 2010