SlideShare a Scribd company logo
Unlocking knowledge in biodiversity
legacy literature through automatic
semantic metadata extraction
Riza Batista-Navarro, William Ulate, Jennifer
Hammock, Georgios Kontonatsios, Trish
Rose-Sandler and Sophia Ananiadou
Structured
Data
? Text
Mining
http://miningbiodiversity.org
The partners
Social Media Lab
410/9/2015 Mining Biodiversity
Mining Biodiversity
• Transform BHL into a next-generation social
digital library
• A multi-disciplinary approach
– Text Mining
– Machine learning
– History of Science
– Environmental History & Studies
– Library and Information Science
– Social Media
510/9/2015 Mining Biodiversity
What do we want to do?
Social Media
Visualisation
Semantic
Metadata
610/9/2015 Mining Biodiversity
Biodiversity Heritage Library
• a consortium of botanical and natural history
libraries
• stores digitised legacy literature on
biodiversity
• currently holds 160,000 volumes = millions of
pages (PDFs and OCR-generated text)
• open-access
710/9/2015 Mining Biodiversity
Current features
• supports keyword-based search
• species names annotated and linked to the
Encyclopedia of Life
• integrates automatic taxonomic name finding
tools (uBio Taxonfinder)
• data access through export functionalities and
Web services
810/9/2015 Mining Biodiversity
Keyword-based search
and Browsing
Advanced search
(also keyword-based)
10/9/2015 10Mining Biodiversity
What’s wrong with
keyword-based search?
• Ambiguity!
Boxwood
historic place in
Alabama?
North American term for
plants in the Buxaceae
family?
Box
container?
Boxwood for other English-
speaking countries?
What’s wrong with
keyword-based search?
• Ambiguity!
California bay
hardwood
tree?
location?
Drum
musical
instrument?
fish?
What’s wrong with
keyword-based search?
• Ambiguity!
Emperor
fish?
person?
Scrambled eggs
food?
plant?
Semantic metadata generation
• Entity types
– species
– location
– habitat
– anatomical parts
– qualities
– persons
– temporal expressions
• Association types
– observation
– Habitation
– nutrition
– trait
10/9/2015 Mining Biodiversity 14
Examples of semantic metadata
(annotations)
• Observation
• Habitation
Examples of semantic metadata
(annotations)
• Nutrition
• Trait
How does semantic
information help?
SPECIES:
California bay
hardwood tree
location
LOCATION:
California bay
Text mining-based approach
Seed
documents
Unlabelled
documents
Learn semantics
Annotator/Curator
Validate
Feedback
Annotate
Search
index
Store
Annotate
Automatic annotation by
text mining (TM)
– Web-based, graphical TM workbench
– conforms with the Unstructured Information
Management Architecture (UIMA) standard
– facilitates the straightforward integration of
various analytics into workflows
– allows for the validation of annotations
10/9/2015 Mining Biodiversity 19
interface
10/9/2015 20Mining Biodiversity
Learning semantics
• Training of models using machine learning
– conditional random fields (CRFs) for sequence
labelling
– learning the features of mentions and relations of
interest based on labelled documents
• contextual features: surrounding, co-occurring words
• dictionary matches: presence of certain words in
controlled vocabularies, e.g., Catalogue of Life,
Phenotype and Trait Ontology, Gazetteer
10/9/2015 Mining Biodiversity 21
interface
10/9/2015 22Mining Biodiversity
Annotation workflowPre-
processing
Dictionary
lookup
Machine
learning-based
recognition
Relation
extraction
Saving
Validation interface
Enhanced searching of BHL content
Faceted
search
Automatically
generated
questions
Time-
sensitive
search
Enhanced document viewing
Page in
PDF/image
format
OCR-corrected text
with colour-coded
annotations
Conclusions
• Literature is a rich source of information but
difficult to search
• Keyword-based search not enough to address
ambiguity
• Semantic metadata allows for more accurate
searching
• Semantic metadata can be extracted using text
mining tools
• The Argo text mining workbench facilitates the
construction of custom semantic metadata
generation workflows

More Related Content

PDF
Expanding Access to Biodiversity Literature. Mining Biodiversity.
PPTX
Text Mining Biodiversity 20160127
PPTX
Engaging the Citizen Scientist in Content Enhancement for BHL
PPTX
Breathing new life into old data - How opening your collection can spark imag...
PDF
An Inordinate Fondness for Data: The Biodiversity Heritage Library
PPTX
Special libraries association meeting march 2014
PPTX
Purposeful Gaming Crowdsourcing the Correction of OCRed Text in the Biodivers...
PPT
Digitizing Entomology: The Biodiversity Heritage Library @ the Smithsonian
Expanding Access to Biodiversity Literature. Mining Biodiversity.
Text Mining Biodiversity 20160127
Engaging the Citizen Scientist in Content Enhancement for BHL
Breathing new life into old data - How opening your collection can spark imag...
An Inordinate Fondness for Data: The Biodiversity Heritage Library
Special libraries association meeting march 2014
Purposeful Gaming Crowdsourcing the Correction of OCRed Text in the Biodivers...
Digitizing Entomology: The Biodiversity Heritage Library @ the Smithsonian

What's hot (20)

PPSX
We've Got Issues: Issue Tracking and Workflow in the Digital Library
PPT
2009 05 20 Cimc Pilsk
PDF
Bhl knowledge-ecology-rlg-collaboration
PDF
Building a Global Library of Taxonomic Literature
PPT
Cybertaxonomy may 31 2011
PDF
The Biodiversity Heritage Library: Workflow Overview
PDF
Smithsonian Libraries Partnering in Research
PDF
“Yet Another BHL Presentation”: The Biodiversity Heritage Library
PPTX
M sc advanced food marketing finding info
PPTX
Stage 2 animal science finding info
PPT
Eol fellow-march2010
PPT
Smithsonian Libraries 2.0 and the Biodiversity Heritage Library Project
PDF
Digital Services Division & The Biodiversity Heritage Library
PDF
3 Years On: The Biodiversity Heritage Library
PPTX
Botany and the BHL: A Botanical Overview of the Biodiversity Heritage Library
PPT
Biodiversity Heritage Library: Cornerstone of the Encyclopedia of Life
ODP
The Biodiversity Heritage Library: A Cornerstone of the Encyclopedia of Life
PPTX
Crowd-sourcing the creation of "articles" within the Biodiversity Heritage Li...
PPT
The Biodiversity Heritage Library. 10+1 and Beyond: Looking Forward
PPT
Donat Agosti - Copyright, Biopiracy and the Taxonomic Impediment
We've Got Issues: Issue Tracking and Workflow in the Digital Library
2009 05 20 Cimc Pilsk
Bhl knowledge-ecology-rlg-collaboration
Building a Global Library of Taxonomic Literature
Cybertaxonomy may 31 2011
The Biodiversity Heritage Library: Workflow Overview
Smithsonian Libraries Partnering in Research
“Yet Another BHL Presentation”: The Biodiversity Heritage Library
M sc advanced food marketing finding info
Stage 2 animal science finding info
Eol fellow-march2010
Smithsonian Libraries 2.0 and the Biodiversity Heritage Library Project
Digital Services Division & The Biodiversity Heritage Library
3 Years On: The Biodiversity Heritage Library
Botany and the BHL: A Botanical Overview of the Biodiversity Heritage Library
Biodiversity Heritage Library: Cornerstone of the Encyclopedia of Life
The Biodiversity Heritage Library: A Cornerstone of the Encyclopedia of Life
Crowd-sourcing the creation of "articles" within the Biodiversity Heritage Li...
The Biodiversity Heritage Library. 10+1 and Beyond: Looking Forward
Donat Agosti - Copyright, Biopiracy and the Taxonomic Impediment
Ad

Viewers also liked (7)

PDF
Mastering sap business objects 2011
ODP
Media
PPT
реклама стокгольма
PDF
Dmd Group West101009
PPTX
BHL Tech Status Update Tech Director W.Ulate 2015.12.11
PPTX
A new flora fauna mycota should...
PPT
The Biodiversity Heritage Library: an Open Global Resource of Literature for ...
Mastering sap business objects 2011
Media
реклама стокгольма
Dmd Group West101009
BHL Tech Status Update Tech Director W.Ulate 2015.12.11
A new flora fauna mycota should...
The Biodiversity Heritage Library: an Open Global Resource of Literature for ...
Ad

Similar to Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction (20)

PDF
Botanists and annotations printer friendly
PDF
Web Mining to Create Semantic Content: A Case Study for the Environment
PDF
20140623 swets agosti_final
PPT
Special Libraries Associatin
PPTX
Open taxonomy
PPTX
Bibliographic References in BHL
PPT
Biodiversity Heritage Library: A Conversation About A Collaborative Digitizin...
PDF
Towards a Simple, Standards-Compliant, and Generic Phylogenetic Database
PPTX
Towards a biodiversity knowledge graph
PPT
Finding knowledge, data and answers on the Semantic Web
PPT
Mla May 7
PDF
Botanists and annotations: use cases and their relevance for the larger scie...
DOC
PA5-2_iconf08.doc.doc
DOC
PA5-2_iconf08.doc.doc
PPT
An International Cooperative Digital Library for Taxonomic Literature: The Bi...
KEY
Apis And APIs a wildlife ontology
PPT
Sherborn: Lyal - Digitising legacy taxonomic literature: processes, products ...
PPTX
Phyloinformatics and the Semantic Web
PPT
Using the Semantic Web to Support Ecoinformatics
PPTX
Biodiversity Informatics: An Interdisciplinary Challenge
Botanists and annotations printer friendly
Web Mining to Create Semantic Content: A Case Study for the Environment
20140623 swets agosti_final
Special Libraries Associatin
Open taxonomy
Bibliographic References in BHL
Biodiversity Heritage Library: A Conversation About A Collaborative Digitizin...
Towards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a biodiversity knowledge graph
Finding knowledge, data and answers on the Semantic Web
Mla May 7
Botanists and annotations: use cases and their relevance for the larger scie...
PA5-2_iconf08.doc.doc
PA5-2_iconf08.doc.doc
An International Cooperative Digital Library for Taxonomic Literature: The Bi...
Apis And APIs a wildlife ontology
Sherborn: Lyal - Digitising legacy taxonomic literature: processes, products ...
Phyloinformatics and the Semantic Web
Using the Semantic Web to Support Ecoinformatics
Biodiversity Informatics: An Interdisciplinary Challenge

More from William Ulate (14)

PPTX
Enhancing the WFO in support of GSPC.pptx
PPTX
Finding the annotation needs of the botanical community in a digital library
PDF
Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...
PDF
BHL Technical Director's Report, Mar. 2014
PPTX
BHL Markup Efforts and Plans
PDF
Purposeful Gaming and BHL
PDF
Fourth Global BHL Meeting - Technical Update
PDF
BHL Technical Update (May 2013)
PDF
Global BHL Update May 2013
PPTX
The BHL way to content
PPTX
TDWG 2012 Poster for Art of Life project
PPTX
BHL Technical Projects Updates
PPT
BHL: Toward a Global, Sustainable Resource
PDF
Global BHL Meeting Action Items
Enhancing the WFO in support of GSPC.pptx
Finding the annotation needs of the botanical community in a digital library
Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...
BHL Technical Director's Report, Mar. 2014
BHL Markup Efforts and Plans
Purposeful Gaming and BHL
Fourth Global BHL Meeting - Technical Update
BHL Technical Update (May 2013)
Global BHL Update May 2013
The BHL way to content
TDWG 2012 Poster for Art of Life project
BHL Technical Projects Updates
BHL: Toward a Global, Sustainable Resource
Global BHL Meeting Action Items

Recently uploaded (20)

PDF
Phytochemical Investigation of Miliusa longipes.pdf
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PDF
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
PPTX
Seminar Hypertension and Kidney diseases.pptx
PPT
veterinary parasitology ````````````.ppt
PPT
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
PDF
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
PPTX
C1 cut-Methane and it's Derivatives.pptx
PDF
Sciences of Europe No 170 (2025)
PDF
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
PPTX
Fluid dynamics vivavoce presentation of prakash
PDF
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
PPTX
Biomechanics of the Hip - Basic Science.pptx
PDF
Placing the Near-Earth Object Impact Probability in Context
PPTX
Hypertension_Training_materials_English_2024[1] (1).pptx
PDF
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PDF
The scientific heritage No 166 (166) (2025)
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PPTX
Overview of calcium in human muscles.pptx
Phytochemical Investigation of Miliusa longipes.pdf
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
Seminar Hypertension and Kidney diseases.pptx
veterinary parasitology ````````````.ppt
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
C1 cut-Methane and it's Derivatives.pptx
Sciences of Europe No 170 (2025)
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
Fluid dynamics vivavoce presentation of prakash
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
Biomechanics of the Hip - Basic Science.pptx
Placing the Near-Earth Object Impact Probability in Context
Hypertension_Training_materials_English_2024[1] (1).pptx
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
The scientific heritage No 166 (166) (2025)
TOTAL hIP ARTHROPLASTY Presentation.pptx
Overview of calcium in human muscles.pptx

Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

Editor's Notes

  • #3: Most of us in the biodiversity informatics community are reliant on curated databases such as EOL (click) and NCBI Taxonomy (click). Indeed, they are some of the most fundamental sources of structured information that is critical to understanding biodiversity (click) Another rich, albeit less exploited resource is biodiversity literature (click) which provides possibly even more comprehensive information, considering that any significant findings have most likely been published in one form of writing or another: in reports, articles, books or monographs. However, unlike curated databases which provide information in a structured, readily computable form, literature collections are characterised by copious textual data expressed in natural language. This unstructured and voluminous nature of literature makes it difficult to find information of interest, thus posing a barrier to knowledge accessibility and discovery (click). As many of you know, the Biodiversity Heritage Library or BHL holds the biggest literature collection on biodiversity. In this talk, I will be describing our work on how we are extracting semantic content from BHL and putting it in a structured form that is a lot easier to access and search (click), and how we’re using text mining as the enabling technology for this (click).
  • #4: We are doing this work as part of a project funded by the transatlantic Digging Into Data program called Mining Biodiversity.
  • #7: In a nutshell, we have incorporated into BHL three elements, as part of the Mining Biodiversity project: Visualisation, Social Media and Semantic Metadata. The rest of this talk will be focussing on the extraction of semantic metadata aspect (click).
  • #12: One might say, I’m currently very much happy with how I’m searching BHL. What’s wrong with keywords? Well then, the answer to that is ambiguity! If one searches for “Boxwood”, a keyword-based system wouldn’t know if he/she was referring to a place in Alabama, or the North American term for plants under the Buxaceae family. It will just return all documents pertaining to both. Nor will it know if a query “Box” pertains to the same plant family because apparently this is how other English-speaking countries refer to it, or a container.
  • #13: Or “California bay”. A keyword-based system will not know if the user is referring to the hardwood tree or some location. What about “Drum”? Is it a fish or a musical instrument?
  • #14: “Emperor” too. It wouldn’t know if the user wants the fish or a person. Event “Scrambled eggs”. Is it breakfast or the plant known as such?
  • #15: To alleviate such issues we are enriching BHL content with semantic metadata. To this end, we are marking up mentions of different entity and association types within text. For entities, we are capturing species, locations, habitats, anatomical parts, qualities, people and temporal expressions. To capture associations, we link up these entities to encapsulate relationships such as observation, habitation and nutrition.
  • #18: So why does semantic information help? With semantic categorisation of terms, for example, if a user specified that he/she is looking for California bay in the SPECIES sense of the term, the system knows it should look for documents which contain a species entity of that name. And if the user specifies he/she is looking for a LOCATION called California bay, then similarly the system knows it should look for documents in which “California bay” has been annotated as a name of a place or location.
  • #19: In fleshing out the semantics from BHL documents, we took a text mining-based approach, the overall architecture of which is depicted in this figure (click). Firstly, we set aside a seed set of documents which were manually annotated (click). This set was used by our system to learn the semantics, i.e., entities and associations, in the documents (click). The system then applies what it learns on unlabelled documents (click). The annotations the system produces on these documents are then validated manually by an expert (click). Whatever corrections the expert makes are fed back into the system and are used by the system to learn again, in order to improve itself. (Active Learning) When the performance of the system is satisfactory, we run the final version of the system on the whole BHL collection and (click) store all of the generated annotations or semantic metadata in a search index, e.g., Solr. This index is what we’re using to complement the bibliographic metadata in BHL.
  • #21: This is Argo’s main interface. Argo comes with a library of various text mining components, which you can see on the left panel. Basically, these components can be dragged and dropped to the canvas in the middle which serves as a block diagramming tool. The user can then arrange these components according to the desired order of processing, and interconnect them to form a pipeline or workflow.
  • #22: What did we mean earlier by “learning semantics”? How does the text mining system or Argo workflow do this?
  • #23: This is Argo’s main interface. Argo comes with a library of various text mining components, which you can see on the left panel. Basically, these components can be dragged and dropped to the canvas in the middle which serves as a block diagramming tool. The user can then arrange these components according to the desired order of processing, and interconnect them to form a pipeline or workflow.
  • #24: This is the workflow that we put together using Argo. Without going too much into detail, I will just point out the general types of processing it tries to do: pre-processing (sentence splitting, tokenisation and part-of-speech tagging), matching against dictionaries or controlled vocabularies such as the ENVO and PATO ontologies, machine learning-based recognition of entities, extraction of relations based on the results of dependency parsing, and serialisation of the generated annotations.
  • #25: Additionally, Argo allows users to validate or correct any of the automatically generated annotations.