SlideShare a Scribd company logo
Taxonomic classification of
digitized specimens using
machine learning
Rutger Vos
Taxonomic classification1
of digitized specimens2
using machine learning3
1.  To give the right taxonomic name to a thing, or at least
approximate it to a higher level (e.g. Genus, Family)
2.  Photographs of biological objects, e.g. from a natural
history collection and taken in a standardized setup
3.  Machine learning explores the study and construction of
algorithms that can learn from and make predictions on
data
Case study: slipper orchids
Slipper orchids
•  Traded illegally
•  Photographed “in the wild”
Case study: Javanese butterflies
Van Groenendael-Krijger collection
•  Collected in the 1930s
•  Photographed in standardized setup
Project structure overview
•  Open source, freely
available at:
github.com/naturalis
•  Designed as loosely
coupled, swappable
modules
•  Intended for re-use for
multiple cases
Project structure: reference images
photos [table]
id INTEGER NOT NULL
md5sum VARCHAR(32) NOT NULL
path VARCHAR(255)
title VARCHAR(100)
description VARCHAR(255)
photos_tags [table]
photo_id INTEGER NOT NULL
tag_id INTEGER NOT NULL
tags [table]
id INTEGER NOT NULL
name VARCHAR(50) NOT NULL
photos_taxa [table]
photo_id INTEGER NOT NULL
taxon_id INTEGER NOT NULL
taxa [table]
id INTEGER NOT NULL
rank_id INTEGER NOT NULL
name VARCHAR(50) NOT NULL
description VARCHAR(255)
ranks [table]
id INTEGER NOT NULL
name VARCHAR(50) NOT NULL
Project structure: image processing
Speeded Up Robust Features
Project structure: machine learning
Project structure: optimization
Project structure: user interface
Results: SURF features
•  PCA plots of the “speeded up robust
features” show clustering both at the
genus (top) and species (bottom) level
•  Some species are so dimorphic that
the sexes are treated as separate
species (not shown)
•  Some individuals are
“gynandromorphic”, though there is
likely positive collection bias
•  Some taxa are much more variable
than others
Results: k-folds cross-validation
•  Split the data in k (2, 5, 10) partitions
•  Train on 1 partition, use k-1 as “out-of-sample” data
•  Count number of correct/incorrect/unknown identifications
Next steps
•  Application of trained neural networks to the entire
VGKS collection (once that is fully digitized)
•  Testing other classifiers in addition to ANNs
•  Improvement of the end user interface, possibly
as a native ‘app’ or on the web
•  Extension of the platform to additional cases,
such as shells (snails, bivalves)
•  Do more with the image feature data: mimicry,
character displacement, dimorphism
Acknowledgements
Naturalis sector Collection
•  Max Caspers
•  Luc Willemse
•  Jan Moonen
•  Digitization volunteers
Hogeschool Leiden
•  Barbara Gravendeel
•  Patrick Wijntjes
•  Saskia de Vetter
LIACS
•  Fons Verbeek
•  Mengke Li
•  Yuanhao Guo
IBL
•  Wim van Tongeren
WUR
•  Feia Matthijssen
Made possible by
•  Naturalis internal grant for
application-oriented research
•  The Van Groenendael-Krijger
Stichting
•  Kind contributions of photos by
numerous orchid breeders
Thanks for
listening!

More Related Content

PDF
Robot eye for the butterfly
ODP
2011 03-provenance-workshop-edingurgh
PPTX
Fairport domain specific metadata using w3 c dcat & skos w ontology views
PPT
exFrame: a Semantic Web Platform for Genomics Experiments
PPT
eXframe: A Semantic Web Platform for Genomic Experiments
PPTX
Franz Et Al. SCAN - Southwest Collections of Arthropods Network: Leveraging F...
PPT
Annotopia open annotation services platform
PDF
Aaron Ellison: Analytic Web
Robot eye for the butterfly
2011 03-provenance-workshop-edingurgh
Fairport domain specific metadata using w3 c dcat & skos w ontology views
exFrame: a Semantic Web Platform for Genomics Experiments
eXframe: A Semantic Web Platform for Genomic Experiments
Franz Et Al. SCAN - Southwest Collections of Arthropods Network: Leveraging F...
Annotopia open annotation services platform
Aaron Ellison: Analytic Web

Viewers also liked (9)

PPTX
Monera and protista
PPT
Unit 4: Monera, Protoctist, Fungi and Plants
PPTX
Kingdom Animalia
PPTX
Kingdom Animalia Biology Lesson PowerPoint, Taxonomy, Animal Phylums
PPTX
Kingdom animalia
PPT
Kingdom Animalia
PPTX
Animal Kingdom
PPS
Power Point Animals
PPTX
Animals classification
Monera and protista
Unit 4: Monera, Protoctist, Fungi and Plants
Kingdom Animalia
Kingdom Animalia Biology Lesson PowerPoint, Taxonomy, Animal Phylums
Kingdom animalia
Kingdom Animalia
Animal Kingdom
Power Point Animals
Animals classification
Ad

Similar to Taxonomic classification of digitized specimens using machine learning (20)

PDF
Foundations for the Future of Science
PPTX
Learning, Training,  Classification,  Common Sense and Exascale Computing
PDF
Keynote IEEE International Workshop on Cloud Analytics. Dennis Gannon
PPTX
SAX-VSM
PPTX
December 2, 2015: NISO/NFAIS Virtual Conference: Semantic Web: What's New and...
PDF
Slicing and Dicing a Newspaper Corpus for Historical Ecology Research
PDF
Using Lucene/Solr to Build CiteSeerX and Friends
PDF
Using Lucene/Solr to Build CiteSeerX and Friends
PDF
Large Scale Data Mining using Genetics-Based Machine Learning
PPSX
Biomedical Atlas Centre
PDF
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
PPTX
STEM in 3D
PPTX
Approaches to Mining Large-Scale Heterogeneous Data: Old and New
PDF
Transfer learning for unsupervised influenza-like illness models from online ...
PDF
Introduction to Next Generation Sequencing
PDF
Open science 2014
PPTX
The Future of Microalgal Taxonomy
PPTX
Chapter-OBDD.pptx
PPT
Clustering
PDF
clustering-151017180103-lva1-app6892 (1).pdf
Foundations for the Future of Science
Learning, Training,  Classification,  Common Sense and Exascale Computing
Keynote IEEE International Workshop on Cloud Analytics. Dennis Gannon
SAX-VSM
December 2, 2015: NISO/NFAIS Virtual Conference: Semantic Web: What's New and...
Slicing and Dicing a Newspaper Corpus for Historical Ecology Research
Using Lucene/Solr to Build CiteSeerX and Friends
Using Lucene/Solr to Build CiteSeerX and Friends
Large Scale Data Mining using Genetics-Based Machine Learning
Biomedical Atlas Centre
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
STEM in 3D
Approaches to Mining Large-Scale Heterogeneous Data: Old and New
Transfer learning for unsupervised influenza-like illness models from online ...
Introduction to Next Generation Sequencing
Open science 2014
The Future of Microalgal Taxonomy
Chapter-OBDD.pptx
Clustering
clustering-151017180103-lva1-app6892 (1).pdf
Ad

More from Rutger Vos (20)

PDF
Anna Karenina on hooves - what makes an animal fit for domestication?
PDF
10 Misverstanden Over Evolutie
PDF
Crash Course Biodiversiteit
PDF
Natural history research as a replicable data science
PDF
Species delimitation - species limits and character evolution
PDF
Onderzoek bio-informatica Naturalis. Raad voor Cultuur 2017.
PDF
Self-Updating Platform for the Estimation of Rates of Speciation, Migration A...
PPTX
Assembling the Tree of Life from public DNA sequence data
PDF
Hoe leer je een robot soorten te herkennen?
PDF
Modeling the biosphere: the natural historian's perspective
PDF
Kunnen we een tomaat van 400 jaar oud proeven
PPTX
PhyloTastic: names-based phyloinformatic data integration
PPTX
SUPERSMART pipeline intro
PPTX
Reconstructing paleoenvironments using metagenomics
PPTX
Synthesising disparate data resources to obtain composite estimates of geophy...
PDF
The Galaxy bioinformatics workflow environment
PDF
Retrieving useful information from connected specimen- and data collections
PPT
NeXML - phylogenetic data as XML
PPTX
Vos at NCB Naturalis
PPTX
Tree of Life
Anna Karenina on hooves - what makes an animal fit for domestication?
10 Misverstanden Over Evolutie
Crash Course Biodiversiteit
Natural history research as a replicable data science
Species delimitation - species limits and character evolution
Onderzoek bio-informatica Naturalis. Raad voor Cultuur 2017.
Self-Updating Platform for the Estimation of Rates of Speciation, Migration A...
Assembling the Tree of Life from public DNA sequence data
Hoe leer je een robot soorten te herkennen?
Modeling the biosphere: the natural historian's perspective
Kunnen we een tomaat van 400 jaar oud proeven
PhyloTastic: names-based phyloinformatic data integration
SUPERSMART pipeline intro
Reconstructing paleoenvironments using metagenomics
Synthesising disparate data resources to obtain composite estimates of geophy...
The Galaxy bioinformatics workflow environment
Retrieving useful information from connected specimen- and data collections
NeXML - phylogenetic data as XML
Vos at NCB Naturalis
Tree of Life

Recently uploaded (20)

PDF
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
PPTX
Seminar Hypertension and Kidney diseases.pptx
PDF
An interstellar mission to test astrophysical black holes
PDF
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
PPT
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
PPTX
Fluid dynamics vivavoce presentation of prakash
PPT
Heredity-grade-9 Heredity-grade-9. Heredity-grade-9.
PDF
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PDF
BET Eukaryotic signal Transduction BET Eukaryotic signal Transduction.pdf
PPT
veterinary parasitology ````````````.ppt
PDF
The Land of Punt — A research by Dhani Irwanto
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PPTX
BIOMOLECULES PPT........................
PPTX
The Minerals for Earth and Life Science SHS.pptx
PDF
Sciences of Europe No 170 (2025)
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PPTX
Hypertension_Training_materials_English_2024[1] (1).pptx
PDF
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
Seminar Hypertension and Kidney diseases.pptx
An interstellar mission to test astrophysical black holes
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
Fluid dynamics vivavoce presentation of prakash
Heredity-grade-9 Heredity-grade-9. Heredity-grade-9.
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
Biophysics 2.pdffffffffffffffffffffffffff
BET Eukaryotic signal Transduction BET Eukaryotic signal Transduction.pdf
veterinary parasitology ````````````.ppt
The Land of Punt — A research by Dhani Irwanto
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
BIOMOLECULES PPT........................
The Minerals for Earth and Life Science SHS.pptx
Sciences of Europe No 170 (2025)
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
TOTAL hIP ARTHROPLASTY Presentation.pptx
Hypertension_Training_materials_English_2024[1] (1).pptx
Looking into the jet cone of the neutrino-associated very high-energy blazar ...

Taxonomic classification of digitized specimens using machine learning

  • 1. Taxonomic classification of digitized specimens using machine learning Rutger Vos
  • 2. Taxonomic classification1 of digitized specimens2 using machine learning3 1.  To give the right taxonomic name to a thing, or at least approximate it to a higher level (e.g. Genus, Family) 2.  Photographs of biological objects, e.g. from a natural history collection and taken in a standardized setup 3.  Machine learning explores the study and construction of algorithms that can learn from and make predictions on data
  • 3. Case study: slipper orchids Slipper orchids •  Traded illegally •  Photographed “in the wild”
  • 4. Case study: Javanese butterflies Van Groenendael-Krijger collection •  Collected in the 1930s •  Photographed in standardized setup
  • 5. Project structure overview •  Open source, freely available at: github.com/naturalis •  Designed as loosely coupled, swappable modules •  Intended for re-use for multiple cases
  • 6. Project structure: reference images photos [table] id INTEGER NOT NULL md5sum VARCHAR(32) NOT NULL path VARCHAR(255) title VARCHAR(100) description VARCHAR(255) photos_tags [table] photo_id INTEGER NOT NULL tag_id INTEGER NOT NULL tags [table] id INTEGER NOT NULL name VARCHAR(50) NOT NULL photos_taxa [table] photo_id INTEGER NOT NULL taxon_id INTEGER NOT NULL taxa [table] id INTEGER NOT NULL rank_id INTEGER NOT NULL name VARCHAR(50) NOT NULL description VARCHAR(255) ranks [table] id INTEGER NOT NULL name VARCHAR(50) NOT NULL
  • 7. Project structure: image processing Speeded Up Robust Features
  • 11. Results: SURF features •  PCA plots of the “speeded up robust features” show clustering both at the genus (top) and species (bottom) level •  Some species are so dimorphic that the sexes are treated as separate species (not shown) •  Some individuals are “gynandromorphic”, though there is likely positive collection bias •  Some taxa are much more variable than others
  • 12. Results: k-folds cross-validation •  Split the data in k (2, 5, 10) partitions •  Train on 1 partition, use k-1 as “out-of-sample” data •  Count number of correct/incorrect/unknown identifications
  • 13. Next steps •  Application of trained neural networks to the entire VGKS collection (once that is fully digitized) •  Testing other classifiers in addition to ANNs •  Improvement of the end user interface, possibly as a native ‘app’ or on the web •  Extension of the platform to additional cases, such as shells (snails, bivalves) •  Do more with the image feature data: mimicry, character displacement, dimorphism
  • 14. Acknowledgements Naturalis sector Collection •  Max Caspers •  Luc Willemse •  Jan Moonen •  Digitization volunteers Hogeschool Leiden •  Barbara Gravendeel •  Patrick Wijntjes •  Saskia de Vetter LIACS •  Fons Verbeek •  Mengke Li •  Yuanhao Guo IBL •  Wim van Tongeren WUR •  Feia Matthijssen Made possible by •  Naturalis internal grant for application-oriented research •  The Van Groenendael-Krijger Stichting •  Kind contributions of photos by numerous orchid breeders