SlideShare a Scribd company logo
Data quality 
Towards a common validator 
Christian Gendreau, Anne Bruneau, David Shorthouse 
Université de Montréal, Biodiversity Centre
What is data quality? 
● Relative 
● Fitness for use 
o Coordinate precision for distribution 
o Hierarchy not provided 
● What –When –Where
Examples 
● 2008 VI 13, 2008-06-13, 13-06-2008, June 13 2008, 13 junho 
2008 
● Canada, Québec, Montréal, -73.55399 45.508669 
● Narwalus microcephalus => Monodon monoceros 
Public Domain: Freshwater and Marine Image Bank, University of Washington 
Libraries Digital Collections
Data Quality Information Chain 
Courtesy of Arthur D. Chapman
Brief History 
● First Canadensys Explorer/Harvester (2012) 
● narwhal-processor (2013) 
● TDWG 2013 
o DQ Interest Group 
o Presentation about our plan 
o Discussion with GBIF
Why do we need a validator? 
● Identify and quantify potential issues 
● DarwinCore is permissive 
o DarwinCore itself can change 
● Records and technologies will always evolve
What should a validator allow? 
● Define a validation scope 
o at the source (e.g. collection) 
o national node 
o aggregator 
o GBIF
Validator - Expected design 
● Modular, scalable, reusable 
● Customizable 
o per configuration/extension 
o use user defined dictionary 
● Validation Chain
Current Options 
● GBIF validator 
● CRIA tools 
● ALA tools 
Probably all organisations have their own tools.
dwca-validator 
● Starting from previous GBIF validator 
● Building a community project 
● Provide framework for Biodiversity Data Quality Interest 
Group (TDWG) 
https://github.com/gbif/dwca-validator
Vision 
• Library 
o Core module, reusable (e.g. IPT) 
• Web 
o Send archive, view report 
• narwhal-processor 
o Suggest interpreted value 
• Extensions 
o Domain knowledge / Quality index
Validation chain 
● Chain element 
o Self contained (never relies on another chain element) 
o Ordering independent 
● Composed chain element (narwhal and extensions) 
o Wrap chain elements under a new element 
o Ordering possible between wrapped element
Chain element example
Validation types 
● Structure 
o metadata 
o organization of data 
● Rows 
o dates, coordinates, ... 
● Columns 
o ID uniqueness
Result Accumulator 
● Records validation result as they occur 
o ID/Validator/Context/ValidationType/Result/Message 
● Allows different views of result 
o Web view 
o Feed another application
Current Status 
● Library with CLI (command line interface) 
● Basic evaluators and rules 
● Ready for contributions
Demo 
• Darwin Core Archive, Taxon Checklist 
o Invalid characters 
o Broken link synonym  accepted taxon 
Lynx Canadensis, http://www.animalgalleries.org/
Future validations 
● Use semantic web (e.g. GeoNames) 
● Use external resolver (e.g. CoL) 
● Use more complex validation (e.g. climate layer)
Future validations 
Accomodate localisation vs misspellings 
● Brésil (fr) 
● Brazil (en) 
● Brasil (pt) 
● Brasilien (se) 
● Brézil (??)
Questions? 
Public Domain: robynm
Acknowledgements
Contact 
http://www.canadensys.net 
http://github.com/Canadensys 
@Canadensys

More Related Content

PPTX
The Cancer Genomics Cloud (CGC) pilots - an Introduction
PPTX
Canadensys Explorer presentation
PPTX
Darwin Core Archive (DwC-A) validation: A New Collaborative Effort
PDF
Data quality challenges in the Canadensys network of occurrence records: exam...
PPS
Passive voice
PDF
Statnett - Nord.link - Status of a Transnational Project - Ingard Moen
PDF
Using ElasticSearch as a fast, flexible, and scalable solution to search occu...
PDF
SEO: Getting Personal
The Cancer Genomics Cloud (CGC) pilots - an Introduction
Canadensys Explorer presentation
Darwin Core Archive (DwC-A) validation: A New Collaborative Effort
Data quality challenges in the Canadensys network of occurrence records: exam...
Passive voice
Statnett - Nord.link - Status of a Transnational Project - Ingard Moen
Using ElasticSearch as a fast, flexible, and scalable solution to search occu...
SEO: Getting Personal

Similar to Data Quality: Towards a Common Validator (15)

PPT
Invited talk @Aberdeen, '07: Modelling and computing the quality of informati...
PPT
Invited talk @Roma La Sapienza, April '07
PPTX
D paul ecn2013
PPTX
Data accessibility and the role of informatics in predicting the biosphere
PPT
EIA Biodiversity Data Mobilisation
PPTX
Publishing hkh biodiversity data globally technical session ii
PDF
The Biodiversity Information Standards (TDWG): Opportunities for Collaboratio...
PPT
TDWG_2010_Chavan_data_citation
PDF
BIO-INSPIRED MODELLING OF SOFTWARE VERIFICATION BY MODIFIED MORAN PROCESSES
PDF
Bio-Inspired Modelling of Software Verification by Modified Moran Processes
PDF
BIO-INSPIRED MODELLING OF SOFTWARE VERIFICATION BY MODIFIED MORAN PROCESSES
PPT
Paper presentation @IPAW'08
PDF
Model Checking as a Service: Towards Pragmatic Hidden Formal Methods
PDF
FAIR Data in Trustworthy Data Repositories Webinar - 12-13 December 2016| www...
PDF
Machine Learning Applications in Grid Computing
Invited talk @Aberdeen, '07: Modelling and computing the quality of informati...
Invited talk @Roma La Sapienza, April '07
D paul ecn2013
Data accessibility and the role of informatics in predicting the biosphere
EIA Biodiversity Data Mobilisation
Publishing hkh biodiversity data globally technical session ii
The Biodiversity Information Standards (TDWG): Opportunities for Collaboratio...
TDWG_2010_Chavan_data_citation
BIO-INSPIRED MODELLING OF SOFTWARE VERIFICATION BY MODIFIED MORAN PROCESSES
Bio-Inspired Modelling of Software Verification by Modified Moran Processes
BIO-INSPIRED MODELLING OF SOFTWARE VERIFICATION BY MODIFIED MORAN PROCESSES
Paper presentation @IPAW'08
Model Checking as a Service: Towards Pragmatic Hidden Formal Methods
FAIR Data in Trustworthy Data Repositories Webinar - 12-13 December 2016| www...
Machine Learning Applications in Grid Computing
Ad

Recently uploaded (20)

PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
Online Work Permit System for Fast Permit Processing
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPT
Introduction Database Management System for Course Database
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
top salesforce developer skills in 2025.pdf
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
Odoo POS Development Services by CandidRoot Solutions
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PPTX
history of c programming in notes for students .pptx
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
medical staffing services at VALiNTRY
PDF
Digital Strategies for Manufacturing Companies
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Online Work Permit System for Fast Permit Processing
Odoo Companies in India – Driving Business Transformation.pdf
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Introduction Database Management System for Course Database
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
top salesforce developer skills in 2025.pdf
Operating system designcfffgfgggggggvggggggggg
Design an Analysis of Algorithms I-SECS-1021-03
Odoo POS Development Services by CandidRoot Solutions
VVF-Customer-Presentation2025-Ver1.9.pptx
history of c programming in notes for students .pptx
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Which alternative to Crystal Reports is best for small or large businesses.pdf
CHAPTER 2 - PM Management and IT Context
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
medical staffing services at VALiNTRY
Digital Strategies for Manufacturing Companies
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Ad

Data Quality: Towards a Common Validator

  • 1. Data quality Towards a common validator Christian Gendreau, Anne Bruneau, David Shorthouse Université de Montréal, Biodiversity Centre
  • 2. What is data quality? ● Relative ● Fitness for use o Coordinate precision for distribution o Hierarchy not provided ● What –When –Where
  • 3. Examples ● 2008 VI 13, 2008-06-13, 13-06-2008, June 13 2008, 13 junho 2008 ● Canada, Québec, Montréal, -73.55399 45.508669 ● Narwalus microcephalus => Monodon monoceros Public Domain: Freshwater and Marine Image Bank, University of Washington Libraries Digital Collections
  • 4. Data Quality Information Chain Courtesy of Arthur D. Chapman
  • 5. Brief History ● First Canadensys Explorer/Harvester (2012) ● narwhal-processor (2013) ● TDWG 2013 o DQ Interest Group o Presentation about our plan o Discussion with GBIF
  • 6. Why do we need a validator? ● Identify and quantify potential issues ● DarwinCore is permissive o DarwinCore itself can change ● Records and technologies will always evolve
  • 7. What should a validator allow? ● Define a validation scope o at the source (e.g. collection) o national node o aggregator o GBIF
  • 8. Validator - Expected design ● Modular, scalable, reusable ● Customizable o per configuration/extension o use user defined dictionary ● Validation Chain
  • 9. Current Options ● GBIF validator ● CRIA tools ● ALA tools Probably all organisations have their own tools.
  • 10. dwca-validator ● Starting from previous GBIF validator ● Building a community project ● Provide framework for Biodiversity Data Quality Interest Group (TDWG) https://github.com/gbif/dwca-validator
  • 11. Vision • Library o Core module, reusable (e.g. IPT) • Web o Send archive, view report • narwhal-processor o Suggest interpreted value • Extensions o Domain knowledge / Quality index
  • 12. Validation chain ● Chain element o Self contained (never relies on another chain element) o Ordering independent ● Composed chain element (narwhal and extensions) o Wrap chain elements under a new element o Ordering possible between wrapped element
  • 14. Validation types ● Structure o metadata o organization of data ● Rows o dates, coordinates, ... ● Columns o ID uniqueness
  • 15. Result Accumulator ● Records validation result as they occur o ID/Validator/Context/ValidationType/Result/Message ● Allows different views of result o Web view o Feed another application
  • 16. Current Status ● Library with CLI (command line interface) ● Basic evaluators and rules ● Ready for contributions
  • 17. Demo • Darwin Core Archive, Taxon Checklist o Invalid characters o Broken link synonym  accepted taxon Lynx Canadensis, http://www.animalgalleries.org/
  • 18. Future validations ● Use semantic web (e.g. GeoNames) ● Use external resolver (e.g. CoL) ● Use more complex validation (e.g. climate layer)
  • 19. Future validations Accomodate localisation vs misspellings ● Brésil (fr) ● Brazil (en) ● Brasil (pt) ● Brasilien (se) ● Brézil (??)