SlideShare a Scribd company logo
Jo McEntyre, EMBL-EBI
Mining Data Availability Statements for GWAS data
GWAS and the GWAS Catalog
• GWAS
analyse
variants
across the
genome to
identify loci
associated
with a
disease or
phenotype
Study metadata
including:
- Trait
- Sample
information
Publication
information
Results
- Lead
associations
- Summary
statistics
GWAS
Catalog
data
GWAS Catalog content
As of October 2019
• 4,220 publications
• 7,661 studies
• 157,336 variant-trait assoc.
• 276 pubs with summary
statistics, >8,000 datasets
www.ebi.ac.uk/gwas
What is Europe PMC?
Europe PMC– free digital archive of
biomedical and life sciences research publications
Content in Europe PMC
Europe PMC is a partner in PubMed Central International
Text mining infrastructure
• Gene-disease relationships
• Mutations
• GeneRIFs
• Diseases and phenotypes
• Phosphorylation events
• Transcription factor-target
interactions
• Organisms
• Gene/proteins
• GO terms
• ChEBI
• EFO
• Grants
• Accession numbers
Text mining platform: SciLite
application
Accession numbers mined from full text publications
ELIXIR Core Data
Resources and Deposition Databases
Cross-links between GWAS and Europe PMC
Data Availability statements in Europe PMC
<title> and XML path
Title XML path Frequency
Data Availability article:front:notes 90,928
Data accessibility article:back:sec 2,694
Data Availability article:back:sec:fn-group 2,580
Data article:body:sec 2,265
Availability of supporting data article:body:sec 1,593
Major datasets article:back:sec:sec 1,074
Database survey article:body:sec 986
Extended Data article:body:sec 851
Data availability article:body:sec 795
Extended Data Figure 1 article:body:sec:SecTag:fig 689
Top 10 combinations of <title> content containing “data” and XML path
Some unhelpful statements
GWAS and DAS
Curating papers for the GWAS catalog
GWAS Catalog literature identification:
Query based vs machine learning
Query-based Machine learning
Precision 6% 27%
Recall 100% 96%
Improved efficiency
80% reduction in publications to review
average 144 to 30/week
Summary statistics in the GWAS Catalog by publication year
% of publications with summary statistics over time & in the whole Catalog
Summary statistics for users
Facilitating data integration
and downstream analyses
GWAS and DAS
The end
GWAS Catalog literature identification
• Previously used manual query based search term
• Query: genomewide OR genome wide OR genome-wide OR GWAS
• Now replaced with machine learning based search
• convolutional neural net trained on corpus of GWAS Catalog
publications
• Collaboration with Zhiyong Lu’s group
Lee et al, PMID 30102703 , PloS Comp Bio
• ML results triaged by curator in custom Pubtator interface
Old literature search and triage
process
• Manual search in PubMed
• Query: genomewide OR genome wide OR genome-
wide OR GWAS
• Curator assesses each publication for eligibility for inclusion in
GWAS Catalog
• Specific eligibility criteria
https://www.ebi.ac.uk/gwas/docs/methods/criteria
• Genome wide association study of >100,000 variants distributed
genome
Deep learning algorithm (convolutional neural net) trained on corpus of
GWAS Catalog publications)
Figure 1. Lee et al, PMID 30102703 , PloS Comp Bio
Machine learning search
Corpus of
GWAS Catalog
publications
GWAS Catalog machine learning literature
search method
• Precision 27%
• Recall 96%
Table 3. Lee et al, PMID 30102703 , PloS Comp Bio
Machine learning:
• Improved efficiency (80% reduction in publications to review, 144 to 30/week)
• Similar capture of eligible studies
GWAS Catalog machine learning literature search method vs
query based search
Table 3. Lee et al, PMID 30102703 , PloS Comp Bio
Uses
Narrow-down/prioritise
candidate loci
Drug
target
discovery
Predict
disease risk
Understand
disease
mechanism
Statistics on
disease data
and research
DOI citations within DASs
Most popular data repositories based on DOI citations in DASs (Jan-Mar 2019)
(?i)(10[.]d{4,9})(?=/)(?=[-._;()/:A-Z0-9]+)

More Related Content

PPTX
Using and extending Darwin Core for structured attribute data
PDF
Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI
PPTX
The Diversity of Biomedical Data, Databases and Standards (Research Data Alli...
PPT
eScience at the Royal Society of Chemistry and our current initiatives
PPTX
Tyler poster v2
PPTX
Alternative Avenues of Discovery: Competition or Potential
PPTX
CI4CC sustainability-panel
PPTX
Jsm madduri-august-2015
Using and extending Darwin Core for structured attribute data
Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI
The Diversity of Biomedical Data, Databases and Standards (Research Data Alli...
eScience at the Royal Society of Chemistry and our current initiatives
Tyler poster v2
Alternative Avenues of Discovery: Competition or Potential
CI4CC sustainability-panel
Jsm madduri-august-2015

What's hot (20)

PPT
As a result of the mandates
PPTX
PubMed
PPTX
Biositemaps: A Framework for Biomedical Resource Discovery
PPTX
Access to Freely Available Journal Articles: Gold, Green, and Rogue Open Ac...
PDF
Data publication: Discover, Explore, Visualise
PPT
NCBO Overview and Biositemaps
PDF
Metadata challenges research and re-usable data - BioSharing, ISA and STATO
PPTX
DDA/OAMI Update - NISO Update, ALA Annual Chicago 2013
PPTX
Discovery impact erl2014
PDF
Cameron Neylon - Lightning talk at NISO Altmetrics Initiative
PPT
NISO Apr 29 Virtual Conference: Value in numbers: A Shared Approach to Measur...
PPT
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
PPT
Open Access and Publishers - Michael Mabe (2007)
PPTX
NISO Apr 29 Virtual Conference: Dismantling a Single-Discipline Journal Bundl...
PPT
The Growing Call for Open Access - Heather Joseph (2007)
PPT
Ontology-based Tools to Enhance the Curation Workflow
PPTX
Role of Amyloid Burden in cognitive decline
PPT
eScience Resources for the Chemistry Community from the Royal Society of Chem...
PDF
A Biclustering Method for Rationalizing Chemical Biology Mechanisms of Action
As a result of the mandates
PubMed
Biositemaps: A Framework for Biomedical Resource Discovery
Access to Freely Available Journal Articles: Gold, Green, and Rogue Open Ac...
Data publication: Discover, Explore, Visualise
NCBO Overview and Biositemaps
Metadata challenges research and re-usable data - BioSharing, ISA and STATO
DDA/OAMI Update - NISO Update, ALA Annual Chicago 2013
Discovery impact erl2014
Cameron Neylon - Lightning talk at NISO Altmetrics Initiative
NISO Apr 29 Virtual Conference: Value in numbers: A Shared Approach to Measur...
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
Open Access and Publishers - Michael Mabe (2007)
NISO Apr 29 Virtual Conference: Dismantling a Single-Discipline Journal Bundl...
The Growing Call for Open Access - Heather Joseph (2007)
Ontology-based Tools to Enhance the Curation Workflow
Role of Amyloid Burden in cognitive decline
eScience Resources for the Chemistry Community from the Royal Society of Chem...
A Biclustering Method for Rationalizing Chemical Biology Mechanisms of Action
Ad

Similar to GWAS and DAS (20)

PPTX
Mcentyre dryad-orcid_may2013
PPTX
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
PPTX
Data availability and feasibility of validation – A genomics case study
PPTX
bioinformatics presentation in the master presentation
PDF
Pathway studio into webinar 052715v1
PDF
Introduction to Bioinformatics for Molecular Studies
PPTX
Data availability Study
PDF
Advanced Bioinformatics for Genomics and BioData Driven Research
PDF
Investigating plant systems using data integration and network analysis
PPTX
KnetMiner Overview Oct 2017
PPTX
FedCentric_Presentation
PPTX
WikiPathways: how open source and open data can make omics technology more us...
PDF
Bioinformatics Introduction
PDF
CINECA webinar slides: Making cohort data FAIR
PPT
UniProt-GOA
 
PPTX
System biology and its tools
PDF
openSNP - Crowdsourcing Genome Wide Association Studies
PPT
Data integration
PPTX
BIOINFORMATICS BIOLOGICAL DATABASES DATA BASES.pptx
PPTX
SEEK for Science: A Data and Model Management Platform to support Open and Re...
Mcentyre dryad-orcid_may2013
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Data availability and feasibility of validation – A genomics case study
bioinformatics presentation in the master presentation
Pathway studio into webinar 052715v1
Introduction to Bioinformatics for Molecular Studies
Data availability Study
Advanced Bioinformatics for Genomics and BioData Driven Research
Investigating plant systems using data integration and network analysis
KnetMiner Overview Oct 2017
FedCentric_Presentation
WikiPathways: how open source and open data can make omics technology more us...
Bioinformatics Introduction
CINECA webinar slides: Making cohort data FAIR
UniProt-GOA
 
System biology and its tools
openSNP - Crowdsourcing Genome Wide Association Studies
Data integration
BIOINFORMATICS BIOLOGICAL DATABASES DATA BASES.pptx
SEEK for Science: A Data and Model Management Platform to support Open and Re...
Ad

More from Verena139 (14)

PPTX
Peer judge: Praise and Criticism Detection in F1000Research reviews
PPTX
Tracking data
PPTX
Metrics for oa monographs - introduction
PPTX
Thoughts on metrics for OA monographs
PPTX
Operas Metrics Service
PPTX
Reproducibility Analytics Lab
PPTX
Prediction markets
PPTX
Jisc R&D work in Research Analytics
PPTX
ORCID: Jisc&ARMA final meeting update by Josh Brown
PPTX
Orcid implementation in uk 29092014
PPTX
ORCID: Jisc&ARMA progress meeting update by Josh Brown
PDF
Jisc-ARMA ORCID pilot start-up meeting - presentation by Laure Haak (ORCID)
PDF
Thunderbolts and lightning outputs
PDF
Weathering the storm outputs
Peer judge: Praise and Criticism Detection in F1000Research reviews
Tracking data
Metrics for oa monographs - introduction
Thoughts on metrics for OA monographs
Operas Metrics Service
Reproducibility Analytics Lab
Prediction markets
Jisc R&D work in Research Analytics
ORCID: Jisc&ARMA final meeting update by Josh Brown
Orcid implementation in uk 29092014
ORCID: Jisc&ARMA progress meeting update by Josh Brown
Jisc-ARMA ORCID pilot start-up meeting - presentation by Laure Haak (ORCID)
Thunderbolts and lightning outputs
Weathering the storm outputs

Recently uploaded (20)

PPT
Predictive modeling basics in data cleaning process
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PDF
Introduction to Data Science and Data Analysis
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Managing Community Partner Relationships
PDF
Transcultural that can help you someday.
PPTX
New ISO 27001_2022 standard and the changes
PDF
Global Data and Analytics Market Outlook Report
PDF
Introduction to the R Programming Language
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
DOCX
Factor Analysis Word Document Presentation
PDF
Business Analytics and business intelligence.pdf
PPTX
Introduction to Inferential Statistics.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PDF
Microsoft 365 products and services descrption
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
Predictive modeling basics in data cleaning process
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
STERILIZATION AND DISINFECTION-1.ppthhhbx
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
Introduction to Data Science and Data Analysis
ISS -ESG Data flows What is ESG and HowHow
Managing Community Partner Relationships
Transcultural that can help you someday.
New ISO 27001_2022 standard and the changes
Global Data and Analytics Market Outlook Report
Introduction to the R Programming Language
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
Factor Analysis Word Document Presentation
Business Analytics and business intelligence.pdf
Introduction to Inferential Statistics.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
Microsoft 365 products and services descrption
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...

GWAS and DAS

  • 1. Jo McEntyre, EMBL-EBI Mining Data Availability Statements for GWAS data
  • 2. GWAS and the GWAS Catalog • GWAS analyse variants across the genome to identify loci associated with a disease or phenotype Study metadata including: - Trait - Sample information Publication information Results - Lead associations - Summary statistics GWAS Catalog data
  • 3. GWAS Catalog content As of October 2019 • 4,220 publications • 7,661 studies • 157,336 variant-trait assoc. • 276 pubs with summary statistics, >8,000 datasets www.ebi.ac.uk/gwas
  • 4. What is Europe PMC? Europe PMC– free digital archive of biomedical and life sciences research publications
  • 5. Content in Europe PMC Europe PMC is a partner in PubMed Central International
  • 6. Text mining infrastructure • Gene-disease relationships • Mutations • GeneRIFs • Diseases and phenotypes • Phosphorylation events • Transcription factor-target interactions • Organisms • Gene/proteins • GO terms • ChEBI • EFO • Grants • Accession numbers
  • 7. Text mining platform: SciLite application
  • 8. Accession numbers mined from full text publications ELIXIR Core Data Resources and Deposition Databases
  • 9. Cross-links between GWAS and Europe PMC
  • 11. <title> and XML path Title XML path Frequency Data Availability article:front:notes 90,928 Data accessibility article:back:sec 2,694 Data Availability article:back:sec:fn-group 2,580 Data article:body:sec 2,265 Availability of supporting data article:body:sec 1,593 Major datasets article:back:sec:sec 1,074 Database survey article:body:sec 986 Extended Data article:body:sec 851 Data availability article:body:sec 795 Extended Data Figure 1 article:body:sec:SecTag:fig 689 Top 10 combinations of <title> content containing “data” and XML path
  • 14. Curating papers for the GWAS catalog
  • 15. GWAS Catalog literature identification: Query based vs machine learning Query-based Machine learning Precision 6% 27% Recall 100% 96% Improved efficiency 80% reduction in publications to review average 144 to 30/week
  • 16. Summary statistics in the GWAS Catalog by publication year % of publications with summary statistics over time & in the whole Catalog
  • 17. Summary statistics for users Facilitating data integration and downstream analyses
  • 20. GWAS Catalog literature identification • Previously used manual query based search term • Query: genomewide OR genome wide OR genome-wide OR GWAS • Now replaced with machine learning based search • convolutional neural net trained on corpus of GWAS Catalog publications • Collaboration with Zhiyong Lu’s group Lee et al, PMID 30102703 , PloS Comp Bio • ML results triaged by curator in custom Pubtator interface
  • 21. Old literature search and triage process • Manual search in PubMed • Query: genomewide OR genome wide OR genome- wide OR GWAS • Curator assesses each publication for eligibility for inclusion in GWAS Catalog • Specific eligibility criteria https://www.ebi.ac.uk/gwas/docs/methods/criteria • Genome wide association study of >100,000 variants distributed genome
  • 22. Deep learning algorithm (convolutional neural net) trained on corpus of GWAS Catalog publications) Figure 1. Lee et al, PMID 30102703 , PloS Comp Bio Machine learning search Corpus of GWAS Catalog publications
  • 23. GWAS Catalog machine learning literature search method • Precision 27% • Recall 96% Table 3. Lee et al, PMID 30102703 , PloS Comp Bio
  • 24. Machine learning: • Improved efficiency (80% reduction in publications to review, 144 to 30/week) • Similar capture of eligible studies GWAS Catalog machine learning literature search method vs query based search Table 3. Lee et al, PMID 30102703 , PloS Comp Bio
  • 26. DOI citations within DASs Most popular data repositories based on DOI citations in DASs (Jan-Mar 2019) (?i)(10[.]d{4,9})(?=/)(?=[-._;()/:A-Z0-9]+)