SlideShare a Scribd company logo
Data availability and
feasibility of validation –
A genomics case study
Mike Thelwall, Marcus Munafò, Amalia Mas Bleda, Emma
Stuart, Meiko Makita, Verena Weigert, Chris Keene,
Nushrat Khan, Katie Drax, Kayvan Kousha
University of Wolverhampton, University of Bristol & UK
Reproducibility Network & JISC
Data sharing experiment goals
• Find out how often data is shared in a field with
apparently ideal conditions
• Write a program to automatically identify shared
data of a specified type
• Write a program to validate the quality of shared
data of a specified type
• As a step towards more general automatic shared
data discovery and quality control
The ideal case study topic? GWAS
• Genome Wide Association Study (GWAS) summary
statistics
• Variation likelihood at large sets of locations of the human
genome for measurable traits (e.g. disease susceptibility)
• Data is high value and expensive to collect
• Often stored in a standard format for internal sharing
by consortia
• An international repository exists for hosting it,
emphasising its importance
• NHGRI-EBI Catalog of published genome-wide association
studies
• Meta-analyses benefit from shared files – increased
power and population triangulation
• Genomics has a reputation for data sharing
https://www.ebi.ac.uk/gwas/diagram
Each dot represents a point on the human genome that at least one
research study has found to associate with a measurable trait
Methods
• Medline search for articles that could be primary
human GWAS
"Molecular Epidemiology"[Majr] AND "Genome-
Wide Association Study"[Majr]
• Restriction to 2010 and 2017 to identify trends
• Three human coders classified 1799 articles for
being (a) primary human GWAS and (b) publicly
sharing complete primary human GWAS summary
statistics
• MT and MM follow-up checks of results
https://www.biorxiv.org/content/10.1101/622795v1
Results
Data availability information 2010 2017 Total Percent
GWAS location not stated in article 156 139 295 89.4%
Broken link or not findable at stated location 3 1 4 1.2%
On request to the authors 0 8 8 2.4%
On request via dbGaP 2 5 7 2.1%
On request via EGA 1 3 4 1.2%
On request via another portal 0 3 3 0.9%
Free online without login, proprietary format 1 0 1 0.3%
Free online without login, plain text 0 8 8 2.4%
10.6% reported sharing GWAS summary statistics in some form
Article descriptions of the availability
of GWAS summary statistics
• Usually in a Data Availability article section (26 out of
35).
• Data availability more difficult to identify from the
methods (4 articles) and results (3 articles).
• Only five data sharing statements described the shared
data as GWAS summary statistics, and all five used
different phrases
• “full GWAS summary statistics”, “Case Oncoarray GWAS data”,
“Summary GWAS estimates”, “Summary statistics for the
genome-wide association study”, “genome-wide set of
summary association statistics”
• Descriptions are therefore hard to automatically
identify from articles.
Conclusions
• Data sharing is unlikely to become near-universal
when it is optional.
• Policy initiatives or mandates are needed to
promote data sharing.
• Automatically identifying shared data is difficult or
impossible in practice because of:
• the complexity of articles (multiple data sources and
article structures)
• a lack of standardisation of terminology
• - but data availability statements help
Mike Thelwall, Marcus Munafò, Amalia Mas Bleda, Emma Stuart, Meiko Makita, Verena
Weigert, Chris Keene, Nushrat Khan, Katie Drax, Kayvan Kousha
University of Wolverhampton, University of Bristol & UK Reproducibility Network & JISC
Follow-up study: Investigating
data availability statements
• A program was written to extract data sharing
statements from full text articles in XML
• Free software Webometric Analyst
(http://lexiurl.wlv.ac.uk/), menu: Citations > PMC full
text > Data availability statements extract
• Manual content analysis for types of information in
extracted PMC Open Access Subset data availability
statements (n=500)
• Test machine learning for classifying data sharing
methods from data availability statements
Result - how is data shared?
Almost all papers with D.S.S. claim
to share data.
Standardised wordings common
e.g., “All relevant data are within
the paper.”
Results – what data is shared?
38% of data sharing
statements specify that all
data is shared
Results – why is data [not] shared?
91% of data sharing
statements give no
explanation for their
data sharing policy
Machine learning
• Simple support vector machines (SVM) test for
detecting sharing methods from data sharing
statements
• 87% accurate for: How is the data shared
• 89% accurate for: is all the data shared (binary)
Summary
• Data sharing seems to need mandates to become
widespread, even in otherwise best case fields
• Shared data is hard to detect precisely because of
article complexity and language variation.
• Basic information about whether data is shared and
where can be extracted automatically from data
availability statements.
• Mike Thelwall, Marcus Munafò, Amalia Mas Bleda,
Emma Stuart, Meiko Makita, Verena Weigert, Chris
Keene, Nushrat Khan, Katie Drax, Kayvan Kousha
• University of Wolverhampton, University of Bristol
& UK Reproducibility Network & JISC

More Related Content

PPTX
Data availability and feasibility of validation – A genomics case study
PPTX
GWAS and DAS
PDF
Network-based machine learning approach for aggregating multi-modal data
PDF
A survey of heterogeneous information network analysis
PPTX
Niso usage data forum 2007
PPTX
Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...
PPTX
Systems genetics approaches to understand complex traits
PPT
The Future: Overcoming the Barriers to Using NHS Clinical Data For Research P...
Data availability and feasibility of validation – A genomics case study
GWAS and DAS
Network-based machine learning approach for aggregating multi-modal data
A survey of heterogeneous information network analysis
Niso usage data forum 2007
Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...
Systems genetics approaches to understand complex traits
The Future: Overcoming the Barriers to Using NHS Clinical Data For Research P...

What's hot (20)

PPTX
RDAP14: University-wide Research Data Management Policy
PPT
RDAP14: Emerging role of UC Libraries in research data management education
PPTX
Branch: An interactive, web-based tool for building decision tree classifiers
PDF
Link Analysis of Life Sciences Linked Data
PDF
OpenTox - an open community and framework supporting predictive toxicology an...
PDF
Research Data Overview
PDF
Overlapping Experiments Infrastructure
PPTX
Mapping to the Metabolomic Manifold
PPTX
Research Summaries: An Evolving Tool in the KMb Tool Box
PPTX
Helping Faculty Help Themselves: Open Access and Data Management Consulting A...
PPTX
National Bibliographic Knowledgebase survey: Data Quality Subgroup initial su...
PPTX
Omic Data Integration Strategies
PPTX
Pizza club - March 2017 - Gaia
PPTX
Addressing the wicked problem of learning data privacy though principle and p...
PPTX
Working Effectively with Medicare Data: Limits and Opportunities
PPTX
National Data Archive (NADA) 3.0
PDF
RDAP14 Poster: The DCC’s institutional engagement program: changing approache...
PDF
Evaluation of virtual classroom technology - Blackboard Collaborate
PDF
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
RDAP14: University-wide Research Data Management Policy
RDAP14: Emerging role of UC Libraries in research data management education
Branch: An interactive, web-based tool for building decision tree classifiers
Link Analysis of Life Sciences Linked Data
OpenTox - an open community and framework supporting predictive toxicology an...
Research Data Overview
Overlapping Experiments Infrastructure
Mapping to the Metabolomic Manifold
Research Summaries: An Evolving Tool in the KMb Tool Box
Helping Faculty Help Themselves: Open Access and Data Management Consulting A...
National Bibliographic Knowledgebase survey: Data Quality Subgroup initial su...
Omic Data Integration Strategies
Pizza club - March 2017 - Gaia
Addressing the wicked problem of learning data privacy though principle and p...
Working Effectively with Medicare Data: Limits and Opportunities
National Data Archive (NADA) 3.0
RDAP14 Poster: The DCC’s institutional engagement program: changing approache...
Evaluation of virtual classroom technology - Blackboard Collaborate
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
Ad

Similar to Data availability Study (20)

PDF
CINECA webinar slides: Making cohort data FAIR
PDF
2015 GU-ICBI Poster (third printing)
PPTX
KnetMiner Overview Oct 2017
PPTX
CI4CC sustainability-panel
PPTX
FedCentric_Presentation
PDF
Investigating plant systems using data integration and network analysis
PDF
CINECA webinar slides: Modular and reproducible workflows for federated molec...
PPTX
Hospital Cloud Forum - thoughts for panel
PDF
A FAIR Data Sharing Framework for Large-Scale Human Cancer Proteogenomics
PDF
What is Data Commons and How Can Your Organization Build One?
PDF
The Human Variome Database in Australia in 2014 - Graham Taylor
PPTX
NCI Cancer Genomic Data Commons for NCAB September 2016
PPTX
Provenance abstraction for implementing security: Learning Health System and ...
PPTX
Shifting the goal post – from high impact journals to high impact data
PPTX
Why should researchers care about data curation?
PPTX
Cancer Moonshot, Data sharing and the Genomic Data Commons
PPTX
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
PPTX
A Vision for a Cancer Research Knowledge System
PDF
FAIR Data Management and FAIR Data Sharing
PPTX
NCI Cancer Genomics, Open Science and PMI: FAIR
CINECA webinar slides: Making cohort data FAIR
2015 GU-ICBI Poster (third printing)
KnetMiner Overview Oct 2017
CI4CC sustainability-panel
FedCentric_Presentation
Investigating plant systems using data integration and network analysis
CINECA webinar slides: Modular and reproducible workflows for federated molec...
Hospital Cloud Forum - thoughts for panel
A FAIR Data Sharing Framework for Large-Scale Human Cancer Proteogenomics
What is Data Commons and How Can Your Organization Build One?
The Human Variome Database in Australia in 2014 - Graham Taylor
NCI Cancer Genomic Data Commons for NCAB September 2016
Provenance abstraction for implementing security: Learning Health System and ...
Shifting the goal post – from high impact journals to high impact data
Why should researchers care about data curation?
Cancer Moonshot, Data sharing and the Genomic Data Commons
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
A Vision for a Cancer Research Knowledge System
FAIR Data Management and FAIR Data Sharing
NCI Cancer Genomics, Open Science and PMI: FAIR
Ad

More from Verena139 (14)

PPTX
Peer judge: Praise and Criticism Detection in F1000Research reviews
PPTX
Tracking data
PPTX
Metrics for oa monographs - introduction
PPTX
Thoughts on metrics for OA monographs
PPTX
Operas Metrics Service
PPTX
Reproducibility Analytics Lab
PPTX
Prediction markets
PPTX
Jisc R&D work in Research Analytics
PPTX
ORCID: Jisc&ARMA final meeting update by Josh Brown
PPTX
Orcid implementation in uk 29092014
PPTX
ORCID: Jisc&ARMA progress meeting update by Josh Brown
PDF
Jisc-ARMA ORCID pilot start-up meeting - presentation by Laure Haak (ORCID)
PDF
Thunderbolts and lightning outputs
PDF
Weathering the storm outputs
Peer judge: Praise and Criticism Detection in F1000Research reviews
Tracking data
Metrics for oa monographs - introduction
Thoughts on metrics for OA monographs
Operas Metrics Service
Reproducibility Analytics Lab
Prediction markets
Jisc R&D work in Research Analytics
ORCID: Jisc&ARMA final meeting update by Josh Brown
Orcid implementation in uk 29092014
ORCID: Jisc&ARMA progress meeting update by Josh Brown
Jisc-ARMA ORCID pilot start-up meeting - presentation by Laure Haak (ORCID)
Thunderbolts and lightning outputs
Weathering the storm outputs

Recently uploaded (20)

PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Introduction to the R Programming Language
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
annual-report-2024-2025 original latest.
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Computer network topology notes for revision
Data_Analytics_and_PowerBI_Presentation.pptx
ISS -ESG Data flows What is ESG and HowHow
Introduction to the R Programming Language
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
STERILIZATION AND DISINFECTION-1.ppthhhbx
IB Computer Science - Internal Assessment.pptx
Introduction to Knowledge Engineering Part 1
Miokarditis (Inflamasi pada Otot Jantung)
SAP 2 completion done . PRESENTATION.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
annual-report-2024-2025 original latest.
Fluorescence-microscope_Botany_detailed content
Business Ppt On Nestle.pptx huunnnhhgfvu
Computer network topology notes for revision

Data availability Study

  • 1. Data availability and feasibility of validation – A genomics case study Mike Thelwall, Marcus Munafò, Amalia Mas Bleda, Emma Stuart, Meiko Makita, Verena Weigert, Chris Keene, Nushrat Khan, Katie Drax, Kayvan Kousha University of Wolverhampton, University of Bristol & UK Reproducibility Network & JISC
  • 2. Data sharing experiment goals • Find out how often data is shared in a field with apparently ideal conditions • Write a program to automatically identify shared data of a specified type • Write a program to validate the quality of shared data of a specified type • As a step towards more general automatic shared data discovery and quality control
  • 3. The ideal case study topic? GWAS • Genome Wide Association Study (GWAS) summary statistics • Variation likelihood at large sets of locations of the human genome for measurable traits (e.g. disease susceptibility) • Data is high value and expensive to collect • Often stored in a standard format for internal sharing by consortia • An international repository exists for hosting it, emphasising its importance • NHGRI-EBI Catalog of published genome-wide association studies • Meta-analyses benefit from shared files – increased power and population triangulation • Genomics has a reputation for data sharing
  • 4. https://www.ebi.ac.uk/gwas/diagram Each dot represents a point on the human genome that at least one research study has found to associate with a measurable trait
  • 5. Methods • Medline search for articles that could be primary human GWAS "Molecular Epidemiology"[Majr] AND "Genome- Wide Association Study"[Majr] • Restriction to 2010 and 2017 to identify trends • Three human coders classified 1799 articles for being (a) primary human GWAS and (b) publicly sharing complete primary human GWAS summary statistics • MT and MM follow-up checks of results https://www.biorxiv.org/content/10.1101/622795v1
  • 6. Results Data availability information 2010 2017 Total Percent GWAS location not stated in article 156 139 295 89.4% Broken link or not findable at stated location 3 1 4 1.2% On request to the authors 0 8 8 2.4% On request via dbGaP 2 5 7 2.1% On request via EGA 1 3 4 1.2% On request via another portal 0 3 3 0.9% Free online without login, proprietary format 1 0 1 0.3% Free online without login, plain text 0 8 8 2.4% 10.6% reported sharing GWAS summary statistics in some form
  • 7. Article descriptions of the availability of GWAS summary statistics • Usually in a Data Availability article section (26 out of 35). • Data availability more difficult to identify from the methods (4 articles) and results (3 articles). • Only five data sharing statements described the shared data as GWAS summary statistics, and all five used different phrases • “full GWAS summary statistics”, “Case Oncoarray GWAS data”, “Summary GWAS estimates”, “Summary statistics for the genome-wide association study”, “genome-wide set of summary association statistics” • Descriptions are therefore hard to automatically identify from articles.
  • 8. Conclusions • Data sharing is unlikely to become near-universal when it is optional. • Policy initiatives or mandates are needed to promote data sharing. • Automatically identifying shared data is difficult or impossible in practice because of: • the complexity of articles (multiple data sources and article structures) • a lack of standardisation of terminology • - but data availability statements help Mike Thelwall, Marcus Munafò, Amalia Mas Bleda, Emma Stuart, Meiko Makita, Verena Weigert, Chris Keene, Nushrat Khan, Katie Drax, Kayvan Kousha University of Wolverhampton, University of Bristol & UK Reproducibility Network & JISC
  • 9. Follow-up study: Investigating data availability statements • A program was written to extract data sharing statements from full text articles in XML • Free software Webometric Analyst (http://lexiurl.wlv.ac.uk/), menu: Citations > PMC full text > Data availability statements extract • Manual content analysis for types of information in extracted PMC Open Access Subset data availability statements (n=500) • Test machine learning for classifying data sharing methods from data availability statements
  • 10. Result - how is data shared? Almost all papers with D.S.S. claim to share data. Standardised wordings common e.g., “All relevant data are within the paper.”
  • 11. Results – what data is shared? 38% of data sharing statements specify that all data is shared
  • 12. Results – why is data [not] shared? 91% of data sharing statements give no explanation for their data sharing policy
  • 13. Machine learning • Simple support vector machines (SVM) test for detecting sharing methods from data sharing statements • 87% accurate for: How is the data shared • 89% accurate for: is all the data shared (binary)
  • 14. Summary • Data sharing seems to need mandates to become widespread, even in otherwise best case fields • Shared data is hard to detect precisely because of article complexity and language variation. • Basic information about whether data is shared and where can be extracted automatically from data availability statements. • Mike Thelwall, Marcus Munafò, Amalia Mas Bleda, Emma Stuart, Meiko Makita, Verena Weigert, Chris Keene, Nushrat Khan, Katie Drax, Kayvan Kousha • University of Wolverhampton, University of Bristol & UK Reproducibility Network & JISC

Editor's Notes

  • #5: “A single-nucleotide polymorphism, often abbreviated to SNP, is a substitution of a single nucleotide that occurs at a specific position in the genome, where each variation is present to some appreciable degree within a population (e.g. > 1%).” https://en.wikipedia.org/wiki/Single-nucleotide_polymorphism