SlideShare a Scribd company logo
Data availability and
feasibility of validation –
A genomics case study
Mike Thelwall, Marcus Munafò, Amalia Mas Bleda, Emma
Stuart, Meiko Makita, Verena Weigert, Chris Keene,
Nushrat Khan, Katie Drax, Kayvan Kousha
University of Wolverhampton, University of Bristol & UK
Reproducibility Network & JISC
Data sharing experiment goals
• Find out how often data is shared in a field with
apparently ideal conditions
• Write a program to automatically identify shared
data of a specified type
• Write a program to validate the quality of shared
data of a specified type
• As a step towards more general automatic shared
data discovery and quality control
The ideal case study topic? GWAS
• Genome Wide Association Study (GWAS) summary
statistics
• Variation likelihood at large sets of locations of the human
genome for measurable traits (e.g. disease susceptibility)
• Data is high value and expensive to collect
• Often stored in a standard format for internal sharing
by consortia
• An international repository exists for hosting it,
emphasising its importance
• NHGRI-EBI Catalog of published genome-wide association
studies
• Meta-analyses benefit from shared files – increased
power and population triangulation
• Genomics has a reputation for data sharing
https://www.ebi.ac.uk/gwas/diagram
Each dot represents a point on the human genome that at least one
research study has found to associate with a measurable trait
Methods
• Medline search for articles that could be primary
human GWAS
"Molecular Epidemiology"[Majr] AND "Genome-
Wide Association Study"[Majr]
• Restriction to 2010 and 2017 to identify trends
• Three human coders classified 1799 articles for
being (a) primary human GWAS and (b) publicly
sharing complete primary human GWAS summary
statistics
• MT and MM follow-up checks of results
https://www.biorxiv.org/content/10.1101/622795v1
Results
Data availability information 2010 2017 Total Percent
GWAS location not stated in article 156 139 295 89.4%
Broken link or not findable at stated location 3 1 4 1.2%
On request to the authors 0 8 8 2.4%
On request via dbGaP 2 5 7 2.1%
On request via EGA 1 3 4 1.2%
On request via another portal 0 3 3 0.9%
Free online without login, proprietary format 1 0 1 0.3%
Free online without login, plain text 0 8 8 2.4%
10.6% reported sharing GWAS summary statistics in some form
Article descriptions of the availability
of GWAS summary statistics
• Usually in a Data Availability article section (26 out of
35).
• Data availability more difficult to identify from the
methods (4 articles) and results (3 articles).
• Only five data sharing statements described the shared
data as GWAS summary statistics, and all five used
different phrases
• “full GWAS summary statistics”, “Case Oncoarray GWAS data”,
“Summary GWAS estimates”, “Summary statistics for the
genome-wide association study”, “genome-wide set of
summary association statistics”
• Descriptions are therefore hard to automatically
identify from articles.
Conclusions
• Data sharing is unlikely to become near-universal
when it is optional.
• Policy initiatives or mandates are needed to
promote data sharing.
• Automatically identifying shared data is difficult or
impossible in practice because of:
• the complexity of articles (multiple data sources and
article structures)
• a lack of standardisation of terminology
• - but data availability statements help
Mike Thelwall, Marcus Munafò, Amalia Mas Bleda, Emma Stuart, Meiko Makita, Verena
Weigert, Chris Keene, Nushrat Khan, Katie Drax, Kayvan Kousha
University of Wolverhampton, University of Bristol & UK Reproducibility Network & JISC
Follow-up study: Investigating
data availability statements
• A program was written to extract data sharing
statements from full text articles in XML
• Free software Webometric Analyst
(http://lexiurl.wlv.ac.uk/), menu: Citations > PMC full
text > Data availability statements extract
• Manual content analysis for types of information in
extracted PMC Open Access Subset data availability
statements (n=500)
• Test machine learning for classifying data sharing
methods from data availability statements
Result - how is data shared?
Almost all papers with D.S.S. claim
to share data.
Standardised wordings common
e.g., “All relevant data are within
the paper.”
Results – what data is shared?
38% of data sharing
statements specify that all
data is shared
Results – why is data [not] shared?
91% of data sharing
statements give no
explanation for their
data sharing policy
Machine learning
• Simple support vector machines (SVM) test for
detecting sharing methods from data sharing
statements
• 87% accurate for: How is the data shared
• 89% accurate for: is all the data shared (binary)
Software to detect data sharing
• Webometric Analyst (free: http://lexiurl.wlv.ac.uk/)
tool to extract data sharing statements from a
folder of PDFs and classify them
• http://lexiurl.wlv.ac.uk/searcher/datashare.html
• Needs standard format for these statements
• Disciplinary & publisher differences in the uptake of data
sharing statements
Webometric Analyst output
• Attempts to classify what is shared, how(where),
and why
Summary
• Data sharing seems to need mandates to become
widespread, even in otherwise best case fields
• Shared data is hard to detect precisely because of
article complexity and language variation.
• Basic information about whether data is shared and
where can be extracted automatically from data
availability statements.
• Applications: Monitoring; More useful in the longer
term after standardisation?
• Mike Thelwall, Marcus Munafò, Amalia Mas Bleda,
Emma Stuart, Meiko Makita, Verena Weigert, Chris
Keene, Nushrat Khan, Katie Drax, Kayvan Kousha
• University of Wolverhampton, University of Bristol
& UK Reproducibility Network & JISC

More Related Content

PPTX
Data availability Study
PDF
Link Analysis of Life Sciences Linked Data
PDF
Network-based machine learning approach for aggregating multi-modal data
PDF
A survey of heterogeneous information network analysis
PPTX
Open Access as a Means to Produce High Quality Data
PPTX
Complex Systems Biology Informed Data Analysis and Machine Learning
PPTX
How Accessible Is Our Collection? Performing an E-Resources Accessibility Review
PPTX
Mapping to the Metabolomic Manifold
Data availability Study
Link Analysis of Life Sciences Linked Data
Network-based machine learning approach for aggregating multi-modal data
A survey of heterogeneous information network analysis
Open Access as a Means to Produce High Quality Data
Complex Systems Biology Informed Data Analysis and Machine Learning
How Accessible Is Our Collection? Performing an E-Resources Accessibility Review
Mapping to the Metabolomic Manifold

What's hot (20)

PPTX
Metabolomic data analysis and visualization tools
PDF
CINECA webinar slides: Making cohort data FAIR
PDF
CINECA webinar slides: Open science through fair health data networks dream o...
PDF
Doing research better: The role of meta‐data
PDF
A snake, a planet, and a bear ditching spreadsheets for quick, reproducible r...
PDF
OpenTox - an open community and framework supporting predictive toxicology an...
PDF
Research Data Census
PDF
Text Data Mining: Unlocking the hidden potential from scholarly content.
PPTX
Payton Eliminating Conflicts in Ebook Metadata
PPTX
Laurie Goodman at #aibsdata: Beyond Data Release Mandates - Helping Authors M...
PPTX
Data analysis workflows part 1 2015
PDF
Overlapping Experiments Infrastructure
PPTX
Burton - Security, Privacy and Trust
PPTX
Connecting Metabolomic Data with Context
PPTX
NISO Working Group Connection Live! Research Data Metrics Landscape: An Updat...
PPTX
The Kaleidoscope of Impact: same data, different perspectives, constantly cha...
PPT
Prote-OMIC Data Analysis and Visualization
PPTX
Navigating the data management ecosystem - John Kratz
PPT
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
PPTX
Integrating research indicators for use in the repositories infrastructure
Metabolomic data analysis and visualization tools
CINECA webinar slides: Making cohort data FAIR
CINECA webinar slides: Open science through fair health data networks dream o...
Doing research better: The role of meta‐data
A snake, a planet, and a bear ditching spreadsheets for quick, reproducible r...
OpenTox - an open community and framework supporting predictive toxicology an...
Research Data Census
Text Data Mining: Unlocking the hidden potential from scholarly content.
Payton Eliminating Conflicts in Ebook Metadata
Laurie Goodman at #aibsdata: Beyond Data Release Mandates - Helping Authors M...
Data analysis workflows part 1 2015
Overlapping Experiments Infrastructure
Burton - Security, Privacy and Trust
Connecting Metabolomic Data with Context
NISO Working Group Connection Live! Research Data Metrics Landscape: An Updat...
The Kaleidoscope of Impact: same data, different perspectives, constantly cha...
Prote-OMIC Data Analysis and Visualization
Navigating the data management ecosystem - John Kratz
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
Integrating research indicators for use in the repositories infrastructure
Ad

Similar to Data availability and feasibility of validation – A genomics case study (20)

PPTX
CuttingEEG - Open Science, Open Data and BIDS for EEG
PPT
Why study Data Sharing? (+ why share your data)
PPTX
Open Science: Where Theory Meets Practice
PPT
A Successful Academic Medical Center Must be a Truly Digital Enterprise
PPTX
DataONE Education Module 02: Data Sharing
PPTX
Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...
PPTX
RDAP13 Jared Lyle: Domain Repositories and Institutional Repositories Partn…
PDF
Thesis defense, Heather Piwowar, Sharing biomedical research data
PPTX
From Data Policy Towards FAIR Data For All: How standardised data policies ca...
PPTX
Open science, open data - FOSTER training, Potsdam
PPTX
How and Why to Share Your Data
PDF
Data availability
PDF
Open data oct 2013
PDF
Data sharing as part of the research ecosystem
PDF
Fair sample and data access -David Van enckevort
PPTX
David Van Enckevort - FAIR sample and data access
PDF
NIH Data Science Special Interest Group
PDF
Data sharing as part of the research workflow
PDF
The State of Open Research Data
PDF
The State of Open Research Data - OpenCon 2014
CuttingEEG - Open Science, Open Data and BIDS for EEG
Why study Data Sharing? (+ why share your data)
Open Science: Where Theory Meets Practice
A Successful Academic Medical Center Must be a Truly Digital Enterprise
DataONE Education Module 02: Data Sharing
Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...
RDAP13 Jared Lyle: Domain Repositories and Institutional Repositories Partn…
Thesis defense, Heather Piwowar, Sharing biomedical research data
From Data Policy Towards FAIR Data For All: How standardised data policies ca...
Open science, open data - FOSTER training, Potsdam
How and Why to Share Your Data
Data availability
Open data oct 2013
Data sharing as part of the research ecosystem
Fair sample and data access -David Van enckevort
David Van Enckevort - FAIR sample and data access
NIH Data Science Special Interest Group
Data sharing as part of the research workflow
The State of Open Research Data
The State of Open Research Data - OpenCon 2014
Ad

More from Verena139 (15)

PPTX
Peer judge: Praise and Criticism Detection in F1000Research reviews
PPTX
GWAS and DAS
PPTX
Tracking data
PPTX
Metrics for oa monographs - introduction
PPTX
Thoughts on metrics for OA monographs
PPTX
Operas Metrics Service
PPTX
Reproducibility Analytics Lab
PPTX
Prediction markets
PPTX
Jisc R&D work in Research Analytics
PPTX
ORCID: Jisc&ARMA final meeting update by Josh Brown
PPTX
Orcid implementation in uk 29092014
PPTX
ORCID: Jisc&ARMA progress meeting update by Josh Brown
PDF
Jisc-ARMA ORCID pilot start-up meeting - presentation by Laure Haak (ORCID)
PDF
Thunderbolts and lightning outputs
PDF
Weathering the storm outputs
Peer judge: Praise and Criticism Detection in F1000Research reviews
GWAS and DAS
Tracking data
Metrics for oa monographs - introduction
Thoughts on metrics for OA monographs
Operas Metrics Service
Reproducibility Analytics Lab
Prediction markets
Jisc R&D work in Research Analytics
ORCID: Jisc&ARMA final meeting update by Josh Brown
Orcid implementation in uk 29092014
ORCID: Jisc&ARMA progress meeting update by Josh Brown
Jisc-ARMA ORCID pilot start-up meeting - presentation by Laure Haak (ORCID)
Thunderbolts and lightning outputs
Weathering the storm outputs

Recently uploaded (20)

PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Business Analytics and business intelligence.pdf
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
[EN] Industrial Machine Downtime Prediction
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Lecture1 pattern recognition............
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Supervised vs unsupervised machine learning algorithms
IBA_Chapter_11_Slides_Final_Accessible.pptx
Introduction to Knowledge Engineering Part 1
Business Ppt On Nestle.pptx huunnnhhgfvu
Business Analytics and business intelligence.pdf
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
[EN] Industrial Machine Downtime Prediction
Fluorescence-microscope_Botany_detailed content
Lecture1 pattern recognition............
Qualitative Qantitative and Mixed Methods.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
climate analysis of Dhaka ,Banglades.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Reliability_Chapter_ presentation 1221.5784
.pdf is not working space design for the following data for the following dat...
Supervised vs unsupervised machine learning algorithms

Data availability and feasibility of validation – A genomics case study

  • 1. Data availability and feasibility of validation – A genomics case study Mike Thelwall, Marcus Munafò, Amalia Mas Bleda, Emma Stuart, Meiko Makita, Verena Weigert, Chris Keene, Nushrat Khan, Katie Drax, Kayvan Kousha University of Wolverhampton, University of Bristol & UK Reproducibility Network & JISC
  • 2. Data sharing experiment goals • Find out how often data is shared in a field with apparently ideal conditions • Write a program to automatically identify shared data of a specified type • Write a program to validate the quality of shared data of a specified type • As a step towards more general automatic shared data discovery and quality control
  • 3. The ideal case study topic? GWAS • Genome Wide Association Study (GWAS) summary statistics • Variation likelihood at large sets of locations of the human genome for measurable traits (e.g. disease susceptibility) • Data is high value and expensive to collect • Often stored in a standard format for internal sharing by consortia • An international repository exists for hosting it, emphasising its importance • NHGRI-EBI Catalog of published genome-wide association studies • Meta-analyses benefit from shared files – increased power and population triangulation • Genomics has a reputation for data sharing
  • 4. https://www.ebi.ac.uk/gwas/diagram Each dot represents a point on the human genome that at least one research study has found to associate with a measurable trait
  • 5. Methods • Medline search for articles that could be primary human GWAS "Molecular Epidemiology"[Majr] AND "Genome- Wide Association Study"[Majr] • Restriction to 2010 and 2017 to identify trends • Three human coders classified 1799 articles for being (a) primary human GWAS and (b) publicly sharing complete primary human GWAS summary statistics • MT and MM follow-up checks of results https://www.biorxiv.org/content/10.1101/622795v1
  • 6. Results Data availability information 2010 2017 Total Percent GWAS location not stated in article 156 139 295 89.4% Broken link or not findable at stated location 3 1 4 1.2% On request to the authors 0 8 8 2.4% On request via dbGaP 2 5 7 2.1% On request via EGA 1 3 4 1.2% On request via another portal 0 3 3 0.9% Free online without login, proprietary format 1 0 1 0.3% Free online without login, plain text 0 8 8 2.4% 10.6% reported sharing GWAS summary statistics in some form
  • 7. Article descriptions of the availability of GWAS summary statistics • Usually in a Data Availability article section (26 out of 35). • Data availability more difficult to identify from the methods (4 articles) and results (3 articles). • Only five data sharing statements described the shared data as GWAS summary statistics, and all five used different phrases • “full GWAS summary statistics”, “Case Oncoarray GWAS data”, “Summary GWAS estimates”, “Summary statistics for the genome-wide association study”, “genome-wide set of summary association statistics” • Descriptions are therefore hard to automatically identify from articles.
  • 8. Conclusions • Data sharing is unlikely to become near-universal when it is optional. • Policy initiatives or mandates are needed to promote data sharing. • Automatically identifying shared data is difficult or impossible in practice because of: • the complexity of articles (multiple data sources and article structures) • a lack of standardisation of terminology • - but data availability statements help Mike Thelwall, Marcus Munafò, Amalia Mas Bleda, Emma Stuart, Meiko Makita, Verena Weigert, Chris Keene, Nushrat Khan, Katie Drax, Kayvan Kousha University of Wolverhampton, University of Bristol & UK Reproducibility Network & JISC
  • 9. Follow-up study: Investigating data availability statements • A program was written to extract data sharing statements from full text articles in XML • Free software Webometric Analyst (http://lexiurl.wlv.ac.uk/), menu: Citations > PMC full text > Data availability statements extract • Manual content analysis for types of information in extracted PMC Open Access Subset data availability statements (n=500) • Test machine learning for classifying data sharing methods from data availability statements
  • 10. Result - how is data shared? Almost all papers with D.S.S. claim to share data. Standardised wordings common e.g., “All relevant data are within the paper.”
  • 11. Results – what data is shared? 38% of data sharing statements specify that all data is shared
  • 12. Results – why is data [not] shared? 91% of data sharing statements give no explanation for their data sharing policy
  • 13. Machine learning • Simple support vector machines (SVM) test for detecting sharing methods from data sharing statements • 87% accurate for: How is the data shared • 89% accurate for: is all the data shared (binary)
  • 14. Software to detect data sharing • Webometric Analyst (free: http://lexiurl.wlv.ac.uk/) tool to extract data sharing statements from a folder of PDFs and classify them • http://lexiurl.wlv.ac.uk/searcher/datashare.html • Needs standard format for these statements • Disciplinary & publisher differences in the uptake of data sharing statements
  • 15. Webometric Analyst output • Attempts to classify what is shared, how(where), and why
  • 16. Summary • Data sharing seems to need mandates to become widespread, even in otherwise best case fields • Shared data is hard to detect precisely because of article complexity and language variation. • Basic information about whether data is shared and where can be extracted automatically from data availability statements. • Applications: Monitoring; More useful in the longer term after standardisation? • Mike Thelwall, Marcus Munafò, Amalia Mas Bleda, Emma Stuart, Meiko Makita, Verena Weigert, Chris Keene, Nushrat Khan, Katie Drax, Kayvan Kousha • University of Wolverhampton, University of Bristol & UK Reproducibility Network & JISC

Editor's Notes

  • #5: “A single-nucleotide polymorphism, often abbreviated to SNP, is a substitution of a single nucleotide that occurs at a specific position in the genome, where each variation is present to some appreciable degree within a population (e.g. > 1%).” https://en.wikipedia.org/wiki/Single-nucleotide_polymorphism