SlideShare a Scribd company logo
Public data archiving:

     Who shares?
    Who doesn’t?
What can we do about it?
               Heather Piwowar
         Presented at UBC BLISS, Sept 2010

 DataONE postdoc with Dryad and NESCent, @UBC
PhD in Dept of Biomedical Informatics, U of Pittsburgh
http://www.metmuseum.org/toah/ho/09/euwf/ho_24.45.1.htm
http://www.flickr.com/photos/jsmjr/62443357/
http://www.flickr.com/photos/camilleharrington/3587294608/
http://www.flickr.com/photos/rkuhnau/3318245976/
http://www.flickr.com/photos/conformpdx/1796399674/
http://www.flickr.com/photos/rkuhnau/3317418699/
http://www.flickr.com/photos/zemlinki/261617721/
http://www.flickr.com/photos/tracenmatt/3020786491/
http://www.flickr.com/photos/the-o/2078239333/
http://www.flickr.com/photos/ryanr/142455033/
http://www.flickr.com/photos/75166820@N00/5318468/
Find
Organize
Document
Deidentify
Format
Decide
Ask
Submit

Answer questions
Worry about mistakes being found
Worry about data being misinterpreted
Worry about being scooped
Forgo money and IP and prestige???
not very motivating.
As a result, policy makers have spent 
 lots of time and money ....




                      http://www.flickr.com/photos/johnnyvulkan/381941233/
                           http://www.flickr.com/photos/tonivc/2283676770/
building databases, 
developing standards, 
articulating best practices

to support public archiving of 
 research datasets 
lots of data sharing!




                        http://www.genome.jp/en/db_growth.html
but how much isn’t 
 shared?

  what isn’t shared?
              who isn’t sharing it?
why not?
     how much does it matter?
             what can we do 
              about it?
you can not manage 
what you do not measure




               quote: Lord Kelvin
               http://www.flickr.com/photos/archeon/2941655917/
As we seek to embrace and
 encourage data sharing,

understanding patterns of adoption
 will allow us to make informed
 decisions about tools, policies, and
 best practices.

Measuring adoption over time will
 allow us to note progress and
 identify best practices and
 opportunities for improvement.
research questions

  1. Is there benefit for those who share?
  2. How can we study data sharing behaviour in
     a scalable, systematic way?
  3. What factors are correlated with sharing
     and withholding data?
http://www.flickr.com/photos/paulhami/1020538523//
Which data?




              http://www.flickr.com/photos/paulhami/1020538523//
Where?




         http://www.flickr.com/photos/paulhami/1020538523//
With whom?




      http://www.flickr.com/photos/paulhami/1020538523//
When?




        http://www.flickr.com/photos/paulhami/1020538523//
Under what terms?




                http://www.flickr.com/photos/paulhami/1020538523//
http://www.flickr.com/photos/paulhami/1020538523//
http://www.flickr.com/photos/paulhami/1020538523//
• gene expression microarray data
• raw intensity data
• upon publication
• publicly on the internet
• (centralized databases)

                       http://www.flickr.com/photos/paulhami/1020538523//
http://en.wikipedia.org/wiki/DNA_microarray
   http://en.wikipedia.org/wiki/Image:Heatmap.png
   http://commons.wikimedia.org/wiki/
       File:DNA_double_helix_vertikal.PNG




microarray
      data
microarray
      data
1.  Is there benefit for 
 those who share?




                 http://www.flickr.com/photos/sunrise/35819369/
currency of value?

     Citations.
currency of value?

     Citations.

           $50!




                     Diamond,Arthur M. What is a Citation Worth?.
                        The Journal of Human Resources (1986)
                        vol. 21 (2) pp. 200-215
dataset
85 cancer microarray trials published in 1999-2003, as
identified by Ntzani and Ioannidis (2003)

citations
ISI Web of Science Citation index, citations from
2004-2005

data sharing locations
Publisher and lab websites, microarray databases, WayBack
Internet Archive, Oncomine

statistics
Multivariate linear regression
Note:
 log
 scale
Public data archiving: Who does?  Who doesn't?  What can we do about it?
~70%
2. Need automated methods to:

a) Identify studies that create datasets
b) Determine which of these
        have in fact been shared
c) Extract attributes about the environment
a) Identify studies that create datasets




                                 http://www.flickr.com/photos/lofaesofa/248546821/
Look for wetlab methods in article full text:




                         http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1522022&tool=pmcentrez
                         http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1590031&tool=pmcentrez
                   http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1482311&tool=pmcentrez#id331936
                         http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2082469&tool=pmcentrez
                    http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=126870&tool=pmcentrez#id442745
Combined, these full-text portals reach 85%
of the articles available through
U of Pittsburgh library subscriptions.
But how to generate an effective query?
Use open access articles.
• text analysis:
               automatically catalogued
 single words and word-pairs from full text
• assessed precision and recall
• combined the high performers:
Derived query:
  ("gene expression" AND microarray AND cell AND rna)

  AND (rneasy OR trizol OR "real-time pcr")

  NOT (“tissue microarray*” OR “cpg island*”)
Evaluation:
Ochsner et al. Nature Methods (2008)
400 studies across 20 journals

Precision: 90% (conf int: 86% to 93%)
Recall:    56% (conf int: 52% to 61%)
a) Identify studies that create datasets
b) Determine which of these
        have in fact been shared
c) Extract attributes about the environment
b) Determine which datasets
        have in fact been shared
Public data archiving: Who does?  Who doesn't?  What can we do about it?
77 % 
Public data archiving: Who does?  Who doesn't?  What can we do about it?
a) Identify studies that create datasets
b) Determine which of these
        have in fact been shared
c) Extract attributes about the environment
Funder   Journal       Investigator   Institution   Study




                   Is research data shared
                       after publication?
Funder       Journal       Investigator   Institution     Study

funded by     impact         years since   sector        humans?
NIH?          factor         first paper
                                           size          mice?
size of       strength of    # pubs
grant         policy                       impact        plants?
                             # citations   rank
sharing       open                                       cancer?
plan req’d?   access?        previously    country
                             shared?                     clinical
funded by     number of                                  trial?
non-NIH?      microarray     previously
                             reused?                     number of
              studies                                    authors
              published      gender
                                                         year
journal rank
journal data sharing policy


          “An inherent principle of publication is that
           others should be able to replicate and build
           upon the authors' published claims.
           Therefore, a condition of publication
           in a Nature journal is that authors are
           required to make materials, data and
           associated protocols available in a publicly
           accessible database …”


                          http://www.nature.com/authors/editorial_policies/availability.html
                              http://www.nature.com/nature/journal/v453/n7197/index.html
institution rank




Yu et al. BMC medical
  informatics and decision
  making (2007) vol. 7 pp. 17
study type
author “experience”

Author publication history:

Author name            Author-ity web service
                       Torvik & Smalheiser. (2009). Author Name
disambiguation:        Disambiguation in MEDLINE. ACM Transactions on
                       Knowledge Discovery from Data, 3(3):11.



Citation counts:
author gender
funding level

PubMed grant lists   + NIH grant details
funder mandates




     Requires a data sharing plan
     for studies funded after October 2003
     that receive more than $500 000 in
     direct funding per year
funder mandates

Proxy for NIH data sharing policy
applicability:

If in any year since 2004,
• funded by an NIH grant number
   with a “1” or “2” type code
• received more than $750 000 in
   total funding from the grant
and so on...


    124 variables
Now equipped with automated methods to:

a) Identify studies that create datasets
b) Determine which of these
        have in fact been shared
c) Extract attributes about the environment
3.  What factors are correlated 
 with sharing and withholding 
 data?
                     http://www.flickr.com/photos/cogdog/123072/
11,603 datapoints


25% had links from datasets in databases
univariate analysis
Proportion of articles with shared datasets, by year




                                                                    0.35
Proportion of articles with datasets found in GEO or ArrayExpress

                                                                    0.30
                                                                    0.25
                                                                    0.20
                                                                    0.15




                                                                                                          Across time
                                                                    0.10
                                                                    0.05




                                                                           2000   2001   2002   2003   2004   2005    2006   2007   2008   2009

                                                                                                  Year article published
Proportion of datasets shared




                                     0.0
                                           0.2
                                                 0.4
                                                       0.6
                                                                      0.8
                                                                                    1.0
             Physiol Genomics
                    PLoS Genet
                   Genome Biol
                    Microbiology
                      PLoS One
                BMC Genomics
                       Plant Cell
                  Genome Res
                  Eukaryot Cell
        Appl Environ Microbiol
          BMC Med Genomics
                Hum Mol Genet
      Proc Natl Acad Sci U S A
                   Infect Immun
      Am J Respir Cell Mol Biol
                         Dev Biol
                      J Bacteriol
                 Mol Endocrinol
                   BMC Cancer
                   Plant Physiol
                    Biol Reprod
                           Blood
                      J Immunol
                        FASEB J
                     Toxicol Sci
                       J Exp Bot
             Nucleic Acids Res
                        Diabetes
                    Mol Cell Biol
               Mol Cancer Ther
           BMC Bioinformatics
                     Stem Cells
                      FEBS Lett
                      J Neurosci
                    Am J Pathol
                    J Biol Chem
                           J Virol
                         OTHER
                    Cancer Res
       J Clin Endocrinol Metab
                  Plant Mol Biol
               Clin Cancer Res
                      Genomics
                                                                                   Journals




     Invest Ophthalmol Vis Sci
              Mol Hum Reprod
                Carcinogenesis
                            Gene
                 Endocrinology
                      Oncogene
                     Cancer Lett
Biochem Biophys Res Commun
                                                        (Physiological Genomics)
Proportion of datasets shared




                                            0.0
                                                     0.2
                                                           0.4
                                                                      0.6
                                                                                 0.8
                                                                                        1.0
                   Stanford University
            University of Pennsylvania
                   University of Illinois
  University of California, Los Angeles
     University of Wisconsin, Madison
             University of Washington
        University of California, Davis
    The University of British Columbia
University of California, San Francisco
                  University of Florida
   University of California, San Diego
  University of Minnesota, Twin Cities
           Baylor College of Medicine
                                OTHER
             Max Planck Gesellschaft
                    Harvard University
      Duke University Medical Center
                       Yale University


             Johns Hopkins University
               University of Pittsburgh
                                                                 (Stanford)




 Washington University in Saint Louis
                 University of Toronto
     University of California, Berkeley
    University of Michigan, Ann Arbor
             Michigan State University
                                                                              Institutions




             National Cancer Institute
                       Tokyo Daigaku
Proportion of datasets shared




       0.0
             0.2
                         0.4
                                       0.6
                                                   0.8
                                                             1.0




   1
 101
 201
 301
 401
 501
 601
 701
 801
 901
1001
1101
1201
1301
                                               rank




1401
1501
1601
1701
1801
1901
                                               Institution
multivariate analysis
Public data archiving: Who does?  Who doesn't?  What can we do about it?
factor analysis
Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?
multivariate logistic regression over
the first-order factors
Multivariate nonlinear regressions with interactions
                                                                       Odds Ratio
                                                                                        0.25       0.50                 1.00            2.00   4.00   8.00

                                                             Has journal policy
                                                       Multivariate nonlinear regressions with interactions
                            Count of                R01 & other NIH grants                 Odds Ratio




                                                                                                                                 0.95
                                                                                     0.25   0.50   1.00          2.00     4.00          8.00
Authors prev GEOAE sharing & OA & microarray creation
                                                                   Has journal policy
                                        NO K funding other P funding
                                                   Count of R01 & or NIH grants




                                                                                                          0.95
                        Authors prev GEOAE sharing & OA & microarray creation
                                                          NO K Journalfunding
                                                                funding or P impact
                                           Institution high citations & collaboration
              Journal policy consequences & Journal impact            long halflife
                                      Journal policy consequences & long halflife
                   Institution high citations NOTcollaboration  & animals or mice
                                      Instititution is government & NOT higher ed
                                                   NOT animals or mice
                                       Last author num prev pubs & first year pub
                                                                     Large NIH grant
              Instititution is government & NOT higher ed          Humans & cancer
                                      NO geo reuse + YES high institution output
               Last author num prev pubs & first year pub
                                       First author num prev pubs & first year pub

                                                             Large NIH grant
                                                          Humans & cancer
              NO geo reuse + YES high institution output
               First author num prev pubs & first year pub
Multivariate nonlinear regressions with interactions
                                                                       Odds Ratio
                                                                                        0.25       0.50                 1.00            2.00   4.00   8.00

                                                             Has journal policy
                                                       Multivariate nonlinear regressions with interactions
                            Count of                R01 & other NIH grants                 Odds Ratio




                                                                                                                                 0.95
                                                                                     0.25   0.50   1.00          2.00     4.00          8.00
Authors prev GEOAE sharing & OA & microarray creation
                                                                   Has journal policy
                                        NO K funding other P funding
                                                   Count of R01 & or NIH grants




                                                                                                          0.95
                        Authors prev GEOAE sharing & OA & microarray creation
                                                          NO K Journalfunding
                                                                funding or P impact
                                           Institution high citations & collaboration
              Journal policy consequences & Journal impact            long halflife
                                      Journal policy consequences & long halflife
                   Institution high citations NOTcollaboration  & animals or mice
                                      Instititution is government & NOT higher ed
                                                   NOT animals or mice
                                       Last author num prev pubs & first year pub
                                                                     Large NIH grant
              Instititution is government & NOT higher ed          Humans & cancer
                                      NO geo reuse + YES high institution output
               Last author num prev pubs & first year pub
                                       First author num prev pubs & first year pub

                                                             Large NIH grant
                                                          Humans & cancer
              NO geo reuse + YES high institution output
               First author num prev pubs & first year pub
logistic regression
using second-order factors
Multivariate nonlinear regression with interactions
                                                 Odds Ratio
                                     0.25   0.50    1.00       2.00      4.00

OA journal & previous GEO-AE sharing

               Amount of NIH funding




                                                        0.95
      Journal impact factor and policy

                    Higher Ed in USA

                   Cancer & humans
Multivariate nonlinear regression with interactions
                                                 Odds Ratio
                                     0.25   0.50    1.00       2.00      4.00

OA journal & previous GEO-AE sharing

               Amount of NIH funding




                                                        0.95
      Journal impact factor and policy

                    Higher Ed in USA

                   Cancer & humans
Conclusions:
   • data sharing rates are increasing,
     but overall levels are low

Preliminary evidence:
   • levels are particularly low in cancer
   • levels are highest for those who
      • publish in a journal with a policy
      • publish in an open access journal
      • have shared data before
•   data and filters were imperfect
•   many assumptions
•   didn’t capture all types of sharing
•   don’t know how generalizable across datatypes
•   should be considered hypothesis-generating


                                  http://www.flickr.com/photos/vlastula/300102949/
http://www.flickr.com/photos/gatewaystreets/3838452287/
NSF-funded distributed framework
 and cyberinfrastructure for
 environmental science.



Dryad is a repository of data
 underlying scientific publications,
 with an initial focus on evolution,
 ecology, and related fields.


The National Evolutionary
  Synthesis Center, NSF-funded:
• Duke University,
• UNC at Chapel Hill
• North Carolina State University
1.  new domain
http://www.flickr.com/photos/paulhami/1020538523//
http://www.flickr.com/photos/paulhami/1020538523//
• evolution and ecology
    datasets
•   raw data that support results
•   upon publication
    or short embargo
•   publicly on the internet




                   http://www.flickr.com/photos/paulhami/1020538523//
challenges!

  1. No PubMed
  2. Diverse data types, norms, repositories
  3. Data almost always collected for a specific
     hypothesis
  4. Less public sharing so far
2.  new initiatives
Public data archiving: Who does?  Who doesn't?  What can we do about it?
JDAP
       •   The American Naturalist
       •   Evolution
       •   Journal of Evolutionary Biology
       •   Molecular Ecology
       •   Evolutionary Applications
       •   Genetics
       •   Heredity
       •   Molecular Biology and Evolution
       •   Systematic Biology
       •   Paleobiology
       •   BMC Evolutionary Biology
Blumenthal et al. Acad Med. 2006
        Campbell et al. JAMA. 2002.
Kyzas et al. J Natl Cancer Inst. 2005.
       Vogeli et al. Acad Med. 2006.
      Reidpath et al. Bioethics 2001.
http://www.flickr.com/photos/jima/606588905/
3.  Reuse.




             http://www.flickr.com/photos/boitabulle/3668162701/
who reuses data?
                  why?
     when?
                       who doesn’t?
which datasets are most likely 
 to be reused?
         how many datasets could be 
          reused but aren’t?
 why aren’t they?
      does it matter?
                  what can we do 
                   about it?
http://upload.wikimedia.org/wikipedia/commons/thumb/e/e6/
    Gamma_distribution_pdf.svg/500px-Gamma_distribution_pdf.svg.png
Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?
I post my data, code, and statistical scripts on
GitHub (links from http://researchremix.org)
Share yours too!


                         http://www.flickr.com/photos/myklroventine/892446624/
“Does anyone want your data?

That’s hard to predict […]
After all, no one ever knocked on your door asking to
buy those figurines collecting dust in your cabinet
before you listed them on eBay.

Your data, too, may simply be awaiting an effective
matchmaker.”




                     Got data? Nature Neuroscience (2007)
Dept of Biomedical Informatics at U of Pittsburgh
Wendy Chapman for support and feedback
Todd Vision, Mike Whitlock for ongoing discussions
NIH NLM. NSF through DataONE, NESCent, Dryad.
Open science online community and those who release their
 articles, datasets and photos openly


                thank you
Public data archiving: Who does?  Who doesn't?  What can we do about it?
http://www.flickr.com/photos/jep42/3017149415/in/set-72157608797298056/
Journal
mandates




           variables
• readers
• reusers               perspectives,
• authors        and also driving towards
• editors             actionable results
                      for these groups
• reviewers
• funders
• database designers, maintainers, curators
• patients, subjects, or populations
Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?
http://www.flickr.com/photos/sunrise/35819369/
http://www.flickr.com/photos/fboyd/2156630044/
Correlates with self‐reported data 
withholding
            industry involvement
perceived competitiveness of field
                             male
   sharing discouraged in training
              human participants
            academic productivity
                                     0   1             2            3




                                             Blumenthal et al. Acad Med. 2006
Self‐reported reasons for data 
withholding
               sharing is too much effort
want student or jr faculty to publish more
   they themselves want to publish more
                                       cost
                         industrial sponsor
                             confidentiality
              commercial value of results
                                               0%   20%   40%    60%    80%



                                                      Campbell et al. JAMA 2002.
Table 2: Second-order factor loadings, by first-order factors

                   Amount of NIH funding
                0.88 Count of R01 & other NIH grants
                         0.49 Large NIH grant
                   -0.55 NO K funding or P funding

                       Cancer & humans
                        0.83 Humans & cancer

           OA journal & previous GEO-AE sharing
     0.59 Authors prev GEOAE sharing & OA & microarray creation
               0.43 Institution high citations & collaboration
             0.31 First author num prev pubs & first year pub
            -0.36 Last author num prev pubs & first year pub

               Journal impact factor and policy
                          0.57 Journal impact
            0.51 Last author num prev pubs & first year pub

                         Higher Ed in USA
            0.40 NO geo reuse + YES high institution output
           -0.44 Institution is government & NOT higher ed
Table 3: Second-order factor loadings, by   OA journal & previous GEO-AE sharing
original variables
                                              0.40 first.author.num.prev.geoae.sharing.tr
Amount of NIH funding                         0.37 pubmed.is.open.access
 0.87 nih.cumulative.years.tr                 0.37 first.author.num.prev.oa.tr
 0.85 num.grants.via.nih.tr                   0.35 last.author.num.prev.geoae.sharing.tr
 0.84 max.grant.duration.tr                   0.32 pubmed.is.effectiveness
 0.82 num.grant.numbers.tr                    0.32 last.author.num.prev.oa.tr
 0.80 pubmed.is.funded.nih                    0.31 pubmed.is.geo.reuse
 0.79 nih.max.max.dollars.tr                 -0.38 country.japan
 0.70 nih.sum.avg.dollars.tr
 0.70 nih.sum.sum.dollars.tr                Journal impact factor and policy
 0.59 has.R.funding                            0.48 journal.impact.factor.log
 0.59 num.post2003.morethan500k.tr             0.47 jour.policy.requires.microarray.accession
 0.58 country.usa                              0.46 jour.policy.mentions.exceptions
 0.58 has.U.funding                            0.46 pubmed.num.cites.from.pmc.tr
 0.57 has.R01.funding                          0.45 journal.5yr.impact.factor.log
 0.55 num.post2003.morethan750k.tr             0.45 jour.policy.contains.word.miame.mged
 0.53 has.T.funding                            0.42 last.author.num.prev.pmc.cites.tr
 0.53 num.post2003.morethan1000k.tr            0.41 jour.policy.requests.accession
 0.49 num.post2004.morethan500k.tr             0.40 journal.immediacy.index.log
 0.45 num.post2004.morethan750k.tr             0.40 journal.num.articles.2008.tr
 0.44 has.P.funding                            0.39 years.ago.tr
 0.43 num.post2004.morethan1000k.tr            0.36 jour.policy.says.must.deposit
 0.43 num.nih.is.nci.tr                        0.35 pubmed.num.cites.from.pmc.per.year
 0.35 num.post2005.morethan500k.tr             0.33 institution.mean.norm.citation.score
 0.32 num.nih.is.nigms.tr                      0.32 last.author.year.first.pub.ago.tr
 0.31 num.post2005.morethan750k.tr             0.31 country.usa
                                               0.31 last.author.num.prev.pubs.tr
Cancer & humans                                0.31 jour.policy.contains.word.microarray
  0.60 pubmed.is.cancer                       -0.31 pubmed.is.open.access
  0.59 pubmed.is.humans
  0.52 pubmed.is.cultured.cells             Higher Ed in USA
  0.43 pubmed.is.core.clinical.journal        0.36 institution.stanford
  0.39 institution.is.medical                 0.36 institution.is.higher.ed
 -0.58 pubmed.is.plants                       0.35 country.usa
 -0.50 pubmed.is.fungi                        0.35 has.R.funding
 -0.37 pubmed.is.shared.other                 0.33 has.R01.funding
 -0.30 pubmed.is.bacteria                     0.30 institution.harvard
                                             -0.37 institution.is.govnt

More Related Content

PDF
Digital Data Sharing: Opportunities and Challenges of Opening Research
PPTX
Research Data in the Arts and Humanities: A Few Difficulties
PDF
Open Access and Open Data: what do I need to know (and do)?
PDF
Research Data in the Arts and Humanities: A Few Tricky Questions
PPTX
Practical Research Data Management: tools and approaches, pre- and post-award
PDF
Data management (1)
PPTX
Research as infrastructure, Digital Humanities Congress, Sheffield 2012
PPTX
Fsci 2018 monday30_july_am6
Digital Data Sharing: Opportunities and Challenges of Opening Research
Research Data in the Arts and Humanities: A Few Difficulties
Open Access and Open Data: what do I need to know (and do)?
Research Data in the Arts and Humanities: A Few Tricky Questions
Practical Research Data Management: tools and approaches, pre- and post-award
Data management (1)
Research as infrastructure, Digital Humanities Congress, Sheffield 2012
Fsci 2018 monday30_july_am6

What's hot (20)

PPTX
Introduction to Research Data Management - 2015-02-09 - MPLS Division, Univer...
PDF
IRJET- Characteristics of Research Process and Methods for Web-Based Rese...
PDF
Open Science Incentives/Veerle van den Eynden
PDF
Incentivizing data sharing: a "bottom up" perspective/Louise Bezuidenhout
PPTX
Open Access to Research Data: Challenges and Solutions
PPTX
Open Data and the Panton Principles in the Humanities
PDF
2-6-14 ESI Supplemental Webinar: The Data Information Literacy Project
PPTX
Introduction to data management
PPTX
LEARN Conference - How to cost
PPTX
Without data, science is merely an opinion: African Open Science Platform/Ina...
PDF
Introduction to research data management
PPTX
Data Literacy: Creating and Managing Reserach Data
PDF
Open science curriculum for students, June 2019
PDF
Open science and data sharing: the DataFirst experience/Martin Wittenberg
PPTX
Introduction to Research Data Management - 2015-05-27 - Social Sciences Divis...
PPTX
Introduction to research data management; Lecture 01 for GRAD521
PDF
Open Data - strategies for research data management & impact of best practices
PPTX
The African Open Science Platform/Geoffrey Boulton
PDF
Data Science and What It Means to Library and Information Science
PDF
Data and communication of research: incentives and disincentives
Introduction to Research Data Management - 2015-02-09 - MPLS Division, Univer...
IRJET- Characteristics of Research Process and Methods for Web-Based Rese...
Open Science Incentives/Veerle van den Eynden
Incentivizing data sharing: a "bottom up" perspective/Louise Bezuidenhout
Open Access to Research Data: Challenges and Solutions
Open Data and the Panton Principles in the Humanities
2-6-14 ESI Supplemental Webinar: The Data Information Literacy Project
Introduction to data management
LEARN Conference - How to cost
Without data, science is merely an opinion: African Open Science Platform/Ina...
Introduction to research data management
Data Literacy: Creating and Managing Reserach Data
Open science curriculum for students, June 2019
Open science and data sharing: the DataFirst experience/Martin Wittenberg
Introduction to Research Data Management - 2015-05-27 - Social Sciences Divis...
Introduction to research data management; Lecture 01 for GRAD521
Open Data - strategies for research data management & impact of best practices
The African Open Science Platform/Geoffrey Boulton
Data Science and What It Means to Library and Information Science
Data and communication of research: incentives and disincentives
Ad

Similar to Public data archiving: Who does? Who doesn't? What can we do about it? (20)

PDF
Research into Open Research Data
PDF
Thesis defense, Heather Piwowar, Sharing biomedical research data
PDF
Public Sharing of Research Datasets: A Pilot Study of Associations
PDF
PLoS ONE Piwowar: Sharing Detailed Research Data Is Associated with Increa...
PDF
Thesis Proposal Piwowar Presentation 20091109
PDF
NESCent visit: Measuring progress toward a cultural norm of shared (and reus...
PDF
Knowledge Exchange, Nov 2011, Bonn
PPT
BioMed Central's open data initiatives
PDF
JCDL doctoral consortium 2008: Proposed Foundations for Evaluating Data Shar...
PDF
Reputation as (dis)incentive
PDF
NEDCC 2010 Piwowar Leaders and Laggards
PDF
Thesis Proposal, as presented for dissertation proposal defense
PDF
SLA webinar: Open research data needs librarians
PPTX
Data sharing and data management – what are they all about?
PPT
Why study Data Sharing? (+ why share your data)
PPTX
Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in...
PPTX
How and Why to Share Your Data
PPTX
Scott Edmunds A*STAR open access workshop: how licensing can change the way w...
PDF
Measuring the Adoption of Open Science
PDF
Heather Piwowar - Measuring the adoption of Open Science
Research into Open Research Data
Thesis defense, Heather Piwowar, Sharing biomedical research data
Public Sharing of Research Datasets: A Pilot Study of Associations
PLoS ONE Piwowar: Sharing Detailed Research Data Is Associated with Increa...
Thesis Proposal Piwowar Presentation 20091109
NESCent visit: Measuring progress toward a cultural norm of shared (and reus...
Knowledge Exchange, Nov 2011, Bonn
BioMed Central's open data initiatives
JCDL doctoral consortium 2008: Proposed Foundations for Evaluating Data Shar...
Reputation as (dis)incentive
NEDCC 2010 Piwowar Leaders and Laggards
Thesis Proposal, as presented for dissertation proposal defense
SLA webinar: Open research data needs librarians
Data sharing and data management – what are they all about?
Why study Data Sharing? (+ why share your data)
Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in...
How and Why to Share Your Data
Scott Edmunds A*STAR open access workshop: how licensing can change the way w...
Measuring the Adoption of Open Science
Heather Piwowar - Measuring the adoption of Open Science
Ad

More from Heather Piwowar (20)

PDF
Calculating how much your University spends on Open Access--and what to do ab...
PDF
Unsub Lightning Talk
PDF
How to Calculate OA APC Spend for Your University
PDF
Intro to Managing Serials with Net Cost per Paid Use
PDF
The Future of OA: 
The Impact of Open Access on Readership and Subscription ...
PDF
The time has come to talk of... who should own scholarly infrastructure?
PDF
What kinds of open have 
made a difference in scholarly communication infrast...
PDF
Data science needs Data and lots of it
PDF
Oadoi and libraries
PDF
Impactstory OA week 2017
PDF
Paperbuzz sneak peek
PDF
Software-Native metrics: Depsy lessons learned
PDF
What's your Impactstory?
PDF
capturing the impact of software AAS 2017
PDF
Software-Native metrics: Depsy lessons learned
PDF
submission summary for #WSSSPE Policy session on Credit, Citation, and Impact
PDF
Building Skyscrapers with our Scholarship
PDF
Right time, right place, to change the world
PDF
No more waiting! Tools that work Today to reveal dataset use
PDF
Analyzing data about our data
Calculating how much your University spends on Open Access--and what to do ab...
Unsub Lightning Talk
How to Calculate OA APC Spend for Your University
Intro to Managing Serials with Net Cost per Paid Use
The Future of OA: 
The Impact of Open Access on Readership and Subscription ...
The time has come to talk of... who should own scholarly infrastructure?
What kinds of open have 
made a difference in scholarly communication infrast...
Data science needs Data and lots of it
Oadoi and libraries
Impactstory OA week 2017
Paperbuzz sneak peek
Software-Native metrics: Depsy lessons learned
What's your Impactstory?
capturing the impact of software AAS 2017
Software-Native metrics: Depsy lessons learned
submission summary for #WSSSPE Policy session on Credit, Citation, and Impact
Building Skyscrapers with our Scholarship
Right time, right place, to change the world
No more waiting! Tools that work Today to reveal dataset use
Analyzing data about our data

Recently uploaded (20)

PDF
Yogi Goddess Pres Conference Studio Updates
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PDF
Practical Manual AGRO-233 Principles and Practices of Natural Farming
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
Updated Idioms and Phrasal Verbs in English subject
PDF
Paper A Mock Exam 9_ Attempt review.pdf.
PDF
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
LDMMIA Reiki Yoga Finals Review Spring Summer
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PPTX
Orientation - ARALprogram of Deped to the Parents.pptx
PDF
Complications of Minimal Access Surgery at WLH
PPTX
Lesson notes of climatology university.
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PPTX
UNIT III MENTAL HEALTH NURSING ASSESSMENT
PDF
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PPTX
History, Philosophy and sociology of education (1).pptx
Yogi Goddess Pres Conference Studio Updates
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
Practical Manual AGRO-233 Principles and Practices of Natural Farming
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Updated Idioms and Phrasal Verbs in English subject
Paper A Mock Exam 9_ Attempt review.pdf.
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
LDMMIA Reiki Yoga Finals Review Spring Summer
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
Orientation - ARALprogram of Deped to the Parents.pptx
Complications of Minimal Access Surgery at WLH
Lesson notes of climatology university.
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
UNIT III MENTAL HEALTH NURSING ASSESSMENT
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
STATICS OF THE RIGID BODIES Hibbelers.pdf
History, Philosophy and sociology of education (1).pptx

Public data archiving: Who does? Who doesn't? What can we do about it?

  • 1. Public data archiving: Who shares? Who doesn’t? What can we do about it? Heather Piwowar Presented at UBC BLISS, Sept 2010 DataONE postdoc with Dryad and NESCent, @UBC PhD in Dept of Biomedical Informatics, U of Pittsburgh
  • 13. Find Organize Document Deidentify Format Decide Ask Submit Answer questions Worry about mistakes being found Worry about data being misinterpreted Worry about being scooped Forgo money and IP and prestige???
  • 15. As a result, policy makers have spent  lots of time and money .... http://www.flickr.com/photos/johnnyvulkan/381941233/ http://www.flickr.com/photos/tonivc/2283676770/
  • 17. lots of data sharing! http://www.genome.jp/en/db_growth.html
  • 18. but how much isn’t  shared? what isn’t shared? who isn’t sharing it? why not? how much does it matter? what can we do  about it?
  • 19. you can not manage  what you do not measure quote: Lord Kelvin http://www.flickr.com/photos/archeon/2941655917/
  • 20. As we seek to embrace and encourage data sharing, understanding patterns of adoption will allow us to make informed decisions about tools, policies, and best practices. Measuring adoption over time will allow us to note progress and identify best practices and opportunities for improvement.
  • 21. research questions 1. Is there benefit for those who share? 2. How can we study data sharing behaviour in a scalable, systematic way? 3. What factors are correlated with sharing and withholding data?
  • 23. Which data? http://www.flickr.com/photos/paulhami/1020538523//
  • 24. Where? http://www.flickr.com/photos/paulhami/1020538523//
  • 25. With whom? http://www.flickr.com/photos/paulhami/1020538523//
  • 26. When? http://www.flickr.com/photos/paulhami/1020538523//
  • 27. Under what terms? http://www.flickr.com/photos/paulhami/1020538523//
  • 30. • gene expression microarray data • raw intensity data • upon publication • publicly on the internet • (centralized databases) http://www.flickr.com/photos/paulhami/1020538523//
  • 31. http://en.wikipedia.org/wiki/DNA_microarray http://en.wikipedia.org/wiki/Image:Heatmap.png http://commons.wikimedia.org/wiki/ File:DNA_double_helix_vertikal.PNG microarray data
  • 32. microarray data
  • 33. 1.  Is there benefit for  those who share? http://www.flickr.com/photos/sunrise/35819369/
  • 34. currency of value? Citations.
  • 35. currency of value? Citations. $50! Diamond,Arthur M. What is a Citation Worth?. The Journal of Human Resources (1986) vol. 21 (2) pp. 200-215
  • 36. dataset 85 cancer microarray trials published in 1999-2003, as identified by Ntzani and Ioannidis (2003) citations ISI Web of Science Citation index, citations from 2004-2005 data sharing locations Publisher and lab websites, microarray databases, WayBack Internet Archive, Oncomine statistics Multivariate linear regression
  • 39. ~70%
  • 40. 2. Need automated methods to: a) Identify studies that create datasets b) Determine which of these have in fact been shared c) Extract attributes about the environment
  • 41. a) Identify studies that create datasets http://www.flickr.com/photos/lofaesofa/248546821/
  • 42. Look for wetlab methods in article full text: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1522022&tool=pmcentrez http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1590031&tool=pmcentrez http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1482311&tool=pmcentrez#id331936 http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2082469&tool=pmcentrez http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=126870&tool=pmcentrez#id442745
  • 43. Combined, these full-text portals reach 85% of the articles available through U of Pittsburgh library subscriptions.
  • 44. But how to generate an effective query? Use open access articles.
  • 45. • text analysis: automatically catalogued single words and word-pairs from full text • assessed precision and recall • combined the high performers:
  • 46. Derived query: ("gene expression" AND microarray AND cell AND rna) AND (rneasy OR trizol OR "real-time pcr") NOT (“tissue microarray*” OR “cpg island*”)
  • 47. Evaluation: Ochsner et al. Nature Methods (2008) 400 studies across 20 journals Precision: 90% (conf int: 86% to 93%) Recall: 56% (conf int: 52% to 61%)
  • 48. a) Identify studies that create datasets b) Determine which of these have in fact been shared c) Extract attributes about the environment
  • 49. b) Determine which datasets have in fact been shared
  • 53. a) Identify studies that create datasets b) Determine which of these have in fact been shared c) Extract attributes about the environment
  • 54. Funder Journal Investigator Institution Study Is research data shared after publication?
  • 55. Funder Journal Investigator Institution Study funded by impact years since sector humans? NIH? factor first paper size mice? size of strength of # pubs grant policy impact plants? # citations rank sharing open cancer? plan req’d? access? previously country shared? clinical funded by number of trial? non-NIH? microarray previously reused? number of studies authors published gender year
  • 57. journal data sharing policy “An inherent principle of publication is that others should be able to replicate and build upon the authors' published claims. Therefore, a condition of publication in a Nature journal is that authors are required to make materials, data and associated protocols available in a publicly accessible database …” http://www.nature.com/authors/editorial_policies/availability.html http://www.nature.com/nature/journal/v453/n7197/index.html
  • 58. institution rank Yu et al. BMC medical informatics and decision making (2007) vol. 7 pp. 17
  • 60. author “experience” Author publication history: Author name Author-ity web service Torvik & Smalheiser. (2009). Author Name disambiguation: Disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3):11. Citation counts:
  • 62. funding level PubMed grant lists + NIH grant details
  • 63. funder mandates Requires a data sharing plan for studies funded after October 2003 that receive more than $500 000 in direct funding per year
  • 64. funder mandates Proxy for NIH data sharing policy applicability: If in any year since 2004, • funded by an NIH grant number with a “1” or “2” type code • received more than $750 000 in total funding from the grant
  • 65. and so on... 124 variables
  • 66. Now equipped with automated methods to: a) Identify studies that create datasets b) Determine which of these have in fact been shared c) Extract attributes about the environment
  • 67. 3.  What factors are correlated  with sharing and withholding  data? http://www.flickr.com/photos/cogdog/123072/
  • 68. 11,603 datapoints 25% had links from datasets in databases
  • 70. Proportion of articles with shared datasets, by year 0.35 Proportion of articles with datasets found in GEO or ArrayExpress 0.30 0.25 0.20 0.15 Across time 0.10 0.05 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Year article published
  • 71. Proportion of datasets shared 0.0 0.2 0.4 0.6 0.8 1.0 Physiol Genomics PLoS Genet Genome Biol Microbiology PLoS One BMC Genomics Plant Cell Genome Res Eukaryot Cell Appl Environ Microbiol BMC Med Genomics Hum Mol Genet Proc Natl Acad Sci U S A Infect Immun Am J Respir Cell Mol Biol Dev Biol J Bacteriol Mol Endocrinol BMC Cancer Plant Physiol Biol Reprod Blood J Immunol FASEB J Toxicol Sci J Exp Bot Nucleic Acids Res Diabetes Mol Cell Biol Mol Cancer Ther BMC Bioinformatics Stem Cells FEBS Lett J Neurosci Am J Pathol J Biol Chem J Virol OTHER Cancer Res J Clin Endocrinol Metab Plant Mol Biol Clin Cancer Res Genomics Journals Invest Ophthalmol Vis Sci Mol Hum Reprod Carcinogenesis Gene Endocrinology Oncogene Cancer Lett Biochem Biophys Res Commun (Physiological Genomics)
  • 72. Proportion of datasets shared 0.0 0.2 0.4 0.6 0.8 1.0 Stanford University University of Pennsylvania University of Illinois University of California, Los Angeles University of Wisconsin, Madison University of Washington University of California, Davis The University of British Columbia University of California, San Francisco University of Florida University of California, San Diego University of Minnesota, Twin Cities Baylor College of Medicine OTHER Max Planck Gesellschaft Harvard University Duke University Medical Center Yale University Johns Hopkins University University of Pittsburgh (Stanford) Washington University in Saint Louis University of Toronto University of California, Berkeley University of Michigan, Ann Arbor Michigan State University Institutions National Cancer Institute Tokyo Daigaku
  • 73. Proportion of datasets shared 0.0 0.2 0.4 0.6 0.8 1.0 1 101 201 301 401 501 601 701 801 901 1001 1101 1201 1301 rank 1401 1501 1601 1701 1801 1901 Institution
  • 88. multivariate logistic regression over the first-order factors
  • 89. Multivariate nonlinear regressions with interactions Odds Ratio 0.25 0.50 1.00 2.00 4.00 8.00 Has journal policy Multivariate nonlinear regressions with interactions Count of R01 & other NIH grants Odds Ratio 0.95 0.25 0.50 1.00 2.00 4.00 8.00 Authors prev GEOAE sharing & OA & microarray creation Has journal policy NO K funding other P funding Count of R01 & or NIH grants 0.95 Authors prev GEOAE sharing & OA & microarray creation NO K Journalfunding funding or P impact Institution high citations & collaboration Journal policy consequences & Journal impact long halflife Journal policy consequences & long halflife Institution high citations NOTcollaboration & animals or mice Instititution is government & NOT higher ed NOT animals or mice Last author num prev pubs & first year pub Large NIH grant Instititution is government & NOT higher ed Humans & cancer NO geo reuse + YES high institution output Last author num prev pubs & first year pub First author num prev pubs & first year pub Large NIH grant Humans & cancer NO geo reuse + YES high institution output First author num prev pubs & first year pub
  • 90. Multivariate nonlinear regressions with interactions Odds Ratio 0.25 0.50 1.00 2.00 4.00 8.00 Has journal policy Multivariate nonlinear regressions with interactions Count of R01 & other NIH grants Odds Ratio 0.95 0.25 0.50 1.00 2.00 4.00 8.00 Authors prev GEOAE sharing & OA & microarray creation Has journal policy NO K funding other P funding Count of R01 & or NIH grants 0.95 Authors prev GEOAE sharing & OA & microarray creation NO K Journalfunding funding or P impact Institution high citations & collaboration Journal policy consequences & Journal impact long halflife Journal policy consequences & long halflife Institution high citations NOTcollaboration & animals or mice Instititution is government & NOT higher ed NOT animals or mice Last author num prev pubs & first year pub Large NIH grant Instititution is government & NOT higher ed Humans & cancer NO geo reuse + YES high institution output Last author num prev pubs & first year pub First author num prev pubs & first year pub Large NIH grant Humans & cancer NO geo reuse + YES high institution output First author num prev pubs & first year pub
  • 92. Multivariate nonlinear regression with interactions Odds Ratio 0.25 0.50 1.00 2.00 4.00 OA journal & previous GEO-AE sharing Amount of NIH funding 0.95 Journal impact factor and policy Higher Ed in USA Cancer & humans
  • 93. Multivariate nonlinear regression with interactions Odds Ratio 0.25 0.50 1.00 2.00 4.00 OA journal & previous GEO-AE sharing Amount of NIH funding 0.95 Journal impact factor and policy Higher Ed in USA Cancer & humans
  • 94. Conclusions: • data sharing rates are increasing, but overall levels are low Preliminary evidence: • levels are particularly low in cancer • levels are highest for those who • publish in a journal with a policy • publish in an open access journal • have shared data before
  • 95. data and filters were imperfect • many assumptions • didn’t capture all types of sharing • don’t know how generalizable across datatypes • should be considered hypothesis-generating http://www.flickr.com/photos/vlastula/300102949/
  • 97. NSF-funded distributed framework and cyberinfrastructure for environmental science. Dryad is a repository of data underlying scientific publications, with an initial focus on evolution, ecology, and related fields. The National Evolutionary Synthesis Center, NSF-funded: • Duke University, • UNC at Chapel Hill • North Carolina State University
  • 101. • evolution and ecology datasets • raw data that support results • upon publication or short embargo • publicly on the internet http://www.flickr.com/photos/paulhami/1020538523//
  • 102. challenges! 1. No PubMed 2. Diverse data types, norms, repositories 3. Data almost always collected for a specific hypothesis 4. Less public sharing so far
  • 105. JDAP • The American Naturalist • Evolution • Journal of Evolutionary Biology • Molecular Ecology • Evolutionary Applications • Genetics • Heredity • Molecular Biology and Evolution • Systematic Biology • Paleobiology • BMC Evolutionary Biology
  • 106. Blumenthal et al. Acad Med. 2006 Campbell et al. JAMA. 2002. Kyzas et al. J Natl Cancer Inst. 2005. Vogeli et al. Acad Med. 2006. Reidpath et al. Bioethics 2001. http://www.flickr.com/photos/jima/606588905/
  • 107. 3.  Reuse. http://www.flickr.com/photos/boitabulle/3668162701/
  • 108. who reuses data? why? when? who doesn’t? which datasets are most likely  to be reused? how many datasets could be  reused but aren’t? why aren’t they? does it matter? what can we do  about it?
  • 109. http://upload.wikimedia.org/wikipedia/commons/thumb/e/e6/ Gamma_distribution_pdf.svg/500px-Gamma_distribution_pdf.svg.png
  • 112. I post my data, code, and statistical scripts on GitHub (links from http://researchremix.org) Share yours too! http://www.flickr.com/photos/myklroventine/892446624/
  • 113. “Does anyone want your data? That’s hard to predict […] After all, no one ever knocked on your door asking to buy those figurines collecting dust in your cabinet before you listed them on eBay. Your data, too, may simply be awaiting an effective matchmaker.” Got data? Nature Neuroscience (2007)
  • 114. Dept of Biomedical Informatics at U of Pittsburgh Wendy Chapman for support and feedback Todd Vision, Mike Whitlock for ongoing discussions NIH NLM. NSF through DataONE, NESCent, Dryad. Open science online community and those who release their articles, datasets and photos openly thank you
  • 117. Journal mandates variables
  • 118. • readers • reusers perspectives, • authors and also driving towards • editors actionable results for these groups • reviewers • funders • database designers, maintainers, curators • patients, subjects, or populations
  • 122. Correlates with self‐reported data  withholding industry involvement perceived competitiveness of field male sharing discouraged in training human participants academic productivity 0 1 2 3 Blumenthal et al. Acad Med. 2006
  • 123. Self‐reported reasons for data  withholding sharing is too much effort want student or jr faculty to publish more they themselves want to publish more cost industrial sponsor confidentiality commercial value of results 0% 20% 40% 60% 80% Campbell et al. JAMA 2002.
  • 124. Table 2: Second-order factor loadings, by first-order factors Amount of NIH funding 0.88 Count of R01 & other NIH grants 0.49 Large NIH grant -0.55 NO K funding or P funding Cancer & humans 0.83 Humans & cancer OA journal & previous GEO-AE sharing 0.59 Authors prev GEOAE sharing & OA & microarray creation 0.43 Institution high citations & collaboration 0.31 First author num prev pubs & first year pub -0.36 Last author num prev pubs & first year pub Journal impact factor and policy 0.57 Journal impact 0.51 Last author num prev pubs & first year pub Higher Ed in USA 0.40 NO geo reuse + YES high institution output -0.44 Institution is government & NOT higher ed
  • 125. Table 3: Second-order factor loadings, by OA journal & previous GEO-AE sharing original variables 0.40 first.author.num.prev.geoae.sharing.tr Amount of NIH funding 0.37 pubmed.is.open.access 0.87 nih.cumulative.years.tr 0.37 first.author.num.prev.oa.tr 0.85 num.grants.via.nih.tr 0.35 last.author.num.prev.geoae.sharing.tr 0.84 max.grant.duration.tr 0.32 pubmed.is.effectiveness 0.82 num.grant.numbers.tr 0.32 last.author.num.prev.oa.tr 0.80 pubmed.is.funded.nih 0.31 pubmed.is.geo.reuse 0.79 nih.max.max.dollars.tr -0.38 country.japan 0.70 nih.sum.avg.dollars.tr 0.70 nih.sum.sum.dollars.tr Journal impact factor and policy 0.59 has.R.funding 0.48 journal.impact.factor.log 0.59 num.post2003.morethan500k.tr 0.47 jour.policy.requires.microarray.accession 0.58 country.usa 0.46 jour.policy.mentions.exceptions 0.58 has.U.funding 0.46 pubmed.num.cites.from.pmc.tr 0.57 has.R01.funding 0.45 journal.5yr.impact.factor.log 0.55 num.post2003.morethan750k.tr 0.45 jour.policy.contains.word.miame.mged 0.53 has.T.funding 0.42 last.author.num.prev.pmc.cites.tr 0.53 num.post2003.morethan1000k.tr 0.41 jour.policy.requests.accession 0.49 num.post2004.morethan500k.tr 0.40 journal.immediacy.index.log 0.45 num.post2004.morethan750k.tr 0.40 journal.num.articles.2008.tr 0.44 has.P.funding 0.39 years.ago.tr 0.43 num.post2004.morethan1000k.tr 0.36 jour.policy.says.must.deposit 0.43 num.nih.is.nci.tr 0.35 pubmed.num.cites.from.pmc.per.year 0.35 num.post2005.morethan500k.tr 0.33 institution.mean.norm.citation.score 0.32 num.nih.is.nigms.tr 0.32 last.author.year.first.pub.ago.tr 0.31 num.post2005.morethan750k.tr 0.31 country.usa 0.31 last.author.num.prev.pubs.tr Cancer & humans 0.31 jour.policy.contains.word.microarray 0.60 pubmed.is.cancer -0.31 pubmed.is.open.access 0.59 pubmed.is.humans 0.52 pubmed.is.cultured.cells Higher Ed in USA 0.43 pubmed.is.core.clinical.journal 0.36 institution.stanford 0.39 institution.is.medical 0.36 institution.is.higher.ed -0.58 pubmed.is.plants 0.35 country.usa -0.50 pubmed.is.fungi 0.35 has.R.funding -0.37 pubmed.is.shared.other 0.33 has.R01.funding -0.30 pubmed.is.bacteria 0.30 institution.harvard -0.37 institution.is.govnt