SlideShare a Scribd company logo
Open access to scientific research data
Gudmundur A. Thorisson, PhD <gt50@leicester.ac.uk>
Research associate, University of Leicester
Guest scientist, University of Iceland
Participant in the GEN2PHEN Consortium and the ORCID Technical Working Group



                                  This work is published under the Creative Commons Attribution license (CC BY:
                                  http://creativecommons.org/licenses/by/3.0/) which means that it can be freely
                                  copied, redistributed and adapted, as long as proper attribution is given.
Overview



   ๏ Intro to the world of Big Science & Big Data
              •Why is inadequate access to data such a problem?
   ๏ Incentive-based approaches to tackling the sharing problem
                 Identification, identification, identification
   ๏ Key relevant developments internationally
   ๏ Some food for thought for funders, institutions, other key players
   ๏ Concluding remarks




RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
Big Science, Big Data

• Scientific research increasingly large-scale and data-driven

• High-profile discipline examples

     – High-energy particle physics - experiments
       performed in the Large Hadron Collider

     – Astronomy - data from ground-based and space
       telescopes, the Virtual Observatory (VO)




                                                                             •   Doctorow, C. Big data: Welcome to the petacentre. Nature 455, 16-
                                                                                 21 (2008). http://dx.doi.org/10.1038/455016a
RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
Hypothesis generation guided by available data




                                                                                              Kell and Oliver. Bioessays (2004) vol. 26 (1)



• Science paradigms
    – 1st: Empirical - describing natural phenomena
    – 2nd: Theoretical - models, generalizations
    – 3rd: Computational - simulating complex phenomena
    – 4th (1+2+3): Data exploration, e-Science


Gray, J. 2009. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research




 RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
Biological research too is
                         increasingly big and data-driven



  • From: small-scale datasets that
    fit into a printed journal article




                                    Richards, M. et al. Paleolithic and neolithic lineages in the European mitochondrial gene pool. American
                                    journal of human genetics 59, 185-203 (1996). http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1915109/




RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
Biological research too is
                         increasingly big and data-driven

• To: large-scale collection of
  biological data in digital form




• Huge technological advances in last 5-10 years
     – experimental / observations <-- gathering data with high-throughput equipment
     – computer technology <-- storing & analyzing massive data volumes


• Example: massively-parallel sequencing
     – Determine human genome sequence in <1 day - the $1000 genome
     – Metagenomics: sequence *everything* in environment samples
     – Large bio-specimen collections
          • x100,0000 of individuals in disease/population biobanks

RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
Examples: domain repositories for sequence data


 • GenBank - genetic sequence
   repository, established 1986




                                                                                    • UniProt - knowledge base for
                                                                                      protein sequence & function
Conference on Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
   RDFC2012 Unique Identifiers, Vilnius, Feb 14 2012
“Community resource projects” - large-scale data generation
  for the purpose of making the data available for broad reuse


• The sequence of the human genome
     – International Human Genome project - mandatory rapid data sharing, the Bermuda
       principles



• Pattern of variation in the human genome
     – International Haplotype Map Project - genotyping population samples
     – 1000 Genomes Project - sequencing population samples




RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
Big Data – challenges, opportunities
• Managing & making sense of large-scale datasets
     – Data easy/cheap to generate - not so cheap to store & use
     – Favorite quote: “the $1000 genome sequence, followed by the ++$10,000 analysis”



• Integration & analysis - combining datasets
     – more data of the same type - e.g. combine sequences from multiple species
     – related data of different type - e.g. a person’s genome sequence + his/her phenotype


• Potential for accelerating research, creating new knowledge and (in
  biomedicine) improving human health.


• Key driver = unrestricted sharing of scientifc data deposited in
  the public domain

RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
Data = “fuel” of science




                         Smith,V. Data publication: towards a database of everything. BMC Res Notes (2009) vol. 2 (1)




RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
11
12
Data = “fuel” of science



                            [..] If digital technologies are the engine of this
                            revolution, digital data are its fuel. But for many
                            scientific disciplines, this fuel is in short supply.[..]

                         Smith,V. Data publication: towards a database of everything. BMC Res Notes (2009) vol. 2 (1)




RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
Biology and data sharing in the “long tail”

• Biology is complex, so data are often very
  heterogeneous
• Technologies changing rapidly
• Lots of small-scale research projects
• Lots of small/medium datasets            The ‘long tail’ of dark bio-data
• Data in the long tail usually *not* shared
  OR not shared in a useful way

 • Contrast with other data-intensive disciplines with
      – a long history of sharing research data - a “culture of sharing”
      – big, expensive, shared facilities = the only way to do this kind of research
      – relatively homogeneous datasets, easier to scale up to big volumes (e.g. telescope images)


RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
[…] Overall, only 47 papers (9%) deposited full primary raw data
                         online. None of the 149 papers not subject to data availability
                         policies made their full primary data publicly available.

                         Conclusion: A substantial proportion of original research papers published in
                         high-impact journals are either not subject to any data availability
                         policies, or do not adhere to the data availability instructions in their
                         respective journals. This empiric evaluation highlights opportunities for
                         improvement




RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
DATA
                                                               analysed
                                                               synthesised
                                                               interpreted



                                INFORMATION

                                                                  published




                                 KNOWLEDGE
                                                                    Publication


                                          Lots of published knowledge but
                                           hard/impossible to go back and
                                         reproduce work & validate findings

                                                                    +
                                    Opportunity for maximising the value of
                                        data through reuse is wasted

RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
Credit: http://cutcaster.com/photo/800902839-The-hand-drawing-question-WHY/




                                                                              17
Lots and lots of diverse reasons!!
                                            Some quotes from researchers:

                                                          “Don't which digital repository I should upload to”
                                                          “Too much work, got better things to do!”
                                                          “My competitors will just take the data and ‘scoop’ me”
                                                          “It's my data, I collected them and noone else is entitled
                                                          to use them”
                                                          “[myriad other reasons]”




                                                         Worringly, many authors don't seem to
                                                         care whether evidence underpinning their
                                                         published findings is accessible or not




Koslow. Should the neuroscience community make a
paradigm shift to sharing primary data?. Nat Neurosci
(2000) vol. 3 (9). http://dx.doi.org/10.1038/78760


 RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
Gnarly issue #1: “ownership” vs “stewardship”



• Many researchers consider data their property, even if research
  funded by public money
   – e.g. want to do further analysis on data in future, publish more papers


• ..which conficts with interests of other stakeholders in the game,
  e.g. (funders, universities) who want:
   – to maximize return on investment in the funded research
   – to ensure good, solid evidence-based science is done, etc.




RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
Gnarly issue #2 – biomedical data

• Usually sensitive, cannot be shared without restrictions
     – Detailed, reidentifiable biomedical data that cannot be fully anonymized
     – Personal privacy considerations


• Specialized controlled-access archives deal with some of this
     – NCBI's database of Genotypes and Phenotypes – dbGaP
     – European Genome-phenome Archive – EGA
     – [specific diseases / disorders, research consortia, others]




RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
How to Make a Tackle in Rugby
Tackling in rugby is one of the most important aspects of the game.
[...]
Credit:http://djamba.com/how-to-make-a-tackle-in-rugby.html




                                                                21
...which are an imperfect solution

• Arguments that mandates by themselves are not the way

• Mandates likely to ensure only minimum compliance
     – sharing would be done in minimally useful form (as in, whatever is the least effort)



    …. and are meaningless if not enforced (currently the case with
    many journals)




RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
Sharing now tends to be driven by mandates...

  * Journals increasingly require data to be made available

  “Provide supporting data in a repository OR we won’t
  publish your paper”



   * Funders increasingly require data sharing plan &
   budget baked into grant proposals.

   “Publish data we are funding you to generate OR we
   will not fund your research again”




                                                                                     Using just a stick
                                                                                     gets you so only far


RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
Strategies focused on encouraging sharing

                                 - Make it easy -
                                - Make it useful -
                                - Make it citable -




RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
Treating data as citable
                      publications in their own right

• Core strategy: enable data to be treated as 1st class citizens of the
  scholarly record which:
         i) are indexed and can be discovered, located and accessed, and
         ii) can be properly identified & cited unambiguously like other scholarly works

• Link datasets with the primary journal publication - citation crosslinks
• Give data creators/curators/analysts proper credit for their contribution
  to the digital resource


• Focus on the benefits to researchers from publishing their data
     –   Data sharing → Data PUBLICATION + CITATION
     – Others reuse & cite their stuff → more citations → more impact
     – The more useful a dataset, the more likely to be used & cited
RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
Exemplar – Data Dryad

“international repository of data
underlying peer-reviewed articles in
the basic and applied biosciences”
  http://datadryad.org



• Combines
     – Mandates (journal policy)
            and
     – Citable data publication



• Citation cross-linking
     – Paper references dataset
     – Dataset references paper



RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
Key building blocks: the 3 I’s of identification




RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
1
I        Identifying scholarly publications (and other research outputs)
          • Why? So it is possible to..
             ..cite the work unambiguously (‘..we used the method described in Thorisson et al (2009)’)
             ..locate the work (retrieve Nature article as PDF from journal website)
             ..give credit to persons/entities who contributed to the work (G. Thorisson authored paper X)
          • Need for globally unique, persistent identifiers to combat unstable Web URLs, broken hyperlinks
          • e.g. Digital Object Identifiers (DOIs) for pubs, datasets and more:
               – Bell et al. 2009. Science 323(5919) doi:10.1371/journal.pone.0024357
               – Goodwillie C et al (2005) Data from: The evolutionary enigma of mixed mating systems in
                 plants: occurrence, theoretical explanations, and empirical evidence. Dryad Digital
                 Repository. doi:10.5061/dryad.292q34fp




RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
2
I        Identifying use/reuse - measuring impact

      – Historical reliance on formal citations and citation-based metrics
      – ISI Impact Factor widely used, but really metric for infuence of a scholarly journal
      – Citation analysis not going away - remains the gold standard


      – Many other use/reuse indicators for impact of individual research outputs
          • Focus on the impact of the *publication* itself, not the journal in which it appears
          • Indicators: no. full-text downloads, tweets (i.e. mentions on Twitter), social bookmarking
          • AltMetrics - a growing grassroots movement “ to better measure and reward all the different
            ways that people contribute to the messy and complex process of scientific progress [..] born out
            of a simple recognition: Many of the traditional measurements are too slow or simplistic to
            keep pace with today’s Internet-age science” http://altmetrics.org
      – Lots new tools and projects emerging to explore possibilities in this space
          • e.g. http://total-impact.org




RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
3
I        Identifying contributors – attributing credit

      – Why? So we can..
         ..link content creators with their works - attribute credit accurately
         ..figure out: who contributed to publication X?
                       which publications has person/organization Y contributed to?
      – What kind of contributions? Characterizing ‘contributorship’
         author, creator, analyst, reviewer, ‘conceived of study & designed experiment’ etc




RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
Tackling the author name ambiguity problem
               (or ‘Who’s Who?’)

                                                                                     How about these?    Or these?

                                                                                                        J.   Smith
                                                                                                        J.   Smith
                                                                                                        J.   Smith
  Are these authors all the same person?                                                                J.   Smith
 G. Thorisson, University of Leicester                                                                  J.   Smith
 G. A. Thorisson, University of Leicester                                                                    [etc.]
 G. A. Thorisson, Cold Spring Harbor Laboratory




           ∼2/3 of the ∼6 million authors in MEDLINE share a last name and
           first initial with at least one other author, and an ambiguous name
           refers to ∼8 persons on average.
           Torvik and Smalheiser. Author name disambiguation in MEDLINE. ACM Transactions on Knowledge
           Discovery from Data (2009) vol. 3 (3)

RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
The Open Researcher & Contributor ID initiative

Launched end of 2009, ORCID will work to
support the creation of a permanent, clear
and unambiguous record of scholarly
communication by enabling reliable
attribution of authors and contributors
through unique identifiers




 RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
The Open Researcher & Contributor ID initiative

ORCID will add value for scholars and
the organizations that they are
interacting with, including universities,
scholarly societies, funding
organizations and publishers


                                                                              •Joins faculty or student body
                                                                              •Joins scholarly society
                                                                              •Applies for grant
                                                                              •Submits manuscript




RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
ORCID transcends discipline, geographic, national and
institutional boundaries - now >300 participants




http://www.orcid.org                                    34
Some food for thought / recommendation
            kind of stuff to conclude
• Status of research data in Iceland is unclear → need research
     – Build on & extend 2007 Rannís report “Gagnagrunnar á Íslandi um náttúru, umhverfi og orku”

                                                                  Rannís, we´re looking at you!

• Funders to take lead
     – Mandates (aka sticks) - require data management plan + budget in grant proposals
          • Many best practices & tools available to draw upon, e.g. by the UK Digital Curation Centre
     – Call for & fund research proposals to build infrastructural foundations & explore
       technologies/initiatives
     – Raise awareness in the local research community




RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
Even more food for thought /
                         recommendation kind of stuff

• Universities & other research institutions need to
     – Take research data seriously
     – Build infrastructure for data storage & preservation, support personnel (e.g. data
       officers / coordinators)
     – Include datasets and other non-conventional outputs in professional evalutations


• Identify & engage with key international initiatives in this space
     – ORCID, DataCite, Dryad, Open Knowledge Foundation, others
     – OpenAIRPlus ← Solveig's talk coming up!




RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
Final bite of food-for-thought




                  Let's make research data an integral part of the
                   OA mission in Iceland, NOT an afterthought




RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
Acknowledgements
GEN2PHEN Consortium
                                                              This work has received funding by the
   http://www.gen2phen.org/about-gen2phen/partners            European Community's Seventh Framework
                                                              Programme (FP7/2007-2013)
                                                              under grant agreement number 200754 -
Prof Anthony J. Brookes Bioinformatics Group, Leicester       the GEN2PHEN project.




                   Contact me!
                   Contact me!
                                                     ORCID - http://www.orcid.org
             <gthorisson@gmail.com>
             <gthorisson@gmail.com>
       http://www.linkedin.com/in/mummi
       http://www.linkedin.com/in/mummi
        http://www.twitter.com/gthorisson
        http://www.twitter.com/gthorisson
                                                          Published under the Creative Commons BY license
           http://www.gthorisson.name
            http://www.gthorisson.name                     (http://creativecommons.org/licenses/by/3.0/)

More Related Content

PDF
A Cabinet Of Web2.0 Scientific Curiosities
PPT
Sla2009 D Curation Heidorn
PPTX
SEAD Datanet and Sustainability Science
PPTX
NISO Forum, Denver, Sept. 24, 2012: DataCite and Campus Data Services
KEY
NISO Forum, Denver, Sept. 24, 2012: Data Equivalence
PPTX
NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an...
PPTX
Needs for Data Management & Citation Throughout the Information Lifecycle
PPTX
Repository Federation: Towards Data Interoperability
A Cabinet Of Web2.0 Scientific Curiosities
Sla2009 D Curation Heidorn
SEAD Datanet and Sustainability Science
NISO Forum, Denver, Sept. 24, 2012: DataCite and Campus Data Services
NISO Forum, Denver, Sept. 24, 2012: Data Equivalence
NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an...
Needs for Data Management & Citation Throughout the Information Lifecycle
Repository Federation: Towards Data Interoperability

What's hot (18)

PDF
Beyond Preservation: Situating Archaeological Data in Professional Practice
PPTX
Cornell 2011 05-13
PDF
Big Data in the Arts and Humanities
PPTX
Open Access: Open Access Looking for ways to increase the reach and impact of...
PDF
OpenData Public Research, University of Toronto, Open Access Week, 25/11/2011
PPTX
DataCite - services and support for opening up research data
PPTX
RDAP13 Lorrie Johnson: Facilitating Access to Scientific Data
ZIP
Open Access, Open Data. Open Research?
PPTX
SEAD Virtual Archive: Building a Federation of Institutional Repositories fo...
PPT
BeSTGRID OpenGridForum 29 GIN session
PPTX
Managing and Sharing Research Data
PPTX
Data Publishing in Archaeozoology
PPTX
Building a Data Discovery Network for Sustainability Science
PPTX
Managing and Sharing Research Data: Good practices for an ideal world...in th...
PDF
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
PPTX
Building the FAIR Research Commons: A Data Driven Society of Scientists
PPTX
The Future of Open Science
PDF
Chris Marsden, University of Essex (Plenary): Regulation, Standards, Governan...
Beyond Preservation: Situating Archaeological Data in Professional Practice
Cornell 2011 05-13
Big Data in the Arts and Humanities
Open Access: Open Access Looking for ways to increase the reach and impact of...
OpenData Public Research, University of Toronto, Open Access Week, 25/11/2011
DataCite - services and support for opening up research data
RDAP13 Lorrie Johnson: Facilitating Access to Scientific Data
Open Access, Open Data. Open Research?
SEAD Virtual Archive: Building a Federation of Institutional Repositories fo...
BeSTGRID OpenGridForum 29 GIN session
Managing and Sharing Research Data
Data Publishing in Archaeozoology
Building a Data Discovery Network for Sustainability Science
Managing and Sharing Research Data: Good practices for an ideal world...in th...
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Building the FAIR Research Commons: A Data Driven Society of Scientists
The Future of Open Science
Chris Marsden, University of Essex (Plenary): Regulation, Standards, Governan...
Ad

Viewers also liked (9)

PDF
ORCID Outreach Meeting dev breakout session
PDF
DataCite workshop at BL April 2011
PDF
Thorisson science online london sep2010
PDF
Identity in research data publication - meeting with SageCite people march2011
PPT
T M 6 Etika Linkungan (4)
PDF
BRIF workshop Toulouse 2012 ORCID intro and status update
PDF
NIH VIVO workshop Indiana March 2011
PDF
Staða opins aðgangs á Íslandi
PPTX
Flickr.com: More than Pretty Pictures (updated for GWA2010)
ORCID Outreach Meeting dev breakout session
DataCite workshop at BL April 2011
Thorisson science online london sep2010
Identity in research data publication - meeting with SageCite people march2011
T M 6 Etika Linkungan (4)
BRIF workshop Toulouse 2012 ORCID intro and status update
NIH VIVO workshop Indiana March 2011
Staða opins aðgangs á Íslandi
Flickr.com: More than Pretty Pictures (updated for GWA2010)
Ad

Similar to RDFC2012 Open Access to Research Data (20)

PPT
Research Data Sharing LERU
PPT
Improving Access to Research Data: What does changing legislation mean for y...
PDF
Knowledge Exchange, Nov 2011, Bonn
PPTX
Scott Edmunds A*STAR open access workshop: how licensing can change the way w...
PPT
Supporting Libraries in Leading the Way in Research Data Management
PPTX
Data sharing and data management – what are they all about?
PPT
Presentation to EASE, Tallinn, June 2012
PDF
Graham Pryor
PPT
BioMed Central's open data initiatives
PDF
Sünje Dallmeier-Tiessen: Research data "publishing": models, roles and respon...
PDF
Value of Unique IDs in Academia, Vilnius - Identifying knowledge contributors
PPT
Where is the opportunity for libraries in the collaborative data infrastructure?
PDF
A research passport: library requirements
PDF
"Why an OPEN attittude" at OpenByDefault, DTU 2012
PDF
Mendeley's Data and Perspectives on Data Challenges
PPTX
Data and science
PPTX
Preserving the Inputs and Outputs of Scholarship
PPT
Informatics Transform : Re-engineering Libraries for the Data Decade
PDF
Sirris innovate2011 - Smart Products with smart data - introduction, Dr. Elen...
PPTX
Altman pitt 2013_v3
Research Data Sharing LERU
Improving Access to Research Data: What does changing legislation mean for y...
Knowledge Exchange, Nov 2011, Bonn
Scott Edmunds A*STAR open access workshop: how licensing can change the way w...
Supporting Libraries in Leading the Way in Research Data Management
Data sharing and data management – what are they all about?
Presentation to EASE, Tallinn, June 2012
Graham Pryor
BioMed Central's open data initiatives
Sünje Dallmeier-Tiessen: Research data "publishing": models, roles and respon...
Value of Unique IDs in Academia, Vilnius - Identifying knowledge contributors
Where is the opportunity for libraries in the collaborative data infrastructure?
A research passport: library requirements
"Why an OPEN attittude" at OpenByDefault, DTU 2012
Mendeley's Data and Perspectives on Data Challenges
Data and science
Preserving the Inputs and Outputs of Scholarship
Informatics Transform : Re-engineering Libraries for the Data Decade
Sirris innovate2011 - Smart Products with smart data - introduction, Dr. Elen...
Altman pitt 2013_v3

More from Gudmundur Thorisson (15)

PDF
ODIN 1st year Conference Oct 2013 Interoperability: connecting identifiers
PDF
ORCID Outreach meeting Oxford may 2013 integration demo
PDF
Elsevier webinar New York
PDF
OA útskýrt: hvað er opinn aðgangur og af hverju?
PDF
BRIF workshop Toulouse 2012 Digital IDs subgroup
PDF
GEN2PHEN GAM9 Toulouse - Launching the ORCID system, what do we do now?
PDF
Afmælisfundur Líf- og umhverfisvísindastofnunar - kynning á vef
PDF
TNC2012 Federated and scholarly identity - match made in heaven?
PDF
GEN2PHEN GAM8 meeting Leiden - Identifiers for LSDBs
PDF
GEN2PHEN GAM8 meeting Leiden - Update on ORCID and other ID developments
PDF
VIVO conference Aug 2011: The VIVO platform and ORCID in the scholarly identi...
PDF
ORCID participant meeting May 2011: The digital scholar, identity on the Web ...
PDF
Data Citation Principles Harvard May 2011: ORCID and data publication - Ident...
PDF
sameAs London May 2011: The digital scholar, identity on the Web and ORCID
PDF
JISC MRD workshop Birmingham march 2011
ODIN 1st year Conference Oct 2013 Interoperability: connecting identifiers
ORCID Outreach meeting Oxford may 2013 integration demo
Elsevier webinar New York
OA útskýrt: hvað er opinn aðgangur og af hverju?
BRIF workshop Toulouse 2012 Digital IDs subgroup
GEN2PHEN GAM9 Toulouse - Launching the ORCID system, what do we do now?
Afmælisfundur Líf- og umhverfisvísindastofnunar - kynning á vef
TNC2012 Federated and scholarly identity - match made in heaven?
GEN2PHEN GAM8 meeting Leiden - Identifiers for LSDBs
GEN2PHEN GAM8 meeting Leiden - Update on ORCID and other ID developments
VIVO conference Aug 2011: The VIVO platform and ORCID in the scholarly identi...
ORCID participant meeting May 2011: The digital scholar, identity on the Web ...
Data Citation Principles Harvard May 2011: ORCID and data publication - Ident...
sameAs London May 2011: The digital scholar, identity on the Web and ORCID
JISC MRD workshop Birmingham march 2011

Recently uploaded (20)

PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
cuic standard and advanced reporting.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Cloud computing and distributed systems.
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Encapsulation theory and applications.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Empathic Computing: Creating Shared Understanding
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
The AUB Centre for AI in Media Proposal.docx
cuic standard and advanced reporting.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Understanding_Digital_Forensics_Presentation.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Network Security Unit 5.pdf for BCA BBA.
Cloud computing and distributed systems.
Digital-Transformation-Roadmap-for-Companies.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Encapsulation theory and applications.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Empathic Computing: Creating Shared Understanding
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
20250228 LYD VKU AI Blended-Learning.pptx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Profit Center Accounting in SAP S/4HANA, S4F28 Col11

RDFC2012 Open Access to Research Data

  • 1. Open access to scientific research data Gudmundur A. Thorisson, PhD <gt50@leicester.ac.uk> Research associate, University of Leicester Guest scientist, University of Iceland Participant in the GEN2PHEN Consortium and the ORCID Technical Working Group This work is published under the Creative Commons Attribution license (CC BY: http://creativecommons.org/licenses/by/3.0/) which means that it can be freely copied, redistributed and adapted, as long as proper attribution is given.
  • 2. Overview ๏ Intro to the world of Big Science & Big Data •Why is inadequate access to data such a problem? ๏ Incentive-based approaches to tackling the sharing problem Identification, identification, identification ๏ Key relevant developments internationally ๏ Some food for thought for funders, institutions, other key players ๏ Concluding remarks RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  • 3. Big Science, Big Data • Scientific research increasingly large-scale and data-driven • High-profile discipline examples – High-energy particle physics - experiments performed in the Large Hadron Collider – Astronomy - data from ground-based and space telescopes, the Virtual Observatory (VO) • Doctorow, C. Big data: Welcome to the petacentre. Nature 455, 16- 21 (2008). http://dx.doi.org/10.1038/455016a RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  • 4. Hypothesis generation guided by available data Kell and Oliver. Bioessays (2004) vol. 26 (1) • Science paradigms – 1st: Empirical - describing natural phenomena – 2nd: Theoretical - models, generalizations – 3rd: Computational - simulating complex phenomena – 4th (1+2+3): Data exploration, e-Science Gray, J. 2009. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  • 5. Biological research too is increasingly big and data-driven • From: small-scale datasets that fit into a printed journal article Richards, M. et al. Paleolithic and neolithic lineages in the European mitochondrial gene pool. American journal of human genetics 59, 185-203 (1996). http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1915109/ RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  • 6. Biological research too is increasingly big and data-driven • To: large-scale collection of biological data in digital form • Huge technological advances in last 5-10 years – experimental / observations <-- gathering data with high-throughput equipment – computer technology <-- storing & analyzing massive data volumes • Example: massively-parallel sequencing – Determine human genome sequence in <1 day - the $1000 genome – Metagenomics: sequence *everything* in environment samples – Large bio-specimen collections • x100,0000 of individuals in disease/population biobanks RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  • 7. Examples: domain repositories for sequence data • GenBank - genetic sequence repository, established 1986 • UniProt - knowledge base for protein sequence & function Conference on Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012 RDFC2012 Unique Identifiers, Vilnius, Feb 14 2012
  • 8. “Community resource projects” - large-scale data generation for the purpose of making the data available for broad reuse • The sequence of the human genome – International Human Genome project - mandatory rapid data sharing, the Bermuda principles • Pattern of variation in the human genome – International Haplotype Map Project - genotyping population samples – 1000 Genomes Project - sequencing population samples RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  • 9. Big Data – challenges, opportunities • Managing & making sense of large-scale datasets – Data easy/cheap to generate - not so cheap to store & use – Favorite quote: “the $1000 genome sequence, followed by the ++$10,000 analysis” • Integration & analysis - combining datasets – more data of the same type - e.g. combine sequences from multiple species – related data of different type - e.g. a person’s genome sequence + his/her phenotype • Potential for accelerating research, creating new knowledge and (in biomedicine) improving human health. • Key driver = unrestricted sharing of scientifc data deposited in the public domain RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  • 10. Data = “fuel” of science Smith,V. Data publication: towards a database of everything. BMC Res Notes (2009) vol. 2 (1) RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  • 11. 11
  • 12. 12
  • 13. Data = “fuel” of science [..] If digital technologies are the engine of this revolution, digital data are its fuel. But for many scientific disciplines, this fuel is in short supply.[..] Smith,V. Data publication: towards a database of everything. BMC Res Notes (2009) vol. 2 (1) RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  • 14. Biology and data sharing in the “long tail” • Biology is complex, so data are often very heterogeneous • Technologies changing rapidly • Lots of small-scale research projects • Lots of small/medium datasets The ‘long tail’ of dark bio-data • Data in the long tail usually *not* shared OR not shared in a useful way • Contrast with other data-intensive disciplines with – a long history of sharing research data - a “culture of sharing” – big, expensive, shared facilities = the only way to do this kind of research – relatively homogeneous datasets, easier to scale up to big volumes (e.g. telescope images) RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  • 15. […] Overall, only 47 papers (9%) deposited full primary raw data online. None of the 149 papers not subject to data availability policies made their full primary data publicly available. Conclusion: A substantial proportion of original research papers published in high-impact journals are either not subject to any data availability policies, or do not adhere to the data availability instructions in their respective journals. This empiric evaluation highlights opportunities for improvement RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  • 16. DATA analysed synthesised interpreted INFORMATION published KNOWLEDGE Publication Lots of published knowledge but hard/impossible to go back and reproduce work & validate findings + Opportunity for maximising the value of data through reuse is wasted RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  • 18. Lots and lots of diverse reasons!! Some quotes from researchers: “Don't which digital repository I should upload to” “Too much work, got better things to do!” “My competitors will just take the data and ‘scoop’ me” “It's my data, I collected them and noone else is entitled to use them” “[myriad other reasons]” Worringly, many authors don't seem to care whether evidence underpinning their published findings is accessible or not Koslow. Should the neuroscience community make a paradigm shift to sharing primary data?. Nat Neurosci (2000) vol. 3 (9). http://dx.doi.org/10.1038/78760 RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  • 19. Gnarly issue #1: “ownership” vs “stewardship” • Many researchers consider data their property, even if research funded by public money – e.g. want to do further analysis on data in future, publish more papers • ..which conficts with interests of other stakeholders in the game, e.g. (funders, universities) who want: – to maximize return on investment in the funded research – to ensure good, solid evidence-based science is done, etc. RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  • 20. Gnarly issue #2 – biomedical data • Usually sensitive, cannot be shared without restrictions – Detailed, reidentifiable biomedical data that cannot be fully anonymized – Personal privacy considerations • Specialized controlled-access archives deal with some of this – NCBI's database of Genotypes and Phenotypes – dbGaP – European Genome-phenome Archive – EGA – [specific diseases / disorders, research consortia, others] RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  • 21. How to Make a Tackle in Rugby Tackling in rugby is one of the most important aspects of the game. [...] Credit:http://djamba.com/how-to-make-a-tackle-in-rugby.html 21
  • 22. ...which are an imperfect solution • Arguments that mandates by themselves are not the way • Mandates likely to ensure only minimum compliance – sharing would be done in minimally useful form (as in, whatever is the least effort) …. and are meaningless if not enforced (currently the case with many journals) RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  • 23. Sharing now tends to be driven by mandates... * Journals increasingly require data to be made available “Provide supporting data in a repository OR we won’t publish your paper” * Funders increasingly require data sharing plan & budget baked into grant proposals. “Publish data we are funding you to generate OR we will not fund your research again” Using just a stick gets you so only far RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  • 24. Strategies focused on encouraging sharing - Make it easy - - Make it useful - - Make it citable - RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  • 25. Treating data as citable publications in their own right • Core strategy: enable data to be treated as 1st class citizens of the scholarly record which: i) are indexed and can be discovered, located and accessed, and ii) can be properly identified & cited unambiguously like other scholarly works • Link datasets with the primary journal publication - citation crosslinks • Give data creators/curators/analysts proper credit for their contribution to the digital resource • Focus on the benefits to researchers from publishing their data – Data sharing → Data PUBLICATION + CITATION – Others reuse & cite their stuff → more citations → more impact – The more useful a dataset, the more likely to be used & cited RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  • 26. Exemplar – Data Dryad “international repository of data underlying peer-reviewed articles in the basic and applied biosciences” http://datadryad.org • Combines – Mandates (journal policy) and – Citable data publication • Citation cross-linking – Paper references dataset – Dataset references paper RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  • 27. Key building blocks: the 3 I’s of identification RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  • 28. 1 I Identifying scholarly publications (and other research outputs) • Why? So it is possible to.. ..cite the work unambiguously (‘..we used the method described in Thorisson et al (2009)’) ..locate the work (retrieve Nature article as PDF from journal website) ..give credit to persons/entities who contributed to the work (G. Thorisson authored paper X) • Need for globally unique, persistent identifiers to combat unstable Web URLs, broken hyperlinks • e.g. Digital Object Identifiers (DOIs) for pubs, datasets and more: – Bell et al. 2009. Science 323(5919) doi:10.1371/journal.pone.0024357 – Goodwillie C et al (2005) Data from: The evolutionary enigma of mixed mating systems in plants: occurrence, theoretical explanations, and empirical evidence. Dryad Digital Repository. doi:10.5061/dryad.292q34fp RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  • 29. 2 I Identifying use/reuse - measuring impact – Historical reliance on formal citations and citation-based metrics – ISI Impact Factor widely used, but really metric for infuence of a scholarly journal – Citation analysis not going away - remains the gold standard – Many other use/reuse indicators for impact of individual research outputs • Focus on the impact of the *publication* itself, not the journal in which it appears • Indicators: no. full-text downloads, tweets (i.e. mentions on Twitter), social bookmarking • AltMetrics - a growing grassroots movement “ to better measure and reward all the different ways that people contribute to the messy and complex process of scientific progress [..] born out of a simple recognition: Many of the traditional measurements are too slow or simplistic to keep pace with today’s Internet-age science” http://altmetrics.org – Lots new tools and projects emerging to explore possibilities in this space • e.g. http://total-impact.org RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  • 30. 3 I Identifying contributors – attributing credit – Why? So we can.. ..link content creators with their works - attribute credit accurately ..figure out: who contributed to publication X? which publications has person/organization Y contributed to? – What kind of contributions? Characterizing ‘contributorship’ author, creator, analyst, reviewer, ‘conceived of study & designed experiment’ etc RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  • 31. Tackling the author name ambiguity problem (or ‘Who’s Who?’) How about these? Or these? J. Smith J. Smith J. Smith Are these authors all the same person? J. Smith G. Thorisson, University of Leicester J. Smith G. A. Thorisson, University of Leicester [etc.] G. A. Thorisson, Cold Spring Harbor Laboratory ∼2/3 of the ∼6 million authors in MEDLINE share a last name and first initial with at least one other author, and an ambiguous name refers to ∼8 persons on average. Torvik and Smalheiser. Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data (2009) vol. 3 (3) RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  • 32. The Open Researcher & Contributor ID initiative Launched end of 2009, ORCID will work to support the creation of a permanent, clear and unambiguous record of scholarly communication by enabling reliable attribution of authors and contributors through unique identifiers RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  • 33. The Open Researcher & Contributor ID initiative ORCID will add value for scholars and the organizations that they are interacting with, including universities, scholarly societies, funding organizations and publishers •Joins faculty or student body •Joins scholarly society •Applies for grant •Submits manuscript RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  • 34. ORCID transcends discipline, geographic, national and institutional boundaries - now >300 participants http://www.orcid.org 34
  • 35. Some food for thought / recommendation kind of stuff to conclude • Status of research data in Iceland is unclear → need research – Build on & extend 2007 Rannís report “Gagnagrunnar á Íslandi um náttúru, umhverfi og orku” Rannís, we´re looking at you! • Funders to take lead – Mandates (aka sticks) - require data management plan + budget in grant proposals • Many best practices & tools available to draw upon, e.g. by the UK Digital Curation Centre – Call for & fund research proposals to build infrastructural foundations & explore technologies/initiatives – Raise awareness in the local research community RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  • 36. Even more food for thought / recommendation kind of stuff • Universities & other research institutions need to – Take research data seriously – Build infrastructure for data storage & preservation, support personnel (e.g. data officers / coordinators) – Include datasets and other non-conventional outputs in professional evalutations • Identify & engage with key international initiatives in this space – ORCID, DataCite, Dryad, Open Knowledge Foundation, others – OpenAIRPlus ← Solveig's talk coming up! RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  • 37. Final bite of food-for-thought Let's make research data an integral part of the OA mission in Iceland, NOT an afterthought RDFC2012 Conference on Open Access and Digital Rights, Reykjavik, March 29 th 2012
  • 38. Acknowledgements GEN2PHEN Consortium This work has received funding by the http://www.gen2phen.org/about-gen2phen/partners European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement number 200754 - Prof Anthony J. Brookes Bioinformatics Group, Leicester the GEN2PHEN project. Contact me! Contact me! ORCID - http://www.orcid.org <gthorisson@gmail.com> <gthorisson@gmail.com> http://www.linkedin.com/in/mummi http://www.linkedin.com/in/mummi http://www.twitter.com/gthorisson http://www.twitter.com/gthorisson Published under the Creative Commons BY license http://www.gthorisson.name http://www.gthorisson.name (http://creativecommons.org/licenses/by/3.0/)