SlideShare a Scribd company logo
Because good research needs good data




Big data
– no big deal for curation?
Graham Pryor, Associate Director, UK Digital Curation Centre

Eduserv Symposium 2012: Big Data, Big Deal?

                                                                                                .

          This work is licensed under a Creative Commons Attribution 2.5 UK: Scotland License
Big data – big deal or same deal?
“What need the bridge much broader than the flood?
The fairest grant is the necessity.
Look, what will serve is fit…”
                        Much Ado About Nothing, Act 1 Scene 1
Eduserv Symposium 2012 –
      speakers’ Research Areas
•   Operating Systems & Networking
•   Computer and Network Security
•   Distributed Systems
•   Mobile Computing
•   Wireless Networking
•   Software Engineering
             • High performance compute clusters
             • Cloud and grid technologies
             • Effective management of large clusters and
               cluster file-systems
             • Very large database systems (architecture,
               management and application optimization)
The Digital Curation Centre
• a consortium comprising units from the Universities of Bath
  (UKOLN), Edinburgh (DCC Centre) and Glasgow (HATII)
• launched 1st March 2004 as a national centre for solving
  challenges in digital curation that could not be tackled by
  any single institution or discipline
• funded by JISC to build capacity, capability and skills in
  research data management across the UK HEI community
• awarded additional HEFCE funding 2011/13 for
   • the provision of support to national cloud services
   • targeted institutional development
Three perspectives
 Scale and complexity
   – Volume and pace
   – Infrastructure
   – Open science
 Policy
   – Funders
   – Institutions
   – Ethics & IP
 Management
   – Storage
   – Incentives
   – Costs & Sustainability
                              http://www.nonsolotigullio.com/effettiottici/images/escher.jpg/
Challenges of scale and complexity
           • The virtual laboratory is a federation
              of server nodes that allows
• Globally, >100,000
              distributed data to be stored local to
  neuroscientists study the
              acquisition
  CNS, generating massive,
           • Analysis codes can be uploaded and
  intricate and highly this is only talking
                  But                                  terabytes…
              executed on the nodes so that
  interrelated datasets
              derived datasets need not be
• Analysts require access to
              transported over low bandwidth
  these data to develop
              connections
  algorithms, models and
           • Data and analysis codes are
  schemata that characterise
              described by structured metadata,
  the underlying system
              providing an index for search,
• Resources and actors are
              annotation and audit over workflows
  rarely collocated and are
              leading to scientific outcomes
  therefore difficult to combine.
           • Users access the distributed
              resources through a web portal
              emulating a PC desktop
                                               http://www.carmen.org.uk/
Big data? – The Large Hadron Collider




                                 Searching for the Higgs Boson




 • Predicted annual generation of around 15
   petabytes (15 million gigabytes) of data
 • Would need >1,700,000 dual layer DVDs
Big data – the GridPP solution
                             Crowd sourcing for the LHC
                             Home and“Withcomputer users
                                         office GridPP you
                             can sign up to thenever have
                                        need LHC at home
                             project (based at Queen Mary,
                             University those data
                                        of London), which
                                        processing blues
                             makes use of idle CPU time. So
                             far, 40,000again…”
                                         users in more than 100
                             countries have contributed the
                             equivalent of 3000 years on a
                                        http://www.gridpp.ac.uk/about
                             single computer to the project.
With the Large Hadron Collider running at CERN the grid is
being used to process the accompanying data deluge. The UK
grid is contributing more than the equivalent of 20,000 PCs to
this worldwide effort.
Yet…..Data Preservation in High
Energy Physics?
Data from high–energy physics (HEP)
experiments are collected with significant
financial and human effort and are in many
cases unique. At the same time, HEP has no
coherent strategy for data preservation and re–
use, and many important and complex data sets
are simply lost.
David M. South, on behalf of the ICFA DPHEP Study Group
arXiv:1101.3186v1 [hep-ex]
Big data in genomics



   These studies are generating
   valuable datasets which, due to
   their size and complexity, need to
   be skilfully managed…
There’s a bigger deal than big data…
                                          Socio-                    2.
                                        technical                   • Inventory data assets
                                       management
                                       perspectives                 • Profile norms, roles,
• Identify drivers and
                                                                       values
  champions
                                                                    • Identify capability gaps
• Analyse stakeholders,
                                                                    • Analyse current
  issues
                             Information                               workflows
• Identify capability          systems
  gaps                      perspectives
• Assess costs,
  benefits, risks
                                                                    3.
                                       Research
                                        practice                    • Produce feasible,
                                      perspectives                    desirable changes
                                                                    • Evaluate fitness for
                                                                      purpose

                   Adapted from Developing Research Data Management Capabilities by Whyte et al, DCC, 2012
The DCC - building capacity and capability
through targeted institutional development
•   18 institutional engagements, 14 roadshows
•   advice and assistance in strategy and policy
•   use of curation tools for audit and planning
•   training and skills transfer
Why do we do this?
1. Reports that researchers are often unaware
   of threats and opportunities
http://www.flickr.com/photos/mattimattila/3003324844/




       “Departments don’t have guidelines or
   norms for personal back-up and researcher
   procedure, knowledge and diligence varies
       tremendously. Many have experienced
          moderate to catastrophic data loss”
Incremental Project Report, June 2010
Why do we do this?
1. Reports that researchers are often unaware
   of threats and opportunities
2. There is a lack of clarity in terms of skills
   availability and acquisition
…researchers are
reluctant to adopt new tools and
services unless they know
someone who can recommend
or share knowledge about
them. Support needs to be
based on a close understanding
of the researchers’ work, its
patterns and timetables.
Why do we do this?
1. Reports that researchers are often unaware
   of threats and opportunities
2. There is a lack of clarity in terms of skills
   availability and acquisition
3. Many institutions are unprepared to meet
   the increasingly prescriptive demands of
   funders
EPSRC expects all those institutions it funds
• to have developed a roadmap aligning their policies
  and processes with EPSRC’s nine expectations by
  1st May 2012
• to be fully compliant with each of those expectations
  by 1st May 2015
• to recognise that compliance will be monitored and
  non-compliance investigated and that
• failure to share research data could result in the
  imposition of sanctions
Why do we do this?
1. Reports that researchers are often unaware
   of threats and opportunities
2. There is a lack of clarity in terms of skills
   availability and acquisition
3. Many institutions are unprepared to meet
   the increasingly prescriptive demands of
   funders
4. …and legislators
Rules and regulations…


    Compliance

 Data Protection Act
        1998
                       • Rights, Exemptions, Enforcement

Freedom of             • Climategate, Tree Rings, Tobacco
Information Act 2000     and…(what’s next?)

Computer Misuse Act
      1980
                    • etc. etc. etc………..
Why do we do this?
1. Reports that researchers are often unaware
   of threats and opportunities
2. There is a lack of clarity in terms of skills
   availability and acquisition
3. Many institutions are unprepared to meet
   the increasingly prescriptive demands of
   funders
4. …and legislators
5. The advantages from planning, openness
   and sharing are not understood
Open to all? Case studies of openness
in research
Choices are made according to context, with
degrees of openness reached according to:
• The kinds of data to be made available
• The stage in the research process
• The groups to whom data will be made
  available
• On what terms and conditions it will be
  provided

Default position of most:
• YES to protocols, software, analysis tools,
  methods and techniques
• NO to making research data content freely
  available to everyone

After all, where is the incentive?              Angus Whyte, RIN/NESTA, 2010
DCC
Institutional
Engagements




http://www.dcc.ac.uk/community/institutional-engagements
                      Adapted from Developing Research Data Management Capabilities by Whyte et al, DCC, 2012
Main institutional concerns
And big data? There has been no mention
– Compliance
yet of any specific challenge from big data
– Asset management
but…
– Cost benefits
– Incentivisation
Institutions are providing resources to work
onComplexity of the data environment
– big data, both equipment and people,
and more importantly…
…the issues central to effective data
management are common across the data
spectrum, irrespective of size
Some current institutional engagements
          Assessing                  Piloting tools
              needs                  e.g. DataFlow


                      RDM roadmaps




    Policy                                     Policy
 development                               implementation
Support offered by the DCC
                              Institutional
Assess                      data catalogues
needs         Workflow
             assessment                  Pilot RDM
                                            tools
                                                             Develop
   DAF & CARDIO            DCC
    assessments                                Guidance      support
                          support
                           team               and training     and
                                                             services
                                         RDM policy
   Advocacy to senior                   development
     management
                           Customised Data
         Make the case    Management Plans

                             …and support policy implementation
Four DCC Tools
Your Data as Assets: DAF
• What are the characteristics of your
  research data assets?
  –   Number?
  –   Scale?
  –   Complexity?
  –   Dependencies?
  –   Liabilities?
• Why do researchers act the way they do
  with respect to data?
• Which data do they need to undertake
  productive research?
DMP Online is a web-based data management
planning tool that allows you to build and edit plans
according to the requirements of the major UK
funders.

The tool also contains helpful guidance and links for
researchers and other data professionals.

http://www.dcc.ac.uk/dmponline
An online tool for departments or research groups to
identify their current data management capabilities
and identify coordinated pathways to future
enhancement via a dedicated knowledge base.

CARDIO emphasises a collaborative, consensus-
driven approach, and enables benchmarking with
other groups and institutions.

http://cardio.dcc.ac.uk/
DRAMBORA is an audit methodology and tool for
identifying and planning for the management of risks
which may threaten the availability and/or usability of
content in a digital repository or archive.

http://www.repositoryaudit.eu
So, big data
– no big deal for curation?
• Yes, it’s big
• It’s also very complex
• There is no single technology solution
• Issues of human infrastructure are
  possibly a bigger challenge
• But for big data aficionados the
  technology challenges are big enough
Data Management – infrastructure
and data storage challenges...
Scaleability
Cost-effectiveness
Security (privacy and IPR)
Robust and resilient
Low entry barrier
Ease-of-use
Data-handling / transfer /
analysis capabilities
         The case for cloud computing in genome informatics.
         Lincoln D Stein, May 2010
Help desk:
0131 651 1239

info@dcc.ac.uk

www.dcc.ac.uk

More Related Content

PPTX
Why manage research data?
PDF
Supporting Research Data Management at the University of Stirling
PPT
BeSTGRID OpenGridForum 29 GIN session
PPT
100503 bioinfo instsymp
PPTX
2013 bio it world
PDF
NSF Software @ ApacheConNA
PDF
Open Science Governance and Regulation/Simon Hodson
PPTX
091020 E Research Otago
Why manage research data?
Supporting Research Data Management at the University of Stirling
BeSTGRID OpenGridForum 29 GIN session
100503 bioinfo instsymp
2013 bio it world
NSF Software @ ApacheConNA
Open Science Governance and Regulation/Simon Hodson
091020 E Research Otago

What's hot (20)

PDF
Research Data Management, Challenges and Tools - Per Öster
PPTX
Research Data Management at Imperial College London
PDF
OpenData Public Research, University of Toronto, Open Access Week, 25/11/2011
PDF
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
PDF
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
PPTX
Stuart Phinn_Many kinds of infrastructure: resolving and advancing ecosystem ...
PPT
Facilitating Scientific Collaborations by Delegating Identity Management
PDF
Sgci nsf-2-22-17
PPTX
dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021
PDF
TERN data sharing at TRY workshop
PDF
II-SDV 2012 Patent Prior-Art Searching with Latent Semantic Analysis
PDF
Guy avoiding-dat apocalypse
PDF
DEVELOPING A KNOWLEDGE MANAGEMENT SPIRAL FOR THE LONG-TERM PRESERVATION SYSTE...
PPTX
Imaging dearry ncrdc 11062017
PPT
SKA NZ R&D BeSTGRID Infrastructure
PPTX
Introduction to research data management; Lecture 01 for GRAD521
PPTX
Research Data Management and Librarians
PPTX
Building the FAIR Research Commons: A Data Driven Society of Scientists
PPTX
Data accessibility and the role of informatics in predicting the biosphere
PDF
Introduction to research data management
Research Data Management, Challenges and Tools - Per Öster
Research Data Management at Imperial College London
OpenData Public Research, University of Toronto, Open Access Week, 25/11/2011
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Stuart Phinn_Many kinds of infrastructure: resolving and advancing ecosystem ...
Facilitating Scientific Collaborations by Delegating Identity Management
Sgci nsf-2-22-17
dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021
TERN data sharing at TRY workshop
II-SDV 2012 Patent Prior-Art Searching with Latent Semantic Analysis
Guy avoiding-dat apocalypse
DEVELOPING A KNOWLEDGE MANAGEMENT SPIRAL FOR THE LONG-TERM PRESERVATION SYSTE...
Imaging dearry ncrdc 11062017
SKA NZ R&D BeSTGRID Infrastructure
Introduction to research data management; Lecture 01 for GRAD521
Research Data Management and Librarians
Building the FAIR Research Commons: A Data Driven Society of Scientists
Data accessibility and the role of informatics in predicting the biosphere
Introduction to research data management
Ad

Viewers also liked (14)

PDF
Maple University of Waterloo case study
PPT
Owain Davies - The value of syndicating health information - an NHS case study
PPTX
SharePoint in Higher Education Institutions
PPTX
Security radar for 2014
PPTX
Wayfs and Strays - Jonathan Richardson
PPTX
The Eduserv Cloud: Who, What, Why, When and Where?
PPTX
UMF Cloud Pilot
PPTX
Practically applying agile
PPTX
The role of a University Computing Service in an increasingly mobile world OR...
PDF
The Molly Project & Mobile Oxford
PPT
Design Patterns for Digital Identity
PPTX
Identity & Access Management Update - David Orrell
PPSX
Beyond Library eResources: Using OpenAthens for Enterprise Security
PPT
Case study: Building a business case for cloud, migration in practice and spr...
Maple University of Waterloo case study
Owain Davies - The value of syndicating health information - an NHS case study
SharePoint in Higher Education Institutions
Security radar for 2014
Wayfs and Strays - Jonathan Richardson
The Eduserv Cloud: Who, What, Why, When and Where?
UMF Cloud Pilot
Practically applying agile
The role of a University Computing Service in an increasingly mobile world OR...
The Molly Project & Mobile Oxford
Design Patterns for Digital Identity
Identity & Access Management Update - David Orrell
Beyond Library eResources: Using OpenAthens for Enterprise Security
Case study: Building a business case for cloud, migration in practice and spr...
Ad

Similar to Graham Pryor (20)

PDF
PPT
Supporting Libraries in Leading the Way in Research Data Management
PDF
High Performance Data Analytics and a Java Grande Run Time
PDF
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
PPTX
NIST Big Data Public Working Group NBD-PWG
PDF
Sirris innovate2011 - Smart Products with smart data - introduction, Dr. Elen...
PPTX
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
PPTX
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
PPTX
The e-Ciber Superfacility Project
PPTX
Managing and Sharing Research Data
PPTX
Creating a Data Management Plan for your Grant Application
PPTX
Creating a Data Management Plan for your Grant Application
PPTX
Intro to RDM
PPTX
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
PPT
Ticer summer school_24_aug06
PPTX
Introduction to research data management
PPTX
PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...
PPTX
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
PPTX
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Supporting Libraries in Leading the Way in Research Data Management
High Performance Data Analytics and a Java Grande Run Time
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
NIST Big Data Public Working Group NBD-PWG
Sirris innovate2011 - Smart Products with smart data - introduction, Dr. Elen...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
The e-Ciber Superfacility Project
Managing and Sharing Research Data
Creating a Data Management Plan for your Grant Application
Creating a Data Management Plan for your Grant Application
Intro to RDM
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
Ticer summer school_24_aug06
Introduction to research data management
PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT

More from Eduserv (20)

PPTX
Phase two of OpenAthens SP evolution including OpenID connect option
PPTX
Partnership Licensing - allowing access to licensed resources
PPTX
Lightning talk - EBSCO
PPTX
Lightning talk - Boopsie
PPTX
Lightning talk - Softlink
PPTX
Lightning talk - Third Iron BrowZine
PPTX
Lightning talk - Eduserv Chest Agreements
PPTX
Phase one of OpenAthens SP evolution
PPTX
Key considerations when mapping your end user experience
PPTX
Our product development methodology
PPTX
How Readers Discover Content
PPTX
OpenAthens product update
PPTX
OpenAthens Customer Conference - Welcome address
PPTX
Generating leads with content marketing
PPTX
Pre-launch introduction to the new OpenAthens SP dashboard - 13/09/2016
PDF
Mobius from Maplesoft
PDF
QSR NVivo
PPTX
How Eduserv are helping local government organisations
PPTX
Is cloud the right fit for your needs?
PPTX
Planning your cloud strategy: Adur and Worthing Councils
Phase two of OpenAthens SP evolution including OpenID connect option
Partnership Licensing - allowing access to licensed resources
Lightning talk - EBSCO
Lightning talk - Boopsie
Lightning talk - Softlink
Lightning talk - Third Iron BrowZine
Lightning talk - Eduserv Chest Agreements
Phase one of OpenAthens SP evolution
Key considerations when mapping your end user experience
Our product development methodology
How Readers Discover Content
OpenAthens product update
OpenAthens Customer Conference - Welcome address
Generating leads with content marketing
Pre-launch introduction to the new OpenAthens SP dashboard - 13/09/2016
Mobius from Maplesoft
QSR NVivo
How Eduserv are helping local government organisations
Is cloud the right fit for your needs?
Planning your cloud strategy: Adur and Worthing Councils

Recently uploaded (20)

PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Machine learning based COVID-19 study performance prediction
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPT
Teaching material agriculture food technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Approach and Philosophy of On baking technology
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
Understanding_Digital_Forensics_Presentation.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
MYSQL Presentation for SQL database connectivity
Network Security Unit 5.pdf for BCA BBA.
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
NewMind AI Weekly Chronicles - August'25 Week I
Agricultural_Statistics_at_a_Glance_2022_0.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Electronic commerce courselecture one. Pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Machine learning based COVID-19 study performance prediction
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Teaching material agriculture food technology
Advanced methodologies resolving dimensionality complications for autism neur...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Approach and Philosophy of On baking technology
Per capita expenditure prediction using model stacking based on satellite ima...

Graham Pryor

  • 1. Because good research needs good data Big data – no big deal for curation? Graham Pryor, Associate Director, UK Digital Curation Centre Eduserv Symposium 2012: Big Data, Big Deal? . This work is licensed under a Creative Commons Attribution 2.5 UK: Scotland License
  • 2. Big data – big deal or same deal? “What need the bridge much broader than the flood? The fairest grant is the necessity. Look, what will serve is fit…” Much Ado About Nothing, Act 1 Scene 1
  • 3. Eduserv Symposium 2012 – speakers’ Research Areas • Operating Systems & Networking • Computer and Network Security • Distributed Systems • Mobile Computing • Wireless Networking • Software Engineering • High performance compute clusters • Cloud and grid technologies • Effective management of large clusters and cluster file-systems • Very large database systems (architecture, management and application optimization)
  • 4. The Digital Curation Centre • a consortium comprising units from the Universities of Bath (UKOLN), Edinburgh (DCC Centre) and Glasgow (HATII) • launched 1st March 2004 as a national centre for solving challenges in digital curation that could not be tackled by any single institution or discipline • funded by JISC to build capacity, capability and skills in research data management across the UK HEI community • awarded additional HEFCE funding 2011/13 for • the provision of support to national cloud services • targeted institutional development
  • 5. Three perspectives Scale and complexity – Volume and pace – Infrastructure – Open science Policy – Funders – Institutions – Ethics & IP Management – Storage – Incentives – Costs & Sustainability http://www.nonsolotigullio.com/effettiottici/images/escher.jpg/
  • 6. Challenges of scale and complexity • The virtual laboratory is a federation of server nodes that allows • Globally, >100,000 distributed data to be stored local to neuroscientists study the acquisition CNS, generating massive, • Analysis codes can be uploaded and intricate and highly this is only talking But terabytes… executed on the nodes so that interrelated datasets derived datasets need not be • Analysts require access to transported over low bandwidth these data to develop connections algorithms, models and • Data and analysis codes are schemata that characterise described by structured metadata, the underlying system providing an index for search, • Resources and actors are annotation and audit over workflows rarely collocated and are leading to scientific outcomes therefore difficult to combine. • Users access the distributed resources through a web portal emulating a PC desktop http://www.carmen.org.uk/
  • 7. Big data? – The Large Hadron Collider Searching for the Higgs Boson • Predicted annual generation of around 15 petabytes (15 million gigabytes) of data • Would need >1,700,000 dual layer DVDs
  • 8. Big data – the GridPP solution Crowd sourcing for the LHC Home and“Withcomputer users office GridPP you can sign up to thenever have need LHC at home project (based at Queen Mary, University those data of London), which processing blues makes use of idle CPU time. So far, 40,000again…” users in more than 100 countries have contributed the equivalent of 3000 years on a http://www.gridpp.ac.uk/about single computer to the project. With the Large Hadron Collider running at CERN the grid is being used to process the accompanying data deluge. The UK grid is contributing more than the equivalent of 20,000 PCs to this worldwide effort.
  • 9. Yet…..Data Preservation in High Energy Physics? Data from high–energy physics (HEP) experiments are collected with significant financial and human effort and are in many cases unique. At the same time, HEP has no coherent strategy for data preservation and re– use, and many important and complex data sets are simply lost. David M. South, on behalf of the ICFA DPHEP Study Group arXiv:1101.3186v1 [hep-ex]
  • 10. Big data in genomics These studies are generating valuable datasets which, due to their size and complexity, need to be skilfully managed…
  • 11. There’s a bigger deal than big data… Socio- 2. technical • Inventory data assets management perspectives • Profile norms, roles, • Identify drivers and values champions • Identify capability gaps • Analyse stakeholders, • Analyse current issues Information workflows • Identify capability systems gaps perspectives • Assess costs, benefits, risks 3. Research practice • Produce feasible, perspectives desirable changes • Evaluate fitness for purpose Adapted from Developing Research Data Management Capabilities by Whyte et al, DCC, 2012
  • 12. The DCC - building capacity and capability through targeted institutional development • 18 institutional engagements, 14 roadshows • advice and assistance in strategy and policy • use of curation tools for audit and planning • training and skills transfer
  • 13. Why do we do this? 1. Reports that researchers are often unaware of threats and opportunities
  • 14. http://www.flickr.com/photos/mattimattila/3003324844/ “Departments don’t have guidelines or norms for personal back-up and researcher procedure, knowledge and diligence varies tremendously. Many have experienced moderate to catastrophic data loss” Incremental Project Report, June 2010
  • 15. Why do we do this? 1. Reports that researchers are often unaware of threats and opportunities 2. There is a lack of clarity in terms of skills availability and acquisition
  • 16. …researchers are reluctant to adopt new tools and services unless they know someone who can recommend or share knowledge about them. Support needs to be based on a close understanding of the researchers’ work, its patterns and timetables.
  • 17. Why do we do this? 1. Reports that researchers are often unaware of threats and opportunities 2. There is a lack of clarity in terms of skills availability and acquisition 3. Many institutions are unprepared to meet the increasingly prescriptive demands of funders
  • 18. EPSRC expects all those institutions it funds • to have developed a roadmap aligning their policies and processes with EPSRC’s nine expectations by 1st May 2012 • to be fully compliant with each of those expectations by 1st May 2015 • to recognise that compliance will be monitored and non-compliance investigated and that • failure to share research data could result in the imposition of sanctions
  • 19. Why do we do this? 1. Reports that researchers are often unaware of threats and opportunities 2. There is a lack of clarity in terms of skills availability and acquisition 3. Many institutions are unprepared to meet the increasingly prescriptive demands of funders 4. …and legislators
  • 20. Rules and regulations… Compliance Data Protection Act 1998 • Rights, Exemptions, Enforcement Freedom of • Climategate, Tree Rings, Tobacco Information Act 2000 and…(what’s next?) Computer Misuse Act 1980 • etc. etc. etc………..
  • 21. Why do we do this? 1. Reports that researchers are often unaware of threats and opportunities 2. There is a lack of clarity in terms of skills availability and acquisition 3. Many institutions are unprepared to meet the increasingly prescriptive demands of funders 4. …and legislators 5. The advantages from planning, openness and sharing are not understood
  • 22. Open to all? Case studies of openness in research Choices are made according to context, with degrees of openness reached according to: • The kinds of data to be made available • The stage in the research process • The groups to whom data will be made available • On what terms and conditions it will be provided Default position of most: • YES to protocols, software, analysis tools, methods and techniques • NO to making research data content freely available to everyone After all, where is the incentive? Angus Whyte, RIN/NESTA, 2010
  • 23. DCC Institutional Engagements http://www.dcc.ac.uk/community/institutional-engagements Adapted from Developing Research Data Management Capabilities by Whyte et al, DCC, 2012
  • 24. Main institutional concerns And big data? There has been no mention – Compliance yet of any specific challenge from big data – Asset management but… – Cost benefits – Incentivisation Institutions are providing resources to work onComplexity of the data environment – big data, both equipment and people, and more importantly… …the issues central to effective data management are common across the data spectrum, irrespective of size
  • 25. Some current institutional engagements Assessing Piloting tools needs e.g. DataFlow RDM roadmaps Policy Policy development implementation
  • 26. Support offered by the DCC Institutional Assess data catalogues needs Workflow assessment Pilot RDM tools Develop DAF & CARDIO DCC assessments Guidance support support team and training and services RDM policy Advocacy to senior development management Customised Data Make the case Management Plans …and support policy implementation
  • 28. Your Data as Assets: DAF • What are the characteristics of your research data assets? – Number? – Scale? – Complexity? – Dependencies? – Liabilities? • Why do researchers act the way they do with respect to data? • Which data do they need to undertake productive research?
  • 29. DMP Online is a web-based data management planning tool that allows you to build and edit plans according to the requirements of the major UK funders. The tool also contains helpful guidance and links for researchers and other data professionals. http://www.dcc.ac.uk/dmponline
  • 30. An online tool for departments or research groups to identify their current data management capabilities and identify coordinated pathways to future enhancement via a dedicated knowledge base. CARDIO emphasises a collaborative, consensus- driven approach, and enables benchmarking with other groups and institutions. http://cardio.dcc.ac.uk/
  • 31. DRAMBORA is an audit methodology and tool for identifying and planning for the management of risks which may threaten the availability and/or usability of content in a digital repository or archive. http://www.repositoryaudit.eu
  • 32. So, big data – no big deal for curation? • Yes, it’s big • It’s also very complex • There is no single technology solution • Issues of human infrastructure are possibly a bigger challenge • But for big data aficionados the technology challenges are big enough
  • 33. Data Management – infrastructure and data storage challenges... Scaleability Cost-effectiveness Security (privacy and IPR) Robust and resilient Low entry barrier Ease-of-use Data-handling / transfer / analysis capabilities The case for cloud computing in genome informatics. Lincoln D Stein, May 2010
  • 34. Help desk: 0131 651 1239 info@dcc.ac.uk www.dcc.ac.uk