SlideShare a Scribd company logo
UKOLN is supported  by: Repositories and Digital Preservation Michael Day Research and Development Team Leader UKOLN, University of Bath RSP 'Goes back to' School, Matfen Hall, Northumberland, 14-16 September 2009
Presentation outline General context: repositories and digital preservation Digital preservation overview Tools: Preservation Planning (Plato) Repository audit (TRAC, DRAMBORA) Repositories and the curation of research data Roles and responsibilities Infrastructures Curation challenges
The repository context (1) Repository content is one part of a much wider digital preservation problem No major specific digital preservation requirements Repositories are part of the evolving structure of scholarly communication Preservation needs to be considered in the same conceptual space as things like e-journals (e.g., Portico) There is a commonly-held view that e-prints are just duplicates of the conventional literature created for immediate access and will not therefore need preservation
The repository context (2) Repositories can contain many types of materials, but the main focus has usually been on “e-prints” or theses Broadly text-based (analogues of traditional papers) This simplifies preservation requirements Compared with complex multimedia objects, “the preservation of e-prints is  relatively  straightforward from a technical point of view” (Pinfield and James, 2003) A large percentage of repository content has (until very recently) been made up of a relatively limited number of formats, e.g.: PDF, HTML, MS Word, RTF, TeX, PostScript
The repository context (3) Repositories are beginning to consider their role in preserving a wider range of content types Maintain accurate records of the whole research process or lifecycle (digital curation) Includes: research data (simulations, materials, the results of high-throughput instrumentation, open science, etc.), Web pages, Web 2.0 content (blogs, etc.), learning objects, images, time-based media, etc. For example: KeepIt project exemplars - research papers, science data, arts, teaching materials and theses: http://preservation.eprints.org/keepit/ This makes defining preservation requirements for repositories more difficult
The repository context (4) Repositories need to consider carefully their longer-term objectives and ambitions Clifford Lynch: “An institutional repository needs to be a service with continuity behind it … Institutions need to recognise that they are making commitments for the long term” This is dependent on institutional support Shared infrastructures Bilateral, regional, national, international Distributed approaches possible (e.g., SHERPA DP) A potential key role for national or research libraries (as with DARE in the Netherlands)
The repository context (5) Integration of preservation services with repository software Some experimentation in the PRESERV project: http://preservation.eprints.org/ Used third-party registries of format information (DROID and PRONOM from The National Archives) to characterise and validate repository content and to analyse risks RSP briefing paper on preservation and storage formats: http://www.rsp.ac.uk/pubs/briefingpapers-docs/technical-preservformats.pdf
Digital preservation basics An ongoing approach to managing digital content based on: The identification and adoption of appropriate preservation strategies Creation or Ingest stages are normally the best time to ensure that data are fit-for-purpose and “preservable” The collection and management of appropriate metadata Capture of explicit and implicit knowledge, contexts The ongoing monitoring of technical contexts and the application of preservation planning techniques Continual monitoring of the organisation (audit)
Technical challenges Digital media Currently magnetic or optical tape and disks, some devices (e.g., memory sticks) Uncertain lifetimes Hardware and software dependence Most digital objects are dependent on particular configurations of hardware and software Relatively short obsolescence cycles
Conceptual challenges (1) What is an digital object? Some are analogues of traditional objects, e.g. meeting minutes, research papers (e-prints) Others are not, e.g. Web pages, GIS, 3D models of chemical structures Complexity Dynamic nature
Conceptual challenges (2) Three layers: Physical: the bits stored on a particular medium Logical: defines how the bits are used by a software application, based on data types (e.g. ASCII); in order to understand (or preserve) the bits, we need to know how to process this Conceptual: things that we deal with in the real world From: Ken Thibodeau, “Overview of technological approaches to digital preservation and challenges in coming years.” In:  The state of digital preservation: an international perspective  (CLIR, 2002): http://www.clir.org/
Conceptual challenges (3) On which of these layers should preservation activities focus? We need to preserve the ability to reproduce the objects, not just the bits (would a printout do?) In fact, we can change the bits and logical representation and still reproduce an authentic conceptual object (e.g. converting into PDF) Increased focus on reuse (e.g, data in tables) Authenticity and integrity How can we trust that an object is what it claims to be? Digital information can easily be changed by accident or design
Some general principles (1)  Most of the technical problems associated with long-term digital preservation can be solved if a life-cycle management approach is adopted  i.e. a continual programme of active management Ideally, combines both managerial and technical processes, e.g., as in the OAIS Reference Model Many current preservation systems are attempting to support this approach Digital preservation strategies need to be seen in this wider context
Some general principles (2) Preservation needs to be considered at a very early stage in an object's life-cycle There is a need to identify 'significant properties' Recognises that preservation is context dependent, even user specific (concept of 'designated community') Helps with choosing an acceptable preservation strategy Encapsulation Surrounding the digital object - at least conceptually - with all of the information needed to decode and understand it (including software) Produces autonomous 'self-describing' objects, reduces external dependencies (linked to the Information Package concept in the OAIS Reference Model)
Some general principles (3) Metadata and documentation is vitally important Relates to OAIS concepts like Representation Information and Preservation Description Information Functions Records scientific meaning Records the research context Enables the development of finding aids Standards are being developed that support digital preservation activities (e.g., the PREMIS Data Dictionary) Wherever possible, retain also the original byte-stream
Digital preservation strategies Three main families: Technology preservation / digital archaeology Emulation Migration
Technology preservation The preservation of an information object together with all of the hardware and software needed to interpret it Successfully preserves the look, feel and behaviour of the whole system (at least while the hardware and software still functions) Severe problems with storage and ongoing maintenance, missing documentation May have a role for historically important hardware May have a shorter-term role for supporting the rescue of digital objects (digital archaeology)
Digital archaeology Not so much a preservation strategy, but the default situation if there isn't one Using various techniques to recover digital content from obsolete or damaged physical objects (media, hardware, etc.) A time consuming process, needs specialised equipment and (in most cases) adequate documentation Considered to be expensive (and risky) Remains an option for content deemed to be of value that has not been dealt with in any other way
Emulation (1) Preserving the original bit-streams and application software; running this on emulator programs that mimic the behaviour of obsolete hardware Emulators evolve over time Chaining, rehosting Emulation Virtual Machines Running emulators on simplified 'virtual machines' that can be run on a range of different platforms
Emulation (2) Benefits: Technique already widely used, e.g. for emulating different hardware, computer games Preserves (and uses) the original bits Reduces the need for regular object transformations (but emulators and virtual machines may themselves need to be migrated) Retains ‘look-and-feel’ May be the only approach possible where objects are complex or dependent on executable code Less 'understanding' of formats is needed; little incremental cost in keeping additional formats
Emulation (3) Challenges: Do organisations have the technical skills necessary to implement the strategy? Preserving 'look and feel' may not be needed for all objects It will be difficult to know definitively whether user experience has been accurately preserved Uses: Promising family of approaches Needs further practical application and research, e.g. Dioscuri software (National Library of the Netherlands)
Migration (1) Based on the managed transformation of content: A set of organised tasks designed to achieve the periodic transfer of digital information from one hardware and software configuration to another, or from one generation of computer technology to a subsequent one - CPA/RLG report (1996) Abandons attempts to keep old technology (or substitutes for it) working A 'known' solution used by data archives and software vendors Focuses on the perceived content (or significant properties) of objects
Migration (2) Challenges: Can be labour intensive (batch process, monitoring, QA) There can be problems with ensuring the ongoing 'integrity and authenticity' of objects Transformations need to be documented (typically as part of the preservation metadata) Uses: Seems to be most suitable for dealing with large collections of similar objects (e-print repositories?) Migration can often be combined with some form of  standardisation process, e.g., as part of ingest A role for repository managers?
Preservation support on ingest Formats can be identified and validated on ingest or deposit into a repository JHOVE (JSTOR/Harvard Object Validation Environment) PRONOM, DROID (The National Archives) Metadata Some tools exist for the automatic capture of metadata Standardisation on ingest Perceived wisdom suggests the adoption of open or non-proprietary standards, e.g. databases structured in XML, uncompressed images, 'preservation friendly' standards like PDF/A
Choosing a strategy (1) Preservation strategies are not in competition Different strategies will work together, may be value in diversification Migration strategies mean difficult choices need to be made about target formats But the strategy chosen has implications for: The technical infrastructure required (and metadata) Collection management priorities Rights management Owning the rights to re-engineer software Costs
Choosing a strategy (2) Plato preservation planning tool (EU Planets project) A decision support tool that helps users explore the evaluation of potential preservation solutions against specific requirements and for building a plan for preserving a given set of objects Integrates file format identification (using DROID); some migration services; XML-based generic format characterisation using XCL (eXtensible Characterisation Languages) http://www.ifs.tuwien.ac.at/dp/plato/intro.html
Repository audit frameworks (1) Repository audit frameworks first developed out of the OAIS Reference Model OAIS Mandatory Responsibilities (only six of them): The main focus was on technical and organisational aspects, e.g.: That repositories ensure that preserved information (content) can be understood (independently understandable) That documented policies and procedures are being followed No clear concept of OAIS compliance (although often claimed by system developers)
Repository audit frameworks (2) Trusted Repositories Audit and Certification (TRAC): Criteria and Checklist RLG-NARA Digital Repository Certification Task Force checklist, revised (following pilot audits) by the Center for Research Libraries and OCLC Criteria cover three main aspects: Organisational Infrastructure Governance and viability, structure and staffing, financial sustainability, contracts, etc. Digital Object Management Ingest, preservation planning, archival storage, etc. Technologies, Technical Infrastructure, & Security Systems and infrastructure, etc.
TRAC Checklist example page
Repository audit frameworks (3) DRAMBORA (Digital Repository Audit Method Based on Risk Assessment) Digital Curation Centre / Digital Preservation Europe “ Presents a methodology for self-assessment, encouraging organisations to establish a comprehensive self-awareness of their objectives, activities and assets before identifying, assessing and managing the risks implicit within their organisation“ Identifying risks and scoring each one on likelihood and impact Covers: organisational context, policies, assets, risks, etc. Online tool (http://www.repositoryaudit.eu/about/)
Repository audit frameworks (4) A means of "asking the right questions" about your repository and documenting appropriate procedures and risks Both TRAC and DRAMBORA are under consideration by (different) ISO technical committees External badge of quality (a "certified preservation repository") vs. Management tool for self assessment
Web links: PRESERV project: http://preservation.eprints.org/ KeepIt project: http://preservation.eprints.org/keepit/ Plato Preservation Planning tool: http://www.ifs.tuwien.ac.at/dp/plato/intro.html DRAMBORA: http://www.repositoryaudit.eu/about/ RSP briefing paper on preservation and storage formats: http://www.rsp.ac.uk/pubs/briefingpapers-docs/technical-preservformats.pdf
Questions? “ Pigabyte” King Bladud’s Pigs in Bath  (public art project), Summer 2008 http://www.kingbladudspigs.org/
Repositories and the curation of research data
Dealing with research data An extremely broad category of material: “... any information that can be stored in digital form, including text, numbers, images, video or movies, audio, software, algorithms, equations, animations, models, simulations, etc.” (National Science Board, Long-lived digital data collections, 2005) In practice, it can mean almost anything
Why curate research data? (1) Part of the normal research process: The need for others to validate and replicate research In some disciplines, supporting data is routinely made available to reviewers and linked from journal papers Principles of sharing and openness are firmly embedded in some disciplines
Why curate research data? (2) Extrinsic and intrinsic value; High investment in research Data can be very expensive to capture and analyse Data is impossible to recreate once lost Observational data (by definition) is irreplaceable Current generations of instruments can gather more data than can be analysed
Why curate research data? (3) The potential for creating 'new' knowledge from existing data: Re-use, re-analysis, data mining Annotation, e.g. in molecular biology astronomy Combining datasets in innovative ways, e.g. mapping biodiversity data onto ecological GIS “Science 2.0”
Why curate research data? (4) It is increasingly a requirement of some research funding bodies Some have quite mature data retention policies (not necessarily for permanent retention) Increasing expectation of access to data from publicly-funded research OECD Principles and guidelines for access to research data from public funding (2007)
Why curate research data? (5) Institutional asset management: Universities and other research organisations invest very large sums of money into research activities Research data is a key output of this activity It is, therefore, an institutional asset that needs stewardship
Why curate research data? (6) Promoting the institution, research group or individual: Re-use helps promote visibility and 'impact' Institutions become acknowledged 'centres of competence'
Who undertakes preservation? Researchers Indirectly - they have most direct contact with creation stage, and understand how data can be used Directly - sometimes responsible for maintaining community data collections Information professionals Sometimes, but it depends on the context  IT professionals Primarily informaticians working with scientists
Roles and responsibilities (1) Long-lived data collections (NSB) Data authors Data managers Data scientists Data users Funding agencies Dealing with data (JISC) Scientist Institution Data centre User Funder Publisher
Roles and responsibilities (2) Scientists Initial creation and use of data Expectation of first use and in gaining appropriate credit and recognition Responsible for: Managing data for life of project For using standards (where possible) For complying with data policies For making the data available in a form that can (easily?) be used by others
Roles and responsibilities (3) Institutions: Role less clear Institutional policies may require short-term management of data Advocacy and training Some institutions are developing repository services Are rarely currently used for research data Federated approaches maintain disciplinary involvement
Roles and responsibilities (3) Data centres Undertakes curation and provides access  Responsible for: Selection and ingest Participating in the development of standards Protecting the rights of data creators Supporting ingest and metadata capture Supporting re-use (tools and services) Training
Roles and responsibilities (4) Users: Users of third-party data Responsible for: Adhering to any licenses and restrictions on use Acknowledging data creators and curators Managing any derived data Provide feedback to scientists and data centres
Roles and responsibilities (5) Funding bodies: Acting at policy level Responsible for: Considering wider policy perspectives Developing policies in co-operation with other stakeholders Monitoring and enforcing data policies Support for long-term data management Support for data curation
Research data collections (1) A typology (1): From National Science Board report Long-lived digital data collections (2005) Research data collections – the products of one or more focused research projects Resource or community data collections – collections that emerge to serve particular subject sub-disciplines Reference data collections – serve a broader and more diverse set of user communities
Research data collections (2) Data in “research data collections” is most at risk A modern version of the “file-drawer problem” Data stored on personal hard-drives or on media; largely undocumented Particular challenge when the data creator has retired or moved to another institution Data creators not always aware of its potential value The reward structure of science is not always helpful
Curation infrastructures (1) Focus on the generic: Need for a balance between: The 'bottom-up' discipline-based drivers that promote the generation of research data The policy level, looking to make cost effective investment in curation When building Infrastructures, focus on the generic Storage systems and middleware Preservation services Identifying the needs of the wider community
Curation infrastructures (2) The need for collaboration: Need for 'deep-infrastructure' recognised as far back as 1996 by the Task Force on Archiving of Digital Information Digital preservation involves the "grander problem of organizing ourselves over time and as a society ... [to manoeuvre] effectively in a digital landscape" (p. 7)
Curation challenges: Costs NSF Task Force looking at this subject JISC-funded LIFE (Life Cycle Information for E-Literature) project is developing a predictive costing tool (http://www.life.ac.uk/) JISC-funded study ( Keeping research data safe , 2008) focused on  research data  curation at the institution level The complex service requirements for curating research data means that institutions are setting-up federated approaches to repository development Currently ingest costs are much higher than long-term storage and preservation costs
Curation challenges: Scale (1) The “digital deluge” in e-Science New generations of instruments Computer  simulations Many terabytes generated per day, petabyte scale computing (and growing) Cory Doctorow, “Welcome to the petacentre.” Nature, 455, pp 17-21, 4 Sep 2008 Are Institutional Repositories ready for this? Digitised content: Google Book Search (~7 million items) A role for research libraries?
Curation challenges: Scale (2) Problems of scale are particularly acute in traditional 'big-science' disciplines: Particle physics (e.g., the Large Hadron Collider) Astronomy (sky surveys, etc) But “smaller experiments will grow the fastest” (Szalay & Gray,  Nature , 440, 413-4, 23 Mar 2006) Bioinformatics, crystallography, engineering design, and many others In some cases it may be cheaper just to generate the data again, e.g. for computer simulations
Curation challenges: Complexity (2) Research data is extremely diverse - not really a single category of material tabular data, images, GIS, etc. raw machine output vs, derived data varying levels of structure (XML, legacy formats, etc.) many different standards Research data is not homogeneous No one-size-fits-all approach possible
Curation challenges: Cultures Diverse research cultures Data practices vary widely, even within a single discipline Gene sequence data is typically deposited in public databases In proteomics, sharing is not so widespread; partly driven by lack of standards, but there is also concern about who have exploitation rights Role of commercial interests Pharmaceuticals, architecture and engineering, geological prospecting
The Future ... “It is always a mistake for a historian to try and predict the future. Life, unlike science, is simply too full of surprises” - Richard J. Evans,  In defence of history  (1997, p. 62)
Further reading National Science Board, Long-lived digital data collections: enabling research and education in the 21st century (NSF, 2005) http//www.nsf.gov/pubs/2005/nsb0540/ Liz Lyon, Dealing with data; roles, rights, responsibilities and relationships (JISC, 2007) http://www.jisc.ac.uk/whatwedo/programmes/digitalrepositories2005/dealingwithdata.aspx Neil Beagrie, Jullia Chruszcz, and Brian Lavoie, Keeping research data safe: a cost model and guidance for UK universities (JISC, 2008) http://www.jisc.ac.uk/publications/publications/keepingresearchdatasafe.aspx
Acknowledgments UKOLN is funded by the Joint Information Systems Committee (JISC) of the UK higher and further education funding councils, the Museums, Libraries and Archives Council (MLA), as well as by project funding from the JISC, the European Union, and other sources. UKOLN also receives support from the University of Bath, where it is based. More information: http://www.ukoln.ac.uk/
Thank You!

More Related Content

PPT
Evaluation of Digital Library
PPTX
Overview of Archival Processing
PDF
Archival Acquisition (LIS 170)
PPT
Digital Archives in Theory and Practice
PPTX
Query formulation process
PPT
Donnelly providing reference services in archives
PPT
Access Points
PPTX
Introduction to DSpace
Evaluation of Digital Library
Overview of Archival Processing
Archival Acquisition (LIS 170)
Digital Archives in Theory and Practice
Query formulation process
Donnelly providing reference services in archives
Access Points
Introduction to DSpace

What's hot (20)

PPT
Intro to Digitization Projects
PPT
Collection development
PPT
Digital Libray
PPT
Knowledge Management in Libraries: an introduction
PPTX
Institutional repositories
PDF
COUNTER Usage Statistics
PPT
Impact Of Ict on libraries
PPT
Classification a review
PPT
Total quality of management in libraries
PPTX
Archival Arrangement, Description & Access
PDF
Information storage and retrieval
PDF
Archives and recordkeeping: theory into practice
PPTX
USING MARC FORMAT TO CATALOG NON BOOK MATERIALS
DOCX
1. indexing and abstracting
PPT
PPTX
Introduction to arrangement and description (feb 4&5, 2012)
PPT
Cataloging of nonbook materials edited
PPTX
Integrated library management system.ppt
PPT
Cataloguing
PDF
Data Catalogs Are the Answer – What Is the Question?
Intro to Digitization Projects
Collection development
Digital Libray
Knowledge Management in Libraries: an introduction
Institutional repositories
COUNTER Usage Statistics
Impact Of Ict on libraries
Classification a review
Total quality of management in libraries
Archival Arrangement, Description & Access
Information storage and retrieval
Archives and recordkeeping: theory into practice
USING MARC FORMAT TO CATALOG NON BOOK MATERIALS
1. indexing and abstracting
Introduction to arrangement and description (feb 4&5, 2012)
Cataloging of nonbook materials edited
Integrated library management system.ppt
Cataloguing
Data Catalogs Are the Answer – What Is the Question?
Ad

Viewers also liked (20)

PPT
Reference Model for an Open Archival Information Systems (OAIS): Overview and...
PDF
Drambora Hans Hofman
PPT
Digital Preservation Process: Preparation and Requirements
PDF
Building A Sustainable Model for Digital Preservation Services, Clive Billenn...
PPT
Trusted Repositories
PDF
Digital preservation and institutional repositories
PDF
Librarians and Open Access: the case of E-LIS
PPT
EPrints for Data
PPT
E-LIS: an Eprints LIS Repository
PDF
Ψηφιακές βιβλιοθήκες, ψηφιακά αποθετήρια, υποδομές δεδομένων: θεμέλια της νέα...
PPTX
EPrints and the Cloud
PPT
Fedora Overview
PPT
Biblio to Fedora Commons REST API
PPT
Using Fedora Commons To Create A Persistent Archive
PPTX
11.5.14 Presentation Slides, “Fedora 4.0 in Action at Penn State and Stanford”
PPTX
eprints digital library software
PPT
An Introduction to Digital Preservation
PDF
Introduction to fedora 20cat
PPT
Web 2.0 and repositories - have we got our repository architecture right?
Reference Model for an Open Archival Information Systems (OAIS): Overview and...
Drambora Hans Hofman
Digital Preservation Process: Preparation and Requirements
Building A Sustainable Model for Digital Preservation Services, Clive Billenn...
Trusted Repositories
Digital preservation and institutional repositories
Librarians and Open Access: the case of E-LIS
EPrints for Data
E-LIS: an Eprints LIS Repository
Ψηφιακές βιβλιοθήκες, ψηφιακά αποθετήρια, υποδομές δεδομένων: θεμέλια της νέα...
EPrints and the Cloud
Fedora Overview
Biblio to Fedora Commons REST API
Using Fedora Commons To Create A Persistent Archive
11.5.14 Presentation Slides, “Fedora 4.0 in Action at Penn State and Stanford”
eprints digital library software
An Introduction to Digital Preservation
Introduction to fedora 20cat
Web 2.0 and repositories - have we got our repository architecture right?
Ad

Similar to Repositories and digital preservation (20)

PPT
The digital preservation technical context
PPT
Digital preservation
PPT
Brief Introduction to Digital Preservation
PPT
Digital Preservation
PPT
Digital Preservation
PPT
Hans Hofman - European Perspectives on Digital Preservation
PPT
Digital Curation 101: Preserve
PPT
DCC 101: Preservation
PPT
Metadata approaches for digital presentation
PPT
Preservation metadata
PPT
Trm Introduction
PPT
Collaboration on appraisal and collection development for the long-term prese...
PPT
Trm Vilnius Metadata New
PPT
Preservation Metadata, Michael Day, DCC
PPT
Digital Preservation
PPT
Digital Preservation
PPTX
Completepresentation
PPT
PRESERVATION Web archiving
PPT
Introduction to digital curation
PPT
Getting started in digital preservation
The digital preservation technical context
Digital preservation
Brief Introduction to Digital Preservation
Digital Preservation
Digital Preservation
Hans Hofman - European Perspectives on Digital Preservation
Digital Curation 101: Preserve
DCC 101: Preservation
Metadata approaches for digital presentation
Preservation metadata
Trm Introduction
Collaboration on appraisal and collection development for the long-term prese...
Trm Vilnius Metadata New
Preservation Metadata, Michael Day, DCC
Digital Preservation
Digital Preservation
Completepresentation
PRESERVATION Web archiving
Introduction to digital curation
Getting started in digital preservation

More from Michael Day (20)

PDF
What can libraries do for researchers?
PDF
Preservation planning at the British Library
PDF
Implementing digital preservation strategy: collection profiling at the Briti...
PDF
Developing institutional RDM services
PDF
Open access data
PDF
Digital Preservation (UWE)
PPT
Digital Curation 101 (University of Glamorgan)
PDF
Continuity and change: Opportunities and challenges for the future of researc...
PDF
Developing a Community Capability Model Framework for data-intensive research
PDF
Introduction to research data management
PDF
Introduction to Research Data Management: activities, roles and requirements
PPT
UKOLN activities on research information management
PDF
UKOLN Programme Support for the JISC Research Information Management Programme
PDF
EASTER project
PDF
Models for integrating institutional repositories and research information ma...
PDF
Research Information Management
PPT
Digital preservation exercises
PPT
Curation of Research Data
PDF
Digital preservation from a records management perspective
PDF
The Improving Access to Text (IMPACT) project and other European initiatives
What can libraries do for researchers?
Preservation planning at the British Library
Implementing digital preservation strategy: collection profiling at the Briti...
Developing institutional RDM services
Open access data
Digital Preservation (UWE)
Digital Curation 101 (University of Glamorgan)
Continuity and change: Opportunities and challenges for the future of researc...
Developing a Community Capability Model Framework for data-intensive research
Introduction to research data management
Introduction to Research Data Management: activities, roles and requirements
UKOLN activities on research information management
UKOLN Programme Support for the JISC Research Information Management Programme
EASTER project
Models for integrating institutional repositories and research information ma...
Research Information Management
Digital preservation exercises
Curation of Research Data
Digital preservation from a records management perspective
The Improving Access to Text (IMPACT) project and other European initiatives

Recently uploaded (20)

PDF
Approach and Philosophy of On baking technology
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Electronic commerce courselecture one. Pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Empathic Computing: Creating Shared Understanding
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
KodekX | Application Modernization Development
PPTX
A Presentation on Artificial Intelligence
Approach and Philosophy of On baking technology
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Spectral efficient network and resource selection model in 5G networks
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Review of recent advances in non-invasive hemoglobin estimation
Understanding_Digital_Forensics_Presentation.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Electronic commerce courselecture one. Pdf
20250228 LYD VKU AI Blended-Learning.pptx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Empathic Computing: Creating Shared Understanding
Unlocking AI with Model Context Protocol (MCP)
NewMind AI Monthly Chronicles - July 2025
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Network Security Unit 5.pdf for BCA BBA.
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
KodekX | Application Modernization Development
A Presentation on Artificial Intelligence

Repositories and digital preservation

  • 1. UKOLN is supported by: Repositories and Digital Preservation Michael Day Research and Development Team Leader UKOLN, University of Bath RSP 'Goes back to' School, Matfen Hall, Northumberland, 14-16 September 2009
  • 2. Presentation outline General context: repositories and digital preservation Digital preservation overview Tools: Preservation Planning (Plato) Repository audit (TRAC, DRAMBORA) Repositories and the curation of research data Roles and responsibilities Infrastructures Curation challenges
  • 3. The repository context (1) Repository content is one part of a much wider digital preservation problem No major specific digital preservation requirements Repositories are part of the evolving structure of scholarly communication Preservation needs to be considered in the same conceptual space as things like e-journals (e.g., Portico) There is a commonly-held view that e-prints are just duplicates of the conventional literature created for immediate access and will not therefore need preservation
  • 4. The repository context (2) Repositories can contain many types of materials, but the main focus has usually been on “e-prints” or theses Broadly text-based (analogues of traditional papers) This simplifies preservation requirements Compared with complex multimedia objects, “the preservation of e-prints is relatively straightforward from a technical point of view” (Pinfield and James, 2003) A large percentage of repository content has (until very recently) been made up of a relatively limited number of formats, e.g.: PDF, HTML, MS Word, RTF, TeX, PostScript
  • 5. The repository context (3) Repositories are beginning to consider their role in preserving a wider range of content types Maintain accurate records of the whole research process or lifecycle (digital curation) Includes: research data (simulations, materials, the results of high-throughput instrumentation, open science, etc.), Web pages, Web 2.0 content (blogs, etc.), learning objects, images, time-based media, etc. For example: KeepIt project exemplars - research papers, science data, arts, teaching materials and theses: http://preservation.eprints.org/keepit/ This makes defining preservation requirements for repositories more difficult
  • 6. The repository context (4) Repositories need to consider carefully their longer-term objectives and ambitions Clifford Lynch: “An institutional repository needs to be a service with continuity behind it … Institutions need to recognise that they are making commitments for the long term” This is dependent on institutional support Shared infrastructures Bilateral, regional, national, international Distributed approaches possible (e.g., SHERPA DP) A potential key role for national or research libraries (as with DARE in the Netherlands)
  • 7. The repository context (5) Integration of preservation services with repository software Some experimentation in the PRESERV project: http://preservation.eprints.org/ Used third-party registries of format information (DROID and PRONOM from The National Archives) to characterise and validate repository content and to analyse risks RSP briefing paper on preservation and storage formats: http://www.rsp.ac.uk/pubs/briefingpapers-docs/technical-preservformats.pdf
  • 8. Digital preservation basics An ongoing approach to managing digital content based on: The identification and adoption of appropriate preservation strategies Creation or Ingest stages are normally the best time to ensure that data are fit-for-purpose and “preservable” The collection and management of appropriate metadata Capture of explicit and implicit knowledge, contexts The ongoing monitoring of technical contexts and the application of preservation planning techniques Continual monitoring of the organisation (audit)
  • 9. Technical challenges Digital media Currently magnetic or optical tape and disks, some devices (e.g., memory sticks) Uncertain lifetimes Hardware and software dependence Most digital objects are dependent on particular configurations of hardware and software Relatively short obsolescence cycles
  • 10. Conceptual challenges (1) What is an digital object? Some are analogues of traditional objects, e.g. meeting minutes, research papers (e-prints) Others are not, e.g. Web pages, GIS, 3D models of chemical structures Complexity Dynamic nature
  • 11. Conceptual challenges (2) Three layers: Physical: the bits stored on a particular medium Logical: defines how the bits are used by a software application, based on data types (e.g. ASCII); in order to understand (or preserve) the bits, we need to know how to process this Conceptual: things that we deal with in the real world From: Ken Thibodeau, “Overview of technological approaches to digital preservation and challenges in coming years.” In: The state of digital preservation: an international perspective (CLIR, 2002): http://www.clir.org/
  • 12. Conceptual challenges (3) On which of these layers should preservation activities focus? We need to preserve the ability to reproduce the objects, not just the bits (would a printout do?) In fact, we can change the bits and logical representation and still reproduce an authentic conceptual object (e.g. converting into PDF) Increased focus on reuse (e.g, data in tables) Authenticity and integrity How can we trust that an object is what it claims to be? Digital information can easily be changed by accident or design
  • 13. Some general principles (1) Most of the technical problems associated with long-term digital preservation can be solved if a life-cycle management approach is adopted i.e. a continual programme of active management Ideally, combines both managerial and technical processes, e.g., as in the OAIS Reference Model Many current preservation systems are attempting to support this approach Digital preservation strategies need to be seen in this wider context
  • 14. Some general principles (2) Preservation needs to be considered at a very early stage in an object's life-cycle There is a need to identify 'significant properties' Recognises that preservation is context dependent, even user specific (concept of 'designated community') Helps with choosing an acceptable preservation strategy Encapsulation Surrounding the digital object - at least conceptually - with all of the information needed to decode and understand it (including software) Produces autonomous 'self-describing' objects, reduces external dependencies (linked to the Information Package concept in the OAIS Reference Model)
  • 15. Some general principles (3) Metadata and documentation is vitally important Relates to OAIS concepts like Representation Information and Preservation Description Information Functions Records scientific meaning Records the research context Enables the development of finding aids Standards are being developed that support digital preservation activities (e.g., the PREMIS Data Dictionary) Wherever possible, retain also the original byte-stream
  • 16. Digital preservation strategies Three main families: Technology preservation / digital archaeology Emulation Migration
  • 17. Technology preservation The preservation of an information object together with all of the hardware and software needed to interpret it Successfully preserves the look, feel and behaviour of the whole system (at least while the hardware and software still functions) Severe problems with storage and ongoing maintenance, missing documentation May have a role for historically important hardware May have a shorter-term role for supporting the rescue of digital objects (digital archaeology)
  • 18. Digital archaeology Not so much a preservation strategy, but the default situation if there isn't one Using various techniques to recover digital content from obsolete or damaged physical objects (media, hardware, etc.) A time consuming process, needs specialised equipment and (in most cases) adequate documentation Considered to be expensive (and risky) Remains an option for content deemed to be of value that has not been dealt with in any other way
  • 19. Emulation (1) Preserving the original bit-streams and application software; running this on emulator programs that mimic the behaviour of obsolete hardware Emulators evolve over time Chaining, rehosting Emulation Virtual Machines Running emulators on simplified 'virtual machines' that can be run on a range of different platforms
  • 20. Emulation (2) Benefits: Technique already widely used, e.g. for emulating different hardware, computer games Preserves (and uses) the original bits Reduces the need for regular object transformations (but emulators and virtual machines may themselves need to be migrated) Retains ‘look-and-feel’ May be the only approach possible where objects are complex or dependent on executable code Less 'understanding' of formats is needed; little incremental cost in keeping additional formats
  • 21. Emulation (3) Challenges: Do organisations have the technical skills necessary to implement the strategy? Preserving 'look and feel' may not be needed for all objects It will be difficult to know definitively whether user experience has been accurately preserved Uses: Promising family of approaches Needs further practical application and research, e.g. Dioscuri software (National Library of the Netherlands)
  • 22. Migration (1) Based on the managed transformation of content: A set of organised tasks designed to achieve the periodic transfer of digital information from one hardware and software configuration to another, or from one generation of computer technology to a subsequent one - CPA/RLG report (1996) Abandons attempts to keep old technology (or substitutes for it) working A 'known' solution used by data archives and software vendors Focuses on the perceived content (or significant properties) of objects
  • 23. Migration (2) Challenges: Can be labour intensive (batch process, monitoring, QA) There can be problems with ensuring the ongoing 'integrity and authenticity' of objects Transformations need to be documented (typically as part of the preservation metadata) Uses: Seems to be most suitable for dealing with large collections of similar objects (e-print repositories?) Migration can often be combined with some form of standardisation process, e.g., as part of ingest A role for repository managers?
  • 24. Preservation support on ingest Formats can be identified and validated on ingest or deposit into a repository JHOVE (JSTOR/Harvard Object Validation Environment) PRONOM, DROID (The National Archives) Metadata Some tools exist for the automatic capture of metadata Standardisation on ingest Perceived wisdom suggests the adoption of open or non-proprietary standards, e.g. databases structured in XML, uncompressed images, 'preservation friendly' standards like PDF/A
  • 25. Choosing a strategy (1) Preservation strategies are not in competition Different strategies will work together, may be value in diversification Migration strategies mean difficult choices need to be made about target formats But the strategy chosen has implications for: The technical infrastructure required (and metadata) Collection management priorities Rights management Owning the rights to re-engineer software Costs
  • 26. Choosing a strategy (2) Plato preservation planning tool (EU Planets project) A decision support tool that helps users explore the evaluation of potential preservation solutions against specific requirements and for building a plan for preserving a given set of objects Integrates file format identification (using DROID); some migration services; XML-based generic format characterisation using XCL (eXtensible Characterisation Languages) http://www.ifs.tuwien.ac.at/dp/plato/intro.html
  • 27. Repository audit frameworks (1) Repository audit frameworks first developed out of the OAIS Reference Model OAIS Mandatory Responsibilities (only six of them): The main focus was on technical and organisational aspects, e.g.: That repositories ensure that preserved information (content) can be understood (independently understandable) That documented policies and procedures are being followed No clear concept of OAIS compliance (although often claimed by system developers)
  • 28. Repository audit frameworks (2) Trusted Repositories Audit and Certification (TRAC): Criteria and Checklist RLG-NARA Digital Repository Certification Task Force checklist, revised (following pilot audits) by the Center for Research Libraries and OCLC Criteria cover three main aspects: Organisational Infrastructure Governance and viability, structure and staffing, financial sustainability, contracts, etc. Digital Object Management Ingest, preservation planning, archival storage, etc. Technologies, Technical Infrastructure, & Security Systems and infrastructure, etc.
  • 30. Repository audit frameworks (3) DRAMBORA (Digital Repository Audit Method Based on Risk Assessment) Digital Curation Centre / Digital Preservation Europe “ Presents a methodology for self-assessment, encouraging organisations to establish a comprehensive self-awareness of their objectives, activities and assets before identifying, assessing and managing the risks implicit within their organisation“ Identifying risks and scoring each one on likelihood and impact Covers: organisational context, policies, assets, risks, etc. Online tool (http://www.repositoryaudit.eu/about/)
  • 31. Repository audit frameworks (4) A means of "asking the right questions" about your repository and documenting appropriate procedures and risks Both TRAC and DRAMBORA are under consideration by (different) ISO technical committees External badge of quality (a "certified preservation repository") vs. Management tool for self assessment
  • 32. Web links: PRESERV project: http://preservation.eprints.org/ KeepIt project: http://preservation.eprints.org/keepit/ Plato Preservation Planning tool: http://www.ifs.tuwien.ac.at/dp/plato/intro.html DRAMBORA: http://www.repositoryaudit.eu/about/ RSP briefing paper on preservation and storage formats: http://www.rsp.ac.uk/pubs/briefingpapers-docs/technical-preservformats.pdf
  • 33. Questions? “ Pigabyte” King Bladud’s Pigs in Bath (public art project), Summer 2008 http://www.kingbladudspigs.org/
  • 34. Repositories and the curation of research data
  • 35. Dealing with research data An extremely broad category of material: “... any information that can be stored in digital form, including text, numbers, images, video or movies, audio, software, algorithms, equations, animations, models, simulations, etc.” (National Science Board, Long-lived digital data collections, 2005) In practice, it can mean almost anything
  • 36. Why curate research data? (1) Part of the normal research process: The need for others to validate and replicate research In some disciplines, supporting data is routinely made available to reviewers and linked from journal papers Principles of sharing and openness are firmly embedded in some disciplines
  • 37. Why curate research data? (2) Extrinsic and intrinsic value; High investment in research Data can be very expensive to capture and analyse Data is impossible to recreate once lost Observational data (by definition) is irreplaceable Current generations of instruments can gather more data than can be analysed
  • 38. Why curate research data? (3) The potential for creating 'new' knowledge from existing data: Re-use, re-analysis, data mining Annotation, e.g. in molecular biology astronomy Combining datasets in innovative ways, e.g. mapping biodiversity data onto ecological GIS “Science 2.0”
  • 39. Why curate research data? (4) It is increasingly a requirement of some research funding bodies Some have quite mature data retention policies (not necessarily for permanent retention) Increasing expectation of access to data from publicly-funded research OECD Principles and guidelines for access to research data from public funding (2007)
  • 40. Why curate research data? (5) Institutional asset management: Universities and other research organisations invest very large sums of money into research activities Research data is a key output of this activity It is, therefore, an institutional asset that needs stewardship
  • 41. Why curate research data? (6) Promoting the institution, research group or individual: Re-use helps promote visibility and 'impact' Institutions become acknowledged 'centres of competence'
  • 42. Who undertakes preservation? Researchers Indirectly - they have most direct contact with creation stage, and understand how data can be used Directly - sometimes responsible for maintaining community data collections Information professionals Sometimes, but it depends on the context IT professionals Primarily informaticians working with scientists
  • 43. Roles and responsibilities (1) Long-lived data collections (NSB) Data authors Data managers Data scientists Data users Funding agencies Dealing with data (JISC) Scientist Institution Data centre User Funder Publisher
  • 44. Roles and responsibilities (2) Scientists Initial creation and use of data Expectation of first use and in gaining appropriate credit and recognition Responsible for: Managing data for life of project For using standards (where possible) For complying with data policies For making the data available in a form that can (easily?) be used by others
  • 45. Roles and responsibilities (3) Institutions: Role less clear Institutional policies may require short-term management of data Advocacy and training Some institutions are developing repository services Are rarely currently used for research data Federated approaches maintain disciplinary involvement
  • 46. Roles and responsibilities (3) Data centres Undertakes curation and provides access Responsible for: Selection and ingest Participating in the development of standards Protecting the rights of data creators Supporting ingest and metadata capture Supporting re-use (tools and services) Training
  • 47. Roles and responsibilities (4) Users: Users of third-party data Responsible for: Adhering to any licenses and restrictions on use Acknowledging data creators and curators Managing any derived data Provide feedback to scientists and data centres
  • 48. Roles and responsibilities (5) Funding bodies: Acting at policy level Responsible for: Considering wider policy perspectives Developing policies in co-operation with other stakeholders Monitoring and enforcing data policies Support for long-term data management Support for data curation
  • 49. Research data collections (1) A typology (1): From National Science Board report Long-lived digital data collections (2005) Research data collections – the products of one or more focused research projects Resource or community data collections – collections that emerge to serve particular subject sub-disciplines Reference data collections – serve a broader and more diverse set of user communities
  • 50. Research data collections (2) Data in “research data collections” is most at risk A modern version of the “file-drawer problem” Data stored on personal hard-drives or on media; largely undocumented Particular challenge when the data creator has retired or moved to another institution Data creators not always aware of its potential value The reward structure of science is not always helpful
  • 51. Curation infrastructures (1) Focus on the generic: Need for a balance between: The 'bottom-up' discipline-based drivers that promote the generation of research data The policy level, looking to make cost effective investment in curation When building Infrastructures, focus on the generic Storage systems and middleware Preservation services Identifying the needs of the wider community
  • 52. Curation infrastructures (2) The need for collaboration: Need for 'deep-infrastructure' recognised as far back as 1996 by the Task Force on Archiving of Digital Information Digital preservation involves the "grander problem of organizing ourselves over time and as a society ... [to manoeuvre] effectively in a digital landscape" (p. 7)
  • 53. Curation challenges: Costs NSF Task Force looking at this subject JISC-funded LIFE (Life Cycle Information for E-Literature) project is developing a predictive costing tool (http://www.life.ac.uk/) JISC-funded study ( Keeping research data safe , 2008) focused on research data curation at the institution level The complex service requirements for curating research data means that institutions are setting-up federated approaches to repository development Currently ingest costs are much higher than long-term storage and preservation costs
  • 54. Curation challenges: Scale (1) The “digital deluge” in e-Science New generations of instruments Computer simulations Many terabytes generated per day, petabyte scale computing (and growing) Cory Doctorow, “Welcome to the petacentre.” Nature, 455, pp 17-21, 4 Sep 2008 Are Institutional Repositories ready for this? Digitised content: Google Book Search (~7 million items) A role for research libraries?
  • 55. Curation challenges: Scale (2) Problems of scale are particularly acute in traditional 'big-science' disciplines: Particle physics (e.g., the Large Hadron Collider) Astronomy (sky surveys, etc) But “smaller experiments will grow the fastest” (Szalay & Gray, Nature , 440, 413-4, 23 Mar 2006) Bioinformatics, crystallography, engineering design, and many others In some cases it may be cheaper just to generate the data again, e.g. for computer simulations
  • 56. Curation challenges: Complexity (2) Research data is extremely diverse - not really a single category of material tabular data, images, GIS, etc. raw machine output vs, derived data varying levels of structure (XML, legacy formats, etc.) many different standards Research data is not homogeneous No one-size-fits-all approach possible
  • 57. Curation challenges: Cultures Diverse research cultures Data practices vary widely, even within a single discipline Gene sequence data is typically deposited in public databases In proteomics, sharing is not so widespread; partly driven by lack of standards, but there is also concern about who have exploitation rights Role of commercial interests Pharmaceuticals, architecture and engineering, geological prospecting
  • 58. The Future ... “It is always a mistake for a historian to try and predict the future. Life, unlike science, is simply too full of surprises” - Richard J. Evans, In defence of history (1997, p. 62)
  • 59. Further reading National Science Board, Long-lived digital data collections: enabling research and education in the 21st century (NSF, 2005) http//www.nsf.gov/pubs/2005/nsb0540/ Liz Lyon, Dealing with data; roles, rights, responsibilities and relationships (JISC, 2007) http://www.jisc.ac.uk/whatwedo/programmes/digitalrepositories2005/dealingwithdata.aspx Neil Beagrie, Jullia Chruszcz, and Brian Lavoie, Keeping research data safe: a cost model and guidance for UK universities (JISC, 2008) http://www.jisc.ac.uk/publications/publications/keepingresearchdatasafe.aspx
  • 60. Acknowledgments UKOLN is funded by the Joint Information Systems Committee (JISC) of the UK higher and further education funding councils, the Museums, Libraries and Archives Council (MLA), as well as by project funding from the JISC, the European Union, and other sources. UKOLN also receives support from the University of Bath, where it is based. More information: http://www.ukoln.ac.uk/

Editor's Notes

  • #11: Reference: Thibodeau, K. (2002)."Overview of technological approaches to digital preservation and challenges in coming years." In: The state of digital preservation: an international perspective . Washington, D.C.: Council for Library and Information Resources. Available: http://www.clir.org/pubs/abstract/pub107abst.html
  • #15: References: Nelson, M.L. (2001). "Buckets: a new digital library technology for preserving NASA research." Journal of Government Information , 28(4), 369-394. http://www.cs.odu.edu/~mln/pubs/jgi/jgi-eprint.pdf Universal Preservation Format: http://info.wgbh.org/upf/
  • #16: References: Nelson, M.L. (2001). "Buckets: a new digital library technology for preserving NASA research." Journal of Government Information , 28(4), 369-394. http://www.cs.odu.edu/~mln/pubs/jgi/jgi-eprint.pdf Universal Preservation Format: http://info.wgbh.org/upf/
  • #18: Reference: Feeny, M. (1999). Digital culture: maximising the nation's investment . London: National Preservation Office.
  • #19: Ross, S., & Gow, A. (1999). Digital archaeology: rescuing neglected and damaged data resources . JISC/NPO Study: http://www.ukoln.ac.uk/services/elib/papers/supporting/pdf/p2.pdf
  • #20: Reference: Rothenberg, J. (1998). Avoiding technological quicksand: finding a viable technical foundation for digital preservation . Washington, D.C.: Council on Library and Information Resources . http://www.clir.org/pubs/reports/rothenberg/contents.html
  • #23: Reference: Preserving digital information: report of the Task Force on Archiving of Digital Information commissioned by the Commission on Preservation and Access and the Research Libraries Group . Washington, D.C.: Commission on Preservation and Access, 1996. http://www.rlg.org/ArchTF/