SlideShare a Scribd company logo
P res erva tio n P la nning :
 Choosing a suitable preservation
            approach
          Long-term Archiving P erspectives of
         E uropean Union P ublications meeting
   Office for Official Publications of the European Communities
               Luxembourg, November 10-11, 2011



Gareth Knight
Centre for e-Research
Preservation Objectives
Authentic - it is what it                    Understandability – what does
    purports to be                              this information mean?


                                                        Content
                                                        preservation



                                                          Bitstream
                                                         preservation




              Priscilla Caplan's revised Preservation Pyramid
Identity
   • The exact sameness of things.
   • Leibniz's law indicates that 2 items that share
     common attributes are not only similar, but are the
     same thing
   • Can two things be the same? “ultimately nothing is
     the s ame as something else” (Paskin, 2003)                  A painting of Leibniz

Questions:
   • Both images are a pictorial representation of Leibniz
       • Image A is constructed using paint on a canvas
       • Image B is constructed as 0s and 1s
   • Do they share the same identity?
   • Is it necessary for all object attribute to be same, or is
     it acceptable to have some degree of granularity?
   • How much is identity based upon ability to measure
     attributes?

                                                           Scanned copy of painting
Integrity
Is integrity maintained = Yes/No
• Linked to notions of consistency, wholeness and truth
• There has not been deliberate or accidental damage/change
  that has caused meaning to be altered or lost, in part or
  entirety.
• Checksum algorithm applied to a file generates a distinct
  (possibly unique) alphanumeric value




• Commonly used to check for accidental/deliberate data
  change/corruption
   • Generate checksum on October 1st
   • Generate checksum on October 14th & compare to Oct 1st value –
     are they the same? Y E S /N O
Is Integrity maintained
                          = 0- 100%
If one chunk became corrupted, the hashes for other chunks,
which hadn't changed, could be used to prove its integrity.

P iec ew is e ha s hing :
•divides an input file into sections and checksums each chunk
separately.
•Intended to measure integrity of disk images (dcfldd).
• However, Insert or delete changes all subsequent hashes

•R o lling ha s h:
Looks at each point of file in semi-random order
Depends only on last few bytes
Example of Piecewise hashing (1)
                   19e33h213a7865b2b664348b




                   ea3fe191227a4eg933bc41ge




                  2d839db2996b412e84h77a33



                   872e73ab867c883e7391ae65
Example of Piecewise hashing (2)
                   19e33h213a7865b2b664348b
                            SAME!


                   ea3fe191227a4eg933bc41ge
                            SAME!


                   a73921e173c94e8232fa91bb
                      DIFFERENT TEXT


                   7894af8211c12bb123ah9912
                       INCOMPLETE
Renderability
Data Interpretation in practice
OAIS Reference Model




NAA Performance Model




                                                    =
          +              +        +



   data       computer       OS       application       information
                                                          content
Information Object
                      Information Properties
Some definitions:
  • Information P roperty/ D escription:
                          IP
     • A description of part of the information
       content (OAIS RM v2, 2009)
  • P roperty:
     • An abstract attribute, trait or peculiarity
       suitable for describing preservation
       objects, actions or environments
       (Dappert, 2009)

Observations:
  • No interpretation of significance –
    merely exists
  • May be held in different locations and
    different levels of detail
Information Property categories (1)

Rothenberg & Bikson (1999) identify five types of
Information Property:
  • C ontent: the author’s intellectual work, e.g. text, still image,
    audio waveform, etc.
  • C ontext: Information that affects the content’s intended
    meaning and establishes its provenance
  • Appearance: Information that contributes to the recreation of
    the performance, e.g. font type/colour/size, bit depth
  • S tructure: Relationship between 2+ types of content, e.g. e-
    mail attachments, internal hyperlinks
  • Behaviour: information that establishes how content interacts
    with the user, or other objects or components, e.g. hyperlink
    handling

                                    http://www.panix.com/~jeffr/Prof/digilong.html
Context


 Content     Image & Text
                 link

                            Content and
                             Context?       Structure
Appearance


                                          Behaviour
Information Property categories (2)
PLANETS Digital Object Properties WP use different
classification based upon ability to identify:
•E x tra c ta ble properties :
   • Properties that can be extracted from or calculated
     on the fly, e.g. file size, image dimensions, MD
•O bs erva tiona l properties :
   • Can only be determined by human observation, e.g.
     licence restriction(?)
•P erform a nc e P ro perties :
   • Properties that emerge through combination of HW,
     SW & Data Object
Source: PLANETS Digital Object Properties WG
Performance
              Observational Property     Property
Extractable
information
Preservation Metadata: Documenting
    the technical encoding and
         intellectual content
PREMIS


                              • "things that most working repositories are
                                likely to need to know in order to support
                                digital preservation“
                              • Core metadata that defines “viability,
                                renderability, understandability,
                                authenticity, and identity in a preservation
                                context"
                              What metadata assists with rendering?
                              •   Format
                              •   Size
                              •   Fixity
                              •   Creating Application: Name, version, date
PREMIS DD 1.0 (May 2005)          data was created
PREMIS DD 2.0 (March 2008)
                              •   Inhibitors: Features intended to inhibit
                                  access, use, or migration.
Technical Metadata for still images




                            http://www.flickr.com/photos/k4chii/200303113/

                      Standards: Z39.87, MIX
                      and others
                      Information on
                         •Image characteristics
                         •Encoding scheme
                         •Metadata
Document MD

    Applicable to formats that are primarily text, allow choice of font,
    support embedded multimedia & page layouts

    Example elements
       
           Page Count
       
           Word Count
       
           Character Count
       
           Paragraph Count
       
           Line count
       
           Table Count
       
           Graphics Count
       
           Language
       
           Fonts (list of each font in document)
       
           Features (additional document features, e.g. hasTransparency,
           hasOutline, hasAnnotation)
Third party services: Representation
             Information Registries
•Require trusted third party
services capable of identifying
formats
  • PRONOM, UDFR


•Providing information on
rendering data
  • OpenWith, various RI services
Preserving your object across
   changing technologies
Change in process over time
SOURCE                            PROCESS                      PERFORMANCE
                              Intel PC, 2000


                             +               +                  =
                             Mac laptop, 2006


                             +               +                  =

                        X64 Ubuntu laptop, 2010


                             +               +                  =
                                 operating        software          information
                  hardware
                                  system         application          content

          Potential for changing to ‘Performance’ over time
Change is a necessity… and a risk
“traditionally, preserving things meant keeping them unchanged; however
… if we hold on to digital information without modifications, accessing the
information will become increasingly more difficult, if not impossible.”
(Su-Shing Chen, 2001)

“The fundamental challenge of digital preservation is to preserve the
accessibility and authenticity of digital objects over time and domains, and
across changing technical environments” (Wilson, 2008)
Authenticity
Authenticity
“the degree to which a person
(or system) may regard an
object as what it is purported to
be”
(OAIS RM v2)


Questions:
•How do you distinguish the
authentic original from the
imitators?
•What is authenticity in the digital
realm?                                       Which is the real Elvis?
                                       Img src: http://www.flickr.com/photos/mymollypop/2904798835/
                                       http://www.flickr.com/photos/blahflowers/3827096787/
                                       © 1973, Elvis Presley Enterprises, Inc. and RCA Records
                                       http://en.wikipedia.org/wiki/File:ElvisPresleyAlohafromHawaii.jpg
What do we need to keep for information
              Object to be authentic?
“Understanding, defining and assessing the individual
properties… important.. for informing decisions about which
characteristics of that object should be preserved over time,
in circumstances where it is not possible, for reasons such as
cost, practicality or technical constraints, to preserve all the
elements of that object”
(Montague et al. The Concept of Significant Properties. 2010)

“Unless such properties can be defined in a rigorous and
measurable manner, cultural memory institutions have no
objective framework for identifying, implementing, and
validating appropriate preservation strategies, nor for
asserting the continued authenticity of their digital collections”
(Dappert, 2009)
Acceptable Vs Unacceptable change

•Easy to identify when preservation gone wrong, but how do you
decide when it goes right?
   • Interpretation is a value judgement – often influenced by different
     criteria
   • Uncertainty on level that evaluation should be performed – technical
     encoding, object type (e.g. still image), object sub-type (e.g. business
     document, research paper)
   • How do you measure attributes that are considered significant?
       • Technical properties may vary between formats
       • Observational properties require manual identification
Planning your strategy; strategising your plan

  • P res erva tio n P la n:
    defines a series of preservation actions to be taken
    by a responsible institution due to an identified risk
    for a given set of digital objects or records”
   http://www.dlib.org/dlib/november09/kulovits/11kulovits.html



  • P res erva tio n s tra teg y
    indicates commitment to preservation and high-level
    approach adopted – organisational mission, applied
    principles (e.g. use lifecycle approach), sequence of
    actions (immediate, medium term, long-term), risk
    management
Why develop a preservation plan?
Assists decision-making process
            •   Evaluate different strategies
            •   Evaluate different tools
Determine which is the most effective approach for your needs
• Transparency of operation – enable others to view and
  understand approach adopted – inspire confidence and trust
• Provide evidence of decision-making – decisions may be
  questioned. How do you prove that approach taken was
  appropriate for circumstances?
Evaluation frameworks
Various approaches may be adopted to develop preservation plan:
•Produce internal decision tree
   • Fit intrinsic needs of organisation, but requires staff time to develop &
     may be limiting when considering new approaches
•Perform informal “bottom-up” object analysis & develop bespoke
plan
   • Fit requirements of object type, but may be time intensive to produce
     & may be incompatible with broader policies
•Adopt 3rd party standardised plan (aka copy and paste)
   • Adopting existing plan saves time, but may be inappropriate for
     context
•Use analysis frameworks and toolkits
   • Structured process by which organisation can identify objectives &
     develop plan to address them
      • DRAMBORA/DIRKS – analyse environment & practices, identify risks and
        brainstorm methods of mitigating or avoiding them
      • Data Asset Framework – identify data held, assess management practices & make
        recommendations for improvement
      • PLANETS Preservation Planning –define requirements, evaluate alternative
        approaches, analyse and compare results, recommend preferred approach, and
        develop plan
Preservation Planning workflow

•Developed as part of DELOS
project & adopted by PLANETS
Consortium
•Conforms to the ‘General COTS
(Commercial-Off-The-Shelf)
selection process (GCS)
•Abstract steps: Define criteria,
Search for products, Create
shortlist, Evaluate candidates,
Analyze data & Select product
•Uses utility analysis approach
PLANETS Planning workflow




        http://olymp.ifs.tuwien.ac.at:8080/plato/
Define Requirements:
              Factors to consider
•Identify & analyse environment in which
decisions are made (e.g. assumptions &
constraints) to determine context:
  • Organisational/dept objectives (e.g. mission
    statement, mandate)
  • National/local policy framework (e.g. acquisition,
    legal framework)
  • Codes of practice
  • Financial limitations – what can you afford?
  • Object types to be maintained
  • Expertise & needs of key stakeholders, e.g.
    Designated Community
Whose views do you need to take into
              account?
D ig ita l a rc hive pers pec tive
  • General trend to simplify object to make it (speculatively) easier to
    manage in future:
     • Reduce cost of preservation process
     • Limit risk that accessibility/preservation issues will emerge
     • Increase number of preservation options available
C rea to r pers pec tive
  • Author intent difficult to establish
  • Differs for each object – do you seek to treat each object individually
    or identify broad classes?
  • When do you ask them? On creation, after 5 years? May have
    different views on value.
U s er pers pec tive
  • How do you analyse interpretation of current user community?
  • How do you predict needs of future users?
InSPECT Requirements Analysis
            Framework (2008)
• Adopted a design method used to assist engineers &
  designers to create & re-design artefacts
• Based upon theory that artefact construction is a product
  of designated function(s)
• Assessment upon two philosophical approaches:
   1. Teleology: study of design and purpose of object – why was
      it created?
   2. Epistemology: Understand meaning and process by which
      knowledge is acquired
• In combination, these encourage evaluation of context of
  creation and information needed to communicate intrinsic
  knowledge to a new audience (designated community)
Requirements Analysis activities
S tep 1: O bjec t A na lys is
Interpret context of creation:
1. Analyse object to find out what it contains
2. Identify original audience and functions that object was created to
      perform
3. Determine info. properties necessary to achieve each function

S tep 2: S ta k eholder A na lys is
Determine future requirements of digital object
1. Identify Stakeholders that will use object
2. Determine function set they may perform when using object
3. Identify quality thresholds for each information property that must be
     met to allow each function to be achieved – what is acceptable loss?
Define Requirements:
       PLANETS Requirement Categories
•   Produce list of criteria that will be used to evaluate diff. preservation
    strategies in specific domain
•   May take top-down (organisational) or top-down (object) approach
•   PLANETS identify four groups of characteristic to be evaluated:

    1. O bject: Attributes of information content itself, e.g. behaviour, context
    2. R ecord: Attributes of record including context, relationships & MD -
       potential overlap with Obj in some cases
    3. P rocess : Attributes of preservation process, e.g. processing speed,
       usability of tool, ability to batch process, etc.
    4. C os t: Set-up of process, cost per object, H/W & S/W, personnel

•   Non-prescriptive - evaluator may identify further top-level & sub-
    categories or ignore existing criteria (e.g. technical characteristics for
    format evaluation)
•   May be expressed as spreadsheet, list, mind-map, post-it notes & other
    forms
Record requirements as Evaluation Tree

•Set of requirements may be
expressed as mind map,
spreadsheet, or other form
•Define structure of
evaluation process, grouping
similar items together
•Assign a measurement
value to each ‘leaf’
  • Objective measure: E.g.
    colour depth, duration
  • Subjective measure:
    Acceptable variance,
Define Requirements:
              Measure each criterion
•Assign a measurement value to each ‘leaf’
•Objective measures:
  • Unambiguous, automated (possibly), E.g. seconds to process
    object, colour depth, cost value
•Subjective measures:
  • Acceptable, but often require manual evaluation, e.g. degree
    of format support
•Type of scale
  • Numeric measure (e.g. 15 bit)
  • Boolean (Yes/No)
  • Controlled vocab
  (e.g. Yes/Acceptable/No)
  • Ordinal numbers (controlled list)
  • Subjective criteria (0-5)
Objective tree for web sites
Define Alternatives

• On basis of object type and expressed
  requirements, what strategies are feasible?
• Many different approaches available, e.g. TIFF
  images could undergo following actions:
   •   Format conversion to JPG2k
   •   Format conversion to PNG (to save space)
   •   Format conversion to PDF (though would not recommend)
   •   Emulation/virtual machine
   •   Do nothing!
• For each alternative strategy, may wish to define:
   • Tool to be tested (e.g. name, version, OS)
   • Configuration parameters
   • Function to be tested
Trial the preservation approaches

Develop a set of experiments to trial the
 preservation approach
     
         Define workflow
     
         Select representative test files
     
         Perform evaluation
     
         Evaluate the outcome according to
         your objective tree
           
               Were there undesired/unexpected
               results?
PLATO conversion tool/format comparison




 Definition of alternative approaches to preserve GIF image (conversion to alt.
     formats) and identification of tool services available to perform action
Compare results
Require common basis for comparing different strategies
N o rm a lis e dis pa ra te res ults

    Each evaluation factor is measured differently (Y/N, cost, speed
    of conversion)

    Can make them comparable by converting them to a uniform
    scale
S et I m porta nt Fa c to rs

    Not all assessment criteria is equal – do you wish to prioritise
    specific reqs. (e.g. scalability, cost)
C om pa re outc o m es & s elec t m os t a ppropria te
   pres erva tion s tra teg y
Conclusions

Preservation is an iterative process – must climb many
  steps to reach the top of the pyramid
Preservation Planning enables organisation to
  understand and document their requirements
Demonstrate decision making – inspires confidence &
  trust
Not a perform once, forget process. Must be repeated
Discussion points
• Are traditional checksum techniques acceptable
  for measuring integrity, or do we need a more
  granular approach?

• How should we utilise & build upon third party
  services, such as RI Registries & preservation
  plan tools, to achieve our preservation
  objectives?

• What would a preservation plan for our scanned
  images, documents, metadata look like?
Thank You for your attention




          QUESTIONS?

            Gareth Knight
       gareth.knight@kcl.ac.uk

More Related Content

PDF
Using and Developing with Open Source Digital Forensics Software in Digital A...
PDF
Digital Forensics for Digital Archives
PDF
Accessioning-Based Metadata Extraction and Iterative Processing: Notes From t...
PDF
Building a Digital Learning Object w/ Articulate Storyline 2
PDF
fiwalk With Me: Building Emergent Pre-Ingest Workflows for Digital Archival R...
PPT
KeepIt Course 4: Putting storage, format management and preservation planning...
PDF
On Semantics in Onto-DIY
PDF
Capitalizing on Machine Reading to Engage Bigger Data
Using and Developing with Open Source Digital Forensics Software in Digital A...
Digital Forensics for Digital Archives
Accessioning-Based Metadata Extraction and Iterative Processing: Notes From t...
Building a Digital Learning Object w/ Articulate Storyline 2
fiwalk With Me: Building Emergent Pre-Ingest Workflows for Digital Archival R...
KeepIt Course 4: Putting storage, format management and preservation planning...
On Semantics in Onto-DIY
Capitalizing on Machine Reading to Engage Bigger Data

Viewers also liked (8)

PDF
Conference Engineering mechanics 2007
PDF
Seminary of numerical analysis 2010
PDF
PhD defence
PPT
Workshop 4 audiovisual digital preservation strategy
PPT
Basic Principles of Digitisation
PDF
Digitisation
PPT
20yrs: 2004 jisc cni-brighton
PPT
Brief Introduction to Digital Preservation
Conference Engineering mechanics 2007
Seminary of numerical analysis 2010
PhD defence
Workshop 4 audiovisual digital preservation strategy
Basic Principles of Digitisation
Digitisation
20yrs: 2004 jisc cni-brighton
Brief Introduction to Digital Preservation
Ad

Similar to Preservation Planning: Choosing a suitable digital preservation strategy (20)

PPT
Establishing the significant properties of digital research
PPTX
NISO Webinar: Metadata for Preservation: A Digital Object's Best Friend
PPT
Trm Introduction
PDF
Digital Preservation in the Wild
PDF
Digital preservation: an introduction
PPT
Repositories and digital preservation
PPT
The digital preservation technical context
PPT
Introduction to Digital Preservation
PDF
DURAARK Preserving Architectural Knowledge
PDF
Intro to Digital Preservation
PDF
Digital preservation and institutional repositories
PDF
DURAARK presentation at DEDICATE final seminar, October 21st 2013, Michelle L...
PPT
Metadata For Preservation Delos
PPT
PRESERVATION Web archiving
PPT
D.3.1: State of the Art - Linked Data and Digital Preservation
PDF
(Apr 2009) Comparing Curricula for Digital Library and Digital Curation Educa...
PDF
2015 05-27-congrés archivoscatalunya
PPTX
Electronic Records
PPT
Gettingstartedwithdigitalcollectionsweb[1]
Establishing the significant properties of digital research
NISO Webinar: Metadata for Preservation: A Digital Object's Best Friend
Trm Introduction
Digital Preservation in the Wild
Digital preservation: an introduction
Repositories and digital preservation
The digital preservation technical context
Introduction to Digital Preservation
DURAARK Preserving Architectural Knowledge
Intro to Digital Preservation
Digital preservation and institutional repositories
DURAARK presentation at DEDICATE final seminar, October 21st 2013, Michelle L...
Metadata For Preservation Delos
PRESERVATION Web archiving
D.3.1: State of the Art - Linked Data and Digital Preservation
(Apr 2009) Comparing Curricula for Digital Library and Digital Curation Educa...
2015 05-27-congrés archivoscatalunya
Electronic Records
Gettingstartedwithdigitalcollectionsweb[1]
Ad

More from GarethKnight (16)

PDF
Supporting Open Science in Research
PPTX
Making Sense of a Digital Collection
PPTX
Building Sustainability: Preserving research data without breaking the bank
PPTX
GIS: A project by project prospective
PDF
Complying with EPSRC policy: An LSHTM case study
PDF
Data Management for Librarians: An Introduction
PDF
Challenges in setting up an RDM Support Service
PDF
Research Data Management: What is it and why is the Library & Archives Servic...
PDF
Doing research better: The role of meta‐data
PDF
Laying the Foundation: Establishing an institutional RDM Support Service for ...
PDF
Watching the Detectives: Using digital forensics techniques to investigate th...
PPT
Introduction to digital curation
PPT
Digital Forensics in the Archive
PPT
Keep Calm and Curate
PPT
Same as it ever was? Significant Properties and the preservation of meaning o...
PPT
Who Decides? Reinterpreting archival processes for the management of digital ...
Supporting Open Science in Research
Making Sense of a Digital Collection
Building Sustainability: Preserving research data without breaking the bank
GIS: A project by project prospective
Complying with EPSRC policy: An LSHTM case study
Data Management for Librarians: An Introduction
Challenges in setting up an RDM Support Service
Research Data Management: What is it and why is the Library & Archives Servic...
Doing research better: The role of meta‐data
Laying the Foundation: Establishing an institutional RDM Support Service for ...
Watching the Detectives: Using digital forensics techniques to investigate th...
Introduction to digital curation
Digital Forensics in the Archive
Keep Calm and Curate
Same as it ever was? Significant Properties and the preservation of meaning o...
Who Decides? Reinterpreting archival processes for the management of digital ...

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Encapsulation theory and applications.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPT
Teaching material agriculture food technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Digital-Transformation-Roadmap-for-Companies.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Encapsulation theory and applications.pdf
cuic standard and advanced reporting.pdf
NewMind AI Monthly Chronicles - July 2025
Advanced methodologies resolving dimensionality complications for autism neur...
MYSQL Presentation for SQL database connectivity
Spectral efficient network and resource selection model in 5G networks
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Encapsulation_ Review paper, used for researhc scholars
Diabetes mellitus diagnosis method based random forest with bat algorithm
Network Security Unit 5.pdf for BCA BBA.
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
The AUB Centre for AI in Media Proposal.docx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Teaching material agriculture food technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...

Preservation Planning: Choosing a suitable digital preservation strategy

  • 1. P res erva tio n P la nning : Choosing a suitable preservation approach Long-term Archiving P erspectives of E uropean Union P ublications meeting Office for Official Publications of the European Communities Luxembourg, November 10-11, 2011 Gareth Knight Centre for e-Research
  • 2. Preservation Objectives Authentic - it is what it Understandability – what does purports to be this information mean? Content preservation Bitstream preservation Priscilla Caplan's revised Preservation Pyramid
  • 3. Identity • The exact sameness of things. • Leibniz's law indicates that 2 items that share common attributes are not only similar, but are the same thing • Can two things be the same? “ultimately nothing is the s ame as something else” (Paskin, 2003) A painting of Leibniz Questions: • Both images are a pictorial representation of Leibniz • Image A is constructed using paint on a canvas • Image B is constructed as 0s and 1s • Do they share the same identity? • Is it necessary for all object attribute to be same, or is it acceptable to have some degree of granularity? • How much is identity based upon ability to measure attributes? Scanned copy of painting
  • 5. Is integrity maintained = Yes/No • Linked to notions of consistency, wholeness and truth • There has not been deliberate or accidental damage/change that has caused meaning to be altered or lost, in part or entirety. • Checksum algorithm applied to a file generates a distinct (possibly unique) alphanumeric value • Commonly used to check for accidental/deliberate data change/corruption • Generate checksum on October 1st • Generate checksum on October 14th & compare to Oct 1st value – are they the same? Y E S /N O
  • 6. Is Integrity maintained = 0- 100% If one chunk became corrupted, the hashes for other chunks, which hadn't changed, could be used to prove its integrity. P iec ew is e ha s hing : •divides an input file into sections and checksums each chunk separately. •Intended to measure integrity of disk images (dcfldd). • However, Insert or delete changes all subsequent hashes •R o lling ha s h: Looks at each point of file in semi-random order Depends only on last few bytes
  • 7. Example of Piecewise hashing (1) 19e33h213a7865b2b664348b ea3fe191227a4eg933bc41ge 2d839db2996b412e84h77a33 872e73ab867c883e7391ae65
  • 8. Example of Piecewise hashing (2) 19e33h213a7865b2b664348b SAME! ea3fe191227a4eg933bc41ge SAME! a73921e173c94e8232fa91bb DIFFERENT TEXT 7894af8211c12bb123ah9912 INCOMPLETE
  • 10. Data Interpretation in practice OAIS Reference Model NAA Performance Model = + + + data computer OS application information content
  • 11. Information Object Information Properties Some definitions: • Information P roperty/ D escription: IP • A description of part of the information content (OAIS RM v2, 2009) • P roperty: • An abstract attribute, trait or peculiarity suitable for describing preservation objects, actions or environments (Dappert, 2009) Observations: • No interpretation of significance – merely exists • May be held in different locations and different levels of detail
  • 12. Information Property categories (1) Rothenberg & Bikson (1999) identify five types of Information Property: • C ontent: the author’s intellectual work, e.g. text, still image, audio waveform, etc. • C ontext: Information that affects the content’s intended meaning and establishes its provenance • Appearance: Information that contributes to the recreation of the performance, e.g. font type/colour/size, bit depth • S tructure: Relationship between 2+ types of content, e.g. e- mail attachments, internal hyperlinks • Behaviour: information that establishes how content interacts with the user, or other objects or components, e.g. hyperlink handling http://www.panix.com/~jeffr/Prof/digilong.html
  • 13. Context Content Image & Text link Content and Context? Structure Appearance Behaviour
  • 14. Information Property categories (2) PLANETS Digital Object Properties WP use different classification based upon ability to identify: •E x tra c ta ble properties : • Properties that can be extracted from or calculated on the fly, e.g. file size, image dimensions, MD •O bs erva tiona l properties : • Can only be determined by human observation, e.g. licence restriction(?) •P erform a nc e P ro perties : • Properties that emerge through combination of HW, SW & Data Object Source: PLANETS Digital Object Properties WG
  • 15. Performance Observational Property Property Extractable information
  • 16. Preservation Metadata: Documenting the technical encoding and intellectual content
  • 17. PREMIS • "things that most working repositories are likely to need to know in order to support digital preservation“ • Core metadata that defines “viability, renderability, understandability, authenticity, and identity in a preservation context" What metadata assists with rendering? • Format • Size • Fixity • Creating Application: Name, version, date PREMIS DD 1.0 (May 2005) data was created PREMIS DD 2.0 (March 2008) • Inhibitors: Features intended to inhibit access, use, or migration.
  • 18. Technical Metadata for still images http://www.flickr.com/photos/k4chii/200303113/ Standards: Z39.87, MIX and others Information on •Image characteristics •Encoding scheme •Metadata
  • 19. Document MD  Applicable to formats that are primarily text, allow choice of font, support embedded multimedia & page layouts  Example elements  Page Count  Word Count  Character Count  Paragraph Count  Line count  Table Count  Graphics Count  Language  Fonts (list of each font in document)  Features (additional document features, e.g. hasTransparency, hasOutline, hasAnnotation)
  • 20. Third party services: Representation Information Registries •Require trusted third party services capable of identifying formats • PRONOM, UDFR •Providing information on rendering data • OpenWith, various RI services
  • 21. Preserving your object across changing technologies
  • 22. Change in process over time SOURCE PROCESS PERFORMANCE Intel PC, 2000 + + = Mac laptop, 2006 + + = X64 Ubuntu laptop, 2010 + + = operating software information hardware system application content Potential for changing to ‘Performance’ over time
  • 23. Change is a necessity… and a risk “traditionally, preserving things meant keeping them unchanged; however … if we hold on to digital information without modifications, accessing the information will become increasingly more difficult, if not impossible.” (Su-Shing Chen, 2001) “The fundamental challenge of digital preservation is to preserve the accessibility and authenticity of digital objects over time and domains, and across changing technical environments” (Wilson, 2008)
  • 25. Authenticity “the degree to which a person (or system) may regard an object as what it is purported to be” (OAIS RM v2) Questions: •How do you distinguish the authentic original from the imitators? •What is authenticity in the digital realm? Which is the real Elvis? Img src: http://www.flickr.com/photos/mymollypop/2904798835/ http://www.flickr.com/photos/blahflowers/3827096787/ © 1973, Elvis Presley Enterprises, Inc. and RCA Records http://en.wikipedia.org/wiki/File:ElvisPresleyAlohafromHawaii.jpg
  • 26. What do we need to keep for information Object to be authentic? “Understanding, defining and assessing the individual properties… important.. for informing decisions about which characteristics of that object should be preserved over time, in circumstances where it is not possible, for reasons such as cost, practicality or technical constraints, to preserve all the elements of that object” (Montague et al. The Concept of Significant Properties. 2010) “Unless such properties can be defined in a rigorous and measurable manner, cultural memory institutions have no objective framework for identifying, implementing, and validating appropriate preservation strategies, nor for asserting the continued authenticity of their digital collections” (Dappert, 2009)
  • 27. Acceptable Vs Unacceptable change •Easy to identify when preservation gone wrong, but how do you decide when it goes right? • Interpretation is a value judgement – often influenced by different criteria • Uncertainty on level that evaluation should be performed – technical encoding, object type (e.g. still image), object sub-type (e.g. business document, research paper) • How do you measure attributes that are considered significant? • Technical properties may vary between formats • Observational properties require manual identification
  • 28. Planning your strategy; strategising your plan • P res erva tio n P la n: defines a series of preservation actions to be taken by a responsible institution due to an identified risk for a given set of digital objects or records” http://www.dlib.org/dlib/november09/kulovits/11kulovits.html • P res erva tio n s tra teg y indicates commitment to preservation and high-level approach adopted – organisational mission, applied principles (e.g. use lifecycle approach), sequence of actions (immediate, medium term, long-term), risk management
  • 29. Why develop a preservation plan? Assists decision-making process • Evaluate different strategies • Evaluate different tools Determine which is the most effective approach for your needs • Transparency of operation – enable others to view and understand approach adopted – inspire confidence and trust • Provide evidence of decision-making – decisions may be questioned. How do you prove that approach taken was appropriate for circumstances?
  • 30. Evaluation frameworks Various approaches may be adopted to develop preservation plan: •Produce internal decision tree • Fit intrinsic needs of organisation, but requires staff time to develop & may be limiting when considering new approaches •Perform informal “bottom-up” object analysis & develop bespoke plan • Fit requirements of object type, but may be time intensive to produce & may be incompatible with broader policies •Adopt 3rd party standardised plan (aka copy and paste) • Adopting existing plan saves time, but may be inappropriate for context •Use analysis frameworks and toolkits • Structured process by which organisation can identify objectives & develop plan to address them • DRAMBORA/DIRKS – analyse environment & practices, identify risks and brainstorm methods of mitigating or avoiding them • Data Asset Framework – identify data held, assess management practices & make recommendations for improvement • PLANETS Preservation Planning –define requirements, evaluate alternative approaches, analyse and compare results, recommend preferred approach, and develop plan
  • 31. Preservation Planning workflow •Developed as part of DELOS project & adopted by PLANETS Consortium •Conforms to the ‘General COTS (Commercial-Off-The-Shelf) selection process (GCS) •Abstract steps: Define criteria, Search for products, Create shortlist, Evaluate candidates, Analyze data & Select product •Uses utility analysis approach
  • 32. PLANETS Planning workflow http://olymp.ifs.tuwien.ac.at:8080/plato/
  • 33. Define Requirements: Factors to consider •Identify & analyse environment in which decisions are made (e.g. assumptions & constraints) to determine context: • Organisational/dept objectives (e.g. mission statement, mandate) • National/local policy framework (e.g. acquisition, legal framework) • Codes of practice • Financial limitations – what can you afford? • Object types to be maintained • Expertise & needs of key stakeholders, e.g. Designated Community
  • 34. Whose views do you need to take into account? D ig ita l a rc hive pers pec tive • General trend to simplify object to make it (speculatively) easier to manage in future: • Reduce cost of preservation process • Limit risk that accessibility/preservation issues will emerge • Increase number of preservation options available C rea to r pers pec tive • Author intent difficult to establish • Differs for each object – do you seek to treat each object individually or identify broad classes? • When do you ask them? On creation, after 5 years? May have different views on value. U s er pers pec tive • How do you analyse interpretation of current user community? • How do you predict needs of future users?
  • 35. InSPECT Requirements Analysis Framework (2008) • Adopted a design method used to assist engineers & designers to create & re-design artefacts • Based upon theory that artefact construction is a product of designated function(s) • Assessment upon two philosophical approaches: 1. Teleology: study of design and purpose of object – why was it created? 2. Epistemology: Understand meaning and process by which knowledge is acquired • In combination, these encourage evaluation of context of creation and information needed to communicate intrinsic knowledge to a new audience (designated community)
  • 36. Requirements Analysis activities S tep 1: O bjec t A na lys is Interpret context of creation: 1. Analyse object to find out what it contains 2. Identify original audience and functions that object was created to perform 3. Determine info. properties necessary to achieve each function S tep 2: S ta k eholder A na lys is Determine future requirements of digital object 1. Identify Stakeholders that will use object 2. Determine function set they may perform when using object 3. Identify quality thresholds for each information property that must be met to allow each function to be achieved – what is acceptable loss?
  • 37. Define Requirements: PLANETS Requirement Categories • Produce list of criteria that will be used to evaluate diff. preservation strategies in specific domain • May take top-down (organisational) or top-down (object) approach • PLANETS identify four groups of characteristic to be evaluated: 1. O bject: Attributes of information content itself, e.g. behaviour, context 2. R ecord: Attributes of record including context, relationships & MD - potential overlap with Obj in some cases 3. P rocess : Attributes of preservation process, e.g. processing speed, usability of tool, ability to batch process, etc. 4. C os t: Set-up of process, cost per object, H/W & S/W, personnel • Non-prescriptive - evaluator may identify further top-level & sub- categories or ignore existing criteria (e.g. technical characteristics for format evaluation) • May be expressed as spreadsheet, list, mind-map, post-it notes & other forms
  • 38. Record requirements as Evaluation Tree •Set of requirements may be expressed as mind map, spreadsheet, or other form •Define structure of evaluation process, grouping similar items together •Assign a measurement value to each ‘leaf’ • Objective measure: E.g. colour depth, duration • Subjective measure: Acceptable variance,
  • 39. Define Requirements: Measure each criterion •Assign a measurement value to each ‘leaf’ •Objective measures: • Unambiguous, automated (possibly), E.g. seconds to process object, colour depth, cost value •Subjective measures: • Acceptable, but often require manual evaluation, e.g. degree of format support •Type of scale • Numeric measure (e.g. 15 bit) • Boolean (Yes/No) • Controlled vocab (e.g. Yes/Acceptable/No) • Ordinal numbers (controlled list) • Subjective criteria (0-5)
  • 40. Objective tree for web sites
  • 41. Define Alternatives • On basis of object type and expressed requirements, what strategies are feasible? • Many different approaches available, e.g. TIFF images could undergo following actions: • Format conversion to JPG2k • Format conversion to PNG (to save space) • Format conversion to PDF (though would not recommend) • Emulation/virtual machine • Do nothing! • For each alternative strategy, may wish to define: • Tool to be tested (e.g. name, version, OS) • Configuration parameters • Function to be tested
  • 42. Trial the preservation approaches Develop a set of experiments to trial the preservation approach  Define workflow  Select representative test files  Perform evaluation  Evaluate the outcome according to your objective tree  Were there undesired/unexpected results?
  • 43. PLATO conversion tool/format comparison Definition of alternative approaches to preserve GIF image (conversion to alt. formats) and identification of tool services available to perform action
  • 44. Compare results Require common basis for comparing different strategies N o rm a lis e dis pa ra te res ults  Each evaluation factor is measured differently (Y/N, cost, speed of conversion)  Can make them comparable by converting them to a uniform scale S et I m porta nt Fa c to rs  Not all assessment criteria is equal – do you wish to prioritise specific reqs. (e.g. scalability, cost) C om pa re outc o m es & s elec t m os t a ppropria te pres erva tion s tra teg y
  • 45. Conclusions Preservation is an iterative process – must climb many steps to reach the top of the pyramid Preservation Planning enables organisation to understand and document their requirements Demonstrate decision making – inspires confidence & trust Not a perform once, forget process. Must be repeated
  • 46. Discussion points • Are traditional checksum techniques acceptable for measuring integrity, or do we need a more granular approach? • How should we utilise & build upon third party services, such as RI Registries & preservation plan tools, to achieve our preservation objectives? • What would a preservation plan for our scanned images, documents, metadata look like?
  • 47. Thank You for your attention QUESTIONS? Gareth Knight gareth.knight@kcl.ac.uk