SlideShare a Scribd company logo
SCAPE


SCAPE
Building Digital Preservation Infrastructure
Dr. Ross King
AIT Austrian Institute of Technology GmbH

eSciDoc Days
Berlin, October 27, 2011
SCAPE
                                                                 Digital Preservation
• For the first time, the rate of
  increase of information creation is
  beginning to exceed the rate of
  increase in storage capacity.

• This massive volume of digital
  material raises a number of issues:
         •        What is worth preserving?
         •        How to preserve so much?
         •        How to access preserved data?
         •        How to create incentives to
                  preserve?

 http://arstechnica.com/business/consumerization-of-it/2011/09/information-explosion-how-rapidly-expanding-storage-spurs-innovation.ars




                                                                                                   07.11.2011
                                                                                                                                             2
SCAPE
                    Digital Preservation
• Standards, best-practices, and technologies utilized in order to
  ensure access to digital information over time

• How long?

  “Digital documents last forever – or five years,
   whichever comes first.”
       http://www.clir.org/pubs/reports/rothenberg/introduction.html


• Generally we mean decades or centuries

                               07.11.2011
                                                                       3
SCAPE
             SCAPE – what is it about?

• Planning and managing computing-intensive (digital)
  preservation processes such as the large-scale
  ingestion or migration of large (multi-Terabyte)
  data sets

  SCAPE is a follow-up to the highly successful FP6 IP Planets.
SCAPE
                 SCAPE Project Data
• Project instrument: FP7 Integrated Project
• 6. Call
   • Objective ICT-2009.4.1:
     Digital Libraries and Digital Preservation
   • Target outcome (a) Scalable systems and services for
     preserving digital content
• Duration: 42 months
   • February 2011 – July 2014
• Budget: 11.3 Million Euro
   • Funded: 8.6 Million Euro
SCAPE
                          SCAPE Consortium
   Number         Partner name                                Partner short name   Country
1 (coordinator)   AIT Austrian Institute of Technology GmbH          AIT             AT
       2          British Library                                    BL              UK
       3          Internet Memory Foundation                        IMF              NL
       4          Ex Libris Ltd                                      EXL             IL
       5          Fachinformationszentrum Karlsruhe                  FIZ             DE
       6          Koninklijke Bibliotheek                            KB              NL
       7          KEEP Solutions                                   KEEPS             PT
       8          Microsoft Research                                MSR              UK
       9          Österreichische Nationalbibliothek                ONB              AT
      10          Open Planets Foundation                           OPF              UK
      11          Statsbiblioteket Aarhus                            SB              DK
      12          Science and Technology Facilities Council         STFC             UK
      13          Technische Universität Berlin                     TUB              DE
      14          Technische Universität Wien                      TUW               AT
      15          University of Manchester                        UNIMAN             UK
      16          Pierre & Marie Curie Université Paris 6          UPMC              FR
SCAPE
                                SCAPE Project Overview
SCAPE will enhance the state of the art in digital preservation in three ways:
• Infrastructure and tools for scalable preservation actions
• A framework for automated, quality-assured preservation workflows
• Integration of these components with policy-based automated
preservation planning and watch                                             Takeup

                                                                                 Stakeholders
                                                                                 Communities
                                                                                 Dissemination
                                                                               Training Activities
                                                                                 Sustainability
SCAPE results will be validated in three large-scale testbeds:
• Digital Repositories                                                            Testbeds
• Web Content                                                                      Corpora
                                                                                 Integration
• Research Data Sets                                                            Benchmarking
                                                                                  Validation



The SCAPE Consortium brings together                                                                   Cross-project Activities
                                                                                                          Project Management
a broad spectrum of expertise from                                                 Platform
                                                                                                         Technical Coordination
                                                                                                           Research Roadmap

• Memory institutions                                                            Automation
                                                                                 Workflows
• Data centres                                        Planning and Watch        Parallelization          Preservation
                                                                                                         Components
                                                                                Virtualization
• Research labs                                                                                        Quality Assurance
                                                      Institutional Policies                         Scalable Components
• Universities                                          Technical Watch
                                                      Automated Planning
                                                                                                      Automation-ready
                                                                                                             Tools
• Industrial firms

                                                                                                                                  7
SCAPE
              Selected SCAPE Testbed Scenarios
• Characterise large video files
   •   The master MPEG2 files are so large that it is difficult to apply JHOVE and
       insufficient detail is provided. A detailed characterisation of the MPEG2 streams
       is needed in order to identify technical dependencies for extracting from or
       rendering the MPEG2 stream. This would enable preservation risks related to
       current access services to be monitored and action taken as necessary to ensure
       continued access and preservation.

• Carry out large scale migrations
   •   Migrating from one format to another introduces the possibility of damaging the
       content or failing to capture significant properties of the original in the resulting
       destination format.
   •   Specific requirements include:
         • Solution tools that operate reliably at scale (80TB, 2 million pages)
         • Automated QA, ideally with no manual intervention on a file by file basis
         • QA performed by independent process from the migration process                      from digitalbevaring.dk

         • QA demonstrates strong evidence of significant properties being captured
              in the destination format

• Quality assurance in web harvesting
   •   For large scale crawls, automation of the quality control processes is a necessary
       requirement. Currently, this process relies on random sampling and very basic
       quantitative checks.                                                                                              8
SCAPE
                Selected SCAPE Challenges
• Bridging the gap between test workflows and
  scalable workflows
• Applying Map/Reduce to binary data
• Locality of data
    • Bring the data to the computation, or
      bring the computation to the data?
• Repository Integration
    • Repository Consistency
    • Scalable Ingest
• Preservation Planning
    • How to scale?
    • How to automate?
• Research data sets                            from digitalbevaring.dk


    • How to preserve contextual information?
                                                                          9
SCAPE
                    SCAPE Solutions

• SCAPE Platform
  • HADOOP, Stratosphere
  • Virtualized cluster
  • Repository integration
     • HBASE, HDFS - Fedora
  • Three levels of parallelization    from digitalbevaring.dk



     • Distribution of files
     • Splitting binary files
     • Parallelisation of algorithms
  • Mapping Taverna to HADOOP

                                                                 10
SCAPE
                   SCAPE Solutions

• Automated Planning and Watch
  • Building on the Planets PLATO tool
  • Automated watch based on
     • Results Evaluation Framework (REF) database
     • Monitoring trends in web harvests
  • Automated planning based on semantically
    formalized policies
• Automated Quality Assurance
  • QA in web harvesting through automated comparison of
    rendered pages – combined structural and image analysis

                                                              11
SCAPE
                       SCAPE Achievements
• Public Website
    • http://www.scape-project.eu/
• Development Infrastructure
    • Hosted by the Open Planets Foundation and GitHub
    • Development Wiki
        • http://wiki.opf-labs.org/display/SP/Home
• Deliverables
    • First Deliverables available for download
• Publications
    • 13 in the first nine months, including 6 at iPres next week
    • Report: comparative analysis of identification tools
• Platform
    • 10-node, 20 TB experimental cluster hosted by AIT

                                                                       12
SCAPE
           SCAPE Contact Information

• http://www.scape-project.eu/

• office@list.scape-project.eu

• Dr. Ross King
  AIT Austrian Institute of Technology GmbH
  Donau-City-Strasse 1
  A-1220 Wien


                                                 13
SCAPE



Thank you for your attention!




                                   14

More Related Content

PDF
Presentation of SCAPE Project
PDF
Enabling Dynamic Services with SURFconext
PDF
AGGREGATING AND ENRICHING AUDIO-VISUAL METADATA USING EBUCORE | Athanasios DR...
PPTX
A Real-time Collaboration-enabled Mobile Augmented Reality System with Semant...
PDF
L'Europe et au-delà
PDF
Shaman Project Hemmje
PPTX
JISC Digital Preservation: Current & Future Work by Neil Grindley
PPTX
General Introduction to technologies that will be seen in the school
Presentation of SCAPE Project
Enabling Dynamic Services with SURFconext
AGGREGATING AND ENRICHING AUDIO-VISUAL METADATA USING EBUCORE | Athanasios DR...
A Real-time Collaboration-enabled Mobile Augmented Reality System with Semant...
L'Europe et au-delà
Shaman Project Hemmje
JISC Digital Preservation: Current & Future Work by Neil Grindley
General Introduction to technologies that will be seen in the school

Viewers also liked (20)

PPT
Historical Development of Photogrammetry
PDF
Digital Preservation Policies - SCAPE
PDF
Scalable Preservation Workflows
PDF
Characterisation - 101. An introduction to the identification and characteris...
PDF
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
PDF
Taverna and myExperiment. SCAPE presentation at a Hack-a-thon
PDF
Matchbox tool. Quality control for digital collections – SCAPE Training event...
PDF
Planets, OPF & SCAPE - presentation of tools on digital preservation
PDF
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
PDF
TAVERNA Components - Semantically annotated and sharable units of functionality
PDF
SCAPE Preservation Platform. Design and Deployment
PDF
PDF/A-3 for preservation. Notes on embedded files and JPEG2000
PDF
Audio Quality Assurance. An application of cross correlation
PDF
Quality assurance for document image collections in digital preservation
PDF
Jpylyzer, a validation and feature extraction tool developed in SCAPE project
PDF
Duplicate detection for quality assurance of document image collections
PDF
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
PDF
SCAPE Information Day at BL - Large Scale Processing with Hadoop
PPT
Historical Development of Photogrammetry
PDF
Evolving Domains, Problems and Solutions for Long Term Digital Preservation
Historical Development of Photogrammetry
Digital Preservation Policies - SCAPE
Scalable Preservation Workflows
Characterisation - 101. An introduction to the identification and characteris...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Taverna and myExperiment. SCAPE presentation at a Hack-a-thon
Matchbox tool. Quality control for digital collections – SCAPE Training event...
Planets, OPF & SCAPE - presentation of tools on digital preservation
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
TAVERNA Components - Semantically annotated and sharable units of functionality
SCAPE Preservation Platform. Design and Deployment
PDF/A-3 for preservation. Notes on embedded files and JPEG2000
Audio Quality Assurance. An application of cross correlation
Quality assurance for document image collections in digital preservation
Jpylyzer, a validation and feature extraction tool developed in SCAPE project
Duplicate detection for quality assurance of document image collections
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
SCAPE Information Day at BL - Large Scale Processing with Hadoop
Historical Development of Photogrammetry
Evolving Domains, Problems and Solutions for Long Term Digital Preservation
Ad

Similar to SCAPE - Building Digital Preservation Infrastructure (20)

PPTX
The Inside Out Library.
PPT
Susan Schreibman
PDF
ESI Supplemental Webinar 2 - DataONE presentation slides
PDF
Part 2 OCLC Strategic Presentation Bruce Crocco ACURIL 2011
PDF
Webinito Digital Library Division Brochure
PPT
Sely Costa Pkp2009
PPTX
Scape project presentation - Scalable Preservation Environments
PDF
Graham Pryor
PDF
20120605 icse zurich
PPT
Libby Bishop, Ethics Of Data Sharing Ncess Jun 09 Final
KEY
Cloud computing in academic libraries
PDF
Ldp Executive Slides
KEY
JISC CNI Meeting, Edinburgh 2010
PPSX
OCLC WorldShare - Cooperating and Innovating at Webscale - Chris Thewlis
PPTX
Knowledge Base+: a Cloud-Based Community Knowledge Base
PPTX
Building a Data Discovery Network for Sustainability Science
PDF
BlogForever poster
The Inside Out Library.
Susan Schreibman
ESI Supplemental Webinar 2 - DataONE presentation slides
Part 2 OCLC Strategic Presentation Bruce Crocco ACURIL 2011
Webinito Digital Library Division Brochure
Sely Costa Pkp2009
Scape project presentation - Scalable Preservation Environments
Graham Pryor
20120605 icse zurich
Libby Bishop, Ethics Of Data Sharing Ncess Jun 09 Final
Cloud computing in academic libraries
Ldp Executive Slides
JISC CNI Meeting, Edinburgh 2010
OCLC WorldShare - Cooperating and Innovating at Webscale - Chris Thewlis
Knowledge Base+: a Cloud-Based Community Knowledge Base
Building a Data Discovery Network for Sustainability Science
BlogForever poster
Ad

More from SCAPE Project (18)

PDF
C sz z6
PDF
SCAPE Information Day at BL - Characterising content in web archives with Nanite
PDF
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
PDF
SCAPE Information day at BL - Flint, a Format and File Validation Tool
PDF
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
PDF
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
PDF
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
PDF
Hadoop and its applications at the State and University Library, SCAPE Inform...
PDF
LIBER Satellite Event, SCAPE by Sven Schlarb
PDF
Content profiling and C3PO
PDF
Control policy formulation
PDF
Preservation Policy in SCAPE - Training, Aarhus
PDF
An image based approach for content analysis in document collections
PDF
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
PDF
Automatic Preservation Watch
PDF
Policy levels in SCAPE
PDF
SCAPE - Scalable Preservation Environments
PDF
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
C sz z6
SCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Hadoop and its applications at the State and University Library, SCAPE Inform...
LIBER Satellite Event, SCAPE by Sven Schlarb
Content profiling and C3PO
Control policy formulation
Preservation Policy in SCAPE - Training, Aarhus
An image based approach for content analysis in document collections
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
Automatic Preservation Watch
Policy levels in SCAPE
SCAPE - Scalable Preservation Environments
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...

Recently uploaded (20)

PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Modernizing your data center with Dell and AMD
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPT
Teaching material agriculture food technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
cuic standard and advanced reporting.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
Spectral efficient network and resource selection model in 5G networks
Building Integrated photovoltaic BIPV_UPV.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Modernizing your data center with Dell and AMD
The AUB Centre for AI in Media Proposal.docx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Network Security Unit 5.pdf for BCA BBA.
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Review of recent advances in non-invasive hemoglobin estimation
Teaching material agriculture food technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Encapsulation_ Review paper, used for researhc scholars
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
NewMind AI Monthly Chronicles - July 2025
CIFDAQ's Market Insight: SEC Turns Pro Crypto
cuic standard and advanced reporting.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Digital-Transformation-Roadmap-for-Companies.pptx

SCAPE - Building Digital Preservation Infrastructure

  • 1. SCAPE SCAPE Building Digital Preservation Infrastructure Dr. Ross King AIT Austrian Institute of Technology GmbH eSciDoc Days Berlin, October 27, 2011
  • 2. SCAPE Digital Preservation • For the first time, the rate of increase of information creation is beginning to exceed the rate of increase in storage capacity. • This massive volume of digital material raises a number of issues: • What is worth preserving? • How to preserve so much? • How to access preserved data? • How to create incentives to preserve? http://arstechnica.com/business/consumerization-of-it/2011/09/information-explosion-how-rapidly-expanding-storage-spurs-innovation.ars 07.11.2011 2
  • 3. SCAPE Digital Preservation • Standards, best-practices, and technologies utilized in order to ensure access to digital information over time • How long? “Digital documents last forever – or five years, whichever comes first.” http://www.clir.org/pubs/reports/rothenberg/introduction.html • Generally we mean decades or centuries 07.11.2011 3
  • 4. SCAPE SCAPE – what is it about? • Planning and managing computing-intensive (digital) preservation processes such as the large-scale ingestion or migration of large (multi-Terabyte) data sets SCAPE is a follow-up to the highly successful FP6 IP Planets.
  • 5. SCAPE SCAPE Project Data • Project instrument: FP7 Integrated Project • 6. Call • Objective ICT-2009.4.1: Digital Libraries and Digital Preservation • Target outcome (a) Scalable systems and services for preserving digital content • Duration: 42 months • February 2011 – July 2014 • Budget: 11.3 Million Euro • Funded: 8.6 Million Euro
  • 6. SCAPE SCAPE Consortium Number Partner name Partner short name Country 1 (coordinator) AIT Austrian Institute of Technology GmbH AIT AT 2 British Library BL UK 3 Internet Memory Foundation IMF NL 4 Ex Libris Ltd EXL IL 5 Fachinformationszentrum Karlsruhe FIZ DE 6 Koninklijke Bibliotheek KB NL 7 KEEP Solutions KEEPS PT 8 Microsoft Research MSR UK 9 Österreichische Nationalbibliothek ONB AT 10 Open Planets Foundation OPF UK 11 Statsbiblioteket Aarhus SB DK 12 Science and Technology Facilities Council STFC UK 13 Technische Universität Berlin TUB DE 14 Technische Universität Wien TUW AT 15 University of Manchester UNIMAN UK 16 Pierre & Marie Curie Université Paris 6 UPMC FR
  • 7. SCAPE SCAPE Project Overview SCAPE will enhance the state of the art in digital preservation in three ways: • Infrastructure and tools for scalable preservation actions • A framework for automated, quality-assured preservation workflows • Integration of these components with policy-based automated preservation planning and watch Takeup Stakeholders Communities Dissemination Training Activities Sustainability SCAPE results will be validated in three large-scale testbeds: • Digital Repositories Testbeds • Web Content Corpora Integration • Research Data Sets Benchmarking Validation The SCAPE Consortium brings together Cross-project Activities Project Management a broad spectrum of expertise from Platform Technical Coordination Research Roadmap • Memory institutions Automation Workflows • Data centres Planning and Watch Parallelization Preservation Components Virtualization • Research labs Quality Assurance Institutional Policies Scalable Components • Universities Technical Watch Automated Planning Automation-ready Tools • Industrial firms 7
  • 8. SCAPE Selected SCAPE Testbed Scenarios • Characterise large video files • The master MPEG2 files are so large that it is difficult to apply JHOVE and insufficient detail is provided. A detailed characterisation of the MPEG2 streams is needed in order to identify technical dependencies for extracting from or rendering the MPEG2 stream. This would enable preservation risks related to current access services to be monitored and action taken as necessary to ensure continued access and preservation. • Carry out large scale migrations • Migrating from one format to another introduces the possibility of damaging the content or failing to capture significant properties of the original in the resulting destination format. • Specific requirements include: • Solution tools that operate reliably at scale (80TB, 2 million pages) • Automated QA, ideally with no manual intervention on a file by file basis • QA performed by independent process from the migration process from digitalbevaring.dk • QA demonstrates strong evidence of significant properties being captured in the destination format • Quality assurance in web harvesting • For large scale crawls, automation of the quality control processes is a necessary requirement. Currently, this process relies on random sampling and very basic quantitative checks. 8
  • 9. SCAPE Selected SCAPE Challenges • Bridging the gap between test workflows and scalable workflows • Applying Map/Reduce to binary data • Locality of data • Bring the data to the computation, or bring the computation to the data? • Repository Integration • Repository Consistency • Scalable Ingest • Preservation Planning • How to scale? • How to automate? • Research data sets from digitalbevaring.dk • How to preserve contextual information? 9
  • 10. SCAPE SCAPE Solutions • SCAPE Platform • HADOOP, Stratosphere • Virtualized cluster • Repository integration • HBASE, HDFS - Fedora • Three levels of parallelization from digitalbevaring.dk • Distribution of files • Splitting binary files • Parallelisation of algorithms • Mapping Taverna to HADOOP 10
  • 11. SCAPE SCAPE Solutions • Automated Planning and Watch • Building on the Planets PLATO tool • Automated watch based on • Results Evaluation Framework (REF) database • Monitoring trends in web harvests • Automated planning based on semantically formalized policies • Automated Quality Assurance • QA in web harvesting through automated comparison of rendered pages – combined structural and image analysis 11
  • 12. SCAPE SCAPE Achievements • Public Website • http://www.scape-project.eu/ • Development Infrastructure • Hosted by the Open Planets Foundation and GitHub • Development Wiki • http://wiki.opf-labs.org/display/SP/Home • Deliverables • First Deliverables available for download • Publications • 13 in the first nine months, including 6 at iPres next week • Report: comparative analysis of identification tools • Platform • 10-node, 20 TB experimental cluster hosted by AIT 12
  • 13. SCAPE SCAPE Contact Information • http://www.scape-project.eu/ • office@list.scape-project.eu • Dr. Ross King AIT Austrian Institute of Technology GmbH Donau-City-Strasse 1 A-1220 Wien 13
  • 14. SCAPE Thank you for your attention! 14