SlideShare a Scribd company logo
Dr. Ross King
AIT Austrian Institute of Technology GmbH
Preservation at Scale Workshop
Lisbon, September 5, 2013
SCAPE
Tools and Infrastructure for Preservation at Scale
• SCAPE Project
• SCAPE Solutions
• Scalable Planning
• Scalable Tools
• Scalable Computation
• Scalable Repositories
• SCAPE Testbeds
• SCAPE Additional Information
• Online Resources
• Training Events
• Contact Information
2
Outline
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
SCAPE – what is it about?
• Planning and executing computing-intensive digital preservation
processes such as the large-scale ingestion, characterisation or
migration of large (multi-Terabyte) and complex data sets
• SCAPE results include
• Preservation scenarios
• Preservation tools
• Preservation workflows
• Preservation infrastructure
• Preservation best-practices
SCAPE is a follow-up to the highly successful FP6 IP Planets.
3
SCAPE Project Data
• Project instrument: FP7 Collaborative Project
• 6. Call
• Objective ICT-2009.4.1: Digital Libraries and Digital
Preservation
• Target outcome (a) Scalable systems and services for
preserving digital content
• 10. Call
• Objective ICT-2013.11.4: Supplements to Strengthen
Cooperation in ICT R&D in an Enlarged European Union
• Duration: 42 44 months
• February 2011 – July September 2014
• Budget: 11.3 12.0 Million Euro
• Funded: 8.6 9.2 Million Euro
4
SCAPE Consortium
5
SCAPE Solutions
6
• SCOUT: an automated preservation watch system
• Enables planning tool and decision makers to monitor the world and the organisation
• Collects relevant knowledge and enable automated notification
• Open and extensible
• c3po: scalable content profiling
• c3po analyses characterisation data based on fits
• Scale-out MongoDB (100k/min/node)
• Visual drill-down and well-documented profile
• Automated sample selection
• PLATO 4.1: scalable preservation planning
• www.ifs.tuwien.ac.at/dp/plato
• Technology upgrade - refactored, rebuilt, standardised, tested
• New features
• Groups allow collaborative planning
• Integration of control policies for group
• Quality domain – measures
7
Scalable Planning and Watch
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
• Tool Wrapper
• Application that adapts existing tools to the SCAPE Platform
• https://github.com/openplanets/scape-toolwrapper
• Enhances wrapped tools
• Standard naming scheme for CC, AS and QA tools
• Standard invocation method (CLI)
• Debian packages for easy deployment on the cluster
• Support for data streaming (useful for Hadoop jobs)
• Generates Preservation Components
• Taverna workflows with embedded metadata for easy discovery
• Automatic publication of components on myExperiment (to support discoverability)
• Standard ports to enable composition of Preservation Components (based on well defined component
profiles, CC, AS & QA)
• Digital Preservation Toolkit
• Software suite that contains a large set of DP tools
• 77 operations in total
• Easy to deploy on Linux machines (via apt-get)
• apt - get i nst al l di gi t al - pr eser vat i on- t ool s
8
Scalable Tools
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
• Deployment of environments
• XEN Hypervisor
• Eucalyptus
• Deployment of tools
• Debian Packages
• Tool Spec
• Job Execution Service (JES)
• Apache Oozie
• Apache Hadoop
9
Scalable Computation
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
from digitalbevaring.dk
User‐view on SCAPE development cloud at AIT: Eucalyptus web
interface, Hybridfox browser add‐on, and terminal‐based interaction.
• Fedora 4.0.0
• All REST, no SOAP
• RDF as first class objects
• JCR 2.0 Implementation (ModeShape)
• Infinispan distributed NoSQL datastore
• Lily 2.0
• Built on top of HBase/HDFS
• Integration of computation and storage
10
Scalable Repositories
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
11
SCAPE Architecture
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Plan
Management
API
Digital Object
Repository
Execution
Platform
JES
Hadoop
JES API
Data
Connector API
Automated Watch
Automated Planning
PLATO
Plan
Management
GUI
Digital
Objects/
Metadata
Preservation
Plan Store
Plan
Component
Catalogue
Component
Lookup
API
Taverna
Workbench
Component
Registration
API
Component
Profile
Validator
Automated Watch
Sources
Push
API
Pull
API
Knowledge
Source
Adaptor
Client
Service
Watch Request
API
Notification API
Report
API
Assessment
Data
Publication
Platform
LDS3
APIData
Loader
Application
SCAPE Testbeds
12
SCAPE Testbeds
• Large-scale Digital Repositories
• Carry out large scale image migrations
• The master files from legacy digitized image collections are typically TIFF files that can be costly to store due
to their size. The cost benefit can only be realized if one can remove the original TIFFs and this can only be
done if one can provide evidence of successful migration. (2.2 million pages, 80 TB)
• Detect poor sound quality
• In a collection of mp3 files (20 TB - 360.000 files) we have discovered files with very bad sound quality. Before
ingesting everything into our DOMS we would like to be able to discover the bad files and potentially get
those re-digitized from the original analogue media.
• Research Data Sets
• RAW to NEXUS conversion
• There are file size and volume of content challenges identified for nexus files
the raw to nexus format migration tool can be customised to account for
various other types of experiment data files in the process of the migration.
However, the scalability challenge here is that for different instrument specific
to each facility), the other types of experiment data files vary significantly.
13
from digitalbevaring.dk
See http://wiki.opf-labs.org/display/SP/Scenarios
SCAPE Testbeds
• Web Content
• Quality assurance in web harvesting
• Web crawling is a process that is highly susceptible to errors. Often, essential data is
missed by the crawler and thus not captured and preserved. Currently, quality
assurance requires manual effort and because crawls often contain millions of pages,
manual quality assurance will be neither very efficient
• Data Centers
• Anonymization of medical data
• In order to fulfil the requirements for storing medical data in terms of safety
and security, it will be necessary to develop encryption and anonymization
services that will allow medical data transfer to a data center’s remote storage
facilities. On one hand, the encryption techniques will be used to secure
sensitive personal data (e.g. internal documents, patient databases) which
must only be accessible from authorized services and users. On the other hand,
the anonymization services will enable medical data (like x-ray generator
outputs, x-ray computed tomography outputs, surgery recordings) being stored
in the data center without having sensitive data attached.
14
from digitalbevaring.dk
SCAPE Additional Information
15
Additional Resources of Interest
• Development Infrastructure
• Code repository hosted by the Open Planets Foundation and GitHub
• https://github.com/openplanets/scape/
• Development Wiki
• http://wiki.opf-labs.org/display/SP/Home
• Experimental Workflows
• http://www.myexperiment.org/search?query=SCAPE&type=all&commit=Search
• Publications
• http://www.scape-project.eu/category/publication
• Public Deliverables
• http://www.scape-project.eu/category/deliverable
• Tools
• http://www.scape-project.eu/tools
16
SCAPE Training Events
• Future Formats First:
Application Infrastructures for Action Services
• 16-17 September 2013, London
• Registration: http://scape-future-formats-first.eventbrite.co.uk/
• Critical Path: Effective Evidence Based Preservation Planning
• 13 November 2013, Aarhus
• Hadoop-driven Digital Preservation (Hackathon)
• 2-4 December 2013, Vienna
17
See http://www.scape-project.eu/events
SCAPE Contact Information
• http://www.scape-project.eu/
• Twitter: #scapeproject
• office@list.scape-project.eu
• Dr. Ross King
AIT Austrian Institute of Technology GmbH
Donau-City-Strasse 1
A-1220 Wien
18
Thank you for your attention!
Questions?
19

More Related Content

PPTX
Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...
PPTX
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
PPTX
ExxonMobil’s journey to unleash time-series data with open source technology
PPTX
Saving the elephant—now, not later
PPTX
Docker datascience pipeline
PPTX
Lessons learned processing 70 billion data points a day using the hybrid cloud
PPTX
How Hadoop Makes the Natixis Pack More Efficient
PPTX
Lessons learned running a container cloud on YARN
Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
ExxonMobil’s journey to unleash time-series data with open source technology
Saving the elephant—now, not later
Docker datascience pipeline
Lessons learned processing 70 billion data points a day using the hybrid cloud
How Hadoop Makes the Natixis Pack More Efficient
Lessons learned running a container cloud on YARN

What's hot (20)

PPTX
Accelerating TensorFlow with RDMA for high-performance deep learning
PPTX
Compute-based sizing and system dashboard
PPTX
Bridging the gap: achieving fast data synchronization from SAP HANA by levera...
PDF
Performance Models for Apache Accumulo
PDF
RISELab:Enabling Intelligent Real-Time Decisions
PPTX
Log I am your father
PPTX
Data Science Crash Course
PPTX
Hadoop and Friends as Key Enabler of the IoE - Continental's Dynamic eHorizon
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
PPTX
Operating a secure big data platform in a multi-cloud environment
PPTX
Big data at United Airlines
PDF
Fast SQL on Hadoop, really?
PDF
Apache Metron in the Real World
PDF
KNIME tutorial
PPTX
Practice of large Hadoop cluster in China Mobile
PPTX
Shaping a Digital Vision
PPTX
PDF
Achieving a 360-degree view of manufacturing via open source industrial data ...
PPTX
Enabling Modern Application Architecture using Data.gov open government data
Accelerating TensorFlow with RDMA for high-performance deep learning
Compute-based sizing and system dashboard
Bridging the gap: achieving fast data synchronization from SAP HANA by levera...
Performance Models for Apache Accumulo
RISELab:Enabling Intelligent Real-Time Decisions
Log I am your father
Data Science Crash Course
Hadoop and Friends as Key Enabler of the IoE - Continental's Dynamic eHorizon
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Operating a secure big data platform in a multi-cloud environment
Big data at United Airlines
Fast SQL on Hadoop, really?
Apache Metron in the Real World
KNIME tutorial
Practice of large Hadoop cluster in China Mobile
Shaping a Digital Vision
Achieving a 360-degree view of manufacturing via open source industrial data ...
Enabling Modern Application Architecture using Data.gov open government data
Ad

Viewers also liked (20)

PDF
Gp cibercultura taciana de lima burgos
PDF
Presentación groupstowork
PDF
BID.workshop Sicherheitspolitik für Parlamentsmitarbeiter - Präsentation BAKS
PDF
Avalon Media System (Open Repositories 2014 poster)
PDF
Javier del Villar, antiguo Alumno EAE, nuevo Director Comercial de Sambil Out...
PDF
Aa 125 gp-results-2010
PDF
Programa Microsoft Aceleración de Startups de Base Tecnológica 2014
PDF
Manual del usuario
PDF
Comunicación humana por medio de herramientas.
PPTX
Contaminación emitida por los barcos
PPTX
The Jigsaw Story - Data 2.0 2012 Keynote by Jim Fowler
PPTX
Grafton Recruitment Eng
PPTX
El mejor empleo del mundo
PPTX
Andres acosta riesgos_internet_actividad3.2
PDF
Kongres Mobilny: Łukasz Ciechanek, Przemysław Jurgiel-Żyła, Netsprint
PPTX
Presentación Grupo 2
PPT
PDF
FY 2010 Annual Report-Tobacco Prevention and Control Program
PPTX
Dn nfor mobile_download_en
PDF
AEPT - Plataformas colaborativas para profesionales turísticos 3.0
Gp cibercultura taciana de lima burgos
Presentación groupstowork
BID.workshop Sicherheitspolitik für Parlamentsmitarbeiter - Präsentation BAKS
Avalon Media System (Open Repositories 2014 poster)
Javier del Villar, antiguo Alumno EAE, nuevo Director Comercial de Sambil Out...
Aa 125 gp-results-2010
Programa Microsoft Aceleración de Startups de Base Tecnológica 2014
Manual del usuario
Comunicación humana por medio de herramientas.
Contaminación emitida por los barcos
The Jigsaw Story - Data 2.0 2012 Keynote by Jim Fowler
Grafton Recruitment Eng
El mejor empleo del mundo
Andres acosta riesgos_internet_actividad3.2
Kongres Mobilny: Łukasz Ciechanek, Przemysław Jurgiel-Żyła, Netsprint
Presentación Grupo 2
FY 2010 Annual Report-Tobacco Prevention and Control Program
Dn nfor mobile_download_en
AEPT - Plataformas colaborativas para profesionales turísticos 3.0
Ad

Similar to SCAPE - Scalable Preservation Environments (20)

PDF
Scalable Preservation Workflows
PDF
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
PDF
SCAPE Information Day at BL - Large Scale Processing with Hadoop
PDF
How to scale your PaaS with OVH infrastructure?
PPTX
Presentation arsip nov 2012 frans smit handout
PPTX
SCAPE general presentation
PDF
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
PPT
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
PPTX
Federated Cloud Computing
PDF
LIBER Satellite Event, SCAPE by Sven Schlarb
PPTX
Application scenarios of the SCAPE project at the Austrian National Library
PDF
DSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado Blasco
PPTX
Hadoop-Automation-Tool_RamkishorTak
PPTX
Partner webinar featuring CatDV
PPTX
Intership(Hadoop cluster and DevOps.pptx
PPTX
Packaging computational biology tools for broad distribution and ease-of-reuse
PPTX
Utilising Cloud Computing for Research through Infrastructure, Software and D...
PPTX
OGC Interfaces in Thematic Exploitation Platforms
PDF
SCAPE - Building Digital Preservation Infrastructure
PPTX
Scape project presentation - Scalable Preservation Environments
Scalable Preservation Workflows
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Large Scale Processing with Hadoop
How to scale your PaaS with OVH infrastructure?
Presentation arsip nov 2012 frans smit handout
SCAPE general presentation
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
Federated Cloud Computing
LIBER Satellite Event, SCAPE by Sven Schlarb
Application scenarios of the SCAPE project at the Austrian National Library
DSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado Blasco
Hadoop-Automation-Tool_RamkishorTak
Partner webinar featuring CatDV
Intership(Hadoop cluster and DevOps.pptx
Packaging computational biology tools for broad distribution and ease-of-reuse
Utilising Cloud Computing for Research through Infrastructure, Software and D...
OGC Interfaces in Thematic Exploitation Platforms
SCAPE - Building Digital Preservation Infrastructure
Scape project presentation - Scalable Preservation Environments

More from SCAPE Project (20)

PDF
C sz z6
PDF
SCAPE Information Day at BL - Characterising content in web archives with Nanite
PDF
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
PDF
SCAPE Information day at BL - Flint, a Format and File Validation Tool
PDF
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
PDF
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
PDF
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
PDF
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
PDF
Hadoop and its applications at the State and University Library, SCAPE Inform...
PDF
Content profiling and C3PO
PDF
Control policy formulation
PDF
Preservation Policy in SCAPE - Training, Aarhus
PDF
An image based approach for content analysis in document collections
PDF
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
PDF
TAVERNA Components - Semantically annotated and sharable units of functionality
PDF
Automatic Preservation Watch
PDF
Policy levels in SCAPE
PDF
PDF/A-3 for preservation. Notes on embedded files and JPEG2000
PDF
Quality assurance for document image collections in digital preservation
PDF
Digital Preservation Policies - SCAPE
C sz z6
SCAPE Information Day at BL - Characterising content in web archives with Nanite
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
SCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
Hadoop and its applications at the State and University Library, SCAPE Inform...
Content profiling and C3PO
Control policy formulation
Preservation Policy in SCAPE - Training, Aarhus
An image based approach for content analysis in document collections
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
TAVERNA Components - Semantically annotated and sharable units of functionality
Automatic Preservation Watch
Policy levels in SCAPE
PDF/A-3 for preservation. Notes on embedded files and JPEG2000
Quality assurance for document image collections in digital preservation
Digital Preservation Policies - SCAPE

Recently uploaded (20)

PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
KodekX | Application Modernization Development
PPTX
Big Data Technologies - Introduction.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Approach and Philosophy of On baking technology
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Empathic Computing: Creating Shared Understanding
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Machine learning based COVID-19 study performance prediction
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Spectral efficient network and resource selection model in 5G networks
KodekX | Application Modernization Development
Big Data Technologies - Introduction.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
20250228 LYD VKU AI Blended-Learning.pptx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Digital-Transformation-Roadmap-for-Companies.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Approach and Philosophy of On baking technology
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Empathic Computing: Creating Shared Understanding
Unlocking AI with Model Context Protocol (MCP)
Review of recent advances in non-invasive hemoglobin estimation
Machine learning based COVID-19 study performance prediction
Reach Out and Touch Someone: Haptics and Empathic Computing
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf

SCAPE - Scalable Preservation Environments

  • 1. Dr. Ross King AIT Austrian Institute of Technology GmbH Preservation at Scale Workshop Lisbon, September 5, 2013 SCAPE Tools and Infrastructure for Preservation at Scale
  • 2. • SCAPE Project • SCAPE Solutions • Scalable Planning • Scalable Tools • Scalable Computation • Scalable Repositories • SCAPE Testbeds • SCAPE Additional Information • Online Resources • Training Events • Contact Information 2 Outline This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 3. SCAPE – what is it about? • Planning and executing computing-intensive digital preservation processes such as the large-scale ingestion, characterisation or migration of large (multi-Terabyte) and complex data sets • SCAPE results include • Preservation scenarios • Preservation tools • Preservation workflows • Preservation infrastructure • Preservation best-practices SCAPE is a follow-up to the highly successful FP6 IP Planets. 3
  • 4. SCAPE Project Data • Project instrument: FP7 Collaborative Project • 6. Call • Objective ICT-2009.4.1: Digital Libraries and Digital Preservation • Target outcome (a) Scalable systems and services for preserving digital content • 10. Call • Objective ICT-2013.11.4: Supplements to Strengthen Cooperation in ICT R&D in an Enlarged European Union • Duration: 42 44 months • February 2011 – July September 2014 • Budget: 11.3 12.0 Million Euro • Funded: 8.6 9.2 Million Euro 4
  • 7. • SCOUT: an automated preservation watch system • Enables planning tool and decision makers to monitor the world and the organisation • Collects relevant knowledge and enable automated notification • Open and extensible • c3po: scalable content profiling • c3po analyses characterisation data based on fits • Scale-out MongoDB (100k/min/node) • Visual drill-down and well-documented profile • Automated sample selection • PLATO 4.1: scalable preservation planning • www.ifs.tuwien.ac.at/dp/plato • Technology upgrade - refactored, rebuilt, standardised, tested • New features • Groups allow collaborative planning • Integration of control policies for group • Quality domain – measures 7 Scalable Planning and Watch This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 8. • Tool Wrapper • Application that adapts existing tools to the SCAPE Platform • https://github.com/openplanets/scape-toolwrapper • Enhances wrapped tools • Standard naming scheme for CC, AS and QA tools • Standard invocation method (CLI) • Debian packages for easy deployment on the cluster • Support for data streaming (useful for Hadoop jobs) • Generates Preservation Components • Taverna workflows with embedded metadata for easy discovery • Automatic publication of components on myExperiment (to support discoverability) • Standard ports to enable composition of Preservation Components (based on well defined component profiles, CC, AS & QA) • Digital Preservation Toolkit • Software suite that contains a large set of DP tools • 77 operations in total • Easy to deploy on Linux machines (via apt-get) • apt - get i nst al l di gi t al - pr eser vat i on- t ool s 8 Scalable Tools This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 9. • Deployment of environments • XEN Hypervisor • Eucalyptus • Deployment of tools • Debian Packages • Tool Spec • Job Execution Service (JES) • Apache Oozie • Apache Hadoop 9 Scalable Computation This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). from digitalbevaring.dk User‐view on SCAPE development cloud at AIT: Eucalyptus web interface, Hybridfox browser add‐on, and terminal‐based interaction.
  • 10. • Fedora 4.0.0 • All REST, no SOAP • RDF as first class objects • JCR 2.0 Implementation (ModeShape) • Infinispan distributed NoSQL datastore • Lily 2.0 • Built on top of HBase/HDFS • Integration of computation and storage 10 Scalable Repositories This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 11. 11 SCAPE Architecture This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). Plan Management API Digital Object Repository Execution Platform JES Hadoop JES API Data Connector API Automated Watch Automated Planning PLATO Plan Management GUI Digital Objects/ Metadata Preservation Plan Store Plan Component Catalogue Component Lookup API Taverna Workbench Component Registration API Component Profile Validator Automated Watch Sources Push API Pull API Knowledge Source Adaptor Client Service Watch Request API Notification API Report API Assessment Data Publication Platform LDS3 APIData Loader Application
  • 13. SCAPE Testbeds • Large-scale Digital Repositories • Carry out large scale image migrations • The master files from legacy digitized image collections are typically TIFF files that can be costly to store due to their size. The cost benefit can only be realized if one can remove the original TIFFs and this can only be done if one can provide evidence of successful migration. (2.2 million pages, 80 TB) • Detect poor sound quality • In a collection of mp3 files (20 TB - 360.000 files) we have discovered files with very bad sound quality. Before ingesting everything into our DOMS we would like to be able to discover the bad files and potentially get those re-digitized from the original analogue media. • Research Data Sets • RAW to NEXUS conversion • There are file size and volume of content challenges identified for nexus files the raw to nexus format migration tool can be customised to account for various other types of experiment data files in the process of the migration. However, the scalability challenge here is that for different instrument specific to each facility), the other types of experiment data files vary significantly. 13 from digitalbevaring.dk See http://wiki.opf-labs.org/display/SP/Scenarios
  • 14. SCAPE Testbeds • Web Content • Quality assurance in web harvesting • Web crawling is a process that is highly susceptible to errors. Often, essential data is missed by the crawler and thus not captured and preserved. Currently, quality assurance requires manual effort and because crawls often contain millions of pages, manual quality assurance will be neither very efficient • Data Centers • Anonymization of medical data • In order to fulfil the requirements for storing medical data in terms of safety and security, it will be necessary to develop encryption and anonymization services that will allow medical data transfer to a data center’s remote storage facilities. On one hand, the encryption techniques will be used to secure sensitive personal data (e.g. internal documents, patient databases) which must only be accessible from authorized services and users. On the other hand, the anonymization services will enable medical data (like x-ray generator outputs, x-ray computed tomography outputs, surgery recordings) being stored in the data center without having sensitive data attached. 14 from digitalbevaring.dk
  • 16. Additional Resources of Interest • Development Infrastructure • Code repository hosted by the Open Planets Foundation and GitHub • https://github.com/openplanets/scape/ • Development Wiki • http://wiki.opf-labs.org/display/SP/Home • Experimental Workflows • http://www.myexperiment.org/search?query=SCAPE&type=all&commit=Search • Publications • http://www.scape-project.eu/category/publication • Public Deliverables • http://www.scape-project.eu/category/deliverable • Tools • http://www.scape-project.eu/tools 16
  • 17. SCAPE Training Events • Future Formats First: Application Infrastructures for Action Services • 16-17 September 2013, London • Registration: http://scape-future-formats-first.eventbrite.co.uk/ • Critical Path: Effective Evidence Based Preservation Planning • 13 November 2013, Aarhus • Hadoop-driven Digital Preservation (Hackathon) • 2-4 December 2013, Vienna 17 See http://www.scape-project.eu/events
  • 18. SCAPE Contact Information • http://www.scape-project.eu/ • Twitter: #scapeproject • office@list.scape-project.eu • Dr. Ross King AIT Austrian Institute of Technology GmbH Donau-City-Strasse 1 A-1220 Wien 18
  • 19. Thank you for your attention! Questions? 19