SlideShare a Scribd company logo
The Use of Big Data Techniques
for Digital Archiving
Sven Schlarb, Austrian Institute of
Technology
Tuesday 15th March 2016, Cambridge
OUTLINE
• E-ARK Project Overview
• Technical Background
• Integrated Prototype
• Data Mining Use Cases
Project Overview
THE
E-ARK PROJECT
IS
CO-FUNDED
BY THE
EUROPEAN
COMMISSION
UNDER THE
ICT-PSP
PROGRAMME
www.eark-project.eu
Advisory Boards
Archival
• Archives of Emilia-Romagna, Italy
• Directorate-General of the Book, of
Archives & of Libraries, Portugal
• EC Archives & Records Management
• EC Historical Archives
• German Federal Archives
• National Archives of Bulgaria
• National Archives of Finland
• National Archives of France
• National Archives of Sweden
• National Archives of the
Netherlands
• Polish Data Archive
• Queensland State Archives
• Swiss Federal Archives
• UK National Archives
• UK Parliamentary Archives
Commercial
Technial
• Arkivum
• ARMA Europe
• DigitalForever
• Discovery Garden
• Microsoft Research
• Open Preservation Foundation
• Open Text Initiative
• Preservica
• Versity
Data Providers
• Danish Agency for Digitisation
• Estonian Ministry of Economic
Affairs & Communication
• Estonian Unemployment Insurance
Fund
• James Lappin, RM Consultant
Project mission
• Improve access to the archived records of
European Archives
• Create guidelines and recommended
practices
• Cover relational databases, record
management systems, and geographical
data
• Create open source implementation
evaluated in several pilots
Outcomes
Standardisation of
available best-
practices
• Common terminology
(Knowledge Center)
• SIP, AIP and DIP
format specifications
• Pre-ingest, ingest and
access workflows
Open source tools
• Scalable, modular,
and reusable
implementation of
specifications
• Individual
deployments (Pilots)
and an integrated
reference
implementation
Technical Background
Hadoop Cluster
Task Trackers
Data Nodes
Job Tracker
Name Node
Hadoop = MapReduce + HDFS
Distributed processing (MapReduce)
Distributed Storage (HDFS)
example: 2 x Quad-Core-CPUs:
10 Map (Parallelisierung)
4 Reduce (Aggregation)
example: 4 x 1 TB Hard-Disks (replication factor 3):
ca. 1,33 TB
HADOOP
Sort
Shuffle
Merge
Input data
Input split 1
Record 1
Record 2
Record 3
Input split 2
Record 4
Record 5
Record 6
Input split 3
Record 7
Record 8
Record 9
Task1
Map Reduce
Task 2
Task 3
Output data
Aggregated
Result
Aggregated
Result
Map/Reduce in a nutshell
E-ARK Integrated Prototype
Architecture & Implementation
Base technology stack
E-ARK Web
“Integrated” Prototype?
AIP to DIPSIP to AIP
Hadoop Distributed
File System
NAS
Working area
Search and Access
Lily Repository
DIP Delivery
Workers
Celery
Information Package processing &
Access Repository
Access Repository - Interfaces
Ingest and Preservation
Access
E-ARK
SIP
SIP
Creation
Tools
Archival
records
Content and
Records
Management
Systems
SIP – AIP
Conversion
E-ARK
AIP
CMIS
Interface
Data
Mining
Interface
Digital preservation systems
AIP - DIP
Conversion
Scalable
Computation
E-ARK
DIP
Archival Search ,
Access and
Display Tools
Content and
Records
Management
Systems
Data Mining
Showcase
The Use of Big Data Techniques for Digital Archiving
The Use of Big Data Techniques for Digital Archiving
The Use of Big Data Techniques for Digital Archiving
E-ARK Data Mining
Geographical/timeline search
Peripleo - PELAGIOS Project
Geographical/timeline search
Peripleo - PELAGIOS Project
Text mining: Text classification
Training
• Train classifier using annotated text corpus
• SVM – based on statistical features
Classification
• Scan for texts during ingest (or run MR after)
• Text category estimation
Search
• Add category as a searcheable field to Lily index
• Full-text search using Lily‘s SolR search interface
OLAP (Online Analytical Processing)
• Database archiving
and re-use (SIARD2)
• Normalization -
OLAP/Oracle Data
Warehouse
Thank you!
• http://www.eark-project.eu
• https://github.com/eark-project

More Related Content

PDF
E-ARK-iPRES2016-Bern-October-2016
PDF
Moving ahead: The ARIADNE integration process
PPTX
Repeatable Semantic Queries for the Linked Data Agnostic
PDF
Maurer Presentation - WARCnet Spring Meeting 2021
PDF
Dm2 e ontotext-nov2012
PDF
Mariana Damova - Ontotext
PPTX
PPTX
Improving long-term preservation of EOS data by independently mapping HDF4 da...
E-ARK-iPRES2016-Bern-October-2016
Moving ahead: The ARIADNE integration process
Repeatable Semantic Queries for the Linked Data Agnostic
Maurer Presentation - WARCnet Spring Meeting 2021
Dm2 e ontotext-nov2012
Mariana Damova - Ontotext
Improving long-term preservation of EOS data by independently mapping HDF4 da...

What's hot (20)

PDF
Béatrice Markhoff - Semantic mediation ArSol and CIDOC CRM
PDF
RIPE Atlas and IXPs "Stitchin' it up"
PPTX
Pilot Project for HDF5 Metadata Structures for SWOT
PDF
Intro to R statistic programming
PPTX
Migrating data to a new LMS: challenges, opportunities and lessons / Penny Do...
PDF
Sitemap4rdf(v2 boris)
PPTX
Arkstore web ready2013
PPTX
Python in geospatial analysis
PPT
The New HDF-EOS WebSite - How it can help you
PPTX
GeoKnow: Making the Web an Exploratory Place for Spatial Data
PPTX
Basic Analytic Techniques - Using R Tool - Part 1
PDF
Comsode tools - pushing data to open ecosystem
PPTX
c,c++,java and python in gis development
PDF
Drupal Day 2011 - Thinking spatially with your open data
PDF
Using Linked Data to diversify search results: a case study in cultural heritage
PPTX
Zenodo and OpenAIRE Research Communities (Presentation by Tim Smith at DI4R2...
PPTX
RJ Broker: Automating Delivery of Research Output to Repositories
PDF
IXP Traffic and Major Sports Events
PDF
Geo linked data lstd10(v2-boris)
Béatrice Markhoff - Semantic mediation ArSol and CIDOC CRM
RIPE Atlas and IXPs "Stitchin' it up"
Pilot Project for HDF5 Metadata Structures for SWOT
Intro to R statistic programming
Migrating data to a new LMS: challenges, opportunities and lessons / Penny Do...
Sitemap4rdf(v2 boris)
Arkstore web ready2013
Python in geospatial analysis
The New HDF-EOS WebSite - How it can help you
GeoKnow: Making the Web an Exploratory Place for Spatial Data
Basic Analytic Techniques - Using R Tool - Part 1
Comsode tools - pushing data to open ecosystem
c,c++,java and python in gis development
Drupal Day 2011 - Thinking spatially with your open data
Using Linked Data to diversify search results: a case study in cultural heritage
Zenodo and OpenAIRE Research Communities (Presentation by Tim Smith at DI4R2...
RJ Broker: Automating Delivery of Research Output to Repositories
IXP Traffic and Major Sports Events
Geo linked data lstd10(v2-boris)
Ad

Viewers also liked (20)

PDF
Análisis de contingencia inmobiliaria semanal Arenas y Cayo S.A.
PDF
Composición de predios agricolas
PDF
Mercado hotelero 06 2016
PPT
21st Annual Day of Yoganjali Natyalayam 2014
PDF
EU: Plastic Tubes, Pipes And Hoses, And Fitting Therefor - Market Report. Ana...
PDF
Introduction to Apache Tajo: Future of Data Warehouse
PDF
TMC Resource Kit Hugues Sweeney CoProduction Interview NFB
PDF
ABB MagMaster - Flow Meter & End to End Testing Procedure
PDF
Therapeutic Potential of Pranayama
PDF
Elastic search 클러스터관리
PDF
Data analysis with Tajo
PPTX
Neev Conversion Strategy Capabilities
PDF
DDD start 1장
PPTX
Ddd start! 6장. 응용 서비스와 표현 영역
PDF
AWS 보안: WAF, AWS Inspector, Config Rules - 임기성 :: 2015 리인벤트 리캡 게이밍
PDF
Data Governance for Data Lakes
PPTX
Using hadoop for enterprise data management
PPT
PDF
선도 금융사들의 aws security 활용 방안 소개 :: Eugene Yu :: AWS Finance...
PPTX
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
Análisis de contingencia inmobiliaria semanal Arenas y Cayo S.A.
Composición de predios agricolas
Mercado hotelero 06 2016
21st Annual Day of Yoganjali Natyalayam 2014
EU: Plastic Tubes, Pipes And Hoses, And Fitting Therefor - Market Report. Ana...
Introduction to Apache Tajo: Future of Data Warehouse
TMC Resource Kit Hugues Sweeney CoProduction Interview NFB
ABB MagMaster - Flow Meter & End to End Testing Procedure
Therapeutic Potential of Pranayama
Elastic search 클러스터관리
Data analysis with Tajo
Neev Conversion Strategy Capabilities
DDD start 1장
Ddd start! 6장. 응용 서비스와 표현 영역
AWS 보안: WAF, AWS Inspector, Config Rules - 임기성 :: 2015 리인벤트 리캡 게이밍
Data Governance for Data Lakes
Using hadoop for enterprise data management
선도 금융사들의 aws security 활용 방안 소개 :: Eugene Yu :: AWS Finance...
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
Ad

Similar to The Use of Big Data Techniques for Digital Archiving (20)

PDF
E-ARK: Open Data Mining for Government Archives
PPT
What is Hadoop?
PDF
Introduction to apache hadoop
PPT
Presentation
PPTX
Apache hadoop introduction and architecture
PPSX
PPT
Hadoop online-training
PPTX
Big Data & Hadoop Introduction
DOCX
Hadoop technology doc
PPTX
Hadoop ppt1
PPTX
Distributed data mining
PPTX
Introduction to Hadoop Technology
PDF
Building A Scalable Open Source Storage Solution
PPT
Hadoop ecosystem framework n hadoop in live environment
PPTX
Towards a Common Approach for Access to Digital Archival Records in Europe. A...
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
PDF
Hadoop on Azure, Blue elephants
PDF
Hadoop programming
PDF
Hadoop 101
 
ODP
Hadoop demo ppt
E-ARK: Open Data Mining for Government Archives
What is Hadoop?
Introduction to apache hadoop
Presentation
Apache hadoop introduction and architecture
Hadoop online-training
Big Data & Hadoop Introduction
Hadoop technology doc
Hadoop ppt1
Distributed data mining
Introduction to Hadoop Technology
Building A Scalable Open Source Storage Solution
Hadoop ecosystem framework n hadoop in live environment
Towards a Common Approach for Access to Digital Archival Records in Europe. A...
Hadoop_EcoSystem slide by CIDAC India.pptx
Hadoop on Azure, Blue elephants
Hadoop programming
Hadoop 101
 
Hadoop demo ppt

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Empathic Computing: Creating Shared Understanding
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Electronic commerce courselecture one. Pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Spectroscopy.pptx food analysis technology
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
MYSQL Presentation for SQL database connectivity
Dropbox Q2 2025 Financial Results & Investor Presentation
Mobile App Security Testing_ A Comprehensive Guide.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Empathic Computing: Creating Shared Understanding
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Unlocking AI with Model Context Protocol (MCP)
Digital-Transformation-Roadmap-for-Companies.pptx
Electronic commerce courselecture one. Pdf
MIND Revenue Release Quarter 2 2025 Press Release
Spectroscopy.pptx food analysis technology
Programs and apps: productivity, graphics, security and other tools
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
The AUB Centre for AI in Media Proposal.docx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Spectral efficient network and resource selection model in 5G networks
MYSQL Presentation for SQL database connectivity

The Use of Big Data Techniques for Digital Archiving

  • 1. The Use of Big Data Techniques for Digital Archiving Sven Schlarb, Austrian Institute of Technology Tuesday 15th March 2016, Cambridge
  • 2. OUTLINE • E-ARK Project Overview • Technical Background • Integrated Prototype • Data Mining Use Cases
  • 4. THE E-ARK PROJECT IS CO-FUNDED BY THE EUROPEAN COMMISSION UNDER THE ICT-PSP PROGRAMME www.eark-project.eu
  • 5. Advisory Boards Archival • Archives of Emilia-Romagna, Italy • Directorate-General of the Book, of Archives & of Libraries, Portugal • EC Archives & Records Management • EC Historical Archives • German Federal Archives • National Archives of Bulgaria • National Archives of Finland • National Archives of France • National Archives of Sweden • National Archives of the Netherlands • Polish Data Archive • Queensland State Archives • Swiss Federal Archives • UK National Archives • UK Parliamentary Archives Commercial Technial • Arkivum • ARMA Europe • DigitalForever • Discovery Garden • Microsoft Research • Open Preservation Foundation • Open Text Initiative • Preservica • Versity Data Providers • Danish Agency for Digitisation • Estonian Ministry of Economic Affairs & Communication • Estonian Unemployment Insurance Fund • James Lappin, RM Consultant
  • 6. Project mission • Improve access to the archived records of European Archives • Create guidelines and recommended practices • Cover relational databases, record management systems, and geographical data • Create open source implementation evaluated in several pilots
  • 7. Outcomes Standardisation of available best- practices • Common terminology (Knowledge Center) • SIP, AIP and DIP format specifications • Pre-ingest, ingest and access workflows Open source tools • Scalable, modular, and reusable implementation of specifications • Individual deployments (Pilots) and an integrated reference implementation
  • 9. Hadoop Cluster Task Trackers Data Nodes Job Tracker Name Node
  • 10. Hadoop = MapReduce + HDFS Distributed processing (MapReduce) Distributed Storage (HDFS) example: 2 x Quad-Core-CPUs: 10 Map (Parallelisierung) 4 Reduce (Aggregation) example: 4 x 1 TB Hard-Disks (replication factor 3): ca. 1,33 TB HADOOP
  • 11. Sort Shuffle Merge Input data Input split 1 Record 1 Record 2 Record 3 Input split 2 Record 4 Record 5 Record 6 Input split 3 Record 7 Record 8 Record 9 Task1 Map Reduce Task 2 Task 3 Output data Aggregated Result Aggregated Result Map/Reduce in a nutshell
  • 15. AIP to DIPSIP to AIP Hadoop Distributed File System NAS Working area Search and Access Lily Repository DIP Delivery Workers Celery Information Package processing & Access Repository
  • 16. Access Repository - Interfaces
  • 17. Ingest and Preservation Access E-ARK SIP SIP Creation Tools Archival records Content and Records Management Systems SIP – AIP Conversion E-ARK AIP CMIS Interface Data Mining Interface Digital preservation systems AIP - DIP Conversion Scalable Computation E-ARK DIP Archival Search , Access and Display Tools Content and Records Management Systems Data Mining Showcase
  • 24. Text mining: Text classification Training • Train classifier using annotated text corpus • SVM – based on statistical features Classification • Scan for texts during ingest (or run MR after) • Text category estimation Search • Add category as a searcheable field to Lily index • Full-text search using Lily‘s SolR search interface
  • 25. OLAP (Online Analytical Processing) • Database archiving and re-use (SIARD2) • Normalization - OLAP/Oracle Data Warehouse
  • 26. Thank you! • http://www.eark-project.eu • https://github.com/eark-project

Editor's Notes

  • #6: Purpose is to assess contributions to and from the project Open to interested parties Meetings of these groups Gather information and contribute to a knowledge base (maintained by the DLM Forum)
  • #17: Technologies: Hadoop MapReduce, SolR, HDFS, Lily Repository, ESSArch Preservatin Platform, E-ARK Web Vertical Integration: [MapReduce] works atop [HDFS], [SolR] indexes [Lily] Records Horizontal Integration: [MapReduce] used to build [SolR] index, [HDFS] used to store [Lily] content, packages ingested via [EPP] UI are searched/accessed via [E-ARK WEB] UI