SlideShare a Scribd company logo
Luis	
  Faria	
  lfaria@keep.pt
KEEP	
  SOLUTIONS	
  www.keep-­‐solu:ons.com
Alan	
  Akbik,	
  Barbara	
  Sierman,	
  Marcel	
  Ras,	
  Miguel	
  Ferreira,	
  José	
  Carlos	
  Ramalho
iPRES	
  2013
Lisbon,	
  September	
  2,	
  2013
Automa0c	
  Preserva0on	
  Watch
Using	
  Informa-on	
  Extrac-on	
  on	
  the	
  Web
Repository
Format obsolescence
Emerging technology
Consumer trends
New standards
Organisation
mission
Bit rot
Resource capability
System availability
Security breach
Economical limitations Social and political factors
Producer trends
Organisation
policies
2
Why do we need monitoring?
Repository
Format obsolescence
Emerging technology
Consumer trends
New standards
Organisation
mission
Bit rot
Resource capability
System availability
Security breach
Economical limitations Social and political factors
Producer trends
Organisation
policies
3
Why do we need monitoring?
Risks
Opportunities
60%
40%
Yes but manual and adhoc
None
Risk Assessment
Survey on:
4
Scout:	
  a	
  preserva-on	
  watch	
  system
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
Monitors	
  aspects	
  of	
  the	
  world	
  to	
  detect	
  preserva:on	
  risks	
  and	
  opportuni:es
5
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
6
Information Sources
• Format registries & software catalogues
• Digital repositories & web archives
• Organizational objectives
• Experiments
• Simulation
• Human knowledge
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
7
Currently supported information sources
• PRONOM
• Repository content and events
• Web archive content
• Web archive renderability experiments
• SCAPE Policy model
8
Define triggers
• Notify me when there are tools that can render the
format X.
9
Define triggers
Simple query with templates
10
Receive
notifications
Email
HTTP Push API
There	
  are	
  tools	
  that	
  can	
  render	
  format	
  X.
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
Automa-c	
  Watch	
  Limita-ons
11
Machine readable data
• Explicit and formal specified information
• Controlled vocabulary
• Ontology
• All instances use same structure and set of values
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
Case	
  study:	
  e-­‐Depot	
  coverage
12
0
100
200
300
400
500
600
40% 50% 60% 70% 80% 90% 100%
% of journal titles
Publishers Titles per publisher
97%
publishers
1-10
titles
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
e-­‐journal	
  coverage	
  ques-ons
13
• Which	
  publisher	
  provides	
  which	
  journal	
  -tles
• Publisher	
  changes:
• Ceases	
  to	
  provide	
  journal
• Transfers	
  journal	
  to	
  other	
  publisher(s)
• Publishers	
  merge
• Journal	
  changes:
• Name	
  changes
• ISSN	
  changes
• Ceased	
  to	
  exist
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
Where	
  is	
  this	
  informa-on?
14
“In 1991, two years before the merger with Reed, Elsevier
acquired Pergamon Press in the UK.”
“The Asia-Europe Foundation (ASEF) sold the Asia Europe
Journal and transferred the copyright to its long-time partner
Springer.”
“Acta Chirurgica Iugoslavica is available free of charge as an
Open Access journal on the Internet.”
In the publisher website!
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
Where	
  is	
  this	
  informa-on?
14
“In 1991, two years before the merger with Reed, Elsevier
acquired Pergamon Press in the UK.”
“The Asia-Europe Foundation (ASEF) sold the Asia Europe
Journal and transferred the copyright to its long-time partner
Springer.”
“Acta Chirurgica Iugoslavica is available free of charge as an
Open Access journal on the Internet.”
In the publisher website!
Not
machine
readable!
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
Informa-on	
  Extrac-on
• Extract structural information from unstructured data
• Pattern-based information extraction
• Some training and supervision may be needed
15
“[X] acquired [Y]”
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
Experiment
1. Data acquisition and pre-processing
2. Relation discovery
3. Information extraction
4. Validation of results
16
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
1.	
  Data	
  acquisi-on	
  and	
  pre-­‐processing
• Focused crawler with seed words (12.000 entries)
• Publisher names
• Journal titles
➡500.000 Web pages
• Pre-process with NLP tools
➡18 million sentences
➡8 GB
17
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
2.	
  Rela-on	
  discovery
18
Prominent pattern Rank
[X] journal of [Y] 1
[X] published by [Y] 2
[X] journal on [Y] 3
[X] journal published by [Y] 4
[X] available as [Y] journal 5
PubMed [X] [Y] 9
[X] science proceedings of [Y] 25
[X] subscription available to [Y] 30
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
2.	
  Rela-on	
  discovery
19
Prominent pattern Rank
[X] journal of [Y] 1
[X] published by [Y] 2
[X] journal on [Y] 3
[X] journal published by [Y] 4
[X] available as [Y] journal 5
PubMed [X] [Y] 9
[X] science proceedings of [Y] 25
[X] subscription available to [Y] 30
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
3.	
  Informa-on	
  extrac-on
20
2.000 journal titles
500 journal-publisher attributions
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
4.	
  Valida-on	
  of	
  results
21
4%
10%
86%
Journal titles in eDepot
15%
50%
35%
Title-publisher in the Keepers registry
Should add Existing
False-positives
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
False-­‐posi-ves
• Detecting boundaries of titles and publisher names
• Using abbreviations on titles and publisher names
• Technical problems like encoding
22
“European Journal of Nuclear Medicine and Molecular Imaging”
IAAE - “International Association of Agricultural Economists”
“├ó╦å┼buda University”
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
Conclusions
• We need data to support digital preservation
• Explicit and formal specified for automation
• Registries tend to be incomplete and outdated
• Information Extraction Technologies can help
• Still, some supervision may be needed
23
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
Send	
  us	
  your	
  use	
  cases!
24
Alan Akbik
alan.akbik@tu-berlin.de
Luis Faria
lfaria@keep.pt
Preservation Watch
What risks to monitor?
Information Extraction
What to extract from the web?
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
Thank	
  you,	
  ques-ons?
• Scout - a preservation watch system
• Site: http://openplanets.github.io/scout/
• Demo: http://scout.scape.keep.pt
• SCAPE Planning and Watch suite iPRES poster
• http://bit.ly/scape-pw
• SCAPE
• http://www.scape-project.eu
25

More Related Content

PPTX
Scape project presentation - Scalable Preservation Environments
PDF
Policy levels in SCAPE
PPTX
SCAPE general presentation
PPT
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
PDF
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
PDF
Preservation Policy in SCAPE - Training, Aarhus
PPT
Barbara Sierman: Policy levels in SCAPE
PDF
Automatic Preservation Watch Using Information Extraction on the Web
Scape project presentation - Scalable Preservation Environments
Policy levels in SCAPE
SCAPE general presentation
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
Preservation Policy in SCAPE - Training, Aarhus
Barbara Sierman: Policy levels in SCAPE
Automatic Preservation Watch Using Information Extraction on the Web

What's hot (20)

PPTX
Intelligent tools-mitja-jermol-2013-bali-7 may2013
PPT
Per Blixt - IPv6 deployment, taking stock and next steps?
PPTX
Experience in managing service portfolio by Pasquale Pagano
PDF
OpenAIRE NOADs
PDF
1st Technical Meeting - WP8
PDF
Archiver 3rd omc_project_overview
PPTX
The European life-science data infrastructure: Data, Computing and Services ...
PDF
New toolkit introduced by the energy infrastructure package
PDF
Prototype Phase Kick-off Event and Ceremony
PPT
Europeana Newspapers wp2 liber2013
PDF
Webinar on OpenAIRE compatibility for repositories: EPrints repository platform
PPTX
SLOPE Final Conference - 3D harvesting planner
PDF
Design phase kick-off event and Ceremony
PPTX
SLOPE Final Conference - intelligent truck
PDF
FIRE slideshow @ECFI-2
PDF
FIRE Brochure 2014 multimedia eBook -version
PDF
Policy Making: A Powerful Tool
PDF
Archiver pilot phase kick off Award Ceremony
Intelligent tools-mitja-jermol-2013-bali-7 may2013
Per Blixt - IPv6 deployment, taking stock and next steps?
Experience in managing service portfolio by Pasquale Pagano
OpenAIRE NOADs
1st Technical Meeting - WP8
Archiver 3rd omc_project_overview
The European life-science data infrastructure: Data, Computing and Services ...
New toolkit introduced by the energy infrastructure package
Prototype Phase Kick-off Event and Ceremony
Europeana Newspapers wp2 liber2013
Webinar on OpenAIRE compatibility for repositories: EPrints repository platform
SLOPE Final Conference - 3D harvesting planner
Design phase kick-off event and Ceremony
SLOPE Final Conference - intelligent truck
FIRE slideshow @ECFI-2
FIRE Brochure 2014 multimedia eBook -version
Policy Making: A Powerful Tool
Archiver pilot phase kick off Award Ceremony
Ad

Viewers also liked (11)

PDF
Hadoop and its applications at the State and University Library, SCAPE Inform...
PDF
SCAPE Information Day at BL - Characterising content in web archives with Nanite
PDF
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
PDF
SCAPE Information day at BL - Flint, a Format and File Validation Tool
PDF
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
PDF
Evaluation of format identification tools
PDF
Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012
PDF
LIBER Satellite Event, SCAPE by Sven Schlarb
PDF
An image based approach for content analysis in document collections
PDF
C sz z6
PDF
SCAPE - Scalable Preservation Environments
Hadoop and its applications at the State and University Library, SCAPE Inform...
SCAPE Information Day at BL - Characterising content in web archives with Nanite
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
SCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
Evaluation of format identification tools
Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012
LIBER Satellite Event, SCAPE by Sven Schlarb
An image based approach for content analysis in document collections
C sz z6
SCAPE - Scalable Preservation Environments
Ad

Similar to Automatic Preservation Watch (20)

PDF
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
PPTX
EDINA Serials UKLA SafeNet
PPT
D.3.1: State of the Art - Linked Data and Digital Preservation
PPT
292 daniel dollar ssp yale_28_may2008
PPTX
Where data and journal content collide: what does it mean to ‘publish your da...
PPT
148 john shaw2006fall
PPTX
20100401 정영임 da 전략 tft_0330
PPTX
20100401 정영임 da 전략 tft_0330
PDF
Interactive news feed extraction system 2
PPTX
Adam Rusbridge (EDINA) - Clarifying e-journal subscription history
PDF
Rusbridge Feb 8 Improving Clarity around Continuing Access
PPTX
Ensuring the Integrity (& Continuity) of Our Record of Scholarship
PPT
Repositories and the wider context
PPTX
Stronger together: community initiatives in journal management
PDF
Msr2021 tutorial-di penta
PDF
Digital Preservation Policies - SCAPE
PPT
PECAN Phase 2: Pilot for Ensuring Continuity of Access via Nesli2
PDF
Extraction and Analysis of Publication Data of Conferences - ICACCE 2015
PDF
Digital archiving 3.0
PDF
Review on an automatic extraction of educational digital objects and metadata...
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
EDINA Serials UKLA SafeNet
D.3.1: State of the Art - Linked Data and Digital Preservation
292 daniel dollar ssp yale_28_may2008
Where data and journal content collide: what does it mean to ‘publish your da...
148 john shaw2006fall
20100401 정영임 da 전략 tft_0330
20100401 정영임 da 전략 tft_0330
Interactive news feed extraction system 2
Adam Rusbridge (EDINA) - Clarifying e-journal subscription history
Rusbridge Feb 8 Improving Clarity around Continuing Access
Ensuring the Integrity (& Continuity) of Our Record of Scholarship
Repositories and the wider context
Stronger together: community initiatives in journal management
Msr2021 tutorial-di penta
Digital Preservation Policies - SCAPE
PECAN Phase 2: Pilot for Ensuring Continuity of Access via Nesli2
Extraction and Analysis of Publication Data of Conferences - ICACCE 2015
Digital archiving 3.0
Review on an automatic extraction of educational digital objects and metadata...

More from SCAPE Project (14)

PDF
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
PDF
SCAPE Information Day at BL - Large Scale Processing with Hadoop
PDF
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
PDF
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
PDF
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
PDF
Content profiling and C3PO
PDF
Control policy formulation
PDF
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
PDF
TAVERNA Components - Semantically annotated and sharable units of functionality
PDF
PDF/A-3 for preservation. Notes on embedded files and JPEG2000
PDF
Scalable Preservation Workflows
PDF
Quality assurance for document image collections in digital preservation
PDF
Matchbox tool. Quality control for digital collections – SCAPE Training event...
PDF
Characterisation - 101. An introduction to the identification and characteris...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
SCAPE Information Day at BL - Large Scale Processing with Hadoop
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
Content profiling and C3PO
Control policy formulation
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
TAVERNA Components - Semantically annotated and sharable units of functionality
PDF/A-3 for preservation. Notes on embedded files and JPEG2000
Scalable Preservation Workflows
Quality assurance for document image collections in digital preservation
Matchbox tool. Quality control for digital collections – SCAPE Training event...
Characterisation - 101. An introduction to the identification and characteris...

Recently uploaded (20)

PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Empathic Computing: Creating Shared Understanding
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Approach and Philosophy of On baking technology
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Modernizing your data center with Dell and AMD
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
A Presentation on Artificial Intelligence
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Reach Out and Touch Someone: Haptics and Empathic Computing
Advanced methodologies resolving dimensionality complications for autism neur...
Empathic Computing: Creating Shared Understanding
Building Integrated photovoltaic BIPV_UPV.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
The AUB Centre for AI in Media Proposal.docx
20250228 LYD VKU AI Blended-Learning.pptx
Approach and Philosophy of On baking technology
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Modernizing your data center with Dell and AMD
NewMind AI Monthly Chronicles - July 2025
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Review of recent advances in non-invasive hemoglobin estimation
A Presentation on Artificial Intelligence
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Encapsulation_ Review paper, used for researhc scholars
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...

Automatic Preservation Watch

  • 1. Luis  Faria  lfaria@keep.pt KEEP  SOLUTIONS  www.keep-­‐solu:ons.com Alan  Akbik,  Barbara  Sierman,  Marcel  Ras,  Miguel  Ferreira,  José  Carlos  Ramalho iPRES  2013 Lisbon,  September  2,  2013 Automa0c  Preserva0on  Watch Using  Informa-on  Extrac-on  on  the  Web
  • 2. Repository Format obsolescence Emerging technology Consumer trends New standards Organisation mission Bit rot Resource capability System availability Security breach Economical limitations Social and political factors Producer trends Organisation policies 2 Why do we need monitoring?
  • 3. Repository Format obsolescence Emerging technology Consumer trends New standards Organisation mission Bit rot Resource capability System availability Security breach Economical limitations Social and political factors Producer trends Organisation policies 3 Why do we need monitoring? Risks Opportunities
  • 4. 60% 40% Yes but manual and adhoc None Risk Assessment Survey on: 4
  • 5. Scout:  a  preserva-on  watch  system This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). Monitors  aspects  of  the  world  to  detect  preserva:on  risks  and  opportuni:es 5
  • 6. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). 6 Information Sources • Format registries & software catalogues • Digital repositories & web archives • Organizational objectives • Experiments • Simulation • Human knowledge
  • 7. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). 7 Currently supported information sources • PRONOM • Repository content and events • Web archive content • Web archive renderability experiments • SCAPE Policy model
  • 8. 8 Define triggers • Notify me when there are tools that can render the format X.
  • 10. 10 Receive notifications Email HTTP Push API There  are  tools  that  can  render  format  X.
  • 11. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). Automa-c  Watch  Limita-ons 11 Machine readable data • Explicit and formal specified information • Controlled vocabulary • Ontology • All instances use same structure and set of values
  • 12. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). Case  study:  e-­‐Depot  coverage 12 0 100 200 300 400 500 600 40% 50% 60% 70% 80% 90% 100% % of journal titles Publishers Titles per publisher 97% publishers 1-10 titles
  • 13. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). e-­‐journal  coverage  ques-ons 13 • Which  publisher  provides  which  journal  -tles • Publisher  changes: • Ceases  to  provide  journal • Transfers  journal  to  other  publisher(s) • Publishers  merge • Journal  changes: • Name  changes • ISSN  changes • Ceased  to  exist
  • 14. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). Where  is  this  informa-on? 14 “In 1991, two years before the merger with Reed, Elsevier acquired Pergamon Press in the UK.” “The Asia-Europe Foundation (ASEF) sold the Asia Europe Journal and transferred the copyright to its long-time partner Springer.” “Acta Chirurgica Iugoslavica is available free of charge as an Open Access journal on the Internet.” In the publisher website!
  • 15. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). Where  is  this  informa-on? 14 “In 1991, two years before the merger with Reed, Elsevier acquired Pergamon Press in the UK.” “The Asia-Europe Foundation (ASEF) sold the Asia Europe Journal and transferred the copyright to its long-time partner Springer.” “Acta Chirurgica Iugoslavica is available free of charge as an Open Access journal on the Internet.” In the publisher website! Not machine readable!
  • 16. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). Informa-on  Extrac-on • Extract structural information from unstructured data • Pattern-based information extraction • Some training and supervision may be needed 15 “[X] acquired [Y]”
  • 17. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). Experiment 1. Data acquisition and pre-processing 2. Relation discovery 3. Information extraction 4. Validation of results 16
  • 18. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). 1.  Data  acquisi-on  and  pre-­‐processing • Focused crawler with seed words (12.000 entries) • Publisher names • Journal titles ➡500.000 Web pages • Pre-process with NLP tools ➡18 million sentences ➡8 GB 17
  • 19. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). 2.  Rela-on  discovery 18 Prominent pattern Rank [X] journal of [Y] 1 [X] published by [Y] 2 [X] journal on [Y] 3 [X] journal published by [Y] 4 [X] available as [Y] journal 5 PubMed [X] [Y] 9 [X] science proceedings of [Y] 25 [X] subscription available to [Y] 30
  • 20. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). 2.  Rela-on  discovery 19 Prominent pattern Rank [X] journal of [Y] 1 [X] published by [Y] 2 [X] journal on [Y] 3 [X] journal published by [Y] 4 [X] available as [Y] journal 5 PubMed [X] [Y] 9 [X] science proceedings of [Y] 25 [X] subscription available to [Y] 30
  • 21. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). 3.  Informa-on  extrac-on 20 2.000 journal titles 500 journal-publisher attributions
  • 22. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). 4.  Valida-on  of  results 21 4% 10% 86% Journal titles in eDepot 15% 50% 35% Title-publisher in the Keepers registry Should add Existing False-positives
  • 23. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). False-­‐posi-ves • Detecting boundaries of titles and publisher names • Using abbreviations on titles and publisher names • Technical problems like encoding 22 “European Journal of Nuclear Medicine and Molecular Imaging” IAAE - “International Association of Agricultural Economists” “├ó╦å┼buda University”
  • 24. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). Conclusions • We need data to support digital preservation • Explicit and formal specified for automation • Registries tend to be incomplete and outdated • Information Extraction Technologies can help • Still, some supervision may be needed 23
  • 25. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). Send  us  your  use  cases! 24 Alan Akbik alan.akbik@tu-berlin.de Luis Faria lfaria@keep.pt Preservation Watch What risks to monitor? Information Extraction What to extract from the web?
  • 26. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). Thank  you,  ques-ons? • Scout - a preservation watch system • Site: http://openplanets.github.io/scout/ • Demo: http://scout.scape.keep.pt • SCAPE Planning and Watch suite iPRES poster • http://bit.ly/scape-pw • SCAPE • http://www.scape-project.eu 25