GRANT AGREEMENT: 601138 | SCHEME FP7 ICT 2011.4.3
Promoting and Enhancing Reuse of Information throughout the Content Lifecycle taking account of Evolving Semantics
[Digital Preservation]
“This project has received funding from the European Union’s Seventh
Framework Programme for research, technological development and
demonstration under grant agreement no601138”.
Semi-automated metadata extraction
in the long term
Emma Tonkin, King’s College London
DPC Workshop, Belfast, Dec 2015
Structure of presentation
2
 Introduction to Pericles
 Layers of Metadata
 Sources of Metadata
 Time, space and data
 Semi-automated metadata as mitigating factor
Introduction to Pericles
Introduction to PERICLES
4
 Four-year Integrated Project (2013-2017) funded by the European Union
under its Seventh Framework Programme
 Promoting and Enhancing Reuse of Information throughout the Content
Lifecycle taking account of Evolving Semantics
 Two (or three) domains:
− Digital artworks, such as interactive software-based installations, and
other digital media from Tate's collections
− Material from Tate's archives
− Experimental scientific data originating from the European Space
Agency and International Space Station.
Model-driven approach
5
 Essentially all archives are based around some conceptual model of the
material held
 PERICLES applies formal models to describe
− Objects
− Entities associated with objects
− Broader community
 These models support processes such as appraisal and QA, and
consequentially functionality such as maintenance and actions taken for
sustainability
 A broad variety of models are under consideration: semantic (ontological)
models to formally describe objects; social network graphs to describe
community; statistical models to describe technology obsolescence...
Layers of Metadata
Open Archival Information System
7
 OAIS reference model
− “conceptual framework for an archival system dedicated to preserving
and maintaining access to digital information over the long term“
-Lavoie, B. (2000). Meeting the challenges of digital preservation: The OAIS reference model
 OAIS-compliance
− adherence to ISO 14721:2003 or (now) ISO 14721:2012
− Specifies conceptual framework, functional model, information model
OAIS information model
8
Descriptive metadata
9
 Supporting humans and machines
 Goal: interpreting data object
 Not always possible to automatically interpret data objects on any level
(some are fully opaque)
 Consider:
− 'Unstructured' natural-language texts, such as letters, books, articles
− Images of artworks
− Images of letters
− Recordings of audiovisual presentations
− Complex data files
Sources of descriptive metadata
Automated metadata extraction
11
 Popular view on indexing metadata:
− “the more, the merrier”
 Risks of low-quality metadata:
− Low accuracy on search and browse tasks; occasionally embarrassing
misinterpretations
 Benefits:
− Additional metadata can improve search indexing
How good is automated metadata
extraction?
12
 Varies significantly depending on the precise task and source material
 Automated metadata extraction tends to apply probabilistic (machine
learning) or heuristic approaches
 Machine-eye view:
− describe what is present
− Infer what is not based on:
 Knowledge base
 Comparison with other items
 Learning from training examples ('supervised learning')
Crowdsourcing metadata
13
 The 'phone a friend' approach to metadata generation
− Make material available to public
− Encourage them to annotate (example: social tagging)
− Examine the result
 The likely result:
− Some material extensively annotated; some descriptive annotations;
some formally structured; some personal ('cryptic')
− Some/most material receives no notice and is not annotated at all
 Mitigation: engineer more consistent coverage through, for example,
gamification (see Galaxy Zoo)
 Identify incentives that encourage public to contribute
Capturing 'live' metadata
14
 If the environment is accessible at the time of creation:
− Technical 'live' metadata may be captured
− Within Pericles, this is referred to as 'significant environment
information'
− Example: steps in creation, time of creation, contextual relevance of
other files…
 Another sort of 'live' metadata emerges from observation of behaviour of
those engaging with the data
− Interaction with search/browse interfaces (cf. information scent)
− Satisfaction with results
− Patterns of sharing and reuse (information diffusion on social networks,
for example)
Time, space and data
Theoretical reach of information
16
Theoretical reach of information
17
Image source: S Korotkiy
 Receiving the signal is only the start
 Can we decode the signal?
− Technical decoding
− Practical comprehension
 Confounding factors in decoding metadata:
− Language
− Dialect
− Prerequisite knowledge
Practical reach of information
18
Language: space travel
19
Language change: Time travel
20
 Language may be viewed as a complex adaptive system (Beckner et al,
2007)
− Made up of many tiny parts - people talking, writing, gesturing
− Adaptive, because we change our behaviour based on past
interactions
− Many factors influence its development: biology of perception; social
structure; experience
 Probabilistic processes underlie language change: collective experience
and eventual consensus
Example: Photogram (Getty Art &
Architecture Thesaurus)
21
The challenges of decreasing
accessibility
 Unfamiliar data
− Technical encoding – well-understood problems
− Challenges of internationalisation
 Unfamiliar texts
− Conventions and best practices change over time
− Coherence degrades long before it fails entirely (slower to read: takes
more effort: machines trained on modern texts are likely to encounter
issues with texts outside that timeframe)
 Challenges of unfamiliar artefacts
− There are many more questions that may be asked about an object: for
example, in the case of artworks, “artist's intent” may be significant
− Once lost, these are very difficult to infer
Understanding unfamiliar material
23
 Understanding unfamiliar material, though hard, is easier than finding it
 Separate processes:
− Recognising a term
− Identifying (generating) a term
 Recognition is faster and more reliable
 Why:
− Recognising a term: connecting term to concept
− Generating terms: search around a concept looking through large pool
of candidate terms for the one that might work best here
− Think yourself into the curator's shoes: what terms might they have
used for the concept that interests you, and why?
Term recognition vs generation
24
Semi-automated metadata as
mitigating factor
 Peirce: semiotic triad, relating symbol, object and interpreter
− Software agents: machine-level features (machine perception) –
words found in documents, colours, shapes or patterns found in
images…
− Human agents: perception; comprehension; application of relevant
knowledge; interpretation into a set of concepts; encoding
observations into terms
 Observing the behaviour of human agents throughout the lifecycle of the
digital object allows us to study change in manual interpretation and
encoding
 This permits us to characterise these patterns of change
 It also permits software agents to be brought into line with changing norms
Relating concept, feature, agent and
term
26
Conclusion
 PERICLES combines
− model-led approaches to data management
− data-led approaches to modelling and characterising the changing
environment and context(s) of reuse
 Approach acknowledges dynamical nature of system in which reuse occurs
 Downside: such an approach requires ongoing availability of material
(ethically) gleaned from observational data
− Consequentially, a closed archive or an archive that excites little
interest remains difficult to sustain, unless data is sourced elsewhere
 In conclusion, therefore, data-led approaches gain from joint infrastructure
and open data
Conclusion

More Related Content

PPT
Towards a digital library for York
PDF
Semantic IoT Semantic Inter-Operability Practices - Part 1
PPTX
Research Assignment V2
PDF
Yjs: A Framework for Near Real-time P2P Shared Editing on Arbitrary Data Types
PDF
Data Research Vision
PDF
Revealing digital documents - concealed structures in data
PPT
Enhancing Semantic Mining
PPTX
Podobnostní hledání v netextových datech (Pavel Zezula)
Towards a digital library for York
Semantic IoT Semantic Inter-Operability Practices - Part 1
Research Assignment V2
Yjs: A Framework for Near Real-time P2P Shared Editing on Arbitrary Data Types
Data Research Vision
Revealing digital documents - concealed structures in data
Enhancing Semantic Mining
Podobnostní hledání v netextových datech (Pavel Zezula)

What's hot (10)

PPT
Getaneh Alemu
PDF
Coreon - Making Sure IoT Devices Understand Each Other!
PPTX
Sands Fish - Knowing in the Age of Networked Knowledge
PPT
KeepIt Course 4: digital preservation recap, by Andreas Rauber, Hannes Kulovi...
PPTX
Knowledge Representation essay outline
PPTX
Policy-compliant data processing: RDF-based restrictions for data-protection
DOC
Tomas Singliar
PPTX
Hypertext System
PDF
DireWolf - Distributing and Migrating User Interfaces for Widget-based Web Ap...
PPTX
Hypertext system
Getaneh Alemu
Coreon - Making Sure IoT Devices Understand Each Other!
Sands Fish - Knowing in the Age of Networked Knowledge
KeepIt Course 4: digital preservation recap, by Andreas Rauber, Hannes Kulovi...
Knowledge Representation essay outline
Policy-compliant data processing: RDF-based restrictions for data-protection
Tomas Singliar
Hypertext System
DireWolf - Distributing and Migrating User Interfaces for Widget-based Web Ap...
Hypertext system
Ad

Viewers also liked (6)

PPT
Oais
PDF
Preservation Metadata Initiatives and Standards
PPT
The Reference Model for an Open Archival Information System (OAIS)
PDF
20 Years, 20 Recipes - Happy Thanksgiving from Aristotle
PDF
Aprovação do governo no município de São Paulo - Maio 2016
PPT
Introduction to the Reference Model for an Open Archival Information System (...
Oais
Preservation Metadata Initiatives and Standards
The Reference Model for an Open Archival Information System (OAIS)
20 Years, 20 Recipes - Happy Thanksgiving from Aristotle
Aprovação do governo no município de São Paulo - Maio 2016
Introduction to the Reference Model for an Open Archival Information System (...
Ad

Similar to Semi-automated metadata extraction in the long-term (20)

PPT
Metadata approaches for digital presentation
PPT
Metadata for digital long-term preservation
PPT
Preservation Metadata, Michael Day, DCC
PPT
Trm Vilnius Metadata New
PPT
Preservation metadata
PPT
D.3.1: State of the Art - Linked Data and Digital Preservation
PPTX
Metadata and Tagging
PPTX
Metadata enriching and filtering for enhanced collection discoverability
PPTX
Current metadata landscape in the library world (Getaneh Alemu)
PPT
Preservation Metadata
PPT
Digital Preservation
PPT
Digital Preservation
PPTX
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of...
PPT
Gettingstartedwithdigitalcollectionsweb[1]
PPT
Metadata
PPT
Metadata
PDF
A theory of digital library metadata the emergence of enriching and filtering
PPTX
NISO Webinar: Metadata for Preservation: A Digital Object's Best Friend
PPT
Metadata 101public
PPT
UAEU_MDL_Slides_rev1.ppt
Metadata approaches for digital presentation
Metadata for digital long-term preservation
Preservation Metadata, Michael Day, DCC
Trm Vilnius Metadata New
Preservation metadata
D.3.1: State of the Art - Linked Data and Digital Preservation
Metadata and Tagging
Metadata enriching and filtering for enhanced collection discoverability
Current metadata landscape in the library world (Getaneh Alemu)
Preservation Metadata
Digital Preservation
Digital Preservation
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of...
Gettingstartedwithdigitalcollectionsweb[1]
Metadata
Metadata
A theory of digital library metadata the emergence of enriching and filtering
NISO Webinar: Metadata for Preservation: A Digital Object's Best Friend
Metadata 101public
UAEU_MDL_Slides_rev1.ppt

More from PERICLES_FP7 (20)

PPTX
Digital Ecosystem and Process Compiler - IDCC17
PPTX
Technical Appraisal of Complex Digital Objects in Evolving Environments - IDC...
PPTX
Technical appraisal and change impact analysis - IDCC17 workshop
PDF
ForgetIT: human memory inspired Information Model
PPTX
Data quality, preservation and access: a DANS perspective
PPTX
Proactive Evolution management in Data-centric SW ecosystems - Acting on Chan...
PPTX
Digital Preservation in the era of Big Data - The Diachron Platform - Acting ...
PPTX
Detecting Semantic Drift for ontology maintenance - Acting on Change 2016
PPTX
Filling the Digital Preservation Gap - Acting on Change
PPTX
Risk assessment for preservation in the active life of complex digital object...
PPTX
Technical Appraisal Tool, MICE - Acting on Change 2016
PPTX
PERICLES Workflow for the automated updating of Digital Ecosystem Models with...
PDF
Capability gap - Preservation isn't just throwing tools at the problem - Acti...
PPTX
Automatic policy application and change management - Acting on Change 2016
PPTX
Reproducibile scientific workflows - Acting on Change 2016
PPTX
Pro-active solutions for higher reproducibility of scientific experiments - A...
PPTX
PERICLES Policy management & ontology supported preservation - Acting on Chan...
PPTX
PERICLES Modelling Policies - Acting on Change 2016
PPTX
PERICLES Ecosystem Modelling (NCDD use case) - Acting on Change 2016
PPTX
PERICLES Process Compiler - ‘Eye of the Storm: Preserving Digital Content in ...
Digital Ecosystem and Process Compiler - IDCC17
Technical Appraisal of Complex Digital Objects in Evolving Environments - IDC...
Technical appraisal and change impact analysis - IDCC17 workshop
ForgetIT: human memory inspired Information Model
Data quality, preservation and access: a DANS perspective
Proactive Evolution management in Data-centric SW ecosystems - Acting on Chan...
Digital Preservation in the era of Big Data - The Diachron Platform - Acting ...
Detecting Semantic Drift for ontology maintenance - Acting on Change 2016
Filling the Digital Preservation Gap - Acting on Change
Risk assessment for preservation in the active life of complex digital object...
Technical Appraisal Tool, MICE - Acting on Change 2016
PERICLES Workflow for the automated updating of Digital Ecosystem Models with...
Capability gap - Preservation isn't just throwing tools at the problem - Acti...
Automatic policy application and change management - Acting on Change 2016
Reproducibile scientific workflows - Acting on Change 2016
Pro-active solutions for higher reproducibility of scientific experiments - A...
PERICLES Policy management & ontology supported preservation - Acting on Chan...
PERICLES Modelling Policies - Acting on Change 2016
PERICLES Ecosystem Modelling (NCDD use case) - Acting on Change 2016
PERICLES Process Compiler - ‘Eye of the Storm: Preserving Digital Content in ...

Recently uploaded (20)

PPT
What is a Computer? Input Devices /output devices
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
Architecture types and enterprise applications.pdf
PDF
A proposed approach for plagiarism detection in Myanmar Unicode text
PDF
Two-dimensional Klein-Gordon and Sine-Gordon numerical solutions based on dee...
PDF
Consumable AI The What, Why & How for Small Teams.pdf
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PPTX
2018-HIPAA-Renewal-Training for executives
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
A review of recent deep learning applications in wood surface defect identifi...
PPTX
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
PDF
Getting started with AI Agents and Multi-Agent Systems
PPTX
Benefits of Physical activity for teenagers.pptx
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
1 - Historical Antecedents, Social Consideration.pdf
PPTX
Custom Battery Pack Design Considerations for Performance and Safety
What is a Computer? Input Devices /output devices
Convolutional neural network based encoder-decoder for efficient real-time ob...
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Architecture types and enterprise applications.pdf
A proposed approach for plagiarism detection in Myanmar Unicode text
Two-dimensional Klein-Gordon and Sine-Gordon numerical solutions based on dee...
Consumable AI The What, Why & How for Small Teams.pdf
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
2018-HIPAA-Renewal-Training for executives
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Developing a website for English-speaking practice to English as a foreign la...
A review of recent deep learning applications in wood surface defect identifi...
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
Enhancing emotion recognition model for a student engagement use case through...
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
Getting started with AI Agents and Multi-Agent Systems
Benefits of Physical activity for teenagers.pptx
Zenith AI: Advanced Artificial Intelligence
1 - Historical Antecedents, Social Consideration.pdf
Custom Battery Pack Design Considerations for Performance and Safety

Semi-automated metadata extraction in the long-term

  • 1. GRANT AGREEMENT: 601138 | SCHEME FP7 ICT 2011.4.3 Promoting and Enhancing Reuse of Information throughout the Content Lifecycle taking account of Evolving Semantics [Digital Preservation] “This project has received funding from the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement no601138”. Semi-automated metadata extraction in the long term Emma Tonkin, King’s College London DPC Workshop, Belfast, Dec 2015
  • 2. Structure of presentation 2  Introduction to Pericles  Layers of Metadata  Sources of Metadata  Time, space and data  Semi-automated metadata as mitigating factor
  • 4. Introduction to PERICLES 4  Four-year Integrated Project (2013-2017) funded by the European Union under its Seventh Framework Programme  Promoting and Enhancing Reuse of Information throughout the Content Lifecycle taking account of Evolving Semantics  Two (or three) domains: − Digital artworks, such as interactive software-based installations, and other digital media from Tate's collections − Material from Tate's archives − Experimental scientific data originating from the European Space Agency and International Space Station.
  • 5. Model-driven approach 5  Essentially all archives are based around some conceptual model of the material held  PERICLES applies formal models to describe − Objects − Entities associated with objects − Broader community  These models support processes such as appraisal and QA, and consequentially functionality such as maintenance and actions taken for sustainability  A broad variety of models are under consideration: semantic (ontological) models to formally describe objects; social network graphs to describe community; statistical models to describe technology obsolescence...
  • 7. Open Archival Information System 7  OAIS reference model − “conceptual framework for an archival system dedicated to preserving and maintaining access to digital information over the long term“ -Lavoie, B. (2000). Meeting the challenges of digital preservation: The OAIS reference model  OAIS-compliance − adherence to ISO 14721:2003 or (now) ISO 14721:2012 − Specifies conceptual framework, functional model, information model
  • 9. Descriptive metadata 9  Supporting humans and machines  Goal: interpreting data object  Not always possible to automatically interpret data objects on any level (some are fully opaque)  Consider: − 'Unstructured' natural-language texts, such as letters, books, articles − Images of artworks − Images of letters − Recordings of audiovisual presentations − Complex data files
  • 11. Automated metadata extraction 11  Popular view on indexing metadata: − “the more, the merrier”  Risks of low-quality metadata: − Low accuracy on search and browse tasks; occasionally embarrassing misinterpretations  Benefits: − Additional metadata can improve search indexing
  • 12. How good is automated metadata extraction? 12  Varies significantly depending on the precise task and source material  Automated metadata extraction tends to apply probabilistic (machine learning) or heuristic approaches  Machine-eye view: − describe what is present − Infer what is not based on:  Knowledge base  Comparison with other items  Learning from training examples ('supervised learning')
  • 13. Crowdsourcing metadata 13  The 'phone a friend' approach to metadata generation − Make material available to public − Encourage them to annotate (example: social tagging) − Examine the result  The likely result: − Some material extensively annotated; some descriptive annotations; some formally structured; some personal ('cryptic') − Some/most material receives no notice and is not annotated at all  Mitigation: engineer more consistent coverage through, for example, gamification (see Galaxy Zoo)  Identify incentives that encourage public to contribute
  • 14. Capturing 'live' metadata 14  If the environment is accessible at the time of creation: − Technical 'live' metadata may be captured − Within Pericles, this is referred to as 'significant environment information' − Example: steps in creation, time of creation, contextual relevance of other files…  Another sort of 'live' metadata emerges from observation of behaviour of those engaging with the data − Interaction with search/browse interfaces (cf. information scent) − Satisfaction with results − Patterns of sharing and reuse (information diffusion on social networks, for example)
  • 16. Theoretical reach of information 16
  • 17. Theoretical reach of information 17 Image source: S Korotkiy
  • 18.  Receiving the signal is only the start  Can we decode the signal? − Technical decoding − Practical comprehension  Confounding factors in decoding metadata: − Language − Dialect − Prerequisite knowledge Practical reach of information 18
  • 20. Language change: Time travel 20  Language may be viewed as a complex adaptive system (Beckner et al, 2007) − Made up of many tiny parts - people talking, writing, gesturing − Adaptive, because we change our behaviour based on past interactions − Many factors influence its development: biology of perception; social structure; experience  Probabilistic processes underlie language change: collective experience and eventual consensus
  • 21. Example: Photogram (Getty Art & Architecture Thesaurus) 21
  • 22. The challenges of decreasing accessibility
  • 23.  Unfamiliar data − Technical encoding – well-understood problems − Challenges of internationalisation  Unfamiliar texts − Conventions and best practices change over time − Coherence degrades long before it fails entirely (slower to read: takes more effort: machines trained on modern texts are likely to encounter issues with texts outside that timeframe)  Challenges of unfamiliar artefacts − There are many more questions that may be asked about an object: for example, in the case of artworks, “artist's intent” may be significant − Once lost, these are very difficult to infer Understanding unfamiliar material 23
  • 24.  Understanding unfamiliar material, though hard, is easier than finding it  Separate processes: − Recognising a term − Identifying (generating) a term  Recognition is faster and more reliable  Why: − Recognising a term: connecting term to concept − Generating terms: search around a concept looking through large pool of candidate terms for the one that might work best here − Think yourself into the curator's shoes: what terms might they have used for the concept that interests you, and why? Term recognition vs generation 24
  • 26.  Peirce: semiotic triad, relating symbol, object and interpreter − Software agents: machine-level features (machine perception) – words found in documents, colours, shapes or patterns found in images… − Human agents: perception; comprehension; application of relevant knowledge; interpretation into a set of concepts; encoding observations into terms  Observing the behaviour of human agents throughout the lifecycle of the digital object allows us to study change in manual interpretation and encoding  This permits us to characterise these patterns of change  It also permits software agents to be brought into line with changing norms Relating concept, feature, agent and term 26
  • 28.  PERICLES combines − model-led approaches to data management − data-led approaches to modelling and characterising the changing environment and context(s) of reuse  Approach acknowledges dynamical nature of system in which reuse occurs  Downside: such an approach requires ongoing availability of material (ethically) gleaned from observational data − Consequentially, a closed archive or an archive that excites little interest remains difficult to sustain, unless data is sourced elsewhere  In conclusion, therefore, data-led approaches gain from joint infrastructure and open data Conclusion