SlideShare a Scribd company logo
UKOLN is supported  by: Approaches to automated metadata extraction : FixRep Project Emma Tonkin [email_address] www.bath.ac.uk
Wouldn't it be nice if... ...computers could author our metadata for us, thus saving a lot of hassle? Mechanical metadata extraction vs manual metadata input
But... Automated tools are fallible There's never quite enough information available Templates change, different domains have different standards In short, computers are often wrong and so are people
Hybrid approach: Get what metadata you can Ask the user to check and clean it if necessary Philosophy: If the computer gets it wrong, we can fix it later The 'half a loaf' hypothesis
Wouldn’t it be nice if… … computers could  fix  our metadata for us? Or, more realistically, help us do this work for ourselves.
All about ‘fixing it later’, doing what we can with what we have Automated metadata extraction + metadata consistency assessment Metadata generation, evaluation, characterisation:  enabling metadata triage
Challenges in automated metadata extraction Manual metadata generation Metadata extraction in brief Practical use as part of a repository deposit workflow A user study comparing manual and hybrid input Towards metadata triage
Whatever can go wrong... PDFs can be: Encrypted Corrupted Oddly encoded An image file without embedded text Occurrence: ~3-6%
Character sets Ligatures, Accents, Symbols - may not always be extractable  from PDFs Image © Daniel Ullrich
Document formats/layouts Many possible formats Some formats not widely supported Document layouts vary widely, esp. by discipline
Challenges in metadata extraction Manual metadata generation Metadata extraction in brief Practical use as part of a repository deposit workflow A user study comparing manual and hybrid input Towards metadata triage
Whatever can go wrong... (II) Function following form – interface  Model adapted to suit unique user needs Data model incompletely supported Input validation issues Systematic error; typos; localisation; encoding; etc. Lots of past work in characterising manual input errors
Challenges in metadata extraction Manual metadata generation Metadata extraction in brief Practical use as part of a repository deposit workflow A user study comparing manual and hybrid input
Image segmentation, templating & OCR
Working from text There are a number of possible states (ie. title, author, email, affiliation, abstract)  Directed graph with probabilities Markov chain:  for example, Title Author Email Affil.
Hidden Markov Model We cannot directly see these states – only the words But we can gather statistics on the correlation between the words and the underlying states, to inform guesses as to how the data should be segmented This may be expressed in terms of an HMM Bayesian statistics used across term appearance
Example parse Confirmation-Guided Discovery  of First-Order Rules, PETER A. FLACH, NICOLAS LACHICHE  Confirmation-Guided Discovery of First-Order Rules, PETER A. FLACH,  NICOLAS LACHICHE  ... Confirmation-Guided Discovery of First-Order Rules ,  PETER A. FLACH, NICOLAS LACHICHE  Self-correcting, to the extent that the knowledge base grows as new papers are added to the collection
Challenges in metadata extraction Manual metadata generation Metadata extraction in brief Practical use as part of a repository deposit workflow A user study comparing manual and hybrid input Towards metadata triage
Aims Adaption of existing interfaces Enhancing rather than rewriting Cross-platform, accessible interface Simple reusable REST API, metadata as DC/XML
Sample interfaces
Sample interfaces
Architecture
Using what we know...
Challenges in metadata extraction Manual metadata generation Metadata extraction in brief Practical use as part of a repository deposit workflow A user study comparing manual and hybrid input Towards metadata triage
Question: “ Do people accept ‘hybrid’ interfaces?” Here’s one we did earlier…
Hypotheses Correcting extracted metadata is faster than entering or cutting-and-pasting metadata. The resulting metadata has fewer errors when the user is provided with already extracted metadata to correct. User satisfaction with a system is higher if it 'tries' to extract metadata, even if it fails. Measured:  speed and accuracy of entering information manually versus hybrid entry, and qualitatively, the user-satisfaction
Results: Timing Hybrid faster under both conditions (Summary  of median times)‏
Results: Accuracy Tested against ground-truth Keyword accuracy: First keyword listed was relevant for 46% of the publications. The top two were relevant in 66%; the top-5 cover 81% of all desired keywords. Manual metadata accuracy: Few users use cut and paste Capitalisation, punctuation frequently differs Synonyms are accidentally substituted Hybrid closer to ground-truth, and more complete, but results not clear-cut.
Qualitative results Most users preferred the hybrid mode Most perceived it to be faster than manual data entry Few believed the hybrid approach to be more accurate; in practice, there was no significant difference in quality between hybrid and manual approach Both were good - quality
Discussion Results support hypotheses People prefer the hybrid interface, and found it more satisfying to use Accessibility issues exist, but can be overcome The punchline: one subject actually preferred manual entry because the hybrid system filled in metadata fields that he preferred to leave empty – ie.  it did more than the subject wanted!
Challenges in metadata extraction Manual metadata generation Metadata extraction in brief Practical use as part of a repository deposit workflow A user study comparing manual and hybrid input Towards metadata triage
MetRe prototype (2008) Characteristic classes of individual/systematic error highlighted Nb. local and general best practice. Uses: ranking, browsing, correcting systematic error Uses info from intra-/inter-repository harvested metadata to identify patterns, rank occurrences and co-occurrences
v
 
Issues Discipline/domain-specific issues Lots of information required to do this right (see metadata schema/terminology registry) Some APs present particular difficulties, such as SWAP (FRBR structure, linking objects by ‘Scholarly Work’)
Approach Generally dependent on heuristics over available data Powered by very specific functions (classifiers, validation, etc…) Potentially expensive, not always domain-independent
Future work More!  Data Filters (input/output formats) Methods Evaluation Service availability (mail me for announcements!)
Conclusion Metadata creation can be supported through software Specific problem sets in metadata triage Work continues in the FixRep project
Conclusion (II) Formal Metadata Extraction/evaluation Metadata review process Accessibility metadata Entity extraction (named entities, geographical, temporal [k-int!]) Repository integration
Thanks! Comments/Questions? www.ukoln.ac.uk/projects/fixrep

More Related Content

PDF
Cluster Based Web Search Using Support Vector Machine
PDF
Custom-Made Ranking in Databases Establishing and Utilizing an Appropriate Wo...
PPT
Mazhiming
PPT
Internet 信息检索中的数学
PDF
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
PDF
24 24 nov16 13455 27560-1-sm(edit)
PPTX
Interleaving, Evaluation to Self-learning Search @904Labs
Cluster Based Web Search Using Support Vector Machine
Custom-Made Ranking in Databases Establishing and Utilizing an Appropriate Wo...
Mazhiming
Internet 信息检索中的数学
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
24 24 nov16 13455 27560-1-sm(edit)
Interleaving, Evaluation to Self-learning Search @904Labs

Similar to Approaches to automated metadata extraction : FixRep Project (20)

ODP
Bibliographic metadata (including citation)
PPTX
LIS688_Group1
PPT
Supporting PDF accessibility evaluation: Early results from the FixRep project
PPTX
Metadata & Brokering - a modern approach for INGV RI
PPT
PDF
OpenMetadata Community Meeting - 4th April, 2024
PPT
Tech WG report 2011
PPTX
Diadem 1.0
PDF
AI-based Information Retrieval from Structured Text Documents.pdf
PPT
SWAP : A Dublin Core Application Profile for desribing scholarly works
PPTX
Information Extraction
PPTX
Information Extraction
PPTX
Information Extraction
PPTX
Automatic metadata generation
PPT
ppt
PPT
PhD Presentation
PPT
Rethinking the library catalogue: making search work for the library user
PDF
WoSC19: Serverless Workflows for Indexing Large Scientific Data
PDF
The Nuts and Bolts of Metadata Tagging and Taxonomies Made Easy Webinar
PDF
OpenMetadata Webinar on Custom Connectors
Bibliographic metadata (including citation)
LIS688_Group1
Supporting PDF accessibility evaluation: Early results from the FixRep project
Metadata & Brokering - a modern approach for INGV RI
OpenMetadata Community Meeting - 4th April, 2024
Tech WG report 2011
Diadem 1.0
AI-based Information Retrieval from Structured Text Documents.pdf
SWAP : A Dublin Core Application Profile for desribing scholarly works
Information Extraction
Information Extraction
Information Extraction
Automatic metadata generation
ppt
PhD Presentation
Rethinking the library catalogue: making search work for the library user
WoSC19: Serverless Workflows for Indexing Large Scientific Data
The Nuts and Bolts of Metadata Tagging and Taxonomies Made Easy Webinar
OpenMetadata Webinar on Custom Connectors
Ad

More from UKOLN (dev), University of Bath (9)

PPT
Email Repository Deposit:Creating a plug-in for Thunderbird allowing deposit ...
PPT
Multilayered paper prototyping for user concept modeling
PPT
Salience, Focus and Bandwidth
PPT
Slides from IWMW 2009 workshop session
PPTX
Distributed Registries
PPT
Schemas and Schema-driven Metadata Software
PPT
Workshop Presentation
PDF
MPDL metadata handling
PPT
Mapping the metadata registry environment
Email Repository Deposit:Creating a plug-in for Thunderbird allowing deposit ...
Multilayered paper prototyping for user concept modeling
Salience, Focus and Bandwidth
Slides from IWMW 2009 workshop session
Distributed Registries
Schemas and Schema-driven Metadata Software
Workshop Presentation
MPDL metadata handling
Mapping the metadata registry environment
Ad

Recently uploaded (20)

PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Cell Structure & Organelles in detailed.
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
Classroom Observation Tools for Teachers
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
Lesson notes of climatology university.
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Complications of Minimal Access Surgery at WLH
PDF
Computing-Curriculum for Schools in Ghana
PDF
Pre independence Education in Inndia.pdf
PDF
RMMM.pdf make it easy to upload and study
PPTX
GDM (1) (1).pptx small presentation for students
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
VCE English Exam - Section C Student Revision Booklet
Supply Chain Operations Speaking Notes -ICLT Program
Cell Structure & Organelles in detailed.
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
102 student loan defaulters named and shamed – Is someone you know on the list?
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Classroom Observation Tools for Teachers
FourierSeries-QuestionsWithAnswers(Part-A).pdf
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Lesson notes of climatology university.
PPH.pptx obstetrics and gynecology in nursing
O5-L3 Freight Transport Ops (International) V1.pdf
Complications of Minimal Access Surgery at WLH
Computing-Curriculum for Schools in Ghana
Pre independence Education in Inndia.pdf
RMMM.pdf make it easy to upload and study
GDM (1) (1).pptx small presentation for students

Approaches to automated metadata extraction : FixRep Project

  • 1. UKOLN is supported by: Approaches to automated metadata extraction : FixRep Project Emma Tonkin [email_address] www.bath.ac.uk
  • 2. Wouldn't it be nice if... ...computers could author our metadata for us, thus saving a lot of hassle? Mechanical metadata extraction vs manual metadata input
  • 3. But... Automated tools are fallible There's never quite enough information available Templates change, different domains have different standards In short, computers are often wrong and so are people
  • 4. Hybrid approach: Get what metadata you can Ask the user to check and clean it if necessary Philosophy: If the computer gets it wrong, we can fix it later The 'half a loaf' hypothesis
  • 5. Wouldn’t it be nice if… … computers could fix our metadata for us? Or, more realistically, help us do this work for ourselves.
  • 6. All about ‘fixing it later’, doing what we can with what we have Automated metadata extraction + metadata consistency assessment Metadata generation, evaluation, characterisation: enabling metadata triage
  • 7. Challenges in automated metadata extraction Manual metadata generation Metadata extraction in brief Practical use as part of a repository deposit workflow A user study comparing manual and hybrid input Towards metadata triage
  • 8. Whatever can go wrong... PDFs can be: Encrypted Corrupted Oddly encoded An image file without embedded text Occurrence: ~3-6%
  • 9. Character sets Ligatures, Accents, Symbols - may not always be extractable from PDFs Image © Daniel Ullrich
  • 10. Document formats/layouts Many possible formats Some formats not widely supported Document layouts vary widely, esp. by discipline
  • 11. Challenges in metadata extraction Manual metadata generation Metadata extraction in brief Practical use as part of a repository deposit workflow A user study comparing manual and hybrid input Towards metadata triage
  • 12. Whatever can go wrong... (II) Function following form – interface Model adapted to suit unique user needs Data model incompletely supported Input validation issues Systematic error; typos; localisation; encoding; etc. Lots of past work in characterising manual input errors
  • 13. Challenges in metadata extraction Manual metadata generation Metadata extraction in brief Practical use as part of a repository deposit workflow A user study comparing manual and hybrid input
  • 15. Working from text There are a number of possible states (ie. title, author, email, affiliation, abstract) Directed graph with probabilities Markov chain: for example, Title Author Email Affil.
  • 16. Hidden Markov Model We cannot directly see these states – only the words But we can gather statistics on the correlation between the words and the underlying states, to inform guesses as to how the data should be segmented This may be expressed in terms of an HMM Bayesian statistics used across term appearance
  • 17. Example parse Confirmation-Guided Discovery of First-Order Rules, PETER A. FLACH, NICOLAS LACHICHE Confirmation-Guided Discovery of First-Order Rules, PETER A. FLACH, NICOLAS LACHICHE ... Confirmation-Guided Discovery of First-Order Rules , PETER A. FLACH, NICOLAS LACHICHE Self-correcting, to the extent that the knowledge base grows as new papers are added to the collection
  • 18. Challenges in metadata extraction Manual metadata generation Metadata extraction in brief Practical use as part of a repository deposit workflow A user study comparing manual and hybrid input Towards metadata triage
  • 19. Aims Adaption of existing interfaces Enhancing rather than rewriting Cross-platform, accessible interface Simple reusable REST API, metadata as DC/XML
  • 23. Using what we know...
  • 24. Challenges in metadata extraction Manual metadata generation Metadata extraction in brief Practical use as part of a repository deposit workflow A user study comparing manual and hybrid input Towards metadata triage
  • 25. Question: “ Do people accept ‘hybrid’ interfaces?” Here’s one we did earlier…
  • 26. Hypotheses Correcting extracted metadata is faster than entering or cutting-and-pasting metadata. The resulting metadata has fewer errors when the user is provided with already extracted metadata to correct. User satisfaction with a system is higher if it 'tries' to extract metadata, even if it fails. Measured: speed and accuracy of entering information manually versus hybrid entry, and qualitatively, the user-satisfaction
  • 27. Results: Timing Hybrid faster under both conditions (Summary of median times)‏
  • 28. Results: Accuracy Tested against ground-truth Keyword accuracy: First keyword listed was relevant for 46% of the publications. The top two were relevant in 66%; the top-5 cover 81% of all desired keywords. Manual metadata accuracy: Few users use cut and paste Capitalisation, punctuation frequently differs Synonyms are accidentally substituted Hybrid closer to ground-truth, and more complete, but results not clear-cut.
  • 29. Qualitative results Most users preferred the hybrid mode Most perceived it to be faster than manual data entry Few believed the hybrid approach to be more accurate; in practice, there was no significant difference in quality between hybrid and manual approach Both were good - quality
  • 30. Discussion Results support hypotheses People prefer the hybrid interface, and found it more satisfying to use Accessibility issues exist, but can be overcome The punchline: one subject actually preferred manual entry because the hybrid system filled in metadata fields that he preferred to leave empty – ie. it did more than the subject wanted!
  • 31. Challenges in metadata extraction Manual metadata generation Metadata extraction in brief Practical use as part of a repository deposit workflow A user study comparing manual and hybrid input Towards metadata triage
  • 32. MetRe prototype (2008) Characteristic classes of individual/systematic error highlighted Nb. local and general best practice. Uses: ranking, browsing, correcting systematic error Uses info from intra-/inter-repository harvested metadata to identify patterns, rank occurrences and co-occurrences
  • 33. v
  • 34.  
  • 35. Issues Discipline/domain-specific issues Lots of information required to do this right (see metadata schema/terminology registry) Some APs present particular difficulties, such as SWAP (FRBR structure, linking objects by ‘Scholarly Work’)
  • 36. Approach Generally dependent on heuristics over available data Powered by very specific functions (classifiers, validation, etc…) Potentially expensive, not always domain-independent
  • 37. Future work More! Data Filters (input/output formats) Methods Evaluation Service availability (mail me for announcements!)
  • 38. Conclusion Metadata creation can be supported through software Specific problem sets in metadata triage Work continues in the FixRep project
  • 39. Conclusion (II) Formal Metadata Extraction/evaluation Metadata review process Accessibility metadata Entity extraction (named entities, geographical, temporal [k-int!]) Repository integration