SlideShare a Scribd company logo
Metadata Quality Evaluation: Experience from the Open Language Archives Community Baden Hughes Department of Computer Science and Software Engineering University of Melbourne [email_address]
Presentation Overview Introduction OLAC Community Background Motivation Algorithm Design Implementation Demo Evaluation Future Directions Conclusion
Introduction It is unfortunate that distributed metadata creation practices result in highly variable metadata quality A lack of extant metadata quality evaluation tools and methodologies means addressing this variation is difficult Our contribution is a suite of metadata quality evaluation tools within a specific OAI sub-domain, the Open Language Archives Community Significant feature of these tools is the ability to assess metadata quality against actual community practice and external best practice standards
Open Language Archives Community (OLAC) An open consortium of 29 linguistic data archives cataloguing 27K language-related objects  OLAC metadata is a Dublin Core application profile for domain-specific areas: language, linguistic type, subject language, linguistic subject, linguistic role Based on OAI architecture, 2 tiered data provider and service provider model linked by OAI Protocol for Metadata Harvesting (OAI-PMH) A number of OLAC innovations have motivated the development of OAI services: static repositories, virtual service providers, personal metadata creation and management tools
Motivation To establish infrastructural support for ongoing metadata quality evaluation Validation tools for higher layer interoperability such as OAI work well for conformance checking At a community level, we are generally lacking tools which provide qualitative analyses in both semantic and syntactic modes Differentiating factors of our work: establishing a common baseline; assisting individual data providers directly; assessing use of OLAC controlled vocabularies (CVs)
Algorithm Design #1 Basic objective is to generate a score for each metadata record based on Dublin Core and OLAC best practice recommendations Code Existence Score: number of elements containing code attributes divided by the number of elements of a type associated with a controlled vocabulary in the record Element Absence Penalty: number of core elements absent divided by total number of core elements in the record
Algorithm Design #2 Per Metadata Record Weighted Aggregate: an arbitrary maximum multiplied by the weighted product of Code Existence Score and Element Absence Penalty Derivative metrics: archive diversity, metadata quality score, core elements per record, core element usage, code usage, code and element usage, “star rating” Using these metrics, we compute a score for each metadata record in an archive; each archive in total; and for the whole community
Implementation Live service at  http://www.language-archives.org/tools/reports/archiveReportCard.php Metadata quality evaluation suite installed in the service layer on top of OLAC Harvestor and Aggregator Based on Apache, MySQL, and PHP – runs on Windows/Mac/Linux All codebase components are open source, licensed under GPL, and available from SourceForge http://sf.net/projects/olac
Demo Metadata quality report on all OLAC Data Providers  [ Live ] [ Local ] Metadata quality report for a single OLAC data provider (PARADISEC)  [ Live ] [ Local ]
Evaluation #1 Creating a data provider ranking system was not a primary goal of the work reported here Per data provider Apparently no systematic correlation between size of archive and overall metadata quality A positive correlation between size of archive and the average number of elements per metadata record Community-wide Additional evidence supporting earlier work as to most common metadata elements 4 distinct classes: subject; title, description, date, identifier, creator; format, type, contributor, publisher, isPartOf; all others (including OLAC CVs)
Evaluation #2 Qualitatively-based archive clustering  3 distinct groups of archives based on Per Metadata Record Weighted Aggregate Characterised by metadata creation technique, size, number of elements used, application of OLAC controlled vocabularies Use of OLAC CVs Subject: OLAC CV used 56% of the time, for language identification where the DC recommendation of ISO 639-2 is too coarse Contributor: OLAC CV used 78% of the time, for distinct roles in the linguistic data creation/curation process Type: OLAC CV used 33% of the time, suprising given domain requirement for differentiating linguistic data types
Future Directions Algorithm improvements – particularly weighting in proportion to size of data provider A longitudinal study of metadata evolution, including qualitative aspects (commenced, and retrofitted to Jan 2002) New services based on quality attributes – the OLAC Search Engine uses metadata quality as a ranking scheme for result sets New metrics which reflect other values of the OLAC community eg online data, use of CVs
Conclusions Reported the design and deployment of scalable, dynamic metadata quality evaluation infrastructure A distinct contribution in the absence of comparable services and models, our code is open to the community to experiment with Allowing more accurate identification of leverage points for metadata enrichment effort Promoting better practice in metadata development and management Ultimately enabling better search and retrieval experiences for end users
Acknowledgements National Science Foundation Grants #9910603 (International Standards in Language Engineering) and #0094934  (Querying Linguistic Databases) Amol Kamat, Steven Bird and Gary Simons ICADL Program Committee and Reviewers

More Related Content

PPTX
Pushing the Institutional Repository to a New Level: Potential Benefits of Me...
PPT
Supporting End Users In The Creation Of Dependable Web Clips
PPT
IASLIC's 23rd National Seminar, Kolkata by Goutam Biswas
PPT
Recycling MARC: Using the Library's Catalog to Create an Online Resources Loc...
PPT
Metadata quality in digital repositories
PDF
Discovery platforms: Technology, tools and issues
PPTX
Content Addressable NDN Repository - proposal
PPTX
Use of "NewGenLib" Open Source Software for Library Automation, Digital Libra...
Pushing the Institutional Repository to a New Level: Potential Benefits of Me...
Supporting End Users In The Creation Of Dependable Web Clips
IASLIC's 23rd National Seminar, Kolkata by Goutam Biswas
Recycling MARC: Using the Library's Catalog to Create an Online Resources Loc...
Metadata quality in digital repositories
Discovery platforms: Technology, tools and issues
Content Addressable NDN Repository - proposal
Use of "NewGenLib" Open Source Software for Library Automation, Digital Libra...

Viewers also liked (20)

PPT
0809 UXD minors graduation information
PDF
Week 13 Sponges
PDF
Medical Librarianship in the Philippines: what's in store beyond the next gen...
PPT
Zappos - SANG Conference - 2-23-09
PPT
Zappos - WOA - Offset And Beyond - 5-5-09
PPT
Failure3
PPTX
1011Q1 Design For Mobile Les 2 - wireless, context en postures
PDF
Ia presentation2014 2015-parentsgrade9
PDF
Week 21 Sponges
PPT
IADD1 0809 Q3 Hoorcollege1 Deeltijd
PDF
Week 34 Sponges
PPT
Zappos - NAA - 3-9-09
PDF
Week 31 Sponges
PDF
European Outdoor Summit 2013 Keynote
PPT
User Created Content, deel III
PDF
Section5 vocab
PDF
Week 23 Sponges
PDF
Techo aug.22nd
PDF
Week 30 Sponges
0809 UXD minors graduation information
Week 13 Sponges
Medical Librarianship in the Philippines: what's in store beyond the next gen...
Zappos - SANG Conference - 2-23-09
Zappos - WOA - Offset And Beyond - 5-5-09
Failure3
1011Q1 Design For Mobile Les 2 - wireless, context en postures
Ia presentation2014 2015-parentsgrade9
Week 21 Sponges
IADD1 0809 Q3 Hoorcollege1 Deeltijd
Week 34 Sponges
Zappos - NAA - 3-9-09
Week 31 Sponges
European Outdoor Summit 2013 Keynote
User Created Content, deel III
Section5 vocab
Week 23 Sponges
Techo aug.22nd
Week 30 Sponges
Ad

Similar to Metadata Quality Evaluation: Experience from the Open Language Archives Community (20)

PPT
Towards OpenURL Quality Metrics: Initial Findings
PPTX
A demonstration of transparent and scalable OpenURL quality metrics for use i...
PDF
Object Reuse and Exchange (ORE) : Experience in the Open Language Archives Co...
PDF
Metadata Quality assessment tool for Open Access
PDF
Metadata Quality assessment tool for Open Access Cultural Heritage institutio...
PPT
TERENA OER portal, metadata extraction analysis, LAK, Leuven @9apr2013
PDF
Towards a Web Search Service for Minority Language Communities
PDF
Update From OCLC Research May 2008
PPT
OAI Metadata: Why and How
PDF
Archives on the Web and users expectations: towards a convergence with digita...
PPTX
Oregon State visit 2011
PPTX
New World of Metadata: Growing, Shifting, Merging
ODP
Learning Resource Metadata Initiative: Vocabulary Development Best Practices
PPTX
Open Infrastructure for Cultural Heritage Digital Content
PPT
New Directions in Metadata
PPT
Metadata for Audiovisual Materials and its Role in Digital Projects
PPTX
Building a Linked Open Data Set
PDF
Handout for Applying Digital Library Metadata Standards
PPT
Towards Automatic Evaluation of Learning Object Metadata Quality
Towards OpenURL Quality Metrics: Initial Findings
A demonstration of transparent and scalable OpenURL quality metrics for use i...
Object Reuse and Exchange (ORE) : Experience in the Open Language Archives Co...
Metadata Quality assessment tool for Open Access
Metadata Quality assessment tool for Open Access Cultural Heritage institutio...
TERENA OER portal, metadata extraction analysis, LAK, Leuven @9apr2013
Towards a Web Search Service for Minority Language Communities
Update From OCLC Research May 2008
OAI Metadata: Why and How
Archives on the Web and users expectations: towards a convergence with digita...
Oregon State visit 2011
New World of Metadata: Growing, Shifting, Merging
Learning Resource Metadata Initiative: Vocabulary Development Best Practices
Open Infrastructure for Cultural Heritage Digital Content
New Directions in Metadata
Metadata for Audiovisual Materials and its Role in Digital Projects
Building a Linked Open Data Set
Handout for Applying Digital Library Metadata Standards
Towards Automatic Evaluation of Learning Object Metadata Quality
Ad

More from Baden Hughes (12)

PDF
Closing the Gap: Data Models for Documentary Linguistics
PDF
Managing Perl Installations: A SysAdmin's View
PDF
If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The Univer...
PDF
Building Computational Grids with Apple’s Xgrid Middleware
PPT
Functional Requirements for an Interlinear Text Editor
PPT
Management of Metadata in Linguistic Fieldwork: Experience from the ACLA Pro...
PPT
Disambiguating Advanced Computing for Humanities Researchers
PPT
Encoding and Presenting Interlinear Text Using XML Technologies
PDF
Refactoring Metadata:
PDF
Change Management and Versioning in Ontologies
PDF
The Effects of Cross-Pollination : How non-library mass market services are c...
PDF
Why Digitization Increases the Value of Print Collections
Closing the Gap: Data Models for Documentary Linguistics
Managing Perl Installations: A SysAdmin's View
If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The Univer...
Building Computational Grids with Apple’s Xgrid Middleware
Functional Requirements for an Interlinear Text Editor
Management of Metadata in Linguistic Fieldwork: Experience from the ACLA Pro...
Disambiguating Advanced Computing for Humanities Researchers
Encoding and Presenting Interlinear Text Using XML Technologies
Refactoring Metadata:
Change Management and Versioning in Ontologies
The Effects of Cross-Pollination : How non-library mass market services are c...
Why Digitization Increases the Value of Print Collections

Recently uploaded (20)

PDF
Lecture1.pdf buss1040 uses economics introduction
PDF
Chapter 9 IFRS Ed-Ed4_2020 Intermediate Accounting
PPTX
Session 14-16. Capital Structure Theories.pptx
PPT
E commerce busin and some important issues
PDF
Spending, Allocation Choices, and Aging THROUGH Retirement. Are all of these ...
PPTX
FL INTRODUCTION TO AGRIBUSINESS CHAPTER 1
PDF
Why Ignoring Passive Income for Retirees Could Cost You Big.pdf
PDF
caregiving tools.pdf...........................
PDF
ABriefOverviewComparisonUCP600_ISP8_URDG_758.pdf
PPTX
Introduction to Managemeng Chapter 1..pptx
PDF
Bitcoin Layer August 2025: Power Laws of Bitcoin: The Core and Bubbles
PDF
Dr Tran Quoc Bao the first Vietnamese speaker at GITEX DigiHealth Conference ...
PDF
final_dropping_the_baton_-_how_america_is_failing_to_use_russia_sanctions_and...
DOCX
marketing plan Elkhabiry............docx
PDF
Bladex Earnings Call Presentation 2Q2025
PPTX
Introduction to Customs (June 2025) v1.pptx
PDF
NAPF_RESPONSE_TO_THE_PENSIONS_COMMISSION_8 _2_.pdf
PDF
ECONOMICS AND ENTREPRENEURS LESSONSS AND
PDF
Circular Flow of Income by Dr. S. Malini
PPTX
Who’s winning the race to be the world’s first trillionaire.pptx
Lecture1.pdf buss1040 uses economics introduction
Chapter 9 IFRS Ed-Ed4_2020 Intermediate Accounting
Session 14-16. Capital Structure Theories.pptx
E commerce busin and some important issues
Spending, Allocation Choices, and Aging THROUGH Retirement. Are all of these ...
FL INTRODUCTION TO AGRIBUSINESS CHAPTER 1
Why Ignoring Passive Income for Retirees Could Cost You Big.pdf
caregiving tools.pdf...........................
ABriefOverviewComparisonUCP600_ISP8_URDG_758.pdf
Introduction to Managemeng Chapter 1..pptx
Bitcoin Layer August 2025: Power Laws of Bitcoin: The Core and Bubbles
Dr Tran Quoc Bao the first Vietnamese speaker at GITEX DigiHealth Conference ...
final_dropping_the_baton_-_how_america_is_failing_to_use_russia_sanctions_and...
marketing plan Elkhabiry............docx
Bladex Earnings Call Presentation 2Q2025
Introduction to Customs (June 2025) v1.pptx
NAPF_RESPONSE_TO_THE_PENSIONS_COMMISSION_8 _2_.pdf
ECONOMICS AND ENTREPRENEURS LESSONSS AND
Circular Flow of Income by Dr. S. Malini
Who’s winning the race to be the world’s first trillionaire.pptx

Metadata Quality Evaluation: Experience from the Open Language Archives Community

  • 1. Metadata Quality Evaluation: Experience from the Open Language Archives Community Baden Hughes Department of Computer Science and Software Engineering University of Melbourne [email_address]
  • 2. Presentation Overview Introduction OLAC Community Background Motivation Algorithm Design Implementation Demo Evaluation Future Directions Conclusion
  • 3. Introduction It is unfortunate that distributed metadata creation practices result in highly variable metadata quality A lack of extant metadata quality evaluation tools and methodologies means addressing this variation is difficult Our contribution is a suite of metadata quality evaluation tools within a specific OAI sub-domain, the Open Language Archives Community Significant feature of these tools is the ability to assess metadata quality against actual community practice and external best practice standards
  • 4. Open Language Archives Community (OLAC) An open consortium of 29 linguistic data archives cataloguing 27K language-related objects OLAC metadata is a Dublin Core application profile for domain-specific areas: language, linguistic type, subject language, linguistic subject, linguistic role Based on OAI architecture, 2 tiered data provider and service provider model linked by OAI Protocol for Metadata Harvesting (OAI-PMH) A number of OLAC innovations have motivated the development of OAI services: static repositories, virtual service providers, personal metadata creation and management tools
  • 5. Motivation To establish infrastructural support for ongoing metadata quality evaluation Validation tools for higher layer interoperability such as OAI work well for conformance checking At a community level, we are generally lacking tools which provide qualitative analyses in both semantic and syntactic modes Differentiating factors of our work: establishing a common baseline; assisting individual data providers directly; assessing use of OLAC controlled vocabularies (CVs)
  • 6. Algorithm Design #1 Basic objective is to generate a score for each metadata record based on Dublin Core and OLAC best practice recommendations Code Existence Score: number of elements containing code attributes divided by the number of elements of a type associated with a controlled vocabulary in the record Element Absence Penalty: number of core elements absent divided by total number of core elements in the record
  • 7. Algorithm Design #2 Per Metadata Record Weighted Aggregate: an arbitrary maximum multiplied by the weighted product of Code Existence Score and Element Absence Penalty Derivative metrics: archive diversity, metadata quality score, core elements per record, core element usage, code usage, code and element usage, “star rating” Using these metrics, we compute a score for each metadata record in an archive; each archive in total; and for the whole community
  • 8. Implementation Live service at http://www.language-archives.org/tools/reports/archiveReportCard.php Metadata quality evaluation suite installed in the service layer on top of OLAC Harvestor and Aggregator Based on Apache, MySQL, and PHP – runs on Windows/Mac/Linux All codebase components are open source, licensed under GPL, and available from SourceForge http://sf.net/projects/olac
  • 9. Demo Metadata quality report on all OLAC Data Providers [ Live ] [ Local ] Metadata quality report for a single OLAC data provider (PARADISEC) [ Live ] [ Local ]
  • 10. Evaluation #1 Creating a data provider ranking system was not a primary goal of the work reported here Per data provider Apparently no systematic correlation between size of archive and overall metadata quality A positive correlation between size of archive and the average number of elements per metadata record Community-wide Additional evidence supporting earlier work as to most common metadata elements 4 distinct classes: subject; title, description, date, identifier, creator; format, type, contributor, publisher, isPartOf; all others (including OLAC CVs)
  • 11. Evaluation #2 Qualitatively-based archive clustering 3 distinct groups of archives based on Per Metadata Record Weighted Aggregate Characterised by metadata creation technique, size, number of elements used, application of OLAC controlled vocabularies Use of OLAC CVs Subject: OLAC CV used 56% of the time, for language identification where the DC recommendation of ISO 639-2 is too coarse Contributor: OLAC CV used 78% of the time, for distinct roles in the linguistic data creation/curation process Type: OLAC CV used 33% of the time, suprising given domain requirement for differentiating linguistic data types
  • 12. Future Directions Algorithm improvements – particularly weighting in proportion to size of data provider A longitudinal study of metadata evolution, including qualitative aspects (commenced, and retrofitted to Jan 2002) New services based on quality attributes – the OLAC Search Engine uses metadata quality as a ranking scheme for result sets New metrics which reflect other values of the OLAC community eg online data, use of CVs
  • 13. Conclusions Reported the design and deployment of scalable, dynamic metadata quality evaluation infrastructure A distinct contribution in the absence of comparable services and models, our code is open to the community to experiment with Allowing more accurate identification of leverage points for metadata enrichment effort Promoting better practice in metadata development and management Ultimately enabling better search and retrieval experiences for end users
  • 14. Acknowledgements National Science Foundation Grants #9910603 (International Standards in Language Engineering) and #0094934 (Querying Linguistic Databases) Amol Kamat, Steven Bird and Gary Simons ICADL Program Committee and Reviewers