Metadata
25 October 2010
Weekly reflection
• What digital “stuff” do you have? Where
do you put it? How do you organize it, if
you do? How do you find it when you
need it?
• In the course of your career, you will have to
do things you don’t entirely know how to do.
• Technical and non-!
• Without training, guidance, or clear instructions.
• No, of course we don’t teach you everything in library
school!
• Learn to dive in despite imperfect knowledge.
• Use your common sense.
• Trust that those around you want you to succeed.
• If you need to, research! Always be ready to learn.
• Mentors are great... but they’re not babysitters.
• Accept imperfection.
• Please model these behaviors in my class!
Tool of the week: Self-efficacy
Tip of the week: Staying informed
• Weblogs and newsfeeds are your friends.
• If you are not reading at least a few librarian blogs,
you are not staying informed.
• Can’t hurt to pick up some journal TOCs too.
• Blogs are faster than the published literature! And
often written by the same people.
• For (library) tech:
• Librarian in Black
• Planet Code4Lib
• librarian.net
• Lifehacker, Gizmodo, Engadget
• Roy Tennant’s LJ columns
What is metadata?
• Heck, I dunno. I’m not sure that’s even a
useful question.
• This is one reason I’m not a library-school
professor. Definitional pilpul bores me.
• Operationally: when we collect stuff, we
take notes on it so we can organize it,
inventory it, find it later, etc. Those
notes are metadata.
• Is MARC metadata? Well, of course!
• But many librarians don’t think about it that way.
Why are there so many
metadata standards?
• Different things described
• For an image, you want to know its bit depth and
colorspace. This has no meaning for a finding aid.
• Several targeted standards vastly easier to cope
with than one supposedly universal standard.
• Different purposes
• More on this in a moment
• Different provider and user communities
• Level of detail/specificity
• Wheel (or toothbrush) reinvention
Metadata file formats
• You can express metadata in an Excel
spreadsheet, a MARC record, XML, RDF...
• But some expressions are more readable, useful,
and reusable than others!
• Metadata librarians spend a lot of time fixing and
transforming Other People’s Metadata, in as
automated a fashion as possible.
• Large majority of modern metadata
standards expressed in XML.
• Though RDF wants to be a contender, and XML is
only one way of several to express RDF.
So what’s this RDF thing all the
cool kids are talking about?
• Resource Description Framework
• by the W3C
• Like XML, RDF is more or less friendly to
whatever kind of metadata you want to
throw at it.
• Unlike XML, RDF is a data model designed for integrating
information from different metadata vocabularies, and
expressing how items and metadata records relate to one
another. Links and linking!
• (Also, XML works for content, e.g. TEI. RDF doesn’t.)
(very) Basic RDF
• “Triple:” subject, property, value
• A little like subject, verb, object in English.
• Dorothea Salo is the author of “Innkeeper
at the Roach Motel.”
• Subject: either me or the article (works either way,
depending on property chosen)
• Property: authorship (“isAuthorOf” or “isBy”); often
comes from a controlled vocabulary like Dublin Core
• Value: either the article or me, depending
• One annoying thing: URIs as identifiers
• What is my URI? Or the article’s (several versions)?
• Several other annoying things about RDF, but they’re
super-nerdy.
Linked data
• As the web linked documents and people,
it’s now time (say some) to link data.
• Not a simple proposition!
• RDF is hard. Calling it linked data doesn’t make it easier.
• Data modeling is hard.
• Data integration is hard. RDF makes it easier... up to a
point. Still HUGE problems around people using the
same term differently, other unexamined assumptions.
• Idea gaining traction among governments,
other big data providers.
• So we probably need to keep our eye on it.
• ALWAYS a good idea to think about how
other people might use your metadata.
Kinds of metadata
• Descriptive (“bibliographic”)
• Who made this? When? Where? What’s it about? Etc.
• Technical
• What is this? What is its format? What made it? Etc.
• Administrative
• Who owns this? Who’s changed it? Who has what IP
rights over it? Who can see it? Etc.
• Structural
• How is this thing put together?
• In practice, the landscape is muddier.
• Most standards have bits of two or more types.
• Also, “relationship” metadata coming to the fore.
Descriptive metadata:
MODS
• Metadata Object Description Schema
• Maintained by Library of Congress
• Stripped-down, human-readable MARC
in XML
• http://www.loc.gov/standards/mods/
• Sample: http://www.loc.gov/standards/mods/v3/
mods99042030.xml
Technical metadata: MIX
• Metadata for Images in XML
• By Library of Congress, NISO
• Captures information about an image’s file
format and other technical characteristics
• Why? Think about file-format
obsolescence.
• http://www.loc.gov/standards/mix/
• Sample document: http://www.loc.gov/standards/mix/
instances/test_mix10.xml
Administrative
metadata: PREMIS
• Preservation Metadata Maintenance
Activity
• who comes up with these acronyms?
• Library of Congress, again
• Designed to track digital preservation
activity across an object’s lifecycle
• http://www.loc.gov/standards/premis/
• Samples: look in http://www.dlib.org/dlib/
september08/dappert/09dappert.html
• But be aware that PREMIS is usually embedded in
other metadata, like METS.
Structural metadata:
METS
• Metadata Encoding and Transmission
Standard
• By... guess who?
• Wrapper for other kinds of metadata;
delineates the structure of a complex
digital object
• http://www.loc.gov/standards/mets/
• Samples: http://www.loc.gov/standards/mets/
mets-examples.html
Metadata spaghetti: TEI
• Text Encoding Initiative
• by the TEI Consortium
• For digital transcriptions of books,
manuscripts, dictionaries, etc. etc.
• Content standard, not metadata standard!
But contains its own “metadata header”
• This header sometimes reused in other contexts
• Moral: Sometimes content “embeds”
metadata.
• This is OK, but should every content standard roll its
own internal metadata?
Where does metadata
come from?
• Human data entry
• Slow, expensive, error-prone
• Often semi-automatable (80/20 point)
• If you can automate, DO IT. Do not waste keystrokes!
• Auto-extracting from a content object
• Common for technical metadata
• Auto-capture by preservation system
• Common for some administrative metadata
• Grabbing from elsewhere
• From other metadata: “crosswalking”
• HTML screenscraping, Excel spreadsheets
• Issues: authority control? granularity? accuracy?
Subject metadata,
specifically
• What is this thing about?
• Plenty of variation in sources
• Author’s keyword vs. indexer’s descriptor
• Controlled vocabulary vs. free-form keywording
• Community tagging/“folksonomy”
• Mechanically-extracted keywords
• All of this matters if you’re searching!
Where does metadata live?
• In XML files (or MARC files, or...)
• In relational databases
• In RDF “triple stores” (special databases)
• In content objects (as with TEI)
• Or some combination of the above!
• E.g. DSpace: can accept metadata in an XML file; stores
all metadata in relational database
• Next trick: associating content with its
metadata!
What is done with metadata?
• To search against it or use it to browse,
you need to “index” it first.
• Turn it inside-out: records containing terms --> list
of terms and the records they appear in
• It’s all more complicated: stemming, phrases,
variant spellings, languages, stopwords, etc.
• The hot new indexing software is “Solr” from UVa.
Underlies Blacklight, which underlies Forward.
• Full-text search works the same way!
• Google’s index: MASSIVE database of words with
the web pages they appear in.
• Spider/crawler: program that follows links across
the web and indexes page content
Relevance ranking
• You have a bunch of words and the records
or documents they appear in. How do you
decide which records/pages to display first?
• Traditionally in libraries: last-in-first-out. Awful.
• Using document structure and metadata
• If the word’s in a title, heading, or subject field, take it
more seriously than if it’s just in ordinary text.
• TF/IDF
• Term frequency: how often the search term shows up in
a given record/document
• Inverse document frequency: how rare the search term
is in the whole mass of records/documents.
Super-
relevant!
Record not
“about” this
term
Overused
word or
stopword
Irrelevant
TF
(one record)
IDF
(whole corpus)
High Low
Rare term
Common term
What other information can
be used to gauge relevance?
• People pointing
• Google: PageRank, based on counting links to a
document
• Scholarly communication: many metrics based on
later citation of articles
• People choosing
• Google also up-votes pages based on people
clicking on them in search results.
• Individual or social history of interests
• Amazon, Netflix
• Notice who’s doing this and who isn’t.
• Serious question: what about privacy?
http://xkcd.com/522
Search engine
optimization
• Making sure that your page turns up in
searches for relevant terms.
• Done maliciously, this amounts to spam. Google
spends LOTS of effort despamming its index.
• Clean markup helps. So does putting
highly relevant terms in highly visible/
important locations.
• Also, don’t overload pages! Dilutes vocabulary.
What else can you do with
relevance information?
• Point people to PEOPLE and SERVICES,
not just search results!
• Point people to context that will help
them evaluate search results.
• We know people just throw search terms at boxes.
We might as well work with that.
• This may well be the best work Forward
is doing.
A word about GIS
• “Geographic Information Systems”
• It’s metadata all the way down! Metadata
about places.
• Also a lot about how to represent and visualize that
metadata.
• And how to mash it up with other data.
• Heavily based on relational-database
technology.
• HOT JOB MARKET. If you can get trained, do.
Finding and using
metadata standards
• Nobody knows every metadata
standard out there. I sure don’t.
• But faced with a new standard, I may
have to get up to speed fast.
• I may even be making adoption decisions.
• So here’s how I do it.
Getting up to speed
• Find its website. If it doesn’t have a
website, you don’t want to use it.
• Is the website current? Is there recent activity?
• Is there a list of who’s using this standard?
• Find a sample record.
• How is this standard expressed? XML, RDF, what?
• Does it pass a sniff test?
• Find the documentation and community.
• “Tag libraries” and “data dictionaries” especially helpful.
• Primers, “getting started” documents also nice.
• Look for tools.
• Authoring/crosswalk tools (and programming libraries)
• Validation tools

More Related Content

PDF
What We Organize
PDF
MARC and BIBFRAME; Linking libraries and archives
PPTX
NISO/DCMI Webinar: Cooperative Authority Control: The Virtual International A...
PDF
Implementing Linked Data in Low-Resource Conditions
PDF
Risk management and auditing
PPTX
ENGL 1221 Writing Seminar
PPTX
ENGL 1221 Writing Seminar Putt
PPT
Linked Open Data for Libraries
What We Organize
MARC and BIBFRAME; Linking libraries and archives
NISO/DCMI Webinar: Cooperative Authority Control: The Virtual International A...
Implementing Linked Data in Low-Resource Conditions
Risk management and auditing
ENGL 1221 Writing Seminar
ENGL 1221 Writing Seminar Putt
Linked Open Data for Libraries

What's hot (20)

PPTX
The Buzz About BIBFRAME, by Angela Kroeger
PPTX
The liaison librarian: connecting with the qualitative research lifecycle
PPTX
NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
PDF
Library Language: Vocabulary for the Modern Librarian
PDF
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅
PPTX
EDUC 601 Library Presentation
PPTX
Referencing methods and approaches
PPTX
Annotated bib and research strategies
PPTX
Ws spring 2014 rogers
PPTX
Searching for MAED Research Articles
PPTX
Finding the annotation needs of the botanical community in a digital library
PDF
CWIN17 Frankfurt / talend_nlp
PPT
Searching of Web and Electronic Resources
PPTX
Research Strategies
PPTX
Digital Medieval Manuscripts
PPTX
Writing Seminar Moore
PPTX
Towards digitizing scholarly communication
PPTX
Databases mtcp4
PDF
Embedding Linked Data Invisibly into Web Pages: Strategies and Workflows for ...
PPTX
Engl 1221 bauer spring 2014
The Buzz About BIBFRAME, by Angela Kroeger
The liaison librarian: connecting with the qualitative research lifecycle
NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
Library Language: Vocabulary for the Modern Librarian
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅
EDUC 601 Library Presentation
Referencing methods and approaches
Annotated bib and research strategies
Ws spring 2014 rogers
Searching for MAED Research Articles
Finding the annotation needs of the botanical community in a digital library
CWIN17 Frankfurt / talend_nlp
Searching of Web and Electronic Resources
Research Strategies
Digital Medieval Manuscripts
Writing Seminar Moore
Towards digitizing scholarly communication
Databases mtcp4
Embedding Linked Data Invisibly into Web Pages: Strategies and Workflows for ...
Engl 1221 bauer spring 2014
Ad

Viewers also liked (20)

PDF
Manufacturing Serendipity
PDF
So are we winning yet?
PDF
Open Sesame (and other open movements)
PDF
Occupy Copyright!
PDF
Preservation and institutional repositories for the digital arts and humanities
PDF
Soylent Semantic Web Is People! (with notes)
PDF
RDF, RDA, and other TLAs
PDF
I own copyright, so I pwn you!
PDF
Encryption
PDF
So are we winning yet?
PDF
Even the Loons are Licensed
PDF
Escaping Datageddon
PDF
Solving Problems with Web 2.0
PDF
A Successful Failure: Community Requirements Gathering for DSpace
PDF
Who owns our work?
PDF
Lipstick on a Pig: Integrated Library Systems
PDF
Grab a bucket! It's raining data!
PDF
So you think you know libraries
PDF
Save the Cows! Cyberinfrastructure for the Rest of Us
PDF
Grab a bucket! It's raining data!
Manufacturing Serendipity
So are we winning yet?
Open Sesame (and other open movements)
Occupy Copyright!
Preservation and institutional repositories for the digital arts and humanities
Soylent Semantic Web Is People! (with notes)
RDF, RDA, and other TLAs
I own copyright, so I pwn you!
Encryption
So are we winning yet?
Even the Loons are Licensed
Escaping Datageddon
Solving Problems with Web 2.0
A Successful Failure: Community Requirements Gathering for DSpace
Who owns our work?
Lipstick on a Pig: Integrated Library Systems
Grab a bucket! It's raining data!
So you think you know libraries
Save the Cows! Cyberinfrastructure for the Rest of Us
Grab a bucket! It's raining data!
Ad

Similar to Metadata (20)

PPT
Semantic Web, Cataloging, & Metadata
PPT
Faceted Navigation (LACASIS Fall Workshop 2005)
PPT
Does metadata matter?
PPTX
METADATA: A PRACTICE AND ITS SERVICES TOWARDS DIGITAL ENVIRONMENT
PPT
DM110 - Week 10 - Semantic Web / Web 3.0
PDF
BIBFRAME, Linked data, RDA
PPTX
Current metadata landscape in the library world (Getaneh Alemu)
PDF
Library Linked Data
PPTX
A theory of Metadata enriching & filtering
PPTX
Intro to the semantic web (for libraries)
PPT
Metadata issues and challenges: Link Data
PPT
Semantic Search using RDF Metadata (SemTech 2005)
PPTX
New World of Metadata: Growing, Shifting, Merging
PPT
Object models and object representation
PDF
Metadata 101
PPT
Metadata 101public
PPTX
Leveraging the semantic web meetup, Semantic Search, Schema.org and more
PPT
Applying Digital Library Metadata Standards
PDF
Choices, modelling and Frankenstein Ontologies
PPTX
Metadata and Tagging
Semantic Web, Cataloging, & Metadata
Faceted Navigation (LACASIS Fall Workshop 2005)
Does metadata matter?
METADATA: A PRACTICE AND ITS SERVICES TOWARDS DIGITAL ENVIRONMENT
DM110 - Week 10 - Semantic Web / Web 3.0
BIBFRAME, Linked data, RDA
Current metadata landscape in the library world (Getaneh Alemu)
Library Linked Data
A theory of Metadata enriching & filtering
Intro to the semantic web (for libraries)
Metadata issues and challenges: Link Data
Semantic Search using RDF Metadata (SemTech 2005)
New World of Metadata: Growing, Shifting, Merging
Object models and object representation
Metadata 101
Metadata 101public
Leveraging the semantic web meetup, Semantic Search, Schema.org and more
Applying Digital Library Metadata Standards
Choices, modelling and Frankenstein Ontologies
Metadata and Tagging

More from Dorothea Salo (14)

PDF
Soylent SemanticWeb Is People!
PDF
Privacy and libraries
PDF
Paying for it
PDF
The Canonically Bad (Digital) Humanities Proposal (and how to avoid it)
PDF
Is this BIG DATA which I see before me?
PDF
FRBR and RDA
PDF
Research Data and Scholarly Communication
PDF
Research Data and Scholarly Communication (with notes)
PDF
Librarians love data!
PDF
Taming the Monster: Digital Preservation Planning and Implementation Tools
PDF
Avoiding the Heron's Way
PDF
Manufacturing Serendipity
PDF
Open Content
PDF
Databases, Markup, and Regular Expressions
Soylent SemanticWeb Is People!
Privacy and libraries
Paying for it
The Canonically Bad (Digital) Humanities Proposal (and how to avoid it)
Is this BIG DATA which I see before me?
FRBR and RDA
Research Data and Scholarly Communication
Research Data and Scholarly Communication (with notes)
Librarians love data!
Taming the Monster: Digital Preservation Planning and Implementation Tools
Avoiding the Heron's Way
Manufacturing Serendipity
Open Content
Databases, Markup, and Regular Expressions

Recently uploaded (20)

PDF
Complications of Minimal Access-Surgery.pdf
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
PPTX
B.Sc. DS Unit 2 Software Engineering.pptx
PDF
LEARNERS WITH ADDITIONAL NEEDS ProfEd Topic
PDF
Myanmar Dental Journal, The Journal of the Myanmar Dental Association (2013).pdf
PDF
AI-driven educational solutions for real-life interventions in the Philippine...
PDF
Literature_Review_methods_ BRACU_MKT426 course material
PDF
Journal of Dental Science - UDMY (2022).pdf
PDF
Skin Care and Cosmetic Ingredients Dictionary ( PDFDrive ).pdf
PDF
Climate and Adaptation MCQs class 7 from chatgpt
PDF
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
PPTX
Computer Architecture Input Output Memory.pptx
PDF
MBA _Common_ 2nd year Syllabus _2021-22_.pdf
PDF
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
PDF
English Textual Question & Ans (12th Class).pdf
PDF
FOISHS ANNUAL IMPLEMENTATION PLAN 2025.pdf
PDF
Vision Prelims GS PYQ Analysis 2011-2022 www.upscpdf.com.pdf
PDF
Journal of Dental Science - UDMY (2020).pdf
PDF
Empowerment Technology for Senior High School Guide
PPTX
Education and Perspectives of Education.pptx
Complications of Minimal Access-Surgery.pdf
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
B.Sc. DS Unit 2 Software Engineering.pptx
LEARNERS WITH ADDITIONAL NEEDS ProfEd Topic
Myanmar Dental Journal, The Journal of the Myanmar Dental Association (2013).pdf
AI-driven educational solutions for real-life interventions in the Philippine...
Literature_Review_methods_ BRACU_MKT426 course material
Journal of Dental Science - UDMY (2022).pdf
Skin Care and Cosmetic Ingredients Dictionary ( PDFDrive ).pdf
Climate and Adaptation MCQs class 7 from chatgpt
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
Computer Architecture Input Output Memory.pptx
MBA _Common_ 2nd year Syllabus _2021-22_.pdf
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
English Textual Question & Ans (12th Class).pdf
FOISHS ANNUAL IMPLEMENTATION PLAN 2025.pdf
Vision Prelims GS PYQ Analysis 2011-2022 www.upscpdf.com.pdf
Journal of Dental Science - UDMY (2020).pdf
Empowerment Technology for Senior High School Guide
Education and Perspectives of Education.pptx

Metadata

  • 2. Weekly reflection • What digital “stuff” do you have? Where do you put it? How do you organize it, if you do? How do you find it when you need it?
  • 3. • In the course of your career, you will have to do things you don’t entirely know how to do. • Technical and non-! • Without training, guidance, or clear instructions. • No, of course we don’t teach you everything in library school! • Learn to dive in despite imperfect knowledge. • Use your common sense. • Trust that those around you want you to succeed. • If you need to, research! Always be ready to learn. • Mentors are great... but they’re not babysitters. • Accept imperfection. • Please model these behaviors in my class! Tool of the week: Self-efficacy
  • 4. Tip of the week: Staying informed • Weblogs and newsfeeds are your friends. • If you are not reading at least a few librarian blogs, you are not staying informed. • Can’t hurt to pick up some journal TOCs too. • Blogs are faster than the published literature! And often written by the same people. • For (library) tech: • Librarian in Black • Planet Code4Lib • librarian.net • Lifehacker, Gizmodo, Engadget • Roy Tennant’s LJ columns
  • 5. What is metadata? • Heck, I dunno. I’m not sure that’s even a useful question. • This is one reason I’m not a library-school professor. Definitional pilpul bores me. • Operationally: when we collect stuff, we take notes on it so we can organize it, inventory it, find it later, etc. Those notes are metadata. • Is MARC metadata? Well, of course! • But many librarians don’t think about it that way.
  • 6. Why are there so many metadata standards? • Different things described • For an image, you want to know its bit depth and colorspace. This has no meaning for a finding aid. • Several targeted standards vastly easier to cope with than one supposedly universal standard. • Different purposes • More on this in a moment • Different provider and user communities • Level of detail/specificity • Wheel (or toothbrush) reinvention
  • 7. Metadata file formats • You can express metadata in an Excel spreadsheet, a MARC record, XML, RDF... • But some expressions are more readable, useful, and reusable than others! • Metadata librarians spend a lot of time fixing and transforming Other People’s Metadata, in as automated a fashion as possible. • Large majority of modern metadata standards expressed in XML. • Though RDF wants to be a contender, and XML is only one way of several to express RDF.
  • 8. So what’s this RDF thing all the cool kids are talking about? • Resource Description Framework • by the W3C • Like XML, RDF is more or less friendly to whatever kind of metadata you want to throw at it. • Unlike XML, RDF is a data model designed for integrating information from different metadata vocabularies, and expressing how items and metadata records relate to one another. Links and linking! • (Also, XML works for content, e.g. TEI. RDF doesn’t.)
  • 9. (very) Basic RDF • “Triple:” subject, property, value • A little like subject, verb, object in English. • Dorothea Salo is the author of “Innkeeper at the Roach Motel.” • Subject: either me or the article (works either way, depending on property chosen) • Property: authorship (“isAuthorOf” or “isBy”); often comes from a controlled vocabulary like Dublin Core • Value: either the article or me, depending • One annoying thing: URIs as identifiers • What is my URI? Or the article’s (several versions)? • Several other annoying things about RDF, but they’re super-nerdy.
  • 10. Linked data • As the web linked documents and people, it’s now time (say some) to link data. • Not a simple proposition! • RDF is hard. Calling it linked data doesn’t make it easier. • Data modeling is hard. • Data integration is hard. RDF makes it easier... up to a point. Still HUGE problems around people using the same term differently, other unexamined assumptions. • Idea gaining traction among governments, other big data providers. • So we probably need to keep our eye on it. • ALWAYS a good idea to think about how other people might use your metadata.
  • 11. Kinds of metadata • Descriptive (“bibliographic”) • Who made this? When? Where? What’s it about? Etc. • Technical • What is this? What is its format? What made it? Etc. • Administrative • Who owns this? Who’s changed it? Who has what IP rights over it? Who can see it? Etc. • Structural • How is this thing put together? • In practice, the landscape is muddier. • Most standards have bits of two or more types. • Also, “relationship” metadata coming to the fore.
  • 12. Descriptive metadata: MODS • Metadata Object Description Schema • Maintained by Library of Congress • Stripped-down, human-readable MARC in XML • http://www.loc.gov/standards/mods/ • Sample: http://www.loc.gov/standards/mods/v3/ mods99042030.xml
  • 13. Technical metadata: MIX • Metadata for Images in XML • By Library of Congress, NISO • Captures information about an image’s file format and other technical characteristics • Why? Think about file-format obsolescence. • http://www.loc.gov/standards/mix/ • Sample document: http://www.loc.gov/standards/mix/ instances/test_mix10.xml
  • 14. Administrative metadata: PREMIS • Preservation Metadata Maintenance Activity • who comes up with these acronyms? • Library of Congress, again • Designed to track digital preservation activity across an object’s lifecycle • http://www.loc.gov/standards/premis/ • Samples: look in http://www.dlib.org/dlib/ september08/dappert/09dappert.html • But be aware that PREMIS is usually embedded in other metadata, like METS.
  • 15. Structural metadata: METS • Metadata Encoding and Transmission Standard • By... guess who? • Wrapper for other kinds of metadata; delineates the structure of a complex digital object • http://www.loc.gov/standards/mets/ • Samples: http://www.loc.gov/standards/mets/ mets-examples.html
  • 16. Metadata spaghetti: TEI • Text Encoding Initiative • by the TEI Consortium • For digital transcriptions of books, manuscripts, dictionaries, etc. etc. • Content standard, not metadata standard! But contains its own “metadata header” • This header sometimes reused in other contexts • Moral: Sometimes content “embeds” metadata. • This is OK, but should every content standard roll its own internal metadata?
  • 17. Where does metadata come from? • Human data entry • Slow, expensive, error-prone • Often semi-automatable (80/20 point) • If you can automate, DO IT. Do not waste keystrokes! • Auto-extracting from a content object • Common for technical metadata • Auto-capture by preservation system • Common for some administrative metadata • Grabbing from elsewhere • From other metadata: “crosswalking” • HTML screenscraping, Excel spreadsheets • Issues: authority control? granularity? accuracy?
  • 18. Subject metadata, specifically • What is this thing about? • Plenty of variation in sources • Author’s keyword vs. indexer’s descriptor • Controlled vocabulary vs. free-form keywording • Community tagging/“folksonomy” • Mechanically-extracted keywords • All of this matters if you’re searching!
  • 19. Where does metadata live? • In XML files (or MARC files, or...) • In relational databases • In RDF “triple stores” (special databases) • In content objects (as with TEI) • Or some combination of the above! • E.g. DSpace: can accept metadata in an XML file; stores all metadata in relational database • Next trick: associating content with its metadata!
  • 20. What is done with metadata? • To search against it or use it to browse, you need to “index” it first. • Turn it inside-out: records containing terms --> list of terms and the records they appear in • It’s all more complicated: stemming, phrases, variant spellings, languages, stopwords, etc. • The hot new indexing software is “Solr” from UVa. Underlies Blacklight, which underlies Forward. • Full-text search works the same way! • Google’s index: MASSIVE database of words with the web pages they appear in. • Spider/crawler: program that follows links across the web and indexes page content
  • 21. Relevance ranking • You have a bunch of words and the records or documents they appear in. How do you decide which records/pages to display first? • Traditionally in libraries: last-in-first-out. Awful. • Using document structure and metadata • If the word’s in a title, heading, or subject field, take it more seriously than if it’s just in ordinary text. • TF/IDF • Term frequency: how often the search term shows up in a given record/document • Inverse document frequency: how rare the search term is in the whole mass of records/documents.
  • 22. Super- relevant! Record not “about” this term Overused word or stopword Irrelevant TF (one record) IDF (whole corpus) High Low Rare term Common term
  • 23. What other information can be used to gauge relevance? • People pointing • Google: PageRank, based on counting links to a document • Scholarly communication: many metrics based on later citation of articles • People choosing • Google also up-votes pages based on people clicking on them in search results. • Individual or social history of interests • Amazon, Netflix • Notice who’s doing this and who isn’t. • Serious question: what about privacy?
  • 25. Search engine optimization • Making sure that your page turns up in searches for relevant terms. • Done maliciously, this amounts to spam. Google spends LOTS of effort despamming its index. • Clean markup helps. So does putting highly relevant terms in highly visible/ important locations. • Also, don’t overload pages! Dilutes vocabulary.
  • 26. What else can you do with relevance information? • Point people to PEOPLE and SERVICES, not just search results! • Point people to context that will help them evaluate search results. • We know people just throw search terms at boxes. We might as well work with that. • This may well be the best work Forward is doing.
  • 27. A word about GIS • “Geographic Information Systems” • It’s metadata all the way down! Metadata about places. • Also a lot about how to represent and visualize that metadata. • And how to mash it up with other data. • Heavily based on relational-database technology. • HOT JOB MARKET. If you can get trained, do.
  • 28. Finding and using metadata standards • Nobody knows every metadata standard out there. I sure don’t. • But faced with a new standard, I may have to get up to speed fast. • I may even be making adoption decisions. • So here’s how I do it.
  • 29. Getting up to speed • Find its website. If it doesn’t have a website, you don’t want to use it. • Is the website current? Is there recent activity? • Is there a list of who’s using this standard? • Find a sample record. • How is this standard expressed? XML, RDF, what? • Does it pass a sniff test? • Find the documentation and community. • “Tag libraries” and “data dictionaries” especially helpful. • Primers, “getting started” documents also nice. • Look for tools. • Authoring/crosswalk tools (and programming libraries) • Validation tools