A pair of shoes in the thesaurus; some reflexions on human and computer indexing

Eric Sieverts Media, information & communication Amsterdam University of Applied Sciences / Section Innovation & Development University Library Utrecht A pair of shoes in the thesaurus reflexions on human and computer indexing Society of Indexers Conference 2010 The challenging future of indexing 30 September 2010, Middelburg

agenda the holy grail for search systems: let people find what they search searching in the world of Google what's wrong with Google (and alikes) metadata and indexing indexing and knowledge organization knowledge organization and the semantic web Eric Sieverts | e.g.sieverts@uu.nl | e.g.sieverts@hva.nl | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

searching in the world of Google appears to be "the measure of all things" in search: with Google "everything can be found" Eric Sieverts | e.g.sieverts@uu.nl | e.g.sieverts@hva.nl | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

searching in the world of Google appears to be "the measure of all things" in search: with Google "everything can be found" but isn't there a paradox ? if Google (or Yahoo! or Bing) contains everything (> 500.000.000.000 items) can "it" still be found ? >> anticipation of user's intentions & peerless ranking algorithms become increasingly important Eric Sieverts | e.g.sieverts@uu.nl | e.g.sieverts@hva.nl | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

search, search, search, search, search, ...... searcher / query documents match the basic search-and-find paradigm Eric Sieverts | e.g.sieverts@uu.nl | e.g.sieverts@hva.nl | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

search, search, search, ...... validity for free-text matching ? match Eric Sieverts | e.g.sieverts@uu.nl | e.g.sieverts@hva.nl | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010 (paraphrasing a Dutch poetry title "Lees maar er staat niet wat er staat") "just read; it does not mean what you're reading" How does Google know what you mean? How does Google know what a document means?

filename: thesaurus.jpg is this meant to be representative for the ease of use of thesauri? to what query is this Google's answer ?

Want to know something about " hallenkerken " (Dutch for "hall church") thru Google Books? Google's first hit is a book about building thesauri, containing the word in a single example of broader and narrower terms

searching in the world of The new Google Instant tries to predict user intent (the holy grail for search engine developers) after typing 1 or 2 letters it already presents results for statistically most probable (longer) words but is Google really guessing right? Eric Sieverts | e.g.sieverts@uu.nl | e.g.sieverts@hva.nl | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

match classical situation with controlled human indexing searcher must enter the "term(s)" that have been used to characterize the subject indexer must assign “correct” terms to characterize the document in principle perfect match is possible Eric Sieverts | e.g.sieverts@uu.nl | e.g.sieverts@hva.nl | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010 search, search, search, ......

match not user-friendly: searcher has to invent the correct terms expensive: indexers must analyze the document in order to assign the correct terms however Eric Sieverts | e.g.sieverts@uu.nl | e.g.sieverts@hva.nl | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010 search, search, search, ...... classical situation with controlled human indexing

search in the world of searcher just types some words (or often only one single word) search system contains (all) the words from the documents themselves often you don't find all you need - still satisfied ? match Eric Sieverts | e.g.sieverts@uu.nl | e.g.sieverts@hva.nl | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010 search, search, search, ......

why still user satisfaction ? despite recall and precision problems: search system looks attractively simple searcher always finds something (in 500 billion web pages) smart relevance ranking, providing some relevant items among first 10 for most (simple) questions, for majority of users, very often even #1 already and: who cares about lousy recall & precision (in the Google -world)? Eric Sieverts | e.g.sieverts@uu.nl | e.g.sieverts@hva.nl | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

language technology at searcher side original simple query expanded & disambiguated statistics generate additional terms to refine queries search system contains just the words from the documents themselves improved queries will result in better answers ? match Eric Sieverts | e.g.sieverts@uu.nl | e.g.sieverts@hva.nl | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010 search, search, search, ......

language technology for better "query" "word stemming" and "fuzzy search" : automatically search for more wordforms >> better recall semantic network (or ontology) contains semantic relations between words : query expanded with semantically related terms >> better recall for different meanings of a word, a semantic network (or ontology) contains relations with different words >> disambiguation >> better precision no scientific evidence yet about how much improvement Eric Sieverts | e.g.sieverts@uu.nl | e.g.sieverts@hva.nl | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

language technology for better "query" statistical analysis of search result generates characteristic terms, from which user can choose to refine its query such words can also be derived from a synonym list, thesaurus, semantic network et cetera mostly >> better precision Eric Sieverts | e.g.sieverts@uu.nl | e.g.sieverts@hva.nl | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

language technology at the document search with "correct" or “important” terms language technology enriches document with "correct" term (from thesaurus) or derives characteristic terms from the text in principle perfect match is possible match Eric Sieverts | e.g.sieverts@uu.nl | e.g.sieverts@hva.nl | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010 search, search, search, ......

automatic classification or enrichment 1. deriving specific terms from the document itself on the basis of word lists and text analysis specific types of terms (e.g. names of persons, places, products, parties, companies, etc.) can be recognized and marked as such 2. adding characteristics to classify a document after training it, a system can analyze documents and classify them with terms from a thesaurus or with classes from a taxonomy despite some limitations it's getting better all the time even for less tangible tasks as sentiment analysis Eric Sieverts | e.g.sieverts@uu.nl | e.g.sieverts@hva.nl | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

The Calais Web Service automatically creates rich semantic metadata Named Entities Facts Events

geographical recognition in Google Books

training a system thesaurus training documents analysis module “ finger- prints” training module enrichment of thesaurus  Joop van Gent, Irion

classification with system enriched thesaurus new documents analysis module “ finger- prints” classification module  Joop van Gent, Irion enriched documents

endgame tips: checkmate with bishop and knight (in Dutch: "horse" ) chess equestrianism

knowledge organization systems metadata: more than keywords or thesauri ?

knowledge organization systems can be more than just metadata models or tools for subject indexing 4 types of KOS : categorization systems (like classifications and taxonomies) metadata models (like MARC or Dublin Core) relational models (like thesauri, semantic networks, ontologies ) term lists (like authorization files) more about ontologies in a moment knowledge organization systems Eric Sieverts | e.g.sieverts@uu.nl | e.g.sieverts@hva.nl | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

4 types of functions for KOS: description and labeling (e.g. subject indexing with a thesaurus) definition (e.g. specification of the meaning of concepts in a thesaurus or ontology) translation (e.g. concordance between systems for interoperability ) navigational (thru the systematic structure of a taxonomy or classification, or the hierarchy of concepts in a thesaurus or ontology) some of these play a role in the semantic web knowledge organization systems Eric Sieverts | e.g.sieverts@uu.nl | e.g.sieverts@hva.nl | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

"knowledge-representation“ in which knowledge about (a small part of) the world is stored mostly not directly used for subject indexing allows more complete and complex representations of reality than a thesaurus with many possible types of relations between concepts with fixed roles and properties of these concepts often for limited domains (“wine ontology”) sometimes broader in so-called “core ontologies” for example: CIDOC-CRM (conceptual reference model) for concepts, relations and properties in the field of cultural heritage ontologies Eric Sieverts | e.g.sieverts@uu.nl | e.g.sieverts@hva.nl | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

relations between some concepts in a simple "wine ontology"

example of the relations between concepts about the statue of Balzac by Rodin [in CIDOC-CRM]

“ ontologies” in relation to the semantic web in a more general connotation : general name for all kinds of subject indexing (thesauri, classifications, taxonomies, name authority lists, .....) essential requirements : ontology must be available in a form that can be read, interpreted and processed by a computer program -> needs notations and formal languages to describe them ontologies Eric Sieverts | e.g.sieverts@uu.nl | e.g.sieverts@hva.nl | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

ontology notation for semantic web RDF resource description framework standard to describe relations between object and its metadata OWL web ontology language standard for computer readable description of ontologies RDFS RDF-schema standard for description of a KOS in RDF SKOS simple knowledge organization system standard for describing KOSses and relations between them in RDF Eric Sieverts | e.g.sieverts@uu.nl | e.g.sieverts@hva.nl | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

RDF uses XML to describe the relation between a resource (or object), its metadata and the used metadata standards resources should have a URI to refer to them RDF uses “namespaces” to refer to computer-readable description of the standards (link via URL) RDF is meant to (re)use and to combine existing semantic systems properties (metadata) are registered in so-called triples: subject <predicate> object (which we could perhaps also write: thing <property> value ) RDF-triples are used in "linked data" Eric Sieverts | e.sieverts@library.uu.nl | http://www.library.uu.nl/medew/it/eric | [email_address] resource description framework

rdf triples subject <predicate> object doc1 <has author> auth1 auth1 <has name> john smith auth1 <has affiliation> home inc. auth1 <has email> [email_address] Eric Sieverts | e.g.sieverts@uu.nl | e.g.sieverts@hva.nl | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010 graphical representation of simple network of 4 RDF-triples

SKOS-representation of thesaurus term & relations can be described in RDF Term : Economic cooperation Used For : Economic co-operation Broader terms : Economic policy Narrower terms : Economic integration, European economic cooperation, European industrial cooperation, Industrial cooperation Related terms : Interdependence Scope Note : Includes cooperative measures in banking, trade, industry etc., between and among countries. Eric Sieverts | e.g.sieverts@uu.nl | e.g.sieverts@hva.nl | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

SKOS representation in RDF <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:skos="http://www.w3.org/2004/02/skos/core#"> <skos:Concept> <skos:prefLabel>Economic cooperation</skos:prefLabel> <skos:altLabel>Economic co-operation</skos:altLabel> <skos:scopeNote>Includes cooperative measures in banking, trade, industry etc., between and among countries. </skos:scopeNote> <skos:broader> <skos:Concept> <skos:prefLabel>Economic policy</skos:prefLabel> </skos:Concept> </skos:broader> <skos:related> <skos:Concept> <skos:prefLabel>Interdependence</skos:prefLabel> </skos:Concept> </skos:related> <skos:narrower> <skos:Concept> <skos:prefLabel>Economic integration</skos:prefLabel> </skos:Concept> </skos:narrower>  </skos:Concept> </rdf:RDF> Eric Sieverts | e.g.sieverts@uu.nl | e.g.sieverts@hva.nl | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

RDF and "linked data" Eric Sieverts | e.g.sieverts@uu.nl | e.g.sieverts@hva.nl | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010 a lot of buzz recently about " linked (open) data " it's just RDF-triples so it's computer readable it's on the internet so it's open it's meant to be re-used so it's an important ingredient for the semantic web it's standardized so it can be re-used everybody can (and has to) contribute data so it is also somewhat messy

the "linked data cloud" - september 2010 - 24 billion RDF triples online

viaf: virtual international authority file dbpedia: data from Wikipedia last.fm: artists geonames: 6.2 M toponyms BBC: wildlife finder LCSH Reuters: openCalais IMDB

topic maps XML-based information systems that can be considered as ontologies that need no additional notations and/or standards to make them computer-readable that combine knowledge representations and the indexed information in a single self-containing, interlinked system suited to make local knowledge accessible Eric Sieverts | e.g.sieverts@uu.nl | e.g.sieverts@hva.nl | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

topic maps consist of: concepts (=topics) that are being characterized with “ names” (can be any word - even multiple- to describe them) (names are topics themselves as well!) “ types” (describing to what class of concepts it belongs) (types are topics themselves as well!) “ associations” (specified types of relations between topics) (associations are also topics, thus having types!) “ occurrences” (information-items “about” the concept-topic) (occurrences are also topics, thus having types!) all of this described in XML Eric Sieverts | e.g.sieverts@uu.nl | e.g.sieverts@hva.nl | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

verdi puccini lucca italy italia italië italien tosca madame -butterfly madama -butterfly roma rome occurrences situated in influenced composed location for place of birth simple example of opera topic-map adopted from Pepper association types topic types composer opera city country

© Antony Pitts, Kal Ahmed, MusicDNA Eric Sieverts | e.g.sieverts@uu.nl | e.g.sieverts@hva.nl | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010 topic map application Royal Academy of Music in London developed a model to describe "everything" around music, from work/composition to experience of a particular performance conceptually similar to relational FRBR model in library world

semantic web ultimate application of interoperability using combination of methods and standards for storing, structuring, filling, formalizing, describing and interpreting metadata RDF(S) ontologies (as well as thesauri, taxonomies, semantic networks, …) formal languages (like SKOS and OWL) annotation of resources/objects (= subject indexing ) so that computers will be able to interpret meaning and to combine knowledge from separate systems Eric Sieverts | e.g.sieverts@uu.nl | e.g.sieverts@hva.nl | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

"species ontology" Eric Sieverts | e.sieverts@library.uu.nl | http://www.library.uu.nl/medew/it/eric | [email_address] © Guus Schreiber UvA / VU

search, search, search, search, search, ...... match the semantic web (and interoperability) still require a lot of subject indexing, but with smart systems that: (help to) index dumb documents can infer meaning can match heterogeneous metadata can improve dumb searches even a monkey may find correct information, even information he didn't know he was looking for Eric Sieverts | e.g.sieverts@uu.nl | e.g.sieverts@hva.nl | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

A pair of shoes in the thesaurus; some reflexions on human and computer indexing

More Related Content

Similar to A pair of shoes in the thesaurus; some reflexions on human and computer indexing (20)

More from Eric Sieverts (20)

Recently uploaded (20)

A pair of shoes in the thesaurus; some reflexions on human and computer indexing