Comet project

Ed Chamberlain
Cambridge University Library

 Cambridge Open Metadata

 Funded by the JISC Infrastructure for Resource
Discovery Project

 Cambridge, back in 2010 …

 OKFN - Open Bibliography project
(2010-2011)

 Debate around re-use of catalogue
records from vendors (not just
OCLC)

 CUL already provides public APIs

 Increasing interest in linked data

 FAST / VIAF

 Lorcan

 “The initial aim of this project will
be to identify and release a
substantial record set to an
external platform under an open
license” …

 “For OCLC-derived bibliographic
records data will be released in a
fashion compliant with their
WorldCat Rights and
Responsibilities for the OCLC
Cooperative” …

 “The project aims to then deploy
and test and number of
technologies and methodologies
for releasing open bibliographic
data including XML, RDF, SPARQL,
and JSON” …

 Cambridge University Library
 Metadata conversion
 Development
 Project management

 CARET
 Infrastructure support

 OCLC
 Licensing consultancy
 FAST / VIAFF enrichment

 Value for money – Taxpayers

 Open data = affiliate marketing for
our collections

 Drive innovation - vital buy-in from
non library developer communities

 One of many open data projects at
the time

 “Library catalogues have imposed on them librarian or supplier-made
decisions about what can/can’t be searched and in what way. Some of these
decisions are limited by current cataloguing rules, but not all; often the data is
recorded, but not in a usable way, or is there but isn’t tapped by the
interface. For example, in most catalogues you can limit by publication type to
newspapers, but you can’t limit by frequency of the issues.”

 “Releasing data means that people can start to use it in the way they want to.”

 Most of the catalogue (3 million +)
 Bulk downloads of RDF triples
 Query-able ‘endpoints’
 Fast / VIAF enriched
 Snapshot

 RDF conversion tools

 Working model and code to decide
on MArc21 record origin

 Codebase for ‘library centric’ RDF
publishing website

 SPARQL tutorial

 Verbose blog
Data and code at http://data.lib.cam.ac.uk

 Examine contracts with major
vendors

 Contact them and decide on re-use
conditions

 Deduce record origin from Marc21
fields

 Several places in Marc21 where
this data could be held
(015,035,038,994 …)

 Logic and hierarchy for
examination

 Attempt at scripted analysis

 Marc21 fails at ‘IPR’

 Potential down the line for
problem to persist if attribution is
not handled correctly in future
formats

Need the right
license!
 Most vendors happy with
permissive license for ‘non-
marc21’ formats

 RLUK / BL B.N.B. – Public Domain
Data License

 OCLC – ODC-By Attribution license
with community norms

 RDF allows you to freely mix
vocabularies

 Emerging consensus on
bibliographic description

 BL and others leading the way

 Victory for pragmatism?

 Punctuation as a function

 Binary encoding

 Numbers for field names

 Bad characters

 Replication of data in fields

 PHP script to match text against
LOC subject headings – enrich with
LOC GUID

 FAST / VIAF enrichment courtesy
of OCLC

No. of records: 3,658,384
No. of records with LCSH
headings: 2,709,878
Percentage with LCSH
headings: 74%

No. of subject headings found: 5,889,048
No. of subject headings
skipped: 45

Valid FAST subjects: 8,134,230

 Marc / AACR2 cannot translate
easily to semantically rich
formats

 Libraries need to better utilise
modern container / transfer
standards (not necessarily RDF)

 No ‘one size fits all’ approach for
future

Karen Coyle criticises the Marc21 Bibliographic Framework Transition Initiative
for not including museums, publishing, and IT professionals …

She argues that our data is not just for us to consume alone …

“The next data carrier for libraries needs to be developed as a truly open effort.
ItSteeringbe led byand Marc organization (possibly ad hoc) that can bring
should for RDA a neutral
replacement needs non-librarian
together the wide range of interested parties and make sure that all voices are
heard. or ownership
input Technical development should be done by computer professionals with
expertise in metadata design. The resulting system should be rigorous yet flexible
enough to allow growth and specialization.”

http://kcoyle.blogspot.com/2011/08/bibliographic-framework-transition.html

Open Bibliography 2
Lightweight approach to sharing
bibliography now its open …
 Bottom up, community led software
called Bibserver
 Wikimedia for bib data
 JSON as a container format – flexible, able
to cope with different structures,
vocabularies etc.
 Engagement with UK PubMed Central

C.L.O.C.K. (Cambridge/ Lincoln open
cataloguing knowledgebase)
New approaches to traditional library
workflows (copy cataloguing) using
open data
 Using rich open data to enrich bare bones
data
 NOSQL database technology
 APIs as key deliverables

FAST subject
Language Place of headings
publication

LCSH subject
headings
Special Archives Bibliographic
collections
Creator / entity
Holdings
Libraries

Librarians
Course lists Transactions

 Anonymous usage data from
circulation systems

 Aggregated from several University
Libraries

 API feed

 Available openly (CC-BY )

 It becomes (even) easier to go to
Amazon

 Our status as authoritative data
providers will be (further) eroded

 Assume we can

 Assume we should (where we can)

 http://www.discovery.ac.uk -
Discovery

 Ncg4lib mailing list

 http://okfn.org - Open Knowledge
Foundation

 http://data.lib.cam.ac.uk

 Ed Chamberlain

 @edchamberlain
 emc59@cam.ac.uk
 http://www.slideshare.net/EdmundChamberlain/

Comet project

More Related Content

What's hot (20)

Viewers also liked (16)

Similar to Comet project (20)

Recently uploaded (20)

Comet project

Editor's Notes