SlideShare a Scribd company logo
Ed Chamberlain
Cambridge University Library
 Cambridge Open Metadata



 Funded by the JISC Infrastructure for Resource
  Discovery Project
 Cambridge, back in 2010 …

 OKFN - Open Bibliography project
  (2010-2011)

 Debate around re-use of catalogue
  records from vendors (not just
  OCLC)

 CUL already provides public APIs

 Increasing interest in linked data

 FAST / VIAF

 Lorcan
 “The initial aim of this project will
  be to identify and release a
  substantial record set to an
  external platform under an open
  license” …

 “For OCLC-derived bibliographic
  records data will be released in a
  fashion compliant with their
  WorldCat Rights and
  Responsibilities for the OCLC
  Cooperative” …

 “The project aims to then deploy
  and test and number of
  technologies and methodologies
  for releasing open bibliographic
  data including XML, RDF, SPARQL,
  and JSON” …
 Cambridge University Library
        Metadata conversion
        Development
        Project management


       CARET
        Infrastructure support


       OCLC
        Licensing consultancy
        FAST / VIAFF enrichment
 Value for money – Taxpayers

 Open data = affiliate marketing for
  our collections

 Drive innovation - vital buy-in from
  non library developer communities

 One of many open data projects at
  the time
 “Library catalogues have imposed on them librarian or supplier-made
  decisions about what can/can’t be searched and in what way. Some of these
  decisions are limited by current cataloguing rules, but not all; often the data is
  recorded, but not in a usable way, or is there but isn’t tapped by the
  interface. For example, in most catalogues you can limit by publication type to
  newspapers, but you can’t limit by frequency of the issues.”

 “Releasing data means that people can start to use it in the way they want to.”
 Most of the catalogue (3 million +)
     Bulk downloads of RDF triples
     Query-able ‘endpoints’
     Fast / VIAF enriched
     Snapshot


 RDF conversion tools

 Working model and code to decide
  on MArc21 record origin

 Codebase for ‘library centric’ RDF
  publishing website

 SPARQL tutorial

 Verbose blog
                                        Data and code at http://data.lib.cam.ac.uk
Comet project
Comet project
 Examine contracts with major
  vendors

 Contact them and decide on re-use
  conditions

 Deduce record origin from Marc21
  fields
 Several places in Marc21 where
  this data could be held
  (015,035,038,994 …)

 Logic and hierarchy for
  examination

 Attempt at scripted analysis

 Marc21 fails at ‘IPR’

 Potential down the line for
  problem to persist if attribution is
  not handled correctly in future
  formats
Comet project
Need the right
license!
 Most vendors happy with
  permissive license for ‘non-
  marc21’ formats

 RLUK / BL B.N.B. – Public Domain
  Data License

 OCLC – ODC-By Attribution license
  with community norms
Comet project
 RDF allows you to freely mix
  vocabularies

 Emerging consensus on
  bibliographic description

 BL and others leading the way

 Victory for pragmatism?
 Punctuation as a function

 Binary encoding

 Numbers for field names

 Bad characters

 Replication of data in fields
 PHP script to match text against
  LOC subject headings – enrich with
  LOC GUID

 FAST / VIAF enrichment courtesy
  of OCLC
No. of records:               3,658,384
No. of records with LCSH
headings:                     2,709,878
Percentage with LCSH
headings:                     74%

No. of subject headings found: 5,889,048
No. of subject headings
skipped:                      45

Valid FAST subjects:          8,134,230
 Marc / AACR2 cannot translate
  easily to semantically rich
  formats

 Libraries need to better utilise
  modern container / transfer
  standards (not necessarily RDF)

 No ‘one size fits all’ approach for
  future
Karen Coyle criticises the Marc21 Bibliographic Framework Transition Initiative
for not including museums, publishing, and IT professionals …

She argues that our data is not just for us to consume alone …

  “The next data carrier for libraries needs to be developed as a truly open effort.
ItSteeringbe led byand Marc organization (possibly ad hoc) that can bring
    should for RDA a neutral
   replacement needs non-librarian
 together the wide range of interested parties and make sure that all voices are
 heard. or ownership
   input Technical development should be done by computer professionals with
 expertise in metadata design. The resulting system should be rigorous yet flexible
 enough to allow growth and specialization.”


http://kcoyle.blogspot.com/2011/08/bibliographic-framework-transition.html
Comet project
Open Bibliography 2
Lightweight approach to sharing
bibliography now its open …
   Bottom up, community led software
    called Bibserver
   Wikimedia for bib data
   JSON as a container format – flexible, able
    to cope with different structures,
    vocabularies etc.
   Engagement with UK PubMed Central



C.L.O.C.K. (Cambridge/ Lincoln open
cataloguing knowledgebase)
New approaches to traditional library
workflows (copy cataloguing) using
open data
    Using rich open data to enrich bare bones
     data
    NOSQL database technology
    APIs as key deliverables
FAST subject
                          Language      Place of     headings
                                       publication

                                                     LCSH subject
                                                     headings
 Special          Archives           Bibliographic
collections
                                                      Creator / entity
                               Holdings
              Libraries

Librarians
               Course lists      Transactions
 Anonymous usage data from
  circulation systems

 Aggregated from several University
  Libraries

 API feed

 Available openly (CC-BY )
 It becomes (even) easier to go to
  Amazon

 Our status as authoritative data
  providers will be (further) eroded

 Assume we can

 Assume we should (where we can)
 http://www.discovery.ac.uk -
  Discovery

 Ncg4lib mailing list

 http://okfn.org - Open Knowledge
  Foundation

 http://data.lib.cam.ac.uk
 Ed Chamberlain

   @edchamberlain
   emc59@cam.ac.uk
   http://www.slideshare.net/EdmundChamberlain/

More Related Content

PDF
Documents, services, and data on the web
PPT
Future of Cataloging, Classification
PPT
Providing Tools for Author Evaluation - A case study
PPTX
Text mining in CORE (OR2012)
PPTX
Towards a comprehensive call ontology for research 2.0
PDF
Nicoletta Fornara and Fabio Marfia | Modeling and Enforcing Access Control Ob...
PPTX
"Article Level" The Future of Resource Discovery
PDF
Documents, services, and data on the web
Future of Cataloging, Classification
Providing Tools for Author Evaluation - A case study
Text mining in CORE (OR2012)
Towards a comprehensive call ontology for research 2.0
Nicoletta Fornara and Fabio Marfia | Modeling and Enforcing Access Control Ob...
"Article Level" The Future of Resource Discovery

What's hot (20)

PDF
Semantic Web Nature
PPTX
External CV support in Dataverse 5.7
 
PPTX
Interaction with Linked Data
PPTX
Building Linked Data Applications
PPT
Open Knowledge Foundation Edinburgh meet-up #3
PDF
Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and the...
PPT
20080917 Rev
PPT
Site Interoperability Projects at DERI Galway's SW Cluster
PPTX
Microtask Crowdsourcing Applications for Linked Data
PPTX
PRELIDA Project Draft Roadmap
PPTX
Providing Linked Data
PPT
Metadata for your Digital Collections
PPTX
Scaling up Linked Data
PPTX
Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and DAN...
 
PPT
Structured Dynamics' Semantic Technologies Product Stack
PDF
Presentation: mashing up ontologies
PPT
RDF and Open Linked Data, a first approach
PDF
RDF and Java
PDF
The RDF Report Card: Beyond the Triple Count
PPT
Semantic Web and Linked Data for cultural heritage materials - Approaches in ...
Semantic Web Nature
External CV support in Dataverse 5.7
 
Interaction with Linked Data
Building Linked Data Applications
Open Knowledge Foundation Edinburgh meet-up #3
Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and the...
20080917 Rev
Site Interoperability Projects at DERI Galway's SW Cluster
Microtask Crowdsourcing Applications for Linked Data
PRELIDA Project Draft Roadmap
Providing Linked Data
Metadata for your Digital Collections
Scaling up Linked Data
Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and DAN...
 
Structured Dynamics' Semantic Technologies Product Stack
Presentation: mashing up ontologies
RDF and Open Linked Data, a first approach
RDF and Java
The RDF Report Card: Beyond the Triple Count
Semantic Web and Linked Data for cultural heritage materials - Approaches in ...
Ad

Viewers also liked (16)

PPTX
Text to data
PPT
Cambridge university library ess update for ucs
PPT
Linked data and voyager
PPT
Open (linked) bibliographic data
PDF
DeCosta Properties Listing Presentation
PPT
Aula virtual
PPTX
Debt Outlook Negative
PDF
The kove
PDF
Future Search 2011
PPT
Developments in catalogues and data sharing
PPTX
Sharing data
PPT
WRCCISD Technology Plan
PPTX
Portfolio Julie Ariens
KEY
CreativeBloc 2011
PPT
State of fusion
PPT
WRCCISD Technology Plan
Text to data
Cambridge university library ess update for ucs
Linked data and voyager
Open (linked) bibliographic data
DeCosta Properties Listing Presentation
Aula virtual
Debt Outlook Negative
The kove
Future Search 2011
Developments in catalogues and data sharing
Sharing data
WRCCISD Technology Plan
Portfolio Julie Ariens
CreativeBloc 2011
State of fusion
WRCCISD Technology Plan
Ad

Similar to Comet project (20)

PPT
OCLC Linked Data Roundtable event IFLA 2012
PPT
Linked Data and why we (librarians) should care
PPT
Ifla swsig meeting - Puerto Rico - 20110817
PDF
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ZIP
Linked Open Data in Libraries, Archives & Museums
ZIP
Intro to Linked Open Data in Libraries, Archives & Museums
PDF
Open Library at Make Books Apparent
PDF
Open for Reuse: Library data and mashups
PDF
Charleston 2012 - The Future of Serials in a Linked Data World
PDF
Where is the World is my Open Government Data?
PPTX
"In the Early Days of a Better Nation": Enhancing the power of metadata today...
PPT
Of Cataloging & Context
PDF
Global lodlam_communities and open cultural data
PDF
Vila LOD-innovacion- bib-semweb-redux
PDF
Linked data radical change
PPT
Establishing the Connection: Creating a Linked Data Version of the BNB
PDF
(Big) bibliographic data @ ScaDS project meeting - 2015-06-12
PPTX
RDTF Metadata Guidelines: an update
PDF
BIBFRAME, Linked data, RDA
OCLC Linked Data Roundtable event IFLA 2012
Linked Data and why we (librarians) should care
Ifla swsig meeting - Puerto Rico - 20110817
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & Museums
Intro to Linked Open Data in Libraries, Archives & Museums
Open Library at Make Books Apparent
Open for Reuse: Library data and mashups
Charleston 2012 - The Future of Serials in a Linked Data World
Where is the World is my Open Government Data?
"In the Early Days of a Better Nation": Enhancing the power of metadata today...
Of Cataloging & Context
Global lodlam_communities and open cultural data
Vila LOD-innovacion- bib-semweb-redux
Linked data radical change
Establishing the Connection: Creating a Linked Data Version of the BNB
(Big) bibliographic data @ ScaDS project meeting - 2015-06-12
RDTF Metadata Guidelines: an update
BIBFRAME, Linked data, RDA

Recently uploaded (20)

PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Insiders guide to clinical Medicine.pdf
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
Week 4 Term 3 Study Techniques revisited.pptx
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
PPTX
PPH.pptx obstetrics and gynecology in nursing
PPTX
Cell Structure & Organelles in detailed.
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
Microbial diseases, their pathogenesis and prophylaxis
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Insiders guide to clinical Medicine.pdf
2.FourierTransform-ShortQuestionswithAnswers.pdf
Renaissance Architecture: A Journey from Faith to Humanism
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
TR - Agricultural Crops Production NC III.pdf
Anesthesia in Laparoscopic Surgery in India
Week 4 Term 3 Study Techniques revisited.pptx
102 student loan defaulters named and shamed – Is someone you know on the list?
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
PPH.pptx obstetrics and gynecology in nursing
Cell Structure & Organelles in detailed.
Pharmacology of Heart Failure /Pharmacotherapy of CHF
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
VCE English Exam - Section C Student Revision Booklet
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
O5-L3 Freight Transport Ops (International) V1.pdf

Comet project

  • 2.  Cambridge Open Metadata  Funded by the JISC Infrastructure for Resource Discovery Project
  • 3.  Cambridge, back in 2010 …  OKFN - Open Bibliography project (2010-2011)  Debate around re-use of catalogue records from vendors (not just OCLC)  CUL already provides public APIs  Increasing interest in linked data  FAST / VIAF  Lorcan
  • 4.  “The initial aim of this project will be to identify and release a substantial record set to an external platform under an open license” …  “For OCLC-derived bibliographic records data will be released in a fashion compliant with their WorldCat Rights and Responsibilities for the OCLC Cooperative” …  “The project aims to then deploy and test and number of technologies and methodologies for releasing open bibliographic data including XML, RDF, SPARQL, and JSON” …
  • 5.  Cambridge University Library  Metadata conversion  Development  Project management  CARET  Infrastructure support  OCLC  Licensing consultancy  FAST / VIAFF enrichment
  • 6.  Value for money – Taxpayers  Open data = affiliate marketing for our collections  Drive innovation - vital buy-in from non library developer communities  One of many open data projects at the time
  • 7.  “Library catalogues have imposed on them librarian or supplier-made decisions about what can/can’t be searched and in what way. Some of these decisions are limited by current cataloguing rules, but not all; often the data is recorded, but not in a usable way, or is there but isn’t tapped by the interface. For example, in most catalogues you can limit by publication type to newspapers, but you can’t limit by frequency of the issues.”  “Releasing data means that people can start to use it in the way they want to.”
  • 8.  Most of the catalogue (3 million +)  Bulk downloads of RDF triples  Query-able ‘endpoints’  Fast / VIAF enriched  Snapshot  RDF conversion tools  Working model and code to decide on MArc21 record origin  Codebase for ‘library centric’ RDF publishing website  SPARQL tutorial  Verbose blog Data and code at http://data.lib.cam.ac.uk
  • 11.  Examine contracts with major vendors  Contact them and decide on re-use conditions  Deduce record origin from Marc21 fields
  • 12.  Several places in Marc21 where this data could be held (015,035,038,994 …)  Logic and hierarchy for examination  Attempt at scripted analysis  Marc21 fails at ‘IPR’  Potential down the line for problem to persist if attribution is not handled correctly in future formats
  • 14. Need the right license!  Most vendors happy with permissive license for ‘non- marc21’ formats  RLUK / BL B.N.B. – Public Domain Data License  OCLC – ODC-By Attribution license with community norms
  • 16.  RDF allows you to freely mix vocabularies  Emerging consensus on bibliographic description  BL and others leading the way  Victory for pragmatism?
  • 17.  Punctuation as a function  Binary encoding  Numbers for field names  Bad characters  Replication of data in fields
  • 18.  PHP script to match text against LOC subject headings – enrich with LOC GUID  FAST / VIAF enrichment courtesy of OCLC
  • 19. No. of records: 3,658,384 No. of records with LCSH headings: 2,709,878 Percentage with LCSH headings: 74% No. of subject headings found: 5,889,048 No. of subject headings skipped: 45 Valid FAST subjects: 8,134,230
  • 20.  Marc / AACR2 cannot translate easily to semantically rich formats  Libraries need to better utilise modern container / transfer standards (not necessarily RDF)  No ‘one size fits all’ approach for future
  • 21. Karen Coyle criticises the Marc21 Bibliographic Framework Transition Initiative for not including museums, publishing, and IT professionals … She argues that our data is not just for us to consume alone … “The next data carrier for libraries needs to be developed as a truly open effort. ItSteeringbe led byand Marc organization (possibly ad hoc) that can bring should for RDA a neutral replacement needs non-librarian together the wide range of interested parties and make sure that all voices are heard. or ownership input Technical development should be done by computer professionals with expertise in metadata design. The resulting system should be rigorous yet flexible enough to allow growth and specialization.” http://kcoyle.blogspot.com/2011/08/bibliographic-framework-transition.html
  • 23. Open Bibliography 2 Lightweight approach to sharing bibliography now its open …  Bottom up, community led software called Bibserver  Wikimedia for bib data  JSON as a container format – flexible, able to cope with different structures, vocabularies etc.  Engagement with UK PubMed Central C.L.O.C.K. (Cambridge/ Lincoln open cataloguing knowledgebase) New approaches to traditional library workflows (copy cataloguing) using open data  Using rich open data to enrich bare bones data  NOSQL database technology  APIs as key deliverables
  • 24. FAST subject Language Place of headings publication LCSH subject headings Special Archives Bibliographic collections Creator / entity Holdings Libraries Librarians Course lists Transactions
  • 25.  Anonymous usage data from circulation systems  Aggregated from several University Libraries  API feed  Available openly (CC-BY )
  • 26.  It becomes (even) easier to go to Amazon  Our status as authoritative data providers will be (further) eroded  Assume we can  Assume we should (where we can)
  • 27.  http://www.discovery.ac.uk - Discovery  Ncg4lib mailing list  http://okfn.org - Open Knowledge Foundation  http://data.lib.cam.ac.uk
  • 28.  Ed Chamberlain  @edchamberlain  emc59@cam.ac.uk  http://www.slideshare.net/EdmundChamberlain/

Editor's Notes

  • #4: Respond to academic / national demand for Open Data – previously given some to the Open Bibliography projectGet our data to non-librarians and provicdeTax-payer value-for-moneyCUL already provides public APIsGain in-house experience of RDFMove library services forward
  • #8: This is my colleague Katies’ write up of a talk lead by Owen Stephens it really sums it all up …
  • #12: See if there were any expressive contractual clauses saying we could not redistribute
  • #13: Where does a record come from ? – practically quite hard to determine …Several places in Marc21 where this data could be held …Logic for examinationAttempt at scripted analysis – list bib_ids by record vendor
  • #15: Most vendors happy with permissive license for ‘non-marc21’ formats - Non marc thing is not an issue in this context, no one outside of library land cares about a load of binary encoded numbers … we are re-purposing Marc originated data for a wider audienceRLUK / BL BNB – PDDL OCLC – ODC-By Attribution licenseNo good reason not to re-publish – need the right license!
  • #17: RDF allows you to freely mix vocabularies – choices of fields to describe your dataEmerging consensus on bibliographic description - thankfully no-one is attempting to recreate Marc, mainly a use of Qualified Dublin Core, FOAFAnd other relevant bibliographic focused vocabularies. There may never emrgeseuch a Our conversion script is CSV customisableBL and others leading the way on vocab choice – they did some great data modelling, which we stayed clear ofIts my personal hope that we never see a heavyweight approach of the style of Marc again. As we move forward with new container formats, pragmatism needs to rule over completionism if we are to successfully share our valuable data with a wider user base.
  • #19: PHP script to match text against LOC subject headings – enrich with LOC GUIDFAST / VIAF enrichment courtesy of OCLC FAST – next generation subject headings – very excitingVIAF – Virtual International Authority FileOCLC want to develop these as linked services, keen to help.
  • #21: Marc / AACR2 cannot translate will to semantically rich formats Need better container / transfer standards (not necessarily RDF)
  • #22: So despite the change its my worry that those in charge of Marc21 and RDA developments arenot thinking widely enough about the new open ecosystem in which our data must inhabit
  • #24: Two projects, focused less on data release and license and more about exploiting its value in an open environment
  • #27: If we don’t try and shift …It becomes easier to go to Amazon – who have awesome API’sOr even Google books (theirs are rubbish)Our status as an authority of data providers will be further erodedNo-one will want to play with us if we do not share