SlideShare a Scribd company logo
This is an interesting
metadata source. Can I
 import it into Koha?
 Marijana Glavica <mglavica@ffzg.hr>
Dobrica Pavlinušić <dpavlin@rot13.org>
This is an interesting metadata source. Can I import it into Koha?
This is an interesting metadata source. Can I import it into Koha?
This is an interesting metadata source. Can I import it into Koha?
Material
● 6000 scans of book front pages

● directories organised by person who did the
  scanning, and location of the books

● filenames - inventory number (having
  duplicates)
Task
● add metadata to scanned material
  ○ some books already catalogued somewhere else
  ○ not all sources have Z39.59


● upload images

● keep track of what is done
  ○ separate spreadsheet file?
Solution
● Create MARC records from file names -
  bibliographic and items
  ○ itemtype - not yet processed


● import MARC records and upload all images
  in Koha
This is an interesting metadata source. Can I import it into Koha?
What is wrong with metadata?
http://www.catholicresearch.net/blog/2012/05/oai/
The harvested Dublin Core metadata was
typical of OAI-PMH repositories: thin, a bit
ambiguous, and somewhat inconsistent across
repositories. -- Eric Lease Morgan

Europeana is nice example of this:
● sparse on meta-data
● multiple link hops to image of record (?!)
Importing covers and meta data
● DVD with scanned book front pages
  ○ various resolution (from stamp size to 300 dpi)
  ○ number_student/location/inventory_note.jpg
● Koha 3.8 has a tool to upload zip with cover
  images and idlink.txt
  ○ zip files big, and we don't have biblio records
● Create MARC21 records from file names
  (only metadata available to us)
● Write script which uses Koha API
  ○ create MARC21 using MARC::Record
  ○ AddItemFromMarc, PutImage
  ○ https://github.com/dpavlin/Koha/blob/koha_gimpoz/misc/gimpoz/import-images.pl
Scrape cataloging
● It's like copy cataloguing, but you don't have
  to use copy/paste in your browser to do it
● Instead, you use scraper to Z39.50 gateway:
  https://github.com/dpavlin/Biblio-Z3950
● Source formats:
   ○   Aleph - NSK, our national library
   ○   COBISS - they started serving images for records!
   ○   Google Books - another JSON API
   ○   vuFind - HathiTrust (MARC records export)
   ○   DPLA - JSON API (with broken UTF-8 encoding)
● Returns MARC21 records for Koha import
Scraping?!
● It's 2012, where is my semantic web?!
● Various reasons why scraping is easier
  ○ no public Z39.50 server
    ■ or there is one but has wrong encoding
  ○ data source isn't MARC21
    ■ older national MARC standards, UNIMARC or
       JSON for Google Books
● This is open source projects
  ○ all parts, but some assembly required
    ■ URLS to resources, mapping to MARC
  ○ modify existing scrapers to create new ones
● Let the data flow!
Biblio::Z3950
● based on Net::Z3950::SimpleServer
● convert Z39.50 RPN query to URL params
  ○ API support for and/or/not operators
  ○ enter just one field in Koha
● use WWW::Mechanize to issue search
  ○ advanced search syntax is best choice if available
  ○ scrape web page for results
    ■ web page with MARC-like structure
    ■ export formats
● use MARC::Record to create MARC21
  ○ web pages have utf-8 encoding
  ○ mapping to MARC specified in code
Mappings easy to define (in code :-)




                                Google Books JSON to
                                MARC mapping is more
                                complex but still only 80
                                    lines of code
Questions?
● Do you have nicely formatted web pages
  which need conversion to MARC21 for
  Koha?
● Is storing cover images in database the right
  way? (4.9Gb gziped SQL dump)

● This presentation: http://bit.ly/gimpoz
● Koha instance: http://gimpoz.koha.rot13.org
● Blog: http://blog.rot13.org
Abstract
We live in a world of data. However, data doesn’t always come in a format that is as easy to share as
we would expect.
We had approximately 6000 scans of book front pages coming from the Teachers' library stock of
the Gymnasium in Požega, which was proclaimed the movable monument of culture that carries
national significance. Our goal was to make the library stock visible to public and we needed to
add metadata to those images. Fortunately, some of that data was already available on the web: in
National Library’s Aleph system, several Croatian libraries using Koha, Hathi Trust digital library
(VuFind), Open Library, Google Books, Europeana etc.
Importing local images is now standard part of Koha, so we decided to import all those images and
to create the initial biblio records using the only kind of metadata that we had: structured directories
and filenames which represent some kind of identifier number. After that, we started cataloguing our
items. There is a convenient method for adding bibliographic data to a catalogue: using Z39.50
search. Unfortunately, not all of our metadata sources provided Z39.50 interface.
Our solution to the problem was to use scrape-cataloguing, which provided us with a way to avoid
infinite copy & paste cycles or manual data entry. Instead, the job was done by our script that provides
Z39.50 interface for Koha.

More Related Content

PPTX
Linked data-tooling-xml
PDF
Linked data tooling XML
PDF
Slides semantic web and Drupal 7 NYCCamp 2012
ODP
Data Integration And Visualization
ODP
Apache Marmotta - Introduction
PDF
The Semantic Web and Drupal 7 - Loja 2013
PDF
Geospatial querying in Apache Marmotta - ApacheCon Big Data Europe 2015
PPTX
WikiNext - Application Semantic Wiki
Linked data-tooling-xml
Linked data tooling XML
Slides semantic web and Drupal 7 NYCCamp 2012
Data Integration And Visualization
Apache Marmotta - Introduction
The Semantic Web and Drupal 7 - Loja 2013
Geospatial querying in Apache Marmotta - ApacheCon Big Data Europe 2015
WikiNext - Application Semantic Wiki

What's hot (20)

PDF
Drupal and RDF
PPTX
swib15 ALIADA
PDF
Geospatial Querying in Apache Marmotta - Apache Big Data North America 2016
PDF
Review of KohaCon18
ODP
Querying GrAF data in linguistic analysis
PPT
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
PDF
Leaving Blackboxes Behind - ELAG 2016
PDF
Enabling access to Linked Media with SPARQL-MM
PDF
Apache Marmotta (incubating)
PDF
Semantic Media Management with Apache Marmotta
PDF
RDF Seminar Presentation
PPTX
Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Ma...
PDF
DBpedia Japanese
PPTX
Sasaki datathon-madrid-2015
PDF
Linked Media Management with Apache Marmotta
PDF
Drupal 7 and RDF
PPTX
Semantics, rdf and drupal
PPTX
Produce & Publish Authoring Environment V 2.0 (english version)
PDF
Apache Spark's Built-in File Sources in Depth
PPTX
Selecting the right database type for your knowledge management needs.
Drupal and RDF
swib15 ALIADA
Geospatial Querying in Apache Marmotta - Apache Big Data North America 2016
Review of KohaCon18
Querying GrAF data in linguistic analysis
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
Leaving Blackboxes Behind - ELAG 2016
Enabling access to Linked Media with SPARQL-MM
Apache Marmotta (incubating)
Semantic Media Management with Apache Marmotta
RDF Seminar Presentation
Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Ma...
DBpedia Japanese
Sasaki datathon-madrid-2015
Linked Media Management with Apache Marmotta
Drupal 7 and RDF
Semantics, rdf and drupal
Produce & Publish Authoring Environment V 2.0 (english version)
Apache Spark's Built-in File Sources in Depth
Selecting the right database type for your knowledge management needs.
Ad

Viewers also liked (20)

PDF
Social Media & Web 2.0 Services for Choirs
PDF
Playful Blended Digital Storytelling in 3D Immersive eLearning Environments f...
PDF
Mojo Facets – so, you have data and browser?
PDF
Intro to Haml
PDF
Morocco
PPT
Creating And Customizing Your Blackboard Class
PPTX
Operation Payback (...is a bitch): Hacktivism at the Dawn of Copyright Contro...
PDF
Oslobodimo Hardware
PPT
CTE Teaching and Learning Inst. 2008
PDF
REST ili kao sam se prestao brinuti o HTTP-u i zavolio ga (HTTP Server sa RFI...
PPT
Χριστούγεννα χωρίς Χριστό
PDF
Free Libre Open Source Software at FFZG library
PDF
Towards an Instructional Design Motivational Framework to Address the Retenti...
PPTX
Wiki: Open Collaborative Learning Environment
PPT
Ppt Demo Slideshare
PDF
Kako napraviti Google od zgrade sa računalima?
PDF
Pubic Diplomacy and Web 2.0
PPT
Cisco Board 18
PDF
Virtual LDAP - kako natjerati strgane aplikacije da koriste LDAP
PDF
The Constellation Query Language
Social Media & Web 2.0 Services for Choirs
Playful Blended Digital Storytelling in 3D Immersive eLearning Environments f...
Mojo Facets – so, you have data and browser?
Intro to Haml
Morocco
Creating And Customizing Your Blackboard Class
Operation Payback (...is a bitch): Hacktivism at the Dawn of Copyright Contro...
Oslobodimo Hardware
CTE Teaching and Learning Inst. 2008
REST ili kao sam se prestao brinuti o HTTP-u i zavolio ga (HTTP Server sa RFI...
Χριστούγεννα χωρίς Χριστό
Free Libre Open Source Software at FFZG library
Towards an Instructional Design Motivational Framework to Address the Retenti...
Wiki: Open Collaborative Learning Environment
Ppt Demo Slideshare
Kako napraviti Google od zgrade sa računalima?
Pubic Diplomacy and Web 2.0
Cisco Board 18
Virtual LDAP - kako natjerati strgane aplikacije da koriste LDAP
The Constellation Query Language
Ad

Similar to This is an interesting metadata source. Can I import it into Koha? (20)

PPTX
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
PDF
Data analysis with Pandas and Spark
PDF
Linked Open Citation Database (LOC-DB)
PPTX
Ontology Access Kit_ Workshop Intro Slides.pptx
PDF
Logs aggregation and analysis
PDF
AddisDev Meetup ii: Golang and Flow-based Programming
PDF
PDF
Apache Arrow: Cross-language Development Platform for In-memory Data
PPTX
Mechanical curator - Technical notes
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
KEY
A rubyist's naive comparison of some database systems and toolkits
PDF
Interactive E-Books
PDF
Apache Arrow at DataEngConf Barcelona 2018
PDF
Migrating structured data between Hadoop and RDBMS
PDF
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
PPTX
Modern SharePoint, the Good, the Bad, and the Ugly
PDF
Apache Arrow: Present and Future @ ScaledML 2020
PDF
Introduction to Go
PPTX
Introduction to knime
PDF
On Again; Off Again - Benjamin Young - ebookcraft 2017
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
Data analysis with Pandas and Spark
Linked Open Citation Database (LOC-DB)
Ontology Access Kit_ Workshop Intro Slides.pptx
Logs aggregation and analysis
AddisDev Meetup ii: Golang and Flow-based Programming
Apache Arrow: Cross-language Development Platform for In-memory Data
Mechanical curator - Technical notes
Apache Iceberg - A Table Format for Hige Analytic Datasets
A rubyist's naive comparison of some database systems and toolkits
Interactive E-Books
Apache Arrow at DataEngConf Barcelona 2018
Migrating structured data between Hadoop and RDBMS
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Modern SharePoint, the Good, the Bad, and the Ugly
Apache Arrow: Present and Future @ ScaledML 2020
Introduction to Go
Introduction to knime
On Again; Off Again - Benjamin Young - ebookcraft 2017

More from Dobrica Pavlinušić (20)

PDF
Mainline kernel on ARM Tegra20 devices that are left behind on 2.6 kernels
PDF
Linux+sensor+device-tree+shell=IoT !
PDF
bro - what is in my network?
PDF
Let's hack cheap hardware 2016 edition
PDF
Raspberry Pi - best friend for all your GPIO needs
PDF
Cheap, good, hackable tools from China: AVR component tester
PDF
Ganeti - build your own cloud
PDF
FSEC 2014 - I can haz your board with JTAG
PDF
Hardware hacking for software people
PDF
Gnu linux on arm for $50 - $100
PDF
Security of Linux containers in the cloud
PDF
Web scale monitoring
PDF
SysAdmin cookbook
PDF
Printing on Linux, simple right?
PPT
KohaCon11: Integrating Koha with RFID system
PDF
Deploy your own P2P network
PDF
Post-relational databases: What's wrong with web development? v3
PDF
Virtualization which isn't: LXC (Linux Containers)
PDF
Slobodni softver za digitalne arhive: EPrints u Knjižnici Filozofskog fakulte...
PDF
Post-relational databases: What's wrong with web development?
Mainline kernel on ARM Tegra20 devices that are left behind on 2.6 kernels
Linux+sensor+device-tree+shell=IoT !
bro - what is in my network?
Let's hack cheap hardware 2016 edition
Raspberry Pi - best friend for all your GPIO needs
Cheap, good, hackable tools from China: AVR component tester
Ganeti - build your own cloud
FSEC 2014 - I can haz your board with JTAG
Hardware hacking for software people
Gnu linux on arm for $50 - $100
Security of Linux containers in the cloud
Web scale monitoring
SysAdmin cookbook
Printing on Linux, simple right?
KohaCon11: Integrating Koha with RFID system
Deploy your own P2P network
Post-relational databases: What's wrong with web development? v3
Virtualization which isn't: LXC (Linux Containers)
Slobodni softver za digitalne arhive: EPrints u Knjižnici Filozofskog fakulte...
Post-relational databases: What's wrong with web development?

Recently uploaded (20)

PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PPTX
Lesson notes of climatology university.
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
Classroom Observation Tools for Teachers
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
master seminar digital applications in india
PDF
RMMM.pdf make it easy to upload and study
PPTX
Cell Types and Its function , kingdom of life
PPTX
GDM (1) (1).pptx small presentation for students
PDF
Computing-Curriculum for Schools in Ghana
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
PPH.pptx obstetrics and gynecology in nursing
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Complications of Minimal Access Surgery at WLH
Final Presentation General Medicine 03-08-2024.pptx
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Lesson notes of climatology university.
Module 4: Burden of Disease Tutorial Slides S2 2025
TR - Agricultural Crops Production NC III.pdf
102 student loan defaulters named and shamed – Is someone you know on the list?
Classroom Observation Tools for Teachers
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Microbial disease of the cardiovascular and lymphatic systems
VCE English Exam - Section C Student Revision Booklet
master seminar digital applications in india
RMMM.pdf make it easy to upload and study
Cell Types and Its function , kingdom of life
GDM (1) (1).pptx small presentation for students
Computing-Curriculum for Schools in Ghana
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPH.pptx obstetrics and gynecology in nursing
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Complications of Minimal Access Surgery at WLH

This is an interesting metadata source. Can I import it into Koha?

  • 1. This is an interesting metadata source. Can I import it into Koha? Marijana Glavica <mglavica@ffzg.hr> Dobrica Pavlinušić <dpavlin@rot13.org>
  • 5. Material ● 6000 scans of book front pages ● directories organised by person who did the scanning, and location of the books ● filenames - inventory number (having duplicates)
  • 6. Task ● add metadata to scanned material ○ some books already catalogued somewhere else ○ not all sources have Z39.59 ● upload images ● keep track of what is done ○ separate spreadsheet file?
  • 7. Solution ● Create MARC records from file names - bibliographic and items ○ itemtype - not yet processed ● import MARC records and upload all images in Koha
  • 9. What is wrong with metadata? http://www.catholicresearch.net/blog/2012/05/oai/ The harvested Dublin Core metadata was typical of OAI-PMH repositories: thin, a bit ambiguous, and somewhat inconsistent across repositories. -- Eric Lease Morgan Europeana is nice example of this: ● sparse on meta-data ● multiple link hops to image of record (?!)
  • 10. Importing covers and meta data ● DVD with scanned book front pages ○ various resolution (from stamp size to 300 dpi) ○ number_student/location/inventory_note.jpg ● Koha 3.8 has a tool to upload zip with cover images and idlink.txt ○ zip files big, and we don't have biblio records ● Create MARC21 records from file names (only metadata available to us) ● Write script which uses Koha API ○ create MARC21 using MARC::Record ○ AddItemFromMarc, PutImage ○ https://github.com/dpavlin/Koha/blob/koha_gimpoz/misc/gimpoz/import-images.pl
  • 11. Scrape cataloging ● It's like copy cataloguing, but you don't have to use copy/paste in your browser to do it ● Instead, you use scraper to Z39.50 gateway: https://github.com/dpavlin/Biblio-Z3950 ● Source formats: ○ Aleph - NSK, our national library ○ COBISS - they started serving images for records! ○ Google Books - another JSON API ○ vuFind - HathiTrust (MARC records export) ○ DPLA - JSON API (with broken UTF-8 encoding) ● Returns MARC21 records for Koha import
  • 12. Scraping?! ● It's 2012, where is my semantic web?! ● Various reasons why scraping is easier ○ no public Z39.50 server ■ or there is one but has wrong encoding ○ data source isn't MARC21 ■ older national MARC standards, UNIMARC or JSON for Google Books ● This is open source projects ○ all parts, but some assembly required ■ URLS to resources, mapping to MARC ○ modify existing scrapers to create new ones ● Let the data flow!
  • 13. Biblio::Z3950 ● based on Net::Z3950::SimpleServer ● convert Z39.50 RPN query to URL params ○ API support for and/or/not operators ○ enter just one field in Koha ● use WWW::Mechanize to issue search ○ advanced search syntax is best choice if available ○ scrape web page for results ■ web page with MARC-like structure ■ export formats ● use MARC::Record to create MARC21 ○ web pages have utf-8 encoding ○ mapping to MARC specified in code
  • 14. Mappings easy to define (in code :-) Google Books JSON to MARC mapping is more complex but still only 80 lines of code
  • 15. Questions? ● Do you have nicely formatted web pages which need conversion to MARC21 for Koha? ● Is storing cover images in database the right way? (4.9Gb gziped SQL dump) ● This presentation: http://bit.ly/gimpoz ● Koha instance: http://gimpoz.koha.rot13.org ● Blog: http://blog.rot13.org
  • 16. Abstract We live in a world of data. However, data doesn’t always come in a format that is as easy to share as we would expect. We had approximately 6000 scans of book front pages coming from the Teachers' library stock of the Gymnasium in Požega, which was proclaimed the movable monument of culture that carries national significance. Our goal was to make the library stock visible to public and we needed to add metadata to those images. Fortunately, some of that data was already available on the web: in National Library’s Aleph system, several Croatian libraries using Koha, Hathi Trust digital library (VuFind), Open Library, Google Books, Europeana etc. Importing local images is now standard part of Koha, so we decided to import all those images and to create the initial biblio records using the only kind of metadata that we had: structured directories and filenames which represent some kind of identifier number. After that, we started cataloguing our items. There is a convenient method for adding bibliographic data to a catalogue: using Z39.50 search. Unfortunately, not all of our metadata sources provided Z39.50 interface. Our solution to the problem was to use scrape-cataloguing, which provided us with a way to avoid infinite copy & paste cycles or manual data entry. Instead, the job was done by our script that provides Z39.50 interface for Koha.