SlideShare a Scribd company logo
Digitisation at Scale:
Automating the mass acquisition of
digitised content
IS&T Archiving Conference, Washington, April 2016
Dave Thompson
Digital Curator, Wellcome Library
The Wellcome Library
• Part of Wellcome Collection, astonishing public
venue in London developed by the Wellcome
Trust. Where people can learn more about
medicine through the ages & across cultures
• Five-year plan for transforming the Wellcome
Library.
Driver for digitisation
• To make our collections available to anyone,
anywhere, we are digitising as much of our
physical collection as we can, for both our website
and the websites of other organisations. We are
also digitising and hosting collections from
partners that complement our holdings
Transforming the Wellcome Library: 2009-2014.
http://wellcomelibrary.org/what-we-do/library-strategy-and-policy/transforming-
the-wellcome-library/
The problem
• How to scale systems & processes to deliver on
our ambition
• How to design & build new high volume systems &
processes for; acquisition, storage, processing,
access
• How to manage volumes of data during
creation/acquisition
Process design – sources of content
Goobi
(METS/OCR)
Preservica
In-house
Institutions
Contractors
Harvesting
TIFF or JP2
TIFF or JP2
HD & ftp
TIFF or JP2
Normalises TIFF
to JP2
Manual
Automatic
Jpylyzer validates
JP2
Auto harvesting of
JP2 & DMD
Grey literature
PDF
Ingest Officer / Digital Curator
Snagging
Snagging
The approach
• (Re)Use/develop existing systems were possible,
e.g. bibliographic system Sierra, Preservica EE
repository
• Identify where new systems would be required,
e.g. workflow middle ware
• Take a practical approach & accept that it would
be iterative learning as we go
The solution was
to use Goobi
Why Goobi?
• Dedicated to digitisation
• Flexibility & process control
• Adaptable & scalable
• Vendor expertise/support
http://www.inspirelancs.org.uk/interested-in-volunteering-family-carers-volunteers-wanted/
Role of Goobi
• Role of Goobi is overall management & tracking of
processes
• Initiate ingest into our DAM Preservica
• Reporting & statistics
Role of humans
• Working at volume did not imply more staff, it
implied efficiency
• Also implied automation
• Human work was focussed on tasks machines
couldn't do
http://planetivy.com/gaming/25273/natural-selection-2-gaming-evolution-in-action/
System & process design
• High volume doesn’t imply use of many systems
• Requires design to be as simple as possible, with
as few moving parts as possible
• Processes need to be efficient & scalable, human
as well as system
http://www.nivenswealthstrategies.com/keeping-it-simple/
Partnership for scalable digitisation
• Relationship with Internet Archive digitising our
Library content
• High volume long term project
• Content harvested from Internet Archive website &
processed automatically
• Dedicated Goobi process for fully automated
harvesting
Harvesting from Internet Archive
Content processed automatically, including
creation of METS & ALTO.
Goobi has a ‘repository’ of IA identifiers for
searching/harvesting.
Goobi harvests data from Internet Archive
website.
Content available in the player.
Content stored in Preservica. DDS creates JSON for the player & pre-
caches some content.
Challenges - M&Ms
• Multi volume works
• No metadata to support their union
• Have to construct them manually, but process can
be simplified
• Time consuming, still to be fully automated
Challenges – Working with partners
• Changes to Internet Archive website broke our
harvesting
• For automated ftp to work 3rd parties need to
follow instructions
• Creation of JPEG2000 images/video
• Incorrect identifiers trips up processes
Opportunities
• Working with IT, flexibility of virtualised
environment
• Working with Intranda, brings in vendor expertise
• Distributed system brings in feedback from many
users
• Small team simplifies decision making
• Success leads to success
Life cycle management
• Good place with regard to life cycle management
• Consistent processes based on common
workflows
• Goobi outputs consistent & predictable
• Unified data set easier to manage in the future
Has automation been successful?
• Yes with a but
• Automation can be complex, easy to make
mistakes
• Automation requires metadata to be available
• Automated processes still require a human minder
The scale of things
Digitisation at Scale: Automating the mass acquisition of digitised content
Lessons learned
• Complexity Vs simplicity
• Iterative approaches work but are time consuming
• Vendor support/input crucial when starting from
scratch
• Process design essential
Be bold. Sometimes
it’s the way we work
that has to change
Thank you
Questions now, questions later…?
Dave Thompson, Digital Curator
Wellcome Library
d.thompson@wellcome.ac.uk @d_n_t
http://wellcomelibrary.org/

More Related Content

PPTX
Goobi at the Wellcome Library: Current Work and New Developments
PPT
hungarian library portal at the internet librarian 2011
PPT
Beitarie, "Toward Service-Oriented Librarianship"
PDF
RDM @ KU Leuven: De verbindende kracht van het Research Data Management Compe...
PPTX
The Fantastic Challenges of Librarianship: Digital Solutions at The Ringling ...
PPTX
Digitisation projects: Purpose, planning, process, people :: Vye Perrone, Uni...
PPT
‘Size Does Matter…But Not in the Way You think.’ - Christopher Pressler (Dubl...
PPTX
Levels of Service for Digital Libraries
Goobi at the Wellcome Library: Current Work and New Developments
hungarian library portal at the internet librarian 2011
Beitarie, "Toward Service-Oriented Librarianship"
RDM @ KU Leuven: De verbindende kracht van het Research Data Management Compe...
The Fantastic Challenges of Librarianship: Digital Solutions at The Ringling ...
Digitisation projects: Purpose, planning, process, people :: Vye Perrone, Uni...
‘Size Does Matter…But Not in the Way You think.’ - Christopher Pressler (Dubl...
Levels of Service for Digital Libraries

Viewers also liked (20)

PDF
Normdatendienste der VZG und ihre Anwendungsmöglichkeiten
PDF
Goobi Tag 2016 - Willkommen
PDF
viewer Tag 2016 - Willkommen
ODP
Mehr als nur scharfe Aufnahmen - Bildformate verstehen und für die richtige V...
PDF
Überblick über Neuigkeiten in den Versionen 2.2 und 2.3 von Goobi und neue ...
PDF
Bisher konnten wir suchen, jetzt wollen wir auch finden. Probleme und Lösungs...
PPTX
Sind wir schon da? - Rückblick, Ausblick, Lessons Learned nach 20 Jahren Kult...
PDF
Perspektiven und visionen für den viewer in der version 3.1 und 3.2-expanded
PDF
Named Entity Recognition im Alltag - Erste Erfahrungen und Herausforderungen
PDF
Validierung, Tools und Plugins - Workflowoptimierung durch Automatisierung
PDF
Developing Goobi: An Open-Source Workflow Tracking Tool for Digitization Proj...
PPTX
10 Jahre Evolution: Digitalisierung an der UB Greifswald
PDF
Goobi: News & Noteworthy – Latest Developments and Future Roadmap
PDF
Neuigkeiten und Entwicklungssprünge des viewers 3.0 und 3.1
ODP
Goobi viewer: Developing a Complete Solution to Bring Digitised Content IIIF ...
PDF
7 Reglas para una campaña de email marketing efectiva
PPTX
8 . Valle de Ricote - Ricote - la sierra
PDF
Fichas proyectos fsc
PPT
Mano a Mano - Charla sobre TICE para maestras
PDF
Erfolgsfaktoren zur Auswahl und Organisation von studentischen Projektstudien
Normdatendienste der VZG und ihre Anwendungsmöglichkeiten
Goobi Tag 2016 - Willkommen
viewer Tag 2016 - Willkommen
Mehr als nur scharfe Aufnahmen - Bildformate verstehen und für die richtige V...
Überblick über Neuigkeiten in den Versionen 2.2 und 2.3 von Goobi und neue ...
Bisher konnten wir suchen, jetzt wollen wir auch finden. Probleme und Lösungs...
Sind wir schon da? - Rückblick, Ausblick, Lessons Learned nach 20 Jahren Kult...
Perspektiven und visionen für den viewer in der version 3.1 und 3.2-expanded
Named Entity Recognition im Alltag - Erste Erfahrungen und Herausforderungen
Validierung, Tools und Plugins - Workflowoptimierung durch Automatisierung
Developing Goobi: An Open-Source Workflow Tracking Tool for Digitization Proj...
10 Jahre Evolution: Digitalisierung an der UB Greifswald
Goobi: News & Noteworthy – Latest Developments and Future Roadmap
Neuigkeiten und Entwicklungssprünge des viewers 3.0 und 3.1
Goobi viewer: Developing a Complete Solution to Bring Digitised Content IIIF ...
7 Reglas para una campaña de email marketing efectiva
8 . Valle de Ricote - Ricote - la sierra
Fichas proyectos fsc
Mano a Mano - Charla sobre TICE para maestras
Erfolgsfaktoren zur Auswahl und Organisation von studentischen Projektstudien
Ad

Similar to Digitisation at Scale: Automating the mass acquisition of digitised content (20)

PPT
Dave's Wellcome Library digitisation presentation
PDF
Goobi in the Wellcome Library
PPT
Upscaling digitisation at the Wellcome Library
PPT
Systems and Processes: making order out of chaos
PPTX
Managing Large Scale Digitisation at the Wellcome Library
PPTX
Partnerships in Digitisation
PPT
Wt dnt digitisation_open_day_v9
PPTX
Digitisation Projects at Wellcome Library
PPTX
The workflows for the ingest of digital objects into a repository/digital li...
PPTX
Doing Projects: 10 laws of digitisation
PDF
Libraries & Tech for Good, 11 July 2016 (with notes)
PPTX
Challenges, Workflows, and Insights in the Collaboration to Preserve America'...
PPTX
Goobi at the bodleian
PPT
Goobi
PPT
Developing a Digitisation Framework for your Library. 2003
PPTX
Buying Bigger, Buying Better...Together
PPTX
Matthew Brack Wellcome Library Presentation
PPTX
Digitization Basics for Archives and Special Collections – Part 2: Store and ...
PPTX
MMC Seminar May 2015 Jonas Engstrom (Mayam) Keynote pptx
PPTX
The workflows for the ingest of digital objects into a repository/digital l...
Dave's Wellcome Library digitisation presentation
Goobi in the Wellcome Library
Upscaling digitisation at the Wellcome Library
Systems and Processes: making order out of chaos
Managing Large Scale Digitisation at the Wellcome Library
Partnerships in Digitisation
Wt dnt digitisation_open_day_v9
Digitisation Projects at Wellcome Library
The workflows for the ingest of digital objects into a repository/digital li...
Doing Projects: 10 laws of digitisation
Libraries & Tech for Good, 11 July 2016 (with notes)
Challenges, Workflows, and Insights in the Collaboration to Preserve America'...
Goobi at the bodleian
Goobi
Developing a Digitisation Framework for your Library. 2003
Buying Bigger, Buying Better...Together
Matthew Brack Wellcome Library Presentation
Digitization Basics for Archives and Special Collections – Part 2: Store and ...
MMC Seminar May 2015 Jonas Engstrom (Mayam) Keynote pptx
The workflows for the ingest of digital objects into a repository/digital l...
Ad

More from intranda GmbH (20)

PDF
Goobi-Tag 2021: Barrierefreiheit im Goobi viewer
PPTX
Goobi-Tag 2021: „Am Anfang sah es nach Routine aus“ – DLC goes Goobi, ein Pr...
PDF
Goobi-Tage 2019: Nachlass Robert Koch: in Augias verzeichnet mit Goobi digita...
PDF
Goobi-Tage 2019: Goobi-to-go Ersatz für Test-Server Überlegungen / Erfahrungen
PDF
Goobi-Tag 2021: Goobi viewer in Docker Containern
PDF
Goobi-Tag 2021: Right to left - Goobi viewer Design
PDF
Goobi-Tag 2021: Goobi im Einsatz im Niedersächsischen Landesamt für Denkmalpf...
PDF
Goobi-Tag 2021: Goobi meets OCR4all-libraries
PDF
Goobi viewer - Der lange Weg zu Open Source
PDF
Goobi-viewer-Tag 2019 - Viel zu tun: Geheimbaustellen und Ideen für die nächs...
PDF
Goobi-viewer-Tag 2019 - Willkommen
PDF
Goobi-viewer-Tag 2019 - Goobi viewer 4.0 - What happened?
PDF
Goobi-Tage 2019 - Goobi workflow Entwicklungsrückblick über die letzten 12 M...
PDF
Goobi-workflow-Tag 2019 - Willkommen
PDF
Goobi-Tag 2020 - Ausblick
PDF
Goobi-Tag 2020 - Willkommen
PDF
Goobi-Tag 2020 - Goobi workflow Entwicklungsrückblick
PDF
Goobi-Tage 2019 - Goobi 19.09 under the Hood
PDF
Goobi-Tag 2021 - Ausblick
PDF
Goobi-Tag 2020 - Entwicklungsrückblick Goobi viewer
Goobi-Tag 2021: Barrierefreiheit im Goobi viewer
Goobi-Tag 2021: „Am Anfang sah es nach Routine aus“ – DLC goes Goobi, ein Pr...
Goobi-Tage 2019: Nachlass Robert Koch: in Augias verzeichnet mit Goobi digita...
Goobi-Tage 2019: Goobi-to-go Ersatz für Test-Server Überlegungen / Erfahrungen
Goobi-Tag 2021: Goobi viewer in Docker Containern
Goobi-Tag 2021: Right to left - Goobi viewer Design
Goobi-Tag 2021: Goobi im Einsatz im Niedersächsischen Landesamt für Denkmalpf...
Goobi-Tag 2021: Goobi meets OCR4all-libraries
Goobi viewer - Der lange Weg zu Open Source
Goobi-viewer-Tag 2019 - Viel zu tun: Geheimbaustellen und Ideen für die nächs...
Goobi-viewer-Tag 2019 - Willkommen
Goobi-viewer-Tag 2019 - Goobi viewer 4.0 - What happened?
Goobi-Tage 2019 - Goobi workflow Entwicklungsrückblick über die letzten 12 M...
Goobi-workflow-Tag 2019 - Willkommen
Goobi-Tag 2020 - Ausblick
Goobi-Tag 2020 - Willkommen
Goobi-Tag 2020 - Goobi workflow Entwicklungsrückblick
Goobi-Tage 2019 - Goobi 19.09 under the Hood
Goobi-Tag 2021 - Ausblick
Goobi-Tag 2020 - Entwicklungsrückblick Goobi viewer

Recently uploaded (20)

PDF
Unnecessary information is required for the
DOC
LSTM毕业证学历认证,利物浦大学毕业证学历认证怎么认证
PPTX
BIOLOGY TISSUE PPT CLASS 9 PROJECT PUBLIC
PDF
Module 7 guard mounting of security pers
PPTX
FINAL TEST 3C_OCTAVIA RAMADHANI SANTOSO-1.pptx
PPTX
NORMAN_RESEARCH_PRESENTATION.in education
PPTX
lesson6-211001025531lesson plan ppt.pptx
DOCX
Action plan to easily understanding okey
PDF
PM Narendra Modi's speech from Red Fort on 79th Independence Day.pdf
PPTX
ART-APP-REPORT-FINctrwxsg f fuy L-na.pptx
PPTX
PurpoaiveCommunication for students 02.pptx
PPTX
_ISO_Presentation_ISO 9001 and 45001.pptx
PPTX
Phylogeny and disease transmission of Dipteran Fly (ppt).pptx
PPT
First Aid Training Presentation Slides.ppt
PDF
natwest.pdf company description and business model
PPTX
3RD-Q 2022_EMPLOYEE RELATION - Copy.pptx
PPTX
PHIL.-ASTRONOMY-AND-NAVIGATION of ..pptx
PPTX
An Unlikely Response 08 10 2025.pptx
PPTX
Tablets And Capsule Preformulation Of Paracetamol
PDF
_Nature and dynamics of communities and community development .pdf
Unnecessary information is required for the
LSTM毕业证学历认证,利物浦大学毕业证学历认证怎么认证
BIOLOGY TISSUE PPT CLASS 9 PROJECT PUBLIC
Module 7 guard mounting of security pers
FINAL TEST 3C_OCTAVIA RAMADHANI SANTOSO-1.pptx
NORMAN_RESEARCH_PRESENTATION.in education
lesson6-211001025531lesson plan ppt.pptx
Action plan to easily understanding okey
PM Narendra Modi's speech from Red Fort on 79th Independence Day.pdf
ART-APP-REPORT-FINctrwxsg f fuy L-na.pptx
PurpoaiveCommunication for students 02.pptx
_ISO_Presentation_ISO 9001 and 45001.pptx
Phylogeny and disease transmission of Dipteran Fly (ppt).pptx
First Aid Training Presentation Slides.ppt
natwest.pdf company description and business model
3RD-Q 2022_EMPLOYEE RELATION - Copy.pptx
PHIL.-ASTRONOMY-AND-NAVIGATION of ..pptx
An Unlikely Response 08 10 2025.pptx
Tablets And Capsule Preformulation Of Paracetamol
_Nature and dynamics of communities and community development .pdf

Digitisation at Scale: Automating the mass acquisition of digitised content

  • 1. Digitisation at Scale: Automating the mass acquisition of digitised content IS&T Archiving Conference, Washington, April 2016 Dave Thompson Digital Curator, Wellcome Library
  • 2. The Wellcome Library • Part of Wellcome Collection, astonishing public venue in London developed by the Wellcome Trust. Where people can learn more about medicine through the ages & across cultures • Five-year plan for transforming the Wellcome Library.
  • 3. Driver for digitisation • To make our collections available to anyone, anywhere, we are digitising as much of our physical collection as we can, for both our website and the websites of other organisations. We are also digitising and hosting collections from partners that complement our holdings Transforming the Wellcome Library: 2009-2014. http://wellcomelibrary.org/what-we-do/library-strategy-and-policy/transforming- the-wellcome-library/
  • 4. The problem • How to scale systems & processes to deliver on our ambition • How to design & build new high volume systems & processes for; acquisition, storage, processing, access • How to manage volumes of data during creation/acquisition
  • 5. Process design – sources of content Goobi (METS/OCR) Preservica In-house Institutions Contractors Harvesting TIFF or JP2 TIFF or JP2 HD & ftp TIFF or JP2 Normalises TIFF to JP2 Manual Automatic Jpylyzer validates JP2 Auto harvesting of JP2 & DMD Grey literature PDF Ingest Officer / Digital Curator Snagging Snagging
  • 6. The approach • (Re)Use/develop existing systems were possible, e.g. bibliographic system Sierra, Preservica EE repository • Identify where new systems would be required, e.g. workflow middle ware • Take a practical approach & accept that it would be iterative learning as we go
  • 8. Why Goobi? • Dedicated to digitisation • Flexibility & process control • Adaptable & scalable • Vendor expertise/support http://www.inspirelancs.org.uk/interested-in-volunteering-family-carers-volunteers-wanted/
  • 9. Role of Goobi • Role of Goobi is overall management & tracking of processes • Initiate ingest into our DAM Preservica • Reporting & statistics
  • 10. Role of humans • Working at volume did not imply more staff, it implied efficiency • Also implied automation • Human work was focussed on tasks machines couldn't do http://planetivy.com/gaming/25273/natural-selection-2-gaming-evolution-in-action/
  • 11. System & process design • High volume doesn’t imply use of many systems • Requires design to be as simple as possible, with as few moving parts as possible • Processes need to be efficient & scalable, human as well as system http://www.nivenswealthstrategies.com/keeping-it-simple/
  • 12. Partnership for scalable digitisation • Relationship with Internet Archive digitising our Library content • High volume long term project • Content harvested from Internet Archive website & processed automatically • Dedicated Goobi process for fully automated harvesting
  • 13. Harvesting from Internet Archive Content processed automatically, including creation of METS & ALTO. Goobi has a ‘repository’ of IA identifiers for searching/harvesting. Goobi harvests data from Internet Archive website. Content available in the player. Content stored in Preservica. DDS creates JSON for the player & pre- caches some content.
  • 14. Challenges - M&Ms • Multi volume works • No metadata to support their union • Have to construct them manually, but process can be simplified • Time consuming, still to be fully automated
  • 15. Challenges – Working with partners • Changes to Internet Archive website broke our harvesting • For automated ftp to work 3rd parties need to follow instructions • Creation of JPEG2000 images/video • Incorrect identifiers trips up processes
  • 16. Opportunities • Working with IT, flexibility of virtualised environment • Working with Intranda, brings in vendor expertise • Distributed system brings in feedback from many users • Small team simplifies decision making • Success leads to success
  • 17. Life cycle management • Good place with regard to life cycle management • Consistent processes based on common workflows • Goobi outputs consistent & predictable • Unified data set easier to manage in the future
  • 18. Has automation been successful? • Yes with a but • Automation can be complex, easy to make mistakes • Automation requires metadata to be available • Automated processes still require a human minder
  • 19. The scale of things
  • 21. Lessons learned • Complexity Vs simplicity • Iterative approaches work but are time consuming • Vendor support/input crucial when starting from scratch • Process design essential
  • 22. Be bold. Sometimes it’s the way we work that has to change
  • 23. Thank you Questions now, questions later…? Dave Thompson, Digital Curator Wellcome Library d.thompson@wellcome.ac.uk @d_n_t http://wellcomelibrary.org/