SlideShare a Scribd company logo
Digging into the Web Archive 
at the British Library 
Andrew Jackson 
UK Web Archive Technical Lead
Collections & Scale 
• Three collections: 
– By permission (2004-2013) 
• c. 200 million URLs 
– Legal Deposit (2013 onwards) 
• c. 2 billion URLs/year (30TB/y) 
– JISC/IA Historical (1996-2013) 
• c. 6 billion URLs (57TB) 
• Use data-mining to support: 
– Access 
– Search 
– Preservation 
– Web science 
www.bl.uk 2
Single-Item Retrieval 
www.bl.uk 3
Web Archive Architecture 
www.bl.uk 4
Search & Analytical Access 
• ‘Title-level’ search: 
– Millions of homepages found via metadata 
• Full-text search: 
– Billions of resources 
– Dedicated faceted search service 
• Analytical access: 
– Combine faceted full-text search with: 
• Trend analysis 
• Visualisation tools 
– Working with modern historians to drive development 
www.bl.uk 5
Longitudinal Analysis (Prime Ministers) 
www.bl.uk 6
Embedded Licenses 
www.bl.uk 7
Secondary Datasets 
• Facts about content, including: 
– Crawl index 
– Geo-index 
– Format profiles 
– Link graphs 
• Facilitate independent research 
• Can be made available under CC0 
• Hosted at http://data.webarchive.org.uk/opendata/ 
www.bl.uk 8
Exploring Links Between Hosts 
www.bl.uk Courtesy of Peter Webster, Rainer Simon and Jules Mataly 
9
Links From 1996 
www.bl.uk 10
Top-Level Links Over Time [here] 
www.bl.uk 11
Access Service Spectrum 
• Single-item retrieval 
• ‘Title-level’ search 
• Full-text search 
• Analytics & visualisation (at full scale) 
• Secondary datasets 
• Remote analysis of datasets (an API, e.g. SPARQL) 
• Full computational access service (internal only right now) 
• Not just the web archive? 
www.bl.uk 12
Thank you! 
Email: Andrew.Jackson@bl.uk 
Twitter: @anjacks0n 
UK Web Archive: 
http://www.webarchive.org.uk 
Blog: 
http://britishlibrary.typepad.co.uk/webarc 
hive/ 
Twitter: @ukwebarchive 
www.bl.uk 13

More Related Content

PDF
Building a Collection of the Historical UK Web for scholarly use
PDF
Peter Webster - Digital History - 11 June 2013
PPT
Making the most of metadata Feb 2014 - BNB Linked Data Update
PPT
OCLC Linked Data Roundtable event IFLA 2012
PPT
Opening Up The BL's Metadata
PDF
OA Network: Heading for Joint Standards and Enhancing Cooperation: Value‐Adde...
PDF
Internet Archive: Archive-It and Contract Crawling, C. Mumma
PDF
Building a Collection of the Historical UK Web for scholarly use
Peter Webster - Digital History - 11 June 2013
Making the most of metadata Feb 2014 - BNB Linked Data Update
OCLC Linked Data Roundtable event IFLA 2012
Opening Up The BL's Metadata
OA Network: Heading for Joint Standards and Enhancing Cooperation: Value‐Adde...
Internet Archive: Archive-It and Contract Crawling, C. Mumma

What's hot (20)

PDF
DBpedia Mappings Wiki, SMWCon Fall 2013, Berlin
PDF
Scholze goportis 4-11-14
PPT
Rachel Hill - The IReL-Open initiative and institutional repositories in Irel...
PPTX
Corpus Protocols IFLA Geneva August 2014 by Neil Smyth and Stella Wisdom
PDF
Ancient History of the UK Web
PPTX
Keepers Registry @ issn.keepers.org
PPTX
Peter Lang’s traditional eBook Models and its new Book Open Access programme:...
PDF
(Big) bibliographic data @ ScaDS project meeting - 2015-06-12
PDF
Donating data to Wikidata: First experiences from the „20th Century Press Arc...
PPTX
DataCite UK and British Library Update - DataCite UK Summer Client Meeting 2018
PPTX
Data visualisation workshop
PDF
RDM Jargon Busting Session: Demystifying Commonly Used Terms
PDF
Understanding cross-border religion in the Irish web
PDF
The ARIADNE interoperability framework, component architecture and registry s...
PDF
Prospects and pitfalls in using web archives for research
PDF
Religion, social media and the web archive: Peter Webster at International Co...
PPT
Matthew Hale - Open Source at the Kings Fund
PPTX
RJ Broker: Automating Delivery of Research Output to Repositories
PPT
Working with the archived web, 1996-2013
PPTX
The European Cluster Observatory: Measuring the performance of clusters
DBpedia Mappings Wiki, SMWCon Fall 2013, Berlin
Scholze goportis 4-11-14
Rachel Hill - The IReL-Open initiative and institutional repositories in Irel...
Corpus Protocols IFLA Geneva August 2014 by Neil Smyth and Stella Wisdom
Ancient History of the UK Web
Keepers Registry @ issn.keepers.org
Peter Lang’s traditional eBook Models and its new Book Open Access programme:...
(Big) bibliographic data @ ScaDS project meeting - 2015-06-12
Donating data to Wikidata: First experiences from the „20th Century Press Arc...
DataCite UK and British Library Update - DataCite UK Summer Client Meeting 2018
Data visualisation workshop
RDM Jargon Busting Session: Demystifying Commonly Used Terms
Understanding cross-border religion in the Irish web
The ARIADNE interoperability framework, component architecture and registry s...
Prospects and pitfalls in using web archives for research
Religion, social media and the web archive: Peter Webster at International Co...
Matthew Hale - Open Source at the Kings Fund
RJ Broker: Automating Delivery of Research Output to Repositories
Working with the archived web, 1996-2013
The European Cluster Observatory: Measuring the performance of clusters
Ad

Similar to Digging into the Web Archive at the British Library 2014-11-27 (20)

PDF
Slides anu talkwebarchivingaug2012
PDF
Scaling up to archive the UK Web. Helen Hockx-Yu
PDF
Peter webster interrogating the archived uk web
PPTX
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
PPTX
Web archiving challenges and opportunities
PPTX
IIPC GA 2014 Solr
PPTX
Supporting research with open services at the British Library, Sara Gould, Op...
PDF
Internet content as research data
PPT
Analytics and Access to the UK web archive
PPTX
Introduction to Apache Solr
PPT
Bill Stockting - UKAD Forum 2016
PDF
IIPC-Training-Event-Jan-2014-Solr-Introduction.pdf
PDF
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
PPT
BL Labs presentation given to the Digital Scholarship Team
PPTX
High and Lows of Library Linked Data
PDF
Deluca "Building Momentum and Support for Institutional Repository Deposits"
PPTX
What do you want to discover today? / Janet Aucock, University of St Andrews
PPTX
UKSG webinar: Making Connections - Creating Linked Open Library Data with Nei...
PPTX
Exposing Library Content with the NISO Metasearch XML Gateway Protocol
PDF
10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides
Slides anu talkwebarchivingaug2012
Scaling up to archive the UK Web. Helen Hockx-Yu
Peter webster interrogating the archived uk web
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
Web archiving challenges and opportunities
IIPC GA 2014 Solr
Supporting research with open services at the British Library, Sara Gould, Op...
Internet content as research data
Analytics and Access to the UK web archive
Introduction to Apache Solr
Bill Stockting - UKAD Forum 2016
IIPC-Training-Event-Jan-2014-Solr-Introduction.pdf
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
BL Labs presentation given to the Digital Scholarship Team
High and Lows of Library Linked Data
Deluca "Building Momentum and Support for Institutional Repository Deposits"
What do you want to discover today? / Janet Aucock, University of St Andrews
UKSG webinar: Making Connections - Creating Linked Open Library Data with Nei...
Exposing Library Content with the NISO Metasearch XML Gateway Protocol
10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides
Ad

Recently uploaded (20)

PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Enhancing emotion recognition model for a student engagement use case through...
PPTX
Web Crawler for Trend Tracking Gen Z Insights.pptx
PDF
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
PDF
A novel scalable deep ensemble learning framework for big data classification...
PPT
Geologic Time for studying geology for geologist
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
1 - Historical Antecedents, Social Consideration.pdf
PPT
What is a Computer? Input Devices /output devices
PPTX
Modernising the Digital Integration Hub
PDF
WOOl fibre morphology and structure.pdf for textiles
PPTX
Benefits of Physical activity for teenagers.pptx
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
Hindi spoken digit analysis for native and non-native speakers
Getting started with AI Agents and Multi-Agent Systems
NewMind AI Weekly Chronicles – August ’25 Week III
Assigned Numbers - 2025 - Bluetooth® Document
DP Operators-handbook-extract for the Mautical Institute
A comparative study of natural language inference in Swahili using monolingua...
Enhancing emotion recognition model for a student engagement use case through...
Web Crawler for Trend Tracking Gen Z Insights.pptx
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
A novel scalable deep ensemble learning framework for big data classification...
Geologic Time for studying geology for geologist
sustainability-14-14877-v2.pddhzftheheeeee
Zenith AI: Advanced Artificial Intelligence
1 - Historical Antecedents, Social Consideration.pdf
What is a Computer? Input Devices /output devices
Modernising the Digital Integration Hub
WOOl fibre morphology and structure.pdf for textiles
Benefits of Physical activity for teenagers.pptx
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Module 1.ppt Iot fundamentals and Architecture
Hindi spoken digit analysis for native and non-native speakers

Digging into the Web Archive at the British Library 2014-11-27

  • 1. Digging into the Web Archive at the British Library Andrew Jackson UK Web Archive Technical Lead
  • 2. Collections & Scale • Three collections: – By permission (2004-2013) • c. 200 million URLs – Legal Deposit (2013 onwards) • c. 2 billion URLs/year (30TB/y) – JISC/IA Historical (1996-2013) • c. 6 billion URLs (57TB) • Use data-mining to support: – Access – Search – Preservation – Web science www.bl.uk 2
  • 5. Search & Analytical Access • ‘Title-level’ search: – Millions of homepages found via metadata • Full-text search: – Billions of resources – Dedicated faceted search service • Analytical access: – Combine faceted full-text search with: • Trend analysis • Visualisation tools – Working with modern historians to drive development www.bl.uk 5
  • 6. Longitudinal Analysis (Prime Ministers) www.bl.uk 6
  • 8. Secondary Datasets • Facts about content, including: – Crawl index – Geo-index – Format profiles – Link graphs • Facilitate independent research • Can be made available under CC0 • Hosted at http://data.webarchive.org.uk/opendata/ www.bl.uk 8
  • 9. Exploring Links Between Hosts www.bl.uk Courtesy of Peter Webster, Rainer Simon and Jules Mataly 9
  • 10. Links From 1996 www.bl.uk 10
  • 11. Top-Level Links Over Time [here] www.bl.uk 11
  • 12. Access Service Spectrum • Single-item retrieval • ‘Title-level’ search • Full-text search • Analytics & visualisation (at full scale) • Secondary datasets • Remote analysis of datasets (an API, e.g. SPARQL) • Full computational access service (internal only right now) • Not just the web archive? www.bl.uk 12
  • 13. Thank you! Email: Andrew.Jackson@bl.uk Twitter: @anjacks0n UK Web Archive: http://www.webarchive.org.uk Blog: http://britishlibrary.typepad.co.uk/webarc hive/ Twitter: @ukwebarchive www.bl.uk 13

Editor's Notes

  • #2: 20 min presentation on - wa activities and status - Usage etc. stats in talk for LDLs. - inc. research projects for sept 19th, should also include document harvest as they are expecting to take that on too (CEDAR?).
  • #5: Well, it’s just a bunch of files.
  • #8: Supporting research has helped us improve how we crawl material. Also helps us understand the collection, so we can take care of it. And furthermore, helps us understand how to leverage this collection.
  • #10: Paper accepted for WebSci’14: Mapping the UK Webspace: Fifteen Years of British Universities on the Web MA thesis by Jules Mataly: The Three Truths of Margaret Thatcher: Creating and Analysing Link graph mostly explored secondary dataset 1996 visualisation of the UK web by Rainer Simon Peter Wester’s analysis and visualisation of hosts within the dataset over time link to Rowan Williams’ website as archbishop of Canterbury, and how this relates to the public reaction to his lecture in February 2008 on the interaction between English family law and Islamic shari’a law Student from HK