SlideShare a Scribd company logo
Archive-It: Scaling Beyond a
 Billion Archival Web-pages
       Aaron Binns, Internet Archive
      aaron@archive.org, 2011-10-19
My Background
§    Aaron Binns (aaron@archive.org)
§    Internet Archive
§    Senior Software Engineer
§    Full-text search & cool stuff
   •  Full-text search
   •  Hadoop
   •  “Big Data”
•  http://github.com/aaronbinns




                           2
Internet Archive
§    Universal access to all knowledge
§    http://archive.org
§    Founded 1996
§    501(c)(3) non-profit org.
§    Digital Library
§    San Francisco, CA, USA
§    7+ PB of publicly accessible digital materials
      –  Web archive
      –  Books, music, video, etc.


                               3
§  http://web.archive.org
§  165,000,000,000+ archived web pages
   –  HTML
   –  Images
   –  CSS
   –  JavaScript
   –  Multimedia
§  1996-today


                        4
5
http://archive-it.org
§  Subscription web archiving service
   –  Select websites to harvest, frequency, depth
   –  Crawling/Harvesting
   –  Wayback
   –  Full-text search
§  Customers
   –  Public, State & University Libraries
   –  Local governments
   –  Museums
   –  Non-Governmental Organizations (NGOs)


                             6
Collections & Documents
§  Collection
   –  Web harvest configuration
       •  URLs to crawl
       •  Frequency & depth
   –  Set of documents archived
       •  Access via Wayback Machine
       •  Full-text search
§  Document
   –  Unique version of a URL
   –  “Text” documents: HTML, PDF, Office, etc.


                             7
Archive-It: Collection




            8
Archive-It: Wayback




           9
Archive-It: Replay
                July 27, 2002




Sept 15, 2011


                10
Archive-It: Search




          11
Archive-It: Search




          12
Challenges and....Solutions?

§  Scale
§  Archival web search != web search
§  Document formats
  –  HTML (1996....2011)
  –  PDF, Office, text, etc.
§  English, Français, Español,漢字, …
§  Diversity
§  Time


                               13
Scale

§  200+ customers
§  2,272 collections
   –  Largest: 33,470,659 documents
   –  24 collections, 10,000,000+ docs
   –  250 collections, 1,000,000+ docs
§  Total:
     –  1,375,473,187 unique documents



                            14
Scale...each day

§  30-40 simultaneous crawls/harvests
§  ~150GB of data: HTML, images, media
§  ~1.3 million new unique documents
   –  New URLs never seen before
   –  New versions of URLs
§  ~1.3 million updates
   –  Documents unchanged
   –  New crawl dates



                            15
Architecture
§  Offline indexing
   –  10 dedicated indexing machines
   –  ~10% of collections per machine
   –  Add new documents
   –  Update existing documents with new dates
   –  1CPU x 2core, 4GB RAM, 3x2TB disk
§  Search service
   –  11 machines: 1 master, 10 slaves
   –  ~10% of collections per slave
   –  1 collection → 1 Lucene index
   –  1CPU x 2core, 8GB RAM, 3x2TB disk

                            16
Diversity




     17
Diversity




     18
Diversity




     19
Field Collapsing / Grouping

§  Applied to web documents
       “Give me the best 1-2 hits from a site”
§  Lucene
   –  Grouping contrib package
§  Solr
   –  Field Collapsing
§  What is the performance cost?
§  Custom solution


                            20
Time

§  User experience & understanding
   –  Archival web search != web search
§  Information Architecture
   –  Publication date for web pages – difficult
§  Temporal diversity
   –  Multiple hits per site
   –  Multiple versions per URL




                              21
Time




   22
Searching across collections

§  Search all collections of a user
§  Search arbitrary group of collections
§  1 collection → 1 Lucene index
     –  Search 100 collections....
     –  Search 100 indexes
§  Collections distributed over 10 searchers




                          23
Custom Solutions

§  Java
§  Built on Lucene
§  Investigating Solr
   –  Capabilities
   –  Cost
§  Internet Archive
   –  Open Source
   –  Apache License
   –  http://github.com/aaronbinns


                             24
Custom Solutions: Indexing


§    http://github.com/aaronbinns/jbs
§    Archive-It & other archival web collections
§    Hadoop-based, or stand-alone
§    Java code with Lucene
      –  Hard-coded “schema” for web documents
      –  Title, body, keywords, date, mime-type, etc.
      –  Link analysis & curation to augment scoring



                                 25
Custom Solutions: Searching

§  http://github.com/aaronbinns/tnh
§  Custom Java web application with Lucene
§  Federated search
   –  1 master, 10 slaves
   –  OpenSearch
§  Multiple collections & arbitrary grouping
§  CollapsingCollector




                            26
CollapsingCollector


§    http://github.com/aaronbinns/tnh
§    Extends Lucene Collector
§    Field cache: “site”
§    Retains top N hits per “site”
      –  Control N via URL parameter




                               27
Web Archives!
§  Archive-It
   –  http://archive-it.org/
§  US National Archives
   –  http://webharvest.gov/
§  UK Web Archive
   –  http://www.webarchive.org.uk/
    –  Solr-based
§  Web Archive of Catalonia / PADICAT
   –  Biblioteca de Catalunya
   –  http://www.padicat.cat/

                                28

More Related Content

PDF
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
KEY
An introduction to Pincaster
PDF
MongoDB: a gentle, friendly overview
PPTX
Hadoop Training in Hyderabad
KEY
HAProxyでMySQL HA on Amazon EC2
PDF
search rest and playvideo ddiscovery - james alexa
PDF
Shrinking the haystack wes caldwell - final
PDF
UX Design as a Strategic Opportunity, at UXI Live 2013 (Hebrew)
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
An introduction to Pincaster
MongoDB: a gentle, friendly overview
Hadoop Training in Hyderabad
HAProxyでMySQL HA on Amazon EC2
search rest and playvideo ddiscovery - james alexa
Shrinking the haystack wes caldwell - final
UX Design as a Strategic Opportunity, at UXI Live 2013 (Hebrew)

Similar to Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns (20)

PPTX
The Archivists' Toolkit presented at MARAC, November 13, 2010
PPTX
Introduction to Apache Solr
PDF
Slides anu talkwebarchivingaug2012
PDF
IIPC-Training-Event-Jan-2014-Solr-Introduction.pdf
PPTX
IIPC GA 2014 Solr
PDF
Something That Works: Implementing ResourceSpace Open Source Digital Asset Ma...
PPT
Web and Twitter Archiving at the Library of Congress
PDF
Internet content as research data
PPTX
Web archiving challenges and opportunities
PDF
Why libraries should embrace Linked Data
 
PPTX
Scalability andefficiencypres
PPTX
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
PPTX
High and Lows of Library Linked Data
KEY
Online Exhibits in Plone
PPT
The development of web archiving 3
PDF
Connecting the Dots: Constellations in the Linked Data Universe
PPTX
Linked Open Data and The Digital Archaeological Workflow at the Swedish Natio...
PDF
Rapid prototyping with solr - By Erik Hatcher
PDF
Rapid Prototyping with Solr
The Archivists' Toolkit presented at MARAC, November 13, 2010
Introduction to Apache Solr
Slides anu talkwebarchivingaug2012
IIPC-Training-Event-Jan-2014-Solr-Introduction.pdf
IIPC GA 2014 Solr
Something That Works: Implementing ResourceSpace Open Source Digital Asset Ma...
Web and Twitter Archiving at the Library of Congress
Internet content as research data
Web archiving challenges and opportunities
Why libraries should embrace Linked Data
 
Scalability andefficiencypres
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
High and Lows of Library Linked Data
Online Exhibits in Plone
The development of web archiving 3
Connecting the Dots: Constellations in the Linked Data Universe
Linked Open Data and The Digital Archaeological Workflow at the Swedish Natio...
Rapid prototyping with solr - By Erik Hatcher
Rapid Prototyping with Solr
Ad

More from lucenerevolution (20)

PDF
Text Classification Powered by Apache Mahout and Lucene
PDF
State of the Art Logging. Kibana4Solr is Here!
PDF
Search at Twitter
PDF
Building Client-side Search Applications with Solr
PDF
Integrate Solr with real-time stream processing applications
PDF
Scaling Solr with SolrCloud
PDF
Administering and Monitoring SolrCloud Clusters
PDF
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
PDF
Using Solr to Search and Analyze Logs
PDF
Enhancing relevancy through personalization & semantic search
PDF
Real-time Inverted Search in the Cloud Using Lucene and Storm
PDF
Solr's Admin UI - Where does the data come from?
PDF
Schemaless Solr and the Solr Schema REST API
PDF
High Performance JSON Search and Relational Faceted Browsing with Lucene
PDF
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
PDF
Faceted Search with Lucene
PDF
Recent Additions to Lucene Arsenal
PDF
Turning search upside down
PDF
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
PDF
The First Class Integration of Solr with Hadoop
Text Classification Powered by Apache Mahout and Lucene
State of the Art Logging. Kibana4Solr is Here!
Search at Twitter
Building Client-side Search Applications with Solr
Integrate Solr with real-time stream processing applications
Scaling Solr with SolrCloud
Administering and Monitoring SolrCloud Clusters
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Using Solr to Search and Analyze Logs
Enhancing relevancy through personalization & semantic search
Real-time Inverted Search in the Cloud Using Lucene and Storm
Solr's Admin UI - Where does the data come from?
Schemaless Solr and the Solr Schema REST API
High Performance JSON Search and Relational Faceted Browsing with Lucene
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Faceted Search with Lucene
Recent Additions to Lucene Arsenal
Turning search upside down
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
The First Class Integration of Solr with Hadoop
Ad

Recently uploaded (20)

PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Electronic commerce courselecture one. Pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Machine learning based COVID-19 study performance prediction
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Approach and Philosophy of On baking technology
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Big Data Technologies - Introduction.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
cuic standard and advanced reporting.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
Unlocking AI with Model Context Protocol (MCP)
Electronic commerce courselecture one. Pdf
20250228 LYD VKU AI Blended-Learning.pptx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
NewMind AI Monthly Chronicles - July 2025
Machine learning based COVID-19 study performance prediction
Mobile App Security Testing_ A Comprehensive Guide.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Approach and Philosophy of On baking technology
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Chapter 3 Spatial Domain Image Processing.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Network Security Unit 5.pdf for BCA BBA.
The Rise and Fall of 3GPP – Time for a Sabbatical?
Big Data Technologies - Introduction.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
A Presentation on Artificial Intelligence
cuic standard and advanced reporting.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm

Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns

  • 1. Archive-It: Scaling Beyond a Billion Archival Web-pages Aaron Binns, Internet Archive aaron@archive.org, 2011-10-19
  • 2. My Background §  Aaron Binns (aaron@archive.org) §  Internet Archive §  Senior Software Engineer §  Full-text search & cool stuff •  Full-text search •  Hadoop •  “Big Data” •  http://github.com/aaronbinns 2
  • 3. Internet Archive §  Universal access to all knowledge §  http://archive.org §  Founded 1996 §  501(c)(3) non-profit org. §  Digital Library §  San Francisco, CA, USA §  7+ PB of publicly accessible digital materials –  Web archive –  Books, music, video, etc. 3
  • 4. §  http://web.archive.org §  165,000,000,000+ archived web pages –  HTML –  Images –  CSS –  JavaScript –  Multimedia §  1996-today 4
  • 5. 5
  • 6. http://archive-it.org §  Subscription web archiving service –  Select websites to harvest, frequency, depth –  Crawling/Harvesting –  Wayback –  Full-text search §  Customers –  Public, State & University Libraries –  Local governments –  Museums –  Non-Governmental Organizations (NGOs) 6
  • 7. Collections & Documents §  Collection –  Web harvest configuration •  URLs to crawl •  Frequency & depth –  Set of documents archived •  Access via Wayback Machine •  Full-text search §  Document –  Unique version of a URL –  “Text” documents: HTML, PDF, Office, etc. 7
  • 10. Archive-It: Replay July 27, 2002 Sept 15, 2011 10
  • 13. Challenges and....Solutions? §  Scale §  Archival web search != web search §  Document formats –  HTML (1996....2011) –  PDF, Office, text, etc. §  English, Français, Español,漢字, … §  Diversity §  Time 13
  • 14. Scale §  200+ customers §  2,272 collections –  Largest: 33,470,659 documents –  24 collections, 10,000,000+ docs –  250 collections, 1,000,000+ docs §  Total: –  1,375,473,187 unique documents 14
  • 15. Scale...each day §  30-40 simultaneous crawls/harvests §  ~150GB of data: HTML, images, media §  ~1.3 million new unique documents –  New URLs never seen before –  New versions of URLs §  ~1.3 million updates –  Documents unchanged –  New crawl dates 15
  • 16. Architecture §  Offline indexing –  10 dedicated indexing machines –  ~10% of collections per machine –  Add new documents –  Update existing documents with new dates –  1CPU x 2core, 4GB RAM, 3x2TB disk §  Search service –  11 machines: 1 master, 10 slaves –  ~10% of collections per slave –  1 collection → 1 Lucene index –  1CPU x 2core, 8GB RAM, 3x2TB disk 16
  • 17. Diversity 17
  • 18. Diversity 18
  • 19. Diversity 19
  • 20. Field Collapsing / Grouping §  Applied to web documents “Give me the best 1-2 hits from a site” §  Lucene –  Grouping contrib package §  Solr –  Field Collapsing §  What is the performance cost? §  Custom solution 20
  • 21. Time §  User experience & understanding –  Archival web search != web search §  Information Architecture –  Publication date for web pages – difficult §  Temporal diversity –  Multiple hits per site –  Multiple versions per URL 21
  • 22. Time 22
  • 23. Searching across collections §  Search all collections of a user §  Search arbitrary group of collections §  1 collection → 1 Lucene index –  Search 100 collections.... –  Search 100 indexes §  Collections distributed over 10 searchers 23
  • 24. Custom Solutions §  Java §  Built on Lucene §  Investigating Solr –  Capabilities –  Cost §  Internet Archive –  Open Source –  Apache License –  http://github.com/aaronbinns 24
  • 25. Custom Solutions: Indexing §  http://github.com/aaronbinns/jbs §  Archive-It & other archival web collections §  Hadoop-based, or stand-alone §  Java code with Lucene –  Hard-coded “schema” for web documents –  Title, body, keywords, date, mime-type, etc. –  Link analysis & curation to augment scoring 25
  • 26. Custom Solutions: Searching §  http://github.com/aaronbinns/tnh §  Custom Java web application with Lucene §  Federated search –  1 master, 10 slaves –  OpenSearch §  Multiple collections & arbitrary grouping §  CollapsingCollector 26
  • 27. CollapsingCollector §  http://github.com/aaronbinns/tnh §  Extends Lucene Collector §  Field cache: “site” §  Retains top N hits per “site” –  Control N via URL parameter 27
  • 28. Web Archives! §  Archive-It –  http://archive-it.org/ §  US National Archives –  http://webharvest.gov/ §  UK Web Archive –  http://www.webarchive.org.uk/ –  Solr-based §  Web Archive of Catalonia / PADICAT –  Biblioteca de Catalunya –  http://www.padicat.cat/ 28