SlideShare a Scribd company logo
Out with the Old: Archiving Web-Based #musetech for Institutional Memory
SAMANTHA NORLING, DIGITAL COLLECTIONS MANAGER, NEWFIELDS
NOVEMBER 19, 2018
2
Out with the Old?
Archiving Web-Based #musetech for
Institutional Memory
Samantha Norling
Digital Collections Manager, Newfields
Museum Computer Network, November 15, 2018
3
What is Web
Archiving?
Web archiving is the process of collecting, preserving,
and enabling access to content on the Web, such as:
• Websites
• Blogs
• Social Media pages
For museums, web-based content may include:
• Exhibition-related websites
• Gallery interactives
• Tour applications
…all of which tend to be time-based.
4
Why Should
Museums
Archive the
Web?
• Meet records retention requirements or
capture important institutional content as
part of an archives program
• Document online events and/or public
responses to the museum and its activities
• Collect web-native resources within your
collection scope*
• Combat link rot
* Though this is not the focus of this presentation, there are many interesting projects being led by museums to
collect this material.
5
Who
Currently
Archives the
Web?
Source: Web Archiving: A 2016 Survey (National Digital Stewardship Alliance)
6
Source: Web Archiving: A 2016 Survey (National Digital Stewardship Alliance)
What is being Archived?
7
Types of Web
Archiving
• Client-side
• Archiving any site that is freely-
available on the web
• Crawlers start from a “seed” page and
navigate through hyperlinks, based on
parameters set
• Capture documents, text pages, data,
images, audio and video files
• Can be blocked by robots.txt
exclusions on websites
Source: Web Archiving Guidance, National Archives UK (2011)
http://www.nationalarchives.gov.uk/documents/information-management/web-archiving-guidance.pdf
8
Types of Web
Archiving
• Transaction-based
• Requires access to the web server
• Records transactions between users of
a site and the server
• Content that is never viewed will never
be archived
• Record exactly what was seen and
when
Source: Web Archiving Guidance, National Archives UK (2011)
http://www.nationalarchives.gov.uk/documents/information-management/web-archiving-guidance.pdf
9
Types of Web
Archiving
• Server-side
• Direct copy of files from web server
• Challenge to make the content usable
as a navigable archived site
• Dynamically-generated content difficult
to recreate
• Faithful reproduction requires
replication of the environment in which
the live site was run
• Benefit: archive parts of sites that are
inaccessible to web crawlers
Source: Web Archiving Guidance, National Archives UK (2011)
http://www.nationalarchives.gov.uk/documents/information-management/web-archiving-guidance.pdf
10
Client-side web archiving in a
(super simplified) nutshell:
11
THE
WAYBACK
MACHINE
WEB ARCHIVING TOOLS AND
SERVICES
https://archive.org/web
12
ARCHIVE-IT
WEB ARCHIVING TOOLS AND
SERVICES
https://archive-it.org
13
NYARChttps://www.nyarc.org/content/
web-archiving
ARCHIVE-IT PARTNER
14
Source: Web Archiving: A 2016 Survey (National Digital Stewardship Alliance)
Local Web-Archiving Tool Usage
15
WEB
CRAWLERS
• HTTrack (file directory directory)
https://www.httrack.com/
• Heretrix
https://github.com/internetarchive/heritrix3
• Warcprox
http://github.com/internetarchive/warcprox
• wget
https://www.petekeen.net/archiving-websites-with-
wget
WEB ARCHIVING TOOLS AND
SERVICES
16
WEB
ARCHIVE
PLAYERS
• OpenWayback
https://github.com/iipc/openwayback/wiki
• pywb (Python Wayback)
https://github.com/webrecorder/pywb
• oldweb.today
http://oldweb.today/
• Webrecorder Player
https://github.com/webrecorder/webrecorder-player
WEB ARCHIVING TOOLS AND
SERVICES
17
WEBRECORDE
R
WEB ARCHIVING TOOLS AND
SERVICES
https://webrecorder.io (online)
https://github.com/webrecorder/webrecorder (local)
18
CASE STUDY
INDIANAPOLIS MUSEUM OF ART AT NEWFIELDS WEB ARCHIVES
19
SETTING THE
STAGE
A PLACE FOR NATURE & THE ARTS
Sutphin Mall at Newfields,with Robert Indiana’s LOVE
and Roy Lichtenstein’s Five Brushstrokes.
20
imamuseum.org
21
Site Spider, Mark II
22
RESULTS
• 13,663 pages total
• 7,327 IMA Blog pages
• 2,003 JPG images
• 455 PDF files
• 3 mp3 recordings
• 828 HTTP Status 404
• 2 The Island residency blogs
• Exhibition microsites
Site Spider, Mark II
23
HTTRACK
WEB ARCHIVING TESTS
https://www.httrack.com/
24
HTTRACK
WEB ARCHIVING TESTS
https://www.httrack.com/
25
ARCHIVE-IT
WEB ARCHIVING TESTS
https://archive-it.org
26
WEBRECORDE
R
WEB ARCHIVING TOOLS AND
SERVICES
https://webrecorder.io (online)
https://github.com/webrecorder/webrecorder (local)
27
webrecorder.io/imamuseum
28
Retroactive Web Archiving – IMA Archives
29
Guidelines for
Preservable
Websites
• Provide a standard link to all website content
• Avoid proprietary formats for important content,
especially the home page
• Include a user and/or site map
• Omit robots.txt exclusions, or limit them to areas of
the site not needed for archiving
• Maintain stable URLs and redirect when necessary
• Correctly identify character set encoding
Source: Columbia University Libraries
30
Samantha Norling
Digital Collections Manager
snorling@discovernewfields.org
@SamiNorling
Contact
Information

More Related Content

PPTX
'HathiTrust's Long View: Perspectives on Preservation Strategies' by Mike Fur...
PPTX
'Building the Legal Deposit E-Journal Archive for the UK' by Andrew MacEwan
PPTX
Interlibrary loan (ILL) in Denmark, Poul Erlandsen
PPTX
‘PERSIST – UNESCO’s Memory of the World Programme as a catalyst for the deba...
PDF
OA Network: Heading for Joint Standards and Enhancing Cooperation: Value‐Adde...
PDF
RIOXX and V4OA - Paul Walk
PPT
'Constructing a national S&T literature preservation system' by Zhenxin Wu
PPTX
'Good Identification Enables Better Preservation' by Gaelle Bequet
'HathiTrust's Long View: Perspectives on Preservation Strategies' by Mike Fur...
'Building the Legal Deposit E-Journal Archive for the UK' by Andrew MacEwan
Interlibrary loan (ILL) in Denmark, Poul Erlandsen
‘PERSIST – UNESCO’s Memory of the World Programme as a catalyst for the deba...
OA Network: Heading for Joint Standards and Enhancing Cooperation: Value‐Adde...
RIOXX and V4OA - Paul Walk
'Constructing a national S&T literature preservation system' by Zhenxin Wu
'Good Identification Enables Better Preservation' by Gaelle Bequet

Similar to Archiving Web-Based #musetech for Institutional Memory (20)

PPTX
Web archiving challenges and opportunities
PDF
Slides for Web Archiving in the Heritage and Archive Sectors
PPTX
Making the Black Hole Gray: Implementing the Web Archiving of Specialist Art ...
PPTX
Web Preservation, or Managing your Organisation’s Online Presence After the O...
PPTX
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
PPTX
Information sharing about Columbia University Library’s recent web archiving ...
PDF
Farl web archiving
PPTX
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
PPTX
Making the Black Hole Gray: Web Archiving Art Resources at New York Art Resou...
PDF
"Woe, Destruction, Ruin, and Decay:" An Introduction to Web Archiving
PPT
Integrating collections data to build sustainable online resources
PPTX
Tools for Managing the Past Web
PPT
WEB ARCHIVING PROJECTS END-USER PERSPECTIVE
PDF
Digital Initiatives and Digital Scholarship at the British Library
PPT
How to Face the Challenges of Web Archiving? The Experiences of a Small Libra...
PDF
Scaling up to archive the UK Web. Helen Hockx-Yu
PDF
SiteStory 2013
PPTX
UBC Library Web Archiving 2016
PPTX
Storytelling for Summarizing Collections in Web Archives
PPTX
RDMRose 2.4 Designing library webpages
Web archiving challenges and opportunities
Slides for Web Archiving in the Heritage and Archive Sectors
Making the Black Hole Gray: Implementing the Web Archiving of Specialist Art ...
Web Preservation, or Managing your Organisation’s Online Presence After the O...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
Information sharing about Columbia University Library’s recent web archiving ...
Farl web archiving
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Making the Black Hole Gray: Web Archiving Art Resources at New York Art Resou...
"Woe, Destruction, Ruin, and Decay:" An Introduction to Web Archiving
Integrating collections data to build sustainable online resources
Tools for Managing the Past Web
WEB ARCHIVING PROJECTS END-USER PERSPECTIVE
Digital Initiatives and Digital Scholarship at the British Library
How to Face the Challenges of Web Archiving? The Experiences of a Small Libra...
Scaling up to archive the UK Web. Helen Hockx-Yu
SiteStory 2013
UBC Library Web Archiving 2016
Storytelling for Summarizing Collections in Web Archives
RDMRose 2.4 Designing library webpages
Ad

Recently uploaded (20)

PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
A Presentation on Artificial Intelligence
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Machine learning based COVID-19 study performance prediction
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Empathic Computing: Creating Shared Understanding
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Getting Started with Data Integration: FME Form 101
PDF
Approach and Philosophy of On baking technology
PPTX
Spectroscopy.pptx food analysis technology
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
A Presentation on Artificial Intelligence
Per capita expenditure prediction using model stacking based on satellite ima...
Network Security Unit 5.pdf for BCA BBA.
Machine learning based COVID-19 study performance prediction
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Electronic commerce courselecture one. Pdf
MIND Revenue Release Quarter 2 2025 Press Release
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Digital-Transformation-Roadmap-for-Companies.pptx
Programs and apps: productivity, graphics, security and other tools
The Rise and Fall of 3GPP – Time for a Sabbatical?
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Empathic Computing: Creating Shared Understanding
A comparative analysis of optical character recognition models for extracting...
Getting Started with Data Integration: FME Form 101
Approach and Philosophy of On baking technology
Spectroscopy.pptx food analysis technology
Ad

Archiving Web-Based #musetech for Institutional Memory

Editor's Notes

  • #3: Set the stage for the presentation here. Introduction, context (not an expert in web archiving, but a former archivist who found myself in a position to need to react quickly in anticipation of a major loss with the retirement of the previous site). Following the successful archiving of that site, now pursuing an expanded web archiving program, primarily of institutionally-created content, both current and past, but also including some external content.
  • #5: Essentially, web archiving can help you to maintain a reliable point of access to your most important records.
  • #8: The robots exclusion standard is a tool used by a webmaster to direct a web crawler not to crawl all or specified parts of their website. The webmaster places their request in the form of a robots.txt file that is easily found on their website (ex. example.com/robots.txt). Archive-It (like Google and most other search engines) uses a robot to crawl and archive web pages.
  • #9: The robots exclusion standard is a tool used by a webmaster to direct a web crawler not to crawl all or specified parts of their website. The webmaster places their request in the form of a robots.txt file that is easily found on their website (ex. example.com/robots.txt). Archive-It (like Google and most other search engines) uses a robot to crawl and archive web pages.
  • #10: The robots exclusion standard is a tool used by a webmaster to direct a web crawler not to crawl all or specified parts of their website. The webmaster places their request in the form of a robots.txt file that is easily found on their website (ex. example.com/robots.txt). Archive-It (like Google and most other search engines) uses a robot to crawl and archive web pages.
  • #11: A web crawler (also known as a web spider or webrobot) is a program or automated script which browses the World Wide Web in a methodical, automated manner, harvesting content from web servers. This process is called Web crawling or spidering. Many legitimate sites, in particular search engines, use spidering as a means of providing up-to-date data. Crawlers used for web archiving navigate through live websites and download their source code into the WARC format (web archive), which provides an archival snapshot of the live website at the time of the crawl. “Replay technologies” allow for viewing of WARC files. (Graphic shows going from WARC to live website).
  • #12: Launched in 2001, major collecting activity of the Internet Archive. 341 Billion web pages saved over time – 25 petabytes.
  • #13: Find image to represent Archive-It. Launched in 2006, Archive-It is a subscription-based web archiving service with over 400 partner (subscriber) organizations. This includes the New York Art Resources Consortium (NYARC)
  • #14: NYARC - The New York Art Resources Consortium (NYARC) consists of the research libraries and archives of three leading art museums in New York City: The Brooklyn Museum, The Frick Collection, and The Museum of Modern Art. With funding from The Andrew W. Mellon Foundation, NYARC was formed in 2006 to facilitate collaboration that results in enhanced resources to research communities.
  • #16: List: HTTrack, Heritrix, warcprox, wget
  • #17: List: Webrecorder Player, Wayback Machine, OpenWayback, pywb (Python Wayback), oldweb.today (check that all of these are still active, and provide URLs)
  • #20: Tascha and Sami introductions.
  • #21: Known content on imamuseum.org included the main site and its sub-pages, as well as the IMA Blog (active 2007 – 2014)
  • #23: Find image to represent warc players. List: Webrecorder Player, Wayback Machine, OpenWayback, pywb (Python Wayback), oldweb.today (check that all of these are still active, and provide URLs)
  • #30: https://library.columbia.edu/bts/web_resources_collection/guidelines_for_preservable_websites.html