Archiving Web-Based #musetech for Institutional Memory

Out with the Old: Archiving Web-Based #musetech for Institutional Memory
SAMANTHA NORLING, DIGITAL COLLECTIONS MANAGER, NEWFIELDS
NOVEMBER 19, 2018

2
Out with the Old?
Archiving Web-Based #musetech for
Institutional Memory
Samantha Norling
Digital Collections Manager, Newfields
Museum Computer Network, November 15, 2018

3
What is Web
Archiving?
Web archiving is the process of collecting, preserving,
and enabling access to content on the Web, such as:
• Websites
• Blogs
• Social Media pages
For museums, web-based content may include:
• Exhibition-related websites
• Gallery interactives
• Tour applications
…all of which tend to be time-based.

4
Why Should
Museums
Archive the
Web?
• Meet records retention requirements or
capture important institutional content as
part of an archives program
• Document online events and/or public
responses to the museum and its activities
• Collect web-native resources within your
collection scope*
• Combat link rot
* Though this is not the focus of this presentation, there are many interesting projects being led by museums to
collect this material.

5
Who
Currently
Archives the
Web?
Source: Web Archiving: A 2016 Survey (National Digital Stewardship Alliance)

6
What is being Archived?

7
Types of Web
Archiving
• Client-side
• Archiving any site that is freely-
available on the web
• Crawlers start from a “seed” page and
navigate through hyperlinks, based on
parameters set
• Capture documents, text pages, data,
images, audio and video files
• Can be blocked by robots.txt
exclusions on websites
Source: Web Archiving Guidance, National Archives UK (2011)
http://www.nationalarchives.gov.uk/documents/information-management/web-archiving-guidance.pdf

8
Types of Web
Archiving
• Transaction-based
• Requires access to the web server
• Records transactions between users of
a site and the server
• Content that is never viewed will never
be archived
• Record exactly what was seen and
when

9
Types of Web
Archiving
• Server-side
• Direct copy of files from web server
• Challenge to make the content usable
as a navigable archived site
• Dynamically-generated content difficult
to recreate
• Faithful reproduction requires
replication of the environment in which
the live site was run
• Benefit: archive parts of sites that are
inaccessible to web crawlers

10
Client-side web archiving in a
(super simplified) nutshell:

11
THE
WAYBACK
MACHINE
WEB ARCHIVING TOOLS AND
SERVICES
https://archive.org/web

12
ARCHIVE-IT
SERVICES
https://archive-it.org

13
NYARChttps://www.nyarc.org/content/
web-archiving
ARCHIVE-IT PARTNER

14
Local Web-Archiving Tool Usage

15
WEB
CRAWLERS
• HTTrack (file directory directory)
https://www.httrack.com/
• Heretrix
https://github.com/internetarchive/heritrix3
• Warcprox
http://github.com/internetarchive/warcprox
• wget
https://www.petekeen.net/archiving-websites-with-
wget
SERVICES

16
WEB
ARCHIVE
PLAYERS
• OpenWayback
https://github.com/iipc/openwayback/wiki
• pywb (Python Wayback)
https://github.com/webrecorder/pywb
• oldweb.today
http://oldweb.today/
• Webrecorder Player
https://github.com/webrecorder/webrecorder-player
SERVICES

17
WEBRECORDE
R
SERVICES
https://webrecorder.io (online)
https://github.com/webrecorder/webrecorder (local)

18
CASE STUDY
INDIANAPOLIS MUSEUM OF ART AT NEWFIELDS WEB ARCHIVES

19
SETTING THE
STAGE
A PLACE FOR NATURE & THE ARTS
Sutphin Mall at Newfields,with Robert Indiana’s LOVE
and Roy Lichtenstein’s Five Brushstrokes.

22
RESULTS
• 13,663 pages total
• 7,327 IMA Blog pages
• 2,003 JPG images
• 455 PDF files
• 3 mp3 recordings
• 828 HTTP Status 404
• 2 The Island residency blogs
• Exhibition microsites
Site Spider, Mark II

23
HTTRACK
WEB ARCHIVING TESTS

24
HTTRACK
WEB ARCHIVING TESTS

25
ARCHIVE-IT
WEB ARCHIVING TESTS
https://archive-it.org

26
WEBRECORDE
R
SERVICES
https://webrecorder.io (online)
https://github.com/webrecorder/webrecorder (local)

28
Retroactive Web Archiving – IMA Archives

29
Guidelines for
Preservable
Websites
• Provide a standard link to all website content
• Avoid proprietary formats for important content,
especially the home page
• Include a user and/or site map
• Omit robots.txt exclusions, or limit them to areas of
the site not needed for archiving
• Maintain stable URLs and redirect when necessary
• Correctly identify character set encoding
Source: Columbia University Libraries

30
Samantha Norling
Digital Collections Manager
snorling@discovernewfields.org
@SamiNorling
Contact
Information

Archiving Web-Based #musetech for Institutional Memory

More Related Content

Similar to Archiving Web-Based #musetech for Institutional Memory (20)

Recently uploaded (20)

Archiving Web-Based #musetech for Institutional Memory

Editor's Notes