Web and Twitter Archiving at the Library of Congress

Web and Twitter Archiving at
the Library of Congress
Web Archive Globalization Workshop
June 16, 2011

Nicholas Taylor (@nullhandle)
Web Archiving Team
Library of Congress

why archive the web?

• preserve our nation’s
history and culture
• identify and preserve at-
risk digital content
• develop tools, models,
and methods for digital
preservation

“World Wide Web 1997: 2 Terabytes in 63 Inches”

2 LIBRARY OF CONGRESS

various collection strategies

• entire web domain—Internet Archive
• national domain—Sweden, Denmark, others
• selective (individual URLs) and thematic—
Australia
• thematic or event-based—Library of Congress

http://netpreserve.org/about/archiveList.php


web archiving at LC

• began in 2000 with MINERVA
pilot
• identify policy issues,
establish best practices,
build tools (internally and
w/ partners)
• broaden expertise and
understanding of Web
Archiving within LC
• collect, manage and sustain
at-risk digital content
LC Prints & Photographs: design for Minerva Congressional Library


Curators/Recommending Officers
In Library Services, Congressional Research
Service, and the Law Library pick the Web Archiving Team and Web
collections and what URLs to archive, Preservation Engineering Team
and research who to contact for permission. In the Office of Strategic Initiatives (OSI).
We are project managers and technical staff
focused on capture, tools, and permissions.
Bibliographic Access
MODS records are created in Library
Services: the Network Development and Information Technology Office and
MARC Standards Office (NetDev) and Technical Architecture Team
Acquisitions and Bibliographic Access Also in OSI. Supports Wayback Machine, Heritrix,
(ABA) staff do the cataloging. repository and tools development, and data
transfers. Contractors are also used in this area.

LIBRARY OF CONGRESS

LC collections: over 245 TB + 5 TB/month
– ongoing collections, including:
• Congress/Legislative websites
• Legal Blawgs
• Public Policy Topics
– event-based collections, including:
• U.S. National Elections—2000, 2002, 2004, 2006, 2008, 2010
• Iraq War 2003-2009
• September 11 2001 and September 11 Remembrance 2002
• Civil War Sesquicentennial
• Olympics 2002
• Supreme Court Nominations
• Papal Transition
• Case Studies: health care, terrorism, visual image content,
organizational Web sites, Crisis in Darfur, “single site”
– Overseas Operations collections, including: Egypt 2008; Brazilian, Indian,
Indonesian, Philippine, and Thai Elections; Afghanistan Government;
Pakistan Nationalisms

http://www.loc.gov/webarchiving/collections.html

web archives access: loc.gov/lcwa


essential tools

• capture: Heritrix (contract crawling w/ IA and
in-house)
• replay: Wayback
• permissions/seed management; capture quality
review; reporting; transfer tracking: custom
apps built on LAMP stack
• transfer: BagIt Library (based on BagIt spec);
*nix ingest/staging/storage/access servers;
Internet2 connection

other useful tools

• web archiving workflow management:
NetArchiveSuite, Web Curator Tool
• small-scale web archiving: HTTrack
• Firefox add-ons: Firebug, Web Developer


cataloging for access

• collection-level metadata
• site-level bibliographic metadata
– nominators provide subject heading
– HTML metadata extraction via cURL
– cataloger assigns keywords
• cataloging metadata stored in MODS
• assisted keyword assignment: HIVE


collection-level record example

LIBRARY OF CONGRESS

MODS bibliographic record example


Wayback Machine Resource Page


example of an archived site


search and discovery

• bibliographic metadata search
• (not yet) Memento-enabled
• full-text search based on NutchWAX unfeasible
• Lucene/Solr looks promising


Challenges for

WEB ARCHIVES


challenges for web archives
• technical
– large, deep, dynamic, interlinked
– continuous transformation, simultaneously growing and
disappearing
• intellectual property laws and regulations
– legal deposit laws, mandates for preservation, laws that
do not address web content
• economic environment
– few good business models for sustaining web collections
• social environment
– who is responsible and how is responsibility shared?


capture, replay, and preservation

• capturing websites – Heritrix
– “form-fronted” databases (i.e., “deep web”)
– URLs the crawler can’t see that we want
– …and URLs the crawler can see that we don’t
– web 2.0 and other “new” web technologies
• replaying archived versions – Wayback
– non-rigorous website coding
– live site “leakage”
– significant interactivity may be lost
• preserving access to our archives
– billions of files
– thousands of file types
– how do we ensure content is accessible in 10, 25, 50 or more
years?

LIBRARY OF CONGRESS

scaling capacity

• budgetary pressures
• limited access server disk
space
– competing w/ other big
data projects
• new infrastructure for
new capabilities

photo by Henrik Bennetsen under CC BY-SA 2.0


when the only tool you have is a library…

LC Prints & Photographs: exterior view of the LC Jefferson Building


…many things look like collections

• archive behaves more like discrete records than
web
– archived sites not contiguously navigable
– data doesn’t readily allow for downloading
• Twitter archive may prompt re-thinking web
archive data access


web archiving and U.S. copyright law

• legal deposit requirement
only applies to “published
works” (§ 407)
• § 108 of the Copyright Act
provides library exceptions
– doesn’t address digital
preservation and web
archiving

photo by Gabriel de Urioste under CC BY 2.0

LIBRARY OF CONGRESS

why not rely on robots.txt?

• unreliable proxy for
copyright permissions
• archival crawler ≠ search
crawler
• LC disregards robots.txt
but leaves contact info

last.fm: robots.txt

LIBRARY OF CONGRESS

capture and access permissions

• permissions-based approach
access
began in 2002 capture
offsite
• permission plans for each
collection developed w/
government no notice no notice
counsel
• permission requirements
depend on site type advocacy/
notice permission
policy
• more liberal about capture
than about offsite access
news permission permission

LIBRARY OF CONGRESS

implications of opt-in permissions

• no response treated as denial
– very few denials
– many non-responsive
• case study: September 11, 2001
– 2300 cataloged, 30000 uncataloged URLs
– many news sites (“high risk” permissions
category)
– no permissions sought
– very few takedown requests

the future of permissions

• risk of more liberal
approach appears low
• hope to move to
more notice-based,
opt-out policy
• may affect
previously-captured
sites as well
photo by RJ Sangosti, Denver Post under © (fair use)


Challenges for the

TWITTER ARCHIVE


why archive Twitter?

• historical record of communication, news reporting,
and social trends
• complements collections and mission

http://twitter.com/#!/klerner/status/64895357355704320
LIBRARY OF CONGRESS

Twitter Archive FAQs

• currently receiving Tweets through Gnip
• includes only the public archive
– deletions will propagate to archive
• access limitations
– 6 month embargo on new Tweets
– no bulk distribution
• downstream users
– no commercial use
– no substantial re-distribution
LIBRARY OF CONGRESS

a (literally) growing challenge

• ~3 years: time it took from 1st Tweet to
billionth
• 1 week: time it now takes users to send a
billion Tweets
• average Tweets/day in 3/10: 50 million
• average Tweets/day in 3/11: 149 million

http://blog.twitter.com/2011/03/numbers.html


questions to consider

• how does archive fit in w/ existing collections?
• how are agreement guidelines interpreted and
implemented technically?
• what kind(s) of access can we provide?
• what context do we provide for content?

LIBRARY OF CONGRESS

additional goals

• justify value to Congress and public
• understand and respond to researcher needs
• push the institution beyond existing curatorial
models

LIBRARY OF CONGRESS

for more information
• Library of Congress Web Archiving Program:
http://www.loc.gov/webarchiving/
• Library of Congress Web Archives:
http://loc.gov/lcwa/
• National Digital Information Infrastructure and
Preservation Program:
http://www.digitalpreservation.gov/
• Library of Congress – Twitter FAQ:
http://blogs.loc.gov/loc/2010/04/the-library-and-twitter
• Section 108 Study Group:
http://www.section108.gov/


for more information
• “Legal Issues in Building Social Media Collections:”
http://www.arl.org/bm~doc/mm11sp-okeeffe.pdf
• “How the Library of Congress is building the Twitter
archive:”
http://radar.oreilly.com/2011/06/library-of-congress-tw
• “Web Archives: The Future(s):”
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=183


questions?

Nicholas Taylor
@nullhandle
ntay@loc.gov

Web and Twitter Archiving at the Library of Congress

More Related Content

Viewers also liked (12)

Similar to Web and Twitter Archiving at the Library of Congress (20)

More from nullhandle (20)

Recently uploaded (20)

Web and Twitter Archiving at the Library of Congress

Editor's Notes