SlideShare a Scribd company logo
Making Sense of
Abundance
Opportunity and Challenges Across Three Web
Archive Case Studies
Ian Milligan
Assistant Professor
Why?
The sheer amount of social,
cultural, and political
information generated every
day presents new
opportunities for historians.
Could one
even study
the 1990s
and
beyond
without
web
archives?
No.
Historians need to do this now, or
we’re going to be left behind.
Nightmare Scenario
• Wayback Machine won’t be enough. We won’t use that.
• Historians rely uncritically on date-ordered keyword
search results, putting them at mercy of search
algorithms they do not understand;
• Historians are completely left out of post-1996
research, letting everybody else do the work (a la
Culturomics project/Nature magazine article);
• Our profession gets left behind…
But what will web archives
look like?
• Three Distinct Case Studies
• Wide Web Scrape, March - December 2011
(Internet Archive) (sample of 80TB WARC collection);
• GeoCities End-of-Life Torrent, 2009 (Archive
Team);
• Archive-It Longitudinal Collections, Canadian
Political Parties & Labour Organizations,
2005-2014 (Archive-It/University of Toronto)
Similarities -
Windows into the lives of
everyday people.
Differences -
Incredible range of technical
skills/no common platform!
Case Study One
• The Wide Web Scrape (~
80TB)
• 85,570 WARC files, CDX
metadata
• Similar in some ways to
traditional humanistic inquiry,
just on a bigger scale.
ca,yorku,justlabour)/	
  20110714073726	
  
http://www.justlabour.yorku.ca/	
  text/html	
  
302	
  3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ	
  
http://www.justlabour.yorku.ca/index.php?
page=toc&volume=16	
  -­‐	
  462	
  880654831	
  
WIDE-­‐20110714062831-­‐crawl416/
WIDE-­‐20110714070859-­‐02373.warc.gz	
  
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archive Case Studies
WARC File
WARC-Tools/Lynx
(warcfilter.py,
warchtmlindex.py
and filesdump.py)
Indexing
CDX Files
(finding aids)
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archive Case Studies
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archive Case Studies
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archive Case Studies
Problem is.. you need to
know what you’re looking
for!
Generated using Jimmy Lin’s Warcbase
622k .ca sites, 1,719,167 links
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archive Case Studies
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archive Case Studies
Countries Mentioned in .ca TLD (excluding Canada)
Provinces Mentioned in .ca TLD
Canadian Postal Codes visualized
Need longitudinal, but the
size/intensity = extreme.
Wide Web Scrapes and
the Dream of Social
History.
Case Study Two
• Archive-It Research
Services: “Canadian Political
Parties and Political Interest
Groups” and “Canadian
Labour Unions.”
• 2005 - 2015
• WAT Files
WAT Files?
Potential sweet spot between
the lightweight CDX and the
heavy-duty WARC?
Do we want metadata
or content analysis?
Historians NEED content,
but metadata can help us
find and contextualize it
Metadata Extraction
Metadata Extraction
Metadata Extraction
Metadata Extraction
• Results @ http://ianmilligan.ca/2015/02/05/topic-
modeling-web-archive-modularity-classes/
Metadata Extraction
• Conservative themes (2014): economic
development, family, immigration, legislation,
women’s issues, senior issues, Ukrainians,
constituency offices, some prominent (and not-so-
prominent) MPs, and of course, our economic
action plan.
• Liberal themes (2014): Justin Trudeau (the new
leader), cuts to social programs, child poverty,
mental health, municipal issues, labour, workers,
Stop the Cuts, and housing.
Metadata Extraction
• Conservative themes (2006): education, university,
but tons of information on Aboriginal issues;
• Liberal themes (2006): community questions,
electoral topics, universities, human rights, child
care support.
As well as short stories..
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archive Case Studies
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archive Case Studies
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archive Case Studies
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archive Case Studies
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archive Case Studies
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archive Case Studies
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archive Case Studies
2005 Canadian Federal Election
WATs help us find the files
we need to use - and to
contextualize them
(switch to browser)
Some code/walkthroughs/
sample data available at
https://github.com/
ianmilligan1/WAHR
Case Study Three
• GeoCities: Archive Team End-
of-Life Torrent
• 2009, content dating back to
1996; can find sites created
pre-1999 using
neighbourhood structure
A substantive
research question?
What was
GeoCities?
Why does it
matter?
GEOCITIES USERS:
OCT. 1995: 10,000 USERS
AUG. 1996: 100,000 USERS
OCT. 1997: 1,000,000 USERS
Selected
Neighbourhoods
Top Two Topics
Athens
“… based on education,
teaching, reading,
writing and philosophy”.
people things time person sense life man work world h
soul make nature body case made point
part parts goddess witch healing incense witchcraft lov
shaman witches sun spirit protection light circle earth r
EnchantedForest
“A place for and about
kids. Games, stories,
educational sites, and
homepages created by
kids themselves.”
blue page school home day kids clues fun

time year room birthday family mom jordan play great
jq battalion show st jonny horse battery

armored lt artillery camp sailor army field col pingu wa
Heartland
“A family oriented
neighborhood that
represents Main Street
in cyberspace. This is
the place to find
parenting, pets, and
home town values.”
people time children book years child information year
school person system state world books government g
family county church home years information st city b
school mrs history birth records great cemetery death
Hollywood
“Entertainment capital
of the world. Movies,
television, and our live
video camera at the
corner of Hollywood
and Vine!”
joey rachel ross monica chandler

don yeah phoebe hey mike back gonna ll chris big uh g
frasier niles martin daphne roz don

back ll door room scene ve dad turns takes crane good
Pentagon
Military men and
women.
war people president government american world state
united general military public soviet political clinton am
fort war civil island iran world adams army british histo
german french american forts walther cap newport
WestHollywood
“A community with a
culture based on gay and
lesbian identity.”
gender women sex male female

people men person woman sexual crossdressing femin
transgendered marriage man children transsexual
Topic
Modelling
Community to
Test Coherence
Looking at
millions of
user-
contributed &
generated
images
And the
stories of
significant
users and
meaningful
experiences.
Shared Problems
• Never have enough processing power or memory;
• Web archive tools often designed for clusters - less than ten
historians in North America probably can use one…
• Tools
• Some work on WARCs;
• Some work on ARCs;
• Some work on WATs;
• And some work on live-web material;
End-user tools and
co-operation with
CS colleagues is
key.
But the shared
promise…
More voices, more
people, the promise of
social history achieved.
Thank you!
@ianmilligan1
ianmilligan1@gmail.com
Ian Milligan
Assistant Professor

More Related Content

PDF
WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...
PDF
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three...
PPTX
History vault-black-freedom-naacp-research
PDF
The Past's Present Future: Emerging Trends in Online Cultural Heritage
PPT
IMLS DCC Progress Update to the Chief Officers of State Library Agencies (COSLA)
PPTX
Resisting Neoliberalism: the challenge of activist librarianship in the UK HE...
PPT
Cultural Heritage Information Dashboards
PPT
Innovative Information Literacy 9 25 09
WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three...
History vault-black-freedom-naacp-research
The Past's Present Future: Emerging Trends in Online Cultural Heritage
IMLS DCC Progress Update to the Chief Officers of State Library Agencies (COSLA)
Resisting Neoliberalism: the challenge of activist librarianship in the UK HE...
Cultural Heritage Information Dashboards
Innovative Information Literacy 9 25 09

What's hot (16)

PPT
If the Buck Stops Here, Where the Heck Is It?
PPT
NSTA - Ft. Lauderdale -- NASA eClips for Secondary Students: Using Video Seg...
PPTX
The Irish presence in the global published record
PPTX
Watson "names & naming in an evolving humanities ecosystem"
PPT
Freeing Culture: Ending Information Classism
PPTX
Open source resources for sociology
PDF
177 sspcc2 e_kasdorf
PPT
New Forms Of Communication: Harnessing Collective Knowledge through Web Logs
PPT
New Forms Of Communication: Harnessing Collective Knowledge through Web Logs
PPTX
Finding info hips 1011
PPT
Twittering in the-library
PPTX
African American Discovery Resource
PPTX
Tla2014 archivesduesterhoeft
PPTX
Equity, Diversity and Inclusion Survey Results
PDF
Digital contemporary history: sources, tools, methods, issues
PPTX
Irish Studies - making library data work harder
If the Buck Stops Here, Where the Heck Is It?
NSTA - Ft. Lauderdale -- NASA eClips for Secondary Students: Using Video Seg...
The Irish presence in the global published record
Watson "names & naming in an evolving humanities ecosystem"
Freeing Culture: Ending Information Classism
Open source resources for sociology
177 sspcc2 e_kasdorf
New Forms Of Communication: Harnessing Collective Knowledge through Web Logs
New Forms Of Communication: Harnessing Collective Knowledge through Web Logs
Finding info hips 1011
Twittering in the-library
African American Discovery Resource
Tla2014 archivesduesterhoeft
Equity, Diversity and Inclusion Survey Results
Digital contemporary history: sources, tools, methods, issues
Irish Studies - making library data work harder
Ad

Viewers also liked (6)

PDF
Warcbase Building a Scalable Platform on HBase and Hadoop - Part Two: Histori...
PDF
Congress text-mining-event
PDF
Welcome to the GeoHood: Using the GeoCities Web Archive to Explore Virtual Co...
PDF
Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Histor...
PPTX
Psychological foundations of education
PPT
10 axioms of curriculum change
Warcbase Building a Scalable Platform on HBase and Hadoop - Part Two: Histori...
Congress text-mining-event
Welcome to the GeoHood: Using the GeoCities Web Archive to Explore Virtual Co...
Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Histor...
Psychological foundations of education
10 axioms of curriculum change
Ad

Similar to Making Sense of Abundance: Opportunity and Challenges Across Three Web Archive Case Studies (20)

PDF
History In The Age Of Abundance How The Web Is Transforming Historical Resear...
PDF
Ruest and Milligan - The Great WARC Adventure
PDF
Clustering Search to Navigate A Case Study of the Canadian World Wide Web as ...
PPTX
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
PDF
Historical Research Breakout Session Notes, WIRE 2014
PDF
Internet content as research data
PDF
International Internet Preservation Consortium Research Slides from Ian Milligan
PPTX
History connectedonlineapr2010
PPTX
Thomas Haigh: Techniques from History
PDF
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
PPTX
Web archiving challenges and opportunities
PPTX
On the Value of Temporal Anchor Texts in Wikipedia
PPTX
History Connected
PPT
History On Web Where is it Headed?
PPTX
Archive What I See Now - Archive-It Partner Meeting 2013 2013
PPTX
SAA 2014 session 703
PPTX
Web Archives and Data Challenges - Archives Unleashed
PDF
Towards Multidimensional Web Archive Access (IIPC 2016)
PDF
Slides anu talkwebarchivingaug2012
PDF
Observing Web Archives: The Case for an Ethnographic Study of Web Archiving
History In The Age Of Abundance How The Web Is Transforming Historical Resear...
Ruest and Milligan - The Great WARC Adventure
Clustering Search to Navigate A Case Study of the Canadian World Wide Web as ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
Historical Research Breakout Session Notes, WIRE 2014
Internet content as research data
International Internet Preservation Consortium Research Slides from Ian Milligan
History connectedonlineapr2010
Thomas Haigh: Techniques from History
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
Web archiving challenges and opportunities
On the Value of Temporal Anchor Texts in Wikipedia
History Connected
History On Web Where is it Headed?
Archive What I See Now - Archive-It Partner Meeting 2013 2013
SAA 2014 session 703
Web Archives and Data Challenges - Archives Unleashed
Towards Multidimensional Web Archive Access (IIPC 2016)
Slides anu talkwebarchivingaug2012
Observing Web Archives: The Case for an Ethnographic Study of Web Archiving

Recently uploaded (20)

PPTX
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
PDF
The Ikigai Template _ Recalibrate How You Spend Your Time.pdf
PDF
SASE Traffic Flow - ZTNA Connector-1.pdf
PDF
Uptota Investor Deck - Where Africa Meets Blockchain
PDF
The New Creative Director: How AI Tools for Social Media Content Creation Are...
PPT
415456121-Jiwratrwecdtwfdsfwgdwedvwe dbwsdjsadca-EVN.ppt
PDF
mera desh ae watn.(a source of motivation and patriotism to the youth of the ...
PPTX
IPCNA VIRTUAL CLASSES INTERMEDIATE 6 PROJECT.pptx
PDF
📍 LABUAN4D EXCLUSIVE SERVER STAR GAMING ASIA NO.1 TERPOPULER DI INDONESIA ! 🌟
PDF
Session 1 (Week 1)fghjmgfdsfgthyjkhfdsadfghjkhgfdsa
PPTX
Layers_of_the_Earth_Grade7.pptx class by
PDF
Exploring VPS Hosting Trends for SMBs in 2025
PPT
Design_with_Watersergyerge45hrbgre4top (1).ppt
PPTX
Slides PPTX: World Game (s): Eco Economic Epochs.pptx
PDF
Introduction to the IoT system, how the IoT system works
PPTX
Introduction to cybersecurity and digital nettiquette
PDF
Smart Home Technology for Health Monitoring (www.kiu.ac.ug)
PDF
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
PPTX
Database Information System - Management Information System
PPT
250152213-Excitation-SystemWERRT (1).ppt
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
The Ikigai Template _ Recalibrate How You Spend Your Time.pdf
SASE Traffic Flow - ZTNA Connector-1.pdf
Uptota Investor Deck - Where Africa Meets Blockchain
The New Creative Director: How AI Tools for Social Media Content Creation Are...
415456121-Jiwratrwecdtwfdsfwgdwedvwe dbwsdjsadca-EVN.ppt
mera desh ae watn.(a source of motivation and patriotism to the youth of the ...
IPCNA VIRTUAL CLASSES INTERMEDIATE 6 PROJECT.pptx
📍 LABUAN4D EXCLUSIVE SERVER STAR GAMING ASIA NO.1 TERPOPULER DI INDONESIA ! 🌟
Session 1 (Week 1)fghjmgfdsfgthyjkhfdsadfghjkhgfdsa
Layers_of_the_Earth_Grade7.pptx class by
Exploring VPS Hosting Trends for SMBs in 2025
Design_with_Watersergyerge45hrbgre4top (1).ppt
Slides PPTX: World Game (s): Eco Economic Epochs.pptx
Introduction to the IoT system, how the IoT system works
Introduction to cybersecurity and digital nettiquette
Smart Home Technology for Health Monitoring (www.kiu.ac.ug)
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
Database Information System - Management Information System
250152213-Excitation-SystemWERRT (1).ppt

Making Sense of Abundance: Opportunity and Challenges Across Three Web Archive Case Studies

  • 1. Making Sense of Abundance Opportunity and Challenges Across Three Web Archive Case Studies Ian Milligan Assistant Professor
  • 2. Why? The sheer amount of social, cultural, and political information generated every day presents new opportunities for historians.
  • 3. Could one even study the 1990s and beyond without web archives?
  • 4. No. Historians need to do this now, or we’re going to be left behind.
  • 5. Nightmare Scenario • Wayback Machine won’t be enough. We won’t use that. • Historians rely uncritically on date-ordered keyword search results, putting them at mercy of search algorithms they do not understand; • Historians are completely left out of post-1996 research, letting everybody else do the work (a la Culturomics project/Nature magazine article); • Our profession gets left behind…
  • 6. But what will web archives look like? • Three Distinct Case Studies • Wide Web Scrape, March - December 2011 (Internet Archive) (sample of 80TB WARC collection); • GeoCities End-of-Life Torrent, 2009 (Archive Team); • Archive-It Longitudinal Collections, Canadian Political Parties & Labour Organizations, 2005-2014 (Archive-It/University of Toronto)
  • 7. Similarities - Windows into the lives of everyday people.
  • 8. Differences - Incredible range of technical skills/no common platform!
  • 9. Case Study One • The Wide Web Scrape (~ 80TB) • 85,570 WARC files, CDX metadata • Similar in some ways to traditional humanistic inquiry, just on a bigger scale.
  • 10. ca,yorku,justlabour)/  20110714073726   http://www.justlabour.yorku.ca/  text/html   302  3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ   http://www.justlabour.yorku.ca/index.php? page=toc&volume=16  -­‐  462  880654831   WIDE-­‐20110714062831-­‐crawl416/ WIDE-­‐20110714070859-­‐02373.warc.gz  
  • 16. Problem is.. you need to know what you’re looking for!
  • 17. Generated using Jimmy Lin’s Warcbase 622k .ca sites, 1,719,167 links
  • 20. Countries Mentioned in .ca TLD (excluding Canada)
  • 22. Canadian Postal Codes visualized
  • 23. Need longitudinal, but the size/intensity = extreme.
  • 24. Wide Web Scrapes and the Dream of Social History.
  • 25. Case Study Two • Archive-It Research Services: “Canadian Political Parties and Political Interest Groups” and “Canadian Labour Unions.” • 2005 - 2015 • WAT Files
  • 26. WAT Files? Potential sweet spot between the lightweight CDX and the heavy-duty WARC?
  • 27. Do we want metadata or content analysis?
  • 28. Historians NEED content, but metadata can help us find and contextualize it
  • 32. Metadata Extraction • Results @ http://ianmilligan.ca/2015/02/05/topic- modeling-web-archive-modularity-classes/
  • 33. Metadata Extraction • Conservative themes (2014): economic development, family, immigration, legislation, women’s issues, senior issues, Ukrainians, constituency offices, some prominent (and not-so- prominent) MPs, and of course, our economic action plan. • Liberal themes (2014): Justin Trudeau (the new leader), cuts to social programs, child poverty, mental health, municipal issues, labour, workers, Stop the Cuts, and housing.
  • 34. Metadata Extraction • Conservative themes (2006): education, university, but tons of information on Aboriginal issues; • Liberal themes (2006): community questions, electoral topics, universities, human rights, child care support.
  • 35. As well as short stories..
  • 44. WATs help us find the files we need to use - and to contextualize them
  • 46. Some code/walkthroughs/ sample data available at https://github.com/ ianmilligan1/WAHR
  • 47. Case Study Three • GeoCities: Archive Team End- of-Life Torrent • 2009, content dating back to 1996; can find sites created pre-1999 using neighbourhood structure
  • 49. What was GeoCities? Why does it matter? GEOCITIES USERS: OCT. 1995: 10,000 USERS AUG. 1996: 100,000 USERS OCT. 1997: 1,000,000 USERS
  • 50. Selected Neighbourhoods Top Two Topics Athens “… based on education, teaching, reading, writing and philosophy”. people things time person sense life man work world h soul make nature body case made point part parts goddess witch healing incense witchcraft lov shaman witches sun spirit protection light circle earth r EnchantedForest “A place for and about kids. Games, stories, educational sites, and homepages created by kids themselves.” blue page school home day kids clues fun
 time year room birthday family mom jordan play great jq battalion show st jonny horse battery
 armored lt artillery camp sailor army field col pingu wa Heartland “A family oriented neighborhood that represents Main Street in cyberspace. This is the place to find parenting, pets, and home town values.” people time children book years child information year school person system state world books government g family county church home years information st city b school mrs history birth records great cemetery death Hollywood “Entertainment capital of the world. Movies, television, and our live video camera at the corner of Hollywood and Vine!” joey rachel ross monica chandler
 don yeah phoebe hey mike back gonna ll chris big uh g frasier niles martin daphne roz don
 back ll door room scene ve dad turns takes crane good Pentagon Military men and women. war people president government american world state united general military public soviet political clinton am fort war civil island iran world adams army british histo german french american forts walther cap newport WestHollywood “A community with a culture based on gay and lesbian identity.” gender women sex male female
 people men person woman sexual crossdressing femin transgendered marriage man children transsexual Topic Modelling Community to Test Coherence
  • 52. And the stories of significant users and meaningful experiences.
  • 53. Shared Problems • Never have enough processing power or memory; • Web archive tools often designed for clusters - less than ten historians in North America probably can use one… • Tools • Some work on WARCs; • Some work on ARCs; • Some work on WATs; • And some work on live-web material;
  • 54. End-user tools and co-operation with CS colleagues is key.
  • 56. More voices, more people, the promise of social history achieved.