SlideShare a Scribd company logo
The Challenge of Digital
Sources in the Web Age
Common Tensions Across Three Web Histories,
1994-2015
Ian Milligan
Assistant Professor
Why?
The sheer amount of social,
cultural, and political
information generated every
day presents new
opportunities for historians.
Could one
even study
the 1990s
and
beyond
without
web
archives?
No.
Historians need to do this now, or
we’re going to be left behind.
Nightmare Scenario
• Wayback Machine won’t be enough. We won’t use that.
• Historians rely uncritically on date-ordered keyword
search results, putting them at mercy of search
algorithms they do not understand;
• Historians are completely left out of post-1996
research, letting everybody else do the work (a la
Culturomics project/Nature magazine article);
• Our profession gets left behind…
But what will web archives
look like?
• Three Distinct Case Studies
• Wide Web Scrape, March - December 2011
(Internet Archive) (sample of 80TB WARC collection);
• GeoCities End-of-Life Torrent, 2009 (Archive
Team);
• Archive-It Longitudinal Collections, Canadian
Political Parties & Labour Organizations,
2005-2015 (Archive-It/University of Toronto)
Similarities -
Windows into the lives of
everyday people.
Differences -
Incredible range of technical
skills/no common platform!
Case Study One
• The Wide Web Scrape (~
80TB) - Snapshot of the Web
• 85,570 WARC files, CDX
metadata
• Similar in some ways to
traditional humanistic inquiry,
just on a bigger scale.
ca,yorku,justlabour)/	
  20110714073726	
  
http://www.justlabour.yorku.ca/	
  text/html	
  
302	
  3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ	
  
http://www.justlabour.yorku.ca/index.php?
page=toc&volume=16	
  -­‐	
  462	
  880654831	
  
WIDE-­‐20110714062831-­‐crawl416/
WIDE-­‐20110714070859-­‐02373.warc.gz	
  
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three Web Histories, 1994-2015
WARC File
WARC-Tools/Lynx
(warcfilter.py,
warchtmlindex.py
and filesdump.py)
Indexing
CDX Files
(finding aids)
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three Web Histories, 1994-2015
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three Web Histories, 1994-2015
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three Web Histories, 1994-2015
Problem is.. you need to
know what you’re looking
for!
Generated using Jimmy Lin’s Warcbase
622k .ca sites, 1,719,167 links
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three Web Histories, 1994-2015
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three Web Histories, 1994-2015
Countries Mentioned in .ca TLD (excluding Canada)
Provinces Mentioned in .ca TLD
Canadian Postal Codes visualized
Need longitudinal, but the
size/intensity = extreme.
Wide Web Scrapes and
the Dream of Social
History.
Case Study Two
• Archive-It Research
Services: “Canadian Political
Parties and Political Interest
Groups” and “Canadian
Labour Unions.”
• 2005 - 2015
• WAT & WARC files
Pivotal Changes in Canadian
Politics, 2005-2015
• Militarization of Canadian
society?
• Change from ‘natural governing
party’ of Liberals to
Conservatives
• Major policy changes on
foreign policy, environment, etc.
• How to measure?
Current Interface
• Very limited - simple search
engine, some advanced
options; no facets
• Great collections.. but
nobody uses them!
How to provide
access?
WAT Files?
Potential sweet spot between
the lightweight CDX and the
heavy-duty WARC?
Do we want metadata
or content analysis?
Two problems
Problem One:
Historians want to work
with content, but we can
only use metadata on
most computer systems.
(but that’s ok - we can
use metadata to do great
work)
Metadata Extraction
Metadata Extraction
Metadata Extraction
Metadata Extraction
• Results @ http://ianmilligan.ca/2015/02/05/topic-
modeling-web-archive-modularity-classes/
Metadata Extraction
• Conservative themes (2014): economic
development, family, immigration, legislation,
women’s issues, senior issues, Ukrainians,
constituency offices, some prominent (and not-so-
prominent) MPs, and of course, our economic
action plan.
• Liberal themes (2014): Justin Trudeau (the new
leader), cuts to social programs, child poverty,
mental health, municipal issues, labour, workers,
Stop the Cuts, and housing.
Metadata Extraction
• Conservative themes (2006): education, university,
but tons of information on Aboriginal issues;
• Liberal themes (2006): community questions,
electoral topics, universities, human rights, child
care support.
As well as short stories..
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three Web Histories, 1994-2015
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three Web Histories, 1994-2015
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three Web Histories, 1994-2015
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three Web Histories, 1994-2015
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three Web Histories, 1994-2015
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three Web Histories, 1994-2015
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three Web Histories, 1994-2015
2005 Canadian Federal Election
WATs help us find the files
we need to use - and to
contextualize them
Problem Two:
You can do amazing things with the
content (WARCs), but you need a
cluster or powerful computer.
WARC Analysis
• 2005-2009: 244 GB of content;
2.9 GB of plain text
• 10,606,822 websites
• On a local powerful node (3
Ghz 8-Core Intel Xeon E5/64
GB RAM, data on SSD), about
three to four hours per query
• On a cluster, about ~10-20
minutes per query, depending
on traffic
Large-Scale Text
Analysis
• With Hadoop about 15-20
minutes to extract all plain-text
from any specified queries:
i.e. all pages belonging to
Green Party, Liberal Party,
Conservative Party, Council of
Canadians, etc.
• Compared to “out of memory”/
go home for an extended
weekend on a local node
Large-Scale Text Analysis
• NER/LDA/Keyword Frequency broken
down by scrape date: i.e. scrape
carried out 2005-10, see change over
time;
• Downside: not everything is optimized
for parallel environment; if not, it crawls
(there goes a day)
• Downside: scrape date != creation
date, requiring temporal analysis
Recipe Book Idea
Using Warcbase to
analyze links and full-text
Recipe book:
https://github.com/lintool/
warcbase/wiki
NER
October	
  2005	
  
	
  	
  62476	
  Stephen	
  Harper	
  
	
  	
  30234	
  Michael	
  Chong	
  
	
  	
  30109	
  Gwynne	
  Dyer	
  
	
  	
  28011	
  ami	
  Entrez	
  
	
  	
  26238	
  Paul	
  Martin	
  
	
  	
  22303	
  Harper	
  
NER
November	
  2008	
  
	
  	
  	
  3188	
  Stéphane	
  Dion	
  
	
  	
  	
  2557	
  Stephen	
  Harper	
  
	
  	
  	
  2471	
  Stephen	
  HarperLaureen	
  
	
  	
  	
  2410	
  Dion	
  
	
  	
  	
  2356	
  Harper	
  
Visualizing NER
Integration with Wolfram|
Alpha
Integration with Wolfram|
Alpha
Shine/WebArchives.ca
• UK Web Archive’s Shine
(https://github.com/ukwa/
shine)
• Indexing as bottleneck
• ~ 250GB of WARCs takes ~
5 days on a single machine
• Hadoop indexer available if
data in HFDS
• ~ 90GB index size
Five Things I’ve Learned
• Political parties delete content
• User-generated comments were more common in
political parties
• Absences can be more informative than presences
• We can see the rise/fall of prominent people
• Enabling user access is truly transformative
Shine
• Advantages: accessible to the general public,
easy to use, interactive trend diagram allows
digging down for context, can move down to level
of document itself.
• Disadvantage: keyword searching requires you
know what to look for; random sampling misleading
when tens of thousands of records; etc.
• Doesn’t take advantage of what makes web
sources so powerful: hyperlinks
Building connections
between Warcbase and
Shine
Case Study Three
• GeoCities: Archive Team End-
of-Life Torrent
• 2009, content dating back to
1996; can find sites created
pre-1999 using
neighbourhood structure
A substantive
research question?
What was
GeoCities?
Why does it
matter?
GEOCITIES USERS:
OCT. 1995: 10,000 USERS
AUG. 1996: 100,000 USERS
OCT. 1997: 1,000,000 USERS
“largest body of texts
detailing the lives of non-elite
people ever published”?
Massive experiment in
user-generated content
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three Web Histories, 1994-2015
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three Web Histories, 1994-2015
Ethically
Navigating the
Records of
Seven Million
People
Selected
Neighbourhoods
Top Two Topics
Athens
“… based on education,
teaching, reading,
writing and philosophy”.
people things time person sense life man work world human good mind
soul make nature body case made point
part parts goddess witch healing incense witchcraft love energy pagan
shaman witches sun spirit protection light circle earth religion
EnchantedForest
“A place for and about
kids. Games, stories,
educational sites, and
homepages created by
kids themselves.”
blue page school home day kids clues fun

time year room birthday family mom jordan play great party friends
jq battalion show st jonny horse battery

armored lt artillery camp sailor army field col pingu war area quest
Heartland
“A family oriented
neighborhood that
represents Main Street
in cyberspace. This is
the place to find
parenting, pets, and
home town values.”
people time children book years child information year work make life
school person system state world books government good
family county church home years information st city born state war
school mrs history birth records great cemetery death
Hollywood
“Entertainment capital
of the world. Movies,
television, and our live
video camera at the
corner of Hollywood
and Vine!”
joey rachel ross monica chandler

don yeah phoebe hey mike back gonna ll chris big uh guy guys rock
frasier niles martin daphne roz don

back ll door room scene ve dad turns takes crane good walks yeah
Pentagon
Military men and
women.
war people president government american world states power state
united general military public soviet political clinton america make army
fort war civil island iran world adams army british history badge rhode
german french american forts walther cap newport
WestHollywood
“A community with a
culture based on gay and
lesbian identity.”
gender women sex male female

people men person woman sexual crossdressing feminine society identity
transgendered marriage man children transsexual
Topic
Modelling
Community to
Test
Coherence
Looking at
millions of
user-
contributed &
generated
images
And the
stories of
significant
users and
meaningful
experiences.
The possibilities of
such digital scholarship
Shared Problems
• Never have enough processing power or memory;
• Web archive tools often designed for clusters - less than ten
historians in North America probably can use one…
• Tools
• Some work on WARCs;
• Some work on ARCs;
• Some work on WATs;
• And some work on live-web material;
End-user tools and
co-operation with
CS colleagues is
key.
But the shared
promise…
More voices, more
people, the promise of
social history achieved.
Thank you!
@ianmilligan1
ianmilligan1@gmail.com
Ian Milligan
Assistant Professor

More Related Content

PDF
WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...
PDF
Digital Visitors and Residents: Project Feedback
PDF
Wikipedia and Civic Engagement
PPTX
Facilitating Human Intervention in Coreference Resolution with Comparative En...
PPTX
They have left the building: The Web Route to Library Users
PPTX
What is #LODLAM?! Understanding linked open data in libraries, archives [and ...
PPT
The Tower and the Open Web: The Role of Reference
WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...
Digital Visitors and Residents: Project Feedback
Wikipedia and Civic Engagement
Facilitating Human Intervention in Coreference Resolution with Comparative En...
They have left the building: The Web Route to Library Users
What is #LODLAM?! Understanding linked open data in libraries, archives [and ...
The Tower and the Open Web: The Role of Reference

What's hot (20)

PDF
New Faculty Roles in the Emerging Digital Ecosystem
PPTX
Creating Pockets of Persistence
PPT
Knowledge Organization Lis 653 Spring 2017 Class Posters
PPTX
Pratt sils knowledge organization spring 2014
PDF
DigPedATX demo
PPTX
Irish Studies - making library data work harder
PPT
LIS 653-02 Spring 2014 Final Presentation Posters
PDF
From Hyperlinks to Semantic Web Properties using Open Knowledge Extraction
ZIP
Intro to Linked Open Data in Libraries Archives & Museums.
PDF
Linked open data and libraries
PPTX
LIS 653 posters spring 2015
PPTX
What is #LODLAM?! (revised January 2015)
PDF
AAC Linked Data Planning: Perspectives and Considerations
PPTX
LIS 653 Posters Fall 2014
PDF
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
PPTX
LIS 653 Knowledge Organization | Pratt Institute School of Information | Fall...
PDF
Linked data: what it means, why it matters. Karen Coyle
PPTX
Pratt SILS Knowledge Organization Spring 2011
PPTX
Is Linked Open Data the way forward?
PPTX
Linked Data and Discovery with Steve Meyer
New Faculty Roles in the Emerging Digital Ecosystem
Creating Pockets of Persistence
Knowledge Organization Lis 653 Spring 2017 Class Posters
Pratt sils knowledge organization spring 2014
DigPedATX demo
Irish Studies - making library data work harder
LIS 653-02 Spring 2014 Final Presentation Posters
From Hyperlinks to Semantic Web Properties using Open Knowledge Extraction
Intro to Linked Open Data in Libraries Archives & Museums.
Linked open data and libraries
LIS 653 posters spring 2015
What is #LODLAM?! (revised January 2015)
AAC Linked Data Planning: Perspectives and Considerations
LIS 653 Posters Fall 2014
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
LIS 653 Knowledge Organization | Pratt Institute School of Information | Fall...
Linked data: what it means, why it matters. Karen Coyle
Pratt SILS Knowledge Organization Spring 2011
Is Linked Open Data the way forward?
Linked Data and Discovery with Steve Meyer
Ad

Viewers also liked (19)

PPTX
European or Imperial Metropolis? Depictions of London in British Newspapers, ...
PDF
Remixing Digital Archives: The Victorian Meme Machine (IHR Digital History Se...
PDF
Gareth millwood interrogating the archived uk web
PPTX
Political Meetings Mapper with British Library Labs: mapping the origins of B...
PDF
Cordell scientific american
PDF
Sarah Rees Jones (York) and Helen Petrie: 'Chartex overview and next steps'
PPTX
The Pictorial publisher - Agents technologies and the illustrrated book in Br...
PDF
Ihr june15-evans
PDF
Emma Bayne: ‘Traces Through Time overview and next steps’
PDF
Peter webster interrogating the archived uk web
PPTX
Mapping paris
PDF
Richard deswarte interrogating the archived uk web
PDF
Writing a Big Data History of Music
PDF
Sonia Ranade: 'Traces Through Time overview and next steps'
PPTX
Citizen History and its Discontents
PPTX
Tudor Intelligence Networks - Ruth Ahnert
PDF
Text Mining the History of Medicine
PDF
Petrie ihr presentation
PDF
Tracking the Emergence of New Words across Time and Space
European or Imperial Metropolis? Depictions of London in British Newspapers, ...
Remixing Digital Archives: The Victorian Meme Machine (IHR Digital History Se...
Gareth millwood interrogating the archived uk web
Political Meetings Mapper with British Library Labs: mapping the origins of B...
Cordell scientific american
Sarah Rees Jones (York) and Helen Petrie: 'Chartex overview and next steps'
The Pictorial publisher - Agents technologies and the illustrrated book in Br...
Ihr june15-evans
Emma Bayne: ‘Traces Through Time overview and next steps’
Peter webster interrogating the archived uk web
Mapping paris
Richard deswarte interrogating the archived uk web
Writing a Big Data History of Music
Sonia Ranade: 'Traces Through Time overview and next steps'
Citizen History and its Discontents
Tudor Intelligence Networks - Ruth Ahnert
Text Mining the History of Medicine
Petrie ihr presentation
Tracking the Emergence of New Words across Time and Space
Ad

Similar to The Challenge of Digital Sources in the Web Age: Common Tensions Across Three Web Histories, 1994-2015 (20)

PDF
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archiv...
PDF
Congress text-mining-event
PDF
History In The Age Of Abundance How The Web Is Transforming Historical Resear...
PDF
Warcbase Building a Scalable Platform on HBase and Hadoop - Part Two: Histori...
PDF
Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Histor...
PDF
Historical Research Breakout Session Notes, WIRE 2014
PDF
Internet content as research data
PDF
Ruest and Milligan - The Great WARC Adventure
PDF
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
PPTX
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
PDF
Clustering Search to Navigate A Case Study of the Canadian World Wide Web as ...
PDF
Slides anu talkwebarchivingaug2012
PPTX
Thomas Haigh: Techniques from History
PPTX
On the Value of Temporal Anchor Texts in Wikipedia
PDF
International Internet Preservation Consortium Research Slides from Ian Milligan
PDF
Towards Multidimensional Web Archive Access (IIPC 2016)
PPT
Cultural Heritage Insitutions and Big Data Collections
PPTX
Best Practices for Descriptive Metadata
PDF
The web is a mess: how I learnt to stop worrying and love web archiving. Kris...
PPT
Introduction to British Library digital resources for social scientists
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archiv...
Congress text-mining-event
History In The Age Of Abundance How The Web Is Transforming Historical Resear...
Warcbase Building a Scalable Platform on HBase and Hadoop - Part Two: Histori...
Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Histor...
Historical Research Breakout Session Notes, WIRE 2014
Internet content as research data
Ruest and Milligan - The Great WARC Adventure
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
Clustering Search to Navigate A Case Study of the Canadian World Wide Web as ...
Slides anu talkwebarchivingaug2012
Thomas Haigh: Techniques from History
On the Value of Temporal Anchor Texts in Wikipedia
International Internet Preservation Consortium Research Slides from Ian Milligan
Towards Multidimensional Web Archive Access (IIPC 2016)
Cultural Heritage Insitutions and Big Data Collections
Best Practices for Descriptive Metadata
The web is a mess: how I learnt to stop worrying and love web archiving. Kris...
Introduction to British Library digital resources for social scientists

More from Digital History (12)

PPTX
Ihr dig hist_teachingpanel_feb2020
PPTX
Ihr dig hist_teachingpanel_feb2020
PPTX
Commemorating the Great War on Twitter
PDF
Community Archives and Ethics
PPTX
Contemporary web archives ihr
PPT
The ‘Digital Thematic Deconstruction’ of early modern urban maps and bird’s-e...
PPTX
The Language of Migration in the Victorian Press: A Corpus Linguistic Approach
PPTX
Identifying responses to revolution
PPTX
Chance encounters with the past
PPTX
The lives and criminal careers of juvenile offenders
PPTX
History of teaching ihr
PPTX
Holford mapping the medieval countryside 2014-06-17
Ihr dig hist_teachingpanel_feb2020
Ihr dig hist_teachingpanel_feb2020
Commemorating the Great War on Twitter
Community Archives and Ethics
Contemporary web archives ihr
The ‘Digital Thematic Deconstruction’ of early modern urban maps and bird’s-e...
The Language of Migration in the Victorian Press: A Corpus Linguistic Approach
Identifying responses to revolution
Chance encounters with the past
The lives and criminal careers of juvenile offenders
History of teaching ihr
Holford mapping the medieval countryside 2014-06-17

Recently uploaded (20)

PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
DOC
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PDF
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PDF
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
PPTX
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
PPTX
ELIAS-SEZIURE AND EPilepsy semmioan session.pptx
PDF
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
PDF
IGGE1 Understanding the Self1234567891011
PDF
advance database management system book.pdf
PDF
Hazard Identification & Risk Assessment .pdf
PPTX
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
PPTX
Share_Module_2_Power_conflict_and_negotiation.pptx
PPTX
History, Philosophy and sociology of education (1).pptx
PDF
Indian roads congress 037 - 2012 Flexible pavement
PPTX
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
AI-driven educational solutions for real-life interventions in the Philippine...
PPTX
Computer Architecture Input Output Memory.pptx
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
ELIAS-SEZIURE AND EPilepsy semmioan session.pptx
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
IGGE1 Understanding the Self1234567891011
advance database management system book.pdf
Hazard Identification & Risk Assessment .pdf
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
Share_Module_2_Power_conflict_and_negotiation.pptx
History, Philosophy and sociology of education (1).pptx
Indian roads congress 037 - 2012 Flexible pavement
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
AI-driven educational solutions for real-life interventions in the Philippine...
Computer Architecture Input Output Memory.pptx

The Challenge of Digital Sources in the Web Age: Common Tensions Across Three Web Histories, 1994-2015

  • 1. The Challenge of Digital Sources in the Web Age Common Tensions Across Three Web Histories, 1994-2015 Ian Milligan Assistant Professor
  • 2. Why? The sheer amount of social, cultural, and political information generated every day presents new opportunities for historians.
  • 3. Could one even study the 1990s and beyond without web archives?
  • 4. No. Historians need to do this now, or we’re going to be left behind.
  • 5. Nightmare Scenario • Wayback Machine won’t be enough. We won’t use that. • Historians rely uncritically on date-ordered keyword search results, putting them at mercy of search algorithms they do not understand; • Historians are completely left out of post-1996 research, letting everybody else do the work (a la Culturomics project/Nature magazine article); • Our profession gets left behind…
  • 6. But what will web archives look like? • Three Distinct Case Studies • Wide Web Scrape, March - December 2011 (Internet Archive) (sample of 80TB WARC collection); • GeoCities End-of-Life Torrent, 2009 (Archive Team); • Archive-It Longitudinal Collections, Canadian Political Parties & Labour Organizations, 2005-2015 (Archive-It/University of Toronto)
  • 7. Similarities - Windows into the lives of everyday people.
  • 8. Differences - Incredible range of technical skills/no common platform!
  • 9. Case Study One • The Wide Web Scrape (~ 80TB) - Snapshot of the Web • 85,570 WARC files, CDX metadata • Similar in some ways to traditional humanistic inquiry, just on a bigger scale.
  • 10. ca,yorku,justlabour)/  20110714073726   http://www.justlabour.yorku.ca/  text/html   302  3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ   http://www.justlabour.yorku.ca/index.php? page=toc&volume=16  -­‐  462  880654831   WIDE-­‐20110714062831-­‐crawl416/ WIDE-­‐20110714070859-­‐02373.warc.gz  
  • 16. Problem is.. you need to know what you’re looking for!
  • 17. Generated using Jimmy Lin’s Warcbase 622k .ca sites, 1,719,167 links
  • 20. Countries Mentioned in .ca TLD (excluding Canada)
  • 22. Canadian Postal Codes visualized
  • 23. Need longitudinal, but the size/intensity = extreme.
  • 24. Wide Web Scrapes and the Dream of Social History.
  • 25. Case Study Two • Archive-It Research Services: “Canadian Political Parties and Political Interest Groups” and “Canadian Labour Unions.” • 2005 - 2015 • WAT & WARC files
  • 26. Pivotal Changes in Canadian Politics, 2005-2015 • Militarization of Canadian society? • Change from ‘natural governing party’ of Liberals to Conservatives • Major policy changes on foreign policy, environment, etc. • How to measure?
  • 27. Current Interface • Very limited - simple search engine, some advanced options; no facets • Great collections.. but nobody uses them!
  • 29. WAT Files? Potential sweet spot between the lightweight CDX and the heavy-duty WARC?
  • 30. Do we want metadata or content analysis?
  • 32. Problem One: Historians want to work with content, but we can only use metadata on most computer systems.
  • 33. (but that’s ok - we can use metadata to do great work)
  • 37. Metadata Extraction • Results @ http://ianmilligan.ca/2015/02/05/topic- modeling-web-archive-modularity-classes/
  • 38. Metadata Extraction • Conservative themes (2014): economic development, family, immigration, legislation, women’s issues, senior issues, Ukrainians, constituency offices, some prominent (and not-so- prominent) MPs, and of course, our economic action plan. • Liberal themes (2014): Justin Trudeau (the new leader), cuts to social programs, child poverty, mental health, municipal issues, labour, workers, Stop the Cuts, and housing.
  • 39. Metadata Extraction • Conservative themes (2006): education, university, but tons of information on Aboriginal issues; • Liberal themes (2006): community questions, electoral topics, universities, human rights, child care support.
  • 40. As well as short stories..
  • 49. WATs help us find the files we need to use - and to contextualize them
  • 50. Problem Two: You can do amazing things with the content (WARCs), but you need a cluster or powerful computer.
  • 51. WARC Analysis • 2005-2009: 244 GB of content; 2.9 GB of plain text • 10,606,822 websites • On a local powerful node (3 Ghz 8-Core Intel Xeon E5/64 GB RAM, data on SSD), about three to four hours per query • On a cluster, about ~10-20 minutes per query, depending on traffic
  • 52. Large-Scale Text Analysis • With Hadoop about 15-20 minutes to extract all plain-text from any specified queries: i.e. all pages belonging to Green Party, Liberal Party, Conservative Party, Council of Canadians, etc. • Compared to “out of memory”/ go home for an extended weekend on a local node
  • 53. Large-Scale Text Analysis • NER/LDA/Keyword Frequency broken down by scrape date: i.e. scrape carried out 2005-10, see change over time; • Downside: not everything is optimized for parallel environment; if not, it crawls (there goes a day) • Downside: scrape date != creation date, requiring temporal analysis
  • 55. Using Warcbase to analyze links and full-text
  • 57. NER October  2005      62476  Stephen  Harper      30234  Michael  Chong      30109  Gwynne  Dyer      28011  ami  Entrez      26238  Paul  Martin      22303  Harper  
  • 58. NER November  2008        3188  Stéphane  Dion        2557  Stephen  Harper        2471  Stephen  HarperLaureen        2410  Dion        2356  Harper  
  • 62. Shine/WebArchives.ca • UK Web Archive’s Shine (https://github.com/ukwa/ shine) • Indexing as bottleneck • ~ 250GB of WARCs takes ~ 5 days on a single machine • Hadoop indexer available if data in HFDS • ~ 90GB index size
  • 63. Five Things I’ve Learned • Political parties delete content • User-generated comments were more common in political parties • Absences can be more informative than presences • We can see the rise/fall of prominent people • Enabling user access is truly transformative
  • 64. Shine • Advantages: accessible to the general public, easy to use, interactive trend diagram allows digging down for context, can move down to level of document itself. • Disadvantage: keyword searching requires you know what to look for; random sampling misleading when tens of thousands of records; etc. • Doesn’t take advantage of what makes web sources so powerful: hyperlinks
  • 66. Case Study Three • GeoCities: Archive Team End- of-Life Torrent • 2009, content dating back to 1996; can find sites created pre-1999 using neighbourhood structure
  • 68. What was GeoCities? Why does it matter? GEOCITIES USERS: OCT. 1995: 10,000 USERS AUG. 1996: 100,000 USERS OCT. 1997: 1,000,000 USERS
  • 69. “largest body of texts detailing the lives of non-elite people ever published”?
  • 74. Selected Neighbourhoods Top Two Topics Athens “… based on education, teaching, reading, writing and philosophy”. people things time person sense life man work world human good mind soul make nature body case made point part parts goddess witch healing incense witchcraft love energy pagan shaman witches sun spirit protection light circle earth religion EnchantedForest “A place for and about kids. Games, stories, educational sites, and homepages created by kids themselves.” blue page school home day kids clues fun
 time year room birthday family mom jordan play great party friends jq battalion show st jonny horse battery
 armored lt artillery camp sailor army field col pingu war area quest Heartland “A family oriented neighborhood that represents Main Street in cyberspace. This is the place to find parenting, pets, and home town values.” people time children book years child information year work make life school person system state world books government good family county church home years information st city born state war school mrs history birth records great cemetery death Hollywood “Entertainment capital of the world. Movies, television, and our live video camera at the corner of Hollywood and Vine!” joey rachel ross monica chandler
 don yeah phoebe hey mike back gonna ll chris big uh guy guys rock frasier niles martin daphne roz don
 back ll door room scene ve dad turns takes crane good walks yeah Pentagon Military men and women. war people president government american world states power state united general military public soviet political clinton america make army fort war civil island iran world adams army british history badge rhode german french american forts walther cap newport WestHollywood “A community with a culture based on gay and lesbian identity.” gender women sex male female
 people men person woman sexual crossdressing feminine society identity transgendered marriage man children transsexual Topic Modelling Community to Test Coherence
  • 76. And the stories of significant users and meaningful experiences.
  • 77. The possibilities of such digital scholarship
  • 78. Shared Problems • Never have enough processing power or memory; • Web archive tools often designed for clusters - less than ten historians in North America probably can use one… • Tools • Some work on WARCs; • Some work on ARCs; • Some work on WATs; • And some work on live-web material;
  • 79. End-user tools and co-operation with CS colleagues is key.
  • 81. More voices, more people, the promise of social history achieved.