SlideShare a Scribd company logo
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
Building Event Collections
from
Crawling Web Archives
Martin Klein1
Lyudmila Balakireva1
Herbert Van de Sompel2
1Research Library
Los Alamos National Laboratory
2Data Archiving and Networked Services
The Netherlands
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
2
Inspiration from Previous Work
https://doi.org/10.1007/978-3-319-67008-9_10
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
3
Published at WebSci 2018
https://doi.org/10.1145/3201064.3201085
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
4
1. Can we create event collections by focused crawling online-
available web archives?
2. How do event collections created from the archived web
compare to those created from the live web?
3. How does the amount of time passed since the event affect
the collections built from the live and the archived web?
4. How do event collections built from the archived web
compare to manually curated collections?
Questions
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
5
• Often orchestrated by subject matter experts, archivists,
special collection librarians, technicians
• Potentially with guidance from institutional collection policy
• Results in a list of seeds (URIs, social media accounts, etc)
• Utilization of crawling services such as Archive-It, Social Feed
Manager
Background – Event Collection Building
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
6
• Temporal: time passed since event is of concern
 Use of web archives via Memento infrastructure
• Selection: seeds often picked manually
 Use of references from Wikipedia pages
• Relevance: seed assessment often done by humans
 Use of focused crawling with content and temporal
relevance assessment
Problems and our Approach
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
7
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
8
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
9
• Temporal: time passed since event is of concern
 Use of web archives
• Selection: seeds often picked manually
 Use of references from Wikipedia pages
• Relevance: seed assessment often done by humans
 Use of focused crawling with content and temporal
relevance assessment
Problems and our Approach
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
10
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
11
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
12
• Temporal: time passed since event is of concern
 Use of web archives
• Selection: seeds often picked manually
 Use of references from Wikipedia pages
• Relevance: seed assessment often done by humans
 Use of focused crawling with content and temporal
relevance assessment
Problems and our Approach
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
13
Focused Crawling
Child 1
Seed
Child 2 Child 3
Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2
Not crawled
Crawled and
not relevant
Crawled and
relevant
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
14
Focused Crawling
Child 1
Seed
Child 2 Child 3
Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2
Not crawled
Crawled and
not relevant
Crawled and
relevant
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
15
Focused Crawling
Child 1
Seed
Child 2 Child 3
Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2
Not crawled
Crawled and
not relevant
Crawled and
relevant
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
16
1. Content of Wikipedia page + random 60% of page’s references
• Generate topic vector (TF-IDF of 1grams + 2grams)
2. Content of remaining 40% of Wikipedia page’s references
• Generate topic vector (TF-IDF of 1grams + 2grams)
• Compute cosine similarity value between vectors 1 and 2
• Run 10 times
• Take average cosine similarity value as content threshold
Content Relevance
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
17
• Define temporal interval for which crawled pages are
considered relevant
• Event date extracted from Wikipedia event page
Temporal Relevance
1
Event Date Change Point Today
0 0
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
18
Change Point Detection
2016−06−12 2016−11−05 2017−03−31 2017−08−24
020406080100
Edit Dates
Percentage
46
• Plot number of Wikipedia page
edits per day
• Run R’s changepoint algorithm
• Detect significant change in curve
https://cran.r-project.org/web/packages/changepoint/index.html
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
19
• Extract datetime from pages via:
• URI
http://www.cnn.com/2017/12/09/us/wildfire-fighting-tactics/
• Meta tags
<meta property="article:published" itemprop="datePublished"
content="2017-12-09T10:14:50-05:00" />
• ODU’s Carbondate tool
http://carbondate.cs.odu.edu/
• Memento datetime
• X-Header
Datetime Extraction
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
20
• Topics limited to terror attacks and mass shootings in the U.S.
• From different times in the past
• Take content and temporal relevance into account
• Equally weighted
• Use events’ Wikipedia page as input for focused crawler
• Version that was live at change point
Experiment Details
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
21
• Focused crawl of:
• 22 archives, simultaneously, via Memento infrastructure
• The live web
• Seeds
• Memento of Wikipedia page references closest to and
after event time
• Subject to temporal and contextual relevance assessment
• Crawled outlinks
• Memento of outlinks closest to and after event time
• Subject to temporal and contextual relevance assessment
Crawl Details
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
22
• Crawl stop conditions:
• No more relevant documents left
• 5 levels deep
• Utilized crawl priority queue
Crawl Details
Level 2
Level 1
Level 0
Child 1
Seed
Child 2 Child 3
Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
23
• New York City, October 31st 2017
• Las Vegas, October 1st 2017
• Orlando, June 12th 2016
• San Bernadino, December 2nd 2015
• Tucson, January 8th 2011
• Binghampton, April 3rd 2009
Collections Crawled (in November 2017)
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
24
NYC, 10/31/2017 – URIs per Level
0 1 2 3 4 5
Crawl depth
NumberofURIs
0500100015002000
Web Archive Crawl
0102030405060708090100
All URIs
Relevant URIs
0 1 2 3 4 5
Crawl depth
0500100015002000
Live Web Crawl
0102030405060708090100
Percent
All URIs
Relevant URIs
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
25
TUC, 01/08/2011 – URIs per Level
0 1 2 3 4 5
Crawl depth
NumberofURIs
020000400006000080000
Web Archive Crawl
0102030405060708090100
All URIs
Relevant URIs
0 1 2 3 4 5
Crawl depth
020000400006000080000
Live Web Crawl
0102030405060708090100
Percent
All URIs
Relevant URIs
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
26
NYC, 10/31/2017 – Relevance over…
Crawled Documents Crawl Time
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
27
TUC, 01/08/2011 – Relevance over…
Crawled Documents Crawl Time
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
28
TUC, 01/08/2011 – Comparison to Archive-IT
0 5000 10000 15000
050001000015000
Documents
AccumulatedRelevance
Web Archive Crawl
Archive−It Crawl
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
29
TUC, 01/08/2011 – Web Archive Contributions
web.archive.org 75%
wayback.archive−it.org
14%
webarchive.loc.gov 7%
web.archive.bibalex.org 2%
archive.is 2%
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
30
• Web archives are great resources to build event collections of
web resources
• Crawling web archives is much slower than the live web
• Collections about very recent events benefit more from the
live web than the archived web
but
• Collections about events from the distant past benefit more
from the archived web than the live web
• Utilizing multiple web archives is beneficial for the collection
• Focused crawls have the potential to outperform manual
collection building
Takeaways
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
31
https://web.archive.org/web/20171206181955/https:/twitter.com/TVNewsArchive/status/938466726190096384
Building Event Collections from Crawling Web Archives
@mart1nkle1n
IIPC WAC 2018, 11/13/2018, Wellington, NZ
Building Event Collections
from
Crawling Web Archives
Martin Klein1
Lyudmila Balakireva1
Herbert Van de Sompel2
1Research Library
Los Alamos National Laboratory
2Data Archiving and Networked Services
The Netherlands

More Related Content

PPTX
Focused Crawl of Web Archives to Build Event Collections
PPTX
Enabling Personal Use of Web Archives
PPTX
Improving the reported use and impact of institutional repositories
PPTX
Creating Topical Collections: Web Archives vs. Live Web
PPTX
Archiving Web-Based #musetech for Institutional Memory
PPTX
DHUG 2018: Towards Web-Centric Repository Interoperability
PPTX
Smart Routing of Memento Requests
PPTX
‘Born in the USB: Digital collecting at the National Library of Ireland’ - De...
Focused Crawl of Web Archives to Build Event Collections
Enabling Personal Use of Web Archives
Improving the reported use and impact of institutional repositories
Creating Topical Collections: Web Archives vs. Live Web
Archiving Web-Based #musetech for Institutional Memory
DHUG 2018: Towards Web-Centric Repository Interoperability
Smart Routing of Memento Requests
‘Born in the USB: Digital collecting at the National Library of Ireland’ - De...

Similar to Building Event Collections from Crawling Web Archives (20)

PDF
Flink Forward San Francisco 2018: Till Rohrmann & Flavio Junqueira - "Scaling...
PDF
Scaling stream data pipelines with Pravega and Apache Flink
PPTX
 Challenges in Managing Online Business Communities
PPTX
Storytelling for Summarizing Collections in Web Archives
PPTX
Bootstrapping Web Archive Collections of Stories from Micro-collections in S...
PDF
Creating Structure in Web Archives With Collections: Different Concepts From ...
PDF
It is hard to compute fixity on archived web pages
PPSX
Tuesday 5 May: IIPC activities, Olga Holownia, IIPC
PDF
Search, Exploration and Analytics of Evolving Data
PPTX
The role public libraries play in supporting digital literacy
PPTX
Information sharing about Columbia University Library’s recent web archiving ...
PPTX
Making the Black Hole Gray: Web Archiving Art Resources at New York Art Resou...
PPTX
Collaboration and Cash: Web Archiving Incentive Awards
PPTX
Elastic Meetup - Elasticsearch and Linked Data
PDF
History In The Age Of Abundance How The Web Is Transforming Historical Resear...
PPT
Eternal Cities?
PDF
Hahn "Wikidata as a hub to library linked data re-use"
PPTX
First Steps in Research Data Management Under Constraints of a National Secur...
PPTX
Actions to Ensure the Integrity and Continuity of the Scholarly Record
PPTX
Making the Black Hole Gray: Implementing the Web Archiving of Specialist Art ...
Flink Forward San Francisco 2018: Till Rohrmann & Flavio Junqueira - "Scaling...
Scaling stream data pipelines with Pravega and Apache Flink
 Challenges in Managing Online Business Communities
Storytelling for Summarizing Collections in Web Archives
Bootstrapping Web Archive Collections of Stories from Micro-collections in S...
Creating Structure in Web Archives With Collections: Different Concepts From ...
It is hard to compute fixity on archived web pages
Tuesday 5 May: IIPC activities, Olga Holownia, IIPC
Search, Exploration and Analytics of Evolving Data
The role public libraries play in supporting digital literacy
Information sharing about Columbia University Library’s recent web archiving ...
Making the Black Hole Gray: Web Archiving Art Resources at New York Art Resou...
Collaboration and Cash: Web Archiving Incentive Awards
Elastic Meetup - Elasticsearch and Linked Data
History In The Age Of Abundance How The Web Is Transforming Historical Resear...
Eternal Cities?
Hahn "Wikidata as a hub to library linked data re-use"
First Steps in Research Data Management Under Constraints of a National Secur...
Actions to Ensure the Integrity and Continuity of the Scholarly Record
Making the Black Hole Gray: Implementing the Web Archiving of Specialist Art ...
Ad

More from Martin Klein (20)

PPTX
On the Persistence of Persistent Identifiers of the Scholarly Web
PPTX
On the Persistence of Persistent Identifiers of the Scholarly Web
PPTX
An Institutional Perspective to Rescue Scholarly Orphans
PPTX
Who is Asking - Humans and Machines Experience a Different Scholarly Web
PPTX
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
PPTX
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
PPTX
Comparing the Performance of OAI-PMH with ResourceSync
PPTX
Evaluating Memento Service Optimizations
PPTX
An Institutional Perspective to Rescue Scholarly Orphans
PPTX
A Vision of the Library’s Role in Archiving Scholarly Artifacts
PPTX
A Web-Centric Pipeline for Archiving Scholarly Artifacts
PPTX
Robust Linking to Web Resources
PPTX
Signposting for Repositories
PPTX
Discovering Scholarly Orphans Using ORCID
PPTX
Using the Memento Framework to Assess Content Drift in Scholarly Communication
PPTX
Uniform Access to Raw Mementos
PPTX
Robust Links - a proposed solution to reference rot in scholarly communication
PDF
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
PPTX
To the Rescue of the Orphans of Scholarly Communication
PPTX
web_archive_interoperability_memento
On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly Web
An Institutional Perspective to Rescue Scholarly Orphans
Who is Asking - Humans and Machines Experience a Different Scholarly Web
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
Comparing the Performance of OAI-PMH with ResourceSync
Evaluating Memento Service Optimizations
An Institutional Perspective to Rescue Scholarly Orphans
A Vision of the Library’s Role in Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly Artifacts
Robust Linking to Web Resources
Signposting for Repositories
Discovering Scholarly Orphans Using ORCID
Using the Memento Framework to Assess Content Drift in Scholarly Communication
Uniform Access to Raw Mementos
Robust Links - a proposed solution to reference rot in scholarly communication
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
To the Rescue of the Orphans of Scholarly Communication
web_archive_interoperability_memento
Ad

Recently uploaded (20)

PPTX
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
PPTX
522797556-Unit-2-Temperature-measurement-1-1.pptx
DOCX
Unit-3 cyber security network security of internet system
PDF
Paper PDF World Game (s) Great Redesign.pdf
PPTX
innovation process that make everything different.pptx
PPTX
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
PPTX
artificial intelligence overview of it and more
PDF
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
PDF
Slides PDF The World Game (s) Eco Economic Epochs.pdf
PPTX
Module 1 - Cyber Law and Ethics 101.pptx
PPTX
Introuction about ICD -10 and ICD-11 PPT.pptx
PPTX
PptxGenJS_Demo_Chart_20250317130215833.pptx
PPTX
introduction about ICD -10 & ICD-11 ppt.pptx
PPTX
INTERNET------BASICS-------UPDATED PPT PRESENTATION
PPTX
international classification of diseases ICD-10 review PPT.pptx
PDF
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
PPTX
Digital Literacy And Online Safety on internet
PPTX
Introuction about WHO-FIC in ICD-10.pptx
PPTX
E -tech empowerment technologies PowerPoint
PDF
SASE Traffic Flow - ZTNA Connector-1.pdf
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
522797556-Unit-2-Temperature-measurement-1-1.pptx
Unit-3 cyber security network security of internet system
Paper PDF World Game (s) Great Redesign.pdf
innovation process that make everything different.pptx
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
artificial intelligence overview of it and more
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
Slides PDF The World Game (s) Eco Economic Epochs.pdf
Module 1 - Cyber Law and Ethics 101.pptx
Introuction about ICD -10 and ICD-11 PPT.pptx
PptxGenJS_Demo_Chart_20250317130215833.pptx
introduction about ICD -10 & ICD-11 ppt.pptx
INTERNET------BASICS-------UPDATED PPT PRESENTATION
international classification of diseases ICD-10 review PPT.pptx
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
Digital Literacy And Online Safety on internet
Introuction about WHO-FIC in ICD-10.pptx
E -tech empowerment technologies PowerPoint
SASE Traffic Flow - ZTNA Connector-1.pdf

Building Event Collections from Crawling Web Archives

  • 1. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ Building Event Collections from Crawling Web Archives Martin Klein1 Lyudmila Balakireva1 Herbert Van de Sompel2 1Research Library Los Alamos National Laboratory 2Data Archiving and Networked Services The Netherlands
  • 2. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 2 Inspiration from Previous Work https://doi.org/10.1007/978-3-319-67008-9_10
  • 3. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 3 Published at WebSci 2018 https://doi.org/10.1145/3201064.3201085
  • 4. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 4 1. Can we create event collections by focused crawling online- available web archives? 2. How do event collections created from the archived web compare to those created from the live web? 3. How does the amount of time passed since the event affect the collections built from the live and the archived web? 4. How do event collections built from the archived web compare to manually curated collections? Questions
  • 5. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 5 • Often orchestrated by subject matter experts, archivists, special collection librarians, technicians • Potentially with guidance from institutional collection policy • Results in a list of seeds (URIs, social media accounts, etc) • Utilization of crawling services such as Archive-It, Social Feed Manager Background – Event Collection Building
  • 6. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 6 • Temporal: time passed since event is of concern  Use of web archives via Memento infrastructure • Selection: seeds often picked manually  Use of references from Wikipedia pages • Relevance: seed assessment often done by humans  Use of focused crawling with content and temporal relevance assessment Problems and our Approach
  • 7. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 7
  • 8. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 8
  • 9. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 9 • Temporal: time passed since event is of concern  Use of web archives • Selection: seeds often picked manually  Use of references from Wikipedia pages • Relevance: seed assessment often done by humans  Use of focused crawling with content and temporal relevance assessment Problems and our Approach
  • 10. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 10
  • 11. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 11
  • 12. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 12 • Temporal: time passed since event is of concern  Use of web archives • Selection: seeds often picked manually  Use of references from Wikipedia pages • Relevance: seed assessment often done by humans  Use of focused crawling with content and temporal relevance assessment Problems and our Approach
  • 13. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 13 Focused Crawling Child 1 Seed Child 2 Child 3 Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2 Not crawled Crawled and not relevant Crawled and relevant
  • 14. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 14 Focused Crawling Child 1 Seed Child 2 Child 3 Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2 Not crawled Crawled and not relevant Crawled and relevant
  • 15. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 15 Focused Crawling Child 1 Seed Child 2 Child 3 Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2 Not crawled Crawled and not relevant Crawled and relevant
  • 16. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 16 1. Content of Wikipedia page + random 60% of page’s references • Generate topic vector (TF-IDF of 1grams + 2grams) 2. Content of remaining 40% of Wikipedia page’s references • Generate topic vector (TF-IDF of 1grams + 2grams) • Compute cosine similarity value between vectors 1 and 2 • Run 10 times • Take average cosine similarity value as content threshold Content Relevance
  • 17. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 17 • Define temporal interval for which crawled pages are considered relevant • Event date extracted from Wikipedia event page Temporal Relevance 1 Event Date Change Point Today 0 0
  • 18. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 18 Change Point Detection 2016−06−12 2016−11−05 2017−03−31 2017−08−24 020406080100 Edit Dates Percentage 46 • Plot number of Wikipedia page edits per day • Run R’s changepoint algorithm • Detect significant change in curve https://cran.r-project.org/web/packages/changepoint/index.html
  • 19. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 19 • Extract datetime from pages via: • URI http://www.cnn.com/2017/12/09/us/wildfire-fighting-tactics/ • Meta tags <meta property="article:published" itemprop="datePublished" content="2017-12-09T10:14:50-05:00" /> • ODU’s Carbondate tool http://carbondate.cs.odu.edu/ • Memento datetime • X-Header Datetime Extraction
  • 20. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 20 • Topics limited to terror attacks and mass shootings in the U.S. • From different times in the past • Take content and temporal relevance into account • Equally weighted • Use events’ Wikipedia page as input for focused crawler • Version that was live at change point Experiment Details
  • 21. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 21 • Focused crawl of: • 22 archives, simultaneously, via Memento infrastructure • The live web • Seeds • Memento of Wikipedia page references closest to and after event time • Subject to temporal and contextual relevance assessment • Crawled outlinks • Memento of outlinks closest to and after event time • Subject to temporal and contextual relevance assessment Crawl Details
  • 22. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 22 • Crawl stop conditions: • No more relevant documents left • 5 levels deep • Utilized crawl priority queue Crawl Details Level 2 Level 1 Level 0 Child 1 Seed Child 2 Child 3 Child 3.2Child 3.1Child 2.1Child 1.1 Child 3.2Child 1.2
  • 23. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 23 • New York City, October 31st 2017 • Las Vegas, October 1st 2017 • Orlando, June 12th 2016 • San Bernadino, December 2nd 2015 • Tucson, January 8th 2011 • Binghampton, April 3rd 2009 Collections Crawled (in November 2017)
  • 24. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 24 NYC, 10/31/2017 – URIs per Level 0 1 2 3 4 5 Crawl depth NumberofURIs 0500100015002000 Web Archive Crawl 0102030405060708090100 All URIs Relevant URIs 0 1 2 3 4 5 Crawl depth 0500100015002000 Live Web Crawl 0102030405060708090100 Percent All URIs Relevant URIs
  • 25. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 25 TUC, 01/08/2011 – URIs per Level 0 1 2 3 4 5 Crawl depth NumberofURIs 020000400006000080000 Web Archive Crawl 0102030405060708090100 All URIs Relevant URIs 0 1 2 3 4 5 Crawl depth 020000400006000080000 Live Web Crawl 0102030405060708090100 Percent All URIs Relevant URIs
  • 26. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 26 NYC, 10/31/2017 – Relevance over… Crawled Documents Crawl Time
  • 27. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 27 TUC, 01/08/2011 – Relevance over… Crawled Documents Crawl Time
  • 28. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 28 TUC, 01/08/2011 – Comparison to Archive-IT 0 5000 10000 15000 050001000015000 Documents AccumulatedRelevance Web Archive Crawl Archive−It Crawl
  • 29. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 29 TUC, 01/08/2011 – Web Archive Contributions web.archive.org 75% wayback.archive−it.org 14% webarchive.loc.gov 7% web.archive.bibalex.org 2% archive.is 2%
  • 30. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 30 • Web archives are great resources to build event collections of web resources • Crawling web archives is much slower than the live web • Collections about very recent events benefit more from the live web than the archived web but • Collections about events from the distant past benefit more from the archived web than the live web • Utilizing multiple web archives is beneficial for the collection • Focused crawls have the potential to outperform manual collection building Takeaways
  • 31. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ 31 https://web.archive.org/web/20171206181955/https:/twitter.com/TVNewsArchive/status/938466726190096384
  • 32. Building Event Collections from Crawling Web Archives @mart1nkle1n IIPC WAC 2018, 11/13/2018, Wellington, NZ Building Event Collections from Crawling Web Archives Martin Klein1 Lyudmila Balakireva1 Herbert Van de Sompel2 1Research Library Los Alamos National Laboratory 2Data Archiving and Networked Services The Netherlands