SlideShare a Scribd company logo
Profiling Web Archive Coverage
for
Top-Level Domain &Content Language
Ahmed AlSum, Michele C. Weigle, Michael L. Nelson,
Herbert Van de Sompel
International Conference on Theory and Practice of Digital Libraries
September 22-26, 2013
Valletta, Malta 1
2
Where to find Mementos for …
3
http://www.japantimes.co.jp/
Where to find Mementos for …
4
http://www.japantimes.co.jp/
Where to find Mementos for …
5
http://www.google.com/
Where to find Mementos for …
6
http://www.google.com/
Research Question
Problem
• Profile public web archives according to the following
dimensions:
o Top-level domains
o Languages
o Growth rate
o Archival date
Motivation
• To determine who is archiving what
• To optimize the query routing for a Memento Aggregator
7
Web Archives in this Experiment
Full text URI-lookup
Internet Archive √
Library of Congress √
Icelandic Web Archive √
Library and Archives Canada √ √
British Library √ √
UK National Library √ √
Portuguese Web Archive √ √
Web Archive of Catalonia √ √
Croatian Web Archive √ √
Archive of the Czech Web √ √
National Taiwan University √ √
Archive It √ √
8
Experiment Set Up
• Sample URIs from different sources
o Details coming up
• Retrieve the TimeMap for each URI from all archives
o A TimeMap lists all Mementos for a given URI
o A Memento is an archived version of a resource
• Analyze
o Details coming up
9
Sampling URIs
Web
1. DMOZ:Random
2. DMOZ:TLD - 2% of each
TLD from DMOZ (.com,
.org, .jp, etc 52 TLD)
3. DMOZ:Languages - 100
URIs for each Languages
(24 lang.)
Web Archives Full Text
4. Top 1-Gram from Bing
5. Top 1000 queries term
by Yahoo in 9 languages
User requests
6. IA Wayback Machine Log
files
7. Memento aggregator log files
10
Sampling URIs - DMOZ
1. DMOZ:Random
o 10,000 URIs randomly sampled from DMOZ directory (~5M URIs).
2. DMOZ:TLD - 2% for each TLD from DMOZ or 100 URIs
whichever is greater
o 52 TLDs (com 23,470) (de 6,332), (org 4,025), (uk 3,309), (net
2,073), (it 1,775), (jp 1379), (ru 1244), (fr 1154), (pl 1062), (au
764), (ca 642), (at 438), (edu 390), (cz 385), (tr 334), (info 319),
(cn 278), (us 266), (nz 265), (es 238), (ar 213), (no 150), (br 149),
(tw 141), (za 118), (fi 113), ( 100 URIs for [ae, cat, cl, cu, eg, gov,
id, in, ir, is, ke, kr, ma, mt, mx, my, na, pe, pk, pt, sa, to, uy,
zw])
3. DMOZ:Languages - 100 URIs for each language
o 24 languages: Icelandic, Portuguese, Catalan, Afrikaans, Arabic,
Indonesian, Chinese (Simplified), Chinese (Traditional), Dutch,
Spanish, French, Greek, Hindi, Italian, Japanese, Korean,
Norwegian, Persian, Polish , Russian, Turkish, Ukrainian
11
• Query the fulltext search interface of select web archives
with two sets of query terms.
4. Top 1-Gram from Bing
o Most are English
5. Top 1000 query terms from Yahoo in 9 languages
o Excluding general keywords such as: Obama, Facebook.
12
Sampling URIs – Web Archives Full Text
Chinese
English
French
German
Italian
Japanese
Korean
Portuguese
Spanish
Yahoo Bing
rchivewithFullTextsearch
AIT 26 2066 3512 3837 3321 119 2 2434
214
1
12617 3953
BL 163 2354 2350 2240 2068 225 131 1940
205
6
6430 3187
CAN 49 800 804 646 601 77 113 580 514 1351 1107
CR 54 706 697 703 701 74 19 599 600 1599 1201
CZ 363 1782 1578 1695 1519 577 114 1310
127
8
6081 3360
CAT 28 2775 2496 2448 2280 209 129 2164
242
9
8996 4241
PO 91 2460 3603 3081 3113 53 69 3267
317
7
14126 5004
TW 357 178 176 165 157 106 7 198 119 1004 354
13
Sampling URIs – Web Archives Full Text
Chinese
English
French
German
Italian
Japanese
Korean
Portuguese
Spanish
Yahoo Bing
rchivewithFullTextsearch
AIT 26 2066 3512 3837 3321 119 2 2434
214
1
12617 3953
BL 163 2354 2350 2240 2068 225 131 1940
205
6
6430 3187
CAN 49 800 804 646 601 77 113 580 514 1351 1107
CR 54 706 697 703 701 74 19 599 600 1599 1201
CZ 363 1782 1578 1695 1519 577 114 1310
127
8
6081 3360
CAT 28 2775 2496 2448 2280 209 129 2164
242
9
8996 4241
PO 91 2460 3603 3081 3113 53 69 3267
317
7
14126 5004
TW 357 178 176 165 157 106 7 198 119 1004 354
14
Sampling URIs – Web Archives Full Text
Sampling URIs – User Requests
• Sampling from user requests for archived web resources
6. Sample from IA Wayback Machine Log files
o 1,000 URIs randomly sampled from Feb 22, 2012 to Feb 26,
2012.
7. Sample from Memento Aggregator log files
o 100 URIs randomly sampled from LANL Memento Aggregator
between 2011 to 2013.
15
Archive Coverage per Sample
16
1
0
0
%
3
5
%
Entire Sample
TLD Coverage across Archives (1)
17
Entire Sample
TLD Coverage across Archives (2)
18
Entire Sample
TLD Distribution per Archive
19
DMOZ:TLD Sample
TLD Distribution per Archive
20
Web Archives Full Text Sample
Language Coverage per Archive
21
DMOZ Sample
Archive Growth Rate
22
Entire Sample
TLD Coverage across Archives
23
Entire Sample
Query Routing Evaluation
24
Conclusions
• Introduced automated methodology to profile web
archives using available infrastructure, no privileged
access
• Coverage:
o Internet Archive provides broad coverage
o National archives have good coverage for their domains
o Surprising coverage by certain archives
• Query Routing:
o In 84% of the cases, all existing Mementos for a TLD can be
found by using IA and two additional top archives for a TLD
o In 55% of the cases, all existing Mementos for a TLD can be
found by using the top 3 archives for a TLD, excluding IA
25

More Related Content

PPT
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
PPTX
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
PPTX
On the Change in Archivability of Websites Over Time
PDF
Profiling Web Archival Voids for Memento Routing
PPTX
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
PDF
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
PDF
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
PDF
Avoiding Zombies in Archival Replay Using ServiceWorker
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
On the Change in Archivability of Websites Over Time
Profiling Web Archival Voids for Memento Routing
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
Avoiding Zombies in Archival Replay Using ServiceWorker

What's hot (8)

PPT
Something about links
PDF
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
PPTX
ResourceSync in 24x7
PPTX
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
PPT
Apachecon 2011 stanbol_ogrisel
PDF
Free Web Hosting Directory List
PDF
Richard Wallis Linked Data
PDF
An Introduction to Linked Data for Librarians (2018-06-28)
Something about links
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
ResourceSync in 24x7
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Apachecon 2011 stanbol_ogrisel
Free Web Hosting Directory List
Richard Wallis Linked Data
An Introduction to Linked Data for Librarians (2018-06-28)
Ad

Viewers also liked (20)

PDF
Web Archiving: A Brief Introduction
PDF
@WebSciDL PhD Student Project Reviews August 5&6, 2015
PPTX
Who and What Links to the Internet Archive
PPT
More Archives, More Better
PDF
Using Web Archives to Enrich the Live Web Experience Through Storytelling
PPTX
Storytelling for Summarizing Collections in Web Archives
PPT
Old Dominion University Computer Science IIPC New Member
PPT
Profiling Web Archives
PPTX
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...
PPTX
Why We Need Multiple Archives
PPT
We Need Multiple, Independent Web Archives
PPTX
Evaluating the Temporal Coherence of Archived Pages
PPTX
The Memento Protocol and Research Issues With Web Archiving
PPTX
When Should I Make Preservation Copies of Myself?
PPT
Assessing the Quality of Web Archives
PDF
Software as a Well-Formed Research Object
PPTX
Combining Storytelling and Web Archives
PPTX
Summarizing archival collections using storytelling techniques
PDF
OAI-ORE: The Open Archives Initiative Object Reuse and Exchange Project
PPT
Why Care About the Past?
Web Archiving: A Brief Introduction
@WebSciDL PhD Student Project Reviews August 5&6, 2015
Who and What Links to the Internet Archive
More Archives, More Better
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Storytelling for Summarizing Collections in Web Archives
Old Dominion University Computer Science IIPC New Member
Profiling Web Archives
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...
Why We Need Multiple Archives
We Need Multiple, Independent Web Archives
Evaluating the Temporal Coherence of Archived Pages
The Memento Protocol and Research Issues With Web Archiving
When Should I Make Preservation Copies of Myself?
Assessing the Quality of Web Archives
Software as a Well-Formed Research Object
Combining Storytelling and Web Archives
Summarizing archival collections using storytelling techniques
OAI-ORE: The Open Archives Initiative Object Reuse and Exchange Project
Why Care About the Past?
Ad

Similar to Profiling Web Archive Coverage for Top-Level Domain and Content Language (20)

PDF
TPDL 2015 - Profiling Web Archives
PPTX
Web Archiving Profile - WADL 2013
PPTX
2016 bioinformatics i_databases_wim_vancriekinge
KEY
History and Background of the USEWOD Data Challenge
PDF
JCDL 2016 Doctoral Consortium - Web Archive Profiling
PDF
Summarize Your Archival Holdings With MementoMap
PDF
TPDL 2016 Doctoral Consortium - Web Archive Profiling
PPTX
Bioinformatics t2-databases v2014
PPTX
Bioinfo ngs data format visualization v2
PPTX
Semantic Technology for Development: Semantic Web without the Web?
PDF
BHL Technical Director's Report, Mar. 2014
PPTX
Do MORe with your data
PPT
Global Information Systems for Plant Genetic Resources (2009)
PPTX
Bioinformatics t2-databases wim-vancriekinge_v2013
PPTX
ContentMine + EPMC: Finding Zika!
PDF
ArcLink - IIPC GA 2013
PPTX
Opening and Integration of CASDD and Germplasm Data to AGRIS by Prof. Xuefu Z...
PPTX
Major databases in bioinformatics
PDF
Accelerating your research with Microsoft Azure
PPTX
Semantic Web in an SMS as presented at EKAW2016
TPDL 2015 - Profiling Web Archives
Web Archiving Profile - WADL 2013
2016 bioinformatics i_databases_wim_vancriekinge
History and Background of the USEWOD Data Challenge
JCDL 2016 Doctoral Consortium - Web Archive Profiling
Summarize Your Archival Holdings With MementoMap
TPDL 2016 Doctoral Consortium - Web Archive Profiling
Bioinformatics t2-databases v2014
Bioinfo ngs data format visualization v2
Semantic Technology for Development: Semantic Web without the Web?
BHL Technical Director's Report, Mar. 2014
Do MORe with your data
Global Information Systems for Plant Genetic Resources (2009)
Bioinformatics t2-databases wim-vancriekinge_v2013
ContentMine + EPMC: Finding Zika!
ArcLink - IIPC GA 2013
Opening and Integration of CASDD and Germplasm Data to AGRIS by Prof. Xuefu Z...
Major databases in bioinformatics
Accelerating your research with Microsoft Azure
Semantic Web in an SMS as presented at EKAW2016

More from Michael Nelson (9)

PDF
Web Archiving in the Year eaee1902f186819154789ee22ca30035
PDF
Uncertainty in replaying archived Twitter pages
PPT
Web Archives at the Nexus of Good Fakes and Flawed Originals
PPT
Web Archives at the Nexus of Good Fakes and Flawed Originals
PPT
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
PPT
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
PPT
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
PPT
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
PPT
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Uncertainty in replaying archived Twitter pages
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed Originals
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence

Recently uploaded (20)

PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
cuic standard and advanced reporting.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Big Data Technologies - Introduction.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
MYSQL Presentation for SQL database connectivity
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Cloud computing and distributed systems.
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
cuic standard and advanced reporting.pdf
Programs and apps: productivity, graphics, security and other tools
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Big Data Technologies - Introduction.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
MYSQL Presentation for SQL database connectivity
MIND Revenue Release Quarter 2 2025 Press Release
Cloud computing and distributed systems.
sap open course for s4hana steps from ECC to s4
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
20250228 LYD VKU AI Blended-Learning.pptx
Empathic Computing: Creating Shared Understanding
Digital-Transformation-Roadmap-for-Companies.pptx

Profiling Web Archive Coverage for Top-Level Domain and Content Language

  • 1. Profiling Web Archive Coverage for Top-Level Domain &Content Language Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, Herbert Van de Sompel International Conference on Theory and Practice of Digital Libraries September 22-26, 2013 Valletta, Malta 1
  • 2. 2
  • 3. Where to find Mementos for … 3 http://www.japantimes.co.jp/
  • 4. Where to find Mementos for … 4 http://www.japantimes.co.jp/
  • 5. Where to find Mementos for … 5 http://www.google.com/
  • 6. Where to find Mementos for … 6 http://www.google.com/
  • 7. Research Question Problem • Profile public web archives according to the following dimensions: o Top-level domains o Languages o Growth rate o Archival date Motivation • To determine who is archiving what • To optimize the query routing for a Memento Aggregator 7
  • 8. Web Archives in this Experiment Full text URI-lookup Internet Archive √ Library of Congress √ Icelandic Web Archive √ Library and Archives Canada √ √ British Library √ √ UK National Library √ √ Portuguese Web Archive √ √ Web Archive of Catalonia √ √ Croatian Web Archive √ √ Archive of the Czech Web √ √ National Taiwan University √ √ Archive It √ √ 8
  • 9. Experiment Set Up • Sample URIs from different sources o Details coming up • Retrieve the TimeMap for each URI from all archives o A TimeMap lists all Mementos for a given URI o A Memento is an archived version of a resource • Analyze o Details coming up 9
  • 10. Sampling URIs Web 1. DMOZ:Random 2. DMOZ:TLD - 2% of each TLD from DMOZ (.com, .org, .jp, etc 52 TLD) 3. DMOZ:Languages - 100 URIs for each Languages (24 lang.) Web Archives Full Text 4. Top 1-Gram from Bing 5. Top 1000 queries term by Yahoo in 9 languages User requests 6. IA Wayback Machine Log files 7. Memento aggregator log files 10
  • 11. Sampling URIs - DMOZ 1. DMOZ:Random o 10,000 URIs randomly sampled from DMOZ directory (~5M URIs). 2. DMOZ:TLD - 2% for each TLD from DMOZ or 100 URIs whichever is greater o 52 TLDs (com 23,470) (de 6,332), (org 4,025), (uk 3,309), (net 2,073), (it 1,775), (jp 1379), (ru 1244), (fr 1154), (pl 1062), (au 764), (ca 642), (at 438), (edu 390), (cz 385), (tr 334), (info 319), (cn 278), (us 266), (nz 265), (es 238), (ar 213), (no 150), (br 149), (tw 141), (za 118), (fi 113), ( 100 URIs for [ae, cat, cl, cu, eg, gov, id, in, ir, is, ke, kr, ma, mt, mx, my, na, pe, pk, pt, sa, to, uy, zw]) 3. DMOZ:Languages - 100 URIs for each language o 24 languages: Icelandic, Portuguese, Catalan, Afrikaans, Arabic, Indonesian, Chinese (Simplified), Chinese (Traditional), Dutch, Spanish, French, Greek, Hindi, Italian, Japanese, Korean, Norwegian, Persian, Polish , Russian, Turkish, Ukrainian 11
  • 12. • Query the fulltext search interface of select web archives with two sets of query terms. 4. Top 1-Gram from Bing o Most are English 5. Top 1000 query terms from Yahoo in 9 languages o Excluding general keywords such as: Obama, Facebook. 12 Sampling URIs – Web Archives Full Text
  • 13. Chinese English French German Italian Japanese Korean Portuguese Spanish Yahoo Bing rchivewithFullTextsearch AIT 26 2066 3512 3837 3321 119 2 2434 214 1 12617 3953 BL 163 2354 2350 2240 2068 225 131 1940 205 6 6430 3187 CAN 49 800 804 646 601 77 113 580 514 1351 1107 CR 54 706 697 703 701 74 19 599 600 1599 1201 CZ 363 1782 1578 1695 1519 577 114 1310 127 8 6081 3360 CAT 28 2775 2496 2448 2280 209 129 2164 242 9 8996 4241 PO 91 2460 3603 3081 3113 53 69 3267 317 7 14126 5004 TW 357 178 176 165 157 106 7 198 119 1004 354 13 Sampling URIs – Web Archives Full Text
  • 14. Chinese English French German Italian Japanese Korean Portuguese Spanish Yahoo Bing rchivewithFullTextsearch AIT 26 2066 3512 3837 3321 119 2 2434 214 1 12617 3953 BL 163 2354 2350 2240 2068 225 131 1940 205 6 6430 3187 CAN 49 800 804 646 601 77 113 580 514 1351 1107 CR 54 706 697 703 701 74 19 599 600 1599 1201 CZ 363 1782 1578 1695 1519 577 114 1310 127 8 6081 3360 CAT 28 2775 2496 2448 2280 209 129 2164 242 9 8996 4241 PO 91 2460 3603 3081 3113 53 69 3267 317 7 14126 5004 TW 357 178 176 165 157 106 7 198 119 1004 354 14 Sampling URIs – Web Archives Full Text
  • 15. Sampling URIs – User Requests • Sampling from user requests for archived web resources 6. Sample from IA Wayback Machine Log files o 1,000 URIs randomly sampled from Feb 22, 2012 to Feb 26, 2012. 7. Sample from Memento Aggregator log files o 100 URIs randomly sampled from LANL Memento Aggregator between 2011 to 2013. 15
  • 16. Archive Coverage per Sample 16 1 0 0 % 3 5 % Entire Sample
  • 17. TLD Coverage across Archives (1) 17 Entire Sample
  • 18. TLD Coverage across Archives (2) 18 Entire Sample
  • 19. TLD Distribution per Archive 19 DMOZ:TLD Sample
  • 20. TLD Distribution per Archive 20 Web Archives Full Text Sample
  • 21. Language Coverage per Archive 21 DMOZ Sample
  • 23. TLD Coverage across Archives 23 Entire Sample
  • 25. Conclusions • Introduced automated methodology to profile web archives using available infrastructure, no privileged access • Coverage: o Internet Archive provides broad coverage o National archives have good coverage for their domains o Surprising coverage by certain archives • Query Routing: o In 84% of the cases, all existing Mementos for a TLD can be found by using IA and two additional top archives for a TLD o In 55% of the cases, all existing Mementos for a TLD can be found by using the top 3 archives for a TLD, excluding IA 25