SlideShare a Scribd company logo
Web Archive Profiling
Through Fulltext Search
Sawood Alam and Michael L. Nelson
Computer Science Department, Old Dominion University
Norfolk, Virginia - 23529
Herbert Van de Sompel
Los Alamos National Laboratory, Los Alamos, NM
David S. H. Rosenthal
Stanford University Libraries, Stanford, CA
Supported in part by the IIPC and NSF 1526700
Unorganized Collections
2
Organized Collections
3
Collection Understanding
4
Memento Aggregator
5
Memento Aggregator
6
Memento Aggregator
7
Memento Aggregator
8
Memento Aggregator
9
Memento Aggregator
10
From: Michael Nelson [mailto:mln@cs.odu.edu]
Sent: Wednesday, December 02, 2015 12:33 PM
To: Jones, Gina
Cc: Rourke, Patrick; Grotke, Abigail
Subject: Re: WebSciDL
Hi Gina, I'll investigate. memgator is software that one my students wrote,
but I suspect the traffic you're seeing is b/c it is deployed in
http://oldweb.today/ can you share the IP addr from where you're seeing
the traffic? I presume the requests are for Memento TimeMaps? It should
not being actually scraping HTML pages.
regards,
Michael
On Wed, 2 Dec 2015, Jones, Gina wrote:
> Hi Michael, we have a slight configuration issue with the current OW
> set up for our webarchives. I think, from looking at the logs, that
> "MemGator:1.0-rc3 <@WebSciDL>" is really causing some issues
on our wayback.
> Do you know who is running this scraper? Itʼs not part of memento is it?
>
> Gina Jones
> Web Archiving Team
> Library of Congress
From: Ilya Kreymer <ikreymer@gmail.com>
Date: Wed, 2 Dec 2015 10:33:56 -0800
Subject: high traffic on oldweb!
To: Herbert Van de Sompel <hvdsomp@gmail.com>, Sawood Alam
<ibnesayeed@gmail.com>
Hi Herbert, Sawood,
Herbert: Perhaps you are lucky that I am not using the LANL aggregator,
as the traffic has gotten really high, and also I was asked to remove an
archive due to the traffic it was causing temporarily..
I am thinking that ability to remove source archives quickly is an
important aspect of an aggregator.
Sawood: Hopefully yours will support something like this so I don't need
to restart the container to change the archivelist ;)
Ilya
Broadcasting is Bad
11
Availability and Overlap
● Archives are sparse
● Broadcasting is wasteful, both clients and archives suffer
12
Memento Routing
13
Routing Pros & Cons
● Pros
○ Minimizes traffic and resources consumption
○ Improves throughput
● Cons
○ Upfront profile maintenance cost
○ May miss Mementos (false negatives)
14
Why Small Archives Matter?
15
Why Small Archives Matter?
● 400B+ web pages at IA do not cover
everything
● Top three archives after IA produce full
TimeMap 52% of the time (AlSum, et al., TPDL 2013)
● Targeted crawls
● Special focus archives
● Restricted resources
● Private archives
● Censorship
16
While the Internet Archive was Down...
$ memgator -f cdxj example.org | cut -c-4 | grep -v "^@" | uniq -c
2 2002
1 2005
1 2008
6 2009
67 2010
17 2011
64 2012
108 2013
108 2014
186 2015
51 2016 17
Archive Profile
● High-level summary of an archive
● Predicts presence of mementos of a URI-R in
an archive
● Provides various statistics about the holdings
● Small in size
● Publicly available
● Easy to update and partially patch
● Useful for Memento query routing and other
things
18
Profiling Strategies
● Sample URI Profiling (AlSum, et al., TPDL 2013)
● CDX Profiling (Alam, et al., TPDL 2015)
● Response Cache Profiling (Bornand, et al., JCDL 2016)
● Fulltext Search Profiling
19
Methodology
Top Nouns
time
year
people
way
man
day
thing
child
mr
government 20
Random Dict
analogies
unbolt
consonant
coils
stolidly
cigar
decrepit
rhododendron
cannibal
honeydew
Dynamic Words Discovery
the ‫وﻛﺎﻟﺔ‬ war
angry ‫أﻧﺑﺎء‬ the
arab ‫اﻟﻌرﺑﻲ‬ middle
news ‫اﻟﻐﺎﺿب‬ east
service on arabic
a politics poetry
source war art
Random Searcher Model (RSM)
21
START
STOP
Seed Vocabulary
NextWord()
ExtractWords()
Search()
Select a random link
from the search results
Vocabulary
seeding
needed?
Termination
condition
reached?
GenerateProfile()
Store search results
No
Yes
YesNo
Fetch the contents of the
selected document
RSM Illustration
Teaching Resources Adjunct Toolkit NC NET Academy PD Planning Tools Regional
Centers Campus Liaisons Nontraditional Careers College Tech Prep NC ACCESS Co op
Education Green Technology You are here NC NET Teaching Resources Discipline Specific
English English Self Paced Modules Writing Across the Curriculum NC NET Western Center
Incorporating Visuals in Workplace Documents Sections 1 2 Wake Tech Community College
Incorporating Visuals in Workplace Documents Section 3 Wake Tech Community College
All self paced modules can be accessed through the NC NET Blackboard server Log in with
the user name faculty and the password nc net Once connected you can view the courses
by topic or alphabetically by title English Webliography North Carolina Community College
System 2012
RSM Modes
● Static: Externally supplied static word list
● PopularityBiased: Refresh Vocabulary after
every search attempt and consider term
frequency for selecting next search keyword
● EqualOpportunity: Refresh Vocabulary
after every search attempt and ignore term
frequency for selecting next search keyword
● Conservative: Discover new words only
when the Vocabulary is exhausted
23
Profiling Policies & Archive-It Dataset
Policy # Keys Example
URIR 30,800,406 uk,co,bbc,news,)/Images/Logo.png?height=80&width=200
HxP1 1,724,284 uk,co,bbc,news,)/Images
DDom 91,629 uk,co,bbc,)/
H1P0 212 uk,)/
Sample URI: https://www.news.BBC.co.uk/Images/Logo.png?width=80&height=40
24
For a detailed list of profiling policies please refer to:
Alam, et al.: Web Archive Profiling Through CDX Summarization. IJDL (2016) 17: 223-238
Searches vs Coverage
25
100% in 11K searches
100% in 27K searches
100% in 337K searches 100% in 1.9M searches
RSM Operation Mode Costs
Mode
Query
Cost
HTTP
Cost
Remarks
Static C C
Suitable for specialized collection with known top
keywords
PopularityBiased C 2 * C Human like model, but costly
EqualOpportunity C 2 * C Human like model, but costly
Conservative C C +
(where << C)
Suitable for any collection and works without any
supplementary materials with very little overhead
26
Routing Confusion Matrix
Predicted  Actual Present in the Archive Not in the Archive
Routed to the Archive True Positive (TP) False Positive (FP)
Not Routed to the Archive False Negative (FN) True Negative (TN)
Routing Confusion Matrix Recall Accuracy
27
Accuracy, Recall, & Coverage (10-100%)
28
DMOZ IA Wayback
UK WaybackMemento Proxy
Low Accuracy (high FP) =>
Archives & Aggregator suffer
Low Recall (high FN) =>
Users suffer
Profile Policy Recommendations
● IF complete CDX is available THEN
○ Generate HxP1 profile
● ELSE IF fulltext search is available THEN
○ Generate DDom profile
● ELSE
○ Generate H1P0 or other smaller profiles using
Sample URIs
Note: It is possible to perform less detailed queries on more
specific (higher order) profiles, but not the other way
29
RSM Mode Recommendations
● IF the collection is about a specific topic in a
specific language AND a suitable top
keywords list is available THEN
○ Use Static mode
● ELSE
○ Use Conservative mode
30
Who Knows Term Frequency for
Estonian Nouns?
31
https://en.wiktionary.org/wiki/Category:Estonian_nouns
Future Work
● Evaluation of combination profiles such as
URI-Key along with Datetime
● Utilize archive profile to generate rank
ordered list of archive
● Profiles for usage other than Memento
routing, such as, site classification based
profiles (e.g., news, wiki, social media, blog
etc.)
32
Conclusions
● Evaluated the search cost as a function of archive holdings’
coverage and profiling policy
● Developed the Random Searcher Model
● Correctly route 80% requests while maintaining 0.9 Recall
by only discovering 10% of the archive holdings and
generating a profile that costs less than 1% of the complete
knowledge profile
33

More Related Content

PDF
TPDL 2016 Doctoral Consortium - Web Archive Profiling
PDF
JCDL 2016 Doctoral Consortium - Web Archive Profiling
PDF
Introducing Web Archiving and WSDL Research Group
PDF
Web Archiving: A Brief Introduction
PDF
Profiling Web Archives
PDF
TPDL 2015 - Profiling Web Archives
PPTX
DBpedia Archive using Memento, Triple Pattern Fragments, and HDT
PPTX
"Web Archive services framework for tighter integration between the past and ...
TPDL 2016 Doctoral Consortium - Web Archive Profiling
JCDL 2016 Doctoral Consortium - Web Archive Profiling
Introducing Web Archiving and WSDL Research Group
Web Archiving: A Brief Introduction
Profiling Web Archives
TPDL 2015 - Profiling Web Archives
DBpedia Archive using Memento, Triple Pattern Fragments, and HDT
"Web Archive services framework for tighter integration between the past and ...

What's hot (20)

PDF
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
PDF
VALA Tech Camp 2017: Intro to Wikidata & SPARQL
PPTX
PDF
Graph databases & data integration v2
PDF
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
PDF
20110728 datalift-rpi-troy
PDF
GitHubGraph
PDF
Flagis linked open_data_stijn_goedertier
ODP
Semantic Web introduction
PDF
Converting GHO to RDF
PDF
Data quality in Real Estate
PDF
Dirk Goldhahn: Introduction to the German Wortschatz Project
PDF
Vocabulary for Linked Data Visualization Model - Dateso 2015
PPTX
Introduction to W3C Linked Data Platform
PPTX
SWIB14 Weaving repository contents into the Semantic Web
PDF
Insight Data Engineering project
PDF
Web Data Management with RDF
PPSX
The Web of data and web data commons
PDF
Ontology, Semantic Web and DBpedia
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
VALA Tech Camp 2017: Intro to Wikidata & SPARQL
Graph databases & data integration v2
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
20110728 datalift-rpi-troy
GitHubGraph
Flagis linked open_data_stijn_goedertier
Semantic Web introduction
Converting GHO to RDF
Data quality in Real Estate
Dirk Goldhahn: Introduction to the German Wortschatz Project
Vocabulary for Linked Data Visualization Model - Dateso 2015
Introduction to W3C Linked Data Platform
SWIB14 Weaving repository contents into the Semantic Web
Insight Data Engineering project
Web Data Management with RDF
The Web of data and web data commons
Ontology, Semantic Web and DBpedia
Ad

Viewers also liked (18)

PDF
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
PDF
Libyan digital newspapers_after_revolution
PDF
10 Ways to Win at SlideShare SEO & Presentation Optimization
PDF
Using Web Archives to Enrich the Live Web Experience Through Storytelling
PPTX
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
PDF
Social Feed Manager presentation at Archives Unleashed 3.0
PDF
Twitter Analysis: Fake News
PDF
Good News/ Bad News
PPTX
My Presentation to SFIMA Summit 2010 - Social Media Strategy, YouTube, and Vi...
PPTX
FINAL.LosOjos
PDF
02 תואר בוגר וגליון ציונים
PPT
Operatingsystems 4grade
PDF
I sociedades de inversion
PDF
5 Things You Should Do Before Job Interview-by Jubaer
PPTX
”C”は何の”C”
PPTX
Props music video pp
DOCX
Evidencias 2013
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
Libyan digital newspapers_after_revolution
10 Ways to Win at SlideShare SEO & Presentation Optimization
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Social Feed Manager presentation at Archives Unleashed 3.0
Twitter Analysis: Fake News
Good News/ Bad News
My Presentation to SFIMA Summit 2010 - Social Media Strategy, YouTube, and Vi...
FINAL.LosOjos
02 תואר בוגר וגליון ציונים
Operatingsystems 4grade
I sociedades de inversion
5 Things You Should Do Before Job Interview-by Jubaer
”C”は何の”C”
Props music video pp
Evidencias 2013
Ad

Similar to Web Archive Profiling Through Fulltext Search (20)

PDF
dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data
PDF
A Practical Approach to Design, Implementation, and Management A Practical Ap...
PDF
Five Ways To Do Data Analytics "The Wrong Way"
PDF
How to get started in Big Data for master's students
PPTX
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
PDF
Apache Spark 101 - Demi Ben-Ari
PPTX
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
PDF
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
DOC
BIT 3107 DATABASE SYSTEMS II OUTLINE.doc
PPTX
Release webinar: Sansa and Ontario
PDF
Graph basedrdf storeforapachecassandra
PDF
PLNOG 6: Piotr Modzelewski, Bartłomiej Rymarski - Product Catalogue - Case Study
PDF
cyclades eswc2016
PDF
Efficient top-k queries processing in column-family distributed databases
PPT
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
PPT
On the need for a W3C community group on RDF Stream Processing
PDF
Handling the growth of data
PDF
Lightweight Collection and Storage of Software Repository Data with DataRover
PDF
Challenges with Gluster and Persistent Memory with Dan Lambright
PDF
Data Science as Scale
dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data
A Practical Approach to Design, Implementation, and Management A Practical Ap...
Five Ways To Do Data Analytics "The Wrong Way"
How to get started in Big Data for master's students
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
Apache Spark 101 - Demi Ben-Ari
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
BIT 3107 DATABASE SYSTEMS II OUTLINE.doc
Release webinar: Sansa and Ontario
Graph basedrdf storeforapachecassandra
PLNOG 6: Piotr Modzelewski, Bartłomiej Rymarski - Product Catalogue - Case Study
cyclades eswc2016
Efficient top-k queries processing in column-family distributed databases
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
On the need for a W3C community group on RDF Stream Processing
Handling the growth of data
Lightweight Collection and Storage of Software Repository Data with DataRover
Challenges with Gluster and Persistent Memory with Dan Lambright
Data Science as Scale

More from Sawood Alam (20)

PDF
TrendMachine: Temporal Resilience of Web Pages
PDF
CDX Summary: Web Archival Collection Insights
PDF
Video Archiving and Playback in the Wayback Machine
PDF
Profiling Web Archival Voids for Memento Routing
PDF
Readying Web Archives to Consume and Leverage Web Bundles
PDF
Summarize Your Archival Holdings With MementoMap
PDF
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
PDF
Supporting Web Archiving via Web Packaging
PDF
MementoMap: An Archive Profile Dissemination Framework
PDF
Impact of HTTP Cookie Violations in Web Archives
PDF
Archive Assisted Archival Fixity Verification Framework
PDF
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
PDF
Web ARChive (WARC) File Format
PDF
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
PDF
MemGator - A Memento Aggregator CLI and Server in Go
PDF
Dockerize Your Projects - A Brief Introduction to Containerization
PDF
Avoiding Zombies in Archival Replay Using ServiceWorker
PDF
Client-side Reconstruction of Composite Mementos Using ServiceWorker
PDF
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
PDF
Profile Serialization IIPC GA 2015
TrendMachine: Temporal Resilience of Web Pages
CDX Summary: Web Archival Collection Insights
Video Archiving and Playback in the Wayback Machine
Profiling Web Archival Voids for Memento Routing
Readying Web Archives to Consume and Leverage Web Bundles
Summarize Your Archival Holdings With MementoMap
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
Supporting Web Archiving via Web Packaging
MementoMap: An Archive Profile Dissemination Framework
Impact of HTTP Cookie Violations in Web Archives
Archive Assisted Archival Fixity Verification Framework
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
Web ARChive (WARC) File Format
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
MemGator - A Memento Aggregator CLI and Server in Go
Dockerize Your Projects - A Brief Introduction to Containerization
Avoiding Zombies in Archival Replay Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Profile Serialization IIPC GA 2015

Recently uploaded (20)

PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PPTX
2. Earth - The Living Planet earth and life
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PPT
Chemical bonding and molecular structure
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PDF
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
bbec55_b34400a7914c42429908233dbd381773.pdf
2. Earth - The Living Planet earth and life
Classification Systems_TAXONOMY_SCIENCE8.pptx
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
Chemical bonding and molecular structure
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
. Radiology Case Scenariosssssssssssssss
INTRODUCTION TO EVS | Concept of sustainability
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
Derivatives of integument scales, beaks, horns,.pptx
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
microscope-Lecturecjchchchchcuvuvhc.pptx
AlphaEarth Foundations and the Satellite Embedding dataset
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.

Web Archive Profiling Through Fulltext Search

  • 1. Web Archive Profiling Through Fulltext Search Sawood Alam and Michael L. Nelson Computer Science Department, Old Dominion University Norfolk, Virginia - 23529 Herbert Van de Sompel Los Alamos National Laboratory, Los Alamos, NM David S. H. Rosenthal Stanford University Libraries, Stanford, CA Supported in part by the IIPC and NSF 1526700
  • 11. From: Michael Nelson [mailto:mln@cs.odu.edu] Sent: Wednesday, December 02, 2015 12:33 PM To: Jones, Gina Cc: Rourke, Patrick; Grotke, Abigail Subject: Re: WebSciDL Hi Gina, I'll investigate. memgator is software that one my students wrote, but I suspect the traffic you're seeing is b/c it is deployed in http://oldweb.today/ can you share the IP addr from where you're seeing the traffic? I presume the requests are for Memento TimeMaps? It should not being actually scraping HTML pages. regards, Michael On Wed, 2 Dec 2015, Jones, Gina wrote: > Hi Michael, we have a slight configuration issue with the current OW > set up for our webarchives. I think, from looking at the logs, that > "MemGator:1.0-rc3 <@WebSciDL>" is really causing some issues on our wayback. > Do you know who is running this scraper? Itʼs not part of memento is it? > > Gina Jones > Web Archiving Team > Library of Congress From: Ilya Kreymer <ikreymer@gmail.com> Date: Wed, 2 Dec 2015 10:33:56 -0800 Subject: high traffic on oldweb! To: Herbert Van de Sompel <hvdsomp@gmail.com>, Sawood Alam <ibnesayeed@gmail.com> Hi Herbert, Sawood, Herbert: Perhaps you are lucky that I am not using the LANL aggregator, as the traffic has gotten really high, and also I was asked to remove an archive due to the traffic it was causing temporarily.. I am thinking that ability to remove source archives quickly is an important aspect of an aggregator. Sawood: Hopefully yours will support something like this so I don't need to restart the container to change the archivelist ;) Ilya Broadcasting is Bad 11
  • 12. Availability and Overlap ● Archives are sparse ● Broadcasting is wasteful, both clients and archives suffer 12
  • 14. Routing Pros & Cons ● Pros ○ Minimizes traffic and resources consumption ○ Improves throughput ● Cons ○ Upfront profile maintenance cost ○ May miss Mementos (false negatives) 14
  • 15. Why Small Archives Matter? 15
  • 16. Why Small Archives Matter? ● 400B+ web pages at IA do not cover everything ● Top three archives after IA produce full TimeMap 52% of the time (AlSum, et al., TPDL 2013) ● Targeted crawls ● Special focus archives ● Restricted resources ● Private archives ● Censorship 16
  • 17. While the Internet Archive was Down... $ memgator -f cdxj example.org | cut -c-4 | grep -v "^@" | uniq -c 2 2002 1 2005 1 2008 6 2009 67 2010 17 2011 64 2012 108 2013 108 2014 186 2015 51 2016 17
  • 18. Archive Profile ● High-level summary of an archive ● Predicts presence of mementos of a URI-R in an archive ● Provides various statistics about the holdings ● Small in size ● Publicly available ● Easy to update and partially patch ● Useful for Memento query routing and other things 18
  • 19. Profiling Strategies ● Sample URI Profiling (AlSum, et al., TPDL 2013) ● CDX Profiling (Alam, et al., TPDL 2015) ● Response Cache Profiling (Bornand, et al., JCDL 2016) ● Fulltext Search Profiling 19
  • 20. Methodology Top Nouns time year people way man day thing child mr government 20 Random Dict analogies unbolt consonant coils stolidly cigar decrepit rhododendron cannibal honeydew Dynamic Words Discovery the ‫وﻛﺎﻟﺔ‬ war angry ‫أﻧﺑﺎء‬ the arab ‫اﻟﻌرﺑﻲ‬ middle news ‫اﻟﻐﺎﺿب‬ east service on arabic a politics poetry source war art
  • 21. Random Searcher Model (RSM) 21 START STOP Seed Vocabulary NextWord() ExtractWords() Search() Select a random link from the search results Vocabulary seeding needed? Termination condition reached? GenerateProfile() Store search results No Yes YesNo Fetch the contents of the selected document
  • 22. RSM Illustration Teaching Resources Adjunct Toolkit NC NET Academy PD Planning Tools Regional Centers Campus Liaisons Nontraditional Careers College Tech Prep NC ACCESS Co op Education Green Technology You are here NC NET Teaching Resources Discipline Specific English English Self Paced Modules Writing Across the Curriculum NC NET Western Center Incorporating Visuals in Workplace Documents Sections 1 2 Wake Tech Community College Incorporating Visuals in Workplace Documents Section 3 Wake Tech Community College All self paced modules can be accessed through the NC NET Blackboard server Log in with the user name faculty and the password nc net Once connected you can view the courses by topic or alphabetically by title English Webliography North Carolina Community College System 2012
  • 23. RSM Modes ● Static: Externally supplied static word list ● PopularityBiased: Refresh Vocabulary after every search attempt and consider term frequency for selecting next search keyword ● EqualOpportunity: Refresh Vocabulary after every search attempt and ignore term frequency for selecting next search keyword ● Conservative: Discover new words only when the Vocabulary is exhausted 23
  • 24. Profiling Policies & Archive-It Dataset Policy # Keys Example URIR 30,800,406 uk,co,bbc,news,)/Images/Logo.png?height=80&width=200 HxP1 1,724,284 uk,co,bbc,news,)/Images DDom 91,629 uk,co,bbc,)/ H1P0 212 uk,)/ Sample URI: https://www.news.BBC.co.uk/Images/Logo.png?width=80&height=40 24 For a detailed list of profiling policies please refer to: Alam, et al.: Web Archive Profiling Through CDX Summarization. IJDL (2016) 17: 223-238
  • 25. Searches vs Coverage 25 100% in 11K searches 100% in 27K searches 100% in 337K searches 100% in 1.9M searches
  • 26. RSM Operation Mode Costs Mode Query Cost HTTP Cost Remarks Static C C Suitable for specialized collection with known top keywords PopularityBiased C 2 * C Human like model, but costly EqualOpportunity C 2 * C Human like model, but costly Conservative C C + (where << C) Suitable for any collection and works without any supplementary materials with very little overhead 26
  • 27. Routing Confusion Matrix Predicted Actual Present in the Archive Not in the Archive Routed to the Archive True Positive (TP) False Positive (FP) Not Routed to the Archive False Negative (FN) True Negative (TN) Routing Confusion Matrix Recall Accuracy 27
  • 28. Accuracy, Recall, & Coverage (10-100%) 28 DMOZ IA Wayback UK WaybackMemento Proxy Low Accuracy (high FP) => Archives & Aggregator suffer Low Recall (high FN) => Users suffer
  • 29. Profile Policy Recommendations ● IF complete CDX is available THEN ○ Generate HxP1 profile ● ELSE IF fulltext search is available THEN ○ Generate DDom profile ● ELSE ○ Generate H1P0 or other smaller profiles using Sample URIs Note: It is possible to perform less detailed queries on more specific (higher order) profiles, but not the other way 29
  • 30. RSM Mode Recommendations ● IF the collection is about a specific topic in a specific language AND a suitable top keywords list is available THEN ○ Use Static mode ● ELSE ○ Use Conservative mode 30
  • 31. Who Knows Term Frequency for Estonian Nouns? 31 https://en.wiktionary.org/wiki/Category:Estonian_nouns
  • 32. Future Work ● Evaluation of combination profiles such as URI-Key along with Datetime ● Utilize archive profile to generate rank ordered list of archive ● Profiles for usage other than Memento routing, such as, site classification based profiles (e.g., news, wiki, social media, blog etc.) 32
  • 33. Conclusions ● Evaluated the search cost as a function of archive holdings’ coverage and profiling policy ● Developed the Random Searcher Model ● Correctly route 80% requests while maintaining 0.9 Recall by only discovering 10% of the archive holdings and generating a profile that costs less than 1% of the complete knowledge profile 33