SlideShare a Scribd company logo
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Martin Klein
Los Alamos National Laboratory
martinklein0815@gmail.com
@mart1nkle1n
with
Harihar Shankar (98point6)
Lyudmila Balakireva (LANL)
Herbert Van de Sompel (DANS)
The Memento Tracer Framework:
Balancing Quality and Scalability
for Web Archiving
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
A major challenge in web archiving:
Scale vs. Quality
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
IA’s Scale!
https://twitter.com/brewster_kahle/status/1016003169589981184
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
IA’s Scale!!
https://twitter.com/brewster_kahle/status/1118172506777509890
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
IA’s Scale!!!
https://twitter.com/brewster_kahle/status/1139700494748663809
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
IA’s Scale!!!!
https://twitter.com/brewster_kahle/status/1170820482104348672
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Fidelity?
http://web.archive.org/web/*/http://cnn.com
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Fidelity?
http://web.archive.org/web/20190808041346/https://www.cnn.com/
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Fidelity?
https://ws-dl.blogspot.com/2017/01/2017-01-20-cnncom-has-been-unarchivable.html
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Webrecorder’s Fidelity!
https://webrecorder.io/martinklein/tpdl_test_collection/20190417221002/https://www.cnn.com/
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Webrecorder’s Fidelity!!
https://twitter.com/ianmilligan1/status/1136703505442324481https://twitter.com/MellonFdn/status/1138811967060267011
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Webrecorder’s Scale?
https://twitter.com/mart1nkle1n/status/1136705116738904067
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Scale vs. Quality
• Crawler-based
approaches scale
well
• Crawling quality is
not always as
desired
• Human-driven
approaches often result
in great quality
• Not necessarily
designed for (web)
scale
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Scale vs. Quality
• Crawler-based
approaches scale
well
• Crawling quality is
not always as
desired
• Human-driven
approaches often result
in great quality
• Not necessarily
designed for (web)
scale
Memento Tracer
http://tracer.mementoweb.org
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Memento Tracer Framework
http://tracer.mementoweb.org
Inspired by:
• LOCKSS
• Same automated approach for resources of a class
• Webrecorder
• Manual recording of web resources
• Various attempts aimed at automating interactions/behaviors
• E.g., Brozzler, Browsertrix
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Memento Tracer Framework
http://tracer.mementoweb.org
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Memento Tracer Implementation
• Client-side:
• Tracer Chrome extension leveraging Selenium IDE
• JSON-formatted Trace for download
• Server-side:
• Stormcrawler
• Selenium (Chrome) with Tracer plug-in
• WarcProxy
• file-system storage for WARC files
http://stormcrawler.net/
https://www.seleniumhq.org/projects/webdriver/
https://github.com/odie5533/WarcProxy
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Tracer Interactions
https://github.com/mementoweb/memento_extensions
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Tracer Interactions
https://github.com/mementoweb/memento_extensions
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Tracer Interactions
https://github.com/mementoweb/memento_extensions
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Tracer Interactions
https://github.com/mementoweb/memento_extensions
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Tracer Interactions
https://www.slideshare.net/martinklein0815/evaluating-memento-service-optimizations
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Current Memento Tracer Capabilities
• Single clicks/links
• All links in an area
• Repeated click on links, with stop condition
• Slides
• Pagination
• Nested traces i.e., “trace in a trace”
• Trace for portal A  follow link to portal B  execute
trace for portal B
• Identification of page/portal for which a trace exists by URI
(pattern)
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Memento Tracer Benefits
• Scalability
• Trace created once is applicable to all web resources of
the same class
• Traces shared via repository (edits, versioning)
• Quality
• Trace used as set of instructions for browser-based
capture framework
• Resource boundary explicit
• Tradeoff
• Quality vs performance
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Evaluation of Scalability & Quality
• Dataset made of GitHub repositories and Slideshare slide decks
• 17,646 GitHub repositories (via changelog.com)
• 12,280 Slideshare decks (via Explore feature)
• Archival goals:
• GitHub: get all repository files and ZIP file
• Slideshare: get all slides and notes
• Quality eval:
• Compare against Webrecorder
• Scalability eval:
• Large amount of high-quality captures
• Compare against crawl time of common crawler
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Quality
• Not a trivial dimension to evaluate!
• Decision to evaluate by amount of URIs in live web version vs.
archived snapshot
• Based on manually generated snapshots with Webrecorder
• Random sample of 100 repos and slide decks
• Expectation:
• 100% of URIs from live web in archived snapshot
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Quality
100 @ GitHub 100 @ Slideshare
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Quality at Scale
17,646 @ GitHub 12,280 @ Slideshare
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Cost of Quality at Scale
• Runtime difference between Memento Tracer and common web
crawler for the same amount of URIs
• Plus 20 seconds per URI, on average
• Faster than previous approaches, discovers many more URIs
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Take aways
• Memento Tracer aims at finding a balance between quality and scale
• Human in the loop, benefits from patterns of web resources
• Experiments provide indicators for high quality, reliability, scale
• Cost involved, slower than simple crawlers
• Optimizations possible, further potential and limitations to be
explored
Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
@mart1nkle1n
TPDL, Oslo, Norway, September 10 2019
Martin Klein
Los Alamos National Laboratory
martinklein0815@gmail.com
@mart1nkle1n
with
Harihar Shankar (98point6)
Lyudmila Balakireva (LANL)
Herbert Van de Sompel (DANS)
The Memento Tracer Framework:
Balancing Quality and Scalability
for Web Archiving

More Related Content

PPTX
Smart Routing of Memento Requests
PDF
Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Repr...
PDF
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
PPTX
Evaluating Memento Service Optimizations
PPTX
"Web Archive services framework for tighter integration between the past and ...
PDF
Web Archive Profiling Through Fulltext Search
PDF
Matraca industrial evaluation (Cha-Q tool demo event Dec 2016)
PPTX
Web archiving challenges and opportunities
Smart Routing of Memento Requests
Scripts in a Frame: A Two-Tiered Crawling Approach to Archiving Deferred Repr...
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Evaluating Memento Service Optimizations
"Web Archive services framework for tighter integration between the past and ...
Web Archive Profiling Through Fulltext Search
Matraca industrial evaluation (Cha-Q tool demo event Dec 2016)
Web archiving challenges and opportunities

Similar to The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving (20)

PPTX
2013-07-19 myExperiment research objects, beyond workflows and packs (PPTX)
PPTX
Information sharing about Columbia University Library’s recent web archiving ...
PDF
Summarize Your Archival Holdings With MementoMap
PDF
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
PPTX
PPTX
From Seed to Harvest: Web Archiving Program Considerations for SUL
PDF
Linked Data Best Practices and BibFrame
PDF
SAA Web Archiving Roundtable Education Needs Assessment Survey Results
PDF
TPDL 2016 Doctoral Consortium - Web Archive Profiling
PPT
Establishing the Connection: Creating a Linked Data Version of the BNB
PPTX
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
PDF
JCDL 2016 Doctoral Consortium - Web Archive Profiling
PDF
99.99% of Your Traces Are (Probably) Trash (SRECon NA 2024).pdf
PPTX
Archiving Web-Based #musetech for Institutional Memory
PPT
Filling in the Blanks: Capturing Dynamically Generated Content
PPT
Profiling Web Archives
PPTX
Performance testingfromthecloud_usingBlazemeter
PDF
"Introducing Distributed Tracing in a Large Software System", Kostiantyn Sha...
PPTX
Sustaining ArchivesSpace
PDF
Introduction to Research Objects - Collaboartions Workshop 2015, Oxford
2013-07-19 myExperiment research objects, beyond workflows and packs (PPTX)
Information sharing about Columbia University Library’s recent web archiving ...
Summarize Your Archival Holdings With MementoMap
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
From Seed to Harvest: Web Archiving Program Considerations for SUL
Linked Data Best Practices and BibFrame
SAA Web Archiving Roundtable Education Needs Assessment Survey Results
TPDL 2016 Doctoral Consortium - Web Archive Profiling
Establishing the Connection: Creating a Linked Data Version of the BNB
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
JCDL 2016 Doctoral Consortium - Web Archive Profiling
99.99% of Your Traces Are (Probably) Trash (SRECon NA 2024).pdf
Archiving Web-Based #musetech for Institutional Memory
Filling in the Blanks: Capturing Dynamically Generated Content
Profiling Web Archives
Performance testingfromthecloud_usingBlazemeter
"Introducing Distributed Tracing in a Large Software System", Kostiantyn Sha...
Sustaining ArchivesSpace
Introduction to Research Objects - Collaboartions Workshop 2015, Oxford
Ad

More from Martin Klein (20)

PPTX
On the Persistence of Persistent Identifiers of the Scholarly Web
PPTX
On the Persistence of Persistent Identifiers of the Scholarly Web
PPTX
An Institutional Perspective to Rescue Scholarly Orphans
PPTX
Who is Asking - Humans and Machines Experience a Different Scholarly Web
PPTX
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
PPTX
Comparing the Performance of OAI-PMH with ResourceSync
PPTX
An Institutional Perspective to Rescue Scholarly Orphans
PPTX
A Vision of the Library’s Role in Archiving Scholarly Artifacts
PPTX
First Steps in Research Data Management Under Constraints of a National Secur...
PPTX
Building Event Collections from Crawling Web Archives
PPTX
A Web-Centric Pipeline for Archiving Scholarly Artifacts
PPTX
Focused Crawl of Web Archives to Build Event Collections
PPTX
Creating Topical Collections: Web Archives vs. Live Web
PPTX
Robust Linking to Web Resources
PPTX
Signposting for Repositories
PPTX
Discovering Scholarly Orphans Using ORCID
PPTX
Using the Memento Framework to Assess Content Drift in Scholarly Communication
PPTX
Uniform Access to Raw Mementos
PPTX
Robust Links - a proposed solution to reference rot in scholarly communication
PDF
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly Web
An Institutional Perspective to Rescue Scholarly Orphans
Who is Asking - Humans and Machines Experience a Different Scholarly Web
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
Comparing the Performance of OAI-PMH with ResourceSync
An Institutional Perspective to Rescue Scholarly Orphans
A Vision of the Library’s Role in Archiving Scholarly Artifacts
First Steps in Research Data Management Under Constraints of a National Secur...
Building Event Collections from Crawling Web Archives
A Web-Centric Pipeline for Archiving Scholarly Artifacts
Focused Crawl of Web Archives to Build Event Collections
Creating Topical Collections: Web Archives vs. Live Web
Robust Linking to Web Resources
Signposting for Repositories
Discovering Scholarly Orphans Using ORCID
Using the Memento Framework to Assess Content Drift in Scholarly Communication
Uniform Access to Raw Mementos
Robust Links - a proposed solution to reference rot in scholarly communication
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
Ad

Recently uploaded (20)

PPTX
Module 1 - Cyber Law and Ethics 101.pptx
PDF
SASE Traffic Flow - ZTNA Connector-1.pdf
PPTX
Job_Card_System_Styled_lorem_ipsum_.pptx
PPT
isotopes_sddsadsaadasdasdasdasdsa1213.ppt
PDF
Exploring VPS Hosting Trends for SMBs in 2025
PPTX
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
PDF
Introduction to the IoT system, how the IoT system works
PPTX
newyork.pptxirantrafgshenepalchinachinane
PPTX
Digital Literacy And Online Safety on internet
PDF
Slides PDF The World Game (s) Eco Economic Epochs.pdf
PDF
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
PDF
Paper PDF World Game (s) Great Redesign.pdf
PDF
Decoding a Decade: 10 Years of Applied CTI Discipline
PPT
Ethics in Information System - Management Information System
PDF
Sims 4 Historia para lo sims 4 para jugar
PPTX
Slides PPTX World Game (s) Eco Economic Epochs.pptx
PDF
Automated vs Manual WooCommerce to Shopify Migration_ Pros & Cons.pdf
PDF
Unit-1 introduction to cyber security discuss about how to secure a system
PPTX
innovation process that make everything different.pptx
PPTX
artificial intelligence overview of it and more
Module 1 - Cyber Law and Ethics 101.pptx
SASE Traffic Flow - ZTNA Connector-1.pdf
Job_Card_System_Styled_lorem_ipsum_.pptx
isotopes_sddsadsaadasdasdasdasdsa1213.ppt
Exploring VPS Hosting Trends for SMBs in 2025
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
Introduction to the IoT system, how the IoT system works
newyork.pptxirantrafgshenepalchinachinane
Digital Literacy And Online Safety on internet
Slides PDF The World Game (s) Eco Economic Epochs.pdf
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
Paper PDF World Game (s) Great Redesign.pdf
Decoding a Decade: 10 Years of Applied CTI Discipline
Ethics in Information System - Management Information System
Sims 4 Historia para lo sims 4 para jugar
Slides PPTX World Game (s) Eco Economic Epochs.pptx
Automated vs Manual WooCommerce to Shopify Migration_ Pros & Cons.pdf
Unit-1 introduction to cyber security discuss about how to secure a system
innovation process that make everything different.pptx
artificial intelligence overview of it and more

The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving

  • 1. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Martin Klein Los Alamos National Laboratory martinklein0815@gmail.com @mart1nkle1n with Harihar Shankar (98point6) Lyudmila Balakireva (LANL) Herbert Van de Sompel (DANS) The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
  • 2. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 A major challenge in web archiving: Scale vs. Quality
  • 3. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 IA’s Scale! https://twitter.com/brewster_kahle/status/1016003169589981184
  • 4. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 IA’s Scale!! https://twitter.com/brewster_kahle/status/1118172506777509890
  • 5. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 IA’s Scale!!! https://twitter.com/brewster_kahle/status/1139700494748663809
  • 6. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 IA’s Scale!!!! https://twitter.com/brewster_kahle/status/1170820482104348672
  • 7. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Fidelity? http://web.archive.org/web/*/http://cnn.com
  • 8. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Fidelity? http://web.archive.org/web/20190808041346/https://www.cnn.com/
  • 9. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Fidelity? https://ws-dl.blogspot.com/2017/01/2017-01-20-cnncom-has-been-unarchivable.html
  • 10. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Webrecorder’s Fidelity! https://webrecorder.io/martinklein/tpdl_test_collection/20190417221002/https://www.cnn.com/
  • 11. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Webrecorder’s Fidelity!! https://twitter.com/ianmilligan1/status/1136703505442324481https://twitter.com/MellonFdn/status/1138811967060267011
  • 12. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Webrecorder’s Scale? https://twitter.com/mart1nkle1n/status/1136705116738904067
  • 13. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Scale vs. Quality • Crawler-based approaches scale well • Crawling quality is not always as desired • Human-driven approaches often result in great quality • Not necessarily designed for (web) scale
  • 14. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Scale vs. Quality • Crawler-based approaches scale well • Crawling quality is not always as desired • Human-driven approaches often result in great quality • Not necessarily designed for (web) scale Memento Tracer http://tracer.mementoweb.org
  • 15. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Memento Tracer Framework http://tracer.mementoweb.org Inspired by: • LOCKSS • Same automated approach for resources of a class • Webrecorder • Manual recording of web resources • Various attempts aimed at automating interactions/behaviors • E.g., Brozzler, Browsertrix
  • 16. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Memento Tracer Framework http://tracer.mementoweb.org
  • 17. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Memento Tracer Implementation • Client-side: • Tracer Chrome extension leveraging Selenium IDE • JSON-formatted Trace for download • Server-side: • Stormcrawler • Selenium (Chrome) with Tracer plug-in • WarcProxy • file-system storage for WARC files http://stormcrawler.net/ https://www.seleniumhq.org/projects/webdriver/ https://github.com/odie5533/WarcProxy
  • 18. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Tracer Interactions https://github.com/mementoweb/memento_extensions
  • 19. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Tracer Interactions https://github.com/mementoweb/memento_extensions
  • 20. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Tracer Interactions https://github.com/mementoweb/memento_extensions
  • 21. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Tracer Interactions https://github.com/mementoweb/memento_extensions
  • 22. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Tracer Interactions https://www.slideshare.net/martinklein0815/evaluating-memento-service-optimizations
  • 23. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Current Memento Tracer Capabilities • Single clicks/links • All links in an area • Repeated click on links, with stop condition • Slides • Pagination • Nested traces i.e., “trace in a trace” • Trace for portal A  follow link to portal B  execute trace for portal B • Identification of page/portal for which a trace exists by URI (pattern)
  • 24. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Memento Tracer Benefits • Scalability • Trace created once is applicable to all web resources of the same class • Traces shared via repository (edits, versioning) • Quality • Trace used as set of instructions for browser-based capture framework • Resource boundary explicit • Tradeoff • Quality vs performance
  • 25. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Evaluation of Scalability & Quality • Dataset made of GitHub repositories and Slideshare slide decks • 17,646 GitHub repositories (via changelog.com) • 12,280 Slideshare decks (via Explore feature) • Archival goals: • GitHub: get all repository files and ZIP file • Slideshare: get all slides and notes • Quality eval: • Compare against Webrecorder • Scalability eval: • Large amount of high-quality captures • Compare against crawl time of common crawler
  • 26. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Quality • Not a trivial dimension to evaluate! • Decision to evaluate by amount of URIs in live web version vs. archived snapshot • Based on manually generated snapshots with Webrecorder • Random sample of 100 repos and slide decks • Expectation: • 100% of URIs from live web in archived snapshot
  • 27. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Quality 100 @ GitHub 100 @ Slideshare
  • 28. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Quality at Scale 17,646 @ GitHub 12,280 @ Slideshare
  • 29. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Cost of Quality at Scale • Runtime difference between Memento Tracer and common web crawler for the same amount of URIs • Plus 20 seconds per URI, on average • Faster than previous approaches, discovers many more URIs
  • 30. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Take aways • Memento Tracer aims at finding a balance between quality and scale • Human in the loop, benefits from patterns of web resources • Experiments provide indicators for high quality, reliability, scale • Cost involved, slower than simple crawlers • Optimizations possible, further potential and limitations to be explored
  • 31. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Martin Klein Los Alamos National Laboratory martinklein0815@gmail.com @mart1nkle1n with Harihar Shankar (98point6) Lyudmila Balakireva (LANL) Herbert Van de Sompel (DANS) The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving