SlideShare a Scribd company logo
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
Using the Memento Framework
to Assess Content Drift
in Scholarly Communication
Acknowledgements:
Shawn Jones, Harihar Shankar (LANL)
Richard Tobin, Claire Grover (University of of Edinburgh)
Andy Jackson (British Library)
Martin Klein
@mart1nkle1n
Herbert Van de Sompel
@hvdsomp
Research Library
Los Alamos National Laboratory
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
2
Link Rot
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
3
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
4
Content Drift
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
5
http://dl00.org
2000
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
6
http://dl00.org
2004
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
7
http://dl00.org
2005
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
8
http://dl00.org
2008
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
9
Content Drift
(in legal documents)
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
10
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
11
Content Drift
(in scholarly articles)
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
12
Referenced in
http://dx.doi.org/10.1016/j.nuclphysa.2009.05.110
published on August 15th 2009
May 8th 2009 August 27th 2009
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
13
Referenced in
http://arxiv.org/abs/astro-ph/9707064
published on July 4th 1997
June 7th 1997 today
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
14
ArXiv
Corpus
1997 1999 2001 2003 2005 2007 2009 2011
02000060000100000140000180000
articles
URI references
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
15
http://hiberlink.org/
Definition:
• Link Rot + Content Drift = Reference Rot
Observation:
• Links to these resources are subject to Reference Rot
• Web at large resources referenced in scholarly articles
Problem:
• Threat to integrity of the web-based scholarly record
• Resources do not have the same sense of fixity like e.g.,
journal articles
• Resources’ custodianship is different, in terms of long-
term archiving, integrity, and access
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
16
http://dx.doi.org/10.1371/journal.pone.0115253
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
17
Focus: Content Drift
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
18
http://dx.doi.org/10.1371/journal.pone.0167475
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
19
Study Dataset
• 3.5 million articles from arXiv, Elsevier, PMC
• Published between Jan 1997 – Dec 2012
• Converted from PDF to XML
• Extraction of URIs to web at large resources (>1 million)
• Keep track of articles’ publication date
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
20
Novel Approach to Assess Content Drift
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
21
Step 1: Find Mementos
• ~ 1 million URI references
• ~ 650k Memento Pre/Post pairs
discovered via Memento
https://mementoweb.org
https://tools.ietf.org/html/rfc7089
t t+1t-1
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
22
Step 2: Select Representative Mementos
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
23
• Apply content similarity measures
• How similar is representative?
Step 2: Select Representative Mementos
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
24
Content Similarity Measures
• Compute normalized scores (values between 0...100) for:
• Simhash
• Jaccard
• Sørensen-Dice
• Cosine
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
25
Representative Mementos
• Idea
• If perfect score in all 4 similarity measures
 Memento Pre and Post are the same
 Representative Mementos
• Sanity check needed
• Via HTTP headers: E-Tag and Last-Modified
• If same for Pre and Post Memento
 HTTP-same
• Sanity check passed!
• 98.88% of Memento pairs that are HTTP-same have perfect
score in all 4 similarity measures
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
26
• ~ 313k referenced URIs have
representative Mementos
Step 2: Select Representative Mementos
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
27
Representative Mementos in arXiv
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
28
arXiv
Elsevier
PMC
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
29
• 241k out of 313k URIs have a live web version
Step 3: Dereference Live Web Version of URI
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
30
Step 4: Representative Memento vs. Live Version
• Apply content similarity measures
• Bin results into 6 clusters
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
31
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
32
Aggregate
Similarity
Score
Good:
23.7% of
URIs have
*not*
drifted!
Bad:
3/4 URIs
*have*
drifted!
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
33
Content Drift & Link Rot Over Time - arXiv
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
34
arXiv
Elsevier
PMC
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
35
Take-Aways
1. Scholarly articles increasingly contain URI references to web at
large resources.
2. Such resources are subject to reference rot (link rot + content drift).
3. Custodians of these resources are typically not overly concerned
with archiving of their content and longevity of the scholarly record.
4. Spoiler: Authors, publishers, web archives, and other parties can
help tackle this problem (see my lightning talk + poster on Robust
Links).
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
Using the Memento Framework
to Assess Content Drift
in Scholarly Communication
Martin Klein
@mart1nkle1n
Herbert Van de Sompel
@hvdsomp
Research Library
Los Alamos National Laboratory

More Related Content

PPTX
Wikipedia and Libraries: Island Hopping the Data Archipelago
PDF
The Journal of Open Research Software
PPTX
PRIME: Achievements, Challenges & Recommendations
PPTX
Sustainable, Successful Open Data Publication
PPTX
Data Journals & Data Papers
PDF
The Journal of Open Economics Data
PPTX
The Ubiquity Partner Network: Global Support for Publishing
PPTX
Publishing Open Data: Incentivising Rigour
Wikipedia and Libraries: Island Hopping the Data Archipelago
The Journal of Open Research Software
PRIME: Achievements, Challenges & Recommendations
Sustainable, Successful Open Data Publication
Data Journals & Data Papers
The Journal of Open Economics Data
The Ubiquity Partner Network: Global Support for Publishing
Publishing Open Data: Incentivising Rigour

What's hot (20)

PPTX
Introducing PRIME:Publisher, Repository and Institutional Metadata Exchange
PPTX
The Ubiquity Partner Network: Enabling Library-Based Publishing
PPTX
Open Access is Just the Beginning: Disrupting Publishing
PPTX
EThOS for Academic English
PPTX
Brian Hole Open Access - LSE 2013 talk
PPTX
The Shift to Open Access Publishing
PPTX
PRIME: Publisher, Repository & Institutional Metadata Exchange
PPTX
Publishing Open Research Data
PPTX
Disrupting Academic Publishing
PDF
The data journal: incentivizing open scholarship or 'a convenient fiction'?
PPTX
OAPEN-UK presentation at UCL Ebooks Event, Jun 2013
PPTX
Open Science: A New Publisher Perspective
PPTX
Quantifying the impacts of investment in humanities archives
PPTX
Open Access eBooks and Scholarly Publishing
PDF
Ubiquity Press: open scholarship
PPTX
Publishing (Open) Data
PPTX
Too Many Copies! The confusion between duplication and versioning
PDF
From Open Access to Open Data
PPTX
The Journal of Open Archaeology Data and PRIME: Incentivising Open Data Archi...
PPTX
Linking Data with sameAs: Challenges and Solutions - Workshop
Introducing PRIME:Publisher, Repository and Institutional Metadata Exchange
The Ubiquity Partner Network: Enabling Library-Based Publishing
Open Access is Just the Beginning: Disrupting Publishing
EThOS for Academic English
Brian Hole Open Access - LSE 2013 talk
The Shift to Open Access Publishing
PRIME: Publisher, Repository & Institutional Metadata Exchange
Publishing Open Research Data
Disrupting Academic Publishing
The data journal: incentivizing open scholarship or 'a convenient fiction'?
OAPEN-UK presentation at UCL Ebooks Event, Jun 2013
Open Science: A New Publisher Perspective
Quantifying the impacts of investment in humanities archives
Open Access eBooks and Scholarly Publishing
Ubiquity Press: open scholarship
Publishing (Open) Data
Too Many Copies! The confusion between duplication and versioning
From Open Access to Open Data
The Journal of Open Archaeology Data and PRIME: Incentivising Open Data Archi...
Linking Data with sameAs: Challenges and Solutions - Workshop
Ad

Similar to Using the Memento Framework to Assess Content Drift in Scholarly Communication (20)

PPTX
Reference Rot in Scholarly Communication: A Reliable Quantification and a P...
PPTX
Reference Rot
PPTX
Storytelling With Web Archives
PPTX
HIBERLINK: Reference Rot and Linked Data: Threat and Remedy
PPTX
The web is rotting and what to do about it
PDF
Parthenos Webinar Create Impact With Your e-Humanities and e-Heritage Research
PPTX
Probes & Storytelling
PPTX
Using technologies to promote projects
PPT
The Use of the Social Web in Scholarly Communication
PPTX
Combining Social Media Storytelling With Web Archives
PPTX
Towards digitizing scholarly communication
PPTX
Studying archives of online behavior
PPTX
An Institutional Perspective to Rescue Scholarly Orphans
PDF
@WebSciDL PhD Student Project Reviews August 5&6, 2015
PDF
Objective Fiction, i-semantics keynote
PPTX
Finding the Story in the Data
PDF
Linked data for knowledge curation in humanities research
PPT
Achieving Link Integrity for Managed Collections
PDF
Memento: Time Travel for the Web
PPTX
Improving Understanding of Web Archive Collections Through Storytelling - PhD...
Reference Rot in Scholarly Communication: A Reliable Quantification and a P...
Reference Rot
Storytelling With Web Archives
HIBERLINK: Reference Rot and Linked Data: Threat and Remedy
The web is rotting and what to do about it
Parthenos Webinar Create Impact With Your e-Humanities and e-Heritage Research
Probes & Storytelling
Using technologies to promote projects
The Use of the Social Web in Scholarly Communication
Combining Social Media Storytelling With Web Archives
Towards digitizing scholarly communication
Studying archives of online behavior
An Institutional Perspective to Rescue Scholarly Orphans
@WebSciDL PhD Student Project Reviews August 5&6, 2015
Objective Fiction, i-semantics keynote
Finding the Story in the Data
Linked data for knowledge curation in humanities research
Achieving Link Integrity for Managed Collections
Memento: Time Travel for the Web
Improving Understanding of Web Archive Collections Through Storytelling - PhD...
Ad

More from Martin Klein (20)

PPTX
On the Persistence of Persistent Identifiers of the Scholarly Web
PPTX
On the Persistence of Persistent Identifiers of the Scholarly Web
PPTX
An Institutional Perspective to Rescue Scholarly Orphans
PPTX
Who is Asking - Humans and Machines Experience a Different Scholarly Web
PPTX
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
PPTX
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
PPTX
Comparing the Performance of OAI-PMH with ResourceSync
PPTX
Evaluating Memento Service Optimizations
PPTX
A Vision of the Library’s Role in Archiving Scholarly Artifacts
PPTX
First Steps in Research Data Management Under Constraints of a National Secur...
PPTX
Smart Routing of Memento Requests
PPTX
Building Event Collections from Crawling Web Archives
PPTX
A Web-Centric Pipeline for Archiving Scholarly Artifacts
PPTX
Focused Crawl of Web Archives to Build Event Collections
PPTX
Creating Topical Collections: Web Archives vs. Live Web
PPTX
Robust Linking to Web Resources
PPTX
Signposting for Repositories
PPTX
Discovering Scholarly Orphans Using ORCID
PPTX
Uniform Access to Raw Mementos
PPTX
Robust Links - a proposed solution to reference rot in scholarly communication
On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly Web
An Institutional Perspective to Rescue Scholarly Orphans
Who is Asking - Humans and Machines Experience a Different Scholarly Web
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
Comparing the Performance of OAI-PMH with ResourceSync
Evaluating Memento Service Optimizations
A Vision of the Library’s Role in Archiving Scholarly Artifacts
First Steps in Research Data Management Under Constraints of a National Secur...
Smart Routing of Memento Requests
Building Event Collections from Crawling Web Archives
A Web-Centric Pipeline for Archiving Scholarly Artifacts
Focused Crawl of Web Archives to Build Event Collections
Creating Topical Collections: Web Archives vs. Live Web
Robust Linking to Web Resources
Signposting for Repositories
Discovering Scholarly Orphans Using ORCID
Uniform Access to Raw Mementos
Robust Links - a proposed solution to reference rot in scholarly communication

Recently uploaded (20)

PPTX
Digital Literacy And Online Safety on internet
PPTX
Database Information System - Management Information System
PPTX
t_and_OpenAI_Combined_two_pressentations
PPTX
newyork.pptxirantrafgshenepalchinachinane
PPTX
artificial intelligence overview of it and more
PPTX
artificialintelligenceai1-copy-210604123353.pptx
PDF
Introduction to the IoT system, how the IoT system works
PPTX
SAP Ariba Sourcing PPT for learning material
PPT
Design_with_Watersergyerge45hrbgre4top (1).ppt
PPT
Ethics in Information System - Management Information System
PDF
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
PPTX
presentation_pfe-universite-molay-seltan.pptx
PPTX
Mathew Digital SEO Checklist Guidlines 2025
PDF
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
PPT
415456121-Jiwratrwecdtwfdsfwgdwedvwe dbwsdjsadca-EVN.ppt
DOCX
Unit-3 cyber security network security of internet system
PDF
The New Creative Director: How AI Tools for Social Media Content Creation Are...
PDF
SASE Traffic Flow - ZTNA Connector-1.pdf
PPTX
Funds Management Learning Material for Beg
PPTX
Power Point - Lesson 3_2.pptx grad school presentation
Digital Literacy And Online Safety on internet
Database Information System - Management Information System
t_and_OpenAI_Combined_two_pressentations
newyork.pptxirantrafgshenepalchinachinane
artificial intelligence overview of it and more
artificialintelligenceai1-copy-210604123353.pptx
Introduction to the IoT system, how the IoT system works
SAP Ariba Sourcing PPT for learning material
Design_with_Watersergyerge45hrbgre4top (1).ppt
Ethics in Information System - Management Information System
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
presentation_pfe-universite-molay-seltan.pptx
Mathew Digital SEO Checklist Guidlines 2025
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
415456121-Jiwratrwecdtwfdsfwgdwedvwe dbwsdjsadca-EVN.ppt
Unit-3 cyber security network security of internet system
The New Creative Director: How AI Tools for Social Media Content Creation Are...
SASE Traffic Flow - ZTNA Connector-1.pdf
Funds Management Learning Material for Beg
Power Point - Lesson 3_2.pptx grad school presentation

Using the Memento Framework to Assess Content Drift in Scholarly Communication

  • 1. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK Using the Memento Framework to Assess Content Drift in Scholarly Communication Acknowledgements: Shawn Jones, Harihar Shankar (LANL) Richard Tobin, Claire Grover (University of of Edinburgh) Andy Jackson (British Library) Martin Klein @mart1nkle1n Herbert Van de Sompel @hvdsomp Research Library Los Alamos National Laboratory
  • 2. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 2 Link Rot
  • 3. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 3
  • 4. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 4 Content Drift
  • 5. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 5 http://dl00.org 2000
  • 6. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 6 http://dl00.org 2004
  • 7. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 7 http://dl00.org 2005
  • 8. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 8 http://dl00.org 2008
  • 9. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 9 Content Drift (in legal documents)
  • 10. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 10
  • 11. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 11 Content Drift (in scholarly articles)
  • 12. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 12 Referenced in http://dx.doi.org/10.1016/j.nuclphysa.2009.05.110 published on August 15th 2009 May 8th 2009 August 27th 2009
  • 13. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 13 Referenced in http://arxiv.org/abs/astro-ph/9707064 published on July 4th 1997 June 7th 1997 today
  • 14. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 14 ArXiv Corpus 1997 1999 2001 2003 2005 2007 2009 2011 02000060000100000140000180000 articles URI references
  • 15. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 15 http://hiberlink.org/ Definition: • Link Rot + Content Drift = Reference Rot Observation: • Links to these resources are subject to Reference Rot • Web at large resources referenced in scholarly articles Problem: • Threat to integrity of the web-based scholarly record • Resources do not have the same sense of fixity like e.g., journal articles • Resources’ custodianship is different, in terms of long- term archiving, integrity, and access
  • 16. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 16 http://dx.doi.org/10.1371/journal.pone.0115253
  • 17. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 17 Focus: Content Drift
  • 18. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 18 http://dx.doi.org/10.1371/journal.pone.0167475
  • 19. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 19 Study Dataset • 3.5 million articles from arXiv, Elsevier, PMC • Published between Jan 1997 – Dec 2012 • Converted from PDF to XML • Extraction of URIs to web at large resources (>1 million) • Keep track of articles’ publication date
  • 20. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 20 Novel Approach to Assess Content Drift
  • 21. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 21 Step 1: Find Mementos • ~ 1 million URI references • ~ 650k Memento Pre/Post pairs discovered via Memento https://mementoweb.org https://tools.ietf.org/html/rfc7089 t t+1t-1
  • 22. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 22 Step 2: Select Representative Mementos
  • 23. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 23 • Apply content similarity measures • How similar is representative? Step 2: Select Representative Mementos
  • 24. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 24 Content Similarity Measures • Compute normalized scores (values between 0...100) for: • Simhash • Jaccard • Sørensen-Dice • Cosine
  • 25. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 25 Representative Mementos • Idea • If perfect score in all 4 similarity measures  Memento Pre and Post are the same  Representative Mementos • Sanity check needed • Via HTTP headers: E-Tag and Last-Modified • If same for Pre and Post Memento  HTTP-same • Sanity check passed! • 98.88% of Memento pairs that are HTTP-same have perfect score in all 4 similarity measures
  • 26. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 26 • ~ 313k referenced URIs have representative Mementos Step 2: Select Representative Mementos
  • 27. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 27 Representative Mementos in arXiv
  • 28. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 28 arXiv Elsevier PMC
  • 29. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 29 • 241k out of 313k URIs have a live web version Step 3: Dereference Live Web Version of URI
  • 30. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 30 Step 4: Representative Memento vs. Live Version • Apply content similarity measures • Bin results into 6 clusters
  • 31. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 31
  • 32. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 32 Aggregate Similarity Score Good: 23.7% of URIs have *not* drifted! Bad: 3/4 URIs *have* drifted!
  • 33. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 33 Content Drift & Link Rot Over Time - arXiv
  • 34. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 34 arXiv Elsevier PMC
  • 35. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK 35 Take-Aways 1. Scholarly articles increasingly contain URI references to web at large resources. 2. Such resources are subject to reference rot (link rot + content drift). 3. Custodians of these resources are typically not overly concerned with archiving of their content and longevity of the scholarly record. 4. Spoiler: Authors, publishers, web archives, and other parties can help tackle this problem (see my lightning talk + poster on Robust Links).
  • 36. Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK Using the Memento Framework to Assess Content Drift in Scholarly Communication Martin Klein @mart1nkle1n Herbert Van de Sompel @hvdsomp Research Library Los Alamos National Laboratory

Editor's Notes

  • #13: IceCube Neutrino Observatory at the University of Wisconsin http://icecube.wisc.edu
  • #14: Institute for Astronomy at the University of Hawaii http://www.ifa.hawaii.edu/~cowie/k_table.html
  • #21: Previously, archival status (14-day window) as proxy
  • #22: Previously, archival status (14-day window) as proxy