SlideShare a Scribd company logo
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
Martin Klein
Los Alamos National Laboratory
martinklein0815@gmail.com
@mart1nkle1n
On the Persistence of Persistent
Identifiers of the Scholarly Web
HEAD GET GET+ Chrome
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
https://arxiv.org/abs/2004.03011
For more background, details, results
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
Regardless of location, phone used …
… when calling a well-known number that uniquely identifies a(n)
(emergency) resource …
… would you not expect to get the same response?
Do you trust in the persistence of that number (and the response)?
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
No more scary emergency scenarios!
• Phones == web clients
• Locations == network environments
• 911 calls == HTTP requests against DOIs
• Regardless of the web client and network location, would you
not expect the same response from a web server when
requesting the same DOI?
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
Idea…
• Comparative study investigating scholarly publishers’ responses
• To common HTTP requests
• Against DOIs
• Using different web clients and request methods, resembling
• Machines ”browsing”, crawling
• Humans browsing
• From network environments with different subscriptions/licenses
• Amazon Web Service EC2 instance
• LANL internal
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
How does this work?
10.1007/978-3-540-87599-4_38
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
How does this (not) work?
10.1007/978-3-540-87599-4_38
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
How does this work via HTTP?
https://doi.org/10.1007/978-3-540-87599-4_38
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
What do you see?
https://doi.org/10.1007/978-3-540-87599-4_38
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
How does this work via HTTP?
https://doi.org/10.1007/978-3-540-87599-4_38
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
How does this work via HTTP?
https://doi.org/10.1007/978-3-540-87599-4_38
 (HTTP redirect)
http://link.springer.com/10.1007/978-3-540-87599-4_38
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
How does this work via HTTP?
https://doi.org/10.1007/978-3-540-87599-4_38
 (HTTP redirect)
http://link.springer.com/10.1007/978-3-540-87599-4_38
 (HTTP redirect)
https://link.springer.com/10.1007/978-3-540-87599-4_38
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
How does this work via HTTP?
https://doi.org/10.1007/978-3-540-87599-4_38
 (HTTP redirect)
http://link.springer.com/10.1007/978-3-540-87599-4_38
 (HTTP redirect)
https://link.springer.com/10.1007/978-3-540-87599-4_38
 (HTTP redirect)
https://link.springer.com/chapter/10.1007%2F978-3-540-87599-4_38
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
DOI dataset
• Gathering a representative sample is not trivial!
• Internet Archive conducts crawls of the scholarly domain
• June 2018: 93 million DOIs
• Obtained WARC files and extracted DOI redirect chain
• Investigate publisher distribution
• Final link of redirect chain and extract host e.g.:
https://link.springer.com/chapter/10.1007%2F978-3-540-87599-4_38

Domain: springer.com
• Randomly pick 100 DOIs from the 100 most frequent domains
• 10,000 DOIs
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
Web clients and HTTP requests 1/4
• HEAD request
• Server responds with response headers
• *but no* response body
• Client: cURL
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
Web clients and HTTP requests 1/4
• HEAD request
• Server responds with response headers
• *but no* response body
• Client: cURL
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
Web clients and HTTP requests 2/4
• GET request
• Server responds with response headers
• *and* response body
• Client: cURL
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
Web clients and HTTP requests 2/4
• GET request
• Server responds with response headers
• *and* response body
• Client: cURL
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
Web clients and HTTP requests 3/4
• GET+
• GET request with request headers
• User Agent (desktop Chrome browser)
• Specified connection timeout
• Specified maximum number of redirects
• Cookies accepted and stored
• Insecure connections allowed
• Client: cURL
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
Web clients and HTTP requests 3/4
• GET+
• GET request with request headers
• User Agent (desktop Chrome browser)
• Specified connection timeout
• Specified maximum number of redirects
• Cookies accepted and stored
• Insecure connections allowed
• Client: cURL
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
Web clients and HTTP requests 4/4
• Chrome:
• GET request via Selenium Webdriver controlled browser
• Client: Chrome
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
Web clients and HTTP requests 4/4
• Chrome:
• GET request via Selenium Webdriver controlled browser
• Client: Chrome
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
Regarding response headers, RFC 7231 states:
(highlights mine)
“The server SHOULD send the same header
fields in response to a HEAD request as it would
have sent if the request had been a GET...”.
https://tools.ietf.org/html/rfc7231
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
HTTP response codes
• 2xx
• Success
• 3xx
• Redirection
• 4xx
• Client error
• 5xx
• Server error
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
Response codes of last link in redirect chain by DOI
HEAD GET GET+ Chrome
2xx 3xx 4xx 5xx Err10,000DOIs
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
Response codes of last link in redirect chain by DOI
HEAD GET GET+ Chrome
2xx 3xx 4xx 5xx Err
48.3%
• < 50% successful
requests across all
methods
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
Response codes of last link in redirect chain by DOI
HEAD GET GET+ Chrome
2xx 3xx 4xx 5xx Err
48.3%
• < 50% successful
requests across all
methods
• > 40% 300-level
responses w/ GET
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
Response codes of last link in redirect chain by DOI
HEAD GET GET+ Chrome
2xx 3xx 4xx 5xx Err
48.3%
• < 50% successful
requests across all
methods
• > 40% 300-level
responses w/ GET
• 25% of them 200-
level w/
HEAD/Chrome
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
Response codes of last link in redirect chain by DOI
HEAD GET GET+ Chrome
2xx 3xx 4xx 5xx Err
48.3%
• < 50% successful
requests across all
methods
• > 40% 300-level
responses w/ GET
• 25% of them 200-
level w/
HEAD/Chrome
• 13% 400-level
responses w/ HEAD
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
Response codes of last link in redirect chain by DOI
HEAD GET GET+ Chrome
2xx 3xx 4xx 5xx Err
48.3%
• < 50% successful
requests across all
methods
• > 40% 300-level
responses w/ GET
• 25% of them 200-
level w/
HEAD/Chrome
• 13% 400-level
responses w/ HEAD
• 25% of them w/
200-level response
w/ any other method
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
Response code comparison external vs internal network
HEAD GET GET+ Chrome
2xx 3xx 4xx 5xx Err
66.9%
HEAD GET GET+ Chrome
2xx 3xx 4xx 5xx Err
48.3%
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
Response code comparison OA vs non-OA
HEAD GET GET+ Chrome
2xx 3xx 4xx 5xx Err
48.3%
HEAD GET GET+ Chrome
2xx 3xx 4xx 5xx Err
59.5%
OA
973DOIs
HEAD GET GET+ Chrome
2xx 3xx 4xx 5xx Err
47.1%
non-OA
10,000DOIs
9,027DOIs
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
HEAD GET GET+ Chrome
2xx 3xx 4xx 5xx Err
66.9%
HEAD GET GET+ Chrome
2xx 3xx 4xx 5xx Err
64.4%
HEAD GET GET+ Chrome
2xx 3xx 4xx 5xx Err
84.3%
Response code comparison SUB vs non-SUB
SUB
1,266DOIs
non-SUB
10,000DOIs
8,734DOIs
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
Take-aways
• Frequently, scholarly publishers respond inconsistently to different
requests against the same DOI, depending on:
• HTTP client, request method, network environment
• Implications for (perceived) persistence of DOIs?
• Inconsistent DOI resolution does not build trust in DOIs
• Lack of adherence to standards does not build trust
• More work needed but initial findings seem to indicate:
• OA DOIs more consistent than non-OA DOIs
• DOIs for subscribed & licensed content show more consistency
• Implications for archival efforts?
• Test different combinations of clients/request methods/networks
• Pretend to be as human as possible
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
• https://unsplash.com/photos/SBiVq9eWEtQ
• https://unsplash.com/photos/9e9PD9blAto
• https://myescambia.com/our-services/public-
safety/communications
• https://unsplash.com/photos/97HfVpyNR1M
• https://unsplash.com/photos/UYwjKbrwUos
• https://unsplash.com/photos/Se7vVKzYxTI
• https://unsplash.com/photos/_geAgtjqLzY
• https://unsplash.com/photos/r-enAOPw8Rs
• https://unsplash.com/photos/A4qmsfG6ywM
• https://unsplash.com/photos/eWqOgJ-lfiI
• https://unsplash.com/photos/goholCAVTRs
• https://unsplash.com/photos/vpXbwh6Qk9U
• https://unsplash.com/photos/K21Dn4OVxNw
• https://unsplash.com/photos/HzOclMmYryc
• https://unsplash.com/photos/OW5KP_Pj85Q
Photo credits
On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
CNI Spring Virtual Meeting 2020
Martin Klein
Los Alamos National Laboratory
martinklein0815@gmail.com
@mart1nkle1n
On the Persistence of Persistent
Identifiers of the Scholarly Web
HEAD GET GET+ Chrome
Thank you
&
stay safe!

More Related Content

PPTX
COVID-19 Antibody Test+Vaccination Certificates: There's an app for that
PPTX
Managing Your Metadata Quality 2010 CrossRef Workshops
PPT
CrossRef Technical Information for Libraries
PPT
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
PPTX
On the Persistence of Persistent Identifiers of the Scholarly Web
PPTX
Persistent Identification: Easier Said than Done
PPT
Persistently identifying website content
PPTX
An Institutional Perspective to Rescue Scholarly Orphans
COVID-19 Antibody Test+Vaccination Certificates: There's an app for that
Managing Your Metadata Quality 2010 CrossRef Workshops
CrossRef Technical Information for Libraries
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
On the Persistence of Persistent Identifiers of the Scholarly Web
Persistent Identification: Easier Said than Done
Persistently identifying website content
An Institutional Perspective to Rescue Scholarly Orphans

More from Martin Klein (20)

PPTX
An Institutional Perspective to Rescue Scholarly Orphans
PPTX
Who is Asking - Humans and Machines Experience a Different Scholarly Web
PPTX
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
PPTX
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
PPTX
Comparing the Performance of OAI-PMH with ResourceSync
PPTX
Evaluating Memento Service Optimizations
PPTX
A Vision of the Library’s Role in Archiving Scholarly Artifacts
PPTX
First Steps in Research Data Management Under Constraints of a National Secur...
PPTX
Smart Routing of Memento Requests
PPTX
Building Event Collections from Crawling Web Archives
PPTX
A Web-Centric Pipeline for Archiving Scholarly Artifacts
PPTX
Focused Crawl of Web Archives to Build Event Collections
PPTX
Creating Topical Collections: Web Archives vs. Live Web
PPTX
Robust Linking to Web Resources
PPTX
Signposting for Repositories
PPTX
Discovering Scholarly Orphans Using ORCID
PPTX
Using the Memento Framework to Assess Content Drift in Scholarly Communication
PPTX
Uniform Access to Raw Mementos
PPTX
Robust Links - a proposed solution to reference rot in scholarly communication
PDF
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
An Institutional Perspective to Rescue Scholarly Orphans
Who is Asking - Humans and Machines Experience a Different Scholarly Web
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
Comparing the Performance of OAI-PMH with ResourceSync
Evaluating Memento Service Optimizations
A Vision of the Library’s Role in Archiving Scholarly Artifacts
First Steps in Research Data Management Under Constraints of a National Secur...
Smart Routing of Memento Requests
Building Event Collections from Crawling Web Archives
A Web-Centric Pipeline for Archiving Scholarly Artifacts
Focused Crawl of Web Archives to Build Event Collections
Creating Topical Collections: Web Archives vs. Live Web
Robust Linking to Web Resources
Signposting for Repositories
Discovering Scholarly Orphans Using ORCID
Using the Memento Framework to Assess Content Drift in Scholarly Communication
Uniform Access to Raw Mementos
Robust Links - a proposed solution to reference rot in scholarly communication
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
Ad

Recently uploaded (20)

PDF
Tenda Login Guide: Access Your Router in 5 Easy Steps
PPT
isotopes_sddsadsaadasdasdasdasdsa1213.ppt
PPTX
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
PPT
FIRE PREVENTION AND CONTROL PLAN- LUS.FM.MQ.OM.UTM.PLN.00014.ppt
PPTX
Introduction to cybersecurity and digital nettiquette
PPTX
SAP Ariba Sourcing PPT for learning material
PDF
Unit-1 introduction to cyber security discuss about how to secure a system
PDF
Introduction to the IoT system, how the IoT system works
PPTX
Database Information System - Management Information System
DOC
Rose毕业证学历认证,利物浦约翰摩尔斯大学毕业证国外本科毕业证
PPTX
artificial intelligence overview of it and more
PPTX
Digital Literacy And Online Safety on internet
PDF
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
PPTX
Introduction to Information and Communication Technology
PPTX
Module 1 - Cyber Law and Ethics 101.pptx
PPTX
artificialintelligenceai1-copy-210604123353.pptx
PPTX
presentation_pfe-universite-molay-seltan.pptx
PPTX
Mathew Digital SEO Checklist Guidlines 2025
PDF
Slides PDF The World Game (s) Eco Economic Epochs.pdf
PDF
The New Creative Director: How AI Tools for Social Media Content Creation Are...
Tenda Login Guide: Access Your Router in 5 Easy Steps
isotopes_sddsadsaadasdasdasdasdsa1213.ppt
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
FIRE PREVENTION AND CONTROL PLAN- LUS.FM.MQ.OM.UTM.PLN.00014.ppt
Introduction to cybersecurity and digital nettiquette
SAP Ariba Sourcing PPT for learning material
Unit-1 introduction to cyber security discuss about how to secure a system
Introduction to the IoT system, how the IoT system works
Database Information System - Management Information System
Rose毕业证学历认证,利物浦约翰摩尔斯大学毕业证国外本科毕业证
artificial intelligence overview of it and more
Digital Literacy And Online Safety on internet
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
Introduction to Information and Communication Technology
Module 1 - Cyber Law and Ethics 101.pptx
artificialintelligenceai1-copy-210604123353.pptx
presentation_pfe-universite-molay-seltan.pptx
Mathew Digital SEO Checklist Guidlines 2025
Slides PDF The World Game (s) Eco Economic Epochs.pdf
The New Creative Director: How AI Tools for Social Media Content Creation Are...
Ad

On the Persistence of Persistent Identifiers of the Scholarly Web

  • 1. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Martin Klein Los Alamos National Laboratory martinklein0815@gmail.com @mart1nkle1n On the Persistence of Persistent Identifiers of the Scholarly Web HEAD GET GET+ Chrome
  • 2. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 https://arxiv.org/abs/2004.03011 For more background, details, results
  • 3. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020
  • 4. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020
  • 5. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020
  • 6. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020
  • 7. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020
  • 8. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020
  • 9. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020
  • 10. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020
  • 11. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020
  • 12. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020
  • 13. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020
  • 14. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020
  • 15. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020
  • 16. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020
  • 17. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020
  • 18. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Regardless of location, phone used … … when calling a well-known number that uniquely identifies a(n) (emergency) resource … … would you not expect to get the same response? Do you trust in the persistence of that number (and the response)?
  • 19. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 No more scary emergency scenarios! • Phones == web clients • Locations == network environments • 911 calls == HTTP requests against DOIs • Regardless of the web client and network location, would you not expect the same response from a web server when requesting the same DOI?
  • 20. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Idea… • Comparative study investigating scholarly publishers’ responses • To common HTTP requests • Against DOIs • Using different web clients and request methods, resembling • Machines ”browsing”, crawling • Humans browsing • From network environments with different subscriptions/licenses • Amazon Web Service EC2 instance • LANL internal
  • 21. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 How does this work? 10.1007/978-3-540-87599-4_38
  • 22. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 How does this (not) work? 10.1007/978-3-540-87599-4_38
  • 23. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 How does this work via HTTP? https://doi.org/10.1007/978-3-540-87599-4_38
  • 24. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 What do you see? https://doi.org/10.1007/978-3-540-87599-4_38
  • 25. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 How does this work via HTTP? https://doi.org/10.1007/978-3-540-87599-4_38
  • 26. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 How does this work via HTTP? https://doi.org/10.1007/978-3-540-87599-4_38  (HTTP redirect) http://link.springer.com/10.1007/978-3-540-87599-4_38
  • 27. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 How does this work via HTTP? https://doi.org/10.1007/978-3-540-87599-4_38  (HTTP redirect) http://link.springer.com/10.1007/978-3-540-87599-4_38  (HTTP redirect) https://link.springer.com/10.1007/978-3-540-87599-4_38
  • 28. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 How does this work via HTTP? https://doi.org/10.1007/978-3-540-87599-4_38  (HTTP redirect) http://link.springer.com/10.1007/978-3-540-87599-4_38  (HTTP redirect) https://link.springer.com/10.1007/978-3-540-87599-4_38  (HTTP redirect) https://link.springer.com/chapter/10.1007%2F978-3-540-87599-4_38
  • 29. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 DOI dataset • Gathering a representative sample is not trivial! • Internet Archive conducts crawls of the scholarly domain • June 2018: 93 million DOIs • Obtained WARC files and extracted DOI redirect chain • Investigate publisher distribution • Final link of redirect chain and extract host e.g.: https://link.springer.com/chapter/10.1007%2F978-3-540-87599-4_38  Domain: springer.com • Randomly pick 100 DOIs from the 100 most frequent domains • 10,000 DOIs
  • 30. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Web clients and HTTP requests 1/4 • HEAD request • Server responds with response headers • *but no* response body • Client: cURL
  • 31. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Web clients and HTTP requests 1/4 • HEAD request • Server responds with response headers • *but no* response body • Client: cURL
  • 32. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Web clients and HTTP requests 2/4 • GET request • Server responds with response headers • *and* response body • Client: cURL
  • 33. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Web clients and HTTP requests 2/4 • GET request • Server responds with response headers • *and* response body • Client: cURL
  • 34. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Web clients and HTTP requests 3/4 • GET+ • GET request with request headers • User Agent (desktop Chrome browser) • Specified connection timeout • Specified maximum number of redirects • Cookies accepted and stored • Insecure connections allowed • Client: cURL
  • 35. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Web clients and HTTP requests 3/4 • GET+ • GET request with request headers • User Agent (desktop Chrome browser) • Specified connection timeout • Specified maximum number of redirects • Cookies accepted and stored • Insecure connections allowed • Client: cURL
  • 36. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Web clients and HTTP requests 4/4 • Chrome: • GET request via Selenium Webdriver controlled browser • Client: Chrome
  • 37. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Web clients and HTTP requests 4/4 • Chrome: • GET request via Selenium Webdriver controlled browser • Client: Chrome
  • 38. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Regarding response headers, RFC 7231 states: (highlights mine) “The server SHOULD send the same header fields in response to a HEAD request as it would have sent if the request had been a GET...”. https://tools.ietf.org/html/rfc7231
  • 39. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 HTTP response codes • 2xx • Success • 3xx • Redirection • 4xx • Client error • 5xx • Server error
  • 40. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Response codes of last link in redirect chain by DOI HEAD GET GET+ Chrome 2xx 3xx 4xx 5xx Err10,000DOIs
  • 41. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Response codes of last link in redirect chain by DOI HEAD GET GET+ Chrome 2xx 3xx 4xx 5xx Err 48.3% • < 50% successful requests across all methods
  • 42. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Response codes of last link in redirect chain by DOI HEAD GET GET+ Chrome 2xx 3xx 4xx 5xx Err 48.3% • < 50% successful requests across all methods • > 40% 300-level responses w/ GET
  • 43. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Response codes of last link in redirect chain by DOI HEAD GET GET+ Chrome 2xx 3xx 4xx 5xx Err 48.3% • < 50% successful requests across all methods • > 40% 300-level responses w/ GET • 25% of them 200- level w/ HEAD/Chrome
  • 44. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Response codes of last link in redirect chain by DOI HEAD GET GET+ Chrome 2xx 3xx 4xx 5xx Err 48.3% • < 50% successful requests across all methods • > 40% 300-level responses w/ GET • 25% of them 200- level w/ HEAD/Chrome • 13% 400-level responses w/ HEAD
  • 45. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Response codes of last link in redirect chain by DOI HEAD GET GET+ Chrome 2xx 3xx 4xx 5xx Err 48.3% • < 50% successful requests across all methods • > 40% 300-level responses w/ GET • 25% of them 200- level w/ HEAD/Chrome • 13% 400-level responses w/ HEAD • 25% of them w/ 200-level response w/ any other method
  • 46. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Response code comparison external vs internal network HEAD GET GET+ Chrome 2xx 3xx 4xx 5xx Err 66.9% HEAD GET GET+ Chrome 2xx 3xx 4xx 5xx Err 48.3%
  • 47. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Response code comparison OA vs non-OA HEAD GET GET+ Chrome 2xx 3xx 4xx 5xx Err 48.3% HEAD GET GET+ Chrome 2xx 3xx 4xx 5xx Err 59.5% OA 973DOIs HEAD GET GET+ Chrome 2xx 3xx 4xx 5xx Err 47.1% non-OA 10,000DOIs 9,027DOIs
  • 48. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 HEAD GET GET+ Chrome 2xx 3xx 4xx 5xx Err 66.9% HEAD GET GET+ Chrome 2xx 3xx 4xx 5xx Err 64.4% HEAD GET GET+ Chrome 2xx 3xx 4xx 5xx Err 84.3% Response code comparison SUB vs non-SUB SUB 1,266DOIs non-SUB 10,000DOIs 8,734DOIs
  • 49. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Take-aways • Frequently, scholarly publishers respond inconsistently to different requests against the same DOI, depending on: • HTTP client, request method, network environment • Implications for (perceived) persistence of DOIs? • Inconsistent DOI resolution does not build trust in DOIs • Lack of adherence to standards does not build trust • More work needed but initial findings seem to indicate: • OA DOIs more consistent than non-OA DOIs • DOIs for subscribed & licensed content show more consistency • Implications for archival efforts? • Test different combinations of clients/request methods/networks • Pretend to be as human as possible
  • 50. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 • https://unsplash.com/photos/SBiVq9eWEtQ • https://unsplash.com/photos/9e9PD9blAto • https://myescambia.com/our-services/public- safety/communications • https://unsplash.com/photos/97HfVpyNR1M • https://unsplash.com/photos/UYwjKbrwUos • https://unsplash.com/photos/Se7vVKzYxTI • https://unsplash.com/photos/_geAgtjqLzY • https://unsplash.com/photos/r-enAOPw8Rs • https://unsplash.com/photos/A4qmsfG6ywM • https://unsplash.com/photos/eWqOgJ-lfiI • https://unsplash.com/photos/goholCAVTRs • https://unsplash.com/photos/vpXbwh6Qk9U • https://unsplash.com/photos/K21Dn4OVxNw • https://unsplash.com/photos/HzOclMmYryc • https://unsplash.com/photos/OW5KP_Pj85Q Photo credits
  • 51. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Martin Klein Los Alamos National Laboratory martinklein0815@gmail.com @mart1nkle1n On the Persistence of Persistent Identifiers of the Scholarly Web HEAD GET GET+ Chrome Thank you & stay safe!