CrossRef Text & Data Mining - UKSG 2015

Rachael Lammey
Product Manager, CrossRef
UKSG 2015
CrossRef Text and Data Mining Services:
one year in

Not-for-profit association of scholarly publishers
All subjects, all business models
5,000+ organizations from all over the world
83 non-publisher affiliates, 2000 library affiliates
72 million + DOIs assigned to content items

User clicks on
CrossRef DOI
reference link
in Journal A
Tani, N., N. Tomaru, M. Araki, AND K. Ohba. 1996. Genetic diversity and
differentiation in populations of Japanese stone pine (Pinus pumila) in
Japan. Canadian Journal of Forest Research 26: 1454–1462.[CrossRef]
DOI
directory
returns URL
User accesses
cited article in
Journal B

A Text and Data Mining Hub for Researchers

What is Text and Data Mining
(TDM)?
Text Mining is an interdisciplinary field combining techniques
from linguistics, computer science and statistics to build
tools that can efficiently retrieve and extract information
from digital text.
http://blogs.plos.org/everyone/2013/04/17/announcing-the-plos-text-mining-collection/
It uses powerful computers to find links between drugs
and side effects, or genes and diseases, that are hidden
within the vast scientific literature. These are discoveries
that a person scouring through papers one by one may
never notice.
http://www.theguardian.com/science/2012/may/23/text-mining-research-tool-forbidden

Why?• Researchers find it impractical to
negotiate multiple bilateral agreements
with hundreds of subscription-based
publishers in order to authorise TDM of
subscribed content.
• Subscription-based publishers find it
impractical to negotiate multiple bilateral
agreements with thousands of
researchers and institutions in order to
authorise TDM of subscribed content.
• All parties would benefit from support of
standard APIs and data representations in
order to enable TDM across both open
access and subscription-based publishers.

CrossRef Text & Data Mining - UKSG 2015

Build Cross-Publisher
API for TDM

Access To Full Text
Problem: Researchers want to get full text
content from publishers’ sites for OA or
subscribed content. Solution:
Solution: Common API (protocol) for requesting
machine readable full text from many different
publishers

Negotiating Permissions
Problem: Researchers want to know whether text
and data mining is allowed, and if not, get
permission.
Solution: Licensing information embedded in article
metadata and a registry for supplemental text and
data mining terms and conditions (licenses).

Text and Data Mining Steps
• Define problem
• Identify potential corpus to mine
• Discovery (full text links)
• Identification of subset which can be
accessed (license information)
• Download identified corpus
• Text and data mine corpus

Publisher Participation
To enable their content for use by the service, publishers have
to provide CrossRef with two additional pieces of metadata:
• Full text URIs (to show where the full-text is located)
• License URIs (to show the Terms & Conditions under
which they can use it)
• Can implement rate limiting
CrossRef doesn’t charge publishers for participating in this
service.

Researcher Use
• The CrossRef REST API is the main aspect of this service
• It is designed to allow researchers to easily harvest full text
documents from all participating publishers regardless of their
business model (e.g. open access, subscription).
• It makes use of CrossRef DOI content negotiation to provide
researchers with links to the full text of content located on the
publisher’s site.
• The publisher remains responsible for actually delivering the full
text of the content requested
• CrossRef does not charge researchers for using the service

Publisher Metadata for CrossRef TDM:
Hindawi

Publisher Metadata for CrossRef TDM:
Elsevier

Researcher queries DOI using CN + API
token
Publisher verifies API token
If token verified AND access control allows,
publisher returns full text
(frequency at publisher discretion)

Benefits
• Streamlines researcher access to distributed full text for
TDM
• Enables machine-to-machine, automated access for
recognized TDM (i.e. researchers won’t be locked out of
publisher sites)
• Enables article-level licensing info and easy mechanism
for supplemental T&Cs for text and data mining
(publishers discussing model license via STM)

Publishers
Over 14 million articles with full-text links and license
information deposited

Usable as is:
https://blogs.nd.edu/emorgan/

http://tdmsupport.crossref.org/

www.crossref.org
http://www.crossref.org/tdm/index.html
tdm@crossref.org

How can researchers use
the service?
• Modify TDM tools to make use of the API token
• Modify TDM tools to look for <lic_ref> elements
• Register with the click-through service and
accept/decline licenses (if applicable)
• Details at: http://tdmsupport.crossref.org/researchers/

Using the DOI as the basis for a common text and data mining
API provides several benefits. For example, the DOI provides:
•An easy way to de-duplicate documents that may be found on
several sites.
•Persistent provenance information.
•An easy way to document, share and compare coropra without
having to exchange the actual documents
•A mechanism to ensure the reproducibility of TDM results using
the source documents.
•A mechanism to track the impact of updates, corrections
retractions and withdrawals on corpora.
Why use the DOI?

CrossRef Text & Data Mining - UKSG 2015

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to CrossRef Text & Data Mining - UKSG 2015 (20)

More from Crossref (20)

Recently uploaded (20)

CrossRef Text & Data Mining - UKSG 2015

Editor's Notes