Text Data Mining: Unlocking the hidden potential from scholarly content.

1
TDM: Unlocking the hidden
potential from scholarly
content

2
Until recently, text mining has mostly been
restricted to post-publication PDFs and has
proved slow and difficult. The focus for scholarly
content has often been limited to metadata and
abstracts.
TDM is evolving to extract a wealth of
information that can support the entire scholarly
community – from authors to publishers.
Making sense of unstructured
content

4
6% YoY growth in manuscript submissions
42% authors post their preprint before
journal submission
300% increase in the number of preprint
servers since 2015
The research keeps growing
Published work and preprints
6%
300%
42%

5
Too many manuscripts. Not enough time.
Submission to publication time expanding.
48 Hours
First review
round
Submission to
publication
Screening
13 Weeks 400 Days

6
XML often made available for Open Access articles, but not all publishers make XML
available to TDM services (API).
Rise of preprint servers and number of journals inviting article submission via these
servers increases need to mine non-XML content.
Most authors still submit manuscripts to publishers & preprint servers in Word or
PDF.
Some servers convert content into XML, but majority of platforms only allow for the
preprint to be downloaded in the same format it was uploaded in.
The format challenge

7
Software used by authors
Word still the preferred format
Writing software used by authors submitting to bioRxiv.
Source: Sever et al (2019) bioRxiv: the preprint server for biology. https://dx.doi.org/10.1101/833400

9
Extracting structured content from any document
Dixon WG, Beukenhorst AL, Yimer BB et al. 2019. doi:10.1038/s41746-019-
0180-3
Content extracted to a structured format

10
Distilling research into headlines and key information
Rosyadi S, Haryanto A. 2019. doi:10.31124/advance.9989639.v1 Distillation to unified format

12
Manuscript
submission
Manuscript
screening
Peer review
Promotion
TDM: What are the opportunities?
TDM can work at any stage of the publishing process, opening up a huge number of opportunities from
manuscript drafting and screening to promoting the published article.

13
• Metadata extraction to automate
population of submissions system (Title,
author, affiliations, abstract, keywords).
• Reduces author friction / duplication of
effort.
• Previous work in this area has focused on
the biomedical domain, but this
opportunity can apply to any domain.
Automating submissions process

14
• Data extraction for manuscript screening
(key methods, results, sample size,
participants, ethical compliance etc.)
• Clear article context/overview for
reviewers.
• One-click access of cited sources & main
findings.
• Table extraction for analysis of statistical
calculations.
Speeding up peer review

15
Surfacing cited sources & their main findings
Krohn L, Ruskey JA, Rudakou U et al. 2019. doi:10.1101/19010991 Cited sources and their main findings surfaced

16
• Extract, parse and link citations from
archives dating back hundreds of years.
• Large scale reference population of open
citation networks (BMJ Case study)
• Improve exposure/discovery of older
research.
Exposing more content through
citation networks

18
How publishers can help.
Make XML available for all Open Access articles rather than just the final
PDF for text mining.
Enrich citation networks with additional content (e.g. abstract,
highlights) in a machine-readable format.
Make all cited sources more easily verifiable for authors and
researchers.
Converting articles & preprints into a universally structured format for
more effective TDM. Allow authors to write articles natively in a
machine-readable format.
1
2
3
4

19
…equal rights for friendly bots!
And finally…

Text Data Mining: Unlocking the hidden potential from scholarly content.

More Related Content

What's hot (20)

Similar to Text Data Mining: Unlocking the hidden potential from scholarly content. (20)

Recently uploaded (20)

Text Data Mining: Unlocking the hidden potential from scholarly content.