SlideShare a Scribd company logo
1
TDM: Unlocking the hidden
potential from scholarly
content
2
Until recently, text mining has mostly been
restricted to post-publication PDFs and has
proved slow and difficult. The focus for scholarly
content has often been limited to metadata and
abstracts.
TDM is evolving to extract a wealth of
information that can support the entire scholarly
community – from authors to publishers.
Making sense of unstructured
content
3
Landscape
4
6% YoY growth in manuscript submissions
42% authors post their preprint before
journal submission
300% increase in the number of preprint
servers since 2015
The research keeps growing
Published work and preprints
6%
300%
42%
5
Too many manuscripts. Not enough time.
Submission to publication time expanding.
48 Hours
First review
round
Submission to
publication
Screening
13 Weeks 400 Days
6
XML often made available for Open Access articles, but not all publishers make XML
available to TDM services (API).
Rise of preprint servers and number of journals inviting article submission via these
servers increases need to mine non-XML content.
Most authors still submit manuscripts to publishers & preprint servers in Word or
PDF.
Some servers convert content into XML, but majority of platforms only allow for the
preprint to be downloaded in the same format it was uploaded in.
The format challenge
7
Software used by authors
Word still the preferred format
Writing software used by authors submitting to bioRxiv.
Source: Sever et al (2019) bioRxiv: the preprint server for biology. https://dx.doi.org/10.1101/833400
8
Format shouldn’t matter
9
Extracting structured content from any document
Dixon WG, Beukenhorst AL, Yimer BB et al. 2019. doi:10.1038/s41746-019-
0180-3
Content extracted to a structured format
10
Distilling research into headlines and key information
Rosyadi S, Haryanto A. 2019. doi:10.31124/advance.9989639.v1 Distillation to unified format
11
Opportunities
12
Manuscript
submission
Manuscript
screening
Peer review
Promotion
TDM: What are the opportunities?
TDM can work at any stage of the publishing process, opening up a huge number of opportunities from
manuscript drafting and screening to promoting the published article.
13
• Metadata extraction to automate
population of submissions system (Title,
author, affiliations, abstract, keywords).
• Reduces author friction / duplication of
effort.
• Previous work in this area has focused on
the biomedical domain, but this
opportunity can apply to any domain.
Automating submissions process
14
• Data extraction for manuscript screening
(key methods, results, sample size,
participants, ethical compliance etc.)
• Clear article context/overview for
reviewers.
• One-click access of cited sources & main
findings.
• Table extraction for analysis of statistical
calculations.
Speeding up peer review
15
Surfacing cited sources & their main findings
Krohn L, Ruskey JA, Rudakou U et al. 2019. doi:10.1101/19010991 Cited sources and their main findings surfaced
16
• Extract, parse and link citations from
archives dating back hundreds of years.
• Large scale reference population of open
citation networks (BMJ Case study)
• Improve exposure/discovery of older
research.
Exposing more content through
citation networks
17
What’s needed?
18
How publishers can help.
Make XML available for all Open Access articles rather than just the final
PDF for text mining.
Enrich citation networks with additional content (e.g. abstract,
highlights) in a machine-readable format.
Make all cited sources more easily verifiable for authors and
researchers.
Converting articles & preprints into a universally structured format for
more effective TDM. Allow authors to write articles natively in a
machine-readable format.
1
2
3
4
19
…equal rights for friendly bots!
And finally…

More Related Content

PPTX
COAR Next Generation Repositories WG - Text mining and Recommender system sto...
PPTX
Integrating research indicators for use in the repositories infrastructure
PPTX
Walters "Preprints, the Institutional Repository and the Impact on the Resear...
PPTX
Shearer "Next Generation Repositories: Developing a Distributed Architecture ...
PPTX
Funk and Beck "Driving Use: Identifiers and Enhanced Metadata"
PPTX
Sharing IR metadata with SHARE
PDF
A snake, a planet, and a bear ditching spreadsheets for quick, reproducible r...
COAR Next Generation Repositories WG - Text mining and Recommender system sto...
Integrating research indicators for use in the repositories infrastructure
Walters "Preprints, the Institutional Repository and the Impact on the Resear...
Shearer "Next Generation Repositories: Developing a Distributed Architecture ...
Funk and Beck "Driving Use: Identifiers and Enhanced Metadata"
Sharing IR metadata with SHARE
A snake, a planet, and a bear ditching spreadsheets for quick, reproducible r...

What's hot (20)

PPTX
Data Citation: A Critical Role for Publishers
PPTX
Data availability and feasibility of validation – A genomics case study
PPTX
2015 NISO Forum: The Future of Library Resource Discovery
PPT
Citation Analysis for the Free, Online Literature
PPTX
UKSG 2018 Breakout - Trouble(shooting) with a capital T: how categorising and...
PPTX
Capturing and Analyzing Publication, Citation and Usage Data for Contextual C...
PPT
Where you should publish
PPTX
2015 NISO Forum: The Future of Library Resource Discovery
PDF
Data Metadata and Data Citation - Emma Ganley (PLoS)
PPTX
2015 NISO Forum: The Future of Library Resource Discovery
PDF
COVID-19 and Changing Paradigm in Scholarly communication
PPTX
Oct 14 NISO Webinar: Cloud and Web Services for Libraries
PPTX
2015 NISO Forum: The Future of Library Resource Discovery
PPTX
How Accessible Is Our Collection? Performing an E-Resources Accessibility Review
PPT
Advancing the International Plant Names Index (IPNI)
PPTX
CI4CC sustainability-panel
PPT
Fox-Keynote-Now and Now of Data Publishing-nfdp13
PPTX
The Intersection of InterLibrary Loan and Acquisition Models: A review of rec...
PDF
Biosharing sansone-dryad-may13
PPTX
2 flash presentations for annual meeting tdm and cross check final
Data Citation: A Critical Role for Publishers
Data availability and feasibility of validation – A genomics case study
2015 NISO Forum: The Future of Library Resource Discovery
Citation Analysis for the Free, Online Literature
UKSG 2018 Breakout - Trouble(shooting) with a capital T: how categorising and...
Capturing and Analyzing Publication, Citation and Usage Data for Contextual C...
Where you should publish
2015 NISO Forum: The Future of Library Resource Discovery
Data Metadata and Data Citation - Emma Ganley (PLoS)
2015 NISO Forum: The Future of Library Resource Discovery
COVID-19 and Changing Paradigm in Scholarly communication
Oct 14 NISO Webinar: Cloud and Web Services for Libraries
2015 NISO Forum: The Future of Library Resource Discovery
How Accessible Is Our Collection? Performing an E-Resources Accessibility Review
Advancing the International Plant Names Index (IPNI)
CI4CC sustainability-panel
Fox-Keynote-Now and Now of Data Publishing-nfdp13
The Intersection of InterLibrary Loan and Acquisition Models: A review of rec...
Biosharing sansone-dryad-may13
2 flash presentations for annual meeting tdm and cross check final
Ad

Similar to Text Data Mining: Unlocking the hidden potential from scholarly content. (20)

PDF
OSFair2017 training | Machine accessibility of Open Access scientific publica...
PDF
How can we ensure research data is re-usable? The role of Publishers in Resea...
PPTX
UKSG 2018 Breakout - Setting your cites to open I4OC - Maccallum
PDF
Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)
PPTX
Data, Data Everywhere: What's A Publisher to Do?
PDF
ALAMW14 Altmetrics Panel: Redefining Research Impact
PPTX
Elsevier - Smart Data and Algorithms for the Publishing Industry
DOCX
A scalable hybrid research paper recommender system for micro
PDF
Engaging Information Professionals in the Process of Authoritative Interlinki...
PPT
CrossRef Text and Data Mining
PPTX
PPTX
Better together: building services for public good on top of content from the...
PPTX
Better together: building services for public good on top of content from the...
PDF
Supporting the ref5
PDF
A Pragmatic Approach to Facilitating Text and Data Mining
PDF
From Open Access to Open Data
PPTX
Research Data Publishing
PPTX
Simons orcid forum canberra 2018-PIDs in research
PPTX
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
PPTX
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OSFair2017 training | Machine accessibility of Open Access scientific publica...
How can we ensure research data is re-usable? The role of Publishers in Resea...
UKSG 2018 Breakout - Setting your cites to open I4OC - Maccallum
Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)
Data, Data Everywhere: What's A Publisher to Do?
ALAMW14 Altmetrics Panel: Redefining Research Impact
Elsevier - Smart Data and Algorithms for the Publishing Industry
A scalable hybrid research paper recommender system for micro
Engaging Information Professionals in the Process of Authoritative Interlinki...
CrossRef Text and Data Mining
Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...
Supporting the ref5
A Pragmatic Approach to Facilitating Text and Data Mining
From Open Access to Open Data
Research Data Publishing
Simons orcid forum canberra 2018-PIDs in research
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
Ad

Recently uploaded (20)

PDF
Electronic commerce courselecture one. Pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Empathic Computing: Creating Shared Understanding
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Network Security Unit 5.pdf for BCA BBA.
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Cloud computing and distributed systems.
PDF
Approach and Philosophy of On baking technology
PPTX
Machine Learning_overview_presentation.pptx
PDF
Machine learning based COVID-19 study performance prediction
DOCX
The AUB Centre for AI in Media Proposal.docx
Electronic commerce courselecture one. Pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Empathic Computing: Creating Shared Understanding
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Digital-Transformation-Roadmap-for-Companies.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Programs and apps: productivity, graphics, security and other tools
Network Security Unit 5.pdf for BCA BBA.
“AI and Expert System Decision Support & Business Intelligence Systems”
Unlocking AI with Model Context Protocol (MCP)
Per capita expenditure prediction using model stacking based on satellite ima...
NewMind AI Weekly Chronicles - August'25-Week II
The Rise and Fall of 3GPP – Time for a Sabbatical?
Cloud computing and distributed systems.
Approach and Philosophy of On baking technology
Machine Learning_overview_presentation.pptx
Machine learning based COVID-19 study performance prediction
The AUB Centre for AI in Media Proposal.docx

Text Data Mining: Unlocking the hidden potential from scholarly content.

  • 1. 1 TDM: Unlocking the hidden potential from scholarly content
  • 2. 2 Until recently, text mining has mostly been restricted to post-publication PDFs and has proved slow and difficult. The focus for scholarly content has often been limited to metadata and abstracts. TDM is evolving to extract a wealth of information that can support the entire scholarly community – from authors to publishers. Making sense of unstructured content
  • 4. 4 6% YoY growth in manuscript submissions 42% authors post their preprint before journal submission 300% increase in the number of preprint servers since 2015 The research keeps growing Published work and preprints 6% 300% 42%
  • 5. 5 Too many manuscripts. Not enough time. Submission to publication time expanding. 48 Hours First review round Submission to publication Screening 13 Weeks 400 Days
  • 6. 6 XML often made available for Open Access articles, but not all publishers make XML available to TDM services (API). Rise of preprint servers and number of journals inviting article submission via these servers increases need to mine non-XML content. Most authors still submit manuscripts to publishers & preprint servers in Word or PDF. Some servers convert content into XML, but majority of platforms only allow for the preprint to be downloaded in the same format it was uploaded in. The format challenge
  • 7. 7 Software used by authors Word still the preferred format Writing software used by authors submitting to bioRxiv. Source: Sever et al (2019) bioRxiv: the preprint server for biology. https://dx.doi.org/10.1101/833400
  • 9. 9 Extracting structured content from any document Dixon WG, Beukenhorst AL, Yimer BB et al. 2019. doi:10.1038/s41746-019- 0180-3 Content extracted to a structured format
  • 10. 10 Distilling research into headlines and key information Rosyadi S, Haryanto A. 2019. doi:10.31124/advance.9989639.v1 Distillation to unified format
  • 12. 12 Manuscript submission Manuscript screening Peer review Promotion TDM: What are the opportunities? TDM can work at any stage of the publishing process, opening up a huge number of opportunities from manuscript drafting and screening to promoting the published article.
  • 13. 13 • Metadata extraction to automate population of submissions system (Title, author, affiliations, abstract, keywords). • Reduces author friction / duplication of effort. • Previous work in this area has focused on the biomedical domain, but this opportunity can apply to any domain. Automating submissions process
  • 14. 14 • Data extraction for manuscript screening (key methods, results, sample size, participants, ethical compliance etc.) • Clear article context/overview for reviewers. • One-click access of cited sources & main findings. • Table extraction for analysis of statistical calculations. Speeding up peer review
  • 15. 15 Surfacing cited sources & their main findings Krohn L, Ruskey JA, Rudakou U et al. 2019. doi:10.1101/19010991 Cited sources and their main findings surfaced
  • 16. 16 • Extract, parse and link citations from archives dating back hundreds of years. • Large scale reference population of open citation networks (BMJ Case study) • Improve exposure/discovery of older research. Exposing more content through citation networks
  • 18. 18 How publishers can help. Make XML available for all Open Access articles rather than just the final PDF for text mining. Enrich citation networks with additional content (e.g. abstract, highlights) in a machine-readable format. Make all cited sources more easily verifiable for authors and researchers. Converting articles & preprints into a universally structured format for more effective TDM. Allow authors to write articles natively in a machine-readable format. 1 2 3 4
  • 19. 19 …equal rights for friendly bots! And finally…