SlideShare a Scribd company logo
Improving Text Mining Results with
Access to Full-Text Scientific Articles
Mike Iarrobino
Product Manager, CCC
Introduction
Mike Iarrobino
Product Manager
RightFind™ XML for Mining
Copyright Clearance Center
Making Copyright Work – CCC and RightsDirect
Rightsholders Content Users
• Licensing Solutions
• Rights Management
• Content Delivery
• Copyright Education
950+ million rights from:
• Publishers
• Authors
• Agents
• Creators
• 35,000 companies
• Workers worldwide
• 1,200 colleges
and universities
• Publishers and
Authors
CCC and Text Mining
Rightsholders Content Users
Servicing many text
mining license and
content requests
Managing text
mining feeds
Negotiating text
mining rights with
multiple publishers
“Text mining” is the process of
deriving high-quality information
from text materials using software.
Text Mining
Non-Patent
Literature
• Mining limited to abstracts
• High cost to obtain
formatted full-text
content and permission
from multiple publishers
• Multiple formats
• Researchers can’t mine
content to which they
are not subscribed
What is the Benefit of Full Text?
Volume Timeliness Quality
Catherine Blake. “Beyond genes, proteins, and abstracts:
Identifying scientific claims from full-text biomedical
articles.” Journal of Biomedical Informatics Volume 43,
Issue 2, April 2010, Pages 173–189
Elsevier (2015) Harnessing the Power of Content -
Extracting value from scientific literature: the power of
mining full-text articles for pathway analysis. Available at
www.elsevier.com/__data/assets/pdf_file/0016/83005/R
_D-Solutions_Harnessing-Power-of-Content_DIGITAL.pdf
Elsevier (2015) Harnessing the Power of Content -
Extracting value from scientific literature: the power of
mining full-text articles for pathway analysis. Available at
www.elsevier.com/__data/assets/pdf_file/0016/83005/R
_D-Solutions_Harnessing-Power-of-Content_DIGITAL.pdf
Enrique Bernal-Delgado and Elliot S
Fisher. “Abstracts in high profile journals
often fail to report harm.” BMC Medical
Research Methodology (2008); 8:14
Volume and Recall
December 20158
(Abstract: "tau hyperphosphorylation" AND
Abstract: kinase OR (GSK3β OR (CDK5 OR (MAPK1 OR
(MARK1 OR (MARK2 OR (MARK3 OR MARK4))))))) AND
(Abstract: alzheimer OR alzheimer's)
content:"tau hyperphosphorylation kinase"~25 OR
"tau hyperphosphorylation GSK3β "~25 OR "tau
hyperphosphorylation CDK5"~25 OR "tau
hyperphosphorylation MAPK1"~25 OR "tau
hyperphosphorylation MARK1"~25 OR "tau
hyperphosphorylation MARK2"~25 OR "tau
hyperphosphorylation MARK3"~25 OR "tau
hyperphosphorylation MARK4"~25
Volume and Recall - Results
December 20159
0
100
200
300
400
500
600
700
800
BTK Tau
hyperphosphorylation
NumberArticles
Abstract
Full text
Text Mining Today – Example Workflow
December 201510
Search
Get
permission
Download
PDFs
Convert
PDFs
Import into
text mining
software
Search
Get
permission
Download
PDFs
Convert
PDFs
Import into
text mining
software
• Perform search• Obtain permission from
publishers to mine full
text for commercial use
• Requires automated tool or
custom software to download
in bulk
• Requires text mining permission
from multiple publishers
• Requires content storage and
feed management
• PDF is converted to a “blob of text”
• No tags
• Loss of metadata
• Low fidelity of content
• References induce noise
• Requires structuring text into XML
• Article text does not
have “fields”
• Combining content
from multiple sources
takes time to normalize
the metadata
Search
Get
permission
Download
PDFs
Convert
PDFs
Import into
text mining
software
TEXT MINING TOOLS
Run
queries
View
results
MANUAL WORKTypically takes
4-8 weeks
CCC’s RightFind™ XML for Mining Service
Build a corpus of full-text articles in XML format for mining
Text Mining SoftwareCCC’s Text Mining Service
XML for Mining
• Rapid inventory growth
• MEDLINE abstract corpus
• Purchase not subscribed articles
with cost optimization process
• MeSH article tagging and flat
synonym list
Market Observations and Future Vision
ACCESS
AUTOMATION
Thank you!
Mike Iarrobino
Product Manager, CCC
+1.978.646.2633
miarrobino@copyright.com

More Related Content

PDF
II-SDV 2015, 20 - 21 April, in Nice
PDF
II-SDV 2015, 20 - 21 April, in Nice
PDF
II-SDV 2016 Minesoft
PDF
II-SDV 2016 RightsDirect
PDF
II-SDV 2016 Aleksandar Kapisoda, Klaus Kater - Deep Web Search
PDF
II-SDV 2016 Linguamatics
PDF
II-SDV 2016 Questel Intellixir
PDF
II-SDV 2016 Irene Kitsara - Patent Landscape Reports and Other WIPO Activitie...
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2016 Minesoft
II-SDV 2016 RightsDirect
II-SDV 2016 Aleksandar Kapisoda, Klaus Kater - Deep Web Search
II-SDV 2016 Linguamatics
II-SDV 2016 Questel Intellixir
II-SDV 2016 Irene Kitsara - Patent Landscape Reports and Other WIPO Activitie...

What's hot (20)

PDF
ICIC 2014 New Product Presentations ChemAxon
PDF
II-SDV 2015, 20 - 21 April, in Nice
PDF
II-SDV 2015, 20 - 21 April, in Nice
PDF
ICIC 2017: Publication Analysis and Publication Strategy
PDF
II-SDV 2016 Expert System
PDF
ICIC 2014 New Product Introduction Minesoft
PDF
II-SDV 2016 Aalt van de Kuilen - The Art of Patent Landscaping
PDF
ICIC 2014 Chemical Patent Curation and Management – New Tools and Capabilities
PDF
II-SDV 2015, 20 - 21 April, in Nice
PDF
ICIC 2017: New product presentation minesoft
PDF
New Product Introductions - Minesoft
PDF
ICIC 2014 Finding Answers in the Data – The Future Role of Text and Data Mini...
PDF
ICIC 2017: New product presentations CAS
PDF
II-PIC 2017: Porduct presentation minesoft
PDF
II-SDV 2016 IRIX Software Engineering
PDF
II-SDV 2015, 20 - 21 April, in Nice
PDF
II-SDV 2016 Centredoc
PDF
ICIC 2014 New Product Introduction InfoChem
PDF
II-SDV 2016 GRIDLOGICS
PDF
ICIC 2013 Conference Proceedings Nicolas Lalyre Syngenta
ICIC 2014 New Product Presentations ChemAxon
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in Nice
ICIC 2017: Publication Analysis and Publication Strategy
II-SDV 2016 Expert System
ICIC 2014 New Product Introduction Minesoft
II-SDV 2016 Aalt van de Kuilen - The Art of Patent Landscaping
ICIC 2014 Chemical Patent Curation and Management – New Tools and Capabilities
II-SDV 2015, 20 - 21 April, in Nice
ICIC 2017: New product presentation minesoft
New Product Introductions - Minesoft
ICIC 2014 Finding Answers in the Data – The Future Role of Text and Data Mini...
ICIC 2017: New product presentations CAS
II-PIC 2017: Porduct presentation minesoft
II-SDV 2016 IRIX Software Engineering
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2016 Centredoc
ICIC 2014 New Product Introduction InfoChem
II-SDV 2016 GRIDLOGICS
ICIC 2013 Conference Proceedings Nicolas Lalyre Syngenta
Ad

Viewers also liked (15)

PDF
II-SDV 2016 Simon Fitall -
PDF
II-SDV Arne Krüger - Elastic Search & Patent Information @ mtc
PDF
II-SDV 2016 - QWAM Content Intelligence
PDF
II-SDV 2016 Nils Newman - Sentiment Analysis: What your Choice of Words Says ...
PDF
II-SDV Andrew Hinton - Text mining - as normal as data mining?
PDF
II-SDV 2016 Srinivasan Parthiban - KOL Analytics from Biomedical Literature
PDF
II-SDV 2016 Manish Sinka - Taking Patent Research platforms beyond Search
PDF
II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla Air
PDF
Monitoring and Analysis of Web Information for Various Business Contexts : Co...
PDF
PatSeer Introduction
PPTX
Scientific writing
PDF
Text mining tools for semantically enriching scientific literature
PDF
A syntagmatic and paradigmatic analysis of scientific text
PPTX
English 9 - Text Types
PDF
II-SDV 2017 in Nice - The International Information Conference on Search, Dat...
II-SDV 2016 Simon Fitall -
II-SDV Arne Krüger - Elastic Search & Patent Information @ mtc
II-SDV 2016 - QWAM Content Intelligence
II-SDV 2016 Nils Newman - Sentiment Analysis: What your Choice of Words Says ...
II-SDV Andrew Hinton - Text mining - as normal as data mining?
II-SDV 2016 Srinivasan Parthiban - KOL Analytics from Biomedical Literature
II-SDV 2016 Manish Sinka - Taking Patent Research platforms beyond Search
II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla Air
Monitoring and Analysis of Web Information for Various Business Contexts : Co...
PatSeer Introduction
Scientific writing
Text mining tools for semantically enriching scientific literature
A syntagmatic and paradigmatic analysis of scientific text
English 9 - Text Types
II-SDV 2017 in Nice - The International Information Conference on Search, Dat...
Ad

Similar to II-SDV 2016 Michael Iarrobino - Improving Text Mining Results with Access to Full-Text Scientific Articles (20)

PDF
PubChem for drug discovery in the age of big data and artificial intelligence
PPTX
Accessing Environmental Chemistry Data via Data Dashboards
PPT
Open innovation contributions from RSC resulting from the Open Phacts project
PPT
Open innovation contributions from RSC resulting from the Open Phacts project
PDF
Overview of Next Gen Sequencing Data Analysis
PPT
NCBO Technology
PPT
The application of text and data mining to enhance the RSC publication archive
PPTX
Mining Drug Targets, Structures and Activity Data
PPT
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
PPTX
The Progress on Sagace and Data Integration
PPTX
Taylor & Francis Group - Digital Product Overview (2016)
PPTX
Delivering chemical-associated data via EPA web applications
PDF
Strategies for Identifying, Categorizing, and Facilitating Data Extraction.pdf
PPTX
How to place your research questions or results into the context of the "Lega...
PDF
Overview of SureChEMBL
PPTX
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
PPTX
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
PPTX
Delivering The Benefits of Chemical-Biological Integration in Computational T...
PDF
Connecting the dots: drug information and Linked Data
PubChem for drug discovery in the age of big data and artificial intelligence
Accessing Environmental Chemistry Data via Data Dashboards
Open innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts project
Overview of Next Gen Sequencing Data Analysis
NCBO Technology
The application of text and data mining to enhance the RSC publication archive
Mining Drug Targets, Structures and Activity Data
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
The Progress on Sagace and Data Integration
Taylor & Francis Group - Digital Product Overview (2016)
Delivering chemical-associated data via EPA web applications
Strategies for Identifying, Categorizing, and Facilitating Data Extraction.pdf
How to place your research questions or results into the context of the "Lega...
Overview of SureChEMBL
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
Delivering The Benefits of Chemical-Biological Integration in Computational T...
Connecting the dots: drug information and Linked Data

More from Dr. Haxel Consult (20)

PDF
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
PDF
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
PDF
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
PDF
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
PDF
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
PDF
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
PDF
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
PDF
AI-SDV 2022: Machine learning based patent categorization: A success story in...
PDF
AI-SDV 2022: Machine learning based patent categorization: A success story in...
PDF
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
PDF
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
PDF
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
PDF
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
PDF
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
PDF
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
PDF
AI-SDV 2022: Copyright Clearance Center
PDF
AI-SDV 2022: Lighthouse IP
PDF
AI-SDV 2022: New Product Introductions: CENTREDOC
PDF
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
PDF
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Lighthouse IP
AI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...

Recently uploaded (20)

PPTX
Job_Card_System_Styled_lorem_ipsum_.pptx
PDF
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
PPTX
Internet___Basics___Styled_ presentation
PDF
SASE Traffic Flow - ZTNA Connector-1.pdf
PPTX
international classification of diseases ICD-10 review PPT.pptx
PDF
An introduction to the IFRS (ISSB) Stndards.pdf
PPTX
introduction about ICD -10 & ICD-11 ppt.pptx
PDF
Slides PDF The World Game (s) Eco Economic Epochs.pdf
PPTX
SAP Ariba Sourcing PPT for learning material
PDF
Sims 4 Historia para lo sims 4 para jugar
PPTX
INTERNET------BASICS-------UPDATED PPT PRESENTATION
PPTX
presentation_pfe-universite-molay-seltan.pptx
PPTX
Module 1 - Cyber Law and Ethics 101.pptx
PPTX
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
PDF
Unit-1 introduction to cyber security discuss about how to secure a system
PDF
The New Creative Director: How AI Tools for Social Media Content Creation Are...
PDF
Introduction to the IoT system, how the IoT system works
PDF
The Internet -By the Numbers, Sri Lanka Edition
PPTX
E -tech empowerment technologies PowerPoint
PPTX
Introuction about ICD -10 and ICD-11 PPT.pptx
Job_Card_System_Styled_lorem_ipsum_.pptx
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
Internet___Basics___Styled_ presentation
SASE Traffic Flow - ZTNA Connector-1.pdf
international classification of diseases ICD-10 review PPT.pptx
An introduction to the IFRS (ISSB) Stndards.pdf
introduction about ICD -10 & ICD-11 ppt.pptx
Slides PDF The World Game (s) Eco Economic Epochs.pdf
SAP Ariba Sourcing PPT for learning material
Sims 4 Historia para lo sims 4 para jugar
INTERNET------BASICS-------UPDATED PPT PRESENTATION
presentation_pfe-universite-molay-seltan.pptx
Module 1 - Cyber Law and Ethics 101.pptx
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
Unit-1 introduction to cyber security discuss about how to secure a system
The New Creative Director: How AI Tools for Social Media Content Creation Are...
Introduction to the IoT system, how the IoT system works
The Internet -By the Numbers, Sri Lanka Edition
E -tech empowerment technologies PowerPoint
Introuction about ICD -10 and ICD-11 PPT.pptx

II-SDV 2016 Michael Iarrobino - Improving Text Mining Results with Access to Full-Text Scientific Articles

  • 1. Improving Text Mining Results with Access to Full-Text Scientific Articles Mike Iarrobino Product Manager, CCC
  • 2. Introduction Mike Iarrobino Product Manager RightFind™ XML for Mining Copyright Clearance Center
  • 3. Making Copyright Work – CCC and RightsDirect Rightsholders Content Users • Licensing Solutions • Rights Management • Content Delivery • Copyright Education 950+ million rights from: • Publishers • Authors • Agents • Creators • 35,000 companies • Workers worldwide • 1,200 colleges and universities • Publishers and Authors
  • 4. CCC and Text Mining Rightsholders Content Users Servicing many text mining license and content requests Managing text mining feeds Negotiating text mining rights with multiple publishers
  • 5. “Text mining” is the process of deriving high-quality information from text materials using software.
  • 6. Text Mining Non-Patent Literature • Mining limited to abstracts • High cost to obtain formatted full-text content and permission from multiple publishers • Multiple formats • Researchers can’t mine content to which they are not subscribed
  • 7. What is the Benefit of Full Text? Volume Timeliness Quality Catherine Blake. “Beyond genes, proteins, and abstracts: Identifying scientific claims from full-text biomedical articles.” Journal of Biomedical Informatics Volume 43, Issue 2, April 2010, Pages 173–189 Elsevier (2015) Harnessing the Power of Content - Extracting value from scientific literature: the power of mining full-text articles for pathway analysis. Available at www.elsevier.com/__data/assets/pdf_file/0016/83005/R _D-Solutions_Harnessing-Power-of-Content_DIGITAL.pdf Elsevier (2015) Harnessing the Power of Content - Extracting value from scientific literature: the power of mining full-text articles for pathway analysis. Available at www.elsevier.com/__data/assets/pdf_file/0016/83005/R _D-Solutions_Harnessing-Power-of-Content_DIGITAL.pdf Enrique Bernal-Delgado and Elliot S Fisher. “Abstracts in high profile journals often fail to report harm.” BMC Medical Research Methodology (2008); 8:14
  • 8. Volume and Recall December 20158 (Abstract: "tau hyperphosphorylation" AND Abstract: kinase OR (GSK3β OR (CDK5 OR (MAPK1 OR (MARK1 OR (MARK2 OR (MARK3 OR MARK4))))))) AND (Abstract: alzheimer OR alzheimer's) content:"tau hyperphosphorylation kinase"~25 OR "tau hyperphosphorylation GSK3β "~25 OR "tau hyperphosphorylation CDK5"~25 OR "tau hyperphosphorylation MAPK1"~25 OR "tau hyperphosphorylation MARK1"~25 OR "tau hyperphosphorylation MARK2"~25 OR "tau hyperphosphorylation MARK3"~25 OR "tau hyperphosphorylation MARK4"~25
  • 9. Volume and Recall - Results December 20159 0 100 200 300 400 500 600 700 800 BTK Tau hyperphosphorylation NumberArticles Abstract Full text
  • 10. Text Mining Today – Example Workflow December 201510 Search Get permission Download PDFs Convert PDFs Import into text mining software Search Get permission Download PDFs Convert PDFs Import into text mining software • Perform search• Obtain permission from publishers to mine full text for commercial use • Requires automated tool or custom software to download in bulk • Requires text mining permission from multiple publishers • Requires content storage and feed management • PDF is converted to a “blob of text” • No tags • Loss of metadata • Low fidelity of content • References induce noise • Requires structuring text into XML • Article text does not have “fields” • Combining content from multiple sources takes time to normalize the metadata Search Get permission Download PDFs Convert PDFs Import into text mining software TEXT MINING TOOLS Run queries View results MANUAL WORKTypically takes 4-8 weeks
  • 11. CCC’s RightFind™ XML for Mining Service Build a corpus of full-text articles in XML format for mining Text Mining SoftwareCCC’s Text Mining Service
  • 12. XML for Mining • Rapid inventory growth • MEDLINE abstract corpus • Purchase not subscribed articles with cost optimization process • MeSH article tagging and flat synonym list
  • 13. Market Observations and Future Vision ACCESS AUTOMATION
  • 14. Thank you! Mike Iarrobino Product Manager, CCC +1.978.646.2633 miarrobino@copyright.com