SlideShare a Scribd company logo
Text Mining - as Normal as Data Mining?
Andrew Hinton, Application Specialist
IISDV 2016, Tuesday 19th April 2016, Nice
Agenda
Introduction to text mining
The challenge
Applications of specialised normalization solutions
− Maximising Source Normalization
− EASL (Extraction and Search Language )
− Allows programmatic access to unstructured data similar to
SQL over structured data.
− Numeric Normalization & Range search
− Capturing weights between 60 and 80kg whether
expressed in kilograms or pounds, for patient selection
from EHRs.
− Gene Mutation Normalization
− Use case where gene mutations have been linked to rare
disease progression.
© 2016 Linguamatics Ltd2
Answers to Our Questions are in Free Text
80% of information at companies is in free text
Most of the answers to our questions are there
Ever-increasing amounts of text data to examine
© 2016 Linguamatics Ltd3
0
5.000.000
10.000.000
15.000.000
20.000.000
25.000.000
PubMed Records
− Different kinds of documents
− External literature, patents,
EHRs, internal reports, blogs,
presentations
− Different formats
− HTML, PDF, XML, Word, PPT,
Wiki, TXT, HL7
Keyword Searching
© 2016 Linguamatics Ltd4
OLED
Documents, Web Pages, Folders
All these
documents
contain the
keyword
‘Additive’.
Read ALL
the
document
to find the
relevant bit
to you
Linguamatics in Healthcare
© 2016 Linguamatics Ltd5
Electronic
Health
Record
Enterprise
Data
Warehouse
Pathology, radiology,
initial
assessment,
discharge, check up
Structured
data
Clinical
Risk
Monitor
Patient
characteristics
Patient
lists
Clinical
trials
gov
Patient
characteristics
Matching
Clinical
trials
Patient
Narrative
Semantic search
tags
Semantic
Enrichment
Clinical case
histories and/or
genomic
interpretation
Patient
characteristics
Scientific
literature
I2E Transforms Text into Actionable Insights
© 2016 Linguamatics Ltd6
Turn text Into structured data
using sophisticated queries
Accurate results: only retrieves relevant results
Complete results: comprehensive and systematic
Analytics
To drive
analytics
Enterprise
Warehouse
Search vs. Text Mining
© 2016 Linguamatics Ltd7
Text MiningSearch Engine
Filter to find
most
relevant
documents,
then read
News Feeds Literature Patents Internal Reports Social Media
Natural Language
Processing (NLP) -
understand meaning
© 2012 Linguamatics Ltd.
Use of ontologies and
clustered results
Efficient review, without
reading every document
Challenges in Unstructured Data
© 2016 Linguamatics Ltd
Different word, same
meaning
cyclosporine
ciclosporin
Neoral
Sandimmune
Different expression, same
meaning
Non-smoker
Does not smoke
Does not drink or smoke
Denies tobacco use
Different grammar, same
meaning
5mg/kg of cyclosporine per day
5mg/kg per day of cyclosporine
cyclosporine 5mg/kg per day
Same word, different
context
Diagnosed with diabetes
Family history of diabetes
No family history of diabetes
NLP
8
Linguistic Processing Using NLP
Interprets meaning of the text
Groups words into meaningful units
Search for different forms of words
© 2016 Linguamatics Ltd9
We find that p42mapk phosphorylates c-Myb on serine and threonine .
Purified recombinant p42 MAPK was found to phosphorylate Wee1 .
sentences morphology -
different forms
noun groups
match entities
verb groups
match actions
From Words to Meaning
© 2016 Linguamatics Ltd10
“Among them, nimesulide, a selective COX2 inhibitor, …”
Entrez Gene ID:
5743
inhibits
Entrez Gene ID: 5743
inhibits
Identifying
entities and
relations
Linguistics to establish relationships
Text Mining - as Normal as Data Mining?
© 2016 Linguamatics Ltd11
CHALLENGE
How can we capture
information from free text
as conveniently as
accessing a database?
One of the essential
differences is the lack of
normalization of terms
and concepts in free text.
SOLUTION
NLP-based text mining
provides the capability to
look through unstructured
text normalizing:
• Keywords to concepts
• Numerical data
• Range Search
• Gene Mutations
• Content source
BENEFIT
A set of structured facts,
relationships or
assertions, from different
data sources that can be
used for decision support
Providing tabular or visual
analytics to fill data
warehouses and support
better patient care.
Literature
Patents
Reports
Clinical
Trials
Examples of Normalization
Content Source Normalization
I2E: A Fully Federated Text Mining Platform
14
Merge into a single set of results
Content
Server 1
Content
Server 2
Content
Server 3
Content
Server 4
Federated Architecture
Normalizing Data from Different Sources
Single query
Differently structured data sources on different
servers
− Journal articles (PubMed Central) on local Enterprise
Server
− MEDLINE on remote cloud server
Single set of results
© 2016 Linguamatics Ltd15
Using EASL
EASL: Extraction And Search Language
Representing a Query in EASL
17
EASL Example
© 2016 Linguamatics Ltd18
query:
document:
- phrase:
- class: {snid: nci.C1909, pt: Pharmacologic Substance}
- treat
- class: {snid: nlm.C04.588.180, pt: Breast Neoplasms}
output:
outputSettings: {documentsPerAssertion: -1,
hitsPerDocPerAssertion: 10, outputOrdering: frequency,
resultType: standard}
Benefits of EASL
Automation
− Richer language for WSAPI applications
− Can build a completely new query vs. adapting smart query parameters
− Allows on-the-fly query production
Re-use
− Save, share and compare components of queries e.g.
− Save out Alternatives
− Load complex expressions in smart query parameters
Audit
− Human readable language for documenting the text mining strategy
− Using open mark-up language (YAML)
Conversion
− Enable scripts to convert from other query languages e.g. advanced search
Different interfaces
− Enables 3rd party applications to create I2E queries
− Developers can produce innovative specialized interfaces e.g. advanced
search plus terminologies
© 2016 Linguamatics Ltd19
EASL: Enhancing the Value of Federated Search
20
Merge into a single set of results
Content
Server 1
Content
Server 2
Content
Server 3
Content
Server 4
Federated Architecture
translate2easl
© 2016 Linguamatics Ltd21
Espacenet query Pubmed query
espacenet2easl pubmed2easl
EASL
keywords + index terms
EASL
terminologies, linguistics …
Clinical
Trials
OMIM
FDA
Drug
Labels
PatentsNIH
Grants
MEDLINE
refine
query
Range Search and Normalization
What Do We Want to Find?
Patients
− below 60 years old
− weight ≥ 80kg
− not having chemotherapy after 2010
− with a mutation C677T
© 2016 Linguamatics Ltd23
Challenge: Variety Within the Text
Below 60 years old
− aged 58
− 35 years old
− 42-year-old
− 39 y/o
Weight ≥ 80kg
− 267 pounds
− 280 lbs
− 80.4kg
− 82 kilograms
© 2016 Linguamatics Ltd24
After 2010
− January 21, 2011
− October of 2012
− 08/21/11
− 2012-05-04
Mutation C677T
− C677T
− 677C>T
− 677C/T
− 677C->T
Normalizing Gene Mutations
Different types of mutation description,
including:
− positional e.g. +869(T>C)
− rsID e.g. rs100
Transform different syntax e.g.
− 1166A/C -> A1166C
− Asn to Ser substitution at codon 127 -> N127S
− +1196C/T -> C1196T)
− g.655C/A>G -> C655G, A655G
− M567V/A -> M567V, M567A
© 2016 Linguamatics Ltd25
Mutation Normalization Examples
© 2016 Linguamatics Ltd26
Range Search
Allows search for values
within a range
− in fixed fields e.g. publication
date
− within free text e.g. dosages
Can directly ask for e.g.
− patients with diabetes under
60 with BMI under 30
Can find intervals within the
text and find these when
search for a number or an
overlapping range
© 2016 Linguamatics Ltd27
Range Search with
Normalization
Range Search (Age, Date)
− Patients aged < 60yrs
− Date before 2010
Normalizing:
− Report Date, Age, Weight & BMI
© 2016 Linguamatics Ltd28
Normalization Benefits
Ability to compare measurements with
different units e.g. kg vs. lbs
Ability to perform range search for numerics,
measurements, dates
Standardized representations to link to
structured data e.g. mutation databases
Better clustering of results e.g. drug lab codes
© 2016 Linguamatics Ltd29
Real World Example: Mutation Normalization
Mucopolysaccharidosis II: Hunter Syndrome
Rare X-linked recessive disorder
Deficiency of the lysosomal enzyme
iduronate-2-sulfatase
Leads to progressive accumulation of
glycosaminoglucans throughout the body
Signs & symptoms:
− Bone deformities with joint stiffness; Frequent
respiratory infections; Cardiomyopathy;
Hepatosplenomegaly; Neurocognitive
impairment; Reduced lifespan
− Some symptoms partially improved with enzyme
replacement therapy
Spectrum of clinical severity (mild to
severe); main difference is progressive
development of neurodegeneration in the
severe form
© 2016 Linguamatics Ltd31
32
CHALLENGE
• Scarcity of knowledge of
natural history of
disease
• Sparse data, needs high
recall across full text
papers
• Mutation patterns very
variable
• Structured databases
lack broad phenotypic
association data
© 2016 Linguamatics Ltd
TEXT ANALYTICS FOR RARE DISEASES
GENOTYPE-PHENOTYPE ASSOCIATION IN HUNTER
SYNDROME
33
CHALLENGE
• Scarcity of knowledge of
natural history of
disease
• Sparse data, needs high
recall across full text
papers
• Mutation patterns very
variable
• Structured databases
lack broad phenotypic
association data
SOLUTION
• Developed workflow with
Linguamatics I2E
• Abstracts ID’ed in
MEDLINE using broad
vocabularies
• Full text PDFs processed
for text analytics
• I2E mutation ontology
and bespoke severity
vocabs enabled
extraction of genotype-
phenotype associations
BENEFIT
• Extraction of patient
mutations matched or
bettered genetic
databases
• Increased understanding
of IDS mutational
spectrum for provider
diagnostics and patient
awareness
• Enabled rational
approach to immune
response classification
© 2016 Linguamatics Ltd
TEXT ANALYTICS FOR RARE DISEASES
GENOTYPE-PHENOTYPE ASSOCIATION IN HUNTER
SYNDROME
Shire-Use case
© 2016 Linguamatics Ltd34
In Summary
Better Normalization of
− Numbers, dates, drug codes, TNM cancer stage
− Subsequent range search
− Gene mutations
In combination with a human readable open query
language EASL
− Maximises the ease and flexibility of asking complex
questions simultaneously across different content
sources
Ultimately agile NLP text mining provides
− High quality, structured, clustered & normalized results
in the format you need
− Improves speed to insight for faster decision making
© 2016 Linguamatics Ltd35

More Related Content

PDF
II-SDV 2016 Expert System
PDF
II-SDV 2016 Linguamatics
PDF
II-SDV 2015, 20 - 21 April, in Nice
PDF
ICIC 2014 Finding Answers in the Data – The Future Role of Text and Data Mini...
PDF
II-SDV 2016 Stefan Geißler Navigating complex information landscapes – Semant...
PDF
ICIC 2014 What Can We Learn from Our Past, that Equips Us for the Future?
PDF
II-SDV 2016 Deep SEARCH 9
PDF
ICIC 2014 The Future of Pharmaceutical and Life Science Publishing Industries
II-SDV 2016 Expert System
II-SDV 2016 Linguamatics
II-SDV 2015, 20 - 21 April, in Nice
ICIC 2014 Finding Answers in the Data – The Future Role of Text and Data Mini...
II-SDV 2016 Stefan Geißler Navigating complex information landscapes – Semant...
ICIC 2014 What Can We Learn from Our Past, that Equips Us for the Future?
II-SDV 2016 Deep SEARCH 9
ICIC 2014 The Future of Pharmaceutical and Life Science Publishing Industries

What's hot (20)

PDF
II-SDV 2016 Diane Webb - Challenges in Visualizing Pharmaceutical Information...
PDF
ICIC 2014 New Product Presentations ChemAxon
PDF
II-SDV 2016 Aleksandar Kapisoda, Klaus Kater - Deep Web Search
PDF
ICIC 2014 New Product Introduction CAS
PDF
ICIC 2014 Increasing the efficiency of pharmaceutical research through data i...
PDF
IC-SDV 2019: Competitive Intelligence: how to optimize the analysis of pipeli...
PDF
ICIC 2014 New Product Introduction InfoChem
PDF
ICIC 2014 Application Programming Interface (API) Technologies to Integrate C...
PDF
ICIC 2017: New product presentations CAS
PDF
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 1)
PDF
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 2)
PDF
7th Content Providers Community Call
PDF
Scibite - We Do.
PPTX
SciDataCon - How to increase accessibility and reuse for clinical and persona...
PPTX
SciBite - Role Of Ontologies (Pistoia Alliance Webinar)
ZIP
PPTX
Spreading the word: marketing your Trusted Institutional Repository
PDF
New Product Introductions - CAS
PDF
ICIC 2014 New Product Introduction Minesoft
PPT
Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry
II-SDV 2016 Diane Webb - Challenges in Visualizing Pharmaceutical Information...
ICIC 2014 New Product Presentations ChemAxon
II-SDV 2016 Aleksandar Kapisoda, Klaus Kater - Deep Web Search
ICIC 2014 New Product Introduction CAS
ICIC 2014 Increasing the efficiency of pharmaceutical research through data i...
IC-SDV 2019: Competitive Intelligence: how to optimize the analysis of pipeli...
ICIC 2014 New Product Introduction InfoChem
ICIC 2014 Application Programming Interface (API) Technologies to Integrate C...
ICIC 2017: New product presentations CAS
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 1)
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 2)
7th Content Providers Community Call
Scibite - We Do.
SciDataCon - How to increase accessibility and reuse for clinical and persona...
SciBite - Role Of Ontologies (Pistoia Alliance Webinar)
Spreading the word: marketing your Trusted Institutional Repository
New Product Introductions - CAS
ICIC 2014 New Product Introduction Minesoft
Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Viewers also liked (14)

PDF
II-SDV 2016 Irene Kitsara - Patent Landscape Reports and Other WIPO Activitie...
PDF
II-SDV 2016 Manish Sinka - Taking Patent Research platforms beyond Search
PDF
II-SDV 2016 - QWAM Content Intelligence
PDF
II-SDV Arne Krüger - Elastic Search & Patent Information @ mtc
PDF
II-SDV 2016 IRIX Software Engineering
PDF
II-SDV 2016 Simon Fitall -
PDF
II-SDV 2016 Nils Newman - Sentiment Analysis: What your Choice of Words Says ...
PDF
II-SDV 2016 Srinivasan Parthiban - KOL Analytics from Biomedical Literature
PDF
II-SDV 2016 Michael Iarrobino - Improving Text Mining Results with Access to ...
PDF
II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla Air
PDF
Monitoring and Analysis of Web Information for Various Business Contexts : Co...
PDF
PatSeer Introduction
PPT
Pathology cptr5-genetics
PDF
II-SDV 2017 in Nice - The International Information Conference on Search, Dat...
II-SDV 2016 Irene Kitsara - Patent Landscape Reports and Other WIPO Activitie...
II-SDV 2016 Manish Sinka - Taking Patent Research platforms beyond Search
II-SDV 2016 - QWAM Content Intelligence
II-SDV Arne Krüger - Elastic Search & Patent Information @ mtc
II-SDV 2016 IRIX Software Engineering
II-SDV 2016 Simon Fitall -
II-SDV 2016 Nils Newman - Sentiment Analysis: What your Choice of Words Says ...
II-SDV 2016 Srinivasan Parthiban - KOL Analytics from Biomedical Literature
II-SDV 2016 Michael Iarrobino - Improving Text Mining Results with Access to ...
II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla Air
Monitoring and Analysis of Web Information for Various Business Contexts : Co...
PatSeer Introduction
Pathology cptr5-genetics
II-SDV 2017 in Nice - The International Information Conference on Search, Dat...

Similar to II-SDV Andrew Hinton - Text mining - as normal as data mining? (20)

PPT
Stratergies for the intergration of information (IPI_ConfEX)
PDF
II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining
PPTX
Pistoia Alliance Debates: Text Mining for Pharma R&D in a Social World (17th ...
PDF
Allotrope foundation vanderwall_and_little_bio_it_world_2016
PPT
FDA Data Standards An Update
PDF
Low Hanging Fruit Breakout Discussion #2
PDF
Pathway studio into webinar 052715v1
PPTX
The Pistoia Alliance Biology Domain Strategy April 2011
PDF
Insights from Building the Future of Drug Discovery with Apache Spark with Lu...
PDF
2016 Standardization of Laboratory Test Coding - PHI Conference
PPTX
Pistoia Alliance Debates: Ontologies mapping webinar 23rd Feb 2017
PPTX
Become a Medicines Discovery Catapult Partner - Glasgow
PPT
SooryaKiran Bioinformatics
PDF
Expert Panel on Data Challenges in Translational Research
PDF
plani_prezi_34
PPTX
Illumina-General-Overview-Q1-17
PPTX
Seamless Dataflow with a Clinical Metadata Repository
PPTX
2011-12-02 Open PHACTS at STM Innovation
PPT
A Reason Able View To The Web Of Pathway Data
PPTX
PhD dissertation Luis Marco Ruiz
Stratergies for the intergration of information (IPI_ConfEX)
II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining
Pistoia Alliance Debates: Text Mining for Pharma R&D in a Social World (17th ...
Allotrope foundation vanderwall_and_little_bio_it_world_2016
FDA Data Standards An Update
Low Hanging Fruit Breakout Discussion #2
Pathway studio into webinar 052715v1
The Pistoia Alliance Biology Domain Strategy April 2011
Insights from Building the Future of Drug Discovery with Apache Spark with Lu...
2016 Standardization of Laboratory Test Coding - PHI Conference
Pistoia Alliance Debates: Ontologies mapping webinar 23rd Feb 2017
Become a Medicines Discovery Catapult Partner - Glasgow
SooryaKiran Bioinformatics
Expert Panel on Data Challenges in Translational Research
plani_prezi_34
Illumina-General-Overview-Q1-17
Seamless Dataflow with a Clinical Metadata Repository
2011-12-02 Open PHACTS at STM Innovation
A Reason Able View To The Web Of Pathway Data
PhD dissertation Luis Marco Ruiz

More from Dr. Haxel Consult (20)

PDF
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
PDF
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
PDF
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
PDF
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
PDF
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
PDF
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
PDF
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
PDF
AI-SDV 2022: Machine learning based patent categorization: A success story in...
PDF
AI-SDV 2022: Machine learning based patent categorization: A success story in...
PDF
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
PDF
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
PDF
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
PDF
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
PDF
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
PDF
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
PDF
AI-SDV 2022: Copyright Clearance Center
PDF
AI-SDV 2022: Lighthouse IP
PDF
AI-SDV 2022: New Product Introductions: CENTREDOC
PDF
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
PDF
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Lighthouse IP
AI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...

Recently uploaded (20)

PPT
tcp ip networks nd ip layering assotred slides
PDF
Introduction to the IoT system, how the IoT system works
PDF
The Internet -By the Numbers, Sri Lanka Edition
PDF
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
PDF
💰 𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓 💰
DOCX
Unit-3 cyber security network security of internet system
PDF
Cloud-Scale Log Monitoring _ Datadog.pdf
PPTX
Introduction to Information and Communication Technology
PPTX
innovation process that make everything different.pptx
PPTX
Job_Card_System_Styled_lorem_ipsum_.pptx
PDF
RPKI Status Update, presented by Makito Lay at IDNOG 10
PDF
How to Ensure Data Integrity During Shopify Migration_ Best Practices for Sec...
PDF
WebRTC in SignalWire - troubleshooting media negotiation
PPTX
Internet___Basics___Styled_ presentation
PDF
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
PPTX
E -tech empowerment technologies PowerPoint
PPTX
international classification of diseases ICD-10 review PPT.pptx
PPTX
introduction about ICD -10 & ICD-11 ppt.pptx
PPTX
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
PPTX
Funds Management Learning Material for Beg
tcp ip networks nd ip layering assotred slides
Introduction to the IoT system, how the IoT system works
The Internet -By the Numbers, Sri Lanka Edition
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
💰 𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓 💰
Unit-3 cyber security network security of internet system
Cloud-Scale Log Monitoring _ Datadog.pdf
Introduction to Information and Communication Technology
innovation process that make everything different.pptx
Job_Card_System_Styled_lorem_ipsum_.pptx
RPKI Status Update, presented by Makito Lay at IDNOG 10
How to Ensure Data Integrity During Shopify Migration_ Best Practices for Sec...
WebRTC in SignalWire - troubleshooting media negotiation
Internet___Basics___Styled_ presentation
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
E -tech empowerment technologies PowerPoint
international classification of diseases ICD-10 review PPT.pptx
introduction about ICD -10 & ICD-11 ppt.pptx
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
Funds Management Learning Material for Beg

II-SDV Andrew Hinton - Text mining - as normal as data mining?

  • 1. Text Mining - as Normal as Data Mining? Andrew Hinton, Application Specialist IISDV 2016, Tuesday 19th April 2016, Nice
  • 2. Agenda Introduction to text mining The challenge Applications of specialised normalization solutions − Maximising Source Normalization − EASL (Extraction and Search Language ) − Allows programmatic access to unstructured data similar to SQL over structured data. − Numeric Normalization & Range search − Capturing weights between 60 and 80kg whether expressed in kilograms or pounds, for patient selection from EHRs. − Gene Mutation Normalization − Use case where gene mutations have been linked to rare disease progression. © 2016 Linguamatics Ltd2
  • 3. Answers to Our Questions are in Free Text 80% of information at companies is in free text Most of the answers to our questions are there Ever-increasing amounts of text data to examine © 2016 Linguamatics Ltd3 0 5.000.000 10.000.000 15.000.000 20.000.000 25.000.000 PubMed Records − Different kinds of documents − External literature, patents, EHRs, internal reports, blogs, presentations − Different formats − HTML, PDF, XML, Word, PPT, Wiki, TXT, HL7
  • 4. Keyword Searching © 2016 Linguamatics Ltd4 OLED Documents, Web Pages, Folders All these documents contain the keyword ‘Additive’. Read ALL the document to find the relevant bit to you
  • 5. Linguamatics in Healthcare © 2016 Linguamatics Ltd5 Electronic Health Record Enterprise Data Warehouse Pathology, radiology, initial assessment, discharge, check up Structured data Clinical Risk Monitor Patient characteristics Patient lists Clinical trials gov Patient characteristics Matching Clinical trials Patient Narrative Semantic search tags Semantic Enrichment Clinical case histories and/or genomic interpretation Patient characteristics Scientific literature
  • 6. I2E Transforms Text into Actionable Insights © 2016 Linguamatics Ltd6 Turn text Into structured data using sophisticated queries Accurate results: only retrieves relevant results Complete results: comprehensive and systematic Analytics To drive analytics Enterprise Warehouse
  • 7. Search vs. Text Mining © 2016 Linguamatics Ltd7 Text MiningSearch Engine Filter to find most relevant documents, then read News Feeds Literature Patents Internal Reports Social Media Natural Language Processing (NLP) - understand meaning © 2012 Linguamatics Ltd. Use of ontologies and clustered results Efficient review, without reading every document
  • 8. Challenges in Unstructured Data © 2016 Linguamatics Ltd Different word, same meaning cyclosporine ciclosporin Neoral Sandimmune Different expression, same meaning Non-smoker Does not smoke Does not drink or smoke Denies tobacco use Different grammar, same meaning 5mg/kg of cyclosporine per day 5mg/kg per day of cyclosporine cyclosporine 5mg/kg per day Same word, different context Diagnosed with diabetes Family history of diabetes No family history of diabetes NLP 8
  • 9. Linguistic Processing Using NLP Interprets meaning of the text Groups words into meaningful units Search for different forms of words © 2016 Linguamatics Ltd9 We find that p42mapk phosphorylates c-Myb on serine and threonine . Purified recombinant p42 MAPK was found to phosphorylate Wee1 . sentences morphology - different forms noun groups match entities verb groups match actions
  • 10. From Words to Meaning © 2016 Linguamatics Ltd10 “Among them, nimesulide, a selective COX2 inhibitor, …” Entrez Gene ID: 5743 inhibits Entrez Gene ID: 5743 inhibits Identifying entities and relations Linguistics to establish relationships
  • 11. Text Mining - as Normal as Data Mining? © 2016 Linguamatics Ltd11 CHALLENGE How can we capture information from free text as conveniently as accessing a database? One of the essential differences is the lack of normalization of terms and concepts in free text. SOLUTION NLP-based text mining provides the capability to look through unstructured text normalizing: • Keywords to concepts • Numerical data • Range Search • Gene Mutations • Content source BENEFIT A set of structured facts, relationships or assertions, from different data sources that can be used for decision support Providing tabular or visual analytics to fill data warehouses and support better patient care. Literature Patents Reports Clinical Trials
  • 14. I2E: A Fully Federated Text Mining Platform 14 Merge into a single set of results Content Server 1 Content Server 2 Content Server 3 Content Server 4 Federated Architecture
  • 15. Normalizing Data from Different Sources Single query Differently structured data sources on different servers − Journal articles (PubMed Central) on local Enterprise Server − MEDLINE on remote cloud server Single set of results © 2016 Linguamatics Ltd15
  • 16. Using EASL EASL: Extraction And Search Language
  • 17. Representing a Query in EASL 17
  • 18. EASL Example © 2016 Linguamatics Ltd18 query: document: - phrase: - class: {snid: nci.C1909, pt: Pharmacologic Substance} - treat - class: {snid: nlm.C04.588.180, pt: Breast Neoplasms} output: outputSettings: {documentsPerAssertion: -1, hitsPerDocPerAssertion: 10, outputOrdering: frequency, resultType: standard}
  • 19. Benefits of EASL Automation − Richer language for WSAPI applications − Can build a completely new query vs. adapting smart query parameters − Allows on-the-fly query production Re-use − Save, share and compare components of queries e.g. − Save out Alternatives − Load complex expressions in smart query parameters Audit − Human readable language for documenting the text mining strategy − Using open mark-up language (YAML) Conversion − Enable scripts to convert from other query languages e.g. advanced search Different interfaces − Enables 3rd party applications to create I2E queries − Developers can produce innovative specialized interfaces e.g. advanced search plus terminologies © 2016 Linguamatics Ltd19
  • 20. EASL: Enhancing the Value of Federated Search 20 Merge into a single set of results Content Server 1 Content Server 2 Content Server 3 Content Server 4 Federated Architecture
  • 21. translate2easl © 2016 Linguamatics Ltd21 Espacenet query Pubmed query espacenet2easl pubmed2easl EASL keywords + index terms EASL terminologies, linguistics … Clinical Trials OMIM FDA Drug Labels PatentsNIH Grants MEDLINE refine query
  • 22. Range Search and Normalization
  • 23. What Do We Want to Find? Patients − below 60 years old − weight ≥ 80kg − not having chemotherapy after 2010 − with a mutation C677T © 2016 Linguamatics Ltd23
  • 24. Challenge: Variety Within the Text Below 60 years old − aged 58 − 35 years old − 42-year-old − 39 y/o Weight ≥ 80kg − 267 pounds − 280 lbs − 80.4kg − 82 kilograms © 2016 Linguamatics Ltd24 After 2010 − January 21, 2011 − October of 2012 − 08/21/11 − 2012-05-04 Mutation C677T − C677T − 677C>T − 677C/T − 677C->T
  • 25. Normalizing Gene Mutations Different types of mutation description, including: − positional e.g. +869(T>C) − rsID e.g. rs100 Transform different syntax e.g. − 1166A/C -> A1166C − Asn to Ser substitution at codon 127 -> N127S − +1196C/T -> C1196T) − g.655C/A>G -> C655G, A655G − M567V/A -> M567V, M567A © 2016 Linguamatics Ltd25
  • 26. Mutation Normalization Examples © 2016 Linguamatics Ltd26
  • 27. Range Search Allows search for values within a range − in fixed fields e.g. publication date − within free text e.g. dosages Can directly ask for e.g. − patients with diabetes under 60 with BMI under 30 Can find intervals within the text and find these when search for a number or an overlapping range © 2016 Linguamatics Ltd27
  • 28. Range Search with Normalization Range Search (Age, Date) − Patients aged < 60yrs − Date before 2010 Normalizing: − Report Date, Age, Weight & BMI © 2016 Linguamatics Ltd28
  • 29. Normalization Benefits Ability to compare measurements with different units e.g. kg vs. lbs Ability to perform range search for numerics, measurements, dates Standardized representations to link to structured data e.g. mutation databases Better clustering of results e.g. drug lab codes © 2016 Linguamatics Ltd29
  • 30. Real World Example: Mutation Normalization
  • 31. Mucopolysaccharidosis II: Hunter Syndrome Rare X-linked recessive disorder Deficiency of the lysosomal enzyme iduronate-2-sulfatase Leads to progressive accumulation of glycosaminoglucans throughout the body Signs & symptoms: − Bone deformities with joint stiffness; Frequent respiratory infections; Cardiomyopathy; Hepatosplenomegaly; Neurocognitive impairment; Reduced lifespan − Some symptoms partially improved with enzyme replacement therapy Spectrum of clinical severity (mild to severe); main difference is progressive development of neurodegeneration in the severe form © 2016 Linguamatics Ltd31
  • 32. 32 CHALLENGE • Scarcity of knowledge of natural history of disease • Sparse data, needs high recall across full text papers • Mutation patterns very variable • Structured databases lack broad phenotypic association data © 2016 Linguamatics Ltd TEXT ANALYTICS FOR RARE DISEASES GENOTYPE-PHENOTYPE ASSOCIATION IN HUNTER SYNDROME
  • 33. 33 CHALLENGE • Scarcity of knowledge of natural history of disease • Sparse data, needs high recall across full text papers • Mutation patterns very variable • Structured databases lack broad phenotypic association data SOLUTION • Developed workflow with Linguamatics I2E • Abstracts ID’ed in MEDLINE using broad vocabularies • Full text PDFs processed for text analytics • I2E mutation ontology and bespoke severity vocabs enabled extraction of genotype- phenotype associations BENEFIT • Extraction of patient mutations matched or bettered genetic databases • Increased understanding of IDS mutational spectrum for provider diagnostics and patient awareness • Enabled rational approach to immune response classification © 2016 Linguamatics Ltd TEXT ANALYTICS FOR RARE DISEASES GENOTYPE-PHENOTYPE ASSOCIATION IN HUNTER SYNDROME
  • 34. Shire-Use case © 2016 Linguamatics Ltd34
  • 35. In Summary Better Normalization of − Numbers, dates, drug codes, TNM cancer stage − Subsequent range search − Gene mutations In combination with a human readable open query language EASL − Maximises the ease and flexibility of asking complex questions simultaneously across different content sources Ultimately agile NLP text mining provides − High quality, structured, clustered & normalized results in the format you need − Improves speed to insight for faster decision making © 2016 Linguamatics Ltd35