NLP tutorial at AIME 2020

Tutorial 1:
Methods and Applications of Natural Language
Processing in Medicine
Rui Zhang1, Hua Xu2, Yanshan Wang3, Yifan Peng4
1University of Minnesota, 2University of Texas Health,
3Mayo Clinic, 4Weill Cornell Medicine
International Conference on Artificial Intelligence in Medicine (AIME 2020)
August 25, 2020

Purpose of this tutorial
• Review NLP systems and tools in solving clinical problems and
facilitating clinical research
• Showcase our real-world NLP application in clinical practice and
research across four institutions
• Discuss opportunities and challenges of NLP in medicine

JAMA 2014;311(24):2479-80
Healthcare Big Data

Motivation for Clinical NLP
20%
80%
Demographics, Lab results,
Medication, Diagnosis…
Clinical notes
Patient provided information
Family history
Social history
Radiology reports
Pathology reports
…
Structured Data
Unstructured Data

Developing high-performance NLP solutions for
healthcare applications
Dr. Hua Xu is a Professor at University of Texas Health School of Biomedical Informatics and a fellow of
the American College of Medical Informatics. His primary research interest is to develop NLP methods
and systems and apply them to clinical research and operation. He has worked on different clinical NLP
topics, such as entity recognition, relation extraction, syntactic parsing, word sense disambiguation,
and active learning, with over 200 publications. He has built multiple clinical NLP systems including the
medication information extraction tool MedEx and a recent comprehensive clinical NLP system CLAMP,
using machine learning and deep learning methods. Those tools have been widely used in large clinical
consortia such as OHDSI and CTSA.
• NLP concepts and tasks
• Issues affecting NLP performance
• Tools to facilitate NLP development
• Applications to healthcare
https://sbmi.uth.edu/faculty-and-staff/hua-xu.htm

Transfer Learning of NLP in Medicine
Dr. Yifan Peng is an assistant professor of population health sciences in the Division of Health
Informatics at Weill Cornell Medicine. After receiving his Ph.D. in Computer Science from the
University of Delaware in 2016, Dr. Peng worked as a research fellow at the National Center for
Biotechnology Information at National Library of Medicine at NIH. Dr. Peng’s main research interests
include biomedical and clinical natural language processing and medical image analysis (by courtesy).
His current project focuses on applying information extracted through NLP and image analysis on
radiological data classification.
• Transfer learning
• Pre-training of BERT model on large-scale clinical corpora
• Fine-tuning the BERT model on specific tasks such as NER and RE
• Multi-task learning
http://vivo.med.cornell.edu/display/cwid-yip4002

Digital Phenotyping for Cohort Discovery
Dr. Yanshan Wang is an Assistant Professor at Mayo Clinic. His current work is centered on developing
novel NLP and artificial intelligence (AI) methodologies for facilitating clinical research and solving real-
world clinical problems. Dr. Wang has extensive collaborative research experience with physicians,
epidemiology researchers, and statisticians. Dr. Wang has published over 40 peer-reviewed articles at
referred computational linguistic conferences (e.g., NAACL), and medical informatics journals and
conference (e.g., JBI, JAMIA, JMIR and AMIA). He has served on program committees for EMNLP,
NAACL, IEEE-ICHI, IEEE-BIBM.
• Cohort retrieval
• Approaches for cohort retrieval
• Case study
• Patient cohort retrieval for clinical trials accrual
https://www.mayo.edu/research/faculty/wang-yanshan-ph-d/bio-20199713.

Advances of NLP in Clinical Research
Dr. Rui Zhang is an McKnight Presidential Fellow and Associate Professor in the College of Pharmacy
and the Institute for Health Informatics (IHI), and also graduate faculty in Data Science at the University
of Minnesota (UMN). He is the Director of NLP Services in Clinical and Transnational Science Institution
(CTSI) at the UMN. Dr. Zhang’s research focuses on health and biomedical informatics, especially
biomedical NLP and text mining. His research interests include the secondly analysis of EHR data for
patient care as well as pharmacovigilance knowledge discovery through mining biomedical literature.
• Background of NLP to Support Clinical Research
• NLP Systems and Tools for Clinical Research
• Case study
• NLP to Support Dietary Supplement Safety Research
http://ruizhang.umn.edu

Schedule
Time Session Presenter
9:00 - 9:05 Introduction Rui Zhang
9:05 - 9:45 Developing high-performance NLP solutions for
healthcare applications
Hua Xu
9:45 - 10:25 Transfer Learning of NLP in Medicine: A case study
with BERT
Yifan Peng
10:25 – 10:30 Break
10:30 – 11:10 Digital Phenotyping for Cohort Discovery Yanshan Wang
11:10 – 11:50 Advances of NLP for Clinical Research Rui Zhang
11:50 – 12: 00 Q&A

Building High-performance
NLP Systems in Healthcare
Hua Xu PhD
School of Biomedical Informatics, University of Texas
AIME NLP Tutorial
8/25/2020
Data
Science
Biomedcine
NLP

Disclosure
§ Founder and CEO:
§ Melax Technologies Inc.
§ Consultant:
§ Hebta LLC
§ More Health Inc.
§ DCHealth Technologies Inc.
2

Outline
01 Overview & Challenges
02 Select right algorithms
03 Annotate good data
3
04 Bring human into the loop

Part 1
01 Overview & Challenges
4

NLP Tasks – Let’s focus on Information Extraction
Information
Retrieval
Information
extraction
Document
classification
Question
answering
Language
generation
Wikipedia
Website
Social media
Email
Office files
Computational techniques for
analyzing and representing
naturally occurring languages at
one or more levels of linguistic
analysis for the purpose of
achieving human-like language
processing for a range of tasks or
applications.
5

Applications of Biomedical IE Systems
Application
Clinical document
Drug labels
Clinical trial
protocols
Biomedical
literature
NLP
Decision support
Business Intelligence
Clinical Research
Surveillance
6

Active Development of Biomedical IE Systems
General Purpose Specific Purpose
MedLEE
MetaMap
CLAMP
cTAKES
Smoking status
PHI De-identification
Social determinants
……
7
Bleeding events
Cancer metastasis

Challenges for End-user to Utilize Biomedical NLP
§ General clinical NLP systems exist, but their performance is often
suboptimal for user-specific applications
§ Specific-purpose NLP systems often show good performance in a given
task, but performance drop when transporting these NLP tools
§ The generalizability issue when users build or deploy NLP applications
§ From one type of document to another
§ From one organization to another
§ From one application to another
8

An Example of Smoking Status Detection
§ Mayo Clinic cTAKES for smoking detection
§ Sentence-level mention detection and classification – machine learning (ML)
§ Document-level status classification – rules
§ Patient-level summarization – rules
§ Performance drop at deployment
§ I2b2 dataset, F-measure 85.5% (Savova et al. JAMIA 2008)
§ Vanderbilt dataset, F-measure 75% (Liu et al. AMIA 2012)
§ Steps to customize it to improve performance to 89%
§ Collect and annotate local data
§ Re-train models using specific algorithms
§ Specify rules by local physicians
9
Optimizing NLP performance
could be time-consuming and
costly …

Components for Building High-performance NLP Systems
10
Rules
Machine learning
Deep learning
Algorithm
Conduct annotation
Specify rules
Curate Knowledgebases
Human
What to annotate
Annotation Quality
Annotation Cost
Data
Practical NLP for
Biomedcine

Part 2
02 Select right algorithms
11
Rules vs. Machine Learning vs. Deep Learning

Rule-based approach to medication information extraction
§ Input: a clinical document, e.g., discharge summary
§ Output: all drug names with associated signature information such as dose,
frequency, route…
§ Issues:
§ Misspellings and abbreviations
§ ibuprofen ("ibuprfen"), augmentin ("qugmentin"), insulin ("inuslin"), and ASA ( aspirin )
§ Context of drug mentions
§ Allergy: pt is allergic to penicillin
§ Negation: never on warfarin
§ Lab tests: potassium level is normal vs. take potassium
§ Temporal status: was on warfarin 3 days before admission
§ Multiple signatures and multiple drugs in one sentence
§ Coumadin 2.5mg po dly except 5mg qTu,Th
§ start the patient on Lovenox for the duration of this pregnancy, followed by a transition to Coumadin postpartum, to be
continued for likely long-term, possibly lifelong duration.
12

Findings Prec Rec F-Score
DrugName 95.0 91.5 93.2
Strength 98.8 90.5 94.5
Route 98.8 89.6 93.9
Frequency 98.9 93.2 96.0
Table 1. Evaluation on discharge
summaries from Vanderbilt.
• Semantic-based parsing (Drug names and signatures)
• Maps to RxNorm concepts
Semantic Tagger Parser
Semantic
Grammar
Lexicon &
Rules
MedEx
Clinical Text
She is currently maintained
on Prograf 3mg bid.
Structured Output
Drugname: Prograf
Strength: 3mg
Frequency: bid
Pre-processing
Xu et al. JAMIA 2010; 17:19-24
DrugName 96.7 88.0 92.1
Strength 94.7 94.7 94.7
Route 96.0 87.0 91.3
Frequency 96.8 89.2 92.9
Table 2. Evaluation on clinic visit notes
from Vanderbilt.
13
MedEx – a rule-based tool to identify drug information from free text

Drug Name Lisinopril , Famotidine
Strength 50mg , 500/50
Route by mouth , iv
Frequency b.i.d. , every 2 days
Form tablet , ointment
Dose Amount take one tablet
IntakeTime cc , at 10am
Duration for 10 days
Dispense Amount dispensed #30
Refill refills: 2
Necessity prn , as needed
14
Define Semantic Categories

Entity Recognition
§ Lexicon lookup tagger
§ Drug names
§ Include RxNorm, UMLS, and manually collected drug terms
§ Exclude certain English terms ( sleep )
§ Regular expression-based tagger
§ Frequency, such as q8hrs
§ Transformation/Disambiguation
§ Rule-based transformation/disambiguation of initial tags to final semantic
tags
15

Parsing
§A Chart Parser in NLTK
§ Semantic grammar
§ Parse Tree à Structured output
§A Regular Expression based Chunker
DGMSIG
<S> :=<DrugList>
<DrugList> := <Drug>|<Drug><DrugList>
<Drug> := <DGSSIG> | <DGMSIG>
<DGSSIG> := <DGN> | <DGN> <SIG>
<SIG> := <DOSE> | <FORM> | <RUT> ….
DGN SIG SIG
FreqStr FreqStr
Prograf 3mg qam and 2mg qpm
Figure 1. Simplified semantic grammar.
16

Extend MedEx for the 2009 i2b2 Challenge
Semantic Tagger Parser
MedEx
Clinical Notes I2b2 Output
Sentence Splitter
Section Identification
Post-processing
Spell Checker
DrugName 84.2 87.1 85.6
Dose 89.5 81.8 85.5
Route 91.8 85.8 88.7
Frequency 87.9 85.8 86.8
Reason 45.9 29.6 36.0
Duration 36.4 35.8 36.1
All 83.9 80.3 82.1
Table 3. Evaluation on 2009 i2b2 data set.
Ranked 2nd out of 20 participating
teams.
Doan et al. JAMIA 2010; 17: 528-
31
17

§ The 2010 i2b2 Challenge: recognize problem, treatment and test
§ Convert it into a machine learning task
§ Optimize the ML models
§ ML algorithms: CRFs, SSVMs
§ Features: words, sections, dictionary, representations
§ Entity tag sets: BIO, BIESO
18
Machine Learning for clinical entity recognition
She was given 1 unit of packed red blood cell .
O O O O O O B I I I O
“Plavix was not recommended, given her recent GI bleeding.”
Jiang et al. JAMIA 2011

Results
19
Tags Features SSVMs - F(R/P) CRFs - F(R/P)
BIO
Baseline 84.51 (82.61/86.49) 84.02 (81.32/86.90)
Optimized 85.22 (84.05/86.43) 85.16 (82.94/87.50)
BIESO
Baseline 84.71 (82.53/87.02) 84.22 (81.40/87.23)
Optimized 85.82 (84.31/87.38) 85.59 (83.16/88.16)
Tang et al. BMC Medical Informatics and Decision Making 2013

Embedding
Methods
Traditional
- word2vec
- GloVe
- fastText
Contextual
- ELMo
- BERTBASE
- BERTLARGE
- BioBERT
Open-domain
- Off-the-shelf
(General)
Clinical domain
- Pre-trained on
clinical notes from
MIMIC-III starting
from open-domain
checkpoint
Entity Recognition Tasks
- i2b2 2010
- i2b2 2012
- SemEval 2014 Task 7
- SemEval 2014 Task 14
Pre-training Evaluation
Si Y et al. Enhancing clinical concept extraction with contextual embeddings. JAMIA. 2020
Contextual embeddings for deep learning-based NER

Algorithm Comparison on Benchmark Dataset
21
Algorithms Feature F1
CRFs (Jiang et al., 2010)
(#2 in challenge)
Bag of words 77.33
Optimized features 83.60
Semi-Markov (deBruijn B, et
al., 2010)
(#1 in challenge)
Optimized features + Brown clustering 85.23
SSVMs (Tang et al., 2014) Optimized features
+ Brown clustering + Random indexing
85.82
CNN (Wu et al., 2015) Word embedding 82.77
Bi-LSTM-CRF (Wu et al., 2017) Word embedding 85.91
BERT (Si et al., 2020) Pre-trained language model - BERT, fine tuned on clinical text 90.25
§ Task: 2010 i2b2 challenge – entity recognition for problem, treatment,
and test in discharge summaries

Additional thoughts on deep learning approaches
§ Parameter optimization
§ Computation resources (e.g., GPU)
§ Prediction speed
§ CRF-based NER – 1 second per discharge summary
§ BERT-based NER – 20 second per discharge summary
§ Reliability and explainability

A Review of Deep Learning in Clinical NLP
Wu S et. al Deep learning in clinical natural language processing: a methodical review. JAMIA 2019

NLP Tasks and Applications
NLP Tasks Sub-tasks NLP Applications
Word sequence POS tagging, language models,
Named entity recognition, Relation
extraction / semantic annotation (semantic
role labeling, event detection, FrameNet)
Information extraction
Sequence to sequence encoders and decoders Machine translation
Summarization
Text
classification/clustering
Document classification, Sentence
classification, Sentiment analysis, Topic
models
Email Spam
Product sentiment
Information retrieval Query expansion, Indexing, Relevance
ranking
Search engine
Dialog systems Speech recognition, natural language
generation
Chat bot

Summary about algorithm selection
§ The simplest approach that can achieve good performance is the best
§ Take available resources into consideration
§ Computation resources
§ Both labeled and unlabeled data
§ Expertise in machine learning/deep learning
§ Keep deployment in mind
§ Technical architecture and infrastructure
§ Fitting into your workflow
§ Other requirements such as speed, robustness etc.

Part 3
03 Annotate good data
26
Availability, Quality, and Sample Size

Data Availability
§ Large unlabeled data is useful, especially for deep leaning based
approaches
§ High quality annotated data is the key to machine learning/deep learning
based approaches
§ Be aware of the privacy issue of biomedical textual data - De-identification
programs that can remove protected health information (i.e., names,
address, dates) are available
§ An NER task – many rule-based, ML-based, hybrid approaches
§ Performance varied (some as high as 95%)
§ Examples: MIST, De-ID…

What about synthetic text?
§ Generating synthetic notes
§ Task – generate HPI section
§ Data – 826 clinical notes
§ Methods – SeqGAN, GPT-2, and CTRL
This is a 39 year-old female with a history of diabetes mellitus , coronary artery
disease , who presents with shortness of breath and cough . She has no relief from
antacids or antiinflammatories . She is admitted now with increasing radiation
damage to her home and extensive medical bills . She denies any pleural chest pain.

Annotation Quality Matters
§ Task: 2014 i2b2 challenge – extracting 36 risk factors , a document
classification task
§ Dataset: 790 training and 514 test notes with document labels and
evidence span highlighted
§ The top-ranked system:
§ Traditional SVM classifiers
§ Re-annotated the corpus to:
§ Fix inconsistent boundary
§ Identify negative mentions
Roberts, K., Shooshan, S.E., Rodriguez, L., Abhyankar, S., Kilicoglu, H. and Demner-Fushman, D., 2015. The role of fine-grained annotations in supervised
recognition of risk factors for heart disease from EHRs. Journal of biomedical informatics, 58, pp.S111-S119.

Requirements for annotation
§ Annotation guideline
§ Clear definitions of entity and relation (an information model)
§ Appropriate granularity to benefit your application
§ Consistent and robust for representing information
§ High quality Annotation (e.g., consistent)
§ Annotator knowledge
§ Sufficient training
§ Adequate sample size
30

Annotation Workflow
Data collection
Pre-annotation
Guideline
development
Training &
annotation
Model
training
Quality
control
ML Models

Guideline development – content
§ Goals of the annotation
§ Definitions of entities, relations, etc.
§ Information model
§ Granularity
§ Detailed annotation rules (different scenarios)
§ Human vs. computer’s thinking
§ Provide many examples
§ Positive examples
§ Negative examples
32

Guideline development - workflow
§ Iterative process
§ Involvement of both domain experts and linguists/informaticians
33

Guideline development – example
34

Annotators selection and training
§ Annotator selection
§ Background: domain experts/linguists/lay person
§ Sources: physicians/nurses, residents, students, or crowdsourcing - Amazon
Mechanical Turk
§ Annotator training
§ Iterative training until achieve an expected performance
§ Quality checking during the annotation
§ Multi-annotator magament
§ Train each annotator to make sure consistent annotations achieved before start
§ If resources available, each sample can be double annotated by two people, and a
third more experienced one to judge discrepancy.
§ Otherwise, assign a small portion of data to both annotators so that inter-annotation
agreement (IAA) can be calculated
35

Annotation tools
§ BRAT
§ MAE
§ eHOST
§ ….
§ Prodigy
§ LightTag
§ CLAMP
36

Quality checking
§ Inter annotator agreement
§ Precision/Recall/F-measure
§ Cohen’s Kappa, Fleiss’s Kappa (https://en.wikipedia.org/wiki/Cohen%27s_kappa)
§ Self-train and self-test
§ One dataset – build the model and predict the same dataset again.
§ Performance should be high, otherwise it indicates issues with annotation
37

Sample Size
§ How many samples are needed for the required performance of
the specific task? – No definite answer…
§ Many studies reported results on several hundreds of documents
§ Sample size could be estimated based on power calculation
§ More precisely, we can plot a learning curve
38
0.25
0.35
0.45
0.55
0.65
0.75
0.85
8 32 128 512 2048 8192
F-measure
Number of sentences in the training set
Uncertainty

Challenges and potential solutions
§Annotation cost/time
§Requires reasonable sized annotated corpus
§Annotation by experts (e.g., physicians) is expensive
§Technologies to save annotation time
§Weak supervision
§ Get low quality labels more efficiently
§Transfer learning
§ Leverage labeled data/models for a different domain/task
§Active learning
§ Label informative samples to build better models

Summary
§ Data and annotation play important roles in machine learning/deep
learning based NLP systems
§ A good annotated corpus that leads to high performance ML models should
include:
§ Annotation guideline designed for the task
§ Knowledgeable and well-trained annotators
§ Enough annotated samples
§ Annotation could be costly and time-consuming

Part 4
41
04 Bring human into the loop
Human annotation, Rule augmentation, Biomedical knowledge bases

Rule Augmentation is Effective
SVM
post-
processing
CNN-RNN
+ post-
processing
biLSTM-
CRF
+ post-
processing
Strength -> Drug 0.9704 0.9792 0.9760 0.9853 0.9865 0.9916
Dosage -> Drug 0.9637 0.9798 0.9642 0.9818 0.9720 0.9860
Duration -> Drug 0.84 0.8947 0.8519 0.9125 0.8829 0.9292
Frequency -> Drug 0.9525 0.9735 0.9592 0.9810 0.9692 0.9873
Form -> Drug 0.9728 0.9867 0.9713 0.9864 0.9765 0.9890
Route -> Drug 0.9581 0.9742 0.9668 0.9805 0.9736 0.9858
Reason -> Drug 0.7328 0.8364 0.7464 0.8466 0.7579 0.8488
ADE -> Drug 0.7604 0.8221 0.7528 0.8112 0.7946 0.8502
Overall 0.9256 0.9521 0.9304 0.9574 0.9399 0.9630
Task: 2018 n2c2 Drug-ADE challenge
Wei, Q., et al. A study of deep learning approaches for medication and adverse drug event extraction from
clinical text. JAMIA, 2020 27(1), pp.13-21.

Active Learning to Reduce Annotation Cost
§ Goal: minimize annotation cost while maximizing the quality of ML-based model
43
Pool of
unlabeled data
Labeled
data
Human
annotator Machine
learner
active learning:
select the most
informative samples
select samples
randomly
VS.
passive learning:

Querying Algorithms
§ Uncertainty-based
querying
§ Clustering and
uncertainty sampling
engine (CAUSE)
§ Query the most uncertain
and representative
sentences
44
Score(c1) = 0.6
Score(c2) = 0.4
Score(c3) = 0.1
c1
c2
c3
Inputs:
Clusters
Uncertainty
Number of queries
= 2
Steps:
(1) Cluster scoring
(2) Representative
sampling
Outputs:
a and c
a
c
b
An active learning-enabled annotation system for clinical named entity recognition. Chen Y et al. BMC Med Inform & Decis
Mak. 2017

Simulation Study
45
0.25
0.35
0.45
0.55
0.65
0.75
0.85
8 32 128 512 2048 8192
F-measure
Number of sentences in the training set
Uncertainty
Diversity
Length
Random
Annotation
cost
Random
sampling
Uncertainty
sampling
Reduction
Sentences 8,702 2,971 66%
To achieve a model with 0.80 in F-
measure

Active Learning Workflow
47
Annotate a sentence
by user
Encode
CRF model
Rank sentences based
on querying algorithm
Labeled
set
Ranked
unlabeled
set
Start
Pool
User quits
or
time out
EndYes
No
Learning Annotation
Activator
Interface
Load data pool
Select top unlabeled
sentence

Real Time User Study on Acitive vs. Passive Learning
48

Categories Features
Basic Number of words (NOW)
Number of entities (NOE)
Number of entity words (NOEW)
Syntactic Entropy of POS tag (EOP)
Semantic TFIDF
Sentence MRI by report showed bilateral rotator cuff repairs and
he was admitted for repair of the left rotator cuff .
Feature NOW NOE NOEW TFIDF EOP
Value 20 3 11 35.36 2.28
!"#$ # = !& + (
)
!)*) #
A linear regression model was used to estimate annotation time
based on the basic information, semantic complexity and syntactic
complexity of the sentence.
UPC(s) = Utility(s)/Cost(s)
Cost-aware Active Learning

8 out of 9 users showed better performance
on active learning than random sampling
AL saves 20-30% annotation time!
A Larger User Study
Wei Q. et. al Cost-aware Active Learning for Named Entity Recognition in Clinical Text JAMIA 2019

Human Annotation Process is Complicated
§ Annotation speed vs. quality
68
70
72
74
76
78
80
82
84
86
88
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5 6 7 8 9
Annotionquality(F1score)
Speed(words/minute)
Users
Statistics of Annotations
Speed
Quality
Wei, Q., et al. AMIA 2018
user1 user2 user3 user4 user5 user6 user7 user8 user9
DD -0.038* -0.036 -0.002 -0.036 -0.088* -0.107* -0.273 0.001 -0.13*
EOP 0.338* 0.245 1.046* -0.459 0.695* -1.066* -0.459 -0.349 0.95*
NOP 0.243 0.705 -0.372 -0.609 0.431* 0.583* 0.478 -0.147 -0.297
ISC -0.083 -0.397 -0.38 -0.196 0.421* -0.737* -0.452 -0.49* -0.261
NOV 0.402* 0.634* -0.35* 0.201 -0.22 0.139 1.175* 0.514* 1.17*
DOP -0.234 -0.903 0.58* 0.592 -0.756* 1.213* -0.885 0.716* 1.25*
§ Syntactic structure impact
DD - Dependency Distance
EOP – Entropy of POS tags
NOP - Number Of Phrase nodes
…….

Mapping to Standard Clinical Terminologies is Important
§ Encoding (Entity Linking) – find the corresponding concept ID in a
terminology for a given term/entity
§ Example:
§ Entity: “right below - knee amputation”
§ Candidates:
• 1: C2202463 amput below knee leg right
• 2: C0002692 amput below knee
• 3: C0002692 amput below bka knee
• …
§ Challenges
§ Lexical variation
§ Polysemy
§ Granularity differences

Entity Linking Framework – Map to UMLS
NE term
Query on UMLS
concepts
Ranking by
similarity scores
Post processing,
adjust CUI offset
Query expansion:
LVG/abbr/synonyms/
adj-to-noun…
Learning to
Rank
UMLS
Index
Builder
UMLS
Index

Task Dataset Method Accuracy
SNOMED-CT
clinical text
2013 ShARe/CLEF
2014 Semeval
BM25 + Domain knowledge+RankSVM (#1
in challenge) (Zhang, 2014)
0.873
BM25 + domain Knowledge + CNN (Tang,
2017)
0.903
BM25 + BERT (Ji, 2019) 0.911
MedDRA drug labels
2018 TAC ADR
BM25 + Translational model + RankSVM
(#1 in challenge) (Xu, 2018)
0.911
BM25 + BERT (Ji, 2019) 0.932
MeSH biomedical
literature
NCBI
BM25 + domain Knowledge + CNN (Tang,
2017)
0.861
BM25 + BERT (Ji, 2019) 0.891
Encoding Algorithms and Performance on Benchmark Data

Summary
§ Rules are still important when optimizing performance of biomedical NLP
systems – hybrid approaches often achieve best performance
§ Interacting human with data/algorithm is one way to improve model
performance while reducing annotation cost
§ Biomedical ontologies and other knowledge bases are valuable for many
NLP applications

Integrate All to Better Support End-Users
The CLAMP system

CLAMP - Clinical Language Annotation, Modeling, and Processing

NLP Challenge Tasks Ranking
Named entity
recognition
2009 i2b2 medication information
extraction
#2
2010 i2b2 problem, treatment, test
extraction
#2
2013 SHARe/CLEF abbreviation recognition #1
2016 CEGS N-GRID, De-identification #2
UMLS
encoding
2014 SemEval, disorder encoding #1
Relation
extraction
2012 i2b2 Temporal information extraction #1
2015 SemEval Disease-modifier extraction #1
2015 BioCREATIVE Chemical-induced
disease from literature
#1
2016 SemEvel, temporal information
extraction
#1
2017 TAC ADR extraction from drug labels #1
2018 n2c2, medication and associated ADR #1
CLAMP Algorithms
Track record in clinical NLP challenges The DeepMed Framework
CRFs, SSVMs
Bi-LSTM-CRF
BERT/bioBERT
…..
AutoML
Docker Container

CLAMP Rule Interface for Human

61
Available as:
• CLAMP-CMD
• CLAMP-GUI
• CLAMP-EE
CLAMP Users

When developing biomedical NLP applications, please
§ Identify the right NLP tasks for your projects
§ Assemble the development team with required expertise (domain experts,
business owners, informaticians, developers ….)
§ Collect data and annotate data following a standard protocol (guideline
development, annotator training/agreement check, annotation quality control …)
§ Select appropriate algorithms (accuracy, speed, implementation …) and carefully
evaluate its performance/usability/interoperability
§ Keep human (multidisciplinary) in the loop during the life cycle of the
development

1
Transfer Learning of NLP in Medicine:
A Case Study with BERT
Yifan Peng
Department of Population Health Sciences

2
u A technique that allows to reutilize an already trained model on one dataset
and adapt it to a different dataset
u In the field of computer vision, researchers have repeatedly shown the
value of transfer learning
u Two steps
u Pre-training: Use a large training set to learn network parameters and save them for
later use
u Fine-tuning: Train all (or part of) layers of the pretrained network on the target
dataset
Transfer learning

3
u Step 1: Pre-train the CNN on ImageNet (14 million images)
Example: Detect lung diseases from chest X-ray

4
u Step 2: Fine-tune the model on NIH Chest X-ray (100,000 chest X-ray images)
Example: Detect lung diseases from chest X-ray
Wang et al., ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised
Classification and Localization of Common Thorax Diseases. CVPR. 2017.

5
u Less difficult to train a complex network
u Speed up the convergence speed of the training
u How can transfer learning benefit NLP in medicine
Transfer learning

6
u Word embedding
u ELMo
u BERT
u How to use pre-trained BERT
u Performance comparison of BERT in medicine
u Multi-task learning
Outlines
• How BERT’s idea are gradually formed?
• What has been innovated?
• Why the effect is good?

7
u There are an estimated 13 million words for the English language
u They are not completely unrelated.
u Hotel vs Motel vs Dog
u We want to encode each word into some representations that the machine
can understand.
How do we represent the meaning of a word?

8
u Encode similarity in the vectors themselves
u Some N-dimensional space (e.g., 200D) that is sufficient to encode all
semantics of our language.
u Each dimension would encode some meaning that we transfer using
speech.
u tense (past vs. present vs. future)
u count (singular vs. plural)
Word vector

9
u Word2vec, fastText, etc
u BioWordVec: https://github.com/ncbi-nlp/BioWordVec
Word Embeddings
Zhang et al., Improving biomedical word embeddings with subword information and MeSH. Scientific Data. 2019
Sources Documents Tokens
PubMed 30M 4000M
MIMIC III Clinical notes 2M 500M

10
Interesting semantic patterns emerge in the vectors
Word pair word2vec BioWordVec
thalassemia hemoglobinopathy — 0.834
mycosis histoplasmosis 0.353 0.706
thirsty hunger 0.252 0.629
influenza pneumoniae 0.482 0.611
atherosclerosis angina 0.503 0.589
Zhang et al., Improving biomedical word embeddings with subword information and MeSH. Scientific Data. 2019

11
Interesting syntactic patterns emerge in the vectors
Rohde et al. An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence. 2005

12
u Algebraic relations:
u vec(‘‘woman")−vec(‘‘man") + vec(‘‘aunt") ≈ vec(‘‘uncle")
u vec(‘‘woman")−vec(‘‘man") + vec(‘‘queen") ≈ vec(‘‘king")
Linear translations
Tomas Mikolov , Wen-tau Yih, Geoffrey Zweig,“Linguistic Regularities in Continuous Space Word Representations”, NAACL-
HLT 2013

13
How to use Word Embeddings
Peng et al. Extracting chemical-protein relations with ensembles of
SVM and deep learning models. Database. 2018.
Recurrent Neural NetworkConvolutional Neural Network

14
u Evaluation of word embeddings in protein-protein interaction (PPI)
extraction task
Word Embeddings in DL
Data Set Word2vec BioWordVec
AIMed 0.445 0.487
BioInfer 0.524 0.549
BioInfer 0.603 0.623
IEPA 0.484 0.511
HPRD50 0.679 0.713

15
Polysemous problem
u I arrived at the bank after crossing the river
u The bank has plan to branch through the country…
Static Word Embedding can't solve the problem of polysemous words.
Limitations of Word Embeddings

16
u “Embedding from Language Models”
Adjust the Word Embedding representation of the word according to the
semantics of the context word
From Word Embedding to ELMo
Peters et al., Deep contextualized word representations. NAACL. 2018

17
u A typical two-stage process
u The first stage is to use the language model for pre-training.
ELMo
no evidence of infiltrate
Target Word
Left context Right context

18
u A typical two-stage process
u The first stage is to use the language model for pre-training.
u The second stage is to extract the Embeddings of each layer.
ELMo
ELMo word representations are
functions of the entire input sentence

19
u Evaluation of ELMo in named entity recognition and relation extraction
tasks
ELMo in medical NLP
Task Dataset SOTA ELMo
Named Entity Recognition ShARe/CLEFE 70.0 75.6
Relation Extraction DDI 72.9 78.9
Relation Extraction ChemProt 64.1 66.6

20
u Hard to capture long distance information
u Computational expensive
Limitations of ELMo

21
u Bidirectional Encoder Representations from Transformers
From ELMo to BERT
Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL. 2019

23
u A self-attention mechanism which directly models relationships
between all words in a sentence.
Why transformer?
https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html

25
u In parallel. Much faster and more space-efficient
Why transformer?

26
u Pre-training
u PubMed abstracts
u Clinical notes
u Fine-tuning
u Sentence Similarity
u Named entity recognition
u Relation extraction
u etc
BERT and BlueBERT
Corpus Words Domain
PubMed abstract 4,000M Biomedical
MIMIC-III 500M Clinical

27
u Word embedding
u ELMo
u BERT
u Performance of BERT in medicine (BLUE Benchmark)
Outlines

28
Assign tags or categories to text
according to its content.
For example
u Organizing millions of cancer-
related references from PubMed
into the Hallmarks of Cancer
How to use BERT - Sentence classification

29
Extract semantic relationships from a text
How to use BERT - Relation extraction

30
Predict similarity scores based on the
sentence pairs.
For examples,
u The above was discussed with the
patient, and she voiced understanding of
the content and plan.
u The patient verbalized understanding of
the information and was satisfied with
the plan of care.
How to use BERT - Sentence similarity

31
u Locate and classify named entity
mentions in text into pre-defined
categories
How to use BERT - Named entity recognition

32
u Pre-trained models
u Fine-tuning codes
u Preprocessed texts in PubMed
https://github.com/ncbi-nlp/bluebert

33
u Word embedding
u ELMo
u BERT
Outlines

34
u Significant advances in the development of pretraining language representations
in the general domain
u ELMo, BERT, Transformer-XL, XLNet
u General Language Understanding Evaluation (GLUE) benchmark in general
domain
u No publicly available benchmarking in the biomedicine domain
Biomedical Language Understanding Evaluation (BLUE) benchmark
u Contains diverse range of text genres (biomedical literature and clinical notes)
u Highlight common biomedicine text-mining challenges
u Promote development of language representations in biomedicine domain
BLUE Benchmark

35
BLUE benchmark
Not publicly available, but the permissions can be requested.

36
Results
Sentence similarity
Named entity
recognition
Relation extraction
Doc classification
Inference

37
https://github.com/ncbi-nlp/BLUE

38
u Word embedding
u ELMo
u BERT
Outlines

39
u Multi-task learning (MTL) is a field of machine learning where multiple tasks
are learned in parallel while using a shared representation
u Increase the sample size for training the model, thus lead to performance
improvement by increasing the generalization of the model
u This is particularly helpful in some applications such as medical informatics
where (labeled) datasets are hard to collect
u May also helpful in the context that researchers are in the hassle of
choosing a suitable model for new problems where training resources are
limited
Multi-task learning

40
Multi-task model
mt-dnn (https://github.com/namisan/mt-dnn)

41
u Pretraining
u BlueBERT: pretrained on PubMed and MIMIC-III
u BioBERT: pretrained on PubMed
u Refining via Multi-task learning
u Refine all layers in the model
u Fine-tuning MT-BERT
u Continue training all layers on each specific task
Training procedure

42
Test results on clinical tasks
• Fine-tuning BERT (4 models)
• Refining via Multi-task learning (1 model)
• Refine all layers in the model
• Fine-tuning MT-BERT (4 models)
• Continue training all layers on each specific task

43
Test results on biomedical tasks
• Fine-tuning BERT (4 models)
• Refining via Multi-task learning (1 model)
• Refine all layers in the model
• Fine-tuning MT-BERT (4 models)
• Continue training all layers on each specific task

44
Test results on eight BLUE tasks

45
u Word embeddings à ELMo à BERT
u Pre-trained BERT models
u How to use BERT
u Performance comparison and benchmark
Summary

46
u https://github.com/ncbi-nlp/BioWordVec
u https://github.com/ncbi-nlp/BioSentVec
u https://github.com/ncbi-nlp/bluebert
u https://github.com/ncbi-nlp/BLUE
Resources

47
u BERT, ELMo, and mt-dnn
u Shared tasks and datasets
u BIOSSTS, MedSTS, BioCreative V chemical-disease relation task, ShARe/CLEF eHealth
task, DDI extraction 2013 task, BioCreative VI CHEMPROT, i2b2 2010 shared task,
Hallmarks of Cancers corpus
u This work was supported by the Intramural Research Programs of the
National Library of Medicine, National Institutes of Health, and
K99LM013001.
Acknowledgment

48
Thank you!
yip4002@med.cornell.edu

AIME 2020
Digital Phenotyping for Cohort
Discovery using Electronic Health
Records
Yanshan Wang
Assistant Professor of Biomedical Informatics
Division of Digital Sciences Research
Mayo Clinic

AIME 2020
Why take this tutorial?
• Patient cohort retrieval is still labor
expensive today.
• Most information is embedded in
unstructured EHRs.
• Natural language processing is under-
utilized for cohort retrieval.
2

AIME 2020
Goal of this tutorial
• To get an understanding of basic concepts
about cohort retrieval in clinical domain.
• To connect NLP theory with clinical
knowledge.
• To get an introduction into clinical use
cases of cohort retrieval.
3

AIME 2020
Suggested reading
• Books

AIME 2020
Suggested reading
• Papers
• A review of approaches to identifying patient phenotype
cohorts using electronic health records. Shivade et al. 2013.
• Case-based reasoning using electronic health records
efficiently identifies eligible patients for clinical trials. Miotto
et al. 2015.
• A survey of practices for the use of electronic health
records to support research recruitment. Obeid et al. 2017.
• Clinical information extraction applications: a literature
review. Wang et al. 2018
• Using clinical natural language processing for health
outcomes research: Overview and actionable suggestions
for future advances. Velupillai et al. 2018.
5

AIME 2020
Agenda
• Basic Concepts
• EHR, Phenotyping, Evidence-based Clinical
Research, Knowledge Base, Common Data
Model
• Patient Cohort Discovery
• Brief Introduction to NLP
• NLP for Cohort Discovery

AIME 2020
Basic Concepts
• Electronic Health Record
• Phenotyping
• Evidence-based clinical research
• Knowledge bases
• Common Data Model

AIME 2020
Basic Concepts
• Electronic Health Record

AIME 2020
Basic Concepts
• Phenotyping
• The phenotype (as opposed to genotype, which is the set of
genes in our DNA responsible for a particular trait) is the
physical expression, or characteristics, of that trait.
• Phenotyping is the practice of developing algorithms
designed to identify specific phenomic traits within an
individual1.
• Digital phenotyping using EHRs
• Traditionally, clinical studies often use self-report
questionnaires or clinical staff to obtain phenotypes from
patients. (slow, expensive, could not scale).
• EHR data come in both structured and unstructured
formats, and the use of both types of information can be
essential for creating accurate phenotypes2.
1. eMERGE network.
2. Wei, W. Q., & Denny, J. C. (2015). Extracting research-quality phenotypes from electronic health
records to support precision medicine. Genome medicine, 7(1), 41.

AIME 2020
Basic Concepts
• Phenotyping
• Phenotyping is the practice of developing algorithms
designed to identify specific phenomic traits within an
individual1.
• Digital Phenotyping using EHRs
• EHR data come in both structured and unstructured
formats, and the use of both types of information can be
essential for creating accurate phenotypes2.
Source: Wei, W. Q., & Denny, J. C. (2015). Extracting research-quality phenotypes from electronic
health records to support precision medicine. Genome medicine, 7(1), 41.
(NLP)
(NLP)

AIME 2020
Evidence-based clinical research
• Observational studies
• Types of studies in epidemiology, such as the cohort study
and the case-control study.
• The investigators retrospectively assess associations
between the treatments given to participants and their
health status.
• Randomized control trials
• Clinical trials are prospective biomedical or behavioral
research studies on human participants that are designed
to answer specific questions about biomedical or behavioral
interventions including new treatments, such as novel
vaccines, drugs, and medical devices.

AIME 2020
Basic Concepts
• Cohort/Eligibility Criteria
• Inclusion criteria
• Exclusion criteria

AIME 2020
Basic Concepts
• Cohort/Eligibility Criteria
• Inclusion criteria
• Exclusion criteria
https://clinicaltrials.gov/ct2/show/NCT03690193?cond=alzheimer%27s+disease&rank=5
clinicaltrials.gov

AIME 2020
Basic Concepts
• Knowledge Bases
• UMLS (Unified Medical Language System) (including the Metathesaurus,
Semantic Network, the Specialist Lexicon)
• Used as a knowledge base and resource for a lexicon. Metathesaurus
provides the medical concept identifiers. Semantic Network specifies the
semantic categories for the medical concepts.

AIME 2020
Basic Concepts
• Knowledge Bases
• SNOMED-CT
• Standardized vocabulary of clinical terminology.

AIME 2020
Basic Concepts
• Knowledge Bases
• SNOMED-CT
• LOINC
• Standardized vocabulary for identifying health measurements, observations,
and documents.

AIME 2020
Basic Concepts
• Knowledge Bases
• SNOMED-CT
• LOINC
and documents.
• MeSH
• NLM controlled vocabulary thesaurus used for indexing articles for PubMed
articles.

AIME 2020
Basic Concepts
• Knowledge Bases
• SNOMED-CT
• LOINC
and documents.
• MeSH
• NLM controlled vocabulary thesaurus used for indexing articles for PubMed
articles.
• MedDRA
• Terminologies specific to adverse event.
• RxNorm
• Terminologies specific to medications

AIME 2020
Basic Concepts
• Common Data Model
• Common Data Model (CDM) is a specification that
describes how data from multiple sources (e.g., multiple
EHR systems) can be combined. Many CDMs use a
relational database.
• Observational Medical Outcomes Partnership (OMOP)
CDM by Observational Health Data Sciences and
Informatics (OHDSI)

AIME 2020
OMOP CDM v. 5.0
Source: https://www.ohdsi.org/data-standardization/the-common-data-model/

AIME 2020
Why Natural Language Processing
(NLP)?

AIME 2020
Facts
• Artificial Intelligence (AI) is one of the
most interesting fields of research today.
• The growth of and interest in AI is due to
the recent advances in deep learning.
• Language is the most compelling
manifestation of intelligence.
22

AIME 2020
Facts
• Artificial Intelligence (AI) is one of the
most interesting fields of research today.
• The growth of and interest in AI is due to
the recent advances in deep learning.
• Language is the most compelling
manifestation of intelligence.
23

AIME 2020
Natural Language Processing
• What is NLP?
• "Natural language processing (NLP) is a subfield of computer
science, information engineering, and artificial intelligence
concerned with the interactions between computers and human
(natural) languages, in particular how to program computers to
process and analyze large amounts of natural language data."
(Wikipedia)

AIME 2020
Question Answering
27
Source: https://www.youtube.com/watch?v=BkpAro4zIwU
IBM
Watson
Voice
Assistant

AIME 2020
Information Extraction
28

AIME 2020
29
The patient’s maternal grandmother was diagnosed with breast cancer at
age 59 and passed away at age 80.

AIME 2020
30
The patient’s maternal grandmother was diagnosed with breast cancer at
age 59 and passed away at age 80.
Entity normalization
The patient’s FAMILY MEMBER was diagnosed with CONDITION at age
AGE and LIVING_STATUS at age AGE.
Dependency parser
Maternal
Grandmother
Breast
cancer
59
deceased
80
Family Member
Condition
Living Status
Age

AIME 2020
Sentiment Analysis
31
■ nice and compact to carry!
■ since the camera is small and light, I won't need to
carry around those heavy, bulky professional cameras
either!
■ the camera feels flimsy, is plastic and very light in
weight you have to be very delicate in the handling of
this camera
Reviews:
Attributes:
zoom
affordability
size and weight
flash
ease of use
✓
✗
✓
Source: https://web.stanford.edu/~jurafsky/NLPCourseraSlides.html

AIME 2020
Information Retrieval

AIME 2020
How to represent Natural Language
• Natural language text = sequences of
discrete symbols (e.g. words).
• Vector representations of words: Vector Space Model
• Bag-of-words
35
I
love
NLP
like
dogs
I love NLP and like dogs
I 1 0 0 0 0 0
love 0 1 0 0 0 0
NLP 0 0 1 0 0 0
like 0 0 0 0 1 0
Vocabulary list

AIME 2020
• Drawbacks
• love=[0,1,0,0,0,0] AND like=[0,0,0,0,1,0] = 0!
• Using such representation, there’s no meaningful (semantic)
comparison we can make between words.
36
I =[ 1 0 0 0 0 0 ]
love =[ 0 1 0 0 0 0 ]
NLP =[ 0 0 1 0 0 0 ]
like =[ 0 0 0 0 1 0 ]
Sparse representation

AIME 2020
• With AI models, we learn “meaning” of a
word using dense semantic representation
/word embeddings
• Learning semantic representation from data
(corpus).
• Simply examining a large corpus it’s possible to
learn word vectors that are able to capture the
semantics and relationships between words in a
surprisingly expressive way.37
I =[ 0.99 0.05 0.1 0.87 0.1 0.1 ]
love =[ 0.1 0.85 0.99 0.1 0.83 0.09 ]
NLP =[ 0.67 0.23 0.01 0.02 0.01 0.81 ]
like =[ 0.1 0.73 0.99 0.05 1.79 0.09 ]
semantic representation

AIME 2020
NLP in AI is All About Learning A Better
Representation of Language

AIME 2020
• Images are easier to be presented by RGB
values
Source: https://analyticsindiamag.com/computer-vision-primer-how-ai-sees-an-image/

AIME 2020
• Language is much harder…
• “The weather’s looking gloomy today. I better wear
my trusty rubber boots!”
• “The weather’s looking gloomy today. I’m going to
stay inside.”1
• Will, will Will will Will Will's will? – Will (a person),
will (future tense helping verb) Will (a second
person) will (bequeath) [to] Will (a third person)
Will's (the second person) will (a document)?
(Someone asked Will 1 directly if Will 2 plans to
bequeath his own will, the document, to Will 3)2
Source: 1. Carlson L. Moral and Linguistic Perspectives on Pain and Suffering in Doctor-Patient Discourse. UMN Thesis.
2. Han, Bianca-Oana (2015). "On Language Peculiarities: when language evolves that much that speakers find it strange" (PDF). Philologia
(18): 140. ISSN 1582-9960. Archived (PDF) from the original on 14 October 2015.

AIME 2020
Learning Better Representations
41 Source: https://www.datasciencecentral.com/profiles/blogs/overview-of-artificial-intelligence-and-role-of-natural-language
Better
Word
Representation
“Representation Learning”
Language
Features

AIME 2020
NLP for Patient Cohort Discovery

AIME 2020
Clinical Research Pathway
Research
Question
Protocol
Design Feasibility
Identify
Patients
Invite
Patients
Pre-
screening
ConsentAnalysisReport

AIME 2020
Clinical Trials Eligibility Screening and
Recruitment
• Clinical trials recruitment
• Randomized clinical trials are fundamental to the
advancement of medicine. However, patient recruitment for
clinical trials remains the biggest barrier to clinical and
translational research.
Cancer patients are
eligible1
Cancer patients are
participate1
Clinical trials fail to
retain enough patients2
85%20% <5%
1. Haddad TC, Helgeson J, Pomerleau K, Makey M, Lombardo P, Coverdill S, Urman A, Rammage M, Goetz MP, LaRusso N. Impact of a cognitive computing
clinical trial matching system in an ambulatory oncology practice. American Society of Clinical Oncology; 2018.
2. Cote DN. Minimizing Trial Costs by Accelerating and Improving Enrollment and Retention. Global Clinical Trials for Alzheimer's Disease: Elsevier; 2014. p. 197-
215.

AIME 2020
268
190 187
177
170
141
133 131
122
114 113 110 106 105 103
95 94 90 90
78 77 77 76 75 75 73 72 70 68
0
50
100
150
200
250
300
M
ayo
Clinic
Johns
H
opkins
University
D
uke
U
niversity
M
D
Anderson
C
ancerCenter
M
assachusetts
G
eneralH
ospital
U
niversity
ofW
ashington
U
niversity
ofCalifornia,San
Francisco
Stanford
University
M
em
orialSloan-Kettering
C
ancerC
enter
U
niversity
ofNorth
C
arolina
C
olum
bia
U
niversity
VanderbiltU
niversity
U
niversity
ofPittsburgh
U
niversity
ofPennsylvania
U
niversity
ofM
innesota
U
niversity
ofCalifornia,Los
Angeles
W
ashington
U
niversity
in
StLouis
The
C
leveland
Clinic
U
niversity
ofW
isconsin
U
niversity
ofM
ichigan
BaylorCollege
ofM
edicine
Yale
U
niversity
U
niversity
ofChicago
Em
ory
U
niversity
U
niversity
ofCalifornia,San
D
iego
O
regon
H
ealth
and
Science
U
niversity
Indiana
U
niversity
U
niversity
ofAlabam
a
atBirm
ingham
C
ase
W
estern
R
eserve
U
niversity
NUMBER OF TRIALS BETWEEN 2007 AND 2010
Source: Chen et al. Publication and reporting of clinical trial results: cross sectional analysis across academic medical centers. BMJ. 2016

AIME 2020
NLP for Clinical Trials Eligibility Screening
Natural Language
Processing
All patients
Eligible patients
Expeditepatientscreeningand
increasepatientrecruitmentrates
EHR
Recruit
Clinical trials criteria

AIME 2020
A Real-World Project

AIME 2020
Example: Clinical trials eligibility
screening for GERD
Identify a cohort of patients with and without chronic reflux using the definitions spelled out below. We wish to
test people with and without chronic reflux as our working hypothesis is that the prevalence of Barrett's
esophagus is comparable between those with and without chronic reflux.
Inclusion criteria :
1. Age greater than 50 years.
2. Gastroesophageal reflux disease. This can be defined using ICD 9 or ICD 10 cords. Additional criteria which
could be used to define GERD broadly are chronic (> 3 mo) use of a proton pump inhibitor (drug names include
omeprazole, esomeprazole, pantoprazole, rabeprazole, dexlansoprazole, lansoprazole) or a H2 receptor blocker
(ranitidine, famotidine, cimetidine). Prior endoscopic diagnosis of erosive esophagitis can also be used to make
a diagnosis of GERD.
3. Male gender
4. Obesity defined as body mass index greater than equal to 30. This is a surrogate marker for central obesity.
5. Current or previous history of smoking
6. Family history of esophageal adenocarcinoma/cancer or Barrett's esophagus
Exclusion criteria
1. Previous history of esophageal adenocarcinoma/cancer or Barrett's esophagus, previous history of
endoscopic ablation for Barrett's esophagus.
2. Previous history of esophageal squamous cancer or squamous dysplasia.
3. Treatment with oral anticoagulation including warfarin/Coumadin.
4. History of cirrhosis or esophageal varices
5. History of Barrett’s esophagus : this can be defined with ICD 9/10 codes.
6. History of endoscopy (will need to use a procedure code for EGD) in the last 5 years.

AIME 2020
Criteria ICD 9 ICD 10 CPT 4 Medication
Inclusion
2. Gastroesophageal reflux
disease (any of 2.1, 2.2, 2.3)
2.1 Gastroesophageal reflux
disease defined by Dx
530.81 K21.9
disease defined by drug,
duration of use >= 3 months
over the last 5 years
omeprazole,
esomeprazole,
pantoprazole,
rabeprazole,
dexlansoprazole,
lansoprazole, ranitidine,
famotidine, cimetidine
disease defined by prior
endoscopic diagnosis of
erosive esophagitis
530.19 K21.0 Not able to find specific
code for esophagitis
3. Male gender
4. Obesity defined as body mass index greater than equal
to 30.
6. Family history of esophageal adenocarcinoma/cancer
or Barrett's esophagus
7. Caucasian
Exclusion
1. Previous history of esophageal adenocarcinoma/cancer 150.9 C15.9
2. previous history of endoscopic ablation for Barrett's
esophagus.
43229, 43270
43228
43258
3. Previous history of esophageal squamous carcinoma
(included in 1)
150.9 C15.9
4. Previous history of esophageal squamous dysplasia 622.10 N87.9
5. Current Treatment (drug) with oral anticoagulation -
warfarin
warfarin
Coumadin. (included in 5)
Coumadin
7. History of cirrhosis 571.5 K74.60
8. History of esophageal varices 456.20 I85.00
9. History of Barrett’s esophagus 530.85 K22.7
K22.710
K22.711
K22.719
10. History of endoscopy in the last 5 years 43235-43270

AIME 2020
Criteria ICD 9 ICD 10 CPT 4 Medication
Inclusion
2. Gastroesophageal reflux
disease (any of 2.1, 2.2, 2.3)
disease defined by Dx
530.81 K21.9
disease defined by drug,
duration of use >= 3 months
over the last 5 years
omeprazole,
esomeprazole,
pantoprazole,
rabeprazole,
dexlansoprazole,
lansoprazole, ranitidine,
famotidine, cimetidine
disease defined by prior
endoscopic diagnosis of
erosive esophagitis
530.19 K21.0 Not able to find specific
code for esophagitis
3. Male gender
4. Obesity defined as body mass index greater than equal
to 30.
6. Family history of esophageal adenocarcinoma/cancer
or Barrett's esophagus
7. Caucasian
Exclusion
1. Previous history of esophageal adenocarcinoma/cancer 150.9 C15.9
2. previous history of endoscopic ablation for Barrett's
esophagus.
43229, 43270
43228
43258
3. Previous history of esophageal squamous carcinoma
(included in 1)
150.9 C15.9
4. Previous history of esophageal squamous dysplasia 622.10 N87.9
warfarin
warfarin
Coumadin. (included in 5)
Coumadin
7. History of cirrhosis 571.5 K74.60
8. History of esophageal varices 456.20 I85.00
9. History of Barrett’s esophagus 530.85 K22.7
K22.710
K22.711
K22.719
10. History of endoscopy in the last 5 years 43235-43270
NLP-based Digital
Phenotyping
Algorithm

AIME 2020
Screening patients by Inclusion criteria
1, 3, 4, 7 and all Exclusion criteria using
i2b2. Get patient set A (n=31749)
From patient set A, screening patients
by Inclusion criteria 2.1 using i2b2. Get
patient set B (n=8667)
by Inclusion criteria 2.2 using i2b2. Get
patient set C (n=1577)
by Inclusion criteria 2.3, 5, 6 using ACE
and NLP. Get patient set D (n=230)
Union patient sets B, C, and D. Get
patient set E (n=9080)

AIME 2020
Architecture of Current Solutions
Structured
data
Collating Results
User Interface
(visualization, analytics, reporting, etc.)
Postprocessing
using
Unstructured
data

AIME 2020
An Integrated Framework
Structur
ed
data
Unstruct
ured
data
Collating Results
User Interface
(visualization, analytics, reporting, etc.)

AIME 2020
Information Retrieval for Cohort
Discovery
• Cohort retrieval is similar to modern search
engines.

AIME 2020
Structured
EHR
Structured Data
Unstructured
EHR
Data Retrieval
Unstructured
Concepts
Clinical Texts
Data Retrieval
Unstructured
Query
EHR
CDR
Full-text
Query
Structured
Query
Transform
Index
Structured
Index
Unstructured
Index
Parse/Edit
Filtered
Cohort
Relevant
Cohort
End User
NLP
Structured Data Flow
Unstructured Data Flow
Cohort Retrieval Enhanced Analysis by Text from EHR (CREATE)
ICD9/10, CPT,
SNOMED CT ...
"Adults with inﬂammatory
bowel disease (ulcerative
colitis or Crohn's disease)"
Index
Index
Machine
Learning
CREATE (Cohort Retrieval Enhanced by the Analysis of
Text from Electronic health records)
Liu et al. CREATE: Cohort Retrieval Enhanced by Analysis of Text from Electronic Health Records using OMOP Common Data Model.. 2019.

AIME 2020
Another Way of Thinking of Cohort
Retrieval: Patient Representation
Patient Representation
Patient 1 -0.0011 -0008 -0.0050 ...
Patient 2 0.0108 -0.0194 0.0101 …
Patient 3 -0.0433 0.0361 0.0272 ...
Patient 4 -0.0935 0.0655…
…
…
EHR
AI
Clinical
Trial
Target Eligible Patients
Similarity
Measurement

AIME 2020
Unsupervised Machine Learning for
Patient Representation using EHRs
Poisson Dirichlet Model: an unsupervised generative
probabilistic machine learning model.
Wang et al. Unsupervised Machine Learning for the Discovery of Latent Disease Clusters and Patient Subgroups
Using Electronic Health Records. Journal of biomedical informatics. 2019.
Latent Dirichelt
Allocation (LDA)
Poisson Dirichlet
Model (PDM)

AIME 2020
Unsupervised Machine Learning for
Patient Representation using EHRs
Discovering
disease clusters
using EHRs
Discovering
patient subgroups
using EHRs
diabetes
comorbidities
Enhance
disease risk
prediction
Discover new
underlying
disease
mechanisms
Personalized care,
diagnosis, treatment,
and prevention.
EHR
AI
AIEHR

AIME 2020
DiseaseClusters
Osteoporosis
Cohort
Delirium/Dementia
Cohort
COPD/ Bronchiectasis
Cohort
LDA PDM

AIME 2020
Patient Subgroups
Osteoporosis Cohort Delirium/Dementia Cohort COPD/Bronchiectasis Cohort

AIME 2020
NLP for Cohort Discovery is All About
Learning A Better Representation of Patient

AIME 2020
Research Collaborations

Thank you!
Q&A
Wang.Yanshan@mayo.edu

Advances of Natural Language Processing
in Clinical Research
Rui Zhang, Ph.D.
Associate Professor and McKnight Presidential Fellow
Institute for Health Informatics, Department of Pharmaceutical Care & Health
Systems, and Data Science
University of Minnesota, Twin Cities
August 25, 2020
1
AIME 2020 Tutorial 1

Outline
• Part 1: NLP for Dietary Supplement Clinical
Research
• Part 2: Information Extraction in EHRs and Clinical
Trials
2

Clinical Research Informatics (CRI)
• CRI involves the use of informatics in the discovery and
management of new knowledge relating to health and
disease.
• It includes management of information related to clinical
trials and also involves informatics related to secondary
research use of clinical data.
• It involves approaches to collect, process, analyze, and
display health care and biomedical data for research
3

Leveraging Big Data for Pharmacovigilance
Big Data
Analytics

Leveraging Big Data for Pharmacovigilance
https://knowledgent.com/whitepaper/big-data-enabling-better-pharmacovigilance/
5

Leveraging NLP in Healthcare Analytics
NLP (extract, classify, summarize)
Biomedical Literature
!"#$%&#'(
!"#$%&#')
!"#$%&#'*
'''''''+++'''''
''''''''''''
''''''''''''
1
2
n
...
Clinical Notes
• Adverse events
• Substance use
• Family history
• Medical history
Biomedical knowledge
(Subject – Predicate - Object)
Healthcare providers, clinical researchers
6
Social Media
Pharmacovigilance signals
(Drug/supplement -
adverse Events)

Part 1: NLP for Dietary
Supplement Clinical Research
1R01AT009457 (PI: Rui Zhang)
• Integrated DS Knowledge Base (iDISK)
• Expanding DS terminology
• Detecting DS safety signals on clinical notes
• Mining biomedical literature to discover DSIs
• Active learning to reduce annotation costs
• Detecting DS safety signals on Twitter
7
Online resources
Clinical notes
Literature
Social media

Introduction to Dietary Supplements
• Dietary supplements
Ø Herbs, vitamins, minerals, probiotics, amino acids, others.
• Use of supplements increasing
Ø More than half of U.S. adults take dietary supplements (Center
for Disease Control and Prevention)
Ø One in six U.S. adults takes a supplement simultaneously with
prescription medications
Ø Sales over $6 billion per year in U.S. (American Botanical
Council, 2014)
https://nccih.nih.gov/health/supplements
Use of complementary and alternative medicine by children in Europe: Published data and expert perspectives. Complement
Ther Med. 2013 4;21.
Kaufman, Kelly, JAMA. 2002;287(3):337-344.
Dietary Supplement Use Among U.S. Adults Has Increased Since NHANES III (1988–1994). 2014(Nov 4, 2014). CDC. 8

Safety of Dietary Supplements
• Doctors often poorly informed about supplements
Ø 75.5% of 1,157 clinicians
• Supplements are NOT always safe
Ø Averagely 23,000 annual emergency visits for supplements
adverse events
Ø Drug-supplement interactions (DSIs)
• Concomitant administration of supplements and drugs
increases risks of DSIs
• Example: Docetaxel & St John’s Wort (hyperforin component
induces docetaxel metabolism via P450 3A4)
Kaufman, Kelly, JAMA. 2002;287(3):337-344.
Geller et al. New England J Med. 2015; 373:1531-40.
Gurley BJ. Molecular nutrition & food research. 2008, 52(7):772-9.
9

Regulation for Dietary Supplements
• Regulated by Dietary Supplement Health and Education
Act of 1994 (DSHEA)
Ø Different regulatory framework from prescription and over-
the-counter drugs
Ø Safety testing and FDA approval NOT required before
marketing
Ø Postmarketing reporting only required for serious adverse
events (hospitalization, significant disability or death)
Department of Health and Human Services, Food and Drug Administration. New dietary ingredients in dietary supplements —
background for industry. March 3, 2014
Dietary Supplement and Nonprescription Drug Consumer Protection Act. Public Law 109-462, 120 Stat 4500.
10

Limited Supplements Research
• Supplement safety research is limited
Ø Not required for clinical trials
Ø Not found until new supplement is on the market
Ø Voluntary adverse events reporting underestimates
the safety issues
Ø Pharmacy studies only focuses on specific
supplements
Ø DSI documentation is limited due to less rigorous
regulatory rules on supplements
11

Informatics and AI for Supplements Safety Research
• Online resources
Ø Provides DS knowledge across various resources
Ø Need informatics method to standard and integrate knowledge
• Electronic health records
Ø EHR provides patient data for supplement use
Ø Detailed supplements usage information documented in
clinical notes
• Biomedical literature
Ø Contains pharmacokinetics and pharmacodynamics knowledge
Ø Discover undefined pathways for DSIs
Ø Find potential DSIs by linking information
12

Informatics and AI for Supplements Safety Research
• Social media
Ø Contains customer’s DS use experience
Ø Discover their information needs
• Adverse Event Reporting System (CARES)
Ø Contains reported AEs
Ø A good resource to mine DS-AE signals
13

Challenges for Supplement Clinical Research
• No standardized and consistent DS knowledge
representation
• Lexical variations of supplements in clinical notes
• Detailed usage information related to
supplements
• Differentiate adverse events vs purpose use
14

1.1. Supplement Knowledge Base
15
To generate an integrated and standardized DS knowledge base
Rizvi R, et al. AMIA CRI (student paper competition finalist) 2018.
JAMIA 2019. doi: 10.1093/jamia/ocz216

iDISK
Development
q To build a one-stop
Integrated DIetary
Supplement
Knowledge base
(iDISK)
q DS related content
is represented in:
consistent and
standardized forms
16JAMIA 2019. doi: 10.1093/jamia/ocz216
DSLD, Dietary Supplement Label Database; MSKCC, Memorial Sloan Kettering
Cancer Center; NHP, Natural Health Products Database; NMCD, Natural Medicines
Comprehensive Database.

• Evaluation showed that iDISK achieved high accuracy (98.5%-100%) across all data elements
iDISK Statistics
19

iDISK vs UMLS on DS Coverage
iDISK: 41,628 unique DS ingredient names
UMLSDistilled : Only with certain semantic types (Nucleic Acid, Nucleoside, or Nucleotide,
Organic Chemical, Pharmacologic Substance, Vitamin, Bacterium, Fish, Fungus, Plant, or
Food, etc)
UMLSDS: select all concepts using Parent-Children relationship under the “Dietary
Supplements” (C0242295) and “Vitamin” (C0042890) concepts.
Matched
Against
iDISK
element
Exact Match (%) +luiNorm (+%) Total (%)
UMLS
Concepts
UMLS
Atoms 27 992 (45.7%) +550 (+0.9%) 28 542 (46.6%)
10 716
Unique Terms 12 744 (30.6%) +474 (+1.1%) 13 218 (31.7%)
UMLSDistilled
Atoms 27 553 (45.0%) +524 (+0.9%) 28 077 (45.9%)
8 684
Unique Terms 12 397 (29.8%) +450 (+1.0%) 12 847 (30.8%)
UMLSDS
Atoms 12 096 (19.7%) +407 (+0.7%) 12 503 (20.4%)
5 817
Unique Terms 4 899 (11.8%) +308 (+0.7%) 5 207 (12.5%)

Evaluation on a DS NER Task
Evaluation
Criterion
QuickUMLS
Installation Precision Recall F1
Lenient
UMLS 0.08 0.91 0.15
UMLSDistilled 0.25 0.89 0.39
UMLSDS 0.32 0.86 0.46
iDISK 0.51 0.82 0.63
Union 0.32 0.91 0.48
Strict
UMLS 0.05 0.67 0.10
UMLSDistilled 0.19 0.69 0.30
UMLSDS 0.22 0.61 0.33
iDISK 0.43 0.69 0.53
Union 0.23 0.77 0.36
Identifying 3710 DS entities on 351 abstracts

1.2. Expanding Supplement Terminology
22

Objective
23
• To apply word embedding models to expand the
terminology of DS from clinical notes: semantic variants,
brand names, misspellings
• Word embeddings
• Reveal hidden relationship between words (similarity and relatedness)
• More efficient; can be trained a large amount of unannotated data
calcium chamomile cranberry dandelion flaxseed garlic ginger
ginkgo ginseng glucosamine lavender melatonin turmeric valerian

Model Training
25
• Corpus size
• Hyperparameter tuning
• Window size (i.e., 4, 6, 8, 10, and 12)
• Vector size (i.e., 100, 150, 200, 250)
• Glove trained on the same corpus
• Window size and vector size
• Optimal parameters were chosen based on human annotation (intrinsic evaluation)

Results: Query Expansion Examples
Initial Query word2vec Expanded Query Expanded Examples
Black
cohosh
Misspelling: black kohosh, black kohash;
Brand name: remifemin Estroven Estrovan
estraven icool amberen amberin Estrovera
EstroFactor
• Please try black cohash or Estroven for hot flashes.
• Pt has discontinued Remifemin but still has symptoms.
• Recommend Estroven trial for symptoms of menopause.
Turmeric Misspelling: tumeric
• Pt emailed wondering about taking Tumeric
• Patient states that she sometimes takes the supplements
Tumeric
Folic acid
Brand name: Folgard, Folbic
Other name: Folate
• Patient is willing to try Folgard if ok with provider.
• Patient is on folate and does not smoke.
Valerian
Misspelling: velarian
Brand name: myocalm pm, somnapure
• Taking Velarian root and benadryl as well
• I would recommend moving to 6mg dose first, then trying
somnapure if still not helping.
Melatonin
Misspelling: Melantonin, melotonin
Brand name: alteril, neuro sleep
• Can try melantonin for sleep aid.
• Try alteril - it is over the counter sleep aid Let me know if
this is not better over the next few weeks
26

Results: Comparison of Base and Expanded Queries
27

Results: Comparison of word embedding expanded
versus external resource expanded queries
28

1.3. Detecting DS Indications and Adverse
Event Signals in Clinical Texts
• Clinical notes document information related to patient safety
• ADs
• “Patient gets headaches with black cohosh”
• Indications
• “Presently, patient is taking black cohosh for night sweats and
hot flashes”
• Temporal relationship between medical events
• “Also headaches did start shortly after starting black cohosh”
• DS safety surveillance
• NLP for medical concepts and relation extraction
29Fan, et al, J Am Med Inform Assoc. 2020

Objectives
• To demonstrate the feasibility of deep learning models
applied to clinical notes to facilitate discovery of DS
safety knowledge
• To evaluate different deep learning (e.g., pre-trained
BERT) models on annotated DS-specific clinical corpora
30

Results of NER Models
31
7000 sentences on 7 DS were randomly selected. DS entities including generic name, brand name, abbreviations,
misspellings.
DS Symptom Overall (micro)
P* R* F1 Num* P R F1 Num P R F1 Num
CRF 0.900
±
0.00
0.791
±
0.00
0.842
±
0.00
1247 0.714
±
0.00
0.567
±
0.00
0.632
±
0.00
356 0.861
±
0.00
0.741
±
0.00
0.797
±
0.00
1603
Bi-LSTM-CRF
(word only)
0.905
±
0.002
0.854
±
0.007
0.879
±
0.003
1247 0.812
±
0.015
0.825
±
0.007
0.818
±
0.009
356 0.884
±
0.004
0.847
±
0.003
0.865
±
0.003
1603
Bi-LSTM-CRF (char
lstm)
0.900
±
0.006
0.860
±
0.002
0.879
±
0.003
1247 0.806
±
0.008
0.837
±
0.011
0.822
±
0.008
356 0.877
±
0.003
0.855
±
0.003
0.866
±
0.002
1603
Bi-LSTM-CRF (char
cnn)
0.905
±
0.006
0.864
±
0.004
0.884
±
0.003
1247 0.847
±
0.018
0.845
±
0.007
0.846
±
0.011
356 0.892
±
0.006
0.860
±
0.003
0.876
±
0.004
160
Clinical BERT 0.931
±
0.002
0.845
±
0.002
0.886
±
0.002
1247
0.836
±
0.014
0.840
±
0.007
0.838
±
0.008
356
0.908
±
0.003
0.845
±
0.002
0.875
±
0.001
1603
BERT 0.931
±
0.005
0.850
±
0.003
0.889
±
0.003
1247
0.860
±
0.010
0.854
±
0.006
0.857
±
0.004
356
0.914
±
0.007
0.851
±
0.003
0.881
±
0.003
1603

Results for Relation Extraction
32
Positive Negative Not related Overall (micro)
P R F1 Num P R F1 Num P R F1 Num P R F1 Num
Random
Forest
0.835
±
0.002
0.939
±
0.003
0.884
±
0.002
336 0.782
±
0.009
0.716
±
0.007
0.747
±
0.006
109 0.825
±
0.011
0.438±
0.006
0.572
±
0.005
69 0.823
±
0.003
0.824
±
0.002
0.813
±
0.002
514
CNN 0.937
±
0.013
0.936
±
0.031
0.936
±
0.010
336 0.804
±
0.057
0.926
±
0.021
0.859
±
0.026
109 0.824
±
0.095
0.634
±
0.060
0.721
±
0.040
69 0.899
±
0.013
0.896±
0.016
0.890
±
0.016
514
Att-
BLSTM
0.913
±
0.011
0.967
±
0.017
0.939
±
0.004
336 0.869
±
0.035
0.861
±
0.063
0.863
±
0.024
109 0.876
±
0.028
0.798
±
0.009
0.826
±
0.007
69 0.897
±
0.006
0.899
±
0.005
0.893
±
0.004
514
3,000 sentences (200 sentences on each) of the 15 DS, including black cohosh, chamomile, cranberry,
dandelion, folic acid, garlic, ginger, ginkgo, ginseng, glucosamine, green tea, lavender, melatonin, milk thistle,
and saw palmetto

Positive relationships (indication)
33
18,348 pairs
Entity pair NMCD Sentences
Vitamin C, Wound Ö Starting mv, Vitamin C and zinc for wound healing.
Fish oil, Hyperlipidemia Ö Patient has history of hyperlipidemia which was until recently well-controlled with
fish oil and simvastatin.
Peppermint, Nausea Ö He has much less nausea with peppermint oil and marijuana.
Vitamin E, Scar Ö Vitamin E po apply 1 capsule daily as needed to scar on forehead.
Fish oil, Pain Ö I suggested that she could try daily fish oil which may help the breast pain when it
is taken for at least a month or two and could use iburprofen and heat for the pain
as well.
Psyllium, Constipation Ö Patients states she takes psyllium powder daily for constipation, and needs refills.
Vitamin C, UTI Ö Patient with hx recurrent utis, on vitamin c for urinary acidification
Fish oil, Anxiety ☓ I encourage over the counter multi vitamin and fish oil pills, as they can help
improve some anxiety and depression symptoms.
Peppermint, Pain Ö She also has experienced pain relief when rubbing peppermint essential oil on the
low back.

Negative relationships (Adverse Event)
34
13,130 pairs
Entity pair NMCD Sentences
Niacin, Rash x Lisinopril causes a cough and niacin causes a rash.
Niacin, Flushing x She was having significant flushing with niacin, so she discontinued this
about 6 months ago.
Niacin, Hives x Patient stating reaction to niacin is hives though has used mvi in past
without issues.
Fish oil, Rash Ö Pt states when she takes fish oil tablets she get a small rash on her chin.
Fish oil, vomiting x Also, discussed vomiting with fish oil caps because she bit into them-
would not pursue further at this time.
Vitamin C, Nausea x She did have 1 or 2 episodes of nausea related to taking delayed-release
vitamin c for wound healing
Niacin, GI disturbance x Allergen reactions: niacin: gi disturnbance; simvastation: cramps.
Fish oil, Diarrhea Ö Discussed titrating back up on fish oil as he tolerates, previously has
been causing a lot of diarrhea so going slow.

1.4. Mining Biomedical Literature to Discovery
Drug-Supplement Interactions (DSIs)
http://www.wsj.com/articles/what-you-should-know-about-how-your-supplements-interact-with-prescription-drugs-1456777548
Researchers at the University of Minnesota in
Minneapolis are exploring interactions between
cancer drugs and dietary supplements, based on
data extracted from 23 million scientific
publications, according to lead author Rui
Zhang, a clinical assistant professor in health
informatics. In a study published last year by a
conference of the American Medical
Informatics Association, he says, they identified
some that were previously unknown.
35

Objective
• Explore potential DSIs by linking knowledge
extracted from biomedical literature
36

Literature-based Discovery
We have shown that ECHINACEA
preparations and some common alkylamides
weakly inhibit several cytochrome P450
(CYP) isoforms, with considerable variation
in potency. (19790031)
Echinacea - INHIBITS - CYP450
Tamoxifen and toremifene are metabolised by
the cytochrome p450 enzyme system, and
raloxifene is metabolised by glucuronide
conjugation. (12648026)
CYP450 - INTERACTS_WITH - Toremifene
Named entity recognition (NER), Relationship extraction
&
Echinacea - <Potentially Interacts With> - Toremifene
Big Data:
29 million abstracts
X1-Y1
X2-Y2
…
Xm-Ym
Y1-Z1
Y3-Z2
…
Yn-Zn
&
X1-Z1
Y5-Z8
…
Xk-Zt

Drug/Supplement Predicate Gene/Gene Class Predicate
Supplement/Dru
g
Known
Echinacea INH CYP450 INT Docetaxel Y
Echinacea INH CYP450 INT Toremifene N
Echinacea STI CYP1A1 INT Exemestane N
Grape seed extract INH CYP3A4 INT Docetaxel N
Kava preparation STI CYP3A4 INT Docetaxel Y
Results: Selected Interactions
INH, INHIBITS; STI, STIMULATES; INT, INTERACTS_WITH
Echinacea: fights the common cold and viral infections
Grape seed extract: cardiac conditions
Kava: treat sleep problems, relieve anxiety and stress
38

Results: Selected Predications
Semantic
predication
Citations
Echinacea
INHIBITS
CYP450
We have shown that ECHINACEA preparations and some common
alkylamides weakly inhibit several cytochrome P450 (CYP) isoforms, with
considerable variation in potency. (19790031)
Grape seed extract
INHIBITS
CYP3A4
Four brands of GSE had no effect, while another five produced mild to
moderate but variable inhibition of CYP3A4, ranging from 6.4% by Country
Life GSE to 26.8% by Loma Linda Market brand. (19353999)
Melatonin
INHIBITS
Cyclooxygenase-2
Moreover, Western blot analysis showed that melatonin inhibited LPS/IFN-
gamma-induced expression of COX-2 protein, but not that of constitutive
cyclooxygenase. (18078452).
CYP450
INTERACTS_WITH
Toremifene
Tamoxifen and toremifene are metabolised by the cytochrome p450 enzyme
system, and raloxifene is metabolised by glucuronide conjugation. (12648026)
CYP3A
INHIBITS
Docetaxel
Because docetaxel is inactivated by CYP3A, we studied the effects of the St.
John's wort constituent hyperforin on docetaxel metabolism in a human
hepatocyte model. (16203790)
39

1.5. Active Learning to Reduce
Annotation Costs for NLP Tasks
• NLP tasks requires human annotations
Ø Time consuming and labor intensive
• Active learning reduces annotation costs
Ø Used in biomedical and clinical texts
Ø Effectiveness varies across datasets and tasks
Chen Y, Lasko TA, Mei Q, Denny JC, Xu H. A study of active learning methods for named entity recognition in clinical text. J Biomed Inform 2015; 58: 11–8.
Chen Y, Cao H, Mei Q, Zheng K, Xu H. Applying active learning to supervised word sense disambiguation in MEDLINE. J Am Med Inform Assoc 2013; 20 (5): 1001–6. 40

Objectives
• To assess the effectiveness of AL methods on
filtering incorrect semantic predication
• To evaluate various query strategies and provide a
comparative analysis of AL method through
visualization
Vasilakes J, Rizvi R, Melton G, Pakhomov S, Zhang R. J Am Med Info Assoc Open. 2018 41

Method Overview
Query strategies:
• Uncertainty sampling
• Representative sampling
• Combined sampling
Evaluation:
• 10-fold cross validation
• Training = 2700, L0=270
• Testing = 300 using AUC
42

Query Strategies
• Uncertainty
Ø Simple Margin
Ø Least confidence
Ø Least confidence with dynamic bias
• Representative
Ø Distance to center
Ø Density
Ø Min-max
• Combined
Ø Information density
Ø Dynamic
43

Datasets and Annotations
• Substance interaction (3,000):
Ø INTERACTS_WITH, STIMULATES, or INHIBITS
• Clinical Medicine (3,000):
Ø ADMINISTERED_TO, COEXISTS_WITH, COM- PLICATES,
DIAGNOSES, MANIFESTATION_OF, PRE- CEDES, PREVENTS,
PROCESS_OF, PRODUCES, TREATS, or USES
• Inter-rater agreement:
Ø Kappa: 0.74 (SI), 0.72 (CM)
Ø Percentage agreement: 87% (SI), 91% (CM)
44

Performance Comparison
When L is small and U is large:
• it is unlikely that L is representative of U
• given that L is small and unrepresentative, the prediction model trained on L is likely to be poor.
|U| is the size of the current unlabeled set
|L| is the size of the current labeled set 45

Results
Uncertainty Sampling
Query Strategy ALC
Passive Learning 0.590
Uncertainty Sampling 0.597 – 0.607

Results
Representative Sampling
Query Strategy ALC
Representative
Sampling
0.622 – 0.634

Results
Combined Sampling
Query Strategy ALC
Representative
Sampling
0.622 – 0.634
ID (manual β) 0.642

Results
Dynamic β
Passive
Learning
Query Strategy ALC
Representative
Sampling
0.622 – 0.634
ID (manual β) 0.642
ID (dynamic β) 0.641

Performance Analysis
Uncertainty Sampling
(worst performing)
Representative Sampling
(best performing)
Vasilakes J, Rizvi R, Melton G, Pakhomov S, Zhang R. J Am Med Info Assoc Open. 2018

1.6. Mining Twitter to Detect DS Adverse
Events
• Objectives
Ø To develop an AI model to make an end-to-end
pipeline for identifying DS-AEs from tweets
Ø To compare the DS AEs discovered from the tweets to
those curated in iDISK

Data Collection
• Data Collection
Ø 332 DS terms including 40 commonly used DS and their name
variants
Ø 14,143 AE terms from ADR lexicon and iDISK knowledge base
Ø The final dataset includes 247,807 tweets that contain at least one
DS-AE pair from 2012 to 2018.
• Data preprocessing
Ø Remove URL, user handle (@username), hashtag symbol (#), and
emojis
Ø Contractions (e.g. can’t) were expanded
Ø Hashtags were segmented into constituent words
Ø Stop words were kept (e.g. “throw up” is different from “throw”)

Results – Concept Extraction
Concept Type Deep Learning Method Precision Recall F1-measure
Supplement LSTM-CRF + PubMed Word2Vec 0.8587 ± 0.0211 0.8055 ± 0.0280 0.8310 ± 0.0218
LSTM-CRF + GloVe Twitter 0.8491 ± 0.0321 0.8127 ± 0.0196 0.8300 ± 0.0179
LSTM-CRF + Glove Crawl 0.8736 ± 0.0210 0.8375± 0.0152 0.8551 ± 0.0157
LSTM-CRT + fastText 0.8538 ± 0.0160 0.8092 ± 0.0231 0.8308 ± 0.0175
BioBERT 0.8570 ± 0.0248 0.8725 ± 0.0212 0.8646 ± 0.0220
BERT 0.8560 ± 0.0185 0.8736 ± 0.0198 0.8647 ± 0.0184
Symptom LSTM-CRF + PubMed Word2Vec 0.7909 ± 0.0188 0.6794 ± 0.0258 0.7306 ± 0.0173
LSTM-CRF + Glove Crawl 0.8012 ± 0.0205 0.7146 ± 0.0344 0.7550 ± 0.0232
LSTM-CRT + fastText 0.7784 ± 0.0247 0.6841 ± 0.0271 0.7277 ± 0.0182
BioBERT 0.8416 ± 0.0204 0.8582 ± 0.0200 0.8497 ± 0.0172
BERT 0.8393 ± 0.0161 0.8664 ± 0.0147 0.8526 ± 0.0138

Results – Relation Extraction (RE)
Relation Type Deep Learning Method Precision Recall F1-measure
Indication CNN + GloVe Twitter 0.7774 ± 0.0252 0.7946 ± 0.0318 0.7850 ± 0.0124
CNN + GloVe Wiki GigaWord 0.7720 ± 0.0206 0.7901 ± 0.0280 0.7804 ± 0.0142
BioBERT 0.8177 ± 0.0214 0.8595 ± 0.0321 0.8374 ± 0.0147
BERT 0.8181 ± 0.0319 0.8522 ± 0.0409 0.8335 ± 0.0169
Adverse
events
LSTM-CRF + PubMed Word2Vec 0.6995 ± 0.0653 0.6381 ± 0.0539 0.6645 ± 0.0410
BioBERT 0.7349 ± 0.0430 0.7603 ± 0.0519 0.7459 ± 0.0341
BERT 0.7312 ± 0.0694 0.7845 ± 0.1041 0.7538 ± 0.0376

Results
• 194,190 pairs were identified as DS indication
• 45,668 were identified as DS-AE
• 190,170 pairs have no relations

Results – DS-AE pairs examples
• Vitamin C – Kidney stones tweets: (iDISK has this entry)
Ø some medications yes even prolonged high dose vitamin c causes kidney
stones
Ø vitamin c is not actually an effective treatment for the common cold and
high doses may cause kidney stones nausea and diarrhea
Ø too much vitamin c can cause kidney stones
• Vitamin C – diarrhea tweets: (iDISK has this entry)
Ø i would eat this whole bag of oranges but vitamin c in high doses can
induce skin breakouts and diarrhea facts
Ø too much vitamin c or zinc could cause nausea diarrhea and stomach
cramps check your dose
Ø too much vitamin c can cause diarrhea and or nausea
Ø it can cause diarrhea because of all the vitamin c

Results – DS-AE pairs examples
• Niacin – Flush tweets: (not in iDISK)
Ø the niacin flush may be uncomfortable for a few mins but it is well
worth it it may be itchy or burn a little but it passes in 10 30
Ø note to self if you are used to 250 mg of niacin jump up to 500 mg
the niacin flush is so intense
Ø already got a niacin flush crap
• Fish oil – prostate cancer tweets: (not in iDISK)
Ø fish oil makes you more likely to get prostate cancer good enough for
me to stop taking it just a heads up
Ø some docs say fish oil can raise your risk of prostate cancer wait so i
should stop stuffing goldfish up there tgif
Ø correction study finds fish oil increases risk of high grade prostate
cancer by 71 percent <url>

58
Part 2:
Information Extraction in EHR and Clinical
Trials
• Extract Breast Cancer Receptor Status
• Identify Clinically New Information
• Parse Clinical Trail Eligibility Criteria

2.1. Breast Cancer Receptor Status
Phenotyping from EHR
59
Breitenstein MK, Liu H, Maxwell KN, Pathak J, Zhang R. Electronic health record phenotypes for precision medicine:
perspectives and caveats from treatment of breast cancer at a single institution. Clinical and Translational Science.
2018 Jan;11(1):85-92.

Phenotyping Granularity
• Phenotyping usually identify case or control
• Precision medicine phenotypes of of breast cancer
subtypes
Ø Estrogen receptor (ER)
Ø Progesterone receptor (PR)
Ø Human epidermal growth factor receptor 2 (HER2)
Ø Triple negative breast cancer (TNBC: ER-, PR-, HER2-)
60

Objectives
• Develop NLP-based breast cancer precision
medicine phenotyping methods to identify
receptor status
• Compare the clinical data source coverage on
receptor status
61

2.2. Identifying Clinically Relevant New Versus
Redundant Information in Clinical Texts
64
• EHR “copy-and-pasting” functionality
Ø 74-90% physicians copy and paste
Ø 20-78% physician notes are copied text
• Results
Ø Little deletion, only addition
Ø Longer notes, recombinant versions of previous notes
Ø Errors repeat
• User issues
Ø Information overload
Ø Difficulties in finding information

• Predict the probability of a word based on all previous words
• Markov Assumption
Ø The probability of a word depends only on the previous n words
• N-gram model
Ø A (n-1)th order Markov model
Example: P (congestion | a female presenting with a chief complaint of nasal)
Bigram: P (congestion | nasal)
Trigram: P (congestion | of nasal)
Four-gram: P (congestion | complaint of nasal)
Statistical N-gram Language Model
65
€
P(wk | w1
k−1
) ≈ P(wk | wk−n +1
k−1
)
€
P(w1
n
) = P(w1)P(w2 | w1)P(w3 | w1w2)…P(wn | w1w2…wn−1)
€
= P(wk | w1
k−1
)
k=1
n
∏
Manning and SchÜtze. Foundations of Statistical Natural Language Processing. The MIT Press; 2003.

• Sparseness of corpus
Ø Zero to unseen event
Ø Zero will propagate to the probability of a long string
• Smoothing Methods
Ø Decrease probability of seen events and allows the occurrence
of unseen n-grams
Ø Good-Turning
Statistical N-gram Language Model
66Manning and SchÜtze. Foundations of Statistical Natural Language Processing. The MIT Press; 2003.
If C(w1...wn ) = r > 0, PGT (w1wn ) =
r *
N
, where r* =
(r +1)Nr+1
Nr
If C(w1...wn ) = 0, PGT (w1wn ) =
1− Nr
r *
Nr=1
∞
∑
N0
≈
N1
N0 N

Semantic Similarity Measures
• Measure semantic similarity between two
biomedical concepts by determining the closeness in
a hierarchy
• UMLS brings many biomedical vocabularies and
standards together
• UMLS::Similarity provides a platform to calculate
similarity using various methods
• Methods: Resnik; Jiang and Conrath; Lin
Pedersen T, Pakhomov S et al. 2nd ACM SIGHIT IHI Symp Proc, 2012.
Pakhomov S, McInnes B et al. AMIA Annu Symp Proc 2010: 572-6.
McInnes B, Pedersen T, Pakhomov S. AMIA Annu Symp Proc 2009:431-5.
Pedersen T, Pakhomov S et al. J Biomed Inform. 2007 Jun;40(3):288-99.
P. Resnik, International Joint Conference for Artificial Intelligence, 448-53, 1995.
J. Jiang and D. Conrath, Proceedings on International Conference on Research in CL, 9-33, 1997.
D. Lin, Proceedings of the International Conference on ML., 296-304, 1998.

Results: Performance Comparison
Algorithms Recall Precision F1-Measure
Optimal
Threshold
Baseline 0.85 0.64 0.73 -
Baseline + Lin 0.87 0.62 0.72 0.9
Baseline + Res 0.87 0.61 0.72 0.9
Baseline + Jcn 0.87 0.61 0.72 0.9
Baseline: rule-based section information adjustment + removal note formatting and
noises + removal both stop words + lexical normalization
Semantic similarity methods: Lin; Resnik; Jiang and Conrath
Precision =TP/(TP+FP), Recall = TP/(TP+FN),
F1-Measure = 2×Precision×Recall/(Precision+Recall).

NIP Score to Navigate Notes
0"
10"
20"
30"
40"
50"
60"
70"
29" 30" 31" 32" 33" 34" 35" 36" 37" 38"
New$Informa,on$Propor,on$(%)$
Note$Index$
10"Notes"
20"Notes"
30, 32, 33 & 35 Nothing new
31. NEW: RUQ pain worse with
eating greasy foods
34. NEW: pt visits
diabetes RN
36. NEW: sore
throat x 3 days
37. NEW: having chest
pain, will try colchicine for
pericarditis
38. NEW: depressive
symptoms, bulging L TM
on exam
70
• Cyclical pattern
• High correlation with human judgment
• Source note of redundant information *NIP: New Information Proportion

New Information
Semantic Types
Figure: Plot of NDIP (disease), NMIP (medication), and
NLIP (laboratory) over time for a patient. Biomedical
concepts for each note were automatically extracted and
included in boxes. NDIP, new problem/disease
information proportion; NMIP, new medication
information proportion; NLIP, new laboratory information
proportion.
!"
#!"
$!"
%!"
&!"
'!"
(!"
$')*+,)!-"
#&)./0)!-"
%)12/)!-"
$$)345)!6"
#%)748)!6"
$)749)!6"
$#)3+5)!6"
#!)*+,)!6"
$6):2;)!6"
#-)<=>)!6"
?)345)#!"
$()@2A)#!"
#?)*;8)#!"
()3+5)#!"
$()3+B)#!"
#&):2;)#!"
%)<=>)#!"
$%)12/)#!"
##)@2A)##"
$)*;8)##"
$$)749)##"
##)3+B)##"
%!)*+,)##"
#6)./0)##"
!"#$
New
Information
71
NIP = NDIP+NMIP+NLIP+NOIP
!"
#"
$!"
$#"
%!"
%#"
%#&'()&!*"
$+&,-.&!*"
/&01-&!*"
%%&234&!5"
$/&637&!5"
%&638&!5"
%$&2(4&!5"
$!&'()&!5"
%5&91:&!5"
$*&;<=&!5"
>&234&$!"
%?&@1A&$!"
$>&':7&$!"
?&2(4&$!"
%?&2(B&$!"
$+&91:&$!"
/&;<=&$!"
%/&01-&$!"
$$&@1A&$$"
%&':7&$$"
%%&638&$$"
$$&2(B&$$"
/!&'()&$$"
$5&,-.&$$"
!"#$%
!"
#"
$!"
$#"
%!"
%#"
%#&'()&!*"
$+&,-.&!*"
/&01-&!*"
%%&234&!5"
$/&637&!5"
%&638&!5"
%$&2(4&!5"
$!&'()&!5"
%5&91:&!5"
$*&;<=&!5"
>&234&$!"
%?&@1A&$!"
$>&':7&$!"
?&2(4&$!"
%?&2(B&$!"
$+&91:&$!"
/&;<=&$!"
%/&01-&$!"
$$&@1A&$$"
%&':7&$$"
%%&638&$$"
$$&2(B&$$"
/!&'()&$$"
$5&,-.&$$"
!"#$%
5-Sep-08: clonazepam
24-Sep-08: sertraline,
clonazepam
22-Oct-08:
sertraline
31-Dec-08:
glimepiride
24-Mar-09: tylenol,
ibuprofen, Imitrex
8-Mar-10: janumet,
metformin, Imitrex,
sertraline, estroven
7-May-10:
glipizide
25-Mar-11: buspirone,
venlafaxine
17-Sep-11: inﬂuenza vaccine
New Medication
Information
5-Sep-08: elbow pain, hand pain,
stress, depression, weight gain,
fatigure, osteoarthritis
24-Sep-08: sleepy, dizziness,
nausea, numbness, low back
pain, hip pain
22-Oct-08: anxiety
31-Dec-08: obesity, joint
tenderness, depression
23-Jan-09: hypoglycemia, hot
ﬂushes, menorrhagia,
headache
24-Mar-09: arm pain,
migraine headaches, anxiety
8-Mar-10: depression,
back pain, fatigue
7-May-10: weight loss, family
stress, thirsty,
hypercholesterolemia
25-Mar-11: shoulder pain,
cramping, Leg pain,
patellofemoral syndrome
New Disease
Information
!"
#"
$!"
$#"
%!"
%#"
%#&'()&!*"
$+&,-.&!*"
/&01-&!*"
%%&234&!5"
$/&637&!5"
%&638&!5"
%$&2(4&!5"
$!&'()&!5"
%5&91:&!5"
$*&;<=&!5"
>&234&$!"
%?&@1A&$!"
$>&':7&$!"
?&2(4&$!"
%?&2(B&$!"
$+&91:&$!"
/&;<=&$!"
%/&01-&$!"
$$&@1A&$$"
%&':7&$$"
%%&638&$$"
$$&2(B&$$"
/!&'()&$$"
$5&,-.&$$"
!"#$%
5-Sep-08: BP, weight
24-Sep-08: breast cancer
screeing, X-ray spine
31-Dec-08: A1C, CHOL, HDL, LDL,
TRIG, Microalbuminuria
measurement, X-ray knee
24-Mar-09: A1C, BP
7-May-10: glucuse monitoring,
A1C, HDL, LDL, GLC, BP,
blood glucose
25-Mar-11: blood glucose
New Laboratory
Information
NOIP (Other types of new information)

New Information Visualization in Epic EHR System

2.3. Parsing Clinical Trial Eligibility Criteria
• Patient recruitment delays are remarkably
common and costly
Ø nearly 80 percent of patient recruitment timelines in
clinical trials are not met
Ø over 50 percent of the patients are not enrolled within
the planned time frames
• Objective
Ø Using NLP to parse trial inclusion/exclusion eligibility
criteria entities
Ø Using stat-of-the-art methods on CLAMP platform

Demographics
Observation
Procedure
Condition
Drug
Dietary Supplement
Diet
Device
Measurement
Temporal Measurement
Qualifier
EntitiesandAttributes
NLP for Clinical Trial
Matching

Semantic Class Example Criteria (entities and attributes are underlined and marked in blue)
Entity Demographics Women must be > 18 to 45 years of age; BMI = 27 kg/m2;
Observation Bilirubin greater than 1.2 g/dl; MMSE below 24, dementia or unstable clinical depression by
exam
Procedure History of bilateral hip replacement
Condition Uncontrolled hypertension (BP over 180mm HG)
Drug Taking metformin, propranolol and other medications
Dietary supplement (ds) Use of St. John’s Wort or any other dietary supplement
Device Claustrophobia, metal implants, pacemaker or other factors affecting feasibility and / or
safety of MRI scanning
Attribute Measurement BUN above 40 mg/dl, Cr above 1.8 mg/dl, CrCl < 60 mg/dl
Qualifier Signs and symptoms of increased intracranial pressure; severe
hypercalcemia
Temporal_measurement Use of systemic corticosteroids within the last year
Negation Use of anti-diabetic drugs other than metformin
Annotations

Mapping to UMLS semantic groups across NLP systems

A: BioMedICUS, B: CLAMP, C: cTAKES and D: MetaMap.
Performances of individual NLP systems & Boolean ensemble
Anusha Bompelli, et al. Comparing NLP Systems to Extract Entities of Eligibility Criteria in Dietary Supplements Clinical
Trials using NLP-ADAPT. AIME 2020 (will present on Aug 25 at 14:00, NLP session)

Eligibility Criteria Corpus (149 trials) Berta
(strict, lenient) Robertab
(strict, lenient) Electrac
(strict, lenient)
Semantic Class # Precision Recall F1 measure Precision Recall F1 measure Precision Recall F1 measure
Entity demographics 194 0.856, 0.916 0.409, 0.451 0.554, 0.604 0.654, 0.928 0.537, 0.801 0.586, 0.859 0.500, 0.541 0.273, 0.294 0.353, 0.380
observation 868 0.694, 0.829 0.658, 0.829 0.675, 0.829 0.721, 0.91 0.684, 0.897 0.702, 0.904 0.663, 0.865 0.590, 0.795 0.624, 0.829
procedure 148 0.667, 1.000 0.600, 0.800 0.632, 0.889 0.542, 0.708 0.650, 0.850 0.591, 0.773 0.448, 0.552 0.650, 0.850 0.531, 0.669
condition 1832 0.794, 0.995 0.698, 0.851 0.743, 0.917 0.813, 0.949 0.767, 0.900 0.789, 0.924 0.778, 0.0.918 0.788, 0.893 0.744, 0.905
drug 890 0.935, 0.959 0.505, 0.543 0.655, 0.693 0.707, 0.846 0.699, 0.952 0.703, 0.895 0.423, 0.453 0.391, 0.447 0.406, 0.449
supplement 188 0.111, 0.278 0.250, 0.625 0.154, 0.385 0.296, 0.412 0.625, 1.000 0.400, 0.583 0.250, 0.438 0.500, 0.875 0.333, 0.583
device 37 0.857, 0.857 1.000, 1.000 0.923, 0.923 0.857, 0.857 1.000, 1.000 0.923, 0.923 0.750, 0.750 1.000, 1.000 0.857, 0.857
Attribu
te
measurement 397 0.731, 0.851 0.700, 0.829 0.715, 0.840 0.781, 0.938 0.725, 0.87 0.752, 0.902 0.667, 0.810 0.600, 0.786 0.632, 0.797
qualifier 1137 0.730, 0.795 0.754, 0.822 0.742, 0.808 0.817, 0.872 0.761, 0.795 0.788, 0.831 0.705, 0.750 0.788, 0.839 0.744, 0.792
temporal 646 0.805, 0.931 0.729, 0.823 0.765, 0.874 0.837, 0.989 0.811, 0.926 0.824, 0.957 0.859, 0.976 0.760, 0.844 0.807, 0.905
negation 261 0.818, 0.879 0.562, 0.769 0.667, 0.820 0.914, 0.943 0.821, 0.846 0.825, 0.892 0.735, 0.765 0.641, 0.667 0.685, 0.712
arXiv:a1810.04805, b1907.11692, c2003.10555.
Performance Comparison of the Deep Learning Models

Acknowledgements
Extramural Funding
NCCIH 1R01AT009457 (Zhang)
OD R01AT009457-03S1 (Zhang)
NIA 3R01AT009457-04S1 (Zhang)
CTSA 1UL1TR002494 (Blazer)
AHRQ 1R01HS022085 (Melton)
Medtronic Inc. (Speedie)
Collaborators
Mayo Clinic (Liu, Wang), U of Florida (Bian), Florida State U
(He), NIH/NLM (Rindflesch, Bodenreider), UIUC (Kilicoglu)
Contact Information
Rui Zhang, Ph.D.
Email: zhan1386@umn.edu
Research Lab: http://ruizhang.umn.edu/

NLP tutorial at AIME 2020

More Related Content

What's hot (20)

Similar to NLP tutorial at AIME 2020 (20)

Recently uploaded (20)

NLP tutorial at AIME 2020