Extreme scale text based classification of medical data

Extreme-scale text-based classification of medical data
Anton Hristov & Svetla Boytcheva
18 May 2021
making sense of text and data

o Medical Ontologies
o Linked Open Data
o Dataset generation
o Data Augmentation
o Text based classification
o Classification model
o Embeddings
o eXtreme scale classification
Presentation outline

About 80% of
Electronic Health
Records are in
unstructured format
Need for NLP tools for
processing clinical text
Lack of multilingual
terminology
resources and
domain specific
ontologies
The automatic processing and knowledge extraction from
medical records is a task with public importance

Clinical text
HISTORY OF PRESENT ILLNESS :The patient is an 80 female with
a history of diastolic function and heart failure , hypertension and
rheumatoid arthritis who presents from an outside hospital with
presyncope.

Clinical text
OPERATIONS / PROCEDURES :Dobutamine stress test , cardiac
ultrasound , EGD , chest x-ray , PICC placement .The patient is a
62-year-old female with a history of diabetes mellitus ,
hypertension , COPD , hypercholesterolemia , depression and CHF

Clinical text
HISTORY OF PRESENT ILLNESS :The patient is a 63 year-old
woman transferred for evaluation of thrombotic thrombocytopenic
purpura and bronchiolitis obliterans organizing pneumonia .

Why the task for concept normalization
is so important?
o Disambiguation
o Usage of URI
o Data integration
o Reasoning
o Similarity search
o Phenotypes

Text-based classification
a process of assigning tags or categories to text
according to its content.

Standard Classification & Ontologies

Extreme scale text based classification of medical data

Objective
To develop methods for automatic association of
SNOMED CD codes to textual descriptions of
diagnosis

How to find training data?
o For 150000 classes we will need huge training dataset
o Clinical data are not publicly available due to GDPR issues
o There are quite few manually annotate datasets
o We need to rely only on publicly available sources:
− Other standard classifications and ontologies
− Open data

Medical Ontologies Mappings
o 1:1
o 1:N
o N:M
o No mappings
Source: https://library.ahima.org/doc?oid=106975#.YKOy_agzaHu

ExaMode dataset
Dataset version 1
• Summary:
– 22M+ data records
• 128K+ SNOMED codes
• 280K+ textual descriptions
- 17K+ undiscovered connections
32

Dataset Generation
o More data – more problems
o Data cleaning
o Unbalanced dataset
o Overrepresented vs underrepresented classes

Data Augmentation
o The original idea for dataset enlargement
− Datasets with images for Neural networks training
o Popular techniques:
− Flip
− Rotation

Data Augmentation
o Popular techniques:
− Scale
− Crop
− Translate
− Pixel/Region change (fill with constant)
− Pixel/Region swap
− ….

Types of data augmentation that are applicable
for textual data
o Swap random letters within a single word
o Swap random words within a text
o Replace word with its synonim
o Delete random letter within a single word
o Replace a random letter with a letter close to it on the keyboard

ExaMode dataset
Dataset version 2 Remove noise
• Additional data augmentations
• Additional heuristics
• Additional data cleaning
• Split the dataset into 3 subgroups:
– Disorders
– Procedures
– Findings
38

ExaMode dataset
Dataset version 2
Summary:
– Disorders: ~105K SNOMED codes
– Procedures: ~67K SNOMED codes
– Findings: ~70K SNOMED codes
39

o Medical Ontologies
o Linked Open Data
o Dataset generation
o Data Augmentation
o Text based classification
o Classification model
o Embeddings
o eXtreme scale classification

Text based classification
o Binary classification
o Multiclass classification
o Multilabel classification

Binary classification
o Samples takes only 1 label out of 2 classes
Review Sentiment
Delivered as expected Positive
Good quality Positive
There are scratches on the surface Negative
Works great Positive
I do not recommend it Negative

Multiclass classification
o Samples takes only 1 label out of number of classes
Movie Rating
Palmer 7
Bad Trip 6
Godzilla vs. Kong 6
Band of Brothers 9
Big fish 8

Multilabel classification
o Samples takes one or more than one labels out of number
of classes
Movie Drama Comedy Action Sci-Fi War Adventure Fantasy
Palmer 1 0 0 0 0 0 0
Bad Trip 0 1 0 0 0 0 0
Godzilla vs. Kong 0 0 1 1 0 0 0
Band of Brothers 1 0 1 0 1 0 0
Big fish 1 0 0 0 0 1 1

Classification model
o BERT (Bidirectional Encoder Representations from
Transformers)
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin.
Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010, 2017.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for
language understanding. arXiv preprint arXiv:1810.04805, 2018.

o Why was BERT created?
o Big gap in the data

o BERT core idea
Source: Park, Dongju & Ahn, Chang Wook. Self-Supervised Contextual Data Augmentation for Natural Language Processing, 2019

o BERT used for classification

o BERT advantages
o Incredible performance
o Open source
o Easy to pretrain with small amount of medical data

o BERT pretrained models:
o bioBERT
o multilingualBERT
o slavicBERT
o clinicalBERT
o pubmedBERT
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In
Proceedings of NAACL, 2019.
Lee, Jinhyuk and Yoon, Wonjin and Kim, Sungdong and Kim, Donghyeon and Kim, Sunkyu and So, Chan Ho and Kang, Jaewoo. BioBERT: a pre-trained biomedical
language representation model for biomedical text mining. Bioinformatics, 2019.
Mikhail Arkhipov, Maria Trofimova, Yurii Kuratov, and Alexey Sorokin. Tuning multilingual transformers for language-specific named entity recognition. 2019.
Emily Alsentzer, John R. Murphy, Willie Boag, WeiHung Weng, Di Jin, Tristan Naumann, and Matthew B. A. McDermott. Publicly available clinical bert
embeddings. In ClinicalNLP workshop at NAACL, 2019.
Gu, Yu, et al. "Domain-specific language model pretraining for biomedical natural language processing." arXiv preprint arXiv:2007.15779, 2020.

Embeddings
o Student: [2, 7]
o School: [3, 6]
o University: [1, 5]
o Dog: [6, 2.5]
o Cat: [5, 2]
o Fish: [7.5, 1]

Embeddings
o Deep learning embeddings
Figure is based on: Park, Dongju & Ahn, Chang Wook. Self-Supervised Contextual Data Augmentation for Natural Language Processing, 2019

eXtreme scale classification
o Labels clustering
o Dataset with +10K classes

Labels clustering
o Labels embeddings
o Embeddings clustering

Embeddings clustering
Clustering
algorithm
o Clustering algorithms:
o Agglomerative clustering
o DBSCAN
o K-Means
o Mean Shift
o Spectral Clustering
o ...
o etc.

Refinement
o Possible solutions:
o Classical shallow ANN
o Deep learning approach
o Binary classifiers for every label

Acknowledgements
o Alexander Tahchiev
o Andrey Avramov
o Hristo Papazov
o Pavlin Gyurov
o Todor Primov
o Stanislav Slavkov
https://www.datasciencesociety.net/
https://www.ontotext.com

Thank you!
See Ontotext Platform demos
Star Wars API: https://swapi-platform.ontotext.com/graphiql/
Platform monitoring: https://test-platform.ontotext.com/grafana/

Extreme scale text based classification of medical data

More Related Content

Similar to Extreme scale text based classification of medical data (20)

Recently uploaded (20)

Extreme scale text based classification of medical data