Kavita Ganesan & Michael Subotin 
Presented at: 2014 Conference on IEEE Big Data
All sorts of notes types! 
 Admit notes 
◦ documenting why patient is being admitted 
◦ baseline status, etc. 
 Progress notes 
◦ progress during course of hospitalization 
 Discharge notes 
◦ conclusion of a hospital stay or series of treatments 
 Others 
◦ Operative notes 
◦ Procedure notes 
◦ Delivery notes 
◦ Emergency Department notes, etc
PRIMARY CARE PHYSICIAN: 
Dr. XXXXX XXXXXXXX. 
CHIEF COMPLAINT: 
Injured right little toe. 
HISTORY OF PRESENT ILLNESS: 
This is a 63-year-old male with a past medical history of multiple 
myeloma who presents today after hitting his fifth toe of the right foot 
on a wood panel yesterday…… 
Review of Systems: 
CONSTITUTIONAL: No fever, chills, or weight loss. 
RESPIRATORY: No cough, shortness of breath, or wheezing. 
CARDIOVASCULAR: No chest pain, chest pressure, or palpitations. 
............... 
PAST MEDICAL HISTORY 
Multiple myeloma, peripheral neuropathy, hypertension.. 
PAST SURGICAL HISTORY:- 
Stem cell transplant. 
SOCIAL HISTORY 
The patient formerly smoked tobacco; however, quit within the last 10 
years. 
FAMILY HISTORY: 
Hypertension. 
ALLERGIES: 
ASPIRIN. 
……… 
Purpose of visit 
Patient’s current 
condition in 
narrative form 
Ongoing issues, 
issues in the past 
Information on 
allergies
PRIMARY CARE PHYSICIAN: 
Dr. XXXXX XXXXXXXX. 
CHIEF COMPLAINT: 
Injured right little toe. 
HISTORY OF PRESENT ILLNESS: 
This is a 63-year-old male with a past medical history of multiple 
myeloma who presents today after hitting his fifth toe of the right foot 
on a wood panel yesterday…… 
Review of Systems: 
CONSTITUTIONAL: No fever, chills, or weight loss. 
RESPIRATORY: No cough, shortness of breath, or wheezing. 
CARDIOVASCULAR: No chest pain, chest pressure, or palpitations. 
............... 
PAST MEDICAL HISTORY 
Multiple myeloma, peripheral neuropathy, hypertension.. 
PAST SURGICAL HISTORY:- 
Stem cell transplant. 
SOCIAL HISTORY 
The patient formerly smoked tobacco; however, quit within the last 10 
years. 
This is how most notes look: 
• some longer, some shorter 
• different set of headers, etc 
FAMILY HISTORY: 
Hypertension. 
ALLERGIES: 
ASPIRIN. 
……… 
Purpose of visit 
Patient’s current 
condition in 
narrative form 
Ongoing issues, 
issues in the past 
Information on 
allergies
PRIMARY CARE PHYSICIAN: 
Dr. XXXXX XXXXXXXX. 
PRIMARY CARE PHYSICIAN: 
Dr. XXXXX XXXXXXXX. 
PRIMARY CARE PHYSICIAN: 
Dr. XXXXX XXXXXXXX. 
CHIEF COMPLAIN: 
Injured right little toe. 
CHIEF COMPLAIN: 
Injured right little toe. 
CHIEF COMPLAINT: 
Injured right little toe. 
HISTORY OF PRESENT ILLNESS: 
This is a 63-year-old male with 
a past medical history of… 
HISTORY OF PRESENT ILLNESS: 
This is a 63-year-old male with 
a past medical history of… 
HISTORY OF PRESENT ILLNESS: 
This is a 63-year-old male with 
a past medical history of… 
Review of Systems: 
CONSTITUTIONAL: No fever, 
chills, or weight loss. 
CARDIOVASCULAR: No chest pain, 
chest pressure, or palpitations. 
............... 
Review of Systems: 
CONSTITUTIONAL: No fever, 
chills, or weight loss. 
CARDIOVASCULAR: No chest pain, 
chest pressure, or palpitations. 
............... 
……… 
Review of Systems: 
CONSTITUTIONAL: No fever, 
chills, or weight loss. 
CARDIOVASCULAR: No chest pain, 
chest pressure, or palpitations. 
............... 
……… 
……… 
 Very unstructured 
◦ formatting cues  inconsistent 
◦ varies: across physicians, notes, 
hospitals 
 Hard to analyze specific sections 
◦ E.g. analyze allergies patient population 
◦ Need to segment notes to extract 
all allergy info.
◦ Information collected vary from note types to note types 
 Ex. info on progress notes vs. admit note 
◦ Contents & formatting can vary from hospital to hospital 
 Even within the same organization – E.g. Kaiser 
◦ Contents & formatting vary between physicians 
 Different styles, speed of typing, etc.
 If you are looking at a single note type, from a single 
hospital - then maybe 
 Not suitable as a general segmentation approach: 
 Can easily break: 
◦ on unseen note types and minor format variations 
◦ Example: 
 regex based on all caps 
 regex based on seen headers only
 Several works have explored supervised methods to 
segmenting clinical notes 
[Cho et al. 2003, tepper et al. 2012, apostolva et al. 2009] 
 Problem: methods not general! 
◦ Cho et al. 2003: One model for each type of note 
 20 note types  20 models! 
 Not practical  maintain each model 
◦ Tepper et al. 2012: Model had low adaptability to unseen 
documents 
 features used, training data used, etc.
 General segmentation approach for clinical texts 
 Requirements: 
◦ Single model/approach for most note types 
◦ Discount extreme non-standard formatting 
e.g. tabular format 
 Segment: 
◦ Header 
◦ Top level sections 
◦ Footer
PRIMARY CARE PHYSICIAN: 
Dr. XXXXX XXXXXXXX. 
CHIEF COMPLAINT: 
Injured right little toe. 
HISTORY OF PRESENT ILLNESS: 
This is a 63-year-old male with a past medical history of multiple 
myeloma who presents today after hitting his fifth toe of the right foot 
on a wood panel yesterday…… 
Review of Systems: 
CONSTITUTIONAL: No fever, chills, or weight loss. 
RESPIRATORY: No cough, shortness of breath, or wheezing. 
CARDIOVASCULAR: No chest pain, chest pressure, or palpitations. 
............... 
PAST MEDICAL HISTORY 
Multiple myeloma, peripheral neuropathy, hypertension.. 
PAST SURGICAL HISTORY:- 
Stem cell transplant. 
SOCIAL HISTORY 
The patient formerly smoked tobacco; however, quit within the last 10 
years. 
FAMILY HISTORY: 
Hypertension. 
ALLERGIES: 
ASPIRIN. 
……… 
Header 
Top-level section 
Top-level section 
Top-level section 
Top-level section 
Top-level section 
Top-level section 
Top-level section
 Supervised approach using L1-Logistic Regression with a 
constraint combination approach 
 Idea: scan each line in a clinical document and label as: 
◦ BeginHeader 
◦ ContHeader 
◦ BeginSection 
◦ ContSection 
◦ Footer 
 Labels are predicted with certain confidence 
 But, problem using line-wise predictions as is: 
◦ Label sequences may not make sense 
◦ E.g. There maybe a BeginHeader after a BeginSection  
incorrect
 Post-processing: enforce sequence combination rules: 
◦ First line of document: BeginHeader or BeginSection 
◦ BeginHeader cannot come right after BeginHeader or ContHeader 
◦ ContHeader must come after BeginHeader or ContHeader 
◦ ContSection must come after BeginSection or ContSection 
◦ Footer cannot come right after BeginHeader or ContHeader 
 Rules applied after all lines in document labeled 
◦ Applied to consecutive label pairs 
◦ Computed efficiently: Viterbi algorithm
Inpatient Outpatient 
• Notes from 12 different enterprises 
• Some large enterprises 
• All sorts of note types 
• Some noisy sectioning, some clean 
• 100 radiology notes 
• Fairly clean sections 
• One hospital 
• All sorts of note types 
• Fairly well sectioned 
• 35, 000 notes in total 
• 2000 randomly sampled notes 
(inpatient) 
• 100 radiology notes 
• Fairly clean sections
 Emphasis on training data 
 Variation in training data 
◦ Use different note types for training 
◦ Intuition: help model generalize well 
 Sample training data: 
◦ Instead of using all training data from 2100 notes 
◦ Generated subsets of training data with varying size and 
cross-validate on test sets 
◦ Intuition: allows to pick the best model 
 Best model only used < 700 notes (out of 2100)
 5 test sets 
◦ 4/5 test set from hospitals not in train set 
 true estimate of accuracy 
◦ Covers both inpatient and outpatient notes 
◦ Covers different note types 
◦ ~12,500 test notes 
 Primary evaluation metric: line-wise accuracy 
◦ percentage of correctly predicted line labels
1st model: limited variety 
(hp + discharge) 
Train set 
3-folded cross 
validation 
Unseen test 
accuracy 
Inp1HospB (300 - limited) 96.70% 67.00% 
Inp3HospD (300 - varied) 96.58% 88.23% 
2nd model: variety 
(11 types - hp, ds, pn…) 
Model with variety: 
higher accuracy on 
unseen test set 
3-folded cross-validation 
accuracy: high in both 
Important to have variety in training notes in 
building general segmentation model
Accuracy consistently 
> 90% across enterprises 
Client/Data In/Outpatient # Test Docs Accuracy 
1. Inp1HospB In 300 92.58% 
2. Inp2HospC In 1000 93.29% 
3. Inp3HospD In 300 95.81% 
4. Rad1MixedHosps Out 9000 92.45% 
5. Rad2HospA Out 1902 93.67% 
Average 93.56% 
• Average accuracy: 93.56% 
• Covers inpatient/outpatient 
Single model: But, performs well across enterprises
Document Type Accuracy 
1. History and Physical 95.70% 
2. Physician Clinicals 93.10% 
3. Discharge Summary 94.00% 
4. Consult Note 94.60% 
5. Short Stay Summary 94.60% 
6. Operative Note 92.20% 
7. Progress Note 87.80% 
8. Cardiac Cath Report 85.40% 
9. Procedure Note 83.60% 
• Model performs well across note types 
• Lowest performance: procedure notes 
low recall on segmenting “technique” sections 
Performs 
very well 
> 90% 
Reasonable.. 
> 80% 
Accuracy Breakdown for Inp2HospC
94.00% 
93.00% 
92.00% 
91.00% 
90.00% 
89.00% 
88.00% 
87.00% 
86.00% 
# Notes vs. Accuracy 
No benefit with more notes 
0 500 1000 1500 2000 
Accuracy 
# Training Notes 
Avg. accurracy peaks @500 
notes on all test sets 
No benefit with more notes 
No need for big data for a general model. 
We need good data from all that big data!
 Unigrams – of each line (LineUnigram) 
 Relative position of line in document (PosInDoc) 
◦ Top, Middle, Bottom 
 Known Header features (KnownHeader) 
◦ Find potential headers using repository of seen headers 
◦ Seen headers can have canonical type 
E.g. Past Medical History, Previous Med History “PAST_MEDICAL_HISTORY” 
◦ If potential headers found, we include features: 
 Canonical type 
 Unigram & Char n-gram of potential header 
 Caps/colon info – mixed case, all caps, lowercase 
 Length of potential header
Feature Set 
Avg. 
Accuracy Improvement 
LineUnigram 85.55% 
LineUnigram+PosInDoc 88.62% +3.46% 
LineUnigram+PosInDoc+KnownHeader 93.10% +4.81%
 Explored: 
◦ Supervised approach to building a very general segmentation 
model for clinical texts 
 Evaluation showed: 
◦ Model works well on notes across enterprises 
◦ Model works across note types 
 Key to effectiveness: 
◦ Variation in training data –all sorts of note types 
◦ Training data selection strategy – sample and cross-validate 
◦ Feature set – not explored in existing works
Contact: 
Kavita Ganesan 
ganesan.kavita@gmail.com 
www.kavita-ganesan.com 
www.text-analytics101.com

More Related Content

PDF
Actualización AHA 2018 de ACLS y PALS (Highlights)
PPTX
Non–ST-Elevation–ACS 2014 Guidelines
PDF
Back to the Bedside: Internal Medicine Bedside Ultrasound Program
KEY
Endovascular Introduction
PPTX
What is a stent retreiver
PPTX
Stroke EVT- A Discussion
PPTX
PDF
Fundación EPIC _ Left atrial appendage closure. Clinical evidence; where we a...
Actualización AHA 2018 de ACLS y PALS (Highlights)
Non–ST-Elevation–ACS 2014 Guidelines
Back to the Bedside: Internal Medicine Bedside Ultrasound Program
Endovascular Introduction
What is a stent retreiver
Stroke EVT- A Discussion
Fundación EPIC _ Left atrial appendage closure. Clinical evidence; where we a...

Viewers also liked (20)

PPTX
Opinion Driven Decision Support System
PPTX
Micropinion Generation
PPT
Opinion Mining Tutorial (Sentiment Analysis)
PPTX
Opinosis Presentation @ Coling 2010: Opinosis - A Graph Based Approach to Abs...
PPTX
Introduction to Java Strings, By Kavita Ganesan
PDF
Francais orthographe
PPT
Power guineu 1[1]
PPT
PPTX
What do We Know about Drag Kings?
PDF
Financial terms
PPTX
La moral kantiana( què he de fer
PPTX
28th Social Work Day at the United Nations 2011
PDF
User eXitus - Nenechte sve navstevniky odchazet BarCamp 2011 Ostrava
PPSX
Carlos lenin estrada
PPT
Real Estate Impacts of Alternative Energy Technology
PPTX
UI Prototype
PPSX
Salem Area Market Statistics Q1 2011
DOC
Victoriamolinatp1 110601071455-phpapp01
PPT
Prsentation eng 101
PPT
What is your earliest memory
Opinion Driven Decision Support System
Micropinion Generation
Opinion Mining Tutorial (Sentiment Analysis)
Opinosis Presentation @ Coling 2010: Opinosis - A Graph Based Approach to Abs...
Introduction to Java Strings, By Kavita Ganesan
Francais orthographe
Power guineu 1[1]
What do We Know about Drag Kings?
Financial terms
La moral kantiana( què he de fer
28th Social Work Day at the United Nations 2011
User eXitus - Nenechte sve navstevniky odchazet BarCamp 2011 Ostrava
Carlos lenin estrada
Real Estate Impacts of Alternative Energy Technology
UI Prototype
Salem Area Market Statistics Q1 2011
Victoriamolinatp1 110601071455-phpapp01
Prsentation eng 101
What is your earliest memory
Ad

Similar to Segmentation of Clinical Texts (20)

PPTX
Shock-case-study-8.21.20.pptx
PPTX
6 minute walk test
PDF
Phtls prep-packet-2-day
PPT
Documentation 101 - BMH/Tele
PPTX
Code blue drill and didactic for endoscopy center providers
PPT
Surgery revision
PPTX
Patient selection and functional outcomes by Dr Ashutosh Hardikar
PPTX
BCC4: Michael Parr on ICU - Surviving Trauma Guidelines
PPT
Professor Richard Beale @ MRF's Meningitis & Septicaemia in Children & Adults...
PPTX
minimallyinvasivecardiacsurgery-130110015719-phpapp02 (1) (1).pptx
PDF
Nrs 410 topic 1 mandatory discussion question
PDF
Clinical materials for medicine III
PDF
STEMI Training
PPTX
GCSC Stroke Symposium 2022-COMBINED
PPTX
Osce ft Spirit men.pptx for all the major OSCE spot
PPTX
Covid 19 (1)
PPTX
Covid 19 (1)
PPTX
Covid 19 (1)
PDF
Prof. Todor (Ted) A. Popov - 6th Clinical Research Conference
PPTX
Remote Ischaemic Conditioning: A Paper Review & Uses in Paramedic Practice
Shock-case-study-8.21.20.pptx
6 minute walk test
Phtls prep-packet-2-day
Documentation 101 - BMH/Tele
Code blue drill and didactic for endoscopy center providers
Surgery revision
Patient selection and functional outcomes by Dr Ashutosh Hardikar
BCC4: Michael Parr on ICU - Surviving Trauma Guidelines
Professor Richard Beale @ MRF's Meningitis & Septicaemia in Children & Adults...
minimallyinvasivecardiacsurgery-130110015719-phpapp02 (1) (1).pptx
Nrs 410 topic 1 mandatory discussion question
Clinical materials for medicine III
STEMI Training
GCSC Stroke Symposium 2022-COMBINED
Osce ft Spirit men.pptx for all the major OSCE spot
Covid 19 (1)
Covid 19 (1)
Covid 19 (1)
Prof. Todor (Ted) A. Popov - 6th Clinical Research Conference
Remote Ischaemic Conditioning: A Paper Review & Uses in Paramedic Practice
Ad

More from Kavita Ganesan (7)

PPTX
Comparison between cbow, skip gram and skip-gram with subword information (1)
PPTX
Comparison between cbow, skip gram and skip-gram with subword information
PPT
Statistical Methods for Integration and Analysis of Online Opinionated Text...
PPTX
In situ evaluation of entity retrieval and opinion summarization
PPTX
Enabling Opinion-Driven Decision Making - Sentiment Analysis Innovation Summit
PPT
Very Small Tutorial on Terrier 3.0 Retrieval Toolkit
PPTX
Opinion-Based Entity Ranking
Comparison between cbow, skip gram and skip-gram with subword information (1)
Comparison between cbow, skip gram and skip-gram with subword information
Statistical Methods for Integration and Analysis of Online Opinionated Text...
In situ evaluation of entity retrieval and opinion summarization
Enabling Opinion-Driven Decision Making - Sentiment Analysis Innovation Summit
Very Small Tutorial on Terrier 3.0 Retrieval Toolkit
Opinion-Based Entity Ranking

Recently uploaded (20)

PPTX
Web Crawler for Trend Tracking Gen Z Insights.pptx
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
August Patch Tuesday
PPTX
Tartificialntelligence_presentation.pptx
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
DOCX
search engine optimization ppt fir known well about this
PDF
Getting Started with Data Integration: FME Form 101
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
Hybrid model detection and classification of lung cancer
PPTX
Chapter 5: Probability Theory and Statistics
PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PPTX
The various Industrial Revolutions .pptx
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
Unlock new opportunities with location data.pdf
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Web Crawler for Trend Tracking Gen Z Insights.pptx
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
August Patch Tuesday
Tartificialntelligence_presentation.pptx
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
search engine optimization ppt fir known well about this
Getting Started with Data Integration: FME Form 101
Group 1 Presentation -Planning and Decision Making .pptx
A novel scalable deep ensemble learning framework for big data classification...
Hybrid model detection and classification of lung cancer
Chapter 5: Probability Theory and Statistics
Module 1.ppt Iot fundamentals and Architecture
sustainability-14-14877-v2.pddhzftheheeeee
The various Industrial Revolutions .pptx
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
observCloud-Native Containerability and monitoring.pptx
Unlock new opportunities with location data.pdf
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx

Segmentation of Clinical Texts

  • 1. Kavita Ganesan & Michael Subotin Presented at: 2014 Conference on IEEE Big Data
  • 2. All sorts of notes types!  Admit notes ◦ documenting why patient is being admitted ◦ baseline status, etc.  Progress notes ◦ progress during course of hospitalization  Discharge notes ◦ conclusion of a hospital stay or series of treatments  Others ◦ Operative notes ◦ Procedure notes ◦ Delivery notes ◦ Emergency Department notes, etc
  • 3. PRIMARY CARE PHYSICIAN: Dr. XXXXX XXXXXXXX. CHIEF COMPLAINT: Injured right little toe. HISTORY OF PRESENT ILLNESS: This is a 63-year-old male with a past medical history of multiple myeloma who presents today after hitting his fifth toe of the right foot on a wood panel yesterday…… Review of Systems: CONSTITUTIONAL: No fever, chills, or weight loss. RESPIRATORY: No cough, shortness of breath, or wheezing. CARDIOVASCULAR: No chest pain, chest pressure, or palpitations. ............... PAST MEDICAL HISTORY Multiple myeloma, peripheral neuropathy, hypertension.. PAST SURGICAL HISTORY:- Stem cell transplant. SOCIAL HISTORY The patient formerly smoked tobacco; however, quit within the last 10 years. FAMILY HISTORY: Hypertension. ALLERGIES: ASPIRIN. ……… Purpose of visit Patient’s current condition in narrative form Ongoing issues, issues in the past Information on allergies
  • 4. PRIMARY CARE PHYSICIAN: Dr. XXXXX XXXXXXXX. CHIEF COMPLAINT: Injured right little toe. HISTORY OF PRESENT ILLNESS: This is a 63-year-old male with a past medical history of multiple myeloma who presents today after hitting his fifth toe of the right foot on a wood panel yesterday…… Review of Systems: CONSTITUTIONAL: No fever, chills, or weight loss. RESPIRATORY: No cough, shortness of breath, or wheezing. CARDIOVASCULAR: No chest pain, chest pressure, or palpitations. ............... PAST MEDICAL HISTORY Multiple myeloma, peripheral neuropathy, hypertension.. PAST SURGICAL HISTORY:- Stem cell transplant. SOCIAL HISTORY The patient formerly smoked tobacco; however, quit within the last 10 years. This is how most notes look: • some longer, some shorter • different set of headers, etc FAMILY HISTORY: Hypertension. ALLERGIES: ASPIRIN. ……… Purpose of visit Patient’s current condition in narrative form Ongoing issues, issues in the past Information on allergies
  • 5. PRIMARY CARE PHYSICIAN: Dr. XXXXX XXXXXXXX. PRIMARY CARE PHYSICIAN: Dr. XXXXX XXXXXXXX. PRIMARY CARE PHYSICIAN: Dr. XXXXX XXXXXXXX. CHIEF COMPLAIN: Injured right little toe. CHIEF COMPLAIN: Injured right little toe. CHIEF COMPLAINT: Injured right little toe. HISTORY OF PRESENT ILLNESS: This is a 63-year-old male with a past medical history of… HISTORY OF PRESENT ILLNESS: This is a 63-year-old male with a past medical history of… HISTORY OF PRESENT ILLNESS: This is a 63-year-old male with a past medical history of… Review of Systems: CONSTITUTIONAL: No fever, chills, or weight loss. CARDIOVASCULAR: No chest pain, chest pressure, or palpitations. ............... Review of Systems: CONSTITUTIONAL: No fever, chills, or weight loss. CARDIOVASCULAR: No chest pain, chest pressure, or palpitations. ............... ……… Review of Systems: CONSTITUTIONAL: No fever, chills, or weight loss. CARDIOVASCULAR: No chest pain, chest pressure, or palpitations. ............... ……… ………  Very unstructured ◦ formatting cues  inconsistent ◦ varies: across physicians, notes, hospitals  Hard to analyze specific sections ◦ E.g. analyze allergies patient population ◦ Need to segment notes to extract all allergy info.
  • 6. ◦ Information collected vary from note types to note types  Ex. info on progress notes vs. admit note ◦ Contents & formatting can vary from hospital to hospital  Even within the same organization – E.g. Kaiser ◦ Contents & formatting vary between physicians  Different styles, speed of typing, etc.
  • 7.  If you are looking at a single note type, from a single hospital - then maybe  Not suitable as a general segmentation approach:  Can easily break: ◦ on unseen note types and minor format variations ◦ Example:  regex based on all caps  regex based on seen headers only
  • 8.  Several works have explored supervised methods to segmenting clinical notes [Cho et al. 2003, tepper et al. 2012, apostolva et al. 2009]  Problem: methods not general! ◦ Cho et al. 2003: One model for each type of note  20 note types  20 models!  Not practical  maintain each model ◦ Tepper et al. 2012: Model had low adaptability to unseen documents  features used, training data used, etc.
  • 9.  General segmentation approach for clinical texts  Requirements: ◦ Single model/approach for most note types ◦ Discount extreme non-standard formatting e.g. tabular format  Segment: ◦ Header ◦ Top level sections ◦ Footer
  • 10. PRIMARY CARE PHYSICIAN: Dr. XXXXX XXXXXXXX. CHIEF COMPLAINT: Injured right little toe. HISTORY OF PRESENT ILLNESS: This is a 63-year-old male with a past medical history of multiple myeloma who presents today after hitting his fifth toe of the right foot on a wood panel yesterday…… Review of Systems: CONSTITUTIONAL: No fever, chills, or weight loss. RESPIRATORY: No cough, shortness of breath, or wheezing. CARDIOVASCULAR: No chest pain, chest pressure, or palpitations. ............... PAST MEDICAL HISTORY Multiple myeloma, peripheral neuropathy, hypertension.. PAST SURGICAL HISTORY:- Stem cell transplant. SOCIAL HISTORY The patient formerly smoked tobacco; however, quit within the last 10 years. FAMILY HISTORY: Hypertension. ALLERGIES: ASPIRIN. ……… Header Top-level section Top-level section Top-level section Top-level section Top-level section Top-level section Top-level section
  • 11.  Supervised approach using L1-Logistic Regression with a constraint combination approach  Idea: scan each line in a clinical document and label as: ◦ BeginHeader ◦ ContHeader ◦ BeginSection ◦ ContSection ◦ Footer  Labels are predicted with certain confidence  But, problem using line-wise predictions as is: ◦ Label sequences may not make sense ◦ E.g. There maybe a BeginHeader after a BeginSection  incorrect
  • 12.  Post-processing: enforce sequence combination rules: ◦ First line of document: BeginHeader or BeginSection ◦ BeginHeader cannot come right after BeginHeader or ContHeader ◦ ContHeader must come after BeginHeader or ContHeader ◦ ContSection must come after BeginSection or ContSection ◦ Footer cannot come right after BeginHeader or ContHeader  Rules applied after all lines in document labeled ◦ Applied to consecutive label pairs ◦ Computed efficiently: Viterbi algorithm
  • 13. Inpatient Outpatient • Notes from 12 different enterprises • Some large enterprises • All sorts of note types • Some noisy sectioning, some clean • 100 radiology notes • Fairly clean sections • One hospital • All sorts of note types • Fairly well sectioned • 35, 000 notes in total • 2000 randomly sampled notes (inpatient) • 100 radiology notes • Fairly clean sections
  • 14.  Emphasis on training data  Variation in training data ◦ Use different note types for training ◦ Intuition: help model generalize well  Sample training data: ◦ Instead of using all training data from 2100 notes ◦ Generated subsets of training data with varying size and cross-validate on test sets ◦ Intuition: allows to pick the best model  Best model only used < 700 notes (out of 2100)
  • 15.  5 test sets ◦ 4/5 test set from hospitals not in train set  true estimate of accuracy ◦ Covers both inpatient and outpatient notes ◦ Covers different note types ◦ ~12,500 test notes  Primary evaluation metric: line-wise accuracy ◦ percentage of correctly predicted line labels
  • 16. 1st model: limited variety (hp + discharge) Train set 3-folded cross validation Unseen test accuracy Inp1HospB (300 - limited) 96.70% 67.00% Inp3HospD (300 - varied) 96.58% 88.23% 2nd model: variety (11 types - hp, ds, pn…) Model with variety: higher accuracy on unseen test set 3-folded cross-validation accuracy: high in both Important to have variety in training notes in building general segmentation model
  • 17. Accuracy consistently > 90% across enterprises Client/Data In/Outpatient # Test Docs Accuracy 1. Inp1HospB In 300 92.58% 2. Inp2HospC In 1000 93.29% 3. Inp3HospD In 300 95.81% 4. Rad1MixedHosps Out 9000 92.45% 5. Rad2HospA Out 1902 93.67% Average 93.56% • Average accuracy: 93.56% • Covers inpatient/outpatient Single model: But, performs well across enterprises
  • 18. Document Type Accuracy 1. History and Physical 95.70% 2. Physician Clinicals 93.10% 3. Discharge Summary 94.00% 4. Consult Note 94.60% 5. Short Stay Summary 94.60% 6. Operative Note 92.20% 7. Progress Note 87.80% 8. Cardiac Cath Report 85.40% 9. Procedure Note 83.60% • Model performs well across note types • Lowest performance: procedure notes low recall on segmenting “technique” sections Performs very well > 90% Reasonable.. > 80% Accuracy Breakdown for Inp2HospC
  • 19. 94.00% 93.00% 92.00% 91.00% 90.00% 89.00% 88.00% 87.00% 86.00% # Notes vs. Accuracy No benefit with more notes 0 500 1000 1500 2000 Accuracy # Training Notes Avg. accurracy peaks @500 notes on all test sets No benefit with more notes No need for big data for a general model. We need good data from all that big data!
  • 20.  Unigrams – of each line (LineUnigram)  Relative position of line in document (PosInDoc) ◦ Top, Middle, Bottom  Known Header features (KnownHeader) ◦ Find potential headers using repository of seen headers ◦ Seen headers can have canonical type E.g. Past Medical History, Previous Med History “PAST_MEDICAL_HISTORY” ◦ If potential headers found, we include features:  Canonical type  Unigram & Char n-gram of potential header  Caps/colon info – mixed case, all caps, lowercase  Length of potential header
  • 21. Feature Set Avg. Accuracy Improvement LineUnigram 85.55% LineUnigram+PosInDoc 88.62% +3.46% LineUnigram+PosInDoc+KnownHeader 93.10% +4.81%
  • 22.  Explored: ◦ Supervised approach to building a very general segmentation model for clinical texts  Evaluation showed: ◦ Model works well on notes across enterprises ◦ Model works across note types  Key to effectiveness: ◦ Variation in training data –all sorts of note types ◦ Training data selection strategy – sample and cross-validate ◦ Feature set – not explored in existing works
  • 23. Contact: Kavita Ganesan ganesan.kavita@gmail.com www.kavita-ganesan.com www.text-analytics101.com