SlideShare a Scribd company logo
Marina Sokolova
Institute for Big Data Analytics
DECM and School of EECS,
University of Ottawa,
sokolova@uottawa.ca
M. Sokolova
Opinion Mining of Online Medical
Forums 2
M. Sokolova
Opinion Mining of Online Medical
Forums 3
M. Sokolova
Opinion Mining of Online Medical
Forums 4
 19%-28% of Internet users participate in online
health discussions.
 In North America, 59% of all adults have
looked online for information about a range of
health issues, the most popular being specific
diseases and treatments.
M. Sokolova
Opinion Mining of Online Medical
Forums 5
 Personal health information (PHI) is
information about one’s health discussed by a
patient in a clinical setting
 PHI is the most vulnerable private information
posted online
 I have a family history of Alzheimer's disease. I have seen
what it does and its sadness is a part of my life. I am
already burdened with the knowledge that I am at risk.
 We're going for the basic blood tests, the NT scan, and the
"Ashkenazi panel" since both XX and I are Jewish from E.
European descent.
M. Sokolova
Opinion Mining of Online Medical
Forums 6
 26% of adult internet users have read or
watched someone else’s health experience
about health or medical issues in the past 12
months.
 16% of adult internet users in the U.S. have
gone online in the past 12 months to find others
who share the same health concerns.
 Up to 49% of the users are most interested in
personal testimonials related to health
 < 25% of the users are interested only in facts
M. Sokolova
Opinion Mining of Online Medical
Forums 7
 Understanding of PHI posted by the general public is
important for the development of health care policies
 I really dont know why everyones freaking out about the H1N1
vaccine. I got it the first day it came out (about a week and a half ago)
and so did 4 of my family members. None of us had any problems and
were all really glad we got the vaccine.
 Previous to social networks, PHI studies had been
conducted on restricted and controlled groups (e.g.,
nuns from the same monastery, patients of the same
clinic)
 Time-, event- and location-dependent!
M. Sokolova
Opinion Mining of Online Medical
Forums 8
 Manual analysis
- pros: the most accurate (80- 95% of labels coincide); can work with any
type of data;
- cons: effort- and labor-consuming; inviting more annotators can
improve or hurt accuracy, agreement depends on the topic (kappa
0.60 – 0.73);
- data size: up to 1000 text units per a person;
 Fully automated
 pros: fast and portable;
 cons: can work with certain type of data; high accuracy vacillation
(50-80%);
 data size: 1000 – 10,000 units
 Automated, with humans in the loop
 pros: relatively accurate (65-80%), can work with any type of data;
 cons: less portable than FA; less accurate than MA
 data size: 500 – 10,000 units
M. Sokolova
Opinion Mining of Online Medical
Forums 9
 Open-source data
 Analytics kit:
 Social Mining – framework
 Information Extraction - knowledge-based search
 Machine Learning - processing of (extremely) large data
 Opinion mining – semantic analysis of data
 Strategically placed humans in the loop
 Data annotation by 2-3 annotators
 Exhaustive verification of positive results;
 Random verification of negative results
 PHI resources
 ontology of PHI terms
 HealthAffect lexicon
M. Sokolova
Opinion Mining of Online Medical
Forums 10
 We used random posts to verify whether the messages
were self-evident for sentiment annotation or required
an additional context.
 We looked for discussions where the forum
participants discussed only one topic.
 A preliminary analysis showed that discussions with ≤ 30 posts
satisfied this condition.
 We wanted discussions be long enough to form a
meaningful discourse.
 This condition was satisfied when discussion had ≥ 5 messages.
M. Sokolova
Opinion Mining of Online Medical
Forums 11
 Social Mining identifies
demographic parameters
which influence and imply
language parameters.
 Information Extraction uses
those language parameters
to find and retrieve relevant
information from text.
 I have a family history of
Alzheimer's disease. I have seen
what it does and its sadness is a
part of my life. I am already
burdened with the knowledge
that I am at risk.
Record A1 ... A5000 Class
1 1 0 P
2 1 1 N
.....
....
.....
....
3000 0 1 N
M. Sokolova
Opinion Mining of Online Medical
Forums 12
In the Table, each record represents
one comment .
 learning modes – do we know data labels?
 Training and test stages – training and test sets
should not overlap!
 Algorithms - simpler often means robust
 The best model selection – it is always worth
trying several parameters
 performance evaluation – verify results on
positive AND negative classes
M. Sokolova
Opinion Mining of Online Medical
Forums 13
General health information: they are promoting cancer awareness
particularly lung cancer
Personal health information: I had a rare condition and half of my
lung had to be removed
Irrelevant: I saw a guy chasing someone and screaming at the top of
his lungs
Terminology the transfer went well - my RE did it himself which was
comforting. 2 embies (grade 1 but slow in development) so I am
not holding my breath for a positive
Technical terms Someone with 50 DB hearing aid gain with a total
loss of 70 DB may not know that the place is producing 107 DB
since it may not appear too loud to him since he only perceives 47
DB
M. Sokolova
Opinion Mining of Online Medical
Forums 14
M. Sokolova
Opinion Mining of Online Medical
Forums 15
Sentiment: I am sickened by
the thought …
Ailment: I feel sick for awhile;
should see my physician
Opinion: I think it is
evident that …
Improvement: The benefit is
usually evident within a few days
of starting it
Humor: don't forget that
it's better for your health to
enjoy your steak than to
resent your sprouts
Complain: After that my health
deteriorated …
 Non-textual features
 people use few emoticons when discuss PHI
 people do not create new hashtags about PHI
 General purpose lexicons
 WordNet’s semantics needs substantiation
 SentiWordNet, WordNetAffect require considerable
adjustment
M. Sokolova
Opinion Mining of Online Medical
Forums 16
 Electronic medical dictionaries are developed
to analyze scientific publications
 the Medical Dictionary for Regulatory Activities
(MedDRA):
8,561 unique terms/86 PHI terms
 the Systematized Nomenclature of Medicine Clinical
Terms (SNOMED CT):
44,802 unique terms/108 PHI terms
M. Sokolova
Opinion Mining of Online Medical
Forums 17
55 discussions, 1008 comments from North American
online forums gathered in Summer 2013.
Topics discussed by many participants:
 Security and safety of retention /storage of DNA data
 Exploitation of DNA data by government and/or
insurance companies
 Anonymity of the data for research purposes
And you have no problem with this government owning their
genetic code, potentially knowing illness, disabilities, strengths,
weaknesses and potential? A trusting soul you are indeed.
M. Sokolova
Opinion Mining of Online Medical
Forums 18
From 7238 discourse units, we identified that the forum users
 are patient
 Elaboration –49%
 Explanation and background - 8%
 want to convince other participants
 Joint units – 17%
 Enabling, attribution, summary - 7%
 can be skeptical
 Evaluation and conditions – 10%
 can argue
 Contrast – 8%
M. Sokolova 19
Opinion Mining of Online Medical
Forums
 Text unit: one comment
 Classification categories: positive, negative, mix, and
irrelevant.
 Each comment was assigned with a label. There were no “other” comments.
 Text representations
 Bag-of-words, # of words identified as positives (negatives) by
SentiWordNet, General Inquirer, # of punctuation marks,
emoticons, elongated words.
 DepecheMode, SentiWordNet, General Inquirer
 SVM, NB, 10-fold cross-validation.
M. Sokolova 20
Opinion Mining of Online Medical
Forums
 5-class flat classification
 On Bag-of-words+, accuracy 66.3%
 On sentiment lexicons, accuracy 63.9-64.2%
 Two-level semi-automated classification
• Irrelevant comment removed manually
• Positive/negative/mix from the relevant comments
• Accuracy 60.8%
 Hierarchical classification
 Relevant/irrelevant classifier
• Accuracy 77.3%
 Positive/negative/mix from the result of the above classifier
 Accuracy 58.7%
 Irrelevant comments help!?
M. Sokolova 21
Opinion Mining of Online Medical
Forums
 6 sub-forums of IVF.ca
 95% participants are women
 Empirical results are obtained on sub-forum Age 35+
 130 discussions, 1438 messages.
 A separate discussion contained a coherent discourse.
 unexpected shifts in the discourse flow can be introduced by a
new participant joining the discussion.
 Five emotional and non-emotional categories:
encouragement, gratitude, confusion, facts, and
endorsement.
- identified bottom-up: from specific to general
M. Sokolova
Opinion Mining of Online Medical
Forums 22
 Alice: Jane - whats going on??
 Jane: We have our appt. Wednesday!! EEE!!!
 Beth: Good luck on your transfer! Grow embies
grow!!!!
 Jane: The transfer went well - my RE did it himself
which was comforting. 2 embies (grade 1 but slow
in development) so I am not holding my breath for a
positive. This really was my worst cycle yet; it was
the Antagonist protocol which is supposed to be
great when you are over 40 but not so much for
me!!
M. Sokolova
Opinion Mining of Online Medical
Forums 23
Manual Annotation
+ Two raters annotated each post with the dominant
sentiment.
+ Only author’s subjective comments were marked as
such;
- if the author conveyed sentiments of others, we did not mark
it.
+ We obtained Fleiss Kappa = 0.737 which indicated a
strong agreement between annotators.
- Kappa values demonstrated an adequate selection of classes
of sentiments and appropriate annotation guidelines.
Classification category # posts Per-cent
Facts 494 34.4%
Encouragement 333 23.2%
Endorsement 166 11.5%
Confusion 146 10.2%
Gratitude 131 9.1%
Ambiguous , i.e. raters disagree 168 11.7%
Total 1438 100%
 Discussions usually start by a participant by
expressing her doubts and concerns, continued
by describing a treatment and come to a
conclusion by the announcement of the results.
 All these cornerstone messages received
corresponding replies.
 Within discussions messages were related:
every posted message replied to one or several
previous messages.
M. Sokolova
Opinion Mining of Online Medical
Forums 26
 4-class classification where all 1269 unambiguous
posts are classified into (encouragement, gratitude,
confusion, and neutral, i.e., facts and endorsement)
 3-class classification (positive: encouragement,
gratitude; negative: confusion, neutral: facts and
endorsement)
M. Sokolova
Opinion Mining of Online Medical
Forums 27
M. Sokolova
Opinion Mining of Online Medical
Forums 28
Metrics 4-class classification 3-class classification
microaverage F-score 0.633 0.672
macroaverage Precision 0.593 0.625
macroaverage Recall 0.686 0.679
macroaverage F-score 0.636 0.651
Baseline F-score 0.281 0.356
Precision (P) – how many of comments identified as C indeed belong to class C; rat
false hits.
Recall (Re) – how many of all comments from class C are identified as C; rate of mi
F-score - the harmonic mean of P and Re;
 The most accurate classification occurred for gratitude.
 It was correctly classified in 83.6% of its occurrences.
 It was most commonly misclassified as encouragement (9.7%).
 The second most accurate result was achieved for
encouragement.
 It was correctly classified in 76.7% of cases.
 It was misclassified as neutral, i.e. facts + endorsement, in 9.8%.
 The least correctly classified class was neutral (50.8%).
 One possible explanation is the presence of the sentiment
bearing words in the description of facts in a post which is in
general objective and which was marked as factual by the
annotators.
M. Sokolova
Opinion Mining of Online Medical
Forums 29
Pairs Occurrence Percent
facts, facts 170 19.5%
encouragement, encouragement 119 13.7%
facts, encouragement 55 6.3%
endorsement, facts 53 6.1%
encouragement, facts 44 5.1%
Triads Occurrence Percent
factual, factual, factual 94 12.8%
encouragement, encouragement,
encouragement 63 8.6%
encouragement, gratitude, encouragement 18 2.4%
factual, endorsement, factual 18 2.4%
confusion, factual, factual 17 2.3%
 The most reinforcing transition: facts->facts – 0.47
 The least reinforcing transition: gratitude–>
gratitude – 0.14
 The most frequent changes: confusion-> facts and
gratitude –> encouragement – 0.30 each
 The least frequent change: facts –> confusion – 0.02
 The most frequent 1st comment: confusion – 0.57
 The most frequent last comment: facts – 0.39
 The most ambiguous comments: 1st – 0.26
 The least ambiguous comment: endorsement –
ambiguous – 0.06
M. Sokolova
Opinion Mining of Online Medical
Forums 31
 15 most active authors posted 15–50 comments
each. This comes to 387 texts, or 29% of the
data.
 71% of comments convey facts, endorsement and
encouragement
 remaining non-ambiguous comments are evenly
split between confusion and gratitude
 11 authors had posts marked as confusion.
 Only 8 authors had posts marked as gratitude.
 comments of prolific authors were more ambiguous
than of other authors: 16% vs 12.5%
M. Sokolova
Opinion Mining of Online Medical
Forums 32
M. Sokolova
Opinion Mining of Online Medical
Forums 33
ambiguous
16%
confusion
1%
encouragemen
t
25%
endorsement
8%
facts
39%
gratitude
11%
Last comments
ambiguous
16%
confusion
5%
encouragemen
t
25%endorsement
13%
facts
34%
gratitude
7%
Comments written by prolific authors
ambiguous
26%
confusion
57%
encouragemen
t
0%
endorsement
1% facts
16%
gratitude
0%
First comments
ambiguous
13%
confusion
9%
encouragemen
t
24%endorsement
12%
facts
33%
gratitude
9%
All comments
 Too early for conclusions 
 Proud to say: our group was the 1st TDM group
to analyse PHI in user-generated Web content
 In future, Social Mining may play a bigger role
 Eventually, PHI resources will be developed
on a scale of current medical resources
 Privacy protection will start at the source
 Authorship attribution F-score = 0.97
M. Sokolova
Opinion Mining of Online Medical
Forums 34
 Personal Health Information retrieval
 Twitter (Sokolova et al, RANLP 2013)
 MySpace (Ghazinour et al, AI 2013;)
 Opinion Mining on Health Care
 medical forums (Ali et al, IJCNLP 2013; Poursepanj, M.Sc.
thesis, in progress)
 Commentosphere (Sobhani, Ph.D. thesis, in progress)
 Sentiment Analysis of Personal Health Information
 Twitter (Bobicev et al, AI 2012)
 medical forums (Ali et al, RANLP 2013; Bobicev & Sokolova,
RANLP 2013)
 Sentiment propagation of Personal Health Information
 IVF forums (Bobicev et al, SocialNLP 2014)
M. Sokolova
Opinion Mining of Online Medical
Forums 35
Thank you!
Questions?
M. Sokolova
Opinion Mining of Online Medical
Forums 36

More Related Content

PDF
Opinion mining for social media
PDF
additional Reading dnbvbfdvfivddcdsvfbivdcsdlcd
PPTX
Knowledge-enhanced Learning @ Kno.e.sis
PDF
WSDM2015Tutorial.pdf
PPTX
The Uneven Future of Evidence-Based Medicine
PPSX
A Cognitive-Based Semantic Approach to Deep Content Analysis in Search Engines
PPTX
General Framework for Sentiment Analysis of Twitter Data, with Special Attent...
PPT
Searching for evidence
Opinion mining for social media
additional Reading dnbvbfdvfivddcdsvfbivdcsdlcd
Knowledge-enhanced Learning @ Kno.e.sis
WSDM2015Tutorial.pdf
The Uneven Future of Evidence-Based Medicine
A Cognitive-Based Semantic Approach to Deep Content Analysis in Search Engines
General Framework for Sentiment Analysis of Twitter Data, with Special Attent...
Searching for evidence

Similar to Online forums dalu_nov20 (20)

PDF
mQoL-Lab : Living Lab Infrastructure
PDF
Luciano informs healthcare_2015 Nashville, TN USA July 30 2015
PPT
Evidence based databases_Literature search.ppt
PDF
Patient Powered Research with Big Data and Connected Communities by Assoc. P...
PDF
IRJET- Clinical Medical Knowledge Extraction using Crowdsourcing Techniques
PPTX
Leveraging Text Classification Strategies for Clinical and Public Health Appl...
DOCX
Original Work, NO PLAGERIESM, Cite Reference, References, 100 word.docx
PPTX
Burns Farabough pncmla
PDF
Local and Global Learning Method for Question Answering Approach
PPTX
PDF
2022 25th International Conference On Computer And Information Technology Iccit
PPTX
GoodIT2021.pptx
PPTX
Dont believe the hype final
PPTX
Health history
PPTX
Social Media Research and Practice in the Health Domain - Tutorial, Part II
PPTX
Developing a fundable research question: 27 June 2017
PDF
II-SDV 2016 Srinivasan Parthiban - KOL Analytics from Biomedical Literature
PPTX
Lec 1.pptx
PPTX
Health assessment
mQoL-Lab : Living Lab Infrastructure
Luciano informs healthcare_2015 Nashville, TN USA July 30 2015
Evidence based databases_Literature search.ppt
Patient Powered Research with Big Data and Connected Communities by Assoc. P...
IRJET- Clinical Medical Knowledge Extraction using Crowdsourcing Techniques
Leveraging Text Classification Strategies for Clinical and Public Health Appl...
Original Work, NO PLAGERIESM, Cite Reference, References, 100 word.docx
Burns Farabough pncmla
Local and Global Learning Method for Question Answering Approach
2022 25th International Conference On Computer And Information Technology Iccit
GoodIT2021.pptx
Dont believe the hype final
Health history
Social Media Research and Practice in the Health Domain - Tutorial, Part II
Developing a fundable research question: 27 June 2017
II-SDV 2016 Srinivasan Parthiban - KOL Analytics from Biomedical Literature
Lec 1.pptx
Health assessment
Ad

Recently uploaded (20)

PPT
Copy-Histopathology Practical by CMDA ESUTH CHAPTER(0) - Copy.ppt
PPT
HIV lecture final - student.pptfghjjkkejjhhge
PPTX
antibiotics rational use of antibiotics.pptx
PPTX
1. Basic chemist of Biomolecule (1).pptx
PPTX
preoerative assessment in anesthesia and critical care medicine
PPTX
Cardiovascular - antihypertensive medical backgrounds
PDF
TISSUE LECTURE (anatomy and physiology )
PPTX
Neuropathic pain.ppt treatment managment
PPTX
Post Op complications in general surgery
PPTX
y4d nutrition and diet in pregnancy and postpartum
PPT
neurology Member of Royal College of Physicians (MRCP).ppt
PDF
Transcultural that can help you someday.
PPTX
Acute Coronary Syndrome for Cardiology Conference
PPT
Infections Member of Royal College of Physicians.ppt
PPTX
Morphology of Bacterial Cell for bsc sud
PPTX
MANAGEMENT SNAKE BITE IN THE TROPICALS.pptx
PDF
Copy of OB - Exam #2 Study Guide. pdf
PDF
Intl J Gynecology Obste - 2021 - Melamed - FIGO International Federation o...
PPT
STD NOTES INTRODUCTION TO COMMUNITY HEALT STRATEGY.ppt
PPTX
2 neonat neotnatology dr hussein neonatologist
Copy-Histopathology Practical by CMDA ESUTH CHAPTER(0) - Copy.ppt
HIV lecture final - student.pptfghjjkkejjhhge
antibiotics rational use of antibiotics.pptx
1. Basic chemist of Biomolecule (1).pptx
preoerative assessment in anesthesia and critical care medicine
Cardiovascular - antihypertensive medical backgrounds
TISSUE LECTURE (anatomy and physiology )
Neuropathic pain.ppt treatment managment
Post Op complications in general surgery
y4d nutrition and diet in pregnancy and postpartum
neurology Member of Royal College of Physicians (MRCP).ppt
Transcultural that can help you someday.
Acute Coronary Syndrome for Cardiology Conference
Infections Member of Royal College of Physicians.ppt
Morphology of Bacterial Cell for bsc sud
MANAGEMENT SNAKE BITE IN THE TROPICALS.pptx
Copy of OB - Exam #2 Study Guide. pdf
Intl J Gynecology Obste - 2021 - Melamed - FIGO International Federation o...
STD NOTES INTRODUCTION TO COMMUNITY HEALT STRATEGY.ppt
2 neonat neotnatology dr hussein neonatologist
Ad

Online forums dalu_nov20

  • 1. Marina Sokolova Institute for Big Data Analytics DECM and School of EECS, University of Ottawa, sokolova@uottawa.ca
  • 2. M. Sokolova Opinion Mining of Online Medical Forums 2
  • 3. M. Sokolova Opinion Mining of Online Medical Forums 3
  • 4. M. Sokolova Opinion Mining of Online Medical Forums 4
  • 5.  19%-28% of Internet users participate in online health discussions.  In North America, 59% of all adults have looked online for information about a range of health issues, the most popular being specific diseases and treatments. M. Sokolova Opinion Mining of Online Medical Forums 5
  • 6.  Personal health information (PHI) is information about one’s health discussed by a patient in a clinical setting  PHI is the most vulnerable private information posted online  I have a family history of Alzheimer's disease. I have seen what it does and its sadness is a part of my life. I am already burdened with the knowledge that I am at risk.  We're going for the basic blood tests, the NT scan, and the "Ashkenazi panel" since both XX and I are Jewish from E. European descent. M. Sokolova Opinion Mining of Online Medical Forums 6
  • 7.  26% of adult internet users have read or watched someone else’s health experience about health or medical issues in the past 12 months.  16% of adult internet users in the U.S. have gone online in the past 12 months to find others who share the same health concerns.  Up to 49% of the users are most interested in personal testimonials related to health  < 25% of the users are interested only in facts M. Sokolova Opinion Mining of Online Medical Forums 7
  • 8.  Understanding of PHI posted by the general public is important for the development of health care policies  I really dont know why everyones freaking out about the H1N1 vaccine. I got it the first day it came out (about a week and a half ago) and so did 4 of my family members. None of us had any problems and were all really glad we got the vaccine.  Previous to social networks, PHI studies had been conducted on restricted and controlled groups (e.g., nuns from the same monastery, patients of the same clinic)  Time-, event- and location-dependent! M. Sokolova Opinion Mining of Online Medical Forums 8
  • 9.  Manual analysis - pros: the most accurate (80- 95% of labels coincide); can work with any type of data; - cons: effort- and labor-consuming; inviting more annotators can improve or hurt accuracy, agreement depends on the topic (kappa 0.60 – 0.73); - data size: up to 1000 text units per a person;  Fully automated  pros: fast and portable;  cons: can work with certain type of data; high accuracy vacillation (50-80%);  data size: 1000 – 10,000 units  Automated, with humans in the loop  pros: relatively accurate (65-80%), can work with any type of data;  cons: less portable than FA; less accurate than MA  data size: 500 – 10,000 units M. Sokolova Opinion Mining of Online Medical Forums 9
  • 10.  Open-source data  Analytics kit:  Social Mining – framework  Information Extraction - knowledge-based search  Machine Learning - processing of (extremely) large data  Opinion mining – semantic analysis of data  Strategically placed humans in the loop  Data annotation by 2-3 annotators  Exhaustive verification of positive results;  Random verification of negative results  PHI resources  ontology of PHI terms  HealthAffect lexicon M. Sokolova Opinion Mining of Online Medical Forums 10
  • 11.  We used random posts to verify whether the messages were self-evident for sentiment annotation or required an additional context.  We looked for discussions where the forum participants discussed only one topic.  A preliminary analysis showed that discussions with ≤ 30 posts satisfied this condition.  We wanted discussions be long enough to form a meaningful discourse.  This condition was satisfied when discussion had ≥ 5 messages. M. Sokolova Opinion Mining of Online Medical Forums 11
  • 12.  Social Mining identifies demographic parameters which influence and imply language parameters.  Information Extraction uses those language parameters to find and retrieve relevant information from text.  I have a family history of Alzheimer's disease. I have seen what it does and its sadness is a part of my life. I am already burdened with the knowledge that I am at risk. Record A1 ... A5000 Class 1 1 0 P 2 1 1 N ..... .... ..... .... 3000 0 1 N M. Sokolova Opinion Mining of Online Medical Forums 12 In the Table, each record represents one comment .
  • 13.  learning modes – do we know data labels?  Training and test stages – training and test sets should not overlap!  Algorithms - simpler often means robust  The best model selection – it is always worth trying several parameters  performance evaluation – verify results on positive AND negative classes M. Sokolova Opinion Mining of Online Medical Forums 13
  • 14. General health information: they are promoting cancer awareness particularly lung cancer Personal health information: I had a rare condition and half of my lung had to be removed Irrelevant: I saw a guy chasing someone and screaming at the top of his lungs Terminology the transfer went well - my RE did it himself which was comforting. 2 embies (grade 1 but slow in development) so I am not holding my breath for a positive Technical terms Someone with 50 DB hearing aid gain with a total loss of 70 DB may not know that the place is producing 107 DB since it may not appear too loud to him since he only perceives 47 DB M. Sokolova Opinion Mining of Online Medical Forums 14
  • 15. M. Sokolova Opinion Mining of Online Medical Forums 15 Sentiment: I am sickened by the thought … Ailment: I feel sick for awhile; should see my physician Opinion: I think it is evident that … Improvement: The benefit is usually evident within a few days of starting it Humor: don't forget that it's better for your health to enjoy your steak than to resent your sprouts Complain: After that my health deteriorated …
  • 16.  Non-textual features  people use few emoticons when discuss PHI  people do not create new hashtags about PHI  General purpose lexicons  WordNet’s semantics needs substantiation  SentiWordNet, WordNetAffect require considerable adjustment M. Sokolova Opinion Mining of Online Medical Forums 16
  • 17.  Electronic medical dictionaries are developed to analyze scientific publications  the Medical Dictionary for Regulatory Activities (MedDRA): 8,561 unique terms/86 PHI terms  the Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT): 44,802 unique terms/108 PHI terms M. Sokolova Opinion Mining of Online Medical Forums 17
  • 18. 55 discussions, 1008 comments from North American online forums gathered in Summer 2013. Topics discussed by many participants:  Security and safety of retention /storage of DNA data  Exploitation of DNA data by government and/or insurance companies  Anonymity of the data for research purposes And you have no problem with this government owning their genetic code, potentially knowing illness, disabilities, strengths, weaknesses and potential? A trusting soul you are indeed. M. Sokolova Opinion Mining of Online Medical Forums 18
  • 19. From 7238 discourse units, we identified that the forum users  are patient  Elaboration –49%  Explanation and background - 8%  want to convince other participants  Joint units – 17%  Enabling, attribution, summary - 7%  can be skeptical  Evaluation and conditions – 10%  can argue  Contrast – 8% M. Sokolova 19 Opinion Mining of Online Medical Forums
  • 20.  Text unit: one comment  Classification categories: positive, negative, mix, and irrelevant.  Each comment was assigned with a label. There were no “other” comments.  Text representations  Bag-of-words, # of words identified as positives (negatives) by SentiWordNet, General Inquirer, # of punctuation marks, emoticons, elongated words.  DepecheMode, SentiWordNet, General Inquirer  SVM, NB, 10-fold cross-validation. M. Sokolova 20 Opinion Mining of Online Medical Forums
  • 21.  5-class flat classification  On Bag-of-words+, accuracy 66.3%  On sentiment lexicons, accuracy 63.9-64.2%  Two-level semi-automated classification • Irrelevant comment removed manually • Positive/negative/mix from the relevant comments • Accuracy 60.8%  Hierarchical classification  Relevant/irrelevant classifier • Accuracy 77.3%  Positive/negative/mix from the result of the above classifier  Accuracy 58.7%  Irrelevant comments help!? M. Sokolova 21 Opinion Mining of Online Medical Forums
  • 22.  6 sub-forums of IVF.ca  95% participants are women  Empirical results are obtained on sub-forum Age 35+  130 discussions, 1438 messages.  A separate discussion contained a coherent discourse.  unexpected shifts in the discourse flow can be introduced by a new participant joining the discussion.  Five emotional and non-emotional categories: encouragement, gratitude, confusion, facts, and endorsement. - identified bottom-up: from specific to general M. Sokolova Opinion Mining of Online Medical Forums 22
  • 23.  Alice: Jane - whats going on??  Jane: We have our appt. Wednesday!! EEE!!!  Beth: Good luck on your transfer! Grow embies grow!!!!  Jane: The transfer went well - my RE did it himself which was comforting. 2 embies (grade 1 but slow in development) so I am not holding my breath for a positive. This really was my worst cycle yet; it was the Antagonist protocol which is supposed to be great when you are over 40 but not so much for me!! M. Sokolova Opinion Mining of Online Medical Forums 23
  • 24. Manual Annotation + Two raters annotated each post with the dominant sentiment. + Only author’s subjective comments were marked as such; - if the author conveyed sentiments of others, we did not mark it. + We obtained Fleiss Kappa = 0.737 which indicated a strong agreement between annotators. - Kappa values demonstrated an adequate selection of classes of sentiments and appropriate annotation guidelines.
  • 25. Classification category # posts Per-cent Facts 494 34.4% Encouragement 333 23.2% Endorsement 166 11.5% Confusion 146 10.2% Gratitude 131 9.1% Ambiguous , i.e. raters disagree 168 11.7% Total 1438 100%
  • 26.  Discussions usually start by a participant by expressing her doubts and concerns, continued by describing a treatment and come to a conclusion by the announcement of the results.  All these cornerstone messages received corresponding replies.  Within discussions messages were related: every posted message replied to one or several previous messages. M. Sokolova Opinion Mining of Online Medical Forums 26
  • 27.  4-class classification where all 1269 unambiguous posts are classified into (encouragement, gratitude, confusion, and neutral, i.e., facts and endorsement)  3-class classification (positive: encouragement, gratitude; negative: confusion, neutral: facts and endorsement) M. Sokolova Opinion Mining of Online Medical Forums 27
  • 28. M. Sokolova Opinion Mining of Online Medical Forums 28 Metrics 4-class classification 3-class classification microaverage F-score 0.633 0.672 macroaverage Precision 0.593 0.625 macroaverage Recall 0.686 0.679 macroaverage F-score 0.636 0.651 Baseline F-score 0.281 0.356 Precision (P) – how many of comments identified as C indeed belong to class C; rat false hits. Recall (Re) – how many of all comments from class C are identified as C; rate of mi F-score - the harmonic mean of P and Re;
  • 29.  The most accurate classification occurred for gratitude.  It was correctly classified in 83.6% of its occurrences.  It was most commonly misclassified as encouragement (9.7%).  The second most accurate result was achieved for encouragement.  It was correctly classified in 76.7% of cases.  It was misclassified as neutral, i.e. facts + endorsement, in 9.8%.  The least correctly classified class was neutral (50.8%).  One possible explanation is the presence of the sentiment bearing words in the description of facts in a post which is in general objective and which was marked as factual by the annotators. M. Sokolova Opinion Mining of Online Medical Forums 29
  • 30. Pairs Occurrence Percent facts, facts 170 19.5% encouragement, encouragement 119 13.7% facts, encouragement 55 6.3% endorsement, facts 53 6.1% encouragement, facts 44 5.1% Triads Occurrence Percent factual, factual, factual 94 12.8% encouragement, encouragement, encouragement 63 8.6% encouragement, gratitude, encouragement 18 2.4% factual, endorsement, factual 18 2.4% confusion, factual, factual 17 2.3%
  • 31.  The most reinforcing transition: facts->facts – 0.47  The least reinforcing transition: gratitude–> gratitude – 0.14  The most frequent changes: confusion-> facts and gratitude –> encouragement – 0.30 each  The least frequent change: facts –> confusion – 0.02  The most frequent 1st comment: confusion – 0.57  The most frequent last comment: facts – 0.39  The most ambiguous comments: 1st – 0.26  The least ambiguous comment: endorsement – ambiguous – 0.06 M. Sokolova Opinion Mining of Online Medical Forums 31
  • 32.  15 most active authors posted 15–50 comments each. This comes to 387 texts, or 29% of the data.  71% of comments convey facts, endorsement and encouragement  remaining non-ambiguous comments are evenly split between confusion and gratitude  11 authors had posts marked as confusion.  Only 8 authors had posts marked as gratitude.  comments of prolific authors were more ambiguous than of other authors: 16% vs 12.5% M. Sokolova Opinion Mining of Online Medical Forums 32
  • 33. M. Sokolova Opinion Mining of Online Medical Forums 33 ambiguous 16% confusion 1% encouragemen t 25% endorsement 8% facts 39% gratitude 11% Last comments ambiguous 16% confusion 5% encouragemen t 25%endorsement 13% facts 34% gratitude 7% Comments written by prolific authors ambiguous 26% confusion 57% encouragemen t 0% endorsement 1% facts 16% gratitude 0% First comments ambiguous 13% confusion 9% encouragemen t 24%endorsement 12% facts 33% gratitude 9% All comments
  • 34.  Too early for conclusions   Proud to say: our group was the 1st TDM group to analyse PHI in user-generated Web content  In future, Social Mining may play a bigger role  Eventually, PHI resources will be developed on a scale of current medical resources  Privacy protection will start at the source  Authorship attribution F-score = 0.97 M. Sokolova Opinion Mining of Online Medical Forums 34
  • 35.  Personal Health Information retrieval  Twitter (Sokolova et al, RANLP 2013)  MySpace (Ghazinour et al, AI 2013;)  Opinion Mining on Health Care  medical forums (Ali et al, IJCNLP 2013; Poursepanj, M.Sc. thesis, in progress)  Commentosphere (Sobhani, Ph.D. thesis, in progress)  Sentiment Analysis of Personal Health Information  Twitter (Bobicev et al, AI 2012)  medical forums (Ali et al, RANLP 2013; Bobicev & Sokolova, RANLP 2013)  Sentiment propagation of Personal Health Information  IVF forums (Bobicev et al, SocialNLP 2014) M. Sokolova Opinion Mining of Online Medical Forums 35
  • 36. Thank you! Questions? M. Sokolova Opinion Mining of Online Medical Forums 36