SlideShare a Scribd company logo
To Label or Not?
Advances and Open Challenges in
SE-specific Sentiment Analysis
@NicoleNovielli
Nicole Novielli
University of Bari, Italy
Collaborative Development Group
My research
• Human Aspects in Software
Engineering
• Affective Computing
• Natural Language Processing
People at Collab
Faculty
– Filippo Lanubile
– Fabio Calefato
– Nicole Novielli
collab.di.uniba.it
PhD Students
– Luigi Quaranta
– Daniela Girardi
Graduate students and final-
year undergraduates
Visiting Professors and
Researchers
Sentiment analysis Analysis of biometrics
BrainLink
EEG mon
activity of
Emotional Awareness in SE
Acknowledgements
Alexander Serebrenik, TU/e
Walid Maalej, Uni. Hamburg
Mika Mäntylä, Uni. Oulu
Davide Fucci, BTH
Viviana Patti, Uni. Turin
Valerio Basile, Uni. Turin
Malvina Nissim, Uni. Groningen
Danilo Croce, Tor Vergata
…
Emotion awareness in Software Engineering
Emotions in social media
To Label or Not? Advances and Open Challenges in SE-specific Sentiment Analysis
To label or not?
Does SE-specific tuning enhances sentiment analysis?
Do we need a theoretical model of affect?
What size for the gold standard?
Can we reuse SE-specific tools off-the-shelf?
Sentiment analysis for software engineering
Collaborative software development and knowledge-sharing
– Correlation with issue-fixing time (Ortu et al., MSR 2015)
– Early burnout discovery (Mantyla et al. MSR 2015)
– Anger detection (Gachechiladze et al., ICSE-NIER 2017)
– Empirically-driven guidelines for question writings (Calefato et al., IST 2018)
Recommender systems
– Pattern-based mining of opinions in Q&A websites (Lin et al., ICSE 2019)
– Opinion search and summarization for APIs (Uddin and Khomh, ASE 2017)
Requirements engineering
– User feedback (Guzman and Maalej, RE‘14)
– App improvement (Panichella et al., ICSME ‘14)
Actionable insights for
More…
Approach Ouput Validated on
Supervised learning
Bag-of-words
Probabilities:
• p(positive)
• p(negative)
• p(neutral)
Movie reviews
Tweets
Supervised learning Sentiment score in
[0,4]:
• 0 = very negative
• 2 = neutral
• 4 = very positive
Movie reviews
Lexicon-based
Dictionaries with a
priori polarity scores
in [-5, 5]
Sentiment scores
• Negative in [-5, -1]
• Positive in [1,5]
• Neutral = (-1,1)
Social media:
• YouTube
• Twitter
• MySpace
• …
http://sentistrength.wlv.ac.uk/
http://text-processing.com/
http://nlp.stanford.edu/sentiment/
General-purpose tools
Approach Ouput Validated on
Supervised learning
Bag-of-words
Probabilities:
• p(positive)
• p(negative)
• p(neutral)
Movie reviews
Tweets
Supervised learning Sentiment score in
[0,4]:
• 0 = very negative
• 2 = neutral
• 4 = very positive
Movie reviews
Lexicon-based
Dictionaries with a
priori polarity scores
in [-5, 5]
Sentiment scores
• Negative in [-5, -1]
• Positive in [1,5]
• Neutral = (-1,1)
Social media:
• YouTube
• Twitter
• MySpace
• …
http://sentistrength.wlv.ac.uk/
http://text-processing.com/
http://nlp.stanford.edu/sentiment/
General-purpose tools
Need for SE-specific tools
Are off-the-shelf sentiment analysis tools
reliable for software engineering research?
• Poor performance on technical texts
• They disagree with each other
• Disagreement can lead to diverging conclusions
Contextual semantics
‘I am missing a parenthesis. But where?’
N. Novielli, F. Calefato, F. Lanubile. “The Challenges of Sentiment Detection in the Social Programmer Ecosystem” (SSE’15)
Domain-dependent semantics
‘What is the best way to kill a critical
process?’
Platform-specific use of non-neutral lexicon
‘I have a problem, […] please explain what is
wrong’
neutral as negative
Why do general-purpose tools fail?
Representing textual content in a DSM
Training the Distributional Semantic Model (DSM) from unlabeled posts
Word as vectors
Comparison using cosine similarity
W4
W5
Domain-dependent semantics
Stack Overflow Wikipedia DSM 50M Tweets
20M posts English Abstracts
Target word: Kill
kill kill kill
kills decapitate marry
killing kills die
terminate kidnap destroy
dies destroy embarrass
killed incapacitate beat
spawn behead tell
spawned assassinate punch
terminates disarm ruin
quits overpower leave
suspends killing drown
quit imprison stab
suspend betray give
respawn abduct replace
re-launch seduce cry
died injure strangle
exits steal burn
shutdown subdue hurt
Stack Overflow – 20M
posts
Wikipedia DSM – Enlish
abstracts
50M Tweets
Target word: Save
save save save
saves saved bring
saving saving give
saved saves donate
store recover make
resave protect protect
persist restore fix
re-save resurrect saving
retrieve destroy send
retrive banish add
re-store redeem raise
strore steal saves
restore evict remove
re-load erase get
store/retrieve eliminate collect
upload bring buy
save/load hide pick
reupload wipe use
Need for domain-specific tools
SE-specific sentiment analysis tools
• Senti4SD[1]
• SentiCR[2]
• SentiStrength-SE[3]
• DEVA[4]
Supervised learning
Lexicon-based
[1] F. Calefato, F. Lanubile, F. Maiorano, N. Novielli. Sentiment Polarity Detection for Software Development. EMSE, 2017
[2] T . Ahmed, A. Bosu, A. Iqbal, and S. Rahimi. . SentiCR: a customized sentiment analysis tool for code review interactions, ASE 2017.
[3] M.D.R. Islam and M.F. Zibran, Leveraging automated sentiment analysis in software engineering, MSR 2017.
[4] M.D.R. Islam and M.F. Zibran, DEVA: sensing emotions in the valence arousal space in software engineering text, SAC 2018.
More…
Collab Emotion Mining Toolkit
Senti4SD allows customization
• Retrain classification model from gold standard
• Use different sentiment lexicons
– E.g. you could bootstrap a sentiment lexicon from your
data(Mäntylä et al., MSR’17)
• Enable/disable feature sets
M. V. Mäntylä, N. Novielli, F. Lanubile, M. Claes, and M. Kuutila. Bootstrapping a lexicon for emotional arousal in software
engineering. MSR '17
Developing Senti4SD
Supervised learning
• Support Vector Machines
– #features >> #text items
• Three feature sets:
– Keyword-based
– Semantic features (distributional semantic model)
– Sentiment lexicon-based
Reducing the negative bias
As per your code, the application was completely killed
NEUTRAL
Reducing strong disagreement
So u need not to worry!
POSITIVE
Does SE-specific tuning enhances
sentiment analysis?
To Label or Not? Advances and Open Challenges in SE-specific Sentiment Analysis
SE-specific sentiment analysis tools
• Senti4SD (Calefato et al. EMSE 2017)
• SentiCR(Ahmed et al., ASE ‘17)
• SentiStrength-SE (Islam and Zibran, MSR’17)
Supervised
Lexicon-based
F. Calefato, F. Lanubile, F. Maiorano, N. Novielli. Sentiment Polarity Detection for Software Development. EMSE, 2017
T. Ahmed, A. Bosu, A. Iqbal, and S. Rahimi. . SentiCR: a customized sentiment analysis tool for code review interactions, ASE 2017.
M.D.R. Islam and M.F. Zibran, Leveraging automated sentiment analysis in software engineering, MSR 2017.
Our replication
Research questions
RQ1: Do different sentiment analysis
tools agree with emotions of software
developers?
RQ2: Do sentiment analysis tools agree
with each other?
RQ2: Do SE-specific sentiment analysis
tools agree with each other?
RQ1: Do SE-specific sentiment analysis
tools agree with emotions of software
developers?
• Senti4SD (Calefato et al. EMSE 2017)
• SentiCR(Ahmed et al., ASE ‘17)
• SentiStrength-SE
(Islam and Zibran, MSR’17)
• SentiStrength (baseline)
• NLTK
• Stanford NLP
• Alchemy API
• SentiStrength
SE-specificOff-the-shelf
Our replication
Gold standard datasets
392 comments
(Murgia et al., MSR’14)
5869 comments
(Murgia et al., MSR’16)
4423 Qs, As, Cs
(Calefato et al., EMSE 2017)
Experimental setting
Gold
standard
datasets
Train 70%Stratified
sampling
Senti4SD
updated model
SentiCR
updated model
Training of
supervised tools
Test 30%
SentiStrength-SE
SentiStrength
Assessment of
performance
Weighted Cohens’ Kappa
Disagreement: strong vs. mild
negative neutral positive
negative 0 1 2
neutral 1 0 1
positive 2 1 0
Interpretation (Viera and Garrett, 2005)
• less than chance κ ≤ 0
• slight if 0.01 ≤ κ ≤ 0.20
• fair if 0.21 ≤ κ ≤ 0.40
• moderate if 0.41 ≤ κ ≤ 0.60
• substantial if 0.61 ≤ κ ≤ 0.80
• almost perfect if 0.81 ≤ κ ≤ 1
A.J. Viera, J.M. Garrett. 2005. Understanding interobserver agreement: the kappa statistic. Family Medicine, 37,5, 360–363
SE-specific tools
vs. manual annotation
General purpose-tools
RQ1: Do SE-specific sentiment analysis tools
agree with emotions of software developers?
Fair agreement
Substantial
agreement
SE-Specific tools
vs. manual annotation
RQ1: Do SE-specific sentiment analysis tools
agree with emotions of software developers?
• SE-specific optimization
improves the classification
accuracy
• Retraining supervised tools
produces better performance
Our replication
vs. manual annotation
RQ1: Do SE-specific sentiment analysis tools
agree with emotions of software developers?
• SE-specific optimization
improves the classification
accuracy
• Retraining supervised tools
produces better performance
• Comparable performance for
SentiStrength-SE (lexicon-
based)
RQ2: Do SE-specific sentiment analysis tools
agree with each other?
SE-specific tools
General-purpose tools
From substantial
to perfect
agreement
From less than
chance to fair
agreement
Does SE-specific tuning enhances
sentiment analysis?
To Label or Not? Advances and Open Challenges in SE-specific Sentiment Analysis
To Label or Not? Advances and Open Challenges in SE-specific Sentiment Analysis
“I'm happy with the
approach and the
code looks good”
“I will come over to
your work and slap
you”
“This code has
problems”
“This code has
problems”
Polar facts
“This is not ideal, why not
trying to find another
solution?”
“This is not ideal, why not
trying to find another
solution?”
Opinions <> Emotions
Affective States
Duration++
Intensity++
Scherer, 1984. Emotion as a Multicomponent Process: A model and some cross-cultural data. In P. Shaver, ed.,
Review of Personality and Social Psych 5: 37-63.
Do we need a theoretical model
of affect?
To what extent the labeling approach has an impact on
the performance of SE-specific sentiment analysis tools?
392 comments
(Murgia et al., MSR’14)
5869 comments
(Ortu et al., MSR’16)
4423 Qs, As, Cs
(Calefato et al., EMSE 2017)
Model-driven annotation
1500 sentences QA on
Java libraries (Lin et al., ICSE’18)
1600 comments from
code review (Ahmed et al., ASE’17)
Ad-hoc annotation
Model-driven annotation of emotions
Emotion Original study Our replication
Love Positive Positive
Joy Positive Positive
Surprise Positive Ambiguous
Anger Negative Negative
Sadness Negative Negative
Fear Negative Negative
No
emotion
Neutral Neutral
(Shaver et al., 1987)
Mapping emotions to polarity
I'm happy with the approach and
the code looks good
Joy -> Positive Polarity
Joy Happiness Satisfaction
To what extent the labeling approach has an
impact on the performance of SE-specific sentiment analysis tools?
Model-
driven
annotation
Ad-hoc
annotation
RQ3: To what extent the labeling approach has an
impact on the performance of SE-specific sentiment analysis tools?
Model-driven annotation Ad-hoc annotation
• From substantial to perfect
agreement also between supervised
and lexicon-based tools
• From fair to moderate agreement
• Better agreement for supervised
approaches
Error analysis
Polar facts but neutral sentiment
‘I tried the following and it returns nothing’
---
‘This creates an unnecessary garbage list.
Sets.newHashSet should accept an Iterable.’
Do we need a theoretical model
of affect?
Reliable sentiment analysis in SE is possible
 Tuning of tools for software engineering improves
classification accuracy
 SE-specific tools agree with manual annotation
 SE-specific tools agree with each other
Can we use SE-
specific tools
off the shelf?
Within-platform
MSR 2018
Cross-platform
MSR 2020
Model-driven annotation
Gold standard datasets
5869 comments
(Murgia et al., MSR’16)
4423 Qs, As, Cs
(Calefato et al., EMSE 2018)
https://doi.org/10.6084/m9.figshare.11604597
7,122 comments
Polarity Classes
Neutral Positive Negative
Cross-platform setting
GitHub Stack Overflow Jira
GitHub Within-platform
Stack Overflow Within-platform
Jira Within-platform
Train 70%
Senti4SD
updated model
SentiCR
updated model
Training of
supervised tools
Test 30%
SentiStrength-SE
DEVA
Assessment of
performance
Senti4SD no BOW
updated model
Gold
standard
datasets
Train 70%Stratified
sampling
Senti4SD
updated model
SentiCR
updated model
Training of
supervised tools
Test 30%
SentiStrength-SE
DEVA
Assessment of
performance
Senti4SD no BOW
updated model
Gold
standard
datasets
Train 70%Stratified
sampling
u
u
Training of
supervised tools
Test 30%
Sen
Assessment of
performance
Senti4SD no BOW
updated model
Senti4SD = .92
SentiCR = .82
Cross-platform performance of supervised tools
F-measure (macro-average)
TEST
TRAIN
Senti4SD = .92
SentiCR = .82
Senti4SD = .80 (-.12)
SentiCR = .65 (-.17)
Senti4SD = .66 (-.26)
SentiCR = .52 (-.30)
Cross-platform performance of supervised tools
F-measure (macro-average)
TEST
TRAIN
Senti4SD = .92
SentiCR = .82
Senti4SD = .80 (-.12)
S4SD noBoW = .82 (-.10)
SentiCR = .65 (-.17)
Senti4SD = .66 (-.26)
S4SD noBoW = .69 (-.23)
SentiCR = .52 (-.30)
Cross-platform performance of supervised tools
F-measure (macro-average)
TEST
TRAIN
Shift in lexical semantics across
platform
Senti4SD = .92
SentiCR = .82
Senti4SD = .80 (-.12)
S4SD noBoW = .82 (-.10)
SentiCR = .65 (-.17)
Senti4SD = .66 (-.26)
S4SD noBoW = .69 (-.23)
SentiCR = .52 (-.30)
Senti4SD = .77 (-.10)
S4SD noBoW = .80 (-.07)
SentiCR = .69 (-.13)
Senti4SD = .87
SentiCR = .82
Senti4SD = .65 (-.22)
S4SD noBoW = .65 (-.22)
SentiCR = .46 (-.36)
Senti4SD = .73 (-.04)
S4SD noBoW = .71 (-.06)
SentiCR = .77 (-.03)
Senti4SD = .61 (-.16)
S4SD noBoW = .71 (-.06)
SentiCR = .49 (-.31)
Senti4SD = .77
SentiCR = .80
Cross-platform performance of supervised tools
F-measure (macro-average)
TEST
TRAIN
Need to narrow down the domain at
the level of the specific platform
Senti4SD = .92
SentiCR = .82
Senti4SD = .80 (-.12)
S4SD noBoW = .82 (-.10)
SentiCR = .65 (-.17)
Senti4SD = .66 (-.26)
S4SD noBoW = .69 (-.23)
SentiCR = .52 (-.30)
Senti4SD = .77 (-.10)
S4SD noBoW = .80 (-.07)
SentiCR = .69 (-.13)
Senti4SD = .87
SentiCR = .82
Senti4SD = .65 (-.22)
S4SD noBoW = .65 (-.22)
SentiCR = .46 (-.36)
Senti4SD = .73 (-.04)
S4SD noBoW = .71 (-.06)
SentiCR = .77 (-.03)
Senti4SD = .61 (-.16)
S4SD noBoW = .71 (-.06)
SentiCR = .49 (-.31)
Senti4SD = .77
SentiCR = .80
Cross-platform performance of supervised tools
F-measure (macro-average)
TEST
TRAIN
Gold standard datasets
5869 comments
(Murgia et al., MSR’16)
4423 Qs, As, Cs
(Calefato et al., EMSE 2018)
7,122 comments
Polarity Classes
Neutral Positive Negative
Different label distributions
Gold standard datasets
5869 comments
(Murgia et al., MSR’16)
4423 Qs, As, Cs
(Calefato et al., EMSE 2018)
7,122 comments
Polarity Classes
Neutral Positive Negative
Agreement: substantial (SO) to almost perfect agreement (GitHub)
Shared conceptualization of affect
guidelines for labeling in line with research goals
Senti4SD = .92
SentiCR = .82
Senti4SD = .87
SentiCR = .82
Senti4SD = .77
SentiCR = .80
Comparison with lexicon-based tools
SentiStrength-SE = .80
DEVA = .73
TEST
TRAIN
SentiStrength-SE = .79
DEVA = .78
SentiStrength-SE = .78
DEVA = .72
Lexicon-based
Senti4SD = .80 (-.12)
S4SD noBoW = .82 (-.10)
SentiCR = .65 (-.17)
Senti4SD = .66 (-.26)
S4SD noBoW = .69 (-.23)
SentiCR = .52 (-.30)
Senti4SD = .77 (-.10)
S4SD noBoW = .80 (-.07)
SentiCR = .69 (-.13)
Senti4SD = .65 (-.22)
S4SD noBoW = .65 (-.22)
SentiCR = .46 (-.36)
Senti4SD = .73 (-.04)
S4SD noBoW = .71 (-.06)
SentiCR = .77 (-.03)
Senti4SD = .61 (-.16)
S4SD noBoW = .71 (-.06)
SentiCR = .49 (-.31)
Comparison with lexicon-based tools
SentiStrength-SE = .80
DEVA = .73
TEST
TRAIN
SentiStrength-SE = .79
DEVA = .78
SentiStrength-SE = .78
DEVA = .72
Lexicon-based
Can we use SE-
specific tools
off the shelf?
To label or not?
To label or not?
• Retraining is convenient
compared to the performance of
lexicon-based tools
• Nearly optimal performance
obtained with a minimal set of
about 1,200 documents for
Github and Stack Overflow
JIRA
• Retraining beneficial only if a larger
set of ~1,600 documents is available
• Negligible improvement over the
SentiStrength-SE performance
• SMOTE doesn’t help
Lessons learned
 Use SE-specific tools
Lessons learned
 Use SE-specific tools
 Use training data from your target platform
 Platform-specific tuning enhances accuracy
 Robust, well-balanced gold standard with good interrater
agreement
Lessons learned
 Use SE-specific tools
 Use training data from your target platform
 Platform-specific tuning enhances accuracy
 Robust, well-balanced gold standard with good interrater
agreement
 If you cannot retrain, perform a sanity check
 Rule-based tools offer a comparable performance
 Define your goal first, then select the tool
 The choice depends on the research goals: polarity vs.
fine-grained emotions, emotions vs. attitudes, etc.
Lessons learned
 Use SE-specific tools
 Use training data from your target platform
 Platform-specific tuning enhances accuracy
 Robust, well-balanced gold standard with good interrater
agreement
 If you cannot retrain, perform a sanity check
 Rule-based tools offer a comparable performance
 Define your goal first, then select the tool
 The choice depends on the research goals: polarity vs.
fine-grained emotions, emotions vs. attitudes, etc.
@NicoleNovielli
nicole.novielli@uniba.it
http://collab.di.uniba.it/nicole/http://collab.di.uniba.it/

More Related Content

PPTX
Emotion Detection Using Noninvasive Low-cost Sensors
PDF
MindfulTech - QS Discussion
PDF
IRJET- BDI using NLP for Efficient Depression Identification
PPTX
Alleviating Data Sparsity for Twitter Sentiment Analysis
PPTX
The Challenges of Affect Detection in the Social Programmer Ecosystem
PPTX
A Benchmark Study on Sentiment Analysis for Software Engineering Research
PDF
N01741100102
PPTX
Social Media Sentiments Analysis
Emotion Detection Using Noninvasive Low-cost Sensors
MindfulTech - QS Discussion
IRJET- BDI using NLP for Efficient Depression Identification
Alleviating Data Sparsity for Twitter Sentiment Analysis
The Challenges of Affect Detection in the Social Programmer Ecosystem
A Benchmark Study on Sentiment Analysis for Software Engineering Research
N01741100102
Social Media Sentiments Analysis

Similar to To Label or Not? Advances and Open Challenges in SE-specific Sentiment Analysis (20)

PPTX
Affective Trust as a Predictor of Successful Collaboration in Distributed Sof...
PDF
A Journey Into the Emotions of Software Developers
PPTX
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
PDF
Talk It Out, Or Write It Down?
PDF
Keynote@QUATIC - Recognizing Developer's Emotions: Advances and Open Challenges
PDF
Dictionary Based Approach to Sentiment Analysis - A Review
PDF
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
PDF
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
PPTX
Perceptual Data_04182016
PPTX
NATURAL LANGUAGE PROCESSING.pptx
PDF
On serendipity in recommender systems - Haifa RecSoc workshop june 2015
PDF
Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India - Team Zer...
PDF
Opinion Mining Techniques for Non-English Languages: An Overview
PDF
SENTIMENT ANALYSIS-AN OBJECTIVE VIEW
PDF
Natural Language Processing: L01 introduction
PDF
A Subjective Feature Extraction For Sentiment Analysis In Malayalam Language
PPTX
Webinar: Strong Connections; Linking your strategy to goals to data
PDF
Sentiment Analysis of Social Media Content: A multi-tool for listening to you...
PDF
An Improved sentiment classification for objective word.
PPTX
Lecture 2 - Sentimental Analysis. ndsfbsbjjfasbdbsadhbsjdhbsah
Affective Trust as a Predictor of Successful Collaboration in Distributed Sof...
A Journey Into the Emotions of Software Developers
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
Talk It Out, Or Write It Down?
Keynote@QUATIC - Recognizing Developer's Emotions: Advances and Open Challenges
Dictionary Based Approach to Sentiment Analysis - A Review
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Perceptual Data_04182016
NATURAL LANGUAGE PROCESSING.pptx
On serendipity in recommender systems - Haifa RecSoc workshop june 2015
Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India - Team Zer...
Opinion Mining Techniques for Non-English Languages: An Overview
SENTIMENT ANALYSIS-AN OBJECTIVE VIEW
Natural Language Processing: L01 introduction
A Subjective Feature Extraction For Sentiment Analysis In Malayalam Language
Webinar: Strong Connections; Linking your strategy to goals to data
Sentiment Analysis of Social Media Content: A multi-tool for listening to you...
An Improved sentiment classification for objective word.
Lecture 2 - Sentimental Analysis. ndsfbsbjjfasbdbsadhbsjdhbsah
Ad

More from Nicole Novielli (7)

PDF
Towards Supporting Emotion Awareness of Software Developers
PPTX
Evalita2018 iListen - itaLIan Speech acT labEliNg
PPTX
Deep Tweets: from Entity Linking to Sentiment Analysis
PDF
UNIBA at EVALITA 2014-SENTIPOLC Task: Predicting tweet sentiment polarity com...
PPT
Towards Discovering the Role of Emotions in Stack Overflow
PPT
A Preliminary Investigation of the Effect of Social Media on Affective Trust ...
PPT
Social Network Analysis for Global Software Engineering: Exploring relationsh...
Towards Supporting Emotion Awareness of Software Developers
Evalita2018 iListen - itaLIan Speech acT labEliNg
Deep Tweets: from Entity Linking to Sentiment Analysis
UNIBA at EVALITA 2014-SENTIPOLC Task: Predicting tweet sentiment polarity com...
Towards Discovering the Role of Emotions in Stack Overflow
A Preliminary Investigation of the Effect of Social Media on Affective Trust ...
Social Network Analysis for Global Software Engineering: Exploring relationsh...
Ad

Recently uploaded (20)

PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
A Complete Guide to Streamlining Business Processes
DOCX
Factor Analysis Word Document Presentation
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
importance of Data-Visualization-in-Data-Science. for mba studnts
PPTX
Introduction to Inferential Statistics.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPTX
modul_python (1).pptx for professional and student
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPTX
Managing Community Partner Relationships
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PDF
annual-report-2024-2025 original latest.
PDF
Global Data and Analytics Market Outlook Report
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
SAP 2 completion done . PRESENTATION.pptx
A Complete Guide to Streamlining Business Processes
Factor Analysis Word Document Presentation
retention in jsjsksksksnbsndjddjdnFPD.pptx
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
importance of Data-Visualization-in-Data-Science. for mba studnts
Introduction to Inferential Statistics.pptx
ISS -ESG Data flows What is ESG and HowHow
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
modul_python (1).pptx for professional and student
STERILIZATION AND DISINFECTION-1.ppthhhbx
Pilar Kemerdekaan dan Identi Bangsa.pptx
Managing Community Partner Relationships
Topic 5 Presentation 5 Lesson 5 Corporate Fin
annual-report-2024-2025 original latest.
Global Data and Analytics Market Outlook Report
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja

To Label or Not? Advances and Open Challenges in SE-specific Sentiment Analysis

  • 1. To Label or Not? Advances and Open Challenges in SE-specific Sentiment Analysis @NicoleNovielli Nicole Novielli University of Bari, Italy Collaborative Development Group
  • 2. My research • Human Aspects in Software Engineering • Affective Computing • Natural Language Processing
  • 3. People at Collab Faculty – Filippo Lanubile – Fabio Calefato – Nicole Novielli collab.di.uniba.it PhD Students – Luigi Quaranta – Daniela Girardi Graduate students and final- year undergraduates Visiting Professors and Researchers
  • 4. Sentiment analysis Analysis of biometrics BrainLink EEG mon activity of
  • 6. Acknowledgements Alexander Serebrenik, TU/e Walid Maalej, Uni. Hamburg Mika Mäntylä, Uni. Oulu Davide Fucci, BTH Viviana Patti, Uni. Turin Valerio Basile, Uni. Turin Malvina Nissim, Uni. Groningen Danilo Croce, Tor Vergata … Emotion awareness in Software Engineering Emotions in social media
  • 8. To label or not? Does SE-specific tuning enhances sentiment analysis? Do we need a theoretical model of affect? What size for the gold standard? Can we reuse SE-specific tools off-the-shelf?
  • 9. Sentiment analysis for software engineering Collaborative software development and knowledge-sharing – Correlation with issue-fixing time (Ortu et al., MSR 2015) – Early burnout discovery (Mantyla et al. MSR 2015) – Anger detection (Gachechiladze et al., ICSE-NIER 2017) – Empirically-driven guidelines for question writings (Calefato et al., IST 2018) Recommender systems – Pattern-based mining of opinions in Q&A websites (Lin et al., ICSE 2019) – Opinion search and summarization for APIs (Uddin and Khomh, ASE 2017) Requirements engineering – User feedback (Guzman and Maalej, RE‘14) – App improvement (Panichella et al., ICSME ‘14) Actionable insights for More…
  • 10. Approach Ouput Validated on Supervised learning Bag-of-words Probabilities: • p(positive) • p(negative) • p(neutral) Movie reviews Tweets Supervised learning Sentiment score in [0,4]: • 0 = very negative • 2 = neutral • 4 = very positive Movie reviews Lexicon-based Dictionaries with a priori polarity scores in [-5, 5] Sentiment scores • Negative in [-5, -1] • Positive in [1,5] • Neutral = (-1,1) Social media: • YouTube • Twitter • MySpace • … http://sentistrength.wlv.ac.uk/ http://text-processing.com/ http://nlp.stanford.edu/sentiment/ General-purpose tools
  • 11. Approach Ouput Validated on Supervised learning Bag-of-words Probabilities: • p(positive) • p(negative) • p(neutral) Movie reviews Tweets Supervised learning Sentiment score in [0,4]: • 0 = very negative • 2 = neutral • 4 = very positive Movie reviews Lexicon-based Dictionaries with a priori polarity scores in [-5, 5] Sentiment scores • Negative in [-5, -1] • Positive in [1,5] • Neutral = (-1,1) Social media: • YouTube • Twitter • MySpace • … http://sentistrength.wlv.ac.uk/ http://text-processing.com/ http://nlp.stanford.edu/sentiment/ General-purpose tools
  • 12. Need for SE-specific tools Are off-the-shelf sentiment analysis tools reliable for software engineering research? • Poor performance on technical texts • They disagree with each other • Disagreement can lead to diverging conclusions
  • 13. Contextual semantics ‘I am missing a parenthesis. But where?’ N. Novielli, F. Calefato, F. Lanubile. “The Challenges of Sentiment Detection in the Social Programmer Ecosystem” (SSE’15) Domain-dependent semantics ‘What is the best way to kill a critical process?’ Platform-specific use of non-neutral lexicon ‘I have a problem, […] please explain what is wrong’ neutral as negative Why do general-purpose tools fail?
  • 14. Representing textual content in a DSM Training the Distributional Semantic Model (DSM) from unlabeled posts Word as vectors Comparison using cosine similarity W4 W5
  • 15. Domain-dependent semantics Stack Overflow Wikipedia DSM 50M Tweets 20M posts English Abstracts Target word: Kill kill kill kill kills decapitate marry killing kills die terminate kidnap destroy dies destroy embarrass killed incapacitate beat spawn behead tell spawned assassinate punch terminates disarm ruin quits overpower leave suspends killing drown quit imprison stab suspend betray give respawn abduct replace re-launch seduce cry died injure strangle exits steal burn shutdown subdue hurt
  • 16. Stack Overflow – 20M posts Wikipedia DSM – Enlish abstracts 50M Tweets Target word: Save save save save saves saved bring saving saving give saved saves donate store recover make resave protect protect persist restore fix re-save resurrect saving retrieve destroy send retrive banish add re-store redeem raise strore steal saves restore evict remove re-load erase get store/retrieve eliminate collect upload bring buy save/load hide pick reupload wipe use
  • 18. SE-specific sentiment analysis tools • Senti4SD[1] • SentiCR[2] • SentiStrength-SE[3] • DEVA[4] Supervised learning Lexicon-based [1] F. Calefato, F. Lanubile, F. Maiorano, N. Novielli. Sentiment Polarity Detection for Software Development. EMSE, 2017 [2] T . Ahmed, A. Bosu, A. Iqbal, and S. Rahimi. . SentiCR: a customized sentiment analysis tool for code review interactions, ASE 2017. [3] M.D.R. Islam and M.F. Zibran, Leveraging automated sentiment analysis in software engineering, MSR 2017. [4] M.D.R. Islam and M.F. Zibran, DEVA: sensing emotions in the valence arousal space in software engineering text, SAC 2018. More…
  • 20. Senti4SD allows customization • Retrain classification model from gold standard • Use different sentiment lexicons – E.g. you could bootstrap a sentiment lexicon from your data(Mäntylä et al., MSR’17) • Enable/disable feature sets M. V. Mäntylä, N. Novielli, F. Lanubile, M. Claes, and M. Kuutila. Bootstrapping a lexicon for emotional arousal in software engineering. MSR '17
  • 22. Supervised learning • Support Vector Machines – #features >> #text items • Three feature sets: – Keyword-based – Semantic features (distributional semantic model) – Sentiment lexicon-based
  • 23. Reducing the negative bias As per your code, the application was completely killed NEUTRAL
  • 24. Reducing strong disagreement So u need not to worry! POSITIVE
  • 25. Does SE-specific tuning enhances sentiment analysis?
  • 27. SE-specific sentiment analysis tools • Senti4SD (Calefato et al. EMSE 2017) • SentiCR(Ahmed et al., ASE ‘17) • SentiStrength-SE (Islam and Zibran, MSR’17) Supervised Lexicon-based F. Calefato, F. Lanubile, F. Maiorano, N. Novielli. Sentiment Polarity Detection for Software Development. EMSE, 2017 T. Ahmed, A. Bosu, A. Iqbal, and S. Rahimi. . SentiCR: a customized sentiment analysis tool for code review interactions, ASE 2017. M.D.R. Islam and M.F. Zibran, Leveraging automated sentiment analysis in software engineering, MSR 2017.
  • 28. Our replication Research questions RQ1: Do different sentiment analysis tools agree with emotions of software developers? RQ2: Do sentiment analysis tools agree with each other? RQ2: Do SE-specific sentiment analysis tools agree with each other? RQ1: Do SE-specific sentiment analysis tools agree with emotions of software developers? • Senti4SD (Calefato et al. EMSE 2017) • SentiCR(Ahmed et al., ASE ‘17) • SentiStrength-SE (Islam and Zibran, MSR’17) • SentiStrength (baseline) • NLTK • Stanford NLP • Alchemy API • SentiStrength SE-specificOff-the-shelf
  • 29. Our replication Gold standard datasets 392 comments (Murgia et al., MSR’14) 5869 comments (Murgia et al., MSR’16) 4423 Qs, As, Cs (Calefato et al., EMSE 2017)
  • 30. Experimental setting Gold standard datasets Train 70%Stratified sampling Senti4SD updated model SentiCR updated model Training of supervised tools Test 30% SentiStrength-SE SentiStrength Assessment of performance
  • 31. Weighted Cohens’ Kappa Disagreement: strong vs. mild negative neutral positive negative 0 1 2 neutral 1 0 1 positive 2 1 0 Interpretation (Viera and Garrett, 2005) • less than chance κ ≤ 0 • slight if 0.01 ≤ κ ≤ 0.20 • fair if 0.21 ≤ κ ≤ 0.40 • moderate if 0.41 ≤ κ ≤ 0.60 • substantial if 0.61 ≤ κ ≤ 0.80 • almost perfect if 0.81 ≤ κ ≤ 1 A.J. Viera, J.M. Garrett. 2005. Understanding interobserver agreement: the kappa statistic. Family Medicine, 37,5, 360–363
  • 32. SE-specific tools vs. manual annotation General purpose-tools RQ1: Do SE-specific sentiment analysis tools agree with emotions of software developers? Fair agreement Substantial agreement
  • 33. SE-Specific tools vs. manual annotation RQ1: Do SE-specific sentiment analysis tools agree with emotions of software developers? • SE-specific optimization improves the classification accuracy • Retraining supervised tools produces better performance
  • 34. Our replication vs. manual annotation RQ1: Do SE-specific sentiment analysis tools agree with emotions of software developers? • SE-specific optimization improves the classification accuracy • Retraining supervised tools produces better performance • Comparable performance for SentiStrength-SE (lexicon- based)
  • 35. RQ2: Do SE-specific sentiment analysis tools agree with each other? SE-specific tools General-purpose tools From substantial to perfect agreement From less than chance to fair agreement
  • 36. Does SE-specific tuning enhances sentiment analysis?
  • 39. “I'm happy with the approach and the code looks good”
  • 40. “I will come over to your work and slap you”
  • 43. “This is not ideal, why not trying to find another solution?”
  • 44. “This is not ideal, why not trying to find another solution?” Opinions <> Emotions
  • 45. Affective States Duration++ Intensity++ Scherer, 1984. Emotion as a Multicomponent Process: A model and some cross-cultural data. In P. Shaver, ed., Review of Personality and Social Psych 5: 37-63.
  • 46. Do we need a theoretical model of affect?
  • 47. To what extent the labeling approach has an impact on the performance of SE-specific sentiment analysis tools? 392 comments (Murgia et al., MSR’14) 5869 comments (Ortu et al., MSR’16) 4423 Qs, As, Cs (Calefato et al., EMSE 2017) Model-driven annotation 1500 sentences QA on Java libraries (Lin et al., ICSE’18) 1600 comments from code review (Ahmed et al., ASE’17) Ad-hoc annotation
  • 48. Model-driven annotation of emotions Emotion Original study Our replication Love Positive Positive Joy Positive Positive Surprise Positive Ambiguous Anger Negative Negative Sadness Negative Negative Fear Negative Negative No emotion Neutral Neutral (Shaver et al., 1987) Mapping emotions to polarity I'm happy with the approach and the code looks good Joy -> Positive Polarity Joy Happiness Satisfaction
  • 49. To what extent the labeling approach has an impact on the performance of SE-specific sentiment analysis tools? Model- driven annotation Ad-hoc annotation
  • 50. RQ3: To what extent the labeling approach has an impact on the performance of SE-specific sentiment analysis tools? Model-driven annotation Ad-hoc annotation • From substantial to perfect agreement also between supervised and lexicon-based tools • From fair to moderate agreement • Better agreement for supervised approaches
  • 51. Error analysis Polar facts but neutral sentiment ‘I tried the following and it returns nothing’ --- ‘This creates an unnecessary garbage list. Sets.newHashSet should accept an Iterable.’
  • 52. Do we need a theoretical model of affect?
  • 53. Reliable sentiment analysis in SE is possible  Tuning of tools for software engineering improves classification accuracy  SE-specific tools agree with manual annotation  SE-specific tools agree with each other
  • 54. Can we use SE- specific tools off the shelf?
  • 56. Model-driven annotation Gold standard datasets 5869 comments (Murgia et al., MSR’16) 4423 Qs, As, Cs (Calefato et al., EMSE 2018) https://doi.org/10.6084/m9.figshare.11604597 7,122 comments Polarity Classes Neutral Positive Negative
  • 57. Cross-platform setting GitHub Stack Overflow Jira GitHub Within-platform Stack Overflow Within-platform Jira Within-platform Train 70% Senti4SD updated model SentiCR updated model Training of supervised tools Test 30% SentiStrength-SE DEVA Assessment of performance Senti4SD no BOW updated model Gold standard datasets Train 70%Stratified sampling Senti4SD updated model SentiCR updated model Training of supervised tools Test 30% SentiStrength-SE DEVA Assessment of performance Senti4SD no BOW updated model Gold standard datasets Train 70%Stratified sampling u u Training of supervised tools Test 30% Sen Assessment of performance Senti4SD no BOW updated model
  • 58. Senti4SD = .92 SentiCR = .82 Cross-platform performance of supervised tools F-measure (macro-average) TEST TRAIN
  • 59. Senti4SD = .92 SentiCR = .82 Senti4SD = .80 (-.12) SentiCR = .65 (-.17) Senti4SD = .66 (-.26) SentiCR = .52 (-.30) Cross-platform performance of supervised tools F-measure (macro-average) TEST TRAIN
  • 60. Senti4SD = .92 SentiCR = .82 Senti4SD = .80 (-.12) S4SD noBoW = .82 (-.10) SentiCR = .65 (-.17) Senti4SD = .66 (-.26) S4SD noBoW = .69 (-.23) SentiCR = .52 (-.30) Cross-platform performance of supervised tools F-measure (macro-average) TEST TRAIN Shift in lexical semantics across platform
  • 61. Senti4SD = .92 SentiCR = .82 Senti4SD = .80 (-.12) S4SD noBoW = .82 (-.10) SentiCR = .65 (-.17) Senti4SD = .66 (-.26) S4SD noBoW = .69 (-.23) SentiCR = .52 (-.30) Senti4SD = .77 (-.10) S4SD noBoW = .80 (-.07) SentiCR = .69 (-.13) Senti4SD = .87 SentiCR = .82 Senti4SD = .65 (-.22) S4SD noBoW = .65 (-.22) SentiCR = .46 (-.36) Senti4SD = .73 (-.04) S4SD noBoW = .71 (-.06) SentiCR = .77 (-.03) Senti4SD = .61 (-.16) S4SD noBoW = .71 (-.06) SentiCR = .49 (-.31) Senti4SD = .77 SentiCR = .80 Cross-platform performance of supervised tools F-measure (macro-average) TEST TRAIN Need to narrow down the domain at the level of the specific platform
  • 62. Senti4SD = .92 SentiCR = .82 Senti4SD = .80 (-.12) S4SD noBoW = .82 (-.10) SentiCR = .65 (-.17) Senti4SD = .66 (-.26) S4SD noBoW = .69 (-.23) SentiCR = .52 (-.30) Senti4SD = .77 (-.10) S4SD noBoW = .80 (-.07) SentiCR = .69 (-.13) Senti4SD = .87 SentiCR = .82 Senti4SD = .65 (-.22) S4SD noBoW = .65 (-.22) SentiCR = .46 (-.36) Senti4SD = .73 (-.04) S4SD noBoW = .71 (-.06) SentiCR = .77 (-.03) Senti4SD = .61 (-.16) S4SD noBoW = .71 (-.06) SentiCR = .49 (-.31) Senti4SD = .77 SentiCR = .80 Cross-platform performance of supervised tools F-measure (macro-average) TEST TRAIN
  • 63. Gold standard datasets 5869 comments (Murgia et al., MSR’16) 4423 Qs, As, Cs (Calefato et al., EMSE 2018) 7,122 comments Polarity Classes Neutral Positive Negative Different label distributions
  • 64. Gold standard datasets 5869 comments (Murgia et al., MSR’16) 4423 Qs, As, Cs (Calefato et al., EMSE 2018) 7,122 comments Polarity Classes Neutral Positive Negative Agreement: substantial (SO) to almost perfect agreement (GitHub) Shared conceptualization of affect guidelines for labeling in line with research goals
  • 65. Senti4SD = .92 SentiCR = .82 Senti4SD = .87 SentiCR = .82 Senti4SD = .77 SentiCR = .80 Comparison with lexicon-based tools SentiStrength-SE = .80 DEVA = .73 TEST TRAIN SentiStrength-SE = .79 DEVA = .78 SentiStrength-SE = .78 DEVA = .72 Lexicon-based
  • 66. Senti4SD = .80 (-.12) S4SD noBoW = .82 (-.10) SentiCR = .65 (-.17) Senti4SD = .66 (-.26) S4SD noBoW = .69 (-.23) SentiCR = .52 (-.30) Senti4SD = .77 (-.10) S4SD noBoW = .80 (-.07) SentiCR = .69 (-.13) Senti4SD = .65 (-.22) S4SD noBoW = .65 (-.22) SentiCR = .46 (-.36) Senti4SD = .73 (-.04) S4SD noBoW = .71 (-.06) SentiCR = .77 (-.03) Senti4SD = .61 (-.16) S4SD noBoW = .71 (-.06) SentiCR = .49 (-.31) Comparison with lexicon-based tools SentiStrength-SE = .80 DEVA = .73 TEST TRAIN SentiStrength-SE = .79 DEVA = .78 SentiStrength-SE = .78 DEVA = .72 Lexicon-based
  • 67. Can we use SE- specific tools off the shelf?
  • 68. To label or not?
  • 69. To label or not? • Retraining is convenient compared to the performance of lexicon-based tools • Nearly optimal performance obtained with a minimal set of about 1,200 documents for Github and Stack Overflow
  • 70. JIRA • Retraining beneficial only if a larger set of ~1,600 documents is available • Negligible improvement over the SentiStrength-SE performance • SMOTE doesn’t help
  • 71. Lessons learned  Use SE-specific tools
  • 72. Lessons learned  Use SE-specific tools  Use training data from your target platform  Platform-specific tuning enhances accuracy  Robust, well-balanced gold standard with good interrater agreement
  • 73. Lessons learned  Use SE-specific tools  Use training data from your target platform  Platform-specific tuning enhances accuracy  Robust, well-balanced gold standard with good interrater agreement  If you cannot retrain, perform a sanity check  Rule-based tools offer a comparable performance  Define your goal first, then select the tool  The choice depends on the research goals: polarity vs. fine-grained emotions, emotions vs. attitudes, etc.
  • 74. Lessons learned  Use SE-specific tools  Use training data from your target platform  Platform-specific tuning enhances accuracy  Robust, well-balanced gold standard with good interrater agreement  If you cannot retrain, perform a sanity check  Rule-based tools offer a comparable performance  Define your goal first, then select the tool  The choice depends on the research goals: polarity vs. fine-grained emotions, emotions vs. attitudes, etc.

Editor's Notes

  • #3: My research focuses on human aspects in software, My contribution in the scope of the group focuses on human aspects in software engineering. In particular I leverage my background in affective computing, emotion recognition and natural language processing in the study of human aspects in software engineering both at an individual and at a team level.
  • #5: In my reserachs I focus both on language analysis, using sentiment analysis on developers’ communication traces on several platforms, such as Github, Stack Overflow, Jira, etc. More recently, with our group we started investigating the role of emotion in software development by running user studies aimed at assessing the correlation between self-reported emotion and the perceived productivity during software development activities. Specifically, we are also building classifiers the detect positive and negative emotion as well as high or low emotional activation based on biometrics sensors that measure heart-related metrics, EEG that is brain activity, and galvanic skin response, that is the amount of electricity conveyed by the skin. 
  • #10: This research fits in the vain of a research trend that emerged in recent years as researchers became interested in the use of sentiment analysis for software engineering applications. For example, correlation between emotions and productivity was investigated in issue tracking systems. Other studies investigate the possibility to implement early detection of anger and burnout to prevent undesired turnover. The role and impact of emotions in collaborative knowledge sharing was also studied on Stack Overflow. Also, mining of opinions about APIs from developer-generated content has been leveraged for building recommender systems. Similarly, in the field of requirements engineering, positive and negative sentiment in users’ feedback might be used to inform prioritization of development tasks by suggesting new features or enabling identification of bug based on the users’ complaints.
  • #11: Early studies in this field, and when I say early I mean studies published about five years ago, use general-purpose sentiment analysis tools. Here are some example tools, for example NLTK has sentiment analysis libraries whose classification model were fine-tuned using movie reviews or tweets. Movie reviews were also used by the Stanford researcher in training their model based on deep-learning Other than supervised approaches, there are also tools relying on lexicon-based heuristics such as SentiStrength, wihch leverages a set of rule that compute the sentiment scores based on the presence of positive or negative sentiment words in the text. The performance of sentistrength were vaidated on six different datasets annotated manually and containing texts from social media, including Youtube comments, Twitter, my space texts, and so on.
  • #12: Early studies in this field, and when I say early I mean studies published about five years ago, use general-purpose sentiment analysis tools. Here are some example tools, for example NLTK has sentiment analysis libraries whose classification model were fine-tuned using movie reviews or tweets. Movie reviews were also used by the Stanford researcher in training their model based on deep-learning Other than supervised approaches, there are also tools relying on lexicon-based heuristics such as SentiStrength, wihch leverages a set of rule that compute the sentiment scores based on the presence of positive or negative sentiment words in the text. The performance of sentistrength were vaidated on six different datasets annotated manually and containing texts from social media, including Youtube comments, Twitter, my space texts, and so on.
  • #13: The reliability of general-purpose tools was questioned by benchmark studies showing how they perform poorly on technical texts and they disagree with each other, inducing a threat to conclusion validity for empirical software engineering studies. This is kind of expected as general-purpose tools are trained on corpora collected in different domains, such as movie reviews, newspaper, social media comments, and so on.
  • #14: So why general-purpose tools fail? In search for an answers to this question, we manually investigated 800 questions, answers, and comments from Stack Overflow with specific focus on texts that were misclassified by SentiStrenght and found that the main cause of errors are the inability of tools to deal with contextual semantics, the fact that polarity of the lexicon might depend
  • #19: Trying to overcome the limitations posed by general-purpose tools, researchers started to develop Software Engineering-specific tools for sentiment analysis. Two main approaches prevailed: on one hand, tools such as Senti4SD or SentiCR rely on supervised machine learning, combining different lexical features extracted from the raw text in a manually annotated gold standard. On the other hand, tools as SentiStrength-se and DEVA rely on lexicon-based approach, which means that they implement rules and heuristics based on the presence of positive and negative sentiment words as listed in sentiment vocabularies. So these tools cannot be retrained using a gold-standard.
  • #20: We also developed our own tool for sentiment polarity detection, which you can download and re-train using your own custom gold standard dataset.
  • #25: Misclassification of positive posts as negative occurs in 6.6% of the cases when the classification is performed with SentiStrength (see TABLE 9). This is what we consider a strong disagreement that should be avoided. Senti4SD reduces such misclassification to 2.4% of the cases. For example, sentences like ‘Is in so u need not worry! Internally the data is always stored as TEXT, so even if you create table with, SQLite is going to follow the rules of data type’ is erroneously classified by SentiStrength as negative due to the presence of negative lexicon (‘worry’) even if SentiStrength is supposed to correctly deal with negations (Thelwall et al. 2012) which should determine polarity inversion.
  • #26: So, the question now is: given the lesson learned here, that platform-specific trained is required, the question is how many texts do we need to label to reliably retrain classifiers.
  • #34: Better performance for supervised approaches
  • #35: Better performance for supervised approaches
  • #37: So, the question now is: given the lesson learned here, that platform-specific trained is required, the question is how many texts do we need to label to reliably retrain classifiers.
  • #47: So, the question now is: given the lesson learned here, that platform-specific trained is required, the question is how many texts do we need to label to reliably retrain classifiers.
  • #52: Manual inspection of texts misclassified by all tools reveal that polar facts are the most common cause of misclassification ad that they are mainly observed in ad hoc annotation datasets
  • #53: So, the question now is: given the lesson learned here, that platform-specific trained is required, the question is how many texts do we need to label to reliably retrain classifiers.
  • #54: And indeed, we found that tuning of tools for software engineering improves the classification accuracy. Specifically we observed that SE-specific tuning enhance the agreement of tools with manual annotation and with each-other
  • #55: A question remains opens, which related to the off-the-shelf use of such tools. So the question we address in this study is whether we can use Software Engineering-specific tools for sentiment analysis off-the-shelf, without retraining them. Manual labeling is time consuming and gold standard datasets are not always available. So If I don’t have a manually annotated gold standard from the platform I want to study, can I safely reused these tools off-the-shelf, without retraining?
  • #56: Specifically, we formulate three research question to check if tools agree with manual annotation and with each other in a cross-platform setting, that is in absence of a gold standard for retraining
  • #57: We reused the Jira and Stack Overflow dataset we already leveraged in our previous benchmark and we specifically developed a third dataset with Github comments for the sake of this study. The Github dataset, of course, is available for reuse and can be downloaded from figshare. The three datasets are labeled based on the same emotion model, which is the Shaver model of emotions. In this model we map emotions as joy into positive polarity and emotions such as anger into negative polarity. Neutral polarity refers to the absence of emotions. The Stack Overflow and Github dataset are well-baleanced in terms of distribution of polarity lables while the Jira one include a prevalence of neutral texts. A difference in interrater agreement is also observed, with Github and Stack Overflow showing the highest Kappa values indicating substantial agreement among human raters
  • #58: And of course we repeated this process for all couples of datasets, that is we trained three different models for each tools using the train set of each dataset and then we tested on the test set from each dataset, to simulate the off-the-shelf use of such tools. The setting on the diagonal correspond to the within-platform setting, that is when you have a gold standard from your own target platform and you can retrain.
  • #59: The papers contains all detailed information regarding performance metrics. Here, for time constraint, I will only discuss f-measure. Let’s start from supervised tools and Github. Let’s assume your target platform in your study is Github and you have a gold standard for retraining. In a within-platform condition, we observe very satisfying performance with Senti4SD achieving .92 of f-measure and SentiCr .82
  • #60: The situation changes when training is performed on a different platform: by training on Stack Overflow we observe a drop of .12 and .17 for Senti4SD and SentiCr, respectively. The situation is even worse when training on Jira.
  • #61: Interestingly, we observe that in both cases the drop is reduced when Bag of Words are excluded from the feature sets. This is in line with the fact that the drop is higher for SentiCR that only relies on Bag of words, while Senti4SD also leverages semantic and lexicon-based features thus mitigating the drop and reducing overfitting to the training data. This provide evidence of the lower ability to generalize of such features in cross-dataset settings and suggest that positive and negative lexicon might be platform dependent. Furthermore, this indicates that shifts in lexical semantics might occur across platform due to platform-specific lingo or communication style.
  • #62: This drop in performance is observed for all setting, thus reinforcing the intuition that the definition of domain, which is currently identified with the broad field of software engineering, should be probably narrowed down to the specific platform.
  • #63: In particular we observe the highest drop when the Jira dataset is used for training. This indicate a higher ‘distance’ of the Jira gold standard from the other two, which might be potentially explained in several ways. First of all, the lexical distance we just discussed might be an explanation.
  • #64: Also, this could be due to the difference in label distributions, which induce a bias towards the majority class for the Jira dataset, which is imbalanced in favor of the neutral class
  • #65: Another explanation might be in the robustness of the gold standard in term of interrater agreement, which is substantial to almost perfect for Stack Overflow and Github. Furthermore, the labeling of these two datasets was coordinated by researchers at Collab research lab, at the University of Bari who hold a shared understanding of the conceptual model of emotions and who defined the guidelines for annotation according to their research goals.
  • #66: As for the comparison between supervised and lexicon-based tools, we can see that in a within-platform condition supervised retraining enable achieving the best accuracy.
  • #67: However, this is not necessarily true if the training is performed in a cross-platform setting, with lexicon-based tools outperforming supervised ones in some of the studied conditions, as in the case of Jira.
  • #69: So, the question now is: given the lesson learned here, that platform-specific trained is required, the question is how many texts do we need to label to reliably retrain classifiers.
  • #70: To answer this question, we retrained the supervised tools in a within-platfrom condition using incrementally bigger training sets and testing on the same 30% test partition used so far for the other experiments. Here are the learning curves for the GitHub and Stack Overflow dataset. We also report the performance of lexicon-based tools for reference, which of course do not change as we cannot retrain then. We observe that retraining is convenient compared to the performance of lexicon-based tools and a nearly optimal performance is obtained already with a minimal set of about 1,200 documents for Github and Stack Overflow
  • #71: A different situation is observed for Jira, which is unbalanced. In this case retraining is beneficial only if a larger set of about 1600 document is available but the improvement over the lexicon based tool Sentistrength-SE is negligible. This might be due to the fact that SentiStrength-SE has been fine-tuned by the authors using subset of the Jira dataset we are using in this study. Another explanation might be the unbalanced polarity label distribution, which we didn’t manage to address successfully even when techniques as SMOTE are applied, as you can see from the bottom orange line in the chart.