To Label or Not? Advances and Open Challenges in SE-specific Sentiment Analysis

To Label or Not?
Advances and Open Challenges in
SE-specific Sentiment Analysis
@NicoleNovielli
Nicole Novielli
University of Bari, Italy
Collaborative Development Group

My research
• Human Aspects in Software
Engineering
• Affective Computing
• Natural Language Processing

People at Collab
Faculty
– Filippo Lanubile
– Fabio Calefato
– Nicole Novielli
collab.di.uniba.it
PhD Students
– Luigi Quaranta
– Daniela Girardi
Graduate students and final-
year undergraduates
Visiting Professors and
Researchers

Sentiment analysis Analysis of biometrics
BrainLink
EEG mon
activity of

Acknowledgements
Alexander Serebrenik, TU/e
Walid Maalej, Uni. Hamburg
Mika Mäntylä, Uni. Oulu
Davide Fucci, BTH
Viviana Patti, Uni. Turin
Valerio Basile, Uni. Turin
Malvina Nissim, Uni. Groningen
Danilo Croce, Tor Vergata
…
Emotion awareness in Software Engineering
Emotions in social media

To Label or Not? Advances and Open Challenges in SE-specific Sentiment Analysis

To label or not?
Does SE-specific tuning enhances sentiment analysis?
Do we need a theoretical model of affect?
What size for the gold standard?
Can we reuse SE-specific tools off-the-shelf?

Sentiment analysis for software engineering
Collaborative software development and knowledge-sharing
– Correlation with issue-fixing time (Ortu et al., MSR 2015)
– Early burnout discovery (Mantyla et al. MSR 2015)
– Anger detection (Gachechiladze et al., ICSE-NIER 2017)
– Empirically-driven guidelines for question writings (Calefato et al., IST 2018)
Recommender systems
– Pattern-based mining of opinions in Q&A websites (Lin et al., ICSE 2019)
– Opinion search and summarization for APIs (Uddin and Khomh, ASE 2017)
Requirements engineering
– User feedback (Guzman and Maalej, RE‘14)
– App improvement (Panichella et al., ICSME ‘14)
Actionable insights for
More…

Approach Ouput Validated on
Supervised learning
Bag-of-words
Probabilities:
• p(positive)
• p(negative)
• p(neutral)
Movie reviews
Tweets
Supervised learning Sentiment score in
[0,4]:
• 0 = very negative
• 2 = neutral
• 4 = very positive
Movie reviews
Lexicon-based
Dictionaries with a
priori polarity scores
in [-5, 5]
Sentiment scores
• Negative in [-5, -1]
• Positive in [1,5]
• Neutral = (-1,1)
Social media:
• YouTube
• Twitter
• MySpace
• …
http://sentistrength.wlv.ac.uk/
http://text-processing.com/
http://nlp.stanford.edu/sentiment/
General-purpose tools

Need for SE-specific tools
Are off-the-shelf sentiment analysis tools
reliable for software engineering research?
• Poor performance on technical texts
• They disagree with each other
• Disagreement can lead to diverging conclusions

Contextual semantics
‘I am missing a parenthesis. But where?’
N. Novielli, F. Calefato, F. Lanubile. “The Challenges of Sentiment Detection in the Social Programmer Ecosystem” (SSE’15)
Domain-dependent semantics
‘What is the best way to kill a critical
process?’
Platform-specific use of non-neutral lexicon
‘I have a problem, […] please explain what is
wrong’
neutral as negative
Why do general-purpose tools fail?

Representing textual content in a DSM
Training the Distributional Semantic Model (DSM) from unlabeled posts
Word as vectors
Comparison using cosine similarity
W4
W5

Domain-dependent semantics
Stack Overflow Wikipedia DSM 50M Tweets
20M posts English Abstracts
Target word: Kill
kill kill kill
kills decapitate marry
killing kills die
terminate kidnap destroy
dies destroy embarrass
killed incapacitate beat
spawn behead tell
spawned assassinate punch
terminates disarm ruin
quits overpower leave
suspends killing drown
quit imprison stab
suspend betray give
respawn abduct replace
re-launch seduce cry
died injure strangle
exits steal burn
shutdown subdue hurt

Stack Overflow – 20M
posts
Wikipedia DSM – Enlish
abstracts
50M Tweets
Target word: Save
save save save
saves saved bring
saving saving give
saved saves donate
store recover make
resave protect protect
persist restore fix
re-save resurrect saving
retrieve destroy send
retrive banish add
re-store redeem raise
strore steal saves
restore evict remove
re-load erase get
store/retrieve eliminate collect
upload bring buy
save/load hide pick
reupload wipe use

Need for domain-specific tools

SE-specific sentiment analysis tools
• Senti4SD[1]
• SentiCR[2]
• SentiStrength-SE[3]
• DEVA[4]
Supervised learning
Lexicon-based
[1] F. Calefato, F. Lanubile, F. Maiorano, N. Novielli. Sentiment Polarity Detection for Software Development. EMSE, 2017
[2] T . Ahmed, A. Bosu, A. Iqbal, and S. Rahimi. . SentiCR: a customized sentiment analysis tool for code review interactions, ASE 2017.
[3] M.D.R. Islam and M.F. Zibran, Leveraging automated sentiment analysis in software engineering, MSR 2017.
[4] M.D.R. Islam and M.F. Zibran, DEVA: sensing emotions in the valence arousal space in software engineering text, SAC 2018.
More…

Senti4SD allows customization
• Retrain classification model from gold standard
• Use different sentiment lexicons
– E.g. you could bootstrap a sentiment lexicon from your
data(Mäntylä et al., MSR’17)
• Enable/disable feature sets
M. V. Mäntylä, N. Novielli, F. Lanubile, M. Claes, and M. Kuutila. Bootstrapping a lexicon for emotional arousal in software
engineering. MSR '17

Supervised learning
• Support Vector Machines
– #features >> #text items
• Three feature sets:
– Keyword-based
– Semantic features (distributional semantic model)
– Sentiment lexicon-based

Reducing the negative bias
As per your code, the application was completely killed
NEUTRAL

Reducing strong disagreement
So u need not to worry!
POSITIVE

Does SE-specific tuning enhances
sentiment analysis?

SE-specific sentiment analysis tools
• Senti4SD (Calefato et al. EMSE 2017)
• SentiCR(Ahmed et al., ASE ‘17)
• SentiStrength-SE (Islam and Zibran, MSR’17)
Supervised
Lexicon-based
F. Calefato, F. Lanubile, F. Maiorano, N. Novielli. Sentiment Polarity Detection for Software Development. EMSE, 2017
T. Ahmed, A. Bosu, A. Iqbal, and S. Rahimi. . SentiCR: a customized sentiment analysis tool for code review interactions, ASE 2017.
M.D.R. Islam and M.F. Zibran, Leveraging automated sentiment analysis in software engineering, MSR 2017.

Our replication
Research questions
RQ1: Do different sentiment analysis
tools agree with emotions of software
developers?
RQ2: Do sentiment analysis tools agree
with each other?
RQ2: Do SE-specific sentiment analysis
tools agree with each other?
RQ1: Do SE-specific sentiment analysis
tools agree with emotions of software
developers?
• Senti4SD (Calefato et al. EMSE 2017)
• SentiCR(Ahmed et al., ASE ‘17)
• SentiStrength-SE
(Islam and Zibran, MSR’17)
• SentiStrength (baseline)
• NLTK
• Stanford NLP
• Alchemy API
• SentiStrength
SE-specificOff-the-shelf

Our replication
Gold standard datasets
392 comments
(Murgia et al., MSR’14)
5869 comments
4423 Qs, As, Cs
(Calefato et al., EMSE 2017)

Experimental setting
Gold
standard
datasets
Train 70%Stratified
sampling
Senti4SD
updated model
SentiCR
updated model
Training of
supervised tools
Test 30%
SentiStrength-SE
SentiStrength
Assessment of
performance

Weighted Cohens’ Kappa
Disagreement: strong vs. mild
negative neutral positive
negative 0 1 2
neutral 1 0 1
positive 2 1 0
Interpretation (Viera and Garrett, 2005)
• less than chance κ ≤ 0
• slight if 0.01 ≤ κ ≤ 0.20
• fair if 0.21 ≤ κ ≤ 0.40
• moderate if 0.41 ≤ κ ≤ 0.60
• substantial if 0.61 ≤ κ ≤ 0.80
• almost perfect if 0.81 ≤ κ ≤ 1
A.J. Viera, J.M. Garrett. 2005. Understanding interobserver agreement: the kappa statistic. Family Medicine, 37,5, 360–363

SE-specific tools
vs. manual annotation
General purpose-tools
RQ1: Do SE-specific sentiment analysis tools
agree with emotions of software developers?
Fair agreement
Substantial
agreement

SE-Specific tools
• SE-specific optimization
improves the classification
accuracy
• Retraining supervised tools
produces better performance

Our replication
• SE-specific optimization
improves the classification
accuracy
• Retraining supervised tools
produces better performance
• Comparable performance for
SentiStrength-SE (lexicon-
based)

agree with each other?
SE-specific tools
General-purpose tools
From substantial
to perfect
agreement
From less than
chance to fair
agreement

“I'm happy with the
approach and the
code looks good”

“I will come over to
your work and slap
you”

“This code has
problems”
Polar facts

“This is not ideal, why not
trying to find another
solution?”

“This is not ideal, why not
trying to find another
solution?”
Opinions <> Emotions

Affective States
Duration++
Intensity++
Scherer, 1984. Emotion as a Multicomponent Process: A model and some cross-cultural data. In P. Shaver, ed.,
Review of Personality and Social Psych 5: 37-63.

Do we need a theoretical model
of affect?

To what extent the labeling approach has an impact on
the performance of SE-specific sentiment analysis tools?
392 comments
5869 comments
(Ortu et al., MSR’16)
4423 Qs, As, Cs
Model-driven annotation
1500 sentences QA on
Java libraries (Lin et al., ICSE’18)
1600 comments from
code review (Ahmed et al., ASE’17)
Ad-hoc annotation

Model-driven annotation of emotions
Emotion Original study Our replication
Love Positive Positive
Joy Positive Positive
Surprise Positive Ambiguous
Anger Negative Negative
Sadness Negative Negative
Fear Negative Negative
No
emotion
Neutral Neutral
(Shaver et al., 1987)
Mapping emotions to polarity
I'm happy with the approach and
the code looks good
Joy -> Positive Polarity
Joy Happiness Satisfaction

To what extent the labeling approach has an
impact on the performance of SE-specific sentiment analysis tools?
Model-
driven
annotation
Ad-hoc
annotation

RQ3: To what extent the labeling approach has an
impact on the performance of SE-specific sentiment analysis tools?
Model-driven annotation Ad-hoc annotation
• From substantial to perfect
agreement also between supervised
and lexicon-based tools
• From fair to moderate agreement
• Better agreement for supervised
approaches

Error analysis
Polar facts but neutral sentiment
‘I tried the following and it returns nothing’
---
‘This creates an unnecessary garbage list.
Sets.newHashSet should accept an Iterable.’

Reliable sentiment analysis in SE is possible
 Tuning of tools for software engineering improves
classification accuracy
 SE-specific tools agree with manual annotation
 SE-specific tools agree with each other

Can we use SE-
specific tools
off the shelf?

Within-platform
MSR 2018
Cross-platform
MSR 2020

Model-driven annotation
5869 comments
4423 Qs, As, Cs
https://doi.org/10.6084/m9.figshare.11604597
7,122 comments
Polarity Classes
Neutral Positive Negative

Cross-platform setting
GitHub Stack Overflow Jira
GitHub Within-platform
Stack Overflow Within-platform
Jira Within-platform
Train 70%
Senti4SD
updated model
SentiCR
updated model
Training of
supervised tools
Test 30%
SentiStrength-SE
DEVA
Assessment of
performance
Senti4SD no BOW
updated model
Gold
standard
datasets
Train 70%Stratified
sampling
Senti4SD
updated model
SentiCR
updated model
Training of
supervised tools
Test 30%
SentiStrength-SE
DEVA
Assessment of
performance
Senti4SD no BOW
updated model
Gold
standard
datasets
Train 70%Stratified
sampling
u
u
Training of
supervised tools
Test 30%
Sen
Assessment of
performance
Senti4SD no BOW
updated model

Senti4SD = .92
SentiCR = .82
Cross-platform performance of supervised tools
F-measure (macro-average)
TEST
TRAIN

Senti4SD = .92
SentiCR = .82
Senti4SD = .80 (-.12)
SentiCR = .65 (-.17)
Senti4SD = .66 (-.26)
SentiCR = .52 (-.30)
TEST
TRAIN

Senti4SD = .92
SentiCR = .82
Senti4SD = .80 (-.12)
S4SD noBoW = .82 (-.10)
SentiCR = .65 (-.17)
Senti4SD = .66 (-.26)
S4SD noBoW = .69 (-.23)
SentiCR = .52 (-.30)
TEST
TRAIN
Shift in lexical semantics across
platform

Senti4SD = .92
SentiCR = .82
Senti4SD = .80 (-.12)
S4SD noBoW = .82 (-.10)
SentiCR = .65 (-.17)
Senti4SD = .66 (-.26)
S4SD noBoW = .69 (-.23)
SentiCR = .52 (-.30)
Senti4SD = .77 (-.10)
S4SD noBoW = .80 (-.07)
SentiCR = .69 (-.13)
Senti4SD = .87
SentiCR = .82
Senti4SD = .65 (-.22)
S4SD noBoW = .65 (-.22)
SentiCR = .46 (-.36)
Senti4SD = .73 (-.04)
S4SD noBoW = .71 (-.06)
SentiCR = .77 (-.03)
Senti4SD = .61 (-.16)
S4SD noBoW = .71 (-.06)
SentiCR = .49 (-.31)
Senti4SD = .77
SentiCR = .80
TEST
TRAIN
Need to narrow down the domain at
the level of the specific platform

Senti4SD = .92
SentiCR = .82
Senti4SD = .80 (-.12)
S4SD noBoW = .82 (-.10)
SentiCR = .65 (-.17)
Senti4SD = .66 (-.26)
S4SD noBoW = .69 (-.23)
SentiCR = .52 (-.30)
Senti4SD = .77 (-.10)
S4SD noBoW = .80 (-.07)
SentiCR = .69 (-.13)
Senti4SD = .87
SentiCR = .82
Senti4SD = .65 (-.22)
S4SD noBoW = .65 (-.22)
SentiCR = .46 (-.36)
Senti4SD = .73 (-.04)
S4SD noBoW = .71 (-.06)
SentiCR = .77 (-.03)
Senti4SD = .61 (-.16)
S4SD noBoW = .71 (-.06)
SentiCR = .49 (-.31)
Senti4SD = .77
SentiCR = .80
TEST
TRAIN

5869 comments
4423 Qs, As, Cs
7,122 comments
Polarity Classes
Different label distributions

5869 comments
4423 Qs, As, Cs
7,122 comments
Polarity Classes
Agreement: substantial (SO) to almost perfect agreement (GitHub)
Shared conceptualization of affect
guidelines for labeling in line with research goals

Senti4SD = .92
SentiCR = .82
Senti4SD = .87
SentiCR = .82
Senti4SD = .77
SentiCR = .80
Comparison with lexicon-based tools
SentiStrength-SE = .80
DEVA = .73
TEST
TRAIN
DEVA = .78
DEVA = .72
Lexicon-based

Senti4SD = .80 (-.12)
S4SD noBoW = .82 (-.10)
SentiCR = .65 (-.17)
Senti4SD = .66 (-.26)
S4SD noBoW = .69 (-.23)
SentiCR = .52 (-.30)
Senti4SD = .77 (-.10)
S4SD noBoW = .80 (-.07)
SentiCR = .69 (-.13)
Senti4SD = .65 (-.22)
S4SD noBoW = .65 (-.22)
SentiCR = .46 (-.36)
Senti4SD = .73 (-.04)
S4SD noBoW = .71 (-.06)
SentiCR = .77 (-.03)
Senti4SD = .61 (-.16)
S4SD noBoW = .71 (-.06)
SentiCR = .49 (-.31)
Comparison with lexicon-based tools
DEVA = .73
TEST
TRAIN
DEVA = .78
DEVA = .72
Lexicon-based

To label or not?
• Retraining is convenient
compared to the performance of
lexicon-based tools
• Nearly optimal performance
obtained with a minimal set of
about 1,200 documents for
Github and Stack Overflow

JIRA
• Retraining beneficial only if a larger
set of ~1,600 documents is available
• Negligible improvement over the
SentiStrength-SE performance
• SMOTE doesn’t help

Lessons learned
 Use SE-specific tools

Lessons learned
 Use training data from your target platform
 Platform-specific tuning enhances accuracy
 Robust, well-balanced gold standard with good interrater
agreement

Lessons learned
 Use training data from your target platform
 Platform-specific tuning enhances accuracy
 Robust, well-balanced gold standard with good interrater
agreement
 If you cannot retrain, perform a sanity check
 Rule-based tools offer a comparable performance
 Define your goal first, then select the tool
 The choice depends on the research goals: polarity vs.
fine-grained emotions, emotions vs. attitudes, etc.

@NicoleNovielli
nicole.novielli@uniba.it
http://collab.di.uniba.it/nicole/http://collab.di.uniba.it/

To Label or Not? Advances and Open Challenges in SE-specific Sentiment Analysis

More Related Content

Similar to To Label or Not? Advances and Open Challenges in SE-specific Sentiment Analysis (20)

More from Nicole Novielli (7)

Recently uploaded (20)

To Label or Not? Advances and Open Challenges in SE-specific Sentiment Analysis

Editor's Notes