Analyzing Reviews and Code of Mobile Apps for Better Release Planning

Analyzing Reviews and Code
of Mobile Apps for
Better Release Planning
Adelina Ciurumelea, Andreas Schaufenbühl,
Sebastiano Panichella, Harald C. Gall
software evolution & architecture lab

2
Extremely Popular Apps
8,087,067 reviews3,505,905 reviews38,742,600 reviews

3
Open Source Apps
62,707 reviews

4
The number of reviews is large compared
to the available development resources.

5
• reviews contain valuable
feedback directly from the
users
• users often report bugs, user
experience and request
features
• the review content inﬂuences
the number of downloads
Importance of reviews

6
INFORMATIVE NON-INFORMATIVE
“AR-Miner: Mining informative reviews for developers from mobile app marketplace”
N. Chen, J. Lin, S. Hoi, X. Xiao, and B. Zhang

7
BUG FEATURE REQUEST
“Release planning of mobile apps based on user reviews”
L. Villarroel, G. Bavota, B. Russo, R. Oliveto, and M. Di Penta
OTHER

8
BUGFEATURE REQUEST
• the developer has to manually analyse the unstructured groups of
reviews, understand what they talk about and extract actionable change
tasks
• what does a particular cluster talk about? Does it talk about the UI or
about the performance of the app, etc.?

9
What are the mobile speciﬁc topics
users talk about in their reviews?

10
manual analysis of ~1600
reviews

11
Hmmm...
Mm No…
This is IT
Nope Nopity nope
• not all reviews are useful

12
Hmmm...
Mm No…
This is IT
Nope Nopity nope
Sucks Way to many errors
0 stars Garbage.
problem bro
Garbage Bla bla bla
• not all reviews are useful
• some are even offensive

13
Pretty close to perfect, this app is
way better than any comic book
reader I've ever used. It's small, it
operates fast, and the interface is
incredibly clean and simple.
• others can provide valuable
information for the developer

14
Pretty close to perfect, this app is
way better than any comic book
reader I've ever used. It's small,
it operates fast, and the
interface is incredibly clean and
simple.
Resources
Usage

15
For info (in case dev not already
aware!), there is a graphical
glitch when scrolling output in
marshmallow on a nexus 5.
Compatibility
Usage
Complaint

16
Building the taxonomy
• feature extraction: TF-IDF scores and 2 and 3-
grams counts
Content analysis in 2 passes:
• start with an empty list of categories
• analyse each review and add a new category
(including deﬁnition and keywords) if necessary
• label the review with all the matching categories
• second pass: revisit the list of reviews and label
them with the appropriate categories

17
Category Description
Compatibility mentions the OS, mobile device or a speciﬁc hardware component.
Usage talks about the UI or the usability of the app.
Resources
mentions the app’s inﬂuence on the battery and memory usage or the
performance of the app/phone.
Pricing statements mentioning the license model or the price of the app.
Protection statements referring to security or privacy issues.
Complaint the user reports or complains about an issue with the app.
High Level Taxonomy

18
specialise the taxonomy
further

19
Liked it and worked very well in
lollipop, but not MM The plugins
don't refresh, manual navigation
to next image doesn't work.
Some plugins give error.
Altogether seems broken after
MM update on Note 4.
Compatibility

20
Liked it and worked very well in
lollipop, but not MM The plugins
don't refresh, manual navigation
to next image doesn't work.
Some plugins give error.
Altogether seems broken after
MM update on Note 4.
Compatibility
Device
Android Version

21
High Level Low Level Categories
Compatibility Device, Android Version, Hardware
Usage App Usability, UI
Resources Performance, Battery, Memory
Pricing Licensing, Price
Protection Security, Privacy
Low Level Taxonomy

23
Gradient Boosted
Trees Training
Preprocessing
&
Feature Extraction
Multi-label
Classiﬁcation
ML Approach

24
Preprocessing & Feature
Extraction
• preprocessing: stop words removal and stemming
grams counts

25
Training
grams counts
• one-vs-all strategy: separate classiﬁer for each
high and low level category (18 in total)
• used the Gradient Boosted Trees model

26
Multi-label Classiﬁcation
Preprocessing
Feature
Extraction Classiﬁcation
High & Low
Level Categories
++
++
…
Battery
UI
Complaint
Resources
Usage

27
Example
grams counts
RQ2: Does our approach correctly recommend the software
artifacts that need to be modiﬁed in order to handle user
requests and complaints?
• 752 user reviews from our dataset
belong to AcDisplay
• analyse Compatibility and
Complaint reviews (61 reviews)
• Complaint and Android Version (22
reviews)

28
Example
grams counts
“Good but has some issues with Marshmallow I used this on
my old phone and if was flawless and I loved it. I noticed that
sometimes when I had AcDisplay activated I would not be
able to use the fingerprint sensor even after I unlocked
AcDisplay and had to enter a password. This is very frustrating
so I cannot use AcDisplay.”
“Love the design I love the app. It’s super sleek and nice. But
ever since my phone updated to marshmallow it’s stopped
working. Hope it comes back soon.”
“On Marshmallow, the screen is buggy and sometimes shows
the notification shade.”

29
grams counts
• can we link reviews to the related source code?
• IR methods based on the VSM (hard task: the vocabulary
used by reviews and source code is different)
• use additional Android project speciﬁc information (e.g.
UI functionality is implemented in Activity classes)
Source Code Localisation

30
Source Code Localisation
Android Project
Structure Info
IR - VSM
Software Artifacts
App’s Source Code
User Reviews

31
Evaluation
grams counts
RQ1: To what extent does our approach organise reviews
according to meaningful maintenance and evolution tasks
for developers?
RQ2: Does our approach correctly recommend the software
artifacts that need to be modiﬁed in order to handle user
requests and complaints?

33
Study RQ1
grams counts
• ~7800 user reviews from 39 apps

34
Study RQ1
grams counts
• 2 external evaluators
• evaluate 200 reviews for
each category (3600 total)

35
Results RQ1
High Level
Category
Precision Recall F1 Score
Compatibility 71% 97% 82%
Usage 89% 94% 91%
Resources 79% 99% 88%
Pricing 85% 97% 90%
Protection 89% 98% 93%
Complaint 85% 80% 82%

36
Results RQ1
High Level
Category
Low Level
Category
Precision Recall
F1
Score
Compatibility
Device
OS Version
Hardware
85%
89%
61%
98%
86%
95%
91%
87%
74%
Usage
App Usability
UI
92%
83%
91%
93%
91%
88%
Resources
Performance
Battery
Memory
64%
78%
68%
97%
95%
95%
77%
86%
79%
Pricing
Licensing
Price
91%
85%
98%
96%
94%
90%
Protection
Security
Privacy
87%
83%
98%
96%
92%
89%

37
Results RQ1
Our approach is able to classify reviews with high precision
and recall according to the mobile speciﬁc topics we derived.
The most important categories are Usage, Resources and
Compatibility.

38
Study RQ2
• 1 external evaluator
• 91 user reviews from 2 apps

39
Results RQ2
grams counts
Quality of
Reviews
Precision Recall F1 Score
Difﬁcult to Link 41% 83% 55%
Easier to Link 52% 79% 63%
All 51% 79% 62%

40
Results RQ2
Our approach achieves promising results in recommending
related software artifacts for speciﬁc user reviews, furthermore
better quality reviews are easier to link than lower quality ones.

41
Conclusion & Future Work
• reviews can be classified with high precision and recall
using machine learning according to mobile specific
topics
• linking reviews to source code using textual similarity
based methods is difficult
• future work: summarise reviews, improve localisation
(static analysis)

42
Discussion
What mechanisms can we adopt for enabling a reliable
and practical solution for code localisation?

Analyzing Reviews and Code of Mobile Apps for Better Release Planning

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Analyzing Reviews and Code of Mobile Apps for Better Release Planning (20)

More from Sebastiano Panichella (20)

Recently uploaded (20)

Analyzing Reviews and Code of Mobile Apps for Better Release Planning