178 - A replicated study on duplicate detection: Using Apache Lucene to search among Android defects

A replicated study on duplicate
detection:
Using Apache Lucene to search among
Android defects
M. BORG, P. RUNESON, J. JOHANSSON, M. MÄNTYLÄ

Core problem:
Issue Duplicate Detection
Issue inflow
Analyst
Merge

Background
ICSE 07 paper
[cited by 57(ISI), 214 (GS)]:
• Mobile w proprietary
embedded OS
• 1,000’s of reports,
10% duplications
• 40% of true duplicates
found
• 2/3 is conceptual limit
• Save 20h/1000 issues
Detection of Duplicate Defect Reports Using Natural Language Processing
Per Runeson, Magnus Alexandersson and Oskar Nyholm
Software Engineering Research Group
Lund University,
Box 118, SE-221 00 Lund, Sweden
per.runeson@telecom.lth.se
Abstract
Defect reports are generated from various testing and
development activities in software engineering. Some-times
two reports are submitted that describe the same
problem, leading to duplicate reports. These reports
are mostly written in structured natural language, and
as such, it is hard to compare two reports for similarity
with formal methods. In order to identify duplicates,
we investigate using Natural Language Processing
(NLP) techniques to support the identification. A pro-totype
tool is developed and evaluated in a case study
analyzing defect reports at Sony Ericsson Mobile Com-munications.
The evaluation shows that about 2/3 of
the duplicates can possibly be found using the NLP
techniques. Different variants of the techniques pro-vide
only minor result differences, indicating a robust
technology. User testing shows that the overall attitude
towards the technique is positive and that it has a
growth potential.
1. Introduction
When a complex software product like a mobile
phone is developed, it is natural and common that
software defects slip into the product, leading to func-tional
failures, i.e. the phone does not have the ex-pected
behavior. These failures are found in testing or
other development activities and reported in a defect
management system [5][18]. If the development proc-ess
is highly parallel, or a product line architecture is
used, where components are used in different products,
the same defect may easily be reported multiple times,
resulting in duplicate reports in the defect management
system. These duplicates cost effort in identification
and handling, hence support to speed up the duplicate
detection process is appreciated.
The defect reports are written in natural language,
and the duplicate identification requires suitable infor-mation
retrieval methods. In this study, we investigate
the use of Natural Language Processing (NLP) [17]
techniques to help automate this process. NLP is previ-ously
used in requirements engineering [12][3][19],
program comprehension [2] and in defect report man-agement
[15], although with a different angle.
Basically, we take the words in the defect report in
plain English, make some processing of the text and
then use the statistics on the occurrences of the words
to identify similar defect reports. We implemented a
prototype tool and evaluated its effects on the internal
defect reporting system of Sony Ericsson Mobile
Communications which contained thousands of reports.
Further, we interviewed some users of the prototype
tool to get a qualitative view of the effects. The proto-type
tool identified about 40% of the marked duplicate
defect reports, which can be seen as low figure. How-ever,
since only one type of duplicate reports are possi-bly
found by the technique, we estimate that the tech-nique
finds 2/3 of the possible duplicates. Also, in
terms of working hours, reducing the effort to identify
duplicate reports with 40% is still a substantial saving
for a major software development company, which
handles thousands of defect reports every year.
The paper is outlined as follows. Section 2 intro-duces
the theory on defect reporting and on natural
language processing. Section 3 presents the tailoring
made of the NLP techniques to fit the duplicate detec-tion
purpose. In Section 4, we specify the case study
conducted for evaluation of the technique, and Section
5 presents the case study results. Finally Section 6 con-cludes
the paper and outlines further work.
29th International Conference on Software Engineering (ICSE'07)
0-7695-2828-7/07 $20.00 © 2007

Conceptual replication study
This paper:
• Android OS
• 20,175 reports,
1,158 duplicates
(5.7%)
• 14% of true
duplicates found
• Did it replicate??

Replication study methodology
• Use all defect reports as search queries in Apache
Lucene
• Evaluate the output at cut-off 10 defect reports
– RQ1: Recall and Mean Average Precision, Rc@10
MAP@10
– RQ2: Relative importance of title and description
– RQ3: Filter on submission date

Results
• RQ1: Rc@10= 0.138, MAP@10=0.632
• RQ2: Title is more important than description, but
differences are small
• RQ3: Time filtering is not beneficial

Replication?
Original study
• Search one by one
• Unique master report
(not realistic)
• Precision not reported
Conceptual replication
• Search for all duplicates
• Clusters of duplicates
(160 reports have 20+
duplicates)
• Recall much lower
Conclusions on replication
• Precision-Recall numbers not comparable
• Principal results confirmed (weighting) or rejected
(time filter)
• Needs for empirical evaluations beyond the basic
Precision-Recall “race”

178 - A replicated study on duplicate detection: Using Apache Lucene to search among Android defects

178 - A replicated study on duplicate detection: Using Apache Lucene to search among Android defects

More Related Content

Viewers also liked (20)

Similar to 178 - A replicated study on duplicate detection: Using Apache Lucene to search among Android defects (20)

More from ESEM 2014 (8)

Recently uploaded (20)

178 - A replicated study on duplicate detection: Using Apache Lucene to search among Android defects