SlideShare a Scribd company logo
A replicated study on duplicate 
detection: 
Using Apache Lucene to search among 
Android defects 
M. BORG, P. RUNESON, J. JOHANSSON, M. MÄNTYLÄ
Core problem: 
Issue Duplicate Detection 
Issue inflow 
Analyst 
Merge
Background 
ICSE 07 paper 
[cited by 57(ISI), 214 (GS)]: 
‱ Mobile w proprietary 
embedded OS 
‱ 1,000’s of reports, 
10% duplications 
‱ 40% of true duplicates 
found 
‱ 2/3 is conceptual limit 
‱ Save 20h/1000 issues 
Detection of Duplicate Defect Reports Using Natural Language Processing 
Per Runeson, Magnus Alexandersson and Oskar Nyholm 
Software Engineering Research Group 
Lund University, 
Box 118, SE-221 00 Lund, Sweden 
per.runeson@telecom.lth.se 
Abstract 
Defect reports are generated from various testing and 
development activities in software engineering. Some-times 
two reports are submitted that describe the same 
problem, leading to duplicate reports. These reports 
are mostly written in structured natural language, and 
as such, it is hard to compare two reports for similarity 
with formal methods. In order to identify duplicates, 
we investigate using Natural Language Processing 
(NLP) techniques to support the identification. A pro-totype 
tool is developed and evaluated in a case study 
analyzing defect reports at Sony Ericsson Mobile Com-munications. 
The evaluation shows that about 2/3 of 
the duplicates can possibly be found using the NLP 
techniques. Different variants of the techniques pro-vide 
only minor result differences, indicating a robust 
technology. User testing shows that the overall attitude 
towards the technique is positive and that it has a 
growth potential. 
1. Introduction 
When a complex software product like a mobile 
phone is developed, it is natural and common that 
software defects slip into the product, leading to func-tional 
failures, i.e. the phone does not have the ex-pected 
behavior. These failures are found in testing or 
other development activities and reported in a defect 
management system [5][18]. If the development proc-ess 
is highly parallel, or a product line architecture is 
used, where components are used in different products, 
the same defect may easily be reported multiple times, 
resulting in duplicate reports in the defect management 
system. These duplicates cost effort in identification 
and handling, hence support to speed up the duplicate 
detection process is appreciated. 
The defect reports are written in natural language, 
and the duplicate identification requires suitable infor-mation 
retrieval methods. In this study, we investigate 
the use of Natural Language Processing (NLP) [17] 
techniques to help automate this process. NLP is previ-ously 
used in requirements engineering [12][3][19], 
program comprehension [2] and in defect report man-agement 
[15], although with a different angle. 
Basically, we take the words in the defect report in 
plain English, make some processing of the text and 
then use the statistics on the occurrences of the words 
to identify similar defect reports. We implemented a 
prototype tool and evaluated its effects on the internal 
defect reporting system of Sony Ericsson Mobile 
Communications which contained thousands of reports. 
Further, we interviewed some users of the prototype 
tool to get a qualitative view of the effects. The proto-type 
tool identified about 40% of the marked duplicate 
defect reports, which can be seen as low figure. How-ever, 
since only one type of duplicate reports are possi-bly 
found by the technique, we estimate that the tech-nique 
finds 2/3 of the possible duplicates. Also, in 
terms of working hours, reducing the effort to identify 
duplicate reports with 40% is still a substantial saving 
for a major software development company, which 
handles thousands of defect reports every year. 
The paper is outlined as follows. Section 2 intro-duces 
the theory on defect reporting and on natural 
language processing. Section 3 presents the tailoring 
made of the NLP techniques to fit the duplicate detec-tion 
purpose. In Section 4, we specify the case study 
conducted for evaluation of the technique, and Section 
5 presents the case study results. Finally Section 6 con-cludes 
the paper and outlines further work. 
29th International Conference on Software Engineering (ICSE'07) 
0-7695-2828-7/07 $20.00 © 2007
Conceptual replication study 
This paper: 
‱ Android OS 
‱ 20,175 reports, 
1,158 duplicates 
(5.7%) 
‱ 14% of true 
duplicates found 
‱ Did it replicate??
Replication study methodology 
‱ Use all defect reports as search queries in Apache 
Lucene 
‱ Evaluate the output at cut-off 10 defect reports 
– RQ1: Recall and Mean Average Precision, Rc@10 
MAP@10 
– RQ2: Relative importance of title and description 
– RQ3: Filter on submission date
Results 
‱ RQ1: Rc@10= 0.138, MAP@10=0.632 
‱ RQ2: Title is more important than description, but 
differences are small 
‱ RQ3: Time filtering is not beneficial
Replication? 
Original study 
‱ Search one by one 
‱ Unique master report 
(not realistic) 
‱ Precision not reported 
Conceptual replication 
‱ Search for all duplicates 
‱ Clusters of duplicates 
(160 reports have 20+ 
duplicates) 
‱ Recall much lower 
Conclusions on replication 
‱ Precision-Recall numbers not comparable 
‱ Principal results confirmed (weighting) or rejected 
(time filter) 
‱ Needs for empirical evaluations beyond the basic 
Precision-Recall “race”
178 - A replicated study on duplicate detection: Using Apache Lucene to search among Android defects

More Related Content

PDF
Search-based testing of procedural programs:iterative single-target or multi-...
PDF
Iterative code reviews system for detecting and correcting faults from softwa...
PDF
IRJET- Data Reduction in Bug Triage using Supervised Machine Learning
PDF
Instance Space Analysis for Search Based Software Engineering
PDF
Summarization Techniques for Code, Change, Testing and User Feedback
PDF
Machine Learning in Static Analysis of Program Source Code
PDF
Andrea Mocci: Beautiful Design, Beautiful Coding at I T.A.K.E. Unconference 2015
PPTX
NLP and its application in Insurance -Short story presentation
Search-based testing of procedural programs:iterative single-target or multi-...
Iterative code reviews system for detecting and correcting faults from softwa...
IRJET- Data Reduction in Bug Triage using Supervised Machine Learning
Instance Space Analysis for Search Based Software Engineering
Summarization Techniques for Code, Change, Testing and User Feedback
Machine Learning in Static Analysis of Program Source Code
Andrea Mocci: Beautiful Design, Beautiful Coding at I T.A.K.E. Unconference 2015
NLP and its application in Insurance -Short story presentation

Viewers also liked (20)

PDF
196 - Evaluation in Practice: Artifact-based Requirements Engineering and Sc...
PPTX
214 - Sampling Improvement in Software Engineering Surveys
PPTX
215 Towards a Framework to Support Large Scale Sampling in Software Engineeri...
PDF
201 - Using Qualitative Metasummary to Synthesize Empirical Findings in Liter...
PDF
210 - Software Population Pyramids: The Current and the Future of OSS Develop...
PDF
166 - ISBSG variables most frequently used for software effort estimation: A ...
PDF
33 - On Knowledge Transfer Skill in Pair Programming
PDF
112 - The Role of Mentoring and Project Characteristics for Onboarding in Ope...
PDF
181 - Evaluating strategies for study selection in systematic literature studies
PDF
What Do Game Developers Test in Their Products?
PDF
124 - Impact of Developer Reputation on Code Review Outcomes in OSS Projects:...
PPTX
52 - The Impact of Test Ownership and Team Structure on the Reliability and E...
PDF
Keynote 2 - The 20% of software engineering practices that contribute to 80% ...
PPTX
65 - An Empirical Simulation-based Study of Real-Time Speech Translation for ...
PPTX
130 - Motivated software engineers are engaged and focused, while satisfied o...
DOC
Analyse et conception des scĂ©narios d’apprentissage - ActivitĂ© 2 sĂ©minaire ec...
PPS
GeoInTalk 2010 : Philippe Goudal (V-Trafic / Mediamobile) - Fourniture d’Info...
PDF
Etude E-marketing : Email mobile - maelle urban
PDF
Tiki-VUL-ARTICLE-DO-final-AS1mai2016
PDF
Atelier 13 - Réenchanter les destinations pour conquérir et transformer - ET8
196 - Evaluation in Practice: Artifact-based Requirements Engineering and Sc...
214 - Sampling Improvement in Software Engineering Surveys
215 Towards a Framework to Support Large Scale Sampling in Software Engineeri...
201 - Using Qualitative Metasummary to Synthesize Empirical Findings in Liter...
210 - Software Population Pyramids: The Current and the Future of OSS Develop...
166 - ISBSG variables most frequently used for software effort estimation: A ...
33 - On Knowledge Transfer Skill in Pair Programming
112 - The Role of Mentoring and Project Characteristics for Onboarding in Ope...
181 - Evaluating strategies for study selection in systematic literature studies
What Do Game Developers Test in Their Products?
124 - Impact of Developer Reputation on Code Review Outcomes in OSS Projects:...
52 - The Impact of Test Ownership and Team Structure on the Reliability and E...
Keynote 2 - The 20% of software engineering practices that contribute to 80% ...
65 - An Empirical Simulation-based Study of Real-Time Speech Translation for ...
130 - Motivated software engineers are engaged and focused, while satisfied o...
Analyse et conception des scĂ©narios d’apprentissage - ActivitĂ© 2 sĂ©minaire ec...
GeoInTalk 2010 : Philippe Goudal (V-Trafic / Mediamobile) - Fourniture d’Info...
Etude E-marketing : Email mobile - maelle urban
Tiki-VUL-ARTICLE-DO-final-AS1mai2016
Atelier 13 - Réenchanter les destinations pour conquérir et transformer - ET8
Ad

Similar to 178 - A replicated study on duplicate detection: Using Apache Lucene to search among Android defects (20)

PDF
Software Engineering Domain Knowledge to Identify Duplicate Bug Reports
 
PPT
PROMISE 2011: "Detecting Bug Duplicate Reports through Locality of Reference"
PDF
From Bugs to Decision Support - Selected Research Highlights
PDF
PDF
Put Your Hands in the Mud: What Technique, Why, and How
PDF
Not Only Statements: The Role of Textual Analysis in Software Quality
PDF
How Can Software Engineering Support AI
PDF
Paper id 22201490
 
PDF
A Novel Approach for Code Clone Detection Using Hybrid Technique
PDF
Implementation of Semantic Analysis Using Domain Ontology
PDF
NL based Object Oriented modeling - EJSR 35(1)
PPTX
An approach to source code plagiarism
PDF
Matching data detection for the integration system
PDF
Detecting the High Level Similarities in Software Implementation Process Usin...
PDF
Wiki dev nlp
PPTX
When, why and for whom do practitioners detect technical debts?: An experienc...
PPTX
IWESEP 2013
PPTX
Maturing software engineering knowledge through classifications
PDF
50120140507013
PDF
50120140507013
Software Engineering Domain Knowledge to Identify Duplicate Bug Reports
 
PROMISE 2011: "Detecting Bug Duplicate Reports through Locality of Reference"
From Bugs to Decision Support - Selected Research Highlights
Put Your Hands in the Mud: What Technique, Why, and How
Not Only Statements: The Role of Textual Analysis in Software Quality
How Can Software Engineering Support AI
Paper id 22201490
 
A Novel Approach for Code Clone Detection Using Hybrid Technique
Implementation of Semantic Analysis Using Domain Ontology
NL based Object Oriented modeling - EJSR 35(1)
An approach to source code plagiarism
Matching data detection for the integration system
Detecting the High Level Similarities in Software Implementation Process Usin...
Wiki dev nlp
When, why and for whom do practitioners detect technical debts?: An experienc...
IWESEP 2013
Maturing software engineering knowledge through classifications
50120140507013
50120140507013
Ad

More from ESEM 2014 (8)

PDF
Keynote 1 - Engineering Software Analytics Studies
PPTX
222 - Design Pattern Decay: The Case for Class Grime
PDF
169 - Bridging the Gap: SE Technology Transfer into Practice - Study Design a...
PDF
42- Using Templates to Elicit Implied Security Requirements from Functional R...
PDF
224 - Factors Impacting Rapid Releases: An Industrial Case Study
PDF
18 - Impact of Process Conformance on the Effects of Test-driven Development
PDF
167 - Productivity for proof engineering
PPTX
Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...
Keynote 1 - Engineering Software Analytics Studies
222 - Design Pattern Decay: The Case for Class Grime
169 - Bridging the Gap: SE Technology Transfer into Practice - Study Design a...
42- Using Templates to Elicit Implied Security Requirements from Functional R...
224 - Factors Impacting Rapid Releases: An Industrial Case Study
18 - Impact of Process Conformance on the Effects of Test-driven Development
167 - Productivity for proof engineering
Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...

Recently uploaded (20)

PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
Nekopoi APK 2025 free lastest update
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
ai tools demonstartion for schools and inter college
PDF
top salesforce developer skills in 2025.pdf
PPTX
Introduction to Artificial Intelligence
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PPTX
assetexplorer- product-overview - presentation
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
medical staffing services at VALiNTRY
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
Computer Software and OS of computer science of grade 11.pptx
PPT
Introduction Database Management System for Course Database
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Reimagine Home Health with the Power of Agentic AI​
Nekopoi APK 2025 free lastest update
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Softaken Excel to vCard Converter Software.pdf
CHAPTER 2 - PM Management and IT Context
Navsoft: AI-Powered Business Solutions & Custom Software Development
ai tools demonstartion for schools and inter college
top salesforce developer skills in 2025.pdf
Introduction to Artificial Intelligence
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
assetexplorer- product-overview - presentation
Which alternative to Crystal Reports is best for small or large businesses.pdf
medical staffing services at VALiNTRY
Understanding Forklifts - TECH EHS Solution
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Odoo Companies in India – Driving Business Transformation.pdf
Computer Software and OS of computer science of grade 11.pptx
Introduction Database Management System for Course Database

178 - A replicated study on duplicate detection: Using Apache Lucene to search among Android defects

  • 1. A replicated study on duplicate detection: Using Apache Lucene to search among Android defects M. BORG, P. RUNESON, J. JOHANSSON, M. MÄNTYLÄ
  • 2. Core problem: Issue Duplicate Detection Issue inflow Analyst Merge
  • 3. Background ICSE 07 paper [cited by 57(ISI), 214 (GS)]: ‱ Mobile w proprietary embedded OS ‱ 1,000’s of reports, 10% duplications ‱ 40% of true duplicates found ‱ 2/3 is conceptual limit ‱ Save 20h/1000 issues Detection of Duplicate Defect Reports Using Natural Language Processing Per Runeson, Magnus Alexandersson and Oskar Nyholm Software Engineering Research Group Lund University, Box 118, SE-221 00 Lund, Sweden per.runeson@telecom.lth.se Abstract Defect reports are generated from various testing and development activities in software engineering. Some-times two reports are submitted that describe the same problem, leading to duplicate reports. These reports are mostly written in structured natural language, and as such, it is hard to compare two reports for similarity with formal methods. In order to identify duplicates, we investigate using Natural Language Processing (NLP) techniques to support the identification. A pro-totype tool is developed and evaluated in a case study analyzing defect reports at Sony Ericsson Mobile Com-munications. The evaluation shows that about 2/3 of the duplicates can possibly be found using the NLP techniques. Different variants of the techniques pro-vide only minor result differences, indicating a robust technology. User testing shows that the overall attitude towards the technique is positive and that it has a growth potential. 1. Introduction When a complex software product like a mobile phone is developed, it is natural and common that software defects slip into the product, leading to func-tional failures, i.e. the phone does not have the ex-pected behavior. These failures are found in testing or other development activities and reported in a defect management system [5][18]. If the development proc-ess is highly parallel, or a product line architecture is used, where components are used in different products, the same defect may easily be reported multiple times, resulting in duplicate reports in the defect management system. These duplicates cost effort in identification and handling, hence support to speed up the duplicate detection process is appreciated. The defect reports are written in natural language, and the duplicate identification requires suitable infor-mation retrieval methods. In this study, we investigate the use of Natural Language Processing (NLP) [17] techniques to help automate this process. NLP is previ-ously used in requirements engineering [12][3][19], program comprehension [2] and in defect report man-agement [15], although with a different angle. Basically, we take the words in the defect report in plain English, make some processing of the text and then use the statistics on the occurrences of the words to identify similar defect reports. We implemented a prototype tool and evaluated its effects on the internal defect reporting system of Sony Ericsson Mobile Communications which contained thousands of reports. Further, we interviewed some users of the prototype tool to get a qualitative view of the effects. The proto-type tool identified about 40% of the marked duplicate defect reports, which can be seen as low figure. How-ever, since only one type of duplicate reports are possi-bly found by the technique, we estimate that the tech-nique finds 2/3 of the possible duplicates. Also, in terms of working hours, reducing the effort to identify duplicate reports with 40% is still a substantial saving for a major software development company, which handles thousands of defect reports every year. The paper is outlined as follows. Section 2 intro-duces the theory on defect reporting and on natural language processing. Section 3 presents the tailoring made of the NLP techniques to fit the duplicate detec-tion purpose. In Section 4, we specify the case study conducted for evaluation of the technique, and Section 5 presents the case study results. Finally Section 6 con-cludes the paper and outlines further work. 29th International Conference on Software Engineering (ICSE'07) 0-7695-2828-7/07 $20.00 © 2007
  • 4. Conceptual replication study This paper: ‱ Android OS ‱ 20,175 reports, 1,158 duplicates (5.7%) ‱ 14% of true duplicates found ‱ Did it replicate??
  • 5. Replication study methodology ‱ Use all defect reports as search queries in Apache Lucene ‱ Evaluate the output at cut-off 10 defect reports – RQ1: Recall and Mean Average Precision, Rc@10 MAP@10 – RQ2: Relative importance of title and description – RQ3: Filter on submission date
  • 6. Results ‱ RQ1: Rc@10= 0.138, MAP@10=0.632 ‱ RQ2: Title is more important than description, but differences are small ‱ RQ3: Time filtering is not beneficial
  • 7. Replication? Original study ‱ Search one by one ‱ Unique master report (not realistic) ‱ Precision not reported Conceptual replication ‱ Search for all duplicates ‱ Clusters of duplicates (160 reports have 20+ duplicates) ‱ Recall much lower Conclusions on replication ‱ Precision-Recall numbers not comparable ‱ Principal results confirmed (weighting) or rejected (time filter) ‱ Needs for empirical evaluations beyond the basic Precision-Recall “race”