Is Text Search an Effective Approach for Fault Localization: A Practitioners Perspective

Vibha Singhal Sinha, Senthil Mani and Debdoot Mukherjee
IBM Research - India
23rd
October 2012, SPLASH-Wavefront, Tucson, AZ, USA

Can Text
Search help in
Debugging?
3

1. Search within past bug reports
• Find similar bug reports and identify patches
linked to them
1. Search within source – code
• Search comments, method names, variable
names etc to identify code regions with high
text overlap 4

 No dependence on program sizes, programming
languages, types of faults or the presence of
passing & failing test inputs; unlike existing
program-analysis based approaches:
 Program slicing
 Statistical debugging / spectra-based techniques
 Delta debugging / mutation based approaches
 Can be readily applied to jumpstart debugging
Possible Tactic: Identify a small set of files with text
search and feed that as input to a program analysis
based technique to localize to a set of lines
5

 IR systems proposed in different areas of software
maintenance to recommend relevant artifacts in context of
developer tasks
 Hipikat, Lassie, DebugAdvisor
 Efficacy of different language models have been evaluated for
fault localization (Rao et al, Marcus et al, Cleary et al)
 Vector Space Model, Latent Semantic Indexing, Latent Dirichlet
Allocation, Cluster Based Decision Making
 Rao and Kak suggest that IR-based bug localization is at least
as effective as static and dynamic analysis techniques
 Enslen proposed Identifier-Splitting to increase vocabulary
overlap between bug reports and code base
 E.g., a code word TextFieldTool is split into three words: text, field,
tool. 6

Index
Creator
Index
Creator
Query
Creator
Query
Creator
From repository of past
Resolved Bugs
Search For {query}
Search On {created indices}
Incoming Bug
Past Resolved
Bugs and linked
Code repository
Search Results:
Ranked list of files
Bug Index
(BI)
Bug Index
(BI)
Code Index
(CI)
Code Index
(CI)
From code repository
Meta Index
(MI)
Meta Index
(MI)
From code repository –
processed through
identifier splitting
Search
Module
Search
Module
Results
Collator
Results
Collator
Collate Title &
Description (A)
Collate Title &
Description (A)
Boost weight of
Title Words
Boost weight of
Title Words
Boost weight of
Code Words
Boost weight of
Code Words
Indexing Strategies
Querying Strategies
7

 RQ1 : How do the following search approaches
compare in terms of efficacy? Are they any better than
chance?
 Search on past bug reports – Bug Index (BI)
 Search on code repository – Code Index (CI)
 Search on processed code repository– Meta Index (MI)
 RQ2 : Can we combine them to increase efficacy?
 RQ3 : How do different features of the source code and
the bugs available in a project impact the effectiveness
of search? 8

 4 open source subjects
 BIRT, Datatools (Eclipse)
 Derby, Hadoop (Apache)
 Linking bug reports to change-sets
 Mined from references to bug-ids in
commit comments
 Tracing JIRA links
 Test set has bug reports with at
least one source file associated
with them
 1177 bugs in test set
 35% of total bugs in chosen releases
 3-4% of the bug repositories
9

 Average Precision, Recall and F1-Score
 For each bug in the test set taken as a query, we
calculate precision, recall and F1-score and then
average across the test-set.
 Bug Coverage
 Percentage of bugs in the test set for which the
search returns at least one file in the
recommendation set matches the ground truth.
10

chance?
11

CI:A
MI:A
BI:A
PRECISION RECALL F1-SCORE
Increase in recall much slower than drop in precision; so F-score dips beyond result-set size of 3Increase in recall much slower than drop in precision; so F-score dips beyond result-set size of 3
Suggests that search techniques may NOT help in identifying ALL files that needs to be fixedSuggests that search techniques may NOT help in identifying ALL files that needs to be fixed

BIRT DATATOOLS
DERBY HADOOP
Bug Coverage Increases with Increase in Result-set SizeBug Coverage Increases with Increase in Result-set Size
None of the techniques emerge as the clear winnerNone of the techniques emerge as the clear winner
MI isn’t any better than CI. Sometimes it performs worseMI isn’t any better than CI. Sometimes it performs worse
Hadoop gives much better results than other 3 subjectsHadoop gives much better results than other 3 subjects
13

 Compare with efficacy of a user who randomly selects source
files from the code repository as the files to be fixed to resolve
a bug
 Think of the code repository as a bin of black and white balls, where the
files that need fix for a bug resolution are white balls; rest are black
balls.
 The hyper-geometric distribution gives the probability of choosing
white balls without replacement
 probability p of getting at least x files that require a fix by choosing k
files at random from the repository:
 If p < 0.05, reject the null hypothesis that search technique is no better
than chance. Do FDR test for multiple hypothesis testing.
14

 Even if one correct result is returned for a bug, then the
result is usually significant.
 Datatools has many queries failing the FDR test
 Certain queries have a large number of fixed files (e.g., 491 in 2 bugs)
 Record the average number of files in the repository at which
the techniques break even with chance: p >= 0.05
 Ranges from 66 in Derby (MI:A) to 158 in Datatools (CI:A)
15

chance?
16

 Fleiss’ Kappa analysis to measure the degree of agreement
amongst the three techniques
 Each technique rates a bug: Yes, if technique covers the bug; else No
 Code based techniques (CI, MI) are similar, they are quite different
from the bug based technique (BI)
 Combine bug based and code based to get better results ?? 17

 Fire the same query on the 3 different indices and choose
the top X search results using the following ranking
schemes:
 RankScore: Rank using the absolute search similarity scores
returned by the search engine
 NormScore: Rank using a normalized similarity score - fraction
of maximum score returned by the query
 AggregateScore: Rank on the basis of sum of scores from
different techniques
 Sample: Pick the top 2*(X/5) search results from the results of
BI:A and CI:A, and the remaining X/5 results from MI:A.
18

 RankScore works better than the best of the
individual techniques across all subjects
 Improvement in bug coverage ranges from 1% to 46%
19

chance?
20

 Since query sizes can become very large, there may
be a need for artificially boosting important words –
TitleWords, CodeWords
 TitleBoost helps improve bug coverage
 Except in Hadoop where the fraction of titleWords that
come up significant is already high even without boost.
MI MI
MI MI
BI CI BI
BI CI
CI
BI CI
21

 Compare the efficacy of techniques that directly search
the code repository with those that search over past bug
reports
 No clear winner is observed
 Bug coverage ranges from 20 to 60% across 4 subjects
 Techniques are better than chance
 Identifier splitting does not yield much benefit
 The techniques are complementary
 Bug coverage improves by 1% - 46% by combining them
 Favoring title-words help in most cases
24

Is Text Search an Effective Approach for Fault Localization: A Practitioners Perspective

More Related Content

What's hot (17)

Viewers also liked (9)

Similar to Is Text Search an Effective Approach for Fault Localization: A Practitioners Perspective (20)

Recently uploaded (20)

Is Text Search an Effective Approach for Fault Localization: A Practitioners Perspective