Muffler a tool using mutation to facilitate fault localization 2.3

Muffler: An Approach Using Mutation
to Facilitate Fault Localization
Tao He
elfinhe@gmail.com
Department of Computer Science, Sun Yat-Sen University
Department of Computer Science and Engineering, HKUST

Group Discussion
February 2012
HKUST, Hong Kong, China

1/34

Outline

 Background
 Motivation
 Why does our approach work?
 Our Approach – Muffler
 Empirical Evaluation
 Conclusion

2/34

Background

 Coverage-Based Fault Localization (CBFL)
 Input
 Coverage
 Testing results (passed or failed)

 Output
 A ranking list of statements
 Ranking functions
 Most CBFL techniques are similar with each other
except that different ranking functions are used to
compute suspiciousness.

3/34

What is the limitation of existing
CBFL techniques?

4/34

Motivation

 One fundamental assumption [YPW08] of CBFL
 The observed behaviors from passed runs can precisely
represent the correct behaviors of this program;
 and the observed behaviors from failed runs can represent the
infamous behaviors.
 Therefore, the different observed behaviors of program
entities between passed runs and failed runs will indicate the
fault’s location.
 But this does not always hold.

[YPW08] C. Yilmaz, A. Paradkar, and C. Williams. Time will tell: fault localization using time spectra. In Proceedings
of the 30th international conference on Software engineering (ICSE '08). ACM, New York, NY, USA, 81-90. 2008.
5/34

Motivation
 Coincidental Correctness (CC)
 “No failure is detected, even though a fault has been executed.” [RT93]
 i.e., the passed runs may cover the fault.
 Weaken the first part of CBFL‟s assumption:
 The observed behaviors from passed runs can precisely represent
the correct behaviors of this program;
 More, CC occurs frequently in practice.[MAE+09]

[RT93] D.J. Richardson and M.C. Thompson, An analysis of test data selection criteria using the RELAY model of
fault detection, Software Engineering, IEEE Transactions on, vol. 19, (no. 6), pp. 533-553, 1993.
[MAE+09] W. Masri, R. Abou-Assi, M. El-Ghali, and N. Al-Fatairi, An empirical study of the factors that reduce the
effectiveness of coverage-based fault localization, in Proceedings of the 2nd International Workshop on Defects in
Large Software Systems: Held in conjunction with the ACM SIGSOFT International Symposium on Software Testing
6/34
and Analysis (ISSTA 2009), pp. 1-5, 2009.

Our goal is to address the CC issue via mutation analysis
What is the idea?

7/34

Why does our approach work?
- Key hypothesis
 Mutating the faulty statement tends to maintain the
results of passed test cases.
 By contrast, mutating a correct statement tends to
change the results of passed test cases (from passed to
failed).

8/34

- Three comprehensive scenarios (1/3)
- If we mutate an M in different basic blocks with F
Test cases

Passed

Program Failed

F M

M: Mutant point
Test results F: Fault point

3 test results change from passed to failed
9/34

- If we mutate an M in different basic blocks with F
Test cases

Passed
M
Program Failed

F

M: Mutant point

3 test results change from passed to failed
10/34

- If we mutate F
Test cases

Passed

Program Failed

F +M

M: Mutant point

0 test result changes from passed to failed
11/34

- If we mutate an M in the same basic block with F
Test cases Due to different data flow to affect output

Passed

F
Program Failed
M

M: Mutant point
F: Fault point

Control Flow
Test results

3 test results change from passed to failed Data Flow
12/34

- If we mutate F
Test cases

Passed

F +M
Program Failed

M: Mutant point
F: Fault point

Control Flow
Test results

0 test result change from passed to failed Data Flow
13/34

- When CC occurs frequently
Test cases - If we mutate F
Due to weak ability to affect output

Passed

Program Failed

F +M
M: Mutant point
F: Fault point

Test results Weak ability to generate
an infectious state or to
propagate the infectious
state to output
0 test result changes from passed to failed
14/34

Does this work in real programs?

15/34

1000
- A feasibility study 2500

2000
800 800 2000

1500
600 600 1500

400 1000
400 1000

200 200 500 500

0 0 0 0
tcas v7 tot_info v17 schedule v4 schedule2 v1
4000 4000
4000 150

3000 3000
3000
100

2000 2000 2000

50
1000 1000 1000

0 0 0 0
print_tokens v7 print_tokens2 v3 replace v24 space v20

Figure: Distribution of statements’ result changes
and faulty statement’s testing result changes.
The vertical axis denotes the number of testing results changes (from „passed‟ to
„failed‟), and horizontal width denotes the probability density at corresponding amount of
testing results changes. 16/34

- Another feasibility study (When CC%≥95%)
25

∎ Result changes (avg. 16.33%)
20
Frequency of faulty versions
∎ Naish (avg. 47.55%)

15

10

5

0
0% 20 % 40 % 60 % 80 %
Percentage of code examined
Figure: Frequency distribution of effectiveness
when CC%≥ 95%.
 When CC% is greater or equal than 95%, code examination effort
reduction of result changes is 65.66% (=100%-16.33%/47.55%).
 Only 6 faulty versions need to examine less than 20% of statements for
Naish, while 22 versions by using result changes 17/34

How to design our new ranking
function?

18/34

Our Approach – Muffler



[LRR11] L. Naish, H. J. Lee, and K. Ramamohanarao, A model for spectra-based software diagnosis. ACM
Transaction on Software Engineering Methodology, 20(3):11, 2011.
19/34

How do we evaluate our approach?
What is the result?

20/34

Empirical Evaluation



Lines of
Number of Number of
Program suite Executable LOC
versions test cases
Code
tcas 41 63-67 1608 133-137
tot_info 23 122-123 1052 272-273
schedule 9 149-152 2650 290-294
schedule2 10 127-129 2710 261-263
print_tokens 7 189-190 4130 341-343

print_tokens2 10 199-200 4115 350-355

replace 32 240-245 5542 508-515
21/34
space 38 3633-3647 13585 5882-5904

Empirical Evaluation 100%

95%

90%

85%

80%

75%
Percentage of fault located

70%

65%

60%

55%

50%

45%

40%

35%

30%
Techiniques
25%
Muffler
20% Naish
15% Ochiai
Tarantula
10% Wong3
5%

0%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Percentage of code examined
Figure: Overall effectiveness comparison.
22/34

% of code
Tarantula Ochiai χDebug Naish Muffler
examined
1% 14 18 19 21 35
5% 38 48 56 58 74
10% 54 63 68 68 85
15% 57 65 80 80 94
20% 60 67 84 84 99
30% 79 88 91 92 110
Table: Number of faults located at different 99
40% 92 98 98 level of code
117
examination effort using Naish and Muffler.
50% 98 99 101 102 121
60% 99 103 105 106 123
70% 101 107 117 119 123
 When 1% of the statements have been examined, 123 can reach the
80% 114 122 122 Naish 123
fault in 17.07% of faulty versions. At 122 same time, Muffler 123 reach
90% 123 123
the 123
can
the fault in 28.46% of faulty versions.
100% 123 123 123 123 123
23/34

Tarantula Ochiai χDebug Naish Muffler
Min 0.00 0.00 0.00 0.00 0.00
Max 87.89 84.25 93.85 78.46 55.38
Median 20.33 9.52 7.69 7.32 3.25
Mean 27.68 23.62 20.04 19.34 9.62
Stdev 28.29 26.36 24.61 23.86 13.22

Table: Statistics of code examination effort.

Among these five techniques, Muffler always scores the best in the rows that correspond to
the minimum, median, and mean code examination effort. In addition, Muffler gets much
lower standard deviation, which means that their performances vary less widely than
others, and are shown to be more stable in terms of effectiveness. Results also show that
Muffler reduces the average code examination effort from Naish by 50.26% (=100%-
(9.62%/19.34%).

24/34

How about the coincidental
correctness issue?

25/34

Conclusion and future work
 We propose Muffler, a technique using mutation to
help locate program faults.
 On 123 faulty versions of seven programs, we conduct
a comparison of effectiveness and efficiency with
Naish technique. Results show that Muffler reduces the
average code examination effort on each faulty version
by 50.26%.
 For future work, we plan to generalize our approach to
locate faults in multi-fault programs.

27/34

Thank you!
Contact me via elfinhe@gmail.com

29/34

# Background
 Mutation analysis, first proposed by Hamlet [Ham77] and
Demilo et al. [DLS78] , is a fault-based testing technique
used to measure the effectiveness of a test suite.
 In mutation analysis, one introduces syntactic code
changes, one at a time, into a program to generate
various faulty programs (called mutants).
 A mutation operator is a change-seeding rule to
generate a mutant from the original program.

[Ham77] R.G. Hamlet, Testing Programs with the Aid of a Compiler, Software Engineering, IEEE Transactions
on, vol. SE-3, (no. 4), pp. 279- 290, 1977.
[DLS78] R.A. DeMillo, R.J. Lipton and F.G. Sayward, Hints on Test Data Selection: Help for the Practicing
Programmer, Computer, vol. 11, (no. 4), pp. 34-41, 1978.
30/34

# Ranking functions
 Tarantula [JHS02], Ochiai [AZV07], χDebug [WQZ+07], and Naish [NLR11]

Table: Ranking faunctions

[JHS02] J.A. Jones, M. J. Harrold, and J. Stasko. Visualization of test information to assist fault localization. In Proceedings of the
24th International Conference on Software Engineering (ICSE '02), pp. 467-477, 2002.
[AZV07] R. Abreu, P. Zoeteweij and A.J.C. Van Gemund, On the accuracy of spectrum-based fault localization, in Proc. Proceedings -
Testing: Academic and Industrial Conference Practice and Research Techniques, TAIC PART-Mutation 2007, pp. 89-98, 2007.
[WQZ+07] W.E. Wong, Yu Qi, Lei Zhao, and Kai-Yuan Cai. Effective Fault Localization using Code Coverage. In Proceedings of the
31st Annual International Computer Software and Applications Conference (COMPSAC '07), Vol. 1, pp. 449-456, 2007.
[NLR11] L. Naish, H. J. Lee, and K. Ramamohanarao, A model for spectra-based software diagnosis. ACM Transaction on Software
Engineering Methodology, 20(3):11, 2011. 31/34

# Our Approach – Muffler
Faulty
Test
Program
Suite

Instrument program
&
Execute against test suite
Coverage & Testing Results

Select statements to mutate

Candidate Statements

Mutate selected statements

Mutants

Run mutants against test suite
Legend
Changes of testing results
Calculate suspiciousness Input
&
Sort statements Process

Ranking List of all Output
statements

Figure: Dataflow diagram of Muffler. 32/34

# Our Approach – Muffler



Primary Key Secondary Key Additional Key
(imprecise when (invalid when (inclined to handle
multiple faults coincidental coincidental correctness)
occurs) correctness%
is high)

33/34

# An Example
TotalPassed TotalFailed Part II

Part I 2440 210 Tarantula Ochiai χDebug Naish

Statement Passed(s) Failed(s) susp* r** susp r susp r susp r

S1 if (block_queue){ 1798 210 0.58 8 0.32 8 205.41 8 510812 8

S2 count = block_queue->mem_count + 1; /* fault: insert ‘+1’ */ 1382 210 0.64 7 0.36 7 205.83 7 511228 7

S3 n = (int) (count*ratio); /* fault: missing ‘+1’ */ 1382 210 0.64 7 0.36 7 205.83 7 511228 7

S4 proc = find_nth(block_queue, n); 1382 210 0.64 7 0.36 7 205.83 7 511228 7

S5 if (proc) { 1382 210 0.64 7 0.36 7 205.83 7 511228 7

S6 block_queue = del_ele(block_queue, proc); 1358 210 0.64 3 0.37 3 205.85 3 511252 3

S7 prio = proc->priority; 1358 210 0.64 3 0.37 3 205.85 3 511252 3

S8 prio_queue[prio] = append_ele(prio_queue[prio], proc);}} 1358 210 0.64 3 0.37 3 205.85 3 511252 3

Code examination effort to locate S2 and S3: 88% 88% 88% 88%

Figure: Faulty version v2 of program “schedule”. 34/34

# An Example

Part III Part IV Muffler

Mutated statement for each mutant Changep→f Changep→f Changep→f Changep→f Changep→f Impact susp r

M1 if (!block_queue ) { 1644 1798 1101 1101 1644 1457.6 509354.4 8

M2 count = block_queue->mem_count != 1; 249 1097 1097 249 1382 814.8 510413.2 2

M3 n = (int) (count <= ratio) ; 249 1116 1101 494 1101 812.2 510415.8 2

M4 proc = find_nth(block_queue , ratio); 1088 638 1136 744 1382 997.6 510230.4 5

M5 if (!proc) { 1136 1358 1101 1382 1101 1215.6 510012.4 6

M6 block_queue = del_ele(block_queue , proc-1); 1123 349 1358 814 1358 1000.4 510251.6 4

M7 prio /= proc->priority; 1358 1358 1101 1101 1358 1255.2 509996.8 7

M8 prio_queue[prio] = append_ele(prio_queue[__MININT__] , proc); }} 598 598 1138 1358 1101 958.6 510293.4 3

Code examination effort to locate S2 and S3: 25%


# An Example
TotalPassed TotalFailed Part II

Part I 2440 210 Tarantula Ochiai χDebug Naish

Statement Passed(s) Failed(s) susp* r** susp r susp r susp r

S1 if (block_queue){ 1798 210 0.58 8 0.32 8 205.41 8 510812 8

S2 count = block_queue->mem_count + 1; /* fault: insert ‘+1’ */ 1382 210 0.64 7 0.36 7 205.83 7 511228 7

S3 n = (int) (count*ratio); /* fault: missing ‘+1’ */ 1382 210 0.64 7 0.36 7 205.83 7 511228 7

S4 proc = find_nth(block_queue, n); 1382 210 0.64 7 0.36 7 205.83 7 511228 7

S5 if (proc) { 1382 210 0.64 7 0.36 7 205.83 7 511228 7

S6 block_queue = del_ele(block_queue, proc); 1358 210 0.64 3 0.37 3 205.85 3 511252 3

S7 prio = proc->priority; 1358 210 0.64 3 0.37 3 205.85 3 511252 3

S8 prio_queue[prio] = append_ele(prio_queue[prio], proc);}} 1358 210 0.64 3 0.37 3 205.85 3 511252 3

Code examination effort to locate S2 and S3: 88% 88% 88% 88%


# An Example

Part III Part IV Muffler

Mutated statement for each mutant Changep→f Changep→f Changep→f Changep→f Changep→f Impact susp r

M1 if (!block_queue ) { 1644 1798 1101 1101 1644 1457.6 509354.4 8

M2 count = block_queue->mem_count != 1; 249 1097 1097 249 1382 814.8 510413.2 2

M3 n = (int) (count <= ratio) ; 249 1116 1101 494 1101 812.2 510415.8 2

M4 proc = find_nth(block_queue , ratio); 1088 638 1136 744 1382 997.6 510230.4 5

M5 if (!proc) { 1136 1358 1101 1382 1101 1215.6 510012.4 6

M6 block_queue = del_ele(block_queue , proc-1); 1123 349 1358 814 1358 1000.4 510251.6 4

M7 prio /= proc->priority; 1358 1358 1101 1101 1358 1255.2 509996.8 7

M8 prio_queue[prio] = append_ele(prio_queue[__MININT__] , proc); }} 598 598 1138 1358 1101 958.6 510293.4 3

Code examination effort to locate S2 and S3: 25%


# Empirical Evaluation
Versus Versus Versus Versus
Tanrantula Ochiai χDebug Naish

More effective 102 96 93 89

Same effectiveness 19 23 23 25

Less effective 2 4 7 9

Table: Pair-wise comparison between
Muffler and existing techniques.

Muffler is more effective (examining more statements before encountering the faulty
statement) than Naish for 89 out of 123 faulty versions; is as effective (examining the same
number of statements before encountering the faulty statement) as Naish for 25 out of 123
faulty versions; and is less effective (examining less statements before encountering the
faulty statement) than Naish for only 9 out of 123 faulty versions.

38/34

 Experience on real faults

Faulty versions CC% Code examination effort
Naish Muffler
v5 1% 0% 0%
v9 7% 1% 0%
v17 31% 12% 7%
v28 49% 11% 5%
v29 99% 25% 9%

Table: Results with real faults in space

Five faulty versions are chosen to represent low, medium, and the high occurrence of
coincidental correctness. In this table, the column “CC%” presents the percentage of
coincidentally passed test cases out of all passed test cases. The columns under the head
“Code examination effort” present the percentage of code to be examined before the fault is
encountered.

39/34

 Efficiency analysis
Program suite CBFL (seconds) Muffler (seconds)
tcas 18.00 868.68
tot_info 11.92 573.12
schedule 34.02 2703.01
schedule2 27.76 1773.14
print_tokens 59.11 2530.17
print_tokens2 62.07 5062.87
replace 69.13 4139.19
Average 40.29 2521.46
Table: Time spent by each technique on subject programs.

We have shown experimentally that, by taking advantages from both coverage and mutation
impact, Muffler outperforms Naish regardless the occurrence of coincidental correctness.
Unfortunately, our approaches, Muffler need to execute piles of mutants to compute mutation
impact. The execution of mutants against the test suite may increase the time cost of fault
localization. The time mainly contains the cost of instrumentation, execution, and coverage
collection. From this table, we observe that Muffler takes approximately 62.59 times of
average time cost to the Naish technique.
40/34

 Efficiency analysis
Program Mutated Total Time per mutant
Mutants
suite statements statements (seconds)
tcas 40.15 65.10 199.90 4.26
tot_info 39.57 122.96 191.87 2.92
schedule 80.60 150.20 351.60 7.59
schedule2 75.33 127.56 327.78 5.32
print_tokens 67.43 189.86 260.29 9.49
print_tokens2 86.67 199.44 398.67 12.54
replace 71.14 242.86 305.93 13.30
Average 56.52 142.79 256.90 7.92

Table: Information about mutants generated.

This Table illustrates the detailed data about the number of mutated/total executable
statements, the number of mutants generated, and the time cost of running each mutant. For
example, of the program tcas, there are, on average, 40.15 statements that are mutated by
Muffler; and 65.10 executable statements in total; 199.90 mutants are generated and it takes
4.26 seconds to run each of them, on average. Notice that there is no need to collect coverage
from the mutants‟ executions, and it takes about 1/4 time to run a mutant without
instrumentation and coverage collection.
41/34

Muffler a tool using mutation to facilitate fault localization 2.3

More Related Content

What's hot (13)

Similar to Muffler a tool using mutation to facilitate fault localization 2.3 (20)

More from Tao He (15)

Recently uploaded (20)

Muffler a tool using mutation to facilitate fault localization 2.3

Editor's Notes