SlideShare a Scribd company logo
Muffler: An Approach Using Mutation
to Facilitate Fault Localization
                                                        Tao He
                                              elfinhe@gmail.com
           Department of Computer Science, Sun Yat-Sen University
          Department of Computer Science and Engineering, HKUST

                                               Group Discussion
                                                  February 2012
                                       HKUST, Hong Kong, China




                                                              1/34
Outline

   Background
   Motivation
   Why does our approach work?
   Our Approach – Muffler
   Empirical Evaluation
   Conclusion



                                  2/34
Background

   Coverage-Based Fault Localization (CBFL)
       Input
          Coverage
          Testing results (passed or failed)

       Output
            A ranking list of statements
       Ranking functions
            Most CBFL techniques are similar with each other
             except that different ranking functions are used to
             compute suspiciousness.

                                                                   3/34
What is the limitation of existing
CBFL techniques?




                                     4/34
Motivation

                     One fundamental assumption [YPW08] of CBFL
                           The observed behaviors from passed runs can precisely
                            represent the correct behaviors of this program;
                           and the observed behaviors from failed runs can represent the
                            infamous behaviors.
                           Therefore, the different observed behaviors of program
                            entities between passed runs and failed runs will indicate the
                            fault’s location.
                           But this does not always hold.


[YPW08] C. Yilmaz, A. Paradkar, and C. Williams. Time will tell: fault localization using time spectra. In Proceedings
of the 30th international conference on Software engineering (ICSE '08). ACM, New York, NY, USA, 81-90. 2008.
                                                                                                            5/34
Motivation
            Coincidental Correctness (CC)
                 “No failure is detected, even though a fault has been executed.” [RT93]
                 i.e., the passed runs may cover the fault.
            Weaken the first part of CBFL‟s assumption:
                 The observed behaviors from passed runs can precisely represent
                  the correct behaviors of this program;
                 More, CC occurs frequently in practice.[MAE+09]



[RT93] D.J. Richardson and M.C. Thompson, An analysis of test data selection criteria using the RELAY model of
fault detection, Software Engineering, IEEE Transactions on, vol. 19, (no. 6), pp. 533-553, 1993.
[MAE+09] W. Masri, R. Abou-Assi, M. El-Ghali, and N. Al-Fatairi, An empirical study of the factors that reduce the
effectiveness of coverage-based fault localization, in Proceedings of the 2nd International Workshop on Defects in
Large Software Systems: Held in conjunction with the ACM SIGSOFT International Symposium on Software Testing
                                                                                                             6/34
and Analysis (ISSTA 2009), pp. 1-5, 2009.
Our goal is to address the CC issue via mutation analysis
What is the idea?




                                                            7/34
Why does our approach work?
- Key hypothesis
   Mutating the faulty statement tends to maintain the
    results of passed test cases.
   By contrast, mutating a correct statement tends to
    change the results of passed test cases (from passed to
    failed).




                                                         8/34
Why does our approach work?
- Three comprehensive scenarios (1/3)
  - If we mutate an M in different basic blocks with F
   Test cases


                                                            Passed


   Program                                                  Failed

                   F          M


                                                    M: Mutant point
   Test results                                     F: Fault point



      3 test results change from passed to failed
                                                                     9/34
Why does our approach work?
- Three comprehensive scenarios (1/3)
  - If we mutate an M in different basic blocks with F
   Test cases


                                                            Passed
                         M
   Program                                                  Failed

                   F


                                                    M: Mutant point
   Test results                                     F: Fault point



      3 test results change from passed to failed
                                                                  10/34
Why does our approach work?
- Three comprehensive scenarios (1/3)
                                        - If we mutate F
   Test cases


                                                            Passed


   Program                                                  Failed

                   F +M


                                                    M: Mutant point
   Test results                                     F: Fault point



      0 test result changes from passed to failed
                                                                     11/34
Why does our approach work?
- Three comprehensive scenarios (2/3)
         - If we mutate an M in the same basic block with F
   Test cases                      Due to different data flow to affect output


                                                               Passed

                         F
   Program                                                     Failed
                         M

                                                     M: Mutant point
                                                     F: Fault point


                                                         Control Flow
   Test results

      3 test results change from passed to failed        Data Flow
                                                                      12/34
Why does our approach work?
- Three comprehensive scenarios (2/3)
                                        - If we mutate F
   Test cases


                                                           Passed

                         F +M
   Program                                                 Failed


                                                   M: Mutant point
                                                   F: Fault point


                                                      Control Flow
   Test results

      0 test result change from passed to failed      Data Flow
                                                                  13/34
Why does our approach work?
- Three comprehensive scenarios (3/3)
                                    - When CC occurs frequently
   Test cases                       - If we mutate F
                                    Due to weak ability to affect output

                                                                 Passed


   Program                                                       Failed

                   F +M
                                                       M: Mutant point
                                                       F: Fault point

   Test results                                     Weak ability to generate
                                                    an infectious state or to
                                                    propagate the infectious
                                                    state to output
      0 test result changes from passed to failed
                                                                       14/34
Does this work in real programs?




                                   15/34
Why does our approach work?
1000
            - A feasibility study                                               2500

                                                           2000
 800                            800                                             2000


                                                           1500
 600                            600                                             1500


                                400                        1000
 400                                                                            1000


 200                            200                         500                  500


   0                              0                           0                    0
               tcas v7                    tot_info v17            schedule v4          schedule2 v1
                               4000                        4000
4000                                                                             150


                               3000                        3000
3000
                                                                                 100


2000                           2000                        2000


                                                                                  50
1000                           1000                        1000



   0                              0                           0                    0
           print_tokens v7              print_tokens2 v3          replace v24           space v20

                             Figure: Distribution of statements’ result changes
                               and faulty statement’s testing result changes.
       The vertical axis denotes the number of testing results changes (from „passed‟ to
       „failed‟), and horizontal width denotes the probability density at corresponding amount of
       testing results changes.                                                              16/34
Why does our approach work?
    - Another feasibility study (When CC%≥95%)
                                       25


                                                       ∎ Result changes (avg. 16.33%)
                                       20
        Frequency of faulty versions
                                                       ∎ Naish (avg. 47.55%)

                                       15



                                       10



                                        5



                                        0
                                                0%       20 %         40 %          60 %   80 %
                                                           Percentage of code examined
                                            Figure: Frequency distribution of effectiveness
                                                          when CC%≥ 95%.
   When CC% is greater or equal than 95%, code examination effort
    reduction of result changes is 65.66% (=100%-16.33%/47.55%).
   Only 6 faulty versions need to examine less than 20% of statements for
    Naish, while 22 versions by using result changes                    17/34
How to design our new ranking
function?




                                18/34
Our Approach – Muffler

            




[LRR11] L. Naish, H. J. Lee, and K. Ramamohanarao, A model for spectra-based software diagnosis. ACM
Transaction on Software Engineering Methodology, 20(3):11, 2011.
                                                                                              19/34
How do we evaluate our approach?
What is the result?




                                   20/34
Empirical Evaluation






                                 Lines of
                    Number of                Number of
    Program suite               Executable                  LOC
                     versions                test cases
                                  Code
        tcas           41         63-67        1608        133-137
      tot_info         23        122-123       1052        272-273
      schedule         9         149-152       2650        290-294
     schedule2         10        127-129       2710        261-263
    print_tokens       7         189-190       4130        341-343

    print_tokens2      10        199-200       4115        350-355

       replace         32        240-245       5542        508-515
                                                                      21/34
       space           38       3633-3647      13585      5882-5904
Empirical Evaluation             100%

                                 95%

                                 90%

                                 85%

                                 80%

                                 75%
   Percentage of fault located


                                 70%

                                 65%

                                 60%

                                 55%

                                 50%

                                 45%

                                 40%

                                 35%

                                 30%
                                                                                         Techiniques
                                 25%
                                                                                            Muffler
                                 20%                                                        Naish
                                 15%                                                        Ochiai
                                                                                            Tarantula
                                 10%                                                        Wong3
                                  5%

                                  0%
                                        0%    10%   20%    30%   40%   50%   60%   70%       80%       90%   100%


                                                          Percentage of code examined
                                             Figure: Overall effectiveness comparison.
                                                                                                                    22/34
Empirical Evaluation
    % of code
                 Tarantula    Ochiai     χDebug        Naish       Muffler
    examined
       1%            14         18           19         21           35
       5%            38         48           56         58           74
       10%           54         63           68         68           85
       15%           57         65           80         80           94
       20%           60         67           84         84           99
       30%           79         88           91         92          110
        Table: Number of faults located at different 99
       40%           92         98           98          level of code
                                                                    117
               examination effort using Naish and Muffler.
       50%           98         99          101         102         121
       60%           99        103          105         106         123
       70%          101        107          117         119         123
    When 1% of the statements have been examined, 123 can reach the
       80%          114        122          122         Naish       123
     fault in 17.07% of faulty versions. At 122 same time, Muffler 123 reach
       90%          123        123
                                            the         123
                                                                    can
     the fault in 28.46% of faulty versions.
      100%          123        123          123         123         123
                                                                           23/34
Empirical Evaluation
               Tarantula       Ochiai        χDebug          Naish         Muffler
    Min           0.00           0.00          0.00           0.00           0.00
   Max           87.89          84.25          93.85         78.46          55.38
 Median          20.33           9.52          7.69           7.32           3.25
   Mean          27.68          23.62          20.04         19.34           9.62
   Stdev         28.29          26.36          24.61         23.86          13.22

               Table: Statistics of code examination effort.

Among these five techniques, Muffler always scores the best in the rows that correspond to
the minimum, median, and mean code examination effort. In addition, Muffler gets much
lower standard deviation, which means that their performances vary less widely than
others, and are shown to be more stable in terms of effectiveness. Results also show that
Muffler reduces the average code examination effort from Naish by 50.26% (=100%-
(9.62%/19.34%).

                                                                                             24/34
How about the coincidental
correctness issue?




                             25/34
‹#›/34
Conclusion and future work
   We propose Muffler, a technique using mutation to
    help locate program faults.
   On 123 faulty versions of seven programs, we conduct
    a comparison of effectiveness and efficiency with
    Naish technique. Results show that Muffler reduces the
    average code examination effort on each faulty version
    by 50.26%.
   For future work, we plan to generalize our approach to
    locate faults in multi-fault programs.



                                                       27/34
Q&A




      28/34
Thank you!
Contact me via elfinhe@gmail.com




                                   29/34
# Background
                  Mutation analysis, first proposed by Hamlet [Ham77] and
                   Demilo et al. [DLS78] , is a fault-based testing technique
                   used to measure the effectiveness of a test suite.
                  In mutation analysis, one introduces syntactic code
                   changes, one at a time, into a program to generate
                   various faulty programs (called mutants).
                  A mutation operator is a change-seeding rule to
                   generate a mutant from the original program.

[Ham77] R.G. Hamlet, Testing Programs with the Aid of a Compiler, Software Engineering, IEEE Transactions
on, vol. SE-3, (no. 4), pp. 279- 290, 1977.
[DLS78] R.A. DeMillo, R.J. Lipton and F.G. Sayward, Hints on Test Data Selection: Help for the Practicing
Programmer, Computer, vol. 11, (no. 4), pp. 34-41, 1978.
                                                                                                30/34
# Ranking functions
                       Tarantula [JHS02], Ochiai [AZV07], χDebug [WQZ+07], and Naish [NLR11]




                                               Table: Ranking faunctions

[JHS02] J.A. Jones, M. J. Harrold, and J. Stasko. Visualization of test information to assist fault localization. In Proceedings of the
24th International Conference on Software Engineering (ICSE '02), pp. 467-477, 2002.
[AZV07] R. Abreu, P. Zoeteweij and A.J.C. Van Gemund, On the accuracy of spectrum-based fault localization, in Proc. Proceedings -
Testing: Academic and Industrial Conference Practice and Research Techniques, TAIC PART-Mutation 2007, pp. 89-98, 2007.
[WQZ+07] W.E. Wong, Yu Qi, Lei Zhao, and Kai-Yuan Cai. Effective Fault Localization using Code Coverage. In Proceedings of the
31st Annual International Computer Software and Applications Conference (COMPSAC '07), Vol. 1, pp. 449-456, 2007.
[NLR11] L. Naish, H. J. Lee, and K. Ramamohanarao, A model for spectra-based software diagnosis. ACM Transaction on Software
Engineering Methodology, 20(3):11, 2011.                                                                                    31/34
# Our Approach – Muffler
     Faulty
                                    Test
    Program
                                    Suite


          Instrument program
                   &
        Execute against test suite
                          Coverage & Testing Results

      Select statements to mutate

                          Candidate Statements

       Mutate selected statements

                          Mutants

      Run mutants against test suite
                                                       Legend
                          Changes of testing results
        Calculate suspiciousness                        Input
                    &
            Sort statements                            Process

              Ranking List of all                      Output
                statements


     Figure: Dataflow diagram of Muffler.                        32/34
# Our Approach – Muffler




         Primary Key       Secondary Key   Additional Key
         (imprecise when   (invalid when   (inclined to handle
         multiple faults   coincidental    coincidental correctness)
         occurs)           correctness%
                           is high)




                                                              33/34
# An Example
                                                                            TotalPassed   TotalFailed   Part II


Part I                                                                         2440          210         Tarantula       Ochiai        χDebug          Naish


                                      Statement                              Passed(s)     Failed(s)    susp*     r**   susp      r    susp     r    susp      r


 S1      if (block_queue){                                                     1798          210        0.58      8     0.32      8   205.41    8   510812     8


 S2        count = block_queue->mem_count + 1; /* fault: insert ‘+1’ */        1382          210        0.64      7     0.36      7   205.83    7   511228     7


 S3        n = (int) (count*ratio); /* fault: missing ‘+1’ */                  1382          210        0.64      7     0.36      7   205.83    7   511228     7


 S4        proc = find_nth(block_queue, n);                                    1382          210        0.64      7     0.36      7   205.83    7   511228     7


 S5        if (proc) {                                                         1382          210        0.64      7     0.36      7   205.83    7   511228     7


 S6          block_queue = del_ele(block_queue, proc);                         1358          210        0.64      3     0.37      3   205.85    3   511252     3

 S7          prio = proc->priority;                                            1358          210        0.64      3     0.37      3   205.85    3   511252     3

 S8          prio_queue[prio] = append_ele(prio_queue[prio], proc);}}          1358          210        0.64      3     0.37      3   205.85    3   511252     3


                                                       Code examination effort to locate S2 and S3:         88%           88%            88%            88%


                                      Figure: Faulty version v2 of program “schedule”.                                                               34/34
# An Example

Part III                                                                                                                                      Part IV       Muffler


                                  Mutated statement for each mutant               Changep→f   Changep→f   Changep→f   Changep→f   Changep→f   Impact      susp           r


M1         if (!block_queue ) {                                                     1644       1798       1101        1101        1644        1457.6    509354.4         8


M2           count = block_queue->mem_count != 1;                                   249        1097       1097         249        1382         814.8    510413.2         2


M3           n = (int) (count <= ratio) ;                                           249        1116       1101         494        1101         812.2    510415.8         2


M4           proc = find_nth(block_queue , ratio);                                  1088       638        1136         744        1382         997.6    510230.4         5


M5           if (!proc) {                                                           1136       1358       1101        1382        1101        1215.6    510012.4         6


M6             block_queue = del_ele(block_queue , proc-1);                         1123       349        1358         814        1358        1000.4    510251.6         4


M7             prio /= proc->priority;                                              1358       1358       1101        1101        1358        1255.2    509996.8         7


M8             prio_queue[prio] = append_ele(prio_queue[__MININT__] , proc); }}     598        598        1138        1358        1101         958.6    510293.4         3


                                                                                      Code examination effort to locate S2 and S3:                           25%


                                        Figure: Faulty version v2 of program “schedule”.                                                                         35/34
# An Example
                                                                            TotalPassed   TotalFailed   Part II


Part I                                                                         2440          210         Tarantula       Ochiai        χDebug          Naish


                                      Statement                              Passed(s)     Failed(s)    susp*     r**   susp      r    susp     r    susp      r


 S1      if (block_queue){                                                     1798          210        0.58      8     0.32      8   205.41    8   510812     8


 S2        count = block_queue->mem_count + 1; /* fault: insert ‘+1’ */        1382          210        0.64      7     0.36      7   205.83    7   511228     7


 S3        n = (int) (count*ratio); /* fault: missing ‘+1’ */                  1382          210        0.64      7     0.36      7   205.83    7   511228     7


 S4        proc = find_nth(block_queue, n);                                    1382          210        0.64      7     0.36      7   205.83    7   511228     7


 S5        if (proc) {                                                         1382          210        0.64      7     0.36      7   205.83    7   511228     7


 S6          block_queue = del_ele(block_queue, proc);                         1358          210        0.64      3     0.37      3   205.85    3   511252     3

 S7          prio = proc->priority;                                            1358          210        0.64      3     0.37      3   205.85    3   511252     3

 S8          prio_queue[prio] = append_ele(prio_queue[prio], proc);}}          1358          210        0.64      3     0.37      3   205.85    3   511252     3


                                                       Code examination effort to locate S2 and S3:         88%           88%            88%            88%


                                      Figure: Faulty version v2 of program “schedule”.                                                               36/34
# An Example

Part III                                                                                                                                      Part IV       Muffler


                                  Mutated statement for each mutant               Changep→f   Changep→f   Changep→f   Changep→f   Changep→f   Impact      susp           r


M1         if (!block_queue ) {                                                     1644       1798       1101        1101        1644        1457.6    509354.4         8


M2           count = block_queue->mem_count != 1;                                   249        1097       1097         249        1382         814.8    510413.2         2


M3           n = (int) (count <= ratio) ;                                           249        1116       1101         494        1101         812.2    510415.8         2


M4           proc = find_nth(block_queue , ratio);                                  1088       638        1136         744        1382         997.6    510230.4         5


M5           if (!proc) {                                                           1136       1358       1101        1382        1101        1215.6    510012.4         6


M6             block_queue = del_ele(block_queue , proc-1);                         1123       349        1358         814        1358        1000.4    510251.6         4


M7             prio /= proc->priority;                                              1358       1358       1101        1101        1358        1255.2    509996.8         7


M8             prio_queue[prio] = append_ele(prio_queue[__MININT__] , proc); }}     598        598        1138        1358        1101         958.6    510293.4         3


                                                                                      Code examination effort to locate S2 and S3:                           25%


                                        Figure: Faulty version v2 of program “schedule”.                                                                         37/34
# Empirical Evaluation
                              Versus          Versus          Versus          Versus
                            Tanrantula        Ochiai         χDebug           Naish

     More effective             102              96             93              89

   Same effectiveness            19              23             23              25

      Less effective              2              4               7               9

                    Table: Pair-wise comparison between
                      Muffler and existing techniques.

Muffler is more effective (examining more statements before encountering the faulty
statement) than Naish for 89 out of 123 faulty versions; is as effective (examining the same
number of statements before encountering the faulty statement) as Naish for 25 out of 123
faulty versions; and is less effective (examining less statements before encountering the
faulty statement) than Naish for only 9 out of 123 faulty versions.


                                                                                               38/34
# Empirical Evaluation
   Experience on real faults

         Faulty versions         CC%               Code examination effort
                                                   Naish           Muffler
               v5                1%                 0%               0%
               v9                7%                 1%               0%
               v17               31%               12%               7%
               v28               49%               11%               5%
               v29               99%               25%               9%

                 Table: Results with real faults in space



Five faulty versions are chosen to represent low, medium, and the high occurrence of
coincidental correctness. In this table, the column “CC%” presents the percentage of
coincidentally passed test cases out of all passed test cases. The columns under the head
“Code examination effort” present the percentage of code to be examined before the fault is
encountered.


                                                                                              39/34
# Empirical Evaluation
   Efficiency analysis
    Program suite                 CBFL (seconds)                   Muffler (seconds)
          tcas                       18.00                              868.68
       tot_info                      11.92                              573.12
       schedule                      34.02                             2703.01
      schedule2                      27.76                             1773.14
     print_tokens                    59.11                             2530.17
    print_tokens2                    62.07                             5062.87
        replace                      69.13                             4139.19
       Average                       40.29                             2521.46
     Table: Time spent by each technique on subject programs.

We have shown experimentally that, by taking advantages from both coverage and mutation
impact, Muffler outperforms Naish regardless the occurrence of coincidental correctness.
Unfortunately, our approaches, Muffler need to execute piles of mutants to compute mutation
impact. The execution of mutants against the test suite may increase the time cost of fault
localization. The time mainly contains the cost of instrumentation, execution, and coverage
collection. From this table, we observe that Muffler takes approximately 62.59 times of
average time cost to the Naish technique.
                                                                                              40/34
# Empirical Evaluation
   Efficiency analysis
   Program               Mutated                Total                         Time per mutant
                                                                 Mutants
     suite              statements           statements                          (seconds)
      tcas                 40.15                65.10             199.90            4.26
   tot_info                39.57               122.96             191.87            2.92
   schedule                80.60               150.20             351.60            7.59
  schedule2                75.33               127.56             327.78            5.32
 print_tokens              67.43               189.86             260.29            9.49
print_tokens2              86.67               199.44             398.67           12.54
    replace                71.14               242.86             305.93           13.30
   Average                 56.52               142.79             256.90            7.92

                Table: Information about mutants generated.

This Table illustrates the detailed data about the number of mutated/total executable
statements, the number of mutants generated, and the time cost of running each mutant. For
example, of the program tcas, there are, on average, 40.15 statements that are mutated by
Muffler; and 65.10 executable statements in total; 199.90 mutants are generated and it takes
4.26 seconds to run each of them, on average. Notice that there is no need to collect coverage
from the mutants‟ executions, and it takes about 1/4 time to run a mutant without
instrumentation and coverage collection.
                                                                                                 41/34

More Related Content

PPTX
Muffler a tool using mutation to facilitate fault localization 2.0
PDF
PDF
Control Flow Analysis
PPTX
Subprogramms
PDF
PDF
A WHITE BOX TESTING TECHNIQUE IN SOFTWARE TESTING : BASIS PATH TESTING
PDF
Visualizing Stakeholder Concerns with Anchored Map
PPTX
Plc part 3
Muffler a tool using mutation to facilitate fault localization 2.0
Control Flow Analysis
Subprogramms
A WHITE BOX TESTING TECHNIQUE IN SOFTWARE TESTING : BASIS PATH TESTING
Visualizing Stakeholder Concerns with Anchored Map
Plc part 3

What's hot (13)

PDF
Console manual impl
PPTX
Unit1 principle of programming language
PDF
Refinery Blending Problems by Engr. Adefami Olusegun
PDF
POLITEKNIK MALAYSIA
PPTX
Fortran - concise review
PPTX
Path testing
PDF
Session 7 code_functional_coverage
PPT
Fortran compiling 2
PDF
On the Performance Overhead of BPMN Modeling Practices
PDF
Session 9 advance_verification_features
PDF
Session 8 assertion_based_verification_and_interfaces
PDF
Duplicate Code Detection using Control Statements
Console manual impl
Unit1 principle of programming language
Refinery Blending Problems by Engr. Adefami Olusegun
POLITEKNIK MALAYSIA
Fortran - concise review
Path testing
Session 7 code_functional_coverage
Fortran compiling 2
On the Performance Overhead of BPMN Modeling Practices
Session 9 advance_verification_features
Session 8 assertion_based_verification_and_interfaces
Duplicate Code Detection using Control Statements
Ad

Similar to Muffler a tool using mutation to facilitate fault localization 2.3 (20)

PPTX
A software fault localization technique based on program mutations
PDF
Black box testing (an introduction to)
PPTX
[2012] Empirical Evaluation on FBD Model-Based Test Coverage Criteria using M...
PDF
White Box Testing (Introduction to)
PDF
Software testing
PDF
Stamp breizhcamp 2019
PPTX
ST-UNIT-4.pptx software testing
PDF
The CI as a partner for test improvement suggestions
PDF
Testing computer software
PPTX
Tdd pecha kucha_v2
PDF
Introduction to Software Testing
DOC
ISTQB Advanced Study Guide - 5
PPTX
Pa chapter08-testing integrating-the_programs-cs_390
PPT
11 whiteboxtesting
DOC
Testing survey by_directions
PPTX
seng301-10-testing-breaking-code.pptx
PPT
lec-11 Testing.ppt
PPTX
SE%200-Testing%20(2).pptx
PPTX
Abhik-Satish-dagstuhl
PPT
Functionality testing techniqu
A software fault localization technique based on program mutations
Black box testing (an introduction to)
[2012] Empirical Evaluation on FBD Model-Based Test Coverage Criteria using M...
White Box Testing (Introduction to)
Software testing
Stamp breizhcamp 2019
ST-UNIT-4.pptx software testing
The CI as a partner for test improvement suggestions
Testing computer software
Tdd pecha kucha_v2
Introduction to Software Testing
ISTQB Advanced Study Guide - 5
Pa chapter08-testing integrating-the_programs-cs_390
11 whiteboxtesting
Testing survey by_directions
seng301-10-testing-breaking-code.pptx
lec-11 Testing.ppt
SE%200-Testing%20(2).pptx
Abhik-Satish-dagstuhl
Functionality testing techniqu
Ad

More from Tao He (15)

PPTX
Java 并发编程笔记:01. 并行与并发 —— 概念
PPT
Introduction to llvm
PDF
Testing survey
PPT
Smart debugger
PPT
Mutation testing
DOCX
C语言benchmark覆盖信息收集总结4
PPT
Django
DOC
基于覆盖信息的软件错误定位技术综述
DOC
Java覆盖信息收集工具比较
PPT
Testing group’s work on fault localization
PPT
Semantic Parsing in Bayesian Anti Spam
PPT
Problems
PPT
A survey of software testing
PPT
Cleansing test suites from coincidental correctness to enhance falut localiza...
PPTX
Concrete meta research - how to collect, manage, and read papers?
Java 并发编程笔记:01. 并行与并发 —— 概念
Introduction to llvm
Testing survey
Smart debugger
Mutation testing
C语言benchmark覆盖信息收集总结4
Django
基于覆盖信息的软件错误定位技术综述
Java覆盖信息收集工具比较
Testing group’s work on fault localization
Semantic Parsing in Bayesian Anti Spam
Problems
A survey of software testing
Cleansing test suites from coincidental correctness to enhance falut localiza...
Concrete meta research - how to collect, manage, and read papers?

Recently uploaded (20)

PPTX
Cloud computing and distributed systems.
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Machine learning based COVID-19 study performance prediction
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
cuic standard and advanced reporting.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Approach and Philosophy of On baking technology
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
MYSQL Presentation for SQL database connectivity
Cloud computing and distributed systems.
Digital-Transformation-Roadmap-for-Companies.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Building Integrated photovoltaic BIPV_UPV.pdf
Understanding_Digital_Forensics_Presentation.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
20250228 LYD VKU AI Blended-Learning.pptx
Spectral efficient network and resource selection model in 5G networks
Machine learning based COVID-19 study performance prediction
Encapsulation_ Review paper, used for researhc scholars
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
cuic standard and advanced reporting.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Approach and Philosophy of On baking technology
The Rise and Fall of 3GPP – Time for a Sabbatical?
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
MYSQL Presentation for SQL database connectivity

Muffler a tool using mutation to facilitate fault localization 2.3

  • 1. Muffler: An Approach Using Mutation to Facilitate Fault Localization Tao He elfinhe@gmail.com Department of Computer Science, Sun Yat-Sen University Department of Computer Science and Engineering, HKUST Group Discussion February 2012 HKUST, Hong Kong, China 1/34
  • 2. Outline  Background  Motivation  Why does our approach work?  Our Approach – Muffler  Empirical Evaluation  Conclusion 2/34
  • 3. Background  Coverage-Based Fault Localization (CBFL)  Input  Coverage  Testing results (passed or failed)  Output  A ranking list of statements  Ranking functions  Most CBFL techniques are similar with each other except that different ranking functions are used to compute suspiciousness. 3/34
  • 4. What is the limitation of existing CBFL techniques? 4/34
  • 5. Motivation  One fundamental assumption [YPW08] of CBFL  The observed behaviors from passed runs can precisely represent the correct behaviors of this program;  and the observed behaviors from failed runs can represent the infamous behaviors.  Therefore, the different observed behaviors of program entities between passed runs and failed runs will indicate the fault’s location.  But this does not always hold. [YPW08] C. Yilmaz, A. Paradkar, and C. Williams. Time will tell: fault localization using time spectra. In Proceedings of the 30th international conference on Software engineering (ICSE '08). ACM, New York, NY, USA, 81-90. 2008. 5/34
  • 6. Motivation  Coincidental Correctness (CC)  “No failure is detected, even though a fault has been executed.” [RT93]  i.e., the passed runs may cover the fault.  Weaken the first part of CBFL‟s assumption:  The observed behaviors from passed runs can precisely represent the correct behaviors of this program;  More, CC occurs frequently in practice.[MAE+09] [RT93] D.J. Richardson and M.C. Thompson, An analysis of test data selection criteria using the RELAY model of fault detection, Software Engineering, IEEE Transactions on, vol. 19, (no. 6), pp. 533-553, 1993. [MAE+09] W. Masri, R. Abou-Assi, M. El-Ghali, and N. Al-Fatairi, An empirical study of the factors that reduce the effectiveness of coverage-based fault localization, in Proceedings of the 2nd International Workshop on Defects in Large Software Systems: Held in conjunction with the ACM SIGSOFT International Symposium on Software Testing 6/34 and Analysis (ISSTA 2009), pp. 1-5, 2009.
  • 7. Our goal is to address the CC issue via mutation analysis What is the idea? 7/34
  • 8. Why does our approach work? - Key hypothesis  Mutating the faulty statement tends to maintain the results of passed test cases.  By contrast, mutating a correct statement tends to change the results of passed test cases (from passed to failed). 8/34
  • 9. Why does our approach work? - Three comprehensive scenarios (1/3) - If we mutate an M in different basic blocks with F Test cases Passed Program Failed F M M: Mutant point Test results F: Fault point 3 test results change from passed to failed 9/34
  • 10. Why does our approach work? - Three comprehensive scenarios (1/3) - If we mutate an M in different basic blocks with F Test cases Passed M Program Failed F M: Mutant point Test results F: Fault point 3 test results change from passed to failed 10/34
  • 11. Why does our approach work? - Three comprehensive scenarios (1/3) - If we mutate F Test cases Passed Program Failed F +M M: Mutant point Test results F: Fault point 0 test result changes from passed to failed 11/34
  • 12. Why does our approach work? - Three comprehensive scenarios (2/3) - If we mutate an M in the same basic block with F Test cases Due to different data flow to affect output Passed F Program Failed M M: Mutant point F: Fault point Control Flow Test results 3 test results change from passed to failed Data Flow 12/34
  • 13. Why does our approach work? - Three comprehensive scenarios (2/3) - If we mutate F Test cases Passed F +M Program Failed M: Mutant point F: Fault point Control Flow Test results 0 test result change from passed to failed Data Flow 13/34
  • 14. Why does our approach work? - Three comprehensive scenarios (3/3) - When CC occurs frequently Test cases - If we mutate F Due to weak ability to affect output Passed Program Failed F +M M: Mutant point F: Fault point Test results Weak ability to generate an infectious state or to propagate the infectious state to output 0 test result changes from passed to failed 14/34
  • 15. Does this work in real programs? 15/34
  • 16. Why does our approach work? 1000 - A feasibility study 2500 2000 800 800 2000 1500 600 600 1500 400 1000 400 1000 200 200 500 500 0 0 0 0 tcas v7 tot_info v17 schedule v4 schedule2 v1 4000 4000 4000 150 3000 3000 3000 100 2000 2000 2000 50 1000 1000 1000 0 0 0 0 print_tokens v7 print_tokens2 v3 replace v24 space v20 Figure: Distribution of statements’ result changes and faulty statement’s testing result changes. The vertical axis denotes the number of testing results changes (from „passed‟ to „failed‟), and horizontal width denotes the probability density at corresponding amount of testing results changes. 16/34
  • 17. Why does our approach work? - Another feasibility study (When CC%≥95%) 25 ∎ Result changes (avg. 16.33%) 20 Frequency of faulty versions ∎ Naish (avg. 47.55%) 15 10 5 0 0% 20 % 40 % 60 % 80 % Percentage of code examined Figure: Frequency distribution of effectiveness when CC%≥ 95%.  When CC% is greater or equal than 95%, code examination effort reduction of result changes is 65.66% (=100%-16.33%/47.55%).  Only 6 faulty versions need to examine less than 20% of statements for Naish, while 22 versions by using result changes 17/34
  • 18. How to design our new ranking function? 18/34
  • 19. Our Approach – Muffler  [LRR11] L. Naish, H. J. Lee, and K. Ramamohanarao, A model for spectra-based software diagnosis. ACM Transaction on Software Engineering Methodology, 20(3):11, 2011. 19/34
  • 20. How do we evaluate our approach? What is the result? 20/34
  • 21. Empirical Evaluation  Lines of Number of Number of Program suite Executable LOC versions test cases Code tcas 41 63-67 1608 133-137 tot_info 23 122-123 1052 272-273 schedule 9 149-152 2650 290-294 schedule2 10 127-129 2710 261-263 print_tokens 7 189-190 4130 341-343 print_tokens2 10 199-200 4115 350-355 replace 32 240-245 5542 508-515 21/34 space 38 3633-3647 13585 5882-5904
  • 22. Empirical Evaluation 100% 95% 90% 85% 80% 75% Percentage of fault located 70% 65% 60% 55% 50% 45% 40% 35% 30% Techiniques 25% Muffler 20% Naish 15% Ochiai Tarantula 10% Wong3 5% 0% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Percentage of code examined Figure: Overall effectiveness comparison. 22/34
  • 23. Empirical Evaluation % of code Tarantula Ochiai χDebug Naish Muffler examined 1% 14 18 19 21 35 5% 38 48 56 58 74 10% 54 63 68 68 85 15% 57 65 80 80 94 20% 60 67 84 84 99 30% 79 88 91 92 110 Table: Number of faults located at different 99 40% 92 98 98 level of code 117 examination effort using Naish and Muffler. 50% 98 99 101 102 121 60% 99 103 105 106 123 70% 101 107 117 119 123  When 1% of the statements have been examined, 123 can reach the 80% 114 122 122 Naish 123 fault in 17.07% of faulty versions. At 122 same time, Muffler 123 reach 90% 123 123 the 123 can the fault in 28.46% of faulty versions. 100% 123 123 123 123 123 23/34
  • 24. Empirical Evaluation Tarantula Ochiai χDebug Naish Muffler Min 0.00 0.00 0.00 0.00 0.00 Max 87.89 84.25 93.85 78.46 55.38 Median 20.33 9.52 7.69 7.32 3.25 Mean 27.68 23.62 20.04 19.34 9.62 Stdev 28.29 26.36 24.61 23.86 13.22 Table: Statistics of code examination effort. Among these five techniques, Muffler always scores the best in the rows that correspond to the minimum, median, and mean code examination effort. In addition, Muffler gets much lower standard deviation, which means that their performances vary less widely than others, and are shown to be more stable in terms of effectiveness. Results also show that Muffler reduces the average code examination effort from Naish by 50.26% (=100%- (9.62%/19.34%). 24/34
  • 25. How about the coincidental correctness issue? 25/34
  • 27. Conclusion and future work  We propose Muffler, a technique using mutation to help locate program faults.  On 123 faulty versions of seven programs, we conduct a comparison of effectiveness and efficiency with Naish technique. Results show that Muffler reduces the average code examination effort on each faulty version by 50.26%.  For future work, we plan to generalize our approach to locate faults in multi-fault programs. 27/34
  • 28. Q&A 28/34
  • 29. Thank you! Contact me via elfinhe@gmail.com 29/34
  • 30. # Background  Mutation analysis, first proposed by Hamlet [Ham77] and Demilo et al. [DLS78] , is a fault-based testing technique used to measure the effectiveness of a test suite.  In mutation analysis, one introduces syntactic code changes, one at a time, into a program to generate various faulty programs (called mutants).  A mutation operator is a change-seeding rule to generate a mutant from the original program. [Ham77] R.G. Hamlet, Testing Programs with the Aid of a Compiler, Software Engineering, IEEE Transactions on, vol. SE-3, (no. 4), pp. 279- 290, 1977. [DLS78] R.A. DeMillo, R.J. Lipton and F.G. Sayward, Hints on Test Data Selection: Help for the Practicing Programmer, Computer, vol. 11, (no. 4), pp. 34-41, 1978. 30/34
  • 31. # Ranking functions  Tarantula [JHS02], Ochiai [AZV07], χDebug [WQZ+07], and Naish [NLR11] Table: Ranking faunctions [JHS02] J.A. Jones, M. J. Harrold, and J. Stasko. Visualization of test information to assist fault localization. In Proceedings of the 24th International Conference on Software Engineering (ICSE '02), pp. 467-477, 2002. [AZV07] R. Abreu, P. Zoeteweij and A.J.C. Van Gemund, On the accuracy of spectrum-based fault localization, in Proc. Proceedings - Testing: Academic and Industrial Conference Practice and Research Techniques, TAIC PART-Mutation 2007, pp. 89-98, 2007. [WQZ+07] W.E. Wong, Yu Qi, Lei Zhao, and Kai-Yuan Cai. Effective Fault Localization using Code Coverage. In Proceedings of the 31st Annual International Computer Software and Applications Conference (COMPSAC '07), Vol. 1, pp. 449-456, 2007. [NLR11] L. Naish, H. J. Lee, and K. Ramamohanarao, A model for spectra-based software diagnosis. ACM Transaction on Software Engineering Methodology, 20(3):11, 2011. 31/34
  • 32. # Our Approach – Muffler Faulty Test Program Suite Instrument program & Execute against test suite Coverage & Testing Results Select statements to mutate Candidate Statements Mutate selected statements Mutants Run mutants against test suite Legend Changes of testing results Calculate suspiciousness Input & Sort statements Process Ranking List of all Output statements Figure: Dataflow diagram of Muffler. 32/34
  • 33. # Our Approach – Muffler  Primary Key Secondary Key Additional Key (imprecise when (invalid when (inclined to handle multiple faults coincidental coincidental correctness) occurs) correctness% is high) 33/34
  • 34. # An Example TotalPassed TotalFailed Part II Part I 2440 210 Tarantula Ochiai χDebug Naish Statement Passed(s) Failed(s) susp* r** susp r susp r susp r S1 if (block_queue){ 1798 210 0.58 8 0.32 8 205.41 8 510812 8 S2 count = block_queue->mem_count + 1; /* fault: insert ‘+1’ */ 1382 210 0.64 7 0.36 7 205.83 7 511228 7 S3 n = (int) (count*ratio); /* fault: missing ‘+1’ */ 1382 210 0.64 7 0.36 7 205.83 7 511228 7 S4 proc = find_nth(block_queue, n); 1382 210 0.64 7 0.36 7 205.83 7 511228 7 S5 if (proc) { 1382 210 0.64 7 0.36 7 205.83 7 511228 7 S6 block_queue = del_ele(block_queue, proc); 1358 210 0.64 3 0.37 3 205.85 3 511252 3 S7 prio = proc->priority; 1358 210 0.64 3 0.37 3 205.85 3 511252 3 S8 prio_queue[prio] = append_ele(prio_queue[prio], proc);}} 1358 210 0.64 3 0.37 3 205.85 3 511252 3 Code examination effort to locate S2 and S3: 88% 88% 88% 88% Figure: Faulty version v2 of program “schedule”. 34/34
  • 35. # An Example Part III Part IV Muffler Mutated statement for each mutant Changep→f Changep→f Changep→f Changep→f Changep→f Impact susp r M1 if (!block_queue ) { 1644 1798 1101 1101 1644 1457.6 509354.4 8 M2 count = block_queue->mem_count != 1; 249 1097 1097 249 1382 814.8 510413.2 2 M3 n = (int) (count <= ratio) ; 249 1116 1101 494 1101 812.2 510415.8 2 M4 proc = find_nth(block_queue , ratio); 1088 638 1136 744 1382 997.6 510230.4 5 M5 if (!proc) { 1136 1358 1101 1382 1101 1215.6 510012.4 6 M6 block_queue = del_ele(block_queue , proc-1); 1123 349 1358 814 1358 1000.4 510251.6 4 M7 prio /= proc->priority; 1358 1358 1101 1101 1358 1255.2 509996.8 7 M8 prio_queue[prio] = append_ele(prio_queue[__MININT__] , proc); }} 598 598 1138 1358 1101 958.6 510293.4 3 Code examination effort to locate S2 and S3: 25% Figure: Faulty version v2 of program “schedule”. 35/34
  • 36. # An Example TotalPassed TotalFailed Part II Part I 2440 210 Tarantula Ochiai χDebug Naish Statement Passed(s) Failed(s) susp* r** susp r susp r susp r S1 if (block_queue){ 1798 210 0.58 8 0.32 8 205.41 8 510812 8 S2 count = block_queue->mem_count + 1; /* fault: insert ‘+1’ */ 1382 210 0.64 7 0.36 7 205.83 7 511228 7 S3 n = (int) (count*ratio); /* fault: missing ‘+1’ */ 1382 210 0.64 7 0.36 7 205.83 7 511228 7 S4 proc = find_nth(block_queue, n); 1382 210 0.64 7 0.36 7 205.83 7 511228 7 S5 if (proc) { 1382 210 0.64 7 0.36 7 205.83 7 511228 7 S6 block_queue = del_ele(block_queue, proc); 1358 210 0.64 3 0.37 3 205.85 3 511252 3 S7 prio = proc->priority; 1358 210 0.64 3 0.37 3 205.85 3 511252 3 S8 prio_queue[prio] = append_ele(prio_queue[prio], proc);}} 1358 210 0.64 3 0.37 3 205.85 3 511252 3 Code examination effort to locate S2 and S3: 88% 88% 88% 88% Figure: Faulty version v2 of program “schedule”. 36/34
  • 37. # An Example Part III Part IV Muffler Mutated statement for each mutant Changep→f Changep→f Changep→f Changep→f Changep→f Impact susp r M1 if (!block_queue ) { 1644 1798 1101 1101 1644 1457.6 509354.4 8 M2 count = block_queue->mem_count != 1; 249 1097 1097 249 1382 814.8 510413.2 2 M3 n = (int) (count <= ratio) ; 249 1116 1101 494 1101 812.2 510415.8 2 M4 proc = find_nth(block_queue , ratio); 1088 638 1136 744 1382 997.6 510230.4 5 M5 if (!proc) { 1136 1358 1101 1382 1101 1215.6 510012.4 6 M6 block_queue = del_ele(block_queue , proc-1); 1123 349 1358 814 1358 1000.4 510251.6 4 M7 prio /= proc->priority; 1358 1358 1101 1101 1358 1255.2 509996.8 7 M8 prio_queue[prio] = append_ele(prio_queue[__MININT__] , proc); }} 598 598 1138 1358 1101 958.6 510293.4 3 Code examination effort to locate S2 and S3: 25% Figure: Faulty version v2 of program “schedule”. 37/34
  • 38. # Empirical Evaluation Versus Versus Versus Versus Tanrantula Ochiai χDebug Naish More effective 102 96 93 89 Same effectiveness 19 23 23 25 Less effective 2 4 7 9 Table: Pair-wise comparison between Muffler and existing techniques. Muffler is more effective (examining more statements before encountering the faulty statement) than Naish for 89 out of 123 faulty versions; is as effective (examining the same number of statements before encountering the faulty statement) as Naish for 25 out of 123 faulty versions; and is less effective (examining less statements before encountering the faulty statement) than Naish for only 9 out of 123 faulty versions. 38/34
  • 39. # Empirical Evaluation  Experience on real faults Faulty versions CC% Code examination effort Naish Muffler v5 1% 0% 0% v9 7% 1% 0% v17 31% 12% 7% v28 49% 11% 5% v29 99% 25% 9% Table: Results with real faults in space Five faulty versions are chosen to represent low, medium, and the high occurrence of coincidental correctness. In this table, the column “CC%” presents the percentage of coincidentally passed test cases out of all passed test cases. The columns under the head “Code examination effort” present the percentage of code to be examined before the fault is encountered. 39/34
  • 40. # Empirical Evaluation  Efficiency analysis Program suite CBFL (seconds) Muffler (seconds) tcas 18.00 868.68 tot_info 11.92 573.12 schedule 34.02 2703.01 schedule2 27.76 1773.14 print_tokens 59.11 2530.17 print_tokens2 62.07 5062.87 replace 69.13 4139.19 Average 40.29 2521.46 Table: Time spent by each technique on subject programs. We have shown experimentally that, by taking advantages from both coverage and mutation impact, Muffler outperforms Naish regardless the occurrence of coincidental correctness. Unfortunately, our approaches, Muffler need to execute piles of mutants to compute mutation impact. The execution of mutants against the test suite may increase the time cost of fault localization. The time mainly contains the cost of instrumentation, execution, and coverage collection. From this table, we observe that Muffler takes approximately 62.59 times of average time cost to the Naish technique. 40/34
  • 41. # Empirical Evaluation  Efficiency analysis Program Mutated Total Time per mutant Mutants suite statements statements (seconds) tcas 40.15 65.10 199.90 4.26 tot_info 39.57 122.96 191.87 2.92 schedule 80.60 150.20 351.60 7.59 schedule2 75.33 127.56 327.78 5.32 print_tokens 67.43 189.86 260.29 9.49 print_tokens2 86.67 199.44 398.67 12.54 replace 71.14 242.86 305.93 13.30 Average 56.52 142.79 256.90 7.92 Table: Information about mutants generated. This Table illustrates the detailed data about the number of mutated/total executable statements, the number of mutants generated, and the time cost of running each mutant. For example, of the program tcas, there are, on average, 40.15 statements that are mutated by Muffler; and 65.10 executable statements in total; 199.90 mutants are generated and it takes 4.26 seconds to run each of them, on average. Notice that there is no need to collect coverage from the mutants‟ executions, and it takes about 1/4 time to run a mutant without instrumentation and coverage collection. 41/34

Editor's Notes

  • #4: I assume that you have already known a lot of these techniques, so I only give a quick review.
  • #7: Please find another definition, using passed runs to describ CC
  • #35: Please remember to notate the CC, e.g., 1382.Please remember to add amination
  • #37: Please remember to notate the CC, e.g., 1382.Please remember to add amination
  • #42: It is worthwhile to mention that Muffler’s time cost can be greatly reduced with a simple test selection strategy. The strategy can be described as: do not re-run a test case that does not cover the mutated statement. Furthermore, because the executions of mutants do not depend on each other, we can parallelize them with not much effort. Nonetheless, we have to admit that Muffler need more time to offer a better effectiveness in fault localization.