Data Mining for Software Engineering


          Tao Xie                                     Jian Pei
North Carolina State University               Simon Fraser University
 www.csc.ncsu.edu/faculty/xie                   www.cs.sfu.ca/~jpei
      xie@csc.ncsu.edu                            jpei@cs.sfu.ca



              An up-to-date version of this tutorial is available at
                     http://ase.csc.ncsu.edu/dmse/dmse.pdf
      Attendants of the tutorial are kindly suggested to download the latest
                        version 2-3 days before the tutorial
Outline
• Introduction
• What software engineering tasks can be
  helped by data mining?
• What kinds of software engineering data can
  be mined?
• How are data mining techniques used in
  software engineering?
• Case studies
• Conclusions
T. Xie and J. Pei: Data Mining for Software Engineering   2
Introduction
• A large amount of data is produced in
  software development
   – Data from software repositories
   – Data from program executions
• Data mining techniques can be used to
  analyze software engineering data
   – Understand software artifacts or processes
   – Assist software engineering tasks

T. Xie and J. Pei: Data Mining for Software Engineering   3
Examples
• Data in software development
   – Programming: versions of programs
   – Testing: execution traces
   – Deployment: error/bug reports
   – Reuse: open source packages
• Software development needs data analysis
   – How should I use this class?
   – Where are the bugs?
   – How to implement a typical functionality?
T. Xie and J. Pei: Data Mining for Software Engineering   4
Overview

programming        defect detection              testing            debugging       maintenance

                    software engineering tasks helped by data mining



                                    association/
          classification                                    clustering          …
                                      patterns
                                data mining techniques




     code                change               program             structural          bug
     bases                history              states              entities         reports
                                     software engineering data
  T. Xie and J. Pei: Data Mining for Software Engineering                                     5
Software Engineering Tasks
•     Programming
•     Static defect detection
•     Testing
•     Debugging
•     Maintenance




    T. Xie and J. Pei: Data Mining for Software Engineering   6
Software Categorization – Why?
• SourceForge hosts 70,000+ software systems
   – How can one find the software needed?
   – How can developers collaborate effectively?
• Why software categorization?
   – SourceForge categorizes software according to their
     primary function (editors, databases, etc.)
         • Software foundries – related software
   – Keep developers informed about related software
         • Learn the “best practice”
         • Promote software reuse
                                                          [Kawaguchi et al. 04]
T. Xie and J. Pei: Data Mining for Software Engineering                      7
Software Categorization – What?
• Organize software systems into categories
   – Software systems in each category share a
     somehow same theme
   – A software system may belong to one or
     multiple categories
• What are the categories?
   – Defined by domain experts manually
   – Discovered automatically
• Example system: MUDABlue                                [Kawaguchi et al. 04]
T. Xie and J. Pei: Data Mining for Software Engineering                      8
Version Comparison and Search
• What does the current code segment look
  like in previous versions?
   – How have they been changed over versions?
• Using standard search tools, e.g., grep?
   – Source code may not be well documented
   – The code may be changed
• Can we have some source code friendly
  search engines?
   – E.g., www.koders.com, corp.krugle.com,
     demo.spars.info
T. Xie and J. Pei: Data Mining for Software Engineering   9
Software Library Reuse
• Issues in reusing software libraries
   – Which components should I use?
   – What is the right way to use?
   – Multiple components may often be used in
     combinations, e.g., Smalltalk’s Model/View/Controller
• Frequent patterns help
   – Specifically, inheritance information is important
   – Example: most application classes inheriting from library
     class Widget tend to override its member function
     paint(); most application classes instantiating library
     class Painter and calling its member function
     begin() also call end()
                                                          [Michail 99/00]
T. Xie and J. Pei: Data Mining for Software Engineering              10
API Usage
• How should an API be used correctly?
   – An API may serve multiple functionalities
   – Different styles of API usage
• “I know what type of object I need, but I don’t know
  how to write the code to get the object” [Mandelin
  et al. 05]
   – Can we synthesize jungloid code fragments
     automatically?
   – Given a simple query describing the desired code in
     terms of input and output types, return a code segment
• “I know what method call I need, but I don’t know
  how to write code before and after this method
  call” [Xie & Pei 06]
T. Xie and J. Pei: Data Mining for Software Engineering       11
How Can Data Mining Help?
• Identify characteristic usage of the library
  automatically
• Understand the reuse of library classes from
  real-life applications instead of toy programs
• Keep reuse patterns up to date w.r.t. the
  most recent version of the library and
  applications
• General patterns may cover inheritance
  cases
T. Xie and J. Pei: Data Mining for Software Engineering   12
Software Engineering Tasks
•     Programming
•     Static defect detection
•     Testing
•     Debugging
•     Maintenance




    T. Xie and J. Pei: Data Mining for Software Engineering   13
Locating Matching Method Calls
• Many bugs due to unmatched method calls
   – E.g., fail to call free() to deallocate a data
     structure
   – One-line-code-changes: many bugs can be
     fixed by changing only one line in source code
• Problem: how to find highly correlated pairs
  of method calls
   – E.g., <fopen, fclose>, <malloc, free>

                             [Li&Zhou 05, Livshits&Zimmermann 05, Yang et al. 06]
T. Xie and J. Pei: Data Mining for Software Engineering                     14
Inferring Errors from Source Code

• A system must follow some correctness
  rules
   – Unfortunately, the rules are documented or
     specified in an ad hoc manner
• Deriving the rules requires a lot of a priori
  knowledge
• Can we detect some errors without knowing
  the rules by data mining?
                                                          [Engler et al. 01]
T. Xie and J. Pei: Data Mining for Software Engineering                  15
Inference in Large Systems
• Execution traces     inferred properties    static
  checker
• Inference algorithms need to be scalable with the
  size of the programs and the input traces
• Due to only imperfect traces available in industrial
  environments, how to use those imperfect traces
• Many inferred properties may be uninteresting; it is
  hard for a developer to review those properties
  thoroughly for large programs
                                                          [Yang et al. 06]
T. Xie and J. Pei: Data Mining for Software Engineering                  16
Detecting Copy-Paste and Bugs
• Copy-pasted code is common in large
  systems
   – Code reuse
• Prone to bugs
   – E.g., identifiers are not changed consistently
• How to detect copy-paste code?
   – How to scale up to large software?
   – How to handle minor modifications?
                                                          [Li et al. 04]
T. Xie and J. Pei: Data Mining for Software Engineering               17
Software Engineering Tasks
•     Programming
•     Static defect detection
•     Testing
•     Debugging
•     Maintenance




    T. Xie and J. Pei: Data Mining for Software Engineering   18
Inspecting Test Behavior
• Automatically generated tests or field
  executions lack test oracles
   – Sample/summarize behavior for inspection
• Examples:
   – Select tests (executions/outputs) for inspection
         • E.g., clustering path/branch profiles [Podgurski et al.
           01, Bowring et al. 04]
   – Summarize object behavior [Xie&Notkin 04,
     Dallmeier et al. 06]


T. Xie and J. Pei: Data Mining for Software Engineering          19
Mining Object Behavior
• Can we find the undocumented behavior of
  classes? It may not be observed from
  program source code directly




Behavior model for JAVA Vector class. Picture from “Mining object behavior with ADABU”
[Dallmeier et al. WODA 06]
T. Xie and J. Pei: Data Mining for Software Engineering                            20
Mining Specifications
• Specifications are very useful for testing
   – test generation + test oracle
• Major obstacle: protocol specifications are
  often unavailable
   – Example: what is the right way to use the socket
     API?
• How can data mining help?
   – If a protocol is held in well tested programs (i.e.,
     their executions), the protocol is likely valid
                                                           [Ammons et al. 02]
 T. Xie and J. Pei: Data Mining for Software Engineering                 21
Specification Helps                                           Specification




Does the above code follow the correct socket API protocol?
[Ammons et al. 02]

  T. Xie and J. Pei: Data Mining for Software Engineering                     22
Software Engineering Tasks
•     Programming
•     Static defect detection
•     Testing
•     Debugging
•     Maintenance




    T. Xie and J. Pei: Data Mining for Software Engineering   23
Fault Localization
• Running tests produces execution traces
   – Some tests fail and the other tests pass
• Given many execution traces generated by tests,
  can we suggest likely faulty statements?
  [Liblit et al. 03/05, Liu et al. 05]
   – Some traces may lead to program failures
   – It would be better if we can even suggest the likeliness
     of a statement being faulty
• For large programs, how can we collect traces
  effectively?
• What if there are multiple faults?
T. Xie and J. Pei: Data Mining for Software Engineering         24
Analyzing Bug Repositories
• Most open source software development projects
  have bug repositories
    – Report and track problems and potential enhancements
    – Valuable information for both developers and users
• Bug repositories are often messy
    – Duplicate error reports; Related errors
• Challenge: how to analyze effectively?
    – Who are reporting and at what rate?
    – How are reports resolved and by whom?
• Automatic bug report assignment & duplicate
  detection                               [Anvik et al. 06]
 T. Xie and J. Pei: Data Mining for Software Engineering   25
Stabilizing Buggy Applications
• Users may report bugs in a program, can
  those bug reports be used to prevent the
  program from crashing?
   – When a user attempts an action that led to
     some errors before, a warning should be issued
• Given a program state S and an event e,
  predict whether e likely results in a bug
   – Positive samples: past bugs
   – Negative samples: “not bug” reports
                                                          [Michail&Xie 05]
T. Xie and J. Pei: Data Mining for Software Engineering              26
Software Engineering Tasks
•     Programming
•     Static defect detection
•     Testing
•     Debugging
•     Maintenance




    T. Xie and J. Pei: Data Mining for Software Engineering   27
Guiding Software Changes
• Programmers start changing some locations
   – Suggest locations that other programmers have
     changed together with this location
     E.g., “Programmers who changed this function
     also changed …”
• Mine association rules from change histories
   – coarse-granular entities: directories, modules,
     files
   – fine-granular entities: methods, variables,
     sections
                                           [Zimmermann et al. 04, Ying et al. 04]
T. Xie and J. Pei: Data Mining for Software Engineering                      28
Aspect Mining
• Discover crosscutting concerns that can be
  potentially turned into one place (an aspect
  in aspect-oriented programs)
   – E.g., logging, timing, communication
• Mine recurring execution patterns
   – Event traces [Breu&Krinke 04, Tonella&Ceccato
     04]
   – Source code [Shepherd et al. 05]


T. Xie and J. Pei: Data Mining for Software Engineering   29
Software Engineering Data
•     Static code bases
•     Software change history
•     Profiled program states
•     Profiled structural entities
•     Bug reports




    T. Xie and J. Pei: Data Mining for Software Engineering   30
Code Entities
• Identifiers within a system [Kawaguchi et al. 04]
    – E.g., variable names, function names
• Statement sequence within a basic block [Li et al.
  04]
    – E.g., variables, operators, constants, functions,
      keywords
• Element set within a function [Li&Zhou 05]
    – E.g., functions, variables, data types
• Call sites within a function [Xie&Pei 05]
• API signatures [Mandelin et al. 05]
                  [Mandelin et al. 05] http://snobol.cs.berkeley.edu/prospector/index.jsp
 T. Xie and J. Pei: Data Mining for Software Engineering                             31
Relationships btw Code Entities
• Membership relationships
   – A class contains membership functions
• Reuse relationships
   – Class inheritance
   – Class instantiation
   – Function invocations
   – Function overriding

                                 [Michail 99/00] http://codeweb.sourceforge.net/ for C++

T. Xie and J. Pei: Data Mining for Software Engineering                              32
Software Engineering Data
•     Static code bases
•     Software change history
•     Profiled program states
•     Profiled structural entities
•     Bug reports




    T. Xie and J. Pei: Data Mining for Software Engineering   33
Concurrent Versions System (CVS)
Comments




                                          [Chen et al. 01] http://cvssearch.sourceforge.net/
T. Xie and J. Pei: Data Mining for Software Engineering                                  34
CVS Comments
                                                          RCS files:/repository/file.h,v

• cvs log – displays
                                                          Working file: file.h
                                                          head: 1.5
                                                          ...
  for all revisions and                                   description:
                                                          ----------------------------

  its comments for each                                   Revision 1.5
                                                          Date: ...

  file
                                                          cvs comment ...
                                                          ----------------------------
                                                          ...


                                                          …
• cvs diff – shows                                        RCS file: /repository/file.h,v
                                                          …

  differences between                                     9c9,10
                                                          < old line

  different versions of a
                                                          ---
                                                          > new line
                                                          > another new line
  file
                                        [Chen et al. 01] http://cvssearch.sourceforge.net/
T. Xie and J. Pei: Data Mining for Software Engineering                                    35
Code Version Histories
• CVS provides file versioning
   – Group individual per-file changes into individual
     transactions (atomic change sets): checked in by the
     same author with the same check-in comment close in
     time
• CVS manages only files and line numbers
   – Associate syntactic entities with line ranges
• Filter out long transactions not corresponding to
  meaningful atomic changes
   – E.g., feature requests, bug fixes, branch merging
                                                                         [Ying et al. 04]
                       [Zimmermann et al. 04] http://www.st.cs.uni-sb.de/softevo/erose/
T. Xie and J. Pei: Data Mining for Software Engineering                              36
Software Engineering Data
•     Static code bases
•     Software change history
•     Profiled program states
•     Profiled structural entities
•     Bug reports




    T. Xie and J. Pei: Data Mining for Software Engineering   37
Method-Entry/Exit States
• State of an object
   – Values of transitively reachable fields
• Method-entry state
   – Receiver-object state, method argument values
• Method-exit state
   – Receiver-object state, updated method
     argument values, method return value
                                              [Ernst et al. 02] http://pag.csail.mit.edu/daikon/
                                      [Dallmeier et al. 06] http://www.st.cs.uni-sb.de/models/
                                                                            [Henkel&Diwan 03]
                                                                             [Xie&Notkin 04/05]
T. Xie and J. Pei: Data Mining for Software Engineering                                   38
Other Profiled Program States
• Values of variables at certain code locations
  [Hangal&Lam 02]
   – Object/static field read/write
   – Method-call arguments
   – Method returns
• Sampled predicates on values of variables
  [Liblit et al. 03/05]


                                               [Hangal&Lam 02] http://diduce.sourceforge.net/
                                                 [Liblit et al. 03/05] http://www.cs.wisc.edu/cbi/

T. Xie and J. Pei: Data Mining for Software Engineering                                     39
Software Engineering Data
•     Static code bases
•     Software change history
•     Profiled program states
•     Profiled structural entities
•     Bug reports




    T. Xie and J. Pei: Data Mining for Software Engineering   40
Executed Structural Entities
• Executed branches/paths, def-use pairs
• Executed function/method calls
   – Group methods invoked on the same object
• Profiling options
   – Execution hit vs. count
   – Execution order (sequences)


                                 [Dallmeier et al. 05] http://www.st.cs.uni-sb.de/ample/
         More related tools: http://www.csc.ncsu.edu/faculty/xie/research.htm#related
T. Xie and J. Pei: Data Mining for Software Engineering                           41
Software Engineering Data
•     Static code bases
•     Software change history
•     Profiled program states
•     Profiled structural entities
•     Bug reports




    T. Xie and J. Pei: Data Mining for Software Engineering   42
Processing Bug Reports



                  User                                     Triager                      Developer
 Bug
Report




                    Bug
                                        Duplicate          Works          Invalid          Won’t
                Repository
                                                           For Me                            Fix
 T. Xie and J. Pei: Data Mining for Software Engineering         Adapted from Anvik et al.’s slides   43
Sample Bugzilla Bug Report
• Bug report image
• Overlay the triage questions

       Assigned To: ?                                  Assignment?
                                                                            Duplicate?


                                          Reproducible?
                                                      Bugzilla: open source bug tracking tool
                                                                      http://www.bugzilla.org/
                                                                              [Anvik et al. 06]
                                        http://www.cs.ubc.ca/labs/spl/projects/bugTriage.html
 T. Xie and J. Pei: Data Mining for Software Engineering     Adapted from Anvik et al.’s slides   44
Eclipse Bug Data
                                                          • Defect counts are listed
                                                          as count at the plug-in,
                                                          package and
                                                          compilationunit levels.
                                                          • The value field
                                                          contains the actual
                                                          number of pre- ("pre")
                                                          and post-release defects
                                                          ("post").
                                                          • The average ("avg")
                                                          and maximum ("max")
                                                          values refer to the
                                                          defects found in the
                                                          compilation units
                                                          ("compilationunits").
               [Schröter et al. 06] http://www.st.cs.uni-sb.de/softevo/bug-data/eclipse/
T. Xie and J. Pei: Data Mining for Software Engineering                             45
Data Mining Techniques in SE
•     Association rules and frequent patterns
•     Classification
•     Clustering
•     Misc.




    T. Xie and J. Pei: Data Mining for Software Engineering   46
Frequent Itemsets
• Itemset: a set of items
    – E.g., acm={a, c, m}                                  Transaction database TDB
• Support of itemsets
                                                            TID      Items bought
    – Sup(acm)=3
                                                            100   f, a, c, d, g, I, m, p
• Given min_sup = 3, acm
                                                            200   a, b, c, f, l, m, o
  is a frequent pattern
                                                            300   b, f, h, j, o
• Frequent pattern mining:                                  400   b, c, k, s, p
  find all frequent patterns
                                                            500   a, f, c, e, l, p, m, n
  in a database

 T. Xie and J. Pei: Data Mining for Software Engineering                                   47
Association Rules
• (Time∈{Fri, Sat}) ∧ buy(X, diaper)                      buy(X,
  beer)
   – Dads taking care of babies in weekends drink
     beers
• Itemsets should be frequent
   – It can be applied extensively
• Rules should be confident
   – With strong prediction capability

T. Xie and J. Pei: Data Mining for Software Engineering        48
A Road Map
• Boolean vs. quantitative associations
   – buys(x, “SQLServer”) ^ buys(x, “DMBook”)
     buys(x, “DM Software”) [0.2%, 60%]
   – age(x, “30..39”) ^ income(x, “42..48K”)
     buys(x, “PC”) [1%, 75%]
• Single dimension vs. multiple dimensional
  associations
• Single level vs. multiple-level analysis
   – What brands of beers are associated with what
     brands of diapers?
T. Xie and J. Pei: Data Mining for Software Engineering   49
Frequent Pattern Mining Methods
• Apriori and its variations/improvements
• Mining frequent-patterns without candidate
  generation
• Mining max-patterns and closed itemsets
• Mining multi-dimensional, multi-level
  frequent patterns with flexible support
  constraints
• Interestingness: correlation and causality

T. Xie and J. Pei: Data Mining for Software Engineering   50
A Simple Case
• Finding highly correlated method call pairs
• Confidence of pairs help
   – Conf(<a,b>)=support(<a,b>)/support(<a,a>)
• Check the revisions (fixes to bugs), find the
  pairs of method calls whose confidences are
  improved dramatically by frequent added
  fixes
   – Those are the matching method call pairs that
     may often be violated by programmers
                                                          [Livshits&Zimmermann 05]
T. Xie and J. Pei: Data Mining for Software Engineering                       51
Conflicting Patterns
• 999 out of 1000 times spin_unlock
  follows spin_lock
   – The single time that spin_unlock does not
     follow may likely be an error
• We can detect an error without knowing the
  correctness rule



              [Li&Zhou 05, Livshits&Zimmermann 05, Yang et al. 06]
T. Xie and J. Pei: Data Mining for Software Engineering       52
Frequent Library Reuse Patterns
• Items: classes, member functions, reuse relationships
  (e.g., inheritance, overriding, instantiation)
• Transactions: for every application class A, the set of all
  items that are involved in a reuse relationship with A
• Pruning
    – Uninteresting rules, e.g., a rule holds for every class
    – Misleading rules, e.g., xy      z (conf: 60%) is pruned if y   z (conf:
      80%)
    – Statistically insignificant rules, prune rules of a high p-value
• Constrained rules
    – Rules involving a particular class
    – Rules that are violated in a particular application
                                                                  [Michail 99/00]
 T. Xie and J. Pei: Data Mining for Software Engineering                        53
MAPO: Mining Frequent API Patterns




                                                          [Xie&Pei 06]
T. Xie and J. Pei: Data Mining for Software Engineering         54
Sequential Pattern Mining in MAPO
• Use BIDE [Wang&Han 04] to mine closed sequential
  patterns from the preprocessed method-call
  sequences
• Postprocessing in MAPO
   – Remove frequent sequences that do not contain the
     entities interesting to the user
   – Compress consecutive calls of the same method into
     one
   – Remove duplicate frequent sequences after the
     compression
   – Remove frequent sequences that are subsequences of
     some other frequent sequences
                                                          [Xie&Pei 06]
T. Xie and J. Pei: Data Mining for Software Engineering          55
Detecting Copy-Paste Code
• Apply closed sequential pattern mining techniques
• Customizing the techniques
   – A copy-paste segment typically do not have big gaps –
     use a maximum gap threshold to control
   – Output the instances of patterns (i.e., the copy-pasted
     code segments) instead of the patterns
   – Use small copy-pasted segments to form larger ones
   – Prune false positives: tiny segments, unmappable
     segments, overlapping segments, and segments with
     large gaps
                                                          [Li et al. 04]
T. Xie and J. Pei: Data Mining for Software Engineering             56
Find Bugs in Copy-Pasted Segments

• For two copy-pasted segments, are the
  modifications consistent?
   – Identifier a in segment S1 is changed to b in
     segment S2 3 times, but remains unchanged
     once – likely a bug
   – The heuristic may not be right all the time
• The lower the unchanged rate of an
  identifier, the more likely there is a bug
                                                          [Li et al. 04]
T. Xie and J. Pei: Data Mining for Software Engineering             57
Approximate Patterns for Inferences

• Use an alternating template to find
  interesting properties
   – Example: template – (PS)*, an instance:
     loc.acq     loc.rel
• Handling imperfect traces
   – Instead of requiring perfect matches, check the
     ratio of matching
   – Explore contexts of matching
                                                          [Yang et al. 06]
T. Xie and J. Pei: Data Mining for Software Engineering               58
Context Handling




                                                          Figure from “Perracotta: mining teporal API rules
                                                          from imperfect traces”, in [Yang et al. ICSE’06]

T. Xie and J. Pei: Data Mining for Software Engineering                                             59
Cross-Checking of Execution Traces
• Mine association rules or sequential
  patterns S      F, where S is a statement and
  F is the status of program failure
• The higher the confidence, the more likely S
  is faulty or related to a fault
• Using only one statement at the left side of
  the rule can be misleading, since a fault may
  be led by a combination of statements
  – Frequent patterns can be used to improve
                                                           [Denmat et al. 05]
 T. Xie and J. Pei: Data Mining for Software Engineering               60
Emerging Patterns of Traces
• A method executed only in failing runs is
  likely to point to the defect
   – Comparing the coverage of passing and failing
     program runs help
• Mining patterns frequent in failing program
  runs but infrequent in passing program runs
   – Sequential patterns may be used


                 [Dallmeier et al. 05, Denmat et al. 05, Yang et al. 06]
T. Xie and J. Pei: Data Mining for Software Engineering             61
Learning Object Behavior
• Extracting models
   – A static analysis identifies all side-effect-free
     methods in the program
   – Some side-effect-free methods are selected as
     inspectors
   – The program is executed and inspectors are
     called to extract information about an object’s
     state – a vector of inspector values
• Merge models of all objects in a program
                                                          [Dallmeier et al. 06]
T. Xie and J. Pei: Data Mining for Software Engineering                     62
Data Mining Techniques in SE
•     Association rules and frequent patterns
•     Classification
•     Clustering
•     Misc.




    T. Xie and J. Pei: Data Mining for Software Engineering   63
Classification: A 2-step Process
• Model construction: describe a set of
  predetermined classes
   – Training dataset: tuples for model construction
         • Each tuple/sample belongs to a predefined class
   – Classification rules, decision trees, or math formulae
• Model application: classify unseen objects
   – Estimate accuracy of the model using an independent
     test set
   – Acceptable accuracy     apply the model to classify
     tuples with unknown class labels

T. Xie and J. Pei: Data Mining for Software Engineering       64
Model Construction
                                                               Classification
                                                                Algorithms
                       Training
                        Data


Name          Rank                Years         Tenured          Classifier
Mike        Ass. Prof               3             No             (Model)
Mary        Ass. Prof               7             Yes
 Bill         Prof                  2             Yes
 Jim        Asso. Prof              7             Yes      IF rank = ‘professor’
Dave        Ass. Prof               6             No       OR years > 6
Anne        Asso. Prof              3             No       THEN tenured = ‘yes’
 T. Xie and J. Pei: Data Mining for Software Engineering                        65
Model Application

                                                     Classifier


                         Testing
                          Data                                       Unseen Data

                                                                  (Jeff, Professor, 4)
 Name     Rank     Years                         Tenured
  Tom   Ass. Prof    2                             No             Tenured?
Merlisa Asso. Prof   7                             No
George    Prof       5                             Yes
Joseph Ass. Prof     7                             Yes
T. Xie and J. Pei: Data Mining for Software Engineering                             66
Supervised vs. Unsupervised
Learning
• Supervised learning (classification)
   – Supervision: objects in the training data set
     have labels
   – New data is classified based on the training set
• Unsupervised learning (clustering)
   – The class labels of training data are unknown
   – Given a set of measurements, observations,
     etc. with the aim of establishing the existence of
     classes or clusters in the data

T. Xie and J. Pei: Data Mining for Software Engineering   67
GUI-Application Stabilizer
• Given a program state S and an event e, predict
  whether e likely results in a bug
   – Positive samples: past bugs
   – Negative samples: “not bug” reports
• A k-NN based approach
   – Consider the k closest cases reported before
   – Compare Σ 1/d for bug cases and not-bug cases, where
     d is the similarity between the current state and the
     reported states
   – If the current state is more similar to bugs, predict a bug
                                                          [Michail&Xie 05]
T. Xie and J. Pei: Data Mining for Software Engineering              68
Data Mining Techniques in SE
•     Association rules and frequent patterns
•     Classification
•     Clustering
•     Misc.




    T. Xie and J. Pei: Data Mining for Software Engineering   69
What Is Clustering?
• Group data into clusters
   – Similar to one another within the same cluster
   – Dissimilar to the objects in other clusters
   – Unsupervised learning: no predefined classes

                              Outliers
                                                          Cluster 1
              Cluster 2


T. Xie and J. Pei: Data Mining for Software Engineering               70
Categories of Clustering
Approaches (1)
• Partitioning algorithms
   – Partition the objects into k clusters
   – Iteratively reallocate objects to improve the
     clustering
• Hierarchy algorithms
   – Agglomerative: each object is a cluster, merge
     clusters to form larger ones
   – Divisive: all objects are in a cluster, split it up
     into smaller clusters

T. Xie and J. Pei: Data Mining for Software Engineering    71
Categories of Clustering
Approaches (2)
• Density-based methods
   – Based on connectivity and density functions
   – Filter out noise, find clusters of arbitrary shape
• Grid-based methods
   – Quantize the object space into a grid structure
• Model-based
   – Use a model to find the best fit of data


T. Xie and J. Pei: Data Mining for Software Engineering   72
K-Means: Example
                                                            10                                                                                                   10
10
                                                            9                                                                                                    9
9
                                                            8                                                                                                    8
8
                                                            7                                                                                                    7
7
                                                            6                                                                                                    6
6
                                                            5                                                                                                    5
5
                                                            4                                                                                                    4
4
                                                  Assign    3                                                                                          Update    3

                                                                                                                                                       the
3

                                                  each
                                                            2                                                                                                    2
2

1
                                                  objects
                                                            1

                                                            0
                                                                                                                                                       cluster   1

                                                                                                                                                                 0
0
     0   1   2   3   4   5   6   7   8   9   10   to most
                                                                 0       1       2       3       4       5       6       7       8       9       10    means          0       1       2       3       4       5       6       7       8       9       10



                                                  similar
                                                  center                                                         reassign                                                                                             reassign
                                                             10                                                                                                   10

     K=2                                                         9                                                                                                    9

                                                                 8                                                                                                    8

     Arbitrarily choose K                                        7                                                                                                    7


     object as initial
                                                                 6                                                                                                    6

                                                                 5                                                                                                    5

     cluster center                                              4                                                                                     Update         4

                                                                 3

                                                                 2
                                                                                                                                                       the            3

                                                                                                                                                                      2

                                                                 1                                                                                     cluster        1

                                                                 0
                                                                     0       1       2       3       4       5       6       7       8       9    10
                                                                                                                                                       means          0
                                                                                                                                                                          0       1       2       3       4       5       6       7       8       9    10




         T. Xie and J. Pei: Data Mining for Software Engineering                                                                                                                                                                              73
Clustering and Categorization
• Software categorization
   – Partitioning software systems into categories
• Categories predefined – a classification
  problem
• Categories discovered automatically – a
  clustering problem



T. Xie and J. Pei: Data Mining for Software Engineering   74
Software Categorization - MUDABlue

• Understanding source code
    – Use latent semantic analysis (LSA) to find similarity
      between software systems
    – Use identifiers (e.g., variable names, function names)
      as features
          • “gtk_window” represents some window
          • The source code near “gtk_window” contains some GUI
            operation on the window
• Extracting categories using frequent identifiers
    – “gtk_window”, “gtk_main”, and “gpointer”          GTK
      related software system
    – Use LSA to find relationships between identifiers
                                                           [Kawaguchi et al. 04]
 T. Xie and J. Pei: Data Mining for Software Engineering                   75
Overview of MUDABlue
• Extract identifiers
• Create identifier-by-software matrix
• Remove useless identifiers
• Apply LSA, and retrieve categories
• Make software clusters from identifier
  clusters
• Title software clusters

                                                              [Kawaguchi et al. 04]
    T. Xie and J. Pei: Data Mining for Software Engineering                     76
Data Mining Techniques in SE
•     Association rules and frequent patterns
•     Classification
•     Clustering
•     Misc.




    T. Xie and J. Pei: Data Mining for Software Engineering   77
Searching Source Code/Comments
• CVSSearch: searching using CVS
  comments
• Comments are often more stable than code
  segments
   – Describe a segment of code
   – May hold for many future versions
• Compare differences of successive versions
   – For two versions, associate a comment to the
     corresponding changes
   – Propagate changes over versions       [Chen et al. 01]
T. Xie and J. Pei: Data Mining for Software Engineering   78
Jungloid Mining
• Given a query describing the input and output
  types, synthesize code fragments automatically
• Prospector: using API method signatures and
  jungloids mined from a corpus of sample client
  programs
• Elementary jungloids
   –   Field access
   –   Static method or constructor invocation
   –   Instance method invocation
   –   Widening reference conversion
   –   Downcast (narrowing reference conversions)
                                                          [Mandelin et al. 05]
T. Xie and J. Pei: Data Mining for Software Engineering                 79
Parsing a JAVA source code file in an
Finding Jungloids                                         IFile object using the Eclipse IDE
                                                          framework




• Use signatures of elementary jungloids and API’s to form a
  signature graph
• Represent a solution as a path in the graph matching the
  constraints
• Rank the paths by their lengths – short paths are preferred
• Learn downcast from sample programs           [Mandelin et al. 05]
T. Xie and J. Pei: Data Mining for Software Engineering                                     80
Sampling Programs
• During the execution of a program, each
  execution of a statement takes a probability
  to be sampled
   – Sampling large programs becomes feasible
   – Many traces can be collected
• Bug isolation by analyzing samples
   – Correlation between some specific statements
     or function calls with program errors/crashes

                                                          [Liblit et al. 03/05]
T. Xie and J. Pei: Data Mining for Software Engineering                    81
Outline
• Introduction
• What software engineering tasks can be
  helped by data mining?
• What kinds of software engineering data can
  be mined?
• How are data mining techniques used in
  software engineering?
• Case studies
• Conclusions
T. Xie and J. Pei: Data Mining for Software Engineering   82
Case Studies
• MAPO: mining API usages from open source
  repositories [Xie&Pei 06]
   • Code bases                   sequence analysis              programming
• DynaMine: finding common error patterns by mining
  software revision histories [Livshits&Zimmermann 05]
   • Change history                    association rules          defect detection
• BugTriage: Who should fix this bugs? [Anvik et al. 06]
   • Bug reports                 classification             debugging



  T. Xie and J. Pei: Data Mining for Software Engineering                            83
Motivation
• APIs in class libraries or frameworks are
  popularly reused in software development.

• An example programming task:
  “instrument the bytecode of a Java class by
  adding an extra method to the class”
   – org.apache.bcel.generic.ClassGen
           public void addMethod(Method m)


T. Xie and J. Pei: Data Mining for Software Engineering   84
First Try: ClassGen Java API Doc
addMethod

public void addMethod(Method m)
       Add a method to this class.
       Parameters:
        m - method to add




T. Xie and J. Pei: Data Mining for Software Engineering   85
Second Try: Code Search Engine




T. Xie and J. Pei: Data Mining for Software Engineering   86
MAPO Approach
• Analyze code segments returned from code
  search engines and disclose the inherent
  usage patterns
   – Input: an API characterized by a method, class,
     or package
     code bases: open source repositories or
     proprietary source repositories
   – Output: a short list of frequent API usage
     patterns related to the API

T. Xie and J. Pei: Data Mining for Software Engineering   87
Sample Tool Output
InstructionList.<init>()
InstructionFactory.createLoad(Type, int)
InstructionList.append(Instruction)
InstructionFactory.createReturn(Type)
InstructionList.append(Instruction)
MethodGen.setMaxStack()
MethodGen.setMaxLocals()
MethodGen.getMethod()
ClassGen.addMethod(Method)
InstructionList.dispose()
        •Mined from 36 Java source files, 1087 method seqs
 T. Xie and J. Pei: Data Mining for Software Engineering     88
Tool Architecture




T. Xie and J. Pei: Data Mining for Software Engineering   89
Results
 A tool that integrates various components
 • Relevant code extractor
       – download returns from code search engine (koders.com)
 • Code analyzer
       – implemented a lightweight tool for Java programs
 • Sequence preprocessor
       – employed various heuristics
 • Frequent sequence miner
       – reused BIDE [Wang&Han ICDE 2004]
 • Frequent sequence postprocessor
       – employed various heuristics



T. Xie and J. Pei: Data Mining for Software Engineering          90
Case Studies
• MAPO: mining API usages from open source
  repositories [Xie&Pei 06]
   • Code bases                   sequence analysis              programming
• DynaMine: finding common error patterns by mining
  software revision histories [Livshits&Zimmermann 05]
   • Change history                    association rules          defect detection
• BugTriage: Who should fix this bugs? [Anvik et al. 06]
   • Bug reports                 classification             debugging



  T. Xie and J. Pei: Data Mining for Software Engineering                            91
Co-Change Pattern
• Things that are frequently changed together
often form a pattern (a.k.a. co-change)
    • E.g., co-added method calls
     public void createPartControl(Composite parent) {
       ...
       // add listener for editor page activation
       getSite().getPage().addPartListener(partListener);
     }

     public void dispose() {
         ...
                                                  co-added
getSite().getPage().removePartListener(partListener);
    }

T. Xie and J. Pei: Data Mining for Software Engineering   Adapted from Livshits et al.’s slides   92
DynaMine
    revision            mine CVS                                  rank and
history mining
                                                patterns
                        histories                                   filter

                                                             instrument relevant
                                                                 method calls

                                                             run the application
              dynamic
              analysis
                                                               post-process

                                                   usage            error                unlikely
                                                  patterns         patterns              patterns


                                                  report            report
                          reporting
                                                 patterns            bugs
  T. Xie and J. Pei: Data Mining for Software Engineering     Adapted from Livshits et al.’s slides   93
Mining Patterns
    revision            mine CVS                                  rank and
history mining
                                                patterns
                        histories                                   filter

                                                             instrument relevant
                                                                 method calls

                                                             run the application
              dynamic
              analysis
                                                               post-process

                                                   usage            error                unlikely
                                                  patterns         patterns              patterns


                                                  report           report
                          reporting
                                                 patterns     Adaptedbugs
                                                                      from Livshits et al.’s slides
  T. Xie and J. Pei: Data Mining for Software Engineering                                             94
Mining Method Calls
Foo.java    o1.addListener()
    1.12
            o1.removeListener()
Bar.java
    1.47
            o2.addListener()
            o2.removeListener()
            System.out.println()
Baz.java
    1.23
            o3.addListener()
            o3.removeListener()
            list.iterator()
            iter.hasNext()
            iter.next()
Qux.java    o4.addListener()
    1.41
            System.out.println()
     1.42   o4.removeListener()


T. Xie and J. Pei: Data Mining for Software Engineering   Adapted from Livshits et al.’s slides   95
Finding Pairs
            o1.addListener()
                                                      1 Pair
Foo.java
    1.12
            o1.removeListener()
Bar.java
            o2.addListener()
                                                      1 Pair
    1.47
            o2.removeListener()
            System.out.println()
Baz.java    o3.addListener()
    1.23
            o3.removeListener()
            list.iterator()                           2 Pairs
            iter.hasNext()
            iter.next()

            o4.addListener()
                                                      0 Pairs
Qux.java
    1.41
            System.out.println()
     1.42   o4.removeListener()                       0 Pairs
T. Xie and J. Pei: Data Mining for Software Engineering   Adapted from Livshits et al.’s slides   96
Mining Method Calls
Foo.java    o1.addListener()
    1.12
            o1.removeListener()
Bar.java
    1.47
            o2.addListener()
            o2.removeListener()
            System.out.println()
Baz.java
    1.23
            o3.addListener()                              Co-added calls
            o3.removeListener()
            list.iterator()                               often represent a
            iter.hasNext()
            iter.next()                                   usage pattern
Qux.java    o4.addListener()
    1.41
            System.out.println()
     1.42   o4.removeListener()

T. Xie and J. Pei: Data Mining for Software Engineering     Adapted from Livshits et al.’s slides   97
Finding Patterns

                Find “frequent itemsets” (with
                Apriori) o.enterAlignment()
                          o.enterAlignment()
                                     o.enterAlignment()
                                    o.enterAlignment()
                                       o.exitAlignment()
                                      o.exitAlignment()
                                     o.exitAlignment()
                                    o.exitAlignment()
                                       o.redoAlignment()
                                      o.redoAlignment()
                                     o.redoAlignment()
                                    o.redoAlignment()
                                       iter.hasNext()
                                      iter.hasNext()
                                     iter.hasNext()
                                    iter.hasNext()
                                       iter.next()
                                      iter.next()
                                     iter.next()
                                    iter.next()  
                                                 


                {enterAlignment(), exitAlignment(),
                         redoAlignment()}
T. Xie and J. Pei: Data Mining for Software Engineering   Adapted from Livshits et al.’s slides   98
Ranking Patterns
Foo.java    o1.addListener()
    1.12
            o1.removeListener()
Bar.java
            o2.addListener()
                                                          Support count =
    1.47
            o2.removeListener()
            System.out.println()
Baz.java
                                                          #occurrences of a
            o3.addListener()
    1.23
            o3.removeListener()                           pattern
            list.iterator()
            iter.hasNext()                                Confidence =
            iter.next()
Qux.java
                                                          strength of a
            o4.addListener()
    1.41
            System.out.println()                          pattern, P(A|B)
     1.42   o4.removeListener()

T. Xie and J. Pei: Data Mining for Software Engineering     Adapted from Livshits et al.’s slides   99
Ranking Patterns
Foo.java    o1.addListener()
    1.12
            o1.removeListener()
Bar.java
    1.47
            o2.addListener()
            o2.removeListener()
            System.out.println()
Baz.java
    1.23
            o3.addListener()
            o3.removeListener()
            list.iterator()
            iter.hasNext()
            iter.next()
Qux.java    o4.addListener()
    1.41
            System.out.println()                          This is a fix!
     1.42   o4.removeListener()                           Rank removeListener()
                                                          patterns higher
T. Xie and J. Pei: Data Mining for Software Engineering       Adapted from Livshits et al.’s slides 100
Dynamic Validation
    revision            mine CVS                                  rank and
history mining
                                                patterns
                        histories                                   filter

                                                             instrument relevant
                                                                 method calls

                                                             run the application
              dynamic
              analysis
                                                               post-process

                                                   usage            error                unlikely
                                                  patterns         patterns              patterns


                                                  report           report
                          reporting
                                                 patterns     Adaptedbugs
                                                                      from Livshits et al.’s slides
  T. Xie and J. Pei: Data Mining for Software Engineering                                             101
Matches and Mismatches
        Find and count matches and mismatches.
        o.register(d)
        o.deregister(d)     matches o.register(d)
        o.deregister(d)     mismatch

        Static vs dynamic counts.




T. Xie and J. Pei: Data Mining for Software Engineering   Adapted from Livshits et al.’s slides 102
Pattern classification


                                    post-process
                             v validations, e violations



           usage                             error                       unlikely
          patterns                          patterns                     patterns
            e<v/10                     v/10<=e<=2v                      otherwise
T. Xie and J. Pei: Data Mining for Software Engineering   Adapted from Livshits et al.’s slides 103
Experiments                                                since
                                                                           JEDIT
                                                                            2000
                                                                                           ECLIPSE
                                                                                            2001
                                                      developers              92               112

                                                     lines of code        700,000          2,900,000
                                                          revisions        40,000           400,000




                                                                      total 56 patterns
T. Xie and J. Pei: Data Mining for Software Engineering        Adapted from Livshits et al.’s slides 104
Case Studies
• MAPO: mining API usages from open source
  repositories [Xie&Pei 06]
   • Code bases                   sequence analysis              programming
• DynaMine: finding common error patterns by mining
  software revision histories [Livshits&Zimmermann 05]
   • Change history                    association rules          defect detection
• BugTriage: Who should fix this bugs? [Anvik et al. 06]
   • Bug reports                 classification             debugging



  T. Xie and J. Pei: Data Mining for Software Engineering                            105
Assigning a Bug
• Many considerations
   – who has the expertise?
   – who is available?
   – how quickly does this have to be fixed?


• Not always an obvious or correct assignment
   – multiple developers may be suitable
   – difficult to know what the bug is about
   – bug fixes get delayed
         • triage and fix rate indicates ‘liveness’ of OSS projects

T. Xie and J. Pei: Data Mining for Software Engineering   Adapted from Anvik et al.’s slides 106
Assigning a Bug Today




                 bill@firefox.org




T. Xie and J. Pei: Data Mining for Software Engineering   Adapted from Anvik et al.’s slides 107
Recommending assignment




                  bill@firefox.com
                  ted@gmail.com
                  cindy-loo@whoville.org




T. Xie and J. Pei: Data Mining for Software Engineering   Adapted from Anvik et al.’s slides 108
Overview of approach
             Approach tuned using Eclipse and Firefox


                                                                                 bill@firefox.com
                                                                                 ted@gmail.com
                                                                                 cindy-loo@whoville.org




                                        Machine Learning                     Assignment
    Resolved                                 Algorithm                       Recommender
  Bug Reports




 T. Xie and J. Pei: Data Mining for Software Engineering   Adapted from Anvik et al.’s slides 109
Steps to the approach
1. Characterize the reports

2. Label the reports

3. Select the reports

4. Use a machine learning algorithm


T. Xie and J. Pei: Data Mining for Software Engineering   Adapted from Anvik et al.’s slides 110
Step 1: Characterizing a report
• Based on two fields
   – textual summary
   – description
• Use text categorization approach
   – represent with a word vector
   – remove stop words
   – intra- and inter-document frequency



T. Xie and J. Pei: Data Mining for Software Engineering   Adapted from Anvik et al.’s slides 111
Step 2: Labeling a report
• Must determine who really fixed it
   – “Assigned-to” field is not accurate

• Project-specific heuristics




T. Xie and J. Pei: Data Mining for Software Engineering   Adapted from Anvik et al.’s slides 112
Step 2: Labeling a report
• Must determine who really fixed it
    – “Assigned-to” field is not accurate

• Project-specificIfheuristics
                    a report is FIXED, label with who
                                         marked it as fixed. (Eclipse)
    – simple
                                     If a report is DUPLICATE, use
                                        the label of the report it duplicates.
                                        (Eclipse and Firefox)


 T. Xie and J. Pei: Data Mining for Software Engineering   Adapted from Anvik et al.’s slides 113
Step 2: Labeling a report
• Must determine who really fixed it
   – “Assigned-to” field is not accurate
                                                                                  (Firefox)
• Project-specificIfheuristics FIXED and has
                    the report is
                                        attachments that are approved by a
   – simple                             reviewer, then
   – complex                              – If one submitter of patches, use their
                                            name.
                                          – If more than one submitter, choose
                                            name of who submitted the most
                                            patches.
                                          – If cannot determine submitters, label
                                            with the person assigned to the report.
T. Xie and J. Pei: Data Mining for Software Engineering   Adapted from Anvik et al.’s slides 114
Step 2: Labeling a report
• Must determine who really fixed it
   – “Assigned-to” field is not accurate
                                                                                  (Firefox)
• Project-specificReports marked as WONTFIX are
                   heuristics
                                        often resolved after discussion and
   – simple
                                        developers reaching a consensus.
   – complex                              – Unknown who would have fixed the bug
   – unclassifiable                       – Report is labeled unclassifiable.




T. Xie and J. Pei: Data Mining for Software Engineering   Adapted from Anvik et al.’s slides 115
Step 2: Labeling a report
• Must determine who really fixed it
   – “Assigned-to” field is not accurate

• Project-specific heuristics
   – simple
                                                       Eclipse Firefox
   – complex
                                        Simple            5       4
   – unclassifiable
                                        Complex           2       1
                                        Unclassifiable    1       4

T. Xie and J. Pei: Data Mining for Software Engineering   Adapted from Anvik et al.’s slides 116
Step 3: Selecting the reports
• Exclude those with no label

• Include those of active developers
          – developer profiles
40                                                                            40


35                                                                            35


30                                                                            30


25                                                                            25


20                                                                            20


15                                                                            15


10                                                                            10


5                                                                              5


0                                                                              0
      Sep-04   Oct-04   Nov-04   Dec-04   Jan-05   Feb-05   Mar-05   Apr-05        Sep-04   Oct-04    Nov-04   Dec-04   Jan-05   Feb-05   Mar-05   Apr-05




     T. Xie and J. Pei: Data Mining for Software Engineering                                         Adapted from Anvik et al.’s slides 117
Step 3: Selecting the reports
40


35


30


25


20
                                                            3 reports / month
15


10


 5


 0
      Sep-04     Oct-04     Nov-04     Dec-04      Jan-05   Feb-05    Mar-05     Apr-05




T. Xie and J. Pei: Data Mining for Software Engineering       Adapted from Anvik et al.’s slides 118
Step 4: Use a ML algorithm
• Supervised Algorithms
   – Naïve Bayes
   – C4.5
   – Support Vector Machines
• Unsupervised Algorithms
   – Expectation Maximization
• Incremental Algorithms
   – Naïve Bayes

T. Xie and J. Pei: Data Mining for Software Engineering   Adapted from Anvik et al.’s slides 119
Evaluating Recommenders
            # of relevant recommenda tions
Precision =
             # of recommenda tions made


          # of relevant recommenda tions
Recall =
         # of possibly relevant developers

 How do we find this?
 T. Xie and J. Pei: Data Mining for Software Engineering   Adapted from Anvik et al.’s slides 120
Determining Possibly Relevant
 Developers
                         Module C
                         Module J                           paulw                         paulw@...
                         Module Q                           tryder                        tryder@...
                         …                                  stibbs                        vendger@...
                                                            …                             …
Fixed Bug
                       Modules
 Report                                                      CVS                            Bug
                   touched by fix
                                                      Usernames                        Repository
                                                                                       Usernames




                          CVS
                     Repository
  T. Xie and J. Pei: Data Mining for Software Engineering            Adapted from Anvik et al.’s slides 121
Still not Straightforward (e.g., Firefox)
                                                                                   paulw@...
                                                                                   bboggs@...
                                                                                   vendger@...
                                                                                   …
                        Module C
                        Module J                           paulw
                        Module Q                           tryder          Patch Submitters
                        …                                  stibbs
                                                           …
                                                                                   tryder@...
 Bug
                       Module                                                      axelf@...

Report                                                      CVS                    vendger@...
                         List                                                      …
                                                     Usernames
                                                                           Patch Submitters
                                                                                   jlpicard@...
                                                                                   bchater@...
                                                                                   kpollac@...
                                                                                   …


                                                                           Patch Submitters
                         CVS
                     Repository                                                        …
 T. Xie and J. Pei: Data Mining for Software Engineering            Adapted from Anvik et al.’s slides 122
Precision vs. Recall

              A small set of “right” developers (precision) more
              important than the set of all possible developers
              (recall)
            100%                                                     100%


            90%                                                      90%

            80%                                                      80%

            70%                                                      70%

            60%                                                      60%
Precision




                                                            Recall
            50%                                                      50%

            40%                                                      40%

            30%
                                                                     30%
            20%
                                                                     20%
            10%
                                                                     10%
             0%
                   Multi. NB         C4.5       SVM                   0%
                                                                            Multi. NB     C4.5         SVM
                     Eclipse          Firefox     gcc
                                                                               Eclipse      Firefox      gcc

                               Precision                                                 Recall
  T. Xie and J. Pei: Data Mining for Software Engineering                   Adapted from Anvik et al.’s slides 123
Overview

programming        defect detection              testing            debugging          maintenance

                              software engineering tasks



                                    association/
          classification                                    clustering          etc.
                                      patterns
                                data mining techniques




     code                change               program             structural             bug
     bases                history              states              entities            reports
                                 software engineering data
  T. Xie and J. Pei: Data Mining for Software Engineering                                        124
Conclusions
• Software development generates a large
  amount of different types of data
• Data mining and data analysis can help
  software engineering substantially
• Successful cases
   – What software engineering data can be mined?
   – What software engineering task can be helped?
   – How to conduct the mining?

T. Xie and J. Pei: Data Mining for Software Engineering   125
Challenges
• Complexity in software development
   – Specific data mining techniques are needed
• Software development and maintenance are
  dynamic and user-centered
   – Interactive data mining
   – Visual data mining and analysis
   – Online, incremental mining



T. Xie and J. Pei: Data Mining for Software Engineering   126
Questions?


Mining Software Engineering Data Bibliography
http://ase.csc.ncsu.edu/dmse/
•What software engineering tasks can be helped by data mining?
•What kinds of software engineering data can be mined?
•How are data mining techniques used in software engineering?
•Resources

More Related Content

PPTX
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
PDF
Software Analytics: Towards Software Mining that Matters
PDF
Software bug prediction
PDF
Software Analytics: Data Analytics for Software Engineering
PDF
PPTX
Survey on Software Defect Prediction
PPTX
Big(ger) Data in Software Engineering
PDF
Software Mining and Software Datasets
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
Software Analytics: Towards Software Mining that Matters
Software bug prediction
Software Analytics: Data Analytics for Software Engineering
Survey on Software Defect Prediction
Big(ger) Data in Software Engineering
Software Mining and Software Datasets

What's hot (7)

PPTX
Transferring Software Testing Tools to Practice
PDF
Survey on Software Defect Prediction (PhD Qualifying Examination Presentation)
PDF
Survey on Software Defect Prediction
PDF
User Expectations in Mobile App Security
PDF
Proactive Empirical Assessment of New Language Feature Adoption via Automated...
PPTX
Toward a Traceable, Explainable and fair JD/Resume Recommendation System
PDF
Wcre13b.ppt
Transferring Software Testing Tools to Practice
Survey on Software Defect Prediction (PhD Qualifying Examination Presentation)
Survey on Software Defect Prediction
User Expectations in Mobile App Security
Proactive Empirical Assessment of New Language Feature Adoption via Automated...
Toward a Traceable, Explainable and fair JD/Resume Recommendation System
Wcre13b.ppt
Ad

Viewers also liked (7)

PPT
Ejemplo de Aplicaciones en Weka
PDF
Fundamentos de Data Mining con R
DOCX
Mineria de datos
PPT
Presentación colombia junio 2011
PPTX
Mineria de datos
PPT
MIneria de datos
Ejemplo de Aplicaciones en Weka
Fundamentos de Data Mining con R
Mineria de datos
Presentación colombia junio 2011
Mineria de datos
MIneria de datos
Ad

Similar to Datamingse (20)

PDF
Mining Software Engineering Data
PPTX
Towards Reusable Research Software
PDF
Software Analytics - Achievements and Challenges
PDF
Precise and Complete Requirements? An Elusive Goal
PDF
Data Con LA 2018 - Towards Data Science Engineering Principles by Joerg Schad
PPT
Software Engineering Lec 1-introduction
PDF
01 - Course setup software sustainability
PPTX
Log Engineering: Towards Systematic Log Mining to Support the Development of ...
PPTX
Log Engineering: Towards Systematic Log Mining to Support the Development of ...
DOC
V1_I2_2012_Paper3.doc
PDF
Improvement of Software Maintenance and Reliability using Data Mining Techniques
PPTX
Introduction Software engineering
PPTX
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
PPTX
Intelligent Software Engineering: Synergy between AI and Software Engineering
PPSX
Scope of software engineering
PPTX
ISO 15926 Reference Data Engineering Methodology
PPTX
Interactive SDLC
PDF
Software Engineering Research: Leading a Double-Agent Life.
PDF
AI for Software Engineering
PPTX
Big Data: the weakest link
Mining Software Engineering Data
Towards Reusable Research Software
Software Analytics - Achievements and Challenges
Precise and Complete Requirements? An Elusive Goal
Data Con LA 2018 - Towards Data Science Engineering Principles by Joerg Schad
Software Engineering Lec 1-introduction
01 - Course setup software sustainability
Log Engineering: Towards Systematic Log Mining to Support the Development of ...
Log Engineering: Towards Systematic Log Mining to Support the Development of ...
V1_I2_2012_Paper3.doc
Improvement of Software Maintenance and Reliability using Data Mining Techniques
Introduction Software engineering
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
Intelligent Software Engineering: Synergy between AI and Software Engineering
Scope of software engineering
ISO 15926 Reference Data Engineering Methodology
Interactive SDLC
Software Engineering Research: Leading a Double-Agent Life.
AI for Software Engineering
Big Data: the weakest link

Recently uploaded (20)

PPTX
Configure Apache Mutual Authentication
PDF
Improvisation in detection of pomegranate leaf disease using transfer learni...
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Getting started with AI Agents and Multi-Agent Systems
PPTX
Custom Battery Pack Design Considerations for Performance and Safety
PPTX
TEXTILE technology diploma scope and career opportunities
PDF
Architecture types and enterprise applications.pdf
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PDF
Zenith AI: Advanced Artificial Intelligence
PPTX
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
The influence of sentiment analysis in enhancing early warning system model f...
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
STKI Israel Market Study 2025 version august
PDF
OpenACC and Open Hackathons Monthly Highlights July 2025
PPTX
The various Industrial Revolutions .pptx
Configure Apache Mutual Authentication
Improvisation in detection of pomegranate leaf disease using transfer learni...
Taming the Chaos: How to Turn Unstructured Data into Decisions
sustainability-14-14877-v2.pddhzftheheeeee
Developing a website for English-speaking practice to English as a foreign la...
1 - Historical Antecedents, Social Consideration.pdf
Getting started with AI Agents and Multi-Agent Systems
Custom Battery Pack Design Considerations for Performance and Safety
TEXTILE technology diploma scope and career opportunities
Architecture types and enterprise applications.pdf
Final SEM Unit 1 for mit wpu at pune .pptx
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
Zenith AI: Advanced Artificial Intelligence
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
The influence of sentiment analysis in enhancing early warning system model f...
NewMind AI Weekly Chronicles – August ’25 Week III
STKI Israel Market Study 2025 version august
OpenACC and Open Hackathons Monthly Highlights July 2025
The various Industrial Revolutions .pptx

Datamingse

  • 1. Data Mining for Software Engineering Tao Xie Jian Pei North Carolina State University Simon Fraser University www.csc.ncsu.edu/faculty/xie www.cs.sfu.ca/~jpei xie@csc.ncsu.edu jpei@cs.sfu.ca An up-to-date version of this tutorial is available at http://ase.csc.ncsu.edu/dmse/dmse.pdf Attendants of the tutorial are kindly suggested to download the latest version 2-3 days before the tutorial
  • 2. Outline • Introduction • What software engineering tasks can be helped by data mining? • What kinds of software engineering data can be mined? • How are data mining techniques used in software engineering? • Case studies • Conclusions T. Xie and J. Pei: Data Mining for Software Engineering 2
  • 3. Introduction • A large amount of data is produced in software development – Data from software repositories – Data from program executions • Data mining techniques can be used to analyze software engineering data – Understand software artifacts or processes – Assist software engineering tasks T. Xie and J. Pei: Data Mining for Software Engineering 3
  • 4. Examples • Data in software development – Programming: versions of programs – Testing: execution traces – Deployment: error/bug reports – Reuse: open source packages • Software development needs data analysis – How should I use this class? – Where are the bugs? – How to implement a typical functionality? T. Xie and J. Pei: Data Mining for Software Engineering 4
  • 5. Overview programming defect detection testing debugging maintenance software engineering tasks helped by data mining association/ classification clustering … patterns data mining techniques code change program structural bug bases history states entities reports software engineering data T. Xie and J. Pei: Data Mining for Software Engineering 5
  • 6. Software Engineering Tasks • Programming • Static defect detection • Testing • Debugging • Maintenance T. Xie and J. Pei: Data Mining for Software Engineering 6
  • 7. Software Categorization – Why? • SourceForge hosts 70,000+ software systems – How can one find the software needed? – How can developers collaborate effectively? • Why software categorization? – SourceForge categorizes software according to their primary function (editors, databases, etc.) • Software foundries – related software – Keep developers informed about related software • Learn the “best practice” • Promote software reuse [Kawaguchi et al. 04] T. Xie and J. Pei: Data Mining for Software Engineering 7
  • 8. Software Categorization – What? • Organize software systems into categories – Software systems in each category share a somehow same theme – A software system may belong to one or multiple categories • What are the categories? – Defined by domain experts manually – Discovered automatically • Example system: MUDABlue [Kawaguchi et al. 04] T. Xie and J. Pei: Data Mining for Software Engineering 8
  • 9. Version Comparison and Search • What does the current code segment look like in previous versions? – How have they been changed over versions? • Using standard search tools, e.g., grep? – Source code may not be well documented – The code may be changed • Can we have some source code friendly search engines? – E.g., www.koders.com, corp.krugle.com, demo.spars.info T. Xie and J. Pei: Data Mining for Software Engineering 9
  • 10. Software Library Reuse • Issues in reusing software libraries – Which components should I use? – What is the right way to use? – Multiple components may often be used in combinations, e.g., Smalltalk’s Model/View/Controller • Frequent patterns help – Specifically, inheritance information is important – Example: most application classes inheriting from library class Widget tend to override its member function paint(); most application classes instantiating library class Painter and calling its member function begin() also call end() [Michail 99/00] T. Xie and J. Pei: Data Mining for Software Engineering 10
  • 11. API Usage • How should an API be used correctly? – An API may serve multiple functionalities – Different styles of API usage • “I know what type of object I need, but I don’t know how to write the code to get the object” [Mandelin et al. 05] – Can we synthesize jungloid code fragments automatically? – Given a simple query describing the desired code in terms of input and output types, return a code segment • “I know what method call I need, but I don’t know how to write code before and after this method call” [Xie & Pei 06] T. Xie and J. Pei: Data Mining for Software Engineering 11
  • 12. How Can Data Mining Help? • Identify characteristic usage of the library automatically • Understand the reuse of library classes from real-life applications instead of toy programs • Keep reuse patterns up to date w.r.t. the most recent version of the library and applications • General patterns may cover inheritance cases T. Xie and J. Pei: Data Mining for Software Engineering 12
  • 13. Software Engineering Tasks • Programming • Static defect detection • Testing • Debugging • Maintenance T. Xie and J. Pei: Data Mining for Software Engineering 13
  • 14. Locating Matching Method Calls • Many bugs due to unmatched method calls – E.g., fail to call free() to deallocate a data structure – One-line-code-changes: many bugs can be fixed by changing only one line in source code • Problem: how to find highly correlated pairs of method calls – E.g., <fopen, fclose>, <malloc, free> [Li&Zhou 05, Livshits&Zimmermann 05, Yang et al. 06] T. Xie and J. Pei: Data Mining for Software Engineering 14
  • 15. Inferring Errors from Source Code • A system must follow some correctness rules – Unfortunately, the rules are documented or specified in an ad hoc manner • Deriving the rules requires a lot of a priori knowledge • Can we detect some errors without knowing the rules by data mining? [Engler et al. 01] T. Xie and J. Pei: Data Mining for Software Engineering 15
  • 16. Inference in Large Systems • Execution traces inferred properties static checker • Inference algorithms need to be scalable with the size of the programs and the input traces • Due to only imperfect traces available in industrial environments, how to use those imperfect traces • Many inferred properties may be uninteresting; it is hard for a developer to review those properties thoroughly for large programs [Yang et al. 06] T. Xie and J. Pei: Data Mining for Software Engineering 16
  • 17. Detecting Copy-Paste and Bugs • Copy-pasted code is common in large systems – Code reuse • Prone to bugs – E.g., identifiers are not changed consistently • How to detect copy-paste code? – How to scale up to large software? – How to handle minor modifications? [Li et al. 04] T. Xie and J. Pei: Data Mining for Software Engineering 17
  • 18. Software Engineering Tasks • Programming • Static defect detection • Testing • Debugging • Maintenance T. Xie and J. Pei: Data Mining for Software Engineering 18
  • 19. Inspecting Test Behavior • Automatically generated tests or field executions lack test oracles – Sample/summarize behavior for inspection • Examples: – Select tests (executions/outputs) for inspection • E.g., clustering path/branch profiles [Podgurski et al. 01, Bowring et al. 04] – Summarize object behavior [Xie&Notkin 04, Dallmeier et al. 06] T. Xie and J. Pei: Data Mining for Software Engineering 19
  • 20. Mining Object Behavior • Can we find the undocumented behavior of classes? It may not be observed from program source code directly Behavior model for JAVA Vector class. Picture from “Mining object behavior with ADABU” [Dallmeier et al. WODA 06] T. Xie and J. Pei: Data Mining for Software Engineering 20
  • 21. Mining Specifications • Specifications are very useful for testing – test generation + test oracle • Major obstacle: protocol specifications are often unavailable – Example: what is the right way to use the socket API? • How can data mining help? – If a protocol is held in well tested programs (i.e., their executions), the protocol is likely valid [Ammons et al. 02] T. Xie and J. Pei: Data Mining for Software Engineering 21
  • 22. Specification Helps Specification Does the above code follow the correct socket API protocol? [Ammons et al. 02] T. Xie and J. Pei: Data Mining for Software Engineering 22
  • 23. Software Engineering Tasks • Programming • Static defect detection • Testing • Debugging • Maintenance T. Xie and J. Pei: Data Mining for Software Engineering 23
  • 24. Fault Localization • Running tests produces execution traces – Some tests fail and the other tests pass • Given many execution traces generated by tests, can we suggest likely faulty statements? [Liblit et al. 03/05, Liu et al. 05] – Some traces may lead to program failures – It would be better if we can even suggest the likeliness of a statement being faulty • For large programs, how can we collect traces effectively? • What if there are multiple faults? T. Xie and J. Pei: Data Mining for Software Engineering 24
  • 25. Analyzing Bug Repositories • Most open source software development projects have bug repositories – Report and track problems and potential enhancements – Valuable information for both developers and users • Bug repositories are often messy – Duplicate error reports; Related errors • Challenge: how to analyze effectively? – Who are reporting and at what rate? – How are reports resolved and by whom? • Automatic bug report assignment & duplicate detection [Anvik et al. 06] T. Xie and J. Pei: Data Mining for Software Engineering 25
  • 26. Stabilizing Buggy Applications • Users may report bugs in a program, can those bug reports be used to prevent the program from crashing? – When a user attempts an action that led to some errors before, a warning should be issued • Given a program state S and an event e, predict whether e likely results in a bug – Positive samples: past bugs – Negative samples: “not bug” reports [Michail&Xie 05] T. Xie and J. Pei: Data Mining for Software Engineering 26
  • 27. Software Engineering Tasks • Programming • Static defect detection • Testing • Debugging • Maintenance T. Xie and J. Pei: Data Mining for Software Engineering 27
  • 28. Guiding Software Changes • Programmers start changing some locations – Suggest locations that other programmers have changed together with this location E.g., “Programmers who changed this function also changed …” • Mine association rules from change histories – coarse-granular entities: directories, modules, files – fine-granular entities: methods, variables, sections [Zimmermann et al. 04, Ying et al. 04] T. Xie and J. Pei: Data Mining for Software Engineering 28
  • 29. Aspect Mining • Discover crosscutting concerns that can be potentially turned into one place (an aspect in aspect-oriented programs) – E.g., logging, timing, communication • Mine recurring execution patterns – Event traces [Breu&Krinke 04, Tonella&Ceccato 04] – Source code [Shepherd et al. 05] T. Xie and J. Pei: Data Mining for Software Engineering 29
  • 30. Software Engineering Data • Static code bases • Software change history • Profiled program states • Profiled structural entities • Bug reports T. Xie and J. Pei: Data Mining for Software Engineering 30
  • 31. Code Entities • Identifiers within a system [Kawaguchi et al. 04] – E.g., variable names, function names • Statement sequence within a basic block [Li et al. 04] – E.g., variables, operators, constants, functions, keywords • Element set within a function [Li&Zhou 05] – E.g., functions, variables, data types • Call sites within a function [Xie&Pei 05] • API signatures [Mandelin et al. 05] [Mandelin et al. 05] http://snobol.cs.berkeley.edu/prospector/index.jsp T. Xie and J. Pei: Data Mining for Software Engineering 31
  • 32. Relationships btw Code Entities • Membership relationships – A class contains membership functions • Reuse relationships – Class inheritance – Class instantiation – Function invocations – Function overriding [Michail 99/00] http://codeweb.sourceforge.net/ for C++ T. Xie and J. Pei: Data Mining for Software Engineering 32
  • 33. Software Engineering Data • Static code bases • Software change history • Profiled program states • Profiled structural entities • Bug reports T. Xie and J. Pei: Data Mining for Software Engineering 33
  • 34. Concurrent Versions System (CVS) Comments [Chen et al. 01] http://cvssearch.sourceforge.net/ T. Xie and J. Pei: Data Mining for Software Engineering 34
  • 35. CVS Comments RCS files:/repository/file.h,v • cvs log – displays Working file: file.h head: 1.5 ... for all revisions and description: ---------------------------- its comments for each Revision 1.5 Date: ... file cvs comment ... ---------------------------- ... … • cvs diff – shows RCS file: /repository/file.h,v … differences between 9c9,10 < old line different versions of a --- > new line > another new line file [Chen et al. 01] http://cvssearch.sourceforge.net/ T. Xie and J. Pei: Data Mining for Software Engineering 35
  • 36. Code Version Histories • CVS provides file versioning – Group individual per-file changes into individual transactions (atomic change sets): checked in by the same author with the same check-in comment close in time • CVS manages only files and line numbers – Associate syntactic entities with line ranges • Filter out long transactions not corresponding to meaningful atomic changes – E.g., feature requests, bug fixes, branch merging [Ying et al. 04] [Zimmermann et al. 04] http://www.st.cs.uni-sb.de/softevo/erose/ T. Xie and J. Pei: Data Mining for Software Engineering 36
  • 37. Software Engineering Data • Static code bases • Software change history • Profiled program states • Profiled structural entities • Bug reports T. Xie and J. Pei: Data Mining for Software Engineering 37
  • 38. Method-Entry/Exit States • State of an object – Values of transitively reachable fields • Method-entry state – Receiver-object state, method argument values • Method-exit state – Receiver-object state, updated method argument values, method return value [Ernst et al. 02] http://pag.csail.mit.edu/daikon/ [Dallmeier et al. 06] http://www.st.cs.uni-sb.de/models/ [Henkel&Diwan 03] [Xie&Notkin 04/05] T. Xie and J. Pei: Data Mining for Software Engineering 38
  • 39. Other Profiled Program States • Values of variables at certain code locations [Hangal&Lam 02] – Object/static field read/write – Method-call arguments – Method returns • Sampled predicates on values of variables [Liblit et al. 03/05] [Hangal&Lam 02] http://diduce.sourceforge.net/ [Liblit et al. 03/05] http://www.cs.wisc.edu/cbi/ T. Xie and J. Pei: Data Mining for Software Engineering 39
  • 40. Software Engineering Data • Static code bases • Software change history • Profiled program states • Profiled structural entities • Bug reports T. Xie and J. Pei: Data Mining for Software Engineering 40
  • 41. Executed Structural Entities • Executed branches/paths, def-use pairs • Executed function/method calls – Group methods invoked on the same object • Profiling options – Execution hit vs. count – Execution order (sequences) [Dallmeier et al. 05] http://www.st.cs.uni-sb.de/ample/ More related tools: http://www.csc.ncsu.edu/faculty/xie/research.htm#related T. Xie and J. Pei: Data Mining for Software Engineering 41
  • 42. Software Engineering Data • Static code bases • Software change history • Profiled program states • Profiled structural entities • Bug reports T. Xie and J. Pei: Data Mining for Software Engineering 42
  • 43. Processing Bug Reports User Triager Developer Bug Report Bug Duplicate Works Invalid Won’t Repository For Me Fix T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 43
  • 44. Sample Bugzilla Bug Report • Bug report image • Overlay the triage questions Assigned To: ? Assignment? Duplicate? Reproducible? Bugzilla: open source bug tracking tool http://www.bugzilla.org/ [Anvik et al. 06] http://www.cs.ubc.ca/labs/spl/projects/bugTriage.html T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 44
  • 45. Eclipse Bug Data • Defect counts are listed as count at the plug-in, package and compilationunit levels. • The value field contains the actual number of pre- ("pre") and post-release defects ("post"). • The average ("avg") and maximum ("max") values refer to the defects found in the compilation units ("compilationunits"). [Schröter et al. 06] http://www.st.cs.uni-sb.de/softevo/bug-data/eclipse/ T. Xie and J. Pei: Data Mining for Software Engineering 45
  • 46. Data Mining Techniques in SE • Association rules and frequent patterns • Classification • Clustering • Misc. T. Xie and J. Pei: Data Mining for Software Engineering 46
  • 47. Frequent Itemsets • Itemset: a set of items – E.g., acm={a, c, m} Transaction database TDB • Support of itemsets TID Items bought – Sup(acm)=3 100 f, a, c, d, g, I, m, p • Given min_sup = 3, acm 200 a, b, c, f, l, m, o is a frequent pattern 300 b, f, h, j, o • Frequent pattern mining: 400 b, c, k, s, p find all frequent patterns 500 a, f, c, e, l, p, m, n in a database T. Xie and J. Pei: Data Mining for Software Engineering 47
  • 48. Association Rules • (Time∈{Fri, Sat}) ∧ buy(X, diaper) buy(X, beer) – Dads taking care of babies in weekends drink beers • Itemsets should be frequent – It can be applied extensively • Rules should be confident – With strong prediction capability T. Xie and J. Pei: Data Mining for Software Engineering 48
  • 49. A Road Map • Boolean vs. quantitative associations – buys(x, “SQLServer”) ^ buys(x, “DMBook”) buys(x, “DM Software”) [0.2%, 60%] – age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”) [1%, 75%] • Single dimension vs. multiple dimensional associations • Single level vs. multiple-level analysis – What brands of beers are associated with what brands of diapers? T. Xie and J. Pei: Data Mining for Software Engineering 49
  • 50. Frequent Pattern Mining Methods • Apriori and its variations/improvements • Mining frequent-patterns without candidate generation • Mining max-patterns and closed itemsets • Mining multi-dimensional, multi-level frequent patterns with flexible support constraints • Interestingness: correlation and causality T. Xie and J. Pei: Data Mining for Software Engineering 50
  • 51. A Simple Case • Finding highly correlated method call pairs • Confidence of pairs help – Conf(<a,b>)=support(<a,b>)/support(<a,a>) • Check the revisions (fixes to bugs), find the pairs of method calls whose confidences are improved dramatically by frequent added fixes – Those are the matching method call pairs that may often be violated by programmers [Livshits&Zimmermann 05] T. Xie and J. Pei: Data Mining for Software Engineering 51
  • 52. Conflicting Patterns • 999 out of 1000 times spin_unlock follows spin_lock – The single time that spin_unlock does not follow may likely be an error • We can detect an error without knowing the correctness rule [Li&Zhou 05, Livshits&Zimmermann 05, Yang et al. 06] T. Xie and J. Pei: Data Mining for Software Engineering 52
  • 53. Frequent Library Reuse Patterns • Items: classes, member functions, reuse relationships (e.g., inheritance, overriding, instantiation) • Transactions: for every application class A, the set of all items that are involved in a reuse relationship with A • Pruning – Uninteresting rules, e.g., a rule holds for every class – Misleading rules, e.g., xy z (conf: 60%) is pruned if y z (conf: 80%) – Statistically insignificant rules, prune rules of a high p-value • Constrained rules – Rules involving a particular class – Rules that are violated in a particular application [Michail 99/00] T. Xie and J. Pei: Data Mining for Software Engineering 53
  • 54. MAPO: Mining Frequent API Patterns [Xie&Pei 06] T. Xie and J. Pei: Data Mining for Software Engineering 54
  • 55. Sequential Pattern Mining in MAPO • Use BIDE [Wang&Han 04] to mine closed sequential patterns from the preprocessed method-call sequences • Postprocessing in MAPO – Remove frequent sequences that do not contain the entities interesting to the user – Compress consecutive calls of the same method into one – Remove duplicate frequent sequences after the compression – Remove frequent sequences that are subsequences of some other frequent sequences [Xie&Pei 06] T. Xie and J. Pei: Data Mining for Software Engineering 55
  • 56. Detecting Copy-Paste Code • Apply closed sequential pattern mining techniques • Customizing the techniques – A copy-paste segment typically do not have big gaps – use a maximum gap threshold to control – Output the instances of patterns (i.e., the copy-pasted code segments) instead of the patterns – Use small copy-pasted segments to form larger ones – Prune false positives: tiny segments, unmappable segments, overlapping segments, and segments with large gaps [Li et al. 04] T. Xie and J. Pei: Data Mining for Software Engineering 56
  • 57. Find Bugs in Copy-Pasted Segments • For two copy-pasted segments, are the modifications consistent? – Identifier a in segment S1 is changed to b in segment S2 3 times, but remains unchanged once – likely a bug – The heuristic may not be right all the time • The lower the unchanged rate of an identifier, the more likely there is a bug [Li et al. 04] T. Xie and J. Pei: Data Mining for Software Engineering 57
  • 58. Approximate Patterns for Inferences • Use an alternating template to find interesting properties – Example: template – (PS)*, an instance: loc.acq loc.rel • Handling imperfect traces – Instead of requiring perfect matches, check the ratio of matching – Explore contexts of matching [Yang et al. 06] T. Xie and J. Pei: Data Mining for Software Engineering 58
  • 59. Context Handling Figure from “Perracotta: mining teporal API rules from imperfect traces”, in [Yang et al. ICSE’06] T. Xie and J. Pei: Data Mining for Software Engineering 59
  • 60. Cross-Checking of Execution Traces • Mine association rules or sequential patterns S F, where S is a statement and F is the status of program failure • The higher the confidence, the more likely S is faulty or related to a fault • Using only one statement at the left side of the rule can be misleading, since a fault may be led by a combination of statements – Frequent patterns can be used to improve [Denmat et al. 05] T. Xie and J. Pei: Data Mining for Software Engineering 60
  • 61. Emerging Patterns of Traces • A method executed only in failing runs is likely to point to the defect – Comparing the coverage of passing and failing program runs help • Mining patterns frequent in failing program runs but infrequent in passing program runs – Sequential patterns may be used [Dallmeier et al. 05, Denmat et al. 05, Yang et al. 06] T. Xie and J. Pei: Data Mining for Software Engineering 61
  • 62. Learning Object Behavior • Extracting models – A static analysis identifies all side-effect-free methods in the program – Some side-effect-free methods are selected as inspectors – The program is executed and inspectors are called to extract information about an object’s state – a vector of inspector values • Merge models of all objects in a program [Dallmeier et al. 06] T. Xie and J. Pei: Data Mining for Software Engineering 62
  • 63. Data Mining Techniques in SE • Association rules and frequent patterns • Classification • Clustering • Misc. T. Xie and J. Pei: Data Mining for Software Engineering 63
  • 64. Classification: A 2-step Process • Model construction: describe a set of predetermined classes – Training dataset: tuples for model construction • Each tuple/sample belongs to a predefined class – Classification rules, decision trees, or math formulae • Model application: classify unseen objects – Estimate accuracy of the model using an independent test set – Acceptable accuracy apply the model to classify tuples with unknown class labels T. Xie and J. Pei: Data Mining for Software Engineering 64
  • 65. Model Construction Classification Algorithms Training Data Name Rank Years Tenured Classifier Mike Ass. Prof 3 No (Model) Mary Ass. Prof 7 Yes Bill Prof 2 Yes Jim Asso. Prof 7 Yes IF rank = ‘professor’ Dave Ass. Prof 6 No OR years > 6 Anne Asso. Prof 3 No THEN tenured = ‘yes’ T. Xie and J. Pei: Data Mining for Software Engineering 65
  • 66. Model Application Classifier Testing Data Unseen Data (Jeff, Professor, 4) Name Rank Years Tenured Tom Ass. Prof 2 No Tenured? Merlisa Asso. Prof 7 No George Prof 5 Yes Joseph Ass. Prof 7 Yes T. Xie and J. Pei: Data Mining for Software Engineering 66
  • 67. Supervised vs. Unsupervised Learning • Supervised learning (classification) – Supervision: objects in the training data set have labels – New data is classified based on the training set • Unsupervised learning (clustering) – The class labels of training data are unknown – Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data T. Xie and J. Pei: Data Mining for Software Engineering 67
  • 68. GUI-Application Stabilizer • Given a program state S and an event e, predict whether e likely results in a bug – Positive samples: past bugs – Negative samples: “not bug” reports • A k-NN based approach – Consider the k closest cases reported before – Compare Σ 1/d for bug cases and not-bug cases, where d is the similarity between the current state and the reported states – If the current state is more similar to bugs, predict a bug [Michail&Xie 05] T. Xie and J. Pei: Data Mining for Software Engineering 68
  • 69. Data Mining Techniques in SE • Association rules and frequent patterns • Classification • Clustering • Misc. T. Xie and J. Pei: Data Mining for Software Engineering 69
  • 70. What Is Clustering? • Group data into clusters – Similar to one another within the same cluster – Dissimilar to the objects in other clusters – Unsupervised learning: no predefined classes Outliers Cluster 1 Cluster 2 T. Xie and J. Pei: Data Mining for Software Engineering 70
  • 71. Categories of Clustering Approaches (1) • Partitioning algorithms – Partition the objects into k clusters – Iteratively reallocate objects to improve the clustering • Hierarchy algorithms – Agglomerative: each object is a cluster, merge clusters to form larger ones – Divisive: all objects are in a cluster, split it up into smaller clusters T. Xie and J. Pei: Data Mining for Software Engineering 71
  • 72. Categories of Clustering Approaches (2) • Density-based methods – Based on connectivity and density functions – Filter out noise, find clusters of arbitrary shape • Grid-based methods – Quantize the object space into a grid structure • Model-based – Use a model to find the best fit of data T. Xie and J. Pei: Data Mining for Software Engineering 72
  • 73. K-Means: Example 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 Assign 3 Update 3 the 3 each 2 2 2 1 objects 1 0 cluster 1 0 0 0 1 2 3 4 5 6 7 8 9 10 to most 0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10 similar center reassign reassign 10 10 K=2 9 9 8 8 Arbitrarily choose K 7 7 object as initial 6 6 5 5 cluster center 4 Update 4 3 2 the 3 2 1 cluster 1 0 0 1 2 3 4 5 6 7 8 9 10 means 0 0 1 2 3 4 5 6 7 8 9 10 T. Xie and J. Pei: Data Mining for Software Engineering 73
  • 74. Clustering and Categorization • Software categorization – Partitioning software systems into categories • Categories predefined – a classification problem • Categories discovered automatically – a clustering problem T. Xie and J. Pei: Data Mining for Software Engineering 74
  • 75. Software Categorization - MUDABlue • Understanding source code – Use latent semantic analysis (LSA) to find similarity between software systems – Use identifiers (e.g., variable names, function names) as features • “gtk_window” represents some window • The source code near “gtk_window” contains some GUI operation on the window • Extracting categories using frequent identifiers – “gtk_window”, “gtk_main”, and “gpointer” GTK related software system – Use LSA to find relationships between identifiers [Kawaguchi et al. 04] T. Xie and J. Pei: Data Mining for Software Engineering 75
  • 76. Overview of MUDABlue • Extract identifiers • Create identifier-by-software matrix • Remove useless identifiers • Apply LSA, and retrieve categories • Make software clusters from identifier clusters • Title software clusters [Kawaguchi et al. 04] T. Xie and J. Pei: Data Mining for Software Engineering 76
  • 77. Data Mining Techniques in SE • Association rules and frequent patterns • Classification • Clustering • Misc. T. Xie and J. Pei: Data Mining for Software Engineering 77
  • 78. Searching Source Code/Comments • CVSSearch: searching using CVS comments • Comments are often more stable than code segments – Describe a segment of code – May hold for many future versions • Compare differences of successive versions – For two versions, associate a comment to the corresponding changes – Propagate changes over versions [Chen et al. 01] T. Xie and J. Pei: Data Mining for Software Engineering 78
  • 79. Jungloid Mining • Given a query describing the input and output types, synthesize code fragments automatically • Prospector: using API method signatures and jungloids mined from a corpus of sample client programs • Elementary jungloids – Field access – Static method or constructor invocation – Instance method invocation – Widening reference conversion – Downcast (narrowing reference conversions) [Mandelin et al. 05] T. Xie and J. Pei: Data Mining for Software Engineering 79
  • 80. Parsing a JAVA source code file in an Finding Jungloids IFile object using the Eclipse IDE framework • Use signatures of elementary jungloids and API’s to form a signature graph • Represent a solution as a path in the graph matching the constraints • Rank the paths by their lengths – short paths are preferred • Learn downcast from sample programs [Mandelin et al. 05] T. Xie and J. Pei: Data Mining for Software Engineering 80
  • 81. Sampling Programs • During the execution of a program, each execution of a statement takes a probability to be sampled – Sampling large programs becomes feasible – Many traces can be collected • Bug isolation by analyzing samples – Correlation between some specific statements or function calls with program errors/crashes [Liblit et al. 03/05] T. Xie and J. Pei: Data Mining for Software Engineering 81
  • 82. Outline • Introduction • What software engineering tasks can be helped by data mining? • What kinds of software engineering data can be mined? • How are data mining techniques used in software engineering? • Case studies • Conclusions T. Xie and J. Pei: Data Mining for Software Engineering 82
  • 83. Case Studies • MAPO: mining API usages from open source repositories [Xie&Pei 06] • Code bases sequence analysis programming • DynaMine: finding common error patterns by mining software revision histories [Livshits&Zimmermann 05] • Change history association rules defect detection • BugTriage: Who should fix this bugs? [Anvik et al. 06] • Bug reports classification debugging T. Xie and J. Pei: Data Mining for Software Engineering 83
  • 84. Motivation • APIs in class libraries or frameworks are popularly reused in software development. • An example programming task: “instrument the bytecode of a Java class by adding an extra method to the class” – org.apache.bcel.generic.ClassGen public void addMethod(Method m) T. Xie and J. Pei: Data Mining for Software Engineering 84
  • 85. First Try: ClassGen Java API Doc addMethod public void addMethod(Method m) Add a method to this class. Parameters: m - method to add T. Xie and J. Pei: Data Mining for Software Engineering 85
  • 86. Second Try: Code Search Engine T. Xie and J. Pei: Data Mining for Software Engineering 86
  • 87. MAPO Approach • Analyze code segments returned from code search engines and disclose the inherent usage patterns – Input: an API characterized by a method, class, or package code bases: open source repositories or proprietary source repositories – Output: a short list of frequent API usage patterns related to the API T. Xie and J. Pei: Data Mining for Software Engineering 87
  • 88. Sample Tool Output InstructionList.<init>() InstructionFactory.createLoad(Type, int) InstructionList.append(Instruction) InstructionFactory.createReturn(Type) InstructionList.append(Instruction) MethodGen.setMaxStack() MethodGen.setMaxLocals() MethodGen.getMethod() ClassGen.addMethod(Method) InstructionList.dispose() •Mined from 36 Java source files, 1087 method seqs T. Xie and J. Pei: Data Mining for Software Engineering 88
  • 89. Tool Architecture T. Xie and J. Pei: Data Mining for Software Engineering 89
  • 90. Results A tool that integrates various components • Relevant code extractor – download returns from code search engine (koders.com) • Code analyzer – implemented a lightweight tool for Java programs • Sequence preprocessor – employed various heuristics • Frequent sequence miner – reused BIDE [Wang&Han ICDE 2004] • Frequent sequence postprocessor – employed various heuristics T. Xie and J. Pei: Data Mining for Software Engineering 90
  • 91. Case Studies • MAPO: mining API usages from open source repositories [Xie&Pei 06] • Code bases sequence analysis programming • DynaMine: finding common error patterns by mining software revision histories [Livshits&Zimmermann 05] • Change history association rules defect detection • BugTriage: Who should fix this bugs? [Anvik et al. 06] • Bug reports classification debugging T. Xie and J. Pei: Data Mining for Software Engineering 91
  • 92. Co-Change Pattern • Things that are frequently changed together often form a pattern (a.k.a. co-change) • E.g., co-added method calls public void createPartControl(Composite parent) { ... // add listener for editor page activation getSite().getPage().addPartListener(partListener); } public void dispose() { ... co-added getSite().getPage().removePartListener(partListener); } T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Livshits et al.’s slides 92
  • 93. DynaMine revision mine CVS rank and history mining patterns histories filter instrument relevant method calls run the application dynamic analysis post-process usage error unlikely patterns patterns patterns report report reporting patterns bugs T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Livshits et al.’s slides 93
  • 94. Mining Patterns revision mine CVS rank and history mining patterns histories filter instrument relevant method calls run the application dynamic analysis post-process usage error unlikely patterns patterns patterns report report reporting patterns Adaptedbugs from Livshits et al.’s slides T. Xie and J. Pei: Data Mining for Software Engineering 94
  • 95. Mining Method Calls Foo.java o1.addListener() 1.12 o1.removeListener() Bar.java 1.47 o2.addListener() o2.removeListener() System.out.println() Baz.java 1.23 o3.addListener() o3.removeListener() list.iterator() iter.hasNext() iter.next() Qux.java o4.addListener() 1.41 System.out.println() 1.42 o4.removeListener() T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Livshits et al.’s slides 95
  • 96. Finding Pairs o1.addListener() 1 Pair Foo.java 1.12 o1.removeListener() Bar.java o2.addListener() 1 Pair 1.47 o2.removeListener() System.out.println() Baz.java o3.addListener() 1.23 o3.removeListener() list.iterator() 2 Pairs iter.hasNext() iter.next() o4.addListener() 0 Pairs Qux.java 1.41 System.out.println() 1.42 o4.removeListener() 0 Pairs T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Livshits et al.’s slides 96
  • 97. Mining Method Calls Foo.java o1.addListener() 1.12 o1.removeListener() Bar.java 1.47 o2.addListener() o2.removeListener() System.out.println() Baz.java 1.23 o3.addListener() Co-added calls o3.removeListener() list.iterator() often represent a iter.hasNext() iter.next() usage pattern Qux.java o4.addListener() 1.41 System.out.println() 1.42 o4.removeListener() T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Livshits et al.’s slides 97
  • 98. Finding Patterns Find “frequent itemsets” (with Apriori) o.enterAlignment() o.enterAlignment() o.enterAlignment() o.enterAlignment() o.exitAlignment() o.exitAlignment() o.exitAlignment() o.exitAlignment() o.redoAlignment() o.redoAlignment() o.redoAlignment() o.redoAlignment() iter.hasNext() iter.hasNext() iter.hasNext() iter.hasNext() iter.next() iter.next() iter.next() iter.next() {enterAlignment(), exitAlignment(), redoAlignment()} T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Livshits et al.’s slides 98
  • 99. Ranking Patterns Foo.java o1.addListener() 1.12 o1.removeListener() Bar.java o2.addListener() Support count = 1.47 o2.removeListener() System.out.println() Baz.java #occurrences of a o3.addListener() 1.23 o3.removeListener() pattern list.iterator() iter.hasNext() Confidence = iter.next() Qux.java strength of a o4.addListener() 1.41 System.out.println() pattern, P(A|B) 1.42 o4.removeListener() T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Livshits et al.’s slides 99
  • 100. Ranking Patterns Foo.java o1.addListener() 1.12 o1.removeListener() Bar.java 1.47 o2.addListener() o2.removeListener() System.out.println() Baz.java 1.23 o3.addListener() o3.removeListener() list.iterator() iter.hasNext() iter.next() Qux.java o4.addListener() 1.41 System.out.println() This is a fix! 1.42 o4.removeListener() Rank removeListener() patterns higher T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Livshits et al.’s slides 100
  • 101. Dynamic Validation revision mine CVS rank and history mining patterns histories filter instrument relevant method calls run the application dynamic analysis post-process usage error unlikely patterns patterns patterns report report reporting patterns Adaptedbugs from Livshits et al.’s slides T. Xie and J. Pei: Data Mining for Software Engineering 101
  • 102. Matches and Mismatches Find and count matches and mismatches. o.register(d) o.deregister(d) matches o.register(d) o.deregister(d) mismatch Static vs dynamic counts. T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Livshits et al.’s slides 102
  • 103. Pattern classification post-process v validations, e violations usage error unlikely patterns patterns patterns e<v/10 v/10<=e<=2v otherwise T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Livshits et al.’s slides 103
  • 104. Experiments since JEDIT 2000 ECLIPSE 2001 developers 92 112 lines of code 700,000 2,900,000 revisions 40,000 400,000 total 56 patterns T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Livshits et al.’s slides 104
  • 105. Case Studies • MAPO: mining API usages from open source repositories [Xie&Pei 06] • Code bases sequence analysis programming • DynaMine: finding common error patterns by mining software revision histories [Livshits&Zimmermann 05] • Change history association rules defect detection • BugTriage: Who should fix this bugs? [Anvik et al. 06] • Bug reports classification debugging T. Xie and J. Pei: Data Mining for Software Engineering 105
  • 106. Assigning a Bug • Many considerations – who has the expertise? – who is available? – how quickly does this have to be fixed? • Not always an obvious or correct assignment – multiple developers may be suitable – difficult to know what the bug is about – bug fixes get delayed • triage and fix rate indicates ‘liveness’ of OSS projects T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 106
  • 107. Assigning a Bug Today bill@firefox.org T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 107
  • 108. Recommending assignment bill@firefox.com ted@gmail.com cindy-loo@whoville.org T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 108
  • 109. Overview of approach Approach tuned using Eclipse and Firefox bill@firefox.com ted@gmail.com cindy-loo@whoville.org Machine Learning Assignment Resolved Algorithm Recommender Bug Reports T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 109
  • 110. Steps to the approach 1. Characterize the reports 2. Label the reports 3. Select the reports 4. Use a machine learning algorithm T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 110
  • 111. Step 1: Characterizing a report • Based on two fields – textual summary – description • Use text categorization approach – represent with a word vector – remove stop words – intra- and inter-document frequency T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 111
  • 112. Step 2: Labeling a report • Must determine who really fixed it – “Assigned-to” field is not accurate • Project-specific heuristics T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 112
  • 113. Step 2: Labeling a report • Must determine who really fixed it – “Assigned-to” field is not accurate • Project-specificIfheuristics a report is FIXED, label with who marked it as fixed. (Eclipse) – simple If a report is DUPLICATE, use the label of the report it duplicates. (Eclipse and Firefox) T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 113
  • 114. Step 2: Labeling a report • Must determine who really fixed it – “Assigned-to” field is not accurate (Firefox) • Project-specificIfheuristics FIXED and has the report is attachments that are approved by a – simple reviewer, then – complex – If one submitter of patches, use their name. – If more than one submitter, choose name of who submitted the most patches. – If cannot determine submitters, label with the person assigned to the report. T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 114
  • 115. Step 2: Labeling a report • Must determine who really fixed it – “Assigned-to” field is not accurate (Firefox) • Project-specificReports marked as WONTFIX are heuristics often resolved after discussion and – simple developers reaching a consensus. – complex – Unknown who would have fixed the bug – unclassifiable – Report is labeled unclassifiable. T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 115
  • 116. Step 2: Labeling a report • Must determine who really fixed it – “Assigned-to” field is not accurate • Project-specific heuristics – simple Eclipse Firefox – complex Simple 5 4 – unclassifiable Complex 2 1 Unclassifiable 1 4 T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 116
  • 117. Step 3: Selecting the reports • Exclude those with no label • Include those of active developers – developer profiles 40 40 35 35 30 30 25 25 20 20 15 15 10 10 5 5 0 0 Sep-04 Oct-04 Nov-04 Dec-04 Jan-05 Feb-05 Mar-05 Apr-05 Sep-04 Oct-04 Nov-04 Dec-04 Jan-05 Feb-05 Mar-05 Apr-05 T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 117
  • 118. Step 3: Selecting the reports 40 35 30 25 20 3 reports / month 15 10 5 0 Sep-04 Oct-04 Nov-04 Dec-04 Jan-05 Feb-05 Mar-05 Apr-05 T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 118
  • 119. Step 4: Use a ML algorithm • Supervised Algorithms – Naïve Bayes – C4.5 – Support Vector Machines • Unsupervised Algorithms – Expectation Maximization • Incremental Algorithms – Naïve Bayes T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 119
  • 120. Evaluating Recommenders # of relevant recommenda tions Precision = # of recommenda tions made # of relevant recommenda tions Recall = # of possibly relevant developers How do we find this? T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 120
  • 121. Determining Possibly Relevant Developers Module C Module J paulw paulw@... Module Q tryder tryder@... … stibbs vendger@... … … Fixed Bug Modules Report CVS Bug touched by fix Usernames Repository Usernames CVS Repository T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 121
  • 122. Still not Straightforward (e.g., Firefox) paulw@... bboggs@... vendger@... … Module C Module J paulw Module Q tryder Patch Submitters … stibbs … tryder@... Bug Module axelf@... Report CVS vendger@... List … Usernames Patch Submitters jlpicard@... bchater@... kpollac@... … Patch Submitters CVS Repository … T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 122
  • 123. Precision vs. Recall A small set of “right” developers (precision) more important than the set of all possible developers (recall) 100% 100% 90% 90% 80% 80% 70% 70% 60% 60% Precision Recall 50% 50% 40% 40% 30% 30% 20% 20% 10% 10% 0% Multi. NB C4.5 SVM 0% Multi. NB C4.5 SVM Eclipse Firefox gcc Eclipse Firefox gcc Precision Recall T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 123
  • 124. Overview programming defect detection testing debugging maintenance software engineering tasks association/ classification clustering etc. patterns data mining techniques code change program structural bug bases history states entities reports software engineering data T. Xie and J. Pei: Data Mining for Software Engineering 124
  • 125. Conclusions • Software development generates a large amount of different types of data • Data mining and data analysis can help software engineering substantially • Successful cases – What software engineering data can be mined? – What software engineering task can be helped? – How to conduct the mining? T. Xie and J. Pei: Data Mining for Software Engineering 125
  • 126. Challenges • Complexity in software development – Specific data mining techniques are needed • Software development and maintenance are dynamic and user-centered – Interactive data mining – Visual data mining and analysis – Online, incremental mining T. Xie and J. Pei: Data Mining for Software Engineering 126
  • 127. Questions? Mining Software Engineering Data Bibliography http://ase.csc.ncsu.edu/dmse/ •What software engineering tasks can be helped by data mining? •What kinds of software engineering data can be mined? •How are data mining techniques used in software engineering? •Resources