Datamingse

Data Mining for Software Engineering

Tao Xie Jian Pei
North Carolina State University Simon Fraser University
www.csc.ncsu.edu/faculty/xie www.cs.sfu.ca/~jpei
xie@csc.ncsu.edu jpei@cs.sfu.ca

An up-to-date version of this tutorial is available at
http://ase.csc.ncsu.edu/dmse/dmse.pdf
Attendants of the tutorial are kindly suggested to download the latest
version 2-3 days before the tutorial

Outline
• Introduction
• What software engineering tasks can be
helped by data mining?
• What kinds of software engineering data can
be mined?
• How are data mining techniques used in
software engineering?
• Case studies
• Conclusions
T. Xie and J. Pei: Data Mining for Software Engineering 2

Introduction
• A large amount of data is produced in
software development
– Data from software repositories
– Data from program executions
• Data mining techniques can be used to
analyze software engineering data
– Understand software artifacts or processes
– Assist software engineering tasks


Examples
• Data in software development
– Programming: versions of programs
– Testing: execution traces
– Deployment: error/bug reports
– Reuse: open source packages
• Software development needs data analysis
– How should I use this class?
– Where are the bugs?
– How to implement a typical functionality?

Overview

programming defect detection testing debugging maintenance

software engineering tasks helped by data mining

association/
classification clustering …
patterns
data mining techniques

code change program structural bug
bases history states entities reports
software engineering data

Software Engineering Tasks
• Programming
• Static defect detection
• Testing
• Debugging
• Maintenance


Software Categorization – Why?
• SourceForge hosts 70,000+ software systems
– How can one find the software needed?
– How can developers collaborate effectively?
• Why software categorization?
– SourceForge categorizes software according to their
primary function (editors, databases, etc.)
• Software foundries – related software
– Keep developers informed about related software
• Learn the “best practice”
• Promote software reuse
[Kawaguchi et al. 04]

Software Categorization – What?
• Organize software systems into categories
– Software systems in each category share a
somehow same theme
– A software system may belong to one or
multiple categories
• What are the categories?
– Defined by domain experts manually
– Discovered automatically
• Example system: MUDABlue [Kawaguchi et al. 04]

Version Comparison and Search
• What does the current code segment look
like in previous versions?
– How have they been changed over versions?
• Using standard search tools, e.g., grep?
– Source code may not be well documented
– The code may be changed
• Can we have some source code friendly
search engines?
– E.g., www.koders.com, corp.krugle.com,
demo.spars.info

Software Library Reuse
• Issues in reusing software libraries
– Which components should I use?
– What is the right way to use?
– Multiple components may often be used in
combinations, e.g., Smalltalk’s Model/View/Controller
• Frequent patterns help
– Specifically, inheritance information is important
– Example: most application classes inheriting from library
class Widget tend to override its member function
paint(); most application classes instantiating library
class Painter and calling its member function
begin() also call end()
[Michail 99/00]

API Usage
• How should an API be used correctly?
– An API may serve multiple functionalities
– Different styles of API usage
• “I know what type of object I need, but I don’t know
how to write the code to get the object” [Mandelin
et al. 05]
– Can we synthesize jungloid code fragments
automatically?
– Given a simple query describing the desired code in
terms of input and output types, return a code segment
• “I know what method call I need, but I don’t know
how to write code before and after this method
call” [Xie & Pei 06]

How Can Data Mining Help?
• Identify characteristic usage of the library
automatically
• Understand the reuse of library classes from
real-life applications instead of toy programs
• Keep reuse patterns up to date w.r.t. the
most recent version of the library and
applications
• General patterns may cover inheritance
cases

• Programming
• Testing
• Debugging
• Maintenance


Locating Matching Method Calls
• Many bugs due to unmatched method calls
– E.g., fail to call free() to deallocate a data
structure
– One-line-code-changes: many bugs can be
fixed by changing only one line in source code
• Problem: how to find highly correlated pairs
of method calls
– E.g., <fopen, fclose>, <malloc, free>

[Li&Zhou 05, Livshits&Zimmermann 05, Yang et al. 06]

Inferring Errors from Source Code

• A system must follow some correctness
rules
– Unfortunately, the rules are documented or
specified in an ad hoc manner
• Deriving the rules requires a lot of a priori
knowledge
• Can we detect some errors without knowing
the rules by data mining?
[Engler et al. 01]

Inference in Large Systems
• Execution traces inferred properties static
checker
• Inference algorithms need to be scalable with the
size of the programs and the input traces
• Due to only imperfect traces available in industrial
environments, how to use those imperfect traces
• Many inferred properties may be uninteresting; it is
hard for a developer to review those properties
thoroughly for large programs
[Yang et al. 06]

Detecting Copy-Paste and Bugs
• Copy-pasted code is common in large
systems
– Code reuse
• Prone to bugs
– E.g., identifiers are not changed consistently
• How to detect copy-paste code?
– How to scale up to large software?
– How to handle minor modifications?
[Li et al. 04]

• Programming
• Testing
• Debugging
• Maintenance


Inspecting Test Behavior
• Automatically generated tests or field
executions lack test oracles
– Sample/summarize behavior for inspection
• Examples:
– Select tests (executions/outputs) for inspection
• E.g., clustering path/branch profiles [Podgurski et al.
01, Bowring et al. 04]
– Summarize object behavior [Xie&Notkin 04,
Dallmeier et al. 06]


Mining Object Behavior
• Can we find the undocumented behavior of
classes? It may not be observed from
program source code directly

Behavior model for JAVA Vector class. Picture from “Mining object behavior with ADABU”
[Dallmeier et al. WODA 06]

Mining Specifications
• Specifications are very useful for testing
– test generation + test oracle
• Major obstacle: protocol specifications are
often unavailable
– Example: what is the right way to use the socket
API?
• How can data mining help?
– If a protocol is held in well tested programs (i.e.,
their executions), the protocol is likely valid
[Ammons et al. 02]

Specification Helps Specification

Does the above code follow the correct socket API protocol?
[Ammons et al. 02]


• Programming
• Testing
• Debugging
• Maintenance


Fault Localization
• Running tests produces execution traces
– Some tests fail and the other tests pass
• Given many execution traces generated by tests,
can we suggest likely faulty statements?
[Liblit et al. 03/05, Liu et al. 05]
– Some traces may lead to program failures
– It would be better if we can even suggest the likeliness
of a statement being faulty
• For large programs, how can we collect traces
effectively?
• What if there are multiple faults?

Analyzing Bug Repositories
• Most open source software development projects
have bug repositories
– Report and track problems and potential enhancements
– Valuable information for both developers and users
• Bug repositories are often messy
– Duplicate error reports; Related errors
• Challenge: how to analyze effectively?
– Who are reporting and at what rate?
– How are reports resolved and by whom?
• Automatic bug report assignment & duplicate
detection [Anvik et al. 06]

Stabilizing Buggy Applications
• Users may report bugs in a program, can
those bug reports be used to prevent the
program from crashing?
– When a user attempts an action that led to
some errors before, a warning should be issued
• Given a program state S and an event e,
predict whether e likely results in a bug
– Positive samples: past bugs
– Negative samples: “not bug” reports
[Michail&Xie 05]

• Programming
• Testing
• Debugging
• Maintenance


Guiding Software Changes
• Programmers start changing some locations
– Suggest locations that other programmers have
changed together with this location
E.g., “Programmers who changed this function
also changed …”
• Mine association rules from change histories
– coarse-granular entities: directories, modules,
files
– fine-granular entities: methods, variables,
sections
[Zimmermann et al. 04, Ying et al. 04]

Aspect Mining
• Discover crosscutting concerns that can be
potentially turned into one place (an aspect
in aspect-oriented programs)
– E.g., logging, timing, communication
• Mine recurring execution patterns
– Event traces [Breu&Krinke 04, Tonella&Ceccato
04]
– Source code [Shepherd et al. 05]


Software Engineering Data
• Static code bases
• Software change history
• Profiled program states
• Profiled structural entities
• Bug reports


Code Entities
• Identifiers within a system [Kawaguchi et al. 04]
– E.g., variable names, function names
• Statement sequence within a basic block [Li et al.
04]
– E.g., variables, operators, constants, functions,
keywords
• Element set within a function [Li&Zhou 05]
– E.g., functions, variables, data types
• Call sites within a function [Xie&Pei 05]
• API signatures [Mandelin et al. 05]
[Mandelin et al. 05] http://snobol.cs.berkeley.edu/prospector/index.jsp

Relationships btw Code Entities
• Membership relationships
– A class contains membership functions
• Reuse relationships
– Class inheritance
– Class instantiation
– Function invocations
– Function overriding

[Michail 99/00] http://codeweb.sourceforge.net/ for C++


• Bug reports


Concurrent Versions System (CVS)
Comments

[Chen et al. 01] http://cvssearch.sourceforge.net/

CVS Comments
RCS files:/repository/file.h,v

• cvs log – displays
Working file: file.h
head: 1.5
...
for all revisions and description:
----------------------------

its comments for each Revision 1.5
Date: ...

file
cvs comment ...
----------------------------
...

…
• cvs diff – shows RCS file: /repository/file.h,v
…

differences between 9c9,10
< old line

different versions of a
---
> new line
> another new line
file
[Chen et al. 01] http://cvssearch.sourceforge.net/

Code Version Histories
• CVS provides file versioning
– Group individual per-file changes into individual
transactions (atomic change sets): checked in by the
same author with the same check-in comment close in
time
• CVS manages only files and line numbers
– Associate syntactic entities with line ranges
• Filter out long transactions not corresponding to
meaningful atomic changes
– E.g., feature requests, bug fixes, branch merging
[Ying et al. 04]
[Zimmermann et al. 04] http://www.st.cs.uni-sb.de/softevo/erose/

• Bug reports


Method-Entry/Exit States
• State of an object
– Values of transitively reachable fields
• Method-entry state
– Receiver-object state, method argument values
• Method-exit state
– Receiver-object state, updated method
argument values, method return value
[Ernst et al. 02] http://pag.csail.mit.edu/daikon/
[Dallmeier et al. 06] http://www.st.cs.uni-sb.de/models/
[Henkel&Diwan 03]
[Xie&Notkin 04/05]

Other Profiled Program States
• Values of variables at certain code locations
[Hangal&Lam 02]
– Object/static field read/write
– Method-call arguments
– Method returns
• Sampled predicates on values of variables
[Liblit et al. 03/05]

[Hangal&Lam 02] http://diduce.sourceforge.net/
[Liblit et al. 03/05] http://www.cs.wisc.edu/cbi/


• Bug reports


Executed Structural Entities
• Executed branches/paths, def-use pairs
• Executed function/method calls
– Group methods invoked on the same object
• Profiling options
– Execution hit vs. count
– Execution order (sequences)

[Dallmeier et al. 05] http://www.st.cs.uni-sb.de/ample/
More related tools: http://www.csc.ncsu.edu/faculty/xie/research.htm#related

• Bug reports


Processing Bug Reports

User Triager Developer
Bug
Report

Bug
Duplicate Works Invalid Won’t
Repository
For Me Fix
T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 43

Sample Bugzilla Bug Report
• Bug report image
• Overlay the triage questions

Assigned To: ? Assignment?
Duplicate?

Reproducible?
Bugzilla: open source bug tracking tool
http://www.bugzilla.org/
[Anvik et al. 06]
http://www.cs.ubc.ca/labs/spl/projects/bugTriage.html

Eclipse Bug Data
• Defect counts are listed
as count at the plug-in,
package and
compilationunit levels.
• The value field
contains the actual
number of pre- ("pre")
and post-release defects
("post").
• The average ("avg")
and maximum ("max")
values refer to the
defects found in the
compilation units
("compilationunits").
[Schröter et al. 06] http://www.st.cs.uni-sb.de/softevo/bug-data/eclipse/

Data Mining Techniques in SE
• Association rules and frequent patterns
• Classification
• Clustering
• Misc.


Frequent Itemsets
• Itemset: a set of items
– E.g., acm={a, c, m} Transaction database TDB
• Support of itemsets
TID Items bought
– Sup(acm)=3
100 f, a, c, d, g, I, m, p
• Given min_sup = 3, acm
200 a, b, c, f, l, m, o
is a frequent pattern
300 b, f, h, j, o
• Frequent pattern mining: 400 b, c, k, s, p
find all frequent patterns
500 a, f, c, e, l, p, m, n
in a database


Association Rules
• (Time∈{Fri, Sat}) ∧ buy(X, diaper) buy(X,
beer)
– Dads taking care of babies in weekends drink
beers
• Itemsets should be frequent
– It can be applied extensively
• Rules should be confident
– With strong prediction capability


A Road Map
• Boolean vs. quantitative associations
– buys(x, “SQLServer”) ^ buys(x, “DMBook”)
buys(x, “DM Software”) [0.2%, 60%]
– age(x, “30..39”) ^ income(x, “42..48K”)
buys(x, “PC”) [1%, 75%]
• Single dimension vs. multiple dimensional
associations
• Single level vs. multiple-level analysis
– What brands of beers are associated with what
brands of diapers?

Frequent Pattern Mining Methods
• Apriori and its variations/improvements
• Mining frequent-patterns without candidate
generation
• Mining max-patterns and closed itemsets
• Mining multi-dimensional, multi-level
frequent patterns with flexible support
constraints
• Interestingness: correlation and causality


A Simple Case
• Finding highly correlated method call pairs
• Confidence of pairs help
– Conf(<a,b>)=support(<a,b>)/support(<a,a>)
• Check the revisions (fixes to bugs), find the
pairs of method calls whose confidences are
improved dramatically by frequent added
fixes
– Those are the matching method call pairs that
may often be violated by programmers
[Livshits&Zimmermann 05]

Conflicting Patterns
• 999 out of 1000 times spin_unlock
follows spin_lock
– The single time that spin_unlock does not
follow may likely be an error
• We can detect an error without knowing the
correctness rule

[Li&Zhou 05, Livshits&Zimmermann 05, Yang et al. 06]

Frequent Library Reuse Patterns
• Items: classes, member functions, reuse relationships
(e.g., inheritance, overriding, instantiation)
• Transactions: for every application class A, the set of all
items that are involved in a reuse relationship with A
• Pruning
– Uninteresting rules, e.g., a rule holds for every class
– Misleading rules, e.g., xy z (conf: 60%) is pruned if y z (conf:
80%)
– Statistically insignificant rules, prune rules of a high p-value
• Constrained rules
– Rules involving a particular class
– Rules that are violated in a particular application
[Michail 99/00]

MAPO: Mining Frequent API Patterns

[Xie&Pei 06]

Sequential Pattern Mining in MAPO
• Use BIDE [Wang&Han 04] to mine closed sequential
patterns from the preprocessed method-call
sequences
• Postprocessing in MAPO
– Remove frequent sequences that do not contain the
entities interesting to the user
– Compress consecutive calls of the same method into
one
– Remove duplicate frequent sequences after the
compression
– Remove frequent sequences that are subsequences of
some other frequent sequences
[Xie&Pei 06]

Detecting Copy-Paste Code
• Apply closed sequential pattern mining techniques
• Customizing the techniques
– A copy-paste segment typically do not have big gaps –
use a maximum gap threshold to control
– Output the instances of patterns (i.e., the copy-pasted
code segments) instead of the patterns
– Use small copy-pasted segments to form larger ones
– Prune false positives: tiny segments, unmappable
segments, overlapping segments, and segments with
large gaps
[Li et al. 04]

Find Bugs in Copy-Pasted Segments

• For two copy-pasted segments, are the
modifications consistent?
– Identifier a in segment S1 is changed to b in
segment S2 3 times, but remains unchanged
once – likely a bug
– The heuristic may not be right all the time
• The lower the unchanged rate of an
identifier, the more likely there is a bug
[Li et al. 04]

Approximate Patterns for Inferences

• Use an alternating template to find
interesting properties
– Example: template – (PS)*, an instance:
loc.acq loc.rel
• Handling imperfect traces
– Instead of requiring perfect matches, check the
ratio of matching
– Explore contexts of matching
[Yang et al. 06]

Context Handling

Figure from “Perracotta: mining teporal API rules
from imperfect traces”, in [Yang et al. ICSE’06]


Cross-Checking of Execution Traces
• Mine association rules or sequential
patterns S F, where S is a statement and
F is the status of program failure
• The higher the confidence, the more likely S
is faulty or related to a fault
• Using only one statement at the left side of
the rule can be misleading, since a fault may
be led by a combination of statements
– Frequent patterns can be used to improve
[Denmat et al. 05]

Emerging Patterns of Traces
• A method executed only in failing runs is
likely to point to the defect
– Comparing the coverage of passing and failing
program runs help
• Mining patterns frequent in failing program
runs but infrequent in passing program runs
– Sequential patterns may be used

[Dallmeier et al. 05, Denmat et al. 05, Yang et al. 06]

Learning Object Behavior
• Extracting models
– A static analysis identifies all side-effect-free
methods in the program
– Some side-effect-free methods are selected as
inspectors
– The program is executed and inspectors are
called to extract information about an object’s
state – a vector of inspector values
• Merge models of all objects in a program
[Dallmeier et al. 06]

• Classification
• Clustering
• Misc.


Classification: A 2-step Process
• Model construction: describe a set of
predetermined classes
– Training dataset: tuples for model construction
• Each tuple/sample belongs to a predefined class
– Classification rules, decision trees, or math formulae
• Model application: classify unseen objects
– Estimate accuracy of the model using an independent
test set
– Acceptable accuracy apply the model to classify
tuples with unknown class labels


Model Construction
Classification
Algorithms
Training
Data

Name Rank Years Tenured Classifier
Mike Ass. Prof 3 No (Model)
Mary Ass. Prof 7 Yes
Bill Prof 2 Yes
Jim Asso. Prof 7 Yes IF rank = ‘professor’
Dave Ass. Prof 6 No OR years > 6
Anne Asso. Prof 3 No THEN tenured = ‘yes’

Model Application

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
Name Rank Years Tenured
Tom Ass. Prof 2 No Tenured?
Merlisa Asso. Prof 7 No
George Prof 5 Yes
Joseph Ass. Prof 7 Yes

Supervised vs. Unsupervised
Learning
• Supervised learning (classification)
– Supervision: objects in the training data set
have labels
– New data is classified based on the training set
• Unsupervised learning (clustering)
– The class labels of training data are unknown
– Given a set of measurements, observations,
etc. with the aim of establishing the existence of
classes or clusters in the data


GUI-Application Stabilizer
• Given a program state S and an event e, predict
whether e likely results in a bug
– Positive samples: past bugs
– Negative samples: “not bug” reports
• A k-NN based approach
– Consider the k closest cases reported before
– Compare Σ 1/d for bug cases and not-bug cases, where
d is the similarity between the current state and the
reported states
– If the current state is more similar to bugs, predict a bug
[Michail&Xie 05]

• Classification
• Clustering
• Misc.


What Is Clustering?
• Group data into clusters
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
– Unsupervised learning: no predefined classes

Outliers
Cluster 1
Cluster 2


Categories of Clustering
Approaches (1)
• Partitioning algorithms
– Partition the objects into k clusters
– Iteratively reallocate objects to improve the
clustering
• Hierarchy algorithms
– Agglomerative: each object is a cluster, merge
clusters to form larger ones
– Divisive: all objects are in a cluster, split it up
into smaller clusters


Categories of Clustering
Approaches (2)
• Density-based methods
– Based on connectivity and density functions
– Filter out noise, find clusters of arbitrary shape
• Grid-based methods
– Quantize the object space into a grid structure
• Model-based
– Use a model to find the best fit of data


K-Means: Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3

the
3

each
2 2
2

1
objects
1

0
cluster 1

0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10

similar
center reassign reassign
10 10

K=2 9 9

8 8

Arbitrarily choose K 7 7

object as initial
6 6

5 5

cluster center 4 Update 4

3

2
the 3

2

1 cluster 1

0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10


Clustering and Categorization
• Software categorization
– Partitioning software systems into categories
• Categories predefined – a classification
problem
• Categories discovered automatically – a
clustering problem


Software Categorization - MUDABlue

• Understanding source code
– Use latent semantic analysis (LSA) to find similarity
between software systems
– Use identifiers (e.g., variable names, function names)
as features
• “gtk_window” represents some window
• The source code near “gtk_window” contains some GUI
operation on the window
• Extracting categories using frequent identifiers
– “gtk_window”, “gtk_main”, and “gpointer” GTK
related software system
– Use LSA to find relationships between identifiers

Overview of MUDABlue
• Extract identifiers
• Create identifier-by-software matrix
• Remove useless identifiers
• Apply LSA, and retrieve categories
• Make software clusters from identifier
clusters
• Title software clusters


• Classification
• Clustering
• Misc.


Searching Source Code/Comments
• CVSSearch: searching using CVS
comments
• Comments are often more stable than code
segments
– Describe a segment of code
– May hold for many future versions
• Compare differences of successive versions
– For two versions, associate a comment to the
corresponding changes
– Propagate changes over versions [Chen et al. 01]

Jungloid Mining
• Given a query describing the input and output
types, synthesize code fragments automatically
• Prospector: using API method signatures and
jungloids mined from a corpus of sample client
programs
• Elementary jungloids
– Field access
– Static method or constructor invocation
– Instance method invocation
– Widening reference conversion
– Downcast (narrowing reference conversions)
[Mandelin et al. 05]

Parsing a JAVA source code file in an
Finding Jungloids IFile object using the Eclipse IDE
framework

• Use signatures of elementary jungloids and API’s to form a
signature graph
• Represent a solution as a path in the graph matching the
constraints
• Rank the paths by their lengths – short paths are preferred
• Learn downcast from sample programs [Mandelin et al. 05]

Sampling Programs
• During the execution of a program, each
execution of a statement takes a probability
to be sampled
– Sampling large programs becomes feasible
– Many traces can be collected
• Bug isolation by analyzing samples
– Correlation between some specific statements
or function calls with program errors/crashes

[Liblit et al. 03/05]

Outline
• Introduction
• What software engineering tasks can be
helped by data mining?
• What kinds of software engineering data can
be mined?
• How are data mining techniques used in
software engineering?
• Case studies
• Conclusions

Case Studies
• MAPO: mining API usages from open source
repositories [Xie&Pei 06]
• Code bases sequence analysis programming
• DynaMine: finding common error patterns by mining
software revision histories [Livshits&Zimmermann 05]
• Change history association rules defect detection
• BugTriage: Who should fix this bugs? [Anvik et al. 06]
• Bug reports classification debugging


Motivation
• APIs in class libraries or frameworks are
popularly reused in software development.

• An example programming task:
“instrument the bytecode of a Java class by
adding an extra method to the class”
– org.apache.bcel.generic.ClassGen
public void addMethod(Method m)


First Try: ClassGen Java API Doc
addMethod

public void addMethod(Method m)
Add a method to this class.
Parameters:
m - method to add


Second Try: Code Search Engine


MAPO Approach
• Analyze code segments returned from code
search engines and disclose the inherent
usage patterns
– Input: an API characterized by a method, class,
or package
code bases: open source repositories or
proprietary source repositories
– Output: a short list of frequent API usage
patterns related to the API


Sample Tool Output
InstructionList.<init>()
InstructionFactory.createLoad(Type, int)
InstructionList.append(Instruction)
InstructionFactory.createReturn(Type)
InstructionList.append(Instruction)
MethodGen.setMaxStack()
MethodGen.setMaxLocals()
MethodGen.getMethod()
ClassGen.addMethod(Method)
InstructionList.dispose()
•Mined from 36 Java source files, 1087 method seqs

Tool Architecture


Results
A tool that integrates various components
• Relevant code extractor
– download returns from code search engine (koders.com)
• Code analyzer
– implemented a lightweight tool for Java programs
• Sequence preprocessor
– employed various heuristics
• Frequent sequence miner
– reused BIDE [Wang&Han ICDE 2004]
• Frequent sequence postprocessor
– employed various heuristics


Case Studies


Co-Change Pattern
• Things that are frequently changed together
often form a pattern (a.k.a. co-change)
• E.g., co-added method calls
public void createPartControl(Composite parent) {
...
// add listener for editor page activation
getSite().getPage().addPartListener(partListener);
}

public void dispose() {
...
co-added
getSite().getPage().removePartListener(partListener);
}

T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Livshits et al.’s slides 92

DynaMine
revision mine CVS rank and
history mining
patterns
histories filter

instrument relevant
method calls

run the application
dynamic
analysis
post-process

usage error unlikely
patterns patterns patterns

report report
reporting
patterns bugs

Mining Patterns
history mining
patterns
histories filter

instrument relevant
method calls

run the application
dynamic
analysis
post-process


report report
reporting
patterns Adaptedbugs
from Livshits et al.’s slides

Mining Method Calls
Foo.java o1.addListener()
1.12
o1.removeListener()
Bar.java
1.47
o2.addListener()
o2.removeListener()
System.out.println()
Baz.java
1.23
o3.addListener()
o3.removeListener()
list.iterator()
iter.hasNext()
iter.next()
Qux.java o4.addListener()
1.41
1.42 o4.removeListener()


Finding Pairs
o1.addListener()
1 Pair
Foo.java
1.12
o1.removeListener()
Bar.java
o2.addListener()
1 Pair
1.47
o2.removeListener()
Baz.java o3.addListener()
1.23
o3.removeListener()
list.iterator() 2 Pairs
iter.hasNext()
iter.next()

o4.addListener()
0 Pairs
Qux.java
1.41
1.42 o4.removeListener() 0 Pairs

Mining Method Calls
1.12
o1.removeListener()
Bar.java
1.47
o2.addListener()
o2.removeListener()
Baz.java
1.23
o3.addListener() Co-added calls
o3.removeListener()
list.iterator() often represent a
iter.hasNext()
iter.next() usage pattern
1.41


Finding Patterns

Find “frequent itemsets” (with
Apriori) o.enterAlignment()
o.enterAlignment()
o.enterAlignment()
o.enterAlignment()
o.exitAlignment()
o.exitAlignment()
o.exitAlignment()
o.exitAlignment()
o.redoAlignment()
o.redoAlignment()
o.redoAlignment()
o.redoAlignment()
iter.hasNext()
iter.hasNext()
iter.hasNext()
iter.hasNext()
iter.next()
iter.next()
iter.next()
iter.next()

{enterAlignment(), exitAlignment(),
redoAlignment()}

Ranking Patterns
1.12
o1.removeListener()
Bar.java
o2.addListener()
Support count =
1.47
o2.removeListener()
Baz.java
#occurrences of a
o3.addListener()
1.23
o3.removeListener() pattern
list.iterator()
iter.hasNext() Confidence =
iter.next()
Qux.java
strength of a
o4.addListener()
1.41
System.out.println() pattern, P(A|B)


Ranking Patterns
1.12
o1.removeListener()
Bar.java
1.47
o2.addListener()
o2.removeListener()
Baz.java
1.23
o3.addListener()
o3.removeListener()
list.iterator()
iter.hasNext()
iter.next()
1.41
System.out.println() This is a fix!
1.42 o4.removeListener() Rank removeListener()
patterns higher

Dynamic Validation
history mining
patterns
histories filter

instrument relevant
method calls

run the application
dynamic
analysis
post-process


report report
reporting
patterns Adaptedbugs
from Livshits et al.’s slides

Matches and Mismatches
Find and count matches and mismatches.
o.register(d)
o.deregister(d) matches o.register(d)
o.deregister(d) mismatch

Static vs dynamic counts.


Pattern classification

post-process
v validations, e violations

e<v/10 v/10<=e<=2v otherwise

Experiments since
JEDIT
2000
ECLIPSE
2001
developers 92 112

lines of code 700,000 2,900,000
revisions 40,000 400,000

total 56 patterns

Case Studies


Assigning a Bug
• Many considerations
– who has the expertise?
– who is available?
– how quickly does this have to be fixed?

• Not always an obvious or correct assignment
– multiple developers may be suitable
– difficult to know what the bug is about
– bug fixes get delayed
• triage and fix rate indicates ‘liveness’ of OSS projects


Assigning a Bug Today

bill@firefox.org


Recommending assignment

bill@firefox.com
ted@gmail.com
cindy-loo@whoville.org


Overview of approach
Approach tuned using Eclipse and Firefox

bill@firefox.com
ted@gmail.com
cindy-loo@whoville.org

Machine Learning Assignment
Resolved Algorithm Recommender
Bug Reports


Steps to the approach
1. Characterize the reports

2. Label the reports

3. Select the reports

4. Use a machine learning algorithm


Step 1: Characterizing a report
• Based on two fields
– textual summary
– description
• Use text categorization approach
– represent with a word vector
– remove stop words
– intra- and inter-document frequency


Step 2: Labeling a report
• Must determine who really fixed it
– “Assigned-to” field is not accurate

• Project-specific heuristics



• Project-specificIfheuristics
a report is FIXED, label with who
marked it as fixed. (Eclipse)
– simple
If a report is DUPLICATE, use
the label of the report it duplicates.
(Eclipse and Firefox)


(Firefox)
• Project-specificIfheuristics FIXED and has
the report is
attachments that are approved by a
– simple reviewer, then
– complex – If one submitter of patches, use their
name.
– If more than one submitter, choose
name of who submitted the most
patches.
– If cannot determine submitters, label
with the person assigned to the report.

(Firefox)
• Project-specificReports marked as WONTFIX are
heuristics
often resolved after discussion and
– simple
developers reaching a consensus.
– complex – Unknown who would have fixed the bug
– unclassifiable – Report is labeled unclassifiable.



• Project-specific heuristics
– simple
Eclipse Firefox
– complex
Simple 5 4
– unclassifiable
Complex 2 1
Unclassifiable 1 4


Step 3: Selecting the reports
• Exclude those with no label

• Include those of active developers
– developer profiles
40 40

35 35

30 30

25 25

20 20

15 15

10 10

5 5

0 0
Sep-04 Oct-04 Nov-04 Dec-04 Jan-05 Feb-05 Mar-05 Apr-05 Sep-04 Oct-04 Nov-04 Dec-04 Jan-05 Feb-05 Mar-05 Apr-05


Step 3: Selecting the reports
40

35

30

25

20
3 reports / month
15

10

5

0
Sep-04 Oct-04 Nov-04 Dec-04 Jan-05 Feb-05 Mar-05 Apr-05


Step 4: Use a ML algorithm
• Supervised Algorithms
– Naïve Bayes
– C4.5
– Support Vector Machines
• Unsupervised Algorithms
– Expectation Maximization
• Incremental Algorithms
– Naïve Bayes


Evaluating Recommenders
# of relevant recommenda tions
Precision =
# of recommenda tions made

# of relevant recommenda tions
Recall =
# of possibly relevant developers

How do we find this?

Determining Possibly Relevant
Developers
Module C
Module J paulw paulw@...
Module Q tryder tryder@...
… stibbs vendger@...
… …
Fixed Bug
Modules
Report CVS Bug
touched by fix
Usernames Repository
Usernames

CVS
Repository

Still not Straightforward (e.g., Firefox)
paulw@...
bboggs@...
vendger@...
…
Module C
Module J paulw
Module Q tryder Patch Submitters
… stibbs
…
tryder@...
Bug
Module axelf@...

Report CVS vendger@...
List …
Usernames
Patch Submitters
jlpicard@...
bchater@...
kpollac@...
…

Patch Submitters
CVS
Repository …

Precision vs. Recall

A small set of “right” developers (precision) more
important than the set of all possible developers
(recall)
100% 100%

90% 90%

80% 80%

70% 70%

60% 60%
Precision

Recall
50% 50%

40% 40%

30%
30%
20%
20%
10%
10%
0%
Multi. NB C4.5 SVM 0%
Multi. NB C4.5 SVM
Eclipse Firefox gcc
Eclipse Firefox gcc

Precision Recall

Overview

programming defect detection testing debugging maintenance

software engineering tasks

association/
classification clustering etc.
patterns
data mining techniques

code change program structural bug
bases history states entities reports
software engineering data

Conclusions
• Software development generates a large
amount of different types of data
• Data mining and data analysis can help
software engineering substantially
• Successful cases
– What software engineering data can be mined?
– What software engineering task can be helped?
– How to conduct the mining?


Challenges
• Complexity in software development
– Specific data mining techniques are needed
• Software development and maintenance are
dynamic and user-centered
– Interactive data mining
– Visual data mining and analysis
– Online, incremental mining


Questions?

Mining Software Engineering Data Bibliography
http://ase.csc.ncsu.edu/dmse/
•What software engineering tasks can be helped by data mining?
•What kinds of software engineering data can be mined?
•How are data mining techniques used in software engineering?
•Resources

Datamingse

More Related Content

What's hot (7)

Viewers also liked (7)

Similar to Datamingse (20)

Recently uploaded (20)

Datamingse