Mining Software Engineering Data

More information available at
http://ase.csc.ncsu.edu/dmse/
Mining Software Engineering Data
Ahmed E. Hassan
Queen’s University
www.cs.queensu.ca/~ahmed
ahmed@cs.queensu.ca
Tao Xie
North Carolina State University
www.csc.ncsu.edu/faculty/xie
xie@csc.ncsu.edu

• NSERC/RIM Software Engineering Research
Chair Queen’s University, Canada
• Leads the SAIL research group at Queen’s
• Co-chair for Workshop on Mining Software
Repositories (MSR) from 2004-2006
• Chair of the steering committee for MSR
2

A. E. Hassan and T. Xie:
Mining Software Engineering
Data
3
• Associate Professor at North Carolina State
University, USA
• Leads the ASE research group at NCSU
• PC Co-Chair of ICSM 2009, MSR 2011/2012
• Co-organizer of 2007 Dagstuhl Seminar on
Mining Programs and Processes

http://msrconf.org
An international effort to
make software repositories actionable
http://promisedata.org

• Transforms static record-
keeping repositories to active
repositories
• Makes repository data
actionable by uncovering
hidden patterns and trends
11
MailinglistBugzilla Crashes
Field logs CVS/SVN

1212
Field
Logs
Source Control
CVS/SVN
Bugzilla Mailing
lists
Crash
Repos
Historical Repositories Runtime Repos
Code Repos
Sourceforge
GoogleCode

Bugzilla CVS/SVNMailinglist Crashes
fixed
bug
discussions
Buggy change &
Fixing change
Field
crashes
Estimate fix effort
Mark duplicates
Suggest experts and fix!
New Bug Report

Bugzilla CVS/SVNMailinglist Crashes
fixed
bug
Field
crashes
Suggest APIs
Warn about risky code or bugs
Suggest locations to co-change
New Change
discussions
Buggy change &
Fixing change

Example Repositories:
Source Control and Bug Repositories

A. E. Hassan and T. Xie: Mining
Software Engineering Data
16
Source Control Repositories
• A source control system
tracks changes to
ChangeUnits
• Example of ChangeUnits:
– File (most common)
– Function
– Dependency (e.g., Call)
• Each ChangeUnit:
– Records the developer, change
time, change message, co-
changing Units
ChangeListDeveloper
Time
ChangeChangeUnit
Modify
Add
Remove
Change
Type
* .. *
ChangeList
Message
FI
FR
GM
ChangeList
Type

Data
17
Determine
Initial Entity
To Change
Change
Entity
Determine
Other Entities
To Change
Consult
Guru for
Advice
New Req., Bug Fix
“How does a change in one source code
entity propagate to other entities?”
No More
Changes
For Each Entity
Suggested Entity

Data
18
• We want:
– High Precision to avoid wasting time
– High Recall to avoid bugs
entitieschanged
changedwhichentitiespredicted
Recall 
entitiespredicted
changedwhichentitiespredicted
Precision 

Data
19
• Mine association rules from change history
• Use rules to help propagate changes:
– Recall as high as 44%
– Precision around 30%
• High precision and recall reached in < 1mth
• Prediction accuracy improves prior to a
release (i.e., during maintenance phase)
• Better predictor than static dependencies
alone
[Zimmermann et al. 05]
[Hassan&Holt 04]

Data
20
• Traditional dependency graphs and program
understanding models usually do not use
historical information
• Static dependencies capture only a static view
of a system – not enough detail!
• Development history can help understand the
current structure (architecture) of a software
system
[Hassan & Holt 04]

21
Hardware
Trans.
Kernel Fault
Handler
Pager
FileSystem
Virtual Addr.
Maint.
VM Policy
Subsystem
Depend Divergence
Hardware
Trans.
Kernel Fault
Handler
Pager
FileSystem
Virtual Addr.
Maint.
VM Policy
Convergence
Subsystem
Why? Who?
When? Where?

22
• Eight unexpected dependencies
• All except two dependencies existed since day one:
– Virtual Address Maintenance  Pager
– Pager  Hardware Translations
Which?
vm_map_entry_create (in src/sys/vm/Attic/vm_map.c)
depends on pager_map (in /src/sys/uvm/uvm_pager.c)
Who? cgd
When?
1993/04/09 15:54:59
Revision 1.2 of src/sys/vm/Attic/vm_map.c
Why?
from sean eric fagan:
it seems to keep the vm system from deadlocking the
system when it runs out of swap + physical memory.
prevents the system from giving the last page(s) to
anything but the referenced "processes" (especially
important is the pager process, which should never
have to wait for a free page).
Auto-generated
from CVS repository

• Conway’s Law:
“The structure of a software system is a direct
reflection of the structure of the development
team”

Conceptual
Architecture
Ownership
Architecture
Concrete
Architecture

Data
25
import org.eclipse.jdt.internal.compiler.lookup.*;
import org.eclipse.jdt.internal.compiler.*;
import org.eclipse.jdt.internal.compiler.ast.*;
import org.eclipse.jdt.internal.compiler.util.*;
...
import org.eclipse.pde.core.*;
import org.eclipse.jface.wizard.*;
import org.eclipse.ui.*;
14% of all files that import ui packages,
had to be fixed later on.
71% of files that import compiler packages,
had to be fixed later on.
[Schröter et al. 06]

Data
26
Percentage of bug-introducing changes for eclipse
Don’t program on Fridays ;-)
[Zimmermann et al. 05]

Data
27
• Given a change can we warn a developer that
there is a bug in it?
– Recall/Precision in 50-60% range
[Kim et al. 06]

Project Communication – Mailing lists

Data
29
• Most open source projects communicate
through mailing lists or IRC channels
• Rich source of information about the inner
workings of large projects
• Discussions cover topics such as future plans,
design decisions, project policies, code or
patch reviews
• Social network analysis could be performed on
discussion threads

Data
30
• Study the content of messages before and after a release
• Use dimensions from a psychometric text analysis tool:
– After Apache 1.3 release there was a drop in optimism
– After Apache 2.0 release there was an increase in sociability
[Rigby & Hassan 07]

Data
31
• Mailing list activity:
– strongly correlates with code
change activity
– moderately correlates with
document change activity
• Social network measures (in-
degree, out-degree,
betweenness) indicate that
committers play a more
significant role in the mailing list
community than non-
committers [Bird et al. 06]

Data
32
• When will a developer be invited to join a
project?
– Expertise vs. interest
[Bird et al. 07]

Program Source Code

Data
35
Source data Mined info
Variable names and function names Software categories
[Kawaguchi et al. 04]
Statement seq in a basic block Copy-paste code
[Li et al. 04]
Set of functions, variables, and data
types within a C function
Programming rules
[Li&Zhou 05]
Sequence of methods within a Java
method
API usages
[Xie&Pei 06]
API method signatures API Jungloids
[Mandelin et al. 05]

Data
36
• How should an API be used correctly?
– An API may serve multiple functionalities
– Different styles of API usage
• “I know what type of object I need, but I don’t know
how to write the code to get the object” [Mandelin
et al. 05]
– Can we synthesize jungloid code fragments automatically?
– Given a simple query describing the desired code in terms
of input and output types, return a code segment
• “I know what method call I need, but I don’t know
how to write code before and after this method call”
[Xie&Pei 06]

Data
37
• Mine framework reuse patterns [Michail 00]
– Membership relationships
• A class contains membership functions
– Reuse relationships
• Class inheritance/ instantiation
• Function invocations/overriding
• Mine software plagiarism [Liu et al. 06]
– Program dependence graphs
[Michail 99/00] http://codeweb.sourceforge.net/ for C++

Data
38
• Apply closed sequential pattern mining techniques
• Customizing the techniques
– A copy-paste segment typically does not have big gaps – use
a maximum gap threshold to control
– Output the instances of patterns (i.e., the copy-pasted code
segments) instead of the patterns
– Use small copy-pasted segments to form larger ones
– Prune false positives: tiny segments, unmappable segments,
overlapping segments, and segments with large gaps
[Li et al. 04]

Data
39
• For two copy-pasted segments, are the
modifications consistent?
– Identifier a in segment S1 is changed to b in
segment S2 3 times, but remains unchanged once
– likely a bug
– The heuristic may not be correct all the time
• The lower the unchanged rate of an identifier,
the more likely there is a bug
[Li et al. 04]

Program Execution Traces

Data
41
• Goal: mine specifications (pre/post conditions) or
object behavior (object transition diagrams)
• State of an object
– Values of transitively reachable fields
• Method-entry state
– Receiver-object state, method argument values
• Method-exit state
– Receiver-object state, updated method argument values,
method return value
[Ernst et al. 02] http://pag.csail.mit.edu/daikon/
[Xie&Notkin 04/05][Dallmeier et al. 06] http://www.st.cs.uni-sb.de/models/

Data
42
• Goal: detect or locate bugs
• Values of variables at certain code locations
[Hangal&Lam 02]
– Object/static field read/write
– Method-call arguments
– Method returns
• Sampled predicates on values of variables [Liblit
et al. 03/05][Liu et al. 05]
[Hangal&Lam 02] http://diduce.sourceforge.net/
[Liblit et al. 03/05] http://www.cs.wisc.edu/cbi/
[Liu et al. 05] http://www.ews.uiuc.edu/~chaoliu/sober.htm

Data
43
• Goal: locate bugs
• Executed branches/paths, def-use pairs
• Executed function/method calls
– Group methods invoked on the same object
• Profiling options
– Execution hit vs. count
– Execution order (sequences)
[Dallmeier et al. 05] http://www.st.cs.uni-sb.de/ample/
More related tools: http://www.csc.ncsu.edu/faculty/xie/research.htm#related

Data
44
• Given a program state S and an event e, predict
whether e likely results in a bug
– Positive samples: past bugs
– Negative samples: “not bug” reports
• A k-NN based approach
– Consider the k closest cases reported before
– Compare Σ 1/d for bug cases and not-bug cases, where d is
the similarity between the current state and the reported
states
– If the current state is more similar to bugs, predict a bug
[Michail&Xie 05]

Data
45
• A method executed only in failing runs is likely
to point to the defect
– Comparing the coverage of passing and failing
program runs helps
• Mining patterns frequent in failing program
runs but infrequent in passing program runs
– Sequential patterns may be used
[Dallmeier et al. 05, Denmat et al. 05]

Must show value before
data quality improves
Correlation vs. Causation

Data
48
• Very active research area in SE:
– MSR is the most attended ICSE event in last 8 yrs
• http://msrconf.org
– Special Issue of IEEE TSE 2005 on MSR:
• 15 % of all submissions of TSE in 2004
• Fastest review cycle in TSE history: 8 months
– Special Issue Empirical Software Engineering 2009, 2011
– MSR 2012!

Data
49
• Report the statistical significance of your results:
– Get a statistics book (one for social scientist, not for
mathematicians)
• Discuss any limitations of your findings based on the
characteristics of the studied repositories:
– Make sure you manually examine the repositories. Do not fully
automate the process!
– Use random sampling to resolve issues about data noise
• Relevant conferences/workshops:
– main SE conferences, ICSM, ISSTA, MSR, WODA, PROMISE …

Mining Software Engineering Data

More Related Content

Similar to Mining Software Engineering Data (20)

More from SAIL_QU (20)

Recently uploaded (20)

Mining Software Engineering Data