SlideShare a Scribd company logo
More information available at
http://ase.csc.ncsu.edu/dmse/
Mining Software Engineering Data
Ahmed E. Hassan
Queen’s University
www.cs.queensu.ca/~ahmed
ahmed@cs.queensu.ca
Tao Xie
North Carolina State University
www.csc.ncsu.edu/faculty/xie
xie@csc.ncsu.edu
• NSERC/RIM Software Engineering Research
Chair Queen’s University, Canada
• Leads the SAIL research group at Queen’s
• Co-chair for Workshop on Mining Software
Repositories (MSR) from 2004-2006
• Chair of the steering committee for MSR
2
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
3
• Associate Professor at North Carolina State
University, USA
• Leads the ASE research group at NCSU
• PC Co-Chair of ICSM 2009, MSR 2011/2012
• Co-organizer of 2007 Dagstuhl Seminar on
Mining Programs and Processes
Mining Software Engineering Data
Mining Software Engineering Data
Mining Software Engineering Data
Mining Software Engineering Data
Mining Software Engineering Data
Mining Software Engineering Data
http://msrconf.org
An international effort to
make software repositories actionable
http://promisedata.org
• Transforms static record-
keeping repositories to active
repositories
• Makes repository data
actionable by uncovering
hidden patterns and trends
11
MailinglistBugzilla Crashes
Field logs CVS/SVN
1212
Field
Logs
Source Control
CVS/SVN
Bugzilla Mailing
lists
Crash
Repos
Historical Repositories Runtime Repos
Code Repos
Sourceforge
GoogleCode
Bugzilla CVS/SVNMailinglist Crashes
fixed
bug
discussions
Buggy change &
Fixing change
Field
crashes
Estimate fix effort
Mark duplicates
Suggest experts and fix!
New Bug Report
Bugzilla CVS/SVNMailinglist Crashes
fixed
bug
Field
crashes
Suggest APIs
Warn about risky code or bugs
Suggest locations to co-change
New Change
discussions
Buggy change &
Fixing change
Example Repositories:
Source Control and Bug Repositories
A. E. Hassan and T. Xie: Mining
Software Engineering Data
16
Source Control Repositories
• A source control system
tracks changes to
ChangeUnits
• Example of ChangeUnits:
– File (most common)
– Function
– Dependency (e.g., Call)
• Each ChangeUnit:
– Records the developer, change
time, change message, co-
changing Units
ChangeListDeveloper
Time
ChangeChangeUnit
Modify
Add
Remove
Change
Type
* .. *
ChangeList
Message
FI
FR
GM
ChangeList
Type
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
17
Determine
Initial Entity
To Change
Change
Entity
Determine
Other Entities
To Change
Consult
Guru for
Advice
New Req., Bug Fix
“How does a change in one source code
entity propagate to other entities?”
No More
Changes
For Each Entity
Suggested Entity
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
18
• We want:
– High Precision to avoid wasting time
– High Recall to avoid bugs
entitieschanged
changedwhichentitiespredicted
Recall 
entitiespredicted
changedwhichentitiespredicted
Precision 
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
19
• Mine association rules from change history
• Use rules to help propagate changes:
– Recall as high as 44%
– Precision around 30%
• High precision and recall reached in < 1mth
• Prediction accuracy improves prior to a
release (i.e., during maintenance phase)
• Better predictor than static dependencies
alone
[Zimmermann et al. 05]
[Hassan&Holt 04]
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
20
• Traditional dependency graphs and program
understanding models usually do not use
historical information
• Static dependencies capture only a static view
of a system – not enough detail!
• Development history can help understand the
current structure (architecture) of a software
system
[Hassan & Holt 04]
21
Hardware
Trans.
Kernel Fault
Handler
Pager
FileSystem
Virtual Addr.
Maint.
VM Policy
Subsystem
Depend Divergence
Hardware
Trans.
Kernel Fault
Handler
Pager
FileSystem
Virtual Addr.
Maint.
VM Policy
Convergence
Subsystem
Why? Who?
When? Where?
22
• Eight unexpected dependencies
• All except two dependencies existed since day one:
– Virtual Address Maintenance  Pager
– Pager  Hardware Translations
Which?
vm_map_entry_create (in src/sys/vm/Attic/vm_map.c)
depends on pager_map (in /src/sys/uvm/uvm_pager.c)
Who? cgd
When?
1993/04/09 15:54:59
Revision 1.2 of src/sys/vm/Attic/vm_map.c
Why?
from sean eric fagan:
it seems to keep the vm system from deadlocking the
system when it runs out of swap + physical memory.
prevents the system from giving the last page(s) to
anything but the referenced "processes" (especially
important is the pager process, which should never
have to wait for a free page).
Auto-generated
from CVS repository
• Conway’s Law:
“The structure of a software system is a direct
reflection of the structure of the development
team”
Conceptual
Architecture
Ownership
Architecture
Concrete
Architecture
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
25
import org.eclipse.jdt.internal.compiler.lookup.*;
import org.eclipse.jdt.internal.compiler.*;
import org.eclipse.jdt.internal.compiler.ast.*;
import org.eclipse.jdt.internal.compiler.util.*;
...
import org.eclipse.pde.core.*;
import org.eclipse.jface.wizard.*;
import org.eclipse.ui.*;
14% of all files that import ui packages,
had to be fixed later on.
71% of files that import compiler packages,
had to be fixed later on.
[Schröter et al. 06]
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
26
Percentage of bug-introducing changes for eclipse
Don’t program on Fridays ;-)
[Zimmermann et al. 05]
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
27
• Given a change can we warn a developer that
there is a bug in it?
– Recall/Precision in 50-60% range
[Kim et al. 06]
Example Repositories:
Project Communication – Mailing lists
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
29
• Most open source projects communicate
through mailing lists or IRC channels
• Rich source of information about the inner
workings of large projects
• Discussions cover topics such as future plans,
design decisions, project policies, code or
patch reviews
• Social network analysis could be performed on
discussion threads
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
30
• Study the content of messages before and after a release
• Use dimensions from a psychometric text analysis tool:
– After Apache 1.3 release there was a drop in optimism
– After Apache 2.0 release there was an increase in sociability
[Rigby & Hassan 07]
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
31
• Mailing list activity:
– strongly correlates with code
change activity
– moderately correlates with
document change activity
• Social network measures (in-
degree, out-degree,
betweenness) indicate that
committers play a more
significant role in the mailing list
community than non-
committers [Bird et al. 06]
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
32
• When will a developer be invited to join a
project?
– Expertise vs. interest
[Bird et al. 07]
Example Repositories:
Program Source Code
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
35
Source data Mined info
Variable names and function names Software categories
[Kawaguchi et al. 04]
Statement seq in a basic block Copy-paste code
[Li et al. 04]
Set of functions, variables, and data
types within a C function
Programming rules
[Li&Zhou 05]
Sequence of methods within a Java
method
API usages
[Xie&Pei 06]
API method signatures API Jungloids
[Mandelin et al. 05]
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
36
• How should an API be used correctly?
– An API may serve multiple functionalities
– Different styles of API usage
• “I know what type of object I need, but I don’t know
how to write the code to get the object” [Mandelin
et al. 05]
– Can we synthesize jungloid code fragments automatically?
– Given a simple query describing the desired code in terms
of input and output types, return a code segment
• “I know what method call I need, but I don’t know
how to write code before and after this method call”
[Xie&Pei 06]
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
37
• Mine framework reuse patterns [Michail 00]
– Membership relationships
• A class contains membership functions
– Reuse relationships
• Class inheritance/ instantiation
• Function invocations/overriding
• Mine software plagiarism [Liu et al. 06]
– Program dependence graphs
[Michail 99/00] http://codeweb.sourceforge.net/ for C++
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
38
• Apply closed sequential pattern mining techniques
• Customizing the techniques
– A copy-paste segment typically does not have big gaps – use
a maximum gap threshold to control
– Output the instances of patterns (i.e., the copy-pasted code
segments) instead of the patterns
– Use small copy-pasted segments to form larger ones
– Prune false positives: tiny segments, unmappable segments,
overlapping segments, and segments with large gaps
[Li et al. 04]
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
39
• For two copy-pasted segments, are the
modifications consistent?
– Identifier a in segment S1 is changed to b in
segment S2 3 times, but remains unchanged once
– likely a bug
– The heuristic may not be correct all the time
• The lower the unchanged rate of an identifier,
the more likely there is a bug
[Li et al. 04]
Example Repositories:
Program Execution Traces
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
41
• Goal: mine specifications (pre/post conditions) or
object behavior (object transition diagrams)
• State of an object
– Values of transitively reachable fields
• Method-entry state
– Receiver-object state, method argument values
• Method-exit state
– Receiver-object state, updated method argument values,
method return value
[Ernst et al. 02] http://pag.csail.mit.edu/daikon/
[Xie&Notkin 04/05][Dallmeier et al. 06] http://www.st.cs.uni-sb.de/models/
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
42
• Goal: detect or locate bugs
• Values of variables at certain code locations
[Hangal&Lam 02]
– Object/static field read/write
– Method-call arguments
– Method returns
• Sampled predicates on values of variables [Liblit
et al. 03/05][Liu et al. 05]
[Hangal&Lam 02] http://diduce.sourceforge.net/
[Liblit et al. 03/05] http://www.cs.wisc.edu/cbi/
[Liu et al. 05] http://www.ews.uiuc.edu/~chaoliu/sober.htm
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
43
• Goal: locate bugs
• Executed branches/paths, def-use pairs
• Executed function/method calls
– Group methods invoked on the same object
• Profiling options
– Execution hit vs. count
– Execution order (sequences)
[Dallmeier et al. 05] http://www.st.cs.uni-sb.de/ample/
More related tools: http://www.csc.ncsu.edu/faculty/xie/research.htm#related
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
44
• Given a program state S and an event e, predict
whether e likely results in a bug
– Positive samples: past bugs
– Negative samples: “not bug” reports
• A k-NN based approach
– Consider the k closest cases reported before
– Compare Σ 1/d for bug cases and not-bug cases, where d is
the similarity between the current state and the reported
states
– If the current state is more similar to bugs, predict a bug
[Michail&Xie 05]
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
45
• A method executed only in failing runs is likely
to point to the defect
– Comparing the coverage of passing and failing
program runs helps
• Mining patterns frequent in failing program
runs but infrequent in passing program runs
– Sequential patterns may be used
[Dallmeier et al. 05, Denmat et al. 05]
Mining Software Engineering Data
Must show value before
data quality improves
Correlation vs. Causation
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
48
• Very active research area in SE:
– MSR is the most attended ICSE event in last 8 yrs
• http://msrconf.org
– Special Issue of IEEE TSE 2005 on MSR:
• 15 % of all submissions of TSE in 2004
• Fastest review cycle in TSE history: 8 months
– Special Issue Empirical Software Engineering 2009, 2011
– MSR 2012!
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
49
• Report the statistical significance of your results:
– Get a statistics book (one for social scientist, not for
mathematicians)
• Discuss any limitations of your findings based on the
characteristics of the studied repositories:
– Make sure you manually examine the repositories. Do not fully
automate the process!
– Use random sampling to resolve issues about data noise
• Relevant conferences/workshops:
– main SE conferences, ICSM, ISSTA, MSR, WODA, PROMISE …

More Related Content

PDF
Achieving Scalability in Software Testing with Machine Learning and Metaheuri...
PDF
Datamingse
PDF
A Review on Software Mining: Current Trends and Methodologies
PDF
Data_Mining_for_Software_Engineering.pdf
PDF
Software Analytics: Towards Software Mining that Matters
PDF
Software Mining and Software Datasets
PDF
Populating a Release History Database (ICSM 2013 MIP)
DOC
V1_I2_2012_Paper3.doc
Achieving Scalability in Software Testing with Machine Learning and Metaheuri...
Datamingse
A Review on Software Mining: Current Trends and Methodologies
Data_Mining_for_Software_Engineering.pdf
Software Analytics: Towards Software Mining that Matters
Software Mining and Software Datasets
Populating a Release History Database (ICSM 2013 MIP)
V1_I2_2012_Paper3.doc

Similar to Mining Software Engineering Data (20)

PDF
Improvement of Software Maintenance and Reliability using Data Mining Techniques
PDF
Software bug prediction
PDF
Empirical evaluation in 2020: how big, how beautiful?
PDF
Model-based Analysis of Large Scale Software Repositories
PDF
Software Analytics - Achievements and Challenges
ODP
Mining Software Repositories
PPTX
The Art and Science of Analyzing Software Data
PPTX
Software maintenance real world maintenance cost
PDF
SWE-401 - 11. Software maintenance overview
PPTX
lecture 7ppt.pptx knowledge engineering.
PDF
Changes and Bugs: Mining and Predicting Development Activities
PDF
A methodology to evaluate object oriented software systems using change requi...
PDF
Improving Software Maintenance using Unsupervised Machine Learning techniques
PPTX
Introduction to Software Engineering
PPT
Memories of Bug Fixes
PDF
Put Your Hands in the Mud: What Technique, Why, and How
PDF
Replication and Benchmarking in Software Analytics
PPTX
LEC 2asasasasasasasasasasasasasasasasa.pptx
PDF
Illogical engineers
PDF
Illogical engineers
Improvement of Software Maintenance and Reliability using Data Mining Techniques
Software bug prediction
Empirical evaluation in 2020: how big, how beautiful?
Model-based Analysis of Large Scale Software Repositories
Software Analytics - Achievements and Challenges
Mining Software Repositories
The Art and Science of Analyzing Software Data
Software maintenance real world maintenance cost
SWE-401 - 11. Software maintenance overview
lecture 7ppt.pptx knowledge engineering.
Changes and Bugs: Mining and Predicting Development Activities
A methodology to evaluate object oriented software systems using change requi...
Improving Software Maintenance using Unsupervised Machine Learning techniques
Introduction to Software Engineering
Memories of Bug Fixes
Put Your Hands in the Mud: What Technique, Why, and How
Replication and Benchmarking in Software Analytics
LEC 2asasasasasasasasasasasasasasasasa.pptx
Illogical engineers
Illogical engineers
Ad

More from SAIL_QU (20)

PDF
Studying the Integration Practices and the Evolution of Ad Libraries in the G...
PDF
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
PPTX
Improving the testing efficiency of selenium-based load tests
PDF
Studying User-Developer Interactions Through the Distribution and Reviewing M...
PDF
Studying online distribution platforms for games through the mining of data f...
PPTX
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
PDF
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
PDF
Mining Development Knowledge to Understand and Support Software Logging Pract...
PPTX
Which Log Level Should Developers Choose For a New Logging Statement?
PPTX
Towards Just-in-Time Suggestions for Log Changes
PDF
The Impact of Task Granularity on Co-evolution Analyses
PPTX
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
PPTX
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
PPTX
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
PDF
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
PPTX
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
PDF
What Do Programmers Know about Software Energy Consumption?
PPTX
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
PDF
Revisiting the Experimental Design Choices for Approaches for the Automated R...
PPTX
Measuring Program Comprehension: A Large-Scale Field Study with Professionals
Studying the Integration Practices and the Evolution of Ad Libraries in the G...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Improving the testing efficiency of selenium-based load tests
Studying User-Developer Interactions Through the Distribution and Reviewing M...
Studying online distribution platforms for games through the mining of data f...
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
Mining Development Knowledge to Understand and Support Software Logging Pract...
Which Log Level Should Developers Choose For a New Logging Statement?
Towards Just-in-Time Suggestions for Log Changes
The Impact of Task Granularity on Co-evolution Analyses
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
What Do Programmers Know about Software Energy Consumption?
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Revisiting the Experimental Design Choices for Approaches for the Automated R...
Measuring Program Comprehension: A Large-Scale Field Study with Professionals
Ad

Recently uploaded (20)

PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
AI in Product Development-omnex systems
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
ai tools demonstartion for schools and inter college
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
L1 - Introduction to python Backend.pptx
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Digital Strategies for Manufacturing Companies
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
System and Network Administration Chapter 2
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
medical staffing services at VALiNTRY
2025 Textile ERP Trends: SAP, Odoo & Oracle
VVF-Customer-Presentation2025-Ver1.9.pptx
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
AI in Product Development-omnex systems
Understanding Forklifts - TECH EHS Solution
ai tools demonstartion for schools and inter college
Design an Analysis of Algorithms II-SECS-1021-03
How to Migrate SBCGlobal Email to Yahoo Easily
Wondershare Filmora 15 Crack With Activation Key [2025
PTS Company Brochure 2025 (1).pdf.......
L1 - Introduction to python Backend.pptx
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Digital Strategies for Manufacturing Companies
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
System and Network Administration Chapter 2
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Odoo POS Development Services by CandidRoot Solutions
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Design an Analysis of Algorithms I-SECS-1021-03
medical staffing services at VALiNTRY

Mining Software Engineering Data

  • 1. More information available at http://ase.csc.ncsu.edu/dmse/ Mining Software Engineering Data Ahmed E. Hassan Queen’s University www.cs.queensu.ca/~ahmed ahmed@cs.queensu.ca Tao Xie North Carolina State University www.csc.ncsu.edu/faculty/xie xie@csc.ncsu.edu
  • 2. • NSERC/RIM Software Engineering Research Chair Queen’s University, Canada • Leads the SAIL research group at Queen’s • Co-chair for Workshop on Mining Software Repositories (MSR) from 2004-2006 • Chair of the steering committee for MSR 2
  • 3. A. E. Hassan and T. Xie: Mining Software Engineering Data 3 • Associate Professor at North Carolina State University, USA • Leads the ASE research group at NCSU • PC Co-Chair of ICSM 2009, MSR 2011/2012 • Co-organizer of 2007 Dagstuhl Seminar on Mining Programs and Processes
  • 10. http://msrconf.org An international effort to make software repositories actionable http://promisedata.org
  • 11. • Transforms static record- keeping repositories to active repositories • Makes repository data actionable by uncovering hidden patterns and trends 11 MailinglistBugzilla Crashes Field logs CVS/SVN
  • 12. 1212 Field Logs Source Control CVS/SVN Bugzilla Mailing lists Crash Repos Historical Repositories Runtime Repos Code Repos Sourceforge GoogleCode
  • 13. Bugzilla CVS/SVNMailinglist Crashes fixed bug discussions Buggy change & Fixing change Field crashes Estimate fix effort Mark duplicates Suggest experts and fix! New Bug Report
  • 14. Bugzilla CVS/SVNMailinglist Crashes fixed bug Field crashes Suggest APIs Warn about risky code or bugs Suggest locations to co-change New Change discussions Buggy change & Fixing change
  • 15. Example Repositories: Source Control and Bug Repositories
  • 16. A. E. Hassan and T. Xie: Mining Software Engineering Data 16 Source Control Repositories • A source control system tracks changes to ChangeUnits • Example of ChangeUnits: – File (most common) – Function – Dependency (e.g., Call) • Each ChangeUnit: – Records the developer, change time, change message, co- changing Units ChangeListDeveloper Time ChangeChangeUnit Modify Add Remove Change Type * .. * ChangeList Message FI FR GM ChangeList Type
  • 17. A. E. Hassan and T. Xie: Mining Software Engineering Data 17 Determine Initial Entity To Change Change Entity Determine Other Entities To Change Consult Guru for Advice New Req., Bug Fix “How does a change in one source code entity propagate to other entities?” No More Changes For Each Entity Suggested Entity
  • 18. A. E. Hassan and T. Xie: Mining Software Engineering Data 18 • We want: – High Precision to avoid wasting time – High Recall to avoid bugs entitieschanged changedwhichentitiespredicted Recall  entitiespredicted changedwhichentitiespredicted Precision 
  • 19. A. E. Hassan and T. Xie: Mining Software Engineering Data 19 • Mine association rules from change history • Use rules to help propagate changes: – Recall as high as 44% – Precision around 30% • High precision and recall reached in < 1mth • Prediction accuracy improves prior to a release (i.e., during maintenance phase) • Better predictor than static dependencies alone [Zimmermann et al. 05] [Hassan&Holt 04]
  • 20. A. E. Hassan and T. Xie: Mining Software Engineering Data 20 • Traditional dependency graphs and program understanding models usually do not use historical information • Static dependencies capture only a static view of a system – not enough detail! • Development history can help understand the current structure (architecture) of a software system [Hassan & Holt 04]
  • 21. 21 Hardware Trans. Kernel Fault Handler Pager FileSystem Virtual Addr. Maint. VM Policy Subsystem Depend Divergence Hardware Trans. Kernel Fault Handler Pager FileSystem Virtual Addr. Maint. VM Policy Convergence Subsystem Why? Who? When? Where?
  • 22. 22 • Eight unexpected dependencies • All except two dependencies existed since day one: – Virtual Address Maintenance  Pager – Pager  Hardware Translations Which? vm_map_entry_create (in src/sys/vm/Attic/vm_map.c) depends on pager_map (in /src/sys/uvm/uvm_pager.c) Who? cgd When? 1993/04/09 15:54:59 Revision 1.2 of src/sys/vm/Attic/vm_map.c Why? from sean eric fagan: it seems to keep the vm system from deadlocking the system when it runs out of swap + physical memory. prevents the system from giving the last page(s) to anything but the referenced "processes" (especially important is the pager process, which should never have to wait for a free page). Auto-generated from CVS repository
  • 23. • Conway’s Law: “The structure of a software system is a direct reflection of the structure of the development team”
  • 25. A. E. Hassan and T. Xie: Mining Software Engineering Data 25 import org.eclipse.jdt.internal.compiler.lookup.*; import org.eclipse.jdt.internal.compiler.*; import org.eclipse.jdt.internal.compiler.ast.*; import org.eclipse.jdt.internal.compiler.util.*; ... import org.eclipse.pde.core.*; import org.eclipse.jface.wizard.*; import org.eclipse.ui.*; 14% of all files that import ui packages, had to be fixed later on. 71% of files that import compiler packages, had to be fixed later on. [Schröter et al. 06]
  • 26. A. E. Hassan and T. Xie: Mining Software Engineering Data 26 Percentage of bug-introducing changes for eclipse Don’t program on Fridays ;-) [Zimmermann et al. 05]
  • 27. A. E. Hassan and T. Xie: Mining Software Engineering Data 27 • Given a change can we warn a developer that there is a bug in it? – Recall/Precision in 50-60% range [Kim et al. 06]
  • 29. A. E. Hassan and T. Xie: Mining Software Engineering Data 29 • Most open source projects communicate through mailing lists or IRC channels • Rich source of information about the inner workings of large projects • Discussions cover topics such as future plans, design decisions, project policies, code or patch reviews • Social network analysis could be performed on discussion threads
  • 30. A. E. Hassan and T. Xie: Mining Software Engineering Data 30 • Study the content of messages before and after a release • Use dimensions from a psychometric text analysis tool: – After Apache 1.3 release there was a drop in optimism – After Apache 2.0 release there was an increase in sociability [Rigby & Hassan 07]
  • 31. A. E. Hassan and T. Xie: Mining Software Engineering Data 31 • Mailing list activity: – strongly correlates with code change activity – moderately correlates with document change activity • Social network measures (in- degree, out-degree, betweenness) indicate that committers play a more significant role in the mailing list community than non- committers [Bird et al. 06]
  • 32. A. E. Hassan and T. Xie: Mining Software Engineering Data 32 • When will a developer be invited to join a project? – Expertise vs. interest [Bird et al. 07]
  • 34. A. E. Hassan and T. Xie: Mining Software Engineering Data 35 Source data Mined info Variable names and function names Software categories [Kawaguchi et al. 04] Statement seq in a basic block Copy-paste code [Li et al. 04] Set of functions, variables, and data types within a C function Programming rules [Li&Zhou 05] Sequence of methods within a Java method API usages [Xie&Pei 06] API method signatures API Jungloids [Mandelin et al. 05]
  • 35. A. E. Hassan and T. Xie: Mining Software Engineering Data 36 • How should an API be used correctly? – An API may serve multiple functionalities – Different styles of API usage • “I know what type of object I need, but I don’t know how to write the code to get the object” [Mandelin et al. 05] – Can we synthesize jungloid code fragments automatically? – Given a simple query describing the desired code in terms of input and output types, return a code segment • “I know what method call I need, but I don’t know how to write code before and after this method call” [Xie&Pei 06]
  • 36. A. E. Hassan and T. Xie: Mining Software Engineering Data 37 • Mine framework reuse patterns [Michail 00] – Membership relationships • A class contains membership functions – Reuse relationships • Class inheritance/ instantiation • Function invocations/overriding • Mine software plagiarism [Liu et al. 06] – Program dependence graphs [Michail 99/00] http://codeweb.sourceforge.net/ for C++
  • 37. A. E. Hassan and T. Xie: Mining Software Engineering Data 38 • Apply closed sequential pattern mining techniques • Customizing the techniques – A copy-paste segment typically does not have big gaps – use a maximum gap threshold to control – Output the instances of patterns (i.e., the copy-pasted code segments) instead of the patterns – Use small copy-pasted segments to form larger ones – Prune false positives: tiny segments, unmappable segments, overlapping segments, and segments with large gaps [Li et al. 04]
  • 38. A. E. Hassan and T. Xie: Mining Software Engineering Data 39 • For two copy-pasted segments, are the modifications consistent? – Identifier a in segment S1 is changed to b in segment S2 3 times, but remains unchanged once – likely a bug – The heuristic may not be correct all the time • The lower the unchanged rate of an identifier, the more likely there is a bug [Li et al. 04]
  • 40. A. E. Hassan and T. Xie: Mining Software Engineering Data 41 • Goal: mine specifications (pre/post conditions) or object behavior (object transition diagrams) • State of an object – Values of transitively reachable fields • Method-entry state – Receiver-object state, method argument values • Method-exit state – Receiver-object state, updated method argument values, method return value [Ernst et al. 02] http://pag.csail.mit.edu/daikon/ [Xie&Notkin 04/05][Dallmeier et al. 06] http://www.st.cs.uni-sb.de/models/
  • 41. A. E. Hassan and T. Xie: Mining Software Engineering Data 42 • Goal: detect or locate bugs • Values of variables at certain code locations [Hangal&Lam 02] – Object/static field read/write – Method-call arguments – Method returns • Sampled predicates on values of variables [Liblit et al. 03/05][Liu et al. 05] [Hangal&Lam 02] http://diduce.sourceforge.net/ [Liblit et al. 03/05] http://www.cs.wisc.edu/cbi/ [Liu et al. 05] http://www.ews.uiuc.edu/~chaoliu/sober.htm
  • 42. A. E. Hassan and T. Xie: Mining Software Engineering Data 43 • Goal: locate bugs • Executed branches/paths, def-use pairs • Executed function/method calls – Group methods invoked on the same object • Profiling options – Execution hit vs. count – Execution order (sequences) [Dallmeier et al. 05] http://www.st.cs.uni-sb.de/ample/ More related tools: http://www.csc.ncsu.edu/faculty/xie/research.htm#related
  • 43. A. E. Hassan and T. Xie: Mining Software Engineering Data 44 • Given a program state S and an event e, predict whether e likely results in a bug – Positive samples: past bugs – Negative samples: “not bug” reports • A k-NN based approach – Consider the k closest cases reported before – Compare Σ 1/d for bug cases and not-bug cases, where d is the similarity between the current state and the reported states – If the current state is more similar to bugs, predict a bug [Michail&Xie 05]
  • 44. A. E. Hassan and T. Xie: Mining Software Engineering Data 45 • A method executed only in failing runs is likely to point to the defect – Comparing the coverage of passing and failing program runs helps • Mining patterns frequent in failing program runs but infrequent in passing program runs – Sequential patterns may be used [Dallmeier et al. 05, Denmat et al. 05]
  • 46. Must show value before data quality improves Correlation vs. Causation
  • 47. A. E. Hassan and T. Xie: Mining Software Engineering Data 48 • Very active research area in SE: – MSR is the most attended ICSE event in last 8 yrs • http://msrconf.org – Special Issue of IEEE TSE 2005 on MSR: • 15 % of all submissions of TSE in 2004 • Fastest review cycle in TSE history: 8 months – Special Issue Empirical Software Engineering 2009, 2011 – MSR 2012!
  • 48. A. E. Hassan and T. Xie: Mining Software Engineering Data 49 • Report the statistical significance of your results: – Get a statistics book (one for social scientist, not for mathematicians) • Discuss any limitations of your findings based on the characteristics of the studied repositories: – Make sure you manually examine the repositories. Do not fully automate the process! – Use random sampling to resolve issues about data noise • Relevant conferences/workshops: – main SE conferences, ICSM, ISSTA, MSR, WODA, PROMISE …