CSMR11b.ppt

Sub-graph Mining: Identifying
Micro-architectures in Evolving
Object-oriented Software

Ahmed Belderrar, Segla Kpodjedo, Yann-Gaël Guéhéneuc,
Giuliano Antoniol, Philippe Galinier

CSMR 2011
Oldenburg

Context & Motivation
Coding
activity
Active Research
Design
Good?
OO Patterns
application

Known Anti
Bad?
Patterns
New
organizations
of classes and Almost Nothing
relations
Micro-Architectures UnKnown Ugly?

Preliminary study exploring the feasibility and prospects of an approach
CSMR 2011 aimed at building empirical knowledge about organizations of classes.
Oldenburg

Related work (I)
In general, the proposed approaches rely on a library of known motifs
and use architectural recovery techniques based on sub-graph matching
or on motifs properties.

 Constraint Programming to recognize plans
[Rich and Waters-90]
 Explanation-based constraint programming
[Guéhéneuc and Antoniol-2008]
 Use of queries
[Kullbach and Winter-99]
 Generic fuzzy reasoning nets
[Niere et al.-2002].
 Graph-transformation techniques
[Tsantalis et al.-2006]
CSMR 2011
Oldenburg

Related work (II)
 Similar to our approach: [Tonella and Antoniol-01] in which concept
analysis was used to infer domain-specific design patterns.
Main Difference :
“maximal collections of class sets having the same relations
between them” VS isomorphic induced subgraphs of fixed size
 improves on scalability via an efficient sub-graph enumeration
eliminates the need of manual inspection of “concept lattice”.
 Inspiration from sub-graph mining algorithms
 Sub-graph discovery problem : discover all recurring sub-graphs, or
only the most frequent ones.
Limitations of previous works
(restriction to undirected graphs, or to sub-graphs with limited
number of labels on their arcs [Wernicke and Rasche-2006] )
 Our own tool SGFinder
CSMR 2011
Oldenburg

Our approach – Introduction
 Considering a class diagram as a labeled graph, we define a micro-
architecture as the connected subgraph induced by a given subset
of classes.
Pre-requisite
 We model a class diagram as a labeled graph, with nodes being
classes (and interfaces) and arcs representing the relations among
classes.
 Arc label  type of relation between two classes (e.g., association,
aggregation, or inheritance).

Example 1: Association
2: Aggregation/Composition
3: Inheritance

CSMR 2011 Figure 1 A sub-graph G(V,E) extracted from Rhino
Oldenburg (release 1.7R1).

Our approach – Definitions
 Labeled graph: given a set of labels L, it is defined by a triplet
G = (V,A, l) where V is the vertex set of G, A ⊆ V × V the arc set,
and l : A  L the labeling function.
 Connected Graph: There is a chain between any two vertices.
 Isomorphic Graphs: There exists a perfect one-to-one
correspondence between the vertices and edges of those graphs.
 Induced Subgraph: GX induced by X (X⊆V) is a graph which
vertex set is X and which arc set contains all the arcs of G that link
two vertices of X.
 Embedding of a sub-graph H of G: set X of vertices such that
GX is isomorphic to H.

Example
Figure 2 A sub-graph G=(V,E) extracted from
Rhino (release 1.7R1) with two embeddings
CSMR 2011 of size three {7, 1, 3} and {2, 4, 5}.
Oldenburg

Our approach – Algorithm (I)
 Inputs
A graph G and the size k of the sought sub-graphs
 Output
The set of induced subgraphs with their embeddings.
 Principle
Generate all k-subsets X of V such that GX is connected.

Generates only a limited number of k-subsets of V
 Underlying idea
if GX is a connected induced sub-graph of G, then X can
not contain two vertices x and y such that dist(x, y) ≥ k.
CSMR 2011
Oldenburg

Our approach – Algorithm (II)
SGFinder
Build k disjoint subsets of vertices, denoted by
L0...Lk−1 (line 3-5), where each Li contains the
vertices which distance to x equals i (L0 = {x})

To build X (line 6, developed in the CompleteSG()
procedure), we choose the unique vertex present in
L0 (namely x), plus one or more vertices chosen in
L1, plus one or more vertices chosen in L2, and so
on until we reach a total number k of vertices.

CSMR 2011
Oldenburg

Our approach – Algorithm (III)
 Illustration

Figure 1 A sub-graph G(V,E) extracted from Rhino (release 1.7R1).

- Let us choose k = 4,
- And let us assume that the first chosen vertex is x = 0.
1. We have L0 = {0}, L1 = {3}, L2 = {1, 2}, and L3 = {4, 6, 7}.
2. There are now seven possibilities for X:
{0, 3, 1, 2}, {0, 3, 1, 4}, {0, 3, 1, 6}, {0, 3, 1, 7},
{0, 3, 2, 4}, {0, 3, 2, 6}, {0, 3, 2, 7}.
3. Some of the induced sub-graphs are not connected, e.g.,
G{0,3,1,4} and are discarded.
CSMR 2011
Oldenburg

Case Study
 Goal
Identify in OO systems, micro-architectures with desirable or
harmful properties.
 Quality focus
Verify applicability of SGFinder to small and medium size OO
programs in order to detect micro-architectures that are stable or
fault-prone.
 Perspective
Researchers, developers, and managers, who want to get
information about undocumented micro-architectures found
in OO systems and possible related drawbacks.
 Context
Two open-source systems: the Rhino JavaScript/ECMAScript
interpreter, and the ArgoUML CASE tool.
CSMR 2011
Oldenburg

Case Study - Objects
 Rhino: JavaScript/ECMAScript interpreter and compiler
• Defect data extracted from [Eaddy et al-2008] study
• Change data mined from the Rhino CVS logs.
 ArgoUML: UML CASE tool written in Java.
• Defect data from the ArgoUML customized Bugzilla repository
• Change data from the SVN logs.

Class diagrams recovered using the Ptidej tool suite and its PADL
meta-model.
Size of micro-architectures of interest: 3, 4 and 5 nodes.
CSMR 2011
Oldenburg

Case Study – Research Questions

 RQ1 – Does SGFinder scale up to medium size
programs and what kind of microarchitectures are
found in OO systems?

 RQ2 – Are there micro-architectures particularly fault-
prone or fault-free?

 RQ3 – Are there micro-architectures particularly stable
or change-prone?

CSMR 2011
Oldenburg

Case Study – Analysis Method (I)
 Characterizing a micro-architecture mAi

Connectivity
•Total number of relations (nbRel),
•Number of associations (nbAssoc),
•Number of aggregations or compositions (nbAggr),
•Number of inheritances (nbInher),
•Number of ”cyclic relations” (nbCycl).

Presence and repartition
•Number of releases in which mAi can be found (nbReleases)
•Number of embeddings of mAi in a given release (nbEmbeddings)
•Number of zones (nbZones).
an embedding is counted only when it does not share any
common edge with a previously found embedding
•Number of classes appearing at least once in mAi (nbClasses)
CSMR 2011
Oldenburg

Case Study – Analysis Method (II)
 Answering the Research Questions

RQ1
Execution times and Characterization of micro-architectures
RQ2 and RQ3
1. For each mA, we compute the percentage of its classes that are
fault-prone (i.e., with one or more documented faults) or changed
between two subsequent releases.
2. Rank micro-architectures from the fault(change)-prone to the
fault(change)-free using the precision measure of step 1.
3. Inspection of the top 10% and bottom 10% of ranked
microarchitectures  hints of what makes those micro-
architectures outstanding.

Use of descriptive statistics : the five-number summary statistic
(Min,Quartile1,Median,Quartile3,Max).
CSMR 2011
Oldenburg

Results – RQ1.1 SGFinder Applicability
Computation times:
From a few seconds to ...two days and half.

3-nodes micro-architectures  20 seconds at most
4-nodes micro-architectures  less than 30 minutes
5-nodes ...another story
Rhino max <8 min
Argo max > 60 h to retrieve the
82,877 different 5-nodes mAs in ArgoUML 0.17.5
and their 13,741,073,588 embeddings.

Depend mostly on the edge density of the considered class diagrams.
More specifically, the single most time-costly factor is the presence of highly
connected nodes which lead to a near-combinatorial explosion of the
number of embeddings.

CSMR 2011
Oldenburg

Results – RQ1.2 Micro-architectures’ Description
Number of different micro-architectures.
- 3 basic relations (association 1, aggregation 2 and inheritance 3)
- 4 derived mixed cases (12,13,23,123)
- No relation
8 possible connections between two nodes

For k × k pairs of nodes (including loops), if we do not take into
account symmetry
At most 8 (k×k−(k−1)) ×7k−1 connected subgraphs of k nodes.

For k=5, in theory, possibly 22 × 1021 different micro-architectures.

In practice, the union of five-nodes micro-architectures from Rhino and
Argo contains only about 32 × 104 different micro-architectures.

Suggests, unsurprisingly, that the set of micro-architectures is
CSMR 2011 restricted to specific subgraphs
Oldenburg


Characterization

Averages over the two systems

CSMR 2011
Oldenburg


CSMR 2011
Illustration - Most Connected 5-nodes micro-architectures
Oldenburg

Results – RQ2 Fault proneness

Figure 4. Example of a particularly faulty
(precision=81) 4-nodes micro-architecture

(almost) every class ”‘talks”’ to every other class. In
fact, when we retrieve the set of micro-architectures
with three cyclic relations and where every class
is connected to every other class, we obtain a 5-
number summary (28,44,56,67,81) with values mostly
higher to those from the top 10% most faulty
CSMR 2011 microarchitectures (42,45,48,54,83).
Oldenburg

Results – RQ3 Change proneness

Figure 5. Example of a particularly stable
(precision=30) 4-nodes micro-architecture

presents a ”cascade”- like pattern and we were able
to retrieve many other similar organizations with
mostly low changeability.
CSMR 2011
Oldenburg

Threats to Validity
Threats to conclusion validity concern the relationship between the
treatment and the outcome.
We do not claim any causal relation between the microarchitectures
and unwanted or desired features.

Threats to external validity concern the possibility of generalizing our
results.
The study is limited to 2 systems: Rhino and ArgoUML.
But threats are mitigated given that the two systems
correspond to different domains and applications,
have different sizes,
and are developed by different teams.

CSMR 2011
Oldenburg

Conclusion
 I presented SGFinder, an algorithm and a tool to support micro-
architecture discovery based on a reformulation as a sub-graph
mining problem.
 SGFinder uses an effective enumeration technique that allows us to
infer instances of micro-architectures and to study micro-
architecture properties such as stability or fault proneness.
 We used SGFinder on 8 releases of 2 well known Java applications:
Rhino and ArgoUML.
 We provided insight about the kind of micro-architectures (of 3, 4
or 5 nodes) found in OO systems.
 We characterize the most and least faulty/changed micro-
architectures and report some of the most interesting micro-
architectures with respect to their connectivity and frequency.

CSMR 2011
Oldenburg

Future Ongoing Work
Despite the encouraging results, more work is needed to

(i) further optimize the proposed algorithm and gain in
scalability, and/or investigate the partitioning of large OO
diagrams into subsystems

(ii) define heuristics and rules able to classify micro-architectures
discovered in a system under development,

(iii) go beyond the micro-architectures and analyze, similarly to
design patterns, the roles played by participant classes, and

(iv) provide qualitative analysis and validation of the findings. For
generalization purposes, our plan for future work also include
replication of the study on a larger system.

CSMR 2011
Oldenburg

References
 [Guéhéneuc and Antoniol-2008] Y.-G. Guéhéneuc and G. Antoniol,
“DeMIMA: A multi-layered framework for design pattern identification,”
Transactions on Software Engineering (TSE), vol. 34, no. 5, pp. 667–684,
September 2008.
 [Tonella and Antoniol-01] P. Tonella and G. Antoniol, “Inference of object
oriented design patterns,” Journal of Software Maintenance - Research and
Practice, vol. 13, no. 5, pp.309–330, September-October 2001.
 [Rich and Waters-90] C. Rich and R. C. Waters, The Programmer’s Apprentice,
1st ed. ACM Press Frontier Series and Addison-Wesley, January 1990.
 [Kullbach and Winter-99] B. Kullbach and A. Winter, “Querying as an
enabling technology in software reengineering,” in Proceedings of the 3rd
Conference on Software Maintenance and Reengineering, P. Nesi and C. Verhoef,
Eds. IEEE Computer Society Press, March 1999, pp. 42–50.
 [Niere et al.-2002] J. Niere, W. Sch¨afer, J. P. Wadsack, L. Wendehals, and J.
Welsh, “Towards pattern-based design recovery,” in Proceedings of the 24th
International Conference on Software Engineering, M.Young and J. Magee, Eds.
ACM Press, May 2002, pp. 338–348.
CSMR 2011
Oldenburg

References
 [Tsantalis et al.-2006] N. Tsantalis, A. Chatzigeorgiou, G. Stephanides, and S.
Halkidis, “Design pattern detection using similarity scoring,” Transactions on
Software Engineering, vol. 32, no. 11, November 2006.
 [Wernicke and Rasche-2006] S. Wernicke and F. Rasche, “A tool for fast
network motif detection,” Bioinformatics, vol. 22, pp. 1152–1153, 2006.
 [Eaddy et al.-2008] M. Eaddy, T. Zimmermann, K. D. Sherwood,V. Garg, G. C.
Murphy, N. Nagappan, and A.V. Aho, “Do crosscutting concerns cause
defects?” IEEE Transactions on Software Engineering, vol. 34, no. 4, pp. 497–515,
2008.

Thanks for the attention

Questions?
CSMR 2011
Oldenburg

CSMR11b.ppt

More Related Content

What's hot (20)

Viewers also liked (8)

Similar to CSMR11b.ppt (20)

More from Ptidej Team (20)

Recently uploaded (20)

CSMR11b.ppt