SlideShare a Scribd company logo
Sub-graph Mining: Identifying
             Micro-architectures in Evolving
             Object-oriented Software

             Ahmed Belderrar, Segla Kpodjedo, Yann-Gaël Guéhéneuc,
             Giuliano Antoniol, Philippe Galinier




CSMR 2011
 Oldenburg
Context & Motivation
                  Coding
                  activity
                                    Active Research
                                                     Design
                                                                          Good?
                    OO                              Patterns
                 application

                                      Known           Anti
                                                                           Bad?
                                                    Patterns
                   New
              organizations
              of classes and       Almost Nothing
                 relations
             Micro-Architectures     UnKnown                              Ugly?



              Preliminary study exploring the feasibility and prospects of an approach
CSMR 2011     aimed at building empirical knowledge about organizations of classes.
 Oldenburg
Related work (I)
                 In general, the proposed approaches rely on a library of known motifs
                 and use architectural recovery techniques based on sub-graph matching
                 or on motifs properties.

                Constraint Programming to recognize plans
                 [Rich and Waters-90]
                Explanation-based constraint programming
                 [Guéhéneuc and Antoniol-2008]
                Use of queries
                 [Kullbach and Winter-99]
                Generic fuzzy reasoning nets
                 [Niere et al.-2002].
                Graph-transformation techniques
                 [Tsantalis et al.-2006]
CSMR 2011
 Oldenburg
Related work (II)
                Similar to our approach: [Tonella and Antoniol-01] in which concept
                 analysis was used to infer domain-specific design patterns.
                 Main Difference :
                 “maximal collections of class sets having the same relations
                 between them” VS isomorphic induced subgraphs of fixed size
                  improves on scalability via an efficient sub-graph enumeration
                 eliminates the need of manual inspection of “concept lattice”.
                Inspiration from sub-graph mining algorithms
                Sub-graph discovery problem : discover all recurring sub-graphs, or
                 only the most frequent ones.
                 Limitations of previous works
                 (restriction to undirected graphs, or to sub-graphs with limited
                 number of labels on their arcs [Wernicke and Rasche-2006] )
                       Our own tool SGFinder
CSMR 2011
 Oldenburg
Our approach – Introduction
              Considering a class diagram as a labeled graph, we define a micro-
               architecture as the connected subgraph induced by a given subset
               of classes.
             Pre-requisite
              We model a class diagram as a labeled graph, with nodes being
               classes (and interfaces) and arcs representing the relations among
               classes.
                Arc label  type of relation between two classes (e.g., association,
                 aggregation, or inheritance).

                 Example                                            1: Association
                                                                    2: Aggregation/Composition
                                                                    3: Inheritance




CSMR 2011                  Figure 1 A sub-graph G(V,E) extracted from Rhino
 Oldenburg                 (release 1.7R1).
Our approach – Definitions
                Labeled graph: given a set of labels L, it is defined by a triplet
                 G = (V,A, l) where V is the vertex set of G, A ⊆ V × V the arc set,
                 and l : A  L the labeling function.
                Connected Graph: There is a chain between any two vertices.
                Isomorphic Graphs: There exists a perfect one-to-one
                 correspondence between the vertices and edges of those graphs.
                Induced Subgraph: GX induced by X (X⊆V) is a graph which
                 vertex set is X and which arc set contains all the arcs of G that link
                 two vertices of X.
                Embedding of a sub-graph H of G: set X of vertices such that
                 GX is isomorphic to H.

             Example
             Figure 2 A sub-graph G=(V,E) extracted from
             Rhino (release 1.7R1) with two embeddings
CSMR 2011    of size three {7, 1, 3} and {2, 4, 5}.
 Oldenburg
Our approach – Algorithm (I)
              Inputs
             A graph G and the size k of the sought sub-graphs
              Output
             The set of induced subgraphs with their embeddings.
              Principle
             Generate all k-subsets X of V such that GX is connected.

             Generates only a limited number of k-subsets of V
              Underlying idea
             if GX is a connected induced sub-graph of G, then X can
                not contain two vertices x and y such that dist(x, y) ≥ k.
CSMR 2011
 Oldenburg
Our approach – Algorithm (II)
                             SGFinder
                                                              Build k disjoint subsets of vertices, denoted by
                                                              L0...Lk−1 (line 3-5), where each Li contains the
                                                              vertices which distance to x equals i (L0 = {x})




             To build X (line 6, developed in the CompleteSG()
             procedure), we choose the unique vertex present in
             L0 (namely x), plus one or more vertices chosen in
             L1, plus one or more vertices chosen in L2, and so
             on until we reach a total number k of vertices.




CSMR 2011
 Oldenburg
Our approach – Algorithm (III)
                     Illustration




                                  Figure 1 A sub-graph G(V,E) extracted from Rhino (release 1.7R1).

             - Let us choose k = 4,
                       - And let us assume that the first chosen vertex is x = 0.
                                 1. We have L0 = {0}, L1 = {3}, L2 = {1, 2}, and L3 = {4, 6, 7}.
                                 2. There are now seven possibilities for X:
                                           {0, 3, 1, 2}, {0, 3, 1, 4}, {0, 3, 1, 6}, {0, 3, 1, 7},
                                           {0, 3, 2, 4}, {0, 3, 2, 6}, {0, 3, 2, 7}.
                                 3. Some of the induced sub-graphs are not connected, e.g.,
                                           G{0,3,1,4} and are discarded.
CSMR 2011
 Oldenburg
Case Study
                Goal
                 Identify in OO systems, micro-architectures with desirable or
                 harmful properties.
                Quality focus
                 Verify applicability of SGFinder to small and medium size OO
                 programs in order to detect micro-architectures that are stable or
                 fault-prone.
                Perspective
                 Researchers, developers, and managers, who want to get
                 information about undocumented micro-architectures found
                 in OO systems and possible related drawbacks.
                Context
                 Two open-source systems: the Rhino JavaScript/ECMAScript
                 interpreter, and the ArgoUML CASE tool.
CSMR 2011
 Oldenburg
Case Study - Objects
                Rhino: JavaScript/ECMAScript interpreter and compiler
                  • Defect data extracted from [Eaddy et al-2008] study
                  • Change data mined from the Rhino CVS logs.
                ArgoUML: UML CASE tool written in Java.
                  • Defect data from the ArgoUML customized Bugzilla repository
                  • Change data from the SVN logs.




             Class diagrams recovered using the Ptidej tool suite and its PADL
                meta-model.
             Size of micro-architectures of interest: 3, 4 and 5 nodes.
CSMR 2011
 Oldenburg
Case Study – Research Questions

                RQ1 – Does SGFinder scale up to medium size
                 programs and what kind of microarchitectures are
                 found in OO systems?

                RQ2 – Are there micro-architectures particularly fault-
                 prone or fault-free?

                RQ3 – Are there micro-architectures particularly stable
                 or change-prone?



CSMR 2011
 Oldenburg
Case Study – Analysis Method (I)
                Characterizing a micro-architecture mAi

                 Connectivity
                 •Total number of relations (nbRel),
                 •Number of associations (nbAssoc),
                 •Number of aggregations or compositions (nbAggr),
                 •Number of inheritances (nbInher),
                 •Number of ”cyclic relations” (nbCycl).

                 Presence and repartition
                 •Number of releases in which mAi can be found (nbReleases)
                 •Number of embeddings of mAi in a given release (nbEmbeddings)
                 •Number of zones (nbZones).
                    an embedding is counted only when it does not share any
                    common edge with a previously found embedding
                 •Number of classes appearing at least once in mAi (nbClasses)
CSMR 2011
 Oldenburg
Case Study – Analysis Method (II)
                Answering the Research Questions

             RQ1
             Execution times and Characterization of micro-architectures
             RQ2 and RQ3
             1. For each mA, we compute the percentage of its classes that are
                 fault-prone (i.e., with one or more documented faults) or changed
                 between two subsequent releases.
             2. Rank micro-architectures from the fault(change)-prone to the
                 fault(change)-free using the precision measure of step 1.
             3. Inspection of the top 10% and bottom 10% of ranked
                 microarchitectures  hints of what makes those micro-
                 architectures outstanding.

             Use of descriptive statistics : the five-number summary statistic
             (Min,Quartile1,Median,Quartile3,Max).
CSMR 2011
 Oldenburg
Results – RQ1.1 SGFinder Applicability
             Computation times:
             From a few seconds to ...two days and half.

             3-nodes micro-architectures  20 seconds at most
             4-nodes micro-architectures  less than 30 minutes
             5-nodes ...another story
                     Rhino max <8 min
                     Argo max > 60 h to retrieve the
                               82,877 different 5-nodes mAs in ArgoUML 0.17.5
                     and their 13,741,073,588 embeddings.

             Depend mostly on the edge density of the considered class diagrams.
             More specifically, the single most time-costly factor is the presence of highly
             connected nodes which lead to a near-combinatorial explosion of the
             number of embeddings.

CSMR 2011
 Oldenburg
Results – RQ1.2 Micro-architectures’ Description
              Number of different micro-architectures.
              - 3 basic relations (association 1, aggregation 2 and inheritance 3)
              - 4 derived mixed cases (12,13,23,123)
              - No relation
                                 8 possible connections between two nodes

              For k × k pairs of nodes (including loops), if we do not take into
              account symmetry
                       At most 8 (k×k−(k−1)) ×7k−1 connected subgraphs of k nodes.

              For k=5, in theory, possibly 22 × 1021 different micro-architectures.

              In practice, the union of five-nodes micro-architectures from Rhino and
              Argo contains only about 32 × 104 different micro-architectures.

                       Suggests, unsurprisingly, that the set of micro-architectures is
CSMR 2011              restricted to specific subgraphs
 Oldenburg
Results – RQ1.2 Micro-architectures’ Description

                                            Characterization




                                      Averages over the two systems




CSMR 2011
 Oldenburg
Results – RQ1.2 Micro-architectures’ Description




CSMR 2011
                   Illustration - Most Connected 5-nodes micro-architectures
 Oldenburg
Results – RQ2 Fault proneness




               Figure 4. Example of a particularly faulty
               (precision=81) 4-nodes micro-architecture


             (almost) every class ”‘talks”’ to every other class. In
             fact, when we retrieve the set of micro-architectures
             with three cyclic relations and where every class
             is connected to every other class, we obtain a 5-
             number summary (28,44,56,67,81) with values mostly
             higher to those from the top 10% most faulty
CSMR 2011    microarchitectures (42,45,48,54,83).
 Oldenburg
Results – RQ3 Change proneness




             Figure 5. Example of a particularly stable
             (precision=30) 4-nodes micro-architecture

             presents a ”cascade”- like pattern and we were able
             to retrieve many other similar organizations with
             mostly low changeability.
CSMR 2011
 Oldenburg
Threats to Validity
             Threats to conclusion validity concern the relationship between the
             treatment and the outcome.
             We do not claim any causal relation between the microarchitectures
             and unwanted or desired features.

             Threats to external validity concern the possibility of generalizing our
             results.
             The study is limited to 2 systems: Rhino and ArgoUML.
             But threats are mitigated given that the two systems
                      correspond to different domains and applications,
                      have different sizes,
                      and are developed by different teams.



CSMR 2011
 Oldenburg
Conclusion
                I presented SGFinder, an algorithm and a tool to support micro-
                 architecture discovery based on a reformulation as a sub-graph
                 mining problem.
                SGFinder uses an effective enumeration technique that allows us to
                 infer instances of micro-architectures and to study micro-
                 architecture properties such as stability or fault proneness.
                We used SGFinder on 8 releases of 2 well known Java applications:
                 Rhino and ArgoUML.
                We provided insight about the kind of micro-architectures (of 3, 4
                 or 5 nodes) found in OO systems.
                We characterize the most and least faulty/changed micro-
                 architectures and report some of the most interesting micro-
                 architectures with respect to their connectivity and frequency.


CSMR 2011
 Oldenburg
Future Ongoing Work
             Despite the encouraging results, more work is needed to

             (i) further optimize the proposed algorithm and gain in
                 scalability, and/or investigate the partitioning of large OO
                 diagrams into subsystems

             (ii) define heuristics and rules able to classify micro-architectures
                  discovered in a system under development,

             (iii) go beyond the micro-architectures and analyze, similarly to
                   design patterns, the roles played by participant classes, and

             (iv) provide qualitative analysis and validation of the findings. For
                  generalization purposes, our plan for future work also include
                  replication of the study on a larger system.

CSMR 2011
 Oldenburg
References
                [Guéhéneuc and Antoniol-2008] Y.-G. Guéhéneuc and G. Antoniol,
                 “DeMIMA: A multi-layered framework for design pattern identification,”
                 Transactions on Software Engineering (TSE), vol. 34, no. 5, pp. 667–684,
                 September 2008.
                [Tonella and Antoniol-01] P. Tonella and G. Antoniol, “Inference of object
                 oriented design patterns,” Journal of Software Maintenance - Research and
                 Practice, vol. 13, no. 5, pp.309–330, September-October 2001.
                [Rich and Waters-90] C. Rich and R. C. Waters, The Programmer’s Apprentice,
                 1st ed. ACM Press Frontier Series and Addison-Wesley, January 1990.
                [Kullbach and Winter-99] B. Kullbach and A. Winter, “Querying as an
                 enabling technology in software reengineering,” in Proceedings of the 3rd
                 Conference on Software Maintenance and Reengineering, P. Nesi and C. Verhoef,
                 Eds. IEEE Computer Society Press, March 1999, pp. 42–50.
                [Niere et al.-2002] J. Niere, W. Sch¨afer, J. P. Wadsack, L. Wendehals, and J.
                 Welsh, “Towards pattern-based design recovery,” in Proceedings of the 24th
                 International Conference on Software Engineering, M.Young and J. Magee, Eds.
                 ACM Press, May 2002, pp. 338–348.
CSMR 2011
 Oldenburg
References
   [Tsantalis et al.-2006] N. Tsantalis, A. Chatzigeorgiou, G. Stephanides, and S.
    Halkidis, “Design pattern detection using similarity scoring,” Transactions on
    Software Engineering, vol. 32, no. 11, November 2006.
   [Wernicke and Rasche-2006] S. Wernicke and F. Rasche, “A tool for fast
    network motif detection,” Bioinformatics, vol. 22, pp. 1152–1153, 2006.
   [Eaddy et al.-2008] M. Eaddy, T. Zimmermann, K. D. Sherwood,V. Garg, G. C.
    Murphy, N. Nagappan, and A.V. Aho, “Do crosscutting concerns cause
    defects?” IEEE Transactions on Software Engineering, vol. 34, no. 4, pp. 497–515,
    2008.
Thanks for the attention




                     Questions?
CSMR 2011
 Oldenburg

More Related Content

PDF
PDF
Planning-Based Approach for Automating Sequence Diagram Generation
PDF
What the matrix can tell us about the social network.
PDF
A Physical Approach to Moving Cast Shadow Detection (ICASSP 2009)
PPT
Class01
PDF
ICPR 2012
PDF
Estimating Human Pose from Occluded Images (ACCV 2009)
PDF
Wireless Localization: Positioning
Planning-Based Approach for Automating Sequence Diagram Generation
What the matrix can tell us about the social network.
A Physical Approach to Moving Cast Shadow Detection (ICASSP 2009)
Class01
ICPR 2012
Estimating Human Pose from Occluded Images (ACCV 2009)
Wireless Localization: Positioning

What's hot (20)

PDF
Learning Moving Cast Shadows for Foreground Detection (VS 2008)
PDF
Lecture11
PDF
Graphcompression handout
PDF
A short and naive introduction to using network in prediction models
PDF
From RNN to neural networks for cyclic undirected graphs
PPT
CS 354 Acceleration Structures
KEY
2 tri partite model algebra
PDF
Survey ecc 09june12
PDF
An Unorthodox View on Memetic Algorithms
PDF
A discussion on sampling graphs to approximate network classification functions
PPTX
JOSA TechTalks - Machine Learning on Graph-Structured Data
PDF
D E S I G N A N D A N A L Y S I S O F A L G O R I T H M S J N T U M O D E L...
PDF
Kernel methods in machine learning
PPT
Ontology mapping needs context & approximation
PDF
Robust Low-rank and Sparse Decomposition for Moving Object Detection
PDF
Elementary Landscape Decomposition of the Test Suite Minimization Problem
PDF
Lecture8 - From CBR to IBk
PDF
Convolutional networks and graph networks through kernels
PDF
Visual Odomtery(2)
PPT
Nanotechnology
Learning Moving Cast Shadows for Foreground Detection (VS 2008)
Lecture11
Graphcompression handout
A short and naive introduction to using network in prediction models
From RNN to neural networks for cyclic undirected graphs
CS 354 Acceleration Structures
2 tri partite model algebra
Survey ecc 09june12
An Unorthodox View on Memetic Algorithms
A discussion on sampling graphs to approximate network classification functions
JOSA TechTalks - Machine Learning on Graph-Structured Data
D E S I G N A N D A N A L Y S I S O F A L G O R I T H M S J N T U M O D E L...
Kernel methods in machine learning
Ontology mapping needs context & approximation
Robust Low-rank and Sparse Decomposition for Moving Object Detection
Elementary Landscape Decomposition of the Test Suite Minimization Problem
Lecture8 - From CBR to IBk
Convolutional networks and graph networks through kernels
Visual Odomtery(2)
Nanotechnology
Ad

Viewers also liked (8)

PDF
20110501 csseminar rybalkin_substructure_search
PDF
Jayant lrs
PDF
Effective community search_dami2015
PPTX
LEXBFS on Chordal Graphs
PDF
Modeling adoptions and the stages of the diffusion of innovations
PPT
Graph mining seminar_2009
PPTX
Lgm saarbrucken
PDF
Lgm pakdd2011 public
20110501 csseminar rybalkin_substructure_search
Jayant lrs
Effective community search_dami2015
LEXBFS on Chordal Graphs
Modeling adoptions and the stages of the diffusion of innovations
Graph mining seminar_2009
Lgm saarbrucken
Lgm pakdd2011 public
Ad

Similar to CSMR11b.ppt (20)

PDF
DOC
BugLoc: Bug Localization in Multi Threaded Application via Graph Mining Approach
DOC
BugLoc: Bug Localization in Multi Threaded Application via Graph Mining Approach
PDF
FREQUENT SUBGRAPH MINING ALGORITHMS - A SURVEY AND FRAMEWORK FOR CLASSIFICATION
PPT
PPSX
Design and analysis of Algorithms Lecture 1 (BFS, DFS).ppsx
PDF
call for papers, research paper publishing, where to publish research paper, ...
PDF
SubGraD- An Approach for Subgraph Detection
PDF
Memoirs of a Graph Addict: Despair to Redemption
PPTX
DATA STRUCTURES unit 4.pptx
PPTX
Presentation on Graph Clustering (vldb 09)
PPTX
Collaborative Similarity Measure for Intra-Graph Clustering
PDF
Applying Machine Learning to Software Clustering
PDF
Lecture 5 Software Engineering and Design Design Patterns
PDF
FADML 06 PPC Graphs and Traversals.pdf
PDF
SSBSE10.ppt
PDF
CSMR06a.ppt
PPT
Chap 6 Graph.ppt
PPT
Double Patterning (4/2 update)
PPT
Lecture 5b graphs and hashing
BugLoc: Bug Localization in Multi Threaded Application via Graph Mining Approach
BugLoc: Bug Localization in Multi Threaded Application via Graph Mining Approach
FREQUENT SUBGRAPH MINING ALGORITHMS - A SURVEY AND FRAMEWORK FOR CLASSIFICATION
Design and analysis of Algorithms Lecture 1 (BFS, DFS).ppsx
call for papers, research paper publishing, where to publish research paper, ...
SubGraD- An Approach for Subgraph Detection
Memoirs of a Graph Addict: Despair to Redemption
DATA STRUCTURES unit 4.pptx
Presentation on Graph Clustering (vldb 09)
Collaborative Similarity Measure for Intra-Graph Clustering
Applying Machine Learning to Software Clustering
Lecture 5 Software Engineering and Design Design Patterns
FADML 06 PPC Graphs and Traversals.pdf
SSBSE10.ppt
CSMR06a.ppt
Chap 6 Graph.ppt
Double Patterning (4/2 update)
Lecture 5b graphs and hashing

More from Ptidej Team (20)

PDF
From IoT to Software Miniaturisation
PDF
Presentation
PDF
Presentation
PDF
Presentation
PDF
Presentation by Lionel Briand
PDF
Manel Abdellatif
PDF
Azadeh Kermansaravi
PDF
Mouna Abidi
PDF
CSED - Manel Grichi
PDF
Cristiano Politowski
PDF
Will io t trigger the next software crisis
PDF
PDF
Thesis+of+laleh+eshkevari.ppt
PDF
Thesis+of+nesrine+abdelkafi.ppt
PDF
Medicine15.ppt
PDF
Qrs17b.ppt
PDF
Icpc11c.ppt
PDF
Icsme16.ppt
PDF
Msr17a.ppt
PDF
Icsoc15.ppt
From IoT to Software Miniaturisation
Presentation
Presentation
Presentation
Presentation by Lionel Briand
Manel Abdellatif
Azadeh Kermansaravi
Mouna Abidi
CSED - Manel Grichi
Cristiano Politowski
Will io t trigger the next software crisis
Thesis+of+laleh+eshkevari.ppt
Thesis+of+nesrine+abdelkafi.ppt
Medicine15.ppt
Qrs17b.ppt
Icpc11c.ppt
Icsme16.ppt
Msr17a.ppt
Icsoc15.ppt

Recently uploaded (20)

PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Cloud computing and distributed systems.
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Encapsulation theory and applications.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Review of recent advances in non-invasive hemoglobin estimation
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
sap open course for s4hana steps from ECC to s4
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Cloud computing and distributed systems.
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Encapsulation theory and applications.pdf
Spectroscopy.pptx food analysis technology
Encapsulation_ Review paper, used for researhc scholars
Review of recent advances in non-invasive hemoglobin estimation
The AUB Centre for AI in Media Proposal.docx
Programs and apps: productivity, graphics, security and other tools
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
MYSQL Presentation for SQL database connectivity
Spectral efficient network and resource selection model in 5G networks
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Dropbox Q2 2025 Financial Results & Investor Presentation
sap open course for s4hana steps from ECC to s4

CSMR11b.ppt

  • 1. Sub-graph Mining: Identifying Micro-architectures in Evolving Object-oriented Software Ahmed Belderrar, Segla Kpodjedo, Yann-Gaël Guéhéneuc, Giuliano Antoniol, Philippe Galinier CSMR 2011 Oldenburg
  • 2. Context & Motivation Coding activity Active Research Design Good? OO Patterns application Known Anti Bad? Patterns New organizations of classes and Almost Nothing relations Micro-Architectures UnKnown Ugly? Preliminary study exploring the feasibility and prospects of an approach CSMR 2011 aimed at building empirical knowledge about organizations of classes. Oldenburg
  • 3. Related work (I) In general, the proposed approaches rely on a library of known motifs and use architectural recovery techniques based on sub-graph matching or on motifs properties.  Constraint Programming to recognize plans [Rich and Waters-90]  Explanation-based constraint programming [Guéhéneuc and Antoniol-2008]  Use of queries [Kullbach and Winter-99]  Generic fuzzy reasoning nets [Niere et al.-2002].  Graph-transformation techniques [Tsantalis et al.-2006] CSMR 2011 Oldenburg
  • 4. Related work (II)  Similar to our approach: [Tonella and Antoniol-01] in which concept analysis was used to infer domain-specific design patterns. Main Difference : “maximal collections of class sets having the same relations between them” VS isomorphic induced subgraphs of fixed size  improves on scalability via an efficient sub-graph enumeration eliminates the need of manual inspection of “concept lattice”.  Inspiration from sub-graph mining algorithms  Sub-graph discovery problem : discover all recurring sub-graphs, or only the most frequent ones. Limitations of previous works (restriction to undirected graphs, or to sub-graphs with limited number of labels on their arcs [Wernicke and Rasche-2006] )  Our own tool SGFinder CSMR 2011 Oldenburg
  • 5. Our approach – Introduction  Considering a class diagram as a labeled graph, we define a micro- architecture as the connected subgraph induced by a given subset of classes. Pre-requisite  We model a class diagram as a labeled graph, with nodes being classes (and interfaces) and arcs representing the relations among classes.  Arc label  type of relation between two classes (e.g., association, aggregation, or inheritance). Example 1: Association 2: Aggregation/Composition 3: Inheritance CSMR 2011 Figure 1 A sub-graph G(V,E) extracted from Rhino Oldenburg (release 1.7R1).
  • 6. Our approach – Definitions  Labeled graph: given a set of labels L, it is defined by a triplet G = (V,A, l) where V is the vertex set of G, A ⊆ V × V the arc set, and l : A  L the labeling function.  Connected Graph: There is a chain between any two vertices.  Isomorphic Graphs: There exists a perfect one-to-one correspondence between the vertices and edges of those graphs.  Induced Subgraph: GX induced by X (X⊆V) is a graph which vertex set is X and which arc set contains all the arcs of G that link two vertices of X.  Embedding of a sub-graph H of G: set X of vertices such that GX is isomorphic to H. Example Figure 2 A sub-graph G=(V,E) extracted from Rhino (release 1.7R1) with two embeddings CSMR 2011 of size three {7, 1, 3} and {2, 4, 5}. Oldenburg
  • 7. Our approach – Algorithm (I)  Inputs A graph G and the size k of the sought sub-graphs  Output The set of induced subgraphs with their embeddings.  Principle Generate all k-subsets X of V such that GX is connected. Generates only a limited number of k-subsets of V  Underlying idea if GX is a connected induced sub-graph of G, then X can not contain two vertices x and y such that dist(x, y) ≥ k. CSMR 2011 Oldenburg
  • 8. Our approach – Algorithm (II) SGFinder Build k disjoint subsets of vertices, denoted by L0...Lk−1 (line 3-5), where each Li contains the vertices which distance to x equals i (L0 = {x}) To build X (line 6, developed in the CompleteSG() procedure), we choose the unique vertex present in L0 (namely x), plus one or more vertices chosen in L1, plus one or more vertices chosen in L2, and so on until we reach a total number k of vertices. CSMR 2011 Oldenburg
  • 9. Our approach – Algorithm (III)  Illustration Figure 1 A sub-graph G(V,E) extracted from Rhino (release 1.7R1). - Let us choose k = 4, - And let us assume that the first chosen vertex is x = 0. 1. We have L0 = {0}, L1 = {3}, L2 = {1, 2}, and L3 = {4, 6, 7}. 2. There are now seven possibilities for X: {0, 3, 1, 2}, {0, 3, 1, 4}, {0, 3, 1, 6}, {0, 3, 1, 7}, {0, 3, 2, 4}, {0, 3, 2, 6}, {0, 3, 2, 7}. 3. Some of the induced sub-graphs are not connected, e.g., G{0,3,1,4} and are discarded. CSMR 2011 Oldenburg
  • 10. Case Study  Goal Identify in OO systems, micro-architectures with desirable or harmful properties.  Quality focus Verify applicability of SGFinder to small and medium size OO programs in order to detect micro-architectures that are stable or fault-prone.  Perspective Researchers, developers, and managers, who want to get information about undocumented micro-architectures found in OO systems and possible related drawbacks.  Context Two open-source systems: the Rhino JavaScript/ECMAScript interpreter, and the ArgoUML CASE tool. CSMR 2011 Oldenburg
  • 11. Case Study - Objects  Rhino: JavaScript/ECMAScript interpreter and compiler • Defect data extracted from [Eaddy et al-2008] study • Change data mined from the Rhino CVS logs.  ArgoUML: UML CASE tool written in Java. • Defect data from the ArgoUML customized Bugzilla repository • Change data from the SVN logs. Class diagrams recovered using the Ptidej tool suite and its PADL meta-model. Size of micro-architectures of interest: 3, 4 and 5 nodes. CSMR 2011 Oldenburg
  • 12. Case Study – Research Questions  RQ1 – Does SGFinder scale up to medium size programs and what kind of microarchitectures are found in OO systems?  RQ2 – Are there micro-architectures particularly fault- prone or fault-free?  RQ3 – Are there micro-architectures particularly stable or change-prone? CSMR 2011 Oldenburg
  • 13. Case Study – Analysis Method (I)  Characterizing a micro-architecture mAi Connectivity •Total number of relations (nbRel), •Number of associations (nbAssoc), •Number of aggregations or compositions (nbAggr), •Number of inheritances (nbInher), •Number of ”cyclic relations” (nbCycl). Presence and repartition •Number of releases in which mAi can be found (nbReleases) •Number of embeddings of mAi in a given release (nbEmbeddings) •Number of zones (nbZones). an embedding is counted only when it does not share any common edge with a previously found embedding •Number of classes appearing at least once in mAi (nbClasses) CSMR 2011 Oldenburg
  • 14. Case Study – Analysis Method (II)  Answering the Research Questions RQ1 Execution times and Characterization of micro-architectures RQ2 and RQ3 1. For each mA, we compute the percentage of its classes that are fault-prone (i.e., with one or more documented faults) or changed between two subsequent releases. 2. Rank micro-architectures from the fault(change)-prone to the fault(change)-free using the precision measure of step 1. 3. Inspection of the top 10% and bottom 10% of ranked microarchitectures  hints of what makes those micro- architectures outstanding. Use of descriptive statistics : the five-number summary statistic (Min,Quartile1,Median,Quartile3,Max). CSMR 2011 Oldenburg
  • 15. Results – RQ1.1 SGFinder Applicability Computation times: From a few seconds to ...two days and half. 3-nodes micro-architectures  20 seconds at most 4-nodes micro-architectures  less than 30 minutes 5-nodes ...another story Rhino max <8 min Argo max > 60 h to retrieve the 82,877 different 5-nodes mAs in ArgoUML 0.17.5 and their 13,741,073,588 embeddings. Depend mostly on the edge density of the considered class diagrams. More specifically, the single most time-costly factor is the presence of highly connected nodes which lead to a near-combinatorial explosion of the number of embeddings. CSMR 2011 Oldenburg
  • 16. Results – RQ1.2 Micro-architectures’ Description Number of different micro-architectures. - 3 basic relations (association 1, aggregation 2 and inheritance 3) - 4 derived mixed cases (12,13,23,123) - No relation 8 possible connections between two nodes For k × k pairs of nodes (including loops), if we do not take into account symmetry At most 8 (k×k−(k−1)) ×7k−1 connected subgraphs of k nodes. For k=5, in theory, possibly 22 × 1021 different micro-architectures. In practice, the union of five-nodes micro-architectures from Rhino and Argo contains only about 32 × 104 different micro-architectures. Suggests, unsurprisingly, that the set of micro-architectures is CSMR 2011 restricted to specific subgraphs Oldenburg
  • 17. Results – RQ1.2 Micro-architectures’ Description Characterization Averages over the two systems CSMR 2011 Oldenburg
  • 18. Results – RQ1.2 Micro-architectures’ Description CSMR 2011 Illustration - Most Connected 5-nodes micro-architectures Oldenburg
  • 19. Results – RQ2 Fault proneness Figure 4. Example of a particularly faulty (precision=81) 4-nodes micro-architecture (almost) every class ”‘talks”’ to every other class. In fact, when we retrieve the set of micro-architectures with three cyclic relations and where every class is connected to every other class, we obtain a 5- number summary (28,44,56,67,81) with values mostly higher to those from the top 10% most faulty CSMR 2011 microarchitectures (42,45,48,54,83). Oldenburg
  • 20. Results – RQ3 Change proneness Figure 5. Example of a particularly stable (precision=30) 4-nodes micro-architecture presents a ”cascade”- like pattern and we were able to retrieve many other similar organizations with mostly low changeability. CSMR 2011 Oldenburg
  • 21. Threats to Validity Threats to conclusion validity concern the relationship between the treatment and the outcome. We do not claim any causal relation between the microarchitectures and unwanted or desired features. Threats to external validity concern the possibility of generalizing our results. The study is limited to 2 systems: Rhino and ArgoUML. But threats are mitigated given that the two systems correspond to different domains and applications, have different sizes, and are developed by different teams. CSMR 2011 Oldenburg
  • 22. Conclusion  I presented SGFinder, an algorithm and a tool to support micro- architecture discovery based on a reformulation as a sub-graph mining problem.  SGFinder uses an effective enumeration technique that allows us to infer instances of micro-architectures and to study micro- architecture properties such as stability or fault proneness.  We used SGFinder on 8 releases of 2 well known Java applications: Rhino and ArgoUML.  We provided insight about the kind of micro-architectures (of 3, 4 or 5 nodes) found in OO systems.  We characterize the most and least faulty/changed micro- architectures and report some of the most interesting micro- architectures with respect to their connectivity and frequency. CSMR 2011 Oldenburg
  • 23. Future Ongoing Work Despite the encouraging results, more work is needed to (i) further optimize the proposed algorithm and gain in scalability, and/or investigate the partitioning of large OO diagrams into subsystems (ii) define heuristics and rules able to classify micro-architectures discovered in a system under development, (iii) go beyond the micro-architectures and analyze, similarly to design patterns, the roles played by participant classes, and (iv) provide qualitative analysis and validation of the findings. For generalization purposes, our plan for future work also include replication of the study on a larger system. CSMR 2011 Oldenburg
  • 24. References  [Guéhéneuc and Antoniol-2008] Y.-G. Guéhéneuc and G. Antoniol, “DeMIMA: A multi-layered framework for design pattern identification,” Transactions on Software Engineering (TSE), vol. 34, no. 5, pp. 667–684, September 2008.  [Tonella and Antoniol-01] P. Tonella and G. Antoniol, “Inference of object oriented design patterns,” Journal of Software Maintenance - Research and Practice, vol. 13, no. 5, pp.309–330, September-October 2001.  [Rich and Waters-90] C. Rich and R. C. Waters, The Programmer’s Apprentice, 1st ed. ACM Press Frontier Series and Addison-Wesley, January 1990.  [Kullbach and Winter-99] B. Kullbach and A. Winter, “Querying as an enabling technology in software reengineering,” in Proceedings of the 3rd Conference on Software Maintenance and Reengineering, P. Nesi and C. Verhoef, Eds. IEEE Computer Society Press, March 1999, pp. 42–50.  [Niere et al.-2002] J. Niere, W. Sch¨afer, J. P. Wadsack, L. Wendehals, and J. Welsh, “Towards pattern-based design recovery,” in Proceedings of the 24th International Conference on Software Engineering, M.Young and J. Magee, Eds. ACM Press, May 2002, pp. 338–348. CSMR 2011 Oldenburg
  • 25. References  [Tsantalis et al.-2006] N. Tsantalis, A. Chatzigeorgiou, G. Stephanides, and S. Halkidis, “Design pattern detection using similarity scoring,” Transactions on Software Engineering, vol. 32, no. 11, November 2006.  [Wernicke and Rasche-2006] S. Wernicke and F. Rasche, “A tool for fast network motif detection,” Bioinformatics, vol. 22, pp. 1152–1153, 2006.  [Eaddy et al.-2008] M. Eaddy, T. Zimmermann, K. D. Sherwood,V. Garg, G. C. Murphy, N. Nagappan, and A.V. Aho, “Do crosscutting concerns cause defects?” IEEE Transactions on Software Engineering, vol. 34, no. 4, pp. 497–515, 2008.
  • 26. Thanks for the attention Questions? CSMR 2011 Oldenburg