SlideShare a Scribd company logo
Statistical distributions of software metrics: do
                      they matter?

                                     Israel Herraiz

                          Technical University of Madrid


                         israel.herraiz@upm.es


                               Grab these slides from
     http://slideshare.net/herraiz/statistical-distributions-of-metrics




Israel Herraiz, UPM       Statistical distributions of software metrics: do they matter?   1/17
Outline



1    Some background


2    Statistical properties of software metrics


3    Evidence of impact on quality


4    Summary of findings and further work




Israel Herraiz, UPM      Statistical distributions of software metrics: do they matter?   2/17
1    Some background


2    Statistical properties of software metrics


3    Evidence of impact on quality


4    Summary of findings and further work




Israel Herraiz, UPM      Statistical distributions of software metrics: do they matter?   3/17
A (not so) long time ago...



Statistical distribution of software metrics
Software size follows a double Pareto distribution
Towards a theoretical model for software growth MSR 2007

More recently
Not only size, but some OO metrics too (and some complexity metrics)
On the Statistical Distribution of Object-Oriented System
Properties WETSoM 2012




Israel Herraiz, UPM    Statistical distributions of software metrics: do they matter?   4/17
OK, but what is that double Pareto thing?
           1e+00
           1e−02
P[X > x]




                          Data
                          Double Pareto
           1e−04




                          Lognormal


                      1                   100                                   10000

                                                  SLOC
Israel Herraiz, UPM           Statistical distributions of software metrics: do they matter?   5/17
But does it matter?




 Most of the files are on the
 lognormal side
             10 15 20 25 30 35
   % Files

             5
             0




                                 C   C++   Java   Python     Lisp




Israel Herraiz, UPM                               Statistical distributions of software metrics: do they matter?   6/17
But does it matter?




 Most of the files are on the                                                But the power law minority
 lognormal side                                                             matters a lot
             10 15 20 25 30 35




                                                                                       40
                                                                                       30
                                                                              % SLOC
   % Files




                                                                                       20
                                                                                       10
             5




                                                                                       0
             0




                                 C   C++   Java   Python     Lisp                            C        C++          Java   Python   Lisp




Israel Herraiz, UPM                               Statistical distributions of software metrics: do they matter?                          6/17
Large files have a large impact

Size estimation models
Some software size estimation models are based on the log-normality of size
metrics. These models systematically underestimate the size of software.

                                                  C                                                 C++
                           50




                                                                              50
                      RE




                                                                         RE
                           0




                                                                              0
                           −100




                                                                              −100
                                  2000    5000 10000             50000                2000    5000          20000     50000

                                                 SLOC                                               SLOC



                                                 Java                                           Python
                           50




                                                                              50
                      RE




                                                                         RE
                           0




                                                                              0
                           −100




                                                                              −100




                                   1000   2000          5000   10000                 1000    2000          5000     10000

                                                 SLOC                                               SLOC



On the distribution of source code file sizes ICSOFT 2011
Israel Herraiz, UPM                       Statistical distributions of software metrics: do they matter?                      7/17
1    Some background


2    Statistical properties of software metrics


3    Evidence of impact on quality


4    Summary of findings and further work




Israel Herraiz, UPM      Statistical distributions of software metrics: do they matter?   8/17
Parameters of the statistical distribution

Power law parameters: λ and xmin
Transition from lognormal to power law
                             1e+00
                             1e−02
                  P[X > x]




                                            Data
                                            Double Pareto
                             1e−04




                                            Lognormal


                                     1                      100                           10000

                                                                   SLOC

Israel Herraiz, UPM                      Statistical distributions of software metrics: do they matter?   9/17
1    Some background


2    Statistical properties of software metrics


3    Evidence of impact on quality


4    Summary of findings and further work




Israel Herraiz, UPM      Statistical distributions of software metrics: do they matter?   10/17
Probability of finding defects


Probability of finding defects
We have seen that files above xmin account for 40% of total size, being
only about ∼ 1% of the files.
What about defects? Probability of finding defects in three software
projects (using CYCLO as metric)

                      Project             Below xmin               Above xmin
                      Apache                   .4178                   .7708
                      OpenIntents              .2500                   .7500
                      Zxing                    .2143                   .4161

* Data extracted from “ReLink: Recovering Links between Bugs and Changes” FSE
2011.



Israel Herraiz, UPM         Statistical distributions of software metrics: do they matter?   11/17
Probability of finding defects




Probability of finding defects (normalized metrics)
Using CYCLO / WMC as metric (cyclomatic complex. per LOC)

                      Project             Below xmin               Above xmin
                      Apache                   .4159                   .6296
                      OpenIntents              .2813                   .5417
                      Zxing                    .3181                   .2389




Israel Herraiz, UPM         Statistical distributions of software metrics: do they matter?   12/17
Probability of finding defects

Defects density (only pre-release defects)
Using Number of Methods and number of pre-release defects per LOC

                                      Below xmin                                                Above xmin
                                                  Below xmin                                                 Above xmin
                      12000                                                         300




                      10000                                                         250




                       8000                                                         200




                       6000                                                         150




                       4000                                                         100




                       2000                                                          50




                          0                                                           0
                              0   1   2   3   4       5        6   7   8   9   10         0   0.05   0.1   0.15       0.2   0.25   0.3   0.35




                      Avg .Dens. = .2685                                            Avg .Dens. = .4565

* Data obtained from "Predicting Defects for Eclipse” PROMISE 2007

Israel Herraiz, UPM                               Statistical distributions of software metrics: do they matter?                                13/17
Probability of finding defects

Defects density (only post-release defects)
Using Number of Methods and number of post-release defects per LOC

                                           Below xmin                                                             Above xmin
                                                    Below xmin                                                             Above xmin
                      12000                                                                    300




                      10000                                                                    250




                       8000                                                                    200




                       6000                                                                    150




                       4000                                                                    100




                       2000                                                                     50




                          0                                                                      0
                              0    1   2    3   4       5         6   7   8   9   10                 0     0.05    0.1   0.15       0.2   0.25   0.3   0.35




                                  Avg .Dens. = .1437                                                     Avg .Dens. = .2690

Israel Herraiz, UPM                                              Statistical distributions of software metrics: do they matter?                               14/17
Probability of finding defects
Defects density (pre + post-release defects)
Using CYCLO/SLOC and number of total defects per LOC

                         0                                                  3
                        10                                                 10




                         −1                                                 2
                        10                                                 10
            Pr(X ≥ x)




                         −2                                                 1
                        10                                                 10




                         −3                                                 0
                        10                                                 10




                         −4                                                 −1
                        10 −1    1         3             5
                                                                           10
                                                                                 −1    0    1      2       3    4    5
                                                                                10    10   10     10      10   10   10
                          10    10       10            10
                                     x




                  Below xmin                                                   Above xmin
       Avg .Dens. = .3335 (>9000 files)                                Avg .Dens. = .7747 (364 files)
Israel Herraiz, UPM                      Statistical distributions of software metrics: do they matter?                  15/17
1    Some background


2    Statistical properties of software metrics


3    Evidence of impact on quality


4    Summary of findings and further work




Israel Herraiz, UPM      Statistical distributions of software metrics: do they matter?   16/17
Summary and further work

Summary of preliminary findings
        Some metrics have a transition from lognormal to power law
        Clear relation between normalized metrics and defects density
        Although the threshold might not be perfect (e.g., you might find a
        high defects density in a lower side file), it greatly reduces the search
        space for potentially problematic files

Further work
    Verify in more projects
                Do you have defects data at the file level?
        Find explanation for the transition and its influence on quality
        How do the statistical parameters change over time? Do defects
        evolve accordingly?

Israel Herraiz, UPM           Statistical distributions of software metrics: do they matter?   17/17

More Related Content

PPTX
Reduce time and cost
PDF
WETSoM 2011
PPTX
3. project charter, check sheet, pareto analysis & c&e diagram & matrix
PDF
Presentation dropbox
PDF
Software size distribution - Why we always underestimate software cost
PDF
Williamson arm validation metrics
PPT
Verification Metrics
PDF
Testaus 2013 Mark Fewster Reporting Software Quality
Reduce time and cost
WETSoM 2011
3. project charter, check sheet, pareto analysis & c&e diagram & matrix
Presentation dropbox
Software size distribution - Why we always underestimate software cost
Williamson arm validation metrics
Verification Metrics
Testaus 2013 Mark Fewster Reporting Software Quality

Similar to Statistical Distribution of Metrics (20)

PDF
Faults and Regression Testing - Fault interaction and its repercussions
PDF
The Comment Density of Open Source Software Code
PPT
Functionality testing techniqu
PPTX
Estimation techniques and software metrics
PPT
An overview of software dd a scoing study
PDF
2008 11 14 Google Oss Stanford
PDF
2008 11 14 Google Oss Stanford
PDF
Periyar msc
PDF
3 Estimation
PPTX
Modeling pheromone dispensers using genetic programming
PDF
mazters syllabus deepu
PDF
Mca syllabus
PDF
Perl 5.12.0
PDF
Sattose 2011
PPTX
10 Tips for Agile Adoption
PDF
Ip addressing and subnetting instructors workbook
PDF
Ip Addressing And Subnetting Teachers Book Robb Jones
PDF
Ip Subredes
PDF
The influence of identifiers on code quality
PDF
Empirical Software Engineering at Microsoft Research
Faults and Regression Testing - Fault interaction and its repercussions
The Comment Density of Open Source Software Code
Functionality testing techniqu
Estimation techniques and software metrics
An overview of software dd a scoing study
2008 11 14 Google Oss Stanford
2008 11 14 Google Oss Stanford
Periyar msc
3 Estimation
Modeling pheromone dispensers using genetic programming
mazters syllabus deepu
Mca syllabus
Perl 5.12.0
Sattose 2011
10 Tips for Agile Adoption
Ip addressing and subnetting instructors workbook
Ip Addressing And Subnetting Teachers Book Robb Jones
Ip Subredes
The influence of identifiers on code quality
Empirical Software Engineering at Microsoft Research
Ad

More from Israel Herraiz (8)

PDF
intensive metrics software evolution
PDF
Public Key Cryptography
PDF
¿MATLAB? Yo uso Octave UPM
PDF
The Ultimate Debian Database
PDF
Evaluating the presence and impact of bias in bug-fix datasets
PDF
The dynamics of software evolution - EVOLUMONS 2011
PDF
Public key cryptography
ODP
Mining Software Repositories
intensive metrics software evolution
Public Key Cryptography
¿MATLAB? Yo uso Octave UPM
The Ultimate Debian Database
Evaluating the presence and impact of bias in bug-fix datasets
The dynamics of software evolution - EVOLUMONS 2011
Public key cryptography
Mining Software Repositories
Ad

Recently uploaded (20)

PDF
Computing-Curriculum for Schools in Ghana
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
Complications of Minimal Access Surgery at WLH
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PPTX
master seminar digital applications in india
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
A systematic review of self-coping strategies used by university students to ...
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PDF
01-Introduction-to-Information-Management.pdf
PDF
RMMM.pdf make it easy to upload and study
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Classroom Observation Tools for Teachers
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
Computing-Curriculum for Schools in Ghana
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Complications of Minimal Access Surgery at WLH
Pharmacology of Heart Failure /Pharmacotherapy of CHF
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
master seminar digital applications in india
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
Microbial disease of the cardiovascular and lymphatic systems
A systematic review of self-coping strategies used by university students to ...
human mycosis Human fungal infections are called human mycosis..pptx
Final Presentation General Medicine 03-08-2024.pptx
Chinmaya Tiranga quiz Grand Finale.pdf
01-Introduction-to-Information-Management.pdf
RMMM.pdf make it easy to upload and study
Abdominal Access Techniques with Prof. Dr. R K Mishra
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
O7-L3 Supply Chain Operations - ICLT Program
Classroom Observation Tools for Teachers
STATICS OF THE RIGID BODIES Hibbelers.pdf

Statistical Distribution of Metrics

  • 1. Statistical distributions of software metrics: do they matter? Israel Herraiz Technical University of Madrid israel.herraiz@upm.es Grab these slides from http://slideshare.net/herraiz/statistical-distributions-of-metrics Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 1/17
  • 2. Outline 1 Some background 2 Statistical properties of software metrics 3 Evidence of impact on quality 4 Summary of findings and further work Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 2/17
  • 3. 1 Some background 2 Statistical properties of software metrics 3 Evidence of impact on quality 4 Summary of findings and further work Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 3/17
  • 4. A (not so) long time ago... Statistical distribution of software metrics Software size follows a double Pareto distribution Towards a theoretical model for software growth MSR 2007 More recently Not only size, but some OO metrics too (and some complexity metrics) On the Statistical Distribution of Object-Oriented System Properties WETSoM 2012 Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 4/17
  • 5. OK, but what is that double Pareto thing? 1e+00 1e−02 P[X > x] Data Double Pareto 1e−04 Lognormal 1 100 10000 SLOC Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 5/17
  • 6. But does it matter? Most of the files are on the lognormal side 10 15 20 25 30 35 % Files 5 0 C C++ Java Python Lisp Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 6/17
  • 7. But does it matter? Most of the files are on the But the power law minority lognormal side matters a lot 10 15 20 25 30 35 40 30 % SLOC % Files 20 10 5 0 0 C C++ Java Python Lisp C C++ Java Python Lisp Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 6/17
  • 8. Large files have a large impact Size estimation models Some software size estimation models are based on the log-normality of size metrics. These models systematically underestimate the size of software. C C++ 50 50 RE RE 0 0 −100 −100 2000 5000 10000 50000 2000 5000 20000 50000 SLOC SLOC Java Python 50 50 RE RE 0 0 −100 −100 1000 2000 5000 10000 1000 2000 5000 10000 SLOC SLOC On the distribution of source code file sizes ICSOFT 2011 Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 7/17
  • 9. 1 Some background 2 Statistical properties of software metrics 3 Evidence of impact on quality 4 Summary of findings and further work Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 8/17
  • 10. Parameters of the statistical distribution Power law parameters: λ and xmin Transition from lognormal to power law 1e+00 1e−02 P[X > x] Data Double Pareto 1e−04 Lognormal 1 100 10000 SLOC Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 9/17
  • 11. 1 Some background 2 Statistical properties of software metrics 3 Evidence of impact on quality 4 Summary of findings and further work Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 10/17
  • 12. Probability of finding defects Probability of finding defects We have seen that files above xmin account for 40% of total size, being only about ∼ 1% of the files. What about defects? Probability of finding defects in three software projects (using CYCLO as metric) Project Below xmin Above xmin Apache .4178 .7708 OpenIntents .2500 .7500 Zxing .2143 .4161 * Data extracted from “ReLink: Recovering Links between Bugs and Changes” FSE 2011. Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 11/17
  • 13. Probability of finding defects Probability of finding defects (normalized metrics) Using CYCLO / WMC as metric (cyclomatic complex. per LOC) Project Below xmin Above xmin Apache .4159 .6296 OpenIntents .2813 .5417 Zxing .3181 .2389 Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 12/17
  • 14. Probability of finding defects Defects density (only pre-release defects) Using Number of Methods and number of pre-release defects per LOC Below xmin Above xmin Below xmin Above xmin 12000 300 10000 250 8000 200 6000 150 4000 100 2000 50 0 0 0 1 2 3 4 5 6 7 8 9 10 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Avg .Dens. = .2685 Avg .Dens. = .4565 * Data obtained from "Predicting Defects for Eclipse” PROMISE 2007 Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 13/17
  • 15. Probability of finding defects Defects density (only post-release defects) Using Number of Methods and number of post-release defects per LOC Below xmin Above xmin Below xmin Above xmin 12000 300 10000 250 8000 200 6000 150 4000 100 2000 50 0 0 0 1 2 3 4 5 6 7 8 9 10 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Avg .Dens. = .1437 Avg .Dens. = .2690 Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 14/17
  • 16. Probability of finding defects Defects density (pre + post-release defects) Using CYCLO/SLOC and number of total defects per LOC 0 3 10 10 −1 2 10 10 Pr(X ≥ x) −2 1 10 10 −3 0 10 10 −4 −1 10 −1 1 3 5 10 −1 0 1 2 3 4 5 10 10 10 10 10 10 10 10 10 10 10 x Below xmin Above xmin Avg .Dens. = .3335 (>9000 files) Avg .Dens. = .7747 (364 files) Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 15/17
  • 17. 1 Some background 2 Statistical properties of software metrics 3 Evidence of impact on quality 4 Summary of findings and further work Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 16/17
  • 18. Summary and further work Summary of preliminary findings Some metrics have a transition from lognormal to power law Clear relation between normalized metrics and defects density Although the threshold might not be perfect (e.g., you might find a high defects density in a lower side file), it greatly reduces the search space for potentially problematic files Further work Verify in more projects Do you have defects data at the file level? Find explanation for the transition and its influence on quality How do the statistical parameters change over time? Do defects evolve accordingly? Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 17/17