Statistical Distribution of Metrics

Statistical distributions of software metrics: do
they matter?

Israel Herraiz

Technical University of Madrid

israel.herraiz@upm.es

Grab these slides from
http://slideshare.net/herraiz/statistical-distributions-of-metrics

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 1/17

Outline

1 Some background

2 Statistical properties of software metrics

3 Evidence of impact on quality

4 Summary of ﬁndings and further work


1 Some background





A (not so) long time ago...

Statistical distribution of software metrics
Software size follows a double Pareto distribution
Towards a theoretical model for software growth MSR 2007

More recently
Not only size, but some OO metrics too (and some complexity metrics)
On the Statistical Distribution of Object-Oriented System
Properties WETSoM 2012


OK, but what is that double Pareto thing?
1e+00
1e−02
P[X > x]

Data
Double Pareto
1e−04

Lognormal

1 100 10000

SLOC

But does it matter?

Most of the ﬁles are on the
lognormal side
10 15 20 25 30 35
% Files

5
0

C C++ Java Python Lisp


But does it matter?

Most of the ﬁles are on the But the power law minority
lognormal side matters a lot
10 15 20 25 30 35

40
30
% SLOC
% Files

20
10
5

0
0

C C++ Java Python Lisp C C++ Java Python Lisp


Large ﬁles have a large impact

Size estimation models
Some software size estimation models are based on the log-normality of size
metrics. These models systematically underestimate the size of software.

C C++
50

50
RE

RE
0

0
−100

−100
2000 5000 10000 50000 2000 5000 20000 50000

SLOC SLOC

Java Python
50

50
RE

RE
0

0
−100

−100

1000 2000 5000 10000 1000 2000 5000 10000

SLOC SLOC

On the distribution of source code ﬁle sizes ICSOFT 2011

1 Some background





Parameters of the statistical distribution

Power law parameters: λ and xmin
Transition from lognormal to power law
1e+00
1e−02
P[X > x]

Data
Double Pareto
1e−04

Lognormal

1 100 10000

SLOC


1 Some background





Probability of finding defects

We have seen that files above xmin account for 40% of total size, being
only about ∼ 1% of the files.
What about defects? Probability of finding defects in three software
projects (using CYCLO as metric)

Project Below xmin Above xmin
Apache .4178 .7708
OpenIntents .2500 .7500
Zxing .2143 .4161

* Data extracted from “ReLink: Recovering Links between Bugs and Changes” FSE
2011.



Probability of ﬁnding defects (normalized metrics)
Using CYCLO / WMC as metric (cyclomatic complex. per LOC)

Project Below xmin Above xmin
Apache .4159 .6296
OpenIntents .2813 .5417
Zxing .3181 .2389



Defects density (only pre-release defects)
Using Number of Methods and number of pre-release defects per LOC

Below xmin Above xmin
12000 300

10000 250

8000 200

6000 150

4000 100

2000 50

0 0
0 1 2 3 4 5 6 7 8 9 10 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Avg .Dens. = .2685 Avg .Dens. = .4565

* Data obtained from "Predicting Defects for Eclipse” PROMISE 2007



Defects density (only post-release defects)
Using Number of Methods and number of post-release defects per LOC

12000 300

10000 250

8000 200

6000 150

4000 100

2000 50

0 0
0 1 2 3 4 5 6 7 8 9 10 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Avg .Dens. = .1437 Avg .Dens. = .2690


Defects density (pre + post-release defects)
Using CYCLO/SLOC and number of total defects per LOC

0 3
10 10

−1 2
10 10
Pr(X ≥ x)

−2 1
10 10

−3 0
10 10

−4 −1
10 −1 1 3 5
10
−1 0 1 2 3 4 5
10 10 10 10 10 10 10
10 10 10 10
x

Avg .Dens. = .3335 (>9000 ﬁles) Avg .Dens. = .7747 (364 ﬁles)

1 Some background





Summary and further work

Summary of preliminary findings
Some metrics have a transition from lognormal to power law
Clear relation between normalized metrics and defects density
Although the threshold might not be perfect (e.g., you might find a
high defects density in a lower side file), it greatly reduces the search
space for potentially problematic files

Further work
Verify in more projects
Do you have defects data at the file level?
Find explanation for the transition and its influence on quality
How do the statistical parameters change over time? Do defects
evolve accordingly?


Statistical Distribution of Metrics

More Related Content

Similar to Statistical Distribution of Metrics (20)

More from Israel Herraiz (8)

Recently uploaded (20)

Statistical Distribution of Metrics