Software size distribution - Why we always underestimate software cost

On the distribution of source code
file sizes
Israel Herraiz – Universidad Politécnica de Madrid, Spain
Daniel German – University of Victoria, Canada
Ahmed E. Hassan – Queen's University, Canada

ICSOFT 2011
Sevilla, July 19th 2011

Preprint available at
http://oa.upm.es/6791/
This presentation available at
http://slideshare.net/herraiz/on-the-distribution-of-source-code-file-sizes

1

Software size
●
Important metric
●
Estimation of effort and cost
●
Examples
●
COCOMO
●
Effort = a KLOC b
●
Time = c Effort d
●
People = Effort / Time
●
Function points
●
Guess size and you will find out software cost

2

Goal of this paper
●
Find out the statistical
distribution of software
size
●
Why is the distribution
important?
●
Estimate overall size
●
Estimate number of
modules within a given
range size

Image extracted from http://en.wikipedia.org/wiki/Normal_distribution 3

The lognormal distribution
●
Software size is believed to follow a lognormal distribution
●
“The Distribution of Program Sizes and Its Implications: An
Eclipse Case Study”. Hongyu Zhang, Hee Beng Kuan Tan, Michele
Marchesi
●
http://arxiv.org/abs/0905.2288

Log

Images extracted from http://en.wikipedia.org/wiki/Lognormal_distribution 4

Lognormal vs. Double Pareto
●
Contrarily to previous results, we have found that software
size follows a double Pareto distribution
●
Large files are found more often than predicted by the
lognormal distribution

Image extracted from http://en.wikipedia.org/wiki/Lognormal_distribution 5

How did we find out?
● Measuring Debian 5.0.2, about 1.4M source code files
● Measured SLOC for different programming languages

6

Estimation of the density function
●
Looks like log-Normally distributed
●
Figure is in logarithmic scale

8

But is it normal?
●
Graphical normality test
●
Compare the quantile of the sample with ideal values of
normal quantiles
●
Quite log-normal, except for the tails

9

The density function from another point of view
●
Complementary cumulative distribution function
●
Easier to find shapes of known statistical distributions

10

Conclusions so far

11

Conclusions so far
●
The shape of the actual distribution is
very close to lognormal

●
However it is not clear enough

●
The tails deviate from lognormality,
and do not show a clear shape in the
CCDF plot
12

But are the tails important?
● The tails are only a minority of files
● We will come back to this plot later

13

Impact of the minority
●
But the tails are an immense minority
●
Impact of large files in the tails on the overall size of the
system

14

Model fitting
●
Two parts model fitting
●
Lognormal
●
Straightforward procedure
●
The tails
●
Probably power laws, not so straightforward procedure
●
How do we decide where the lognormal body ends and the
tails begin?

15

Maximum likelihood power law fitting
●
Fitting power laws to empirical data
●
Clauset et al. “Power-law distributions in empirical data”.
●
http://www.santafe.edu/~aaronc/powerlaws/
●
Estimate the parameters that minimize the Kolmogorov-
Smirnov distance
●
Maximum vertical distance in the CCDF between model and data
●
Calculates a threshold value for data that deviate from the
power law model

16

Example of model fitting
●
The data and two models in the CCDF plot
●
Showing only Lisp source code files

17

Results for all the languages
● Two languages do not have power law tails
● Shell and Perl

18

What about the lognormal body?
●
Shell and Perl do not fit well the lognormal model either

19

Timeout!

Conclusions so far

20

Conclusions so far

●
Lognormal body + power law tail
●
C, C++, Java, Python and Lisp

●
Unknown distribution
●
Shell and Perl

●
Large files are more frequent than predicted by a
lognormal model

21

Using the threshold value to show the impact of
large files
●
Even though large files are very scarce, they account for a
large part of the overall size

22

Estimation errors using double Pareto and
lognormal models
●
This impact causes a great error in the prediction of the
lognormal model
●
Showing relative error for Lisp

23

So what?
●
Estimation techniques based on lognormal size models,
will always underestimate the size of software
●
Because they underestimate the amount of large files
●
And large files have an impact >30% on the overall size

24

Any more juice extracted from these oranges?
●
More fuel for the programming languages holy war
●
The power law parameters could be related to the
properties of the different programming languages

25

And what about Shell and Perl?

●
These languages are used to great extent for package
maintenance activities in Debian

●
So they are of a different nature

●
Does it mean that double Pareto is the signature of the
programming process?

26

Further work
●
Analysis over time
●
How do files reach the threshold value?
●
What happens when files get large? Do they split? Are they
abandoned?

●
Domains of applications
●
How do the power law parameters change with domain of
application?

●
Can we find more “non-double Pareto” languages?

27

Take away

Software size Size is not
is an important lognormal,
metric it is
(effort, cost) double Pareto

Lognormal
models Double Pareto
underestimate as the signature
software size of programming?
by design
Preprint available at http://oa.upm.es/6791/ 28

Software size distribution - Why we always underestimate software cost

More Related Content

Similar to Software size distribution - Why we always underestimate software cost (20)

More from Israel Herraiz (8)

Recently uploaded (20)

Software size distribution - Why we always underestimate software cost