SlideShare a Scribd company logo
On the distribution of source code
                  file sizes
         Israel Herraiz – Universidad Politécnica de Madrid, Spain
              Daniel German – University of Victoria, Canada
             Ahmed E. Hassan – Queen's University, Canada


                              ICSOFT 2011
                          Sevilla, July 19th 2011

                            Preprint available at
                          http://oa.upm.es/6791/
                       This presentation available at
http://slideshare.net/herraiz/on-the-distribution-of-source-code-file-sizes

                                                                              1
Software size
●
    Important metric
      ●
          Estimation of effort and cost
●
    Examples
      ●
          COCOMO
           ●
               Effort = a KLOC b
           ●
               Time = c Effort d
           ●
               People = Effort / Time
      ●
          Function points
           ●
               Guess size and you will find out software cost




                                                                2
Goal of this paper
●
    Find out the statistical
    distribution of software
    size
●
    Why is the distribution
    important?
      ●
          Estimate overall size
      ●
          Estimate number of
          modules within a given
          range size




                         Image extracted from http://en.wikipedia.org/wiki/Normal_distribution   3
The lognormal distribution
●
    Software size is believed to follow a lognormal distribution
          ●
              “The Distribution of Program Sizes and Its Implications: An
              Eclipse Case Study”. Hongyu Zhang, Hee Beng Kuan Tan, Michele
              Marchesi
          ●
              http://arxiv.org/abs/0905.2288




                                           Log




                         Images extracted from http://en.wikipedia.org/wiki/Lognormal_distribution   4
Lognormal vs. Double Pareto
●
    Contrarily to previous results, we have found that software
    size follows a double Pareto distribution
●
    Large files are found more often than predicted by the
    lognormal distribution




                   Image extracted from http://en.wikipedia.org/wiki/Lognormal_distribution   5
How did we find out?
●   Measuring Debian 5.0.2, about 1.4M source code files
      ●   Measured SLOC for different programming languages




                                                              6
The numbers




              7
Estimation of the density function
●
    Looks like log-Normally distributed
      ●
          Figure is in logarithmic scale




                                                  8
But is it normal?
●
    Graphical normality test
      ●
          Compare the quantile of the sample with ideal values of
          normal quantiles
      ●
          Quite log-normal, except for the tails




                                                                    9
The density function from another point of view
●
    Complementary cumulative distribution function
      ●
          Easier to find shapes of known statistical distributions




                                                                     10
Conclusions so far




                     11
Conclusions so far
●
    The shape of the actual distribution is
    very close to lognormal

●
    However it is not clear enough

●
    The tails deviate from lognormality,
    and do not show a clear shape in the
    CCDF plot
                                              12
But are the tails important?
●   The tails are only a minority of files
●   We will come back to this plot later




                                               13
Impact of the minority
●
    But the tails are an immense minority
●
    Impact of large files in the tails on the overall size of the
    system




                                                                    14
Model fitting
●
    Two parts model fitting
      ●
          Lognormal
            ●
                Straightforward procedure
      ●
          The tails
            ●
                Probably power laws, not so straightforward procedure
●
    How do we decide where the lognormal body ends and the
    tails begin?




                                                                        15
Maximum likelihood power law fitting
●
    Fitting power laws to empirical data
         ●
             Clauset et al. “Power-law distributions in empirical data”.
         ●
             http://www.santafe.edu/~aaronc/powerlaws/
●
    Estimate the parameters that minimize the Kolmogorov-
    Smirnov distance
         ●
             Maximum vertical distance in the CCDF between model and data
●
    Calculates a threshold value for data that deviate from the
    power law model




                                                                            16
Example of model fitting
●
    The data and two models in the CCDF plot
●
    Showing only Lisp source code files




                                               17
Results for all the languages
●   Two languages do not have power law tails
      ●   Shell and Perl




                                                 18
What about the lognormal body?
●
    Shell and Perl do not fit well the lognormal model either




                                                                19
Timeout!

Conclusions so far




                     20
Conclusions so far

●
    Lognormal body + power law tail
      ●
          C, C++, Java, Python and Lisp

●
    Unknown distribution
      ●
          Shell and Perl

●
    Large files are more frequent than predicted by a
    lognormal model



                                                        21
Using the threshold value to show the impact of
                  large files
●
    Even though large files are very scarce, they account for a
    large part of the overall size




                                                                  22
Estimation errors using double Pareto and
                 lognormal models
●
    This impact causes a great error in the prediction of the
    lognormal model
●
    Showing relative error for Lisp




                                                                23
So what?
●
    Estimation techniques based on lognormal size models,
    will always underestimate the size of software
      ●
          Because they underestimate the amount of large files
      ●
          And large files have an impact >30% on the overall size




                                                                    24
Any more juice extracted from these oranges?
●
    More fuel for the programming languages holy war
      ●
          The power law parameters could be related to the
          properties of the different programming languages




                                                              25
And what about Shell and Perl?

●
    These languages are used to great extent for package
    maintenance activities in Debian

●
    So they are of a different nature

●
    Does it mean that double Pareto is the signature of the
    programming process?




                                                              26
Further work
●
    Analysis over time
         ●
             How do files reach the threshold value?
         ●
             What happens when files get large? Do they split? Are they
             abandoned?

●
    Domains of applications
         ●
             How do the power law parameters change with domain of
             application?

●
    Can we find more “non-double Pareto” languages?




                                                                          27
Take away

 Software size                    Size is not
is an important                   lognormal,
     metric                          it is
  (effort, cost)                 double Pareto


 Lognormal
   models                       Double Pareto
underestimate                  as the signature
software size                  of programming?
  by design
   Preprint available at http://oa.upm.es/6791/   28

More Related Content

PDF
Chemical Data Mining: Open Source & Reproducible
PDF
A Portable Approach for Bidirectional Integration between a Logic and a Stati...
PDF
The effect of distributed archetypes on complexity theory
PDF
Reverse-Engineering Reusable Language Modules from Legacy DSLs
PDF
Evaluating the presence and impact of bias in bug-fix datasets
ODP
Challenges in Large Scale Machine Learning
PDF
Big data & frameworks: no book for you anymore
PDF
Big data & frameworks: no book for you anymore.
Chemical Data Mining: Open Source & Reproducible
A Portable Approach for Bidirectional Integration between a Logic and a Stati...
The effect of distributed archetypes on complexity theory
Reverse-Engineering Reusable Language Modules from Legacy DSLs
Evaluating the presence and impact of bias in bug-fix datasets
Challenges in Large Scale Machine Learning
Big data & frameworks: no book for you anymore
Big data & frameworks: no book for you anymore.

Similar to Software size distribution - Why we always underestimate software cost (20)

PDF
Database Refactoring
PDF
Elephant grooming: quality with Hadoop
PDF
Notes on data-intensive processing with Hadoop Mapreduce
PDF
Hadoop.mapreduce
PDF
“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...
PDF
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
PPTX
Big data analytics using R
PDF
What are your Programming Language's Energy-Delay Implications?
PPTX
Parallelization using open mp
PDF
Why is Bioinformatics a Good Fit for Spark?
PDF
Making Strongly-typed NETCONF Usable
PPTX
Craftsmanship
PDF
Fighting legacy with hexagonal architecture and frameworkless php
PDF
Programming Models for High-performance Computing
PDF
Openerp Rise Web
PDF
Tools for Meta-Programming
KEY
Ruby codebases in an entropic universe
PPTX
Hadoop Training Tutorial for Freshers
PDF
Performance Characterization and Optimization of In-Memory Data Analytics on ...
ODP
The Art of Evolutionary Algorithms Programming
Database Refactoring
Elephant grooming: quality with Hadoop
Notes on data-intensive processing with Hadoop Mapreduce
Hadoop.mapreduce
“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
Big data analytics using R
What are your Programming Language's Energy-Delay Implications?
Parallelization using open mp
Why is Bioinformatics a Good Fit for Spark?
Making Strongly-typed NETCONF Usable
Craftsmanship
Fighting legacy with hexagonal architecture and frameworkless php
Programming Models for High-performance Computing
Openerp Rise Web
Tools for Meta-Programming
Ruby codebases in an entropic universe
Hadoop Training Tutorial for Freshers
Performance Characterization and Optimization of In-Memory Data Analytics on ...
The Art of Evolutionary Algorithms Programming
Ad

More from Israel Herraiz (8)

PDF
intensive metrics software evolution
PDF
Public Key Cryptography
PDF
Statistical Distribution of Metrics
PDF
¿MATLAB? Yo uso Octave UPM
PDF
The Ultimate Debian Database
PDF
The dynamics of software evolution - EVOLUMONS 2011
PDF
Public key cryptography
ODP
Mining Software Repositories
intensive metrics software evolution
Public Key Cryptography
Statistical Distribution of Metrics
¿MATLAB? Yo uso Octave UPM
The Ultimate Debian Database
The dynamics of software evolution - EVOLUMONS 2011
Public key cryptography
Mining Software Repositories
Ad

Recently uploaded (20)

PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Complications of Minimal Access Surgery at WLH
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
Basic Mud Logging Guide for educational purpose
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
master seminar digital applications in india
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Insiders guide to clinical Medicine.pdf
PPTX
Lesson notes of climatology university.
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
Microbial disease of the cardiovascular and lymphatic systems
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
FourierSeries-QuestionsWithAnswers(Part-A).pdf
O7-L3 Supply Chain Operations - ICLT Program
human mycosis Human fungal infections are called human mycosis..pptx
Microbial diseases, their pathogenesis and prophylaxis
Renaissance Architecture: A Journey from Faith to Humanism
Complications of Minimal Access Surgery at WLH
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Anesthesia in Laparoscopic Surgery in India
Basic Mud Logging Guide for educational purpose
102 student loan defaulters named and shamed – Is someone you know on the list?
master seminar digital applications in india
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Insiders guide to clinical Medicine.pdf
Lesson notes of climatology university.
PPH.pptx obstetrics and gynecology in nursing
Abdominal Access Techniques with Prof. Dr. R K Mishra
O5-L3 Freight Transport Ops (International) V1.pdf
Supply Chain Operations Speaking Notes -ICLT Program

Software size distribution - Why we always underestimate software cost

  • 1. On the distribution of source code file sizes Israel Herraiz – Universidad Politécnica de Madrid, Spain Daniel German – University of Victoria, Canada Ahmed E. Hassan – Queen's University, Canada ICSOFT 2011 Sevilla, July 19th 2011 Preprint available at http://oa.upm.es/6791/ This presentation available at http://slideshare.net/herraiz/on-the-distribution-of-source-code-file-sizes 1
  • 2. Software size ● Important metric ● Estimation of effort and cost ● Examples ● COCOMO ● Effort = a KLOC b ● Time = c Effort d ● People = Effort / Time ● Function points ● Guess size and you will find out software cost 2
  • 3. Goal of this paper ● Find out the statistical distribution of software size ● Why is the distribution important? ● Estimate overall size ● Estimate number of modules within a given range size Image extracted from http://en.wikipedia.org/wiki/Normal_distribution 3
  • 4. The lognormal distribution ● Software size is believed to follow a lognormal distribution ● “The Distribution of Program Sizes and Its Implications: An Eclipse Case Study”. Hongyu Zhang, Hee Beng Kuan Tan, Michele Marchesi ● http://arxiv.org/abs/0905.2288 Log Images extracted from http://en.wikipedia.org/wiki/Lognormal_distribution 4
  • 5. Lognormal vs. Double Pareto ● Contrarily to previous results, we have found that software size follows a double Pareto distribution ● Large files are found more often than predicted by the lognormal distribution Image extracted from http://en.wikipedia.org/wiki/Lognormal_distribution 5
  • 6. How did we find out? ● Measuring Debian 5.0.2, about 1.4M source code files ● Measured SLOC for different programming languages 6
  • 8. Estimation of the density function ● Looks like log-Normally distributed ● Figure is in logarithmic scale 8
  • 9. But is it normal? ● Graphical normality test ● Compare the quantile of the sample with ideal values of normal quantiles ● Quite log-normal, except for the tails 9
  • 10. The density function from another point of view ● Complementary cumulative distribution function ● Easier to find shapes of known statistical distributions 10
  • 12. Conclusions so far ● The shape of the actual distribution is very close to lognormal ● However it is not clear enough ● The tails deviate from lognormality, and do not show a clear shape in the CCDF plot 12
  • 13. But are the tails important? ● The tails are only a minority of files ● We will come back to this plot later 13
  • 14. Impact of the minority ● But the tails are an immense minority ● Impact of large files in the tails on the overall size of the system 14
  • 15. Model fitting ● Two parts model fitting ● Lognormal ● Straightforward procedure ● The tails ● Probably power laws, not so straightforward procedure ● How do we decide where the lognormal body ends and the tails begin? 15
  • 16. Maximum likelihood power law fitting ● Fitting power laws to empirical data ● Clauset et al. “Power-law distributions in empirical data”. ● http://www.santafe.edu/~aaronc/powerlaws/ ● Estimate the parameters that minimize the Kolmogorov- Smirnov distance ● Maximum vertical distance in the CCDF between model and data ● Calculates a threshold value for data that deviate from the power law model 16
  • 17. Example of model fitting ● The data and two models in the CCDF plot ● Showing only Lisp source code files 17
  • 18. Results for all the languages ● Two languages do not have power law tails ● Shell and Perl 18
  • 19. What about the lognormal body? ● Shell and Perl do not fit well the lognormal model either 19
  • 21. Conclusions so far ● Lognormal body + power law tail ● C, C++, Java, Python and Lisp ● Unknown distribution ● Shell and Perl ● Large files are more frequent than predicted by a lognormal model 21
  • 22. Using the threshold value to show the impact of large files ● Even though large files are very scarce, they account for a large part of the overall size 22
  • 23. Estimation errors using double Pareto and lognormal models ● This impact causes a great error in the prediction of the lognormal model ● Showing relative error for Lisp 23
  • 24. So what? ● Estimation techniques based on lognormal size models, will always underestimate the size of software ● Because they underestimate the amount of large files ● And large files have an impact >30% on the overall size 24
  • 25. Any more juice extracted from these oranges? ● More fuel for the programming languages holy war ● The power law parameters could be related to the properties of the different programming languages 25
  • 26. And what about Shell and Perl? ● These languages are used to great extent for package maintenance activities in Debian ● So they are of a different nature ● Does it mean that double Pareto is the signature of the programming process? 26
  • 27. Further work ● Analysis over time ● How do files reach the threshold value? ● What happens when files get large? Do they split? Are they abandoned? ● Domains of applications ● How do the power law parameters change with domain of application? ● Can we find more “non-double Pareto” languages? 27
  • 28. Take away Software size Size is not is an important lognormal, metric it is (effort, cost) double Pareto Lognormal models Double Pareto underestimate as the signature software size of programming? by design Preprint available at http://oa.upm.es/6791/ 28