Principles of effort estimation

A Principled
Methodology
A Dozen Principles of
Software Effort
Estimation

Ekrem Kocaguneli, 11/07/2012

2

Agenda
• Introduction
• Publications
• What to Know
• 8 Questions
• Answers
• 12 Principles
• Validity Issues
• Future Work

3

Introduction
Software effort estimation (SEE) is the process of estimating the total
effort required to complete a software project (Keung2008 [1]).

Successful estimation is critical for an organizations
Over-estimation: Killing promising projects
Under-estimation: Wasting entire effort! E.g. NASA’s
launch-control system cancelled after initial estimate of
$200M was overrun by another $200M [22]

Among IT projects developed in 2009, only 32% were
successfully completed within time with full functionality [23]

4

Introduction (cntd.)
We will discuss algorithms, but it would be irresponsible to say
that SEE is merely an algorithmic problem. Organizational factors
are just as important

E.g. common experiences of data collection and user interaction
in organizations operating in different domains

5

Introduction (cntd.)

This presentation is not about a single algorithm/answer targeting a
single problem.

Because there is not just one question.

It is (unfortunately) not everything about SEE.

It brings together critical questions and related solutions.

6

What to know?
1 When do I have perfect data? What is the best effort
2
estimation method?
3 Can I use multiple methods?
4
ABE methods are easy to use.
5 What if I lack resources How can I improve them?
for local data?
7 Are all attributes and all
6 I don’t believe in size instances necessary?
attributes. What can I do?
8 How to experiment, which
sampling method to use?

7

Publications
Journals
• E. Kocaguneli, T. Menzies, J. Keung, “On the Value of Ensemble Effort Estimation”, IEEE Transactions on
Software Engineering, 2011.
• E. Kocaguneli, T. Menzies, A. Bener, J. Keung, “Exploiting the Essential Assumptions of Analogy-based
Effort Estimation”, IEEE Transactions on Software Engineering, 2011.
• E. Kocaguneli, T. Menzies, J. Keung, “Kernel Methods for Software Effort Estimation”, Empirical
Software Engineering Journal, 2011.
• J. Keung, E. Kocaguneli, T. Menzies, “A Ranking Stability Indicator for Selecting the Best Effort Estimator
in Software Cost Estimation”, Journal of Automated Software Engineering, 2012.
Under review Journals
• E. Kocaguneli, T. Menzies, J. Keung, “Active Learning for Effort Estimation”, third round review at IEEE
Transactions on Software Engineering.
• E. Kocaguneli, T. Menzies, E. Mendes, “Transfer Learning in Effort Estimation”, submitted to ACM
• E. Kocaguneli, T. Menzies, “Software Effort Models Should be Assessed Via Leave-One-Out Validation”,
under second round review at Journal of Systems and Software.
• E. Kocaguneli, T. Menzies, E. Mendes, “Towards Theoretical Maximum Prediction Accuracy Using D-
ABE”, submitted to IEEE Transactions on Software Engineering.
Conference
• E. Kocaguneli, T. Menzies, J. Hihn, Byeong Ho Kang, “Size Doesn‘t Matter? On the Value of Software Size
Features for Effort Estimation”, Predictive Models in Software Engineering (PROMISE) 2012.
• E. Kocaguneli, T. Menzies, “How to Find Relevant Data for Effort Estimation”, International Symposium
on Empirical Software Engineering and Measurement (ESEM) 2011
• E. Kocaguneli, G. Gay, Y. Yang, T. Menzies, “When to Use Data from Other Projects for Effort Estimation”,
International Conference on Automated Software Engineering (ASE) 2010, Short-paper.

8

1 When do I have the perfect data?
Principle #1: Know your domain
Domain knowledge is important in every step (Fayyad1996 [2])
Yet, this knowledge takes time and effort to gain,
e.g. percentage commit information
Principle #2: Let the experts talk
Initial results may be off according to domain experts
Success is to create a discussion, interest and suggestions

Principle #3: Suspect your data
“Curiosity” to question is a key characteristic (Rauser2011 [3])
e.g. in an SEE project, 200+ test cases, 0 bugs

Principle #4: Data collection is cyclic
Any step from mining till presentation may be repeated

9

2 What is the best effort estimation
method?
There is no agreed upon Methods change ranking w.r.t.
best estimation method conditions such as data sets, error
(Shepperd2001 [4]) measures (Myrtveit2005 [5])
Experimenting with: 90 solo-
methods, 20 public data sets, 7 Top 13 methods are CART & ABE
error measures methods (1NN, 5NN)

10

3 How to use superior subset of
methods?
We have a set of Assembling solo-methods
superior methods to may be a good idea, e.g.
recommend fusion of 3 biometric
modalities (Ross2003 [20])
But the previous evidence of Baker2007 [7], Kocaguneli2009
assembling multiple methods in [8], Khoshgoftaar2009 [9] failed to
SEE is discouraging outperform solo-methods

Combine top
2,4,8,13 solo-
methods via mean,
median and IRWM

11

2 How to use superior subset of methods?
3 What is the best effort estimation method?

Principle #5: Use a ranking stability indicator

Principle #6: Assemble superior solo-methods

A method to identify successful methods using their rank changes
A novel scheme for assembling solo-methods
Multi-methods that outperform all solo-methods
This research published at: .
• Kocaguneli, T. Menzies, J. Keung, “On the Value of Ensemble Effort Estimation”, IEEE Transactions on
• J. Keung, E. Kocaguneli, T. Menzies, “A Ranking Stability Indicator for Selecting the Best Effort Estimator in
Software Cost Estimation”, Journal of Automated Software Engineering, 2012.

12

4 How can we improve ABE methods?

Analogy based methods They are very widely used
make use of similar past (Walkerden1999 [10]) as:
projects for estimation • No model-calibration to local data
• Can better handle outliers
• Can work with 1 or more attributes
• Easy to explain

Two promising research areas
• weighting the selected analogies
(Mendes2003 [11], Mosley2002[12])
• improving design options (Keung2008 [1])

13

How can we improve ABE methods?
(cntd.)
Building on the previous research (Mendes2003 [11], Mosley2002[12]
,Keung2008 [1]), we adopted two different strategies

a) Weighting analogies

We used kernel weighting to
weigh selected analogies

Compare performance of
each k-value with and
without weighting.

In none of the scenarios did we
see a significant improvement

14

b) Designing ABE methods
(cntd.) D-ABE
Easy-path: Remove training • Get best estimates of all training
instance that violate assumptions instances
• Remove all the training instances
TEAK will be discussed later. within half of the worst MRE (acc.
D-ABE: Built on theoretical to TMPA).
maximum prediction accuracy • Return closest neighbor’s estimate
(TMPA) (Keung2008 [1]) to the test instance.
Training Instances
Test instance

t a
Close to the
b d worst MRE
Return the c
closest e
neighbor’s
estimate f
Worst MRE

15

(cntd.)

DABE Comparison to DABE Comparison to
static k w.r.t. MMRE static k w.r.t. win, tie, loss

16

(cntd.)

Principle #7: Weighting analogies is overelaboration

Principle #8: Use easy-path design

Investigation of an unexplored and promising ABE option
of kernel-weighting
A negative result published at ESE Journal
An ABE design option that can be applied to different
ABE methods (D-ABE, TEAK)

• E. Kocaguneli, T. Menzies, A. Bener, J. Keung, “Exploiting the Essential Assumptions of Analogy-based Effort
Estimation”, IEEE Transactions on Software Engineering, 2011.
• E. Kocaguneli, T. Menzies, J. Keung, “Kernel Methods for Software Effort Estimation”, Empirical Software
Engineering Journal, 2011.

17

5 How to handle lack of local data?
Finding enough local training Merits of using cross-data from
data is a fundamental problem another company is questionable
of SEE (Turhan2009 [13]). (Kitchenham2007 [14]).
We use a relevancy filtering method called TEAK
on public and proprietary data sets.

Similar projects,
Similar projects,
dissimilar effort
similar effort
values, hence
values, hence
high variance
low variance

Cross data works as well as within data for 6 out
of 8 proprietary data sets, 19 out of 21 public data
sets after TEAK’s relevancy filtering

18

How to handle lack of local data?
(cntd.)

Principle #9: Use relevancy filtering

A novel method to handle lack of local data
Successful application on public as well as proprietary data

• E. Kocaguneli, T. Menzies, “How to Find Relevant Data for Effort Estimation”, International Symposium on
Empirical Software Engineering and Measurement (ESEM) 2011
• E. Kocaguneli, G. Gay, Y. Yang, T. Menzies, “When to Use Data from Other Projects for Effort Estimation”,
International Conference on Automated Software Engineering (ASE) 2010, Short-paper.

19

E(k) matrices & Popularity
This concept helps the next 2 problems: size features and the
essential content, i.e. pop1NN and QUICK algorithms, respectively

A similar concept is reverse nearest neighbor (RNN) in ML, used to find
instances whose k-NN’s are included in a specific query (Achtert2006 *26+).

20

E(k) matrices & Popularity (cntd.)
Outlier pruning Sample steps
1. Calculate “popularity” of
instances
2. Sorting by popularity,
3. Label one instance at a time
4. Find the stopping point
5. Return estimate from labeled
training data

Finding the stopping point
1. If all popular instances are exhausted.
2. Or if there is no MRE improvement for n consecutive times.
3. Or if the ∆ between the best and the worst error of the last n
instances is very small. (∆ = 0.1; n = 3)

21

E(k) matrices & Popularity (cntd.)
Picking random More popular instances
training instance is One of the stopping
in the active pool
not a good idea point conditions fire
decreases error

22

6 Do I have to use size attributes?
At the heart of widely accepted COCOMO uses LOC (Boehm1981
SEE methods lies the software [15]), whereas FP (Albrecht1983
size attributes [16]) uses logical transactions

Size attributes are beneficial if used properly (Lum2002
[17]); e.g. DoD and NASA uses successfully.

Yet, the size attributes may not be trusted or may not be estimated
at the early stages. That disrupts adoption of SEE methods.
Measuring software
productivity by lines of code is This is a very costly measuring
like measuring progress on an unit because it encourages the
airplane by how much it weighs writing of insipid code - E. Dijkstra
– B. Gates

23

Do I have to use size attributes? (cntd.)

pop1NN (w/o size) vs. CART and 1NN (w/ size)

Given enough resources
for correct collection and
estimation, size features
are helpful

If not, then outlier pruning
helps.

24


Principle #10: Use outlier pruning

Promotion of SEE methods that can compensate the lack
of the software size features
A method called pop1NN that shows that size features
are not a “must”.

• E. Kocaguneli, T. Menzies, J. Hihn, Byeong Ho Kang, “Size Doesn‘t Matter? On the Value of Software Size
Features for Effort Estimation”, Predictive Models in Software Engineering (PROMISE) 2012.

25

7 What is the essential content of SEE
data?
SEE is populated with overly
In a matrix of N instances and F
complex methods for
features, the essential content is N ′ ∗ F ′
marginal performance
increase (Jorgensen2007 [18])
QUICK is an active learning
Synonym pruning method combines outlier
1. Transpose normalized matrix removal and synonym pruning
and calculate the popularity Removal of features based on
of features distance seemed to be reserved
2. Select non-popular features. for instances.
Similar tasks both remove ABE method as a two dimensional
cells in the hypercube of all reduction (Ahn2007 [25])
cases times all columns
In our lab variance-based feature
(Lipowezky1998 [24])
selector is used as a row selector

26

What is the essential content of SEE
data? (cntd.)
At most 31% of all
the cells

On median 10%

There is a consensus in the high-dimensional
data analysis community that the only
reason any methods work in very high
dimensions is that, in fact, the data are not
truly high-dimensional. (Levina & Bickel 2005)

Performance?

27

QUICK vs passiveNN (1NN)
data? (cntd.) QUICK vs CART
Only one dataset
where QUICK is
significantly worse
than passiveNN

4 such data sets
when QUICK is
compared to CART

28

data? (cntd.)
Principle #11: Combine outlier and synonym pruning

An unsupervised method to find the essential content of
SEE data sets and reduce the data needs
Promoting research to elaborate on the data, not on the
algorithm

This research is under 3rd round review: .
• E. Kocaguneli, T. Menzies, J. Keung, “Active Learning for Effort Estimation”, third round review at IEEE

29

8 How should I choose the right SM?
Expectation
(Kitchenham2007 [7]) Observed

No significant difference for B&V values among 90 methods

Only minutes of run time difference (<15)
LOO is not probabilistic and results can be easily shared

30

How should I choose the right SM?
(cntd.)

Principle #12: Be aware of sampling method trade-off

The first investigation of B&V trade-off in SEE domain

Recommendation based on experimental concerns

This research is under 2nd round review: .
• E. Kocaguneli, T. Menzies, “Software Effort Models Should be Assessed Via Leave-One-Out Validation”,
under second round review at Journal of Systems and Software.

31

1.
What to know?
Know your domain
2. Let the experts talk
3.When do I your data
Suspect have perfect data? What isathe best effort
5. Use ranking stability
4. Data collection is cyclic estimation method?
indicator
6. Assemble superior solo-
Can I use multiple methods?
methods
7.ABE methods are easy to use.
Weighting analogies is over-
What if I lack resources elaboration I improve them?
How can
9. Use relevancy filtering 8. Use easy-path design
for local data?
Are all attributes andand
11. Combine outlier all
I don’t believe in size instances necessary?
synonym pruning
10. Use outlier pruning
attributes. What can I do?
How Be experiment, which
12. to aware of sampling
method trade-off
sampling method to use?

32

Validity Issues
Construct validity, i.e. do we measure what
we intend to measure?
Use of previously recommended estimation
methods, error measures and data sets

External validity, i.e. can we generalize results
outside current specifications
Difficult to assert that results will definitely hold
Yet we use almost all the publicly available SEE data sets.

Median value of projects used by the studies
reviewed is 186 projects (Kitchenham2007 [14])
Our experimentation uses 1000+ projects

33

Future Work
Application on publicly
accessible big data sets

300K projects, 2M users 250K open source projects

Smarter, larger scale algorithms
for general conclusions Application to different
domains, e.g. defect
Current methods may face prediction
scalability issues. Improving
common ideas for scalability, e.g. Combining intrinsic dimensionality
linear time NN methods techniques in ML for lower bound
dimensions of SEE data sets
(Levina2004 [27])

What have we covered?

34

36

References
[1] J. W. Keung, “Theoretical Maximum Prediction Accuracy for Analogy-Based Software Cost
Estimation,” 15th Asia-Pacific Software Engineering Conference, pp. 495– 502, 2008.
[2] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “The kdd process for extracting useful knowledge
from volumes of data,” Commun. ACM, vol. 39, no. 11, pp. 27–34, Nov. 1996.
[3] J. Rauser, “What is a career in big data?” 2011. [Online]. Available: http:
//strataconf.com/stratany2011/public/schedule/speaker/10070
[4] M. Shepperd and G. Kadoda, “Comparing Software Prediction Techniques Using Simulation,” IEEE
Trans. Softw. Eng., vol. 27, no. 11, pp. 1014–1022, 2001.
[5] I. Myrtveit, E. Stensrud, and M. Shepperd, “Reliability and validity in comparative studies of
software prediction models,” IEEE Trans. Softw. Eng., vol. 31, no. 5, pp. 380–391, May 2005.
[6] E. Alpaydin, “Techniques for combining multiple learners,” Proceedings of Engineering of Intelligent
Systems, vol. 2, pp. 6–12, 1998.
[7] D. Baker, “A hybrid approach to expert and model-based effort estimation,” Master’s thesis, Lane
Department of Computer Science and Electrical Engineering, West Virginia University, 2007.
[8] E. Kocaguneli, Y. Kultur, and A. Bener, “Combining multiple learners induced on multiple datasets
for software effort prediction,” in International Symposium on Software Reliability Engineering (ISSRE),
2009, student Paper.
[9] T. M. Khoshgoftaar, P. Rebours, and N. Seliya, “Software quality analysis by combining multiple
projects and learners,” Software Quality Control, vol. 17, no. 1, pp. 25–49, 2009.
[10] F. Walkerden and R. Jeffery, “An empirical study of analogy-based software effort estimation,”
Empirical Software Engineering, vol. 4, no. 2, pp. 135–158, 1999.
[11] E. Mendes, I. D. Watson, C. Triggs, N. Mosley, and S. Counsell, “A comparative study of cost
estimation models for web hypermedia applications,” Empirical Software Engineering, vol. 8, no. 2, pp.
163–196, 2003.

[12] E. Mendes and N. Mosley, “Further investigation into the use of cbr and stepwise regression to 37
predict development effort for web hypermedia applications,” in International Symposium on Empirical
[13] B. Turhan, T. Menzies, A. Bener, and J. Di Stefano, “On the relative value of cross-company and
within-company data for defect prediction,” Empirical Software Engineering, vol. 14, no. 5, pp. 540–
578, 2009.
[14] B. A. Kitchenham, E. Mendes, and G. H. Travassos, “Cross versus within-company cost
estimation studies: A systematic review,” IEEE Trans. Softw. Eng., vol. 33, no. 5, pp. 316– 329, 2007.
[15] B. W. Boehm, C. Abts, A. W. Brown, S. Chulani, B. K. Clark, E. Horowitz, R. Madachy, D. J.
Reifer, and B. Steece, Software Cost Estimation with Cocomo II. Upper Saddle River, NJ, USA:
Prentice Hall PTR, 2000.
[16] A. Albrecht and J. Gaffney, “Software function, source lines of code and development effort
prediction: A software science validation,” IEEE Trans. Softw. Eng., vol. 9, pp. 639–648, 1983.
[17] K. Lum, J. Powell, and J. Hihn, “Validation of spacecraft cost estimation models for flight and
ground systems,” in ISPA’02: Conference Proceedings, Software Modeling Track, 2002.
[18] M. Jorgensen and M. Shepperd, “A systematic review of software development cost estimation
studies,” IEEE Trans. Softw. Eng., vol. 33, no. 1, pp. 33–53, 2007.
[19] ] B. A. Kitchenham, E. Mendes, and G. H. Travassos, “Cross versus within-company cost
estimation studies: A systematic review,” IEEE Trans. Softw. Eng., vol. 33, no. 5, pp. 316– 329, 2007.
[20] A. Ross, “Information fusion in biometrics,” Pattern Recognition Letters, vol. 24, no. 13, pp. 2115–
2125, Sep. 2003.
[21] Raymond P. L. Buse, Thomas Zimmermann: Information needs for software development
analytics. ICSE 2012: 987-996
[22] Spareref.com. Nasa to shut down checkout & launch control system, August 26, 2002.
http://www.spaceref.com/news/viewnews.html?id=475.
[23] Standish Group (2004). CHAOS Report(Report). West Yarmouth, Massachusetts: Standish
Group.
[24] U. Lipowezky, Selection of the optimal prototype subset for 1-NN classification, Pattern
Recognition Lett. 19 (1998) 907}918.
[25] Hyunchul Ahn, Kyoung-jae Kim, Ingoo Han, A case-based reasoning system with the two-
dimensional reduction technique for customer classification, Expert Systems with Applications, Volume
32, Issue 4, May 2007, Pages 1011-1019, ISSN 0957-4174, 10.1016/j.eswa.2006.02.021.
[26] Elke Achtert, Christian Böhm, Peer Kröger, Peter Kunath, Alexey Pryakhin, and Matthias Renz.
2006. Efficient reverse k-nearest neighbor search in arbitrary metric spaces. In Proceedings of the
2006 ACM SIGMOD international conference on Management of data (SIGMOD '06)
[27] E. Levina and P.J. Bickel. Maximum likelihood estimation of intrinsic dimension. In Advances in
Neural Information Processing Systems, volume 17, Cambridge, MA, USA, 2004. The MIT Press.

39

Pre-processors and learners

40

What is the best effort estimation
method? (cntd.)
1. Rank methods acc. to win, loss
and win-loss values

2. δr is the max. rank change

3. Sort methods acc. to loss and
observe δr values

41

What is the best effort estimation
method? (cntd.)
What about aggregate results reflecting on specific
scenarios? (question of a reviewer)
Sort methods according to increasing MdMRE
Group MRE values that are statistically the same

Highlighted are the
cases, where superior-
methods do not occur
in the top group

Note how superior
solo-methods
correspond to the best
(lowest MRE) groups

42

(cntd.)

We used kernel weighting
with 4 kernels with 5
bandwidth values plus
IRWM to weigh selected
analogies (5 different k
values)

A total of 2090 settings:
• 19 datasets * 5 k-values = 95
• 19 datasets * 5 k values * 4 kernels * 5 bandwidths = 1900
• IRWM: 19 datasets * 5 k values = 95

43

(cntd.)
We used kernel weighting
to weigh selected
analogies

Compare performance of
each k-value with and
without weighting.

• o = tie for 3 or more k values
• - = loss for 3 or more k values
• + = win for 3 or more k values

In none of the scenarios did we
see a significant improvement

44

How to handle lack of local data?
(cntd.)
TEAK on proprietary data TEAK on public data

45

Can standard methods tolerate the lack of size attributes?

CART w/o size vs. CART w/ size CART and 1NN

46

8 How should I choose the right SM?
Only one work (Kitchenham2007
[7]) discusses implications of
sampling method (SM) on the Expectations is
bias and variance LOO: high variance, low bias
3Way: low variance, high bias
10Way: in between

Does expectation hold?
What about run time
and ease-of replication?

Principles of effort estimation

More Related Content

What's hot (20)

Similar to Principles of effort estimation (20)

More from CS, NcState (20)

Recently uploaded (20)

Principles of effort estimation