Project examples for sampling and the law of large numbers

Examples for
the Project
Manager
SAMPLING AND THE
LAW OF LARGE
NUMBERS
The Law of Large Numbers, LLN, tells us it‟s possible to estimate
certain information about a population from just the data
measured, calculated, or observed from a sample of the population.

Sampling saves the project manager time and money, but
introduces risk. How much risk for how much savings?

The answer to these questions is the subject of this paper.

A whitepaper by
John C. Goodpasture, PMP
Managing Principal
Square Peg Consulting, LLC

Sampling and the Law of Large Numbers
Examples for the Project Manager

The Law of Large Numbers, LLN, tells us it‟s possible to
estimate certain information about a population from just the data
measured, calculated, or observed from a sample of the population.

A population is any frame of like entities. For statistical purposes,
entities should be individually independent and subject to identical
distributions of values of interest.

Sampling saves project managers a lot of time and money:

 Obtains practical and useful results even when it is not
economical to obtain and evaluate every data point in a
population
 Extends the project access even though it may not be
practical to reach every member of the population.
 Provides actionable information even when it is not
possible to know every member of the population.
 Avoids spending too much time to observe, measure, or
interview every member of the population
 Avoids collecting too much data to handle even if every
member of the population were readily available—to
include expense of data handling and timeliness of data
handling

Analysis by sampling is called ‘drawing an inference’, and the branch of statistics from which it
comes is called ‘inferential statistics’. Drawing an inference is similar to ‘inductive reasoning’.

In both cases, inference and induction, one works from a set of specific observations back to the
more general case, or to the rules that govern the observations.

What about risks?
Sampling introduces risk into the project:
 Risk that the data sample may not accurately portray the population—there may
be inadvertent exclusions, clusters, strata, or other population attributes not
understood and accounted for.

©Copyright 2010 John C Goodpasture Page 1

 Risk that some required information in the population may not be sampled at all;
thus the sample data may be deficient or may misrepresent the true condition of
the population.
 Risk that in other situations, the data in the sample are outliers and misrepresent
their true relationship to the population; the sample may not be discarded when it
should be.

Risk assessments
There are two risk assessments to be made. Examples in this paper will illustrate these
two assessments.

1. “Margin of error”, which refers to the estimated error around the measurement,
observation, or calculation of statistics within the interval of the sample data, and
Margin of error is the percentage of the interval relative to the statistic being
estimated:

% error = Interval / Average, or
Interval / Proportion
(x100)

Because margin of error is a ratio, the risk manager actually has to be concerned for both
the numerator and the denominator: for small statistical values [for a small denominator]
the interval [the numerator] must be likewise small—and, a small interval is achieved by
having a large sample size, N.

2. “Confidence interval”, which refers to the interval within which the true
population parameters are likely to be with a specified probability.

Confidence intervals have their own risk. The principle risk is that the sample
misrepresents the population. If confidence is stated as 95% for some interval, then there
is a 5% chance that the true population parameter lays outside the interval. Consider this
case: a population with a parameter real value of 8 is sampled [of course, this fact—the
real value of 8—is unknown to the project team]. But, also unknown to the project team,
for example, the sample may be influenced by some infrequent outliers in the population.
From the sample data the sample average may be calculated to be 10. The question is:
what is the quality of this metric value? We will use confidence interval and margin of
error as surrogates for sample statistics quality.

Sample design and sample risk
Would more trials of the same sample size improve quality? Perhaps. However, the
definition of confidence covers the case: Of all the sample intervals obtained in multiple
trials, 95% of them will contain the true population parameter; or, for only one trial, there
is a 95% chance that the true population parameter is within the interval of that trial.


Generally, to reduce risk, the sample size, N, is made larger, rather than independently
resampling the same population with the same size sample.

Deciding upon the sample size—meaning: the value of N—introduces a tension between
the project‟s budget and/or schedule managers, and the risk managers. Tension is another
word for risk.

 Budget managers want to limit the cost of gathering more data than is needed and
thereby limit cost risk—in other words, avoid oversampling.
 Risk managers want to limit the impact of not having enough data and thereby
limit functional, feature, or performance risk.

Sampling policy
The risk plan customarily invokes a project management policy regarding the degree of
risk that is acceptable:
 “Margin of error” is customarily accepted between +/- 3 to 5%
 “Confidence Interval” is customarily a pre-selected percentage between 80 and
99%, most commonly 95% or 99%.
The sampling protocol for a given project is designed by the risk manager to support
these policy objectives

General examples
Below are several population examples that are common in project situations. They fall
into one of two population types, discrete proportions and continuous data.
 Project managers and the project office often deal with proportions
 Project control account managers and team leaders often deal with “continuous
data”.

1. Populations of categorical data characterized with proportions: Proportional data is
sometimes called ‘categorical data’ or ‘category data’; proportions are a form of ‘count’
data. Proportions are formed from the ratio of the count.

In Six Sigma, such category data is called ‘attribute data’. For example, a semi-
conductor wafer fits either into a category of ‘defect free’ or into another category of
‘defective’. The metric is the count in each category.

Proportion is often notated as ‘p’ for the proportional count in one category, and ‘1-p’ for
the other. ‘1-p’ is sometimes denoted ‘q’.

The true proportion, p, is often unknown. An estimate of p is measurable but the
estimate is probabilistic and thus has statistical characteristics.


The underlying entity is often not quantitative. In other words, we speak of the average
proportion of defects, but not the average defect.

2. Populations of continuous data: Continuous data is measured on a continuous number
scale. Continuous data from one measurement can be compared with other continuous
data and can be manipulated with arithmetic operations. The ‘distance’ between one
point on the scale and another has a real meaning, not just a relative position as on an
ordinal scale.

Continuous data is descriptive: the data values describe features and attributes, like size,
weight, density, and the like. Collections or sets of continuous data values are
characterized with descriptive statistics, like average weight, or average hours of
experience; and other statistics that can be calculated from data, like standard deviation
and variance.

Six Sigma refers to such populations as having ‘continuous or variable data’ metrics,
referring to the idea that such metrics can be measured on a continuous scale.

Examples of Categorical data populations characterized with proportions:

A proportion of Users/operators/ maintenance and support/beneficiaries
Opinion
who have one opinion or another about a feature or function.

A proportion of devices or objects have a defect, and others do not,
Or possess an attribute that pass/fails some metric limit, like power
Defects and
consumed.
pass/fail
Typically, pass/fail results are observed in a number of independent
tests, inspections, or ‘trials’.

A proportion of devices or objects that are of a certain
type/category/classification.
This situation comes up often in database projects where database
Classification records may or may not meet a specific type classification.
and position But all manner of tangible objects also have type classifications,
such as hard wood or soft wood, steel or stainless steel.
A proportion of devices that are positioned above, between, or below some
‘critical’ boundary, like a quartile or percentile limit.

Examples of Continuous measurement populations


Average age of a user group, average drying time of a coating, average
time to code a design object, or average time to repair an object.
Objects with
Measurable Average difference between user groups, drying time, coding time, or repair
attributes time of one or another object
Average distance to [or between] object coordinates

Process example: A process for which the arrival rate of event—like a
trigger or a device failure—or the count of events in a unit of time or space
is important.
In a web commerce project, an example is the arrival rate of
Process events customers to the product ordering page.
and Opportunity Opportunity example: An ‘area’ in which events can occur.
In a chemical development project, an event could be the
appearance—yes, or no—of a certain molecule after some process
activity; the measurable opportunity is the count of a certain
molecule per cubic centimeter.

Project Estimates
Regardless of the nature of the population, the issues for the project manager are the
same:

 Effort: How much effort will sampling take?
The LLN tells us the sample statistics will be „good enough‟ if the sample is
„large enough‟. For project managers the question is: How large is „large
enough‟?
 Impact: What is the impact of the risk to be mitigated?
Confidence statistics and margins of error of the sample provide the ranges of the
impact.

Risk management and estimating rules of thumb

The actual size of the population is irrelevant—so long as it is ‘large’
Population size compared to the sample. Population size is not used in estimates, even if
known, unless the population is ‘small’ when compared to the sample size.

Sample size [count of values in the sample] is driven by risk tolerance for the
Sample size
possible error in the sample results. A larger count reduces error possibilities.
[count of values]
There are formulas for sample size that take into account risk tolerance.

The margin of error in the estimated statistic improves with increasing count
Margin of error
of data values in the sample

Confidence that the actual population parameter is within the sample data
Population interval improves as the interval is made wider for a given number of samples
parameter values.
confidence Thus, for a sample of 30 values, the confidence interval for 99% confidence is
wider than for 90% confidence


Common
The most common confidence intervals are 80, 90, 95, and 99%.
intervals

Estimating proportional parameters
Sample proportion notation:
 One category is given a proportion notated „p‟.
 „1-p‟ notates the sample proportion of the other category [sometimes „1-p‟ is
denoted as „q‟]

Project example with proportion:
Project description: Let‟s say that a project deliverable is a database for which over 10M
data records are to be loaded from a very much larger library [population]. Depending
on the mix of categories of data records in the population, the scheduling manager will
schedule more loading time if mostly Category-1, or less time if not mostly Category-1.

The project manager elects to sample the data record population to determine the
proportionality, p, of records that are Category-1 so that the scheduling manager has
information to guide project scheduling.

The project risk management plan requires estimates to have 95% confidence for design
parameters, and a margin of error of less than +/- 5% on sample data values.

Sample design: With no a priori hypothesis of the expected proportionality of „p‟, some
iteration may be required. A good starting point is to assume p = 0.5. The risk manager
refers to the chart given in the appendix entitled “Proportion „p‟ vs +/- Margin of Error
%” that is a plot of error percentage for a confidence of 95%. From that chart, the risk
manager finds that for a +/- 5% margin of error of „p‟ with 95% confidence a sample size
greater than 1,000 but smaller than 3000 is needed.

Solving the margin of error equation for N in fact gives 1,536 as the appropriate starting
point for N

Starting with N = 1,536, if the first sample returns a „p‟ value that is 0.5 or greater, the
margin of error is likely less than +/- 5%; no further sampling is required. Otherwise, a
larger sample size is required.

Sample analysis: Assume the sample returns a value of „p‟ of 0.7. From the confidence
interval equation for proportions given in the appendix, the 95% confidence interval for
the estimated proportion is calculated to be 67% to 73%, centered on 70%.

Risk management analysis: There is a 5% probability that the proportion „p‟ is not
within the confidence interval of 67% to 73%. There is not enough information to
forecast whether the proportion „p‟ is more likely less than 67% or greater than 73%.

From the chart in the appendix for margin of error, the margin of error of the
proportionality value 0.7 is about +/- 4.7 %, or +/- 0.032, from 0.668 to 0.732.


The sample data supports the project risk tolerance policy objectives of 95% confidence
and < +/- 5% margin of error.

Estimating continuous data parameters:
Project example with descriptive statistics
Project description: Let‟s say that a project deliverable is an ejector seat for a military
aircraft; the average weight of the pilot population needs be known for the design.

The project manager elects to sample the pilot population rather than weigh every pilot.

The project risk management plan requires estimates to have 95% confidence for design
parameters and +/- 3% margin of error for sample data statistics.

Sample Frame: From the chart in the appendix entitled “% Margin of Error v N, 95%
Confidence” the risk manager finds that a sample of size 85 is required to meet the +/-
3% policy metric and simultaneously meet the 95% confidence interval metric. So, in this
example, 85 pilots are weighed from a population frame of active duty military pilots,
both men and women.

Assume the sample average is found from the sample data to be 175 lbs [79.4 kg], and
the Sample σ is calculated from the sample data by spreadsheet function. Assume the
Sample σ is calculated to be 25 lbs [11.3 kg].

Sample analysis: From the equation given in the appendix for continuous data, the 95%
confidence interval for the estimated average weight of the pilot population is estimated
to be about +/- 5.4 lbs [+/- 2.4 kg], or from 169.6 to 180.4 lbs [76.9 to 81.8 kg].

Risk management analysis: There is a 5% probability that the average pilot weight is
not within the confidence interval.

The sample average of 175 pounds is estimated to have a margin of error of +/- 3%, or
+/- 5.2 pounds [+/- 2.4 kg].


Acknowledgement
The author is indebted to Dr. Walter P. Bond, Associate Professor (retired) of Florida
Institute of Technology for suggestions and peer review.

Appendix

Proportional category data

Confidence Interval, proportions
The following equations define the confidence interval for varying confidence objectives,
where  is the symbol for „square root‟ [sqrt]. Note the square root is multiplied by a
numerical factor. The numerical factor is a so-called „Z‟ number taken from the standard
normal bell curve. Z = 1 corresponds to one standard deviation, σ. Z values typically
range +/- 3 about the standard normal mean of „0‟. In order to use this equation, „p‟
cannot be very close to 0 or 1 since the validity of the equation depends „p‟ being in the
mid-range of the confidence probability.

80% Interval = p +/- 1.3 * [p * (1 - p) / N]
90% Interval = p +/- 1.7 * [p * (1 - p) / N]
95% Interval = p +/- 2 * [p * (1 - p) / N]
99% Interval = p +/- 2.7 * [p * (1 - p) / N]

The confidence objective, expressed as a %, is read as, for example, 80% confidence
the real population parameter is within the interval, and 20% confidence the real
population parameter is outside the interval.

Margin of Error, proportions
The following chart is a plot of three different sample sizes, N, showing the margin of
error as the proportion „p‟ changes. This chart is based on the formula for margin of error
given below:

+/- Margin of Error = ½ Interval width / p
Where ½ Interval width = +/- Z * [p * (1 - p) / N]
And where
Z = 2* for 95% confidence
* more precisely: 1.96


Continuous data and descriptive statistics
Confidence Interval
The following equations give approximations of the interval range. „N‟ is the count of
data values in the sample; N is the square root of „N‟. The numerical factor in the
numerator comes from a table of „t‟ values that are developed by statisticians for
sampling analysis. The „t‟ value depends on the count of the sample points. The „t‟ value
typically ranges +/- 3; it is taken from the T-distribution that is approximately Normal.

80% Interval = Sample average +/- (1.3 / N) x Sample σ [narrowest interval]
90% Interval = Sample average +/- (1.7 / N) x Sample σ
95% Interval = Sample average +/- (2 / N) x Sample σ
99% Interval = Sample average +/- (2.7 / N) x Sample σ [widest interval]

Note: The Sample σ, or sample standard deviation, is calculated, usually by spreadsheet
function, from the sample data.


Margin of error
The margin of error is based on the following equation:

Margin of error = +/- ‘t’ * sample σ / √N
Sample average
Where
‘t’ = 2* for 95% confidence interval

*more precisely: 1.96

The following is a plot for the margin of error as a function of the sample size, N


.


John C. Goodpasture, PMP and Managing Principal at
Square Peg Consulting, is a program manager, coach,
author, and project consultant specializing in technology
projects with emphasis on quantitative methods, project
planning, and risk management.

His career in program management has spanned the U.S.
Department of Defense; the defense, intelligence, and
aerospace industry; and the IT back office where he led
several efforts in ERP systems.

He has coached many project teams in the U.S., Europe,
and Asia.

John is the author of numerous books, magazine articles,
and web logs in the field of project management, the most
recent of which is “Project Management the agile way:
Making it work in the enterprise”.

He blogs at johngoodpasture.com, and his work products
are found in the library at www. sqpegconsulting.com.


Project examples for sampling and the law of large numbers

More Related Content

What's hot (20)

Viewers also liked (15)

Similar to Project examples for sampling and the law of large numbers (20)

More from John Goodpasture (20)

Recently uploaded (20)

Project examples for sampling and the law of large numbers