Central limit. of data bioinformatic studies

CENTRAL LIMIT THEOREM
INFERENCE
Critical Regions
Neha Jain
School of Biotechnology

Types of Biological Variables
• There are three main types of variables: measurement variables,
which are expressed as numbers (such as 3.7 mm); nominal
variables, which are expressed as names (such as "female"); and
ranked variables, which are expressed as positions (such as
"third").
• Measurement variables
• Measurement variables are, as the name implies, things you can
measure. An individual observation of a measurement variable is
always a number. Examples include length, weight, pH, and bone
density. Other names for them include "numeric" or
"quantitative" variables.

• Nominal variables
• Nominal variables classify observations into discrete categories. Examples of nominal
variables include sex (the possible values are male or female), genotype (values
are AA, Aa, or aa), or ankle condition (values are normal, sprained, torn ligament, or
broken).
• A good rule of thumb is that an individual observation of a nominal variable can be
expressed as a word, not a number. If you have just two values of what would
normally be a measurement variable, it's nominal instead: think of it as "present" vs.
"absent" or "low" vs. "high”.
• Nominal variables are often used to divide individuals up into categories, so that
other variables may be compared among the categories. In the comparison of head
width in male vs. female isopods, the isopods are classified by sex, a nominal
variable, and the measurement variable head width is compared between the sexes.

• Independent and dependent variables
• Another way to classify variables is as independent or dependent variables. An independent
variable (also known as a predictor, explanatory, or exposure variable) is a variable that you
think may cause a change in a dependent variable (also known as an outcome or response
variable).
• For example, if you grow isopods with 10 different mannose concentrations in their food
and measure their growth rate, the mannose concentration is an independent variable and
the growth rate is a dependent variable, because you think that different mannose
concentrations may cause different growth rates. Any of the three variable types
(measurement, nominal or ranked) can be either independent or dependent.
• For example, if you want to know whether sex affects body temperature in mice, sex would
be an independent variable and temperature would be a dependent variable. If you wanted
to know whether the incubation temperature of eggs affects sex in turtles, temperature
would be the independent variable and sex would be the dependent variable.

• The normal distribution
• Many measurement variables in biology fit the normal distribution fairly
well.
• According to the central limit theorem, if you have several different
variables that each have some distribution of values and add them
together, the sum follows the normal distribution fairly well. It doesn't
matter what the shape of the distribution of the individual variables is, the
sum will still be normal. The distribution of the sum fits the normal
distribution more closely as the number of variables increases. The graphs
below are frequency histograms of 5,000 numbers. The first graph shows
the distribution of a single number with a uniform distribution between 0
and 1. The other graphs show the distributions of the sums of two, three, or
four random numbers.

• As you can see, as more random numbers are added together, the frequency
distribution of the sum quickly approaches a bell-shaped curve.
• This is analogous to a biological variable that is the result of several different factors. For
example, let's say that you've captured 100 lizards and measured their maximum
running speed. The running speed of an individual lizard would be a function of its
genotype at many genes; its nutrition as it was growing up; the diseases it's had; how
full its stomach is now; how much water it's drunk; and how motivated it is to run fast
on a lizard racetrack. Each of these variables might not be normally distributed; the
effect of disease might be to either subtract 10 cm/sec if it has had lizard-slowing
disease, or add 20 cm/sec if it has not; the effect of gene A might be to add 25 cm/sec
for genotype AA, 20 cm/sec for genotype Aa, or 15 cm/sec for genotype aa. Even though
the individual variables might not have normally distributed effects, the running speed
that is the sum of all the effects would be normally distributed.

CENTRAL LIMIT THEOREM
• In probability theory, the central limit theorem (CLT) states that, given certain
conditions, “the mean of a sufficiently large number of independent random
variables, each with finite mean and variance, will be approximately normally
distributed.”
• The Central Limit Theorem states that if you draw a sample from a population and
calculate the mean of the sample, and then repeat it several times, the means will
form a normal distribution around the true mean of the original population. This
means that even if the original population has a wild distribution, repeated
samples of the population come closer and closer to the true mean.
• When sampling is from a normal population, the means of samples drawn from
such a population are themselves normally distributed.
• But when sampling is not from a normal population, the size of the sample plays a
critical role.
• When n is small, the shape of the distribution will depend largely on the shape of
the parent population, but as n gets large (n> 30), the shape of the sampling
distribution will become more and more like a normal distribution, irrespective of
the shape of the parent population.

• The theorem which explains this sort of
relationship between the shape of the population
distribution and the sampling distribution of the
mean is known as the central limit theorem.
• This theorem is by far the most important
theorem in statistical inference.
• It assures that the sampling distribution of the
mean approaches normal distribution as the
sample size increases.

• In formal terms, the central limit theorem states that “the
distribution of means of random samples taken from a
population having mean µ and finite variance σ2
approaches the normal distribution with mean µ and
variance σ2/n as n goes to infinity.”
• The significance of the central limit theorem lies in the
fact that it permits us to use sample statistics to make
inferences about population parameters without knowing
anything about the shape of the frequency distribution of
that population other than what we can get from the
sample.”

INFERENCE
• An inference is the act of coming to a logical
conclusion without actually eye witnessing or
having first hand knowledge of certain events.
• Biological network inference is the process of
making inferences and predictions about
biological networks.
• Authors don’t always tell every detail or give
every bit of information in nonfiction or in
fiction stories.

INFERENCE
• Readers make inferences to supply information
that authors leave out.
• When you make an inference, you add what you
already know to what an author has told you.
• Sometimes you have to come to a conclusion
when you don’t have all the facts.
• You can use the clues given to help you make an
inference (a guess based on known facts).

Examples
• . What the author said + what I know = my inference
The
weather
had been
scorching
for
weeks.
Summer
is the
hottest
time of
the year.
It is
summer.

.
• .
What the author said + what I know = my inference
Alvin took
out a
pitcher of
cold
lemonade
.
You keep
things cold
in a
refrigerato
r
Alvin took
the
lemonade
out of the
refrigerato
r

Inference Vs Hypothesis
• Inference is a logical conclusion based on
experiments, and a hypothesis is what one thinks is
going to happen (an educated guess).
• In science, an inference refers to reasonable
conclusions or possible hypotheses drawn from a
small sampling of data.
• Scientists make inferences all the time, which may
prove correlations, but don’t prove cause. In fact most
“known” scientific facts, are inferences since it would
be impossible to fully gather all material on a subject.

Critical Value(s)
• The critical value(s) for a hypothesis test is a
threshold to which the value of the test
statistic in a sample is compared to determine
whether or not the null hypothesis is rejected.
• The critical value for any hypothesis test
depends on the significance level at which the
test is carried out, and whether the test is
one-sided or two-sided.

Critical Region
Set of all values of the test statistic that
would cause a rejection of the
null hypothesis
Critical
Regions

• The sample space for the test statistic is
partitioned into two regions; one region (the
critical region) will lead us to reject the null
hypothesis H0, the other will not. So, if the
observed value of the test statistic is a
member of the critical region, we conclude
"Reject H0"; if it is not a member of the critical
region then we conclude "Do not reject H0".

Central limit. of data bioinformatic studies

More Related Content

Similar to Central limit. of data bioinformatic studies (20)

Recently uploaded (20)

Central limit. of data bioinformatic studies