Review of Basic Statistics and Terminology

Review of Basic Statistics and Terminology
What are Statistics?
There are various steps to consider in scientific research. The key steps are:
• Research Design
• Data Collection
• Description of data and statistical analysis (our focus for this course)
(Daniels, D., Nizam, Z, 1999, Biostatistics (notes), RSPH)
All of these steps consider directly or indirectly data or the statistics used to do
scientific research.
There are actually many different definitions of statistics but the one that is more
general and fits most situations is the following:
Statistics are produced from data. The dictionary definition of "statistics" refers
to “numeric indicators of nations. Popular usage of the term points to numeric
summaries that condense information, or numbers that are used to make
comparisons, or numbers that portray relationships or associations. The term
statistics also refers a formal discipline of study. The field of statistics is the
science of generalization. Built upon theories of probability and inference,
statistics support the making of broad generalizations from a smaller number of
specific observations.”
There are two types of statistics researchers are usually most interested in:
Descriptive and Inferential.
Descriptive Statistics “give us information around discovering and describing
the important features and trends contained in a set of data using quick and easy
graphical and numerical methods. In other words, they are most often used to
describe populations. Even though descriptive statistics most often yield
information about one variable, the relationship between two variables can be
described as well.
With inferential statistics, research hypotheses about a population of interest
are investigated using information on the basis of measurements contained in a
random sample of data from the population.“
(Price, I., 2000, University of New England, School of Psychology)
An important concept to understanding descriptive statistics is the Shape of
Distribution. A variable’s distribution shape is a very important aspect of the
variables description. The shape of the distribution tells you the frequency of
values from different ranges of the variable. Four common terms to understand
when evaluating shape of the distribution is:
• Normal
• Skewness
• Kurtosis
• Symetrical/Asymmetrical

A normal distribution is a bell shaped curve that is symmetrical with scores more
frequent in the middle. A standard normal distribution has a mean of 0 and a
standard deviation of 1.
Skewed distributions yield most scores as being high or low. Usually, a small
percentage of scores are shown in one direction away from the majority. Skewed
distributions produce what is known as a “tail.” If the “tail” points toward the
upper end of the score continuum, the distribution is “positive.” If the “tail “points
toward the lower end of the score continuum, the tail is “negative.” (Huck, S.W. &
Cormier, W.H., 1996. Reading Statistics and Research, New York, New York, p.
31).
Distributions that are skewed are typically considered asymmetrical while normal
distributions are symmetrical. Kurtosis measures “peakedness” of the
distribution. A high peak gives a distribution with “fat” tails and a low even
distribution while a low peak gives a “skinny” tail and a distribution concentrated
towards the mean.
(http://www.statsoft.com/textbook/stbasic.html,
Readings:
http://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm
http://davidmlane.com/hyperstat/normal_distribution.html
Common concepts/terms
A Population in statistics is often referred to as an entire group of people or
objects. When a statistical inference is made, a selection, called a sample, is
first removed from a larger group called a population. The sample, usually
comprised of people or objects, is then measured using statistics. These
measurements are then summarized. A guess is then made as to the numerical
value of the same statistical concept in the population. (Huck, S.W. & Cormier,
W.H., 1996. Reading Statistics and Research, New York, New York, p. 100).
Correlations – A correlation is a measure of the relation between two or more
variables. The most popular technique for assessing the strength of a bivariate
(having two variables) relationship is Pearson's product-moment correlation, also
called linear or product-moment correlation. Correlation coefficients can range
from -1.00 to +1.00. The value of -1.00 represents a negative correlation while a
value of +1.00 represents a perfect positive correlation. A value of 0.00
represents a lack of correlation. In order to evaluate a correlation between 2
variables, it’s important to know the “magnitude” as well as the significance of the
correlation. A perfect correlation rarely exists but a correlation at .6 or above is
often considered to be satisfactory. Sometimes correlations observed at .3 are
significant and worth reporting. Also, a negative correlation does not infer an
unsatisfactory correlation, as a -.129 can be a significant correlation. A negative
correlation really means there is an indirect or inverse relationship between the
variables.
Readings:
http://www.statsoft.com/textbook/stbasic.html (Basic Statistics tab - Correlation
section)
http://www.uwsp.edu/psych/stat/7/correlat.htm (Lessons I, II, III, and V)

Outliers – Outliers in a set of data, have a value so far removed from other values
in the distribution that its presence cannot be attributed to the random combination
of chance causes. An outlier is an atypical value that does not belong to the
distribution of the rest of the values in the data set. Values that are more than 2.5
standard deviations from the mean can be defined as outliers. This would hold
when the distribution is unimodal and symmetrical. For skewed data, median is a
better indicator of central location than mean. In such a case, an observation can
be considered as an outlier if it more than 1.5 inter-quartile range away from the
closest quartile. An outlier would be labeled an extreme outlier if it is more than 3
inter-quartile range from the closest quartile, and it would be called mild otherwise.
Frequencies - A frequency distribution yields how many subjects or objects were
similar and ended up in the same category or had the same score. The total
frequency is usually labeled as N, and tells us how many subjects were
measured.
T-Tests – the t-test is the most commonly used method to evaluate the
differences in means between two groups. For example, the t-test can be used
to test for the difference in knowledge scores between a group of students who
participated in a sex education class and those students who did not. As long as
variables are normally distributed within each group and variation of scores in the
two groups is not reliably different, small sample sizes can be used when
performing a t-test. The p-level reported with a t-test represents the probability of
error involved in accepting our research hypothesis about the existence of a
difference.
Reading:
http://www.texasoft.com/tutorial-statistics-compare-2-groups.htm
Confidence Intervals – A Confidence Interval is a range of values that is
normally used to describe the uncertainty around a point estimate of a quantity,
for example, a mortality rate. Confidence intervals provide a means of assessing
and reporting the precision of a point estimate and account for the uncertainty
that arises from the natural variation inherent in data collection. Confidence
intervals are constructed by adding a specific amount to the statistic (upper limit)
and by subtracting a specific amount from the statistic (lower limit). Researchers
also attach a percentage to any interval constructed, usually either 95 or 99.
Reading:
http://www.stat.yale.edu/Courses/1997-98/101/confint.htm
Working with Data and questions
It’s important to be prepared for working with data before you enter, manipulate,
or analyze data. This also involves knowing which program you will use for data
management and analysis.
Part of the preparation involves knowing your questions. What questions do you
want answered from the data? What are your key research questions? Data

collection should be systematic and done with purpose, even though it’s
sometimes easy to explore data first, especially if you are working with an
existing data set or a large data set. If you are creating a data collection
instrument, it should be designed using specific techniques. When
designing the instrument, it’s important to think about how you are going
to analyze each question and what exactly you want to gain from each
question. Several factors to think about when creating a survey is the
following:
Variable name – Unique identifiers for each question. Often times,
variable names should not be longer than 8 characters. This is because
some statistical programs only allow 8 characters and may pose a
problem if importing/exporting between programs if more than 8
characters exist for variable names. Also, variables names usually don’t
begin with numbers or contain symbols, punctuation, or spaces. Lastly,
variables names should relate to the question and be easy to identify or
recall.
Level of measurement – Refers to nominal, ordinal, or interval/ratio
level of measurement. Each variable will have one of these levels of
measurement. This is important to know because the level of
measurement often depends on how we analyze our data. We will
discuss this more in the independent/dependent variable lab week.
Type of variable – Field types refer to the data types associated with
a variable in your survey and vary from program to program but the
two most common field types are numeric, or integer and string or text
variables. Numeric variables often contain whole numbers. String
variables contains letters, numbers, or special symbols. In survey
development, we may wish to use text responses for certain questions but
when entering data, associate the text responses with numbers and
therefore would be entered and analyzed as a numeric variable. For
example, if we have a question with a Likert scale such as:
1. I enjoy exercise.
Strongly Agree
Agree
Neutral
Disagree
Strongly Disagree
Then we may want to associate these responses with numbers such as:
1 - Strongly Agree
2 - Agree
3 - Neutral
4 - Disagree
5 - Strongly Disagree

Types of Data – Terms associated with types of data are Quantitative
and Qualitative. Quantitative data refers to data that are numeric.
Qualitative data refers to data that are non-numeric. The way we analyze
these types of data are very different. Quantitative data can be further
categorizedinto discrete and continuous data. Discrete data are numeric
data that has a finite number of possible values. Continuous data is
numeric data that have infinite possibilities. Qualitative data are also
referred to as categorical data, meaning non-numerical data which may
be divided into groups. We will also discuss this more in the
independent/dependent variable lab week.
Labels – Refers to a description of the variable or the response
categories. There are variable labels and value labels. Variable labels
assigns descriptive labels to all variables. Variable names and labels may
be the same for some variables. Often, a label might be the entire
question or a part of the question. Value labels give a description of the
response categories. The value label assigns descriptive labels in the
data file. Not all variables necessarily have assigned value labels.
Codebook – A codebook displays a information for the variables in a
data set and is a description of the data that was collected. According to
the Data and Statistical Services at Princeton University, the best
codebooks have: Description of the study: who did it, why they did it, how
they did it.
1. Sampling information: what was the population studied, how was
the sample drawn, what was the response rate.
2. Technical information about the files themselves: number of
observations, record length, number of records per observation,
etc.
3. Structure of the data within the file: hierarchical, multiple cards, etc.
4. Details about the data: columns in which specific variables can be
found, whether they are character or numeric, and if numeric, what
format.
5. Text of the questions and responses: some even have how many
people responded a particular way.
Review of Research Designs
Every research project should begin with a purpose. When thinking about the
purpose of the research, you will also want to think about the research design, or
the structure of the study. Research designs fall into 2 categories:

Experimental – a study or experiment that imposes a treatment on a group of
subjects or objects in order to observe the response. This differs from
observational studies, involving collecting and analyzing data about a group of
subjects or objects, without imposing treatment or changing existing conditions.
A “true” experimental design has the following characteristics: 1) random
assignment of participants to groups, and 2) manipulation of an internal variable
Quasi-experimental – Controls for confounding variables, used to investigate
cause-and-effect relationships. This is different than experimental because there
is no random assignment of participants to groups.
Readings:
http://www.socialresearchmethods.net/kb/quasiexp.php
Study hypotheses
Once a problem or area of interest has been identified and researched, a
hypothesis is then created and stated by the investigator. Hypothesis testing is a
“statement that describes a phenomenon and the relationships between the
variables in the problem” (Southeastern Institute for Training and Evaluation). In
order to formulate a hypothesis test, usually some theory has been researched
and stated. The hypothesis is used for several reasons:
1. Gives a statement of relationships between variables to be tested
2. Offers a possible explanation of the research problem and a focus for the
testing
3. Provides direction for the research
4. Provides a framework or organization for reporting the findings of the
study
(Southeastern Institute for Training and Evaluation)
Hypothesis Testing
There typically is a 6 step process involved in hypothesis testing:
• State the null hypothesis – a statement as to the unknown quantitative value
of the parameter of interest. The null hypothesis relates to the statement
being tested.
• State the alternative hypothesis – the predicted relationship between the
variables of interest. The alternative hypothesis relates to the statement to be
accepted if/when the null is rejected.
• Select a level of significance – Fixed probability of wrongly rejecting the null
hypothesis, if it is in fact true. Usually the significance level is chosen to be
.05 or 5%...often seen as p<.05.
• Collect and summarize the sample data – process used to move from the
beginning points of the hypothesis testing procedure to the final decision
• Refer to a criterion for evaluating the sample evidence – involves asking the
question, are the sample data inconsistent with what would likely occur if the

null hypothesis is true? If yes, the null hypothesis would be rejected.
• Make a Decision to discard/retain the null hypothesis
(Huck, S.W. & Cormier, W.H., 1996. Reading Statistics and Research, New
York, New York, p. 150).
More information on hypothesis testing can be found in the readings.
Readings:
http://davidmlane.com/hyperstat/logic_hypothesis.html
http://www.stats.gla.ac.uk/steps/glossary/hypothesis_testing.html
Analyzing Data
There are many programs, procedures and commands used to analyze data.
For the purposes of this class, only a select group of data analysis procedures
will be used to analyze data using SPSS software. In order to analyze data, it
must first be collected and entered into a database. Most analysis programs will
allow you to import a variety of database files. Setting up databases is not within
the scope of this class and will not be discussed. I will show you exporting and
importing using various file formats at the closing on-campus session.
Data cleaning – If you enter data and set up your own database, it’s important to
make sure that the data is clean and examined for possible data errors.
Common errors may be:
• Values entered outside of the response categories
• Missing data
• Unexpected responses that are not easily identified
• Spelling errors
(Southeastern Institute for Training and Evaluation) Reading:
http://www.tulane.edu/~panda2/Analysis2/datclean/dataclean.htm
Basic Data Analysis
This section will mainly discuss basic descriptive statistics.
Measures of Central Tendency
Mean - The mean is a descriptive statistic probably used most often as a
measure of central tendency. Means are often reported in conjunction with
confidence intervals. Confidence intervals for the mean yield a range of values
around the mean where the “true” expected mean is found. For example, if the
mean in your sample is 60, and the lower and upper limits for a P=.05 confidence
interval are 53 and 67, then you can conclude that there is a 95% probability that
the population mean is greater than 53 and lower than 67.
(http://www.mste.uiuc.edu/hill/dstat/dstat.html)
Median – The median is the middle value of a distribution when the data are
arranged from highest to lowest. Exactly half the data is above the median and
half is below it. If there is an even number of values, then it becomes the
average of the two middle observations.

Mode – The most frequently occurring value. It is possible to have more than
one mode or no mode at all.
Reading: https://statistics.laerd.com/statistical-guides/measures-central-
tendency-mean-mode-median.php
Measures of Variability around the mean
After reviewing a set of values, measures of central tendency and shape of the
distribution, it’s important to know more about the variability of a set of values.
Most groups of values have some degree of variability, meaning at least some of
the values differ (vary) from one another. There are two main measures of
variability.
Variance – The variance is the sum of squared deviations from the mean divided
by N-1. The variance is simply the standard deviation squared.
Standard deviation – Determined by figuring out how much each score deviates
from the mean and by placing these deviation scores into a formula.
The standard deviation is the square root of the variance.
Readings: https://statistics.laerd.com/statistical-guides/measures-of-spread-
range-quartiles.php
Cross Tabulations
Cross tabulations show the relationship between 2 variables. In most programs,
cross tabulations are used to get basic descriptive statistics to conduct other
tests, such as a chi-square. Different programs may use different commands
and/or procedures to conduct a cross tabulation. When performing a cross
tabulation, it’s important to know your independent and dependent variables
since independent variables are used as column headings and dependent
variables are found in the rows
Reading:
http://web.idrc.ca/en/ev-56452-201-1-DO_TOPIC.html

Review of Basic Statistics and Terminology

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to Review of Basic Statistics and Terminology (20)

More from aswhite (20)

Recently uploaded (20)

Review of Basic Statistics and Terminology