SlideShare a Scribd company logo
2
Most read
7
Most read
Review of Basic Statistics and Terminology
What are Statistics?
There are various steps to consider in scientific research. The key steps are:
• Research Design
• Data Collection
• Description of data and statistical analysis (our focus for this course)
(Daniels, D., Nizam, Z, 1999, Biostatistics (notes), RSPH)
All of these steps consider directly or indirectly data or the statistics used to do
scientific research.
There are actually many different definitions of statistics but the one that is more
general and fits most situations is the following:
Statistics are produced from data. The dictionary definition of "statistics" refers
to “numeric indicators of nations. Popular usage of the term points to numeric
summaries that condense information, or numbers that are used to make
comparisons, or numbers that portray relationships or associations. The term
statistics also refers a formal discipline of study. The field of statistics is the
science of generalization. Built upon theories of probability and inference,
statistics support the making of broad generalizations from a smaller number of
specific observations.”
There are two types of statistics researchers are usually most interested in:
Descriptive and Inferential.
Descriptive Statistics “give us information around discovering and describing
the important features and trends contained in a set of data using quick and easy
graphical and numerical methods. In other words, they are most often used to
describe populations. Even though descriptive statistics most often yield
information about one variable, the relationship between two variables can be
described as well.
With inferential statistics, research hypotheses about a population of interest
are investigated using information on the basis of measurements contained in a
random sample of data from the population.“
(Price, I., 2000, University of New England, School of Psychology)
An important concept to understanding descriptive statistics is the Shape of
Distribution. A variable’s distribution shape is a very important aspect of the
variables description. The shape of the distribution tells you the frequency of
values from different ranges of the variable. Four common terms to understand
when evaluating shape of the distribution is:
• Normal
• Skewness
• Kurtosis
• Symetrical/Asymmetrical
A normal distribution is a bell shaped curve that is symmetrical with scores more
frequent in the middle. A standard normal distribution has a mean of 0 and a
standard deviation of 1.
Skewed distributions yield most scores as being high or low. Usually, a small
percentage of scores are shown in one direction away from the majority. Skewed
distributions produce what is known as a “tail.” If the “tail” points toward the
upper end of the score continuum, the distribution is “positive.” If the “tail “points
toward the lower end of the score continuum, the tail is “negative.” (Huck, S.W. &
Cormier, W.H., 1996. Reading Statistics and Research, New York, New York, p.
31).
Distributions that are skewed are typically considered asymmetrical while normal
distributions are symmetrical. Kurtosis measures “peakedness” of the
distribution. A high peak gives a distribution with “fat” tails and a low even
distribution while a low peak gives a “skinny” tail and a distribution concentrated
towards the mean.
(http://www.statsoft.com/textbook/stbasic.html,
Readings:
http://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm
http://davidmlane.com/hyperstat/normal_distribution.html
Common concepts/terms
A Population in statistics is often referred to as an entire group of people or
objects. When a statistical inference is made, a selection, called a sample, is
first removed from a larger group called a population. The sample, usually
comprised of people or objects, is then measured using statistics. These
measurements are then summarized. A guess is then made as to the numerical
value of the same statistical concept in the population. (Huck, S.W. & Cormier,
W.H., 1996. Reading Statistics and Research, New York, New York, p. 100).
Correlations – A correlation is a measure of the relation between two or more
variables. The most popular technique for assessing the strength of a bivariate
(having two variables) relationship is Pearson's product-moment correlation, also
called linear or product-moment correlation. Correlation coefficients can range
from -1.00 to +1.00. The value of -1.00 represents a negative correlation while a
value of +1.00 represents a perfect positive correlation. A value of 0.00
represents a lack of correlation. In order to evaluate a correlation between 2
variables, it’s important to know the “magnitude” as well as the significance of the
correlation. A perfect correlation rarely exists but a correlation at .6 or above is
often considered to be satisfactory. Sometimes correlations observed at .3 are
significant and worth reporting. Also, a negative correlation does not infer an
unsatisfactory correlation, as a -.129 can be a significant correlation. A negative
correlation really means there is an indirect or inverse relationship between the
variables.
Readings:
http://www.statsoft.com/textbook/stbasic.html (Basic Statistics tab - Correlation
section)
http://www.uwsp.edu/psych/stat/7/correlat.htm (Lessons I, II, III, and V)
Outliers – Outliers in a set of data, have a value so far removed from other values
in the distribution that its presence cannot be attributed to the random combination
of chance causes. An outlier is an atypical value that does not belong to the
distribution of the rest of the values in the data set. Values that are more than 2.5
standard deviations from the mean can be defined as outliers. This would hold
when the distribution is unimodal and symmetrical. For skewed data, median is a
better indicator of central location than mean. In such a case, an observation can
be considered as an outlier if it more than 1.5 inter-quartile range away from the
closest quartile. An outlier would be labeled an extreme outlier if it is more than 3
inter-quartile range from the closest quartile, and it would be called mild otherwise.
Frequencies - A frequency distribution yields how many subjects or objects were
similar and ended up in the same category or had the same score. The total
frequency is usually labeled as N, and tells us how many subjects were
measured.
T-Tests – the t-test is the most commonly used method to evaluate the
differences in means between two groups. For example, the t-test can be used
to test for the difference in knowledge scores between a group of students who
participated in a sex education class and those students who did not. As long as
variables are normally distributed within each group and variation of scores in the
two groups is not reliably different, small sample sizes can be used when
performing a t-test. The p-level reported with a t-test represents the probability of
error involved in accepting our research hypothesis about the existence of a
difference.
Reading:
http://www.texasoft.com/tutorial-statistics-compare-2-groups.htm
Confidence Intervals – A Confidence Interval is a range of values that is
normally used to describe the uncertainty around a point estimate of a quantity,
for example, a mortality rate. Confidence intervals provide a means of assessing
and reporting the precision of a point estimate and account for the uncertainty
that arises from the natural variation inherent in data collection. Confidence
intervals are constructed by adding a specific amount to the statistic (upper limit)
and by subtracting a specific amount from the statistic (lower limit). Researchers
also attach a percentage to any interval constructed, usually either 95 or 99.
Reading:
http://www.stat.yale.edu/Courses/1997-98/101/confint.htm
Working with Data and questions
It’s important to be prepared for working with data before you enter, manipulate,
or analyze data. This also involves knowing which program you will use for data
management and analysis.
Part of the preparation involves knowing your questions. What questions do you
want answered from the data? What are your key research questions? Data
collection should be systematic and done with purpose, even though it’s
sometimes easy to explore data first, especially if you are working with an
existing data set or a large data set. If you are creating a data collection
instrument, it should be designed using specific techniques. When
designing the instrument, it’s important to think about how you are going
to analyze each question and what exactly you want to gain from each
question. Several factors to think about when creating a survey is the
following:
Variable name – Unique identifiers for each question. Often times,
variable names should not be longer than 8 characters. This is because
some statistical programs only allow 8 characters and may pose a
problem if importing/exporting between programs if more than 8
characters exist for variable names. Also, variables names usually don’t
begin with numbers or contain symbols, punctuation, or spaces. Lastly,
variables names should relate to the question and be easy to identify or
recall.
Level of measurement – Refers to nominal, ordinal, or interval/ratio
level of measurement. Each variable will have one of these levels of
measurement. This is important to know because the level of
measurement often depends on how we analyze our data. We will
discuss this more in the independent/dependent variable lab week.
Type of variable – Field types refer to the data types associated with
a variable in your survey and vary from program to program but the
two most common field types are numeric, or integer and string or text
variables. Numeric variables often contain whole numbers. String
variables contains letters, numbers, or special symbols. In survey
development, we may wish to use text responses for certain questions but
when entering data, associate the text responses with numbers and
therefore would be entered and analyzed as a numeric variable. For
example, if we have a question with a Likert scale such as:
1. I enjoy exercise.
Strongly Agree
Agree
Neutral
Disagree
Strongly Disagree
Then we may want to associate these responses with numbers such as:
1 - Strongly Agree
2 - Agree
3 - Neutral
4 - Disagree
5 - Strongly Disagree
Types of Data – Terms associated with types of data are Quantitative
and Qualitative. Quantitative data refers to data that are numeric.
Qualitative data refers to data that are non-numeric. The way we analyze
these types of data are very different. Quantitative data can be further
categorizedinto discrete and continuous data. Discrete data are numeric
data that has a finite number of possible values. Continuous data is
numeric data that have infinite possibilities. Qualitative data are also
referred to as categorical data, meaning non-numerical data which may
be divided into groups. We will also discuss this more in the
independent/dependent variable lab week.
Labels – Refers to a description of the variable or the response
categories. There are variable labels and value labels. Variable labels
assigns descriptive labels to all variables. Variable names and labels may
be the same for some variables. Often, a label might be the entire
question or a part of the question. Value labels give a description of the
response categories. The value label assigns descriptive labels in the
data file. Not all variables necessarily have assigned value labels.
Codebook – A codebook displays a information for the variables in a
data set and is a description of the data that was collected. According to
the Data and Statistical Services at Princeton University, the best
codebooks have: Description of the study: who did it, why they did it, how
they did it.
1. Sampling information: what was the population studied, how was
the sample drawn, what was the response rate.
2. Technical information about the files themselves: number of
observations, record length, number of records per observation,
etc.
3. Structure of the data within the file: hierarchical, multiple cards, etc.
4. Details about the data: columns in which specific variables can be
found, whether they are character or numeric, and if numeric, what
format.
5. Text of the questions and responses: some even have how many
people responded a particular way.
Review of Research Designs
Every research project should begin with a purpose. When thinking about the
purpose of the research, you will also want to think about the research design, or
the structure of the study. Research designs fall into 2 categories:
Experimental – a study or experiment that imposes a treatment on a group of
subjects or objects in order to observe the response. This differs from
observational studies, involving collecting and analyzing data about a group of
subjects or objects, without imposing treatment or changing existing conditions.
A “true” experimental design has the following characteristics: 1) random
assignment of participants to groups, and 2) manipulation of an internal variable
Quasi-experimental – Controls for confounding variables, used to investigate
cause-and-effect relationships. This is different than experimental because there
is no random assignment of participants to groups.
Readings:
http://www.socialresearchmethods.net/kb/quasiexp.php
Study hypotheses
Once a problem or area of interest has been identified and researched, a
hypothesis is then created and stated by the investigator. Hypothesis testing is a
“statement that describes a phenomenon and the relationships between the
variables in the problem” (Southeastern Institute for Training and Evaluation). In
order to formulate a hypothesis test, usually some theory has been researched
and stated. The hypothesis is used for several reasons:
1. Gives a statement of relationships between variables to be tested
2. Offers a possible explanation of the research problem and a focus for the
testing
3. Provides direction for the research
4. Provides a framework or organization for reporting the findings of the
study
(Southeastern Institute for Training and Evaluation)
Hypothesis Testing
There typically is a 6 step process involved in hypothesis testing:
• State the null hypothesis – a statement as to the unknown quantitative value
of the parameter of interest. The null hypothesis relates to the statement
being tested.
• State the alternative hypothesis – the predicted relationship between the
variables of interest. The alternative hypothesis relates to the statement to be
accepted if/when the null is rejected.
• Select a level of significance – Fixed probability of wrongly rejecting the null
hypothesis, if it is in fact true. Usually the significance level is chosen to be
.05 or 5%...often seen as p<.05.
• Collect and summarize the sample data – process used to move from the
beginning points of the hypothesis testing procedure to the final decision
• Refer to a criterion for evaluating the sample evidence – involves asking the
question, are the sample data inconsistent with what would likely occur if the
null hypothesis is true? If yes, the null hypothesis would be rejected.
• Make a Decision to discard/retain the null hypothesis
(Huck, S.W. & Cormier, W.H., 1996. Reading Statistics and Research, New
York, New York, p. 150).
More information on hypothesis testing can be found in the readings.
Readings:
http://davidmlane.com/hyperstat/logic_hypothesis.html
http://www.stats.gla.ac.uk/steps/glossary/hypothesis_testing.html
Analyzing Data
There are many programs, procedures and commands used to analyze data.
For the purposes of this class, only a select group of data analysis procedures
will be used to analyze data using SPSS software. In order to analyze data, it
must first be collected and entered into a database. Most analysis programs will
allow you to import a variety of database files. Setting up databases is not within
the scope of this class and will not be discussed. I will show you exporting and
importing using various file formats at the closing on-campus session.
Data cleaning – If you enter data and set up your own database, it’s important to
make sure that the data is clean and examined for possible data errors.
Common errors may be:
• Values entered outside of the response categories
• Missing data
• Unexpected responses that are not easily identified
• Spelling errors
(Southeastern Institute for Training and Evaluation) Reading:
http://www.tulane.edu/~panda2/Analysis2/datclean/dataclean.htm
Basic Data Analysis
This section will mainly discuss basic descriptive statistics.
Measures of Central Tendency
Mean - The mean is a descriptive statistic probably used most often as a
measure of central tendency. Means are often reported in conjunction with
confidence intervals. Confidence intervals for the mean yield a range of values
around the mean where the “true” expected mean is found. For example, if the
mean in your sample is 60, and the lower and upper limits for a P=.05 confidence
interval are 53 and 67, then you can conclude that there is a 95% probability that
the population mean is greater than 53 and lower than 67.
(http://www.mste.uiuc.edu/hill/dstat/dstat.html)
Median – The median is the middle value of a distribution when the data are
arranged from highest to lowest. Exactly half the data is above the median and
half is below it. If there is an even number of values, then it becomes the
average of the two middle observations.
Mode – The most frequently occurring value. It is possible to have more than
one mode or no mode at all.
Reading: https://statistics.laerd.com/statistical-guides/measures-central-
tendency-mean-mode-median.php
Measures of Variability around the mean
After reviewing a set of values, measures of central tendency and shape of the
distribution, it’s important to know more about the variability of a set of values.
Most groups of values have some degree of variability, meaning at least some of
the values differ (vary) from one another. There are two main measures of
variability.
Variance – The variance is the sum of squared deviations from the mean divided
by N-1. The variance is simply the standard deviation squared.
Standard deviation – Determined by figuring out how much each score deviates
from the mean and by placing these deviation scores into a formula.
The standard deviation is the square root of the variance.
Readings: https://statistics.laerd.com/statistical-guides/measures-of-spread-
range-quartiles.php
Cross Tabulations
Cross tabulations show the relationship between 2 variables. In most programs,
cross tabulations are used to get basic descriptive statistics to conduct other
tests, such as a chi-square. Different programs may use different commands
and/or procedures to conduct a cross tabulation. When performing a cross
tabulation, it’s important to know your independent and dependent variables
since independent variables are used as column headings and dependent
variables are found in the rows
Reading:
http://web.idrc.ca/en/ev-56452-201-1-DO_TOPIC.html

More Related Content

PPTX
PPTX
What is an independent samples-t test?
PPTX
Multivariate
PPTX
Kolmogorov Smirnov
PPT
Two sample t-test
PPT
Teac lesson 5
PPTX
Discrete distributions: Binomial, Poisson & Hypergeometric distributions
PPTX
Statistical Estimation
What is an independent samples-t test?
Multivariate
Kolmogorov Smirnov
Two sample t-test
Teac lesson 5
Discrete distributions: Binomial, Poisson & Hypergeometric distributions
Statistical Estimation

What's hot (20)

PPTX
Variance & standard deviation
PPTX
MONOVA
PPTX
Sample and Population in Research - Meaning, Examples and Types
PPTX
One-Sample Hypothesis Tests
PPT
The research instruments
PDF
Introduction to random variables
PPTX
Estimating population mean
PPTX
6 typesofvariables
PPT
One Sample T Test
PPTX
CHAPTER 3: FREQUENCY DISTRIBUTION ..pptx
PPT
Quantitative data analysis
PPTX
Measures of Variation
PPTX
Types of random sampling
PDF
Regression Analysis
PDF
Stat 130 chi-square goodnes-of-fit test
PPTX
Research Variables
PPTX
An outline of Quantitative Research Methods
PPTX
Simple linear regression
PPTX
Item analysis ppt
PPT
Formative_and_Summative_Assessment_ppt.ppt
Variance & standard deviation
MONOVA
Sample and Population in Research - Meaning, Examples and Types
One-Sample Hypothesis Tests
The research instruments
Introduction to random variables
Estimating population mean
6 typesofvariables
One Sample T Test
CHAPTER 3: FREQUENCY DISTRIBUTION ..pptx
Quantitative data analysis
Measures of Variation
Types of random sampling
Regression Analysis
Stat 130 chi-square goodnes-of-fit test
Research Variables
An outline of Quantitative Research Methods
Simple linear regression
Item analysis ppt
Formative_and_Summative_Assessment_ppt.ppt
Ad

Viewers also liked (6)

PDF
9th ICCS Noordwijkerhout
PDF
Introductory Lecture to Applied Mathematics Stream
PDF
Applied Statistics - Introduction
PDF
Problems statistics 1
PPTX
Applied Statistics : Sampling method & central limit theorem
PPTX
Role of Statistics in Scientific Research
9th ICCS Noordwijkerhout
Introductory Lecture to Applied Mathematics Stream
Applied Statistics - Introduction
Problems statistics 1
Applied Statistics : Sampling method & central limit theorem
Role of Statistics in Scientific Research
Ad

Similar to Review of Basic Statistics and Terminology (20)

PPTX
Lecture 1.pptx
PPT
Introduction to statistics
PPT
Intro statistics
PPTX
Bio Statistics.pptx by Dr.REVATHI SIVAKUMAR
PPTX
Understanding statistics in research
PPTX
Statistical techniques for interpreting and reporting quantitative data i
PPTX
Basics of statistics
PPT
grade7statistics-150427083137-conversion-gate01.ppt
PDF
2_54248135948895858599595585887869437 2.pdf
PDF
STATISTICS-E.pdf
PPTX
01 Introduction (1).pptx
PPT
Introduction To Statistics.ppt
PPTX
Stat and prob a recap
PPT
Introduction-To-Statistics-18032022-010747pm (1).ppt
PDF
Lesson 1.pdf probability and statistics.
PPTX
Medical Statistics.pptx
DOCX
Statistics  What you Need to KnowIntroductionOften, when peop.docx
PPTX
fundamentals of data science and analytics on descriptive analysis.pptx
PPTX
Presentation1
PDF
Statistics of engineer’s with basic concepts in statistics
Lecture 1.pptx
Introduction to statistics
Intro statistics
Bio Statistics.pptx by Dr.REVATHI SIVAKUMAR
Understanding statistics in research
Statistical techniques for interpreting and reporting quantitative data i
Basics of statistics
grade7statistics-150427083137-conversion-gate01.ppt
2_54248135948895858599595585887869437 2.pdf
STATISTICS-E.pdf
01 Introduction (1).pptx
Introduction To Statistics.ppt
Stat and prob a recap
Introduction-To-Statistics-18032022-010747pm (1).ppt
Lesson 1.pdf probability and statistics.
Medical Statistics.pptx
Statistics  What you Need to KnowIntroductionOften, when peop.docx
fundamentals of data science and analytics on descriptive analysis.pptx
Presentation1
Statistics of engineer’s with basic concepts in statistics

More from aswhite (20)

PDF
Lab birth linear
PDF
Parc linear regression analysis in spss(1)
PDF
Linear regression interpretation
DOCX
Prs530 schedule sp17
DOCX
PRS530 Syllabus sp17
DOC
Parc variables
DOC
Codebook
PPT
Parc slides
PDF
Failed health system ebola jama2014
PDF
Who mbhss 2010 full web
PDF
Everybody's business
PDF
Aepi555 scheduleSP17
PDF
Interpretation
PDF
Independent and dependent variables
PPT
PARC Slides
PPT
BRFSS
PDF
SPSS Getting Started Tutorial
DOC
Computer Codebook
PPT
Overview spss student
PPT
Overview spss instructor
Lab birth linear
Parc linear regression analysis in spss(1)
Linear regression interpretation
Prs530 schedule sp17
PRS530 Syllabus sp17
Parc variables
Codebook
Parc slides
Failed health system ebola jama2014
Who mbhss 2010 full web
Everybody's business
Aepi555 scheduleSP17
Interpretation
Independent and dependent variables
PARC Slides
BRFSS
SPSS Getting Started Tutorial
Computer Codebook
Overview spss student
Overview spss instructor

Recently uploaded (20)

PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
Hazard Identification & Risk Assessment .pdf
PPTX
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
PDF
Indian roads congress 037 - 2012 Flexible pavement
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
Introduction to Building Materials
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PDF
Practical Manual AGRO-233 Principles and Practices of Natural Farming
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
PPTX
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PPTX
Digestion and Absorption of Carbohydrates, Proteina and Fats
PPTX
A powerpoint presentation on the Revised K-10 Science Shaping Paper
PPTX
Unit 4 Skeletal System.ppt.pptxopresentatiom
PPTX
Lesson notes of climatology university.
PDF
Paper A Mock Exam 9_ Attempt review.pdf.
PPTX
Radiologic_Anatomy_of_the_Brachial_plexus [final].pptx
Supply Chain Operations Speaking Notes -ICLT Program
Hazard Identification & Risk Assessment .pdf
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
Indian roads congress 037 - 2012 Flexible pavement
Final Presentation General Medicine 03-08-2024.pptx
Introduction to Building Materials
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Chinmaya Tiranga quiz Grand Finale.pdf
202450812 BayCHI UCSC-SV 20250812 v17.pptx
Practical Manual AGRO-233 Principles and Practices of Natural Farming
Final Presentation General Medicine 03-08-2024.pptx
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
Digestion and Absorption of Carbohydrates, Proteina and Fats
A powerpoint presentation on the Revised K-10 Science Shaping Paper
Unit 4 Skeletal System.ppt.pptxopresentatiom
Lesson notes of climatology university.
Paper A Mock Exam 9_ Attempt review.pdf.
Radiologic_Anatomy_of_the_Brachial_plexus [final].pptx

Review of Basic Statistics and Terminology

  • 1. Review of Basic Statistics and Terminology What are Statistics? There are various steps to consider in scientific research. The key steps are: • Research Design • Data Collection • Description of data and statistical analysis (our focus for this course) (Daniels, D., Nizam, Z, 1999, Biostatistics (notes), RSPH) All of these steps consider directly or indirectly data or the statistics used to do scientific research. There are actually many different definitions of statistics but the one that is more general and fits most situations is the following: Statistics are produced from data. The dictionary definition of "statistics" refers to “numeric indicators of nations. Popular usage of the term points to numeric summaries that condense information, or numbers that are used to make comparisons, or numbers that portray relationships or associations. The term statistics also refers a formal discipline of study. The field of statistics is the science of generalization. Built upon theories of probability and inference, statistics support the making of broad generalizations from a smaller number of specific observations.” There are two types of statistics researchers are usually most interested in: Descriptive and Inferential. Descriptive Statistics “give us information around discovering and describing the important features and trends contained in a set of data using quick and easy graphical and numerical methods. In other words, they are most often used to describe populations. Even though descriptive statistics most often yield information about one variable, the relationship between two variables can be described as well. With inferential statistics, research hypotheses about a population of interest are investigated using information on the basis of measurements contained in a random sample of data from the population.“ (Price, I., 2000, University of New England, School of Psychology) An important concept to understanding descriptive statistics is the Shape of Distribution. A variable’s distribution shape is a very important aspect of the variables description. The shape of the distribution tells you the frequency of values from different ranges of the variable. Four common terms to understand when evaluating shape of the distribution is: • Normal • Skewness • Kurtosis • Symetrical/Asymmetrical
  • 2. A normal distribution is a bell shaped curve that is symmetrical with scores more frequent in the middle. A standard normal distribution has a mean of 0 and a standard deviation of 1. Skewed distributions yield most scores as being high or low. Usually, a small percentage of scores are shown in one direction away from the majority. Skewed distributions produce what is known as a “tail.” If the “tail” points toward the upper end of the score continuum, the distribution is “positive.” If the “tail “points toward the lower end of the score continuum, the tail is “negative.” (Huck, S.W. & Cormier, W.H., 1996. Reading Statistics and Research, New York, New York, p. 31). Distributions that are skewed are typically considered asymmetrical while normal distributions are symmetrical. Kurtosis measures “peakedness” of the distribution. A high peak gives a distribution with “fat” tails and a low even distribution while a low peak gives a “skinny” tail and a distribution concentrated towards the mean. (http://www.statsoft.com/textbook/stbasic.html, Readings: http://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm http://davidmlane.com/hyperstat/normal_distribution.html Common concepts/terms A Population in statistics is often referred to as an entire group of people or objects. When a statistical inference is made, a selection, called a sample, is first removed from a larger group called a population. The sample, usually comprised of people or objects, is then measured using statistics. These measurements are then summarized. A guess is then made as to the numerical value of the same statistical concept in the population. (Huck, S.W. & Cormier, W.H., 1996. Reading Statistics and Research, New York, New York, p. 100). Correlations – A correlation is a measure of the relation between two or more variables. The most popular technique for assessing the strength of a bivariate (having two variables) relationship is Pearson's product-moment correlation, also called linear or product-moment correlation. Correlation coefficients can range from -1.00 to +1.00. The value of -1.00 represents a negative correlation while a value of +1.00 represents a perfect positive correlation. A value of 0.00 represents a lack of correlation. In order to evaluate a correlation between 2 variables, it’s important to know the “magnitude” as well as the significance of the correlation. A perfect correlation rarely exists but a correlation at .6 or above is often considered to be satisfactory. Sometimes correlations observed at .3 are significant and worth reporting. Also, a negative correlation does not infer an unsatisfactory correlation, as a -.129 can be a significant correlation. A negative correlation really means there is an indirect or inverse relationship between the variables. Readings: http://www.statsoft.com/textbook/stbasic.html (Basic Statistics tab - Correlation section) http://www.uwsp.edu/psych/stat/7/correlat.htm (Lessons I, II, III, and V)
  • 3. Outliers – Outliers in a set of data, have a value so far removed from other values in the distribution that its presence cannot be attributed to the random combination of chance causes. An outlier is an atypical value that does not belong to the distribution of the rest of the values in the data set. Values that are more than 2.5 standard deviations from the mean can be defined as outliers. This would hold when the distribution is unimodal and symmetrical. For skewed data, median is a better indicator of central location than mean. In such a case, an observation can be considered as an outlier if it more than 1.5 inter-quartile range away from the closest quartile. An outlier would be labeled an extreme outlier if it is more than 3 inter-quartile range from the closest quartile, and it would be called mild otherwise. Frequencies - A frequency distribution yields how many subjects or objects were similar and ended up in the same category or had the same score. The total frequency is usually labeled as N, and tells us how many subjects were measured. T-Tests – the t-test is the most commonly used method to evaluate the differences in means between two groups. For example, the t-test can be used to test for the difference in knowledge scores between a group of students who participated in a sex education class and those students who did not. As long as variables are normally distributed within each group and variation of scores in the two groups is not reliably different, small sample sizes can be used when performing a t-test. The p-level reported with a t-test represents the probability of error involved in accepting our research hypothesis about the existence of a difference. Reading: http://www.texasoft.com/tutorial-statistics-compare-2-groups.htm Confidence Intervals – A Confidence Interval is a range of values that is normally used to describe the uncertainty around a point estimate of a quantity, for example, a mortality rate. Confidence intervals provide a means of assessing and reporting the precision of a point estimate and account for the uncertainty that arises from the natural variation inherent in data collection. Confidence intervals are constructed by adding a specific amount to the statistic (upper limit) and by subtracting a specific amount from the statistic (lower limit). Researchers also attach a percentage to any interval constructed, usually either 95 or 99. Reading: http://www.stat.yale.edu/Courses/1997-98/101/confint.htm Working with Data and questions It’s important to be prepared for working with data before you enter, manipulate, or analyze data. This also involves knowing which program you will use for data management and analysis. Part of the preparation involves knowing your questions. What questions do you want answered from the data? What are your key research questions? Data
  • 4. collection should be systematic and done with purpose, even though it’s sometimes easy to explore data first, especially if you are working with an existing data set or a large data set. If you are creating a data collection instrument, it should be designed using specific techniques. When designing the instrument, it’s important to think about how you are going to analyze each question and what exactly you want to gain from each question. Several factors to think about when creating a survey is the following: Variable name – Unique identifiers for each question. Often times, variable names should not be longer than 8 characters. This is because some statistical programs only allow 8 characters and may pose a problem if importing/exporting between programs if more than 8 characters exist for variable names. Also, variables names usually don’t begin with numbers or contain symbols, punctuation, or spaces. Lastly, variables names should relate to the question and be easy to identify or recall. Level of measurement – Refers to nominal, ordinal, or interval/ratio level of measurement. Each variable will have one of these levels of measurement. This is important to know because the level of measurement often depends on how we analyze our data. We will discuss this more in the independent/dependent variable lab week. Type of variable – Field types refer to the data types associated with a variable in your survey and vary from program to program but the two most common field types are numeric, or integer and string or text variables. Numeric variables often contain whole numbers. String variables contains letters, numbers, or special symbols. In survey development, we may wish to use text responses for certain questions but when entering data, associate the text responses with numbers and therefore would be entered and analyzed as a numeric variable. For example, if we have a question with a Likert scale such as: 1. I enjoy exercise. Strongly Agree Agree Neutral Disagree Strongly Disagree Then we may want to associate these responses with numbers such as: 1 - Strongly Agree 2 - Agree 3 - Neutral 4 - Disagree 5 - Strongly Disagree
  • 5. Types of Data – Terms associated with types of data are Quantitative and Qualitative. Quantitative data refers to data that are numeric. Qualitative data refers to data that are non-numeric. The way we analyze these types of data are very different. Quantitative data can be further categorizedinto discrete and continuous data. Discrete data are numeric data that has a finite number of possible values. Continuous data is numeric data that have infinite possibilities. Qualitative data are also referred to as categorical data, meaning non-numerical data which may be divided into groups. We will also discuss this more in the independent/dependent variable lab week. Labels – Refers to a description of the variable or the response categories. There are variable labels and value labels. Variable labels assigns descriptive labels to all variables. Variable names and labels may be the same for some variables. Often, a label might be the entire question or a part of the question. Value labels give a description of the response categories. The value label assigns descriptive labels in the data file. Not all variables necessarily have assigned value labels. Codebook – A codebook displays a information for the variables in a data set and is a description of the data that was collected. According to the Data and Statistical Services at Princeton University, the best codebooks have: Description of the study: who did it, why they did it, how they did it. 1. Sampling information: what was the population studied, how was the sample drawn, what was the response rate. 2. Technical information about the files themselves: number of observations, record length, number of records per observation, etc. 3. Structure of the data within the file: hierarchical, multiple cards, etc. 4. Details about the data: columns in which specific variables can be found, whether they are character or numeric, and if numeric, what format. 5. Text of the questions and responses: some even have how many people responded a particular way. Review of Research Designs Every research project should begin with a purpose. When thinking about the purpose of the research, you will also want to think about the research design, or the structure of the study. Research designs fall into 2 categories:
  • 6. Experimental – a study or experiment that imposes a treatment on a group of subjects or objects in order to observe the response. This differs from observational studies, involving collecting and analyzing data about a group of subjects or objects, without imposing treatment or changing existing conditions. A “true” experimental design has the following characteristics: 1) random assignment of participants to groups, and 2) manipulation of an internal variable Quasi-experimental – Controls for confounding variables, used to investigate cause-and-effect relationships. This is different than experimental because there is no random assignment of participants to groups. Readings: http://www.socialresearchmethods.net/kb/quasiexp.php Study hypotheses Once a problem or area of interest has been identified and researched, a hypothesis is then created and stated by the investigator. Hypothesis testing is a “statement that describes a phenomenon and the relationships between the variables in the problem” (Southeastern Institute for Training and Evaluation). In order to formulate a hypothesis test, usually some theory has been researched and stated. The hypothesis is used for several reasons: 1. Gives a statement of relationships between variables to be tested 2. Offers a possible explanation of the research problem and a focus for the testing 3. Provides direction for the research 4. Provides a framework or organization for reporting the findings of the study (Southeastern Institute for Training and Evaluation) Hypothesis Testing There typically is a 6 step process involved in hypothesis testing: • State the null hypothesis – a statement as to the unknown quantitative value of the parameter of interest. The null hypothesis relates to the statement being tested. • State the alternative hypothesis – the predicted relationship between the variables of interest. The alternative hypothesis relates to the statement to be accepted if/when the null is rejected. • Select a level of significance – Fixed probability of wrongly rejecting the null hypothesis, if it is in fact true. Usually the significance level is chosen to be .05 or 5%...often seen as p<.05. • Collect and summarize the sample data – process used to move from the beginning points of the hypothesis testing procedure to the final decision • Refer to a criterion for evaluating the sample evidence – involves asking the question, are the sample data inconsistent with what would likely occur if the
  • 7. null hypothesis is true? If yes, the null hypothesis would be rejected. • Make a Decision to discard/retain the null hypothesis (Huck, S.W. & Cormier, W.H., 1996. Reading Statistics and Research, New York, New York, p. 150). More information on hypothesis testing can be found in the readings. Readings: http://davidmlane.com/hyperstat/logic_hypothesis.html http://www.stats.gla.ac.uk/steps/glossary/hypothesis_testing.html Analyzing Data There are many programs, procedures and commands used to analyze data. For the purposes of this class, only a select group of data analysis procedures will be used to analyze data using SPSS software. In order to analyze data, it must first be collected and entered into a database. Most analysis programs will allow you to import a variety of database files. Setting up databases is not within the scope of this class and will not be discussed. I will show you exporting and importing using various file formats at the closing on-campus session. Data cleaning – If you enter data and set up your own database, it’s important to make sure that the data is clean and examined for possible data errors. Common errors may be: • Values entered outside of the response categories • Missing data • Unexpected responses that are not easily identified • Spelling errors (Southeastern Institute for Training and Evaluation) Reading: http://www.tulane.edu/~panda2/Analysis2/datclean/dataclean.htm Basic Data Analysis This section will mainly discuss basic descriptive statistics. Measures of Central Tendency Mean - The mean is a descriptive statistic probably used most often as a measure of central tendency. Means are often reported in conjunction with confidence intervals. Confidence intervals for the mean yield a range of values around the mean where the “true” expected mean is found. For example, if the mean in your sample is 60, and the lower and upper limits for a P=.05 confidence interval are 53 and 67, then you can conclude that there is a 95% probability that the population mean is greater than 53 and lower than 67. (http://www.mste.uiuc.edu/hill/dstat/dstat.html) Median – The median is the middle value of a distribution when the data are arranged from highest to lowest. Exactly half the data is above the median and half is below it. If there is an even number of values, then it becomes the average of the two middle observations.
  • 8. Mode – The most frequently occurring value. It is possible to have more than one mode or no mode at all. Reading: https://statistics.laerd.com/statistical-guides/measures-central- tendency-mean-mode-median.php Measures of Variability around the mean After reviewing a set of values, measures of central tendency and shape of the distribution, it’s important to know more about the variability of a set of values. Most groups of values have some degree of variability, meaning at least some of the values differ (vary) from one another. There are two main measures of variability. Variance – The variance is the sum of squared deviations from the mean divided by N-1. The variance is simply the standard deviation squared. Standard deviation – Determined by figuring out how much each score deviates from the mean and by placing these deviation scores into a formula. The standard deviation is the square root of the variance. Readings: https://statistics.laerd.com/statistical-guides/measures-of-spread- range-quartiles.php Cross Tabulations Cross tabulations show the relationship between 2 variables. In most programs, cross tabulations are used to get basic descriptive statistics to conduct other tests, such as a chi-square. Different programs may use different commands and/or procedures to conduct a cross tabulation. When performing a cross tabulation, it’s important to know your independent and dependent variables since independent variables are used as column headings and dependent variables are found in the rows Reading: http://web.idrc.ca/en/ev-56452-201-1-DO_TOPIC.html