Introduction_to_Statistics_as_used_in_th.ppt

INTRODUCTION TO STATISTICS
April 25th 2013
James.Hall@Education.ox.ac.uk
Coral.Milburn-Curtis@Education.ox.ac.uk

STRUCTURE OF THIS MORNING
 9.30-10.15: Introduction & basic concepts
15 minute break
 10.30:11.15: Introduction to SPSS
15 minute break
 11.30-12.30: Two worked examples
2

PURPOSE OF THIS MORNING
 To “set the scene” for people to go away and learn
statistics for themselves
 With the help of a textbook at the appropriate level:
1. Pallant. SPSS Survival Manual
2. Field. Discovering Statistics Using SPSS
3. Tabachnick and Fidell. Using Multivariate Statistics
 At the end of this morning, you should be able to:
 Understand that there are four areas of knowledge
required to successfully produce statistics in the social
sciences
 Understand basic statistical terminology
 Have an introductory understanding of SPSS
3

SESSION ONE: INTRODUCTION AND
BASIC CONCEPTS
9.30-10.15
4

CONTENTS
1. Background
2. Essential Ideas (no maths though)
 Including Descriptive Statistics
3. Inferential Statistics
4. Things that you can easily do in Microsoft Excel
5

WHERE YOU FIND STATISTICS WITHIN JOURNAL
ARTICLES & DISSERTATIONS: EVERYWHERE!
 Literature Review
 Determining the statistical weaknesses in past research in order to
identify the gap which will be addressed
 Method
 Participants
 Description of participants – including numbers and description of their
background (can include representativeness)
 Materials/Measures
 Presentation of all measures and full description
 Design
 Description of the design of the study and presentation of variables
 Procedure
 Description of what was undertaken – including manipulation of variables
 Results
 Descriptive statistics first, then Inferential. The aim is the same as
in a literature review/essay – to tell a coherent story
 Discussion
 The strengths and weaknesses of this research compared to
previous studies – suggestions for future research
7

FOUR KEY AREAS OF KNOWLEDGE TO ACQUIRE
SHOULD YOU NEED TO DO ANY OF THIS
Data
Management
Data Entry (into
a suitable
software)
Variable Creation
Data Cleaning &
Variable
Modification
Producing
Statistics
Descriptive
Documenting
Missing Data
Measures of
Central
Tendency
Range &
Dispersion of
Scores
Inferential
Tests
...of Difference ...of Association
Models
Presentation
Skills
Ability to write-up
/report statistics
Ability to
generate suitable
tables & graphs
8
1 2 3 4

(NOTE)
 Although all four areas of knowledge are needed
should you ever need statistics for your own
research
 Learning how to carry out statistics (numbers 2 and
3 in the previous slide) can be quicker and easier
than learning Data Management (1) and learning
how to present/write-up statistics (4)
 Further, textbooks and online courses commonly
skip areas 1 and 4.
9

SOURCES OF KNOWLEDGE
 Textbooks:
 BASICS->INTERMEDIATE:
 Field, A. (2009) Discovering statistics using SPSS (and sex, drugs
and rock'n'roll). 3rd ed. London: Sage Publications Inc.
 INTERMEDIATE->ADVANCED:
 Tabachnick, B.G. & Fidell, L.S. (2013) Multivariate Statistics. 6th
ed. Allyn and Bacon: Boston
 Websites:
 Andy Field’s website: http://www.statisticshell.com/
 (Warning: he has a very quirky sense of humour & there is bad
language)
 Includes videos of his statistics lectures including how to write up
statistics which can otherwise be found here:
 http://www.youtube.com/watch?v=vekCPvF016A
10

COMPUTER PACKAGES
 Microsoft Excel will only take you so far...
 Perhaps the most common statistical software
package is SPSS
 The University of Oxford has a site-license for this
meaning that you can get it installed on your
machines
 It’s also on the machines in the Department’s
computer room
 Excel is a Spreadsheet programme – SPSS is a
database programme – don’t be confused
 Other statistical software packages (that do the
basics well) include: STATA and SAS
11

LEVELS OF MEASUREMENT
 Measurement is the representation of information with
numbers
 There are different levels of complexity in how we use
numbers in measurement – from simple to complex
 From most-simple to most-complex, there are three
commonly used levels of complexity:
1. “Discrete”/“Nominal”/”Categorical” Level: e.g. east =1,
west=2
2. “Ordinal”: e.g. small=1, medium=2, large=3
3. “Continuous”
a. “Interval”: e.g. age in years
b. “Ratio”: a special type of interval data. One where zero
represents nothing. e.g. income as opposed to temperature in
degrees Celsius
13

LEVELS OF MEASUREMENT
 One “level of measurement” is special however – it is
both categorical and ordinal at the same time:
 “Dichotomous”/“Binary”
 discrete data with only 2 conditions. (0 and 1 is the usual way of
coding) e.g. Employed/not-employed
 Being able to identify a level of measurement is the
most important first thing to learn:
 It informs which “measure of central tendency” and
method of documenting “dispersion” that you should
report
 And it informs which “Inferential” “Test” or “Model” you
should carry out and how you should go about this
14

CLASS EXERCISE
 Talk to your neighbour: Which level of measurement best
describes the following measures?:
 Telephone numbers
 Gender
 Participants’ scores on an self-report anxiety question:
 Strongly Agree (5), Agree(4), Neither Agree nor Disagree (3)
Disagree(2), Strongly Disagree(1)
 Height
 University Rankings
15

DESCRIPTIVE STATISTICS
 Once we have identified each variable’s level of
measurement we can then describe this variable with
descriptive statistics
 Measures of Central Tendency:
 For Discrete Data: Mode
 For Ordinal Data: Median
 (though you can report the mode as well)
 For Continuous Data: Mean
 (though you can report the median and mode as well)
 Measures of Dispersion:
 For Ordinal Data: Inter-quartile Range
 For Continuous Data: Standard Deviation 16

INTER-QUARTILE RANGE
 The median is the middle-value of a range of
scores. It is the “second quartile”(“Q2”)
 If we divided the range into four equal parts, the second
quartile would occur in the middle – as the median does.
 The “Inter-Quartile Range” (IQR) is the middle-
range that surrounds the median
 First quartile (“Q1”) to the third (“Q3”)
 We get this range by the simple subtraction of Q1 from
Q3:
 IQR=Q3-Q1
17
Median
Q2
Q1 Q3 Q4

GRAPHING ORDINAL DATA
18
This is a box plot
Range
encompassing
middle 90% of
values
The middle-
value. AKA the
Median
Range
encompassing
middle 50% of
values. AKA: The
“inter-quartile
range”
“Outliers”

STANDARD DEVIATION
 When we calculate a mean, we understand that not everyone
actually has this score
 Some scores are closer, some are further away from the mean
 But how close are people’s scores to the mean - on average?
 This is the Standard Deviation:
19
Mean
+1 standard deviation
-1 standard deviation

THE NORMAL DISTRIBUTION
 This is a special “distribution” of continuous data
 Many real-life continuous variables are normally distributed
 95% of the cases in a normally distributed continuous variable occur in
the blue area (mean ±1.96 standard deviations[SDs])
20
-3SDs -1.96SDs -1SDs Mean +1SDs +1.96SDs +3SDs

GRAPHING CONTINUOUS DATA
21
If you request a
histogram of
continuous data,
SPSS creates
arbitrary pots of
scores(!)
Don’t rely on this
fitted “normal curve”
to establish the
“normality” of a
continuous measure
This is a
histogram

A LITTLE MORE ON DESCRIBING CONTINUOUS
DATA
 Two more descriptive statistics are “Skewness” and “Kurtosis”:
 A rule-of-thumb for assessing whether a continuous measure is
“normally distributed”:
 Divide each above statistic by it’s “standard error”
 (a good statistics software will calculate all these values for you)
 Scores outside the range -2 to +2 suggest you have non-normality
 Quotable source: http://web.ipac.caltech.edu/staff/fmasci/home/statistics_refs/SkewStatSignif.pdf
22
zero
-ve
+ve

TYPES
 “Bivariate” and “Multivariate”
 In other words: “Two variables” and “Multiple variables”
 “Tests”
 ...Of the difference between groups or time-points
 ...Of the association between two or more measures
 “Models”
 Miniature representations of reality
 supposedly(!)
 lots of ways in which this can be determined
24

HYPOTHESIS TESTING
 We state a hypothesis (H1) about how we believe two or more
measures should be related to one another and then try to
disprove it’s opposite – it’s “null hypothesis” (H0)
 Because we are trying to disprove H0, the full name for H1 is the
“alternative hypothesis”
 We gather a “sample” of data to do this, but then
infer/generalise conclusions back to the real-world “population”
from which we believe our sample came from
 We estimate the accuracy of these generalisations back to the
real-world with probabilities
 We want the null-hypothesis to be very unlikely to be true in the
real-world and so we look for low probabilities
 We usually want our inferential statistics to reject our null-hypotheses with
95% confidence (so we look for probabilities <5%)
 Moving from percentage to proportion: we look for p<0.05
25

EFFECT SIZE
 Though we look for probabilities <5% (“p<0.05”)
 The likelihood that we find one is also affected by how
many people we consider:
 More people considered = more chance of p<0.05
 This means that p<0.05 is not a reliable enough
measure of a statistical (probabilistic) effect
 We also need a measure that is not affected by the number of
people we consider
 Such quantities are termed “Effect Sizes”
 As they estimate the “size of a statistical effect” 26

INFERENTIAL STATISTICS - TESTS
27

COMMON EFFECT SIZES FOR STATISTICAL TESTS
28
Test Statistic Effect Size ‘Small’
effect
‘Mediu
m’
effect
‘Large’
effect
Chi-square 2 2 = 2 / N * (k-1)
(k = smaller of number
rows or number of
columns)
.01 .09 .25
Pearson ‘s
correlation
r ±.01 ±.03 ±.05
t-test related t .02 .05 .08
t-test
unrelated
One-way
ANOVA
F 2 = SSeffect / SStotal .01 .06 .14
 Good news: There are lots of online calculators that
will calculate these for you! Just try searching for,
“effect size calculator”
freedom
of
degree
2t
d 

PARAMETRIC DATA
 A property of Continuous Data
 It strongly dictates which “Inferential Test” to carry out
 Three assumptions:
1. The continuous measure is “Normally Distributed”
 We can test this (see slide 19!)
2. When comparing groups of scores to one another, all groups
should have the same “standard deviation”
 We can test this (good software packages will do it automatically)
3. Each score was gathered independently from the others
 We can’t test this, this concerns how we gathered our data 29

INFERENTIAL STATISTICS - MODELS
Reality – The Underlying
Population
Drawn Sample – The data
available to us
Statistical Model – Our version
of Reality created with our
Sample data 30
(With unavoidable “error”)
(With unavoidable
“residual” aspects left un-
accounted for)

STATISTICAL REGRESSION
 The most common “Model” in “Inferential Statistics”
 At it’s simplest, it “models” how much one measure (“y”)
is driven by another (“x”):
 We say that we “regress” “y on x”
 y=mx+C
 y=b1x+bo [+e]
31
m
x
y
C

4. THINGS THAT YOU CAN EASILY DO
IN MICROSOFT EXCEL
32

THINGS YOU CAN EASILY DO IN EXCEL
1. Simple aspects of Data Management
 E.g. Looking at your data & simple manipulation of data
2. Generate Descriptive Statistics
 Get Measures of Central Tendency
 ...& Dispersion
3. Carry out simple Inferential Statistics
 E.g. Correlations (at a push)
4. Create Tables and simple Figures
 Perhaps Excel’s most useful purpose. You can even copy-
pasting SPSS output into Excel and so simplify it to a level
suitable for reporting in a report/dissertation/paper
 You really can’t rely on Excel for anything other than the
basics however...
 ...It will struggle to give you the necessary statistics for any
solely quantitative dissertation.
33

Introduction_to_Statistics_as_used_in_th.ppt

More Related Content

Similar to Introduction_to_Statistics_as_used_in_th.ppt (20)

Recently uploaded (20)

Introduction_to_Statistics_as_used_in_th.ppt