Association and its different measures using SPSS

UNIVERSITY OF LUCKNOW
Association and its different measures using SPSS
Presented By:
Ankur Dhangar
M.Sc. Biostatistics Sem -3
Roll No. 2210014145008

Association and its different measures
 Association refers to the relationship or dependency between two or more categorical
variables. It's about understanding whether the occurrence or distribution of values in one
categorical variable is related to or influences the values in another categorical variable.
 When exploring association between categorical variables:
1. Independence: If two categorical variables are independent, the occurrence of one
variable's categories does not affect the distribution of the other variable's categories.
For example, there might not be any association between gender and favorite ice
cream flavor.
2. Association or Dependency: When there's an association between categorical
variables, the occurrence or distribution of values in one variable is related to the
values in another variable. For instance, there might be an association between
smoking habits (yes/no) and the incidence of a particular health condition
(present/absent).

Measures of association
Association involves several statistical tests such as-
For nominal variables
1. Chi-square test
2. Fisher's exact test
3.Phi coefficient and
Cramer’s V
4. Lambda
5.Uncertainity
coefficient
For Ordinal variables
1.Gamma
2. Somer’s d
3. Kendall’s tau-b
4. Kendall’s tau-c
For Nominal By Interval
1.Eta
Some other
1. Kappa
2. Risk
3. McNemar etc

The chi-square test for independence, also called Pearson's chi-square test or the chi-
square test of association, is used to discover if there is a relationship between two
categorical variables.
Assumptions:
When you choose to analyse your data using a chi-square test for independence, you
need to make sure that the data you want to analyse "passes" two assumptions:
Assumption #1: Your two variables should be measured at an ordinal or nominal level
(i.e., categorical data).
Assumption #2: Your two variable should consist of two or more categorical,
independent groups. Example independent variables that meet this criterion include
gender (2 groups: Males and Females), ethnicity (e.g., 3 groups: Caucasian, African
American and Hispanic)

Example
Educators are always looking for novel ways in which to teach statistics to
undergraduates as part of a non-statistics degree course (e.g., psychology).
With current technology, it is possible to present how-to guides for statistical
programs online instead of in a book. However, different people learn in
different ways. An educator would like to know whether gender (male/female)
is associated with the preferred type of learning medium (online vs. books).
Therefore, we have two nominal variables: Gender (male/female) and
Preferred Learning Medium (online/books).
Setup in SPSS
In SPSS Statistics, we created two variables so that we could enter our data:
Gender and Preferred_Learning_Medium.

Procedure:
1. Click Analyze > Descriptives Statistics > Crosstabs... on the top menu, as shown
below:
2. You will be presented with the following Crosstabs dialogue box:

3. Transfer one of the variables into the Row(s): box and the other variable into
the Column(s): box. In our example, we will transfer the Gender variable into
the Row(s): box and Preferred_Learning_Medium into the Column(s): box.
4. Click on the button. You will be presented with
the following Crosstabs: Statistics dialogue box:

5. Select the Chi-square and Phi and Cramer's V options, as shown below:
6.Click on the button.
7. Click on the button. You will be presented with the
following Crosstabs: Cell Display dialogue box:

8. Select Observed from the –Counts– area, and Row, Column and Total from
the –Percentages– area,
9. Click on the button.
10. Click on the button.
Note: This next option is only really useful if you have more than two categories in one of
your variables, but we will show it here in case you have. If you don't, you can skip to
STEP 12.
11. You will be presented with the following:

This option allows you to change the order of the values to either ascending
or descending.
12.Once you have made your choice, click on the button.
.
13. Click on the button to generate your output
Output:
You will be presented with some tables in the Output Viewer under the title
"Crosstabs". The tables of note are presented below:
The Crosstabulation Table (Gender*Preferred Learning Medium
Crosstabulation)

This table allows us to understand that both males and females prefer to learn
using online materials versus books.
The Chi-Square Tests Table
When reading this table we are interested in the results of the "Pearson Chi-
Square" row. We can see here that χ(1) = 0.487, p = .485. This tells us that
there is no statistically significant association between Gender and Preferred
Learning Medium; that is, both Males and Females equally prefer online
learning versus books.

The Symmetric Measures Table
Phi and Cramer's V are both tests of the strength of association. We can see
that the strength of association between the variables is very weak.
Fisher’s Exact Test is used to determine whether or not there is a significant
association between two categorical variables.
It is typically used as an alternative to the Chi-Square Test of
Independence when one or more of the cell counts in a 2×2 table is less than 5
Fisher’s Exact Test

Example:
Democrat Republican
Female 8 4
Male 4 9
Suppose we want to know whether or not gender is associated with political
party preference at a particular college. To explore this, we randomly poll
25 students on campus. The number of students who are Democrats or
Republicans, based on gender, is shown in the table below:
To determine if there is a statistically significant association between gender
and political party preference, we can use the following steps to perform
Fisher’s Exact Test in SPSS:
Step 1: Enter the data.
First, enter the data as shown below:
Each row shows an individual’s ID, their political party preference, and their
gender.

Step 2: Perform Fisher’s Exact Test.
Click the Analyze tab, then Descriptive Statistics, then Crosstabs:

Drag the variable Gender into the box labelled Rows and the
variable Party into the box labelled Columns. Then click the button
labelled Statistics and make sure that the box next to Chi-square is checked.
Then click Continue.
Next, click the button labelled Exact and make sure the box next to Exact is
checked. Then click Continue.

Lastly, click OK to perform Fisher’s Exact Test.
Interpret the results
Once you click OK, the results of Fisher’s Exact Test will be displayed:

The first table displays the number of missing cases in the dataset. We can see
that there are 0 missing cases in this example.
The second table displays a crosstab of the total number of individuals by
gender and political party preference.
The third table shows the results of Fisher’s Exact Test. We can see the
following two p-values for the test:
•Two-sided p-value: .115
•One-sided p-value: .081
The null hypothesis for Fisher’s Exact Test is that the two variables are
independent. In this case, our null hypothesis is that gender and political party
preference are independent, which is a two-sided test so we would use the
two-sided p-value of 0.115.
Since this p-value is not less than 0.05, we do not reject the null hypothesis.
Thus, we don’t have sufficient evidence to say that there is a significant
association between gender and political party preference.

Phi coefficient and Cramer's V
Phi Coefficient:
•Use Case: Measures association between two dichotomous variables in a
2x2 table.
•Range: Varies between -1 and 1.
•Interpretation:
• 1 indicates a perfect association.
• 0 indicates no association.
• -1 indicates a perfect negative association.
•Specificity: Applicable only to 2x2 contingency tables.
•Calculation: Derived when analyzing two dichotomous variables using
Crosstabs in SPSS with the "Phi and Cramer's V" option selected.

Cramer's V:
•Use Case: Measures association between categorical variables in
contingency tables larger than 2x2.
•Range: Varies between 0 and 1.
•Interpretation:
• 1 indicates a perfect association.
• 0 indicates no association.
•Applicability: Suitable for larger contingency tables beyond 2x2, providing
a measure of association strength.
•Calculation: Automatically calculated by SPSS Crosstabs when dealing with
tables larger than 2x2.
Both coefficients help assess the strength of association between categorical
variables, with Phi specific to 2x2 tables and Cramer's V extending to larger
contingency tables. They aid in understanding relationships within datasets or
research studies.

Lambda Coefficient: A measure of association that reflects the proportional
reduction in error when values of the independent variable are used to
predict values of the dependent variable. A value of 1 means that the
independent variable perfectly predicts the dependent variable. A value of 0
means that the independent variable is no help in predicting the dependent
variable.
Uncertainty coefficient: A measure of association that indicates the
proportional reduction in error when values of one variable are used to
predict values of the other variable. For example, a value of 0.83 indicates
that knowledge of one variable reduces error in predicting values of the
other variable by 83%. The program calculates both symmetric and
asymmetric versions of the uncertainty coefficient

For Ordinal variables: For tables in which both rows and columns contain
ordered values
Goodman and Kruskal's gamma: Goodman and Kruskal's gamma can be used
when both ordinal variables have just two categories. For example, you could
use Goodman and Kruskal's gamma to understand whether there is an
association between exam performance (i.e., with two categories: "pass" or
"fail") and test anxiety level (i.e., with two categories: "high" or "low").
Assumptions:
1.Your two variables should be measured on an ordinal scale. Examples
of ordinal variables include Likert items (e.g., a 7-point scale from "strongly
agree" through to "strongly disagree").
2. There needs to be a monotonic relationship between the two variables.

Example
A researcher at the Department of Health wants to determine if there is an association
between the amount of physical activity people undertake and obesity levels. They recruited
250 people to take part in a study to find out. These participants were randomly sampled
from the population.
Participants were asked to complete a questionnaire explaining their level of physical
activity. Based on the results from this questionnaire, participants were categorized into one
of five physical activity levels: "sedentary", "low", "moderate", "high" and "very high".
Participants were also assessed by a nurse practitioner to determine their body fat
classification. Based on this assessment, participants were categorized into one of four
levels: "morbidly obese", "obese", "normal" and "underweight". These ordered responses
reflected the categories of our two variables: physical_activity_level (i.e., with five
categories: "sedentary", "low", "moderate", "high" and "very high")
and body_fat_classification (i.e., with four categories: "morbidly obese", "obese", "normal"
and "underweight").

Data setup
For a Goodman and Kruskal's gamma, you will have either two or three
variables:
(1) The ordinal variable, physical_activity_level, which has five ordered
categories: "sedentary", "low", "moderate", "high" and "very high";
(2) The ordinal variable, body_fat_classification, which has four ordered
categories: "underweight", "normal", "obese" and "morbidly obese".
(3) The frequencies (i.e., total counts) for the two ordinal variables above (i.e.,
the number of participants for each cell combination). This is captured in the
variable, freq.

Procedure:
Click Analyze > Descriptive Statistics > Crosstabs... on the top menu, as
shown below:
You will be presented with the Crosstabs dialogue box,
as shown below:
Transfer the variable, physical_activity_level, into
the Row(s): box, and the
variable, body_fat_classification, into
the Column(s): box, by dragging-and-dropping or by
clicking the relevant buttons

Click on the button. You will be presented with the
following Crosstabs: Statistics dialogue box:
Select the Gamma tick box in the –Ordinal– area, as shown below:
Click on the button. You will be returned to the Crosstabs dialogue
box, as shown below:

Click on the button. This will generate the output for Goodman and
Kruskal's gamma.
Interpreting the results for Goodman and Kruskal's gamma
The Case Processing Summary table provides a useful check of your data
to determine the valid sample size, N, and whether you have any missing
data. In our example, there were 250 participants with no missing data.

Finally, you should consult the Symmetric Measures table, which provides
the result of Goodman and Kruskal's gamma, as shown below:
Goodman and Kruskal's gamma is presented in the "Gamma" row of the "Value" column
and is -.509 in this example. This indicates that there is a strong, negative
association between the level of physical activity and body fat classification. In other
words, higher levels of physical activity (e.g., a "very high" level of physical activity) are
associated with a lower body fat classification (e.g., an "underweight" body fat
classification); and vice versa, with lower levels of physical activity (e.g., a "sedentary"
level of physical activity) being associated with a higher body fat classification (e.g., a
"morbidly obese" body fat classification).
Furthermore, the "Approx. Sig." column shows that the statistical significance
value (i.e., p-value) is < .001, which means that the p-value is less than .001. Therefore,
the association between physical activity level and body fat classification is statistically
significant.

Somers’ d :
Somers' delta (or Somers' d, for short), is a nonparametric measure of the
strength and direction of association that exists between an ordinal
dependent variable and an ordinal independent variable.
For Example: We can use Somers' d to understand whether there is an
association between customer satisfaction and hotel room cleanliness (i.e.,
the ordinal dependent variable is "customer satisfaction", measured on a five
point scale from "very satisfied" to "very dissatisfied", and the ordinal
independent variable is "hotel room cleanliness", measured on a three point
scale from "above average" to "below average").
Interpretation:
when running the Somers' d procedure, start with the Case Processing
Summary table:

The Case Processing Summary table provides a useful check of your data to determine the
valid sample size, N, and whether you have any missing data. In our example, there were 189
participants with no missing data.
Next, you should get a 'feel' for your data using the table showing the crosstabulation of the
data (this will be labelled based on your two variables; in our case,
the hotel_room_cleanliness * customer_satisfaction Crosstabulation table), as shown
below:
Finally, you should consult the Directional Measures table, which provides
the result of Somers' d, as shown below:

Somers' d is presented in the "customer_satisfaction Dependent" row of the
"Value" column and is .603 in this example. This indicates that increased hotel
room cleanliness is associated with increased customer satisfaction.
Furthermore, the "Approx. Sig." column shows that the statistical significance
value (i.e., p-value) is .000, which means p < .0005. Therefore, the association
between the ordinal dependent variable, "customer satisfaction", and ordinal
independent variable, "hotel room cleanliness", is statistically significant.
In our example, you might present the results as follows:
Somers' d was run to determine the association between customer satisfaction
and hotel room cleanliness amongst 189 participants. There was a strong,
positive correlation between customer satisfaction and hotel room cleanliness,
which was statistically significant (d = .603, p < .0005).

Kendall's Tau-b:
Kendall's tau-b assesses the strength and direction of association between two ordinal
variables. It doesn’t consider ties in the data.
1.Open Data in SPSS: Load your dataset.
2.Access Cross-tabulation Analysis: Go to the menu bar.
1. Click on "Analyze."
2. Select "Descriptive Statistics."
3. Choose "Crosstabs."
3.Select Variables: In the "Crosstabs" dialog box:
1. Choose the ordinal variables you want to analyze.
2. Place one variable in the "Rows" box and the other in the "Columns" box.
4.Run the Analysis: Click on the "Statistics" button in the Crosstabs dialog box:
1. Check the box for "Kendall's tau-b" under the "Statistics" list.
2. Click "Continue" to return to the Crosstabs dialog box.
3. Click on the "OK" button to execute the analysis.

Kendall's tau-c: A nonparametric measure of association for ordinal variables
that ignores ties. The sign of the coefficient indicates the direction of the
relationship, and its absolute value indicates the strength, with larger absolute
values indicating stronger relationships. Possible values range from -1 to 1, but
a value of -1 or +1 can be obtained only from square tables.

We are going to perform a Cross tabulation of the variables “Prayer
Frequency” and “Fundamentalist”. We test for the existence of a relationship
between those two variables. In order to test for the existence of a relationship,
we use the SPSS output shown above.
THE ASSUMPTIONS
We use Kendall’s tau-b, Kendall’s tau-c and Gamma to check for a relationship, which
is appropriate because we are analyzing two Ordinal variables
The hypotheses:
We want to test the following null and alternative hypotheses:
Ho: There is not relationship between ''Prayer Frequency'' and ''Fundamentalist''
Ha: There is a relationship between ''Prayer Frequency'' and ''Fundamentalist''
In order to test these hypotheses we use SPSS crosstabs analysis and the Kendall’s tau-
b, Kendall’s tau-c and Gamma statistics.

Level of Significance:
We choose the level of significance alpha =0.05. The level of significance corresponds to
the probability to make a Type I error, which is the probability of the rejecting the null
hypothesis when it is actually true.
Results:
The significance of the Kendall’s tau-b, Kendall’s tau-c and Gamma statistics is p =
0.000 for all of them, which indicates that there is a relationship between the two
variables. Since the p-values are all less than 0.05, our previously chosen level of
significance, we have enough evidence to reject the null hypothesis.
Conclusions:
We reject the null hypothesis at the 0.05 level of significance, which means that we accept
that there is a relationship between the variables, with a 0.05 significance level. The value
of the Kendall’s tau-b, Kendall’s tau-c and Gamma is small (0.262, 0.282, 0.360
respectively) which is indication of a rather weak relationship.

Nominal by Interval: When one variable is categorical and the other is
quantitative, select Eta. The categorical variable must be coded numerically.
Eta: A measure of association that ranges from 0 to 1, with 0 indicating no
association between the row and column variables and values close to 1
indicating a high degree of association. Eta is appropriate for a dependent
variable measured on an interval scale (for example, income) and an
independent variable with a limited number of categories (for example,
gender). Two eta values are computed: one treats the row variable as the
interval variable, and the other treats the column variable as the interval
variable.
Interpret Results:
Eta measures the strength of association between categorical variables in
contingency tables, considering their nominal nature. It's particularly useful for
larger tables where other measures might not be as effective.

Cohen's kappa
Cohen's kappa is a statistical measure that assesses the level of agreement
between two raters or observers when dealing with categorical or nominal
data. It's particularly useful in cases where there might be agreement by
chance alone.
For a Cohen's kappa, you will have two variables. In this example, these are:
(1) the scores for "Rater 1", Officer1, which reflect Police Officer 1's
decision to rate a person's behaviour as being either "normal" or
"suspicious"; and (2) the scores for "Rater 2", Officer2, which reflect Police
Officer 2's decision to rate a person's behaviour as being either "normal" or
"suspicious".

Assumptions:
1. The response (e.g., judgement) that is made by your two raters is measured on
a nominal scale (i.e., either an ordinal or nominal variable) and the categories need to
be mutually exclusive.
2. The response data are paired observations of the same phenomenon, meaning that both
raters assess the same observations.
3. Each response variable must have the same number of categories and
the crosstabulation must be symmetric (i.e., "square") (e.g., a 2x2 crosstabulation, 3x3
crosstabulation, 4x4 crosstabulation, etc.). For example, a 2x2 crosstabulation means that the
responses of both raters are measured on a dichotomous scale; that is, a nominal scale with
two categories (e.g., no scarring vs scarring.
4. The two raters are independent (i.e., one rater's judgement does not affect the other rater's
judgement).
5. The same two raters are used to judge all observations. This has been referred to as
having fixed or unique raters. If different raters were used for each observation (e.g.,
patient), Cohen's kappa is not the appropriate test to use.

Click Analyze > Descriptive Statistics > Crosstabs... on the main menu:
You will be presented with the Crosstabs dialogue box, as shown below:

You need to transfer one variable (e.g., Officer1) into the Row(s): box, and the
second variable (e.g., Officer2) into the Column(s): box.
Click on the button. You will be presented with the Crosstabs:
Statistics dialogue box,
Select the Kappa checkbox. You will end up with the dialogue box below:

• Click on the button and you will be returned to
the Crosstabs dialogue box.
• Click on the button. You will be presented with
the Crosstabs: Cell Display dialogue box, as shown below:
• Keep the Observed checkbox selected, as shown below:
• Click on the button. You will be returned to
the Crosstabs dialogue box, as shown below:

1.Click on the button to generate the output for Cohen's kappa.
Output of Cohen's kappa:
SPSS Statistics generates two main tables of output for Cohen's kappa:
the Crosstabulation table and Symmetric Measures table

We can use the Crosstabulation table, amongst other things, to understand the degree to
which the two raters (i.e., both police officers) agreed and disagreed on their judgement of
suspicious behaviour. You can see from the table above that of the 100 people evaluated by
the police officers, 85 people displayed normal behaviour as agreed by both police officers.
In addition, both officers agreed that there were seven people who displayed suspicious
behaviour. Therefore, there were eight individuals (i.e., 6 + 2 = 8) for whom the two police
officers could not agree on their behaviour.
You can see that Cohen's kappa (κ) is .593. This is the proportion of
agreement over and above chance agreement. Cohen's kappa (κ) can range
from -1 to +1. A kappa (κ) of .593 represents a moderate strength of
agreement. Furthermore, since p < .001 (i.e., p is less than .001), our kappa
(κ) coefficient is statistically significantly different from zero.

McNemar test
McNemar test to assess the association or difference between two related categorical
variables. This test is often used to analyze paired categorical data, especially in situations
where you're dealing with a binary outcome (e.g., yes/no, success/failure) measured on the
same subjects or entities at different points in time or under different conditions.
To perform a McNemar test for association in SPSS:
1.Data Preparation: Ensure your data is arranged in a 2x2 contingency table format, where
each row represents a pair of related observations or conditions.
2.Open SPSS: Start by opening your dataset in SPSS.
3.Conduct the Test:
1. Go to "Analyze" > "Nonparametric Tests" > "Legacy Dialogs" > "2 Related
Samples..."
2. In the dialog box that appears, move your paired categorical variables to the "Paired
Variables" box.
3. Click on "Options" to select the McNemar test.
4. Click "OK" to run the analysis.

Example
A researcher wanted to investigate the impact of an intervention on smoking. In this
hypothetical study, 50 participants were recruited to take part, consisting of 25 smokers and
25 non-smokers. All participants watched an emotive video showing the impact that deaths
from smoking-related cancers had on families. Two weeks after this video intervention, the
same participants were asked whether they remained smokers or non-smokers.
Therefore, participants were categorized as being either smokers or non-smokers before the
intervention and then re-assessed as either smokers or non-smokers after the intervention.
Due to the same participants being measured twice, we have paired-samples. We also have
a dependent variable that is dichotomous with two mutually exclusive categories (i.e.,
"smoker" and "non-smoker"). As a result, a McNemar's test is the appropriate choice to
analyze the data.
Output of the McNemar's test
Crosstabulation Table:

Consulting the bottom-left cell first, you can see that there were 16 participants that were
originally smokers, but following the intervention, they became non-smokers. In the sense
that the intervention was designed to reduce smoking, these participants could be
considered the intervention's successes. However, by consulting the top-right cell, you can
see that five non-smokers actually took up smoking following the intervention! Clearly,
this is not the effect you were looking for, and it is important that you note this in your
report. So, although overall there were more 'positive' changes than 'negative' changes, it
can be enlightening to know the different 'directions of travel' that the participants took.
Test Statistics Table:
Fifty participants were recruited to take part in an intervention
designed to warn about the dangers of smoking. An exact
McNemar's test determined that there was a statistically
significant difference in the proportion of non-smokers pre-
and post-intervention, p = .027.

Association and its different measures using SPSS

More Related Content

What's hot (20)

Similar to Association and its different measures using SPSS (20)

Recently uploaded (20)

Association and its different measures using SPSS