Selection of appropriate data analysis technique

of 15
DESCRIPTION OF THE TOPIC
Choosing the right statistical method for data analysis is always a challenge as it dependent on a
host of things.
Before we discuss the major determinants of choice of a method in detail, it is also important to
understand that one should have a Research/Data Analysis blueprint of the study one is
undertaking.
1. Research/Data Analysis Blueprint
Generally, the research starts with a broad research question that is often divided into more
measurable, narrower objectives (See Figure 1). Each objective is achieved by splitting the subject
matter into certain statistically testable hypotheses.
Items Description of Topic
Course Data Analysis for Social Science Teachers
Topic
Choosing the Right Statistical Method for
Data Analysis

of 15
Figure 1: The Research Blueprint--Objective Hypotheses Mapping
There is no standard rule as to how many hypotheses a research objective can have. One research
objective might have one or two or more hypotheses. However, it is important that each objective
be split into one or more testable hypotheses.
In order that one is clear about how a hypothesis is tested, one must identify the variables
associated with each of the hypotheses (see Figure 2). There is no rule as to how many variables a
hypothesis will have. There could be a hypothesis with just one variable (such as test of population
mean to be equal to a number) or there could be two variables (like tests of hypothesis of
association or difference) or even more (like factor analysis/multiple regression).
Each of the variables is then identified as a Dependent or Independent variable given the nature of
the hypothesis being tested. Further against each variable, its level of measurement is noted. We
shall have them noted as Nominal, Ordinal, Interval or Ratio. Often the nominal and ordinal levels
are be combined into Categorical whereas the Interval and Ratio levels are labeled as Numerical.

of 15
The categorical variable is also called Non-Metric or non-Parametric variable. The Numerical
Variables are also called metric or parametric or sometimes even as a continuous variable by some
authors.
Figure 2: The Research Blueprint—Objective-Hypothesis-Variable-Test Mapping
2. Major Determinants of Choice of s Statistical Method
The choice of particular statistical method is generally determined the following:
a) Number and Level of Measurement of Variables
b) Distribution of the variable
c) Dependence and Independence Structure
d) Nature of the Hypothesis
e) Sample Size
We shall now briefly discuss the above:

of 15
2.1. Level of Measurement of Variables
We know that there are four levels of measurement:
a) Nominal
b) Ordinal
c) Interval
d) Ratio
Often the nominal and ordinal levels are to be combined into Categorical whereas the Interval and
Ratio levels are to be labeled as Numerical. The categorical variable is also called Non-Metric or
non-Parametric variable. The Numerical Variables are also called metric or parametric or
sometimes it is even called a continuous variable by some authors.
While choosing a particular test, we shall be asking the question:
What is the level of measurement of the data?
--Nominal/Ordinal/interval/Ration
Or simply Categorical or Numerical?
2.2. Distribution of Underlying Variables
Based on the level of measurement, the data might follow a distribution like Normal, Binominal,
Poisson etc. and it might not have a distribution. The variables measured on nominal and ordinal
scales generally do not have any distribution whereas the numerical variables might follow a
normal distribution or other distribution. The tests that are used when the categorical variables are
involved are called non-parametric or distribution-free tests. The tests that are used with numerical
variables will be called parametric tests.
While choosing a particular test we shall be asking the question:

of 15
Is the data parametric (measured on a numerical scale) or non-parametric (measured on
a categorical scale)?
2.3. Nature of Hypothesis
Broadly a hypothesis can be categorized as:
a) Hypothesis of Association/Causation and
b) Hypothesis of Differences
The hypothesis of association/causation examines the nature and strength of the relationship
between variables. Correlation, Regression are such examples.
The hypothesis of difference examines whether the two populations differ on a parameter like
mean. Using hypothesis of difference, we generally test the equality of two or more population
means.
What is the nature of the hypothesis?
---Hypothesis of Association/Causation OR Hypothesis of Differences
2.4. No. of Variables in the Hypothesis
The number of variables associated with a hypothesis is also an important determinant of the
choice of a statistical technique.
Based on the number of variables, we sometimes even classify the statistical techniques as
Univariate (involving one variable) /Bi-variate (two variables)/ Multivariate (more than two)
techniques.

of 15
How many Variables are there in the hypothesis?
-- One or two or more than two
3. An approach for Choosing a Statistical Method
Several authors present different approaches to choose a statistical method. An approach generally
involves starting with one of the above determinants and drilling down with other determinants.
For instance, we might start with the question: What is the nature of the hypothesis? Then, ask the
question: How many variables are involved? And then ask: What is the level of measurement of
each of the variables? And so on. Alternatively, we might start with, say, the number of variables
in the hypothesis, then the nature of the hypothesis and so on.
We suggest starting with the question of a number of variables. The following sections present the
self-explanatory flow charts of how to choose a test once you started with the question: How many
variables are involved in the hypothesis? One or two or more than two. Accordingly, the sections
are titled as Statistical Methods for Univariate /Bi-variate /Multivariate data
3.1. Statistical Methods for Univariate Data
Figure 3 presents the flowchart of how a method can be chosen when the hypothesis involves just
one variable.

of 15
Figure 3: Statistical Methods for Univariate Data
We will ask what is it that we are trying to do. Are we trying to describe the data or Are we trying
to make an inference? Trying to make an inference with univariate data generally involves testing
whether the population mean equals a particular numeral like whether µ =3..
Let us look at the first wing: Descriptive statistics.
The kind of descriptive statistics we can use to describe the univariate data straight away depends
on the level of measurement of the variable.
● For nominal data, the measure of central tendency is always mode and mode is the only
choice if your data is nominal. Further, we don't have any measure of spread or variance
when data is on a nominal scale.
● When data is on an ordinal scale, we have two choices of central tendency that is mode and
median. We can use the interquartile range as a measure of dispersion or variance.

of 15
● When data is measured in interval or ratio scale, we can use all the three measures of central
tendency, i.e. mean, median and mode. And we can also use several measures of dispersion
such as interquartile range, range, variance and standard deviation.
On the other hand, if we are interested in the hypothesis whether the population mean equals a
particular numeral like µ =3?So, in this case, we call it a hypothesis of difference involving a single
variable and the test is one-sample t-test. Our univariate data is on the numerical scale (interval or
ratio), so we use the one-sample t-test.
3.2. Statistical Methods for Bi-variate Data
Quite often, we will be interested in testing the hypothesis that involves two variables or
sometimes we also have one variable measured across two samples.
Figure 4 presents the flowchart of how a method can be chosen when the hypothesis involves two
variables or two samples measured on one variable.

of 15
Figure 4: Statistical Methods for Bi-variate Data
We will start with the question:
What is the nature of the hypothesis?
---Hypothesis of Association/Causation OR Hypothesis of Differences
1. Hypothesis of Difference: A hypothesis of difference in this context generally involves testing
for the equality of two population means (whether µ1=µ2?).

of 15
Then, we can ask this question :
Is this data parametric or non-parametric?
When the data is parametric(meaning the underlying variable has a distribution), we will ask this
question whether the samples are independent or dependent. In independent samples, we measure
one variable on two samples whereas a dependent sample generally involves repeated
measurements(twice) of the same variables on a single sample.
If the samples happen to be independent, we use an independent sample t-test, otherwise we use a
paired sample t-test.
And for non-parametric data, we use the Mann-Whitney U test to test the hypothesis of differences.
2. Hypothesis of Association: In Hypothesis of Association again we ask this question whether
the data is parametric or non-parametric. And if the data is parametric, the next level question is
whether we want to look at the association between the two variables or there is a cause-effect
relationship. In Association between the variables, we simply try to know whether two variables
are related. Whereas in causation one of the variables is dependent and the other will be
independent and we just want to see to what extent the independent variable explains the changes
in the dependent variable.
For parametric data, when we are examining the association; the test will be the Pearson
coefficient correlation. And for causation we use Regression.
For non-parametric data, we ask a next level question: whether the data is measured on a nominal
or ordinal scale. If it measured on a nominal scale we use Chi-square test of association. If the data
is measured on an ordinal scale, we use Spearman’s Rank correlation.

of 15
3.3. Statistical Methods for Multivariate Data
Figure 5 presents the flowchart of how a method can be chosen when the hypothesis involves more
than two variables.
Figure 5: Statistical Methods for Multivariate Data (1)
In multivariate data, again we will start with the same question whether it is the hypothesis of
difference or the hypothesis of association.
Under the hypothesis of difference again we need to know that data is parametric or non-
parametric. When the data happened to be parametric, we use ANOVA and if the data is
nonparametric, we use Kruskal-Wallis.

of 15
Testing a Hypothesis of Association we can ask the question: What is the level of measurement
of the dependent variable, i.e., numerical or categorical?
When the dependent variable is numerical, the next question is to look at whether all independent
variables are also numerical? If all the independent variables are also numerical, then we use
Multiple Regression.
When the dependent variable is categorical, then we look at the type of independent variables. If
all the independent variables are numerical, then we use Multiple Discriminant Analysis. We may
have a case where one or two independent variables are categorical and other variables are
numerical. In this case we use Logistic Regression.
Figure 6 presents the flowchart of how a method is chosen in some special cases involving more
than two variables.

of 15
Figure 6: Statistical Methods for Univariate Data (2)
When we are interested in variable/Dimension Reduction that means we don’t have dependent and
independent relation between the variables or when we are working at the item level and we would
like to group the items into certain variables, we use the Factor Analysis. And, of course, the factor
analysis has two variants: exploratory analysis and conformity analysis.
And sometimes we are interested, based on some criteria, to group the cases or respondents(not
the variables) of our study then in such case we will use Cluster Analysis.
The major difference between the factor Analysis and Cluster analysis is:
In Factor Analysis, several variables or several items are grouped into fewer Dimensions or fewer
Variables. In Cluster Analysis, the respondents or subjects in the study are grouped into certain
clusters.

of 15
We might also have a situation where you examine several relationships and there are multiple
dependencies. Then, we use Structural Equation Modelling.
4. Choosing between the Z Test and t-test
One more important confusion normally people have is when to use Z -test and when to use t-test.
In the previous discussion, wherever we used t-test that could be a possibility, Z-test can be used.
Figure 7 presents the flow chart of how to choose between a t-test and z-test.
Figure 7: Choosing between Z test and t-test
We start with the question: Is population normal? If the population is normal, then we go with
another question: Is the standard deviation of the population known?

of 15
If population is normal and the standard deviation of the population is known, we use Z-test. If
the standard deviation of the population is not known, then we use t-test.
If the population is not normal, then we ask the question as to whether the sample size is more than
or equal to 30. If the sample size is more than or equal to 30, then we go back to the same logic of
asking the question: Is the standard deviation of the population known? If the standard deviation
of the population is known, we use Z-test. If the standard deviation of the population is not known
then we use t-test.
If the sample size is not more than 30, we need to ask whether it is a large population. If it is a
large population, we use Binomial test; if it is not a large population, we use Hyper Geometric
Test.
References
1. Hair, J. F., Black, W. C., Babin, B. J., & Anderson, R. E. (2013). Multivariate data analysis:
Pearson new international edition. Pearson Higher Ed.
2. Field, A. (2013). Discovering statistics using IBM SPSS statistics. sage.

Selection of appropriate data analysis technique

More Related Content

What's hot (20)

Similar to Selection of appropriate data analysis technique (20)

More from RajaKrishnan M (20)

Recently uploaded (20)

Selection of appropriate data analysis technique