Introduction to data analysis

P a g e 1 | 8
DESCRIPTION OF THE TOPIC
Items Description of the Topic
Course Data Analysis for Social Science Teachers
Topic Introduction to Data Analysis
Module Id 1.1
Introduction
In the recent past, quite a bit of importance has been given to data analysis in
research. One of the possible reasons is that empirical evidence establishes a firm grounding
to either accept or reject the proposed hypotheses. The choice of the statistical technique
depends on the nature of the research problem or question and also on the nature of the data
set.
The research questions to solve a research gap or problem may be related to
identifying the degree of relationships among variables, checking for the significance of
group differences, predicting of group memberships or structure, or it could be time-related.
In order to identify associations between two or more variables, depending on
whether their nature of being parametric or non-parametric, correlation, and regression or chi-
square techniques may be adopted. This can be done as a Bi-variate correlation and
regression, multiple correlation and regression, Canonical correlation, Multiple Discriminant
Analysis, and Log-it regression. The bi-variate correlation is a good starting point to identify
the degree of relationship between two continuous variables, such as job and family
satisfaction where either of them can be treated as a DV and IV as the research question may
be. But bi-variate regression would require one of them to be defined as the DV and the other
as the IV. Although these are not multivariate techniques, they form the basis of the
Multivariate Analysis (MVA).
1. Importance of Multivariate Analysis (MVA)
If watching a movie needs to be a pleasant experience, the lighting, the projected film
light and sound effects in the theatre must be optimum. The other factors that may contribute
to a pleasant viewing experience may include, but not limited to, seating arrangements, air-
conditioning and hall odour. If one has to study or measure the pleasantness of watching a
movie in a theatre, all of the above factors must be studied together and not in isolation.
There is a possibility of an unpleasant parking experience that may negatively impact the
pleasantness of watching a movie. So, the real value of measuring the pleasantness of a
movie-watching experience lies in measuring all the influencing factors together.
This is exactly what Multivariate analysis is all about. So, analysis of multiple
variables simultaneously would result in a better picture to arrive at inferences instead of
multiple uni-variate analyses done with the individual variables. Statistical Techniques that
simultaneously analyse multiple measurements of the observed variables are known as
Multivariate Analysis (MVA). We may perform MVA by using multiple variables in a single
relationship or in multiple relationships.

P a g e 2 | 8
In a truly multivariate scenario, all variables must be:
i. Random in nature,
ii. Inter-related, and
iii. Interpreted in unison.
Reading the paper related to testing the Greenhaus and Allen model by Pattusamy and
Jacob (2015) will help in understanding our forthcoming discussions and answering a few
questions in the end. The theoretical model is shown in Figure 1.
Figure 1 - Theoretical Model
From Figure 1, it is seen that family-work conflict (FWC) will have a negative effect
on job satisfaction (JS) while family-work facilitation (FWF) will have a positive effect on
job satisfaction. Similarly, work-family facilitation (WFF) will have a positive effect on
family satisfaction (FS) while work-family conflict (WFC) will have a negative effect on
family satisfaction. Both job and family satisfaction will influence feelings of work-family
balance positively which in turn will positively influence life satisfaction (LS). All the above
statements have been hypothesized and can be stated conclusively if we have empirical data
to establish the stated hypotheses. The use of appropriate statistical methods will facilitate the
data analysis to arrive at well-grounded inferences and conclusions.
Univariate statistical tests involve one dependent variable. Examples include, but are
not limited to, t-tests of means, analysis of variance (ANOVA), analysis of covariance and
simple linear regression (with one dependent and one independent variable). Having said so
much about the importance of data analysis, let us have a quick look at a few multivariate
techniques that we are likely to study in detail during the course of this study.
The next section leads us to the classification of MVA.

P a g e 3 | 8
2. Classification of MVA
MVA can be classified as Dependence techniques and Interdependence techniques.
2.1 Dependence techniques (used when there are one or more dependent variables and
independent variables. Eg. Multiple regression analysis)
i. Multiple regression and multiple correlation
ii. Multiple Discriminant Analysis (MDA) and Logistic Regression
iii. Canonical Correlation Analysis
iv. Multivariate Analysis of Variance and Covariance
v. Conjoint Analysis
vi. Structural Equation Modelling (SEM) and Confirmatory Factor Analysis (CFA)
2.1.1 Multiple Regression
Let us presume that some previous research has established that cars with higher
engine capacity and higher unladen weight offer lesser fuel efficiency (possibly validated
using a correlation analysis). If a researcher wants to predict the fuel efficiency based on
engine capacity and unladen weight, then fuel efficiency is treated as the dependent variable
while engine capacity and unladen weight are treated as the independent variables. The
researcher collects data on fuel efficiency, engine capacity and an unladen weight of about
100 cars or more (that run on the same type of fuel) and would possibly use the multiple
regression (MR) method to predict fuel efficiency. In order to use the MR method the
dependent and the independent variables (two or more) must be metric data.
2.1.2 Multiple Discriminant Analysis (MDA) and Logit Analysis
If the dependent variable is dichotomous (Yes/No, Men / Women) type, then MDA is
an appropriate technique. The independent variables need to be metric data. MDA helps to
understand group differences and to predict the possibility that an observation or object
would belong to a specific group. An example that we had discussed in MR in the previous
section, suppose we had data on the engine capacity and unladen weight of about 100 plus
cars (that run on the same type of fuel) and if we want to classify them as Big and Small cars,
then MDA would be a relevant technique.
Logit Analysis also is known as Logistics regression is a combination of MR and
MDA. Although the regression principle is similar to that of MR, the DV in Logit regression
need not be metric as in the case of MR but can be a dichotomous variable as in MDA.
Another distinguishing fact of Logit regression is that it can accommodate both metric and
non-metric IVs and overlook the multivariate normality assumption.

P a g e 4 | 8
2.1.3 Canonical Correlation Analysis
If there are multiple metric dependent and metric independent variables to be
correlated and regressed, then the right tool is Canonical Correlation Analysis. We actually
try to determine the associations between two sets of variables. For example, we might study
the relationship between a number of indices of fuel efficiency (the DVs such as Indicated
Horse Power (IHP) and Brake Horse Power (BHP)) and the IVs (such as engine capacity,
unladen weight of the car, and age of the car).
2.1.4 Multivariate Analysis of Variance and Covariance
In-order to simultaneously explore the relationship between multiple categorical
independent variables, which are also called treatments, and two or more metric dependent
variables, an ideal technique would be the Multivariate Analysis of Variance and Covariance
(MANOVA). If the analysis requires the elimination of the effect of the uncontrolled metric
independent variables, which are known as covariates, on the dependent variables, then the
multivariate analysis of covariance (MANCOVA) is used. Both MANOVA and MANCOVA
may be done as one way or factorial. In our car example with fuel efficiency as the DV, age
of the car can be treated as a covariate.
2.1.5 Conjoint Analysis
Conjoint Analysis is a contemporary dependence technique that would help a decision-maker
(product design head) evaluate the importance of attributes (typically product attributes)
along with its levels. Let us say we have three attributes of a car, namely, airbags (2, 4, or 6
airbags), speakers for infotainment (2, 4 or 6 speakers) and steering wheel height adjustment
(low, medium and high). If we want to know popular combinations preferred by car
enthusiasts, we may have to ask them to rate all of the 27 combinations. For example a car
enthusiast may prefer 6 airbags, 4 speakers and medium height for his steering wheel.
Likewise, there are 27 possible combinations. However, using conjoint analysis it is possible
to capture the ratings of the prospective car buyer with just 9 or more combinations. The
conjoint analysis helps a great deal is product design simulation studies.
2.1.6 Structural Equation Modelling (SEM) and Confirmatory Factor Analysis
(CFA)
While multiple regression examines a single relationship between a DV and multiple
IVs in an SEM, it is possible to examine multiple relationships simultaneously. Generally, a
CFA is done prior to the SEM. The SEM consists of the structural and the measurement
model. The structural model may have one or more DVs and one or more IVs with all
relationships defined. Each of the DVs and IVS may be either uni- or multi-dimensional and
each of the dimensions may be measured using scale items for indicators. The CFA will show
the contribution of each scale item to its dimension and the extent to which it measures the
same. By this the measurement model is evaluated. After the validity and reliability of the

P a g e 5 | 8
measurement model are established, the structural model is evaluated to establish and prove
or disprove hypotheses. Hence, SEM supports simultaneous assessment of relationships and
accommodates multi-item scales.
2. 2 Interdependence techniques (absence of dependent or independent variables but
involves techniques to simultaneously analyze all variables together in the set. Eg. Factor
Analysis).
a) Factor Analysis (both Principal Component Analysis and Common Factor Analysis)
b) Cluster Analysis
c) Perceptual Mapping (also called as Multidimensional Scaling)
d) Correspondence Analysis
2.2.1 Factor Analysis
The objective of factor analysis is to reduce the number of measured variables into
meaningful factors (or variates) with minimal loss of information. This can either be done by
the PCA method or by common factor analysis. Suppose a prospective car buyer is
considering the color of the car, the aerodynamic design, body-colored bumpers, height-
adjustable steering column, driver seat height adjustment, touch screen for infotainment, ABS
and Airbags. If the opinion of the car buyer is captured using a 7 point Likert scale, either
PCA or common factor analysis may group these eight variables in three groups, namely,
external features (colour of the car, the aerodynamic design, body-colored bumpers), internal
features (height-adjustable steering column, driver seat height adjustment, touch screen for
infotainment) and safety features (ABS and Airbags). So factor analysis helps us to reduce
eight variables into three meaningful factors (variates).
2.2.2 Cluster Analysis
In the car example that we have been discussing so far, suppose we have the data on
engine capacities of about 130 cars with the engine capacities ranging from a minimum of
799cc to 2399cc and we want these 130 cars to be placed in three groups, namely, small,
medium and large cars, cluster analysis would be a recommended technique. The Cluster
analysis algorithm places the objects in homogeneous groups depending on the characteristics
specified by the researcher. In our example, the cars would be placed in groups based on
engine capacity. Clustering can be done based on multiple characteristics too. Either
hierarchical or non-hierarchical clustering procedures may be adopted. Basically hierarchical
methods could be either agglomerative or divisive. The algorithms followed in the
hierarchical methods are single, complete and average linkage methods. The other methods
are the Centroid and Ward methods. Alternatively the non-hierarchical clustering popularly
follows the k-means algorithm and places objects in cluster groups once the number of
clusters is specified. The decision on whether to adopt the hierarchical or non-hierarchical
procedure depends on the choice of the researcher and the problem defined.

P a g e 6 | 8
2.2.3 Perceptual Mapping
If we consider two dimensions of the car, namely, fuel efficiency and driving comfort
and we want to know how the brands of cars currently available in the market are positioned
in the minds of the car enthusiasts and perceived by the car enthusiasts, the right technique is
Perceptual Mapping (PM) also known as Multi-dimensional Scaling (MDS). MDS
typically helps a researcher to determine the perceived relative image of the cars (in this case)
considering the two dimensions. In MDS, unlike in factor or cluster analysis, a solution can
be obtained for each respondent and there is no variate. The researcher makes choices
between similarity and preference data, disaggregate and aggregate analysis and on whether
to use the Compositional or decompositional methods. Although earlier MDS programs were
predominantly non-metric in output, the contemporary programs provide metric output.
2.2.4 Correspondence Analysis
If we have non-metric data such as colors of the cars, classification of car size such as
small, medium and large and we want to position the cars in a perceptual map, then the
technique to be adopted is the Correspondence Analysis (CA). It starts with a cross-tabulation
of the two attributes, namely, colors and car size; after that it carries out a non-metric to
metric conversion, and then leads to dimension reduction and finally the perceptual map is
prepared. CA is the best option for a multivariate representation of interdependence for non-
metric data.
3. Nature of Data
The following table gives a summary of the nature of data:
Name of the Multivariate
Technique
Nature of the Data
DV IV
Canonical Correlation Metric, Non-metric Metric, Non-metric
MANOVA Metric Non-metric
ANOVA Metric Non-metric
MDA Non-metric Metric
Multiple Regression Metric Metric, Non-metric
Conjoint Analysis Non-metric, Metric Non-metric
SEM Metric Metric, Non-metric
4 Some Generic Tips to Perform Multivariate Analysis

P a g e 7 | 8
While performing MVA on the research problem, it would help if the researcher observes the
following tips:
1. Ensure that both statistical and practical significance exists in the research being done.
2. The sample size should be adequate but neither under sized nor over sized.
3. Clearly, understand the nature of the data.
4. Use a minimum number of variables in the model to obtain the desired results.
5. Identify and eliminate errors.
6. Ensure a fool-proof validation of the results.
I hope the above content gives you a fair idea of the existing multivariate techniques
that we would be covering in our course and a snapshot of their applications. For further
learning, may I also suggest the open courseware by Cynthia et al., (2011), titled “Statistical
Thinking and Data Analysis”.
Although at the beginning of this discussion, I had suggested the reading of the paper
by Pattusamy and Jacob (2015), throughout the discussion I used examples relating to cars. If
you have understood the application of the discussed MVA tests with the variables in the car
example, you should be able to answer a few fundamental questions relating to data analysis
with respect to the variables in the paper. Here are your challenges.
Self-Assessment:
You could suggest appropriate statistical tests to answer the following research
questions. It does help if you could also justify your choice of the technique.
1. Are men more satisfied with their jobs than women?
2. Does life satisfaction vary with age?
3. Will feelings of work-life balance influence the relationship between job
satisfaction and life satisfaction?
4. Would there be a difference in the strength of the relationship between family
satisfaction and life satisfaction between men and women?
5. Would it be possible to categorize men who are highly and moderately satisfied in
their lives?

P a g e 8 | 8
References
1. Barbara G.T and Linda S.F, Using Multivariate Statistics, 6th Edition, Pearson Education
Inc, pp. 612-680.
2. Cynthia Rudin, Allison Chang, and Dimitrios Bisias. 15.075J Statistical Thinking and
Data Analysis. Fall 2011. Massachusetts Institute of Technology: MIT
OpenCourseWare, https://ocw.mit.edu. License: Creative Commons BY-NC-SA.
3. Hair J.F, Black W.C, Babin B.J and Anderson R.E, Multivariate Data Analysis, 7th
Edition, Pearson Education (South Asia), pp. 89-149.
4. Murugan Pattusamy and Jayanth Jacob, A test of Greenhaus and Allen (2011) model on
Work Family Balance, Current Psychology, Springer, 2015.
5. Zumbo B.D. (2014) Univariate Tests. In: Michalos A.C. (eds) Encyclopedia of Quality
of Life and Well-Being Research. Springer, Dordrecht
***************************************************************************

Introduction to data analysis

More Related Content

What's hot (19)

Similar to Introduction to data analysis (20)

More from RajaKrishnan M (20)

Recently uploaded (20)

Introduction to data analysis