SlideShare a Scribd company logo
2
Most read
3
Most read
6
Most read
Overview of Multivariate
Statistical Methods
Thomas Uttaro, Ph.D., M.S.
Deputy Director and CIO,
South Beach Psychiatric Center
11th
Annual NYS-OMH Institute on Mental
Health Management Information
Introduction
 Concerned with data collected on several dimensions of
the same individual or units of analyses such as
geographic regions.
 Common in social, behavioral, life, and medical
sciences; medical and mental health outcomes,
economic indicators, demography.
 Extension of Univariate statistics, analysis of variation for
a single random variable, t tests, Correlation,
Regression, ANOVA, ANCOVA, Survival Analysis
 Multivariate techniques account for correlation of
measures due to common source for each individual or
other unit of analysis. These techniques also control
type I error rates with overall (experimentwise)
significance tests.
Preview of Multivariate Methods
 Multivariate General Linear Model-the
extension of ANOVA, ANCOVA, and regression
to a family of methods for multivariate outcomes.
 Principal Components Analysis-accounts for
variation in multivariate observations with a
smaller number of observed indices that are
linear combinations of the original variables.
 Factor Analysis -accounts for variation in
multiple outcomes with a linear combination of
unobserved factors and a variable specific term.
Preview of Multivariate Methods (cont.)
 Discriminant Analysis-concerned with
separating observations into known groups
based on multivariate observations.
 Cluster Analysis-concerned with identification
of unknown but interpretable groups and placing
individual observations within them.
 Canonical Correlation-individual variables are
divided into two groups, concerned with
describing the relationship between the two sets
through multivariate correlations.
General Linear Model: ANOVA, ANCOVA
and Multiple Regression
 General Linear Model unified framework relates ANOVA,
ANCOVA, and Regression methods.
 ANOVA predicts or relates factor or categorical
predictors to a single predicted continuous variable.
 Multiple Regression predicts or relates continuous
variables to a single predicted continuous variable.
 ANCOVA predicts or relates factor and continuous
variables to a single predicted continuous variable.
 A regression approach can be used with dummy
variables to perform ANOVA, hence the GLM model.
 Variations on these models exist to predict binary or
categorical outcomes (logistic regression, multinomial
regression).
ANOVA and Multiple Regression
Examples using SPSS 10.5
 ANOVA example relates FACA (2 levels, perhaps gender), FACB (3
levels, perhaps region) and the interaction FACA x FACB to a single
dependent variable (perhaps annual income).steve296anova.SPS
 F test indicates that the main effects FACA and FACB are significant
at the p<.001 level, FACA x FACB interaction is non-significant.
 Multiple regression example predicts instructor evaluation from 5
predictors: clarity, stimulation, knowledge, interest, and course
evaluation. Variables are continuous. Accounts for correlation
among predictor variables and determines which are most
important.stevep84MBAreg.SPS
 t tests of regression coefficients indicates that all variables except
interest are significant in predicting the level of instructor evaluation.
 Several diagnostics are output including residuals, leverages, and
Cook distance (influential data points) values. Plot of regression
standardized residuals should be approximately normal.
Multivariate General Linear Model
 Extensions of single dependent variable procedures
such as ANOVA, ANCOVA, and multiple regression.
 Statistical framework includes MANOVA (factor
predictors), MANCOVA (factors and continuous
predictors), and multivariate multiple regression
(continuous predictors).
 Prevents inflated overall type I error rate, accounts for
correlations among the predictors, can detect the joint
significance of a set of variables, even when univariate
analyses would not be significant.
 Hotelling’s T2
is the overall multivariate test statistic and a
generalization of the univariate t. Tests the Ho that the
population mean vectors are equal for two or more groups.
Multivariate General Linear Model (cont.)
 Multivariate significance implies that there is a linear
combination of dependent variables (the discriminant
function) that is separating the k groups.
 Multivariate test statistics are a function of eigenvalues
which are fundamental to all multivariate analyses.
 Four multivariate test statistics are commonly used,
Wilk’s Λ, Roy’s largest root, the Hotelling-Lawley trace,
and the Pillai Bartlett trace. Wilk’s Λ is most common.
 Following a significant finding, post hoc or planned
comparisons are then used to determine which variables
are driving the significance between groups.
Multivariate Regression Example using
SPSS 10.5
 Timm data on differences in cognitive tests due to
learning tasks. Scores on Ravin’s Progressive Matrices
and Peabody Picture Vocabulary regressed on 3
learning tasks.steve132multivreg.SPS
 Multivariate test statistic Wilk’s Λ is significant indicating
a significant relationship between the dependent
variables and the 3 predictors beyond the .01 level.
 Univariate F tests examine the regression on each
variable separately. In particular, NA (named action) is
related to PEVOCAB at t=2.68, p<.011.
 Univariate prediction equations do not take into account
correlations among dependent variables.
MANOVA Example with Tukey
post-hoc tests using SAS V8
 Novince data on improving social skills among college
women. 3 groups: control, behavioral rehearsal and
cognitive restructuring, 4 variables: anxiety, social
interaction skills, appropriateness, and assertiveness.
 SAS program used for 3 treatment group MANOVA with
4 measures to determine treatment effectiveness.
SASstevep204.sas
 Overall significant multivariate tests indicate true
differences between groups on one or more variables
and their linear combinations. Excellent optional output.
 Tukey post hoc tests generate significance levels and
confidence intervals to examine effects of variables.
Crisis Residence Treatment and the
Basis-32 at South Beach PC
 Treatment, gender, and GAF covariate effects on
BASIS-32 subscale scores, n=73 paired admissions and
discharges.
 Highly significant pre/post treatment effect, F=4.216, 5
df, p<.001. CR effective in terms of all BASIS-32
subscales.
 Significant GAF covariate effect F=5.271, 5 df, p<.001
strong relationship between clinician GAF and self-report
BASIS-32 on relationship to self/others and depression
subscales.
 Gender by treatment interaction non-significant. CR
equally effective for both genders in terms of subscale
scores
 Statistical diagnostics indicate excellent power for all
tests.
Principal Components Analysis
 Analysis based on a large number of original variables
can be simplified to a smaller number of standardized
linear combinations of original variables.
 x→y=Γ'(x-μ) where Γ is orthogonal Γ'ΣΓ=Λ. The ith
principal component of x may be defined as the ith
element of vector y, as yi=φ'i(x-μ), leading to uncorrelated
principal components. Principal components essentially
involves finding the eigenvalues of the covariance matrix
Σ.
 The first principal component has the largest variance of
all standardized linear combinations of x.
Principal Components Example using
S-Plus V6PrinComp.ssc
 Ph.D. qualifying examinations in five areas of
mathematics for 25 students.
 Analysis carried out using S-Plus princomp function
which returns object of mode princomp.
 A large coefficient (absolute value) corresponds to a high
loading, while a coefficient near zero has a low loading.
 First principal component loadings are of moderate size
in the same direction representing an average score.
 Second principal component contrasts two closed book
exams with three open book exams, with the first and
last exams weighted most heavily.
 Plots of the principal component loadings and the biplot of
original and transformed test scores in two dimensional
principal component space.
Factor Analysis
 Factor Analysis explains correlations between observed
variables with underlying factors.
 x=μ+Λf+u Λ={λij} is matrix of factor loadings, f and u represent
the common and unique factors respectively. Equivalently,
Σ=ΛΛ'+Ψ, decomposition into factor and error covariances.
 Diagonal of factor covariance matrix is the vector of
communalities h2
, common variation in the factors and Ψij is
the vector of uniquenesses, the variation in xi not shared with
the other variables. These sum to 1 for each variable.
 Factor solution is not unique. Factors can be rotated to ease
interpretation via Σ=(ΛG')+(G'Λ')+Ψ. Δ= ΛG is the matrix of
rotated factor loadings. Analyst seeks simple structure in the
rotation. Each variable should load highly on one factor and
all factor loadings have large absolute value or are near zero.
Factor Analysis Example using
S-Plus V6 FactorAnal.ssc
 S-Plus uses factanal, a weighted covariance estimation
function to perform factor analysis.
 Using testscores we analφyze whether a two factor
model, overall ability and closed or open book, explain
the overall variation in the scores.
 The two factor model explains about 80% of the variation in
the original data, with the first factor accounting for 45%.
 The rotated factor loadings indicate the importance of the
first overall ability factor and the relative effects of closed
and open book exams.
 Plots of the factor loadings and the biplot of test scores in
two dimensional factor space.
Discriminant Function Analysis
 Concerned with allocating observations to one or another
a priori defined classes.
 Calibrated on a training sample in which membership is
known and then applied to test cases which are
unknown.
 In Medicine post-mortem information (classes based on
survival) can used to classify at risk patients for mortality
or morbidity.
 An observation is classified into one of two groups on a
series of measurements x1, x2, x3,… xp using a linear
function z of the variables: z=a1x1+a2x2+…+apxp
Discriminant Function Analysis
 Coefficients maximize ratio of between groups variance of z to
within groups variance. V=a'Ba/a'Sa
 The data in both groups have a multivariate normal distribution
and the covariance matrices of each group are the same.
 Function evaluates z. Assign to group 1 if zi-zc<0, assign to group
2 if zi-zc≥0.
 Performance can be assessed through misclassification rate on
known cases through the training set data.
 Significance tests are available including 1) Wilk's Λ or others
previously mentioned and also Hotelling’s multivariate T2
, 2) φ to
see whether the discriminant function differs between groups,
and 3) Chi-square test of Mahalanobis distances from
observations to their group centers, if large chi-square then
unlikely that an observation came from a particular group.
Discriminant Function Analysis Example
using SAS V8SASHandDA.sas
 Archeological study of two types of skulls from the Tibetan areas
of Sikkim or Kharis (fundamental human type). 5 dimensional
variables measured on 32 skulls.
 This can be considered a training set for future classification, the
analysis will also identify the most important variables in
discrimination.
 Proc discrim output generates within group and between group
covariance matrices, covariance diagnostics, generalized
pairwise distances between groups, discrimination function
coefficients, and misclassification (resubstitution) rate.
 Proc stepdisc finds that faceheight is the most important variable
for classifying the members into groups. Crossvalidated with
another proc discrim using only faceheight.
Cluster Analysis
 Concerned with allocating observations to discrete groups or
clusters of observations which are unknown.
 A hierarchy of solutions from single observation clusters to a single
group cluster containing all observations are displayed in a
dendrogram. An particular clustering partition will be considered
optimal based on statistical and practical criteria.
 Clustering methods operate on the inter-individual Euclidian
distance matrix calculated from the raw data.
 Single Linkage or Nearest Neighbors -groups are merged a a given
distance if closest individuals from each group are at least the
specified distance.
 Complete Linkage or Furthest Neighbors- two groups merge only if
the most distant members are close enough together.
 Average Linkage- two groups merge if the average distance
between them is close enough.
Cluster Analysis Example
using SAS V8SASHandCA.sas
 Analysis of quality of air of U.S. cities. Object is to
identify groups of cities that are similar for policy
intervention.
 Clustering variables include SO2, temperature, factories,
population, windspeed, rain, rainydays.
 First step is to look for outliers using proc univariate.
Chicago is an outlier on manufacturing and population,
Phoenix has the lowest value on all three climate
variables, these cities are excluded from the analysis.
 Results from several runs each based on a different
clustering method are complex and require interpretation
and a feel for the technique.
Cluster Analysis Example
using SAS V8 (cont.)
 Cluster history indicates the stages at which various cities and
clusters are joined at particular distances along with other
diagnostics.
 Bimodalty index of at least .55 suggests clustering on a
particular variable. Factories and population are at .55.
 The value of the cubic clustering criterion (ccc) is a guide to
the number of clusters in the data. It peaks at 4 clusters for
the single and complete linkage runs. The number of
eigenvalues of the correlation matrix may also suggest
dimensionality in the data. Four clusters is only an
approximation as the evidence is not that clear.
 Dendrograms may also suggest evidence of structure but
generally do not make the optimal number of groups obvious.
Cluster Analysis Example
using SAS V8 (cont.)
 Means for clustering variables can be examined to
understand how clusters differ on the variables. Mean
differences on these variables can be tested.
 Clustering solutions can be displayed by plotting the data
in principal component space since they are linear
transformations of the clustering variables.
 In this example the first two principal components are
derived and the individual cluster observations are
graphed. They are distinct in the location of the
observations although the solution is not optimal.
 A box plot was created and means tested for differences
on the SO2 level.

More Related Content

PPTX
Wilcoxon signed rank test
PDF
Power Analysis and Sample Size Determination
PPTX
Statistical tests of significance and Student`s T-Test
PPTX
Cross over design, Placebo and blinding techniques
PDF
Posthoc
PPTX
The mann whitney u test
PPTX
SPONTANEOUS REPORTING SYSTEM & GUIDELINES FOR ADR REPORTING.pptx
Wilcoxon signed rank test
Power Analysis and Sample Size Determination
Statistical tests of significance and Student`s T-Test
Cross over design, Placebo and blinding techniques
Posthoc
The mann whitney u test
SPONTANEOUS REPORTING SYSTEM & GUIDELINES FOR ADR REPORTING.pptx

What's hot (20)

PPTX
Student t-test
PPTX
Medical Research Unit 5 Research Methodology.ppt
PPTX
Genotoxicity studies according to oecd guildline.
PPTX
Biostatistics_Unit_II_Research Methodology & Biostatistics_M. Pharm (Pharmace...
PPTX
Anova, ancova
PPTX
General Factor Factorial Design
PPTX
Declaration of helsinki (Pharmacology SEM-III)
PDF
Usp biotherapeutics - biological medicines
PPTX
Hansch and free wilson analysis
PPTX
SlideShare on Traditional drug design methods
PPTX
All non parametric test
PPTX
Criticisms of orthodox medical ethics, importance of
PPTX
Observational study design
PPTX
Sample Size Determination
PPTX
PPT on Sample Size, Importance of Sample Size,
PPTX
Clinical trial study team
PPTX
Guidelines on adr reporting
PPTX
PROBABILITY
PPTX
4 partial least squares modeling
PPTX
Types of Data, Key Concept
Student t-test
Medical Research Unit 5 Research Methodology.ppt
Genotoxicity studies according to oecd guildline.
Biostatistics_Unit_II_Research Methodology & Biostatistics_M. Pharm (Pharmace...
Anova, ancova
General Factor Factorial Design
Declaration of helsinki (Pharmacology SEM-III)
Usp biotherapeutics - biological medicines
Hansch and free wilson analysis
SlideShare on Traditional drug design methods
All non parametric test
Criticisms of orthodox medical ethics, importance of
Observational study design
Sample Size Determination
PPT on Sample Size, Importance of Sample Size,
Clinical trial study team
Guidelines on adr reporting
PROBABILITY
4 partial least squares modeling
Types of Data, Key Concept
Ad

Similar to Overview of Multivariate Statistical Methods (20)

PPTX
s.analysis
PDF
Multinomial Logistic Regression.pdf
DOCX
© 2014 Laureate Education, Inc. Page 1 of 5 Week 4 A.docx
PPTX
Factor Analysis of MPH Biostatistics.pptx
PPTX
pembuastasn multiVARIASTE YASNG ASKASN DIPSAKASI UNTUK PENELITIAN COBA CEK
DOCX
Commonly used Statistics in Medical Research Handout
PPTX
Advanced Methods of Statistical Analysis used in Animal Breeding.
DOCX
6ONE-WAY BETWEEN-SUBJECTS ANALYSIS OFVARIANCE6.1 .docx
PDF
Applied statistics lecture_6
PPT
Lesson 8 Linear Correlation And Regression
PPT
Correlational research
DOCX
Section 1 Data File DescriptionThe fictional data represents a te.docx
PDF
the unconditional Logistic Regression .pdf
PPT
Quantitative_analysis.ppt
PPTX
Data processing
DOCX
Calculating Analysis of Variance (ANOVA) and Post Hoc Analyses Follo.docx
PDF
Statistical data handling
PPTX
Different Statistical Techniques - Practical Research 1
PDF
Group 5 - Regression Analysis.pdf
PPTX
Meta analysis with R
s.analysis
Multinomial Logistic Regression.pdf
© 2014 Laureate Education, Inc. Page 1 of 5 Week 4 A.docx
Factor Analysis of MPH Biostatistics.pptx
pembuastasn multiVARIASTE YASNG ASKASN DIPSAKASI UNTUK PENELITIAN COBA CEK
Commonly used Statistics in Medical Research Handout
Advanced Methods of Statistical Analysis used in Animal Breeding.
6ONE-WAY BETWEEN-SUBJECTS ANALYSIS OFVARIANCE6.1 .docx
Applied statistics lecture_6
Lesson 8 Linear Correlation And Regression
Correlational research
Section 1 Data File DescriptionThe fictional data represents a te.docx
the unconditional Logistic Regression .pdf
Quantitative_analysis.ppt
Data processing
Calculating Analysis of Variance (ANOVA) and Post Hoc Analyses Follo.docx
Statistical data handling
Different Statistical Techniques - Practical Research 1
Group 5 - Regression Analysis.pdf
Meta analysis with R
Ad

Recently uploaded (20)

PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PDF
HPLC-PPT.docx high performance liquid chromatography
PDF
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PDF
An interstellar mission to test astrophysical black holes
PDF
The scientific heritage No 166 (166) (2025)
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PDF
Sciences of Europe No 170 (2025)
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
HPLC-PPT.docx high performance liquid chromatography
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
TOTAL hIP ARTHROPLASTY Presentation.pptx
Classification Systems_TAXONOMY_SCIENCE8.pptx
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
Introduction to Fisheries Biotechnology_Lesson 1.pptx
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
An interstellar mission to test astrophysical black holes
The scientific heritage No 166 (166) (2025)
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
Comparative Structure of Integument in Vertebrates.pptx
ECG_Course_Presentation د.محمد صقران ppt
Biophysics 2.pdffffffffffffffffffffffffff
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
Sciences of Europe No 170 (2025)
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud

Overview of Multivariate Statistical Methods

  • 1. Overview of Multivariate Statistical Methods Thomas Uttaro, Ph.D., M.S. Deputy Director and CIO, South Beach Psychiatric Center 11th Annual NYS-OMH Institute on Mental Health Management Information
  • 2. Introduction  Concerned with data collected on several dimensions of the same individual or units of analyses such as geographic regions.  Common in social, behavioral, life, and medical sciences; medical and mental health outcomes, economic indicators, demography.  Extension of Univariate statistics, analysis of variation for a single random variable, t tests, Correlation, Regression, ANOVA, ANCOVA, Survival Analysis  Multivariate techniques account for correlation of measures due to common source for each individual or other unit of analysis. These techniques also control type I error rates with overall (experimentwise) significance tests.
  • 3. Preview of Multivariate Methods  Multivariate General Linear Model-the extension of ANOVA, ANCOVA, and regression to a family of methods for multivariate outcomes.  Principal Components Analysis-accounts for variation in multivariate observations with a smaller number of observed indices that are linear combinations of the original variables.  Factor Analysis -accounts for variation in multiple outcomes with a linear combination of unobserved factors and a variable specific term.
  • 4. Preview of Multivariate Methods (cont.)  Discriminant Analysis-concerned with separating observations into known groups based on multivariate observations.  Cluster Analysis-concerned with identification of unknown but interpretable groups and placing individual observations within them.  Canonical Correlation-individual variables are divided into two groups, concerned with describing the relationship between the two sets through multivariate correlations.
  • 5. General Linear Model: ANOVA, ANCOVA and Multiple Regression  General Linear Model unified framework relates ANOVA, ANCOVA, and Regression methods.  ANOVA predicts or relates factor or categorical predictors to a single predicted continuous variable.  Multiple Regression predicts or relates continuous variables to a single predicted continuous variable.  ANCOVA predicts or relates factor and continuous variables to a single predicted continuous variable.  A regression approach can be used with dummy variables to perform ANOVA, hence the GLM model.  Variations on these models exist to predict binary or categorical outcomes (logistic regression, multinomial regression).
  • 6. ANOVA and Multiple Regression Examples using SPSS 10.5  ANOVA example relates FACA (2 levels, perhaps gender), FACB (3 levels, perhaps region) and the interaction FACA x FACB to a single dependent variable (perhaps annual income).steve296anova.SPS  F test indicates that the main effects FACA and FACB are significant at the p<.001 level, FACA x FACB interaction is non-significant.  Multiple regression example predicts instructor evaluation from 5 predictors: clarity, stimulation, knowledge, interest, and course evaluation. Variables are continuous. Accounts for correlation among predictor variables and determines which are most important.stevep84MBAreg.SPS  t tests of regression coefficients indicates that all variables except interest are significant in predicting the level of instructor evaluation.  Several diagnostics are output including residuals, leverages, and Cook distance (influential data points) values. Plot of regression standardized residuals should be approximately normal.
  • 7. Multivariate General Linear Model  Extensions of single dependent variable procedures such as ANOVA, ANCOVA, and multiple regression.  Statistical framework includes MANOVA (factor predictors), MANCOVA (factors and continuous predictors), and multivariate multiple regression (continuous predictors).  Prevents inflated overall type I error rate, accounts for correlations among the predictors, can detect the joint significance of a set of variables, even when univariate analyses would not be significant.  Hotelling’s T2 is the overall multivariate test statistic and a generalization of the univariate t. Tests the Ho that the population mean vectors are equal for two or more groups.
  • 8. Multivariate General Linear Model (cont.)  Multivariate significance implies that there is a linear combination of dependent variables (the discriminant function) that is separating the k groups.  Multivariate test statistics are a function of eigenvalues which are fundamental to all multivariate analyses.  Four multivariate test statistics are commonly used, Wilk’s Λ, Roy’s largest root, the Hotelling-Lawley trace, and the Pillai Bartlett trace. Wilk’s Λ is most common.  Following a significant finding, post hoc or planned comparisons are then used to determine which variables are driving the significance between groups.
  • 9. Multivariate Regression Example using SPSS 10.5  Timm data on differences in cognitive tests due to learning tasks. Scores on Ravin’s Progressive Matrices and Peabody Picture Vocabulary regressed on 3 learning tasks.steve132multivreg.SPS  Multivariate test statistic Wilk’s Λ is significant indicating a significant relationship between the dependent variables and the 3 predictors beyond the .01 level.  Univariate F tests examine the regression on each variable separately. In particular, NA (named action) is related to PEVOCAB at t=2.68, p<.011.  Univariate prediction equations do not take into account correlations among dependent variables.
  • 10. MANOVA Example with Tukey post-hoc tests using SAS V8  Novince data on improving social skills among college women. 3 groups: control, behavioral rehearsal and cognitive restructuring, 4 variables: anxiety, social interaction skills, appropriateness, and assertiveness.  SAS program used for 3 treatment group MANOVA with 4 measures to determine treatment effectiveness. SASstevep204.sas  Overall significant multivariate tests indicate true differences between groups on one or more variables and their linear combinations. Excellent optional output.  Tukey post hoc tests generate significance levels and confidence intervals to examine effects of variables.
  • 11. Crisis Residence Treatment and the Basis-32 at South Beach PC  Treatment, gender, and GAF covariate effects on BASIS-32 subscale scores, n=73 paired admissions and discharges.  Highly significant pre/post treatment effect, F=4.216, 5 df, p<.001. CR effective in terms of all BASIS-32 subscales.  Significant GAF covariate effect F=5.271, 5 df, p<.001 strong relationship between clinician GAF and self-report BASIS-32 on relationship to self/others and depression subscales.  Gender by treatment interaction non-significant. CR equally effective for both genders in terms of subscale scores  Statistical diagnostics indicate excellent power for all tests.
  • 12. Principal Components Analysis  Analysis based on a large number of original variables can be simplified to a smaller number of standardized linear combinations of original variables.  x→y=Γ'(x-μ) where Γ is orthogonal Γ'ΣΓ=Λ. The ith principal component of x may be defined as the ith element of vector y, as yi=φ'i(x-μ), leading to uncorrelated principal components. Principal components essentially involves finding the eigenvalues of the covariance matrix Σ.  The first principal component has the largest variance of all standardized linear combinations of x.
  • 13. Principal Components Example using S-Plus V6PrinComp.ssc  Ph.D. qualifying examinations in five areas of mathematics for 25 students.  Analysis carried out using S-Plus princomp function which returns object of mode princomp.  A large coefficient (absolute value) corresponds to a high loading, while a coefficient near zero has a low loading.  First principal component loadings are of moderate size in the same direction representing an average score.  Second principal component contrasts two closed book exams with three open book exams, with the first and last exams weighted most heavily.  Plots of the principal component loadings and the biplot of original and transformed test scores in two dimensional principal component space.
  • 14. Factor Analysis  Factor Analysis explains correlations between observed variables with underlying factors.  x=μ+Λf+u Λ={λij} is matrix of factor loadings, f and u represent the common and unique factors respectively. Equivalently, Σ=ΛΛ'+Ψ, decomposition into factor and error covariances.  Diagonal of factor covariance matrix is the vector of communalities h2 , common variation in the factors and Ψij is the vector of uniquenesses, the variation in xi not shared with the other variables. These sum to 1 for each variable.  Factor solution is not unique. Factors can be rotated to ease interpretation via Σ=(ΛG')+(G'Λ')+Ψ. Δ= ΛG is the matrix of rotated factor loadings. Analyst seeks simple structure in the rotation. Each variable should load highly on one factor and all factor loadings have large absolute value or are near zero.
  • 15. Factor Analysis Example using S-Plus V6 FactorAnal.ssc  S-Plus uses factanal, a weighted covariance estimation function to perform factor analysis.  Using testscores we analφyze whether a two factor model, overall ability and closed or open book, explain the overall variation in the scores.  The two factor model explains about 80% of the variation in the original data, with the first factor accounting for 45%.  The rotated factor loadings indicate the importance of the first overall ability factor and the relative effects of closed and open book exams.  Plots of the factor loadings and the biplot of test scores in two dimensional factor space.
  • 16. Discriminant Function Analysis  Concerned with allocating observations to one or another a priori defined classes.  Calibrated on a training sample in which membership is known and then applied to test cases which are unknown.  In Medicine post-mortem information (classes based on survival) can used to classify at risk patients for mortality or morbidity.  An observation is classified into one of two groups on a series of measurements x1, x2, x3,… xp using a linear function z of the variables: z=a1x1+a2x2+…+apxp
  • 17. Discriminant Function Analysis  Coefficients maximize ratio of between groups variance of z to within groups variance. V=a'Ba/a'Sa  The data in both groups have a multivariate normal distribution and the covariance matrices of each group are the same.  Function evaluates z. Assign to group 1 if zi-zc<0, assign to group 2 if zi-zc≥0.  Performance can be assessed through misclassification rate on known cases through the training set data.  Significance tests are available including 1) Wilk's Λ or others previously mentioned and also Hotelling’s multivariate T2 , 2) φ to see whether the discriminant function differs between groups, and 3) Chi-square test of Mahalanobis distances from observations to their group centers, if large chi-square then unlikely that an observation came from a particular group.
  • 18. Discriminant Function Analysis Example using SAS V8SASHandDA.sas  Archeological study of two types of skulls from the Tibetan areas of Sikkim or Kharis (fundamental human type). 5 dimensional variables measured on 32 skulls.  This can be considered a training set for future classification, the analysis will also identify the most important variables in discrimination.  Proc discrim output generates within group and between group covariance matrices, covariance diagnostics, generalized pairwise distances between groups, discrimination function coefficients, and misclassification (resubstitution) rate.  Proc stepdisc finds that faceheight is the most important variable for classifying the members into groups. Crossvalidated with another proc discrim using only faceheight.
  • 19. Cluster Analysis  Concerned with allocating observations to discrete groups or clusters of observations which are unknown.  A hierarchy of solutions from single observation clusters to a single group cluster containing all observations are displayed in a dendrogram. An particular clustering partition will be considered optimal based on statistical and practical criteria.  Clustering methods operate on the inter-individual Euclidian distance matrix calculated from the raw data.  Single Linkage or Nearest Neighbors -groups are merged a a given distance if closest individuals from each group are at least the specified distance.  Complete Linkage or Furthest Neighbors- two groups merge only if the most distant members are close enough together.  Average Linkage- two groups merge if the average distance between them is close enough.
  • 20. Cluster Analysis Example using SAS V8SASHandCA.sas  Analysis of quality of air of U.S. cities. Object is to identify groups of cities that are similar for policy intervention.  Clustering variables include SO2, temperature, factories, population, windspeed, rain, rainydays.  First step is to look for outliers using proc univariate. Chicago is an outlier on manufacturing and population, Phoenix has the lowest value on all three climate variables, these cities are excluded from the analysis.  Results from several runs each based on a different clustering method are complex and require interpretation and a feel for the technique.
  • 21. Cluster Analysis Example using SAS V8 (cont.)  Cluster history indicates the stages at which various cities and clusters are joined at particular distances along with other diagnostics.  Bimodalty index of at least .55 suggests clustering on a particular variable. Factories and population are at .55.  The value of the cubic clustering criterion (ccc) is a guide to the number of clusters in the data. It peaks at 4 clusters for the single and complete linkage runs. The number of eigenvalues of the correlation matrix may also suggest dimensionality in the data. Four clusters is only an approximation as the evidence is not that clear.  Dendrograms may also suggest evidence of structure but generally do not make the optimal number of groups obvious.
  • 22. Cluster Analysis Example using SAS V8 (cont.)  Means for clustering variables can be examined to understand how clusters differ on the variables. Mean differences on these variables can be tested.  Clustering solutions can be displayed by plotting the data in principal component space since they are linear transformations of the clustering variables.  In this example the first two principal components are derived and the individual cluster observations are graphed. They are distinct in the location of the observations although the solution is not optimal.  A box plot was created and means tested for differences on the SO2 level.