SlideShare a Scribd company logo
AN APPLICATION IN TUBERCULOSIS PREVALENCE
MULTIPLE FACTOR ANALYSIS
WITH ESTIMATED DATA
Dec 4, 2014
ISyE 6405 Fall 2014 Project
Chaoyi Wu Farida Jariwala
2
OVERVIEW
Problem Statement
• Background and motivation
• Challenges
Methodology
• Extend PCA to MCA (multiple factor analysis)
• Consider interval data in MCA
Analysis
• Data
• MCA modeling in R: package “FactoMineR”
• Output analysis
Conclusion and next steps
Reference 2
3
Problem Statement
Tuberculosis, or TB, is an infectious bacterial disease which most commonly
affects the lungs. Tuberculosis (TB) is second only to HIV/AIDS as the greatest
killer worldwide due to a single infectious agent. In 2013, 9 million people fell ill
with TB and 1.5 million died from the disease. People living with HIV are 26-31
times more likely to develop TB than persons without HIV. (WHO 2014)
Patterns among TB prevalence, HIV cases, healthcare resources and other
factors are helpful to curb TB prevalence.
Background and motivation
4
Problem Statement
Challenges
• The number of potential factors that affect TB is large
Healthcare input: TB Immunization, TB test, general health resource expense
HIV
Tobacco use
other
→ group factors (variables) into categories and reduce dimensions by MFA
• The data (i.e. Population, cases) is estimated with variances
→ include intervals in pattern analysis with vertices method symbolic PCA (V-SPCA)
Snap shot from WHO dataset http://apps.who.int/gho/data
5
Methodology
• Multiple factor analysis (MFA)
MFA analyzes observations described by several groups or sets of variables in
two steps:
(1) A PCA is performed on each group which is then normalized. A same
weight is associated to each variable of the a group. The weight is the
largest eigenvalue of the PCA on the group.
(2) The normalized data sets are merged to form a unique matrix and a
global PCA is performed on this matrix.
The data type of a variable can be continuous or categorical, but the data
type for variables in one set should be the same. (Abdi, H. 2007)
5
6
Methodology
• Use V-SPCA to deal with interval data in MCA
Vertices method symbolic PCA (V-SPCA) performs a classical PCA on interval
data. Given a dataset 𝑿 that contains 𝑵 observations described by 𝒑 variables
of interval type, derive a new dataset 𝑿 𝑽 from it and use the new dataset for
PCA. (Zuccolotto. 2006)
6The method is still valid for MCA.
Country Prevalent TB cases Number of adults aged 15 and over living with HIV
Chile 3500 [1500-6400] 39 000 [25 000-61 000]
Country Prevalent TB cases Number of adults aged 15 and over living with HIV
Chile1 1500
Chile2 1500
Chile3 6400
Chile4 6400
39 000
61 000
25 000
61 000
7
Analysis
7
Data structure
• 7 countries: Sweden, Malaysia, Hungary, Sri Lanka Chile, Mexico
• 19 variables, 8 groups
data for TB and HIV estimated with confidence intervals.
Group Variable name Note
TB TB prevalent TB cases in 2012/total population (%)
HIV HIV 15+ living with HIV/15+ population (%)
PM PM10 2012 PM10 (Annual mean, ug/m3)
TST DST 2012 2012 Laboratories providing DST (drug susceptibility testing) (per 5 million population)
TST DST 2011 2011 Laboratories providing DST (drug susceptibility testing) (per 5 million population)
TST DST 2010 2010 Laboratories providing DST (drug susceptibility testing) (per 5 million population)
TST TB dgns clt 2012 2012 Laboratories providing TB diagnostic services using culture (per 5 million population)
TST TB dgns clt 2011 2011 Laboratories providing TB diagnostic services using culture (per 5 million population)
TST TB dgns clt 2010 2010 Laboratories providing TB diagnostic services using culture (per 5 million population)
TST TB dgns micro 2012 2012 Laboratories providing TB diagnostic services using sputum smear microscopy (per 100 000 population)
TST TB dgns micro 2011 2011 Laboratories providing TB diagnostic services using sputum smear microscopy (per 100 000 population)
TST TB dgns micro 2010 2010 Laboratories providing TB diagnostic services using sputum smear microscopy (per 100 000 population)
GNI GNI 2012 General national income per capital (current USD)
HC Health exps 2010 2010 Health expenditure (public) per capital
HC Health exps 2011 2011 Health expenditure (public) per capital
HC Health exps 2012 2012 Health expenditure (public) per capital
BCG BCG 1 y Imz 1992 1992 Immunization, BCG (% of one-year-old children)
BCG BCG 1 y Imz 2012 2012 Immunization, BCG (% of one-year-old children)
SMK SMK 2011 2011 Smoking prevalence (% of adults)
8
Output analysis
8
• MCA with mean estimation
Eigenvalues Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
Variance 3.792 2.496 1.115 0.667 0.567 0.036
% of var 43.719 28.781 12.857 7.695 6.538 0.41
Cumulative % of var 43.719 72.5 85.357 93.052 99.59 100
Groups Dim.1 cos2 Dim.2 cos2 Dim.3 cos2
TB 0.058 0.003 0.783 0.614 0.137 0.019
HIV 0.000 0.000 0.798 0.637 0.189 0.036
PM 0.560 0.314 0.002 0.000 0.007 0.000
TST 0.399 0.125 0.340 0.091 0.272 0.058
GNI 0.948 0.899 0.004 0.000 0.003 0.000
HC 0.969 0.939 0.000 0.000 0.007 0.000
BCG 0.829 0.688 0.105 0.011 0.007 0.000
SMK 0.027 0.001 0.463 0.215 0.495 0.245
9
Output analysis
9
• MCA with interval estimation
Eigenvalues Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
Variance 3.794 2.38 1.087 0.658 0.57 0.149
% of var 43.749 27.445 12.532 7.59 6.572 1.72
Cumulative % of var 43.749 71.194 83.726 91.316 97.888 99.608
Groups Dim.1 cos2 Dim.2 cos2 Dim.3 cos2
TB 0.059 0.003 0.688 0.474 0.148 0.022
HIV 0.000 0.000 0.719 0.517 0.207 0.043
PM 0.564 0.318 0.005 0.000 0.002 0.000
TST 0.402 0.127 0.358 0.100 0.259 0.053
GNI 0.944 0.890 0.004 0.000 0.002 0.000
HC 0.966 0.934 0.000 0.000 0.006 0.000
BCG 0.834 0.695 0.102 0.010 0.005 0.000
SMK 0.026 0.001 0.505 0.255 0.457 0.208
10
Output analysis
10
• Comparison
Cumulative % of var Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
Mean 43.719 72.5 85.357 93.052 99.59 100
Interval 43.749 71.194 83.726 91.316 97.888 99.608
Groups Mean Interval Mean Interval
TB 0.058 0.059 0.783 0.688
HIV 0.000 0.000 0.798 0.719
PM 0.560 0.564 0.002 0.005
TST 0.399 0.402 0.340 0.358
GNI 0.948 0.944 0.004 0.004
HC 0.969 0.966 0.000 0.000
BCG 0.829 0.834 0.105 0.102
SMK 0.027 0.026 0.463 0.505
Dim.2Dim.1
The comparison shows the two MCA’s are almost the same. WHY?
11
Output analysis
11
• Pattern analysis with interval data
o The first two dimensions account for more than 70% of the variance
o {HIV, TB, SMK} and {PM, BCG, HC, GNI} are almost orthogonal
1. TB is almost not in the 1st dimension
2. The variables in the latter set doesn’t have a strong correlation with TB
o The individual contribution to the 2nd dimension brings further investigation:
it accounts for the difference between TB, HIV and Tobacco use
Guess: Smoking prevalance in 2011 is not a reasonable factor
It confirms HIV and TB are highly correlated
o The first dimension can be considered as
a score for economics. The higher, the better
(see the individual factor map in slide 9)
Variable Contribution to Dim.2
TB -0.948
HIV -0.794
SMK.2011 0.71
Variable Contribution to Dim.1
PM10 -0.751
DST.2012 0.656
DST.2011 0.656
DST.2010 0.67
TB.dgns.clt.2012 -0.1
TB.dgns.clt.2011 -0.093
TB.dgns.clt.2010 -0.117
TB.dgns.micro.2012 -0.532
TB.dgns.micro.2011 -0.499
TB.dgns.micro.2010 -0.556
GNI 0.971
Health.exps.2010 0.985
Health.exps.2011 0.984
Health.exps.2012 0.98
BCG.1.y.Imz.1992 -0.87
BCG.1.y.Imz.2012 -0.94
12
Conclusion and next steps
12
• HIV and TB are correlated
• V-SPCA doesn’t make a
difference in this project
• Factors(variables) selection
is to be improved
• Model validation need to be
conducted
• Analysis with dimensions
can be more detailed
13
Abdi, H. and Valentin, D. (2007). Multiple Factor Analysis. In Neil Salkind
(Ed): Encyclopedia of Measurement and Statistics. Thousand Oaks: Sage.
Zuccolotto, P.(2006). Principal Components of Sample Estimates: an
Approach through Symbolic Data Analysis. Stat Meth & Appl. 16. 173-192.
Springer (Verlag). DOI: 10.1007/s10260-006-0024-6.
Le, S., Josse, J. and Husson, F. (2008). FactoMineR: an R package for
multivariate analysis. Journal of Statistical Software. 25(1). American
Statistical Association
WHO. (2014). World Health Organization/Media center/Tuberculosis:
http://www.who.int/mediacentre/factsheets/fs104/en/
The world bank. (2014). The world bank/Data: http://www.worldbank.org/
13
Reference
1414
Thank you
Q & A

More Related Content

PDF
Variable Syrup used for consumption in children. Mike Sharland (UK)
PDF
Moving from Big Data to Better Models of Disease and Drug Response - Joel Dudley
PDF
Prediction of antitubercular_peptides_from_sequenc
PDF
Japanese Environmental Children's Study and Data-driven E
PDF
Knocking on the clinic door of precision medicine
PDF
Joel Voigt - 24227730 - FinalYearReport-MEC4401
PDF
Bioinformatics Strategies for Exposome 100416
PDF
REGRESSION ANALYSIS ON HEALTH INSURANCE COVERAGE RATE
Variable Syrup used for consumption in children. Mike Sharland (UK)
Moving from Big Data to Better Models of Disease and Drug Response - Joel Dudley
Prediction of antitubercular_peptides_from_sequenc
Japanese Environmental Children's Study and Data-driven E
Knocking on the clinic door of precision medicine
Joel Voigt - 24227730 - FinalYearReport-MEC4401
Bioinformatics Strategies for Exposome 100416
REGRESSION ANALYSIS ON HEALTH INSURANCE COVERAGE RATE

Similar to AN MFA APPLICATION IN TUBERCULOSIS PREVALENCE ANALYSIS (20)

PDF
Lecture 2 Regression en evaluación de .pdf
PDF
Identification the number of Mycobacterium tuberculosis based on sputum image...
PDF
Estimating the Statistical Significance of Classifiers used in the Predictio...
DOCX
Course Project Phase TwoPavel GarbuzApril 12th, 2017.docx
PPTX
SEMINAR PRESENTATION.pptx
DOCX
Running Head SCENARIO NCLEX MEMORIAL HOSPITAL .docx
DOCX
Running head PROJECT PHASE 4-INFECTIOUS DISEASES1PROJECT PHASE.docx
PPT
Dignostic Tests of Applied Economics
PPTX
Epidemiology Lectures for UG
DOCX
Phase 2Phase 2Lucia RuizRasmussen College.docx
PDF
Statistical analysis
PPTX
A lesson on statistics
PDF
SPATIAL CLUSTERING AND ANALYSIS ON HEPATITIS C VIRUS INFECTIONS IN EGYPT
PDF
The analysis of doubly censored survival data
PDF
Analyzing data in health care Dr.Majdi
PPT
BioSHaRE: Analysis of mixed effects models using federated data analysis appr...
PDF
3. Univariable and Multivariable Analysis_Using Stata_2025 (2).pdf
PDF
3. Univariable and Multivariable Analysis_Using Stata_2025.pdf
PPTX
Epidemiology of tb with recent advances acknowledged by who
PPTX
Lect1.pptxdglsgldjtzjgd csjfsjtskysngfkgfhxvxfhhdhz
Lecture 2 Regression en evaluación de .pdf
Identification the number of Mycobacterium tuberculosis based on sputum image...
Estimating the Statistical Significance of Classifiers used in the Predictio...
Course Project Phase TwoPavel GarbuzApril 12th, 2017.docx
SEMINAR PRESENTATION.pptx
Running Head SCENARIO NCLEX MEMORIAL HOSPITAL .docx
Running head PROJECT PHASE 4-INFECTIOUS DISEASES1PROJECT PHASE.docx
Dignostic Tests of Applied Economics
Epidemiology Lectures for UG
Phase 2Phase 2Lucia RuizRasmussen College.docx
Statistical analysis
A lesson on statistics
SPATIAL CLUSTERING AND ANALYSIS ON HEPATITIS C VIRUS INFECTIONS IN EGYPT
The analysis of doubly censored survival data
Analyzing data in health care Dr.Majdi
BioSHaRE: Analysis of mixed effects models using federated data analysis appr...
3. Univariable and Multivariable Analysis_Using Stata_2025 (2).pdf
3. Univariable and Multivariable Analysis_Using Stata_2025.pdf
Epidemiology of tb with recent advances acknowledged by who
Lect1.pptxdglsgldjtzjgd csjfsjtskysngfkgfhxvxfhhdhz
Ad

AN MFA APPLICATION IN TUBERCULOSIS PREVALENCE ANALYSIS

  • 1. AN APPLICATION IN TUBERCULOSIS PREVALENCE MULTIPLE FACTOR ANALYSIS WITH ESTIMATED DATA Dec 4, 2014 ISyE 6405 Fall 2014 Project Chaoyi Wu Farida Jariwala
  • 2. 2 OVERVIEW Problem Statement • Background and motivation • Challenges Methodology • Extend PCA to MCA (multiple factor analysis) • Consider interval data in MCA Analysis • Data • MCA modeling in R: package “FactoMineR” • Output analysis Conclusion and next steps Reference 2
  • 3. 3 Problem Statement Tuberculosis, or TB, is an infectious bacterial disease which most commonly affects the lungs. Tuberculosis (TB) is second only to HIV/AIDS as the greatest killer worldwide due to a single infectious agent. In 2013, 9 million people fell ill with TB and 1.5 million died from the disease. People living with HIV are 26-31 times more likely to develop TB than persons without HIV. (WHO 2014) Patterns among TB prevalence, HIV cases, healthcare resources and other factors are helpful to curb TB prevalence. Background and motivation
  • 4. 4 Problem Statement Challenges • The number of potential factors that affect TB is large Healthcare input: TB Immunization, TB test, general health resource expense HIV Tobacco use other → group factors (variables) into categories and reduce dimensions by MFA • The data (i.e. Population, cases) is estimated with variances → include intervals in pattern analysis with vertices method symbolic PCA (V-SPCA) Snap shot from WHO dataset http://apps.who.int/gho/data
  • 5. 5 Methodology • Multiple factor analysis (MFA) MFA analyzes observations described by several groups or sets of variables in two steps: (1) A PCA is performed on each group which is then normalized. A same weight is associated to each variable of the a group. The weight is the largest eigenvalue of the PCA on the group. (2) The normalized data sets are merged to form a unique matrix and a global PCA is performed on this matrix. The data type of a variable can be continuous or categorical, but the data type for variables in one set should be the same. (Abdi, H. 2007) 5
  • 6. 6 Methodology • Use V-SPCA to deal with interval data in MCA Vertices method symbolic PCA (V-SPCA) performs a classical PCA on interval data. Given a dataset 𝑿 that contains 𝑵 observations described by 𝒑 variables of interval type, derive a new dataset 𝑿 𝑽 from it and use the new dataset for PCA. (Zuccolotto. 2006) 6The method is still valid for MCA. Country Prevalent TB cases Number of adults aged 15 and over living with HIV Chile 3500 [1500-6400] 39 000 [25 000-61 000] Country Prevalent TB cases Number of adults aged 15 and over living with HIV Chile1 1500 Chile2 1500 Chile3 6400 Chile4 6400 39 000 61 000 25 000 61 000
  • 7. 7 Analysis 7 Data structure • 7 countries: Sweden, Malaysia, Hungary, Sri Lanka Chile, Mexico • 19 variables, 8 groups data for TB and HIV estimated with confidence intervals. Group Variable name Note TB TB prevalent TB cases in 2012/total population (%) HIV HIV 15+ living with HIV/15+ population (%) PM PM10 2012 PM10 (Annual mean, ug/m3) TST DST 2012 2012 Laboratories providing DST (drug susceptibility testing) (per 5 million population) TST DST 2011 2011 Laboratories providing DST (drug susceptibility testing) (per 5 million population) TST DST 2010 2010 Laboratories providing DST (drug susceptibility testing) (per 5 million population) TST TB dgns clt 2012 2012 Laboratories providing TB diagnostic services using culture (per 5 million population) TST TB dgns clt 2011 2011 Laboratories providing TB diagnostic services using culture (per 5 million population) TST TB dgns clt 2010 2010 Laboratories providing TB diagnostic services using culture (per 5 million population) TST TB dgns micro 2012 2012 Laboratories providing TB diagnostic services using sputum smear microscopy (per 100 000 population) TST TB dgns micro 2011 2011 Laboratories providing TB diagnostic services using sputum smear microscopy (per 100 000 population) TST TB dgns micro 2010 2010 Laboratories providing TB diagnostic services using sputum smear microscopy (per 100 000 population) GNI GNI 2012 General national income per capital (current USD) HC Health exps 2010 2010 Health expenditure (public) per capital HC Health exps 2011 2011 Health expenditure (public) per capital HC Health exps 2012 2012 Health expenditure (public) per capital BCG BCG 1 y Imz 1992 1992 Immunization, BCG (% of one-year-old children) BCG BCG 1 y Imz 2012 2012 Immunization, BCG (% of one-year-old children) SMK SMK 2011 2011 Smoking prevalence (% of adults)
  • 8. 8 Output analysis 8 • MCA with mean estimation Eigenvalues Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Variance 3.792 2.496 1.115 0.667 0.567 0.036 % of var 43.719 28.781 12.857 7.695 6.538 0.41 Cumulative % of var 43.719 72.5 85.357 93.052 99.59 100 Groups Dim.1 cos2 Dim.2 cos2 Dim.3 cos2 TB 0.058 0.003 0.783 0.614 0.137 0.019 HIV 0.000 0.000 0.798 0.637 0.189 0.036 PM 0.560 0.314 0.002 0.000 0.007 0.000 TST 0.399 0.125 0.340 0.091 0.272 0.058 GNI 0.948 0.899 0.004 0.000 0.003 0.000 HC 0.969 0.939 0.000 0.000 0.007 0.000 BCG 0.829 0.688 0.105 0.011 0.007 0.000 SMK 0.027 0.001 0.463 0.215 0.495 0.245
  • 9. 9 Output analysis 9 • MCA with interval estimation Eigenvalues Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Variance 3.794 2.38 1.087 0.658 0.57 0.149 % of var 43.749 27.445 12.532 7.59 6.572 1.72 Cumulative % of var 43.749 71.194 83.726 91.316 97.888 99.608 Groups Dim.1 cos2 Dim.2 cos2 Dim.3 cos2 TB 0.059 0.003 0.688 0.474 0.148 0.022 HIV 0.000 0.000 0.719 0.517 0.207 0.043 PM 0.564 0.318 0.005 0.000 0.002 0.000 TST 0.402 0.127 0.358 0.100 0.259 0.053 GNI 0.944 0.890 0.004 0.000 0.002 0.000 HC 0.966 0.934 0.000 0.000 0.006 0.000 BCG 0.834 0.695 0.102 0.010 0.005 0.000 SMK 0.026 0.001 0.505 0.255 0.457 0.208
  • 10. 10 Output analysis 10 • Comparison Cumulative % of var Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Mean 43.719 72.5 85.357 93.052 99.59 100 Interval 43.749 71.194 83.726 91.316 97.888 99.608 Groups Mean Interval Mean Interval TB 0.058 0.059 0.783 0.688 HIV 0.000 0.000 0.798 0.719 PM 0.560 0.564 0.002 0.005 TST 0.399 0.402 0.340 0.358 GNI 0.948 0.944 0.004 0.004 HC 0.969 0.966 0.000 0.000 BCG 0.829 0.834 0.105 0.102 SMK 0.027 0.026 0.463 0.505 Dim.2Dim.1 The comparison shows the two MCA’s are almost the same. WHY?
  • 11. 11 Output analysis 11 • Pattern analysis with interval data o The first two dimensions account for more than 70% of the variance o {HIV, TB, SMK} and {PM, BCG, HC, GNI} are almost orthogonal 1. TB is almost not in the 1st dimension 2. The variables in the latter set doesn’t have a strong correlation with TB o The individual contribution to the 2nd dimension brings further investigation: it accounts for the difference between TB, HIV and Tobacco use Guess: Smoking prevalance in 2011 is not a reasonable factor It confirms HIV and TB are highly correlated o The first dimension can be considered as a score for economics. The higher, the better (see the individual factor map in slide 9) Variable Contribution to Dim.2 TB -0.948 HIV -0.794 SMK.2011 0.71 Variable Contribution to Dim.1 PM10 -0.751 DST.2012 0.656 DST.2011 0.656 DST.2010 0.67 TB.dgns.clt.2012 -0.1 TB.dgns.clt.2011 -0.093 TB.dgns.clt.2010 -0.117 TB.dgns.micro.2012 -0.532 TB.dgns.micro.2011 -0.499 TB.dgns.micro.2010 -0.556 GNI 0.971 Health.exps.2010 0.985 Health.exps.2011 0.984 Health.exps.2012 0.98 BCG.1.y.Imz.1992 -0.87 BCG.1.y.Imz.2012 -0.94
  • 12. 12 Conclusion and next steps 12 • HIV and TB are correlated • V-SPCA doesn’t make a difference in this project • Factors(variables) selection is to be improved • Model validation need to be conducted • Analysis with dimensions can be more detailed
  • 13. 13 Abdi, H. and Valentin, D. (2007). Multiple Factor Analysis. In Neil Salkind (Ed): Encyclopedia of Measurement and Statistics. Thousand Oaks: Sage. Zuccolotto, P.(2006). Principal Components of Sample Estimates: an Approach through Symbolic Data Analysis. Stat Meth & Appl. 16. 173-192. Springer (Verlag). DOI: 10.1007/s10260-006-0024-6. Le, S., Josse, J. and Husson, F. (2008). FactoMineR: an R package for multivariate analysis. Journal of Statistical Software. 25(1). American Statistical Association WHO. (2014). World Health Organization/Media center/Tuberculosis: http://www.who.int/mediacentre/factsheets/fs104/en/ The world bank. (2014). The world bank/Data: http://www.worldbank.org/ 13 Reference