SlideShare a Scribd company logo
0
1/24/2018 Project - Part 1
Statistical Modeling
Rashmi Subrahmanya (M12383010)
UNIVERSITY OF CINCINNATI
i
Contents
Tables............................................................................................................................................................ ii
Executive Summary.......................................................................................................................................1
Introduction ..................................................................................................................................................2
Variable Dictionary...................................................................................................................................2
Chapter 1: Initial Data Exploration................................................................................................................3
R Code.......................................................................................................................................................3
R Output....................................................................................................................................................3
Observations.............................................................................................................................................4
Conclusion.................................................................................................................................................5
Chapter 2: Data Cleaning and Exploration....................................................................................................5
R Code.......................................................................................................................................................5
R Output....................................................................................................................................................5
Observations.............................................................................................................................................7
Conclusion.................................................................................................................................................7
Chapter 3: Data Analysis...............................................................................................................................8
R Code.......................................................................................................................................................8
R Output....................................................................................................................................................8
Observations...........................................................................................................................................10
Conclusion...............................................................................................................................................10
Chapter 4: Regression Analysis...................................................................................................................11
R Code.....................................................................................................................................................11
R Output..................................................................................................................................................12
Observations...........................................................................................................................................12
Conclusion...............................................................................................................................................13
Chapter 5: Check for collinearity ................................................................................................................13
R Code.....................................................................................................................................................13
R Output..................................................................................................................................................13
Observation.............................................................................................................................................14
Conclusion...............................................................................................................................................14
Chapter 6: Variable Selection .....................................................................................................................15
R Code.....................................................................................................................................................15
R Output..................................................................................................................................................15
ii
Observations...........................................................................................................................................17
Conclusion...............................................................................................................................................17
Chapter 7: Variable Selection based on automate algorithm ....................................................................17
R Code.....................................................................................................................................................17
R Output..................................................................................................................................................17
Observation.............................................................................................................................................18
Conclusion...............................................................................................................................................18
Tables
Table 1: Pairwise correlation between distance and each factor.................................................................9
Table 2: Ranking factors based on -p-value................................................................................................12
Table 3: Ranking factors after standardization of variables .......................................................................12
Table 4: Ranking of variables ......................................................................................................................13
Table 5: Checking for collinearity................................................................................................................14
Table 6: Comparing different models.........................................................................................................15
1
Executive Summary
This project is carried out to understand which factors influence the landing distance of flights to
minimize the risk of over run. A summary of the variables used in the project is provided in
introduction section. In chapter 1, two data sets are imported and examined. They are then merged
to one data set. 100 duplicate rows were observed in merged data set which were subsequently
removed. Also, missing values were observed in ‘duration’ and ‘speed_air’ columns. Summary
statistics of each variable is provided in this chapter. Chapter 2 checks for abnormal values as
defined by variable dictionary. 17 rows were found to contain abnormal values and they were
removed. Histograms of each variable is plotted to understand their distribution.
In chapter 3, correlation matrix was calculated, and scatter plots were used to see which factors
are correlated with landing distance, their strength and direction. Aircraft variable is also recoded.
The predictor variables are ranked according to strength of correlation. In chapter 4, landing
distance is regressed on each of predictor variable, one at a time and p-values of resulting linear
regression models are noted. Then the variables are standardized, and the process is repeated. It is
observed that rank of predictor variables, in terms of influence on landing distance is same in all
three ways – correlation matrix, regression models before and after standardization.
Chapter 5 checks for collinearity between predictor variables. Speed_air and speed_ground is
found to be highly correlated. Speed_ground is dropped from further analysis. In chapter 6, linear
regression models are built adding one variable at a time. The r squared, adjusted r squared and
AIC values of the models are plotted against number of variables. Based on this, all variables,
except speed_ground, are used to build predictive model for landing distance. However, based on
p-values of the models, only speed_air, aircraft and height are significant. Another thing to note is
that model is built using 195 observations only due to missing values in speed_air and duration
column. In final chapter, stepAIC function in R is used to perform forward variable selection.
Based on the results, speed_air, height and aircraft are used to build predictive model.
2
Introduction
The goal of the project is to study what factors and how they impact the landing distance of a
commercial flight to reduce the risk of landing over run. We have landing data from 950
commercial flights as the input data.
Variable Dictionary
Aircraft: The make of an aircraft (Boeing or Airbus).
Duration (in minutes): Flight duration between taking off and landing. The duration of a normal
flight should always be greater than 40min.
No_pasg: The number of passengers in a flight.
Speed_ground (in miles per hour): The ground speed of an aircraft when passing over the
threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing
would be considered as abnormal.
Speed_air (in miles per hour): The air speed of an aircraft when passing over the threshold of
the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be
considered as abnormal.
Height (in meters): The height of an aircraft when it is passing over the threshold of the runway.
The landing aircraft is required to be at least 6 meters high at the threshold of the runway.
Pitch (in degrees): Pitch angle of an aircraft when it is passing over the threshold of the runway.
Distance (in feet): The landing distance of an aircraft. More specifically, it refers to the distance
between the threshold of the runway and the point where the aircraft can be fully stopped. The
length of the airport runway is typically less than 6000 feet.
3
Chapter 1: Initial Data Exploration
In this chapter, we import two data sets FAA1 and FAA2 in R and look at their structure. Then,
we combine both data sets to get a new data set named ‘FAA’. We check for duplicate values in
the new data set and have a look at its structure. We also obtain summary statistics of each variable
in FAA data set.
R Code
#Importing FAA1 and FAA2 excel files
FAA1 <- readxl::read_xls('FAA1.xls', col_names = TRUE)
FAA2 <- readxl::read_xls('FAA2.xls', col_names = TRUE)
#A look at first few rows of FAA1 and FAA2 data set
head(FAA1)
head(FAA2)
#Checking structure of the data sets
str(FAA1)
str(FAA2)
#Merging FAA1 and FAA2 into a single data set - FAA
FAA <- plyr::rbind.fill(FAA1,FAA2)
head(FAA)
#A look at structure of new data set
str(FAA)
#Checking for duplicates
FAA_dup <- FAA[duplicated(FAA$speed_ground), ]
nrow(FAA_dup)
FAA <- FAA[!duplicated(FAA$speed_ground), ]
#Summary of each variable in FAA
summary(FAA)
R Output
#Structure of FAA1
> str(FAA1)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 800 obs. of 8 variables:
$ aircraft : chr "boeing" "boeing" "boeing" "boeing" ...
$ duration : num 98.5 125.7 112 196.8 90.1 ...
$ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
$ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
$ speed_air : num 109 103 NA NA NA ...
4
$ height : num 27.4 27.8 18.6 30.7 32.4 ...
$ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
$ distance : num 3370 2988 1145 1664 1050 ...
#Structure of FAA2
> str(FAA2)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 150 obs. of 7 variables:
$ aircraft : chr "boeing" "boeing" "boeing" "boeing" ...
$ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
$ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
$ speed_air : num 109 103 NA NA NA ...
$ height : num 27.4 27.8 18.6 30.7 32.4 ...
$ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
$ distance : num 3370 2988 1145 1664 1050 ...
#Structure of FAA (combined data set)
> str(FAA)
'data.frame': 850 obs. of 8 variables:
$ aircraft : chr "boeing" "boeing" "boeing" "boeing" ...
$ duration : num 98.5 125.7 112 196.8 90.1 ...
$ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
$ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
$ speed_air : num 109 103 NA NA NA ...
$ height : num 27.4 27.8 18.6 30.7 32.4 ...
$ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
$ distance : num 3370 2988 1145 1664 1050 ...
#Summary of each variable
Observations
• FAA1 data set has 800 observations and 8 variables while FAA2 data set has 150
observations and 7 variables. Variable ‘duration’ is missing in FAA2 data set. Both FAA1
and FAA2 are data frames. In both data sets, variable ‘aircraft’ is of character data type,
while other variables are numeric.
• ‘duration’ column has 50 missing values, while speed_air has 642 missing values. There
are no missing values in other columns.
5
• Minimum value of duration and distance is too low while minimum height is negative.
These are abnormal values, as defined by variable dictionary.
• There are 100 duplicate rows in FAA data set, which are deleted from further analysis.
After removing duplicates, FAA has 850 observations and 8 variables. ‘aircraft’ is of
character data type, while other variables are of numeric type.
Conclusion
• Duplicate rows are removed from further analysis as they do not provide any meaningful
information and they can affect results.
• Both speed_air and duration columns are retained for now, even though they have missing
values. They can be dropped later, if required.
Chapter 2: Data Cleaning and Exploration
In this chapter, we check for abnormal values, as defined in the variable dictionary. If there are
any rows with abnormal values, we remove them. We plot histogram of each variable to understand
their distributions.
R Code
#Removing abnormal values from data set
FAA <- FAA[(FAA$duration > 40 | is.na(FAA$duration)), ]
FAA <- FAA[(FAA$height >= 6), ]
FAA <- FAA[(FAA$speed_air >= 30 | FAA$speed_air <= 140 | is.na(FAA$speed_air)), ]
FAA <- FAA[(FAA$speed_ground >= 30 | FAA$speed_ground <= 140), ]
FAA <- FAA[(FAA$distance < 6000), ]
dim(FAA)
#summary of cleaned data set
summary(FAA)
#Plotting histogram of each variable
hist(FAA$duration, breaks = 30, main = 'Histogram of duration variable', xlab = 'Duration')
hist(FAA$no_pasg, breaks = 30, main = 'Histogram of no_pasg', xlab = 'Number of Passengers')
hist(FAA$speed_ground, breaks = 30, main = 'Histogram of speed_ground', xlab = 'Speed Ground')
hist(FAA$speed_air, breaks = 30, main = 'Histogram of speed_air', xlab = 'Speed Air')
hist(FAA$height, breaks = 30, main = 'Histogram of height', xlab = 'Height')
hist(FAA$pitch, breaks = 30, main = 'Histogram of pitch', xlab = 'Pitch')
hist(log(FAA$distance), breaks = 30, main = 'Histogram of log(distance)', xlab = 'Landing Distance')
R Output
#Structure of FAA after removing abnormal values
6
> str(FAA)
'data.frame': 831 obs. of 8 variables:
$ aircraft : chr "boeing" "boeing" NA NA ...
$ duration : num 98.5 125.7 NA NA NA ...
$ no_pasg : num 53 69 NA NA NA NA NA NA NA NA ...
$ speed_ground: num 108 102 NA NA NA ...
$ speed_air : num 109 103 NA NA NA ...
$ height : num 27.4 27.8 NA NA NA ...
$ pitch : num 4.04 4.12 NA NA NA ...
$ distance : num 3370 2988 NA NA NA ...
#summary of cleaned data set
#Histogram of each variable
Figure 1: Histogram of duration Figure 2: Histogram of no_pasg
Figure 3: Histogram of speed_ground Figure 4: Histogram of speed_air
7
Figure 5: Histogram of height Figure 6: Histogram of pitch
Figure 7: Histogram of distance
Observations
• There are abnormal values in the data set, as defined by the variable dictionary. There are
5 such values in duration, 3 in speed_ground, 1 in speed_air, 10 in height and 2 in distance
column.
• 17 rows/observations with abnormal values were removed.
• Distribution of speed_air shows that it is right-skewed.
Conclusion
• Final data set has 833 observations and 8 columns.
• The observations with abnormal values are deleted, since the number of such observations
is very low.
8
Chapter 3: Data Analysis
This chapter comprises of initial data analysis where we try to identify factors which impact the
response variable, landing distance.
R Code
#Recoding aircraft to numeric values: Boeing - 0, Airbus - 1
FAA$aircraft <- ifelse(FAA$aircraft == 'boeing', 0, 1)
#Computing pairwise correlation
round(cor(FAA, use = 'pairwise.complete.obs'), 4)
corrplot(FAACor, method = "ellipse")
#Scatter plots
par(mfrow = c(4, 2))
plot(FAA$aircraft, FAA$distance)
plot(FAA$duration, FAA$distance)
plot(FAA$no_pasg, FAA$distance)
plot(FAA$speed_ground, FAA$distance)
plot(FAA$speed_air, FAA$distance)
plot(FAA$height, FAA$distance)
plot(FAA$pitch, FAA$distance)
par(mfrow = c(1,1))
R Output
#Correlation Matrix
9
Figure 8: Correlation plot of variables
Table 1: Pairwise correlation between distance and each factor
Variables Strength of correlation Direction
Speed_air 0.9421 Positive
Speed_ground 0.8608 Positive
Aircraft 0.2369 Negative
Height 0.0999 Positive
Pitch 0.0863 Positive
Duration 0.0520 Negative
No_pasg 0.0173 Negative
#Scatter plots
10
Figure 9: Scatter plot
Observations
• From the correlation matrix and scatter plots, it is evident that speed_ground and speed_air
are the important factors which impact landing distance and they have strong positive
correlation coefficient.
• Aircraft make also impacts landing distance, but the strength of correlation is weak and
negative.
• Other factors are weakly correlated with landing distance or have little impact on it.
Conclusion
• Speed_ground, speed_air and aircraft are important factors which impact landing distance.
11
Chapter 4: Regression Analysis
In this chapter, we regress landing distance on each of the factors and observe the p-values of each
model. Then, we standardize each variable using the following formula:
X’
= {X – mean(X)}/sd(X)
We regress landing distance on each of standardized variables and note down p-values. Then, we
compare results from correlation matrix and the regression analysis to see if the results are
consistent.
R Code
#Regression using single factor each time
model1 <- lm(distance ~ aircraft, data = FAA)
summary(model1)
model2 <- lm(distance ~ duration, data = FAA)
summary(model2)
model3 <- lm(distance ~ no_pasg, data = FAA)
summary(model3)
model4 <- lm(distance ~ speed_ground, data = FAA)
summary(model4)
model5 <- lm(distance ~ speed_air, data = FAA)
summary(model5)
model6 <- lm(distance ~ height, data = FAA)
summary(model6)
model7 <- lm(distance ~ pitch, data = FAA)
summary(model7)
#Standardizing and creating new variables
FAA$aircraft.std <- (FAA$aircraft - mean(FAA$aircraft))/sd(FAA$aircraft)
FAA$duration.std <- (FAA$duration – mean (FAA$duration, na.rm = TRUE))/sd(FAA$duration, na.rm =
TRUE)
FAA$no_pasg.std <- (FAA$no_pasg - mean(FAA$no_pasg))/sd(FAA$no_pasg)
FAA$speed_ground.std <- (FAA$speed_ground - mean(FAA$speed_ground))/sd(FAA$speed_ground)
FAA$speed_air.std <- (FAA$speed_air - mean(FAA$speed_air, na.rm = TRUE))/sd(FAA$speed_air,
na.rm = TRUE)
FAA$height.std <- (FAA$height - mean(FAA$height))/sd(FAA$height)
FAA$pitch.std <- (FAA$pitch - mean(FAA$pitch))/sd(FAA$pitch)
#Regression using standardized variables
model8 <- lm(distance ~ aircraft.std, data = FAA)
summary(model8)
model9 <- lm(distance ~ duration.std, data = FAA)
12
summary(model9)
model10 <- lm(distance ~ no_pasg.std, data = FAA)
summary(model10)
model11 <- lm(distance ~ speed_ground.std, data = FAA)
summary(model11)
model12 <- lm(distance ~ speed_air.std, data = FAA)
summary(model12)
model13 <- lm(distance ~ height.std, data = FAA)
summary(model13)
model14 <- lm(distance ~ pitch.std, data = FAA)
summary(model14)
R Output
#p-value of different models
Table 2: Ranking factors based on -p-value
Variables p-value Direction of regression coefficient
Speed_air <0.0001 Positive
Speed_ground <0.0001 Positive
Aircraft <0.0001 Negative
Height 0.00389 Positive
Pitch 0.0127 Positive
Duration 0.146 Negative
No_pasg 0.618 Negative
#p-value after standardizing the variables
Table 3: Ranking factors after standardization of variables
Variables p-value Direction of regression coefficient
Speed_air <0.0001 Positive
Speed_ground <0.0001 Positive
Aircraft <0.0001 Negative
Height 0.00389 Positive
Pitch 0.0127 Positive
Duration 0.146 Negative
No_pasg 0.618 Negative
Observations
• Comparing results from tables 1,2 and 3, we observe that results are consistent. In table 4
below, the factors are ranked based on their relative importance in determining the landing
distance.
13
Table 4: Ranking of variables
Rank Variable
1 Speed_air
2 Speed_ground
3 Aircraft
4 Height
5 Pitch
6 Duration
7 No_pasg
• Speed_air, speed_ground, height and pitch have positive correlation with landing distance
while aircraft has negative correlation with landing distance.
Conclusion
Assuming a significance of 0.05, Speed_air, speed_ground, aircraft, height and pitch are most
important factors influencing landing distance.
Chapter 5: Check for collinearity
Speed_air and speed_ground pretty much provide same information. In this chapter, we check for
correlation between speed_air and speed_ground. If there is high correlation between the two
variables, we retain only one of them.
R Code
#Checking for collinearity between speed_ground and speed_air
model1 <- lm (distance ~ speed_ground, data = FAA)
summary(model1)
model2 <- lm (distance ~ speed_air, data = FAA)
summary(model2)
model3 <- lm (distance ~ speed_ground + speed_air, data = FAA)
summary(model3)
#Correlation between speed_ground and speed_air
Cor (FAA$speed_air, FAA$speed_ground, use = "pairwise.complete.obs")
R Output
#Table showing regression coefficients
14
Table 5: Checking for collinearity
Model Number Model Variable Regression
Coefficient
p-value
1 LD ~ speed_ground Speed_ground 40.8252 <0.0001
2 LD ~ speed_air Speed_air 79.532 <0.0001
3 LD ~ speed_ground
+ speed_air
Speed_ground -14.37 0.258
Speed_air 93.96 <0.0001
Observation
• We observe from models 1 and 2 that both speed_ground and speed_air are significant
factors in determining landing distance. However, according to model 3, only speed_air is
significant factor (p-value < 0.0001).
• We also observe a sign change in regression coefficient of speed_ground and a change in
significance value. P-value of speed_ground in model 3 is greater than 0.05 suggesting that
it may not be a significant factor which is not true.
• We can say that collinearity exists, that is, speed_ground and speed_air is correlated with
each other. In fact, they have strong correlation with value of 0.9879. It is better to drop
one of them since including both might result in unstable model.
• Speed_air can be considered as speed_ground plus wind speed.
Conclusion
Speed_air is retained even though there are lot of missing values due to following reason:
• It is an important factor, from domain knowledge.
• From scatter plot, it is seen that speed_air has nearly linear relation with landing distance
which is not the case for speed_ground. This makes it possible to fit a linear regression
model for speed_air.
• Speed_air column has observations required for predicting landing over run, while in case
of speed_ground, a large portion of the observations is less than 90 mph which is not very
useful in predicting landing over run.
• It is easier to get values of speed_air.
15
Chapter 6: Variable Selection
We fit models based on variable ranking in table 4 by adding one variable at a time. We obtain r-
square, adjusted r squared and AIC values for each model.
R Code
#Plotting R squared values against number of parameters
r.squared.1 <- summary(model1)$r.squared
r.squared.2 <- summary(model2)$r.squared
r.squared.3 <- summary(model3)$r.squared
r.squared.4 <- summary(model4)$r.squared
r.squared.5 <- summary(model5)$r.squared
r.squared.6 <- summary(model6)$r.squared
plot(c(1,2,3,4,5,6), c(r.squared.1,r.squared.2,r.squared.3,r.squared.4,r.squared.5,r.squared.6), type = "b",
ylab = "R squared", xlab = "Number of predictors")
#Plotting Adjusted R squared values against number of parameters
r.adj.squared.1 <- summary(model1)$adj.r.squared
r.adj.squared.2 <- summary(model2)$adj.r.squared
r.adj.squared.3 <- summary(model3)$adj.r.squared
r.adj.squared.4 <- summary(model4)$adj.r.squared
r.adj.squared.5 <- summary(model5)$adj.r.squared
r.adj.squared.6 <- summary(model6)$adj.r.squared
plot(c(1,2,3,4,5,6), c(r.adj.squared.1,r.adj.squared.2,r.adj.squared.3,r.adj.squared.4,
r.adj.squared.5,r.adj.squared.6), type = "b", ylab = "Adjusted R squared",
xlab = "Number of predictors")
#Plotting AIC values against number of parameters
plot(c(1,2,3,4,5,6), c(r.AIC.1,r.AIC.2,r.AIC.3,r.AIC.4,r.AIC.5,r.AIC.6), type = "b",
ylab = "AIC", xlab = "Number of predictors")
R Output
Table 6: Comparing different models
Model
Number
Model R-squared value Adjusted R-
squared value
AIC value
1 LD ~ speed_air 0.8875 0.8870 2862.423
2 LD ~ speed_air + aircraft 0.9493 0.9488 2702.784
3 LD ~ speed_air + aircraft +
height
0.9737 0.9733 2571.310
4 LD ~ speed_air + aircraft +
height + pitch
0.9737 0.9732 2573.300
16
5 LD ~ speed_air + aircraft +
height + pitch + duration
0.9744 0.9737 2473.168
6 LD ~ speed_air + aircraft +
height + pitch + duration +
no_pasg
0.9747 0.9739 2473.010
Figure 10: Plot of R squared values of different models Vs Number of predictors
Figure 11: Plot of Adjusted R Squared of different models Vs Number of predictors
Figure 12: Plot of AIC of different models Vs Number of predictors
17
Observations
• Model 6 has highest adjusted r squared value and lowest AIC value. While comparing
models, we choose the one with higher adjusted r squared value, that is, one whose
predictor variables are better able to explain variation in dependent variable. Also, we
choose model with lower AIC value.
• However, if we look at p-values of the models, only speed_air, aircraft and height are
significant.
• It should also be noted in the final data set, speed_air has 630 missing values. While
modeling, only 195 observations are taken into consideration.
Conclusion
• Based on adjusted r squared and AIC values, I would choose speed_air, height, pitch,
no_pasg and duration to build predictive model for landing distance.
• Based on p-values of models, I would choose speed_air, aircraft and height.
• The final model is as follows:
Distance = -5796.9430 + (81.9833 * speed_air) – (437.8295 * aircraft) + (13.71 * height)
• It is seen that among the influential factors, height has least impact on landing distance.
Chapter 7: Variable Selection based on automate algorithm
In this chapter, stepAIC function in R is used to perform forward variable selection. The results so
obtained are compared with results obtained in previous chapter to see if they are consistent.
R Code
model <- lm(distance ~ ., data = FAA)
step <- stepAIC(model, direction = "forward")
summary(step)
R Output
#Summary of stepAIC function
18
Observation
• Using stepAIC function to perform variable forward selection, I would select three
variables to build predictive model for landing distance. From output above, it can be seen
that p-values for aircraft, speed_air and height are significant.
Conclusion
• The final model after using stepAIC function is as follows:
Distance = -5791.6573 – (437.9428 * aircraft) + (85.5469 * speed_air) + (13.6756 * height)
• Based on results from chapter 6 and 7, I would choose speed_air, aircraft and height in the
final model.

More Related Content

PPTX
Example flow process charts
PDF
Numerical investigation of winglet angles influence on vortex shedding
PDF
A study on gap acceptance of unsignalized intersection under mixed traffic co...
PDF
IRJET-CFD Analysis of conceptual Aircraft body
PDF
Vehicle Headway Distribution Models on Two-Lane Two-Way Undivided Roads
PDF
STUDY OF FIVE-AXIS COMMERCIAL SOFTWARE POST-PROCESSOR CONVERSION APPLIED TO A...
PPTX
Presentation1
PDF
Volume 2-issue-6-2177-2185
Example flow process charts
Numerical investigation of winglet angles influence on vortex shedding
A study on gap acceptance of unsignalized intersection under mixed traffic co...
IRJET-CFD Analysis of conceptual Aircraft body
Vehicle Headway Distribution Models on Two-Lane Two-Way Undivided Roads
STUDY OF FIVE-AXIS COMMERCIAL SOFTWARE POST-PROCESSOR CONVERSION APPLIED TO A...
Presentation1
Volume 2-issue-6-2177-2185

Similar to Rashmi subrahmanya project (20)

DOCX
Structures proyect
PDF
Assignment 5
PDF
finalReport
PDF
IRJET- Aerodynamic Analysis of Aircraft Wings using CFD
PDF
Low Cost Airports in India Part 1 - Applying implementation frameworks 7-S mo...
PDF
Trajectory pricing for the European Air Traffic Management system using modul...
PDF
A Linear Programming Solution To The Gate Assignment Problem At Airport Termi...
PDF
Aviation articles - Aircraft Evaluation and selection
PDF
Final Report Wind Tunnel
PDF
M.G.Goman, A.V.Khramtsovsky (2008) - Computational framework for investigatio...
PDF
Assignment 6
PDF
Aviation Article : Getting The Right Picture
PDF
Autonomous cargo transporter report
PDF
Goman, Khramtsovsky, Shapiro (2001) – Aerodynamics Modeling and Dynamics Simu...
PDF
CFD Analysis of conceptual Aircraft body
PDF
CONTAINER TRAFFIC PROJECTIONS USING AHP MODEL IN SELECTING REGIONAL TRANSHIPM...
PDF
Airline Fleet Assignment And Schedule Design Integrated Models And Algorithms
PDF
AIRLINE FARE PREDICTION USING MACHINE LEARNING.pdf
PDF
AIAA Design Build & Fly Design Report
PDF
6 prediccion velocidad cr2c - 99171
Structures proyect
Assignment 5
finalReport
IRJET- Aerodynamic Analysis of Aircraft Wings using CFD
Low Cost Airports in India Part 1 - Applying implementation frameworks 7-S mo...
Trajectory pricing for the European Air Traffic Management system using modul...
A Linear Programming Solution To The Gate Assignment Problem At Airport Termi...
Aviation articles - Aircraft Evaluation and selection
Final Report Wind Tunnel
M.G.Goman, A.V.Khramtsovsky (2008) - Computational framework for investigatio...
Assignment 6
Aviation Article : Getting The Right Picture
Autonomous cargo transporter report
Goman, Khramtsovsky, Shapiro (2001) – Aerodynamics Modeling and Dynamics Simu...
CFD Analysis of conceptual Aircraft body
CONTAINER TRAFFIC PROJECTIONS USING AHP MODEL IN SELECTING REGIONAL TRANSHIPM...
Airline Fleet Assignment And Schedule Design Integrated Models And Algorithms
AIRLINE FARE PREDICTION USING MACHINE LEARNING.pdf
AIAA Design Build & Fly Design Report
6 prediccion velocidad cr2c - 99171
Ad

Recently uploaded (20)

PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Introduction to Data Science and Data Analysis
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Leprosy and NLEP programme community medicine
PDF
Business Analytics and business intelligence.pdf
PDF
Lecture1 pattern recognition............
PDF
Transcultural that can help you someday.
PPT
Quality review (1)_presentation of this 21
PDF
.pdf is not working space design for the following data for the following dat...
PPT
Predictive modeling basics in data cleaning process
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Computer network topology notes for revision
PPTX
modul_python (1).pptx for professional and student
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction to Data Science and Data Analysis
Qualitative Qantitative and Mixed Methods.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
STERILIZATION AND DISINFECTION-1.ppthhhbx
Leprosy and NLEP programme community medicine
Business Analytics and business intelligence.pdf
Lecture1 pattern recognition............
Transcultural that can help you someday.
Quality review (1)_presentation of this 21
.pdf is not working space design for the following data for the following dat...
Predictive modeling basics in data cleaning process
Optimise Shopper Experiences with a Strong Data Estate.pdf
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Computer network topology notes for revision
modul_python (1).pptx for professional and student
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Mega Projects Data Mega Projects Data
IBA_Chapter_11_Slides_Final_Accessible.pptx
Ad

Rashmi subrahmanya project

  • 1. 0 1/24/2018 Project - Part 1 Statistical Modeling Rashmi Subrahmanya (M12383010) UNIVERSITY OF CINCINNATI
  • 2. i Contents Tables............................................................................................................................................................ ii Executive Summary.......................................................................................................................................1 Introduction ..................................................................................................................................................2 Variable Dictionary...................................................................................................................................2 Chapter 1: Initial Data Exploration................................................................................................................3 R Code.......................................................................................................................................................3 R Output....................................................................................................................................................3 Observations.............................................................................................................................................4 Conclusion.................................................................................................................................................5 Chapter 2: Data Cleaning and Exploration....................................................................................................5 R Code.......................................................................................................................................................5 R Output....................................................................................................................................................5 Observations.............................................................................................................................................7 Conclusion.................................................................................................................................................7 Chapter 3: Data Analysis...............................................................................................................................8 R Code.......................................................................................................................................................8 R Output....................................................................................................................................................8 Observations...........................................................................................................................................10 Conclusion...............................................................................................................................................10 Chapter 4: Regression Analysis...................................................................................................................11 R Code.....................................................................................................................................................11 R Output..................................................................................................................................................12 Observations...........................................................................................................................................12 Conclusion...............................................................................................................................................13 Chapter 5: Check for collinearity ................................................................................................................13 R Code.....................................................................................................................................................13 R Output..................................................................................................................................................13 Observation.............................................................................................................................................14 Conclusion...............................................................................................................................................14 Chapter 6: Variable Selection .....................................................................................................................15 R Code.....................................................................................................................................................15 R Output..................................................................................................................................................15
  • 3. ii Observations...........................................................................................................................................17 Conclusion...............................................................................................................................................17 Chapter 7: Variable Selection based on automate algorithm ....................................................................17 R Code.....................................................................................................................................................17 R Output..................................................................................................................................................17 Observation.............................................................................................................................................18 Conclusion...............................................................................................................................................18 Tables Table 1: Pairwise correlation between distance and each factor.................................................................9 Table 2: Ranking factors based on -p-value................................................................................................12 Table 3: Ranking factors after standardization of variables .......................................................................12 Table 4: Ranking of variables ......................................................................................................................13 Table 5: Checking for collinearity................................................................................................................14 Table 6: Comparing different models.........................................................................................................15
  • 4. 1 Executive Summary This project is carried out to understand which factors influence the landing distance of flights to minimize the risk of over run. A summary of the variables used in the project is provided in introduction section. In chapter 1, two data sets are imported and examined. They are then merged to one data set. 100 duplicate rows were observed in merged data set which were subsequently removed. Also, missing values were observed in ‘duration’ and ‘speed_air’ columns. Summary statistics of each variable is provided in this chapter. Chapter 2 checks for abnormal values as defined by variable dictionary. 17 rows were found to contain abnormal values and they were removed. Histograms of each variable is plotted to understand their distribution. In chapter 3, correlation matrix was calculated, and scatter plots were used to see which factors are correlated with landing distance, their strength and direction. Aircraft variable is also recoded. The predictor variables are ranked according to strength of correlation. In chapter 4, landing distance is regressed on each of predictor variable, one at a time and p-values of resulting linear regression models are noted. Then the variables are standardized, and the process is repeated. It is observed that rank of predictor variables, in terms of influence on landing distance is same in all three ways – correlation matrix, regression models before and after standardization. Chapter 5 checks for collinearity between predictor variables. Speed_air and speed_ground is found to be highly correlated. Speed_ground is dropped from further analysis. In chapter 6, linear regression models are built adding one variable at a time. The r squared, adjusted r squared and AIC values of the models are plotted against number of variables. Based on this, all variables, except speed_ground, are used to build predictive model for landing distance. However, based on p-values of the models, only speed_air, aircraft and height are significant. Another thing to note is that model is built using 195 observations only due to missing values in speed_air and duration column. In final chapter, stepAIC function in R is used to perform forward variable selection. Based on the results, speed_air, height and aircraft are used to build predictive model.
  • 5. 2 Introduction The goal of the project is to study what factors and how they impact the landing distance of a commercial flight to reduce the risk of landing over run. We have landing data from 950 commercial flights as the input data. Variable Dictionary Aircraft: The make of an aircraft (Boeing or Airbus). Duration (in minutes): Flight duration between taking off and landing. The duration of a normal flight should always be greater than 40min. No_pasg: The number of passengers in a flight. Speed_ground (in miles per hour): The ground speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal. Speed_air (in miles per hour): The air speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal. Height (in meters): The height of an aircraft when it is passing over the threshold of the runway. The landing aircraft is required to be at least 6 meters high at the threshold of the runway. Pitch (in degrees): Pitch angle of an aircraft when it is passing over the threshold of the runway. Distance (in feet): The landing distance of an aircraft. More specifically, it refers to the distance between the threshold of the runway and the point where the aircraft can be fully stopped. The length of the airport runway is typically less than 6000 feet.
  • 6. 3 Chapter 1: Initial Data Exploration In this chapter, we import two data sets FAA1 and FAA2 in R and look at their structure. Then, we combine both data sets to get a new data set named ‘FAA’. We check for duplicate values in the new data set and have a look at its structure. We also obtain summary statistics of each variable in FAA data set. R Code #Importing FAA1 and FAA2 excel files FAA1 <- readxl::read_xls('FAA1.xls', col_names = TRUE) FAA2 <- readxl::read_xls('FAA2.xls', col_names = TRUE) #A look at first few rows of FAA1 and FAA2 data set head(FAA1) head(FAA2) #Checking structure of the data sets str(FAA1) str(FAA2) #Merging FAA1 and FAA2 into a single data set - FAA FAA <- plyr::rbind.fill(FAA1,FAA2) head(FAA) #A look at structure of new data set str(FAA) #Checking for duplicates FAA_dup <- FAA[duplicated(FAA$speed_ground), ] nrow(FAA_dup) FAA <- FAA[!duplicated(FAA$speed_ground), ] #Summary of each variable in FAA summary(FAA) R Output #Structure of FAA1 > str(FAA1) Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 800 obs. of 8 variables: $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ... $ duration : num 98.5 125.7 112 196.8 90.1 ... $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ... $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ... $ speed_air : num 109 103 NA NA NA ...
  • 7. 4 $ height : num 27.4 27.8 18.6 30.7 32.4 ... $ pitch : num 4.04 4.12 4.43 3.88 4.03 ... $ distance : num 3370 2988 1145 1664 1050 ... #Structure of FAA2 > str(FAA2) Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 150 obs. of 7 variables: $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ... $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ... $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ... $ speed_air : num 109 103 NA NA NA ... $ height : num 27.4 27.8 18.6 30.7 32.4 ... $ pitch : num 4.04 4.12 4.43 3.88 4.03 ... $ distance : num 3370 2988 1145 1664 1050 ... #Structure of FAA (combined data set) > str(FAA) 'data.frame': 850 obs. of 8 variables: $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ... $ duration : num 98.5 125.7 112 196.8 90.1 ... $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ... $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ... $ speed_air : num 109 103 NA NA NA ... $ height : num 27.4 27.8 18.6 30.7 32.4 ... $ pitch : num 4.04 4.12 4.43 3.88 4.03 ... $ distance : num 3370 2988 1145 1664 1050 ... #Summary of each variable Observations • FAA1 data set has 800 observations and 8 variables while FAA2 data set has 150 observations and 7 variables. Variable ‘duration’ is missing in FAA2 data set. Both FAA1 and FAA2 are data frames. In both data sets, variable ‘aircraft’ is of character data type, while other variables are numeric. • ‘duration’ column has 50 missing values, while speed_air has 642 missing values. There are no missing values in other columns.
  • 8. 5 • Minimum value of duration and distance is too low while minimum height is negative. These are abnormal values, as defined by variable dictionary. • There are 100 duplicate rows in FAA data set, which are deleted from further analysis. After removing duplicates, FAA has 850 observations and 8 variables. ‘aircraft’ is of character data type, while other variables are of numeric type. Conclusion • Duplicate rows are removed from further analysis as they do not provide any meaningful information and they can affect results. • Both speed_air and duration columns are retained for now, even though they have missing values. They can be dropped later, if required. Chapter 2: Data Cleaning and Exploration In this chapter, we check for abnormal values, as defined in the variable dictionary. If there are any rows with abnormal values, we remove them. We plot histogram of each variable to understand their distributions. R Code #Removing abnormal values from data set FAA <- FAA[(FAA$duration > 40 | is.na(FAA$duration)), ] FAA <- FAA[(FAA$height >= 6), ] FAA <- FAA[(FAA$speed_air >= 30 | FAA$speed_air <= 140 | is.na(FAA$speed_air)), ] FAA <- FAA[(FAA$speed_ground >= 30 | FAA$speed_ground <= 140), ] FAA <- FAA[(FAA$distance < 6000), ] dim(FAA) #summary of cleaned data set summary(FAA) #Plotting histogram of each variable hist(FAA$duration, breaks = 30, main = 'Histogram of duration variable', xlab = 'Duration') hist(FAA$no_pasg, breaks = 30, main = 'Histogram of no_pasg', xlab = 'Number of Passengers') hist(FAA$speed_ground, breaks = 30, main = 'Histogram of speed_ground', xlab = 'Speed Ground') hist(FAA$speed_air, breaks = 30, main = 'Histogram of speed_air', xlab = 'Speed Air') hist(FAA$height, breaks = 30, main = 'Histogram of height', xlab = 'Height') hist(FAA$pitch, breaks = 30, main = 'Histogram of pitch', xlab = 'Pitch') hist(log(FAA$distance), breaks = 30, main = 'Histogram of log(distance)', xlab = 'Landing Distance') R Output #Structure of FAA after removing abnormal values
  • 9. 6 > str(FAA) 'data.frame': 831 obs. of 8 variables: $ aircraft : chr "boeing" "boeing" NA NA ... $ duration : num 98.5 125.7 NA NA NA ... $ no_pasg : num 53 69 NA NA NA NA NA NA NA NA ... $ speed_ground: num 108 102 NA NA NA ... $ speed_air : num 109 103 NA NA NA ... $ height : num 27.4 27.8 NA NA NA ... $ pitch : num 4.04 4.12 NA NA NA ... $ distance : num 3370 2988 NA NA NA ... #summary of cleaned data set #Histogram of each variable Figure 1: Histogram of duration Figure 2: Histogram of no_pasg Figure 3: Histogram of speed_ground Figure 4: Histogram of speed_air
  • 10. 7 Figure 5: Histogram of height Figure 6: Histogram of pitch Figure 7: Histogram of distance Observations • There are abnormal values in the data set, as defined by the variable dictionary. There are 5 such values in duration, 3 in speed_ground, 1 in speed_air, 10 in height and 2 in distance column. • 17 rows/observations with abnormal values were removed. • Distribution of speed_air shows that it is right-skewed. Conclusion • Final data set has 833 observations and 8 columns. • The observations with abnormal values are deleted, since the number of such observations is very low.
  • 11. 8 Chapter 3: Data Analysis This chapter comprises of initial data analysis where we try to identify factors which impact the response variable, landing distance. R Code #Recoding aircraft to numeric values: Boeing - 0, Airbus - 1 FAA$aircraft <- ifelse(FAA$aircraft == 'boeing', 0, 1) #Computing pairwise correlation round(cor(FAA, use = 'pairwise.complete.obs'), 4) corrplot(FAACor, method = "ellipse") #Scatter plots par(mfrow = c(4, 2)) plot(FAA$aircraft, FAA$distance) plot(FAA$duration, FAA$distance) plot(FAA$no_pasg, FAA$distance) plot(FAA$speed_ground, FAA$distance) plot(FAA$speed_air, FAA$distance) plot(FAA$height, FAA$distance) plot(FAA$pitch, FAA$distance) par(mfrow = c(1,1)) R Output #Correlation Matrix
  • 12. 9 Figure 8: Correlation plot of variables Table 1: Pairwise correlation between distance and each factor Variables Strength of correlation Direction Speed_air 0.9421 Positive Speed_ground 0.8608 Positive Aircraft 0.2369 Negative Height 0.0999 Positive Pitch 0.0863 Positive Duration 0.0520 Negative No_pasg 0.0173 Negative #Scatter plots
  • 13. 10 Figure 9: Scatter plot Observations • From the correlation matrix and scatter plots, it is evident that speed_ground and speed_air are the important factors which impact landing distance and they have strong positive correlation coefficient. • Aircraft make also impacts landing distance, but the strength of correlation is weak and negative. • Other factors are weakly correlated with landing distance or have little impact on it. Conclusion • Speed_ground, speed_air and aircraft are important factors which impact landing distance.
  • 14. 11 Chapter 4: Regression Analysis In this chapter, we regress landing distance on each of the factors and observe the p-values of each model. Then, we standardize each variable using the following formula: X’ = {X – mean(X)}/sd(X) We regress landing distance on each of standardized variables and note down p-values. Then, we compare results from correlation matrix and the regression analysis to see if the results are consistent. R Code #Regression using single factor each time model1 <- lm(distance ~ aircraft, data = FAA) summary(model1) model2 <- lm(distance ~ duration, data = FAA) summary(model2) model3 <- lm(distance ~ no_pasg, data = FAA) summary(model3) model4 <- lm(distance ~ speed_ground, data = FAA) summary(model4) model5 <- lm(distance ~ speed_air, data = FAA) summary(model5) model6 <- lm(distance ~ height, data = FAA) summary(model6) model7 <- lm(distance ~ pitch, data = FAA) summary(model7) #Standardizing and creating new variables FAA$aircraft.std <- (FAA$aircraft - mean(FAA$aircraft))/sd(FAA$aircraft) FAA$duration.std <- (FAA$duration – mean (FAA$duration, na.rm = TRUE))/sd(FAA$duration, na.rm = TRUE) FAA$no_pasg.std <- (FAA$no_pasg - mean(FAA$no_pasg))/sd(FAA$no_pasg) FAA$speed_ground.std <- (FAA$speed_ground - mean(FAA$speed_ground))/sd(FAA$speed_ground) FAA$speed_air.std <- (FAA$speed_air - mean(FAA$speed_air, na.rm = TRUE))/sd(FAA$speed_air, na.rm = TRUE) FAA$height.std <- (FAA$height - mean(FAA$height))/sd(FAA$height) FAA$pitch.std <- (FAA$pitch - mean(FAA$pitch))/sd(FAA$pitch) #Regression using standardized variables model8 <- lm(distance ~ aircraft.std, data = FAA) summary(model8) model9 <- lm(distance ~ duration.std, data = FAA)
  • 15. 12 summary(model9) model10 <- lm(distance ~ no_pasg.std, data = FAA) summary(model10) model11 <- lm(distance ~ speed_ground.std, data = FAA) summary(model11) model12 <- lm(distance ~ speed_air.std, data = FAA) summary(model12) model13 <- lm(distance ~ height.std, data = FAA) summary(model13) model14 <- lm(distance ~ pitch.std, data = FAA) summary(model14) R Output #p-value of different models Table 2: Ranking factors based on -p-value Variables p-value Direction of regression coefficient Speed_air <0.0001 Positive Speed_ground <0.0001 Positive Aircraft <0.0001 Negative Height 0.00389 Positive Pitch 0.0127 Positive Duration 0.146 Negative No_pasg 0.618 Negative #p-value after standardizing the variables Table 3: Ranking factors after standardization of variables Variables p-value Direction of regression coefficient Speed_air <0.0001 Positive Speed_ground <0.0001 Positive Aircraft <0.0001 Negative Height 0.00389 Positive Pitch 0.0127 Positive Duration 0.146 Negative No_pasg 0.618 Negative Observations • Comparing results from tables 1,2 and 3, we observe that results are consistent. In table 4 below, the factors are ranked based on their relative importance in determining the landing distance.
  • 16. 13 Table 4: Ranking of variables Rank Variable 1 Speed_air 2 Speed_ground 3 Aircraft 4 Height 5 Pitch 6 Duration 7 No_pasg • Speed_air, speed_ground, height and pitch have positive correlation with landing distance while aircraft has negative correlation with landing distance. Conclusion Assuming a significance of 0.05, Speed_air, speed_ground, aircraft, height and pitch are most important factors influencing landing distance. Chapter 5: Check for collinearity Speed_air and speed_ground pretty much provide same information. In this chapter, we check for correlation between speed_air and speed_ground. If there is high correlation between the two variables, we retain only one of them. R Code #Checking for collinearity between speed_ground and speed_air model1 <- lm (distance ~ speed_ground, data = FAA) summary(model1) model2 <- lm (distance ~ speed_air, data = FAA) summary(model2) model3 <- lm (distance ~ speed_ground + speed_air, data = FAA) summary(model3) #Correlation between speed_ground and speed_air Cor (FAA$speed_air, FAA$speed_ground, use = "pairwise.complete.obs") R Output #Table showing regression coefficients
  • 17. 14 Table 5: Checking for collinearity Model Number Model Variable Regression Coefficient p-value 1 LD ~ speed_ground Speed_ground 40.8252 <0.0001 2 LD ~ speed_air Speed_air 79.532 <0.0001 3 LD ~ speed_ground + speed_air Speed_ground -14.37 0.258 Speed_air 93.96 <0.0001 Observation • We observe from models 1 and 2 that both speed_ground and speed_air are significant factors in determining landing distance. However, according to model 3, only speed_air is significant factor (p-value < 0.0001). • We also observe a sign change in regression coefficient of speed_ground and a change in significance value. P-value of speed_ground in model 3 is greater than 0.05 suggesting that it may not be a significant factor which is not true. • We can say that collinearity exists, that is, speed_ground and speed_air is correlated with each other. In fact, they have strong correlation with value of 0.9879. It is better to drop one of them since including both might result in unstable model. • Speed_air can be considered as speed_ground plus wind speed. Conclusion Speed_air is retained even though there are lot of missing values due to following reason: • It is an important factor, from domain knowledge. • From scatter plot, it is seen that speed_air has nearly linear relation with landing distance which is not the case for speed_ground. This makes it possible to fit a linear regression model for speed_air. • Speed_air column has observations required for predicting landing over run, while in case of speed_ground, a large portion of the observations is less than 90 mph which is not very useful in predicting landing over run. • It is easier to get values of speed_air.
  • 18. 15 Chapter 6: Variable Selection We fit models based on variable ranking in table 4 by adding one variable at a time. We obtain r- square, adjusted r squared and AIC values for each model. R Code #Plotting R squared values against number of parameters r.squared.1 <- summary(model1)$r.squared r.squared.2 <- summary(model2)$r.squared r.squared.3 <- summary(model3)$r.squared r.squared.4 <- summary(model4)$r.squared r.squared.5 <- summary(model5)$r.squared r.squared.6 <- summary(model6)$r.squared plot(c(1,2,3,4,5,6), c(r.squared.1,r.squared.2,r.squared.3,r.squared.4,r.squared.5,r.squared.6), type = "b", ylab = "R squared", xlab = "Number of predictors") #Plotting Adjusted R squared values against number of parameters r.adj.squared.1 <- summary(model1)$adj.r.squared r.adj.squared.2 <- summary(model2)$adj.r.squared r.adj.squared.3 <- summary(model3)$adj.r.squared r.adj.squared.4 <- summary(model4)$adj.r.squared r.adj.squared.5 <- summary(model5)$adj.r.squared r.adj.squared.6 <- summary(model6)$adj.r.squared plot(c(1,2,3,4,5,6), c(r.adj.squared.1,r.adj.squared.2,r.adj.squared.3,r.adj.squared.4, r.adj.squared.5,r.adj.squared.6), type = "b", ylab = "Adjusted R squared", xlab = "Number of predictors") #Plotting AIC values against number of parameters plot(c(1,2,3,4,5,6), c(r.AIC.1,r.AIC.2,r.AIC.3,r.AIC.4,r.AIC.5,r.AIC.6), type = "b", ylab = "AIC", xlab = "Number of predictors") R Output Table 6: Comparing different models Model Number Model R-squared value Adjusted R- squared value AIC value 1 LD ~ speed_air 0.8875 0.8870 2862.423 2 LD ~ speed_air + aircraft 0.9493 0.9488 2702.784 3 LD ~ speed_air + aircraft + height 0.9737 0.9733 2571.310 4 LD ~ speed_air + aircraft + height + pitch 0.9737 0.9732 2573.300
  • 19. 16 5 LD ~ speed_air + aircraft + height + pitch + duration 0.9744 0.9737 2473.168 6 LD ~ speed_air + aircraft + height + pitch + duration + no_pasg 0.9747 0.9739 2473.010 Figure 10: Plot of R squared values of different models Vs Number of predictors Figure 11: Plot of Adjusted R Squared of different models Vs Number of predictors Figure 12: Plot of AIC of different models Vs Number of predictors
  • 20. 17 Observations • Model 6 has highest adjusted r squared value and lowest AIC value. While comparing models, we choose the one with higher adjusted r squared value, that is, one whose predictor variables are better able to explain variation in dependent variable. Also, we choose model with lower AIC value. • However, if we look at p-values of the models, only speed_air, aircraft and height are significant. • It should also be noted in the final data set, speed_air has 630 missing values. While modeling, only 195 observations are taken into consideration. Conclusion • Based on adjusted r squared and AIC values, I would choose speed_air, height, pitch, no_pasg and duration to build predictive model for landing distance. • Based on p-values of models, I would choose speed_air, aircraft and height. • The final model is as follows: Distance = -5796.9430 + (81.9833 * speed_air) – (437.8295 * aircraft) + (13.71 * height) • It is seen that among the influential factors, height has least impact on landing distance. Chapter 7: Variable Selection based on automate algorithm In this chapter, stepAIC function in R is used to perform forward variable selection. The results so obtained are compared with results obtained in previous chapter to see if they are consistent. R Code model <- lm(distance ~ ., data = FAA) step <- stepAIC(model, direction = "forward") summary(step) R Output #Summary of stepAIC function
  • 21. 18 Observation • Using stepAIC function to perform variable forward selection, I would select three variables to build predictive model for landing distance. From output above, it can be seen that p-values for aircraft, speed_air and height are significant. Conclusion • The final model after using stepAIC function is as follows: Distance = -5791.6573 – (437.9428 * aircraft) + (85.5469 * speed_air) + (13.6756 * height) • Based on results from chapter 6 and 7, I would choose speed_air, aircraft and height in the final model.