Rashmi subrahmanya project

0
1/24/2018 Project - Part 1
Statistical Modeling
Rashmi Subrahmanya (M12383010)
UNIVERSITY OF CINCINNATI

i
Contents
Tables............................................................................................................................................................ ii
Executive Summary.......................................................................................................................................1
Introduction ..................................................................................................................................................2
Variable Dictionary...................................................................................................................................2
Chapter 1: Initial Data Exploration................................................................................................................3
R Code.......................................................................................................................................................3
R Output....................................................................................................................................................3
Observations.............................................................................................................................................4
Conclusion.................................................................................................................................................5
Chapter 2: Data Cleaning and Exploration....................................................................................................5
R Code.......................................................................................................................................................5
R Output....................................................................................................................................................5
Observations.............................................................................................................................................7
Conclusion.................................................................................................................................................7
Chapter 3: Data Analysis...............................................................................................................................8
R Code.......................................................................................................................................................8
R Output....................................................................................................................................................8
Observations...........................................................................................................................................10
Conclusion...............................................................................................................................................10
Chapter 4: Regression Analysis...................................................................................................................11
R Code.....................................................................................................................................................11
R Output..................................................................................................................................................12
Observations...........................................................................................................................................12
Conclusion...............................................................................................................................................13
Chapter 5: Check for collinearity ................................................................................................................13
R Code.....................................................................................................................................................13
R Output..................................................................................................................................................13
Observation.............................................................................................................................................14
Conclusion...............................................................................................................................................14
Chapter 6: Variable Selection .....................................................................................................................15
R Code.....................................................................................................................................................15
R Output..................................................................................................................................................15

ii
Observations...........................................................................................................................................17
Conclusion...............................................................................................................................................17
Chapter 7: Variable Selection based on automate algorithm ....................................................................17
R Code.....................................................................................................................................................17
R Output..................................................................................................................................................17
Observation.............................................................................................................................................18
Conclusion...............................................................................................................................................18
Tables
Table 1: Pairwise correlation between distance and each factor.................................................................9
Table 2: Ranking factors based on -p-value................................................................................................12
Table 3: Ranking factors after standardization of variables .......................................................................12
Table 4: Ranking of variables ......................................................................................................................13
Table 5: Checking for collinearity................................................................................................................14
Table 6: Comparing different models.........................................................................................................15

1
Executive Summary
This project is carried out to understand which factors influence the landing distance of flights to
minimize the risk of over run. A summary of the variables used in the project is provided in
introduction section. In chapter 1, two data sets are imported and examined. They are then merged
to one data set. 100 duplicate rows were observed in merged data set which were subsequently
removed. Also, missing values were observed in ‘duration’ and ‘speed_air’ columns. Summary
statistics of each variable is provided in this chapter. Chapter 2 checks for abnormal values as
defined by variable dictionary. 17 rows were found to contain abnormal values and they were
removed. Histograms of each variable is plotted to understand their distribution.
In chapter 3, correlation matrix was calculated, and scatter plots were used to see which factors
are correlated with landing distance, their strength and direction. Aircraft variable is also recoded.
The predictor variables are ranked according to strength of correlation. In chapter 4, landing
distance is regressed on each of predictor variable, one at a time and p-values of resulting linear
regression models are noted. Then the variables are standardized, and the process is repeated. It is
observed that rank of predictor variables, in terms of influence on landing distance is same in all
three ways – correlation matrix, regression models before and after standardization.
Chapter 5 checks for collinearity between predictor variables. Speed_air and speed_ground is
found to be highly correlated. Speed_ground is dropped from further analysis. In chapter 6, linear
regression models are built adding one variable at a time. The r squared, adjusted r squared and
AIC values of the models are plotted against number of variables. Based on this, all variables,
except speed_ground, are used to build predictive model for landing distance. However, based on
p-values of the models, only speed_air, aircraft and height are significant. Another thing to note is
that model is built using 195 observations only due to missing values in speed_air and duration
column. In final chapter, stepAIC function in R is used to perform forward variable selection.
Based on the results, speed_air, height and aircraft are used to build predictive model.

2
Introduction
The goal of the project is to study what factors and how they impact the landing distance of a
commercial flight to reduce the risk of landing over run. We have landing data from 950
commercial flights as the input data.
Variable Dictionary
Aircraft: The make of an aircraft (Boeing or Airbus).
Duration (in minutes): Flight duration between taking off and landing. The duration of a normal
flight should always be greater than 40min.
No_pasg: The number of passengers in a flight.
Speed_ground (in miles per hour): The ground speed of an aircraft when passing over the
threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing
would be considered as abnormal.
Speed_air (in miles per hour): The air speed of an aircraft when passing over the threshold of
the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be
considered as abnormal.
Height (in meters): The height of an aircraft when it is passing over the threshold of the runway.
The landing aircraft is required to be at least 6 meters high at the threshold of the runway.
Pitch (in degrees): Pitch angle of an aircraft when it is passing over the threshold of the runway.
Distance (in feet): The landing distance of an aircraft. More specifically, it refers to the distance
between the threshold of the runway and the point where the aircraft can be fully stopped. The
length of the airport runway is typically less than 6000 feet.

3
Chapter 1: Initial Data Exploration
In this chapter, we import two data sets FAA1 and FAA2 in R and look at their structure. Then,
we combine both data sets to get a new data set named ‘FAA’. We check for duplicate values in
the new data set and have a look at its structure. We also obtain summary statistics of each variable
in FAA data set.
R Code
#Importing FAA1 and FAA2 excel files
FAA1 <- readxl::read_xls('FAA1.xls', col_names = TRUE)
FAA2 <- readxl::read_xls('FAA2.xls', col_names = TRUE)
#A look at first few rows of FAA1 and FAA2 data set
head(FAA1)
head(FAA2)
#Checking structure of the data sets
str(FAA1)
str(FAA2)
#Merging FAA1 and FAA2 into a single data set - FAA
FAA <- plyr::rbind.fill(FAA1,FAA2)
head(FAA)
#A look at structure of new data set
str(FAA)
#Checking for duplicates
FAA_dup <- FAA[duplicated(FAA$speed_ground), ]
nrow(FAA_dup)
FAA <- FAA[!duplicated(FAA$speed_ground), ]
#Summary of each variable in FAA
summary(FAA)
R Output
#Structure of FAA1
> str(FAA1)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 800 obs. of 8 variables:
$ aircraft : chr "boeing" "boeing" "boeing" "boeing" ...
$ duration : num 98.5 125.7 112 196.8 90.1 ...
$ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
$ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
$ speed_air : num 109 103 NA NA NA ...

4
$ height : num 27.4 27.8 18.6 30.7 32.4 ...
$ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
$ distance : num 3370 2988 1145 1664 1050 ...
#Structure of FAA2
> str(FAA2)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 150 obs. of 7 variables:
$ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
$ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
$ height : num 27.4 27.8 18.6 30.7 32.4 ...
$ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
$ distance : num 3370 2988 1145 1664 1050 ...
#Structure of FAA (combined data set)
> str(FAA)
'data.frame': 850 obs. of 8 variables:
$ duration : num 98.5 125.7 112 196.8 90.1 ...
$ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
$ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
$ height : num 27.4 27.8 18.6 30.7 32.4 ...
$ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
$ distance : num 3370 2988 1145 1664 1050 ...
#Summary of each variable
Observations
• FAA1 data set has 800 observations and 8 variables while FAA2 data set has 150
observations and 7 variables. Variable ‘duration’ is missing in FAA2 data set. Both FAA1
and FAA2 are data frames. In both data sets, variable ‘aircraft’ is of character data type,
while other variables are numeric.
• ‘duration’ column has 50 missing values, while speed_air has 642 missing values. There
are no missing values in other columns.

5
• Minimum value of duration and distance is too low while minimum height is negative.
These are abnormal values, as defined by variable dictionary.
• There are 100 duplicate rows in FAA data set, which are deleted from further analysis.
After removing duplicates, FAA has 850 observations and 8 variables. ‘aircraft’ is of
character data type, while other variables are of numeric type.
Conclusion
• Duplicate rows are removed from further analysis as they do not provide any meaningful
information and they can affect results.
• Both speed_air and duration columns are retained for now, even though they have missing
values. They can be dropped later, if required.
Chapter 2: Data Cleaning and Exploration
In this chapter, we check for abnormal values, as defined in the variable dictionary. If there are
any rows with abnormal values, we remove them. We plot histogram of each variable to understand
their distributions.
R Code
#Removing abnormal values from data set
FAA <- FAA[(FAA$duration > 40 | is.na(FAA$duration)), ]
FAA <- FAA[(FAA$height >= 6), ]
FAA <- FAA[(FAA$speed_air >= 30 | FAA$speed_air <= 140 | is.na(FAA$speed_air)), ]
FAA <- FAA[(FAA$speed_ground >= 30 | FAA$speed_ground <= 140), ]
FAA <- FAA[(FAA$distance < 6000), ]
dim(FAA)
#summary of cleaned data set
summary(FAA)
#Plotting histogram of each variable
hist(FAA$duration, breaks = 30, main = 'Histogram of duration variable', xlab = 'Duration')
hist(FAA$no_pasg, breaks = 30, main = 'Histogram of no_pasg', xlab = 'Number of Passengers')
hist(FAA$speed_ground, breaks = 30, main = 'Histogram of speed_ground', xlab = 'Speed Ground')
hist(FAA$speed_air, breaks = 30, main = 'Histogram of speed_air', xlab = 'Speed Air')
hist(FAA$height, breaks = 30, main = 'Histogram of height', xlab = 'Height')
hist(FAA$pitch, breaks = 30, main = 'Histogram of pitch', xlab = 'Pitch')
hist(log(FAA$distance), breaks = 30, main = 'Histogram of log(distance)', xlab = 'Landing Distance')
R Output
#Structure of FAA after removing abnormal values

6
> str(FAA)
'data.frame': 831 obs. of 8 variables:
$ aircraft : chr "boeing" "boeing" NA NA ...
$ duration : num 98.5 125.7 NA NA NA ...
$ no_pasg : num 53 69 NA NA NA NA NA NA NA NA ...
$ speed_ground: num 108 102 NA NA NA ...
$ height : num 27.4 27.8 NA NA NA ...
$ pitch : num 4.04 4.12 NA NA NA ...
$ distance : num 3370 2988 NA NA NA ...
#summary of cleaned data set
#Histogram of each variable
Figure 1: Histogram of duration Figure 2: Histogram of no_pasg
Figure 3: Histogram of speed_ground Figure 4: Histogram of speed_air

7
Figure 5: Histogram of height Figure 6: Histogram of pitch
Figure 7: Histogram of distance
Observations
• There are abnormal values in the data set, as defined by the variable dictionary. There are
5 such values in duration, 3 in speed_ground, 1 in speed_air, 10 in height and 2 in distance
column.
• 17 rows/observations with abnormal values were removed.
• Distribution of speed_air shows that it is right-skewed.
Conclusion
• Final data set has 833 observations and 8 columns.
• The observations with abnormal values are deleted, since the number of such observations
is very low.

8
Chapter 3: Data Analysis
This chapter comprises of initial data analysis where we try to identify factors which impact the
response variable, landing distance.
R Code
#Recoding aircraft to numeric values: Boeing - 0, Airbus - 1
FAA$aircraft <- ifelse(FAA$aircraft == 'boeing', 0, 1)
#Computing pairwise correlation
round(cor(FAA, use = 'pairwise.complete.obs'), 4)
corrplot(FAACor, method = "ellipse")
#Scatter plots
par(mfrow = c(4, 2))
plot(FAA$aircraft, FAA$distance)
plot(FAA$duration, FAA$distance)
plot(FAA$no_pasg, FAA$distance)
plot(FAA$speed_ground, FAA$distance)
plot(FAA$speed_air, FAA$distance)
plot(FAA$height, FAA$distance)
plot(FAA$pitch, FAA$distance)
par(mfrow = c(1,1))
R Output
#Correlation Matrix

9
Figure 8: Correlation plot of variables
Table 1: Pairwise correlation between distance and each factor
Variables Strength of correlation Direction
Speed_air 0.9421 Positive
Speed_ground 0.8608 Positive
Aircraft 0.2369 Negative
Height 0.0999 Positive
Pitch 0.0863 Positive
Duration 0.0520 Negative
No_pasg 0.0173 Negative
#Scatter plots

10
Figure 9: Scatter plot
Observations
• From the correlation matrix and scatter plots, it is evident that speed_ground and speed_air
are the important factors which impact landing distance and they have strong positive
correlation coefficient.
• Aircraft make also impacts landing distance, but the strength of correlation is weak and
negative.
• Other factors are weakly correlated with landing distance or have little impact on it.
Conclusion
• Speed_ground, speed_air and aircraft are important factors which impact landing distance.

11
Chapter 4: Regression Analysis
In this chapter, we regress landing distance on each of the factors and observe the p-values of each
model. Then, we standardize each variable using the following formula:
X’
= {X – mean(X)}/sd(X)
We regress landing distance on each of standardized variables and note down p-values. Then, we
compare results from correlation matrix and the regression analysis to see if the results are
consistent.
R Code
#Regression using single factor each time
model1 <- lm(distance ~ aircraft, data = FAA)
summary(model1)
model2 <- lm(distance ~ duration, data = FAA)
summary(model2)
model3 <- lm(distance ~ no_pasg, data = FAA)
summary(model3)
model4 <- lm(distance ~ speed_ground, data = FAA)
summary(model4)
model5 <- lm(distance ~ speed_air, data = FAA)
summary(model5)
model6 <- lm(distance ~ height, data = FAA)
summary(model6)
model7 <- lm(distance ~ pitch, data = FAA)
summary(model7)
#Standardizing and creating new variables
FAA$aircraft.std <- (FAA$aircraft - mean(FAA$aircraft))/sd(FAA$aircraft)
FAA$duration.std <- (FAA$duration – mean (FAA$duration, na.rm = TRUE))/sd(FAA$duration, na.rm =
TRUE)
FAA$no_pasg.std <- (FAA$no_pasg - mean(FAA$no_pasg))/sd(FAA$no_pasg)
FAA$speed_ground.std <- (FAA$speed_ground - mean(FAA$speed_ground))/sd(FAA$speed_ground)
FAA$speed_air.std <- (FAA$speed_air - mean(FAA$speed_air, na.rm = TRUE))/sd(FAA$speed_air,
na.rm = TRUE)
FAA$height.std <- (FAA$height - mean(FAA$height))/sd(FAA$height)
FAA$pitch.std <- (FAA$pitch - mean(FAA$pitch))/sd(FAA$pitch)
#Regression using standardized variables
model8 <- lm(distance ~ aircraft.std, data = FAA)
summary(model8)
model9 <- lm(distance ~ duration.std, data = FAA)

12
summary(model9)
model10 <- lm(distance ~ no_pasg.std, data = FAA)
summary(model10)
model11 <- lm(distance ~ speed_ground.std, data = FAA)
summary(model11)
model12 <- lm(distance ~ speed_air.std, data = FAA)
summary(model12)
model13 <- lm(distance ~ height.std, data = FAA)
summary(model13)
model14 <- lm(distance ~ pitch.std, data = FAA)
summary(model14)
R Output
#p-value of different models
Table 2: Ranking factors based on -p-value
Variables p-value Direction of regression coefficient
Speed_air <0.0001 Positive
Speed_ground <0.0001 Positive
Aircraft <0.0001 Negative
#p-value after standardizing the variables
Table 3: Ranking factors after standardization of variables
Variables p-value Direction of regression coefficient
Speed_air <0.0001 Positive
Speed_ground <0.0001 Positive
Aircraft <0.0001 Negative
Observations
• Comparing results from tables 1,2 and 3, we observe that results are consistent. In table 4
below, the factors are ranked based on their relative importance in determining the landing
distance.

13
Table 4: Ranking of variables
Rank Variable
1 Speed_air
2 Speed_ground
3 Aircraft
4 Height
5 Pitch
6 Duration
7 No_pasg
• Speed_air, speed_ground, height and pitch have positive correlation with landing distance
while aircraft has negative correlation with landing distance.
Conclusion
Assuming a significance of 0.05, Speed_air, speed_ground, aircraft, height and pitch are most
important factors influencing landing distance.
Chapter 5: Check for collinearity
Speed_air and speed_ground pretty much provide same information. In this chapter, we check for
correlation between speed_air and speed_ground. If there is high correlation between the two
variables, we retain only one of them.
R Code
#Checking for collinearity between speed_ground and speed_air
model1 <- lm (distance ~ speed_ground, data = FAA)
summary(model1)
model2 <- lm (distance ~ speed_air, data = FAA)
summary(model2)
model3 <- lm (distance ~ speed_ground + speed_air, data = FAA)
summary(model3)
#Correlation between speed_ground and speed_air
Cor (FAA$speed_air, FAA$speed_ground, use = "pairwise.complete.obs")
R Output
#Table showing regression coefficients

14
Table 5: Checking for collinearity
Model Number Model Variable Regression
Coefficient
p-value
1 LD ~ speed_ground Speed_ground 40.8252 <0.0001
2 LD ~ speed_air Speed_air 79.532 <0.0001
3 LD ~ speed_ground
+ speed_air
Speed_ground -14.37 0.258
Speed_air 93.96 <0.0001
Observation
• We observe from models 1 and 2 that both speed_ground and speed_air are significant
factors in determining landing distance. However, according to model 3, only speed_air is
significant factor (p-value < 0.0001).
• We also observe a sign change in regression coefficient of speed_ground and a change in
significance value. P-value of speed_ground in model 3 is greater than 0.05 suggesting that
it may not be a significant factor which is not true.
• We can say that collinearity exists, that is, speed_ground and speed_air is correlated with
each other. In fact, they have strong correlation with value of 0.9879. It is better to drop
one of them since including both might result in unstable model.
• Speed_air can be considered as speed_ground plus wind speed.
Conclusion
Speed_air is retained even though there are lot of missing values due to following reason:
• It is an important factor, from domain knowledge.
• From scatter plot, it is seen that speed_air has nearly linear relation with landing distance
which is not the case for speed_ground. This makes it possible to fit a linear regression
model for speed_air.
• Speed_air column has observations required for predicting landing over run, while in case
of speed_ground, a large portion of the observations is less than 90 mph which is not very
useful in predicting landing over run.
• It is easier to get values of speed_air.

15
Chapter 6: Variable Selection
We fit models based on variable ranking in table 4 by adding one variable at a time. We obtain r-
square, adjusted r squared and AIC values for each model.
R Code
#Plotting R squared values against number of parameters
r.squared.1 <- summary(model1)$r.squared
plot(c(1,2,3,4,5,6), c(r.squared.1,r.squared.2,r.squared.3,r.squared.4,r.squared.5,r.squared.6), type = "b",
ylab = "R squared", xlab = "Number of predictors")
#Plotting Adjusted R squared values against number of parameters
r.adj.squared.1 <- summary(model1)$adj.r.squared
plot(c(1,2,3,4,5,6), c(r.adj.squared.1,r.adj.squared.2,r.adj.squared.3,r.adj.squared.4,
r.adj.squared.5,r.adj.squared.6), type = "b", ylab = "Adjusted R squared",
xlab = "Number of predictors")
#Plotting AIC values against number of parameters
plot(c(1,2,3,4,5,6), c(r.AIC.1,r.AIC.2,r.AIC.3,r.AIC.4,r.AIC.5,r.AIC.6), type = "b",
ylab = "AIC", xlab = "Number of predictors")
R Output
Table 6: Comparing different models
Model
Number
Model R-squared value Adjusted R-
squared value
AIC value
1 LD ~ speed_air 0.8875 0.8870 2862.423
2 LD ~ speed_air + aircraft 0.9493 0.9488 2702.784
3 LD ~ speed_air + aircraft +
height
0.9737 0.9733 2571.310
height + pitch
0.9737 0.9732 2573.300

16
height + pitch + duration
0.9744 0.9737 2473.168
height + pitch + duration +
no_pasg
0.9747 0.9739 2473.010
Figure 10: Plot of R squared values of different models Vs Number of predictors
Figure 11: Plot of Adjusted R Squared of different models Vs Number of predictors
Figure 12: Plot of AIC of different models Vs Number of predictors

17
Observations
• Model 6 has highest adjusted r squared value and lowest AIC value. While comparing
models, we choose the one with higher adjusted r squared value, that is, one whose
predictor variables are better able to explain variation in dependent variable. Also, we
choose model with lower AIC value.
• However, if we look at p-values of the models, only speed_air, aircraft and height are
significant.
• It should also be noted in the final data set, speed_air has 630 missing values. While
modeling, only 195 observations are taken into consideration.
Conclusion
• Based on adjusted r squared and AIC values, I would choose speed_air, height, pitch,
no_pasg and duration to build predictive model for landing distance.
• Based on p-values of models, I would choose speed_air, aircraft and height.
• The final model is as follows:
Distance = -5796.9430 + (81.9833 * speed_air) – (437.8295 * aircraft) + (13.71 * height)
• It is seen that among the influential factors, height has least impact on landing distance.
Chapter 7: Variable Selection based on automate algorithm
In this chapter, stepAIC function in R is used to perform forward variable selection. The results so
obtained are compared with results obtained in previous chapter to see if they are consistent.
R Code
model <- lm(distance ~ ., data = FAA)
step <- stepAIC(model, direction = "forward")
summary(step)
R Output
#Summary of stepAIC function

18
Observation
• Using stepAIC function to perform variable forward selection, I would select three
variables to build predictive model for landing distance. From output above, it can be seen
that p-values for aircraft, speed_air and height are significant.
Conclusion
• The final model after using stepAIC function is as follows:
Distance = -5791.6573 – (437.9428 * aircraft) + (85.5469 * speed_air) + (13.6756 * height)
• Based on results from chapter 6 and 7, I would choose speed_air, aircraft and height in the
final model.

Rashmi subrahmanya project

More Related Content

Similar to Rashmi subrahmanya project (20)

Recently uploaded (20)

Rashmi subrahmanya project