SlideShare a Scribd company logo
STAT 6620 Asmar Farooq and Weizhong Li
Project# 1
Abstract
The main purpose of the project is to predict the delay status of flights using KNN algorithm and
to predict the number of hours arrival delay using regression tree. The data that are used to develop the
models comes from American Statistical Association’s website at http://stat-
computing.org/dataexpo/2009/the-data.html. The data spans from 1987 to 2008. A total of 29 variables
and total observations of 24,117,234 are recorded in the flight dataset. Instead of creating a model for
the entire dataset, the model team focuses on predicting the flights’ delay status for JFK airport at New
York. This decision is due to the team’s perception that airport capaicity, unique temperature patterns,
and local air travel demands for each individual airport in the country will have unique impact of the
number of delay flights. As a result, the model is more practical and useful for each individual airport in
the country.
Based on the team’s model exercise, the model team is concluded that KNN algorithm is a poor
tool to estimate the delay status of flight with a prediction error rate of more than 90 percent for flights
that are actually delayed. On the other hand, regression tree algorithm is an appropraite model that
correctly predicts 88 percent of the duration of arrival delay.
Data Summary
For each year, we are given 29 following features.
Feature Type Feature Type
ArrDelay Numerical
Year Categorical DepDelay Numerical
Month Categorical Origin Categorical
DayofMonth Categorical Dest Categorical
DayOfWeek Categorical Distance Numerical
DepTime Categorical TaxiIn Numerical
CRSDepTime Categorical TaxiOut Numerical
ArrTime Categorical Cancelled Categorical
CRSArrTime Categorical CancellationCode Categorical
UniqueCarrier Categorical Diverted Categorical
FlightNum Categorical CarrierDelay Categorical
TailNum Categorical WeatherDelay Categorical
ActualElapsedTime Numerical NASDelay Categorical
CRSElapsedTime Numerical SecurityDelay Categorical
AirTime Numerical LateAircraftDelay Categorical
ArrDelay Numerical SecurityDelay Categorical
Here is the summarized mean and standard deviation of year 1987 data grouped by months.
Oct Nov Dec
Feature Mean Sd Mean Sd Mean Sd
ActualElapsedTime 100.71 60.85 102.16 61.52 103.72 63.08
CRSElapsedTime 88.53 60.65 100.62 61.21 101.73 61.82
ArrDelay 6.00 18.56 8.47 24.46 13.99 32.18
DepDelay 5.00 19.79 7.13 20.62 12.10 29.86
Distance 587.99 496.51 590.80 497.70 594.98 500.14
AirTime
N/A for 1987TaxiIn
TaxiOut
Below, the counts and relative frequencies for few of the categorical data are listed. Rest of the
categorical features had 31 or more categories and creating tables for each level would be almost
impossible. For example, FlightNum had 2161 levels.
str(x2)
'data.frame': 1311826 obs. of 13 variables:
$ Month : int 10 10 10 10 10 10 10 10 10 10 ...
$ DayofMonth : Factor w/ 31 levels "1","2","3","4",..: 14 15 17 18 19 21 22 23 24 25 ...
$ DayOfWeek : Factor w/ 7 levels "1","2","3","4",..: 3 4 6 7 1 3 4 5 6 7 ...
$ DepTime : Factor w/ 1430 levels "1","2","3","4",..: 451 439 451 439 459 438 438 441 454 439 ...
$ CRSDepTime : Factor w/ 1174 levels "1","5","6","8",..: 209 209 209 209 209 209 209 209 209 209 ...
$ ArrTime : Factor w/ 1440 levels "1","2","3","4",..: 552 543 558 527 562 528 532 542 548 531 ...
$ CRSArrTime : Factor w/ 1301 levels "1","2","3","4",..: 390 390 390 390 390 390 390 390 390 390 ...
$ UniqueCarrier: Factor w/ 14 levels "AA","AS","CO",..: 10 10 10 10 10 10 10 10 10 10 ...
$ FlightNum : Factor w/ 2161 levels "1","2","3","4",..: 1359 1359 1359 1359 1359 1359 1359 1359 1359
$ Origin : Factor w/ 237 levels "ABE","ABQ","ACV",..: 198 198 198 198 198 198 198 198 198 198 ...
$ Dest : Factor w/ 237 levels "ABE","ABQ","ACV",..: 207 207 207 207 207 207 207 207 207 207 ...
$ Cancelled : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Diverted : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
Day of Week
October November December
Level Counts
Rel.
Freq. Level Counts Rel. Freq. Level Counts
Rel.
Freq.
1 59,243 0.1321 1 73,057 0.1728 1 58,411 0.1326
2 59,214 0.1320 2 58,441 0.1382 2 72,583 0.1648
3 59,076 0.1317 3 58,763 0.1390 3 72,396 0.1644
4 73,966 0.1649 4 55,614 0.1315 4 71,331 0.1620
5 73,739 0.1644 5 54,637 0.1292 5 56,537 0.1284
6 67,256 0.1499 6 52,767 0.1248 6 53,347 0.1211
7 56,126 0.1251 7 69,524 0.1644 7 55,798 0.1267
Total 448,620 1.0000 Total 422,803 1.0000 Total 440,403 1.0000
Unique Carrier
October November December
Level Counts Rel. Freq. Level Counts Rel. Freq. Level Counts Rel. Freq.
DL 63,104 0.1407 DL 60,150 0.1423 DL 62,559 0.142
AA 56,091 0.125 AA 53,200 0.1258 AA 55,830 0.1267
UA 52,952 0.118 UA 48,702 0.1152 UA 50,970 0.1157
CO 42,756 0.0953 CO 39,408 0.0932 CO 40,838 0.0927
PI 39,228 0.0874 PI 37,707 0.0892 PI 39,547 0.0898
NW 37,590 0.0838 EA 34,865 0.0825 EA 36,863 0.0837
EA 37,048 0.0826 NW 34,342 0.0812 NW 36,341 0.0825
US 32,293 0.072 US 31,006 0.0733 US 31,515 0.0715
TW 23,823 0.0531 TW 22,125 0.0523 TW 23,792 0.054
WN 21,738 0.0485 WN 20,237 0.0479 WN 20,000 0.0454
HP 15,026 0.0335 HP 14,939 0.0353 HP 15,434 0.035
PS 14,405 0.0321 PS 13,540 0.032 PS 13,761 0.0312
AS 7,432 0.0166 AS 6,967 0.0165 AS 7,007 0.0159
PA 5,134 0.0114 PA 5,615 0.0133 PA 6,036 0.0137
Total 448,620 1 Total 422,803 1 Total 440,493 1
Cancelled
October November December
Level Counts
Rel.
Freq. Level Counts
Rel.
Freq. Level Counts
Rel.
Freq.
0 445,619 0.9933 0 417,612 0.9877 0 428,910 0.9739
1 3,001 0.0067 1 5,191 0.0123 1 11,493 0.0261
Total 448,620 1.0000 Total 422,803 1.0000 Total 440,403 1.0000
Diverted
October November December
Level Counts
Rel.
Freq. Level Counts
Rel.
Freq. Level Counts
Rel.
Freq.
0 447,781 0.9982 0 421,708 0.9974 0 438,522 0.9957
1 829 0.0018 1 1,095 0.0026 1 1,881 0.0043
Total 448,610 1.0000 Total 422,803 1.0000 Total 440,403 1.0000
Model Variable Construction
In order to clean the data, we excluded all observations with NA value in airtime as well as
observations where the flight was cancelled as those observations do not add any information. The
reason why we exclude observations with an NA value in airtime is because we consider taxiin (the time
it takes a flight to leave the terminal and take off from the airport) and taxiout (the time it takes a flight
to land and reach to the terminal) important variables that explains flight delay status and arrival delay
time. In general, observations with null in airtime will always have null in taxiin and taxiout as well.
After cleaning the data, five additional columns variables are added to the dataset. The first one
is the delay flag. A flight is considered late if either of the following two conditions were true.
 Actual elapsed time is 30 minutes more than the scheduled elapsed time
 Actual arrival time is 30 minutes more than the scheduled arrival time.
30 minutes delay was chosen due to personal experience and the assumption that a passenger will
expect to wait at least 30 minutes before he/she would consider a flight delayed. The delay flag is a
binary variable with value of 0 and 1 (delayed flight) and is used as the response variable in the KNN
model.
Three temperature measures for JFK: mean temp, min temp and max temp, are added using the
R package called “weatherData”. Common experience tells the model team that temperature will play a
big role, especially for airports that are located in locals that have adverse winter conditions, which JFK
is one of them.
Lastly, the variable called “Total number of flights per day” is created. The variable reflects the
usage rate of the airport on a daily basis. The model team suspect that on days with more flights needed
to take off from an airport, the probabliy of delay will be different than on days when less than normal
flights are using the airport.
All in all, we dropped 19 variables either due to the reasons above or we found them to be
useless in explaining the delay of air flights. Those 19 variables are Year, ArrTime, CRSArrTime,
UniqueCarrier, FlightNum, TailNum, ActualElapsedTime, CRSElapsedTime, AirTime, Origin, Dest,
Cancelled, CancellationCode, Diverted, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay,
LateAircraftDelay.
By the end of this whole process, we had a little more than 150K observation, out of which
about 36K flights were delayed. The 150k observations are then split into a training set and a test set
with 80 percent of the observations (120555) as the training set and the rest are reserved for the test
set.
Delay Status Classification Model
Using KNN algorithm, ArrivedLate is predicted. Following results are generated from ‘R’.
Variables: Airtime, distance, taxiin, taxiout, max temperature, min temperature, number of flight.
Normalization method: maximum minimum normalization.
Model Confusion Matrix for k =10, k=100, k=200, k=300, k=400. Actual values are listed on the vertical
side while the predicted values on the horizontal side.
K=10
K=100
K=200
K=300
K=400
K=500 doesn’t work
It is observed that as the number of nearest neighbor increases, the number of incorrectly
classified ontime flight decreases while the number of incorrectly classified delay flight increases. The
model results are unsatisfactory as the error rate for correctly classified delay flight is more than 20
percent for all of the models. Based on the above results, KNN is not a good model for predicting the
delay status of flights.
Prediction for Delay Arrival Time
Regression tree algorithm is used to predict the delay arrival time. The explainatory variables
include: month, day of month, day of week, departure time, planned time, airtime, taxiin, taxiout, max
temperature, mean temperature, min temperature, number of flights, departure delay.
Correlation Matrix between the numerical variables:
Regression graph:
Model effectiveness Measures RMSE:
Model effective Measures Correlation with the test set:
Summary
In summary, after cleaning and preparing the data set, KNN algorithm was applied to predict
whether a flight at JFK would delay or not. It was found that while KNN algorithm is simple to apply, the
error rate it generated is unacceptable at more than 20%. Furthermore, it was also observed that by
increasing the k nearest neighbor value, the model’s ability to predict late flight worsen by misclassifying
late flights as on-time flights.
On the other hand, regression tree algorithm generated better results and produced an easy to
follow graph. According to the graph, the main cause for flight to arrive late is when there is a delay in
departure. In other words, if a flight leaves late, it is very likely to arrive late. Moreover, according to our
model, shorter taxi out and airtime will result in fewer delays as our common sense would suggest. The
correlation of 0.88 between our test data and model fitted data suggests that this model is acceptable
and is a good candidate to predict late flights at JFK airport.
Since this model was only applied to JFK airport, it is our team’s suggestion to apply the same
technique to other airport to further evaluate the effectiveness of regression tree algorithm for airline
flights.
APPENDIX
R CODE
## Loading Data
library(RODBC)
myconn <- odbcConnect('project')
flight_data <- sqlQuery(myconn, "select * from
[project].[dbo].[jfk_revised]")
close(myconn)
str(flight_data)
attach(flight_data)
fit_data <- data.frame(factor(Month),factor(DayofMonth),factor(DayOfWeek),
round(DepTime/100, digit=0), round(CRSDepTime/100, digit=0),
AirTime, Distance, TaxiIn, TaxiOut,
Max_TemperatureF, Mean_TemperatureF, Min_TemperatureF,num_flight,
factor(delay),
ArrDelay, DepDelay)
names(fit_data)<- c('month', 'dayofmonth', 'dayofweek', 'deptime',
'crsdpetime', 'airtime', 'distance', 'taxiin','taxiout'
,'maxt', 'meant', 'mint', 'num_flight', 'delay',
'arrdelay', 'depdelay')
set.seed(12345)
fit_data <- fit_data[order(runif(150694)), ]
fit_training <- fit_data[1:120555,]
fit_test <- fit_data[120556:150694,]
##KNN
## Normalize Function
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
knn_training_full <- fit_training[complete.cases(fit_training),]
knn_training <- knn_training_full[,-(1:5)]
knn_training_cor <- knn_training[,-(9:11)]
knn_training <- as.data.frame(lapply(knn_training[,-(9:11)], normalize))
knn_test_full <- fit_test[complete.cases(fit_test),]
knn_test <- knn_test_full[,-(1:5)]
knn_test <- as.data.frame(lapply(knn_test[,-(9:11)], normalize))
library(class)
fit_pred <- knn(train = knn_training, test = knn_test,
cl = knn_training_full[,14], k=10)
library(gmodels)
CrossTable(x = knn_test_full$delay, y = fit_pred,
prop.chisq=FALSE)
i <- seq(100,1000,100)
lapply(i,function(x){
knn_i <- knn(train = knn_training, test = knn_test,
cl = knn_training_full[,14], k=i)
CrossTable(x = knn_test_full$delay, y = knn_i,
prop.chisq=FALSE)
})
## Regression Tree
library(rpart)
m.rpart <- rpart(arrdelay~ ., data = fit_training)
library(rpart.plot)
rpart.plot(m.rpart, digits = 3)
library(caret)
rpart.pred<- predict(m.rpart, fit_test)
RMSE(rpart.pred, fit_test$arrdelay)
cor(fit_test$arrdelay, rpart.pred)
cor(knn_training_cor)
SQL CODE
USE [project]
GO
/****** Object: View [dbo].[vw_jfk] Script Date: 6/7/2015 11:08:32 AM
******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE view [dbo].[vw_jfk] as
/* First Chain: filtering out for where origin = jfk , create a date variable
to aggregate number of flights taken off at jfk.*/
with cte1 as
(SELECT [Year]
,[Month]
,[DayofMonth]
,[DayOfWeek]
,[DepTime]
,[CRSDepTime]
,[ArrTime]
,[CRSArrTime]
,[UniqueCarrier]
,[FlightNum]
,[TailNum]
,[ActualElapsedTime]
,[CRSElapsedTime]
,[AirTime]
,[ArrDelay]
,[DepDelay]
,[Origin]
,[Dest]
,[Distance]
,[TaxiIn]
,[TaxiOut]
,[Cancelled]
,[CancellationCode]
,[Diverted]
,[CarrierDelay]
,[WeatherDelay]
,[NASDelay]
,[SecurityDelay]
,[LateAircraftDelay]
,DATEFROMPARTS([year],[month],[dayofmonth]) as [date]
FROM [project].[dbo].[part1_data]
where origin = 'jfk'
)
/* joining the first chain with the min, mean, and max temperacture while
filtering out cancelled flights and null value for airtime.
The resultant dataset will have taxiin and taxiout fields filled with
values.*/
,cte2 as
(select a.*
,b.[Max_TemperatureF]
,b.[Mean_TemperatureF]
,b.[Min_TemperatureF]
from cte1 a left join [project].[dbo].[temp] b on a.[date] = b.[date]
where cancelled <> 1 and airtime <>'NA'
)
/*creating the delay flag using the second chain*/
,cte3 as
(select a.*
,case when cast(a.[ActualElapsedTime] as int) - cast(a.[CRSElapsedTime]
as int) > 30 then '1'
when cast(a.[ArrTime] as int)- cast(a.[CRSArrTime] as int) >30 then
'1'
else '0' end as [delay]
from cte2 a
)
/*aggregating the number of flights taken place in JFK on a daily basis*/
,cte4 as
(select [date]
,count([date]) as [num_flight]
from cte3
group by [date]
)
/*combining chain number three and four into a final table*/
,cte5 as
(select a.*
,b.[num_flight]
from cte3 a left join cte4 b on a.[date] = b.[date]
)
select * from cte5
GO

More Related Content

PPT
Cansat 2008: University of New Hampshire Final Presentation
PPTX
Cruising
PPTX
Optimization with an Impact (OpIm)
KEY
Top ten
PDF
Kolkata 1973
PPTX
Cisco WS-X4596-E
Cansat 2008: University of New Hampshire Final Presentation
Cruising
Optimization with an Impact (OpIm)
Top ten
Kolkata 1973
Cisco WS-X4596-E

Viewers also liked (18)

DOCX
الدعم البيداغوجي
PPTX
Natalia roumelh
PPS
Quando seu email_chega
ODP
Riscos sismicos e volcanicos
PPTX
Los signos de puntuación
PPTX
Content page
PDF
JUNE 25 2015 (CLE CERTIFICATE)
PDF
Marvel
DOC
Organisation Profile Of Banwasi Vikas AShram
PDF
Aula03
PDF
バインダ1p5test
PDF
Cuadro de honor final
PPTX
Presentation_NEW.PPTX
PDF
04.22.12 vetroquip story
PDF
Premios Medicamento y Medio Ambiente de SIGRE
RTF
Dany peñaa!!!
PPTX
Marcas e patentes
الدعم البيداغوجي
Natalia roumelh
Quando seu email_chega
Riscos sismicos e volcanicos
Los signos de puntuación
Content page
JUNE 25 2015 (CLE CERTIFICATE)
Marvel
Organisation Profile Of Banwasi Vikas AShram
Aula03
バインダ1p5test
Cuadro de honor final
Presentation_NEW.PPTX
04.22.12 vetroquip story
Premios Medicamento y Medio Ambiente de SIGRE
Dany peñaa!!!
Marcas e patentes
Ad

Similar to KNN and regression Tree (20)

PDF
Data Mining & Analytics for U.S. Airlines On-Time Performance
PDF
Report Statistical Analysis
PPTX
Flight Delay Prediction Model (2)
PPTX
Flight Delay Prediction
PDF
Air Travel Analytics in SAS
PPTX
Prediction of Airlines Delay
PDF
Course project for CEE 4674
DOCX
AVS 3201 Hodograph AssignmentSpring 2020Read through the L.docx
DOCX
AVS 3201 Hodograph AssignmentSpring 2020Read through the L.docx
PPTX
Airline delay prediction
PPTX
big data slides.pptx
PPTX
Predicting flight cancellation likelihood
PDF
Is Low Cost Carrier Profitable - Norwegian article - Issue No. 1
PDF
j2 Universal - Modelling and Tuning Braking Characteristics
PDF
Predicting landing distance: Adrian Valles
PPTX
Afterwork Big Data - Data Science & Machine Learning : explorer, comprendre e...
PDF
Is Low Cost Carrier Profitable -Ryan article - Issue No. 2
PDF
Flights Landing Overrun Project
PPT
Benefit/Cost Analysis For Ait Traffic Control Towers Presentation
PPT
Senior Aviation Limited Upload
Data Mining & Analytics for U.S. Airlines On-Time Performance
Report Statistical Analysis
Flight Delay Prediction Model (2)
Flight Delay Prediction
Air Travel Analytics in SAS
Prediction of Airlines Delay
Course project for CEE 4674
AVS 3201 Hodograph AssignmentSpring 2020Read through the L.docx
AVS 3201 Hodograph AssignmentSpring 2020Read through the L.docx
Airline delay prediction
big data slides.pptx
Predicting flight cancellation likelihood
Is Low Cost Carrier Profitable - Norwegian article - Issue No. 1
j2 Universal - Modelling and Tuning Braking Characteristics
Predicting landing distance: Adrian Valles
Afterwork Big Data - Data Science & Machine Learning : explorer, comprendre e...
Is Low Cost Carrier Profitable -Ryan article - Issue No. 2
Flights Landing Overrun Project
Benefit/Cost Analysis For Ait Traffic Control Towers Presentation
Senior Aviation Limited Upload
Ad

KNN and regression Tree

  • 1. STAT 6620 Asmar Farooq and Weizhong Li Project# 1 Abstract The main purpose of the project is to predict the delay status of flights using KNN algorithm and to predict the number of hours arrival delay using regression tree. The data that are used to develop the models comes from American Statistical Association’s website at http://stat- computing.org/dataexpo/2009/the-data.html. The data spans from 1987 to 2008. A total of 29 variables and total observations of 24,117,234 are recorded in the flight dataset. Instead of creating a model for the entire dataset, the model team focuses on predicting the flights’ delay status for JFK airport at New York. This decision is due to the team’s perception that airport capaicity, unique temperature patterns, and local air travel demands for each individual airport in the country will have unique impact of the number of delay flights. As a result, the model is more practical and useful for each individual airport in the country. Based on the team’s model exercise, the model team is concluded that KNN algorithm is a poor tool to estimate the delay status of flight with a prediction error rate of more than 90 percent for flights that are actually delayed. On the other hand, regression tree algorithm is an appropraite model that correctly predicts 88 percent of the duration of arrival delay. Data Summary For each year, we are given 29 following features. Feature Type Feature Type ArrDelay Numerical Year Categorical DepDelay Numerical Month Categorical Origin Categorical DayofMonth Categorical Dest Categorical DayOfWeek Categorical Distance Numerical DepTime Categorical TaxiIn Numerical CRSDepTime Categorical TaxiOut Numerical ArrTime Categorical Cancelled Categorical CRSArrTime Categorical CancellationCode Categorical
  • 2. UniqueCarrier Categorical Diverted Categorical FlightNum Categorical CarrierDelay Categorical TailNum Categorical WeatherDelay Categorical ActualElapsedTime Numerical NASDelay Categorical CRSElapsedTime Numerical SecurityDelay Categorical AirTime Numerical LateAircraftDelay Categorical ArrDelay Numerical SecurityDelay Categorical Here is the summarized mean and standard deviation of year 1987 data grouped by months. Oct Nov Dec Feature Mean Sd Mean Sd Mean Sd ActualElapsedTime 100.71 60.85 102.16 61.52 103.72 63.08 CRSElapsedTime 88.53 60.65 100.62 61.21 101.73 61.82 ArrDelay 6.00 18.56 8.47 24.46 13.99 32.18 DepDelay 5.00 19.79 7.13 20.62 12.10 29.86 Distance 587.99 496.51 590.80 497.70 594.98 500.14 AirTime N/A for 1987TaxiIn TaxiOut Below, the counts and relative frequencies for few of the categorical data are listed. Rest of the categorical features had 31 or more categories and creating tables for each level would be almost impossible. For example, FlightNum had 2161 levels. str(x2) 'data.frame': 1311826 obs. of 13 variables: $ Month : int 10 10 10 10 10 10 10 10 10 10 ... $ DayofMonth : Factor w/ 31 levels "1","2","3","4",..: 14 15 17 18 19 21 22 23 24 25 ... $ DayOfWeek : Factor w/ 7 levels "1","2","3","4",..: 3 4 6 7 1 3 4 5 6 7 ... $ DepTime : Factor w/ 1430 levels "1","2","3","4",..: 451 439 451 439 459 438 438 441 454 439 ... $ CRSDepTime : Factor w/ 1174 levels "1","5","6","8",..: 209 209 209 209 209 209 209 209 209 209 ... $ ArrTime : Factor w/ 1440 levels "1","2","3","4",..: 552 543 558 527 562 528 532 542 548 531 ... $ CRSArrTime : Factor w/ 1301 levels "1","2","3","4",..: 390 390 390 390 390 390 390 390 390 390 ... $ UniqueCarrier: Factor w/ 14 levels "AA","AS","CO",..: 10 10 10 10 10 10 10 10 10 10 ... $ FlightNum : Factor w/ 2161 levels "1","2","3","4",..: 1359 1359 1359 1359 1359 1359 1359 1359 1359 $ Origin : Factor w/ 237 levels "ABE","ABQ","ACV",..: 198 198 198 198 198 198 198 198 198 198 ... $ Dest : Factor w/ 237 levels "ABE","ABQ","ACV",..: 207 207 207 207 207 207 207 207 207 207 ... $ Cancelled : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... $ Diverted : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
  • 3. Day of Week October November December Level Counts Rel. Freq. Level Counts Rel. Freq. Level Counts Rel. Freq. 1 59,243 0.1321 1 73,057 0.1728 1 58,411 0.1326 2 59,214 0.1320 2 58,441 0.1382 2 72,583 0.1648 3 59,076 0.1317 3 58,763 0.1390 3 72,396 0.1644 4 73,966 0.1649 4 55,614 0.1315 4 71,331 0.1620 5 73,739 0.1644 5 54,637 0.1292 5 56,537 0.1284 6 67,256 0.1499 6 52,767 0.1248 6 53,347 0.1211 7 56,126 0.1251 7 69,524 0.1644 7 55,798 0.1267 Total 448,620 1.0000 Total 422,803 1.0000 Total 440,403 1.0000 Unique Carrier October November December Level Counts Rel. Freq. Level Counts Rel. Freq. Level Counts Rel. Freq. DL 63,104 0.1407 DL 60,150 0.1423 DL 62,559 0.142 AA 56,091 0.125 AA 53,200 0.1258 AA 55,830 0.1267 UA 52,952 0.118 UA 48,702 0.1152 UA 50,970 0.1157 CO 42,756 0.0953 CO 39,408 0.0932 CO 40,838 0.0927 PI 39,228 0.0874 PI 37,707 0.0892 PI 39,547 0.0898 NW 37,590 0.0838 EA 34,865 0.0825 EA 36,863 0.0837 EA 37,048 0.0826 NW 34,342 0.0812 NW 36,341 0.0825 US 32,293 0.072 US 31,006 0.0733 US 31,515 0.0715 TW 23,823 0.0531 TW 22,125 0.0523 TW 23,792 0.054 WN 21,738 0.0485 WN 20,237 0.0479 WN 20,000 0.0454 HP 15,026 0.0335 HP 14,939 0.0353 HP 15,434 0.035 PS 14,405 0.0321 PS 13,540 0.032 PS 13,761 0.0312 AS 7,432 0.0166 AS 6,967 0.0165 AS 7,007 0.0159 PA 5,134 0.0114 PA 5,615 0.0133 PA 6,036 0.0137 Total 448,620 1 Total 422,803 1 Total 440,493 1
  • 4. Cancelled October November December Level Counts Rel. Freq. Level Counts Rel. Freq. Level Counts Rel. Freq. 0 445,619 0.9933 0 417,612 0.9877 0 428,910 0.9739 1 3,001 0.0067 1 5,191 0.0123 1 11,493 0.0261 Total 448,620 1.0000 Total 422,803 1.0000 Total 440,403 1.0000 Diverted October November December Level Counts Rel. Freq. Level Counts Rel. Freq. Level Counts Rel. Freq. 0 447,781 0.9982 0 421,708 0.9974 0 438,522 0.9957 1 829 0.0018 1 1,095 0.0026 1 1,881 0.0043 Total 448,610 1.0000 Total 422,803 1.0000 Total 440,403 1.0000 Model Variable Construction In order to clean the data, we excluded all observations with NA value in airtime as well as observations where the flight was cancelled as those observations do not add any information. The reason why we exclude observations with an NA value in airtime is because we consider taxiin (the time it takes a flight to leave the terminal and take off from the airport) and taxiout (the time it takes a flight to land and reach to the terminal) important variables that explains flight delay status and arrival delay time. In general, observations with null in airtime will always have null in taxiin and taxiout as well. After cleaning the data, five additional columns variables are added to the dataset. The first one is the delay flag. A flight is considered late if either of the following two conditions were true.  Actual elapsed time is 30 minutes more than the scheduled elapsed time  Actual arrival time is 30 minutes more than the scheduled arrival time. 30 minutes delay was chosen due to personal experience and the assumption that a passenger will expect to wait at least 30 minutes before he/she would consider a flight delayed. The delay flag is a
  • 5. binary variable with value of 0 and 1 (delayed flight) and is used as the response variable in the KNN model. Three temperature measures for JFK: mean temp, min temp and max temp, are added using the R package called “weatherData”. Common experience tells the model team that temperature will play a big role, especially for airports that are located in locals that have adverse winter conditions, which JFK is one of them. Lastly, the variable called “Total number of flights per day” is created. The variable reflects the usage rate of the airport on a daily basis. The model team suspect that on days with more flights needed to take off from an airport, the probabliy of delay will be different than on days when less than normal flights are using the airport. All in all, we dropped 19 variables either due to the reasons above or we found them to be useless in explaining the delay of air flights. Those 19 variables are Year, ArrTime, CRSArrTime, UniqueCarrier, FlightNum, TailNum, ActualElapsedTime, CRSElapsedTime, AirTime, Origin, Dest, Cancelled, CancellationCode, Diverted, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay. By the end of this whole process, we had a little more than 150K observation, out of which about 36K flights were delayed. The 150k observations are then split into a training set and a test set with 80 percent of the observations (120555) as the training set and the rest are reserved for the test set. Delay Status Classification Model Using KNN algorithm, ArrivedLate is predicted. Following results are generated from ‘R’. Variables: Airtime, distance, taxiin, taxiout, max temperature, min temperature, number of flight. Normalization method: maximum minimum normalization. Model Confusion Matrix for k =10, k=100, k=200, k=300, k=400. Actual values are listed on the vertical side while the predicted values on the horizontal side.
  • 8. K=400 K=500 doesn’t work It is observed that as the number of nearest neighbor increases, the number of incorrectly classified ontime flight decreases while the number of incorrectly classified delay flight increases. The model results are unsatisfactory as the error rate for correctly classified delay flight is more than 20 percent for all of the models. Based on the above results, KNN is not a good model for predicting the delay status of flights. Prediction for Delay Arrival Time Regression tree algorithm is used to predict the delay arrival time. The explainatory variables include: month, day of month, day of week, departure time, planned time, airtime, taxiin, taxiout, max temperature, mean temperature, min temperature, number of flights, departure delay. Correlation Matrix between the numerical variables:
  • 9. Regression graph: Model effectiveness Measures RMSE: Model effective Measures Correlation with the test set:
  • 10. Summary In summary, after cleaning and preparing the data set, KNN algorithm was applied to predict whether a flight at JFK would delay or not. It was found that while KNN algorithm is simple to apply, the error rate it generated is unacceptable at more than 20%. Furthermore, it was also observed that by increasing the k nearest neighbor value, the model’s ability to predict late flight worsen by misclassifying late flights as on-time flights. On the other hand, regression tree algorithm generated better results and produced an easy to follow graph. According to the graph, the main cause for flight to arrive late is when there is a delay in departure. In other words, if a flight leaves late, it is very likely to arrive late. Moreover, according to our model, shorter taxi out and airtime will result in fewer delays as our common sense would suggest. The correlation of 0.88 between our test data and model fitted data suggests that this model is acceptable and is a good candidate to predict late flights at JFK airport. Since this model was only applied to JFK airport, it is our team’s suggestion to apply the same technique to other airport to further evaluate the effectiveness of regression tree algorithm for airline flights.
  • 11. APPENDIX R CODE ## Loading Data library(RODBC) myconn <- odbcConnect('project') flight_data <- sqlQuery(myconn, "select * from [project].[dbo].[jfk_revised]") close(myconn) str(flight_data) attach(flight_data) fit_data <- data.frame(factor(Month),factor(DayofMonth),factor(DayOfWeek), round(DepTime/100, digit=0), round(CRSDepTime/100, digit=0), AirTime, Distance, TaxiIn, TaxiOut, Max_TemperatureF, Mean_TemperatureF, Min_TemperatureF,num_flight, factor(delay), ArrDelay, DepDelay) names(fit_data)<- c('month', 'dayofmonth', 'dayofweek', 'deptime', 'crsdpetime', 'airtime', 'distance', 'taxiin','taxiout' ,'maxt', 'meant', 'mint', 'num_flight', 'delay', 'arrdelay', 'depdelay') set.seed(12345) fit_data <- fit_data[order(runif(150694)), ] fit_training <- fit_data[1:120555,] fit_test <- fit_data[120556:150694,] ##KNN ## Normalize Function normalize <- function(x) { return ((x - min(x)) / (max(x) - min(x))) } knn_training_full <- fit_training[complete.cases(fit_training),] knn_training <- knn_training_full[,-(1:5)] knn_training_cor <- knn_training[,-(9:11)] knn_training <- as.data.frame(lapply(knn_training[,-(9:11)], normalize)) knn_test_full <- fit_test[complete.cases(fit_test),] knn_test <- knn_test_full[,-(1:5)] knn_test <- as.data.frame(lapply(knn_test[,-(9:11)], normalize)) library(class)
  • 12. fit_pred <- knn(train = knn_training, test = knn_test, cl = knn_training_full[,14], k=10) library(gmodels) CrossTable(x = knn_test_full$delay, y = fit_pred, prop.chisq=FALSE) i <- seq(100,1000,100) lapply(i,function(x){ knn_i <- knn(train = knn_training, test = knn_test, cl = knn_training_full[,14], k=i) CrossTable(x = knn_test_full$delay, y = knn_i, prop.chisq=FALSE) }) ## Regression Tree library(rpart) m.rpart <- rpart(arrdelay~ ., data = fit_training) library(rpart.plot) rpart.plot(m.rpart, digits = 3) library(caret) rpart.pred<- predict(m.rpart, fit_test) RMSE(rpart.pred, fit_test$arrdelay) cor(fit_test$arrdelay, rpart.pred) cor(knn_training_cor)
  • 13. SQL CODE USE [project] GO /****** Object: View [dbo].[vw_jfk] Script Date: 6/7/2015 11:08:32 AM ******/ SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO CREATE view [dbo].[vw_jfk] as /* First Chain: filtering out for where origin = jfk , create a date variable to aggregate number of flights taken off at jfk.*/ with cte1 as (SELECT [Year] ,[Month] ,[DayofMonth] ,[DayOfWeek] ,[DepTime] ,[CRSDepTime] ,[ArrTime] ,[CRSArrTime] ,[UniqueCarrier] ,[FlightNum] ,[TailNum] ,[ActualElapsedTime] ,[CRSElapsedTime] ,[AirTime] ,[ArrDelay] ,[DepDelay] ,[Origin] ,[Dest] ,[Distance] ,[TaxiIn] ,[TaxiOut] ,[Cancelled] ,[CancellationCode] ,[Diverted] ,[CarrierDelay] ,[WeatherDelay] ,[NASDelay] ,[SecurityDelay] ,[LateAircraftDelay] ,DATEFROMPARTS([year],[month],[dayofmonth]) as [date] FROM [project].[dbo].[part1_data] where origin = 'jfk' ) /* joining the first chain with the min, mean, and max temperacture while filtering out cancelled flights and null value for airtime.
  • 14. The resultant dataset will have taxiin and taxiout fields filled with values.*/ ,cte2 as (select a.* ,b.[Max_TemperatureF] ,b.[Mean_TemperatureF] ,b.[Min_TemperatureF] from cte1 a left join [project].[dbo].[temp] b on a.[date] = b.[date] where cancelled <> 1 and airtime <>'NA' ) /*creating the delay flag using the second chain*/ ,cte3 as (select a.* ,case when cast(a.[ActualElapsedTime] as int) - cast(a.[CRSElapsedTime] as int) > 30 then '1' when cast(a.[ArrTime] as int)- cast(a.[CRSArrTime] as int) >30 then '1' else '0' end as [delay] from cte2 a ) /*aggregating the number of flights taken place in JFK on a daily basis*/ ,cte4 as (select [date] ,count([date]) as [num_flight] from cte3 group by [date] ) /*combining chain number three and four into a final table*/ ,cte5 as (select a.* ,b.[num_flight] from cte3 a left join cte4 b on a.[date] = b.[date] ) select * from cte5 GO