Report Statistical Analysis

US Commercial Flights
Francesca Pappalardo
29 gennaio 2019
Us commercial flight analysis
Introduction
This report describes and detects the analysis pages performed on a data set provided by the site http://stat-computing.org/. Data comes from
Research and Innovative Technology Administration (RITA). Data includes 22 years from the year 1987 to the year 2007 with a total of 123 million
observations and 29 different variables. I highlight the main variables used with the related description.
Data Description (used in this analysis)
Year 1987-2008
Month 1-12
ArrTime actual arrival time (local, hhmm)
UniqueCarrier unique carrier code
FlightNum flight number
TailNum plane tail number
AirTime in minutes
ArrDelay arrival delay, in minutes
DepDelay departure delay, in minutes
Distance in miles
Cancelled was the flight cancelled?
CancellationCode reason for cancellation (A = carrier, B = weather, C = NAS, D = security)
The dataset has a size of 12 GB compressed for which it was appropriate to create an SQlite database and a connection to facilitate the use of
data to perform the analysis.
path_db ="C:/ProjectInferential/ontimefly.sqlite3"
con <- dbConnect(RSQLite::SQLite(), dbname=path_db)
from_db <- function(sql) {
dbGetQuery(ontimefly, sql)
}
ontime <- tbl(con, "ontimefly")
*EDA (Exploration Data Analysis) The analyzes performed in this report focus on cancellation, delays and performance of an flight. Before
obtaining the specific data, it is good to perform cognitive analyzes on the entire dataset.
1. What is the main reason why flights are canceled?
To avoid errors of inconsistencies it is advisable to eliminate all null values.
By analyzing all the canceled flights which have a value of 1 within the Cancelled variable, the values of the CancellationCode variable have been
set in the following order:
Uknown: NA or empty
Carrier: A
Weather: B
NAS: C
Security: D
cancellation <- flights[flights$Cancelled == 1,]
cancellation$CancellationCode[ cancellation$CancellationCode == 'NA' | cancellation$CancellationCode == ''] <- 'Uk
nown'
cancellation$CancellationCode[cancellation$CancellationCode == 'A'] <- 'Carrier'
cancellation$CancellationCode[cancellation$CancellationCode == 'B'] <- 'Weather'
cancellation$CancellationCode[cancellation$CancellationCode == 'C'] <- 'NAS'
cancellation$CancellationCode[cancellation$CancellationCode == 'D'] <- 'Security'
plot_cancellation <- ggplot( data = cancellation, aes(x = CancellationCode))+ geom_bar(aes(y =(..count..)/sum(..co
unt..), fill=CancellationCode))+
scale_y_continuous(labels=percent)+
ggtitle("Cancellation Causes")+
ylab("% Cancellation")
plot_cancellation

Cause of Cancellation of flight
The largest percentage that represents the cause of cancellation of a flight is Uknown with a value of about 80%, followed by Carrier with a
value of around 15%.
2. Distribution Carrier
Specific analyzes have also been carried out on Carrier types, so it is important to know Carrier Distribution.
carrier <- flights%>%
filter(UniqueCarrier != "NA")
carrier$UniqueCarrier[carrier$UniqueCarrier != "NW" &
carrier$UniqueCarrier != "DL" &
carrier$UniqueCarrier != "US" &
carrier$UniqueCarrier != "AA" &
carrier$UniqueCarrier != "UA"] <- "Other"
carrier <- carrier %>%
group_by(UniqueCarrier) %>%
dplyr::summarize(Num = n())
In the dataset there are 29 different carrier but I analyze olny the most important frutto delle analisi successive Description Carrier American
Airlines Inc. : AA Delta Air Lines Inc. : DL US Airways Inc. : US Northwest Airlines Inc.: NW United Air Lines Inc.: UA
uniquec <- c('AA', 'DL', 'US', 'NW', 'UA','Other')
x=carrier$Num/sum(carrier$Num)
etichette <- paste(carrier$UniqueCarrier," (",round(x*100, 1), "%)")
p <- pie(carrier$Num/sum(carrier$Num), labels=uniquec)

Specific Analysis
The present report, performs various analyzes answering the following questions:
1. What is the month in which more cancellations occurred?
2. What is the season with less delays?
3. Which aerial manufacture allows a better performance?
4. Which of the two carriers that make the most flights is faster?
1. What is the month in which more cancellations occurred?
*Dataframe: flights_cancelled
## # A tibble: 123,534,969 x 4
## Month FlightNum Cancelled CancellationCode
## <int> <int> <int> <chr>
## 1 1 335 0 ""
## 2 1 3231 0 ""
## 3 1 448 0 ""
## 4 1 1746 0 ""
## 5 1 3920 0 ""
## 6 1 378 0 ""
## 7 1 509 0 ""
## 8 1 535 0 ""
## 9 1 11 0 ""
## 10 1 810 0 ""
## # ... with 123,534,959 more rows
To make clear and fast the reading of the data, I assign to the flights not canceled therefore the ones that variables Cancelled equal to 0 value and ‘No’, and flights canceled with
Cancelled variable equal to 1 value ‘Yes’.
Furthermore, any reason for the cancellation is represented by the variable CancellationCode. These are reasons below. CancellationCode | DescriptionCancellationCode A | Carrier B |
Weather C | NAS D | Security NA or “” | Uknown

cancelled_analysis <- flights_cancelled
cancelled_analysis$Cancelled[cancelled_analysis$Cancelled == 0] <- 'No' #assegno no se il volo nn è stato cancellato
cancelled_analysis$Cancelled[cancelled_analysis$Cancelled == 1] <- 'Yes' #assegno Si se il volo è stato cancellato
cancelled_analysis$CancellationCode[ cancelled_analysis$CancellationCode == 'NA' | cancelled_analysis$CancellationCode == ''] <- 'Uknown'
cancelled_analysis$CancellationCode[cancelled_analysis$CancellationCode == 'A'] <- 'Carrier'
cancelled_analysis$CancellationCode[cancelled_analysis$CancellationCode == 'B'] <- 'Weather'
cancelled_analysis$CancellationCode[cancelled_analysis$CancellationCode == 'C'] <- 'NAS'
cancelled_analysis$CancellationCode[cancelled_analysis$CancellationCode == 'D'] <- 'Security'
cancelled_analysis$Month[cancelled_analysis$Month == 1] <- 'Juanuary'
cancelled_analysis$Month[cancelled_analysis$Month == 2] <- 'February'
cancelled_analysis$Month[cancelled_analysis$Month == 3] <- 'March'
cancelled_analysis$Month[cancelled_analysis$Month == 4] <- 'April'
cancelled_analysis$Month[cancelled_analysis$Month == 5] <- 'May'
cancelled_analysis$Month[cancelled_analysis$Month == 6] <- 'June'
cancelled_analysis$Month[cancelled_analysis$Month == 7] <- 'July'
cancelled_analysis$Month[cancelled_analysis$Month == 8] <- 'August'
cancelled_analysis$Month[cancelled_analysis$Month == 9] <- 'September'
cancelled_analysis$Month[cancelled_analysis$Month == 10] <- 'October'
cancelled_analysis$Month[cancelled_analysis$Month == 11] <- 'November'
cancelled_analysis$Month[cancelled_analysis$Month == 12] <- 'December'
na.omit(cancelled_analysis)
## # A tibble: 123,534,969 x 4
## <chr> <int> <chr> <chr>
## 1 Juanuary 335 No Uknown
## # ... with 123,534,959 more rows
To get an overview of the information related to the cancellation of flights, I show the number of flights canceled depending on the variation of Cancellation COde
cancelled_analysis %>%
group_by(CancellationCode) %>%
tally %>%
arrange(desc(n))
## # A tibble: 5 x 2
## CancellationCode n
## <chr> <int>
## 1 Uknown 122800263
## 2 Carrier 317972
## 3 Weather 267054
## 4 NAS 149079
## 5 Security 601
Results: * Causes of cancellations Carrier : 317972 flights Weather : 267054 flights NAS : 149079 flights Security : 601 flights
I count the numbers of flights canceled and not canceled.
cancelled_analysis %>%
group_by(Cancelled) %>%
tally %>%
arrange(desc(n))
## Cancelled n
## <chr> <int>
## 1 No 121231645
## 2 Yes 2303324
*Result: The number of canceled flights: 2303324 The number of flights not canceled: 121231645
percent(2303324/121231645)
## [1] "1.90%"
1.90% represents the probability of percentage of flight cancellation.
Analysis of canceled flights only
Dataframe: cancelled_Flights (Contains only canceled flights)

cancelled_Flights <- cancelled_analysis%>%
filter(Cancelled == 'Yes') %>%
group_by(FlightNum, Month, CancellationCode, Cancelled) %>% as_data_frame()
na.omit(cancelled_Flights)
## # A tibble: 2,303,324 x 4
## <chr> <int> <chr> <chr>
## 1 Juanuary 126 Yes Carrier
## 4 Juanuary 618 Yes NAS
## # ... with 2,303,314 more rows
cancelledplot <- ggplot(cancelled_Flights, aes( Month, fill=CancellationCode)) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
ylab("Percentages")+
xlab("Month")
cancelledplot
Results: The result provided in this analysis gives us the following results: The largest cancellation rate for a flight is due to an “Unknown” case and has a higher value especially in the
month of January followed by the month of September.
2. What is the season with less delays?
Preparation Data
In this analysis, we get the best season to travel by getting fewer departure delays.
Given that the analysis affects the delays occurred in the four seasons, for clarity the following legend is drawn: Legend: 1:winter(Month: 1,2,12) 2:spring(Month: 3,4,5) 3:summer(Month:
6,7,8) *4:fall(Month: 9,10,11)

flights_effective$Month [flights_effective$Month == 1] <- 1
To provide a detailed analysis, we calculate the mean, Standard Error, Confidence Interval and t.test relative to the DepDelay variable.
meanDepDelay <-mean(flights_effective$DepDelay)
standardDepDelay <- sd(flights_effective$DepDelay)/sqrt(length(flights_effective$DepDelay))
ci <- CI(flights_effective$DepDelay) # a 95% confidence interval fot the mean DepDelay is given by
t.test(flights_effective$DepDelay, alternative="two.sided", conf.level = .95) # mu=12
##
## One Sample t-test
##
## data: flights_effective$DepDelay
## t = 3155.6, df = 121230000, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 8.165247 8.175396
## sample estimates:
## mean of x
## 8.170322
meanDepDelay
## [1] 8.170322
standardDepDelay #How is accurate this point estimate
## [1] 0.002589136
ci
## upper mean lower
## 8.175396 8.170322 8.165247
Results: The sample mean for the variable DepDelay is: 8.170322 minutes. How is accurate the point estimate(mean)? I answer the question with Standard Error. The Standard Error is:
0.002 A more readible result can be obtained by using a confidence interval CI with the result: +upper: 8.175396 +mean: 8.170322 +lower: 8.165247 Testing the mean DepDelay of a
flight: the t-test produce also the p-value, which is the probability of wrongly rejecting the null hypothesis. The p-value is always compared with the significance level of the test. The result
of p-vale is p< 2.2e-16, so, suggest that the null hypotesis is unlikely to be true. The smaller it is, the more confident we can reject the null hypotesys
seasonplot <- boxplot(formula = DepDelay ~ Month,
data = flights_effective,
main = 'Departures delays depending on the season',
xlab = 'Season',
ylab = 'Departure delay',
border = c('springgreen', 'yellow', 'orange', 'skyblue'),
names = c('Spring', 'Summer', 'Fall', 'Winter'))

Result: The plot shows on the x-axis the 4 seasons, and on the y-axis the minutes of the departure delay of a flight in the range -1000 up to 2000. The longer delay occurs in the ** Winter
** season with a delay of more than 2000 minutes, instead in the ** Spring ** season the minimum departure delay is present with a negative value.
3. Which aerial manufacture allows a better performance?
It is important to evaluate which manufacturer allows a better performance of the plane.
To perform this analysis, we need additional information, contained within the csv “plane-data.csv”
plane_data <- read_csv("C:/ProjectInferential/plane-data.csv")
plane_data <- na.omit(plane_data)
To get a clear view of the types of producers, I create a sort of legend to quickly identify the various types. In particular, the types of manufacture that occur most often are highlighted.
Legenda: Embrarer: E Boeing: B AirBus Industrie: A *Other: O
plane_performance <- na.omit(plane_performance)
For more information, we analyze the numbers of the various types of manufacturer.
plane_performance %>%
group_by(manufacturer) %>%
tally %>%
arrange(desc(n))
## manufacturer n
## <chr> <int>
## 1 B 2061
## 2 O 1397
## 3 E 588
## 4 A 434
Results Manufacturer B are: 2061 Manufacturer O are: 1397 Manufacturer E are: 588 Manufacturer A are: 434
I create a single dataframe containing flight and plane information. I combine the two data frames with the TailNum variable that should be unique for each flight.
For better modeling and interpretation I create a plot representing the density for clear and efficient data reading. Kernel density plot are usually a much more effective way to view the
distribution of a variable.
mdensity <- ggplot(plane_performance, aes(x=airtime))
mdensity + geom_density(aes(colour=manufacturer, fill=manufacturer), alpha=0.3)+
theme_gray(base_size=14)
## Warning: Removed 3 rows containing non-finite values (stat_density).

Now to explore the data AirTime and manufacturer I calculate the mean, standard deviation and median and show them on a plots.
Meanplot <- ggplot(airtimeMean, aes(Manufacturer, x, fill=Manufacturer))+
geom_bar(stat="identity", position="dodge") +
xlab("Manufacturer")+
ylab("Hourly Mean AirTime")+
theme_gray(base_size = 14)
StandardDeviationplot <-ggplot(airtimeSD, aes(Manufacturer, x, fill=Manufacturer))+
ylab("Hourly AirTime SD")+
Medianplot <- ggplot(airtimemedian, aes(Manufacturer, x, fill=Manufacturer))+
ylab("Hourly AirTime SD")+
ggarrange(Meanplot, StandardDeviationplot, Medianplot, ncol = 2, nrow=2);
## Warning: Removed 1 rows containing missing values (geom_bar).

To evaluate the flight performance, calculate performance_index by adding DepDelay ArrDelay and dividing the flight time AirTime
## # A tibble: 4,480 x 6
## # Groups: tailnum, manufacturer [4,480]
## tailnum manufacturer arrdelay depdelay airtime performance_index
## <chr> <chr> <int> <int> <int> <dbl>
## 1 N10156 E 34 35 160 0.431
## 2 N102UW A 23 20 308 0.140
## 3 N10323 B 88 70 208 0.760
## 4 N103US A 6 11 251 0.0677
## 5 N104UA B 193 185 240 1.58
## 6 N104UW A 14 23 288 0.128
## 7 N10575 E 80 85 55 3
## 8 N105UA B 3 27 206 0.146
## 9 N105UW A 45 24 256 0.270
## 10 N106US A 7 5 65 0.185
## # ... with 4,470 more rows
plane_performance_index <- plane_performance %>%
group_by(tailnum)%>%
dplyr::summarise(avg_performance_index = mean(performance_index, na.rm=FALSE)) %>% as_data_frame()
plane_performance_index <- na.omit(plane_performance_index)
Result: The highest performance index is given by TailNum: N581SW with value 50.000000 The lowest performance index is given by TailNum: N5ETAA with value 0.005524862
To combine both the information on the planes and the flights and then the two data frames, I have combined them with the unique TailNum index.
pos1 <- match(plane_performance_index$tailnum, plane_data$tailnum)
plane_performance_index$manufacturer <- plane_data$manufacturer[pos1]
plot1 <- ggplot(f, aes(x = factor(manufacturer) , y =p))+
geom_bar(colour ="blue", stat = "identity")+
ggtitle("Performance Index based on the manufacturer of plane")+
guides(fill=FALSE)+
xlab("Manufacturer") +
ylab("Performance Index") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
plot1 <- ggplotly(plot1)
plot1

Result How we can see on the plot the best performance is given by manufacturer FEDERICK CHRISK viceversa, the bad performance is given by manufacturer BAUMAN RANDY
4. Which of the two carriers that make the most flights is faster?
In this analysis, we initially determine the carriers that make the most flights, after which we analyze the two companies and find out who is the fastest.
data_carrier_air <- select(ontime, Distance, AirTime, UniqueCarrier, DepDelay, ArrDelay) %>%
filter((ArrDelay > 0) & (DepDelay > 0) & (AirTime > 0)) %>%
group_by(UniqueCarrier, DepDelay, ArrDelay) %>% as_data_frame()
data_carrier_air <- na.omit(data_carrier_air)
Covariance between AirTime and Distance
cov(data_carrier_air$Distance, data_carrier_air$AirTime)
## [1] 25409.49
Covariance Result The result provided by the cov (Covariance) function, indicates the variable AirTime and the Distance variable are positively correlated, ie we assume a linear
relationship between AirTime and Distance there is a positive correlation, from the increase of AirTime there is also an increase of the Distance average.
The frequencies and the relative frequencies of each carrier are calculated to obtain the carriers that carry out more flights.
Calculation of frequencies and frequencies of careers
flight_to_carrier <- cbind (Frequency = table(data_carrier_air$UniqueCarrier), RelFreq = prop.table(table(data_carrier_air$UniqueCarrier)) )
flight_to_carrier

## Frequency RelFreq
## 9E 134604 0.0033550012
## AA 4505617 0.1123023864
## AQ 41702 0.0010394213
## AS 928903 0.0231528831
## B6 261440 0.0065163852
## CO 2348172 0.0585281260
## DH 199362 0.0049690927
## DL 6239652 0.1555231636
## EA 215526 0.0053719799
## EV 588954 0.0146796631
## F9 117590 0.0029309277
## FL 394566 0.0098345473
## HA 38710 0.0009648457
## HP 1176193 0.0293165799
## ML (1) 14393 0.0003587452
## MQ 1299845 0.0323986028
## NW 2702993 0.0673720301
## OH 413364 0.0103030869
## OO 893097 0.0222604195
## PA (1) 61635 0.0015362508
## PI 466058 0.0116164835
## PS 35534 0.0008856840
## TW 1092832 0.0272388091
## TZ 47440 0.0011824408
## UA 4669724 0.1163927491
## US 5217884 0.1300556228
## WN 5083808 0.1267137820
## XE 664729 0.0165683530
## YV 266076 0.0066319374
The largest number of flights is made by the company ** DL Delta Air Lines Inc ** with a percentage of 15.55% followed by ** US Airways Inc ** with a percentage of 13%.
Pearson’s Correlation Test
Correlation between AirTime and Distance
cor.test(data_carrier_air$AirTime, data_carrier_air$Distance)
##
## Pearson's product-moment correlation
##
## data: data_carrier_air$AirTime and data_carrier_air$Distance
## t = 5112.5, df = 40120000, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6278910 0.6282657
## sample estimates:
## cor
## 0.6280784
Correlation: 0.6280784
Regression Analysis
airtime.lm.DL <- lm(formula = AirTime ~ Distance,
data = data_carrier_air,
subset = UniqueCarrier == "DL" )
summary (airtime.lm.DL)
##
## Call:
## lm(formula = AirTime ~ Distance, data = data_carrier_air, subset = UniqueCarrier ==
## "DL")
##
## Residuals:
## Min 1Q Median 3Q Max
## -381.96 -37.16 20.30 39.22 481.85
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.595e-01 3.822e-02 14.64 <2e-16 ***
## Distance 8.472e-02 4.236e-05 1999.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 58.28 on 6239650 degrees of freedom
## Multiple R-squared: 0.3906, Adjusted R-squared: 0.3906
## F-statistic: 4e+06 on 1 and 6239650 DF, p-value: < 2.2e-16
airtime.lm.US <- lm(formula = AirTime ~ Distance,
data = data_carrier_air,
subset = UniqueCarrier == "US" )
summary (airtime.lm.US)

##
## Call:
## lm(formula = AirTime ~ Distance, data = data_carrier_air, subset = UniqueCarrier ==
## "US")
##
## Residuals:
## Min 1Q Median 3Q Max
## -256.40 -28.59 15.26 35.60 339.54
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.590e-01 3.515e-02 -21.59 <2e-16 ***
## Distance 8.632e-02 4.552e-05 1896.59 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 52.6 on 5217882 degrees of freedom
## Multiple R-squared: 0.4081, Adjusted R-squared: 0.4081
## F-statistic: 3.597e+06 on 1 and 5217882 DF, p-value: < 2.2e-16
Multiple R-squared is 0.4081, which is the R-squared value.
Multiple R-squared will always increase if you add more independent variables. But Adjusted R-squared will decrease if you add an independent variable that doesn’t help the model.
plotDL <- plot(airtime.lm.DL)

plotDL
## NULL
plotUS <- plot(airtime.lm.US)

plotUS
## NULL
** Residual vs Fitted** It is a scatterplot of residuals on the y axes and fitted values on the x axes. This plot is used to detect non-linearity by looking at redline for questionable pattern.
The characteristic of a well-behaved residual vs fitted plot are: The residual bounce randomly around the 0 line. The residual roughly for a horizontal band around the 0 line. No residual

stands out. So, this plot has the residual randomly around the 0 line, so this suggest that the assumption that the relationship is linear is reasonable.
Scale Location This plot tells us if the residual apread equally along the ranges of predictor.
Normal QQ This scatterplot show if residuals are normally distributed. The closer point are to falling directly on the diagonal line then the more we can interpret the residual as normally
distributed.
DL <- subset(data_carrier_air, UniqueCarrier == 'DL')
US <- subset(data_carrier_air, UniqueCarrier == 'US')
finalplot <- plot(x=DL$Distance,
y=DL$AirTime,
xlab = 'Distance',
ylab = 'Air Time',
main = 'Air time based on distance by carrier',
pch=20,
col='dodgerblue1'
)
points (x = US$Distance,
y = US$AirTime,
pch=20,
col='forestgreen'
)
abline (airtime.lm.DL , col = 'slateblue1')
abline (airtime.lm.US, col = 'springgreen1')
legend ('topleft',
legend = c('Delta Air Lines Inc.', 'US Airways Inc.'),
col = c('dodgerblue1', 'forestgreen'),
pch = 20)
finalplot
## NULL
The plot is a relationship between Distance and AirTime for the DL and US companies. (ps: flight times are all in positive integer value) WHO is faster ?? DL is represented by slateblue1
US is presented by springgreen1 It is best to travel with Delta Air Lines Inc. for distances of less than 1500 miles and with US Airways Inc for distances greater than 1500 miles.
Conclusion
The analyzes carried out on all the years (1987-2008) confirm that the worst month to start is January as there is a greater number of cancellations with an unknown reason. The average
starting delay is 8.17 minutes. The worst season for starting delays is Winter, while in Winter there is also a negative DepDelay. The best performing flight number is N581SW built by the
manufacturer FEDERICK CHRISK The worst performing flight number is N5ETAA built by the manufacturer BAUMAN RANDY. The two companies that operate more flights are DL Delta
Airlines with a percentage of 15% and US Airwais with a percentage of 13%. In addition, DL is faster on flights with a distance of less than 15,000 miles.

Report Statistical Analysis

More Related Content

Similar to Report Statistical Analysis (20)

More from Francesca Pappalardo (10)

Recently uploaded (20)

Report Statistical Analysis