SlideShare a Scribd company logo
US Commercial Flights
Francesca Pappalardo
29 gennaio 2019
Us commercial flight analysis
Introduction
This report describes and detects the analysis pages performed on a data set provided by the site http://stat-computing.org/. Data comes from
Research and Innovative Technology Administration (RITA). Data includes 22 years from the year 1987 to the year 2007 with a total of 123 million
observations and 29 different variables. I highlight the main variables used with the related description.
Data Description (used in this analysis)
Year 1987-2008
Month 1-12
ArrTime actual arrival time (local, hhmm)
UniqueCarrier unique carrier code
FlightNum flight number
TailNum plane tail number
AirTime in minutes
ArrDelay arrival delay, in minutes
DepDelay departure delay, in minutes
Distance in miles
Cancelled was the flight cancelled?
CancellationCode reason for cancellation (A = carrier, B = weather, C = NAS, D = security)
The dataset has a size of 12 GB compressed for which it was appropriate to create an SQlite database and a connection to facilitate the use of
data to perform the analysis.
path_db ="C:/ProjectInferential/ontimefly.sqlite3"
con <- dbConnect(RSQLite::SQLite(), dbname=path_db)
from_db <- function(sql) {
dbGetQuery(ontimefly, sql)
}
ontime <- tbl(con, "ontimefly")
*EDA (Exploration Data Analysis) The analyzes performed in this report focus on cancellation, delays and performance of an flight. Before
obtaining the specific data, it is good to perform cognitive analyzes on the entire dataset.
1. What is the main reason why flights are canceled?
To avoid errors of inconsistencies it is advisable to eliminate all null values.
By analyzing all the canceled flights which have a value of 1 within the Cancelled variable, the values of the CancellationCode variable have been
set in the following order:
Uknown: NA or empty
Carrier: A
Weather: B
NAS: C
Security: D
cancellation <- flights[flights$Cancelled == 1,]
cancellation$CancellationCode[ cancellation$CancellationCode == 'NA' | cancellation$CancellationCode == ''] <- 'Uk
nown'
cancellation$CancellationCode[cancellation$CancellationCode == 'A'] <- 'Carrier'
cancellation$CancellationCode[cancellation$CancellationCode == 'B'] <- 'Weather'
cancellation$CancellationCode[cancellation$CancellationCode == 'C'] <- 'NAS'
cancellation$CancellationCode[cancellation$CancellationCode == 'D'] <- 'Security'
plot_cancellation <- ggplot( data = cancellation, aes(x = CancellationCode))+ geom_bar(aes(y =(..count..)/sum(..co
unt..), fill=CancellationCode))+
scale_y_continuous(labels=percent)+
ggtitle("Cancellation Causes")+
ylab("% Cancellation")
plot_cancellation
Cause of Cancellation of flight
The largest percentage that represents the cause of cancellation of a flight is Uknown with a value of about 80%, followed by Carrier with a
value of around 15%.
2. Distribution Carrier
Specific analyzes have also been carried out on Carrier types, so it is important to know Carrier Distribution.
carrier <- flights%>%
filter(UniqueCarrier != "NA")
carrier$UniqueCarrier[carrier$UniqueCarrier != "NW" &
carrier$UniqueCarrier != "DL" &
carrier$UniqueCarrier != "US" &
carrier$UniqueCarrier != "AA" &
carrier$UniqueCarrier != "UA"] <- "Other"
carrier <- carrier %>%
group_by(UniqueCarrier) %>%
dplyr::summarize(Num = n())
In the dataset there are 29 different carrier but I analyze olny the most important frutto delle analisi successive Description Carrier American
Airlines Inc. : AA Delta Air Lines Inc. : DL US Airways Inc. : US Northwest Airlines Inc.: NW United Air Lines Inc.: UA
uniquec <- c('AA', 'DL', 'US', 'NW', 'UA','Other')
x=carrier$Num/sum(carrier$Num)
etichette <- paste(carrier$UniqueCarrier," (",round(x*100, 1), "%)")
p <- pie(carrier$Num/sum(carrier$Num), labels=uniquec)
Specific Analysis
The present report, performs various analyzes answering the following questions:
1. What is the month in which more cancellations occurred?
2. What is the season with less delays?
3. Which aerial manufacture allows a better performance?
4. Which of the two carriers that make the most flights is faster?
1. What is the month in which more cancellations occurred?
*Dataframe: flights_cancelled
## # A tibble: 123,534,969 x 4
## Month FlightNum Cancelled CancellationCode
## <int> <int> <int> <chr>
## 1 1 335 0 ""
## 2 1 3231 0 ""
## 3 1 448 0 ""
## 4 1 1746 0 ""
## 5 1 3920 0 ""
## 6 1 378 0 ""
## 7 1 509 0 ""
## 8 1 535 0 ""
## 9 1 11 0 ""
## 10 1 810 0 ""
## # ... with 123,534,959 more rows
To make clear and fast the reading of the data, I assign to the flights not canceled therefore the ones that variables Cancelled equal to 0 value and ‘No’, and flights canceled with
Cancelled variable equal to 1 value ‘Yes’.
Furthermore, any reason for the cancellation is represented by the variable CancellationCode. These are reasons below. CancellationCode | DescriptionCancellationCode A | Carrier B |
Weather C | NAS D | Security NA or “” | Uknown
cancelled_analysis <- flights_cancelled
cancelled_analysis$Cancelled[cancelled_analysis$Cancelled == 0] <- 'No' #assegno no se il volo nn è stato cancellato
cancelled_analysis$Cancelled[cancelled_analysis$Cancelled == 1] <- 'Yes' #assegno Si se il volo è stato cancellato
cancelled_analysis$CancellationCode[ cancelled_analysis$CancellationCode == 'NA' | cancelled_analysis$CancellationCode == ''] <- 'Uknown'
cancelled_analysis$CancellationCode[cancelled_analysis$CancellationCode == 'A'] <- 'Carrier'
cancelled_analysis$CancellationCode[cancelled_analysis$CancellationCode == 'B'] <- 'Weather'
cancelled_analysis$CancellationCode[cancelled_analysis$CancellationCode == 'C'] <- 'NAS'
cancelled_analysis$CancellationCode[cancelled_analysis$CancellationCode == 'D'] <- 'Security'
cancelled_analysis$Month[cancelled_analysis$Month == 1] <- 'Juanuary'
cancelled_analysis$Month[cancelled_analysis$Month == 2] <- 'February'
cancelled_analysis$Month[cancelled_analysis$Month == 3] <- 'March'
cancelled_analysis$Month[cancelled_analysis$Month == 4] <- 'April'
cancelled_analysis$Month[cancelled_analysis$Month == 5] <- 'May'
cancelled_analysis$Month[cancelled_analysis$Month == 6] <- 'June'
cancelled_analysis$Month[cancelled_analysis$Month == 7] <- 'July'
cancelled_analysis$Month[cancelled_analysis$Month == 8] <- 'August'
cancelled_analysis$Month[cancelled_analysis$Month == 9] <- 'September'
cancelled_analysis$Month[cancelled_analysis$Month == 10] <- 'October'
cancelled_analysis$Month[cancelled_analysis$Month == 11] <- 'November'
cancelled_analysis$Month[cancelled_analysis$Month == 12] <- 'December'
na.omit(cancelled_analysis)
## # A tibble: 123,534,969 x 4
## Month FlightNum Cancelled CancellationCode
## <chr> <int> <chr> <chr>
## 1 Juanuary 335 No Uknown
## 2 Juanuary 3231 No Uknown
## 3 Juanuary 448 No Uknown
## 4 Juanuary 1746 No Uknown
## 5 Juanuary 3920 No Uknown
## 6 Juanuary 378 No Uknown
## 7 Juanuary 509 No Uknown
## 8 Juanuary 535 No Uknown
## 9 Juanuary 11 No Uknown
## 10 Juanuary 810 No Uknown
## # ... with 123,534,959 more rows
To get an overview of the information related to the cancellation of flights, I show the number of flights canceled depending on the variation of Cancellation COde
cancelled_analysis %>%
group_by(CancellationCode) %>%
tally %>%
arrange(desc(n))
## # A tibble: 5 x 2
## CancellationCode n
## <chr> <int>
## 1 Uknown 122800263
## 2 Carrier 317972
## 3 Weather 267054
## 4 NAS 149079
## 5 Security 601
Results: * Causes of cancellations Carrier : 317972 flights Weather : 267054 flights NAS : 149079 flights Security : 601 flights
I count the numbers of flights canceled and not canceled.
cancelled_analysis %>%
group_by(Cancelled) %>%
tally %>%
arrange(desc(n))
## # A tibble: 2 x 2
## Cancelled n
## <chr> <int>
## 1 No 121231645
## 2 Yes 2303324
*Result: The number of canceled flights: 2303324 The number of flights not canceled: 121231645
percent(2303324/121231645)
## [1] "1.90%"
1.90% represents the probability of percentage of flight cancellation.
Analysis of canceled flights only
Dataframe: cancelled_Flights (Contains only canceled flights)
cancelled_Flights <- cancelled_analysis%>%
filter(Cancelled == 'Yes') %>%
group_by(FlightNum, Month, CancellationCode, Cancelled) %>% as_data_frame()
na.omit(cancelled_Flights)
## # A tibble: 2,303,324 x 4
## Month FlightNum Cancelled CancellationCode
## <chr> <int> <chr> <chr>
## 1 Juanuary 126 Yes Carrier
## 2 Juanuary 1146 Yes Carrier
## 3 Juanuary 469 Yes Carrier
## 4 Juanuary 618 Yes NAS
## 5 Juanuary 2528 Yes Carrier
## 6 Juanuary 437 Yes Carrier
## 7 Juanuary 934 Yes Carrier
## 8 Juanuary 3326 Yes Carrier
## 9 Juanuary 1402 Yes Carrier
## 10 Juanuary 2205 Yes Carrier
## # ... with 2,303,314 more rows
cancelledplot <- ggplot(cancelled_Flights, aes( Month, fill=CancellationCode)) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
ylab("Percentages")+
xlab("Month")
cancelledplot
Results: The result provided in this analysis gives us the following results: The largest cancellation rate for a flight is due to an “Unknown” case and has a higher value especially in the
month of January followed by the month of September.
2. What is the season with less delays?
Preparation Data
In this analysis, we get the best season to travel by getting fewer departure delays.
Given that the analysis affects the delays occurred in the four seasons, for clarity the following legend is drawn: Legend: 1:winter(Month: 1,2,12) 2:spring(Month: 3,4,5) 3:summer(Month:
6,7,8) *4:fall(Month: 9,10,11)
flights_effective$Month [flights_effective$Month == 1] <- 1
flights_effective$Month [flights_effective$Month == 2] <- 1
flights_effective$Month [flights_effective$Month == 12] <- 1
flights_effective$Month [flights_effective$Month == 3] <- 2
flights_effective$Month [flights_effective$Month == 4] <- 2
flights_effective$Month [flights_effective$Month == 5] <- 2
flights_effective$Month [flights_effective$Month == 6] <- 3
flights_effective$Month [flights_effective$Month == 7] <- 3
flights_effective$Month [flights_effective$Month == 8] <- 3
flights_effective$Month [flights_effective$Month == 9] <- 4
flights_effective$Month [flights_effective$Month == 10] <- 4
flights_effective$Month [flights_effective$Month == 11] <- 4
To provide a detailed analysis, we calculate the mean, Standard Error, Confidence Interval and t.test relative to the DepDelay variable.
meanDepDelay <-mean(flights_effective$DepDelay)
standardDepDelay <- sd(flights_effective$DepDelay)/sqrt(length(flights_effective$DepDelay))
ci <- CI(flights_effective$DepDelay) # a 95% confidence interval fot the mean DepDelay is given by
t.test(flights_effective$DepDelay, alternative="two.sided", conf.level = .95) # mu=12
##
## One Sample t-test
##
## data: flights_effective$DepDelay
## t = 3155.6, df = 121230000, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 8.165247 8.175396
## sample estimates:
## mean of x
## 8.170322
meanDepDelay
## [1] 8.170322
standardDepDelay #How is accurate this point estimate
## [1] 0.002589136
ci
## upper mean lower
## 8.175396 8.170322 8.165247
Results: The sample mean for the variable DepDelay is: 8.170322 minutes. How is accurate the point estimate(mean)? I answer the question with Standard Error. The Standard Error is:
0.002 A more readible result can be obtained by using a confidence interval CI with the result: +upper: 8.175396 +mean: 8.170322 +lower: 8.165247 Testing the mean DepDelay of a
flight: the t-test produce also the p-value, which is the probability of wrongly rejecting the null hypothesis. The p-value is always compared with the significance level of the test. The result
of p-vale is p< 2.2e-16, so, suggest that the null hypotesis is unlikely to be true. The smaller it is, the more confident we can reject the null hypotesys
seasonplot <- boxplot(formula = DepDelay ~ Month,
data = flights_effective,
main = 'Departures delays depending on the season',
xlab = 'Season',
ylab = 'Departure delay',
border = c('springgreen', 'yellow', 'orange', 'skyblue'),
names = c('Spring', 'Summer', 'Fall', 'Winter'))
Result: The plot shows on the x-axis the 4 seasons, and on the y-axis the minutes of the departure delay of a flight in the range -1000 up to 2000. The longer delay occurs in the ** Winter
** season with a delay of more than 2000 minutes, instead in the ** Spring ** season the minimum departure delay is present with a negative value.
3. Which aerial manufacture allows a better performance?
It is important to evaluate which manufacturer allows a better performance of the plane.
To perform this analysis, we need additional information, contained within the csv “plane-data.csv”
plane_data <- read_csv("C:/ProjectInferential/plane-data.csv")
plane_data <- na.omit(plane_data)
To get a clear view of the types of producers, I create a sort of legend to quickly identify the various types. In particular, the types of manufacture that occur most often are highlighted.
Legenda: Embrarer: E Boeing: B AirBus Industrie: A *Other: O
plane_performance <- na.omit(plane_performance)
For more information, we analyze the numbers of the various types of manufacturer.
plane_performance %>%
group_by(manufacturer) %>%
tally %>%
arrange(desc(n))
## # A tibble: 4 x 2
## manufacturer n
## <chr> <int>
## 1 B 2061
## 2 O 1397
## 3 E 588
## 4 A 434
Results Manufacturer B are: 2061 Manufacturer O are: 1397 Manufacturer E are: 588 Manufacturer A are: 434
I create a single dataframe containing flight and plane information. I combine the two data frames with the TailNum variable that should be unique for each flight.
For better modeling and interpretation I create a plot representing the density for clear and efficient data reading. Kernel density plot are usually a much more effective way to view the
distribution of a variable.
mdensity <- ggplot(plane_performance, aes(x=airtime))
mdensity + geom_density(aes(colour=manufacturer, fill=manufacturer), alpha=0.3)+
theme_gray(base_size=14)
## Warning: Removed 3 rows containing non-finite values (stat_density).
Now to explore the data AirTime and manufacturer I calculate the mean, standard deviation and median and show them on a plots.
Meanplot <- ggplot(airtimeMean, aes(Manufacturer, x, fill=Manufacturer))+
geom_bar(stat="identity", position="dodge") +
xlab("Manufacturer")+
ylab("Hourly Mean AirTime")+
theme_gray(base_size = 14)
StandardDeviationplot <-ggplot(airtimeSD, aes(Manufacturer, x, fill=Manufacturer))+
geom_bar(stat="identity", position="dodge") +
xlab("Manufacturer")+
ylab("Hourly AirTime SD")+
theme_gray(base_size = 14)
Medianplot <- ggplot(airtimemedian, aes(Manufacturer, x, fill=Manufacturer))+
geom_bar(stat="identity", position="dodge") +
xlab("Manufacturer")+
ylab("Hourly AirTime SD")+
theme_gray(base_size = 14)
ggarrange(Meanplot, StandardDeviationplot, Medianplot, ncol = 2, nrow=2);
## Warning: Removed 1 rows containing missing values (geom_bar).
## Warning: Removed 1 rows containing missing values (geom_bar).
## Warning: Removed 1 rows containing missing values (geom_bar).
To evaluate the flight performance, calculate performance_index by adding DepDelay ArrDelay and dividing the flight time AirTime
## # A tibble: 4,480 x 6
## # Groups: tailnum, manufacturer [4,480]
## tailnum manufacturer arrdelay depdelay airtime performance_index
## <chr> <chr> <int> <int> <int> <dbl>
## 1 N10156 E 34 35 160 0.431
## 2 N102UW A 23 20 308 0.140
## 3 N10323 B 88 70 208 0.760
## 4 N103US A 6 11 251 0.0677
## 5 N104UA B 193 185 240 1.58
## 6 N104UW A 14 23 288 0.128
## 7 N10575 E 80 85 55 3
## 8 N105UA B 3 27 206 0.146
## 9 N105UW A 45 24 256 0.270
## 10 N106US A 7 5 65 0.185
## # ... with 4,470 more rows
plane_performance_index <- plane_performance %>%
group_by(tailnum)%>%
dplyr::summarise(avg_performance_index = mean(performance_index, na.rm=FALSE)) %>% as_data_frame()
plane_performance_index <- na.omit(plane_performance_index)
Result: The highest performance index is given by TailNum: N581SW with value 50.000000 The lowest performance index is given by TailNum: N5ETAA with value 0.005524862
To combine both the information on the planes and the flights and then the two data frames, I have combined them with the unique TailNum index.
pos1 <- match(plane_performance_index$tailnum, plane_data$tailnum)
plane_performance_index$manufacturer <- plane_data$manufacturer[pos1]
plot1 <- ggplot(f, aes(x = factor(manufacturer) , y =p))+
geom_bar(colour ="blue", stat = "identity")+
ggtitle("Performance Index based on the manufacturer of plane")+
guides(fill=FALSE)+
xlab("Manufacturer") +
ylab("Performance Index") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
plot1 <- ggplotly(plot1)
plot1
Result How we can see on the plot the best performance is given by manufacturer FEDERICK CHRISK viceversa, the bad performance is given by manufacturer BAUMAN RANDY
4. Which of the two carriers that make the most flights is faster?
In this analysis, we initially determine the carriers that make the most flights, after which we analyze the two companies and find out who is the fastest.
data_carrier_air <- select(ontime, Distance, AirTime, UniqueCarrier, DepDelay, ArrDelay) %>%
filter((ArrDelay > 0) & (DepDelay > 0) & (AirTime > 0)) %>%
group_by(UniqueCarrier, DepDelay, ArrDelay) %>% as_data_frame()
data_carrier_air <- na.omit(data_carrier_air)
Covariance between AirTime and Distance
cov(data_carrier_air$Distance, data_carrier_air$AirTime)
## [1] 25409.49
Covariance Result The result provided by the cov (Covariance) function, indicates the variable AirTime and the Distance variable are positively correlated, ie we assume a linear
relationship between AirTime and Distance there is a positive correlation, from the increase of AirTime there is also an increase of the Distance average.
The frequencies and the relative frequencies of each carrier are calculated to obtain the carriers that carry out more flights.
Calculation of frequencies and frequencies of careers
flight_to_carrier <- cbind (Frequency = table(data_carrier_air$UniqueCarrier), RelFreq = prop.table(table(data_carrier_air$UniqueCarrier)) )
flight_to_carrier
## Frequency RelFreq
## 9E 134604 0.0033550012
## AA 4505617 0.1123023864
## AQ 41702 0.0010394213
## AS 928903 0.0231528831
## B6 261440 0.0065163852
## CO 2348172 0.0585281260
## DH 199362 0.0049690927
## DL 6239652 0.1555231636
## EA 215526 0.0053719799
## EV 588954 0.0146796631
## F9 117590 0.0029309277
## FL 394566 0.0098345473
## HA 38710 0.0009648457
## HP 1176193 0.0293165799
## ML (1) 14393 0.0003587452
## MQ 1299845 0.0323986028
## NW 2702993 0.0673720301
## OH 413364 0.0103030869
## OO 893097 0.0222604195
## PA (1) 61635 0.0015362508
## PI 466058 0.0116164835
## PS 35534 0.0008856840
## TW 1092832 0.0272388091
## TZ 47440 0.0011824408
## UA 4669724 0.1163927491
## US 5217884 0.1300556228
## WN 5083808 0.1267137820
## XE 664729 0.0165683530
## YV 266076 0.0066319374
The largest number of flights is made by the company ** DL Delta Air Lines Inc ** with a percentage of 15.55% followed by ** US Airways Inc ** with a percentage of 13%.
Pearson’s Correlation Test
Correlation between AirTime and Distance
cor.test(data_carrier_air$AirTime, data_carrier_air$Distance)
##
## Pearson's product-moment correlation
##
## data: data_carrier_air$AirTime and data_carrier_air$Distance
## t = 5112.5, df = 40120000, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6278910 0.6282657
## sample estimates:
## cor
## 0.6280784
Correlation: 0.6280784
Regression Analysis
airtime.lm.DL <- lm(formula = AirTime ~ Distance,
data = data_carrier_air,
subset = UniqueCarrier == "DL" )
summary (airtime.lm.DL)
##
## Call:
## lm(formula = AirTime ~ Distance, data = data_carrier_air, subset = UniqueCarrier ==
## "DL")
##
## Residuals:
## Min 1Q Median 3Q Max
## -381.96 -37.16 20.30 39.22 481.85
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.595e-01 3.822e-02 14.64 <2e-16 ***
## Distance 8.472e-02 4.236e-05 1999.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 58.28 on 6239650 degrees of freedom
## Multiple R-squared: 0.3906, Adjusted R-squared: 0.3906
## F-statistic: 4e+06 on 1 and 6239650 DF, p-value: < 2.2e-16
airtime.lm.US <- lm(formula = AirTime ~ Distance,
data = data_carrier_air,
subset = UniqueCarrier == "US" )
summary (airtime.lm.US)
##
## Call:
## lm(formula = AirTime ~ Distance, data = data_carrier_air, subset = UniqueCarrier ==
## "US")
##
## Residuals:
## Min 1Q Median 3Q Max
## -256.40 -28.59 15.26 35.60 339.54
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.590e-01 3.515e-02 -21.59 <2e-16 ***
## Distance 8.632e-02 4.552e-05 1896.59 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 52.6 on 5217882 degrees of freedom
## Multiple R-squared: 0.4081, Adjusted R-squared: 0.4081
## F-statistic: 3.597e+06 on 1 and 5217882 DF, p-value: < 2.2e-16
Multiple R-squared is 0.4081, which is the R-squared value.
Multiple R-squared will always increase if you add more independent variables. But Adjusted R-squared will decrease if you add an independent variable that doesn’t help the model.
plotDL <- plot(airtime.lm.DL)
Report Statistical Analysis
plotDL
## NULL
plotUS <- plot(airtime.lm.US)
Report Statistical Analysis
plotUS
## NULL
** Residual vs Fitted** It is a scatterplot of residuals on the y axes and fitted values on the x axes. This plot is used to detect non-linearity by looking at redline for questionable pattern.
The characteristic of a well-behaved residual vs fitted plot are: The residual bounce randomly around the 0 line. The residual roughly for a horizontal band around the 0 line. No residual
stands out. So, this plot has the residual randomly around the 0 line, so this suggest that the assumption that the relationship is linear is reasonable.
Scale Location This plot tells us if the residual apread equally along the ranges of predictor.
Normal QQ This scatterplot show if residuals are normally distributed. The closer point are to falling directly on the diagonal line then the more we can interpret the residual as normally
distributed.
DL <- subset(data_carrier_air, UniqueCarrier == 'DL')
US <- subset(data_carrier_air, UniqueCarrier == 'US')
finalplot <- plot(x=DL$Distance,
y=DL$AirTime,
xlab = 'Distance',
ylab = 'Air Time',
main = 'Air time based on distance by carrier',
pch=20,
col='dodgerblue1'
)
points (x = US$Distance,
y = US$AirTime,
pch=20,
col='forestgreen'
)
abline (airtime.lm.DL , col = 'slateblue1')
abline (airtime.lm.US, col = 'springgreen1')
legend ('topleft',
legend = c('Delta Air Lines Inc.', 'US Airways Inc.'),
col = c('dodgerblue1', 'forestgreen'),
pch = 20)
finalplot
## NULL
The plot is a relationship between Distance and AirTime for the DL and US companies. (ps: flight times are all in positive integer value) WHO is faster ?? DL is represented by slateblue1
US is presented by springgreen1 It is best to travel with Delta Air Lines Inc. for distances of less than 1500 miles and with US Airways Inc for distances greater than 1500 miles.
Conclusion
The analyzes carried out on all the years (1987-2008) confirm that the worst month to start is January as there is a greater number of cancellations with an unknown reason. The average
starting delay is 8.17 minutes. The worst season for starting delays is Winter, while in Winter there is also a negative DepDelay. The best performing flight number is N581SW built by the
manufacturer FEDERICK CHRISK The worst performing flight number is N5ETAA built by the manufacturer BAUMAN RANDY. The two companies that operate more flights are DL Delta
Airlines with a percentage of 15% and US Airwais with a percentage of 13%. In addition, DL is faster on flights with a distance of less than 15,000 miles.

More Related Content

PPTX
CBO Basics: Cardinality
PDF
How to write SQL queries | pgDay Paris 2019 | Dimitri Fontaine
ODP
Basic Query Tuning Primer - Pg West 2009
PDF
It's painful how much data rules the world
PDF
Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...
PPT
Dynamically Evolving Systems: Cluster Analysis Using Time
PDF
Air Travel Analytics in SAS
PDF
dplyr-tutorial.pdf
CBO Basics: Cardinality
How to write SQL queries | pgDay Paris 2019 | Dimitri Fontaine
Basic Query Tuning Primer - Pg West 2009
It's painful how much data rules the world
Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...
Dynamically Evolving Systems: Cluster Analysis Using Time
Air Travel Analytics in SAS
dplyr-tutorial.pdf

Similar to Report Statistical Analysis (20)

PDF
KNN and regression Tree
PPTX
Prediction of Airlines Delay
PDF
Analyzing 22 years of US flights with datadr and Trelliscope
PPTX
Airline delay prediction
PDF
dplyr use case
PDF
Data Mining & Analytics for U.S. Airlines On-Time Performance
DOCX
Air passenger report
PPTX
Airlines Delay Kaggle
PDF
Random Forest Ensemble learning algorithm for Engineering Analytics Project
PDF
Time Series Modelling in R-Forecasting.
PPTX
PRESENTATION ON CHALLENGE lab_084627 (1).pptx
PPTX
FAA Airline Project
PDF
Flights Landing Overrun Project
PDF
Brussels airport forecast
PDF
2015 Flight Delay/Cancellation Analysis
PDF
Data manipulation with dplyr
PPTX
Flight departure delay prediction
PDF
SevillaR meetup: dplyr and magrittr
PDF
Aeroporti de roma fco
PPTX
Flight data analysis using apache pig--------------Final Year Project
KNN and regression Tree
Prediction of Airlines Delay
Analyzing 22 years of US flights with datadr and Trelliscope
Airline delay prediction
dplyr use case
Data Mining & Analytics for U.S. Airlines On-Time Performance
Air passenger report
Airlines Delay Kaggle
Random Forest Ensemble learning algorithm for Engineering Analytics Project
Time Series Modelling in R-Forecasting.
PRESENTATION ON CHALLENGE lab_084627 (1).pptx
FAA Airline Project
Flights Landing Overrun Project
Brussels airport forecast
2015 Flight Delay/Cancellation Analysis
Data manipulation with dplyr
Flight departure delay prediction
SevillaR meetup: dplyr and magrittr
Aeroporti de roma fco
Flight data analysis using apache pig--------------Final Year Project
Ad

More from Francesca Pappalardo (10)

PPTX
Fraud Detection with Ensemble Learning Technique
PDF
Final written Essay Francesca Pappalardo
PPTX
FATE Financial Analysis Tool for Excel - Prenatal
PDF
Small Summary
PPTX
Presentation CCT
PPTX
CCT (Check and Calculate Transfer)
PPTX
CCT Check and Calculate Transfer
PDF
CoolMi Documentation
PPTX
Cool mi by Coolook
Fraud Detection with Ensemble Learning Technique
Final written Essay Francesca Pappalardo
FATE Financial Analysis Tool for Excel - Prenatal
Small Summary
Presentation CCT
CCT (Check and Calculate Transfer)
CCT Check and Calculate Transfer
CoolMi Documentation
Cool mi by Coolook
Ad

Recently uploaded (20)

PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
annual-report-2024-2025 original latest.
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Business Analytics and business intelligence.pdf
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
A Complete Guide to Streamlining Business Processes
PPTX
importance of Data-Visualization-in-Data-Science. for mba studnts
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
modul_python (1).pptx for professional and student
PDF
Transcultural that can help you someday.
PDF
Introduction to the R Programming Language
PPT
DATA COLLECTION METHODS-ppt for nursing research
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Modelling in Business Intelligence , information system
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
annual-report-2024-2025 original latest.
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Qualitative Qantitative and Mixed Methods.pptx
Business Analytics and business intelligence.pdf
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
A Complete Guide to Streamlining Business Processes
importance of Data-Visualization-in-Data-Science. for mba studnts
ISS -ESG Data flows What is ESG and HowHow
modul_python (1).pptx for professional and student
Transcultural that can help you someday.
Introduction to the R Programming Language
DATA COLLECTION METHODS-ppt for nursing research
Acceptance and paychological effects of mandatory extra coach I classes.pptx
climate analysis of Dhaka ,Banglades.pptx
Modelling in Business Intelligence , information system
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...

Report Statistical Analysis

  • 1. US Commercial Flights Francesca Pappalardo 29 gennaio 2019 Us commercial flight analysis Introduction This report describes and detects the analysis pages performed on a data set provided by the site http://stat-computing.org/. Data comes from Research and Innovative Technology Administration (RITA). Data includes 22 years from the year 1987 to the year 2007 with a total of 123 million observations and 29 different variables. I highlight the main variables used with the related description. Data Description (used in this analysis) Year 1987-2008 Month 1-12 ArrTime actual arrival time (local, hhmm) UniqueCarrier unique carrier code FlightNum flight number TailNum plane tail number AirTime in minutes ArrDelay arrival delay, in minutes DepDelay departure delay, in minutes Distance in miles Cancelled was the flight cancelled? CancellationCode reason for cancellation (A = carrier, B = weather, C = NAS, D = security) The dataset has a size of 12 GB compressed for which it was appropriate to create an SQlite database and a connection to facilitate the use of data to perform the analysis. path_db ="C:/ProjectInferential/ontimefly.sqlite3" con <- dbConnect(RSQLite::SQLite(), dbname=path_db) from_db <- function(sql) { dbGetQuery(ontimefly, sql) } ontime <- tbl(con, "ontimefly") *EDA (Exploration Data Analysis) The analyzes performed in this report focus on cancellation, delays and performance of an flight. Before obtaining the specific data, it is good to perform cognitive analyzes on the entire dataset. 1. What is the main reason why flights are canceled? To avoid errors of inconsistencies it is advisable to eliminate all null values. By analyzing all the canceled flights which have a value of 1 within the Cancelled variable, the values of the CancellationCode variable have been set in the following order: Uknown: NA or empty Carrier: A Weather: B NAS: C Security: D cancellation <- flights[flights$Cancelled == 1,] cancellation$CancellationCode[ cancellation$CancellationCode == 'NA' | cancellation$CancellationCode == ''] <- 'Uk nown' cancellation$CancellationCode[cancellation$CancellationCode == 'A'] <- 'Carrier' cancellation$CancellationCode[cancellation$CancellationCode == 'B'] <- 'Weather' cancellation$CancellationCode[cancellation$CancellationCode == 'C'] <- 'NAS' cancellation$CancellationCode[cancellation$CancellationCode == 'D'] <- 'Security' plot_cancellation <- ggplot( data = cancellation, aes(x = CancellationCode))+ geom_bar(aes(y =(..count..)/sum(..co unt..), fill=CancellationCode))+ scale_y_continuous(labels=percent)+ ggtitle("Cancellation Causes")+ ylab("% Cancellation") plot_cancellation
  • 2. Cause of Cancellation of flight The largest percentage that represents the cause of cancellation of a flight is Uknown with a value of about 80%, followed by Carrier with a value of around 15%. 2. Distribution Carrier Specific analyzes have also been carried out on Carrier types, so it is important to know Carrier Distribution. carrier <- flights%>% filter(UniqueCarrier != "NA") carrier$UniqueCarrier[carrier$UniqueCarrier != "NW" & carrier$UniqueCarrier != "DL" & carrier$UniqueCarrier != "US" & carrier$UniqueCarrier != "AA" & carrier$UniqueCarrier != "UA"] <- "Other" carrier <- carrier %>% group_by(UniqueCarrier) %>% dplyr::summarize(Num = n()) In the dataset there are 29 different carrier but I analyze olny the most important frutto delle analisi successive Description Carrier American Airlines Inc. : AA Delta Air Lines Inc. : DL US Airways Inc. : US Northwest Airlines Inc.: NW United Air Lines Inc.: UA uniquec <- c('AA', 'DL', 'US', 'NW', 'UA','Other') x=carrier$Num/sum(carrier$Num) etichette <- paste(carrier$UniqueCarrier," (",round(x*100, 1), "%)") p <- pie(carrier$Num/sum(carrier$Num), labels=uniquec)
  • 3. Specific Analysis The present report, performs various analyzes answering the following questions: 1. What is the month in which more cancellations occurred? 2. What is the season with less delays? 3. Which aerial manufacture allows a better performance? 4. Which of the two carriers that make the most flights is faster? 1. What is the month in which more cancellations occurred? *Dataframe: flights_cancelled ## # A tibble: 123,534,969 x 4 ## Month FlightNum Cancelled CancellationCode ## <int> <int> <int> <chr> ## 1 1 335 0 "" ## 2 1 3231 0 "" ## 3 1 448 0 "" ## 4 1 1746 0 "" ## 5 1 3920 0 "" ## 6 1 378 0 "" ## 7 1 509 0 "" ## 8 1 535 0 "" ## 9 1 11 0 "" ## 10 1 810 0 "" ## # ... with 123,534,959 more rows To make clear and fast the reading of the data, I assign to the flights not canceled therefore the ones that variables Cancelled equal to 0 value and ‘No’, and flights canceled with Cancelled variable equal to 1 value ‘Yes’. Furthermore, any reason for the cancellation is represented by the variable CancellationCode. These are reasons below. CancellationCode | DescriptionCancellationCode A | Carrier B | Weather C | NAS D | Security NA or “” | Uknown
  • 4. cancelled_analysis <- flights_cancelled cancelled_analysis$Cancelled[cancelled_analysis$Cancelled == 0] <- 'No' #assegno no se il volo nn è stato cancellato cancelled_analysis$Cancelled[cancelled_analysis$Cancelled == 1] <- 'Yes' #assegno Si se il volo è stato cancellato cancelled_analysis$CancellationCode[ cancelled_analysis$CancellationCode == 'NA' | cancelled_analysis$CancellationCode == ''] <- 'Uknown' cancelled_analysis$CancellationCode[cancelled_analysis$CancellationCode == 'A'] <- 'Carrier' cancelled_analysis$CancellationCode[cancelled_analysis$CancellationCode == 'B'] <- 'Weather' cancelled_analysis$CancellationCode[cancelled_analysis$CancellationCode == 'C'] <- 'NAS' cancelled_analysis$CancellationCode[cancelled_analysis$CancellationCode == 'D'] <- 'Security' cancelled_analysis$Month[cancelled_analysis$Month == 1] <- 'Juanuary' cancelled_analysis$Month[cancelled_analysis$Month == 2] <- 'February' cancelled_analysis$Month[cancelled_analysis$Month == 3] <- 'March' cancelled_analysis$Month[cancelled_analysis$Month == 4] <- 'April' cancelled_analysis$Month[cancelled_analysis$Month == 5] <- 'May' cancelled_analysis$Month[cancelled_analysis$Month == 6] <- 'June' cancelled_analysis$Month[cancelled_analysis$Month == 7] <- 'July' cancelled_analysis$Month[cancelled_analysis$Month == 8] <- 'August' cancelled_analysis$Month[cancelled_analysis$Month == 9] <- 'September' cancelled_analysis$Month[cancelled_analysis$Month == 10] <- 'October' cancelled_analysis$Month[cancelled_analysis$Month == 11] <- 'November' cancelled_analysis$Month[cancelled_analysis$Month == 12] <- 'December' na.omit(cancelled_analysis) ## # A tibble: 123,534,969 x 4 ## Month FlightNum Cancelled CancellationCode ## <chr> <int> <chr> <chr> ## 1 Juanuary 335 No Uknown ## 2 Juanuary 3231 No Uknown ## 3 Juanuary 448 No Uknown ## 4 Juanuary 1746 No Uknown ## 5 Juanuary 3920 No Uknown ## 6 Juanuary 378 No Uknown ## 7 Juanuary 509 No Uknown ## 8 Juanuary 535 No Uknown ## 9 Juanuary 11 No Uknown ## 10 Juanuary 810 No Uknown ## # ... with 123,534,959 more rows To get an overview of the information related to the cancellation of flights, I show the number of flights canceled depending on the variation of Cancellation COde cancelled_analysis %>% group_by(CancellationCode) %>% tally %>% arrange(desc(n)) ## # A tibble: 5 x 2 ## CancellationCode n ## <chr> <int> ## 1 Uknown 122800263 ## 2 Carrier 317972 ## 3 Weather 267054 ## 4 NAS 149079 ## 5 Security 601 Results: * Causes of cancellations Carrier : 317972 flights Weather : 267054 flights NAS : 149079 flights Security : 601 flights I count the numbers of flights canceled and not canceled. cancelled_analysis %>% group_by(Cancelled) %>% tally %>% arrange(desc(n)) ## # A tibble: 2 x 2 ## Cancelled n ## <chr> <int> ## 1 No 121231645 ## 2 Yes 2303324 *Result: The number of canceled flights: 2303324 The number of flights not canceled: 121231645 percent(2303324/121231645) ## [1] "1.90%" 1.90% represents the probability of percentage of flight cancellation. Analysis of canceled flights only Dataframe: cancelled_Flights (Contains only canceled flights)
  • 5. cancelled_Flights <- cancelled_analysis%>% filter(Cancelled == 'Yes') %>% group_by(FlightNum, Month, CancellationCode, Cancelled) %>% as_data_frame() na.omit(cancelled_Flights) ## # A tibble: 2,303,324 x 4 ## Month FlightNum Cancelled CancellationCode ## <chr> <int> <chr> <chr> ## 1 Juanuary 126 Yes Carrier ## 2 Juanuary 1146 Yes Carrier ## 3 Juanuary 469 Yes Carrier ## 4 Juanuary 618 Yes NAS ## 5 Juanuary 2528 Yes Carrier ## 6 Juanuary 437 Yes Carrier ## 7 Juanuary 934 Yes Carrier ## 8 Juanuary 3326 Yes Carrier ## 9 Juanuary 1402 Yes Carrier ## 10 Juanuary 2205 Yes Carrier ## # ... with 2,303,314 more rows cancelledplot <- ggplot(cancelled_Flights, aes( Month, fill=CancellationCode)) + geom_bar(aes(y = (..count..)/sum(..count..))) + ylab("Percentages")+ xlab("Month") cancelledplot Results: The result provided in this analysis gives us the following results: The largest cancellation rate for a flight is due to an “Unknown” case and has a higher value especially in the month of January followed by the month of September. 2. What is the season with less delays? Preparation Data In this analysis, we get the best season to travel by getting fewer departure delays. Given that the analysis affects the delays occurred in the four seasons, for clarity the following legend is drawn: Legend: 1:winter(Month: 1,2,12) 2:spring(Month: 3,4,5) 3:summer(Month: 6,7,8) *4:fall(Month: 9,10,11)
  • 6. flights_effective$Month [flights_effective$Month == 1] <- 1 flights_effective$Month [flights_effective$Month == 2] <- 1 flights_effective$Month [flights_effective$Month == 12] <- 1 flights_effective$Month [flights_effective$Month == 3] <- 2 flights_effective$Month [flights_effective$Month == 4] <- 2 flights_effective$Month [flights_effective$Month == 5] <- 2 flights_effective$Month [flights_effective$Month == 6] <- 3 flights_effective$Month [flights_effective$Month == 7] <- 3 flights_effective$Month [flights_effective$Month == 8] <- 3 flights_effective$Month [flights_effective$Month == 9] <- 4 flights_effective$Month [flights_effective$Month == 10] <- 4 flights_effective$Month [flights_effective$Month == 11] <- 4 To provide a detailed analysis, we calculate the mean, Standard Error, Confidence Interval and t.test relative to the DepDelay variable. meanDepDelay <-mean(flights_effective$DepDelay) standardDepDelay <- sd(flights_effective$DepDelay)/sqrt(length(flights_effective$DepDelay)) ci <- CI(flights_effective$DepDelay) # a 95% confidence interval fot the mean DepDelay is given by t.test(flights_effective$DepDelay, alternative="two.sided", conf.level = .95) # mu=12 ## ## One Sample t-test ## ## data: flights_effective$DepDelay ## t = 3155.6, df = 121230000, p-value < 2.2e-16 ## alternative hypothesis: true mean is not equal to 0 ## 95 percent confidence interval: ## 8.165247 8.175396 ## sample estimates: ## mean of x ## 8.170322 meanDepDelay ## [1] 8.170322 standardDepDelay #How is accurate this point estimate ## [1] 0.002589136 ci ## upper mean lower ## 8.175396 8.170322 8.165247 Results: The sample mean for the variable DepDelay is: 8.170322 minutes. How is accurate the point estimate(mean)? I answer the question with Standard Error. The Standard Error is: 0.002 A more readible result can be obtained by using a confidence interval CI with the result: +upper: 8.175396 +mean: 8.170322 +lower: 8.165247 Testing the mean DepDelay of a flight: the t-test produce also the p-value, which is the probability of wrongly rejecting the null hypothesis. The p-value is always compared with the significance level of the test. The result of p-vale is p< 2.2e-16, so, suggest that the null hypotesis is unlikely to be true. The smaller it is, the more confident we can reject the null hypotesys seasonplot <- boxplot(formula = DepDelay ~ Month, data = flights_effective, main = 'Departures delays depending on the season', xlab = 'Season', ylab = 'Departure delay', border = c('springgreen', 'yellow', 'orange', 'skyblue'), names = c('Spring', 'Summer', 'Fall', 'Winter'))
  • 7. Result: The plot shows on the x-axis the 4 seasons, and on the y-axis the minutes of the departure delay of a flight in the range -1000 up to 2000. The longer delay occurs in the ** Winter ** season with a delay of more than 2000 minutes, instead in the ** Spring ** season the minimum departure delay is present with a negative value. 3. Which aerial manufacture allows a better performance? It is important to evaluate which manufacturer allows a better performance of the plane. To perform this analysis, we need additional information, contained within the csv “plane-data.csv” plane_data <- read_csv("C:/ProjectInferential/plane-data.csv") plane_data <- na.omit(plane_data) To get a clear view of the types of producers, I create a sort of legend to quickly identify the various types. In particular, the types of manufacture that occur most often are highlighted. Legenda: Embrarer: E Boeing: B AirBus Industrie: A *Other: O plane_performance <- na.omit(plane_performance) For more information, we analyze the numbers of the various types of manufacturer. plane_performance %>% group_by(manufacturer) %>% tally %>% arrange(desc(n)) ## # A tibble: 4 x 2 ## manufacturer n ## <chr> <int> ## 1 B 2061 ## 2 O 1397 ## 3 E 588 ## 4 A 434 Results Manufacturer B are: 2061 Manufacturer O are: 1397 Manufacturer E are: 588 Manufacturer A are: 434 I create a single dataframe containing flight and plane information. I combine the two data frames with the TailNum variable that should be unique for each flight. For better modeling and interpretation I create a plot representing the density for clear and efficient data reading. Kernel density plot are usually a much more effective way to view the distribution of a variable. mdensity <- ggplot(plane_performance, aes(x=airtime)) mdensity + geom_density(aes(colour=manufacturer, fill=manufacturer), alpha=0.3)+ theme_gray(base_size=14) ## Warning: Removed 3 rows containing non-finite values (stat_density).
  • 8. Now to explore the data AirTime and manufacturer I calculate the mean, standard deviation and median and show them on a plots. Meanplot <- ggplot(airtimeMean, aes(Manufacturer, x, fill=Manufacturer))+ geom_bar(stat="identity", position="dodge") + xlab("Manufacturer")+ ylab("Hourly Mean AirTime")+ theme_gray(base_size = 14) StandardDeviationplot <-ggplot(airtimeSD, aes(Manufacturer, x, fill=Manufacturer))+ geom_bar(stat="identity", position="dodge") + xlab("Manufacturer")+ ylab("Hourly AirTime SD")+ theme_gray(base_size = 14) Medianplot <- ggplot(airtimemedian, aes(Manufacturer, x, fill=Manufacturer))+ geom_bar(stat="identity", position="dodge") + xlab("Manufacturer")+ ylab("Hourly AirTime SD")+ theme_gray(base_size = 14) ggarrange(Meanplot, StandardDeviationplot, Medianplot, ncol = 2, nrow=2); ## Warning: Removed 1 rows containing missing values (geom_bar). ## Warning: Removed 1 rows containing missing values (geom_bar). ## Warning: Removed 1 rows containing missing values (geom_bar).
  • 9. To evaluate the flight performance, calculate performance_index by adding DepDelay ArrDelay and dividing the flight time AirTime ## # A tibble: 4,480 x 6 ## # Groups: tailnum, manufacturer [4,480] ## tailnum manufacturer arrdelay depdelay airtime performance_index ## <chr> <chr> <int> <int> <int> <dbl> ## 1 N10156 E 34 35 160 0.431 ## 2 N102UW A 23 20 308 0.140 ## 3 N10323 B 88 70 208 0.760 ## 4 N103US A 6 11 251 0.0677 ## 5 N104UA B 193 185 240 1.58 ## 6 N104UW A 14 23 288 0.128 ## 7 N10575 E 80 85 55 3 ## 8 N105UA B 3 27 206 0.146 ## 9 N105UW A 45 24 256 0.270 ## 10 N106US A 7 5 65 0.185 ## # ... with 4,470 more rows plane_performance_index <- plane_performance %>% group_by(tailnum)%>% dplyr::summarise(avg_performance_index = mean(performance_index, na.rm=FALSE)) %>% as_data_frame() plane_performance_index <- na.omit(plane_performance_index) Result: The highest performance index is given by TailNum: N581SW with value 50.000000 The lowest performance index is given by TailNum: N5ETAA with value 0.005524862 To combine both the information on the planes and the flights and then the two data frames, I have combined them with the unique TailNum index. pos1 <- match(plane_performance_index$tailnum, plane_data$tailnum) plane_performance_index$manufacturer <- plane_data$manufacturer[pos1] plot1 <- ggplot(f, aes(x = factor(manufacturer) , y =p))+ geom_bar(colour ="blue", stat = "identity")+ ggtitle("Performance Index based on the manufacturer of plane")+ guides(fill=FALSE)+ xlab("Manufacturer") + ylab("Performance Index") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) plot1 <- ggplotly(plot1) plot1
  • 10. Result How we can see on the plot the best performance is given by manufacturer FEDERICK CHRISK viceversa, the bad performance is given by manufacturer BAUMAN RANDY 4. Which of the two carriers that make the most flights is faster? In this analysis, we initially determine the carriers that make the most flights, after which we analyze the two companies and find out who is the fastest. data_carrier_air <- select(ontime, Distance, AirTime, UniqueCarrier, DepDelay, ArrDelay) %>% filter((ArrDelay > 0) & (DepDelay > 0) & (AirTime > 0)) %>% group_by(UniqueCarrier, DepDelay, ArrDelay) %>% as_data_frame() data_carrier_air <- na.omit(data_carrier_air) Covariance between AirTime and Distance cov(data_carrier_air$Distance, data_carrier_air$AirTime) ## [1] 25409.49 Covariance Result The result provided by the cov (Covariance) function, indicates the variable AirTime and the Distance variable are positively correlated, ie we assume a linear relationship between AirTime and Distance there is a positive correlation, from the increase of AirTime there is also an increase of the Distance average. The frequencies and the relative frequencies of each carrier are calculated to obtain the carriers that carry out more flights. Calculation of frequencies and frequencies of careers flight_to_carrier <- cbind (Frequency = table(data_carrier_air$UniqueCarrier), RelFreq = prop.table(table(data_carrier_air$UniqueCarrier)) ) flight_to_carrier
  • 11. ## Frequency RelFreq ## 9E 134604 0.0033550012 ## AA 4505617 0.1123023864 ## AQ 41702 0.0010394213 ## AS 928903 0.0231528831 ## B6 261440 0.0065163852 ## CO 2348172 0.0585281260 ## DH 199362 0.0049690927 ## DL 6239652 0.1555231636 ## EA 215526 0.0053719799 ## EV 588954 0.0146796631 ## F9 117590 0.0029309277 ## FL 394566 0.0098345473 ## HA 38710 0.0009648457 ## HP 1176193 0.0293165799 ## ML (1) 14393 0.0003587452 ## MQ 1299845 0.0323986028 ## NW 2702993 0.0673720301 ## OH 413364 0.0103030869 ## OO 893097 0.0222604195 ## PA (1) 61635 0.0015362508 ## PI 466058 0.0116164835 ## PS 35534 0.0008856840 ## TW 1092832 0.0272388091 ## TZ 47440 0.0011824408 ## UA 4669724 0.1163927491 ## US 5217884 0.1300556228 ## WN 5083808 0.1267137820 ## XE 664729 0.0165683530 ## YV 266076 0.0066319374 The largest number of flights is made by the company ** DL Delta Air Lines Inc ** with a percentage of 15.55% followed by ** US Airways Inc ** with a percentage of 13%. Pearson’s Correlation Test Correlation between AirTime and Distance cor.test(data_carrier_air$AirTime, data_carrier_air$Distance) ## ## Pearson's product-moment correlation ## ## data: data_carrier_air$AirTime and data_carrier_air$Distance ## t = 5112.5, df = 40120000, p-value < 2.2e-16 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## 0.6278910 0.6282657 ## sample estimates: ## cor ## 0.6280784 Correlation: 0.6280784 Regression Analysis airtime.lm.DL <- lm(formula = AirTime ~ Distance, data = data_carrier_air, subset = UniqueCarrier == "DL" ) summary (airtime.lm.DL) ## ## Call: ## lm(formula = AirTime ~ Distance, data = data_carrier_air, subset = UniqueCarrier == ## "DL") ## ## Residuals: ## Min 1Q Median 3Q Max ## -381.96 -37.16 20.30 39.22 481.85 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.595e-01 3.822e-02 14.64 <2e-16 *** ## Distance 8.472e-02 4.236e-05 1999.98 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 58.28 on 6239650 degrees of freedom ## Multiple R-squared: 0.3906, Adjusted R-squared: 0.3906 ## F-statistic: 4e+06 on 1 and 6239650 DF, p-value: < 2.2e-16 airtime.lm.US <- lm(formula = AirTime ~ Distance, data = data_carrier_air, subset = UniqueCarrier == "US" ) summary (airtime.lm.US)
  • 12. ## ## Call: ## lm(formula = AirTime ~ Distance, data = data_carrier_air, subset = UniqueCarrier == ## "US") ## ## Residuals: ## Min 1Q Median 3Q Max ## -256.40 -28.59 15.26 35.60 339.54 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -7.590e-01 3.515e-02 -21.59 <2e-16 *** ## Distance 8.632e-02 4.552e-05 1896.59 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 52.6 on 5217882 degrees of freedom ## Multiple R-squared: 0.4081, Adjusted R-squared: 0.4081 ## F-statistic: 3.597e+06 on 1 and 5217882 DF, p-value: < 2.2e-16 Multiple R-squared is 0.4081, which is the R-squared value. Multiple R-squared will always increase if you add more independent variables. But Adjusted R-squared will decrease if you add an independent variable that doesn’t help the model. plotDL <- plot(airtime.lm.DL)
  • 14. plotDL ## NULL plotUS <- plot(airtime.lm.US)
  • 16. plotUS ## NULL ** Residual vs Fitted** It is a scatterplot of residuals on the y axes and fitted values on the x axes. This plot is used to detect non-linearity by looking at redline for questionable pattern. The characteristic of a well-behaved residual vs fitted plot are: The residual bounce randomly around the 0 line. The residual roughly for a horizontal band around the 0 line. No residual
  • 17. stands out. So, this plot has the residual randomly around the 0 line, so this suggest that the assumption that the relationship is linear is reasonable. Scale Location This plot tells us if the residual apread equally along the ranges of predictor. Normal QQ This scatterplot show if residuals are normally distributed. The closer point are to falling directly on the diagonal line then the more we can interpret the residual as normally distributed. DL <- subset(data_carrier_air, UniqueCarrier == 'DL') US <- subset(data_carrier_air, UniqueCarrier == 'US') finalplot <- plot(x=DL$Distance, y=DL$AirTime, xlab = 'Distance', ylab = 'Air Time', main = 'Air time based on distance by carrier', pch=20, col='dodgerblue1' ) points (x = US$Distance, y = US$AirTime, pch=20, col='forestgreen' ) abline (airtime.lm.DL , col = 'slateblue1') abline (airtime.lm.US, col = 'springgreen1') legend ('topleft', legend = c('Delta Air Lines Inc.', 'US Airways Inc.'), col = c('dodgerblue1', 'forestgreen'), pch = 20) finalplot ## NULL The plot is a relationship between Distance and AirTime for the DL and US companies. (ps: flight times are all in positive integer value) WHO is faster ?? DL is represented by slateblue1 US is presented by springgreen1 It is best to travel with Delta Air Lines Inc. for distances of less than 1500 miles and with US Airways Inc for distances greater than 1500 miles. Conclusion The analyzes carried out on all the years (1987-2008) confirm that the worst month to start is January as there is a greater number of cancellations with an unknown reason. The average starting delay is 8.17 minutes. The worst season for starting delays is Winter, while in Winter there is also a negative DepDelay. The best performing flight number is N581SW built by the manufacturer FEDERICK CHRISK The worst performing flight number is N5ETAA built by the manufacturer BAUMAN RANDY. The two companies that operate more flights are DL Delta Airlines with a percentage of 15% and US Airwais with a percentage of 13%. In addition, DL is faster on flights with a distance of less than 15,000 miles.