SlideShare a Scribd company logo
University of Cincinnati, Carl H. Lindner College of Business
MS BANA 2017-18
Statistical Computing Project
Study of factors affecting aircraft landing distance
Samrudh Keshava Kumar
M12420395
(003)
The aim of thisprojectis to studythe simulateddataof 950 commercial flightlandingperformances
and understandthe factorsaffectingthe same. Initially,the datawasprocessedtoremove any
missingorabnormal valuesbefore proceedingwiththe analysis.Bivariate analysiswasperformed
betweenthe variables,the speedandtype of aircraft hada highpositive impactonthe landing
distance.A regressionmodelwasbuiltwithall the available variablesandthe model wasimproved
basedon the diagnosticplotsof the regressionmodel.The speed,type,pitchandthe heightof the
aircraft wasfoundto have significanteffectonthe landingdistancesthroughthe regression analysis.
The initial model hadanR-Squaredof 0.85 and MAPE of 22.5%, the R-Squared wasincreasedto0.97
and MAPE reducedto10.8% inthe final model.
Chapter 1
Data exploration and data cleaning
Aim:To verifydataquality&correct thembefore proceedingwiththe analysis.
Loading the datasets into the SAS environment
PROCIMPORTDATAFILE='/home/samrudhkumar0/Project/FAA1.csv'
DBMS=CSV
OUT=FAA1;
GETNAMES=YES;
RUN;
PROCIMPORTDATAFILE='/home/samrudhkumar0/Project/FAA2.csv'
DBMS=CSV
REPLACE
OUT=FAA2;
GETNAMES=YES;
RUN;
/*Print the top 10 rowsof dataof each dataset*/
PROC PRINTDATA=faa1(obs= 10);
RUN;
PROCPRINTDATA=faa2(obs=10);
RUN;
The datasetshave beenloadedintothe SASenvironmentasFAA1andFAA2. The summaryof the
data isobtained usingPROCMEANS.
PROCmeansDATA = FAA1n nmiss max min mean median var;
Title "Basic Summary of FAA1";
RUN;
PROCmeansDATA = FAA2n nmiss max min mean median var;
Title "Basic Summary of FAA2";
RUN;
/*Observed thatFAA2hasa fewempty rows,in the subsequentstep itwill be removed*/
DATA NO_DEADROWS;
SET FAA2;
IFMISSING(AIRCRAFT) then delete;
RUN;
/*The 50 missing observationshavebeen removed thedatasetnow contains150
observations*/
The emptyrowsof data has beenremoved.The missingvaluesunder speed_airwill be dealtwith
later.
Combiningdata sets from differentsources
Before mergingthe datasetstogether,SASrequiresthatboththe datasetsbe sortedinthe same
fashion.The aircraftname and speed_groundare the unique variablesbywhichthe twodatasets
can be merged.
/*Sorting the datasetbeforemerging*/
PROCSORT DATA = FAA1;
BY aircraft speed_ground;
RUN;
PROCSORT DATA = NO_DEADROWS;
BY aircraft speed_ground;
RUN;
DATA MERGED;
MERGE FAA1 NO_DEADROWS;
BY aircraft speed_ground;
/*Merging by speed_ground sincethereis
repetation in the data,speed_groundhasuniquevaluesso isperfect asa primary key*/
RUN;
/*850 OBSERVATIONSAFTERMERGE*/
The combineddatasetshouldhave had800+150 = 950 observationsbutitcontains850
observations.Thisshowsthatthere were 100observationswhichwere notunique. Summaryof the
mergeddatais showninthe table below
Performingthe completenesscheckofeach variable
Usingthe MEANSprocedure withoptionsN andN Miss to displaythe numberof observationsand
the numberof missingvaluesineachvariable.
PROCMEANSDATA = MERGED N NMISS;
RUN;
/*Treating missing values - duration,speed_air*/
642 and50 valuesare missingfromthe variablesspeed_airanddurationrespectively.
Performingthe validitycheck of each variable
Runningthe UNIVARIATEprocedure todetermine the quartile rangesanyvaluesabovethe 99% and
below1% levelscanbe treatedasabnormal values.
PROCUNIVARIATEDATA=MERGEDPLOT;
RUN;
No of passengers Speed_ground Height
Pitch Distance Speed_air
Data cleaning
Basedon the understandingof the datafromthe previoussteps. Abnormal valuesof speed_ground,
height,durationanddistance are deletedfromthe analytical datasetandmovedtoanew datasets
containingonlyoutliers.Forvariable ‘duration’ outof 781 observations,50(~6%) were missing,the
missingvaluescanbe approximatedwiththe average value. Forvariable ‘speed_air’whichhas203
missingoutof 628 (~32%),the missingvaluesare notreplacedsince itwouldleadtoapproximation
errors.
DATA TREATED_DATA;
SET MERGED;
IF SPEED_GROUND< 30 THEN DELETE;
IF SPEED_GROUND> 140 THEN DELETE;
IF HEIGHT < 6 THEN DELETE;
IF (DURATION <40 ANDDURATION >0) THEN DELETE;
IF MISSING(DURATION) THEN DURATION =154.0065385;
IF DISTANCE> 6000 THEN DELETE;
RUN;
/*831 OBSERVATIONSREMAINING*/
PROCMEANSDATA = TREATED_DATA N NMISS;
RUN;
The treateddatasetcontains831 observationsand0 missingvaluesforall variablesexpect
‘speed_air’.
PROCSORT DATA = TREATED_DATA;
BY AIRCRAFTSPEED_GROUND;
RUN;
PROCSORT DATA = MERGED;
BY AIRCRAFTSPEED_GROUND;
RUN;
DATA COMPLEMENT;
MERGED TREATED_DATA (IN = X) MERGED (IN = Y);
BY AIRCRAFTSPEED_GROUND;
IF (X= 1 ANDY = 0) OR (X=0 ANDY = 1);
DROPDURATION SPEED_AIR;
RUN;
PROCPRINTDATA = COMPLEMENT;
RUN;
The above statementsgenerate the table of observationsthatwere removedfromthe maindataset.
It contains19 observationsasexpected.
Summarizingthe distribution
To summarize the distributionof eachvariable,itwouldbe sufficienttolookatthe meanand
medianvaluesof each.
PROCMEANSDATA=TREATED_DATA N MEAN MEDIAN;
TITLE "MEAN ANDMEDIAN OFTREATED DATA";
RUN;
The mean andmedianvaluesof all the variablesexceptdistance are close toeachotherindicating
that theyfollow anormal distribution.
Usingthe UNIVARIATEprocedure the distance variable isplottedtounderstandthe distribution.
PROCUNIVARIATEDATA=TREATED_DATA PLOT;
VARDISTANCE;
RUN;
The distance variable followsaskewedpatternandmaximumobservationsoccurbetween600to
1000 feet.
It was observedthat100 observationswere duplicate andwere removed.The variable speed_air
had 628 observationsmissing,the missingvalueswouldbe treatedduringthe dataanalysissteps.
Chapter 2
Data Visualization
Aim: To understandhowthe independentvariables/factorsaffectthe dependentvariable(distance)
beingmodelled.
Since the data isbeingmodelled usinglinearregression, itisassumedthatthe independentvariables
have a linearrelationshipwiththe predictedvariable.The slope of the plotswillindicate the impact
the independentvariableshave onthe independentvariable (variable beingpredicted) and, the
shape will indicate the type of relationshipi.e.linear,quadraticetc. andthe spread/variabilityof the
data.
/*Chapter2 visualization*/
/*Plottingdistance of landingwithothervariablestounderstandthe relationships*/
proc plotdata = treated_data;
plotdistance*pitch;
plotdistance*height;
plotdistance*speed_air;
plotdistance*speed_ground;
plotdistance*no_pasg;
plotdistance*duration;
plotdistance*aircraft;
run;
The plot indicatesthatthe pitchof the aircraft doesnothave much of an impacton the landing
distance,the datais concentratedinthe centre of the plotand has highvariability.
Hightof the aircraft above the thresholdof the runwayhasa slight positive impactonthe
landingdistance.
The variable speed_airhasaminimumvalue of 90 MPH, below whichthe valueshave not
beencapturedinthe data. The variable speed_airshows ahighpositive correlationwiththe
landingdistance andthe spreadof the data pointslooksminimal.Fromthe regression
analysiswhichwouldbe carriedoutlater,thisvariable should have ahighersignificance.
The speed_groundvariable hasaquadraticrelationshipwiththe landingdistances,below 70mph
the impact is almostnegligiblebutabove 70mphthere seemstobe a highpositive correlation
similartowhatis beingobservedforthe speed_airvariable.
The no_pasg (No.of passengers) doesnotseemtohave animpacton the landingdistances.
Durationof the flightseemstohave aslightnegative impactonthe landingdistance.
The type of aircraftseemsto be affectingthe landingdistance,Airbusseemstoexhibitshorter
landingdistancescomparedtoBoeing.
Furtherto understandthe strengthof the relationships,the correlationbetweenthe variablesis
calculatedusingthe PROCCORRprocedure inSAS.
In the previousplotforspeed_ground,the curve seemstobe flatbelow 70MPH,to testthis a subset
of the data below70MPH is takenandthe correlationis calculatedbetweenspeed_groundand
distance.
data ground_speed_low;
settreated_data;
if speed_ground>70thendelete;
keepspeed_grounddistance;
run;
proc corr data = ground_speed_low;
run;
The correlationbetweenthe twovariablesare 0.11 meaning the speedof the aircrafthasminimal
impacton the landingdistancesbelow70MPH,0.39 forspeedsbelow 80MPH and0.65 forspeeds
below90MPH. For speed_air,the missingvaluescouldbe approximatedtobe equal tothe
speed_groundvalues.
/*Calculatingthe correlationbetweenthe variables*/
proc corr data = treated_data;
run;
The highlightedvaluesindicate variableswhichare highlycorrelated. The variablesspeed_airand
speed_ground are highlycorrelated witheachotherandare correlatedwiththe predictorvariable
(distance).One of the variables should be eliminatedtopreventmulticollinearityerrors.
The variablesspeed_ground,speed_airandaircrafttype seemtohave an impacton the landing
distances.‘speed_ground’and‘speed_air’have the highestcorrelationcoefficientwithdistance and
are correlatedwitheachother.The missingvaluesof speed_aircouldbe imputedwiththe values
fromspeed_groundandthe speed_groundvariable couldbe eliminatedaltogether.
Chapter 3
Statistical Modelling
Aim:Understandthe variablessignificantlyaffectingthe landingdistance andfitalinearmodel to
predictlandingdistance of the aircraft
SASCodesand outputs:
From the previouschapter,the variablesspeed_air, speed_groundandaircrafthassignificantimpact
on the landingdistances.Toinclude aircraftasa variable inthe linearmodel, adummyvariable
calledaircraft_type iscreated withvalues0and1 for AirbusandBoeingrespectively.
/*Run tteston the speed_groundspeed_air*/
data speeds_df;
settreated_data;
if missing(speed_air) thendelete;
keepspeed_airspeed_ground;
run;
proc ttestdata = speeds_df;
pairedspeed_air*speed_ground;
run;
The null hypothesisbeingtestedisthatthe difference betweenthe meansof the twovariablesis
zero.The null hypothesiscannotbe rejectedbecause p>0.05,therefore we couldsaythatthe two
variablesare similar.The meandifference betweenthe twois0.0739 MPH and the correlationis
0.987. Giventhese evidence,the speed_groundisverysimilartospeed_air.The missingvaluesof
speed_aircanbe imputedwithvaluesfromspeed_ground.
A newdatasetiscreatedwiththe above-mentioned changes.
/*Creatinga dummyvariable foraircrafttype to include aircrafttype asa
*variable inthe linearmodel
*/
data final_model_data;
settreated_data;
if aircraft = 'airbus' thenaircraft_type = 0;
else aircraft_type =1;
if missing(speed_air) then speed_air=speed_ground;
drop aircraftspeed_ground;
run;
proc meansdata = final_model_dataN Nmiss;
run;
/*Generate corelationmatrix*/
proc corr data = final_model_data;
run;
Variableswithhighcorrelationwithdistance ishighlighted.None of the independentvariablesare
correlatedwitheachother.
The final datasethasnot missingvaluesand831 observations.Variablesspeed_groundandaircraft
have beeneliminatedfurtheranalysisisperformedonthisdataset.
A regressionmodelisfittedonthe dataset.
/*Fittinga regressionmodel*/
proc reg data = final_model_data;
model distance =speed_airaircraft_type no_pasgpitchheightduration;
run;
Belowisthe summaryof the correlationand the regressionanalysisof the independentvariables.
Independent
Variables Direction
Correlation
Coefficient
P - Value of
corr
coefficient
Regression
Coefficient
Distance ~ All
P Value reg
coeff
Distance~All
speed_air Strongpositive 0.8675 <.0001 42.45547 <.0001
aircraft_type 0.2381 <.0001 481.22446 <.0001
no_pasg no visible realtion -0.0177 0.6093 -2.15925 0.1806
pitch no visible realtion 0.08703 0.0121 34.84949 0.1552
height no visible realtion 0.09941 0.5082 14.07733 <.0001
duration Slightnegative -0.04995 0.1503 0.00415 0.9871
Nextstepisto eliminatevariables whichhave p-value <0.005 one by one.
The resultantmodel usesair_speed,aircrafttype andheightasdependantvariables.The r -Squared
is0.85.
Chapter 4
Model Validation
Aim:Diagnose the model performance byanalysingthe plotof the residuals,R-Squaredandthe
MAPE of the predictedvalues.
/*Model validationcheckif the residualsare normallydistributed*/
proc reg data=final_model_data;
model distance=speed_airaircraft_type height;
run;
The fit diagnosticsforthe predictedvariable show thatthe residualsare notrandom.The non-
randompatternshowsthat the linearmodel isinappropriateandthe dataneedssome
transformations.The model isunderestimatingthe relationshipinthe extreme rangesof landing
distance.
Calculationof MAPE
proc reg data = final_model_data;
model distance =speed_airaircraft_type height;
outputout=predicted_valuespredicted=py;
run;
data predicted_values;
setpredicted_values;
error_abs = abs(distance - py)/distance;
keepdistance py error_abs;
run;
proc meansdata = predicted_values N mean;
var error_abs;
run;
/*MAPE is22.575%*/
Model predictionaccuracyisexpectedtobe bad,the predictionscouldbe improvedby transforming
a fewpredictorvariables.
Chapter 5
Remodelling and model Validation
Aim:Transformpredictorvariablesandensure the residual plotisrandom.Compare the new models
withthe base model.
SASCodes:
data remodelling_data;
setfinal_model_data;
speed_air_4= (speed_air**4);
speed_air_3= speed_air**3;
speed_air_2= speed_air**2;
height_pitch=height*pitch;
run;
proc meansdata = remodelling_dataN NmissMinmax median;
run;
proc corr data = remodelling_data;
run;
From the correlationplot,speed_air_4isgivingthe highestcorrelationtodistance,height_pitch
whichismultiplicationof heightandpitchhasa highercorrelationcomparedtothe individual
variables,thishave beenselectedforthe final model independentvariable list.
/*Speed_airhasnomissingvalues*/
proc plotdata = remodelling_data;
plotdistance*speed_air;
plotdistance*speed_air_2;
plotdistance*speed_air_3;
plotdistance*speed_air_4;
plotdistance*height_pitch;
run;
The transformed speed_air(speed_air^4) variable showsalinearrelationshiptothe landing
distance.The speed_airvariablewill be replacedwithspeed_air_4.
/*Fittinga regressionmodel*/
proc reg data = remodelling_data;
model distance =speed_air_4aircraft_type height_pitch;
run;
The model hasa betterresidual plotthoughthe modelisunderpredictingpredicting the longer
landingdistances,thisisacceptablegiventhe lackof datapointsexplainingthese scenarios.The R-
Squaredhasimprovedfrom0.85 to 0.97 indicatinghigherpredictionaccuracy.
proc reg data = remodelling_data;
model distance =speed_air_4aircraft_type height_pitch;
outputout=predicted_valuespredicted=py;
run;
data predicted_values;
setpredicted_values;
error_abs= abs(distance - py)/distance;
keepdistance pyerror_abs;
run;
proc meansdata = predicted_valuesN mean;
var error_abs;
run;
The MAPE (MeanAbsolute Percentage Error) of the improvedmodelis10.88%.
The MAPE hasreducedfrom22.58% to 10.88%, the transformationof the dataimprovedthe
accuracy of the predictions.
The model canbe furtherimprovedwithmore datapointsespeciallyinthe scenarioswhere the
landingdistances are greaterthan4000 feetsince thisare the casesto be predicted. More variables
such as grossweightof the aircraft,aircraft model no,winddirectionetc.wouldsignificantly
improve thismodel.
Appendix.
Variable dictionary:
Aircraft: The make of an aircraft (BoeingorAirbus).
Duration (in minutes):Flightdurationbetweentakingoff andlanding.The durationof anormal
flightshouldalwaysbe greaterthan40min.
No_pasg: The numberof passengersinaflight.
Speed_ground(inmilesper hour): The groundspeedof an aircraftwhenpassingoverthe threshold
of the runway.If itsvalue islessthan30MPH or greaterthan 140MPH, thenthe landingwouldbe
consideredasabnormal.
Speed_air(in milesperhour): The air speedof an aircraftwhenpassingoverthe thresholdof the
runway.If its value islessthan30MPH or greaterthan 140MPH, thenthe landingwouldbe
consideredasabnormal.
Height(in meters):The heightof an aircraftwhenit ispassingoverthe thresholdof the runway.The
landingaircraftisrequiredtobe at least6 metershighatthe thresholdof the runway.
Pitch (indegrees):Pitchangle of anaircraft whenitis passingoverthe thresholdof the runway.1
Distance (infeet):The landingdistance of anaircraft.More specifically,itreferstothe distance
betweenthe thresholdof the runwayandthe pointwhere the aircraftcan be fullystopped.The
lengthof the airportrunwayis typicallylessthan6000 feet.

More Related Content

PDF
Predicting aircraft landing overruns using quadratic linear regression
PDF
Javier Garcia - Verdugo Sanchez - Six Sigma Training - W4 Autocorrelation and...
DOCX
Modeling and Prediction using SAS
PDF
Flight Landing Risk Assessment Project
DOCX
Regression Analysis on Flights data
PDF
A statistical approach to predict flight delay
PDF
Flight landing Project
PDF
Stats computing project_final
Predicting aircraft landing overruns using quadratic linear regression
Javier Garcia - Verdugo Sanchez - Six Sigma Training - W4 Autocorrelation and...
Modeling and Prediction using SAS
Flight Landing Risk Assessment Project
Regression Analysis on Flights data
A statistical approach to predict flight delay
Flight landing Project
Stats computing project_final

Similar to Predicting aircraft landing distances using linear regression (20)

PDF
Predicting landing distance: Adrian Valles
PDF
Comparative Study on the Prediction of Remaining Useful Life of an Aircraft E...
PDF
Flight Landing Analysis
PPT
Exploring the prospect of operating low cost carriers and legacy carriers fro...
PDF
Aircraft position estimation using angle of arrival of received radar signals
PDF
IRJET- Error Reduction in Data Prediction using Least Square Regression Method
PDF
A study of the Behavior of Floating-Point Errors
PDF
A study of the Behavior of Floating-Point Errors
PDF
A study of the Behavior of Floating-Point Errors.pdf
PDF
Climate Visibility Prediction Using Machine Learning
PDF
Climate Visibility Prediction Using Machine Learning
PPTX
Malta international airport 1
DOCX
A Method for Determining and Improving the Horizontal Accuracy of Geospatial ...
DOCX
Final project
PDF
GRADIENT OMISSIVE DESCENT IS A MINIMIZATION ALGORITHM
PDF
Air Traffic Control And Management System
PPT
AIAA-MAO-DSUS-2012
PPTX
Flight Delay Prediction
PDF
j2 Universal - Modelling and Tuning Braking Characteristics
Predicting landing distance: Adrian Valles
Comparative Study on the Prediction of Remaining Useful Life of an Aircraft E...
Flight Landing Analysis
Exploring the prospect of operating low cost carriers and legacy carriers fro...
Aircraft position estimation using angle of arrival of received radar signals
IRJET- Error Reduction in Data Prediction using Least Square Regression Method
A study of the Behavior of Floating-Point Errors
A study of the Behavior of Floating-Point Errors
A study of the Behavior of Floating-Point Errors.pdf
Climate Visibility Prediction Using Machine Learning
Climate Visibility Prediction Using Machine Learning
Malta international airport 1
A Method for Determining and Improving the Horizontal Accuracy of Geospatial ...
Final project
GRADIENT OMISSIVE DESCENT IS A MINIMIZATION ALGORITHM
Air Traffic Control And Management System
AIAA-MAO-DSUS-2012
Flight Delay Prediction
j2 Universal - Modelling and Tuning Braking Characteristics
Ad

Recently uploaded (20)

PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Supervised vs unsupervised machine learning algorithms
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Introduction to Knowledge Engineering Part 1
Galatica Smart Energy Infrastructure Startup Pitch Deck
ISS -ESG Data flows What is ESG and HowHow
STUDY DESIGN details- Lt Col Maksud (21).pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Supervised vs unsupervised machine learning algorithms
Reliability_Chapter_ presentation 1221.5784
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Business Acumen Training GuidePresentation.pptx
Fluorescence-microscope_Botany_detailed content
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Clinical guidelines as a resource for EBP(1).pdf
Acceptance and paychological effects of mandatory extra coach I classes.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Ad

Predicting aircraft landing distances using linear regression

  • 1. University of Cincinnati, Carl H. Lindner College of Business MS BANA 2017-18 Statistical Computing Project Study of factors affecting aircraft landing distance Samrudh Keshava Kumar M12420395 (003) The aim of thisprojectis to studythe simulateddataof 950 commercial flightlandingperformances and understandthe factorsaffectingthe same. Initially,the datawasprocessedtoremove any missingorabnormal valuesbefore proceedingwiththe analysis.Bivariate analysiswasperformed betweenthe variables,the speedandtype of aircraft hada highpositive impactonthe landing distance.A regressionmodelwasbuiltwithall the available variablesandthe model wasimproved basedon the diagnosticplotsof the regressionmodel.The speed,type,pitchandthe heightof the aircraft wasfoundto have significanteffectonthe landingdistancesthroughthe regression analysis. The initial model hadanR-Squaredof 0.85 and MAPE of 22.5%, the R-Squared wasincreasedto0.97 and MAPE reducedto10.8% inthe final model.
  • 2. Chapter 1 Data exploration and data cleaning Aim:To verifydataquality&correct thembefore proceedingwiththe analysis. Loading the datasets into the SAS environment PROCIMPORTDATAFILE='/home/samrudhkumar0/Project/FAA1.csv' DBMS=CSV OUT=FAA1; GETNAMES=YES; RUN; PROCIMPORTDATAFILE='/home/samrudhkumar0/Project/FAA2.csv' DBMS=CSV REPLACE OUT=FAA2; GETNAMES=YES; RUN; /*Print the top 10 rowsof dataof each dataset*/ PROC PRINTDATA=faa1(obs= 10); RUN; PROCPRINTDATA=faa2(obs=10); RUN;
  • 3. The datasetshave beenloadedintothe SASenvironmentasFAA1andFAA2. The summaryof the data isobtained usingPROCMEANS. PROCmeansDATA = FAA1n nmiss max min mean median var; Title "Basic Summary of FAA1"; RUN; PROCmeansDATA = FAA2n nmiss max min mean median var; Title "Basic Summary of FAA2"; RUN; /*Observed thatFAA2hasa fewempty rows,in the subsequentstep itwill be removed*/ DATA NO_DEADROWS; SET FAA2; IFMISSING(AIRCRAFT) then delete; RUN; /*The 50 missing observationshavebeen removed thedatasetnow contains150 observations*/
  • 4. The emptyrowsof data has beenremoved.The missingvaluesunder speed_airwill be dealtwith later. Combiningdata sets from differentsources Before mergingthe datasetstogether,SASrequiresthatboththe datasetsbe sortedinthe same fashion.The aircraftname and speed_groundare the unique variablesbywhichthe twodatasets can be merged. /*Sorting the datasetbeforemerging*/ PROCSORT DATA = FAA1; BY aircraft speed_ground; RUN; PROCSORT DATA = NO_DEADROWS; BY aircraft speed_ground; RUN; DATA MERGED; MERGE FAA1 NO_DEADROWS; BY aircraft speed_ground; /*Merging by speed_ground sincethereis repetation in the data,speed_groundhasuniquevaluesso isperfect asa primary key*/ RUN; /*850 OBSERVATIONSAFTERMERGE*/ The combineddatasetshouldhave had800+150 = 950 observationsbutitcontains850 observations.Thisshowsthatthere were 100observationswhichwere notunique. Summaryof the mergeddatais showninthe table below Performingthe completenesscheckofeach variable Usingthe MEANSprocedure withoptionsN andN Miss to displaythe numberof observationsand the numberof missingvaluesineachvariable. PROCMEANSDATA = MERGED N NMISS; RUN; /*Treating missing values - duration,speed_air*/
  • 5. 642 and50 valuesare missingfromthe variablesspeed_airanddurationrespectively. Performingthe validitycheck of each variable Runningthe UNIVARIATEprocedure todetermine the quartile rangesanyvaluesabovethe 99% and below1% levelscanbe treatedasabnormal values. PROCUNIVARIATEDATA=MERGEDPLOT; RUN; No of passengers Speed_ground Height Pitch Distance Speed_air
  • 6. Data cleaning Basedon the understandingof the datafromthe previoussteps. Abnormal valuesof speed_ground, height,durationanddistance are deletedfromthe analytical datasetandmovedtoanew datasets containingonlyoutliers.Forvariable ‘duration’ outof 781 observations,50(~6%) were missing,the missingvaluescanbe approximatedwiththe average value. Forvariable ‘speed_air’whichhas203 missingoutof 628 (~32%),the missingvaluesare notreplacedsince itwouldleadtoapproximation errors. DATA TREATED_DATA; SET MERGED; IF SPEED_GROUND< 30 THEN DELETE; IF SPEED_GROUND> 140 THEN DELETE; IF HEIGHT < 6 THEN DELETE; IF (DURATION <40 ANDDURATION >0) THEN DELETE; IF MISSING(DURATION) THEN DURATION =154.0065385; IF DISTANCE> 6000 THEN DELETE; RUN; /*831 OBSERVATIONSREMAINING*/ PROCMEANSDATA = TREATED_DATA N NMISS; RUN;
  • 7. The treateddatasetcontains831 observationsand0 missingvaluesforall variablesexpect ‘speed_air’. PROCSORT DATA = TREATED_DATA; BY AIRCRAFTSPEED_GROUND; RUN; PROCSORT DATA = MERGED; BY AIRCRAFTSPEED_GROUND; RUN; DATA COMPLEMENT; MERGED TREATED_DATA (IN = X) MERGED (IN = Y); BY AIRCRAFTSPEED_GROUND; IF (X= 1 ANDY = 0) OR (X=0 ANDY = 1); DROPDURATION SPEED_AIR; RUN; PROCPRINTDATA = COMPLEMENT; RUN; The above statementsgenerate the table of observationsthatwere removedfromthe maindataset. It contains19 observationsasexpected.
  • 8. Summarizingthe distribution To summarize the distributionof eachvariable,itwouldbe sufficienttolookatthe meanand medianvaluesof each. PROCMEANSDATA=TREATED_DATA N MEAN MEDIAN; TITLE "MEAN ANDMEDIAN OFTREATED DATA"; RUN; The mean andmedianvaluesof all the variablesexceptdistance are close toeachotherindicating that theyfollow anormal distribution. Usingthe UNIVARIATEprocedure the distance variable isplottedtounderstandthe distribution. PROCUNIVARIATEDATA=TREATED_DATA PLOT; VARDISTANCE; RUN; The distance variable followsaskewedpatternandmaximumobservationsoccurbetween600to 1000 feet. It was observedthat100 observationswere duplicate andwere removed.The variable speed_air had 628 observationsmissing,the missingvalueswouldbe treatedduringthe dataanalysissteps.
  • 9. Chapter 2 Data Visualization Aim: To understandhowthe independentvariables/factorsaffectthe dependentvariable(distance) beingmodelled. Since the data isbeingmodelled usinglinearregression, itisassumedthatthe independentvariables have a linearrelationshipwiththe predictedvariable.The slope of the plotswillindicate the impact the independentvariableshave onthe independentvariable (variable beingpredicted) and, the shape will indicate the type of relationshipi.e.linear,quadraticetc. andthe spread/variabilityof the data. /*Chapter2 visualization*/ /*Plottingdistance of landingwithothervariablestounderstandthe relationships*/ proc plotdata = treated_data; plotdistance*pitch; plotdistance*height; plotdistance*speed_air; plotdistance*speed_ground; plotdistance*no_pasg; plotdistance*duration; plotdistance*aircraft; run; The plot indicatesthatthe pitchof the aircraft doesnothave much of an impacton the landing distance,the datais concentratedinthe centre of the plotand has highvariability.
  • 10. Hightof the aircraft above the thresholdof the runwayhasa slight positive impactonthe landingdistance. The variable speed_airhasaminimumvalue of 90 MPH, below whichthe valueshave not beencapturedinthe data. The variable speed_airshows ahighpositive correlationwiththe landingdistance andthe spreadof the data pointslooksminimal.Fromthe regression analysiswhichwouldbe carriedoutlater,thisvariable should have ahighersignificance.
  • 11. The speed_groundvariable hasaquadraticrelationshipwiththe landingdistances,below 70mph the impact is almostnegligiblebutabove 70mphthere seemstobe a highpositive correlation similartowhatis beingobservedforthe speed_airvariable. The no_pasg (No.of passengers) doesnotseemtohave animpacton the landingdistances.
  • 12. Durationof the flightseemstohave aslightnegative impactonthe landingdistance. The type of aircraftseemsto be affectingthe landingdistance,Airbusseemstoexhibitshorter landingdistancescomparedtoBoeing. Furtherto understandthe strengthof the relationships,the correlationbetweenthe variablesis calculatedusingthe PROCCORRprocedure inSAS.
  • 13. In the previousplotforspeed_ground,the curve seemstobe flatbelow 70MPH,to testthis a subset of the data below70MPH is takenandthe correlationis calculatedbetweenspeed_groundand distance. data ground_speed_low; settreated_data; if speed_ground>70thendelete; keepspeed_grounddistance; run; proc corr data = ground_speed_low; run; The correlationbetweenthe twovariablesare 0.11 meaning the speedof the aircrafthasminimal impacton the landingdistancesbelow70MPH,0.39 forspeedsbelow 80MPH and0.65 forspeeds below90MPH. For speed_air,the missingvaluescouldbe approximatedtobe equal tothe speed_groundvalues. /*Calculatingthe correlationbetweenthe variables*/ proc corr data = treated_data; run;
  • 14. The highlightedvaluesindicate variableswhichare highlycorrelated. The variablesspeed_airand speed_ground are highlycorrelated witheachotherandare correlatedwiththe predictorvariable (distance).One of the variables should be eliminatedtopreventmulticollinearityerrors. The variablesspeed_ground,speed_airandaircrafttype seemtohave an impacton the landing distances.‘speed_ground’and‘speed_air’have the highestcorrelationcoefficientwithdistance and are correlatedwitheachother.The missingvaluesof speed_aircouldbe imputedwiththe values fromspeed_groundandthe speed_groundvariable couldbe eliminatedaltogether.
  • 15. Chapter 3 Statistical Modelling Aim:Understandthe variablessignificantlyaffectingthe landingdistance andfitalinearmodel to predictlandingdistance of the aircraft SASCodesand outputs: From the previouschapter,the variablesspeed_air, speed_groundandaircrafthassignificantimpact on the landingdistances.Toinclude aircraftasa variable inthe linearmodel, adummyvariable calledaircraft_type iscreated withvalues0and1 for AirbusandBoeingrespectively. /*Run tteston the speed_groundspeed_air*/ data speeds_df; settreated_data; if missing(speed_air) thendelete; keepspeed_airspeed_ground; run; proc ttestdata = speeds_df; pairedspeed_air*speed_ground; run; The null hypothesisbeingtestedisthatthe difference betweenthe meansof the twovariablesis zero.The null hypothesiscannotbe rejectedbecause p>0.05,therefore we couldsaythatthe two variablesare similar.The meandifference betweenthe twois0.0739 MPH and the correlationis 0.987. Giventhese evidence,the speed_groundisverysimilartospeed_air.The missingvaluesof speed_aircanbe imputedwithvaluesfromspeed_ground.
  • 16. A newdatasetiscreatedwiththe above-mentioned changes. /*Creatinga dummyvariable foraircrafttype to include aircrafttype asa *variable inthe linearmodel */ data final_model_data; settreated_data; if aircraft = 'airbus' thenaircraft_type = 0; else aircraft_type =1; if missing(speed_air) then speed_air=speed_ground; drop aircraftspeed_ground; run; proc meansdata = final_model_dataN Nmiss; run; /*Generate corelationmatrix*/ proc corr data = final_model_data; run; Variableswithhighcorrelationwithdistance ishighlighted.None of the independentvariablesare correlatedwitheachother. The final datasethasnot missingvaluesand831 observations.Variablesspeed_groundandaircraft have beeneliminatedfurtheranalysisisperformedonthisdataset.
  • 17. A regressionmodelisfittedonthe dataset. /*Fittinga regressionmodel*/ proc reg data = final_model_data; model distance =speed_airaircraft_type no_pasgpitchheightduration; run; Belowisthe summaryof the correlationand the regressionanalysisof the independentvariables. Independent Variables Direction Correlation Coefficient P - Value of corr coefficient Regression Coefficient Distance ~ All P Value reg coeff Distance~All speed_air Strongpositive 0.8675 <.0001 42.45547 <.0001 aircraft_type 0.2381 <.0001 481.22446 <.0001 no_pasg no visible realtion -0.0177 0.6093 -2.15925 0.1806 pitch no visible realtion 0.08703 0.0121 34.84949 0.1552 height no visible realtion 0.09941 0.5082 14.07733 <.0001 duration Slightnegative -0.04995 0.1503 0.00415 0.9871 Nextstepisto eliminatevariables whichhave p-value <0.005 one by one. The resultantmodel usesair_speed,aircrafttype andheightasdependantvariables.The r -Squared is0.85.
  • 18. Chapter 4 Model Validation Aim:Diagnose the model performance byanalysingthe plotof the residuals,R-Squaredandthe MAPE of the predictedvalues. /*Model validationcheckif the residualsare normallydistributed*/ proc reg data=final_model_data; model distance=speed_airaircraft_type height; run;
  • 19. The fit diagnosticsforthe predictedvariable show thatthe residualsare notrandom.The non- randompatternshowsthat the linearmodel isinappropriateandthe dataneedssome transformations.The model isunderestimatingthe relationshipinthe extreme rangesof landing distance. Calculationof MAPE proc reg data = final_model_data; model distance =speed_airaircraft_type height; outputout=predicted_valuespredicted=py; run; data predicted_values; setpredicted_values; error_abs = abs(distance - py)/distance; keepdistance py error_abs; run; proc meansdata = predicted_values N mean; var error_abs; run; /*MAPE is22.575%*/ Model predictionaccuracyisexpectedtobe bad,the predictionscouldbe improvedby transforming a fewpredictorvariables. Chapter 5 Remodelling and model Validation Aim:Transformpredictorvariablesandensure the residual plotisrandom.Compare the new models withthe base model. SASCodes: data remodelling_data; setfinal_model_data; speed_air_4= (speed_air**4); speed_air_3= speed_air**3; speed_air_2= speed_air**2; height_pitch=height*pitch; run; proc meansdata = remodelling_dataN NmissMinmax median; run; proc corr data = remodelling_data; run;
  • 20. From the correlationplot,speed_air_4isgivingthe highestcorrelationtodistance,height_pitch whichismultiplicationof heightandpitchhasa highercorrelationcomparedtothe individual variables,thishave beenselectedforthe final model independentvariable list. /*Speed_airhasnomissingvalues*/ proc plotdata = remodelling_data; plotdistance*speed_air; plotdistance*speed_air_2; plotdistance*speed_air_3; plotdistance*speed_air_4; plotdistance*height_pitch; run;
  • 21. The transformed speed_air(speed_air^4) variable showsalinearrelationshiptothe landing distance.The speed_airvariablewill be replacedwithspeed_air_4. /*Fittinga regressionmodel*/ proc reg data = remodelling_data; model distance =speed_air_4aircraft_type height_pitch; run;
  • 22. The model hasa betterresidual plotthoughthe modelisunderpredictingpredicting the longer landingdistances,thisisacceptablegiventhe lackof datapointsexplainingthese scenarios.The R- Squaredhasimprovedfrom0.85 to 0.97 indicatinghigherpredictionaccuracy. proc reg data = remodelling_data; model distance =speed_air_4aircraft_type height_pitch; outputout=predicted_valuespredicted=py; run; data predicted_values; setpredicted_values; error_abs= abs(distance - py)/distance; keepdistance pyerror_abs; run; proc meansdata = predicted_valuesN mean; var error_abs; run; The MAPE (MeanAbsolute Percentage Error) of the improvedmodelis10.88%. The MAPE hasreducedfrom22.58% to 10.88%, the transformationof the dataimprovedthe accuracy of the predictions. The model canbe furtherimprovedwithmore datapointsespeciallyinthe scenarioswhere the landingdistances are greaterthan4000 feetsince thisare the casesto be predicted. More variables such as grossweightof the aircraft,aircraft model no,winddirectionetc.wouldsignificantly improve thismodel.
  • 23. Appendix. Variable dictionary: Aircraft: The make of an aircraft (BoeingorAirbus). Duration (in minutes):Flightdurationbetweentakingoff andlanding.The durationof anormal flightshouldalwaysbe greaterthan40min. No_pasg: The numberof passengersinaflight. Speed_ground(inmilesper hour): The groundspeedof an aircraftwhenpassingoverthe threshold of the runway.If itsvalue islessthan30MPH or greaterthan 140MPH, thenthe landingwouldbe consideredasabnormal. Speed_air(in milesperhour): The air speedof an aircraftwhenpassingoverthe thresholdof the runway.If its value islessthan30MPH or greaterthan 140MPH, thenthe landingwouldbe consideredasabnormal. Height(in meters):The heightof an aircraftwhenit ispassingoverthe thresholdof the runway.The landingaircraftisrequiredtobe at least6 metershighatthe thresholdof the runway. Pitch (indegrees):Pitchangle of anaircraft whenitis passingoverthe thresholdof the runway.1 Distance (infeet):The landingdistance of anaircraft.More specifically,itreferstothe distance betweenthe thresholdof the runwayandthe pointwhere the aircraftcan be fullystopped.The lengthof the airportrunwayis typicallylessthan6000 feet.