SlideShare a Scribd company logo
Data Source:
The datasetis downloadedfromKaggle.com. The followingisthe linktothe dataset:
https://www.kaggle.com/wsj/college-salaries.All datawasobtainedfromWall StreetJournal basedon
data fromPayScale Inc.
Dataset Description:
The datasetincludesthree tables:
 Salariesforcollegesbytype:
 Salariesforcollegesbyregion
 Degreesthatpay youback
The datasetgivesinformationaboutyourstartingsalary,mid-levelsalaryandpercentage increase
accordingto the school attended.Italsogivesinformationregardingthe type of the school attended
and the regioninwhichthe school islocated. The lasttable gives informationaboutthe mediansalaries
accordingto undergraduate Major.
The followingare the detail regardingthe columnsinthe tables:
Salariesforthe collegesbytype:
 Thistable has 269 rows and 8 columns
 School name and the school type givesthe name andtype of the school.There are 249 unique
school namesand5 school types
 The remainingcolumnsgivesthe range of salaries of the studentsgraduatedfromthe
respective schoolsovera10-year period.
Degreesthatpay youback:
 Thistable has 50 rowsand 8 columns
 Undergraduate Major has50 distinctvalues.The remainingcolumnsgivesthe range of salaries
accordingto the undergraduate degree.
SalariesforcollegesbyRegion:
 Thistable has 320 rows and 8 columns.
 Regiongivesthe regionwhere the school islocated.Ithas5 distinctvalues.The othercolumns
give the range of salariesbyschool andregion.
Importing data into SQL and CreatingTables:
The tablesare downloadedfromthe source websiteincsvformat.Theyare importedintoSQL by using
the importwizard.While importingthe datafor the ‘salariesforcollegesbytype’table we hadtodefine
school name and school type ascomposite primarykey.The reasonbeingthere are multiple schools
withdifferenttypesdependingonthe course offered. So,acomposite primarykeydefinedtomaintain
the entityintegrity.
Afterimportingthe data,newtablesare created,andthe importeddataisinsertedintothe new tables.
While importingthe dataintonewtables,the datatypesof columnscontainingthe salaryinformation
are convertedfrommoneytodecimal.
General statisticsusing SQL:
Afternormalizingandinsertingdata we getthe followingtables:
 DBO.DEGREE
 DBO.COLLEGE
 DBO.REGION
We performthe followinggeneral statisticsusingSQL:
 Calculatingthe average of mediansalariesby school type:
 Calculatingthe average of mediansalariesby region:
 Higheststarting mediansalary by region withcollege name:
 Higheststarting mediansalary by school type withschool name:
 Selectingtop5 undergraduate majors by starting mediansalary:
 Selectingtop5 undergraduate majors according to percentage change in salaries:
 AddingCategory Variable according to starting mediansalary:
Analysisin R:
The database is importedinRfor furtheranalysis.The RODBClibraryisusedforestablishinga
connectionandimportingthe tablesinR.
The three tablesfromthe database are importedandsavedindataframes.Also,we runan innerjoinon
the regionandcollege tablesandthe resultisimportedintoR,as a data frame namedreg.col,usingthe
sqlQueryfunction.
The followingoutputgives the summary of the importedtables:
The followingoutputgivesthe total number ofmissingvalues and missingvaluesby column for the
data frames:
There are nomissingvaluesinthe degree dataframe.There are 88 missingvaluesinthe reg.col data
frame.
The followingisthe histogram ofthe starting median salary and mid-level mediansalaryfor the two
data frames:
The histogramgivesthe distributionof startingmediansalaryandthe mid-level mediansalariesinthe
twotables.
The followingshowsthe boxplotforthe startingmediansalaryandmid-levelmediansalary. The boxplot
depictsthe inter-quartilerange andshowspossible outliers.
Starting median salary in Degree data frame:
Mid-level mediansalaryin Degree data frame:
Starting median salary in the reg.col data frame:
Mid-level mediansalaryin the reg.col data frame:
The followingisa bar graph ofthe top 7 undergraduate degreesaccording to the starting median
salaries:
We can see thatphysicianassistant, chemical engineeringandcomputerengineeringare the top3
majors.
A linearregressionmodel for predictingthe starting mediansalary basedon the school type and
regionis formed.The belowis the summary of the model:
As the variable school type and regionare categorical variables,dummyvariablesare createdbyR in
regressionmodel.The coefficientsof the categoriesforthe twovariablesare showninthe summary
output.We can see that all the coefficientestimate exceptthe school type Ivyleague are significant.The
R-squaredvalue is57%,whichmeansthat 57% of variance inresponse variable isexplainedbythe
predictorvariables.
Also,a linearregressionmodel,withthe mid-level mediansalaryas response and the starting median
salary and school type as the predictors,is developed.Belowisthe summary ofthe model.
All the coefficientestimatesare significant.The p-valueforthe F-statisticalsosuggeststhatthere isa
linearrelationship betweenthe response variableandthe predictorvariables.The R-squaredis85%,
whichmeansthat 85% variance inthe response variable canbe explainedbythe predictors.
VisualizationinTableau:
The belowis the bar graph of the average salariesaccording to school types:
The belowis the bar graph of the average salariesaccording to region:
The belowis the scatter plotof starting median salariesagainst mid-level mediansalariesaccordingto
school type:
The belowis the text plot of the school names:the size varies according to average of mid-level
mediansalaries and the color variesaccording to average of starting median salaries
Summary:
From the data we can see that the startingsalariesvarysignificantlyaccordingtocollege type.Butthe
increasedearningpowershowslessdisparity.After10 years, graduatesof IvyLeague schoolsearned
99% more than theydidat graduation.Partyschool graduatessaw an 85% increase.Engineeringschool
graduateshad the leastgrowth,earning76% more 10 years afterschool.
Midwestcollege graduates tendtoearn the lowestsalarybothatgraduationandat mid-career,
accordingto the PayScale Inc.survey.Graduatesof schoolsinthe NortheastandCaliforniafaredbest.
The data showsthat graduates of majors like philosophyandInternational Relations earned103.5%and
97.8% more,respectively,about10 yearspost-commencement.Majorsthatdidn'tshow as much salary
growthinclude NursingandInformationTechnology.
Challenges:
While importingcsvfile inSQLIfacederrors relatedtodata type anddelimiter.Iconvertedthe source
file intoxlsx formatandthenimportedthe same withoutanyerrors.Also,regressionmodel involving
the categorical variablescreatesdummyvariablesandassignscoefficientestimates tothem.Itbecomes
little confusingwhenthere are multiplecategorical variablesaspredictors.

More Related Content

PPTX
Relational database design unit(v)
PPTX
Statistical software
PDF
Tech line excel
PPT
Ch14 data exploration (i)
PDF
Sample K12 Interactive Reports From Datagoodies
DOCX
Normalization
PDF
Ibm spss statistics 19 brief guide
PPT
Statistics
Relational database design unit(v)
Statistical software
Tech line excel
Ch14 data exploration (i)
Sample K12 Interactive Reports From Datagoodies
Normalization
Ibm spss statistics 19 brief guide
Statistics

Similar to Hw5 deone pranil (20)

DOCX
DataSalCompaMidAgeEESSERGRaiseDegGen1Gr1581.017573485805.70METhe o.docx
PDF
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
PDF
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
PDF
ANALYSIS OF TUITION GROWTH RATES BASED ON CLUSTERING AND REGRESSION MODELS
PPTX
Multiple Linear Regression.pptx
DOCX
University Ranking Variable Analysis
PDF
MidTerm memo
PPT
Chapter03
PPT
Chapter03
DOCX
Final Project Final Doc
DOCX
Excel Files AssingmentsCopy of Student_Assignment_File.11.01..docx
PDF
German credit score shivaram prakash
PPTX
Graphical Presentation of Data - Rangga Masyhuri Nuur LLU 27.pptx
PDF
Principal Component Analysis and Clustering
PDF
Business statistics
PPTX
Stats LECTURE 2.pptx
DOCX
Risk severity level extraction
PDF
ORGANIZATION, UTILIZATION, AND COMMUNICATION OF TEST RESULTSProf-ed-7.pdf
PDF
Double VaR Model and Its Application Research
ODT
Qnt 275 final exam new 2016
DataSalCompaMidAgeEESSERGRaiseDegGen1Gr1581.017573485805.70METhe o.docx
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
ANALYSIS OF TUITION GROWTH RATES BASED ON CLUSTERING AND REGRESSION MODELS
Multiple Linear Regression.pptx
University Ranking Variable Analysis
MidTerm memo
Chapter03
Chapter03
Final Project Final Doc
Excel Files AssingmentsCopy of Student_Assignment_File.11.01..docx
German credit score shivaram prakash
Graphical Presentation of Data - Rangga Masyhuri Nuur LLU 27.pptx
Principal Component Analysis and Clustering
Business statistics
Stats LECTURE 2.pptx
Risk severity level extraction
ORGANIZATION, UTILIZATION, AND COMMUNICATION OF TEST RESULTSProf-ed-7.pdf
Double VaR Model and Its Application Research
Qnt 275 final exam new 2016
Ad

Recently uploaded (20)

PPTX
Database Infoormation System (DBIS).pptx
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Introduction to Data Science and Data Analysis
PDF
Business Analytics and business intelligence.pdf
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
Managing Community Partner Relationships
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Introduction to the R Programming Language
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
How to run a consulting project- client discovery
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPTX
A Complete Guide to Streamlining Business Processes
PPT
Predictive modeling basics in data cleaning process
PPTX
IMPACT OF LANDSLIDE.....................
PPTX
New ISO 27001_2022 standard and the changes
Database Infoormation System (DBIS).pptx
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
STERILIZATION AND DISINFECTION-1.ppthhhbx
Introduction to Data Science and Data Analysis
Business Analytics and business intelligence.pdf
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Managing Community Partner Relationships
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Introduction to the R Programming Language
IBA_Chapter_11_Slides_Final_Accessible.pptx
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
How to run a consulting project- client discovery
Pilar Kemerdekaan dan Identi Bangsa.pptx
A Complete Guide to Streamlining Business Processes
Predictive modeling basics in data cleaning process
IMPACT OF LANDSLIDE.....................
New ISO 27001_2022 standard and the changes
Ad

Hw5 deone pranil

  • 1. Data Source: The datasetis downloadedfromKaggle.com. The followingisthe linktothe dataset: https://www.kaggle.com/wsj/college-salaries.All datawasobtainedfromWall StreetJournal basedon data fromPayScale Inc. Dataset Description: The datasetincludesthree tables:  Salariesforcollegesbytype:  Salariesforcollegesbyregion  Degreesthatpay youback The datasetgivesinformationaboutyourstartingsalary,mid-levelsalaryandpercentage increase accordingto the school attended.Italsogivesinformationregardingthe type of the school attended and the regioninwhichthe school islocated. The lasttable gives informationaboutthe mediansalaries accordingto undergraduate Major. The followingare the detail regardingthe columnsinthe tables: Salariesforthe collegesbytype:  Thistable has 269 rows and 8 columns  School name and the school type givesthe name andtype of the school.There are 249 unique school namesand5 school types  The remainingcolumnsgivesthe range of salaries of the studentsgraduatedfromthe respective schoolsovera10-year period. Degreesthatpay youback:  Thistable has 50 rowsand 8 columns  Undergraduate Major has50 distinctvalues.The remainingcolumnsgivesthe range of salaries accordingto the undergraduate degree. SalariesforcollegesbyRegion:  Thistable has 320 rows and 8 columns.  Regiongivesthe regionwhere the school islocated.Ithas5 distinctvalues.The othercolumns give the range of salariesbyschool andregion. Importing data into SQL and CreatingTables: The tablesare downloadedfromthe source websiteincsvformat.Theyare importedintoSQL by using the importwizard.While importingthe datafor the ‘salariesforcollegesbytype’table we hadtodefine school name and school type ascomposite primarykey.The reasonbeingthere are multiple schools withdifferenttypesdependingonthe course offered. So,acomposite primarykeydefinedtomaintain the entityintegrity.
  • 2. Afterimportingthe data,newtablesare created,andthe importeddataisinsertedintothe new tables. While importingthe dataintonewtables,the datatypesof columnscontainingthe salaryinformation are convertedfrommoneytodecimal. General statisticsusing SQL: Afternormalizingandinsertingdata we getthe followingtables:  DBO.DEGREE  DBO.COLLEGE  DBO.REGION We performthe followinggeneral statisticsusingSQL:  Calculatingthe average of mediansalariesby school type:  Calculatingthe average of mediansalariesby region:  Higheststarting mediansalary by region withcollege name:
  • 3.  Higheststarting mediansalary by school type withschool name:  Selectingtop5 undergraduate majors by starting mediansalary:  Selectingtop5 undergraduate majors according to percentage change in salaries:  AddingCategory Variable according to starting mediansalary:
  • 4. Analysisin R: The database is importedinRfor furtheranalysis.The RODBClibraryisusedforestablishinga connectionandimportingthe tablesinR. The three tablesfromthe database are importedandsavedindataframes.Also,we runan innerjoinon the regionandcollege tablesandthe resultisimportedintoR,as a data frame namedreg.col,usingthe sqlQueryfunction. The followingoutputgives the summary of the importedtables: The followingoutputgivesthe total number ofmissingvalues and missingvaluesby column for the data frames: There are nomissingvaluesinthe degree dataframe.There are 88 missingvaluesinthe reg.col data frame.
  • 5. The followingisthe histogram ofthe starting median salary and mid-level mediansalaryfor the two data frames:
  • 6. The histogramgivesthe distributionof startingmediansalaryandthe mid-level mediansalariesinthe twotables.
  • 7. The followingshowsthe boxplotforthe startingmediansalaryandmid-levelmediansalary. The boxplot depictsthe inter-quartilerange andshowspossible outliers. Starting median salary in Degree data frame: Mid-level mediansalaryin Degree data frame: Starting median salary in the reg.col data frame: Mid-level mediansalaryin the reg.col data frame:
  • 8. The followingisa bar graph ofthe top 7 undergraduate degreesaccording to the starting median salaries: We can see thatphysicianassistant, chemical engineeringandcomputerengineeringare the top3 majors. A linearregressionmodel for predictingthe starting mediansalary basedon the school type and regionis formed.The belowis the summary of the model: As the variable school type and regionare categorical variables,dummyvariablesare createdbyR in regressionmodel.The coefficientsof the categoriesforthe twovariablesare showninthe summary
  • 9. output.We can see that all the coefficientestimate exceptthe school type Ivyleague are significant.The R-squaredvalue is57%,whichmeansthat 57% of variance inresponse variable isexplainedbythe predictorvariables. Also,a linearregressionmodel,withthe mid-level mediansalaryas response and the starting median salary and school type as the predictors,is developed.Belowisthe summary ofthe model. All the coefficientestimatesare significant.The p-valueforthe F-statisticalsosuggeststhatthere isa linearrelationship betweenthe response variableandthe predictorvariables.The R-squaredis85%, whichmeansthat 85% variance inthe response variable canbe explainedbythe predictors.
  • 10. VisualizationinTableau: The belowis the bar graph of the average salariesaccording to school types: The belowis the bar graph of the average salariesaccording to region:
  • 11. The belowis the scatter plotof starting median salariesagainst mid-level mediansalariesaccordingto school type: The belowis the text plot of the school names:the size varies according to average of mid-level mediansalaries and the color variesaccording to average of starting median salaries
  • 12. Summary: From the data we can see that the startingsalariesvarysignificantlyaccordingtocollege type.Butthe increasedearningpowershowslessdisparity.After10 years, graduatesof IvyLeague schoolsearned 99% more than theydidat graduation.Partyschool graduatessaw an 85% increase.Engineeringschool graduateshad the leastgrowth,earning76% more 10 years afterschool. Midwestcollege graduates tendtoearn the lowestsalarybothatgraduationandat mid-career, accordingto the PayScale Inc.survey.Graduatesof schoolsinthe NortheastandCaliforniafaredbest. The data showsthat graduates of majors like philosophyandInternational Relations earned103.5%and 97.8% more,respectively,about10 yearspost-commencement.Majorsthatdidn'tshow as much salary growthinclude NursingandInformationTechnology. Challenges: While importingcsvfile inSQLIfacederrors relatedtodata type anddelimiter.Iconvertedthe source file intoxlsx formatandthenimportedthe same withoutanyerrors.Also,regressionmodel involving the categorical variablescreatesdummyvariablesandassignscoefficientestimates tothem.Itbecomes little confusingwhenthere are multiplecategorical variablesaspredictors.