SlideShare a Scribd company logo
1
University of Cincinnati Capstone Project Summer 2016
Optimizing a baseball lineup:
Getting the most bang for your buck
Nicholas Imholte
First Reader: Dr. Michael Magazine
Second Reader: Dr. Yichen Qin
2
Table of Contents
1. Abstract Page 2
2. Introduction Page 3
3. Visualizations and Regressions Pages 3-7
4. Clustering Pages 8-10
5. Optimization Pages 10-11
6. Simulation Pages 11-13
7. Conclusion Page 13
8. Appendix Pages 13-14
Abstract:
Given a fixed payroll, and focusing purely on the offensive side of the ball, how should a
baseball team assign its funds to give itself the highest average number of runs possible? In this
essay, I will attempt to answer this question using regression, clustering, optimization, and
simulation. First, I will use regression to model baseball scores, with the goal being to
determine how each event in a baseball game impacts how many runs a team scores. Second, I
will use clustering to determine what kinds of hitters there are, and how much each type of
hitter costs. Third, I will use optimization to determine the optimal arrangement of hitter
clusters for a variety of payrolls. Finally, I will complement this analysis with a simulation, and
see how the results from the two approaches compare.
3
Introduction
Everybusinessinthe worldhasto at some pointaskthe question“Whatis the bestway for me
to deploymyresources?”Sportingcompetitionsare nodifferent,andthoughteamshave alarge
amountof moneytospend,the playerstheyneedtopurchase are alsoexpensive.Therefore,anatural
questiontoaskis “WhichplayersshouldIhire?”Inthisessay,I will be restrictingmyattentionto
baseball andoffensivelineups.Mygoal will be to determinewhatkindsof playersare mostcost
efficient,andexactlyhowyoushoulddesignyourline uptomaximizeyourrunpotential asthisisthe
ultimate measure of teamoffense.Althoughpitchinganddefenseare of course importanttopics,they
will notbe consideredinthisanalysis.
To answerthisquestion, Iwill employafive stepprocedure.First, Iwill lookatthirty-fiveyears
of regularseasonbaseball datafrom Retrosheet. I will create variousmodelsusinglinear and
multinomialregression,regressionandclassificationtrees,and Poisson regression.Thiswillallow me to
determine the intrinsicvalue of eacheventinabaseball game. Inorderto determinethe bestmodel, I
will use MeanSquare Error (MSE) on a testingdataset.Next, Iwill lookatten yearsof playerdata from
Baseballreference todetermineplayertypes. Iwill use variousclusteringmethodsandmetricsinorder
to groupplayersintoa fewcategories accordingtotheirhittingabilities.Iwill thenuse salary datafrom
USAToday in orderto determine howmuchplayersfromeachclustercostonaverage.Next, Iwill use
solvertooptimize alineup basedon budgetandplayerconstraints.Finally, Iwill use asimulationin
Arenaand compare these results.
Visualizations and Regressions
To begin, Istartedwithregularseasongame data fromthe past 35 years. The response variable
isruns scored,and the predictorvariablesare splitintothree categories:Offensive,Defensive,and
Location.On the offensive sideIincludedHits,Doubles,Triples,Home Runs,Sacrifice FliesandHits,
Walksand Hit By Pitches,Strike Outs,StolenBasesandCaughtStealing,andfinallyGroundinginto
Double Plays.Onthe defensiveside Iincludednumberof pitchersfaced,Balks,PassedBallsandWild
Pitches,CatchersInterference,andErrors.Lastly,I includedwhetherateamwashome or visiting.
Althoughanoffense cannoteffectdefensiveorlocationstatistics,includingthemallowedme toaccount
for noise factorsandthus recorda more precise measurementof offensive impact. Beforebuildingthe
regressionmodels,Icreated visualizationsinTableau,demonstratingthe nature of the Runsscored
variable andhowitrelatesto some of the significantpredictors. First,here isahistogramof runs.
4
Visualization 1: Histogramof RunsScored.
The most commonscore is 3 runs,and around90% of scoresare withinthe 0 to 8 range. Next,here isa
table showingthe 6-numbersummaryof runsscored.
Visualization 2: 6-numbersummary of Runs.
The mean isslightlylargerthanthe median,indicatingarightskeweddataset,andwhile there are afew
outliers,the Interquartile range isarelativelycompact2to 6. Finally,here isachart indicatingthe
relationshipbetweenhitsandhomeruns,andtheircombinedeffectonruns.
5
Visualization 3: Hits and HomerunsvsRuns.
The two axesare hitsand home runs,and the dots are coloredby average numberof runsscoredin
such games.Redisbelowaverage,greyisaverage,andblackisabove average. Now thatI have a feel for
the data, I can move on to buildingregressionmodels.Tobeginwith,I splitthe datainto90% training
and 10% testing,and builtfive prescriptivemodels:Linear,Multinomial,RegressionTree,Classification
Tree,and Poisson.Here isatable showingthe resultof these models. Icomparedthe 5 modelsusing
meansquaredandabsolute value error,onbothtrainingandtesting.
Training Testing
Model MSE MAE MSE MAE
Linear 2.25 1.17 2.26 1.17
Multinomial 2.32 1.12 2.32 1.12
Tree 3.06 1.35 3.1 1.36
ClassificationTree 29.96 4.51 30.19 4.5
Poisson 17.2 3.25 17.35 3.24
Table 1: Comparison of models
Clearlythe bestperformingmodelsare the LinearandMultinomial regressions.Whatisinterestingis
that the Linearmodel hasslightlybetterMSEbutslightlyworse MAE.Thismeansthat multinomial has a
fewbetterpredictionsoverall,butafewthatare off by largeramounts.However,the difference
betweenthe twoisminimal enoughtonotbe of consequence.Further,since the Linearmodel issimpler
and more interpretable,Idecidedtomove forwardwiththe LinearRegressionmodel.
The linearmodel wascreatedusinganautomaticselection,stepwise procedure,withAICasthe
definingmetric. See Appendix Cforthe regressioncoefficientsandAppendix A forabbreviation
definitions. We cangleanquite a fewpiecesof informationfromthistable.Firstof all,almostall of the
variablesare significant.The onlyvariablenotselectedtobe inthe model ispitchersfaced.AtfirstIwas
6
surprisedbythisresult,astypicallyif ateamis forcedtouse a lotof pitchers,itmeansthe starterwas
pulledearly,usuallybecause he allowedalotof runs.But on the otherhand,a lotof pitcherscanbe
meanthat the game wentdeepintoextrainnings,usuallybecauseof lackof offense. Second,note that
we can use the coefficientstodetermine atwhatrate a team mustbe successfullyatstolenbasesfor
themto be worthwhile.Supposeateamis successful atstealing pof the time.Thenfora steal attempt
to be worthwhile,pmustsatisfy:
p*(.0331) + (1-p)*(-.2964) > 0.
Solvingforp,we get p > .8995. Thus,a team mustbe successful atleast90% of the time for an attempt
to be worthwhile onaverage.Aswe shall latersee,noplayergroupmanagesthislevelof success.
Finally,note thatall of the coefficients have the signyouwouldexpectthemtohave.All of the events
typicallyseenas“positive”have positivecoefficients,while all of the “negative”eventshave negative
coefficients.Note thatsacrifice hits hasanegative coefficient.Thissuggests thatsacrificingoutsfor
basestendstoresultina lossof runs.
Havingestablishedoutlinearmodels,we canmove todiagnosticstodetermineif the
assumptionsof linearregressionhold,anddetermine the goodnessof fit.We startwitha plot of
residuals vsfittedvaluestoassesslinearityandhomoscedasticity.Ideallywe wouldlike the graphtobe
random,withno obviouspatterns,andresidualsspreadevenlybetweenpositiveandnegative.
Graph 1: Residualplot of Linear Model
Everythingseemsto checkoutwiththisgraph.Althoughthere are some large residuals,thisistobe
expectedwithapopulationof nearly150,000. Next,we move tothe Q-Qplot to assessthe normalityof
the residuals.Ideallywe wouldliketosee all the residualsfall ona 45 degree line.
7
Graph 2: Q-Qplot of Linear Model.
Most of the residualslie onthe 45 degree line, soeventhough thereare afew deviationsatthe end,we
can assume the residualsare normallydistributed. Finally,we canlookata plotof Cook’sDistance.
These valuesmeasure influence of observationsonthe model.
Graph 3: Cook’sDistance
Althoughobservation1489 doesstandout on thisplot,itstill hasan extremelysmall Cook’sDistance,so
there isno reasonto be concerned,especiallyconsideringthe sample size of close to150,000. At this
pointwe conclude thatour model sufficientlyexplainsthe variationinthe data,and move on to player
clustering.
8
Clustering
Nowthat we have a model forhow each eventimpactsrunscoring,our nextgoal isto attempt
to clusterhittersintoafewdistinctclusters. Todothis,I firstgathered10 years of hittingdatafrom
Baseballreference.com.Irestrictedmyanalysistorate statistics – For instance,battingaverage instead
of hits– to account for the disparityinplate appearancesbetweenplayers.Ifurtherrestrictedmydata
by onlyconsideringplayerswithatleast100 plate appearancesina givenyear.Further,Ineededtobe
sure that I had salarydata from USAToday fromeach observation.Takingall of thisintoconsideration,I
endedupwith3,626 observations,whichismore thanenoughtoperformclusteringanalysis.Next,Ihad
to determine whichvariablestoconsiderforthe analysis.Ichoose onlyattributesthatwere both
significant,accordingtomylinearmodel,andwere definitivelyunderthe control of anindividual player.
I thuschoose the followingstatisticsintheircorrespondingrate form:hits,doubles,triples,home runs,
strike outs,groundingintodouble plays,walksandhit bypitches.Ididnotconsiderdefensivestatistics,
as an offensiveplayerhasnocontrol oversay, whenanopposingpitcherthrowsawildpitch.I alsodid
not include eitherkindof sacrifice,asa playerhasno control overwhenhe will come tobat in a
situation whentheycanoccur.The final questioniswhetherornotto considerstolenbases,caught
stealing,stolenbase percentage,ornone of the above.Asindicatedabove,stolenbasesare positive
events,butonlyif theyare successful atleast90% of the time.Aftersome experimentationwith
clustering,Ifoundthatno groupevercomesclose tothat kindof successrate on a large scale.
Therefore,forthe purposesof optimizingruns,Iwill assume thatmytheoretical teamwillnotattempt
to steal. Havingdeterminedthe observationsandthe variables,the nextstepinclusteringisto
determine the numberof clusters.Since there isnoone waytodetermine this,Iwentthroughanumber
of differentmethods andcomparedtheirresults.
9
Table 2: Clustergoodnessof fitmetrics. Top Left: Within Sumof Squares.Top Right:SilhouetteIndex
BottomLeft: Dunn Index.BottomRight:Dendogramusing Ward’sMethod.
The firstgraph is a withingroupsumof squaresplot.The ideato lookforan “elbow”,where adding
extragroupsdoesnot significantlyreduce the WSS.Thisplothasnoobviouselbow,butseemsto
taperingoff between5and 10. The secondgraph,calledthe SilhouetteIndex,measureslikenessof an
objectto itsclustervsotherclusters.Here, spikesare desirable,andthusthismetricsuggestseither2or
5 clusters.The thirdgraph,calledthe DunnIndex,measurescompactnessof clustersanddistance to
otherclusters.Again,spikesinthe graphare desirable,thusthismetricsuggests6,8, or 10 clusters.
Finally,the fourthgraphtakesa differentapproachtocluster,usingahierarchical clusteringwithWard’s
Method.In thiscases,we are lookingtobalance asmall heightwithasmall numberof clusters.We can
see that at a heightof 200 there are sevenclusters,while ataheightof 250 there are onlyfive.Putting
all of thisanalysistogether,Idecidedthatfive wasanappropriate numberof clusterstoconsider. Iused
a k-meansalgorithmwith5centersto create the final clustering. Here isasummaryof the cluster
statistics,alongwithaverage salary.
Table 3: Playerclustering final results.Salary and mean hitting statistics.
10
Havingdone this, I lookedforpatternsinthistable tosee if I could determine aconvenient label for
each group.To beginwith,playertype 3has the bestbattingaverage andhome run rate,alongwiththe
bestdouble andwalkrate.Clearly,group3 has the bestplayersoverall,alongwiththe steepestprice
tag. Thus,I labeledthisgroup“Premium”players.Group5has low battingaverage,highhome runrate,
and the worststrike outrate. Therefore,Ilabeledthisgroup“Power”.Group4 has the highesttriple rate
and the smallestdoubleplayrate.Thus,Ilabeledthisgroup“Speed”.Group2has the secondbesthit
rate,the beststrike outrate,and a small home runrate. Hence,Ilabeledthisgroup“Contact”.Group1
issomewhatinthe middle oneverystatistic,soIlabeledthisgroup“Average”. Donotbe confusedwith
the word average,aswe are not talkingaboutitinthe sense of battingaverage,butratherin the sense
of meanor middle class.
Optimization
Nowthat we have both a model forhow eventsimpactscoringanda clusterof playertypes
alongwithcost, we can combine these twointoanoptimizationmodel.Todothis,firstwe assigneach
playertype a “value”,whichissimplythe productof theirabilitytocause anevent,timesthe linear
coefficientwe determineinthe linearregressionportion. Forexample,here ishow the value of a
Premiumplayeriscalculated.
Here is a summaryof playertype,alongwithvalue andcost.
Table 4: Playersummary.
Nowthat we have valuesassignedtoeachplayertype,there isone more questionwe mustaddress
before beginningthe optimization. Namely:“How manyhittersdoI need?”.The answertothisquestion
dependsonwhichleague youare in.Inthe AmericanLeague,pitchersdonotbat,and thuswe need9
hitters.However,inthe NationalLeague,the pitcherdoesbat,andthuswe reallyonlyhave control over
8 of the hitters. Fortunately,the pitcherisgenerallyaverypoorhitter,soincludinghiminthe lineup
shouldonlyhave the effectof reducingthe numberof hittersbyone.Otherthanthat,the optimization
processisthe same. Iusedsolvertooptimize alineupconsistingof 9hitters,subjecttothe constraintof
variousbudgets.Here isa table summarizingmyfindings,withthe total budgetpresentedinmillions.
11
Table 5: Optimization results using Solver.Columnsindicatethe optimalnumberof hitters to use.
First,note that once youhave maxedout onpremiumhitters,there isnopointingoinganyfurther,asit
cannot getany better.Onthe otherend,youcannot go any cheaperthanall speedhitters,soif you
cannot affordthem,youhave noteam.Second,note that powerandcontact hittersare not chosenfor
any budget.Contacthittersare tooexpensive,whilepowerhittersare noteffectiveenough. Finally,in
bothleaguesthe general strategyistohire as manypremiumhittersaspossible,thenfill outthe
remainderwithaverage,andfinallyspeedhitters.
Thisconcludesthe regressionportionof myanalysis.Althoughthisgivesabroadpicture of
effectiveness,the maincritique tothisapproach,isthatit assumesthe value of eacheventis
independentof whenitoccurs.Thisisof course nottrue,as a double witharunneron2nd
isgoingto be
more effectivethana double withnoone onbase.A natural wayto continue wouldbe toruna
simulation,usingthe 5playertypesas inputs.
Simulation
In orderto performa simulation,Icreateda baseball modelinArena.Itworksbycreatinga
predeterminednumberof playersof eachplayertype,usingtriangle distributionsforeachstatistic,and
thenusesa randomnumbergeneratortodetermine outcomesof eachplate appearance. A simple flow
chart thenmodelsthe progressof base runners,runsscored,andouts recorded.Increatingthe model, I
made the followingassumptions:
1. Each hit advancesall base runnersthe same numberof positionsasthe batter.Sofor instance,a
single alwaysmovedeverybaserunnerexactlyone base.
2. Strikeoutsanddouble playsneveradvance base runners.
12
3. All otheroutsadvance base runnersexactlyone position.
I usedOptQuesttodetermine the optimal numberof eachclassof hitters,using 300 replicationsper
event.Here are the final resultsforbothleagues.
Table 6: Optimization resultsusing OptQuestin Arena.
To get the NL result,Iaddedina pitcher,usingaverage pitcherstatisticsforthe last5 years. The two
models,solverandoptquest, agree onafew pointsbutdisagree onothers.First,neithermodel selected
a Powerhitterat anybudget.Second,bothmodelsagree youshouldgetasmanyPremiumhittersas
youcan aford,andfill outthe rest of the lineupasbestas you can.However,the twomethodsclearly
disagree onthe value of SpeedhittersvsAverage hitters. Inthe regressionapproach,Average hittersare
slightlymore valuable overall,butduringsimulation,Speedhittersperformbetter.Here isa more
detailedcomparisonbetweenthe twoclusters.
Table 7: Detailed comparison of Speed vsAverage
13
I believethe difference betweenthe twomaylie inthe distributionIchoose touse inthe simulation
approach.In mostsimulations,the triangle distributionistypicallyusedwhenyouhave aminimum,a
maximum,anda “typical”value.However,inthiscase,itresultedinthe Speedhittershavingalarge On
Base Percentage plusSluggingAverage (OPS) thanthe Average hitters.Thislikelycausedthemtoscore
more runs insimulation.One waytoaddressthismightbe totry differentdistributions andsee if they
give differentresults.
Conclusion
In conclusion,Ihave usedmanydifferentanalytical toolstohelpstudyhow toproduce an
efficientbaseball lineup.Iusedregressiontodetermine how tobestvalue eachbaseball event,
clusteringtodifferentiate betweenplayerskillsets,optimizationtocreate efficientlineups,and
simulationtosee howthese lineupsfairinamodel.While the twomainmethodshave asmall amount
of disagreement,the maintake awayisthathavinga few superstarplayersonyoursquadis general
more efficientthanhavingaroster full of average players.
Appendix
A. List of baseball abbreviationsused.
H = Hits.The numberof hitsrecordedbythe offense.
D = Doubles.The numberof doublesrecordedbythe offense.
T = Triples.The numberof triplesrecordedbythe offense.
HR = Home Runs.The numberof home runsrecordedby the offense.
SH = Sacrifice Hits.The numberof sacrifice hitsperformedbythe offense.
SF = Sacrifice Flies.The numberof sacrifice fliesperformedbythe offense.
SB = Stole Bases.The numberof successful attemptsatstealingabase.
CS = CaughtStealing.The numberof unsuccessful attemptsatstealingabase.
GIDP = GroundingintoaDouble Play.The numberof double playsrecordedbythe defense.
SO = Strike Out.The numberof strike outsrecordedbythe opposingpitchers.
BB = Base on Balls[AlsoknownasWalks].The numberof walksissuedbythe opposingpitchers.
HBP = Hitby Pitch.The numberof battershit bythe opposingpitchers.
Errors. The numberof errorscommittedbythe opposing defense.
Wild= WildPitchers.The numberof wildpitchesthrownbythe opposingpitchers.
Passed= PassedBalls.The numberof passedballsallowedbythe opposingcatcher.
Balks.The numberof balkscommittedbythe opposingpitchers.
CatchInt= Catcher’sInterference.The numberof timesabatterwasawardedfirstbase due to
interference.
HomeTeamV =A dummyvariable indicatingwhetherateamisvisitingorhome.0 = Home,1 = Visiting.
AVG= BattingAverage.Hits/At Bats
OBP = On Base Percentage. Total timesonbase /Total numberof plate appearances.
SLG = SluggingAverage [More commonlycalledSluggingPercentage].A weightedaverageof basesper
at bat.
OPS= On base PlusSlugging.A general termusedtoconveyoverall hittingability.
14
B. List of statistical abbreviationsused.
MSE = Mean SquaredError
MAE = Mean Absolute Error
C. Regressioncoefficientsforlinearmodel,alongwithp-values.

More Related Content

PPT
Algebra unit 9.2
PPT
Algebra unit 9.3
PPT
Chapter 7 – Confidence Intervals And Sample Size
PDF
ELEMENTS OF STATISTICS / TUTORIALOUTLET DOT COM
PDF
You've Been Doing Statistics All Along
PPTX
Sec 3.1 measures of center
PPTX
Confidence Level and Sample Size
PDF
Quantitative Methods for Lawyers - Class #15 - R Boot Camp - Part 2 - Profess...
Algebra unit 9.2
Algebra unit 9.3
Chapter 7 – Confidence Intervals And Sample Size
ELEMENTS OF STATISTICS / TUTORIALOUTLET DOT COM
You've Been Doing Statistics All Along
Sec 3.1 measures of center
Confidence Level and Sample Size
Quantitative Methods for Lawyers - Class #15 - R Boot Camp - Part 2 - Profess...

What's hot (20)

DOCX
Mth 245 lesson 17 notes sampling distributions sam
PPTX
(7) Lesson 10.3
PPTX
Pengenalan Ekonometrika
ODP
QT1 - 07 - Estimation
PPTX
Chap009
PPTX
A.6 confidence intervals
PPTX
Normal as Approximation to Binomial
PPT
Chapter 7 – Confidence Intervals And Sample Size
PPTX
Two Means, Two Dependent Samples, Matched Pairs
PPT
Confidence intervals
DOCX
Statistik Chapter 6
PDF
Types of Probability Distributions - Statistics II
PPTX
05 confidence interval & probability statements
PPTX
Real Applications of Normal Distributions
PPTX
3by9on w week_6
PPT
Normal distribution
PPTX
M1 regression metrics_middleschool
PPTX
What is a single sample t test?
PPTX
Central tendency
Mth 245 lesson 17 notes sampling distributions sam
(7) Lesson 10.3
Pengenalan Ekonometrika
QT1 - 07 - Estimation
Chap009
A.6 confidence intervals
Normal as Approximation to Binomial
Chapter 7 – Confidence Intervals And Sample Size
Two Means, Two Dependent Samples, Matched Pairs
Confidence intervals
Statistik Chapter 6
Types of Probability Distributions - Statistics II
05 confidence interval & probability statements
Real Applications of Normal Distributions
3by9on w week_6
Normal distribution
M1 regression metrics_middleschool
What is a single sample t test?
Central tendency
Ad

Viewers also liked (20)

PDF
FUNCIONES
PPTX
Velocidad 4 g lte
PDF
PROTFOLIO VM
PDF
Memorias del programa en capacitación ciudadana y control social: En la garan...
PPTX
Marcos regulatorios
PDF
Sesion 5 compu i-google drive
PDF
Sesion 5 compu i-google drive
PDF
MFL Academy - Tutorial MercadoPago
PPTX
Planeación tributaria unidad 2
PDF
#MFLAcademy Cursos y Talleres Vacacionales 2016
PDF
1. MsC diploma and supplement original and translation
PDF
MFL Academy | Oferta Académica
DOCX
Excerpt of Student Comments
PPTX
Escuela secundaria técnica no
PPTX
Planificacion y desarrollo
PDF
2.7. alcances-y-limitaciones-del-profesional-informático-en-las-organizaciones
PDF
Unidad 6 memorando de planeación
PDF
Subneting -
PPTX
Unidad 1 Admo Empresas Familiares
PDF
La estructura de la célula
FUNCIONES
Velocidad 4 g lte
PROTFOLIO VM
Memorias del programa en capacitación ciudadana y control social: En la garan...
Marcos regulatorios
Sesion 5 compu i-google drive
Sesion 5 compu i-google drive
MFL Academy - Tutorial MercadoPago
Planeación tributaria unidad 2
#MFLAcademy Cursos y Talleres Vacacionales 2016
1. MsC diploma and supplement original and translation
MFL Academy | Oferta Académica
Excerpt of Student Comments
Escuela secundaria técnica no
Planificacion y desarrollo
2.7. alcances-y-limitaciones-del-profesional-informático-en-las-organizaciones
Unidad 6 memorando de planeación
Subneting -
Unidad 1 Admo Empresas Familiares
La estructura de la célula
Ad

Similar to Capstone Project - Nicholas Imholte - Final Draft (20)

DOCX
Identifying Key Factors in Winning MLB Games Using a Data-Mining Approach
PPTX
Data Visualization and Clustering of Players in Major League Baseball
PPTX
Clustering of Players in Major League Baseball
PPTX
Py con2020
PDF
Stats111Final
PPTX
Data mining for baseball new ppt
DOCX
Final Research Paper
DOCX
Predicting Salary for MLB Players
DOC
Multi Criteria Selection of All-Star Pitching Staff
DOCX
Econometrics Paper
PPTX
The Year of the Pitcher: Analyzing No-Hitters
DOCX
Statistical Model Report
DOCX
Statistical Model Report
DOCX
What Innings Determine Total Wins
DOCX
Final+draft
PPT
Web Quest Baseball
PPTX
Loras College 2014 Business Analytics Symposium | Dan Conway: Sports Analytics
PDF
CLanctot_DSlavin_JMiron_Stats415_Project
PDF
Machine Learning Based Selection of Optimal Sports team based on the Players ...
PPTX
Why Does a Team Outperform its Run Differential?
Identifying Key Factors in Winning MLB Games Using a Data-Mining Approach
Data Visualization and Clustering of Players in Major League Baseball
Clustering of Players in Major League Baseball
Py con2020
Stats111Final
Data mining for baseball new ppt
Final Research Paper
Predicting Salary for MLB Players
Multi Criteria Selection of All-Star Pitching Staff
Econometrics Paper
The Year of the Pitcher: Analyzing No-Hitters
Statistical Model Report
Statistical Model Report
What Innings Determine Total Wins
Final+draft
Web Quest Baseball
Loras College 2014 Business Analytics Symposium | Dan Conway: Sports Analytics
CLanctot_DSlavin_JMiron_Stats415_Project
Machine Learning Based Selection of Optimal Sports team based on the Players ...
Why Does a Team Outperform its Run Differential?

Capstone Project - Nicholas Imholte - Final Draft

  • 1. 1 University of Cincinnati Capstone Project Summer 2016 Optimizing a baseball lineup: Getting the most bang for your buck Nicholas Imholte First Reader: Dr. Michael Magazine Second Reader: Dr. Yichen Qin
  • 2. 2 Table of Contents 1. Abstract Page 2 2. Introduction Page 3 3. Visualizations and Regressions Pages 3-7 4. Clustering Pages 8-10 5. Optimization Pages 10-11 6. Simulation Pages 11-13 7. Conclusion Page 13 8. Appendix Pages 13-14 Abstract: Given a fixed payroll, and focusing purely on the offensive side of the ball, how should a baseball team assign its funds to give itself the highest average number of runs possible? In this essay, I will attempt to answer this question using regression, clustering, optimization, and simulation. First, I will use regression to model baseball scores, with the goal being to determine how each event in a baseball game impacts how many runs a team scores. Second, I will use clustering to determine what kinds of hitters there are, and how much each type of hitter costs. Third, I will use optimization to determine the optimal arrangement of hitter clusters for a variety of payrolls. Finally, I will complement this analysis with a simulation, and see how the results from the two approaches compare.
  • 3. 3 Introduction Everybusinessinthe worldhasto at some pointaskthe question“Whatis the bestway for me to deploymyresources?”Sportingcompetitionsare nodifferent,andthoughteamshave alarge amountof moneytospend,the playerstheyneedtopurchase are alsoexpensive.Therefore,anatural questiontoaskis “WhichplayersshouldIhire?”Inthisessay,I will be restrictingmyattentionto baseball andoffensivelineups.Mygoal will be to determinewhatkindsof playersare mostcost efficient,andexactlyhowyoushoulddesignyourline uptomaximizeyourrunpotential asthisisthe ultimate measure of teamoffense.Althoughpitchinganddefenseare of course importanttopics,they will notbe consideredinthisanalysis. To answerthisquestion, Iwill employafive stepprocedure.First, Iwill lookatthirty-fiveyears of regularseasonbaseball datafrom Retrosheet. I will create variousmodelsusinglinear and multinomialregression,regressionandclassificationtrees,and Poisson regression.Thiswillallow me to determine the intrinsicvalue of eacheventinabaseball game. Inorderto determinethe bestmodel, I will use MeanSquare Error (MSE) on a testingdataset.Next, Iwill lookatten yearsof playerdata from Baseballreference todetermineplayertypes. Iwill use variousclusteringmethodsandmetricsinorder to groupplayersintoa fewcategories accordingtotheirhittingabilities.Iwill thenuse salary datafrom USAToday in orderto determine howmuchplayersfromeachclustercostonaverage.Next, Iwill use solvertooptimize alineup basedon budgetandplayerconstraints.Finally, Iwill use asimulationin Arenaand compare these results. Visualizations and Regressions To begin, Istartedwithregularseasongame data fromthe past 35 years. The response variable isruns scored,and the predictorvariablesare splitintothree categories:Offensive,Defensive,and Location.On the offensive sideIincludedHits,Doubles,Triples,Home Runs,Sacrifice FliesandHits, Walksand Hit By Pitches,Strike Outs,StolenBasesandCaughtStealing,andfinallyGroundinginto Double Plays.Onthe defensiveside Iincludednumberof pitchersfaced,Balks,PassedBallsandWild Pitches,CatchersInterference,andErrors.Lastly,I includedwhetherateamwashome or visiting. Althoughanoffense cannoteffectdefensiveorlocationstatistics,includingthemallowedme toaccount for noise factorsandthus recorda more precise measurementof offensive impact. Beforebuildingthe regressionmodels,Icreated visualizationsinTableau,demonstratingthe nature of the Runsscored variable andhowitrelatesto some of the significantpredictors. First,here isahistogramof runs.
  • 4. 4 Visualization 1: Histogramof RunsScored. The most commonscore is 3 runs,and around90% of scoresare withinthe 0 to 8 range. Next,here isa table showingthe 6-numbersummaryof runsscored. Visualization 2: 6-numbersummary of Runs. The mean isslightlylargerthanthe median,indicatingarightskeweddataset,andwhile there are afew outliers,the Interquartile range isarelativelycompact2to 6. Finally,here isachart indicatingthe relationshipbetweenhitsandhomeruns,andtheircombinedeffectonruns.
  • 5. 5 Visualization 3: Hits and HomerunsvsRuns. The two axesare hitsand home runs,and the dots are coloredby average numberof runsscoredin such games.Redisbelowaverage,greyisaverage,andblackisabove average. Now thatI have a feel for the data, I can move on to buildingregressionmodels.Tobeginwith,I splitthe datainto90% training and 10% testing,and builtfive prescriptivemodels:Linear,Multinomial,RegressionTree,Classification Tree,and Poisson.Here isatable showingthe resultof these models. Icomparedthe 5 modelsusing meansquaredandabsolute value error,onbothtrainingandtesting. Training Testing Model MSE MAE MSE MAE Linear 2.25 1.17 2.26 1.17 Multinomial 2.32 1.12 2.32 1.12 Tree 3.06 1.35 3.1 1.36 ClassificationTree 29.96 4.51 30.19 4.5 Poisson 17.2 3.25 17.35 3.24 Table 1: Comparison of models Clearlythe bestperformingmodelsare the LinearandMultinomial regressions.Whatisinterestingis that the Linearmodel hasslightlybetterMSEbutslightlyworse MAE.Thismeansthat multinomial has a fewbetterpredictionsoverall,butafewthatare off by largeramounts.However,the difference betweenthe twoisminimal enoughtonotbe of consequence.Further,since the Linearmodel issimpler and more interpretable,Idecidedtomove forwardwiththe LinearRegressionmodel. The linearmodel wascreatedusinganautomaticselection,stepwise procedure,withAICasthe definingmetric. See Appendix Cforthe regressioncoefficientsandAppendix A forabbreviation definitions. We cangleanquite a fewpiecesof informationfromthistable.Firstof all,almostall of the variablesare significant.The onlyvariablenotselectedtobe inthe model ispitchersfaced.AtfirstIwas
  • 6. 6 surprisedbythisresult,astypicallyif ateamis forcedtouse a lotof pitchers,itmeansthe starterwas pulledearly,usuallybecause he allowedalotof runs.But on the otherhand,a lotof pitcherscanbe meanthat the game wentdeepintoextrainnings,usuallybecauseof lackof offense. Second,note that we can use the coefficientstodetermine atwhatrate a team mustbe successfullyatstolenbasesfor themto be worthwhile.Supposeateamis successful atstealing pof the time.Thenfora steal attempt to be worthwhile,pmustsatisfy: p*(.0331) + (1-p)*(-.2964) > 0. Solvingforp,we get p > .8995. Thus,a team mustbe successful atleast90% of the time for an attempt to be worthwhile onaverage.Aswe shall latersee,noplayergroupmanagesthislevelof success. Finally,note thatall of the coefficients have the signyouwouldexpectthemtohave.All of the events typicallyseenas“positive”have positivecoefficients,while all of the “negative”eventshave negative coefficients.Note thatsacrifice hits hasanegative coefficient.Thissuggests thatsacrificingoutsfor basestendstoresultina lossof runs. Havingestablishedoutlinearmodels,we canmove todiagnosticstodetermineif the assumptionsof linearregressionhold,anddetermine the goodnessof fit.We startwitha plot of residuals vsfittedvaluestoassesslinearityandhomoscedasticity.Ideallywe wouldlike the graphtobe random,withno obviouspatterns,andresidualsspreadevenlybetweenpositiveandnegative. Graph 1: Residualplot of Linear Model Everythingseemsto checkoutwiththisgraph.Althoughthere are some large residuals,thisistobe expectedwithapopulationof nearly150,000. Next,we move tothe Q-Qplot to assessthe normalityof the residuals.Ideallywe wouldliketosee all the residualsfall ona 45 degree line.
  • 7. 7 Graph 2: Q-Qplot of Linear Model. Most of the residualslie onthe 45 degree line, soeventhough thereare afew deviationsatthe end,we can assume the residualsare normallydistributed. Finally,we canlookata plotof Cook’sDistance. These valuesmeasure influence of observationsonthe model. Graph 3: Cook’sDistance Althoughobservation1489 doesstandout on thisplot,itstill hasan extremelysmall Cook’sDistance,so there isno reasonto be concerned,especiallyconsideringthe sample size of close to150,000. At this pointwe conclude thatour model sufficientlyexplainsthe variationinthe data,and move on to player clustering.
  • 8. 8 Clustering Nowthat we have a model forhow each eventimpactsrunscoring,our nextgoal isto attempt to clusterhittersintoafewdistinctclusters. Todothis,I firstgathered10 years of hittingdatafrom Baseballreference.com.Irestrictedmyanalysistorate statistics – For instance,battingaverage instead of hits– to account for the disparityinplate appearancesbetweenplayers.Ifurtherrestrictedmydata by onlyconsideringplayerswithatleast100 plate appearancesina givenyear.Further,Ineededtobe sure that I had salarydata from USAToday fromeach observation.Takingall of thisintoconsideration,I endedupwith3,626 observations,whichismore thanenoughtoperformclusteringanalysis.Next,Ihad to determine whichvariablestoconsiderforthe analysis.Ichoose onlyattributesthatwere both significant,accordingtomylinearmodel,andwere definitivelyunderthe control of anindividual player. I thuschoose the followingstatisticsintheircorrespondingrate form:hits,doubles,triples,home runs, strike outs,groundingintodouble plays,walksandhit bypitches.Ididnotconsiderdefensivestatistics, as an offensiveplayerhasnocontrol oversay, whenanopposingpitcherthrowsawildpitch.I alsodid not include eitherkindof sacrifice,asa playerhasno control overwhenhe will come tobat in a situation whentheycanoccur.The final questioniswhetherornotto considerstolenbases,caught stealing,stolenbase percentage,ornone of the above.Asindicatedabove,stolenbasesare positive events,butonlyif theyare successful atleast90% of the time.Aftersome experimentationwith clustering,Ifoundthatno groupevercomesclose tothat kindof successrate on a large scale. Therefore,forthe purposesof optimizingruns,Iwill assume thatmytheoretical teamwillnotattempt to steal. Havingdeterminedthe observationsandthe variables,the nextstepinclusteringisto determine the numberof clusters.Since there isnoone waytodetermine this,Iwentthroughanumber of differentmethods andcomparedtheirresults.
  • 9. 9 Table 2: Clustergoodnessof fitmetrics. Top Left: Within Sumof Squares.Top Right:SilhouetteIndex BottomLeft: Dunn Index.BottomRight:Dendogramusing Ward’sMethod. The firstgraph is a withingroupsumof squaresplot.The ideato lookforan “elbow”,where adding extragroupsdoesnot significantlyreduce the WSS.Thisplothasnoobviouselbow,butseemsto taperingoff between5and 10. The secondgraph,calledthe SilhouetteIndex,measureslikenessof an objectto itsclustervsotherclusters.Here, spikesare desirable,andthusthismetricsuggestseither2or 5 clusters.The thirdgraph,calledthe DunnIndex,measurescompactnessof clustersanddistance to otherclusters.Again,spikesinthe graphare desirable,thusthismetricsuggests6,8, or 10 clusters. Finally,the fourthgraphtakesa differentapproachtocluster,usingahierarchical clusteringwithWard’s Method.In thiscases,we are lookingtobalance asmall heightwithasmall numberof clusters.We can see that at a heightof 200 there are sevenclusters,while ataheightof 250 there are onlyfive.Putting all of thisanalysistogether,Idecidedthatfive wasanappropriate numberof clusterstoconsider. Iused a k-meansalgorithmwith5centersto create the final clustering. Here isasummaryof the cluster statistics,alongwithaverage salary. Table 3: Playerclustering final results.Salary and mean hitting statistics.
  • 10. 10 Havingdone this, I lookedforpatternsinthistable tosee if I could determine aconvenient label for each group.To beginwith,playertype 3has the bestbattingaverage andhome run rate,alongwiththe bestdouble andwalkrate.Clearly,group3 has the bestplayersoverall,alongwiththe steepestprice tag. Thus,I labeledthisgroup“Premium”players.Group5has low battingaverage,highhome runrate, and the worststrike outrate. Therefore,Ilabeledthisgroup“Power”.Group4 has the highesttriple rate and the smallestdoubleplayrate.Thus,Ilabeledthisgroup“Speed”.Group2has the secondbesthit rate,the beststrike outrate,and a small home runrate. Hence,Ilabeledthisgroup“Contact”.Group1 issomewhatinthe middle oneverystatistic,soIlabeledthisgroup“Average”. Donotbe confusedwith the word average,aswe are not talkingaboutitinthe sense of battingaverage,butratherin the sense of meanor middle class. Optimization Nowthat we have both a model forhow eventsimpactscoringanda clusterof playertypes alongwithcost, we can combine these twointoanoptimizationmodel.Todothis,firstwe assigneach playertype a “value”,whichissimplythe productof theirabilitytocause anevent,timesthe linear coefficientwe determineinthe linearregressionportion. Forexample,here ishow the value of a Premiumplayeriscalculated. Here is a summaryof playertype,alongwithvalue andcost. Table 4: Playersummary. Nowthat we have valuesassignedtoeachplayertype,there isone more questionwe mustaddress before beginningthe optimization. Namely:“How manyhittersdoI need?”.The answertothisquestion dependsonwhichleague youare in.Inthe AmericanLeague,pitchersdonotbat,and thuswe need9 hitters.However,inthe NationalLeague,the pitcherdoesbat,andthuswe reallyonlyhave control over 8 of the hitters. Fortunately,the pitcherisgenerallyaverypoorhitter,soincludinghiminthe lineup shouldonlyhave the effectof reducingthe numberof hittersbyone.Otherthanthat,the optimization processisthe same. Iusedsolvertooptimize alineupconsistingof 9hitters,subjecttothe constraintof variousbudgets.Here isa table summarizingmyfindings,withthe total budgetpresentedinmillions.
  • 11. 11 Table 5: Optimization results using Solver.Columnsindicatethe optimalnumberof hitters to use. First,note that once youhave maxedout onpremiumhitters,there isnopointingoinganyfurther,asit cannot getany better.Onthe otherend,youcannot go any cheaperthanall speedhitters,soif you cannot affordthem,youhave noteam.Second,note that powerandcontact hittersare not chosenfor any budget.Contacthittersare tooexpensive,whilepowerhittersare noteffectiveenough. Finally,in bothleaguesthe general strategyistohire as manypremiumhittersaspossible,thenfill outthe remainderwithaverage,andfinallyspeedhitters. Thisconcludesthe regressionportionof myanalysis.Althoughthisgivesabroadpicture of effectiveness,the maincritique tothisapproach,isthatit assumesthe value of eacheventis independentof whenitoccurs.Thisisof course nottrue,as a double witharunneron2nd isgoingto be more effectivethana double withnoone onbase.A natural wayto continue wouldbe toruna simulation,usingthe 5playertypesas inputs. Simulation In orderto performa simulation,Icreateda baseball modelinArena.Itworksbycreatinga predeterminednumberof playersof eachplayertype,usingtriangle distributionsforeachstatistic,and thenusesa randomnumbergeneratortodetermine outcomesof eachplate appearance. A simple flow chart thenmodelsthe progressof base runners,runsscored,andouts recorded.Increatingthe model, I made the followingassumptions: 1. Each hit advancesall base runnersthe same numberof positionsasthe batter.Sofor instance,a single alwaysmovedeverybaserunnerexactlyone base. 2. Strikeoutsanddouble playsneveradvance base runners.
  • 12. 12 3. All otheroutsadvance base runnersexactlyone position. I usedOptQuesttodetermine the optimal numberof eachclassof hitters,using 300 replicationsper event.Here are the final resultsforbothleagues. Table 6: Optimization resultsusing OptQuestin Arena. To get the NL result,Iaddedina pitcher,usingaverage pitcherstatisticsforthe last5 years. The two models,solverandoptquest, agree onafew pointsbutdisagree onothers.First,neithermodel selected a Powerhitterat anybudget.Second,bothmodelsagree youshouldgetasmanyPremiumhittersas youcan aford,andfill outthe rest of the lineupasbestas you can.However,the twomethodsclearly disagree onthe value of SpeedhittersvsAverage hitters. Inthe regressionapproach,Average hittersare slightlymore valuable overall,butduringsimulation,Speedhittersperformbetter.Here isa more detailedcomparisonbetweenthe twoclusters. Table 7: Detailed comparison of Speed vsAverage
  • 13. 13 I believethe difference betweenthe twomaylie inthe distributionIchoose touse inthe simulation approach.In mostsimulations,the triangle distributionistypicallyusedwhenyouhave aminimum,a maximum,anda “typical”value.However,inthiscase,itresultedinthe Speedhittershavingalarge On Base Percentage plusSluggingAverage (OPS) thanthe Average hitters.Thislikelycausedthemtoscore more runs insimulation.One waytoaddressthismightbe totry differentdistributions andsee if they give differentresults. Conclusion In conclusion,Ihave usedmanydifferentanalytical toolstohelpstudyhow toproduce an efficientbaseball lineup.Iusedregressiontodetermine how tobestvalue eachbaseball event, clusteringtodifferentiate betweenplayerskillsets,optimizationtocreate efficientlineups,and simulationtosee howthese lineupsfairinamodel.While the twomainmethodshave asmall amount of disagreement,the maintake awayisthathavinga few superstarplayersonyoursquadis general more efficientthanhavingaroster full of average players. Appendix A. List of baseball abbreviationsused. H = Hits.The numberof hitsrecordedbythe offense. D = Doubles.The numberof doublesrecordedbythe offense. T = Triples.The numberof triplesrecordedbythe offense. HR = Home Runs.The numberof home runsrecordedby the offense. SH = Sacrifice Hits.The numberof sacrifice hitsperformedbythe offense. SF = Sacrifice Flies.The numberof sacrifice fliesperformedbythe offense. SB = Stole Bases.The numberof successful attemptsatstealingabase. CS = CaughtStealing.The numberof unsuccessful attemptsatstealingabase. GIDP = GroundingintoaDouble Play.The numberof double playsrecordedbythe defense. SO = Strike Out.The numberof strike outsrecordedbythe opposingpitchers. BB = Base on Balls[AlsoknownasWalks].The numberof walksissuedbythe opposingpitchers. HBP = Hitby Pitch.The numberof battershit bythe opposingpitchers. Errors. The numberof errorscommittedbythe opposing defense. Wild= WildPitchers.The numberof wildpitchesthrownbythe opposingpitchers. Passed= PassedBalls.The numberof passedballsallowedbythe opposingcatcher. Balks.The numberof balkscommittedbythe opposingpitchers. CatchInt= Catcher’sInterference.The numberof timesabatterwasawardedfirstbase due to interference. HomeTeamV =A dummyvariable indicatingwhetherateamisvisitingorhome.0 = Home,1 = Visiting. AVG= BattingAverage.Hits/At Bats OBP = On Base Percentage. Total timesonbase /Total numberof plate appearances. SLG = SluggingAverage [More commonlycalledSluggingPercentage].A weightedaverageof basesper at bat. OPS= On base PlusSlugging.A general termusedtoconveyoverall hittingability.
  • 14. 14 B. List of statistical abbreviationsused. MSE = Mean SquaredError MAE = Mean Absolute Error C. Regressioncoefficientsforlinearmodel,alongwithp-values.