Capstone Project - Nicholas Imholte - Final Draft

1
University of Cincinnati Capstone Project Summer 2016
Optimizing a baseball lineup:
Getting the most bang for your buck
Nicholas Imholte
First Reader: Dr. Michael Magazine
Second Reader: Dr. Yichen Qin

2
Table of Contents
1. Abstract Page 2
2. Introduction Page 3
3. Visualizations and Regressions Pages 3-7
4. Clustering Pages 8-10
5. Optimization Pages 10-11
6. Simulation Pages 11-13
7. Conclusion Page 13
8. Appendix Pages 13-14
Abstract:
Given a fixed payroll, and focusing purely on the offensive side of the ball, how should a
baseball team assign its funds to give itself the highest average number of runs possible? In this
essay, I will attempt to answer this question using regression, clustering, optimization, and
simulation. First, I will use regression to model baseball scores, with the goal being to
determine how each event in a baseball game impacts how many runs a team scores. Second, I
will use clustering to determine what kinds of hitters there are, and how much each type of
hitter costs. Third, I will use optimization to determine the optimal arrangement of hitter
clusters for a variety of payrolls. Finally, I will complement this analysis with a simulation, and
see how the results from the two approaches compare.

3
Introduction
Everybusinessinthe worldhasto at some pointaskthe question“Whatis the bestway for me
to deploymyresources?”Sportingcompetitionsare nodifferent,andthoughteamshave alarge
amountof moneytospend,the playerstheyneedtopurchase are alsoexpensive.Therefore,anatural
questiontoaskis “WhichplayersshouldIhire?”Inthisessay,I will be restrictingmyattentionto
baseball andoffensivelineups.Mygoal will be to determinewhatkindsof playersare mostcost
efficient,andexactlyhowyoushoulddesignyourline uptomaximizeyourrunpotential asthisisthe
ultimate measure of teamoffense.Althoughpitchinganddefenseare of course importanttopics,they
will notbe consideredinthisanalysis.
To answerthisquestion, Iwill employafive stepprocedure.First, Iwill lookatthirty-fiveyears
of regularseasonbaseball datafrom Retrosheet. I will create variousmodelsusinglinear and
multinomialregression,regressionandclassificationtrees,and Poisson regression.Thiswillallow me to
determine the intrinsicvalue of eacheventinabaseball game. Inorderto determinethe bestmodel, I
will use MeanSquare Error (MSE) on a testingdataset.Next, Iwill lookatten yearsof playerdata from
Baseballreference todetermineplayertypes. Iwill use variousclusteringmethodsandmetricsinorder
to groupplayersintoa fewcategories accordingtotheirhittingabilities.Iwill thenuse salary datafrom
USAToday in orderto determine howmuchplayersfromeachclustercostonaverage.Next, Iwill use
solvertooptimize alineup basedon budgetandplayerconstraints.Finally, Iwill use asimulationin
Arenaand compare these results.
Visualizations and Regressions
To begin, Istartedwithregularseasongame data fromthe past 35 years. The response variable
isruns scored,and the predictorvariablesare splitintothree categories:Offensive,Defensive,and
Location.On the offensive sideIincludedHits,Doubles,Triples,Home Runs,Sacrifice FliesandHits,
Walksand Hit By Pitches,Strike Outs,StolenBasesandCaughtStealing,andfinallyGroundinginto
Double Plays.Onthe defensiveside Iincludednumberof pitchersfaced,Balks,PassedBallsandWild
Pitches,CatchersInterference,andErrors.Lastly,I includedwhetherateamwashome or visiting.
Althoughanoffense cannoteffectdefensiveorlocationstatistics,includingthemallowedme toaccount
for noise factorsandthus recorda more precise measurementof offensive impact. Beforebuildingthe
regressionmodels,Icreated visualizationsinTableau,demonstratingthe nature of the Runsscored
variable andhowitrelatesto some of the significantpredictors. First,here isahistogramof runs.

4
Visualization 1: Histogramof RunsScored.
The most commonscore is 3 runs,and around90% of scoresare withinthe 0 to 8 range. Next,here isa
table showingthe 6-numbersummaryof runsscored.
Visualization 2: 6-numbersummary of Runs.
The mean isslightlylargerthanthe median,indicatingarightskeweddataset,andwhile there are afew
outliers,the Interquartile range isarelativelycompact2to 6. Finally,here isachart indicatingthe
relationshipbetweenhitsandhomeruns,andtheircombinedeffectonruns.

5
Visualization 3: Hits and HomerunsvsRuns.
The two axesare hitsand home runs,and the dots are coloredby average numberof runsscoredin
such games.Redisbelowaverage,greyisaverage,andblackisabove average. Now thatI have a feel for
the data, I can move on to buildingregressionmodels.Tobeginwith,I splitthe datainto90% training
and 10% testing,and builtfive prescriptivemodels:Linear,Multinomial,RegressionTree,Classification
Tree,and Poisson.Here isatable showingthe resultof these models. Icomparedthe 5 modelsusing
meansquaredandabsolute value error,onbothtrainingandtesting.
Training Testing
Model MSE MAE MSE MAE
Linear 2.25 1.17 2.26 1.17
Multinomial 2.32 1.12 2.32 1.12
Tree 3.06 1.35 3.1 1.36
ClassificationTree 29.96 4.51 30.19 4.5
Poisson 17.2 3.25 17.35 3.24
Table 1: Comparison of models
Clearlythe bestperformingmodelsare the LinearandMultinomial regressions.Whatisinterestingis
that the Linearmodel hasslightlybetterMSEbutslightlyworse MAE.Thismeansthat multinomial has a
fewbetterpredictionsoverall,butafewthatare off by largeramounts.However,the difference
betweenthe twoisminimal enoughtonotbe of consequence.Further,since the Linearmodel issimpler
and more interpretable,Idecidedtomove forwardwiththe LinearRegressionmodel.
The linearmodel wascreatedusinganautomaticselection,stepwise procedure,withAICasthe
definingmetric. See Appendix Cforthe regressioncoefficientsandAppendix A forabbreviation
definitions. We cangleanquite a fewpiecesof informationfromthistable.Firstof all,almostall of the
variablesare significant.The onlyvariablenotselectedtobe inthe model ispitchersfaced.AtfirstIwas

6
surprisedbythisresult,astypicallyif ateamis forcedtouse a lotof pitchers,itmeansthe starterwas
pulledearly,usuallybecause he allowedalotof runs.But on the otherhand,a lotof pitcherscanbe
meanthat the game wentdeepintoextrainnings,usuallybecauseof lackof offense. Second,note that
we can use the coefficientstodetermine atwhatrate a team mustbe successfullyatstolenbasesfor
themto be worthwhile.Supposeateamis successful atstealing pof the time.Thenfora steal attempt
to be worthwhile,pmustsatisfy:
p*(.0331) + (1-p)*(-.2964) > 0.
Solvingforp,we get p > .8995. Thus,a team mustbe successful atleast90% of the time for an attempt
to be worthwhile onaverage.Aswe shall latersee,noplayergroupmanagesthislevelof success.
Finally,note thatall of the coefficients have the signyouwouldexpectthemtohave.All of the events
typicallyseenas“positive”have positivecoefficients,while all of the “negative”eventshave negative
coefficients.Note thatsacrifice hits hasanegative coefficient.Thissuggests thatsacrificingoutsfor
basestendstoresultina lossof runs.
Havingestablishedoutlinearmodels,we canmove todiagnosticstodetermineif the
assumptionsof linearregressionhold,anddetermine the goodnessof fit.We startwitha plot of
residuals vsfittedvaluestoassesslinearityandhomoscedasticity.Ideallywe wouldlike the graphtobe
random,withno obviouspatterns,andresidualsspreadevenlybetweenpositiveandnegative.
Graph 1: Residualplot of Linear Model
Everythingseemsto checkoutwiththisgraph.Althoughthere are some large residuals,thisistobe
expectedwithapopulationof nearly150,000. Next,we move tothe Q-Qplot to assessthe normalityof
the residuals.Ideallywe wouldliketosee all the residualsfall ona 45 degree line.

7
Graph 2: Q-Qplot of Linear Model.
Most of the residualslie onthe 45 degree line, soeventhough thereare afew deviationsatthe end,we
can assume the residualsare normallydistributed. Finally,we canlookata plotof Cook’sDistance.
These valuesmeasure influence of observationsonthe model.
Graph 3: Cook’sDistance
Althoughobservation1489 doesstandout on thisplot,itstill hasan extremelysmall Cook’sDistance,so
there isno reasonto be concerned,especiallyconsideringthe sample size of close to150,000. At this
pointwe conclude thatour model sufficientlyexplainsthe variationinthe data,and move on to player
clustering.

8
Clustering
Nowthat we have a model forhow each eventimpactsrunscoring,our nextgoal isto attempt
to clusterhittersintoafewdistinctclusters. Todothis,I firstgathered10 years of hittingdatafrom
Baseballreference.com.Irestrictedmyanalysistorate statistics – For instance,battingaverage instead
of hits– to account for the disparityinplate appearancesbetweenplayers.Ifurtherrestrictedmydata
by onlyconsideringplayerswithatleast100 plate appearancesina givenyear.Further,Ineededtobe
sure that I had salarydata from USAToday fromeach observation.Takingall of thisintoconsideration,I
endedupwith3,626 observations,whichismore thanenoughtoperformclusteringanalysis.Next,Ihad
to determine whichvariablestoconsiderforthe analysis.Ichoose onlyattributesthatwere both
significant,accordingtomylinearmodel,andwere definitivelyunderthe control of anindividual player.
I thuschoose the followingstatisticsintheircorrespondingrate form:hits,doubles,triples,home runs,
strike outs,groundingintodouble plays,walksandhit bypitches.Ididnotconsiderdefensivestatistics,
as an offensiveplayerhasnocontrol oversay, whenanopposingpitcherthrowsawildpitch.I alsodid
not include eitherkindof sacrifice,asa playerhasno control overwhenhe will come tobat in a
situation whentheycanoccur.The final questioniswhetherornotto considerstolenbases,caught
stealing,stolenbase percentage,ornone of the above.Asindicatedabove,stolenbasesare positive
events,butonlyif theyare successful atleast90% of the time.Aftersome experimentationwith
clustering,Ifoundthatno groupevercomesclose tothat kindof successrate on a large scale.
Therefore,forthe purposesof optimizingruns,Iwill assume thatmytheoretical teamwillnotattempt
to steal. Havingdeterminedthe observationsandthe variables,the nextstepinclusteringisto
determine the numberof clusters.Since there isnoone waytodetermine this,Iwentthroughanumber
of differentmethods andcomparedtheirresults.

9
Table 2: Clustergoodnessof fitmetrics. Top Left: Within Sumof Squares.Top Right:SilhouetteIndex
BottomLeft: Dunn Index.BottomRight:Dendogramusing Ward’sMethod.
The firstgraph is a withingroupsumof squaresplot.The ideato lookforan “elbow”,where adding
extragroupsdoesnot significantlyreduce the WSS.Thisplothasnoobviouselbow,butseemsto
taperingoff between5and 10. The secondgraph,calledthe SilhouetteIndex,measureslikenessof an
objectto itsclustervsotherclusters.Here, spikesare desirable,andthusthismetricsuggestseither2or
5 clusters.The thirdgraph,calledthe DunnIndex,measurescompactnessof clustersanddistance to
otherclusters.Again,spikesinthe graphare desirable,thusthismetricsuggests6,8, or 10 clusters.
Finally,the fourthgraphtakesa differentapproachtocluster,usingahierarchical clusteringwithWard’s
Method.In thiscases,we are lookingtobalance asmall heightwithasmall numberof clusters.We can
see that at a heightof 200 there are sevenclusters,while ataheightof 250 there are onlyfive.Putting
all of thisanalysistogether,Idecidedthatfive wasanappropriate numberof clusterstoconsider. Iused
a k-meansalgorithmwith5centersto create the final clustering. Here isasummaryof the cluster
statistics,alongwithaverage salary.
Table 3: Playerclustering final results.Salary and mean hitting statistics.

10
Havingdone this, I lookedforpatternsinthistable tosee if I could determine aconvenient label for
each group.To beginwith,playertype 3has the bestbattingaverage andhome run rate,alongwiththe
bestdouble andwalkrate.Clearly,group3 has the bestplayersoverall,alongwiththe steepestprice
tag. Thus,I labeledthisgroup“Premium”players.Group5has low battingaverage,highhome runrate,
and the worststrike outrate. Therefore,Ilabeledthisgroup“Power”.Group4 has the highesttriple rate
and the smallestdoubleplayrate.Thus,Ilabeledthisgroup“Speed”.Group2has the secondbesthit
rate,the beststrike outrate,and a small home runrate. Hence,Ilabeledthisgroup“Contact”.Group1
issomewhatinthe middle oneverystatistic,soIlabeledthisgroup“Average”. Donotbe confusedwith
the word average,aswe are not talkingaboutitinthe sense of battingaverage,butratherin the sense
of meanor middle class.
Optimization
Nowthat we have both a model forhow eventsimpactscoringanda clusterof playertypes
alongwithcost, we can combine these twointoanoptimizationmodel.Todothis,firstwe assigneach
playertype a “value”,whichissimplythe productof theirabilitytocause anevent,timesthe linear
coefficientwe determineinthe linearregressionportion. Forexample,here ishow the value of a
Premiumplayeriscalculated.
Here is a summaryof playertype,alongwithvalue andcost.
Table 4: Playersummary.
Nowthat we have valuesassignedtoeachplayertype,there isone more questionwe mustaddress
before beginningthe optimization. Namely:“How manyhittersdoI need?”.The answertothisquestion
dependsonwhichleague youare in.Inthe AmericanLeague,pitchersdonotbat,and thuswe need9
hitters.However,inthe NationalLeague,the pitcherdoesbat,andthuswe reallyonlyhave control over
8 of the hitters. Fortunately,the pitcherisgenerallyaverypoorhitter,soincludinghiminthe lineup
shouldonlyhave the effectof reducingthe numberof hittersbyone.Otherthanthat,the optimization
processisthe same. Iusedsolvertooptimize alineupconsistingof 9hitters,subjecttothe constraintof
variousbudgets.Here isa table summarizingmyfindings,withthe total budgetpresentedinmillions.

11
Table 5: Optimization results using Solver.Columnsindicatethe optimalnumberof hitters to use.
First,note that once youhave maxedout onpremiumhitters,there isnopointingoinganyfurther,asit
cannot getany better.Onthe otherend,youcannot go any cheaperthanall speedhitters,soif you
cannot affordthem,youhave noteam.Second,note that powerandcontact hittersare not chosenfor
any budget.Contacthittersare tooexpensive,whilepowerhittersare noteffectiveenough. Finally,in
bothleaguesthe general strategyistohire as manypremiumhittersaspossible,thenfill outthe
remainderwithaverage,andfinallyspeedhitters.
Thisconcludesthe regressionportionof myanalysis.Althoughthisgivesabroadpicture of
effectiveness,the maincritique tothisapproach,isthatit assumesthe value of eacheventis
independentof whenitoccurs.Thisisof course nottrue,as a double witharunneron2nd
isgoingto be
more effectivethana double withnoone onbase.A natural wayto continue wouldbe toruna
simulation,usingthe 5playertypesas inputs.
Simulation
In orderto performa simulation,Icreateda baseball modelinArena.Itworksbycreatinga
predeterminednumberof playersof eachplayertype,usingtriangle distributionsforeachstatistic,and
thenusesa randomnumbergeneratortodetermine outcomesof eachplate appearance. A simple flow
chart thenmodelsthe progressof base runners,runsscored,andouts recorded.Increatingthe model, I
made the followingassumptions:
1. Each hit advancesall base runnersthe same numberof positionsasthe batter.Sofor instance,a
single alwaysmovedeverybaserunnerexactlyone base.
2. Strikeoutsanddouble playsneveradvance base runners.

12
3. All otheroutsadvance base runnersexactlyone position.
I usedOptQuesttodetermine the optimal numberof eachclassof hitters,using 300 replicationsper
event.Here are the final resultsforbothleagues.
Table 6: Optimization resultsusing OptQuestin Arena.
To get the NL result,Iaddedina pitcher,usingaverage pitcherstatisticsforthe last5 years. The two
models,solverandoptquest, agree onafew pointsbutdisagree onothers.First,neithermodel selected
a Powerhitterat anybudget.Second,bothmodelsagree youshouldgetasmanyPremiumhittersas
youcan aford,andfill outthe rest of the lineupasbestas you can.However,the twomethodsclearly
disagree onthe value of SpeedhittersvsAverage hitters. Inthe regressionapproach,Average hittersare
slightlymore valuable overall,butduringsimulation,Speedhittersperformbetter.Here isa more
detailedcomparisonbetweenthe twoclusters.
Table 7: Detailed comparison of Speed vsAverage

13
I believethe difference betweenthe twomaylie inthe distributionIchoose touse inthe simulation
approach.In mostsimulations,the triangle distributionistypicallyusedwhenyouhave aminimum,a
maximum,anda “typical”value.However,inthiscase,itresultedinthe Speedhittershavingalarge On
Base Percentage plusSluggingAverage (OPS) thanthe Average hitters.Thislikelycausedthemtoscore
more runs insimulation.One waytoaddressthismightbe totry differentdistributions andsee if they
give differentresults.
Conclusion
In conclusion,Ihave usedmanydifferentanalytical toolstohelpstudyhow toproduce an
efficientbaseball lineup.Iusedregressiontodetermine how tobestvalue eachbaseball event,
clusteringtodifferentiate betweenplayerskillsets,optimizationtocreate efficientlineups,and
simulationtosee howthese lineupsfairinamodel.While the twomainmethodshave asmall amount
of disagreement,the maintake awayisthathavinga few superstarplayersonyoursquadis general
more efficientthanhavingaroster full of average players.
Appendix
A. List of baseball abbreviationsused.
H = Hits.The numberof hitsrecordedbythe offense.
D = Doubles.The numberof doublesrecordedbythe offense.
T = Triples.The numberof triplesrecordedbythe offense.
HR = Home Runs.The numberof home runsrecordedby the offense.
SH = Sacrifice Hits.The numberof sacrifice hitsperformedbythe offense.
SF = Sacrifice Flies.The numberof sacrifice fliesperformedbythe offense.
SB = Stole Bases.The numberof successful attemptsatstealingabase.
CS = CaughtStealing.The numberof unsuccessful attemptsatstealingabase.
GIDP = GroundingintoaDouble Play.The numberof double playsrecordedbythe defense.
SO = Strike Out.The numberof strike outsrecordedbythe opposingpitchers.
BB = Base on Balls[AlsoknownasWalks].The numberof walksissuedbythe opposingpitchers.
HBP = Hitby Pitch.The numberof battershit bythe opposingpitchers.
Errors. The numberof errorscommittedbythe opposing defense.
Wild= WildPitchers.The numberof wildpitchesthrownbythe opposingpitchers.
Passed= PassedBalls.The numberof passedballsallowedbythe opposingcatcher.
Balks.The numberof balkscommittedbythe opposingpitchers.
CatchInt= Catcher’sInterference.The numberof timesabatterwasawardedfirstbase due to
interference.
HomeTeamV =A dummyvariable indicatingwhetherateamisvisitingorhome.0 = Home,1 = Visiting.
AVG= BattingAverage.Hits/At Bats
OBP = On Base Percentage. Total timesonbase /Total numberof plate appearances.
SLG = SluggingAverage [More commonlycalledSluggingPercentage].A weightedaverageof basesper
at bat.
OPS= On base PlusSlugging.A general termusedtoconveyoverall hittingability.

14
B. List of statistical abbreviationsused.
MSE = Mean SquaredError
MAE = Mean Absolute Error
C. Regressioncoefficientsforlinearmodel,alongwithp-values.

Capstone Project - Nicholas Imholte - Final Draft

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Capstone Project - Nicholas Imholte - Final Draft (20)

Capstone Project - Nicholas Imholte - Final Draft