SlideShare a Scribd company logo
ITERATING OVER STATISTICAL MODELS
NCAA TOURNAMENT EDITION
Daniel Lee
WHY ARE YOU LISTENING TO ME?
▸ Stan developer

http://mc-stan.org

▸ Researcher at Columbia

▸ Co-founder of Stan Group

training / statistical support /
consulting






▸ email: bearlee@alum.mit.edu

twitter: @djsyclik / @mcmc_stan

web: http://syclik.com
MARCH MADNESS
WHAT IS MARCH MADNESS?
WHAT IS MARCH MADNESS?
CHAMPIONSHIP GAME: 

VILLANOVA VS NORTH CAROLINA
P(VILLANOVA WIN)?
CHAMPIONSHIP GAME: 

VILLANOVA VS NORTH CAROLINA
P(VILLANOVA WIN)?
VILLANOVA WILDCATS VS. NORTH CAROLINA TAR HEELS
▸ 1:00. 70 - 69
▸ 0:35. 72 - 69
▸ 0:23. 72 - 71
P(VILLANOVA WIN)?
VILLANOVA WILDCATS VS. NORTH CAROLINA TAR HEELS
▸ 1:00. 70 - 69
▸ 0:35. 72 - 69
▸ 0:23. 72 - 71
▸ 0:13. 74 - 71
P(VILLANOVA WIN)?
VILLANOVA WILDCATS VS. NORTH CAROLINA TAR HEELS
▸ 1:00. 70 - 69
▸ 0:35. 72 - 69
▸ 0:23. 72 - 71
▸ 0:13. 74 - 71
▸ 0:06. 74 - 74
P(VILLANOVA WIN)?
VILLANOVA WILDCATS VS. NORTH CAROLINA TAR HEELS
▸ 1:00. 70 - 69
▸ 0:35. 72 - 69
▸ 0:23. 72 - 71
▸ 0:13. 74 - 71
▸ 0:06. 74 - 74
▸ 0:00. 77 - 74. Villanova wins.
Iterating over statistical models: NCAA tournament edition
STATISTICAL MODELS
TREAT
AS CODE
WRITING CODE IS A DISCIPLINE
WRITING CODE IS A DISCIPLINE
▸ design patterns
▸ testing
▸ code review
▸ maintenance
▸ modularity
▸ collaboration
CAN BE
IS STATISTICAL MODELING A DISCIPLINE?
▸ Art or science?

▸ Models have names

▸ Statistical model vs implementation

▸ Collaboration on the statistical model?
TREAT STATISTICAL MODELS AS CODE
WHAT DO WE NEED TO DO
▸ Elevate statistical models: first class entity

▸ Modularize

▸ Language

▸ Discuss subtle details

▸ Collaborate
TREAT STATISTICAL MODELS AS CODE
STAN GETS US CLOSE
▸ statistical modeling language
▸ domain-specific language

has its own grammar; not R or BUGS!
▸ rstan

shinystan, RStudio integration, rstanarm
▸ open-source

core libraries are new BSD
▸ Stan program
▸ plain text
▸ plays nicely with source
repositories
▸ imperative language
Note: Stan isn’t the only thing you can do this with
MODELING
BASKETBALL
MARCH MADNESS
BASKETBALL HISTORY
▸ 1891
▸ Dr. James Naismith. Springfield, MA
▸ Non-contact conditioning
▸ 13 simple rules
▸ Peach basket





MARCH MADNESS
BASKETBALL HISTORY
▸ 1891
▸ Dr. James Naismith. Springfield, MA
▸ Non-contact conditioning
▸ 13 simple rules
▸ Peach basket

10. The umpire shall be judge of the men and shall note the fouls and notify the
referee when three consecutive fouls have been made. He shall have power to
disqualify men according to Rule 5.
MARCH MADNESS
BASKETBALL NOW
▸ 2 x 20 min half
▸ Increasing score
▸ Points increment by 2, 3, and 1
▸ 5 players, unlimited substitutions
▸ player DQ: 5th foul
▸ Bonus: 7 team fouls

Double bonus: 10 team fouls
MARCH MADNESS
DATA
▸ 351 NCAA Division 1 Men’s basketball teams
▸ 33 conferences
▸ 5421 games
▸ 24 - 35 games per team
▸ Max 3 observations
Iterating over statistical models: NCAA tournament edition
IS THIS BIG DATA?
TALL DATA VS WIDE DATA
▸ Tall data

lots of replications



▸ Wide data

lots of fields



day, home, score, ot, fgm, fga, 3pm, 3pa, 3a, fta, ftm, or, dr, ast, to, stl, blk, pf
THREE STEPS OF
BAYESIAN DATA ANALYSIS
ANDREW GELMAN IN BDA
THE THREE STEPS OF BAYESIAN DATA ANALYSIS
1. Set up full probability model



2. Condition on observed data



3. Evaluate the fit of the model

ANDREW GELMAN IN BDA
THE THREE STEPS OF BAYESIAN DATA ANALYSIS
1. Set up full probability model

Write a Stan program

2. Condition on observed data



3. Evaluate the fit of the model

ANDREW GELMAN IN BDA
THE THREE STEPS OF BAYESIAN DATA ANALYSIS
1. Set up full probability model

Write a Stan program

2. Condition on observed data

Run RStan

3. Evaluate the fit of the model

ANDREW GELMAN IN BDA
THE THREE STEPS OF BAYESIAN DATA ANALYSIS
1. Set up full probability model

Write a Stan program

2. Condition on observed data

Run RStan

3. Evaluate the fit of the model

R, ShinyStan, posterior predictive checks

ITERATING OVER
MODELS
ITERATING OVER STATISTICAL MODELS
STATISTICAL MODEL #1
▸ Only 2015-2016 matters
▸ Teams have a latent ability
▸ “logistic regression”
▸ “Bradley-Terry model”
y ⇠ bernoulli(logit 1
(✓1 ✓2))
Iterating over statistical models: NCAA tournament edition
P(VILLANOVA > UNC | DATA) = 0.73
Iterating over statistical models: NCAA tournament edition
Iterating over statistical models: NCAA tournament edition
ITERATING OVER STATISTICAL MODELS
TREAT STATISTICAL MODELS AS CODE
▸ Statistical model in a separate file
▸ Git
▸ Testing
▸ Inspection of fit
▸ Backtest on historical data
▸ Priors
▸ “Model #1”
IN WRITING, YOU MUST KILL ALL YOUR
DARLINGS.
Willliam Faulkner
ITERATING OVER STATISTICAL MODELS
STATISTICAL MODEL #2
▸ Home court advantage!

▸ Teams have a latent ability
▸ “logistic regression”
▸ “Bradley-Terry model”
y ⇠ bernoulli(logit 1
(↵ + ✓1 ✓2))
Iterating over statistical models: NCAA tournament edition
P(VILLANOVA > UNC | DATA) = 0.52
Iterating over statistical models: NCAA tournament edition
ITERATING OVER STATISTICAL MODELS
STATISTICAL MODEL #3
▸ Assumptions:
▸ Only 2015-2016 matters
▸ Teams have a latent ability
▸ Model points
▸ Add a home court advantage
Iterating over statistical models: NCAA tournament edition
Iterating over statistical models: NCAA tournament edition
Iterating over statistical models: NCAA tournament edition
ITERATING OVER STATISTICAL MODELS
HOW DID WE DO?
▸ Kaggle: 87 / 608
▸ What went wrong?
▸ What’s next?
STATISTICAL MODELS
TREAT
AS CODE
THANKS
▸ Collaborative effort
▸ NCAA modeling:

Rob Trangucci
▸ Stan team

Andrew Gelman, Bob Carpenter, Matt
Hoffman, Ben Goodrich, Michael Betancourt,
Marcus Brubaker, Jiqiang Guo, Peter Li, Allen
Riddell, Marco Ignacio, Jeff Arnold, Mitzi
Morris, Rob Goedman, Brian Lau, Jonah
Gabry, Alp Kucukelbir, Robert Grant, Dustin
Tran, Krzysztof Sakrejda, Aki Vehtari, Rayleigh
Lei, Sebastian Weber
HELP
▸ http://mc-stan.org
▸ stan-users mailing list

▸ Stan Group Inc.

http://stan.fit

training / statistical support / consulting



▸ bearlee@alum.mit.edu / @djsyclik / @mcmc_stan

More Related Content

PDF
NC State Athletics - Cat-Quick Barber, NC State Host Bucknell Saturday At Noon
PDF
Scaling Data Science at Airbnb
PDF
One Algorithm to Rule Them All: How to Automate Statistical Computation
PDF
Data Science Challenges in Personal Program Analysis
PDF
I Don't Want to Be a Dummy! Encoding Predictors for Trees
PDF
R Packages for Time-Varying Networks and Extremal Dependence
PDF
Broom: Converting Statistical Models to Tidy Data Frames
PDF
The Feels
NC State Athletics - Cat-Quick Barber, NC State Host Bucknell Saturday At Noon
Scaling Data Science at Airbnb
One Algorithm to Rule Them All: How to Automate Statistical Computation
Data Science Challenges in Personal Program Analysis
I Don't Want to Be a Dummy! Encoding Predictors for Trees
R Packages for Time-Varying Networks and Extremal Dependence
Broom: Converting Statistical Models to Tidy Data Frames
The Feels

Viewers also liked (15)

PDF
Analyzing NYC Transit Data
PDF
The Political Impact of Social Penumbras
PDF
Reflection on the Data Science Profession in NYC
PDF
A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...
PDF
Dr. Datascience or: How I Learned to Stop Munging and Love Tests
PDF
Julia + R for Data Science
PDF
R for Everything
PDF
Using R at NYT Graphics
PDF
Thinking Small About Big Data
PDF
Improving Data Interoperability for Python and R
PDF
Building Scalable Prediction Services in R
PDF
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
PDF
High-Performance Python
PPTX
Inside the R Consortium
PDF
Scaling Analysis Responsibly
Analyzing NYC Transit Data
The Political Impact of Social Penumbras
Reflection on the Data Science Profession in NYC
A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...
Dr. Datascience or: How I Learned to Stop Munging and Love Tests
Julia + R for Data Science
R for Everything
Using R at NYT Graphics
Thinking Small About Big Data
Improving Data Interoperability for Python and R
Building Scalable Prediction Services in R
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
High-Performance Python
Inside the R Consortium
Scaling Analysis Responsibly
Ad

Similar to Iterating over statistical models: NCAA tournament edition (19)

PDF
Workshop on Bayesian Workflows with CmdStanPy by Mitzi Morris
PPTX
Amy Langville, Professor of Mathematics, The College of Charleston in South C...
PPT
March Madness WebQuest
PDF
CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...
PDF
bayes_machine_learning_book for data scientist
PDF
Introduction to Model-Based Machine Learning
PDF
QMC Community Software
PDF
Preliminary Modeling Report
PPTX
2015 Sport Analysis for March Madness
PDF
Train, explain, acclaim. Build a good model in three steps
PPTX
An Introduction to Simulation in the Social Sciences
PDF
Mathematical modeling
PPTX
Msis5633 cfb presentation
PPTX
IM Final.pptx
PPTX
NBA playoff prediction Model.pptx
PDF
Striving to Demystify Bayesian Computational Modelling
PPTX
March madness sports analysis
PPTX
CRA-IM-Group4.pptx
PDF
Noam Finkelstein - The Importance of Modeling Data Collection
Workshop on Bayesian Workflows with CmdStanPy by Mitzi Morris
Amy Langville, Professor of Mathematics, The College of Charleston in South C...
March Madness WebQuest
CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...
bayes_machine_learning_book for data scientist
Introduction to Model-Based Machine Learning
QMC Community Software
Preliminary Modeling Report
2015 Sport Analysis for March Madness
Train, explain, acclaim. Build a good model in three steps
An Introduction to Simulation in the Social Sciences
Mathematical modeling
Msis5633 cfb presentation
IM Final.pptx
NBA playoff prediction Model.pptx
Striving to Demystify Bayesian Computational Modelling
March madness sports analysis
CRA-IM-Group4.pptx
Noam Finkelstein - The Importance of Modeling Data Collection
Ad

More from Work-Bench (8)

PDF
2017 Enterprise Almanac
PDF
AI to Enable Next Generation of People Managers
PDF
Startup Recruiting Workbook: Sourcing and Interview Process
PDF
Cloud Native Infrastructure Management Solutions Compared
PPTX
Building a Demand Generation Machine at MongoDB
PPTX
How to Market Your Startup to the Enterprise
PDF
Marketing & Design for the Enterprise
PDF
Playing the Marketing Long Game
2017 Enterprise Almanac
AI to Enable Next Generation of People Managers
Startup Recruiting Workbook: Sourcing and Interview Process
Cloud Native Infrastructure Management Solutions Compared
Building a Demand Generation Machine at MongoDB
How to Market Your Startup to the Enterprise
Marketing & Design for the Enterprise
Playing the Marketing Long Game

Recently uploaded (20)

PPTX
IMPACT OF LANDSLIDE.....................
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Introduction to Inferential Statistics.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
How to run a consulting project- client discovery
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
CYBER SECURITY the Next Warefare Tactics
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Managing Community Partner Relationships
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Business Analytics and business intelligence.pdf
PPTX
New ISO 27001_2022 standard and the changes
IMPACT OF LANDSLIDE.....................
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
Qualitative Qantitative and Mixed Methods.pptx
Introduction to Inferential Statistics.pptx
ISS -ESG Data flows What is ESG and HowHow
How to run a consulting project- client discovery
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
SAP 2 completion done . PRESENTATION.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
CYBER SECURITY the Next Warefare Tactics
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Managing Community Partner Relationships
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Business Analytics and business intelligence.pdf
New ISO 27001_2022 standard and the changes

Iterating over statistical models: NCAA tournament edition

  • 1. ITERATING OVER STATISTICAL MODELS NCAA TOURNAMENT EDITION
  • 2. Daniel Lee WHY ARE YOU LISTENING TO ME? ▸ Stan developer
 http://mc-stan.org
 ▸ Researcher at Columbia
 ▸ Co-founder of Stan Group
 training / statistical support / consulting 
 
 
 ▸ email: bearlee@alum.mit.edu
 twitter: @djsyclik / @mcmc_stan
 web: http://syclik.com
  • 4. WHAT IS MARCH MADNESS?
  • 5. WHAT IS MARCH MADNESS?
  • 7. P(VILLANOVA WIN)? CHAMPIONSHIP GAME: 
 VILLANOVA VS NORTH CAROLINA
  • 8. P(VILLANOVA WIN)? VILLANOVA WILDCATS VS. NORTH CAROLINA TAR HEELS ▸ 1:00. 70 - 69 ▸ 0:35. 72 - 69 ▸ 0:23. 72 - 71
  • 9. P(VILLANOVA WIN)? VILLANOVA WILDCATS VS. NORTH CAROLINA TAR HEELS ▸ 1:00. 70 - 69 ▸ 0:35. 72 - 69 ▸ 0:23. 72 - 71 ▸ 0:13. 74 - 71
  • 10. P(VILLANOVA WIN)? VILLANOVA WILDCATS VS. NORTH CAROLINA TAR HEELS ▸ 1:00. 70 - 69 ▸ 0:35. 72 - 69 ▸ 0:23. 72 - 71 ▸ 0:13. 74 - 71 ▸ 0:06. 74 - 74
  • 11. P(VILLANOVA WIN)? VILLANOVA WILDCATS VS. NORTH CAROLINA TAR HEELS ▸ 1:00. 70 - 69 ▸ 0:35. 72 - 69 ▸ 0:23. 72 - 71 ▸ 0:13. 74 - 71 ▸ 0:06. 74 - 74 ▸ 0:00. 77 - 74. Villanova wins.
  • 14. WRITING CODE IS A DISCIPLINE
  • 15. WRITING CODE IS A DISCIPLINE ▸ design patterns ▸ testing ▸ code review ▸ maintenance ▸ modularity ▸ collaboration CAN BE
  • 16. IS STATISTICAL MODELING A DISCIPLINE? ▸ Art or science?
 ▸ Models have names
 ▸ Statistical model vs implementation
 ▸ Collaboration on the statistical model?
  • 17. TREAT STATISTICAL MODELS AS CODE WHAT DO WE NEED TO DO ▸ Elevate statistical models: first class entity
 ▸ Modularize
 ▸ Language
 ▸ Discuss subtle details
 ▸ Collaborate
  • 18. TREAT STATISTICAL MODELS AS CODE STAN GETS US CLOSE ▸ statistical modeling language ▸ domain-specific language
 has its own grammar; not R or BUGS! ▸ rstan
 shinystan, RStudio integration, rstanarm ▸ open-source
 core libraries are new BSD ▸ Stan program ▸ plain text ▸ plays nicely with source repositories ▸ imperative language Note: Stan isn’t the only thing you can do this with
  • 20. MARCH MADNESS BASKETBALL HISTORY ▸ 1891 ▸ Dr. James Naismith. Springfield, MA ▸ Non-contact conditioning ▸ 13 simple rules ▸ Peach basket
 
 

  • 21. MARCH MADNESS BASKETBALL HISTORY ▸ 1891 ▸ Dr. James Naismith. Springfield, MA ▸ Non-contact conditioning ▸ 13 simple rules ▸ Peach basket
 10. The umpire shall be judge of the men and shall note the fouls and notify the referee when three consecutive fouls have been made. He shall have power to disqualify men according to Rule 5.
  • 22. MARCH MADNESS BASKETBALL NOW ▸ 2 x 20 min half ▸ Increasing score ▸ Points increment by 2, 3, and 1 ▸ 5 players, unlimited substitutions ▸ player DQ: 5th foul ▸ Bonus: 7 team fouls
 Double bonus: 10 team fouls
  • 23. MARCH MADNESS DATA ▸ 351 NCAA Division 1 Men’s basketball teams ▸ 33 conferences ▸ 5421 games ▸ 24 - 35 games per team ▸ Max 3 observations
  • 25. IS THIS BIG DATA?
  • 26. TALL DATA VS WIDE DATA ▸ Tall data
 lots of replications
 
 ▸ Wide data
 lots of fields
 
 day, home, score, ot, fgm, fga, 3pm, 3pa, 3a, fta, ftm, or, dr, ast, to, stl, blk, pf
  • 27. THREE STEPS OF BAYESIAN DATA ANALYSIS
  • 28. ANDREW GELMAN IN BDA THE THREE STEPS OF BAYESIAN DATA ANALYSIS 1. Set up full probability model
 
 2. Condition on observed data
 
 3. Evaluate the fit of the model

  • 29. ANDREW GELMAN IN BDA THE THREE STEPS OF BAYESIAN DATA ANALYSIS 1. Set up full probability model
 Write a Stan program
 2. Condition on observed data
 
 3. Evaluate the fit of the model

  • 30. ANDREW GELMAN IN BDA THE THREE STEPS OF BAYESIAN DATA ANALYSIS 1. Set up full probability model
 Write a Stan program
 2. Condition on observed data
 Run RStan
 3. Evaluate the fit of the model

  • 31. ANDREW GELMAN IN BDA THE THREE STEPS OF BAYESIAN DATA ANALYSIS 1. Set up full probability model
 Write a Stan program
 2. Condition on observed data
 Run RStan
 3. Evaluate the fit of the model
 R, ShinyStan, posterior predictive checks

  • 33. ITERATING OVER STATISTICAL MODELS STATISTICAL MODEL #1 ▸ Only 2015-2016 matters ▸ Teams have a latent ability ▸ “logistic regression” ▸ “Bradley-Terry model” y ⇠ bernoulli(logit 1 (✓1 ✓2))
  • 35. P(VILLANOVA > UNC | DATA) = 0.73
  • 38. ITERATING OVER STATISTICAL MODELS TREAT STATISTICAL MODELS AS CODE ▸ Statistical model in a separate file ▸ Git ▸ Testing ▸ Inspection of fit ▸ Backtest on historical data ▸ Priors ▸ “Model #1”
  • 39. IN WRITING, YOU MUST KILL ALL YOUR DARLINGS. Willliam Faulkner
  • 40. ITERATING OVER STATISTICAL MODELS STATISTICAL MODEL #2 ▸ Home court advantage!
 ▸ Teams have a latent ability ▸ “logistic regression” ▸ “Bradley-Terry model” y ⇠ bernoulli(logit 1 (↵ + ✓1 ✓2))
  • 42. P(VILLANOVA > UNC | DATA) = 0.52
  • 44. ITERATING OVER STATISTICAL MODELS STATISTICAL MODEL #3 ▸ Assumptions: ▸ Only 2015-2016 matters ▸ Teams have a latent ability ▸ Model points ▸ Add a home court advantage
  • 48. ITERATING OVER STATISTICAL MODELS HOW DID WE DO? ▸ Kaggle: 87 / 608 ▸ What went wrong? ▸ What’s next?
  • 50. THANKS ▸ Collaborative effort ▸ NCAA modeling:
 Rob Trangucci ▸ Stan team
 Andrew Gelman, Bob Carpenter, Matt Hoffman, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, Allen Riddell, Marco Ignacio, Jeff Arnold, Mitzi Morris, Rob Goedman, Brian Lau, Jonah Gabry, Alp Kucukelbir, Robert Grant, Dustin Tran, Krzysztof Sakrejda, Aki Vehtari, Rayleigh Lei, Sebastian Weber HELP ▸ http://mc-stan.org ▸ stan-users mailing list
 ▸ Stan Group Inc.
 http://stan.fit
 training / statistical support / consulting
 
 ▸ bearlee@alum.mit.edu / @djsyclik / @mcmc_stan