SlideShare a Scribd company logo
Practical Machine
Learning in Python
Matt Spitz
       via
@mattspitz
Practical Machine Learning in Python   2




This is the Age of Aquarius Data
• Data is plentiful
 • application logs
 • external APIs
   • Facebook, Twitter

 • public datasets
• Analysis adds value
 • understanding your users
 • dynamic application decisions
• Storage / CPU time is cheap
Practical Machine Learning in Python   3




Machine Learning in Python
• Python is well-suited for data analysis
• Versatile
  • quick and dirty scripts
  • full-featured, realtime applications
• Mature ML packages
  • tons of choices (see: mloss.org)
  • plug-and-play or DIY
Practical Machine Learning in Python   4




Classification Problem: Terminology
• Data points
  • feature set: “interesting” facts about an event/thing
  • label: a description of that event/thing
• Classification
  • training set: a bunch of labeled feature sets
  • given a training set, build a classifier to predict labels for
    unlabeled feature sets
Practical Machine Learning in Python   5




SluggerML
• Two questions
   • What features are strong predictors for home runs and strikeouts?
   • Given a particular situation, with what probability will the batter
     hit a home run or strike out?
• Feature sets represent game state for a plate appearance
   • game: day vs. night, wind direction...
   • at-bat: inning, #strikes, left-right matchup...
   • batter/pitcher: age, weight, fielding position...
• Labels represent outcome
   • HR (home run), K (strikeout), OTHER
• Poor Man’s Sabermetrics
Practical Machine Learning in Python   6




SluggerML: Example
• Training set
   • {game_daynight: day, batter_age: 24, pitcher_weight: 211}
    • label: HR
  • {game_daynight: day, batter_age: 36, pitcher_weight: 242}
     • label: K
  • {game_daynight: night, batter_age: 27, pitcher_weight: 195}
     • label: OTHER
• Classifier predictions
  • {game_daynight: night, batter_age: 36, pitcher_weight: 225}
    • 2.6% HR     15.6% K
  • {game_daynight: day, batter_age: 20, pitcher_weight: 216}
     • 2.2% HR 19.1% K
Practical Machine Learning in Python   7




SluggerML: Gathering Data
• Sources
  • Retrosheet
     • play-by-play logs for every game since 1956
  • Sean Lahman’s Baseball Archive
     • detailed stats about individual players

• Coalescing
  • 1st pass, Lahman: create player database
    • shelve module
  • 2nd pass, Retrosheet: track game state, join on player db
• Scrubbing
  • ensure consistency
Practical Machine Learning in Python   8




SluggerML: Gathering Data
• Training set
  • regular-season games from 1980-2011
  • 5,669,301 plate appearances
     • 135,602 home runs
     • 871,226 strikeouts
Practical Machine Learning in Python   9




Selecting a Toolkit: Tradeoffs
• Speed
  • offline vs. realtime
• Transparency
   • internal visibility
   • customizability
• Support
  • maturity
  • community
Practical Machine Learning in Python   10




Selecting a Toolkit: High-Level Options
• External bindings
  • python interfaces to popular packages
  • Matlab, R, Octave, SHOGUN Toolbox
  • transition legacy workflows
• Python implementations
  • collections of algorithms
  • (mostly) python
  • external subcomponents
• DIY
  • building blocks
Practical Machine Learning in Python   11




Selecting a Toolkit: Python Implementations
• nltk
  • focus on NLP
  • book: Natural Language Processing with Python (O’Reilly ‘09)
• mlpy
  • regression, classification, clustering
• PyML
  • focus on SVM
• PyBrain
  • focus on neural networks
Practical Machine Learning in Python   12




Selecting a Toolkit: Python Implementations
• mdp-toolkit
  • data processing management
  • nodes represent tasks in a data workflow
  • scheduling, parallelization
• scikit-learn
  • supervised, unsupervised, feature selection, visualization
  • heavy development, large team
  • excellent documentation
  • active community
Practical Machine Learning in Python   13




Selecting a Toolkit: Do It Yourself
• Basic building blocks
  • NumPy
  • SciPy
• C/C++ implementations
  • LIBLINEAR
  • LIBSVM
  • OpenCV
  • ...your own?
Practical Machine Learning in Python   14




SluggerML: Two Questions
• What features are strong predictors for home runs
  and strikeouts?
• Given a particular situation, with what probability will
  the batter hit a home run or strike out?
Practical Machine Learning in Python   15




SluggerML: Feature Selection
• Identifies predictive features
  • strongly correlated with labels
  • predictive: max_benchpress
  • not predictive: favorite_cookie
• scikit-learn: chi-square feature selection
• Visualizing significance
  • for each well-supported value, find correlation with HR/K
     • “well-supported”: >= 0.05% of samples with feature=value
     • correlation: ( P(HR | feature=value) / P(HR) ) - 1
Practical Machine Learning in Python   16




      SluggerML: Feature Selection
                                   Batter: Home vs. Visiting
              50.0%


              40.0%


              30.0%


              20.0%


              10.0%
Correlation




               0.0%                                                                              Home Run
                                                                                                 Strikeout
              -10.0%


              -20.0%


              -30.0%


              -40.0%


              -50.0%
                       home team                               visiting team
Practical Machine Learning in Python    17




      SluggerML: Feature Selection
                                         Batter: Fielding Position
              50.0%


              40.0%


              30.0%


              20.0%


              10.0%
Correlation




               0.0%                                                                                      Home Run
                                                                                                         Strikeout
              -10.0%


              -20.0%


              -30.0%


              -40.0%


              -50.0%
                       P   C   1B   2B       3B    SS     LF       CF       RF        DH       PH
Practical Machine Learning in Python      18




      SluggerML: Feature Selection
                                                           Game: Temperature (˚F)
              50.0%


              40.0%


              30.0%


              20.0%


              10.0%
Correlation




               0.0%                                                                                                                    Home Run
                                                                                                                                       Strikeout
              -10.0%


              -20.0%


              -30.0%


              -40.0%


              -50.0%
                       35-39   40-44   45-49   50-54   55-59   60-64   65-69   70-74   75-79   80-84   85-89   90-94   95-99 100-104
Practical Machine Learning in Python     19




      SluggerML: Feature Selection
                                                           Game: Year
              50.0%


              40.0%


              30.0%


              20.0%


              10.0%
Correlation




               0.0%                                                                                                   Home Run
                                                                                                                      Strikeout
              -10.0%


              -20.0%


              -30.0%


              -40.0%


              -50.0%
                       1980-1984   1985-1989   1990-1994    1995-1999   2000-2004     2005-2009      2010-2011
Practical Machine Learning in Python   20




SluggerML: Realtime Classification
• Given features, predict label probabilities
• nltk: NaiveBayesClassifier
• Web frontend
  • gunicorn, nginx
Practical Machine Learning in Python   21




Tips and Tricks
• Persistent classifier internals
   • once trained, save and reuse
   • depends on implementation
    • string representation may exist
    • create your own
• Using generators where possible
  • avoid keeping data in memory
    • single-pass algorithms
    • conversion pass before training
• Multicore text processing
  • scrubbing: low memory footprint
  • multiprocessing module
Practical Machine Learning in Python   22




The Fine Print™
• Plug-and-play is easy!
• Don’t blindly apply ML
  • understand your data
  • understand your algorithms
     • ml-class.org is an excellent resource
Practical Machine Learning in Python   23




Thanks!
github.com/mattspitz/sluggerml
slideshare.net/mattspitz/practical-machine-learning-in-python


@mattspitz

More Related Content

PPTX
Introduction to Machine Learning with Python and scikit-learn
PPTX
A Beginner's Guide to Machine Learning with Scikit-Learn
ODP
Python and Machine Learning
PDF
Scikit-Learn: Machine Learning in Python
PDF
Data Science and Machine Learning Using Python and Scikit-learn
PDF
Introduction to Machine Learning with SciKit-Learn
PPTX
K-means Clustering with Scikit-Learn
PDF
Open Software Platforms for Mobile Digital Broadcasting
Introduction to Machine Learning with Python and scikit-learn
A Beginner's Guide to Machine Learning with Scikit-Learn
Python and Machine Learning
Scikit-Learn: Machine Learning in Python
Data Science and Machine Learning Using Python and Scikit-learn
Introduction to Machine Learning with SciKit-Learn
K-means Clustering with Scikit-Learn
Open Software Platforms for Mobile Digital Broadcasting

Viewers also liked (19)

PDF
Glossary
DOCX
Sample email submission
DOCX
My trans kit checklist gw1 ds1_gw3
PDF
Shrunken Head
PDF
Internationalization in Rails 2.2
PDF
Pycon 2012 What Python can learn from Java
PDF
Putting Out Fires with Content Strategy (InfoDevDC meetup)
ODP
mobile development platforms
KEY
How to make intelligent web apps
PPS
My Valentine Gift - YOU Decide
PDF
Putting Out Fires with Content Strategy (STC Academic SIG)
PDF
2008 Fourth Quarter Real Estate Commentary
PDF
Strategies for Friendly English and Successful Localization (InfoDevWorld 2014)
PDF
The ruby on rails i18n core api-Neeraj Kumar
PDF
Strategies for Friendly English and Successful Localization
KEY
Designing for Multiple Mobile Platforms
PPTX
Stc 2014 unraveling the mysteries of localization kits
DOC
Silmeyiniz
PDF
Linguistic Potluck: Crowdsourcing localization with Rails
Glossary
Sample email submission
My trans kit checklist gw1 ds1_gw3
Shrunken Head
Internationalization in Rails 2.2
Pycon 2012 What Python can learn from Java
Putting Out Fires with Content Strategy (InfoDevDC meetup)
mobile development platforms
How to make intelligent web apps
My Valentine Gift - YOU Decide
Putting Out Fires with Content Strategy (STC Academic SIG)
2008 Fourth Quarter Real Estate Commentary
Strategies for Friendly English and Successful Localization (InfoDevWorld 2014)
The ruby on rails i18n core api-Neeraj Kumar
Strategies for Friendly English and Successful Localization
Designing for Multiple Mobile Platforms
Stc 2014 unraveling the mysteries of localization kits
Silmeyiniz
Linguistic Potluck: Crowdsourcing localization with Rails
Ad

Similar to Practical Machine Learning in Python (18)

DOCX
Capstone Project - Nicholas Imholte - Final Draft
PPTX
Data Visualization and Clustering of Players in Major League Baseball
PPTX
Data mining for baseball new ppt
PDF
Pycon 2012 Scikit-Learn
PPTX
Clustering of Players in Major League Baseball
PDF
Jolt’s Picks - Machine Learning and Major League Baseball Hit Streaks
PPTX
Chapter 5 Introduction to Machine Learning with Scikit-learn.pptx
PPTX
Machine Learning with Python made easy and simple
PPTX
Predicting the NBA MVP
DOCX
Predicting rainfall with data science in python
PDF
Introduction to ML and Decision Tree
PPTX
UNIT_5_Data Wrangling.pptx
PDF
Python Advanced Predictive Analytics Kumar Ashish
PDF
Big Data Baseball with Python - Ichiro Suzuki hacks! #kwsk01
PDF
Impact AI 2020: Portfolio-Scale Data Science at Zynga
PDF
IRJET-V8I11270.pdf
PPTX
Python for Machine Learning_ A Comprehensive Overview.pptx
PDF
Python Machine Learning Cookbook Early Release 1st Ed Chris Albon
Capstone Project - Nicholas Imholte - Final Draft
Data Visualization and Clustering of Players in Major League Baseball
Data mining for baseball new ppt
Pycon 2012 Scikit-Learn
Clustering of Players in Major League Baseball
Jolt’s Picks - Machine Learning and Major League Baseball Hit Streaks
Chapter 5 Introduction to Machine Learning with Scikit-learn.pptx
Machine Learning with Python made easy and simple
Predicting the NBA MVP
Predicting rainfall with data science in python
Introduction to ML and Decision Tree
UNIT_5_Data Wrangling.pptx
Python Advanced Predictive Analytics Kumar Ashish
Big Data Baseball with Python - Ichiro Suzuki hacks! #kwsk01
Impact AI 2020: Portfolio-Scale Data Science at Zynga
IRJET-V8I11270.pdf
Python for Machine Learning_ A Comprehensive Overview.pptx
Python Machine Learning Cookbook Early Release 1st Ed Chris Albon
Ad

Recently uploaded (20)

PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Machine learning based COVID-19 study performance prediction
PDF
KodekX | Application Modernization Development
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
cuic standard and advanced reporting.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
A Presentation on Artificial Intelligence
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
Review of recent advances in non-invasive hemoglobin estimation
Machine learning based COVID-19 study performance prediction
KodekX | Application Modernization Development
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
MYSQL Presentation for SQL database connectivity
Digital-Transformation-Roadmap-for-Companies.pptx
Understanding_Digital_Forensics_Presentation.pptx
cuic standard and advanced reporting.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Big Data Technologies - Introduction.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Mobile App Security Testing_ A Comprehensive Guide.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
A Presentation on Artificial Intelligence
Reach Out and Touch Someone: Haptics and Empathic Computing

Practical Machine Learning in Python

  • 1. Practical Machine Learning in Python Matt Spitz via @mattspitz
  • 2. Practical Machine Learning in Python 2 This is the Age of Aquarius Data • Data is plentiful • application logs • external APIs • Facebook, Twitter • public datasets • Analysis adds value • understanding your users • dynamic application decisions • Storage / CPU time is cheap
  • 3. Practical Machine Learning in Python 3 Machine Learning in Python • Python is well-suited for data analysis • Versatile • quick and dirty scripts • full-featured, realtime applications • Mature ML packages • tons of choices (see: mloss.org) • plug-and-play or DIY
  • 4. Practical Machine Learning in Python 4 Classification Problem: Terminology • Data points • feature set: “interesting” facts about an event/thing • label: a description of that event/thing • Classification • training set: a bunch of labeled feature sets • given a training set, build a classifier to predict labels for unlabeled feature sets
  • 5. Practical Machine Learning in Python 5 SluggerML • Two questions • What features are strong predictors for home runs and strikeouts? • Given a particular situation, with what probability will the batter hit a home run or strike out? • Feature sets represent game state for a plate appearance • game: day vs. night, wind direction... • at-bat: inning, #strikes, left-right matchup... • batter/pitcher: age, weight, fielding position... • Labels represent outcome • HR (home run), K (strikeout), OTHER • Poor Man’s Sabermetrics
  • 6. Practical Machine Learning in Python 6 SluggerML: Example • Training set • {game_daynight: day, batter_age: 24, pitcher_weight: 211} • label: HR • {game_daynight: day, batter_age: 36, pitcher_weight: 242} • label: K • {game_daynight: night, batter_age: 27, pitcher_weight: 195} • label: OTHER • Classifier predictions • {game_daynight: night, batter_age: 36, pitcher_weight: 225} • 2.6% HR 15.6% K • {game_daynight: day, batter_age: 20, pitcher_weight: 216} • 2.2% HR 19.1% K
  • 7. Practical Machine Learning in Python 7 SluggerML: Gathering Data • Sources • Retrosheet • play-by-play logs for every game since 1956 • Sean Lahman’s Baseball Archive • detailed stats about individual players • Coalescing • 1st pass, Lahman: create player database • shelve module • 2nd pass, Retrosheet: track game state, join on player db • Scrubbing • ensure consistency
  • 8. Practical Machine Learning in Python 8 SluggerML: Gathering Data • Training set • regular-season games from 1980-2011 • 5,669,301 plate appearances • 135,602 home runs • 871,226 strikeouts
  • 9. Practical Machine Learning in Python 9 Selecting a Toolkit: Tradeoffs • Speed • offline vs. realtime • Transparency • internal visibility • customizability • Support • maturity • community
  • 10. Practical Machine Learning in Python 10 Selecting a Toolkit: High-Level Options • External bindings • python interfaces to popular packages • Matlab, R, Octave, SHOGUN Toolbox • transition legacy workflows • Python implementations • collections of algorithms • (mostly) python • external subcomponents • DIY • building blocks
  • 11. Practical Machine Learning in Python 11 Selecting a Toolkit: Python Implementations • nltk • focus on NLP • book: Natural Language Processing with Python (O’Reilly ‘09) • mlpy • regression, classification, clustering • PyML • focus on SVM • PyBrain • focus on neural networks
  • 12. Practical Machine Learning in Python 12 Selecting a Toolkit: Python Implementations • mdp-toolkit • data processing management • nodes represent tasks in a data workflow • scheduling, parallelization • scikit-learn • supervised, unsupervised, feature selection, visualization • heavy development, large team • excellent documentation • active community
  • 13. Practical Machine Learning in Python 13 Selecting a Toolkit: Do It Yourself • Basic building blocks • NumPy • SciPy • C/C++ implementations • LIBLINEAR • LIBSVM • OpenCV • ...your own?
  • 14. Practical Machine Learning in Python 14 SluggerML: Two Questions • What features are strong predictors for home runs and strikeouts? • Given a particular situation, with what probability will the batter hit a home run or strike out?
  • 15. Practical Machine Learning in Python 15 SluggerML: Feature Selection • Identifies predictive features • strongly correlated with labels • predictive: max_benchpress • not predictive: favorite_cookie • scikit-learn: chi-square feature selection • Visualizing significance • for each well-supported value, find correlation with HR/K • “well-supported”: >= 0.05% of samples with feature=value • correlation: ( P(HR | feature=value) / P(HR) ) - 1
  • 16. Practical Machine Learning in Python 16 SluggerML: Feature Selection Batter: Home vs. Visiting 50.0% 40.0% 30.0% 20.0% 10.0% Correlation 0.0% Home Run Strikeout -10.0% -20.0% -30.0% -40.0% -50.0% home team visiting team
  • 17. Practical Machine Learning in Python 17 SluggerML: Feature Selection Batter: Fielding Position 50.0% 40.0% 30.0% 20.0% 10.0% Correlation 0.0% Home Run Strikeout -10.0% -20.0% -30.0% -40.0% -50.0% P C 1B 2B 3B SS LF CF RF DH PH
  • 18. Practical Machine Learning in Python 18 SluggerML: Feature Selection Game: Temperature (˚F) 50.0% 40.0% 30.0% 20.0% 10.0% Correlation 0.0% Home Run Strikeout -10.0% -20.0% -30.0% -40.0% -50.0% 35-39 40-44 45-49 50-54 55-59 60-64 65-69 70-74 75-79 80-84 85-89 90-94 95-99 100-104
  • 19. Practical Machine Learning in Python 19 SluggerML: Feature Selection Game: Year 50.0% 40.0% 30.0% 20.0% 10.0% Correlation 0.0% Home Run Strikeout -10.0% -20.0% -30.0% -40.0% -50.0% 1980-1984 1985-1989 1990-1994 1995-1999 2000-2004 2005-2009 2010-2011
  • 20. Practical Machine Learning in Python 20 SluggerML: Realtime Classification • Given features, predict label probabilities • nltk: NaiveBayesClassifier • Web frontend • gunicorn, nginx
  • 21. Practical Machine Learning in Python 21 Tips and Tricks • Persistent classifier internals • once trained, save and reuse • depends on implementation • string representation may exist • create your own • Using generators where possible • avoid keeping data in memory • single-pass algorithms • conversion pass before training • Multicore text processing • scrubbing: low memory footprint • multiprocessing module
  • 22. Practical Machine Learning in Python 22 The Fine Print™ • Plug-and-play is easy! • Don’t blindly apply ML • understand your data • understand your algorithms • ml-class.org is an excellent resource
  • 23. Practical Machine Learning in Python 23 Thanks! github.com/mattspitz/sluggerml slideshare.net/mattspitz/practical-machine-learning-in-python @mattspitz

Editor's Notes

  • #3: Data is everywhere clickstream data users are bad at managing fb permissions; you can get a lot out of the graph APIThere’s value in learning about data - how people use your site- feature or advertisement personalizationOne thing that enables this is that resources are cheap these days
  • #4: Python is a fantastic programming environment for data processing and analyticson one end of the spectrum, quick and dirty scripts... or full-featured applications ready for a deployment at scaleWide variety of toolkits for off-the-shelf analysis or building out your own data processing applications
  • #5: For this talk... discussing one flavor of analytics and machine learning, the classification problemintuition: training set: what you know about the world train a classifier to predict things that you don’t
  • #6: As a concrete example, I started playing around with some baseball stats to illustrate how one might go about building ML applications in pythoneven if you’re not into baseball, you know that the iconic visions of success and failure are the home run and the strikeout in all the movies, hitting a home run is equivalent to getting the girl and striking out is seen as a major setback
  • #8: As with any machine learning problem, you want to get your data into a classifier-consumable format. That is, labeled feature sets. For each play in the game, keep track of the game state and output a labeled feature bundle representing the situation and its outcome: HR, K, (other)
  • #10: speed: offline: deadline ~ hours, daysrealtime: user waiting on the other side (user actions: => milliseconds)transparency:seeing what’s going on with an algorithm in case the docs aren’t clearmodifying or patching an algorithm to meet your needssupport:maturity, active development how strong is the community around the project? are there tutorials available?
  • #11: interface with external packages if you’ve done some analysis already and want to transition to python without throwing away codepython toolkits provide sets of algorithms, mostly python implementationsoften use external packages with C bindings, some even use other toolkitsDIY: use the external packages yourself
  • #12: to give a sampling of what’s available, i chose some toolkits that were last updated within a yearAs a disclaimer... -Not exhaustive, just a sampling -some of these tools I’ve used, some I haven’t! -I’m sure I’ve missed your favorite, and for that I apologizedifferent packages focus on different things, so one isn’t necessarily going to suit all of your needs
  • #13: buzz around scikit-learn last year - checked it out recently and it’s been built out a lot
  • #14: NumPy: fast and efficient arraysSciPy: scientific tools and algorithms built on NumPyCan also use popular C/C++ implementations using python bindingspython is a modular language, so you can always sub out your implementation without disrupting your workflow too muchnow, as an example of applying these toolkits...
  • #15: speed isn’t criticalspeed is critical (imagine that you’re a coach) baseball is slow, but it’s not THAT slow
  • #16: identifies predictive features certain values are strongly correlated with certain labelssklearn- wasn’t clear on the documented usage, looked at the code
  • #21: for a coach
  • #22: don’t we need to train our classifier to run our web application?save them on disk!pickle or pull out a textual representation(another argument for using a package that allows you to do this)why compute things twice?use generatorslots and lots of dataavoid keeping it all in memorysingle pass algorithm (bayes)first-pass conversion to compact data (numpy vectors, not python objects)not always possible, but keep it in mindtake advantage of multiple cores - if your processing step has a minimal memory footprint (just one line at a time), do it on multiple cores - multiple processes on different input files or multiprocessing module is great at this
  • #23: you don't need to know everything about the algorithms you use …but you can't just blindly apply these things and hope that they magically workml-class.org: free class, provides an excellent foundation and starting point for understanding MLin no time, you, too, can be a number muncher
  • #24: source code for SluggerML on github; kind of a mess, and I’m sorry about thatand I’m @mattspitz on the twitters