SlideShare a Scribd company logo
Predicting the Oscars with data science
March 2017
Data Science Process
• Frame the question.
• Collect the raw data.
• Process the data.
• Explore the data.
• Communicate results.
Frame the question
• Who will win the Oscar for Best Picture?
Collect the Data
• What kind of data do we need?
• Financial data (Budget, box office…)
• Reviews, ratings and scores.
• Awards and nominations.
Process the data
• How’s the data “dirty” and how can we fix it?
• User input, redundancies, missing data…
• Formatting: adapt the data to meet certain
specifications.
• Cleaning: detecting and correcting
corrupt or inaccurate records.
Explore the data
• What are the meaningful patterns in the
data?
• How meaningful is each data point for our
predictions?
Communicate results
• Tell story at the right technical level for each
audience
• Make sure to focus on Whats In It For You
(WIIFY!)
• Be objective, don’t lie with statistics
• Be visual! Show, don’t just tell
Goals
• Introduction to a data scientist's tools and
methods:
• Jupyter notebooks, numpy, pandas,
sklearn…
• Overview of basic machine learning
concepts:
• Data formatting and cleaning, Decision
trees, Overfitting, Random Forests…
Jupyter Notebooks
• One of data scientist’s everyday tools.
• Find the link in our classroom tool:
• (bit.ly/atl-oscars)
• Contains cells with code. They have already
been executed for you.
NumPy
• The fundamental package for scientific
computing with Python.
• Provides powerful multi-dimensional array
objects.
• Many methods for fast operations on arrays.
Pandas
• Fundamental high-level building block for
doing practical, real world data analysis in
Python.
• Built on top of NumPy.
• Offers data structures and operations for
manipulating numerical tables and time
series.
Scikit-learn
• Python module for machine learning.
• Provides a large menu of libraries for
scientific computation, such as integration,
interpolation, signal processing, linear
algebra, statistics, etc.
Initial imports and loading data with Pandas
Understanding your data
• .head(n) method: Returns first n rows.
• .value_counts() method: Returns the counts
of unique values in the DataFrame.
Formatting your Data
Formatting your Data
• Rate values in a non-numeric format. Thus,
we will need to assign each rate a unique
integer so that Python can handle the
information.
• With the .ix method you create a subset of
rows and assign a value to a certain variable
of that subset of observations.
Cleaning your Data
Decision Trees
• It breaks down a dataset into smaller and
smaller subsets.
• The final result is a model with a tree
structure that has:
• Decision nodes: ask a question and have
two or more branches.
• Leaf nodes: represent a classification or
decision.
Predict the Oscars with Data Science
Classification vs Regression
• Classification — Predict categories.
• Identifying group membership.
• Regression — Predict values.
• Involves estimating or predicting a
response.
Classification
Classification
?
Creating your first Decision Tree
You will use the scikit-learn and numpy
libraries to build your first decision tree. We
will need the following to build a decision tree
• target: A one-dimensional numpy array
containing the target from the train data.
• features: A multidimensional numpy array
containing the features/predictors from the
train data.
Creating your first Decision Tree
Importances and Score
• .feature_importances_ attribute: tells us
how important the features are for the final
result.
• .score() method: returns the mean accuracy
of our fitting.
Importances and Score
Predicting
Overfitting
• Resulting model too tied to the training set.
• It doesn’t generalize to new data, which is
the point of prediction.
Random Forest Classifier
• Random Forest Classifiers use many
Decision Trees to build a classifier.
• We introduce a bit of randomness.
• Each Tree can give a different answer (a
vote). The final classification is the most
common amongst the Trees.
Random Forest Classifier
Importances and Score
Predicting with Random Forest Classifiers
Results
1976
Rocky
1984
Amadeus
1996
The English Patient
2009
The Hurt Locker
And the Oscar goes to…
La La Land!!
Predict the Oscars with Data Science
Predict the Oscars with Data Science
The End
Nothing happened after that.
Right?? RIGHT??
We can predict the Oscars
Except for 2017 ¯_( )_/¯
Predict the Oscars with Data Science
What is Thinkful?
Online skills bootcamp with 1-on-1 mentorship —
learn anytime & anywhere & get a job, guaranteed.
Anyone who’s committed can learn to code.
Our Philosophy
• 1-on-1 mentorship is the best way to learn
• Flexibility matters — learn anywhere, anytime
• We only make money when you get a job…
Our Results — Job Guarantee
Bhaumik Liz
Data Science Bootcamp
Syllabus: Python Toolkit, Statistics & Probability,
Experimentation, Machine Learning,
Communicating Data, Algorithms and Big Data
Web Development Bootcamp
Syllabus: Beginner and Intermediate Frontend
Development, Backend Development, CS
Fundamentals, Product Engineering
Special Prep Course Offer
• Three-week program, includes six mentor sessions
• Web: HTML/CSS, Javascript, jQuery, Responsive Design
• Data: Basic Python & Stats, Data Science Toolkit, Project
• Option to continue into web development bootcamp
• Prep courses cost $500 (can apply to cost of full bootcamp)
• Talk to us about special offers for both programs

More Related Content

PDF
Predict oscars (4:17)
PDF
Predict the Oscars with Data Science
PPT
Data Handling With Ict For Bb
PDF
Learning from data
PDF
Quantitative Analyst
PDF
The Wild West of Data Wrangling
PPTX
How to think like a data scientist sandeep
PPT
Dropbox Tutorial
Predict oscars (4:17)
Predict the Oscars with Data Science
Data Handling With Ict For Bb
Learning from data
Quantitative Analyst
The Wild West of Data Wrangling
How to think like a data scientist sandeep
Dropbox Tutorial

Similar to Predict the Oscars with Data Science (20)

PDF
Predict oscars (5:11)
PPTX
Predicting the NBA MVP
PPTX
Ml - A shallow dive
PDF
Machine Learning Foundations for Professional Managers
PPTX
Data Science Training in Chandigarh h
PDF
Machinr Learning and artificial_Lect1.pdf
PDF
Choosing a Machine Learning technique to solve your need
PPTX
Data Science Introduction to Data Science
PDF
Barga Data Science lecture 2
PPTX
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
PDF
Starting a career in data science
PPTX
Creativity and Curiosity - The Trial and Error of Data Science
PDF
Barga Data Science lecture 4
PPTX
Building Data Scientists
PPSX
Data Refinement: The missing link between data collection and decisions
PPTX
JamieStainer ATA SCIEnCE path finder.pptx
PPTX
AI AND DATA SCIENCE generative data scinece.pptx
PPTX
Data Mining - The Big Picture!
PDF
Data Con LA 2022 - Real world consumer segmentation
PDF
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Predict oscars (5:11)
Predicting the NBA MVP
Ml - A shallow dive
Machine Learning Foundations for Professional Managers
Data Science Training in Chandigarh h
Machinr Learning and artificial_Lect1.pdf
Choosing a Machine Learning technique to solve your need
Data Science Introduction to Data Science
Barga Data Science lecture 2
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Starting a career in data science
Creativity and Curiosity - The Trial and Error of Data Science
Barga Data Science lecture 4
Building Data Scientists
Data Refinement: The missing link between data collection and decisions
JamieStainer ATA SCIEnCE path finder.pptx
AI AND DATA SCIENCE generative data scinece.pptx
Data Mining - The Big Picture!
Data Con LA 2022 - Real world consumer segmentation
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Ad

Recently uploaded (20)

PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Launch Your Data Science Career in Kochi – 2025
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Foundation of Data Science unit number two notes
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPT
Quality review (1)_presentation of this 21
PDF
Lecture1 pattern recognition............
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Introduction-to-Cloud-ComputingFinal.pptx
Launch Your Data Science Career in Kochi – 2025
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
IB Computer Science - Internal Assessment.pptx
Foundation of Data Science unit number two notes
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
IBA_Chapter_11_Slides_Final_Accessible.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Quality review (1)_presentation of this 21
Lecture1 pattern recognition............
Supervised vs unsupervised machine learning algorithms
STUDY DESIGN details- Lt Col Maksud (21).pptx
Ad

Predict the Oscars with Data Science

  • 1. Predicting the Oscars with data science March 2017
  • 2. Data Science Process • Frame the question. • Collect the raw data. • Process the data. • Explore the data. • Communicate results.
  • 3. Frame the question • Who will win the Oscar for Best Picture?
  • 4. Collect the Data • What kind of data do we need? • Financial data (Budget, box office…) • Reviews, ratings and scores. • Awards and nominations.
  • 5. Process the data • How’s the data “dirty” and how can we fix it? • User input, redundancies, missing data… • Formatting: adapt the data to meet certain specifications. • Cleaning: detecting and correcting corrupt or inaccurate records.
  • 6. Explore the data • What are the meaningful patterns in the data? • How meaningful is each data point for our predictions?
  • 7. Communicate results • Tell story at the right technical level for each audience • Make sure to focus on Whats In It For You (WIIFY!) • Be objective, don’t lie with statistics • Be visual! Show, don’t just tell
  • 8. Goals • Introduction to a data scientist's tools and methods: • Jupyter notebooks, numpy, pandas, sklearn… • Overview of basic machine learning concepts: • Data formatting and cleaning, Decision trees, Overfitting, Random Forests…
  • 9. Jupyter Notebooks • One of data scientist’s everyday tools. • Find the link in our classroom tool: • (bit.ly/atl-oscars) • Contains cells with code. They have already been executed for you.
  • 10. NumPy • The fundamental package for scientific computing with Python. • Provides powerful multi-dimensional array objects. • Many methods for fast operations on arrays.
  • 11. Pandas • Fundamental high-level building block for doing practical, real world data analysis in Python. • Built on top of NumPy. • Offers data structures and operations for manipulating numerical tables and time series.
  • 12. Scikit-learn • Python module for machine learning. • Provides a large menu of libraries for scientific computation, such as integration, interpolation, signal processing, linear algebra, statistics, etc.
  • 13. Initial imports and loading data with Pandas
  • 14. Understanding your data • .head(n) method: Returns first n rows. • .value_counts() method: Returns the counts of unique values in the DataFrame.
  • 16. Formatting your Data • Rate values in a non-numeric format. Thus, we will need to assign each rate a unique integer so that Python can handle the information. • With the .ix method you create a subset of rows and assign a value to a certain variable of that subset of observations.
  • 18. Decision Trees • It breaks down a dataset into smaller and smaller subsets. • The final result is a model with a tree structure that has: • Decision nodes: ask a question and have two or more branches. • Leaf nodes: represent a classification or decision.
  • 20. Classification vs Regression • Classification — Predict categories. • Identifying group membership. • Regression — Predict values. • Involves estimating or predicting a response.
  • 23. Creating your first Decision Tree You will use the scikit-learn and numpy libraries to build your first decision tree. We will need the following to build a decision tree • target: A one-dimensional numpy array containing the target from the train data. • features: A multidimensional numpy array containing the features/predictors from the train data.
  • 24. Creating your first Decision Tree
  • 25. Importances and Score • .feature_importances_ attribute: tells us how important the features are for the final result. • .score() method: returns the mean accuracy of our fitting.
  • 28. Overfitting • Resulting model too tied to the training set. • It doesn’t generalize to new data, which is the point of prediction.
  • 29. Random Forest Classifier • Random Forest Classifiers use many Decision Trees to build a classifier. • We introduce a bit of randomness. • Each Tree can give a different answer (a vote). The final classification is the most common amongst the Trees.
  • 32. Predicting with Random Forest Classifiers
  • 38. And the Oscar goes to…
  • 42. The End Nothing happened after that. Right?? RIGHT??
  • 43. We can predict the Oscars Except for 2017 ¯_( )_/¯
  • 45. What is Thinkful? Online skills bootcamp with 1-on-1 mentorship — learn anytime & anywhere & get a job, guaranteed. Anyone who’s committed can learn to code.
  • 46. Our Philosophy • 1-on-1 mentorship is the best way to learn • Flexibility matters — learn anywhere, anytime • We only make money when you get a job…
  • 47. Our Results — Job Guarantee Bhaumik Liz
  • 48. Data Science Bootcamp Syllabus: Python Toolkit, Statistics & Probability, Experimentation, Machine Learning, Communicating Data, Algorithms and Big Data
  • 49. Web Development Bootcamp Syllabus: Beginner and Intermediate Frontend Development, Backend Development, CS Fundamentals, Product Engineering
  • 50. Special Prep Course Offer • Three-week program, includes six mentor sessions • Web: HTML/CSS, Javascript, jQuery, Responsive Design • Data: Basic Python & Stats, Data Science Toolkit, Project • Option to continue into web development bootcamp • Prep courses cost $500 (can apply to cost of full bootcamp) • Talk to us about special offers for both programs