Predict the Oscars with Data Science

Predicting the Oscars with data science
March 2017

Data Science Process
• Frame the question.
• Collect the raw data.
• Process the data.
• Explore the data.
• Communicate results.

Frame the question
• Who will win the Oscar for Best Picture?

Collect the Data
• What kind of data do we need?
• Financial data (Budget, box ofﬁce…)
• Reviews, ratings and scores.
• Awards and nominations.

Process the data
• How’s the data “dirty” and how can we ﬁx it?
• User input, redundancies, missing data…
• Formatting: adapt the data to meet certain
speciﬁcations.
• Cleaning: detecting and correcting
corrupt or inaccurate records.

Explore the data
• What are the meaningful patterns in the
data?
• How meaningful is each data point for our
predictions?

Communicate results
• Tell story at the right technical level for each
audience
• Make sure to focus on Whats In It For You
(WIIFY!)
• Be objective, don’t lie with statistics
• Be visual! Show, don’t just tell

Goals
• Introduction to a data scientist's tools and
methods:
• Jupyter notebooks, numpy, pandas,
sklearn…
• Overview of basic machine learning
concepts:
• Data formatting and cleaning, Decision
trees, Overﬁtting, Random Forests…

Jupyter Notebooks
• One of data scientist’s everyday tools.
• Find the link in our classroom tool:
• (bit.ly/atl-oscars)
• Contains cells with code. They have already
been executed for you.

NumPy
• The fundamental package for scientiﬁc
computing with Python.
• Provides powerful multi-dimensional array
objects.
• Many methods for fast operations on arrays.

Pandas
• Fundamental high-level building block for
doing practical, real world data analysis in
Python.
• Built on top of NumPy.
• Offers data structures and operations for
manipulating numerical tables and time
series.

Scikit-learn
• Python module for machine learning.
• Provides a large menu of libraries for
scientiﬁc computation, such as integration,
interpolation, signal processing, linear
algebra, statistics, etc.

Initial imports and loading data with Pandas

Understanding your data
• .head(n) method: Returns ﬁrst n rows.
• .value_counts() method: Returns the counts
of unique values in the DataFrame.

Formatting your Data
• Rate values in a non-numeric format. Thus,
we will need to assign each rate a unique
integer so that Python can handle the
information.
• With the .ix method you create a subset of
rows and assign a value to a certain variable
of that subset of observations.

Decision Trees
• It breaks down a dataset into smaller and
smaller subsets.
• The ﬁnal result is a model with a tree
structure that has:
• Decision nodes: ask a question and have
two or more branches.
• Leaf nodes: represent a classiﬁcation or
decision.

Predict the Oscars with Data Science

Classiﬁcation vs Regression
• Classiﬁcation — Predict categories.
• Identifying group membership.
• Regression — Predict values.
• Involves estimating or predicting a
response.

Creating your ﬁrst Decision Tree
You will use the scikit-learn and numpy
libraries to build your ﬁrst decision tree. We
will need the following to build a decision tree
• target: A one-dimensional numpy array
containing the target from the train data.
• features: A multidimensional numpy array
containing the features/predictors from the
train data.

Creating your ﬁrst Decision Tree

Importances and Score
• .feature_importances_ attribute: tells us
how important the features are for the ﬁnal
result.
• .score() method: returns the mean accuracy
of our ﬁtting.

Overﬁtting
• Resulting model too tied to the training set.
• It doesn’t generalize to new data, which is
the point of prediction.

Random Forest Classifier
• Random Forest Classifiers use many
Decision Trees to build a classifier.
• We introduce a bit of randomness.
• Each Tree can give a different answer (a
vote). The final classification is the most
common amongst the Trees.

Predicting with Random Forest Classiﬁers

The End
Nothing happened after that.
Right?? RIGHT??

We can predict the Oscars
Except for 2017 ¯_( )_/¯

What is Thinkful?
Online skills bootcamp with 1-on-1 mentorship —
learn anytime & anywhere & get a job, guaranteed.
Anyone who’s committed can learn to code.

Our Philosophy
• 1-on-1 mentorship is the best way to learn
• Flexibility matters — learn anywhere, anytime
• We only make money when you get a job…

Our Results — Job Guarantee
Bhaumik Liz

Data Science Bootcamp
Syllabus: Python Toolkit, Statistics & Probability,
Experimentation, Machine Learning,
Communicating Data, Algorithms and Big Data

Web Development Bootcamp
Syllabus: Beginner and Intermediate Frontend
Development, Backend Development, CS
Fundamentals, Product Engineering

Special Prep Course Offer
• Three-week program, includes six mentor sessions
• Web: HTML/CSS, Javascript, jQuery, Responsive Design
• Data: Basic Python & Stats, Data Science Toolkit, Project
• Option to continue into web development bootcamp
• Prep courses cost $500 (can apply to cost of full bootcamp)
• Talk to us about special offers for both programs

Predict the Oscars with Data Science

More Related Content

Similar to Predict the Oscars with Data Science (20)

Recently uploaded (20)

Predict the Oscars with Data Science