Predicting Box Office Hits: Data-Driven Insights into Movie Success

CONFIDENTIAL: The information in this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this
material is prohibited and subject to legal action under breach of IP and confidentiality clauses.
Predicting Movie Success

Index
1. Introduction
1. Problem Statement
2. Project Benefits
2. Data Exploration
1. Importing Dataset and Libraries
2. Categorizing the Target Variables
3. Handling Missing Values
4. Label Encoding
5. Correlation
3. Classification Model Building
1. Train Test Split
2. Scaling
3. Feature Selection using RFECV
4. Random Forest
5. Confusion Matrix
6. Classification Report

Methodology:
 Load and explore the dataset using pandas,
matplotlib, and seaborn.
 Preprocess the data, including handling missing
values, label encoding, and addressing
multicollinearity.
 Implement feature selection.
 Split the data into training and testing sets, and
apply feature scaling.
 Train and evaluate a Random Forest classifier
for predicting movie success categories.
 Generate and interpret performance metrics and
visualizations.

Predicting Movie Success
• To develop a comprehensive data analysis pipeline and a robust machine learning model to accurately predict
movie success categories (Hit, Average, Flop) based on various movie attributes. By utilizing this model, the
studio aims to improve movie production decisions, marketing strategies, and overall film industry insights.

Project Benefits
 Production Optimization: The model will help identify factors influencing movie success, allowing for more
informed decisions in movie production.
 Marketing Strategy: Accurate prediction of movie success can assist in tailoring marketing efforts and budget
allocation.
 Industry Insights: Understanding success patterns can guide future trends and innovations in filmmaking.

Data Collection
Data Exploration
Categorizing the target variables
Handling the missing values
Series of Steps
2. Data Exploration (Exploratory Data Analysis)

1. Importing the Dataset and Libraries:
Importing all the required libraries for preprocessing i.e. pandas, numpy, seaborn and matplotlib. Importing the
dataset given by the client to the notebook using ‘pd.read_csv’ function from pandas library.
After importing the dataset we check the shape and the description of the dataset.
The original dataset has 5043 rows and 28 columns.
Check the data type of each column.
dtypes: float64(13), int64(3), object(12)
We check if there are any null cells in the dataset.

2. Categorizing the Target Variables:
Creating a new column Classify to categorize movies
into "Hit", "Average", or "Flop" based on the IMDB
score ranges(|1-3 | -Flop Movie,|3-6 |- Average Movie,|
6-10 |- Hit Movie)
As seen in the graph there are more number of hit
movies.
3. Handling Missing Values.
Dropping the samples which have missing values.
After dropping all the samples which have missing
values we are left with a clean data which has 3755
rows and 29 columns.
No column has been dropped.
We save the clean data as a separate csv file for making
a dashboard.

4.Label encoding
All the categorical columns are label encoded in this step.
5.Correlation
We have to find out if there is any relation between the columns. Multicollinearity
cause errors to the prediction. Hence, we remove any multicollinearity. We also
remove the column ‘imdb_score’ since we already have a column ‘classify’.

3. Classification Model Building
• Classification is a supervised machine learning method where the model
tries to predict the correct label of a given input data. In classification,
the model is fully trained using the training data, and then it is evaluated
on test data before being used to perform prediction on new unseen data.
• Splitting the data into X and y where X contains Indepentent variables
and y contain Target/Dependent variable.

1. Train Test Split
We need data not only to train our model but also to test our model. So splitting the
dataset into 70:30 (Train:Test) ratio. We have a predefined a function in Sklearn library
called train_test_split, we use that.
2. Scaling
Few variables will be in the range of Millions and some in Tens, lets bring all of them
into same scale

3. Feature Selection using RFECV

4. Random Forest
Random Forest is a classifier that contains a number of decision trees on
various subsets of the given dataset and takes the average to improve the
predictive accuracy of that dataset.
5. Confusion Matrix
A confusion matrix is a matrix that summarizes the performance of a
machine learning model on a set of test data. It is a means of displaying
the number of accurate and inaccurate instances based on the model’s
predictions. It is often used to measure the performance of classification
models, which aim to predict a categorical label for each input instance.

6. Classification Report
As seen in the classification report. We have an accuracy of 80% for this
model.

Questions ?

Thank You!

Predicting Box Office Hits: Data-Driven Insights into Movie Success

More Related Content

Similar to Predicting Box Office Hits: Data-Driven Insights into Movie Success (20)

More from Boston Institute of Analytics (20)

Recently uploaded (20)

Predicting Box Office Hits: Data-Driven Insights into Movie Success