Udacity Data Analyst Nanodegree
P2: Investigate [TMDb Movie] dataset
Author: Mouhamadou GUEYE
Date: May 26, 2019
Table of contents
Introduction
Data Wrangling
Exploratory Data Analysis
Conclusions
Introduction
In this project we will analyze the dataset associated with the informations about 10000 movies collected from the movie
database TMDb. In particular we'll be interested in finding trends ralating most popular movies by genre, the movie rating and
popularity based on the budget and revenue.
Background:
The [Movie Database TMDB](https://www.themoviedb.org/) is a community built movie and TV database.
Every piece of data has been added by our amazing community dating back to 2008. TMDb's strong
international focus and breadth of data is largely unmatched and something we're incredibly proud of. Put
simply, we live and breathe community and that's precisely what makes us different. ### The TMDb
Advantage: 1. Every year since 2008, the number of contributions to our database has increased. With
over 125,000 developers and companies using our platform, TMDb has become a premiere source for
metadata. 2. Along with extensive metadata for movies, TV shows and people, we also offer one of the
best selections of high resolution posters and fan art. On average, over 1,000 images are added every
single day. 3. We're international. While we officially support 39 languages we also have extensive
regional data. Every single day TMDb is used in over 180 countries. 4. Our community is second to none.
Between our staff and community moderators, we're always here to help. We're passionate about making
sure your experience on TMDb is nothing short of amazing. 5. Trusted platform. Every single day our
service is used by millions of people while we process over 2 billion requests. We've proven for years that
this is a service that can be trusted and relied on. This organization profile is not owned or maintained by
TMDb: datasets hosted under this organization profile use the TMDb API but are not endorsed or
certified by TMDb[1].
Reseach Questions for investigations
1. What is the most popular movies by genre?
2. What is the most popular movies by genre from year to year?
3. Do movies with highest revenue have more popularity?
4. Do movies with highest budget have more popularity?
5. Do movies with highest revenue recieve a better rating?
6. Do movies with highest budget recieve a better rating?
Dataset
This data set contains information about 10,000 movies collected from The Movie Database (TMDb),including user ratings and
revenue.The dataset uses in this project is a cleaned version of the original dataset on Kaggle. where its full description can be
found there.
In [1]: # packages import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline
Data Wrangling
General Properties
In this step we will inspect the dataset, in order to undestand it's properties and structures:
The datatypes of each column
The number samples of the dataset
Number of columns in the dataset
Duplicate rows if any in the dataset
Features with missing values
Number of non-null unique value for features in each dataset
What are those unique values are, count of each
In [2]: movies = pd.read_csv('tmdb-movies.csv')
In [3]: # Printing the five first row of the dataframe
movies.head()
Columns Data Types
In [4]: movies.dtypes
Number of samples/columns
In [5]: # number of rows for the movie dataset
movies.shape[0]
In [6]: # Number of columns for the movie dataset
movies.shape[1]
Duplicates Rows
In [7]: # Duplicate rows in the movies dataset
sum(movies.duplicated())
Deletion of duplicates
In [8]: # Duplicate rows n the credit dataset
movies[movies.duplicated()]
movies.drop_duplicates(inplace=True)
Missing Values
We notice that there missing values in the following columns:
homepage
overview
release_date
tagline
runtime
cast
production_companies
director
genres
etc.
In [9]: # informations about the dataset
movies.info()
In [10]: # Inpecting rows with missing values
movies[movies.isnull().any(axis='columns')].head()
Number of Distinct Observations
In [11]: movies.nunique()
Descriptive Statistics Summary
In [12]: movies.describe()
Data Cleaning
In this step we will clean the dataset by removing columns that are irrelevant for our analysis, convert the release date
columns from a string to a datetime object, fill columns for budget and revenue which contains a huge amount of zero values
by their means, handle the columns with multiple values separated by a pipe (|), by splitting them in differents rows.
In [13]: # movies dataset columns
print(movies.columns)
Droping Extraneous Columns
These columns will dropped since they are not relevant on our data analysis.
In [14]: # columns to drop from the movies dataset, thes columns are irrelevant for our data analysis
columns = ['homepage', 'tagline', 'overview', 'keywords']
In [15]: movies.drop(labels=columns, axis=1, inplace=True)
In [16]: movies.info()
Convert release_date in datetime Object
The release_date in a string format, we will use panda's to_datetime method to convert the column from string to datatime
dtype.
In [17]: movies['release_date'] = pd.to_datetime(movies['release_date'],format='%m/%d/%y')
In [18]: # check the date format after cleaning
movies['release_date'].dtype
Dealing with Multiple Values Columns
In [19]: movies= (movies.drop('genres', axis=1)
.join(movies['genres'].str.split('|', expand=True)
.stack().reset_index(level=1,drop=True)
.rename('genres'))
.loc[:, movies.columns])
movies.head()
In [20]: # splitting into row the production_companies columns
movies= (movies.drop('production_companies', axis=1)
.join(movies['production_companies'].str.split('|', expand=True)
.stack().reset_index(level=1,drop=True)
.rename('production_companies'))
.loc[:, movies.columns])
movies.head()
Fill zero value in the revenue and budget columns
Here we inspect the column revenue, revenue_adj, budget and budget_adj counting the number of rows having 0 values
before filling those values with the mean.
In [21]: # inspecting the movies and budget columns
movies[movies['revenue'] == 0].count()['revenue']
In [22]: # inspecting the movies and budget columns
movies[movies['revenue_adj'] == 0].count()['revenue_adj']
In [23]: # inspecting the movies and budget columns
movies[movies['budget'] == 0].count()['budget']
In [24]: # inspecting the movies and budget columns
movies[movies['budget_adj'] == 0].count()['budget_adj']
In [25]: # fill the columns revenue and budget with their mean value
cols = ['budget', 'budget_adj', 'revenue', 'revenue_adj']
for item in cols:
print(item, movies[item].mean())
movies[item] = movies[item].replace({0:movies[item].mean()})
In [26]: # Check Whether the colums have been successfully filled
movies[movies['revenue'].notnull()].count()
In [27]: # should return False
(movies['revenue'] == 0).all()
In [28]: # should return False
(movies['revenue_adj'] == 0).all()
In [29]: # should return False
(movies['budget_adj'] == 0).all()
In [30]: # should return False
(movies['budget_adj'] == 0).all()
Check Number of samples/columns
In [31]: movies.shape
Visual Trends
In [32]: movies.hist(figsize=(15,10));
Exploratory data analysis
1. Which genres is more popular ?
In [33]: # unique genres movies existing in the dataframe
genres = movies['genres'].unique()
print(genres)
In [34]: # grouping movies by genre projecting on the popularity column and calculation of the mean
movies_by_genres = movies.groupby('genres')['popularity'].mean()
# plottting the bar chart of movies by genre
movies_by_genres.plot(kind='bar', alpha=.7, figsize=(15,6))
plt.xlabel("Genres", fontsize=18);
plt.ylabel("Popularity", fontsize=18);
plt.xticks(fontsize=10)
plt.title('Average movie popularity by genre', fontsize=18);
plt.grid(True)
2. Which genres is most popular from year to year?
In [35]: # plot data
fig, ax = plt.subplots(figsize=(15,7))
# grouping movies by genre
grouped= movies.groupby(['release_year', 'genres']).count()['popularity']
.unstack().plot(ax=ax, figsize=(15,6))
plt.xlabel("release year", fontsize=18);
plt.ylabel("count", fontsize=18);
plt.xticks(fontsize=10)
plt.title('movie popularity year by year', fontsize=18);
3. What Moving Genres recieves the highest average rating?
In [36]: # grouping the movies by genres and projecting on the rating column
rating = movies.groupby('genres')['vote_average'].mean()
rating
In [37]: # bar chart of the movies mean rating by genre
rating.plot(kind='bar', alpha=0.7)
plt.xlabel('Movie Genre', fontsize=12)
plt.ylabel('Vote Average', fontsize=12)
plt.title('Average Movie Quality by Genre', fontsize=12)
plt.grid(True)
4. Do movies with high revenue recieve the highest rating?
In [38]: plt.scatter(movies['revenue_adj'], movies['vote_average'], linewidth=5)
plt.title('Vote Ratings by Revenue Level', fontsize=15)
plt.xlabel('Revenue Level', fontsize=15)
plt.ylabel('Average Vote Rating', fontsize=15);
plt.show()
In [39]: # mean rating for each revenue level
median_rev = movies['revenue_adj'].median()
low = movies.query('revenue_adj < {}'.format(median_rev))
high = movies.query('revenue_adj >= {}'.format(median_rev))
# filtering to vote_average columns and calculation of the mean
mean_low = low['vote_average'].mean()
mean_high = high['vote_average'].mean()
In [40]: heights = [mean_low, mean_high]
print(heights)
labels = ['low', 'high']
locations = [1,2]
plt.bar(locations, heights, tick_label=labels)
plt.title('Average Vote Ratings by Revenue Level', fontsize=15)
plt.xlabel('Revenue Level', fontsize=12)
plt.ylabel('Average Vote Rating', fontsize=15);
In [41]: # counting the movie revenue unique values
movies.revenue.value_counts().head()
In [42]: # 10 first values
movies.groupby('revenue_adj')['vote_average'].value_counts().head(10)
In [43]: # 10 last values
movies.groupby('revenue_adj')['vote_average'].value_counts().tail(10)
In [44]: # comparison of the median popularity of movies with low and high revenue
movies.query('revenue_adj < {}'.format(median_rev))['vote_average'].median(), movies.query('revenue_
adj > {}'.format(median_rev))['vote_average'].median()
Partial conclusion
It is difficult to say the movies with high revenue have a better rating since according to the histogram, the height of the
histogram are approximativaty the same. For deeper comparison the median vote average with low and high revenue is
calculated. we notice that median movie vote average for movie with low revenue is 6.0 while the one of movie with high
revenue is 6.3.
5. Do movies with high budget get the highest rating?
In [45]: # scatter plot of the budget versus vote rating
plt.scatter(movies['budget_adj'], movies['vote_average'], linewidth=5)
plt.title('Vote Ratings by Budget Level', fontsize=15)
plt.xlabel('Budget Level', fontsize=15)
plt.ylabel('Vote Rating', fontsize=15);
plt.show()
In [46]: # mean rating for each revenue level
median_bud = movies['budget_adj'].median()
low = movies.query('budget_adj < {}'.format(median_bud))
high = movies.query('budget_adj >= {}'.format(median_bud))
# filtering to vote_average columns and calculation of the mean
mean_low = low['vote_average'].mean()
mean_high = high['vote_average'].mean()
print([mean_low, mean_high])
In [47]: heights = [mean_low, mean_high]
print(heights)
labels = ['low', 'high']
locations = [1,2]
plt.bar(locations, heights, tick_label=labels)
plt.title('Average Vote Ratings by Budget Level', fontsize=15)
plt.xlabel('Budget Level', fontsize=12)
plt.ylabel('Average Vote Rating', fontsize=15);
In [48]: # counting the movie revenue unique values
movies.budget.value_counts().head()
In [49]: # 10 first values
movies.groupby('budget_adj')['vote_average'].value_counts().head(10)
In [50]: # 10 last values
movies.groupby('budget_adj')['vote_average'].value_counts().tail(10)
In [51]: # comparison of the median popularity of movies with low and high revenue
(movies.query('budget_adj < {}'.format(median_rev))['vote_average'].median(),
movies.query('budget_adj > {}'.format(median_rev))['vote_average'].median())
Partial conclusion
It is difficult to say the movies with high budget have a better rating since according to the histogram, the height of the
histogram are approximativaty the same. For deeper comparison the median vote average with low and high revenue is
calculated. we notice that median movie vote average for movie with low revenue is 6.0 while the one of movie with high
revenue is 6.2.
6. Do movies with highest revenue have more popularity?
In [52]: plt.scatter(movies['revenue_adj'], movies['popularity'], linewidth=5)
plt.title('Popularity by Revenue Level', fontsize=15)
plt.xlabel('Revenue Level', fontsize=15)
plt.ylabel('Average Popularity', fontsize=15);
plt.show()
In [53]: # mean rating for each revenue level
median_rev = movies['revenue_adj'].median()
low = movies.query('revenue_adj < {}'.format(median_rev))
high = movies.query('revenue_adj >= {}'.format(median_rev))
# filtering to popularity columns and calculation of the mean
mean_low = low['popularity'].mean()
mean_high = high['popularity'].mean()
In [54]: # list of the mean and high revenue for historgram chart
heights = [mean_low, mean_high]
print(heights)
labels = ['low', 'high']
locations = [1,2]
plt.bar(locations, heights, tick_label=labels)
plt.title('Average Popularity by Revenue Level', fontsize=15)
plt.xlabel('Revenue Level', fontsize=12)
plt.ylabel('Average Popularity', fontsize=15);
In [55]: # counting the movie revenue unique values
movies.revenue_adj.value_counts().head()
In [56]: # 10 first values
movies.groupby('revenue_adj')['popularity'].value_counts().head(10)
In [57]: # 10 last values
movies.groupby('revenue_adj')['popularity'].value_counts().tail(10)
In [58]: # comparison of the median popularity of movies with low and high revenue
(movies.query('revenue_adj < {}'.format(median_rev))['popularity'].median(),
movies.query('revenue_adj > {}'.format(median_rev))['popularity'].median())
Partial conclusion
We can see that, the film with high revenue seem to be more popular than the ones with low revenue, with an average
popularity respectively of 0.7420684714824547 and 0.9989869505300212. Morever by comparing the median popularity of
movies with low and high revenue, we can clearly see that the movie with high revenue are more popular.
7. Do movies with highest budget have more popularity?
In [59]: # scatter plot of the movies budget versus popularity
plt.scatter(movies['budget_adj'], movies['popularity'], linewidth=5)
plt.title('Popularity by Revenue Level', fontsize=15)
plt.xlabel('Budget Level', fontsize=15)
plt.ylabel('Average Popularity', fontsize=15);
plt.show()
In [60]: # mean rating for each revenue level
median_rev = movies['budget_adj'].median()
low = movies.query('budget_adj < {}'.format(median_rev))
high = movies.query('budget_adj >= {}'.format(median_rev))
# filtering to popularity columns and calculation of the mean
mean_low = low['popularity'].mean()
mean_high = high['popularity'].mean()
In [61]: heights = [mean_low, mean_high]
print(heights)
labels = ['low', 'high']
locations = [1,2]
plt.bar(locations, heights, tick_label=labels)
plt.title('Average Popularity by Budget Level', fontsize=15)
plt.xlabel('Budget Level', fontsize=12)
plt.ylabel('Average Popularity', fontsize=15);
In [62]: # counting the movie budget unique values
movies.budget_adj.value_counts().head()
In [63]: # 10 first values
movies.groupby('budget_adj')['popularity'].value_counts().head(10)
In [64]: # 10 last values
movies.groupby('budget_adj')['popularity'].value_counts().tail(10)
In [65]: # comparison of the median popularity of movies with low and high revenue
(movies.query('budget_adj < {}'.format(median_rev))['popularity'].median(),
movies.query('budget_adj > {}'.format(median_rev))['popularity'].median())
Partial conclusion
We can see that, the film with high budget seem to be more popular than the ones with low budget, with an average popularity
respectively of 0.7564478409230605 and 0.979017978784679. Morever by comparing the median popularity of movies with
low and high budget, we can clearly see that the movie with high budget seem more popular.
Conclusions
In this project, we started our analysis by examining the most popular movie by genre. We notice the adventure movies are the
most popular movies genre. We've, then examined, the movie popularity year by year. For, this since there is no correlation
between release_year and movie popularity, the count of the realese movie each year is used for the analysis. Based on the
relation between the genres and vote avarage, we found that, the Documentary recieves the highest rating. Moreover, we
have analyzed the dataset trying to answer different questions related to movies popularity and rating versus revenue and
budget. While the movies with high revenue and budget seem to be more popular, we could not find a correlation between
movie budget and revenue with rating.
Limitations
For a better analysis, a more details seems to be useful regarding the variables popularity and vote_average and how they are
calculated? The factors/criteria used for their calculations. During the analysis process the columns in which we are interested
in this analayis (budget, revenue, budget_adj and revenue_adj) contain many missing values which have been filled using the
mean. This seems not the best way to fix those columns since the mean is not always the best measure of center. Another
limitations in this analysis, the process of categorizing the movie with low and high revenue and budget using the median.
Since some movie have a huge amount of budget and revenue and the fact that we fill many missing values with the mean,
the median should not be the best for categoring the movie.
References:
[1]: https://www.themoviedb.org/about
In [ ]:
Out[3]:
id imdb_id popularity budget revenue original_title cast home
0 135397 tt0369610 32.985763 150000000 1513528810
Jurassic
World
Chris Pratt|Bryce
Dallas
Howard|Irrfan
Khan|Vi...
http://www.jurassicworld
1 76341 tt1392190 28.419936 150000000 378436354
Mad Max:
Fury Road
Tom
Hardy|Charlize
Theron|Hugh
Keays-
Byrne|Nic...
http://www.madmaxmovie
2 262500 tt2908446 13.112507 110000000 295238201 Insurgent
Shailene
Woodley|Theo
James|Kate
Winslet|Ansel...
http://www.thedivergentseries.movie/#insu
3 140607 tt2488496 11.173104 200000000 2068178225
Star Wars:
The Force
Awakens
Harrison
Ford|Mark
Hamill|Carrie
Fisher|Adam D...
http://www.starwars.com/films/star-w
epi
4 168259 tt2820852 9.335014 190000000 1506249360 Furious 7
Vin Diesel|Paul
Walker|Jason
Statham|Michelle
...
http://www.furious7
5 rows × 21 columns
Out[4]: id int64
imdb_id object
popularity float64
budget int64
revenue int64
original_title object
cast object
homepage object
director object
tagline object
keywords object
overview object
runtime int64
genres object
production_companies object
release_date object
vote_count int64
vote_average float64
release_year int64
budget_adj float64
revenue_adj float64
dtype: object
Out[5]: 10866
Out[6]: 21
Out[7]: 1
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10865 entries, 0 to 10865
Data columns (total 21 columns):
id 10865 non-null int64
imdb_id 10855 non-null object
popularity 10865 non-null float64
budget 10865 non-null int64
revenue 10865 non-null int64
original_title 10865 non-null object
cast 10789 non-null object
homepage 2936 non-null object
director 10821 non-null object
tagline 8041 non-null object
keywords 9372 non-null object
overview 10861 non-null object
runtime 10865 non-null int64
genres 10842 non-null object
production_companies 9835 non-null object
release_date 10865 non-null object
vote_count 10865 non-null int64
vote_average 10865 non-null float64
release_year 10865 non-null int64
budget_adj 10865 non-null float64
revenue_adj 10865 non-null float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.8+ MB
Out[10]:
id imdb_id popularity budget revenue original_title cast homepage director tagline
18 150689 tt1661199 5.556818 95000000 542351353 Cinderella
Lily James|Cate
Blanchett|Richard
Madden|Helen...
NaN
Kenneth
Branagh
Midnight
is just the
beginning.
21 307081 tt1798684 5.337064 30000000 91709827 Southpaw
Jake
Gyllenhaal|Rachel
McAdams|Forest
Whitaker...
NaN
Antoine
Fuqua
Believe in
Hope.
26 214756 tt2637276 4.564549 68000000 215863606 Ted 2
Mark Wahlberg|Seth
MacFarlane|Amanda
Seyfried|...
NaN
Seth
MacFarlane
Ted is
Coming,
Again.
32 254470 tt2848292 3.877764 29000000 287506194
Pitch Perfect
2
Anna
Kendrick|Rebel
Wilson|Hailee
Steinfeld|Br...
NaN
Elizabeth
Banks
We're
back
pitches
33 296098 tt3682448 3.648210 40000000 162610473
Bridge of
Spies
Tom Hanks|Mark
Rylance|Amy
Ryan|Alan
Alda|Seba...
NaN
Steven
Spielberg
In the
shadow of
war, one
man
showed
the
world...
5 rows × 21 columns
Out[11]: id 10865
imdb_id 10855
popularity 10814
budget 557
revenue 4702
original_title 10571
cast 10719
homepage 2896
director 5067
tagline 7997
keywords 8804
overview 10847
runtime 247
genres 2039
production_companies 7445
release_date 5909
vote_count 1289
vote_average 72
release_year 56
budget_adj 2614
revenue_adj 4840
dtype: int64
Out[12]:
id popularity budget revenue runtime vote_count vote_average release_year
count 10865.000000 10865.000000 1.086500e+04 1.086500e+04 10865.000000 10865.000000 10865.000000 10865.000000
mean 66066.374413 0.646446 1.462429e+07 3.982690e+07 102.071790 217.399632 5.975012 2001.321859
std 92134.091971 1.000231 3.091428e+07 1.170083e+08 31.382701 575.644627 0.935138 12.813260
min 5.000000 0.000065 0.000000e+00 0.000000e+00 0.000000 10.000000 1.500000 1960.000000
25% 10596.000000 0.207575 0.000000e+00 0.000000e+00 90.000000 17.000000 5.400000 1995.000000
50% 20662.000000 0.383831 0.000000e+00 0.000000e+00 99.000000 38.000000 6.000000 2006.000000
75% 75612.000000 0.713857 1.500000e+07 2.400000e+07 111.000000 146.000000 6.600000 2011.000000
max 417859.000000 32.985763 4.250000e+08 2.781506e+09 900.000000 9767.000000 9.200000 2015.000000
Index(['id', 'imdb_id', 'popularity', 'budget', 'revenue', 'original_title',
'cast', 'homepage', 'director', 'tagline', 'keywords', 'overview',
'runtime', 'genres', 'production_companies', 'release_date',
'vote_count', 'vote_average', 'release_year', 'budget_adj',
'revenue_adj'],
dtype='object')
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10865 entries, 0 to 10865
Data columns (total 17 columns):
id 10865 non-null int64
imdb_id 10855 non-null object
popularity 10865 non-null float64
budget 10865 non-null int64
revenue 10865 non-null int64
original_title 10865 non-null object
cast 10789 non-null object
director 10821 non-null object
runtime 10865 non-null int64
genres 10842 non-null object
production_companies 9835 non-null object
release_date 10865 non-null object
vote_count 10865 non-null int64
vote_average 10865 non-null float64
release_year 10865 non-null int64
budget_adj 10865 non-null float64
revenue_adj 10865 non-null float64
dtypes: float64(4), int64(6), object(7)
memory usage: 1.5+ MB
Out[18]: dtype('<M8[ns]')
Out[19]:
id imdb_id popularity budget revenue original_title cast director runtime genres product
0 135397 tt0369610 32.985763 150000000 1513528810
Jurassic
World
Chris
Pratt|Bryce
Dallas
Howard|Irrfan
Khan|Vi...
Colin
Trevorrow
124 Action
Universa
Entertain
0 135397 tt0369610 32.985763 150000000 1513528810
Jurassic
World
Chris
Pratt|Bryce
Dallas
Howard|Irrfan
Khan|Vi...
Colin
Trevorrow
124 Adventure
Universa
Entertain
0 135397 tt0369610 32.985763 150000000 1513528810
Jurassic
World
Chris
Pratt|Bryce
Dallas
Howard|Irrfan
Khan|Vi...
Colin
Trevorrow
124
Science
Fiction
Universa
Entertain
0 135397 tt0369610 32.985763 150000000 1513528810
Jurassic
World
Chris
Pratt|Bryce
Dallas
Howard|Irrfan
Khan|Vi...
Colin
Trevorrow
124 Thriller
Universa
Entertain
1 76341 tt1392190 28.419936 150000000 378436354
Mad Max:
Fury Road
Tom
Hardy|Charlize
Theron|Hugh
Keays-
Byrne|Nic...
George
Miller
120 Action
Vi
Pictures
Out[20]:
id imdb_id popularity budget revenue original_title cast director runtime genres production_c
0 135397 tt0369610 32.985763 150000000 1513528810
Jurassic
World
Chris
Pratt|Bryce
Dallas
Howard|Irrfan
Khan|Vi...
Colin
Trevorrow
124 Action Univers
0 135397 tt0369610 32.985763 150000000 1513528810
Jurassic
World
Chris
Pratt|Bryce
Dallas
Howard|Irrfan
Khan|Vi...
Colin
Trevorrow
124 Action Amblin Ent
0 135397 tt0369610 32.985763 150000000 1513528810
Jurassic
World
Chris
Pratt|Bryce
Dallas
Howard|Irrfan
Khan|Vi...
Colin
Trevorrow
124 Action Legenda
0 135397 tt0369610 32.985763 150000000 1513528810
Jurassic
World
Chris
Pratt|Bryce
Dallas
Howard|Irrfan
Khan|Vi...
Colin
Trevorrow
124 Action Fuji Televisio
0 135397 tt0369610 32.985763 150000000 1513528810
Jurassic
World
Chris
Pratt|Bryce
Dallas
Howard|Irrfan
Khan|Vi...
Colin
Trevorrow
124 Action
Out[21]: 76851
Out[22]: 76851
Out[23]: 69433
Out[24]: 69433
budget 26775356.47371121
budget_adj 30889712.59859798
revenue 70334023.35863845
revenue_adj 84260108.11800905
Out[26]: id 181294
imdb_id 181249
popularity 181294
budget 181294
revenue 181294
original_title 181294
cast 181023
director 181104
runtime 181294
genres 181270
production_companies 179082
release_date 181294
vote_count 181294
vote_average 181294
release_year 181294
budget_adj 181294
revenue_adj 181294
dtype: int64
Out[27]: False
Out[28]: False
Out[29]: False
Out[30]: False
Out[31]: (181294, 17)
['Action' 'Adventure' 'Science Fiction' 'Thriller' 'Fantasy' 'Crime'
'Western' 'Drama' 'Family' 'Animation' 'Comedy' 'Mystery' 'Romance' 'War'
'History' 'Music' 'Horror' 'Documentary' 'TV Movie' nan 'Foreign']
Out[36]: genres
Action 5.859801
Adventure 5.962865
Animation 6.333965
Comedy 5.917464
Crime 6.112665
Documentary 6.957312
Drama 6.156389
Family 5.973175
Fantasy 5.895793
Foreign 5.892970
History 6.417070
Horror 5.444786
Music 6.302175
Mystery 5.986585
Romance 6.059295
Science Fiction 5.738771
TV Movie 5.651250
Thriller 5.848404
War 6.336557
Western 6.101556
Name: vote_average, dtype: float64
[5.971800334804548, 5.967507674675677]
Out[41]: 7.033402e+07 76851
2.000000e+06 152
2.000000e+07 126
1.200000e+07 126
5.318650e+08 125
Name: revenue, dtype: int64
Out[42]: revenue_adj vote_average
2.370705 6.4 12
2.861934 6.8 12
3.038360 7.7 5
5.926763 6.8 4
6.951084 4.9 8
8.585801 4.5 4
9.056820 6.7 20
9.115080 5.1 2
10.000000 4.2 9
10.296367 6.5 18
Name: vote_average, dtype: int64
Out[43]: revenue_adj vote_average
1.443191e+09 7.3 9
1.574815e+09 6.6 16
1.583050e+09 5.6 25
1.791694e+09 7.2 32
1.902723e+09 7.5 48
1.907006e+09 7.3 18
2.167325e+09 7.2 18
2.506406e+09 7.3 27
2.789712e+09 7.9 18
2.827124e+09 7.1 64
Name: vote_average, dtype: int64
Out[44]: (6.0, 6.3)
[5.981346153846566, 5.963905517657313]
[5.981346153846566, 5.963905517657313]
Out[48]: 2.677536e+07 69433
2.000000e+07 4331
2.500000e+07 4255
3.000000e+07 3902
4.000000e+07 3584
Name: budget, dtype: int64
Out[49]: budget_adj vote_average
0.921091 4.1 3
0.969398 5.3 20
1.012787 6.5 48
1.309053 4.8 8
2.908194 6.5 12
3.000000 7.3 12
4.519285 5.6 9
4.605455 6.0 1
5.006696 5.8 27
8.102293 6.9 45
Name: vote_average, dtype: int64
Out[50]: budget_adj vote_average
2.504192e+08 5.8 16
2.541001e+08 7.3 18
2.575999e+08 7.4 27
2.600000e+08 7.3 8
2.713305e+08 5.8 27
2.716921e+08 7.3 27
2.920507e+08 5.3 64
3.155006e+08 6.8 27
3.683713e+08 6.3 27
4.250000e+08 6.4 25
Name: vote_average, dtype: int64
Out[51]: (6.0, 6.2)
[0.7420684714824547, 0.9989869505300212]
Out[55]: 8.426011e+07 76851
2.358000e+07 125
4.978434e+08 125
2.231273e+07 125
1.934053e+07 125
Name: revenue_adj, dtype: int64
Out[56]: revenue_adj popularity
2.370705 0.462609 12
2.861934 0.552091 12
3.038360 0.352054 5
5.926763 0.208637 4
6.951084 0.578849 8
8.585801 0.183034 4
9.056820 0.450208 20
9.115080 0.113082 2
10.000000 0.559371 9
10.296367 0.222776 18
Name: popularity, dtype: int64
Out[57]: revenue_adj popularity
1.443191e+09 7.637767 9
1.574815e+09 2.631987 16
1.583050e+09 1.136610 25
1.791694e+09 2.900556 32
1.902723e+09 11.173104 48
1.907006e+09 2.563191 18
2.167325e+09 2.010733 18
2.506406e+09 4.355219 27
2.789712e+09 12.037933 18
2.827124e+09 9.432768 64
Name: popularity, dtype: int64
Out[58]: (0.58808, 1.4886709999999999)
[0.7564478409230605, 0.9790179787846799]
Out[62]: 3.088971e+07 69433
2.032801e+07 532
2.103337e+07 421
4.065602e+07 385
2.908194e+07 381
Name: budget_adj, dtype: int64
Out[63]: budget_adj popularity
0.921091 0.177102 3
0.969398 0.520430 20
1.012787 0.472691 48
1.309053 0.090186 8
2.908194 0.228643 12
3.000000 0.028456 12
4.519285 0.464188 9
4.605455 0.002922 1
5.006696 0.317091 27
8.102293 0.626646 45
Name: popularity, dtype: int64
Out[64]: budget_adj popularity
2.504192e+08 1.232098 16
2.541001e+08 5.076472 18
2.575999e+08 5.944927 27
2.600000e+08 2.865684 8
2.713305e+08 2.520912 27
2.716921e+08 4.355219 27
2.920507e+08 1.957331 64
3.155006e+08 4.965391 27
3.683713e+08 4.955130 27
4.250000e+08 0.250540 25
Name: popularity, dtype: int64
Out[65]: (0.534192, 1.138395)

More Related Content

PDF
6 Year Transformation Map Product Roadmap Categories Template
PDF
Transformation Process Flows Powerpoint Show
PDF
как пользоваться Base camp (для новых клиентов)
PPTX
Best Practices for a CoE
PDF
Project Communication Plan PowerPoint Presentation Slides
PDF
Service Delivery Model Flow
PDF
Introduction to ETL and Data Integration
KEY
The Art of Scalability - Managing growth
6 Year Transformation Map Product Roadmap Categories Template
Transformation Process Flows Powerpoint Show
как пользоваться Base camp (для новых клиентов)
Best Practices for a CoE
Project Communication Plan PowerPoint Presentation Slides
Service Delivery Model Flow
Introduction to ETL and Data Integration
The Art of Scalability - Managing growth

What's hot (20)

PDF
Current State Vs Future State Info Graphics
PDF
Digital Operating Model & IT4IT
PPTX
Create Value with ITIL 4
PPTX
What is ETL testing & how to enforce it in Data Wharehouse
PDF
ドメイン駆動設計入門
PPTX
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
PDF
【Unite 2017 Tokyo】Unityで楽しむノンフォトリアルな絵づくり講座:トゥーンシェーダー・マニアクス
PDF
Business Intelligence Maturity Model
PPTX
プログラムで映像をつくるとは?? ~超入門編~
PPTX
Program, Project and Change Management Toolkit and Playbook
PDF
Weekly Project Status Report With Project Number
PDF
PPTX
DevOps-as-a-Service: Towards Automating the Automation
PDF
Visual Dataprepで建築データを美味しく下ごしらえ UNREAL FEST EXTREME 2021 SUMMER
PDF
リーン・チェンジマネジメント - チーム・組織に変化を起こす!オリジナルのチェンジ・フレームワークを構築する方法
ODP
Unity ネイティブプラグインの作成について
PPTX
Qlik Sense ストーリーテリングベストプラクティス
PDF
Evolution of The Twitter Stack
PDF
チケット駆動開発の解説~タスク管理からプロセス改善へ
PDF
Business Process Management Tools Process Management Tools Process Performanc...
Current State Vs Future State Info Graphics
Digital Operating Model & IT4IT
Create Value with ITIL 4
What is ETL testing & how to enforce it in Data Wharehouse
ドメイン駆動設計入門
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
【Unite 2017 Tokyo】Unityで楽しむノンフォトリアルな絵づくり講座:トゥーンシェーダー・マニアクス
Business Intelligence Maturity Model
プログラムで映像をつくるとは?? ~超入門編~
Program, Project and Change Management Toolkit and Playbook
Weekly Project Status Report With Project Number
DevOps-as-a-Service: Towards Automating the Automation
Visual Dataprepで建築データを美味しく下ごしらえ UNREAL FEST EXTREME 2021 SUMMER
リーン・チェンジマネジメント - チーム・組織に変化を起こす!オリジナルのチェンジ・フレームワークを構築する方法
Unity ネイティブプラグインの作成について
Qlik Sense ストーリーテリングベストプラクティス
Evolution of The Twitter Stack
チケット駆動開発の解説~タスク管理からプロセス改善へ
Business Process Management Tools Process Management Tools Process Performanc...
Ad

Similar to TMDb movie dataset by kaggle (20)

PDF
Building a Movie Success Predictor
PDF
movie_notebook.pdf
PDF
Data Science - The Most Profitable Movie Characteristic
PPTX
Python Pandas.pptx
PDF
DATA ANALYTICS MOVIES PROJECT.pdf
PDF
movieRecommendation_FinalReport
PDF
Regression Model for movies
PPTX
Film Big Data Visualization Based on D3.pptx
PDF
Report: EDA of TV shows & movies available on Netflix
PPTX
Pandas-(Ziad).pptx
PDF
R markup code to create Regression Model
PPTX
Tableau User Group - Khi > First Meetup! Movies + Data Hands-On Vizathon (11t...
PPTX
A Tour of the Data Science Process, a Case Study Using Movie Industry Data
PDF
Hacking data visualisations
PDF
Lesson 2 data preprocessing
PDF
IMDB Analysis.pdf
PDF
predictive analysis 1 assignment.pdf
PPTX
Foresee your movie revenue
PPTX
Pandas Dataframe reading data Kirti final.pptx
PPTX
intoduction of probabliity and statistics
Building a Movie Success Predictor
movie_notebook.pdf
Data Science - The Most Profitable Movie Characteristic
Python Pandas.pptx
DATA ANALYTICS MOVIES PROJECT.pdf
movieRecommendation_FinalReport
Regression Model for movies
Film Big Data Visualization Based on D3.pptx
Report: EDA of TV shows & movies available on Netflix
Pandas-(Ziad).pptx
R markup code to create Regression Model
Tableau User Group - Khi > First Meetup! Movies + Data Hands-On Vizathon (11t...
A Tour of the Data Science Process, a Case Study Using Movie Industry Data
Hacking data visualisations
Lesson 2 data preprocessing
IMDB Analysis.pdf
predictive analysis 1 assignment.pdf
Foresee your movie revenue
Pandas Dataframe reading data Kirti final.pptx
intoduction of probabliity and statistics
Ad

More from Mouhamadou Gueye, PhD (9)

PDF
Managing aws with ansible
PDF
Certified kubernetes application developer (ckad)
PDF
Ansible playbooks deep dive
PDF
Cloud DevOps Engineer Nanodegre
PDF
Google cloud professional_data_engineer
PDF
Python Specialization Certicate
PDF
LabVIEW certificate
PDF
Capsim Simulation Certificate
PDF
Certificate of business administration
Managing aws with ansible
Certified kubernetes application developer (ckad)
Ansible playbooks deep dive
Cloud DevOps Engineer Nanodegre
Google cloud professional_data_engineer
Python Specialization Certicate
LabVIEW certificate
Capsim Simulation Certificate
Certificate of business administration

Recently uploaded (20)

PDF
Microsoft 365 products and services descrption
PPTX
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
PPT
Image processing and pattern recognition 2.ppt
PDF
Introduction to the R Programming Language
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPT
DU, AIS, Big Data and Data Analytics.ppt
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPTX
Steganography Project Steganography Project .pptx
PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
Introduction to Inferential Statistics.pptx
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
modul_python (1).pptx for professional and student
DOCX
Factor Analysis Word Document Presentation
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PDF
Transcultural that can help you someday.
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
Microsoft 365 products and services descrption
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
Image processing and pattern recognition 2.ppt
Introduction to the R Programming Language
STERILIZATION AND DISINFECTION-1.ppthhhbx
DU, AIS, Big Data and Data Analytics.ppt
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
Steganography Project Steganography Project .pptx
CYBER SECURITY the Next Warefare Tactics
Introduction to Inferential Statistics.pptx
[EN] Industrial Machine Downtime Prediction
modul_python (1).pptx for professional and student
Factor Analysis Word Document Presentation
SAP 2 completion done . PRESENTATION.pptx
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
Transcultural that can help you someday.
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
Pilar Kemerdekaan dan Identi Bangsa.pptx

TMDb movie dataset by kaggle

  • 1. Udacity Data Analyst Nanodegree P2: Investigate [TMDb Movie] dataset Author: Mouhamadou GUEYE Date: May 26, 2019 Table of contents Introduction Data Wrangling Exploratory Data Analysis Conclusions Introduction In this project we will analyze the dataset associated with the informations about 10000 movies collected from the movie database TMDb. In particular we'll be interested in finding trends ralating most popular movies by genre, the movie rating and popularity based on the budget and revenue. Background: The [Movie Database TMDB](https://www.themoviedb.org/) is a community built movie and TV database. Every piece of data has been added by our amazing community dating back to 2008. TMDb's strong international focus and breadth of data is largely unmatched and something we're incredibly proud of. Put simply, we live and breathe community and that's precisely what makes us different. ### The TMDb Advantage: 1. Every year since 2008, the number of contributions to our database has increased. With over 125,000 developers and companies using our platform, TMDb has become a premiere source for metadata. 2. Along with extensive metadata for movies, TV shows and people, we also offer one of the best selections of high resolution posters and fan art. On average, over 1,000 images are added every single day. 3. We're international. While we officially support 39 languages we also have extensive regional data. Every single day TMDb is used in over 180 countries. 4. Our community is second to none. Between our staff and community moderators, we're always here to help. We're passionate about making sure your experience on TMDb is nothing short of amazing. 5. Trusted platform. Every single day our service is used by millions of people while we process over 2 billion requests. We've proven for years that this is a service that can be trusted and relied on. This organization profile is not owned or maintained by TMDb: datasets hosted under this organization profile use the TMDb API but are not endorsed or certified by TMDb[1]. Reseach Questions for investigations 1. What is the most popular movies by genre? 2. What is the most popular movies by genre from year to year? 3. Do movies with highest revenue have more popularity? 4. Do movies with highest budget have more popularity? 5. Do movies with highest revenue recieve a better rating? 6. Do movies with highest budget recieve a better rating? Dataset This data set contains information about 10,000 movies collected from The Movie Database (TMDb),including user ratings and revenue.The dataset uses in this project is a cleaned version of the original dataset on Kaggle. where its full description can be found there. In [1]: # packages import import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns sns.set_style('darkgrid') %matplotlib inline Data Wrangling General Properties In this step we will inspect the dataset, in order to undestand it's properties and structures: The datatypes of each column The number samples of the dataset Number of columns in the dataset Duplicate rows if any in the dataset Features with missing values Number of non-null unique value for features in each dataset What are those unique values are, count of each In [2]: movies = pd.read_csv('tmdb-movies.csv') In [3]: # Printing the five first row of the dataframe movies.head() Columns Data Types In [4]: movies.dtypes Number of samples/columns In [5]: # number of rows for the movie dataset movies.shape[0] In [6]: # Number of columns for the movie dataset movies.shape[1] Duplicates Rows In [7]: # Duplicate rows in the movies dataset sum(movies.duplicated()) Deletion of duplicates In [8]: # Duplicate rows n the credit dataset movies[movies.duplicated()] movies.drop_duplicates(inplace=True) Missing Values We notice that there missing values in the following columns: homepage overview release_date tagline runtime cast production_companies director genres etc. In [9]: # informations about the dataset movies.info() In [10]: # Inpecting rows with missing values movies[movies.isnull().any(axis='columns')].head() Number of Distinct Observations In [11]: movies.nunique() Descriptive Statistics Summary In [12]: movies.describe() Data Cleaning In this step we will clean the dataset by removing columns that are irrelevant for our analysis, convert the release date columns from a string to a datetime object, fill columns for budget and revenue which contains a huge amount of zero values by their means, handle the columns with multiple values separated by a pipe (|), by splitting them in differents rows. In [13]: # movies dataset columns print(movies.columns) Droping Extraneous Columns These columns will dropped since they are not relevant on our data analysis. In [14]: # columns to drop from the movies dataset, thes columns are irrelevant for our data analysis columns = ['homepage', 'tagline', 'overview', 'keywords'] In [15]: movies.drop(labels=columns, axis=1, inplace=True) In [16]: movies.info() Convert release_date in datetime Object The release_date in a string format, we will use panda's to_datetime method to convert the column from string to datatime dtype. In [17]: movies['release_date'] = pd.to_datetime(movies['release_date'],format='%m/%d/%y') In [18]: # check the date format after cleaning movies['release_date'].dtype Dealing with Multiple Values Columns In [19]: movies= (movies.drop('genres', axis=1) .join(movies['genres'].str.split('|', expand=True) .stack().reset_index(level=1,drop=True) .rename('genres')) .loc[:, movies.columns]) movies.head() In [20]: # splitting into row the production_companies columns movies= (movies.drop('production_companies', axis=1) .join(movies['production_companies'].str.split('|', expand=True) .stack().reset_index(level=1,drop=True) .rename('production_companies')) .loc[:, movies.columns]) movies.head() Fill zero value in the revenue and budget columns Here we inspect the column revenue, revenue_adj, budget and budget_adj counting the number of rows having 0 values before filling those values with the mean. In [21]: # inspecting the movies and budget columns movies[movies['revenue'] == 0].count()['revenue'] In [22]: # inspecting the movies and budget columns movies[movies['revenue_adj'] == 0].count()['revenue_adj'] In [23]: # inspecting the movies and budget columns movies[movies['budget'] == 0].count()['budget'] In [24]: # inspecting the movies and budget columns movies[movies['budget_adj'] == 0].count()['budget_adj'] In [25]: # fill the columns revenue and budget with their mean value cols = ['budget', 'budget_adj', 'revenue', 'revenue_adj'] for item in cols: print(item, movies[item].mean()) movies[item] = movies[item].replace({0:movies[item].mean()}) In [26]: # Check Whether the colums have been successfully filled movies[movies['revenue'].notnull()].count() In [27]: # should return False (movies['revenue'] == 0).all() In [28]: # should return False (movies['revenue_adj'] == 0).all() In [29]: # should return False (movies['budget_adj'] == 0).all() In [30]: # should return False (movies['budget_adj'] == 0).all() Check Number of samples/columns In [31]: movies.shape Visual Trends In [32]: movies.hist(figsize=(15,10)); Exploratory data analysis 1. Which genres is more popular ? In [33]: # unique genres movies existing in the dataframe genres = movies['genres'].unique() print(genres) In [34]: # grouping movies by genre projecting on the popularity column and calculation of the mean movies_by_genres = movies.groupby('genres')['popularity'].mean() # plottting the bar chart of movies by genre movies_by_genres.plot(kind='bar', alpha=.7, figsize=(15,6)) plt.xlabel("Genres", fontsize=18); plt.ylabel("Popularity", fontsize=18); plt.xticks(fontsize=10) plt.title('Average movie popularity by genre', fontsize=18); plt.grid(True) 2. Which genres is most popular from year to year? In [35]: # plot data fig, ax = plt.subplots(figsize=(15,7)) # grouping movies by genre grouped= movies.groupby(['release_year', 'genres']).count()['popularity'] .unstack().plot(ax=ax, figsize=(15,6)) plt.xlabel("release year", fontsize=18); plt.ylabel("count", fontsize=18); plt.xticks(fontsize=10) plt.title('movie popularity year by year', fontsize=18); 3. What Moving Genres recieves the highest average rating? In [36]: # grouping the movies by genres and projecting on the rating column rating = movies.groupby('genres')['vote_average'].mean() rating In [37]: # bar chart of the movies mean rating by genre rating.plot(kind='bar', alpha=0.7) plt.xlabel('Movie Genre', fontsize=12) plt.ylabel('Vote Average', fontsize=12) plt.title('Average Movie Quality by Genre', fontsize=12) plt.grid(True) 4. Do movies with high revenue recieve the highest rating? In [38]: plt.scatter(movies['revenue_adj'], movies['vote_average'], linewidth=5) plt.title('Vote Ratings by Revenue Level', fontsize=15) plt.xlabel('Revenue Level', fontsize=15) plt.ylabel('Average Vote Rating', fontsize=15); plt.show() In [39]: # mean rating for each revenue level median_rev = movies['revenue_adj'].median() low = movies.query('revenue_adj < {}'.format(median_rev)) high = movies.query('revenue_adj >= {}'.format(median_rev)) # filtering to vote_average columns and calculation of the mean mean_low = low['vote_average'].mean() mean_high = high['vote_average'].mean() In [40]: heights = [mean_low, mean_high] print(heights) labels = ['low', 'high'] locations = [1,2] plt.bar(locations, heights, tick_label=labels) plt.title('Average Vote Ratings by Revenue Level', fontsize=15) plt.xlabel('Revenue Level', fontsize=12) plt.ylabel('Average Vote Rating', fontsize=15); In [41]: # counting the movie revenue unique values movies.revenue.value_counts().head() In [42]: # 10 first values movies.groupby('revenue_adj')['vote_average'].value_counts().head(10) In [43]: # 10 last values movies.groupby('revenue_adj')['vote_average'].value_counts().tail(10) In [44]: # comparison of the median popularity of movies with low and high revenue movies.query('revenue_adj < {}'.format(median_rev))['vote_average'].median(), movies.query('revenue_ adj > {}'.format(median_rev))['vote_average'].median() Partial conclusion It is difficult to say the movies with high revenue have a better rating since according to the histogram, the height of the histogram are approximativaty the same. For deeper comparison the median vote average with low and high revenue is calculated. we notice that median movie vote average for movie with low revenue is 6.0 while the one of movie with high revenue is 6.3. 5. Do movies with high budget get the highest rating? In [45]: # scatter plot of the budget versus vote rating plt.scatter(movies['budget_adj'], movies['vote_average'], linewidth=5) plt.title('Vote Ratings by Budget Level', fontsize=15) plt.xlabel('Budget Level', fontsize=15) plt.ylabel('Vote Rating', fontsize=15); plt.show() In [46]: # mean rating for each revenue level median_bud = movies['budget_adj'].median() low = movies.query('budget_adj < {}'.format(median_bud)) high = movies.query('budget_adj >= {}'.format(median_bud)) # filtering to vote_average columns and calculation of the mean mean_low = low['vote_average'].mean() mean_high = high['vote_average'].mean() print([mean_low, mean_high]) In [47]: heights = [mean_low, mean_high] print(heights) labels = ['low', 'high'] locations = [1,2] plt.bar(locations, heights, tick_label=labels) plt.title('Average Vote Ratings by Budget Level', fontsize=15) plt.xlabel('Budget Level', fontsize=12) plt.ylabel('Average Vote Rating', fontsize=15); In [48]: # counting the movie revenue unique values movies.budget.value_counts().head() In [49]: # 10 first values movies.groupby('budget_adj')['vote_average'].value_counts().head(10) In [50]: # 10 last values movies.groupby('budget_adj')['vote_average'].value_counts().tail(10) In [51]: # comparison of the median popularity of movies with low and high revenue (movies.query('budget_adj < {}'.format(median_rev))['vote_average'].median(), movies.query('budget_adj > {}'.format(median_rev))['vote_average'].median()) Partial conclusion It is difficult to say the movies with high budget have a better rating since according to the histogram, the height of the histogram are approximativaty the same. For deeper comparison the median vote average with low and high revenue is calculated. we notice that median movie vote average for movie with low revenue is 6.0 while the one of movie with high revenue is 6.2. 6. Do movies with highest revenue have more popularity? In [52]: plt.scatter(movies['revenue_adj'], movies['popularity'], linewidth=5) plt.title('Popularity by Revenue Level', fontsize=15) plt.xlabel('Revenue Level', fontsize=15) plt.ylabel('Average Popularity', fontsize=15); plt.show() In [53]: # mean rating for each revenue level median_rev = movies['revenue_adj'].median() low = movies.query('revenue_adj < {}'.format(median_rev)) high = movies.query('revenue_adj >= {}'.format(median_rev)) # filtering to popularity columns and calculation of the mean mean_low = low['popularity'].mean() mean_high = high['popularity'].mean() In [54]: # list of the mean and high revenue for historgram chart heights = [mean_low, mean_high] print(heights) labels = ['low', 'high'] locations = [1,2] plt.bar(locations, heights, tick_label=labels) plt.title('Average Popularity by Revenue Level', fontsize=15) plt.xlabel('Revenue Level', fontsize=12) plt.ylabel('Average Popularity', fontsize=15); In [55]: # counting the movie revenue unique values movies.revenue_adj.value_counts().head() In [56]: # 10 first values movies.groupby('revenue_adj')['popularity'].value_counts().head(10) In [57]: # 10 last values movies.groupby('revenue_adj')['popularity'].value_counts().tail(10) In [58]: # comparison of the median popularity of movies with low and high revenue (movies.query('revenue_adj < {}'.format(median_rev))['popularity'].median(), movies.query('revenue_adj > {}'.format(median_rev))['popularity'].median()) Partial conclusion We can see that, the film with high revenue seem to be more popular than the ones with low revenue, with an average popularity respectively of 0.7420684714824547 and 0.9989869505300212. Morever by comparing the median popularity of movies with low and high revenue, we can clearly see that the movie with high revenue are more popular. 7. Do movies with highest budget have more popularity? In [59]: # scatter plot of the movies budget versus popularity plt.scatter(movies['budget_adj'], movies['popularity'], linewidth=5) plt.title('Popularity by Revenue Level', fontsize=15) plt.xlabel('Budget Level', fontsize=15) plt.ylabel('Average Popularity', fontsize=15); plt.show() In [60]: # mean rating for each revenue level median_rev = movies['budget_adj'].median() low = movies.query('budget_adj < {}'.format(median_rev)) high = movies.query('budget_adj >= {}'.format(median_rev)) # filtering to popularity columns and calculation of the mean mean_low = low['popularity'].mean() mean_high = high['popularity'].mean() In [61]: heights = [mean_low, mean_high] print(heights) labels = ['low', 'high'] locations = [1,2] plt.bar(locations, heights, tick_label=labels) plt.title('Average Popularity by Budget Level', fontsize=15) plt.xlabel('Budget Level', fontsize=12) plt.ylabel('Average Popularity', fontsize=15); In [62]: # counting the movie budget unique values movies.budget_adj.value_counts().head() In [63]: # 10 first values movies.groupby('budget_adj')['popularity'].value_counts().head(10) In [64]: # 10 last values movies.groupby('budget_adj')['popularity'].value_counts().tail(10) In [65]: # comparison of the median popularity of movies with low and high revenue (movies.query('budget_adj < {}'.format(median_rev))['popularity'].median(), movies.query('budget_adj > {}'.format(median_rev))['popularity'].median()) Partial conclusion We can see that, the film with high budget seem to be more popular than the ones with low budget, with an average popularity respectively of 0.7564478409230605 and 0.979017978784679. Morever by comparing the median popularity of movies with low and high budget, we can clearly see that the movie with high budget seem more popular. Conclusions In this project, we started our analysis by examining the most popular movie by genre. We notice the adventure movies are the most popular movies genre. We've, then examined, the movie popularity year by year. For, this since there is no correlation between release_year and movie popularity, the count of the realese movie each year is used for the analysis. Based on the relation between the genres and vote avarage, we found that, the Documentary recieves the highest rating. Moreover, we have analyzed the dataset trying to answer different questions related to movies popularity and rating versus revenue and budget. While the movies with high revenue and budget seem to be more popular, we could not find a correlation between movie budget and revenue with rating. Limitations For a better analysis, a more details seems to be useful regarding the variables popularity and vote_average and how they are calculated? The factors/criteria used for their calculations. During the analysis process the columns in which we are interested in this analayis (budget, revenue, budget_adj and revenue_adj) contain many missing values which have been filled using the mean. This seems not the best way to fix those columns since the mean is not always the best measure of center. Another limitations in this analysis, the process of categorizing the movie with low and high revenue and budget using the median. Since some movie have a huge amount of budget and revenue and the fact that we fill many missing values with the mean, the median should not be the best for categoring the movie. References: [1]: https://www.themoviedb.org/about In [ ]: Out[3]: id imdb_id popularity budget revenue original_title cast home 0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld 1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays- Byrne|Nic... http://www.madmaxmovie 2 262500 tt2908446 13.112507 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... http://www.thedivergentseries.movie/#insu 3 140607 tt2488496 11.173104 200000000 2068178225 Star Wars: The Force Awakens Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... http://www.starwars.com/films/star-w epi 4 168259 tt2820852 9.335014 190000000 1506249360 Furious 7 Vin Diesel|Paul Walker|Jason Statham|Michelle ... http://www.furious7 5 rows × 21 columns Out[4]: id int64 imdb_id object popularity float64 budget int64 revenue int64 original_title object cast object homepage object director object tagline object keywords object overview object runtime int64 genres object production_companies object release_date object vote_count int64 vote_average float64 release_year int64 budget_adj float64 revenue_adj float64 dtype: object Out[5]: 10866 Out[6]: 21 Out[7]: 1 <class 'pandas.core.frame.DataFrame'> Int64Index: 10865 entries, 0 to 10865 Data columns (total 21 columns): id 10865 non-null int64 imdb_id 10855 non-null object popularity 10865 non-null float64 budget 10865 non-null int64 revenue 10865 non-null int64 original_title 10865 non-null object cast 10789 non-null object homepage 2936 non-null object director 10821 non-null object tagline 8041 non-null object keywords 9372 non-null object overview 10861 non-null object runtime 10865 non-null int64 genres 10842 non-null object production_companies 9835 non-null object release_date 10865 non-null object vote_count 10865 non-null int64 vote_average 10865 non-null float64 release_year 10865 non-null int64 budget_adj 10865 non-null float64 revenue_adj 10865 non-null float64 dtypes: float64(4), int64(6), object(11) memory usage: 1.8+ MB Out[10]: id imdb_id popularity budget revenue original_title cast homepage director tagline 18 150689 tt1661199 5.556818 95000000 542351353 Cinderella Lily James|Cate Blanchett|Richard Madden|Helen... NaN Kenneth Branagh Midnight is just the beginning. 21 307081 tt1798684 5.337064 30000000 91709827 Southpaw Jake Gyllenhaal|Rachel McAdams|Forest Whitaker... NaN Antoine Fuqua Believe in Hope. 26 214756 tt2637276 4.564549 68000000 215863606 Ted 2 Mark Wahlberg|Seth MacFarlane|Amanda Seyfried|... NaN Seth MacFarlane Ted is Coming, Again. 32 254470 tt2848292 3.877764 29000000 287506194 Pitch Perfect 2 Anna Kendrick|Rebel Wilson|Hailee Steinfeld|Br... NaN Elizabeth Banks We're back pitches 33 296098 tt3682448 3.648210 40000000 162610473 Bridge of Spies Tom Hanks|Mark Rylance|Amy Ryan|Alan Alda|Seba... NaN Steven Spielberg In the shadow of war, one man showed the world... 5 rows × 21 columns Out[11]: id 10865 imdb_id 10855 popularity 10814 budget 557 revenue 4702 original_title 10571 cast 10719 homepage 2896 director 5067 tagline 7997 keywords 8804 overview 10847 runtime 247 genres 2039 production_companies 7445 release_date 5909 vote_count 1289 vote_average 72 release_year 56 budget_adj 2614 revenue_adj 4840 dtype: int64 Out[12]: id popularity budget revenue runtime vote_count vote_average release_year count 10865.000000 10865.000000 1.086500e+04 1.086500e+04 10865.000000 10865.000000 10865.000000 10865.000000 mean 66066.374413 0.646446 1.462429e+07 3.982690e+07 102.071790 217.399632 5.975012 2001.321859 std 92134.091971 1.000231 3.091428e+07 1.170083e+08 31.382701 575.644627 0.935138 12.813260 min 5.000000 0.000065 0.000000e+00 0.000000e+00 0.000000 10.000000 1.500000 1960.000000 25% 10596.000000 0.207575 0.000000e+00 0.000000e+00 90.000000 17.000000 5.400000 1995.000000 50% 20662.000000 0.383831 0.000000e+00 0.000000e+00 99.000000 38.000000 6.000000 2006.000000 75% 75612.000000 0.713857 1.500000e+07 2.400000e+07 111.000000 146.000000 6.600000 2011.000000 max 417859.000000 32.985763 4.250000e+08 2.781506e+09 900.000000 9767.000000 9.200000 2015.000000 Index(['id', 'imdb_id', 'popularity', 'budget', 'revenue', 'original_title', 'cast', 'homepage', 'director', 'tagline', 'keywords', 'overview', 'runtime', 'genres', 'production_companies', 'release_date', 'vote_count', 'vote_average', 'release_year', 'budget_adj', 'revenue_adj'], dtype='object') <class 'pandas.core.frame.DataFrame'> Int64Index: 10865 entries, 0 to 10865 Data columns (total 17 columns): id 10865 non-null int64 imdb_id 10855 non-null object popularity 10865 non-null float64 budget 10865 non-null int64 revenue 10865 non-null int64 original_title 10865 non-null object cast 10789 non-null object director 10821 non-null object runtime 10865 non-null int64 genres 10842 non-null object production_companies 9835 non-null object release_date 10865 non-null object vote_count 10865 non-null int64 vote_average 10865 non-null float64 release_year 10865 non-null int64 budget_adj 10865 non-null float64 revenue_adj 10865 non-null float64 dtypes: float64(4), int64(6), object(7) memory usage: 1.5+ MB Out[18]: dtype('<M8[ns]') Out[19]: id imdb_id popularity budget revenue original_title cast director runtime genres product 0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow 124 Action Universa Entertain 0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow 124 Adventure Universa Entertain 0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow 124 Science Fiction Universa Entertain 0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow 124 Thriller Universa Entertain 1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays- Byrne|Nic... George Miller 120 Action Vi Pictures Out[20]: id imdb_id popularity budget revenue original_title cast director runtime genres production_c 0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow 124 Action Univers 0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow 124 Action Amblin Ent 0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow 124 Action Legenda 0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow 124 Action Fuji Televisio 0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow 124 Action Out[21]: 76851 Out[22]: 76851 Out[23]: 69433 Out[24]: 69433 budget 26775356.47371121 budget_adj 30889712.59859798 revenue 70334023.35863845 revenue_adj 84260108.11800905 Out[26]: id 181294 imdb_id 181249 popularity 181294 budget 181294 revenue 181294 original_title 181294 cast 181023 director 181104 runtime 181294 genres 181270 production_companies 179082 release_date 181294 vote_count 181294 vote_average 181294 release_year 181294 budget_adj 181294 revenue_adj 181294 dtype: int64 Out[27]: False Out[28]: False Out[29]: False Out[30]: False Out[31]: (181294, 17) ['Action' 'Adventure' 'Science Fiction' 'Thriller' 'Fantasy' 'Crime' 'Western' 'Drama' 'Family' 'Animation' 'Comedy' 'Mystery' 'Romance' 'War' 'History' 'Music' 'Horror' 'Documentary' 'TV Movie' nan 'Foreign'] Out[36]: genres Action 5.859801 Adventure 5.962865 Animation 6.333965 Comedy 5.917464 Crime 6.112665 Documentary 6.957312 Drama 6.156389 Family 5.973175 Fantasy 5.895793 Foreign 5.892970 History 6.417070 Horror 5.444786 Music 6.302175 Mystery 5.986585 Romance 6.059295 Science Fiction 5.738771 TV Movie 5.651250 Thriller 5.848404 War 6.336557 Western 6.101556 Name: vote_average, dtype: float64 [5.971800334804548, 5.967507674675677] Out[41]: 7.033402e+07 76851 2.000000e+06 152 2.000000e+07 126 1.200000e+07 126 5.318650e+08 125 Name: revenue, dtype: int64 Out[42]: revenue_adj vote_average 2.370705 6.4 12 2.861934 6.8 12 3.038360 7.7 5 5.926763 6.8 4 6.951084 4.9 8 8.585801 4.5 4 9.056820 6.7 20 9.115080 5.1 2 10.000000 4.2 9 10.296367 6.5 18 Name: vote_average, dtype: int64 Out[43]: revenue_adj vote_average 1.443191e+09 7.3 9 1.574815e+09 6.6 16 1.583050e+09 5.6 25 1.791694e+09 7.2 32 1.902723e+09 7.5 48 1.907006e+09 7.3 18 2.167325e+09 7.2 18 2.506406e+09 7.3 27 2.789712e+09 7.9 18 2.827124e+09 7.1 64 Name: vote_average, dtype: int64 Out[44]: (6.0, 6.3) [5.981346153846566, 5.963905517657313] [5.981346153846566, 5.963905517657313] Out[48]: 2.677536e+07 69433 2.000000e+07 4331 2.500000e+07 4255 3.000000e+07 3902 4.000000e+07 3584 Name: budget, dtype: int64 Out[49]: budget_adj vote_average 0.921091 4.1 3 0.969398 5.3 20 1.012787 6.5 48 1.309053 4.8 8 2.908194 6.5 12 3.000000 7.3 12 4.519285 5.6 9 4.605455 6.0 1 5.006696 5.8 27 8.102293 6.9 45 Name: vote_average, dtype: int64 Out[50]: budget_adj vote_average 2.504192e+08 5.8 16 2.541001e+08 7.3 18 2.575999e+08 7.4 27 2.600000e+08 7.3 8 2.713305e+08 5.8 27 2.716921e+08 7.3 27 2.920507e+08 5.3 64 3.155006e+08 6.8 27 3.683713e+08 6.3 27 4.250000e+08 6.4 25 Name: vote_average, dtype: int64 Out[51]: (6.0, 6.2) [0.7420684714824547, 0.9989869505300212] Out[55]: 8.426011e+07 76851 2.358000e+07 125 4.978434e+08 125 2.231273e+07 125 1.934053e+07 125 Name: revenue_adj, dtype: int64 Out[56]: revenue_adj popularity 2.370705 0.462609 12 2.861934 0.552091 12 3.038360 0.352054 5 5.926763 0.208637 4 6.951084 0.578849 8 8.585801 0.183034 4 9.056820 0.450208 20 9.115080 0.113082 2 10.000000 0.559371 9 10.296367 0.222776 18 Name: popularity, dtype: int64 Out[57]: revenue_adj popularity 1.443191e+09 7.637767 9 1.574815e+09 2.631987 16 1.583050e+09 1.136610 25 1.791694e+09 2.900556 32 1.902723e+09 11.173104 48 1.907006e+09 2.563191 18 2.167325e+09 2.010733 18 2.506406e+09 4.355219 27 2.789712e+09 12.037933 18 2.827124e+09 9.432768 64 Name: popularity, dtype: int64 Out[58]: (0.58808, 1.4886709999999999) [0.7564478409230605, 0.9790179787846799] Out[62]: 3.088971e+07 69433 2.032801e+07 532 2.103337e+07 421 4.065602e+07 385 2.908194e+07 381 Name: budget_adj, dtype: int64 Out[63]: budget_adj popularity 0.921091 0.177102 3 0.969398 0.520430 20 1.012787 0.472691 48 1.309053 0.090186 8 2.908194 0.228643 12 3.000000 0.028456 12 4.519285 0.464188 9 4.605455 0.002922 1 5.006696 0.317091 27 8.102293 0.626646 45 Name: popularity, dtype: int64 Out[64]: budget_adj popularity 2.504192e+08 1.232098 16 2.541001e+08 5.076472 18 2.575999e+08 5.944927 27 2.600000e+08 2.865684 8 2.713305e+08 2.520912 27 2.716921e+08 4.355219 27 2.920507e+08 1.957331 64 3.155006e+08 4.965391 27 3.683713e+08 4.955130 27 4.250000e+08 0.250540 25 Name: popularity, dtype: int64 Out[65]: (0.534192, 1.138395)