Investigating TMDB Movie Titles

Table of Contents

Introduction

Dataset Description

For this report, I will be analyzing a single-table dataset containing 10,000 titles from The Movie Database. The dataset contains many useful metrics for the titles, such as their popularity, average review, budget and revenue, keywords, genre, etc.

The columns in the dataset are:

  • id: primary key, unique identifier

  • imdb_id: the id for each title on IMDb.

  • popularity: A complex metric that is built from various factors, such as number of votes, number of views, release date, and number of users adding the title to their "watchlist." More information can be found on: https://developers.themoviedb.org/3/getting-started/popularity

  • budget: the recorded budget of the film at time of release

  • revenue: the cumulative revenue made by the film by the dataset's release in 2015

  • original_title: the title of the film at release

  • cast: a list of prominent cast-members in the film

  • homepage: contains a link to the homepage of the movie's website

  • director: contains either the lead director, or a list of directors associated with the film.

  • tagline: the one or two sentence tagline accompanying the film's title for promotion

  • keywords: a list of SEO keywords for searching and indexing

  • overview: a brief description of the movie's plot

  • runtime: the length of the movie in minutes

  • genres: a list of genres that the film falls under

  • production_companies: a list of companies involved in the production of the film

  • release_date: the day the film was released

  • vote_count: the number of unique votes that have been submitted for the movie on TMDB

  • vote_average: the mean score calculated from all votes

  • release_year: the year the title was released

  • budget_adj: the film's budget, adjusted for inflation to 2015 dollars.

  • revenue_adj: the film's revenue, adjusted for inflation to 2015 dollars.

Questions for Analysis

There are two primary question I would like to explore in my analysis.

  1. How have audience tastes in different genres changed over time?

  2. What factors are associated with high-revenue films?

*Please note that this report has been made without the use of inferential statistics or machine learning, so all findings presented are tentative.

Data Wrangling

Before we can do anything useful with our data, we will have to load it into a dataframe, inspect it, and clean it.

General Properties

We can immediately see a few redundancies in this data. We have two primary keys, and both standard and adjusted columns for budget and revenue.

There are also certain columns that will not be useful for addressing our questions, such as homepage and overview.

We will remove these momentarily when we are cleaning the data, but first let us inspect the columns more directly.

Here, we can see that there are a large amount of null values that may need to be trimmed.

There are 10,866 total entries in this dataset. The columns for homepage, tagline, keywords, cast, and production_companies are all missing moderate (cast) to vast (homepage) amounts of data. We will have to decide which columns are essential for analysis and have an acceptable amount of null values, and which will need to be dropped.

Data Cleaning

We will be dropping the tagline, homepage, keywords, and overview columns from the dataset.

We are also not currently interested in cast, director, or production companies.

budget, revenue, budget_adj, and revenue_adj are redundant metrices. Since budget_adj and revenue_adj account for inflation, we will drop budget and revenue.

We can see here that the number of non-null entiries in each column matches the total number of columns. This confirms that we have removed all null values.

We can also see that all numerical columns have an appropriate datatype, so we will not need to worry about them.

There are a lot of duplicate titles in this list. While it is possible that many of these actually are duplicates, they may also happen to have the same name, or may be reboots. I can't jump to conclusions and drop them here. I can at least review the titles, and then check for reboots by testing title and year together.

It looks like these duplicate values are mostly just reboots, so we will only remove values that both have the same title, and were released on the same year. We will also remove any entire rows that are duplicates. Since only three of the values are true duplicate titles, this won't make much impact, but it's still better than leaving bad data in.

There are a couple more things we need to take care of after reviewing the data in its present state.

First, our minimum values for runtime, budget_adj, and revenue_adj are zero. While these titles clearly have incomplete data, they may still have useful review data. We will just set them to null values to prevent them from interfering with later data visualization.

Exploratory Data Analysis

Now that our data has been cleaned, we will begin our exploratory analysis to see what information we can find to help us answer our questions.

Question 1: How have audience tastes in different genres changed over time?

Our dataset contains information on many different genres from the 1960s up through the mid-2010s. How have our preferences for different types of movies changed, or not changed, in that those 50 years?

While we could just compare average vote for each year, this would only tell us the modern audience's tastes on movies from different eras. We will have to choose other metrics to attempt to glimpse at audience preferences.

Two assumptions will be made here.

  1. Supply-and-demand will result in more movies being made in genres that have a higher demand.
  2. Genres with a higher revenue will have had a greater viewership.

These assumptions may hold true or false, but they are the best we have to work with for now.

The genres column is formatted strangely, split up by bar-lines, as you can see here:

We will need to be able to compare each genre individually, so here we will split the genres up into a numpy array containing each genre.

We will also now define some useful functions to help us with styling our visualizations.

There appears to be a steady exponential growth of titles released in every genre from 1960 to 2015. We can clearly see that drama movies have had the greatest number of titles released in every decade, with a notable exception in the second half of the 1980s, in which it was surpassed by comedies.

Comedies briefly surpassed dramas again in 1994-1995, and came close twice in the early 2000s.

At first glance, westerns appear to have consistently ranked among the least titles in every decade sampled.

Since almost all genres have seen some degree of growth since the 1960s, it will probably be more helpful to look at the number of titles released per genre as a percent of the total titles released per year. This should prove give a good indicator of changes in market share over time.

Here we can see a clearer image of the market dominance of drama titles. We can also more clearly see a period in the mid-70s where thrillers almost overtook dramas. These moments of fluctuation appear to be deviances from the norm, however. For the most part, each genre seems to have maintained its relative position in the percent of total releases across the years.

Examining the number of titles released alone, however, may not be the best metric for determining audience interest. Drama movies may be more commonly produced for a variety of reasons, such as lower production costs. We can only see a correlation here, but we can make no claim as to the cause.

We will now explore the average budget of films in each genre to see if there is a correlation with the number of releases. If we can eliminate budget as a factor in the number of titles, then we can increase our confidence in the assumption that interest in genres has stayed relatively fixed over time.

Here we can see that drama movies have a pretty average budget overall. They're certainly not as cheap to produce as documentaries or TV Movies, but they are far cheaper than Adventure, Animation, and Fantasy. This alone does not appear to be a strong enough indicator to explain why so many drama titles are produced compared to other genres.

At this point, we can not confidently explain a reason for the consistency each genre has held its place in the market. While this does increase confidence in the possibility of audience tastes remaining relatively fixed over time, we cannot yet confirm this.

It will, perhaps, be useful to zoom in on drama titles. Let us explore their mean revenue over time.

There appears to be a downward trend in the revenue for drama titles over time. This is an interesting result when considering their dominance in the market as sheer number of titles released.

Are dramas actually making less money? Let's take a look at the total revenue per year to see if it gives us any more insight.

As expected, the total revenue shows us that the entire drama genre is making more money than it did in the 1960s. While it cannot be definitively stated, it is possible that the market is just flooded with a high quantity of drama titles, driving the average revenue down.

What factors are associated with high-profit films?

What factors influence the profit a film brings in? For instance:

How does the choice of genre affect profit?

Does the audience liking a title necessarily result in that title making more money? i.e. how strongly correlated are user reviews and revenue?

For the purposes of this exploration, we will define profit as revenue_adj - budget_adj.

First, let's take a look at some statistics on revenue and budget individually.

Here we can see that the vast majority of titles have under a \$100,000,000 budget.

At the same time, we can see that it is rare for a movie to make greater than \$200,000,000 revenue, and exceedingly rare to gross over \\$500,000,000.

This is extremely interesting. It would appear that many movies are either bringing in very little profit, or are even losing money. Perhaps the profit made by highly successful movies is enough to offset the risk of a flop.

Comparing finances by genre

Now that we've explored some basic financial information in the industry, let's zoom in on some average financial information for each genre.

Specifically, we will look at mean revenue, mean profit, and mean margin.

We will be using the standard formula for calculating gross margin:

$$m = {100 \cdot \frac{r - c}{r}}$$

Where m = margin, c = costs, and r = revenue

It is unambiguously clear that if your only goal is to pull a large revenue, documentaries are the worst choice, and foreign directors are out of luck.

It would seem that, overall, adventure and animation films have the highest overall profit. Documentaries seem to be bad choices when you look only at their mean profit, but their margin looks much better. Perhaps they would be a better choice for a company with a small budget looking for a low-risk return on investment.

With unlimited resources, however, adventure and animation are clearly the top two winners in total profit.

Interestingly, foreign films seem to have a negative margin. It is possible that the data only includes information on revenue generated by United States audiences.

How are audience votes and revenue correlated?

How does the opinion of the audience factor into the revenue a movie generates? Are better-liked movies bigger money-makers?

It would seem that the average vote for a title does have a correlation with the revenue brought in by that title. Movies with scores below a 4 or 5 have very little revenue. Revenue continues to increase up to about a 6.7-6.8, and then sharply drops off for scores greater than 7.

There are many factors that could explain this. It makes sense that generally disliked movies would make less money, but why would movies that are universally loved make less money than those with just average reception?

Conclusions

In this report, we have indirectly explored audience preferences in film genre over time, as well as specific factors that may affect a film's profit and revenue.

As we did not perform any statistical tests, all findings are tentative and any assumptions may be incorrect.

First, we explored audience preferences over time and discovered that the total number of films in general, across all genres, has increased since the 1960s. It is possible that this is due to a general increase in demand, but this cannot be definitively stated.

We then found that, as a percent of total films released in a given year, most genres maintained a relatively stable position in number of titles released for every year from the 1960s to the present date.

We then looked specifically at drama titles to see if we could find any information that would be useful in identifying changes in tastes over time. We found that, while there are overall more drama movies being made, those movies are making less money on average. A possible explanation is the sheer number of titles bringing the average down, but this cannot be proven.

All in all, no specific information was found that would indicate any major change in genre tastes over time.

Secondly, we examined factors that may affect a film's profit and revenue. We started by gathering data about basic financial information. Interestingly, we discovered that many films either lost money, or profited very little.

We found that adventure and animation films had the highest overall revenue, while documentaries had a very high profit margin when compared to their small budgets.

We then compared average voter score and total revenue, and found that movies with a vote average between 5 and 8 made the most money, with a peak around 6.8.