Data Sources
TMDb Dataset

The TMDb Movies Dataset is a community-built, comprehensive movie dataset consisting of about 1.1 million movies with 24 movie attributes (columns). The data in this dataset is pulled from The Movie Database (TMDB), which is a crowdsourced platform that enables the public to add movies and ratings. Therefore, the information in the database is populated by movie fans all over the world. The data was pulled using the TMDb API by Kaggle user Asaniczka and has been downloaded almost 40K times.
The dataset includes the following columns: ID (the unique movie identifier), title, vote average (average rating given by users on a scale of 1-10), vote count (total number of votes the movie received), status (released, rumored, etc.), release date, revenue, runtime, adult (suitable for children or not), budget, homepage, IMDB ID, original language, original title, overview, popularity, tagline, genres, production companies, production countries, spoken languages, keywords, backdrop path, and poster path.
Letterboxd Movies Dataset

In order to have a more focused analysis, we will utilize a subset of the dataset rather than the million movies. To do this, we will merge another dataset with our current TMDB one. Through merging, we can cut the samples into 16K films with a more holistic and comprehensive scope.
The second dataset is the Letterboxd Movies Dataset, which contains 16,246 movies as rows and 28 columns. The Letterboxd data is collected by Kaggle user Kutay Şahin from Letterboxd, a popular social platform for film enthusiasts. The dataset includes the following columns: title, year, decade, runtime, runtime_category (short, standard, long, or epic), genres, primary_genre, genre_count (the total number of genres assigned to the movie), country, language, description (movie description), and title_length (the number of characters in the movie title).
What the Combined Dataset Captures
The two datasets that are merged range across a broader spectrum. The Letterboxd dataset is a lot about the details of the actual film. This includes genre, country of origin, and language. Compared to the first dataset, the TMDB dataset also includes the revenue that was generated and the public opinion on the film, which is essentially the rating of the film. When combined, the dataset seems to encourage interpretations that link cultural identity, such as is_english, to be a contributing factor to a film’s success, which is also shown by English-language films being much more prominent.

Data Critique
If this dataset were the only source, certain details of the film that might have contributed to its success might be left out. Examples of this are aesthetic style, narrative complexity, or thematic meanings.
Our dataset can illuminate broad patterns in film metadata and platform engagement, such as how releases vary over time, which genres correlate to higher ratings, vote counts, revenue, and popularity, and how these variables relate to one another for tasks like trend analysis. However, it can’t explain why a movie has a particular popularity score or number of votes, so we have to use our own analysis to fill in the gaps. It also can’t measure cultural impact or influence, as films can have low TMDb ratings but be highly influential through other means, such as community significance, but this would have to be determined with data outside of what our current data set provides.
Moreover, the data recorded are fixed, meaning the information is specific to a certain year or day. This prevents us from studying the changing rating to trace the impact of social and economic events. We have to rely more on the film production year and release date to study how the films reflected the social and economic context from the studio point of view, with additional consideration on the latency of influence of social events, such as COVID-19, on the film industry.
Another limitation of this dataset comes from how popularity and success are defined and measured. The two datasets we used are both collected from English-dominant platforms or communities. The “data silence” led to the conclusion to be more regional than global. In the dataset, success is largely represented through numeric indicators such as popularity scores, vote averages, vote counts, and revenue. These metrics reflect user engagement on the TMDB platform rather than the real audience experience. Films popular in regions with lower TMDB usage may appear less significant in the data than they are in reality. In addition, because TMDB relies on voluntary user contributions, some films have incomplete or inaccurate records. This uneven data quality can shape the conclusion we draw, pushing analysis toward mainstream films. If this dataset were our only source, we would miss how films are discussed, remembered, or valued outside of online platforms.
