Underlying Themes of Movies Through Time
Authors: Cedric Conol, Rosely Pena and Bok Torregoza
Executive Summary
Internet Movie Database, commonly known as IMDb, is an online database of information related to films, television programs, home videos and video games. The study aims to determine the different themes of movies through time using unsupervised clustering. Movie titles, description, genre, ratings, directors, country of origin and languages were scraped from IMDb using Scrapy, a Python Web Scraping Framework, and were stored in a SQLite3 database. The analysis focused only on American movies produced between 1980 and 2019, filtered by ratings and grouped into four (4) decades: 1980-1989, 1990-1999, 2000-2009 and 2010-2019. Movies with ranking of 6.0 and above were classified as ‘above average’ and with ratings below 6.0 were tagged as below average. Descriptions of each movie were vectorized using term-frequency inverse document frequency (TF-IDF) weighting and dimensionalities of the design matrices were reduced to n components that would explain at least 80% of the variance using Latent Semantic Analysis (LSA). Unsupervised clustering using the Ward Method was used to generate clusters per decade. Results show that movies set in New York city consistently received above average ratings throughout the four decades. Movies based on true stories had significantly increased in ratings in the past two decades (2000-2019) while movies about violence, deaths and killings received lower ratings in three decades, specifically 1980-1989, 2000-2009 and 2010-2019. Findings of the study can be used by producers to determine what type of movies to develop, or create movies with themes that have never been explored however, addition of additional features such as movie keywords should be done to generate more significant results.
Note: If you wish to have a copy of the technical paper (PDF), data and the code used in this project, you may send me an email or you can also drop your message via LinkedIn.