# MovieSearch: a smart movie search engine

MovieSearch is a content speciﬁc search engine with the aim to retrieve movie information given the contents of a user’s query. The search engine relies on the OkapiBM25 algorithm and takes into consideration the text present in the overview, the title, the names of the cast, and the production companies of each movie. The backend has been developed with the framework Django while the front-end extensively relies on Bootstrap 4 and Html5. Movies data, reviews, and a reverse index used to speed up the search are stored in a local MongoDB database. It has also been implemented a movie recommendation system to recommend similar movies exploiting a nearest-neighbors machine learning algorithm. Moreover, before uploading, movie reviews have been classiﬁed into two categories, positive and negatives, exploiting a pretrained Bert model ﬁne tuned on the IMDb reviews dataset.

### Introduction

In our reality already exist many context speciﬁc search engines focused on movie retrieval. Among them, one of the most popular is without doubts about the website IMDb. It contains information about every possible movie since the early beginning of the history of the cinema to not yet released movies such as Avatar 5 which is set to be released in 2027 (2). Despite our project being less complete and complex than IMDb, however, MovieSearch relies on a more powerful, fast, and easy to use retrieval system based on the OkapiBM25 algorithm. It also beneﬁts a simple, friendly, and mobile responsive user interface, an automatic way to classify movie reviews without relying on humans judgments and a reliable recommendation system. overall, the website is composed of two main pages: the first page is the ”search results” (figure 1) page which shows the movies retrieved given an input query. When clicking on the ”More info” button of a speciﬁc retrieved movie, the system will dynamically generate its relative” movie details”(figure 1) page where it is possible to visualize more complete information about a movie, the results of the recommendation system and, whenever present, the reviews given by other users. The textbox where to input a new query is always present on any page and is located on top of it so to allow the user to perform a new search without requiring to be in a speciﬁc page. The next paragraphs will explain each of the above-mentioned features in more detail.

### Dataset

Overall, three different datasets have been used to make the contents of the search engine. The first dataset has been crawled from IMDb and is publicly available on Kaggle. It contains information of about 5000 movies including title, overview, release date, main actors, director, production companies and production countries, original languages, genres, plot keywords, popularity, runtime, IMDb average score, votes, and id as well budget and revenue in dollars. Given the different types present in the data (there are integers, floats, strings, and arrays) SQL like relational databases were discarded preferring a document-based database like MongoDB. As common in many data science projects, most of the work has been consumed for cleaning the dataset. Many empty cells have been filled with N.A (not available) or 0, depending on the type of data. Movies whose ‘overview’ field was absent have been directly discarded since it is a fundamental field to make the search engine algorithm work.
The second dataset has also been crawled from IMDb and available on Kaggle and contains the poster of many movies. The title and the poster of a movie are the first two things users look at in order to judge whether they may like it or not. Thus it has been retained important to show the poster of a movie along with its title in both the search results and even more in the details page.
The third dataset is a typical text classification dataset containing over 50000 movies reviews, always taken from IMDb, to be displayed on the details page of a specific movie. Reviews are very important to let a user further understand whether a movie is suitable for him and whether it is worth to be watched based on the experiences of other users.

### Movies retrieval

Before diving into the details of this part, it’s more convenient do define a function f that takes as input a text x and processes it. This function f splits the text into single words, converts them to lowercase, removes stopwords, removes punctuations, and applies stemming. Its return is thus a list of words w. Moreover, let’s also define the text of a movie as the concatenation of the words obtained by applying f on its title, overview, actors, director, and production companies fields. These operations are very common in the text preprocessing pipeline of many NLP applications and their goal is to make a text more understandable for machines.
As already introduced, Okapi25 is the algorithm chosen for retrieving relevant movies given an input user query. A query is first processed with f(query) and subsequently are retrieved from the database all movies containing at least one of the words wq in their text. This allows a user to enhance its search by taking into consideration also cast and production companies’ names as keywords rather than just relying on the titles and overviews. Moreover, words in the title have a weight double respect to the other words in the text since usually, users tend to retrieve a movie directly by writing its title or some keywords contained in it.
Given the whole database containing over 5000 movies, searching for query keywords in all of the movies one by one would result in a very slow operation. Has then been implemented a reverse index whose keys are the union of the words present in the text of each movie and the items are arrays of indices of those movies whose text includes the key. In this way, the number of movies retrieved depends on the frequencies of the words in the processed query as well as its length and it’s usually not required to scan the whole database. To further speed up the retrieval phase, the system temporarily caches movies retrieved from the database for the first time so to avoid retrieving them in the next searches. Table 1 shows the first 5 search results for the queries “Avengers heroes” and “Kung Fu”.

### Recommendation system

The goal of the recommendation engine is to recommend n movies whose content is similar to a target movie. In our implementation n=5. Hence, movie plot keywords play an important role in the functioning of the system. Luckily, the movie dataset already includes a set of keywords for each entry. Anyway, before using them, keywords have first been cleaned suppressing rare terms that appeared less than 5 times and replaced by a synonymous with higher frequency. Secondly, have been suppressed all the keywords appearing in less than 3 films in total. To compute the similarity between a target movie t and the rest of the movies has first been built a similarity matrix (see table 2) with as many rows as the number of movies and a number of columns depending on t. In particular, the columns correspond to the fields title, director, actors, genres, and keywords of the movie t.

Besides the title which is a string, all the other fields a(i, 2<=j<=N+K+H+2) have a boolean value that is true only if there is a correspondence between the significance of column j and the content of film i respect to the movie t. For example, if keyword1 of movie t is in movie i, with t!=i, we will have a(i, N+K+3)=1 and 0 otherwise. Once this matrix has been defined, we determine the distance using K-neighbors between two films according to the formula:

$d_{t,i}=\sqrt{\sum_{j=2}^{J}(a_{t,j}-a_{i,j})^2}$

Where J=N+K+H+2 is the number of features for target movie t. Next, are simply selected the m, (m>=n), movies with the lowest d. Now given these selected m movies, their similarity score is refined taking into consideration their IMDb score, the number of votes they received, the popularity, and the year of release. The reason is that the less is the difference between these last 3 parameters, the more related are two movies. For example, it’s very likely that people’s favorite movies will be most of the time from the same epoch. Then the new score for each movie is thus calculated according to the formula:

$score=IMDb_{score}^2 \times N_{votes}(votes_i) \times N_{popularity}(popularity_i) \times N_{year}(year_i)$

Where N is a gaussian with mean and standard deviation equal to the max of its relative parameter among the subset of selected movies. With the Gaussians, we put more weight to the entries with a large number of votes and popularity and to the movies whose release year is close to the title selected by the user. In addition, movies with a higher IMDb score would tend to appear first since the goal of the recommendation system is to suggest similar but also of good quality movies.

One last attention has to paid to the sequel and prequels of the target movie. Many blockbusters have sequels that share the same director, actors and keywords, and so on. However, most of the time, the fact that sequels exist means that it was a “fair” box-office success, which is a synonym of a good IMDb score and similar popularity. Usually, there’s an inheritance of success among sequels and according to how the current recommendation system is built, it is quite probable that if the engine matches one movie of a series, it will end recommending several of them. To overcome this drawback, a movie whose title is too similar to the title of the movie t are discarded. For example, “Pirates of the Caribbean: Dead Man’s Chest” and “Pirates of the Caribbean: The Curse of the Black Pearl” have a very similar title, thus the system would recognize one of them to be the sequel of the other. Finally, table 3 shows some recommendation results for the movies “Avatar” and “I Robot”.

### Reviews classification

To make the website more interesting, movie reviews have been labeled with a pretrained model Bert into 2 categories: positive and negative. The model has been fine-tuned on the IMDb movie review dataset using the 25000 sentences contained in the training set. Next, have been predicted the other 25000 reviews contained in the test set before uploading all of them in the database. However, given the discrepancy between the reviews dataset and the movies dataset, not all movies have some associated reviews. Hence, future work may consist of crawling more reviews such that most movies can have at least one associated review. We believe that the labeled review may reward the user with immediate feedback on how much a specific movie has been appreciated by the public without needing to read all of them.

### Conclusions

The aim of this project was to design and implement a simple but effective content specific search engine. Moreover, we enhanced it with a recommending system and movie reviews classifier in order to provide a better user experience. However, this work could still be extended by crawling more data. In particular, should be crawled that posters and review that are still missing in the present work. Other enhancements could be adding movie trailers, build a login system to allow the user to create its own account, and writing more reviews and modifying the recommendation system such that recommendation would not be based only on the content of the movies but also on user interactions.