Face Recognition and Similarity in Movies using Python

8 min readMar 31, 2021

Have you ever wondered how much you look like your favorite movie character? Or perhaps you want to build your own dataset about a specific person but all you have is a video file? In either case, this hands-on guide ought to give you some ideas on how you can get started, common practices in image processing and computer vision, as well as some Machine Learning techniques that will definitely sharpen your ML skillset. We will possibly extend this guide later on, covering how you can deploy your models to the real world. (You can check the final result here)

Simply put, we will be covering how to scrape the movie “American Psycho” for faces, preprocess the images for better classification results, and train a model that is able to assign a probability that a given face is Patrick Bateman. This is what we will be using to infer how look-alike two faces are. Although this is not exactly the same as a similarity score per se, calculating similarity between two faces is not a trivial topic, and this proposal offers reliable results with an easy-to-follow method.

Similarly to standard ML workflows, this guide is structured in the following manner:

Scraping Frames
Pre-processing Images
Training a Model
Making predictions
Saving the Model

Before starting, if you are developing in an Anaconda environment, you can skip this. But if not, make sure you start a virtual environment by running:

$ python3 -m venv venv

And activate it using:

$ source venv/bin/activate

Although this won’t make much difference when developing, later on it will make the project much more portable, since we will have all the necessary dependencies already installed and then it’s just a matter of pip freezing the environment. With that out of the way, let’s get into it.

Scraping Frames

The first step is in any Machine Learning task is to gather the data. Usually this involves googling the dataset you are looking for, go on Kaggle, browse a bit and download the best fit. Nothing against Kaggle, but most uploaded datasets there have been thoroughly processed, reviewed, have its outliers trimmed and missing values have been promptly imputed. These are all very important tasks that need to be done in most cases, but in the real world things are seldom this “clean”.

As such, in this article we are going to be the ones to handle everything about the data: from its rawest form to being training-ready.

The source of our data is naturally going to be the movie “American Psycho” video file. Python supports a few flexible libraries specifically designed to handle video files and manipulating their frames. What we are going to do is pretty much standard but with a twist: in order to avoid populating our dataset with only the beginning frames of the movie we are going to set an offset number of frames to skip the intro and set a fraction f. This corresponds to the number of frames that are going to be skipped, on average, before we collect the next frame. This can easily be implemented using the random module. Ultimately, we will also place a hard cap on how many frames we will be extracting.

Pre-processing images

Now that we have extracted max_frames images, let’s work in enhancing the quality of our data. Given we have frames, we cannot apply outlier trimming, missing value imputation nor many of the standard methods applied to numerical data. There are a few commonly used techniques in images however, and on the next step we are going to go over a great one specifically used for face recognition. These steps are based on this article:

Image Pre-processing

In this article, we are going to go through the steps of Image preprocessing needed to train, validate and test any…

towardsdatascience.com

First and foremost, we are going to iterate over every image in the subset of frames we have just collected and try to detect a face in it. If we can find it, resize it to a uniform resolution (we cannot have varying sizes when we feed this to a model later on) and save it to a file. Additionally, it is common practice to use greyscale images in tasks such as this one. It’s fairly easy to understand why: RGB pictures mean 3 different matrices, tripling the data that has to be stored, processed and accounted for when training. Reducing the dimensionality of the data means faster training times and possibly less noise. (See curse of dimensionality).

Now we need to start working on labelling the images. There are a few methods to do this: indexing ranges, renaming or storing. I found the latter to be the easiest. Put simply, what I did was create a folder named 0 and a folder named 1. As you might have guessed, pictures of Patrick Bateman were dragged and dropped to 1 and the rest were placed in 0. Afterwards, we can load the images and their corresponding labels according to the folder in which they were stored.

Let’s dive in to the processing of the images we just cropped. We are going to normalize, blur and segment. Normalizing ensures there are no too-high nor too-low greyscale values in our images and that everything is in the same scale. Blurring helps us get rid of some noise that might exist in our frames. This is particularly relevant considering the inherent noisy frames in old movies. Lastly, segmentation helps in distinguishing the background elements from the foreground.

Normalizing

# Normalize the data
X = X / X.max()

Blurring

# Remove noise
# Gaussian blur
no_noise = []
for i in range(len(X)):
    blur = cv2.GaussianBlur(X[i], (5, 5), 0).astype('double')
    no_noise.append(blur)# Display two images
def display(a, b, title1 = "Original", title2 = "Edited"):
    plt.subplot(121), plt.imshow(a, cmap='gray'), plt.title(title1)
    plt.xticks([]), plt.yticks([])
    plt.subplot(122), plt.imshow(b, cmap='gray'), plt.title(title2)
    plt.xticks([]), plt.yticks([])
    plt.show()
    
display(X[100], no_noise[100], 'Original', 'Blurred')

Segmentation

# Segmentation
seg = []for i in range(len(X)):
    ret, thresh = cv2.threshold(X[i], 0.25, 1, cv2.THRESH_BINARY)
    seg.append(thresh)# Displaying segmented images
display(X[223], seg[223], 'Original', 'Segmented')

That was easy. Before going into training, let’s just split the images between training and test sets and flatten the input.

# Split between training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, d['y'], test_size=0.25, random_state=42)    # preparing the validation setX_train = X_train.reshape(X_train.shape[0], 224 * 224)
X_valid = X_valid.reshape(X_valid.shape[0], 224 * 224)

Training a Model

Let us now train a model to distinguish between the desirable and undesirable frames. This step is a bit of an extra so that we can extend our dataset and force it to contain only desirable frames.

One of the most standard approaches to perform image classification is implement a Convolutional Neural Network. Other equally popular options are using pre-trained networks (VGG16) or NNs specifically optimized to Facial Recognition (VGGFace2)

However, in our case, since we do not have a lot of faces to distinguish, we can use a SVM with the little pre-processing we did beforehand and an extra: we are going to use Principal Component Analysis to extract the Eigenfaces from our data.

The concept of Eigenface was originally proposed by M. Turk and A. Pentland, 1991. Put simply, it corresponds to the eigenvectors associated with the dataset, which in turn describe it as a small set of vectors. An eigenvector, on the other hand, is a vector that changes at most by a given factor when applied a linear transformation. There is an underlying bit of complex algebra to really understand this, so if you want to dig deep, 3blue1brown has a great video on it.

Using Grid Search we can easily fine-tune the hyper-parameters that work best with our data and model. What it does is try out every combination of parameters and select the one that yields the best results.

Lastly, we can now evaluate the model with our data. Let’s see the report, confusion matrix and accuracy.

Output:
Predicting people's names on the test set
done in 0.001s
              precision    recall  f1-score   support           0       0.89      0.93      0.91        44
           1       0.90      0.84      0.87        31    accuracy                           0.89        75
   macro avg       0.89      0.89      0.89        75
weighted avg       0.89      0.89      0.89        75

Accuracy: 0.8933

Making Predictions

Great! Let’s see the model in action.

Erm.. pretty subpar for the first few examples. The accuracy may vary slightly depending on the train-test split. Nevertheless, 89.33% accuracy is not bad at all, especially considering how rough and real the data was when we started out.

Additionally, in case you are still wondering what the Eigenfaces are or how do they look like, look no further:

I hope you aren’t too disappointed. Go on and look into that video I mentioned above now, I promise you won’t regret it.

However, if you recall correctly we are also interested in fetching a probability score that we will interpret as a similarity score. Since we instanced our model with probability=True, we now have access to the method:

ynew = clf.predict_proba(X_new)

This should return an array of the type [p1, p2] where p1 is the probability of X_new belonging to class 0 (not Bateman, in our example) and p2 is the opposite, so naturally they add up to 1.

Saving Model to a File

Lastly, in order to save time next time we want to make predictions, let’s save the trained model to a file. This is pretty straight-forward using the pickle library. The extension doesn’t have to be .pkl, I just found that it went along nicely since we’re using pickle.

And this concludes our guide! 🎉 I’m glad you made it to the end. If you’ve just skipped the whole thing to look for the source code, here it is.