Editshare - Enhancing EditShare’s flow media asset management with Automated AI
Science
Facial expressions are one of the most important forms of non-verbal communication between humans. As Alan Fridlund, a psychology professor at University of California Santa Barbara puts it: “our faces are ways we direct the trajectory of a social interaction”. Given their importance in our day-to-day lives, it can be expected that humans can benefit from a system that can automatically recognize facial expressions. In this post, we describe how we tackled this challenging task at Mobius Labs.
While often used interchangeably, it is important to highlight that facial expressions and emotions are not the same thing. Emotions are things that we feel, and are caused by neurons shooting electrons around pathways inside our brains. Facial expressions refer to the positions and motions of the muscles beneath the skin of a face. It can be an indicator of the emotion we are feeling, but do not always reflect our emotion. For example, we can be smiling but be in an emotional state that is far from happy.
Rather than training a system that can recognize a limited set of emotions, we decided to train facial expressions.
For the above reasons, we decided to train a system that can distinguish facial expressions, rather than teaching a machine the ability to classify a limited set of “emotions”. As we will see, this design choice allows us to easily add new facial expression tags to the system, and allows to distinguish facial expressions at a very fine-grained level.
It is worth highlighting that the ability to distinguish facial expressions goes beyond classifying them into a set of emotion — or rather facial expression — tags. Other use-cases include searching for a specific expression by providing a sample face, or photo gallery summarisation, where for each person in the gallery we want to show a range of facial expressions, as shown below.
In this example, around half of the faces would be tagged as “happy”, and the system would have a hard time figuring out different levels of happiness and hence would probably only show one “happy face” in the summary. By teaching a system what similar facial expressions look like by assigning them a distance that reflects the similarity, we are able to differentiate much more fine-grained levels of facial expressions, and hence show a range of happiness in the facial expression summary.
More formally, our goal here is to learn a feature embedding network F(x) which takes as input a face image x and outputs a facial expression embedding e in a way that the distance between two embeddings coming from faces with similar expressions is small, wheres the distance is large for faces with different expressions. The figure below gives an illustrative example:
In this example, the distance of the facial expression embedding between the image on the left and the one in the middle should be (much) smaller than the one between the right image and the other two.
Let us now have a look how we trained the facial expression embedding.
In the following, we give an overview over the key ingredients of how we trained the facial expression embedding: the dataset, the architecture, and the loss function.
In order to train such a facial expression embedding (FEE), one has to provide a large number of annotated samples so that the CNN can learn what features to extract. Luckily, there is an excellent facial expression dataset out there called Facial Expression Comparison (FEC). In this dataset, we are given triplets of face images (as the ones above), where at least six human annotators had to select the face where the facial expression is most dissimilar to the other two faces of the triplet. The images of these triplets have been carefully sampled from an internal emotion dataset that contains 30 different emotions; see [1] for more details.
A note on the dataset size: The original training dataset contains around 130K faces, which are combined to a total of 360K triplets where at least 60% of the raters agreed on the annotation. At the time we downloaded the dataset, only around 80% of it was still accessible.
As with other tasks that focus on faces, the first step of our processing pipeline is to detect the faces in an image. After this, we use a facial landmark detector — in our case, we use RetinaNet [2] to extract the location of the eyes, the nose tip, as well as both sides of the mouth. Much more sophisticated landmark extractors that extract over 60 landmarks exist, but we found this one to be sufficient, as we are only using it to align the landmarks into a reference coordinate system; essentially, we undo rotation and scale the face such that the inter-ocular distance (distance between the eyes) is 70px, and crop the face to 224x224 pixels.
This aligned face is then input to a convolutional neural network (CNN), which extracts a feature vector. We played around with a variety of CNN architectures, and found that the recently proposed EfficientNet [3] performs best as backbone architecture.
In the case of EfficientNet-B0, the output is a 1280-dimensional feature vector. This feature vector is then passed through two fully-connected layers (FC), reducing the dimensions down to just 16. Lastly, we apply L2-normalization to obtain the facial feature embedding.
The resulting FEENet architecture is very simple and lightweight, requiring just 4.7M parameters.
We train the facial feature embedding network using a triplet loss function L(a, p, b) [4],which encourages the distance between the more similar facial expressions (denoted as anchor a and positive p) to be smaller than the distance of these two to the third facial expression of the triplet (denoted as negative n). We can write the triplet loss function as follows:
The figure above illustrates the main idea behind triplet loss, where we used the following simplified notation for the distance:
On the left, we see a possible situation of the distances of one particular triplet (there are hundreds of thousands such triplets in the training set) before training, where the distances between the anchor a, the positive p, and the negative n, are very similar. Using the above triplet loss function, we can achieve that the distance between the anchor and the positive, denoted as d(a,p), is much smaller than d(a,n) and d(p,n).
This section shows quantitative results for the architecture presented in the Architecture section. As mentioned earlier, at the time we downloaded the dataset, only around 80% of the FEC dataset was still available; this fact has to be taken into account when comparing our results with the one from FECNet [1], which had access to the whole dataset for training.
The chart below shows the triplet prediction accuracy (i.e., the percentage of triplets where the distance between the anchor and the positive is smallest) of the proposed FEENet and compares it to FECNet from Google AI [1]; in addition, we show the average performance of the human annotators who created the ground truth labels for the FEC dataset.
Somewhat surprisingly, the proposed FEENet has a 2.7% improvement in triplet prediction accuracy over FECNet [1], despite the simpler architecture (4.7M versus 7M parameters).
The proposed FEENet model achieves an accuracy very close to that of human annotators.
To put these results in context, it is worth highlighting that the annotators who created the ground truth annotations for the FEC dataset have an average triplet prediction accuracy of 86.2%. In other words, the model we trained almost reaches human performance on this challenging task.
Now to the most fun part — Applications. The list below is by no means complete, but should give the reader an idea of the versatility of the facial expression embeddings. If you have a specific application in mind that is not listed below, please visit our website (linked at the end of this article) to get in touch with us — we are always happy and excited to test out new things.
Perhaps the most obvious application of the facial expression embeddings is to use them to find similar facial expressions in a database of images. This is particularly useful as often, facial expressions are difficult to describe in words; instead, one simply provides an image containing a face with the desired expression.
Below are a few examples. For each row, the leftmost face is the “query” face, and the others are the best matches from the FEC validation set (22K faces).
Instead of searching “just” for the best matching facial expressions in the whole database, one can also first narrow down the database to specific people, and then search for matching facial expressions of a different person. In the example below, we collected 130 images with Donald Trump and extracted facial expression embeddings. We then searched with images of Angela Merkel to find the best matching facial expressions. Note that due to the limited size of the search database, we only show the Top 3 matches here.
The facial expression embeddings can also be used to create facial expression summaries of specific people. That is, given a set of photos of the same person, one can find the dominant facial expressions. One way of creating summaries is to use some form of clustering in the facial feature embedding space. For the examples below, we selected around 150 samples per celebrity, and used simple K-Means with K=8 clusters, and show for each cluster the face that is closest to the cluster center.
The learned embeddings also lend themselves to be used for classifying faces into different types of facial expression tags. In order to get a sense of what the learned features can encode, we ran an off-the-shelf K-Means clustering algorithm on all the faces that are in the FEC validation dataset (around 22'000 samples). We played around with the number of clusters K and found K=100 to give good results.
Let us have a look at some of the clusters the facial expression model is able to distinguish. One of our favourite examples is how it is able to pick up different levels of happiness. Below we show the 15 faces from the FEC validation dataset that are closest to different cluster centers (as obtained using K-Means with K=100 clusters):
Note how all the above facial expressions would most likely be given the tag “happy” — with the learned embedding, we are able to go into much more fine-grained levels of happiness, which should make everyone happy.
The facial expression embedding is able to encode fine nuances, which allows to assign very fine-grained tags to the clusters.
As mentioned earlier, the proposed system makes it is very easy to add new tags to the system. All that has to be done is to provide at least one face along with the tag(s) that describe the facial expression. This is drastically different from existing technologies, where a limited set of predefined emotions are trained (typically less than 10), where it is not possible to add new (emotion) tags without having to retrain the whole architecture using thousands of sample faces that show the desired emotion.
Adding a new facial expression tag is as easy as providing an image of a face that conveys that tag.
As an example, let us focus facial expressions where people have their mouth open. In the figure below, each row shows on the left a face containing the facial expression of interest, and the remaining ones the closest neighbors of that face, which would hence get assigned the same tags.
In the first row for example, one might provide a face, along with the tag ‘screaming’. In addition, one might want to add more tags that describe some face features, such as ‘eyes open’ and ‘frowning’ in this case.
Overall, the figure above reinforces the fact that the embedding is able to encode fine nuances the embedding, which allows to assign very detailed facial feature tags to the clusters.