Multimodal Learning for Facial Expression Recognition

Wei Zhang1, Youmei Zhang1, Lin Ma2, Jingwei Guan2, and Shijie Gong1

1School of Control Science and Engineering, Shandong University, Jinan, China

2Huawei Noah's Ark Lab, Hong Kong


In this paper, multimodal learning for facial expression recognition(FER) is proposed. The multimodal learning method makes the first attempt to learn the joint representation by considering the texture and landmark modality of facial images, which are complementary with each other. In order to learn the representation of each modality and the correlation and interaction between different modalities, the structured regularization(SR) is employed to enforce and learn the modality-specific sparsity and density of each modality, respectively. By introducing SR,the comprehensiveness of the facial expression is fully taken into consideration, which can not only handle the subtle expression but also perform robustly to different input of facial images. With the proposed multimodal learning network, thejoint representation learning from multimodal inputs will be more suitable for FER.

The contributions of this work:

  • Multimodal learning for facial expression recognition (FER) is proposed. The multimodal learning method makes the first attempt to learn the joint representation by integrating the texture and landmark modality of facial images.
  • Structured regularization (SR) is employed to enforce and learn the modality-specific sparsity and density of each modality, respectively.

Proposed Algorithm

The databases contain both texture and landmark modalities for each facial image. These two modalities reflect different properties of the facial expression, which should be considered together for FER. The texture and landmark modalities of the facial image will be first processed, respectively, before being fed into the multimodal FER system.

Fig. 1. The structure of our approach

Fig. 2. The structure of the network

Multilodal FER:

The proposed learning architecture is illustrated in Fig. 2, which takes different numbers and types of modalities as inputs. The output will be the joint representation, which not only considers each modality property but also accounts for the interactions of different modalities.

For the texture modality, the image patches are extracted around eyes and mouth from one frame, which contain the most pivotal facial features related to expressions; For the landmark modality, we calculate the different value between current frame and previous one in X and Y direction as the movements of landmarks.

Fig.3. The structure of AE without SR(a) and with(b) SR

The network is pre-trained by auto-encoder(AE), considering each modality property and the interactions of different modalities, structured regularization(SR) is attached to AE. The structure of AE without and with SR is shown in Fig.3.

Experimental Results

Comparison to prior study on CK+ database(First 6 frames)

Comparison to prior study on CK+ database(First-Last)

Comparison of the algorithms with and without AE

The recognition result with respect to the number of (a) hidden layers, (b) hidden units

Experimental results on spontaneous facial expression database


Wu et al.
Wu T, Bartlett M S, Movellan J R. Facial expression recognition using gabor motion energy filters[C]//Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on. IEEE, 2010: 42-47.[Full Text]
Long et al.
Long F, Wu T, Movellan J R, et al. Learning spatiotemporal features by using independent component analysis with application to facial expression recognition[J]. Neurocomputing, 2012, 93: 126-132. [Full Text]
Jeni et al.
Jeni L A, Girard J M, Cohn J F, et al. Continuous au intensity estimation using localized, sparse facial feature space[C]//Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Workshops on. IEEE, 2013: 1-7. [Full Text]
Lorincz et al.
Lorincz A, Jeni L, Szabo Z, et al. Emotional expression classification using time-series kernels[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2013: 889-895. [Full Text]
Yang et al.
"Yang P, Liu Q, Cui X, et al. Facial expression recognition using encoded dynamic features[C]//Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008: 1-8 [Full Text]
Wang et al.
Z. Wang, E. P. Simoncelli, and A. C. Bovik, "Multiscale structural similarity for image quality assessment", Proc. Asilomar Conference on Signals, Systems, and Computers, 2003. [Full Text]
He et al.
He S, Wang S, Lv Y. Spontaneous facial expression recognition based on feature point tracking[C]//Image and Graphics (ICIG), 2011 Sixth International Conference on. IEEE, 2011: 760-765. [Full Text]

Contact Me

If you have any questions, please feel free to contact Prof. Zhang (

Update: Apr. 27, 2016