Modality-fused representation is an essential and challenging task in multimodal emotion analysis. Previous studies have already yielded remarkable achievements. However, there are two problems: insufficient feature interaction and rough data fusion. To investigate these two challenges more deeply, first, a hybrid architecture, which consists of convolution and a transformer, is proposed to extract local and global features. Second, for extracting more sufficient mutual features from multimodal datasets, our model is comprised of three parts: (1) the interior transformer encoder (TE) aims to extract the intramodality characteristics from the current monomodality; (2) the between TE aims to extract the intermodality feature between two different modalities; and (3) the enhance TE aims to extract the target modality enhance feature from multimodality. Finally, instead of directly fusing features by a linear function, we employ a popular and widely used multimodal factorized high-order pooling mechanism to obtain a more distinguishable feature representation. Extensive experiments on three multimodal sentiment datasets (CMU-MOSEI, CMU-MOSI, and IEMOCAP) demonstrate that our approach reaches the state-of-the-art in an unaligned version setting. Compared with the mainstream methods, our proposed method shows superiority in both word-aligned and unaligned settings.
Datasets play a crucial role in the development of facial expression recognition (FER), but most of the datasets suffer from obvious uncertainties and biases caused by different cultures and collection conditions. To look deeper into these issues, this paper first conducts two sets of experiments, face detection, and facial expression (FE) classification. They are based on three datasets (CK+, FER2013, and RAF–DB), which are collected from lab and wild environments. This paper proposes a network, depthwise separable convolutional neural network (CNN) with an embedded attention mechanism (DSA–CNN) for expression recognition. First, at the preprocessing stage, we obtain the maximum expression range clipping, which is calculated from 81 facial landmark points to filter nonface interferences. Then, we use DSA–CNN, which is based on a coordinate squeeze-and-excitation (CSE) attention for feature extraction. Finally, to further deal with imbalanced class biases and uncertainties issues, this paper proposes a class-weighted cross-entropy loss (CCE-loss) to alleviate the imbalance among seven emotional classes. Then, we combine CCE-loss with ranking regularization loss (RR-loss) and self-importance weighting cross-entropy loss (SCE-loss) at label amend stage, to jointly guide the training of the network. Extensive experiments on three FER datasets demonstrate that our proposed method outperforms most of the state-of-the-art methods eventually.
Emotion is strongly subjective, and different parts of the image may have a different degree of impact on emotion. The key for solving image emotion recognition is to effectively mine different discriminative local regions. We present a deep architecture to guide the network to extract discriminative and diverse affective semantic information. First, training a full convolutional network with a cross-channel max pooling strategy (CCMP) to extract discriminative feature maps. Second, to ensure that most of the discriminative sentiment regions are located accurately, we add a module consisting of the convolution layer and the CCMP. After obtaining the discriminative regions of the first module, the feature elements corresponding to the discriminant regions are erased, and then the erased features are fed into the second module. Such adversarial erasure operation can force the network to discover different sentiment discriminative regions. Third, an adaptive feature fusion mechanism is proposed to better integrate discriminative and diverse sentiment representations. Sufficient experiments are conducted on the benchmark datasets FI, EmotionROI, Instagram, and Twitter1 to achieve 72.17%, 61.13%, 81.97%, and 85.44% recognition accuracies, respectively. The results of the experiment demonstrate that the proposed network outperforms the state-of-the-art results.
Facial expression recognition under partial occlusion is a challenging research. This paper proposes a novel framework for facial expression recognition under occlusion by fusing the global and local features. In global aspect, first, information entropy are employed to locate the occluded region. Second, principal Component Analysis (PCA) method is adopted to reconstruct the occlusion region of image. After that, a replace strategy is applied to reconstruct image by replacing the occluded region with the corresponding region of the best matched image in training set, Pyramid Weber Local Descriptor (PWLD) feature is then extracted. At last, the outputs of SVM are fitted to the probabilities of the target class by using sigmoid function. For the local aspect, an overlapping block-based method is adopted to extract WLD features, and each block is weighted adaptively by information entropy, Chi-square distance and similar block summation methods are then applied to obtain the probabilities which emotion belongs to. Finally, fusion at the decision level is employed for the data fusion of the global and local features based on Dempster-Shafer theory of evidence. Experimental results on the Cohn-Kanade and JAFFE databases demonstrate the effectiveness and fault tolerance of this method.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.