Abstract:
Confusion is the most frequently observed emotion in daily life and can greatly affect the effectiveness and efficiency of communication. Detecting the confusion from learners and resolving timely is critical for achieving successful teaching in education. Most Facial Expression Recognition (FER) research works focus only on detecting six basic emotions: happiness, sadness, anger, fear, disgust, and surprise. Even though the confusion detection problem gains more attention from researchers recently, analysis of both spatial and temporal information with sufficient data is still short. In this study, we present a spatial-temporal network for confusion detection on video level which was trained on BAUM-1 database, as far as we know, this is the largest public video dataset which confusion is labeled. The model includes ResNet-18 Convolutional Neural Network (CNN), and Long-Short Term Memory (LSTM) recurrent neural network (RNN). By cascading these two deep learning structures, our method yields 73% accuracy which outperforms the baseline LSTM network that yields 67% on the same BAUM-1s dataset. We also test our proposed method with our confusion video dataset which was collected by recording 15 participants under uncontrolled environment. The model was able to predict 1 instance of 30 consecutive facial images within 0.04 seconds and got 66% of accuracy.