Research Article

FPT-Former: A Flexible Parallel Transformer of Recognizing Depression by Using Audiovisual Expert-Knowledge-Based Multimodal Measures

Table 3

The performance of depression recognition on E-DAIC databases.

StudyModel nameModalityRMSEMAE

Al Hanai et al. 2018 [43]Long-short term memory (LSTM) neural networkAudio and text features6.505.13
Zhang et al. 2020 [44]An autoencoder model based on a bidirectional gated recurrent unit (BiGRU)Speech signals5.684.64
Yang et al. 2020 [45]Deep convolutional generative adversarial network (DCGAN)Speech, text, and face data5.524.63
Han et al. 2023 [46]Spatial-temporal feature network (STFN)Speech data6.295.38
Fang et al. 2023 [47]A multimodal fusion model with a multilevel attention mechanism (MFM-Att)Audiovisual and text data5.17—
OursA flexible parallel transformer model (FPT-Former)Audiovisual expert-knowledge-based measures4.804.58

The bold font indicates the lowest value among the compared studies.