Research Article
FPT-Former: A Flexible Parallel Transformer of Recognizing Depression by Using Audiovisual Expert-Knowledge-Based Multimodal Measures
Table 3
The performance of depression recognition on E-DAIC databases.
| Study | Model name | Modality | RMSE | MAE |
| Al Hanai et al. 2018 [43] | Long-short term memory (LSTM) neural network | Audio and text features | 6.50 | 5.13 | Zhang et al. 2020 [44] | An autoencoder model based on a bidirectional gated recurrent unit (BiGRU) | Speech signals | 5.68 | 4.64 | Yang et al. 2020 [45] | Deep convolutional generative adversarial network (DCGAN) | Speech, text, and face data | 5.52 | 4.63 | Han et al. 2023 [46] | Spatial-temporal feature network (STFN) | Speech data | 6.29 | 5.38 | Fang et al. 2023 [47] | A multimodal fusion model with a multilevel attention mechanism (MFM-Att) | Audiovisual and text data | 5.17 | ā | Ours | A flexible parallel transformer model (FPT-Former) | Audiovisual expert-knowledge-based measures | 4.80 | 4.58 |
|
|
The bold font indicates the lowest value among the compared studies.
|