Research Article
Machine Reading Comprehension-Enabled Public Service Information System: A Large-Scale Dataset and Neural Network Models
Table 5
The evaluation results of pretrained models and human performance.
| | Model | Development | Test | | F1 | EM | F1 | EM |
| | Human performance | 92.45 | 96.13 | 91.85 | 96.33 |
| | Pretrained on general-domain corpus such as Chinese Wikipedia by Google, etc. | | BERT (base) | 76.30 | 85.34 | 76.40 | 85.04 | | BERT-wwm (base) | 76.95 | 86.42 | 76.75 | 85.93 | | ALBERT (base) | 75.85 | 85.39 | 76.20 | 85.26 | | ALBERT (large) | 79.05 | 88.76 | 77.70 | 87.79 | | RoBERTa-wwm (base) | 78.75 | 86.66 | 78.30 | 85.79 | | RoBERTa-wwm (large) | 79.35 | 88.38 | 78.60 | 87.43 |
| | Continually pretrained on Chinese public service corpus by us | | Our BERT (base) | 81.80 | 89.44 | 79.30 | 87.70 | | Our BERT-wwm (base) | 81.60 | 89.25 | 79.65 | 88.20 | | Our ALBERT (base) | 81.00 | 89.46 | 78.40 | 87.45 | | Our ALBERT (large) | 80.90 | 89.90 | 79.60 | 88.10 | | Our RoBERTa-wwm (base) | 82.65 | 90.05 | 79.85 | 88.30 | | Our RoBERTa-wwm (large) | 83.10 | 90.79 | 80.45 | 88.51 |
|
|