Research Article

Parallel Cleaning Algorithm for Similar Duplicate Chinese Data Based on BERT

Table 1

Error rate on IMDB and Sogou.

MethodIMDBSogou

Head only5.632.58
Tail only5.443.17
Head + tail5.422.43
hier.mean5.892.83
hier.max5.712.47
hier.self-attention5.492.65