Research Article
A Systematic Comparison of Data Selection Criteria for SMT Domain Adaptation
Table 2
Corpora statistics.
| | Data Set | Lang. | Sentences | Tokens | Av. len. |
| | Test set | EN | 2,050 | 60,399 | 29.46 | | ZH | 59,628 | 29.09 |
| | Dev. set | EN | 2,000 | 59,732 | 29.26 | | ZH | 2,000 | 59,064 | 29.07 |
| | In-domain | EN | 43,621 | 1,330,464 | 29.16 | | ZH | 1,321,655 | 28.97 |
| | Training set | EN | 1,138,044 | 28,626,367 | 25.15 | | ZH | 28,239,747 | 24.81 |
|
|