Research Article

Sentence Classification Using N-Grams in Urdu Language Text

Table 2

Urdu language dataset.

Sr. No.Dataset

1Enabling Minority Language Engineering (EMILLE) (only 200000 tokens) [18]
2Becker-Riaz corpus (only 50000 tokens) [19]
3Computing Research Laboratory (CRL) annotated corpus (only 55,000 tokens are publicly available data corpora) [20]
4International Joint Conference on Natural Language Processing (IJCNLP) workshop corpus (only 58252 tokens)
5Urdu Named Entity Recognition (UNER) [4]
6Corpus of 705 sentences [21]
7Corpus of BBC Urdu, Daily Jang [22]
8corpus of 19.3 million words [23]
9COUNTER, Naïve, NPUU [24, 25]
10DSL Urdu news [26]