Research Article
Using Natural Language Preprocessing Architecture (NLPA) for Big Data Text Sources
Table 2
List of available corpora for spam classification.
| | Dataset name | Language | Type of contents | Size | Available at | | Ham | Spam |
| | CSDMC 2010 Spam Corpus | English | E-mail messages | 2,949 | 1,378 | https://github.com/hexgnu/spam_filter/tree/master/data | | TREC 2007 Public Corpus | English | 25,220 | 50,199 | http://plg.uwaterloo.ca/∼gvcormac/treccorpus07/ | | SpamAssassin | English | 4,150 | 1,897 | http://spamassassin.apache.org/old/publiccorpus/ | | Enron e-mail | English | 619,446 | 0 | http://spamassassin.apache.org/old/publiccorpus/ | | Bruce Guenter spam collection | English | 0 | >3M | http://untroubled.org/spam/ | | Ling spam | English | 2,412 | 481 | http://csmining.org/index.php/ling-spam-datasets.html |
| | SMS-spam-collection v.1 | English | SMS messages | 4,827 | 747 | http://www.dt.fee.unicamp.br/∼tiago/smsspamcollection/ | | British English SMS corpora | English | 450 | 425 | https://mtaufiqnzz.wordpress.com/british-English-sms-corpora/ |
| | Webspam-UK 2007 | English | Web pages | 105,896,555 | http://chato.cl/webspam/datasets/index.php | | — | — | | Websmap-UK 2011 | English | 1,769 | 1,998 | https://sites.google.com/site/heiderawahsheh/home/web-spam-2011-datasets/uk-2011-web-spam-dataset | | DC 2010/EU 2010 | English, French, and German | 23M | https://dms.sztaki.hu/en/letoltes/ecmlpkdd—2010—discovery—challenge—data—set | | — | — | | Webb spam 2011 | - | 0 | 330,000 | http://www.cc.gatech.edu/projects/doi/WebbSpamCorpus.html | | ClueWeb 09 | Multilingual | 1,040M | http://www.lemurproject.org/clueweb09.php/ | | — | — | http://www.lemurproject.org/clueweb12.php/ | | ClueWeb 12 | English | 870M | | — | — | http://commoncrawl.org/ | | Common Crawl Data | Multilingual | 0 | 9B |
| | YouTube Comments Dataset | Multilingual | YouTube comments | 5,950,137 | 481,334 | http://mlg.ucd.ie/yt/ | | YouTube Spam Collection Dataset | English | 951 | 1,005 | https://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection |
| | HSpam14.s2 | - | Twitter messages | 14M | http://www3.ntu.edu.sg/home/axsun/datasets.html | | — | — |
|
|