Research Article

Data Mining Technology Application in False Text Information Recognition

Table 2

Feature set divided in classification experiment.

FeatureDescriptionAmountLabel

Lexical featuresBased on content feature(1) Amount of Chinese characters1F1
(2) Total number of characters1
(3) Total number of numeric characters1
(4) Amount of non-Chinese characters1
Based on lexical features(5) Amount of words1
(6) Different words1
(7) Hapax legomenaWords that appear only ONCE1
(8) Hapax dislegomenaWords that appear only TWICE1
(9) Average sentence length1

Syntactic feature(10–17) punctuation frequency“, ”, “. ”, “? ”, “! ”, “: ”,”; ”, “ / ”, “ ” ”8F2
(18–59) frequency of function wordsput, be, about, accord, from, as, compare, include, like, make, need, possible, ‘s, get, pass, ah, yeah, oh, maybe, all, just, later, then, if, though, actually, but later, then, after, in short, until, often, feel, how, but yes, indeed42
(60–79) frequency of parts of speechn.; V.; adj.; adv.; vl; nt; pron.; nS; m; f; q; prep.; conj.; aux.v; int.; nh; W; x; vu; i20

Based on content feature(80) Total number of sentences1F3
(81–95) frequency of specific keywordsTreatment, symptoms, side effects, patient, function, alleviation, period, safety, health, improvement, effect, treatment, significant, recurrence, effective15