Mobile Information Systems

Research Article

Data Mining Technology Application in False Text Information Recognition

Feature set divided in classification experiment.


Feature			Description	Amount	Label

Lexical features	Based on content feature	(1) Amount of Chinese characters		1	F1
		(2) Total number of characters		1
		(3) Total number of numeric characters		1
		(4) Amount of non-Chinese characters		1
	Based on lexical features	(5) Amount of words		1
		(6) Different words		1
		(7) Hapax legomena	Words that appear only ONCE	1
		(8) Hapax dislegomena	Words that appear only TWICE	1
		(9) Average sentence length		1

Syntactic feature		(10–17) punctuation frequency	“, ”, “. ”, “? ”, “! ”, “: ”,”; ”, “ / ”, “ ” ”	8	F2
		(18–59) frequency of function words	put, be, about, accord, from, as, compare, include, like, make, need, possible, ‘s, get, pass, ah, yeah, oh, maybe, all, just, later, then, if, though, actually, but later, then, after, in short, until, often, feel, how, but yes, indeed	42
		(60–79) frequency of parts of speech	n.; V.; adj.; adv.; vl; nt; pron.; nS; m; f; q; prep.; conj.; aux.v; int.; nh; W; x; vu; i	20

Based on content feature		(80) Total number of sentences		1	F3
Based on content feature		(81–95) frequency of specific keywords	Treatment, symptoms, side effects, patient, function, alleviation, period, safety, health, improvement, effect, treatment, significant, recurrence, effective	15	F3