Security and Communication Networks

Research Article

Detecting Web Spam Based on Novel Features from Web Page Source Code

Calculating semantic similarity features from the homepage.

Require:
(1)	hp: homepage of each domain
(2)	: homepage’s domain
(3)	Initialize the list , and set to null
(4)	if (hp is not null) then
(5)	: = ExtractText (hp)
(6)	/∗ extract hp’s title, keywords, description /
(7)	:= Collectlinks (hp)
(8)	/∗ collect all external links in hp /
(9)	if ( and is not null) then
(10)	for each link do
(11)	: = ExtractLinkText (link)
(12)	: = ExtractDomain (link)
(13)	/∗ extract link’s anchor text /
(14)	end for
(15)	= WMD ()
(16)	/∗ computing the WMD distance between hp’s text and external link’s anchor text∗/
(17)	= WMD (, )
(18)	/∗ computing the WMD distance between hp’s domain and external link’s domain∗/
(19)	else if ( is not null and is null) then
(20)	= 0
(21)	else
(22)	= = 0
(23)	end if
(24)	end if
(25)	return,