Research Article

Detecting Web Spam Based on Novel Features from Web Page Source Code

Algorithm 1

Calculating semantic similarity features from the homepage.
Require:
(1)hp: homepage of each domain
(2): homepage’s domain
(3)Initialize the list , and set to null
(4)if (hp is not null) then
(5): = ExtractText (hp)
(6) /∗ extract hp’s title, keywords, description /
(7):= Collectlinks (hp)
(8) /∗ collect all external links in hp /
(9)if ( and is not null) then
(10)  for each link do
(11)   : = ExtractLinkText (link)
(12)   : = ExtractDomain (link)
(13)   /∗ extract link’s anchor text /
(14)  end for
(15)    = WMD ()
(16)  /∗ computing the WMD distance between hp’s text and external link’s anchor text∗/
(17)    = WMD (, )
(18)  /∗ computing the WMD distance between hp’s domain and external link’s domain∗/
(19)else if ( is not null and is null) then
(20)   = 0
(21)else
(22)   =  = 0
(23)end if
(24)end if
(25)return,