Research Article
Detecting Web Spam Based on Novel Features from Web Page Source Code
Algorithm 1
Calculating semantic similarity features from the homepage.
Require: | (1) | hp: homepage of each domain | (2) | : homepage’s domain | (3) | Initialize the list , and set to null | (4) | if (hp is not null) then | (5) | : = ExtractText (hp) | (6) | /∗ extract hp’s title, keywords, description / | (7) | := Collectlinks (hp) | (8) | /∗ collect all external links in hp / | (9) | if ( and is not null) then | (10) | for each link do | (11) | : = ExtractLinkText (link) | (12) | : = ExtractDomain (link) | (13) | /∗ extract link’s anchor text / | (14) | end for | (15) | = WMD () | (16) | /∗ computing the WMD distance between hp’s text and external link’s anchor text∗/ | (17) | = WMD (, ) | (18) | /∗ computing the WMD distance between hp’s domain and external link’s domain∗/ | (19) | else if ( is not null and is null) then | (20) | = 0 | (21) | else | (22) | = = 0 | (23) | end if | (24) | end if | (25) | return, |
|