Research Article
Detecting Web Spam Based on Novel Features from Web Page Source Code
Algorithm 1
Calculating semantic similarity features from the homepage.
| Require: | | (1) | hp: homepage of each domain | | (2) | : homepage’s domain | | (3) | Initialize the list , and set to null | | (4) | if (hp is not null) then | | (5) | : = ExtractText (hp) | | (6) | /∗ extract hp’s title, keywords, description / | | (7) | := Collectlinks (hp) | | (8) | /∗ collect all external links in hp / | | (9) | if ( and is not null) then | | (10) | for each link do | | (11) | : = ExtractLinkText (link) | | (12) | : = ExtractDomain (link) | | (13) | /∗ extract link’s anchor text / | | (14) | end for | | (15) | = WMD () | | (16) | /∗ computing the WMD distance between hp’s text and external link’s anchor text∗/ | | (17) | = WMD (, ) | | (18) | /∗ computing the WMD distance between hp’s domain and external link’s domain∗/ | | (19) | else if ( is not null and is null) then | | (20) | = 0 | | (21) | else | | (22) | = = 0 | | (23) | end if | | (24) | end if | | (25) | return, |
|