Research Article
Edge-Based Detection and Classification of Malicious Contents in Tor Darknet Using Machine Learning
| Input: Web Set | | Output: Corpus set ([0], []) | | (1) | for TO do | | (2) | content = obtain the HTML content of | | (3) | use“lxml()”funtion to parse the content, then remove HTML tags, script andso on | | (4) | text = preserve the text content displayed on the page | | (5) | for in text do | | (6) | if = ‘’ or = ‘’ then | | (7) | = ‘’ | | (8) | end if | | (9) | end for | | (10) | Lowercase all English words | | (11) | for in text do | | (12) | if is a punctuation or a number then | | (13) | = ‘’ | | (14) | end if | | (15) | end for | | (16) | PorterStemmer(text)//Unify all word formats | | (17) | word_list(, …) = text.split(‘’) | | (18) | for TO do | | (19) | if word_list[] stopwords(, … ) or 2 len(word_list[ ]) 12 then | | (20) | delete_wordlist[i] | | (21) | end if | | (22) | end for | | (23) | SET word_list(, … ).join(‘’)//Words are concatenated to strings | | (24) | end for | | (25) | return |
|