Research Article

Content Deduplication with Granularity Tweak Based on Base and Deviation for Large Text Dataset

Table 1

Documents and bag of words.

D. noDocuments

D1John has some cats
D2Cats eat a fish
D3Shipment of gold damaged in a fire
D4Shipment of gold arrived in a truck
D5Romeo Juliet
D6Romeo died by dagger
D7I eat a fish
D8All that glitters is not gold
D9Money makes many

Bag of words {“arrived: T1,” “cats: T2,” “dagger: T3,” “damaged: T4,” “died: T5,” “eat: T6,” “fishT7,” “glitters: T8,” “gold: T9,” “John: T10,” “Juliet: T11,” “makes: T12,” “money: T13,” “Romeo: T14,” and “shipment: T15,” “truck: T16”}.