Input: - Confidential documents set |
- Non-confidential documents set |
- The minimum similarity threshold |
Output: - The set of clusters, each with the centroid and corresponding graph |
- The set of confidential terms in clusters |
- The set of context terms |
(1) |
(2) % The result of clustering is saved in |
(3) Initializing %The scores of confidential terms are saved in |
(4) Initializing %The context terms set of each confidential term is saved in |
(5) for (each in ) |
(6) Calculate the similarity between and the other clusters |
(7) Create language model for , and calculate the scores for each confidential term |
(8) initial the threshold of cluster similarity |
(9) while () |
(10) All clusters whose similarity to > |
(11) Create language model for the documents of |
(12) Based on new language model, Update the scores of confidential terms |
(13) for(each confidential term in ) |
(14) Detect the occurrence of ct in |
(15) For each context term of , calculate the probability of the appearance both and |
the context term in confidential documents. |
(16) For each context term of , calculate the probability of the appearance both and the |
context term in non-confidential documents. |
(17) Calculate the value of for each confidential term |
(18) Detect all clusters whose similarity is greater than , and detect the occurrences of all terms in the clusters. |
(19) Update the probability of the context terms that appear in the scopes of different confidential terms |
(20) |
(21) Reduce the value of |
(22) |
(23) |