| Input: - Confidential documents set |
| - Non-confidential documents set |
| - The minimum similarity threshold |
| Output: - The set of clusters, each with the centroid and corresponding graph |
| - The set of confidential terms in clusters |
| - The set of context terms |
| (1) |
| (2) % The result of clustering is saved in |
| (3) Initializing %The scores of confidential terms are saved in |
| (4) Initializing %The context terms set of each confidential term is saved in |
| (5) for (each in ) |
| (6) Calculate the similarity between and the other clusters |
| (7) Create language model for , and calculate the scores for each confidential term |
| (8) initial the threshold of cluster similarity |
| (9) while () |
| (10) All clusters whose similarity to > |
| (11) Create language model for the documents of |
| (12) Based on new language model, Update the scores of confidential terms |
| (13) for(each confidential term in ) |
| (14) Detect the occurrence of ct in |
| (15) For each context term of , calculate the probability of the appearance both and |
| the context term in confidential documents. |
| (16) For each context term of , calculate the probability of the appearance both and the |
| context term in non-confidential documents. |
| (17) Calculate the value of for each confidential term |
| (18) Detect all clusters whose similarity is greater than , and detect the occurrences of all terms in the clusters. |
| (19) Update the probability of the context terms that appear in the scopes of different confidential terms |
| (20) |
| (21) Reduce the value of |
| (22) |
| (23) |