Abstract

Semantic mining is always a challenge for big biomedical text data. Ontology has been widely proved and used to extract semantic information. However, the process of ontology-based semantic similarity calculation is so complex that it cannot measure the similarity for big text data. To solve this problem, we propose a parallelized semantic similarity measurement method based on Hadoop MapReduce for big text data. At first, we preprocess and extract the semantic features from documents. Then, we calculate the document semantic similarity based on ontology network structure under MapReduce framework. Finally, based on the generated semantic document similarity, document clusters are generated via clustering algorithms. To validate the effectiveness, we use two kinds of open datasets. The experimental results show that the traditional methods can hardly work for more than ten thousand biomedical documents. The proposed method keeps efficient and accurate for big dataset and is of high parallelism and scalability.

1. Introduction

Recently, researchers pay much attention to semantic information discovery. Semantic data mining has been introduced into various fields of text mining, such as text clustering [1, 2], text classification [3, 4], information extraction [57], named entity recognition [810], and sentiment analysis [1113]. Machine learning is the most commonly used method in text mining. In the latest research for text classification, ensemble strategy is often applied, which can capture multiple characteristics from complex text data [1417].

For text clustering, with the continuous growth of data scale, it poses a challenge for people to mine information hidden in big text data. Since the similarities between texts are required before clustering, it is imperative to explore effective methods of computing similarity under the big data background [18].

Document clustering is an important application in the text clustering domain which helps people navigate the interested documents conveniently [19, 20]. Detecting the text similarity is of great importance in document clustering, which directly affects the performance of clustering. Numerous studies about similarity detection have been proposed, including vector-based [2123] and ontology-based [24, 25]. The vector-based methods change the text into vector representation and then view the cosine similarity between vectors as the text similarity. The ontology-based methods use a structural knowledge representation network to describe the meanings and relationships of concepts. Since the vector-based methods ignore the semantic information between words, the ontology-based method attracts much attention at present [26].

An ontology is a hierarchical structure in which concepts are represented as nodes. And the nodes are connected with some relationships such as “is a” and “part of.” Thus, the semantic similarity between concepts can be quantified in an ontology by detecting the node correlation in the structure. Existing ontology-based semantic similarity measurements can be divided into four categories. The first type is path-based, which takes the path distance between nodes in the structure as a measure of correlation. Bulskov et al. [27] used the path length between two nodes in the ontology. The path length is computed by the edges connecting the nodes. Wang [28] gave precomputed weights to edges on the basis of Buskov’s method. The second type is information content (IC) based. Information content is the amount of information that a concept expresses, which can be computed from the ontology and corpus. The more a concept occurs, the less information content it has. Resnik [29] took the IC value of the least common ancestor (LCA) of the two nodes as the semantic similarity. Lin [30] extended the method by normalizing the IC value of the LCA using the IC value of both nodes. The third type is depth-based. Leacock and Chodorow [31] and Li et al. [32] took the depth of nodes in the ontology into account since the depth of nodes represents the information specificity. The fourth type is hybrid. Hybrid methods use more than one class of information. Jiang and Conrath [33] and Zhao and Wang [34] combined the IC and depth of nodes to compute the similarity.

In the domain of biomedical text mining, Medical Subject Headings (MeSH) is one of the most commonly used ontologies, which contains 29,638 MeSH headings arranged hierarchically in a tree structure by 2020 [35, 36].

Nowadays, with the rapid development of biomedicine, the amount of biomedical literature grows rapidly. Even if people narrow the search scope, a lot of literatures are retrieved. For instance, over the past five years, PubMed (http://pubmed.ncbi.nlm.nh.gov) has indexed more than 900 hundred biomedical citations by querying “cancer” in all fields. In addition, due to the complexity of ontology-based semantic similarity calculation, computing semantic similarity between a big number of documents leads to low efficiency. Our experiments show that the existing methods can hardly work with more than ten thousand documents. However, clustering is more valuable when the amount of data is larger.

To solve these problems, we proposed a method on the basis of Hadoop MapReduce. Hadoop is a framework that allows for distributed processing across clusters of computers. MapReduce is a module of Hadoop, allowing the parallel processing of large data sets. Traditionally, the document similarity is computed pair by pair, which causes redundant computation. The proposed method parallelizes the process of computing document similarity for the purpose of reducing the computational redundancy and increasing the amount of data that can be processed.

2. Materials and Methods

2.1. Definition

The set of documents to be clustered is denoted by . Similarly, the set of MeSH headings is denoted by . In this article, we define the MeSH headings as the semantic features of biomedical documents since the MeSH headings describe the subject of each article in MEDLINE. Thus, we use a set of MeSH headings to represent a document: , where is the index of MeSH headings.

In the MeSH ontology, each MeSH heading is mapped to one or more nodes associated with tree numbers. The deeper the node is, the more specific the information is. The MeSH Tree nodes are denoted by . Similarly, a set of nodes are used to represent a MeSH heading: , where is the index of nodes.

Table 1 shows an example of a document with MeSH headings and corresponding tree number of nodes.

Define a function that outputs the similarity between two inputs. For example, outputs the similarity between two nodes, and outputs the similarity of a MeSH heading to a document. Define outputs the LCA (least common ancestor) of two nodes in the MeSH ontology.

In MapReduce programming model, data is represented as key-value pairs. The key-value pair is denoted by . Generally, a MapReduce task mainly consists of three stages: map, shuffle, and reduce. The input file is first divided into multiple splits through the input format, and each split will be assigned a map task. The map task processes the input file line by line and outputs intermediate key-value pairs: . Shuffle is a process after the map task. Shuffle copies data from the map task to the reduce task, sorts the data according to the key value, and aggregates data with the same key: . The reduce task processes the shuffled data line by line and then outputs new key-value pairs: .

2.2. Overview

The workflow of the proposed method is as Figure 1 shows. The input is the biomedical documents, and the output is the document cluster. The details are as follows:(1)Preprocessing. The first step is to extract the semantic features of each document. The second step is to transform the data to put together documents that have the same semantic feature by using MapReduce(2)MapReduce-Based Semantic Similarity Calculation. Calculate the MeSH heading similarity in advance and then calculate the document similarity with the average maximum match(3)Document Clustering. Apply the cluster algorithm over the document similarity. In this article, we perform -means, agglomerative clustering, and spectral clustering, respectively, over the document similarity

2.3. Preprocessing

In MEDLINE, each document is associated with a unique PubMed ID (PMID) and is tagged with several MeSH headings. Since the MeSH headings describe the subject of the documents, the MeSH headings can be viewed as semantic features. Furthermore, the semantic similarity between documents can be represented by the semantic similarity between the sets of MeSH headings. Zhu et al. [37] and Zhou et al. [38] have proved the feasibility of this method. Therefore, we first extract the corresponding MeSH headings of documents through Efetch in NCBI. To put together documents that have the same semantic features, we transform the input document denoted by “” into the format “” which means that the documents ,” “,” and “” contain the same MeSH heading “.” The output is denoted as , where is a MeSH heading.

The data transformation algorithm is as follows:

Data transformation
Input: <d, list(m)>
Output: <m, list(d)>
Notation: Write (k, v) outputs <k, v>
Class mapper
 Method map (d, list(m))
  For each m ∈ list(m)
   Write (m, d)
  End for
Class reducer
 Method reduce (m, list(d))
   s ← string(list(d))
   Write (m, s)
2.4. Semantic Similarity Calculation

To compute the semantic contribution of each MeSH heading to the document, we use Wang’s average maximum match (AMM) strategy [39]. In Wang’s study, Wang used AMM strategy to compute the semantic similarity between two sets of Gene Ontology (GO) terms. Since the AMM strategy is able to accurately detect the similarity between sets of semantic features, we applied AMM strategy to compute the semantic similarity between two sets of MeSH headings. The AMM strategy is defined as follows:where returns the node number of the MeSH heading, and returns the MeSH heading number of the document. The semantic measure is optional. The measures used in this paper are as follows:(1)SP [27]where returns the longest path length, and returns the shortest path length.(2)WP [28]where returns the tree depth of the node.(3)LC [31]where is the maximum depth of the heading in MeSH ontology.(4)Res [29](5)Lin [30](6)Sch [40]

returns the IC value of the node. returns the depth of the node in the ontology. is the set of the children of the node, and is the total node number of the ontology.

We use SORA [41] to calculate the IC value. It is an ontology structure-based method, outperforming the corpus-based method on computation time.

According to AMM, semantic similarity calculation is divided into MeSH heading similarity calculation and document similarity calculation. Since the MeSH heading similarity is frequently used when computing the document similarity, we calculate the similarity of all pairs of extracted MeSH headings in advance. The MapReduce-based MeSH heading similarity calculation algorithm is as follows:

MeSH heading similarity calculation
Input: <m, list(nodes)>
Output: <pair of m, semantic similarity>
Notation: Write (k, v) outputs <k, v>
Class mapper
 Method map (m, list(nodes))
  m1 ← MeSH heading
  For each m2 ∈ M
   r ← Sim (m1, m2)
   s ← string (m1 +” &” + m2)
   Write (s, r)
  End for

Traditionally, document similarity is calculated pair by pair, leading to large computational cost, which is the main reason why the existing methods can hardly work with a large number of documents. To make the AMM applied to the parallelization condition, we designed a MapReduce-based algorithm to calculate the document similarity in parallel. In this method, the similarity contribution of a MeSH heading to a document is viewed as a basic computation element. By splitting the semantic similarity between documents into the aggregation of multiple heading-to-document similarity denoted as , we realized the parallel computation of the document similarity. In addition, for each line of input, we directly output the semantic similarity of the MeSH heading to other documents, avoiding redundant computation. The algorithm is as follows, and an example is given in Figure 2:

Document similarity calculation
Input: <m, list(d)>
Output: <pair of d, similarity>
Notation: Write (k, v) outputs <k, v>
Class mapper
 Method map (heading, list(d))
  m ← heading
  For each d1 ∈ D
   r ← Sim (m, d1)
   For each d2 in list(d)
    s ← string (d1 +” &” + d2)
    Write (s, r)
   End for
  End for
Class reducer
 Method reduce (s, list(r))
Sum ← 0, count ← 0
 For each r in list(r)
   Sum ← sum + r
   Count ← count +1
  End for
  Write (s, sum/count)

Supposing that the number of documents is , average MeSH headings number of documents is , the total number of MeSH headings is , and the time complexity of the traditional method is . For the proposed MapReduce-based algorithm, the time complexity of map stage is , and the time complexity of reduce stage is .

2.5. Document Clustering

Spectral clustering [42, 43], agglomerative clustering [44, 45], and -means [46, 47] are commonly used in text clustering. Spectral clustering is based on graph, which transforms the clustering problem into the optimal partition problem of a graph by treating each document as the vertex of the graph and the similarity between documents as the edge weight. The clusters are obtained by cutting the graph according to some rules such as Ncut [48] and Mcut [49]. Agglomerative clustering treats each document as a cluster first and then merges the most similar cluster repeatedly. -means is carried out through multiple iterations. In each iteration, each document is divided into the most similar cluster until the cluster no longer changes. In this paper, these three clustering algorithms are performed, respectively, over the document similarity.

2.6. Data

For the analysis of multiangle, two kinds of datasets were applied in this experiment. One is a small and labelled dataset named SL used for verifying the accuracy of the proposed method. The other one is large and unlabelled dataset named LUs, being used for testing the efficiency of the method when dealing with a large number of documents.

SL is generated from Text REtrieval Conference (TREC) genomics track 2005, which contains biomedical documents with 50 topics. In TREC genomics track, each document is judged as definitely relevant (DR), possibly relevant (PR), or not relevant (NR) to the topic. We remove the PR and NR documents, reserving DR documents.

When generating the data set, we referred to the practice of Gu et al. [50]. To avoid small clusters, we remove the topics that have less than 10 documents. Furthermore, we remove the documents that are relevant to 2 or more topics. Finally, the dataset of 2,317 documents with 24 topics were obtained. We randomly selected documents of 3-12 topics to generate 100 different datasets. Table 2 shows the summary of these datasets.

LUs include six datasets randomly extracted 10000 to 60000 documents from PubMed, covering more than 20000 different MeSH headings. For each dataset, we mark it with the number of documents, such as LUs-10000 and LUs-20000. Table 3 is the summary of the dataset LUs.

2.7. Evaluation Criteria

In the experiment, the performance is evaluated by comparing the predicted label and the true label. We take Normalized Mutual Information (NMI) as the evaluation index, since it has been proved that NMI outperforms many other clustering evaluation indexes [40]. The NMI formula [41] is defined as follows:where is the total number of documents to be clustered, is the number of documents with true class , is the number of documents with predicted class , and is the number of documents with true class and predicted class .

NMI ranges from 0 to 1. A high NMI value means the strong correlation between the predicted label and the true label.

2.8. Experimental Environment

The hardware and software details of each computer are shown in the following table.

The experimental environment is a Hadoop cluster composed of five computers with the same configuration.

3. Results and Discussion

3.1. Optimization of MapReduce Job Settings

Before testing the efficiency of the proposed method with a large number of documents, MapReduce job settings are optimized according to the characteristics of the proposed method since the job settings have a great impact on task execution [51]. It can be easily found that the input and output of the proposed method are very compact, while a lot of intermediate key-value pairs are generated in the Map task during MapReduce-based semantic similarity calculation, which will generate much data to be sorted and aggregated in Shuffle. Therefore, according to the MapReduce optimization principle of multiset homomorphisms proposed by Dorre et al. [51], we increase the number of reduce tasks to enhance the parallelism and add the combiner used to aggregate the data before shuffle.

As is shown in Figure 3, on the dataset LUs-10000 with Resnik measure, the elapsed time of map is reduced from 0.5 minutes to 0.45 minutes, the elapsed time of shuffle is reduced from 2.77 minutes to 1.3 minutes, and the elapsed time of reduce is reduced from 2.05 minutes to 0.56 minutes. The result shows that the optimization reduces the intermediate data to be shuffled and promotes the efficiency of reduce, effectively decreasing the computation time. And we used the same optimized job settings in the following experiments.

3.2. Evaluation of Clustering and Computation Efficiency

In the experiment, the proposed method was conducted with six semantic measures and three cluster algorithms. Then, we compared the traditional method with the proposed method on both computation time and NMI. Tables 5 and 6 show the results.(A)Computational efficiency

For the small dataset SL, the traditional method takes more than an hour, while the proposed method takes no more than 3 minutes with the cluster of five computers. For the big dataset LUs-10000, the traditional method can hardly work due to the big data, while the proposed method keeps efficient. Various semantic measures are available in this method, and the IC-based methods take less time than other methods.(B)Clustering validation

For the spectral clustering, the highest NMI of 0.647 is achieved with Resnik. For the -means, the highest NMI of 0.526 is achieved with Lin. For the agglomerative clustering, the highest NMI of 0.591 is achieved with Resnik. The result reveals that the information content-based measure (Resnik and Lin) outperforms other semantic measures, and spectral cluster performs better than the other two cluster algorithms. The highest NMI is obtained by Resnik measure and spectral clustering algorithm. Compared with the result in Zhu et al.’s study [37] where the same data and evaluation criteria were used, NMI of the proposed method is slightly increased, implying that the proposed method greatly improves the computational efficiency without decreasing the clustering accuracy.

3.3. Speedup and Elapsed Time with Different Cluster Node Number

To study the parallelism of the method, the proposed method was performed with different cluster node number on dataset LUs of 10000 documents. Figure 4 shows that the elapsed time goes down from 12.63 minutes to 2.31 minutes, and the speedup goes up almost linearly from 1 to 5.45 as the cluster nodes increase, implying that the proposed method is of high parallelism, and increasing nodes will improve the computation time effectively.

3.4. Computation Time with More Documents

In this section, the experiment was performed on the dataset LUs to observe the trend of the elapsed time and the proportion of each stage in the MapReduce job. Figure 5 shows that the proposed method remains effective when processing a big number of documents. As the documents increase, the elapsed time of map and reduce grows slowly while the elapsed time consumed in shuffle grows rapidly. And shuffle accounts for the largest proportion of computation time in all MapReduce tasks. The result reveals that the sort and copy of data become the key factor to the computation time when processing a big number of documents.

4. Conclusions

In this paper, we developed an efficient ontology-based semantic similarity measure for big document data clustering. Traditionally, the semantic similarity between documents is computed in pairwise, which can hardly work with a big number of documents. To solve the problem, we developed a MapReduce-based method to process the data in parallel. By splitting the document similarity into the aggregation of multiple heading-to-document similarity, the proposed method avoids the redundant computation and is available to process a big number of documents in a short time. Additionally, according to the experiment results, it can be concluded that the proposed method is of high parallelism and scalability, implying that more documents can be processed as long as we increase the cluster nodes, upgrade the hardware of computers, and optimize the job settings properly. In this work, both semantic measure and the cluster algorithm are optional, which depend on the datasets and the ontology. For the TREC 2005 genomics track dataset and MeSH ontology, the spectral algorithm and the semantic measure of Resnik perform better than other parameters. Furthermore, the proposed method is not limited to biomedical documents and MeSH ontology. The proposed method can also work in the situation of combining semantic similarity from different semantic features.

Data Availability

The data are available from Text REtrieval Conference (TREC, https://trec.nist.gov/) and PubMed (https://pubmed.ncbi.nlm.nih.gov/).

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

This study was supported by the National Natural Science Foundation of China (61911540482 and 61702324).