An Ensemble Semantic Textual Similarity Measure Based on Multiple Evidences for Biomedical Documents

Li, Meijing; Zhou, Xianhe; Ryu, Keun Ho; Theera-Umpon, Nipon

doi:https://doi.org/10.1155/2022/8238432

Computational and Mathematical Methods in Medicine

On this page

Abstract Introduction Materials and Methods Discussion Conclusion Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Special Issue

Artificial Intelligence for Biological Sequence and Functional Analysis

View this Special Issue

Research Article | Open Access

Volume 2022 | Article ID 8238432 | https://doi.org/10.1155/2022/8238432

An Ensemble Semantic Textual Similarity Measure Based on Multiple Evidences for Biomedical Documents

Meijing Li,¹Xianhe Zhou,¹Keun Ho Ryu,^2,3and Nipon Theera-Umpon^3,4

Academic Editor: Leyi Wei

Received01 Apr 2022

Accepted15 Jun 2022

Published27 Aug 2022

Abstract

With the increasing volume of the published biomedical literature, the fast and effective retrieval of the literature on the sequence, structure, and function of biological entities is an essential task for the rapid development of biology and medicine. To capture the semantic information in biomedical literature more effectively when biomedical documents are clustered, we propose a new multi-evidence-based semantic text similarity calculation method. Two semantic similarities and one content similarity are used, in which two semantic similarities include MeSH-based semantic similarity and word embedding-based semantic similarity. To fuse three different similarities more effectively, after, respectively, calculating two semantic and one content similarities between biomedical documents, feedforward neural network is applied to integrate the two semantic similarities. Finally, weighted linear combination method is used to integrate the semantic and content similarities. To evaluate the effectiveness, the proposed method is compared with the existing basic methods, and the proposed method outperforms the existing related methods. Based on the proven results of this study, this method can be used not only in actual biological or medical experiments such as protein sequence or function analysis but also in biological and medical research fields, which will help to provide, use, and understand thematically consistent documents.

1. Introduction

With the development of biological and medical fields, the volume of biomedical documents is increasing rapidly. Every year, a big number of papers are published and indexed in PubMed, a standard biomedical document database. By 2022, PubMed comprises more than 34 million biomedical documents. Experts in biology and related fields have made a lot of effort to find necessary documents, and many search technologies are emerging in response to this. In particular, semantically clustering or classifying biomedical documents [1, 2] has always been a very active field. Clustering of medical documents is of great importance to biologists, specialists, and document searchers in all fields of biological research; furthermore, it also greatly facilitates knowledge discovery in a higher level.

In the field of text clustering or retrieval, text similarity measure is a critical step. Text content-based similarity measure is very classical, which was aimed at extracting the keywords as features from the texts [3]. For example, term frequency (TF) or term frequency-inverse document frequency (TF-IDF) is usually applied to extract features and measure the document similarity [4]. In PubMed, the ranking metrics of PubMed-related articles (PMRA) [5] are used to find “related articles” and obtain a collection of articles similar to the search article. PMRA and BM25 [6] have the same theoretical basis to calculate the similarity between biomedical texts based on their contents (headings, abstracts, etc.) by generating item frequencies through Poisson distribution. Obviously, similarity based on text content has the flaw of not capturing the semantic information of the text. In most cases, two texts with the same textual content express different semantic information in different contexts. At this point, semantic text similarity is particularly important.

Semantic similarity of text was firstly applied to vector space model in information retrieval [7]. This model uses semantic similarity between queries and documents to retrieve the documents most relevant to a given query from a collection, for example, web search, subtopic mining, word sense disambiguation, relevance feedback, and text classification [8–10]. Meanwhile, natural language processing (NLP) applications make extensive use of semantic text similarity, including text summarization, machine translation, paraphrasing detection, and sentiment analysis [11]. Semantic similarity among biomedical documents is also very important in information mining in the biomedical field. In the field of medicine, many biomedical words have different meanings in different language context. Therefore, the studies about semantic similarity measure of biomedical document can find the subtle differences between biomedical documents at the semantic level, so as to cluster biomedical documents more accurately.

With the rapid development of neural network for word representation learning [12–14], word embedding has received more and more attention in recent years. Word embedding technology can be applied well in the study of semantic similarity in both general and special fields. Word embedding is a general term for language modeling and representation learning techniques in NLP. Conceptually, it refers to embedding a high dimensional space whose dimension is the number of all words into a continuous vector space of much lower dimension, where each word or phrase is mapped to a vector in the real number field. Methods of word embedding include artificial neural network [15], dimensionality reduction of word cooccurrence matrix [16–18], probability model [19], and explicit representation of the context in which the word resides, etc. Currently, the popular word embedding models mainly include the model of learning context-free words, such as Word2Vec [15] and GloVe [14], and the model of learning context-related words, such as ELMo [20] and BERT [21]. In [22], Word2Vec is applied to calculate the similarity of biomedical terms. In [23], Wu et al. cluster short documents based on semantic similarity applied biterm topic model (BTM) and GloVe. Y. Li et al. recognize Chinese clinical named entities in electronic medical records based on ELMo and lattice long short-term memory model [24]. In [25], semantic similarities between biomedical documents are calculated using BERT algorithm. Although word embedding-based document similarly calculation methods consider the context of the text, they do not consider the professorial knowledge in the biomedical field. Also, these methods still miss biomedical professional relationships between documents.

To solve this weakness, biomedical ontologies are applied to measure the document similarity, such as MeSH and Gene Ontology. MeSH is a standard biomedical ontology published by the National Library of Medicine (NLM), and each article in the MEDLINE database is indexed by several MeSH headings that represent the biomedical domain of the article and summarize the semantic content of the article. Meanwhile, all MeSH headings are organized into a tree structure (MeSH tree) based on semantics. When computing the similarity between articles, the semantics of the articles can be captured by extracting the MeSH features from the article. Therefore, ontology structure-based semantic similarity measure is noticed. There are two kinds of method to measure the similarity between MeSH: one is path-based method, and the other is information content-based method. The similarity based on the paths of the nodes is based on the propagation activation theory [26], which assumes that the hierarchy of the MeSH is organized according to its semantic similarity. Since all headings in the ontology are organized hierarchically, the broader MeSH headings tend to be near the root of the hierarchy and the more specific MeSH headings near the bottom of the hierarchy in the whole MeSH tree. The similarity of nodes in an ontology-based hierarchy depends on the path length (distance) and the depth between nodes. Then, the similarity between nodes of a MeSH heading can be calculated based on their position in the MeSH tree and the depth and distance between them, as in SP [27], WL [28], WP [29], LC [30], Li [31], etc. For the information content of MeSH, it is related to the frequency of MeSH headings in a particular corpus. Meanwhile, MeSH is a tree-like structure, so there may be a relationship between MeSH headings to contain and be contained. Therefore, when counting the number of occurrences of each MeSH heading, it is important to include the number of MeSH headings that have IS-A relationship with a particular MeSH heading. Information content-based similarity calculation methods are applied to measure the relationships between ontology terms such as MeSH heading, which include Lord et al. [32], Resnik [33], Lin [34], and Jiang and Conrath [35]. In [36, 37], they proposed and implement semantic similarity calculation methods based on MeSH ontology for biomedical documents. Ontology-based semantic relationship is applied to other different kinds of fields, such as similarity among functional terms and gene products in chicken [38] and calculation semantic similarity within the knowledge resources in the biomedical field [39], such as Systematized Nomenclature of Medicine-Clinical Terms (SNOMED CT) and Unified Medical Language System (UMLS). However, in the study of similarity of texts in biomedical field, it is not enough to consider only the semantic features embedded in MeSH of medical texts, which is a high generalization of the semantic features of biomedical texts, so it is also essential to consider the semantic features embedded in the textual content of medical texts.

To fully consider the biomedical semantic information in the documents, in this paper, we propose a multi-evidence-based ensemble method. We use vectors from multiple pretrained word embedding models trained in two corpora to represent the semantics of words to capture the word-level semantics in the abstract. For the biomedical semantic information contained in MeSH, we use the MeSH tree to capture the semantic relationships of the MeSH. We also use the traditional TF-IDF to obtain the content features of the text. Finally, we compute the semantic similarity between biomedical texts by fusing multiple features.

The main contributions of this work are as follows: (i)Three different kinds of features are extracted and fused by the proposed multi-evidence-based document similarity measurement method, which can be a substantial benefit for understanding semantic information in biomedical documents. We perform a full comparison of the similarity calculation processes and analyze the characteristics of the methods on each feature(ii)A new ensemble text similarity calculation method based on FNN is proposed to integrate two semantic similarities. A weighted linear combination method is applied to integrate the semantic and content similarities for biomedical documents

2. Materials and Methods

2.1. Framework

Our proposed method mainly includes three steps: (1) preprocessing, (2) similarity calculation, and (3) similarity integration, as shown in Figure 1. The input is biomedical document data set, and the output is the fusion semantic similarity matrix of the documents. Firstly, three kinds of features are extracted from the documents in preprocessing. Based on extracted features, we measure two semantic similarities and one content similarity between biomedical documents, respectively. Semantic similarities include word embedding-based similarity and MeSH-based similarity. We apply two suitable integrate methods to fuse similarities. We use FNN to generate the semantic similarity matrix. Then, weighted linear combination method is applied to integrate content similarity and semantic similarities.

2.2. Preprocessing

To extract multiple features of biomedical texts, we use several different preprocessing methods. In terms of semantic features, we consider the semantic information embedded in the abstract and MeSH. Therefore, given a biomedical document, we first extract the abstract and MeSH terms for the semantic features. For content features, we tokenize the abstract and filter the stopwords. Then, we use TF-IDF method to generate TF-IDF-based content features.

2.3. Similarity Calculation

To capture various semantic information, we apply two different kinds of semantic similarity calculation methods, word embedding-based similarity measure and MeSH-based similarity measure. Word embedding-based similarity reflects the context information, and MeSH-based similarity includes semantic information in the biomedical field. TF-IDF-based content similarity is used to reflect the word-level context information in documents.

2.3.1. Word Embedding-Based Semantic Similarity

To obtain the semantic meaning, we use a word embedding model to measure a similarity between texts. Based on the text semantic similarities in the documents, we obtain the semantic similarities between the whole documents.

(1) Model Training. Since the research field is the text in the biomedical field, we need a word embedding model trained by corpuses in the biomedical field. To construct a robust model, we used two kinds of large corpora, Wikipedia corpus and MEDLINE corpus. Thus, we can obtain general context information and biomedical professional information together. In this study, Word2Vec [15] is adopted to construct word embedding model, and according to the corpus, the models are named Wiki_W2V and MEDLINE_W2V, respectively. Wiki_W2V represents the model in the general field, and MEDLINE_W2V represents the model in the medical field, and the dimension of word vector in both models is 300 as shown in Table 1.

(2) Average Semantic Vector of Document. In general, the more similar the semantic information of two words in the word embedding space are, the greater the dot product between their word vectors. Word2Vec model uses the words as the basic unit to extract semantic information. In this study, we need to extract the semantic information of the whole document. Therefore, we use a method of weighting word vectors in the Word2Vec model to get a semantic vector ASV that can represent document. Algorithm 1 describes the ASV algorithm in the document.

Input: Document , Word2Vec
Output: Average semantic Vector
Begin
1
2
3
4 fordo
5 ifthen
6
7 ++8 end
9 end
10
11
End

In Algorithm 1, each document is preprocessed to get , and ASV is initialized as an -dimensional zero vector. For each word in , if the word exists in of Word2Vec model W2V, the word vector is extracted from the W2V model and added to ASV, and is added by 1. After traversing the words of , ASV is divided by After this step, the vector ASV may be offset in different directions. Therefore, we need to normalize ASV. Here, we choose -score normalization. Let , Z-score normalization is , and its changed formula is as follows: where is the mean and is the standard deviation.

(3) Similarity of Document Based on Word Embedding. Assume that ASV and ASV are the average semantic vectors calculated by using Algorithm 1 in document and , and then, the similarity between documents based on word embedding is expressed as follows: where and are the th elements of the vectors ASV and ASV, respectively.

2.3.2. MeSH-Based Semantic Similarity

Documents in the MEDLINE database are labeled by a set of MeSH headings (usually 10-15 individuals), which are unified by biomedical experts and represent the subject of the document as shown in Table 2. Therefore, these MeSH headings can represent the semantic information of the document. At the same time, each MeSH heading contains multiple nodes, appearing at different locations in the tree. Each node is represented by a unique tree number see Table 3. Given a set of documents , for each of the reference , the set of MeSH heading marked by it is , and then, when calculating the similarity based on MeSH features, we divide it into two steps: (1) calculate the similarity between MeSH heading and (2) calculate the similarity between documents .

(1) Similarity of MeSH Heading. MeSH tree contains 16 subtrees, each of which represents a biomedical direction in terms of medicine. Two path-based methods and two information content-based methods are used to calculate the similarity between nodes, including WP [29], LC [30], Lin [34], and JC [35]. In particular, if two nodes are in two different subtrees, then we can assume that the similarity between the two nodes is 0. Each MeSH heading contains multiple nodes. Therefore, it is necessary to consider that two MeSH headings contain the influence of the degree of similarity between all the nodes. In addition, one MeSH heading may correspond to various nodes in the subtree, and the nodes may be in different subtrees. When we calculate the similarity between two MeSH headings, we usually need to calculate the pairwise similarity between all corresponding nodes and the other node and then take the average value as the final similarity values between two MeSH headings. So, the traditional method may lead to the similarity smaller than the real value. To avoid this problem, we chose average maximum match (AMM). Given two MeSH headings, and , the similarity between and is expressed as follows: where represents the maximum similarity between and any node in .

(2) Similarity of MEDLINE Document Based on MeSH. Considering that each document contains multiple MeSH headings and each MeSH heading contains multiple nodes, the similarity from node similarity to MeSH heading is similar to that from MeSH heading similarity to document. Here, we also use AMM method to calculate the similarity of document; given two documents, and , the similarity between and is expressed as

Similar to Equation (3), denotes the maximum similarity between the MeSH heading and any MeSH headings contained in the document .

2.3.3. Content Similarity

Content similarity refers to the content of MEDLINE document as the characteristic similarity. In this part, a document is represented by a real value vector , which contains the content feature information of . Here, we apply the traditional TF-IDF method to extract the content features of the document. Given documents and , their corresponding real value vectors are and , and then, the similarity of documents and based on content features can be expressed as

2.4. Similarity Integration

In the previous part, we generate the document similarity matrix based on one feature, such as word embedding- or MeSH-based semantic features and TF-IDF-based content feature. Based on two integration methods, we integrate the three different document similarities.

2.4.1. Semantic Similarity Integration

To realize the influence of multiple semantic features on the semantic similarity of the document, we apply the feedforward neural network model to integrate the MeSH feature and word embedding feature of the document.

The feedforward neural network model FNN_sem which we constructed for semantic feature integration is shown in Figure 2. The input layer contains 2 input neurons, the hidden layer contains 300 hidden layer neurons, and the output layer contains 1 output neuron. The activation function of the input layer and the hidden layer is ReLU, and the activation function of the output layer is sigmoid.

The purpose of constructing the FNN_sem model is to integrate MeSH features and word embedding features. We take the similarity based on MeSH features, , and similarity based on word embedding, , as the input and take the semantic similarity based on integration, (similarity after semantic feature integration), as the output. During training, for any two documents and in the given data set, if and are DR for the same topic, then the similarity between them is set to 0.9; if and are PR for the same topic, then it is set to 0.5; otherwise, it is set to 0.1. The number of iterations is set to 100.

After the training, we can input and between documents to get , so as to achieve the integration of semantic features.

2.4.2. Fusion Similarity Generation

Weighted linear combination method is applied to integrate the content similarity and semantic similarity. Firstly, and normalized processing, making them the minimum value is 0 and the maximum value is 1. Our normalization method choice is SumNorm, whose excellent performance in the treatment of GO term clustering is proved in Zhou et al. [36]. After this, and are integrated by linear method. Specifically, setting the weight of as , then the similarity after integration is

From Equation (6), we can see that determines the contribution of in . When is 0, is equal to . We will find the most appropriate by adjusting the size of to make more accurate.

3. Experimental Data and Evaluation Methods

In the experiment, we evaluate the performance of the method proposed in this paper. Since there is no official truth value between MEDLINE documents so far, we will evaluate the performance of this method by clustering based on the similarity of this method.

3.1. Data

Text REtrieval Conference (TREC) 2005 Genomics Track Data has 4,591,008 MEDLINE database documents (MEDLINE records from 1994 to 2003) and 50 biomedical research topics. The 50 topics simulate real information needs in the biomedical field and are distributed as query headings to all competing information retrieval systems. For each topic, there is absolutely relevant (DR) documents, possibly relevant (PR) documents, and not relevant (NR) documents, where absolutely relevant documents are returned by different retrieval systems, and these documents are then aggregated for manual evaluation by biologists.

In the data obtained above, we need to make further processing, the specific steps are as follows: firstly, we delete the topics with only nine or fewer documents to avoid the occurrence of small clusters because too small clusters affect the fair evaluation results. Then, we further delete the documents related to multiple topics, and finally, we live to 24 topics containing 2,317 documents. In order to fully test the performance of this similarity, we constructed 100 different data sets, each of which randomly selected 3-12 topics and the documents contained in them from 24 topics. Table 4 shows the basic information of the 100 data sets.

3.2. Evaluation Method

The clustering method evaluated in this paper selects spectral clustering algorithm. Many studies have shown [40] that spectral clustering algorithm is an effective and stable clustering method.

3.2.1. Spectral Clustering

Spectral clustering algorithm is very suitable for clustering data with similarity matrix, and it is better than other clustering algorithms in biomedical text clustering. So we use spectral clustering algorithm to perform clustering experiments to investigate the performance of this similarity. The idea of spectral clustering comes from spectral partitioning, which regards data clustering as a multiplexed partitioning problem of undirected graph. The data point is regarded as the vertex of an undirected graph , and the set of edge weights represents the similarity between two points calculated based on a certain similarity measure. represents the similarity matrix between data points to be clustered, which is regarded as the adjacency matrix of the undirected graph and contains all the information required for clustering. Then, a partition criterion is defined and optimized so that points within the same class have a high degree of similarity, while points between different classes have a low degree of similarity.

3.2.2. Metrics

In the data obtained by us, the cluster to which each document belongs has been determined, so we can conduct external evaluation by spectral clustering results based on this similarity and the cluster to which they belong. Here, we select four evaluation metrics, which were purity, adjusted Rand index (ARI), normalized mutual information (NMI), and Fowlkes-Mallows index (FMI).

Purity is a simple and transparent assessment standard calculated based on an equation. Each cluster is assigned to the category with the most documents, and accuracy is then measured by correctly counting the number of documents assigned, which is then divided by the total number of documents. where represents the true value set of the cluster, represents the clustering of clustering results, and represents the total number of references.

Rand index (RI) calculates the similarity measure between the two clusters by considering all sample pairs and calculating the pairs allocated in the same or different clusters in the predicted and real clusters, and its value range is 0 to 1. where is the number of documents in both a class and a cluster, is the number of documents in a class but not in a cluster, is the number of documents in different classes but in a cluster, and is the number of documents in different classes and in different clusters.

Adjusted Rand index (ARI) is an improvement on RI and was proposed by Hubert and Arabie in 1985 [41]. Since the problem with RI is the division of two random variables, its RI value is not a constant close to 0. Adjust the Rand coefficient to assume that the super distribution of the model is a random model; that is, the division of and is random, and then, the number of data points of each category and cluster is fixed, and its value range becomes -1 to 1. The larger the model, the better the effect: where is the expected value of RI.

NMI is used to measure the degree of correspondence between the two data distributions. In the study of Ghosh [42], it was found that NMI index could achieve a good evaluation effect of clustering. Therefore, we also use NMI to evaluate the performance of clustering:

In Equation (10), and are predicted labels after clustering and correct labels, respectively, represents their mutual information, is the entropy of , and is the entropy of [40], which can be written into Equation (11) according to Equation (10). In Equation (11), is the total number of documents in the data set, is the number of documents in the correct class , is the number of documents in the predicted cluster , and is the total number of documents in both class and cluster .

FMI was proposed by Fowlkes and Mallows [43] in 1983 as the geometric mean of the pairwise precision and recall for document pairs: where , , and have the same meaning as expressed in RI.

4. Experiment Result and Discussion

To fully evaluate the performance of each similarity mentioned in this paper, we set up three groups of clustering experiments.

4.1. The Analysis of the Result Based on Single Similarity

To find the best calculation method based on each single feature, we set up a clustering experiment based on a single similarity, including (1) methods based on word embedding features, WE_M (model MEDLINE_W2V as word embedding model) and WE_W (model WIKI_W2V as word embedding model); (2) methods based on MeSH features include LC [30], WP [29], Lin [34], and JC [35] (In the JC method, the value of is changed to determine the optimal result of the method); and (3) content-based method, Con.

The clustering results of 100 data sets based on the computing method of similarity of word embedding feature and MeSH feature are shown in Tables 5 and 6, respectively, with the maximum shown in italics. As can be seen from Table 5 and Figure 3(a), in the similarity based on word embedding features, the clustering effect of WE_M method (, , , and ) is better than that of WE_W (, , , and ), indicating that the overall effect of MEDLINE_W2V model was better than that of Wiki_W2V model. As can be seen from Table 6 and Figure 4, among the clustering methods based on MeSH features, the overall effect of JC method was the best, especially when (, , , and ), and the overall effect of LC method (, , , and ) was the worst. In addition, it can be seen from Figure 4 that all methods have little difference in each evaluation methods, and the average value of JC method in all evaluation methods is the highest. This shows that all the methods have little difference in the effects of data sets with different sizes and proves the superiority of JC method in MeSH feature-based methods again.

(a)

(b)

(a)

(b)

(c)

(d)

We also compared the methods that achieved the best clustering results based on the similarity of each feature, and the comparison results are shown in Table 7 and Figure 3(b), with the maximum shown in italics. Overall, when , the JC method achieves the maximum values of purity and FMI, which are 0.834 and 0.762, respectively. The maximum values of ARI and NMI obtained by Con were 0.701 and 0.738, respectively. The experimental results show that the semantic similarity based on MeSH feature is better than that based on word embedding feature. Meanwhile, the effect of semantic similarity based on MeSH feature is similar to that of content similarity based on content feature. On the other hand, it was confirmed that the effect of the semantic similarity according to the MeSH feature was similar to that of the content similarity according to the content feature.

4.2. The Analysis of the Result Based on Semantic Similarity

We conduct a clustering experiment by integrating the similarity of MeSH and word embedding features to explore the optimal combination mode. There are four methods based on the MeSH feature, and the value of in the JC method is set from 1 to 5. There are a total of 8 similarity degrees, and two methods based on the word embedding feature are WE_M and WE_W, respectively, so 16 combinations of pair-combined methods are generated, based on Table 8.

The clustering results of similarity on 100 data sets based on the combination of MeSH and word embedding features are shown in Table 8, with the maximum shown in italics. From Table 8, we can see that the overall clustering effect of semantic similarity after integration is generally higher than that of a single similarity. For example, the average values of all metrics (, , , and ) after the combination of LC method and WE_M method are all higher than that of LC (, , , and ). At the same time, we find that the clustering effect of all the methods based on MeSH combined with WE_M method is higher than that of the method combined with WE_W method, which fully shows that the clustering effect of WE_M method is better than that of WE_W method. Overall JC method had the best efficacy in and the combination effect of WE_M method (, , , and ), while WP method and WE_W method (, , , and ) had the worst. Therefore, my next experiment will use the combination of JC () method and WE_M method (JC_1_WE_M method) to calculate semantic similarity.

4.3. The Analysis of the Result Based on Fusion Similarity

As mentioned in the method, semantic similarity and content similarity are integrated by weighted linear combination method. For example, JC_1_WE_M method is used to calculate semantic similarity as mentioned above, and Con method is used to calculate similarity based on content features. According to Equation (6), we adjust the value of to find the best value of .

In combination with Table 9 and Figure 5, we can see that with the increasing value of , the average of all metrics slowly increased and the standard deviation slowly decreased and reached a peak value when . The average (, , , and ) reached the maximum, and the standard deviation (, , , and ) reached the minimum. It shows that the similarity based on content features has a great influence on the results. Compared with semantic similarity, through the integration of semantic similarity and content similarity, the average value of each metric is significantly improved and the standard deviation is significantly reduced, indicating that the more features considered, the better the clustering effect.

(a)

(b)

4.4. Comparison with Related Work

In order to fully verify the clustering performance of our proposed similarity, we compare it with the previous correlation clustering methods. Table 10 shows the index comparison results of the similarity proposed by us (, that is, the similarity at ) and clustering method proposed by Zhu et al. [37] on the same data set.

The clustering method proposed by Zhu et al. is based on the linear integration of MeSH feature similarity and content feature similarity to carry out spectral clustering experiments. From Table 10, it can be clearly seen that the average values of all metrics of similarity proposed in this paper in spectral clustering are higher than method of Zhu et al. In particular, the improvement was particularly significant in the FMI (increased 0.061). This result further demonstrates the validity of our proposed similarity by integrating multiple MEDLINE documents features.

4.5. Discussion for Interpretability of Proposed Method

The research of text similarity can generally extract features from two aspects: one is the content of the text, and the other is the semantics of the text. At the same time, texts in special fields often contain semantic features unique to the field, and we also need to extract such features. In addition, for one aspect of feature extraction, the number considered often determines the performance of similarity. In the previous studies, there are two kinds of defects: one [5, 6] is that only the content is considered without considering the semantic aspect, and the other [37] is that the number of features considered in one aspect is less.

Our method makes up for the above two defects at the same time, which is we consider the content and semantic features of the biomedical documents at the same time. On this basis, we consider multiple features in the semantic aspect (extracted from MeSH and extracted from the text content through word embedding) and fuse them, so as to improve the performance of biomedical text similarity. In the experimental part, we first screened out the best calculation methods in terms of semantics and content, as shown in Tables 5, 6, and 7. Then, we fuse the two semantic features through FNN to screen the optimal semantic fusion scheme, as shown in Table 8. Finally, the content and semantic aspects are fused through the linear model, as shown in Table 10 to obtain the final similarity calculation method.

Furthermore, our experimental results prove that our proposed multi-evidence-based semantic text similarity calculation method enhances the biomedical document clustering effort when compared to various existing methods; it has shown superior performances in several evaluation metrics such as purity, ARI, NMI, and FMI. Essentially, our proposed method extracts and fused different features that findings are able to represent semantic information of the biomedical documents.

5. Conclusion

With multiple documents, the similarity calculation has limitations in terms of accuracy. Therefore, to solve the problems, in this paper, we proposed a new semantic similarity of MEDLINE documents by extracting the semantic information contained in the MeSH title and abstract from the MEDLINE document and combining the content information. In this proposed method, after calculating semantic similarity and content similarity between medical documents, FNN and weighted linear combination method were applied to integrate semantic and content similarity. In addition, the proposed method was compared with the existing basic methods for analyzing medical documents. The experimental results showed that the clustering effect was significantly improved as the number of features considered as semantic similarity integration increased with the semantic similarity integration of the integrated MeSH function and the word embedding function from a single similarity, and the content similarity and clustering performance were correlated in each clustering metric. It was confirmed that the multievidence method outperforms the traditional methods.

One of the strong points of this study is that it achieves the purpose of performance improvement by considering and integrating various semantic features. Our proposed method is based on the idea of multifunctional convergence, which can play an important role not only in experts in biomedical experts and information mining in general fields. Therefore, the research method of clustering these multiple features can be applied to other fields of similarity research, such as a study to calculate the similarity between genes. At the same time, in the general domain, if the study subject contains multiple features, we can apply this idea to improve performance.

Data Availability

The datasets used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

This research was supported by the National Natural Science Foundation of China (61911540482 and 61702324). This work was also supported in part by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT and Future Planning, South Korea, under Grant 2019K2A9A2A06020672 and Grant 2020R1A2B5B02001717.

References

X. Zhang, X. Song, A. Feng, and Z. Gao, “Multi-self-attention for aspect category detection and biomedical multilabel text classification with BERT,” Mathematical Problems in Engineering, vol. 2021, Article ID 6658520, 6 pages, 2021.
View at: Publisher Site | Google Scholar
O. Majewska, C. Collins, S. Baker et al., “BioVerbNet: a large semantic-syntactic classification of verbs in biomedicine,” Journal of Biomedical Semantics, vol. 12, no. 1, p. 12, 2021.
View at: Publisher Site | Google Scholar
“National Library of Medicine,” http://www.nlm.nih.gov/medline/medline_overview.html.
View at: Google Scholar
S. Tata and J. M. Patel, “Estimating the selectivity of Tf-Idf based cosine similarity predicates,” ACM Sigmod Record, vol. 36, no. 2, pp. 7–12, 2007.
View at: Publisher Site | Google Scholar
J. Lin and W. J. Wilbur, “PubMed related articles: a probabilistic topic-based model for content similarity,” BMC Bioinformatics, vol. 8, no. 1, p. 423, 2007.
View at: Publisher Site | Google Scholar
K. Sparck Jones, S. Walker, and S. E. Robertson, “A probabilistic model of information retrieval: development and comparative experiments: part 1,” Information Processing & Management, vol. 36, no. 6, pp. 779–808, 2000.
View at: Publisher Site | Google Scholar
M. Shajalal and M. Aono, “Semantic textual similarity between sentences using bilingual word semantics,” Progress in Artificial Intelligence, vol. 8, no. 2, p. 263, 2019.
View at: Publisher Site | Google Scholar
H. Li and J. Xu, “Semantic matching in search,” Foundations and Trends in Information Retrieval, vol. 7, no. 5, pp. 343–469, 2013.
View at: Publisher Site | Google Scholar
D. Metzler, S. Dumais, and C. Meek, “Similarity measures for short segments of text,” in European conference on information retrieval, pp. 16–27, Berlin, Heidelberg, 2007.
View at: Google Scholar
M. Shajalal, M. Z. Ullah, A. N. Chy, and M. Aono, “Query subtopic diversification based on cluster ranking and semantic features,” in 2016 International Conference On Advanced Informatics: Concepts, Theory And Application (ICAICTA), pp. 1–6, Penang, Malaysia, 2016.
View at: Publisher Site | Google Scholar
R. M. Aliguliyev, “A new sentence similarity measure and sentence based extractive technique for automatic text summarization,” Expert Systems with Applications, vol. 36, no. 4, pp. 7764–7772, 2009.
View at: Publisher Site | Google Scholar
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, “Natural language processing (almost) from scratch,” The Journal of Machine Learning Research, vol. 12, pp. 2493–2537, 2011.
View at: Google Scholar
D. Zhang, H. Xu, Z. Su, and Y. Xu, “Chinese comments sentiment classification based on word2vec and SVM,” Expert Systems with Applications, vol. 42, no. 4, pp. 1857–1863, 2015.
View at: Publisher Site | Google Scholar
J. Pennington and R. Socher, “Glove: global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543, Doha, Qatar, 2014.
View at: Google Scholar
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” Advances in Neural Information Processing Systems, vol. 26, pp. 3111–3119, 2013.
View at: Google Scholar
R. Lebret and R. Collobert, “Word embeddings through Hellinger PCA,” 2013, https://arxiv.org/abs/1312.5542.
View at: Google Scholar
O. Levy and Y. Goldberg, “Neural word embedding as implicit matrix factorization,” Advances in Neural Information Processing Systems, vol. 3, pp. 2177–2185, 2014.
View at: Google Scholar
Y. Li, L. Xu, and F. Tian, “Word embedding revisited: a new representation learning and explicit matrix factorization perspective,” in Twenty-Fourth International Joint Conference on Artificial Intelligence, pp. 1–5, Buenos Aires, Argentina, 2015.
View at: Google Scholar
A. Globerson, G. Chechik, F. Pereira, and N. Tishby, “Euclidean embedding of co-occurrence data,” Journal of Machine Learning Research, vol. 8, pp. 2265–2295, 2007.
View at: Google Scholar
M. E. Peters, “Deep contextualized word representations,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, Louisiana, 2018.
View at: Google Scholar
J. Devlin, “Bert: pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota, 2019.
View at: Google Scholar
Y. Zhu, E. Yan, and F. Wang, “Semantic relatedness and similarity of biomedical terms: examining the effects of recency, size, and section of biomedical publications on the performance of word2vec,” BMC Medical Informatics & Decision Making, vol. 17, no. 1, pp. 95-96, 2017.
View at: Publisher Site | Google Scholar
D. Wu, M. Zhang, C. Shen, Z. Huang, and M. Gu, “BTM and GloVe similarity linear fusion-based short text clustering algorithm for microblog hot topic discovery,” IEEE Access, vol. 8, pp. 32215–32225, 2020.
View at: Publisher Site | Google Scholar
Y. Li, X. Wang, L. Hui et al., “Chinese clinical named entity recognition in electronic medical records: development of a lattice long short-term memory model with contextualized character representations,” JMIR Medical Informatics, vol. 8, p. 9, 2020.
View at: Google Scholar
F. W. Mutinda, S. Yada, S. Wakamiya, and E. Aramaki, “Semantic textual similarity in Japanese clinical domain texts using BERT,” Methods of Information in Medicine, vol. 60, pp. e56–e64, 2021.
View at: Publisher Site | Google Scholar
P. R. Cohen and R. Kjeldsen, “Information retrieval by constrained spreading activation in semantic networks,” Information Processing & Management, vol. 23, no. 4, pp. 255–268, 1987.
View at: Publisher Site | Google Scholar
H. Bulskov, R. Knappe, and T. Andreasen, “On measuring similarity for conceptual querying,” in International Conference on Flexible Query Answering Systems, pp. 100–111, Berlin, Heidelberg, 2002.
View at: Publisher Site | Google Scholar
R. Richardson, A. Smeaton, and J. Murphy, Using wordnet as a knowledge base for measuring semantic similarity between words, Dublin City University, 2014.
Z. Wu and M. Palmer, “Verbs semantics and lexical selection,” 1994, https://arxiv.org/abs/cmp-lg/9406033.
View at: Google Scholar
C. Leacock and M. Chodorow, “Filling in a sparse training space forward sense ident cation,” in Proceedings of the 32nd Annual Meeting of the Associations for Computational Linguistics, Kansas City, Missouri, USA, 1994.
View at: Google Scholar
Y. Li, Z. A. Bandar, and D. McLean, “An approach for measuring semantic similarity between words using multiple information sources,” IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 4, pp. 871–882, 2003.
View at: Publisher Site | Google Scholar
P. Lord, R. Stevens, A. Brass, and C. A. Goble, “Investigating semantic similarity measures across the gene ontology: the relationship between sequence and annotation,” Bioinformatics, vol. 19, no. 10, pp. 1275–1283, 2003.
View at: Publisher Site | Google Scholar
P. Resnik, “Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language,” Journal of Artificial Intelligence Research, vol. 11, no. 1, pp. 95–130, 1999.
View at: Publisher Site | Google Scholar
D. Lin, “Principle-based parsing without overgeneration,” in 31st annual meeting of the association for computational linguistics, pp. 112–120, Columbus, Ohio,USA, 1993.
View at: Google Scholar
J. Jiang and D. Conrath, “Semantic similarity based on corpus statistics and lexical taxonomy,” in Proceedings of the International Conference on Research in Computational Linguistic, Taiwan, China, 1998.
View at: Google Scholar
J. Zhou, Y. Shui, S. Peng, X. Li, H. Mamitsuka, and S. Zhu, “M. MeSHsim: an R/Bioconductor package for measuring semantic similarity over mesh headings and MEDLINE documents,” Journal of Bioinformatics & Computational Biology, vol. 13, no. 6, 2015.
View at: Google Scholar
S. Zhu, J. Zeng, and H. Mamitsuka, “Enhancing MEDLINE document clustering by incorporating mesh semantic similarity,” Bioinformatics, vol. 25, no. 15, pp. 1944–1951, 2009.
View at: Publisher Site | Google Scholar
G. Morota, T. M. Beissinger, and F. Peñagaricano, “MeSH-informed enrichment analysis and MeSH-guided semantic similarity among functional terms and gene products in chicken,” Genetics, vol. 6, no. 8, pp. 2447–2453, 2016.
View at: Publisher Site | Google Scholar
V. N. Garla and C. Brandt, “Semantic similarity in the biomedical domain: an evaluation across knowledge sources,” BMC Bioinformatics, vol. 13, no. 1, p. 261, 2012.
View at: Publisher Site | Google Scholar
S. Zhong and J. Ghosh, “Generative model-based document clustering: a comparative study,” Knowledge and Information Systems, vol. 8, no. 3, pp. 374–384, 2005.
View at: Publisher Site | Google Scholar
L. Hubert and P. Arabie, “Comparing partitions,” Journal of Classification, vol. 2, no. 1, pp. 193–218, 1985.
View at: Publisher Site | Google Scholar
J. Ghosh, “Scalable clustering methods for data mining,” Handbook of data mining, Lawrence Erlbaum, 2003.
View at: Google Scholar
E. B. Fowkles and C. L. Mallows, “A method for comparing two hierarchical clusterings: rejoinder,” Journal of the American Statistical Association, vol. 78, no. 383, p. 584, 1983.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2022 Meijing Li et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies