Abstract

Due to the increasing use of information technologies by biomedical experts, researchers, public health agencies, and healthcare professionals, a large number of scientific literatures, clinical notes, and other structured and unstructured text resources are rapidly increasing and being stored in various data sources like PubMed. These massive text resources can be leveraged to extract valuable knowledge and insights using machine learning techniques. Recent advancement in neural network-based classification models has gained popularity which takes numeric vectors (aka word representation) of training data as the input to train classification models. Better the input vectors, more accurate would be the classification. Word representations are learned as the distribution of words in an embedding space, wherein each word has its vector and the semantically similar words based on the contexts appear nearby each other. However, such distributional word representations are incapable of encapsulating relational semantics between distant words. In the biomedical domain, relation mining is a well-studied problem which aims to extract relational words, which associates distant entities generally representing the subject and object of a sentence. Our goal is to capture the relational semantics information between distant words from a large corpus to learn enhanced word representation and employ the learned word representation for various natural language processing tasks such as text classification. In this article, we have proposed an application of biomedical relation triplets to learn word representation through incorporating relational semantic information within the distributional representation of words. In other words, the proposed approach aims to capture both distributional and relational contexts of the words to learn their numeric vectors from text corpus. We have also proposed an application of the learned word representations for text classification. The proposed approach is evaluated over multiple benchmark datasets, and the efficacy of the learned word representations is tested in terms of word similarity and concept categorization tasks. Our proposed approach provides better performance in comparison to the state-of-the-art GloVe model. Furthermore, we have applied the learned word representations to classify biomedical texts using four neural network-based classification models, and the classification accuracy further confirms the effectiveness of the learned word representations by our proposed approach.

1. Introduction

Biomedical literature, medical records, clinical notes, and online databases such as PubMed are the treasury of valuable information that is rapidly increasing in volume and size. Biomedical professionals and researchers are exploring and analyzing these large volumes of structured and unstructured texts to extract and curate valuable information using different knowledge discovery and data mining techniques. In this line, automated text classification using machine learning techniques has always been considered as a key technique to categorize, filter, search, manage, or process a large volume of text documents. Text classification is a key natural language processing (NLP) task wherein texts are labeled with specific classes based on their contents. Such labeling helps to extract valuable information for various applications, such as disease surveillance, information extraction, named-entity recognition, topic labeling, and social media monitoring.

In the biomedical domain, the existing literature is a valuable source of a large number of named entities, concepts, features, and their associations. In this domain, text classification has many applications including allocating medical subject headings (MeSH terms) to the biomedical articles [1, 2], identifying reportable disease cases from the clinical and pathological reports, and categorizing biomedical documents based on their content. Furthermore, classifying biomedical texts could help to improve the performance of gene-disease association extraction, protein-protein interaction extraction, understanding the functioning of genes, or discovering any other kind of knowledge. The efficiency and accuracy of any classification system depend on the classification algorithm (or the classifier) used and the input feature on which it operates. Since a classifier learns a model from the training data in the form of feature vectors, the role of feature vectors or feature representation is very important in classification performance. In NLP tasks, word representation (aka word embedding) has a notable influence on the performance of deep learning-based classification models.

1.1. Traditional Word Representation and Its Limitations

In traditional word representation techniques, words are encoded as vectors of binary, tf (term frequency), or tf-idf values, where tf-idf stands for “term frequency inverse-document frequency” that have yielded promising results for the classification task. These vectors consider lexical features such as uni-gram, bi-gram, or n-grams to represent text documents as feature vectors, with each entry of the vector consisting of either a Boolean value or frequency count to indicate the presence of lexical features. However, such vectors are unable to capture the semantic information because they ignore the context and the order of the words in the documents. Besides the problems of ignoring word order and contextual information, these feature vectors also suffer from data sparsity issues. Such issues have been addressed using neural network models to learn word representation as low-dimensional dense vectors.

1.2. Modern Word Representation and Its Limitations

Recently, the distributional representation of words as feature vectors (aka words embedding) has opened a new horizon in NLP applications because of its nature to capture contextual information and, hence, the semantics of words mentioned within the textual contents. Learning such word representations as low-dimensional dense vectors in an embedding space from a large corpus has gained popularity since the pioneering work of Mikolov et al. [3]. Such word vectors aim to capture the distributional features of words in a large corpus. Many NLP problems such as classification, clustering, and sentiment analysis have been solved by employing these word representations. Furthermore, the resurgence in the neural network-based machine learning algorithms has shown their capability to accomplish high accuracy even with less engineered features.

Towards this direction, Word2Vec [3] and GloVe [4] are two important algorithms that are widely used to learn distributional representation of words as low-dimensional dense vectors, which can be employed to enhance the performance of neural network-based classification systems. These algorithms consider the neighboring context words on either side of a target word within a fixed context window to preserve the distributional similarity of words. However, these distributional word representations have two major shortcomings: (i) They are inept in capturing relational semantics of words because of their dependence on fixed context window, and (ii) the rare co-occurrence of word pairs might be further problematic as a large corpus may not have a sufficient co-occurrence count of the rare word pairs. To eliminate these shortcomings, researchers tried to incorporate relational knowledge from third-party knowledge bases (KBs) such as WordNet [5] and Freebase [6] into the distributional representation of words. Semantic relations such as synonymy, hypernymy, and meronymy from the KBs have been incorporated into the distributional representation of words to learn better word representations [7, 8]. The relations from KBs, though rich in terms of semantic information, may have inadequate entries and also lack the contextual information. Furthermore, KBs are generally manually curated and maintained due to which they may not be comprehensive.

In addition, the existing works consider only linear contexts to derive contextual information of a target word, wherein context words are the surrounding words within the window of tokens that precede and follow the target word. For example, in the sentence “Whipple disease is a rare systemic illness characterized by arthralgias, chronic diarrhea, weight loss, fever, and abdominal pain,” the words in the pair (Whipple, fever) or (Whipple, pain) have long-range association representing their relational semantics. Both fever and pain are semantically related to Whipple as they are symptoms of Whipple disease. These distant relationships will not be captured by a fixed context window of  = 5 or 10. The smaller context window, say,  = 2 may fail to capture important context, while a very large context window may capture weak and irrelevant contexts, resulting in an adverse impact on the embedding representation. In the existing literature, to capture the distributional context of words, the most commonly used context window size is  = 5. Additionally, if we aspire to learn word embeddings from domain-specific corpus, say, biomedical text corpus, then the semantic associations between Whipple disease and fever or Whipple disease and pain would be of extreme importance as fever and abdominal pain are symptoms of Whipple disease. Furthermore, the rare co-occurrence of such semantically associated words may have little or no weightage during their distributed representation, and it may fail to capture the semantics of such associations. Therefore, the inclusion of such relational information into the distributional representation will enrich and enhance the quality of word representation.

In addition to linear window-based bag-of-word contexts, the syntactic contexts have also been used to generate dependency-based word embeddings [9]. The syntactic contexts are the words that are linked with a target word through syntactic dependency relationships generated by a parser. These syntactic contexts can capture the functional similarity of words [9]. For example, the dependency graph of an example sentence produced by the Stanford parser is shown in Figure 1, which depicts the dependency relations on the edge labels of the graphs. Levy and Goldberg [9] used direct and inverse dependency relations for the target word to generate its dependency-based contexts to learn syntactic dependency-based word embedding. However, these dependency-based contexts with direct and inverse relations at one hop distance in the dependency graphs are unable to capture the semantics of words, which are at multihop (distant) dependency relations in the graph.

In biomedical literature, many traditional approaches for text classification exist; however, the recent popularity of deep learning models such as convolutional neural networks (CNNs) and long-short term memory (LSTM) has drawn the attention of researchers in the biomedical domain to achieve better performance in various NLP and text classification tasks. These deep learning models together with the word embeddings have shown remarkable performance in biomedical text classifications.

1.3. Our Contributions

This article has its contributions in two folds: First, learning effective word representations based on distributional, syntactic, and relational contexts; and second, employing the learned word representations for the classification of biomedical texts using deep learning-based classification models. It is a major extension of one of our conference papers, [11], by considering larger datasets, more benchmark evaluation datasets, effective application of the learned word representation for text classification using deep learning models, and the comparative evaluation of the classification performance with the vectors learned by one of the existing state-of-the-art methods, GloVe.

1.3.1. Learning Word Representation

This article presents an approach of learning word representation using distributional, syntactic, and relational contexts. The relational contexts take into account how words are in relation to other words. In other words, how a target word is semantically related with context words in a sentence. We say such semantically associated information between the target and context words in a sentence as relational semantic information. The proposed approach incorporates relational semantic information distilled from a large corpus using dependency-based syntactic patterns [10] to augment the distributional representation of words from the same corpus through the neural network-based learning and updating process. We employ dependency-based syntactic patterns to extract long-range and multihop dependencies between a target word, say, Whipple and semantically related words such as arthralgias, chronic diarrhea, weight loss, fever, and abdominal pain, representing symptoms of Whipple disease. We extract these semantically related words in the form of semantic triples using the syntactic structures of the dependency tree and further use these triples to augment the distributional representation of the words. The repository of the extracted triples is called the relational semantic repository, which is used to augment the distributional information of the words from the given corpus. To start the learning process, we first obtain the initial vectors by singular value decomposition (SVD) of a positive pointwise mutual information (PPMI) matrix produced from the corpus and the relational semantic repository separately. The initial vectors are merged and updated to minimize the loss such that the PPMI value between co-occurring words from the corpus can be correctly predicted. To optimize the least-square minimization objective, we implement a similar objective function as used in the GloVe [4] model. The initial vectors are augmented such that if any of the co-occurring words from the corpus have their word representation in the relational semantic repository, we merge the vectors from the corpus and the relational semantic repository and jointly optimize them using the gradient descent-based adaptive optimization. As a result, we get enhanced word representations that could be used for various NLP applications.

1.3.2. Biomedical Text Classification

We evaluate the efficacy of the learned word representation using four different neural network-based classification models over two biomedical datasets. Neural network models, in particular, the CNN-based models, have shown exceptional performance in many NLP and text classification tasks compared to traditional ML algorithms. A CNN model performs high-level feature extraction using convolution filters to capture important features during the training process that helps to improve the classification performance. The other neural networks including LSTM have shown remarkable performance for text classification. To evaluate the versatility of the word representation for the classification task, we employ CNN, LSTM, CNN-LSTM, and the bidirectional LSTM (BiLSTM) models.

In brief, the contributions of this article can be summarized as follows.(i)It proposes an approach to learn and augment word representation from a corpus using the relational semantic repository extracted from the corpus to handle both long- and short-range dependencies among semantically similar words(ii)It incorporates the strength of pointwise mutual information, singular value decomposition, and neural network-based updation to learn efficient word representations(iii)It employs the learned word representations to train four deep learning-based classification models, namely, CNN, LSTM, BiLSTM, and CNN-LSTM to classify biomedical texts(iv)It compares the efficacy of the learned word representations and their classification performance with the word representation learned by one of the state-of-the-art methods, GloVe.

The remaining part of the article is organized as follows. Section 2 presents a brief review of the existing works on text classification and word representation learning. Section 3 presents preliminary information about various concepts used in the article. Section 4 provides detailed description about the proposed approach of learning word representation and biomedical text classification. Section 5 presents the experimental details, and Section 6 presents theevaluation results. Finally, Section 7 concludes the article and presents future directions of the research.

The text classification problem has been extensively studied in fields such as text analytics, information retrieval, and data mining by means of machine learning techniques in a wide range of applications including text document clustering, sentiment analysis, language identification, and topic labeling [12]. There are different approaches for text classification, and they follow certain processes such as document representation, feature selection or transformation, vector representation, and the application of statistical or machine learning techniques to achieve the desired performance. The popular traditional machine learning (ML) techniques explored by researchers include support vector machine, k-nearest neighbor, naive Bayes, decision tree, and their variants [13, 14]. Biomedical and clinical texts classification has received much attention of researchers using these machine learning techniques [2, 1517]. However, in the recent years, there has been a drastic shift from traditional ML techniques to modern neural network-based ML classification techniques because of their potential for adaptive learning and generalized prediction. To this end, deep learning models have been widely used in fields such as computer vision, image analysis, and natural language processing, and they have shown outstanding performance in many biomedical applications because of their ability to model the nonlinear and complex patterns and relationships present within the data [1821]. The deep learning methods use several layers to extract important features from the raw inputs through various learning and transformations at different layers. Raw inputs to deep learning models are presented as their vector representations whose quality affects the performance of NLP tasks such as text classification. The initial vectors are nowadays taken as distributional representation of words in an embedding space which has shown remarkable performance with the deep learning models.

In the recent years, there has been a growing interest in learning distributional word representation from large unstructured corpora [3, 4]. The advancement of various word representation learning techniques to learn a low-dimensional dense representation of words as vectors, commonly known as word embedding, has efficiently solved many NLP problems such as named entity recognition [22], sentiment analysis [23], and sentence classification [24]. In this direction, two renowned neural network-based learning models commonly known as continuous bag of words (CBOWs) and skip-gram (SG) models [25], have been widely used to learn a distributional representation of words. These models exploit the neighboring context words that co-occur on either side of a target word within a fixed context window. CBOW uses surrounding context words to predict a target word while SG uses a current word to predict the surrounding context words. Likewise, GloVe [4] is another familiar model based on the global co-occurrence matrix that minimizes least square loss while predicting global co-occurrence between the target and context words using initial random vectors of desired dimensions. These models learn distributional word representations from the corpus without incorporating any external knowledge. To enhance the quality of word representations and to incorporate some domain knowledge, several studies [7, 2629] have used external KBs. Yu and Dredze [26] proposed a joint objective of the relation constraint model and CBOW to learn word representation from a corpus and a similarity lexicon (synonymy) by assigning high probabilities to words that appear in the similarity lexicon. Likewise, Xu et al. [27] use the SG training objective function with additional regularization parameters to incorporate relational and categorical information to learn better word representation. In [30], Ghosh et al. applied the vocabulary-driven skip-gram with negative sampling (SGNS) model to learn word representations that are exclusively associated with diseases from a health-related news corpus by incorporating domain knowledge as a vocabulary of terms associated with diseases, symptoms, and their transmission methods. Most of these approaches use either CBOW or SG and its variants like SGNS to jointly optimize them with the linear combination of some additional objective function or some regularizers. Contrary to this, Alsuhaibani et al. [7], in their joint embedding learning, used a linear combination of GloVe and KB-based objective functions to incorporate relations such as synonymy, antonymy, hypernymy, and meronymy from WordNet. All the discussed and other existing approaches use the third-party knowledge base to enhance distributional word representations without extracting entities and their associations directly from the corpus, and hence ignore the relational semantics between words outside of the range of the context window. Furthermore, these models use linear window-based bag-of-word contexts to capture the contextual features from the corpus. Besides this, there is another approach of learning word representation that uses the syntactic contexts produced by the dependency parse tree generated by the parser rather than window-based contexts. To this end, Levy and Goldberg [9] have used dependency-based syntactic contexts and shown that dependency-based embeddings exhibit better functional similarity than the original SG embeddings. Likewise, Komninos and Manandhar [31] have also shown that the dependency-based word embeddings capture better functional properties and improved classification performance. Moreover, recent advancements in NLP have led to a focus on domain-specific tasks by fine-tuning the sizeable pretrained neural language models such as bidirectional encoder representations from transformers (BERTs) [32] for NLP tasks such as named-entity recognition and question answering. Researchers have demonstrated the adaptability of Word2Vec and BERT in the field of biomedical domain to develop models such as BioWordVec [33] and BioBERT [34], as well as other domain-specific models such as SciBERT [35] trained on various scientific and biomedical corpuses, ClinicalBERT [36] trained on clinical notes for various NLP tasks, and MatSciBERT [37] trained on material science publications. Deep learning models that take such trained word representations as input have been employed by researchers to classify unstructured texts documents [38], medical notes [39], health-related social media texts [40], and biomedical text mining tasks [41]. Besides these, handwritten script recognition [42], detection of diseases [4345], and healthcare solutions [46] involve the potential application of deep learning models.

Word representations learned through the aforementioned algorithms are being used and accordingly evaluated for various NLP applications as they capture contextual features of words. These semantically rich word representation or word vectors are fed as the input to neural networks like CNN and LSTM for performing tasks such as sentiment analysis [4749] and text classification [24, 50]. As the proposed approach has learned word representation related to the biomedical domain, we evaluate the quality of trained word vectors through a text classification task over biomedical datasets.

3. Preliminaries

This section describes the background details of the essential concepts used in the proposed approach. Assume that a corpus consists of documents , , , , and is the collection of target and context words pairs extracted from such that for any target word , the context words are the neighboring words of within a fixed context window . Additionally, and represent the word and context vocabularies of , respectively. Throughout the article, bold letters represent vectors. Table 1 presents a list of notations and their brief descriptions used in this article.

3.1. GloVe

GloVe (https://nlp.stanford.edu/projects/glove/) is a neural network-based method to learn the distributional representation of words in an embedding space, exploiting the global statistical information of words from a text corpus in an unsupervised manner. Given a fixed context window, the algorithm first creates a co-occurrence matrix from the corpus considering the context words (columns of ) within a fixed window surrounding a target word (rows of ) and then uses the matrix to obtain efficient word representation through the neural network-based learning and updating process. Matrix entries represent the sum of the reciprocal distances of the co-occurring context words from the target word. The algorithm minimizes the weighted least-square regression loss , as shown in equation (1), where represents the weight function defined in equation (2) to assign weights between the target word and the context word , and and represent their corresponding bias terms [4]. The hyperparameter and in equation (2) are assigned 0.75 and 100 values, respectively, to control the overweighting of rare and frequent co-occurrences [4].

The GloVe algorithm starts the learning process from the randomly initialized vectors of desired dimensions for the target and context words and gradually updates the initial vectors using the stochastic gradient descent (SGD) algorithm. The primary goal of the GloVe algorithm is to minimize the weighted least-square loss such that the word co-occurrence probabilities can be accurately predicted by the dot product of the target and context word vectors.

3.2. Pointwise Mutual Information

Word and context associations are mostly represented as the co-occurrence of word and context pair from the corpus. However, a mere co-occurrence count does not include any contextual information; hence, it may not be the best measure of association. Pointwise mutual information (PMI) is another powerful measure of association that quantifies how many times two events (words and ) appear together compared with what one might expect if they occurred independently, as defined by equation (3) [51]. Alternatively, the PMI value between the target word and the context word is the log ratio of the joint probability words pair and the product of their marginal probabilities. It gives an estimate of the strength of the association between the target and context words. In the case, when and do not co-occur within the fixed window in the corpus, we have which causes . Furthermore, negative PMI values tend to be unreliable unless we have massive corpora. To circumvent these situations, another familiar measure called positive PMI (PPMI) is used which maps negative PMI values to zero using equation (4). It has been shown in [52] that PPMI is a better metric than PMI to obtain the semantic similarity between two words. Equation (4) selects the max of and 0 to calculate the PPMI value, as it is preferable to have word pairings with more evidence supporting their similarity a higher score when measuring the word similarity. However, PPMI matrices are highly sparse and require extensive computational resources. One way is to map such sparse matrices into low-dimensional dense vectors for generalization and computational efficiency by employing matrix factorization techniques like SVD.

3.3. Singular Value Decomposition

Singular value decomposition (SVD) is a dimensionality reduction technique that factorizes a symmetric matrix into three matrices , , and in such a manner that , where and represent the orthogonal matrices and represents a diagonal matrix of positive real values called singular values. It reduces data dimensions by preserving the main relationship of interest into a low-dimensional representative matrix. To produce dimensional dense vectors, we can decompose matrix into , , and corresponding to top singular values. In NLP applications, we can produce -dimensional dense matrix which is an approximate representative of high dimensional sparse matrix [53]. Furthermore, in word and context situations, we can get the target and context word representative matrices and respectively, by decomposing as stated in [53]. These initial representative matrices ( and ) should satisfy the criteria of minimizing the matrix decomposition error.

4. Proposed Approach

This section presents a detailed description of the proposed approach of learning augmented word representation from a large corpus and a relational semantic repository and their application for biomedical text classification. Figure 2 demonstrates the work-flow of the proposed approach, which comprises methods to produce initial word representation, augment and update the initial word vectors through the relational semantics, and use learned word representation for text classification. It depicts a document crawler to crawl PubMed documents using a set of query patterns. The crawled documents constitute a corpus , which we use to evaluate the proposed approach. The same corpus is exploited to extract the relational semantic information as discussed in [10, 54] and utilized to construct a relational semantic repository, . The corpus and the relational semantic repository are employed to generate the initial word representation by applying SVD on their underlying PPMI matrices.

A detailed description of various processes involved in learning word representation is presented in the following subsections.

4.1. Initial Vector Representation

The first step involved in our proposed approach is to initialize vectors of desired dimensions for each target and context words. We augment and update these initial vectors using the relational semantic repository and a weighted least-square loss minimization function to obtain enriched embedding. Traditionally, distributed word representations relied on count-based vectors such as tf-idf or SVD based vectors. However, neural network-based word representations that considers the target word and its context within a fixed window have proven to be very effective in various NLP applications. The word representations learned using GloVe [4] and Word2Vec [3] methods have shown their applicability in various NLP applications. However, Levy et al. [53, 55] have shown that neural network-based word representation is analogous in performance to traditional word representation generated by the decomposition of the PPMI matrix formed from the co-occurrence matrix of a corpus. Hence, to include the strength of traditional decomposition-based vectors, the proposed word representation approach adopts the PPMI approach to generate initial word representation by factorizing PPMI matrix using SVD. Accordingly, we first build a co-occurrence using the co-occurrence count of target and context words pairs (, ) from corpus with and . The matrix is then mapped to a PPMI matrix , which is further decomposed using SVD to produce , and . Consequently, we obtain initial word representations for the target and context words as matrix and by considering and , respectively. Likewise, we also obtain the initial word representations from the relational semantic repository and represent them as and , respectively, for the target and context words. Furthermore, to have better word representation, the resulting initial word representations from the corpus needs to fulfill minimization of the error in matrix decomposition. To minimize error and to incorporate relational semantic information from , we augment and update the initial word representation from the corpus in such a manner that the weighted least-square loss is minimum. The augmentation and updating process of the initial word representation is described in the following subsection.

4.2. Objective Function Augmentation

In the proposed approach, we adopt the GloVe approach for minimizing the decomposition error to optimize the initial word representation. GloVe learns a low-dimensional dense representation of word vectors from a corpus without incorporating any additional or external relational knowledge. We have discussed its important limitation in Section 1. To address these limitations, we incorporated information from a relational repository into the initial word representation from the corpus by merging the initial word representations from the relational semantic repository with the initial word representations from the corpus. We perform this merging of vectors during the optimization process to produce augmented and enhanced word representation. To this end, we define an objective function analogous to the GloVe objective function as shown in equation (5), where is a function to assign weight to a co-occurrence pair using equation (6), is the PPMI value of the pair , and and are biases of vectors and , respectively. The vectors and are merged initial word and context vectors of and . The merging process of initial vectors is described in the following paragraph.

We consider three categories of words from the vocabulary of the pair collection based on their presence or absence in the vocabulary, , of . These include , , and , which are described in the following paragraphs.(i) it represents the category of pairs in which both the target and context words are the members of .(ii) it represents the category of pairs wherein neither the target nor the context word is a member of .(iii) it represents the category of pairs in which either the target or the context word is a member of .

Each of the three categories of word pairs requires to be handled accurately while merging the initial vectors of and . Consider the first case wherein both the target and context words are the member of , we have initial vectors from as well as for the target and context words and . These initial vectors are merged in such a way that the resultant vector corresponding to the target word is and the resultant vector corresponding to the context word is . It should be noted that and are vectors from , while and are vectors from .

Likewise, in the second case, wherein neither the target word nor the context word is a member of , we have the initial vector representation of words and from the corpus only. In this case, as and are not found in , no merging is needed. As a result, the resultant vector corresponding to and are equal to and , respectively, i.e., and . Similarly, for the third case wherein either the target or the context word is contained in , we have any of the two word’s (target or context) initial vector representation in both and . In this case, either we use the target or the context word’s merged initial vector representation depending upon which word belongs to both the repository. If we have the target word in both the repository, the resultant target word is , and if we have the context word from both the repository, then the resultant context word is .

4.3. Adaptive Updation of Parameters

Gradient descent techniques are widely used optimization techniques for parameter updation during the training of neural networks. Just like the GloVe model, we use the Adagrad [56] gradient descent technique to update parameters during the learning process. Adagrad is an adaptive update algorithm, which automatically adjusts the learning rate. The gradient for the target and context words and their corresponding biases are calculated using the following equations:

AdaGrad efficiently handles the sparse data by performing larger updates for rarely occurring words while smaller updates for frequently occurring words. Equation (8) is used for updating target word vectors,where represents a combined target word vector, represents gradient at time t, and denotes squared gradient at time for . Likewise, equations (9)–(11) are used for updating the merged context word vector and the target and context word biases, respectively.

4.4. Deep Learning Models

This section presents a detailed description of deep learning models used for the text classification. The deep learning models, along with the word embeddings as the input, are proving to be very effective for text classification. These are essentially machine learning models with better intelligence, efficient learning ability, high accuracy, and robust performance. The most popular basic deep learning models used for text classification are CNN and LSTM networks and their variants such as BiLST and CNN-LSTM. All these models take a sequence of input vectors corresponding to the textual data and exploit these vectors to capture important features helpful to map the text into their respective labels. The texts to be classified are first preprocessed by tokenizing and removing symbols, punctuation marks, numbers, and stopwords. The pre-processed text documents consisting of tokens are then transformed into a sequence of -dimensional vectors, where vectors correspond to learned word representations obtained either by the proposed approach or other state-of-the-art-approaches like GloVe. All the deep learning models used for the text classification task are given the input text document as a sequence of -dimensional vectors forming a embedding matrix corresponding to tokens. Given the pre-processed text document with tokens and , , , vectors corresponding to tokens, the embedding matrix can be represented by (12), where represents the concatenation operation over the vectors.

We consider of fixed length ( = 25) to form the embedding matrix. The embedding matrices thus formed constitute an embedding layer for each model, and these embedding matrices are then fed into the different deep learning models for learning high-level features to perform efficient classification. The deep learning models used in this article for biomedical text classification are discussed in the following sub-sections.

4.4.1. Convolutional Neural Network (CNN)

A CNN model comprises various layers for converting texts into embedding matrix and learning high-level features bypassing the embedding matrix through the convolution layer and the intermediate outputs through the max-pooling layer and fully connected dense layers to predict the class labels. The given text is preprocessed by tokenizing and removing symbols, punctuation, number, and stopwords. The preprocessed tokens, say tokens per text document, are then mapped into an embedding matrix (a sequence of vectors) at the embedding layer using the learned word representation. The embedding matrices formed from the input texts are feed as input to the convolution layer, which employs filters of different width by convolving them through the embedding matrices to extract high-level features and accordingly creates feature maps. A filter, say, of width convolves through the embedding matrix with stride to create the feature map determined by (13), where is the convolution operation, represents the vectors from to of convolved by filter , is the biased term, and denotes an activation function. An activation function rectified linear unit is used to introduce nonlinearity to the system that can be represented by equation (14).

The feature maps are further passed through a max-pooling layer, which selects the max-value from the feature maps corresponding to each filter to form a max-pooled feature vector. To control overfitting problems, drop out is used that drops some neurons while keeping the others with some probability. The last layer of the network is the fully connected dense layer, which predicts the class probabilities using the softmax activation function [57]. The detailed description of the basic CNN architecture applied in our experiment can be found in [50]. The categorical cross-entropy loss function is used to calculate the loss while the AdaDelta [58] algorithm is used to update and optimize the parameters.

4.4.2. Long Short-Term Memory (LSTM)

LSTM networks are a slightly tweaked form of recurrent neural networks (RNN) to make them suitable for text classification tasks. LSTM networks contain “memory cells,” which are controlled by input, output, and forget gates. The gates control the inflow and outflow of information through the memory cells. The input gate adds new information to the cell and uses an activation function to regulate the value to be added. Similarly, the forget gate discards some information from the current content of the memory cell, while the output gate decides how much information should be forwarded to the next hidden state. LSTM uses two-way storage of information where short-term recent history is stored as activation of neurons while the long-term memory stores weight, which gets modified based on the backpropagation. During forward pass, the input and output gates learn when to allow the activation to get into the internal state and when to pass it to the output state, respectively. When these entry and exit points are closed, the activation is captured inside the memory cell and hence does not expand, shrink, or affect the output of any intermediate state across multiple time steps. Similarly, during backpropagation, the gradients neither vanish nor explode across time steps. This allows LSTM to capture long-term dependency effectively in comparison to simple RNN.

As stated above, the memory cells consist of input, output, forget gates, and a candidate memory cell, and their values are updated at a time-step for the input vector using the following equations:where represents elementwise multiplication, represents the sigmoid function, and , , , , , and represent input, forget, and output gates’ parameters. The final hidden vector obtained from the LSTM cell representing high-level features for the input texts is fed into a dense layer with the softmax activation function, which maps the output into the probabilities of classifying the texts into their corresponding class labels. Softmax activation function is frequently employed to solve multiclass classification problems. It computes the relative probabilities of high data points (vector obtained from the LSTM cell representing high-level features), indicating that the data points belong to a particular class. We have applied the LSTM model for biomedical text classification tasks in the experimental section.

4.4.3. Bidirectional Long Short-Term Memory (BiLSTM)

Bidirectional LSTM (BiLSTM) is an extension to the unidirectional LSTM to incorporate both the historical and future contexts by introducing another hidden layer. BiLSTM captures the contextual information from both ways, reading the inputs in both the forward (normal way) and reverse directions, which is quite advantageous in text classification tasks. If the hidden state for the forward sequence context is represented by and the backward sequence context is represented by , then the output of the word is given by the following equation:where represents elementwise sum of vectors and . The softmax function is used to map the text into the corresponding label.

4.4.4. CNN-LSTM

The CNN-LSTM model consists of the CNN layer to extract the local -gram features from the input data for the LSTM layer, which interprets the features for sequence prediction across time steps. We can say that the CNN-LSTM model comprises two submodels, CNN and LSTM. For the text classification task, the CNN submodel comprises a convolutions layer followed by a max-pooling layer to capture and consolidate important high-level features as vectors. The max-pooled feature vectors are then fed into the LSTM layer, which captures the long-distance dependency features and gives the final text representation. It is further passed through a dense layer with the softmax activation function to map the text into corresponding class probabilities.

5. Experimental Setup and Results

We use a biomedical text corpus for learning word representation and evaluate the learned word vectors over multiple benchmark datasets for two evaluation tasks: word similarity and concept categorization. We also present an application of the learned word representation for the biomedical text classification task. The following subsections briefly describe the corpus and the relational semantic repository used for experimentation, the experimental setup, and the evaluation results over various benchmark datasets.

5.1. Corpus and the Relational Semantic Repository

The proposed approach is evaluated over a biomedical text corpus crawled from PubMed (https://www.ncbi.nlm.nih.gov/pubmed/) database, which is an online repository of thousands of abstracts and citations related to various biomedical fields such as health, biomedicine, bioengineering, and life and behavioural sciences. These biomedical abstracts encapsulate many disease-related useful information such as disease names, their associated symptoms, vectors, pathogens, etiologies, transmitting agents, and drug-related information. PubMed gives access to the abstracts of biomedical literature through its NCBI Entrez systems API (axis 2.1.6.2 (https://axis.apache.org/axis2/java/core/)) by querying its server using desired keywords. We retrieved 67516 abstracts, called corpus , related to cholera, dengue, diarrhoea, influenza, leishmaniasis, malaria, and meningitis diseases by querying the PubMed database. The document retrieval process is discussed in detail in [10, 54]. Moreover, we created the relational semantic repository from the relation triples (, , ) extracted from the corpus. consists of disease symptom and their associations in the form of semantic triples, which are extracted using typed dependencies generated by Stanford parser (https://nlp.stanford.edu/software/lex-parser.shtm) and filtered by employing MetaMap (https://metamap.nlm.nih.gov/). The process of extraction of relation triples is discussed in [10, 54].

5.2. Experimental Setup

The documents from the corpus are tokenized and preprocessed by eliminating punctuation marks, stopwords, and numbers. We first generate a co-occurrence matrix from the corpus using the co-occurrence count of the target and context words within the fixed context window. The experimental evaluation is performed on two different context window sizes to consider the neighboring context of a target word. For example, for , the context words for a target word are 5 prior and 5 following words to the target within the document. The co-occurrence matrix thus formed is converted into the PPMI matrix according to the method discussed in Section 4. The PPMI matrix is further factorized using SVD to obtain the initial vector representation of corpus words. The same procedure is applied to obtain initial word representation from . We consider two different dimensions of the initial vectors to report the evaluation results of the proposed approach. To optimize the initial vectors by minimizing the least-square loss, we used the objective function defined in equation (5). We used AdaGrad [56], which is an SGD-based adaptive update algorithm for updating of parameters and optimizing the vectors. The initial learning rate, , is adjusted to 0.05 for updating parameters. The algorithm of the proposed approach was executed for 50 iterations to converge it into an optimal solution. Consequently, we received two sets of improved vectors, one for the target words called and the other for the context words called . Furthermore, their combined vectors, namely are considered by taking the average of the corresponding target and context vectors for a particular word from the vocabulary . We considered vectors because the authors in [4] reported that the merged vectors perform better than either of the word and context vectors. We have reported the evaluation results of all the three forms (target word, context word, and the merged form) of the vectors learned by the proposed approach and the corresponding form of the vectors (GloVe_W, GloVe_C, and GloVe_Merged) learned by GloVe.

5.2.1. Parameters Setting for Biomedical Text Classification Models

For the biomedical text classification task, we employed four basic neural network-based models: CNN, LSTM, BiLSTM, and CNN-LSTM, as discussed in Section 4 considering various parameter settings for the underlying models. We executed each model for 100 epochs and report the best results for each model in terms of training and validation accuracy. For all the models, we used optimizer [58], which dynamically adapts over time and does not require hyperparameter tuning. Furthermore, we used the categorical cross entropy loss function to estimate the loss of a model for updating weights. For the CNN model, the initial filter and softmax weights are sampled from the interval [−0.1, 0.1]. We applied 100 filters of width and stride , max-pooling of size 2, a dropout of 0.5 prior to the dense layer, and regularization of 0.03 at the convolution layer. Similarly, for the LSTM model, we used 256 hidden units of LSTM, and for the remaining two models, the parameters settings remain the same.

We evaluate the quality of vectors learned through the proposed approach in terms of two assessment tasks that include word similarity and concept categorization. We also provide an application of the learned word representation to classify biomedical texts into different labels using four neural network-based classification models.

5.2.2. Word Similarity

For word similarity evaluation, we compare the cosine similarity of word pairs determined using the learned word representation against the similarity scores assigned by the human annotator to the corresponding word pairs. The evaluation is based on the principle that the semantics of words are preserved by the trained word representation if we have positive correlations between the calculated similarity value and the human-rated similarity value for the word pairs. In this regard, we use Spearman’s rank correlation coefficient to find the correlation between the calculated similarity value and the annotated similarity value for the word pairs of the benchmark datasets. The quality of word vectors learned using the proposed approach is evaluated over fifteen benchmark datasets: BioSimLex [59], BioSimVerb [59], MEN (https://clic.cimec.unitn.it/elia.bruni/MEN.html), MTurk [60], RG65 [61], RW (https://www-nlp.stanford.edu/%20lmthang/morphoNLM/) [62], SCWS [63], SimLex999 [64], TR9856 [65], UMNSRS-Rel [66], UMNSRS-Sim [66], VERB143 [67], WS353 [68], WS353R [68], and WS353S [68]. BioSimLex and BioSimVerb datasets cover the concept pairs in biomedicine and comprise 988 noun pairs and 1000 verb pairs, respectively [59]. MEN, MTurk, and RG65 datasets contain collection of 3000, 771, and 66 English word pairs, respectively, for evaluation of semantic similarity and relatedness. RW is a rare word dataset containing 2034 low-frequency word pairs to check the rare word representation [62], while SCWS contains 2003 word pairs along with their contexts [63]. Similarly, SimLex999 contains different POS-category word pairs together with the correctness level and association strength [64]. Likewise, the UMNSRS-Sim and UMNSRS-Rel datasets contain 566 and 587 pairs of medical terms, respectively, for evaluation of semantic similarity and relatedness [66, 69]. The VERB143 dataset contains 143 annotated verb pairs for similarity task. Similarly, WS353 is the original data and its two subsets WS353S and WS353R, containing 353, 203, and 252 word pairs, respectively, associated with semantic similarity and relatedness [68].

We compare the performance of word representations learned using the proposed approach and the GloVe method for the word similarity task. We have considered different window sizes and vector dimensions to assess the window size and dimension effects on the learned vectors. The word similarity evaluation results on various combinations of vector dimension and window size are presented in Tables 25. It can be observed from these tables that the word vectors trained using the proposed approach report the best results for all combinations of the window size and vector dimension compared to the GloVe-based vectors except for four instances over the RW, VERB143, and WS353 datasets. Although in these four instances (two in the case of RW and one each in the case of VERB143 and WS353), GloVe-based vectors report better results, and the difference in the performance between the trained vectors using the proposed approach and GloVe is not significant. Another interesting observation is that at , the word vectors using the proposed approach perform better on all the datasets for both dimensions and 200. It signifies that long-range dependencies are also vital. The best performance in the case of each dataset over different combinations of the window size and vector dimension is highlighted in bold typeface. Furthermore, we can also observe from these tables that word vectors learned using the proposed approach perform significantly better over UMNSRS-Rel and UMNSRS-Sim datasets in comparison to the GloVe-based vectors. The results from these tables also show that CE and Merged vectors learned using the proposed approach dominate over all other vectors. Similarly, the other interesting insights may be inferred from these tables.

5.2.3. Concept Categorization

It is another way of evaluating the quality of word representations wherein the set of concepts is grouped into distinct categories. It is based on the clustering of vectors into distinct groups, and the performance is measured by the number of concepts each cluster has from a given category. Here, the purity metric is used, wherein purity specifies that the given category is completely reproduced and hence vectors are of highest quality, whereas purity specifies that the cluster quality is worst. We used seven benchmark datasets: AP [70], BLESS [71], Battig [72], ESSLI_1a (https://wordspace.collocations.de/doku.php/data:esslli2008:concrete_nouns_categorization), ESSLI_2b (https://wordspace.collocations.de/doku.php/data:esslli2008:abstract_concrete_nouns_discrimination), ESSLI_2c (https://wordspace.collocations.de/doku.php/data:esslli2008:verb_categorization), and Ohta-10-bio-words (https://github.com/spyysalo/wvlib/tree/master/word-classes/Ohta-10-bio-words) for the evaluation of learned word vectors using the concept categorization task. The AP dataset contains 402 words with 21 concept categories [70], BLESS contains 200 concepts with 17 semantics classes [71], Battig contains 5231 words listed in 56 taxonomic categories [72], ESSLI_1a contains 44 concrete nouns belonging to 6 semantic categories, ESSLI_2b contains 40 nouns classified into three classes, ESSLI_2c contains 45 verbs belonging to 9 semantic classes, and Ohta-10-bio-words contains 12 word classes of the biomedical domain.

The evaluation results corresponding to the concept categorization task on various combinations of vector dimension and window size are presented in Tables 69. It can be observed from these tables that the word vectors trained using the proposed approach show the best performance for all combinations of the window size and vector dimension compared to the GloVe-based vectors except for the five instances over ESSLI_2a, ESSLI_2b, and ESSLI_2c datasets. Among these five instances, the GloVe-based vectors show best performance in three cases over the ESSLI_2c dataset and one case each over ESSLI_2a and ESSLI_2b datasets. The best performance in the case of each dataset in these tables is highlighted in bold typeface. Furthermore, it can be observed from these tables that for each of the four combinations of the window size and vector dimension, the vectors learned by both the approaches show the worst performance over the Battig dataset, whereas the best performance switches between ESSLI_2a and ESSLI_2b datasets. Moreover, the merged vectors using the proposed approach dominate the performance and show the best results in most of the cases.

6. Comparative Analysis and Evaluation for Biomedical Text Classification Tasks

We investigate the performance of learned word embeddings on two different text classification tasks: one is binary classification task over the BioText Berkeley dataset and the other one is multiclass classification over the PubMed RCT 20K dataset. The details of the datasets and text classification performances are presented in the following subsections.

6.1. Comparative Analysis on the BioText Berkeley Dataset

The BioText Berkeley dataset (https://biotext.berkeley.edu/dis_treat_data.html) is a benchmark dataset containing labeled sentences of 100 titles and 40 abstracts obtained from MEDLINE 2001 and labeled based on the contents of individual sentences [73]. The sentences are labeled based on the roles and relationships of disease and treatment relations considering eight different categories. During dataset preprocessing, we discarded the two categories, namely, “vague” and “to_see.” Thereafter, remaining categories are grouped into two classes, wherein the first class contains all the disease- and treatment-related sentences while the remaining sentences constitute the second class. Finally, the curated dataset is considered as an evaluation dataset for the binary text classification problem. The final dataset contains 3415 labeled sentences.

Following the dataset curation process, the four neural network-based classification models discussed in Section 5 are trained, and underlying results in terms of training and validation accuracy are presented in Tables 1013. The best results corresponding to the word vectors trained using both the proposed approach and the GloVe method for every combination of the window size and vector dimension are shown in bold typeface. It can be observed from these tables that, in most of the cases, classification accuracy using the vectors trained by the proposed approach is significantly better. An interesting observation from these tables is that CE and WE vectors trained using the proposed approach achieve best performances in most of the cases in terms of training and validation accuracies for various combinations of the window size and vector dimension. Therefore, it can be inferred that averaging CE and WE does not show impressive results in case of the text classification task compared to concept categorization and word similarity tasks where merged vectors have shown good results. Furthermore, among the four neural network-based classification models, the CNN-LSTM model shows the best performance followed by the CNN model. In contrast, the BiLSTM model shows the worst performance.

6.2. Comparative Analysis on the PubMed RCT 20K Dataset

The efficacy of the trained word vectors using both the approaches is evaluated over another benchmark dataset PubMed RCT 20K [74], which is associated with the biomedical domain. The PubMed RCT 20K dataset is extracted and curated from PubMed for sequential sentence classification consisting 20000 abstracts of randomized-controlled trials [74]. Each sentence of the dataset is labeled based on its role in the abstract considering that the sentences can be related to five different categories: background, objective, method, result, or conclusion [74]. The original dataset was preprocessed to filter the numbers, symbols, and stopwords. As a result, the final dataset comprises 176560 training and 29667 validation sentences. Like the BioText Berkeley dataset, we trained the same set of four neural network-based classification models. The underlying results in terms of training and validation accuracies are presented in Tables 1417. It can be observed from these tables that there is a slight increase in the training and validation accuracies with the increase in the vector dimension and the context window size. Furthermore, in contrast to the BioText Berkeley dataset, we can observe from these tables that the BiLSTM and LSTM models perform better than the CNN and CNN-LSTM models. This may be because the dataset is sequential and the sentences are sequentially associated with each other. The CNN model shows the worst performance in comparison to the other models. In this dataset also, CE and WE vectors show better performance in comparison to the Merged vector. Similarly, the other interesting observations can be inferred from these tables.

7. Conclusion and Future Works

Biomedical text classification is becoming important to extract valuable information from the proliferating biomedical repositories, and deep learning has encouraged researchers to develop neural network-based classification models for efficient text classifications using low-dimensional dense vectors (aka word embeddings). In this article, we presented a method of incorporating relational semantic information of distant words and the words having infrequent co-occurrence within the corpus in the distributional representation of words through the augmentation of vectors from a corpus of the relational semantic repository to learn enriched word representation. The effectiveness of the proposed approach is evaluated by performing word similarity and concept categorization tasks over various benchmark datasets using the learned word vectors. We have also applied the learned word vectors for classifying biomedical texts and found that they perform significantly better in comparison to the vectors learned by the widely used GloVe model. Since relation mining is one of the well-studied problems in the biomedical domain, we have considered the biomedical domain as one of the potential application domains for our proposed word representation method based on the distributional and relational contexts. However, the proposed approach is generic and can be applied to any domain having the required relation triplets. Exploiting external knowledge bases along with the distributional and relational contexts to further improve the word representations is an interesting direction of future research.

Data Availability

The data used to support the findings of this study are available upon reasonable request to the corresponding authors.

Conflicts of Interest

The authors declare that they have no conflicts of interest.