Abstract
Drug combinations can reduce drug resistance and side effects and enable the improvement of disease treatment efficacy. Therefore, how to effectively identify drug-drug interactions (DDIs) is a challenging problem. Currently, there exist several approaches that leverage advanced representation learning and graph-based techniques for DDIs prediction. While these methods have demonstrated promising results, a limited number of approaches effectively utilize the potential of knowledge graphs (KGs), which provide information on drug attributes and multirelation among entities. In this work, we introduce a novel attention-based KGs representation learning framework. To encode drug SMILES sequence, a pretrained model is used, while molecular structure information is mapped as the initialization of nodes within the KG using a message-passing neural network. Additionally, the knowledge-aware graph attention network is employed to capture the drug and its topological neighbor representation in the KG representation module. To prevent the oversmoothing problem, the residual layer is used in the DDI prediction module. Comprehensive experiments on several datasets have demonstrated that the proposed method outperforms the state-of-the-art algorithms on the DDI prediction task across a range of evaluation metrics. It achieves an accuracy of 0.924 and an AUC of 0.9705 on the KEGG dataset and attains an ACC of 0.9777 and an AUC of 0.9959 on the OGB-biokg dataset. These experimental findings affirm that our approach is a dependable model for predicting the association of drugs.
1. Introduction
Drug combinations are a very promising therapeutic strategy [1]. However, it may increase the risk of unexpected adverse drug reactions (ADRs), such as reduced drug efficacy or increased drug toxicity, which can result in injuries and deaths [2, 3]. Hence, it becomes necessary to effectively detect the potential drug-drug interactions (DDIs) to mitigate the influence of unexpected pharmacological effects.
There are some machine learning-based approaches that define the potential DDIs by utilizing a range of drug-related similarity features, such as drug structure [4–6], adverse or side effects [7, 8], and phenotypic similarity [9, 10]. However, these studies heavily rely on manual characteristics and domain expertise. Other works attempt to use the embedding approach to automatically to discover drugs representations. Various methods have been employed to model DDIs, such as matrix factorization [11–13], random walk [14], and graph neural networks [15]. Despite the remarkable performance achieved by the above methods, a notable limitation is that they treat DDIs as independent data samples, without due consideration of their interrelation correlations, such as drug-target pairs. Several researchers have explored the application of knowledge graphs (KGs) for DDIs, as seen in the works of Celebi et al. [16] and Karim et al. [17]. The drug embedding is obtained using conventional machine learning methods, such as ComplEx [18]. Lin et al. introduced an end-to-end neural method based on the knowledge graph to predict potential DDIs [19]. Although these studies effectively used a knowledge graph with ample biomedical information, resulting in improved performance, they regrettably omitted the consideration of the attribute information of drugs. It has been proved that node attributes are essential for accurately analysing complex networks [20–22]. Su et al. introduce an innovative framework for KG representation learning that incorporates attention mechanisms, employing the simplified molecular input line entry system (SMILES) sequence as an attribute for drug entities within the knowledge graph [23]. It is common knowledge that drugs contain chemical molecules with a distinct spatial structure [24, 25]. However, the sequence alone cannot fully represent the spatial structure of the drug. Therefore, it is crucial to include an atomic graph of the drug as an additional attribute in the knowledge graph.
This research presents a novel attention-based KG representation learning framework, which considers two drug attributes (molecular graph and SMILES sequence) and triple facts in KG. There are three blocks in our framework. The first block is the drug representation initialization module, where we find the SMILES sequence and molecular graph as its drug attributes. Based on these attributes, we leverage message-passing neural network (MPNNs) [26] and the SELFormer pretrained model [27] to initialize the drug embedding. The second block, known as the knowledge graph representation learning module, is tasked with acquiring the drug and its topological neighbourhood representation through knowledge-aware graph attention network. Both high-order structures information and semantic relationship features can be found in this block. The potential DDIs are predicted as binary classification tasks in the final block. We summarize our research’s main contributions as follows:(i)For the DDIs prediction, we have devised a novel KG representation learning framework that exploits attention mechanisms. This framework effectively utilizes the information of biomedical KGs, as well as drug molecular structure and sequence information.(ii)We establish a drug representation initialization module designed to acquire the initial drug embeddings, which are based on their attributes within the KGs. At the same time, we use the knowledge-aware graph attention network to calculate attention weights by considering the drug node, its surrounding neighbourhoods, and triple facts, which are adapted to better learn the drug and its topological neighbourhood representation.(iii)Through a series of experiments conducted on two biomedical datasets, our model consistently demonstrates superior performance compared to the current state-of-the-art methods.
2. Related Work
Recently, numerous works have emerged to address the problem of predicting DDIs. Based on the hypothesis that similar drugs exhibit a higher propensity for interacting, some previous studies have attempted to predict drug interactions using analogous feature derived from molecular structure [28] and various properties (e.g., phenotypic [29], functionality [30], and side effects [31]). In the study by Ruy et al, DDI types were estimated by using a deep neural network (DNN) model trained on chemical structure similarity [32]. With the development of neural network methods in the graph domain, several investigators have employed graph neural network methodologies to extract features related to molecular properties and structure for molecular interaction tasks. Xu et al. introduced an approach that uses multiple graph convolution layers to extract features of nodes from their neighbours in a structured entity graph [33]. Additionally, Deac et al. introduce a method that the type of side effect and the molecular structures are combined by a co-attentional mechanism to generate the drug-level representation [24]. GNN-DDI constructs a five-layer graph attention network (GAT) encoder to capture the drug representation [34]. These above works demonstrate the crucial role of drug molecular structure in DDIs prediction.
With the advancement of deep learning techniques and the increasing popularity of extensive biomedical networks, network-based methods employ various advanced techniques, which are roughly categorized into three groups: graph embedding, link prediction, and knowledge graph-based methods. Some works adopt diverse graph embedding algorithms to capture potentially influential network-based features, such as GraRep [12], HOPE [35], DeepWalk [14], and node2vec [36]. Other works treat DDI prediction as a link prediction task within the drug-drug interaction graph or network. Based on a graph auto-encoder, Decagon et al. proposed an approach to predict multirelational links on multimodal graphs comprising multiple types of interaction on the different entities (e.g., drug-protein target interactions) [8]. Liu et al. constructed several drug feature networks and utilized the graph embedding technique to obtain drug representations from these networks [37]. Feng et al. introduce a method that uses the relational graph convolution network (RGCN)-based encoder and similarity regularisation of multi-drug features to learn the topological features of DDI networks [38]. In addition, some works show that using information from multiple sources (e.g., knowledge graph) improves the prediction performance. KG-DDI is a specialised task for identifying drug-drug interactions (DDIs), which uses different embedding approaches node representation within the KGs [17]. The KGNN method can use GCNs and neighbourhood sampling to effectively identify and analyse relationships between neighbourhoods [19]. DDKG takes drug attributes in the KG to learn drug embeddings and uses an attention mechanism to jointly consider neighbouring node embeddings and triple facts simultaneously [23]. DDKG is the most relevant to our work. Our method differs from the DDKG method in that we include the molecular graph as an additional drug attribute in KGs. Additionally, we utilize knowledge-aware graph network that considers the features of both entities (nodes) and relations (edges) in a multihop neighbourhood of a designated entity/node.
3. Materials and Methods
3.1. Datasets
In this research, we have utilized two datasets to evaluate the effectiveness of our proposed model. They are KEGG-drug [39] and OGB-biokg [40]. First, to obtain the attribute information of drugs, we processed the two datasets separately to select proven drugs according to the latest version of the DrugBank [41]. Second, we also use these datasets to conduct the KGs. It is important to ensure that knowledge graphs do not contain any information about drug-drug interactions (DDIs). Therefore, we meticulously removed the relationships labelled as drug-drug from OGB-biokg and the information presented as URL: Drug-Drug-Interaction from KEGG-drug datasets, respectively. Table 1 shows the statistics of the remaining knowledge graph datasets and drug data.
3.2. Problem Formulation
In our study, given the DDI matrix and the biomedical knowledge graph for the DDI prediction problem, we aim to learn the function for estimating the probability of interaction about drug and drug , as follows:where represents the set of trainable parameters associated with the function . The DDI matrix and the knowledge graph (KGs) are specifically described below.
3.2.1. Knowledge Graph
The knowledge graph is a collection of knowledge bases about the real world. It is formally presented as a set of triples, denoted as , where and represent the set of entities and the set of relationships, respectively. Each triple indicates the relationship between and , where , and is the number of triples within KGs.
3.2.2. DDI Matrix
Given the drug set and the corresponding set of SMILEs sequence as , where refers the number of drugs in the DDI matrix. For the DDIs prediction task, we construct the DDI matrix . is the set of , where indicates that there is a reaction between drug and drug . It is important to note that if , it does not necessarily mean that there is no interaction between the two drugs in KG, as this could be a potential interaction that has not yet been identified.
3.3. Methodology
Figure 1 illustrates the overall framework of our approach, which includes three modules for predicting DDIs: (a) drug representation initialization module, (b) knowledge graph representation learning module, and (c) DDI prediction module. In the drug representation initialization module, we first handle different attribute features such as SMILES sequences and drug molecular graphs. Then, we concatenate the different embeddings to initialize the drug nodes of the knowledge graph. In the knowledge graph representation learning module, to capture high-order neighbourhood topologies of drugs in KG, we employ similar convolutions that aggregate and integrate topological neighbourhood information. Simultaneously, we use knowledge-aware attention mechanisms to capture entity and relational features in a multihop neighbourhood of a given drug. In the final classifier module, the final latent representations of the given drug pair are employed to calculate the probability of interaction.

3.4. Drug Representation Initialization Module
Within the drug initialization representation module, we have developed both the drug molecular graph representation module and the drug sequence representation module to acquire diverse drug representations from different perspectives.
3.4.1. Drug Molecular Graph Representation Module
Figure 2 shows the structure of the drug molecular graph representation module. For each drug , its 2D molecular graph can be generated according to its SMILE sequence by the RDKit tool [42]. In particular, where denotes atoms, and represents chemical bonds. For generating the structure representation, we employed MPNNs. This process can be divided into two main phases as follows:(1)The message-passing phase. For node , we first aggregate relevant information from its neighboring nodes. Then, we conducted k iterations to update the representation of it. Formally, equation (1) describes this phase. where denotes node update functions, denotes message functions, implies a set of neighbors of node in graph , means the representation of edge between of node and node , and denotes the representation of node after iterations.(2)Readout phase. The global drug graph representation, denoted as , can be obtained by relying on the representation of node generated after iterations. This phase can be represented by the following formula: Here, denotes the readout function. We use the average Readout function [43] which computes the average of all node representations to generate the graph-level representation.

3.4.2. Drug Sequence Representation Module
Figure 3 displays the structure of the drug sequence representation module. The SMILES sequence is the most common way to represent drugs composed of molecular characters. It contains more information in comparison to molecular graphs and provides essential functional information about the atoms. Additionally, it enables the representation of long-term dependencies, whereas the molecular graphs of drugs illustrate the interconnectivity between atoms. While some researchers find the SMILES sequence has some disadvantages. For example, a valid SMILES string may exhibit invalid chemical properties, such as surpassing the natural valency of an atom. Krenn et al. introduced SELFIES (SELF-referencing Embedded String), which is a string-based representation of molecular graphs known for its complete robustness [42]. Therefore, we convert SMILES notations into SELFIES representations utilizing the SELFIES API and use the Byte-level byte-pair encoding (BPE) to tokenize the SELFIES sequence within all drug datasets. Then we use the SELFormer, a pre-trained model proposed by Yüksel et al. [27], to generate the drug embedding . It can be described as follows:

when we get the molecular embedding and sequence embedding, we use the simple method to combine those embeddings:
3.5. Knowledge Graph Representation Learning Module
When initializing the knowledge graph with the initial embedding , it becomes essential to discern the associations between drug nodes and other entities within the knowledge graph. Within the KGs, each node is connected to multiple neighbouring nodes, which carrying distinct levels of significance. Furthermore, entities assume different roles based on the relations they are linked with. So, we employ the knowledge-aware graph attention network based on entity and relationship. Figure 4 illustrates the knowledge-aware attention network.

As shown in Figure 4, given the triple , we concatenate the embeddings of entity and relation feature and generate the representation of triple by a linear transformation. This step is shown in equation(6):where vector , , and denote the embedding of entity and relationship. represents the linear transformation matrix. Subsequently, the significance of each triple can be ascertained by employing a linear transformation and applying the nonlinearity:where represents the linear transformation matrix. represents the importance of each triple. To get the relative attention values , the is employed on :where denotes the neighborhood of entity and represents a set of relationships between entity and entity .
By summing the triple representations with their associated attention weights, we derive the revised embedding for the entity .where is activation function. denotes the neighborhood of entity , and represents a set of relationships between entity and entity .
After obtaining the propagated neighboring information with equation (9), we combine the initial embedding and to update the representation of the drug. It can be described as follows:where denotes the aggregation function.
To better learn the global representation, we adopt the stack layer to broaden the entity’s receptive field. To elaborate further, given a total of propagation layers, the representation of entity in the layer is denoted as follows:where is equal to the initial embedding generated by the drug representation initialization module.
3.6. DDI Prediction Module
Due to the challenge of node smoothing that emerges with the growing depth of graph neural networks, we introduce a residual layer to mitigate this problem. In this layer, we combine the initialized drug representation with the global drug representation to obtain the final drug embedding vector.
Finally, we generate the final representations of the drug pair, employing it as an input to function for getting the prediction scores. Our objective is to minimize the discrepancy between the predictions and the actual labels:where is the real label for the drug pair.
4. Experiment
4.1. Baselines
To illustrate the superior performance of our method, we use some state-of-art models as the baselines:(i)Laplacian [11]: It is an exemplary matrix factorization method, which generates the network embedding by factorizing the input data matrix into lower dimensional matrices(ii)DeepWalk [14]: This method employs the random walk to generate node sequences, which are subsequently fed to the skip-gram algorithm [44] for node embedding learning(iii)LINE [45]: This method uses the neural network and incorporates local and global graph information for node embedding learning(iv)TransE [46]: It assesses the credibility of a fact by considering the proximity between two entities, typically following a transformation performed by the relation(v)ComplEx [18]: It measures the plausibility of facts by aligning latent semantics of entities and relations, which are embedding into the vector space representations(vi)KGNN [19]: This method employs a graph neural network on the knowledge graph for capture both structural information and rich semantic features(vii)DDKG [23]: It acquires the drug embedding from their attributes within the knowledge graph. Subsequently, it leverages an attention mechanism to consider neighboring node embeddings and triple facts(viii)LaGAT [47]: It is a link-aware graph attention method, which considers various links between drug pairs in the knowledge graph to create multiple attention pathways for the drug entity
4.2. Experimental Settings
In the experiment, we fixed the embedding size to 32 for all methods. For the Laplacian method, the DeepWalk method, and the LINE method, we re-implement the BioNEV toolkit [48] to generate the node embedding. We also train the TransE and the ComplEx based on the Dgl-ke toolkits [49] and generate node embedding of the knowledge graph. For KGNN and DDKG, we configure the parameters to match their original settings as described in their original works, respectively. As the parameters of our work, we set the number of depths to 2 and the neighborhood size to 4, the batch-size is 4096. We adopt the Adam algorithm to optimize all trainable parameters. For the LaGAT method, we set some parameters to compare with our method, e.g., the number of neighbor samples and the number of depths. For both datasets, we utilize all drug pairs that have been approved to interact as positive samples and randomly allocate them into training, validation, and test sets in an 8 : 1 : 1 ratio. Furthermore, we randomly extract an equivalent quantity of data from the complementary set of positive samples as the negative samples.
5. Results and Discussion
In this section, we show and analyse the performance of our method and all baselines with 5-fold cross-validation. Several metrics are employed to evaluate the prediction preformation, including area under precision-recall (AUPR), accuracy (Acc.), F1 scores, and area under curve (AUC). Table 2 indicates the results of those metrics on the KEGG-drug dataset and the OGB-biokg dataset. The AUC and AUPR curves of each algorithm are depicted in Figures 5 and 6.


Table 2 shows that our approach demonstrates a significant performance advantage over the baseline methods across two datasets, as evidenced by superior results in four metrics. On the KEGG-drug dataset, our method improved by 0.6% on AUC, 0.93% on Acc., 0.89% on F1 scores and 0.93% on AUPR compared to the LaGAT, which achieved the best result of all baselines. On the OGB-biokg dataset, the improvement in AUC and AUPR is not obvious, only 0.07 and 0.08. On the F1 score and ACC indicators, our method improved by 0.41%. This can be attributed to our model’s integration of drug attribute information, including molecular graph and drug sequence data, into the representation learning process. In addition, LAP, DeepWalk and LINE show a relatively poor performance compared to other baselines, the reason is that these approaches only focus on the topological properties of biomedical networks and lack the wealth of attribute information presented in biomedical entities and their relationships. ComplEx and TransE perform better performance than LAP, DeepWalk and LINE, but not as well as the other baselines. This is because ComplEx and TransE take both entities and relationships into account and consider the 1-hop information of KGs to perform the representation learning, In contrast, other baseline methods (e.g. KGNN, DDKG) employ spatial-based graph convolutional networks (GCN) to capture multihop information. Furthermore, DDKG performs better than KGNN because it not only considers t the topological neighbourhood representations of drugs and relationships but also incorporates drug attribute information. DDKG also does not outperform our method. Based on DDKG, we excavate the information of the molecular structure information and take it as an additional attribute of drugs in KGs. Besides, we adopt the residual mechanism to avoid node smoothing caused by graph neural network.
5.1. Ablation Study
As mentioned before, the core of our work is to incorporate the molecular graph as another drug attribute in KGs and to utilize an attention mechanism that considers the features of both entities (nodes) and relations (edges) in a multihop neighbourhood of a designated entity/node. Additionally, the residual mechanism is adopted to reduce node smoothing caused by graph neural networks. To investigate the effectiveness of the central idea, additional variant experiments have been implemented. Their brief descriptions are given as below:(i)w/o residual mechanism: It removes the residual mechanism compared with the origin model(ii)w/o attention: This exam uses the same attention mechanism as DDKG compared with the origin model(iii)w/o seq: It only uses the information of the drug molecular graph as the initial feature(iv)w/o mole: It only uses the information of drug sequence as the initial feature(v)w/o both: It does not use the information of drug sequence or information of the drug molecular graph as the initial feature
Table 3 exhibits the experimental results. From Table 3, all variant experiments show slightly lower results than the original model. When the residual mechanism is removed, the performance decreases uniformly on different datasets. This indicates that the residual mechanism can reduce node smoothing by combining the attribute information of the drug. Moreover, the use of the attention mechanism in DDKG has a lower score than our method. The attention mechanism used in our method tries to consider both entity and relation features in a multihop neighbourhood and capture the diverse roles that an entity plays in different relations. To some extent, this method enhances the expressive capacity of drug representations. In the absence of drug sequence information or molecular graph information, the performance is an unavoidable loss in two datasets, confirming the importance of integrating drug attribute information when initializing drug embeddings.
5.2. Parameter Analysis
Here, we will investigate how several key parameters affect the performance of our approach. When examining a particular parameter, the remaining parameters are fixed. Figure 4 displays the value of AUC, Acc, F1 scores, and AUPR for KEGG-drug datasets.
5.2.1. Impact of Neighbourhood Size
First, we examined the influence of different values of , which represents the number of neighboring nodes selected, ranging from 1 to 5. The results corresponding to the different values are depicted in Figure 7. We observe that (1) our model performs the best when is set to 4. (2) The unsatisfactory performance observed with small values of can be attributed to their limited capacity to capture enough neighboring nodes during the information aggregation process. (3) When is set to 5, the performance begins to deteriorate due to the presence of unexpected noise.

5.2.2. Impact of the Depth of Receptive Field
Second, the effect of the depth of the receptive field, denoted as L, is investigated through a series of experiments where L is varied from 1 to 4. Figure 8 demonstrates the result. It is noted that the performance of the proposed model is better when L is set to 2. As the value of L increases, the perceptual path for each drug node becomes longer, causing an influx of noisy data. This will adversely affect the performance. Moreover, a higher value of L will require more CPU time and a greater number of epochs to achieve convergence. According to the experimental results, we imply L to be 2.

6. Conclusions
In this paper, we develop a new attention-based KG representation learning framework for drug-drug interaction. First, by taking the sequence and molecular graph information of drugs as the attributes of drug nodes in KGs, we use the drug representation initialization module to acquire the initial embeddings of drugs. Then, we adopt a knowledge-aware graph attention network that considers the features of both entities (nodes) and relations (edges) in a multi-hop neighbourhood of a given entity/node. Based on this, the proposed model can capture the topological structure information and the semantic relationship within KGs. Furthermore, the residual mechanism is added to the proposed model to avoid the node smoothing caused by GNN. Finally, we implemented the proposed model and conducted extensive experiments using two benchmark biomedical KG datasets. The results of these experiments consistently demonstrate the superior performance of our model compared to several state-of-the-art drug-drug interaction prediction models across a spectrum of evaluation metrics.
Data Availability
The dataset and source code can be accessed at https://github.com/Nokeli/MFIKGDDI. All datasets in this study are from public resource. KEGG is available at https://www.genome.jp/kegg/. DrugBank is available at https://go.drugbank.com/releases/5-1-10/downloads/all-full-database; OGB-biokg is available at https://ogb.stanford.edu/docs/linkprop/.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work was supported in part by the Nature Science Fund for Distinguished Young Scholars of China (62325308).