Abstract
Automatic malware detection was aimed at determining whether the application is malicious or not with automated systems. Android malware attacks have gained tremendous pace owing to the widespread use of mobile devices. Although significant progress has been made in antimalware techniques, these methods mainly rely on the program features, ignoring the importance of source code analysis. Furthermore, the dynamic analysis is low code coverage and poor efficiency. Hence, we propose an automatic Android malware detection approach, named HyGNN-Mal. It analyzes the Android applications at source code level by exploiting the sequence and structure information. Meanwhile, we combine the typical static features, permissions, and APIs. In HyGNN-Mal, we utilize a deep traversal tree neural network (Deep-TNN) to process the code structure information. Particularly, we add position information to code sequence information before putting in self-attention mechanism. The evaluations conducted on multiple public datasets indicate that our method can accurately identify and classify the malicious software, and their best accuracy is 99.62% and 99.2%, respectively.
1. Introduction
Android, the most popular mobile operating system, has attracted millions of users around the world. The market share of Android smartphones was 70.69 percent in April 2020 [1], and there were no signs of a decrease in the near future [2]. The global dominance of Android operating system and the rich data storage by smartphones make Android users an attractive target for cyberattackers. Kaspersky’s mobile malware (a short form used for malicious software) evolution showed that about 3.5 million malicious installation packages appeared in 2019 [3] and the number of new Android malware in 2020 grew by 2.1 million, breaking the established trend of continuous decline [4]. Usually, malware developers use obfuscation methods to generate malware variants which lead to misjudgments and omission of detection results. Therefore, it is an urgent and important task for researchers to study the accurate automatic detection of Android-based malware.
The malware detection techniques mainly fall into three approaches: (1) static analysis obtains Android features by scanning the decompiled Android package’s files, permissions, source code, and API calls. (2) Dynamic analysis observes and selects features form the system calls, read and write operations, and network data when running the application in a closely monitored virtual environment. (3) Hybrid analysis is the combination of dynamic and static analysis. Many existing Android malware detection approaches have reported excellent results [5–13]. For example, Kim et al. [5] built malware detection models based on seven kinds of static features such as string feature, opcode feature, and API feature, while Fan et al. used subgraphs as sample features to process Android malware familial classification [6]. There are lots of significant research results [14–21] detecting Android malware adopt hybrid analysis. For example, Wong and Lie [21] proposed a hybrid system, named IntelliDroid, which was generic Android input generator for the analysis of any Android application. The dynamic analysis is effective against all types of malwares. Tam et al. [22] built a dynamic analysis system, CopperDroid, to extract interactions between the app and the operating system, as well as interprocess and in-process communications. CopperDroid successfully disclosed more than 60% malicious behaviors of 1365 malware samples. The method has a limitation that it can only automatically check parameters of methods contained in the interface that owns the AIDL file [23]. Alzaylaee et al. [24] adopted a DL solution via dynamic analysis to detect malicious Android apps, the method used a real Android device and 31,125 apps for the experiments and achieved an accuracy of 0.952.
Currently, most of the previous studies ignoring in-depth analysis of application, such as the structure information, interdependency of program, and time complexity. Particularly, the dynamic analysis is accurate but time-consuming and labor-intensive [25]. The hybrid analysis methods can generate a better overall picture of the program but more complex and time-consuming [26]. The application of deep learning (DL) techniques on malware detection is effective [27–32]. Talha et al. [30] suggested a permission based on malware identification system, which analyzed the permissions requested by the application for identifying whether it is malicious or not. Similar work, Cen et al. [31] analyzed the code level of Android malware apps and trained a probabilistic classifier system to predict whether an application is malicious or not. Alazab M et al. [32] proposed permissions and frequency analysis of API calls, and their approach can determine the similarities among malware families. They achieved an F-measure of 0.943 on the dataset of 27,891 apps. Besides, a typical malware detection approach is to convert the disassembled malware codes into a graph representation, such as a grayscale image [33, 34], control-flow graph (CFG), and data flow graph (DFG) and then use machine learning algorithms to identify the malware. Specifically, the DL methods are often more suitable to capture the semantic knowledge within Android apps than the traditional machine learning methods [35]. Considering the large number of datasets in reality, our method follows the static analysis, which does not depend on the compiler and the environment of executed program. Furthermore, we incorporate joint-feature vectors into the hybrid model and evaluate the model on the public datasets. In this way, we can fully utilize the syntactic and semantic information of an Android app without dataset bias that affects the experimental result.
Therefore, we present a malware detection approach based on hybrid graph neural network, which is designed for identifying and detecting the Android malware. We extract semantic information of Android application from source code and program levels. At the source code level, we propose an effective analysis way (see Section 3 for details) inspired by natural language processing (NLP) technology. At the program level, we use the Androguard tool to extract the typical features, permissions, and APIs. The Android applications are represented by combining these three semantic vectors to address the Android malware detection issue.
The main contributions of this paper are as follows: (i)We propose a new automatic Android malware detection approach, HyGNN-Mal, which can accurately identify the malware and its type via a hybrid graph neural network(ii)We subjoin the row and column information to source code sequence, which enhances the semantic of Android application(iii)We use abstract syntax tree (AST) to represent the structure information of source code. To the best of our knowledge, this the first work that uses AST to analyze the Android applications(iv)We verify the effectiveness of HyGNN-Mal on multiple real-word public datasets to avoid data bias, and the experimental results show a superior accuracy compared to existing methods
The rest of this paper is organized as follows. Section 2 introduces the related work. Section 3 presents our approach in detail. Section 4 reports and analyzes experimental results. Finally, Section 5 gives the paper conclusion and future trend discussion.
2. Related Work
2.1. Deep_TNN (Deep Traversal Tree Neural Network)
Deep traversal neural network (Deep_TNN) is a graph model structure that we proposed to traverse the AST in this paper. Different from images (a set of pixels) and natural language (a sequence of words), graph data is more complex that there are at least two types of information in a graph: nodes and the edges. Hence, we build a suitable graph neural network architecture, Deep_TNN, which can be viewed as a variant of graph convolutional network (GCN) [36].
Suppose a graph , where the vertex set and is the number of vertices. Let the number of features for each vertex be , and then, the vertex feature of is a vector . , where represents the relationship among vertices and then forms an adjacency matrix . and are the input of GCN, and the message transmission between GCN layers is as follows: where and is an identity matrix and represent the degree matrix of GCN. The diagonal elements are the degrees of each vertex of ; for example, the degrees of is the number of edges in connected to , is the feature vector of each GCN layer, for the input layer, is , and is an nonlinear activation function.
2.2. Self-Attention Mechanism
Self-attention [37] is a feature extractor layer, which allows the inputs to interact with each other and find out the areas where they should pay more attention. Superior than RNN, self-attention mechanism can capture long-distance interdependence and calculate in parallel. Here is an introduction to the self-attention mechanism.
Let as one input vector of self-attention, and the corresponding output is , where is the number of features for . has three different vectors, query (Q), key (K), and value (V), and the calculation formulas are as follows: where the and are the trainable parameter matrices. The output of each can be calculated as follows: where , , , and . Yet, the weights of self-attention mechanism only depend on the relativity between and , ignoring the position information of .
2.3. Bidirectional Gated Recurrent Unit
GRU is a variant of long short-term memory (LSTM). The main difference between LSTM and GRU lies in that GRU uses an update gate and a reset gate and LSTM uses forget gate, input gate, and output gate [38–40]. The final GRU model is more concise and effective in long-sequence applications. A bidirectional GRU (Bi-GRU) is a combination of two GRUs in opposite directions.
Let as an input vector of Bi-GRU, and the corresponding output is , where is the number of features for . Take one input at time for example, . where and represent the update gate and reset gate. , , , , , and are all weight matrices. , , and are all bias vectors. is the current hidden layer information. is the final result.
3. Approach
Figure 1 shows typical data-driven workflow. It contains raw Android APK file collection, the feature extraction and representation of application, malware detection modelling, and model evaluation. According to this workflow, we illustrate the implementation of HyGNN-Mal in Figure 2.


3.1. Problem Definition
Given one Android app , we set a constant label for it to indicate whether is benign or malicious, even the type of malware, where . Then, for a set of known labels, we can build a training set . We aim to train a deep learning model for learning a function that maps an Android app to a feature vector , so that for any Android app, based on the similarity score , we can determine which label is most likely to belong to .
3.2. Hybrid Model
The process of HyGNN-Mal is as follows. In the training stage, a set of Android apps with labels are processed by three separate parallel branches, self-attention, Deep_TNN, and Bi-GRU. We merge the Seq_vector, AST_vector, and Behavior_vector into one APP_vector to represent the Android app and identify whether the app is malicious or not by MLP. In the testing stage, we input a set of Android apps without labels to HyGNN-Mal; if the app is malicious, return its type, else return benign. The training and testing workflow is shown in Figure 2.
3.3. Feature Extraction
We extract the features of Android apps from both source code and program levels. For source code level, we get the source code of Android apps by reversing engineer APK files. For program level, we use the Androguard tool to extract the typical features, permissions, and APIs.
3.3.1. Source Code Analysis
As we all know, one application (also named program) contains multiple .java files, while each .java file contains multiple functions. We analyze the function-level source code in two ways, source code sequence and its parsed AST, as shown in Figure 3.

On the one hand, we use the javalang package of python to parse the function code into abstract syntax tree (AST). According to the production rules in javalang, the AST nodes are divided into three types: string (terminal nodes), set (“Modifier”) and class name. Terminal nodes refer to the identifier tokens in , and the nonterminal nodes represent syntactic structure of . Particularly, we incorporate some edge types to exploit useful information of AST. (1) Next_TNode: connect a terminal node to the next terminal node, which just shows the order of . For example, the purple connection “Public-static-ForDisplay” in the red box (aa) of Figure 3, corresponds to the edges (Public and static) and (static and ForDisplay). (2) Next_BNode: connect a node to its next brother node (from left to right), which is solving the problem that graph neural networks do not consider the order of nodes. For example, the green connection “Public-static” and “Modifier-ForDisplay,” correspond to the edges (Public and static) and (Modifier and ForDisplay). (3) Building a directed graph based on AST. Because the effect of directed graphs is similar to undirected graph, but the number of it is half less than undirected graphs, so time efficiency can be greatly improved. Besides, the directed graph can represent structural information more accurately. At last, we apply depth-first traversal to get the vertex set and edge set of AST, where , is the number of function code tokens; , is the number of edges; ; and indicate the start node and end node of each edge . After embedding the , we input it into Deep_TNN shown in Figure 4 for feature extraction and finally get the vector of , named .

On the other hand, we add row and column position information to the source code, as shown in Figure 5. It just corrects the self-attention mechanism shortcomings mentioned in Section 2.3. First, we change the source code to a plain text sequence and assign a column position value to each token, as shown in the blue box. Second, we assign a row position value to each line of the source code, as shown in the red box. At last, we combine the embedding of token lexical information and two additional position information (row and column) mentioned above into a feature vector and input the self-attention mechanism to train for a new feature vector, named ; the workflow is shown in Figure 6. For example, in the word “static,” the row position information is “0,” and the embedding is . Meanwhile, the column position information is “1,” and the embedding is . Moreover, the lexical information is ; thus, the final embedding of “static” is ; after the self-attention, we can get the vector of “static.”


3.3.2. Program Analysis
The Android APK was analyzed by Androguard tool, including the APK object, DEX file object, and analysis object. In most cases, a malware typically applies for more permissions than a normal software and tends to request a special set of permissions. APIs can effectively reflect the malicious behaviors to a certain extent. We get the permission sequences from APK object and the API sequences from analysis object. Each APK is represented as {label, permissions list, API sequences list}, which is input into the Bi-GRU to get the behavior feature, . The process is shown in Figure 7.

Finally, by merging , , and , we get the feature vector of , , where is a matrix ; is the number of features, and is the dimension of each feature. In this way, we represent an application as a feature vector full of semantic information, just as , where is the number of functions in the application. The HyGNN-Mal uses to realize the automatic detection of Android malware through the full connection layer.
4. Experiments
4.1. Hardware Details
The hardware on which we implemented, trained, and tested our model included two Intel(R) Xeon(R) Gold 6154 CPU @ 3.00 GHz, 128 GB RAM, and two Titan XP GPUs. It was necessary to train on GPUs with 64 GB VRAM due to the large size of our model.
4.2. Datasets
In order to avoid the experimental effect bias caused by dataset construction, we select multiple samples from the public data source (see Table 1). In malware binary classification study, we download 16,000 malicious applications from the Drebin (DB) [41] and AndroZoo (AZ) (AppChina and Genome) [42] and the same number of benign applications from Google Play (GP) of AZ. In multiclassification study, we consider a new Android malware dataset, CICMalDroid (CICMD) 2020 [25], which is intentionally spanning between five distinct categories (adware, banking malware, SMS malware, riskware, and benign), and then download 15,000 samples from it. Each sample corresponds to an APK file, after the feature engineering extraction, the final remaining Android APKs. Furthermore, to avoid the potential biases introduced by our random sampling, we have removed the same samples to ensure that there are no overlap samples in our classification study.
In the following experimental tasks, we randomly split the total data of DB and AZ datasets into a training set and a test set with the ratio of 8 : 2. The same segmentation method is also used to CICMD dataset.
4.3. Performance Measures
The metrics are accuracy, recall, precision, and F1 score which have been evaluated to verify the effectiveness of HyGNN-Mal model. (1)True positive (TP): the number of samples classified as malicious correctly(2)False positive (FP): the number of samples classified as malicious wrongly(3)True negative (TN): the number of samples classified as benign correctly(4)False negative (FN): the number of samples classified as benign wrongly
4.4. Results
We optimize the performance of HyGNN-Mal by tuning variable hyperparameters. The final parameter settings used for all experiments are shown in Table 2. The Deep-TNN layer means the number of graph convolution layers. The Deep-TNN filter indicates the number of output channels in a graph convolution layer. The dropout rate is used to overcome overfitting by randomly excluding nodes during training. The batch size represents the number of samples selected for one training. The epoch means the number of iterations for the model training.
For the purpose of verifying the effectiveness of HyGNN-Mal, we conduct a number of experiments. These experiments answer the following research questions.
RQ1. What is the performance of HyGNN-Mal compared to the other well-performing methods?
RQ2. Why the HyGNN-Mal model we proposed performs well?
RQ3. What is the impact of source code level analysis on malicious code detection?
4.4.1. Comparison with Different Methods
We compare our methods with four well-performing methods, and the experimental results are shown in Table 3. The reason we choose these benchmarks is that the datasets or models they used are very similar to ours. It is suitable to highlight the effectiveness of the HyGNN-Mal by comparison.
Marcheggiani and Titov [43] showed that bidirectional LSTM (Bi-LSTM) and syntax-based GCNs have complementary modeling power, when GCNs were stacked on top of LSTM layers. Haq et al. [44] leveraged convolutional neural network (CNN) and Bi-LSTM to efficiently identify the persistent malware. They conducted comparative experiments on hybrid DL-driven architectures and DL benchmarks using publicly available datasets. Pei et al. [11] proposed AMalNet for Android malware detection and family attribution. They applied the GCNs and independently recurrent neural network (IndRNN) to fully consider the semantic distribution information of malware, such as character, word, and lexical feature. The above methods performed well, but they still fall short in application semantic extraction. We implemented the models (Bi-LSTM+GCNs, CNN+Bi-LSTM, and GCNs+IndRNN) to extract features described in Section 3.3. Mirzaei et al. [45] used a fast graph-mining algorithm to extract the ensembles of API calls and modeled them as Markov chains to construct the feature vectors for apps. They also utilized machine learning algorithms for classification. Furthermore, the paper [45] has opened their source code so that we can easily reproduce them on our dataset.
From Table 3, the results of [11, 43] are close to our method. However, they ignored the structural information of source code, while we use the AST and the location information of source code to represent in the paper. We also consider the behavioral characteristics of malware, permissions and APIs. The comparison indicated that HyGNN-Mal proposed in this paper surpassed previous efforts.
4.4.2. Ablation Study of Different Models
The selection of models is important in the experiment. Next, we demonstrate why the HyGNN-Mal model performs well through model ablation experiments. We verify the experimental performance and results of different model combinations, as shown in Table 4. Considering the effectiveness of GCN in dealing with graph structure, we choose it to train the structural information of the source code. We found that performance of Bi-LSTM+GCNs model in Table 3 is similar to the Bi-GRU+GCNs model in Table 4. However, the Bi-GRU takes up less computational resources than Bi-LSTM, so we choose Bi-GRU to process sequence information. As for the problem of long source code sequence information, we use self-attention to effectively solve it. As expected, HyGNN-Mal model performs best compared with other model combination.
To further assess the capabilities of our approach versus the baselines, we compute the receiver operating characteristic (ROC) curve of HyGNN-Mal, shown in Figure 8. For the binary classification task, the AUC value is about 0.9998. In five classification tasks, the AUC values of each category are very close. The AUC value of Adware is the highest at 0.9999, which is close to 1. The aforementioned results show that HyGNN-Mal is effective for Android malware classification task.

(a) Binary classification task

(b) Five classification tasks
4.4.3. The Effect of Source Code Features
In order to evidence the importance of source code analysis in HyGNN-Mal, we compare the different feature combinations and observe the performance of them on malware detection. The experimental results are shown in Table 5.
In Table 5, we find that the malware detection effect has been improved, when adding source code semantic information to the P+API combination. It reflects that the analysis of source code level can produce an information gain for the application semantic. Besides, if using only one feature, AST performs the worst, because it only considers the structural information of apps and ignores lexical information. However, different structures may have the same behavior. For example, the code segments of “for” and “while” with different structures both represent loop operations. When we further add the source code lexical information, S, the effect is obviously enhanced, because the combination of sequence and structure information more comprehensively characterizes the applications. Finally, we supplement the behavioral features P+APIs to enhance the semantic information of applications, and it comes with the best result. It is illustrated that the performance of HyGNN-Mal was significantly improved using multiple eigenvectors proposed by us.
5. Conclusion and Discussion
In this paper, we propose a novel automatic detection of Android malware via hybrid graph neural network, HyGNN-Mal. First, we analyze the Android apps from source code level, using AST to represent its structural information, and sequence to represent its syntactic information. Second, we propose the Deep-TNN model to process AST and utilize the self-attention mechanism to process the source code sequences added row and column position information. In addition, we use the Bi-GRU to handle permissions and API features. Finally, a series of experiments are conducted to prove the feasibility and effectiveness of the HyGNN-Mal.
With the increasing of Android malware variants, the challenges of automatic malware detection are as follows: (1) the small datasets, even outdated, lead to an inaccuracy in malware detection. (2) There is a lack of consensus on labeling malwares. (3) The time cost and large storage space of malicious samples are still challenges need to be considered. Depending on our study, we also summarize the future directions. (i)With the widespread application of DL, it is noteworthy that the combination of DL and source code graph analysis in Android malware detection(ii)Taking the app descriptions as the purpose comments of used permissions, NLP techniques will be a potential future direction to detect malware(iii)Given few malicious code samples currently, transferred learning will be a good choice(iv)Compared with hybrid analysis method, the research on which apps are suitable for static analysis and which are suitable for dynamic analysis is more promising. It will greatly save resources and improve the efficiency of feature extraction
In our next work, we consider using DL technology to process apps suitable for dynamic analysis and NLP technology to process apps suitable for static analysis, for achieving a more accurate classification of families and even striving to effectively detect the newly generated malware.
Data Availability
Data sharing is not applicable to this article as no new data were created or analyzed in this study.
Conflicts of Interest
The authors declare that they have no conflicts of interest.