Automatic Detection of Android Malware via Hybrid Graph Neural Network

Zhang, Chunyan; Zhou, Qinglei; Huang, Yizhao; Tang, Ke; Gui, Hairen; Liu, Fudong

doi:https://doi.org/10.1155/2022/7245403

Wireless Communications and Mobile Computing

On this page

Abstract Introduction Related Work Conclusion and Discussion Data Availability Conflicts of Interest References Copyright Related Articles

Special Issue

Models, Technologies, and Applications of Intelligent Defense in Ubiquitous Network Environments

View this Special Issue

Research Article | Open Access

Volume 2022 | Article ID 7245403 | https://doi.org/10.1155/2022/7245403

Automatic Detection of Android Malware via Hybrid Graph Neural Network

Chunyan Zhang,¹Qinglei Zhou,²Yizhao Huang,¹Ke Tang,¹Hairen Gui,¹and Fudong Liu¹

Academic Editor: Lei Zhang

Received07 Mar 2022

Accepted19 Apr 2022

Published11 May 2022

Abstract

Automatic malware detection was aimed at determining whether the application is malicious or not with automated systems. Android malware attacks have gained tremendous pace owing to the widespread use of mobile devices. Although significant progress has been made in antimalware techniques, these methods mainly rely on the program features, ignoring the importance of source code analysis. Furthermore, the dynamic analysis is low code coverage and poor efficiency. Hence, we propose an automatic Android malware detection approach, named HyGNN-Mal. It analyzes the Android applications at source code level by exploiting the sequence and structure information. Meanwhile, we combine the typical static features, permissions, and APIs. In HyGNN-Mal, we utilize a deep traversal tree neural network (Deep-TNN) to process the code structure information. Particularly, we add position information to code sequence information before putting in self-attention mechanism. The evaluations conducted on multiple public datasets indicate that our method can accurately identify and classify the malicious software, and their best accuracy is 99.62% and 99.2%, respectively.

1. Introduction

Android, the most popular mobile operating system, has attracted millions of users around the world. The market share of Android smartphones was 70.69 percent in April 2020 [1], and there were no signs of a decrease in the near future [2]. The global dominance of Android operating system and the rich data storage by smartphones make Android users an attractive target for cyberattackers. Kaspersky’s mobile malware (a short form used for malicious software) evolution showed that about 3.5 million malicious installation packages appeared in 2019 [3] and the number of new Android malware in 2020 grew by 2.1 million, breaking the established trend of continuous decline [4]. Usually, malware developers use obfuscation methods to generate malware variants which lead to misjudgments and omission of detection results. Therefore, it is an urgent and important task for researchers to study the accurate automatic detection of Android-based malware.

The malware detection techniques mainly fall into three approaches: (1) static analysis obtains Android features by scanning the decompiled Android package’s files, permissions, source code, and API calls. (2) Dynamic analysis observes and selects features form the system calls, read and write operations, and network data when running the application in a closely monitored virtual environment. (3) Hybrid analysis is the combination of dynamic and static analysis. Many existing Android malware detection approaches have reported excellent results [5–13]. For example, Kim et al. [5] built malware detection models based on seven kinds of static features such as string feature, opcode feature, and API feature, while Fan et al. used subgraphs as sample features to process Android malware familial classification [6]. There are lots of significant research results [14–21] detecting Android malware adopt hybrid analysis. For example, Wong and Lie [21] proposed a hybrid system, named IntelliDroid, which was generic Android input generator for the analysis of any Android application. The dynamic analysis is effective against all types of malwares. Tam et al. [22] built a dynamic analysis system, CopperDroid, to extract interactions between the app and the operating system, as well as interprocess and in-process communications. CopperDroid successfully disclosed more than 60% malicious behaviors of 1365 malware samples. The method has a limitation that it can only automatically check parameters of methods contained in the interface that owns the AIDL file [23]. Alzaylaee et al. [24] adopted a DL solution via dynamic analysis to detect malicious Android apps, the method used a real Android device and 31,125 apps for the experiments and achieved an accuracy of 0.952.

Currently, most of the previous studies ignoring in-depth analysis of application, such as the structure information, interdependency of program, and time complexity. Particularly, the dynamic analysis is accurate but time-consuming and labor-intensive [25]. The hybrid analysis methods can generate a better overall picture of the program but more complex and time-consuming [26]. The application of deep learning (DL) techniques on malware detection is effective [27–32]. Talha et al. [30] suggested a permission based on malware identification system, which analyzed the permissions requested by the application for identifying whether it is malicious or not. Similar work, Cen et al. [31] analyzed the code level of Android malware apps and trained a probabilistic classifier system to predict whether an application is malicious or not. Alazab M et al. [32] proposed permissions and frequency analysis of API calls, and their approach can determine the similarities among malware families. They achieved an F-measure of 0.943 on the dataset of 27,891 apps. Besides, a typical malware detection approach is to convert the disassembled malware codes into a graph representation, such as a grayscale image [33, 34], control-flow graph (CFG), and data flow graph (DFG) and then use machine learning algorithms to identify the malware. Specifically, the DL methods are often more suitable to capture the semantic knowledge within Android apps than the traditional machine learning methods [35]. Considering the large number of datasets in reality, our method follows the static analysis, which does not depend on the compiler and the environment of executed program. Furthermore, we incorporate joint-feature vectors into the hybrid model and evaluate the model on the public datasets. In this way, we can fully utilize the syntactic and semantic information of an Android app without dataset bias that affects the experimental result.

Therefore, we present a malware detection approach based on hybrid graph neural network, which is designed for identifying and detecting the Android malware. We extract semantic information of Android application from source code and program levels. At the source code level, we propose an effective analysis way (see Section 3 for details) inspired by natural language processing (NLP) technology. At the program level, we use the Androguard tool to extract the typical features, permissions, and APIs. The Android applications are represented by combining these three semantic vectors to address the Android malware detection issue.

The main contributions of this paper are as follows: (i)We propose a new automatic Android malware detection approach, HyGNN-Mal, which can accurately identify the malware and its type via a hybrid graph neural network(ii)We subjoin the row and column information to source code sequence, which enhances the semantic of Android application(iii)We use abstract syntax tree (AST) to represent the structure information of source code. To the best of our knowledge, this the first work that uses AST to analyze the Android applications(iv)We verify the effectiveness of HyGNN-Mal on multiple real-word public datasets to avoid data bias, and the experimental results show a superior accuracy compared to existing methods

The rest of this paper is organized as follows. Section 2 introduces the related work. Section 3 presents our approach in detail. Section 4 reports and analyzes experimental results. Finally, Section 5 gives the paper conclusion and future trend discussion.

2.1. Deep_TNN (Deep Traversal Tree Neural Network)

Deep traversal neural network (Deep_TNN) is a graph model structure that we proposed to traverse the AST in this paper. Different from images (a set of pixels) and natural language (a sequence of words), graph data is more complex that there are at least two types of information in a graph: nodes and the edges. Hence, we build a suitable graph neural network architecture, Deep_TNN, which can be viewed as a variant of graph convolutional network (GCN) [36].

Suppose a graph , where the vertex set and is the number of vertices. Let the number of features for each vertex be , and then, the vertex feature of is a vector . , where represents the relationship among vertices and then forms an adjacency matrix . and are the input of GCN, and the message transmission between GCN layers is as follows: where and is an identity matrix and represent the degree matrix of GCN. The diagonal elements are the degrees of each vertex of ; for example, the degrees of is the number of edges in connected to , is the feature vector of each GCN layer, for the input layer, is , and is an nonlinear activation function.

2.2. Self-Attention Mechanism

Self-attention [37] is a feature extractor layer, which allows the inputs to interact with each other and find out the areas where they should pay more attention. Superior than RNN, self-attention mechanism can capture long-distance interdependence and calculate in parallel. Here is an introduction to the self-attention mechanism.

Let as one input vector of self-attention, and the corresponding output is , where is the number of features for . has three different vectors, query (Q), key (K), and value (V), and the calculation formulas are as follows: where the and are the trainable parameter matrices. The output of each can be calculated as follows: where , , , and . Yet, the weights of self-attention mechanism only depend on the relativity between and , ignoring the position information of .

2.3. Bidirectional Gated Recurrent Unit

GRU is a variant of long short-term memory (LSTM). The main difference between LSTM and GRU lies in that GRU uses an update gate and a reset gate and LSTM uses forget gate, input gate, and output gate [38–40]. The final GRU model is more concise and effective in long-sequence applications. A bidirectional GRU (Bi-GRU) is a combination of two GRUs in opposite directions.

Let as an input vector of Bi-GRU, and the corresponding output is , where is the number of features for . Take one input at time for example, . where and represent the update gate and reset gate. , , , , , and are all weight matrices. , , and are all bias vectors. is the current hidden layer information. is the final result.

3. Approach

Figure 1 shows typical data-driven workflow. It contains raw Android APK file collection, the feature extraction and representation of application, malware detection modelling, and model evaluation. According to this workflow, we illustrate the implementation of HyGNN-Mal in Figure 2.

3.1. Problem Definition

Given one Android app , we set a constant label for it to indicate whether is benign or malicious, even the type of malware, where . Then, for a set of known labels, we can build a training set . We aim to train a deep learning model for learning a function that maps an Android app to a feature vector , so that for any Android app, based on the similarity score , we can determine which label is most likely to belong to .

3.2. Hybrid Model

The process of HyGNN-Mal is as follows. In the training stage, a set of Android apps with labels are processed by three separate parallel branches, self-attention, Deep_TNN, and Bi-GRU. We merge the Seq_vector, AST_vector, and Behavior_vector into one APP_vector to represent the Android app and identify whether the app is malicious or not by MLP. In the testing stage, we input a set of Android apps without labels to HyGNN-Mal; if the app is malicious, return its type, else return benign. The training and testing workflow is shown in Figure 2.

3.3. Feature Extraction

We extract the features of Android apps from both source code and program levels. For source code level, we get the source code of Android apps by reversing engineer APK files. For program level, we use the Androguard tool to extract the typical features, permissions, and APIs.

3.3.1. Source Code Analysis

As we all know, one application (also named program) contains multiple .java files, while each .java file contains multiple functions. We analyze the function-level source code in two ways, source code sequence and its parsed AST, as shown in Figure 3.

On the one hand, we use the javalang package of python to parse the function code into abstract syntax tree (AST). According to the production rules in javalang, the AST nodes are divided into three types: string (terminal nodes), set (“Modifier”) and class name. Terminal nodes refer to the identifier tokens in , and the nonterminal nodes represent syntactic structure of . Particularly, we incorporate some edge types to exploit useful information of AST. (1) Next_TNode: connect a terminal node to the next terminal node, which just shows the order of . For example, the purple connection “Public-static-ForDisplay” in the red box (aa) of Figure 3, corresponds to the edges (Public and static) and (static and ForDisplay). (2) Next_BNode: connect a node to its next brother node (from left to right), which is solving the problem that graph neural networks do not consider the order of nodes. For example, the green connection “Public-static” and “Modifier-ForDisplay,” correspond to the edges (Public and static) and (Modifier and ForDisplay). (3) Building a directed graph based on AST. Because the effect of directed graphs is similar to undirected graph, but the number of it is half less than undirected graphs, so time efficiency can be greatly improved. Besides, the directed graph can represent structural information more accurately. At last, we apply depth-first traversal to get the vertex set and edge set of AST, where , is the number of function code tokens; , is the number of edges; ; and indicate the start node and end node of each edge . After embedding the , we input it into Deep_TNN shown in Figure 4 for feature extraction and finally get the vector of , named .

On the other hand, we add row and column position information to the source code, as shown in Figure 5. It just corrects the self-attention mechanism shortcomings mentioned in Section 2.3. First, we change the source code to a plain text sequence and assign a column position value to each token, as shown in the blue box. Second, we assign a row position value to each line of the source code, as shown in the red box. At last, we combine the embedding of token lexical information and two additional position information (row and column) mentioned above into a feature vector and input the self-attention mechanism to train for a new feature vector, named ; the workflow is shown in Figure 6. For example, in the word “static,” the row position information is “0,” and the embedding is . Meanwhile, the column position information is “1,” and the embedding is . Moreover, the lexical information is ; thus, the final embedding of “static” is ; after the self-attention, we can get the vector of “static.”

3.3.2. Program Analysis

The Android APK was analyzed by Androguard tool, including the APK object, DEX file object, and analysis object. In most cases, a malware typically applies for more permissions than a normal software and tends to request a special set of permissions. APIs can effectively reflect the malicious behaviors to a certain extent. We get the permission sequences from APK object and the API sequences from analysis object. Each APK is represented as {label, permissions list, API sequences list}, which is input into the Bi-GRU to get the behavior feature, . The process is shown in Figure 7.

Finally, by merging , , and , we get the feature vector of , , where is a matrix ; is the number of features, and is the dimension of each feature. In this way, we represent an application as a feature vector full of semantic information, just as , where is the number of functions in the application. The HyGNN-Mal uses to realize the automatic detection of Android malware through the full connection layer.

4. Experiments

4.1. Hardware Details

The hardware on which we implemented, trained, and tested our model included two Intel(R) Xeon(R) Gold 6154 CPU @ 3.00 GHz, 128 GB RAM, and two Titan XP GPUs. It was necessary to train on GPUs with 64 GB VRAM due to the large size of our model.

4.2. Datasets

In order to avoid the experimental effect bias caused by dataset construction, we select multiple samples from the public data source (see Table 1). In malware binary classification study, we download 16,000 malicious applications from the Drebin (DB) [41] and AndroZoo (AZ) (AppChina and Genome) [42] and the same number of benign applications from Google Play (GP) of AZ. In multiclassification study, we consider a new Android malware dataset, CICMalDroid (CICMD) 2020 [25], which is intentionally spanning between five distinct categories (adware, banking malware, SMS malware, riskware, and benign), and then download 15,000 samples from it. Each sample corresponds to an APK file, after the feature engineering extraction, the final remaining Android APKs. Furthermore, to avoid the potential biases introduced by our random sampling, we have removed the same samples to ensure that there are no overlap samples in our classification study.

In the following experimental tasks, we randomly split the total data of DB and AZ datasets into a training set and a test set with the ratio of 8 : 2. The same segmentation method is also used to CICMD dataset.

4.3. Performance Measures

The metrics are accuracy, recall, precision, and F1 score which have been evaluated to verify the effectiveness of HyGNN-Mal model. (1)True positive (TP): the number of samples classified as malicious correctly(2)False positive (FP): the number of samples classified as malicious wrongly(3)True negative (TN): the number of samples classified as benign correctly(4)False negative (FN): the number of samples classified as benign wrongly

4.4. Results

We optimize the performance of HyGNN-Mal by tuning variable hyperparameters. The final parameter settings used for all experiments are shown in Table 2. The Deep-TNN layer means the number of graph convolution layers. The Deep-TNN filter indicates the number of output channels in a graph convolution layer. The dropout rate is used to overcome overfitting by randomly excluding nodes during training. The batch size represents the number of samples selected for one training. The epoch means the number of iterations for the model training.

For the purpose of verifying the effectiveness of HyGNN-Mal, we conduct a number of experiments. These experiments answer the following research questions.

RQ1. What is the performance of HyGNN-Mal compared to the other well-performing methods?

RQ2. Why the HyGNN-Mal model we proposed performs well?

RQ3. What is the impact of source code level analysis on malicious code detection?

4.4.1. Comparison with Different Methods

We compare our methods with four well-performing methods, and the experimental results are shown in Table 3. The reason we choose these benchmarks is that the datasets or models they used are very similar to ours. It is suitable to highlight the effectiveness of the HyGNN-Mal by comparison.

Marcheggiani and Titov [43] showed that bidirectional LSTM (Bi-LSTM) and syntax-based GCNs have complementary modeling power, when GCNs were stacked on top of LSTM layers. Haq et al. [44] leveraged convolutional neural network (CNN) and Bi-LSTM to efficiently identify the persistent malware. They conducted comparative experiments on hybrid DL-driven architectures and DL benchmarks using publicly available datasets. Pei et al. [11] proposed AMalNet for Android malware detection and family attribution. They applied the GCNs and independently recurrent neural network (IndRNN) to fully consider the semantic distribution information of malware, such as character, word, and lexical feature. The above methods performed well, but they still fall short in application semantic extraction. We implemented the models (Bi-LSTM+GCNs, CNN+Bi-LSTM, and GCNs+IndRNN) to extract features described in Section 3.3. Mirzaei et al. [45] used a fast graph-mining algorithm to extract the ensembles of API calls and modeled them as Markov chains to construct the feature vectors for apps. They also utilized machine learning algorithms for classification. Furthermore, the paper [45] has opened their source code so that we can easily reproduce them on our dataset.

From Table 3, the results of [11, 43] are close to our method. However, they ignored the structural information of source code, while we use the AST and the location information of source code to represent in the paper. We also consider the behavioral characteristics of malware, permissions and APIs. The comparison indicated that HyGNN-Mal proposed in this paper surpassed previous efforts.

4.4.2. Ablation Study of Different Models

The selection of models is important in the experiment. Next, we demonstrate why the HyGNN-Mal model performs well through model ablation experiments. We verify the experimental performance and results of different model combinations, as shown in Table 4. Considering the effectiveness of GCN in dealing with graph structure, we choose it to train the structural information of the source code. We found that performance of Bi-LSTM+GCNs model in Table 3 is similar to the Bi-GRU+GCNs model in Table 4. However, the Bi-GRU takes up less computational resources than Bi-LSTM, so we choose Bi-GRU to process sequence information. As for the problem of long source code sequence information, we use self-attention to effectively solve it. As expected, HyGNN-Mal model performs best compared with other model combination.

To further assess the capabilities of our approach versus the baselines, we compute the receiver operating characteristic (ROC) curve of HyGNN-Mal, shown in Figure 8. For the binary classification task, the AUC value is about 0.9998. In five classification tasks, the AUC values of each category are very close. The AUC value of Adware is the highest at 0.9999, which is close to 1. The aforementioned results show that HyGNN-Mal is effective for Android malware classification task.

(a) Binary classification task

(b) Five classification tasks

4.4.3. The Effect of Source Code Features

In order to evidence the importance of source code analysis in HyGNN-Mal, we compare the different feature combinations and observe the performance of them on malware detection. The experimental results are shown in Table 5.

In Table 5, we find that the malware detection effect has been improved, when adding source code semantic information to the P+API combination. It reflects that the analysis of source code level can produce an information gain for the application semantic. Besides, if using only one feature, AST performs the worst, because it only considers the structural information of apps and ignores lexical information. However, different structures may have the same behavior. For example, the code segments of “for” and “while” with different structures both represent loop operations. When we further add the source code lexical information, S, the effect is obviously enhanced, because the combination of sequence and structure information more comprehensively characterizes the applications. Finally, we supplement the behavioral features P+APIs to enhance the semantic information of applications, and it comes with the best result. It is illustrated that the performance of HyGNN-Mal was significantly improved using multiple eigenvectors proposed by us.

5. Conclusion and Discussion

In this paper, we propose a novel automatic detection of Android malware via hybrid graph neural network, HyGNN-Mal. First, we analyze the Android apps from source code level, using AST to represent its structural information, and sequence to represent its syntactic information. Second, we propose the Deep-TNN model to process AST and utilize the self-attention mechanism to process the source code sequences added row and column position information. In addition, we use the Bi-GRU to handle permissions and API features. Finally, a series of experiments are conducted to prove the feasibility and effectiveness of the HyGNN-Mal.

With the increasing of Android malware variants, the challenges of automatic malware detection are as follows: (1) the small datasets, even outdated, lead to an inaccuracy in malware detection. (2) There is a lack of consensus on labeling malwares. (3) The time cost and large storage space of malicious samples are still challenges need to be considered. Depending on our study, we also summarize the future directions. (i)With the widespread application of DL, it is noteworthy that the combination of DL and source code graph analysis in Android malware detection(ii)Taking the app descriptions as the purpose comments of used permissions, NLP techniques will be a potential future direction to detect malware(iii)Given few malicious code samples currently, transferred learning will be a good choice(iv)Compared with hybrid analysis method, the research on which apps are suitable for static analysis and which are suitable for dynamic analysis is more promising. It will greatly save resources and improve the efficiency of feature extraction

In our next work, we consider using DL technology to process apps suitable for dynamic analysis and NLP technology to process apps suitable for static analysis, for achieving a more accurate classification of families and even striving to effectively detect the newly generated malware.

Data Availability

Data sharing is not applicable to this article as no new data were created or analyzed in this study.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

S. G. Stats, “The market share of android smartphones,” 2021, http://gs.statcounter.com/osmarket-share/mobile/worldwide.
View at: Google Scholar
M. Statista, “Operating systems’ market share worldwide from January 2012,” https://www.statista.com/statistics/272698/global-market-share-held-by-mobile-operating-systems-since-2009.
View at: Google Scholar
K. M. Malware Evolution, “The number of malicious installation packages appeared per day,” 2021, https://securelist.com/mobile-malware-evolution-2018/89689/.
View at: Google Scholar
Mobile Malware Evolution, 2021, https://securelist.com/mobile-malware-evolution-2020/101029/.
T. Kim, B. Kang, M. Rho, S. Sezer, and E. G. Im, “A multimodal deep learning method for android malware detection using various features,” IEEE Transactions on Information Forensics and Security, vol. 14, no. 3, pp. 773–788, 2019.
View at: Google Scholar
M. Fan, J. Liu, X. Luo et al., “Android malware familial classification and representative sample selection via frequent subgraph analysis,” IEEE Transactions on Information Forensics and Security, vol. 13, no. 8, pp. 1890–1905, 2018.
View at: Publisher Site | Google Scholar
M. Sewak, S. K. Sahay, and H. Rathore, “An investigation of a deep learning-based malware detection system,” in Proceedings of the 13th International Conference on Availability, Reliability and Security, pp. 1–5, Hamburg, Germany, 2018.
View at: Google Scholar
T. Song, W. Zheng, P. Song, and Z. Cui, “EEG emotion recognition using dynamical graph convolutional neural networks,” IEEE Transactions on Affective Computing, vol. 11, no. 3, pp. 532–541, 2020.
View at: Google Scholar
R. Levie, F. Monti, X. Bresson, and M. M. Bronstein, “Cayleynets: graph convolutional neural networks with complex rational spectral filters,” IEEE Transactions on Signal Processing, vol. 67, no. 1, pp. 97–109, 2019.
View at: Google Scholar
Q. Zhang, J. Chang, G. Meng, S. Xu, S. Xiang, and C. Pan, “Learning graph structure via graph convolutional networks,” Pattern Recognition, vol. 95, pp. 308–318, 2019.
View at: Publisher Site | Google Scholar
X. Pei, L. Yu, and S. Tian, “AMalNet: a deep learning framework based on graph convolutional networks for malware detection,” Computers & Security, vol. 93, article 101792, 2020.
View at: Publisher Site | Google Scholar
H. Safa, M. Nassar, and W. A. R. Al Orabi, “Benchmarking convolutional and recurrent neural networks for malware classification,” in 15th International Wireless Communications & Mobile Computing Conference (IWCMC), pp. 561–566, Tangier, Morocco, 2019.
View at: Google Scholar
A. Pektaş and T. Acarman, “Deep learning for effective android malware detection using API call graph embeddings,” Soft Computing, vol. 24, no. 2, pp. 1027–1043, 2020.
View at: Publisher Site | Google Scholar
Y. Feng, O. Bastani, R. Martins, I. Dillig, and S. Anand, “Automated synthesis of semantic malware signatures using maximum satisfiability,” 2016, https://arxiv.org/abs/1608.06254.
View at: Google Scholar
T. Chakraborty, F. Pierazzi, and V. S. Subrahmanian, “Ec2: ensemble clustering and classification for predicting android malware families,” IEEE Transactions on Dependable and Secure Computing, vol. 17, no. 2, pp. 262–277, 2020.
View at: Google Scholar
A. Atzeni, F. Diaz, A. Marcelli, A. Sánchez, G. Squillero, and A. Tonda, “Countering android malware: a scalable semi-supervised approach for family-signature generation,” IEEE Access, vol. 6, pp. 59540–59556, 2018.
View at: Publisher Site | Google Scholar
R. Surendran, T. Thomas, and S. Emmanuel, “A TAN based hybrid model for android malware detection,” Journal of Information Security and Applications, vol. 54, article 102483, 2020.
View at: Publisher Site | Google Scholar
G. D'Angelo, F. Palmieri, A. Robustelli, and A. Castiglione, “Effective classification of android malware families through dynamic features and neural networks,” Connection Science, vol. 33, no. 3, pp. 786–801, 2021.
View at: Publisher Site | Google Scholar
A. F. A. Kadir, N. Stakhanova, and A. A. Ghorbani, “Android botnets: what urls are telling us,” International Conference on Network and System Security. Springer, vol. 9408, pp. 78–91, 2015.
View at: Publisher Site | Google Scholar
S. Hou, Y. Ye, Y. Song, and M. Abdulhayoglu, “Make evasion harder: an intelligent android malware detection system,” in Proceedings of the 27th International Joint Conference on Artifificial Intelligence (IJCAI), pp. 5279–5283, Stockholm, Sweden, 2018.
View at: Google Scholar
M. Y. Wong and D. I. D. Lie, “A targeted input generator for the dynamic analysis of android malware,” in Proceedings of the Network and Distributed System Security Symposium (NDSS), pp. 21–24, San Diego, CA, USA, 2016.
View at: Google Scholar
K. Tam, S. J. Khan, A. Fattori, and L. Cavallaro, “CopperDroid: automatic reconstruction of android malware behaviors,” in Proceedings of the Network and Distributed System Security Symposium (NDSS), pp. 8–11, San Diego, CA, USA, 2015.
View at: Google Scholar
W. Niu, R. Cao, X. Zhang, K. Ding, K. Zhang, and T. Li, “OpCode-level function call graph based android malware classification using deep learning,” Sensors, vol. 20, no. 13, p. 3645, 2020.
View at: Publisher Site | Google Scholar
M. K. Alzaylaee, S. Y. Yerima, and S. Sezer, “DL-Droid: deep learning based android malware detection using real devices,” Computers & Security, vol. 89, article 101663, 2020.
View at: Publisher Site | Google Scholar
T. Liu, H. Wang, L. Li, G. Bai, Y. Guo, and G. Xu, “Dapanda: detecting aggressive push notifications in android apps,” in 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, pp. 66–78, San Diego, CA, USA, 2019.
View at: Google Scholar
A. Guerra-Manzanares, S. Nomm, and H. Bahsi, “In-depth feature selection and ranking for automated detection of mobile malware,” ICISSP, vol. 1, pp. 274–283, 2019.
View at: Google Scholar
J. Gao, L. Li, P. Kong, T. F. Bissyandé, and J. Klein, “Understanding the evolution of android app vulnerabilities,” IEEE Transactions on Reliability (TRel), vol. 70, no. 1, pp. 212–230, 2021.
View at: Google Scholar
Y. Hu, H. Wang, Y. Zhou et al., “Dating with scambots: understanding the ecosystem of fraudulent dating applications,” IEEE Transactions on Dependable and Secure Computing (TDSC), vol. 2019, 2019.
View at: Google Scholar
L. Li, T. F. Bissyandé, and J. Klein, “Rebooting research on detecting repackaged android apps: literature review and benchmark,” IEEE Transactions on Software Engineering (TSE), vol. 2019, 2019.
View at: Google Scholar
K. A. Talha, D. I. Alper, and C. Aydin, “APK auditor: permission-based android malware detection system,” Digital Investigation, vol. 13, pp. 1–14, 2015.
View at: Publisher Site | Google Scholar
L. Cen, C. S. Gates, L. Si, and N. Li, “A probabilistic discriminative model for android malware detection with decompiled source code,” IEEE Transactions on Dependable and Secure Computing, vol. 12, no. 4, pp. 400–412, 2015.
View at: Google Scholar
M. Alazab, M. Alazab, A. Shalaginov, A. Mesleh, and A. Awajan, “Intelligent mobile malware detection using permission requests and API calls,” Future Generation Computer Systems, vol. 107, pp. 509–521, 2020.
View at: Publisher Site | Google Scholar
J. Su, D. V. Vasconcellos, S. Prasad, D. Sgandurra, Y. Feng, and K. Sakurai, “Lightweight classification of IoT malware based on image recognition,” IEEE 42Nd annual computer software and applications conference (COMPSAC), vol. 2, pp. 664–669, 2018.
View at: Google Scholar
S. Ni, Q. Qian, and R. Zhang, “Malware identification using visualization images and deep learning,” Computers and Security, vol. 77, pp. 871–885, 2018.
View at: Publisher Site | Google Scholar
J. Qiu, J. Zhang, W. Luo, L. Pan, S. Nepal, and Y. Xiang, “A survey of android malware detection with deep neural models,” ACM Computing Surveys (CSUR), vol. 53, no. 6, pp. 1–36, 2021.
View at: Google Scholar
T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” 2016, http://arxiv.org/abs/1609.02907.
View at: Google Scholar
A. Vaswani, N. Shazeer, N. Parmar et al., “Attention is all you need,” Advances in neural information processing systems., vol. 30, 2017.
View at: Google Scholar
L. Zhang, Z. Huang, W. Liu, Z. Guo, and Z. Zhang, “Weather radar echo prediction method based on convolution neural network and long short-term memory networks for sustainable e-agriculture,” Journal of Cleaner Production, vol. 298, article 126776, 2021.
View at: Google Scholar
L. Zhang, C. Xu, Y. Gao, Y. Han, X. du, and Z. Tian, “Improved Dota2 lineup recommendation model based on a bidirectional LSTM,” Tsinghua Science and Technology, vol. 25, no. 6, pp. 712–720, 2020.
View at: Publisher Site | Google Scholar
L. Lv, Z. Wu, L. Zhang, B. B. Gupta, and Z. Tian, “An edge-AI based forecasting approach for improving smart microgrid efficiency,” IEEE Transactions on Industrial Informatics, pp. 1–1, 2022.
View at: Publisher Site | Google Scholar
D. Arp, M. Spreitzenbarth, M. Hubner, H. Gascon, K. Rieck, and C. E. R. T. Siemens, “Drebin: effective and explainable detection of android malware in your pocket,” Ndss., vol. 14, pp. 23–26, 2014.
View at: Google Scholar
K. Allix, T. F. Bissyandé, J. Klein, and Y. Le Traon, “AndroZoo: collecting millions of android apps for the research community,” in 13th International Conference on Mining Software Repositories, pp. 468–471, Austin, TX, USA, 2016.
View at: Google Scholar
D. Marcheggiani and I. Titov, “Encoding sentences with graph convolutional networks for semantic role labeling,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1506–1515, Copenhagen, Denmark, 2017.
View at: Google Scholar
I. U. Haq, T. A. Khan, and A. Akhunzada, “A dynamic robust DL-based model for android malware detection,” IEEE Access, vol. 9, pp. 74510–74521, 2021.
View at: Publisher Site | Google Scholar
O. Mirzaei, G. Suarez-Tangil, J. M. de Fuentes, J. Tapiador, and G. Stringhini, “Andrensemble: leveraging api ensembles to characterize android malware families,” in Proceedings of the 2019 ACM Asia Conference on Computer and Communications Security, pp. 307–314, Auckland, New Zealand, 2019.
View at: Google Scholar

Copyright

Copyright © 2022 Chunyan Zhang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Wireless Communications and Mobile Computing

Models, Technologies, and Applications of Intelligent Defense in Ubiquitous Network Environments

Automatic Detection of Android Malware via Hybrid Graph Neural Network

Abstract

1. Introduction

2. Related Work

2.1. Deep_TNN (Deep Traversal Tree Neural Network)

2.2. Self-Attention Mechanism

2.3. Bidirectional Gated Recurrent Unit

3. Approach

3.1. Problem Definition

3.2. Hybrid Model

3.3. Feature Extraction

3.3.1. Source Code Analysis

3.3.2. Program Analysis

4. Experiments

4.1. Hardware Details

4.2. Datasets

4.3. Performance Measures

4.4. Results

4.4.1. Comparison with Different Methods

4.4.2. Ablation Study of Different Models

4.4.3. The Effect of Source Code Features

5. Conclusion and Discussion

Data Availability

Conflicts of Interest

References

Copyright