Abstract
Malware continuously evolve and become more and more sophisticated. Learning on execution behavior is proven to be effective for malware detection. However, little work has been done to delve into the implications of full process information for malware detection. In this paper, we present a deep neural network based malware detection approach that performs learning on process-aware behaviors for Windows programs. It first employs logistic regression-based weighting method and machine learning-based API score learning method to aware the inner-process behavior, including API sequences and their run-time arguments. Next, it constructs the process graph by inter-process interactions from which a set of attributes are extracted, for characterizing the relationship among various processes in term of invoke actions. Finally, it feeds the process-aware features into the deep neural network for training a binary classifier to detect malware. In addition to designing, we have implemented and evaluated our proposed method on two datasets. The results demonstrate that our method outperforms naïve models when taking raw APIs as input, verifying the effectiveness of our method. Moreover, we have evaluated the robustness to adversarial attacks and concept drift on our model, and the results demonstrates the robustness of our method.
1. Introduction
Over the past few years, malware and new variants are increasingly reported to appear in cyberspace. According to a recent report [1], a total of 442,151 previously unknown malware variants were identified in 2021 by SonicWall, a 65% increase year-over-year and an average of 1,211 per day. Although there are many security solutions which should protect systems and devices, such as intrusion detection system, next generation firewall, anti-virus software and so on, malware infections still occur. Most of these solutions act upon known malware behavior and signatures. However, when new or variant malware is released and the behavior and signature is still unknown, the security solutions cannot take effect against these infections.
The ability to understand and model malware behavior plays a key role in many security applications [2]. Many recent works perform detection based on dynamic execution behavior [3–14]. Generally, the execution behavior of a program can be represented by system and API call traces, process interactions, network communications and so on. Some studies employ machine learning method to acquire the underlying patterns hidden from a large dataset and thus are promising in detecting known malware and identifying new malware [4, 5]. Especially in recent years, researchers have shown an increased interest in designing automatic methods of malware detection based on deep neural network (DNN) models, such as convolutional neural networks (CNN), recurrent neural networks (RNN), etc. Some studies extract features from API call sequences or associated with their arguments to learn a malware detector [3, 6–8], e.g., they represent API calls with graphs and then employ graph-based model to learn on graphs [13, 14], or employ RNN to learn on the sequence of APIs [11, 12].
Some studies employ the graph-like structure of the relations of processes to detect malware or suspicious behavior. Process trees were first introduced for malware detection in [9], where structures from process trees were compared against reference trees known to be benign. The researchers in [10] apply a supervised learning method for malware detection on streaming trees, which represent computer processes.
These studies either apply a single feature like API call sequence or process related information (e.g., [9, 10]) for malware detection, or employ multiple features of API call and its arguments (and/or return values), such as [3, 6–8]. Generally, the relationship between inner-process behavior and inter-process actions is valuable for identifying malicious activity, e.g., the new malware variant of HUIGEZI that creates new processes in the infected OS and hides themselves, making them difficult to discover. However, little work has been done to delve into the implications of full process information for malware detection model. In addition, if the full run-time arguments served as one feature, it would require much deep layers and numerous neurons in each layer of the DNN model, which consequentially leads to heavy computing resources.
In this paper, we aim to leverage full process information to enhance malware detection and to make a trade-off between detection performance and resource consumption. To address the above challenges, we present a DNN based malware detection approach that performs learning on process-aware behaviors for Windows programs. The key insight of our method is that the process-aware behaviors exhibit security information to assist accurate malware detection. The process-aware behaviors lie in two aspects, inner-process behavior awareness that assesses the sensitivity of an API to malicious behavior, and inter-process behavior awareness that perceives the relationship of various processes. First, inner-process behavior awareness. It employs logistic regression-based weighting method to determine whether arguments of an API call are malicious or not with high confidence. In this way, an API call can be weighted differently according to the weight of run-time arguments, namely weighted API, where the weight denotes its sensitivity to malicious behavior. Furthermore, it applies ML-based API score learning method to pre-train each API importance score from the context of benign and malicious API call sequences. The API score denotes the importance of an API, which makes the DNN model better to perceive the difference between malicious and benign features. Second, inter-process behavior awareness. It constructs the process graphs for inter-process interactions, from which a set of attributes are extracted to obtain the key context of invoke actions for inter-process and represent it with a small amount of key properties. Third, deep process-aware behavior learning. It takes the weighted API sequence, API scores and the attributes of process graph as input, and the DNN model integrates weighted-scored API feature and process-graph feature organically for learning, which could effectively perceive the security information of process-aware behaviors.
We have implemented our method and evaluated it on two slightly different datasets, one as the training dataset and the other as the testing dataset, each of which contains 10K benign and 10K malicious execution records. The results show that our method reaches up to 97.63% respectively in terms of accuracy on the testing dataset, 6.54% higher than that of naïve Text-CNN model. In addition, we have evaluated the robustness to adversarial attacks and concept drift for our proposed method, and the results illustrate that our method shows well robustness to defend against these problems.
To summarize, this paper makes the following contributions.(i)First, we present a DNN based malware detection method, which leverage process-aware behavior awareness to excavate the security sensitivity of full process information.(ii)Second, we propose an inner-process behavior awareness method to perceive API sequences and their run-time arguments.(iii)Third, we propose an inter-process behavior awareness method to aware the interaction of invoke actions for various processes.(iv)Finally, we implement our method and run comprehensive experiments to evaluate its effectiveness. The results demonstrate that our method achieves the best performance.
2. Related Work
In recent years, a lot of studies based on dynamic behavior features have been proposed in machine learning area for malware detection, which fall into three main categories, i.e., API call feature based method, process related feature based method, and multiple feature-based method.
2.1. API Call Feature Based Method
Cho et al. [15] presented a similarity assessment method based on API call sequences for distinguishing between different variants within the same malware family. They achieved a significantly improved accuracy by removing the repeated subsequences of APIs from the sequence of API calls and reducing the overhead associated with sequence alignment. Zhang and Zhang [5] proposed API-Graph, which built API relationship graphs to represent the internal relationships between various programming entities. And they fed the API graphs into Random Forest (RF), Model Pool, support vector machine (SVM), and DNN to train malware detectors. Zhang et al. [6] utilized a feature hashing trick to encode API call names associated with their arguments and input them into a deep neural network architecture, which applies multiple Gated-CNNs and bidirectional long short term memory (LSTM) networks. Chen et al. [3] proposed API Labelling method to assess the sensitivity of an API to malicious behavior with the assistance of run-time arguments. Then, they encoded the sensitivity into the API embedding and fed the sequences of API embedding into Text-CNN and bidirectional LSTM to train malware detectors.
Our work is similar to [3] in generating embedding for APIs and applying Text-CNN to train a malware detector. However, the API Labelling method in [3] depends on rules, which require certain field knowledge for sensitivity degree classification. In this paper, we employ logistic regression-based weighting method to weight APIs and adopt ML-based method API score learning method to obtain each API importance score, thereby reducing the reliance on field knowledge and better distinguishing the importance of different APIs. In addition, we augment a dimension of process graph feature and improve the Text-CNN to learn the combination features of weighted-scored API and process graph, for further improving the performance of malware detection.
2.2. Process Related Feature Based Method
Henry et al. [19] employed the call stack, which stores the information about the active subroutines of a process and provides information to which subroutine it should return control after running. The return address of the system calls from a process are analyzed to detect possible malicious behavior. They ran the processes many times and built a list with the return addresses used by the process to train the detection model. Wagner and Wagener [16] applied the relationships among processes for malicious behavior detection. The use of this information can help in building a better real time host-level intrusion detection. They modeled process information into tree-based structures, evaluated the constructed process trees by using tree-based kernels, built a labeled graph, and fed it into the support vector machines to obtain a malware detector. Wijnands [9] proposed a malware detection algorithm by using the combined data from process activities and process trees. They employed a distance measure on the process characteristics, depth and cluster between two various processes, and minimized the distance for every process. The results showed that their algorithm could detect processes from two of the three malware samples used. Cochrane and Foster [10] represented computer processes as streaming trees that evolve in continuous time, and proposed a supervised learning method for systematic malware detection on streaming trees that is robust to irregular sampling and high dimensionality of the underlying streams.
2.3. Multiple Feature Based Method
Rieck and Holz [17] proposed a malware detection method by employing various behavior data, including changes to the file system, changes to the Windows registry, infection of running processes, creation and acquiring of mutexes, network activity and transfer, starting and stopping of Windows services. These data were collected using sandbox and after classifying, 88% of the malware was assigned to the correct malware family. Shan and Wang [18] used OS-level behavior information for clustering the objects of a program together in a cluster. The composed clusters will be compared with predefined behavior templates for malicious behavior, and then determined whether a program is malicious or not. The problem is the fact that still known malicious behavior is used to detect malicious behavior and does not suffice in detecting new kind of malicious behavior.
2.4. Comparison with Related Works
As illustrated in Table 1, our proposed method is inspired by these studies. However, unlike employing the single feature of API calls or process information, we weight the run-time arguments of APIs according to their sensitivities to malicious behavior. Then, we utilize inter-process behavior awareness to obtain the attributes from process graph and feed weighted API sequences, API scores and process graph attributes into the DNN. The main advantage of our method over these studies is that it effectively perceives the relationship of inner-process and inter-process behavior, thereby achieving efficient learning and better performance.
3. Overview
We present a DNN based malware detection method, which efficiently exploits the process-aware behavior information of a program. The overview architecture of our method is illustrated in Figure 1. As can be seen, it is composed of three modules, i.e., Process-Aware Behavior, Deep Process-Aware Behavior Learning and DNN Based Malware Detection.

3.1. Process-Aware Behavior
This module is responsible for analyzing the whole process-aware behavior of malicious or benign programs to mine deep-level security associations. As stated before, the general methods of malware detection based on behavioral (or dynamic analysis) data use features as API sequences or API sequences with their arguments, while the behavioral features and relationships between processes are often ignored. Besides, a simple method is to directly use the run-time arguments for arguments-based malware detection. However, the space of arguments is extremely huge, e.g., a malware passes 7,974 arguments during the 6-minute execution and 1,604 of them are unique, which makes it impossible to learn security semantics using DNN (unless it is deep enough) [3].
To address these challenges, we propose Inner-Process Behavior Awareness method and Inter-Process Behavior Awareness method to analyze and perceive more comprehensive behavioral features meanwhile acquiring the security information. On the one hand, in order to automate argument classification and reduce reliance on field knowledge, the Inner-Process Behavior Awareness method performs logistic regression-based argument weighting that formalize several weighting rules that distinguish malicious and benign arguments more accurately. In this way, a part of run-time arguments matching the rules will be identified directly, enabling the associated APIs to be weighted. Therefore, the weighted API constitute weighted API sequence, which is different from the raw API sequence. Besides, the Inner-Process Behavior Awareness method employs ML-based API score learning method that pre-trains random forest model to learn each API importance score from the context of benign and malicious API call sequences.
On the other hand, the Inter-Process Behavior Awareness method constructs a directed graph from the invoke actions of various processes, from which a set of attributes are extracted to perceive the security information between various processes more accurately. In this way, the attributes of process graph could obtain the key context of invoke actions for inter-process and represent it with a small amount of key properties.
3.2. Deep Process-Aware Behavior Learning
We build an improved deep neural network based on the Text-CNN model. We first utilize the classic Text-CNN neural network to obtain weighted-scored API feature from the combined Embedding, which concatenates the weighted-API embedding and API score embedding. Then, we use a fully connected layer to convert the attributes of process graph into process graph feature, and finally combine these two features together. By concatenating them together, the produced embedding enables the DNN model to better understand behavior data sharing weighted-API sequences, API importance scores and inter-process interactions. In this way, we are able to feed the security information into the DNN from which it could learn a process-aware behavior DNN model.
3.3. DNN Based Malware Detection
We setup process-aware behavior model to train on behavior data for malware detection. The trained model is essentially a binary classifier that reports the input as benign or malicious given a sequence of API calls along with run-time arguments and the invoke actions of various processes.
4. Process-Aware Behavior
4.1. Inner-Process Behavior Awareness
4.1.1. Logistic Regression-Based Argument Weighting
In order to reduce the time and space overhead, the parts with high-frequency are selected for argument weighting analysis in the entire API space. According to statistics for our datasets, in the malicious and benign corpus, the top-30 different APIs ranked by their frequency accounted for 94.27% and 95.29% of all API calls, respectively. Therefore, we adopt the method based on logistic regression algorithm to perform argument weighting analysis on the arguments of high-frequency API calls, and calculate the API sets with the top-30 API call frequencies in malicious traces as well as benign traces separately, and regard the union of the two sets as the high-frequency API set .
Note that the core of the logistic regression algorithm is the logistic function (i.e., ), which measures the correspondence between the label to be predicted and the input feature by estimating the probability. The definition of function is shown in equation (1). It can be seen that the value of the will infinitely approach 1 as the variable continues to increase, while it will close to 0 as the variable decreases. Meanwhile, The logistic regression algorithm can not only achieve better binary classification performance, but also obtain the intermediate weights of influencing factors related to the prediction results. It is conducive to analyze the factors of influence for classification.
More specifically, we train the logistic regression model for high-frequency APIs set , acquire the weight of all the argument features of each API, and classify the run-time API calls into malicious or benign. The logistic regression model training steps are as follows.
Firstly, construct the feature matrix. Let , the frequency of each run-time argument passed by both in malicious and benign traces is used to construct the feature matrix. For example, LoadLibraryExW passes a total of three arguments in a malicious trace and a benign trace, namely user32.dll, gdi32.dll and ws2_32.dll. As illustrated in Table 2, the frequency of LoadLibraryExW passing these three arguments is 2, 1 and 0 in Sample1, and that is 0, 0 and 1 in Sample2. Note that the last column denotes the label of samples.
Secondly, perform a linear transformation. As shown in equation (2), where denotes the dimension of the feature matrix (i.e., the number of features). We multiply each feature (i.e., run-time arguments) with a regression coefficient (i.e., ), accumulate all the results, and then add an offset .
Thirdly, perform transformation. We apply the function to convert the value obtained after linear transformation into an intermediate weight between 0 and 1.
Finally, weight arguments and API calls. We compare the intermediate weight corresponding to each argument with 0. If the intermediate weight of an argument passed by the given API exceeds 0, it is weighted as malicious. Otherwise, the argument is weighted as benign. An API call, belonging to high-frequency API set, can be weighted in the form of APINAME_WEIGHT, in which WEIGHT depends on the security weight of its arguments. For example, NtReadFile_M at a certain epoch denotes that the NtReadFile API is malicious when passed the associated run-time arguments, while NtReadFile_B denotes a benign API call.
4.1.2. Weighted-API Sequence Extraction
To strengthen the capture of inner-process behavior feature without compromising the security weight of run-time arguments, we employ the logistic regression-based method to weight raw API calls based on the security weight of its associated arguments both in malicious and benign traces. After API weighting, we obtain the weighted-API sequence, and calculate that the coverage of weighted-API in malicious and benign corpus achieves 71.19% and 71.39% respectively. They cover a large part of the run-time API calls on our dataset.
4.1.3. ML-Based API Score Learning
The contribution of each API to the performance of the classifier is inconsistent. Generally, existing malware detection models based on API sequences treat each API of inputted API sequence fairly. However, the performance of the significant APIs in distinguishing malicious and benign samples for the classifiers will be somewhat different. Moreover, the importance of various APIs may help the neural network model better to perceive the difference between malicious and benign features. Different from existing DNN-based studies that treat APIs equivalently important, we argue that different APIs exhibit varying semantics. Therefore, we employ the machine learning (ML) algorithm to pre-train the importance scores of all APIs in the dataset, embed the Scored-API feature of the API sequence, and input it into DNN model to further aware the in-process behavior.
A feature selection based on the random forest (RF) classifier [20] has been found to provide multivariate feature importance scores which are relatively cheap to obtain, and which have been successfully applied to high dimensional data, arising from microarrays [21, 22], time series [23], even on spectra [24]. Random forest is an ensemble learner based on randomized decision trees, and provides different feature important measures [25].
We apply the method of Gini Index [26] to measure the variable importance of random forest classifier, which measures how much each feature of API in the dataset contributes to each tree in the random forest. And we call the variable importance of each API as API score.
At first, we employ to denote the API score and to denote the Gini Index. Suppose there are API features , decision trees, and classes. The is obtained by calculating the Gini Index score of each API feature . More specifically, it calculates the average variable of the API feature of in all decision trees of random forest with the node splitting.
The formula for calculating the Gini Index of the node of the decision tree is as follows:where denotes the number of classes, and denotes the proportion of class in the node .
Then, we calculate the importance score of each API feature in each node in each decision tree. In other words, it calculates the variation of Gini Index before and after the node branching. The formula of the importance score of the API feature of the node in the decision tree is as follows:where and respectively denote the Gini Index of two new nodes after node branches.
Next, we calculate the importance score of each API feature in each decision tree. Let denotes the set of nodes that API feature appears in decision tree , then the importance score formula of feature in decision tree is as follows:
Assuming that there are decision trees in the random forest, the formula for calculating the importance score of API feature in the entire random forest is as follows:
Finally, we normalize each API feature using the method of linear function normalization (i.e., Min-Max scaling) as follows:where denotes normalization, and and are the maximum and minimum values in the set where is located, respectively.
4.1.4. API Score Extraction
In order to obtain the importance score of each API in our dataset through ML-based model pre-training, and further extract the Scored-API feature to better facilitate the DNN model to perceive the in-process behavior. We first construct the pretrained data for the machine learning model. DataSet1 is adopted as the pre-training dataset with a total of 94 APIs, and it is described in detail in Section 6.1. We perform data pre-processing on the API call sequences of benign and malicious samples. Each sample is represented by a one-dimensional array, whose size is 94. Given a sample, each position of the array represents an API feature, and the value denotes the execution frequency of the corresponding API in this sample.
Then, the pre-training data is randomly divided into three parts, two-thirds of which are used as the training set and the remaining one-third as the test set. Next, the RF algorithm is employed for training in this dataset, where the number of classifiers in the random forest (i.e., n_estimators) is set to 52 and the maximum depth of decision tree (i.e., max_depth) is set to 68. Note that the random forest model is trained with various parameters and adopts the parameter setting with the best performance. Finally, we extract the feature weights of the pre-trained model after completing the training of the ML-based model, and each value of feature weight denotes the importance score of each API.
In Figure 2 shows the importance scores of eight representative APIs after normalization, which is limited in the range of [0, 1]. We observe that WritePEFile is the highest-scoring API, and the API importance scores of NtSetValueKey, BeCreatedEx and SetFileHiddenOrReadOnly are much higher than Module32FirstW, NtCreatThread, SendARP, and EnumServicesStatusW. We suspect that these API scores are correlated with the behaviors of malware and benign software in real-world environments. For example, some typical downloader malware will secretly download spyware and virus programs, thereby writing PE executable files on the hosting system. Even more, most malware may call NtSetValueKey to modify the registry so that the executable program can run automatically with OS boot-up. On the contrary, like NtCreatThread, SendARP, and EnumServicesStatusW, these API are relatively common behaviors that occur in both benign and malicious programs, and are less important for distinguishing between benign and malicious samples than more sensitive APIs such as WritePEFile.

4.2. Inter-Process Behavior Awareness
4.2.1. Process Graph
Computer processes interact with each other and invoke actions. This information might provide useful insights into process behavior. We construct a directed graph of inter-process interactions from behavior logs and call it as process graph. The vertices signify the process and the edges represent invoke actions between various vertices in the process graph, e.g., A- B denotes that process A calls process B. The root process of all processes during execution is not unique, which may cause the progress graph to be too scattered, making it difficult to aware the connections between various processes. Therefore, we apply a super-vertex to connect all root processes. Figure 3 demonstrates a snippet of process graph of a malicious sample, where the vertex name denotes the process ID, and vertex 1 is the super-vertex.

4.2.2. Process Graph Attribute
NetworkX [27] is a Python language package for exploration and analysis of networks and network algorithms. The core package provides data structures for representing many types of graphs, including simple graphs, directed graphs, and graphs with parallel edges and self loops. Inspired by NetworkX, we adopt the attributes of directed graphs to represent process graphs, including seven attributes, such as number of vertices, number of edges, number of dominating set, average clustering, transitivity, the sum of closeness centrality, and the sum of degree centrality. They can aware the core context of the process graph and represent it with a small amount of key attributes. In this way, it is beneficial to reduce the consumption of computing resources without lacking invoke actions of the inter-process. The seven attributes are described as follows.
(1) Number of vertices. The sum of the number of all vertices in the process graph. The number of vertices can reflect the size of the graph. In Figure 3, there are six vertices, indicating that there are six processes. Notice that vertex 1 is the super-vertex.
(2) Number of edges. The sum of all edges of the process graph. The edges denote the invoke actions between various vertices. Moreover, they reflect the complexity of calling relationship between different processes. In Figure 3, there are six edges, indicating that there are six calling relationships.
(3) Number of dominating set. A dominating set for a graph with node set is a subset of such that every node not in is adjacent to at least one member of [28]. We adopt the number of node V in the subset as one of the attributes of the process graph. E.g., in Figure 3, the dominating set is 1,472 and the number of that is 2.
(4) Average clustering. The process graph shows a tendency for link formation between neighboring vertices. This tendency is called clustering and it reflects the clustering of edges into tightly connected neighborhoods [29]. We apply the average clustering of clustering coefficients of all vertices. The clustering coefficient describes the degree of clustering between vertices, and the definition is as follows:where is the clustering coefficient of vertex , denotes that compute the number of triangles (i.e., find the number of triangles that include two edges with a shared vertex), is the sum of the indegree and outdegree of vertex , and denotes the number of vertices that are bidirectionally connected to vertex .
The average clustering is the average of the clustering coefficients of all vertices in the process graph G. The formula can be written as:where is the average clustering of progress graph , and n denotes the number of vertices. To illustrate in more detail with the example of Figure 3, the clustering coefficient of vertex 369 is 0.167 and the average clustering is 0.125.
(5) Transitivity. Transitivity is the fraction of all possible triangles present in the process graph. It can clearly demonstrate the triangular relationship between vertices, which is significant for the representation of process graph features. The transitivity can be written as:where is the transitivity of process graph , denotes the number of outdegree of each vertex, and computes the number of possible triangles (i.e., two edges with a shared vertex). E.g., in Figure 3, there is triangle only at vertex 369, and the total number of triangles is 1, and the number of possible triangles in the process graph is 8. Therefore, the transitivity of process graph is 0.125.
(6) The sum of closeness centrality. Closeness centrality [30, 31] of a vertex is the reciprocal of the average shortest-path distance to over all reachable nodes. The definition is as follows:where is the shortest-path distance between and , denotes the number of vertices in the graph, and is the number of vertices that can reach . The closeness distance function computes the incoming distance to for process graphs. We accumulate the closeness centrality of all vertices. Notice that higher values of closeness indicate higher centrality as one of the attributes of the process graph. E.g., in Figure 3, the closeness centrality of vertex 369 is 0.2 and the sum of closeness centrality is 1.517 in this process graph.
(7) The sum of degree centrality. The degree centrality for a vertex is the fraction of nodes it is connected to. Degree centrality is a quantification of the importance of vertices in a process graph. The definition of that is as follows:where is the degree centrality of vertex , denotes the number of edges for vertex , which include indegree and outdegree, and denotes the number of vertices in process graph . We accumulate the degree centrality of all vertices as one of the attributes of the process graph. E.g., in Figure 3, the degree centrality of vertex 369 is 0.6 and the sum of degree centrality is 2.4 in this process graph.
4.2.3. Process Graph Attribute Extraction
As mentioned above, the process graph conducted in Section 4.2.1. And the seven attributes of process graph calculated in Section 4.2.2. Consequently, we can obtain the attribute vectors from behavior logs. The dimension of this vector is seven, including the seven attributes of process graph for each sample. E.g., we get the attribute vector [6, 6, 2, 0.125, 0.125, 1.517, 2.4] for the sample in Figure 3. However, we observe that the values of the number of vertices and edges are much larger than other attributes in the seven-dimensional attributes. It will lead to inconsistent weights of each attribute during deep learning, which may affect the performance of malware detection model. The max-min scaling normalization method is able to preserve the relationships existing in the original attribute data and eliminate the effects of dimensions and value ranges. Therefore, we employ max-min scaling normalization to uniformly normalize the values of each dimension to the range of [0, 1].
5. Malware Detection Model
In this section, we start by presenting process-aware behavior which effectively represents inner-process behavior feature and inter-process behavior feature. Then, we train a binary classifier with DNN for detecting malware.
5.1. Deep Process-Aware Behavior Learning
Text-CNN is a variant of CNN [32] used for text classification tasks and Text-CNN model has been proved to be effective in malware detection [3]. It uses multiple kernels of different sizes to extract key information in sentences (similar to n-grams with multiple window sizes), which can better capture local correlations. We further improve Text-CNN to build the malware detection model, which extracts the weighted-scored API feature and the process-graph feature from the perspective of weighted-API sequence and attributes of process graphs respectively. The details of our DNN model are illustrated in Figure 4.

We input weighted-API sequences and process graph attributes into deep neural network model. Firstly, we apply Text-CNN network to obtain the weighted-scored API feature. It concatenates the weighted-API embedding and API score embedding into combined embedding. The Text-CNN network employs multiple filters of different sizes to slide on the combined embedding to extract a specific number of features. The convolution layer converts the original two-dimensional matrix into a column vector, and the results are fed into the max pooling layer, which extracts the most significant n-gram features in the entire sentence in a translation-invariant manner [33]. For the last layer of Text-CNN, it combines all the pooled features together and outputs the weighted-scored API feature. Secondly, we construct the attributes of process graph into attribute embedding, and take it as input into a full connected layer, which outputs the process-graph feature of the given size. Thirdly, we concatenate the weighted-scored API feature and process-graph feature together, and input that into the combination of dense and dropout layer. Finally, our DNN model outputs a classification decision as a score, where 0 is benign, and 1 is malicious.
5.2. DNN Based Malware Detection
Given a test program, it’s first launched in a sandbox, which will record dynamic behavior, including API calls with arguments and return values, process invoke actions, etc. Then, DNN analyzes the process-aware behavior and determines whether it exhibits malicious behavior. Our method first extracts the APIs along with arguments to form a new concise sequence. And the arguments are weighted with the logistic regression-based argument weighting method into various levels, followed by the weighted-API generating according to the level of arguments. Secondly, API scores are obtained by the pre-trained ML-based method, and we construct the API score embedding according to the API scores. Thirdly, process graphs are generated by the invoke actions of various processes, and we extract the process graph attributes from them. Fourthly, we feed the weighted-APIs sequences, API scores and the attributes of process graphs to the trained malware detection model. Finally, the model outputs the predicted label (i.e., malicious or benign) and corresponding probability.
6. Evaluation
6.1. DataSet and Settings
6.1.1. DataSet
We use two datasets for evaluating our proposed method. DataSet1 contains a total of 20,000 execution logs of Windows x86 PE files, of which 10,000 are benign and the other 10,000 are malicious. DataSet2 records logs of 20,000 Windows x86 PE files, in which 10,000 are benign and 10,000 are malicious. The logs are generated with a self-developed sandbox and are available in Github [34]. Note that they are slightly different in the number of traced APIs, i.e., 94 APIs in DataSet1 and 98 APIs in DataSet2.
6.1.2. Metrics and Setting
We measure the performance of our method in terms of , , , and -. The definitions are as Equations (13)–(16), where TP represents the number of samples that are correctly predicted as positive (or malicious), TN denotes the number of samples that are correctly classified as negative (or benign), FN denotes the number of samples that are positive but are incorrectly predicted as negative (or benign), and FP indicates the number of samples that are negative but are classified as positive (or malicious).
We train the malware detection model on DataSet1 as the training set and verify its effectiveness on DataSet2 as the testing set. Upon training, we perform 5-fold cross-validation. I.e., we divide DataSet1 into five even subsets randomly, use four of them for training and use the rest one for validation. Thus, we repeat training for five times and report the average result. We call the result on validation subset as Validation Result and call the result on DataSet2 as Test Result.
6.1.3. DNN Model Setting
In our DNN model, the length of API sequences is limited to 1,000, the filter sizes are 3, 4, 5, and 6 respectively, the number of filters is 128, the dropout is 0.5, the size of API score embedding is 8, and the size of batch, weighted-scored API feature and process graph feature is 64 as default.
6.2. Performance Comparison
6.2.1. Comparison with State-of-the-Art
In order to investigate the performance improvement, we conduct comprehensive experiments among our method, classic decision tree algorithm with raw API sequence, Zhang et al. [6] proposed model with raw API sequence and run-time arguments, and previously proposed model MalPro [35] with process-aware behaviors respectively. All the experiments are conducted on our dataset. We use 5-fold cross-validation over to train the models and then perform testing over . In addition, to measure the efficiency, we also record the processing time per sample, which includes the time for inner-process behavior feature as well as inter-process behavior feature processing, and the detection time per sample that denotes the time cost upon prediction.
As illustrated in Table 3, our proposed model achieves the best , and - among the three existing malware detection models for Test Results. Compared with traditional machine learning method (decision tree algorithm) and deep learning method (Zhang et al. [6]), process-aware behavior learning method (MalPro and our proposed method) has superior performance in malware detection. For example, the test results of MalPro and our proposed method achieve 96.78% and 97.63% in respectively, while the method of decision tree and Zhang et al. [6]) only reach 76.32% and 91.76%. These results also show that the process-aware behavior significantly improves the performance over other models that exclude arguments and invoke actions.
Compared with our previously proposed model MalPro [35], our proposed model in this paper achieves a total of 0.85%, 0.82%, 0.85% and 0.84% improvement in , , and - in the test results on respectively, showing a great generalization ability in malware detection. The results also indicate that our efforts on the API scores can further assist the DNN in accurate malware detection.
As for time consumption, the models when using the run-time arguments take a slightly longer time. Specifically, the processing time of our model is longer than that of the methods using only the API calls, but shorter than that of the methods that combines API calls and full arguments. This is mainly because the models that take use of the original run-time arguments require more time in feature processing, consequentially leading to significant time overhead. In addition, the detection time of our model is comparable to other models, i.e, 2.41 ms compared to 0.34 ms, 17.05 ms and 2.38 ms. Therefore, our proposed method achieves a well trade-off between time overhead and performance for malware detection.
6.2.2. Gain of Proposed Methods
To measure the gain of the effects of weighted API, API scores and process graph attributes, we evaluate the performance of our model with different inputs, which are shown as follows:(i)APIr represents raw APIs, which directly extracted from execution traces.(ii)APIw represents weighted APIs, which generated by logistic regression-based argument weighting method.(iii)APIrs represents raw APIs with API scores, which pre-trained by ML-based API score learning.(iv)APIr_Attr represents raw APIs and process graph attributes, which extracted by process graphs.(v)APIw_Attr represents weighted APIs and process graph attributes.(vi)APIws_Attr represents weighted APIs with API scores and process graph attributes.
Table 4 reports the Validation Results and Test Results in terms of , , and - of DNN model. Note that if only raw APIs or weighted APIs are fed as input, the DNN model is the classic Text-CNN model without full connected layer for progress graph attributes feature.
On the DataSet1, we find that APIw, APIrs, APIr_Attr and APIw_Attr slightly outperforms APIr, and APIws_Attr improves a little further. E.g., compared to the accuracy of APIr, APIw improves 0.07%, APIrs improves 0.16%, APIr_Attr improves 0.12%, and APIw_Attr improves 0.2% respectively, while APIws_Attr improves 0.25%, 0.16%, 0.2% and 0.12% further, achieving a total of 0.32% improvement.
On the testing set DataSet2, as expected, the proposed methods greatly improve the four metrics. Specifically, APIw and APIrs improve the - of APIr from 91.50% to 94.71% and 94.79%, i.e., 3.21% and 3.29% improvement. This result indicates that the weighted APIs and API scores effectively exploits the weighting of APIs and the API scores characterize the APIs more fine-grained in inner-process security information, which enable the DNN to learn from the augmented data. APIr_Attr achieves 1.87% improvement than APIr in -. It indicates that the process graph attributes effectively exploit inter-process invoke actions. Moreover, APIws_Attr improves 2.93%, 2.85% and 4.27% further over APIw, APIrs and APIr_Attr in - and achieves 97.64%. This is contributed to the method that concatenates inner-process and inter-process awareness feature together.
6.2.3. ROC Curve
Figure 5 plots the Receiver Operating Characteristic (ROC) [36] curve of different inputs for DNN model on the testing dataset DataSet2, where the X axis is the false positive rate (FPR) and the Y axis is the true positive rate (TPR). The trained models that give curves closer to the top-left corner indicate a better performance. As we can see, our model achieves high TPR even with a low FPR. For example, when FPR is 0.05, i.e., only 5% of benign samples are incorrectly predicted as malicious, TPR of the APIws_Attr is as high as 0.9754, imply 97.54% of all malicious samples are correctly detected. Note that this performance is critical in malware detection, i.e., detecting as many as malware with a few false positives. On the other hand, the APIr achieves the same TPR when FPR is as high as 0.7597 respectively, meaning a lot of false positives.

It also can be observed that our methods achieve the largest area under curve (AUC) value, e.g., APIw, APIrs, APIr_Attr, and APIws_Attr reach more than 0.98 of AUC value, while APIr achieve about 0.96, suggesting that our model provides better predictive accuracy.
To summarize, the proposed methods, i.e., inner-process behavior awareness, inter-process behavior awareness and DNN model, are effective to enhance the ability of malware detection. Meanwhile, they exhibit well robustness on the test dataset, demonstrating their effectiveness in practical usage.
6.3. Effects of Various Settings
The performance of our method is mainly affected by five aspects, the ML method of API score, the size of API score embedding, the size of weighted-scored API feature, process graph feature, and batch. In this section, we examine the impact of these factors by configuring our method with different settings.
6.3.1. Varying ML Methods of API Score
Various machine learning algorithms can make the results of API importance feature inconsistent, which leads to different performance of DNN model upon learning API score. We adopt three machine learning algorithms, random forest (RF) [20], support vector machine (SVM) [37], and Multi-Layer perceptron (MLP) [38] to obtain API importance scores. Figure 6 shows the performance of our DNN model on testing set DataSet2, using these three machine learning algorithms to pre-train the API score. The results illustrate that the random forest method for pre-training API score achieves the best , , and - among the two other machine learning algorithms.

6.3.2. Varying Sizes of API Score Embedding
The size of API score embedding determines the effect of combined embedding for learning the inner-process behavior. Generally, a larger size implies that the API score embedding attracts more attention upon training. We set the size to increase exponentially from 1 to 16 and measure the performance of our model. Figure 7 reports the ROC curve with varying sizes of API score embedding for combined embedding. As can be seen, DNN model performs best when the size is 8. More specifically, the AUC value increases when the size increases from 1 to 8, and then decreases when the size continues to increase. It makes a trade-off between weighted-API embedding and API score embedding for the combined embedding learning, which enables the DNN model perform the best.

6.3.3. Varying Sizes of Weighted-Scored API Feature
The size of weighed-scored API feature determines the effect of inner-process behavior awareness in the feature combination. Generally, a larger size implies that the weighed-scored API feature attracts more attention upon training. We set the size to increase exponentially from 32 to 256 and measure the performance of our model. Figure 8 reports the ROC curve with varying sizes of weighed-scored API feature for DNN model. As can be seen, DNN model performs best when the size is 64. More specifically, the AUC value increases when the size increases from 32 to 64, and then decreases when the size continues to increase. The reason is as follows. A larger size enables more inner-process behavior information to be described in feature combination, and thus allows the DNN to pay more attention on these characteristics. However, once the size is too large, the modeled correspondence between features will drop and further hurts the performance.

6.3.4. Varying Sizes of Process Graph Feature
The size of process graph feature denotes the weight of inter-process behavior awareness in the feature combination. Similarly, a larger size implies that the process graph attributes attract more attention upon training. We also set the size to increase exponentially from 32 to 256 and measure the performance of our model. Figure 9 reports the ROC curve with varying sizes of process graph feature for DNN model and shows the best perform when the size is 64.

6.3.5. Varying Sizes of Batch
We vary the size of batch in our model. From Figure 10, we observe that when the size of batch is 64, the ROC curve has the largest AUC value. We can observe a clear improvement by increasing the size from 32 to 64. However, continuing to increase the size of batch does not help much.

6.4. Robustness to Adversarial Attack
The deep learning based malware detection models are vulnerable to adversarial attacks, which is exploited by the adversary to fool the target models’ predictions by modifying the inputs. In order to further evaluate whether our proposed method shows well robustness to adversarial attacks, We apply the same adversarial malware generation method as CruParamer [3] for API call sequences against Text-CNN model, which leverages CLARE [39] in the field of natural language processing. We perturb the sequence by inserting no-op APIs, since replacing or removing APIs from the API sequence may break the functionality of the original program [40].
From the adversary’s perspective, we first use a part of samples in to train the substitute Text-CNN model, and train the victim models on . Then, we attack the trained substitute model for all malicious samples in to generate the corresponding adversarial API sequences. Finally, we successfully generate 6,346 adversarial malware samples.
As showed in Table 5, on the 6,346 original and adversarial malware samples, the victim Text-CNN model experiencing adversarial attacks suffers 54.3% decrease in the detection rate, which drops from 98.63% to 43.33%. In contrast, our proposed method is effective in defending adversarial samples, i.e., the rate decreases by 6.3%. The results show that our proposed method is robust in defending against adversarial attacks. This is mainly because the multi-dimensional features applied in this study. In other words, the features combine both the inner-process behavior feature and the inter-process behavior feature, and input them into DNN model for learning. Therefore, to better attack our detection model, the features that requires to be perturbed should include not only the API sequence but also the process relationship. However, the state-of-the-art adversarial attack methods for malware detection models are limited to perturbations of the API call sequences.
6.5. Robustness to Concept Drift
The concept drift problem [41] occurs since the distribution of data evolves over time. More specifically, as new malware or variants emerge with new characteristics over time, detection models trained based on a priori malware knowledge may fail to detect the emerging malware, resulting in the problem of model aging, also known as concept drift. We intend to measure how the proposed model ages and its robustness to concept drift. The malware samples of and are ranged within one year (i.e., 2019), which can not expose the concept drift problem. Therefore, we prepare a new dataset whose samples span from 2011 to 2021. The new dataset consists of 10,947 malware from VirusSign [42] and 3,735 benign samples from NSRL [43]. Then, we apply the Cuckoo sandbox to trace the API call sequence of each sample for two minutes. We use the samples from 2011 to 2018 as the training dataset (6,148 malicious and 3,131 benign), train a binary classifier and measure its performance on testing samples in 2019, 2020 and 2021 respectively. The number of testing samples are 432 (247 malicious and 185 benign samples), 840 (575 malicious and 265 benign samples) and 4,131 (3,977 malicious and 154 benign samples) respectively.
Table 6 illustrates the performance of different models in terms of . It can be seen that the compared models experience performance drop with the time. For example, the of Decision Tree decreases from 81.48% of 2019 to 69.55% of 2021, i.e., a 11.93% drop. Although our model also ages gradually, the aging rate is much lower, and the performance remains the best compared to these two state-of-the-arts, i.e., the is 95.60%, 93.32% and 92.57% respectively. This is mainly owning to the process-aware behavior awareness. More specifically, although the new variants of malware show different syntax from the previous one, they are behaviorally identical. In other words, the behavior insides a process and across processes are less likely to evolve a lot. Our proposed model could perceive the inner-process and inter-process behaviors.
7. Limitations and Discussions
Although our proposed method performs well in malware detection, there still exist several limitations.
7.1. Model Deterioration Problem
With malware constantly evolving over time, the malicious features are different from existing data. As a result, the trained detection model may fail to make the right prediction. Section 6.5 shows that our method achieves the best performance on testing samples of different periods among all the models, which demonstrates the effectiveness of our method to mitigate the model aging problem to some extent. Despite a slight slowdown, however, we believe that the trained models struggle to maintain high accuracy, especially for new malware after a long time.
Model deterioration is a challenging problem. Many methods [44–48] have been proposed recently to mitigate the aging problem in malware detection models for Android, e.g., online learning based algorithms [44], sensitive access distribution [45], time stamping based method [46], et al. Most of these approaches aim at static feature changes of malware, and thus fail to provide the generalization ability of dynamic feature based detection models. One possible approach for dynamic feature based model is to augment the detection model with lifelong learning, so that it can learn features from new process-aware behavior of new malware and variants. We will explore the dynamic feature based malware detection model to resist the aging problem in the future.
7.2. Inter-Process Behavior Awareness
The proposed method only uses the attribute information of the process graph for inter-process behavior awareness. It does not fully exploit the rich behavioral information contained in inter-process behavior. In the future work, we may adopt more efficient methods or some other deep neural networks (such as graph neural network) to learn inter-process behaviors to further improve the performance of malware detection and classification.
8. Conclusions
This paper presents a DNN based malware detection approach that performs learning on process-aware behaviors for malware detection. It proposes inner-process behavior awareness to assess the sensitivity of an API to malicious behavior, and then proposes inter-process behavior awareness to obtain the attributes from process graph. By feeding the weighted API sequences, API scores and the attributes of process graph into the DNN, the trained model outperforms other naïve DNN model taking raw APIs as input. Additionally, it demonstrates well robustness and generalization when identifying new malware and adversarial samples, indicating that it may be useful in real-world applications.
Data Availability
The benign and malicious execution logs in this paper can be obtained free of charge from https://github.com/kericwy1337.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work is supported by the National Natural Science Foundation of China under Grant 62172006, Grant 62072453 and Grant 61972392.