Abstract

An explosive spread of Android malware causes a serious concern for Android application security. One of the solutions to detecting malicious payloads sneaking in an application is to treat the detection as a binary classification problem, which can be effectively tackled with traditional machine learning techniques. The key factors in detecting Android malware with machine learning techniques are feature selection and generation. Most of the existing approaches select and generate features without fully examining the structures of programs, and thus the important semantic information associated with these features is lost, consequently resulting in a low accuracy rate in detection. To address this issue, we propose a new feature generation approach for Android applications, which takes components and program structures into consideration and extracts features in a graph-based and semantics-rich style. This approach highlights two major distinguishing aspects: the context-based feature selection and graph-based feature generation. We abstract an Android application as a collection of reduced iCFGs (interprocedural control flow graphs) and extract original features from these graphs. Combining the original features and their contexts together, we generate new features which hold richer semantic information than the original ones. By embedding the features into a feature vector space, we can use machine learning techniques to train a malware detector. The experiment results show that this approach achieves an accuracy rate of 95.4% and a recall rate of 96.5%, which prove the effectiveness and advantages of our approach.

1. Introduction

Android system, as one of the most popular mobile platforms, faces various serious security challenges due to its open-source characteristics, imperfect permission mechanisms, and the absence of full certification of applications at their publications. Malware or malicious payload exploits the vulnerabilities in Android system to implement a variety of attacks, such as privilege escalation, remote control, illegal financial charges, and personal information stealing, resulting in privacy leakage and even serious financial losses. Therefore, there is an imperative need to detect malicious payload and analyze its potential impact when installing and using an Android application.

Machine learning techniques have been widely used in malware detection [18]. This kind of approach treats malware detection as a binary classification problem (i.e., differentiate an application as malicious or benign), which can be tackled with the traditional techniques in pattern recognition or machine learning disciplines. Since this approach does not fully investigate the semantics and all details of programs, it achieves a better performance over traditional dynamic approaches [1, 2, 9] and static approaches [46, 1015] in terms of scalability and time consumption.

The key factors affecting the performance of the machine learning-based approaches are feature selection and generation. APIs and permissions [1, 3] are commonly selected as the features in that they hold rich security-related information about what critical resources can be accessed by which operations. However, in most of the existing works, these features are extracted in a level of whole-application granularity, and their associated contexts are neglected, resulting in a high false-positive ratio in detection.

To tackle this problem, we propose a static detection method for Android malware, which improves the existing works by additionally considering the contexts of features. We identify three kinds of program characteristics as raw features: security-sensitive broadcast events, security-sensitive permissions, and bigrams of API calls, each kind of which associates with a kind of context. We combine these raw features with their contexts to generate new features, which are subsequently used to train and test a classifier model by embedding them into a feature vector space.

The main challenge of our approach is to retrieve the features and their contexts from original program codes. To achieve a better performance in terms of time consumption, we adopt a graph-based technique to the feature generation. We define the structure of a program as a set of iCFGs and offer a graph reduction transformation for the iCFGs to simplify their structures and improve the performance of feature generation. Based on the reduced iCFGs, we interpret the set of their edges as the set of bigrams of API calls involved in callbacks, which allows us to design an efficient graph algorithm to extract the features.

In summary, the main contributions of this paper are as follows:(1)We propose a context-based feature selection approach, which combines the three kinds of raw features with their contexts to serve as newly generated features. Since these features hold rich semantic information about program behaviors, we achieve a better result than the traditional machine learning-based approaches.(2)We propose a graph-based feature generation approach. Since we only concern with the APIs to be called, we can safely remove irrelevant graph nodes and edges and reduce the complex structure of an iCFG to a simplified version. In addition, we establish a direct mapping from the edges of a reduced iCFG to the bigrams of API calls, which leads to an efficient algorithm for the feature generation.(3)With the newly generated features, we trained a classifier model using several state-of-the-art machine learning algorithms on 4972 samples in total, 3732 for training and 1240 for testing. The comparison experiments show that the random forest algorithm has the best performance on the selected feature set, with an accuracy rate of 95.4% and a recall rate of 96.5%, which prove the effectiveness of our approach.

The remainder of this paper is organized as follows. Section 2 describes the selected features in Android malware detection. The process of generating these features from an Android application is presented in Section 3. Section 4 focuses on transforming the features into values and embedding the features into a feature vector space. We implement our approach and report the experiment results in Section 5. Finally, we represent related works in Section 6 and conclusion in Section 7.

2. Feature Selection

Since API calls, permissions, and system broadcast events usually hold rich security-related information, they are commonly selected as features in traditional malware detection approaches. In this paper, we call these features as original features or raw features and combine them with their contexts together to form newly generated features to achieve better detection results. For convenience, we simply call these new features as features when no confusion is possible. In what follows, we first introduce the original features we selected and then associate different contexts with them to form the features.

2.1. Raw Features

The approach makes use of three categories of information as raw features:(i)Bigrams of security-sensitive API calls(ii)Sensitive permissions(iii)Sensitive broadcast events

2.1.1. Bigrams of API Calls

The N-gram model [3] is a language model widely used for the NLP (natural language processing), such as large vocabulary continuous speech recognition, machine translation, and text classification. The N-gram model is based on the assumption that the appearance of a word is only related to the previous N − 1 words. Moving a sliding window of fixed length N from the beginning to the end of a text will generate a number of N-grams of words each of which contains N sequential words. For example, the division of “I am a citizen of the People’s Republic of China” forms a set of bigrams: {I am, am a, a citizen, citizen of, of the, the People’s, People’s Republic, Republic of, of China}.

If we consider the generalized-sensitive APIs defined in RepassDroid [11] as the vocabulary, then an N-gram of words is just a sequence of N consecutive API calls. Considering a large value of N will sharply increase the burden of computation; in most cases, N is set to 2 or 3. Here we adopt N = 2, i.e., bigrams of API calls, as one kind of features.

We analyze 1,000 typical malicious applications and then extract all bigrams of API calls from them and calculate the frequency of each bigram. In order to ensure the accuracy and efficiency of the detection, we finally choose 300 bigrams of API calls that appear most frequently in malicious applications as the raw features. A part of these bigrams of API calls is shown in Table 1.

2.1.2. Sensitive Permissions

To select a proper number of permissions as the raw features, we borrow the statistical indicator TF-IDF [16] (term frequency-inverse document frequency) from a field of text mining to measure the importance of a permission with respect to a program.

We employ a TF-IDF-like approach [17] to select 20 sensitive permissions that can effectively distinguish malicious applications from benign ones. First, we divide the corpus into two categories: malicious corpus (k = 1) and benign corpus (k = 2), and then define three metrics for each permission s in category k (k = 1, 2) to characterize its usage in different categories:: number of samples that use permission s in category k.: percentage of samples that use permission s in category k, where falNumber (k) denotes the number of samples in k.: TF-IDF value of permission s in category k.

In addition, we use allnumber to denote the number of all collected samples and totalNumber(s) to denote the number of the samples that use permission s in both categories, i.e.,

According to the above definition, the value of should be positively related with its per (s, k) and be negatively related with its totalNumber (s). Then, of permission s in category k is defined as follows:

We collect 150 malicious samples and 150 benign samples to calculate . Finally, we select the top 20 permissions with the highest as the raw permission features, as shown in Table 2.

2.1.3. Sensitive System Broadcasts

Android malware usually relies on system broadcast events and their corresponding callbacks to trigger malicious payload [18]. For example, BOOT_COMPLETED is a broadcast event sent by the Android system whenever it finishes its booting process; some malware may listen to this event and trigger malicious payloads or kick off the background services after it occurs. Work [18] collects 25 most related events in Android malware, and we take them as the raw features of broadcast events.

2.2. Contexts of Raw Features

Each kind of the raw feature is associated with a kind of context. We prepare for our formal description by first giving the definitions of the following sets:BIGRAMS: the finite set of all bigrams of API callsPERMS: the finite set of sensitive permissionsEVENTS: the finite set of sensitive broadcast eventsAPPS: the finite set of applicationsCOMPS: the finite set of components defined in applicationsCALLBACKS: the finite set of callback functions defined in components

Definition 1. (contexts of the raw features). For a raw feature of bigram of API calls, its context is defined as a function context_bigram:context_bigram: BIGRAMS ⟶ 2CALLBACKSHere, 2CALLBACKS denotes the powerset of CALLBACKS. Likewise, the context for a feature of permission or broadcast event is defined as follows:context_perm: PERMS ⟶ 2COMPScontext_event: EVENTS ⟶ APPSFor a bigram feature b ∈ BIGRAMS, its context is a set of callback functions that invoke the sequence of APIs of b; for a permission feature p ∈ PERMS, its context is a set of components using p as one of its actually used permissions; and for an event feature e ∈ EVENTS, its context is just the application which listens to e and triggers some behavior whenever e is received.

Definition 2. (features). A feature of an Android application is defined as a pair (f, c), where is a raw feature and is the context of f.
Note that the context of a raw feature may comprise more than one element. A bigram of API calls may be invoked in multiple callback functions, and thus its context should include all of the callbacks calling it; a permission may be used in multiple components, and thus its context should comprise all of these components. For example, if the bigram getDeviceId (); getSubscriberId () is invoked by both OnReceiver () and OnClick (), then its context is {OnReceive (), OnClick ()}, and the feature is written:
(getDeviceId (); getSubscriberId (), {OnReceive (), OnClick ()}).

3. Feature Generation

In this section, we introduce how to generate features from an Android application. Generating features is far from simply parsing program texts; instead, it needs a comprehensive analysis for program structures. We first outline the process of feature generation and then focus on a graph model of program structures and a graph reduction transformation for iCFGs.

3.1. Outline of Feature Generation Process

The sensitive broadcast events are usually configured statically in an application’s manifest.xml file, and thus they can be directly captured from the manifest file. However, extracting permissions is not straightforward, although they are also configured in manifest. The main difficulties come from below two factors: first, the permissions declared in manifest.xml are required for a level of whole-application granularity rather than for individual components; we thus cannot directly acquire the contexts of the permissions; moreover, the declared permissions in manifest.xml are not always identical to the permissions really used by the application. It is very common that an application always tends to require more permissions than it truly needs. In short, simply capturing permissions from manifest.xml does not work well for the generation of permission features.

Our approach extracts both sensitive permissions and the bigrams of API calls in the same process via a comprehensive analysis of programs. The complete process of feature generation is illustrated in Figure 1.

3.2. Graph-Based Feature Generation

The critical step in previous feature generation process is to construct an iCFG for a callback function. In this section, we first give definitions for CGs, CFGs, and iCFGs, then propose a graph reduction transformation rule to simplify an iCFG’s structure, and finally give a theorem that states how to extract bigrams of API calls from the reduced iCFG.

3.2.1. CG, CFG, and iCFG

Definition 3. (call graph). A call graph of an application a is defined as a labeled multigraph:where(i)N is a finite set of nodes, and each node is labeled a class method name in the set {m, m1, m2, …, mk}. The leaf nodes of CGa represent API calls.(ii)S is a finite set of call-site labels, and each label represents a statement of method call.(iii) is the finite set of labeled directed edges. Each edge denotes that class method mi calls class method mj at call site sk.(iv)r ∈ N is the unique entry node of CGa, representing the root class method m.Note that as a class method may be called at different call sites, the call graph is actually a multigraph with multiple directed edges from one node to another or alternatively multiple labels on some edges.
A call graph merely reveals a hierarchy of class method calls but cannot further specify in what sequence these methods are invoked. To describe the complete internal behavior of a class method, we need to introduce the control flow graph for a class method, which specifies the control flow transfers and the sequence of method calls (including API calls) within the class method.

Definition 4. (control flow graph). A CFG of a method m is defined as a directed simple graphwhere(i)N is a finite set of nodes. Each node represents a statement in the set {s1, s2, …, sn}. A statement can be an assignment, conditional branch, method call, or call return (including API call or return statement).(ii) is the finite set of edges. Each edge indicates a control flow transfer from node ni to node nj.(iii)s ∈ N is the unique start node, representing the control flow starts to enter the method m.(iv)e ∈ N is the unique exit node, representing the control flow exits from the method m.As a class method may invoke other class methods (may also call itself to form recursions) in its body, we need to construct a CFG to describe the control flow transfers at the points of method calls and call returns; this kind of graph is called an interprocedural CFG, or iCFG for short. An iCFG is a supergraph that integrates a number of individual CFGs to form a holistic one.
Given a callback function m0, let {m1, m2, …, mk} be the set of class methods called by m0, and CFGi = (Ni, Ei, si, ei) be the control flow graphs of mi, i ∈ {0, 1, 2, …, k}. Every method invocation in mi contributes two nodes: a call node and a return node. Let Calli ⊆ Ni, Returni ⊆ Ni be the sets of call nodes and return nodes of CFGi, respectively.
CFGi’s edges are divided into two disjoint subsets: , where an edge (n1, n2) ∈  is an ordinary control flow edge—it represents a direct transfer of control from a node to another; an edge (n1, n2) ∈  if n1 is a call node and n2 is the corresponding return node.

Definition 5. (interprocedural CFG). An iCFG for a callback function m0 is a supergraph:where(i) = ∪i ∈ {0,1,2, …,k}Ni is a finite set of nodes.(ii) = E0 ∪ E1 ∪ E2, in which E0 = ∪ ∈ {0,1,2, …,k} is the collection of all ordinary control-flow edges; E1 = ∪i ∈ {0,1,2, …,k} is the collection of all edges from call nodes to the corresponding return nodes; and E2 is the set of call edges or return edges. An edge (n1, n2) ∈ E2 is a call edge if n1 is a call node and n2 is the start node of the called method; an edge (n1, n2) ∈ E2 is a return edge if n1 is an exit node of some method m and n2 is a return node immediately following a call to m. A call edge (n1, si) and a return edge (ej, n2) correspond to each other if i = j and (n1, n2)∈E1.(iii)s0 is the entry node of callback m0.(iv)e0 is the exit node of m0.For simplicity, we use Call, Return, Start, and Exit to denote the sets of all call nodes, return nodes, entry nodes, and exit nodes in an iCFG, respectively, and collectively call them structural nodes:Call = ∪i∈{0,1,2, …, k}CalliReturn = ∪i∈{0,1,2, …, k}ReturniStart = {s0, s1, …, sk}Exit = {e0, e1, …, ek}StructNodes = Call ∪ Return ∪ Start ∪ Exit

3.2.2. Reduction of iCFGs

To generate the features from iCFGs, we only need to concentrate on these nodes that represent the security-sensitive API calls; that is, we shall reduce an original iCFG to a simplified version by removing the nodes and edges without contribution to the concerned features. Before introducing the reduction transformation, we shall first review some helpful terminologies in the following.

Given a general-purpose simple graph G = (N, E), where N is the set of nodes and E is the set of edges, a path from node to in , denoted as ()-path, is a sequence of nodes:where (xi, xi + 1) ∈ E, for i = 0, 1, …, n−1. In this case, we call is reachable from via the ()-path or the path passes through or traverses the nodes x1, x2,…, xn1 from to .

Assume N’ ⊆ N is a subset of N; a ()-path of is said to traverse N’ from u to , if u,  ∉ N’ and xi ∈ N’ for i = 1,…, n − 1. Note that ()-path may be a circuit (or cycle) if it begins and ends at the same node, that is, u = .

Figure 2 shows a graph G, in which N = {u, x1, x2, x3,x4, } and N′ = {x1, x2, x3} whose nodes are shown as the shaded. There are three ()-paths: (u, x1, x2, x3,), (u, x1, x2, x4, ), and (u, x4, ), in which only the path (u, x1, x2, x3,) traverses N′ from u to .

Definition 6. (reduction of iCFG). Let G be a finite set of iCFGs. The graph reduction transformation for iCFGs is a function T : G ⟶ G that transforms an iCFG  = () ∈ G to a reduced iCFG ’= (N’, E’, s0’, e0’):where(i) ⊆  is the set of nodes including only StructNodes and the nodes representing the sensitive API calls in (ii)’ = {(n1, n2) ∈ ’ × ’|(n1, n2) ∈ } ∪  {(n1, n2) ∈N’ × N’| (n1, n2) ∉ E  ∃ (n1, n2)-path such that it traverses ’ from n1 to n2 in }(iii)s0’ = s0(iv)e0’ = e0For convenience, we refer to the reduced iCFGs as API graphs. From above definition, we can conclude the following properties about the reduction transformation.

Property 1. Critical nodes are preserved: all of the sensitive API call nodes and the structural nodes (i.e., the entry, exit, call, and return nodes) in are preserved in ’.

Property 2. Reachability is preserved: for any n1, n2 ∈ ’, if n2 is reachable from n1 in , then n2 is also reachable from n1 in ’.
We can easily observe that each ordinary directed edge (n1, n2) ∈ E0 of an API graph actually represents a bigram of API calls l (n1); l (n2), where l (n1) and l (n2) represent the sensitive API calls associated with n1 and n2, respectively. This observation indicates that we can calculate the set of bigrams of API calls from edges of API graphs, just as the following theorem states.

Theorem 1. Let  = (, , s0, e0) be an API graph for a callback method m0, n1 and n2 be two nodes in , and BIGRAM be the set of bigrams of API calls appearing in m0. Then, BIGRAM can be calculated using the following equation:BIGRAM = {l (n1);l (n2)| (n1, n2)∈E0} ∪       {l (n1);l (n2) | ∃p ∈ Call, q ∈ Start • (n1, p)E0 ∧ (p, q) ∈ E2 ∧ (q, n2) ∈ E0} ∪       {l (n1);l (n2) | ∃p ∈ Exit, q ∈ Return • (n1, p) ∈ E0 ∧ (p, q) ∈ E2 ∧ (q, n2) ∈ E0}.

Example 1. To illustrate the reduction transformation of iCFGs, we give a realistic example. In Table 3, the callback function OnClick () calls both the method method1 () and some APIs; Figure 3 shows how to reduce the iCFG of OnClick () to the corresponding API graph and then generate the bigrams of API calls from the resulting API graph. After obtaining the BIGRAM, we can easily obtain the permissions through the PScout mapping.

4. Feature Transformation

In order to utilize machine learning techniques to train a malware classification model, we need to transform the generated features into a feature vector space. Each feature vector represents the set of features for an application sample.

4.1. Feature Vectors

We define a feature vector as an array of 345 elements, as shown in Figure 4. We select 300 bigrams of API calls (Table 1 gives a part of the bigrams), 20 permissions (shown in Table 2), and 25 system broadcast events as the prominent features in the feature vector. An index i (1 ≤ i ≤ 345) of the vector corresponds to a fixed raw feature, and the ith element’s value Vector[i] represents the quantitative value induced from this raw feature. For example, the first element of the vector corresponds to the raw feature of the bigram of API calls getDeviceId (); getSubscriberId (), and Vector[1], i.e., a1, represents the value calculated from this feature.

Let fi be the ith raw feature in the vector; its context is ci = context_(fi), and thus the feature corresponding to fi is (fi, ci) (see Definition 2). The value of the feature (fi, ci), namely, Vector[i], can be calculated by applying a function f to (fi, ci):

Vector[i] = f (fi, ci).

Here, f is referred to as a feature transformation function, simply feature function, which presents different forms depending on the kinds of the raw features; the values of f (fi, ci) are called the feature values of (fi, ci). The definitions of the feature functions are described in the following section.

4.2. Feature Functions
4.2.1. Feature Function for Bigrams

Definition 7. (feature function for bigrams). Let b ∈ BIGRAMS be a raw feature of bigrams; it constitutes a feature (b, context_bigram (b)). The feature function on this feature is defined aswhere(i)UI and Non_UI are disjoint subsets of context_bigram (b) and UI ∪ Non_UI = context_bigram (b). UI represents the set of the callbacks related to user interface operations, such as OnClick () and onPress (); likewise, Non_UI represents the set of the callbacks unrelated to user interfaces. is the cardinality of set UI.(ii)1i (i = 1, 2) are the weights associated with the callbacks in context_bigram (b). 11 is the weight of the UI-related callbacks; 12 the weight of the Non_UI-related callbacks. computes the sum of the weights associated with each of the callbacks in the context of b.The values of 1i are calculated using a statistical method. We first take 1000 malicious applications and 1000 benign ones as samples and then count the average frequency of the bigrams’ occurrences in UI-related callbacks and Non_UI-related callbacks. The statistical results are shown in Table 4.
According to Table 4, we calculate the frequency difference of the bigram features in different contexts by the following formula:where Cm (c) is the frequency of bigrams with context c in malicious applications and Cn (c) is the one with context c in benign applications. Table 5 gives the values of φ (c) of each context. For convenience, φ (c) is normalized as , which is assigned to the weight 1i, i.e., 11 takes the value of  = 0.14 and 12 takes the value of  = 0.86.

4.2.2. Feature Function for Sensitive Permissions

The contexts of sensitive permissions are divided into four categories according to the component types: Activity, Service, Receiver, and Provider. The feature function is defined as follows.

Definition 8. (feature function for permissions). Let p ∈ PERMS be a raw feature of sensitive permission; it constitutes a feature (p, context_perm (p)). The feature function for this feature is defined aswhere(i)SetActivity, SetService, SetReceiver, and SetProvider are mutually exclusive subsets of context_perm (p). For example, SetActivity is the set of activity components in context_perm (p). is the cardinality of Set.(ii)2i (1 ≤ i ≤ 4) are the weights assigned to each type of component.Likewise, 2i are also calculated by using a similar statistical approach. In the malicious samples and the benign ones, the average frequencies of the top 20 sensitive permissions in Table 2 are calculated for different component types, and the results are shown in Table 6. Similarly, φ (c) are calculated by using formula (9) and then are normalized to get , as shown in Table 7. Finally, the values of are assigned to 2i (1 ≤ i ≤ 4), respectively.

4.2.3. Feature Function for Sensitive System Broadcasts

Since the context of a sensitive system broadcast is the application where the broadcast is registered, we assign a weight of 1 to this feature if it is registered in the application and 0 if unregistered.

In summary, the values of the feature vector in Figure 4 can be calculated using the following equations:where (•) is the indicator function; it takes value 1 if • is true, otherwise takes value 0.

Example 2. We use a malware sample Geinimi to illustrate the values of a feature vector. Geinimi registers a broadcast event BOOT_COMPLETED in manifest.xml file to trigger its payload flexibly. Once the system is initialized, BOOT_COMPLETED is triggered and then the component AdServiceReceiver is launched. In the body of AdServiceReceiver. onReceiver(), URL.openConnection() and URLConnection.connect() are invoked to open a network connection. In the component AdCustomService, the malicious payload does something stealthy such as reading user contacts, sending text messages, and reading messages, which are privileged by the permissions READ_CONTACTS, SEND_SMS, and READ_SMS. Thus, the feature vector of Geinimi is shown as follows:
[..., 1.72,..., 0.14,..., 2.42,..., 0.86,..., 0.86,…, 0.86,…, 0.3,…, 0.3,..., 0.15,…, 0.15,…, 0.52,..., 1,…, 1,…], where the ellipses stands for the values of 0.

5. Implementation and Experiment

The implementation of our approach is based on Soot [19] and Sklearn library [20]. Soot is a Java bytecode optimization framework formerly developed by Sable Research Group of McGill University. It provides various kinds of analyses for Java programs, including IFDS/IDE dataflow analysis framework, Call graph construction and Point-to analysis. In this paper, we only use Soot to generate call graphs for applications and generate control flow graphs for every class methods. Sklearn library is a python-based third-party machine learning library; it integrates a variety of commonly used machine learning algorithms, which help us to train the classifiers. The implementation framework is shown in Figure 5.

5.1. Training Datasets

The malware samples are collected from the sample libraries provided by Drebin [5], Virus Share [21], and DroidBench of FlowDroid [12]. In order to make the samples more representative, the selected malware samples cover all malware families in each sample database. The benign samples come from Google Play, the official market of Android apps; they are tested by the computer butler and 360 security guard [22] in advance to ensure that they are benign applications. Table 8 shows the datasets adopted in this paper. In all of the samples, 75% of them are used as the training set and the remaining 25% are used as the test set. The processing time for each sample should not exceed 3 minutes.

5.2. Effectiveness of Graph Reduction Transformation

To demonstrate the necessity and effectiveness of the reduction transformation, we compare the complexities of original iCFGs and their reduced versions. The complexity of a graph is measured by the total number of nodes and edges of the graph.

In order to facilitate the processing, we select 300 APK files whose sizes are limited to less than 2 MB from the datasets. Figure 6 shows the comparison results between the complexities of the iCFGs and their corresponding API graphs. It shows that the average reduction rate (see formula (12)) is about 75.4%, implying that the number of nodes and edges in API graphs is significantly decreased after the reduction transformation.

To further illustrate that the reduction transformation does improve the performance of feature generation in terms of time consumption, we carried out another comparative experiment, and the results are shown in Figure 7. It can be concluded that no matter how large the APK is, the time cost of the feature generation based on API graphs is lower than that based on iCFG. The average time cost (see formula (13)) is improved about 26.3% for the case of API graphs.

5.3. Training a Detector

The detection of Android malware is treated as a classification problem, which can be solved effectively using machine learning techniques. The classification process is separated as two steps: training and testing the detector. All of 4972 samples are divided into two parts: 3732 for the training and 1240 for the testing. The training is a supervised learning process, which takes as the input the feature vectors labeled 1 or 0 (i.e., benign or malicious) and trains a learner (i.e., a detector) to detect malware; the testing process is used to evaluate the performance of the resulting detector.

As generally no absolutely optimal machine learning algorithms are available for a particular classification problem, we employ several classification algorithms to our feature set, including Naive Bayes (NB), k-nearest neighbor (KNN), random forest (RF), logistic regression (LR), and support vector machine (SVM). The detection performance of each algorithm is shown in Table 9.

From the results, we can conclude that RF classification algorithm has the best performance in the selected feature set, with a recall rate of 96.5%, an accuracy rate of 95.4%, and a false-positive rate of 5.6%. The performance of SVM algorithm is the worst.

5.4. Evaluation of the Selected Feature Set

In order to evaluate the performance of the selected features, we design and implement a number of comparison experiments. The overall experiments are organized as four groups: Ex1Ex4, each of which evaluates the performance of one feature by leaving the other features fixed. The test cases of each group of experiments are shown in Table 10. For example, the experiment Ex1 compares the bigrams of API calls with the unigrams of API call to observe the effectiveness of the bigrams with respect to the accuracy, recall, and precision rates. The experiments are implemented under the same machine learning classification algorithm RF, and the results are shown in Figure 8.

From Figure 8, we conclude that the features we choose present the best performance among the corresponding test cases. The reasons to contribute this achievement are analyzed as follows:(1)The bigrams of API calls, as one of the raw features, can better reflect the behavioral characteristics of an application than individual API calls because a bigram of API calls contains a sequence of two consecutive API calls, which captures not only the APIs to be called but also the order how the APIs are invoked.(2)When extracting the sensitive permissions from an application, we choose to acquire them from the source code instead of directly from the manifest.xml file, and thus the extracted permissions agree with those the application really uses. Using the permissions generated in such a way can achieve a better detection result.(3)As the system broadcast events are frequently used by malicious payloads to trigger their behaviors, selecting them as one kind of feature will definitely improve the detection accuracy.(4)Additionally, we combine the raw features with their contexts to form the new features, which hold richer semantic information than individual raw features, and thus using this feature set to classify an application can achieve a better detection performance.

5.5. Comparison with the Typical Works and the State-of-the-Art Tools

To justify the effectiveness of the proposed approach, we compare this approach with the typical works in recent 5 years. The results are shown in Table 11, from which we can conclude that the accuracy of our approach is better than most of other works, slightly less than DroidMat. Although DroidMat has the highest accuracy, its recall rate is lower than our approach. This result proves the effectiveness of the proposed approach.

Furthermore, we compare our approach with the state-of-the-art industry-scale detection tools. To this end, we apply our samples to VirusTotal, a free malware detection website that collects 69 malware detection tools. The comparison results are shown in Figure 9. It can be concluded that our detection performance is better than several mainstream tools in VirusTotal.

Android malware detection has been extensively studied in recent years, and many methods and tools have been proposed to detect fast-growing Android malware. In this section, we overview the related works and give a brief classification for them.

6.1. Dynamic Analysis Approaches

Dynamic analysis approaches detect Android malware by monitoring an application’s behavior at runtime. Taint analysis is the most typical approach of this kind; it is generally implemented in two ways: platform-based [9, 23] and compiler-based [24, 25]. The platform-based techniques need to customize or modify the underlying execution virtual machines (i.e., Dalvik VM or ART), which will automatically mark sensitive data, track the taint propagation, and provide feedback on the critical information to the analyzer to help detect potential misbehavior. Alternatively, the compiler-based techniques, instead of modifying the virtual machines, customize the dex2oat, the ahead-of-time (AOT) compiler adopted in Android 5.0 and above versions. The customized compiler can instrument taints and taint propagation rules into the original codes of an application when the application is being installed at the first time. SysDroid [7] proposes a dynamic analysis approach that uses Monkey Runner to mimic human interactions and extracts system API calls at runtime. It also proposes a new feature selection approach, known as SAIL, to discovering prominent system calls from applications and then uses machine learning techniques to detect potential malware samples. In addition, SysDroid concludes experimentally that the bigrams and trigrams of API calls present a better performance compared to unigram; this conclusion agrees with the results of our experiments (see Figure 8(a)).

6.2. Static Analysis Approaches

Static analysis approaches are able to directly analyze the codes of programs without executing them yet. According to whether taking the details of programs into consideration, the static approaches are divided into two categories: white box [12, 26] approaches and black box approaches. White box approaches need to fully investigate the internal structures or the semantics of programs and detect misbehavior by using program analysis techniques (such as dataflow analysis) or by discovering certain semantic inconsistencies in programs. These approaches usually suffer from the problems like higher complexity and poor scalability.

On the contrary, the black box methods do not analyze the internal structures and behaviors of programs but use statistical methods to detect malicious payloads. With the accumulation of a large number of malware samples and the rapid development of statistical machine learning methods, machine learning-based techniques have been widely used for the detection of malware [16, 11, 13, 14, 17, 27].

The work [8] is a typical work of using permissions and APIs as program features. It further differentiates permissions into standard permissions (defined by Android system) and nonstandard permissions (defined by developers). Permissions and APIs are extracted from the manifest.xml and smali files, respectively. Then, a feature selection technology (FST) is used to select the most prominent features in order to train the classifier. The feature set adopted by [8] is similar to ours, but it misses the contexts of features; therefore, some semantic information about the program would be lost. The work [28] extracts the second-step behavior features (SSBFs for short), i.e., what was triggered by the security-sensitive operations, to assist application analysis in differentiating between malicious and benign operations. The SSBFs include structural features (e.g., in-degree and out-degree of nodes in the iCFGs) and semantic features (such as the number of GUI behaviors and data-save behaviors). The work [28] also takes into account the dependency relations between the GUIs and API calls when considering program semantic characteristics; such dependency relations are similar to the contexts of the bigrams of API calls in our paper (see Section 4.2.1 for distinguishing UI and Non_UI callbacks). DaDiDroid [29] also performs feature extraction on the basis of Soot. It builds a directed weighted graph by modeling the calling relationship between APIs and then uses the graph structure as a feature to perform malware detection. Additionally, this method provides the flexibility for code obfuscation by abstracting specific API calls into API family names. Unlike our paper, DaDiDroid only focuses on the structures of programs but ignores other significant features such as permissions and system events.

6.3. Context-Based Approaches

Many detection methods take permissions, APIs, and their contexts as features [1, 36, 10, 11, 15, 30, 31]. AppContext [15], DroidSIFT [30], and DESCRIBEME [31] are typical works on context-based detection methods. Our work is partly motivated by the works AppContext and RepassDroid, but it improves them in the following two aspects. First, we combine the three kinds of raw features and their contexts as the features, while RepassDroid only takes security-sensitive APIs and permissions into consideration and merely considers the context for APIs; it thus loses the system events and the contexts of permissions which also hold semantic-rich information. Second, the generalized-sensitive APIs used in RepassDroid ignore the sequence of API calls, which is preserved in the bigrams of API calls, and thus in theory, our approach should achieve a better performance than RepassDroid.

7. Conclusion

In this paper, we propose a static approach to detecting Android malware, which focuses on the context-based feature selection and graph-based feature generation. We combine the three raw features with their contexts to form new features and utilize machine learning techniques to train the classifier. The evaluation results show that random forest is the optimal classification technique for our feature set, achieving 95.4% accuracy and 96.5% recall. At present, our approach can only give a binary classification for an application: malicious or benign, but cannot further distinguish the malware family of the application and reveal the impacts of malicious payload on the application’s behaviors. These two weakness will be strengthened in the near future by improving the current work.

Data Availability

The malware samples are collected from the sample libraries provided by Drebin [5], Virus Share [21], and DroidBench of FlowDroid [12]; the benign applications are randomly collected from Google Play (https://play.google.com/store). All of these samples have been deposited in the Baidu cloud storage: https://pan.baidu.com/s/1UhKReGtEKSg7rLQDAUKZPQ. Please contact the authors for the password of access.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This study was supported by the Chinese NSFC Project (61702408), Science and Technology Plan Project of Shaanxi (2017JM6105), and Ministry of Education Collaborative Education Project “Mobile Operating System and Software Security Experimental Teaching Resources and Development” (2010918001).