Abstract
Deep Neural Network (DNN) has become an essential technology for edge intelligence. Due to significant resource and energy requirements for large-scale DNNs’ inference, executing them directly on energy-constrained Internet of Things (IoT) devices is impractical. DNN partitioning provides a feasible solution for this problem by offloading some DNN layers to execute on the edge server. However, the resources of edge servers are also typically limited. An energy-constrained and resource-constrained optimization problem is generated in such a realistic environment. Motivated by this, we investigate an optimization problem of DNN partitioning and offloading in a multiuser resource-constrained environment, which is considered an intractable Mixed-Integer Nonlinear Problem (MINLP). We decompose the problem into two subproblems and propose an Energy-Efficient DNN Partitioning and Offloading (EEDPO) strategy to solve it in polynomial time based on the minimum cut/maximum flow theorem and dynamic programming. Finally, we test the impact of energy constraint, DNN type, and device number on the performance of EEDPO. Simulation results on realistic DNN models demonstrate that the proposed strategy can significantly improve the DNN inference task completion rate compared to other methods.
1. Introduction
With the development of deep learning [1], the application of Deep Neural Networks (DNNs) has made remarkable achievements in many fields, such as face recognition and self-driving [2]. Meanwhile, the thriving Internet of Things (IoT) industry [3] contributes to the fact that DNN can be deployed on intelligent IoT devices like mobile phones and wearable devices [4] for executing inference tasks. It has been pointed out that DNNs with large scales can commonly provide more accurate analysis results [5]. However, more computational resources and energy are consumed when using deeper network structures, which seriously limits the application of large-scale DNNs on energy-constrained IoT devices.
Mobile Edge Computing (MEC) [6, 7] solves such a problem well by providing devices for offloading some DNN layers to execute on the edge server. For the DNN, some intermediate results (the output of intermediate layers) are significantly smaller than that of raw input data [8]. This provides the chance for us to take advantage of the powerful computing resources of MEC. More specifically, we can compute a part of DNN on the device side, transfer a small number of intermediate results to the edge, and compute the left part on the edge side. The partition of DNN constitutes a tradeoff between computation and transmission. As shown in Figure 1, partition at different layers causes different computation energy and transmission energy. So, an optimal partition is desirable. But the challenge in the partition is that DNNs are no longer limited to a chain topology, and DAG topologies are gaining popularity. Obviously, partitioning DAG instead of chain involves much more complicated graph-theoretic analysis, which may lead to NP-hardness.

Moreover, we focus on the low power requirement situation (considering each task with a low power requirement). Based on the partition methodology shown in Figure 1, we gather the best energy of AlexNet with different edge computing resources. As shown in Figure 2, we can observe that more edge computing resources are required to meet the low power requirement. However, in a realistic environment, the resources of the edge server are typically limited. So, optimal resource allocation is also desirable in a multiuser environment.

To this end, we study a novel energy-aware DNN inference task completion maximization problem with the aim to maximize the number of energy-aware DNN inference tasks completed, by jointly exploring DNN model partitioning and resource allocation of DNN inference. The main contributions of this paper are given as follows: (1)A problem of maximizing task completion rate in a multiuser edge environment is proposed, which further considers the low power requirement of tasks and the resource limitation of the edge server(2)We propose an optimization algorithm EEDPO based on min-cut/max-flow theory and dynamic programming to maximize the user task completion rate by achieving the optimal DNN partitioning and offloading decision(3)Based on the accurate data of DNN models obtained by Python’s official extension library PyPI, extensive simulation experiments on different types of DNN are conducted. The results demonstrate that EEDPO has a significant performance
The remainder of the paper is organized as follows. Section 2 reviews the related works on DNN inference with resource-limited devices. Section 3 introduces the system model, notions, and notations. The defined problem is given formally in Section 4. Section 5 devises an approximation algorithm for the task completion rate maximization problem. Section 6 presents extensive simulation experiments to evaluate the performance of our proposed algorithm, and Section 7 concludes the paper.
2. Related Works
Recent research for the DNN inference on resource-constrained devices includes two types. One is DNN model compression [9], aimed at reducing the device’s workload, but its accuracy can not be guaranteed. The other option is offloading the entire DNN, which causes a significant workload to the core network, resulting in high delay and energy loss. Fortunately, DNN partitioning has been proposed [10] recently as an effective method to solve the above problem, which core idea is to partition DNN and offload part of it to execute on edge servers. For instance, the neurosurgeon framework proposed by Kang et al. [11] used the regression model to predict the execution delay and energy consumption of each DNN layer in edge servers and IoT devices to select the partition point. Dong et al. [12] created an online partitioning and resource allocation algorithm that used Lyapunov to minimize the server rental costs under delay constraints. However, the DNN partition strategy mentioned before and proposed by papers [13, 14] affects only the few DNN types with chain structure.
Obviously, the DNNs with DAG structures like GoogleNet and ResNet are more complex. Xue et al. [15, 16] preprocess the topological DNN model reasonably by designing a two-step strategy for offloading large-scale DNN models in a local-edge-cloud collaborative environment. In [17], Wu et al. propose a Min-Cost Offloading Partitioning (MCOP) algorithm that is aimed at finding the optimal DNN partitioning plan in arbitrary topological consumption graphs under different cost models and mobile environments. Hu et al. [8] considered DNNs to be DAG structures and proposed two approximate algorithms to solve the delay minimization problem. In [18], Chen et al. predicted the delay of each network layer by the random forest algorithm to realize the autooffloading of DNN applications. Furthermore, an approximate algorithm is proposed by Li et al. [19] to maximize the user requests throughput. However, the preceding papers consider the device’s energy to be infinite. Unlike the articles mentioned before, we consider that multiple devices only have limited energy to execute DNN inference tasks, and edge servers have finite resources to be allocated. Finally, we put forward a problem of maximizing the completion rate of DNN inference tasks by making efficient partition and offloading decisions.
3. System Model
As shown in Figure 3, our system consists of one Access Point (AP) linked to a colocated edge server and IoT devices. The edge server connects with AP via a cable, so the transmission delay between them is neglected. Besides, the edge server has limited computing resources ; each device has one DNN inference task with the specific DNN model and given low power requirement (termination condition of task ). Note that the models are pretrained and stored locally and at the edge. Local devices reduce energy consumption by offloading some DNN layers to execute on the edge server.

In this paper, we focus on DNN inference. Compared with DNN training, DNN inference requires much fewer CPU and GPU resources. Therefore, instead of using costly hardware accelerators (e.g., GPUs) for DNN inference, we consider CPU-based edge servers, and each CPU core corresponds to a thread [19]. We consider that each server has a computing capacity , which is measured by the number of threads (CPU cores) in it. Note that the results of this paper can be easily extended to deal with servers equipped with GPU resources by incorporating the threads of GPU cores and leveraging the GPUs on inference acceleration.
3.1. DNN Model and User Tasks
Most modern DNN models are generally made up of several essential elements known as “layers.” The identity of a “layer” represents a combination of computations with a similar effort toward a set of inputs (e.g., the convolution layer in CNN). Such as the approximate GoogleNet model shown in Figure 4, each model layer takes the previous layer’s output as its input. This article considers layers as the atomic elements of a DNN model. To be consistent with the structure of a real neural network, we denote DNN as a Directed Acyclic Graph (DAG). Figure 5 introduces an example DNN model that we considered, where denotes the th layer of DNN .


Each task of device is denoted as , where indicates the topological structure of DNN model , denotes layers of model , and represents the dependency between two layers. Each represents the Floating Point Operations (FLOPs) required by layer . represents the output data size of layer , and denotes the energy cost limitation of , which means if the energy consumption of exceeds it, the task will fail. Note that is a virtual layer used to transmit the input parameters of DNN. Thus, and . Besides, successor layer of is denoted as . The network layers after DNN partitioning are divided into three sets, indicated as , which represents the DNN layer set of local-inference, transmitting, and edge-inference, respectively. When , DNN is executed completely in the edge server, and DNN is executed locally when .
3.2. Inference Energy Model
The energy consumption of tasks we considered explicitly refers to the energy consumption of IoT devices. It is worth noting that we also consider the idle power consumption of the device when offloaded layers are executing at the edge. Because of the tiny amount of data returned, we ignore the energy consumption required to receive the returned results. The energy consumption model is illustrated below.
3.2.1. The Energy Consumption of Local Inference
The local inference delay of task is given by where denotes the local delay of executing layer , among is local processing rate of device , which is measured by FLOPs per second. Then, the local energy consumption of task is denoted as where is the energy efficiency coefficient of the device . In addition, we consider that if all network layers are executed locally, the total energy consumed by the task is
3.2.2. The Energy Consumption of Transmitting Intermediate Parameters
We used Time Division Multiplexing (TDM) as the communication mode, and the transmission rate between device and AP was calculated by the Shannon formula: where is the channel bandwidth, is the transmission power of the device , is the distance between the device and AP, is the channel fading coefficient, and is the noise power. The transmission delay of task expressed as where denotes the delay of transmitting output parameters of layer and the transmission energy consumption of task is given by
3.2.3. The Energy Consumption of Edge Inference
We consider the server’s CPU resources to be multithreaded parallel resources [20, 21]. The number of threads allocated to task is , and the calculation rate per thread is (measured by FLOPs per second), so the calculation rate of user is . The edge inference delay of task is given by where denotes the edge inference delay of layer . Further, the idle energy consumption of task is where is the idle power of device . To sum up, the total energy consumption of task is expressed as
Note that only satisfied the condition of , task can be completed.
4. Problem Formulation
As discussed in the previous section, the task will fail when exceeds the termination condition . Therefore, we reduce the energy consumption by making use of edge computing resources. However, the computing resources of servers are also typically limited. Our optimization goal is to maximize the number of energy-aware DNN inference tasks completed, with given low power requirement and constrained edge computing resources. We give a formal definition of the task completion rate maximization problem, which is formulated as where denotes the vector of partition decision, denotes the vector of server resource allocation decision, is an indicator variable, when , and otherwise . Constraint (10b) indicates that each DNN is partitioned into two disjoint parts, (10c) denotes that the sum of resources allocated to tasks cannot exceed the total server resources, and (10d) indicates that is a nonnegative integer and limited between and .
Noticing (10c) is a nonlinear function and is not an integer variable, our problem () is naturally a Mixed-Integer Nonlinear Problem (MINLP) which is intractable in general. However, we found that once optimal resource allocation of each task is given, () reduces to a Knapsack Problem (KP). Specifically, we solve the problem in two steps. In Step 1, we first consider offloading part of task with allocated resources. If such an assignment is feasible (), will become a candidate. We then determine the minimum that can meet the power requirement . In Step 2, we consider as “price” and consider as “volume.” If solved in Step 1 meets the constraint (10d), , else . The () is reduced to KP. As a result, we decompose () into two subproblems () and (). Analysis of () and () will be described below. () is formulated as
For a given , we can obtain optimal partition strategy that minimizes . Note that we model this min-problem in graph theory as a min-cut/max-flow problem and will solve it in the next section. When is confirmed, it can easily observe that is monotonically nonincreasing. Thus, we can solve minimum resource allocation that maximizes by exhaustive research in time. After solving () for each task , the can be confirmed; here, we denote by where means that task can be completed with given resource and means that the task can not be solved even if the given resource is .
After solving and confirming , we can make optimal offloading decision to maximize the sum of under the limitation of . The subproblem () is formulated as
Obviously, () is a binary KP that is equivalent to the capacity of the knapsack in the KP problem, is equivalent to the item’s volume, and the is equivalent to the price of the item.
5. Proposed Energy-Efficient DNN Partitioning and Offloading Algorithm
For the problem mentioned in Section 4, we put forward an Energy-Efficient DNN Partition and Offloading (EEDPO) algorithm, which adaptively makes the partition and offloading decision of DNN according to the low power requirement of the task and the resource limit of the edge server. As shown in Algorithm 1, the steps of EEDPO are described below. (1)Based on Algorithm 2, solve the optimal resource allocation decision and obtaining by Equation (12)(2)Based on Algorithm 3, solve the maximum completion rate
|
5.1. Minimizing Energy Consumption by Partitioning DNN
We modeled the partition problem in graph theory as a min-cut/max-flow problem. Based on the original DNN graph, some new virtual nodes and edges are added to form an auxiliary graph . As the example of the original graph and auxiliary graph shown in Figure 6, and .

The capacity of each edge is illustrated as follows: (1) for each edge , its capacity is which denotes idle energy consumption of th layer of the task when executing at the edge with allocated resource . Note that the capacity of is set to and the capacity of is set to , because is a virtual layer that is only used to input parameters to the neural network, it must be executed locally. (2) For each edge , its capacity is . (3) About , the capacity of the edge is . (4) The reason for adding virtual nodes , edge , edge , and setting it is to ensure that nodes with more than one successor network layer only transmit intermediate parameters once.
|
As an example cutting shown in Figure 5, We denote the , and denotes the total energy consumption required to cut this DNN. If the minimum cut is found, then denotes the minimum energy required to complete this task according to min-cut/max-flow theorem [22].
As illustrated in Algorithm 2, we first traverse each task , calculating the . Note that the task can be offloaded with if . Tasks in Queue all need to be partitioned. To determine , we traverse the from 1 to , with each traverse constructing an auxiliary graph and finding its minimum cut by the BoyCov Algorithm [22]. Finally, If is insufficient to meet the limit, we set and , indicating this task will fail.
5.2. Solving Knapsack Problem by Dynamic Programming
We consider using dynamic programming to find the knapsack problem’s optimal solution. The Algorithm 3 description of the problem (P2) is given below; indicates the maximum benefit can be obtained when offloading th DNN with . The recurrence relations about are
|
5.3. The Analysis of Algorithm Computation Complexity
The time complexity of Algorithm 2 is , which is the complexity of enumeration and solving the minimum cut problem by the BoyCov Algorithm [22], where indicates the number of layers and indicates the number of edges. Algorithm 3 will traverse the optimal solution of each sub-problem, its computational complexity is . Therefore, the complexity of EEDPO is .
6. Numerical Results
6.1. Simulation Setup
The channel bandwidth MHz, and the distance between AP and devices is in the range of meters. For each DNN model deployed on the device, it includes not only the chain models like AlexNet but also DAG models like GoogleNet and ResNet. The model’s input is a pixel image with a file size of 2.7 MB. We use Python’s official library to obtain the actual parameters of DNN models. The working power is in the range of [0.1, 0.5] Watt, noise power Watt, channel fading coefficient , and the energy of each task is limited in the range of J. Each local device clock speed is in the range of GHz; servers’ clock speed is set at 2.5 GHz per thread. All threads share the same clock speed, and each clock cycle can do 4 floating point operations. The experimental device is a 2.20 GHz Intel core (TM) i5-5200 CPU and 8 GB RAM laptop. The simulation parameters are summarized in Table 1.
6.2. Performance Evaluation
To evaluate the performance of EEDPO in terms of completion rate, we introduce three heuristic algorithms as benchmarks. EDGY-GREEDY lets all tasks offload, and the Greedy strategy makes the offloading decision under the limitation of server resources. RANDOF randomly chooses some tasks to offload with limited server resources. LOCAL-ONLY makes all tasks executed locally. Moreover, we compare EEDPO with a state-of-the-art DNN partitioning strategy Neurosurgeon on the performance of Energy consumption. Since Neurosurgeon is only effective for chain topology, we consider DAG-topology DNN as a sequential connection of several logical layers as did in [13], and then adopt Neurosurgeon to partition the DNN. We use the Edge-Only method as the baseline, i.e., the performance is normalized to the Edge-Only method.
6.2.1. Performance Evaluation of the EEDPO Compared to Other Methods
We first evaluated the task completion rate performance of EEDPO against EDGY-GREEDY, RANDOF, and LOCAL-ONLY with 75 devices over different DNN types, respectively. It can be seen from Figure 7 that the algorithm EEDPO completes more tasks than other methods. Using SqueezeNet as an example, the rate of EEDPO is 9.4%, 35.2%, and 77.4% higher than that of EDGR-GREEDY, RANDOF, and LOCAL-ONLY. Because of that, the optimal partition and offloading strategy reduced energy consumption and maximized server resource utilization. Furthermore, AlexNet has a higher completion rate than GoogleNet and ResNet 32% and 45.4%, respectively. Since they require more FLOPs (14.53G and 62.77G) than AlexNet. To meet the energy limitation, they naturally require more server resources.

Then, we evaluated the task completion rate performance of EEDPO by varying the number of devices from 50 to 150, where the DNN type of task is AlexNet. Figure 8 illustrates the completion rate of EEDPO over different device numbers. Similarly, the EEDPO algorithm completes more tasks than other methods except when the device number equals 50, because server resources are enough to complete a few offloaded DNN layers. Besides, the overall completion rate decreases with the number of devices increases. Because a more considerable value of implies that more DNN layers are offloaded to the edge server, which makes the server can not complete them with limited resources.

6.2.2. Impact of Energy Constraint on the Performance of EEDPO
Then, we evaluated the impact of the energy constraint on the performance of EEDPO by varying the DNN type. Figure 9 shows that when the network is GoogleNet, the task completion rate with energy level [0.15, 0.3] is 13% and 18% higher than tasks with energy levels [0.1, 0.25] and [0.05, 0.2]. That is because the more energy cost that the task has, the fewer resources are required to complete offloaded DNN layers. Similarly, SqueezeNet and AlexNet have a higher completion rate than other models, owing to their higher FLOPs.

6.2.3. Impact of DNN Type, Device Number, and Energy Constraint on the Performance of Algorithm for Reaching Completion Rate 1
Firstly, we evaluated the requirement variation with different types of DNN. As shown in Figure 10, the of SqueezeNet and AlexNet needed to achieve completion rate 1 are 80 and 84, while VGGNet needed 740. Similarly, it is because VGGNet requires 8.52 times and 10.46 times as many FLOPs as SqueezeNet and AlexNet. Thus, only offloading more DNN layers can meet the task’s energy cost limitation, which naturally requires more server resources. This result is consistent with the above evaluation result in Figure 7, which clarifies why the completion rate is low when the DNN type is ResNet50 and VGG11.

Then, as shown in Figure 11, we evaluated the requirement variation as the number of devices increased. Obviously, with the number of devices increasing, reaching a completion rate of 1 requires more threads and grows slower. That is because the number of offloading layers increases with the devices’ number increasing. Thus, more resources are required to meet the tasks’ energy cost limitation. In a realistic environment, it can help us speculate how many resources we need to deploy on the edge server when the number of devices reaches a specific level.

Finally, we evaluated the requirement variation over different energy constraints. It can be seen from Figure 12 that the number of required with energy level [0.15, 0.3] is less than 20 and 52 than energy level [0.1, 0.25] and [0.05, 0.2] because fewer DNN layers require offloading when the local energy is sufficient. This result illustrates that the tasks should choose an appropriate DNN model according to its energy level. For instance, the task with a low energy level should choose a lightweight DNN model like AlexNet or SqueezeNet instead of ResNet or VGGNet.

6.2.4. Comparing EEDPO with Neurosurgeon on the Performance of Energy Consumption
Firstly, we evaluated the energy-saving performance of EEDPO and Neurosurgeon with different DNN types. The result uses the Edge-Only method as the baseline (the performance is normalized to the Edge-Only method). From Figure 13, we can see that, for the chain topology models, EEDPO and Neurosurgeon have similar performance in energy saving. For the DAG topology, we can observe that EEDPO outperforms Neurosurgeon significantly. For DAG topology models, EEDPO has energy saving of 30% compared with Neurosurgeon. Then, we evaluate the energy consumption performance of GoogleNet with different edge computing capacity. As shown in Figure 14, the energy consumption of EEDPO is always lower than Neurosurgeon though they both decrease with the edge computing capacity increasing. This observation validates the effectiveness of EEDPO.


7. Conclusion
In this paper, we investigate the DNN inference task completion rate maximization problem in a multiuser edge environment, further considering the energy limitation of IoT devices and the resource limitation of the edge server. To reduce the energy consumption of tasks and maximize server resource utilization, we proposed an Energy-Efficient DNN Partitioning and Offloading (EEDPO) strategy. Experimental results demonstrated that the proposed algorithms are promising. Besides, we summarize some valuable conclusions from the evaluation result (e.g., we can speculate how many resources we need to deploy on the edge server when the number of devices reaches a specific level from Figure 11). In our future work, we will consider more complicated scenarios with environmental fluctuation, including network delay, wireless channel condition, and server failure, and improve the adaptation of the proposed partition strategy. Moreover, we plan to study the problem of DNN inference and offloading in wireless-powered MEC systems.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this paper.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (Grant No. 61672465) and the Natural Science Foundation of Zhejiang Province (Grant No. LZ22F020004).