Abstract

In this paper, we introduce a trajectory planning algorithm (TPA) in aerial/unmanned aerial vehicle- (UAV-) aided communications using a practical budget-constrained multiarmed bandit (BC-MAB) in disaster regions. Hence, we propose two cost-efficient TPAs based on two variants of the upper confidence bound (UCB) algorithms, namely, UCB-BC1 and UCB-BC2, via orthogonal multiple access (OMA) transmission. The former assumes prior information about the minimum expected costs, while the latter estimates the minimum costs from empirical observations. Simulation results confirm that the proposed algorithms outperform other benchmark schemes in terms of the total number of assisted survivors, battery consumption, and convergence speed.

1. Introduction

Recently, aerials witnessed remarkable features that promoted it to be a key player in the forthcoming beyond fifth generation (B5G) and sixth generation (6G) communication systems. Aerials are capable of carrying out high-risk missions with high maneuvering, which led to several key applications, including data collection, security monitoring, aerial photography, disaster management, rescue operations, and environmental and control monitoring. Additionally, their low cost compared with human-crewed aircraft allows them to widely spread in a wide range of commercial, scientific, and recreational applications such as drone racing, crop monitoring, irrigation, and mail delivery systems [1, 2]. Aerial can hover above the disaster area to assist mobile survivors in postdisaster emergency communications. It needs to set up a trajectory that encircles the target area and consecutively serves survivors within its path to enlarge its coverage inside the disaster region [3].

In disaster area scenarios, efficient aerial trajectory planning (TP) enables the aerials to adjust their activity according to the aerials’ communication necessities and the assisted survivors, hence serving more victims in addition to improving overall network performance. Meanwhile, machine learning (ML) techniques, either offline or online learning solutions, gained a great interest in wireless communications [4, 5]. The former needs offline training on ground-truth data collected from the environment, while the latter provides self-decision-making and quick flexibility to the environment variations [4, 6]. Among those online algorithms is the multiarmed bandit (MAB) policy, which is a stateless reinforcement learning (RL) approach that employs sequential decision-making to maximize the agent’s cumulative payoff without any prior knowledge. MAB schemes are lightweight and suited to dynamical communication environments with different path loss, blocking, and shadowing [4]. In a typical MAB problem, a player/learner/agent stands in front of a slot machine with multiple arms and tries to maximize his/her reward every trial by providing a trade-off between exploiting the previous best arm choice and exploring new arms to discover potential better payoffs [6, 7].

Previous aerial’s TP solutions include model predictive control [8], heuristic techniques [9], and deep RL (DRL) [2]. In [4, 10], MAB schemes were leveraged to efficiently solve aerial gateway selection problems. However, their targeted problem was different than ours. In [3], the authors proposed two MAB-based TP algorithms (TPAs) named distance aware UCB (DUCB) and -exploration, where they subtracted the cost (distance over the aerial remaining battery capacity) from the exploration part, which is improper, especially for different types of payoff and cost. Still, previous work applies deep learning techniques, which need offline training, hence are improper for dynamic communication environment. Moreover, except DUCB and -exploration [3], most RL-based solutions neglected the battery consumption in its formulations, unlike our suggested solution, which is more realistic and adaptable to environmental changes.

Therefore, we reformulate the aerial TP problem to be a multiobjective, i.e., maximizing reward and minimizing cost, by proposing two cost-efficient MAB algorithms for coordination and trajectory design that maximize the number of served survivors under a constrained aerial’s battery. Moreover, the two algorithms handle different types of rewards and costs, which outperform DUCB, -exploration [3], and naive UCB1, as will be shown later.

In this paper, we formulate the aerial TP problem as budget-constrained MAB (BC-MAB), where the aerial is the MAB player that attempts to maximize its payoff (i.e., the number of served survivors). The bandit arms are the grids of the disaster area, and the budget is the varied aerial battery consumption [3, 10]. Consequently, we propose two BC upper confidence bound (UCB) schemes, namely, UCB-BC1 and UCB-BC2. Both algorithms have the same exploitation behavior, which is the division of the observed rewards over the arms’ cost. However, their exploration management is different. UCB-BC1 assumes prior knowledge of all arms’ minimum expected cost (i.e., the location of survivors is known prior), and UCB-BC2 estimates this minimal cost from previous observations (i.e., unknown survivor’s locations). We compared our proposed solutions with DUCB, -exploration given in [3], and original UCB1 algorithms for performance analysis. The evaluation results confirm the efficiency of our proposed solutions in terms of total number of assisted survivors, energy consumption, and convergence speed.

The paper organization is as follows: Section 2 previews related work. Section 3 handles the system model illustrations beside mathematical expressions of the aerial TP problem. Section 4 explains the envisioned UCB-BC1/BC2 schemes after highlighting UCB1, DUCB, and -exploration schemes. Section 5 carries out numerical simulations of the proposed algorithms compared with classical MAB schemes. Finally, Section 6 concludes the work.

Due to its unique merits and promising applications, a lot of UAV-related work has been handled recently. In [11], the total battery consumption of the UAV can be minimized by designing the sensor node scheduling scheme, the power allocation strategy, and the flight trajectory jointly via successive convex optimization methods without any AI utilization. Additionally, a UAV-mounted BS deployment scheme is presented in [12] that is designed to minimize the number of BSs necessary to provide wireless coverage for a group of distributed ground terminals. However, they did not consider reflecting disaster area scenarios and did not utilize any ML solutions. Moreover, the authors of [13] proposed a battery consumption model for rotary-wing UAVs, in which a novel method for path discretization was used to minimize UAV battery consumption. Still, their energy consumption model needs to be more general with more accurate throughput estimation. According to [14], multiple UAV BSs could be operated at a minimum average rate by optimizing UAV trajectories and battery consumption. This was done by successive convex optimization without leveraging any ML solution and without considering UAV movement energy consumption. In [15], the authors investigated UAV data collection wireless sensor network problems involving flight time depreciation in order to save battery power when UAVs collect data. However, they considered fixed sensor and channel fading scenarios, not mobile ones. The authors of [16] studied the optimization of solar-powered UAV trajectory design and resource allocation for static mobile users to maximize the sum throughput within a specific operation duration. Furthermore, the works of [1719] handled resource allocation in aerial aided networks without making use of ML merits.

Recently, MABs have also been leveraged for UAV wireless communication problems due to their unique benefits. Hence, the work in [20, 21] discussed a distributed UAV-assisted MAB approach in disaster area scenarios. According to its location, the UAVs were categorized as access (which collects information from survivors) and gateway (which delivers information to the closest working ground Bs). Also, in [22], a MAB-aided solution to find the optimal UAV location maximizes the network’s sum rate in terms of the MAB issue. So, the authors of [3] proposed MAB schemes to optimize battery consumption to investigate tasks. They solved a MAB optimization problem to find the ideal trajectory by using the UCB algorithms. None of the above work utilized the budget-constrained MAB for aerial TP problem, which is more practical and effective.

Changed from the related work mentioned above, we propose UCB-BC1 and UCB-BC2 schemes that not only maximize the data rate but also minimize the limited UAV energy consumption, including its consumption due to changing direction. Both algorithms have the same exploitation behavior, which is the division of the observed rewards over the arms’ cost. However, their exploration management is different. UCB-BC1 assumes prior knowledge of all arms’ minimum expected cost (i.e., the location of survivors is known prior), and UCB-BC2 estimates this minimal cost from previous observations (i.e., unknown survivor’s locations). Based on performance analysis, we compared our proposed algorithms with DUCB, -exploration, and original UCB1 algorithms. The evaluation results confirm the efficiency of our proposed solutions in terms of the total number of assisted survivors, energy consumption, and convergence speed. Furthermore, due to the limited battery capacity of UAVs, an energy-efficient trajectory is critical for UAV-assisted emergency communication.

3. System Model and Problem Formulation

Figure 1 illustrates the system model of the aerial-assisted communication system, where a single aerial performs as a flying/hovering base station (BS) in a postdisaster region, where the whole ground infrastructure has malfunctioned due to natural disasters, i.e., tsunami or earthquakes. The emergency area is divided into grids, and during aerial operation, it flies from the charging point and covers the crisis zone [4, 10, 23]. Then, it returns to the initial launch place again to recharge its battery before it ends. A hypothesis is made that the aerial first flies to a certain point and then hovers around that point to serve customers/survivors located within horizontal distance . The number of served survivors is known by using the global positioning system (GPS). For analysis clarity, the postdisaster zone under inspection is equally divided into grids. Aerial can support all survivors inside each grid while hovering over the grid center [4, 24]. defines the set of all grids and implies the aerial trajectory, where is the grid number inside the region. The aerial trajectory starts at grid , assists grids, and returns to grid for recharging. Total stationary survivors are located within the zone, assuming an equal probability of requesting radio access assistance. refers to the traffic demands of survivors at grid . For simplicity, , which means that the aerial returns back to the starting grid at the end. reflects the set of all possible trajectories that initiate and end over at the central grid. A survivor requests assistance once a natural disaster occurs. Hence, repeatedly assisting a few grids will waste the aerial’s charging power. Thus, the critical topic of aerial-assisted postdisaster communications is to discover the most useful trajectory that delivers communication service to the maximum allowable number of survivors while satisfying the aerial’s battery constraint. Such optimization problem can be expressed as follows: where and are the aerial’s flying speed, hovering time, battery capacity, average engine flying power, and hovering power, respectively. is the distance between grids and . is the estimated energy consumption of the aerial due to changing its direction to move from grid to [24]. where is the aerial’s changing direction angle expressed in terms of , which is the distance between two-dimensional coordinate of th way point, as follows [25]:

Since the aerial has no prior information about the survivors’ available data rates and traffic demands, it should self-optimize its trajectory. According to the traffic demand, the UAV should fly to the grid with the max traffic/survivors while underestimating its battery consumption per visited grid. This should be done by observing the survivors’ traffic per visited grids.

4. Bandit-Based Aerial TPAs

Herein, first, we will describe how we model the aerial TP problem using the MAB hypothesis. Then, we will explain the conventional solutions to the problem, i.e., UCB1, DUCB, and -exploration followed by the proposed UCB-BC1 and UCB-BC2 algorithms.

4.1. MAB Formulation to Aerial TP Problem

The typical bandit game design considers a player gambling in front of a slotted machine of several arms. Every distinct arm delivers a payoff pulled from a specified distribution (unknown to the player). Hence, in each trial, the player selects only one arm and fetches it. The player’s target is to maximize his long-term reward over the whole trial [4, 10]. Therefore, (1) depicts an expanded MAB problem. The player is the aerial, the arms are the grids to visit, and the payoff is the number of assisted survivors from all hovered grids. Different from the original MAB criteria, here, the aerial/player has to fly over different grids to assist more survivors, leading to energy consumption that relies on the movement between the current grid and the next targeted one. Therefore, we will expand two BC-MAB algorithms to efficiently address such problem, namely, UCB-BC1 and UCB-BC2.

4.2. UCB1 Algorithm

UCB1 is a well-known bandit algorithm suited to balance the exploration-exploitation compromise [3, 26]. It keeps updating its exploration-exploitation balance by gathering more information about the environment. First, it focuses on exploring all arms, and then, when the least action trials have occurred, it exploits the arm with the highest calculated payoff. Applying this in the aerial TP problem, the aerial selects each grid once based on the naive UCB1 policy [3]. Hence, at every trial , the aerial draws a grid/arm according to the following formula: where reflects the average reward (no. of aided survivors) delivered from grid at trial and is the selection count of grid/arm . The confidence interval extends when a grid is pulled many times. Hence, diminishes and the player/aerial attempts other less drawn arms/grids. The player/aerial exploits the past highest payoff grid to gain the maximum allowable reward.

4.3. DUCB Algorithm [3]

Generally, DUCB, proposed in [3] for addressing the aerial TP problem, belongs to pure exploration category of BC-MABs, where the budget is associated only with the exploration part and has no relation with the exploitation one. Hence, the target is to find the best arm given the constrained budget from the entire exploration arms. It is an improved UCB version that addresses the aerial TP problem via subtracting the cost from the exploration term. This was done by appending two new terms to the exploration part of (2), which are aerial’s navigation cost and its outstanding energy . In the first trials, similar to the naive UCB1 algorithm, each arm/grid is selected once to assign each grid’s starting reward and know the outstanding aerial’s energy and flying cost to serve the next grid survivors. Here, aerial maneuvers to the following grid only when its remaining energy is enough to return to the recharging grid. The DUCB selects the following grid from via the following formula: where reflects the number of times that the aerial visited grid and is the mean payoff of grid . The distance between the following grid and the instantaneous one is . The distance between the last recharging grid and the next grid is denoted by , and is the aerial’s remaining power.

4.4. -Exploration Algorithm [3]

It is another proposal by researchers of [3], where with probability , an arm/grid is selected. The aerial chooses the th from based on a function that converts the payoffs into probabilities after checking the remaining battery power at each round with probability [3].

The aerial selects grid from as follows: where is the mean payoff of grid at time and . If , whole grids are equally probable to be drawn, and if , this means that the probability of drawing the highest average payoff grid approaches 1. Hence, it draws the surrounding grids via reflecting the cost of the flight.

4.5. Proposed UCB-BC1/BC2 Algorithms

BC-MABs contain two main categories, namely, the pure exploration category, also known as best arm identification, and the exploitation-exploration category [27]. In the first category, like in [3], the budget is only reflected in the exploration arms without updating the exploitation arms to find the best arm. Unlike DUCB, the proposed UCB-BC1/BC2 algorithms belong to the second category, where the budget is reflected in both exploitation and exploration terms.

Therefore, herein, we illustrate how UCB-BC1 and UCB-BC2 algorithms are well suited to solve aerial TP problem to serve more survivors. Many applications have complex scenarios, where the cost is time-varying even if the same arm is pulled like our considered scenario. Hence, it should be anticipated as a random variable. Thus, we have to explore both the payoff and cost of each arm in UCB-BC1 and UCB-BC2 algorithms. The two proposed algorithms share the same exploitation term, which is the division of payoffs (number of aided victims) over the costs (aerial energy consumption) of arms. UCB-BC1 assumes that the minimum costs of all arms are known priory, i.e., the survivors’ locations are well known. However, UCB-BC2 eliminates this need via making use of previous observations. There is a variation between the ranges of the proposed algorithms: UCB-BC2 owns a looser boundary but wider range than UCB-BC1 because the latter needs more knowledge.

Output: ,
Input: , .
Initialization: At t =1
During the first grids, pull each arm/grid once
Calculate the index for each arm using the formula below.
whiledo
UCB-BC1
UCB-BC2
 Pull the grid that achieves: .
 Obtain and update of the selected grid.
End while

Unlike UCB1-based TPA which has no explicit stopping time, UCB-BC1/BC2 will stop if the aerial runs out of battery. As the average payoff-to-cost ratio exploitation term is the same in both algorithms, they will select the highest marginal payoffs. The difference between both algorithms relies on their exploration policy.

In UCB-BC1, a parameter represents the prior knowledge about the expected costs’ lower bound [27]: where is the mean of the aerial’s cost over grid . This prior knowledge can be easily obtained using global positioning service- (GPS-) based localization [1]. However, such information may be difficult to obtain in other scenarios, where the GPS signal is lost or it may highly drain the battery of the survivor’s handset. In this context, UCB-BC2 calculates the expected costs of the distributed survivors on a timely basis, i.e., , based on the previous observations of the visited grids’ costs. Thus, UCB-BC2 uses empirical observations to compute both the achievable payoffs and the minimum necessary cost expectations, as follows [27]: where is the estimated average aerial’s cost over grid from previous observations. After that, the estimate is used to calculate the exploration term. As a result, unlike UCB-BC1, this method does not require prior information and can be used in many applications. It is noteworthy that the UCB-BC2 equation in (9) can not be estimated by just replacing in UCB-BC1 with . This is because a reasonable regret budget constraint does not result from this simple replacement.

Algorithm 1 explains the proposed UCB-BC1 and UCB-BC2 in detail. The input is the next grid, and the output is the number of aided survivors and the energy cost of the aerial. is the times that grid has been pulled before step , is the average payoff of grid before step , is the average cost, and is the index of the grid pulled by algorithm at time . The confidence interval extends when a grid is pulled a lot of times; hence, diminishes, and the player/aerial attempts other less-drawn arms/grids. The player/aerial exploits the past highest payoff grid to gain the maximum allowable payoff.

5. Numerical Simulations

Herein, we evaluate the performance of the algorithms (UCB-BC1 and UCB-BC2) against the helical path, UCB1, DUCB, and -exploration-based TPAs. The helical path scans the disaster area via circles with expanding radii and returns to the initial point. Survivors are allocated randomly to each grid with a binomial distribution traffic , where is the number of survivors in grid and is the on-demand radio access probability that equals to 0.2. The simulation parameters are as follows: flying speed  km/h, hovering interval  s, grid area  m2, survivor density at 48 survivor/grid, engine hovering power , engine flying power , aerial coverage  m, and .

Figure 2 shows the total number of assisted survivors against battery capacity in Joule within 9 grids’ area for the TPAs. For all schemes, as the battery capacity increases, the number of assisted survivors improves due to the prolonged aerial fly time. Both UCB-BC1 and UCB-BC2 assist a higher number of serviced consumers than -exploration, DUCB, naive UCB1, and helical path due to their efficient policies and variable cost-effectiveness. Moreover, UCB-BC1 shows the best performance due to the efficient estimation of aerial energy consumption in the exploration part throughout previous observations. At  J, the assisted survivors are 1097, 2691, 3288, 4985, 6112, and 5577 for helical path, UCB1, DUCB, -exploration, UCB-BC1, and UCB-BC2, respectively.

Figure 3 repeats Figure 2 but for 25 grids. Since the number of inspected grids is increased, the number of served survivors is also incremented. It is noticed that our proposed algorithms serve more customers than that of -exploration, DUCB, UCB1, and helical path-based TPAs. At  J, the outcomes are 3839, 6040, 6875, 9000, 13670, and 12460 assisted survivors for helical path, UCB1, DUCB, -exploration, UCB-BC1, and UCB-BC2, respectively.

Figure 4 discusses the performance of the TPAs in terms of the number of visited grids as a function of a total number of assisted survivors at two battery capacities of  J and  J. As the number of grids increases, the overall number of survivors reduces slightly, especially for the BC-MAB. The larger number of visited grids leads to more flying power consumption by the aerial. Hence, our proposed algorithms have an efficient energy management policy. Moreover, at  J, the number of assisted survivors for all compared schemes is larger than  J cases due to larger battery capacity hence longer flying and hovering times in the area. For both battery capacity cases, UCB-BC1 has the best performance followed by UCB-BC2 due to its appropriate policies; besides, UCB-BC1 owns the exact locations of the survivors via GPS. At  J, 25 grids, the algorithms exhibit 251, 300, 450, 501, 620, and 580 for Helical path, 1, DUCB, -exploration, UCB-BC1, and UCB-BC2, respectively. However, for  J, the results were 711, 801, 951,1051, 1349, and 1259 survivors in the same order.

Figure 5 shows the convergence rate performance (reward vs. time horizon) of the proposed TPAs. UCB-BC1 has the fastest convergence followed by UCB-BC2, while DUCB has the slowest convergence. Due to its efficient policies, our proposed algorithms converge faster, not only maximizing the number of served customers but also minimizing aerial’s energy consumption. At , the number of aided survivors equals 113, 211, 286, 423, 753, and 677 for helical path, UCB1, DUCB, -exploration, UCB-BC1, and UCB-BC2, respectively.

It is clear from the above figures that leveraging UCB-BC1 and UCB-BC2 prolongs the aerial battery lifetime via an efficient trajectory planning policy, which increases the number of assisted survivors due to the more realistic exploration-exploitation treatment by adding the cost (battery consumption including aerial changing direction in Equation (2)) in Equations (8) and (9), respectively. UCB-BC1 knows the exact location of survivors via GPS service, hence providing the best performance. Furthermore, both schemes converge faster to the optimal than other solutions.

For computational complexity analysis, the computational complexity of the -exploration algorithm comes from two parts. The first part is due to selecting a grid location at random with the probability of resulting in computational complexity of . The second part comes from selecting a grid that yields the highest average reward with probability , resulting in computational complexity of . Thus, the total computational complexity of -exploration algorithm will be . For the DUCB algorithm and the proposed UCB-BC1 and UCB-BC2 algorithms, their computational complexities are compared to the original UCB1, as the main source of their computational complexities comes from selecting the upper bound grid and upgrading its related parameters with computational complexity order of [28]. Finally, the computational complexity of the helical-based trajectory planning comes from selecting the next grid in the helical path with the computational complexity of . The computational complexities of UCB-BC1 and UCB-BC2 algorithms are compared to the original UCB1, as the main source of their computational complexities comes from choosing the upper bound grid and upgrading its related parameters with computational complexity order of .

6. Conclusion

In this paper, we handled the trajectory planning optimization of an aerial in a disaster zone. The objective is to find the trajectory that maximizes the number of assisted survivors and minimizes battery consumption. This optimization problem was addressed as a MAB game by using two self-learning algorithms, i.e., UCB-BC1 and UCB-BC2. Numerical simulations demonstrated the effectiveness of the proposed algorithms over the helical path, naive UCB1, DUCB, and -exploration algorithms due to their efficient policy management of exploration-exploitation trade-off. Future work will include investigations of UAV-NOMA scenarios and multiplayer MAB scenarios too.

Data Availability

Data are available upon request from the corresponding author.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Sherief Hashima and Ramez Hosny are joint first authors.

Acknowledgments

This work was supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI Grant Numbers JP21K14162 and JP22H03649 and also supported via funding from the Prince Sattam bin Abdulaziz University project number (PSAU/2023/R/1444).