Abstract
Despite increased cloud service providers following advanced cloud infrastructure management, substantial execution time is lost due to minimal server usage. Given the importance of reducing total execution time (makespan) for cloud service providers (as a vital metric) during sustaining Quality-of-Service (QoS), this study established an enhanced scheduling algorithm for minimal cloudlet scheduling (CS) makespan with the deep Q-network (DQN) algorithm under MCS-DQN. A novel reward function was recommended to enhance the DQN model convergence. Additionally, an open-source simulator (CloudSim) was employed to assess the suggested work performance. Resultantly, the recommended MCS-DQN scheduler revealed optimal outcomes to minimise the makespan metric and other counterparts (task waiting period, resource usage of virtual machines, and the extent of incongruence against the algorithms).
1. Introduction
Cloud computing denotes an established shared-computing technology that dynamically conveys measurable on-demand services over the global network [1]. Essentially, cloud computing offered users limitless and diverse virtual resources that could be obtained on-demand and with different billing standards (subscription and static-oriented) [2]. The CS (task scheduling or TS) also outlined independent task mapping processes on a set of obtainable resources within a cloud context (for workflow applications) for execution within users’ specified QoS restrictions (makespan and cost). Workflows (common applications associated with empirical studies involving astronomy, earthquake, and biology) were migrated or shifted to the cloud for execution. Although optimal resource identification for every workflow task (to fulfil user-defined QoS) was widely studied over the years, substantial intricacies required further research:(1)The TS on a cloud computing platform is an acknowledged NP-hard problem(2)Multiple TS optimisation objectives are evident: completion time reduction and high resource usage for the entire task queue(3)Cloud resource dynamics, measurability, and heterogeneity resulted in high complexity
Recent research has been performed to enhance TS in cloud environment through artificial intelligence algorithms (particularly metaheuristics involving particle swarm optimisation or PSO [3], ant colony, and genetic algorithm or GA) with TS capacities. However, this article does not rely on these algorithms but instead proposes a viable alternative to them and compares it to one of the metaheuristic algorithms, such as PSO, a widely used technique in the task scheduling area. The proposed method primary objective was recommending a novel DQN scheduler for optimal outcomes by comparing TS measures (waiting time, makespan reduction, and enhanced resource usage).
The remaining sections are arranged as follows: Section 2 outlines pertinent literary works, Section 3 presents the DQN algorithm, Section 4 highlights the recommended work, Section 5 explains the research experiment setup and simulation outcomes, and Section 6 offers the study conclusion.
2. Related Work
In cloud computing, TS, jobs scheduling, or resource selection is one of the most substantial complexities that has garnered cloud service providers’ and customers’ attention. Additionally, specific study types on TS intricacies reflected positive outcomes. The literature’s research accomplishments in cloud resources scheduling can be divided into the following categories based on the used techniques.
2.1. Heuristic-Based Research
The heuristic algorithms, including the metaheuristic ones [4] following intuition or experimental development, offered a potential alternative for the affordable resolution of every optimisation occurrence. Following the unpredictable degree of variance between optimal and viable alternatives, past studies selected metaheuristic algorithms, such as PSO [5], GA [6], and ACO [7] to solve optimal TS policy in cloud computing using metaheuristics algorithms. Based on Huang et al.’s [8] recommendation of a PSO-based scheduler with a logarithm-reducing approach for makespan optimisation, higher performance was achieved against other heuristics algorithms. Meanwhile, Liang et al. [9] suggested a TS approach following PSO in cloud computing by omitting some inferior particles (to accelerate the convergence rate and dynamically adjust the PSO parameters). The experimental findings revealed that the PSO algorithm obtained improved outcomes compared to other counterparts. The proposed shift in genetic algorithm crossovers and mutation operators implied flexible genetic algorithm operators (FGAO) [10]. For example, the FGAO algorithm minimised execution time and iterations compared to GA. Furthermore, Musa et al. [11] recommended an improved GA-PSO hybrid with small position value (SPV) applications (for the initial population) to diverge from arbitrariness and enhance convergence speed. Consequently, the improved GA-PSO hybrid reflected more valuable outcomes than the conventional GA-PSO algorithm in resource usage and makespan. Yi et al. [12] recommended a task scheduler model following an enhanced ant colony algorithm under the cyber-physical system (CPS). The numerical simulation implied that the model resolved local searchability and TS quality concerns. Peng et al. [13] proposed a scheduling algorithm based on cloud computing’s two-phase best heuristic scheduling to reduce the makespan and energy storage metrics. The authors of the paper [14] suggested a VM clustering technique for allocating VMs based on the duration of the requested task and the bandwidth level in order to improve efficiency, availability, and other factors such as VM utilisation, bucket size, and task execution time. Sun and Qi [15] proposed a hybrid tasks scheduler based on Local search and differential evolution (DE) to enhance the makespan and the cost metrics. The authors in the paper [16] presented a parallel optimized relay selection protocol to minimise latency, collision, and energy for wake-up radio-enabled WSNs.
2.2. Reinforcement Learning-Based Research
Reinforcement learning (RL) is a machine-learning category that primarily communicated with the specified context using consecutive trials for an optimal TS method. Recently, RL has garnered much attention in cloud computing. For example, a higher TS success rate and minimal delay and energy consumption were attained in [17] by recommending a Q-learning-oriented and flexible TS from a global viewpoint (QFTS-GV). In [18], Ding et al. recommended a task scheduler using Q-learning for energy-efficient cloud computing (QEEC). Resultantly, QEEC was the most energy-efficient task scheduler compared to other counterparts (primarily catalysed by the M/M/S queueing model and Q-learning method). In [19], a TS algorithm was proposed with Q-learning for wireless sensor network (WSN) to establish Q-learning scheduling on time division multiple access (QS-TDMA). The algorithm implied QS-TDMA to be an approaching optimal TS algorithm that potentially enhanced real-time WSN performance. In [20], Che et al. recommended a novel TS model with the deep RL (DRL) algorithm that incorporated TS into resource-utilisation (RU) optimisation. The recommended scheduling model that was evaluated against conventional TS algorithms (on real datasets) in experiments demonstrated a higher model performance of the defined metrics. Another task scheduler under the DRL architecture (task scheduling algorithm based on a deep reinforcement learning architecture or RLTS) was suggested by Dong et al. [21] for minimal task execution time with a preceding dynamic link to cloud servers. The RLTS algorithm (compared against four heuristic counterparts) reflected that RLTS could efficiently resolve TS in a cloud manufacturing setting. In [22], a cloud-edge collaboration scheduler was constructed following the asynchronous advantage actor-critic (CECS-A3C). The simulation outcomes demonstrated that the CECS-A3C algorithm decreased the task processing period compared to the current DQN and RL-G algorithms. The authors of the article [23] suggest a learning-based approach based on the deep deterministic policy gradient algorithm to improve the performance of mobile devices’ fog resource provisioning. Wang et al. [24] introduced an adaptive data placement architecture that can modify the data placement strategy based on LSTM and Q-learning to maximise data availability while minimising overall cost. Authors in [25] presented a hybrid deep neural network scheduler to solve task scheduling issues in order to minimise the makespan metric. Wu et al. [26] utilised DRL to address scheduling in edge computing for enhancing the quality of the services offered in IoT apps to consumers. The authors in the paper [27] applied a DQN model with a multiagent reinforcement learning setting to control the task scheduling over cloud computing.
3. Background
3.1. The RL
The RL theory was inspired by the psychological and neuroscientific viewpoints of human behaviour [28] to contextually select a pertinent action (from a set of actions) for optimal cumulative rewards. Although the trial-and-error approach was initially utilised for goal attainment (RL was not offered a direct path), the experience was eventually employed towards an optimal path. An agent only determined the most appropriate action in the problem following the current condition, such as the Markov decision process [29]. Figure 1 presents a pictorial RL representation where the RL model encompassed the following elements [30]:(1)A set of environment and agent states (S)(2)A set of actions (A) of the agent(3)Policies of transitioning from states to actions(4)Rules that identified the immediate reward scalar of a transition(5)Rules that outlined agent perception

3.2. The Q-Learning
One of the solutions for the reinforcement problem in polynomial time is Q-learning. As Q-learning could manage problems involving stochastic transitions and rewards without action adaptions or probabilities at a specific point, the technique was also known as the “model-free” approach. Although RL proved successful in different domains (game playing), it was previously restricted to low dimensional state space or domains for manual feature assignation. Equation (1) presents Q-value computations where denoted an actual and immediate agent situation, implied learning rate, reflected a discount factor, and denoted value to attain the “S” state by acting (a). Specifically, reinforcement began with trial and error followed by posttraining experience (the decisions corresponded to policy values that resulted in high reward counterparts).
3.3. The DQN Architecture
Training encompassed specific parameters that were stored as agent experiences: implied the current state, implied action, reflected the reward, denoted the following state, and implied a Boolean value to identify goal attainment. The initial idea served to ascertain state and action as the neural network input. Meanwhile, the output should denote the value representing how the aforementioned action would reflect within the given state (see Figure 2).

3.4. Experience Replay
Experience replay [31] highlights the capacity to learn from mistakes and adjust rather than repeating the same errors. Essentially, training encompassed several parameters that were stored as agent experiences: implied the current state, denoted action, implied reward, reflected the next state, and denoted the Boolean value to identify goal attainment. As all experiences were stored in fixed-size memory, none were linked to values (raw data input for neural network). Once the memory reached a saturation point during the training process, arbitrary batches of a specific size were chosen from the fixed memory. Regarding the insertion of novel experiences, old experiences were eliminated once the memory became full. In this vein, experience relay deterred overfitting problems. Notably, the same data could be utilised multiple times for network training to resolve insufficient training data.
4. Proposed DQN Algorithm
4.1. The TS Problem
The TS protocol in cloud computing implies one of the vital problem-solving mechanisms on the significant overlap between cloud provider and user needs, including QoS and high profit [32]. Cloud service providers strived to attain optimal virtual machine (VM) group usage through reduced makespan and waiting time. Following Figure 3, a large set of autonomous work with varying parameters was submitted by multiple users (to be managed by cloud providers in a cloud computing setting). For example, the cloud broker performed task delegations to the current VMs [33]. Different optimisation algorithms were also employed to attain optimal VM utilisation. Notably, equation (2) was incorporated to compute the overall execution time (makespan) as follows:

Specifically, , demonstrated the cloudlet execution time on [34], implied the total number of cloudlets, and reflected the complete execution time of a set of cloudlets on execution. Figure 4 presents an example of the first-come first-served (FCFS) scheduling process where the number of virtual machines was 2 and the number of tasks was 7. Every task encompassed varied time unit lengths. Notably, makespan denoted the most considerable execution time between the aforementioned VMs. The makespan (computed in VM2) was 45.

4.2. Environment Definition
This study regarded a system with multiple virtual machines and cloudlets. Every VM encompassed specific attributes (processing power in MIPS, memory in GB, and bandwidth in GB/s). As users submitted distinct cloudlets that arrived in a queue, the broker implemented the defined scheduling algorithm to assign every cloudlet to an adequate VM. As the broker scheduling algorithm needed to make an assignment decision in every cloudlet input from the queue, the system state was changed in line with the decision. Figure 5 presents CS with a length of 3 to .

4.2.1. State Space
Only the time taken for each virtual machine during a set of task execution was regarded in this study to support the defined system state identification process. The time counted on every virtual machine implied the total cloudlet time running on VM. The virtual machine running time facilitated makespan computation to enhance each novel cloudlet delegation where the system state changed. In this vein, the system state with VMs was provided by . Specifically, , denoted the cloudlet run time in while implied the total cloudlets in . Figure 5 presents the state as and state as .
4.2.2. Action Space
Available agent actions were defined in the action space. The broker scheduling algorithm was required to choose a VM from all current VMs to schedule the existing task from the queue. For example, the agent would make an action in the space with the same dimension as the number of VMs so that the action space denoted all VMs in the system. The action space was outlined with VMs by , wherein denoted the VM index conceded by the scheduler for cloudlet assignment. In Figure 5, action space denotes , while the chosen action implied is 2.
4.3. Model Training
The MCS-DQN model was retrained for each episode in line with the workflow in Figure 6 as follows: Step 1: the environment and agent contexts were established, including server, virtual machine, and cloudlet attributes. Step 2: the environment state and cloudlet queues were reset. Step 3: the next cloudlet was selected from cloudlet queues. Step 4: the agent selected the following action in line with the existing environment state under factor. Essentially, the factor (exploration rate) influenced the choice between exploration and exploitation in every iteration. The possibility of an agent arbitrarily choosing a VM (exploration) was while the possibility of the agent choosing a VM under the model (exploitation) was . The factor (initialized by one) would reduce in every iteration following a decay factor. Step 5: the environment state was updated by adding the cloudlet execution time to the chosen VM. Step 6: the environment produced a reward under the recommended reward function in the following subsection. Step 7: the agent saved the played experience into the experience replay queue. Step 8: upon experience storage, the algorithm identified more cloudlets to schedule (to be repeated from Step 3 if more cloudlets were determined). Step 9: the model was retrained in every episode (completing all cloudlet queues) with a batch of defined cloudlets from the experience queue. The experience replay queue was applied as a FIFO queue. The oldest experience was omitted when the queue reached a limit. Step 10: the algorithm was repeated from Step 2 if the number of iterations was yet to reach the predefined episode limits. Step 11: the trained MCS-DQN model was saved and exited.

4.4. Reward Function
The recommended reward function was utilised with the MCS-DQN model in Algorithm 1. The makespan of every potential scheduling was first computed. Every VM was subsequently ranked following the makespan computation during CS. A simple example was provided to present the recommended MCS-DQN reward function (see Figure 7). The example encompassed the reward computation for a specific VM state (elaborated following the total execution time in every ). Based on five VMs, every involved a set of cloudlet execution times . Specifically, a newly-arrived cloudlet was scheduled with a length of five to VM2 in the example (see Figure 7(a)) by iterating over VMs, creating a copy of VM state in every iteration, adding the cloudlet to the chosen VM in iteration, and computing the makespan following the added cloudlet. Figure 7(b) presents the first iteration where the arrived cloudlet was added to VM1. Figure 7(c) presents the computed makespans. For example, the makespan would be 14 when the cloudlet was added to VM1 in the first iteration, 13 when added to VM2, and so on. In every ranking, the computed makespans were ranked following the lowest value by sorting the aforementioned makespans (see Figure 7(d)) and providing the highest score to the lowest makespan (to decrease the highest score to be delivered to the following makespan) (see Figure 7(e)). Lastly, the corresponding reward was identified following the makespan index and corresponding VM to be scheduled. In the study context, VM2 reflected the reward as 2.
|

(a)

(b)

(c)

(d)

(e)
5. Results and Discussion
5.1. Experimental Setup
The recommended trained model under deep Q-learning was assessed against FCFS and PSO algorithms with the CloudSim simulator.
5.1.1. CloudSim Parameters
CloudSim is a modular simulation toolkit for modelling and simulating cloud computing systems and application provisioning environments [33]. It enables the modelling of cloud system components such as data centres, virtual machines (VMs), and resource provisioning rules on both a system and behavioural level [33].
The CloudSim simulator configuration in the implementation began with establishing one data center, two hosts, and five VMs with subsequent parameters (see Table 1). This configuration setup is taken from the example 6 of CloudSim code source available on GitHub (CloudSim codebase: https://github.com/Cloudslab/cloudsim), which is based on real servers and VMs information. At the VM level, a time-shared policy (one of the two different scheduling algorithms utilised in CloudSim) was selected. The time-shared policy facilitated VMs and cloudlets towards immediate multitasks and progress within the host. Moreover, the tasks data used in the experiments are real-world workloads of real computer systems recorded by the High-Performance Computing Center North (HPC2N) in Sweden(the HPC2N data: https://www.cse.huji.ac.il/labs/parallel/workload/l_hpc2n/). The data contain information about tasks such as the number of processors, the average CPU time, the used memory, and other task specifications. The utilised tasks from the workload completely differ from the independent counterparts employed in the trained model.
5.1.2. The MCS-DQN Model Parameters
The MCS-DQN model application employs a neural network with five fully connected layers (see Figure 8): an input layer (for state), three hidden layers (64 × 128 × 128), and an output layer (for actions). The network was taken from an original Keras RL tutorial [35] and modified to fit our defined environment. The training was executed following the parameters in Table 2. The aforementioned parameters were obtained following specific training process execution (for a high score in queue scheduling).

5.1.3. PSO Parameters
The PSO algorithm was applied following the recommended version in [5] with several iterations (equal to 1000), particles (equal to 500), local weights (c1 and c2) with the same value of 1.49445, and a fixed inertia weight with a value of 0.9.
5.2. Experimental Results and Analysis
Following Figure 9, the MCS-DQN agent average assessment score reflected over 800 episodes. Perceivably, learning remained steady despite approximately 800 training iterations. The parameter evolution was also incorporated into the -greedy exploration method during training. Following increased agent scores when began decaying, MCS-DQN could already generate sufficiently good Q-value estimates for more thoughtful state and action explorations to accelerate the agent learning process.

After the training process, various cloudlet sets were executed with the MCS-DQN scheduler saved model, FCFS, and PSO algorithms for every metric assessment. As every cloudlet of the same set was simultaneously executed, this study essentially emphasised the makespan metric (the elapsed time when simultaneously executing cloudlet groups on available VMs). Figure 10 presents the reduced research makespan compared to other algorithms.

The makespan metric (employed as the primary model training objective) impacted other performance metrics:(1)The degree of imbalance (DI) metric demonstrated load-balancing between VMs. Specifically, DI was utilised to compute the incongruence between VMs when simultaneously executing a set of cloudlets. The DI metric reduction was attempted for a more congruent system. Equation (3) was employed in this research to calculate the DI metric. Specifically, , , and implied the average, minimum, and maximum total execution time of all VMs [34]. Figure 11 presents the recommended MCS-DQN scheduler that minimised the DI metric in every utilised set of cloudlets for an enhanced load-balancing system.(2)In the waiting time (WT) metric, the cloudlets arrived in the queue and executed following the scheduling algorithm. For example, the waiting time algorithm was applied to compute all cloudlet sequence and average waiting time measures (see equation (4)). Specifically, denoted the cloudlet waiting time while and reflected the queue length: Following Figure 12, the recommended MCS-DQN scheduler could efficiently provide an optimal alternative to heighten the cloudlet queue management speed and effectiveness by reducing cloudlet waiting time and queue length.(3)The RU metric proved vital for elevated RU in the CS process. Equation (5) is employed to compute the average RU [34].


Specifically, denoted the duration to complete all cloudlets while reflected the number of resources. In Figure 13, the recommended MCS-DQN scheduler was more improved than PSO and FCFS regarding RU. Specifically, the MCS-DQN scheduler ensured busy resources while CS (as service providers) intended to earn high profits by renting restricted resources.

Furthermore, to prove the effectiveness of our proposed work, more executions based on the same previous VMs configurations were conducted. Figure 14 illustrates the results of these executions where we increased the number of virtual machines to 10, 15, 20, and 30, respectively. In each set of VMs, we scheduled a number of tasks equal to 60, 140, and 200, respectively. These experiments were done to the makespan since it is our main metric and compared with the PSO, the chosen scheduling algorithm in this work. We notice that our proposed MCS-DQN algorithm performance is still better than the PSO scheduler even when adding more experiments.

However, our suggested approach is restricted to a set number of virtual machines, and any change in the number of virtual machines requires a new model training. We intend to concentrate on variable-length output prediction in the future such that the number of VMs does not impact the model and no training is necessary for every change in VMs.
6. Conclusion
This study encompassed effective CS application using deep Q-learning in cloud computing. Additionally, the MCS-DQN scheduler recommended TS problem enhancement and metric optimisation. The simulation outcomes revealed that the presented work attained optimal performance for minimal waiting time and makespan and maximum resource employment. Additionally, the recommended algorithm regarded load-balancing during cloudlet distribution to current resources beyond PSO and FCFS algorithms. This proposed model can be applied to solve task scheduling problems in cloud computing, specifically in cloud broker. To solve the limitation of fixed VMs, we plan in the future to enhance our work by relying on variable-length output prediction using dynamic neural networks to include various VM sizes, as well as adding other optimisation approaches, taking into account more efficiency metrics such as task priority, VM migration, and energy consumption. Furthermore, assuming that (n) tasks are scheduled to (m) fog computing resource, we can apply adjustments into the proposed algorithm to work on the edge computing; this may also be an idea for future work.
Data Availability
The data for this research are available in the “Parallel Workloads Archive: HPC2N Seth”: https://www.cse.huji.ac.il/labs/parallel/workload/l_hpc2n/.
Conflicts of Interest
The authors declare no conflicts of interest.
Authors’ Contributions
All of the authors participated to the article’s development, including information gathering, editing, modelling, and reviewing. The final manuscript was reviewed and approved by all of the authors.
Acknowledgments
Laboratory of Emerging Technologies Monitoring in Settat, Morocco, provided assistance for this study.