Abstract
In a distributed online optimization problem with a convex constrained set over an undirected multiagent network, the local objective functions are convex and vary over time. Most of the existing methods used to solve this problem are based on the fastest gradient descent method. However, the convergence speed of these methods is decreased with an increase in the number of iterations. To accelerate the convergence speed of the algorithm, we present a distributed online conjugate gradient algorithm, different from a gradient method, in which the search directions are a set of vectors that are conjugated to each other and the step sizes are obtained through an accurate line search. We analyzed the convergence of the algorithm theoretically and obtained a regret bound of , where T is the number of iterations. Finally, numerical experiments conducted on a sensor network demonstrate the performance of the proposed algorithm.
1. Introduction
Distributed optimization has received considerable interest in science and engineering, which can be applied in numerous fields such as distributed tracking and localization [1], multiagent coordination [2], distributed estimations using sensor networks [3–5], and machine learning [6]. Such problems can be modeled to minimize or maximize the summation of some of the local convex functions, and these local functions only use a local computation and communication in a distributed manner. With an increase in the network size and data volume, more effective distributed algorithms have become a hot research topic. In recent years, many scholars have proposed various distributed optimization algorithms to solve such problems [7–15].
Most of the existing algorithms assume that the cost function at each agent is fixed. However, in practical problems, the environment of an agent is uncertain and the cost function of each agent changes over time, requiring us to solve such problems through an online setting. To be more precise, in a distributed online optimization, the cost function of each agent changes during each step, and with t iterations, before a decision is given, the cost function for each agent is unknown; only when we obtain a decision from a constrained set, we can obtain the information of the cost function. In addition, we also obtain a loss at the same time. Such loss reflects the error in the cost of the objective function between the current decision point and the best decision in hindsight, which we call regret. Regret is an important criterion in evaluating a distributed online algorithm. A well-performing distributed online optimization algorithm should decrease the average total regret approach to zero over time.
Because an online distributed optimization algorithm is more consistent with that used for practical problems, many scholars have conducted numerous studies and some effective algorithms have been proposed [16–24]. Yan et al. [20] introduced a distributed autonomous online learning algorithm, namely, a subgradient descent method using a projection. When the local objective functions are strongly convex and convex, the regret bounds of the proposed algorithm are obtained, respectively. The authors in [22] introduced an online distributed push-sum algorithm in which the search direction is a negative subgradient in each iteration, which achieves regret O((log(T))2) when the local function is strongly convex. For a time-varying directed network problem, Zhu et al. [25, 26] proposed a distributed online optimization algorithm. During each iteration, the negative subgradient is randomly selected as the search direction. The authors in [27] presented a distributed online algorithm based on a primal-dual dynamic mirror descent for a problem with time-varying coupling inequality constraints and obtained a dynamic regret bound. The authors in [28] proposed a distributed online conditional gradient algorithm for a constrained distributed online optimization problem in the Internet of Things.
The existing distributed online optimization algorithm based on the gradient method is simple to calculate and requires little storage; however, to ensure the convergence of the algorithm, the iterative step length usually needs to decrease with an increase in the number of iterations, which will lead to a zigzag path at the end of the algorithm. That is, the algorithm will carry out multiple iterations in the same direction or approximate direction, which greatly increases the computational time of the algorithm. The conjugate gradient algorithm also has the advantages of simple calculations and guaranteed convergence under certain conditions [29–31] but differs from the gradient method in that the search direction of the conjugate gradient algorithm is a group of conjugate or approximately conjugate vectors, and during the later stage of the algorithm, there are no additional repeated iterations in the same or approximate direction. Thus, the convergence of the conjugate gradient method is generally faster than that of the gradient descent method. In particular, for an objective functional quadratic, the conjugate gradient method has a quadratic termination. Based on these advantages, the conjugate gradient method has been used to solve numerous centralized offline optimization problems [32–35]. According to the existing literature, however, the conjugate gradient method has not been applied to distributed online optimization problems. To fill in this gap, herein, we present a distributed online conjugate gradient algorithm.
There are two main contributions provided by the present study. First, a new algorithm for a distributed online constrained convex optimization problem, namely, a distributed online conjugate gradient algorithm, is proposed. In our algorithm, a set of conjugate directions is used to replace the gradient directions used in a traditional gradient descent method, and the step size is obtained through an accurate linear search, thus effectively avoiding the slow convergence speed of a traditional gradient descent algorithm during the later stage. Second, we provide a careful analysis of the convergence of the proposed algorithm and obtain the square root of the regret bound.
The remainder of this paper is organized as follows: in Section 2, we first briefly introduce the distributed online optimization model, followed by some necessary mathematical preliminaries and assumptions used in this study. We also provide a detailed statement of our algorithm in Section 3 and an analysis of the convergence of the algorithm in Section 4. The simulation results of our algorithm are then presented in Section 5. Finally, we provide some concluding remarks in Section 6. In addition, further detailed proofs of some of the lemmas applied can be found in the Appendix.
2. Preliminaries
In this section, we provide a brief background on the distributed online optimization and the conjugate gradient method. At the same time, some constructs used in this study and some relevant assumptions regarding our analysis are provided.
2.1. Distributed Online Optimization
Consider a network system with multiple agents; in this network, each agent i is associated with a convex function . All agents aim to solve the following general consensus problem cooperatively:
During each round t ∈ {1, …, T}, the ith agent is required to generate a decision point xi(t) from a convex compact set . Then, the adversary replies to each agent’s decision with a cost function , and each agent has a loss of fti(xi(t)) simultaneously. The communication between agents is specified by a graph , where is the vertex set and is the edge set. Each agent i can only communicate with its immediate neighbors . The goal of the agents is to seek a sequence of decision points such that the cumulative regret with respect to each agent i regarding any fixed decision in hindsightis sublinear in T, that is, limT⟶∞RT/T = 0.
2.2. Conjugate Gradient Method
For the following optimization problem,where f(x) is a quadratic continuous differentiable. The iterative form of the conjugate gradient (CG) method is usually designed aswhere x(k) is the point from the kth iteration, αk > 0 is the step length, and the search direction dk is defined asin which, is the gradient of the objective function at the current iterate point x(k), is a scalar, and the different definitions of βk represent different methods of a conjugate gradient [27]. Well-known conjugate gradient methods include the Polak–Ribiere–Polyak (PRP) method and the Fletcher–Reeves (FR) method. In this study, we define the parameter βk using the PRP method, the specific form of which is as follows:
Gilbert and Nocedal [36] proved that if the parameter βk is appropriately bounded in magnitude, the CG method can converge globally. Therefore, the CG method satisfies the sufficient descent condition under this hypothesis.
To analyze the convergence of our algorithm, we provide the bound of the conjugate gradient as follows.
Lemma 1 (see [37]). Let f(x) be a quadratic continuous difference convex function, and ∇2f(x) be a Hessian matrix of the function. For any , when and , there exist two positive numbers m and M such that
Taking an initial point x(1) ∈ C, where xk, dk, and βk are all defined using the PRP method,in which .
2.3. Some Constructs and Assumptions
The following assumptions are given throughout this paper:(i)Each cost function fti(x) is a convex and twice continuous differentiable L-Lipschitz on the convex set .(ii)The set is compact and convex, and , 0 denotes a vector with all entries equal to zero.(iii)The Euclidean diameter of is bounded by R.
As the Lipschitz condition in (i) implies, for any and any gradient , we have the following:Where denotes the dual norm.
The next definition is used throughout this paper.
Definition 1 (see [38]). Let f(x) be a function difference on an open set , and let be a convex subset of C. Then, f(x) is convex on if and only iffor all .
Now, we give an important inequality in [39] that is often used in optimization problems.
Let f(x) be a first-order continuous differentiable function on the set , whose first derivative satisfies the Lipschitz condition, and thus ,where L is the Lipschitz constant and ∥⋅∥ denotes the European norm.
3. Distributed Online Conjugate Gradient Algorithm
For the distributed online optimization problem (1), each locally cost function fti(x) satisfies the assumptions in Section 2. The network topology relationship among agents is specified by an undirected graph , that is, if , it implies that . Each agent i can only communicate with its immediate neighbors. The adjacency matrix of the undirected graph is a doubly stochastic symmetric , such that pij ≥ 0 only if ; otherwise, pij = 0 and for all and for all .
To solve (1), we present a distributed online conjugate gradient algorithm. After giving a decision based on the current information, we can obtain the cost function fti(x) and compute the gradient . We can then calculate the value of βi(t) using the gradients in the current iteration point xi(t) and the previous iteration point xi(t − 1). If βi(t) > 0, a new search direction di(t), computed using a Gram–Schmidt conjugate of the gradients in the current iteration point xi(t) and the previous search direction di(t − 1), can be constructed. If the parameter is βi(t) ≤ 0, we then obtain the new search direction , which is equivalent to restarting the distributed online conjugate gradient algorithm in the direction of the steepest descent. The iteration step length αi(t) can be obtained through an exact line search, and the next iteration point xi(t + 1) can be obtained using the conjugate direction vector di(t) and step αi(t). The specific algorithm is summarized in Algorithm 1.
|
Here, we define the projection function used in this algorithm as follows:
4. Regret Bound Analysis
To analyze the regret bound for D-OCG, we provide some preliminary remarks and a few definitions. Using Algorithm 1, we can determine the following:
Now, we defineand from the evolution of zi(t + 1), we can obtain
Now, the main results in our paper can be stated.
Theorem 1. The sequences of xi(t) and zi(t) generated by Algorithm 1 are given for all , where , , and , and we thus have the cumulative regret owing to the action of agent i,swhere λ = max1≤i≤n, 1≤t≤T{λi(t)}, b and D are two nonnegative constants, M and m are as defined in Lemma 1, n is the number of agents, and σ2(P) is the second largest eigenvalue of the adjacency matrix P.
From Theorem 1, we obtain a regret bound of the proposed algorithm under the local convexity, which is sublinear to T, i.e., the regret bound of the D-OCG algorithm can approach zero as the value of T increases, where T is the number of iterations. It is evident that the value of the regret bound is related to the upper bound L of the gradient of the local objective functions and the diameter R of the constraint set . By Lemma 1, we know that the regret bound is also related to the Hessian matrix of the local objective functions. Moreover, the value of the regret bound is also related to the scale and topology of the network.
To prove Theorem 1, we now present the following lemmas.
Lemma 2. For any and , we can obtain the following inequality:
Proof. Based on assumption (i), the function ft(xi(t)) is L-Lipschitz continuous on the convex set , that is,and thusBy contrast,Combining equations (19) and (20), the proof of Lemma 2 is completed.
Now, we prove that the last term of inequality (20) has a particular bound.
Lemma 3. For any and ,
Proof. Because is a gradient of ft(x) at xi(t), using the convexity for function ft(x), we haveThen, based on assumption (i), we know that . We can then obtain the following:Summing for t = 1, …, T for the average of , the following is obtained:The proof of Lemma 3 is completed.
Now, we turn our attention to the following term:According to the definition of the conjugate gradient, we give the bound of equation (25) in Lemma 4.
Lemma 4. For any and βi(t) ≤ b (where b is a nonnegative constant), the following bound holds:
Proof. Based on the definitions of di(t) and , the left-hand side of the above inequality can be split into two:Thus, we prove that the first term in equation (27) has a bound. For any function f(x), we know thatwhere dom f is the domain of the function f(x). Therefore, for the functionwe can obtain for any ,that is,soBased on the definition of the conjugate function [40] and the updates for , we have the following:Because α(t) is a nonincreasing sequence, based on the definition of the conjugate function , we can obtain for all ,and thus we obtain the following:According to the inequality (11), we know thatA detailed proof of equation (36) is provided in Appendix A. The following inequality is then established:Summing both sides of the above inequation from t = 1 to T, we obtain the following:Through equations (33) and (38), we can write the following:We then analyze the bound on the second term in inequality (27). Because βi(t) = max1≤i≤n{0, βi(t)PRP}, and βi(t) ≤ b, we then analyze the following two situations.
Case 1. Ifthen,The conclusion therefore clearly holds.
Case 2. Ifwithout loss of generality, we assume that for all ,soThat is,Summing for t = 1, 2, …, T,Next, we give the bound of .
Becausesimilar to the proof of inequality (33), we can obtain the following:Through inequality (11), we obtain the following:In addition to the definition of function , when z = 0,The definition of the projection shows that the supremum of the above equation is uniquely attained through . In addition, for all fixed z, when x = 0, , and thus we can obtain the supremum of to the set.Because the set is closed and φ(x) is strongly convex (for the definition of strongly convex, see [40]), the set described above is compact. By contrast, we know that 〈z, x〉 is differentiable in z, and the supremum is unique, and thus we can obtain the following: .
Then, we derive the next two equations through Taylor expansion:and thusBecause , ,and thereforeThus,Summing both sides of the above inequation from t = 1 to T, we obtainand combining equations (46)–(57), we obtain the following:Through equations (27), (39), and (58), we finalize the proof of Lemma 4.
Mark: (1)The definition of the conjugate function for is as follows:(2)Through inequality (15) and step 2 in Algorithm 1, for all , , we can obtain the following:Next, we provide an important inequality, which will be used in the proof of Lemma 6.
Lemma 5. (α-Lipschitz continuity of the projections). For any pair , we have the following:
A detailed proof of this Lemma can be seen in Appendix B.
Now, we focus on an analysis of a key result concerning regret, i.e., ∥xi(t) − y(t)∥ in Lemma 6.
Lemma 6. For all and t ∈{0, …, T}, the following inequality is true:
Proof. Because xi(t) and y(t) are the projections of zi(t) and onto the set , through Lemma 5, we have the following:Now, considering the evolution of sequence {zi(t)} in Algorithm 1, we obtain the following:Because pij is an element of a doubly stochastic matrix, , then we haveBased on Algorithm 1 and the definition of , we can determine that , and thusIn addition, we can obtainand based on the definition for the 1 norm of the vector (see [41]),To obtain a more specific bound of equation (46), we introduce a useful property of a stochastic matrix as follows [12]:where Pt−r−1 denotes the (t − r − 1)-th power of matrix P, ei is the ith basis coordinate of an n-dimensional space , 1 denotes a vector with all entries equal to 1, and σ2(P) is the second largest eigenvalue of stochastic matrices P and σ2(P) ≤ 1, through which we obtain the following inequality:Combining equations (63), (68), and (70) yields the following:Thus, we complete the proof of Lemma 6
Now, we can provide a brief proof of Theorem 1.
Proof of Theorem 1. Combining lemmata 2–6 yields the following regret bound:By equation (71) and based on , , , we can obtain the conclusion to Theorem 1.
5. Simulation Experiments
To verify the performance of the D-OCG, we consider a problem of a distributed sensor network [18], which has n sensors and aims at the estimation of a random vector . In this network, at each time t ∈ {1, 2, …, T}, each sensor i receives an observation vector , in which the vector is time-varying owing to the effect of the observed noise. Assume that each sensor i has a linear model ϕi(x) = Aix, where Ai is the observation matrix of sensor i, and and ∥Ai∥1 ≤ ϕmax. The local cost function in sensor i is defined as , where = Aix + ηti, in which ηti is white noise. The mathematical model of this problem is
In an offline case, the cost function in each sensor i is fixed, and because we can know all information of the cost function in advance, the centralized optimal estimate for this problem can be obtained by
In a practical problem, the characteristics of the white noise may be unknown, or some sensors might not work properly for a particular reason, and we therefore need to find an estimate for vector x using a distributed online algorithm. Here, we set d = 1 and , and sensor i observes = atix + bti, where ati ∼U(0, 1) and (in which x ∼ U(a, b) indicates a random vector x uniformly distributed on (0, 1)). Then, the cost function for sensor i at each time t is given by , where and .
We verified the performance of the proposed algorithm based on the following three aspects:(1)First, we determined how the number of nodes in the network affects the performance of the D-OCG. We can see from Figure 1 that the average regret decreases slowly when we increase the number of nodes, and the algorithm is convergent on different scaled networks. When n = 1, the problem is equivalent to a centralized optimization problem, and our distributed optimization algorithm can reach the same effect as the centralized algorithm.(2)We then checked how the network topology influences the performance of the D-OCG. We implemented the algorithm on three types of graphs with nine nodes. In a complete graph, each node is connected to the remaining nodes, that is, all nodes can exchange information with each other. In a cycle graph, each node is only connected to two nodes directly adjacent to it. The connectivity of a Watts–Strogatz graph is between the complete graph and the cycle graph. From Figure 2, it can be seen that a better connectivity can lead to a slightly faster convergence.(3)We next compared our algorithm with the class algorithm D-OGD in [20]. The parameters used in these two algorithms are based on their theoretical proofs. The network topology relationship among nodes is complete, whereas for nodes n = 9, the step size is . As shown in Figure 3, the convergence speed of the two algorithms is initially close, but with an increase in the number of iterations, the D-OCG converges faster than the D-OGD, which fully reflects the excellent performance of the proposed algorithm.



6. Conclusion
We proposed a distributed online conjugate gradient algorithm to solve the distributed optimization problem with a convex constraint in a network. With this algorithm, the conjugate gradient is used to replace the gradient or subgradient in a traditional gradient decent method. Because the search direction is mutually conjugated throughout the entire algorithm iteration process, we can remove the disadvantage that a slow convergence has in the later stage of a gradient decent. We also presented a detailed analysis of the convergence for the proposed algorithm and obtained a regret bound for the optimization problem. The regret bound has a sublinear convergence. We applied the proposed algorithm (D-OCG) to a distributed sensor estimation problem. The numerical results show that our algorithm is feasible and effective, and under the same assumptions, the D-OCG has a better convergence rate than the traditional D-OGD gradient method.
Appendix
A. Proof of Lemma 5
Let . Based on the first-order optimality condition for convex optimization, for any , we obtain the following two inequalities:
Through equation (A.1), we obtain the following:and thus
In addition, because is a strong convex function, we obtain the following: ,namely,
We therefore haveand because we know α(t) ≥ αi(t) ≥ 0, therefore
In addition, we havethat is,
Setting y = ω in equation (A.2) and y = x in equation (A.11) yields
Adding the above two inequalities, we obtain the following bound:
By contrast, φ(x) is a strong convexity function, which implies that
Adding the above two inequalities, we have the following:and thus
Combining equations (A.13) and (A.16), we can obtainnamely,
Thus,and we therefore obtain
This completes the proof of the claim in Lemma 5.
B. Proof of Inequality (36)
Based on the definition of the conjugate function, we can obtainwhere is the indicator function of set . In addition, when ; otherwise, . By contrast,and is compact, and thus the supremum of can be uniquely attained through . Here, 〈z, x〉 is differentiable in .
Because the projection is Lipschitz continuous, we have the following:
Through Lemma 1.2.3 and Corollary 4.4.5 in [39], we can obtain
However,and thus
Data Availability
No data were used to support this study.
Conflicts of Interest
The authors declare no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China (NSFC), under Grant nos. 11471102, 61976243 and 61871430, the basic research projects in the University of Henan Province, under Grant nos. 19zx010 and 20zx001, and the Science and Technology Development Programs of Henan Province, under Grant no. 192102210284.