Basic Definitions and Some Properties
ByNwe denote the set of natural numbers, i.e., N= {0,1,2, } Consider the n-dimensional Euclidean vector spaceX = R n which is equipped with the canonical inner product hx, ui : n
In the context of vector calculations, the expression \( X i=1 x i u i \) applies to all vectors \( x = (x 1 , , x n ) \) and \( u = (u 1 , , u n ) \) It's important to note that while vectors in \( R n \) are displayed as rows of real numbers in this text, they are treated as columns in matrix operations The transpose of a matrix \( A \) in \( R m×n \) is indicated by \( A T \), leading to the relationship \( hx, ui = x T u \).
The norm in X is given by kxk = hx, xi 1/2 Then, the dual space Y of X can be identified with X.
A function θ : X → R, defined over the set of generalized real numbers R (which includes +∞ and −∞), is considered proper if it never assumes the value of −∞ and is not identically equal to +∞ This means there exists at least one element x in the set X such that θ(x) yields a real number in R.
The effective domain of θ is defined by domθ := {x ∈ X : θ(x) < +∞}.
Let Γ 0 (X) be the set of all lower semicontinuous, proper, convex functions on X The Fenchel conjugate function g ∗ of a function g ∈ Γ0(X) is defined by g ∗ (y) = sup{hx, yi −g(x) |x ∈ X} ∀y ∈ Y.
Note that g ∗ : Y →Ris also a lower semicontinuous, proper, convex function
[38, Propostion 3, p 174] From the definition it follows that g(x) +g ∗ (y) ≥ hx, yi (∀x ∈ X, ∀y ∈ Y).
Denote by g ∗∗ the conjugate function of g ∗ , i.e., g ∗∗ (x) = sup{hx, yi −g ∗ (y) | y ∈ Y}.
Since g ∈ Γ 0 (X), one has g ∗∗ (x) = g(x) for all x ∈ X by the Fenchel-Moreau theorem ( [38, Theorem 1, p 175]) This fact is the basis for various duality theorems in convex programming and DC programming.
Definition 1.1 The subdifferential of a convex function ϕ : R n →R∪ {+∞} at u ∈ domϕ is the set
If x /∈ domϕ then one puts ∂ϕ(x) =∅.
Clearly, the subdifferential ∂ϕ(u) in (1.1) is a closed, convex set The Fer- mat Rule for convex optimization problems asserts that ¯x ∈ R n is a solution of the minimization problem min{ϕ(x) | x ∈ R n } if and only if 0 ∈ ∂ϕ(¯x).
We now recall some useful properties of the Fenchel conjugate functions. The proofs of the next two propositions can be found in [77].
Proposition 1.1 The inclusion x ∈ ∂g ∗ (y) is equivalent to the equality g(x) +g ∗ (y) =hx, yi.
Proposition 1.2 The inclusions y ∈ ∂g(x) and x ∈ ∂g ∗ (y) are equivalent.
In the sequel, we use the convention (+∞)−(+∞)=+∞.
Definition 1.2 The optimization problem inf{f(x) := g(x)−h(x) : x∈ X}, (P) where g andh are functions belonging to Γ 0 (X), is called a DC program The functions g and h are called d.c components of f.
Definition 1.3 For any g, h ∈ Γ 0 (X), the DC program inf{h ∗ (y)−g ∗ (y) | y ∈ Y}, (D) is called the dual problem of (P).
Proposition 1.3 (Toland’s Duality Theorem; see [79]) The DC programs (P) and (D) have the same optimal value.
Definition 1.4 One says that ¯x ∈ R n is a local solution of (P) if the value f(¯x) = g(¯x) − h(¯x) is finite (i.e., ¯x ∈ domg ∩ domh) and there exists a neighborhood U of ¯x such that g(¯x)−h(¯x) ≤ g(x)−h(x) ∀x ∈ U.
If we can choose U = R n , then ¯x is called a (global) solution of (P).
The set of the solutions (resp., the local solutions) of (P) is denoted by sol(P) (resp., by loc(P)).
Proposition 1.4 (First-order optimality condition; see [77]) If x¯ is a local solution of (P), then ∂h(¯x) ⊂ ∂g(¯x).
Definition 1.5 A point ¯x ∈ R n satisfying ∂h(¯x) ⊂ ∂g(¯x) is called a station- ary point of (P).
The forthcoming example, which is similar to Example 1.1 in [93], shows that a stationary point needs not to be a local solution.
Example 1.1 Consider the DC program (P) with f(x) =g(x)−h(x), where g(x) = |x − 1| and h(x) = (x − 1) 2 for all x ∈ R For ¯x := 1
∂g(¯x) = ∂h(¯x) = {−1} Since ∂h(¯x) ⊂∂g(¯x), ¯x is a stationary point of (P). But ¯x is not a local solution of (P), because f(x) = x−x 2 for all x ≤ 1. Definition 1.6 A vector ¯x ∈ R n is said to be a critical point of (P) if
If ∂h(¯x) 6= ∅ and ¯x is a stationary point of (P), then ¯x is a critical point of (P) The reverse implication does not hold in general The following example is similar to Example 1.2 in [93].
Example 1.2 Consider the DC program (P) with f(x) = g(x)−h(x) with g(x) = (x − 1 2 ) 2 and h(x) = |x − 1| for all x ∈ R For ¯x := 1, we have
∂g(¯x) ={1} and ∂h(¯x) = [−1,1] Hence ∂g(¯x)∩∂h(¯x) 6= ∅ So ¯x is a critical point of (P) But, ¯x is not a stationary point of (P), because ∂h(¯x) is not a subset of ∂g(¯x).
In problem (P), if the set ∂h(¯x) is a singleton, then the function h is Gˆateaux differentiable at the point ¯x, and we have ∂h(¯x) = {∇ G h(¯x)}, where ∇ G h(¯x) represents the Gˆateaux derivative of h at ¯x Conversely, if h is Gˆateaux differentiable at ¯x, it follows that ∂h(¯x) is a singleton and ∂h(¯x) = {∇ G h(¯x)} Additionally, the condition ∂g(¯x)∩∂h(¯x) 6= ∅ is equivalent to stating that ∂h(¯x) is included in ∂g(¯x).
So, if h is Gˆateaux differentiable at x, then¯ x¯ is a critical point if and only if it is a stationary point.
DCA Schemes
The theory of Decomposition and Coordination Algorithms (DCAs) focuses on breaking down a complex DC program (P) into two sequences of convex programs, denoted as (P k ) and (D k ), where k is a natural number This approach involves constructing two sequences, {x k } and {y k }, such that for each k, x k represents a solution to the convex program (P k ) and y k corresponds to a solution for the convex program (D k ) The effectiveness of this method relies on ensuring that specific properties are maintained throughout the sequences.
(i) The sequences {(g−h)(x k )} and {(h ∗ −g ∗ )(y k )} are decreasing;
(ii) Any cluster point ¯x (resp ¯y) of {x k } (resp., of {y k }) is a critical point of (P) (resp., of (D)).
Following Tuan [93], we can formulate and analyze the general DC algo- rithm of [77] as follows.
Step 3 k ← k+ 1 and return to Step 2.
For eachk ≥ 0, we have constructed a pair (x k , y k ) satisfying (1.2) and (1.3). Thanks to Proposition 1.2, we can transform the inclusion (1.2) equiva- lently as y k ∈ ∂h(x k )
Consequently, the condition (1.2) is equivalent to the requirement that y k is a solution of the problem min{h ∗ (y)−[g ∗ (y k−1 ) +hx k , y−y k−1 i] | y ∈ Y}, (D k ) where y k−1 ∈ domg ∗ is the vector defined at the previous step k −1.
The inclusion x k ∈ ∂g ∗ (y k−1 ) means that g ∗ (y)−g ∗ (y k−1 ) ≥ hx k , y −y k−1 i ∀y ∈ Y.
The affine function g ∗ (y k−1 ) + hx k , y−y k−1 i serves as a lower approximation of g ∗ (y) By substituting g ∗ (y) with this lower approximation in the objective function of problem (D) at step k, we formulate the auxiliary problem (D k ).
Since (D k ) is a convex program, solving (D k ) is much easier than solving the DC program (D) Recall that y k is a solution of (D k ).
Similarly, at each stepk+1, the DC program (P) is replaced by the problem minng(x)−[h(x k ) +hx−x k , y k i] | x ∈ Xo, (P k ) where x k ∈ domh ∗ has been defined at step k.
Since (P k ) is a convex program, solving (P k ) is much easier than solving the original DC program (P) As x k+1 satisfies (1.3), it is a solution of (P k ).
The objective function of (Dk) serves as a convex upper approximation of the objective function of (D), with both functions yielding identical values at y k−1 By removing certain real constants from the expression of the objective function of (Dk), we can reformulate the problem to minimize h ∗ (y)− hx k , yi for y within the set Y.
The objective function of problem (P k) serves as a convex upper approximation of the objective function for problem (P), with both functions yielding identical values at point x k By removing certain real constants from the objective function of (P k), we can reformulate the problem into the equivalent form: min{g(x) - ⟨hx, y k⟩ | x ∈ X}.
If x k is a critical point of (P), i.e., ∂g(x k ) ∩ ∂h(x k ) 6= ∅, then DCA may produce a sequence {(x ` , y ` )} with
Indeed, since there exists a point ¯x ∈ ∂g(x k ) ∩ ∂h(x k ), to satisfy (1.2) we can choose y k = ¯x Next, by Proposition 1.2, the inclusion (1.3) is equivalent to y k ∈ ∂g(x k+1 ) So, if we choose x k+1 = x k then (1.3) is fulfilled, because y k = ¯x ∈ ∂g(x k ).
Dollar-Cost Averaging (DCA) guides us to critical points but lacks the means to navigate away from them When encountering a critical point that is not a local minimum, it is essential to employ advanced variational analysis techniques to identify a descent direction.
The following observations can be found in Tuan [93]:
• The DCA is a decomposition procedure which decomposes the solution of the pair of optimization problems (P) and (D) into the parallel solution of the sequence of convex minimization problems (P k ) and (D k ), k ∈ N;
• The DCA does not include any specific technique for solving the convex problems (P k ) and (D k ) Such techniques should be imported from convex programming;
• The performance of DCA depends greatly on a concrete decomposition of the objective function into DC components;
The Deterministic Coordinate Ascent (DCA) method, while classified as a deterministic optimization technique, can produce diverse sequences {x k } and {y k } based on the initial point x 0 This variability arises from the heuristic selection of y k from sol(D k ) and x k from sol(P k ) at each step k, especially when multiple solutions exist for (D k ) and (P k ).
The above analysis allows us to formulate a simplified version of DCA, which includes a termination procedure, as follows.
Output: Finite or infinite sequences {x k } and {y k }.
Step 1 Choose x 0 ∈ domg Take ε > 0 Put k = 0.
Calculate y k by solving the convex program (1.4).
Calculate x k+1 by solving the convex program (1.5).
Step 3 If ||x k+1 −x k || ≤ε then stop, else go to Step 4.
Step 4 k := k+ 1 and return to Step 2.
To understand the performance of the above DCA schemes, let us consider the following example.
Example 1.3 Consider the functionf(x) = g(x)−h(x) with g(x) = (x−1) 2 and h(x) =|x−1| for all x ∈ R Here Y = X = R and we have g ∗ (y) = sup{xy −g(x) | x ∈ R}= sup{xy −(x−1) 2 | x ∈ R} = 1
The directional derivatives are defined as ∂g ∗ (y) = { 1/2 y + 1} for every y ∈ Y, and for the function h, we have ∂h(x) = {−1} for x < 1, ∂h(x) = {1} for x > 1, and ∂h(x) = [−1,1] for x = 1 By employing the DCA Scheme 1.1, we establish two sequences, {x k } and {y k }, where y k belongs to ∂h(x k ) and x k+1 is derived from ∂g ∗ (y k ) for k ∈ N Starting with any x 0 > 1, we find that y 0 = 1, leading to x 1 = 3/2 Consequently, y 1 also equals 1, and it can be shown that for all k ≥ 2, x k = 3/2 and y k = 1, resulting in convergence to ¯x = 3/2 and ¯y = 1 Conversely, beginning with any x 0 < 1 yields the sequences with x k = 1/2 and y k = −1 for all k ≥ 1, which converge to ¯x = 1/2 and ¯y = −1.
x 2 −x for x ≤1 x 2 −3x+ 2 for x ≥1, one finds that ¯x = 3 2 and ˆx = 1 2 are global minimizers of (P), and xe := 1 is the unique critical point of the problem.
Starting with the initial point \( x_0 = x_e = 1 \) and selecting \( y_0 = 0 \) from the range \( \partial h(x_0) = [-1, 1] \), we find that \( x_1 \) belongs to \( \partial g^*(y_0) = \partial g^*(0) = \{1\} \), leading to \( x_1 = 1 \) Similarly, choosing \( y_1 = 0 \) from \( \partial h(x_1) = [-1, 1] \), we continue this process to generate DCA sequences \( \{x_k\} \) and \( \{y_k\} \), which converge to \( x_e = 1 \) and \( \bar{y} = 0 \), respectively It is important to note that the limit point \( x_e \) of the sequence \( \{x_k\} \) is the unique critical point of problem (P), which is neither a local minimizer nor a stationary point of (P).
To ease the presentation of some related programs, we consider the follow- ing scheme.
Output: Finite or infinite sequences {x k } and {y k }.
Step 1 Choose x 0 ∈ domg Take ε > 0 Put k = 0.
Step 2 Calculate y k by using (1.2) and find x k+1 ∈ argmin{g(x)− hx, y k i | x ∈ X} (1.6)
Step 3 If ||x k+1 −x k || ≤ε then stop, else go to Step 4.
Step 4 k := k+ 1 and return to Step 2.
General Convergence Theorem
We will recall the fundamental theorem on DCAs of Pham Dinh Tao and
Le Thi Hoai An [77, Theorem 3] provides a solid theoretical foundation for the practical application of these algorithms To effectively utilize this knowledge, it is essential to revisit the definitions of ρ-convex functions, the modulus of convexity for convex functions, and the characteristics of strongly convex functions.
Definition 1.7 Let ρ ≥0 and C be a convex set in the space X A function θ :C → R∪ {+∞} is called ρ-convex if θ λx+ (1−λ)x 0 ≤λθ(x) + (1−λ)θ(x 0 )− λ(1−λ)
2 ρ||x−x 0 || 2 for all numbers λ ∈ (0,1) and vectors x, x 0 ∈ C This amounts to saying that the function θ(ã)−(ρ/2)|| ã || 2 is convex on C.
Definition 1.8 The modulus of convexity of θ on C is given by ρ(θ, C) = supnρ≥ 0 | θ−(ρ/2)|| ã || 2 is convex on Co.
If C = X then we write ρ(θ) instead of ρ(θ, C) Function θ is called strongly convex on C if ρ(θ, C) > 0.
Consider the problem (P) If ρ(g) > 0 (resp., ρ(g ∗ ) > 0), let ρ 1 (resp., ρ ∗ 1 ) be a real number such that 0 ≤ ρ 1 < ρ(g) (resp., 0 ≤ ρ ∗ 1 < ρ(g ∗ )) If ρ(g) = 0 (resp., ρ(g ∗ ) = 0), let ρ 1 = 0 (resp., ρ ∗ 1 = 0) If ρ(h) > 0 (resp., ρ(h ∗ ) > 0), let ρ2 (resp., ρ ∗ 2 ) be a real number such that 0 ≤ ρ2 < ρ(h) (resp.,
The convenient abbreviations dx k := x k+1 −x k and dy k := y k+1 −y k were adopted in [77].
Theorem 1.1 ( [77, Theorem 3]) Let α := inf{f(x) = g(x)−h(x) | x ∈ R n }. Assume that the iteration sequences {x k } and {y k } are generated by DCA Scheme 1 Then, the following properties are valid:
≤ (g −h)(x k )−maxn ρ 1 +ρ 2 2 ||dx k || 2 , ρ 2 ∗ 1 ||dy k−1 || 2 + ρ 2 2 ||dx k || 2 , ρ 2 ∗ 1 ||dy k−1 || 2 + ρ 2 ∗ 2 ||dy k || 2 o hold for every k;
||dx k+1 || 2 + ρ 2 2 ||dx k || 2 , ρ 2 ∗ 1 ||dy k || 2 + ρ 2 2 ||dx k || 2 o hold for every k;
(iii) If α is finite, then {(g −h)(x k )} and {(h ∗ −g ∗ )(y k )} are decreasing sequences that converge to the same limit β ≥ α Furthermore,
(a) If ρ(g) +ρ(h) > 0 (resp., ρ(g ∗ ) +ρ(h ∗ ) > 0), then k→∞lim(x k+1 −x k ) = 0 (resp., lim k→∞(y k+1 −y k ) = 0);
(iv) If α is finite, and {x k } and {y k } are bounded, then for every cluster point x¯ of {x k } (resp., y¯ of {y k }), there is a cluster point y¯ of {y k } (resp., x¯ of {x k }) such that:
The estimates in the assertions (i) and (ii) of the above theorem can be slightly improved as shown in the next remark.
If ρ(h) is greater than 0, then ρ² is a real number that falls within the range of [0, ρ(h)) The sequences {xₖ} and {yₖ} are constructed independently of the constants ρ₁, ρ*₁, ρ₂, and ρ*₂ According to assertion (i) of Theorem 1.1, this leads to the conclusion that for every natural number k, the inequality holds true.
2||dy k || 2 o. Passing the last inequality to the limit as ρ 2 →ρ(h), we get
By applying this method to the constants associated with strongly convex functions within the set {g, h, g ∗, h ∗}, we can demonstrate enhanced versions of the estimates presented in assertions (i) and (ii) of Theorem 1.1.
≤ (g−h)(x k )−maxn ρ(g)+ρ(h) 2 ||dx k || 2 , ρ(g 2 ∗ ) ||dy k−1 || 2 + ρ(h) 2 ||dx k || 2 , ρ(g 2 ∗ ) ||dy k−1 || 2 + ρ(h 2 ∗ ) ||dy k || 2 o,
||dx k+1 || 2 + ρ(h) 2 ||dx k || 2 , ρ(g 2 ∗ ) ||dy k || 2 + ρ(h) 2 ||dx k || 2 o. The forthcoming example is designed as an illustration for Theorem 1.1.
Example 1.4 Consider the function f(x) = g(x) − h(x) in Example 1.1, where g(x) = |x− 1| and h(x) = (x−1) 2 for all x ∈ R Here Y = X = R and we have h ∗ (y) = sup{xy −h(x) | x ∈ R} = sup{xy −(x−1) 2 | x ∈ R} = 1
4y 2 +y. Using DCA Scheme 1.2, we calculate DCA sequences {x k } and {y k } by solv- ing, respectively, the convex programs (1.4) and (1.5) for k ∈ N Choose ε = 0 First, select x 0 = 2 3 Since y 0 is a solution of (1.4) for k = 0, we get y 0 = − 2
In the context of DCA Scheme 1.2, when k = 0, the solution x₁ equals 1, leading to yₖ = 0 for k ≥ 1 and xₖ = 1 for k ≥ 2 The algorithm terminates after one step at k = 1, resulting in the unique local solution ¯x = x₂ for problem (P) This outcome is consistent for any initial point x₀ within the interval [1/2, 3/2]; if x₀ is in this range, the algorithm stops at k = 0, yielding ¯x = x₁ = x₀, which is identified as a stationary point.
(P), which is not a local solution If x 0 < 1 2 or x 0 > 3 2 , then f(x k ) → −∞ as k → ∞ So, {x k } does not have any cluster point.
Convergence Rates
In Chapters 2 and 4, we will demonstrate various results regarding the convergence rates of iterative sequences, focusing on two specific types of linear convergence: Q-linear convergence and R-linear convergence These concepts will be revisited for clarity and understanding.
Definition 1.9 (See, e.g., [70, p 28] and [88, pp 293–294]) One says that a sequence {x k } ⊂ R n converges Q-linearly to a vector ¯x ∈ R n if there exits β ∈ (0,1) such that kx k+1 −xk ≤¯ βkx k −xk¯ for all k sufficiently large.
Clearly, ifx k 6= ¯x, then the relationkx k+1 −xk ≤¯ βkx k −xk¯ in Definition1.9 can be rewritten equivalently as kx k+1 −xk¯ kx k −xk¯ ≤ β The word “Q”, which stands for “quotient”, comes from this context.
Definition 1.10 (See, e.g., [70, p 30]) One says that a sequence {x k } ⊂ R n converges R-linearly to a vector ¯x∈ R n if there is a sequence of nonnegative scalars {à k } such that kx k − xk ≤¯ à k for all k sufficiently large, and {à k } converges Q-linearly to 0.
If a sequence {x_k} converges Q-linearly to a vector ¯x, it also converges R-linearly to ¯x To demonstrate this, we can choose a constant β within the range (0,1) that meets the criteria outlined in Definition 1.9 By defining à_k = β^k * (x_{k-1} - ¯x) for all k ≥ 1, we observe that ||x_k - ¯x|| ≤ à_k for sufficiently large k Furthermore, the sequence {à_k} converges Q-linearly to 0, as it satisfies the condition à_{k+1} ≤ à_k for large k It is important to note that R-linear convergence does not necessarily imply Q-linear convergence, as illustrated by the example of a sequence of positive scalars {x_k} found in [70, p 30].
1, k is odd, and observe that {x k } converges R-linearly to 1, while the sequence does not converge Q-linearly to 1.
Sometimes, one says that a sequence {x k } ⊂ R n converges R-linearly to a vector ¯x ∈ R n whenever limsup k→∞ kx k −xk¯ 1/k < 1 (1.7)
(see, e.g., [92]) The word “R”, which stands for “root”, comes from this context.
The next proposition clarifies the equivalence between the definition of Q−linear convergence in (1.7) and the one given in Definition 1.9.
Proposition 1.5 A sequence {x k } ⊂ R n converges R-linearly to a vector ¯ x ∈ R n if and only if the strict inequality (1.7) holds.
To establish the necessity, we consider a sequence {x_k} that converges Q-linearly to a vector ¯x This implies the existence of a sequence of nonnegative scalars {à_k} such that the distance between x_k and ¯x is bounded by ¯à_k for sufficiently large k, with {à_k} converging Q-linearly to 0 Consequently, we can identify a constant β in the interval (0,1) and a natural number k_1 such that à_k+1 ≤ βà_k for all k ≥ k_1 Assuming à_k1 > 0 without loss of generality, for any k greater than k_1, we have kx_k − ¯x ≤ ¯à_k ≤ βà_k−1, leading to the conclusion that kx_k − ¯x is bounded above by β^(k-k1) à_k1.
It follows that kx k −xk ≤¯ à k 1 β k 1 β k for all k > k1 Therefore, limsup k→∞ kx k −xk¯ 1/k ≤ β limsup k→∞ àk 1 β k 1
To demonstrate sufficiency, assume the validity of (1.7), which implies the existence of a constant γ in the interval (0,1) and a natural number k2 such that for all k ≥ k2, the inequality kxk - xk¯ 1/k ≤ γ holds Consequently, we can express kxk - xk¯ as being less than or equal to γk for k ≥ k2 By defining ak = γk for k in N, we establish a sequence {ak} satisfying kxk - xk¯ ≤ ak for all k ≥ k2 Furthermore, the relationship ak+1 = γak for k ≥ k2, combined with the limit condition lim k→∞ ak = 0, confirms that the sequence {ak} converges Q-linearly to 0 Thus, it follows that the sequence {xk} converges R-linearly to ¯x.
Analysis of an Algorithm in Indefinite Quadratic
This chapter provides an overview of Difference-of-Convex Functions Algorithms (DCAs), originally developed by Pham Dinh Tao and Le Thi Hoai An Additionally, it will define two types of linear convergence rates for vector sequences.
It is well known that DCAs have a key role in nonconvex programming and many areas of applications [55] For more details, we refer to [77,79] and the references therein.
1.1 Basic Definitions and Some Properties
ByNwe denote the set of natural numbers, i.e., N= {0,1,2, } Consider the n-dimensional Euclidean vector spaceX = R n which is equipped with the canonical inner product hx, ui : n
In this article, we explore the relationship between vectors x = (x₁, , xₙ) and u = (u₁, , uₙ) in Rⁿ, where the dot product is expressed as X i=1 x i u i It is important to note that while vectors are presented as rows of real numbers in the text, they are treated as columns during matrix operations Additionally, the transpose of a matrix A ∈ R m×n is represented as A T, leading to the conclusion that the dot product can be succinctly written as hx, ui = x T u.
The norm in X is given by kxk = hx, xi 1/2 Then, the dual space Y of X can be identified with X.
A function θ : X → R, where R includes the generalized real numbers (+∞ and −∞), is considered proper if it never assumes the value −∞ and is not identically equal to +∞ This means that there exists at least one element x in the set X for which θ(x) is a real number.
The effective domain of θ is defined by domθ := {x ∈ X : θ(x) < +∞}.
Let Γ 0 (X) be the set of all lower semicontinuous, proper, convex functions on X The Fenchel conjugate function g ∗ of a function g ∈ Γ0(X) is defined by g ∗ (y) = sup{hx, yi −g(x) |x ∈ X} ∀y ∈ Y.
Note that g ∗ : Y →Ris also a lower semicontinuous, proper, convex function
[38, Propostion 3, p 174] From the definition it follows that g(x) +g ∗ (y) ≥ hx, yi (∀x ∈ X, ∀y ∈ Y).
Denote by g ∗∗ the conjugate function of g ∗ , i.e., g ∗∗ (x) = sup{hx, yi −g ∗ (y) | y ∈ Y}.
Since g ∈ Γ 0 (X), one has g ∗∗ (x) = g(x) for all x ∈ X by the Fenchel-Moreau theorem ( [38, Theorem 1, p 175]) This fact is the basis for various duality theorems in convex programming and DC programming.
Definition 1.1 The subdifferential of a convex function ϕ : R n →R∪ {+∞} at u ∈ domϕ is the set
If x /∈ domϕ then one puts ∂ϕ(x) =∅.
Clearly, the subdifferential ∂ϕ(u) in (1.1) is a closed, convex set The Fer- mat Rule for convex optimization problems asserts that ¯x ∈ R n is a solution of the minimization problem min{ϕ(x) | x ∈ R n } if and only if 0 ∈ ∂ϕ(¯x).
We now recall some useful properties of the Fenchel conjugate functions. The proofs of the next two propositions can be found in [77].
Proposition 1.1 The inclusion x ∈ ∂g ∗ (y) is equivalent to the equality g(x) +g ∗ (y) =hx, yi.
Proposition 1.2 The inclusions y ∈ ∂g(x) and x ∈ ∂g ∗ (y) are equivalent.
In the sequel, we use the convention (+∞)−(+∞)=+∞.
Definition 1.2 The optimization problem inf{f(x) := g(x)−h(x) : x∈ X}, (P) where g andh are functions belonging to Γ 0 (X), is called a DC program The functions g and h are called d.c components of f.
Definition 1.3 For any g, h ∈ Γ 0 (X), the DC program inf{h ∗ (y)−g ∗ (y) | y ∈ Y}, (D) is called the dual problem of (P).
Proposition 1.3 (Toland’s Duality Theorem; see [79]) The DC programs (P) and (D) have the same optimal value.
Definition 1.4 One says that ¯x ∈ R n is a local solution of (P) if the value f(¯x) = g(¯x) − h(¯x) is finite (i.e., ¯x ∈ domg ∩ domh) and there exists a neighborhood U of ¯x such that g(¯x)−h(¯x) ≤ g(x)−h(x) ∀x ∈ U.
If we can choose U = R n , then ¯x is called a (global) solution of (P).
The set of the solutions (resp., the local solutions) of (P) is denoted by sol(P) (resp., by loc(P)).
Proposition 1.4 (First-order optimality condition; see [77]) If x¯ is a local solution of (P), then ∂h(¯x) ⊂ ∂g(¯x).
Definition 1.5 A point ¯x ∈ R n satisfying ∂h(¯x) ⊂ ∂g(¯x) is called a station- ary point of (P).
The forthcoming example, which is similar to Example 1.1 in [93], shows that a stationary point needs not to be a local solution.
Example 1.1 Consider the DC program (P) with f(x) =g(x)−h(x), where g(x) = |x − 1| and h(x) = (x − 1) 2 for all x ∈ R For ¯x := 1
∂g(¯x) = ∂h(¯x) = {−1} Since ∂h(¯x) ⊂∂g(¯x), ¯x is a stationary point of (P). But ¯x is not a local solution of (P), because f(x) = x−x 2 for all x ≤ 1. Definition 1.6 A vector ¯x ∈ R n is said to be a critical point of (P) if
If ∂h(¯x) 6= ∅ and ¯x is a stationary point of (P), then ¯x is a critical point of (P) The reverse implication does not hold in general The following example is similar to Example 1.2 in [93].
Example 1.2 Consider the DC program (P) with f(x) = g(x)−h(x) with g(x) = (x − 1 2 ) 2 and h(x) = |x − 1| for all x ∈ R For ¯x := 1, we have
∂g(¯x) ={1} and ∂h(¯x) = [−1,1] Hence ∂g(¯x)∩∂h(¯x) 6= ∅ So ¯x is a critical point of (P) But, ¯x is not a stationary point of (P), because ∂h(¯x) is not a subset of ∂g(¯x).
In the context of problem (P), if the set ∂h(¯x) is a singleton, it indicates that the function h is Gˆateaux differentiable at the point ¯x, with ∂h(¯x) equating to {∇ G h(¯x)}, where ∇ G h(¯x) represents the Gˆateaux derivative of h at ¯x Conversely, if h is Gˆateaux differentiable at ¯x, then ∂h(¯x) will also be a singleton, confirming that ∂h(¯x) = {∇ G h(¯x)} Furthermore, the relationship between the sets ∂g(¯x) and ∂h(¯x) is such that the intersection ∂g(¯x)∩∂h(¯x) is non-empty if and only if ∂h(¯x) is included within ∂g(¯x).
So, if h is Gˆateaux differentiable at x, then¯ x¯ is a critical point if and only if it is a stationary point.
The theory of DCAs involves decomposing a complex DC program (P) into two sequences of convex programs, (P k ) and (D k ), where k ∈ N Each DCA scheme constructs two sequences, {x k } and {y k }, ensuring that for every k ∈ N, x k is a solution to the convex program (P k ) and y k is a solution to the convex program (D k ) This approach allows for the effective approximation of the original program (P) and its dual (D).
(i) The sequences {(g−h)(x k )} and {(h ∗ −g ∗ )(y k )} are decreasing;
(ii) Any cluster point ¯x (resp ¯y) of {x k } (resp., of {y k }) is a critical point of (P) (resp., of (D)).
Following Tuan [93], we can formulate and analyze the general DC algo- rithm of [77] as follows.
Step 3 k ← k+ 1 and return to Step 2.
For eachk ≥ 0, we have constructed a pair (x k , y k ) satisfying (1.2) and (1.3). Thanks to Proposition 1.2, we can transform the inclusion (1.2) equiva- lently as y k ∈ ∂h(x k )
Consequently, the condition (1.2) is equivalent to the requirement that y k is a solution of the problem min{h ∗ (y)−[g ∗ (y k−1 ) +hx k , y−y k−1 i] | y ∈ Y}, (D k ) where y k−1 ∈ domg ∗ is the vector defined at the previous step k −1.
The inclusion x k ∈ ∂g ∗ (y k−1 ) means that g ∗ (y)−g ∗ (y k−1 ) ≥ hx k , y −y k−1 i ∀y ∈ Y.
The affine function g ∗ (y k−1 ) + hx k , y−y k−1 i serves as a lower approximation of g ∗ (y) By substituting g ∗ (y) in the objective function of (D) with this lower approximation at step k, we derive the auxiliary problem (D k ).
Since (D k ) is a convex program, solving (D k ) is much easier than solving the DC program (D) Recall that y k is a solution of (D k ).
Similarly, at each stepk+1, the DC program (P) is replaced by the problem minng(x)−[h(x k ) +hx−x k , y k i] | x ∈ Xo, (P k ) where x k ∈ domh ∗ has been defined at step k.
Since (P k ) is a convex program, solving (P k ) is much easier than solving the original DC program (P) As x k+1 satisfies (1.3), it is a solution of (P k ).
The objective function of (Dk) serves as a convex upper approximation of the objective function of (D), with both functions having identical values at y k−1 By removing certain real constants from the expression of the objective function of (Dk), we can reformulate the problem into an equivalent form: min{h ∗ (y)− hx k , yi | y ∈ Y}.
The objective function of problem (P k) serves as a convex upper approximation of the objective function for problem (P), with both functions yielding identical values at point x k By removing certain real constants from the expression of the objective function of (P k), we can reformulate the problem into an equivalent form: minimize {g(x) - ⟨hx, y k⟩ | x ∈ X}.
If x k is a critical point of (P), i.e., ∂g(x k ) ∩ ∂h(x k ) 6= ∅, then DCA may produce a sequence {(x ` , y ` )} with
Indeed, since there exists a point ¯x ∈ ∂g(x k ) ∩ ∂h(x k ), to satisfy (1.2) we can choose y k = ¯x Next, by Proposition 1.2, the inclusion (1.3) is equivalent to y k ∈ ∂g(x k+1 ) So, if we choose x k+1 = x k then (1.3) is fulfilled, because y k = ¯x ∈ ∂g(x k ).
DCA identifies critical points but lacks tools for avoiding them When encountering a critical point that is not a local minimizer, advanced techniques from variational analysis are necessary to determine a descent direction.
The following observations can be found in Tuan [93]:
• The DCA is a decomposition procedure which decomposes the solution of the pair of optimization problems (P) and (D) into the parallel solution of the sequence of convex minimization problems (P k ) and (D k ), k ∈ N;
• The DCA does not include any specific technique for solving the convex problems (P k ) and (D k ) Such techniques should be imported from convex programming;
• The performance of DCA depends greatly on a concrete decomposition of the objective function into DC components;
The Deterministic Coordinate Ascent (DCA) method, while categorized as a deterministic optimization technique, can produce diverse sequences of solutions {x k } and {y k } due to the heuristic nature of selecting y k from the solution set of D k and x k from the solution set of P k at each iteration k, especially when multiple solutions exist for (D k ) or (P k ).
The above analysis allows us to formulate a simplified version of DCA, which includes a termination procedure, as follows.
Output: Finite or infinite sequences {x k } and {y k }.
Step 1 Choose x 0 ∈ domg Take ε > 0 Put k = 0.
Calculate y k by solving the convex program (1.4).
Calculate x k+1 by solving the convex program (1.5).
Step 3 If ||x k+1 −x k || ≤ε then stop, else go to Step 4.
Step 4 k := k+ 1 and return to Step 2.
To understand the performance of the above DCA schemes, let us consider the following example.
Example 1.3 Consider the functionf(x) = g(x)−h(x) with g(x) = (x−1) 2 and h(x) =|x−1| for all x ∈ R Here Y = X = R and we have g ∗ (y) = sup{xy −g(x) | x ∈ R}= sup{xy −(x−1) 2 | x ∈ R} = 1
The derivative of the convex function g is given by ∂g ∗ (y) = {1/2 y + 1} for every y in the set Y The derivative of h is defined as ∂h(x) = {−1} for x < 1, ∂h(x) = {1} for x > 1, and ∂h(x) = [−1,1] at x = 1 Utilizing the DCA Scheme 1.1, we construct two DCA sequences, {x k } and {y k }, where y k belongs to ∂h(x k ) and x k+1 is in ∂g ∗ (y k ) for k in the natural numbers Starting with any x 0 greater than 1, we find that y 0 equals 1, leading to x 1 being 3/2 This results in y 1 also being 1, and it can be shown that x k = 3/2 and y k = 1 for all k ≥ 2, converging to ¯x = 3/2 and ¯y = 1 Conversely, beginning with any x 0 less than 1 yields sequences where x k = 1/2 and y k = −1 for all k ≥ 1, converging to ¯x = 1/2 and ¯y = −1.
x 2 −x for x ≤1 x 2 −3x+ 2 for x ≥1, one finds that ¯x = 3 2 and ˆx = 1 2 are global minimizers of (P), and xe := 1 is the unique critical point of the problem.
Starting with the initial point \( x_0 = x_e = 1 \) and selecting \( y_0 = 0 \) from the range \( \partial h(x_0) = [-1, 1] \), we find \( x_1 \in \partial g^*(y_0) = \partial g^*(0) = \{1\} \), leading to \( x_1 = 1 \) Choosing \( y_1 = 0 \) from \( \partial h(x_1) = [-1, 1] \), we continue this process to generate DCA sequences \( \{x_k\} \) and \( \{y_k\} \), which converge to \( x_e = 1 \) and \( \bar{y} = 0 \) respectively It is important to note that the limit point \( x_e \) is the unique critical point of problem (P), which does not serve as a local minimizer or a stationary point.
To ease the presentation of some related programs, we consider the follow- ing scheme.
Output: Finite or infinite sequences {x k } and {y k }.
Step 1 Choose x 0 ∈ domg Take ε > 0 Put k = 0.
Step 2 Calculate y k by using (1.2) and find x k+1 ∈ argmin{g(x)− hx, y k i | x ∈ X} (1.6)
Step 3 If ||x k+1 −x k || ≤ε then stop, else go to Step 4.
Step 4 k := k+ 1 and return to Step 2.
We will recall the fundamental theorem on DCAs of Pham Dinh Tao and
Le Thi Hoai An [77, Theorem 3] provides a solid theoretical foundation for the practical application of these algorithms To fully understand this, it is essential to revisit the definitions of ρ-convex functions, the modulus of convexity for convex functions, and the characteristics of strongly convex functions.
Definition 1.7 Let ρ ≥0 and C be a convex set in the space X A function θ :C → R∪ {+∞} is called ρ-convex if θ λx+ (1−λ)x 0 ≤λθ(x) + (1−λ)θ(x 0 )− λ(1−λ)
2 ρ||x−x 0 || 2 for all numbers λ ∈ (0,1) and vectors x, x 0 ∈ C This amounts to saying that the function θ(ã)−(ρ/2)|| ã || 2 is convex on C.
Definition 1.8 The modulus of convexity of θ on C is given by ρ(θ, C) = supnρ≥ 0 | θ−(ρ/2)|| ã || 2 is convex on Co.
If C = X then we write ρ(θ) instead of ρ(θ, C) Function θ is called strongly convex on C if ρ(θ, C) > 0.
Consider the problem (P) If ρ(g) > 0 (resp., ρ(g ∗ ) > 0), let ρ 1 (resp., ρ ∗ 1 ) be a real number such that 0 ≤ ρ 1 < ρ(g) (resp., 0 ≤ ρ ∗ 1 < ρ(g ∗ )) If ρ(g) = 0 (resp., ρ(g ∗ ) = 0), let ρ 1 = 0 (resp., ρ ∗ 1 = 0) If ρ(h) > 0 (resp., ρ(h ∗ ) > 0), let ρ2 (resp., ρ ∗ 2 ) be a real number such that 0 ≤ ρ2 < ρ(h) (resp.,
The convenient abbreviations dx k := x k+1 −x k and dy k := y k+1 −y k were adopted in [77].
Theorem 1.1 ( [77, Theorem 3]) Let α := inf{f(x) = g(x)−h(x) | x ∈ R n }. Assume that the iteration sequences {x k } and {y k } are generated by DCA Scheme 1 Then, the following properties are valid:
≤ (g −h)(x k )−maxn ρ 1 +ρ 2 2 ||dx k || 2 , ρ 2 ∗ 1 ||dy k−1 || 2 + ρ 2 2 ||dx k || 2 , ρ 2 ∗ 1 ||dy k−1 || 2 + ρ 2 ∗ 2 ||dy k || 2 o hold for every k;
||dx k+1 || 2 + ρ 2 2 ||dx k || 2 , ρ 2 ∗ 1 ||dy k || 2 + ρ 2 2 ||dx k || 2 o hold for every k;
(iii) If α is finite, then {(g −h)(x k )} and {(h ∗ −g ∗ )(y k )} are decreasing sequences that converge to the same limit β ≥ α Furthermore,
(a) If ρ(g) +ρ(h) > 0 (resp., ρ(g ∗ ) +ρ(h ∗ ) > 0), then k→∞lim(x k+1 −x k ) = 0 (resp., lim k→∞(y k+1 −y k ) = 0);
(iv) If α is finite, and {x k } and {y k } are bounded, then for every cluster point x¯ of {x k } (resp., y¯ of {y k }), there is a cluster point y¯ of {y k } (resp., x¯ of {x k }) such that:
The estimates in the assertions (i) and (ii) of the above theorem can be slightly improved as shown in the next remark.
If ρ(h) is greater than 0, then ρ² is a real number that falls within the range of [0, ρ(h)) The sequences {xₖ} and {yₖ} are constructed independently of the constants ρ₁, ρ₁*, ρ², and ρ₂* According to assertion (i) of Theorem 1.1, this leads to the conclusion that for every k in the natural numbers, the inequality holds.
2||dy k || 2 o. Passing the last inequality to the limit as ρ 2 →ρ(h), we get
By applying a simultaneous trick to the constants associated with strongly convex functions within the family {g, h, g ∗ , h ∗ }, we can demonstrate that enhanced versions of the estimates in assertions (i) and (ii) of Theorem 1.1 are indeed valid.
≤ (g−h)(x k )−maxn ρ(g)+ρ(h) 2 ||dx k || 2 , ρ(g 2 ∗ ) ||dy k−1 || 2 + ρ(h) 2 ||dx k || 2 , ρ(g 2 ∗ ) ||dy k−1 || 2 + ρ(h 2 ∗ ) ||dy k || 2 o,
||dx k+1 || 2 + ρ(h) 2 ||dx k || 2 , ρ(g 2 ∗ ) ||dy k || 2 + ρ(h) 2 ||dx k || 2 o. The forthcoming example is designed as an illustration for Theorem 1.1.
Example 1.4 Consider the function f(x) = g(x) − h(x) in Example 1.1, where g(x) = |x− 1| and h(x) = (x−1) 2 for all x ∈ R Here Y = X = R and we have h ∗ (y) = sup{xy −h(x) | x ∈ R} = sup{xy −(x−1) 2 | x ∈ R} = 1
4y 2 +y. Using DCA Scheme 1.2, we calculate DCA sequences {x k } and {y k } by solv- ing, respectively, the convex programs (1.4) and (1.5) for k ∈ N Choose ε = 0 First, select x 0 = 2 3 Since y 0 is a solution of (1.4) for k = 0, we get y 0 = − 2
Indefinite Quadratic Programs and DCAs
Indefinite quadratic programming (IQP) plays a crucial role in optimization theory and has diverse applications across various fields Sequential quadratic programming methods, such as Wilson’s and Pang’s methods, transform nonlinear programming problems into a series of IQPs Applications of IQP span finance, agriculture, economics, production, marketing, and public policy, as reviewed by Gupta Notably, quadratic programming models in finance are explored in detail by Cornu´ejols et al Additionally, the image enhancement problem has been effectively framed as a quadratic programming issue, as demonstrated by Jen and Wang The methodological and functional aspects of quadratic programming are comprehensively reviewed by McCarl et al Akoa has examined IQP in the training of support vector machines with nonpositive-semidefinite kernels using Difference-of-Convex algorithms, while recent studies by Xu et al and Xue et al have addressed similar challenges in machine learning Liu et al have focused on IQP related to support vector machines with indefinite kernels, a growing area of interest in the field For further insights on quadratic programming under quadratic constraints, Wiebking's work is recommended.
Numerous research studies have explored numerical methods for solving Integer Quadratic Programming (IQP) problems, as referenced in sources [17,19,78–80,82,101–103] It is important to note that most existing algorithms primarily identify stationary points, specifically the Karush-Kuhn-Tucker (KKT) points, or local minimizers, indicating that these methods are predominantly local solution techniques Given that IQP is classified as NP-hard, as discussed in references [72] and [17], the quest for global solutions continues to pose significant challenges.
We are interested in studying and implementing two methods to solve the
IQP utilizes a general framework for addressing Difference-of-Convex (DC) programs, as developed by Pham Dinh and Le Thi Recent advancements include the integration of DC Algorithms (DCA) with interior point methods to tackle large-scale nonconvex quadratic programming challenges The study presents two DC decompositions that enhance the efficiency of these solutions.
DC decomposition and proximal DC decomposition give rise to two algorithms for addressing the Integer Quadratic Programming (IQP) problem: the Projection DC decomposition algorithm (Algorithm A) and the Proximal DC decomposition algorithm (Algorithm B) Key features of these algorithms are highlighted in the detailed descriptions provided in references [57,82].
- The algorithm descriptions are simple;
- No line searches are required.
According to the DCA theory, any cluster point of a DCA sequence generated by specific algorithms is a KKT point of the IQP To confirm the existence of such cluster points, it is essential to establish the boundedness of the DCA sequence While DCA sequences are not always bounded, a conjecture suggests that if the IQP has global solutions, then DCA sequences generated by algorithms A and B must be bounded Recent work by Tuan has affirmed this conjecture for two-dimensional IQP cases To extend this result to the general case, Tuan utilized a local error bound for affine variational inequalities and specific properties of the KKT point set derived by Luo and Tseng The main finding indicates that if the IQP has a nonempty solution set, every DCA sequence produced by Algorithm A converges R-linearly to a KKT point.
Numerous numerical tests, which will be reported in Section 2.4, lead us to the following observations:
For both algorithms A and B, a positive decomposition parameter that is closer to the lower bound of the admissible parameter interval significantly enhances the convergence rate of the DCA sequences.
- Applied to the same problem with the same initial point, Algorithm B is more efficient than Algorithm A in terms of the number of computation steps and the execution time.
Our findings align with the recent work of Le Thi, Huynh, and Pham Dinh, who provided original proofs for significant convergence theorems related to DCA algorithms addressing optimization problems with subanalytic data Specifically, their Theorems 3.4, 3.5, and 4.2 demonstrate that any DCA sequence produced by Algorithm B converges R-linearly to a KKT point, provided the sequence is bounded However, the boundedness of DCA sequences cannot be established through the Lojasiewicz inequality, as indicated in their Theorem 2.1, along with related results on Kurdyka-Lojasiewicz properties Therefore, Theorem 2.2 and its proof represent new contributions to the analysis of existing solution algorithms in the realm of indefinite quadratic programming.
Consider the indefinite quadratic programming problem under linear con- straints (called the IQP for brevity): minnf(x) := 1
2x T Qx+q T x | Ax≥ bo, (2.1) where Q ∈ R n×n and A ∈ R m×n are given matrices, Q is symmetric, q ∈ R n and b ∈ R m are arbitrarily given vectors The constraint set of the problem is C := x ∈ R n | Ax ≥b
Since x T Qx is an indefinite quadratic form, the objective function f(x) may be nonconvex; hence (2.1) is a nonconvex optimization problem.
Now we describe some standard notations that will be used later on The unit matrix in R n×n is denoted by E The eigenvalues of a symmetric matrix
In the context of matrices M ∈ R n×n, the eigenvalues are ordered as λ 1 (M) ≤ ≤ λ n (M), including their multiplicities For an index set α ⊂ {1, , m}, the matrix A α consists of the rows A i for i in α, while the vector b α is formed from the components b i for i in α The pseudo-face of the convex set C associated with α is defined as the set of points x in R n satisfying A α x = b α and A α ¯ x > b α ¯, where ¯α represents the complement of α The notation B(x, ε) refers to the open ball centered at x with radius ε > 0, and ¯B(x, ε) denotes the corresponding closed ball Additionally, for a collection of vectors v 1, , v s in R n, the closed convex cone generated by these vectors is expressed as pos{v 1, , v s} = nv s.
The metric projection of u ∈ R n onto C is denoted by PC(u), that is PC(u) belongs to C and u−P C (u) = min x∈C ku−xk.
The tangent cone to C at x ∈ C is denoted by T C (x), i.e.,
T C (x) ={t(y −x) |t ≥ 0, y ∈ C}= {v ∈ R n | A α v ≥ 0}, where α = {i | A i x = b i } The normal cone to C at x ∈ C is denoted by
To address the IQP, one can decompose the function f(x) into the difference of two convex linear-quadratic functions, expressed as f(x) = ϕ(x) - ψ(x), where ϕ(x) is defined as 1/2 x^T Q1 x + q^T x and ψ(x) is 1/2 x^T Q2 x Here, Q1 is a symmetric positive definite matrix and Q2 is a symmetric positive semidefinite matrix, leading to Q = Q1 - Q2 This formulation transforms the original problem into a DC program that minimizes g(x) - h(x) subject to x in R^n, with g(x) incorporating an indicator function for the feasible set C Starting from an initial point x0, the iterative process involves computing the gradient yk = ∇h(xk) = Q2 xk at each step k ≥ 0, and then solving the convex minimization problem to find the unique solution xk+1, which is equivalent to a strongly convex quadratic program.
2x T Q1x+ q T x−x T Q2x k | x ∈ Co (2.3)The obtained sequence {x k } is called the DCA sequence generated by the DC algorithm and the initial point x 0
Definition 2.1 For x ∈ R n , if there exists a multiplier λ ∈ R m such that
Ax ≥ b, λ≥ 0, λ T (Ax−b) = 0, (2.4) then x is said to be a Karush-Kuhn-Tucker point (a KKT point) of the IQP.
A point x ∈ C is classified as a KKT point of the IQP defined in (2.1) if it satisfies the condition h∇f(x), vi = (Qx+q) T v ≥ 0 for all v in the tangent space T C (x) This condition is equivalent to h∇f(x), y−xi ≥ 0 for every y in C Thus, x is a KKT point if and only if it serves as a solution to the affine variational inequality expressed as x ∈ C, hQx+q, u−xi ≥ 0 for all u in C.
Denote the KKT point set (resp., the global solution set) of IQP by C ∗ (resp.,
S) It is well known (see, e.g., [50]) that S ⊂ C ∗
This article revisits essential properties of DCA sequences derived from the fundamental theorem on DCAs, as outlined in Theorem 1.1 and Remark 1.1, in relation to the IQP presented in equation (2.1) It highlights that the modulus of convexity for the function g(x) = ϕ(x) + δ C(x), where ϕ(x) = 1/2 x^T Q_1 x + q^T x, is at least λ_1(Q_1) In parallel, it notes that the modulus of convexity for the function h(x) = ψ(x), with ψ(x) = 1/2 x^T Q_2 x, equals λ_1(Q_2).
Theorem 2.1 (See [81, Theorem 3] and [82, Theorem 2.1]) Every DCA se- quence {x k } generated by the above DC algorithm and an initial point x 0 ∈ R n has the following properties:
(ii) {f(x k )} converges to an upper bound f ∗ for the optimal value of (2.1); (iii) Every cluster point x¯ of {x k } is a KKT point of (2.1);
(iv) If inf x∈Cf(x) > −∞, then lim k→∞kx k+1 −x k k = 0.
Remark 2.1 By [81, Theorem 3], if x 0 ∈ C then we have the inequality in (i) for every k ≥ 0 To see this, it suffices to note that x 0 ∈ C = domg, where g = ϕ+δ C and dom := {x | g(x) < +∞}.
The smallest eigenvalue λ1(Q) and the largest eigenvalue λn(Q) of the matrix Q = Q1 − Q2 can be efficiently computed using algorithms such as the Newton-Raphson method or specialized software.
DC decomposition (2.2) can be done:
(a) Q 1 := ρE, Q 2 := ρE−Q, where ρ is a positive real value satisfying the condition ρ ≥λ n (Q);
(b) Q1 := Q+ρE, Q2 := ρE, where ρ is a positive real value satisfying the condition ρ > −λ 1 (Q).
The number ρ is called the decomposition parameter The following algo- rithms appear on the basis of (a) and (b), respectively.
The Projection DC decomposition algorithm, referred to as Algorithm A, begins with selecting a positive number ρ that is greater than or equal to λ n (Q) and an initial point x 0 in R n For each iteration k ≥ 0, the next point x k+1 is computed using the formula x k+1 := P C x k - 1 ρ(Qx k + q), which represents the unique solution to the equation (2.3) In this context, Q 1 is defined as ρE, while Q 2 can be expressed as ρE - Q, allowing it to be reformulated as the minimization problem min x - 1 ρ(y k - q).
The scheme of the algorithm with a stopping criterion is as follows (To have an infinite DCA sequence, one has to choose ε = 0.)
Step 3 Calculate x k+1 by solving the convex program (2.7).
Step 4 If kx k+1 −x k k ≤ ε then stop, else go to Step 5.
Step 5 Set k := k+ 1 and go to Step 2.
The Proximal DC Decomposition Algorithm, referred to as Algorithm B, begins by selecting a positive number ρ greater than -λ1(Q) and an initial point x0 in Rn For each iteration k ≥ 0, the algorithm computes the unique solution, labeled as xk+1, for the strongly convex quadratic minimization problem defined as minnψ(x) := 1.
The objective function can be expressed as \( \frac{1}{2} x^T Q_1 x + q^T x - x^T Q_2 x k \), where \( Q_1 = Q + \rho E \) and \( Q_2 = \rho E \) The algorithm's scheme includes a stopping criterion, and to ensure an infinite DCA sequence, it is necessary to set \( \epsilon = 0 \).
Step 2 Calculate x k+1 by solving the convex program (2.9).
Step 3 If kx k+1 −x k k ≤ ε then stop, else go to Step 4.
Step 4 Set k := k+ 1 and go to Step 2.
Convergence and Convergence Rate of the Algorithm
The KKT point set \( C^* \) serves as the solution set for the affine variational inequality, indicating that it consists of a union of finitely many polyhedral convex sets Notably, \( C^* \) contains a finite number of connected components Since the solution set \( S \) is a subset of \( C^* \), the nonemptiness of \( S \) implies that \( C^* \) is also nonempty Additionally, for any subset \( M \) in \( \mathbb{R}^n \), the distance from a point \( x \) to \( M \) is defined as \( d(x, M) := \inf\{ \| x - y \| \mid y \in M \} \).
We will need two auxiliary results The next lemma gives a local error bound for the distance d(x, C ∗ ) form a feasible point x ∈ C to C ∗
Lemma 2.1 ( [92, Lemma 2.1]; cf [65, Lemma 3.1]) For any ρ > 0, if
C ∗ 6= ∅, then there exist scalars ε > 0 and ` > 0 such that d(x, C ∗ ) ≤ ` x−P C x− 1 ρ(Qx+q) (2.10) for all x ∈ C with x−P C x− 1 ρ(Qx+q) ≤ ε (2.11)
Lemma 2.2 ( [65, Lemma 3.1]; see also [92, Lemma 2.2]) Let C1, C2,ã ã ã , Cr denote the connected components of C ∗ Then we have
C i , and the following properties are valid:
(a) each C i is the union of finitely many polyhedral convex sets;
(b) the sets C i , i = 1, r, are properly separated each from others, that is, there exists δ > 0 such that if i 6= j then d(x, C j ) ≥ δ ∀x ∈ C i ;
(c) f is constant on each Ci.
For \( x^{k+1} \) to be the unique solution of the given problem, it is essential that the condition \( h\nabla\psi(x^{k+1}), x - x^{k+1} i \geq 0 \) holds for all \( x \) in the set \( C \) Here, \( \nabla\psi(x^{k+1}) \) is defined as \( Qx^{k+1} + q + \rho x^{k+1} - \rho x^k \) This condition implies that \( x^{k+1} \) serves as the unique solution to a strongly monotone affine variational inequality characterized by the affine operator \( x \mapsto (Q + \rho E)x + q - \rho x^k \) within a polyhedral convex set.
C Therefore, applying Theorem 2.3 from [45, p 9] we see that x k+1 is the unique fixed point of the map G k (x) := P C (x −à(M x + q k )), where à > 0 is arbitrarily chosen, M := Q+ ρE, and q k := q −ρx k In what follows, we choose à = ρ −1 Then x k+1 = P C x k+1 − 1 ρ(M x k+1 + q k ) (2.12)
The convergence and the rate of convergence of Algorithm B, the Proximal
DC decomposition algorithm, can be formulated as follows.
Theorem 2.2 If (2.1) has a solution, then for each x 0 ∈ R n , the DCA se- quence {x k } constructed by Algorithm B converges R-linearly to a KKT point of (2.1), that is, there exists x¯ ∈ C ∗ such that limsup k→∞ kx k −xk¯ 1/k < 1.
Proof Since (2.1) has a solution, C ∗ 6= ∅ Hence, by Lemma 2.1 there exist
` > 0 and ε > 0 such that (2.10) is fulfilled for any x satisfying (2.11) As x∈Cinf f(x) > −∞, assertion (iv) of Theorem 2.1 gives k→∞lim kx k+1 −x k k = 0 (2.13) Choose k 0 ∈ N as large as kx k+1 −x k k< ε for all k ≥k 0
If it holds that kx k+1 −P C (x k+1 − 1 ρ(Qx k+1 +q))k ≤ ε ∀k ≥ k 0 , (2.14) then by (2.10) one has d(x k+1 , C ∗ ) ≤`kx k+1 −P C x k+1 − 1 ρ(Qx k+1 +q) k ∀k ≥ k 0 (2.15)
To obtain (2.14), for any k ≥ k 0 , we recall that x k+1 = G k (x k+1 ) = P C x k+1 − 1 ρ(M x k+1 +q k ) , (2.16)
Combining this with the nonexpansiveness of PC(.) [45, Corollary 2.4, p 10] yields kx k+1 −P C (x k+1 − 1 ρ (Qx k+1 +q))k
Hence (2.14) is valid and, in addition, we have kx k+1 −P C x k+1 − 1 ρ(Qx k+1 +q) k ≤ kx k+1 −x k k.
From this and (2.15) it follows that d(x k+1 , C ∗ ) ≤ `kx k+1 −x k k ∀k ≥ k0 (2.17)
Since C ∗ is closed and nonempty, for each k ∈ {0,1,2, } we can find y k ∈ C ∗ such that d(x k , C ∗ ) =kx k −y k k Then (2.17) implies that kx k+1 −y k+1 k ≤ `kx k+1 −x k k ∀k ≥ k0 (2.18)
So, as consequence of (2.13), k→∞lim ky k+1 −x k+1 k= 0 (2.19)
As demonstrated by the inequality \( k \to \infty \lim \|y_{k+1} - y_k\| = 0 \), we can conclude that for the connected components \( C_1, C_2, \ldots, C_r \) of \( C^* \), there exists an index \( i_0 \in \{1, \ldots, r\} \) and a threshold \( k_1 \geq k_0 \) such that \( y_k \) belongs to \( C_{i_0} \) for all \( k \geq k_1 \) Consequently, based on the third assertion of Lemma 2.2, it follows that \( f(y_k) = c \) for all \( k \geq k_1 \), where \( c \) is a constant in \( \mathbb{R} \).
Since (2.1) has a solution, by Theorem 2.1 we can find a real value f ∗ such that lim k→∞f(x k ) =f ∗
By the classical Mean Value Theorem and by the formula ∇f(x) = Qx+q, for every k there is z k ∈ (x k , y k ) := {(1−t)x k +ty k | 0< t < 1} such that f(y k )−f(x k ) = hQz k + q, y k −x k i.
Since y k is a KKT point, it holds that 0 ≤ hQy k + q, x k −y k i Adding this inequality and the preceding equality, we get f(y k )−f(x k ) ≤ hQ(z k −y k ), y k −x k i
On one hand, from (2.21) and (2.22) it follows that c = f(y k ) ≤f(x k ) +kQk ky k −x k k 2
As lim k→∞ f(x k ) +kQk ky k −x k k 2 = f ∗ due to (2.19), this forces c ≤ f ∗ (2.23)
On the other hand, since x k+1 = P C x k+1 − 1 ρ(M x k+1 +q k ) by (2.16), the characterization of the metric projection on a closed convex set [45, Theo- rem 2.3, p 9] gives us
M x k+1 + q k , y k+1 −x k+1 ≥ 0 ∀k ∈ N. From this and (2.18) we get hM y k+1 +q k , x k+1 −y k+1 i
≤ ` 2 kMkkx k+1 −x k k 2 for all k ≥ k 0 So, setting α = ` 2 kMk, we have hM y k+1 + q k , x k+1 −y k+1 i ≤ αkx k+1 −x k k 2 (2.24)For each k ≥ k 1 , since M = Q+ ρE and q k = q −ρx k , invoking (2.24) and using (2.18) once more, we have f(x k+1 )−c = f(x k+1 )−f(y k+1 )
≤αkx k+1 −x k k 2 + 1 2 kQkkx k+1 −y k+1 k 2 +ρkx k+1 −x k kkx k+1 −y k+1 k+ρkx k+1 −y k+1 k 2
≤ α+ 1 2 kQk` 2 + ρ`(1 +`) kx k+1 −x k k 2 Therefore, with β := α+ 1 2 kQk` 2 +ρ`(1 +`), we get f(x k+1 ) ≤c+βkx k+1 −x k k 2 (2.25) Letting k → ∞, from (2.25) we can deduce that f ∗ = lim k→∞f(x k+1 ) ≤c.
Combining the last expression with (2.23) yieldsf ∗ = c Therefore, by (2.25) and the first assertion of Theorem 2.1 we obtain f(x k+1 )−f ∗ ≤ βkx k+1 −x k k 2 ≤ 2β λ 1 (Q 1 ) + λ 1 (Q 2 )(f(x k )−f(x k+1 )), where Q 1 = Q+ ρE and Q 2 = ρE Putting γ = λ 1 (Q 1 ) +λ 1 (Q 2 ), from the condition ρ > −λ 1 (Q) we get γ = (λ 1 (Q) +ρ) +ρ > 0 Therefore, f(x k+1 )−f ∗ ≤ 2β γ f(x k )−f ∗ − f(x k+1 )−f ∗
≤r 0 à 2k+2 +r 0 à 2k = r 1 à 2k ∀k > k 1 , where r 1 := r 0 (à 2 + 1) Consequently, using the first assertion of Theorem2.1 once more, we see that kx k+1 −x k k 2 ≤ 2 γ(f(x k )−f(x k+1 )) ≤ 2r 1 γ à 2k ∀k > k 1
2 and à ∈ (0,1) Let ε > 0 be given arbitrarily For each positive integer p, we have kx k+p −x k k ≤ kx k+p −x k+p−1 k+ã ã ã+kx k+1 −x k k
For sufficiently large values of k, the sequence {x k} satisfies the condition 1−àà k < ε, indicating that it is a Cauchy sequence Consequently, we can assume that it converges to a point ¯x within the set C According to the third assertion of Theorem 2.1, this point ¯x is also an element of C∗ Additionally, we can derive the inequality kx k+p −x k k ≤ r.
1−à à k to the limit as p→ ∞, we get kx k −xk ≤¯ r
1−àà k for all k large enough So, kx k −xk¯ 1/k ≤ r
1/k à for all k large enough Therefore, limsup k→∞ kx k −xk¯ 1/k ≤ à < 1.
This proves that {x k } converges R-linearly to a KKT point of (2.1) 2
Remark 2.2 According to Theorem 2.2, one can find a constant C such that limsup k→∞ kx k − xk¯ 1/k < C < 1 So, if the computation is terminated at step k, provided that k is sufficiently large, then one has kx k −xk¯ 1/k < C That is, kx k −xk¯ < C k Therefore, the computation error between the obtained approximate solution x k and the exact limit point ¯x of the sequence {x k } is smaller than the number C k Since C ∈ (0,1), one sees that the computation error bound C k tends to 0 as k → ∞.
Asymptotical Stability of the Algorithm
We will prove that DCA sequences generated by Algorithm B converge to a locally unique solution of (2.1) if the initial points are taken from a suitably-chosen neighborhood of it.
To understand the concept of asymptotic stability in discrete dynamical systems, we consider an iteration algorithm that produces a unique point \( x_{k+1} \) based on the previously defined point \( x_k \), where \( k \) belongs to the set of non-negative integers According to Leong and Goh [59, Definition 2], this framework allows us to define the stability of a KKT point effectively.
Definition 2.2 The KKT point ¯x of (2.1) is:
The iteration algorithm is considered stable if, for any ε > 0, there exists a δ > 0 such that when the initial point x₀ is within the neighborhood B(¯x, δ), the DCA sequence generated will remain within the neighborhood B(¯x, ε) for all iterations k ≥ 0.
(ii) attractive if there exists δ > 0 such that whenever x 0 ∈ B(¯x, δ), the DCA sequence generated by the iteration algorithm and the initial point x 0 has the property lim k→∞x k = ¯x;
(iii) asymptotically stable w.r.t the iteration algorithm if it is stable and attractive w.r.t to that algorithm.
In an optimization problem defined as min{g(x) | x ∈ Ω}, where g: R n → R is a real-valued function and Ω is a subset of R n, a point ¯x ∈ Ω is considered a locally unique solution if there exists an ε > 0 such that g(x) is greater than g(¯x) for all x within the intersection of Ω and the neighborhood B(¯x, ε), excluding the point ¯x itself The following two lemmas will present some established facts related to this concept.
Lemma 2.3 (See, e.g., [50, Theorem 3.8]) If x¯ ∈ C is a locally unique solu- tion of (2.1), then there exist à > 0 and η > 0 such that f(x)−f(¯x) ≥ηkx−xk¯ 2 for every x ∈ C ∩B(¯x, à) (2.26)
Lemma 2.4 (See, e.g., [17, Proof of Lemma 4] and [57, Lemma 1]) If the KKT point set C ∗ contains a segment [u, x], then the restriction of f on that segment is a constant function.
The main result of this section can be formulated as follows.
Theorem 2.3 Consider Algorithm B and require additionally that ρ > kQk. Suppose x¯ is a locally unique solution of problem (2.1) In that case, for any γ > 0 there exists δ > 0 such that if x 0 ∈ C ∩B(¯x, δ) and if {x k } is the DCA sequence generated by Algorithm B and the initial point x 0 , then
In other words, x¯ is asymptotically stable w.r.t Algorithm B.
In this proof, we assume that ρ is greater than kQk, and that ¯x is a locally unique solution of equation (2.1) By applying Lemma 2.3, we can choose constants à and η that satisfy condition (2.26) For any γ greater than 0, we can select a smaller γ within the interval (0, à) such that it also meets the condition γ < à(1−ρ −1 kQk) Given that f(x)−f(¯x) is greater than 0 for all x in the intersection of C and B(¯x, γ) excluding ¯x, as stated in (2.26), the continuity of the function f guarantees the existence of a δ within the interval (0, à).
The stability of DCA sequences generated by Algorithm B is confirmed for the selected number δ > 0 By fixing any point x₀ within the intersection of set C and the ball B(¯x, δ), and noting that δ is less than γ, we establish that for k = 0, xₖ remains within the intersection of C and B(¯x, γ) To validate this through induction, we assume the inclusion holds for some k ≥ 0 Given that ¯x is a locally unique solution to the problem defined in (2.1), it qualifies as a KKT point for the corresponding optimization problem.
Indeed, by the characterization of the metric projection [45, Theorem 2.3, p 9], (2.29) is valid if and only if ¯ x− 1 ρ(Q¯x+q) −x¯
The latter is equivalent to (2.28) Using (2.12), (2.29), and the nonexpan- siveness of the metric projection [45, Corollary 2.4, p 10], we have kx k+1 −xk¯ = P C x k+1 − 1 ρ(M x k+1 +q k ) −P C x¯− 1 ρ(Q¯x+q)
We derive the inequality \( kx_{k+1} - x_k \leq \bar{(1 - \frac{1}{\rho_k Q_k})}^{-1} kx_k - x_k \leq \bar{(1 - \frac{1}{\rho_k Q_k})}^{-1} \gamma < \alpha \), with the strict inequality resulting from the property \( \gamma < \alpha(1 - \rho_k^{-1} Q_k) \) Consequently, we find that \( x_{k+1} \) lies within the intersection of set \( C \) and the ball \( B(\bar{x}, \alpha) \) By applying the established inequality \( f(x_k) \geq f(x_{k+1}) \) for all \( k \geq 0 \), we conclude that \( kx_{k+1} - x_k \bar{2} \leq \frac{1}{\eta} (f(x_{k+1}) - f(\bar{x})) \).
≤ 1 η f(x 0 )−f(¯x) Hence, kx k+1 −xk ≤¯ 1 η 1/2 f(x 0 )−f(¯x) 1/2 Since x 0 ∈ C ∩B(¯x, δ), combining this with (2.27) we obtain kx k+1 −xk¯ < γ which means that x k+1 ∈ C ∩B(¯x, γ) Thus, we have proved that x k ∈ C ∩B(¯x, γ) for every k ≥0.
To establish the attractiveness of DCA sequences produced by Algorithm B, we note that for any γ > 0, there exists a corresponding δ = δ(γ) > 0 such that if the initial point x 0 is within the set C and the ball B(¯x, δ), then the property in (a) holds for the generated DCA sequence {x k } Assuming γ is in the range (0, à) and δ is in (0, γ), we can refine γ to a smaller positive value while ensuring the validity of property (a) Consequently, if the initial point x 0 is within C and B(¯x, δ), property (b) is also satisfied If this were not the case, we could find sequences γ j and δ j converging to 0, maintaining the stability assertion for each pair (δ j, γ j) Furthermore, for each j, there exists an initial point x 0,j in C ∩ B(¯x, δ j) such that the DCA sequence {x k,j} does not converge to ¯x Selecting a subsequence of {x k,j} leads to a limit point ex j in C ∩ B(¯x, γ j), where ex j ≠ ¯x, and by Theorem 2.1, ex j belongs to C ∗ for all j Notably, as j approaches infinity, ex j converges to ¯x.
For each j, one can find an integer k(j) ≥ 1 such that γ j+k(j) < kxe j − xk.¯ Then, by (2.30) one has kxe j+k(j) −xk¯ < kxe j −xk.¯
Let \( z_1 = xe_1 \) and define \( z_{p+1} = ex_{p+k(p)} \) for \( p = 1, 2, \ldots \) It follows that the sequence \( \{z_p\} \) is a subsequence of \( \{xe_j\} \) and \( z_p \neq z_{p'} \) for \( p' \neq p \) By selecting an appropriate subsequence, we can ensure that \( xe_j \neq xe_\ell \) whenever \( j \neq \ell \) Given that the number of pseudo-faces of \( C \) is finite, there exists an index set \( \alpha \subset \{1, \ldots, m\} \) such that the pseudo-face is defined accordingly.
The set F α is defined as {x∈ R n | A α x = b α , A α ¯ x > b α ¯ }, which contains infinitely many members of the sequence {ex j } We can assume that the entire sequence {xe j } is included in F α According to [50, Lemma 4.1], the intersection of C ∗ and F α forms a convex set Therefore, as per Lemma 2.4, the function f restricted to C ∗ ∩ F α is constant This leads us to conclude that f(xe j ) = f(¯x) for all j However, since xe j is not equal to ¯x for any j, this creates a contradiction with (2.26), thus proving our claim.
To illustrate asymptotical stability of Algorithmthe B, let us consider the following example.
Example 2.2 (see [50, Example 11.3, p 207]) Consider problem (2.1) with n = 2, m = 2, Q "
! Here, one has the objective function f(x) = 1 2 (x 2 1 −x 2 2 )−x 1 over the set
Since λ 1 = −1 and λ 2 = 1 are eigenvalues of Q, one can choose ρ = 2. Using (2.4), one obtains the KKT point set C ∗ = n(1,0), ( 4 3 , 2 3 ), ( 4 3 ,− 2 3 )o. For this problem, one has S=loc(P)= ( 4 3 , 2 3 ), ( 4 3 ,− 2
3) One selects initial point, say, x 0 = ( 3 2 , 1 2 ), and chooses ¯x = ( 4 3 , 2 3 ) ∈ S Let the tolerance ε > 0 be small enough and put δ = kx 0 −xk¯ √ 2
The initial point \( x_0 \) belongs to the intersection of set \( C \) and the ball \( B(\bar{x}, \delta) \) The DCA sequence \( \{x_k\} \), generated by Algorithm B starting from \( x_0 \), satisfies the condition \( kx_k - x_k \leq \bar{\epsilon} \) for each \( k \) in the range \( \{0, \ldots, 27\} \) Consequently, \( x_k \) is contained within \( C \cap B(\bar{x}, \gamma) \), where \( \gamma = \epsilon \) Similar outcomes are observed when selecting \( x_0 = (3/2, -1/5) \) and \( \bar{x} = (4/3, -2/3) \).
Figure 2.1 : The DCA sequence generated by Algorithm B and x 0 = (1.5, 0.5)
Table 2.3: Asymptotical stability of Algorithm B k x k f (x k ) δ(γ) x k f (x k ) δ(γ)
Further Analysis
This section analyzes the impact of the decomposition parameter ρ on the convergence rates of Algorithms A and B, while also comparing the effectiveness of both algorithms Implemented in Visual C++ 2010, these algorithms were tested on an Intel Core TM i7 processor with 4GB RAM The CPLEX 11.2 solver was utilized to address linear and convex quadratic problems.
Recall that, for Algorithm A, the parameter ρ > 0 has to satisfy the in- equality ρ ≥ λ n (Q) For Algorithm B, ρ > 0 must satisfy the strict inequality ρ > −λ 1 (Q).
We utilized algorithms A and B to address test problems of the form (2.1) across various dimensions: n = 10, n = 20, n = 40, n = 60, and n = 80 Randomly generated values for β i, where β i ∈ [0,10] for i = 1, , n, were employed, and we examined two distinct types of constraint sets.
The solution sets of the linear inequality system Ax ≥ b can be represented using a matrix A ∈ R m×n and a vector b ∈ R m For specific dimensions n ∈ {10,20,40,60,80}, we generate a symmetric matrix Q ∈ R n×n and a vector q ∈ R n, ensuring all components lie within [0,10] The initial point x 0 ∈ R n×n is randomly generated with components in [0,5] We then test Algorithm A using the smallest decomposition parameter ρ = λ n (Q) if λ n (Q) > 0, or ρ = 0.1 otherwise Similarly, for Algorithm B, we set ρ = −λ 1 (Q) + 0.1 if λ1(Q) < 0, or ρ = 0.1 otherwise The algorithms continue until the stopping criterion kx k+1 −x k k ≤ 10 −6 is met or a maximum of 1000 steps is reached After each test, we increase ρ by 1.5 times and rerun the algorithm.
Due to the space limitation, we only present the test results for n 10, 40, 80.
Table 2.4 presents the results of applying Algorithm A and Algorithm B to the same problem with identical initial conditions The second rows of sub-tables a) and b) indicate the smallest decomposition parameters for each algorithm, while the third and fourth rows show tests with decomposition parameters that are 1.5 times larger than the previous ones The first column lists the test ordinal numbers, the second column shows the number of iterations, the third reports the running times, and the fourth contains the decomposition parameters Notably, only 11 records are displayed, as the 12th record indicates that Algorithm A requires over 1000 steps to complete the computation.
The contents of Tables 2.5–2.9 are similar to those of Table 2.4.
With any n belonging to the set {10,20,40,60,80}, a careful analysis of these Tables allows us to observe that:
• For both algorithms, if ρ increases, then the running time, as well as the number of computation steps, increases;
•For any row of the sub-tables a) and b) with the same ordinal number, the number of steps required by Algorithm B is much smaller than that required by Algorithm A.
• For any row of the sub-tables a) and b) with the same ordinal number, the running time of Algorithm B is much smaller than that of Algorithm A.
Thus, in terms of the number of computation steps and the execution time,Algorithm B is much more efficient than Algorithm A when the algorithms are applied to the same problem.
Table 2.4: The test results for n = 10 with the 1st type constraint
No Step Time ρ No Step Time ρ
Table 2.5: The test results for n = 10 with the 2nd type constraint
No Step Time ρ No Step Time ρ
Table 2.6: The test results for n = 40 with the 1st type constraint
No Step Time ρ No Step Time ρ
Table 2.7: The test results for n = 40 with the 2nd type constraint
No Step Time ρ No Step Time ρ
Table 2.8: The test results for n = 80 with the 1st type constraint
No Step Time ρ No Step Time ρ
Table 2.9: The test results for n = 80 with the 2nd type constraint
No Step Time ρ No Step Time ρ
Conclusions
We have established two properties of Algorithm B for the IQP problem:
- Every DCA sequence generated by the Algorithm B must be bounded and, moreover, it converges R-linearly to a KKT point of the problem in question.
Algorithm B demonstrates asymptotic stability when the initial point is sufficiently near a locally unique solution of the problem, and the DCA decomposition parameter meets a mild additional condition.
We have carried many numerical experiments which demonstrate that:
The convergence rate of DCA sequences is significantly affected by the decomposition parameter; as this parameter increases, the execution time also rises Consequently, it is advisable to select the smallest feasible decomposition parameter to optimize performance.
- Algorithm B is more efficient than Algorithm A upon randomly generated data sets.
Qualitative Properties of the Minimum Sum-of-Squares
Clustering Problems
Clustering is a crucial process in data mining that facilitates automated data analysis It involves grouping a subset of data where the elements within each cluster share similarities.
Clustering problems can be approached using various criteria, including Euclidean distance, L1-distance, and the square of Euclidean distance Among these methods, the Minimum Sum-of-Squares Clustering (MSSC) criterion is particularly popular and widely utilized in clustering analysis.
The Minimum Sum of Squared Clusters (MSSC) problem involves partitioning a finite dataset into a specified number of clusters The objective is to minimize the sum of the squared Euclidean distances from each data point to its cluster centroid, ensuring an efficient clustering process.
The importance of the MSSC problem was noticed by researchers long time ago and they have developed many algorithms to solve it (see, e.g.,
The NP-hard nature of the problem limits existing algorithms to achieving only local solutions To enhance data partitioning and discover better outcomes, various techniques have been employed Notably, a method proposed in [71] utilizes Difference-of-Convex-functions Algorithms (DCA) to identify effective starting points, which has also been applied to the Multi-Source Supply Chain (MSSC) problem as referenced in [7, 52, 60].
This chapter aims to establish fundamental properties of the discussed problem by demonstrating the equivalence between its mixed integer programming formulation and the unconstrained nonsmooth nonconvex optimization formulation, as outlined in [71] Furthermore, we show that the MSSC problem guarantees a global solution, and under certain mild conditions, the global solution set is finite, with each component of the global solution being computable via an explicit formula.
This chapter aims to characterize local solutions of the MSSC problem by establishing necessary and sufficient conditions for a system of centroids to qualify as a nontrivial local solution, drawing on the principles of DC programming and previous research The insights gained from these characterizations could enhance the understanding and refinement of existing algorithms focused on local solutions Additionally, we evaluate the performance of the k-means algorithm, a fundamental method for addressing the MSSC problem, through a carefully constructed example.
This chapter aims to analyze how small changes in the data set affect the optimal value, global solution set, and local solution set of the MSSC problem We will establish three key stability properties: the optimal value function is locally Lipschitz, the global solution map is locally upper Lipschitz, and the local solution map exhibits the Aubin property, assuming the original data points are pairwise distinct.
In the n-dimensional Euclidean space R^n, let A = {a1, , am} denote a finite set of data points to be grouped The objective is to partition this set into k disjoint subsets, known as clusters, where k is a positive integer and k ≤ m, while optimizing a specific clustering criterion.
If one associates to each cluster A j a center (or centroid), denoted by x j ∈ R n , then the following well-known variance or SSQ (Sum-of-Squares) clustering criterion (see, e.g., [15, p 266]) ψ(x, α) := 1 m m
X k j=1 α ij ka i −x j k 2 −→ min, where α ij = 1 if a i ∈ A j and α ij = 0 otherwise, is used Thus, the above par- titioning problem can be formulated as the constrained optimization problem minnψ(x, α) | x ∈ R nk , α = (α ij ) ∈ R m×k , α ij ∈ {0,1}, k
X j=1 α ij = 1, i= 1, , m, j = 1, , ko, (3.1) where the centroid system x = (x 1 , , x k ) and the incident matrix α = (α ij ) are to be found.
Since (3.1) is a difficult mixed integer programming problem, instead of it one usually considers (see, e.g., [71, p 344]) the nextunconstrained nonsmooth nonconvex optimization problem minnf(x) := 1 m m
The minimum sum-of-squares clustering problem (MSSC problem) encompasses both models (3.1) and (3.2) To clarify the equivalence between these minimization problems, it's important to note that the decision variables in (3.1) and (3.2) exist in different Euclidean spaces.
Basic Properties of the MSSC Problem
Given a vector ¯x = (¯x 1 , ,x¯ k ) ∈ R nk , we inductively construct k subsets
A 1 , , A k of A in the following way Put A 0 = ∅ and
In the context of clustering, a data point \( a_i \) is assigned to cluster \( A_j \) if the distance \( \| a_i - \bar{x}_j \| \) is the smallest compared to distances \( \| a_i - \bar{x}_q \| \) for all clusters \( q \in J \) This implies that \( a_i \) does not belong to any other cluster \( A_p \) when it is closest to \( A_j \).
1 ≤ p ≤ j − 1 We will call this family {A 1 , , A k } the natural clustering associated with x.¯
Definition 3.1 Let ¯x = (¯x 1 , ,x¯ k ) ∈ R nk We say that the component ¯x j of ¯x is attractive with respect to the data set A if the set
A[¯x j ] := na i ∈ A | ka i −x¯ j k= min q∈J ka i −x¯ q ko is nonempty The latter is called the attraction set of ¯x j
Clearly, the cluster A j in (3.3) can be represented as follows:
Proposition 3.1 If (¯x,α)¯ is a solution of (3.1), thenx¯is a solution of (3.2). Conversely, if x¯ is a solution of (3.2), then the natural clustering defined by (3.3) yields an incident matrix α¯ such that (¯x,α)¯ is a solution of (3.1).
Proof First, suppose that (¯x,α) is a solution of the optimization prob-¯ lem (3.1) As ψ(¯x,α)¯ ≤ψ(¯x, α) for every α = (α ij ) ∈ R m×k with α ij ∈ {0,1},
Pk j=1α ij = 1 for all i ∈ I and j ∈ J, one must have k
If ¯x does not satisfy the solution of equation (3.2), it is possible to identify a point ˜x = (˜x 1 , ,x˜ k ) in R nk where the function value f(˜x) is less than f(¯x) The natural clustering associated with this point ˜x is represented by {A 1 , , A k } For each element i in set I and j in set J, we define ˜α ij as 1 if element a i belongs to cluster A j, and as 0 if it does not Thus, ˜α = ( ˜α ij ) is a matrix in R m×k According to the definition of natural clustering and our selection of ˜α, it follows that ψ(˜x,α) equals ˜ f(˜x) Consequently, we find that ψ(¯x,α) = ¯ f(¯x) is greater than f(˜x), which contradicts the assumption that (¯x,α) is a solution to equation (3.1).
If ¯x is a solution to (3.2), then the natural clustering associated with ¯x is represented as {A 1 , , A k } We define ¯α = ( ¯αij), where ¯αij = 1 if a i belongs to A j and ¯αij = 0 otherwise It follows that ψ(¯x,α) equals ¯f(¯x) If there exists a feasible point (x, α) for (3.1) such that ψ(x, α) is less than ψ(¯x,α), we can analyze the natural clustering {A˜ 1 , ,A˜ k } linked to x By defining ˜α = ( ˜α ij ), where ˜αij = 1 if a i is in A˜ j and ˜αij = 0 if not, we derive f(x) ≤ ψ(x, α) < ψ(¯x,α) = ¯f(¯x) This leads to a contradiction regarding the global optimality of ¯x for (3.2) Thus, the proof is established.
Proposition 3.2 If a 1 , , a m are pairwise distinct points and {A 1 , , A k } is the natural clustering associated with a global solution x¯ of (3.2), then A j is nonempty for every j ∈ J.
If there exists an index \( j_0 \) in set \( J \) such that \( A_{j_0} \) is empty, the assumption \( k \leq m \) leads to the identification of another index \( j_1 \) in \( J \) where \( A_{j_1} \) contains at least two distinct points By selecting an element \( a_{i_1} \) from \( A_{j_1} \) that is not equal to \( \bar{x}_{j_1} \), we can define \( \tilde{x}_j \) as \( \bar{x}_j \) for all \( j \) in \( J \) except \( j_0 \), where \( \tilde{x}_{j_0} = a_{i_1} \) This configuration allows us to demonstrate that \( f(\tilde{x}) - f(\bar{x}) \) is less than zero, specifically \( f(\tilde{x}) - f(\bar{x}) \leq -\frac{1}{m} \| a_{i_1} - \bar{x}_{j_1} \|^2 \).
This is impossible because ¯x is a global solution of (3.2) 2
Remark 3.1 In practical measures, some data points can coincide Natu- rally, if a i 1 = a i 2 , i 1 6= i 2 , then a i 1 and a i 2 must belong to the same cluster. Procedure (3.3) guarantees the fulfillment of this natural requirement By grouping identical data points and choosing from each group a unique rep- resentative, we obtain a new data set having pairwise distinct data points. Thus, there is no loss of generality in assuming that a 1 , , a m are pairwise distinct points.
Theorem 3.1 Both problems (3.1), (3.2) have solutions If a 1 , , a m are pairwise distinct points, then the solution sets are finite Moreover, in that case, if x¯ = (¯x 1 , ,x¯ k ) ∈ R nk is a global solution of (3.2), then the attraction set A[¯x j ] is nonempty for every j ∈ J and one has ¯ x j = 1
X i∈I(j) a i , (3.4) where I(j) := {i ∈ I |a i ∈ A[¯x j ]} with |Ω| denoting the number of elements of Ω.
To demonstrate the existence of a solution for equation (3.2), we leverage the second assertion of Proposition 3.1 The objective function in (3.2), being the minimum of a finite number of continuous functions, is continuous on R^nk Specifically, when k = 1, the function f can be expressed as f(x1) = 1/m * m.
X i=1 ka i −x 1 k 2 This smooth, strongly convex function attains its unique global minimum on R n at the point ¯x 1 = a 0 , where a 0 := 1 m
X i∈I a i (3.5) is the barycenter of the data set A (see, e.g., [50, pp 24–25] for more details).
To demonstrate the existence of solutions for the optimization problem (3.2) when k ≥ 2, we define ρ as the maximum of the absolute difference between ka_i and a_0 for all i in the index set I Here, a_0 is specified by equation (3.5) We then introduce the closed ball ¯B(a_0, 2ρ) in R^n, which is centered at a_0 with a radius of 2ρ The optimization problem we consider is to minimize the function f(x), where x is a vector composed of k elements (x_1, , x_k) belonging to R^nk, and each x_j is constrained to lie within the ball B(a_0, 2ρ) for all j in the index set Jo.
According to the Weierstrass theorem, the equation (3.6) has a solution ¯x = (¯x 1, , ¯x k) where each ¯x j meets the condition k¯x j − a 0 k ≤ 2ρ for all j ∈ J For any point x = (x 1, , x k) in R nk, it follows that f(¯x) ≤ f(x) if kx j − a 0 k ≤ 2ρ for all j ∈ J If there exists at least one index j ∈ J such that kx j − a 0 k > 2ρ, we define the set of such indices as J 1 and construct a vector ˜x = (˜x 1, , ˜x k) in R nk where ˜x j = x j for j ∈ J \ J 1 and ˜x j = a 0 for j ∈ J 1 For any i ∈ I, it is evident that ka i − x˜ j k = ka i − a 0 k ≤ ρ < ka i − x j k for j ∈ J 1, and ka i − x˜ j k = ka i − x j k for j ∈ J \ J 1, leading to f(˜x) ≤ f(x) Since f(¯x) ≤ f(˜x), we conclude that f(¯x) ≤ f(x), confirming that ¯x is a solution of (3.2) Furthermore, if a 1, , a m are distinct points and ¯x = (¯x 1, , ¯x k) is a global solution of (3.2), the natural clustering {A 1, , A k} associated with ¯x ensures that A j ≠ ∅ for all j ∈ J, as stated in Proposition 3.2.
For every j in J, the condition A_j ≠ ∅ ensures that |I(j)| ≥ 1, which confirms that the right-hand side of equation (3.4) is well-defined By fixing any j in J, we observe that the distance ka_i - x̄_jk is greater than the minimum distance for all i not in I(j) This allows us to find a positive ε such that for any x_j within the ball B(¯x_j, ε), the distance remains greater than the minimum for all i not in I(j) We define ˜x with ˜x_q equal to ¯x_q for all q not equal to j and ˜x_j equal to x_j Consequently, from the inequality f(¯x) ≤ f(˜x) and the validity of our earlier findings, we conclude that f(¯x) equals 1/m.
The expression X i∈I(j) ka i − x j k 2 indicates that for every x j ∈ B(¯¯ x j , ε), the comparison between the second line of (3.8) and the sixth line shows that ϕ(¯x j ) ≤ ϕ(x j ) This implies that ϕ reaches its local minimum at ¯x j According to the Fermat Rule, it follows that ∇ϕ(¯x j ) = 0.
(a i −x¯ j ) = 0 This equality implies (3.4) Since there are only finitely many nonempty subsets Ω ⊂ I, the set B of vectors bΩ defined by formula b Ω = 1
The set X i∈Ω a i is finite, with b Ω representing the barycenter of the subsystem {a i ∈ A | i ∈ Ω} of A According to equation (3.4), each component of a global solution ¯x = (¯x 1 , ,x¯ k ) for (3.2) must belong to B, indicating that the solution set for (3.2) is finite, given that a 1 , , a m are pairwise distinct points Furthermore, Proposition 3.1 states that if (¯x,α) is a solution of (3.1), then ¯¯ x is also a solution of (3.2) Additionally, the condition ¯α = ( ¯α ij ) ∈ R m×k must be met, where ¯α ij is restricted to the values {0,1} and k is a defined parameter.
X j=1 ¯ α ij = 1 for all i ∈ I, j ∈ J, it follows that the solution set of (3.1) is also finite 2
Proposition 3.3 If x¯= (¯x 1 , ,x¯ k ) ∈ R nk is a global solution of (3.2), then the components of x¯ are pairwise distinct, i.e., x¯ j 1 6= ¯x j 2 whenever j 2 6= j 1
In this proof, we assume there are distinct indexes \( j_1 \) and \( j_2 \) in set \( J \) such that \( \bar{x}_{j_1} = \bar{x}_{j_2} \) Given that \( k \leq m \) implies \( k - 1 < n \), there must exist an index \( j_0 \in J \) where the size of \( A[\bar{x}_{j_0}] \) is at least 2 This allows us to identify a data point \( a_{i_0} \in A[\bar{x}_{j_0}] \) that is not equal to \( \bar{x}_{j_0} \) By defining \( \tilde{x} = (\tilde{x}_1, \ldots, \tilde{x}_k) \) with \( \tilde{x}_j = \bar{x}_j \) for all \( j \in J \setminus \{j_2\} \) and \( \tilde{x}_{j_2} = a_{i_0} \), we find that the difference \( f(\tilde{x}) - f(\bar{x}) \) is less than zero, which contradicts the fact that \( \bar{x} \) is a global solution of the equation.
Remark 3.2 If the points a 1 , , a m are not pairwise distinct, then the con- clusions of Theorem 3.1 do not hold in general Indeed, let A = {a 1 , a 2 } ⊂ R 2 with a 1 = a 2 For k := 2, let ¯x = (¯x 1 ,x¯ 2 ) with ¯x 1 = a 1 and ¯x 2 ∈ R 2 being chosen arbitrarily Since f(¯x) = 0, we can conclude that ¯x is a global solution of (3.2) So, the problem has an unbounded solution set Similarly, for a data set A = {a 1 , , a 4 } ⊂ R 2 with a 1 = a 2 , a 3 = a 4 , and a 2 6= a 3 For k := 3, let ¯ x = (¯x 1 ,x¯ 2 ,x¯ 3 ) with ¯x 1 = a 1 , ¯x 2 = a 3 , and ¯x 3 ∈ R 2 being chosen arbitrarily.
The equation f(¯x) = 0 indicates that ¯x is a global solution of (3.2), demonstrating that the solution set of (3.2) is unbounded Additionally, if ¯x 3 belongs to {/ x¯ 1 ,x¯ 2 }, then formula (3.4) cannot be applied to ¯x 3, as the index set I(3) = {i ∈ I | a i ∈ A[¯x 3 ]} = {i ∈ I | ka i −x¯ 3 k = min q∈J ka i −x¯ q ko} is empty.
Formula (3.4) is effective for computing certain components of any given local solution of (3.2) The precise statement of this result is as follows.
Theorem 3.2 If x¯= (¯x 1 , ,x¯ k ) ∈ R nk is a local solution of (3.2), then (3.4) is valid for all j ∈ J whose index set I(j) is nonempty, i.e., the component ¯ x j of x¯ is attractive w.r.t the data set A.
The k-means Algorithm
Despite its ineffectiveness, the k-means clustering algorithm (see, e.g., [1, pp 89–90], [39], [43, pp 263–266], and [66]) is one of the most popular solution methods for (3.2) The convergence of this algorithms was proven in [86].
To begin the clustering process, select k initial centroids \( x_1, \ldots, x_k \) in \( \mathbb{R}^n \) Next, construct k subsets \( A_1, \ldots, A_k \) from the dataset A, starting with \( A_0 = \emptyset \) and applying the clustering rule where each \( x_j \) serves as the centroid for its respective subset This results in the natural clustering associated with the chosen centroids After the clusters are established, update each centroid \( x_j \) for non-empty subsets \( A_j \) using the formula \( x_j \leftarrow \bar{x}_j = \frac{1}{|A_j|} \sum_{a \in A_j} a \).
The algorithm iteratively updates the centroid system {x₁, , xₖ} until stability is achieved, meaning that for all j ∈ J with Aₗ ≠ ∅, the condition xₑⱼ = xⱼ holds true The computation process is defined by the equation xᵢ ∈ I(Aⱼ) = {i ∈ I | aᵢ ∈ Aⱼ}, ensuring that xⱼ remains unchanged in other cases.
Input: The data set A= {a 1 , , a m } and a constant ε ≥ 0 (tolerance). Output: The set of k centroids {x 1 , , x k }.
Step 1 Select initial centroids x j ∈ R n for all j ∈ J.
Step 2 Compute α i = min{ka i −x j k | j ∈ J} for all i ∈ I.
Step 4 Update the centroids x j satisfying A j 6= ∅ by the rule (3.9), keeping other centroids unchanged.
Step 5 Check the convergence condition: If kxe j −x j k ≤ ε for all j ∈ J with
A j 6= ∅ then stop, else go to Step 2.
The following example is designed to show how the algorithm is performed in practice.
Example 3.1 Choose m = 3, n = 2, and k = 2 Let A = {a 1 , a 2 , a 3 }, where a 1 = (0,0), a 2 = (1,0), a 3 = (0,1) Apply the k-means algorithm to solve the problem (3.2) with the tolerance ε = 0.
(a) With the starting centroids x 1 = a 1 , x 2 = a 2 , one obtains the clusters
A 1 = A[x 1 ] = {a 1 , a 3 } and A 2 = A[x 2 ] = {a 2 } The updated centroids are x 1 = (0, 1 2 ), x 2 = a 2 Then, the new clusters A 1 and A 2 coincide with the old ones Thus, kxe j −x j k = 0 for all j ∈ J with A j 6= ∅ So, the computation terminates For x 1 = (0, 1 2 ), x 2 = a 2 , one has f(x) = 1 6
(b) Starting with the points x 1 = ( 1 4 , 3 4 ) and x 2 = (2,3), one gets the clusters A 1 = A[x 1 ] = {a 1 , a 2 , a 3 } and A 2 = A[x 2 ] = ∅ The algorithm gives the centroid system x 1 = ( 1 3 , 1 3 ), x 2 = (2,3), and f(x) = 1 3
(c) Starting with x 1 = (0,1) and x 2 = (0,0), by the algorithm we are led to A 1 = A[x 1 ] = {a 3 }, A 2 = A[x 2 ] = {a 1 , a 2 }, x 1 = (0,1), and x 2 = ( 1 2 ,0). The corresponding value of objective function is f(x) = 1 6
(d) Starting with x 1 = (0,0) and x 2 = ( 1 2 , 1 2 ), by the algorithm one gets the results A 1 = A[x 1 ] = {a 1 }, A 2 = A[x 2 ] = {a 2 , a 3 }, x 1 = (0,0), and x 2 = ( 1 2 , 1 2 ). The corresponding value of objective function is f(x) = 4 9
3 ,0) as the initial centroids, one obtains the results A 1 = A[x 1 ] = {a 1 , a 2 , a 3 }, A 2 = A[x 2 ] = ∅, x 1 = ( 1 3 , 1 3 ), x 2 = (1 +
The current understanding of the MSSC problem and the k-means clustering algorithm does not allow us to determine if the five centroid systems from Example 3.1 represent a global optimal solution for clustering While it is confirmed that the centroid systems in items (a) and (c) are global optimal solutions, it remains uncertain whether the centroid systems in items (b), (d), and (e) are local optimal solutions to the problem.
The theoretical results in Section 3.2 and the two forthcoming ones allow us to clarify the following issues related to the MSSC problem in Example3.1:
- The structure of the global solution set (see Example 3.2 below);
- The structure of the local solution set (see Example 3.3);
- The performance of the k-means algorithm (see Example 3.4).
The analysis reveals that the centroid systems in (a) and (c) represent global optimal solutions, whereas those in (b) and (d) are classified as local-nonglobal optimal solutions Additionally, the centroid system in (e) does not qualify as a local solution, despite the k-means algorithm converging to it and its objective function value matching that of the centroid system in (d).
Characterizations of the Local Solutions
To gain a deeper understanding of the local solution set of the problem outlined in (3.2), we will adopt the approach of Ordin and Bagirov [71], focusing on a well-established optimality condition in DC programming For any vector x = (x1, , xk) within Rnk, the function is defined as f(x) = 1/m.
Hence, the objective function f of (3.2) can be expressed [71, p 345] as the difference of two convex functions f(x) =f 1 (x)−f 2 (x), (3.12) where f 1 (x) := 1 m
The function f1 is identified as a convex linear-quadratic function, which is also differentiable In contrast, f2 is a nonsmooth convex function formed by the sum of several nonsmooth convex functions, and it is defined across the entire space R^nk The subdifferentials of both f1(x) and f2(x) can be calculated using specific methods.
= {2(x 1 −a 0 , , x k −a 0 )} where, as before, a 0 = b A is the barycenter of the system {a 1 , , a m } Set ϕ i (x) = max j∈J h i,j (x) (3.15) with h i,j (x) := X q∈J\{j} ka i −x q k 2 and
J i (x) = j ∈ J | a i ∈ A[x j ] (3.17) Proof From the formula of h i,j (x) it follows that h i,j (x) = X q∈J ka i −x q k 2 − ka i −x j k 2
Therefore, by (3.15) we have ϕ i (x) = max j∈J h X q∈J ka i −x q k 2 − ka i −x j k 2 i
Thus, the maximum in (3.15) is attained when the minimum min j∈J ka i −x j k 2 is achieved So, by (3.16),
J i (x) =nj ∈ J | ka i −x j k= min q∈J ka i −x q ko.
Utilizing the subdifferential formula for the maximum function, as outlined in Proposition 2.3.12, we note that the Clarke generalized gradient aligns with the subdifferential of convex analysis when applied to convex functions.
∂ϕ i (x) = co{∇h i,j (x)|j ∈ J i (x)} = co 2 xe j −ea i,j | j ∈ J i (x) , (3.18) where xe j = x 1 , , x j−1 ,0 R n , x j+1 , , x k (3.19) and ea i,j = a i , , a i , 0 R n j−th position|{z}
By the Moreau-Rockafellar theorem [84, Theorem 23.8], one has
Let x = (x₁, , xₖ) ∈ Rⁿᵏ represent a local solution of equation (3.2) According to the necessary optimality condition in DC programming, which is derived from the optimality principles established by Dem’yanov et al in quasidifferential calculus, we can ascertain important insights regarding this solution.
Since ∂f 1 (x) is a singleton, ∂f 2 (x) must be a singleton too This happens if and only if ∂ϕi(x) is a singleton for every i ∈ I By (4.50), if |J i (x)| = 1, then
In the scenario where the Jacobian determinant |J i (x)| exceeds 1, it is possible to choose two elements, j1 and j2, from the set J i (x) such that j1 is less than j2 Given that ∂ϕ i (x) is a singleton, it follows from equation (4.50) that the relationship xe j1 −ea i,j1 = xe j2 −ea i,j2 must hold true By applying equations (3.19) and (3.20), this condition is satisfied if and only if both x j1 and x j2 equal a i To advance our analysis, we need to establish an additional condition regarding the local solution x.
(C1) The components of x are pairwise distinct, i.e., x j 1 6= x j 2 whenever j 2 6= j 1
Definition 3.2 A local solution x = (x 1 , , x k ) of (3.2) that satisfies (C1) is called a nontrivial local solution.
Remark 3.5 Proposition 3.3 shows that every global solution of (3.2) is a nontrivial local solution.
The fundamental facts presented here, based on [71, pp 346], offer a refined formulation According to (3.17), the initial claim of the subsequent theorem indicates that if x represents a nontrivial local solution, then for every data point a_i in set A, there exists a unique component x_j of x such that a_i is included in A[x_j].
Theorem 3.3 (Necessary conditions for nontrivial local optimality) Suppose that x = (x 1 , , x k ) is a nontrivial local solution of (3.2) Then, for any i ∈ I, |J i (x)| = 1 Moreover, for every j ∈ J such that the attraction set A[x j ] of x j is nonempty, one has x j = 1
X i∈I(j) a i , (3.23) where I(j) ={i ∈ I | a i ∈ A[x j ]} For any j ∈ J with A[x j ] = ∅, one has x j ∈ A[x],/ (3.24) where A[x] is the union of the balls B(a¯ p ,ka p − x q k) with p ∈ I, q ∈ J satisfying p∈ I(q).
In the proof, we consider a nontrivial local solution x = (x₁, , xₖ) of equation (3.2) For any index i in set I, it is established that the Jacobian |Jᵢ(x)| must equal 1 If |Jᵢ(x)| were greater than 1, there would exist indices j₁ and j₂ within Jᵢ(x) such that x₁ = x₂ = aᵢ, which contradicts the assumption of x being a nontrivial local solution Therefore, we conclude that Jᵢ(x) contains a unique element, denoted as j(i), for each i in I.
For each i ∈ I, observe by (3.15) that hi,j(x) < hi,j(i)(x) = ϕi(x) ∀j ∈ J \ {j(i)}.
Hence, by the continuity of the functions h i,j (x), there exists an open neigh- borhood U i of x such that h i,j (y) < h i,j(i) (y) ∀j ∈ J \ {j(i)}, ∀y ∈ U i
So, ϕ i (.) is continuously differentiable on U i Put U = \ i∈I
U i From (3.14) and (3.25) one can deduce that f 2 (y) = 1 m
Therefore, f 2 (y) is continuously differentiable function on U Moreover, the formulas (4.50)–(3.20) yield
(ye j(i) −ea i,j(i) ) ∀y ∈ U, (3.26) where ye j(i) = y 1 , , y j(i)−1 ,0 R n , y j(i)+1 , , y k and ea i,j(i) = a i , , a i , 0 R n
Substitutingy = xinto (3.26) and combining the result with (3.22), we obtain
Now, fix an index j ∈ J with A[x j ] 6= ∅ and transform the left-hand side of (3.27) as follows:
If j(i) equals j, the j-th component of the vector \( xe^{j(i)} - ea_{i,j(i)} \) in \( R^{nk} \) is zero Conversely, when j(i) does not equal j, the j-th component of the vector is \( x_j - a_i \) Therefore, equation (3.27) is established.
Since ma 0 = a 1 +ã ã ã+a m , this yields X i∈I(j) a i = |I(j)|x j Thus, formula (3.23) is valid for any j ∈ J satisfying A[x j ] 6= ∅.
For any index \( j \) in set \( J \) where \( A[x_j] = \emptyset \), we can derive a contradiction Assuming there exists an index \( j_0 \) in \( J \) such that \( A[x_{j_0}] = \emptyset \) and for some \( p \in I \) and \( q \in J \) with \( p \in I(q) \), if \( x_{j_0} \) is within the ball defined by \( a_p \) and \( k a_p - x_q \), and the distances \( k a_p - x_{j_0} \) and \( k a_p - x_q \) are equal, then \( J_p(x) \) must include both \( q \) and \( j_0 \), which contradicts the theorem's first claim Furthermore, if the distance \( k a_p - x_{j_0} \) is less than \( k a_p - x_q \), it implies that \( p \) cannot be in \( I(q) \), leading to another contradiction.
The necessary optimality condition outlined in the previous theorem serves as a sufficient criterion Together with Theorem 3.3, this leads to a comprehensive understanding of the nontrivial local solutions for equation (3.2).
Theorem 3.4 (Sufficient conditions for nontrivial local optimality) Suppose that a vector x = (x 1 , , x k ) ∈ R nk satisfies condition (C1) and |J i (x)| = 1 for every i ∈ I If (3.23) is valid for any j ∈ J with A[x j ] 6= ∅ and (3.24) is fulfilled for any j ∈ J with A[x j ] = ∅, then x is a nontrivial local solution of (3.2).
Proof Let x = (x 1 , , x k ) ∈ R nk be such that (C1) holds, J i (x) = {j(i)} for every i ∈ I, (3.23) is valid for any j ∈ J with A[x j ] 6= ∅, and (3.24) is satisfied for any j ∈ J with A[x j ] = ∅ Then, for all i ∈ I and j 0 ∈ J \ {j(i)}, one has ka i −x j(i) k < ka i −x j 0 k.
There exists a positive ε and an index q in J such that the distance between vectors ai and xe j(i) is less than the distance between ai and ex j 0 for all i in I and j 0 in J excluding j(i) This condition holds when the vector xe, composed of elements (xe 1, , xe k) in R nk, satisfies the constraint that the distance between xe q and x q is less than ε for all q in J By leveraging the compactness of A[x] and equation (3.24), we can reduce ε if necessary, ensuring that xe j belongs to A[/ x]e whenever the aforementioned distance condition is met Here, A[x] represents the union of the balls ¯e B(ap, ka p − xe q k) for p in I and q in J, where p belongs to the set I(q) comprising indices i in I such that ai is included in A[x q].
Fix an arbitrary vector xe = (xe 1 , ,xe k ) ∈ R nk with the property that kxe q − x q k< ε for all q ∈ J Then, by (3.28) and (3.29), J i (x) =e {j(i)} So, minj∈J ka i −xe j k 2 = ka i −xe j(i) k 2 Therefore, one has f(x) =e 1 m
X i∈I (j) ka i −x j k 2 = f(x), where the inequality is valid because (3.23) obviously yields
X i∈I(j) ka i −x j k 2 ≤ X i∈I(j) ka i −xe j k 2 for every j ∈ J such that the attraction set A[x j ] of x j is nonempty (Note that x j is the barycenter of A[x j ].)
The local optimality of x = (x 1 , , x k ) has been proved Hence, x is a nontrivial local solution of (3.2) 2
Example 3.2 (A local solution need not be a global solution) Consider the clustering problem described in Example 3.1 Here, we have I = {1,2,3}and
According to Theorem 3.1, problem (3.2) has a global solution, indicating that if x = (x1, x2) ∈ R²² is a global solution, then the attraction set A[xj] is nonempty for every j ∈ J = {1, 2} Remark 3.5 confirms that x is a nontrivial local solution, and by Theorem 3.3, the attraction sets A[x1] and A[x2] are disjoint The barycenter of each attraction set can be calculated using formula (3.23), leading to the conclusion that A = A[x1] ∪ A[x2] Since A[xj] is a subset of A = {a1, a2, a3}, and considering the permutations of the components of each vector x = (x1, x2) ∈ R²² (as noted in Remark 3.4), we can conclude that the global solution set of our problem is contained within the set nx¯.
Given that f(¯x) = 1/3 and f(ˆx) = f(x) = e^(1/6), we conclude that both ˆx and xe are global solutions to our problem According to Theorem 3.4, ¯x qualifies as a local solution However, it is important to note that ¯x does not fall within the global solution set, categorizing it as a local-nonglobal solution.
Stability Properties
This section focuses on demonstrating the local Lipschitz property of the optimal value function, the local upper Lipschitz property of the global solution map, and the local Lipschitz-like property of the local solution map associated with (3.2).
In the context of problem (3.2), let the data set A = {a₁, , aₘ} be variable, represented as a = (a₁, , aₘ) within Rⁿᵐ The optimal value of this problem is denoted as v(a), which can be expressed as v(a) = min{f(x) | x = (x₁, , xₖ) ∈ Rⁿᵏ} Furthermore, the global solution set for (3.2) is represented as F(a).
Let us abbreviate the local solution set of (3.2) to F1(a) Note that the inclusion F(a) ⊂ F 1 (a) is valid, and it may be strict.
Definition 3.3 A family {I(j) | j ∈ J} of pairwise distinct, nonempty sub- sets of I is said to be a partition of I if [ j∈J
From now on, let ¯a = (¯a 1 , ,¯a m ) ∈ R nm be a fixed vector with the property that a¯ 1 , ,a¯ m are pairwise distinct.
Theorem 3.5 (Local Lipschitz property of the optimal value function) The optimal value function v : R nm → R is locally Lipschitz at ¯a, i.e., there exist
|v(a)−v(a 0 )| ≤ L 0 ka−a 0 k for all a and a 0 satisfying ka−ak¯ < δ 0 and ka 0 −¯ak< δ 0
Proof Denote by Ω the set of all the partitions of I Every element ω of Ω is a family {I ω (j) | j ∈ J} of pairwise distinct, nonempty subsets of I with [ j∈J
I ω (j) = I We associate to each pair (ω, a), where a = (a 1 , , a m ) ∈ R nm and ω ∈ Ω, a vector x ω (a) = (x 1 ω (a), , x k ω (a)) ∈ R nk with x j ω (a) = 1
According to Theorem 3.1, the problem defined by equation (3.2) has a finite number of global solutions, ensuring that the set F(¯a) is both nonempty and finite For each solution ¯x = (¯x 1 , , x¯ k ) in F(¯a), there exists an element ω in Ω such that ¯x j equals x j ω (¯a) for all j in J Let Ω 1 = {ω 1 , , ω r } represent the elements of Ω that correspond to the global solutions Furthermore, it holds that f(x ω 1 (¯a), a) is less than f(x ω (¯a), ¯a) for any ω not in Ω 1, where f(x, a) is defined as 1/m.
For every pair (i, j) in the set I×J, the rule (x, a) 7→ ka i −x j k 2 establishes a polynomial function on R nk ×R nm This function is locally Lipschitz within its domain, which allows us to conclude that the function f(x, a) described in (3.34) is also locally Lipschitz on R nk ×R nm, as supported by [20, Prop 2.3.6 and 2.3.12].
Now, observe that for any ω ∈ Ω and j ∈ J, the vector function x j ω (.) in (3.32), which maps R nm to R n , is continuously differentiable In particular, it is locally Lipschitz on R nm
For every ω in the set Ω, we can conclude that the function g ω (a) = f(x ω (a), a) is locally Lipschitz on R nm By rewriting the inequality g ω 1 (¯a) < g ω (¯a) for all ω not in Ω 1, and utilizing the continuity of the functions g ω (.), we can identify a positive number δ 0 such that g ω 1 (a) < g ω (a) holds for all ω not in Ω 1, given that the distance between a and ¯a is less than δ 0 Since the points ¯a 1, , ¯a m are distinct, we can assume without loss of generality that a 1, , a m are also distinct for any vector a = (a 1, , a m) where the distance to ¯a is less than δ 0.
Consider a vector \( a = (a_1, \ldots, a_m) \) where \( ka - \bar{a} < \delta_0 \) According to equation (3.35), it follows that \( f(x_{\omega_1}(a), a) < f(x_{\omega}(a), a) \) for all \( \omega \in \Omega \setminus \Omega_1 \) Since \( f(., a) \) serves as the objective function for (3.2), this indicates that the set \( \{ x_{\omega}(a) | \omega \in \Omega \setminus \Omega_1 \} \) lacks any global solutions to the problem Theorem 3.1 confirms that the global solution set \( F(a) \) for (3.2) is included within this set.
The set F(a) is defined as a subset of {x ω (a) | ω ∈ Ω 1 }, specifically {x ω 1 (a), , x ω r (a)} Given that F(a) is non-empty, it follows that v(a) can be expressed as the minimum of f(x, a) for x in F(a), which leads to the conclusion that v(a) equals the minimum of g ω ` (a) for ` ranging from 1 to r, provided that ka−¯ak is less than δ 0 Additionally, the functions g ω, where ω belongs to Ω, are confirmed to be locally Lipschitz on R nm Consequently, by applying relevant properties of minimum functions, it can be concluded that v is locally Lipschitz at ¯ a.
Theorem 3.6 (Local upper Lipschitz property of the global solution map)The global solution map F : R nm ⇒ R nk is locally upper Lipschitz at ¯a, i.e., there exist L >0 and δ > 0 such that
F(a) ⊂ F(¯a) + Lka−¯akB¯ R nk (3.38) for all a satisfying ka−ak¯ < δ Here
B¯ R nk := nx = (x 1 , , x k ) ∈ R nk | X j∈J kx j k ≤ 1o denotes the closed unit ball of the product space R nk , which is equipped with the sum norm kxk = X j∈J kx j k.
In the proof, we define Ω and Ω 1 as the set of outcomes {ω 1, , ω r }, with the vector function x ω (a) represented as (x 1 ω (a), , x k ω (a)) in R nk The construction of δ 0 follows the methodology outlined in the previous theorem For each ω in Ω, the function x ω (.) is continuously differentiable, which implies the existence of constants L ω > 0 and δ ω > 0 This leads to the inequality kx ω (a)−xω(ea)k ≤ Lωka−eak for any points a and ea that meet the conditions ka−ak¯ < δ ω and kea−¯ak < δ ω.
Then, for every a satisfying ka−¯ak < δ, by (3.36) and (3.39) one has
= F(¯a) +Lka−ak¯ B¯ R nk Hence, inclusion (3.38) is valid for every a satisfying ka−ak¯ < δ 2
Theorem 3.7 (Aubin property of the local solution map) Let x¯ = (¯x 1 , ,x¯ k ) be an element of F1(¯a) satisfying condition (C1), that is, x¯ j 1 6= ¯x j 2 whenever j2 6= j1 Then, the local solution map F1 : R nm ⇒ R nk has the Aubin property at (¯a,x), i.e., there exist¯ L 1 > 0, ε > 0, and δ 1 > 0 such that
F 1 (a)∩B(¯x, ε) ⊂ F 1 (ea) +L 1 ka−eakB¯ R nk (3.40) for all a and ea satisfying ka−¯ak < δ 1 and kea−¯ak< δ 1
Proof Suppose that ¯x = (¯x 1 , ,x¯ k ) ∈ F 1 (¯a) and ¯x j 1 6= ¯x j 2 for all j 1 , j 2 ∈ J with j2 6= j1 Denote by J1 the set of the indexes j ∈ J such that ¯x j is attractive w.r.t the data set {¯a 1 , ,¯a m } Put J2 = J\J1 For every j ∈ J1, by Theorem 3.3 one has k¯a i −x¯ j k< k¯a i −x¯ q k (∀i ∈ I(j), ∀q ∈ J \ {j}) (3.41)
In addition, the following holds: ¯ x j = 1
X i∈I(j) ¯ a i , (3.42) where I(j) = {i ∈ I | ¯a i ∈ A[¯x j ]} For every j ∈ J 2 , by Theorem 3.3 one has k¯x q −a¯ p k < k¯x j −a¯ p k (∀q ∈ J 1 , ∀p ∈ I(q)) (3.43) Let ε0 > 0 be such that k¯x j 1 −x¯ j 2 k > ε0 for all j1, j2 ∈ J with j2 6= j1.
By (3.41) and (3.43), there exist δ 0 > 0 and ε ∈ 0,ε 0
The inequalities (3.44) and (3.45) establish conditions for the vectors \( a \) and \( x \) in the context of their respective indices and sets Specifically, for any \( j \) in \( J_1 \) and \( i \) in \( I(j) \), the distance between \( ka_i \) and \( x_j \) must be less than the distance between \( ka_i \) and \( x_q \) for all \( q \) in \( J \) excluding \( j \) Similarly, for all \( j \) in \( J_2 \) and \( q \) in \( J_1 \), the distance from \( x_q \) to \( a_p \) should be less than that from \( x_j \) to \( a_p \) This holds true under the constraints that the distance \( ka - \bar{a} \) is less than \( \delta_0 \) and the distance \( kx - \bar{x} \) is less than \( 2k\epsilon \) Additionally, given that \( \bar{x}_{j_1} \) is distinct from \( \bar{x}_{j_2} \) for all \( j_1 \) and \( j_2 \) in \( J \), selecting a smaller \( \epsilon \) ensures that for any \( x \) satisfying \( kx - \bar{x} < 2k\epsilon \), the elements \( x_{j_1} \) and \( x_{j_2} \) remain distinct.
For every j ∈ J1 and a = (a 1 , , a m ) ∈ R nm , define x j (a) = 1
By comparing equations (3.46) and (3.42), we find that \( x_j(\bar{a}) = \bar{x}_j \) for all \( j \in J_1 \) Utilizing the continuity of the vector functions \( x_j(.) \) for \( j \in J_1 \), we can assume that the difference \( \| x_j(e_a) - \bar{x}_j \| < \epsilon \) holds for all \( j \in J_1 \), where \( e_a = (e_{a1}, , e_{am}) \in \mathbb{R}^{nm} \) and satisfies \( \| e_a - a \bar{k} \| < \delta_0 \) If needed, a smaller \( \delta_0 > 0 \) can be chosen.
The continuously differentiable vector functions \( x_j(.) \) for \( j \in J_1 \) ensure the existence of a constant \( L_1 > 0 \), satisfying the inequality \( \| x_j(a) - x_j(ea) \| \leq L_1 \| a - ea \| \) for any points \( a \) and \( ea \) where \( \| a - \bar{a} \| < \delta_0 \) and \( \| ea - a \bar{a} \| < \delta_0 \) By selecting a smaller \( \delta_1 \in (0, \delta_0) \) such that \( 2L_1 \delta_1 < \epsilon \), we can maintain the required precision in our calculations.
With the chosen constants L 1 > 0, ε > 0, and δ 1 > 0, let us show that the inclusion (3.40) is fulfilled for all a and ea satisfying ka − ¯ak < δ 1 and kea−ak¯ < δ 1
Let \( \eta \) and \( \epsilon \) be defined such that \( k - \bar{a} < \delta_1 \) and \( k_e - \bar{a} < \delta_1 \) Consider an arbitrary element \( x = (x_1, \ldots, x_k) \) from the intersection of the set \( F_1(a) \) and the ball \( B(\bar{x}, \epsilon) \) For all \( j \in J_1 \), define \( x_{e_j} = x_j(e_a) \), where \( x_j(a) \) is specified by equation (3.46) For indices \( j \in J_2 \), maintain \( x_{e_j} = x_j \).
Claim 1 The vector xe = (xe 1 , ,xe k ) belongs to F1(ea).
The inequalities ka − ¯ak < δ 1 and kx − xk¯ < ε indicate that properties (3.44) and (3.45) are satisfied Consequently, for every j in J 1, the attraction set A[x j] is defined as {a i | i ∈ I(j)} Given that I(j) is not empty for each j in J 1 and x belongs to F 1 (a), Theorem 3.3 confirms that x j equals 1.
Comparing (3.48) with (3.46) yields x j = x j (a) for all j ∈ J 1 By (3.45) we see that, for every j ∈ J 2 , the attraction set A[x j ] is empty Moreover, one has x j ∈ A[x]/ (∀j ∈ J 2 ) (3.49) where A[x] is the union of the balls ¯B(a p ,ka p − x q k) with p ∈ I, q ∈ J satisfying p ∈ I(q).
For each j ∈ J 1 , using (3.47) we have kx j (ea)−x¯ j k ≤ kx j (ea)−x j (a)k+ kx j (a)−x¯ j k
Besides, for each j ∈ J 2 , we have kx j (ea)−x¯ j k = kx j −x¯ j k < ε Therefore, kxe−xk¯ = X j∈J 1 kx j (ea)−x¯ j k+X j∈J 2 kx j −x¯ j k< 2kε.
The inequality kea−ak¯ < δ 1 ensures that the properties (3.44) and (3.45) are maintained, with ea and x(ea) representing the variables a and x, respectively This leads to the conclusion that for all j in J 1 and i in I(j), and for all q in J excluding j, the condition kea i −xe j k < kea i −xe q k holds (3.50) Additionally, for all j in J 2, q in J 1, and p in I(q), the inequality kxe q −ea p k < kxe j −ea p k is satisfied (3.51).
So, similar to the above case of x, for every j ∈ J 1 , the attraction set A[xe j ] is {ea i | i ∈ I(j)} Recall that I(j) 6= ∅ for each j ∈ J 1 and xe j was given by xe j = x j (ea) = 1
Conclusions
The minimum sum-of-squares clustering problem consistently yields a global solution, and under certain mild conditions, this solution set is finite Each global solution's components can be calculated using an explicit formula By introducing the concept of a nontrivial local solution, we have established the necessary and sufficient conditions for a system of centroids to qualify as a nontrivial local solution.
We have demonstrated the local Lipschitz property of the optimal value function, the local upper Lipschitz property of the global solution map, and the local Lipschitz-like property of the local solution map in the MSSC problem These comprehensive characterizations of nontrivial local solutions enhance our understanding of the k-means algorithm's performance.
Some Incremental Algorithms for the Clustering Problem
Solution methods for the minimum sum-of-squares clustering (MSSC) prob- lem will be analyzed and developed in this chapter.
This article introduces enhancements to the incremental algorithms developed by Ordin and Bagirov, leveraging the Difference-of-Convex functions Algorithms (DCAs) and the qualitative properties of the MSSC problem outlined in Chapter 3 We detail the new algorithms' characteristics, such as finite convergence, overall convergence, and convergence rates Additionally, we present the outcomes of numerical tests conducted on various real-world databases to demonstrate the effectiveness of these improved algorithms.
The present chapter is written on the basis of paper No 3 and paper No 4 in the List of author’s related papers (see p 112).
There are many algorithms to solve the MSSC problem (see, e.g., [6,7,9,12,
The problem is classified as NP-hard when the number of data features or clusters is included as input, which explains why existing algorithms can typically only provide local solutions.
The k-means clustering algorithm (see Section 3.3 and see also, e.g., [1],
[39], [43], and [66]) is the best known solution method for the MSSC problem.
To improve its effectiveness, the global k-means, modified global k-means, and fast global k-means clustering algorithms have been proposed in [6,12,33,49,
The quality of computational results is significantly influenced by the choice of starting points, making it essential to identify optimal initial values The Difference-of-Convex-functions Algorithms (DCA), previously utilized for the MSSC problem, can effectively assist in this search for suitable starting points.
Incremental clustering algorithms are defined by their ability to progressively increase the number of clusters Research indicates that these algorithms are particularly effective for managing large datasets, as evidenced by various numerical results.
Ordin and Bagirov recently introduced an incremental clustering algorithm that utilizes control parameters to identify effective starting points for the k-means algorithm This builds on earlier work by Bagirov, who proposed a different incremental clustering method based on DC programming and DCA.
We will propose several improvements of the just mentioned incremental al- gorithms to solve the MSSC problem in (3.2).
Incremental clustering algorithms, as discussed in sources [7, 44, 71], begin by calculating the centroid of the entire dataset They then proceed to optimally add a new centroid at each stage, continuing this process until they identify k centroids for the specified problem (3.2).
This article focuses on the analysis and enhancement of the incremental heuristic clustering algorithm by Ordin and Bagirov, as well as Bagirov's incremental DC clustering algorithm Through the construction of specific MSSC problems with small datasets, we will demonstrate the functionality of these algorithms Notably, the second algorithm may not terminate due to its precise stopping criterion To address this, we will introduce one modified version of the incremental heuristic clustering algorithm and three modified versions of the incremental DC clustering algorithm.
This section is devoted to the incremental heuristic algorithm of Ordin andBagirov [71, pp 349–353] and some properties of the algorithm.
Let ` be an index with 1 ≤ ` ≤ k − 1 and let ¯x = (¯x 1 , ,x¯ ` ) be an approximate solution of (3.2), where k is replaced by ` So, ¯x = (¯x 1 , ,x¯ ` ) solves approximately the problem minnf ` (x) := 1 m m
Applying the natural clustering procedure described in (3.3) to the centroid system x¯ 1 , ,x¯ ` , one divides A into ` clusters with the centers ¯x 1 , ,x¯ ` For every i ∈ I, put d ` (a i ) = min k¯x 1 −a i k 2 , ,k¯x ` −a i k 2 (4.2) The formula g(y) =f`+1(¯x 1 , ,x¯ ` , y) where, in accordance with (4.1), f `+1 (x) = 1 m m
X i=1 min j=1, ,`+1ka i −x j k 2 ∀x = (x 1 , , x ` , x `+1 ) ∈ R n(`+1) , defines our auxiliary cluster function g : R n →R From (4.2) it follows that g(y) = 1 m m
The problem min g(y) | y ∈ R n (4.4) is called the auxiliary clustering problem For each i ∈ I, one has min d ` (a i ), ky−a i k 2 = d ` (a i ) +ky −a i k 2 −max d ` (a i ),ky−a i k 2
So, the objective function of (4.4) can be represented as g(y) =g 1 (y)−g 2 (y), where g 1 (y) = 1 m m
X i=1 ky −a i k 2 (4.5) is a smooth convex function and g 2 (y) = 1 m m
X i=1 max d ` (a i ),ky −a i k 2 (4.6) is a nonsmooth convex function Consider the open set
B a i , d ` (a i ) = y ∈ R n | ∃i ∈ I with ky −a i k 2 < d ` (a i ) , (4.7) which is the finite union of certain open balls with the centers a i (i ∈ I), and put
All points \( \bar{x}_1, \ldots, \bar{x}_\ell \) are contained within \( Y_2 \) Given that \( \ell < k \leq m \) and that the data points \( a_1, \ldots, a_m \) are distinct, there must be at least one index \( i \in I \) such that \( d_\ell(a_i) > 0 \) This implies that not all data points coincide with those in the set \( \bar{x}_1, \ldots, \bar{x}_\ell \), confirming that \( Y_1 \) is non-empty According to equations (4.5) and (4.6), it follows that \( g(y) < \frac{1}{m} m \).
Therefore, any iteration process for solving (4.4) should start with a point y 0 ∈ Y 1
To find an approximate solution of (3.2) where k is replaced by `+ 1, i.e., the problem minnf `+1 (x) := 1 m m
(4.8) we can use the following procedure [71, pp 349–351] Fixing any y ∈ Y 1 , one divides the data set A into two disjoint subsets
Clearly, A 1 (y) consists of all the data points standing closer to y than to their cluster centers Since y ∈ Y 1 , the set A 1 (y) is nonempty Note that g(y) = 1 m
The equation z `+1 (y) = f ` (¯x) − g(y) illustrates that f ` (¯x) is dependent on the current centroid system x¯ 1, , x¯ `, while g(y) represents the function f `+1 (¯x 1, , x¯ `, y) The condition z `+1 (y) > 0 indicates that replacing the existing centroid system with an additional center y leads to a reduction in the minimum sum-of-squares clustering criterion This demonstrates the effectiveness of adding a new center to improve clustering outcomes, as reflected in the formula f`(¯x) = 1/m.
X a i ∈A d`(a i ) and (4.10), one has the representation z `+1 (y) = 1 m
X a i ∈A 1 (y) d ` (a i )− ky−a i k 2 , which can be rewritten as z `+1 (y) = 1 m
Subsequent operations are heavily reliant on the data points in Y1 It can be demonstrated that an element a belongs to the intersection of sets A and Y1 if and only if it is part of A and not included in the set {¯x1, , x¯`} For each point y = a within A∩Y1, the value of z `+1 (a) is calculated using equation (4.11) Finally, the maximum value z max 1 is determined by finding the maximum of z `+1 (a) for all a in A∩Y1, as shown in equation (4.12).
The choice of effective starting points for solving (4.8) is influenced by two parameters, γ 1 and γ 2, both ranging from 0 to 1 The significance of these parameters will be discussed later The selection process for these parameters is based on the computational experience gained from applying the algorithm, leading the authors of [71] to describe their approach as heuristic.
Using γ 1 , one can find the set
For γ 1 = 0, the set ¯A 1 includes all data points from Y 1, while for γ 1 = 1, ¯A 1 comprises only the points that result in the maximum decrease z max 1 This indicates that γ 1 reflects the tolerance level for selecting suitable points from A ∩ Y 1 For each point a in ¯A 1, the associated set A1(a) is identified, and its barycenter, c(a), is computed This barycenter c(a) is then used to replace point a, as it better represents the set A1(a) Additionally, it is established that c(a) falls within Y 1, given that g(c(a)) is less than or equal to g(a) and is also less than f`(¯x).
For each c ∈ A¯ 2 , one computes the value z `+1 (c) by using (4.11) Then, we find z max 2 := max z `+1 (c) | c ∈ A¯ 2 (4.15)
Clearly, z max 2 is the largest decrease among the values f `+1 (¯x 1 , ,x¯ ` , c), where c ∈ A¯ 2 , in comparison with the value f ` (¯x).
When γ 2 equals 0, it is observed that ¯A 3 equals ¯A 2 However, when γ 2 is set to 1, ¯A 3 includes only the barycenters c from A¯ 2 that yield the most significant reduction in the objective function g(y) = f `+1 (¯x 1, , x¯ `, y) as described in equation (4.4) This approach aligns with the identification of a ‘good’ starting point in the modified global k-means algorithm proposed by Bargirov, particularly when γ 1 is 0 and γ 2 is 1, as referenced in [71, p 315] Therefore, γ 2 acts as a measure of tolerance for selecting suitable points from A¯ 2.
Ω := (¯x 1 , ,x¯ ` , c) | c ∈ A¯ 3 (4.17) contains the ‘good’ starting points to solve (4.8).
4.2.2 Version 1 of Ordin-Bagirov’s algorithm