Basic Definitions and Some Properties
ByNwe denote the set of natural numbers, i.e., N= {0,1,2, } Consider the n-dimensional Euclidean vector spaceX = R n which is equipped with the canonical inner product hx, ui : n
In mathematical notation, the expression \( X i=1 x i u i \) represents the dot product of vectors \( x = (x_1, \ldots, x_n) \) and \( u = (u_1, \ldots, u_n) \) It is important to note that while vectors in \( \mathbb{R}^n \) are presented as rows of real numbers in this context, they are treated as columns during matrix operations The transpose of a matrix \( A \in \mathbb{R}^{m \times n} \) is indicated by \( A^T \), leading to the relationship \( h x, u i = x^T u \).
The norm in X is given by kxk = hx, xi 1/2 Then, the dual space Y of X can be identified with X.
A function θ : X → R, where R represents the set of generalized real numbers including +∞ and −∞, is classified as proper if it never assumes the value −∞ and is not identically equal to +∞ This means there exists at least one element x in the set X for which θ(x) is a real number.
The effective domain of θ is defined by domθ := {x ∈ X : θ(x) < +∞}.
Let Γ 0 (X) be the set of all lower semicontinuous, proper, convex functions on X The Fenchel conjugate function g ∗ of a function g ∈ Γ0(X) is defined by g ∗ (y) = sup{hx, yi −g(x) |x ∈ X} ∀y ∈ Y.
Note that g ∗ : Y →Ris also a lower semicontinuous, proper, convex function
[38, Propostion 3, p 174] From the definition it follows that g(x) +g ∗ (y) ≥ hx, yi (∀x ∈ X, ∀y ∈ Y).
Denote by g ∗∗ the conjugate function of g ∗ , i.e., g ∗∗ (x) = sup{hx, yi −g ∗ (y) | y ∈ Y}.
Since g ∈ Γ 0 (X), one has g ∗∗ (x) = g(x) for all x ∈ X by the Fenchel-Moreau theorem ( [38, Theorem 1, p 175]) This fact is the basis for various duality theorems in convex programming and DC programming.
Definition 1.1 The subdifferential of a convex function ϕ : R n →R∪ {+∞} at u ∈ domϕ is the set
If x /∈ domϕ then one puts ∂ϕ(x) =∅.
Clearly, the subdifferential ∂ϕ(u) in (1.1) is a closed, convex set The Fer- mat Rule for convex optimization problems asserts that ¯x ∈ R n is a solution of the minimization problem min{ϕ(x) | x ∈ R n } if and only if 0 ∈ ∂ϕ(¯x).
We now recall some useful properties of the Fenchel conjugate functions. The proofs of the next two propositions can be found in [77].
Proposition 1.1 The inclusion x ∈ ∂g ∗ (y) is equivalent to the equality g(x) +g ∗ (y) =hx, yi.
Proposition 1.2 The inclusions y ∈ ∂g(x) and x ∈ ∂g ∗ (y) are equivalent.
In the sequel, we use the convention (+∞)−(+∞)=+∞.
Definition 1.2 The optimization problem inf{f(x) := g(x)−h(x) : x∈ X}, (P) where g andh are functions belonging to Γ 0 (X), is called a DC program The functions g and h are called d.c components of f.
Definition 1.3 For any g, h ∈ Γ 0 (X), the DC program inf{h ∗ (y)−g ∗ (y) | y ∈ Y}, (D) is called the dual problem of (P).
Proposition 1.3 (Toland’s Duality Theorem; see [79]) The DC programs (P) and (D) have the same optimal value.
Definition 1.4 One says that ¯x ∈ R n is a local solution of (P) if the value f(¯x) = g(¯x) − h(¯x) is finite (i.e., ¯x ∈ domg ∩ domh) and there exists a neighborhood U of ¯x such that g(¯x)−h(¯x) ≤ g(x)−h(x) ∀x ∈ U.
If we can choose U = R n , then ¯x is called a (global) solution of (P).
The set of the solutions (resp., the local solutions) of (P) is denoted by sol(P) (resp., by loc(P)).
Proposition 1.4 (First-order optimality condition; see [77]) If x¯ is a local solution of (P), then ∂h(¯x) ⊂ ∂g(¯x).
Definition 1.5 A point ¯x ∈ R n satisfying ∂h(¯x) ⊂ ∂g(¯x) is called a station- ary point of (P).
The forthcoming example, which is similar to Example 1.1 in [93], shows that a stationary point needs not to be a local solution.
Example 1.1 Consider the DC program (P) with f(x) =g(x)−h(x), where g(x) = |x − 1| and h(x) = (x − 1) 2 for all x ∈ R For ¯x := 1
∂g(¯x) = ∂h(¯x) = {−1} Since ∂h(¯x) ⊂∂g(¯x), ¯x is a stationary point of (P). But ¯x is not a local solution of (P), because f(x) = x−x 2 for all x ≤ 1. Definition 1.6 A vector ¯x ∈ R n is said to be a critical point of (P) if
If ∂h(¯x) 6= ∅ and ¯x is a stationary point of (P), then ¯x is a critical point of (P) The reverse implication does not hold in general The following example is similar to Example 1.2 in [93].
Example 1.2 Consider the DC program (P) with f(x) = g(x)−h(x) with g(x) = (x − 1 2 ) 2 and h(x) = |x − 1| for all x ∈ R For ¯x := 1, we have
∂g(¯x) ={1} and ∂h(¯x) = [−1,1] Hence ∂g(¯x)∩∂h(¯x) 6= ∅ So ¯x is a critical point of (P) But, ¯x is not a stationary point of (P), because ∂h(¯x) is not a subset of ∂g(¯x).
In the context of problem (P), if the set ∂h(¯x) is a singleton, then the function h is Gˆateaux differentiable at the point ¯x, with ∂h(¯x) equating to the Gˆateaux derivative ∇ G h(¯x) Conversely, if h is Gˆateaux differentiable at ¯x, it follows that ∂h(¯x) is also a singleton, confirming that ∂h(¯x) equals ∇ G h(¯x) Furthermore, the intersection of the subdifferentials ∂g(¯x) and ∂h(¯x) being non-empty is equivalent to the inclusion of ∂h(¯x) within ∂g(¯x).
So, if h is Gˆateaux differentiable at x, then¯ x¯ is a critical point if and only if it is a stationary point.
DCA Schemes
The theory of Decomposition and Convex Approximations (DCAs) aims to simplify complex DC programs by breaking them down into two sequences of convex programs, denoted as (P k ) and (D k ) Each sequence approximates the original program (P) and its dual (D), respectively In this approach, it is essential to construct two sequences, {x k } and {y k }, where x k represents a solution to the convex program (P k ) and y k corresponds to a solution for (D k ) This methodology ensures that specific properties hold true for each k in the natural numbers.
(i) The sequences {(g−h)(x k )} and {(h ∗ −g ∗ )(y k )} are decreasing;
(ii) Any cluster point ¯x (resp ¯y) of {x k } (resp., of {y k }) is a critical point of (P) (resp., of (D)).
Following Tuan [93], we can formulate and analyze the general DC algo- rithm of [77] as follows.
Step 3 k ← k+ 1 and return to Step 2.
For eachk ≥ 0, we have constructed a pair (x k , y k ) satisfying (1.2) and (1.3). Thanks to Proposition 1.2, we can transform the inclusion (1.2) equiva- lently as y k ∈ ∂h(x k )
Consequently, the condition (1.2) is equivalent to the requirement that y k is a solution of the problem min{h ∗ (y)−[g ∗ (y k−1 ) +hx k , y−y k−1 i] | y ∈ Y}, (D k ) where y k−1 ∈ domg ∗ is the vector defined at the previous step k −1.
The inclusion x k ∈ ∂g ∗ (y k−1 ) means that g ∗ (y)−g ∗ (y k−1 ) ≥ hx k , y −y k−1 i ∀y ∈ Y.
The affine function g ∗ (y k−1 ) + hx k , y−y k−1 i serves as a lower approximation of g ∗ (y) By substituting g ∗ (y) with this lower approximation in the objective function of (D), we derive the auxiliary problem (D k ).
Since (D k ) is a convex program, solving (D k ) is much easier than solving the DC program (D) Recall that y k is a solution of (D k ).
Similarly, at each stepk+1, the DC program (P) is replaced by the problem minng(x)−[h(x k ) +hx−x k , y k i] | x ∈ Xo, (P k ) where x k ∈ domh ∗ has been defined at step k.
Since (P k ) is a convex program, solving (P k ) is much easier than solving the original DC program (P) As x k+1 satisfies (1.3), it is a solution of (P k ).
The objective function of (Dk) serves as a convex upper approximation of the objective function of (D), with both functions yielding the same values at y k−1 By removing certain real constants from the expression of the objective function of (D k), we can reformulate the problem as min{h ∗ (y)− hx k , yi | y ∈ Y}.
The objective function of problem (P k) serves as a convex upper approximation of the objective function for problem (P), with both functions yielding identical values at point x k By removing certain real constants from the objective function of (P k), we can reformulate the problem into an equivalent form: minimize {g(x) - ⟨h, y k⟩ | x ∈ X}.
If x k is a critical point of (P), i.e., ∂g(x k ) ∩ ∂h(x k ) 6= ∅, then DCA may produce a sequence {(x ` , y ` )} with
Indeed, since there exists a point ¯x ∈ ∂g(x k ) ∩ ∂h(x k ), to satisfy (1.2) we can choose y k = ¯x Next, by Proposition 1.2, the inclusion (1.3) is equivalent to y k ∈ ∂g(x k+1 ) So, if we choose x k+1 = x k then (1.3) is fulfilled, because y k = ¯x ∈ ∂g(x k ).
Dollar-Cost Averaging (DCA) can guide us to critical points, yet it lacks the means to navigate away from these points When faced with a critical point that isn't a local minimizer, it's essential to employ advanced variational analysis techniques to determine a descent direction.
The following observations can be found in Tuan [93]:
• The DCA is a decomposition procedure which decomposes the solution of the pair of optimization problems (P) and (D) into the parallel solution of the sequence of convex minimization problems (P k ) and (D k ), k ∈ N;
• The DCA does not include any specific technique for solving the convex problems (P k ) and (D k ) Such techniques should be imported from convex programming;
• The performance of DCA depends greatly on a concrete decomposition of the objective function into DC components;
The Deterministic Coordinate Ascent (DCA) method, while categorized as a deterministic optimization technique, can produce diverse sequences {x k } and {y k } based on the initial point x 0 This variability arises from the heuristic selection process for y k from sol(D k ) and x k from sol(P k ) at each iteration k, particularly when either (D k ) or (P k ) presents multiple solutions.
The above analysis allows us to formulate a simplified version of DCA, which includes a termination procedure, as follows.
Output: Finite or infinite sequences {x k } and {y k }.
Step 1 Choose x 0 ∈ domg Take ε > 0 Put k = 0.
Calculate y k by solving the convex program (1.4).
Calculate x k+1 by solving the convex program (1.5).
Step 3 If ||x k+1 −x k || ≤ε then stop, else go to Step 4.
Step 4 k := k+ 1 and return to Step 2.
To understand the performance of the above DCA schemes, let us consider the following example.
Example 1.3 Consider the functionf(x) = g(x)−h(x) with g(x) = (x−1) 2 and h(x) =|x−1| for all x ∈ R Here Y = X = R and we have g ∗ (y) = sup{xy −g(x) | x ∈ R}= sup{xy −(x−1) 2 | x ∈ R} = 1
The subdifferential of the function g* at any point y in Y is given by ∂g*(y) = {1/2 y + 1} For the function h, the subdifferential is defined as ∂h(x) = {-1} for x < 1, ∂h(x) = {1} for x > 1, and ∂h(x) = [-1, 1] for x = 1 Utilizing the DCA Scheme 1.1, we construct two sequences {x_k} and {y_k} such that y_k ∈ ∂h(x_k) and x_{k+1} ∈ ∂g*(y_k) for k in natural numbers Starting with any x_0 > 1, we find that y_0 = 1, leading to x_1 = 3/2 Consequently, y_1 also equals 1, and it can be shown that x_k = 3/2 and y_k = 1 for all k ≥ 2, resulting in convergence to ¯x = 3/2 and ¯y = 1 In contrast, initiating with any x_0 < 1 yields sequences {x_k} and {y_k} where x_k = 1/2 and y_k = -1 for all k ≥ 1, converging to ¯x = 1/2 and ¯y = -1.
x 2 −x for x ≤1 x 2 −3x+ 2 for x ≥1, one finds that ¯x = 3 2 and ˆx = 1 2 are global minimizers of (P), and xe := 1 is the unique critical point of the problem.
Starting with the initial point \( x_0 = x_e = 1 \) and selecting \( y_0 = 0 \) from the interval \( \partial h(x_0) = [-1, 1] \), we find that \( x_1 \) belongs to \( \partial g^*(y_0) = \partial g^*(0) = \{1\} \), leading to \( x_1 = 1 \) Again, we can choose \( y_1 = 0 \) since it falls within \( \partial h(x_1) = [-1, 1] \) This process generates DCA sequences \( \{x_k\} \) and \( \{y_k\} \), which converge to \( x_e = 1 \) and \( \bar{y} = 0 \), respectively It is important to note that the limit point \( x_e \) of the sequence \( \{x_k\} \) is the unique critical point of problem (P), which does not serve as a local minimizer or a stationary point of (P).
To ease the presentation of some related programs, we consider the follow- ing scheme.
Output: Finite or infinite sequences {x k } and {y k }.
Step 1 Choose x 0 ∈ domg Take ε > 0 Put k = 0.
Step 2 Calculate y k by using (1.2) and find x k+1 ∈ argmin{g(x)− hx, y k i | x ∈ X} (1.6)
Step 3 If ||x k+1 −x k || ≤ε then stop, else go to Step 4.
Step 4 k := k+ 1 and return to Step 2.
General Convergence Theorem
We will recall the fundamental theorem on DCAs of Pham Dinh Tao and
Le Thi Hoai An [77, Theorem 3] provides a solid theoretical foundation for the practical application of these algorithms To fully understand their implementation, it is essential to revisit the definitions of ρ-convex functions, the modulus of convexity for convex functions, and the characteristics of strongly convex functions.
Definition 1.7 Let ρ ≥0 and C be a convex set in the space X A function θ :C → R∪ {+∞} is called ρ-convex if θ λx+ (1−λ)x 0 ≤λθ(x) + (1−λ)θ(x 0 )− λ(1−λ)
2 ρ||x−x 0 || 2 for all numbers λ ∈ (0,1) and vectors x, x 0 ∈ C This amounts to saying that the function θ(ã)−(ρ/2)|| ã || 2 is convex on C.
Definition 1.8 The modulus of convexity of θ on C is given by ρ(θ, C) = supnρ≥ 0 | θ−(ρ/2)|| ã || 2 is convex on Co.
If C = X then we write ρ(θ) instead of ρ(θ, C) Function θ is called strongly convex on C if ρ(θ, C) > 0.
Consider the problem (P) If ρ(g) > 0 (resp., ρ(g ∗ ) > 0), let ρ 1 (resp., ρ ∗ 1 ) be a real number such that 0 ≤ ρ 1 < ρ(g) (resp., 0 ≤ ρ ∗ 1 < ρ(g ∗ )) If ρ(g) = 0 (resp., ρ(g ∗ ) = 0), let ρ 1 = 0 (resp., ρ ∗ 1 = 0) If ρ(h) > 0 (resp., ρ(h ∗ ) > 0), let ρ2 (resp., ρ ∗ 2 ) be a real number such that 0 ≤ ρ2 < ρ(h) (resp.,
The convenient abbreviations dx k := x k+1 −x k and dy k := y k+1 −y k were adopted in [77].
Theorem 1.1 ( [77, Theorem 3]) Let α := inf{f(x) = g(x)−h(x) | x ∈ R n }. Assume that the iteration sequences {x k } and {y k } are generated by DCA Scheme 1 Then, the following properties are valid:
≤ (g −h)(x k )−maxn ρ 1 +ρ 2 2 ||dx k || 2 , ρ 2 ∗ 1 ||dy k−1 || 2 + ρ 2 2 ||dx k || 2 , ρ 2 ∗ 1 ||dy k−1 || 2 + ρ 2 ∗ 2 ||dy k || 2 o hold for every k;
||dx k+1 || 2 + ρ 2 2 ||dx k || 2 , ρ 2 ∗ 1 ||dy k || 2 + ρ 2 2 ||dx k || 2 o hold for every k;
(iii) If α is finite, then {(g −h)(x k )} and {(h ∗ −g ∗ )(y k )} are decreasing sequences that converge to the same limit β ≥ α Furthermore,
(a) If ρ(g) +ρ(h) > 0 (resp., ρ(g ∗ ) +ρ(h ∗ ) > 0), then k→∞lim(x k+1 −x k ) = 0 (resp., lim k→∞(y k+1 −y k ) = 0);
(iv) If α is finite, and {x k } and {y k } are bounded, then for every cluster point x¯ of {x k } (resp., y¯ of {y k }), there is a cluster point y¯ of {y k } (resp., x¯ of {x k }) such that:
The estimates in the assertions (i) and (ii) of the above theorem can be slightly improved as shown in the next remark.
If ρ(h) is greater than 0, then ρ² is a real number that falls within the range of [0, ρ(h)) The sequences {xₖ} and {yₖ} are constructed independently of the constants ρ₁, ρ₁*, ρ², and ρ₂* According to assertion (i) of Theorem 1.1, this leads to the conclusion that for every natural number k, the inequality holds true.
2||dy k || 2 o. Passing the last inequality to the limit as ρ 2 →ρ(h), we get
By applying a simultaneous approach to the constants associated with strongly convex functions within the set {g, h, g ∗ , h ∗ }, we can demonstrate enhanced versions of the estimates presented in assertions (i) and (ii) of Theorem 1.1.
≤ (g−h)(x k )−maxn ρ(g)+ρ(h) 2 ||dx k || 2 , ρ(g 2 ∗ ) ||dy k−1 || 2 + ρ(h) 2 ||dx k || 2 , ρ(g 2 ∗ ) ||dy k−1 || 2 + ρ(h 2 ∗ ) ||dy k || 2 o,
||dx k+1 || 2 + ρ(h) 2 ||dx k || 2 , ρ(g 2 ∗ ) ||dy k || 2 + ρ(h) 2 ||dx k || 2 o. The forthcoming example is designed as an illustration for Theorem 1.1.
Example 1.4 Consider the function f(x) = g(x) − h(x) in Example 1.1, where g(x) = |x− 1| and h(x) = (x−1) 2 for all x ∈ R Here Y = X = R and we have h ∗ (y) = sup{xy −h(x) | x ∈ R} = sup{xy −(x−1) 2 | x ∈ R} = 1
4y 2 +y. Using DCA Scheme 1.2, we calculate DCA sequences {x k } and {y k } by solv- ing, respectively, the convex programs (1.4) and (1.5) for k ∈ N Choose ε = 0 First, select x 0 = 2 3 Since y 0 is a solution of (1.4) for k = 0, we get y 0 = − 2
In the context of the DCA Scheme 1.2, when k = 0, the solution x₁ is determined to be 1, leading to yₖ = 0 for k ≥ 1 and xₖ = 1 for k ≥ 2 The algorithm successfully meets the stopping condition at k = 1, resulting in the point ¯x = x₂, which is identified as the unique local solution of (P) Furthermore, this outcome holds true for any initial point x₀ within the interval [1/2, 3/2]; if x₀ is within this range, the algorithm halts at k = 0, yielding ¯x = x₁ = x₀, which serves as a stationary point.
(P), which is not a local solution If x 0 < 1 2 or x 0 > 3 2 , then f(x k ) → −∞ as k → ∞ So, {x k } does not have any cluster point.
Convergence Rates
In Chapters 2 and 4, we will demonstrate various results regarding the convergence rates of iterative sequences, focusing on two specific types of linear convergence: Q-linear convergence and R-linear convergence These concepts will be revisited in the subsequent sections.
Definition 1.9 (See, e.g., [70, p 28] and [88, pp 293–294]) One says that a sequence {x k } ⊂ R n converges Q-linearly to a vector ¯x ∈ R n if there exits β ∈ (0,1) such that kx k+1 −xk ≤¯ βkx k −xk¯ for all k sufficiently large.
Clearly, ifx k 6= ¯x, then the relationkx k+1 −xk ≤¯ βkx k −xk¯ in Definition1.9 can be rewritten equivalently as kx k+1 −xk¯ kx k −xk¯ ≤ β The word “Q”, which stands for “quotient”, comes from this context.
Definition 1.10 (See, e.g., [70, p 30]) One says that a sequence {x k } ⊂ R n converges R-linearly to a vector ¯x∈ R n if there is a sequence of nonnegative scalars {à k } such that kx k − xk ≤¯ à k for all k sufficiently large, and {à k } converges Q-linearly to 0.
If a sequence {x_k} converges Q-linearly to a vector ¯x, it also converges R-linearly to ¯x This can be demonstrated by choosing a constant β in the range (0,1) that meets the criteria outlined in Definition 1.9 By defining à_k = β^k * (x_{k-1} - ¯x) for all k ≥ 1, we observe that the distance |x_k - ¯x| is bounded by à_k for sufficiently large k Furthermore, the sequence {à_k} converges Q-linearly to 0, as it satisfies the condition à_{k+1} ≤ à_k for large k However, it is important to note that R-linear convergence does not necessarily imply Q-linear convergence, as illustrated by the example of a sequence of positive scalars {x_k}.
1, k is odd, and observe that {x k } converges R-linearly to 1, while the sequence does not converge Q-linearly to 1.
Sometimes, one says that a sequence {x k } ⊂ R n converges R-linearly to a vector ¯x ∈ R n whenever limsup k→∞ kx k −xk¯ 1/k < 1 (1.7)
(see, e.g., [92]) The word “R”, which stands for “root”, comes from this context.
The next proposition clarifies the equivalence between the definition of Q−linear convergence in (1.7) and the one given in Definition 1.9.
Proposition 1.5 A sequence {x k } ⊂ R n converges R-linearly to a vector ¯ x ∈ R n if and only if the strict inequality (1.7) holds.
To demonstrate the necessity of the convergence, consider a sequence {x_k} that converges Q-linearly to a vector ¯x This implies the existence of a sequence of nonnegative scalars {à_k}, satisfying the condition kx_k - ¯xk ≤ ¯à_k for sufficiently large k, where {à_k} converges Q-linearly to 0 Consequently, we can identify a constant β in the interval (0,1) and an integer k_1 such that à_k+1 ≤ βà_k for all k ≥ k_1 Assuming à_k1 > 0, for any k > k_1, it follows that kx_k - ¯xk ≤ ¯à_k ≤ βà_k−1, leading to kx_k - ¯xk ≤ β^(k - k_1) à_k1.
It follows that kx k −xk ≤¯ à k 1 β k 1 β k for all k > k1 Therefore, limsup k→∞ kx k −xk¯ 1/k ≤ β limsup k→∞ àk 1 β k 1
To demonstrate sufficiency, assume that equation (1.7) holds true This implies the existence of a constant γ within the interval (0,1) and a natural number k₂ such that the inequality \( \|x_k - \bar{x}\| \leq \gamma^k \) holds for all \( k \geq k₂ \) Consequently, we can define the sequence \( \{a_k\} \) where \( a_k = \gamma^k \), ensuring that \( \|x_k - \bar{x}\| \leq a_k \) for \( k \geq k₂ \) Furthermore, the relationship \( a_{k+1} = \gamma a_k \) for \( k \geq k₂ \), along with the limit \( \lim_{k \to \infty} a_k = 0 \), indicates that the sequence \( \{a_k\} \) converges Q-linearly to 0 Thus, it follows that the sequence \( \{x_k\} \) converges R-linearly to \( \bar{x} \).
Analysis of an Algorithm in Indefinite Quadratic
In this chapter, we will explore the foundational concepts of Difference-of-Convex Functions Algorithms (DCAs), originally developed by Pham Dinh Tao and Le Thi Hoai An Additionally, we will define two types of linear convergence rates for vector sequences.
It is well known that DCAs have a key role in nonconvex programming and many areas of applications [55] For more details, we refer to [77,79] and the references therein.
1.1 Basic Definitions and Some Properties
ByNwe denote the set of natural numbers, i.e., N= {0,1,2, } Consider the n-dimensional Euclidean vector spaceX = R n which is equipped with the canonical inner product hx, ui : n
In the context of vector operations, for all vectors \( x = (x_1, \ldots, x_n) \) and \( u = (u_1, \ldots, u_n) \), the expression \( \sum_{i=1}^n x_i u_i \) represents the dot product It is important to note that while vectors in \( \mathbb{R}^n \) are displayed as rows of real numbers in textual representations, they are treated as columns in matrix computations The transpose of a matrix \( A \) in \( \mathbb{R}^{m \times n} \) is indicated as \( A^T \), leading to the relationship \( \langle x, u \rangle = x^T u \).
The norm in X is given by kxk = hx, xi 1/2 Then, the dual space Y of X can be identified with X.
A function θ : X → R is considered proper if it never takes the value of −∞ and is not identically equal to +∞ This means there exists at least one element x in the set X such that θ(x) yields a real number in R.
The effective domain of θ is defined by domθ := {x ∈ X : θ(x) < +∞}.
Let Γ 0 (X) be the set of all lower semicontinuous, proper, convex functions on X The Fenchel conjugate function g ∗ of a function g ∈ Γ0(X) is defined by g ∗ (y) = sup{hx, yi −g(x) |x ∈ X} ∀y ∈ Y.
Note that g ∗ : Y →Ris also a lower semicontinuous, proper, convex function
[38, Propostion 3, p 174] From the definition it follows that g(x) +g ∗ (y) ≥ hx, yi (∀x ∈ X, ∀y ∈ Y).
Denote by g ∗∗ the conjugate function of g ∗ , i.e., g ∗∗ (x) = sup{hx, yi −g ∗ (y) | y ∈ Y}.
Since g ∈ Γ 0 (X), one has g ∗∗ (x) = g(x) for all x ∈ X by the Fenchel-Moreau theorem ( [38, Theorem 1, p 175]) This fact is the basis for various duality theorems in convex programming and DC programming.
Definition 1.1 The subdifferential of a convex function ϕ : R n →R∪ {+∞} at u ∈ domϕ is the set
If x /∈ domϕ then one puts ∂ϕ(x) =∅.
Clearly, the subdifferential ∂ϕ(u) in (1.1) is a closed, convex set The Fer- mat Rule for convex optimization problems asserts that ¯x ∈ R n is a solution of the minimization problem min{ϕ(x) | x ∈ R n } if and only if 0 ∈ ∂ϕ(¯x).
We now recall some useful properties of the Fenchel conjugate functions. The proofs of the next two propositions can be found in [77].
Proposition 1.1 The inclusion x ∈ ∂g ∗ (y) is equivalent to the equality g(x) +g ∗ (y) =hx, yi.
Proposition 1.2 The inclusions y ∈ ∂g(x) and x ∈ ∂g ∗ (y) are equivalent.
In the sequel, we use the convention (+∞)−(+∞)=+∞.
Definition 1.2 The optimization problem inf{f(x) := g(x)−h(x) : x∈ X}, (P) where g andh are functions belonging to Γ 0 (X), is called a DC program The functions g and h are called d.c components of f.
Definition 1.3 For any g, h ∈ Γ 0 (X), the DC program inf{h ∗ (y)−g ∗ (y) | y ∈ Y}, (D) is called the dual problem of (P).
Proposition 1.3 (Toland’s Duality Theorem; see [79]) The DC programs (P) and (D) have the same optimal value.
Definition 1.4 One says that ¯x ∈ R n is a local solution of (P) if the value f(¯x) = g(¯x) − h(¯x) is finite (i.e., ¯x ∈ domg ∩ domh) and there exists a neighborhood U of ¯x such that g(¯x)−h(¯x) ≤ g(x)−h(x) ∀x ∈ U.
If we can choose U = R n , then ¯x is called a (global) solution of (P).
The set of the solutions (resp., the local solutions) of (P) is denoted by sol(P) (resp., by loc(P)).
Proposition 1.4 (First-order optimality condition; see [77]) If x¯ is a local solution of (P), then ∂h(¯x) ⊂ ∂g(¯x).
Definition 1.5 A point ¯x ∈ R n satisfying ∂h(¯x) ⊂ ∂g(¯x) is called a station- ary point of (P).
The forthcoming example, which is similar to Example 1.1 in [93], shows that a stationary point needs not to be a local solution.
Example 1.1 Consider the DC program (P) with f(x) =g(x)−h(x), where g(x) = |x − 1| and h(x) = (x − 1) 2 for all x ∈ R For ¯x := 1
∂g(¯x) = ∂h(¯x) = {−1} Since ∂h(¯x) ⊂∂g(¯x), ¯x is a stationary point of (P). But ¯x is not a local solution of (P), because f(x) = x−x 2 for all x ≤ 1. Definition 1.6 A vector ¯x ∈ R n is said to be a critical point of (P) if
If ∂h(¯x) 6= ∅ and ¯x is a stationary point of (P), then ¯x is a critical point of (P) The reverse implication does not hold in general The following example is similar to Example 1.2 in [93].
Example 1.2 Consider the DC program (P) with f(x) = g(x)−h(x) with g(x) = (x − 1 2 ) 2 and h(x) = |x − 1| for all x ∈ R For ¯x := 1, we have
∂g(¯x) ={1} and ∂h(¯x) = [−1,1] Hence ∂g(¯x)∩∂h(¯x) 6= ∅ So ¯x is a critical point of (P) But, ¯x is not a stationary point of (P), because ∂h(¯x) is not a subset of ∂g(¯x).
In problem (P), if the set ∂h(¯x) is a singleton, then the function h is Gˆateaux differentiable at ¯x, with ∂h(¯x) being equal to {∇ G h(¯x)}, where ∇ G h(¯x) represents the Gˆateaux derivative of h at that point Conversely, if h is Gˆateaux differentiable at ¯x, it follows that ∂h(¯x) is also a singleton, confirming the equality ∂h(¯x) = {∇ G h(¯x)} Furthermore, the condition ∂g(¯x)∩∂h(¯x) 6= ∅ is equivalent to the inclusion ∂h(¯x) ⊂ ∂g(¯x).
So, if h is Gˆateaux differentiable at x, then¯ x¯ is a critical point if and only if it is a stationary point.
The theory of DCAs focuses on breaking down a complex DC program (P) into two sequences of convex programs, denoted as (P k ) and (D k ), where k ∈ N Each DCA scheme involves constructing two sequences, {x k } and {y k }, such that for every k ∈ N, x k represents a solution to the convex program (P k ) and y k corresponds to a solution for (D k ) These sequences must adhere to specific properties to ensure the effectiveness of the decomposition.
(i) The sequences {(g−h)(x k )} and {(h ∗ −g ∗ )(y k )} are decreasing;
(ii) Any cluster point ¯x (resp ¯y) of {x k } (resp., of {y k }) is a critical point of (P) (resp., of (D)).
Following Tuan [93], we can formulate and analyze the general DC algo- rithm of [77] as follows.
Step 3 k ← k+ 1 and return to Step 2.
For eachk ≥ 0, we have constructed a pair (x k , y k ) satisfying (1.2) and (1.3). Thanks to Proposition 1.2, we can transform the inclusion (1.2) equiva- lently as y k ∈ ∂h(x k )
Consequently, the condition (1.2) is equivalent to the requirement that y k is a solution of the problem min{h ∗ (y)−[g ∗ (y k−1 ) +hx k , y−y k−1 i] | y ∈ Y}, (D k ) where y k−1 ∈ domg ∗ is the vector defined at the previous step k −1.
The inclusion x k ∈ ∂g ∗ (y k−1 ) means that g ∗ (y)−g ∗ (y k−1 ) ≥ hx k , y −y k−1 i ∀y ∈ Y.
The affine function g ∗ (y k−1 ) + hx k , y−y k−1 i serves as a lower approximation of g ∗ (y) By substituting g ∗ (y) with this lower approximation in the objective function of problem (D), we derive the auxiliary problem (D k ).
Since (D k ) is a convex program, solving (D k ) is much easier than solving the DC program (D) Recall that y k is a solution of (D k ).
Similarly, at each stepk+1, the DC program (P) is replaced by the problem minng(x)−[h(x k ) +hx−x k , y k i] | x ∈ Xo, (P k ) where x k ∈ domh ∗ has been defined at step k.
Since (P k ) is a convex program, solving (P k ) is much easier than solving the original DC program (P) As x k+1 satisfies (1.3), it is a solution of (P k ).
The objective function of (Dk) serves as a convex upper approximation of the objective function of (D), with both functions yielding identical values at y k−1 By removing certain real constants from the objective function of (Dk), we reformulate the problem to minimize h ∗ (y)− hx k , yi, where y belongs to the set Y.
The objective function of problem (P k) serves as a convex upper approximation of the objective function of problem (P), with both functions yielding the same values at the point x k By removing certain real constants from the expression of the objective function of (P k), we can reformulate the problem into an equivalent form: min{g(x)− hx, y k i | x ∈ X}.
If x k is a critical point of (P), i.e., ∂g(x k ) ∩ ∂h(x k ) 6= ∅, then DCA may produce a sequence {(x ` , y ` )} with
Indeed, since there exists a point ¯x ∈ ∂g(x k ) ∩ ∂h(x k ), to satisfy (1.2) we can choose y k = ¯x Next, by Proposition 1.2, the inclusion (1.3) is equivalent to y k ∈ ∂g(x k+1 ) So, if we choose x k+1 = x k then (1.3) is fulfilled, because y k = ¯x ∈ ∂g(x k ).
DCA identifies critical points but lacks tools for navigating them When encountering a critical point that is not a local minimizer, advanced techniques from variational analysis are necessary to determine a descent direction.
The following observations can be found in Tuan [93]:
• The DCA is a decomposition procedure which decomposes the solution of the pair of optimization problems (P) and (D) into the parallel solution of the sequence of convex minimization problems (P k ) and (D k ), k ∈ N;
• The DCA does not include any specific technique for solving the convex problems (P k ) and (D k ) Such techniques should be imported from convex programming;
• The performance of DCA depends greatly on a concrete decomposition of the objective function into DC components;
The Deterministic Coordinate Ascent (DCA) method, while classified as deterministic optimization, can produce diverse sequences {x k } and {y k } based on the initial point x 0 This variability arises from the heuristic selection of solutions y k from sol(D k ) and x k from sol(P k ) at each iteration k, particularly when (D k ) or (P k ) have multiple solutions.
The above analysis allows us to formulate a simplified version of DCA, which includes a termination procedure, as follows.
Output: Finite or infinite sequences {x k } and {y k }.
Step 1 Choose x 0 ∈ domg Take ε > 0 Put k = 0.
Calculate y k by solving the convex program (1.4).
Calculate x k+1 by solving the convex program (1.5).
Step 3 If ||x k+1 −x k || ≤ε then stop, else go to Step 4.
Step 4 k := k+ 1 and return to Step 2.
To understand the performance of the above DCA schemes, let us consider the following example.
Example 1.3 Consider the functionf(x) = g(x)−h(x) with g(x) = (x−1) 2 and h(x) =|x−1| for all x ∈ R Here Y = X = R and we have g ∗ (y) = sup{xy −g(x) | x ∈ R}= sup{xy −(x−1) 2 | x ∈ R} = 1
The directional derivatives of the functions are defined as ∂g ∗ (y) = { 1/2 y + 1} for every y ∈ Y, and ∂h(x) takes specific values based on x: {−1} for x < 1, {1} for x > 1, and [−1,1] for x = 1 Utilizing the DCA Scheme 1.1, we construct two DCA sequences {x k } and {y k } that satisfy y k ∈ ∂h(x k ) and x k+1 ∈ ∂g ∗ (y k ) for k ∈ N Starting with any x 0 > 1, we find y 0 = 1, leading to x 1 = 3/2 This results in y 1 = 1, and it can be shown that for all k ≥ 2, x k = 3/2 and y k = 1, converging to ¯x = 3/2 and ¯y = 1 Conversely, starting with x 0 < 1 yields DCA sequences where x k = 1/2 and y k = −1 for all k ≥ 1, converging to ¯x = 1/2 and ¯y = −1.
x 2 −x for x ≤1 x 2 −3x+ 2 for x ≥1, one finds that ¯x = 3 2 and ˆx = 1 2 are global minimizers of (P), and xe := 1 is the unique critical point of the problem.
Starting with the initial point \( x_0 = x_e = 1 \) and selecting \( y_0 = 0 \) from the range \( \partial h(x_0) = [-1, 1] \), we find that \( x_1 \) belongs to \( \partial g^*(y_0) = \partial g^*(0) = \{1\} \), leading to \( x_1 = 1 \) By choosing \( y_1 = 0 \) from \( \partial h(x_1) = [-1, 1] \) and continuing this process, we generate DCA sequences \( \{x_k\} \) and \( \{y_k\} \) that converge to \( x_e = 1 \) and \( \bar{y} = 0 \), respectively It is important to note that the limit point \( x_e \) of the sequence \( \{x_k\} \) serves as the unique critical point of problem \( (P) \), which does not represent a local minimizer or a stationary point of \( (P) \).
To ease the presentation of some related programs, we consider the follow- ing scheme.
Output: Finite or infinite sequences {x k } and {y k }.
Step 1 Choose x 0 ∈ domg Take ε > 0 Put k = 0.
Step 2 Calculate y k by using (1.2) and find x k+1 ∈ argmin{g(x)− hx, y k i | x ∈ X} (1.6)
Step 3 If ||x k+1 −x k || ≤ε then stop, else go to Step 4.
Step 4 k := k+ 1 and return to Step 2.
We will recall the fundamental theorem on DCAs of Pham Dinh Tao and
Le Thi Hoai An [77, Theorem 3] provides a solid theoretical foundation for the practical application of these algorithms To fully understand this, it is essential to revisit the definitions of ρ-convex functions, the modulus of convexity for convex functions, and the characteristics of strongly convex functions.
Definition 1.7 Let ρ ≥0 and C be a convex set in the space X A function θ :C → R∪ {+∞} is called ρ-convex if θ λx+ (1−λ)x 0 ≤λθ(x) + (1−λ)θ(x 0 )− λ(1−λ)
2 ρ||x−x 0 || 2 for all numbers λ ∈ (0,1) and vectors x, x 0 ∈ C This amounts to saying that the function θ(ã)−(ρ/2)|| ã || 2 is convex on C.
Definition 1.8 The modulus of convexity of θ on C is given by ρ(θ, C) = supnρ≥ 0 | θ−(ρ/2)|| ã || 2 is convex on Co.
If C = X then we write ρ(θ) instead of ρ(θ, C) Function θ is called strongly convex on C if ρ(θ, C) > 0.
Consider the problem (P) If ρ(g) > 0 (resp., ρ(g ∗ ) > 0), let ρ 1 (resp., ρ ∗ 1 ) be a real number such that 0 ≤ ρ 1 < ρ(g) (resp., 0 ≤ ρ ∗ 1 < ρ(g ∗ )) If ρ(g) = 0 (resp., ρ(g ∗ ) = 0), let ρ 1 = 0 (resp., ρ ∗ 1 = 0) If ρ(h) > 0 (resp., ρ(h ∗ ) > 0), let ρ2 (resp., ρ ∗ 2 ) be a real number such that 0 ≤ ρ2 < ρ(h) (resp.,
The convenient abbreviations dx k := x k+1 −x k and dy k := y k+1 −y k were adopted in [77].
Theorem 1.1 ( [77, Theorem 3]) Let α := inf{f(x) = g(x)−h(x) | x ∈ R n }. Assume that the iteration sequences {x k } and {y k } are generated by DCA Scheme 1 Then, the following properties are valid:
≤ (g −h)(x k )−maxn ρ 1 +ρ 2 2 ||dx k || 2 , ρ 2 ∗ 1 ||dy k−1 || 2 + ρ 2 2 ||dx k || 2 , ρ 2 ∗ 1 ||dy k−1 || 2 + ρ 2 ∗ 2 ||dy k || 2 o hold for every k;
||dx k+1 || 2 + ρ 2 2 ||dx k || 2 , ρ 2 ∗ 1 ||dy k || 2 + ρ 2 2 ||dx k || 2 o hold for every k;
(iii) If α is finite, then {(g −h)(x k )} and {(h ∗ −g ∗ )(y k )} are decreasing sequences that converge to the same limit β ≥ α Furthermore,
(a) If ρ(g) +ρ(h) > 0 (resp., ρ(g ∗ ) +ρ(h ∗ ) > 0), then k→∞lim(x k+1 −x k ) = 0 (resp., lim k→∞(y k+1 −y k ) = 0);
(iv) If α is finite, and {x k } and {y k } are bounded, then for every cluster point x¯ of {x k } (resp., y¯ of {y k }), there is a cluster point y¯ of {y k } (resp., x¯ of {x k }) such that:
The estimates in the assertions (i) and (ii) of the above theorem can be slightly improved as shown in the next remark.
If ρ(h) is greater than 0, then ρ² is a real number that falls within the range of [0, ρ(h)) The sequences {xₖ} and {yₖ} are constructed independently of the constants ρ₁, ρ*₁, ρ², and ρ*₂ Consequently, according to assertion (i) of Theorem 1.1, the inequality holds for every k in the set of natural numbers, N.
2||dy k || 2 o. Passing the last inequality to the limit as ρ 2 →ρ(h), we get
By applying this technique to the constants associated with strongly convex functions within the set {g, h, g ∗ , h ∗ }, we can demonstrate that enhanced versions of the estimates presented in assertions (i) and (ii) of Theorem 1.1 are indeed valid.
≤ (g−h)(x k )−maxn ρ(g)+ρ(h) 2 ||dx k || 2 , ρ(g 2 ∗ ) ||dy k−1 || 2 + ρ(h) 2 ||dx k || 2 , ρ(g 2 ∗ ) ||dy k−1 || 2 + ρ(h 2 ∗ ) ||dy k || 2 o,
||dx k+1 || 2 + ρ(h) 2 ||dx k || 2 , ρ(g 2 ∗ ) ||dy k || 2 + ρ(h) 2 ||dx k || 2 o. The forthcoming example is designed as an illustration for Theorem 1.1.
Example 1.4 Consider the function f(x) = g(x) − h(x) in Example 1.1, where g(x) = |x− 1| and h(x) = (x−1) 2 for all x ∈ R Here Y = X = R and we have h ∗ (y) = sup{xy −h(x) | x ∈ R} = sup{xy −(x−1) 2 | x ∈ R} = 1
4y 2 +y. Using DCA Scheme 1.2, we calculate DCA sequences {x k } and {y k } by solv- ing, respectively, the convex programs (1.4) and (1.5) for k ∈ N Choose ε = 0 First, select x 0 = 2 3 Since y 0 is a solution of (1.4) for k = 0, we get y 0 = − 2
Indefinite Quadratic Programs and DCAs
Indefinite quadratic programming (IQP) plays a crucial role in optimization theory and has diverse applications across various fields Sequential quadratic programming methods, such as Wilson’s and Pang’s methods, simplify nonlinear programming problems into a series of IQPs Gupta reviews IQP applications in finance, agriculture, economics, production, marketing, and public policy, while Cornu´ejols et al focus on quadratic programming models specifically in finance Additionally, the image enhancement problem can be formulated as a quadratic programming issue, as demonstrated by Jen and Wang Akoa explores IQP in training support vector machines with nonpositive-semi-definite kernels, and recent studies by Xu et al and Xue et al have addressed similar machine learning challenges Liu et al have also investigated IQP related to support vector machines with indefinite kernels, which is gaining traction in the machine learning community For applications involving quadratic constraints, Wiebking's research is recommended.
Numerous research studies have explored numerical methods for solving Integer Quadratic Programming (IQP) problems, as referenced in sources [17,19,78–80,82,101–103] It is important to note that the majority of existing algorithms primarily identify stationary points, specifically the Karush-Kuhn-Tucker (KKT) points, or local minimizers, indicating that these methods are predominantly local solution techniques Given that IQP is classified as NP-hard, as discussed in references [72] and [17], the pursuit of global solutions remains a significant challenge in the field.
We are interested in studying and implementing two methods to solve the
IQP, developed from the general framework for solving Difference-of-Convex (DC) programs by Pham Dinh and Le Thi, offers innovative approaches to tackle nonconvex quadratic programming challenges A notable advancement combines DC Algorithms (DCA) with interior point methods, enhancing the efficiency of large-scale problem-solving The study presents two distinct DC decompositions, emphasizing their effectiveness in optimizing complex mathematical models.
DC decomposition and its proximal variant give rise to two algorithms for addressing the Integer Quadratic Programming (IQP) problem: the Projection DC Decomposition Algorithm (Algorithm A) and the Proximal DC Decomposition Algorithm (Algorithm B) For further details, refer to sources [57,82] Notably, these algorithms exhibit distinct features that enhance their effectiveness in solving IQP challenges.
- The algorithm descriptions are simple;
- No line searches are required.
The DCA theory confirms that any cluster point of a DCA sequence, produced by specific algorithms, qualifies as a KKT point of the IQP To ensure the existence of these cluster points, it is crucial to establish the boundedness of the DCA sequence While DCA sequences are not inherently bounded, a conjecture suggests that if the IQP possesses global solutions, then all DCA sequences generated by algorithms A and B should be bounded Recent work by Tuan has affirmed this conjecture for two-dimensional IQPs To extend this result to the general case, Tuan employed a local error bound for affine variational inequalities, along with various properties of the KKT point set of the IQP identified by Luo and Tseng The key finding indicates that if the IQP has a nonempty solution set, every DCA sequence generated by Algorithm A converges R-linearly to a KKT point.
Numerous numerical tests, which will be reported in Section 2.4, lead us to the following observations:
In both algorithms A and B, a positive decomposition parameter that is closer to the lower bound of the admissible parameter interval leads to a higher convergence rate of the DCA sequences.
- Applied to the same problem with the same initial point, Algorithm B is more efficient than Algorithm A in terms of the number of computation steps and the execution time.
Our findings align with the recent work of Le Thi, Huynh, and Pham Dinh, which presents original proofs of significant convergence theorems for DCA algorithms addressing optimization problems with subanalytic data Specifically, Theorems 3.4, 3.5, and 4.2 from their study indicate that any DCA sequence produced by Algorithm B converges R-linearly to a KKT point, provided the sequence is bounded However, the boundedness of DCA sequences cannot be established through the Lojasiewicz inequality, as noted in their Theorem 2.1, along with related insights on Kurdyka-Lojasiewicz properties Thus, Theorem 2.2 and its proof represent novel contributions to the analysis of existing solution algorithms in the realm of indefinite quadratic programming.
Consider the indefinite quadratic programming problem under linear con- straints (called the IQP for brevity): minnf(x) := 1
2x T Qx+q T x | Ax≥ bo, (2.1) where Q ∈ R n×n and A ∈ R m×n are given matrices, Q is symmetric, q ∈ R n and b ∈ R m are arbitrarily given vectors The constraint set of the problem is C := x ∈ R n | Ax ≥b
Since x T Qx is an indefinite quadratic form, the objective function f(x) may be nonconvex; hence (2.1) is a nonconvex optimization problem.
Now we describe some standard notations that will be used later on The unit matrix in R n×n is denoted by E The eigenvalues of a symmetric matrix
In the context of linear algebra, the eigenvalues of a matrix M ∈ R n×n are arranged in ascending order, represented as λ 1 (M) ≤ ≤ λ n (M), considering their multiplicities For a subset α of the index set {1, , m}, the notation A α refers to the matrix formed by the rows A i for i in α, while b α denotes the vector comprising the components b i for i in α The pseudo-face of a convex set C corresponding to α is defined as the set of vectors x in R n satisfying the equations A α x = b α and A α ¯ x > b α ¯, where ¯α represents the complement of α within the index set Additionally, B(x, ε) and ¯B(x, ε) signify the open and closed balls centered at x with a radius ε, respectively Furthermore, for a collection of vectors v 1, , v s in R n, the notation pos{v 1, , v s} denotes the closed convex cone generated by these vectors, which can be expressed as pos{v 1, , v s} = nv s.
The metric projection of u ∈ R n onto C is denoted by PC(u), that is PC(u) belongs to C and u−P C (u) = min x∈C ku−xk.
The tangent cone to C at x ∈ C is denoted by T C (x), i.e.,
T C (x) ={t(y −x) |t ≥ 0, y ∈ C}= {v ∈ R n | A α v ≥ 0}, where α = {i | A i x = b i } The normal cone to C at x ∈ C is denoted by
To address the IQP using a series of strongly convex quadratic programs, the function f(x) is expressed as the difference between two convex linear-quadratic functions: f(x) = ϕ(x) - ψ(x), where ϕ(x) = 1/2 x^T Q1 x + q^T x and ψ(x) = 1/2 x^T Q2 x Here, Q1 is a symmetric positive definite matrix and Q2 is a symmetric positive semidefinite matrix, leading to Q = Q1 - Q2 This formulation allows the problem to be reframed as a DC program: min g(x) - h(x) subject to x ∈ R^n, with g(x) defined as ϕ(x) + δ_C(x) and h(x) as ψ(x) The indicator function δ_C(x) is zero when x is in set C and infinite otherwise Starting from an initial point x₀ in R^n, the iterative method computes y_k = ∇h(x_k) = Q2 x_k at each step k ≥ 0, followed by solving the convex minimization problem to find the unique solution x_{k+1}.
2x T Q1x+ q T x−x T Q2x k | x ∈ Co (2.3)The obtained sequence {x k } is called the DCA sequence generated by the DC algorithm and the initial point x 0
Definition 2.1 For x ∈ R n , if there exists a multiplier λ ∈ R m such that
Ax ≥ b, λ≥ 0, λ T (Ax−b) = 0, (2.4) then x is said to be a Karush-Kuhn-Tucker point (a KKT point) of the IQP.
A point x is classified as a KKT point of the given problem if it satisfies the condition h∇f(x), vi = (Qx + q) T v ≥ 0 for all vectors v in the tangent space T C (x) This condition is equivalent to stating that h∇f(x), y - x i ≥ 0 for every point y in the set C Therefore, x is a KKT point of the IQP if and only if it serves as a solution to the affine variational inequality defined by x ∈ C, hQx + q, u - x i ≥ 0 for all u in C.
Denote the KKT point set (resp., the global solution set) of IQP by C ∗ (resp.,
S) It is well known (see, e.g., [50]) that S ⊂ C ∗
This article revisits essential properties of DCA sequences derived from the application of the fundamental theorem on DCAs, as outlined in Theorem 1.1 and Remark 1.1, in relation to the IQP presented in (2.1) It highlights that the modulus of convexity for the function g(x) = ϕ(x) + δ C (x), where ϕ(x) = 1/2 x^T Q_1 x + q^T x, is at least λ_1(Q_1) In contrast, the modulus of convexity for the function h(x) = ψ(x), defined as ψ(x) = 1/2 x^T Q_2 x, is precisely equal to λ_1(Q_2).
Theorem 2.1 (See [81, Theorem 3] and [82, Theorem 2.1]) Every DCA se- quence {x k } generated by the above DC algorithm and an initial point x 0 ∈ R n has the following properties:
(ii) {f(x k )} converges to an upper bound f ∗ for the optimal value of (2.1); (iii) Every cluster point x¯ of {x k } is a KKT point of (2.1);
(iv) If inf x∈Cf(x) > −∞, then lim k→∞kx k+1 −x k k = 0.
Remark 2.1 By [81, Theorem 3], if x 0 ∈ C then we have the inequality in (i) for every k ≥ 0 To see this, it suffices to note that x 0 ∈ C = domg, where g = ϕ+δ C and dom := {x | g(x) < +∞}.
The smallest eigenvalue λ1(Q) and the largest eigenvalue λn(Q) of the matrix Q = Q1 - Q2 can be efficiently computed using algorithms such as the Newton-Raphson method or specialized software.
DC decomposition (2.2) can be done:
(a) Q 1 := ρE, Q 2 := ρE−Q, where ρ is a positive real value satisfying the condition ρ ≥λ n (Q);
(b) Q1 := Q+ρE, Q2 := ρE, where ρ is a positive real value satisfying the condition ρ > −λ 1 (Q).
The number ρ is called the decomposition parameter The following algo- rithms appear on the basis of (a) and (b), respectively.
The Projection DC decomposition algorithm, referred to as Algorithm A, begins by selecting a positive number ρ that is greater than or equal to λ n (Q) and an initial point x 0 in R n For each iteration k starting from 0, the algorithm computes the next point x k+1 using the formula x k+1 := P C x k − 1 ρ(Qx k + q), which represents the unique solution to equation (2.3) Here, Q 1 is defined as ρE, while Q 2 is expressed as ρE−Q This formulation can be simplified to minimize the expression x− 1 ρ(y k − q).
The scheme of the algorithm with a stopping criterion is as follows (To have an infinite DCA sequence, one has to choose ε = 0.)
Step 3 Calculate x k+1 by solving the convex program (2.7).
Step 4 If kx k+1 −x k k ≤ ε then stop, else go to Step 5.
Step 5 Set k := k+ 1 and go to Step 2.
The Proximal DC Decomposition Algorithm, referred to as Algorithm B, begins by selecting a positive number ρ greater than -λ1(Q) and an initial point x0 in Rn For each iteration k ≥ 0, the algorithm computes the unique solution, labeled as xk+1, to the strongly convex quadratic minimization problem defined as minnψ(x) := 1.
The objective function can be expressed as \( \frac{1}{2} x^T Q_1 x + q^T x - x^T Q_2 x k \), where \( Q_1 = Q + \rho E \) and \( Q_2 = \rho E \) The algorithm's scheme includes a stopping criterion, and to ensure an infinite DCA sequence, it is necessary to set \( \epsilon = 0 \).
Step 2 Calculate x k+1 by solving the convex program (2.9).
Step 3 If kx k+1 −x k k ≤ ε then stop, else go to Step 4.
Step 4 Set k := k+ 1 and go to Step 2.
Convergence and Convergence Rate of the Algorithm
The KKT point set C∗, as described in Section 2.1, represents the solution set of the affine variational inequality (2.6) and comprises the union of a finite number of polyhedral convex sets, which have finitely many connected components If the solution set S of (2.1) is nonempty, it implies that C∗ is also nonempty Additionally, for any subset M of Rn, the distance from a point x in Rn to the subset M is defined as d(x, M) := inf{kx - yk | y ∈ M}.
We will need two auxiliary results The next lemma gives a local error bound for the distance d(x, C ∗ ) form a feasible point x ∈ C to C ∗
Lemma 2.1 ( [92, Lemma 2.1]; cf [65, Lemma 3.1]) For any ρ > 0, if
C ∗ 6= ∅, then there exist scalars ε > 0 and ` > 0 such that d(x, C ∗ ) ≤ ` x−P C x− 1 ρ(Qx+q) (2.10) for all x ∈ C with x−P C x− 1 ρ(Qx+q) ≤ ε (2.11)
Lemma 2.2 ( [65, Lemma 3.1]; see also [92, Lemma 2.2]) Let C1, C2,ã ã ã , Cr denote the connected components of C ∗ Then we have
C i , and the following properties are valid:
(a) each C i is the union of finitely many polyhedral convex sets;
(b) the sets C i , i = 1, r, are properly separated each from others, that is, there exists δ > 0 such that if i 6= j then d(x, C j ) ≥ δ ∀x ∈ C i ;
(c) f is constant on each Ci.
For \( x^{k+1} \) to be the unique solution of the variational inequality, it is essential that the condition \( h\langle \nabla \psi(x^{k+1}), x - x^{k+1} \rangle \geq 0 \) holds for all \( x \) in the convex set \( C \) Here, \( \nabla \psi(x^{k+1}) \) is defined as \( Qx^{k+1} + q + \rho x^{k+1} - \rho x^k \) This condition indicates that \( x^{k+1} \) uniquely satisfies the strongly monotone affine variational inequality, characterized by the affine operator \( x \mapsto (Q + \rho E)x + q - \rho x^k \) and the polyhedral convex set.
C Therefore, applying Theorem 2.3 from [45, p 9] we see that x k+1 is the unique fixed point of the map G k (x) := P C (x −à(M x + q k )), where à > 0 is arbitrarily chosen, M := Q+ ρE, and q k := q −ρx k In what follows, we choose à = ρ −1 Then x k+1 = P C x k+1 − 1 ρ(M x k+1 + q k ) (2.12)
The convergence and the rate of convergence of Algorithm B, the Proximal
DC decomposition algorithm, can be formulated as follows.
Theorem 2.2 If (2.1) has a solution, then for each x 0 ∈ R n , the DCA se- quence {x k } constructed by Algorithm B converges R-linearly to a KKT point of (2.1), that is, there exists x¯ ∈ C ∗ such that limsup k→∞ kx k −xk¯ 1/k < 1.
Proof Since (2.1) has a solution, C ∗ 6= ∅ Hence, by Lemma 2.1 there exist
` > 0 and ε > 0 such that (2.10) is fulfilled for any x satisfying (2.11) As x∈Cinf f(x) > −∞, assertion (iv) of Theorem 2.1 gives k→∞lim kx k+1 −x k k = 0 (2.13) Choose k 0 ∈ N as large as kx k+1 −x k k< ε for all k ≥k 0
If it holds that kx k+1 −P C (x k+1 − 1 ρ(Qx k+1 +q))k ≤ ε ∀k ≥ k 0 , (2.14) then by (2.10) one has d(x k+1 , C ∗ ) ≤`kx k+1 −P C x k+1 − 1 ρ(Qx k+1 +q) k ∀k ≥ k 0 (2.15)
To obtain (2.14), for any k ≥ k 0 , we recall that x k+1 = G k (x k+1 ) = P C x k+1 − 1 ρ(M x k+1 +q k ) , (2.16)
Combining this with the nonexpansiveness of PC(.) [45, Corollary 2.4, p 10] yields kx k+1 −P C (x k+1 − 1 ρ (Qx k+1 +q))k
Hence (2.14) is valid and, in addition, we have kx k+1 −P C x k+1 − 1 ρ(Qx k+1 +q) k ≤ kx k+1 −x k k.
From this and (2.15) it follows that d(x k+1 , C ∗ ) ≤ `kx k+1 −x k k ∀k ≥ k0 (2.17)
Since C ∗ is closed and nonempty, for each k ∈ {0,1,2, } we can find y k ∈ C ∗ such that d(x k , C ∗ ) =kx k −y k k Then (2.17) implies that kx k+1 −y k+1 k ≤ `kx k+1 −x k k ∀k ≥ k0 (2.18)
So, as consequence of (2.13), k→∞lim ky k+1 −x k+1 k= 0 (2.19)
As demonstrated by the inequality \( \| y_{k+1} - y_k \| \leq \| y_{k+1} - x_{k+1} \| + \| x_{k+1} - x_k \| + \| x_k - y_k \| \), it can be concluded that the limit \( \lim_{k \to \infty} \| y_{k+1} - y_k \| = 0 \) Considering the connected components \( C_1, C_2, \ldots, C_r \) of \( C^* \), and applying Lemma 2.2 along with the previous limit result, there exists an index \( i_0 \) and a threshold \( k_1 \geq k_0 \) such that \( y_k \) belongs to \( C_{i_0} \) for all \( k \geq k_1 \) Consequently, according to the third assertion of Lemma 2.2, we establish that \( f(y_k) = c \) for all \( k \geq k_1 \), where \( c \) is a constant in \( \mathbb{R} \).
Since (2.1) has a solution, by Theorem 2.1 we can find a real value f ∗ such that lim k→∞f(x k ) =f ∗
By the classical Mean Value Theorem and by the formula ∇f(x) = Qx+q, for every k there is z k ∈ (x k , y k ) := {(1−t)x k +ty k | 0< t < 1} such that f(y k )−f(x k ) = hQz k + q, y k −x k i.
Since y k is a KKT point, it holds that 0 ≤ hQy k + q, x k −y k i Adding this inequality and the preceding equality, we get f(y k )−f(x k ) ≤ hQ(z k −y k ), y k −x k i
On one hand, from (2.21) and (2.22) it follows that c = f(y k ) ≤f(x k ) +kQk ky k −x k k 2
As lim k→∞ f(x k ) +kQk ky k −x k k 2 = f ∗ due to (2.19), this forces c ≤ f ∗ (2.23)
On the other hand, since x k+1 = P C x k+1 − 1 ρ(M x k+1 +q k ) by (2.16), the characterization of the metric projection on a closed convex set [45, Theo- rem 2.3, p 9] gives us
M x k+1 + q k , y k+1 −x k+1 ≥ 0 ∀k ∈ N. From this and (2.18) we get hM y k+1 +q k , x k+1 −y k+1 i
≤ ` 2 kMkkx k+1 −x k k 2 for all k ≥ k 0 So, setting α = ` 2 kMk, we have hM y k+1 + q k , x k+1 −y k+1 i ≤ αkx k+1 −x k k 2 (2.24)For each k ≥ k 1 , since M = Q+ ρE and q k = q −ρx k , invoking (2.24) and using (2.18) once more, we have f(x k+1 )−c = f(x k+1 )−f(y k+1 )
≤αkx k+1 −x k k 2 + 1 2 kQkkx k+1 −y k+1 k 2 +ρkx k+1 −x k kkx k+1 −y k+1 k+ρkx k+1 −y k+1 k 2
≤ α+ 1 2 kQk` 2 + ρ`(1 +`) kx k+1 −x k k 2 Therefore, with β := α+ 1 2 kQk` 2 +ρ`(1 +`), we get f(x k+1 ) ≤c+βkx k+1 −x k k 2 (2.25) Letting k → ∞, from (2.25) we can deduce that f ∗ = lim k→∞f(x k+1 ) ≤c.
Combining the last expression with (2.23) yieldsf ∗ = c Therefore, by (2.25) and the first assertion of Theorem 2.1 we obtain f(x k+1 )−f ∗ ≤ βkx k+1 −x k k 2 ≤ 2β λ 1 (Q 1 ) + λ 1 (Q 2 )(f(x k )−f(x k+1 )), where Q 1 = Q+ ρE and Q 2 = ρE Putting γ = λ 1 (Q 1 ) +λ 1 (Q 2 ), from the condition ρ > −λ 1 (Q) we get γ = (λ 1 (Q) +ρ) +ρ > 0 Therefore, f(x k+1 )−f ∗ ≤ 2β γ f(x k )−f ∗ − f(x k+1 )−f ∗
≤r 0 à 2k+2 +r 0 à 2k = r 1 à 2k ∀k > k 1 , where r 1 := r 0 (à 2 + 1) Consequently, using the first assertion of Theorem2.1 once more, we see that kx k+1 −x k k 2 ≤ 2 γ(f(x k )−f(x k+1 )) ≤ 2r 1 γ à 2k ∀k > k 1
2 and à ∈ (0,1) Let ε > 0 be given arbitrarily For each positive integer p, we have kx k+p −x k k ≤ kx k+p −x k+p−1 k+ã ã ã+kx k+1 −x k k
For sufficiently large values of k, the sequence {x_k} satisfies the condition 1−àà k < ε, indicating that it is a Cauchy sequence Consequently, we can assume it converges to a limit point ¯x within the set C According to the third assertion of Theorem 2.1, this limit point ¯x also belongs to the set C ∗ Additionally, we can derive the inequality kx_k + p - x_k k ≤ r.
1−à à k to the limit as p→ ∞, we get kx k −xk ≤¯ r
1−àà k for all k large enough So, kx k −xk¯ 1/k ≤ r
1/k à for all k large enough Therefore, limsup k→∞ kx k −xk¯ 1/k ≤ à < 1.
This proves that {x k } converges R-linearly to a KKT point of (2.1) 2
Remark 2.2 According to Theorem 2.2, one can find a constant C such that limsup k→∞ kx k − xk¯ 1/k < C < 1 So, if the computation is terminated at step k, provided that k is sufficiently large, then one has kx k −xk¯ 1/k < C That is, kx k −xk¯ < C k Therefore, the computation error between the obtained approximate solution x k and the exact limit point ¯x of the sequence {x k } is smaller than the number C k Since C ∈ (0,1), one sees that the computation error bound C k tends to 0 as k → ∞.
Asymptotical Stability of the Algorithm
We will prove that DCA sequences generated by Algorithm B converge to a locally unique solution of (2.1) if the initial points are taken from a suitably-chosen neighborhood of it.
In the context of discrete dynamical systems, we can define a stability concept relevant to iteration algorithms that generate a unique point \( x_{k+1} \) based on the previously defined iteration point \( x_k \), where \( k \) belongs to the set of non-negative integers According to Leong and Goh, the notion of asymptotic stability for a KKT point can be articulated as follows.
Definition 2.2 The KKT point ¯x of (2.1) is:
The iteration algorithm is considered stable if, for any ε > 0, there exists a δ > 0 such that when the initial point x₀ is within the ball B(¯x, δ), the DCA sequence generated by the algorithm maintains the property that xₖ remains within the ball B(¯x, ε) for all k ≥ 0.
(ii) attractive if there exists δ > 0 such that whenever x 0 ∈ B(¯x, δ), the DCA sequence generated by the iteration algorithm and the initial point x 0 has the property lim k→∞x k = ¯x;
(iii) asymptotically stable w.r.t the iteration algorithm if it is stable and attractive w.r.t to that algorithm.
In an optimization problem defined as min{g(x) | x ∈ Ω}, where g : R n → R and Ω ⊂ R n represent a real function and an arbitrary subset respectively, a point ¯x ∈ Ω is considered a locally unique solution if there exists a radius ε > 0 such that g(x) is greater than g(¯x) for all x in the intersection of Ω and the ball B(¯x, ε), excluding the point ¯x itself The following two lemmas illustrate some established concepts related to this definition.
Lemma 2.3 (See, e.g., [50, Theorem 3.8]) If x¯ ∈ C is a locally unique solu- tion of (2.1), then there exist à > 0 and η > 0 such that f(x)−f(¯x) ≥ηkx−xk¯ 2 for every x ∈ C ∩B(¯x, à) (2.26)
Lemma 2.4 (See, e.g., [17, Proof of Lemma 4] and [57, Lemma 1]) If the KKT point set C ∗ contains a segment [u, x], then the restriction of f on that segment is a constant function.
The main result of this section can be formulated as follows.
Theorem 2.3 Consider Algorithm B and require additionally that ρ > kQk. Suppose x¯ is a locally unique solution of problem (2.1) In that case, for any γ > 0 there exists δ > 0 such that if x 0 ∈ C ∩B(¯x, δ) and if {x k } is the DCA sequence generated by Algorithm B and the initial point x 0 , then
In other words, x¯ is asymptotically stable w.r.t Algorithm B.
In this proof, we assume that ρ > kQk and that ¯x is a locally unique solution of equation (2.1) According to Lemma 2.3, we can choose constants à > 0 and η > 0 to satisfy condition (2.26) For any γ > 0, we can adjust γ to ensure it falls within the range (0, à) and is less than à(1−ρ −1 kQk) Given that f(x)−f(¯x) > 0 for all x in the intersection of C and the ball B(¯x, γ) excluding ¯x, as stated in (2.26), the continuity of the function f guarantees the existence of a δ within the interval (0, à).
The stability of DCA sequences generated by Algorithm B is confirmed for a chosen number δ > 0 By fixing any x₀ within the intersection of set C and the ball B(¯x, δ), and noting that δ < γ, we establish that for k = 0, xₖ is also within the intersection of C and B(¯x, γ) To continue this proof by induction, we assume that this inclusion holds for some k ≥ 0 Given that ¯x is a locally unique solution of the problem described in equation (2.1), it qualifies as a KKT point for that problem.
Indeed, by the characterization of the metric projection [45, Theorem 2.3, p 9], (2.29) is valid if and only if ¯ x− 1 ρ(Q¯x+q) −x¯
The latter is equivalent to (2.28) Using (2.12), (2.29), and the nonexpan- siveness of the metric projection [45, Corollary 2.4, p 10], we have kx k+1 −xk¯ = P C x k+1 − 1 ρ(M x k+1 +q k ) −P C x¯− 1 ρ(Q¯x+q)
We establish that the distance between consecutive iterations satisfies the inequality \( kx_{k+1} - x_k \leq \overline{(1 - \frac{1}{\rho_k Q_k})^{-1}} kx_k - x_k \leq \overline{(1 - \frac{1}{\rho_k Q_k})^{-1}} \gamma < \alpha \), with the strict inequality resulting from the condition \( \gamma < \alpha(1 - \rho_k^{-1} Q_k) \) Consequently, we find that \( x_{k+1} \) lies within the intersection of the set \( C \) and the ball \( B(\overline{x}, \alpha) \) By applying the previously mentioned inequality \( f(x_k) \geq f(x_{k+1}) \) for all \( k \geq 0 \), we derive that \( kx_{k+1} - x_k \overline{2} \leq \frac{1}{\eta}(f(x_{k+1}) - f(\overline{x})) \).
≤ 1 η f(x 0 )−f(¯x) Hence, kx k+1 −xk ≤¯ 1 η 1/2 f(x 0 )−f(¯x) 1/2 Since x 0 ∈ C ∩B(¯x, δ), combining this with (2.27) we obtain kx k+1 −xk¯ < γ which means that x k+1 ∈ C ∩B(¯x, γ) Thus, we have proved that x k ∈ C ∩B(¯x, γ) for every k ≥0.
To demonstrate the attractiveness of DCA sequences generated by Algorithm B, we establish that for any γ > 0, there exists a corresponding δ = δ(γ) > 0 If the initial point x₀ is within the set C and the ball B(¯x, δ), the property in (a) holds for the DCA sequence {xₖ} Assuming γ ∈ (0, ∞) and δ ∈ (0, γ), we can further assert that if x₀ is within C and B(¯x, δ), then property (b) is also valid If this were not true, we would find sequences γₗ → 0⁺ and δₗ → 0⁺ such that the stability assertion holds for the pairs (δₗ, γₗ) Additionally, for each j, there exists an x₀,j in C ∩ B(¯x, δₗ) where the generated DCA sequence {xₖ,j} does not converge to ¯x By selecting a subsequence of {xₖ,j} that converges to a point xₑ,j in C ∩ B(¯x, γₗ), where xₑ,j ≠ ¯x, we apply Theorem 2.1 to confirm xₑ,j ∈ C* for all j Notably, as j approaches infinity, xₑ,j converges to ¯x.
For each j, one can find an integer k(j) ≥ 1 such that γ j+k(j) < kxe j − xk.¯ Then, by (2.30) one has kxe j+k(j) −xk¯ < kxe j −xk.¯
Let z₁ = xe₁ and define zₚ₊₁ as exₚ₊k(p) for p = 1, 2, This establishes that the sequence {zₚ} is a subsequence of {xeⱼ}, with zₚ ≠ zₚ' for p ≠ p' By selecting an appropriate subsequence, we can ensure that xeⱼ ≠ xeₗ whenever j ≠ ℓ Given that the number of pseudo-faces of C is finite, there exists an index set α ⊂ {1, , m} such that the pseudo-face is defined accordingly.
The set F α is defined as {x ∈ R n | A α x = b α , A α ¯ x > b α ¯ }, which contains an infinite number of elements from the sequence {ex j } We can assume that the entire sequence {xe j } is included in F α According to Lemma 4.1, the intersection of C ∗ and F α forms a convex set Therefore, as stated in Lemma 2.4, the function f restricted to C ∗ ∩ F α is constant Consequently, from equation (2.31), we can conclude that f(xe j ) = f(¯x) for all j However, since xe j is not equal to ¯x for any j, this leads to a contradiction with equation (2.26), thereby proving our claim.
To illustrate asymptotical stability of Algorithmthe B, let us consider the following example.
Example 2.2 (see [50, Example 11.3, p 207]) Consider problem (2.1) with n = 2, m = 2, Q "
! Here, one has the objective function f(x) = 1 2 (x 2 1 −x 2 2 )−x 1 over the set
Since λ 1 = −1 and λ 2 = 1 are eigenvalues of Q, one can choose ρ = 2. Using (2.4), one obtains the KKT point set C ∗ = n(1,0), ( 4 3 , 2 3 ), ( 4 3 ,− 2 3 )o. For this problem, one has S=loc(P)= ( 4 3 , 2 3 ), ( 4 3 ,− 2
3) One selects initial point, say, x 0 = ( 3 2 , 1 2 ), and chooses ¯x = ( 4 3 , 2 3 ) ∈ S Let the tolerance ε > 0 be small enough and put δ = kx 0 −xk¯ √ 2
The initial point \( x_0 \) is within the intersection of set \( C \) and the ball \( B(\bar{x}, \delta) \) The DCA sequence \( \{x_k\} \) is generated by Algorithm B starting from \( x_0 \) For each \( k \) in the range \( \{0, \ldots, 27\} \), the condition \( kx_k - x_k \leq \bar{\epsilon} \) holds, ensuring that \( x_k \) remains within \( C \cap B(\bar{x}, \gamma) \), where \( \gamma = \epsilon \) (refer to Table 2.3 a and Figure 2.1) Similar outcomes are observed when selecting \( x_0 = (3/2, -1/5) \) and \( \bar{x} = (4/3, -2/3) \) (see Table 2.3 b).
Figure 2.1 : The DCA sequence generated by Algorithm B and x 0 = (1.5, 0.5)
Table 2.3: Asymptotical stability of Algorithm B k x k f (x k ) δ(γ) x k f (x k ) δ(γ)
Further Analysis
This section examines the impact of the decomposition parameter ρ on the convergence rates of Algorithms A and B, while also comparing the effectiveness of Algorithm B against Algorithm A Both algorithms were implemented using Visual C++ 2010 and tested on an Intel Core i7 processor with 4GB of RAM The CPLEX 11.2 solver was utilized to address linear and convex quadratic problems.
Recall that, for Algorithm A, the parameter ρ > 0 has to satisfy the in- equality ρ ≥ λ n (Q) For Algorithm B, ρ > 0 must satisfy the strict inequality ρ > −λ 1 (Q).
We applied algorithms A and B to address test problems of the form (2.1) across various dimensions: n = 10, 20, 40, 60, and 80 For each case, the parameters β i were randomly generated within the range of [0,10], and we evaluated two distinct types of constraint sets.
The solution sets of the linear inequality system Ax ≥ b can be represented using a chosen matrix A and vector b For dimensions n ∈ {10,20,40,60,80}, we randomly generate a symmetric matrix Q and a vector q, ensuring all components lie within [0,10] The initial point x₀ is also randomly generated with components in [0,5] We test Algorithm A using the smallest decomposition parameter ρ, set to λₙ(Q) if positive, or 0.1 otherwise Similarly, Algorithm B is tested with ρ = -λ₁(Q) + 0.1 if λ₁(Q) is negative, or 0.1 otherwise The algorithms stop when the convergence criterion kxₖ₊₁ - xₖk ≤ 10⁻⁶ is met, with a maximum of 1000 steps After testing each algorithm with a given ρ, we increase ρ by 1.5 times and rerun the algorithm.
Due to the space limitation, we only present the test results for n 10, 40, 80.
Table 2.4 displays the results of Algorithm A and Algorithm B applied to the same problem with identical initial conditions The second rows of sub-tables a) and b) outline the smallest decomposition parameters for each algorithm, while subsequent rows indicate parameters that are 1.5 times larger than those in the previous row The first column in the sub-tables lists the ordinal numbers of the tests, the second column shows the number of iterations, the third column reports the running times, and the fourth column details the decomposition parameters Notably, only 11 records are presented, as the 12th record indicates that Algorithm A requires over 1000 steps to complete the computation.
The contents of Tables 2.5–2.9 are similar to those of Table 2.4.
With any n belonging to the set {10,20,40,60,80}, a careful analysis of these Tables allows us to observe that:
• For both algorithms, if ρ increases, then the running time, as well as the number of computation steps, increases;
•For any row of the sub-tables a) and b) with the same ordinal number, the number of steps required by Algorithm B is much smaller than that required by Algorithm A.
• For any row of the sub-tables a) and b) with the same ordinal number, the running time of Algorithm B is much smaller than that of Algorithm A.
Thus, in terms of the number of computation steps and the execution time,Algorithm B is much more efficient than Algorithm A when the algorithms are applied to the same problem.
Table 2.4: The test results for n = 10 with the 1st type constraint
No Step Time ρ No Step Time ρ
Table 2.5: The test results for n = 10 with the 2nd type constraint
No Step Time ρ No Step Time ρ
Table 2.6: The test results for n = 40 with the 1st type constraint
No Step Time ρ No Step Time ρ
Table 2.7: The test results for n = 40 with the 2nd type constraint
No Step Time ρ No Step Time ρ
Table 2.8: The test results for n = 80 with the 1st type constraint
No Step Time ρ No Step Time ρ
Table 2.9: The test results for n = 80 with the 2nd type constraint
No Step Time ρ No Step Time ρ
Conclusions
We have established two properties of Algorithm B for the IQP problem:
- Every DCA sequence generated by the Algorithm B must be bounded and, moreover, it converges R-linearly to a KKT point of the problem in question.
Algorithm B demonstrates asymptotic stability when the initial point is sufficiently close to a locally unique solution of the problem, contingent upon the DCA decomposition parameter meeting a specific mild assumption.
We have carried many numerical experiments which demonstrate that:
The decomposition parameter significantly affects the convergence rate of DCA sequences, with higher values leading to increased execution times Consequently, it is advisable to select the smallest feasible decomposition parameter to optimize performance.
- Algorithm B is more efficient than Algorithm A upon randomly generated data sets.
Qualitative Properties of the Minimum Sum-of-Squares
Clustering Problems
Clustering plays a crucial role in data mining, serving as a powerful tool for automated data analysis A cluster consists of a subset of the dataset, where the elements within each cluster share similarities in specific aspects.
Clustering problems can be categorized based on various criteria, including Euclidean distance, L1-distance, and the square of the Euclidean distance Among these, the Minimum Sum-of-Squares Clustering (MSSC) criterion is widely recognized and frequently utilized in clustering applications.
The Minimum Sum of Squared Clusters (MSSC) problem involves partitioning a finite dataset into a specified number of clusters The objective is to minimize the total squared Euclidean distances from each data point to its respective cluster centroid.
The importance of the MSSC problem was noticed by researchers long time ago and they have developed many algorithms to solve it (see, e.g.,
The NP-hard nature of the problem limits existing algorithms to achieving only local solutions To enhance data partitioning and discover better outcomes, various techniques have been proposed Notably, a method introduced in reference [71] focuses on identifying effective starting points using Difference-of-Convex-functions Algorithms (DCA), which have also been utilized in addressing the MSSC problem as detailed in references [7, 52, 60].
This chapter aims to establish fundamental properties of the discussed problem, starting with the clarification of the equivalence between the mixed integer programming formulation and the unconstrained nonsmooth nonconvex optimization formulation, as outlined in [71] Furthermore, we demonstrate that the MSSC problem consistently yields a global solution; under certain mild conditions, the set of global solutions is finite, and each global solution's components can be determined through an explicit formula.
This chapter aims to characterize the local solutions of the Minimum Sum of Squared Clusters (MSSC) problem By utilizing the necessary optimality condition in DC programming and introducing the concept of a nontrivial local solution, we derive necessary conditions for a system of centroids to qualify as such Notably, we demonstrate that these conditions are also sufficient, enhancing our understanding of local solutions Given that existing algorithms for the MSSC problem concentrate on local solutions, our findings could refine these algorithms Additionally, we analyze the performance of the k-means algorithm as a fundamental method for solving the MSSC problem through a constructed example.
This chapter aims to analyze how small changes in the data set affect the optimal value, global solution set, and local solution set of the MSSC problem We will establish three key stability properties: the optimal value function is locally Lipschitz, the global solution map is locally upper Lipschitz, and the local solution map exhibits the Aubin property, given that the original data points are pairwise distinct.
In the n-dimensional Euclidean space R n, let A = {a 1 , , a m } represent a finite set of data points The goal is to partition this set into k disjoint subsets, known as clusters, where k is a positive integer not exceeding m This partitioning aims to optimize a specific clustering criterion.
If one associates to each cluster A j a center (or centroid), denoted by x j ∈ R n , then the following well-known variance or SSQ (Sum-of-Squares) clustering criterion (see, e.g., [15, p 266]) ψ(x, α) := 1 m m
X k j=1 α ij ka i −x j k 2 −→ min, where α ij = 1 if a i ∈ A j and α ij = 0 otherwise, is used Thus, the above par- titioning problem can be formulated as the constrained optimization problem minnψ(x, α) | x ∈ R nk , α = (α ij ) ∈ R m×k , α ij ∈ {0,1}, k
X j=1 α ij = 1, i= 1, , m, j = 1, , ko, (3.1) where the centroid system x = (x 1 , , x k ) and the incident matrix α = (α ij ) are to be found.
Since (3.1) is a difficult mixed integer programming problem, instead of it one usually considers (see, e.g., [71, p 344]) the nextunconstrained nonsmooth nonconvex optimization problem minnf(x) := 1 m m
The minimum sum-of-squares clustering problem (MSSC problem) is represented by models (3.1) and (3.2) It is essential to clarify the equivalence between these minimization problems, as their decision variables reside in different Euclidean spaces For simplicity, we will denote these variables accordingly.
Basic Properties of the MSSC Problem
Given a vector ¯x = (¯x 1 , ,x¯ k ) ∈ R nk , we inductively construct k subsets
A 1 , , A k of A in the following way Put A 0 = ∅ and
In the clustering algorithm, a data point \( a_i \) is assigned to cluster \( A_j \) if the distance \( \| a_i - \bar{x}_j \| \) is the smallest compared to distances \( \| a_i - \bar{x}_q \| \) for all clusters \( q \in J \) This condition ensures that \( a_i \) is exclusively categorized into the cluster with the nearest centroid, and it does not belong to any other cluster \( A_p \).
1 ≤ p ≤ j − 1 We will call this family {A 1 , , A k } the natural clustering associated with x.¯
Definition 3.1 Let ¯x = (¯x 1 , ,x¯ k ) ∈ R nk We say that the component ¯x j of ¯x is attractive with respect to the data set A if the set
A[¯x j ] := na i ∈ A | ka i −x¯ j k= min q∈J ka i −x¯ q ko is nonempty The latter is called the attraction set of ¯x j
Clearly, the cluster A j in (3.3) can be represented as follows:
Proposition 3.1 If (¯x,α)¯ is a solution of (3.1), thenx¯is a solution of (3.2). Conversely, if x¯ is a solution of (3.2), then the natural clustering defined by (3.3) yields an incident matrix α¯ such that (¯x,α)¯ is a solution of (3.1).
Proof First, suppose that (¯x,α) is a solution of the optimization prob-¯ lem (3.1) As ψ(¯x,α)¯ ≤ψ(¯x, α) for every α = (α ij ) ∈ R m×k with α ij ∈ {0,1},
Pk j=1α ij = 1 for all i ∈ I and j ∈ J, one must have k
If ¯x does not satisfy the equation (3.2), it is possible to find a point ˜x = (˜x 1, , ˜x k) in R nk such that f(˜x) is less than f(¯x) The natural clustering related to ˜x is represented by the sets {A 1, , A k} For any indices i and j, we define ˜α ij as 1 if a i belongs to A j, and 0 if it does not This leads to the matrix ˜α = (˜α ij) in R m×k According to the natural clustering definition and the selected ˜α, we find that ψ(˜x,α) equals ˜ f(˜x) Consequently, we observe that ψ(¯x,α) equals ¯ f(¯x) and is greater than f(˜x), which contradicts the assumption that (¯x,α) is a solution to (3.1).
If ¯x is a solution to equation (3.2), then the natural clustering associated with ¯x is represented as {A 1 , , A k } The matrix ¯α is defined such that ¯αij equals 1 if a i belongs to A j and 0 otherwise It follows that ψ(¯x, α) equals ¯f(¯x) If there exists a feasible point (x, α) in (3.1) where ψ(x, α) is less than ψ(¯x, α), we can analyze the natural clustering {A˜ 1 , , A˜ k } linked to x, defining ˜α similarly This leads to the conclusion that f(x) is less than or equal to ψ(x, α), resulting in f(x) ≤ ψ(x, α) < ψ(¯x, α) = ¯f(¯x), which contradicts the assertion that ¯x is globally optimal for (3.2) Thus, the proof is established.
Proposition 3.2 If a 1 , , a m are pairwise distinct points and {A 1 , , A k } is the natural clustering associated with a global solution x¯ of (3.2), then A j is nonempty for every j ∈ J.
If there exists an index \( j_0 \in J \) such that \( A_{j_0} = \emptyset \), the assumption \( k \leq m \) leads to the conclusion that there must be another index \( j_1 \in J \) with \( A_{j_1} \) containing at least two distinct points By selecting an element \( a_{i_1} \in A_{j_1} \) such that \( a_{i_1} \neq \bar{x}_{j_1} \) and defining \( \tilde{x}_j = \bar{x}_j \) for \( j \in J \setminus \{j_0\} \) and \( \tilde{x}_{j_0} = a_{i_1} \), it can be demonstrated that \( f(\tilde{x}) - f(\bar{x}) \leq -\frac{1}{mk} \| a_{i_1} - \bar{x}_{j_1} \|^2 < 0 \).
This is impossible because ¯x is a global solution of (3.2) 2
Remark 3.1 In practical measures, some data points can coincide Natu- rally, if a i 1 = a i 2 , i 1 6= i 2 , then a i 1 and a i 2 must belong to the same cluster. Procedure (3.3) guarantees the fulfillment of this natural requirement By grouping identical data points and choosing from each group a unique rep- resentative, we obtain a new data set having pairwise distinct data points. Thus, there is no loss of generality in assuming that a 1 , , a m are pairwise distinct points.
Theorem 3.1 Both problems (3.1), (3.2) have solutions If a 1 , , a m are pairwise distinct points, then the solution sets are finite Moreover, in that case, if x¯ = (¯x 1 , ,x¯ k ) ∈ R nk is a global solution of (3.2), then the attraction set A[¯x j ] is nonempty for every j ∈ J and one has ¯ x j = 1
X i∈I(j) a i , (3.4) where I(j) := {i ∈ I |a i ∈ A[¯x j ]} with |Ω| denoting the number of elements of Ω.
To demonstrate the existence of a solution for equation (3.2), we rely on the second assertion of Proposition 3.1 The objective function in (3.2) is continuous on R^nk, as it represents the minimum of a finite number of continuous functions In the case where k equals 1, the function f can be expressed in a simplified form as f(x1) = 1/m * m.
X i=1 ka i −x 1 k 2 This smooth, strongly convex function attains its unique global minimum on R n at the point ¯x 1 = a 0 , where a 0 := 1 m
X i∈I a i (3.5) is the barycenter of the data set A (see, e.g., [50, pp 24–25] for more details).
To demonstrate the existence of a solution for equation (3.2) when k ≥ 2, we define ρ as the maximum of the differences between k times the elements of set I and a0, as specified in equation (3.5) We then introduce the closed ball ¯B(a0, 2ρ) in R^n, which is centered at a0 with a radius of 2ρ The next step involves considering the optimization problem, which seeks to minimize the function f(x) where x is represented as a vector (x1, , xk) in R^(nk), and each component xj is constrained within the closed ball B(a0, 2ρ) for all j in the index set Jo.
According to the Weierstrass theorem, the equation (3.6) has a solution ¯x = (¯x 1, , ¯x k) where each ¯x j meets the condition k¯x j − a 0 k ≤ 2ρ for all j in J For any point x = (x 1, , x k) in R nk, it follows that f(¯x) ≤ f(x) if kx j − a 0 k ≤ 2ρ for all j in J If there is at least one index j in J such that kx j − a 0 k > 2ρ, we define J 1 as the set of these indices and create a new vector ˜x = (˜x 1, , ˜x k) where ˜x j = x j for j in J\J 1 and ˜x j = a 0 for j in J 1 For any i in I, it is evident that ka i − ˜x j k = ka i − a 0 k ≤ ρ < ka i − x j k for every j in J 1, and ka i − ˜x j k = ka i − x j k for j in J \J 1 Thus, we have f(˜x) ≤ f(x), and since f(¯x) ≤ f(˜x), it follows that f(¯x) ≤ f(x), confirming that ¯x is a solution to (3.2) Furthermore, assuming that a 1, , a m are distinct points and ¯x = (¯x 1, , ¯x k) is a global solution of (3.2), the natural clustering associated with ¯x, denoted as {A 1, , A k}, ensures that A j is non-empty for all j in J, as stated in Proposition 3.2.
In the context of our analysis, for each j ∈ J, we establish that the set A_j is non-empty, leading to the conclusion that |I(j)| is at least 1 This ensures that the right-hand side of equation (3.4) is well-defined for every j in J By fixing a specific j, we note that for all i not in I(j), the distance ka_i - x̄_jk is greater than the minimum distance for any q in J Consequently, we can find a small ε > 0 such that this inequality holds for any x_j within the ball B(x̄_j, ε) By defining ˜x with components from x̄ and x_j, we can apply the inequality f(x̄) ≤ f(˜x) to conclude that f(x̄) equals a specific value.
The expression \( \sum_{i \in I(j)} \| k_i - x_j \|^2 \) indicates that for every \( x_j \in B(\bar{x}_j, \varepsilon) \), the inequality \( \phi(\bar{x}_j) \leq \phi(x_j) \) holds true This implies that \( \phi \) reaches its local minimum at \( \bar{x}_j \) According to the Fermat Rule, the gradient \( \nabla \phi(\bar{x}_j) = 0 \) results in the summation \( \sum_{i \in I(j)} \).
(a i −x¯ j ) = 0 This equality implies (3.4) Since there are only finitely many nonempty subsets Ω ⊂ I, the set B of vectors bΩ defined by formula b Ω = 1
The finite sum X i∈Ω a i indicates that the barycenter b Ω of the subsystem {a i ∈ A | i ∈ Ω} is relevant to the analysis Each component of the global solution ¯x = (¯x 1 , ,x¯ k ) for equation (3.2) must belong to set B, confirming that the solution set of (3.2) is finite when a 1 , , a m are distinct points Additionally, according to Proposition 3.1, if (¯x,α) solves (3.1), then ¯¯ x also solves (3.2) It is essential that the matrix ¯α = ( ¯α ij ) ∈ R m×k meets the criteria ¯α ij ∈ {0,1}.
X j=1 ¯ α ij = 1 for all i ∈ I, j ∈ J, it follows that the solution set of (3.1) is also finite 2
Proposition 3.3 If x¯= (¯x 1 , ,x¯ k ) ∈ R nk is a global solution of (3.2), then the components of x¯ are pairwise distinct, i.e., x¯ j 1 6= ¯x j 2 whenever j 2 6= j 1
In the proof, we assume there are distinct indexes \( j_1 \) and \( j_2 \) in set \( J \) such that \( \bar{x}_{j_1} = \bar{x}_{j_2} \) Given that \( k \leq m \), it follows that \( k-1 < n \), indicating the existence of an index \( j_0 \) in \( J \) where \( |A[\bar{x}_{j_0}]| \geq 2 \) Consequently, we can identify a data point \( a_{i_0} \) in \( A[\bar{x}_{j_0}] \) that is different from \( \bar{x}_{j_0} \) By defining \( \tilde{x} = (\tilde{x}_1, \ldots, \tilde{x}_k) \) such that \( \tilde{x}_j = \bar{x}_j \) for all \( j \in J \setminus \{j_2\} \) and \( \tilde{x}_{j_2} = a_{i_0} \), we derive that \( f(\tilde{x}) - f(\bar{x}) \leq -\frac{1}{m} \| a_{i_0} - \bar{x}_{j_0} \|^2 < 0 \) This conclusion contradicts the premise that \( \bar{x} \) is a global solution to the problem, thereby proving the initial assumption is false.
Remark 3.2 If the points a 1 , , a m are not pairwise distinct, then the con- clusions of Theorem 3.1 do not hold in general Indeed, let A = {a 1 , a 2 } ⊂ R 2 with a 1 = a 2 For k := 2, let ¯x = (¯x 1 ,x¯ 2 ) with ¯x 1 = a 1 and ¯x 2 ∈ R 2 being chosen arbitrarily Since f(¯x) = 0, we can conclude that ¯x is a global solution of (3.2) So, the problem has an unbounded solution set Similarly, for a data set A = {a 1 , , a 4 } ⊂ R 2 with a 1 = a 2 , a 3 = a 4 , and a 2 6= a 3 For k := 3, let ¯ x = (¯x 1 ,x¯ 2 ,x¯ 3 ) with ¯x 1 = a 1 , ¯x 2 = a 3 , and ¯x 3 ∈ R 2 being chosen arbitrarily.
The equation f(¯x) = 0 indicates that ¯x is a global solution to (3.2), demonstrating that the solution set of (3.2) is unbounded Additionally, if ¯x 3 is not part of {/ x¯ 1 ,x¯ 2 }, then formula (3.4) cannot be applied to ¯x 3, as the index set I(3) = {i ∈ I | a i ∈ A[¯x 3 ]} becomes empty, meaning there are no indices i for which the distance to ¯x 3 is minimized.
Formula (3.4) is effective for computing certain components of any given local solution of (3.2) The precise statement of this result is as follows.
Theorem 3.2 If x¯= (¯x 1 , ,x¯ k ) ∈ R nk is a local solution of (3.2), then (3.4) is valid for all j ∈ J whose index set I(j) is nonempty, i.e., the component ¯ x j of x¯ is attractive w.r.t the data set A.
The k-means Algorithm
Despite its ineffectiveness, the k-means clustering algorithm (see, e.g., [1, pp 89–90], [39], [43, pp 263–266], and [66]) is one of the most popular solution methods for (3.2) The convergence of this algorithms was proven in [86].
To begin the clustering process, select k initial centroids \( x_1, \ldots, x_k \) in \( \mathbb{R}^n \) Next, iteratively construct k subsets \( A_1, \ldots, A_k \) from the data set A, starting with \( A_0 = \emptyset \) and applying the clustering rule This process results in the natural clustering corresponding to the selected centroids After forming the clusters, update each centroid \( x_j \) for non-empty subsets \( A_j \) using the formula \( x_j \leftarrow x_{e_j} := \frac{1}{|A_j|} \sum_{x \in A_j} x \).
The algorithm operates by iteratively updating the centroid system {x₁, , xₖ} until stability is achieved, meaning that for all j ∈ J with Aₗ ≠ ∅, the condition xₑⱼ = xⱼ holds The computation process is defined by the equation X i∈I (A j ) a i, where I(A j ) represents the set of indices i in I for which a i belongs to A j, while ensuring that xⱼ remains unchanged otherwise.
Input: The data set A= {a 1 , , a m } and a constant ε ≥ 0 (tolerance). Output: The set of k centroids {x 1 , , x k }.
Step 1 Select initial centroids x j ∈ R n for all j ∈ J.
Step 2 Compute α i = min{ka i −x j k | j ∈ J} for all i ∈ I.
Step 4 Update the centroids x j satisfying A j 6= ∅ by the rule (3.9), keeping other centroids unchanged.
Step 5 Check the convergence condition: If kxe j −x j k ≤ ε for all j ∈ J with
A j 6= ∅ then stop, else go to Step 2.
The following example is designed to show how the algorithm is performed in practice.
Example 3.1 Choose m = 3, n = 2, and k = 2 Let A = {a 1 , a 2 , a 3 }, where a 1 = (0,0), a 2 = (1,0), a 3 = (0,1) Apply the k-means algorithm to solve the problem (3.2) with the tolerance ε = 0.
(a) With the starting centroids x 1 = a 1 , x 2 = a 2 , one obtains the clusters
A 1 = A[x 1 ] = {a 1 , a 3 } and A 2 = A[x 2 ] = {a 2 } The updated centroids are x 1 = (0, 1 2 ), x 2 = a 2 Then, the new clusters A 1 and A 2 coincide with the old ones Thus, kxe j −x j k = 0 for all j ∈ J with A j 6= ∅ So, the computation terminates For x 1 = (0, 1 2 ), x 2 = a 2 , one has f(x) = 1 6
(b) Starting with the points x 1 = ( 1 4 , 3 4 ) and x 2 = (2,3), one gets the clusters A 1 = A[x 1 ] = {a 1 , a 2 , a 3 } and A 2 = A[x 2 ] = ∅ The algorithm gives the centroid system x 1 = ( 1 3 , 1 3 ), x 2 = (2,3), and f(x) = 1 3
(c) Starting with x 1 = (0,1) and x 2 = (0,0), by the algorithm we are led to A 1 = A[x 1 ] = {a 3 }, A 2 = A[x 2 ] = {a 1 , a 2 }, x 1 = (0,1), and x 2 = ( 1 2 ,0). The corresponding value of objective function is f(x) = 1 6
(d) Starting with x 1 = (0,0) and x 2 = ( 1 2 , 1 2 ), by the algorithm one gets the results A 1 = A[x 1 ] = {a 1 }, A 2 = A[x 2 ] = {a 2 , a 3 }, x 1 = (0,0), and x 2 = ( 1 2 , 1 2 ). The corresponding value of objective function is f(x) = 4 9
3 ,0) as the initial centroids, one obtains the results A 1 = A[x 1 ] = {a 1 , a 2 , a 3 }, A 2 = A[x 2 ] = ∅, x 1 = ( 1 3 , 1 3 ), x 2 = (1 +
The relationship between the MSSC problem and the k-means clustering algorithm raises uncertainty regarding the optimality of the five centroid systems derived from items (a)–(e) in Example 3.1 While it is confirmed that the centroid systems in (a) and (c) represent global optimal solutions, the status of the centroid systems in (b), (d), and (e) as local optimal solutions remains indeterminate.
The theoretical results in Section 3.2 and the two forthcoming ones allow us to clarify the following issues related to the MSSC problem in Example3.1:
- The structure of the global solution set (see Example 3.2 below);
- The structure of the local solution set (see Example 3.3);
- The performance of the k-means algorithm (see Example 3.4).
The analysis reveals that the centroid systems in (a) and (c) represent global optimal solutions, whereas those in (b) and (d) are classified as local, non-global optimal solutions Additionally, the centroid system in (e) does not qualify as a local solution, even though the k-means algorithm converges to it and its objective function value matches that of the centroid system in (d).
Characterizations of the Local Solutions
To analyze the local solution set of the problem outlined in (3.2), we will adopt the approach of Ordin and Bagirov, focusing on a recognized optimality condition in DC programming For any vector x = (x1, , xk) within the space Rnk, the function is defined as f(x) = 1/m.
Hence, the objective function f of (3.2) can be expressed [71, p 345] as the difference of two convex functions f(x) =f 1 (x)−f 2 (x), (3.12) where f 1 (x) := 1 m
The function f1 is identified as a convex linear-quadratic function, which is also differentiable In contrast, f2 is a nonsmooth convex function formed by the sum of several nonsmooth convex functions, applicable across the entire space R^nk The subdifferentials for both f1(x) and f2(x) can be calculated accordingly.
= {2(x 1 −a 0 , , x k −a 0 )} where, as before, a 0 = b A is the barycenter of the system {a 1 , , a m } Set ϕ i (x) = max j∈J h i,j (x) (3.15) with h i,j (x) := X q∈J\{j} ka i −x q k 2 and
J i (x) = j ∈ J | a i ∈ A[x j ] (3.17) Proof From the formula of h i,j (x) it follows that h i,j (x) = X q∈J ka i −x q k 2 − ka i −x j k 2
Therefore, by (3.15) we have ϕ i (x) = max j∈J h X q∈J ka i −x q k 2 − ka i −x j k 2 i
Thus, the maximum in (3.15) is attained when the minimum min j∈J ka i −x j k 2 is achieved So, by (3.16),
J i (x) =nj ∈ J | ka i −x j k= min q∈J ka i −x q ko.
Utilizing the subdifferential formula for the maximum function, as outlined in Proposition 2.3.12, we note that the Clarke generalized gradient aligns with the subdifferential of convex analysis when applied to convex functions.
∂ϕ i (x) = co{∇h i,j (x)|j ∈ J i (x)} = co 2 xe j −ea i,j | j ∈ J i (x) , (3.18) where xe j = x 1 , , x j−1 ,0 R n , x j+1 , , x k (3.19) and ea i,j = a i , , a i , 0 R n j−th position|{z}
By the Moreau-Rockafellar theorem [84, Theorem 23.8], one has
Let x = (x1, , xk) ∈ Rnk represent a local solution of equation (3.2) According to the necessary optimality condition in DC programming, which is derived from the optimality criteria established by Dem’yanov et al in quasidifferential calculus, we can conclude certain properties about this solution.
Since ∂f 1 (x) is a singleton, ∂f 2 (x) must be a singleton too This happens if and only if ∂ϕi(x) is a singleton for every i ∈ I By (4.50), if |J i (x)| = 1, then
In the scenario where |J i (x)| exceeds 1, we can choose two elements, j1 and j2, from J i (x) with the condition that j1 is less than j2 Given that ∂ϕ i (x) is a singleton, it follows from equation (4.50) that xe j1 − ea i,j1 equals xe j2 − ea i,j2 By applying equations (3.19) and (3.20), we can conclude that this equality holds true if and only if x j1 equals x j2, which is equal to a i To advance our analysis, we must establish a specific condition on the local solution x.
(C1) The components of x are pairwise distinct, i.e., x j 1 6= x j 2 whenever j 2 6= j 1
Definition 3.2 A local solution x = (x 1 , , x k ) of (3.2) that satisfies (C1) is called a nontrivial local solution.
Remark 3.5 Proposition 3.3 shows that every global solution of (3.2) is a nontrivial local solution.
The essential principles outlined in [71, pp 346] are further refined here According to (3.17), the initial statement of the subsequent theorem indicates that if x represents a nontrivial local solution, then for every data point a_i within set A, there exists a unique component x_j of x such that a_i is included in A[x_j].
Theorem 3.3 (Necessary conditions for nontrivial local optimality) Suppose that x = (x 1 , , x k ) is a nontrivial local solution of (3.2) Then, for any i ∈ I, |J i (x)| = 1 Moreover, for every j ∈ J such that the attraction set A[x j ] of x j is nonempty, one has x j = 1
X i∈I(j) a i , (3.23) where I(j) ={i ∈ I | a i ∈ A[x j ]} For any j ∈ J with A[x j ] = ∅, one has x j ∈ A[x],/ (3.24) where A[x] is the union of the balls B(a¯ p ,ka p − x q k) with p ∈ I, q ∈ J satisfying p∈ I(q).
In the proof, we consider a nontrivial local solution \( x = (x_1, \ldots, x_k) \) of equation (3.2) For any index \( i \) in set \( I \), it is established that the Jacobian \( |J_i(x)| \) must equal 1 If \( |J_i(x)| \) were greater than 1, it would imply the existence of indices \( j_1 \) and \( j_2 \) within \( J_i(x) \) such that \( x_{j_1} = x_{j_2} = a_i \), which contradicts the assumption of the nontriviality of the local solution \( x \) Consequently, we define \( J_i(x) = \{j(i)\} \) for \( i \in I \), indicating that \( j(i) \) is the unique element of \( J_i(x) \).
For each i ∈ I, observe by (3.15) that hi,j(x) < hi,j(i)(x) = ϕi(x) ∀j ∈ J \ {j(i)}.
Hence, by the continuity of the functions h i,j (x), there exists an open neigh- borhood U i of x such that h i,j (y) < h i,j(i) (y) ∀j ∈ J \ {j(i)}, ∀y ∈ U i
So, ϕ i (.) is continuously differentiable on U i Put U = \ i∈I
U i From (3.14) and (3.25) one can deduce that f 2 (y) = 1 m
Therefore, f 2 (y) is continuously differentiable function on U Moreover, the formulas (4.50)–(3.20) yield
(ye j(i) −ea i,j(i) ) ∀y ∈ U, (3.26) where ye j(i) = y 1 , , y j(i)−1 ,0 R n , y j(i)+1 , , y k and ea i,j(i) = a i , , a i , 0 R n
Substitutingy = xinto (3.26) and combining the result with (3.22), we obtain
Now, fix an index j ∈ J with A[x j ] 6= ∅ and transform the left-hand side of (3.27) as follows:
If j(i) equals j, the j-th component of the vector xe j(i) − ea i,j(i) in R nk is zero Conversely, if j(i) does not equal j, the j-th component of the vector becomes x j − a i This leads us to the conclusion presented in equation (3.27).
Since ma 0 = a 1 +ã ã ã+a m , this yields X i∈I(j) a i = |I(j)|x j Thus, formula (3.23) is valid for any j ∈ J satisfying A[x j ] 6= ∅.
For any index \( j \) in set \( J \) where \( A[x_j] \) is empty, we can derive a contradiction Assume there exists an index \( j_0 \) in \( J \) such that \( A[x_{j_0}] \) is empty, and for some indices \( p \) in \( I \) and \( q \) in \( J \), \( p \) belongs to \( I(q) \) and \( x_{j_0} \) is within the ball defined by \( a_p \) and the distance to \( x_q \) If the distance from \( a_p \) to \( x_{j_0} \) equals that to \( x_q \), then the set \( J_p(x) \) would include both \( q \) and \( j_0 \), which contradicts the theorem's first claim Conversely, if the distance from \( a_p \) to \( x_{j_0} \) is less than that to \( x_q \), then \( p \) cannot belong to \( I(q) \), leading to another contradiction.
The optimality condition outlined in the theorem is sufficient, and when combined with Theorem 3.3, it offers a comprehensive characterization of the nontrivial local solutions for equation (3.2).
Theorem 3.4 (Sufficient conditions for nontrivial local optimality) Suppose that a vector x = (x 1 , , x k ) ∈ R nk satisfies condition (C1) and |J i (x)| = 1 for every i ∈ I If (3.23) is valid for any j ∈ J with A[x j ] 6= ∅ and (3.24) is fulfilled for any j ∈ J with A[x j ] = ∅, then x is a nontrivial local solution of (3.2).
Proof Let x = (x 1 , , x k ) ∈ R nk be such that (C1) holds, J i (x) = {j(i)} for every i ∈ I, (3.23) is valid for any j ∈ J with A[x j ] 6= ∅, and (3.24) is satisfied for any j ∈ J with A[x j ] = ∅ Then, for all i ∈ I and j 0 ∈ J \ {j(i)}, one has ka i −x j(i) k < ka i −x j 0 k.
There exists a positive ε and a specific index q within set J such that the distance between the vectors a_i and x_{e_j(i)} is less than the distance between a_i and x_{e_{j'}} for all i in I and j' in J excluding j(i) This condition holds true whenever the vector x_e = (x_{e1}, , x_{ek}) in R^{nk} satisfies the criterion that the distance between x_e_q and x_q is less than ε for all q in J Additionally, based on the compactness of the set A[x], by reducing ε if necessary, we can ensure that x_{e_j} belongs to the set A[/ x]e whenever the vector x_e meets the aforementioned distance condition Here, A[x] is defined as the union of the balls ¯e B(a_p, ||a_p - x_e_q||) for p in I and q in J, where p is part of the index set I(q) that includes all i in I such that a_i belongs to A[x_q].
Fix an arbitrary vector xe = (xe 1 , ,xe k ) ∈ R nk with the property that kxe q − x q k< ε for all q ∈ J Then, by (3.28) and (3.29), J i (x) =e {j(i)} So, minj∈J ka i −xe j k 2 = ka i −xe j(i) k 2 Therefore, one has f(x) =e 1 m
X i∈I (j) ka i −x j k 2 = f(x), where the inequality is valid because (3.23) obviously yields
X i∈I(j) ka i −x j k 2 ≤ X i∈I(j) ka i −xe j k 2 for every j ∈ J such that the attraction set A[x j ] of x j is nonempty (Note that x j is the barycenter of A[x j ].)
The local optimality of x = (x 1 , , x k ) has been proved Hence, x is a nontrivial local solution of (3.2) 2
Example 3.2 (A local solution need not be a global solution) Consider the clustering problem described in Example 3.1 Here, we have I = {1,2,3}and
According to Theorem 3.1, the problem (3.2) possesses a global solution for the set J = {1,2} If x = (x₁, x₂) ∈ R²×² is identified as a global solution, then the attraction set A[x₁] is nonempty for every j ∈ J Remark 3.5 confirms that x represents a nontrivial local solution, and by applying Theorem 3.3, we establish that the attraction sets A[x₁] and A[x₂] are disjoint The barycenters of these sets can be calculated using formula (3.23) It follows that A = A[x₁] ∪ A[x₂], and since A[x₁] and A[x₂] are subsets of A = {a₁, a₂, a₃}, we can conclude that the global solution set of our problem is included in the set defined as nx¯ := (1.
In our analysis, we find that f(¯x) equals 1/3, while f(ˆx) and f(x) both equal e^(1/6), indicating that ˆx and xe are global solutions to the problem According to Theorem 3.4, we can conclude that ¯x is a local solution However, it is important to note that ¯x does not belong to the set of global solutions, classifying it as a local-nonglobal solution of the problem.
Stability Properties
This section focuses on demonstrating the local Lipschitz property of the optimal value function, the local upper Lipschitz property of the global solution map, and the Lipschitz-like characteristics of the local solution map associated with equation (3.2).
In the context of problem (3.2), let the data set A = {a₁, , aₘ} be variable, represented as a = (a₁, , aₘ) in Rⁿᵐ The optimal value of the problem, denoted as v(a), can be expressed as v(a) = min{f(x) | x = (x₁, , xₖ) ∈ Rⁿᵏ} Additionally, the global solution set for this problem is represented by F(a).
Let us abbreviate the local solution set of (3.2) to F1(a) Note that the inclusion F(a) ⊂ F 1 (a) is valid, and it may be strict.
Definition 3.3 A family {I(j) | j ∈ J} of pairwise distinct, nonempty sub- sets of I is said to be a partition of I if [ j∈J
From now on, let ¯a = (¯a 1 , ,¯a m ) ∈ R nm be a fixed vector with the property that a¯ 1 , ,a¯ m are pairwise distinct.
Theorem 3.5 (Local Lipschitz property of the optimal value function) The optimal value function v : R nm → R is locally Lipschitz at ¯a, i.e., there exist
|v(a)−v(a 0 )| ≤ L 0 ka−a 0 k for all a and a 0 satisfying ka−ak¯ < δ 0 and ka 0 −¯ak< δ 0
Proof Denote by Ω the set of all the partitions of I Every element ω of Ω is a family {I ω (j) | j ∈ J} of pairwise distinct, nonempty subsets of I with [ j∈J
I ω (j) = I We associate to each pair (ω, a), where a = (a 1 , , a m ) ∈ R nm and ω ∈ Ω, a vector x ω (a) = (x 1 ω (a), , x k ω (a)) ∈ R nk with x j ω (a) = 1
According to Theorem 3.1, the problem defined in equation (3.2) has a finite number of global solutions, indicating that F(¯a) is both nonempty and finite For each solution ¯x = (¯x 1 , ,x¯ k ) within F(¯a), there exists an element ω in Ω such that ¯x j equals x j ω (¯a) for all j in J The set Ω 1 = {ω 1 , , ω r } consists of elements from Ω that correspond to these global solutions Furthermore, it is established that for any ω in Ω not included in Ω 1, the inequality f(x ω 1 (¯a), a)¯ is less than f(x ω (¯a), ¯a) holds true, where f(x, a) is defined as 1/m.
For each pair (i, j) in the set I×J, the rule (x, a) 7→ ka i − x j k 2 defines a polynomial function on R nk × R nm This function is locally Lipschitz within its domain, which allows us to conclude, based on [20, Prop 2.3.6 and 2.3.12], that the function f(x, a) described in (3.34) is also locally Lipschitz on R nk × R nm.
Now, observe that for any ω ∈ Ω and j ∈ J, the vector function x j ω (.) in (3.32), which maps R nm to R n , is continuously differentiable In particular, it is locally Lipschitz on R nm
For every ω in the set Ω, we can conclude that the function g ω (a) := f(x ω (a), a) is locally Lipschitz on R nm By rewriting the inequality g ω 1 (¯a) < g ω (¯a) for all ω in Ω\Ω 1 and utilizing the continuity of the functions g ω (.), we can establish the existence of a positive number δ 0 such that g ω 1 (a) < g ω (a) for all ω in Ω\Ω 1, provided that the distance ka−ak¯ is less than δ 0 Given that the points ¯a 1, , ¯a m are distinct, we can assume, without loss of generality, that the points a 1, , a m are also distinct for any a = (a 1, , a m) where ka−ak¯ is less than δ 0.
Consider a vector \( a = (a_1, , a_m) \) that meets the condition \( ka - \bar{a} < \delta_0 \) According to equation (3.35), it follows that \( f(x_{\omega_1}(a), a) < f(x_{\omega}(a), a) \) for all \( \omega \in \Omega \setminus \Omega_1 \) Since \( f(., a) \) represents the objective function of equation (3.2), this indicates that the set \( \{x_{\omega}(a) | \omega \in \Omega \setminus \Omega_1\} \) does not include any global solutions to the problem By applying Theorem 3.1, we can conclude that the global solution set \( F(a) \) of equation (3.2) is contained within a specific set.
The set F(a) is defined as a subset of {x ω (a) | ω ∈ Ω 1 }, which includes elements {x ω 1 (a), , x ω r (a)} Given that F(a) is not empty, we can express v(a) as the minimum of f(x, a) over all x in F(a), leading to v(a) = min{f(x ω ` (a), a) | ` = 1, , r} Consequently, we establish that v(a) equals the minimum of g ω ` (a) for all ` in the specified range, under the condition that ka−¯ak is less than δ 0 It is important to note that the functions g ω, where ω belongs to Ω, are locally Lipschitz on R nm Therefore, by applying relevant propositions from the literature to the minimum function, we conclude that v is locally Lipschitz at the point ¯ a.
Theorem 3.6 (Local upper Lipschitz property of the global solution map)The global solution map F : R nm ⇒ R nk is locally upper Lipschitz at ¯a, i.e., there exist L >0 and δ > 0 such that
F(a) ⊂ F(¯a) + Lka−¯akB¯ R nk (3.38) for all a satisfying ka−ak¯ < δ Here
B¯ R nk := nx = (x 1 , , x k ) ∈ R nk | X j∈J kx j k ≤ 1o denotes the closed unit ball of the product space R nk , which is equipped with the sum norm kxk = X j∈J kx j k.
In the proof, let Ω and Ω 1 be defined as {ω 1, , ω r}, with the vector function x ω (a) represented as (x 1 ω (a), , x k ω (a)) in R nk The construction of δ 0 follows the methodology outlined in the previous theorem For each ω in Ω, the function x ω (.) is continuously differentiable, allowing us to establish the existence of constants L ω > 0 and δ ω > 0 This leads to the inequality kx ω (a)−xω(ea)k ≤ Lωka−eak when the conditions ka−ak¯ < δ ω and kea−¯ak < δ ω are met.
Then, for every a satisfying ka−¯ak < δ, by (3.36) and (3.39) one has
= F(¯a) +Lka−ak¯ B¯ R nk Hence, inclusion (3.38) is valid for every a satisfying ka−ak¯ < δ 2
Theorem 3.7 (Aubin property of the local solution map) Let x¯ = (¯x 1 , ,x¯ k ) be an element of F1(¯a) satisfying condition (C1), that is, x¯ j 1 6= ¯x j 2 whenever j2 6= j1 Then, the local solution map F1 : R nm ⇒ R nk has the Aubin property at (¯a,x), i.e., there exist¯ L 1 > 0, ε > 0, and δ 1 > 0 such that
F 1 (a)∩B(¯x, ε) ⊂ F 1 (ea) +L 1 ka−eakB¯ R nk (3.40) for all a and ea satisfying ka−¯ak < δ 1 and kea−¯ak< δ 1
Proof Suppose that ¯x = (¯x 1 , ,x¯ k ) ∈ F 1 (¯a) and ¯x j 1 6= ¯x j 2 for all j 1 , j 2 ∈ J with j2 6= j1 Denote by J1 the set of the indexes j ∈ J such that ¯x j is attractive w.r.t the data set {¯a 1 , ,¯a m } Put J2 = J\J1 For every j ∈ J1, by Theorem 3.3 one has k¯a i −x¯ j k< k¯a i −x¯ q k (∀i ∈ I(j), ∀q ∈ J \ {j}) (3.41)
In addition, the following holds: ¯ x j = 1
X i∈I(j) ¯ a i , (3.42) where I(j) = {i ∈ I | ¯a i ∈ A[¯x j ]} For every j ∈ J 2 , by Theorem 3.3 one has k¯x q −a¯ p k < k¯x j −a¯ p k (∀q ∈ J 1 , ∀p ∈ I(q)) (3.43) Let ε0 > 0 be such that k¯x j 1 −x¯ j 2 k > ε0 for all j1, j2 ∈ J with j2 6= j1.
By (3.41) and (3.43), there exist δ 0 > 0 and ε ∈ 0,ε 0
In the given context, it is established that for all \( a \) in \( R^{nm} \) and \( x \) in \( R^{nk} \), certain inequalities must hold: \( \|ka_i - x_j\| < \|ka_i - x_q\| \) for all \( j \in J_1 \), \( i \in I(j) \), and \( q \in J \setminus \{j\} \), as well as \( \|x_q - a_p\| < \|x_j - a_p\| \) for all \( j \in J_2 \), \( q \in J_1 \), and \( p \in I(q) \) Additionally, the conditions \( \|a - \bar{a}\| < \delta_0 \) and \( \|x - \bar{x}\| < 2k\epsilon \) are essential Given that \( \bar{x}_{j_1} \neq \bar{x}_{j_2} \) for all distinct \( j_1, j_2 \in J \), by potentially reducing \( \epsilon > 0 \), it follows that any \( x \) satisfying \( \|x - \bar{x}\| < 2k\epsilon \) will ensure \( x_{j_1} \neq x_{j_2} \) for all \( j_1, j_2 \in J \) where \( j_2 \neq j_1 \).
For every j ∈ J1 and a = (a 1 , , a m ) ∈ R nm , define x j (a) = 1
Comparing equations (3.46) and (3.42) shows that \( x_j(\bar{a}) = \bar{x}_j \) for every \( j \in J_1 \) Due to the continuity of the vector functions \( x_j(.) \) for \( j \in J_1 \), we can assume that the difference \( \| x_j(e_a) - \bar{x}_j \| < \epsilon \) holds for all \( j \in J_1 \) Here, \( e_a = (e_{a1}, , e_{am}) \in \mathbb{R}^{nm} \) is chosen such that \( \| e_a - a \bar{k} \| < \delta_0 \), where a smaller \( \delta_0 > 0 \) can be selected if needed.
The vector functions \( x_j(.) \) for \( j \in J_1 \) are continuously differentiable, ensuring the existence of a constant \( L_1 > 0 \) This leads to the inequality \( \| x_j(a) - x_j(ea) \| \leq L_1 \| a - ea \| \) for any points \( a \) and \( ea \) where \( \| a - \bar{a} \| < \delta_0 \) and \( \| ea - \bar{a} \| < \delta_0 \) If necessary, a smaller \( \delta_0 > 0 \) can be chosen Additionally, selecting \( \delta_1 \in (0, \delta_0) \) such that \( 2L_1 \delta_1 < \epsilon \) further refines the conditions.
With the chosen constants L 1 > 0, ε > 0, and δ 1 > 0, let us show that the inclusion (3.40) is fulfilled for all a and ea satisfying ka − ¯ak < δ 1 and kea−ak¯ < δ 1
Let \( \eta \) and \( \epsilon \) be defined such that \( k - \bar{a} < \delta_1 \) and \( k_e - \bar{a} < \delta_1 \) Choose an arbitrary element \( x = (x_1, \ldots, x_k) \) from the intersection of the set \( F_1(a) \) and the ball \( B(\bar{x}, \epsilon) \) For all \( j \in J_1 \), define \( x_{e_j} = x_j(e_a) \), where \( x_j(a) \) is specified by equation (3.46) For any \( j \in J_2 \), simply set \( x_{e_j} = x_j \).
Claim 1 The vector xe = (xe 1 , ,xe k ) belongs to F1(ea).
The inequalities ka − ¯ak < δ 1 and kx − xk¯ < ε indicate that properties (3.44) and (3.45) hold true From property (3.44), it can be concluded that for every element j in set J1, the attraction set A[xj] consists of elements {ai | i ∈ I(j)} Given that I(j) is not empty for each j in J1 and x is in F1(a), Theorem 3.3 confirms that xj equals 1.
Comparing (3.48) with (3.46) yields x j = x j (a) for all j ∈ J 1 By (3.45) we see that, for every j ∈ J 2 , the attraction set A[x j ] is empty Moreover, one has x j ∈ A[x]/ (∀j ∈ J 2 ) (3.49) where A[x] is the union of the balls ¯B(a p ,ka p − x q k) with p ∈ I, q ∈ J satisfying p ∈ I(q).
For each j ∈ J 1 , using (3.47) we have kx j (ea)−x¯ j k ≤ kx j (ea)−x j (a)k+ kx j (a)−x¯ j k
Besides, for each j ∈ J 2 , we have kx j (ea)−x¯ j k = kx j −x¯ j k < ε Therefore, kxe−xk¯ = X j∈J 1 kx j (ea)−x¯ j k+X j∈J 2 kx j −x¯ j k< 2kε.
The inequality kea−ak¯ < δ 1 ensures that the properties (3.44) and (3.45) are maintained, where ea and x(ea) represent the variables a and x, respectively This leads to the conclusion that for all j in J 1, i in I(j), and q in J excluding j, the condition kea i −xe j k < kea i −xe q k (3.50) holds Additionally, for all j in J 2, q in J 1, and p in I(q), the inequality kxe q −ea p k < kxe j −ea p k (3.51) is satisfied.
So, similar to the above case of x, for every j ∈ J 1 , the attraction set A[xe j ] is {ea i | i ∈ I(j)} Recall that I(j) 6= ∅ for each j ∈ J 1 and xe j was given by xe j = x j (ea) = 1
Conclusions
The minimum sum-of-squares clustering problem consistently yields a global solution, and under certain mild conditions, this solution set is finite Each global solution's components can be calculated using an explicit formula By introducing the concept of a nontrivial local solution, we have established the necessary and sufficient conditions for a set of centroids to qualify as a nontrivial local solution.
We have demonstrated the local Lipschitz property of the optimal value function and the local upper Lipschitz property of the global solution map in the MSSC problem Additionally, we established a local Lipschitz-like property for the local solution map These comprehensive characterizations of nontrivial local solutions enhance our understanding of the k-means algorithm's performance.
Some Incremental Algorithms for the Clustering Problem
Solution methods for the minimum sum-of-squares clustering (MSSC) prob- lem will be analyzed and developed in this chapter.
This article proposes enhancements to the incremental algorithms developed by Ordin and Bagirov, as well as Bagirov's own work, utilizing the Difference-of-Convex functions Algorithms (DCAs) within DC programming and the qualitative properties of the MSSC problem We present the key features of the new algorithms, focusing on their finite convergence, overall convergence, and rate of convergence Additionally, we provide results from numerical tests conducted on various real-world databases to demonstrate the effectiveness of these algorithms.
The present chapter is written on the basis of paper No 3 and paper No 4 in the List of author’s related papers (see p 112).
There are many algorithms to solve the MSSC problem (see, e.g., [6,7,9,12,
The problem is classified as NP-hard when the number of data features or clusters is included as input, which explains why existing algorithms can typically provide only local solutions.
The k-means clustering algorithm (see Section 3.3 and see also, e.g., [1],
[39], [43], and [66]) is the best known solution method for the MSSC problem.
To improve its effectiveness, the global k-means, modified global k-means, and fast global k-means clustering algorithms have been proposed in [6,12,33,49,
The quality of computational results is heavily influenced by the choice of starting points, making it essential to identify effective ones The Difference-of-Convex-functions Algorithms (DCA), previously utilized for the MSSC problem, can serve this purpose effectively.
Incremental clustering algorithms increase the number of clusters gradually, making them highly efficient for managing large datasets, as supported by various numerical results.
Ordin and Bagirov recently introduced an incremental clustering algorithm that utilizes control parameters to identify effective starting points for the k-means algorithm Additionally, in a previous work, Bagirov proposed another incremental clustering method grounded in DC programming and DCA.
We will propose several improvements of the just mentioned incremental al- gorithms to solve the MSSC problem in (3.2).
Incremental clustering algorithms, as discussed in references [7, 44, 71], begin by calculating the centroid of the entire dataset and then strategically add one new centroid at each step This iterative process continues until the desired number of k centroids is achieved for the specified problem (3.2).
This article focuses on the analysis and enhancement of the incremental heuristic clustering algorithm by Ordin and Bagirov, as well as Bagirov's incremental DC clustering algorithm Through the construction of specific MSSC problems using small datasets, we demonstrate the functionality of these algorithms Notably, we discover that the second algorithm may not terminate due to its precise stopping criterion To address this issue, we propose a modified version of the incremental heuristic clustering algorithm and introduce three modifications for the incremental DC clustering algorithm.
This section is devoted to the incremental heuristic algorithm of Ordin andBagirov [71, pp 349–353] and some properties of the algorithm.
Let ` be an index with 1 ≤ ` ≤ k − 1 and let ¯x = (¯x 1 , ,x¯ ` ) be an approximate solution of (3.2), where k is replaced by ` So, ¯x = (¯x 1 , ,x¯ ` ) solves approximately the problem minnf ` (x) := 1 m m
Applying the natural clustering procedure described in (3.3) to the centroid system x¯ 1 , ,x¯ ` , one divides A into ` clusters with the centers ¯x 1 , ,x¯ ` For every i ∈ I, put d ` (a i ) = min k¯x 1 −a i k 2 , ,k¯x ` −a i k 2 (4.2) The formula g(y) =f`+1(¯x 1 , ,x¯ ` , y) where, in accordance with (4.1), f `+1 (x) = 1 m m
X i=1 min j=1, ,`+1ka i −x j k 2 ∀x = (x 1 , , x ` , x `+1 ) ∈ R n(`+1) , defines our auxiliary cluster function g : R n →R From (4.2) it follows that g(y) = 1 m m
The problem min g(y) | y ∈ R n (4.4) is called the auxiliary clustering problem For each i ∈ I, one has min d ` (a i ), ky−a i k 2 = d ` (a i ) +ky −a i k 2 −max d ` (a i ),ky−a i k 2
So, the objective function of (4.4) can be represented as g(y) =g 1 (y)−g 2 (y), where g 1 (y) = 1 m m
X i=1 ky −a i k 2 (4.5) is a smooth convex function and g 2 (y) = 1 m m
X i=1 max d ` (a i ),ky −a i k 2 (4.6) is a nonsmooth convex function Consider the open set
B a i , d ` (a i ) = y ∈ R n | ∃i ∈ I with ky −a i k 2 < d ` (a i ) , (4.7) which is the finite union of certain open balls with the centers a i (i ∈ I), and put
All points \( \bar{x}_1, \ldots, \bar{x}_\ell \) are contained within \( Y_2 \) Given that \( \ell < k \leq m \) and the data points \( a_1, \ldots, a_m \) are distinct, there must be at least one index \( i \in I \) such that \( d_\ell(a_i) > 0 \) This indicates that not all data points can coincide with the set \( \bar{x}_1, \ldots, \bar{x}_\ell \), confirming that \( Y_1 \) is not empty According to equations (4.5) and (4.6), we can conclude that \( g(y) < \frac{1}{m} m \).
Therefore, any iteration process for solving (4.4) should start with a point y 0 ∈ Y 1
To find an approximate solution of (3.2) where k is replaced by `+ 1, i.e., the problem minnf `+1 (x) := 1 m m
(4.8) we can use the following procedure [71, pp 349–351] Fixing any y ∈ Y 1 , one divides the data set A into two disjoint subsets
Clearly, A 1 (y) consists of all the data points standing closer to y than to their cluster centers Since y ∈ Y 1 , the set A 1 (y) is nonempty Note that g(y) = 1 m
The equation z `+1 (y) = f ` (¯x) − g(y) indicates that f ` (¯x) is derived from the current centroid system x¯ 1, , x¯ `, while g(y) represents the addition of a new center y, forming `+1 centers A positive value of z `+1 (y) > 0 signifies a reduction in the minimum sum-of-squares clustering criterion when transitioning from the existing centroid system to the new configuration This relationship underscores the effectiveness of incorporating an additional center to enhance clustering performance.
X a i ∈A d`(a i ) and (4.10), one has the representation z `+1 (y) = 1 m
X a i ∈A 1 (y) d ` (a i )− ky−a i k 2 , which can be rewritten as z `+1 (y) = 1 m
Subsequent operations are heavily influenced by the data points in Y1 It can be demonstrated that an element a is part of the intersection A∩Y1 if and only if it belongs to A and is not included in the set {¯x 1, , x¯ `} For each point y = a within A∩Y1, the value z `+1 (a) is calculated using equation (4.11) Finally, the maximum value z max 1 is determined as the highest z `+1 (a) for all a in A∩Y1, as expressed in equation (4.12).
The selection of effective starting points for solving equation (4.8) is determined by two parameters, γ 1 and γ 2, both ranging from 0 to 1 The significance of each parameter will be discussed further The authors of [71] describe their algorithm as heuristic, as the choice of these parameters can be informed by computational experience from applying the algorithm.
Using γ 1 , one can find the set
When γ 1 = 0, the set ¯A 1 is defined as A ∩ Y 1, encompassing all data points within Y 1 Conversely, when γ 1 = 1, ¯A 1 is limited to the data points that result in the maximum decrease, z max 1 This indicates that γ 1 serves as a tolerance level for selecting suitable points from A ∩ Y 1 For each point a in ¯A 1, the corresponding set A1(a) is identified, and its barycenter, denoted as c(a), is calculated The point a is then replaced by c(a), as it better represents the set A1(a) Given that g(c(a)) ≤ g(a) < f`(¯x), it follows that c(a) is included in Y 1.
For each c ∈ A¯ 2 , one computes the value z `+1 (c) by using (4.11) Then, we find z max 2 := max z `+1 (c) | c ∈ A¯ 2 (4.15)
Clearly, z max 2 is the largest decrease among the values f `+1 (¯x 1 , ,x¯ ` , c), where c ∈ A¯ 2 , in comparison with the value f ` (¯x).
When γ 2 = 0, it follows that ¯A 3 is equivalent to ¯A 2 In contrast, when γ 2 = 1, ¯A 3 includes barycenters c ∈ A¯ 2 that yield the most significant reduction in the objective function g(y) = f `+1 (¯x 1, , x¯ `, y) as outlined in equation (4.4) This approach, as referenced in [71, p 315], aligns with the identification of an optimal starting point in the modified global k-means algorithm proposed by Bargirov in [6] Therefore, γ 2 serves as a measure of tolerance in the selection of suitable points from A¯ 2.
Ω := (¯x 1 , ,x¯ ` , c) | c ∈ A¯ 3 (4.17) contains the ‘good’ starting points to solve (4.8).
4.2.2 Version 1 of Ordin-Bagirov’s algorithm