|
2 | 2 |
|
3 | 3 | ```{=latex} |
4 | 4 | \input{results/latex/family_vs_family_matrix.tex} |
5 | | -``` |
| 5 | +``` |
| 6 | + |
| 7 | +# Appendix B: Theoretical Foundations and Proofs |
| 8 | + |
| 9 | +## B.1 Algorithm Derivation |
| 10 | + |
| 11 | +### B.1.1 The Direction Combination Problem |
| 12 | + |
| 13 | +Consider the fundamental problem of combining multiple optimization directions. Given: |
| 14 | +- Gradient direction: $-\nabla f(\mathbf{x})$ providing guaranteed descent |
| 15 | +- Quasi-Newton direction: $\mathbf{d}_{\text{QN}}$ offering potential superlinear convergence |
| 16 | + |
| 17 | +We seek a principled method to combine these directions that: |
| 18 | + |
| 19 | +1. Guarantees descent from any starting point |
| 20 | +2. Smoothly interpolates between the directions |
| 21 | +3. Requires no additional hyperparameters |
| 22 | +4. Maintains computational efficiency |
| 23 | + |
| 24 | +### B.1.2 Geometric Formulation |
| 25 | + |
| 26 | +We formulate direction combination as a boundary value problem in parametric space. Consider a parametric curve $\mathbf{d}: [0,1] \rightarrow \mathbb{R}^n$ satisfying: |
| 27 | + |
| 28 | +1. **Initial position**: $\mathbf{d}(0) = \mathbf{0}$ |
| 29 | +2. **Initial tangent**: $\mathbf{d}'(0) = -\nabla f(\mathbf{x})$ (ensures descent) |
| 30 | +3. **Terminal position**: $\mathbf{d}(1) = \mathbf{d}_{\text{L-BFGS}}$ |
| 31 | + |
| 32 | +The minimal polynomial satisfying these constraints is quadratic: |
| 33 | +$$\mathbf{d}(t) = \mathbf{a}t^2 + \mathbf{b}t + \mathbf{c}$$ |
| 34 | + |
| 35 | +Applying boundary conditions: |
| 36 | + |
| 37 | +- From condition 1: $\mathbf{c} = \mathbf{0}$ |
| 38 | +- From condition 2: $\mathbf{b} = -\nabla f(\mathbf{x})$ |
| 39 | +- From condition 3: $\mathbf{a} + \mathbf{b} = \mathbf{d}_{\text{L-BFGS}}$ |
| 40 | + |
| 41 | +Therefore: $\mathbf{a} = \mathbf{d}_{\text{L-BFGS}} + \nabla f(\mathbf{x})$ |
| 42 | + |
| 43 | +This yields the canonical QQN path: |
| 44 | +$$\mathbf{d}(t) = t(1-t)(-\nabla f) + t^2 \mathbf{d}_{\text{L-BFGS}}$$ |
| 45 | + |
| 46 | +## B.2 Convergence Analysis |
| 47 | + |
| 48 | +### B.2.1 Universal Descent Property |
| 49 | + |
| 50 | +**Lemma B.1** (Universal Descent): For any direction $\mathbf{d}_{\text{L-BFGS}} \in \mathbb{R}^n$, the QQN path satisfies: |
| 51 | +$$\mathbf{d}'(0) = -\nabla f(\mathbf{x})$$ |
| 52 | + |
| 53 | +*Proof*: Direct differentiation of $\mathbf{d}(t) = t(1-t)(-\nabla f) + t^2 \mathbf{d}_{\text{L-BFGS}}$ gives: |
| 54 | +$$\mathbf{d}'(t) = (1-2t)(-\nabla f) + 2t\mathbf{d}_{\text{L-BFGS}}$$ |
| 55 | + |
| 56 | +Evaluating at $t=0$: $\mathbf{d}'(0) = -\nabla f(\mathbf{x})$. $\square$ |
| 57 | +**Theorem B.1** (Descent Property): For any $\mathbf{d}_{\text{L-BFGS}}$, there exists $\bar{t} > 0$ such that $\phi(t) = f(\mathbf{x} + \mathbf{d}(t))$ satisfies $\phi(t) < \phi(0)$ for all $t \in (0, \bar{t}]$. |
| 58 | + |
| 59 | +*Proof*: Since $\mathbf{d}'(0) = -\nabla f(\mathbf{x})$: |
| 60 | +$$\phi'(0) = \nabla f(\mathbf{x})^T(-\nabla f(\mathbf{x})) = -\|\nabla f(\mathbf{x})\|^2 < 0$$ |
| 61 | + |
| 62 | +By continuity of $\phi'$ (assuming $f$ is continuously differentiable), there exists $\bar{t} > 0$ such that $\phi'(t) < 0$ for all $t \in (0, \bar{t}]$. By the fundamental theorem of calculus: |
| 63 | +$$\phi(t) - \phi(0) = \int_0^t \phi'(s) ds < 0$$ |
| 64 | + |
| 65 | +for all $t \in (0, \bar{t}]$. $\square$ |
| 66 | + |
| 67 | +### B.2.2 Global Convergence Analysis |
| 68 | + |
| 69 | +**Theorem B.2** (Global Convergence): Under standard assumptions: |
| 70 | + |
| 71 | +1. $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is continuously differentiable |
| 72 | +2. $f$ is bounded below: $f(\mathbf{x}) \geq f_{\text{inf}} > -\infty$ |
| 73 | +3. $\nabla f$ is Lipschitz continuous with constant $L > 0$ |
| 74 | +4. The univariate optimization finds a point satisfying the Armijo condition |
| 75 | + |
| 76 | +QQN generates iterates satisfying: |
| 77 | +$$\liminf_{k \to \infty} \|\nabla f(\mathbf{x}_k)\| = 0$$ |
| 78 | + |
| 79 | +*Proof*: We establish convergence through a descent lemma approach. |
| 80 | + |
| 81 | +**Step 1: Monotonic Decrease** |
| 82 | + |
| 83 | +By Theorem B.1, each iteration produces $f(\mathbf{x}_{k+1}) < f(\mathbf{x}_k)$ whenever $\nabla f(\mathbf{x}_k) \neq \mathbf{0}$. |
| 84 | + |
| 85 | +**Step 2: Sufficient Decrease** |
| 86 | + |
| 87 | +Define $\phi_k(t) = f(\mathbf{x}_k + \mathbf{d}_k(t))$. Since $\phi_k'(0) = -\|\nabla f(\mathbf{x}_k)\|^2 < 0$, by the Armijo condition, there exists $c_1 \in (0, 1)$ and $\bar{t} > 0$ such that: |
| 88 | +$$\phi_k(t) \leq \phi_k(0) + c_1 t \phi_k'(0) = f(\mathbf{x}_k) - c_1 t \|\nabla f(\mathbf{x}_k)\|^2$$ |
| 89 | + |
| 90 | +for all $t \in (0, \bar{t}]$. |
| 91 | + |
| 92 | +**Step 3: Quantifying Decrease** |
| 93 | + |
| 94 | +Using the descent lemma with Lipschitz constant $L$: |
| 95 | +$$f(\mathbf{x}_{k+1}) \leq f(\mathbf{x}_k) + \nabla f(\mathbf{x}_k)^T \mathbf{d}_k(t_k^*) + \frac{L}{2}\|\mathbf{d}_k(t_k^*)\|^2$$ |
| 96 | + |
| 97 | +For the quadratic path with $t_k^* \in (0, \bar{t}]$: |
| 98 | +$$\|\mathbf{d}_k(t)\|^2 = \|t(1-t)(-\nabla f(\mathbf{x}_k)) + t^2\mathbf{d}_{\text{L-BFGS}}\|^2$$ |
| 99 | + |
| 100 | +$$\leq 2t^2(1-t)^2\|\nabla f(\mathbf{x}_k)\|^2 + 2t^4\|\mathbf{d}_{\text{L-BFGS}}\|^2$$ |
| 101 | + |
| 102 | +For small $t$, the gradient term dominates, giving: |
| 103 | +$$f(\mathbf{x}_k) - f(\mathbf{x}_{k+1}) \geq c\|\nabla f(\mathbf{x}_k)\|^2$$ |
| 104 | + |
| 105 | +for some $c > 0$ independent of $k$. |
| 106 | + |
| 107 | +**Step 4: Summability** |
| 108 | + |
| 109 | +Since $f$ is bounded below and decreases monotonically: |
| 110 | +$$\sum_{k=0}^{\infty} [f(\mathbf{x}_k) - f(\mathbf{x}_{k+1})] = f(\mathbf{x}_0) - \lim_{k \to \infty} f(\mathbf{x}_k) < \infty$$ |
| 111 | + |
| 112 | +Combined with Step 3: |
| 113 | +$$\sum_{k=0}^{\infty} \|\nabla f(\mathbf{x}_k)\|^2 < \infty$$ |
| 114 | + |
| 115 | +**Step 5: Conclusion** |
| 116 | +The summability of $\|\nabla f(\mathbf{x}_k)\|^2$ implies $\liminf_{k \to \infty} \|\nabla f(\mathbf{x}_k)\| = 0$. $\square$ |
| 117 | + |
| 118 | +### B.2.3 Local Superlinear Convergence |
| 119 | + |
| 120 | +**Theorem B.3** (Local Superlinear Convergence): Let $\mathbf{x}^*$ be a local minimum with $\nabla f(\mathbf{x}^*) = \mathbf{0}$ and $\nabla^2 f(\mathbf{x}^*) = H^* \succ 0$. Assume: |
| 121 | + |
| 122 | +1. $\nabla^2 f$ is Lipschitz continuous in a neighborhood of $\mathbf{x}^*$ |
| 123 | +2. The L-BFGS approximation satisfies the Dennis-Moré condition: |
| 124 | + |
| 125 | +$$\lim_{k \to \infty} \frac{\|(\mathbf{H}_k - (H^*)^{-1})(\mathbf{x}_{k+1} - \mathbf{x}_k)\|}{\|\mathbf{x}_{k+1} - \mathbf{x}_k\|} = 0$$ |
| 126 | + |
| 127 | +Then QQN converges superlinearly: $\|\mathbf{x}_{k+1} - \mathbf{x}^*\| = o(\|\mathbf{x}_k - \mathbf{x}^*\|)$. |
| 128 | + |
| 129 | +*Proof*: We analyze the behavior near the optimum. |
| 130 | + |
| 131 | +**Step 1: Neighborhood Properties** |
| 132 | + |
| 133 | +By continuity of $\nabla^2 f$, there exists a neighborhood $\mathcal{N}$ of $\mathbf{x}^*$ and constants $0 < \mu \leq L$ such that: |
| 134 | +$$\mu \mathbf{I} \preceq \nabla^2 f(\mathbf{x}) \preceq L \mathbf{I}, \quad \forall \mathbf{x} \in \mathcal{N}$$ |
| 135 | + |
| 136 | +**Step 2: Optimal Parameter Analysis** |
| 137 | + |
| 138 | +Define $\phi(t) = f(\mathbf{x}_k + \mathbf{d}(t))$ where $\mathbf{d}(t) = t(1-t)(-\nabla f(\mathbf{x}_k)) + t^2\mathbf{d}_{\text{L-BFGS}}$. |
| 139 | +The first derivative is: |
| 140 | +$$\phi'(t) = \nabla f(\mathbf{x}_k + \mathbf{d}(t))^T[(1-2t)(-\nabla f(\mathbf{x}_k)) + 2t\mathbf{d}_{\text{L-BFGS}}]$$ |
| 141 | + |
| 142 | +The second derivative is: |
| 143 | +$$\phi''(t) = [(1-2t)(-\nabla f(\mathbf{x}_k)) + 2t\mathbf{d}_{\text{L-BFGS}}]^T \nabla^2 f(\mathbf{x}_k + \mathbf{d}(t))[(1-2t)(-\nabla f(\mathbf{x}_k)) + 2t\mathbf{d}_{\text{L-BFGS}}]$$ |
| 144 | + |
| 145 | +$$+ \nabla f(\mathbf{x}_k + \mathbf{d}(t))^T[-2(-\nabla f(\mathbf{x}_k)) + 2\mathbf{d}_{\text{L-BFGS}}]$$ |
| 146 | + |
| 147 | +At $t = 1$: |
| 148 | +$$\phi'(1) = \nabla f(\mathbf{x}_k + \mathbf{d}_{\text{L-BFGS}})^T \mathbf{d}_{\text{L-BFGS}}$$ |
| 149 | + |
| 150 | +Using Taylor expansion: |
| 151 | +$$\nabla f(\mathbf{x}_k + \mathbf{d}_{\text{L-BFGS}}) = \nabla f(\mathbf{x}_k) + \nabla^2 f(\mathbf{x}_k)\mathbf{d}_{\text{L-BFGS}} + O(\|\mathbf{d}_{\text{L-BFGS}}\|^2)$$ |
| 152 | + |
| 153 | +Since $\mathbf{d}_{\text{L-BFGS}} = -\mathbf{H}_k\nabla f(\mathbf{x}_k)$: |
| 154 | +$$\nabla f(\mathbf{x}_k + \mathbf{d}_{\text{L-BFGS}}) = [\mathbf{I} - \nabla^2 f(\mathbf{x}_k)\mathbf{H}_k]\nabla f(\mathbf{x}_k) + O(\|\nabla f(\mathbf{x}_k)\|^2)$$ |
| 155 | + |
| 156 | +By the Dennis-Moré condition, as $k \to \infty$: |
| 157 | +$$\|\mathbf{I} - \nabla^2 f(\mathbf{x}_k)\mathbf{H}_k\| \to 0$$ |
| 158 | + |
| 159 | +Therefore: |
| 160 | +$$\phi'(1) = o(\|\nabla f(\mathbf{x}_k)\|^2)$$ |
| 161 | + |
| 162 | +**Step 3: Optimal Parameter Convergence** |
| 163 | + |
| 164 | +Since $\phi'(0) = -\|\nabla f(\mathbf{x}_k)\|^2 < 0$ and $\phi'(1) = o(\|\nabla f(\mathbf{x}_k)\|^2)$, by the intermediate value theorem and the fact that $\phi$ is strongly convex near $t = 1$ (due to positive definite Hessian), the minimizer satisfies: |
| 165 | +$$t_k^* = 1 + o(1)$$ |
| 166 | + |
| 167 | +**Step 4: Convergence Rate** |
| 168 | + |
| 169 | +With $t_k^* = 1 + o(1)$: |
| 170 | +$$\mathbf{x}_{k+1} = \mathbf{x}_k + \mathbf{d}(t_k^*) = \mathbf{x}_k + (1 + o(1))\mathbf{d}_{\text{L-BFGS}} + o(\|\mathbf{d}_{\text{L-BFGS}}\|)$$ |
| 171 | + |
| 172 | +$$= \mathbf{x}_k - \mathbf{H}_k\nabla f(\mathbf{x}_k) + o(\|\nabla f(\mathbf{x}_k)\|)$$ |
| 173 | + |
| 174 | +By standard quasi-Newton theory with the Dennis-Moré condition: |
| 175 | + |
| 176 | +$$\|\mathbf{x}_{k+1} - \mathbf{x}^*\| = o(\|\mathbf{x}_k - \mathbf{x}^*\|)$$ |
| 177 | + |
| 178 | +establishing superlinear convergence. $\square$ |
| 179 | + |
| 180 | +## B.3 Robustness Analysis |
| 181 | + |
| 182 | +### B.3.1 Graceful Degradation |
| 183 | + |
| 184 | +**Theorem B.4** (Graceful Degradation): Let $\theta_k$ be the angle between $-\nabla f(\mathbf{x}_k)$ and $\mathbf{d}_{\text{L-BFGS}}$. If $\theta_k > \pi/2$ (obtuse angle), then the optimal parameter satisfies $t^* \in [0, 1/2]$, ensuring gradient-dominated steps. |
| 185 | + |
| 186 | +*Proof*: When $\theta_k > \pi/2$, we have $\nabla f(\mathbf{x}_k)^T \mathbf{d}_{\text{L-BFGS}} > 0$. |
| 187 | + |
| 188 | +The derivative of our objective along the path is: |
| 189 | +$$\frac{d}{dt}f(\mathbf{x}_k + \mathbf{d}(t)) = \nabla f(\mathbf{x}_k + \mathbf{d}(t))^T \mathbf{d}'(t)$$ |
| 190 | + |
| 191 | +At $t = 1/2$: |
| 192 | +$$\mathbf{d}'(1/2) = -\frac{1}{2}\nabla f(\mathbf{x}_k) + \mathbf{d}_{\text{L-BFGS}}$$ |
| 193 | + |
| 194 | +For small steps from $\mathbf{x}_k$: |
| 195 | +$$\nabla f(\mathbf{x}_k + \mathbf{d}(1/2)) \approx \nabla f(\mathbf{x}_k)$$ |
| 196 | + |
| 197 | +Therefore: |
| 198 | +$$\left.\frac{d}{dt}f(\mathbf{x}_k + \mathbf{d}(t))\right|_{t=1/2} \approx \nabla f(\mathbf{x}_k)^T[-\frac{1}{2}\nabla f(\mathbf{x}_k) + \mathbf{d}_{\text{L-BFGS}}]$$ |
| 199 | + |
| 200 | +$$= -\frac{1}{2}\|\nabla f(\mathbf{x}_k)\|^2 + \nabla f(\mathbf{x}_k)^T\mathbf{d}_{\text{L-BFGS}} > 0$$ |
| 201 | + |
| 202 | +when $\nabla f(\mathbf{x}_k)^T\mathbf{d}_{\text{L-BFGS}} > \frac{1}{2}\|\nabla f(\mathbf{x}_k)\|^2$. |
| 203 | + |
| 204 | +This implies the function increases beyond $t = 1/2$, so the univariate optimization will find $t^* \leq 1/2$, giving: |
| 205 | + |
| 206 | +$$\mathbf{x}_{k+1} \approx \mathbf{x}_k + t^*(1-t^*)(-\nabla f(\mathbf{x}_k))$$ |
| 207 | + |
| 208 | +Since $t^* \leq 1/2$, we have $t^*(1-t^*) \geq t^*(1/2)$, ensuring a gradient-dominated step. $\square$ |
| 209 | + |
| 210 | +### B.3.2 Stability Under Numerical Errors |
| 211 | + |
| 212 | +**Theorem B.5** (Numerical Stability): Let $\tilde{\mathbf{d}}_{\text{L-BFGS}} = \mathbf{d}_{\text{L-BFGS}} + \boldsymbol{\epsilon}$ where $\boldsymbol{\epsilon}$ represents numerical errors with $\|\boldsymbol{\epsilon}\| \leq \delta$. The perturbed QQN path: |
| 213 | +$$\tilde{\mathbf{d}}(t) = t(1-t)(-\nabla f) + t^2 \tilde{\mathbf{d}}_{\text{L-BFGS}}$$ |
| 214 | + |
| 215 | +satisfies: |
| 216 | +$$\|\tilde{\mathbf{d}}(t) - \mathbf{d}(t)\| \leq t^2\delta$$ |
| 217 | + |
| 218 | +*Proof*: Direct computation: |
| 219 | + |
| 220 | +$$\|\tilde{\mathbf{d}}(t) - \mathbf{d}(t)\| = \|t^2(\tilde{\mathbf{d}}_{\text{L-BFGS}} - \mathbf{d}_{\text{L-BFGS}})\| = t^2\|\boldsymbol{\epsilon}\| \leq t^2\delta$$ |
| 221 | + |
| 222 | +For small $t$ (near the initial descent phase), the error is $O(t^2\delta)$, providing quadratic error suppression. $\square$ |
| 223 | + |
| 224 | +## B.4 Computational Complexity |
| 225 | + |
| 226 | +**Theorem B.6** (Computational Complexity): Each QQN iteration requires: |
| 227 | + |
| 228 | +- $O(n)$ operations for path construction |
| 229 | +- $O(mn)$ operations for L-BFGS direction computation |
| 230 | +- $O(k)$ function evaluations for univariate optimization |
| 231 | + |
| 232 | +where $n$ is the dimension, $m$ is the L-BFGS memory size, and $k$ is typically small (3-10). |
| 233 | + |
| 234 | +*Proof*: |
| 235 | + |
| 236 | +1. **Path construction**: Computing $\mathbf{d}(t) = t(1-t)(-\nabla f) + t^2 \mathbf{d}_{\text{L-BFGS}}$ requires $O(n)$ operations for vector arithmetic. |
| 237 | +2. **L-BFGS direction**: The two-loop recursion requires $O(mn)$ operations to compute $\mathbf{H}_k\nabla f(\mathbf{x}_k)$. |
| 238 | +3. **Line search**: Each function evaluation along the path requires $O(n)$ operations to compute $\mathbf{x}_k + \mathbf{d}(t)$, plus the cost of evaluating $f$. |
| 239 | + |
| 240 | +Total complexity per iteration: $O(mn + kn) + k \cdot \text{cost}(f)$. $\square$ |
| 241 | + |
| 242 | +## B.5 Extensions and Variants |
| 243 | + |
| 244 | +### B.5.1 Gradient Scaling |
| 245 | + |
| 246 | +The basic QQN formulation can be enhanced with gradient scaling: |
| 247 | +$$\mathbf{d}(t) = t(1-t)\alpha(-\nabla f) + t^2 \mathbf{d}_{\text{L-BFGS}}$$ |
| 248 | + |
| 249 | +where $\alpha > 0$ is a scaling factor. |
| 250 | + |
| 251 | +**Proposition B.1** (Scaling Invariance): The set of points reachable by the QQN path is invariant to the choice of $\alpha$. Only the parametrization changes. |
| 252 | + |
| 253 | +*Proof*: Consider the mapping $s = \beta(t)$ where $\beta$ is chosen such that: |
| 254 | +$$t(1-t)\alpha(-\nabla f) + t^2 \mathbf{d}_{\text{L-BFGS}} = s(1-s)(-\nabla f) + s^2 \mathbf{d}_{\text{L-BFGS}}$$ |
| 255 | + |
| 256 | +This gives a bijection between parametrizations, showing that any point reachable with one $\alpha$ is reachable with another. $\square$ |
| 257 | + |
| 258 | +### B.5.2 Cubic Extension with Momentum |
| 259 | + |
| 260 | +Incorporating momentum leads to cubic interpolation: |
| 261 | +$$\mathbf{d}(t) = t(1-t)(1-2t)\mathbf{m} + t(1-t)\alpha(-\nabla f) + t^2 \mathbf{d}_{\text{L-BFGS}}$$ |
| 262 | + |
| 263 | +where $\mathbf{m}$ is the momentum vector. |
| 264 | + |
| 265 | +This satisfies: |
| 266 | + |
| 267 | +- $\mathbf{d}(0) = \mathbf{0}$ |
| 268 | +- $\mathbf{d}'(0) = \alpha(-\nabla f) + \mathbf{m}$ |
| 269 | +- $\mathbf{d}(1) = \mathbf{d}_{\text{L-BFGS}}$ |
| 270 | +- $\mathbf{d}''(0) = -6\mathbf{m} + 2\alpha(-\nabla f) + 2\mathbf{d}_{\text{L-BFGS}}$ |
| 271 | + |
| 272 | +**Theorem B.7** (Cubic Convergence Properties): The cubic variant maintains all convergence guarantees of the quadratic version while potentially improving the convergence constant through momentum acceleration. |
| 273 | + |
| 274 | +### B.5.3 Trust Region Integration |
| 275 | + |
| 276 | +QQN naturally extends to trust regions by constraining the univariate search: |
| 277 | +$$t^* = \arg\min_{t: \|\mathbf{d}(t)\| \leq \Delta} f(\mathbf{x} + \mathbf{d}(t))$$ |
| 278 | + |
| 279 | +where $\Delta$ is the trust region radius. |
| 280 | + |
| 281 | +**Proposition B.2** (Trust Region Feasibility): For any $\Delta > 0$, there exists $t_{\max} > 0$ such that $\|\mathbf{d}(t)\| \leq \Delta$ for all $t \in [0, t_{\max}]$. |
| 282 | + |
| 283 | +*Proof*: Since $\mathbf{d}(0) = \mathbf{0}$ and $\mathbf{d}$ is continuous, by the intermediate value theorem, the set $\{t : \|\mathbf{d}(t)\| \leq \Delta\}$ contains an interval $[0, t_{\max}]$ for some $t_{\max} > 0$. $\square$ |
| 284 | + |
| 285 | +## B.6 Comparison with Related Methods |
| 286 | + |
| 287 | +### B.6.1 Relationship to Trust Region Methods |
| 288 | + |
| 289 | +Trust region methods solve: |
| 290 | +$$\min_{\mathbf{s}} \mathbf{g}^T\mathbf{s} + \frac{1}{2}\mathbf{s}^T\mathbf{B}\mathbf{s} \quad \text{s.t.} \quad \|\mathbf{s}\| \leq \Delta$$ |
| 291 | + |
| 292 | +QQN can be viewed as solving a related but different problem: |
| 293 | +$$\min_{t \geq 0} f(\mathbf{x} + \mathbf{d}(t))$$ |
| 294 | + |
| 295 | +where $\mathbf{d}(t)$ is the quadratic path. |
| 296 | +**Key differences**: |
| 297 | + |
| 298 | +- Trust region: Solves 2D subproblem, then line search |
| 299 | +- QQN: Direct 1D optimization along quadratic path |
| 300 | +- Trust region: Requires trust region radius management |
| 301 | +- QQN: Parameter-free, automatic adaptation |
| 302 | + |
| 303 | +### B.6.2 Relationship to Line Search Methods |
| 304 | + |
| 305 | +Traditional line search methods optimize: |
| 306 | +$$\min_{\alpha > 0} f(\mathbf{x} + \alpha \mathbf{d})$$ |
| 307 | + |
| 308 | +QQN generalizes this by optimizing along a parametric path: |
| 309 | +$$\min_{t \geq 0} f(\mathbf{x} + \mathbf{d}(t))$$ |
| 310 | + |
| 311 | +The key insight is that the direction itself changes with the parameter, providing additional flexibility. |
| 312 | + |
| 313 | +### B.6.3 Relationship to Hybrid Methods |
| 314 | + |
| 315 | +Previous hybrid approaches typically use discrete switching: |
| 316 | +$$\mathbf{d} = \begin{cases} |
| 317 | +\mathbf{d}_{\text{gradient}} & \text{if condition A} \\ |
| 318 | +\mathbf{d}_{\text{quasi-Newton}} & \text{if condition B} |
| 319 | +\end{cases}$$ |
| 320 | + |
| 321 | +QQN provides continuous interpolation, eliminating discontinuities and the need for switching logic. |
0 commit comments