Skip to content

Commit 4dd2c05

Browse files
committed
wip
1 parent b422925 commit 4dd2c05

File tree

7 files changed

+1306
-283
lines changed

7 files changed

+1306
-283
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,4 @@ data/
99
*.aux
1010
*.out
1111
*.blg
12+
endorsement.md

papers/intro/appendix.md

Lines changed: 317 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,320 @@
22

33
```{=latex}
44
\input{results/latex/family_vs_family_matrix.tex}
5-
```
5+
```
6+
7+
# Appendix B: Theoretical Foundations and Proofs
8+
9+
## B.1 Algorithm Derivation
10+
11+
### B.1.1 The Direction Combination Problem
12+
13+
Consider the fundamental problem of combining multiple optimization directions. Given:
14+
- Gradient direction: $-\nabla f(\mathbf{x})$ providing guaranteed descent
15+
- Quasi-Newton direction: $\mathbf{d}_{\text{QN}}$ offering potential superlinear convergence
16+
17+
We seek a principled method to combine these directions that:
18+
19+
1. Guarantees descent from any starting point
20+
2. Smoothly interpolates between the directions
21+
3. Requires no additional hyperparameters
22+
4. Maintains computational efficiency
23+
24+
### B.1.2 Geometric Formulation
25+
26+
We formulate direction combination as a boundary value problem in parametric space. Consider a parametric curve $\mathbf{d}: [0,1] \rightarrow \mathbb{R}^n$ satisfying:
27+
28+
1. **Initial position**: $\mathbf{d}(0) = \mathbf{0}$
29+
2. **Initial tangent**: $\mathbf{d}'(0) = -\nabla f(\mathbf{x})$ (ensures descent)
30+
3. **Terminal position**: $\mathbf{d}(1) = \mathbf{d}_{\text{L-BFGS}}$
31+
32+
The minimal polynomial satisfying these constraints is quadratic:
33+
$$\mathbf{d}(t) = \mathbf{a}t^2 + \mathbf{b}t + \mathbf{c}$$
34+
35+
Applying boundary conditions:
36+
37+
- From condition 1: $\mathbf{c} = \mathbf{0}$
38+
- From condition 2: $\mathbf{b} = -\nabla f(\mathbf{x})$
39+
- From condition 3: $\mathbf{a} + \mathbf{b} = \mathbf{d}_{\text{L-BFGS}}$
40+
41+
Therefore: $\mathbf{a} = \mathbf{d}_{\text{L-BFGS}} + \nabla f(\mathbf{x})$
42+
43+
This yields the canonical QQN path:
44+
$$\mathbf{d}(t) = t(1-t)(-\nabla f) + t^2 \mathbf{d}_{\text{L-BFGS}}$$
45+
46+
## B.2 Convergence Analysis
47+
48+
### B.2.1 Universal Descent Property
49+
50+
**Lemma B.1** (Universal Descent): For any direction $\mathbf{d}_{\text{L-BFGS}} \in \mathbb{R}^n$, the QQN path satisfies:
51+
$$\mathbf{d}'(0) = -\nabla f(\mathbf{x})$$
52+
53+
*Proof*: Direct differentiation of $\mathbf{d}(t) = t(1-t)(-\nabla f) + t^2 \mathbf{d}_{\text{L-BFGS}}$ gives:
54+
$$\mathbf{d}'(t) = (1-2t)(-\nabla f) + 2t\mathbf{d}_{\text{L-BFGS}}$$
55+
56+
Evaluating at $t=0$: $\mathbf{d}'(0) = -\nabla f(\mathbf{x})$. $\square$
57+
**Theorem B.1** (Descent Property): For any $\mathbf{d}_{\text{L-BFGS}}$, there exists $\bar{t} > 0$ such that $\phi(t) = f(\mathbf{x} + \mathbf{d}(t))$ satisfies $\phi(t) < \phi(0)$ for all $t \in (0, \bar{t}]$.
58+
59+
*Proof*: Since $\mathbf{d}'(0) = -\nabla f(\mathbf{x})$:
60+
$$\phi'(0) = \nabla f(\mathbf{x})^T(-\nabla f(\mathbf{x})) = -\|\nabla f(\mathbf{x})\|^2 < 0$$
61+
62+
By continuity of $\phi'$ (assuming $f$ is continuously differentiable), there exists $\bar{t} > 0$ such that $\phi'(t) < 0$ for all $t \in (0, \bar{t}]$. By the fundamental theorem of calculus:
63+
$$\phi(t) - \phi(0) = \int_0^t \phi'(s) ds < 0$$
64+
65+
for all $t \in (0, \bar{t}]$. $\square$
66+
67+
### B.2.2 Global Convergence Analysis
68+
69+
**Theorem B.2** (Global Convergence): Under standard assumptions:
70+
71+
1. $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is continuously differentiable
72+
2. $f$ is bounded below: $f(\mathbf{x}) \geq f_{\text{inf}} > -\infty$
73+
3. $\nabla f$ is Lipschitz continuous with constant $L > 0$
74+
4. The univariate optimization finds a point satisfying the Armijo condition
75+
76+
QQN generates iterates satisfying:
77+
$$\liminf_{k \to \infty} \|\nabla f(\mathbf{x}_k)\| = 0$$
78+
79+
*Proof*: We establish convergence through a descent lemma approach.
80+
81+
**Step 1: Monotonic Decrease**
82+
83+
By Theorem B.1, each iteration produces $f(\mathbf{x}_{k+1}) < f(\mathbf{x}_k)$ whenever $\nabla f(\mathbf{x}_k) \neq \mathbf{0}$.
84+
85+
**Step 2: Sufficient Decrease**
86+
87+
Define $\phi_k(t) = f(\mathbf{x}_k + \mathbf{d}_k(t))$. Since $\phi_k'(0) = -\|\nabla f(\mathbf{x}_k)\|^2 < 0$, by the Armijo condition, there exists $c_1 \in (0, 1)$ and $\bar{t} > 0$ such that:
88+
$$\phi_k(t) \leq \phi_k(0) + c_1 t \phi_k'(0) = f(\mathbf{x}_k) - c_1 t \|\nabla f(\mathbf{x}_k)\|^2$$
89+
90+
for all $t \in (0, \bar{t}]$.
91+
92+
**Step 3: Quantifying Decrease**
93+
94+
Using the descent lemma with Lipschitz constant $L$:
95+
$$f(\mathbf{x}_{k+1}) \leq f(\mathbf{x}_k) + \nabla f(\mathbf{x}_k)^T \mathbf{d}_k(t_k^*) + \frac{L}{2}\|\mathbf{d}_k(t_k^*)\|^2$$
96+
97+
For the quadratic path with $t_k^* \in (0, \bar{t}]$:
98+
$$\|\mathbf{d}_k(t)\|^2 = \|t(1-t)(-\nabla f(\mathbf{x}_k)) + t^2\mathbf{d}_{\text{L-BFGS}}\|^2$$
99+
100+
$$\leq 2t^2(1-t)^2\|\nabla f(\mathbf{x}_k)\|^2 + 2t^4\|\mathbf{d}_{\text{L-BFGS}}\|^2$$
101+
102+
For small $t$, the gradient term dominates, giving:
103+
$$f(\mathbf{x}_k) - f(\mathbf{x}_{k+1}) \geq c\|\nabla f(\mathbf{x}_k)\|^2$$
104+
105+
for some $c > 0$ independent of $k$.
106+
107+
**Step 4: Summability**
108+
109+
Since $f$ is bounded below and decreases monotonically:
110+
$$\sum_{k=0}^{\infty} [f(\mathbf{x}_k) - f(\mathbf{x}_{k+1})] = f(\mathbf{x}_0) - \lim_{k \to \infty} f(\mathbf{x}_k) < \infty$$
111+
112+
Combined with Step 3:
113+
$$\sum_{k=0}^{\infty} \|\nabla f(\mathbf{x}_k)\|^2 < \infty$$
114+
115+
**Step 5: Conclusion**
116+
The summability of $\|\nabla f(\mathbf{x}_k)\|^2$ implies $\liminf_{k \to \infty} \|\nabla f(\mathbf{x}_k)\| = 0$. $\square$
117+
118+
### B.2.3 Local Superlinear Convergence
119+
120+
**Theorem B.3** (Local Superlinear Convergence): Let $\mathbf{x}^*$ be a local minimum with $\nabla f(\mathbf{x}^*) = \mathbf{0}$ and $\nabla^2 f(\mathbf{x}^*) = H^* \succ 0$. Assume:
121+
122+
1. $\nabla^2 f$ is Lipschitz continuous in a neighborhood of $\mathbf{x}^*$
123+
2. The L-BFGS approximation satisfies the Dennis-Moré condition:
124+
125+
$$\lim_{k \to \infty} \frac{\|(\mathbf{H}_k - (H^*)^{-1})(\mathbf{x}_{k+1} - \mathbf{x}_k)\|}{\|\mathbf{x}_{k+1} - \mathbf{x}_k\|} = 0$$
126+
127+
Then QQN converges superlinearly: $\|\mathbf{x}_{k+1} - \mathbf{x}^*\| = o(\|\mathbf{x}_k - \mathbf{x}^*\|)$.
128+
129+
*Proof*: We analyze the behavior near the optimum.
130+
131+
**Step 1: Neighborhood Properties**
132+
133+
By continuity of $\nabla^2 f$, there exists a neighborhood $\mathcal{N}$ of $\mathbf{x}^*$ and constants $0 < \mu \leq L$ such that:
134+
$$\mu \mathbf{I} \preceq \nabla^2 f(\mathbf{x}) \preceq L \mathbf{I}, \quad \forall \mathbf{x} \in \mathcal{N}$$
135+
136+
**Step 2: Optimal Parameter Analysis**
137+
138+
Define $\phi(t) = f(\mathbf{x}_k + \mathbf{d}(t))$ where $\mathbf{d}(t) = t(1-t)(-\nabla f(\mathbf{x}_k)) + t^2\mathbf{d}_{\text{L-BFGS}}$.
139+
The first derivative is:
140+
$$\phi'(t) = \nabla f(\mathbf{x}_k + \mathbf{d}(t))^T[(1-2t)(-\nabla f(\mathbf{x}_k)) + 2t\mathbf{d}_{\text{L-BFGS}}]$$
141+
142+
The second derivative is:
143+
$$\phi''(t) = [(1-2t)(-\nabla f(\mathbf{x}_k)) + 2t\mathbf{d}_{\text{L-BFGS}}]^T \nabla^2 f(\mathbf{x}_k + \mathbf{d}(t))[(1-2t)(-\nabla f(\mathbf{x}_k)) + 2t\mathbf{d}_{\text{L-BFGS}}]$$
144+
145+
$$+ \nabla f(\mathbf{x}_k + \mathbf{d}(t))^T[-2(-\nabla f(\mathbf{x}_k)) + 2\mathbf{d}_{\text{L-BFGS}}]$$
146+
147+
At $t = 1$:
148+
$$\phi'(1) = \nabla f(\mathbf{x}_k + \mathbf{d}_{\text{L-BFGS}})^T \mathbf{d}_{\text{L-BFGS}}$$
149+
150+
Using Taylor expansion:
151+
$$\nabla f(\mathbf{x}_k + \mathbf{d}_{\text{L-BFGS}}) = \nabla f(\mathbf{x}_k) + \nabla^2 f(\mathbf{x}_k)\mathbf{d}_{\text{L-BFGS}} + O(\|\mathbf{d}_{\text{L-BFGS}}\|^2)$$
152+
153+
Since $\mathbf{d}_{\text{L-BFGS}} = -\mathbf{H}_k\nabla f(\mathbf{x}_k)$:
154+
$$\nabla f(\mathbf{x}_k + \mathbf{d}_{\text{L-BFGS}}) = [\mathbf{I} - \nabla^2 f(\mathbf{x}_k)\mathbf{H}_k]\nabla f(\mathbf{x}_k) + O(\|\nabla f(\mathbf{x}_k)\|^2)$$
155+
156+
By the Dennis-Moré condition, as $k \to \infty$:
157+
$$\|\mathbf{I} - \nabla^2 f(\mathbf{x}_k)\mathbf{H}_k\| \to 0$$
158+
159+
Therefore:
160+
$$\phi'(1) = o(\|\nabla f(\mathbf{x}_k)\|^2)$$
161+
162+
**Step 3: Optimal Parameter Convergence**
163+
164+
Since $\phi'(0) = -\|\nabla f(\mathbf{x}_k)\|^2 < 0$ and $\phi'(1) = o(\|\nabla f(\mathbf{x}_k)\|^2)$, by the intermediate value theorem and the fact that $\phi$ is strongly convex near $t = 1$ (due to positive definite Hessian), the minimizer satisfies:
165+
$$t_k^* = 1 + o(1)$$
166+
167+
**Step 4: Convergence Rate**
168+
169+
With $t_k^* = 1 + o(1)$:
170+
$$\mathbf{x}_{k+1} = \mathbf{x}_k + \mathbf{d}(t_k^*) = \mathbf{x}_k + (1 + o(1))\mathbf{d}_{\text{L-BFGS}} + o(\|\mathbf{d}_{\text{L-BFGS}}\|)$$
171+
172+
$$= \mathbf{x}_k - \mathbf{H}_k\nabla f(\mathbf{x}_k) + o(\|\nabla f(\mathbf{x}_k)\|)$$
173+
174+
By standard quasi-Newton theory with the Dennis-Moré condition:
175+
176+
$$\|\mathbf{x}_{k+1} - \mathbf{x}^*\| = o(\|\mathbf{x}_k - \mathbf{x}^*\|)$$
177+
178+
establishing superlinear convergence. $\square$
179+
180+
## B.3 Robustness Analysis
181+
182+
### B.3.1 Graceful Degradation
183+
184+
**Theorem B.4** (Graceful Degradation): Let $\theta_k$ be the angle between $-\nabla f(\mathbf{x}_k)$ and $\mathbf{d}_{\text{L-BFGS}}$. If $\theta_k > \pi/2$ (obtuse angle), then the optimal parameter satisfies $t^* \in [0, 1/2]$, ensuring gradient-dominated steps.
185+
186+
*Proof*: When $\theta_k > \pi/2$, we have $\nabla f(\mathbf{x}_k)^T \mathbf{d}_{\text{L-BFGS}} > 0$.
187+
188+
The derivative of our objective along the path is:
189+
$$\frac{d}{dt}f(\mathbf{x}_k + \mathbf{d}(t)) = \nabla f(\mathbf{x}_k + \mathbf{d}(t))^T \mathbf{d}'(t)$$
190+
191+
At $t = 1/2$:
192+
$$\mathbf{d}'(1/2) = -\frac{1}{2}\nabla f(\mathbf{x}_k) + \mathbf{d}_{\text{L-BFGS}}$$
193+
194+
For small steps from $\mathbf{x}_k$:
195+
$$\nabla f(\mathbf{x}_k + \mathbf{d}(1/2)) \approx \nabla f(\mathbf{x}_k)$$
196+
197+
Therefore:
198+
$$\left.\frac{d}{dt}f(\mathbf{x}_k + \mathbf{d}(t))\right|_{t=1/2} \approx \nabla f(\mathbf{x}_k)^T[-\frac{1}{2}\nabla f(\mathbf{x}_k) + \mathbf{d}_{\text{L-BFGS}}]$$
199+
200+
$$= -\frac{1}{2}\|\nabla f(\mathbf{x}_k)\|^2 + \nabla f(\mathbf{x}_k)^T\mathbf{d}_{\text{L-BFGS}} > 0$$
201+
202+
when $\nabla f(\mathbf{x}_k)^T\mathbf{d}_{\text{L-BFGS}} > \frac{1}{2}\|\nabla f(\mathbf{x}_k)\|^2$.
203+
204+
This implies the function increases beyond $t = 1/2$, so the univariate optimization will find $t^* \leq 1/2$, giving:
205+
206+
$$\mathbf{x}_{k+1} \approx \mathbf{x}_k + t^*(1-t^*)(-\nabla f(\mathbf{x}_k))$$
207+
208+
Since $t^* \leq 1/2$, we have $t^*(1-t^*) \geq t^*(1/2)$, ensuring a gradient-dominated step. $\square$
209+
210+
### B.3.2 Stability Under Numerical Errors
211+
212+
**Theorem B.5** (Numerical Stability): Let $\tilde{\mathbf{d}}_{\text{L-BFGS}} = \mathbf{d}_{\text{L-BFGS}} + \boldsymbol{\epsilon}$ where $\boldsymbol{\epsilon}$ represents numerical errors with $\|\boldsymbol{\epsilon}\| \leq \delta$. The perturbed QQN path:
213+
$$\tilde{\mathbf{d}}(t) = t(1-t)(-\nabla f) + t^2 \tilde{\mathbf{d}}_{\text{L-BFGS}}$$
214+
215+
satisfies:
216+
$$\|\tilde{\mathbf{d}}(t) - \mathbf{d}(t)\| \leq t^2\delta$$
217+
218+
*Proof*: Direct computation:
219+
220+
$$\|\tilde{\mathbf{d}}(t) - \mathbf{d}(t)\| = \|t^2(\tilde{\mathbf{d}}_{\text{L-BFGS}} - \mathbf{d}_{\text{L-BFGS}})\| = t^2\|\boldsymbol{\epsilon}\| \leq t^2\delta$$
221+
222+
For small $t$ (near the initial descent phase), the error is $O(t^2\delta)$, providing quadratic error suppression. $\square$
223+
224+
## B.4 Computational Complexity
225+
226+
**Theorem B.6** (Computational Complexity): Each QQN iteration requires:
227+
228+
- $O(n)$ operations for path construction
229+
- $O(mn)$ operations for L-BFGS direction computation
230+
- $O(k)$ function evaluations for univariate optimization
231+
232+
where $n$ is the dimension, $m$ is the L-BFGS memory size, and $k$ is typically small (3-10).
233+
234+
*Proof*:
235+
236+
1. **Path construction**: Computing $\mathbf{d}(t) = t(1-t)(-\nabla f) + t^2 \mathbf{d}_{\text{L-BFGS}}$ requires $O(n)$ operations for vector arithmetic.
237+
2. **L-BFGS direction**: The two-loop recursion requires $O(mn)$ operations to compute $\mathbf{H}_k\nabla f(\mathbf{x}_k)$.
238+
3. **Line search**: Each function evaluation along the path requires $O(n)$ operations to compute $\mathbf{x}_k + \mathbf{d}(t)$, plus the cost of evaluating $f$.
239+
240+
Total complexity per iteration: $O(mn + kn) + k \cdot \text{cost}(f)$. $\square$
241+
242+
## B.5 Extensions and Variants
243+
244+
### B.5.1 Gradient Scaling
245+
246+
The basic QQN formulation can be enhanced with gradient scaling:
247+
$$\mathbf{d}(t) = t(1-t)\alpha(-\nabla f) + t^2 \mathbf{d}_{\text{L-BFGS}}$$
248+
249+
where $\alpha > 0$ is a scaling factor.
250+
251+
**Proposition B.1** (Scaling Invariance): The set of points reachable by the QQN path is invariant to the choice of $\alpha$. Only the parametrization changes.
252+
253+
*Proof*: Consider the mapping $s = \beta(t)$ where $\beta$ is chosen such that:
254+
$$t(1-t)\alpha(-\nabla f) + t^2 \mathbf{d}_{\text{L-BFGS}} = s(1-s)(-\nabla f) + s^2 \mathbf{d}_{\text{L-BFGS}}$$
255+
256+
This gives a bijection between parametrizations, showing that any point reachable with one $\alpha$ is reachable with another. $\square$
257+
258+
### B.5.2 Cubic Extension with Momentum
259+
260+
Incorporating momentum leads to cubic interpolation:
261+
$$\mathbf{d}(t) = t(1-t)(1-2t)\mathbf{m} + t(1-t)\alpha(-\nabla f) + t^2 \mathbf{d}_{\text{L-BFGS}}$$
262+
263+
where $\mathbf{m}$ is the momentum vector.
264+
265+
This satisfies:
266+
267+
- $\mathbf{d}(0) = \mathbf{0}$
268+
- $\mathbf{d}'(0) = \alpha(-\nabla f) + \mathbf{m}$
269+
- $\mathbf{d}(1) = \mathbf{d}_{\text{L-BFGS}}$
270+
- $\mathbf{d}''(0) = -6\mathbf{m} + 2\alpha(-\nabla f) + 2\mathbf{d}_{\text{L-BFGS}}$
271+
272+
**Theorem B.7** (Cubic Convergence Properties): The cubic variant maintains all convergence guarantees of the quadratic version while potentially improving the convergence constant through momentum acceleration.
273+
274+
### B.5.3 Trust Region Integration
275+
276+
QQN naturally extends to trust regions by constraining the univariate search:
277+
$$t^* = \arg\min_{t: \|\mathbf{d}(t)\| \leq \Delta} f(\mathbf{x} + \mathbf{d}(t))$$
278+
279+
where $\Delta$ is the trust region radius.
280+
281+
**Proposition B.2** (Trust Region Feasibility): For any $\Delta > 0$, there exists $t_{\max} > 0$ such that $\|\mathbf{d}(t)\| \leq \Delta$ for all $t \in [0, t_{\max}]$.
282+
283+
*Proof*: Since $\mathbf{d}(0) = \mathbf{0}$ and $\mathbf{d}$ is continuous, by the intermediate value theorem, the set $\{t : \|\mathbf{d}(t)\| \leq \Delta\}$ contains an interval $[0, t_{\max}]$ for some $t_{\max} > 0$. $\square$
284+
285+
## B.6 Comparison with Related Methods
286+
287+
### B.6.1 Relationship to Trust Region Methods
288+
289+
Trust region methods solve:
290+
$$\min_{\mathbf{s}} \mathbf{g}^T\mathbf{s} + \frac{1}{2}\mathbf{s}^T\mathbf{B}\mathbf{s} \quad \text{s.t.} \quad \|\mathbf{s}\| \leq \Delta$$
291+
292+
QQN can be viewed as solving a related but different problem:
293+
$$\min_{t \geq 0} f(\mathbf{x} + \mathbf{d}(t))$$
294+
295+
where $\mathbf{d}(t)$ is the quadratic path.
296+
**Key differences**:
297+
298+
- Trust region: Solves 2D subproblem, then line search
299+
- QQN: Direct 1D optimization along quadratic path
300+
- Trust region: Requires trust region radius management
301+
- QQN: Parameter-free, automatic adaptation
302+
303+
### B.6.2 Relationship to Line Search Methods
304+
305+
Traditional line search methods optimize:
306+
$$\min_{\alpha > 0} f(\mathbf{x} + \alpha \mathbf{d})$$
307+
308+
QQN generalizes this by optimizing along a parametric path:
309+
$$\min_{t \geq 0} f(\mathbf{x} + \mathbf{d}(t))$$
310+
311+
The key insight is that the direction itself changes with the parameter, providing additional flexibility.
312+
313+
### B.6.3 Relationship to Hybrid Methods
314+
315+
Previous hybrid approaches typically use discrete switching:
316+
$$\mathbf{d} = \begin{cases}
317+
\mathbf{d}_{\text{gradient}} & \text{if condition A} \\
318+
\mathbf{d}_{\text{quasi-Newton}} & \text{if condition B}
319+
\end{cases}$$
320+
321+
QQN provides continuous interpolation, eliminating discontinuities and the need for switching logic.

0 commit comments

Comments
 (0)