Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Add sparse non-negative OLS and WLS via QP for MinTraceSparse #319

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

christophertitchen
Copy link
Contributor

@christophertitchen christophertitchen commented Dec 19, 2024

Introduction

This PR improves the current sparse non-negative reconciliation in MinTraceSparse by expanding upon the heuristic from my previous PR, #284, to improve the quality of the non-negative reconciled forecasts for large numbers of time series. It converts the OLS and WLS estimators into quadratic programming (QP) problems with non-negativity constraints, analagous to the approach in MinTrace, but most importantly avoiding the evaluation of the dense $S^{T}W^{-1}S$, thereby maintaining the sparsity of the problem.

The implementation in this PR is robust and performant, allowing users dealing with large numbers of time series—which is extremely common in industrial contexts—to reliably generate high quality non-negative OLS and WLS reconciled forecasts in a fraction of the time.

It adds a new QP solver, Clarabel.rs, as a dependency to the package, which uses an interior point algorithm and is now distributed as the standard solver in CVXPY. The new behaviour is triggered by default upon instantiation of MinTraceSparse when a nonnegative of True is passed, but can be toggled off by setting qp to False, which will revert to the method detailed in #284.

I will add GLS estimators, mint_cov and mint_shrink, in a future PR as we can exploit the symmetry of $W$ by making it an upper-triangular matrix, maintaining a reasonable degree of sparsity and decreasing the time taken to invert it.

Background

$\begin{split} \underset{x \in \mathbb{R}^{n}}{\text{minimise}} \quad & \frac{1}{2} x^{T}W^{-1}x - \left( W^{-1}\hat{y}_{h} \right)^{T}x \end{split}$

$\begin{split} \text{subject to} \quad & \begin{pmatrix} I_{n_{a}} \quad -A \\ 0_{n_{b}, n_{a}} \quad -I_{n_{b}} \end{pmatrix}x + s = 0 \\ \\ & s \in \mathcal{Z_{n_{a}}} \times \mathbb{R}^{n_{b}}_{+} \end{split}$

where $A$ is the "aggregation matrix", which describes the relationship between the bottom level (leaf node) time series and the aggregated (parent node) time series, $W^{-1}$ is the precision matrix for the problem, $\hat{y}_{h}$ are the base forecasts for a particular horizon step, $h$, and $s$ is a composition (product) of two convex cones—$\mathcal{Z}$, a zero cone to model the equality constraints, and $\mathbb{R}_{+}$, a non-negative cone to model the inequality constraints.

This formulation is the most efficient as $\left(I_{n_{a}} \quad -A\right)$ has the set of all reconciled forecasts in its null space, so the graph structure is encoded in just $n_{a}$ equality constraints, with a remaining $n_{b}$ inequality constraints to constrain $x$ to the convex, non-negative feasible region of the solution space.

We try this approach for each $h$, and if the solver solves the problem within the default tolerance of $1\mathrm{e}{-8}$, the primal solutions are post-processed by clipping to $[0, \infty)$ and aggregating with $S$.

An alternative way of framing the problem in the Wickramasuriya, Turlach, and Hyndman paper, that still maintains the sparsity, is shown below.

$\begin{split} \underset{x \in \mathbb{R}^{n_{b}}, \; z \in \mathbb{R}^{n}}{\text{minimise}} \quad & \frac{1}{2} \begin{pmatrix} x \\ z \end{pmatrix}^{T} \begin{pmatrix} 0_{n_{b}, n + n_{b}} \\ 0_{n, n_{b}} \quad W^{-1} \end{pmatrix} \begin{pmatrix} x \\ z \end{pmatrix} \end{split}$

$\begin{split} \text{subject to} \quad & Sx - \hat{y}_{h} = z \\ & \begin{pmatrix} S \quad -I_{n} \\ -I_{n_{b}, n + n_{b}} \end{pmatrix} \begin{pmatrix} x \\ z \end{pmatrix} + s = \begin{pmatrix} \hat{y}_{h} \\ 0_{n_{b}} \end{pmatrix} \\ & s \in \mathcal{Z_{n}} \times \mathbb{R}^{n_{b}}_{+} \end{split}$

where $S$ is the "summing matrix", and $z$ is an auxiliary variable.

This increases the number of decision variables from $n$ to $n$ + $n_{b}$ and the "nnz" of the linear constraints matrix by $2n_{b}$ relative to the first approach, so it is a tad slower in most cases, at least those with a diagonal $W$.

Benchmarks

As expected, this implementation provides the same MAE and RMSE as MinTrace, but is significantly faster. The performance of the heuristic solution in #284 is surprisingly good for OLS and WLS with structural scaling, with MAE extremely close to the QP approaches in a fraction of the time, but with a noticeably worse RMSE at the larger scale of M5 $\left( 30\, 490 \times 42\, 840 \right)$. However, the performance of the heuristic using WLS with variance scaling is bad in these benchmarks, even for Tourism $\left( 304 \times 555 \right)$, so should be investigated further. If this is also the case when nonnegative is False, an idea would be to change this going forward so wls_var uses the QP approach without the non-negative constraints, so just $\left(I_{n_{a}} \quad -A\right)x + \mathcal{Z_{n_{a}}} = 0$ to ensure coherence.

The benchmarks show that the default should be a qp of True to provide the highest quality solutions.

Wall Time

Dataset Method MinTrace MinTraceSparse
#284
MinTraceSparse
#314
M5 ols $\textcolor{red}{\gt 2\, hr}$ $857\, ms$ $19\, s$
M5 wls_struct $\textcolor{red}{\gt 2\, hr}$ $175\, ms$ $17.9\, s$
M5 wls_var $\textcolor{red}{\gt 2\, hr}$ $\textcolor{red}{105\, s}$ $29.8\, s$
Tourism ols $147\, ms$ $21.2\, ms$ $104\, ms$
Tourism wls_struct $112\, ms$ $7.81\, ms$ $90\, ms$
Tourism wls_var $109\, ms$ $\textcolor{red}{258\, ms}$ $103\, ms$

MASE

Dataset Method MinTrace MinTraceSparse
#284
MinTraceSparse
#314
M5 ols ? $1.00$2 $0.99$2
M5 wls_struct ? $1.00$2 $0.98$2
M5 wls_var ? $1.00$2 $\textcolor{green}{0.12}$2
Tourism ols $1.00$1 $1.01$1 $1.00$1
Tourism wls_struct $1.00$1 $1.01$1 $1.00$1
Tourism wls_var $1.00$1 $1.00$1 $0.99$1

1 scaled by the MAE of the out-of-sample MinTrace solution
2 scaled by the MAE of the out-of-sample heuristic solution as MinTrace did not finish

RMSSE

Dataset Method MinTrace MinTraceSparse
#284
MinTraceSparse
#314
M5 ols ? $1.00$2 $\textcolor{green}{0.89}$2
M5 wls_struct ? $1.00$2 $0.98$2
M5 wls_var ? $1.00$2 $\textcolor{green}{0.13}$2
Tourism ols $1.00$1 $\textcolor{red}{1.08}$1 $1.00$1
Tourism wls_struct $1.00$1 $1.00$1 $1.00$1
Tourism wls_var $1.00$1 $1.00$1 $0.99$1

1 scaled by the RMSE of the out-of-sample MinTrace solution
2 scaled by the RMSE of the out-of-sample heuristic solution as MinTrace did not finish

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@elephaint
Copy link
Contributor

elephaint commented Dec 19, 2024

Thanks this is great work!

Some minor comments so far:

  • You can fix the linting failing by clearing the notebook and making sure the notebooks are cleaned too. See contributing.md in the main repo folder.
  • I'm not too happy about adding two more solvers as dependencies, but if it makes sense performance-wise we could go for it. I do think I'd rather keep one qprog solvers instead of two, if possible. So we may want to get rid of quadprog then.
  • I think it's important to have a test showing the behavior of the old method compared to the two new methods, both in terms of accuracy and speed. For example, a test that reproduces the table you show above for M5 and tourism. This is because if we add a lot of additional complexity + dependencies, just want to make sure we have validated the benefits in accuracy and efficiency.
  • Maybe we should opt for choosing the method of preference over doing everything. I.e. choose 'heuristic', 'qp_osqp', 'qp_clarabel' or 'combined' as a string choice in the instantiation of MinTraceSparse. The default is now 'combined' , i.e. it performs a lot of work by doing everything, right? And I think we should have the default depend on how clear the benefit is based on the tests. E.g. if very minor benefit in speed & accuracy -> keep 'heuristic' as default, and the new dependencies are not added as a dependency but something users should install themselves when calling these options.
  • How do we know x_0 is a good warm start? Do we arrive at the same solution albeit with more iterations if we start from a different (random) initialization?

@christophertitchen
Copy link
Contributor Author

Thanks this is great work!

Some minor comments so far:

  • You can fix the linting failing by clearing the notebook and making sure the notebooks are cleaned too. See contributing.md in the main repo folder.
  • I'm not too happy about adding two more solvers as dependencies, but if it makes sense performance-wise we could go for it. I do think I'd rather keep one qprog solvers instead of two, if possible. So we may want to get rid of quadprog then.
  • I think it's important to have a test showing the behavior of the old method compared to the two new methods, both in terms of accuracy and speed. For example, a test that reproduces the table you show above for M5 and tourism. This is because if we add a lot of additional complexity + dependencies, just want to make sure we have validated the benefits in accuracy and efficiency.
  • Maybe we should opt for choosing the method of preference over doing everything. I.e. choose 'heuristic', 'qp_osqp', 'qp_clarabel' or 'combined' as a string choice in the instantiation of MinTraceSparse. The default is now 'combined' , i.e. it performs a lot of work by doing everything, right? And I think we should have the default depend on how clear the benefit is based on the tests. E.g. if very minor benefit in speed & accuracy -> keep 'heuristic' as default, and the new dependencies are not added as a dependency but something users should install themselves when calling these options.
  • How do we know x_0 is a good warm start? Do we arrive at the same solution albeit with more iterations if we start from a different (random) initialization?

Hi, Olivier! I hope you're well.

The linting issue is strange, thanks for pointing that out! I thought I ran nbdev_clean before staging, but I'll double check my environment before my next commit.

Yup, I also thought that three is close to excessive. The good news is that they're both light on dependencies as both have numpy and scipy dependencies which don't conflict with ours. OSQP has a third dependency called qdldl.

Would you rather scale it down to one sparse QP solver? It's straightforward to express the secondary approach in a problem format that OSQP will accept as the only change is reformulating the linear constraints, or vice versa for Clarabel. In my experience, the trade-off between the two in this context is that OSQP is slightly faster than Clarabel but slightly less reliable, particularly for the second approach unless you set higher accuracy, hence the incorporation of the two solvers.

Unfortunately, we have to keep quadprog for now until MinTraceSparse has all of the functionality of MinTrace, as neither OSQP nor Clarabel support dense matrices.

My design decision for the PR just comes down to my workflow in my job and what I think is most valuable in a large business context. When evaluating tens or hundreds of millions of forecasts, it's important to have a process that's robust and performant. At that scale, in this case it's inconsequential whether for example, the QP solvers struggled with a particular reconciliation problem, or which if any of the two approaches were successful, as long as the best estimate coherent non-negative forecasts are returned for a given $\hat{y}_{h}$ and method, which is ultimately the output that's useful in a business context. Hence, the fallback to the secondary approach with Clarabel to try to obtain a high quality solution if the primary approach with OSQP is unsuccessful, along with the important use of the heuristic solution as a baseline and a safeguard, as both are fast to compute.

I understand if you think this methodology isn't suitable for most users of the package. If you'd like, I can easily simplify it and scale it way back to something like MinTrace, where it attempts to solve the problem (we can just use the primary approach) and raises an exception if it's unsuccessful, rather than warning the user and returning the best estimate, like the heuristic solution, for them to use and evaluate regardless.

I find the QP solutions almost always improve upon the heuristic solution if successful, which you can measure by using the heuristic solution as a baseline and calculating metrics like the MSSE or RMSSE. I can put something together that quantitatively shows the differences between the QP approaches and the heuristic approach and measures the alignment of the two approaches with the dense approach in MinTrace at different tolerances.

This is purely anecdotal, but using $x_{0}$ as a warm start helps to either decrease the number of iterations to arrive at the same solution, or make a problem that is supposedly dual infeasible be solved, at least with reduced accuracy, within the maximum number of iterations. In the first case, the overhead to generate $x_{0}$ isn't usually worth the reduced number of iterations in my opinion, but given that the latter case can also occur, along with the use of $x_{0}$ as a baseline and a safeguard, it's something that I find useful.

Let me know what you think and I can do the tests and make the changes. 😄

@elephaint
Copy link
Contributor

Hi, Olivier! I hope you're well.

Thanks for the elaborate reply, hope you're well too. Bit of holidays, so late reply from my side.

The linting issue is strange, thanks for pointing that out! I thought I ran nbdev_clean before staging, but I'll double check my environment before my next commit.

In my IDE (VSCode) I need to make sure the notebooks are cleared, then saved, and then cleaned. That usually fixes it.

Yup, I also thought that three is close to excessive. The good news is that they're both light on dependencies as both have numpy and scipy dependencies which don't conflict with ours. OSQP has a third dependency called qdldl.

Would you rather scale it down to one sparse QP solver? It's straightforward to express the secondary approach in a problem format that OSQP will accept as the only change is reformulating the linear constraints, or vice versa for Clarabel. In my experience, the trade-off between the two in this context is that OSQP is slightly faster than Clarabel but slightly less reliable, particularly for the second approach unless you set higher accuracy, hence the incorporation of the two solvers.

Unfortunately, we have to keep quadprog for now until MinTraceSparse has all of the functionality of MinTrace, as neither OSQP nor Clarabel support dense matrices.

My design decision for the PR just comes down to my workflow in my job and what I think is most valuable in a large business context. When evaluating tens or hundreds of millions of forecasts, it's important to have a process that's robust and performant. At that scale, in this case it's inconsequential whether for example, the QP solvers struggled with a particular reconciliation problem, or which if any of the two approaches were successful, as long as the best estimate coherent non-negative forecasts are returned for a given y ^ h and method, which is ultimately the output that's useful in a business context. Hence, the fallback to the secondary approach with Clarabel to try to obtain a high quality solution if the primary approach with OSQP is unsuccessful, along with the important use of the heuristic solution as a baseline and a safeguard, as both are fast to compute.

I understand if you think this methodology isn't suitable for most users of the package. If you'd like, I can easily simplify it and scale it way back to something like MinTrace, where it attempts to solve the problem (we can just use the primary approach) and raises an exception if it's unsuccessful, rather than warning the user and returning the best estimate, like the heuristic solution, for them to use and evaluate regardless.

I find the QP solutions almost always improve upon the heuristic solution if successful, which you can measure by using the heuristic solution as a baseline and calculating metrics like the MSSE or RMSSE. I can put something together that quantitatively shows the differences between the QP approaches and the heuristic approach and measures the alignment of the two approaches with the dense approach in MinTrace at different tolerances.

This is purely anecdotal, but using x 0 as a warm start helps to either decrease the number of iterations to arrive at the same solution, or make a problem that is supposedly dual infeasible be solved, at least with reduced accuracy, within the maximum number of iterations. In the first case, the overhead to generate x 0 isn't usually worth the reduced number of iterations in my opinion, but given that the latter case can also occur, along with the use of x 0 as a baseline and a safeguard, it's something that I find useful.

Let me know what you think and I can do the tests and make the changes. 😄

I'm fine with adding dependencies, and the proposed workflow, if we have a test demonstrating the clear benefit over the scaled down / simple solution. In my opinion, the result of that test/comparison decides what the default should be. Little benefit? Keep default at 'heuristic' or 'osqp', or whatever we think is the best default. Large benefit? Keep default at 'combined', which is the workflow you've suggested at this moment. The benefit should be well balanced between additional computational cost vs accuracy. That's why the test for multiple datasets, including runtime per solution and accuracy per solution is necessary. Then we can make that decision. Wdyt?

@christophertitchen
Copy link
Contributor Author

christophertitchen commented Jan 3, 2025

I'm fine with adding dependencies, and the proposed workflow, if we have a test demonstrating the clear benefit over the scaled down / simple solution. In my opinion, the result of that test/comparison decides what the default should be. Little benefit? Keep default at 'heuristic' or 'osqp', or whatever we think is the best default. Large benefit? Keep default at 'combined', which is the workflow you've suggested at this moment. The benefit should be well balanced between additional computational cost vs accuracy. That's why the test for multiple datasets, including runtime per solution and accuracy per solution is necessary. Then we can make that decision. Wdyt?

I definitely agree on the testing approach, but honestly that is a hell of a lot of testing for this feature. I should've just kept it simple! 😅

I'm going to remove all of my code for the warm start, "secondary approach", and solution evaluation, and just solve the "primary approach" using Clarabel, triggered when qp is True. It can be aligned with the analogous functionality in MinTrace by raising an exception if the solver fails to find a primal solution within the prescribed tolerance or if it signals primal or dual infeasibility.

It should then be straightforward to add GLS at a later date, as it's going to be impractical to warm start this way when we use the covariance matrix or Schäfer-Strimmer shrinkage estimator as $W$. Plus, I have a hunch that the change in sparsity in this case may actually make the "secondary approach" relatively more performant, in which case I can simply swap it out.

@elephaint
Copy link
Contributor

I'm fine with adding dependencies, and the proposed workflow, if we have a test demonstrating the clear benefit over the scaled down / simple solution. In my opinion, the result of that test/comparison decides what the default should be. Little benefit? Keep default at 'heuristic' or 'osqp', or whatever we think is the best default. Large benefit? Keep default at 'combined', which is the workflow you've suggested at this moment. The benefit should be well balanced between additional computational cost vs accuracy. That's why the test for multiple datasets, including runtime per solution and accuracy per solution is necessary. Then we can make that decision. Wdyt?

I definitely agree on the testing approach, but honestly that is a hell of a lot of testing for this feature. I should've just kept it simple! 😅

I'm going to remove all of my code for the warm start, "secondary approach", and solution evaluation, and just solve the "primary approach" using Clarabel, triggered when qp is True. It can be aligned with the analogous functionality in MinTrace by raising an exception if the solver fails to find a primal solution within the prescribed tolerance or if it signals primal or dual infeasibility.

It should then be straightforward to add GLS at a later date, as it's going to be impractical to warm start this way when we use the covariance matrix or Schäfer-Strimmer shrinkage estimator as W . Plus, I have a hunch that the change in sparsity in this case may actually make the "secondary approach" relatively more performant, in which case I can simply swap it out.

Ok - sure it's that much? I was thinking of something like the below, and just evaluate everything at eps of 1e-5 or 1e-6. That's 24 experiments, for which I'd assume runtime of the Sparse variants is super fast, so in terms of time only the dense variant is time-consuming, no? I think the work you did is great and I don't want you to have to spend hours testing.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants