`GeneralizedLinearRegressor(scale_predictors=True)` raises for highly variable column variances #872

mlondschien · 2024-10-30T14:40:18Z

import numpy as np
import tabmat
from glum import GeneralizedLinearRegressor

n = 5_000
p = 1_000

rng = np.random.default_rng(0)
means = rng.exponential(10, p) ** 2
stds = rng.exponential(10, p) ** 2

X = rng.uniform(size=(n, p)) * stds + means
means = X.mean(axis=0)

beta = rng.standard_normal(p)
y = ((X - means) @ beta + rng.standard_normal(n)) > 0

matrix = tabmat.DenseMatrix(X)
standardized_matrix, _, _ = matrix.standardize(np.ones(n), True, True)
sandwich = standardized_matrix.sandwich(np.ones(n))
print(np.abs(sandwich - sandwich.T).max())

GeneralizedLinearRegressor(
    alpha=0, l1_ratio=0, solver="irls-cd", family="binomial", scale_predictors=True
).fit(matrix, y)

prints

1024.0

and then raises

/Users/mlondschien/mambaforge/envs/ivmodels/lib/python3.12/site-packages/glum/_solvers.py:771: RuntimeWarning: invalid value encountered in multiply
  P1wd_1 = linalg.norm(data.P1 * (state.coef + d)[data.intercept_offset :], ord=1)
Traceback (most recent call last):
  File "/Users/mlondschien/code/icu-benchmarks/tmp2.py", line 27, in <module>
    ).fit(matrix, y)
      ^^^^^^^^^^^^^^
  File "/Users/mlondschien/mambaforge/envs/ivmodels/lib/python3.12/site-packages/glum/_glm.py", line 3246, in fit
    coef = self._solve(
           ^^^^^^^^^^^^
  File "/Users/mlondschien/mambaforge/envs/ivmodels/lib/python3.12/site-packages/glum/_glm.py", line 1113, in _solve
    coef, self.n_iter_, self._n_cycles, self.diagnostics_ = _irls_solver(
                                                            ^^^^^^^^^^^^^
  File "/Users/mlondschien/mambaforge/envs/ivmodels/lib/python3.12/site-packages/glum/_solvers.py", line 326, in _irls_solver
    ) = line_search(state, data, d)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mlondschien/mambaforge/envs/ivmodels/lib/python3.12/site-packages/glum/_solvers.py", line 36, in inner_fct
    out = fct(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^
  File "/Users/mlondschien/mambaforge/envs/ivmodels/lib/python3.12/site-packages/glum/_solvers.py", line 771, in line_search
    P1wd_1 = linalg.norm(data.P1 * (state.coef + d)[data.intercept_offset :], ord=1)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mlondschien/mambaforge/envs/ivmodels/lib/python3.12/site-packages/scipy/linalg/_misc.py", line 146, in norm
    a = np.asarray_chkfinite(a)
        ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mlondschien/mambaforge/envs/ivmodels/lib/python3.12/site-packages/numpy/lib/_function_base_impl.py", line 656, in asarray_chkfinite
    raise ValueError(
ValueError: array must not contain infs or NaNs

To my understanding, build_hessian_delta returns a non-symmetric and non-positive definite matrix. This gets fed into enet_coordinate_descent_gram, which tries to find
$\min_w w^T Q w + q^T w + | w |_1$, where Q is our non-positive definite matrix. This is undefined and the resulting output d = enet_coordinate_descent_gram(...) has some +- inf entries.

The text was updated successfully, but these errors were encountered:

mlondschien · 2024-10-31T08:52:08Z

Turns out that matrix.standardize returns col_stds = np.zeros(p).
This does not make sense if weights.sum() != 1 and should then maybe be something like

    def _get_col_stds(self, weights: np.ndarray, col_means: np.ndarray) -> np.ndarray:
        """Get standard deviations of columns."""
        sqrt_arg = transpose_square_dot_weights(self._array, weights) - col_means**2 / weights.sum()
        # Minor floating point errors above can result in a very slightly
        # negative sqrt_arg (e.g. -5e-16). We just set those values equal to
        # zero.
        sqrt_arg[sqrt_arg < 0] = 0
        return np.sqrt(sqrt_arg / weights.sum())

?

If replacing above with

matrix = tabmat.DenseMatrix(X)
standardized_matrix, means, stds = matrix.standardize(np.ones(n) / n, True, True)
sandwich = standardized_matrix.sandwich(np.ones(n) / n)

print(np.abs(sandwich - sandwich.T).max())
print(scipy.linalg.eigvals(sandwich).real.min())

prints

1.4901161193847656e-08
0.307223562513362

Still, the following GLM.fit() call raises.

mlondschien linked a pull request Oct 31, 2024 that will close this issue

Fix glum issue-872 Quantco/tabmat#408

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`GeneralizedLinearRegressor(scale_predictors=True)` raises for highly variable column variances #872

`GeneralizedLinearRegressor(scale_predictors=True)` raises for highly variable column variances #872

mlondschien commented Oct 30, 2024

mlondschien commented Oct 31, 2024

GeneralizedLinearRegressor(scale_predictors=True) raises for highly variable column variances #872

GeneralizedLinearRegressor(scale_predictors=True) raises for highly variable column variances #872

Comments

mlondschien commented Oct 30, 2024

mlondschien commented Oct 31, 2024

`GeneralizedLinearRegressor(scale_predictors=True)` raises for highly variable column variances #872

`GeneralizedLinearRegressor(scale_predictors=True)` raises for highly variable column variances #872