Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GeneralizedLinearRegressor(scale_predictors=True) raises for highly variable column variances #872

Open
mlondschien opened this issue Oct 30, 2024 · 1 comment · May be fixed by Quantco/tabmat#408

Comments

@mlondschien
Copy link
Contributor

import numpy as np
import tabmat
from glum import GeneralizedLinearRegressor

n = 5_000
p = 1_000

rng = np.random.default_rng(0)
means = rng.exponential(10, p) ** 2
stds = rng.exponential(10, p) ** 2

X = rng.uniform(size=(n, p)) * stds + means
means = X.mean(axis=0)

beta = rng.standard_normal(p)
y = ((X - means) @ beta + rng.standard_normal(n)) > 0

matrix = tabmat.DenseMatrix(X)
standardized_matrix, _, _ = matrix.standardize(np.ones(n), True, True)
sandwich = standardized_matrix.sandwich(np.ones(n))
print(np.abs(sandwich - sandwich.T).max())

GeneralizedLinearRegressor(
    alpha=0, l1_ratio=0, solver="irls-cd", family="binomial", scale_predictors=True
).fit(matrix, y)

prints

1024.0

and then raises

/Users/mlondschien/mambaforge/envs/ivmodels/lib/python3.12/site-packages/glum/_solvers.py:771: RuntimeWarning: invalid value encountered in multiply
  P1wd_1 = linalg.norm(data.P1 * (state.coef + d)[data.intercept_offset :], ord=1)
Traceback (most recent call last):
  File "/Users/mlondschien/code/icu-benchmarks/tmp2.py", line 27, in <module>
    ).fit(matrix, y)
      ^^^^^^^^^^^^^^
  File "/Users/mlondschien/mambaforge/envs/ivmodels/lib/python3.12/site-packages/glum/_glm.py", line 3246, in fit
    coef = self._solve(
           ^^^^^^^^^^^^
  File "/Users/mlondschien/mambaforge/envs/ivmodels/lib/python3.12/site-packages/glum/_glm.py", line 1113, in _solve
    coef, self.n_iter_, self._n_cycles, self.diagnostics_ = _irls_solver(
                                                            ^^^^^^^^^^^^^
  File "/Users/mlondschien/mambaforge/envs/ivmodels/lib/python3.12/site-packages/glum/_solvers.py", line 326, in _irls_solver
    ) = line_search(state, data, d)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mlondschien/mambaforge/envs/ivmodels/lib/python3.12/site-packages/glum/_solvers.py", line 36, in inner_fct
    out = fct(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^
  File "/Users/mlondschien/mambaforge/envs/ivmodels/lib/python3.12/site-packages/glum/_solvers.py", line 771, in line_search
    P1wd_1 = linalg.norm(data.P1 * (state.coef + d)[data.intercept_offset :], ord=1)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mlondschien/mambaforge/envs/ivmodels/lib/python3.12/site-packages/scipy/linalg/_misc.py", line 146, in norm
    a = np.asarray_chkfinite(a)
        ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mlondschien/mambaforge/envs/ivmodels/lib/python3.12/site-packages/numpy/lib/_function_base_impl.py", line 656, in asarray_chkfinite
    raise ValueError(
ValueError: array must not contain infs or NaNs

To my understanding, build_hessian_delta returns a non-symmetric and non-positive definite matrix. This gets fed into enet_coordinate_descent_gram, which tries to find
$\min_w w^T Q w + q^T w + | w |_1$, where Q is our non-positive definite matrix. This is undefined and the resulting output d = enet_coordinate_descent_gram(...) has some +- inf entries.

@mlondschien
Copy link
Contributor Author

Turns out that matrix.standardize returns col_stds = np.zeros(p).
This does not make sense if weights.sum() != 1 and should then maybe be something like

    def _get_col_stds(self, weights: np.ndarray, col_means: np.ndarray) -> np.ndarray:
        """Get standard deviations of columns."""
        sqrt_arg = transpose_square_dot_weights(self._array, weights) - col_means**2 / weights.sum()
        # Minor floating point errors above can result in a very slightly
        # negative sqrt_arg (e.g. -5e-16). We just set those values equal to
        # zero.
        sqrt_arg[sqrt_arg < 0] = 0
        return np.sqrt(sqrt_arg / weights.sum())

?

If replacing above with

matrix = tabmat.DenseMatrix(X)
standardized_matrix, means, stds = matrix.standardize(np.ones(n) / n, True, True)
sandwich = standardized_matrix.sandwich(np.ones(n) / n)

print(np.abs(sandwich - sandwich.T).max())
print(scipy.linalg.eigvals(sandwich).real.min())

prints

1.4901161193847656e-08
0.307223562513362

Still, the following GLM.fit() call raises.

@mlondschien mlondschien linked a pull request Oct 31, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant