Skip to content

GH-16583: Access GLM Variance-Covariance Matrix with vcov#16586

Closed
manh4wk wants to merge 11 commits intoh2oai:masterfrom
manh4wk:gh-16583_vcov
Closed

GH-16583: Access GLM Variance-Covariance Matrix with vcov#16586
manh4wk wants to merge 11 commits intoh2oai:masterfrom
manh4wk:gh-16583_vcov

Conversation

@manh4wk
Copy link
Copy Markdown

@manh4wk manh4wk commented Mar 6, 2025

Made the variance-covariance matrix for GLMs part of the model_output results so they're accessible by Python and R. The matrix is rearranged in h2o-algos/src/main/java/hex/schemas/GLMModelV3.java so that the Intercept is both the first row and the first column, similar to how it's done for the GLM coefficient results in the same area of the code.

This matrix is now accessible with the glm_model_object.vcov() function in Python and with h2o.vcov(glm_model_object) in R.

This change fixes #16583

@manh4wk
Copy link
Copy Markdown
Author

manh4wk commented Mar 18, 2025

@tomasfryda Do you know if there's anything else I can do right now to see if this passes all the tests, etc.? I saw you mentioned in another thread the team is pretty busy at the moment.

@tomasfryda
Copy link
Copy Markdown
Contributor

@manh4wk I don't think so but I'm no longer active in h2o-3 development so it'd be better to ask @valenad1 or @maurever . Personally, I would like to thank you for taking time and contributing to open-source but I have no idea when or if it will get merged.

@manh4wk
Copy link
Copy Markdown
Author

manh4wk commented Dec 29, 2025

Hi @maurever or @valenad1, Can one of you take a look at this? Having the GLM's variance-covariance matrix available will let us do things like run a Wald test on two different levels of a categorical variable to see if they should be treated as statistically different, or if they should be combined into a single category.

Copy link
Copy Markdown
Contributor

@tomasfryda tomasfryda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good idea to expose variance-covariance matrix but that probably depends on @valenad1's decision.
If he agrees, I would suggest fixing R tests and making sure column names and row names are the same - currently column names are always lowercased (IIRC this can be caused by the TwoDimTableV3 so I would consider choosing different data structure (e.g. H2OFrame).), row names aren't.
For example the Intercept vs intercept:
Screenshot 2026-01-02 at 13 41 54

Note that I didn't do complete review, I just looked at the R part of the PR.

manualYear <- mFV@model$coefficients_table$year

# compare values from model and obtained manually
for (ind in c(1:length(manuelPValues)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

manuelPValues doesn't seem to be defined anywhere. Also, I would recommend to use seq_along(x) instead of 1:length(x) (when the x is empty, the latter will produce c(1, 0)).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I updated the R test, that one wasn't working. I also switched to using seq_along as you recommended.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think the test would run faster if the data was pulled into a data frame or something and then run, if that's allowed. Right now it goes cell by cell through the matrix, which took a little bit of time when I was running it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if that's allowed

It's allowed but I don't think it's necessary.

Comment on lines +64 to +66
doTest("GLM: make sure error is generated when a gbm model calls glm functions", testGBMvcov)
doTest("GLM: make sure error is generated when compute_p_values=FALSE", testGLMvcovcomputePValueFALSE)
doTest("GLM: test variance-covariance values", testGLMPValZValStdError)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer something like:

doSuite("GLM: VCOV support", makeSuite(testGBMvcov, testGLMvcovcomputePValueFALSE, testGLMPValZValStdError))

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I changed it to doSuite.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no test that would test if the implementation is working. It just tests if it throws an error if used when unsupported.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last R test was supposed to, but it looks like I must have pushed the wrong thing. It should now check if the covariance matrix returned by the function matches the one returned by going and grabbing the h2o frame by it's reference name.

I thought a little bit about adding in a test that the actual values themselves were calculated correctly, but since the vcov matrix was already in the code, and the p-values all rely on it already, I assume that's already tested somewhere. I'm just making it accessible. I did do testing to make sure they matched what I got with the statsmodels in python though. It looks like they're calculated based on the Observed Information Matrix instead of the Expected Information Matrix, so it was a little tricky.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last R test was supposed to, but it looks like I must have pushed the wrong thing.

No worries!

I thought a little bit about adding in a test that the actual values themselves were calculated correctly, but since the vcov matrix was already in the code, and the p-values all rely on it already, I assume that's already tested somewhere. I'm just making it accessible. I did do testing to make sure they matched what I got with the statsmodels in python though. It looks like they're calculated based on the Observed Information Matrix instead of the Expected Information Matrix, so it was a little tricky.

Thank you for being so vigilant! Unfortunately, the person who implemented this is no longer working in H2O so I can't easily ask (and I'm not familiar with this piece of code) but I hope we have the test somewhere.

@manh4wk
Copy link
Copy Markdown
Author

manh4wk commented Feb 22, 2026

Sorry it took so long @tomasfryda, I got tied up with work for a bit, and then I had some silly scoping issues it took me too long to figure out.

I made the suggestions you recommended. The matrix is now returned as an H2OFrame, with the first column being the names of the variables. The casing of the variable names now matches between the columns and the rows, too. I think we get more precision behind the scenes with the numbers in the H2OFrame, which I also like.

image

Copy link
Copy Markdown
Contributor

@tomasfryda tomasfryda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything appears to be working correctly but there are still some minor necessary changes. Thank you!

Comment on lines +301 to +339
double [][] vcov;
vcov = impl.vcov();
double [][] vcov_reordered = new double[vcov.length][vcov.length];

long[] vcov_indices = new long[vcov.length];
for (int i = 0; i < vcov.length; ++i)
vcov_indices[i] = i;

// move intercept from last row and column to first row and column to match reordering done with coefficients
vcov_reordered[0][0] = vcov[vcov.length - 1][vcov.length - 1];
for(int i = 1; i < vcov.length; ++i) {
vcov_reordered[0][i] = vcov[vcov.length-1][i-1];
}
for(int i = 1; i < vcov.length; ++i) {
vcov_reordered[i][0] = vcov[i-1][vcov.length-1];
}
for(int i = 1; i < vcov.length; ++i) {
for(int j = 1; j < vcov.length; ++j) {
vcov_reordered[i][j] = vcov[i - 1][j - 1];
}
}

String [] vcov_colnames = ArrayUtils.append(new String[]{"Names"},Arrays.copyOf(ns,ns.length));
Vec [] vec_arr = new Vec[vcov.length+1]; // one extra vec for column names
Vec.VectorGroup group = new Vec.VectorGroup();
Key<Vec> vec_key = group.addVec();

// load vcov info into vec_arr
vec_arr[0] = Vec.makeVec(vcov_indices, ns, vec_key);
for(int i = 0; i < vcov.length; ++i) {
vec_key = group.addVec();
vec_arr[i+1] = Vec.makeVec(vcov_reordered[i], vec_key);
}

Key<Frame> frameKey = Key.make();
Frame frame = new Frame(frameKey, vcov_colnames, vec_arr);
DKV.put(frameKey, frame);
vcov_table = new KeyV3.FrameKeyV3();
vcov_table.fillFromImpl(frameKey);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you deterministically derive the frameKey from the model_id (e.g. model_id + "vcov_frame"), and enclose the frame creation with a condition like if (DKV.getGet(frameKey) == null) {} so we don't create the frame every time?

Since the frame is in the DKV, this would otherwise lead to memory leak.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using model_id, but I got the error below about being in a static context. I did find that the model id is saved in the _training_metrics though, and can be accessed like this. Do you know if there's a better approach, or is this good enough?
Key<Frame> frameKey = Key.make(impl._training_metrics.model()._key.toString() + "_vcov_frame");

h2o-3\h2o-algos\src\main\java\hex\schemas\GLMModelV3.java:302: error: non-static variable model_id cannot be referenced from a static context
Key frameKey = Key.make(model_id + "_vcov_frame");
^

Comment thread h2o-py/h2o/model/model_base.py Outdated

def vcov(self):
"""
Return an H2OFrame with the variance-covariance matrix for a GLM (requires ``compute_p_values=True``).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Return an H2OFrame with the variance-covariance matrix for a GLM (requires ``compute_p_values=True``).
:returns: Return an H2OFrame with the variance-covariance matrix for a GLM (requires ``compute_p_values=True``).

manualYear <- mFV@model$coefficients_table$year

# compare values from model and obtained manually
for (ind in c(1:length(manuelPValues)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if that's allowed

It's allowed but I don't think it's necessary.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last R test was supposed to, but it looks like I must have pushed the wrong thing.

No worries!

I thought a little bit about adding in a test that the actual values themselves were calculated correctly, but since the vcov matrix was already in the code, and the p-values all rely on it already, I assume that's already tested somewhere. I'm just making it accessible. I did do testing to make sure they matched what I got with the statsmodels in python though. It looks like they're calculated based on the Observed Information Matrix instead of the Expected Information Matrix, so it was a little tricky.

Thank you for being so vigilant! Unfortunately, the person who implemented this is no longer working in H2O so I can't easily ask (and I'm not familiar with this piece of code) but I hope we have the test somewhere.

@sonarqubecloud
Copy link
Copy Markdown

@manh4wk
Copy link
Copy Markdown
Author

manh4wk commented Mar 25, 2026

Github sent me an email reminder about this yesterday, so I pushed my latest working version. Like I said in one of the comments above, I got a static context error trying to reference the model_id, so I grabbed it from the only place I could find it.

I was looking at some of the other model forms, and the other ones that have a FrameKeyV3 object just instantiate them in the schema. They set them in the model file itself (eg, the CoxPHModelV3.java schema file and the CoxPHModel.java model file).

I think this might be the more sound approach, but I think then I would need to add to the remove_impl, writeAll_impl, and readAll_impl functions too to keep things clean?

@maurever
Copy link
Copy Markdown
Contributor

maurever commented Apr 2, 2026

Hi @manh4wk. Thank you for your contribution. We currently don't have the capacity to assist you in completing this task. We decided that this change is not critical for improving our codebase, so we have closed this PR.

@maurever maurever closed this Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose GLM Variance-Covariance Matrix

4 participants