Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Way to generate confidence interval for predictions? #756

Open
enriquecardenas24 opened this issue Jan 30, 2024 · 6 comments
Open

Way to generate confidence interval for predictions? #756

enriquecardenas24 opened this issue Jan 30, 2024 · 6 comments

Comments

@enriquecardenas24
Copy link

Looking at glum, I don't currently see a way to generate confidence intervals for predictions on an input dataset, although I want to check here. By this, I mean something similar to the statsmodels ability to generate confidence intervals (as well as error values) for predictions on a given dataset by way of the PredictionResults class. Is there any equivalent way to do this in glum?

image

@MatthiasSchmidtblaicherQC
Copy link
Contributor

MatthiasSchmidtblaicherQC commented Jan 30, 2024

There is currently no out-of-the-box way to generate confidence intervals for means. But you should be able to do this in a couple of lines if you have estimated a model and the covariance matrix, along the lines of the statsmodels code that you linked to (especially var_pred_mean = (exog * np.dot(covb, exog.T).T).sum(1)). Note that, in a model without identity link, you will need to apply the delta method to have confidence intervals on the level of the original variable (the statsmodels code appears to be only for identity link / linear models).

@enriquecardenas24
Copy link
Author

Think I've got a draft for a solution that strips some of the statsmodels code down. I believe I'll only be working with models that have identity link, but I'll continue working if I find otherwise. Wanted to post my solution here for future reference:

# Get prediction errors and CIs for each prediction (test).

# Create design matrix using model design info and dataset.
coef_table = model.coef_table()
exog_cols = coef_table.index[1:] # remove intercept idx
exog = pd.DataFrame(in_data, columns = exog_cols)
formula = ' + '.join(exog_cols)
exog = np.atleast_2d(np.asarray(dmatrix(formula, exog)))

# Calculate values used to generate errors and CIs.
covb = np.array(cov_params)
var_pred_mean = (exog * np.dot(covb, exog.T).T).sum(1)
params = coef_table['coef']
predicted_mean = np.dot(exog, params) #+ offset + exposure

# Calculate errors and confidence intervals.
se = np.sqrt(var_pred_mean)
dist_args = ()
q = scipy.stats.norm.ppf(1 - alpha / 2.0, *dist_args)
margin = q * se
lower, upper = predicted_mean - margin, predicted_mean + margin
    
# Record errors and confidence intervals.
new_vals = pd.DataFrame([se, lower, upper]).T
dt_dedup[['error', 'Low_C_I', 'High_C_I']] = new_vals

@enriquecardenas24
Copy link
Author

Not sure if the above code is entirely correct, actually. It's a good starting point, but the statsmodels method is a bit more complicated than I originally thought.

Are there any plans to add this methodology to glum soon? Perhaps in v3? Regardless, I'll continue to work on the problem and hopefully post here when done.

@MatthiasSchmidtblaicherQC
Copy link
Contributor

MatthiasSchmidtblaicherQC commented Feb 26, 2024

I see scope for this feature in glum. Before deciding whether to implement it, we would first need to address some questions though:

  1. How could confidence intervals for predictions be incorporated into the API with distracting the standard user who just wants point predictions?
  2. Are confidence intervals really what one wants statistically for predictions? Many use cases require prediction intervals.
  3. These confidence intervals can only be computed for unregularised glm's. If we really care about prediction intervals, then a method-agnostic method such as conformal prediction would be better. See also this library.

Adding the feature does not need to be coupled to v3 release because it would not be a breaking change.

@stanmart
Copy link
Collaborator

Agreed. While implementing confidence intervals for the predicted mean should be relatively straightforward, I'm not sure how useful they would be for the reasons @MatthiasSchmidtblaicherQC mentioned. For prediction tasks, people are usually more interested in

  • penalized models, for which we could not compute these confidence intervals
  • prediction intervals, but those are kind of tricky to compute for glms, and don't always make much sense (e.g. for binary outcomes)

OTOH, if users would find this feature useful despite these limitations, then it seems like a good addition for some future release.

Also, @enriquecardenas24, thanks so much for posting the code snippets! Those should be super useful for others looking for this feature. Please keep the thread updated if you make more progress.

@enriquecardenas24
Copy link
Author

Here's something I've been working on, based on statsmodels v0.13.5.

It's missing a few features present in the statsmodels generalized_linear_model.py source code for the get_prediction function and its subfunctions (Prediction results class 1 and class 2), but this works for my purposes (may edit later if I'm able to find the supporting glum functions). Looking at these links, the functions are a bit different in the current v0.15, but I believe v0.13.5 should be more or less the same.

There are submodules used in the original version which I can post, but the basic process is as follows:

from scipy import stats
from patsy import dmatrix

def get_pred_github(
        model: glum._glm.GeneralizedLinearRegressor,
        input_dataset: pd.core.frame.DataFrame,
        alpha: float = 0.05,
):
        
    # Get exog data.
    # A bit of magic is done here to get an array version of the data with dummy columns.
    exog_data, pred_data = get_exog_pred_data(input_dataset)
    # Can expand on this process if necessary, but the essence of this is:
    # nrows = num rows in input_dataset; ncols = num rows in model.coef_table() minus 1 (intercept).
    # This is done for categorical variables that may not have all data present.
    # Ex: If input_dataset is a subset of the entire dataset, then for a given categorical field in the dataset,
    # tabmat.from_pandas(df = input_dataset) doesn't capture all categories if any are missing in input_dataset.
    # pred_data is the same as exog data, except the reference for the control is included.

    # Record predictions.
    wts, offset = input_dataset['Weight'], input_dataset['Offset']
    preds = model.predict(pred_data, wts, offset)
    
    # Get prediction errors and CIs for each prediction.
    # Create design matrix using model design info and dataset.
    cols = model.coef_table().index[1:]    # ignore intercept idx
    exog = pd.DataFrame(exog_data, columns = cols)
    formula = ' + '.join(exog_cols)
    exog = np.atleast_2d(np.asarray(dmatrix(formula, exog)))
    # Get covariance parameters, predicted means, variances of predicted means.
    covb = model.covariance_matrix_ # make sure model feature names are set before doing this
    predicted_mean = np.log(np.array(preds))    # for log link
    var_pred_mean = (exog * np.dot(covb, exog.T).T).sum(1)
    
    # Other parameters from statsmodels to be appended.
    #exposure=offset=row_labels=None
    #transform=True    
    #pred_kwds = {'exposure': exposure, 'offset': offset, 'linear': True}
    
    # Get standard errors and confidence intervals.
    #se = self.se_obs if obs else self.se_mean # obs=False always, in my case
    se = np.sqrt(var_pred_mean)
    dist_args = ()
    q = stats.norm.ppf(1 - alpha / 2., *dist_args)
    lower = predicted_mean - q * se
    upper = predicted_mean + q * se
    cis = np.column_stack((lower, upper))
    ci_mean = np.exp(cis)   # for log link; need way to do self.link.inverse(cis)
    
    summary_frame = pd.DataFrame()
    summary_frame['mean_se'] = se
    summary_frame['mean_ci_lower'] = ci_mean[:, 0]
    summary_frame['mean_ci_upper'] = ci_mean[:, 1]
    
    return summary_frame

Here's a snapshot of the exog_data array, which contains all of the data in the input_dataset with dummy columns for categorical levels:
image

The first few columns are levels of the categorical field "Year," hence the 0s. Only in pred_data is the base level of the Year filled with 1s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants