-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Way to generate confidence interval for predictions? #756
Comments
There is currently no out-of-the-box way to generate confidence intervals for means. But you should be able to do this in a couple of lines if you have estimated a model and the covariance matrix, along the lines of the statsmodels code that you linked to (especially |
Think I've got a draft for a solution that strips some of the statsmodels code down. I believe I'll only be working with models that have identity link, but I'll continue working if I find otherwise. Wanted to post my solution here for future reference: # Get prediction errors and CIs for each prediction (test).
# Create design matrix using model design info and dataset.
coef_table = model.coef_table()
exog_cols = coef_table.index[1:] # remove intercept idx
exog = pd.DataFrame(in_data, columns = exog_cols)
formula = ' + '.join(exog_cols)
exog = np.atleast_2d(np.asarray(dmatrix(formula, exog)))
# Calculate values used to generate errors and CIs.
covb = np.array(cov_params)
var_pred_mean = (exog * np.dot(covb, exog.T).T).sum(1)
params = coef_table['coef']
predicted_mean = np.dot(exog, params) #+ offset + exposure
# Calculate errors and confidence intervals.
se = np.sqrt(var_pred_mean)
dist_args = ()
q = scipy.stats.norm.ppf(1 - alpha / 2.0, *dist_args)
margin = q * se
lower, upper = predicted_mean - margin, predicted_mean + margin
# Record errors and confidence intervals.
new_vals = pd.DataFrame([se, lower, upper]).T
dt_dedup[['error', 'Low_C_I', 'High_C_I']] = new_vals |
Not sure if the above code is entirely correct, actually. It's a good starting point, but the statsmodels method is a bit more complicated than I originally thought. Are there any plans to add this methodology to glum soon? Perhaps in v3? Regardless, I'll continue to work on the problem and hopefully post here when done. |
I see scope for this feature in glum. Before deciding whether to implement it, we would first need to address some questions though:
Adding the feature does not need to be coupled to v3 release because it would not be a breaking change. |
Agreed. While implementing confidence intervals for the predicted mean should be relatively straightforward, I'm not sure how useful they would be for the reasons @MatthiasSchmidtblaicherQC mentioned. For prediction tasks, people are usually more interested in
OTOH, if users would find this feature useful despite these limitations, then it seems like a good addition for some future release. Also, @enriquecardenas24, thanks so much for posting the code snippets! Those should be super useful for others looking for this feature. Please keep the thread updated if you make more progress. |
Here's something I've been working on, based on statsmodels v0.13.5. It's missing a few features present in the statsmodels There are submodules used in the original version which I can post, but the basic process is as follows: from scipy import stats
from patsy import dmatrix
def get_pred_github(
model: glum._glm.GeneralizedLinearRegressor,
input_dataset: pd.core.frame.DataFrame,
alpha: float = 0.05,
):
# Get exog data.
# A bit of magic is done here to get an array version of the data with dummy columns.
exog_data, pred_data = get_exog_pred_data(input_dataset)
# Can expand on this process if necessary, but the essence of this is:
# nrows = num rows in input_dataset; ncols = num rows in model.coef_table() minus 1 (intercept).
# This is done for categorical variables that may not have all data present.
# Ex: If input_dataset is a subset of the entire dataset, then for a given categorical field in the dataset,
# tabmat.from_pandas(df = input_dataset) doesn't capture all categories if any are missing in input_dataset.
# pred_data is the same as exog data, except the reference for the control is included.
# Record predictions.
wts, offset = input_dataset['Weight'], input_dataset['Offset']
preds = model.predict(pred_data, wts, offset)
# Get prediction errors and CIs for each prediction.
# Create design matrix using model design info and dataset.
cols = model.coef_table().index[1:] # ignore intercept idx
exog = pd.DataFrame(exog_data, columns = cols)
formula = ' + '.join(exog_cols)
exog = np.atleast_2d(np.asarray(dmatrix(formula, exog)))
# Get covariance parameters, predicted means, variances of predicted means.
covb = model.covariance_matrix_ # make sure model feature names are set before doing this
predicted_mean = np.log(np.array(preds)) # for log link
var_pred_mean = (exog * np.dot(covb, exog.T).T).sum(1)
# Other parameters from statsmodels to be appended.
#exposure=offset=row_labels=None
#transform=True
#pred_kwds = {'exposure': exposure, 'offset': offset, 'linear': True}
# Get standard errors and confidence intervals.
#se = self.se_obs if obs else self.se_mean # obs=False always, in my case
se = np.sqrt(var_pred_mean)
dist_args = ()
q = stats.norm.ppf(1 - alpha / 2., *dist_args)
lower = predicted_mean - q * se
upper = predicted_mean + q * se
cis = np.column_stack((lower, upper))
ci_mean = np.exp(cis) # for log link; need way to do self.link.inverse(cis)
summary_frame = pd.DataFrame()
summary_frame['mean_se'] = se
summary_frame['mean_ci_lower'] = ci_mean[:, 0]
summary_frame['mean_ci_upper'] = ci_mean[:, 1]
return summary_frame Here's a snapshot of the The first few columns are levels of the categorical field "Year," hence the 0s. Only in |
Looking at glum, I don't currently see a way to generate confidence intervals for predictions on an input dataset, although I want to check here. By this, I mean something similar to the statsmodels ability to generate confidence intervals (as well as error values) for predictions on a given dataset by way of the PredictionResults class. Is there any equivalent way to do this in glum?
The text was updated successfully, but these errors were encountered: