make ordinal regression great again #5066

drbenvincent · 2021-10-09T14:49:04Z

drbenvincent
Oct 9, 2021
Collaborator

I have been exploring ordinal regression modelling, and I've found that it's not as simple as it could/should be from a modeller's perspective.

Probit regression

If we want to do probit regression (which assumes a normally distributed latent variable) then we have a PyMC implementation of DBDA by Kruschke here. This turns out to require some jumping through hoops, see...

@as_op(itypes=[tt.fvector, tt.fscalar, tt.fscalar], otypes=[tt.fvector])
def outcome_probabilities(theta, mu, sigma):
    out = np.empty(nYlevels, dtype=np.float32)
    n = norm(loc=mu, scale=sigma)       
    out[0] = n.cdf(theta[0])        
    out[1] = np.max([0, n.cdf(theta[1]) - n.cdf(theta[0])])
    out[2] = np.max([0, n.cdf(theta[2]) - n.cdf(theta[1])])
    out[3] = np.max([0, n.cdf(theta[3]) - n.cdf(theta[2])])
    out[4] = np.max([0, n.cdf(theta[4]) - n.cdf(theta[3])])
    out[5] = np.max([0, n.cdf(theta[5]) - n.cdf(theta[4])])
    out[6] = 1 - n.cdf(theta[5])
    return out

with pm.Model() as ordinal_model_single:    
    theta = pm.Normal('theta', mu=thresh, tau=np.repeat(.5**2, len(thresh)),
                      shape=len(thresh), observed=thresh_obs)
    mu = pm.Normal('mu', mu=nYlevels/2.0, tau=1.0/(nYlevels**2))
    sigma = pm.Uniform('sigma', nYlevels/1000.0, nYlevels*10.0)   
    pr = outcome_probabilities(theta, mu, sigma)
    y = pm.Categorical('y', pr, observed=df.Y.cat.codes.values)

There are further examples in that notebook about how you can use this approach in a regression context with metric predictors. This is fine, but you are stuck with an API/model which is not very clean. The modeller has to jump through a lot of hoops.

So this might make you think how we could use the OrderedLogistic distribution with Logit models instead...

Ordered Logit

Current parameterisation

Going Logit (rather than Probit) is promising, because we can define simple models like this (taken from the docstring of OrderedLogistic...

with pm.Model() as model:
    cutpoints = pm.Normal("cutpoints", mu=[-1,1], sigma=10, shape=2,
                          transform=pm.distributions.transforms.ordered)
    y_ = pm.OrderedLogistic("y", cutpoints=cutpoints, eta=x, observed=y)
    tr = pm.sample(1000)

However I was hoping to run ordinal regression in a regression context where I could estimate regression coefficients for predictor variables. However, it seems that parameterisation of OrderedLogistic is not set up for this. For example, as far as I can tell, it seems to most closely match the parameterisation of eq15.8 from Regression and Other Stories by Gelman, Hill & Vehtari. That is, for an example with 3 categories (2 cut points), the model specification is

So it is basically estimating cutpoints $c_{1.5}, c_{2.5}$ on the scale of the data (x) and sigma.

A better(?) parameterisation

This might be very useful in some contexts, but it is not very useful in a GLM/regresion context. In this case it would be nicer if there as an alternative parameterisation (see eq 15.9 from Regression and Other Stories)

Note that the first cutpoint is now constrained to be zero and the of the Logistic is constrained to be 1, and we have additional regression coefficients alpha and beta.

Questions

Is there an easy way to use the OrderedLogistic distribution in a GLM context?
If not, could we consider changing it's parameterisation to make it more suitable for regression contexts - being able to directly estimate regression coefficients would be very useful.
Or is there a relatively simple way to build a custom distribution function along the lines done in the DBDA ordinal regression translation notebook.

drbenvincent · 2021-10-10T09:43:01Z

drbenvincent
Oct 10, 2021
Collaborator Author

I now see that there is now an OrderedProbit distribution as well, which isn't in the docs yet. So I guess any discussion can encompass OrderedProbit and OrderedLogistic

0 replies

aseyboldt · 2021-10-11T07:46:42Z

aseyboldt
Oct 11, 2021
Maintainer

Maybe I am missing something, but couldn't you get the second parametrization by just fixing the first cutpoint to zero?
I am not entirely sure why you would want to do that though.

Is there something wrong with this?:

with pm.Model():
    sd = pm.HalfNormal("group_sd")
    raw = pm.ZeroSumNormal("group_raw", dims="group")
    group_effect = pm.Deterministic("group_effect", sd * raw, dims="group")

    age_effect = pm.Normal("age_effect")

    # Add intercept if we constrain the sum of cutpoints to 0, or set the first cutpoint to 0.
    mu = (
        group_effect[group_idx]
        + age_effect * data["age"]
    )

    # I don't like this prior. We should maybe investigate distributions
    # on ordered sets that optionally sum to zero or start at zero?
    cutpoints = pm.Normal("cutpoints", mu=[-1,1], sigma=10, shape=2,
                          transform=pm.distributions.transforms.ordered)

    pm.OrderedLogistic("z", eta=mu, cutpoints=cutpoints, observed=data["observation"])

You need to make sure though that you only add an intercept to the model if the cutpoints have only n-1 degrees of freedom.

So maybe something like this?

with pm.Model():
    intercept = pm.Normal("intercept", sigma=10)

    sd = pm.HalfNormal("group_sd")
    raw = pm.ZeroSumNormal("group_raw", dims="group")
    group_effect = pm.Deterministic("group_effect", sd * raw, dims="group")

    age_effect = pm.Normal("age_effect")

    # Add intercept if we constrain the sum of cutpoints to 0, or set the first cutpoint to 0.
    mu = (
        intercept
        + group_effect[group_idx]
        + age_effect * data["age"]
    )

    cutpoints_raw = at.concatenate([
        np.zeros(1),
        pm.Dirichlet("cutpoints_raw_upper", a=np.ones(n_observation_groups - 1))
    ])

    # If I don't misunderstand something larger sigma should indicate that we are very sure about
    # our predictions based on mu?
    sigma = pm.HalfNormal("sigma")

    pm.OrderedLogistic("z", eta=sigma * mu, cutpoints=sigma * cutpoints_raw, observed=data["observation"])

1 reply

drbenvincent Oct 26, 2021
Collaborator Author

Thanks @aseyboldt
I notice that the new (unreleased) pm.OrderedProbit has an explicit sigma parameter, which is nice.

I'm just double checking... multiplying the eta and cutpoints by sigma here basically do that job. Such that if you manually set sigma=1.7 then the Logit would closely approximate the Probit distribution?

drbenvincent · 2021-10-24T18:30:14Z

drbenvincent
Oct 24, 2021
Collaborator Author

Thanks @aseyboldt
I finally had time to look at this again. So far, good progress on simple models. I am working through 1 group, 2 groups, then metric predictor examples, which is the same approach as the Kruschke chapter on this topic.

My only problem so far is constraining the cutpoints:

SamplingError: Initial evaluation of model at starting point failed!
Starting values:
{'eta': array(0.), 'cutpoints_raw_upper_stickbreaking__': array([0., 0., 0.]), 'sigma_log__': array(-0.22579135)}

Initial evaluation results:
eta                                   -0.92
cutpoints_raw_upper_stickbreaking__   -2.37
sigma_log__                           -0.77
y_obs                                  -inf
Name: Log-probability of test_point, dtype: float64

If you have time, I've got a notebook on this here https://github.com/drbenvincent/ordinal-regression/blob/main/ordinal%20logistic%20regression.ipynb. It's Model 3 which involves the constraint on cutpoints which ends up failing.

2 replies

aseyboldt Oct 24, 2021
Maintainer

Should be np.zeros(0)

drbenvincent Oct 25, 2021
Collaborator Author

Hmm. np.zeros(1) sounds right if we want to constrain the first cutpoint to 1. But tried np.zeros(0) and no luck.

Looking again, it looks like it's a bad initial value problem. Setting start values works!

with model:
    trace = pm.sample(return_inferencedata=True, start={"cutpoints_raw_upper": np.array([0.1, 0.2, 0.3])})

drbenvincent · 2021-11-14T18:03:58Z

drbenvincent
Nov 14, 2021
Collaborator Author

Ok. So we've made progress in that the issue is not about parameterisation of OrderedLogit or OrderedProbit. So in one way this discussion could be done.

That said, as @aseyboldt mentioned, some consideration to the priors could be given. I've worked up a notebook using the very latest v4 code, with OrderedProbit. Full example notebook here.

The basic issue seems to be how you deal with too many degrees of freedom. We have K levels of the data, but the model has K+1 degrees of freedom, theta_1,... , theta_{K-1}, mu, sigma.

The method that Krushcke uses is to constrain the lower and upper outpoints. A PyMC implementation is here but this is old and clunky and doesn't utilise OrderedProbit which is new. By constraining the first and last outpoints, the model should now have K-1 degrees of freedom.

initial = np.arange(K-1-2)+2.5
print(initial)

with pm.Model() as model:
    cutpoints = at.concatenate([
        np.ones(1)*1.5,
        pm.Uniform("cutpoints_unknown", lower=1+0.5, upper=K-0.5, shape=K-1-2), #  transform=pm.distributions.transforms.ordered),
        np.ones(1)*(K-0.5)
    ])
    mu = pm.Normal('mu', mu=K/2, sigma=K)
    sigma = pm.HalfNormal("sigma", 1)
    pm.OrderedProbit("y_obs", cutpoints=cutpoints, eta=mu, sigma=sigma, observed=y)
    trace = pm.sample(start={"cutpoints_unknown": initial})

az.plot_trace(trace, var_names=["cutpoints_unknown", "mu", "sigma"]);
plt.tight_layout()

The posteriors are sensible in terms of their values, but we have major divergence issues

I tried the approach outlined by @aseyboldt, which is to fix the first outpoint to zero, and use a Dirichlet distribution for the unknown outpoints. I think this then has a total of K-1 degrees of freedom because the Dirichlet will sum to 1?

initial = np.linspace(0.1, 0.9, K-2)
print(initial)

with pm.Model() as model:
    cutpoints = at.concatenate([
        np.zeros(1),
        pm.Dirichlet("cutpoints_unknown", a=np.ones(K - 1 - 1))
    ])
    mu = pm.Normal('mu', mu=K/2, sigma=K)
    sigma = pm.HalfNormal("sigma", 1)
    pm.OrderedProbit("y_obs", cutpoints=cutpoints, eta=mu, sigma=sigma, observed=y)
    trace = pm.sample(start={"cutpoints_unknown": initial})
    
# pm.model_to_graphviz(model)

az.plot_trace(trace, var_names=["cutpoints_unknown", "mu", "sigma"]);
plt.tight_layout()

Again, OK inferences in terms of numerical values, but still crazy divergence issues.

I'd be grateful for pointers:

am I on the right path?
are the models actually fine, and it's just a v4 issue?
Do the unknown outpoint parameters need to be ordered? I think they do, but adding transform=pm.distributions.transforms.ordered to the Uniform in the first model doesn't resolve divergences. ~~I've not tried this, but if ordering would help with the Dirichlet, then maybe could calculate cumulative sum of these variables. Can Aesara do that?~~

If I've left out any important details, see here for full notebook example.

EDIT: Wrapping the Dirichlet variables in aesara.tensor.extra_ops.cumsum results in zero divergences

with pm.Model() as model:
    cutpoints = at.concatenate([
        np.zeros(1),
        at.extra_ops.cumsum(pm.Dirichlet("cutpoints_unknown", a=np.ones(K - 1 - 1)))
    ])
    mu = pm.Normal('mu', mu=K/2, sigma=K)
    sigma = pm.HalfNormal("sigma", 1)
    pm.OrderedProbit("y_obs", cutpoints=cutpoints, eta=mu, sigma=sigma, observed=y)
    trace = pm.sample(start={"cutpoints_unknown": initial})

az.plot_trace(trace, var_names=["cutpoints_unknown", "mu", "sigma"]);
plt.tight_layout()

But the posteriors over unknown outpoints are suspect

1 reply

drbenvincent Nov 20, 2021
Collaborator Author

Finally got a handle on this!

The last version above actually works fine. By ordering the Dirichlet distributed unknown outpoints we have resolved the issue with divergences, but the problem was that the posteriors looks bad (very high overlap).

But this is a non-problem when we remember about the thresholds being correlated with each other... see DBDA by Kruschke for more info. Looking at marginal posterior distributions for the outpoints is misleading.

drbenvincent · 2021-11-20T23:27:58Z

drbenvincent
Nov 20, 2021
Collaborator Author

Created an issue, proposing a new distribution which would be very useful here #5215

0 replies

jlevy44 · 2021-12-07T03:35:22Z

jlevy44
Dec 7, 2021

https://journals.sagepub.com/doi/full/10.1177/2515245918823199

0 replies

jlevy44 · 2021-12-07T03:35:53Z

jlevy44
Dec 7, 2021

https://mc-stan.org/docs/2_23/stan-users-guide/ordered-logistic-section.html

0 replies

cmgoold · 2022-09-12T12:54:09Z

cmgoold
Sep 12, 2022

@drbenvincent I just posted to the Discourse forums about the Kruschke-style ordered probit model here before actually seeing this thread and your latest notebook. The approaches are very similar.

0 replies

alexjonesphd · 2022-11-28T17:18:40Z

alexjonesphd
Nov 28, 2022

Hi everyone - joining this thread late after struggling hard with ordinal data!

I've been working through the examples above as well as @drbenvincent's great notebooks to try and get a handle on things, but I am still very confused. I'm looking to use the same approach Kruschke uses in DBDA2 to estimate the latent mean and scale of an ordinal set of data to do some inferences on that. I've been using OrderedProbit and have been trying to use the same approach Kruschke uses by pinning the left and right extreme cutpoints, and estimating the mean, scale, and remaining cutpoints. Like the notebooks, I've been using the example dataset from DBDA2, OrdinalProbitData-1grp-1.csv, which is stated in the book as having a latent mean of 1 and scale of 2.5.

I've been using the following parameterisation (forgive hard-coding of values) which results in relatively sensible estimates (but not of the mean), but with severe divergences, and requires starting values to not crash:

df = pd.read_csv('OrdinalProbitData-1grp-1.csv')

with pm.Model() as model2:

    eta = pm.Normal('eta',  mu=3, sigma=2)
    sigma = pm.HalfNormal('sigma', 1)

    cutpoints = at.concatenate([
            [0.5],
            pm.Uniform('cut_unknown', lower=1, upper=5, shape=4),
            [5.5]
    ])

    pm.OrderedProbit('y' cutpoints=cutpoints, eta=eta, sigma=sigma,
                     compute_p=False, observed=df.Y.values-1)

    trace = pm.sample(tune=5000, initvals={'cut_unknown': [1.5, 2.5, 3.5, 4.5]})

az.plot_trace(trace, figsize=(15, 10))

`
Divergences aside, these look reasonable for the cutpoints and sigma, but not really for the mean (eta). The posterior predictive also looks very reasonable here.

Using the constrainedUniform implementation, the divergences disappear as @drbenvincent states, but now the estimates of the mean and scale are very different from 1 and 2.5, and the cutpoints are no longer on the scale of the responses:

def constrainedUniform(N, min=0, max=1):
    return pm.Deterministic('theta',
                            at.concatenate([
                                np.ones(1)*min,
                                at.extra_ops.cumsum(pm.Dirichlet("theta_unknown", 
                                                                 a=np.ones(N - 2))) * (min+(max-min))
                            ]), dims='cutpoints')

coords = {'cutpoints': np.arange(6)}

with pm.Model(coords=coords) as model2:

    eta = pm.Normal("eta", mu=3, sigma=2)
    sigma = pm.HalfNormal('sigma', 1)

    cutpoints = constrainedUniform(7)

    pm.OrderedProbit("y", cutpoints=cutpoints, eta=eta, sigma=sigma,
                     compute_p=False, observed=df.Y.values-1)

    trace = pm.sample(tune=5000)

az.plot_trace(trace)

I'm probably missing something fundamental here about OrderedProbit, but I'm not sure why having the cutpoints on their cumulative probability scale helps so much, or why the eta and sigma parameters never seem to quite reflect the values stated in DBDA2.

Any pointers are massively appreciated from this struggling Bayesian 🙏🏻

2 replies

drbenvincent Nov 28, 2022
Collaborator Author

Hi @alexjonesphd. Just a quick reply for the moment... The cutoff parameters have to be ordered. The uniform won't give you that, so that's one red flag I can see with a quick inspection.

alexjonesphd Nov 28, 2022

Will have a play with that, of course! Thank you.

alexjonesphd · 2022-11-29T13:57:58Z

alexjonesphd
Nov 29, 2022

Getting a bit of a handle on this and realised something of a basic mistake on my part. I'm still a bit vague on the use of the constrainedUniform distribution above but its clear the cutpoints are sensible. The reason the mean and scale are different from Kruschke's example is I think because OrderedProbit requires the data to start from 0 and not 1. So it makes sense that the underlying normal distribution would have different parameters, even if it gives rise to the same pattern of responses (e.g., same count of zeros as there are ones in the observed data, etc).

I guess for most purposes this doesn't matter too much. If the goal is to compare latent distributions to one another the relative difference is important, but I can imagine there are maybe some situations were it would be important to estimate the latent parameters on the precisely observed units of the data, which OrderedProbit won't allow. Am I right in thinking this or am I still missing something fundamental?

2 replies

drbenvincent Dec 14, 2022
Collaborator Author

You should be able to transform the constrained uniform to the scale you want. That's what the min=0, max=1 kwargs are about. I can't guarantee that I've done that right (I didn't test it), but that should allow you to scale from 0-1 to whatever range you want. Let me know if you have any luck with that.

alexjonesphd Dec 14, 2022

Ah - I will try and work with that and see what I get, I hadn't tried to play with those! I responded over on the Discourse too - probably safer for confused people like me to keep discussion there as suggested. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make ordinal regression great again #5066

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 10 comments 8 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

make ordinal regression great again #5066

drbenvincent Oct 9, 2021 Collaborator

Probit regression

Ordered Logit

Current parameterisation

A better(?) parameterisation

Questions

Replies: 10 comments · 8 replies

drbenvincent Oct 10, 2021 Collaborator Author

aseyboldt Oct 11, 2021 Maintainer

drbenvincent Oct 26, 2021 Collaborator Author

drbenvincent Oct 24, 2021 Collaborator Author

aseyboldt Oct 24, 2021 Maintainer

drbenvincent Oct 25, 2021 Collaborator Author

drbenvincent Nov 14, 2021 Collaborator Author

drbenvincent Nov 20, 2021 Collaborator Author

drbenvincent Nov 20, 2021 Collaborator Author

jlevy44 Dec 7, 2021

jlevy44 Dec 7, 2021

cmgoold Sep 12, 2022

alexjonesphd Nov 28, 2022

drbenvincent Nov 28, 2022 Collaborator Author

alexjonesphd Nov 28, 2022

alexjonesphd Nov 29, 2022

drbenvincent Dec 14, 2022 Collaborator Author

alexjonesphd Dec 14, 2022

drbenvincent
Oct 9, 2021
Collaborator

Replies: 10 comments 8 replies

drbenvincent
Oct 10, 2021
Collaborator Author

aseyboldt
Oct 11, 2021
Maintainer

drbenvincent Oct 26, 2021
Collaborator Author

drbenvincent
Oct 24, 2021
Collaborator Author

aseyboldt Oct 24, 2021
Maintainer

drbenvincent Oct 25, 2021
Collaborator Author

drbenvincent
Nov 14, 2021
Collaborator Author

drbenvincent Nov 20, 2021
Collaborator Author

drbenvincent
Nov 20, 2021
Collaborator Author

jlevy44
Dec 7, 2021

jlevy44
Dec 7, 2021

cmgoold
Sep 12, 2022

alexjonesphd
Nov 28, 2022

drbenvincent Nov 28, 2022
Collaborator Author

alexjonesphd
Nov 29, 2022

drbenvincent Dec 14, 2022
Collaborator Author