-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent Estimator? #49
Comments
Thanks for bringing this to my attention, I'll look into it. |
I replied with this comment. I believe the issue is simply not using enough draws for simulating the normal distributions with Halton draws. As I wrote in the comment, the default of 50 may just not be enough. I used 200 and the bias disappeared. I'm not sure what the "best" threshold is. Models with more parameters probably need more draws, so there may be some rule of thumb that scales with the number of parameters. I could easily increase the default number to perhaps 200 or so just to make sure this issue isn't there for models with smaller numbers of parameters. But the package already posts a warning when there are more random parameters and suggests using Sobol draws. |
More updates on the stackexchange post. Even with higher numbers of draws there is still some small bias. I suspect the Halton draws may be the culprit. Still digging into the issue. For sure though I am likely going to modify how the draws are done by default. I like how Stata’s |
Hi @jhelvy, as you have preferred, I am continuing the discussion here instead of following up in the Stackexchange thread. Regarding your last comments in the Stackexchange thread:
I have attached my R script and the resulting dataset of comparing the performance of both If you need any additional information, please let me know. |
@JediKnight5 I appreciate you looking into this and adding your code here. I think this will be a lot easier to manage in this issue, and once we figure this out we can update the SO post and link back to this conversation. Based on your Stata code, it looks like you exported the simulated data from the R code to csv files and then read those in to Stata for each model estimation? I think that's fine, I just wanted to make sure I'm understanding how the Stata simulations are being done. I'm actually thinking a slightly different way of organizing this is to write a separate R script to generate the data and then save all of the data as one large dataframe, keeping track of things like the seed and iteration. If you save it as a .parquet file ( I'm still thinking the consistent bias we're seeing is due to 1) the draws, 2) the optimizer, or 3) a combination of both. That you're getting the same results with a different algorithm doesn't necessarily rule it out because it could also be due to differences in convergence criteria. Basically, regardless of the algorithm used, it may still be stopping prematurely, and we may need to modify convergence criteria. I believe all of those controls can be passed in a call to As for the draws, another idea is to try to export the draws being used themselves and then use those same draws in the other packages. I'm not sure if this can be done in the other packages, but in |
@jhelvy: Unfortunately, Stata cannot read in parquet files but I collected the data I generated for testing the different estimators in the following parquet file: LINK IS EXPIRED I like your idea of keeping the draws of the standard normal distribution as a fixed input and compare the estimators’ performance accordingly. I think The following simple piece of analysis comparing halton draws from different packages to pseudo random number draws doesn’t suggest that there is an issue with the |
All great progress on this! I think doing as much as we can to hold inputs fixed (simulated data, draws, etc.) and just vary the model settings will lead us to the path that gets down to the bottom of these differences. One other thought I had was to also keep track of the log-likelihood achieved at the solution for each simulated dataset. Ideally these should all be converging to the same solution, but if they're not we should be able to see that by comparing the log-likelihoods. That is, if we find that the log-likelihood is generally better with Stata’s |
I had a look at the codes of My understanding is that
Regarding Disclaimer: I have spend a considerably amount of time to look at both codes, but please apologize if I have overlooked something given the number of lines and code nestings. |
I really appreciate the time you've spent on this. I know this is super tedious and time-consuming, but it's really helpful. I'm quite swamped at the moment with a lot of other responsibilities, so this would take me a lot longer to get to without your help. Yes, you are correctly understanding how I have set up the creation of the But taking draws for them may be an issue. It may be better / more appropriate to only take normal draws for the random coefficients and just bind on a matrix of zeros for the fixed parameters. In fact, this could very well be part of the problem. Just changing how this is done in the As for the dimensions, I am only taking |
I actually cannot see how your described procedure of drawing standard normal draws for fixed coefficients and then replacing them with repeated means should adversely affect the estimation. After having another look into your code, I realized that I struggle to understand the purpose of the Given the ineffective |
Okay, I think we may be getting down to something here. First, the That page only shows the general idea for simple logit, but the idea should hopefully be clear. For mixed logit, part of the partial derivatives involves multiplying by the standard normal draws. So effectively you end up with a big matrix of constants that you then use when computing the gradients inside the optimization search. This is one of the big ways to speed up computations. However, this may be where there is an issue. I'll have to go back to how I computed the gradients, but I believe I set this up where I don't have different draws across individuals. This may not matter in a case without a panel structure, but in your case where you do have the panel structure (denoted with the One quick check on this would be to simulate a dataset without the panel structure to see if the bias goes away. If it does, then that very well may be where the issue lies. |
Btw, I built this all based on Train's book. You can see from section 6.6 of chapter 6 where he describes the estimation process. In 6.7 on panel data he writes:
And under equation 6.3 he writes:
|
Alright I just ran another simulation where I removed the panel structure from the data. It was 10000 people, each with only one choice observation. The same results show - slight bias on the SD term. Still rather puzzled about why this outcome remains so consistent, but I don't think it's the panel structure. |
Thanks for this reference, this was very helpful and made your estimation strategy clear for the simple case. Moreover, I got a better understanding of your code. I also run a simulation with
I will double check on that by running another simulation but I think there two are takeaways:
|
Thanks for this further explanation and reference. My simulation actually consists of this simplest specification where the random coefficient |
Interesting. How many draws were you using in the above simulation? So the fact that you're getting similar levels of bias in The only thing left I can think to do is re-run your simulations with the three different packages and keep the log-likelihood values. We should be able to see whether we are getting the same solutions across the runs for each package. If |
First of all, sorry for the delayed response. Unfortunately, I have been super busy in recent days. I used 100 draws in the simulation above. To double check the finding of the estimator's behavior in a cross-sectional vs. a panel dataset I did the following: I used the data set posted in the parquet file above and replaced the Results using panel data structure:
psn: refers to pseudo random numbers; ha: refers to Halton sequence; hm: refers to Hammersley sequence This displays which estimator achieves the maximum likelihood in the 100 iterations:
Results using cross-sectional data structure:
This displays which estimator achieves the maximum likelihood in the 100 iterations:
Conclusion:
Something else which I realized when looking at the density plots was that This is a file with all results: |
This is fantastic stuff - thanks so much for keeping up the progress on this! This is becoming quite the effort to figure out where the bias is coming from. Based on these latest results, I'm thinking we may have a convergence issue. When you say "This displays which estimator achieves the maximum likelihood in the 100 iterations", what specifically do you mean? For example, on the cross-sectional data you get these results: > table(res_cs$max_ll)
cmmixlogit_ha_ll cmmixlogit_hm_ll logitr_ha_ll logitr_so_ll mlogit_ha_ll
21 46 6 7 20 Does that mean that for the logitr with Halton draws case, only 7 out of the 100 iterations reached the maximum log-likelihood value? How do you determine what the maximum log-likelihood at the global solution is? Is it based on exit code statuses? Also you say:
So it sounds like If you're finding that the likelihood by You can modify the tolerances for logitr using the
What if you set all the tolerances tighter to like Maybe the draws aren't the problem. If we're pooling together a handful of results that didn't fully converge with ones that did, then it's kind of an apples and oranges thing. I feel like this very well may be happening, especially because the variance is so much wider in some plots, like the mean_x2 one for logitr. Maybe that wider variance is just from a greater number of the models stopping the gradient descent too early? |
Please apologize any confusion! What I meant is that if you compare the log likelihood of the six different estimators over all 100 cross-sectional data iterations, in 21 out of 100 iterations the
which can be reproduced given the results RData file in my last post.
I agree with your statement referring to In other words, this suggests that there is no convergence issue for the cross-sectional dataset. All estimators with the exception of The opposite is clearly the case for the panel dataset structure. As mentioned above, the performance and therefore the likelihood distribution of
Do you agree with this mapping and is there something equivalent of
Given the My preliminary conclusion would be:
BTW thanks for ongoing support! Much appreciated! |
Okay I was totally mis-understanding your LL results. That makes much more sense. So for sure No doubt there's definitely something different going on between the cross-sectional vs. panel structure. It's probably okay to use the same data but classify them differently for this experiment. The panel structure requires a different log-likelihood calculation, so by adding in the With these results now, I have some ideas for where things may be doing wrong:
One reason I think the bias is so much worse with panel structure data is that library(logitr)
set.seed(456)
mxl_pref <- logitr(
data = yogurt,
outcome = 'choice',
obsID = 'obsID',
# panelID = 'id',
pars = c('price', 'feat', 'brand'),
randPars = c(feat = 'n', brand = 'n'),
numMultiStarts = 10
)
summary(mxl_pref) The multistart summary looks like this: Summary Of Multistart Runs:
Log Likelihood Iterations Exit Status
1 -2641.243 38 3
2 -2641.504 29 3
3 -2641.246 26 3
4 -2641.502 31 3
5 -2642.052 39 3
6 -2641.371 28 3
7 -2642.046 49 3
8 -2641.504 28 3
9 -2641.243 29 3
10 -2641.505 28 3 But if I put the panel structure back in: library(logitr)
set.seed(456)
mxl_pref <- logitr(
data = yogurt,
outcome = 'choice',
obsID = 'obsID',
panelID = 'id',
pars = c('price', 'feat', 'brand'),
randPars = c(feat = 'n', brand = 'n'),
numMultiStarts = 10
)
summary(mxl_pref) Now the multistart summary looks like this: Summary Of Multistart Runs:
Log Likelihood Iterations Exit Status
1 -1266.550 34 3
2 -1300.751 64 3
3 -1260.216 35 3
4 -1261.216 43 3
5 -1269.066 40 3
6 -1239.294 56 3
7 -1343.221 59 3
8 -1260.006 55 3
9 -1273.143 52 3
10 -1304.384 59 3 It's bouncing all over the place. So one reason we're seeing |
Okay I'm a bit out of ideas. I spent the day implementing a version (this branch) where I take different sets of |
Okay, I've gone back and attempted to simulate my own experiment from scratch just to make sure that I'm not missing something. I believe everything you had before is correct, but perhaps we've done something differently because in my simulations I'm no longer getting really a bias at all. In this experiment, I have two random parameters (x2 and x3) and I'm using 200 Halton draws. Across 100 runs, here's what I get for mean estimates: beta_x1 1.003579 1.000353
mean_x2 0.9819966 1.002053
mean_x3 0.9709964 0.9959077
sd_x2 1.028509 1.015882
sd_x3 1.005578 0.9973298 The two columns are cross sectional vs. panel. Here I use I'm not really sure what to make of this. Maybe we're just simulating the data differently? I wrote my code to simulate the data efficiently, so it gets generated within each run of the simulation, and I simulated the data in a way such that there is a panel structure (the coefficients are held constant for individuals but vary across individuals). Here is my code if you want to run it and check the results: library(logitr)
library(tidyverse)
options(dplyr.width = Inf)
set.seed(4321)
# Define functions
sim_choice_data <- function(n, t, j, beta_x1, mean_x2, sd_x2, mean_x3, sd_x3) {
# Same betas for x2 across population
beta_x2 <- rnorm(n, mean_x2, sd_x2)
beta_x3 <- rnorm(n, mean_x3, sd_x3)
# Generate the X data matrix
X <- as.data.frame(matrix(rnorm(n*t*j*3), ncol = 3))
names(X) <- c('x1', 'x2', 'x3')
X$id <- rep(seq(n), each = t*j)
X$alt_id <- rep(seq(j), t*n)
X$q_id <- rep(rep(seq(t), each = j), n)
X$obs_id <- rep(seq(n*t), each = j)
# Compute utility by respondent
X_resp <- split(X[c('x1', 'x2', 'x3')], X$id)
utility <- list()
for (i in 1:n) {
beta <- c(beta_x1, beta_x2[i], beta_x3[i])
utility[[i]] <- as.matrix(X_resp[[i]]) %*% beta
}
V <- unlist(utility)
# Compute probabilities of choosing each alternative
expV <- exp(V)
sumExpV <- rowsum(expV, group = X$obs_id, reorder = FALSE)
reps <- table(X$obs_id)
p <- expV / sumExpV[rep(seq_along(reps), reps),]
# Simulate choices according to probabilities
p_obs <- split(p, X$obs_id)
choices <- list()
for (i in seq_len(length(p_obs))) {
choice <- rep(0, reps[i])
choice[sample(seq(reps[i]), 1, prob = p_obs[[i]])] <- 1
choices[[i]] <- choice
}
X$choice <- unlist(choices)
return(as.data.frame(X))
}
estimate_mixed_logit <- function(
n, t, j, beta_x1, mean_x2, sd_x2, mean_x3, sd_x3, numDraws
) {
# Simulate choice data
data <- sim_choice_data(n, t, j, beta_x1, mean_x2, sd_x2, mean_x3, sd_x3)
# Cross sectional - just ignoring panel structure
suppressMessages(
m1 <- logitr(
data = data,
outcome = 'choice',
obsID = 'obs_id',
# panelID = 'id',
pars = c('x1', 'x2', 'x3'),
randPars = c(x2 = 'n', x3 = 'n'),
numDraws = numDraws,
numMultiStarts = 1
))
# Now include panelID
suppressMessages(
m2 <- logitr(
data = data,
outcome = 'choice',
obsID = 'obs_id',
panelID = 'id',
pars = c('x1', 'x2', 'x3'),
randPars = c(x2 = 'n', x3 = 'n'),
numDraws = numDraws,
numMultiStarts = 10
))
return(
list(
coef = cbind(coef(m1), coef(m2)),
ll = cbind(logLik(m1), logLik(m2))
)
)
}
R <- 100
results <- list()
numDraws <- 200
for (r in 1:R) {
results[[r]] <- estimate_mixed_logit(
n = 1000,
t = 10,
j = 3,
beta_x1 = 1,
mean_x2 = 1,
sd_x2 = 1,
mean_x3 = 1,
sd_x3 = 1,
numDraws = numDraws
)
} To view the means across all runs I used this: cat('beta_x1', apply(do.call(rbind, map(results, \(x) x$coef[1,])), 2, mean))
cat('mean_x2', apply(do.call(rbind, map(results, \(x) x$coef[2,])), 2, mean))
cat('mean_x3', apply(do.call(rbind, map(results, \(x) x$coef[3,])), 2, mean))
cat('sd_x2', apply(do.call(rbind, map(results, \(x) abs(x$coef[4,]))), 2, mean))
cat('sd_x3', apply(do.call(rbind, map(results, \(x) abs(x$coef[5,]))), 2, mean)) And I also made a plot of all the parameters across the 100 runs: coefs <- map_df(
results, \(x) {
temp <- data.frame(x$coef)
temp$var <- row.names(temp)
row.names(temp) <- NULL
temp
}
)
coefs$run <- rep(seq(R), each = 5)
names(coefs)[1:2] <- c('cross_section', 'panel')
coefs |>
pivot_longer(
names_to = 'method',
values_to = 'estimate',
cols = cross_section:panel
) |>
mutate(estimate = abs(estimate)) |>
ggplot() +
geom_density(aes(x = estimate, color = method)) +
facet_wrap(vars(var), nrow = 1) +
theme_bw() You can see how accounting for the panel structure matters here as the cross-sectional model that ignores the structure has negative biases. And you can also see the slight positive bias in the |
I just fixed the error and updated my comment above. |
Hi @JediKnight5, I've been super busy and haven't had any time to look back at this. Just curious if you could re-run your experiments with two random variables instead of just one as I did here in my latest post. I just wanted to see if you get a bias on the mean and / or sd parameters of both or only one of the random variables. If it's only there on one but not both, then I'd be fairly confident this whole issue might just come down to the specifics of how the standard normals are drawn. I sure hope that's the issue as that's a lot easier to fix. |
Hi @jhelvy , sorry, I have been very busy in the last two weeks. I'll definitely reply to your most recent post, hopefully in the next couple of days. |
Hi @jhelvy, my apologies again, I have been very busy in the last two weeks. Regarding this post:
I think your simulation code is fine. I couldn’t spot anything which doesn’t seem right to me. However, I think your bias estimates are somewhat less comparable to mine above since you were using 200 draws while I was using just 100 draws and as we know by means of this analysis taken from the SO post, a larger number of draws decreases the bias.
These are the bias results of my simulation which I run for both, 100 and 200 draws, to make it most comparable: 100 draws cross- sectional
100 draws panel
200 draws cross-sectional
200 draws panel
I think we can see again the beneficial impact of increasing the number of draws. However, the bias in the This is the code I used to generate a parquet file with all data:
This is the code I used for estimation for the panel case and 200 draws (the code for the cross-sectional dataset slightly varies by means of using different options) :
I used version
I had another look on the generation of the Halton sequence in the On a different note, I am still wondering why Stata’s
Thus, I take this statement of mine from above back and agree with your statement
Just for clarification on this statement:
This is not true. In the panel case the Sorry for the long reply. I hope I have addressed all open questions. |
Thanks again for a detailed look into all of this. Based on everything we've now looked into, it seems there a few points that I think we agree on:
On point 1, I know On points 2 & 3, I think a general conclusion from this investigation is that we probably should be using more draws, and right now the default number used in On point 4 I remain puzzled as to why the With all of that said, I think my proposal to update the number of draws to be based on the number of random parameters is an appropriate change to address this issue, and once I've done that I think I can close out this issue (and reply to the original SO post). Do you agree? Anything else you'd recommend? |
I agree with all points.
I agree. Actually, this is rather a selling point for
I agree. I think we should be on the safer side the more draws we take. I like your proposed idea of making the number of draws to be a function of the number of random parameters to be estimated.
I agree. Did you have a chance to test my suggestion of switching the base of Halton draws from 2 to 3? I am happy to test it by myself but I am not sure which
Indeed, I think just a deep dive of the
I think my perspective is just slightly different in the sense that I would argue that the number of draws should be sufficiently high to obtain reliable effects. In other words, it seems that the difference between the estimates and their true values is decreasing by increasing the number of draws yielding a consistent estimator but a certain threshold of number of draws should be exceeded to ensure that this difference is substantially small for every random parameter. I think this view is in line with Cameron & Trivedi (Chapter 15.7.1) and Train (Chapter 10.5.1) in terms of asymptotics. For finite samples / number of draws, this minimum thresshold seems to matter to prevent large biases.
I agree. Thanks for your permanent support and interest in this issue! Finally, I think it might be worthwile to restrict the distribution of the lambda parameter in the WTP space to be log-normal as discussed in the SO post. |
Yes, I actually wrote my own version of the halton sequence to test this, and no matter what I used as the first prime I got similar results, so I don't think it's the prime. I'm going to keep the dependency on the
Good point, I had forgotten about this. But I might instead just restrict it so it can't be normal, but it could be log-normal or censored normal, both which force positivity. In practice, anything but making the scale parameter fixed tends to have a much harder time converging, and that's actually the original motivation for why I included a multi-start option in the package because WTP space models tend to diverge more often. I'll make an updated version that includes this change as well as the change about how many draws are used, then I'll close this out and update the original SO post. |
Hi, I am using your package frequently to estimate WTP model and saw this post https://stats.stackexchange.com/questions/624305/mlogit-logitr-packages-fail-to-recover-true-estimates-of-mixed-logit-random-co today.
Can you please comment on that?
The text was updated successfully, but these errors were encountered: