Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug(?): lognorm distribution with negative loc parameter #33

Open
Buedenbender opened this issue Mar 13, 2023 · 2 comments
Open

Bug(?): lognorm distribution with negative loc parameter #33

Buedenbender opened this issue Mar 13, 2023 · 2 comments

Comments

@Buedenbender
Copy link

Buedenbender commented Mar 13, 2023

Thank you for this extremely helpful package, which I found over a medium post and recommendations by a colleague.
Since I discovered distfit I was eager to try the parametric approach in fitting PDFs.
Below you can find an example for mockup data.

import numpy as np
np.random.seed(4)

x_sim = np.random.normal(loc=47.55,scale=13.8, size = 10000)
x_sim = np.append([*filter(lambda x: x<=80, x_sim)],np.random.normal(loc=90,scale=10, size = 50))
x_sim = np.array([*filter(lambda x: x >=0,x_sim)])

x=x_sim

dfit = distfit('parametric', todf=True, distr=["lognorm"])
dfit.fit_transform(x)
dfit.bootstrap(x, n_boots=100)
fig, ax = plt.subplots(1,3, figsize=(20,8))
sns.histplot(x,ax=ax[0])
dfit.plot("PDF",n_top=3,fontsize=11,ax=ax[1])
dfit.plot("CDF",n_top=3,fontsize=11,ax=ax[2])
plt.show()

image

I was kind of surprised by the negative location parmeter (of $-822.$) for the lognorm distribution. I might missunderstand what the loc parameter means here?
Also I was not quite able to reproduce the plots I obtained from my actual data with mock data.
For the true data I often got PDFs in the form below (despite the histogram sometimes following nicely a nearly perfect bell shape). Unfortunately I cannot provide the data.

fig, ax = plt.subplots(1,1)
ax.vlines(3000,ymax=0.05,ymin=0, color = "red", linestyle="--")
ax.vlines(0,ymax=0.05,ymin=0, color = "red", linestyle="--")
ax.set_ylim((0,0.05))
ax.vlines(5,ymin=0,ymax=0.008,linewidth=3, color = "black")
ax.hlines(xmin=5,xmax=350,y=0,linewidth=7, color = "black")

image

Minor Points

  • In the legend of the PDF Plot the best fitting distribution is capitalized. For consistency (e.g., with plot_summary() or with dfit.plot("CDF")) it might be advisable to keep it just lowercase letters
@erdogant
Copy link
Owner

erdogant commented Mar 14, 2023

Thank you for the feedback! I agree. I will lower all capitals.

Furthermore, I have been looking into your issue. For many of the distributions, it uses scipy, such as the lognorm. The log/scale parameters are likely better described there.

For the lognormal distribution, the "mean" and "std dev" correspond to log(scale) and shape.
For demonstration:

loc = 5
scale=10
sample_dist = st.lognorm.rvs(3, loc=loc, scale=np.exp(scale), size=10000)
dfit = distfit('parametric', todf=True, distr=["lognorm"])
dfit.fit_transform(sample_dist)

print('Estimated loc: %g, input loc: %g' %(dfit.model['loc'], loc))
print('Estimated mu or scale: %g, input scale: %g' %(np.log(dfit.model['scale']), scale))

[distfit] >INFO> fit
[distfit] >INFO> transform
[distfit] >INFO> [lognorm] [0.36 sec] [RSS: 1.76437e-10] [loc=5.069 scale=22043.122]
[distfit] >INFO> Compute confidence intervals [parametric]
Estimated loc: 5.06934, input loc: 5
Estimated mu or scale: 10.0008, input scale: 10

The loc/scale is nicely estimated.
If I now do the same in your case but first without the filters. The mu seems pretty close.

mu=13.8
loc=47.55
x_sim = np.random.normal(loc=loc,scale=np.exp(mu), size = 10000)
# x_sim = np.append([*filter(lambda x: x<=80, x_sim)],np.random.normal(loc=90,scale=10, size = 50))
# x_sim = np.array([*filter(lambda x: x >=0,x_sim)])

dfit = distfit('parametric', todf=True, distr=["lognorm"])
dfit.fit_transform(x_sim)
dfit.bootstrap(x_sim, n_boots=1)

print('Estimated mu or scale: %g, input scale: %g' %(np.log(dfit.model['scale']), mu))
Estimated mu or scale: 17.3597, input scale: 13.8

Checkout this thread on stackoverflow.

@Buedenbender
Copy link
Author

Buedenbender commented Mar 16, 2023

I will read more into the resources you provided, regarding the "manually" simulated 2nd plot (where the pdf basically looks like a corner, and one can not see bars from the contained histogram), I now understand why it does look this way. The upper-limit confidence interval does explode. E.g. the empirical values in my distribution let's say range from $[0, 1000]$. After using distfit with the popular distribution and consequently executing the bootstrap test the 95-upper Confidence Interval boundary (for the e.g., paretro distribution) is estimated to be at $CI_{Upper} = 500,000$ thus making it impossible to interpret the plot or the upper CI limit.

So I made sure to reread the information you provided. Thank you very much for clarifying the relation between mean, SD and log(loc) and log(scale). Still as far as I understand it negative values should not be possible under the distribution, log(negative) = results in complex number with an imaginary component

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants