Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression Error at boundaries, is normalization on output required? #158

Open
mchinen opened this issue Dec 18, 2019 · 2 comments
Open

Comments

@mchinen
Copy link

mchinen commented Dec 18, 2019

When training with svm-train -s 4 -t 2 -n .6 -c .4 <myfile>
I find that the predictions are very much compressed. For example, myfile has labels in the 1 to 5 region, with a significant in 4 to 5, but the highest predicted value on the train set is below 4.0. It seems that there are fewer predictions in the 1.0 to 2.0 region as well.

I've played with NU_SVR and EP_SVR and the other parameters and haven't found a good solution to this. Here is my train file. Even when normalizing the labels to 0-1 I get the same behavior, where the highest predicted value is .72.

First, I'd like to know if I'm doing something incorrectly. Next, if this is a correct model, why is it so compressed? I would like the predictions to be closer to the boundaries of the training labels. I understand that we would expect some compression towards the mean in regression, but this seems more than I would expect. Should I normalize the predicted output to match the input label distribution?

Unnormalized:
mysvmtrainfile.txt
Normalized:
normsvmtrain.txt

@mchinen mchinen changed the title Regression Error at boundaries, is normalization required? Regression Error at boundaries, is normalization on output required? Dec 18, 2019
@cjlin1
Copy link
Owner

cjlin1 commented Dec 20, 2019 via email

@mchinen
Copy link
Author

mchinen commented Dec 20, 2019

Thanks so much, that does seem to be the issue. I hadn't realized the importance of searching the parameters before reading your PDF, and used our last model's parameters. I modified grid.py to do a search and found better parameters which were wildly different. I found I also needed to tune the nu paramter.

However, I see my problem is confounded by another issue that I also resolved:

  • When I use the svm-predict myinput.txt mymodel.txt binary I got predictions that I expected
  • When I use svm_predict() after svm_load(mymodel.txt) I get different incorrect predictions, because I was zero indexing the .index variable. Once I resolved that things worked as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants