-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there any possbility that a ordinary supervised model performs better than a outlier algorithm in this task? #12
Comments
[Sorry for my bad English skills] As far as I know, OC-SVM performs outlier detection without the known anomalies. Even if the data from ODD gives information about which data is abnormal, In the real problem, we usually do not know what is the abnormal samples. If we do not know what is abnormal data, the SVM and RF cannot even be used. If the exact information of anomalies is given, the high performance of SVM looks reasonable for me. |
Thanks for your great answer! Another confusion is that why these supervised algorithms (like SVM and RF) performs well even in the highly umbalanced dataset? (e.g., Annthyroid dataset in the ODDs, contaiining 7.42% positive samples). Intuitively spearking, the ordinary classification model may classify all the samples to the majority (negative samples) and cannot detect the anomaly samples, but again the empirical results indicate that my opinion may be wrong. |
Actually I tried most of the dataset in ODDs (http://odds.cs.stonybrook.edu/annthyroid-dataset/) and upload the results in my github website (https://github.com/jmq19950824/Anomaly-Detection/blob/master/ODDs.ipynb). The results indicate that even using the binary classification algorithm (SVM here) could be good to solve the anomaly detection task. Can anyone explain this result? |
@yzhao062 ,great answer, thanks a lot. I notice that there exists a sentance "Supervised outlier detection is a (difficult) special case of the classification problem. The main characteristic of this problem is that the labels are extremely unbalanced in terms of relative presense. Since anomalies are far lass common than normal points, it is possible for off-thre-shelf classifiers to predict all test points as normal points and still ahieve excellent accuracy" I tried some supervised model (like Random Forest) in some extremely unbalanced dataset like Credit Card Fraud Detection (CCFD) dataset in kaggle (https://www.kaggle.com/mlg-ulb/creditcardfraud), where the positive samples only take up 0.172% of the whole dataset (i.e., extremely unbalanced). |
I have tried some outlier detection datasets (ODDs) in this website like Annthyroid dataset (http://odds.cs.stonybrook.edu/annthyroid-dataset/).
However, when I compare some ordinary supervised models (e.g., SVM and Random Forest), the results indicate that SVM and RF are much better than the anomaly detection algorithms like OC-SVM and Isolation Forest.
I was wonder the reason for this weird results, because threoratically the outlier detection algorithms should perform better in the outlier detection task. Could anyone help me figure this problem? Thanks!
The text was updated successfully, but these errors were encountered: