Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Inconsistent pixel metrics between validation and testing on the same dataset #2301

Open
1 task done
jordanvaneetveldt opened this issue Sep 9, 2024 · 8 comments

Comments

@jordanvaneetveldt
Copy link

Describe the bug

I am encountering an inconsistency in the reported metrics between the validation and testing phases, despite using the same dataset (val_split_mode=ValSplitMode.SAME_AS_TEST). Specifically, the pixel_xxx metrics values vary significantly between the validation and testing stages, even though both phases are using the same data. Interestingly, the image_xxx metrics remain consistent across both phases.

I am using an EfficientAD model with a batch size of 1 for both training and validation/testing. Could there be a reason why the pixel_xxx metrics differ between validation and testing, despite the consistent dataset and batch size?

Dataset

MVTec

Model

Other (please specify in the field below)

Steps to reproduce the behavior

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

from anomalib.data import MVTec
from anomalib.engine import Engine
from anomalib.models import EfficientAd
from anomalib.data.utils.split import ValSplitMode


model = EfficientAd()
data = MVTec(
    train_batch_size=1,
    eval_batch_size=1,
    val_split_mode=ValSplitMode.SAME_AS_TEST,
)

pixel_metrics = {
    "F0_5Score": {
        "class_path": "torchmetrics.FBetaScore",
        "init_args": {"task": "binary", "beta": 0.5},
    },
    "Recall": {
        "class_path": "torchmetrics.Recall",
        "init_args": {"task": "binary"},
    },
    "Precision": {
        "class_path": "torchmetrics.Precision",
        "init_args": {"task": "binary"},
    },
}
engine = Engine(
    pixel_metrics=pixel_metrics,
    threshold="ManualThreshold",
    accelerator="auto",  # \<"cpu", "gpu", "tpu", "ipu", "hpu", "auto">,
    devices=1,
    logger=None,
    max_epochs=1,
)

engine.fit(datamodule=data, model=model)
print("Training Done")

engine.validate(datamodule=data, model=model)
print("Validation Done")

engine.test(datamodule=data, model=model)
print("Testing Done")

OS information

OS information:

  • OS: Ubuntu 20.04.6 LTS
  • Python version: 3.10.14
  • Anomalib version: 1.2.0.dev0
  • PyTorch version: 2.4.0+cu121
  • CUDA/cuDNN version: 12.1
  • GPU models and configuration: 1x GeForce RTX 3090
  • Any other relevant information: tochmetrics: 1.4.1

Expected behavior

The metrics from the validation and testing phases should be consistent since the same dataset is used for both

Screenshots

image

Pip/GitHub

GitHub

What version/branch did you use?

1.2.0.dev0

Configuration YAML

API

Logs

Calculate Validation Dataset Quantiles: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 83/83 [00:00<00:00, 84.32it/s]
Epoch 0: 100%|| 209/209 [00:15<00:00, 13.81it/s, train_st_step=7.730, train_ae_step=0.771, train_stae_step=0.0552, train_loss_step=8.560, pixel_F0_5Score=0.236, pixel_Precision=0.212, pixel_Recall=0.440, train_st_ep`Trainer.fit` stopped: `max_epochs=1` reached.                                                                                                                                                                          
Epoch 0: 100%|| 209/209 [00:15<00:00, 13.61it/s, train_st_step=7.730, train_ae_step=0.771, train_stae_step=0.0552, train_loss_step=8.560, pixel_F0_5Score=0.236, pixel_Precision=0.212, pixel_Recall=0.440, train_st_ep
Training Done
F1Score class exists for backwards compatibility. It will be removed in v1.1. Please use BinaryF1Score from torchmetrics instead
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [1]
Calculate Validation Dataset Quantiles: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 83/83 [00:00<00:00, 85.53it/s]
Validation DataLoader 0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 83/83 [00:01<00:00, 68.01it/s]
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃      Validate metric      ┃       DataLoader 0        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│        image_AUROC        │    0.9642857313156128     │
│       image_F1Score       │    0.6451612710952759     │
│      pixel_F0_5Score      │    0.23632192611694336    │
│      pixel_Precision      │    0.21182596683502197    │
│       pixel_Recall        │    0.4397238790988922     │
└───────────────────────────┴───────────────────────────┘
Validation Done
F1Score class exists for backwards compatibility. It will be removed in v1.1. Please use BinaryF1Score from torchmetrics instead
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [1]
Testing DataLoader 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 83/83 [00:16<00:00,  4.99it/s]
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃        Test metric        ┃       DataLoader 0        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│        image_AUROC        │    0.9642857313156128     │
│       image_F1Score       │    0.6451612710952759     │
│      pixel_F0_5Score      │    0.3458831310272217     │
│      pixel_Precision      │    0.9593373537063599     │
│       pixel_Recall        │    0.09721758216619492    │
└───────────────────────────┴───────────────────────────┘
Testing Done

Code of Conduct

  • I agree to follow this project's Code of Conduct
@jordanvaneetveldt
Copy link
Author

After debugging, I believe normalization is applied only during testing but not during validation. Could that be the case? If so, it would explain the inconsistency. Setting normalization=NormalizationMethod.NONE resolves the inconsistency, but then the reported metrics become inaccurate, as the anomaly map values are no longer constrained between 0 and 1, and TorchMetrics automatically applies a sigmoid function to normalize these values.

Additionally, if normalization is only applied during testing, wouldn't the F1AdaptiveThreshold be computed on non-normalized anomaly maps during validation and then applied on normalized maps during testing?

@Samyarrahimi
Copy link

The MVTec class includes a val_split_ratio argument, which defaults to 0.5. This parameter specifies the fraction of either the training or test images that will be set aside for validation. It seems that since you set the validation data as the test set and didn't specify a val_split_ratio, Anomalib used 50% of the test data for validation. This could explain the discrepancy between your validation and test set results.

here is the source code:
https://github.com/openvinotoolkit/anomalib/blob/main/src/anomalib/data/image/mvtec.py

@jordanvaneetveldt
Copy link
Author

When setting val_split_mode: same_as_test, the val_split_ratio is ignored, and both datasets contain the same images.

I believe the discrepancy occurs because normalization is not applied during validation. When I set normalization=NormalizationMethod.NONE, I obtained the same results for both validation and testing. However, in this case, torchmetrics are computed using the sigmoid function, which leads to incorrect results.

The issue is not specific to EfficientAD, as I encountered the same problem with PatchCore.

@alexriedel1
Copy link
Contributor

During validation the normalization metrics are calculated but normalization is not applied. During testing the previous calculated metrics are applied to the results.
Normalization is applied before thresholding during validation.

So you shouldn't trust the validation results too much for precision and recall.

To use the manual threshold, you need to pass it in a dict-object structure just like you pass your metrics.

config = {
            ...    "class_path": "ManualThreshold",
            ...    "init_args": {"default_value": 0.7}
            ... }

@jordanvaneetveldt
Copy link
Author

Thank you for your answer! So this mean that early stopping isn’t feasible if the validation results are unreliable? For instance, I would like to monitor precision to keep the model with the fewest false positives.

@alexriedel1
Copy link
Contributor

alexriedel1 commented Sep 13, 2024

During training it should not matter, because you validate each epoch without normalization. You then can compare all the non-normalized results.
After training if you test, you will get slightly different results however.
But testing on your validation data isn't even recommended because you then have contaminated test results that appear better than they actually are.

@jordanvaneetveldt
Copy link
Author

Thank you for the clarification. However, it seems that normalization during validation might still have an effect on the metrics, potentially leading to incorrect values. Using the code below, I noticed that the metrics consistently drop to a constant value after the first epoch, and the model stops training early (at epoch 5, due to the patience parameter being set to 5).

from anomalib.data import MVTec
from anomalib.engine import Engine
from anomalib.models import EfficientAd
from anomalib.data.utils.split import ValSplitMode
from anomalib.utils.normalization import NormalizationMethod

from lightning.pytorch.callbacks import EarlyStopping
from anomalib.callbacks.checkpoint import ModelCheckpoint

model = EfficientAd()
data = MVTec(
    train_batch_size=1,
    val_split_mode=ValSplitMode.SAME_AS_TEST,
)

callbacks = [
    EarlyStopping(
        monitor="pixel_F1Score",
        mode="max",
        patience=5,
    ),
]

pixel_metrics = {
    "F1Score": {
        "class_path": "torchmetrics.F1Score",
        "init_args": {"task": "binary"},
    },
    "Recall": {
        "class_path": "torchmetrics.Recall",
        "init_args": {"task": "binary"},
    },
    "Precision": {
        "class_path": "torchmetrics.Precision",
        "init_args": {"task": "binary"},
    },
}
engine = Engine(
    callbacks=callbacks,
    pixel_metrics=pixel_metrics,
    accelerator="auto",  # \<"cpu", "gpu", "tpu", "ipu", "hpu", "auto">,
    devices=1,
    logger=None,
)

engine.fit(datamodule=data, model=model)
print("Training Done")

When you mention that normalization is applied before thresholding during validation, are you referring to the computation of the F1AdaptiveThreshold? During metric computation (such as F1Score, Precision, etc.), Torchmetrics automatically applies a sigmoid function to the data if it is outside the [0,1] range. However, if the F1AdaptiveThreshold is computed on the MinMax normalized data, this creates a mismatch, as the threshold will not align with the data after the sigmoid transformation.

@Samyarrahimi
Copy link

During validation the normalization metrics are calculated but normalization is not applied. During testing the previous calculated metrics are applied to the results. Normalization is applied before thresholding during validation.

@alexriedel1 @jordanvaneetveldt I'd appreciate it if you could help me understand this correctly.

This sentence feels ambiguous to me. Aren't the first and last sentences contradictory? I still don't fully understand the normalization and thresholding approach in the validation phase. If normalization is not applied (and only the metrics for normalization are calculated), how are classification labels determined during validation? Also, are we relying on the sigmoid function applied in torchmetrics to calculate metrics like precision and recall since the model's output is not in the [0,1] range? (#2337)

I understand that normalization is applied in the testing phase using the minimum and maximum scores from validation. However, when it comes to the threshold, the code in min_max_normalization.py doesn’t seem to use any threshold calculated during validation (if such a threshold is calculated at all, assuming I am using F1AdaptiveThreshold).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants