Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model Predictions Impacted by Batch & Batch Size During Evaluation #6

Open
lkurlandski opened this issue Nov 18, 2022 · 2 comments
Open

Comments

@lkurlandski
Copy link

lkurlandski commented Nov 18, 2022

The batch size and which examples are in each batch can impact how a model learns during training. However, during evaluation, the batch size and which examples are in each batch should only impact the speed at which data is processed, not the predictions of the model (as far as I am aware). I have found that the outputs of this model for a single example are impacted by properties of other examples in the batch. Evaluating an example individually (batch_size=1) can result in different predictions than if the example is included in a batch.

I believe this is a result of padding added in LowMemConv.LowMemConvBase.seq2fix, specifically, this line:

x_selected = torch.nn.utils.rnn.pad_sequence(chunk_list, batch_first=True)

Are these results to be expected? I don't remember reading about anything like this in the paper. If so, is there a recommended batch size to use when evaluating the model?

I included a minimal working example demonstrating this behavior below.

Tensors:
X_1.pt.txt
X_2.pt.txt

Environment:
environment.yml.txt

Example:

import torch
import torch.nn.functional as F

from MalConvGCT_nocat import MalConvGCT

device = torch.device("cpu")

# Load the pretrained model.
model = MalConvGCT(channels=256, window_size=256, stride=64)
state = torch.load("models/malconvGCT_nocat.checkpoint", map_location=device)
model.load_state_dict(state["model_state_dict"], strict=False)
model.to(device)
model.eval()

# Two tensors each with two examples.
X_1 = torch.load("X_1.pt.txt").to(device)
X_2 = torch.load("X_2.pt.txt").to(device)

# The second element in each tensor is identical.
print(f"Batches equal?: {torch.equal(X_1, X_2)}")
print(f"First element equal?: {torch.equal(X_1[0], X_2[0])}")
print(f"Second element equal?: {torch.equal(X_1[1], X_2[1])}")

Batches equal?: False
Fist element equal?: False
Second element equal?: True

# The model's confidence on the second example when determined individually,
# ie, run through the model with batch_size=1.
with torch.no_grad():
    conf_1_ind = F.softmax(model(X_1[1].unsqueeze(0))[0], dim=-1).data[:, 1][0].item()
    conf_2_ind = F.softmax(model(X_2[1].unsqueeze(0))[0], dim=-1).data[:, 1][0].item()
print(f"Confidence when evaluating individual: {(conf_1_ind, conf_2_ind)}")

Confidence when evaluating individual: (0.043793901801109314, 0.043793901801109314)

# The model's confidence on the second example when determined in the batches,
# ie, run through the model with batch_size=2. The confidence score of the
# second example differs, even though the element is the same in both scenarios.
with torch.no_grad():
    conf_1_batch = F.softmax(model(X_1)[0], dim=-1).data[:, 1][1].item()
    conf_2_batch = F.softmax(model(X_2)[0], dim=-1).data[:, 1][1].item()
print(f"Confidence when evaluating batch: {(conf_1_batch, conf_2_batch)}")

Confidence when evaluating batch: (0.06193083897233009, 0.043793901801109314)

@EdwardRaff
Copy link

Are these results to be expected? I don't remember reading about anything like this in the paper. If so, is there a recommended batch size to use when evaluating the model?

The predictions should be the same regardless of batch size. So that is weird.

Are X_1 and X_2 real files / data, or made up test data? Are all 3 data points the same length without any padding?

Can you print out / look at at model(X_*)[0].data[:,1][1] without any softmax step there?

We've got a heavy load of priorities into the new year, but at a minimum will help debug over github and will try to find time to mess with this myself.

@lkurlandski
Copy link
Author

Thanks for the rapid response and any effort on your part!

X_1 and X_2 are derived from a real malicious file from the SOREL-20M dataset. Each element in X_1 and X_2 are perturbed variants of the original sample. So X_1[0], X_1[1], X_2[0], X_2[1] are all slightly perturbed malware tensors with a common ancestor. All have the same dimensions and only differ by at most 1024 elements.

All three data points are the same length and none of them have any padding.

print(X_1.shape)
print(X_2.shape)

torch.Size([2, 1971368])
torch.Size([2, 1971368])

The logits are also different:

print(model(X_1)[0].data[:,1][1])
print(model(X_2)[0].data[:,1][1])

tensor(-1.5864)
tensor(-1.8645)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants