Skip to content

Conversation

@ShawnXuan
Copy link
Collaborator

Add NPU support for ArgWhereFunctor:

  • When the input tensor is on NPU, it is first moved to CPU to perform the argwhere operation;
  • The result is then moved back to NPU to maintain compatibility and correctness;
  • This logic is enabled only when compiled with the WITH_NPU macro;
  • Behavior on other devices remains unchanged.

This change ensures that argwhere works correctly on NPU devices.

@github-actions
Copy link
Contributor

@github-actions
Copy link
Contributor

Speed stats:
GPU Name: NVIDIA GeForce RTX 3080 Ti 

❌ OneFlow resnet50 time: 43.7ms (= 4367.6ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 58.4ms (= 5836.6ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.34 (= 58.4ms / 43.7ms)

OneFlow resnet50 time: 26.6ms (= 2655.8ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 40.6ms (= 4060.2ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.53 (= 40.6ms / 26.6ms)

OneFlow resnet50 time: 19.1ms (= 3810.3ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 32.1ms (= 6429.3ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.69 (= 32.1ms / 19.1ms)

OneFlow resnet50 time: 17.2ms (= 3441.4ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 30.8ms (= 6167.6ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.79 (= 30.8ms / 17.2ms)

OneFlow resnet50 time: 17.6ms (= 3518.9ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 29.1ms (= 5813.8ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.65 (= 29.1ms / 17.6ms)

OneFlow swin dataloader time: 0.199s (= 39.850s / 200, num_workers=1)
PyTorch swin dataloader time: 0.127s (= 25.479s / 200, num_workers=1)
Relative speed: 0.639 (= 0.127s / 0.199s)

OneFlow swin dataloader time: 0.056s (= 11.169s / 200, num_workers=4)
PyTorch swin dataloader time: 0.033s (= 6.621s / 200, num_workers=4)
Relative speed: 0.593 (= 0.033s / 0.056s)

OneFlow swin dataloader time: 0.031s (= 6.132s / 200, num_workers=8)
PyTorch swin dataloader time: 0.016s (= 3.273s / 200, num_workers=8)
Relative speed: 0.534 (= 0.016s / 0.031s)

❌ OneFlow resnet50 time: 49.1ms (= 4912.8ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 64.5ms (= 6448.6ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.31 (= 64.5ms / 49.1ms)

OneFlow resnet50 time: 37.4ms (= 3744.6ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 47.6ms (= 4759.9ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.27 (= 47.6ms / 37.4ms)

OneFlow resnet50 time: 27.9ms (= 5577.5ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 40.5ms (= 8097.3ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.45 (= 40.5ms / 27.9ms)

OneFlow resnet50 time: 25.5ms (= 5091.4ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 40.5ms (= 8100.9ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.59 (= 40.5ms / 25.5ms)

OneFlow resnet50 time: 24.8ms (= 4964.6ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 36.3ms (= 7261.8ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.46 (= 36.3ms / 24.8ms)

@github-actions
Copy link
Contributor

@github-actions
Copy link
Contributor

Speed stats:
GPU Name: NVIDIA GeForce RTX 3080 Ti 

❌ OneFlow resnet50 time: 43.7ms (= 4365.1ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 57.9ms (= 5794.4ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.33 (= 57.9ms / 43.7ms)

OneFlow resnet50 time: 26.4ms (= 2641.6ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 37.3ms (= 3731.3ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.41 (= 37.3ms / 26.4ms)

OneFlow resnet50 time: 18.5ms (= 3690.8ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 34.5ms (= 6907.4ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.87 (= 34.5ms / 18.5ms)

OneFlow resnet50 time: 17.9ms (= 3583.3ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 30.8ms (= 6158.0ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.72 (= 30.8ms / 17.9ms)

OneFlow resnet50 time: 16.7ms (= 3343.2ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 30.5ms (= 6093.0ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.82 (= 30.5ms / 16.7ms)

OneFlow swin dataloader time: 0.199s (= 39.815s / 200, num_workers=1)
PyTorch swin dataloader time: 0.127s (= 25.468s / 200, num_workers=1)
Relative speed: 0.640 (= 0.127s / 0.199s)

OneFlow swin dataloader time: 0.053s (= 10.651s / 200, num_workers=4)
PyTorch swin dataloader time: 0.033s (= 6.562s / 200, num_workers=4)
Relative speed: 0.616 (= 0.033s / 0.053s)

OneFlow swin dataloader time: 0.031s (= 6.107s / 200, num_workers=8)
PyTorch swin dataloader time: 0.017s (= 3.309s / 200, num_workers=8)
Relative speed: 0.542 (= 0.017s / 0.031s)

❌ OneFlow resnet50 time: 49.4ms (= 4942.4ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 65.9ms (= 6588.2ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.33 (= 65.9ms / 49.4ms)

OneFlow resnet50 time: 36.7ms (= 3665.7ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 45.7ms (= 4568.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.25 (= 45.7ms / 36.7ms)

OneFlow resnet50 time: 27.7ms (= 5548.4ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 40.1ms (= 8026.8ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.45 (= 40.1ms / 27.7ms)

OneFlow resnet50 time: 25.3ms (= 5058.3ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 38.6ms (= 7729.1ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.53 (= 38.6ms / 25.3ms)

OneFlow resnet50 time: 24.9ms (= 4987.8ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 36.0ms (= 7191.3ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.44 (= 36.0ms / 24.9ms)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants