Skip to content

作者您好,我也遇到了确定性算法的警告导致模型不能运行,同评论区的问题一样,我是4张3090一起训练de设置了0,1,2,3.指定单卡直接显示cuda错误,希望您能给出建议 #16

@urban-drummer

Description

@urban-drummer

segment/train: weights=/media/dell/lhx/yolo/ASF-YOLO/yolov5l-seg.pt, cfg=/media/dell/lhx/yolo/ASF-YOLO/models/segment/asf-yolo.yaml, data=/media/dell/lhx/yolo/ASF-YOLO/data/bcc.yaml, hyp=/media/dell/lhx/yolo/ASF-YOLO/data/hyps/hyp.scratch-low.yaml, epochs=100, batch_size=8, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=0,1,2,3, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=../runs_2/train-seg, name=improve, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, mask_ratio=4, no_overlap=False
YOLOv5  2024-5-30 Python-3.8.0 torch-2.3.1+cu121 CUDA:0 (NVIDIA GeForce RTX 3090, 24260MiB)
CUDA:1 (NVIDIA GeForce RTX 3090, 24260MiB)
CUDA:2 (NVIDIA GeForce RTX 3090, 24260MiB)
CUDA:3 (NVIDIA GeForce RTX 3090, 24260MiB)

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
TensorBoard: Start with 'tensorboard --logdir ../runs_2/train-seg', view at http://localhost:6006/
Overriding model.yaml nc=80 with nc=1

             from  n    params  module                                  arguments                     

0 -1 1 7040 models.common.Conv [3, 64, 6, 2, 2]
1 -1 1 73984 models.common.Conv [64, 128, 3, 2]
2 -1 3 156928 models.common.C3 [128, 128, 3]
3 -1 1 295424 models.common.Conv [128, 256, 3, 2]
4 -1 6 1118208 models.common.C3 [256, 256, 6]
5 -1 1 1180672 models.common.Conv [256, 512, 3, 2]
6 -1 9 6433792 models.common.C3 [512, 512, 9]
7 -1 1 4720640 models.common.Conv [512, 1024, 3, 2]
8 -1 3 9971712 models.common.C3 [1024, 1024, 3]
9 -1 1 2624512 models.common.SPPF [1024, 1024, 5]
10 -1 1 525312 models.common.Conv [1024, 512, 1, 1]
11 4 1 132096 models.common.Conv [256, 512, 1, 1]
12 [-1, 6, -2] 1 0 models.common.Zoom_cat [512]
13 -1 3 3019776 models.common.C3 [1536, 512, 3, False]
14 -1 1 131584 models.common.Conv [512, 256, 1, 1]
15 2 1 33280 models.common.Conv [128, 256, 1, 1]
16 [-1, 4, -2] 1 0 models.common.Zoom_cat [256]
17 -1 3 756224 models.common.C3 [768, 256, 3, False]
18 -1 1 590336 models.common.Conv [256, 256, 3, 2]
19 [-1, 14] 1 0 models.common.Concat [1]
20 -1 3 2495488 models.common.C3 [512, 512, 3, False]
21 -1 1 2360320 models.common.Conv [512, 512, 3, 2]
22 [-1, 10] 1 0 models.common.Concat [1]
23 -1 3 9971712 models.common.C3 [1024, 1024, 3, False]
24 [4, 6, 8] 1 460544 models.common.ScalSeq [256]
25 [17, -1] 1 12325 models.common.attention_model [256]
26 [-1, 20, 23] 1 1393558 models.yolo.Segment [1, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], 32, 256, [256, 512, 1024]]
asf-yolo summary: 407 layers, 48465467 parameters, 48465467 gradients, 155.4 GFLOPs

Transferred 602/671 items from /media/dell/lhx/yolo/ASF-YOLO/yolov5l-seg.pt
AMP: checks passed ✅
optimizer: SGD(lr=0.01) with parameter groups 110 weight(decay=0.0), 116 weight(decay=0.0005), 114 bias
WARNING ⚠️ DP not recommended, use torch.distributed.run for best DDP Multi-GPU results.
See Multi-GPU Tutorial at ultralytics/yolov5#475 to get started.
train: Scanning /media/dell/lhx/yolo/ASF-YOLO/datasets/BCC/labels/train.cache... 128 images, 0 backgrounds, 0 corrupt: 100%|██████████| 128/128 00:00
val: Scanning /media/dell/lhx/yolo/ASF-YOLO/datasets/BCC/labels/val.cache... 32 images, 0 backgrounds, 0 corrupt: 100%|██████████| 32/32 00:00

AutoAnchor: 4.36 anchors/target, 0.970 Best Possible Recall (BPR). Anchors are a poor fit to dataset ⚠️, attempting to improve...
AutoAnchor: WARNING ⚠️ Extremely small objects found: 47 of 1235 labels are <3 pixels in size
AutoAnchor: Running kmeans for 9 anchors on 1235 points...
AutoAnchor: Evolving anchors with Genetic Algorithm: fitness = 0.7403: 100%|██████████| 1000/1000 00:00
AutoAnchor: thr=0.25: 0.9571 best possible recall, 6.31 anchors past thr
AutoAnchor: n=9, img_size=640, metric_all=0.391/0.743-mean/best, past_thr=0.495-mean: 25,43, 88,52, 51,155, 92,121, 163,129, 116,183, 236,232, 160,418, 350,452
AutoAnchor: Done ⚠️ (original anchors better than new anchors, proceeding with original anchors)
Plotting labels to ../runs_2/train-seg/improve3/labels.jpg...
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to ../runs_2/train-seg/improve3
Starting training for 100 epochs...

  Epoch    GPU_mem   box_loss   seg_loss   obj_loss   cls_loss  Instances       Size

0%| | 0/16 00:03
Traceback (most recent call last):
File "/media/dell/lhx/yolo/ASF-YOLO/segment/train.py", line 658, in
main(opt)
File "/media/dell/lhx/yolo/ASF-YOLO/segment/train.py", line 554, in main
train(opt.hyp, opt, device, callbacks)
File "/media/dell/lhx/yolo/ASF-YOLO/segment/train.py", line 317, in train
scaler.scale(loss).backward()
File "/home/leihaoxiang/.conda/envs/yolo/lib/python3.8/site-packages/torch/_tensor.py", line 525, in backward
torch.autograd.backward(
File "/home/leihaoxiang/.conda/envs/yolo/lib/python3.8/site-packages/torch/autograd/init.py", line 267, in backward
_engine_run_backward(
File "/home/leihaoxiang/.conda/envs/yolo/lib/python3.8/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: max_pool3d_with_indices_backward_cuda does not have a deterministic implementation, but you set 'torch.use_deterministic_algorithms(True)'. You can turn off determinism just for this operation, or you can use the 'warn_only=True' option, if that's acceptable for your application. You can also file an issue at https://github.com/pytorch/pytorch/issues to help us prioritize adding deterministic support for this operation.

进程已结束,退出代码1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions