Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data #3

Open
Ystartff opened this issue Sep 22, 2024 · 5 comments
Open

data #3

Ystartff opened this issue Sep 22, 2024 · 5 comments

Comments

@Ystartff
Copy link

if I want to replace a multimodal dataset other than a thesis, I would like to know where is your read dataset processing class?

@Ystartff
Copy link
Author

And I was wondering if your AMOS Dataset was randomly selected from 500 CTs and 100 MRIs, and if you can provide the corresponding dataset

@zxg043017
Copy link
Collaborator

zxg043017 commented Sep 23, 2024

Thank you for your attention to our work. You can use "monai_preprocess_data.py" to process your own dataset. You can access our preprocessed AMOS dataset from here. Thank you.

@Ystartff
Copy link
Author

Traceback (most recent call last):
File "main.py", line 222, in
main()
File "main.py", line 135, in main
setup(rank, world_size,args)
File "main.py", line 28, in setup
dist.init_process_group(
File "/home/panxue/anaconda3/envs/CMC/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
return func(*args, **kwargs)
File "/home/panxue/anaconda3/envs/CMC/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
func_return = func(*args, **kwargs)
File "/home/panxue/anaconda3/envs/CMC/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1305, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/panxue/anaconda3/envs/CMC/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 199, in _tcp_rendezvous_handler
store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout, use_libuv)
File "/home/panxue/anaconda3/envs/CMC/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 174, in _create_c10d_store
return TCPStore(
torch.distributed.DistNetworkError: The client socket has timed out after 600s while trying to connect to (127.0.0.1, 12361).

When I was using VIT3D, I found this problem

@zxg043017
Copy link
Collaborator

HI, this error indicates that a process in your PyTorch distributed training setup attempted to establish a connection to the specified address and port but was unable to do so within the timeout period (600 seconds). Please correct the launch command in your scripts for distributed training when you use VIT3D with multiple GPUs.

@Ystartff
Copy link
Author

Ystartff commented Oct 5, 2024

(CMC) cln@user-NF5280M5:/mnt/data/cln/yyf/CMC-main$ CUDA_VISIBLE_DEVICES=0,1 python main.py --backbone 'VIT3D' --batch_size 4 --img_size 96
max_epochs => 500
val_every => 30
lr => 0.0001
weight_decay => 1e-05
model_type => vit_b_ori
batch_size => 4
img_size => 96
resume => 0
optim_lr => 0.001
optim_name => adamw
reg_weight => 1e-05
momentum => 0.99
checkpoint => ./test_250/2_epoch/model_final.pt
logdir => checkpoint/test
pretrain => ./pretrain_model/sam_vit_b_01ec64.pth
de_pretrain => ./pretrain_model/unet.pth
smooth_dr => 1e-06
smooth_nr => 0.0
rank => 0
test_mode => 0
backbone => VIT3D
workers => 2
dist => True
port => 22
gpu_ids => [0, 1]
local_rank => 1
multi_gpu => True
in_channels => 1
out_channels => 16
squared_dice => 1
lrschedule => warmup_cosine
warmup_epochs => 150
amp => 1
dropout_rate => 0.0
dropout_path_rate => 0.0
RandFlipd_prob => 0.2
RandRotate90d_prob => 0.2
RandScaleIntensityd_prob => 0.3
RandShiftIntensityd_prob => 0.1
consistency_type => kl
with_cons => without_cons
consistency => 1.0
consistency_rampup => 500.0
fusion_start_epoch => 450
device => cuda

Use VIT3D of SAM-3D pretrained weights
Use pretrained weights
[rank1]: Traceback (most recent call last):
[rank1]: File "main.py", line 224, in
[rank1]: main()
[rank1]: File "main.py", line 138, in main
[rank1]: model = DDP(model, device_ids=[args.device])
[rank1]: File "/home/cln/anaconda3/envs/CMC/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 798, in init
[rank1]: _verify_param_shape_across_processes(self.process_group, parameters)
[rank1]: File "/home/cln/anaconda3/envs/CMC/lib/python3.8/site-packages/torch/distributed/utils.py", line 269, in _verify_param_shape_across_processes
[rank1]: return dist._verify_params_across_processes(process_group, tensors, logger)
[rank1]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Stop_waiting response is expected
[rank1]: Exception raised from doWait at ../torch/csrc/distributed/c10d/TCPStore.cpp:540 (most recent call first):
[rank1]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7facb9668897 in /home/cln/anaconda3/envs/CMC/lib/python3.8/site-packages/torch/lib/libc10.so)
[rank1]: frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7facb9618bee in /home/cln/anaconda3/envs/CMC/lib/python3.8/site-packages/torch/lib/libc10.so)
[rank1]: frame #2: c10d::TCPStore::doWait(c10::ArrayRefstd::string, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x336 (0x7facf3180cf6 in /home/cln/anaconda3/envs/CMC/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7facf3180f82 in /home/cln/anaconda3/envs/CMC/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7facf3181fd1 in /home/cln/anaconda3/envs/CMC/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7facf3136371 in /home/cln/anaconda3/envs/CMC/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7facf3136371 in /home/cln/anaconda3/envs/CMC/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7facf3136371 in /home/cln/anaconda3/envs/CMC/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7facba9436d9 in /home/cln/anaconda3/envs/CMC/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
[rank1]: frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7facba94ab60 in /home/cln/anaconda3/envs/CMC/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
[rank1]: frame #10: c10d::ProcessGroupNCCL::allgather(std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > >&, std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllgatherOptions const&) + 0x857 (0x7facba95d287 in /home/cln/anaconda3/envs/CMC/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
[rank1]: frame #11: + 0x5ade1dd (0x7facf312a1dd in /home/cln/anaconda3/envs/CMC/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #12: + 0x5ae7ea2 (0x7facf3133ea2 in /home/cln/anaconda3/envs/CMC/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #13: + 0x5124446 (0x7facf2770446 in /home/cln/anaconda3/envs/CMC/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #14: + 0x1acf4b8 (0x7facef11b4b8 in /home/cln/anaconda3/envs/CMC/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #15: + 0x5aeff43 (0x7facf313bf43 in /home/cln/anaconda3/envs/CMC/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #16: + 0x5afb73f (0x7facf314773f in /home/cln/anaconda3/envs/CMC/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #17: c10d::verify_params_across_processes(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, std::vector<at::Tensor, std::allocatorat::Tensor > const&, std::optional<std::weak_ptrc10d::Logger > const&) + 0x26d (0x7facf31ae1ad in /home/cln/anaconda3/envs/CMC/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #18: + 0xcdeaf1 (0x7fad05cd5af1 in /home/cln/anaconda3/envs/CMC/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
[rank1]: frame #19: + 0x47ce2f (0x7fad05473e2f in /home/cln/anaconda3/envs/CMC/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
[rank1]:
[rank1]: frame #29: python() [0x4f1bf3]
[rank1]: frame #38: python() [0x5a5bd1]
[rank1]: frame #39: python() [0x5a4bdf]
[rank1]: frame #40: python() [0x45c538]
[rank1]: frame #42: python() [0x44fe8f]
[rank1]: frame #44: __libc_start_main + 0xe7 (0x7fad07686c87 in /lib/x86_64-linux-gnu/libc.so.6)
[rank1]: frame #45: python() [0x579d3d]
[rank1]: . This may indicate a possible application crash on rank 0 or a network set up issue.

I changed the settings, am I using an incorrect run command? Could you please guide me? Here is my setup

def setup_seed(seed=42):
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
np.random.seed(seed)
random.seed(seed)

def setup(rank, world_size,args):
# initialize the process group
dist.init_process_group(
backend='nccl',
init_method=f'tcp://211.69.243.27:{args.port}',
world_size=world_size,
rank=rank
)
def init_seeds(seed=0, cuda_deterministic=True):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
if cuda_deterministic: # slower, more reproducible
cudnn.deterministic = True
cudnn.benchmark = False
else: # faster, less reproducible
cudnn.deterministic = False
cudnn.benchmark = True

def main():
import argparse
parser = argparse.ArgumentParser(description='medical contest')
parser.add_argument('--max_epochs', default=500, type=int)
parser.add_argument('--val_every', default=30, type=int)
parser.add_argument('--lr', default=1e-4, type=float)
parser.add_argument('--weight_decay', default=1e-5, type=float)
parser.add_argument('--model_type', type=str, default='vit_b_ori')
parser.add_argument('--batch_size', default=1, type=int)#4
parser.add_argument('--img_size', default=96, type=int)
parser.add_argument('--resume', default=0, type=int, help='The path resume from checkpoint')
parser.add_argument("--optim_lr", default=1e-3, type=float, help="optimization learning rate")
parser.add_argument("--optim_name", default="adamw", type=str, help="optimization algorithm")
parser.add_argument("--reg_weight", default=1e-5, type=float, help="regularization weight")
parser.add_argument("--momentum", default=0.99, type=float, help="momentum")
parser.add_argument("--checkpoint", default="./test_250/2_epoch/model_final.pt", type=str, help="start training from saved checkpoint")
parser.add_argument("--logdir", default="checkpoint/test", type=str, help="directory to save the tensorboard logs")
parser.add_argument('--pretrain', default=f"./pretrain_model/sam_vit_b_01ec64.pth", type=str)
parser.add_argument('--de_pretrain', default=f"./pretrain_model/unet.pth", type=str)
parser.add_argument("--smooth_dr", default=1e-6, type=float, help="constant added to dice denominator to avoid nan")
parser.add_argument("--smooth_nr", default=0.0, type=float, help="constant added to dice numerator to avoid zero")
parser.add_argument("--rank", default=0, type=int, help="node rank for distributed training")
parser.add_argument("--test_mode", default=0, type=int, help="node rank for distributed training")
parser.add_argument('--backbone', default='Foundation_model', help='backbone [Foundation_model or VIT3D]')
parser.add_argument("--workers", default=2, type=int, help="number of workers")
parser.add_argument('--dist', dest='dist', type=bool, default=True,
help='distributed training or not')
parser.add_argument('--port', type=int, default=22)
parser.add_argument('--gpu_ids', type=int, nargs='+', default=[0,1])
parser.add_argument("--local_rank", type=int,default=1)
parser.add_argument('--multi_gpu', action='store_true', default=True)
parser.add_argument("--in_channels", default=1, type=int, help="number of input channels")
parser.add_argument("--out_channels", default=16, type=int, help="number of output channels")
parser.add_argument("--squared_dice",default=1, type=int, help="squared_dice")
parser.add_argument("--lrschedule", default="warmup_cosine", type=str, help="type of learning rate scheduler")
parser.add_argument("--warmup_epochs", default=150, type=int, help="number of warmup epochs")
parser.add_argument("--amp", default=1, type=int, help="use amp for training")
parser.add_argument("--dropout_rate", default=0.0, type=float, help="dropout rate")
parser.add_argument("--dropout_path_rate", default=0.0, type=float, help="drop path rate")
parser.add_argument("--RandFlipd_prob", default=0.2, type=float, help="RandFlipd aug probability")
parser.add_argument("--RandRotate90d_prob", default=0.2, type=float, help="RandRotate90d aug probability")
parser.add_argument("--RandScaleIntensityd_prob", default=0.3, type=float,
help="RandScaleIntensityd aug probability")
parser.add_argument("--RandShiftIntensityd_prob", default=0.1, type=float,
help="RandShiftIntensityd aug probability")
parser.add_argument('--consistency_type', type=str,
default="kl", help='consistency_type')
parser.add_argument('--with_cons', type=str,
default="without_cons", help='with or without consistency')
parser.add_argument('--consistency', type=float,
default=1.0, help='consistency')
parser.add_argument('--consistency_rampup', type=float,
default=500.0, help='consistency_rampup')
parser.add_argument('--fusion_start_epoch', default=450, type=int)

args = parser.parse_args()
torch.set_float32_matmul_precision('high')
init_seeds(2023 + args.rank)

def build_model(args):
    sam_model = sam_model_registry3D[args.model_type](checkpoint=None).to(device)
    if args.multi_gpu:
        sam_model = DDP(sam_model, device_ids=[args.rank], output_device=args.rank)
    return sam_model

device = torch.device("cuda") if torch.cuda.is_available() else torch.device('cpu')
args.device = device
if args.multi_gpu:
    os.environ["CUDA_VISIBLE_DEVICES"] = ','.join([str(i) for i in args.gpu_ids])
torch.backends.cudnn.benchmark = True
for k, v in vars(args).items():
    print(k, '=>', v)
print('-----------------')
args.NUM_CLASS = args.out_channels

model = Semi_SM_model(img_size=args.img_size,
                n_class=args.out_channels,
                backbone=args.backbone
                )


model.to(device)

#Load pre-trained weights
if args.pretrain is not None:
    model.load_encoder_params(torch.load(args.pretrain, map_location='cpu'))
    model.load_decoder_params(torch.load(args.de_pretrain, map_location='cpu')['net'])
if args.dist and args.multi_gpu:
    args.nodes = 1
    args.ngpus_per_node = len(args.gpu_ids)
    world_size = args.nodes * args.ngpus_per_node
    rank = args.local_rank
    setup(rank, world_size,args)
    model = DDP(model, device_ids=[args.device])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants