Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Detail for Pretrain #24

Open
EasonXiao-888 opened this issue Oct 6, 2023 · 6 comments
Open

Training Detail for Pretrain #24

EasonXiao-888 opened this issue Oct 6, 2023 · 6 comments

Comments

@EasonXiao-888
Copy link

Hello, thanks for your fancy work. I want to make sure that the pretrain model is verified on the val set of the QVHighlight dataset, ?and the ckpt is selected by comparing [email protected] ? What's more,could you please share the log file for pretraing?

@QinghongLin
Copy link
Collaborator

@EasonXiao-888 Yes, during pretraining, I use the zero-shot QVhighlight results to monitor the training stage.
The ckpt should be selected by mAP, which is more comprehensive than mAP avg.
But for downstream tasks, I will suggest you try different ckpt on different benchmarks (e.g., zero-shot) to get the optimal one.

Sure, I can share you the log, but might need few days to retrieve it. Please sent me email if I do not response in time.

@EasonXiao-888
Copy link
Author

Okay , thanks a lot. But there is an additional question. When we use "Curve" data to perform pretrain on A100, it cannot be started due to CPU memory problems. Have you encountered this problem?

@QinghongLin
Copy link
Collaborator

I think this may due to the cache option --use_cache, it will try to load the whole pretraining corpus into memory, can you try to remove it in your training script

@RobertLuo1
Copy link

I encounter the same problem too. I did not use the cache and when I load the Curve data the num_workers should only set to be 0. Otherwise it will encounter the problem. But setting the num_workers to 0, the programme will be quite slow.

@QinghongLin
Copy link
Collaborator

@EasonXiao-888 @RobertLuo1 Can you provide me the error output with details and the matched code line for better understanding? thanks you

@Aarontncl
Copy link

Same problem here. My program has always got stuck when loading the "curve_5_window.jsonl" file into the dataset. I used DDP and have tried to set num_workers=0, but it still didn't work. I wonder what was the cpu hardware environment being used for the pretraining. It seems that the pretraining has a very high cpu hardware requirement. Thank you. @QinghongLin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants