Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

加载悟道数据集cpu 内存爆了 #6

Open
hujunchao opened this issue Feb 15, 2023 · 4 comments
Open

加载悟道数据集cpu 内存爆了 #6

hujunchao opened this issue Feb 15, 2023 · 4 comments

Comments

@hujunchao
Copy link

下载了悟道全量的文本数据集,按照先生成pyarrow文件,再用load data 加载全量数据集到cache里来训练,发现数据集cpu内存爆了,请问大佬该咋解决呀,谢谢啦!

@GGGGGGXY
Copy link
Collaborator

内存大概是多少, 我们这套下来内存可能也需要几十G这样

@hujunchao
Copy link
Author

cpu内存有500g,还是挂了

@GGGGGGXY
Copy link
Collaborator

你试试直接加载,应该是没问题的, 500G还挂的话很有可能是torch dataloader里面,生成sample遍历序列的时候,这个数组挂了,
比如你有10亿samples,对这10亿samples torch sampler会生成一份10亿int64的数组,并且根据你的num_worker拷贝多份。

你可以试试直接加载数据集看看会不会爆内存,如果不会的话,可以往我提到的这个方向上看看。

@hujunchao
Copy link
Author

大佬,有个问题想请教一下,torch ddp模式下应该默认每张卡都会复制一遍数据集,那么随着数据集的增大,内存终将不够。请问对于超大数据集,比如T级别的数据集,data这块要咋实现哩,有啥建议不

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants