-
Notifications
You must be signed in to change notification settings - Fork 253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
我有大量算力,支持开源项目,请大家把数据尽量洗干净 #6
Comments
1.越往后的包数据清洗的越干净。我们在第一阶段(到本月底,项目成立一个月整,且顺利完成1Tb数据堆量小目标)后,会启动对历史数据压缩包的重新清洗打包工作。 |
另外这里有 900G 中文语料 https://huggingface.co/datasets/oscar-corpus/OSCAR-2201 |
在收录数据时会尽量避免不重复收录. |
个人建议。现在放到huggingface上的数据属于法律文书甚至还有学习强国上的。 这些包含了部分隐私信息的情况。个人任务这部分数据还是暂时不公开或者脱敏的好,不然很容易担责。 |
这部分数据是国家有法律法规明确公开公示的。请你指出来包含隐私信息的具体文件。 |
请教一下chatGPT使用了40T数据,这一信息是从哪里来获得的呢? |
OSCAR-2201是一个多语种的文本库,Huggingface上总共123GB,请问它解压之后含有900GB的中文语料吗? |
大家好,我是 https://www.zhihu.com/question/570713548/answer/2845310510
记得n年前我也上里屋,哈哈。建议项目建个 Discord,可以在 Discord 找我:https://discord.gg/bDSBUMeFpc
The text was updated successfully, but these errors were encountered: