We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No description provided.
The text was updated successfully, but these errors were encountered:
这里有几个原因: 1.我们是针对预训练的语料集,代码语料作为人类近几十年新出现的语料品种,具备逻辑压缩、格式统一等优点,是预训练语料集中必不可少的部分。 2.目前开源的其他代码语料集,不但做了代码仓库的过滤,而且对字符编码做了清洗,导致包含中文注释的代码数据很少。我们是中文语料集,需要尽可能的保留中文编码,所以我们爬取代码数据时对GBK等其他编码的代码数据做了特殊处理。 3.开源代码仓库并不只有一个github,我们统计到还有另外8个开源代码仓库和其他散碎的代码,这些数据都是其他代码语料集缺少的。
Sorry, something went wrong.
No branches or pull requests
No description provided.
The text was updated successfully, but these errors were encountered: