-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some questions about alma. #67
Comments
The data fields of OSCAR-2301 may be [content,warc_headers metadata]? Try changing the fields in utils.py to the above fields and also change the key 'text' to 'content' when traversing. If that doesn't work, extract the 'content' in OSCAR-2301 to 'text' and change fields into ['id','text','lang']?'id' and 'lang' set by yourself |
yes,OSCAR-2301数据的格式正如您所说的那样,我在utils.py中修改了column_names_oscar = ["content", "warc_headers", "metadata"]后,会出现KeyError: 'content',可能修改的不应该是column_names_oscar?因为它被用在了remove_columns,我有点不太理解这个逻辑。 |
不,我在utils.py的第577和627行修改了["content", "warc_headers", "metadata"]和’content‘之后,没有出现这种报错 |
可能我跟你的utils.py文件不一样,我是拉取的主分支,577和627行跟修改的位置对应不上,然后我修改了772和859行,出现了KeyError: 'content'。所以,问题应该不在这里,你修改成这种格式应该也不正确,因为column_names_oscar是用来消除多余列的参数,content是需要保留的内容,不应该被消除。 |
@yuanzhiyong1999 Did you manage to come up with a solution? I have the same problem when reproducing the monolingual training stage. |
@yuanzhiyong1999 I came up with a simple solution: change Line 859 to ["raw_text"]. It will work fine since only |
I want to reproduce this work. Currently, I am in the first stage (monolingual training).
My script is as follows:
OUTPUT_DIR=${1:-"./saves/llama-2-7b-oscar-ft"}
random port between 30000 and 50000
port=$(( RANDOM % (50000 - 30000 + 1 ) + 30000 ))
accelerate launch --main_process_port ${port} --config_file configs/deepspeed_train_config.yaml
run_llmmt.py
--model_name_or_path /llms/Qwen2.5-3B-Instruct
--oscar_data_path oscar-corpus/OSCAR-2301
--oscar_data_lang en,ru,cs,zh,is,de
--interleave_probs "0.17,0.22,0.14,0.19,0.08,0.2"
--streaming
--max_steps 600000
--do_train
--low_cpu_mem_usage
--fp16
--learning_rate 2e-5
--weight_decay 0.01
--gradient_accumulation_steps 4
--lr_scheduler_type cosine
--warmup_ratio 0.01
--ignore_pad_token_for_loss
--ignore_prompt_token_for_loss
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--save_strategy steps
--save_steps 2000
--save_total_limit 1
--logging_strategy steps
--logging_steps 1
--output_dir ${OUTPUT_DIR}
--max_new_tokens 1024
--max_source_length 1024
--seed 42
--overwrite_output_dir
--use_flash_attention_2
--report_to wandb
--run_name ${OUTPUT_DIR}
There will be a key error problem under the above script.
So I went and modified the utils.py file, changing column_names_oscar = ["id", "meta", "text", "raw_text"] to column_names_oscar = [] (I checked the load data and there is only one field: raw_text). However, a new problem will arise:
so,what is the reason for this? I went to check and the data update of oscar was a year ago. The code should support the existing oscar data. Looking forward to your reply.
The text was updated successfully, but these errors were encountered: