Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some questions about alma. #67

Open
yuanzhiyong1999 opened this issue Oct 30, 2024 · 6 comments
Open

Some questions about alma. #67

yuanzhiyong1999 opened this issue Oct 30, 2024 · 6 comments

Comments

@yuanzhiyong1999
Copy link

yuanzhiyong1999 commented Oct 30, 2024

I want to reproduce this work. Currently, I am in the first stage (monolingual training).
My script is as follows:

OUTPUT_DIR=${1:-"./saves/llama-2-7b-oscar-ft"}

random port between 30000 and 50000

port=$(( RANDOM % (50000 - 30000 + 1 ) + 30000 ))
accelerate launch --main_process_port ${port} --config_file configs/deepspeed_train_config.yaml
run_llmmt.py
--model_name_or_path /llms/Qwen2.5-3B-Instruct
--oscar_data_path oscar-corpus/OSCAR-2301
--oscar_data_lang en,ru,cs,zh,is,de
--interleave_probs "0.17,0.22,0.14,0.19,0.08,0.2"
--streaming
--max_steps 600000
--do_train
--low_cpu_mem_usage
--fp16
--learning_rate 2e-5
--weight_decay 0.01
--gradient_accumulation_steps 4
--lr_scheduler_type cosine
--warmup_ratio 0.01
--ignore_pad_token_for_loss
--ignore_prompt_token_for_loss
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--save_strategy steps
--save_steps 2000
--save_total_limit 1
--logging_strategy steps
--logging_steps 1
--output_dir ${OUTPUT_DIR}
--max_new_tokens 1024
--max_source_length 1024
--seed 42
--overwrite_output_dir
--use_flash_attention_2
--report_to wandb
--run_name ${OUTPUT_DIR}

There will be a key error problem under the above script.

image
So I went and modified the utils.py file, changing column_names_oscar = ["id", "meta", "text", "raw_text"] to column_names_oscar = [] (I checked the load data and there is only one field: raw_text). However, a new problem will arise:
image
so,what is the reason for this? I went to check and the data update of oscar was a year ago. The code should support the existing oscar data. Looking forward to your reply.

@risotoonero
Copy link

The data fields of OSCAR-2301 may be [content,warc_headers metadata]? Try changing the fields in utils.py to the above fields and also change the key 'text' to 'content' when traversing. If that doesn't work, extract the 'content' in OSCAR-2301 to 'text' and change fields into ['id','text','lang']?'id' and 'lang' set by yourself

@yuanzhiyong1999 yuanzhiyong1999 changed the title 关于alma的一些问题 Some questions about alma. Oct 30, 2024
@yuanzhiyong1999
Copy link
Author

The data fields of OSCAR-2301 may be [content,warc_headers metadata]? Try changing the fields in utils.py to the above fields and also change the key 'text' to 'content' when traversing. If that doesn't work, extract the 'content' in OSCAR-2301 to 'text' and change fields into ['id','text','lang']?'id' and 'lang' set by yourself

yes,OSCAR-2301数据的格式正如您所说的那样,我在utils.py中修改了column_names_oscar = ["content", "warc_headers", "metadata"]后,会出现KeyError: 'content',可能修改的不应该是column_names_oscar?因为它被用在了remove_columns,我有点不太理解这个逻辑。

@risotoonero
Copy link

The data fields of OSCAR-2301 may be [content,warc_headers metadata]? Try changing the fields in utils.py to the above fields and also change the key 'text' to 'content' when traversing. If that doesn't work, extract the 'content' in OSCAR-2301 to 'text' and change fields into ['id','text','lang']?'id' and 'lang' set by yourself

yes,OSCAR-2301数据的格式正如您所说的那样,我在utils.py中修改了column_names_oscar = ["content", "warc_headers", "metadata"]后,会出现KeyError: 'content',可能修改的不应该是column_names_oscar?因为它被用在了remove_columns,我有点不太理解这个逻辑。

不,我在utils.py的第577和627行修改了["content", "warc_headers", "metadata"]和’content‘之后,没有出现这种报错

@yuanzhiyong1999
Copy link
Author

可能我跟你的utils.py文件不一样,我是拉取的主分支,577和627行跟修改的位置对应不上,然后我修改了772和859行,出现了KeyError: 'content'。所以,问题应该不在这里,你修改成这种格式应该也不正确,因为column_names_oscar是用来消除多余列的参数,content是需要保留的内容,不应该被消除。

@Bostoncake
Copy link

@yuanzhiyong1999 Did you manage to come up with a solution? I have the same problem when reproducing the monolingual training stage.

@Bostoncake
Copy link

@yuanzhiyong1999 I came up with a simple solution: change Line 859 to ["raw_text"]. It will work fine since only raw_text field remains after tokenizing the data. [content,warc_headers, metadata] are metadata of the entire dataset. remove_columns only processes actual data (train_raw_data['oscar']).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants