Some questions about alma. #67

yuanzhiyong1999 · 2024-10-30T11:16:14Z

I want to reproduce this work. Currently, I am in the first stage (monolingual training).
My script is as follows:

OUTPUT_DIR=${1:-"./saves/llama-2-7b-oscar-ft"}

random port between 30000 and 50000

port=$(( RANDOM % (50000 - 30000 + 1 ) + 30000 ))
accelerate launch --main_process_port ${port} --config_file configs/deepspeed_train_config.yaml
run_llmmt.py
--model_name_or_path /llms/Qwen2.5-3B-Instruct
--oscar_data_path oscar-corpus/OSCAR-2301
--oscar_data_lang en,ru,cs,zh,is,de
--interleave_probs "0.17,0.22,0.14,0.19,0.08,0.2"
--streaming
--max_steps 600000
--do_train
--low_cpu_mem_usage
--fp16
--learning_rate 2e-5
--weight_decay 0.01
--gradient_accumulation_steps 4
--lr_scheduler_type cosine
--warmup_ratio 0.01
--ignore_pad_token_for_loss
--ignore_prompt_token_for_loss
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--save_strategy steps
--save_steps 2000
--save_total_limit 1
--logging_strategy steps
--logging_steps 1
--output_dir ${OUTPUT_DIR}
--max_new_tokens 1024
--max_source_length 1024
--seed 42
--overwrite_output_dir
--use_flash_attention_2
--report_to wandb
--run_name ${OUTPUT_DIR}

There will be a key error problem under the above script.

So I went and modified the utils.py file, changing column_names_oscar = ["id", "meta", "text", "raw_text"] to column_names_oscar = [] (I checked the load data and there is only one field： raw_text). However, a new problem will arise:

so，what is the reason for this? I went to check and the data update of oscar was a year ago. The code should support the existing oscar data. Looking forward to your reply.

risotoonero · 2024-10-30T11:50:21Z

The data fields of OSCAR-2301 may be [content,warc_headers metadata]? Try changing the fields in utils.py to the above fields and also change the key 'text' to 'content' when traversing. If that doesn't work, extract the 'content' in OSCAR-2301 to 'text' and change fields into ['id','text','lang']?'id' and 'lang' set by yourself

yuanzhiyong1999 · 2024-10-30T12:17:57Z

The data fields of OSCAR-2301 may be [content,warc_headers metadata]? Try changing the fields in utils.py to the above fields and also change the key 'text' to 'content' when traversing. If that doesn't work, extract the 'content' in OSCAR-2301 to 'text' and change fields into ['id','text','lang']?'id' and 'lang' set by yourself

yes，OSCAR-2301数据的格式正如您所说的那样，我在utils.py中修改了column_names_oscar = ["content", "warc_headers", "metadata"]后，会出现KeyError: 'content'，可能修改的不应该是column_names_oscar？因为它被用在了remove_columns，我有点不太理解这个逻辑。

risotoonero · 2024-10-30T13:16:40Z

The data fields of OSCAR-2301 may be [content,warc_headers metadata]? Try changing the fields in utils.py to the above fields and also change the key 'text' to 'content' when traversing. If that doesn't work, extract the 'content' in OSCAR-2301 to 'text' and change fields into ['id','text','lang']?'id' and 'lang' set by yourself

yes，OSCAR-2301数据的格式正如您所说的那样，我在utils.py中修改了column_names_oscar = ["content", "warc_headers", "metadata"]后，会出现KeyError: 'content'，可能修改的不应该是column_names_oscar？因为它被用在了remove_columns，我有点不太理解这个逻辑。

不，我在utils.py的第577和627行修改了["content", "warc_headers", "metadata"]和’content‘之后，没有出现这种报错

yuanzhiyong1999 · 2024-10-31T02:16:01Z

可能我跟你的utils.py文件不一样，我是拉取的主分支，577和627行跟修改的位置对应不上，然后我修改了772和859行，出现了KeyError: 'content'。所以，问题应该不在这里，你修改成这种格式应该也不正确，因为column_names_oscar是用来消除多余列的参数，content是需要保留的内容，不应该被消除。

Bostoncake · 2024-11-15T10:49:56Z

@yuanzhiyong1999 Did you manage to come up with a solution? I have the same problem when reproducing the monolingual training stage.

Bostoncake · 2024-11-15T11:44:19Z

@yuanzhiyong1999 I came up with a simple solution: change Line 859 to ["raw_text"]. It will work fine since only raw_text field remains after tokenizing the data. [content,warc_headers, metadata] are metadata of the entire dataset. remove_columns only processes actual data (train_raw_data['oscar']).

yuanzhiyong1999 changed the title ~~关于alma的一些问题~~ Some questions about alma. Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some questions about alma. #67

Some questions about alma. #67

yuanzhiyong1999 commented Oct 30, 2024 •

edited

Loading

risotoonero commented Oct 30, 2024

yuanzhiyong1999 commented Oct 30, 2024

risotoonero commented Oct 30, 2024

yuanzhiyong1999 commented Oct 31, 2024

Bostoncake commented Nov 15, 2024

Bostoncake commented Nov 15, 2024

Some questions about alma. #67

Some questions about alma. #67

Comments

yuanzhiyong1999 commented Oct 30, 2024 • edited Loading

random port between 30000 and 50000

risotoonero commented Oct 30, 2024

yuanzhiyong1999 commented Oct 30, 2024

risotoonero commented Oct 30, 2024

yuanzhiyong1999 commented Oct 31, 2024

Bostoncake commented Nov 15, 2024

Bostoncake commented Nov 15, 2024

yuanzhiyong1999 commented Oct 30, 2024 •

edited

Loading