The pruned model does not match target structred config #74

dat-browny · 2024-09-21T06:12:58Z

I had used your sample scripts to pruned princeton-nlp/Sheared-LLaMA-2.7B into 1.3B size. This is my model config training

data_local: data_tmp/
data_remote: # If blank, files must be present in data_local
tokenizer_name: princeton-nlp/Sheared-LLaMA-2.7B
max_seq_len: 4096
global_seed: 17

# Run Name
run_name: Sheared-LLaMA-2.7B-Pruning

model:
  name: mosaic_llama2_1.3b
  path: models/Sheared-LLaMA-2.7B-composer/state_dict.pt
  init_device: "cpu" 
  tokenizer_name: ${tokenizer_name}
  d_model: 2560
  n_heads: 20
  n_layers: 32
  intermediate_size: 6912
  max_seq_len: ${max_seq_len}
  vocab_size: 32000
  init_std: 0.02
  attn_pdrop: 0.0
  resid_pdrop: 0.0
  emb_pdrop: 0.0
  attn_impl: flash
  rms_norm_eps: 1e-5
  l0_module: 
    start_sparsity: 0.0
    target_sparsity: 0.5
    pruning_modules: ["head", "head_layer", "mlp", "intermediate"]
    lagrangian_warmup_steps: 5ba 
    target_model:
      d_model: 2048
      n_layers: 24
      n_heads: 16 
      intermediate_size: 5504 
      vocab_size: 32000

# Tokenizer
tokenizer:
  type: hftokenizer
  args:
    tokenizer_name: ${tokenizer_name}
    max_seq_len: ${max_seq_len}

# Dataloaders
train_loader:
  name: text
  dataset:
    local: ${data_local}
    remote: ${data_remote}
    split: github
    shuffle: true
    tokenizer_name: ${tokenizer_name}
    max_seq_len: ${max_seq_len}
    shuffle_seed: ${global_seed}
    is_uint16: true
  drop_last: true
  num_workers: 8

eval_loader:
  name: text
  dataset:
    local: ${data_local}
    remote: ${data_remote}
    split: eval_merge 
    shuffle: false 
    tokenizer_name: ${tokenizer_name}
    max_seq_len: ${max_seq_len}
    shuffle_seed: ${global_seed}
    is_uint16: true
  drop_last: false
  num_workers: 8

# Optimization
scheduler:
  name: cosine_with_warmup
  t_warmup: 100ba
  alpha_f: 0.1

optimizer:
  name: decoupled_adamw
  lr: 1e-4
  betas:
  - 0.9
  - 0.95
  eps: 1.0e-08
  weight_decay: 0.0
  lag_lr: 1.0

algorithms:
  gradient_clipping:
    clipping_type: norm
    clipping_threshold: 1.0

max_duration: 800ba  
eval_interval: 200ba
eval_subset_num_batches: 100
global_train_batch_size: 8

# System
seed: ${global_seed}
device_eval_batch_size: 8
device_train_microbatch_size: 4
precision: amp_bf16

# FSDP
fsdp_config:
  sharding_strategy: FULL_SHARD
  mixed_precision: DEFAULT
  activation_checkpointing: true
  activation_cpu_offload: false
  verbose: false

# Logging
progress_bar: false
log_to_console: true
console_log_interval: 1ba

callbacks:
  speed_monitor:
    window_size: 10
  memory_monitor: {}
  lr_monitor: {}
  data_loading:
    dynamic: false
    update_type: doremi
    proportion: [0.67,0.045,0.045,0.02,0.045,0.025,0.15]
    set_names: [cc,github,book,stackexchange,wiki,arxiv,c4-rp]
    target_loss: [1.8712,0.6883,2.0325,1.5353,1.6297,1.3560,2.0328]


loggers:
  wandb: 
    project: LLM-Prune
    entity: 
    name: ${run_name}
    init_kwargs:
      mode: online
      dir: wandb_dir

# Checkpoint to local filesystem or remote object store
save_interval: 100ba 
save_folder: save_dir 
autoresume: false
python_log_level: DEBUG
save_overwrite: true

Note that i just use github proportion to check the pipeline is completed or not. After pruning, I convert model by this scripts

MODEL_PATH=save_dir/latest-rank0.pt
python3 -m llmshearing.utils.post_pruning_processing prune_and_save_model $MODEL_PATH

MODEL_PATH=save_dir/pruned-latest-rank0.pt
OUTPUT_PATH=save_dir/hf-latest_rank0
MODEL_CLASS=LlamaForCausalLM
HIDDEN_SIZE=2048
NUM_ATTENTION_HEADS=16
NUM_HIDDEN_LAYERS=24
INTERMEDIATE_SIZE=5504
MODEL_NAME=Sheared-Llama-1.3B

python3 -m llmshearing.utils.composer_to_hf save_composer_to_hf $MODEL_PATH $OUTPUT_PATH \
        model_class=${MODEL_CLASS} \
        hidden_size=${HIDDEN_SIZE} \
        num_attention_heads=${NUM_ATTENTION_HEADS} \
        num_hidden_layers=${NUM_HIDDEN_LAYERS} \
        intermediate_size=${INTERMEDIATE_SIZE} \
        num_key_value_heads=${NUM_ATTENTION_HEADS} \
        _name_or_path=${MODEL_NAME}

But some how the output shape after pruning had different size compare to the target:

### The attention Q, K, V shape mismatch
size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([32000, 2560]) from checkpoint, the shape in
current model is torch.Size([32000, 2048]).
        size mismatch for model.layers.0.self_attn.q_proj.weight: copying a param with shape torch.Size([2048, 2560]) from checkpoint, 
the shape in current model is torch.Size([2048, 2048]).
        size mismatch for model.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([2048, 2560]) from checkpoint, 
the shape in current model is torch.Size([2048, 2048]).
....

Is there any problem in prune_params() methods when apply l0_module?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The pruned model does not match target structred config #74

The pruned model does not match target structred config #74

dat-browny commented Sep 21, 2024

The pruned model does not match target structred config #74

The pruned model does not match target structred config #74

Comments

dat-browny commented Sep 21, 2024