On the reproduction of the 8B-DPO model #588

tianbuwei · 2025-02-28T02:03:03Z

Thank you very much for your work. I am reproducing the 8B-DPO model, and I find that there is a big difference between the reproduced results and the results in your paper. Could you please help me to check whether my training script is correct?

This is the comparison result between our model and the official model

Here is our training script, we used 4 machines with 32 Gpus

# modify the following `MACHINE_RANK`, `MAIN_PROCESS_IP`,
# `NUM_MACHINES`, `NUM_PROCESSES`, `PER_DEVICE_TRAIN_BATCH_SIZE`,
# `GRADIENT_ACCUMULATION_STEPS` according to your setup
export NCCL_TIMEOUT=0
export NCCL_BLOCKING_WAIT=1
export NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_DEBUG=INFO

MACHINE_RANK=0
MAIN_PROCESS_IP=10.0.26.202
NUM_MACHINES=4
NUM_PROCESSES=32
PER_DEVICE_TRAIN_BATCH_SIZE=1
GRADIENT_ACCUMULATION_STEPS=1

accelerate launch \
    --mixed_precision bf16 \
    --num_machines $NUM_MACHINES \
    --num_processes $NUM_PROCESSES \
    --machine_rank $MACHINE_RANK \
    --main_process_ip $MAIN_PROCESS_IP \
    --main_process_port 29400 \
    --use_deepspeed \
    --deepspeed_config_file configs/ds_configs/stage3_no_offloading_accelerate.conf \
    --deepspeed_multinode_launcher standard open_instruct/dpo_tune_cache.py \
    --model_name_or_path models/Llama-3.1-Tulu-3-8B-SFT \
    --tokenizer_name models/Llama-3.1-Tulu-3-8B-SFT \
    --use_flash_attn \
    --max_seq_length 2048 \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size $PER_DEVICE_TRAIN_BATCH_SIZE \
    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
    --learning_rate 5e-07 \
    --lr_scheduler_type linear \
    --warmup_ratio 0.1 \
    --weight_decay 0.0 \
    --num_train_epochs 1 \
    --output_dir output/dpo_8b \
    --with_tracking \
    --report_to wandb \
    --logging_steps 1 \
    --model_revision main \
    --gradient_checkpointing \
    --dataset_mixer_list data/llama-3.1-tulu-3-8b-preference-mixture 1.0 \
    --use_slow_tokenizer \
    --use_lora False \
    --dpo_loss_type dpo_norm \
    --dpo_beta 5 \
    --checkpointing_steps epoch \
    --exp_name tulu3-8B-dpo > logs/tulu3-8b-dpo/tulu3-8B-GPU${MACHINE_RANK}.log 2>&1 & 
# For Ai2 internal members, this was the experiment URL: https://beaker.org/ex/01JCSAYYHQYF9QDQDCV6KJ53M9/

Second, I noticed that in the training script you gave for the 8B model, GPUs*gradient_accumulation_steps=8*16=128. But the effective batch_size you give in your paper is 32, which should I follow?
This is the script you gave：https://github.com/allenai/open-instruct/blob/main/docs/tulu3.md

This is a screenshot from your paper：

vwxyzjn · 2025-02-28T20:17:23Z

Thanks for reporting this issue. I double checked our internal log (the beaker link) and confirm that was the command used to do training. Then the paper should have been batch size (effective) 128) for the 8B model. Sorry about that!

vwxyzjn · 2025-02-28T20:45:15Z

I am not sure why there is lower mmlu / math flex score in the picture you shared. We recently evaluated it again and found numbers to be consistent.

Maybe it's the eval setup? We use https://github.com/allenai/olmes for evaluation (though we use an internal fork via the following script)

open-instruct/scripts/eval/oe-eval.sh

Lines 173 to 190 in 5ba9f0b

    
           python oe-eval-internal/oe_eval/launch.py \ 
        
               --model "$MODEL_NAME" \ 
        
               --beaker-workspace "ai2/tulu-3-results" \ 
        
               --beaker-budget ai2/oe-adapt \ 
        
               --task "$TASK" \ 
        
               $MODEL_TYPE \ 
        
               --batch-size "$BATCH_SIZE" \ 
        
               --model-args "{\"model_path\":\"${MODEL_LOCATION}\", \"max_length\": ${MAX_LENGTH}}" \ 
        
               --task-args "{ \"generation_kwargs\": { \"max_gen_toks\": ${MAX_LENGTH}, \"truncate_context\": false } }" \ 
        
               ${HF_UPLOAD_ARG} \ 
        
               --gpus "$GPU_COUNT" \ 
        
               --gantry-args '{"env-secret": "OPENAI_API_KEY=openai_api_key", "weka": "oe-adapt-default:/weka/oe-adapt-default", "env#132":"VLLM_ALLOW_LONG_MAX_MODEL_LEN=1"}' \ 
        
               ${REVISION_ARG} \ 
        
               --cluster ai2/neptune-cirrascale,ai2/saturn-cirrascale,ai2/jupiter-cirrascale-2 \ 
        
               --beaker-retries 2 \ 
        
               --beaker-priority "$PRIORITY" \ 
        
               --push-datalake \ 
        
               --datalake-tags "$DATALAKE_ARGS"

Here are what the prompts look like:

Prompt:
user: Problem:
Find the domain of the expression  $\frac{\sqrt{x-2}}{\sqrt{5-x}}$.}

Solution:
assistant: The expressions inside each square root must be non-negative. Therefore, $x-2 \ge 0$, so $x\ge2$, and $5 - x \ge 0$, so $x \le 5$. Also, the denominator cannot be equal to zero, so $5-x>0$, which gives $x<5$. Therefore, the domain of the expression is $\boxed{[2,5)}$.
Final Answer: The final answer is $[2,5)$. I hope it is correct.
user: Problem:
If $\det \mathbf{A} = 2$ and $\det \mathbf{B} = 12,$ then find $\det (\mathbf{A} \mathbf{B}).$

Solution:
assistant: We have that $\det (\mathbf{A} \mathbf{B}) = (\det \mathbf{A})(\det \mathbf{B}) = (2)(12) = \boxed{24}.$
Final Answer: The final answer is $24$. I hope it is correct.
user: Problem:
Terrell usually lifts two 20-pound weights 12 times. If he uses two 15-pound weights instead, how many times must Terrell lift them in order to lift the same total weight?

Solution:
assistant: If Terrell lifts two 20-pound weights 12 times, he lifts a total of $2\cdot 12\cdot20=480$ pounds of weight.  If he lifts two 15-pound weights instead for $n$ times, he will lift a total of $2\cdot15\cdot n=30n$ pounds of weight.  Equating this to 480 pounds, we can solve for $n$:
\begin{align*}
30n&=480\
\Rightarrow\qquad n&=480/30=\boxed{16}
\end{align*}
Final Answer: The final answer is $16$. I hope it is correct.
user: Problem:
If the system of equations

\begin{align*}
6x-4y&=a,\
6y-9x &=b.
\end{align*}has a solution $(x, y)$ where $x$ and $y$ are both nonzero,
find $\frac{a}{b},$ assuming $b$ is nonzero.

Solution:
assistant: If we multiply the first equation by $-\frac{3}{2}$, we obtain

$$6y-9x=-\frac{3}{2}a.$$Since we also know that $6y-9x=b$, we have

$$-\frac{3}{2}a=b\Rightarrow\frac{a}{b}=\boxed{-\frac{2}{3}}.$$
Final Answer: The final answer is $-\frac{2}{3}$. I hope it is correct.
user: Problem:
Find $x$ such that $\lceil x \rceil + x = \dfrac{23}{7}$. Express $x$ as a common fraction.

Solution:
Completion:
To solve the equation \(\lceil x \rceil + x = \frac{23}{7}\), we need to consider the definition of the ceiling function \(\lceil x \rceil\), which is the smallest integer greater than or equal to \(x\).

Let's denote \(\lceil x \rceil\) as \(n\), where \(n\) is an integer. Therefore, the equation becomes:
\[ n + x = \frac{23}{7} \]

Since \(n = \lceil x \rceil\), \(n\) must be such that \(n - 1 < x \leq n\).

Given \(n + x = \frac{23}{7}\), we can express \(x\) as:
\[ x = \frac{23}{7} - n \]

We need to find an integer \(n\) such that \(n - 1 < \frac{23}{7} - n \leq n\).

First, calculate \(\frac{23}{7}\):
\[ \frac{23}{7} \approx 3.285714 \]

So, \(n\) must satisfy:
\[ n - 1 < 3.285714 - n \leq n \]
\[ 2n - 1 < 3.285714 \]
\[ 2n < 4.285714 \]
\[ n < 2.142857 \]

The integer values for \(n\) that satisfy this inequality are \(n = 2\) and \(n = 3\).

Let's check both:

For \(n = 2\):
\[ x = \frac{23}{7} - 2 = \frac{23 - 14}{7} = \frac{9}{7} \]

For \(n = 3\):
\[ x = \frac{23}{7} - 3 = \frac{23 - 21}{7} = \frac{2}{7} \]

Since \(x\) must be between \(n-1\) and \(n\) (inclusive), and \(\frac{9}{7}\) is closer to 3.285714 than \(\frac{2}{7}\), \(x = \frac{9}{7}\) is the correct solution.

Thus, \(x = \boxed{\frac{9}{7}}\).
Metrics:
{
  "exact_match": 0,
  "exact_match_flex": 1,
  "model_answer": "\\frac{9}{7}"
}

vwxyzjn · 2025-02-28T20:52:48Z

Actuall yeah it's the prompt issues. Few shot prompts with mmlu got lower results (consistent with yours)

DPO (mmlu:cot:summaize)

hf-02072025-allenai/Llama-3.1-Tulu-3-8B-DPO
Prompt:
user: The following are multiple choice questions about abstract algebra. Summarize your reasoning concisely, then conclude with 'Therefore, the answer is: X' where X is one of A, B, C, or D.

Question: Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.
 A. 0
 B. 4
 C. 2
 D. 6
Completion:
To determine the degree of the field extension \( \mathbb{Q}(\sqrt{2}, \sqrt{3}, \sqrt{18}) \) over \( \mathbb{Q} \), we need to understand how these elements relate to each other and how they extend the base field \( \mathbb{Q} \).

First, simplify the elements in the extension:

- \( \sqrt{18} = \sqrt{9 \cdot 2} = \sqrt{9} \cdot \sqrt{2} = 3\sqrt{2} \).

So, the extension can be rewritten as \( \mathbb{Q}(\sqrt{2}, \sqrt{3}, 3\sqrt{2}) \).

Next, consider the minimal polynomial for each element over \( \mathbb{Q} \):

- The minimal polynomial of \( \sqrt{2} \) over \( \mathbb{Q} \) is \( x^2 - 2 \), which has degree 2.
- The minimal polynomial of \( \sqrt{3} \) over \( \mathbb{Q} \) is \( x^2 - 3 \), which also has degree 2.

Now, consider \( 3\sqrt{2} \). Notice that \( 3\sqrt{2} \) can be expressed in terms of \( \sqrt{2} \) and \( \sqrt{3} \), so we need to check its minimal polynomial. However, since \( 3\sqrt{2} \) is already expressible in terms of \( \sqrt{2} \) and \( \sqrt{3} \), it doesn't introduce a new irreducible polynomial over \( \mathbb{Q}(\sqrt{2}, \sqrt{3}) \).

The degree of the extension \( \mathbb{Q}(\sqrt{2}, \sqrt{3}) \) over \( \mathbb{Q} \) is the sum of the degrees of the minimal polynomials of \( \sqrt{2} \) and \( \sqrt{3} \), which is \( 2 + 2 = 4 \).

Since \( 3\sqrt{2} \) does not introduce a new irreducible polynomial, the degree of the extension \( \mathbb{Q}(\sqrt{2}, \sqrt{3}, \sqrt{18}) \) over \( \mathbb{Q} \) remains 4.

Therefore, the answer is: B
Metrics:
{
  "exact_match_simple": 1,
  "exact_match": 1,
  "num_tokens": 520,
  "answer_format_correct": 0
}

DPO (mmlu:mc:tulu)

hf-02072025-allenai/Llama-3.1-Tulu-3-8B-DPO
Prompt:
user: The following are multiple choice questions (with answers) about abstract algebra.

Question: Find all c in Z_3 such that Z_3[x]/(x^2 + c) is a field.
 A. 0
 B. 1
 C. 2
 D. 3
Answer:
assistant: B
user: Question: Statement 1 | If aH is an element of a factor group, then |aH| divides |a|. Statement 2 | If H and K are subgroups of G then HK is a subgroup of G.
 A. True, True
 B. False, False
 C. True, False
 D. False, True
Answer:
assistant: B
user: Question: Statement 1 | Every element of a group generates a cyclic subgroup of the group. Statement 2 | The symmetric group S_10 has 10 elements.
 A. True, True
 B. False, False
 C. True, False
 D. False, True
Answer:
assistant: C
user: Question: Statement 1| Every function from a finite set onto itself must be one to one. Statement 2 | Every subgroup of an abelian group is abelian.
 A. True, True
 B. False, False
 C. True, False
 D. False, True
Answer:
assistant: A
user: Question: Find the characteristic of the ring 2Z.
 A. 0
 B. 3
 C. 12
 D. 30
Answer:
assistant: A
user: Question: Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.
 A. 0
 B. 4
 C. 2
 D. 6
Answer:
Completion:
Metrics:
{
  "predicted_index_raw": 1,
  "predicted_index_per_token": 1,
  "predicted_index_per_char": 1,
  "predicted_index_per_byte": 0,
  "predicted_index_uncond": null,
  "correct_choice": 1,
  "acc_raw": 1,
  "acc_per_token": 1,
  "acc_per_char": 1,
  "acc_per_byte": 0,
  "acc_uncond": null,
  "no_answer": 0,
  "sum_logits_corr": -11.745338439941406,
  "logits_per_token_corr": -11.745338439941406,
  "logits_per_char_corr": -11.745338439941406,
  "logits_per_byte_corr": 16.944941520878153
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On the reproduction of the 8B-DPO model #588

On the reproduction of the 8B-DPO model #588

tianbuwei commented Feb 28, 2025 •

edited

Loading

vwxyzjn commented Feb 28, 2025

vwxyzjn commented Feb 28, 2025 •

edited

Loading

vwxyzjn commented Feb 28, 2025

On the reproduction of the 8B-DPO model #588

On the reproduction of the 8B-DPO model #588

Comments

tianbuwei commented Feb 28, 2025 • edited Loading

Thank you very much for your work. I am reproducing the 8B-DPO model, and I find that there is a big difference between the reproduced results and the results in your paper. Could you please help me to check whether my training script is correct?

vwxyzjn commented Feb 28, 2025

vwxyzjn commented Feb 28, 2025 • edited Loading

vwxyzjn commented Feb 28, 2025

DPO (mmlu:cot:summaize)

DPO (mmlu:mc:tulu)

tianbuwei commented Feb 28, 2025 •

edited

Loading

vwxyzjn commented Feb 28, 2025 •

edited

Loading