Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On the reproduction of the 8B-DPO model #588

Open
tianbuwei opened this issue Feb 28, 2025 · 3 comments
Open

On the reproduction of the 8B-DPO model #588

tianbuwei opened this issue Feb 28, 2025 · 3 comments

Comments

@tianbuwei
Copy link

tianbuwei commented Feb 28, 2025

Thank you very much for your work. I am reproducing the 8B-DPO model, and I find that there is a big difference between the reproduced results and the results in your paper. Could you please help me to check whether my training script is correct?

This is the comparison result between our model and the official model
Image


Here is our training script, we used 4 machines with 32 Gpus

# modify the following `MACHINE_RANK`, `MAIN_PROCESS_IP`,
# `NUM_MACHINES`, `NUM_PROCESSES`, `PER_DEVICE_TRAIN_BATCH_SIZE`,
# `GRADIENT_ACCUMULATION_STEPS` according to your setup
export NCCL_TIMEOUT=0
export NCCL_BLOCKING_WAIT=1
export NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_DEBUG=INFO

MACHINE_RANK=0
MAIN_PROCESS_IP=10.0.26.202
NUM_MACHINES=4
NUM_PROCESSES=32
PER_DEVICE_TRAIN_BATCH_SIZE=1
GRADIENT_ACCUMULATION_STEPS=1

accelerate launch \
    --mixed_precision bf16 \
    --num_machines $NUM_MACHINES \
    --num_processes $NUM_PROCESSES \
    --machine_rank $MACHINE_RANK \
    --main_process_ip $MAIN_PROCESS_IP \
    --main_process_port 29400 \
    --use_deepspeed \
    --deepspeed_config_file configs/ds_configs/stage3_no_offloading_accelerate.conf \
    --deepspeed_multinode_launcher standard open_instruct/dpo_tune_cache.py \
    --model_name_or_path models/Llama-3.1-Tulu-3-8B-SFT \
    --tokenizer_name models/Llama-3.1-Tulu-3-8B-SFT \
    --use_flash_attn \
    --max_seq_length 2048 \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size $PER_DEVICE_TRAIN_BATCH_SIZE \
    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
    --learning_rate 5e-07 \
    --lr_scheduler_type linear \
    --warmup_ratio 0.1 \
    --weight_decay 0.0 \
    --num_train_epochs 1 \
    --output_dir output/dpo_8b \
    --with_tracking \
    --report_to wandb \
    --logging_steps 1 \
    --model_revision main \
    --gradient_checkpointing \
    --dataset_mixer_list data/llama-3.1-tulu-3-8b-preference-mixture 1.0 \
    --use_slow_tokenizer \
    --use_lora False \
    --dpo_loss_type dpo_norm \
    --dpo_beta 5 \
    --checkpointing_steps epoch \
    --exp_name tulu3-8B-dpo > logs/tulu3-8b-dpo/tulu3-8B-GPU${MACHINE_RANK}.log 2>&1 & 
# For Ai2 internal members, this was the experiment URL: https://beaker.org/ex/01JCSAYYHQYF9QDQDCV6KJ53M9/

Second, I noticed that in the training script you gave for the 8B model, GPUs*gradient_accumulation_steps=8*16=128. But the effective batch_size you give in your paper is 32, which should I follow?
This is the script you gave:https://github.com/allenai/open-instruct/blob/main/docs/tulu3.md
Image

This is a screenshot from your paper:

Image

@vwxyzjn
Copy link
Collaborator

vwxyzjn commented Feb 28, 2025

Thanks for reporting this issue. I double checked our internal log (the beaker link) and confirm that was the command used to do training. Then the paper should have been batch size (effective) 128) for the 8B model. Sorry about that!

Image

@vwxyzjn
Copy link
Collaborator

vwxyzjn commented Feb 28, 2025

I am not sure why there is lower mmlu / math flex score in the picture you shared. We recently evaluated it again and found numbers to be consistent.

Image

Maybe it's the eval setup? We use https://github.com/allenai/olmes for evaluation (though we use an internal fork via the following script)

python oe-eval-internal/oe_eval/launch.py \
--model "$MODEL_NAME" \
--beaker-workspace "ai2/tulu-3-results" \
--beaker-budget ai2/oe-adapt \
--task "$TASK" \
$MODEL_TYPE \
--batch-size "$BATCH_SIZE" \
--model-args "{\"model_path\":\"${MODEL_LOCATION}\", \"max_length\": ${MAX_LENGTH}}" \
--task-args "{ \"generation_kwargs\": { \"max_gen_toks\": ${MAX_LENGTH}, \"truncate_context\": false } }" \
${HF_UPLOAD_ARG} \
--gpus "$GPU_COUNT" \
--gantry-args '{"env-secret": "OPENAI_API_KEY=openai_api_key", "weka": "oe-adapt-default:/weka/oe-adapt-default", "env#132":"VLLM_ALLOW_LONG_MAX_MODEL_LEN=1"}' \
${REVISION_ARG} \
--cluster ai2/neptune-cirrascale,ai2/saturn-cirrascale,ai2/jupiter-cirrascale-2 \
--beaker-retries 2 \
--beaker-priority "$PRIORITY" \
--push-datalake \
--datalake-tags "$DATALAKE_ARGS"

Here are what the prompts look like:

Image Image
Prompt:
user: Problem:
Find the domain of the expression  $\frac{\sqrt{x-2}}{\sqrt{5-x}}$.}

Solution:
assistant: The expressions inside each square root must be non-negative. Therefore, $x-2 \ge 0$, so $x\ge2$, and $5 - x \ge 0$, so $x \le 5$. Also, the denominator cannot be equal to zero, so $5-x>0$, which gives $x<5$. Therefore, the domain of the expression is $\boxed{[2,5)}$.
Final Answer: The final answer is $[2,5)$. I hope it is correct.
user: Problem:
If $\det \mathbf{A} = 2$ and $\det \mathbf{B} = 12,$ then find $\det (\mathbf{A} \mathbf{B}).$

Solution:
assistant: We have that $\det (\mathbf{A} \mathbf{B}) = (\det \mathbf{A})(\det \mathbf{B}) = (2)(12) = \boxed{24}.$
Final Answer: The final answer is $24$. I hope it is correct.
user: Problem:
Terrell usually lifts two 20-pound weights 12 times. If he uses two 15-pound weights instead, how many times must Terrell lift them in order to lift the same total weight?

Solution:
assistant: If Terrell lifts two 20-pound weights 12 times, he lifts a total of $2\cdot 12\cdot20=480$ pounds of weight.  If he lifts two 15-pound weights instead for $n$ times, he will lift a total of $2\cdot15\cdot n=30n$ pounds of weight.  Equating this to 480 pounds, we can solve for $n$:
\begin{align*}
30n&=480\
\Rightarrow\qquad n&=480/30=\boxed{16}
\end{align*}
Final Answer: The final answer is $16$. I hope it is correct.
user: Problem:
If the system of equations

\begin{align*}
6x-4y&=a,\
6y-9x &=b.
\end{align*}has a solution $(x, y)$ where $x$ and $y$ are both nonzero,
find $\frac{a}{b},$ assuming $b$ is nonzero.

Solution:
assistant: If we multiply the first equation by $-\frac{3}{2}$, we obtain

$$6y-9x=-\frac{3}{2}a.$$Since we also know that $6y-9x=b$, we have

$$-\frac{3}{2}a=b\Rightarrow\frac{a}{b}=\boxed{-\frac{2}{3}}.$$
Final Answer: The final answer is $-\frac{2}{3}$. I hope it is correct.
user: Problem:
Find $x$ such that $\lceil x \rceil + x = \dfrac{23}{7}$. Express $x$ as a common fraction.

Solution:
Completion:
To solve the equation \(\lceil x \rceil + x = \frac{23}{7}\), we need to consider the definition of the ceiling function \(\lceil x \rceil\), which is the smallest integer greater than or equal to \(x\).

Let's denote \(\lceil x \rceil\) as \(n\), where \(n\) is an integer. Therefore, the equation becomes:
\[ n + x = \frac{23}{7} \]

Since \(n = \lceil x \rceil\), \(n\) must be such that \(n - 1 < x \leq n\).

Given \(n + x = \frac{23}{7}\), we can express \(x\) as:
\[ x = \frac{23}{7} - n \]

We need to find an integer \(n\) such that \(n - 1 < \frac{23}{7} - n \leq n\).

First, calculate \(\frac{23}{7}\):
\[ \frac{23}{7} \approx 3.285714 \]

So, \(n\) must satisfy:
\[ n - 1 < 3.285714 - n \leq n \]
\[ 2n - 1 < 3.285714 \]
\[ 2n < 4.285714 \]
\[ n < 2.142857 \]

The integer values for \(n\) that satisfy this inequality are \(n = 2\) and \(n = 3\).

Let's check both:

For \(n = 2\):
\[ x = \frac{23}{7} - 2 = \frac{23 - 14}{7} = \frac{9}{7} \]

For \(n = 3\):
\[ x = \frac{23}{7} - 3 = \frac{23 - 21}{7} = \frac{2}{7} \]

Since \(x\) must be between \(n-1\) and \(n\) (inclusive), and \(\frac{9}{7}\) is closer to 3.285714 than \(\frac{2}{7}\), \(x = \frac{9}{7}\) is the correct solution.

Thus, \(x = \boxed{\frac{9}{7}}\).
Metrics:
{
  "exact_match": 0,
  "exact_match_flex": 1,
  "model_answer": "\\frac{9}{7}"
}

@vwxyzjn
Copy link
Collaborator

vwxyzjn commented Feb 28, 2025

Actuall yeah it's the prompt issues. Few shot prompts with mmlu got lower results (consistent with yours)

Image

DPO (mmlu:cot:summaize)

hf-02072025-allenai/Llama-3.1-Tulu-3-8B-DPO
Prompt:
user: The following are multiple choice questions about abstract algebra. Summarize your reasoning concisely, then conclude with 'Therefore, the answer is: X' where X is one of A, B, C, or D.

Question: Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.
 A. 0
 B. 4
 C. 2
 D. 6
Completion:
To determine the degree of the field extension \( \mathbb{Q}(\sqrt{2}, \sqrt{3}, \sqrt{18}) \) over \( \mathbb{Q} \), we need to understand how these elements relate to each other and how they extend the base field \( \mathbb{Q} \).

First, simplify the elements in the extension:

- \( \sqrt{18} = \sqrt{9 \cdot 2} = \sqrt{9} \cdot \sqrt{2} = 3\sqrt{2} \).

So, the extension can be rewritten as \( \mathbb{Q}(\sqrt{2}, \sqrt{3}, 3\sqrt{2}) \).

Next, consider the minimal polynomial for each element over \( \mathbb{Q} \):

- The minimal polynomial of \( \sqrt{2} \) over \( \mathbb{Q} \) is \( x^2 - 2 \), which has degree 2.
- The minimal polynomial of \( \sqrt{3} \) over \( \mathbb{Q} \) is \( x^2 - 3 \), which also has degree 2.

Now, consider \( 3\sqrt{2} \). Notice that \( 3\sqrt{2} \) can be expressed in terms of \( \sqrt{2} \) and \( \sqrt{3} \), so we need to check its minimal polynomial. However, since \( 3\sqrt{2} \) is already expressible in terms of \( \sqrt{2} \) and \( \sqrt{3} \), it doesn't introduce a new irreducible polynomial over \( \mathbb{Q}(\sqrt{2}, \sqrt{3}) \).

The degree of the extension \( \mathbb{Q}(\sqrt{2}, \sqrt{3}) \) over \( \mathbb{Q} \) is the sum of the degrees of the minimal polynomials of \( \sqrt{2} \) and \( \sqrt{3} \), which is \( 2 + 2 = 4 \).

Since \( 3\sqrt{2} \) does not introduce a new irreducible polynomial, the degree of the extension \( \mathbb{Q}(\sqrt{2}, \sqrt{3}, \sqrt{18}) \) over \( \mathbb{Q} \) remains 4.

Therefore, the answer is: B
Metrics:
{
  "exact_match_simple": 1,
  "exact_match": 1,
  "num_tokens": 520,
  "answer_format_correct": 0
}

DPO (mmlu:mc:tulu)

hf-02072025-allenai/Llama-3.1-Tulu-3-8B-DPO
Prompt:
user: The following are multiple choice questions (with answers) about abstract algebra.

Question: Find all c in Z_3 such that Z_3[x]/(x^2 + c) is a field.
 A. 0
 B. 1
 C. 2
 D. 3
Answer:
assistant: B
user: Question: Statement 1 | If aH is an element of a factor group, then |aH| divides |a|. Statement 2 | If H and K are subgroups of G then HK is a subgroup of G.
 A. True, True
 B. False, False
 C. True, False
 D. False, True
Answer:
assistant: B
user: Question: Statement 1 | Every element of a group generates a cyclic subgroup of the group. Statement 2 | The symmetric group S_10 has 10 elements.
 A. True, True
 B. False, False
 C. True, False
 D. False, True
Answer:
assistant: C
user: Question: Statement 1| Every function from a finite set onto itself must be one to one. Statement 2 | Every subgroup of an abelian group is abelian.
 A. True, True
 B. False, False
 C. True, False
 D. False, True
Answer:
assistant: A
user: Question: Find the characteristic of the ring 2Z.
 A. 0
 B. 3
 C. 12
 D. 30
Answer:
assistant: A
user: Question: Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.
 A. 0
 B. 4
 C. 2
 D. 6
Answer:
Completion:
Metrics:
{
  "predicted_index_raw": 1,
  "predicted_index_per_token": 1,
  "predicted_index_per_char": 1,
  "predicted_index_per_byte": 0,
  "predicted_index_uncond": null,
  "correct_choice": 1,
  "acc_raw": 1,
  "acc_per_token": 1,
  "acc_per_char": 1,
  "acc_per_byte": 0,
  "acc_uncond": null,
  "no_answer": 0,
  "sum_logits_corr": -11.745338439941406,
  "logits_per_token_corr": -11.745338439941406,
  "logits_per_char_corr": -11.745338439941406,
  "logits_per_byte_corr": 16.944941520878153
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants