Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warmstart infrastructure switch #254

Merged
merged 93 commits into from
Sep 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
93 commits
Select commit Hold shift + click to select a range
2e0fedf
refactor: introduced ResultItem to EvaluationResultBatch
le1nux Aug 29, 2024
fc9d7b5
fix: fixed max_length warning in tokenizer
le1nux Aug 29, 2024
3b7d74d
refactor: removed excessive print statements
le1nux Aug 29, 2024
cf598a0
refactor: added ResultItem to other components
le1nux Aug 29, 2024
c0bcad4
chore: removed more print statements
le1nux Aug 29, 2024
99bd571
refactor: removed unused parameter from IndexGenerator constructor
le1nux Aug 29, 2024
431766a
feat: added configs for demo
le1nux Aug 31, 2024
335c783
feat: added demo diagrams
le1nux Aug 31, 2024
1867232
feat: added tokenizer config
le1nux Aug 31, 2024
8d2f0b2
feat: added demo jupyternobook
le1nux Aug 31, 2024
0dab792
feat: added img
le1nux Aug 31, 2024
3dd1f7b
refactor: more demo adaptations
le1nux Sep 2, 2024
6398d62
chore: added banner
le1nux Sep 4, 2024
70edf9d
chore: moved the diagrams to new tutorial
le1nux Sep 6, 2024
e6c5a59
feat: added notebooks disclaimer
le1nux Sep 6, 2024
1f83d68
feat: added tokenizer and training config
le1nux Sep 6, 2024
fce1daa
feat: added getting started jupyter notebook
le1nux Sep 6, 2024
fce1d5f
refactor: updated modalities demo
le1nux Sep 6, 2024
1899bfd
feat: added tokenizer configs for tutorial
le1nux Sep 6, 2024
cf2f7db
feat: added wandb_storage to gitignore
le1nux Sep 6, 2024
4fce366
chore: renamed tutorial folder
le1nux Sep 6, 2024
2f80136
chore: removed old debug print statements
le1nux Sep 6, 2024
9281405
chore: Merge branch 'main' into live_demo
le1nux Sep 6, 2024
42e7b1c
fix: removed the max_length tag in huggingfae tokenizer. Setting it t…
le1nux Sep 6, 2024
5af079a
fix: fixed failing warmstart test
le1nux Sep 6, 2024
226719c
Update src/modalities/config/component_factory.py
le1nux Sep 8, 2024
7f3f8fa
feat: added optional rounding for metrics
le1nux Sep 8, 2024
61dba25
refactor: lr now logged with full precision
le1nux Sep 8, 2024
fe6e38a
feat: added evaluator logging
le1nux Sep 8, 2024
78fd763
refactor: added logging of number parameters again
le1nux Sep 8, 2024
d44049a
chore: added gitkeep files
le1nux Sep 8, 2024
39db2bf
refactor: added huggingface dataset download to modalities_in_15_mins…
mali-git Sep 8, 2024
7462198
chore: minor corrections in README.md
flxst Sep 9, 2024
0b9a436
chore: merge sections usage and entry points in README.md
flxst Sep 9, 2024
cefb910
chore: change order of sections in README.md
flxst Sep 9, 2024
5ba0846
refactor: dataloaders are now never shuffled. Samplers do the shuffli…
le1nux Sep 9, 2024
a9812f3
feat: added more number conversion functions
le1nux Sep 9, 2024
dace200
Merge pull request #251 from Modalities/readme_updates
le1nux Sep 10, 2024
5d29535
refactor: moved activation checkpointing to FSDP model factory
le1nux Sep 10, 2024
576086e
refactor: refactored the instantiation model s.t. it separates traini…
le1nux Sep 10, 2024
6d550b6
feat: added train progress class
le1nux Sep 10, 2024
291f557
feat: introduced ActivationCheckpointedModel to allow for checkpointi…
le1nux Sep 11, 2024
bcc20f3
refactor: BatchProgressSubscriber now gets the number of train steps …
le1nux Sep 11, 2024
872d4a0
refactor: calling BatchProgress only Progress from now on
le1nux Sep 11, 2024
6061220
refactor: refactored warmstart functionality in __main__.py
le1nux Sep 11, 2024
939dd3f
refactor: imlemented checkpointing based on TrainingProgress instead …
le1nux Sep 11, 2024
0ced8b4
feat: added further number conversion functions
le1nux Sep 11, 2024
b4ce789
feat: added pydantic type for FSDP wrapped model
le1nux Sep 11, 2024
bb91e9e
refactor: refactored the trainer to work with the new TrainingProgres…
le1nux Sep 11, 2024
1ad1fb0
refactor: introduced a clean separation of training and warmstart set…
le1nux Sep 11, 2024
a70c7c1
fix: fixed dataloader iteration (needed num batches not num steps)
le1nux Sep 11, 2024
ac1171f
fix: repaired number conversion tests
le1nux Sep 11, 2024
4fbcf19
fix: fixed bug FSDPCheckpointSaving._get_paths_to_delete and respecti…
le1nux Sep 11, 2024
a3f4f6d
fix: fixed failing test_checkpoint_strategy_k
le1nux Sep 11, 2024
ddae58a
refactor: improved the settings configuration
le1nux Sep 12, 2024
7db236d
feat: added NumberConversion get_num_tokens_from_packed_mem_map_datas…
le1nux Sep 12, 2024
d87bcc1
fix: fixed all failing unit tests
le1nux Sep 12, 2024
e8e5d76
refactor: refactored config lorem ipsum
le1nux Sep 12, 2024
3d9c0a1
fix: fixed two failing multi-gpu tests
le1nux Sep 12, 2024
a77f932
refactor: removed get_num_tokens_from_num_steps_callable from checkpo…
le1nux Sep 12, 2024
372f34a
fix: fixed configs for other multi-gpu tets
le1nux Sep 12, 2024
97a8ae7
refactor: removed NumberConversion function get_num_tokens_from_num_s…
le1nux Sep 12, 2024
1bebb61
feat: added test for activation checkpointing
le1nux Sep 13, 2024
24dfc75
feat: added debugger function for testing distributed, multi-gpu tests
le1nux Sep 13, 2024
f831d8a
chore: add debugpy dependency
flxst Sep 13, 2024
9f04f8a
fix: getting started example config
flxst Sep 13, 2024
00031df
feat: added missing number conversion tests
le1nux Sep 13, 2024
2544a21
chore: Merge branch 'warmstart_infrastructure_switch' of github.com:M…
le1nux Sep 13, 2024
d065456
feat: added NumberConversion get_num_steps_from_raw_dataset_index
le1nux Sep 14, 2024
e21f64e
feat: introduced get_raw_index in DatasetFactory
le1nux Sep 14, 2024
d105618
chore: minor print fix
le1nux Sep 14, 2024
77bdada
refactor: refactored the library usage example
le1nux Sep 14, 2024
cf638bf
refactor: adapted more configs to the new setttings design
le1nux Sep 14, 2024
7f9bd09
refactor: reduced the coca_config_initialization.yaml
le1nux Sep 14, 2024
b783810
Merge pull request #239 from Modalities/live_demo
le1nux Sep 14, 2024
5b1e7e9
feat: added TrainingReportGenerator
le1nux Sep 15, 2024
77e0c10
refactor: adapted the modalities_in_15_mins config to latest changes
le1nux Sep 15, 2024
3ee086c
feat: added conststency checks for remaining steps
le1nux Sep 15, 2024
d5885bc
fix: fixed coda example config
le1nux Sep 15, 2024
522694d
feat: added information on missed out tokens percentages
le1nux Sep 15, 2024
55db9a2
feat: added warmstart tutorial
le1nux Sep 15, 2024
02b09f2
feat: updated components.md
le1nux Sep 15, 2024
d183fdd
chore: removed unnecessary math.ceil call
le1nux Sep 16, 2024
be0d424
chore: Merge branch 'main' into warmstart_infrastructure_switch
le1nux Sep 16, 2024
23bb463
chore: added short description for modalities in 15mins tutorial to R…
flxst Sep 16, 2024
e71f537
feat: added README to getting started tutorial
le1nux Sep 16, 2024
4756f39
chore: further shortened path explanations in jupyter notebook
le1nux Sep 16, 2024
c9c1ce4
Update README.md
le1nux Sep 16, 2024
658d0d0
refactor: renamed examples to tutorials
le1nux Sep 16, 2024
f70e7b1
chore: fixed type in variable name
le1nux Sep 16, 2024
f7923f7
Update README.md
le1nux Sep 16, 2024
17b7a0c
refactor: consistent usage of progress_subscriber name
le1nux Sep 16, 2024
9a3ff8c
chore: minor config renaming
le1nux Sep 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -160,5 +160,5 @@ pyenv*
noteboks/*

tests/tmp/*
*wandb_storage*
.coverage/*
wandb_storage/
182 changes: 90 additions & 92 deletions README.md

Large diffs are not rendered by default.

114 changes: 65 additions & 49 deletions config_files/training/config_example_coca.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,27 +4,53 @@ settings:
referencing_keys:
sample_key: input_ids
target_key: target_ids
training:
training_log_interval_in_steps: 2
checkpointing_interval_in_steps: 2
evaluation_interval_in_steps: 2
global_num_seen_tokens: 0
activation_checkpointing_modules: []
gradient_acc_steps: 1
local_train_micro_batch_size: 3
sequence_length: 256
prediction_key: logits
cuda_env:
local_rank: ${cuda_env:LOCAL_RANK}
global_rank: ${cuda_env:RANK}
world_size: ${cuda_env:WORLD_SIZE}
paths:
checkpointing_path: data/checkpoints

tokenizer:
component_key: tokenizer
variant_key: gpt2_tokenizer_fast
config:
tokenizer_file: data/tokenizer/tokenizer_gpt2.json
checkpoint_saving_path: data/checkpoints
train_dataset_path: ./data/lorem_ipsum.pbin
intervals:
training_log_interval_in_steps: 2
checkpointing_interval_in_steps: 2
evaluation_interval_in_steps: 2
consistency_enforcement:
enforce_tokens_per_step_consistency: true
enforce_last_step_logged: false
enforce_last_step_evaluated: false
enforce_last_step_checkpointed: false
step_profile:
gradient_accumulation_steps: 1
local_train_micro_batch_size: 1
sequence_length: 256
training_target:
num_target_tokens:
component_key: number_conversion
variant_key: num_tokens_from_num_steps
config:
num_steps: ${settings.training_target.num_target_steps}
num_ranks: ${settings.cuda_env.world_size}
local_micro_batch_size: ${settings.step_profile.local_train_micro_batch_size}
sequence_length: ${settings.step_profile.sequence_length}
gradient_accumulation_steps: ${settings.step_profile.gradient_accumulation_steps}
num_target_steps: # for the batch progress subscriber
component_key: number_conversion
variant_key: num_steps_from_num_samples
config:
num_ranks: ${settings.cuda_env.world_size}
local_micro_batch_size: ${settings.step_profile.local_train_micro_batch_size}
global_num_samples: ${settings.coca_example_settings.train_num_samples}
gradient_accumulation_steps: ${settings.step_profile.gradient_accumulation_steps}
training_progress:
global_num_seen_tokens: 0
num_seen_steps: 0
local_num_seen_batches: 0
last_step: -1
coca_example_settings:
train_num_samples: 64
val_num_samples: 32

collate_fn:
component_key: collate_fn
Expand All @@ -41,7 +67,7 @@ train_dataset:
component_key: dataset
variant_key: dummy_dataset
config:
num_samples: 64
num_samples: ${settings.coca_example_settings.train_num_samples}
sample_definition:
- sample_key: images
sample_shape: [3, 224, 224]
Expand All @@ -54,7 +80,7 @@ val_dataset:
component_key: dataset
variant_key: dummy_dataset
config:
num_samples: 32
num_samples: ${settings.coca_example_settings.val_num_samples}
sample_definition:
- sample_key: images
sample_shape: [3, 224, 224]
Expand All @@ -69,23 +95,26 @@ train_dataloader:
config:
num_workers: 2
pin_memory: true
shuffle: false
dataloader_tag: "train"
dataloader_tag: train
skip_num_batches: ${settings.training_progress.local_num_seen_batches}
dataset:
instance_key: train_dataset
pass_type: BY_REFERENCE
batch_sampler:
component_key: batch_sampler
variant_key: default
config:
batch_size: ${settings.training.local_train_micro_batch_size}
batch_size: ${settings.step_profile.local_train_micro_batch_size}
drop_last: true
sampler:
component_key: sampler
variant_key: distributed_sampler
config:
rank: ${settings.cuda_env.global_rank}
num_replicas: ${settings.cuda_env.world_size}
shuffle: true
drop_last: true
seed: 42
dataset:
instance_key: train_dataset
pass_type: BY_REFERENCE
Expand All @@ -99,23 +128,25 @@ val_dataloader:
config:
num_workers: 2
pin_memory: true
shuffle: false
dataloader_tag: "val"
dataloader_tag: val
dataset:
instance_key: val_dataset
pass_type: BY_REFERENCE
batch_sampler:
component_key: batch_sampler
variant_key: default
config:
batch_size: ${settings.training.local_train_micro_batch_size}
batch_size: ${settings.step_profile.local_train_micro_batch_size}
drop_last: true

sampler:
component_key: sampler
variant_key: distributed_sampler
config:
rank: ${settings.cuda_env.global_rank}
num_replicas: ${settings.cuda_env.world_size}
shuffle: false
drop_last: true
dataset:
instance_key: train_dataset
pass_type: BY_REFERENCE
Expand All @@ -140,22 +171,16 @@ checkpoint_saving:
component_key: checkpoint_saving_execution
variant_key: fsdp
config:
checkpoint_path: ${settings.paths.checkpointing_path}
checkpoint_path: ${settings.paths.checkpoint_saving_path}
global_rank: ${settings.cuda_env.global_rank}
experiment_id: ${settings.experiment_id}
get_num_tokens_from_num_steps_callable:
component_key: number_conversion
variant_key: num_tokens_from_num_steps_callable
config:
num_ranks: ${settings.cuda_env.world_size}
local_micro_batch_size: ${settings.training.local_train_micro_batch_size}
sequence_length: ${settings.training.sequence_length}

loss_fn:
component_key: loss
variant_key: clm_cross_entropy_loss
config:
target_key: ${settings.referencing_keys.target_key}
prediction_key: logits
prediction_key: ${settings.referencing_keys.prediction_key}

wrapped_model:
component_key: model
Expand All @@ -169,7 +194,7 @@ wrapped_model:
sharding_strategy: FULL_SHARD
block_names: [TransformerBlock, VisionTransformerBlock]

model:
model:
component_key: model
variant_key: model_initialized
config:
Expand Down Expand Up @@ -241,9 +266,10 @@ scheduler:
max_lr: 6e-4
div_factor: 10
final_div_factor: 1
total_steps: 64
total_steps: ${settings.training_target.num_target_steps}
pct_start: 0.01
anneal_strategy: cos
last_epoch: ${settings.training_progress.last_step}

optimizer:
component_key: optimizer
Expand All @@ -267,24 +293,14 @@ gradient_clipper:
pass_type: BY_REFERENCE
norm_type: P2_NORM


batch_progress_subscriber:
progress_subscriber:
component_key: progress_subscriber
variant_key: rich
config:
global_rank: ${settings.cuda_env.global_rank}
global_num_seen_steps:
component_key: number_conversion
variant_key: num_steps_from_num_tokens
config:
num_ranks: ${settings.cuda_env.world_size}
local_micro_batch_size: ${settings.training.local_train_micro_batch_size}
global_num_tokens: ${settings.training.global_num_seen_tokens}
sequence_length: ${settings.training.sequence_length}
gradient_acc_steps: ${settings.training.gradient_acc_steps}
train_dataloader:
instance_key: train_dataloader
pass_type: BY_REFERENCE
num_seen_steps: ${settings.training_progress.num_seen_steps}
num_target_steps: ${settings.training_target.num_target_steps}
train_dataloader_tag: ${train_dataloader.config.dataloader_tag}
eval_dataloaders:
instance_key: eval_dataloaders
pass_type: BY_REFERENCE
Expand Down
Loading