Skip to content

Commit

Permalink
Merge pull request #65 from David-Berghaus/Fix-typos
Browse files Browse the repository at this point in the history
Fixed typos
  • Loading branch information
le1nux committed Mar 4, 2024
2 parents 8ab29d0 + d192331 commit 419fc9e
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 9 deletions.
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

# Getting started
For training and evaluation a model, feel free to checkout [this](https://github.com/Modalities/modalities/blob/main/examples/getting_started/getting_started_example.md) getting started tutorial, in which we train a small, 60M-parameter GPT model on a tiny subset of the Redpajama V2 dataset.
Also, see our WIki and API reference documentation: https://modalities.github.io/modalities/
Also, see our Wiki and API reference documentation: https://modalities.github.io/modalities/

# Installation

Expand All @@ -20,7 +20,7 @@ then, install the repository via
pip install -e .
```

If you want to contribute, have look at `CONTRIBUTING.md`.
If you want to contribute, have a look at `CONTRIBUTING.md`.



Expand Down Expand Up @@ -57,7 +57,7 @@ Or, if you are a VsCode user, add this to your `launch.json`:

# Pydantic and ClassResolver

The mechanismn introduced to instantiate classes via `type_hint` in the `config.yaml`, utilizes
The mechanism introduced to instantiate classes via `type_hint` in the `config.yaml`, utilizes
1) Omegaconf to load the config yaml file
2) Pydantic for the validation of the config
3) ClassResolver to instantiate the correct, concrete class of a class hierarchy.
Expand Down Expand Up @@ -117,7 +117,7 @@ class SchedulerConfig(BaseModel):
config: StepLRConfig | ConstantLRConfig
```

To allow a user-friendly instantiation, all class resolvers are defined in the `ResolverRegistry` and `build_component_by_config` as convenience function is introduced. Dependecies can be passed-through with the `extra_kwargs` argument:
To allow a user-friendly instantiation, all class resolvers are defined in the `ResolverRegistry` and `build_component_by_config` as convenience function is introduced. Dependencies can be passed-through with the `extra_kwargs` argument:
```python
resolvers = ResolverRegister(config=config)
optimizer = ... # our example dependency
Expand Down
9 changes: 4 additions & 5 deletions examples/getting_started/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ As a reference, this example has the following folder structure. Folders in <> w
├── example_config.yaml
├── data
│ ├── mem_map
│ └<preprocessed dataset files>
├── <preprocessed dataset files>
│ └── raw
│ ├── redpajama_v2_samples_512_test.jsonl
│ └── redpajama_v2_samples_512_train.jsonl
Expand All @@ -23,8 +23,7 @@ As a reference, this example has the following folder structure. Folders in <> w
```

## 1. Preprocessing
A single line of the Redpajama V2 JSONL file has the structure denoted below. Since we are not interested in the meta data and quality signals for this minimal example, we consider the `raw_content` from each line without any filtering.
for model training.
A single line of the Redpajama V2 JSONL file has the structure denoted below. Since we are not interested in the meta data and quality signals for this minimal example, we consider the `raw_content` from each line without any filtering for model training.
```json
{
"raw_content":"Archivio Tag: 25 aprile\nSupermercati aperti 25 aprile 2019: centri commerciali e negozi a Roma, Milano, Napoli e Torino\nNell\u2019articolo odierno troverete tutte le informazioni utili su quali saranno i supermercati e le attivit\u00e0 commerciali che resteranno aperti in occasione...\nAuguri di Buon 25 Aprile 2017: frasi e pensieri originali sulla Festa della Liberazione",
Expand All @@ -49,9 +48,9 @@ modalities create_memmap_index --index_path data/mem_map/redpajama_v2_samples_51
modalities create_memmap_index --index_path data/mem_map/redpajama_v2_samples_512_test.idx \
data/raw/redpajama_v2_samples_512_test.jsonl
```
In this step, we read the JSON file as a binary file, iterate over all characters und build up the sample index (char-wisestart and end position for each JSON sample)
In this step, we read the JSON file as a binary file, iterate over all characters and build up the sample index (char-wise start and end position for each JSON sample)
as determined by the `\n` character positions. The sample index is stored in the specified `index_path`. Internally, the `create_memmap_index` command
instantiates and calls the the [IndexGenerator](https://github.com/Modalities/modalities/blob/main/src/modalities/dataloader/create_index.py#L14).
instantiates and calls the [IndexGenerator](https://github.com/Modalities/modalities/blob/main/src/modalities/dataloader/create_index.py#L14).

After having determined the index, we create the packed dataset as described below by leveraging the tokenizer, jsonl file and the created index.

Expand Down

0 comments on commit 419fc9e

Please sign in to comment.