diff --git a/README.md b/README.md index 4bcc0998..50df1de5 100644 --- a/README.md +++ b/README.md @@ -4,7 +4,7 @@ # Getting started For training and evaluation a model, feel free to checkout [this](https://github.com/Modalities/modalities/blob/main/examples/getting_started/getting_started_example.md) getting started tutorial, in which we train a small, 60M-parameter GPT model on a tiny subset of the Redpajama V2 dataset. -Documentation: https://modalities.github.io/modalities/ +Also, see our Wiki and API reference documentation: https://modalities.github.io/modalities/ # Installation @@ -20,7 +20,7 @@ then, install the repository via pip install -e . ``` -If you want to contribute, have look at `CONTRIBUTING.md`. +If you want to contribute, have a look at `CONTRIBUTING.md`. @@ -57,7 +57,7 @@ Or, if you are a VsCode user, add this to your `launch.json`: # Pydantic and ClassResolver -The mechanismn introduced to instantiate classes via `type_hint` in the `config.yaml`, utilizes +The mechanism introduced to instantiate classes via `type_hint` in the `config.yaml`, utilizes 1) Omegaconf to load the config yaml file 2) Pydantic for the validation of the config 3) ClassResolver to instantiate the correct, concrete class of a class hierarchy. @@ -117,7 +117,7 @@ class SchedulerConfig(BaseModel): config: StepLRConfig | ConstantLRConfig ``` -To allow a user-friendly instantiation, all class resolvers are defined in the `ResolverRegistry` and `build_component_by_config` as convenience function is introduced. Dependecies can be passed-through with the `extra_kwargs` argument: +To allow a user-friendly instantiation, all class resolvers are defined in the `ResolverRegistry` and `build_component_by_config` as convenience function is introduced. Dependencies can be passed-through with the `extra_kwargs` argument: ```python resolvers = ResolverRegister(config=config) optimizer = ... # our example dependency diff --git a/examples/getting_started/README.md b/examples/getting_started/README.md index 3d56aac8..c0b0339f 100644 --- a/examples/getting_started/README.md +++ b/examples/getting_started/README.md @@ -11,7 +11,7 @@ As a reference, this example has the following folder structure. Folders in <> w ├── example_config.yaml ├── data │ ├── mem_map - │ │ └ + │ ├── │ └── raw │ ├── redpajama_v2_samples_512_test.jsonl │ └── redpajama_v2_samples_512_train.jsonl @@ -23,8 +23,7 @@ As a reference, this example has the following folder structure. Folders in <> w ``` ## 1. Preprocessing -A single line of the Redpajama V2 JSONL file has the structure denoted below. Since we are not interested in the meta data and quality signals for this minimal example, we consider the `raw_content` from each line without any filtering. -for model training. +A single line of the Redpajama V2 JSONL file has the structure denoted below. Since we are not interested in the meta data and quality signals for this minimal example, we consider the `raw_content` from each line without any filtering for model training. ```json { "raw_content":"Archivio Tag: 25 aprile\nSupermercati aperti 25 aprile 2019: centri commerciali e negozi a Roma, Milano, Napoli e Torino\nNell\u2019articolo odierno troverete tutte le informazioni utili su quali saranno i supermercati e le attivit\u00e0 commerciali che resteranno aperti in occasione...\nAuguri di Buon 25 Aprile 2017: frasi e pensieri originali sulla Festa della Liberazione", @@ -49,9 +48,9 @@ modalities create_memmap_index --index_path data/mem_map/redpajama_v2_samples_51 modalities create_memmap_index --index_path data/mem_map/redpajama_v2_samples_512_test.idx \ data/raw/redpajama_v2_samples_512_test.jsonl ``` -In this step, we read the JSON file as a binary file, iterate over all characters und build up the sample index (char-wisestart and end position for each JSON sample) +In this step, we read the JSON file as a binary file, iterate over all characters and build up the sample index (char-wise start and end position for each JSON sample) as determined by the `\n` character positions. The sample index is stored in the specified `index_path`. Internally, the `create_memmap_index` command -instantiates and calls the the [IndexGenerator](https://github.com/Modalities/modalities/blob/main/src/modalities/dataloader/create_index.py#L14). +instantiates and calls the [IndexGenerator](https://github.com/Modalities/modalities/blob/main/src/modalities/dataloader/create_index.py#L14). After having determined the index, we create the packed dataset as described below by leveraging the tokenizer, jsonl file and the created index.