Modalities · luzian-hahn · Mar 7, 2024 · Jan 22, 2024 · Jan 29, 2024 · Jan 30, 2024
diff --git a/benchmarks/dataloader/README.md b/benchmarks/dataloader/README.md
@@ -0,0 +1,77 @@
+# Benchmarking of Dataset Implementations
+
+## Motivation
+We want to include a storage efficient, fast and generic dataset implementation in this repository.
+Previous work and ideas were based on MegatronLM and its dataset implementation.
+
+Unfortunately its usage is quite intransparent and causes regularly unexpected side effects.
+Those problems are hard to trace, as we are not the original authors of the code.
+
+Therefore we want to provide an own implementation, which comes with all the above mentioned benefits.
+Most importantly, it should be at least as fast as MegatronLM's implementation.
+
+
+## Benchmark Overview
+
+We want to evaluate multiple aspects of the dataset implementations:
+* preparation speed - All datasets need to do some initial steps like tokenization and indexing.
+* initialization speed - When firing up a respective `Dataset` object inside the code.
+* iteration speed - When accessing elements (in a random order) in the respective datasets
+
+
+## Used Example Dataset
+
+The experiments were conducted on a small sample of openwebtext. The data is provided in `.jsonl`-format.
+The relevant data included can be found under `"text"` and is obviously text-only.
+Each dataset with X samples refers to the first X lines in the full openwebtext data,
+ as it can be obtained from huggingface.
+
+
+## Experimental Setup
+
+We relied on the functions provided in `launch_benchmark.sh`. One can reproduce those by calling e.g.
+
+```shell
+. launch_benchmark.sh
+
+INPUT_DIR=<path-to-your-example-dataset.jsonl>
+
+echo "MegatronLM:"
+measure_megatronLM_iteration
+echo "Modalities:"
+measure_modalities_iteration
+```
+
+> For launching the preparation of MegatronLM's dataset, refer to:
+> https://github.com/OpenGPTX/opengptx_data/tree/docs/modalities-vs-megatronlm-dl and look at the `launch_benchmark.sh`
+> script.
+
+#### Glossary
+
+* **preparation:** refers here to the task of turning raw data (e.g. jsonl encoded text) into a binary file,
+  which is loadable later for training. 
+  For MegatronLM this means tokenizing and packing everything according to their defined format.
+  For Modalities it means, indexing the raw data and packing it afterwards as token-ids.
+* **initialization:** refers to the process of initializing a python object, 
+  which represents the respective dataset (mostly represented via the `torch.Dataset`-interface)
+* **iteration:** refers to process of iterating over the respective datasets - once sequentially and once shuffled.
+
+## Results
+
+
+| Evaluation Aspect    | Implementation |   Required Time    | # Samples in Data |
+|----------------------|----------------|:------------------:|-------------------|
+| preparation speed    | MegatronLM     | `0 min 16.965 sec` | `20000(OWT)`      |
+| preparation speed    | Modalities     | `0 min 13.904 sec` | `20000(OWT)`      |
+| preparation speed    | MegatronLM     | `2 min 11.856 sec` | `200000(OWT)`     |
+| preparation speed    | Modalities     | `0 min 38.738 sec` | `200000(OWT)`     |
+| initialization speed | MegatronLM     |    `19.3 msec`     | `20000(OWT)`      |
+| initialization speed | Modalities     |    `5.85 msec`     | `20000(OWT)`      |
+| initialization speed | MegatronLM     |    `180 msec `     | `200000(OWT)`     |
+| initialization speed | Modalities     |     `58 msec`      | `200000(OWT)`     |
+| iteration speed      | MegatronLM     |    `52.4 msec`     | `20000(OWT)`      |
+| iteration speed      | Modalities     |    `66.8 msec`     | `20000(OWT)`      | 
+| iteration speed      | MegatronLM     |    `426 msec `     | `200000(OWT)`     |
+| iteration speed      | Modalities     |     `545 msec`     | `200000(OWT)`     |
+
+
diff --git a/benchmarks/dataloader/launch_benchmark.sh b/benchmarks/dataloader/launch_benchmark.sh
@@ -0,0 +1,87 @@
+#!/bin/bash
+
+
+
+INPUT_DIR="/tmp/i-do-not-exist.jsonl"
+
+
+measure_modalities_preparation() {
+    time (
+        set -e
+        test -f $INPUT_DIR
+        rm -f ${INPUT_DIR/.jsonl/.idx}
+        modalities create_memmap_index $INPUT_DIR &> /dev/null
+        echo "finished memmap index creation"
+        rm -f ${INPUT_DIR/.jsonl/.pbin}
+        modalities create_packed_data $INPUT_DIR &> /dev/null
+        echo "finished memmap packing"
+    )
+}
+
+
+measure_modalities_initialization() {
+  input_file=${INPUT_DIR/.jsonl/.pbin}
+  python -m timeit -n 50 -r 5 -s "
+import sys, io
+null_device = io.StringIO()
+from modalities.dataloader.dataset import PackedMemMapDatasetMegatron
+from pathlib import Path
+p = Path(\"${input_file}\")
+  " -- "
+sys.stdout = null_device  # deactivate stdout to avoid getting spammed
+PackedMemMapDatasetMegatron(raw_data_path=p, block_size=1024, sample_key=\"sample\")
+sys.stdout = sys.__stdout__  # reactivate stdout for timeit
+"
+}
+
+measure_megatronLM_initialization() {
+  input_file="${INPUT_DIR/.jsonl/.megLM.bin_text_document}"
+  python -m timeit -n 50 -r 5 -s "
+import sys, io
+null_device = io.StringIO()
+from modalities.dataloader.open_gptx_dataset.mmap_dataset import MMapIndexedDataset
+p = \"${input_file}\"
+  " -- "
+sys.stdout = null_device  # deactivate stdout to avoid getting spammed
+MMapIndexedDataset(p)
+sys.stdout = sys.__stdout__  # reactivate stdout for timeit
+"
+}
+
+measure_modalities_iteration() {
+  input_file=${INPUT_DIR/.jsonl/.pbin}
+  python -m timeit -n 5 -r 3 -s "
+import random, sys, io
+null_device = io.StringIO()
+from modalities.dataloader.dataset import PackedMemMapDatasetMegatron
+from pathlib import Path
+p = Path(\"${input_file}\")
+sys.stdout = null_device  # deactivate stdout to avoid getting spammed
+dataset = PackedMemMapDatasetMegatron(raw_data_path=p, block_size=1024, sample_key=\"sample\")
+random_indices = random.sample(range(len(dataset)), len(dataset))
+sys.stdout = sys.__stdout__  # reactivate stdout for timeit
+  " -- "
+list(dataset)  # sequential access
+for i in random_indices:
+  dataset[i]
+"
+}
+
+
+measure_megatronLM_iteration() {
+  input_file="${INPUT_DIR/.jsonl/.megLM.bin_text_document}"
+  python -m timeit -n 5 -r 3 -s "
+import random, sys, io
+null_device = io.StringIO()
+from modalities.dataloader.open_gptx_dataset.mmap_dataset import MMapIndexedDataset
+p = \"${input_file}\"
+sys.stdout = null_device  # deactivate stdout to avoid getting spammed
+dataset = MMapIndexedDataset(p)
+random_indices = random.sample(range(len(dataset)), len(dataset))
+sys.stdout = sys.__stdout__  # reactivate stdout for timeit
+  " -- "
+list(dataset)  # sequential access
+for i in random_indices:
+  dataset[i]
+"
+}
diff --git a/src/modalities/__main__.py b/src/modalities/__main__.py
@@ -126,15 +126,30 @@ def entry_point_create_memmap_index(src_path, index_path):
     default=".text",
     help="jq pattern to extract the data from the json line.",
 )
-def entry_point_create_packed_data(src_path, dst_path, index_path, tokenizer_type, tokenizer_file, jq_pattern):
+@click.option(
+    "--num-cpus",
+    type=int,
+    show_default=True,
+    default=os.cpu_count(),
+    help="Specify the number of tokenization workers. Default is the number of available CPUs.",
+)
+def entry_point_create_packed_data(
+    src_path, dst_path, index_path, tokenizer_type, tokenizer_file, jq_pattern, num_cpus
+):
     # TODO: if we want to use alternative entrypoints together with the ResolverRegistry,
     #  we can currently not rely on the existing class resolver.
     #  This is based on its connection to the overall `AppConfig`.
     #  One would requires an object of it to instantiate the ResolverRegistry.
     #  This could get resolved by implementing on own ResolverRegistry for each entrypoint or adapting the existing
     #  ResolverRegistry to work dynamically with any type-hinted config object from config.py.
     tokenizer = tokenizer_type.value(tokenizer_file=str(tokenizer_file))
-    generator = PackedDataGenerator(src_path, index_path=index_path, tokenizer=tokenizer, jq_pattern=jq_pattern)
+    generator = PackedDataGenerator(
+        src_path,
+        index_path=index_path,
+        tokenizer=tokenizer,
+        jq_pattern=jq_pattern,
+        number_of_processes=num_cpus,
+    )
     generator.run(dst_path)
 
 

diff --git a/src/modalities/dataloader/create_index.py b/src/modalities/dataloader/create_index.py
@@ -6,16 +6,14 @@
 import warnings
 from pathlib import Path
 
-import numpy as np
 from tqdm import tqdm
 
 
-# TODO: benchmark against pyspark
 class IndexGenerator:
     def __init__(self, src_file: Path, chunksize: int = 4096, drop_faulty_entries: bool = False):
         """
         Reads in a JSON file as a binary file, iterates character by character und builds up
-        the sample index (char-wisestart and end position for each JSON sample) via "\n" character positions.
+        the sample index (char-wise start and end position for each JSON sample) via "\n" character positions.
 
         :param src_file: Path to a jsonl-file.
         :param chunksize: defines the size of byte chunks that are processed via a producer-consumer approach.
@@ -26,12 +24,11 @@ def __init__(self, src_file: Path, chunksize: int = 4096, drop_faulty_entries: b
         self.src_file = src_file
         self.chunksize = chunksize
         self.drop_faulty_entries = drop_faulty_entries
-        with self.src_file.open(mode="r", encoding="utf-8") as fin:
+        with self.src_file.open(mode="r") as fin:
             fin.seek(0, os.SEEK_END)
-            num_chars = fin.tell()
-        self.num_chunks = num_chars // self.chunksize
-        self.reminder = num_chars % self.chunksize
-        self._chunk_queue = queue.Queue()
+            self._total_num_chars = fin.tell()
+        self.num_chunks = self._total_num_chars // self.chunksize
+        self._queue_of_raw_lines = queue.Queue()
         self._index_map = []
         self._exception_buffer = []
 
@@ -51,49 +48,42 @@ def create_index(self, target_path_for_index_file: Path):
     def _indexer_thread(self):
         def queue_generator():
             while True:
-                chunk = self._chunk_queue.get()
-                if chunk is None:
+                line = self._queue_of_raw_lines.get()
+                if line is None:
                     break
-                yield chunk
+                yield line
 
-        def process_line(last_index: int, curr_index: int):
-            segment_len = curr_index - last_index
+        def parse_line_as_json(line_start_idx: int, line: str):
             try:  # check if line is a valid json
-                line = np.memmap(self.src_file, mode="r", offset=last_index, shape=(segment_len,)).view("S1").tolist()
-                line = [c.decode("utf8") for c in line]
-                line = "".join(line)
                 json.loads(line)
-                self._index_map.append((last_index, segment_len))
+                self._index_map.append((line_start_idx, len(line)))
             except Exception as low_level_err:
                 if self.drop_faulty_entries:
-                    warnings.warn(f"faulty line at {last_index}-{curr_index}, skipping...")
+                    warnings.warn(f'faulty line "{line}", skipping...')
                 else:
-                    warnings.warn(f"faulty line: {line=}")
-                    err = ValueError(f"faulty line at {last_index}-{curr_index}")
+                    err = ValueError(f'faulty line "{line}", skipping...')
                     err.__cause__ = low_level_err
                     self._exception_buffer.append(err)
 
         self._index_map = []
-        last_index = 0
-        for chunk_idx, chunk in tqdm(enumerate(queue_generator()), desc="Processed Chunks", total=self.num_chunks):
-            for char_index, c in enumerate(chunk):
-                curr_index = chunk_idx * self.chunksize + char_index
-                if c == ord("\n"):
-                    process_line(last_index, curr_index)
-                    last_index = curr_index + 1
-        # prevents automatically added "\n"-chars at the end of files getting interpreted as own sample
-        if curr_index >= last_index:
-            process_line(last_index, curr_index + 1)
+        for line_start_idx, line in tqdm(queue_generator(), desc="Processed Lines"):
+            if self._check_for_parallel_errors():
+                return
+            parse_line_as_json(line_start_idx, line)
 
     def _reader_thread(self):
-        with open(self.src_file, "rb") as fin:
+        with open(self.src_file, "r") as fin:
             while True:
-                chunk = fin.read(self.chunksize)
-                if self._exception_buffer:
-                    raise RuntimeError(
-                        "Exception found in exception buffer. Probably the indexer thread ran into an error..."
-                    )
-                if not chunk:
+                cursor = fin.tell()
+                line = fin.readline()
+                if self._check_for_parallel_errors():
+                    return
+                if fin.tell() == self._total_num_chars:
+                    self._queue_of_raw_lines.put((cursor, line))
                     break
-                self._chunk_queue.put(chunk)
-        self._chunk_queue.put(None)
+                line_without_newline_char = line[:-1]
+                self._queue_of_raw_lines.put((cursor, line_without_newline_char))
+        self._queue_of_raw_lines.put(None)
+
+    def _check_for_parallel_errors(self) -> bool:
+        return bool(self._exception_buffer)