-
-
Notifications
You must be signed in to change notification settings - Fork 795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add data streaming support through mosaic-streaming
#1525
Changes from 1 commit
de4aa62
58297d3
41141c1
976bc13
ba86339
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -31,6 +31,7 @@ art | |
fschat==0.2.36 | ||
gradio==3.50.2 | ||
tensorboard | ||
mosaicml-streaming | ||
|
||
mamba-ssm==1.2.0.post1 | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -51,6 +51,49 @@ | |
LOG = logging.getLogger("axolotl") | ||
|
||
|
||
def load_streaming_dataset(config_dataset): | ||
""" | ||
Load a streaming dataset from a remote storage. | ||
|
||
This function initializes a streaming dataset from a remote S3 bucket, | ||
wraps the data into a generator, and then converts it into a Hugging Face | ||
dataset for compatibility purposes. | ||
|
||
Parameters: | ||
- config_dataset (dict): Configuration dictionary that may contain settings necessary for initializing the dataset. | ||
|
||
Returns: | ||
- ds (datasets.Dataset): A Hugging Face dataset object that streams data from the specified remote location. | ||
""" | ||
# These imports are local due to the optionality of `mosaicml-streaming`. | ||
from streaming import StreamingDataset | ||
from datasets import Features, Value, Dataset | ||
from functools import partial | ||
|
||
# Initialize the `StreamingDataset` with configurations. | ||
streaming_dataset = StreamingDataset( | ||
local=None, remote=config_dataset.path, shuffle=True, batch_size=4 | ||
) | ||
|
||
# Define dataset features according to the axolotl structure. | ||
features = Features({"text": Value("string")}) | ||
|
||
# Shim between `StreamingDataset` and `Dataset`. | ||
def generator_from_streaming_dataset(streaming_dataset): | ||
yield from streaming_dataset | ||
|
||
# Create a Hugging Face dataset from the generator. | ||
# | ||
# This is necessary because downstream functions use a different interface | ||
# than `StreamingDataset` (e.g. the `features` attribute). | ||
ds = Dataset.from_generator( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This becomes an There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry for the delay here. No, that was something that I wanted to verify but it looks like it goes to def process and everything is evaluated eagerly. I started a draft like:
But I don't know whether that's a good idea. The Feel free to remove the "ready to merge" tag from this. |
||
generator=partial(generator_from_streaming_dataset, streaming_dataset), | ||
features=features, | ||
) | ||
|
||
return ds | ||
|
||
|
||
def prepare_dataset(cfg, tokenizer): | ||
prompters = [] | ||
if not cfg.pretraining_dataset: | ||
|
@@ -317,10 +360,13 @@ def for_d_in_datasets(dataset_configs): | |
) | ||
elif ds_from_cloud and remote_file_system: | ||
if remote_file_system.isdir(config_dataset.path): | ||
ds = load_from_disk( | ||
config_dataset.path, | ||
storage_options=storage_options, | ||
) | ||
if config_dataset.streaming: | ||
ds = load_streaming_dataset(config_dataset) | ||
else: | ||
ds = load_from_disk( | ||
config_dataset.path, | ||
storage_options=storage_options, | ||
) | ||
elif remote_file_system.isfile(config_dataset.path): | ||
ds_type = get_ds_type(config_dataset) | ||
ds = load_dataset( | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question
:mosaicml-streaming
should be an optional dependency. Is this the right way of adding it in this capacity?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with it being a required dependency. If it causes issues down the line we can make it optional then. Hoping to keep things simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this version be locked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed by
976bc13
.