-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement dataset loading from Google Cloud Storage #191
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename this PR to something more specific, like "Implement dataset loading from Google Cloud Storage"
""" | ||
) -> tuple[tf.data.Dataset, tf.data.Dataset, tf.data.Dataset]: | ||
|
||
def pad_or_truncate_data(data, target_length, pad_value=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this function nested? If it doesn't need to capture local variables then let's move it to module level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
dataset = dataset.batch(batch_size) | ||
|
||
# Use fixed sizes for splitting the dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do this split outside of the function so the caller has complete control over how it's split?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
|
||
# Function to process inputs and labels for each day | ||
# Modify process_day to handle no labels case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove these TODO notes before checking this in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add brief docstrings to these functions.
|
||
# Create the dataset using the data generator | ||
dataset = tf.data.Dataset.from_generator( | ||
data_generator, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my understanding, why do we specify an output signature in this dataset but not in the above datasets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have considered that for create_atmo_input_output_sequencee, it generates batches from the data pipeline so it may not explicitly require an output signature. This is because TensorFlow can infer the shapes and types of the dataset elements during the execution pipeline if they follow a consistent structure. When the data is already pre-processed and consistently structured, the framework does not need the explicit signature to determine how to handle the elements. (This was similarly done for flood).
] | ||
|
||
# Simulate spatial data blob | ||
mock_spatial_blob = MagicMock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this into a util function to avoid code duplication. Something like:
def mock_numpy_blob(array: np.ndarray) -> storage.Blob: ...
No description provided.