Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement dataset loading from Google Cloud Storage #191

Merged
merged 10 commits into from
Dec 2, 2024
Merged

Conversation

SallyElHajjar
Copy link
Collaborator

No description provided.

Copy link
Collaborator

@Katsutoshii Katsutoshii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this PR to something more specific, like "Implement dataset loading from Google Cloud Storage"

"""
) -> tuple[tf.data.Dataset, tf.data.Dataset, tf.data.Dataset]:

def pad_or_truncate_data(data, target_length, pad_value=0):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this function nested? If it doesn't need to capture local variables then let's move it to module level.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

dataset = dataset.batch(batch_size)

# Use fixed sizes for splitting the dataset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this split outside of the function so the caller has complete control over how it's split?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done



# Function to process inputs and labels for each day
# Modify process_day to handle no labels case
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove these TODO notes before checking this in?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add brief docstrings to these functions.


# Create the dataset using the data generator
dataset = tf.data.Dataset.from_generator(
data_generator,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding, why do we specify an output signature in this dataset but not in the above datasets?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have considered that for create_atmo_input_output_sequencee, it generates batches from the data pipeline so it may not explicitly require an output signature. This is because TensorFlow can infer the shapes and types of the dataset elements during the execution pipeline if they follow a consistent structure. When the data is already pre-processed and consistently structured, the framework does not need the explicit signature to determine how to handle the elements. (This was similarly done for flood).

@SallyElHajjar SallyElHajjar changed the title adding testing files Implement dataset loading from Google Cloud Storage Nov 12, 2024
]

# Simulate spatial data blob
mock_spatial_blob = MagicMock()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this into a util function to avoid code duplication. Something like:
def mock_numpy_blob(array: np.ndarray) -> storage.Blob: ...

@SallyElHajjar SallyElHajjar merged commit 9d465a0 into main Dec 2, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants