Implement dataset loading from Google Cloud Storage #191

SallyElHajjar · 2024-10-29T14:27:02Z

No description provided.

Katsutoshii

Let's rename this PR to something more specific, like "Implement dataset loading from Google Cloud Storage"

Katsutoshii · 2024-11-05T21:00:32Z

usl_models/usl_models/atmo_ml/datasets.py

-    """
+) -> tuple[tf.data.Dataset, tf.data.Dataset, tf.data.Dataset]:
+
+    def pad_or_truncate_data(data, target_length, pad_value=0):


Why is this function nested? If it doesn't need to capture local variables then let's move it to module level.

Katsutoshii · 2024-11-05T21:02:14Z

usl_models/usl_models/atmo_ml/datasets.py

    dataset = dataset.batch(batch_size)

+    # Use fixed sizes for splitting the dataset


Can we do this split outside of the function so the caller has complete control over how it's split?

Katsutoshii · 2024-11-05T21:04:36Z

usl_models/usl_models/atmo_ml/cnn_inputs_outputs.py



-# Function to process inputs and labels for each day
+# Modify process_day to handle no labels case


Can you remove these TODO notes before checking this in?

Also add brief docstrings to these functions.

Katsutoshii · 2024-11-05T21:06:05Z

usl_models/usl_models/atmo_ml/datasets.py

+
+    # Create the dataset using the data generator
+    dataset = tf.data.Dataset.from_generator(
+        data_generator,


For my understanding, why do we specify an output signature in this dataset but not in the above datasets?

I have considered that for create_atmo_input_output_sequencee, it generates batches from the data pipeline so it may not explicitly require an output signature. This is because TensorFlow can infer the shapes and types of the dataset elements during the execution pipeline if they follow a consistent structure. When the data is already pre-processed and consistently structured, the framework does not need the explicit signature to determine how to handle the elements. (This was similarly done for flood).

Katsutoshii · 2024-11-14T22:34:29Z

usl_models/tests/atmo_ml/datasets_test.py

+        ]
+
+        # Simulate spatial data blob
+        mock_spatial_blob = MagicMock()


Let's move this into a util function to avoid code duplication. Something like:
def mock_numpy_blob(array: np.ndarray) -> storage.Blob: ...

adding testing files

f90413f

SallyElHajjar requested a review from jainrajan98 October 29, 2024 14:27

SallyElHajjar and others added 5 commits October 31, 2024 10:09

adding new functions

d987ea3

adding new corrections

373bf2c

adding new data

44cf2a4

Fix data loading from GCS and update test mocks

ba92ee1

fixing flake8 errors and time steps

ad65bf6

SallyElHajjar requested a review from Katsutoshii November 5, 2024 19:02

Katsutoshii reviewed Nov 5, 2024

View reviewed changes

SallyElHajjar added 3 commits November 6, 2024 17:12

removing some subfunctions to module level and adding docs

c992f2c

testingwith true data

964b1e7

fixing black error

8d87a62

SallyElHajjar requested a review from ahmustafa November 12, 2024 16:19

ahmustafa approved these changes Nov 12, 2024

View reviewed changes

SallyElHajjar changed the title ~~adding testing files~~ Implement dataset loading from Google Cloud Storage Nov 12, 2024

Katsutoshii approved these changes Nov 14, 2024

View reviewed changes

improving the testing file

3b7aea0

SallyElHajjar merged commit 9d465a0 into main Dec 2, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement dataset loading from Google Cloud Storage #191

Implement dataset loading from Google Cloud Storage #191

SallyElHajjar commented Oct 29, 2024

Katsutoshii left a comment

Katsutoshii Nov 5, 2024

SallyElHajjar Nov 6, 2024

Katsutoshii Nov 5, 2024

SallyElHajjar Nov 6, 2024

Katsutoshii Nov 5, 2024

Katsutoshii Nov 5, 2024

Katsutoshii Nov 5, 2024

SallyElHajjar Nov 6, 2024

Katsutoshii Nov 14, 2024

		dataset = dataset.batch(batch_size)

		# Use fixed sizes for splitting the dataset



		# Function to process inputs and labels for each day
		# Modify process_day to handle no labels case

Implement dataset loading from Google Cloud Storage #191

Implement dataset loading from Google Cloud Storage #191

Conversation

SallyElHajjar commented Oct 29, 2024

Katsutoshii left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment