Support dropout for training samples with min mean value below a configured threshold. #158

Katsutoshii · 2024-07-29T20:48:12Z

Support dropout for training samples with min mean value below a configured threshold.

…igured threshold.

tatianawu · 2024-07-29T20:54:08Z

I think the original idea that Walt proposed was adding a percent_no_data field to the metastore to actually calculate the % of each chunk that isn't semantically meaningful. I'm worried that filtering based on mean label will exclude chunks where we have small amounts of flooding that the model should learn on.

IIRC there's a binary mask for the NODATA cells from the DEM. It should just be a mean over this mask, and then we can add another filter in load_dataset to only include chunks that exceed some threshold of available data.

Katsutoshii · 2024-07-29T22:38:19Z

I think the original idea that Walt proposed was adding a percent_no_data field to the metastore to actually calculate the % of each chunk that isn't semantically meaningful. I'm worried that filtering based on mean label will exclude chunks where we have small amounts of flooding that the model should learn on.

IIRC there's a binary mask for the NODATA cells from the DEM. It should just be a mean over this mask, and then we can add another filter in load_dataset to only include chunks that exceed some threshold of available data.

Is this data already available in metastore? If not, does this mean-based approach still make sense to include as stopgap? With a sufficiently low threshold we shouldn't have any issue with over filtering.

tatianawu · 2024-07-30T02:27:16Z

I think the original idea that Walt proposed was adding a percent_no_data field to the metastore to actually calculate the % of each chunk that isn't semantically meaningful. I'm worried that filtering based on mean label will exclude chunks where we have small amounts of flooding that the model should learn on.
IIRC there's a binary mask for the NODATA cells from the DEM. It should just be a mean over this mask, and then we can add another filter in load_dataset to only include chunks that exceed some threshold of available data.

Is this data already available in metastore? If not, does this mean-based approach still make sense to include as stopgap? With a sufficiently low threshold we shouldn't have any issue with over filtering.

I'm not sure; let's wait for @waltaskew to chime in once he's back. But in general, I feel like it makes more sense to put this information in the metastore rather than computing a mean each time we load in a tensor. I'd rather avoid introducing temporary code, especially since this isn't really anything urgent.

waltaskew · 2024-07-30T14:27:00Z

I'd prefer not to leave folks with stopgaps for anything that isn't urgent (e.g. some stopgaps allowing them to make predictions are okay since that's a core thing they need to be able to do.) USL will need to be able to implement these sorts of experiments on their own starting Thursday! So I think it's more valuable to leave them with good patterns & good documentation they can follow rather than leaving them with stopgaps which allow us to finish tasks before Thursday.

Some options:

put together a notebook which demonstrates some improvements from removing empty chunks
write documentation describing this as future work with some pointers on how to implement it
write the code we'd really want for this which calculates percent no-data in the data pipeline and re-run the pipeline to get those values into the metastore (this is a pretty easy task we could accomplish in a day, but I'd just weight the importance of running that experiment for them versus making sure they're set up to run future experiments themselves.)

tatianawu · 2024-07-30T15:48:58Z

I'd prefer not to leave folks with stopgaps for anything that isn't urgent (e.g. some stopgaps allowing them to make predictions are okay since that's a core thing they need to be able to do.) USL will need to be able to implement these sorts of experiments on their own starting Thursday! So I think it's more valuable to leave them with good patterns & good documentation they can follow rather than leaving them with stopgaps which allow us to finish tasks before Thursday.

Some options:

put together a notebook which demonstrates some improvements from removing empty chunks

write documentation describing this as future work with some pointers on how to implement it

write the code we'd really want for this which calculates percent no-data in the data pipeline and re-run the pipeline to get those values into the metastore (this is a pretty easy task we could accomplish in a day, but I'd just weight the importance of running that experiment for them versus making sure they're set up to run future experiments themselves.)

I briefly touched upon this idea when discussing iterating on the NYC prototype. We can add more detail to get USL folks set up better, but I think this might actually be a good initial exercise to get folks to work with multiple parts of the pipeline.

Support dropout for training samples with min mean value below a conf…

6a916cd

…igured threshold.

Katsutoshii requested review from waltaskew and tatianawu July 29, 2024 20:50

Fix labels typo

9fa2194

Update docstrings

9f6136c

waltaskew removed their request for review August 2, 2024 21:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support dropout for training samples with min mean value below a configured threshold. #158

Support dropout for training samples with min mean value below a configured threshold. #158

Katsutoshii commented Jul 29, 2024

tatianawu commented Jul 29, 2024 •

edited

Loading

Katsutoshii commented Jul 29, 2024

tatianawu commented Jul 30, 2024

waltaskew commented Jul 30, 2024 •

edited

Loading

tatianawu commented Jul 30, 2024

Support dropout for training samples with min mean value below a configured threshold. #158

Are you sure you want to change the base?

Support dropout for training samples with min mean value below a configured threshold. #158

Conversation

Katsutoshii commented Jul 29, 2024

tatianawu commented Jul 29, 2024 • edited Loading

Katsutoshii commented Jul 29, 2024

tatianawu commented Jul 30, 2024

waltaskew commented Jul 30, 2024 • edited Loading

tatianawu commented Jul 30, 2024

tatianawu commented Jul 29, 2024 •

edited

Loading

waltaskew commented Jul 30, 2024 •

edited

Loading