Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support dropout for training samples with min mean value below a configured threshold. #158

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Katsutoshii
Copy link
Collaborator

Support dropout for training samples with min mean value below a configured threshold.

@tatianawu
Copy link
Contributor

tatianawu commented Jul 29, 2024

I think the original idea that Walt proposed was adding a percent_no_data field to the metastore to actually calculate the % of each chunk that isn't semantically meaningful. I'm worried that filtering based on mean label will exclude chunks where we have small amounts of flooding that the model should learn on.

IIRC there's a binary mask for the NODATA cells from the DEM. It should just be a mean over this mask, and then we can add another filter in load_dataset to only include chunks that exceed some threshold of available data.

@Katsutoshii
Copy link
Collaborator Author

I think the original idea that Walt proposed was adding a percent_no_data field to the metastore to actually calculate the % of each chunk that isn't semantically meaningful. I'm worried that filtering based on mean label will exclude chunks where we have small amounts of flooding that the model should learn on.

IIRC there's a binary mask for the NODATA cells from the DEM. It should just be a mean over this mask, and then we can add another filter in load_dataset to only include chunks that exceed some threshold of available data.

Is this data already available in metastore? If not, does this mean-based approach still make sense to include as stopgap? With a sufficiently low threshold we shouldn't have any issue with over filtering.

@tatianawu
Copy link
Contributor

I think the original idea that Walt proposed was adding a percent_no_data field to the metastore to actually calculate the % of each chunk that isn't semantically meaningful. I'm worried that filtering based on mean label will exclude chunks where we have small amounts of flooding that the model should learn on.
IIRC there's a binary mask for the NODATA cells from the DEM. It should just be a mean over this mask, and then we can add another filter in load_dataset to only include chunks that exceed some threshold of available data.

Is this data already available in metastore? If not, does this mean-based approach still make sense to include as stopgap? With a sufficiently low threshold we shouldn't have any issue with over filtering.

I'm not sure; let's wait for @waltaskew to chime in once he's back. But in general, I feel like it makes more sense to put this information in the metastore rather than computing a mean each time we load in a tensor. I'd rather avoid introducing temporary code, especially since this isn't really anything urgent.

@waltaskew
Copy link
Contributor

waltaskew commented Jul 30, 2024

I'd prefer not to leave folks with stopgaps for anything that isn't urgent (e.g. some stopgaps allowing them to make predictions are okay since that's a core thing they need to be able to do.) USL will need to be able to implement these sorts of experiments on their own starting Thursday! So I think it's more valuable to leave them with good patterns & good documentation they can follow rather than leaving them with stopgaps which allow us to finish tasks before Thursday.

Some options:

  • put together a notebook which demonstrates some improvements from removing empty chunks
  • write documentation describing this as future work with some pointers on how to implement it
  • write the code we'd really want for this which calculates percent no-data in the data pipeline and re-run the pipeline to get those values into the metastore (this is a pretty easy task we could accomplish in a day, but I'd just weight the importance of running that experiment for them versus making sure they're set up to run future experiments themselves.)

@tatianawu
Copy link
Contributor

I'd prefer not to leave folks with stopgaps for anything that isn't urgent (e.g. some stopgaps allowing them to make predictions are okay since that's a core thing they need to be able to do.) USL will need to be able to implement these sorts of experiments on their own starting Thursday! So I think it's more valuable to leave them with good patterns & good documentation they can follow rather than leaving them with stopgaps which allow us to finish tasks before Thursday.

Some options:

  • put together a notebook which demonstrates some improvements from removing empty chunks
  • write documentation describing this as future work with some pointers on how to implement it
  • write the code we'd really want for this which calculates percent no-data in the data pipeline and re-run the pipeline to get those values into the metastore (this is a pretty easy task we could accomplish in a day, but I'd just weight the importance of running that experiment for them versus making sure they're set up to run future experiments themselves.)

I briefly touched upon this idea when discussing iterating on the NYC prototype. We can add more detail to get USL folks set up better, but I think this might actually be a good initial exercise to get folks to work with multiple parts of the pipeline.

@waltaskew waltaskew removed their request for review August 2, 2024 21:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants