Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAS-2232 -small functions added to support the main solution in the t… #16

Merged
merged 9 commits into from
Nov 1, 2024

Conversation

sudha-murthy
Copy link
Collaborator

DAS-2232

Description

This PR updates the hoss_config.json , pip_requirements.txt, and a set of small functions in a new file called coordinate_utiliities.py. The aim of the ticket is use the coordinate variables to create dimension scales for collections that do not have support for CF Standards like dimension scales and grid mapping variable. This PR just has a few methods and supporting unit tests towards the final solution, This will be merged to the feature branch DAS-2208-SMAP-L3 till all the required changes for the ticket are complete -

DAS-2232

Local Test Steps

The unit tests should all pass
In a local harmony install

  • fetch the DAS-2232-1 branch from harmony-opendap-subsetter repo
  • cd harmony-opendap-subsetter
  • bin/build-image
    -bin/build-tests
    -bin/run-tests

PR Acceptance Checklist

  • Jira ticket acceptance criteria met.
  • CHANGELOG.md updated to include high level summary of PR changes.
  • docker/service_version.txt updated if publishing a release.
  • [X ] Tests added/updated and passing.
  • Documentation updated (if needed).

@autydp
Copy link
Contributor

autydp commented Oct 20, 2024

This pull request should be to pull in to the feature branch: DAS-2208-SMAP-L3, not to main (yet)

@sudha-murthy
Copy link
Collaborator Author

This pull request should be to pull in to the feature branch: DAS-2208-SMAP-L3, not to main (yet)

Right. Not sure how to switch it to the feature branch.

@owenlittlejohns
Copy link
Member

@sudha-murthy you can switch the target branch by selecting to edit the pull request. It should be an option near the top of the page.

@sudha-murthy sudha-murthy changed the base branch from main to DAS-2208-SMAP-L3 October 21, 2024 03:13
override_variable = varinfo.get_variable(variable_name)

if override_variable is not None and (
override_variable.is_latitude() or override_variable.is_longitude()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be ok in context, but as a stand-alone function it is a bit odd - taking in a variable_name, and then only working if that variable is either a latitude or longitude variable. In fact, the result is not really dependent upon the variable_name passed in, other than to use the path. It might be that it has the side effect of verifying the variable_name as a latitude or longitude, but that is because it calls upon the is_latitude and is_longitude VarInfo methods, which can be called directly. It raises the exception MissingCoordinateVariable if the variable passed in is not a latitude or longitude, but is a known variable, i.e., not intrinsically a missing coordinate use case.

It has the feel of a method perhaps taken as frequently used code, in which case the name could be check_coord_var_and_get_std_dimensions. Alternatively, if there is not the need to verify the variable is a coordinate, the code could be simplified to simply return standard dimensions within a shared path - a get_std_projected_dimension_reference. If the only use is in the following method, that would be my recommendation.

As is, the name could also be: get_std_projected_coordinate_dimensions, but I wonder about the need to check for is_latitidue or is_longitude.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will use your new suggested names

I originally had the is_latitude to map to projected_y and is_longitude to map to projected_x
I removed that.
But we do need to validate that are coordinates - we are using that path for the projected dimensions..
Maybe it does not have to be within this method..

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made the updates - removed "override" and renamed the methods. removed the lat/lon check in the "get_projected_dimension_names' function"
commit - ca881d1

return projected_dimension_names


def get_override_projected_dimensions(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is expected to be updated when we do use a configuration file, but in the mean time, perhaps get_std_projected_dimensions is a better name. Also, such a revised name suggests a method that is well defined and ready to merge, vs. getting the projected dimensions, possibly with transposed names, which is TBD.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used override because we are overriding the coordinates with projected dimensions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we do use a configuration file

I'm going to sound like a broken record, but I do not think we should assume leaning on a configuration file is the best implementation. There are two relevant tenets I try to generally adhere to in our service code:

  1. Avoid hard-coding collection specific information into the code itself.
  2. Minimise the need for configuration file entries, to maintain a low barrier to entry for data providers to on-board collections to a service.

We have the information we need to determine dimension ordering while not needing configuration file-based information. And, if we write the code in a general way, this could open HOSS up to a large number of collections that are HDF-5 format but without proper 1-D dimension variables. That could be a massive win! But it would be stymied by data providers needing to add information to the configuration file. To date, no-one outside of DAS has added any configuration file information to any of our services in the last 3 years. That's pretty telling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, yup, I acknowledge that this monologue is not strictly relevant to this PR.

return set(
variable
for variable in variables
if len(varinfo.get_variable(variable).dimensions) == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is where we need the additional:

if (len(varinfo.get_variable(variable).dimensions) == 0
    or all([ for dimension in varinfo.get_variable(variable.dimensions) : 
                  varinfo.get_variable(dimension) not None and not [] )

(excuse the pidgeon python)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I will include it as a comment till we make the configuration change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly prefer what we have now. Also, the function contents exactly matches the function name.

Also, the snippet supplied above is not correct Python code, so it's hard to know for sure what you are trying to achieve. Trying to decompose that snippet:

if (len(varinfo.get_variable(variable).dimensions) == 0
    or all([ for dimension in varinfo.get_variable(variable.dimensions) : varinfo.get_variable(dimension) not None and not [] )
  • The first bit still makes sense - if the variable in question doesn't have dimensions.
  • Then, I think you are trying to see the VarInfoFromDmr instance does not have any of the listed dimensions as variables.
  • The and not [] is a no-op. It will always evaluate to True, because it is being evaluated in isolation, and you are asking if an empty list is "falsy", which it is.

While I don't like this approach, I think what you are trying to suggest would be more like:

if (
    len(varinfo.get_variable(variable).dimensions) == 0
    or all(
        varinfo.get_variable(dimension) == None
        for dimension in varinfo.get_variable(variable).dimensions
    )
)

If this was to be augmented in such a way, I would recommend breaking this check out into it's own function, because the set comprehension will become very hard to read.

Copy link

@D-Auty D-Auty Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see splitting out the function to clarify the code and document the comprehension.
I'm less clear on forcing the upstream code to use this code without the additional check, and then add in the call to the new function in every usage. That additional check is now essential to ensure the case of OPeNDAP creating "empty" dimensions does not allow this check, by itself, to succeed. And of course, OPeNDAP's "empty" dimensions is pretty much always going to be the case.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if len(varinfo.get_variable(variable).dimensions) == 0
if (len(varinfo.get_variable(variable).dimensions) == 0
or any_absent_dimension_variables(variable)
...
def any_absent_dimension_variables(variable: VarInfo.variable) => bool
return any(
varinfo.get_variable(dimension) == None
for dimension in varinfo.get_variable(variable).dimensions
)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

function updated and unit tests added - d972777

@sudha-murthy sudha-murthy requested a review from autydp October 21, 2024 15:02
Copy link
Member

@owenlittlejohns owenlittlejohns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First up - nice work. There are a couple of things I really like in this PR:

  • The scope of each function in coordinate_utilities.py tends to be of a single level of abstraction, making the functions fairly clean.
  • There are a bunch of unit tests. I think there are still a number of subtests that should be added, but based on what you've already put in place, most of them should be really quick to add.

I've added a lot of comments. I think some of them are really small things, like wording or additional doc-string comments. Some of them are extra test cases in the unit tests (and I am more than happy to help if you have unit test questions!), and some of them are acknowledging that things will be done in the future (but maybe a "TODO" comment would be helpful).

I think the biggest things in the code itself were:

  • Potentially refactoring get_lat_lon_arrays to just get one array at a time, and reduce two blocks of near identical code.
  • Reworking the calculation for the dimension scale. Given the information you have, I think you don't need separate logical blocks for calculating an ascending versus a descending array. (The good news is that with your unit tests, you could drop in my suggestion and just see if it works with the test you already have)
  • The fact that if you are trying to find valid indices for an array and specify the fill values, the latitude/longitude checks won't be applied.
  • Accounting for longitude arrays that are 0 to 360 degrees north, not just -180 to 180.

Even these aren't massive things.

P.S. Did you want to close the old PR? It won't be deleted, so we'll still have access to comments.

hoss/coordinate_utilities.py Outdated Show resolved Hide resolved
hoss/coordinate_utilities.py Outdated Show resolved Hide resolved
Comment on lines 35 to 36
f'{override_variable.group_path}/projected_y',
f'{override_variable.group_path}/projected_x',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This choice of dimension names is better than before, but it's still likely to cause problems down the road. This function currently relies on the assumption that coordinate variables for different grids will be in different groups, which is not guaranteed.

(I think we see this with some of the gridded ICESat-2 ATLAS collections, for example)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I could just have group_path/coordinate_name_projected_x or projected_y
The coordinate names would be unique
it would be latitude_projected_x or latitude_projected_y . It would not matter . it would be unique

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One approach (that would make this function more complicated 😞) could be to:

  • (Already done) - Get the VariableFromDmr instance.
  • Get the references from the coordinates metadata attribute.
  • Find the latitude and longitude variables from those references and put them in a list. - Maybe reusing your get_coordinate_variables function.
  • Make a unique hash of those references.
  • Prepend that hash on to .../projected_x and .../projected_y (like you're already doing with the group path).

I took some inspiration from MaskFill. To try something like:

from hashlib import sha256

# Faking the coordinates list for now:
coordinates_list = ['/latitude', '/longitude']

coordinates_hash = sha256(' '.join(coordinates_list).encode('utf-8')).hexdigest()

projected_dimension_names = [
    f'{coordinates_hash}/projected_y',
    f'{coordinates_hash}/projected_x',
]

The string will look a bit ugly (e.g., 'a089b9ebff6935f6c8332710de2ee3b351bd47c1fb807b22765969617027e8d2'), but it will be unique and reproducible.

return projected_dimension_names


def get_override_projected_dimensions(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we do use a configuration file

I'm going to sound like a broken record, but I do not think we should assume leaning on a configuration file is the best implementation. There are two relevant tenets I try to generally adhere to in our service code:

  1. Avoid hard-coding collection specific information into the code itself.
  2. Minimise the need for configuration file entries, to maintain a low barrier to entry for data providers to on-board collections to a service.

We have the information we need to determine dimension ordering while not needing configuration file-based information. And, if we write the code in a general way, this could open HOSS up to a large number of collections that are HDF-5 format but without proper 1-D dimension variables. That could be a massive win! But it would be stymied by data providers needing to add information to the configuration file. To date, no-one outside of DAS has added any configuration file information to any of our services in the last 3 years. That's pretty telling.

return projected_dimension_names


def get_override_projected_dimensions(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, yup, I acknowledge that this monologue is not strictly relevant to this PR.

tests/unit/test_coordinate_utilities.py Show resolved Hide resolved
tests/unit/test_coordinate_utilities.py Outdated Show resolved Hide resolved
tests/unit/test_coordinate_utilities.py Show resolved Hide resolved
tests/unit/test_coordinate_utilities.py Outdated Show resolved Hide resolved
tests/unit/test_coordinate_utilities.py Show resolved Hide resolved
@sudha-murthy
Copy link
Collaborator Author

@owenlittlejohns @autydp - I think I have updated based on all the comments except the ones related to 3D and order of dimensions which will be next PR.
Have not done assertRaisesRegex for exception message. I can do that if it is needed,

Copy link
Member

@owenlittlejohns owenlittlejohns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all the smaller comments and adding a few more unit tests in. I think this is pretty much there.

I added a couple of other comments, but there was only one thing that I'd definitely like to see changed (the if coordinate_fill to if coordinate_fill is not None tweak).

I did also try to think some more about an appropriate prefix for the projected 1-D dimension variable names. I've popped an idea down there which might be helpful (I hope!). I'm not sure if that needs to be addressed in this PR, but I think it will need to be taken care off before the completion of the overall feature.

Returns indices of a valid array without fill values
"""

if coordinate_fill:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reworking this so that the fill value check and the latitude/longitude range checks can both happen.

I think the coordinate.is_latitude() and coordinate.is_longitude() checks could benefit from some numpy magic, rather than looping individually through each element. I think what you could use is the element-wise-and, which can be either written as & or np.logical_and. You could do something like:

if coordinate_fill is not None:
    is_not_fill = ~np.isclose(coordinate_row_col, float(coordinate_fill))
else:
    # Creates an entire array of `True` values.
    is_not_fill = np.ones_like(coordinate_row_col, dtype=bool)

if coordinate.is_longitude():
    valid_indices = np.where(
        np.logical_and(
            is_not_fill,
            np.logical_and(
                coordinate_row_col >= -180.0,
                coordinate_row_col <= 360.0
            )
        )
    )
elif coordinate is latitude():
    valid_indices = np.where(
        np.logical_and(
            is_not_fill,
            np.logical_and(
                coordinate_row_col >= -90.0,
                coordinate_row_col <= 90.0
            )
        )
    )
else:
    valid_indices = np.empty((0, 0))

Note, in the snippet above, I've also changed the first check from if coordinate_fill to if coordinate_fill is not None. That's pretty important, as zero could be a valid fill value, but if coordinate_fill = 0, then this check will evaluate to False.

Ultimately, I think the conditions you have now are equivalent to this, just maybe not as efficient. So the only thing I'd definitely like to see changed is that first if coordinate_fill condition, so that it's not considering of fill value of 0 to be non-fill.

tests/unit/test_coordinate_utilities.py Outdated Show resolved Hide resolved
return valid_indices


def get_fill_value_for_coordinate(
Copy link
Member

@owenlittlejohns owenlittlejohns Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python plays a little fast and loose with types sometimes. If you have an integer and you are trying to do something with it and a float, then Python tries to be clever and treat that integer as a float.

For example:

In [1]: integer_variable = 2
In [2]: type(integer_variable)
Out[2]: int
In [3]: integer_variable == 2.0
Out[3]: True

So, I think we're probably okay in the instance that the fill value metadata attribute is an integer. (But I totally acknowledge that there is somewhat sneaky stuff Python is doing in the background here)

)[0]
elif coordinate_name == 'longitude':
valid_indices = np.where(
(coordinate_row_col >= -180.0) & (coordinate_row_col <= 180.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wanted to come back to this thread - I was wrong about the &. But I think I prefer the equivalent np.logical_and just because I'm generally too verbose.

Comment on lines 35 to 36
f'{override_variable.group_path}/projected_y',
f'{override_variable.group_path}/projected_x',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One approach (that would make this function more complicated 😞) could be to:

  • (Already done) - Get the VariableFromDmr instance.
  • Get the references from the coordinates metadata attribute.
  • Find the latitude and longitude variables from those references and put them in a list. - Maybe reusing your get_coordinate_variables function.
  • Make a unique hash of those references.
  • Prepend that hash on to .../projected_x and .../projected_y (like you're already doing with the group path).

I took some inspiration from MaskFill. To try something like:

from hashlib import sha256

# Faking the coordinates list for now:
coordinates_list = ['/latitude', '/longitude']

coordinates_hash = sha256(' '.join(coordinates_list).encode('utf-8')).hexdigest()

projected_dimension_names = [
    f'{coordinates_hash}/projected_y',
    f'{coordinates_hash}/projected_x',
]

The string will look a bit ugly (e.g., 'a089b9ebff6935f6c8332710de2ee3b351bd47c1fb807b22765969617027e8d2'), but it will be unique and reproducible.

Comment on lines 362 to 374
expected_row_col_sizes = (5, 10)
lat_1d_array = np.array([1, 2, 3, 4])
lon_1d_array = np.array([6, 7, 8, 9])
lat_mismatched_array = np.array([[1, 2, 3], [3, 4, 5]])
lon_mismatched_array = np.array([[6, 7], [8, 9], [10, 11]])
lat_empty_ndim_array = np.array(
0,
)
lon_empty_ndim_array = np.array(
0,
)
lat_empty_size_array = np.array([])
lon_empty_size_array = np.array([])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I mentioned this somewhere else, but it's a lot easier for the reader if these things that relate to specific subtests were within the relevant with self.subTest(...) context manager. From a quick skim, it doesn't look like these items are used in multiple subtests, so moving them shouldn't break anything. (Note - subtests aren't always executed in the order that they are written - I think unittest orders things by alphabetising the test classes, methods and subtest descriptions in some way)

It's not a big deal, and not something to hold a PR up over, but it's a nice to have, which means I can look at a single subtest and see what's relevant to it in a single look.

Copy link
Collaborator Author

@sudha-murthy sudha-murthy Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@owenlittlejohns on the earlier comment on get_fill_value function -
If I don't do the (float) typecast when I get the fill_value, I would get the following error. because the get_attribute _value method in varinfo returns a str value

TypeError: The DType <class 'numpy.FloatAbstractDType'> could not be promoted by <class 'numpy.dtype[str]'>.

Copy link
Collaborator Author

@sudha-murthy sudha-murthy Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@owenlittlejohns - updated the get_valid_indices function with your way. Had to add [0] since np.where returns a tuple.
Also corrected the subtests and moved the test data near where they are used

73390fd

Copy link

@D-Auty D-Auty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now. Thanks for the updates.

@sudha-murthy sudha-murthy merged commit 3fc06f0 into DAS-2208-SMAP-L3 Nov 1, 2024
3 checks passed
@sudha-murthy sudha-murthy deleted the DAS-2232-1 branch November 1, 2024 13:45
Copy link
Contributor

@autydp autydp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now. Sorry for the delay. Please note there is some confusion in my user account for GitHub - I fumbled setting things up properly and now have 2 (maybe 3?) accounts. Please request reviews and make references to the d-auty account.

@sudha-murthy
Copy link
Collaborator Author

Thanks David

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants