Skip to content

Initial item/file domain detectors #565

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

robertbartel
Copy link
Contributor

Adding initial set of implementations of classes that can scan a data item/file and detect the applicable DataDomain for its data.

Note that this branch builds off that of #564. As such, this PR should remain in draft status until #564 is merged.

class AbstractDomainDetector(ABC):
""" Abstraction for something that will automatically detect a :class:`DataDomain` for some data. """

def __init__(self, *args, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend that we drop this in favor of subtypes defining their own initializer that ideally is the only initializer (no need to call super().__init__()). In doing so, subtypes can more freely specify their requirements and loosen the coupling between the super and subclass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm ... I'm not really making use of it, but I had wanted a lazily initialized property for accessing the domain value, and for all the detector types to have this. My concern was if a detector needed to be used repeatedly (for the same data) that it'd be a performance hit to have to re-detect things over and over. Though at this point I'm not sure though how often, if ever, that scenario is actually going to come up.

Is that - complications from introducing a common, inherited property implementation - specifically what you are worried about here, or is there something more?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I had wanted a lazily initialized property for accessing the domain value, and for all the detector types to have this. My concern was if a detector needed to be used repeatedly (for the same data) that it'd be a performance hit to have to re-detect things over and over.

Yeah, I like that functionality and in some if not all cases that is desirable. My thinking is that subclasses can handle caching / lazy loading in their detect implementation. It likely will result in a minor amount of code duplication or at least code that looks repetitive, however I think considering the tradeoffs of having to know the relationship between the parent class and the subclass or a few lines of repetitive code I would rather have as much of the behavior in one place. Another, at least to me, interesting philosophical question. What are your thoughts?

pass

@property
def data_domain(self) -> DataDomain:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that this is a nicety, but imo there is nothing stopping a subclass from implementing this behavior in their implementation of detect. This also places the burden on subtypes to update _data_domain when detect is called. I think the design can be simplified by dropping this property and the _data_domain class level instance variable. This could result in slightly more code, but the behavior of a subtype will be closer to its declaration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that this is a nicety, but imo there is nothing stopping a subclass from implementing this behavior in their implementation of detect.

True. I think my motivation is that there is nothing making them either, nor making all of them do it in a consistent manner (e.g., if you want to re-detect versus return what was detected already, can you do that, and is it done the same way?). That seemed most easily solvable by doing some things in the superclass.

This also places the burden on subtypes to update _data_domain when detect is called.

Actually the reverse is true. Calling detect is intended not to impact _data_domain. This is why data_domain and reset are implemented: to control (and return) the _data_domain value (though, as I mentioned above, I think I'm not actually using them at the moment). But detect should run detection and return a new object each time.

I think the design can be simplified by dropping this property and the _data_domain class level instance variable. This could result in slightly more code, but the behavior of a subtype will be closer to its declaration.

Heh, we're in another thing where my style shows itself to not be especially Pythonic.

I'd intended to leverage the superclass to abstract a concern from the subclass implementations (though if we stick with that, I do need to at least go back and use the property in places). We could skip that, and just have the independent detect definitions. Subclasses would be more free to handle themselves, but I think we'd get slightly less predictable and consistent implementations. I'm a little worried that, given how we're using these in places like the universal detector, that inconsistency could eventually become a problem, even if it isn't right now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, we're in another thing where my style shows itself to not be especially Pythonic.

Haha, I don't think your style is not Pythonic. I just tend to have an aversion to inheritance outside of using it strictly for enforcing an interface. Especially, when you start sharing private properties between inheritance levels. So, for sure a style thing!

I'd intended to leverage the superclass to abstract a concern from the subclass implementations (though if we stick with that, I do need to at least go back and use the property in places). We could skip that, and just have the independent detect definitions. Subclasses would be more free to handle themselves, but I think we'd get slightly less predictable and consistent implementations. I'm a little worried that, given how we're using these in places like the universal detector, that inconsistency could eventually become a problem, even if it isn't right now.

First off, thanks for articulating your design so throughly! I agree that abstracting the "caching" of the detected DataDomain in a consistent way is valuable. I think we can achieve this behavior, in a similar fashion, without requiring the functionality be provided by the base class in a consistent way by using composition instead. This has the benefit of decoupling the "caching" behavior from the detectors and provides a separate pathway to add complexity to "caching" mechanisms if needed in the future without having to muck around with how each AbstractDomainDetector works. What that could look like in practice is:

class CachedDomainDetector:
  def __init__(self, detector: AbstractDomainDetector):
    self._detector = detector
    self._data_domain = None

  @property
  def data_domain(self) -> DataDomain:
     if self._data_domain is None:
       self._data_domain = self._detector.detect()
     return self._data_domain

  def reset(self):
    self._data_domain = None

Alternatively, it could be a subclass of AbstractDomainDetector. So, something like:

class CachedDomainDetector(AbstractDomainDetector):
  def __init__(self, detector: AbstractDomainDetector):
    self._detector = detector
    self._data_domain = None

  def detect(self, **kwargs) -> DataDomain:
     if self._data_domain is None:
       self._data_domain = self._detector.detect()
     return self._data_domain

  def reset(self):
    self._data_domain = None

In either case, we are able to move the base class prescribed behavior elsewhere and decouple the behavioral concepts. The trade off is now you have to know about whatever that caching wrapper type is, but I would argue that once you find it, its far easier to understand the implications of that abstraction rather than having to walk up the inheritance tree and see how each parent type handles the, in this case, _data_domain property.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and just took it out. I'm still not certain there wouldn't be complications moving to a composition like this that aren't worth the trouble for something as straightforward as a lazy property, but we didn't actually need it. And we can always add the composition in the future if we later decide to.

@robertbartel robertbartel force-pushed the f/dataset_domain_autodetect/1/item_detectors branch 2 times, most recently from 858e8b8 to 90b6450 Compare April 4, 2024 12:33
@robertbartel robertbartel force-pushed the f/dataset_domain_autodetect/1/item_detectors branch 2 times, most recently from 5624eb8 to 734a6c7 Compare April 4, 2024 18:42
@robertbartel robertbartel marked this pull request as ready for review April 4, 2024 18:43
@robertbartel robertbartel requested a review from aaraney April 9, 2024 19:37
@robertbartel
Copy link
Contributor Author

robertbartel commented Apr 9, 2024

@aaraney, I think all the changes we've discussed have been made. I'm a little puzzled by the test failures. I'm not seeing them locally, and I created a fresh 3.10 environment to test in just to be safe. I'm curious if you see these on your side. FWIW, I did clear the cache entries related to this PR, and it didn't seem to help.

Update
Tests are passing now. Best guess: the built packages were cached, and I hadn't updated the versions and dependency pinnings as needed, so old dependencies were being used.

@robertbartel robertbartel force-pushed the f/dataset_domain_autodetect/1/item_detectors branch from 7d162f5 to d38a0db Compare April 15, 2024 13:19
@robertbartel
Copy link
Contributor Author

@aaraney, @christophertubbs, is there anything remaining for this PR that needs to be changed before it can be approved?

Adding AbstractDomainDetector and ItemDomainDetector abstract classes
for data domain detection.
Adding UniversalItemDomainDetector subclass of ItemDataDomainDetector
that can leverage the known, registered subclasses of
ItemDataDomainDetector to perform domain detection for multiple formats.
Adding AorcCsvFileDomainDetector, GeoPackageHydrofabricDomainDetector,
and RealizationConfigDomainDetector.
Replace usage of RepeatableReader with ReadSeeker.
Replace use of RepeatableReader with ReadSeeker and change call to
removed reset() method to analogous seek() call.
And other things related, like init (needed for backing attribute) and
reset method.
Removing registry components from ItemDataDomainDetector and placing
within dedicated registry class; also removing __init_subclass__
registration process in favor of manual registration with a singleton
registry.
Update existing concrete implementations of ItemDataDomainDetector for
redesign of registration setup that uses ItemDataDomainDetectorRegistry.
@robertbartel robertbartel force-pushed the f/dataset_domain_autodetect/1/item_detectors branch from d38a0db to 105fd3e Compare May 6, 2024 14:39
Copy link
Member

@aaraney aaraney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are getting really close to being at a point to merge this. I left a few small comments that will likely turn into TODOs. Let's briefly continue the discussion, I don't want to hold this up much longer.

General type of detector that works with all supported formats by trying all registered, format-specific subtypes.
"""

def detect(self, **kwargs) -> DataDomain:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we improve the type hint experience here a little? Perhaps something like:

Suggested change
def detect(self, **kwargs) -> DataDomain:
def detect(
self,
*,
excluded_formats: Optional[Union[DataFormat, Set[DataFormat]]] = None,
suggested_formats: Optional[Union[DataFormat, List[DataFormat]]] = None,
sort_key: Optional[Callable[[Type[ItemDataDomainDetector]], T]] = None,
**kwargs
) -> DataDomain: ...



# Register the universal tracker type
ItemDataDomainDetectorRegistry.get_instance().register(UniversalItemDomainDetector)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My above point about setting ._instance in __init__ is moot, if this stays present. Im on the fence. Ill leave it to your discretion if you think the change is worth it.

""" Get :class:`DiscreteRestriction` defining applicable catchments (i.e., catchment) for the domain. """
return DiscreteRestriction(variable=StandardDatasetIndex.CATCHMENT_ID, values=[self._get_catchment_id()])

def detect(self, **kwargs) -> DataDomain:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this does not make use of any optional keyword arguments, I would recommend that we change the signature to the following to indicate that.

Suggested change
def detect(self, **kwargs) -> DataDomain:
def detect(self, **_) -> DataDomain:

cat_restriction = self._get_cat_restriction()
data = StringIO(self._item.decode(self._decode_format)) if isinstance(self._item, bytes) else self._item
dt_index = self.get_data_format().indices_to_fields()[StandardDatasetIndex.TIME]
# TODO: (later) perhaps do a little more about the header checking
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will also want to handle / ignore any csv comment headers. So, for example lines that start with #.

# TODO: (later) perhaps do a little more about the header checking
df = pandas_read_csv(data, parse_dates=[0])
self._num_time_steps = df.shape[0]
date_range = TimeRange(begin=df.iloc[0][dt_index].to_pydatetime(), end=df.iloc[-1][dt_index].to_pydatetime())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas.DataFrame.iloc is going to return a pandas.Series and that is not exactly what we want. I suggest we use pandas.DataFrame.at instead.

Suggested change
date_range = TimeRange(begin=df.iloc[0][dt_index].to_pydatetime(), end=df.iloc[-1][dt_index].to_pydatetime())
date_range = TimeRange(begin=df.at[df.index[0], dt_index].to_pydatetime(), end=df.at[df.index[-1], dt_index].to_pydatetime())

If it was not possible to properly detect the domain.
"""
# TODO: (later) probably isn't necessary to treat separately, but don't have a good way to test yet
if isinstance(self._item, ReadSeeker):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like your TODO says, I don't think we need this.

https://geopandas.org/en/stable/docs/reference/api/geopandas.read_file.html

filename
str, path object or file-like object
Either the absolute or relative path to the file or URL to be opened, or any object with a read() method (such as an open file or StringIO)

conus = True
else:
conus = False
pattern = re.compile('(vpu)([-_]?)(\d+)')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like some VPUs have a directional suffix:

nextgen_01.gpkg
nextgen_02.gpkg
nextgen_03N.gpkg
nextgen_03S.gpkg
nextgen_03W.gpkg
nextgen_04.gpkg
nextgen_05.gpkg
nextgen_06.gpkg
nextgen_07.gpkg
nextgen_08.gpkg
nextgen_09.gpkg
nextgen_10L.gpkg
nextgen_10U.gpkg
nextgen_11.gpkg
nextgen_12.gpkg
nextgen_13.gpkg
nextgen_14.gpkg
nextgen_15.gpkg
nextgen_16.gpkg
nextgen_17.gpkg
nextgen_18.gpkg

Copy link
Member

@aaraney aaraney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing that a lot of the comments I've made here have already been addressed in #566. Let's move this along and merge this.

@robertbartel robertbartel merged commit 1e90f6e into NOAA-OWP:master May 7, 2024
8 checks passed
@robertbartel robertbartel deleted the f/dataset_domain_autodetect/1/item_detectors branch May 7, 2024 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request maas MaaS Workstream
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants