Initial item/file domain detectors #565

robertbartel · 2024-04-02T14:33:52Z

Adding initial set of implementations of classes that can scan a data item/file and detect the applicable DataDomain for its data.

Note that this branch builds off that of #564. As such, this PR should remain in draft status until #564 is merged.

aaraney · 2024-04-02T20:23:13Z

python/lib/core/dmod/core/dataset.py

+class AbstractDomainDetector(ABC):
+    """ Abstraction for something that will automatically detect a :class:`DataDomain` for some data. """
+
+    def __init__(self, *args, **kwargs):


I recommend that we drop this in favor of subtypes defining their own initializer that ideally is the only initializer (no need to call super().__init__()). In doing so, subtypes can more freely specify their requirements and loosen the coupling between the super and subclass.

Hmm ... I'm not really making use of it, but I had wanted a lazily initialized property for accessing the domain value, and for all the detector types to have this. My concern was if a detector needed to be used repeatedly (for the same data) that it'd be a performance hit to have to re-detect things over and over. Though at this point I'm not sure though how often, if ever, that scenario is actually going to come up.

Is that - complications from introducing a common, inherited property implementation - specifically what you are worried about here, or is there something more?

but I had wanted a lazily initialized property for accessing the domain value, and for all the detector types to have this. My concern was if a detector needed to be used repeatedly (for the same data) that it'd be a performance hit to have to re-detect things over and over.

Yeah, I like that functionality and in some if not all cases that is desirable. My thinking is that subclasses can handle caching / lazy loading in their detect implementation. It likely will result in a minor amount of code duplication or at least code that looks repetitive, however I think considering the tradeoffs of having to know the relationship between the parent class and the subclass or a few lines of repetitive code I would rather have as much of the behavior in one place. Another, at least to me, interesting philosophical question. What are your thoughts?

aaraney · 2024-04-02T20:30:59Z

python/lib/core/dmod/core/dataset.py

+        pass
+
+    @property
+    def data_domain(self) -> DataDomain:


I get that this is a nicety, but imo there is nothing stopping a subclass from implementing this behavior in their implementation of detect. This also places the burden on subtypes to update _data_domain when detect is called. I think the design can be simplified by dropping this property and the _data_domain class level instance variable. This could result in slightly more code, but the behavior of a subtype will be closer to its declaration.

I get that this is a nicety, but imo there is nothing stopping a subclass from implementing this behavior in their implementation of detect.

True. I think my motivation is that there is nothing making them either, nor making all of them do it in a consistent manner (e.g., if you want to re-detect versus return what was detected already, can you do that, and is it done the same way?). That seemed most easily solvable by doing some things in the superclass.

This also places the burden on subtypes to update _data_domain when detect is called.

Actually the reverse is true. Calling detect is intended not to impact _data_domain. This is why data_domain and reset are implemented: to control (and return) the _data_domain value (though, as I mentioned above, I think I'm not actually using them at the moment). But detect should run detection and return a new object each time.

I think the design can be simplified by dropping this property and the _data_domain class level instance variable. This could result in slightly more code, but the behavior of a subtype will be closer to its declaration.

Heh, we're in another thing where my style shows itself to not be especially Pythonic.

I'd intended to leverage the superclass to abstract a concern from the subclass implementations (though if we stick with that, I do need to at least go back and use the property in places). We could skip that, and just have the independent detect definitions. Subclasses would be more free to handle themselves, but I think we'd get slightly less predictable and consistent implementations. I'm a little worried that, given how we're using these in places like the universal detector, that inconsistency could eventually become a problem, even if it isn't right now.

Heh, we're in another thing where my style shows itself to not be especially Pythonic.

Haha, I don't think your style is not Pythonic. I just tend to have an aversion to inheritance outside of using it strictly for enforcing an interface. Especially, when you start sharing private properties between inheritance levels. So, for sure a style thing!

I'd intended to leverage the superclass to abstract a concern from the subclass implementations (though if we stick with that, I do need to at least go back and use the property in places). We could skip that, and just have the independent detect definitions. Subclasses would be more free to handle themselves, but I think we'd get slightly less predictable and consistent implementations. I'm a little worried that, given how we're using these in places like the universal detector, that inconsistency could eventually become a problem, even if it isn't right now.

First off, thanks for articulating your design so throughly! I agree that abstracting the "caching" of the detected DataDomain in a consistent way is valuable. I think we can achieve this behavior, in a similar fashion, without requiring the functionality be provided by the base class in a consistent way by using composition instead. This has the benefit of decoupling the "caching" behavior from the detectors and provides a separate pathway to add complexity to "caching" mechanisms if needed in the future without having to muck around with how each AbstractDomainDetector works. What that could look like in practice is:

class CachedDomainDetector: def __init__(self, detector: AbstractDomainDetector): self._detector = detector self._data_domain = None @property def data_domain(self) -> DataDomain: if self._data_domain is None: self._data_domain = self._detector.detect() return self._data_domain def reset(self): self._data_domain = None

Alternatively, it could be a subclass of AbstractDomainDetector. So, something like:

class CachedDomainDetector(AbstractDomainDetector): def __init__(self, detector: AbstractDomainDetector): self._detector = detector self._data_domain = None def detect(self, **kwargs) -> DataDomain: if self._data_domain is None: self._data_domain = self._detector.detect() return self._data_domain def reset(self): self._data_domain = None

In either case, we are able to move the base class prescribed behavior elsewhere and decouple the behavioral concepts. The trade off is now you have to know about whatever that caching wrapper type is, but I would argue that once you find it, its far easier to understand the implications of that abstraction rather than having to walk up the inheritance tree and see how each parent type handles the, in this case, _data_domain property.

I went ahead and just took it out. I'm still not certain there wouldn't be complications moving to a composition like this that aren't worth the trouble for something as straightforward as a lazy property, but we didn't actually need it. And we can always add the composition in the future if we later decide to.

python/lib/core/dmod/core/dataset.py

robertbartel · 2024-04-09T19:50:35Z

@aaraney, I think all the changes we've discussed have been made. I'm a little puzzled by the test failures. I'm not seeing them locally, and I created a fresh 3.10 environment to test in just to be safe. I'm curious if you see these on your side. FWIW, I did clear the cache entries related to this PR, and it didn't seem to help.

Update
Tests are passing now. Best guess: the built packages were cached, and I hadn't updated the versions and dependency pinnings as needed, so old dependencies were being used.

robertbartel · 2024-04-26T17:45:26Z

@aaraney, @christophertubbs, is there anything remaining for this PR that needs to be changed before it can be approved?

Adding AbstractDomainDetector and ItemDomainDetector abstract classes for data domain detection.

Adding UniversalItemDomainDetector subclass of ItemDataDomainDetector that can leverage the known, registered subclasses of ItemDataDomainDetector to perform domain detection for multiple formats.

Adding AorcCsvFileDomainDetector, GeoPackageHydrofabricDomainDetector, and RealizationConfigDomainDetector.

Replace usage of RepeatableReader with ReadSeeker.

Replace use of RepeatableReader with ReadSeeker and change call to removed reset() method to analogous seek() call.

And other things related, like init (needed for backing attribute) and reset method.

Removing registry components from ItemDataDomainDetector and placing within dedicated registry class; also removing __init_subclass__ registration process in favor of manual registration with a singleton registry.

Update existing concrete implementations of ItemDataDomainDetector for redesign of registration setup that uses ItemDataDomainDetectorRegistry.

aaraney

I think we are getting really close to being at a point to merge this. I left a few small comments that will likely turn into TODOs. Let's briefly continue the discussion, I don't want to hold this up much longer.

python/lib/core/dmod/core/dataset.py

aaraney · 2024-05-06T18:32:53Z

python/lib/core/dmod/core/dataset.py

+    General type of detector that works with all supported formats by trying all registered, format-specific subtypes.
+    """
+
+    def detect(self, **kwargs) -> DataDomain:


Can we improve the type hint experience here a little? Perhaps something like:

Suggested change

def detect(self, **kwargs) -> DataDomain:

def detect(

self,

*,

excluded_formats: Optional[Union[DataFormat, Set[DataFormat]]] = None,

suggested_formats: Optional[Union[DataFormat, List[DataFormat]]] = None,

sort_key: Optional[Callable[[Type[ItemDataDomainDetector]], T]] = None,

**kwargs

) -> DataDomain: ...

aaraney · 2024-05-06T18:34:58Z

python/lib/core/dmod/core/dataset.py

+
+
+# Register the universal tracker type
+ItemDataDomainDetectorRegistry.get_instance().register(UniversalItemDomainDetector)


My above point about setting ._instance in __init__ is moot, if this stays present. Im on the fence. Ill leave it to your discretion if you think the change is worth it.

aaraney · 2024-05-06T18:46:20Z

python/lib/modeldata/dmod/modeldata/data/item_domain_detector.py

+        """ Get :class:`DiscreteRestriction` defining applicable catchments (i.e., catchment) for the domain. """
+        return DiscreteRestriction(variable=StandardDatasetIndex.CATCHMENT_ID, values=[self._get_catchment_id()])
+
+    def detect(self, **kwargs) -> DataDomain:


Since this does not make use of any optional keyword arguments, I would recommend that we change the signature to the following to indicate that.

Suggested change

def detect(self, **kwargs) -> DataDomain:

def detect(self, **_) -> DataDomain:

aaraney · 2024-05-06T18:48:59Z

python/lib/modeldata/dmod/modeldata/data/item_domain_detector.py

+        cat_restriction = self._get_cat_restriction()
+        data = StringIO(self._item.decode(self._decode_format)) if isinstance(self._item, bytes) else self._item
+        dt_index = self.get_data_format().indices_to_fields()[StandardDatasetIndex.TIME]
+        # TODO: (later) perhaps do a little more about the header checking


We will also want to handle / ignore any csv comment headers. So, for example lines that start with #.

aaraney · 2024-05-06T18:56:37Z

python/lib/modeldata/dmod/modeldata/data/item_domain_detector.py

+        # TODO: (later) perhaps do a little more about the header checking
+        df = pandas_read_csv(data, parse_dates=[0])
+        self._num_time_steps = df.shape[0]
+        date_range = TimeRange(begin=df.iloc[0][dt_index].to_pydatetime(), end=df.iloc[-1][dt_index].to_pydatetime())


pandas.DataFrame.iloc is going to return a pandas.Series and that is not exactly what we want. I suggest we use pandas.DataFrame.at instead.

Suggested change

date_range = TimeRange(begin=df.iloc[0][dt_index].to_pydatetime(), end=df.iloc[-1][dt_index].to_pydatetime())

date_range = TimeRange(begin=df.at[df.index[0], dt_index].to_pydatetime(), end=df.at[df.index[-1], dt_index].to_pydatetime())

aaraney · 2024-05-06T19:01:03Z

python/lib/modeldata/dmod/modeldata/data/item_domain_detector.py

+            If it was not possible to properly detect the domain.
+        """
+        # TODO: (later) probably isn't necessary to treat separately, but don't have a good way to test yet
+        if isinstance(self._item, ReadSeeker):


Like your TODO says, I don't think we need this.

https://geopandas.org/en/stable/docs/reference/api/geopandas.read_file.html

filename
str, path object or file-like object
Either the absolute or relative path to the file or URL to be opened, or any object with a read() method (such as an open file or StringIO)

aaraney · 2024-05-06T19:24:21Z

python/lib/modeldata/dmod/modeldata/data/item_domain_detector.py

+                conus = True
+            else:
+                conus = False
+                pattern = re.compile('(vpu)([-_]?)(\d+)')


It looks like some VPUs have a directional suffix:

nextgen_01.gpkg
nextgen_02.gpkg
nextgen_03N.gpkg
nextgen_03S.gpkg
nextgen_03W.gpkg
nextgen_04.gpkg
nextgen_05.gpkg
nextgen_06.gpkg
nextgen_07.gpkg
nextgen_08.gpkg
nextgen_09.gpkg
nextgen_10L.gpkg
nextgen_10U.gpkg
nextgen_11.gpkg
nextgen_12.gpkg
nextgen_13.gpkg
nextgen_14.gpkg
nextgen_15.gpkg
nextgen_16.gpkg
nextgen_17.gpkg
nextgen_18.gpkg

aaraney

I'm seeing that a lot of the comments I've made here have already been addressed in #566. Let's move this along and merge this.

robertbartel added enhancement New feature or request maas MaaS Workstream labels Apr 2, 2024

robertbartel requested review from christophertubbs and aaraney April 2, 2024 14:33

robertbartel mentioned this pull request Apr 2, 2024

Complex domain detection for collections of data items with CLI support #566

Merged

aaraney requested changes Apr 2, 2024

View reviewed changes

robertbartel force-pushed the f/dataset_domain_autodetect/1/item_detectors branch 2 times, most recently from 858e8b8 to 90b6450 Compare April 4, 2024 12:33

aaraney mentioned this pull request Apr 4, 2024

Prerequisites for supporting domain detection classes #564

Merged

robertbartel force-pushed the f/dataset_domain_autodetect/1/item_detectors branch 2 times, most recently from 5624eb8 to 734a6c7 Compare April 4, 2024 18:42

robertbartel marked this pull request as ready for review April 4, 2024 18:43

robertbartel requested a review from aaraney April 9, 2024 19:37

robertbartel force-pushed the f/dataset_domain_autodetect/1/item_detectors branch from 7d162f5 to d38a0db Compare April 15, 2024 13:19

robertbartel added 14 commits May 6, 2024 10:39

Add abstract domain detectors.

fb8f666

Adding AbstractDomainDetector and ItemDomainDetector abstract classes for data domain detection.

Add UniversalItemDomainDetector.

b50a2e5

Adding UniversalItemDomainDetector subclass of ItemDataDomainDetector that can leverage the known, registered subclasses of ItemDataDomainDetector to perform domain detection for multiple formats.

Add first format specific detector classes.

fbf0c9d

Adding AorcCsvFileDomainDetector, GeoPackageHydrofabricDomainDetector, and RealizationConfigDomainDetector.

Add tests for AORC CSV file domain detector.

6c556de

Add tests for geopackage file domain detector.

f84e849

Add tests for universal item domain detector.

7c04d5d

Add common function needed for several tests.

a4d4d8e

Adding test CSV forcing files.

da8a377

Switch to ReadSeeker in dataset.py.

d4a17c6

Replace usage of RepeatableReader with ReadSeeker.

Switch to ReadSeeker in item_domain_detector.py.

4994384

Replace use of RepeatableReader with ReadSeeker and change call to removed reset() method to analogous seek() call.

Remove lazy property from AbstractDomainDetector.

7eb6fad

And other things related, like init (needed for backing attribute) and reset method.

Optimizing re use in aorc csv domain detector.

576a687

Redesign to use ItemDataDomainDetectorRegistry.

5b1510d

Removing registry components from ItemDataDomainDetector and placing within dedicated registry class; also removing __init_subclass__ registration process in favor of manual registration with a singleton registry.

Update detector implementations for new registry.

5162628

Update existing concrete implementations of ItemDataDomainDetector for redesign of registration setup that uses ItemDataDomainDetectorRegistry.

robertbartel added 8 commits May 6, 2024 10:39

Fix AorcCsvFileDomainDetector tests for redesign.

968b045

Fix geopackage domain detector tests for redesign.

5e99138

Fix universal domain detector tests for redesign.

827086e

Bump core package version to 0.15.1.

e5545b1

Update modeldata dependency on core to ver 0.15.1.

bca8586

Bump modeldata package version to 0.11.0.

e4a1de5

Update client dependency on core to ver 0.15.1.

6c7b408

Bump client package version to 0.6.0.

105fd3e

robertbartel force-pushed the f/dataset_domain_autodetect/1/item_detectors branch from d38a0db to 105fd3e Compare May 6, 2024 14:39

aaraney reviewed May 6, 2024

View reviewed changes

Scrap private _is_item_file for domain detector.

389a127

aaraney approved these changes May 6, 2024

View reviewed changes

robertbartel merged commit 1e90f6e into NOAA-OWP:master May 7, 2024
8 checks passed

robertbartel deleted the f/dataset_domain_autodetect/1/item_detectors branch May 7, 2024 14:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial item/file domain detectors #565

Initial item/file domain detectors #565

robertbartel commented Apr 2, 2024

aaraney Apr 2, 2024

robertbartel Apr 4, 2024

aaraney Apr 4, 2024

aaraney Apr 2, 2024

robertbartel Apr 4, 2024

aaraney Apr 4, 2024

robertbartel Apr 4, 2024

robertbartel commented Apr 9, 2024 •

edited

Loading

robertbartel commented Apr 26, 2024

aaraney left a comment

aaraney May 6, 2024

aaraney May 6, 2024

aaraney May 6, 2024

aaraney May 6, 2024

aaraney May 6, 2024

aaraney May 6, 2024

aaraney May 6, 2024

aaraney left a comment

-    def detect(self, **kwargs) -> DataDomain:
+    def detect(
+        self,
+        *,
+        excluded_formats: Optional[Union[DataFormat, Set[DataFormat]]] = None,
+        suggested_formats: Optional[Union[DataFormat, List[DataFormat]]] = None,
+        sort_key: Optional[Callable[[Type[ItemDataDomainDetector]], T]] = None,
+        **kwargs
+    ) -> DataDomain: ...



		# Register the universal tracker type
		ItemDataDomainDetectorRegistry.get_instance().register(UniversalItemDomainDetector)

	def detect(self, **kwargs) -> DataDomain:
	def detect(self, **_) -> DataDomain:

	date_range = TimeRange(begin=df.iloc[0][dt_index].to_pydatetime(), end=df.iloc[-1][dt_index].to_pydatetime())
	date_range = TimeRange(begin=df.at[df.index[0], dt_index].to_pydatetime(), end=df.at[df.index[-1], dt_index].to_pydatetime())

Initial item/file domain detectors #565

Initial item/file domain detectors #565

Conversation

robertbartel commented Apr 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertbartel commented Apr 9, 2024 • edited Loading

robertbartel commented Apr 26, 2024

aaraney left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaraney left a comment

Choose a reason for hiding this comment

robertbartel commented Apr 9, 2024 •

edited

Loading