-
Notifications
You must be signed in to change notification settings - Fork 18
Initial item/file domain detectors #565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial item/file domain detectors #565
Conversation
python/lib/core/dmod/core/dataset.py
Outdated
class AbstractDomainDetector(ABC): | ||
""" Abstraction for something that will automatically detect a :class:`DataDomain` for some data. """ | ||
|
||
def __init__(self, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend that we drop this in favor of subtypes defining their own initializer that ideally is the only initializer (no need to call super().__init__()
). In doing so, subtypes can more freely specify their requirements and loosen the coupling between the super and subclass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm ... I'm not really making use of it, but I had wanted a lazily initialized property for accessing the domain value, and for all the detector types to have this. My concern was if a detector needed to be used repeatedly (for the same data) that it'd be a performance hit to have to re-detect things over and over. Though at this point I'm not sure though how often, if ever, that scenario is actually going to come up.
Is that - complications from introducing a common, inherited property implementation - specifically what you are worried about here, or is there something more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I had wanted a lazily initialized property for accessing the domain value, and for all the detector types to have this. My concern was if a detector needed to be used repeatedly (for the same data) that it'd be a performance hit to have to re-detect things over and over.
Yeah, I like that functionality and in some if not all cases that is desirable. My thinking is that subclasses can handle caching / lazy loading in their detect
implementation. It likely will result in a minor amount of code duplication or at least code that looks repetitive, however I think considering the tradeoffs of having to know the relationship between the parent class and the subclass or a few lines of repetitive code I would rather have as much of the behavior in one place. Another, at least to me, interesting philosophical question. What are your thoughts?
python/lib/core/dmod/core/dataset.py
Outdated
pass | ||
|
||
@property | ||
def data_domain(self) -> DataDomain: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get that this is a nicety, but imo there is nothing stopping a subclass from implementing this behavior in their implementation of detect
. This also places the burden on subtypes to update _data_domain
when detect
is called. I think the design can be simplified by dropping this property and the _data_domain
class level instance variable. This could result in slightly more code, but the behavior of a subtype will be closer to its declaration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get that this is a nicety, but imo there is nothing stopping a subclass from implementing this behavior in their implementation of detect.
True. I think my motivation is that there is nothing making them either, nor making all of them do it in a consistent manner (e.g., if you want to re-detect versus return what was detected already, can you do that, and is it done the same way?). That seemed most easily solvable by doing some things in the superclass.
This also places the burden on subtypes to update
_data_domain
whendetect
is called.
Actually the reverse is true. Calling detect
is intended not to impact _data_domain
. This is why data_domain
and reset
are implemented: to control (and return) the _data_domain
value (though, as I mentioned above, I think I'm not actually using them at the moment). But detect
should run detection and return a new object each time.
I think the design can be simplified by dropping this property and the _data_domain class level instance variable. This could result in slightly more code, but the behavior of a subtype will be closer to its declaration.
Heh, we're in another thing where my style shows itself to not be especially Pythonic.
I'd intended to leverage the superclass to abstract a concern from the subclass implementations (though if we stick with that, I do need to at least go back and use the property in places). We could skip that, and just have the independent detect
definitions. Subclasses would be more free to handle themselves, but I think we'd get slightly less predictable and consistent implementations. I'm a little worried that, given how we're using these in places like the universal detector, that inconsistency could eventually become a problem, even if it isn't right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, we're in another thing where my style shows itself to not be especially Pythonic.
Haha, I don't think your style is not Pythonic. I just tend to have an aversion to inheritance outside of using it strictly for enforcing an interface. Especially, when you start sharing private properties between inheritance levels. So, for sure a style thing!
I'd intended to leverage the superclass to abstract a concern from the subclass implementations (though if we stick with that, I do need to at least go back and use the property in places). We could skip that, and just have the independent
detect
definitions. Subclasses would be more free to handle themselves, but I think we'd get slightly less predictable and consistent implementations. I'm a little worried that, given how we're using these in places like the universal detector, that inconsistency could eventually become a problem, even if it isn't right now.
First off, thanks for articulating your design so throughly! I agree that abstracting the "caching" of the detected DataDomain
in a consistent way is valuable. I think we can achieve this behavior, in a similar fashion, without requiring the functionality be provided by the base class in a consistent way by using composition instead. This has the benefit of decoupling the "caching" behavior from the detectors and provides a separate pathway to add complexity to "caching" mechanisms if needed in the future without having to muck around with how each AbstractDomainDetector
works. What that could look like in practice is:
class CachedDomainDetector:
def __init__(self, detector: AbstractDomainDetector):
self._detector = detector
self._data_domain = None
@property
def data_domain(self) -> DataDomain:
if self._data_domain is None:
self._data_domain = self._detector.detect()
return self._data_domain
def reset(self):
self._data_domain = None
Alternatively, it could be a subclass of AbstractDomainDetector
. So, something like:
class CachedDomainDetector(AbstractDomainDetector):
def __init__(self, detector: AbstractDomainDetector):
self._detector = detector
self._data_domain = None
def detect(self, **kwargs) -> DataDomain:
if self._data_domain is None:
self._data_domain = self._detector.detect()
return self._data_domain
def reset(self):
self._data_domain = None
In either case, we are able to move the base class prescribed behavior elsewhere and decouple the behavioral concepts. The trade off is now you have to know about whatever that caching wrapper type is, but I would argue that once you find it, its far easier to understand the implications of that abstraction rather than having to walk up the inheritance tree and see how each parent type handles the, in this case, _data_domain
property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and just took it out. I'm still not certain there wouldn't be complications moving to a composition like this that aren't worth the trouble for something as straightforward as a lazy property, but we didn't actually need it. And we can always add the composition in the future if we later decide to.
858e8b8
to
90b6450
Compare
5624eb8
to
734a6c7
Compare
@aaraney, I think all the changes we've discussed have been made. I'm a little puzzled by the test failures. I'm not seeing them locally, and I created a fresh 3.10 environment to test in just to be safe. I'm curious if you see these on your side. FWIW, I did clear the cache entries related to this PR, and it didn't seem to help. Update |
7d162f5
to
d38a0db
Compare
@aaraney, @christophertubbs, is there anything remaining for this PR that needs to be changed before it can be approved? |
Adding AbstractDomainDetector and ItemDomainDetector abstract classes for data domain detection.
Adding UniversalItemDomainDetector subclass of ItemDataDomainDetector that can leverage the known, registered subclasses of ItemDataDomainDetector to perform domain detection for multiple formats.
Adding AorcCsvFileDomainDetector, GeoPackageHydrofabricDomainDetector, and RealizationConfigDomainDetector.
Replace usage of RepeatableReader with ReadSeeker.
Replace use of RepeatableReader with ReadSeeker and change call to removed reset() method to analogous seek() call.
And other things related, like init (needed for backing attribute) and reset method.
Removing registry components from ItemDataDomainDetector and placing within dedicated registry class; also removing __init_subclass__ registration process in favor of manual registration with a singleton registry.
Update existing concrete implementations of ItemDataDomainDetector for redesign of registration setup that uses ItemDataDomainDetectorRegistry.
d38a0db
to
105fd3e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are getting really close to being at a point to merge this. I left a few small comments that will likely turn into TODOs
. Let's briefly continue the discussion, I don't want to hold this up much longer.
General type of detector that works with all supported formats by trying all registered, format-specific subtypes. | ||
""" | ||
|
||
def detect(self, **kwargs) -> DataDomain: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we improve the type hint experience here a little? Perhaps something like:
def detect(self, **kwargs) -> DataDomain: | |
def detect( | |
self, | |
*, | |
excluded_formats: Optional[Union[DataFormat, Set[DataFormat]]] = None, | |
suggested_formats: Optional[Union[DataFormat, List[DataFormat]]] = None, | |
sort_key: Optional[Callable[[Type[ItemDataDomainDetector]], T]] = None, | |
**kwargs | |
) -> DataDomain: ... | |
|
||
|
||
# Register the universal tracker type | ||
ItemDataDomainDetectorRegistry.get_instance().register(UniversalItemDomainDetector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My above point about setting ._instance
in __init__
is moot, if this stays present. Im on the fence. Ill leave it to your discretion if you think the change is worth it.
""" Get :class:`DiscreteRestriction` defining applicable catchments (i.e., catchment) for the domain. """ | ||
return DiscreteRestriction(variable=StandardDatasetIndex.CATCHMENT_ID, values=[self._get_catchment_id()]) | ||
|
||
def detect(self, **kwargs) -> DataDomain: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this does not make use of any optional keyword arguments, I would recommend that we change the signature to the following to indicate that.
def detect(self, **kwargs) -> DataDomain: | |
def detect(self, **_) -> DataDomain: |
cat_restriction = self._get_cat_restriction() | ||
data = StringIO(self._item.decode(self._decode_format)) if isinstance(self._item, bytes) else self._item | ||
dt_index = self.get_data_format().indices_to_fields()[StandardDatasetIndex.TIME] | ||
# TODO: (later) perhaps do a little more about the header checking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will also want to handle / ignore any csv comment headers. So, for example lines that start with #
.
# TODO: (later) perhaps do a little more about the header checking | ||
df = pandas_read_csv(data, parse_dates=[0]) | ||
self._num_time_steps = df.shape[0] | ||
date_range = TimeRange(begin=df.iloc[0][dt_index].to_pydatetime(), end=df.iloc[-1][dt_index].to_pydatetime()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas.DataFrame.iloc
is going to return a pandas.Series
and that is not exactly what we want. I suggest we use pandas.DataFrame.at
instead.
date_range = TimeRange(begin=df.iloc[0][dt_index].to_pydatetime(), end=df.iloc[-1][dt_index].to_pydatetime()) | |
date_range = TimeRange(begin=df.at[df.index[0], dt_index].to_pydatetime(), end=df.at[df.index[-1], dt_index].to_pydatetime()) |
If it was not possible to properly detect the domain. | ||
""" | ||
# TODO: (later) probably isn't necessary to treat separately, but don't have a good way to test yet | ||
if isinstance(self._item, ReadSeeker): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like your TODO says, I don't think we need this.
https://geopandas.org/en/stable/docs/reference/api/geopandas.read_file.html
filename
str, path object or file-like object
Either the absolute or relative path to the file or URL to be opened, or any object with a read() method (such as an open file or StringIO)
conus = True | ||
else: | ||
conus = False | ||
pattern = re.compile('(vpu)([-_]?)(\d+)') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like some VPUs have a directional suffix:
nextgen_01.gpkg
nextgen_02.gpkg
nextgen_03N.gpkg
nextgen_03S.gpkg
nextgen_03W.gpkg
nextgen_04.gpkg
nextgen_05.gpkg
nextgen_06.gpkg
nextgen_07.gpkg
nextgen_08.gpkg
nextgen_09.gpkg
nextgen_10L.gpkg
nextgen_10U.gpkg
nextgen_11.gpkg
nextgen_12.gpkg
nextgen_13.gpkg
nextgen_14.gpkg
nextgen_15.gpkg
nextgen_16.gpkg
nextgen_17.gpkg
nextgen_18.gpkg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm seeing that a lot of the comments I've made here have already been addressed in #566. Let's move this along and merge this.
Adding initial set of implementations of classes that can scan a data item/file and detect the applicable DataDomain for its data.
Note that this branch builds off that of #564. As such, this PR should remain in draft status until #564 is merged.