Update read_watersurfaces for version 2024 & read_ws_hab #192

cecileherr · 2024-12-11T15:19:20Z

Update read_watersufaces

A few preliminary checks in preprocessing>miscellaneous>watersurfaces.Rmd :
watersurfaces.zip

Here are the most important remarks/needed corrections for this new version 2024:

there is a wrong WVLC code (“d”) in de Kleiputten van Heist (Palingpot): not corrected since it is the raw source
new variable wfd_type_alternative: added (and made optional since it is not present in all versions)
variable dropped hyla_code: made optional (since it is not present in all versions)
all the empty cells are imported as "" (empty string) instead of NA: converted to NA

Nice to try in the future (maybe we could add this as issue?)
What is the best (quickest/most readable) way to handle the fact that there are different structures/different fields/different errors to correct in the different versions?
Should we use if then structures or use optional clauses (such as matches, any_of, across + where, ...)?

Update read_watersufaces_hab to version 6

Nothing special to report, v6 has the exact same structure as v5

- drop hyla_code - add wfd_type_alternative - add wfd_type_alt_name - include water_level_management - add CFe in list water types - use na_if for empty cells ("")

- that is with watersurfaces 2024 - incl. documentation

- watersurfaces 2024 - habitatmap_stdized 2023_v1

florisvdh

Great work, thanks so much @cecileherr ! You analyzed the new data source version really well before upgrading the function, resulting in ever more robust code 🚀

Added some minor comments already in the recent commits of branch watersurf_hab_v6 of n2khab-preprocessing, but I presume you wait for current n2khab branch to become stable before starting that PR. Still, the latter is best merged first, then this one, as the n2khab one also uses the resulting watersurfaces_hab. (At least the n2khab release – merging to main – should take place after the preprocessing repo is all done and the new v6 is available at Zenodo.)

What is the best (quickest/most readable) way to handle the fact that there are different structures/different fields/different errors to correct in the different versions? Should we use if then structures or use optional clauses (such as matches, any_of, across + where, ...)?

I agree that generalizations to cope with version-specific columns provide the intended result, and are more readable, easier to maintain and more robust (future-proof). So that's really great 👌 (and probably you / we already did such things in the past ?).
Also IMHO it's OK to mix both approaches where that feels more suitable. What feels most logical and efficient will often be the easiest to understand and maintain.
One could say that, in the generalized approach, the code is less explicit about how the different data source versions differ. But documenting this is not the aim of the n2khab package, rather the package seeks to align things.
Good idea to make it an issue in order to apply this throughout the package! Please!
The main point of attention is to still get essentially the same results for the earlier versions (in the sense of 'no columns lost', 'no errors introduced', 'column names unchanged' etc: ultimately we would implement unit tests to automate such checks), but I didn't notice things that would lead to that, while reading the code.
As a sidenote, I think we should not be reluctant to implement smaller general improvements of the returned object for all data source versions when it's deemed better at some point, which might e.g. be the use of a factor where a character had been used before etc. True reproducibility should involve freezing the n2khab version anyway.

there is a wrong WVLC code (“d”) in de Kleiputten van Heist (Palingpot): not corrected since it is the raw source

Well, for nasty bugs like this we could still mitigate the drawback by using the same approach as fix_geom, e.g. fix_extra with default FALSE, to still honour the default of just returning the 'raw contents' (even though we do always streamline column names and classes to get some basic uniformity between R objects). TRUE would solve it, which then also transfers to processed data sources where this setting is used. The fix_extra argument could be loosely defined as solving some nasty bugs.
What do you think?

General notes:

thanks for replacing the outdated mutate_at() and mutate_if() verbs, using across()!
as a consequence there are quite some consecutive mutate() %>% mutate() %>% ... statements in the code. These are best replaced by a single mutate() statement; I believe (?) the across() cases are no exception to this. Did you try?

Further, R CMD check leaves following Note (here):

Undefined global functions or variables:
  area_name connectivity connectivity_name depth_class name polygon_id
  wfd_code wfd_type wfd_type_certain wfd_type_name where

where() needs to be imported.
For other names, this is typical for select() statements and the like: R CMD check neglects tidyselect-specific syntax rules, so it concludes that e.g. area_name is a separate object that should have been defined before. Before, this was solved for R CMD check by instead passing x as .data$x for tidy-selection or as .data[["x"]] for data-masking (still supported, but deprecated; see #179).
- currently it is advised to just use "x" in tidy-selection expressions since {tidyselect} also supports variable names as strings instead of as names. (Advice in changelog.) So for new code we better already implement that.
- however several verbs like mutate() and filter() refer to the vector to work on (not the column name) and this is referred to as data-masking. This means that something like ... %>% mutate(wfd_type_alt_name = "wfd_type_alternative" %>% ...) would not work in replacing .data, and here you should use all_of("wfd_type_alternative") instead IIUC.
- I've not yet tested the above; this is just my current understanding.

Minor note:

just for future commits: between the commit message title and the commit message body, a whiteline is needed. Currently the message body (with item lists) is part of the title, resulting in long titles with wrapped items.

R/read_habitatdata.R

- replace the deprecated .data$xxx by "xxx" in all rename, select, arrange

- replace the deprecated .data$xxx by "xxx" in all (rename, select, arrange,) relocate

cecileherr · 2024-12-23T13:12:50Z

there is a wrong WVLC code (“d”) in de Kleiputten van Heist (Palingpot): not corrected since it is the raw source

Well, for nasty bugs like this we could still mitigate the drawback by using the same approach as fix_geom, e.g. fix_extra with default FALSE, to still honour the default of just returning the 'raw contents' (even though we do always streamline column names and classes to get some basic uniformity between R objects). TRUE would solve it, which then also transfers to processed data sources where this setting is used. The fix_extra argument could be loosely defined as solving some nasty bugs.

What do you think?

Still pending: let's discuss this one live. I am a bit reluctant because of the added complexity.
(since this also have consequences for n2khab_preprocessing, I will not create a pull request there yet)

as a consequence there are quite some consecutive mutate() %>% mutate() %>% ... statements in the code. These are best replaced by a single mutate() statement; I believe (?) the across() cases are no exception to this. Did you try?

Done

Undefined global functions or variables:
area_name connectivity connectivity_name depth_class name polygon_id
wfd_code wfd_type wfd_type_certain wfd_type_name where

Corrected with the non deprecated version, that is passing "x" instead of .data$x for the tidy-selection cases.
I made three separated commits:

for the new code in read_watersurfaces (which gave the note in R CMD check)
for the rest in read_watersurfaces (commit 0756449), that is all existing occurences of arrange, select, rename , pivot, relocate in read_watersurfaces (because I do not like the idea of mixing old and new syntax within the same function).
for the rest in read_watersurfaces_hab (commit 434a904)

Note that .data is still present in the mutate, filter... statements. If I understand correctly (???) .data can still be used for data-masking (or at least it was the case in 2023 r-lib/tidyselect#169 and using it does not trigger warnings)

If you prefer to correct all .data occurences in the package at once, (issue #179 ) you can revert the last 2 commits (without creating problems in R CMD check)

cecileherr added 3 commits December 11, 2024 15:29

read_watersurfaces: update doc for v 2024

6110cd1

read_watersurfaces: replace mutate_if _at by across

08b3655

read_watersurfaces: adapt foor v2024

b706a0b

- drop hyla_code - add wfd_type_alternative - add wfd_type_alt_name - include water_level_management - add CFe in list water types - use na_if for empty cells ("")

cecileherr requested a review from florisvdh December 11, 2024 15:19

read_watersurfaces_hab: update to v6

24960cc

- that is with watersurfaces 2024 - incl. documentation

cecileherr changed the title ~~Update read_watersurfaces for version 2024~~ Update read_watersurfaces for version 2024 & read_ws_hab Dec 12, 2024

cecileherr mentioned this pull request Dec 12, 2024

Error using latest version of watersurfaces: version 'watersurfaces_2024' #190

Open

4 tasks

florisvdh referenced this pull request in inbo/n2khab-preprocessing Dec 13, 2024

generate watersurfaces_hab v6, based on

9f9ce86

- watersurfaces 2024 - habitatmap_stdized 2023_v1

florisvdh reviewed Dec 13, 2024

View reviewed changes

R/read_habitatdata.R Outdated Show resolved Hide resolved

R/read_habitatdata.R Outdated Show resolved Hide resolved

cecileherr added 8 commits December 23, 2024 12:02

read_habitatdata: add tidyselect where

d41f598

read_watersurfaces: group several mutate in one

546650f

read_watersurfaces: solve 'undefined global var' for tidyselect

7432109

read_watersurfaces: minor changes

413db83

read_watersurfaces: replace deprecated .data in tidy-selection

0756449

- replace the deprecated .data$xxx by "xxx" in all rename, select, arrange

read_watersurfaces: minor correction

880cd3f

read_watersurfaces_hab: replace mutate_at by mutate(across)

76da084

read_watersurfaces_hab: replace deprecated .data in tidy-selection

434a904

- replace the deprecated .data$xxx by "xxx" in all (rename, select, arrange,) relocate

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update read_watersurfaces for version 2024 & read_ws_hab #192

Update read_watersurfaces for version 2024 & read_ws_hab #192

cecileherr commented Dec 11, 2024 •

edited

Loading

florisvdh left a comment •

edited

Loading

cecileherr commented Dec 23, 2024

Update read_watersurfaces for version 2024 & read_ws_hab #192

Are you sure you want to change the base?

Update read_watersurfaces for version 2024 & read_ws_hab #192

Conversation

cecileherr commented Dec 11, 2024 • edited Loading

florisvdh left a comment • edited Loading

Choose a reason for hiding this comment

cecileherr commented Dec 23, 2024

cecileherr commented Dec 11, 2024 •

edited

Loading

florisvdh left a comment •

edited

Loading