-
Notifications
You must be signed in to change notification settings - Fork 4
Fetch nhdv2 attributes from ScienceBase #82
Conversation
Duplicate variables: Thanks for finding these! I just looked at the SB metadata to see if they differ, and I think these are exact duplicates with different names. Here are my thoughts on which names to retain:
|
Cutting off at 1960 sounds good to me to capture ~20 year historical information from our start date. That applies to all datasets with multiple years available. |
I looked at the column you added in the VarsOfInterest table. The only variables that do not have zipped files listed are the mean monthly precip variables that I added, so that makes me think I may have specified column names incorrectly. I thought I used the names listed in their metadata file. Let me know if you need help getting the data for those variables |
@jds485, thanks for that offer! There are missing names in that column for the precip variables because I was lazy and didn't want to copy all the folder names (there are many!) 😄 I instead adding special handling to define those item names in lines 23-37 in |
|
||
# Parse name(s) of unzipped files | ||
file_name <- basename(out_file) | ||
file_name_sans_ext <- substr(file_name,1,nchar(file_name)-4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could get the number of characters in the extension from a strsplit that takes the last element. But I do expect that data downloads would have 4 character extensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this suggestion - like you said, it works for downloading our nhd attributes but it'd be nice to make that file extension bit robust to various file types. I had originally thought to use str_split
w/ pattern = "." and just take everything before the period to define the file name. But then I wasn't sure how often "." would be used within the file name itself (not just to demarcate the extension), so went with this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking something like:
ext_nchar <- str_split(file_name, pattern = '.')[[1]] %>% nchar() %>% last()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, good idea to use last()
like that! I also implemented this right after posting my first comment:
str_split(file_name,".[[:alnum:]]+$")[[1]][1]
It should split the string based on ".ext" (regardless of length of file extension) and then take the first part.
Sounds good. I didn't realize that you added the items in that column manually! I finished overviewing the code and it looks good to me. Thanks for implementing the fetch-unzip-clip that we talked about last week |
Thanks, @jds485. As a recap, here's what I've implemented for datasets with multiple years based on discussion in this PR:
|
We may have to retain ET (over AET) and RUN (over RUN7100). If I download the linked data associated with AET and RUN7100, there are no column names that match those descriptors, and so they must just be identical to ET and RUN? |
Okay, sounds good! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments:
In addition to my inline comments, I read in all the downloaded files to skim how they each look. All are correctly read in (did not do get a chance to do further df checks such NA, min, max - lmk if that is desired). Downloaded tables have correct structure organized by COMID. I suggest testing them with xwalk, seems ready for testing that. Also, most have readable column headers that match the file name (ex: ACC_PT for PT.csv) - so will be easy to keep track of as we work through them either in bulk or individually.
Some files are more complex (not following the ACC*, TOT*, CAT* structure) and therefore more longer to grasp without pulling out the original metadata file in sb (keeping eye on this one: "1_fetch/out/NDAMS.csv”)
Suggestions: Grouping certain downloads:
Could we create a subfolder for these specific downloads?
foldername: STATSGO
for the following 3 csvs:
"1_fetch/out/STATSGO_HYDGRP.csv"
"1_fetch/out/STATSGO_TEXT.csv"
"1_fetch/out/STATSGO_LAYER.csv”
folder name: PPT
for the following 3 csvs:
"1_fetch/out/PPT_ACC.csv"
"1_fetch/out/PPT_TOT.csv"
"1_fetch/out/PPT_CAT.csv”
If easier, these could also be combined into 1 df.
1_fetch.R
Outdated
p1_vars_of_interest_downloaded_csvs, | ||
p1_vars_of_interest %>% | ||
split(.,.$sb_id) %>% | ||
lapply(.,fetch_nhdv2_attributes_from_sb,save_dir = "1_fetch/out",comids=p1_nhdv2reaches_sf$COMID,delete_local_copies=TRUE) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I always default to using function(x) in apply() because I think that it doesn't like it when there are multiple arguments in the given function. Nice to see it works!
1_fetch.R
Outdated
p1_vars_of_interest %>% | ||
split(.,.$sb_id) %>% | ||
lapply(.,fetch_nhdv2_attributes_from_sb,save_dir = "1_fetch/out",comids=p1_nhdv2reaches_sf$COMID,delete_local_copies=TRUE) %>% | ||
do.call('c',.), | ||
format = "file" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you describe what is happening here with a couple comments. In particular, why are you using split()
? (see comment suggestion above).
what does do.call(., 'c')
do ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, thanks! I've added comments to briefly explain what each line is doing within the p1_vars_of_interest_downloaded_csvs
target. In particular, fetch_nhdv2_attributes_from_sb
returns the file output path for each unique sb_id, so do.call('c',.)
concatenates those strings into one vector of class chr
.
I agree that because these are thematically similar, it might be more satisfying if they were combined somehow, either in a separate folder or a single data frame. For the STATSGO tables, these csv's reflect the format of the input datasets that are contained on ScienceBase, and they at least follow the expected structure (columns for "ACC","CAT", and "TOT). The PPT files on ScienceBase are a little different because the ACC/CAT/TOT files actually represent separate sb_id's all-together. I initially thought about combining those PPT tables but opted not to because it would make the data download code more complex and thus harder to read (i.e., by adding special handling code), and because I wasn't sure that it'd be necessary. In the data processing steps, I envision that we'll subset any of these tables to only include those columns containing "CAT" to apply various functions (e.g. sum, area-weighted mean). So the PPT_TOT and PPT_ACC csv's just wouldn't get passed through that data processing code. Would you agree with my characterization here (if it makes sense, that is!)? So I think I'm still leaning toward leaving the downloaded files as is to keep the code as simple as possible and to keep fetch tasks separate from other processing tasks we might need/want. Do you think that makes sense? Perhaps we can revisit in the data processing step? |
Thanks for your review, @msleckman, and thank you for skimming all those downloaded files! If, after skimming the downloaded files, you have notes for processing steps we'll need to add or at least consider, it'd be useful to add those to issue #80. I think I've addressed all of your comments for the fetch step here, so let me know if we should add anything else or whether we can merge. |
Based on Margaux's review (and looking forward to processing the downloaded Wiezorek datasets), I realized that it would make the code cleaner if I used dynamic branching instead of a multi-step So, fyi @msleckman, @jds485 I went ahead and made that change here. @msleckman, would you mind checking that this still builds locally for you? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lekoenig I've run this locally with your changes and had no issues 😄 . I like the dynamic branching approach a lot. Nice job incorporating that targets pattern!
I did feel like it took longer to download the files than the last time >40 min (although I did not perfectly time it).
One idea/suggestion - is it possible to put a sys.time() variable wrapper around the p1_vars_of_interest_downloaded_csvs
to print the time it takes for all the vars of interest files to download? (I have no idea if that is feasible to do within 1_fetch.R
)
Looks good!
targets tracks the time it takes to build every target (and every branch) within the meta file. You can access this information with |
|
||
# Parse name(s) of unzipped files | ||
file_name <- basename(out_file) | ||
file_name_sans_ext <- str_split(file_name,".[[:alnum:]]+$")[[1]][1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[randomly skimming github stuff and this caught my eye] - do you have a reason not to use the built-in tools::file_path_sans_ext()
here? This str_split
approach works for 1 file but not multiple files, so needs to be modified in some way if you're serious about the plural option implied by name(s)
3 lines above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for that suggestion, Alison! In short, I don't have a good reason not to use the built-in function you reference. My brain got going on regex I guess 😄. We map over this function for individual files so I don't expect that the str_split
approach would cause issues, but it's good to simplify the code (and the plural option that was implied in the commented line above). I've edited this line to use tools::file_path_sans_ext()
instead.
This worked for me to ensure that PPT_CAT, PPT_ACC, PPT_TOT, and others were only being downloaded and unpacked once. @msleckman do you know if |
Addresses #16 and #77.
The code here adds a target to download all of the data associated with our VarsOfInterest table. The function to download the data also filters the output to only include COMIDs specified by the user, and saves a csv of the filtered output. The function also contains an argument to choose between retaining or deleting the original downloaded zipped and unzipped files.
I'm tagging @msleckman for a code review and @jds485 for a more cursory review. A few questions I have on this PR:
We could add the NLCD years beyond 2011 to VarsOfInterest and download them in a similar manner. Would this be helpful for addressing Fetch NHD attributes : NLCD Land Cover 2001-2019 #56? I know we currently download NLCD data separately (in fact, 'Land Cover' variables have been filtered out of the target for
p1_vars_of_interest
), so I'm happy to leave that alone and fetch these data in a similar manner to NLCD2011. @msleckman, I'm interested in your thoughts on that since I'm not too familiar with the existing NLCD download functions.There are some variables in our VarsOfInterest table that seem like they could be duplicates (i.e., slightly different GIS layer name but same sb id). I haven't looked at the data sources too closely, so I'd appreciate if anyone else can clarify whether these are distinct variables. From my notes: