Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Erroneous files when trying to bulk download NLCD files with sbtools in R #310

Open
msleckman opened this issue Aug 28, 2023 · 3 comments
Open

Comments

@msleckman
Copy link

Hello 👋

I am trying to bulk download all the National Land Cover Database 2019 Versions for the Years 2001 to 2019 from it's sciencebase data release using sbtools in R.

When attempting to download all the NLCD files from that sciencebase data release using item_file_download():

sbtools::item_file_download(sb_id = '63935b30d34e0de3a1efe082', dest_dir = '.', overwrite_file = TRUE)

The download appears successful as I get all the same named files in my defined directory, however, when opening *_CONUS.txt files I am wanting to use, they all turn out to be unreadable html files as shown in this screenshot.

image

This doesn't happen when I download manually from sciencebase.

Any chance someone know why this is happening and what I need to do to pull the correct NLCD txt files using sbtools?

@dblodgett-usgs
Copy link
Collaborator

Oh wow -- that's interesting.

It's sending you to the cloud file downloader for some reason.

First thing is to double check you are on the latest sbtools. There's been change in that part of the package not that long ago.

library(sbtools)

authenticate_sb()

f <- item_list_files(sb_id = '63935b30d34e0de3a1efe082', fetch_cloud_urls = FALSE)

fname <- f$fname

fname <- fname[grepl("txt", f$fname, ignore.case = TRUE)]

sbtools::item_file_download(sb_id = '63935b30d34e0de3a1efe082', names = fname,
                            destinations = fname,
                            dest_dir = '.', overwrite_file = TRUE)

Seems to be working for me. I am authenticated and I am on VPN.

Hopefully it's something simple? Let me know if you can get it going with an update and some fiddling and I can dig deeper and try to reproduce here.

@msleckman
Copy link
Author

msleckman commented Sep 1, 2023

Thanks @dblodgett-usgs! By updating the sbtools package, I am able to run that chunk of code and get the actual NLCD datasets in my local dir.

I am noticing that this download step doesn't work with sbtools when in a targets object using the Targets R package:

 # Read in all nlcd data from 2001-2019 
  # Note: This takes a long time to download.
  tar_target(p1_NLCD_LC_path,
             {
               NLCD_dir_path <- '1_fetch/in/NLCDs_2001_2019'
               all_fls <- item_list_files(sb_id = '63935b30d34e0de3a1efe082', fetch_cloud_urls = FALSE)
               nlcd_files <- all_fls$fname[grepl('.TXT', all_fls$fname)]
               download_sb_file(sb_id = '63935b30d34e0de3a1efe082',
                                file_name = nlcd_files,
                                out_dir = '1_fetch/in/NLCDs_2001_2019')
               NLCD_dir_path
              },
             format = "file"
  ),
  
  tar_target(
    p1_NLCD_LC_data,
    read_subset_LC_data(LC_data_folder_path = p1_NLCD_LC_path,
                        Comids_in_AOI_df = p1_nhdv2reaches_sf %>%
                          st_drop_geometry() %>%
                          select(COMID),
                        Comid_col = 'COMID',
                        NLCD_type = NULL)
  ),

Error:

> authenticate_sb()
> targets::tar_make(p1_NLCD_LC_data)
▶ start target p1_NLCD_LC_path
retrieving S3 URL
✖ error target p1_NLCD_LC_path
▶ end pipeline [9.225 seconds]
There were 13 warnings (use warnings() to see them)
Error:
! Error running targets::tar_make()
  Error messages: targets::tar_meta(fields = error, complete_only = TRUE)
  Debugging guide: https://books.ropensci.org/targets/debugging.html
  How to ask for help: https://books.ropensci.org/targets/help.html
  Last error: Error downloading 1_fetch/in/NLCDs_2001_2019/NLCD01_ACC_CONUS.TXT Original error: 
 Error in get_access_token(): no token found, must call authenticate_sb()

This error happens even after I successfully authenticate_sb().

Might you know why this is happening? Something about this download step being wrapped into a targets object is making it trickier the authenticate sb? These files are also large, and stemming from an S3 buckets in sciencebase, this may be adding a layer of security that targets cannot cut through.

Details of `download_sb_file()` and `read_subset_LC_data()` below:
download_sb_file <- function(sb_id, file_name, out_dir){
  #'
  #' @description Function to download file from ScienceBase 
  #'
  #' @param sb_id string - the id of the science base item
  #' @param file_name string - the name of the file in the science base item to download
  #' @param out_dir string - the directory where you want the file downloaded to
  #'
  #' @value string the out_path

  # Conditional Statement to use function in the event user wants all files in the given sciencebase item
  if(is.null(file_name)){
    out_path = out_dir
    sbtools::item_file_download(sb_id = sb_id,
                                dest_dir = out_path,
                                overwrite_file = TRUE)
  } else{
    out_path = file.path(out_dir, file_name)
    sbtools::item_file_download(sb_id = sb_id,
                                names = file_name,
                                destinations = out_path,
                                overwrite_file = TRUE)
  
  }  
  
  return(out_path)
}

read_subset_LC_data <- function(LC_data_folder_path,
                                Comids_in_AOI_df,
                                Comid_col,
                                NLCD_type = NULL){
  
  #' @description Read in and subset lc data after data is downloaded and unzipped
  #' @param LC_data_folder_path LC data folder path or vector of LC data folder paths - last subfolder often 'unzipped'
  #' @param Comids_in_AOI_df dataframe of all comid ids
  #' @param Comid_col str. key comid col in Xwalk table. e.g. "comid" | "COMID"
  #' @param NLCD_type str. Default NULL. Options are either CAT, ACC, or TOT. Use NULL if all three are selected
  #' @example read_subset_LC_data(LC_data_folder_path = "1_fetch/out/LandCover_Data/ImperviousnessPct_2011/unzipped",
  #'  Comids_in_AOI_df = PRMSxWalk,  Comid_col = 'comid_down')
  #' @example read_subset_LC_data(LC_data_folder_path = c("1_fetch/out/LandCover_Data/ImperviousnessPct_2011/unzipped",
  #'  "1_fetch/out/LandCover_Data/Imperviousness100m_RipZone/unzipped") , Comids_in_AOI_df = PRMSxWalk,  Comid_col = 'comid_down')

  # Function Vars 
  ## creating list for dfs before for loop
  all_data_subsetted <- list()
  
  # Loop through sub-folders, combine datasets, and subset through Join
  for(LC_data in LC_data_folder_path){

    # Check that downloaded data exists in folder 
    if(length(list.files(LC_data)) == 0){
      stop(paste0('No NLCD LC data for file ', LC_data,'. Please move the NLCD .txt file to this location.'))
    }
    
    LC_data_path <- unlist(LC_data)
    
    if(!is.null(NLCD_type)){
      files <- list.files(path = LC_data_path, pattern = glue('*{NLCD_type}_CONUS.txt|*{NLCD_type}_CONUS.TXT'), full.names = TRUE)
    }
    else{
      files <- list.files(path = LC_data_path, pattern = '*_CONUS.txt|*_CONUS.TXT', full.names = TRUE)
    }

    ## Read in and subset by comid_id's from Xwalk  
    data_list <- lapply(files, function(x){ 
      read_csv(x, show_col_types = FALSE) %>%
        right_join(Comids_in_AOI_df,
                   by = c('COMID' = Comid_col),
                   keep = F)
      })
  
  ## Combine
    cbind_subsetted_df <-data_list %>%
      reduce(inner_join, by = 'COMID') ## possibly add as full_join

    # using str_replace_all to standardize file paths from OS or Windows 
    LC_data <- str_replace_all(LC_data, '\\\\','/')
    
  ## Assign to list - note name of item in list is LC_data (e.g. all_data_subsetted$NLCD_LandCover_2011) 
    if(endsWith(LC_data, 'unzipped') | endsWith(LC_data, '')){
      name <- str_split(LC_data, pattern = '/', simplify = TRUE)
      name <- name[length(name) - 1]
    }
    else{
      name <- str_split(LC_data, pattern = '/', simplify = TRUE)
      name <-name[length(name)]
    }
    
    all_data_subsetted[[name]] <- cbind_subsetted_df
    }
   ## if only one NLCD table loaded, set as a df
    if(length(all_data_subsetted) == 1){
      all_data_subsetted <-  all_data_subsetted[[1]]
    }
      
  return(all_data_subsetted)
  
}

@dblodgett-usgs
Copy link
Collaborator

Specifically, this is happening because tar_make() is running in a clean environment. I think it should work if you set callr_function=NULL in your call to tar_make().

Another way to handle it would be to cache your sciencebase credentials with keyring and pass your username through an environment variable or just in code.

?sbtools::authenticate_sb for the keyring details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants