-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make_archive_urls test for valid URL fails #11
Comments
Here's some reproducible code to test with the SOI dataset. It is currently returning https://urbaninstitute.github.io/nccs/catalogs/dd_unavailable.html for everything: library( dplyr )
library( knitr )
library( kableExtra )
library( stringr )
library( flextable )
library( pander )
GH.RAW <- "https://raw.githubusercontent.com/UrbanInstitute/nccs/main/catalogs/"
d <- read.csv( paste0( GH.RAW, "AWS-NCCSDATA.csv" ) )
source( paste0( GH.RAW, "build-catalog-functions.R" ) )
series <- "soi"
paths <- get_file_paths(series = "soi",
paths = d$Key,
tscope = "CHARITIES",
fscope = "PC" )
profile_urls <- make_archive_urls( series = "soi", paths = paths )
make_archive_urls <- function(series,
paths){
base_url = sprintf("https://urbaninstitute.github.io/nccs-legacy/dictionary/%s/%s_archive_html/",
series,
series)
expr_dic = list("core" = "legacy/core/",
"bmf" = "legacy/bmf/",
"misc" = "legacy/misc/",
"soi" = "legacy/soi-micro/[0-9]{4}/")
unavail_url <- "https://urbaninstitute.github.io/nccs/catalogs/dd_unavailable.html"
matches <- gsub(expr_dic[[series]], "", paths)
matches <- gsub("\\.csv", "", matches)
archive_urls <- paste0(base_url, matches)
archive_urls <- lapply(archive_urls,
function(x) if (RCurl::url.exists(x)) x else unavail_url)
return(archive_urls)
} |
For the time I just commented out the validation line: # archive_urls <- lapply(archive_urls,
# function(x) if (RCurl::url.exists(x)) x else unavail_url) Worst case the user gets a 404 instead of a "dictionary unavailable" message. Will look into an alternative URL validation function. |
I saw your note that you could not replicate the behavior. Same here when I try with this same example: > x <- "https://urbaninstitute.github.io/nccs-legacy/dictionary/soi/soi_archive_html/SOI-MICRODATA-2002-501C3-CHARITIES-PC"
> (RCurl::url.exists(x))
[1] TRUE It could have just been a slow server or perhaps those pages are generated dynamically when requested so there is a delay, but whatever the case there are many instances where the RCurl check will fail when the URLs are actually valid. Unless we have a function that we can trust it's probably better to not remove the links if the test fails because it will result in the kind of file the user mentioned - none of the data dictionary buttons had associated URLS on the download page for the SOI Microdata files (all of the valid ones were dropped when the file was rendered). If the URL is added and does not actually exist then the user just gets a 404 message. That seems like the lesser of the two problems. |
In the make_archive_urls() function within build-catalog-functions.R the test for valid URL is failing.
For example,
The URL works fine:
https://urbaninstitute.github.io/nccs-legacy/dictionary/soi/soi_archive_html/SOI-MICRODATA-2002-501C3-CHARITIES-PC
Any ideas?
The text was updated successfully, but these errors were encountered: