We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I extracted some data from a Chinese pdf file.
The numbers in the columns are extracted as follows (for example): -122, 29458, 9.
I copy pasted the outputs of some cells. However, these characters are not the same as -122, 29458, 9, respectively.
Hence parse.number() produces NA in all of these cases.
Any suggestions regarding what I should do?
This is the pdf file in question: http://images.mofcom.gov.cn/fec/202211/20221118091910924.pdf
I extracted the data from page 49 (53rd page of the pdf file), using the following code:
library(tidyverse) library(pdftools) file <- tempfile() url <- paste0("http://images.mofcom.gov.cn/fec/202211/20221118091910924.pdf") download.file(url, file, headers = c("User-Agent" = "My Custom User Agent")) pdf_data <- pdf_text(file) replace_spaces_and_commas <- function(x) { str_replace_all(x, "[ ,]", "") } pdf <- pdf_data[53:71] tab_pdf <- str_split(pdf, "\n") for (i in 1:19) { assign(paste0("tab_pdf_", i), tab_pdf[[i]]) } the_names <- c("country", "year_2013", "year_2014", "year_2015", "year_2016", "year_2017", "year_2018", "year_2019", "year_2020", "year_2021") view(tab_pdf_1) pdf_clean1 <- tab_pdf_1[14:60] %>% str_trim %>% str_replace_all(",", "") %>% str_split("\\s{2,}", simplify = TRUE) %>% data.frame(stringsAsFactors = FALSE) %>% setNames(the_names) %>% mutate_all(.funs = replace_spaces_and_commas) %>% filter(country != "")
I tried both, e.g., as.numeric(pdf_clean1$year_2013) and parse_number(pdf_clean$year_2013)
as.numeric(pdf_clean1$year_2013)
parse_number(pdf_clean$year_2013)
Both produced NAs, because the outcome for all of "9" == "9" "-122" == "-122" "29458" == "29458" are "FALSE".
"9" == "9" "-122" == "-122" "29458" == "29458"
sessionInfo()
R version 4.3.1 (2023-06-16) Platform: aarch64-apple-darwin20 (64-bit) Running under: macOS Ventura 13.4.1 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0 attached base packages: [1] stats graphics grDevices utils datasets methods [7] base other attached packages: [1] countrycode_1.5.0 magrittr_2.0.3 pdftools_3.3.3 [4] lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0 [7] dplyr_1.1.2 purrr_1.0.1 readr_2.1.4 [10] tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.2 [13] tidyverse_2.0.0 loaded via a namespace (and not attached): [1] gtable_0.3.3 compiler_4.3.1 qpdf_1.3.2 [4] tidyselect_1.2.0 Rcpp_1.0.11 scales_1.2.1 [7] R6_2.5.1 generics_0.1.3 knitr_1.42 [10] munsell_0.5.0 pillar_1.9.0 tzdb_0.4.0 [13] rlang_1.1.1 utf8_1.2.3 stringi_1.7.12 [16] xfun_0.39 timechange_0.2.0 cli_3.6.1 [19] withr_2.5.0 grid_4.3.1 rstudioapi_0.15.0 [22] hms_1.1.3 askpass_1.1 lifecycle_1.0.3 [25] vctrs_0.6.3 glue_1.6.2 fansi_1.0.4 [28] colorspace_2.1-0 tools_4.3.1 pkgconfig_2.0.3
R version 4.3.1 (2023-06-16) Platform: aarch64-apple-darwin20 (64-bit) Running under: macOS Ventura 13.4.1
Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
attached base packages: [1] stats graphics grDevices utils datasets methods [7] base
other attached packages: [1] countrycode_1.5.0 magrittr_2.0.3 pdftools_3.3.3 [4] lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0 [7] dplyr_1.1.2 purrr_1.0.1 readr_2.1.4 [10] tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.2 [13] tidyverse_2.0.0
loaded via a namespace (and not attached): [1] gtable_0.3.3 compiler_4.3.1 qpdf_1.3.2 [4] tidyselect_1.2.0 Rcpp_1.0.11 scales_1.2.1 [7] R6_2.5.1 generics_0.1.3 knitr_1.42 [10] munsell_0.5.0 pillar_1.9.0 tzdb_0.4.0 [13] rlang_1.1.1 utf8_1.2.3 stringi_1.7.12 [16] xfun_0.39 timechange_0.2.0 cli_3.6.1 [19] withr_2.5.0 grid_4.3.1 rstudioapi_0.15.0 [22] hms_1.1.3 askpass_1.1 lifecycle_1.0.3 [25] vctrs_0.6.3 glue_1.6.2 fansi_1.0.4 [28] colorspace_2.1-0 tools_4.3.1 pkgconfig_2.0.3
The text was updated successfully, but these errors were encountered:
Found a solution, just in case someone else has the same problem with the help of a stackoverflow user and ChatGPT:
convert_fullwidth_to_numeric <- function(input_str) { utf8_codes <- utf8ToInt(input_str) # Handle fullwidth minus sign (-) separately utf8_codes <- ifelse(utf8_codes == 65293, 45, utf8_codes) converted_utf8_codes <- ifelse(utf8_codes >= 65296 & utf8_codes <= 65305, utf8_codes - 65248, utf8_codes) converted_chars <- intToUtf8(converted_utf8_codes) converted_numeric <- as.numeric(converted_chars) return(converted_numeric) } # Apply the function to specified columns (columns 2 to 10) columns_to_transform <- 2:10 # Adjust column indices as needed for (col in columns_to_transform) { for (row in 1:nrow(pdf_clean1)) { pdf_clean1[row, col] <- convert_fullwidth_to_numeric(pdf_clean1[row, col]) } }
https://stackoverflow.com/questions/76895064/number-as-character-cannot-be-converted-to-numeric-in-r
Sorry, something went wrong.
No branches or pull requests
I extracted some data from a Chinese pdf file.
The numbers in the columns are extracted as follows (for example): -122, 29458, 9.
I copy pasted the outputs of some cells. However, these characters are not the same as -122, 29458, 9, respectively.
Hence parse.number() produces NA in all of these cases.
Any suggestions regarding what I should do?
This is the pdf file in question: http://images.mofcom.gov.cn/fec/202211/20221118091910924.pdf
I extracted the data from page 49 (53rd page of the pdf file), using the following code:
I tried both, e.g.,
as.numeric(pdf_clean1$year_2013)
andparse_number(pdf_clean$year_2013)
Both produced NAs, because the outcome for all of
"9" == "9" "-122" == "-122" "29458" == "29458"
are "FALSE".sessionInfo()
The text was updated successfully, but these errors were encountered: