Encode unicode characters in UTF8 #1

jeroenbaas · 2021-03-24T20:39:24Z

Some titles have ascii converted special character encodings, such as:

Otolaryngology<U+0096>Head and Neck Surgery

It would be helpful if these could be encoded in UTF-8.

jeroenbaas · 2021-03-25T11:34:09Z

I've tried to replicate the issue above by forking the code and running Analysis_06_Extract-Journals.R; on a system with WSL/Ubuntu it produced an alljournals-.csv file with correct encodings, i.e.:
"Otolaryngology–Head and Neck Surgery","https://journals.sagepub.com/home/otoj","SAGE",2021-03-22
vs what is in the provided CSV alljournals-2021-02-05.csv:
"Otolaryngology<U+0096>Head and Neck Surgery","https://journals.sagepub.com/home/otoj","SAGE",NA

andreaspacher · 2021-03-26T19:17:42Z

Thanks, @jeroenbaas, for taking such a deep look.

Now that I repeated the same procedure of scraping SAGE journals via Analysis_06_Extract-Journals.R, I had the same experience like you: It worked without any encoding errors (even after saving the list of SAGE journals with write.csv(), without adding an explicit UTF-8 encoding setting).

I believe that I originally used a different computer for scraping the journals. Possibly the encoding settings differed there.

Anyway, in 6937419, I fixed some of the gravest encoding issues (merely ex post).

However, Chinese characters as well as Korean and Russian letters still need to be fixed.

But again, thank you, @jeroenbaas, for pointing out this issue. I really should take a better look at all the encoding-related aspects.

jeroenbaas · 2021-03-26T19:35:30Z

No worries. I will when I get a chance also have a closer look on how the scopus title list is used. From what I could quickly tell it only serves as a count of journals per publisher. There's much more value in the source list potentially, and linking it to your final dataset (e.g. by carying the Scopus Source ID) may unlock a lot of analytical power further down stream.
disclosure; I'm employed by Elsevier, (not the Scopus team), in the Analytical / Data Services group and as such spend a lot of quality time with this data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encode unicode characters in UTF8 #1

Encode unicode characters in UTF8 #1

jeroenbaas commented Mar 24, 2021

jeroenbaas commented Mar 25, 2021

andreaspacher commented Mar 26, 2021

jeroenbaas commented Mar 26, 2021 •

edited

Loading

Encode unicode characters in UTF8 #1

Encode unicode characters in UTF8 #1

Comments

jeroenbaas commented Mar 24, 2021

jeroenbaas commented Mar 25, 2021

andreaspacher commented Mar 26, 2021

jeroenbaas commented Mar 26, 2021 • edited Loading

jeroenbaas commented Mar 26, 2021 •

edited

Loading