Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encode unicode characters in UTF8 #1

Open
jeroenbaas opened this issue Mar 24, 2021 · 3 comments
Open

Encode unicode characters in UTF8 #1

jeroenbaas opened this issue Mar 24, 2021 · 3 comments

Comments

@jeroenbaas
Copy link
Contributor

Some titles have ascii converted special character encodings, such as:

Otolaryngology<U+0096>Head and Neck Surgery

It would be helpful if these could be encoded in UTF-8.

@jeroenbaas
Copy link
Contributor Author

I've tried to replicate the issue above by forking the code and running Analysis_06_Extract-Journals.R; on a system with WSL/Ubuntu it produced an alljournals-.csv file with correct encodings, i.e.:
"Otolaryngology–Head and Neck Surgery","https://journals.sagepub.com/home/otoj","SAGE",2021-03-22
vs what is in the provided CSV alljournals-2021-02-05.csv:
"Otolaryngology<U+0096>Head and Neck Surgery","https://journals.sagepub.com/home/otoj","SAGE",NA

@andreaspacher
Copy link
Owner

Thanks, @jeroenbaas, for taking such a deep look.

Now that I repeated the same procedure of scraping SAGE journals via Analysis_06_Extract-Journals.R, I had the same experience like you: It worked without any encoding errors (even after saving the list of SAGE journals with write.csv(), without adding an explicit UTF-8 encoding setting).

I believe that I originally used a different computer for scraping the journals. Possibly the encoding settings differed there.

Anyway, in 6937419, I fixed some of the gravest encoding issues (merely ex post).

However, Chinese characters as well as Korean and Russian letters still need to be fixed.

But again, thank you, @jeroenbaas, for pointing out this issue. I really should take a better look at all the encoding-related aspects.

@jeroenbaas
Copy link
Contributor Author

jeroenbaas commented Mar 26, 2021

No worries. I will when I get a chance also have a closer look on how the scopus title list is used. From what I could quickly tell it only serves as a count of journals per publisher. There's much more value in the source list potentially, and linking it to your final dataset (e.g. by carying the Scopus Source ID) may unlock a lot of analytical power further down stream.
disclosure; I'm employed by Elsevier, (not the Scopus team), in the Analytical / Data Services group and as such spend a lot of quality time with this data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants