Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal country name representation #40

Closed
koen-vg opened this issue Jul 22, 2021 · 9 comments
Closed

Internal country name representation #40

koen-vg opened this issue Jul 22, 2021 · 9 comments
Labels
help wanted Extra attention is needed

Comments

@koen-vg
Copy link
Collaborator

koen-vg commented Jul 22, 2021

For the whole code-base, we need to agree on a way to internally represent country names. For WP2, I have adopted ISO 3166 two-letter country codes. For OSM data extraction, country names are used (see also this note in #37). We should find a convention and stick to it.

Personally, I would argue that ISO 3166 country codes (https://www.iso.org/iso-3166-country-codes.html) are the way to go, at least for internal representation in code. In WP2, I have already had to patch the powerplantmatching tool to work with two-letter country codes, because the different databases being merged use different names for some countries. Country names depend on language, have short and long forms and sometimes contain special characters (e.g. Côte d'Ivoire) which may or may not be converted to ASCII equivalent depending on the data source. Therefore I think we are setting ourselves up for trouble if we want to use full country names in code.

Of course, when it comes to presentation, we should use full country names. There is already the dictionary at https://github.com/pypsa-meets-africa/pypsa-africa/blob/main/scripts/iso_country_codes.py, and the python package pycountry also provides easy tools for working with country names.

The alternative of using full country names internally is of course also possible, but then we need to at the very least have a strict standard for which form of the names we use. Let's discuss!

As a side-note, I think that using full names internally in PyPSA-Eur works, but even there it might have been easier to just for the country codes. I will probably raise the issue at least with powerplantmatching and see if the upstream there is interested in using two-letting country codes instead of full names (at least internally within powerplantmatching).

@koen-vg
Copy link
Collaborator Author

koen-vg commented Jul 22, 2021

We can use this issue to discuss, and maybe also keep track of what might need to be done in the various work packages to implement whatever solution we come with here.

@mnm-matin
Copy link
Member

Full country names were used in early stages to assist development. Now that we are merging databases we can fully migrate to two-letter codes.

@mnm-matin
Copy link
Member

mnm-matin commented Jul 22, 2021

relevant line 1

relevant line 2

Those two functions should be merged as mentioned in #35

@koen-vg
Copy link
Collaborator Author

koen-vg commented Jul 22, 2021

Okay, sounds good! We should think about documenting this at some point...

If we need, there are some nice python package for working with countries. I have used this for example to get a dictrionary with all African country codes and names:

import pycountry as pc
import pycountry_convert as pcc

african_countries = []
for country in pc.countries:
    try:
        if pcc.country_alpha2_to_continent_code(country.alpha_2) == 'AF':
            african_countries.append(country)
    except:
        pass

african_countries_map = {c.alpha_2: c.name for c in african_countries}

The names we get from pycountry might also be a little more presentable than the current AFRICA_CC internal OSM names.

Also: yes I get that Senegal and Gambian might as well be one country for large-scale energy systems purposes, but could we avoid creating our own country code SNGM? I feel like this might get us in trouble in the future, and I would rather stick keep to ISO 3166 strictly.

@mnm-matin
Copy link
Member

Thanks! SNGM definitely needs to be fixed. Unfortunately, we receive a single pbf file from geofabrik for Senegal and Gambian.

We could:

  • Split the data (based on coordinates) into Senegal and Gambian. (Would have to run an analysis as this might not be possible due to significant interconnection of the transmission network)
  • Use full country names as a standard (downsides mentioned above and I would add that it leads to reduced readability and significantly larger file sizes)
  • Treat it as an edge case with a 4-letter code and document it.
  • ???

I would suggest temporarily removing support for Senegal and Gambian until the splitting of the data can be confirmed and implemented (in the data cleaning and not extraction). If not possible then explore the merits of the other options.

Although off-topic, this is probably a good example of the differences between europe and africa. As a result, both major and minor deviations from pypsa-eur might be necessary in the future.

@pz-max
Copy link
Member

pz-max commented Jul 22, 2021

Hi guys,
I think the 2 letter code should be our convention because of the @koen-vg & @mnm-matin mentioned reasons
Providing a 2_letter_code_2_full_name function should make it afterwards pretty readable (this function could be but in the iso_country_codes script.

@euronion
Copy link
Collaborator

Strong endorsement of 2-letter country codes from my side.

Although one has to check each data source carefully if they adhere to the codes. Usually there are a handful of exceptions which are not documented (from my experience).
Some codes are also subject to dispute and may not be interpreted identically across sources, see e.g.: https://en.wikipedia.org/wiki/ISO_3166-1#Naming_and_disputes

As for merged countries:
I'd also say to avoid creating our own 2-letter country code. What you could do is name the combined region "SN-GM".
This makes it clear that:

  • This is a merged region (ISO-3166 CCs are without "-")
  • The regions are "SN" and "GM"

It can also be extended to larger regions.
In gerneral I am with @koen-vg that this should be avoided were possible. But if it cannot be easily avoided, using a combined country code might be the second best option.

@davide-f
Copy link
Member

davide-f commented Jul 31, 2021

``By scouting, pycountry is also used by pypsa-europe and it may be the way to go. Pypsa-eur seems to use the 2-code standard; we may follow if it is appropriate.
By the way there are a lot of disputies over territories and I believe we should avoid problems from that point of view

This code may be what we look for in the conversion between different codes (3/2- alphas etc)

def _get_country(target, **keys):
    assert len(keys) == 1
    try:
        return getattr(pyc.countries.get(**keys), target)
    except (KeyError, AttributeError):
        return np.nan

Example of use:
3-digit from 2-digit: _get_country('alpha_3', alpha_2="ZA")
2-digit from 3-digit: _get_country('alpha_2', alpha_3="ZAF")

@davide-f
Copy link
Member

Should we add it in _helpers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants