Skip to content
This repository has been archived by the owner on Jan 22, 2020. It is now read-only.

Handle different primary keys? #2

Open
lapidus opened this issue Jan 18, 2019 · 6 comments
Open

Handle different primary keys? #2

lapidus opened this issue Jan 18, 2019 · 6 comments

Comments

@lapidus
Copy link

lapidus commented Jan 18, 2019

I need to experiment a bit more with the library but I'm not sure it has the functionality to specify primary keys / generate correct keys for a sparse dataframe?

Currently it assumes that each non-measure is a dimension for every measure:

x['concept_type'] != 'measure'

I am thinking of a scenario where you have a more sparse frame:

geo, year, gender, lex, gdp
swe, 2000,  ,   , 25444
swe, 2000,  , 88,
swe, 2000, m, 88,
nor, 1970, m, 88,

I am not sure if this would occur in the wild or if one would try to make different dataframes?

But in the above case the files expected would be something like:

  1. ddf--datapoints--gdp--by--geo--year
  2. ddf--datapoints--lex--by--geo--year
  3. ddf--datapoints--lex--by--geo-gender--year
@lapidus lapidus changed the title How to specify primary keys? Handle different primary keys? Jan 18, 2019
@miroli
Copy link
Collaborator

miroli commented Jan 18, 2019

Could you clarify what the variable lex refers to in the example? Never mind, found the data.

@miroli
Copy link
Collaborator

miroli commented Jan 18, 2019

@lapidus I think I understand the issue, but for the sake of clarity, could you very briefly specify expected vs actual output of the data in the example?

@lapidus
Copy link
Author

lapidus commented Jan 18, 2019

Overall I think I need to experiment a bit more and understand when exactly we would export multiple indicators from one big dataframe vs having multiple dataframes that generate one indicator each.

But the scenario I described above would result in this actual output:

Primary key: geo-gender-year
ddf--datapoints--gdp--by--geo--gender--year
ddf--datapoints--lex--by--geo--gender--year
ddf--datapoints--lex--by--geo-gender--year

Where the preferred output is:

Primary key: varying depending on data source availability
ddf--datapoints--gdp--by--geo--year
ddf--datapoints--lex--by--geo--year
ddf--datapoints--lex--by--geo-gender--year

@lapidus
Copy link
Author

lapidus commented Jan 18, 2019

Maybe let's simply put this to test with 3-4 different data sources and see if we can streamline further :)

For example these use cases — Produce a DDF from:

  • 3-4 indicators from SCB with different dimensionality
  • Same with Kolada
  • Same with "Daniel's big democracy file" (= one long file)
  • Other examples ...

I'll try some things from my side, I might submit issue or pull requests :)

@miroli
Copy link
Collaborator

miroli commented Jan 18, 2019

I think there are two issues at play here.

1. Tidy data
I think it's reasonable to let frame2package assume the input data always adheres to the tidy data format. In this case, I believe the sample data fails to meet requirement no. 3: "Each type of observational unit forms a table." as GDP by its nature describes whole populations/countries and life expectancy can refer to segments of the population. So these are probably two different tables?

2. Disaggregation levels
According to the DDF specs:

If you have different disaggregation levels, each level gets its own file. This is because the disaggregation dimensions are the (compound) primary key. With a different disaggregation, there's a different primary key and thus a different table.

which I believe is what you are referring to with the preferred output example? I will have to have a think about how to deal with this in an automated fashion.

Please let me know if I've misunderstood something. :)

@miroli
Copy link
Collaborator

miroli commented Jan 23, 2019

I believe number above 2 and the original question in this issue was resolved with this commit. Please let me know if that is not the case.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants