Skip to content

Commit

Permalink
Ready for publication
Browse files Browse the repository at this point in the history
  • Loading branch information
AJdeMooij committed Jun 20, 2024
1 parent 321563b commit bdd1d22
Show file tree
Hide file tree
Showing 14 changed files with 2,555 additions and 3 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -159,4 +159,4 @@ cython_debug/
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
.idea/
64 changes: 64 additions & 0 deletions CITATION.cff
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: >-
GenSynthPop-Python: Generating a Spatially Explicit Synthetic
Population of Agents and Households from Aggregated Data
message: >-
If you use this software please cite as below
type: software
authors:
- given-names: Jan
family-names: Mooij
email: [email protected]
affiliation: >-
Intelligent Systems, Information and Computing
Sciences, Utrecht University, Princetonplein 5,
Utrecht, 3584 CC, The Netherlands
orcid: 'https://orcid.org/0000-0003-4129-6074'
name-particle: de
- given-names: Tabea
family-names: Sonnenschein
email: [email protected]
affiliation: >-
Department of Human Geography and Spatial Planning,
Utrecht University, Heidelberglaan 8, Utrecht, 3584
CS, The Netherlands
orcid: 'https://orcid.org/0000-0001-6592-9548'
- family-names: Pellegrino
given-names: Marco
- given-names: Dastani
family-names: Mehdi
affiliation: >-
Intelligent Systems, Information and Computing
Sciences, Utrecht University, Princetonplein 5,
Utrecht, 3584 CC, The Netherlands
- given-names: Dick
family-names: Ettema
affiliation: >-
Department of Human Geography and Spatial Planning,
Utrecht University, Heidelberglaan 8, Utrecht, 3584
CS, The Netherlands
- orcid: 'https://orcid.org/0000-0003-0648-7107'
given-names: Brian
family-names: Logan
affiliation: >-
Intelligent Systems, Information and Computing
Sciences, Utrecht University, Princetonplein 5,
Utrecht, 3584 CC, The Netherlands
- given-names: Judith A.
family-names: Verstegen
affiliation: >-
Department of Human Geography and Spatial Planning,
Utrecht University, Heidelberglaan 8, Utrecht, 3584
CS, The Netherlands
orcid: 'https://orcid.org/0000-0002-9082-4323'
identifiers:
- type: doi
value: 10.5281/zenodo.11474110
repository-code: >-
https://github.com/A-Practical-Agent-Programming-Language/GenSynthPop-Python
repository: >-
https://github.com/A-Practical-Agent-Programming-Language/Synthetic-Population-The-Hague-South-West
license: GPL-3.0
674 changes: 674 additions & 0 deletions LICENSE.txt

Large diffs are not rendered by default.

181 changes: 179 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,180 @@
This repository is a placeholder that will be populated soon.
[![DOI](https://zenodo.org/badge/810318631.svg)](https://zenodo.org/doi/10.5281/zenodo.11474109)

See https://www.researchsquare.com/article/rs-3405645 for details
# GenSynthPop

This repository contains the implementation of GenSynthPop,
a sample-free tool to construct Synthetic Populations and Households from mixed-aggregation contingency tables

The work in this repository is described in
*GenSynthPop: Generating a Spatially Explicit Synthetic Population of Agents and Households from Aggregated Data*[^1].

[^1]: Marco Pellegrino, Jan de Mooij, Tabea Sonnenschein et al. *GenSynthPop: Generating a Spatially Explicit Synthetic
Population of Agents and Households from Aggregated Data*, 09 October 2023, PREPRINT (Version 1) available at Research
Square [https://doi.org/10.21203/rs.3.rs-3405645/v1]

An R implementation of this library is available [here](https://github.com/TabeaSonnenschein/GenSynthPop)

A reference implementation is available
[here](https://github.com/A-Practical-Agent-Programming-Language/Synthetic-Population-The-Hague-South-West)

## Intuition

This library allows generating a synthetic population from contingency tables and marginal data one attribute at the
time. It does not assume the presence of a detailed sample. Refer to the
[reference implementation](https://github.com/A-Practical-Agent-Programming-Language/Synthetic-Population-The-Hague-South-West)
for details.

Generally when data is published there is a trade-off in accuracy between the joint distribution of attributes and the
spatial granularity. Detailed data may be available for very small regions, but only contain a few or even one
attribute.
In order to achieve the best spatial heterogeneity, one would prefer to use those data. However, in order to get an
accurate representation of the population as a whole, the joint distirbution of attributes is relevant.

This library allows combining both. It assumes for any given attribute, there is marginal data available at a high
spatial resolution and a contingency table at lower levels of spatial resolution, and allows combining the two.

With this library, each attribute is added to the population one at a time, conditioned (based on the contingency table)
on previously added attributes but constrained on spatial details (based on the high spatial resolution marginal data).

Note that this library assumes for each attribute, a contingency table is available or can be obtained that exhaustively
lists the possible categorical values the attribute can take. If this table is not available as is, data preparation is
required before this library can be used

## Generating a synthetic population

In the reference implementation, the highest spatial resolution data was available at the level of a neighborhood, so
that example will be maintained here. If only one level of spatial resolution is available, this library provides
less utility, but a column with a single value can be added to the source data to still use this library.

## Generate individuals

Start by instantiating a data frame with agent IDs located in each of the neighborhoods:

```python
agent_ids = list()
agent_neighborhoods = list()
agent_count = 0
for neighb_code, (neighb_total) in read_marginal_data(['population'], 'population').iterrows():
agent_ids += [f"SA{i + agent_count:06d}" for i in range(neighb_total.iloc[0])]
agent_neighborhoods += [neighb_code] * neighb_total.iloc[0]
agent_count += neighb_total.iloc[0]
df_synthetic_population = pd.DataFrame(data=dict(agent_id=agent_ids, neighb_code=agent_neighborhoods))
```

Next, add an attribute. For example age group. This is done through the
[Conditional Attribute Adder](gensynthpop/conditional_attribute_adder.py).

Pass in the synthetic population created in the first step, a contingency table that for each age group in the index
specifies the number of people in that age group in each neighborhood.

The `group_by` clause indicates by what column(s) the data is split into high spatial resolution.

```python3
from gensynthpop.conditional_attribute_adder import ConditionalAttributeAdder

df_age_group = pd.read_csv('age_group_marginal_data.csv')
df_synthetic_population = ConditionalAttributeAdder(
df_synthetic_population=df_synthetic_population,
df_contingency=df_age_group,
target_attribute='age_group',
group_by=['neighb_code']
).run()
```

Next, an attribute can be added conditioned on a previous attribute, provided a contingency table containing at least
that previous attribute is available.

Best results are achieved if the Iterative Proportional Fitting procedure
(e.g. [ipfn-python](https://github.com/AJdeMooij/ipfn/tree/bugfix/pandas-sort-frames)) is applied to the contingency
table first to fit the contingency table to the margins of the attributes already added.

The process is similar to before, but can now specify neighborhood-specific margins by calling `add_margins`. The
arguments work the same as with the
[ipfn-python](https://github.com/AJdeMooij/ipfn/tree/bugfix/pandas-sort-frames) library

```python3
from gensynthpop.conditional_attribute_adder import ConditionalAttributeAdder

df_margins_gender = pd.read_csv('marginal_gender_by_neighborhood.csv')
df_margins_age_group = synthetic_population_to_contingency(
df_synth_pop, ["neighb_code", "age_group"], True).reset_index()

df_gender_contingency = read_and_fit_gender_contingency_table(df_margins_age_group)

df = ConditionalAttributeAdder(
df_synthetic_population=df_synthetic_population,
df_contingency=df_gender_contingency,
target_attribute="gender",
group_by=["neighb_code"]
).add_margins(
margins=[df_margins_age_group, df_margins_gender],
margins_names=[["age_group"], ["gender"]]
).run()
``````

The process can be repeated for as many attributes as necessary.

## Generate households

After sufficient attributes are added to the individuals, they can be partitioned into households.

The first step is to determine the types of households (e.g., singles, couples without children, couples with 1 child,
couples with 2 children, etc...).

Next, each agent is assigned a *household position* as either an adult or (in the case of households with children)
child in one of those exact households. If a contingency table is available, that is great, but most likely, this
contingency table needs to be constructed. Once it is, add it as an individual-level attribute.

Next, the [Household Grouper](gensynthpop/household_grouper.py) can be used to partition the agents into households
based on their household position. The household grouper aims at placing each agent into a household and generates
households as required. It may switch positions of agents if necessary to fulfill the household constraints.

In order to create the households, three Pandas Series are required:

1) The typical gender distribution between adult partners (`male-female`, `male-male` and `female-female` as index)
2) The typical age disparity between partners of the same gender, where the index defines a range
as `<range_start>-<range_end>`
3) The typical age disparity between a mother and oldest child, with the range again as index.

Each series should specify the count or relative frequency of each group. If data is not available, a series can be
constructed
with just one record, e.g., `0-999 = 1`, but ideally, these series are constructed from available data for the region of
interest.

Each `HouseholdType` object takes the name of the household and the three distributions as an argument. Next, its
constituent members are specified by household type. A household can have two types of members, `adult` and `child`, and
for each type, the `household position` that was added as an individual level attribute earlier should be specified,
as well as the number of individuals of that type in the household and the household positions from which members can
be taken as a backup.

For example, a household with two parents (with household position `couple_with_1_child`)
and one child (with household position `child_of_couple_with_1_children`) may be specified as follows:

```python3
couple_with_1_child = HouseholdType(
'couple_with_1_child',
df_couple_gender_distribution,
df_couple_age_distribution,
df_parent_child_age_distribution
).add_members(
'child_of_couple_with_1_children', 'child', 1, []
).add_members('couple_with_1_child', 'adult', 2, ['couple_no_children', 'single'])
```

If a household has no children, the first call to `add_members` can be omitted. To create single households, the third
argument of the second call to `add_members` can additionally be changed to `1`.

Once all household types are specified, they can be added to the grouper, which can then create the households:

```python3
hh_grouper = HouseholdGrouper(df_synth_pop, ['neighb_code'], 'household_position')
hh_grouper.add_household_type(couple_with_1_child)
df_synthetic_population, df_synthetic_households = hh_grouper.run()
```

## Next steps

When the households have been created, additional attributes can still be added to the synthetic population of
individuals,
or the households can be decorated with additional attributes. The same `ConditionalAttributeAdder` can still be used
for either.
Empty file added gensynthpop/__init__.py
Empty file.
Loading

0 comments on commit bdd1d22

Please sign in to comment.