Skip to content
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 49 additions & 0 deletions statvar_imports/statistics_poland/README.md
Comment thread
abhishekjaisw marked this conversation as resolved.
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Poland Demographics Dataset
## Overview
This dataset provides foundational demographic and socio-economic statistics for Poland, sourced directly from official national datasets.

## Data Source

**Source URL:**
Comment thread
abhishekjaisw marked this conversation as resolved.
Outdated
https://stat.gov.pl/en/databases/


The data comes from Poland's official statistical authority and includes comprehensive demographic variables such as population counts, age distributions, and other census-related metrics.

## How To Download Input Data
To download the data, you'll need to use the provided download script download_input_data.py. This script will automatically create an "poland_input" folder and StatisticsPoland_input.csv will be generated which is our input file. The script also requires a poland_data_sample/poland_raw.xlsx to be present to identify file structure.
Comment thread
abhishekjaisw marked this conversation as resolved.
Outdated

type of place: State.

statvars: Demographics

years: 2003 to 2024.
Comment thread
abhishekjaisw marked this conversation as resolved.
Comment thread
abhishekjaisw marked this conversation as resolved.

## Processing Instructions
To process the Poland Census data and generate statistical variables, use the following command from the "data" directory:

**Download input file**

Comment thread
abhishekjaisw marked this conversation as resolved.
Outdated
```bash
python3 statistics_poland/download_input_data.py
```

**For Test Data Run**
```bash
python3 tools/statvar_importer/stat_var_processor.py \
--input_data=statvar_imports/statistics_poland/test/StatisticsPoland_input.csv \
--pv_map=statvar_imports/statistics_poland/StatisticsPoland_pvmap.csv \
--output_path=statvar_imports/statistics_poland/test/StatisticsPoland_output \
--config_file=statvar_imports/statistics_poland/Statistics_Poland_metadata.csv \
--existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf
Comment thread
abhishekjaisw marked this conversation as resolved.
Outdated
```

**For Main data run**
```bash
python3 tools/statvar_importer/stat_var_processor.py \
--input_data=statvar_imports/statistics_poland/poland_input/StatisticsPoland_input.csv \
--pv_map=statvar_imports/statistics_poland/StatisticsPoland_pvmap.csv \
--output_path=statvar_imports/statistics_poland/poland_output/StatisticsPoland_output \
--config_file=statvar_imports/statistics_poland/Statistics_Poland_metadata.csv \
--existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf
Comment thread
abhishekjaisw marked this conversation as resolved.
Outdated
Comment thread
abhishekjaisw marked this conversation as resolved.
Outdated
```
Comment thread
abhishekjaisw marked this conversation as resolved.
Outdated
68 changes: 68 additions & 0 deletions statvar_imports/statistics_poland/StatisticsPoland_pvmap.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
key,p1,v1,p2,v2,p3,v3,p4,v4,p5,v5
,,,,,,,,,,
Code,measuredProperty,count,populationType,Person,,,,,,
males,gender,Male,,,,,,,,
females,gender,Female,,,,,,,,
total,,,,,,,,,,
Comment thread
abhishekjaisw marked this conversation as resolved.
in urban areas,placeOfResidenceClassification,Urban,,,,,,,,
in rural areas,placeOfResidenceClassification,Rural,,,,,,,,

0-2,age,Years0To2,,,,,,,,
3-6,age,Years3To6,,,,,,,,
7-12,age,Years7To12,,,,,,,,
13-15,age,Years13To15,,,,,,,,
16-19,age,Years16To19,,,,,,,,
20-24,age,Years20To24,,,,,,,,
25-34,age,Years25To34,,,,,,,,
35-44,age,Years35To44,,,,,,,,
45-54,age,Years45To54,,,,,,,,
55-64,age,Years55To64,,,,,,,,
65 and more,age,Years65Onwards,,,,,,,,
,,,,,,,,,,
,,,,,,,,,,
POLAND,observationAbout,country/POL,#Header,observationAbout,,,,,,
DOLNOŚLĄSKIE,observationAbout,wikidataId/Q54150,#Header,observationAbout,,,,,,
KUJAWSKO-POMORSKIE,observationAbout,nuts/PL61,#Header,observationAbout,,,,,,
LUBELSKIE,observationAbout,wikidataId/Q54155,#Header,observationAbout,,,,,,
LUBUSKIE,observationAbout,wikidataId/Q54157,#Header,observationAbout,,,,,,
ŁÓDZKIE,observationAbout,nuts/PL71,#Header,observationAbout,,,,,,
MAŁOPOLSKIE,observationAbout,nuts/PL21,#Header,observationAbout,,,,,,
MAZOWIECKIE,observationAbout,wikidataId/Q54169,#Header,observationAbout,,,,,,
OPOLSKIE,observationAbout,wikidataId/Q54171,#Header,observationAbout,,,,,,
PODKARPACKIE,observationAbout,wikidataId/Q54175,#Header,observationAbout,,,,,,
PODLASKIE,observationAbout,wikidataId/Q54177,#Header,observationAbout,,,,,,
POMORSKIE,observationAbout,wikidataId/Q1288480,#Header,observationAbout,,,,,,
Comment thread
abhishekjaisw marked this conversation as resolved.
Outdated
ŚLĄSKIE,observationAbout,wikidataId/Q54181,#Header,observationAbout,,,,,,
ŚWIĘTOKRZYSKIE,observationAbout,nuts/PL72,#Header,observationAbout,,,,,,
WARMIŃSKO-MAZURSKIE,observationAbout,wikidataId/Q54184,#Header,observationAbout,,,,,,
WIELKOPOLSKIE,observationAbout,wikidataId/Q54187,#Header,observationAbout,,,,,,
ZACHODNIOPOMORSKIE,observationAbout,wikidataId/Q54188,#Header,observationAbout,,,,,,
Comment thread
abhishekjaisw marked this conversation as resolved.
Outdated
,,,,,,,,,,
2003,observationDate,2003,value,{Number},,,,,,
2004,observationDate,2004,value,{Number},,,,,,
2005,observationDate,2005,value,{Number},,,,,,
2006,observationDate,2006,value,{Number},,,,,,
2007,observationDate,2007,value,{Number},,,,,,
2008,observationDate,2008,value,{Number},,,,,,
2009,observationDate,2009,value,{Number},,,,,,
2010,observationDate,2010,value,{Number},,,,,,
2011,observationDate,2011,value,{Number},,,,,,
2012,observationDate,2012,value,{Number},,,,,,
2013,observationDate,2013,value,{Number},,,,,,
2014,observationDate,2014,value,{Number},,,,,,
2015,observationDate,2015,value,{Number},,,,,,
2016,observationDate,2016,value,{Number},,,,,,
2017,observationDate,2017,value,{Number},,,,,,
2018,observationDate,2018,value,{Number},,,,,,
2019,observationDate,2019,value,{Number},,,,,,
2020,observationDate,2020,value,{Number},,,,,,
2021,observationDate,2021,value,{Number},,,,,,
2022,observationDate,2022,value,{Number},,,,,,
2023,observationDate,2023,value,{Number},,,,,,
2024,observationDate,2024,value,{Number},,,,,,
2025,observationDate,2025,value,{Number},,,,,,
2026,observationDate,2026,value,{Number},,,,,,
2027,observationDate,2027,value,{Number},,,,,,
2028,observationDate,2028,value,{Number},,,,,,
2029,observationDate,2029,value,{Number},,,,,,
2030,observationDate,2030,value,{Number},,,,,,
Comment thread
abhishekjaisw marked this conversation as resolved.
Outdated
13 changes: 13 additions & 0 deletions statvar_imports/statistics_poland/Statistics_Poland_metadata.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
config,value
provenance_url,https://bdl.stat.gov.pl/bdl/dane/podgrup/tablica
output_columns,"observationDate,observationAbout,value,variableMeasured"
places_within,country/POL
#place_types,"AdministrativeArea,AdministrativeArea1,AdministrativeArea2,State"
#debug,1
#input_rows,100
#word_delimiter,''
#skip_rows,1
header_rows,5
mapped_columns,2
dc_api_root,https://api.datacommons.org

82 changes: 82 additions & 0 deletions statvar_imports/statistics_poland/download_input_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
import pandas as pd
Comment thread
abhishekjaisw marked this conversation as resolved.
import os
import logging
from datetime import datetime

# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(levelname)s: %(message)s'
)

# Configuration
INPUT_FILE = "statvar_imports/statistics_poland/poland_data_sample/poland_raw.xlsx"
Comment thread
abhishekjaisw marked this conversation as resolved.
Comment thread
abhishekjaisw marked this conversation as resolved.
OUTPUT_DIR = "statvar_imports/statistics_poland/poland_input"
OUTPUT_FILE = os.path.join(OUTPUT_DIR, "StatisticsPoland_input.csv")

# Target functional age groups
TARGET_AGES = [
"0-2", "3-6", "7-12", "13-15", "16-19", "20-24",
"25-34", "35-44", "45-54", "55-64", "65 i więcej"
]

def process_poland_pivot():
if not os.path.exists(INPUT_FILE):
logging.error(f"{INPUT_FILE} not found.")
return

logging.info(f"Starting generic processing. Saving to: {OUTPUT_FILE}")

try:
# 1. Load the 'DANE' sheet
df = pd.read_excel(INPUT_FILE, sheet_name='DANE')
df.columns = ['Code', 'Name', 'Age', 'Sex', 'Location', 'Year', 'Value', 'Unit', 'Attr']

# 2. Generic Filtering
df = df[df['Age'].isin(TARGET_AGES)]

# DYNAMIC YEAR LOGIC
current_year = datetime.now().year
available_years = sorted([y for y in df['Year'].unique() if y <= current_year])
df = df[df['Year'].isin(available_years)]
Comment thread
abhishekjaisw marked this conversation as resolved.

# 3. Translation Logic
translations = {
'mężczyźni': 'males',
'kobiety': 'females',
'ogółem': 'total',
'w miastach': 'in urban areas',
'na wsi': 'in rural areas',
'POLSKA': 'POLAND',
'65 i więcej': '65 and more'
}

# Refactored repetitive replace calls into a loop
for col in ['Sex', 'Location', 'Name', 'Age']:
df[col] = df[col].replace(translations)

# 4. Create the Pivot Table
pivot_df = df.pivot_table(
index=['Code', 'Name'],
columns=['Age', 'Sex', 'Location', 'Year'],
values='Value'
)

# 5. Format Geographic Codes (ensuring 7-digit padding)
pivot_df.index = pivot_df.index.set_levels(
pivot_df.index.levels[0].astype(str).str.zfill(7), level=0
)

# 6. Save result
os.makedirs(OUTPUT_DIR, exist_ok=True)
pivot_df.to_csv(OUTPUT_FILE, encoding='utf-8-sig')
Comment thread
abhishekjaisw marked this conversation as resolved.
Outdated

logging.info(f"SUCCESS: {OUTPUT_FILE} has been updated.")
logging.info(f"Years Included: {available_years}")
logging.info(f"Total Geographies Processed: {pivot_df.shape[0]}")

except Exception as e:
logging.error(f"Processing Error: {e}")
Comment thread
abhishekjaisw marked this conversation as resolved.
Comment thread
abhishekjaisw marked this conversation as resolved.
Comment thread
abhishekjaisw marked this conversation as resolved.

if __name__ == "__main__":
process_poland_pivot()
Comment thread
abhishekjaisw marked this conversation as resolved.
Outdated
26 changes: 26 additions & 0 deletions statvar_imports/statistics_poland/manifest.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
{
"import_specifications": [
{
"import_name": "statistics_poland",
"curator_emails": [
"support@datacommons.org"
],
"provenance_url": "https://stat.gov.pl/en/databases/",
Comment thread
abhishekjaisw marked this conversation as resolved.
Outdated
"provenance_description": "Population data for demographic variables such as population counts, age distributions, and other census-related metrics in Poland",
"scripts": [
"download_input_data.py",
"../../tools/statvar_importer/stat_var_processor.py --input_data=poland_input/StatisticsPoland_input.csv --pv_map=StatisticsPoland_pvmap.csv --config_file=Statistics_Poland_metadata.csv --output_path=poland_output/StatisticsPoland_output"
],
"source_files": [
"poland_input/StatisticsPoland_input.csv"
],
"import_inputs": [
{
"template_mcf": "poland_output/StatisticsPoland_output.tmcf",
"cleaned_csv": "poland_output/StatisticsPoland_output.csv"
}
],
"cron_schedule": "0 0 1 1,4,7,10 *"
}
]
}
Comment thread
abhishekjaisw marked this conversation as resolved.
Outdated
Binary file not shown.
Loading
Loading