Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 47 additions & 0 deletions statvar_imports/statistics_poland/README.md
Comment thread
abhishekjaisw marked this conversation as resolved.
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Poland Demographics Dataset
## Overview
This dataset provides foundational demographic and socio-economic statistics for Poland, sourced directly from official national datasets.

## Data Source

**Source URL:**
https://stat.gov.pl/en/databases/


The data comes from Poland's official statistical authority and includes comprehensive demographic variables such as population counts, age distributions, and other census-related metrics.

## How To Download Input Data
To download the data, you'll need to use the provided download script download_input_data.py. This script processes the statvar_imports/statistics_poland/poland_data_sample/poland_raw.xlsx file to generate StatisticsPoland_input.csv inside a new "poland_input" folder.

type of place: State.

statvars: Demographics

years: 2003 to 2024.
Comment thread
abhishekjaisw marked this conversation as resolved.
Comment thread
abhishekjaisw marked this conversation as resolved.

## Processing Instructions
To process the Poland Census data and generate statistical variables, use the following command from the "data" directory:

**Download input file**
```bash
python3 statvar_imports/statistics_poland/download_input_data.py
```
**For Test Data Run**
```bash
python3 tools/statvar_importer/stat_var_processor.py \
--input_data='statvar_imports/statistics_poland/test/StatisticsPoland_input.csv' \
--pv_map='statvar_imports/statistics_poland/StatisticsPoland_pvmap.csv' \
--output_path='statvar_imports/statistics_poland/test/StatisticsPoland_output' \
--config_file='statvar_imports/statistics_poland/Statistics_Poland_metadata.csv' \
--existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf
```
**For Main data run**
```bash
python3 tools/statvar_importer/stat_var_processor.py \
--input_data='statvar_imports/statistics_poland/poland_input/StatisticsPoland_input.csv' \
--pv_map='statvar_imports/statistics_poland/StatisticsPoland_pvmap.csv' \
--output_path='statvar_imports/statistics_poland/poland_output/StatisticsPoland_output' \
--config_file='statvar_imports/statistics_poland/Statistics_Poland_metadata.csv' \
--existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf
```
Comment thread
abhishekjaisw marked this conversation as resolved.

68 changes: 68 additions & 0 deletions statvar_imports/statistics_poland/StatisticsPoland_pvmap.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
key,p1,v1,p2,v2,p3,v3,p4,v4,p5,v5
,,,,,,,,,,
Code,measuredProperty,count,populationType,Person,,,,,,
males,gender,Male,,,,,,,,
females,gender,Female,,,,,,,,
total,,,,,,,,,,
Comment thread
abhishekjaisw marked this conversation as resolved.
in urban areas,placeOfResidenceClassification,Urban,,,,,,,,
in rural areas,placeOfResidenceClassification,Rural,,,,,,,,

0-2,age,Years0To2,,,,,,,,
3-6,age,Years3To6,,,,,,,,
7-12,age,Years7To12,,,,,,,,
13-15,age,Years13To15,,,,,,,,
16-19,age,Years16To19,,,,,,,,
20-24,age,Years20To24,,,,,,,,
25-34,age,Years25To34,,,,,,,,
35-44,age,Years35To44,,,,,,,,
45-54,age,Years45To54,,,,,,,,
55-64,age,Years55To64,,,,,,,,
65 and more,age,Years65Onwards,,,,,,,,
,,,,,,,,,,
,,,,,,,,,,
POLAND,observationAbout,country/POL,#Header,observationAbout,,,,,,
DOLNOŚLĄSKIE,observationAbout,nuts/PL51,#Header,observationAbout,,,,,,
KUJAWSKO-POMORSKIE,observationAbout,nuts/PL61,#Header,observationAbout,,,,,,
LUBELSKIE,observationAbout,nuts/PL81,#Header,observationAbout,,,,,,
LUBUSKIE,observationAbout,nuts/PL43,#Header,observationAbout,,,,,,
ŁÓDZKIE,observationAbout,nuts/PL71,#Header,observationAbout,,,,,,
MAŁOPOLSKIE,observationAbout,nuts/PL21,#Header,observationAbout,,,,,,
MAZOWIECKIE,observationAbout,nuts/PL9,#Header,observationAbout,,,,,,
OPOLSKIE,observationAbout,nuts/PL52,#Header,observationAbout,,,,,,
PODKARPACKIE,observationAbout,nuts/PL82,#Header,observationAbout,,,,,,
PODLASKIE,observationAbout,nuts/PL84,#Header,observationAbout,,,,,,
POMORSKIE,observationAbout,nuts/PL63,#Header,observationAbout,,,,,,
ŚLĄSKIE,observationAbout,nuts/PL22,#Header,observationAbout,,,,,,
ŚWIĘTOKRZYSKIE,observationAbout,nuts/PL72,#Header,observationAbout,,,,,,
WARMIŃSKO-MAZURSKIE,observationAbout,nuts/PL62,#Header,observationAbout,,,,,,
WIELKOPOLSKIE,observationAbout,nuts/PL41,#Header,observationAbout,,,,,,
ZACHODNIOPOMORSKIE,observationAbout,nuts/PL42,#Header,observationAbout,,,,,,
,,,,,,,,,,
2003,observationDate,2003,value,{Number},,,,,,
2004,observationDate,2004,value,{Number},,,,,,
2005,observationDate,2005,value,{Number},,,,,,
2006,observationDate,2006,value,{Number},,,,,,
2007,observationDate,2007,value,{Number},,,,,,
2008,observationDate,2008,value,{Number},,,,,,
2009,observationDate,2009,value,{Number},,,,,,
2010,observationDate,2010,value,{Number},,,,,,
2011,observationDate,2011,value,{Number},,,,,,
2012,observationDate,2012,value,{Number},,,,,,
2013,observationDate,2013,value,{Number},,,,,,
2014,observationDate,2014,value,{Number},,,,,,
2015,observationDate,2015,value,{Number},,,,,,
2016,observationDate,2016,value,{Number},,,,,,
2017,observationDate,2017,value,{Number},,,,,,
2018,observationDate,2018,value,{Number},,,,,,
2019,observationDate,2019,value,{Number},,,,,,
2020,observationDate,2020,value,{Number},,,,,,
2021,observationDate,2021,value,{Number},,,,,,
2022,observationDate,2022,value,{Number},,,,,,
2023,observationDate,2023,value,{Number},,,,,,
2024,observationDate,2024,value,{Number},,,,,,
2025,observationDate,2025,value,{Number},,,,,,
2026,observationDate,2026,value,{Number},,,,,,
2027,observationDate,2027,value,{Number},,,,,,
2028,observationDate,2028,value,{Number},,,,,,
2029,observationDate,2029,value,{Number},,,,,,
2030,observationDate,2030,value,{Number},,,,,,
14 changes: 14 additions & 0 deletions statvar_imports/statistics_poland/Statistics_Poland_metadata.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
config,value
provenance_url,https://bdl.stat.gov.pl/bdl/dane/podgrup/tablica
output_columns,"observationDate,observationAbout,value,variableMeasured"
places_within,country/POL
#place_types,"AdministrativeArea,AdministrativeArea1,AdministrativeArea2,State"
#debug,1
#input_rows,100
#word_delimiter,''
#skip_rows,1
header_rows,5
mapped_columns,2
dc_api_root,https://api.datacommons.org


82 changes: 82 additions & 0 deletions statvar_imports/statistics_poland/download_input_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
import pandas as pd
Comment thread
abhishekjaisw marked this conversation as resolved.
import os
import logging
from datetime import datetime

# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(levelname)s: %(message)s'
)

# Configuration
INPUT_FILE = "statvar_imports/statistics_poland/poland_data_sample/poland_raw.xlsx"
Comment thread
abhishekjaisw marked this conversation as resolved.
Comment thread
abhishekjaisw marked this conversation as resolved.
OUTPUT_DIR = "statvar_imports/statistics_poland/poland_input"
OUTPUT_FILE = os.path.join(OUTPUT_DIR, "StatisticsPoland_input.csv")

# Target functional age groups
TARGET_AGES = [
"0-2", "3-6", "7-12", "13-15", "16-19", "20-24",
"25-34", "35-44", "45-54", "55-64", "65 i więcej"
]

def process_poland_pivot():
if not os.path.exists(INPUT_FILE):
logging.error(f"{INPUT_FILE} not found.")
return

logging.info(f"Starting generic processing. Saving to: {OUTPUT_FILE}")

try:
# 1. Load the 'DANE' sheet
df = pd.read_excel(INPUT_FILE, sheet_name='DANE')
df.columns = ['Code', 'Name', 'Age', 'Sex', 'Location', 'Year', 'Value', 'Unit', 'Attr']

# 2. Generic Filtering
df = df[df['Age'].isin(TARGET_AGES)]

# DYNAMIC YEAR LOGIC
current_year = datetime.now().year
available_years = sorted([y for y in df['Year'].unique() if y <= current_year])
df = df[df['Year'].isin(available_years)]
Comment thread
abhishekjaisw marked this conversation as resolved.

# 3. Translation Logic
translations = {
'mężczyźni': 'males',
'kobiety': 'females',
'ogółem': 'total',
'w miastach': 'in urban areas',
'na wsi': 'in rural areas',
'POLSKA': 'POLAND',
'65 i więcej': '65 and more'
}

# Refactored repetitive replace calls into a loop
for col in ['Sex', 'Location', 'Name', 'Age']:
df[col] = df[col].replace(translations)

# 4. Create the Pivot Table
pivot_df = df.pivot_table(
index=['Code', 'Name'],
columns=['Age', 'Sex', 'Location', 'Year'],
values='Value'
)

# 5. Format Geographic Codes (ensuring 7-digit padding)
pivot_df.index = pivot_df.index.set_levels(
pivot_df.index.levels[0].astype(str).str.zfill(7), level=0
)

# 6. Save result
os.makedirs(OUTPUT_DIR, exist_ok=True)
pivot_df.to_csv(OUTPUT_FILE, encoding='utf-8')

logging.info(f"SUCCESS: {OUTPUT_FILE} has been updated.")
logging.info(f"Years Included: {available_years}")
logging.info(f"Total Geographies Processed: {pivot_df.shape[0]}")

except Exception as e:
logging.error(f"Processing Error: {e}")
Comment thread
abhishekjaisw marked this conversation as resolved.
Comment thread
abhishekjaisw marked this conversation as resolved.
Comment thread
abhishekjaisw marked this conversation as resolved.

if __name__ == "__main__":
process_poland_pivot()
26 changes: 26 additions & 0 deletions statvar_imports/statistics_poland/manifest.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
{
"import_specifications": [
{
"import_name": "statistics_poland",
"curator_emails": [
"support@datacommons.org"
],
"provenance_url": "https://bdl.stat.gov.pl/bdl/dane/podgrup/tablica",
"provenance_description": "Population data for demographic variables such as population counts, age distributions, and other census-related metrics in Poland",
"scripts": [
"download_input_data.py",
"../../tools/statvar_importer/stat_var_processor.py --input_data=poland_input/StatisticsPoland_input.csv --pv_map=StatisticsPoland_pvmap.csv --config_file=Statistics_Poland_metadata.csv --output_path=poland_output/StatisticsPoland_output"
],
"source_files": [
"poland_input/StatisticsPoland_input.csv"
],
"import_inputs": [
{
"template_mcf": "poland_output/StatisticsPoland_output.tmcf",
"cleaned_csv": "poland_output/StatisticsPoland_output.csv"
}
],
"cron_schedule": "0 0 1 1,4,7,10 *"
}
]
}
Binary file not shown.
Loading
Loading