PDFM Embeddings are condensed vector representations designed to encapsulate the complex, multidimensional interactions among human behaviors, environmental factors, and local contexts at specific locations. These embeddings capture patterns in aggregated data such as search trends, busyness trends, and environmental conditions (maps, air quality, temperature), providing a rich, location-specific snapshot of how populations engage with their surroundings. Aggregated over space and time, these embeddings ensure privacy while enabling nuanced spatial analysis and prediction for applications ranging from public health to socioeconomic modeling.
PDFM Embeddings are generated using a Graph Neural Network (GNN) model, trained on a rich set of features:
- Aggregated Search Trends: Regional interests and concerns reflected in search data.
- Aggregated Maps Data: Geospatial and contextual data about locations.
- Aggregated Busyness: Activity levels in specific areas, indicating density and frequency of human presence.
- Aggregated Weather and Air Quality: Climate-related metrics, including temperature and air quality.
These features are aggregated at the postal code and county levels to generate localized, context-aware embeddings that preserve privacy.
Embeddings are available for all counties and ZIP codes within the contiguous United States. For additional coverage, please reach out to [email protected].
For more information on PDFM Embeddings, please see our paper on arXiv.
PDFM Embeddings can be applied to a wide range of geospatial prediction tasks, similar to census and socioeconomic statistics. Example use cases include:
- Population Health Outcomes: Predicting health statistics like disease prevalence or population health risks.
- Socioeconomic Factors: Modeling economic indicators and living conditions.
- Retail: Identifying promising locations for new stores, market expansion, and demand forecasting.
- Marketing and Sales: Characterizing high-performance regions and identifying similar areas to optimize marketing and sales efforts.
By incorporating spatial relationships and diverse feature types, these embeddings serve as a powerful tool for geospatial predictions.
Access to Population Dynamics Embeddings is subject to Google’s Terms of Service. Users can download the embeddings and associated files after completing the intake form.
To use Population Dynamics Embeddings, prepare ground truth data (e.g., target variable for prediction tasks like asthma prevalence) at the postal code or county level.
- Prepare Existing Model-Based Ground Truth: Use the embeddings as geospatial covariates to enhance an existing model.
- Train an Adapter Model: Improve an existing model by integrating the embeddings.
- Choose a Prediction Model: Any model, such as GBDT, MLP, or linear, can be used for predictions.
- Use Embeddings for Prediction: Use PDFM Embeddings as input features, alongside other contextual data, to improve prediction accuracy.
Explore our demo notebooks to understand various use cases of PDFM Embeddings. The code provided is available under the Apache 2.0 license.
- Nowcasting Colab: Here the model uses past and partial present-day data for a target variable at county level to predict outcomes for remaining counties.
- Superresolution and Imputation Colab: Here we use the embeddings to help train a model at the county level on a target variable to predict at the zip code level. This model also demonstrates imputation capabilities by training on 20% of zip codes and predicting for the remaining 80%.
- Forecasting with TimesFM Colab: In this experimental use case, we incorporate TimesFM (a Univariate Forecasting Model) to perform spatiotemporal forecasting. The embeddings are used to adjust for errors in the forecasts and improve their accuracy.
- Nighttime Lights Prediction with Earth Engine Colab: This notebook illustrates how Earth Engine data, such as nighttime lights, can also be predicted from the embeddings. By referencing Earth Engine data, the model enhances geospatial understanding and demonstrates applications for environmental and socioeconomic forecasting.
The following benchmark files contain ground truth data used to evaluate Population Dynamics Based Embeddings. They can be used alongside the embeddings to reproduce our results and assess performance across various geospatial and temporal prediction tasks..
- Interpolation, Superresolution, and Extrapolation: The conus27 file is a versatile dataset that supports tasks involving interpolation (filling gaps), superresolution (predicting at finer spatial scales), and extrapolation (projecting data over large missing regions). This file includes detailed columns for location information (place, county, state, latitude, longitude) and key population health indicators, along with geographic features such as tree cover, elevation, and nighttime lights.
- Forecasting: The model's capabilities in temporal forecasting are illustrated with two datasets:
county_unemployment.csv
: Contains county-level unemployment data over a monthly timespan from 1990 to 2024, enabling users to track employment trends over time.zcta_poverty.csv
: This file offers annual poverty estimates at the ZIP Code Tabulation Area (ZCTA) level from 2011 to 2022, providing insight into socioeconomic changes at finer spatial scales.
All ground truth data included in the benchmarks are gathered from publicly available sources, through the Data Commons and Google Earth Engine APIs. Here's a list of the original sources:
-
- Health variables: CDC PLACES 2022
- Unemployment: bls.gov
- Poverty: census.gov
-
- ZCTA and County boundaries: TIGER/2010/ZCTA5, TIGER/2016/Counties
- Tree cover: ESA/WorldCover/v100
- Night lights: NOAA/VIIRS/DNB/ANNUAL_V22
- Elevation: USGS/SRTMGL1_003
We release Population Dynamics Foundation Model Embeddings under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. You are free to share and adapt this data, but please cite our work if you incorporate these embeddings in your research or applications.
@article{agarwal2024pdfm,
title={General Geospatial Inference with a Population Dynamics Foundation Model},
author={Mohit Agarwal, Mimi Sun, Chaitanya Kamath, Arbaaz Muslim, Prithul Sarker, Joydeep Paul, Hector Yee, Marcin Sieniek, Kim Jablonski, Yael Mayer, David Fork, Sheila de Guia, Jamie McPike, Adam Boulanger, Tomer Shekel, David Schottlander, Yao Xiao, Manjit Chakravarthy Manukonda, Yun Liu, Neslihan Bulut, Sami Abu-el-haija, Arno Eigenwillig, Parth Kothari, Bryan Perozzi, Monica Bharel, Von Nguyen, Luke Barrington, Niv Efron, Yossi Matias, Greg Corrado, Krish Eswaran, Shruthi Prabhakara, Shravya Shetty, Gautam Prasad},
journal={arXiv preprint arXiv:2411.07207},
year={2024}
}
For questions, please email [email protected].