Digital Phenotyping Foundation Model (DPFM)

Foundation model for lifelog time series — representation learning, behavioral prediction, multi-modal alignment, and clinical outcome prediction.

Executive Summary

📚 Knowledge Base (765 sources, 3 NotebookLM notebooks)

This project is backed by a systematically curated knowledge base spanning 482 papers and 765 total sources across three NotebookLM notebooks.

Notebook	Sources	Focus
Allostasis Theory	424	Theoretical foundation — allostatic regulation, interoception, autonomic control, brain-body interaction
Lifelog AI/FM Research	322	Computational methods — time-series FM, wearable FM, digital phenotyping, multimodal health AI, SSL, clinical prediction, lifestyle-omics, edge AI, missingness handling, smart ring
Predictive Coding	19	Cognitive measurement — prediction error, precision weighting, EMA methodology

482-paper literature survey covers 11 categories (A-K) from top venues: NeurIPS, ICML, ICLR, Nature Medicine, Lancet Digital Health, npj Digital Medicine, IEEE JBHI, IMWUT/UbiComp. See literature/references.bib and the 30-query NLM synthesis for the full SOTA analysis and Top 10 research questions.

🧬 Why Lifelog Data Matters: The Allostasis Framework

Allostasis — "stability through change" — is the body's continuous process of predicting metabolic needs and mobilizing resources before they are needed. Unlike homeostasis (reactive correction), allostasis is predictive regulation: the brain constantly generates forecasts about the body's upcoming energy demands and adjusts physiology proactively.

A smartwatch captures this allostatic regulation in real time:

Watch Signal	What It Reflects
Heart rate & HRV	Autonomic regulation, cardiac allostatic control
Sleep architecture	Restorative prediction, metabolic recovery cycles
Stress score	Sympathetic-parasympathetic balance
Activity/Steps	Energy expenditure and behavioral regulation
SpO2	Respiratory-metabolic coupling
Body composition	Long-term energy balance outcomes

When allostatic regulation works well, the body efficiently adapts — HR recovers quickly after stress, sleep architecture is resilient, circadian rhythms are stable. When it breaks down (allostatic overload), metabolic dysregulation accumulates, leading to hypertension, diabetes, cardiovascular disease, and other chronic conditions.

This gives us a principled scientific basis: lifelog time series are not arbitrary sensor streams — they are a continuous readout of the body's allostatic regulation. A foundation model trained on this data learns representations of how well (or poorly) an individual's regulatory system operates.

🧠 Behavioral Prediction: Predictive Coding Framework

Building on allostasis theory, we introduce the Predictive Coding EMA (Ecological Momentary Assessment) module — the first Galaxy Watch-based system for measuring cognitive prediction abilities in daily life.

Core Hypothesis

Allostatic regulation and cognitive prediction share fundamental mechanisms — both are predictive processes that minimize future uncertainty. If the body's allostatic system breaks down, does cognitive prediction also become less precise?

Predictive Coding Microtasks (Galaxy Watch 7 Ultra)

Task	Input	Duration	Measures
Trajectory Prediction	Rotating Bezel	~45s	Spatial prediction error, precision weighting
Temporal Prediction	Haptic + Tap	~40s	Temporal accuracy, rhythm internalization
Sequence Prediction	Bezel	~35s	Statistical learning, volatility tracking
Sensorimotor Tracking	Bezel	~35s	Online prediction, sensorimotor integration
Oddball Detection	Tap + Haptic	~25s	Deviance detection (behavioral MMN)

EMA Protocol: 16 hourly sessions/day (08:00-23:00), ~30-60 seconds each Theoretical Basis: Friston (2005), Clark (2013) — brain as Bayesian prediction machine

Predictive Coding ↔ Allostasis Integration

Level	Predictive Coding	Allostatic Regulation
Temporal Scale	Milliseconds-seconds	Minutes-hours
Prediction Target	Sensory input	Metabolic demand
Error Signal	Prediction error (PE)	Allostatic load
Update Mechanism	Precision weighting	Autonomic adjustment
Pathology	Aberrant precision	Allostatic overload

🔬 Research: AI on Lifelog + Behavioral Data

Core Problem

Given continuous lifelog time series + behavioral prediction data from wearable devices:

Learn general-purpose representations that capture allostatic regulation patterns
Measure cognitive prediction abilities through ecological momentary assessment
Align these representations with genetic predisposition (multi-omics) and clinical state
Predict clinical outcomes and quantify how lifestyle + cognitive factors affect disease risk

Four-Stage Pipeline

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Stage 1: PRETRAIN — Self-supervised lifelog foundation model
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

 24h watch data → 15-min patches (96 tokens/day) → Temporal Transformer
 Objective: Masked Patch Modeling (reconstruct masked time segments)
 Output: z_lifelog — per-subject, per-day representation

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Stage 2: BEHAVIORAL — Predictive coding assessment  🆕
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

 EMA tasks → Prediction Error metrics → z_behavioral representation
 Constructs: PE magnitude, precision weighting, learning rate, temporal accuracy
 Integration: z_behavioral ↔ z_lifelog circadian correlation

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Stage 3: ALIGN — Cross-modal contrastive alignment
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

 z_lifelog    ←─┐
 z_behavioral ←─┼── InfoNCE contrastive loss ──→ Shared representation space
 z_omics      ←─┤
 z_clinical   ←─┘

 Question: How does daily regulation (lifelog) + cognitive prediction (behavioral)
           relate to genetic predisposition (omics) and health state (clinical)?

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Stage 4: PREDICT — Clinical outcome & future risk
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

 [z_lifelog ⊕ z_behavioral ⊕ z_omics ⊕ z_clinical] → Gated Fusion → Risk Prediction
 Targets: HTN | T2DM | ASCVD | Dementia | Depression | Insomnia | Obesity
 + Lifestyle + Cognitive intervention effect quantification (Causal Forest)

Research Questions

#	Question	Method
RQ1	Can a foundation model learn meaningful representations of allostatic regulation?	Self-supervised pretraining + probing tasks
RQ2	Are cognitive prediction abilities correlated with allostatic regulation patterns?	z_lifelog ↔ z_behavioral circadian correlation
RQ3	How do lifelog + behavioral representations relate to genetic risk profiles?	Contrastive alignment (multimodal)
RQ4	Can aligned representations predict clinical phenotype trajectories?	Downstream finetuning on longitudinal outcomes
RQ5	Which omics markers are sensitive to lifestyle + cognitive changes?	Longitudinal marker analysis + causal inference
RQ6	Can we quantify the effect of behavioral intervention per individual?	Causal forest, what-if simulation

📊 Data Modalities

Modality	Source	Dimensionality	Sampling
Lifelog	Samsung Health Watch	10+ channels	Minute-resolution, continuous
Behavioral	Galaxy Watch 7 Ultra EMA	5 tasks × 16 sessions/day	Hourly microtasks
Genomics	WGS + PGS	287+ East Asian polygenic scores	One-time
Proteomics	Olink Explore HT	~5,400 protein markers (NPX)	Periodic
Microbiome	16s rRNA	OTU abundance profiles	Periodic
Clinical	SMC Health Checkup	InBody, BP, CGM, blood chemistry	Periodic

Cohort: n=1,250, longitudinal follow-up (Samsung Medical Center, 2025–2028)

Behavioral Data Schema (Predictive Coding EMA)

trajectory_prediction:
  prediction_error: {unit: degrees, range: [0, 180]}
  response_time: {unit: ms, range: [500, 3000]}
  adjustment_count: {description: "uncertainty proxy"}
  learning_rate: {description: "Rescorla-Wagner alpha"}

temporal_prediction:
  temporal_error: {unit: ms, description: "produced - expected interval"}
  tap_variability: {unit: ms, description: "temporal precision"}
  tempo_condition: {values: [fast_500ms, medium_750ms, slow_1000ms]}

derived_metrics:
  precision_weighting: {description: "inverse variance of PE"}
  circadian_pe_variation: {description: "PE variation across day"}
  
context_integration:
  heart_rate_at_task: {unit: bpm}
  stress_score_pre_task: {range: [0, 100]}
  hours_since_wake: {unit: hours}

🎯 Target Clinical Phenotypes

Category	Target	Key Measurements	Behavioral Hypothesis
Primary	Hypertension	Office BP, Ambulatory BP	Impaired cardiovascular prediction
	Type 2 Diabetes	FBS, HbA1c, CGM TIR, HOMA-IR	Metabolic prediction dysregulation
	ASCVD	ASCVD risk estimator	Vascular allostatic overload
Secondary	Dementia risk	Memory functional Q, Emocog	Cognitive prediction decline
	Depression/Anxiety	Mental Health Q	Aberrant precision weighting
	Insomnia	Watch sleep + PSQI	Circadian prediction disruption
	Dyslipidemia	TC, TG, HDL, LDL	Lipid regulatory prediction
	Obesity	BMI, BIA, WHR, VFA	Energy balance prediction

🏗️ Project Structure

digital-phenotyping-fm/
├── literature/               # Paper study, NotebookLM knowledge bases
│   ├── notes/               # Literature review notes  
│   ├── reviews/             # Review summaries
│   ├── references.bib       # Bibliography
│   ├── seeds/              # 🆕 Predictive coding core papers
│   └── notebooks.md        # 🆕 NotebookLM KB index
├── src/dpfm/               # Core ML library
│   ├── data/               # Data processors (lifelog, omics, clinical)
│   ├── behavioral/         # 🆕 Predictive coding EMA module
│   │   ├── analysis/       # PE metrics, time series analysis
│   │   ├── tasks/          # EMA task engine (watch integration)
│   │   └── visualization/  # Behavioral data visualization
│   ├── models/             # Foundation model, alignment, predictors
│   ├── training/           # Lightning modules (pretrain, align, finetune)
│   └── evaluation/         # Metrics and benchmarks
├── watch-app/              # 🆕 Galaxy Watch 7 Ultra EMA app
│   ├── app/src/           # Kotlin + Jetpack Compose for Wear OS
│   └── gradle/            # Android build configuration
├── configs/                # Hydra YAML configs
│   ├── default.yaml       # Main configuration
│   └── behavioral/        # 🆕 EMA protocol configs
├── data/                   # Data (gitignored except schemas/)
│   └── schemas/           # Data dictionary (tracked)
├── experiments/            # Per-experiment directories
├── notebooks/              # Jupyter (exploration, analysis, figures)
│   ├── analysis/          # Primary analysis notebooks
│   ├── exploration/       # Exploratory data analysis
│   ├── behavioral/        # 🆕 Predictive coding analysis
│   └── figures/           # Publication figures
├── reports/                # Presentations, progress, papers
├── scripts/                # CLI entry points
└── tests/                 # Unit tests

🚀 Setup & Installation

Core Package

# Clone and install
git clone https://github.com/Transconnectome/digital-phenotyping-fm.git
cd digital-phenotyping-fm
pip install -e ".[dev]"

# Optional: Install behavioral analysis module
pip install -e ".[behavioral]"

# Optional: Install omics processing dependencies
pip install -e ".[omics]"

Galaxy Watch App Development

# Install Android SDK and JDK 17
# Build watch app
cd watch-app
./gradlew build

NotebookLM Knowledge Bases (765 sources across 3 notebooks)

Notebook	ID	Sources	Focus
Allostasis Theory	`1846219f-a072-4544-9721-65a6aa89904f`	424	Brain-body regulation, interoception, autonomic control
Lifelog AI/FM Research	`ebbba35c-09c6-4e2d-8e13-103c1b3a3676`	322	11 categories: TS-FM, wearable FM, digital phenotyping, multimodal health, SSL, clinical prediction, lifestyle-omics, edge AI, missingness, smart ring
Predictive Coding	`b4946642-4c70-4d14-9758-82573eead20a`	19	Prediction error, precision weighting, EMA methodology

Query using CLI or MCP:

# Allostasis framework
nlm notebook query 1846219f-a072-4544-9721-65a6aa89904f "allostasis wearable lifelog"

# Lifelog AI/FM research (482 papers indexed)
nlm notebook query ebbba35c-09c6-4e2d-8e13-103c1b3a3676 "wearable foundation model health prediction"

# Predictive coding theory
nlm notebook query b4946642-4c70-4d14-9758-82573eead20a "prediction error precision weighting EMA"

🧪 Usage Examples

1. Lifelog Foundation Model Training

dpfm-train configs/pretrain.yaml

2. Behavioral Analysis

# Analyze EMA session data
dpfm-behavioral-analyze --data-path ./data/behavioral/sessions.json

# Plot circadian prediction error patterns
python -m dpfm.behavioral.visualization --plot-type circadian

3. Multi-Modal Alignment

dpfm-align configs/align.yaml

4. Clinical Prediction

dpfm-predict configs/predict.yaml

🔬 Scientific Foundations & Knowledge Bases

Three Pillars (765 sources)

1. Allostasis — 424-source KB

The body's predictive regulation system. Lifelog time series = continuous readout of allostatic regulation. Foundation model learns representations of regulatory quality.

Sterling (2012), McEwen (1998), Kleckner+ (2017), Barrett (2017)

2. Lifelog AI/FM — 322-source KB

482-paper systematic survey across 11 categories. SOTA: behavioral representation > raw sensor (ICML 2025), LSM-2/ECGFounder/SleepFM leading. 30-query NLM synthesis produced Top 10 research questions and 3 actionable research designs.

Top 5 open FM: PaPaGei, Pulse-PPG, NormWear, ECG-FM, Step2Heart
See literature/notes/nlm_query_synthesis.md for full synthesis

3. Predictive Coding — 19-source KB

Brain as Bayesian prediction machine. EMA microtasks measure prediction error and precision weighting in daily life. Links cognitive prediction to health regulation.

Friston (2005), Clark (2013), Rao & Ballard (1999), Shiffman+ (2008)

📈 Timeline & Development

Year	Phase	Focus
2026 Q1-Q2	Foundation	Data infrastructure, EMA app deployment, lifelog FM pretrain
2026 Q3-Q4	Behavioral	EMA data collection, predictive coding analysis, z_behavioral
2027	Integration	Cross-modal alignment, lifestyle↔omics validation, longitudinal analysis
2028	Prediction	Clinical outcome models, intervention quantification, PoC API

🤝 Collaboration

Samsung Medical Center × Samsung MX Health (2025.10–2028.06)

Principal Investigator: SNU Connectome Lab
Cohort: n=1,250 participants
Data: Lifelog + Behavioral + Multi-omics + Clinical longitudinal

DPFM represents the first integration of allostatic regulation theory with predictive coding assessment in a lifelog foundation model framework. This unified approach enables unprecedented insights into the relationship between physiological regulation, cognitive prediction, and health outcomes.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
configs		configs
data		data
docs/images		docs/images
experiments		experiments
literature		literature
notebooks		notebooks
reports		reports
scripts		scripts
src/dpfm		src/dpfm
tests		tests
tools/infographics		tools/infographics
watch-app		watch-app
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Digital Phenotyping Foundation Model (DPFM)

Executive Summary

📚 Knowledge Base (765 sources, 3 NotebookLM notebooks)

🧬 Why Lifelog Data Matters: The Allostasis Framework

🧠 Behavioral Prediction: Predictive Coding Framework

Core Hypothesis

Predictive Coding Microtasks (Galaxy Watch 7 Ultra)

Predictive Coding ↔ Allostasis Integration

🔬 Research: AI on Lifelog + Behavioral Data

Core Problem

Four-Stage Pipeline

Research Questions

📊 Data Modalities

Behavioral Data Schema (Predictive Coding EMA)

🎯 Target Clinical Phenotypes

🏗️ Project Structure

🚀 Setup & Installation

Core Package

Galaxy Watch App Development

NotebookLM Knowledge Bases (765 sources across 3 notebooks)

🧪 Usage Examples

1. Lifelog Foundation Model Training

2. Behavioral Analysis

3. Multi-Modal Alignment

4. Clinical Prediction

🔬 Scientific Foundations & Knowledge Bases

Three Pillars (765 sources)

📈 Timeline & Development

🤝 Collaboration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages