feat: implement strict temporal train/test split to prevent data leakage by akintewe · Pull Request #100 · Traqora/astroml

akintewe · 2026-03-25T09:40:40Z

[ML] Temporal Train/Test Split Logic

Summary

Added astroml/training/temporal_split.py with two core functions and a config-driven class:
- temporal_graph_split() — splits a sequence of graph Edge objects into strict temporal train/test partitions (both cutoff and ratio modes), sorts by timestamp, and hard-raises LeakageError if any temporal overlap is detected.
- validate_graph_split() — standalone post-hoc validator for a GraphSplitResult that checks for train-max ≥ test-min overlap.
- TemporalSplitter class — thin config-driven wrapper that dispatches to either the DataFrame splitter (re-exported from astroml.validation.leakage) or the new graph-edge splitter, with automatic validation on every split.
Updated astroml/training/__init__.py to export the new public symbols.
Added a temporal_split: configuration block to configs/training/default.yaml (enabled: false by default — opt-in via Hydra override with no breaking changes to existing runs).
Wired apply_temporal_masks() into train.py: when training.temporal_split.enabled: true, replaces the dataset's existing random train/val/test masks with temporally-ordered masks — uses a node_timestamps attribute if present on the data object, otherwise falls back to node index as an ingestion-order proxy.
Added tests/test_temporal_split.py with 20+ test cases.

What is enforced

Guarantee	How
No future data in training	Edges/rows sorted by timestamp; split index computed on sorted sequence
Hard overlap detection	`LeakageError` raised if `train_max >= test_min`
Shuffled input handled	Input always re-sorted before splitting — shuffle cannot cause leakage
Empty partition warnings	`UserWarning` emitted when either partition is empty (degenerate cutoff)
Custom timestamp attributes	`time_attr` / `time_col` params let callers use any field name
Backward compatible	`temporal_split.enabled: false` (default) — existing random-mask training is unchanged

Test plan

python -m pytest tests/test_temporal_split.py -v — all 20+ cases pass
python -m pytest tests/test_leakage.py -v — existing leakage tests still pass
python train.py training.temporal_split.enabled=true — logs show "Temporal masks applied: train=N val=M test=K"
python train.py training.temporal_split.enabled=true training.temporal_split.train_ratio=0.7 — mask sizes reflect the override
Manually construct overlapping GraphSplitResult and confirm validate_graph_split raises LeakageError
Pass a shuffled edge list to TemporalSplitter.split_edges and assert max(train timestamps) < min(test timestamps)

Closes #76

…eights - Add astroml/tracking/mlflow_tracker.py: thin MLflowTracker wrapper that gracefully degrades to no-op when mlflow is not installed or disabled - Wire MLflowTracker into train.py (GCN): logs hyperparams, per-epoch train/val loss and accuracy curves, final test metrics, and model artifact - Wire MLflowTracker into DeepSVDDTrainer: logs per-epoch train_loss, val_loss, svdd_radius, and evaluation metrics (ROC-AUC, AUC-PR, F1, precision, recall, accuracy); logs best model checkpoint as artifact - Add mlflow config block to configs/config.yaml with tracking_uri, experiment_name, run_name, and log_model_weights toggles - Add mlflow>=2.10.0 to requirements.txt

- Add astroml/training/temporal_split.py with: - temporal_graph_split(): splits Edge sequences by time (cutoff or ratio mode), sorts by timestamp, hard-raises LeakageError on any overlap - validate_graph_split(): standalone overlap validator for GraphSplitResult - TemporalSplitter class: config-driven wrapper supporting both DataFrame and graph-edge inputs with automatic validation - Re-exports temporal_train_test_split/validate_temporal_split from astroml.validation.leakage for a single import surface - Update astroml/training/__init__.py to export new symbols - Add temporal_split config block to configs/training/default.yaml (enabled: false by default, opt-in via Hydra override) - Wire apply_temporal_masks() into train.py: when temporal_split.enabled is true, replaces dataset's random masks with temporally-ordered train/val/test masks (uses node_timestamps attr if present, else node index as ingestion-order proxy) - Add tests/test_temporal_split.py with 20+ cases covering ratio mode, cutoff mode, shuffled input, custom time attr, empty edges, overlap detection, and the no-leakage guarantee property

…transaction prediction Adds a training objective that predicts whether two accounts will transact in the next N ledgers — a self-supervised task requiring no manual labels. New files: - astroml/models/link_prediction.py: LinkPredictor model combining a GCNEncoder with a dot-product or MLP decoder; exposes encode(), decode(), decode_all(), loss() (BCE over pos/neg edges), and a convenience forward() - astroml/tasks/link_prediction_task.py: LinkPredictionTask orchestrator: - build_splits(): enumerates (context, future) window pairs keyed by ledger sequence; context strictly precedes future with no overlap; supports context_ledgers cap to restrict the lookback window - sample_negatives(): samples random non-edges at a configurable ratio - train_step(): one gradient update given a LedgerSplit + node features - evaluate(): returns ROC-AUC and average precision on a held-out split - LedgerSplit dataclass: holds context/future edge sets + node_index with to_edge_index() helper for building PyG edge_index tensors - astroml/tasks/__init__.py: package init - tests/test_link_prediction.py: 20+ unit tests covering LedgerSplit, sample_negative_edges, build_splits (temporal ordering, future window bounds, context_ledgers, empty input, error guards), sample_negatives, and LinkPredictor decoder + loss behaviour including loss-decreases test Bug fixed during implementation: context_ledgers restriction now guards against negative index when fewer context ledgers exist than requested.

…f-loop cycles in the transaction graph

…eneration

…nsactions into discrete time-windowed graph snapshots for training

feat: integrate MLflow tracking for loss curves, ROC-AUC, and model w…

Implement single-account frequency computation and consistency tests

Kenesamas

…k-prediction feat: self-supervised link prediction training objective for account …

Add contributing guide and ignore build artifacts

…ustom assets during graph construction

…edges fix: decompose path payments into per-hop edges and guard against sel…

…yping feat: add multi-asset edge typing to classify XLM, stablecoins, and c…

Feat/standard 2 layer gcn

…shot-slicer-v2 feat: add DB-backed iter_db_snapshots utility to slice normalized tra…

…rt conflict

Ekenesamuel8 and others added 25 commits March 24, 2026 14:34

Implement single-account frequency computation and consistency tests

91f6ea3

Schema Design for PostgreSQL Graph Mirror

90ff09b

Add contributing guide and ignore build artifacts

415e929

fix: decompose path payments into per-hop edges and guard against sel…

b62dc63

…f-loop cycles in the transaction graph

Merge branch 'main' into fix/path-payment-graph-edges

aec71dd

feat: add DB-backed time-windowed snapshot slicer for training data g…

162c40c

…eneration

feat: add DB-backed iter_db_snapshots utility to slice normalized tra…

f9aadd6

…nsactions into discrete time-windowed graph snapshots for training

feat: implement standard 2-layer GCN with PyG for node classification

5682648

Merge pull request Traqora#99 from akintewe/feat/mlflow-tracking

9b17b0b

feat: integrate MLflow tracking for loss curves, ROC-AUC, and model w…

Merge pull request Traqora#97 from Ekenesamuel8/kenesama

b09a7d8

Implement single-account frequency computation and consistency tests

Merge pull request Traqora#98 from Ekenesamuel8/kenesamas

822a115

Kenesamas

Merge pull request Traqora#101 from akintewe/feat/self-supervised-lin…

b3c6c5b

…k-prediction feat: self-supervised link prediction training objective for account …

Merge pull request Traqora#102 from Oluwatomilola/contributing.md-guide

09a74f9

Add contributing guide and ignore build artifacts

feat: add multi-asset edge typing to classify XLM, stablecoins, and c…

57be2b4

…ustom assets during graph construction

Merge pull request Traqora#103 from soma-enyi/fix/path-payment-graph-…

a1d4e65

…edges fix: decompose path payments into per-hop edges and guard against sel…

Merge branch 'main' into feat/standard-2-layer-gcn

3dac596

Merge branch 'main' into feat/time-windowed-snapshot-slicer-v2

7e7ec8d

Merge branch 'main' into feat/multi-asset-edge-typing

7a09457

Merge pull request Traqora#107 from soma-enyi/feat/multi-asset-edge-t…

96a0a39

…yping feat: add multi-asset edge typing to classify XLM, stablecoins, and c…

Merge pull request Traqora#105 from soma-enyi/feat/standard-2-layer-gcn

0458768

Feat/standard 2 layer gcn

Merge pull request Traqora#104 from soma-enyi/feat/time-windowed-snap…

e35ce27

…shot-slicer-v2 feat: add DB-backed iter_db_snapshots utility to slice normalized tra…

Merge feat/temporal-train-test-split into main, resolve train.py impo…

16c3718

…rt conflict

gelluisaac merged commit e4a06c5 into Traqora:main Mar 27, 2026
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement strict temporal train/test split to prevent data leakage#100

feat: implement strict temporal train/test split to prevent data leakage#100
gelluisaac merged 25 commits intoTraqora:mainfrom
akintewe:feat/temporal-train-test-split

akintewe commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

akintewe commented Mar 25, 2026

[ML] Temporal Train/Test Split Logic

Summary

What is enforced

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants