Skip to content

feat: implement strict temporal train/test split to prevent data leakage#100

Merged
gelluisaac merged 25 commits intoTraqora:mainfrom
akintewe:feat/temporal-train-test-split
Mar 27, 2026
Merged

feat: implement strict temporal train/test split to prevent data leakage#100
gelluisaac merged 25 commits intoTraqora:mainfrom
akintewe:feat/temporal-train-test-split

Conversation

@akintewe
Copy link
Copy Markdown
Contributor

[ML] Temporal Train/Test Split Logic

Summary

  • Added astroml/training/temporal_split.py with two core functions and a config-driven class:
    • temporal_graph_split() — splits a sequence of graph Edge objects into strict temporal train/test partitions (both cutoff and ratio modes), sorts by timestamp, and hard-raises LeakageError if any temporal overlap is detected.
    • validate_graph_split() — standalone post-hoc validator for a GraphSplitResult that checks for train-max ≥ test-min overlap.
    • TemporalSplitter class — thin config-driven wrapper that dispatches to either the DataFrame splitter (re-exported from astroml.validation.leakage) or the new graph-edge splitter, with automatic validation on every split.
  • Updated astroml/training/__init__.py to export the new public symbols.
  • Added a temporal_split: configuration block to configs/training/default.yaml (enabled: false by default — opt-in via Hydra override with no breaking changes to existing runs).
  • Wired apply_temporal_masks() into train.py: when training.temporal_split.enabled: true, replaces the dataset's existing random train/val/test masks with temporally-ordered masks — uses a node_timestamps attribute if present on the data object, otherwise falls back to node index as an ingestion-order proxy.
  • Added tests/test_temporal_split.py with 20+ test cases.

What is enforced

Guarantee How
No future data in training Edges/rows sorted by timestamp; split index computed on sorted sequence
Hard overlap detection LeakageError raised if train_max >= test_min
Shuffled input handled Input always re-sorted before splitting — shuffle cannot cause leakage
Empty partition warnings UserWarning emitted when either partition is empty (degenerate cutoff)
Custom timestamp attributes time_attr / time_col params let callers use any field name
Backward compatible temporal_split.enabled: false (default) — existing random-mask training is unchanged

Test plan

  • python -m pytest tests/test_temporal_split.py -v — all 20+ cases pass
  • python -m pytest tests/test_leakage.py -v — existing leakage tests still pass
  • python train.py training.temporal_split.enabled=true — logs show "Temporal masks applied: train=N val=M test=K"
  • python train.py training.temporal_split.enabled=true training.temporal_split.train_ratio=0.7 — mask sizes reflect the override
  • Manually construct overlapping GraphSplitResult and confirm validate_graph_split raises LeakageError
  • Pass a shuffled edge list to TemporalSplitter.split_edges and assert max(train timestamps) < min(test timestamps)

Closes #76

Ekenesamuel8 and others added 25 commits March 24, 2026 14:34
…eights

- Add astroml/tracking/mlflow_tracker.py: thin MLflowTracker wrapper that
  gracefully degrades to no-op when mlflow is not installed or disabled
- Wire MLflowTracker into train.py (GCN): logs hyperparams, per-epoch
  train/val loss and accuracy curves, final test metrics, and model artifact
- Wire MLflowTracker into DeepSVDDTrainer: logs per-epoch train_loss,
  val_loss, svdd_radius, and evaluation metrics (ROC-AUC, AUC-PR, F1,
  precision, recall, accuracy); logs best model checkpoint as artifact
- Add mlflow config block to configs/config.yaml with tracking_uri,
  experiment_name, run_name, and log_model_weights toggles
- Add mlflow>=2.10.0 to requirements.txt
- Add astroml/training/temporal_split.py with:
  - temporal_graph_split(): splits Edge sequences by time (cutoff or ratio
    mode), sorts by timestamp, hard-raises LeakageError on any overlap
  - validate_graph_split(): standalone overlap validator for GraphSplitResult
  - TemporalSplitter class: config-driven wrapper supporting both DataFrame
    and graph-edge inputs with automatic validation
  - Re-exports temporal_train_test_split/validate_temporal_split from
    astroml.validation.leakage for a single import surface
- Update astroml/training/__init__.py to export new symbols
- Add temporal_split config block to configs/training/default.yaml
  (enabled: false by default, opt-in via Hydra override)
- Wire apply_temporal_masks() into train.py: when temporal_split.enabled
  is true, replaces dataset's random masks with temporally-ordered
  train/val/test masks (uses node_timestamps attr if present, else node
  index as ingestion-order proxy)
- Add tests/test_temporal_split.py with 20+ cases covering ratio mode,
  cutoff mode, shuffled input, custom time attr, empty edges, overlap
  detection, and the no-leakage guarantee property
…transaction prediction

Adds a training objective that predicts whether two accounts will transact
in the next N ledgers — a self-supervised task requiring no manual labels.

New files:
- astroml/models/link_prediction.py: LinkPredictor model combining a
  GCNEncoder with a dot-product or MLP decoder; exposes encode(), decode(),
  decode_all(), loss() (BCE over pos/neg edges), and a convenience forward()
- astroml/tasks/link_prediction_task.py: LinkPredictionTask orchestrator:
  - build_splits(): enumerates (context, future) window pairs keyed by
    ledger sequence; context strictly precedes future with no overlap;
    supports context_ledgers cap to restrict the lookback window
  - sample_negatives(): samples random non-edges at a configurable ratio
  - train_step(): one gradient update given a LedgerSplit + node features
  - evaluate(): returns ROC-AUC and average precision on a held-out split
  - LedgerSplit dataclass: holds context/future edge sets + node_index
    with to_edge_index() helper for building PyG edge_index tensors
- astroml/tasks/__init__.py: package init
- tests/test_link_prediction.py: 20+ unit tests covering LedgerSplit,
  sample_negative_edges, build_splits (temporal ordering, future window
  bounds, context_ledgers, empty input, error guards), sample_negatives,
  and LinkPredictor decoder + loss behaviour including loss-decreases test

Bug fixed during implementation: context_ledgers restriction now guards
against negative index when fewer context ledgers exist than requested.
…nsactions into discrete time-windowed graph snapshots for training
feat: integrate MLflow tracking for loss curves, ROC-AUC, and model w…
Implement single-account frequency computation and consistency tests
…k-prediction

feat: self-supervised link prediction training objective for account …
Add contributing guide and ignore build artifacts
…edges

fix: decompose path payments into per-hop edges and guard against sel…
…yping

feat: add multi-asset edge typing to classify XLM, stablecoins, and c…
…shot-slicer-v2

feat: add DB-backed iter_db_snapshots utility to slice normalized tra…
@gelluisaac gelluisaac merged commit e4a06c5 into Traqora:main Mar 27, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ML] Temporal Train/Test Split Logic

5 participants