feat: implement strict temporal train/test split to prevent data leakage#100
Merged
gelluisaac merged 25 commits intoTraqora:mainfrom Mar 27, 2026
Merged
Conversation
…eights - Add astroml/tracking/mlflow_tracker.py: thin MLflowTracker wrapper that gracefully degrades to no-op when mlflow is not installed or disabled - Wire MLflowTracker into train.py (GCN): logs hyperparams, per-epoch train/val loss and accuracy curves, final test metrics, and model artifact - Wire MLflowTracker into DeepSVDDTrainer: logs per-epoch train_loss, val_loss, svdd_radius, and evaluation metrics (ROC-AUC, AUC-PR, F1, precision, recall, accuracy); logs best model checkpoint as artifact - Add mlflow config block to configs/config.yaml with tracking_uri, experiment_name, run_name, and log_model_weights toggles - Add mlflow>=2.10.0 to requirements.txt
- Add astroml/training/temporal_split.py with:
- temporal_graph_split(): splits Edge sequences by time (cutoff or ratio
mode), sorts by timestamp, hard-raises LeakageError on any overlap
- validate_graph_split(): standalone overlap validator for GraphSplitResult
- TemporalSplitter class: config-driven wrapper supporting both DataFrame
and graph-edge inputs with automatic validation
- Re-exports temporal_train_test_split/validate_temporal_split from
astroml.validation.leakage for a single import surface
- Update astroml/training/__init__.py to export new symbols
- Add temporal_split config block to configs/training/default.yaml
(enabled: false by default, opt-in via Hydra override)
- Wire apply_temporal_masks() into train.py: when temporal_split.enabled
is true, replaces dataset's random masks with temporally-ordered
train/val/test masks (uses node_timestamps attr if present, else node
index as ingestion-order proxy)
- Add tests/test_temporal_split.py with 20+ cases covering ratio mode,
cutoff mode, shuffled input, custom time attr, empty edges, overlap
detection, and the no-leakage guarantee property
…transaction prediction
Adds a training objective that predicts whether two accounts will transact
in the next N ledgers — a self-supervised task requiring no manual labels.
New files:
- astroml/models/link_prediction.py: LinkPredictor model combining a
GCNEncoder with a dot-product or MLP decoder; exposes encode(), decode(),
decode_all(), loss() (BCE over pos/neg edges), and a convenience forward()
- astroml/tasks/link_prediction_task.py: LinkPredictionTask orchestrator:
- build_splits(): enumerates (context, future) window pairs keyed by
ledger sequence; context strictly precedes future with no overlap;
supports context_ledgers cap to restrict the lookback window
- sample_negatives(): samples random non-edges at a configurable ratio
- train_step(): one gradient update given a LedgerSplit + node features
- evaluate(): returns ROC-AUC and average precision on a held-out split
- LedgerSplit dataclass: holds context/future edge sets + node_index
with to_edge_index() helper for building PyG edge_index tensors
- astroml/tasks/__init__.py: package init
- tests/test_link_prediction.py: 20+ unit tests covering LedgerSplit,
sample_negative_edges, build_splits (temporal ordering, future window
bounds, context_ledgers, empty input, error guards), sample_negatives,
and LinkPredictor decoder + loss behaviour including loss-decreases test
Bug fixed during implementation: context_ledgers restriction now guards
against negative index when fewer context ledgers exist than requested.
…f-loop cycles in the transaction graph
…nsactions into discrete time-windowed graph snapshots for training
feat: integrate MLflow tracking for loss curves, ROC-AUC, and model w…
Implement single-account frequency computation and consistency tests
…k-prediction feat: self-supervised link prediction training objective for account …
Add contributing guide and ignore build artifacts
…ustom assets during graph construction
…edges fix: decompose path payments into per-hop edges and guard against sel…
…yping feat: add multi-asset edge typing to classify XLM, stablecoins, and c…
Feat/standard 2 layer gcn
…shot-slicer-v2 feat: add DB-backed iter_db_snapshots utility to slice normalized tra…
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[ML] Temporal Train/Test Split Logic
Summary
astroml/training/temporal_split.pywith two core functions and a config-driven class:temporal_graph_split()— splits a sequence of graphEdgeobjects into strict temporal train/test partitions (both cutoff and ratio modes), sorts by timestamp, and hard-raisesLeakageErrorif any temporal overlap is detected.validate_graph_split()— standalone post-hoc validator for aGraphSplitResultthat checks for train-max ≥ test-min overlap.TemporalSplitterclass — thin config-driven wrapper that dispatches to either the DataFrame splitter (re-exported fromastroml.validation.leakage) or the new graph-edge splitter, with automatic validation on every split.astroml/training/__init__.pyto export the new public symbols.temporal_split:configuration block toconfigs/training/default.yaml(enabled: falseby default — opt-in via Hydra override with no breaking changes to existing runs).apply_temporal_masks()intotrain.py: whentraining.temporal_split.enabled: true, replaces the dataset's existing random train/val/test masks with temporally-ordered masks — uses anode_timestampsattribute if present on the data object, otherwise falls back to node index as an ingestion-order proxy.tests/test_temporal_split.pywith 20+ test cases.What is enforced
LeakageErrorraised iftrain_max >= test_minUserWarningemitted when either partition is empty (degenerate cutoff)time_attr/time_colparams let callers use any field nametemporal_split.enabled: false(default) — existing random-mask training is unchangedTest plan
python -m pytest tests/test_temporal_split.py -v— all 20+ cases passpython -m pytest tests/test_leakage.py -v— existing leakage tests still passpython train.py training.temporal_split.enabled=true— logs show"Temporal masks applied: train=N val=M test=K"python train.py training.temporal_split.enabled=true training.temporal_split.train_ratio=0.7— mask sizes reflect the overrideGraphSplitResultand confirmvalidate_graph_splitraisesLeakageErrorTemporalSplitter.split_edgesand assertmax(train timestamps) < min(test timestamps)Closes #76