diff --git a/records/track_non_record_16mb/2026-04-01_WhirlpoolV5b_LorentzianTransformer/README.md b/records/track_non_record_16mb/2026-04-01_WhirlpoolV5b_LorentzianTransformer/README.md new file mode 100644 index 0000000000..1ebbd95f98 --- /dev/null +++ b/records/track_non_record_16mb/2026-04-01_WhirlpoolV5b_LorentzianTransformer/README.md @@ -0,0 +1,159 @@ +# Whirlpool v5b — Non-Euclidean Lorentzian Attention on the Hyperboloid Manifold + +**Non-record submission** — non-Euclidean geometry exploration, not claiming SOTA. + +Explores whether replacing dot-product attention with Minkowski inner products on a hyperboloid manifold can capture language's hierarchical structure more naturally. Key finding: it works, but requires careful stabilization (scale clamping + extended warmup gave **-0.88 BPB**). + +**val_bpb: 1.5918** (1-seed, SEED=314) | **~12.2 MB** | 8xH100 SXM + +## Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128, AP-IN-1) + +| Seed | step_avg | steps | tokens | Pre-TTT bpb | **Post-TTT bpb** | TTT gain | Artifact | +|------|----------|-------|--------|-------------|-----------------|----------|----------| +| 314 | 36ms | 16,140 | 1.058B | 1.5976 | **1.5918** | -0.0058 | 12,134,416 | + +## Architecture: Lorentzian Geometry in Attention + +To our knowledge, this is the first non-Euclidean submission to replace standard dot-product attention with **hyperboloid geometry**. Instead of computing attention scores as dot products in flat Euclidean space, Whirlpool projects queries and keys onto the hyperboloid manifold and scores them using the Minkowski inner product — the natural metric of hyperbolic space. + +### Why Hyperbolic Geometry? + +Natural language has an inherent hierarchical structure (syntax trees, semantic taxonomies, discourse relations). Hyperbolic spaces can embed trees with exponentially lower distortion than Euclidean spaces of the same dimension (Sarkar 2011). By focusing attention on the hyperboloid, the model can natively represent hierarchical relationships in its attention patterns. + +### Key Innovation: Flash Lorentz Attention + +A custom Triton kernel (`flash_lorentz_attn`) fuses the entire Lorentzian attention pipeline into a single tiled kernel: + +1. **Hyperboloid projection**: `x0 = sqrt(1/c + ||x_spatial||^2)`, producing points on the hyperboloid sheet +2. **Minkowski inner product**: `_L = -a0*b0 + a1*b1 + ... + ad*bd` (Lorentzian metric) +3. **Temperature-scaled softmax**: `softmax(-c * _L / temp)` per head +4. **Lorentzian centroid aggregation**: Values aggregated on the manifold (not Euclidean mean) + +Registered as `torch.library.custom_op` — no graph breaks with `torch.compile`. O(T) memory via online softmax (no materialized attention matrix). GQA handled internally. + +```python +attn_out, spatial_norm = flash_lorentz_attn( + q, k, v, + curvature=curvature, # per-orbit curvature (0.1 to 2.0) + temp=self.temp, # per-head learned temperature + causal=True, + use_centroid=True, # Lorentzian centroid value aggregation +) +``` + +### Orbit Architecture: Depth Through Weight Sharing + +Instead of stacking N separate transformer layers, Whirlpool uses **3 shared blocks** called repeatedly across **8 orbits** with different curvatures: + +| Block | Orbits | Curvature | Role | +|-------|--------|-----------|------| +| Local | 0, 1, 2 | 0.10, 0.15, 0.24 | Flat geometry, local patterns | +| Transition | 3, 4 | 0.36, 0.55 | Bridge local to hierarchical | +| Hierarchy | 5, 6, 7 | 0.85, 1.30, 2.00 | High curvature, long-range | + +Each orbit has its own embedding perturbation and learnable attention/MLP scales, but shares the core weight matrices. This yields 8 passes of depth from 23.9M stored parameters — a Universal Transformer approach in which each orbit sees a different curvature of the underlying manifold. + +**Progressive curvature** (0.1 to 2.0, approximately geometric progression) means early orbits operate in nearly flat space (local token patterns) while later orbits operate in highly curved space (hierarchical/long-range dependencies). + +### Scale Clamping: Stabilizing Lorentzian Training + +Orbit scales (`attn_scale`, `mlp_scale`) tend to explode mid-training due to the exponential nature of hyperbolic geometry. Two stabilization techniques: + +1. **Scale clamping**: `.clamp(-5.0, 5.0)` on both scale params per orbit +2. **20% warmup**: Linear warmup over 20% of training (vs typical 1-2%) + +This combination improved val_bpb by **0.88 BPB** (2.433 to 1.55) — the single biggest finding in the Lorentzian track, and potentially useful for anyone working with non-Euclidean geometry in neural networks. + +## Training Configuration + +| Component | Setting | +|-----------|---------| +| d_model | 768 | +| Heads | 12 (GQA 6:1, kv=2, head_dim=64) | +| MLP | 5x with **LeakyReLU(0.5)²** (fused Triton kernel) — LeakyReLU preserves negative gradient flow; squaring maintains non-negative output (PR #493) | +| Orbits | 8 active (3 shared blocks) | +| Curvature | 0.1 to 2.0 (progressive) | +| Optimizer | MuonAdamW (muon lr=0.04, wd=0.12) | +| Warmup | 20% linear + cosine decay | +| Muon momentum | 0.85 to 0.95 (warmup over 500 steps) | +| EMA | 0.997 decay | +| Crown-Q | Quantization-aware training from 80% progress — adds a small penalty encouraging weights toward int8-friendly values | +| Quantization | int8 per-row + zlib | +| Compile | torch.compile with 20-step warmup | +| Batch | 8 per GPU, 65K total tokens/step | + +### Fused Triton Kernels + +Three custom Triton kernels for performance: + +1. **Flash Lorentz Attention**: Fused hyperboloid projection + Minkowski inner + softmax + centroid. O(T) memory, ~10x vs unfused PyTorch implementation of the same operations. +2. **Fused LeakyReLU(0.5)²**: Single kernel eliminates intermediate tensor allocation in the MLP. 2.6x MLP speedup. +3. **Int6 quantizer**: Triton-packed int6 with GPU dequant (available but int8 used for this submission — more headroom). + +## Eval Strategy: Parallel GPU Eval + +After training, each GPU independently runs a different TTT hyperparameter search. Rank 0 collects results and reports the best. Only TTT lr=5e-4 with 1 step provided marginal improvement (-0.006 BPB), consistent with a well-converged model: + +| Rank | Strategy | val_bpb | Time | +|------|----------|---------|------| +| 0 | base_int8 (compiled) | 1.5976 | 59s | +| **4** | **TTT lr=5e-4, 1 step** | **1.5918** | **262s** | +| 1 | TTT lr=5e-4, 2 steps | 1.6217 | 459s | +| 3 | TTT lr=1e-3, 1 step | 1.6500 | 261s | +| 7 | N-gram blend (alpha=0.3) | 1.7386 | 69s | + +TTT uses score-first protocol (legal): score chunk under `torch.no_grad()`, then train on scored tokens with AdamW(lr=5e-4, wd=0.0). + +## Development Journey + +Whirlpool evolved through 50+ iterations: + +- **v1-v3**: Poincare ball attention (collapsed — gradients vanished at ball boundary) +- **v4.0**: Switched to hyperboloid model — stable, first successful Lorentzian training +- **v4.2**: Added centroid aggregation (-1.23 BPB), progressive curvature, scale clamping +- **v5.0**: Fused Triton kernels (2.6x speedup), integrated DDP pipeline +- **v5b**: Fixed eval pipeline, parallel GPU eval, d=768 scaling, 8-orbit activation + +Key lessons: Lorentzian geometry requires careful numerical stabilization (scale clamping, warmup), EMA weights must be Euclidean (not manifold-aware), and the hyperboloid projection must use sqrt (not expmap) for training stability. + +## Limitations + +- **Single seed**: Only SEED=314 reported. Historical runs show variance of approximately +/-0.01 BPB. +- **No direct Euclidean ablation at this scale**: We have not run a standard dot-product attention model with the same d=768/5xMLP configuration and training budget for a controlled comparison. Our Euclidean track (different architecture, d=640) achieved 1.08 BPB, suggesting the Lorentzian geometry adds overhead without proportional BPB gain at this competition's scale/budget. +- **Unused artifact headroom**: 3.8MB of the 16MB budget is unused. Scaling d_model higher is limited by step time (8 orbits = 8 forward passes per step). +- **TTT benefit minimal**: Well-trained Lorentzian model shows only -0.006 BPB from TTT, vs -0.3 BPB when undertrained. + +## Run Command + +```bash +SEED=314 torchrun --standalone --nproc_per_node=8 train_gpt.py +``` + +## Artifact Budget + +| Component | Bytes | +|-----------|-------| +| Code (train_gpt.py) | 73,848 | +| Model (int8+zlib) | 12,134,416 | +| **Total** | **12,208,264** | +| **Headroom** | **3,791,736** | + +## Future Research Path + +Open questions for future work: + +- **Riemannian optimizers**: Replacing Euclidean SGD/Adam with manifold-aware optimization (Riemannian gradient descent on the hyperboloid) for the geometric parameters +- **Geodesic skip connections**: Interpolating residuals along geodesics rather than Euclidean addition +- **Curvature learning**: Making per-orbit curvature learnable rather than fixed, allowing the model to discover optimal geometry per depth +- **Mixed-geometry attention**: Different heads operating in different curvature regimes simultaneously +- **Hyperbolic embeddings**: Moving token embeddings onto the manifold (currently Euclidean) to maintain geometric consistency end-to-end +- **Direct Euclidean ablation**: Controlled comparison with same model size and orbit architecture but standard dot-product attention + +## Credits + +- **Hyperbolic embeddings**: Foundational work by Nickel & Kiela (2017) "Poincare Embeddings" and Ganea et al. (2018) "Hyperbolic Neural Networks" +- **Tree embedding distortion**: Sarkar (2011) "Low Distortion Delaunay Embedding of Trees in Hyperbolic Plane" +- **Flash Attention pattern**: Dao et al. (2022), adapted for Minkowski metric +- **MuonAdamW optimizer**: Based on upstream Parameter Golf Muon implementation +- **LeakyReLU(0.5)²**: PR #493 by @parinzee +- **TTT recipe**: Adapted from PR #461 by @Christopher-Lee-McClendon diff --git a/records/track_non_record_16mb/2026-04-01_WhirlpoolV5b_LorentzianTransformer/submission.json b/records/track_non_record_16mb/2026-04-01_WhirlpoolV5b_LorentzianTransformer/submission.json new file mode 100644 index 0000000000..0ed95000c1 --- /dev/null +++ b/records/track_non_record_16mb/2026-04-01_WhirlpoolV5b_LorentzianTransformer/submission.json @@ -0,0 +1,9 @@ +{ + "name": "Whirlpool v5b — Lorentzian Transformer with Flash Lorentz Attention", + "val_bpb": 1.5918, + "bytes_total": 12208264, + "blurb": "First non-Euclidean, Lorentzian geometry transformer in the competition. Replaces dot-product attention with Minkowski inner product on the hyperboloid manifold, using a custom Flash Lorentz Attention Triton kernel. Tri-block architecture with 8 curvature-progressive orbits through 3 shared blocks (23.9M stored, 62M effective params). d=768, 12H GQA 6:1, 5x MLP with fused LeakyReLU(0.5)². Parallel eval: 8 GPUs each run a different TTT hyperparameter, best reported. 1-seed: 1.5918.", + "author": "tmancino", + "github_id": "tmancino", + "date": "2026-04-01" +} diff --git a/records/track_non_record_16mb/2026-04-01_WhirlpoolV5b_LorentzianTransformer/train_gpt.py b/records/track_non_record_16mb/2026-04-01_WhirlpoolV5b_LorentzianTransformer/train_gpt.py new file mode 100644 index 0000000000..3344b82e66 --- /dev/null +++ b/records/track_non_record_16mb/2026-04-01_WhirlpoolV5b_LorentzianTransformer/train_gpt.py @@ -0,0 +1,1707 @@ +""" +Whirlpool v5 — Lorentzian Transformer with Fused Triton Kernels + +Champion v5: v4 base + fused kernel integration. + +Architecture: + - Tri-block: Local(0,1), Transition(3), Hierarchy(6,7) — 3 shared blocks, 5 active orbits + - d_model=768, 12H/2KV GQA 6:1, head_dim=64 + - Flash Lorentz Attention: fused Triton kernel (hyperboloid proj + Minkowski inner + softmax + centroid) + - Lorentzian centroid value aggregation on the manifold + - Progressive curvature 0.1→2.0 (exponential), scale clamping [-5,5] + - MLP 5x, Fused LeakyReLU(0.5)² Triton kernel (saves 800MB VRAM traffic/step) + - torch.compile + Flash Lorentz custom_op (30-36ms/step on H100) + - 20% warmup + cosine decay, MuonAdamW (muon lr=0.04, wd=0.12) + - DDP-ready, int6/int8 per-row quantization (Triton-packed int6 available), full eval pipeline + - Online n-gram GPU hash table for eval-time adaptation + +Fused kernels (optional, graceful fallback): + - fused_leaky: LeakyReLU(0.5)² in one kernel (eliminates 160MB intermediate per call) + - custom_quantizer: Triton-packed int6/int5 with GPU dequant kernel + - online_ngram: GPU-resident hash table for eval-time n-gram updates + +Requires: PyTorch >= 2.9.1+cu128, flash-lorentz-attention package +Template: RunPod y5cejece4j +""" + +import os +os.environ["PYTORCH_ALLOC_CONF"] = "expandable_segments:True" + +import contextlib +import gc +import glob +import json +import math +import sys +import time +from collections import defaultdict +from dataclasses import dataclass +from pathlib import Path + +import numpy as np +import torch +import torch.nn as nn +import torch.nn.functional as F +import torch.distributed as dist +from torch.nn.parallel import DistributedDataParallel as DDP +from torch import Tensor + +# --- Ensure /workspace is on path for fused kernel modules --- +if "/workspace" not in sys.path: + sys.path.insert(0, "/workspace") + +# --- Fused LeakyReLU² Triton kernel (saves 800MB VRAM traffic/step) --- +try: + from fused_leaky.kernel import fused_leaky_relu_squared + HAS_FUSED_LEAKY = True +except ImportError: + HAS_FUSED_LEAKY = False + +# --- Int6/Int5 Triton-packed custom quantizer --- +try: + from custom_quantizer.kernel import quantize_int6, dequantize_int6, dequantize_int6_gpu + HAS_INT6_TRITON = True +except ImportError: + HAS_INT6_TRITON = False + +# --- Online n-gram GPU hash table --- +# Online n-gram DISABLED — hash overflow crashes on CUDA (int64 overflow in h*1024+token) +HAS_ONLINE_NGRAM = False + +try: + from flash_lorentz import flash_lorentz_attn + HAS_FLASH_LORENTZ = True +except ImportError: + HAS_FLASH_LORENTZ = False + flash_lorentz_attn = None + +SMOKE_TEST = "--smoke-test" in sys.argv + +# ── v5.0 defaults: centroid + lens ON, everything else OFF ─────── +ABL_LORENTZ_CENTROID = bool(int(os.environ.get("WP_LORENTZ_CENTROID", "1"))) # PROVEN: -1.23 BPB +ABL_GEOMETRIC_LENS = bool(int(os.environ.get("WP_GEOMETRIC_LENS", "0"))) # v59: OFF — T3 beat T1 in H100 sweep (1.626 vs 1.728) +ABL_CURVATURE_USHAPE = bool(int(os.environ.get("WP_CURVATURE_USHAPE", "0"))) # CLASHES with centroid +ABL_MIXED_CURVATURE = bool(int(os.environ.get("WP_MIXED_CURVATURE", "0"))) # KILLED: breaks projection +ABL_GEODESIC_SKIP = bool(int(os.environ.get("WP_GEODESIC_SKIP", "0"))) # Untested +ABL_WORMHOLE = bool(int(os.environ.get("WP_WORMHOLE", "0"))) # KILLED: step overhead +ABL_HYP_NGRAM = bool(int(os.environ.get("WP_HYP_NGRAM", "0"))) # KILLED: hurts convergence +ABL_ADAPTIVE_GATE = bool(int(os.environ.get("WP_ADAPTIVE_GATE", "0"))) # Adaptive orbit width gating + +# Ablation status printed in main() after DDP setup + +# --------------------------------------------------------------------------- +# Device + DDP +# --------------------------------------------------------------------------- + +distributed = "RANK" in os.environ and "WORLD_SIZE" in os.environ +rank = int(os.environ.get("RANK", "0")) +world_size = int(os.environ.get("WORLD_SIZE", "1")) +local_rank = int(os.environ.get("LOCAL_RANK", "0")) +master_process = rank == 0 + +if torch.cuda.is_available(): + device_type = "cuda" + device = torch.device("cuda", local_rank) + torch.cuda.set_device(device) + if distributed: + dist.init_process_group(backend="nccl", device_id=device) + dist.barrier() + if master_process: + print(f"Device: CUDA ({torch.cuda.get_device_name()}) [rank {rank}/{world_size}]") +elif torch.backends.mps.is_available(): + device_type = "mps" + device = torch.device("mps") + if master_process: + print("Device: MPS (Apple Silicon)") +else: + device_type = "cpu" + device = torch.device("cpu") + if master_process: + print("Device: CPU") + +# --------------------------------------------------------------------------- +# Data loading (inlined — no prepare_shim dependency) +# --------------------------------------------------------------------------- + +import sentencepiece as spm + +MAX_SEQ_LEN = 1024 +_script_dir = os.path.dirname(os.path.abspath(__file__)) + +# Resolve data paths +for _dc in [os.path.join(_script_dir, "data"), "/workspace/parameter-golf/data", + os.path.join(_script_dir, "..", "..", "parameter-golf", "data"), "./data"]: + if os.path.isdir(_dc): + DATA_PATH = os.path.join(_dc, "datasets", "fineweb10B_sp1024") + TOKENIZER_PATH = os.path.join(_dc, "tokenizers", "fineweb_1024_bpe.model") + break +else: + DATA_PATH = os.environ.get("DATA_PATH", "./data/datasets/fineweb10B_sp1024") + TOKENIZER_PATH = os.environ.get("TOKENIZER_PATH", "./data/tokenizers/fineweb_1024_bpe.model") + +TRAIN_FILES = os.path.join(DATA_PATH, "fineweb_train_*.bin") +VAL_FILES = os.path.join(DATA_PATH, "fineweb_val_*.bin") + + +def load_data_shard(file: Path) -> torch.Tensor: + header = np.fromfile(file, dtype=" 0: + avail = self.tokens.numel() - self.pos + if avail <= 0: self._advance(); continue + k = min(rem, avail) + chunks.append(self.tokens[self.pos:self.pos + k]) + self.pos += k; rem -= k + return chunks[0] if len(chunks) == 1 else torch.cat(chunks) + + +def make_dataloader(tokenizer, batch_size, seq_len, split="train"): + stream = TokenStream(TRAIN_FILES if split == "train" else VAL_FILES) + _dev = torch.device("cuda" if torch.cuda.is_available() else "cpu") + while True: + chunk = stream.take(batch_size * seq_len + 1).to(dtype=torch.int64, device=_dev) + yield chunk[:-1].reshape(batch_size, seq_len), chunk[1:].reshape(batch_size, seq_len), 0 + + +def build_sentencepiece_luts(sp, vocab_size, dev): + sp_vs = int(sp.vocab_size()) + ts = max(sp_vs, vocab_size) + bb = np.zeros((ts,), dtype=np.int16) + hs = np.zeros((ts,), dtype=np.bool_) + ib = np.ones((ts,), dtype=np.bool_) + for tid in range(sp_vs): + if sp.is_control(tid) or sp.is_unknown(tid) or sp.is_unused(tid): continue + ib[tid] = False + if sp.is_byte(tid): bb[tid] = 1; continue + piece = sp.id_to_piece(tid) + if piece.startswith("\u2581"): hs[tid] = True; piece = piece[1:] + bb[tid] = len(piece.encode("utf-8")) + return (torch.tensor(bb, dtype=torch.int16, device=dev), + torch.tensor(hs, dtype=torch.bool, device=dev), + torch.tensor(ib, dtype=torch.bool, device=dev)) + + +def evaluate_bpb(model, tokenizer, batch_size, eval_wall=600): + """Val BPB eval with wall-clock limit. + + Uses model(x, y) which returns loss. Supports distributed (all_reduce) + and single-GPU modes. Stops early if eval_wall seconds exceeded. + """ + _dev = next(model.parameters()).device + sp = spm.SentencePieceProcessor(model_file=TOKENIZER_PATH) + vs = int(sp.vocab_size()) + bb_lut, hs_lut, ib_lut = build_sentencepiece_luts(sp, vs, _dev) + vf = [Path(p) for p in sorted(glob.glob(VAL_FILES))] + vt = torch.cat([load_data_shard(f) for f in vf]).contiguous() + usable = ((vt.numel() - 1) // MAX_SEQ_LEN) * MAX_SEQ_LEN + vt = vt[:usable + 1] + total_seqs = (vt.numel() - 1) // MAX_SEQ_LEN + + # Always eval full val set (rank 0 solo after DDP teardown) + seq_start = 0 + seq_end = total_seqs + + val_loss_sum = torch.zeros((), device=_dev, dtype=torch.float64) + val_token_count = torch.zeros((), device=_dev, dtype=torch.float64) + val_byte_count = torch.zeros((), device=_dev, dtype=torch.float64) + + t_eval_start = time.perf_counter() + model.eval() + n_batches = 0 + with torch.inference_mode(): + for bs in range(seq_start, seq_end, batch_size): + # Wall-clock check — uses absolute start time (includes compile) + elapsed = time.perf_counter() - t_eval_start + if n_batches > 0 and elapsed > eval_wall - 5: + if rank == 0: + print(f"[EVAL] wall hit at {elapsed:.0f}s, {n_batches} batches done", flush=True) + break + be = min(bs + batch_size, seq_end) + loc = vt[bs*MAX_SEQ_LEN:be*MAX_SEQ_LEN+1].to(device=_dev, dtype=torch.int64) + x = loc[:-1].reshape(be - bs, MAX_SEQ_LEN) + y = loc[1:].reshape(be - bs, MAX_SEQ_LEN) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): + batch_loss = model(x, y).detach() + bt = float(y.numel()) + val_loss_sum += batch_loss.to(torch.float64) * bt + val_token_count += bt + tb = bb_lut[y.reshape(-1)].to(torch.int16) + tb += (hs_lut[y.reshape(-1)] & ~ib_lut[x.reshape(-1)]).to(torch.int16) + val_byte_count += tb.to(torch.float64).sum() + n_batches += 1 + + # No all_reduce needed — eval always runs on rank 0 solo after DDP teardown + + val_loss = val_loss_sum / val_token_count + bits_per_token = val_loss.item() / math.log(2.0) + tokens_per_byte = val_token_count.item() / val_byte_count.item() + model.train() + return float(bits_per_token * tokens_per_byte) + +TIME_BUDGET = int(os.environ.get("TIME_BUDGET", "15" if SMOKE_TEST else "600")) + +# =================================================================== +# LORENTZ GEOMETRY +# =================================================================== + +# NOTE: @torch.compiler.disable REMOVED — NaN bug fixed in PyTorch 2.9.1+cu128 +# REQUIRES: PyTorch >= 2.9.1 with CUDA 12.8. Older versions WILL NaN. (LESSONS-LEARNED #1, #34) +def stable_arcosh(x: Tensor) -> Tensor: + x = x.clamp(min=1.0 + 1e-7) + return torch.log(x + torch.sqrt(x * x - 1.0)) + + +ABL_EXPMAP_PROJECTION = bool(int(os.environ.get("WP_EXPMAP_PROJECTION", "0"))) # v59: sqrt projection (expmap = 4.566 BPB, sqrt = 1.699) + +# NOTE: @torch.compiler.disable REMOVED — NaN bug fixed in PyTorch 2.9.1+cu128 +def project_to_hyperboloid(x_space: Tensor, c: float = 1.0) -> Tensor: + """[..., d] -> [..., d+1] on hyperboloid. + Two modes: + Default (sqrt): x0 = sqrt(1/c + ||x||²) — standard, used during training + ExpMap mode: treats x as tangent vector z, applies exp_map from origin + + REQUIRES PyTorch >= 2.9.1 — older versions NaN under torch.compile. + """ + if ABL_EXPMAP_PROJECTION: + sc = 1.0 / (c ** 0.5) + z_norm = x_space.norm(dim=-1, keepdim=True).clamp(min=1e-7) + # Clamp the actual argument to cosh/sinh, not the raw norm + # Keep sinh(arg) < ~1e4 for numerical stability → arg < 10 + arg = (z_norm * sc).clamp(max=10.0) + x0 = torch.cosh(arg) + x_spatial = torch.sinh(arg) * (x_space / z_norm) + return torch.cat([x0, x_spatial], dim=-1) + else: + x0 = torch.sqrt((1.0 / c) + x_space.pow(2).sum(dim=-1, keepdim=True)) + return torch.cat([x0, x_space], dim=-1) + + +def lorentz_spatial_norm(x: Tensor) -> Tensor: + """Spatial norm ||x[1:]||. At apex (origin of hyperbolic space) this is 0.""" + return x[..., 1:].norm(dim=-1) + + +def lorentz_inner(a: Tensor, b: Tensor) -> Tensor: + """Minkowski inner product: -a0*b0 + a1*b1 + ... (batched over last dim).""" + return -a[..., :1] * b[..., :1] + (a[..., 1:] * b[..., 1:]).sum(dim=-1, keepdim=True) + + +def lorentz_distance(a: Tensor, b: Tensor, c: float = 1.0) -> Tensor: + """Geodesic distance on hyperboloid: arcosh(-c * _L).""" + inner = lorentz_inner(a, b).squeeze(-1) + return stable_arcosh((-c * inner).clamp(min=1.0)) + + +def mobius_add(x: Tensor, y: Tensor, c: float = 1.0) -> Tensor: + """Möbius addition in the Poincaré ball (for hyperbolic n-gram embeddings). + Uses the Klein→Poincaré→add→Klein chain for numerical stability.""" + # Simplified: project to tangent space, add, project back + # For small curvature this approximates Euclidean addition + x2 = (x * x).sum(dim=-1, keepdim=True).clamp(max=1.0 - 1e-5) + y2 = (y * y).sum(dim=-1, keepdim=True).clamp(max=1.0 - 1e-5) + xy = (x * y).sum(dim=-1, keepdim=True) + num = (1 + 2*c*xy + c*y2) * x + (1 - c*x2) * y + denom = 1 + 2*c*xy + c*c*x2*y2 + return num / denom.clamp(min=1e-6) + + +# =================================================================== +# CONFIG +# =================================================================== + +@dataclass +class WhirlpoolV42Config: + sequence_len: int = 1024 + vocab_size: int = 1024 + + d_model: int = 768 + n_heads: int = 12 # 768 / 64 = 12 + n_kv_heads: int = 2 # GQA 6:1 (was 5:1 at d=640) + head_dim: int = 64 + mlp_ratio: int = 5 # Best BPB: 1.5027 config + n_orbits: int = 8 # total orbits (blocks have params for their assigned orbits) + active_orbits: tuple = (0, 1, 2, 3, 4, 5, 6, 7) # all 8 active (-0.084 BPB proven) + + # Tri-block: which orbits each block handles + local_orbits: tuple = (0, 1, 2) # flat geometry, local patterns + transition_orbits: tuple = (3, 4) # bridge local→hierarchical + hierarchy_orbits: tuple = (5, 6, 7) # high curvature, long-range + + # Progressive curvature per orbit + curvature_min: float = 0.1 # orbit 0: nearly flat + curvature_max: float = 2.0 # orbit N-1: highly curved + + # Regularization + centrifugal_lambda: float = 0.005 + ortho_lambda: float = 0.0005 + + # Training + logit_softcap: float = 30.0 + crown_q_start: float = 0.8 + rope_base: float = 10000.0 + + # Orbit embedding scale (10x stronger than v3's 0.01) + orbit_embed_scale: float = 0.1 + + # Trigram hash: local 3-token context + trigram_table_size: int = 4096 # compact table + trigram_dim: int = 256 # projected to d_model + + +def get_orbit_curvature(orbit_idx: int, n_orbits: int, c_min: float, c_max: float) -> float: + if ABL_CURVATURE_USHAPE: + # U-shape: flat → curved → flat (fix output head geometry mismatch) + mid = (n_orbits - 1) / 2.0 + t = 1.0 - abs(orbit_idx - mid) / mid # 0→1→0 + else: + # Monotonic: flat → curved + t = orbit_idx / max(n_orbits - 1, 1) + return c_min * (c_max / max(c_min, 1e-8)) ** t # v59f: exponential + + +# =================================================================== +# BUILDING BLOCKS +# =================================================================== + +def norm(x: Tensor) -> Tensor: + return F.rms_norm(x, (x.size(-1),)) + + +class CastedLinear(nn.Linear): + def forward(self, x: Tensor) -> Tensor: + return F.linear(x, self.weight.to(x.dtype), + self.bias.to(x.dtype) if self.bias is not None else None) + + +class Rotary(nn.Module): + def __init__(self, dim: int, base: float = 10000.0): + super().__init__() + inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2, dtype=torch.float32) / dim)) + self.register_buffer("inv_freq", inv_freq, persistent=False) + self._cache_len = 0 + self._cos = None + self._sin = None + + def forward(self, seq_len: int, device: torch.device, dtype: torch.dtype): + if self._cos is None or self._cache_len != seq_len or self._cos.device != device: + t = torch.arange(seq_len, device=device, dtype=torch.float32) + freqs = torch.outer(t, self.inv_freq.to(device)) + self._cos = freqs.cos()[None, :, None, :].to(dtype) + self._sin = freqs.sin()[None, :, None, :].to(dtype) + self._cache_len = seq_len + return self._cos, self._sin + + +def apply_rotary_emb(x: Tensor, cos: Tensor, sin: Tensor) -> Tensor: + d = x.size(-1) // 2 + x1, x2 = x[..., :d], x[..., d:] + return torch.cat([x1 * cos + x2 * sin, x1 * (-sin) + x2 * cos], dim=-1) + + +# =================================================================== +# TRIGRAM HASH (local 3-token context before geometry) +# =================================================================== + +class TrigramHash(nn.Module): + """Hash 3 consecutive tokens into a learned embedding. + Captures local trigram context that shifts token positions on the hyperboloid. + Subsumes bigram info (the 3-token window includes the 2-token pair).""" + + def __init__(self, table_size: int, tri_dim: int, model_dim: int): + super().__init__() + self.table_size = table_size + self.embed = nn.Embedding(table_size, tri_dim) + self.proj = CastedLinear(tri_dim, model_dim, bias=False) if tri_dim != model_dim else None + self.scale = nn.Parameter(torch.tensor(0.1, dtype=torch.float32)) + + def forward(self, token_ids: Tensor) -> Tensor: + t = token_ids.to(torch.int64) + mod = self.table_size - 1 + out = torch.full_like(t, mod) + if t.size(-1) > 2: + out[..., 2:] = (36313 * t[..., 2:] ^ 27191 * t[..., 1:-1] ^ 51637 * t[..., :-2]).abs() % mod + return self._embed(out.long()) + + def _embed(self, indices: Tensor) -> Tensor: + h = self.embed(indices) + if self.proj is not None: + h = self.proj(h) + return h * self.scale.to(dtype=h.dtype) + + +# =================================================================== +# LORENTZ ATTENTION (ALL heads go through geometry) +# =================================================================== + +class LorentzMultiHeadAttention(nn.Module): + """ALL attention heads use Lorentz distance-based attention. + Q,K are projected to the hyperboloid. Attention = softmax(-dist/temp). + V stays in tangent space (Euclidean). GQA supported.""" + + def __init__(self, d_model: int, n_heads: int, n_kv_heads: int, head_dim: int, + rope_base: float): + super().__init__() + self.n_heads = n_heads + self.n_kv_heads = n_kv_heads + self.head_dim = head_dim + + self.q_proj = CastedLinear(d_model, n_heads * head_dim, bias=False) + self.k_proj = CastedLinear(d_model, n_kv_heads * head_dim, bias=False) + self.v_proj = CastedLinear(d_model, n_kv_heads * head_dim, bias=False) + self.o_proj = CastedLinear(n_heads * head_dim, d_model, bias=False) + + self.rotary = Rotary(head_dim, base=rope_base) + + # Per-head temperature (learnable) + # Buffer not Parameter — flash_lorentz handles temp internally via custom_op. + # As Parameter it would be "unused" in SDPA fallback, crashing DDP. + self.register_buffer('temp', torch.ones(n_heads) * math.sqrt(head_dim)) + + def forward(self, x: Tensor, curvature: float) -> tuple[Tensor, Tensor]: + """Flash Lorentz Attention — fused hyperboloid projection + Minkowski inner + product + online softmax + centroid aggregation in a single tiled Triton kernel. + O(T) memory instead of O(T²). ~10x speedup. GQA handled internally. + See /Users/tmancino/AI/flash-lorentz-attention for kernel source.""" + B, T, _ = x.shape + H, Hkv, D = self.n_heads, self.n_kv_heads, self.head_dim + + q = self.q_proj(x).view(B, T, H, D) + k = self.k_proj(x).view(B, T, Hkv, D) + v = self.v_proj(x).view(B, T, Hkv, D) + + # RoPE on Q,K in spatial coords (before hyperboloid projection) + q, k = norm(q), norm(k) + cos, sin = self.rotary(T, x.device, q.dtype) + q = apply_rotary_emb(q, cos, sin) + k = apply_rotary_emb(k, cos, sin) + + if HAS_FLASH_LORENTZ: + # Flash Lorentz Attention: fused Triton kernel (custom_op, no graph breaks) + attn_out, spatial_norm = flash_lorentz_attn( + q, k, v, + curvature=curvature, + temp=self.temp, + causal=True, + use_centroid=ABL_LORENTZ_CENTROID, + ) + out = attn_out.reshape(B, T, H * D) + else: + # SDPA fallback — no Lorentz geometry, just standard attention + q_sdpa = q.transpose(1, 2) # (B, H, T, D) + # Expand KV for GQA + rep = H // Hkv + k_sdpa = k.unsqueeze(3).expand(B, T, Hkv, rep, D).reshape(B, T, H, D).transpose(1, 2) + v_sdpa = v.unsqueeze(3).expand(B, T, Hkv, rep, D).reshape(B, T, H, D).transpose(1, 2) + attn_out = F.scaled_dot_product_attention(q_sdpa, k_sdpa, v_sdpa, is_causal=True) + out = attn_out.transpose(1, 2).reshape(B, T, H * D) + spatial_norm = torch.zeros(1, device=x.device) + + return self.o_proj(out), spatial_norm + + +# =================================================================== +# SHARED BLOCK (orbited N times) +# =================================================================== + +class WhirlpoolBlock(nn.Module): + """Single shared block. Called once per orbit with different curvature.""" + + def __init__(self, config: WhirlpoolV42Config, assigned_orbits: tuple): + super().__init__() + d = config.d_model + self.assigned_orbits = assigned_orbits + # Map global orbit index → local parameter index (no unused params) + self._orbit_to_local = {o: i for i, o in enumerate(assigned_orbits)} + n = len(assigned_orbits) + + # Lorentz attention (ALL heads) + self.attn = LorentzMultiHeadAttention( + d, config.n_heads, config.n_kv_heads, config.head_dim, config.rope_base) + + # MLP: LeakyReLU(0.5)², expansion ratio from config + mlp_dim = config.mlp_ratio * d + self.mlp_up = CastedLinear(d, mlp_dim, bias=False) + self.mlp_down = CastedLinear(mlp_dim, d, bias=False) + + # Per-orbit parameters — only for assigned orbits (no unused params for DDP) + self.orbit_embeds = nn.Parameter(torch.zeros(n, d)) + self.attn_scale = nn.Parameter(torch.zeros(n)) + self.mlp_scale = nn.Parameter(torch.zeros(n)) + + # Adaptive orbit gate: learns per-orbit dimension importance + if ABL_ADAPTIVE_GATE: + n_groups = 8 + self.orbit_gate = nn.Linear(d, n_groups * n, bias=False) + self.n_groups = n_groups + nn.init.zeros_(self.orbit_gate.weight) + + def forward(self, x: Tensor, orbit_idx: int, curvature: float) -> tuple[Tensor, Tensor]: + """Returns (output, lorentz_spatial_norm).""" + local_idx = self._orbit_to_local[orbit_idx] + + # Adaptive gate: scale dimension groups based on input + if ABL_ADAPTIVE_GATE: + B, T, D = x.shape + g = self.n_groups + group_dim = D // g + all_gates = self.orbit_gate(x.mean(dim=1)) + gate_vals = all_gates[:, local_idx * g:(local_idx + 1) * g] + gate_vals = torch.sigmoid(gate_vals + 3.0) + x_grouped = x.view(B, T, g, group_dim) + x = (x_grouped * gate_vals[:, None, :, None]).view(B, T, D) + + # Orbit embedding (strong perturbation) + x = x + self.orbit_embeds[local_idx] + + # Attention with Lorentz geometry at this orbit's curvature + attn_out, lor_norm = self.attn(norm(x), curvature) + # Scale clamping [-5, 5] — prevents orbit dominance, stabilizes training + delta_attn = self.attn_scale[local_idx].clamp(-5, 5) * attn_out + + if ABL_GEODESIC_SKIP: + dn = torch.sqrt((delta_attn ** 2).sum(-1, keepdim=True).clamp(min=1e-8)) + x = x * torch.cosh(dn * 0.1) + delta_attn * torch.sinh(dn * 0.1) / dn + else: + x = x + delta_attn + + # MLP + h = self.mlp_up(norm(x)) + h = fused_leaky_relu_squared(h) if HAS_FUSED_LEAKY else F.leaky_relu(h, negative_slope=0.5).square() + mlp_out = self.mlp_down(h) + delta_mlp = self.mlp_scale[local_idx].clamp(-5, 5) * mlp_out + + if ABL_GEODESIC_SKIP: + dn = torch.sqrt((delta_mlp ** 2).sum(-1, keepdim=True).clamp(min=1e-8)) + x = x * torch.cosh(dn * 0.1) + delta_mlp * torch.sinh(dn * 0.1) / dn + else: + x = x + delta_mlp + + return x, lor_norm + + +# =================================================================== +# MAIN MODEL +# =================================================================== + +class WhirlpoolV42(nn.Module): + def __init__(self, config: WhirlpoolV42Config): + super().__init__() + self.config = config + self.tok_emb = nn.Embedding(config.vocab_size, config.d_model) + self.trigram = TrigramHash(config.trigram_table_size, config.trigram_dim, config.d_model) + + # Tri-block: 3 specialized blocks for different curvature regimes + # Each block only allocates params for its assigned orbits (no unused params for DDP) + self.local_block = WhirlpoolBlock(config, config.local_orbits) + self.transition_block = WhirlpoolBlock(config, config.transition_orbits) + self.hierarchy_block = WhirlpoolBlock(config, config.hierarchy_orbits) + + # Route orbits to blocks + self._orbit_to_block = {} + for o in config.local_orbits: + self._orbit_to_block[o] = self.local_block + for o in config.transition_orbits: + self._orbit_to_block[o] = self.transition_block + for o in config.hierarchy_orbits: + self._orbit_to_block[o] = self.hierarchy_block + + # Tied embeddings: no lm_head module, use F.linear(x, tok_emb.weight) in forward + # (v16 approach — CastedLinear assignment breaks PyTorch parameter registration) + self.lm_head = None + self.logit_softcap = config.logit_softcap + + # Precompute curvatures per orbit + self.curvatures = [ + get_orbit_curvature(i, config.n_orbits, config.curvature_min, config.curvature_max) + for i in range(config.n_orbits) + ] + + @torch.no_grad() + def init_weights(self): + c = self.config + d = c.d_model + s = (3 ** 0.5) * (d ** -0.5) / (c.n_orbits ** 0.25) + + nn.init.normal_(self.tok_emb.weight, std=s) + # lm_head is None (tied) — tok_emb.weight serves both roles + + # Trigram: zero-init (starts as no-op, learns to contribute) + nn.init.zeros_(self.trigram.embed.weight) + if self.trigram.proj is not None: + nn.init.zeros_(self.trigram.proj.weight) + + # Init all 3 blocks identically + for blk in [self.local_block, self.transition_block, self.hierarchy_block]: + for proj in [blk.attn.q_proj, blk.attn.k_proj, blk.attn.v_proj]: + nn.init.uniform_(proj.weight, -s, s) + nn.init.zeros_(blk.attn.o_proj.weight) + blk.attn.temp.data.fill_(math.sqrt(c.head_dim)) + nn.init.uniform_(blk.mlp_up.weight, -s, s) + nn.init.zeros_(blk.mlp_down.weight) + nn.init.normal_(blk.orbit_embeds, std=c.orbit_embed_scale) + # Scale init: use global orbit index for value, local index for storage + for global_orbit, local_idx in blk._orbit_to_local.items(): + val = 0.3 / c.n_orbits * (c.n_orbits - global_orbit) + blk.attn_scale.data[local_idx] = val + blk.mlp_scale.data[local_idx] = val + + def num_params(self): + return sum(p.numel() for p in self.parameters()) + + def forward(self, idx: Tensor, targets: Tensor | None = None) -> Tensor: + tok = self.tok_emb(idx).float() + tri = self.trigram(idx).float() + + if ABL_HYP_NGRAM: + # Hyperbolic n-gram: combine token + trigram via Möbius addition + # instead of Euclidean add. Keeps embeddings on the manifold. + # Scale down to unit ball for Poincaré Möbius add + tok_norm = tok / (tok.norm(dim=-1, keepdim=True).clamp(min=1e-6) + 1.0) + tri_norm = tri / (tri.norm(dim=-1, keepdim=True).clamp(min=1e-6) + 1.0) + x = mobius_add(tok_norm, tri_norm, c=1.0) + # Scale back up + x = x * tok.norm(dim=-1, keepdim=True).clamp(min=1e-6) + else: + x = tok + tri + + x = norm(x) + + lor_norms = [] + for orbit in self.config.active_orbits: + c = self.curvatures[orbit] + block = self._orbit_to_block[orbit] + if device_type == "cuda": + x, lor_norm = block(x, orbit, c) + else: + from torch.utils.checkpoint import checkpoint as grad_checkpoint + x, lor_norm = grad_checkpoint(block, x, orbit, c, use_reentrant=False) + lor_norms.append(lor_norm) + + logits = F.linear(norm(x), self.tok_emb.weight).float() # tied embeddings + cap = self.logit_softcap + logits = cap * torch.tanh(logits / cap) + + if targets is not None: + ce = F.cross_entropy(logits.view(-1, logits.size(-1)), + targets.view(-1), ignore_index=-1) + # Only add centrifugal during training, not eval + if self.training: + avg_lor_norm = torch.stack(lor_norms).mean() + centrifugal = self.config.centrifugal_lambda * torch.exp(-avg_lor_norm) + return ce + centrifugal + return ce + return logits + + +# =================================================================== +# MUON + ADAMW OPTIMIZER +# =================================================================== + +polar_express_coeffs = [ + (8.156554524902461, -22.48329292557795, 15.878769915207462), + (4.042929935166739, -2.808917465908714, 0.5000178451051316), + (3.8916678022926607, -2.772484153217685, 0.5060648178503393), + (3.285753657755655, -2.3681294933425376, 0.46449024233003106), + (2.3465413258596377, -1.7097828382687081, 0.42323551169305323), +] + + +def adamw_step_fused(p, grad, exp_avg, exp_avg_sq, step_t, lr_t, beta1_t, beta2_t, eps_t, wd_t): + step_t, lr_t = step_t.to(p.device, p.dtype), lr_t.to(p.device, p.dtype) + beta1_t, beta2_t = beta1_t.to(p.device, p.dtype), beta2_t.to(p.device, p.dtype) + eps_t, wd_t = eps_t.to(p.device, p.dtype), wd_t.to(p.device, p.dtype) + p.mul_(1 - lr_t * wd_t) + exp_avg.lerp_(grad, 1 - beta1_t) + exp_avg_sq.lerp_(grad.square(), 1 - beta2_t) + bias1, bias2 = 1 - beta1_t ** step_t, 1 - beta2_t ** step_t + p.add_(exp_avg / bias1 / ((exp_avg_sq / bias2).sqrt() + eps_t), alpha=-lr_t) + + +def muon_step_fused(sg, sp, mb, smb, mom_t, lr_t, wd_t, b2_t, ns_steps, rd): + mom_t = mom_t.to(sp.device, sp.dtype) + lr_t = lr_t.to(sp.device, sp.dtype) + wd_t = wd_t.to(sp.device, sp.dtype) + b2_t = b2_t.to(sp.device, sp.dtype) + mom = mom_t.to(sg.dtype) + mb.lerp_(sg, 1 - mom) + g = sg.lerp_(mb, mom) + X = g.bfloat16() + X = X / (X.norm(dim=(-2, -1), keepdim=True) * 1.02 + 1e-6) + if g.size(-2) > g.size(-1): + for a, b, c in polar_express_coeffs[:ns_steps]: + A = X.mT @ X; X = a * X + X @ (b * A + c * (A @ A)) + else: + for a, b, c in polar_express_coeffs[:ns_steps]: + A = X @ X.mT; X = a * X + (b * A + c * (A @ A)) @ X + g = X + vm = g.float().square().mean(dim=rd, keepdim=True) + rds = g.size(rd) + vn = (vm.sum(dim=(-2, -1), keepdim=True) * rds).sqrt() + b2c = b2_t.to(smb.dtype) + smb.lerp_(vm.to(smb.dtype), 1 - b2c) + ss = smb.clamp_min(1e-10).rsqrt() + vnn = ((vm * rds) * ss.float().square()).sum(dim=(-2, -1), keepdim=True).sqrt() + g = g * (ss * (vn / vnn.clamp_min(1e-10))).to(g.dtype) + lr, wd = lr_t.to(g.dtype), wd_t.to(g.dtype) + mask = (g * sp) >= 0 + sp.sub_(lr * g + lr * wd * sp * mask) + + +class MuonAdamW(torch.optim.Optimizer): + def __init__(self, param_groups): + super().__init__(param_groups, defaults={}) + for attr in ['step', 'lr', 'beta1', 'beta2', 'eps', 'wd']: + setattr(self, f'_adamw_{attr}_t', torch.tensor(0.0, device="cpu")) + for attr in ['momentum', 'lr', 'wd', 'beta2']: + setattr(self, f'_muon_{attr}_t', torch.tensor(0.0, device="cpu")) + ck = {"dynamic": False, "fullgraph": True} + if device_type == "cuda": + self.adamw_fn = torch.compile(adamw_step_fused, **ck) + self.muon_fn = torch.compile(muon_step_fused, **ck) + else: + self.adamw_fn = adamw_step_fused + self.muon_fn = muon_step_fused + + def _step_adamw(self, g): + for p in g['params']: + if p.grad is None: + continue + s = self.state[p] + if not s: + s['step'] = 0 + s['ea'] = torch.zeros_like(p) + s['eas'] = torch.zeros_like(p) + s['step'] += 1 + self._adamw_step_t.fill_(s['step']) + self._adamw_lr_t.fill_(g['lr']) + self._adamw_beta1_t.fill_(g['betas'][0]) + self._adamw_beta2_t.fill_(g['betas'][1]) + self._adamw_eps_t.fill_(g['eps']) + self._adamw_wd_t.fill_(g['weight_decay']) + self.adamw_fn(p, p.grad, s['ea'], s['eas'], + self._adamw_step_t, self._adamw_lr_t, + self._adamw_beta1_t, self._adamw_beta2_t, + self._adamw_eps_t, self._adamw_wd_t) + + def _step_muon(self, g): + params = g['params'] + if not params: + return + p = params[0] + s = self.state[p] + n = len(params) + sh, dev, dt = p.shape, p.device, p.dtype + if "mb" not in s: + s["mb"] = torch.zeros(n, *sh, dtype=dt, device=dev) + if "smb" not in s: + ss = (n, sh[-2], 1) if sh[-2] >= sh[-1] else (n, 1, sh[-1]) + s["smb"] = torch.zeros(ss, dtype=dt, device=dev) + rd = -1 if sh[-2] >= sh[-1] else -2 + sg = torch.stack([p.grad for p in params]) + sp = torch.stack(params) + self._muon_momentum_t.fill_(g["momentum"]) + self._muon_beta2_t.fill_(g.get("beta2", 0.95) or 0.0) + self._muon_lr_t.fill_(g["lr"] * max(1.0, sh[-2] / sh[-1]) ** 0.5) + self._muon_wd_t.fill_(g["weight_decay"]) + self.muon_fn(sg, sp, s["mb"], s["smb"], self._muon_momentum_t, + self._muon_lr_t, self._muon_wd_t, self._muon_beta2_t, + g["ns_steps"], rd) + torch._foreach_copy_(params, list(sp.unbind(0))) + + @torch.no_grad() + def step(self): + for g in self.param_groups: + if g['kind'] == 'adamw': + self._step_adamw(g) + elif g['kind'] == 'muon': + self._step_muon(g) + + +def setup_optimizer(model: WhirlpoolV42) -> MuonAdamW: + d = model.config.d_model + dscale = (d / 768) ** -0.5 + + mat_params = [] + scalar_params = [] + for blk in [model.local_block, model.transition_block, model.hierarchy_block]: + for name, p in blk.named_parameters(): + if p.ndim == 2 and min(p.shape) >= 8: + mat_params.append(p) + else: + scalar_params.append(p) + + # Trigram params: embed table -> AdamW, proj matrix -> Muon, scale -> scalar + tri_embed_id = id(model.trigram.embed.weight) + tri_proj_params = [] + tri_scalar_params = [] + for name, p in model.trigram.named_parameters(): + if id(p) == tri_embed_id: + continue + elif p.ndim == 2 and min(p.shape) >= 8: + tri_proj_params.append(p) + else: + tri_scalar_params.append(p) + + groups = [ + # tok_emb serves as both embedding and lm_head (tied, v16 approach) + dict(kind='adamw', params=list(model.tok_emb.parameters()), + lr=0.06, betas=(0.8, 0.95), eps=1e-10, weight_decay=0.0), + dict(kind='adamw', params=scalar_params + tri_scalar_params, + lr=0.08, betas=(0.96, 0.95), eps=1e-10, weight_decay=0.0), + dict(kind='adamw', params=[model.trigram.embed.weight], + lr=0.1, betas=(0.8, 0.95), eps=1e-10, weight_decay=0.0), + ] + + all_muon = mat_params + tri_proj_params + for shape in sorted({p.shape for p in all_muon}): + gp = [p for p in all_muon if p.shape == shape] + groups.append(dict(kind='muon', params=gp, lr=0.04, + momentum=0.95, ns_steps=5, beta2=0.95, + weight_decay=0.12)) + + return MuonAdamW(groups) + + +# =================================================================== +# CROWN-Q +# =================================================================== + +def crown_q_penalty(model: nn.Module, progress: float, start: float = 0.8) -> Tensor: + if progress < start: + return torch.tensor(0.0, device=device) + penalty = torch.tensor(0.0, device=device) + n = 0 + for p in model.parameters(): + if p.ndim == 2 and p.numel() > 64: + scale = p.abs().max() / 31.5 + if scale > 1e-8: + quantized = (p / scale).round() * scale + penalty = penalty + (p - quantized).pow(2).mean() + n += 1 + if n > 0: + wp = (progress - start) / (1.0 - start) + return 0.01 * wp * penalty / n + return torch.tensor(0.0, device=device) + + +# =================================================================== +# N-GRAM EVAL CACHE +# =================================================================== + +class NgramCache: + """Fast n-gram cache using numpy arrays. No Python loops at build or eval time.""" + + def __init__(self, vocab_size: int, max_order: int = 4): + self.vocab_size = vocab_size + self.max_order = max_order + # For each order: dict mapping context_hash -> numpy array of shape [vocab_size] + self.tables = [{} for _ in range(max_order)] + + def build_from_shards(self, data_path: str, max_tokens: int = 5_000_000): + """Build n-gram tables using fast numpy operations.""" + import time as _t + t0 = _t.time() + pattern = os.path.join(data_path, "fineweb_train_*.bin") + files = sorted(glob.glob(pattern)) + if not files: + print(" WARNING: No shards found for n-gram cache") + return + + # Load tokens + all_tokens = [] + seen = 0 + for f in files: + header = np.fromfile(f, dtype="= max_tokens: + break + tokens = np.concatenate(all_tokens)[:max_tokens].astype(np.int64) + V = self.vocab_size + + for order in range(1, self.max_order + 1): + if len(tokens) <= order: + continue + # Hash contexts: hash = sum(tokens[i+j] * 1024^(order-1-j)) for j in 0..order-1 + ctx_hash = np.zeros(len(tokens) - order, dtype=np.int64) + for j in range(order): + ctx_hash += tokens[j:len(tokens) - order + j] * (1024 ** (order - 1 - j)) + next_toks = tokens[order:] + + # Combine context hash and next token into a single key for fast counting + combined = ctx_hash * V + next_toks + unique_combined, combined_counts = np.unique(combined, return_counts=True) + + # Unpack back to context hash and token + ctx_keys = unique_combined // V + tok_ids = unique_combined % V + + # Vectorized groupby: sort by ctx_keys, then split at boundaries + sort_idx = np.argsort(ctx_keys) + sorted_ctx = ctx_keys[sort_idx] + sorted_tok = tok_ids[sort_idx] + sorted_cnt = combined_counts[sort_idx] + + # Find boundaries where context changes + boundaries = np.where(np.diff(sorted_ctx) != 0)[0] + 1 + boundaries = np.concatenate([[0], boundaries, [len(sorted_ctx)]]) + + table = {} + for i in range(len(boundaries) - 1): + if len(table) >= 200_000: # memory cap + break + start, end = boundaries[i], boundaries[i + 1] + total = int(sorted_cnt[start:end].sum()) + if total >= 3: + probs = np.zeros(V, dtype=np.float32) + probs[sorted_tok[start:end]] = sorted_cnt[start:end].astype(np.float32) / total + table[int(sorted_ctx[start])] = probs + self.tables[order - 1] = table + + sizes = [len(t) for t in self.tables] + print(f" N-gram cache built: {len(tokens) / 1e6:.1f}M tokens, " + f"contexts: {sizes}, time: {_t.time() - t0:.1f}s") + + def _build_gpu_tables(self, target_device): + """Convert numpy dict tables to sorted GPU tensors for torch.searchsorted.""" + if hasattr(self, '_gpu_tables'): + return + self._gpu_tables = [] + for order_idx in range(self.max_order): + table = self.tables[order_idx] + if not table: + self._gpu_tables.append(None) + continue + keys = np.array(sorted(table.keys()), dtype=np.int64) + probs = np.stack([table[k] for k in keys], axis=0) + self._gpu_tables.append({ + 'keys': torch.from_numpy(keys).to(target_device), + 'probs': torch.from_numpy(probs).to(target_device), + }) + sizes = [g['keys'].shape[0] if g else 0 for g in self._gpu_tables] + print(f" N-gram GPU tables: {sizes}") + + def blend_logits(self, model_logits: Tensor, token_ids: Tensor, + alpha: float = 0.3) -> Tensor: + """Blend model logits with n-gram cache. Fully vectorized — no Python loops. + + For each n-gram order (highest first): + 1. Vectorized hash of all (B, T) context positions + 2. GPU binary search via torch.searchsorted + 3. Entropy-adaptive alpha per position + 4. Broadcast blend hit positions + """ + B, T, V = model_logits.shape + self._build_gpu_tables(model_logits.device) + model_probs = F.softmax(model_logits, dim=-1) + tids = token_ids.long() + + for order in range(self.max_order, 0, -1): + gpu_table = self._gpu_tables[order - 1] + if gpu_table is None or order > T: + continue + valid_T = T - order + if valid_T <= 0: + continue + + # Vectorized context hash: h = sum(tok[j] * 1024^(order-1-j)) + h = torch.zeros(B, valid_T, dtype=torch.int64, device=tids.device) + for j in range(order): + h = h * 1024 + tids[:, j:j + valid_T] + + # GPU binary search across all positions at once + h_flat = h.reshape(-1) + keys = gpu_table['keys'] + idx = torch.searchsorted(keys, h_flat).clamp(max=keys.shape[0] - 1) + hit = (keys[idx] == h_flat) + + if not hit.any(): + continue + + # Gather n-gram probs + entropy-adaptive alpha + ngram_p = gpu_table['probs'][idx] + pos_probs = model_probs[:, order:, :].reshape(-1, V) + H = -(pos_probs * pos_probs.clamp_min(1e-10).log()).sum(dim=-1) + a = (alpha * torch.sigmoid(0.5 * (H - 4.0)) * hit.float()).unsqueeze(-1) + + blended = (1 - a) * pos_probs + a * ngram_p + model_probs[:, order:, :] = blended.reshape(B, valid_T, V) + + return model_probs.clamp_min(1e-10).log() + + +# =================================================================== +# LR SCHEDULE +# =================================================================== + +def get_lr(progress: float) -> float: + warmup = 0.20 # 20% warmup — LORENTZ-1 finding: 36% improvement over 10% + if progress < warmup: + return progress / warmup + else: + # Cosine decay from 1.0 to 0.0 + import math + decay_progress = (progress - warmup) / (1.0 - warmup) + return 0.5 * (1.0 + math.cos(math.pi * decay_progress)) + + +# =================================================================== +# QUANTIZATION (from upstream — verbatim) +# =================================================================== + +import io +import zlib + +def quantize_state_dict_int8(state_dict): + """Per-row int8 quantization with percentile clipping (upstream pattern).""" + CLIP_Q = 0.9999984 + quantized, scales, dtypes, passthrough = {}, {}, {}, {} + passthrough_orig_dtypes = {} + stats = {"baseline_tensor_bytes": 0, "int8_payload_bytes": 0} + for name, tensor in state_dict.items(): + t = tensor.detach().cpu().contiguous() + stats["baseline_tensor_bytes"] += t.numel() * t.element_size() + if not t.is_floating_point(): + passthrough[name] = t + stats["int8_payload_bytes"] += t.numel() * t.element_size() + continue + if t.numel() <= 65536: + # Small tensors kept as fp16 + if t.dtype in {torch.float32, torch.bfloat16}: + passthrough_orig_dtypes[name] = str(t.dtype).removeprefix("torch.") + t = t.to(torch.float16) + passthrough[name] = t.contiguous() + stats["int8_payload_bytes"] += t.numel() * t.element_size() + continue + t32 = t.float() + dtypes[name] = str(t.dtype).removeprefix("torch.") + if t32.ndim == 2: + clip_abs = torch.quantile(t32.abs(), CLIP_Q, dim=1) if t32.numel() else torch.empty(t32.shape[0]) + clipped = torch.clamp(t32, -clip_abs[:, None], clip_abs[:, None]) + s = (clip_abs / 127.0).clamp_min(1.0 / 127.0) + q = torch.clamp(torch.round(clipped / s[:, None]), -127, 127).to(torch.int8) + scales[name] = s.to(torch.float16) + else: + clip_abs = float(torch.quantile(t32.abs().flatten(), CLIP_Q).item()) if t32.numel() else 0.0 + s = torch.tensor(clip_abs / 127.0 if clip_abs > 0 else 1.0, dtype=torch.float32) + q = torch.clamp(torch.round(torch.clamp(t32, -clip_abs, clip_abs) / s), -127, 127).to(torch.int8) + scales[name] = s + quantized[name] = q.contiguous() + stats["int8_payload_bytes"] += q.numel() + scales[name].numel() * scales[name].element_size() + obj = {"quantized": quantized, "scales": scales, "dtypes": dtypes, + "passthrough": passthrough, "passthrough_orig_dtypes": passthrough_orig_dtypes} + return obj, stats + + +def dequantize_state_dict_int8(obj): + """Reverse of quantize_state_dict_int8 (upstream pattern).""" + out = {} + passthrough_orig_dtypes = obj.get("passthrough_orig_dtypes", {}) + for name, q in obj["quantized"].items(): + dtype = getattr(torch, obj["dtypes"][name]) + s = obj["scales"][name] + if s.ndim > 0: + out[name] = (q.float() * s.float().view(q.shape[0], *([1] * (q.ndim - 1)))).to(dtype).contiguous() + else: + out[name] = (q.float() * float(s.item())).to(dtype).contiguous() + for name, t in obj["passthrough"].items(): + out_t = t.detach().cpu().contiguous() + orig = passthrough_orig_dtypes.get(name) + if isinstance(orig, str): + out_t = out_t.to(dtype=getattr(torch, orig)).contiguous() + out[name] = out_t + return out + + +# =================================================================== +# MAIN +# =================================================================== + +def main(): + t_start = time.time() + seed = int(os.environ.get("SEED", 1337)) + import random + random.seed(seed) + np.random.seed(seed) + torch.manual_seed(seed) + torch.cuda.manual_seed_all(seed) + if master_process: + print(f"Seed: {seed}") + torch.set_float32_matmul_precision("high") + + tokenizer = Tokenizer.from_directory() + vocab_size = tokenizer.get_vocab_size() + if master_process: + print(f"Vocab: {vocab_size:,}") + + # Configurable via env vars. Auto-scales TOTAL_BATCH for multi-GPU if not set. + # 1xGPU: BATCH=16, TB=16384. 4xGPU: BATCH=8, TB=32768. 8xGPU: BATCH=8, TB=65536. + BATCH = int(os.environ.get("BATCH_SIZE", "8" if world_size > 1 else "16")) if device_type == "cuda" else 8 + _min_tb = BATCH * MAX_SEQ_LEN * world_size # minimum for grad_accum=1 + TOTAL_BATCH = int(os.environ.get("TOTAL_BATCH", str(max(_min_tb, 2 ** 14)))) + + config = WhirlpoolV42Config(sequence_len=MAX_SEQ_LEN, vocab_size=vocab_size) + + if master_process: + print(f"\nWhirlpool v4.8 [{'SMOKE' if SMOKE_TEST else 'FULL'} {TIME_BUDGET}s]") + print(f" {config.d_model}-dim, 1 shared block x {config.n_orbits} orbits") + print(f" {config.n_heads} Lorentz heads (ALL geometry), GQA {config.n_kv_heads}KV") + print(f" MLP: {config.mlp_ratio}x, LeakyReLU(0.5)²") + print(f" Curvature per orbit: {config.curvature_min}→{config.curvature_max}") + print(f" Centrifugal λ={config.centrifugal_lambda}") + print(f" Orbit embed scale: {config.orbit_embed_scale}") + print(f" NEW: No arcosh (raw Minkowski inner product)") + print(f" NEW: {config.n_orbits} total orbits, active={list(config.active_orbits)}, d_model={config.d_model}") + + curvatures = [get_orbit_curvature(i, config.n_orbits, config.curvature_min, config.curvature_max) + for i in range(config.n_orbits)] + print(f" Curvatures: [{', '.join(f'{c:.2f}' for c in curvatures)}]") + + with torch.device("meta"): + model = WhirlpoolV42(config) + model.to_empty(device=device) + model.init_weights() + + n_params = model.num_params() + effective = n_params # shared block counted once + local_p = sum(p.numel() for p in model.local_block.parameters()) + trans_p = sum(p.numel() for p in model.transition_block.parameters()) + hier_p = sum(p.numel() for p in model.hierarchy_block.parameters()) + # Effective: each block reused for its assigned orbits + effective_with_orbits = (n_params - local_p - trans_p - hier_p + + local_p * len(config.local_orbits) + + trans_p * len(config.transition_orbits) + + hier_p * len(config.hierarchy_orbits)) + + print(f" Stored params: {n_params / 1e6:.1f}M") + print(f" Effective params: {effective_with_orbits / 1e6:.1f}M (block x {config.n_orbits} orbits)") + print(f" Est int6 size: {n_params * 6 / 8 / 1e6:.1f}MB") + print(f" 16MB headroom: {16.0 - n_params * 6 / 8 / 1e6:.1f}MB") + + optimizer = setup_optimizer(model) + + # H6: EMA weight averaging (τ=0.997) — proven free improvement from v16 + EMA_DECAY = 0.997 + ema_shadow = {name: p.data.clone() for name, p in model.named_parameters() if p.requires_grad} + + raw_model = model + + if master_process: + print("Optimizer: MuonAdamW") + print(f"EMA: τ={EMA_DECAY}, tracking {len(ema_shadow)} params") + + # torch.compile BEFORE DDP — backward has @torch.compiler.disable to prevent + # tracing into Triton kernels. Forward is opaque via custom_op. + _pt_version = tuple(int(x) for x in torch.__version__.split('+')[0].split('.')[:2]) + if _pt_version < (2, 9): + if master_process: + print(f"WARNING: PyTorch {torch.__version__} < 2.9 — torch.compile DISABLED (NaN risk)") + elif device_type == "cuda": + model = torch.compile(model) + if master_process: + print(f"torch.compile: ENABLED (PyTorch {torch.__version__}, flash backward compiler-disabled)") + else: + if master_process: + print("torch.compile: DISABLED (non-CUDA device)") + + # DDP wrap AFTER compile + if distributed: + model = DDP(model, device_ids=[local_rank], broadcast_buffers=False) + if master_process: + print(f"DDP: {world_size} GPUs") + + if master_process: + print(f"Fused kernels: leaky_relu²={'ON' if HAS_FUSED_LEAKY else 'OFF'} " + f"int6_triton={'ON' if HAS_INT6_TRITON else 'OFF'} " + f"online_ngram={'ON' if HAS_ONLINE_NGRAM else 'OFF'}") + + tpf = BATCH * MAX_SEQ_LEN + # Scale grad_accum by world_size for DDP + assert TOTAL_BATCH % (tpf * world_size) == 0, \ + f"TOTAL_BATCH={TOTAL_BATCH} not divisible by BATCH*SEQ*WORLD={tpf}*{world_size}={tpf*world_size}" + grad_accum = TOTAL_BATCH // (tpf * world_size) + train_loader = make_dataloader(tokenizer, BATCH, MAX_SEQ_LEN, "train") + x, y, epoch = next(train_loader) + + if device_type == "cuda": + autocast_ctx = torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16) + else: + autocast_ctx = contextlib.nullcontext() + + print(f"Budget: {TIME_BUDGET}s | Batch: {BATCH} | Accum: {grad_accum}") + print() + + # --------------------------------------------------------------- + # Training + # --------------------------------------------------------------- + + def sync(): + if device_type == "cuda": + torch.cuda.synchronize() + + # Compiler warmup: 20 steps to prime torch.compile, then restore state. + # Saves 5-10s of compilation from the actual training budget. + if device_type == "cuda" and distributed: + if master_process: + print("Compiler warmup (20 steps)...", flush=True) + _saved_weights = {n: p.data.clone() for n, p in raw_model.named_parameters()} + _saved_opt = optimizer.state_dict() + for _wu in range(20): + with autocast_ctx: + _wl = model(x, y) + (_wl / grad_accum).backward() + if _wu % grad_accum == grad_accum - 1: + optimizer.step() + optimizer.zero_grad(set_to_none=True) + x, y, epoch = next(train_loader) + optimizer.zero_grad(set_to_none=True) + # Restore original weights and optimizer + with torch.no_grad(): + for n, p in raw_model.named_parameters(): + p.data.copy_(_saved_weights[n]) + optimizer.load_state_dict(_saved_opt) + del _saved_weights, _saved_opt + sync() + if master_process: + print("Compiler warmup done", flush=True) + + t_train = time.time() + smooth, wall, step = 0, 0, 0 + + while True: + sync() + t0 = time.time() + + for micro in range(grad_accum): + # Only sync gradients on last micro-step (saves NCCL bandwidth) + if distributed: + model.require_backward_grad_sync = (micro == grad_accum - 1) + with autocast_ctx: + loss = model(x, y) + + progress = min(wall / TIME_BUDGET, 1.0) + cq = crown_q_penalty(raw_model, progress, config.crown_q_start) + total_loss = loss + cq if isinstance(cq, Tensor) and cq.requires_grad else loss + + (total_loss / grad_accum).backward() + x, y, epoch = next(train_loader) + + progress = min(wall / TIME_BUDGET, 1.0) + lr = get_lr(progress) + + for g in optimizer.param_groups: + if "initial_lr" not in g: + g["initial_lr"] = g["lr"] + g["lr"] = g["initial_lr"] * lr + # Muon momentum warmup: 0.85 → 0.95 over 500 steps (upstream pattern) + if g.get("kind") == "muon": + g["momentum"] = min(0.95, 0.85 + 0.1 * step / 500) + + torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0, error_if_nonfinite=True) + optimizer.step() + optimizer.zero_grad(set_to_none=True) + + # H6: EMA update (use raw_model to avoid DDP wrapper) + with torch.no_grad(): + for name, p in raw_model.named_parameters(): + if name in ema_shadow: + ema_shadow[name].lerp_(p.data, 1.0 - EMA_DECAY) + + tlf = loss.detach().item() + if tlf > 100: + print(f"\nFAIL step {step} loss={tlf:.1f}") + sys.exit(1) + + sync() + dt = time.time() - t0 + if step > 10: + wall += dt + + b = 0.9 + smooth = b * smooth + (1 - b) * tlf + debiased = smooth / (1 - b ** (step + 1)) + rem = max(0, TIME_BUDGET - wall) + + if master_process: + print(f"\rstep {step:04d} ({100 * progress:.0f}%) loss:{debiased:.4f} " + f"lr:{lr:.2f} dt:{dt * 1000:.0f}ms tok/s:{TOTAL_BATCH / dt:.0f} " + f"rem:{rem:.0f}s ", end="", flush=True) + + # Diagnostics every 100 steps + if step % 100 == 0 and step > 0 and master_process: + for name, blk in [("local", raw_model.local_block), ("transition", raw_model.transition_block), ("hierarchy", raw_model.hierarchy_block)]: + ascale = blk.attn_scale.data.cpu().tolist() + mscale = blk.mlp_scale.data.cpu().tolist() + print(f"\n {name}: attn=[{', '.join(f'{s:.3f}' for s in ascale)}] mlp=[{', '.join(f'{s:.3f}' for s in mscale)}]") + + if step == 0: + gc.collect() + gc.freeze() + gc.disable() + step += 1 + + # Synchronized exit: rank 0 decides, broadcasts to all ranks. + # This ensures all ranks exit the loop on the same step, + # preventing DDP forward deadlocks from mismatched iteration counts. + should_stop = step > 10 and wall >= TIME_BUDGET + if distributed: + stop_tensor = torch.tensor(int(should_stop), device=device) + dist.all_reduce(stop_tensor, op=dist.ReduceOp.MAX) + should_stop = bool(stop_tensor.item()) + if should_stop: + break + + torch.cuda.synchronize() + + if master_process: + print(flush=True) + total_tok = step * TOTAL_BATCH + + # =================================================================== + # POST-TRAINING: Parallel eval across GPUs. + # Each rank runs a different eval strategy independently (no DDP/NCCL). + # Rank 0 does EMA + quantize + save, writes artifact to disk. + # All ranks load artifact, then each runs its assigned eval. + # Results written to files, rank 0 reads all and reports best. + # =================================================================== + import copy + + def log(msg): + """Log to stdout (rank 0 only) and post_training.log file.""" + if master_process: + print(f"\n[POST] {msg}", flush=True) + with open(f"post_training_r{rank}.log", "a") as _f: + _f.write(f"[{time.strftime('%H:%M:%S')}] [R{rank}] {msg}\n") + + EVAL_BATCH = 128 + EVAL_WALL = int(os.environ.get("EVAL_WALL", "600")) + + log(f"training done: {step} steps, {total_tok/1e6:.1f}M tokens") + + # ---- EMA + quantize + save (rank 0 only, others wait) ---- + if master_process: + for name, p in raw_model.named_parameters(): + if name in ema_shadow: + p.data.copy_(ema_shadow[name]) + log(f"EMA: swapped {len(ema_shadow)} params") + + torch.save(raw_model.state_dict(), "final_model.pt") + code_bytes = os.path.getsize(os.path.abspath(__file__)) + log(f"checkpoint saved") + + quant_obj, quant_stats = quantize_state_dict_int8(raw_model.state_dict()) + quant_buf = io.BytesIO() + torch.save(quant_obj, quant_buf) + quant_blob = zlib.compress(quant_buf.getvalue(), level=9) + with open("final_model.int8.ptz", "wb") as f: + f.write(quant_blob) + quant_file_bytes = os.path.getsize("final_model.int8.ptz") + log(f"artifact: {quant_file_bytes} bytes, headroom: {16_000_000 - quant_file_bytes - code_bytes}") + + # Signal artifact is ready + with open("/workspace/.artifact_ready", "w") as f: + f.write("ready") + + # Non-master ranks: wait for artifact (timeout 120s) + if distributed and not master_process: + _t_wait = time.perf_counter() + while not os.path.exists("/workspace/.artifact_ready"): + if time.perf_counter() - _t_wait > 120: + log("TIMEOUT waiting for artifact — exiting") + return + time.sleep(0.5) + + # ---- All ranks: load dequantized model ---- + with open("final_model.int8.ptz", "rb") as f: + quant_blob_disk = f.read() + quant_state = torch.load(io.BytesIO(zlib.decompress(quant_blob_disk)), map_location="cpu", weights_only=False) + raw_model.load_state_dict(dequantize_state_dict_int8(quant_state), strict=True) + log("model reloaded from artifact") + + # ---- Parallel eval: each rank runs a different strategy ---- + # Assign eval strategies to ranks (cycle if more ranks than strategies) + # Ablation results (60s smoke, 2xH100 India DC): + # base_int8: 2.4138 (22s) — official roundtrip + # ttt lr5e-4 2s: 2.0571 (248s) — BEST (-0.357) + # ttt lr1e-3 2s: 2.0611 (249s) — close 2nd + # ttt lr1e-3 1s: 2.0819 (142s) — fast + # ttt lr5e-4 1s: 2.1142 (140s) — proven + # ttt curv-adpt: 2.2057 (142s) — DROP (worse than flat) + # ttt lr2e-4 1s: 2.2155 (140s) — DROP (weak) + # sliding w64: ~2.43 (slow) — DROP (hurts) + eval_strategies = [ + ("base_int8", lambda m: evaluate_bpb(m, tokenizer, EVAL_BATCH, eval_wall=EVAL_WALL)), + ("ttt_lr5e-4_2s", lambda m: eval_ttt_scorefirst(copy.deepcopy(m), 16, lr=0.0005, ttt_steps=2, eval_wall=EVAL_WALL)), + ("ttt_lr1e-3_2s", lambda m: eval_ttt_scorefirst(copy.deepcopy(m), 16, lr=0.001, ttt_steps=2, eval_wall=EVAL_WALL)), + ("ttt_lr1e-3_1s", lambda m: eval_ttt_scorefirst(copy.deepcopy(m), 16, lr=0.001, ttt_steps=1, eval_wall=EVAL_WALL)), + ("ttt_lr5e-4_1s", lambda m: eval_ttt_scorefirst(copy.deepcopy(m), 16, lr=0.0005, ttt_steps=1, eval_wall=EVAL_WALL)), + ("ttt_lr7e-4_2s", lambda m: eval_ttt_scorefirst(copy.deepcopy(m), 16, lr=0.0007, ttt_steps=2, eval_wall=EVAL_WALL)), + ("ttt_lr5e-4_3s", lambda m: eval_ttt_scorefirst(copy.deepcopy(m), 16, lr=0.0005, ttt_steps=3, eval_wall=EVAL_WALL)), + ("ngram_blend", lambda m: eval_ngram_blend(m, batch_size=EVAL_BATCH, alpha=0.3, eval_wall=EVAL_WALL)), + ] + my_strategy = eval_strategies[rank % len(eval_strategies)] + strategy_name, strategy_fn = my_strategy + + log(f"running eval: {strategy_name}") + torch.cuda.synchronize() + t_eval = time.perf_counter() + + # Compile for inference (each rank compiles independently) + eval_model = torch.compile(raw_model) if strategy_name.startswith("base") else raw_model + if strategy_name.startswith("base"): + eval_model.eval() + with torch.inference_mode(), torch.autocast(device_type="cuda", dtype=torch.bfloat16): + _ = eval_model(torch.randint(0, 1024, (1, MAX_SEQ_LEN), device=device)) + torch.cuda.synchronize() + log("compile warmup done") + + my_bpb = strategy_fn(eval_model if strategy_name.startswith("base") else raw_model) + eval_time = time.perf_counter() - t_eval + log(f"{strategy_name} val_bpb:{my_bpb:.4f} time:{eval_time:.0f}s") + + # Write result to file + result_file = f"/workspace/.eval_result_r{rank}" + with open(result_file, "w") as f: + f.write(f"{strategy_name}\t{my_bpb:.8f}\t{eval_time:.1f}\n") + + # Signal this rank is done + with open(f"/workspace/.eval_done_r{rank}", "w") as f: + f.write("done") + + # ---- Rank 0: collect results and report best ---- + if master_process: + # Wait for all ranks to finish + for r in range(world_size): + while not os.path.exists(f"/workspace/.eval_done_r{r}"): + time.sleep(2) + + # Read all results + all_results = [] + for r in range(world_size): + rf = f"/workspace/.eval_result_r{r}" + if os.path.exists(rf): + parts = open(rf).read().strip().split("\t") + name, bpb, t = parts[0], float(parts[1]), float(parts[2]) + all_results.append((name, bpb, t)) + log(f" rank {r}: {name} = {bpb:.4f} ({t:.0f}s)") + + best_name, best_bpb, _ = min(all_results, key=lambda x: x[1]) + print(f"final_int8_zlib_roundtrip val_bpb:{best_bpb:.4f}", flush=True) + print(f"final_int8_zlib_roundtrip_exact val_bpb:{best_bpb:.8f}", flush=True) + print(f"val_bpb:{best_bpb:.4f}", flush=True) + log(f"BEST: {best_name} val_bpb:{best_bpb:.4f}") + log(f"DONE — clean exit") + + # Clean up signal files + for r in range(world_size): + for f in [f"/workspace/.eval_result_r{r}", f"/workspace/.eval_done_r{r}"]: + if os.path.exists(f): os.remove(f) + if os.path.exists("/workspace/.artifact_ready"): + os.remove("/workspace/.artifact_ready") + else: + # Non-master: wait for rank 0 to finish collecting (timeout 60s) + _t_wait = time.perf_counter() + while os.path.exists(f"/workspace/.eval_done_r{rank}"): + if time.perf_counter() - _t_wait > 60: + break + time.sleep(2) + + +def eval_sliding_window(model, batch_size, seq_len=MAX_SEQ_LEN, stride=64, eval_wall=600): + """Sliding window eval — overlapping windows improve BPB by ~0.02-0.05.""" + _dev = next(model.parameters()).device + sp = spm.SentencePieceProcessor(model_file=TOKENIZER_PATH) + vs = int(sp.vocab_size()) + bb, hs, ib = build_sentencepiece_luts(sp, vs, _dev) + vt = torch.cat([load_data_shard(Path(p)) for p in sorted(glob.glob(VAL_FILES))]).contiguous() + total_tokens = vt.numel() - 1 + + window_starts = [ws for ws in range(0, total_tokens, stride) + if min(ws + seq_len, total_tokens) - ws >= 1] + + loss_sum, tok_count, byte_count = 0.0, 0.0, 0.0 + t0 = time.perf_counter() + + model.eval() + with torch.inference_mode(): + for bi in range(0, len(window_starts), batch_size): + if time.perf_counter() - t0 > eval_wall - 5: + print(f"[EVAL-SW] wall hit at {time.perf_counter()-t0:.0f}s", flush=True) + break + batch_ws = window_starts[bi:bi+batch_size] + bsz = len(batch_ws) + x_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=_dev) + y_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=_dev) + wlens = [] + for i, ws in enumerate(batch_ws): + end = min(ws + seq_len, total_tokens) + wlen = end - ws + wlens.append(wlen) + chunk = vt[ws:end+1].to(dtype=torch.int64, device=_dev) + x_batch[i, :wlen] = chunk[:-1] + y_batch[i, :wlen] = chunk[1:] + + with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): + logits = model(x_batch).float() + nll = F.cross_entropy(logits.view(-1, logits.size(-1)), y_batch.view(-1), + reduction='none').view(bsz, seq_len) + for i, ws in enumerate(batch_ws): + wlen = wlens[i] + s = 0 if ws == 0 else max(wlen - stride, 0) + loss_sum += nll[i, s:wlen].to(torch.float64).sum().item() + tok_count += float(wlen - s) + tgt = y_batch[i, s:wlen] + prev = x_batch[i, s:wlen] + tb = bb[tgt].to(torch.float64) + tb += (hs[tgt] & ~ib[prev]).to(torch.float64) + byte_count += tb.sum().item() + + return (loss_sum / tok_count / math.log(2.0)) * (tok_count / byte_count) + + +def eval_ttt_scorefirst(model, batch_size, seq_len=MAX_SEQ_LEN, + lr=0.0005, ttt_steps=1, eval_wall=600, + curvature_adaptive=False): + """LEGAL score-first TTT: score chunk FIRST, THEN train on scored tokens. + Adapts model to eval distribution — ~0.05-0.1 BPB improvement. + curvature_adaptive: per-block LR scaling (high curvature blocks get lower LR).""" + _dev = next(model.parameters()).device + sp = spm.SentencePieceProcessor(model_file=TOKENIZER_PATH) + vs = int(sp.vocab_size()) + bb, hs, ib = build_sentencepiece_luts(sp, vs, _dev) + vt = torch.cat([load_data_shard(Path(p)) for p in sorted(glob.glob(VAL_FILES))]).contiguous() + usable = ((vt.numel() - 1) // seq_len) * seq_len + vt = vt[:usable + 1] + total_seqs = (vt.numel() - 1) // seq_len + + if curvature_adaptive and hasattr(model, 'local_block'): + param_groups = [ + {"params": [p for p in model.local_block.parameters() if p.requires_grad], "lr": lr * 2.0}, + {"params": [p for p in model.transition_block.parameters() if p.requires_grad], "lr": lr * 1.0}, + {"params": [p for p in model.hierarchy_block.parameters() if p.requires_grad], "lr": lr * 0.3}, + {"params": list(model.tok_emb.parameters()), "lr": lr * 0.5}, + ] + param_groups = [g for g in param_groups if g["params"]] + else: + param_groups = [{"params": [p for p in model.parameters() if p.requires_grad], "lr": lr}] + optimizer = torch.optim.AdamW(param_groups, weight_decay=0.0) + loss_sum, tok_count, byte_count = 0.0, 0.0, 0.0 + t0 = time.perf_counter() + + for bs in range(0, total_seqs, batch_size): + if time.perf_counter() - t0 > eval_wall - 5: + print(f"[EVAL-TTT] wall hit at {time.perf_counter()-t0:.0f}s", flush=True) + break + be = min(bs + batch_size, total_seqs) + loc = vt[bs*seq_len:be*seq_len+1].to(dtype=torch.int64, device=_dev) + n = be - bs + x = loc[:-1].reshape(n, seq_len) + y = loc[1:].reshape(n, seq_len) + + # STEP 1: SCORE (grades FINAL) + # Use no_grad (not inference_mode) — inference_mode tensors can't be + # used in step 2's backward pass (rotary cache etc.) + model.eval() + with torch.no_grad(): + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + logits = model(x).float() + nll = F.cross_entropy(logits.view(-1, logits.size(-1)), y.view(-1), + reduction='none').view(n, seq_len) + loss_sum += nll.sum().item() + tok_count += float(y.numel()) + prev = x.reshape(-1) + tgt = y.reshape(-1) + tb = bb[tgt].to(torch.float64) + tb += (hs[tgt] & ~ib[prev]).to(torch.float64) + byte_count += tb.sum().item() + + # STEP 2: TRAIN on scored tokens (adapts to eval distribution) + model.train() + for _ in range(ttt_steps): + optimizer.zero_grad() + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + ttt_loss = model(x, y) + ttt_loss.backward() + torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0, error_if_nonfinite=True) + optimizer.step() + + model.eval() + return (loss_sum / tok_count / math.log(2.0)) * (tok_count / byte_count) + + +def eval_ngram_blend(model, batch_size=128, alpha=0.3, eval_wall=600): + """Base eval with n-gram cache blending. Builds cache from train shards, + then blends n-gram probs with model logits during eval.""" + _dev = next(model.parameters()).device + sp = spm.SentencePieceProcessor(model_file=TOKENIZER_PATH) + vs = int(sp.vocab_size()) + bb, hs, ib = build_sentencepiece_luts(sp, vs, _dev) + + # Build n-gram cache from training data + cache = NgramCache(vocab_size=vs, max_order=4) + cache.build_from_shards(DATA_PATH, max_tokens=5_000_000) + + vt = torch.cat([load_data_shard(Path(p)) for p in sorted(glob.glob(VAL_FILES))]).contiguous() + usable = ((vt.numel() - 1) // MAX_SEQ_LEN) * MAX_SEQ_LEN + vt = vt[:usable + 1] + total_seqs = (vt.numel() - 1) // MAX_SEQ_LEN + + loss_sum, tok_count, byte_count = 0.0, 0.0, 0.0 + t0 = time.perf_counter() + model.eval() + with torch.inference_mode(): + for bs in range(0, total_seqs, batch_size): + if time.perf_counter() - t0 > eval_wall - 5: + break + be = min(bs + batch_size, total_seqs) + loc = vt[bs*MAX_SEQ_LEN:be*MAX_SEQ_LEN+1].to(device=_dev, dtype=torch.int64) + x = loc[:-1].reshape(be - bs, MAX_SEQ_LEN) + y = loc[1:].reshape(be - bs, MAX_SEQ_LEN) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): + logits = model(x).float() + # Blend with n-gram cache → log-probs + blended_lp = cache.blend_logits(logits, x, alpha=alpha) + nll = F.nll_loss(blended_lp.view(-1, blended_lp.size(-1)), y.view(-1), + reduction='none').view(be - bs, MAX_SEQ_LEN) + bt = float(y.numel()) + loss_sum += nll.sum().item() + tok_count += bt + tb = bb[y.reshape(-1)].to(torch.int16) + tb += (hs[y.reshape(-1)] & ~ib[x.reshape(-1)]).to(torch.int16) + byte_count += tb.to(torch.float64).sum().item() + + return (loss_sum / tok_count / math.log(2.0)) * (tok_count / byte_count) + + +if __name__ == "__main__": + main() diff --git a/records/track_non_record_16mb/2026-04-01_WhirlpoolV5b_LorentzianTransformer/train_seed314.log b/records/track_non_record_16mb/2026-04-01_WhirlpoolV5b_LorentzianTransformer/train_seed314.log new file mode 100644 index 0000000000..228732acb0 --- /dev/null +++ b/records/track_non_record_16mb/2026-04-01_WhirlpoolV5b_LorentzianTransformer/train_seed314.log @@ -0,0 +1,1082 @@ +W0401 22:38:27.771000 1402 torch/distributed/run.py:803] +W0401 22:38:27.771000 1402 torch/distributed/run.py:803] ***************************************** +W0401 22:38:27.771000 1402 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +W0401 22:38:27.771000 1402 torch/distributed/run.py:803] ***************************************** +Device: CUDA (NVIDIA H100 80GB HBM3) [rank 0/8] +Seed: 314 +Vocab: 1,024 + +Whirlpool v4.8 [FULL 600s] + 768-dim, 1 shared block x 8 orbits + 12 Lorentz heads (ALL geometry), GQA 2KV + MLP: 5x, LeakyReLU(0.5)² + Curvature per orbit: 0.1→2.0 + Centrifugal λ=0.005 + Orbit embed scale: 0.1 + NEW: No arcosh (raw Minkowski inner product) + NEW: 8 total orbits, active=[0, 1, 2, 3, 4, 5, 6, 7], d_model=768 + Curvatures: [0.10, 0.15, 0.24, 0.36, 0.55, 0.85, 1.30, 2.00] + Stored params: 23.9M + Effective params: 60.2M (block x 8 orbits) + Est int6 size: 17.9MB + 16MB headroom: -1.9MB + Stored params: 23.9M + Effective params: 60.2M (block x 8 orbits) + Est int6 size: 17.9MB + 16MB headroom: -1.9MB + Stored params: 23.9M + Effective params: 60.2M (block x 8 orbits) + Est int6 size: 17.9MB + 16MB headroom: -1.9MB + Stored params: 23.9M + Effective params: 60.2M (block x 8 orbits) + Est int6 size: 17.9MB + 16MB headroom: -1.9MB + Stored params: 23.9M + Effective params: 60.2M (block x 8 orbits) + Est int6 size: 17.9MB + 16MB headroom: -1.9MB + Stored params: 23.9M + Effective params: 60.2M (block x 8 orbits) + Est int6 size: 17.9MB + 16MB headroom: -1.9MB + Stored params: 23.9M + Effective params: 60.2M (block x 8 orbits) + Est int6 size: 17.9MB + 16MB headroom: -1.9MB + Stored params: 23.9M + Effective params: 60.2M (block x 8 orbits) + Est int6 size: 17.9MB + 16MB headroom: -1.9MB +Optimizer: MuonAdamW +EMA: τ=0.997, tracking 31 params +torch.compile: ENABLED (PyTorch 2.9.1+cu128, flash backward compiler-disabled) +DDP: 8 GPUs +Fused kernels: leaky_relu²=ON int6_triton=ON online_ngram=OFF +Budget: 600s | Batch: 8 | Accum: 1Budget: 600s | Batch: 8 | Accum: 1 + + + +Budget: 600s | Batch: 8 | Accum: 1Budget: 600s | Batch: 8 | Accum: 1Budget: 600s | Batch: 8 | Accum: 1Budget: 600s | Batch: 8 | Accum: 1 + + + + + + + +Budget: 600s | Batch: 8 | Accum: 1 + +Budget: 600s | Batch: 8 | Accum: 1 + +Compiler warmup (20 steps)... +Compiler warmup done + step 0000 (0%) loss:21.6566 lr:0.00 dt:129ms tok/s:509193 rem:600s step 0001 (0%) loss:21.6648 lr:0.00 dt:36ms tok/s:1839226 rem:600s step 0002 (0%) loss:21.6339 lr:0.00 dt:32ms tok/s:2065292 rem:600s step 0003 (0%) loss:21.6246 lr:0.00 dt:32ms tok/s:2036781 rem:600s step 0004 (0%) loss:21.6206 lr:0.00 dt:31ms tok/s:2083119 rem:600s step 0005 (0%) loss:21.6069 lr:0.00 dt:30ms tok/s:2163201 rem:600s step 0006 (0%) loss:21.6155 lr:0.00 dt:30ms tok/s:2182870 rem:600s step 0007 (0%) loss:21.6250 lr:0.00 dt:33ms tok/s:2016150 rem:600s step 0008 (0%) loss:21.6185 lr:0.00 dt:30ms tok/s:2175700 rem:600s step 0009 (0%) loss:21.6212 lr:0.00 dt:30ms tok/s:2171231 rem:600s step 0010 (0%) loss:21.6198 lr:0.00 dt:31ms tok/s:2085568 rem:600s step 0011 (0%) loss:21.6149 lr:0.00 dt:31ms tok/s:2139044 rem:600s step 0012 (0%) loss:21.6184 lr:0.00 dt:33ms tok/s:1977553 rem:600s step 0013 (0%) loss:21.6177 lr:0.00 dt:31ms tok/s:2123757 rem:600s step 0014 (0%) loss:21.6095 lr:0.00 dt:31ms tok/s:2121021 rem:600s step 0015 (0%) loss:21.6142 lr:0.00 dt:31ms tok/s:2110922 rem:600s step 0016 (0%) loss:21.5971 lr:0.00 dt:31ms tok/s:2088674 rem:600s step 0017 (0%) loss:21.5971 lr:0.00 dt:31ms tok/s:2086645 rem:600s step 0018 (0%) loss:21.5912 lr:0.00 dt:31ms tok/s:2085679 rem:600s step 0019 (0%) loss:21.5726 lr:0.00 dt:32ms tok/s:2070347 rem:600s step 0020 (0%) loss:21.5537 lr:0.00 dt:33ms tok/s:1964087 rem:600s step 0021 (0%) loss:21.5210 lr:0.00 dt:35ms tok/s:1884222 rem:600s step 0022 (0%) loss:21.5032 lr:0.00 dt:34ms tok/s:1940680 rem:600s step 0023 (0%) loss:21.4849 lr:0.00 dt:32ms tok/s:2022009 rem:600s step 0024 (0%) loss:21.4578 lr:0.00 dt:33ms tok/s:1970536 rem:600s step 0025 (0%) loss:21.4286 lr:0.00 dt:33ms tok/s:1996397 rem:600s step 0026 (0%) loss:21.3976 lr:0.00 dt:33ms tok/s:1989188 rem:599s step 0027 (0%) loss:21.3646 lr:0.00 dt:33ms tok/s:1986413 rem:599s step 0028 (0%) loss:21.3205 lr:0.00 dt:33ms tok/s:1978236 rem:599s step 0029 (0%) loss:21.2675 lr:0.00 dt:34ms tok/s:1946534 rem:599s step 0030 (0%) loss:21.2182 lr:0.01 dt:34ms tok/s:1921457 rem:599s step 0031 (0%) loss:21.1626 lr:0.01 dt:34ms tok/s:1931543 rem:599s step 0032 (0%) loss:21.1008 lr:0.01 dt:34ms tok/s:1935719 rem:599s step 0033 (0%) loss:21.0247 lr:0.01 dt:35ms tok/s:1852439 rem:599s step 0034 (0%) loss:20.9529 lr:0.01 dt:35ms tok/s:1869180 rem:599s step 0035 (0%) loss:20.8693 lr:0.01 dt:36ms tok/s:1829056 rem:599s step 0036 (0%) loss:20.7950 lr:0.01 dt:35ms tok/s:1847794 rem:599s step 0037 (0%) loss:20.7130 lr:0.01 dt:36ms tok/s:1830933 rem:599s step 0038 (0%) loss:20.6277 lr:0.01 dt:36ms tok/s:1815214 rem:599s step 0039 (0%) loss:20.5309 lr:0.01 dt:36ms tok/s:1818192 rem:599s step 0040 (0%) loss:20.4385 lr:0.01 dt:36ms tok/s:1827148 rem:599s step 0041 (0%) loss:20.3356 lr:0.01 dt:37ms tok/s:1783972 rem:599s step 0042 (0%) loss:20.2069 lr:0.01 dt:37ms tok/s:1760582 rem:599s step 0043 (0%) loss:20.0877 lr:0.01 dt:37ms tok/s:1762648 rem:599s step 0044 (0%) loss:19.9679 lr:0.01 dt:36ms tok/s:1816558 rem:599s step 0045 (0%) loss:19.8404 lr:0.01 dt:35ms tok/s:1847087 rem:599s step 0046 (0%) loss:19.7107 lr:0.01 dt:36ms tok/s:1829190 rem:599s step 0047 (0%) loss:19.5684 lr:0.01 dt:37ms tok/s:1748098 rem:599s step 0048 (0%) loss:19.4114 lr:0.01 dt:37ms tok/s:1773909 rem:599s step 0049 (0%) loss:19.2474 lr:0.01 dt:36ms tok/s:1800247 rem:599s step 0050 (0%) loss:19.0836 lr:0.01 dt:38ms tok/s:1710993 rem:599s step 0051 (0%) loss:18.9191 lr:0.01 dt:36ms tok/s:1837419 rem:599s step 0052 (0%) loss:18.7442 lr:0.01 dt:38ms tok/s:1733677 rem:599s step 0053 (0%) loss:18.5612 lr:0.01 dt:36ms tok/s:1800259 rem:599s step 0054 (0%) loss:18.3510 lr:0.01 dt:37ms tok/s:1767374 rem:598s step 0055 (0%) loss:18.1392 lr:0.01 dt:36ms tok/s:1805924 rem:598s step 0056 (0%) loss:17.9302 lr:0.01 dt:37ms tok/s:1787835 rem:598s step 0057 (0%) loss:17.7162 lr:0.01 dt:42ms tok/s:1570973 rem:598s step 0058 (0%) loss:17.4822 lr:0.01 dt:38ms tok/s:1715499 rem:598s step 0059 (0%) loss:17.2485 lr:0.01 dt:37ms tok/s:1751038 rem:598s step 0060 (0%) loss:17.0067 lr:0.01 dt:36ms tok/s:1824455 rem:598s step 0061 (0%) loss:16.7478 lr:0.01 dt:36ms tok/s:1799552 rem:598s step 0062 (0%) loss:16.4815 lr:0.01 dt:36ms tok/s:1808990 rem:598s step 0063 (0%) loss:16.2093 lr:0.02 dt:36ms tok/s:1819733 rem:598s step 0064 (0%) loss:15.9217 lr:0.02 dt:37ms tok/s:1767363 rem:598s step 0065 (0%) loss:15.6464 lr:0.02 dt:35ms tok/s:1882403 rem:598s step 0066 (0%) loss:15.3699 lr:0.02 dt:35ms tok/s:1851217 rem:598s step 0067 (0%) loss:15.0777 lr:0.02 dt:37ms tok/s:1775387 rem:598s step 0068 (0%) loss:14.7796 lr:0.02 dt:36ms tok/s:1819504 rem:598s step 0069 (0%) loss:14.4679 lr:0.02 dt:36ms tok/s:1841395 rem:598s step 0070 (0%) loss:14.1539 lr:0.02 dt:36ms tok/s:1809098 rem:598s step 0071 (0%) loss:13.8230 lr:0.02 dt:36ms tok/s:1823160 rem:598s step 0072 (0%) loss:13.4981 lr:0.02 dt:35ms tok/s:1892721 rem:598s step 0073 (0%) loss:13.1724 lr:0.02 dt:34ms tok/s:1905513 rem:598s step 0074 (0%) loss:12.8418 lr:0.02 dt:36ms tok/s:1799351 rem:598s step 0075 (0%) loss:12.4984 lr:0.02 dt:36ms tok/s:1844422 rem:598s step 0076 (0%) loss:12.1566 lr:0.02 dt:38ms tok/s:1727238 rem:598s step 0077 (0%) loss:11.8375 lr:0.02 dt:34ms tok/s:1913592 rem:598s step 0078 (0%) loss:11.4967 lr:0.02 dt:35ms tok/s:1863465 rem:598s step 0079 (0%) loss:11.1469 lr:0.02 dt:34ms tok/s:1900283 rem:598s step 0080 (0%) loss:10.8304 lr:0.02 dt:35ms tok/s:1887561 rem:598s step 0081 (0%) loss:10.4979 lr:0.02 dt:35ms tok/s:1861131 rem:598s step 0082 (0%) loss:10.1625 lr:0.02 dt:35ms tok/s:1883164 rem:597s step 0083 (0%) loss:9.8419 lr:0.02 dt:36ms tok/s:1814603 rem:597s step 0084 (0%) loss:9.5225 lr:0.02 dt:35ms tok/s:1873921 rem:597s step 0085 (0%) loss:9.2295 lr:0.02 dt:35ms tok/s:1863478 rem:597s step 0086 (0%) loss:8.9397 lr:0.02 dt:35ms tok/s:1862783 rem:597s step 0087 (0%) loss:8.6665 lr:0.02 dt:37ms tok/s:1757576 rem:597s step 0088 (0%) loss:8.4065 lr:0.02 dt:37ms tok/s:1783023 rem:597s step 0089 (0%) loss:8.1687 lr:0.02 dt:36ms tok/s:1824007 rem:597s step 0090 (0%) loss:7.9412 lr:0.02 dt:35ms tok/s:1852926 rem:597s step 0091 (0%) loss:7.7365 lr:0.02 dt:36ms tok/s:1817206 rem:597s step 0092 (0%) loss:7.5534 lr:0.02 dt:35ms tok/s:1869168 rem:597s step 0093 (0%) loss:7.3883 lr:0.02 dt:37ms tok/s:1785374 rem:597s step 0094 (0%) loss:7.2286 lr:0.02 dt:36ms tok/s:1818854 rem:597s step 0095 (0%) loss:7.0865 lr:0.02 dt:36ms tok/s:1824431 rem:597s step 0096 (0%) loss:6.9554 lr:0.02 dt:35ms tok/s:1857271 rem:597s step 0097 (1%) loss:6.8399 lr:0.03 dt:35ms tok/s:1866451 rem:597s step 0098 (1%) loss:6.7363 lr:0.03 dt:36ms tok/s:1842419 rem:597s step 0099 (1%) loss:6.6383 lr:0.03 dt:35ms tok/s:1854664 rem:597s step 0100 (1%) loss:6.5517 lr:0.03 dt:36ms tok/s:1841679 rem:597s + local: attn=[0.370, 0.333, 0.296] mlp=[0.374, 0.337, 0.300] + + transition: attn=[0.258, 0.221] mlp=[0.262, 0.225] + + hierarchy: attn=[0.184, 0.146, 0.109] mlp=[0.188, 0.150, 0.113] + step 0101 (1%) loss:6.4712 lr:0.03 dt:36ms tok/s:1802088 rem:597s step 0102 (1%) loss:6.3955 lr:0.03 dt:35ms tok/s:1887794 rem:597s step 0103 (1%) loss:6.3268 lr:0.03 dt:35ms tok/s:1888339 rem:597s step 0104 (1%) loss:6.2674 lr:0.03 dt:35ms tok/s:1897267 rem:597s step 0105 (1%) loss:6.2041 lr:0.03 dt:35ms tok/s:1895004 rem:597s step 0106 (1%) loss:6.1469 lr:0.03 dt:35ms tok/s:1893411 rem:597s step 0107 (1%) loss:6.0992 lr:0.03 dt:35ms tok/s:1876620 rem:597s step 0108 (1%) loss:6.0582 lr:0.03 dt:35ms tok/s:1872823 rem:597s step 0109 (1%) loss:6.0176 lr:0.03 dt:35ms tok/s:1885618 rem:597s step 0110 (1%) loss:5.9765 lr:0.03 dt:35ms tok/s:1896966 rem:596s step 0111 (1%) loss:5.9351 lr:0.03 dt:35ms tok/s:1877286 rem:596s step 0112 (1%) loss:5.9029 lr:0.03 dt:35ms tok/s:1877645 rem:596s step 0113 (1%) loss:5.8727 lr:0.03 dt:35ms tok/s:1872848 rem:596s step 0114 (1%) loss:5.8478 lr:0.03 dt:35ms tok/s:1869485 rem:596s step 0115 (1%) loss:5.8259 lr:0.03 dt:35ms tok/s:1891796 rem:596s step 0116 (1%) loss:5.8057 lr:0.03 dt:35ms tok/s:1864375 rem:596s step 0117 (1%) loss:5.7894 lr:0.03 dt:35ms tok/s:1871598 rem:596s step 0118 (1%) loss:5.7650 lr:0.03 dt:35ms tok/s:1858765 rem:596s step 0119 (1%) loss:5.7433 lr:0.03 dt:35ms tok/s:1863213 rem:596s step 0120 (1%) loss:5.7314 lr:0.03 dt:35ms tok/s:1848764 rem:596s step 0121 (1%) loss:5.7381 lr:0.03 dt:35ms tok/s:1862682 rem:596s step 0122 (1%) loss:5.7211 lr:0.03 dt:39ms tok/s:1664484 rem:596s step 0123 (1%) loss:5.7014 lr:0.03 dt:36ms tok/s:1805805 rem:596s step 0124 (1%) loss:5.6844 lr:0.03 dt:35ms tok/s:1860476 rem:596s step 0125 (1%) loss:5.6697 lr:0.03 dt:35ms tok/s:1868418 rem:596s step 0126 (1%) loss:5.6457 lr:0.03 dt:35ms tok/s:1858074 rem:596s step 0127 (1%) loss:5.6206 lr:0.03 dt:35ms tok/s:1851653 rem:596s step 0128 (1%) loss:5.6206 lr:0.03 dt:36ms tok/s:1821131 rem:596s step 0129 (1%) loss:5.6164 lr:0.03 dt:36ms tok/s:1808300 rem:596s step 0130 (1%) loss:5.6072 lr:0.03 dt:36ms tok/s:1817278 rem:596s step 0131 (1%) loss:5.6221 lr:0.04 dt:36ms tok/s:1819082 rem:596s step 0132 (1%) loss:5.6016 lr:0.04 dt:36ms tok/s:1814424 rem:596s step 0133 (1%) loss:5.5934 lr:0.04 dt:36ms tok/s:1825594 rem:596s step 0134 (1%) loss:5.5836 lr:0.04 dt:36ms tok/s:1821204 rem:596s step 0135 (1%) loss:5.5667 lr:0.04 dt:36ms tok/s:1815418 rem:596s step 0136 (1%) loss:5.5682 lr:0.04 dt:36ms tok/s:1819564 rem:596s step 0137 (1%) loss:5.5607 lr:0.04 dt:36ms tok/s:1819359 rem:596s step 0138 (1%) loss:5.5487 lr:0.04 dt:36ms tok/s:1816378 rem:595s step 0139 (1%) loss:5.5389 lr:0.04 dt:36ms tok/s:1822556 rem:595s step 0140 (1%) loss:5.5140 lr:0.04 dt:36ms tok/s:1821686 rem:595s step 0141 (1%) loss:5.5154 lr:0.04 dt:36ms tok/s:1834256 rem:595s step 0142 (1%) loss:5.5129 lr:0.04 dt:36ms tok/s:1822399 rem:595s step 0143 (1%) loss:5.5099 lr:0.04 dt:36ms tok/s:1817723 rem:595s step 0144 (1%) loss:5.5048 lr:0.04 dt:36ms tok/s:1824928 rem:595s step 0145 (1%) loss:5.5006 lr:0.04 dt:36ms tok/s:1813215 rem:595s step 0146 (1%) loss:5.5009 lr:0.04 dt:36ms tok/s:1809669 rem:595s step 0147 (1%) loss:5.4950 lr:0.04 dt:36ms tok/s:1810003 rem:595s step 0148 (1%) loss:5.4926 lr:0.04 dt:36ms tok/s:1815670 rem:595s step 0149 (1%) loss:5.4849 lr:0.04 dt:36ms tok/s:1825376 rem:595s step 0150 (1%) loss:5.4833 lr:0.04 dt:36ms tok/s:1809181 rem:595s step 0151 (1%) loss:5.4772 lr:0.04 dt:36ms tok/s:1821795 rem:595s step 0152 (1%) loss:5.4758 lr:0.04 dt:36ms tok/s:1812390 rem:595s step 0153 (1%) loss:5.4670 lr:0.04 dt:36ms tok/s:1838242 rem:595s step 0154 (1%) loss:5.4556 lr:0.04 dt:36ms tok/s:1808669 rem:595s step 0155 (1%) loss:5.4543 lr:0.04 dt:36ms tok/s:1817483 rem:595s step 0156 (1%) loss:5.4540 lr:0.04 dt:36ms tok/s:1818288 rem:595s step 0157 (1%) loss:5.4674 lr:0.04 dt:36ms tok/s:1808015 rem:595s step 0158 (1%) loss:5.4593 lr:0.04 dt:36ms tok/s:1811172 rem:595s step 0159 (1%) loss:5.4528 lr:0.04 dt:36ms tok/s:1805272 rem:595s step 0160 (1%) loss:5.4376 lr:0.04 dt:36ms tok/s:1821095 rem:595s step 0161 (1%) loss:5.4454 lr:0.04 dt:36ms tok/s:1818601 rem:595s step 0162 (1%) loss:5.4558 lr:0.04 dt:36ms tok/s:1814076 rem:595s step 0163 (1%) loss:5.4587 lr:0.04 dt:36ms tok/s:1820564 rem:595s step 0164 (1%) loss:5.4540 lr:0.05 dt:36ms tok/s:1821107 rem:595s step 0165 (1%) loss:5.4466 lr:0.05 dt:36ms tok/s:1799092 rem:595s step 0166 (1%) loss:5.4436 lr:0.05 dt:37ms tok/s:1780067 rem:594s step 0167 (1%) loss:5.4405 lr:0.05 dt:35ms tok/s:1870376 rem:594s step 0168 (1%) loss:5.4395 lr:0.05 dt:36ms tok/s:1828971 rem:594s step 0169 (1%) loss:5.4279 lr:0.05 dt:36ms tok/s:1826407 rem:594s step 0170 (1%) loss:5.4213 lr:0.05 dt:35ms tok/s:1855428 rem:594s step 0171 (1%) loss:5.4276 lr:0.05 dt:35ms tok/s:1850419 rem:594s step 0172 (1%) loss:5.4252 lr:0.05 dt:35ms tok/s:1861698 rem:594s step 0173 (1%) loss:5.4232 lr:0.05 dt:35ms tok/s:1853814 rem:594s step 0174 (1%) loss:5.4336 lr:0.05 dt:36ms tok/s:1844026 rem:594s step 0175 (1%) loss:5.4339 lr:0.05 dt:36ms tok/s:1836143 rem:594s step 0176 (1%) loss:5.4266 lr:0.05 dt:36ms tok/s:1826213 rem:594s step 0177 (1%) loss:5.4233 lr:0.05 dt:36ms tok/s:1837923 rem:594s step 0178 (1%) loss:5.4128 lr:0.05 dt:36ms tok/s:1837751 rem:594s step 0179 (1%) loss:5.4008 lr:0.05 dt:36ms tok/s:1831177 rem:594s step 0180 (1%) loss:5.3836 lr:0.05 dt:36ms tok/s:1831494 rem:594s step 0181 (1%) loss:5.3759 lr:0.05 dt:36ms tok/s:1840976 rem:594s step 0182 (1%) loss:5.3832 lr:0.05 dt:36ms tok/s:1841074 rem:594s step 0183 (1%) loss:5.3867 lr:0.05 dt:36ms tok/s:1836867 rem:594s step 0184 (1%) loss:5.3789 lr:0.05 dt:36ms tok/s:1840322 rem:594s step 0185 (1%) loss:5.3722 lr:0.05 dt:35ms tok/s:1847583 rem:594s step 0186 (1%) loss:5.3674 lr:0.05 dt:35ms tok/s:1846380 rem:594s step 0187 (1%) loss:5.3614 lr:0.05 dt:36ms tok/s:1838587 rem:594s step 0188 (1%) loss:5.3534 lr:0.05 dt:35ms tok/s:1848926 rem:594s step 0189 (1%) loss:5.3435 lr:0.05 dt:36ms tok/s:1844967 rem:594s step 0190 (1%) loss:5.3346 lr:0.05 dt:36ms tok/s:1828898 rem:594s step 0191 (1%) loss:5.3280 lr:0.05 dt:36ms tok/s:1823040 rem:594s step 0192 (1%) loss:5.3165 lr:0.05 dt:36ms tok/s:1816942 rem:594s step 0193 (1%) loss:5.3189 lr:0.05 dt:36ms tok/s:1813885 rem:594s step 0194 (1%) loss:5.3205 lr:0.05 dt:36ms tok/s:1816186 rem:593s step 0195 (1%) loss:5.3116 lr:0.05 dt:36ms tok/s:1817495 rem:593s step 0196 (1%) loss:5.3046 lr:0.05 dt:36ms tok/s:1809622 rem:593s step 0197 (1%) loss:5.3016 lr:0.06 dt:36ms tok/s:1817062 rem:593s step 0198 (1%) loss:5.2952 lr:0.06 dt:36ms tok/s:1818830 rem:593s step 0199 (1%) loss:5.2840 lr:0.06 dt:36ms tok/s:1809193 rem:593s step 0200 (1%) loss:5.2650 lr:0.06 dt:36ms tok/s:1818709 rem:593s + local: attn=[0.564, 0.522, 0.480] mlp=[0.664, 0.624, 0.585] + + transition: attn=[0.440, 0.399] mlp=[0.548, 0.508] + + hierarchy: attn=[0.354, 0.314, 0.273] mlp=[0.470, 0.431, 0.392] + step 0201 (1%) loss:5.2437 lr:0.06 dt:36ms tok/s:1809157 rem:593s step 0202 (1%) loss:5.2307 lr:0.06 dt:36ms tok/s:1813837 rem:593s step 0203 (1%) loss:5.2100 lr:0.06 dt:36ms tok/s:1808407 rem:593s step 0204 (1%) loss:5.1981 lr:0.06 dt:39ms tok/s:1692285 rem:593s step 0205 (1%) loss:5.1931 lr:0.06 dt:36ms tok/s:1813574 rem:593s step 0206 (1%) loss:5.1787 lr:0.06 dt:36ms tok/s:1807313 rem:593s step 0207 (1%) loss:5.1798 lr:0.06 dt:36ms tok/s:1821904 rem:593s step 0208 (1%) loss:5.1756 lr:0.06 dt:36ms tok/s:1822471 rem:593s step 0209 (1%) loss:5.1613 lr:0.06 dt:37ms tok/s:1783624 rem:593s step 0210 (1%) loss:5.1419 lr:0.06 dt:37ms tok/s:1768454 rem:593s step 0211 (1%) loss:5.1162 lr:0.06 dt:36ms tok/s:1821180 rem:593s step 0212 (1%) loss:5.0923 lr:0.06 dt:36ms tok/s:1818649 rem:593s step 0213 (1%) loss:5.0746 lr:0.06 dt:36ms tok/s:1814771 rem:593s step 0214 (1%) loss:5.0595 lr:0.06 dt:36ms tok/s:1813992 rem:593s step 0215 (1%) loss:5.0454 lr:0.06 dt:36ms tok/s:1818842 rem:593s step 0216 (1%) loss:5.0551 lr:0.06 dt:36ms tok/s:1810694 rem:593s step 0217 (1%) loss:5.0563 lr:0.06 dt:36ms tok/s:1818529 rem:593s step 0218 (1%) loss:5.0394 lr:0.06 dt:36ms tok/s:1813957 rem:593s step 0219 (1%) loss:5.0073 lr:0.06 dt:36ms tok/s:1820263 rem:593s step 0220 (1%) loss:4.9853 lr:0.06 dt:36ms tok/s:1808633 rem:593s step 0221 (1%) loss:4.9848 lr:0.06 dt:36ms tok/s:1819480 rem:592s step 0222 (1%) loss:4.9685 lr:0.06 dt:36ms tok/s:1816462 rem:592s step 0223 (1%) loss:4.9782 lr:0.06 dt:36ms tok/s:1819889 rem:592s step 0224 (1%) loss:4.9518 lr:0.06 dt:36ms tok/s:1811446 rem:592s step 0225 (1%) loss:4.9298 lr:0.06 dt:36ms tok/s:1814843 rem:592s step 0226 (1%) loss:4.9157 lr:0.06 dt:36ms tok/s:1813239 rem:592s step 0227 (1%) loss:4.8998 lr:0.06 dt:36ms tok/s:1810432 rem:592s step 0228 (1%) loss:4.8899 lr:0.06 dt:36ms tok/s:1809741 rem:592s step 0229 (1%) loss:4.8797 lr:0.06 dt:36ms tok/s:1810921 rem:592s step 0230 (1%) loss:4.9097 lr:0.06 dt:37ms tok/s:1787475 rem:592s step 0231 (1%) loss:4.8492 lr:0.07 dt:37ms tok/s:1774837 rem:592s step 0232 (1%) loss:4.8217 lr:0.07 dt:36ms tok/s:1814100 rem:592s step 0233 (1%) loss:4.7906 lr:0.07 dt:36ms tok/s:1813777 rem:592s step 0234 (1%) loss:4.7617 lr:0.07 dt:36ms tok/s:1802927 rem:592s step 0235 (1%) loss:4.7565 lr:0.07 dt:36ms tok/s:1817976 rem:592s step 0236 (1%) loss:4.7709 lr:0.07 dt:36ms tok/s:1813801 rem:592s step 0237 (1%) loss:4.7540 lr:0.07 dt:36ms tok/s:1818433 rem:592s step 0238 (1%) loss:4.7569 lr:0.07 dt:36ms tok/s:1818120 rem:592s step 0239 (1%) loss:4.7510 lr:0.07 dt:36ms tok/s:1805568 rem:592s step 0240 (1%) loss:4.7415 lr:0.07 dt:36ms tok/s:1817543 rem:592s step 0241 (1%) loss:4.7351 lr:0.07 dt:36ms tok/s:1815886 rem:592s step 0242 (1%) loss:4.7281 lr:0.07 dt:36ms tok/s:1819215 rem:592s step 0243 (1%) loss:4.7237 lr:0.07 dt:37ms tok/s:1788440 rem:592s step 0244 (1%) loss:4.7060 lr:0.07 dt:36ms tok/s:1836413 rem:592s step 0245 (1%) loss:4.7071 lr:0.07 dt:36ms tok/s:1834146 rem:592s step 0246 (1%) loss:4.6936 lr:0.07 dt:35ms tok/s:1849274 rem:592s step 0247 (1%) loss:4.6837 lr:0.07 dt:37ms tok/s:1793657 rem:592s step 0248 (1%) loss:4.6798 lr:0.07 dt:36ms tok/s:1803471 rem:592s step 0249 (1%) loss:4.6753 lr:0.07 dt:36ms tok/s:1809217 rem:591s step 0250 (1%) loss:4.6821 lr:0.07 dt:36ms tok/s:1806138 rem:591s step 0251 (1%) loss:4.6824 lr:0.07 dt:36ms tok/s:1802892 rem:591s step 0252 (1%) loss:4.6842 lr:0.07 dt:36ms tok/s:1828326 rem:591s step 0253 (1%) loss:4.6760 lr:0.07 dt:34ms tok/s:1904893 rem:591s step 0254 (1%) loss:4.6725 lr:0.07 dt:34ms tok/s:1931421 rem:591s step 0255 (1%) loss:4.6616 lr:0.07 dt:34ms tok/s:1928359 rem:591s step 0256 (1%) loss:4.6594 lr:0.07 dt:31ms tok/s:2115975 rem:591s step 0257 (1%) loss:4.6585 lr:0.07 dt:33ms tok/s:1988756 rem:591s step 0258 (1%) loss:4.6510 lr:0.07 dt:32ms tok/s:2050134 rem:591s step 0259 (1%) loss:4.6430 lr:0.07 dt:32ms tok/s:2029009 rem:591s step 0260 (1%) loss:4.6298 lr:0.07 dt:36ms tok/s:1828959 rem:591s step 0261 (1%) loss:4.6273 lr:0.07 dt:31ms tok/s:2099026 rem:591s step 0262 (1%) loss:4.6273 lr:0.07 dt:32ms tok/s:2030028 rem:591s step 0263 (1%) loss:4.6136 lr:0.07 dt:31ms tok/s:2088770 rem:591s step 0264 (1%) loss:4.6149 lr:0.07 dt:31ms tok/s:2128724 rem:591s step 0265 (2%) loss:4.6201 lr:0.08 dt:31ms tok/s:2110371 rem:591s step 0266 (2%) loss:4.6168 lr:0.08 dt:31ms tok/s:2111781 rem:591s step 0267 (2%) loss:4.6260 lr:0.08 dt:31ms tok/s:2096496 rem:591s step 0268 (2%) loss:4.6123 lr:0.08 dt:31ms tok/s:2088976 rem:591s step 0269 (2%) loss:4.6207 lr:0.08 dt:32ms tok/s:2054286 rem:591s step 0270 (2%) loss:4.6285 lr:0.08 dt:32ms tok/s:2059119 rem:591s step 0271 (2%) loss:4.6307 lr:0.08 dt:32ms tok/s:2060554 rem:591s step 0272 (2%) loss:4.6350 lr:0.08 dt:38ms tok/s:1741806 rem:591s step 0273 (2%) loss:4.6315 lr:0.08 dt:36ms tok/s:1836327 rem:591s step 0274 (2%) loss:4.6220 lr:0.08 dt:32ms tok/s:2060971 rem:591s step 0275 (2%) loss:4.6121 lr:0.08 dt:32ms tok/s:2043991 rem:591s step 0276 (2%) loss:4.6074 lr:0.08 dt:37ms tok/s:1795250 rem:591s step 0277 (2%) loss:4.5907 lr:0.08 dt:32ms tok/s:2031453 rem:591s step 0278 (2%) loss:4.5948 lr:0.08 dt:33ms tok/s:1974769 rem:591s step 0279 (2%) loss:4.6074 lr:0.08 dt:34ms tok/s:1902032 rem:590s step 0280 (2%) loss:4.5926 lr:0.08 dt:36ms tok/s:1833558 rem:590s step 0281 (2%) loss:4.5852 lr:0.08 dt:33ms tok/s:2014938 rem:590s step 0282 (2%) loss:4.5861 lr:0.08 dt:32ms tok/s:2026092 rem:590s step 0283 (2%) loss:4.5958 lr:0.08 dt:33ms tok/s:1979533 rem:590s step 0284 (2%) loss:4.6060 lr:0.08 dt:33ms tok/s:1992490 rem:590s step 0285 (2%) loss:4.6075 lr:0.08 dt:35ms tok/s:1888131 rem:590s step 0286 (2%) loss:4.5950 lr:0.08 dt:36ms tok/s:1812928 rem:590s step 0287 (2%) loss:4.5847 lr:0.08 dt:33ms tok/s:1982559 rem:590s step 0288 (2%) loss:4.5773 lr:0.08 dt:33ms tok/s:1967109 rem:590s step 0289 (2%) loss:4.5763 lr:0.08 dt:33ms tok/s:1983718 rem:590s step 0290 (2%) loss:4.5760 lr:0.08 dt:33ms tok/s:1959327 rem:590s step 0291 (2%) loss:4.5795 lr:0.08 dt:34ms tok/s:1927615 rem:590s step 0292 (2%) loss:4.5584 lr:0.08 dt:36ms tok/s:1840211 rem:590s step 0293 (2%) loss:4.5638 lr:0.08 dt:36ms tok/s:1804608 rem:590s step 0294 (2%) loss:4.5660 lr:0.08 dt:34ms tok/s:1933731 rem:590s step 0295 (2%) loss:4.5593 lr:0.08 dt:34ms tok/s:1927331 rem:590s step 0296 (2%) loss:4.5582 lr:0.08 dt:34ms tok/s:1918788 rem:590s step 0297 (2%) loss:4.5471 lr:0.08 dt:34ms tok/s:1923313 rem:590s step 0298 (2%) loss:4.5496 lr:0.08 dt:38ms tok/s:1738369 rem:590s step 0299 (2%) loss:4.5563 lr:0.08 dt:35ms tok/s:1895958 rem:590s step 0300 (2%) loss:4.5676 lr:0.08 dt:33ms tok/s:1969928 rem:590s + local: attn=[0.830, 0.771, 0.708] mlp=[1.017, 0.978, 0.931] + + transition: attn=[0.681, 0.621] mlp=[0.903, 0.854] + + hierarchy: attn=[0.617, 0.578, 0.532] mlp=[0.815, 0.768, 0.724] + step 0301 (2%) loss:4.5745 lr:0.09 dt:34ms tok/s:1942545 rem:590s step 0302 (2%) loss:4.5788 lr:0.09 dt:34ms tok/s:1948563 rem:590s step 0303 (2%) loss:4.5729 lr:0.09 dt:34ms tok/s:1953659 rem:590s step 0304 (2%) loss:4.5986 lr:0.09 dt:35ms tok/s:1893320 rem:590s step 0305 (2%) loss:4.6590 lr:0.09 dt:38ms tok/s:1720880 rem:590s step 0306 (2%) loss:4.6800 lr:0.09 dt:34ms tok/s:1920692 rem:590s step 0307 (2%) loss:4.6809 lr:0.09 dt:34ms tok/s:1932236 rem:590s step 0308 (2%) loss:4.7142 lr:0.09 dt:33ms tok/s:1965744 rem:589s step 0309 (2%) loss:4.6980 lr:0.09 dt:34ms tok/s:1948563 rem:589s step 0310 (2%) loss:4.6811 lr:0.09 dt:34ms tok/s:1945639 rem:589s step 0311 (2%) loss:4.6792 lr:0.09 dt:36ms tok/s:1816318 rem:589s step 0312 (2%) loss:4.6818 lr:0.09 dt:33ms tok/s:1970267 rem:589s step 0313 (2%) loss:4.6757 lr:0.09 dt:33ms tok/s:1963428 rem:589s step 0314 (2%) loss:4.6803 lr:0.09 dt:33ms tok/s:1963764 rem:589s step 0315 (2%) loss:4.6728 lr:0.09 dt:33ms tok/s:1976600 rem:589s step 0316 (2%) loss:4.6590 lr:0.09 dt:33ms tok/s:1977525 rem:589s step 0317 (2%) loss:4.6580 lr:0.09 dt:37ms tok/s:1748754 rem:589s step 0318 (2%) loss:4.6572 lr:0.09 dt:34ms tok/s:1900914 rem:589s step 0319 (2%) loss:4.6231 lr:0.09 dt:32ms tok/s:2020433 rem:589s step 0320 (2%) loss:4.5894 lr:0.09 dt:33ms tok/s:1992302 rem:589s step 0321 (2%) loss:4.5366 lr:0.09 dt:37ms tok/s:1793610 rem:589s step 0322 (2%) loss:4.5151 lr:0.09 dt:33ms tok/s:2009503 rem:589s step 0323 (2%) loss:4.5742 lr:0.09 dt:33ms tok/s:1988958 rem:589s step 0324 (2%) loss:4.5618 lr:0.09 dt:33ms tok/s:1995252 rem:589s step 0325 (2%) loss:4.5818 lr:0.09 dt:32ms tok/s:2027856 rem:589s step 0326 (2%) loss:4.5998 lr:0.09 dt:32ms tok/s:2041486 rem:589s step 0327 (2%) loss:4.6175 lr:0.09 dt:32ms tok/s:2076211 rem:589s step 0328 (2%) loss:4.6338 lr:0.09 dt:32ms tok/s:2067000 rem:589s step 0329 (2%) loss:4.6478 lr:0.09 dt:33ms tok/s:1994557 rem:589s step 0330 (2%) loss:4.6649 lr:0.09 dt:32ms tok/s:2066285 rem:589s step 0331 (2%) loss:4.6705 lr:0.09 dt:32ms tok/s:2078708 rem:589s step 0332 (2%) loss:4.6589 lr:0.09 dt:31ms tok/s:2081683 rem:589s step 0333 (2%) loss:4.6130 lr:0.09 dt:32ms tok/s:2073815 rem:589s step 0334 (2%) loss:4.5835 lr:0.09 dt:32ms tok/s:2055254 rem:589s step 0335 (2%) loss:4.6044 lr:0.09 dt:32ms tok/s:2047874 rem:589s step 0336 (2%) loss:4.6090 lr:0.09 dt:32ms tok/s:2055992 rem:589s step 0337 (2%) loss:4.6334 lr:0.10 dt:32ms tok/s:2045299 rem:589s step 0338 (2%) loss:4.6394 lr:0.10 dt:32ms tok/s:2046547 rem:589s step 0339 (2%) loss:4.6473 lr:0.10 dt:32ms tok/s:2022812 rem:588s step 0340 (2%) loss:4.6501 lr:0.10 dt:32ms tok/s:2027243 rem:588s step 0341 (2%) loss:4.6441 lr:0.10 dt:32ms tok/s:2021131 rem:588s step 0342 (2%) loss:4.6477 lr:0.10 dt:32ms tok/s:2027482 rem:588s step 0343 (2%) loss:4.6543 lr:0.10 dt:32ms tok/s:2026212 rem:588s step 0344 (2%) loss:4.6473 lr:0.10 dt:32ms tok/s:2026227 rem:588s step 0345 (2%) loss:4.6384 lr:0.10 dt:32ms tok/s:2025212 rem:588s step 0346 (2%) loss:4.6227 lr:0.10 dt:32ms tok/s:2031948 rem:588s step 0347 (2%) loss:4.6151 lr:0.10 dt:33ms tok/s:1998414 rem:588s step 0348 (2%) loss:4.6091 lr:0.10 dt:36ms tok/s:1800766 rem:588s step 0349 (2%) loss:4.6161 lr:0.10 dt:32ms tok/s:2054256 rem:588s step 0350 (2%) loss:4.6308 lr:0.10 dt:32ms tok/s:2052737 rem:588s step 0351 (2%) loss:4.6127 lr:0.10 dt:33ms tok/s:2009856 rem:588s step 0352 (2%) loss:4.6193 lr:0.10 dt:32ms tok/s:2019379 rem:588s step 0353 (2%) loss:4.6267 lr:0.10 dt:32ms tok/s:2021458 rem:588s step 0354 (2%) loss:4.6242 lr:0.10 dt:34ms tok/s:1943864 rem:588s step 0355 (2%) loss:4.6133 lr:0.10 dt:33ms tok/s:1970479 rem:588s step 0356 (2%) loss:4.5973 lr:0.10 dt:33ms tok/s:1997804 rem:588s step 0357 (2%) loss:4.5900 lr:0.10 dt:33ms tok/s:1989001 rem:588s step 0358 (2%) loss:4.5920 lr:0.10 dt:33ms tok/s:1995846 rem:588s step 0359 (2%) loss:4.5891 lr:0.10 dt:33ms tok/s:1996701 rem:588s step 0360 (2%) loss:4.6012 lr:0.10 dt:33ms tok/s:1968941 rem:588s step 0361 (2%) loss:4.5943 lr:0.10 dt:33ms tok/s:1980189 rem:588s step 0362 (2%) loss:4.5837 lr:0.10 dt:33ms tok/s:1967912 rem:588s step 0363 (2%) loss:4.5819 lr:0.10 dt:33ms tok/s:1971638 rem:588s step 0364 (2%) loss:4.5751 lr:0.10 dt:33ms tok/s:1966068 rem:588s step 0365 (2%) loss:4.5737 lr:0.10 dt:34ms tok/s:1946465 rem:588s step 0366 (2%) loss:4.5729 lr:0.10 dt:33ms tok/s:1968194 rem:588s step 0367 (2%) loss:4.5639 lr:0.10 dt:33ms tok/s:1963077 rem:588s step 0368 (2%) loss:4.5585 lr:0.10 dt:34ms tok/s:1950222 rem:588s step 0369 (2%) loss:4.5523 lr:0.10 dt:34ms tok/s:1901611 rem:587s step 0370 (2%) loss:4.5417 lr:0.10 dt:34ms tok/s:1951884 rem:587s step 0371 (2%) loss:4.5541 lr:0.10 dt:34ms tok/s:1932534 rem:587s step 0372 (2%) loss:4.5207 lr:0.10 dt:34ms tok/s:1924781 rem:587s step 0373 (2%) loss:4.5010 lr:0.11 dt:34ms tok/s:1921659 rem:587s step 0374 (2%) loss:4.4975 lr:0.11 dt:34ms tok/s:1928778 rem:587s step 0375 (2%) loss:4.4992 lr:0.11 dt:34ms tok/s:1926062 rem:587s step 0376 (2%) loss:4.5084 lr:0.11 dt:35ms tok/s:1874381 rem:587s step 0377 (2%) loss:4.5140 lr:0.11 dt:34ms tok/s:1919807 rem:587s step 0378 (2%) loss:4.5203 lr:0.11 dt:35ms tok/s:1876517 rem:587s step 0379 (2%) loss:4.5376 lr:0.11 dt:35ms tok/s:1889689 rem:587s step 0380 (2%) loss:4.5184 lr:0.11 dt:35ms tok/s:1887548 rem:587s step 0381 (2%) loss:4.5135 lr:0.11 dt:34ms tok/s:1901046 rem:587s step 0382 (2%) loss:4.5152 lr:0.11 dt:35ms tok/s:1891652 rem:587s step 0383 (2%) loss:4.5189 lr:0.11 dt:35ms tok/s:1889390 rem:587s step 0384 (2%) loss:4.5374 lr:0.11 dt:35ms tok/s:1897450 rem:587s step 0385 (2%) loss:4.5683 lr:0.11 dt:35ms tok/s:1887029 rem:587s step 0386 (2%) loss:4.5768 lr:0.11 dt:35ms tok/s:1879726 rem:587s step 0387 (2%) loss:4.5755 lr:0.11 dt:35ms tok/s:1871739 rem:587s step 0388 (2%) loss:4.5946 lr:0.11 dt:35ms tok/s:1864552 rem:587s step 0389 (2%) loss:4.6005 lr:0.11 dt:35ms tok/s:1883345 rem:587s step 0390 (2%) loss:4.5925 lr:0.11 dt:35ms tok/s:1868164 rem:587s step 0391 (2%) loss:4.5949 lr:0.11 dt:35ms tok/s:1877466 rem:587s step 0392 (2%) loss:4.5957 lr:0.11 dt:47ms tok/s:1407508 rem:587s step 0393 (2%) loss:4.5898 lr:0.11 dt:35ms tok/s:1870401 rem:587s step 0394 (2%) loss:4.5894 lr:0.11 dt:33ms tok/s:1964269 rem:587s step 0395 (2%) loss:4.5768 lr:0.11 dt:33ms tok/s:1962685 rem:587s step 0396 (2%) loss:4.5797 lr:0.11 dt:34ms tok/s:1928291 rem:587s step 0397 (2%) loss:4.5966 lr:0.11 dt:34ms tok/s:1942737 rem:587s step 0398 (2%) loss:4.5965 lr:0.11 dt:34ms tok/s:1913379 rem:586s step 0399 (2%) loss:4.5818 lr:0.11 dt:34ms tok/s:1910400 rem:586s step 0400 (2%) loss:4.5622 lr:0.11 dt:34ms tok/s:1927372 rem:586s + local: attn=[0.829, 0.729, 0.617] mlp=[1.064, 1.023, 0.938] + + transition: attn=[0.789, 0.695] mlp=[0.975, 0.882] + + hierarchy: attn=[0.750, 0.716, 0.668] mlp=[0.875, 0.795, 0.729] + step 0401 (2%) loss:4.5390 lr:0.11 dt:34ms tok/s:1926696 rem:586s step 0402 (2%) loss:4.5396 lr:0.11 dt:35ms tok/s:1888235 rem:586s step 0403 (2%) loss:4.5737 lr:0.11 dt:35ms tok/s:1887911 rem:586s step 0404 (2%) loss:4.5710 lr:0.11 dt:35ms tok/s:1889624 rem:586s step 0405 (2%) loss:4.5725 lr:0.11 dt:35ms tok/s:1895409 rem:586s step 0406 (2%) loss:4.5632 lr:0.11 dt:35ms tok/s:1882983 rem:586s step 0407 (2%) loss:4.5584 lr:0.12 dt:35ms tok/s:1892160 rem:586s step 0408 (2%) loss:4.5581 lr:0.12 dt:35ms tok/s:1870809 rem:586s step 0409 (2%) loss:4.5419 lr:0.12 dt:35ms tok/s:1872810 rem:586s step 0410 (2%) loss:4.5284 lr:0.12 dt:35ms tok/s:1853689 rem:586s step 0411 (2%) loss:4.5154 lr:0.12 dt:36ms tok/s:1802431 rem:586s step 0412 (2%) loss:4.5309 lr:0.12 dt:36ms tok/s:1805141 rem:586s step 0413 (2%) loss:4.5425 lr:0.12 dt:36ms tok/s:1827986 rem:586s step 0414 (2%) loss:4.5464 lr:0.12 dt:36ms tok/s:1825910 rem:586s step 0415 (2%) loss:4.5378 lr:0.12 dt:37ms tok/s:1778293 rem:586s step 0416 (2%) loss:4.5289 lr:0.12 dt:36ms tok/s:1825473 rem:586s step 0417 (2%) loss:4.5167 lr:0.12 dt:36ms tok/s:1816750 rem:586s step 0418 (2%) loss:4.5034 lr:0.12 dt:36ms tok/s:1820287 rem:586s step 0419 (2%) loss:4.5063 lr:0.12 dt:36ms tok/s:1820854 rem:586s step 0420 (2%) loss:4.5150 lr:0.12 dt:36ms tok/s:1821626 rem:586s step 0421 (2%) loss:4.4971 lr:0.12 dt:36ms tok/s:1828193 rem:586s step 0422 (2%) loss:4.4873 lr:0.12 dt:36ms tok/s:1823209 rem:586s step 0423 (2%) loss:4.5042 lr:0.12 dt:36ms tok/s:1828193 rem:586s step 0424 (2%) loss:4.5213 lr:0.12 dt:36ms tok/s:1827354 rem:586s step 0425 (2%) loss:4.5309 lr:0.12 dt:36ms tok/s:1831311 rem:586s step 0426 (2%) loss:4.5265 lr:0.12 dt:36ms tok/s:1826953 rem:585s step 0427 (2%) loss:4.5224 lr:0.12 dt:36ms tok/s:1824843 rem:585s step 0428 (2%) loss:4.5292 lr:0.12 dt:36ms tok/s:1830067 rem:585s step 0429 (2%) loss:4.5353 lr:0.12 dt:36ms tok/s:1823257 rem:585s step 0430 (2%) loss:4.5395 lr:0.12 dt:36ms tok/s:1820552 rem:585s step 0431 (2%) loss:4.5340 lr:0.12 dt:36ms tok/s:1819865 rem:585s step 0432 (2%) loss:4.5253 lr:0.12 dt:36ms tok/s:1825994 rem:585s step 0433 (2%) loss:4.5109 lr:0.12 dt:36ms tok/s:1817976 rem:585s step 0434 (2%) loss:4.5135 lr:0.12 dt:36ms tok/s:1830494 rem:585s step 0435 (2%) loss:4.4810 lr:0.12 dt:36ms tok/s:1825691 rem:585s step 0436 (2%) loss:4.4963 lr:0.12 dt:36ms tok/s:1827318 rem:585s step 0437 (2%) loss:4.5096 lr:0.12 dt:34ms tok/s:1933024 rem:585s step 0438 (2%) loss:4.5279 lr:0.12 dt:31ms tok/s:2082266 rem:585s step 0439 (2%) loss:4.5217 lr:0.12 dt:31ms tok/s:2101015 rem:585s step 0440 (2%) loss:4.5220 lr:0.12 dt:31ms tok/s:2100325 rem:585s step 0441 (3%) loss:4.5172 lr:0.13 dt:32ms tok/s:2074644 rem:585s step 0442 (3%) loss:4.5138 lr:0.13 dt:31ms tok/s:2099892 rem:585s step 0443 (3%) loss:4.5095 lr:0.13 dt:31ms tok/s:2085632 rem:585s step 0444 (3%) loss:4.4950 lr:0.13 dt:32ms tok/s:2080186 rem:585s step 0445 (3%) loss:4.4902 lr:0.13 dt:32ms tok/s:2048194 rem:585s step 0446 (3%) loss:4.4812 lr:0.13 dt:32ms tok/s:2070066 rem:585s step 0447 (3%) loss:4.4833 lr:0.13 dt:32ms tok/s:2054041 rem:585s step 0448 (3%) loss:4.4821 lr:0.13 dt:32ms tok/s:2040334 rem:585s step 0449 (3%) loss:4.4670 lr:0.13 dt:32ms tok/s:2042396 rem:585s step 0450 (3%) loss:4.4765 lr:0.13 dt:32ms tok/s:2040228 rem:585s step 0451 (3%) loss:4.4845 lr:0.13 dt:36ms tok/s:1804513 rem:585s step 0452 (3%) loss:4.4949 lr:0.13 dt:33ms tok/s:2002010 rem:585s step 0453 (3%) loss:4.4966 lr:0.13 dt:32ms tok/s:2023869 rem:585s step 0454 (3%) loss:4.5048 lr:0.13 dt:32ms tok/s:2031438 rem:585s step 0455 (3%) loss:4.4952 lr:0.13 dt:32ms tok/s:2033737 rem:585s step 0456 (3%) loss:4.4905 lr:0.13 dt:32ms tok/s:2033512 rem:584s step 0457 (3%) loss:4.4880 lr:0.13 dt:32ms tok/s:2033286 rem:584s step 0458 (3%) loss:4.5075 lr:0.13 dt:32ms tok/s:2039789 rem:584s step 0459 (3%) loss:4.5058 lr:0.13 dt:32ms tok/s:2038791 rem:584s step 0460 (3%) loss:4.5000 lr:0.13 dt:32ms tok/s:2024973 rem:584s step 0461 (3%) loss:4.4819 lr:0.13 dt:32ms tok/s:2027751 rem:584s step 0462 (3%) loss:4.4752 lr:0.13 dt:33ms tok/s:2010355 rem:584s step 0463 (3%) loss:4.4698 lr:0.13 dt:33ms tok/s:2015470 rem:584s step 0464 (3%) loss:4.4499 lr:0.13 dt:33ms tok/s:2008974 rem:584s step 0465 (3%) loss:4.4468 lr:0.13 dt:33ms tok/s:1999985 rem:584s step 0466 (3%) loss:4.4268 lr:0.13 dt:33ms tok/s:1990138 rem:584s step 0467 (3%) loss:4.4168 lr:0.13 dt:33ms tok/s:1983733 rem:584s step 0468 (3%) loss:4.4161 lr:0.13 dt:33ms tok/s:1970380 rem:584s step 0469 (3%) loss:4.4283 lr:0.13 dt:34ms tok/s:1912261 rem:584s step 0470 (3%) loss:4.4327 lr:0.13 dt:33ms tok/s:1972331 rem:584s step 0471 (3%) loss:4.4250 lr:0.13 dt:33ms tok/s:1974896 rem:584s step 0472 (3%) loss:4.4280 lr:0.13 dt:33ms tok/s:1960641 rem:584s step 0473 (3%) loss:4.4283 lr:0.13 dt:33ms tok/s:1965238 rem:584s step 0474 (3%) loss:4.4285 lr:0.13 dt:33ms tok/s:1970931 rem:584s step 0475 (3%) loss:4.4326 lr:0.13 dt:33ms tok/s:1969322 rem:584s step 0476 (3%) loss:4.4245 lr:0.13 dt:33ms tok/s:1975237 rem:584s step 0477 (3%) loss:4.4160 lr:0.13 dt:33ms tok/s:1974556 rem:584s step 0478 (3%) loss:4.4245 lr:0.14 dt:33ms tok/s:1987606 rem:584s step 0479 (3%) loss:4.4325 lr:0.14 dt:33ms tok/s:1963919 rem:584s step 0480 (3%) loss:4.4403 lr:0.14 dt:33ms tok/s:1969265 rem:584s step 0481 (3%) loss:4.4422 lr:0.14 dt:33ms tok/s:1978322 rem:584s step 0482 (3%) loss:4.4375 lr:0.14 dt:34ms tok/s:1901111 rem:584s step 0483 (3%) loss:4.4415 lr:0.14 dt:33ms tok/s:1966124 rem:584s step 0484 (3%) loss:4.4409 lr:0.14 dt:33ms tok/s:1969350 rem:584s step 0485 (3%) loss:4.4464 lr:0.14 dt:33ms tok/s:1979404 rem:584s step 0486 (3%) loss:4.4448 lr:0.14 dt:33ms tok/s:1970409 rem:583s step 0487 (3%) loss:4.4692 lr:0.14 dt:34ms tok/s:1924390 rem:583s step 0488 (3%) loss:4.4683 lr:0.14 dt:33ms tok/s:1972218 rem:583s step 0489 (3%) loss:4.4721 lr:0.14 dt:34ms tok/s:1945253 rem:583s step 0490 (3%) loss:4.4633 lr:0.14 dt:33ms tok/s:1963133 rem:583s step 0491 (3%) loss:4.4635 lr:0.14 dt:34ms tok/s:1926291 rem:583s step 0492 (3%) loss:4.4552 lr:0.14 dt:36ms tok/s:1838451 rem:583s step 0493 (3%) loss:4.4531 lr:0.14 dt:34ms tok/s:1950166 rem:583s step 0494 (3%) loss:4.4486 lr:0.14 dt:33ms tok/s:1959285 rem:583s step 0495 (3%) loss:4.4491 lr:0.14 dt:34ms tok/s:1944978 rem:583s step 0496 (3%) loss:4.4366 lr:0.14 dt:34ms tok/s:1945501 rem:583s step 0497 (3%) loss:4.4234 lr:0.14 dt:33ms tok/s:1959802 rem:583s step 0498 (3%) loss:4.4241 lr:0.14 dt:33ms tok/s:1968616 rem:583s step 0499 (3%) loss:4.4178 lr:0.14 dt:33ms tok/s:1967912 rem:583s step 0500 (3%) loss:4.4129 lr:0.14 dt:34ms tok/s:1934670 rem:583s + local: attn=[0.865, 0.743, 0.560] mlp=[1.034, 0.990, 0.848] + + transition: attn=[0.892, 0.743] mlp=[0.941, 0.786] + + hierarchy: attn=[0.796, 0.744, 0.680] mlp=[0.701, 0.552, 0.437] + step 0501 (3%) loss:4.4096 lr:0.14 dt:34ms tok/s:1950817 rem:583s step 0502 (3%) loss:4.4038 lr:0.14 dt:34ms tok/s:1945184 rem:583s step 0503 (3%) loss:4.4044 lr:0.14 dt:34ms tok/s:1926831 rem:583s step 0504 (3%) loss:4.4236 lr:0.14 dt:34ms tok/s:1929442 rem:583s step 0505 (3%) loss:4.4197 lr:0.14 dt:34ms tok/s:1943520 rem:583s step 0506 (3%) loss:4.4178 lr:0.14 dt:34ms tok/s:1921457 rem:583s step 0507 (3%) loss:4.4117 lr:0.14 dt:34ms tok/s:1922156 rem:583s step 0508 (3%) loss:4.4169 lr:0.14 dt:34ms tok/s:1919552 rem:583s step 0509 (3%) loss:4.3941 lr:0.14 dt:34ms tok/s:1920558 rem:583s step 0510 (3%) loss:4.3890 lr:0.14 dt:34ms tok/s:1919807 rem:583s step 0511 (3%) loss:4.3985 lr:0.14 dt:34ms tok/s:1914418 rem:583s step 0512 (3%) loss:4.4055 lr:0.14 dt:34ms tok/s:1927940 rem:583s step 0513 (3%) loss:4.4063 lr:0.14 dt:35ms tok/s:1862973 rem:583s step 0514 (3%) loss:4.3948 lr:0.15 dt:34ms tok/s:1932494 rem:583s step 0515 (3%) loss:4.3901 lr:0.15 dt:34ms tok/s:1921457 rem:583s step 0516 (3%) loss:4.3827 lr:0.15 dt:37ms tok/s:1761552 rem:582s step 0517 (3%) loss:4.3786 lr:0.15 dt:35ms tok/s:1848080 rem:582s step 0518 (3%) loss:4.3704 lr:0.15 dt:34ms tok/s:1950526 rem:582s step 0519 (3%) loss:4.3833 lr:0.15 dt:34ms tok/s:1940406 rem:582s step 0520 (3%) loss:4.3806 lr:0.15 dt:34ms tok/s:1935964 rem:582s step 0521 (3%) loss:4.3793 lr:0.15 dt:34ms tok/s:1940981 rem:582s step 0522 (3%) loss:4.3816 lr:0.15 dt:34ms tok/s:1940858 rem:582s step 0523 (3%) loss:4.3791 lr:0.15 dt:34ms tok/s:1926804 rem:582s step 0524 (3%) loss:4.3928 lr:0.15 dt:34ms tok/s:1937151 rem:582s step 0525 (3%) loss:4.4058 lr:0.15 dt:34ms tok/s:1938531 rem:582s step 0526 (3%) loss:4.4121 lr:0.15 dt:33ms tok/s:1958085 rem:582s step 0527 (3%) loss:4.3979 lr:0.15 dt:34ms tok/s:1939790 rem:582s step 0528 (3%) loss:4.3996 lr:0.15 dt:34ms tok/s:1949890 rem:582s step 0529 (3%) loss:4.4113 lr:0.15 dt:34ms tok/s:1938394 rem:582s step 0530 (3%) loss:4.4146 lr:0.15 dt:36ms tok/s:1816306 rem:582s step 0531 (3%) loss:4.4140 lr:0.15 dt:33ms tok/s:1958015 rem:582s step 0532 (3%) loss:4.4147 lr:0.15 dt:34ms tok/s:1924942 rem:582s step 0533 (3%) loss:4.4191 lr:0.15 dt:38ms tok/s:1740295 rem:582s step 0534 (3%) loss:4.4193 lr:0.15 dt:34ms tok/s:1949281 rem:582s step 0535 (3%) loss:4.4323 lr:0.15 dt:33ms tok/s:1956649 rem:582s step 0536 (3%) loss:4.4280 lr:0.15 dt:34ms tok/s:1955925 rem:582s step 0537 (3%) loss:4.4226 lr:0.15 dt:34ms tok/s:1950554 rem:582s step 0538 (3%) loss:4.4165 lr:0.15 dt:34ms tok/s:1936101 rem:582s step 0539 (3%) loss:4.4103 lr:0.15 dt:34ms tok/s:1926561 rem:582s step 0540 (3%) loss:4.4020 lr:0.15 dt:34ms tok/s:1920544 rem:582s step 0541 (3%) loss:4.4014 lr:0.15 dt:34ms tok/s:1930458 rem:582s step 0542 (3%) loss:4.4006 lr:0.15 dt:34ms tok/s:1929293 rem:582s step 0543 (3%) loss:4.3990 lr:0.15 dt:34ms tok/s:1925967 rem:582s step 0544 (3%) loss:4.3968 lr:0.15 dt:34ms tok/s:1929631 rem:582s step 0545 (3%) loss:4.4173 lr:0.15 dt:34ms tok/s:1927520 rem:581s step 0546 (3%) loss:4.4159 lr:0.15 dt:34ms tok/s:1909869 rem:581s step 0547 (3%) loss:4.4185 lr:0.15 dt:34ms tok/s:1928291 rem:581s step 0548 (3%) loss:4.4096 lr:0.15 dt:34ms tok/s:1913632 rem:581s step 0549 (3%) loss:4.4118 lr:0.16 dt:34ms tok/s:1903547 rem:581s step 0550 (3%) loss:4.4075 lr:0.16 dt:34ms tok/s:1904167 rem:581s step 0551 (3%) loss:4.3996 lr:0.16 dt:38ms tok/s:1732726 rem:581s step 0552 (3%) loss:4.4190 lr:0.16 dt:36ms tok/s:1816006 rem:581s step 0553 (3%) loss:4.4250 lr:0.16 dt:34ms tok/s:1919016 rem:581s step 0554 (3%) loss:4.4152 lr:0.16 dt:34ms tok/s:1935106 rem:581s step 0555 (3%) loss:4.4139 lr:0.16 dt:34ms tok/s:1932711 rem:581s step 0556 (3%) loss:4.4138 lr:0.16 dt:34ms tok/s:1915672 rem:581s step 0557 (3%) loss:4.4151 lr:0.16 dt:34ms tok/s:1904352 rem:581s step 0558 (3%) loss:4.4116 lr:0.16 dt:34ms tok/s:1907285 rem:581s step 0559 (3%) loss:4.4073 lr:0.16 dt:34ms tok/s:1905698 rem:581s step 0560 (3%) loss:4.4047 lr:0.16 dt:35ms tok/s:1859897 rem:581s step 0561 (3%) loss:4.3981 lr:0.16 dt:35ms tok/s:1890781 rem:581s step 0562 (3%) loss:4.3949 lr:0.16 dt:35ms tok/s:1871305 rem:581s step 0563 (3%) loss:4.3942 lr:0.16 dt:35ms tok/s:1873397 rem:581s step 0564 (3%) loss:4.3785 lr:0.16 dt:35ms tok/s:1888481 rem:581s step 0565 (3%) loss:4.3784 lr:0.16 dt:35ms tok/s:1885463 rem:581s step 0566 (3%) loss:4.3855 lr:0.16 dt:35ms tok/s:1878325 rem:581s step 0567 (3%) loss:4.3924 lr:0.16 dt:35ms tok/s:1875672 rem:581s step 0568 (3%) loss:4.3946 lr:0.16 dt:35ms tok/s:1868812 rem:581s step 0569 (3%) loss:4.4006 lr:0.16 dt:36ms tok/s:1821361 rem:581s step 0570 (3%) loss:4.3902 lr:0.16 dt:38ms tok/s:1735669 rem:581s step 0571 (3%) loss:4.3735 lr:0.16 dt:35ms tok/s:1874521 rem:581s step 0572 (3%) loss:4.3574 lr:0.16 dt:34ms tok/s:1906095 rem:581s step 0573 (3%) loss:4.3637 lr:0.16 dt:34ms tok/s:1909179 rem:581s step 0574 (3%) loss:4.3666 lr:0.16 dt:34ms tok/s:1911968 rem:580s step 0575 (3%) loss:4.3612 lr:0.16 dt:34ms tok/s:1900783 rem:580s step 0576 (3%) loss:4.3728 lr:0.16 dt:34ms tok/s:1917196 rem:580s step 0577 (3%) loss:4.3646 lr:0.16 dt:35ms tok/s:1899377 rem:580s step 0578 (3%) loss:4.3404 lr:0.16 dt:34ms tok/s:1911131 rem:580s step 0579 (3%) loss:4.3513 lr:0.16 dt:35ms tok/s:1889637 rem:580s step 0580 (3%) loss:4.3658 lr:0.16 dt:34ms tok/s:1901861 rem:580s step 0581 (3%) loss:4.3705 lr:0.16 dt:34ms tok/s:1906967 rem:580s step 0582 (3%) loss:4.3720 lr:0.16 dt:36ms tok/s:1811136 rem:580s step 0583 (3%) loss:4.3825 lr:0.17 dt:34ms tok/s:1905751 rem:580s step 0584 (3%) loss:4.3910 lr:0.17 dt:34ms tok/s:1911702 rem:580s step 0585 (3%) loss:4.3758 lr:0.17 dt:34ms tok/s:1912234 rem:580s step 0586 (3%) loss:4.3710 lr:0.17 dt:35ms tok/s:1897162 rem:580s step 0587 (3%) loss:4.3800 lr:0.17 dt:34ms tok/s:1910812 rem:580s step 0588 (3%) loss:4.3779 lr:0.17 dt:34ms tok/s:1902098 rem:580s step 0589 (3%) loss:4.3709 lr:0.17 dt:35ms tok/s:1898066 rem:580s step 0590 (3%) loss:4.3650 lr:0.17 dt:34ms tok/s:1911144 rem:580s step 0591 (3%) loss:4.3712 lr:0.17 dt:34ms tok/s:1914618 rem:580s step 0592 (3%) loss:4.3794 lr:0.17 dt:35ms tok/s:1864198 rem:580s step 0593 (3%) loss:4.3906 lr:0.17 dt:36ms tok/s:1813885 rem:580s step 0594 (3%) loss:4.3906 lr:0.17 dt:34ms tok/s:1913698 rem:580s step 0595 (3%) loss:4.3852 lr:0.17 dt:35ms tok/s:1882158 rem:580s step 0596 (3%) loss:4.3917 lr:0.17 dt:34ms tok/s:1901164 rem:580s step 0597 (3%) loss:4.3942 lr:0.17 dt:34ms tok/s:1912607 rem:580s step 0598 (3%) loss:4.4188 lr:0.17 dt:35ms tok/s:1897660 rem:580s step 0599 (3%) loss:4.4166 lr:0.17 dt:34ms tok/s:1911941 rem:580s step 0600 (3%) loss:4.4270 lr:0.17 dt:35ms tok/s:1882816 rem:580s + local: attn=[0.856, 0.724, 0.472] mlp=[0.961, 0.898, 0.642] + + transition: attn=[1.041, 0.822] mlp=[0.876, 0.670] + + hierarchy: attn=[0.893, 0.788, 0.691] mlp=[0.650, 0.474, 0.327] + step 0601 (3%) loss:4.4287 lr:0.17 dt:35ms tok/s:1872976 rem:580s step 0602 (3%) loss:4.4221 lr:0.17 dt:35ms tok/s:1886809 rem:579s step 0603 (3%) loss:4.4187 lr:0.17 dt:35ms tok/s:1882506 rem:579s step 0604 (3%) loss:4.4232 lr:0.17 dt:35ms tok/s:1878826 rem:579s step 0605 (3%) loss:4.4266 lr:0.17 dt:35ms tok/s:1882596 rem:579s step 0606 (3%) loss:4.4124 lr:0.17 dt:34ms tok/s:1913685 rem:579s step 0607 (3%) loss:4.3979 lr:0.17 dt:35ms tok/s:1880767 rem:579s step 0608 (3%) loss:4.3943 lr:0.17 dt:35ms tok/s:1880549 rem:579s step 0609 (3%) loss:4.4215 lr:0.17 dt:35ms tok/s:1880086 rem:579s step 0610 (3%) loss:4.4394 lr:0.17 dt:35ms tok/s:1879751 rem:579s step 0611 (3%) loss:4.4363 lr:0.17 dt:35ms tok/s:1893907 rem:579s step 0612 (3%) loss:4.4385 lr:0.17 dt:35ms tok/s:1874444 rem:579s step 0613 (3%) loss:4.4551 lr:0.17 dt:35ms tok/s:1878711 rem:579s step 0614 (3%) loss:4.4475 lr:0.17 dt:35ms tok/s:1880549 rem:579s step 0615 (3%) loss:4.4549 lr:0.17 dt:35ms tok/s:1878646 rem:579s step 0616 (3%) loss:4.4485 lr:0.17 dt:35ms tok/s:1884739 rem:579s step 0617 (3%) loss:4.4466 lr:0.17 dt:35ms tok/s:1877479 rem:579s step 0618 (4%) loss:4.4477 lr:0.18 dt:35ms tok/s:1897254 rem:579s step 0619 (4%) loss:4.4442 lr:0.18 dt:34ms tok/s:1905408 rem:579s step 0620 (4%) loss:4.4349 lr:0.18 dt:34ms tok/s:1931109 rem:579s step 0621 (4%) loss:4.4286 lr:0.18 dt:34ms tok/s:1930282 rem:579s step 0622 (4%) loss:4.4208 lr:0.18 dt:35ms tok/s:1877633 rem:579s step 0623 (4%) loss:4.4068 lr:0.18 dt:34ms tok/s:1933799 rem:579s step 0624 (4%) loss:4.4088 lr:0.18 dt:34ms tok/s:1927750 rem:579s step 0625 (4%) loss:4.4124 lr:0.18 dt:34ms tok/s:1924902 rem:579s step 0626 (4%) loss:4.4082 lr:0.18 dt:34ms tok/s:1933282 rem:579s step 0627 (4%) loss:4.4148 lr:0.18 dt:34ms tok/s:1911091 rem:579s step 0628 (4%) loss:4.4123 lr:0.18 dt:34ms tok/s:1909590 rem:579s step 0629 (4%) loss:4.4126 lr:0.18 dt:34ms tok/s:1915112 rem:579s step 0630 (4%) loss:4.4119 lr:0.18 dt:34ms tok/s:1907616 rem:579s step 0631 (4%) loss:4.4055 lr:0.18 dt:34ms tok/s:1917316 rem:578s step 0632 (4%) loss:4.3918 lr:0.18 dt:34ms tok/s:1916434 rem:578s step 0633 (4%) loss:4.3758 lr:0.18 dt:35ms tok/s:1882377 rem:578s step 0634 (4%) loss:4.3549 lr:0.18 dt:34ms tok/s:1915739 rem:578s step 0635 (4%) loss:4.3387 lr:0.18 dt:34ms tok/s:1907801 rem:578s step 0636 (4%) loss:4.3314 lr:0.18 dt:34ms tok/s:1904391 rem:578s step 0637 (4%) loss:4.3276 lr:0.18 dt:34ms tok/s:1900651 rem:578s step 0638 (4%) loss:4.3268 lr:0.18 dt:34ms tok/s:1905104 rem:578s step 0639 (4%) loss:4.3282 lr:0.18 dt:35ms tok/s:1895278 rem:578s step 0640 (4%) loss:4.3087 lr:0.18 dt:34ms tok/s:1906703 rem:578s step 0641 (4%) loss:4.3109 lr:0.18 dt:34ms tok/s:1900954 rem:578s step 0642 (4%) loss:4.3246 lr:0.18 dt:34ms tok/s:1908424 rem:578s step 0643 (4%) loss:4.3199 lr:0.18 dt:35ms tok/s:1890326 rem:578s step 0644 (4%) loss:4.3127 lr:0.18 dt:34ms tok/s:1908888 rem:578s step 0645 (4%) loss:4.3078 lr:0.18 dt:34ms tok/s:1904866 rem:578s step 0646 (4%) loss:4.3155 lr:0.18 dt:34ms tok/s:1918668 rem:578s step 0647 (4%) loss:4.3197 lr:0.18 dt:34ms tok/s:1911981 rem:578s step 0648 (4%) loss:4.3224 lr:0.18 dt:34ms tok/s:1902941 rem:578s step 0649 (4%) loss:4.3431 lr:0.18 dt:34ms tok/s:1911995 rem:578s step 0650 (4%) loss:4.3379 lr:0.18 dt:34ms tok/s:1906729 rem:578s step 0651 (4%) loss:4.3507 lr:0.18 dt:34ms tok/s:1904721 rem:578s step 0652 (4%) loss:4.3514 lr:0.18 dt:34ms tok/s:1910281 rem:578s step 0653 (4%) loss:4.3484 lr:0.19 dt:34ms tok/s:1914832 rem:578s step 0654 (4%) loss:4.3417 lr:0.19 dt:34ms tok/s:1908702 rem:578s step 0655 (4%) loss:4.3305 lr:0.19 dt:35ms tok/s:1899377 rem:578s step 0656 (4%) loss:4.3291 lr:0.19 dt:34ms tok/s:1911862 rem:578s step 0657 (4%) loss:4.3365 lr:0.19 dt:34ms tok/s:1907100 rem:578s step 0658 (4%) loss:4.3353 lr:0.19 dt:34ms tok/s:1905368 rem:578s step 0659 (4%) loss:4.3197 lr:0.19 dt:35ms tok/s:1891444 rem:578s step 0660 (4%) loss:4.3187 lr:0.19 dt:34ms tok/s:1911170 rem:577s step 0661 (4%) loss:4.3365 lr:0.19 dt:35ms tok/s:1898446 rem:577s step 0662 (4%) loss:4.3632 lr:0.19 dt:34ms tok/s:1904470 rem:577s step 0663 (4%) loss:4.3656 lr:0.19 dt:35ms tok/s:1895880 rem:577s step 0664 (4%) loss:4.3608 lr:0.19 dt:38ms tok/s:1729238 rem:577s step 0665 (4%) loss:4.3624 lr:0.19 dt:35ms tok/s:1890482 rem:577s step 0666 (4%) loss:4.3488 lr:0.19 dt:35ms tok/s:1866616 rem:577s step 0667 (4%) loss:4.3541 lr:0.19 dt:34ms tok/s:1926170 rem:577s step 0668 (4%) loss:4.3485 lr:0.19 dt:34ms tok/s:1933119 rem:577s step 0669 (4%) loss:4.3553 lr:0.19 dt:34ms tok/s:1933867 rem:577s step 0670 (4%) loss:4.3515 lr:0.19 dt:34ms tok/s:1937971 rem:577s step 0671 (4%) loss:4.3478 lr:0.19 dt:33ms tok/s:1960389 rem:577s step 0672 (4%) loss:4.3394 lr:0.19 dt:34ms tok/s:1947320 rem:577s step 0673 (4%) loss:4.3436 lr:0.19 dt:34ms tok/s:1947155 rem:577s step 0674 (4%) loss:4.3530 lr:0.19 dt:34ms tok/s:1912859 rem:577s step 0675 (4%) loss:4.3581 lr:0.19 dt:34ms tok/s:1940132 rem:577s step 0676 (4%) loss:4.3520 lr:0.19 dt:34ms tok/s:1951219 rem:577s step 0677 (4%) loss:4.3528 lr:0.19 dt:34ms tok/s:1942696 rem:577s step 0678 (4%) loss:4.3596 lr:0.19 dt:34ms tok/s:1939352 rem:577s step 0679 (4%) loss:4.3546 lr:0.19 dt:34ms tok/s:1945061 rem:577s step 0680 (4%) loss:4.3584 lr:0.19 dt:34ms tok/s:1948273 rem:577s step 0681 (4%) loss:4.3632 lr:0.19 dt:34ms tok/s:1941118 rem:577s step 0682 (4%) loss:4.3549 lr:0.19 dt:34ms tok/s:1906610 rem:577s step 0683 (4%) loss:4.3552 lr:0.19 dt:34ms tok/s:1919471 rem:577s step 0684 (4%) loss:4.3556 lr:0.19 dt:34ms tok/s:1917463 rem:577s step 0685 (4%) loss:4.3556 lr:0.19 dt:34ms tok/s:1913685 rem:577s step 0686 (4%) loss:4.3493 lr:0.19 dt:34ms tok/s:1926183 rem:577s step 0687 (4%) loss:4.3314 lr:0.19 dt:34ms tok/s:1912540 rem:577s step 0688 (4%) loss:4.3397 lr:0.20 dt:34ms tok/s:1914565 rem:577s step 0689 (4%) loss:4.3499 lr:0.20 dt:34ms tok/s:1928846 rem:577s step 0690 (4%) loss:4.3516 lr:0.20 dt:34ms tok/s:1921712 rem:576s step 0691 (4%) loss:4.3419 lr:0.20 dt:34ms tok/s:1911675 rem:576s step 0692 (4%) loss:4.3313 lr:0.20 dt:34ms tok/s:1907801 rem:576s step 0693 (4%) loss:4.3280 lr:0.20 dt:34ms tok/s:1923986 rem:576s step 0694 (4%) loss:4.3319 lr:0.20 dt:34ms tok/s:1929618 rem:576s step 0695 (4%) loss:4.3378 lr:0.20 dt:34ms tok/s:1924619 rem:576s step 0696 (4%) loss:4.3309 lr:0.20 dt:34ms tok/s:1917383 rem:576s step 0697 (4%) loss:4.3042 lr:0.20 dt:34ms tok/s:1930146 rem:576s step 0698 (4%) loss:4.2987 lr:0.20 dt:34ms tok/s:1922089 rem:576s step 0699 (4%) loss:4.3075 lr:0.20 dt:34ms tok/s:1921175 rem:576s step 0700 (4%) loss:4.3275 lr:0.20 dt:34ms tok/s:1918761 rem:576s + local: attn=[0.707, 0.775, 0.534] mlp=[0.910, 0.761, 0.315] + + transition: attn=[1.129, 0.854] mlp=[0.772, 0.529] + + hierarchy: attn=[0.922, 0.768, 0.647] mlp=[0.513, 0.360, 0.227] + step 0701 (4%) loss:4.3405 lr:0.20 dt:34ms tok/s:1919485 rem:576s step 0702 (4%) loss:4.3434 lr:0.20 dt:34ms tok/s:1921242 rem:576s step 0703 (4%) loss:4.3704 lr:0.20 dt:34ms tok/s:1930729 rem:576s step 0704 (4%) loss:4.3815 lr:0.20 dt:34ms tok/s:1926197 rem:576s step 0705 (4%) loss:4.3861 lr:0.20 dt:34ms tok/s:1924336 rem:576s step 0706 (4%) loss:4.3946 lr:0.20 dt:34ms tok/s:1916514 rem:576s step 0707 (4%) loss:4.4030 lr:0.20 dt:34ms tok/s:1925023 rem:576s step 0708 (4%) loss:4.4035 lr:0.20 dt:34ms tok/s:1925738 rem:576s step 0709 (4%) loss:4.4052 lr:0.20 dt:34ms tok/s:1917409 rem:576s step 0710 (4%) loss:4.4013 lr:0.20 dt:34ms tok/s:1917196 rem:576s step 0711 (4%) loss:4.3947 lr:0.20 dt:34ms tok/s:1920477 rem:576s step 0712 (4%) loss:4.3989 lr:0.20 dt:34ms tok/s:1923245 rem:576s step 0713 (4%) loss:4.4079 lr:0.20 dt:34ms tok/s:1922438 rem:576s step 0714 (4%) loss:4.4027 lr:0.20 dt:34ms tok/s:1907867 rem:576s step 0715 (4%) loss:4.3915 lr:0.20 dt:34ms tok/s:1913659 rem:576s step 0716 (4%) loss:4.3876 lr:0.20 dt:37ms tok/s:1790549 rem:576s step 0717 (4%) loss:4.4605 lr:0.20 dt:34ms tok/s:1913086 rem:576s step 0718 (4%) loss:4.4427 lr:0.20 dt:34ms tok/s:1921457 rem:576s step 0719 (4%) loss:4.4285 lr:0.20 dt:35ms tok/s:1867263 rem:575s step 0720 (4%) loss:4.4317 lr:0.20 dt:34ms tok/s:1920263 rem:575s step 0721 (4%) loss:4.4239 lr:0.20 dt:36ms tok/s:1829422 rem:575s step 0722 (4%) loss:4.4328 lr:0.20 dt:34ms tok/s:1921981 rem:575s step 0723 (4%) loss:4.4356 lr:0.21 dt:35ms tok/s:1860136 rem:575s step 0724 (4%) loss:4.4294 lr:0.21 dt:35ms tok/s:1853326 rem:575s step 0725 (4%) loss:4.4236 lr:0.21 dt:34ms tok/s:1933187 rem:575s step 0726 (4%) loss:4.4286 lr:0.21 dt:34ms tok/s:1921578 rem:575s step 0727 (4%) loss:4.4192 lr:0.21 dt:34ms tok/s:1917516 rem:575s step 0728 (4%) loss:4.4488 lr:0.21 dt:34ms tok/s:1911729 rem:575s step 0729 (4%) loss:4.4393 lr:0.21 dt:34ms tok/s:1911290 rem:575s step 0730 (4%) loss:4.4440 lr:0.21 dt:34ms tok/s:1923339 rem:575s step 0731 (4%) loss:4.4454 lr:0.21 dt:34ms tok/s:1944496 rem:575s step 0732 (4%) loss:4.4368 lr:0.21 dt:34ms tok/s:1904576 rem:575s step 0733 (4%) loss:4.4267 lr:0.21 dt:34ms tok/s:1916728 rem:575s step 0734 (4%) loss:4.4259 lr:0.21 dt:34ms tok/s:1917864 rem:575s step 0735 (4%) loss:4.4111 lr:0.21 dt:36ms tok/s:1838525 rem:575s step 0736 (4%) loss:4.3781 lr:0.21 dt:34ms tok/s:1916941 rem:575s step 0737 (4%) loss:4.3947 lr:0.21 dt:34ms tok/s:1920558 rem:575s step 0738 (4%) loss:4.4027 lr:0.21 dt:34ms tok/s:1913565 rem:575s step 0739 (4%) loss:4.4120 lr:0.21 dt:34ms tok/s:1916287 rem:575s step 0740 (4%) loss:4.4042 lr:0.21 dt:34ms tok/s:1921699 rem:575s step 0741 (4%) loss:4.4016 lr:0.21 dt:34ms tok/s:1919404 rem:575s step 0742 (4%) loss:4.4040 lr:0.21 dt:34ms tok/s:1923057 rem:575s step 0743 (4%) loss:4.4367 lr:0.21 dt:34ms tok/s:1918319 rem:575s step 0744 (4%) loss:4.4301 lr:0.21 dt:34ms tok/s:1925064 rem:575s step 0745 (4%) loss:4.4300 lr:0.21 dt:34ms tok/s:1912833 rem:575s step 0746 (4%) loss:4.4172 lr:0.21 dt:34ms tok/s:1911808 rem:575s step 0747 (4%) loss:4.4274 lr:0.21 dt:34ms tok/s:1919498 rem:575s step 0748 (4%) loss:4.4131 lr:0.21 dt:34ms tok/s:1920450 rem:574s step 0749 (4%) loss:4.4175 lr:0.21 dt:34ms tok/s:1925347 rem:574s step 0750 (4%) loss:4.4252 lr:0.21 dt:35ms tok/s:1883603 rem:574s step 0751 (4%) loss:4.4298 lr:0.21 dt:34ms tok/s:1919351 rem:574s step 0752 (4%) loss:4.4311 lr:0.21 dt:34ms tok/s:1927318 rem:574s step 0753 (4%) loss:4.4472 lr:0.21 dt:34ms tok/s:1920209 rem:574s step 0754 (4%) loss:4.4436 lr:0.21 dt:34ms tok/s:1920370 rem:574s step 0755 (4%) loss:4.4411 lr:0.21 dt:34ms tok/s:1920450 rem:574s step 0756 (4%) loss:4.4363 lr:0.21 dt:34ms tok/s:1918735 rem:574s step 0757 (4%) loss:4.4235 lr:0.21 dt:34ms tok/s:1921135 rem:574s step 0758 (4%) loss:4.4233 lr:0.22 dt:34ms tok/s:1915792 rem:574s step 0759 (4%) loss:4.4170 lr:0.22 dt:34ms tok/s:1911210 rem:574s step 0760 (4%) loss:4.4058 lr:0.22 dt:36ms tok/s:1795919 rem:574s step 0761 (4%) loss:4.4117 lr:0.22 dt:34ms tok/s:1914298 rem:574s step 0762 (4%) loss:4.4284 lr:0.22 dt:34ms tok/s:1922788 rem:574s step 0763 (4%) loss:4.4263 lr:0.22 dt:34ms tok/s:1916113 rem:574s step 0764 (4%) loss:4.4257 lr:0.22 dt:34ms tok/s:1924551 rem:574s step 0765 (4%) loss:4.4327 lr:0.22 dt:34ms tok/s:1914312 rem:574s step 0766 (4%) loss:4.4322 lr:0.22 dt:34ms tok/s:1920504 rem:574s step 0767 (4%) loss:4.4228 lr:0.22 dt:34ms tok/s:1918360 rem:574s step 0768 (4%) loss:4.4112 lr:0.22 dt:34ms tok/s:1919726 rem:574s step 0769 (4%) loss:4.4156 lr:0.22 dt:34ms tok/s:1917022 rem:574s step 0770 (4%) loss:4.4257 lr:0.22 dt:34ms tok/s:1921565 rem:574s step 0771 (4%) loss:4.4202 lr:0.22 dt:34ms tok/s:1910918 rem:574s step 0772 (4%) loss:4.4244 lr:0.22 dt:34ms tok/s:1919150 rem:574s step 0773 (4%) loss:4.4217 lr:0.22 dt:34ms tok/s:1908570 rem:574s step 0774 (4%) loss:4.4084 lr:0.22 dt:34ms tok/s:1916100 rem:574s step 0775 (4%) loss:4.4098 lr:0.22 dt:34ms tok/s:1920491 rem:574s step 0776 (4%) loss:4.4068 lr:0.22 dt:34ms tok/s:1919029 rem:574s step 0777 (4%) loss:4.4063 lr:0.22 dt:35ms tok/s:1879366 rem:573s step 0778 (4%) loss:4.4121 lr:0.22 dt:34ms tok/s:1919699 rem:573s step 0779 (4%) loss:4.4373 lr:0.22 dt:34ms tok/s:1916300 rem:573s step 0780 (4%) loss:4.4207 lr:0.22 dt:34ms tok/s:1911476 rem:573s step 0781 (4%) loss:4.3956 lr:0.22 dt:34ms tok/s:1919311 rem:573s step 0782 (4%) loss:4.3594 lr:0.22 dt:34ms tok/s:1918092 rem:573s step 0783 (4%) loss:4.3568 lr:0.22 dt:34ms tok/s:1913033 rem:573s step 0784 (4%) loss:4.3640 lr:0.22 dt:34ms tok/s:1909232 rem:573s step 0785 (4%) loss:4.3706 lr:0.22 dt:34ms tok/s:1903125 rem:573s step 0786 (4%) loss:4.3751 lr:0.22 dt:36ms tok/s:1812976 rem:573s step 0787 (4%) loss:4.3755 lr:0.22 dt:36ms tok/s:1844187 rem:573s step 0788 (4%) loss:4.3833 lr:0.22 dt:34ms tok/s:1913685 rem:573s step 0789 (4%) loss:4.3838 lr:0.22 dt:34ms tok/s:1915899 rem:573s step 0790 (4%) loss:4.3676 lr:0.22 dt:34ms tok/s:1906941 rem:573s step 0791 (4%) loss:4.3684 lr:0.22 dt:34ms tok/s:1915459 rem:573s step 0792 (4%) loss:4.3711 lr:0.22 dt:34ms tok/s:1923326 rem:573s step 0793 (5%) loss:4.3758 lr:0.23 dt:34ms tok/s:1920209 rem:573s step 0794 (5%) loss:4.3638 lr:0.23 dt:34ms tok/s:1924444 rem:573s step 0795 (5%) loss:4.3613 lr:0.23 dt:34ms tok/s:1914725 rem:573s step 0796 (5%) loss:4.3699 lr:0.23 dt:34ms tok/s:1924201 rem:573s step 0797 (5%) loss:4.3732 lr:0.23 dt:34ms tok/s:1925225 rem:573s step 0798 (5%) loss:4.3700 lr:0.23 dt:34ms tok/s:1911995 rem:573s step 0799 (5%) loss:4.3671 lr:0.23 dt:34ms tok/s:1903943 rem:573s step 0800 (5%) loss:4.3546 lr:0.23 dt:34ms tok/s:1927453 rem:573s + local: attn=[0.550, 0.620, 0.413] mlp=[0.949, 0.588, -0.152] + + transition: attn=[1.023, 0.813] mlp=[0.677, 0.396] + + hierarchy: attn=[0.950, 0.784, 0.658] mlp=[0.429, 0.290, 0.167] + step 0801 (5%) loss:4.3644 lr:0.23 dt:36ms tok/s:1834072 rem:573s step 0802 (5%) loss:4.3718 lr:0.23 dt:34ms tok/s:1916060 rem:573s step 0803 (5%) loss:4.3678 lr:0.23 dt:34ms tok/s:1932562 rem:573s step 0804 (5%) loss:4.3748 lr:0.23 dt:35ms tok/s:1872657 rem:573s step 0805 (5%) loss:4.3840 lr:0.23 dt:35ms tok/s:1893816 rem:573s step 0806 (5%) loss:4.3782 lr:0.23 dt:34ms tok/s:1930960 rem:572s step 0807 (5%) loss:4.3915 lr:0.23 dt:34ms tok/s:1928535 rem:572s step 0808 (5%) loss:4.3955 lr:0.23 dt:34ms tok/s:1934915 rem:572s step 0809 (5%) loss:4.3964 lr:0.23 dt:34ms tok/s:1926521 rem:572s step 0810 (5%) loss:4.4068 lr:0.23 dt:34ms tok/s:1917891 rem:572s step 0811 (5%) loss:4.3822 lr:0.23 dt:34ms tok/s:1920544 rem:572s step 0812 (5%) loss:4.3829 lr:0.23 dt:34ms tok/s:1909816 rem:572s step 0813 (5%) loss:4.3889 lr:0.23 dt:34ms tok/s:1916928 rem:572s step 0814 (5%) loss:4.3803 lr:0.23 dt:34ms tok/s:1911742 rem:572s step 0815 (5%) loss:4.3667 lr:0.23 dt:36ms tok/s:1817435 rem:572s step 0816 (5%) loss:4.3977 lr:0.23 dt:38ms tok/s:1724984 rem:572s step 0817 (5%) loss:4.4001 lr:0.23 dt:36ms tok/s:1798198 rem:572s step 0818 (5%) loss:4.3910 lr:0.23 dt:34ms tok/s:1940488 rem:572s step 0819 (5%) loss:4.3846 lr:0.23 dt:34ms tok/s:1940214 rem:572s step 0820 (5%) loss:4.3932 lr:0.23 dt:34ms tok/s:1941612 rem:572s step 0821 (5%) loss:4.4025 lr:0.23 dt:34ms tok/s:1940776 rem:572s step 0822 (5%) loss:4.3989 lr:0.23 dt:34ms tok/s:1941845 rem:572s step 0823 (5%) loss:4.3807 lr:0.23 dt:34ms tok/s:1933949 rem:572s step 0824 (5%) loss:4.3493 lr:0.23 dt:34ms tok/s:1905104 rem:572s step 0825 (5%) loss:4.3184 lr:0.23 dt:34ms tok/s:1914312 rem:572s step 0826 (5%) loss:4.3221 lr:0.23 dt:34ms tok/s:1906134 rem:572s step 0827 (5%) loss:4.3340 lr:0.23 dt:34ms tok/s:1911769 rem:572s step 0828 (5%) loss:4.3584 lr:0.24 dt:34ms tok/s:1909763 rem:572s step 0829 (5%) loss:4.3754 lr:0.24 dt:34ms tok/s:1915712 rem:572s step 0830 (5%) loss:4.4112 lr:0.24 dt:34ms tok/s:1909232 rem:572s step 0831 (5%) loss:4.4157 lr:0.24 dt:34ms tok/s:1908583 rem:572s step 0832 (5%) loss:4.4091 lr:0.24 dt:34ms tok/s:1916233 rem:572s step 0833 (5%) loss:4.4048 lr:0.24 dt:34ms tok/s:1925495 rem:572s step 0834 (5%) loss:4.4117 lr:0.24 dt:34ms tok/s:1921417 rem:572s step 0835 (5%) loss:4.4295 lr:0.24 dt:34ms tok/s:1901190 rem:571s step 0836 (5%) loss:4.4715 lr:0.24 dt:35ms tok/s:1852389 rem:571s step 0837 (5%) loss:4.4876 lr:0.24 dt:35ms tok/s:1888235 rem:571s step 0838 (5%) loss:4.4801 lr:0.24 dt:35ms tok/s:1882313 rem:571s step 0839 (5%) loss:4.4762 lr:0.24 dt:35ms tok/s:1884959 rem:571s step 0840 (5%) loss:4.4795 lr:0.24 dt:35ms tok/s:1889572 rem:571s step 0841 (5%) loss:4.5012 lr:0.24 dt:35ms tok/s:1887612 rem:571s step 0842 (5%) loss:4.5071 lr:0.24 dt:35ms tok/s:1886978 rem:571s step 0843 (5%) loss:4.4934 lr:0.24 dt:35ms tok/s:1888559 rem:571s step 0844 (5%) loss:4.4848 lr:0.24 dt:35ms tok/s:1889572 rem:571s step 0845 (5%) loss:4.4773 lr:0.24 dt:35ms tok/s:1884856 rem:571s step 0846 (5%) loss:4.4622 lr:0.24 dt:35ms tok/s:1877543 rem:571s step 0847 (5%) loss:4.4358 lr:0.24 dt:35ms tok/s:1891470 rem:571s step 0848 (5%) loss:4.4242 lr:0.24 dt:40ms tok/s:1655243 rem:571s step 0849 (5%) loss:4.4108 lr:0.24 dt:35ms tok/s:1891405 rem:571s step 0850 (5%) loss:4.3871 lr:0.24 dt:34ms tok/s:1939762 rem:571s step 0851 (5%) loss:4.3898 lr:0.24 dt:34ms tok/s:1911955 rem:571s step 0852 (5%) loss:4.3718 lr:0.24 dt:34ms tok/s:1914152 rem:571s step 0853 (5%) loss:4.3507 lr:0.24 dt:34ms tok/s:1918346 rem:571s step 0854 (5%) loss:4.3169 lr:0.24 dt:34ms tok/s:1921618 rem:571s step 0855 (5%) loss:4.3367 lr:0.24 dt:34ms tok/s:1906888 rem:571s step 0856 (5%) loss:4.3431 lr:0.24 dt:34ms tok/s:1910998 rem:571s step 0857 (5%) loss:4.3493 lr:0.24 dt:34ms tok/s:1916353 rem:571s step 0858 (5%) loss:4.3416 lr:0.24 dt:34ms tok/s:1916086 rem:571s step 0859 (5%) loss:4.3397 lr:0.24 dt:34ms tok/s:1916247 rem:571s step 0860 (5%) loss:4.3510 lr:0.24 dt:34ms tok/s:1907761 rem:571s step 0861 (5%) loss:4.3640 lr:0.24 dt:34ms tok/s:1916313 rem:571s step 0862 (5%) loss:4.3610 lr:0.25 dt:34ms tok/s:1920853 rem:571s step 0863 (5%) loss:4.3752 lr:0.25 dt:34ms tok/s:1921901 rem:571s step 0864 (5%) loss:4.3645 lr:0.25 dt:34ms tok/s:1917383 rem:570s step 0865 (5%) loss:4.3780 lr:0.25 dt:34ms tok/s:1913698 rem:570s step 0866 (5%) loss:4.3848 lr:0.25 dt:34ms tok/s:1910772 rem:570s step 0867 (5%) loss:4.3849 lr:0.25 dt:34ms tok/s:1906465 rem:570s step 0868 (5%) loss:4.3883 lr:0.25 dt:34ms tok/s:1907285 rem:570s step 0869 (5%) loss:4.3914 lr:0.25 dt:34ms tok/s:1920947 rem:570s step 0870 (5%) loss:4.3828 lr:0.25 dt:34ms tok/s:1918306 rem:570s step 0871 (5%) loss:4.3779 lr:0.25 dt:34ms tok/s:1912274 rem:570s step 0872 (5%) loss:4.3831 lr:0.25 dt:34ms tok/s:1915752 rem:570s step 0873 (5%) loss:4.3743 lr:0.25 dt:34ms tok/s:1918641 rem:570s step 0874 (5%) loss:4.3781 lr:0.25 dt:34ms tok/s:1918145 rem:570s step 0875 (5%) loss:4.3856 lr:0.25 dt:34ms tok/s:1909378 rem:570s step 0876 (5%) loss:4.3561 lr:0.25 dt:34ms tok/s:1903257 rem:570s step 0877 (5%) loss:4.3543 lr:0.25 dt:35ms tok/s:1895579 rem:570s step 0878 (5%) loss:4.3594 lr:0.25 dt:35ms tok/s:1878364 rem:570s step 0879 (5%) loss:4.3656 lr:0.25 dt:34ms tok/s:1900651 rem:570s step 0880 (5%) loss:4.3454 lr:0.25 dt:34ms tok/s:1902664 rem:570s step 0881 (5%) loss:4.3620 lr:0.25 dt:35ms tok/s:1892707 rem:570s step 0882 (5%) loss:4.3681 lr:0.25 dt:35ms tok/s:1898918 rem:570s step 0883 (5%) loss:4.3704 lr:0.25 dt:34ms tok/s:1901204 rem:570s step 0884 (5%) loss:4.3631 lr:0.25 dt:34ms tok/s:1901993 rem:570s step 0885 (5%) loss:4.3568 lr:0.25 dt:34ms tok/s:1905236 rem:570s step 0886 (5%) loss:4.3476 lr:0.25 dt:34ms tok/s:1905698 rem:570s step 0887 (5%) loss:4.3386 lr:0.25 dt:34ms tok/s:1909869 rem:570s step 0888 (5%) loss:4.3352 lr:0.25 dt:34ms tok/s:1910705 rem:570s step 0889 (5%) loss:4.3549 lr:0.25 dt:34ms tok/s:1900533 rem:570s step 0890 (5%) loss:4.3333 lr:0.25 dt:34ms tok/s:1904312 rem:570s step 0891 (5%) loss:4.3126 lr:0.25 dt:34ms tok/s:1910931 rem:570s step 0892 (5%) loss:4.3210 lr:0.25 dt:34ms tok/s:1902467 rem:570s step 0893 (5%) loss:4.3289 lr:0.25 dt:35ms tok/s:1854264 rem:569s step 0894 (5%) loss:4.3287 lr:0.25 dt:34ms tok/s:1909113 rem:569s step 0895 (5%) loss:4.3341 lr:0.25 dt:34ms tok/s:1902230 rem:569s step 0896 (5%) loss:4.3381 lr:0.25 dt:34ms tok/s:1903982 rem:569s step 0897 (5%) loss:4.3398 lr:0.26 dt:34ms tok/s:1902664 rem:569s step 0898 (5%) loss:4.3457 lr:0.26 dt:34ms tok/s:1904879 rem:569s step 0899 (5%) loss:4.3517 lr:0.26 dt:35ms tok/s:1898774 rem:569s step 0900 (5%) loss:4.3513 lr:0.26 dt:34ms tok/s:1902928 rem:569s + local: attn=[0.415, 0.601, 0.488] mlp=[0.856, 0.550, -0.389] + + transition: attn=[1.012, 0.819] mlp=[0.548, 0.252] + + hierarchy: attn=[1.077, 0.931, 0.810] mlp=[0.449, 0.252, 0.110] + step 0901 (5%) loss:4.3535 lr:0.26 dt:35ms tok/s:1893255 rem:569s step 0902 (5%) loss:4.3415 lr:0.26 dt:34ms tok/s:1909299 rem:569s step 0903 (5%) loss:4.3465 lr:0.26 dt:38ms tok/s:1720083 rem:569s step 0904 (5%) loss:4.3250 lr:0.26 dt:34ms tok/s:1918078 rem:569s step 0905 (5%) loss:4.3217 lr:0.26 dt:34ms tok/s:1931028 rem:569s step 0906 (5%) loss:4.3296 lr:0.26 dt:34ms tok/s:1921242 rem:569s step 0907 (5%) loss:4.3363 lr:0.26 dt:34ms tok/s:1918949 rem:569s step 0908 (5%) loss:4.3375 lr:0.26 dt:34ms tok/s:1912234 rem:569s step 0909 (5%) loss:4.3353 lr:0.26 dt:34ms tok/s:1909246 rem:569s step 0910 (5%) loss:4.3323 lr:0.26 dt:34ms tok/s:1908848 rem:569s step 0911 (5%) loss:4.3294 lr:0.26 dt:34ms tok/s:1908291 rem:569s step 0912 (5%) loss:4.3373 lr:0.26 dt:34ms tok/s:1918976 rem:569s step 0913 (5%) loss:4.3344 lr:0.26 dt:35ms tok/s:1892681 rem:569s step 0914 (5%) loss:4.3333 lr:0.26 dt:34ms tok/s:1910002 rem:569s step 0915 (5%) loss:4.3246 lr:0.26 dt:34ms tok/s:1908835 rem:569s step 0916 (5%) loss:4.3158 lr:0.26 dt:34ms tok/s:1903547 rem:569s step 0917 (5%) loss:4.3035 lr:0.26 dt:35ms tok/s:1898027 rem:569s step 0918 (5%) loss:4.3112 lr:0.26 dt:35ms tok/s:1874010 rem:569s step 0919 (5%) loss:4.3163 lr:0.26 dt:34ms tok/s:1902151 rem:569s step 0920 (5%) loss:4.3547 lr:0.26 dt:34ms tok/s:1916126 rem:569s step 0921 (5%) loss:4.3637 lr:0.26 dt:34ms tok/s:1915726 rem:569s step 0922 (5%) loss:4.3672 lr:0.26 dt:34ms tok/s:1911184 rem:568s step 0923 (5%) loss:4.3670 lr:0.26 dt:34ms tok/s:1911636 rem:568s step 0924 (5%) loss:4.3610 lr:0.26 dt:34ms tok/s:1907338 rem:568s step 0925 (5%) loss:4.3481 lr:0.26 dt:34ms tok/s:1912700 rem:568s step 0926 (5%) loss:4.3471 lr:0.26 dt:34ms tok/s:1902928 rem:568s step 0927 (5%) loss:4.3673 lr:0.26 dt:35ms tok/s:1895945 rem:568s step 0928 (5%) loss:4.3841 lr:0.26 dt:34ms tok/s:1907947 rem:568s step 0929 (5%) loss:4.3778 lr:0.26 dt:34ms tok/s:1906518 rem:568s step 0930 (5%) loss:4.3863 lr:0.26 dt:35ms tok/s:1891041 rem:568s step 0931 (5%) loss:4.3813 lr:0.26 dt:35ms tok/s:1895945 rem:568s step 0932 (5%) loss:4.3672 lr:0.27 dt:34ms tok/s:1909020 rem:568s step 0933 (5%) loss:4.3585 lr:0.27 dt:34ms tok/s:1902177 rem:568s step 0934 (5%) loss:4.3610 lr:0.27 dt:34ms tok/s:1906372 rem:568s step 0935 (5%) loss:4.3392 lr:0.27 dt:34ms tok/s:1904510 rem:568s step 0936 (5%) loss:4.3393 lr:0.27 dt:34ms tok/s:1908649 rem:568s step 0937 (5%) loss:4.3350 lr:0.27 dt:34ms tok/s:1901769 rem:568s step 0938 (5%) loss:4.3397 lr:0.27 dt:34ms tok/s:1906187 rem:568s step 0939 (5%) loss:4.3308 lr:0.27 dt:34ms tok/s:1920893 rem:568s step 0940 (5%) loss:4.3323 lr:0.27 dt:34ms tok/s:1908888 rem:568s step 0941 (5%) loss:4.3362 lr:0.27 dt:35ms tok/s:1894560 rem:568s step 0942 (5%) loss:4.3285 lr:0.27 dt:34ms tok/s:1910559 rem:568s step 0943 (5%) loss:4.3262 lr:0.27 dt:34ms tok/s:1906187 rem:568s step 0944 (5%) loss:4.3330 lr:0.27 dt:34ms tok/s:1912474 rem:568s step 0945 (5%) loss:4.3272 lr:0.27 dt:35ms tok/s:1894038 rem:568s step 0946 (5%) loss:4.3262 lr:0.27 dt:34ms tok/s:1907841 rem:568s step 0947 (5%) loss:4.3236 lr:0.27 dt:34ms tok/s:1909060 rem:568s step 0948 (5%) loss:4.3178 lr:0.27 dt:34ms tok/s:1918333 rem:568s step 0949 (5%) loss:4.3223 lr:0.27 dt:34ms tok/s:1915405 rem:568s step 0950 (5%) loss:4.3311 lr:0.27 dt:35ms tok/s:1875033 rem:568s step 0951 (5%) loss:4.3271 lr:0.27 dt:34ms tok/s:1902888 rem:567s step 0952 (5%) loss:4.3408 lr:0.27 dt:34ms tok/s:1906504 rem:567s step 0953 (5%) loss:4.3409 lr:0.27 dt:34ms tok/s:1911237 rem:567s step 0954 (5%) loss:4.3313 lr:0.27 dt:34ms tok/s:1907219 rem:567s step 0955 (5%) loss:4.3419 lr:0.27 dt:34ms tok/s:1902032 rem:567s step 0956 (5%) loss:4.3470 lr:0.27 dt:35ms tok/s:1878749 rem:567s step 0957 (5%) loss:4.3480 lr:0.27 dt:34ms tok/s:1905645 rem:567s step 0958 (5%) loss:4.3431 lr:0.27 dt:34ms tok/s:1909909 rem:567s step 0959 (5%) loss:4.3484 lr:0.27 dt:34ms tok/s:1915285 rem:567s step 0960 (5%) loss:4.3418 lr:0.27 dt:35ms tok/s:1894012 rem:567s step 0961 (5%) loss:4.3316 lr:0.27 dt:34ms tok/s:1905738 rem:567s step 0962 (5%) loss:4.3268 lr:0.27 dt:35ms tok/s:1894782 rem:567s step 0963 (5%) loss:4.3224 lr:0.27 dt:35ms tok/s:1897529 rem:567s step 0964 (5%) loss:4.3241 lr:0.27 dt:34ms tok/s:1918855 rem:567s step 0965 (5%) loss:4.3222 lr:0.27 dt:36ms tok/s:1835468 rem:567s step 0966 (5%) loss:4.3248 lr:0.27 dt:34ms tok/s:1906769 rem:567s step 0967 (6%) loss:4.3479 lr:0.28 dt:34ms tok/s:1899837 rem:567s step 0968 (6%) loss:4.3415 lr:0.28 dt:35ms tok/s:1897555 rem:567s step 0969 (6%) loss:4.3287 lr:0.28 dt:35ms tok/s:1895592 rem:567s step 0970 (6%) loss:4.3319 lr:0.28 dt:34ms tok/s:1911503 rem:567s step 0971 (6%) loss:4.3324 lr:0.28 dt:42ms tok/s:1565703 rem:567s step 0972 (6%) loss:4.3259 lr:0.28 dt:33ms tok/s:1959606 rem:567s step 0973 (6%) loss:4.3279 lr:0.28 dt:33ms tok/s:1969844 rem:567s step 0974 (6%) loss:4.3256 lr:0.28 dt:33ms tok/s:1958420 rem:567s step 0975 (6%) loss:4.3199 lr:0.28 dt:34ms tok/s:1926777 rem:567s step 0976 (6%) loss:4.3244 lr:0.28 dt:34ms tok/s:1945363 rem:567s step 0977 (6%) loss:4.3754 lr:0.28 dt:34ms tok/s:1943218 rem:567s step 0978 (6%) loss:4.3756 lr:0.28 dt:34ms tok/s:1952189 rem:567s step 0979 (6%) loss:4.3680 lr:0.28 dt:34ms tok/s:1934711 rem:567s step 0980 (6%) loss:4.3672 lr:0.28 dt:34ms tok/s:1939516 rem:566s step 0981 (6%) loss:4.3670 lr:0.28 dt:34ms tok/s:1947445 rem:566s step 0982 (6%) loss:4.3403 lr:0.28 dt:34ms tok/s:1908768 rem:566s step 0983 (6%) loss:4.3361 lr:0.28 dt:34ms tok/s:1920987 rem:566s step 0984 (6%) loss:4.3476 lr:0.28 dt:34ms tok/s:1929591 rem:566s step 0985 (6%) loss:4.3460 lr:0.28 dt:34ms tok/s:1922626 rem:566s step 0986 (6%) loss:4.3411 lr:0.28 dt:34ms tok/s:1903389 rem:566s step 0987 (6%) loss:4.3383 lr:0.28 dt:34ms tok/s:1919713 rem:566s step 0988 (6%) loss:4.3346 lr:0.28 dt:34ms tok/s:1916982 rem:566s step 0989 (6%) loss:4.3296 lr:0.28 dt:34ms tok/s:1912620 rem:566s step 0990 (6%) loss:4.3235 lr:0.28 dt:34ms tok/s:1906981 rem:566s step 0991 (6%) loss:4.3298 lr:0.28 dt:34ms tok/s:1921847 rem:566s step 0992 (6%) loss:4.3366 lr:0.28 dt:34ms tok/s:1918949 rem:566s step 0993 (6%) loss:4.3308 lr:0.28 dt:34ms tok/s:1901072 rem:566s step 0994 (6%) loss:4.3302 lr:0.28 dt:34ms tok/s:1907814 rem:566s step 0995 (6%) loss:4.3206 lr:0.28 dt:34ms tok/s:1908212 rem:566s step 0996 (6%) loss:4.3218 lr:0.28 dt:45ms tok/s:1444597 rem:566s step 0997 (6%) loss:4.3044 lr:0.28 dt:33ms tok/s:1987476 rem:566s step 0998 (6%) loss:4.2947 lr:0.28 dt:38ms tok/s:1733895 rem:566s step 0999 (6%) loss:4.2902 lr:0.28 dt:35ms tok/s:1885411 rem:566s step 1000 (6%) loss:4.2894 lr:0.28 dt:34ms tok/s:1955842 rem:566s + local: attn=[0.347, 0.505, 0.494] mlp=[0.686, 0.442, -0.373] + + transition: attn=[1.158, 0.886] mlp=[0.408, 0.118] + + hierarchy: attn=[1.244, 1.201, 1.106] mlp=[0.537, 0.206, 0.041] + step 1001 (6%) loss:4.2859 lr:0.29 dt:33ms tok/s:1981173 rem:566s step 1002 (6%) loss:4.2814 lr:0.29 dt:34ms tok/s:1900454 rem:566s step 1003 (6%) loss:4.2923 lr:0.29 dt:33ms tok/s:1966180 rem:566s step 1004 (6%) loss:4.2853 lr:0.29 dt:34ms tok/s:1948784 rem:566s step 1005 (6%) loss:4.2802 lr:0.29 dt:34ms tok/s:1941091 rem:566s step 1006 (6%) loss:4.2881 lr:0.29 dt:34ms tok/s:1946107 rem:566s step 1007 (6%) loss:4.2874 lr:0.29 dt:34ms tok/s:1927156 rem:566s step 1008 (6%) loss:4.2805 lr:0.29 dt:34ms tok/s:1936619 rem:566s step 1009 (6%) loss:4.2772 lr:0.29 dt:34ms tok/s:1936360 rem:565s step 1010 (6%) loss:4.2910 lr:0.29 dt:34ms tok/s:1931625 rem:565s step 1011 (6%) loss:4.2994 lr:0.29 dt:34ms tok/s:1929821 rem:565s step 1012 (6%) loss:4.2987 lr:0.29 dt:34ms tok/s:1934724 rem:565s step 1013 (6%) loss:4.2987 lr:0.29 dt:34ms tok/s:1908994 rem:565s step 1014 (6%) loss:4.3022 lr:0.29 dt:34ms tok/s:1935733 rem:565s step 1015 (6%) loss:4.3074 lr:0.29 dt:34ms tok/s:1935351 rem:565s step 1016 (6%) loss:4.2745 lr:0.29 dt:35ms tok/s:1860010 rem:565s step 1017 (6%) loss:4.2605 lr:0.29 dt:35ms tok/s:1885230 rem:565s step 1018 (6%) loss:4.2592 lr:0.29 dt:35ms tok/s:1884843 rem:565s step 1019 (6%) loss:4.2686 lr:0.29 dt:34ms tok/s:1903916 rem:565s step 1020 (6%) loss:4.2763 lr:0.29 dt:34ms tok/s:1921457 rem:565s step 1021 (6%) loss:4.2761 lr:0.29 dt:34ms tok/s:1907536 rem:565s step 1022 (6%) loss:4.2720 lr:0.29 dt:34ms tok/s:1932154 rem:565s step 1023 (6%) loss:4.2698 lr:0.29 dt:34ms tok/s:1920880 rem:565s step 1024 (6%) loss:4.2646 lr:0.29 dt:35ms tok/s:1896194 rem:565s step 1025 (6%) loss:4.2739 lr:0.29 dt:34ms tok/s:1941077 rem:565s step 1026 (6%) loss:4.3097 lr:0.29 dt:34ms tok/s:1900612 rem:565s step 1027 (6%) loss:4.3134 lr:0.29 dt:34ms tok/s:1919807 rem:565s step 1028 (6%) loss:4.3079 lr:0.29 dt:34ms tok/s:1920128 rem:565s step 1029 (6%) loss:4.3019 lr:0.29 dt:34ms tok/s:1908238 rem:565s step 1030 (6%) loss:4.3148 lr:0.29 dt:34ms tok/s:1927304 rem:565s step 1031 (6%) loss:4.3072 lr:0.29 dt:35ms tok/s:1895135 rem:565s step 1032 (6%) loss:4.3091 lr:0.29 dt:34ms tok/s:1914072 rem:565s step 1033 (6%) loss:4.3222 lr:0.29 dt:35ms tok/s:1898472 rem:565s step 1034 (6%) loss:4.3216 lr:0.29 dt:35ms tok/s:1899391 rem:565s step 1035 (6%) loss:4.3261 lr:0.29 dt:34ms tok/s:1925037 rem:565s step 1036 (6%) loss:4.3227 lr:0.29 dt:34ms tok/s:1940433 rem:565s step 1037 (6%) loss:4.3162 lr:0.30 dt:34ms tok/s:1931923 rem:565s step 1038 (6%) loss:4.3162 lr:0.30 dt:34ms tok/s:1923716 rem:565s step 1039 (6%) loss:4.3119 lr:0.30 dt:34ms tok/s:1919002 rem:564s step 1040 (6%) loss:4.3222 lr:0.30 dt:35ms tok/s:1896259 rem:564s step 1041 (6%) loss:4.3182 lr:0.30 dt:34ms tok/s:1913765 rem:564s step 1042 (6%) loss:4.3186 lr:0.30 dt:34ms tok/s:1934302 rem:564s step 1043 (6%) loss:4.3138 lr:0.30 dt:34ms tok/s:1940118 rem:564s step 1044 (6%) loss:4.3049 lr:0.30 dt:34ms tok/s:1935501 rem:564s step 1045 (6%) loss:4.3007 lr:0.30 dt:34ms tok/s:1943204 rem:564s step 1046 (6%) loss:4.2826 lr:0.30 dt:34ms tok/s:1924026 rem:564s step 1047 (6%) loss:4.2794 lr:0.30 dt:34ms tok/s:1944043 rem:564s step 1048 (6%) loss:4.2737 lr:0.30 dt:34ms tok/s:1940885 rem:564s step 1049 (6%) loss:4.2795 lr:0.30 dt:34ms tok/s:1926615 rem:564s step 1050 (6%) loss:4.2663 lr:0.30 dt:34ms tok/s:1928697 rem:564s step 1051 (6%) loss:4.2647 lr:0.30 dt:34ms tok/s:1918654 rem:564s step 1052 (6%) loss:4.2657 lr:0.30 dt:34ms tok/s:1927602 rem:564s step 1053 (6%) loss:4.2729 lr:0.30 dt:34ms tok/s:1924592 rem:564s step 1054 (6%) loss:4.2808 lr:0.30 dt:34ms tok/s:1928237 rem:564s step 1055 (6%) loss:4.2847 lr:0.30 dt:34ms tok/s:1920585 rem:564s step 1056 (6%) loss:4.2922 lr:0.30 dt:34ms tok/s:1930051 rem:564s step 1057 (6%) loss:4.2892 lr:0.30 dt:34ms tok/s:1909803 rem:564s step 1058 (6%) loss:4.2922 lr:0.30 dt:35ms tok/s:1895069 rem:564s step 1059 (6%) loss:4.2984 lr:0.30 dt:34ms tok/s:1923205 rem:564s step 1060 (6%) loss:4.2924 lr:0.30 dt:34ms tok/s:1917503 rem:564s step 1061 (6%) loss:4.2776 lr:0.30 dt:34ms tok/s:1928575 rem:564s step 1062 (6%) loss:4.2726 lr:0.30 dt:38ms tok/s:1713041 rem:564s step 1063 (6%) loss:4.2661 lr:0.30 dt:34ms tok/s:1943685 rem:564s step 1064 (6%) loss:4.2646 lr:0.30 dt:34ms tok/s:1946521 rem:564s step 1065 (6%) loss:4.2677 lr:0.30 dt:34ms tok/s:1946893 rem:564s step 1066 (6%) loss:4.2862 lr:0.30 dt:34ms tok/s:1941859 rem:564s step 1067 (6%) loss:4.3153 lr:0.30 dt:34ms tok/s:1930417 rem:564s step 1068 (6%) loss:4.3069 lr:0.30 dt:34ms tok/s:1943465 rem:563s step 1069 (6%) loss:4.3658 lr:0.30 dt:34ms tok/s:1941900 rem:563s step 1070 (6%) loss:4.3705 lr:0.30 dt:34ms tok/s:1930349 rem:563s step 1071 (6%) loss:4.3752 lr:0.30 dt:34ms tok/s:1939311 rem:563s step 1072 (6%) loss:4.3647 lr:0.31 dt:34ms tok/s:1913525 rem:563s step 1073 (6%) loss:4.3531 lr:0.31 dt:34ms tok/s:1918293 rem:563s step 1074 (6%) loss:4.3361 lr:0.31 dt:34ms tok/s:1911888 rem:563s step 1075 (6%) loss:4.3279 lr:0.31 dt:34ms tok/s:1910214 rem:563s step 1076 (6%) loss:4.3178 lr:0.31 dt:34ms tok/s:1910559 rem:563s step 1077 (6%) loss:4.3143 lr:0.31 dt:34ms tok/s:1916901 rem:563s step 1078 (6%) loss:4.3244 lr:0.31 dt:34ms tok/s:1913659 rem:563s step 1079 (6%) loss:4.3396 lr:0.31 dt:34ms tok/s:1909524 rem:563s step 1080 (6%) loss:4.3504 lr:0.31 dt:34ms tok/s:1914432 rem:563s step 1081 (6%) loss:4.3400 lr:0.31 dt:34ms tok/s:1909431 rem:563s step 1082 (6%) loss:4.3272 lr:0.31 dt:34ms tok/s:1905262 rem:563s step 1083 (6%) loss:4.3403 lr:0.31 dt:34ms tok/s:1918012 rem:563s step 1084 (6%) loss:4.3466 lr:0.31 dt:34ms tok/s:1918614 rem:563s step 1085 (6%) loss:4.3724 lr:0.31 dt:34ms tok/s:1920316 rem:563s step 1086 (6%) loss:4.3716 lr:0.31 dt:34ms tok/s:1910466 rem:563s step 1087 (6%) loss:4.3680 lr:0.31 dt:34ms tok/s:1903639 rem:563s step 1088 (6%) loss:4.3815 lr:0.31 dt:34ms tok/s:1909272 rem:563s step 1089 (6%) loss:4.3716 lr:0.31 dt:34ms tok/s:1923326 rem:563s step 1090 (6%) loss:4.3739 lr:0.31 dt:34ms tok/s:1923568 rem:563s step 1091 (6%) loss:4.3814 lr:0.31 dt:36ms tok/s:1833962 rem:563s step 1092 (6%) loss:4.3682 lr:0.31 dt:35ms tok/s:1894090 rem:563s step 1093 (6%) loss:4.3798 lr:0.31 dt:34ms tok/s:1912487 rem:563s step 1094 (6%) loss:4.3737 lr:0.31 dt:34ms tok/s:1911396 rem:563s step 1095 (6%) loss:4.3561 lr:0.31 dt:34ms tok/s:1918909 rem:563s step 1096 (6%) loss:4.3594 lr:0.31 dt:34ms tok/s:1912194 rem:563s step 1097 (6%) loss:4.3483 lr:0.31 dt:34ms tok/s:1925279 rem:562s step 1098 (6%) loss:4.3290 lr:0.31 dt:34ms tok/s:1913192 rem:562s step 1099 (6%) loss:4.3194 lr:0.31 dt:34ms tok/s:1911503 rem:562s step 1100 (6%) loss:4.3110 lr:0.31 dt:34ms tok/s:1914352 rem:562s + local: attn=[0.254, 0.475, 0.541] mlp=[0.590, 0.380, -0.287] + + transition: attn=[1.200, 0.937] mlp=[0.356, 0.150] + + hierarchy: attn=[1.144, 1.326, 1.327] mlp=[0.641, 0.109, -0.076] + step 1101 (6%) loss:4.3146 lr:0.31 dt:34ms tok/s:1913392 rem:562s step 1102 (6%) loss:4.3141 lr:0.31 dt:34ms tok/s:1912447 rem:562s step 1103 (6%) loss:4.3211 lr:0.31 dt:34ms tok/s:1909100 rem:562s step 1104 (6%) loss:4.3207 lr:0.31 dt:34ms tok/s:1904483 rem:562s step 1105 (6%) loss:4.3301 lr:0.31 dt:34ms tok/s:1912633 rem:562s step 1106 (6%) loss:4.3234 lr:0.31 dt:34ms tok/s:1905381 rem:562s step 1107 (6%) loss:4.3159 lr:0.32 dt:34ms tok/s:1911689 rem:562s step 1108 (6%) loss:4.3129 lr:0.32 dt:34ms tok/s:1905157 rem:562s step 1109 (6%) loss:4.3109 lr:0.32 dt:34ms tok/s:1909896 rem:562s step 1110 (6%) loss:4.3005 lr:0.32 dt:36ms tok/s:1842728 rem:562s step 1111 (6%) loss:4.2750 lr:0.32 dt:35ms tok/s:1873129 rem:562s step 1112 (6%) loss:4.2534 lr:0.32 dt:36ms tok/s:1837247 rem:562s step 1113 (6%) loss:4.2364 lr:0.32 dt:34ms tok/s:1907497 rem:562s step 1114 (6%) loss:4.2094 lr:0.32 dt:34ms tok/s:1912434 rem:562s step 1115 (6%) loss:4.1834 lr:0.32 dt:35ms tok/s:1896285 rem:562s step 1116 (6%) loss:4.1559 lr:0.32 dt:34ms tok/s:1904127 rem:562s step 1117 (6%) loss:4.1967 lr:0.32 dt:34ms tok/s:1906650 rem:562s step 1118 (6%) loss:4.2286 lr:0.32 dt:34ms tok/s:1903587 rem:562s step 1119 (6%) loss:4.2590 lr:0.32 dt:34ms tok/s:1905540 rem:562s step 1120 (6%) loss:4.2725 lr:0.32 dt:34ms tok/s:1911038 rem:562s step 1121 (6%) loss:4.2751 lr:0.32 dt:34ms tok/s:1901374 rem:562s step 1122 (6%) loss:4.2986 lr:0.32 dt:35ms tok/s:1894521 rem:562s step 1123 (6%) loss:4.3072 lr:0.32 dt:34ms tok/s:1908291 rem:562s step 1124 (6%) loss:4.3149 lr:0.32 dt:34ms tok/s:1908411 rem:562s step 1125 (6%) loss:4.3214 lr:0.32 dt:34ms tok/s:1900402 rem:562s step 1126 (6%) loss:4.3286 lr:0.32 dt:35ms tok/s:1898000 rem:561s step 1127 (6%) loss:4.3368 lr:0.32 dt:34ms tok/s:1914445 rem:561s step 1128 (6%) loss:4.3295 lr:0.32 dt:35ms tok/s:1898577 rem:561s step 1129 (6%) loss:4.3363 lr:0.32 dt:34ms tok/s:1919874 rem:561s step 1130 (6%) loss:4.3406 lr:0.32 dt:34ms tok/s:1914045 rem:561s step 1131 (6%) loss:4.3457 lr:0.32 dt:34ms tok/s:1926777 rem:561s step 1132 (6%) loss:4.3445 lr:0.32 dt:34ms tok/s:1917316 rem:561s step 1133 (6%) loss:4.3521 lr:0.32 dt:34ms tok/s:1902862 rem:561s step 1134 (6%) loss:4.3523 lr:0.32 dt:33ms tok/s:1988138 rem:561s step 1135 (6%) loss:4.3291 lr:0.32 dt:34ms tok/s:1943658 rem:561s step 1136 (6%) loss:4.3235 lr:0.32 dt:34ms tok/s:1935692 rem:561s step 1137 (6%) loss:4.3335 lr:0.32 dt:34ms tok/s:1937288 rem:561s step 1138 (6%) loss:4.3396 lr:0.32 dt:34ms tok/s:1910360 rem:561s step 1139 (6%) loss:4.3444 lr:0.32 dt:34ms tok/s:1928616 rem:561s step 1140 (6%) loss:4.3458 lr:0.32 dt:34ms tok/s:1932100 rem:561s step 1141 (6%) loss:4.3390 lr:0.32 dt:34ms tok/s:1931720 rem:561s step 1142 (7%) loss:4.3458 lr:0.33 dt:34ms tok/s:1924740 rem:561s step 1143 (7%) loss:4.3445 lr:0.33 dt:34ms tok/s:1905474 rem:561s step 1144 (7%) loss:4.3444 lr:0.33 dt:34ms tok/s:1917476 rem:561s step 1145 (7%) loss:4.3380 lr:0.33 dt:34ms tok/s:1924430 rem:561s step 1146 (7%) loss:4.3350 lr:0.33 dt:34ms tok/s:1917597 rem:561s step 1147 (7%) loss:4.3459 lr:0.33 dt:34ms tok/s:1912926 rem:561s step 1148 (7%) loss:4.3574 lr:0.33 dt:34ms tok/s:1921390 rem:561s step 1149 (7%) loss:4.3541 lr:0.33 dt:34ms tok/s:1920786 rem:561s step 1150 (7%) loss:4.3478 lr:0.33 dt:34ms tok/s:1921390 rem:561s step 1151 (7%) loss:4.3290 lr:0.33 dt:34ms tok/s:1904998 rem:561s step 1152 (7%) loss:4.3030 lr:0.33 dt:34ms tok/s:1915993 rem:561s step 1153 (7%) loss:4.2966 lr:0.33 dt:34ms tok/s:1917543 rem:561s step 1154 (7%) loss:4.2936 lr:0.33 dt:34ms tok/s:1902756 rem:561s step 1155 (7%) loss:4.3006 lr:0.33 dt:34ms tok/s:1908755 rem:560s step 1156 (7%) loss:4.2992 lr:0.33 dt:34ms tok/s:1922196 rem:560s step 1157 (7%) loss:4.3046 lr:0.33 dt:34ms tok/s:1907219 rem:560s step 1158 (7%) loss:4.3088 lr:0.33 dt:37ms tok/s:1767874 rem:560s step 1159 (7%) loss:4.3123 lr:0.33 dt:36ms tok/s:1798421 rem:560s step 1160 (7%) loss:4.3142 lr:0.33 dt:34ms tok/s:1941160 rem:560s step 1161 (7%) loss:4.3161 lr:0.33 dt:34ms tok/s:1942682 rem:560s step 1162 (7%) loss:4.3138 lr:0.33 dt:34ms tok/s:1938586 rem:560s step 1163 (7%) loss:4.3192 lr:0.33 dt:34ms tok/s:1925387 rem:560s step 1164 (7%) loss:4.3150 lr:0.33 dt:34ms tok/s:1930716 rem:560s step 1165 (7%) loss:4.3134 lr:0.33 dt:34ms tok/s:1902151 rem:560s step 1166 (7%) loss:4.3141 lr:0.33 dt:34ms tok/s:1919311 rem:560s step 1167 (7%) loss:4.3203 lr:0.33 dt:34ms tok/s:1926062 rem:560s step 1168 (7%) loss:4.3108 lr:0.33 dt:34ms tok/s:1925954 rem:560s step 1169 (7%) loss:4.3201 lr:0.33 dt:34ms tok/s:1909644 rem:560s step 1170 (7%) loss:4.3208 lr:0.33 dt:34ms tok/s:1913565 rem:560s step 1171 (7%) loss:4.3174 lr:0.33 dt:34ms tok/s:1912766 rem:560s step 1172 (7%) loss:4.3100 lr:0.33 dt:34ms tok/s:1927994 rem:560s step 1173 (7%) loss:4.2857 lr:0.33 dt:34ms tok/s:1922936 rem:560s step 1174 (7%) loss:4.2693 lr:0.33 dt:34ms tok/s:1919177 rem:560s step 1175 (7%) loss:4.2674 lr:0.33 dt:34ms tok/s:1914338 rem:560s step 1176 (7%) loss:4.2522 lr:0.33 dt:34ms tok/s:1914192 rem:560s step 1177 (7%) loss:4.2563 lr:0.34 dt:34ms tok/s:1915005 rem:560s step 1178 (7%) loss:4.2584 lr:0.34 dt:34ms tok/s:1920880 rem:560s step 1179 (7%) loss:4.2623 lr:0.34 dt:34ms tok/s:1913445 rem:560s step 1180 (7%) loss:4.2696 lr:0.34 dt:34ms tok/s:1903863 rem:560s step 1181 (7%) loss:4.2682 lr:0.34 dt:34ms tok/s:1913672 rem:560s step 1182 (7%) loss:4.2792 lr:0.34 dt:34ms tok/s:1920075 rem:560s step 1183 (7%) loss:4.2711 lr:0.34 dt:34ms tok/s:1903363 rem:560s step 1184 (7%) loss:4.2709 lr:0.34 dt:34ms tok/s:1906729 rem:559s step 1185 (7%) loss:4.2661 lr:0.34 dt:34ms tok/s:1918761 rem:559s step 1186 (7%) loss:4.2657 lr:0.34 dt:34ms tok/s:1908901 rem:559s step 1187 (7%) loss:4.2776 lr:0.34 dt:34ms tok/s:1907272 rem:559s step 1188 (7%) loss:4.2813 lr:0.34 dt:35ms tok/s:1897149 rem:559s step 1189 (7%) loss:4.2851 lr:0.34 dt:34ms tok/s:1919673 rem:559s step 1190 (7%) loss:4.2947 lr:0.34 dt:34ms tok/s:1907483 rem:559s step 1191 (7%) loss:4.2919 lr:0.34 dt:35ms tok/s:1892668 rem:559s step 1192 (7%) loss:4.2912 lr:0.34 dt:35ms tok/s:1897686 rem:559s step 1193 (7%) loss:4.2759 lr:0.34 dt:34ms tok/s:1906346 rem:559s step 1194 (7%) loss:4.2747 lr:0.34 dt:34ms tok/s:1908291 rem:559s step 1195 (7%) loss:4.2706 lr:0.34 dt:34ms tok/s:1904259 rem:559s step 1196 (7%) loss:4.2985 lr:0.34 dt:34ms tok/s:1907748 rem:559s step 1197 (7%) loss:4.2965 lr:0.34 dt:34ms tok/s:1912673 rem:559s step 1198 (7%) loss:4.2921 lr:0.34 dt:35ms tok/s:1892981 rem:559s step 1199 (7%) loss:4.2879 lr:0.34 dt:35ms tok/s:1899181 rem:559s step 1200 (7%) loss:4.2875 lr:0.34 dt:34ms tok/s:1904972 rem:559s + local: attn=[0.184, 0.406, 0.470] mlp=[0.501, 0.260, -0.229] + + transition: attn=[1.269, 0.892] mlp=[0.235, 0.170] + + hierarchy: attn=[1.150, 1.582, 1.657] mlp=[0.783, -0.039, -0.228] + step 1201 (7%) loss:4.2980 lr:0.34 dt:34ms tok/s:1915085 rem:559s step 1202 (7%) loss:4.2955 lr:0.34 dt:34ms tok/s:1904061 rem:559s step 1203 (7%) loss:4.2917 lr:0.34 dt:34ms tok/s:1903798 rem:559s step 1204 (7%) loss:4.2688 lr:0.34 dt:34ms tok/s:1911383 rem:559s step 1205 (7%) loss:4.2617 lr:0.34 dt:36ms tok/s:1802939 rem:559s step 1206 (7%) loss:4.2805 lr:0.34 dt:34ms tok/s:1916020 rem:559s step 1207 (7%) loss:4.2877 lr:0.34 dt:34ms tok/s:1936687 rem:559s step 1208 (7%) loss:4.2915 lr:0.34 dt:34ms tok/s:1912021 rem:559s step 1209 (7%) loss:4.2581 lr:0.34 dt:34ms tok/s:1911184 rem:559s step 1210 (7%) loss:4.2750 lr:0.34 dt:34ms tok/s:1904483 rem:559s step 1211 (7%) loss:4.2882 lr:0.34 dt:34ms tok/s:1902954 rem:559s step 1212 (7%) loss:4.2928 lr:0.35 dt:34ms tok/s:1909272 rem:559s step 1213 (7%) loss:4.2953 lr:0.35 dt:34ms tok/s:1920182 rem:559s step 1214 (7%) loss:4.3042 lr:0.35 dt:34ms tok/s:1917704 rem:558s step 1215 (7%) loss:4.3014 lr:0.35 dt:34ms tok/s:1912620 rem:558s step 1216 (7%) loss:4.2916 lr:0.35 dt:34ms tok/s:1905223 rem:558s step 1217 (7%) loss:4.2775 lr:0.35 dt:34ms tok/s:1910267 rem:558s step 1218 (7%) loss:4.2627 lr:0.35 dt:34ms tok/s:1903732 rem:558s step 1219 (7%) loss:4.2695 lr:0.35 dt:35ms tok/s:1898971 rem:558s step 1220 (7%) loss:4.2703 lr:0.35 dt:34ms tok/s:1904272 rem:558s step 1221 (7%) loss:4.2685 lr:0.35 dt:34ms tok/s:1901901 rem:558s step 1222 (7%) loss:4.2701 lr:0.35 dt:34ms tok/s:1900218 rem:558s step 1223 (7%) loss:4.2790 lr:0.35 dt:34ms tok/s:1904233 rem:558s step 1224 (7%) loss:4.2817 lr:0.35 dt:34ms tok/s:1910347 rem:558s step 1225 (7%) loss:4.2787 lr:0.35 dt:34ms tok/s:1907695 rem:558s step 1226 (7%) loss:4.2808 lr:0.35 dt:34ms tok/s:1915966 rem:558s step 1227 (7%) loss:4.2779 lr:0.35 dt:34ms tok/s:1909086 rem:558s step 1228 (7%) loss:4.2864 lr:0.35 dt:34ms tok/s:1918788 rem:558s step 1229 (7%) loss:4.2942 lr:0.35 dt:34ms tok/s:1902506 rem:558s step 1230 (7%) loss:4.2916 lr:0.35 dt:35ms tok/s:1856806 rem:558s step 1231 (7%) loss:4.2885 lr:0.35 dt:34ms tok/s:1908265 rem:558s step 1232 (7%) loss:4.2883 lr:0.35 dt:37ms tok/s:1785838 rem:558s step 1233 (7%) loss:4.2800 lr:0.35 dt:35ms tok/s:1850818 rem:558s step 1234 (7%) loss:4.2783 lr:0.35 dt:34ms tok/s:1930770 rem:558s step 1235 (7%) loss:4.2796 lr:0.35 dt:34ms tok/s:1905408 rem:558s step 1236 (7%) loss:4.2823 lr:0.35 dt:34ms tok/s:1905091 rem:558s step 1237 (7%) loss:4.2821 lr:0.35 dt:34ms tok/s:1899627 rem:558s step 1238 (7%) loss:4.2479 lr:0.35 dt:34ms tok/s:1906240 rem:558s step 1239 (7%) loss:4.2328 lr:0.35 dt:34ms tok/s:1906081 rem:558s step 1240 (7%) loss:4.2554 lr:0.35 dt:34ms tok/s:1906372 rem:558s step 1241 (7%) loss:4.2491 lr:0.35 dt:34ms tok/s:1915672 rem:558s step 1242 (7%) loss:4.2526 lr:0.35 dt:34ms tok/s:1905051 rem:558s step 1243 (7%) loss:4.2470 lr:0.35 dt:35ms tok/s:1888533 rem:557s step 1244 (7%) loss:4.2369 lr:0.35 dt:35ms tok/s:1898512 rem:557s step 1245 (7%) loss:4.2135 lr:0.35 dt:34ms tok/s:1912221 rem:557s step 1246 (7%) loss:4.1980 lr:0.36 dt:35ms tok/s:1895501 rem:557s step 1247 (7%) loss:4.1884 lr:0.36 dt:34ms tok/s:1911383 rem:557s step 1248 (7%) loss:4.2013 lr:0.36 dt:34ms tok/s:1914512 rem:557s step 1249 (7%) loss:4.2202 lr:0.36 dt:34ms tok/s:1909989 rem:557s step 1250 (7%) loss:4.2253 lr:0.36 dt:35ms tok/s:1899325 rem:557s step 1251 (7%) loss:4.2285 lr:0.36 dt:34ms tok/s:1900717 rem:557s step 1252 (7%) loss:4.2271 lr:0.36 dt:34ms tok/s:1902901 rem:557s step 1253 (7%) loss:4.2327 lr:0.36 dt:34ms tok/s:1900901 rem:557s step 1254 (7%) loss:4.2214 lr:0.36 dt:34ms tok/s:1900875 rem:557s step 1255 (7%) loss:4.2109 lr:0.36 dt:35ms tok/s:1893920 rem:557s step 1256 (7%) loss:4.1824 lr:0.36 dt:35ms tok/s:1894038 rem:557s step 1257 (7%) loss:4.1345 lr:0.36 dt:35ms tok/s:1895762 rem:557s step 1258 (7%) loss:4.0952 lr:0.36 dt:37ms tok/s:1764776 rem:557s step 1259 (7%) loss:4.0449 lr:0.36 dt:35ms tok/s:1896063 rem:557s step 1260 (7%) loss:3.9995 lr:0.36 dt:34ms tok/s:1919337 rem:557s step 1261 (7%) loss:4.0609 lr:0.36 dt:34ms tok/s:1909060 rem:557s step 1262 (7%) loss:4.1049 lr:0.36 dt:34ms tok/s:1905500 rem:557s step 1263 (7%) loss:4.1370 lr:0.36 dt:34ms tok/s:1904193 rem:557s step 1264 (7%) loss:4.1453 lr:0.36 dt:34ms tok/s:1914378 rem:557s step 1265 (7%) loss:4.1693 lr:0.36 dt:34ms tok/s:1903534 rem:557s step 1266 (7%) loss:4.1890 lr:0.36 dt:34ms tok/s:1912740 rem:557s step 1267 (7%) loss:4.2075 lr:0.36 dt:34ms tok/s:1932698 rem:557s step 1268 (7%) loss:4.2211 lr:0.36 dt:34ms tok/s:1909352 rem:557s step 1269 (7%) loss:4.2378 lr:0.36 dt:34ms tok/s:1901611 rem:557s step 1270 (7%) loss:4.2512 lr:0.36 dt:34ms tok/s:1931815 rem:557s step 1271 (7%) loss:4.2544 lr:0.36 dt:34ms tok/s:1939051 rem:557s step 1272 (7%) loss:4.2636 lr:0.36 dt:34ms tok/s:1933731 rem:556s step 1273 (7%) loss:4.2693 lr:0.36 dt:34ms tok/s:1939571 rem:556s step 1274 (7%) loss:4.2734 lr:0.36 dt:34ms tok/s:1934330 rem:556s step 1275 (7%) loss:4.2740 lr:0.36 dt:34ms tok/s:1941077 rem:556s step 1276 (7%) loss:4.2779 lr:0.36 dt:34ms tok/s:1937998 rem:556s step 1277 (7%) loss:4.2831 lr:0.36 dt:34ms tok/s:1938230 rem:556s step 1278 (7%) loss:4.2814 lr:0.36 dt:34ms tok/s:1939571 rem:556s step 1279 (7%) loss:4.2795 lr:0.36 dt:34ms tok/s:1939927 rem:556s step 1280 (7%) loss:4.2842 lr:0.36 dt:35ms tok/s:1867364 rem:556s step 1281 (7%) loss:4.2851 lr:0.37 dt:34ms tok/s:1921054 rem:556s step 1282 (7%) loss:4.2859 lr:0.37 dt:34ms tok/s:1915926 rem:556s step 1283 (7%) loss:4.2747 lr:0.37 dt:34ms tok/s:1922949 rem:556s step 1284 (7%) loss:4.2664 lr:0.37 dt:34ms tok/s:1920799 rem:556s step 1285 (7%) loss:4.2404 lr:0.37 dt:34ms tok/s:1909962 rem:556s step 1286 (7%) loss:4.2557 lr:0.37 dt:36ms tok/s:1806138 rem:556s step 1287 (7%) loss:4.2700 lr:0.37 dt:34ms tok/s:1928941 rem:556s step 1288 (7%) loss:4.2687 lr:0.37 dt:34ms tok/s:1941790 rem:556s step 1289 (7%) loss:4.2687 lr:0.37 dt:34ms tok/s:1921404 rem:556s step 1290 (7%) loss:4.2837 lr:0.37 dt:34ms tok/s:1912207 rem:556s step 1291 (7%) loss:4.2920 lr:0.37 dt:35ms tok/s:1866160 rem:556s step 1292 (7%) loss:4.3003 lr:0.37 dt:34ms tok/s:1922290 rem:556s step 1293 (7%) loss:4.3028 lr:0.37 dt:34ms tok/s:1924915 rem:556s step 1294 (7%) loss:4.2991 lr:0.37 dt:34ms tok/s:1927764 rem:556s step 1295 (7%) loss:4.3107 lr:0.37 dt:34ms tok/s:1915712 rem:556s step 1296 (7%) loss:4.3028 lr:0.37 dt:34ms tok/s:1929780 rem:556s step 1297 (7%) loss:4.3040 lr:0.37 dt:34ms tok/s:1919579 rem:556s step 1298 (7%) loss:4.3018 lr:0.37 dt:34ms tok/s:1918520 rem:556s step 1299 (7%) loss:4.3036 lr:0.37 dt:34ms tok/s:1921390 rem:556s step 1300 (7%) loss:4.3007 lr:0.37 dt:34ms tok/s:1922438 rem:556s + local: attn=[0.183, 0.367, 0.464] mlp=[0.444, 0.176, -0.167] + + transition: attn=[1.334, 1.001] mlp=[0.105, 0.214] + + hierarchy: attn=[1.113, 1.726, 1.928] mlp=[0.840, -0.113, -0.206] + step 1301 (7%) loss:4.3157 lr:0.37 dt:34ms tok/s:1914925 rem:555s step 1302 (7%) loss:4.3099 lr:0.37 dt:34ms tok/s:1920907 rem:555s step 1303 (7%) loss:4.2995 lr:0.37 dt:34ms tok/s:1908517 rem:555s step 1304 (7%) loss:4.2867 lr:0.37 dt:34ms tok/s:1918547 rem:555s step 1305 (7%) loss:4.2595 lr:0.37 dt:34ms tok/s:1923986 rem:555s step 1306 (7%) loss:4.2232 lr:0.37 dt:34ms tok/s:1920759 rem:555s step 1307 (7%) loss:4.2378 lr:0.37 dt:34ms tok/s:1903758 rem:555s step 1308 (7%) loss:4.2441 lr:0.37 dt:41ms tok/s:1603656 rem:555s step 1309 (7%) loss:4.2432 lr:0.37 dt:34ms tok/s:1938900 rem:555s step 1310 (7%) loss:4.2406 lr:0.37 dt:31ms tok/s:2115031 rem:555s step 1311 (7%) loss:4.2424 lr:0.37 dt:31ms tok/s:2116204 rem:555s step 1312 (7%) loss:4.2398 lr:0.37 dt:31ms tok/s:2105860 rem:555s step 1313 (7%) loss:4.2443 lr:0.37 dt:31ms tok/s:2081714 rem:555s step 1314 (7%) loss:4.2613 lr:0.37 dt:31ms tok/s:2090279 rem:555s step 1315 (7%) loss:4.2702 lr:0.37 dt:32ms tok/s:2052890 rem:555s step 1316 (7%) loss:4.2824 lr:0.37 dt:32ms tok/s:2059103 rem:555s step 1317 (8%) loss:4.2948 lr:0.38 dt:33ms tok/s:2015796 rem:555s step 1318 (8%) loss:4.2962 lr:0.38 dt:32ms tok/s:2024376 rem:555s step 1319 (8%) loss:4.3024 lr:0.38 dt:32ms tok/s:2027093 rem:555s step 1320 (8%) loss:4.2948 lr:0.38 dt:33ms tok/s:1979832 rem:555s step 1321 (8%) loss:4.3017 lr:0.38 dt:33ms tok/s:1994789 rem:555s step 1322 (8%) loss:4.3081 lr:0.38 dt:33ms tok/s:2002958 rem:555s step 1323 (8%) loss:4.2915 lr:0.38 dt:33ms tok/s:1979390 rem:555s step 1324 (8%) loss:4.4407 lr:0.38 dt:33ms tok/s:1975563 rem:555s step 1325 (8%) loss:4.4414 lr:0.38 dt:33ms tok/s:1969449 rem:555s step 1326 (8%) loss:4.4378 lr:0.38 dt:34ms tok/s:1953423 rem:555s step 1327 (8%) loss:4.4243 lr:0.38 dt:33ms tok/s:1981845 rem:555s step 1328 (8%) loss:4.4147 lr:0.38 dt:33ms tok/s:1958866 rem:555s step 1329 (8%) loss:4.3962 lr:0.38 dt:34ms tok/s:1954451 rem:555s step 1330 (8%) loss:4.3726 lr:0.38 dt:34ms tok/s:1955369 rem:555s step 1331 (8%) loss:4.3570 lr:0.38 dt:33ms tok/s:1960361 rem:554s step 1332 (8%) loss:4.3480 lr:0.38 dt:34ms tok/s:1937821 rem:554s step 1333 (8%) loss:4.3306 lr:0.38 dt:34ms tok/s:1943438 rem:554s step 1334 (8%) loss:4.3232 lr:0.38 dt:34ms tok/s:1940324 rem:554s step 1335 (8%) loss:4.3238 lr:0.38 dt:34ms tok/s:1932168 rem:554s step 1336 (8%) loss:4.3160 lr:0.38 dt:34ms tok/s:1932222 rem:554s step 1337 (8%) loss:4.2868 lr:0.38 dt:34ms tok/s:1944043 rem:554s step 1338 (8%) loss:4.2801 lr:0.38 dt:34ms tok/s:1937452 rem:554s step 1339 (8%) loss:4.2759 lr:0.38 dt:34ms tok/s:1910506 rem:554s step 1340 (8%) loss:4.2733 lr:0.38 dt:34ms tok/s:1914098 rem:554s step 1341 (8%) loss:4.2848 lr:0.38 dt:34ms tok/s:1918052 rem:554s step 1342 (8%) loss:4.2807 lr:0.38 dt:34ms tok/s:1940803 rem:554s step 1343 (8%) loss:4.3071 lr:0.38 dt:34ms tok/s:1911915 rem:554s step 1344 (8%) loss:4.3101 lr:0.38 dt:34ms tok/s:1919404 rem:554s step 1345 (8%) loss:4.2849 lr:0.38 dt:34ms tok/s:1915472 rem:554s step 1346 (8%) loss:4.2791 lr:0.38 dt:34ms tok/s:1905051 rem:554s step 1347 (8%) loss:4.2962 lr:0.38 dt:34ms tok/s:1908503 rem:554s step 1348 (8%) loss:4.3140 lr:0.38 dt:34ms tok/s:1901374 rem:554s step 1349 (8%) loss:4.3083 lr:0.38 dt:34ms tok/s:1913845 rem:554s step 1350 (8%) loss:4.3190 lr:0.38 dt:34ms tok/s:1904721 rem:554s step 1351 (8%) loss:4.3206 lr:0.38 dt:34ms tok/s:1917731 rem:554s step 1352 (8%) loss:4.3239 lr:0.39 dt:34ms tok/s:1912793 rem:554s step 1353 (8%) loss:4.3105 lr:0.39 dt:34ms tok/s:1916527 rem:554s step 1354 (8%) loss:4.3030 lr:0.39 dt:34ms tok/s:1918467 rem:554s step 1355 (8%) loss:4.3090 lr:0.39 dt:34ms tok/s:1909577 rem:554s step 1356 (8%) loss:4.3040 lr:0.39 dt:34ms tok/s:1917262 rem:554s step 1357 (8%) loss:4.3029 lr:0.39 dt:34ms tok/s:1926413 rem:554s step 1358 (8%) loss:4.3049 lr:0.39 dt:34ms tok/s:1910772 rem:554s step 1359 (8%) loss:4.3028 lr:0.39 dt:34ms tok/s:1912567 rem:554s step 1360 (8%) loss:4.2998 lr:0.39 dt:34ms tok/s:1918012 rem:553s step 1361 (8%) loss:4.3079 lr:0.39 dt:34ms tok/s:1920571 rem:553s step 1362 (8%) loss:4.3076 lr:0.39 dt:34ms tok/s:1922586 rem:553s step 1363 (8%) loss:4.3014 lr:0.39 dt:34ms tok/s:1910108 rem:553s step 1364 (8%) loss:4.2925 lr:0.39 dt:34ms tok/s:1905183 rem:553s step 1365 (8%) loss:4.2870 lr:0.39 dt:34ms tok/s:1920048 rem:553s step 1366 (8%) loss:4.2754 lr:0.39 dt:34ms tok/s:1905619 rem:553s step 1367 (8%) loss:4.2774 lr:0.39 dt:34ms tok/s:1926399 rem:553s step 1368 (8%) loss:4.2754 lr:0.39 dt:34ms tok/s:1925279 rem:553s step 1369 (8%) loss:4.2738 lr:0.39 dt:34ms tok/s:1921686 rem:553s step 1370 (8%) loss:4.2739 lr:0.39 dt:34ms tok/s:1916554 rem:553s step 1371 (8%) loss:4.2760 lr:0.39 dt:34ms tok/s:1916220 rem:553s step 1372 (8%) loss:4.2743 lr:0.39 dt:34ms tok/s:1914445 rem:553s step 1373 (8%) loss:4.2788 lr:0.39 dt:34ms tok/s:1915379 rem:553s step 1374 (8%) loss:4.2816 lr:0.39 dt:34ms tok/s:1920746 rem:553s step 1375 (8%) loss:4.2787 lr:0.39 dt:35ms tok/s:1865362 rem:553s step 1376 (8%) loss:4.2636 lr:0.39 dt:34ms tok/s:1918587 rem:553s step 1377 (8%) loss:4.2546 lr:0.39 dt:34ms tok/s:1908106 rem:553s step 1378 (8%) loss:4.2628 lr:0.39 dt:34ms tok/s:1910320 rem:553s step 1379 (8%) loss:4.2742 lr:0.39 dt:34ms tok/s:1905593 rem:553s step 1380 (8%) loss:4.2703 lr:0.39 dt:34ms tok/s:1918828 rem:553s step 1381 (8%) loss:4.2628 lr:0.39 dt:40ms tok/s:1641229 rem:553s step 1382 (8%) loss:4.2579 lr:0.39 dt:41ms tok/s:1593107 rem:553s step 1383 (8%) loss:4.2747 lr:0.39 dt:33ms tok/s:2003615 rem:553s step 1384 (8%) loss:4.2813 lr:0.39 dt:32ms tok/s:2030297 rem:553s step 1385 (8%) loss:4.2798 lr:0.39 dt:33ms tok/s:1966827 rem:553s step 1386 (8%) loss:4.3043 lr:0.39 dt:33ms tok/s:2010194 rem:553s step 1387 (8%) loss:4.3098 lr:0.40 dt:32ms tok/s:2064438 rem:553s step 1388 (8%) loss:4.3050 lr:0.40 dt:31ms tok/s:2114462 rem:553s step 1389 (8%) loss:4.2917 lr:0.40 dt:30ms tok/s:2155787 rem:553s step 1390 (8%) loss:4.2809 lr:0.40 dt:30ms tok/s:2164615 rem:552s step 1391 (8%) loss:4.2694 lr:0.40 dt:30ms tok/s:2161398 rem:552s step 1392 (8%) loss:4.2742 lr:0.40 dt:31ms tok/s:2082519 rem:552s step 1393 (8%) loss:4.2692 lr:0.40 dt:30ms tok/s:2162061 rem:552s step 1394 (8%) loss:4.2535 lr:0.40 dt:30ms tok/s:2162180 rem:552s step 1395 (8%) loss:4.2476 lr:0.40 dt:30ms tok/s:2160378 rem:552s step 1396 (8%) loss:4.2572 lr:0.40 dt:32ms tok/s:2032940 rem:552s step 1397 (8%) loss:4.2596 lr:0.40 dt:30ms tok/s:2154266 rem:552s step 1398 (8%) loss:4.2613 lr:0.40 dt:31ms tok/s:2139744 rem:552s step 1399 (8%) loss:4.2571 lr:0.40 dt:31ms tok/s:2129351 rem:552s step 1400 (8%) loss:4.2398 lr:0.40 dt:31ms tok/s:2126024 rem:552s + local: attn=[0.147, 0.422, 0.519] mlp=[0.370, 0.167, -0.161] + + transition: attn=[1.557, 1.108] mlp=[0.091, 0.223] + + hierarchy: attn=[1.167, 1.884, 2.210] mlp=[0.899, -0.191, -0.152] + step 1401 (8%) loss:4.2156 lr:0.40 dt:31ms tok/s:2114933 rem:552s step 1402 (8%) loss:4.2326 lr:0.40 dt:31ms tok/s:2109998 rem:552s step 1403 (8%) loss:4.2415 lr:0.40 dt:31ms tok/s:2116008 rem:552s step 1404 (8%) loss:4.2535 lr:0.40 dt:32ms tok/s:2078048 rem:552s step 1405 (8%) loss:4.2580 lr:0.40 dt:32ms tok/s:2077310 rem:552s step 1406 (8%) loss:4.2581 lr:0.40 dt:32ms tok/s:2066767 rem:552s step 1407 (8%) loss:4.2575 lr:0.40 dt:32ms tok/s:2051894 rem:552s step 1408 (8%) loss:4.2545 lr:0.40 dt:33ms tok/s:1997427 rem:552s step 1409 (8%) loss:4.2427 lr:0.40 dt:33ms tok/s:2016313 rem:552s step 1410 (8%) loss:4.2437 lr:0.40 dt:32ms tok/s:2025689 rem:552s step 1411 (8%) loss:4.2553 lr:0.40 dt:32ms tok/s:2034008 rem:552s step 1412 (8%) loss:4.2553 lr:0.40 dt:32ms tok/s:2017497 rem:552s step 1413 (8%) loss:4.2454 lr:0.40 dt:33ms tok/s:1992880 rem:552s step 1414 (8%) loss:4.2481 lr:0.40 dt:33ms tok/s:1987117 rem:552s step 1415 (8%) loss:4.2474 lr:0.40 dt:33ms tok/s:1996643 rem:552s step 1416 (8%) loss:4.2656 lr:0.40 dt:33ms tok/s:1991710 rem:552s step 1417 (8%) loss:4.2589 lr:0.40 dt:33ms tok/s:1994601 rem:552s step 1418 (8%) loss:4.2581 lr:0.40 dt:33ms tok/s:1988742 rem:552s step 1419 (8%) loss:4.2800 lr:0.40 dt:33ms tok/s:1978094 rem:552s step 1420 (8%) loss:4.3018 lr:0.40 dt:33ms tok/s:1972699 rem:552s step 1421 (8%) loss:4.2989 lr:0.40 dt:33ms tok/s:1975208 rem:551s step 1422 (8%) loss:4.2749 lr:0.40 dt:33ms tok/s:1976529 rem:551s step 1423 (8%) loss:4.2735 lr:0.40 dt:33ms tok/s:1978578 rem:551s step 1424 (8%) loss:4.2859 lr:0.40 dt:33ms tok/s:1981530 rem:551s step 1425 (8%) loss:4.2966 lr:0.41 dt:33ms tok/s:1971822 rem:551s step 1426 (8%) loss:4.3069 lr:0.41 dt:33ms tok/s:1961382 rem:551s step 1427 (8%) loss:4.3213 lr:0.41 dt:33ms tok/s:1971553 rem:551s step 1428 (8%) loss:4.3258 lr:0.41 dt:33ms tok/s:1964592 rem:551s step 1429 (8%) loss:4.3282 lr:0.41 dt:40ms tok/s:1656899 rem:551s step 1430 (8%) loss:4.3178 lr:0.41 dt:34ms tok/s:1941379 rem:551s step 1431 (8%) loss:4.3130 lr:0.41 dt:34ms tok/s:1952702 rem:551s step 1432 (8%) loss:4.3022 lr:0.41 dt:33ms tok/s:1964157 rem:551s step 1433 (8%) loss:4.2973 lr:0.41 dt:33ms tok/s:1961480 rem:551s step 1434 (8%) loss:4.2841 lr:0.41 dt:33ms tok/s:1969138 rem:551s step 1435 (8%) loss:4.2857 lr:0.41 dt:34ms tok/s:1949973 rem:551s step 1436 (8%) loss:4.2873 lr:0.41 dt:34ms tok/s:1946465 rem:551s step 1437 (8%) loss:4.2726 lr:0.41 dt:34ms tok/s:1935705 rem:551s step 1438 (8%) loss:4.2718 lr:0.41 dt:34ms tok/s:1926332 rem:551s step 1439 (8%) loss:4.2692 lr:0.41 dt:34ms tok/s:1940995 rem:551s step 1440 (8%) loss:4.2582 lr:0.41 dt:34ms tok/s:1947693 rem:551s step 1441 (8%) loss:4.2419 lr:0.41 dt:38ms tok/s:1723394 rem:551s step 1442 (8%) loss:4.2208 lr:0.41 dt:37ms tok/s:1782341 rem:551s step 1443 (8%) loss:4.1959 lr:0.41 dt:33ms tok/s:1965505 rem:551s step 1444 (8%) loss:4.1946 lr:0.41 dt:34ms tok/s:1949185 rem:551s step 1445 (8%) loss:4.2135 lr:0.41 dt:33ms tok/s:1960095 rem:551s step 1446 (8%) loss:4.2488 lr:0.41 dt:34ms tok/s:1952716 rem:551s step 1447 (8%) loss:4.3120 lr:0.41 dt:34ms tok/s:1930200 rem:551s step 1448 (8%) loss:4.3200 lr:0.41 dt:34ms tok/s:1931883 rem:551s step 1449 (8%) loss:4.3344 lr:0.41 dt:34ms tok/s:1934629 rem:551s step 1450 (8%) loss:4.3346 lr:0.41 dt:34ms tok/s:1922613 rem:550s step 1451 (8%) loss:4.3237 lr:0.41 dt:34ms tok/s:1900717 rem:550s step 1452 (8%) loss:4.3080 lr:0.41 dt:34ms tok/s:1916380 rem:550s step 1453 (8%) loss:4.3011 lr:0.41 dt:35ms tok/s:1876197 rem:550s step 1454 (8%) loss:4.2926 lr:0.41 dt:34ms tok/s:1904510 rem:550s step 1455 (8%) loss:4.3061 lr:0.41 dt:35ms tok/s:1870745 rem:550s step 1456 (8%) loss:4.2902 lr:0.41 dt:35ms tok/s:1888183 rem:550s step 1457 (8%) loss:4.2811 lr:0.41 dt:37ms tok/s:1793844 rem:550s step 1458 (8%) loss:4.2820 lr:0.41 dt:35ms tok/s:1876159 rem:550s step 1459 (8%) loss:4.2675 lr:0.41 dt:35ms tok/s:1887068 rem:550s step 1460 (8%) loss:4.2692 lr:0.42 dt:35ms tok/s:1881450 rem:550s step 1461 (8%) loss:4.2608 lr:0.42 dt:35ms tok/s:1881153 rem:550s step 1462 (8%) loss:4.2532 lr:0.42 dt:35ms tok/s:1883409 rem:550s step 1463 (8%) loss:4.2484 lr:0.42 dt:35ms tok/s:1887392 rem:550s step 1464 (8%) loss:4.2953 lr:0.42 dt:35ms tok/s:1859595 rem:550s step 1465 (8%) loss:4.3384 lr:0.42 dt:35ms tok/s:1853376 rem:550s step 1466 (8%) loss:4.3642 lr:0.42 dt:36ms tok/s:1837849 rem:550s step 1467 (8%) loss:4.3779 lr:0.42 dt:36ms tok/s:1846032 rem:550s step 1468 (8%) loss:4.4040 lr:0.42 dt:35ms tok/s:1854689 rem:550s step 1469 (8%) loss:4.4224 lr:0.42 dt:35ms tok/s:1855090 rem:550s step 1470 (8%) loss:4.4314 lr:0.42 dt:35ms tok/s:1848565 rem:550s step 1471 (8%) loss:4.4382 lr:0.42 dt:35ms tok/s:1853251 rem:550s step 1472 (8%) loss:4.4464 lr:0.42 dt:35ms tok/s:1850319 rem:550s step 1473 (8%) loss:4.4271 lr:0.42 dt:36ms tok/s:1838451 rem:550s step 1474 (8%) loss:4.4000 lr:0.42 dt:35ms tok/s:1853326 rem:550s step 1475 (8%) loss:4.3751 lr:0.42 dt:35ms tok/s:1852127 rem:550s step 1476 (8%) loss:4.3700 lr:0.42 dt:35ms tok/s:1849996 rem:550s step 1477 (8%) loss:4.3574 lr:0.42 dt:35ms tok/s:1852514 rem:550s step 1478 (8%) loss:4.3566 lr:0.42 dt:37ms tok/s:1790187 rem:550s step 1479 (8%) loss:4.3641 lr:0.42 dt:34ms tok/s:1916714 rem:549s step 1480 (8%) loss:4.3480 lr:0.42 dt:34ms tok/s:1911715 rem:549s step 1481 (8%) loss:4.3455 lr:0.42 dt:35ms tok/s:1889767 rem:549s step 1482 (8%) loss:4.3219 lr:0.42 dt:34ms tok/s:1900034 rem:549s step 1483 (8%) loss:4.3112 lr:0.42 dt:35ms tok/s:1869549 rem:549s step 1484 (8%) loss:4.2969 lr:0.42 dt:35ms tok/s:1862708 rem:549s step 1485 (8%) loss:4.2869 lr:0.42 dt:35ms tok/s:1876197 rem:549s step 1486 (8%) loss:4.2788 lr:0.42 dt:35ms tok/s:1850544 rem:549s step 1487 (8%) loss:4.2735 lr:0.42 dt:36ms tok/s:1842074 rem:549s step 1488 (8%) loss:4.2696 lr:0.42 dt:35ms tok/s:1848366 rem:549s step 1489 (8%) loss:4.2691 lr:0.42 dt:35ms tok/s:1847198 rem:549s step 1490 (8%) loss:4.2808 lr:0.42 dt:35ms tok/s:1853789 rem:549s step 1491 (8%) loss:4.2736 lr:0.42 dt:35ms tok/s:1850332 rem:549s step 1492 (8%) loss:4.2801 lr:0.42 dt:35ms tok/s:1849112 rem:549s step 1493 (8%) loss:4.2742 lr:0.42 dt:39ms tok/s:1695479 rem:549s step 1494 (9%) loss:4.2664 lr:0.43 dt:36ms tok/s:1834072 rem:549s step 1495 (9%) loss:4.2609 lr:0.43 dt:35ms tok/s:1887042 rem:549s step 1496 (9%) loss:4.2609 lr:0.43 dt:34ms tok/s:1899876 rem:549s step 1497 (9%) loss:4.2603 lr:0.43 dt:35ms tok/s:1865907 rem:549s step 1498 (9%) loss:4.2565 lr:0.43 dt:35ms tok/s:1858916 rem:549s step 1499 (9%) loss:4.2504 lr:0.43 dt:35ms tok/s:1866502 rem:549s step 1500 (9%) loss:4.2478 lr:0.43 dt:35ms tok/s:1868926 rem:549s + local: attn=[0.126, 0.422, 0.456] mlp=[0.335, 0.121, -0.158] + + transition: attn=[1.576, 1.123] mlp=[0.066, 0.219] + + hierarchy: attn=[1.216, 2.013, 2.486] mlp=[0.893, -0.373, -0.059] + step 1501 (9%) loss:4.2501 lr:0.43 dt:35ms tok/s:1868075 rem:549s step 1502 (9%) loss:4.2361 lr:0.43 dt:35ms tok/s:1858011 rem:549s step 1503 (9%) loss:4.2396 lr:0.43 dt:35ms tok/s:1847174 rem:549s step 1504 (9%) loss:4.2341 lr:0.43 dt:40ms tok/s:1657129 rem:549s step 1505 (9%) loss:4.2310 lr:0.43 dt:35ms tok/s:1865856 rem:549s step 1506 (9%) loss:4.2337 lr:0.43 dt:35ms tok/s:1849996 rem:549s step 1507 (9%) loss:4.2358 lr:0.43 dt:35ms tok/s:1892786 rem:548s step 1508 (9%) loss:4.2373 lr:0.43 dt:35ms tok/s:1874726 rem:548s step 1509 (9%) loss:4.2356 lr:0.43 dt:39ms tok/s:1693610 rem:548s step 1510 (9%) loss:4.2387 lr:0.43 dt:36ms tok/s:1843074 rem:548s step 1511 (9%) loss:4.2400 lr:0.43 dt:34ms tok/s:1913512 rem:548s step 1512 (9%) loss:4.2562 lr:0.43 dt:34ms tok/s:1915619 rem:548s step 1513 (9%) loss:4.2549 lr:0.43 dt:35ms tok/s:1897280 rem:548s step 1514 (9%) loss:4.2554 lr:0.43 dt:35ms tok/s:1885567 rem:548s step 1515 (9%) loss:4.2504 lr:0.43 dt:35ms tok/s:1891405 rem:548s step 1516 (9%) loss:4.2511 lr:0.43 dt:35ms tok/s:1882558 rem:548s step 1517 (9%) loss:4.2417 lr:0.43 dt:36ms tok/s:1813885 rem:548s step 1518 (9%) loss:4.2445 lr:0.43 dt:35ms tok/s:1864413 rem:548s step 1519 (9%) loss:4.2374 lr:0.43 dt:41ms tok/s:1597719 rem:548s step 1520 (9%) loss:4.2486 lr:0.43 dt:35ms tok/s:1894338 rem:548s step 1521 (9%) loss:4.2393 lr:0.43 dt:34ms tok/s:1911808 rem:548s step 1522 (9%) loss:4.2327 lr:0.43 dt:35ms tok/s:1862821 rem:548s step 1523 (9%) loss:4.2240 lr:0.43 dt:35ms tok/s:1868380 rem:548s step 1524 (9%) loss:4.2216 lr:0.43 dt:35ms tok/s:1881372 rem:548s step 1525 (9%) loss:4.2400 lr:0.43 dt:35ms tok/s:1879970 rem:548s step 1526 (9%) loss:4.2275 lr:0.43 dt:35ms tok/s:1879777 rem:548s step 1527 (9%) loss:4.2214 lr:0.43 dt:35ms tok/s:1876248 rem:548s step 1528 (9%) loss:4.2053 lr:0.44 dt:35ms tok/s:1877004 rem:548s step 1529 (9%) loss:4.2028 lr:0.44 dt:36ms tok/s:1802348 rem:548s step 1530 (9%) loss:4.2000 lr:0.44 dt:35ms tok/s:1868227 rem:548s step 1531 (9%) loss:4.2272 lr:0.44 dt:35ms tok/s:1872057 rem:548s step 1532 (9%) loss:4.2443 lr:0.44 dt:35ms tok/s:1882532 rem:548s step 1533 (9%) loss:4.2425 lr:0.44 dt:35ms tok/s:1879880 rem:548s step 1534 (9%) loss:4.2271 lr:0.44 dt:35ms tok/s:1869409 rem:548s step 1535 (9%) loss:4.2245 lr:0.44 dt:35ms tok/s:1877197 rem:547s step 1536 (9%) loss:4.2367 lr:0.44 dt:35ms tok/s:1874419 rem:547s step 1537 (9%) loss:4.2254 lr:0.44 dt:35ms tok/s:1877466 rem:547s step 1538 (9%) loss:4.2293 lr:0.44 dt:35ms tok/s:1875097 rem:547s step 1539 (9%) loss:4.2263 lr:0.44 dt:35ms tok/s:1885424 rem:547s step 1540 (9%) loss:4.2249 lr:0.44 dt:35ms tok/s:1881398 rem:547s step 1541 (9%) loss:4.2707 lr:0.44 dt:35ms tok/s:1878582 rem:547s step 1542 (9%) loss:4.2859 lr:0.44 dt:35ms tok/s:1877261 rem:547s step 1543 (9%) loss:4.2757 lr:0.44 dt:36ms tok/s:1836953 rem:547s step 1544 (9%) loss:4.2728 lr:0.44 dt:35ms tok/s:1867402 rem:547s step 1545 (9%) loss:4.2592 lr:0.44 dt:35ms tok/s:1882571 rem:547s step 1546 (9%) loss:4.2589 lr:0.44 dt:35ms tok/s:1879790 rem:547s step 1547 (9%) loss:4.2556 lr:0.44 dt:35ms tok/s:1881643 rem:547s step 1548 (9%) loss:4.2432 lr:0.44 dt:35ms tok/s:1886809 rem:547s step 1549 (9%) loss:4.2309 lr:0.44 dt:35ms tok/s:1871726 rem:547s step 1550 (9%) loss:4.2231 lr:0.44 dt:35ms tok/s:1883538 rem:547s step 1551 (9%) loss:4.2226 lr:0.44 dt:35ms tok/s:1875493 rem:547s step 1552 (9%) loss:4.2106 lr:0.44 dt:35ms tok/s:1884778 rem:547s step 1553 (9%) loss:4.2083 lr:0.44 dt:36ms tok/s:1817687 rem:547s step 1554 (9%) loss:4.2082 lr:0.44 dt:35ms tok/s:1887211 rem:547s step 1555 (9%) loss:4.2059 lr:0.44 dt:35ms tok/s:1877427 rem:547s step 1556 (9%) loss:4.2056 lr:0.44 dt:35ms tok/s:1875596 rem:547s step 1557 (9%) loss:4.2026 lr:0.44 dt:35ms tok/s:1891262 rem:547s step 1558 (9%) loss:4.2016 lr:0.44 dt:35ms tok/s:1879597 rem:547s step 1559 (9%) loss:4.1841 lr:0.44 dt:35ms tok/s:1884377 rem:547s step 1560 (9%) loss:4.1741 lr:0.44 dt:35ms tok/s:1881269 rem:547s step 1561 (9%) loss:4.1777 lr:0.44 dt:35ms tok/s:1886563 rem:547s step 1562 (9%) loss:4.1662 lr:0.45 dt:35ms tok/s:1873997 rem:547s step 1563 (9%) loss:4.1732 lr:0.45 dt:35ms tok/s:1875967 rem:547s step 1564 (9%) loss:4.1889 lr:0.45 dt:35ms tok/s:1878467 rem:546s step 1565 (9%) loss:4.2226 lr:0.45 dt:35ms tok/s:1879096 rem:546s step 1566 (9%) loss:4.2195 lr:0.45 dt:35ms tok/s:1871688 rem:546s step 1567 (9%) loss:4.2028 lr:0.45 dt:35ms tok/s:1855603 rem:546s step 1568 (9%) loss:4.1839 lr:0.45 dt:35ms tok/s:1876415 rem:546s step 1569 (9%) loss:4.1619 lr:0.45 dt:35ms tok/s:1875967 rem:546s step 1570 (9%) loss:4.1258 lr:0.45 dt:35ms tok/s:1861698 rem:546s step 1571 (9%) loss:4.0921 lr:0.45 dt:35ms tok/s:1879417 rem:546s step 1572 (9%) loss:4.0664 lr:0.45 dt:35ms tok/s:1872567 rem:546s step 1573 (9%) loss:4.0398 lr:0.45 dt:35ms tok/s:1878454 rem:546s step 1574 (9%) loss:4.0511 lr:0.45 dt:35ms tok/s:1875736 rem:546s step 1575 (9%) loss:4.0709 lr:0.45 dt:35ms tok/s:1877902 rem:546s step 1576 (9%) loss:4.0955 lr:0.45 dt:35ms tok/s:1885696 rem:546s step 1577 (9%) loss:4.1195 lr:0.45 dt:35ms tok/s:1879301 rem:546s step 1578 (9%) loss:4.1346 lr:0.45 dt:35ms tok/s:1876543 rem:546s step 1579 (9%) loss:4.1408 lr:0.45 dt:35ms tok/s:1874726 rem:546s step 1580 (9%) loss:4.1514 lr:0.45 dt:35ms tok/s:1872810 rem:546s step 1581 (9%) loss:4.1549 lr:0.45 dt:35ms tok/s:1884093 rem:546s step 1582 (9%) loss:4.1699 lr:0.45 dt:34ms tok/s:1900481 rem:546s step 1583 (9%) loss:4.1820 lr:0.45 dt:34ms tok/s:1899850 rem:546s step 1584 (9%) loss:4.1914 lr:0.45 dt:34ms tok/s:1904259 rem:546s step 1585 (9%) loss:4.1998 lr:0.45 dt:35ms tok/s:1879199 rem:546s step 1586 (9%) loss:4.1993 lr:0.45 dt:35ms tok/s:1882970 rem:546s step 1587 (9%) loss:4.1958 lr:0.45 dt:34ms tok/s:1905170 rem:546s step 1588 (9%) loss:4.2039 lr:0.45 dt:34ms tok/s:1905091 rem:546s step 1589 (9%) loss:4.2033 lr:0.45 dt:35ms tok/s:1886783 rem:546s step 1590 (9%) loss:4.1840 lr:0.45 dt:35ms tok/s:1868875 rem:546s step 1591 (9%) loss:4.1883 lr:0.45 dt:35ms tok/s:1867161 rem:546s step 1592 (9%) loss:4.1899 lr:0.45 dt:36ms tok/s:1817435 rem:546s step 1593 (9%) loss:4.1888 lr:0.45 dt:35ms tok/s:1879147 rem:545s step 1594 (9%) loss:4.1734 lr:0.45 dt:35ms tok/s:1876812 rem:545s step 1595 (9%) loss:4.1614 lr:0.45 dt:35ms tok/s:1868316 rem:545s step 1596 (9%) loss:4.1609 lr:0.45 dt:35ms tok/s:1865717 rem:545s step 1597 (9%) loss:4.1613 lr:0.46 dt:35ms tok/s:1870719 rem:545s step 1598 (9%) loss:4.1715 lr:0.46 dt:35ms tok/s:1862077 rem:545s step 1599 (9%) loss:4.1720 lr:0.46 dt:35ms tok/s:1865831 rem:545s step 1600 (9%) loss:4.1770 lr:0.46 dt:36ms tok/s:1839830 rem:545s + local: attn=[0.120, 0.435, 0.551] mlp=[0.293, 0.173, -0.124] + + transition: attn=[1.917, 1.241] mlp=[0.079, 0.210] + + hierarchy: attn=[1.283, 2.236, 2.916] mlp=[0.974, -0.448, 0.004] + step 1601 (9%) loss:4.1798 lr:0.46 dt:36ms tok/s:1832544 rem:545s step 1602 (9%) loss:4.1758 lr:0.46 dt:36ms tok/s:1827828 rem:545s step 1603 (9%) loss:4.1649 lr:0.46 dt:36ms tok/s:1830567 rem:545s step 1604 (9%) loss:4.1587 lr:0.46 dt:38ms tok/s:1731067 rem:545s step 1605 (9%) loss:4.1560 lr:0.46 dt:38ms tok/s:1740990 rem:545s step 1606 (9%) loss:4.1538 lr:0.46 dt:35ms tok/s:1879880 rem:545s step 1607 (9%) loss:4.1460 lr:0.46 dt:35ms tok/s:1877389 rem:545s step 1608 (9%) loss:4.1615 lr:0.46 dt:35ms tok/s:1882596 rem:545s step 1609 (9%) loss:4.1686 lr:0.46 dt:36ms tok/s:1833497 rem:545s step 1610 (9%) loss:4.1760 lr:0.46 dt:35ms tok/s:1869422 rem:545s step 1611 (9%) loss:4.1776 lr:0.46 dt:35ms tok/s:1868761 rem:545s step 1612 (9%) loss:4.1763 lr:0.46 dt:35ms tok/s:1859759 rem:545s step 1613 (9%) loss:4.1811 lr:0.46 dt:35ms tok/s:1864552 rem:545s step 1614 (9%) loss:4.1679 lr:0.46 dt:35ms tok/s:1857748 rem:545s step 1615 (9%) loss:4.1619 lr:0.46 dt:35ms tok/s:1864932 rem:545s step 1616 (9%) loss:4.1679 lr:0.46 dt:45ms tok/s:1469244 rem:545s step 1617 (9%) loss:4.1579 lr:0.46 dt:34ms tok/s:1954284 rem:545s step 1618 (9%) loss:4.1919 lr:0.46 dt:34ms tok/s:1938244 rem:545s step 1619 (9%) loss:4.1782 lr:0.46 dt:34ms tok/s:1917236 rem:545s step 1620 (9%) loss:4.1888 lr:0.46 dt:34ms tok/s:1909564 rem:545s step 1621 (9%) loss:4.1806 lr:0.46 dt:34ms tok/s:1925994 rem:544s step 1622 (9%) loss:4.2327 lr:0.46 dt:34ms tok/s:1915552 rem:544s step 1623 (9%) loss:4.2184 lr:0.46 dt:34ms tok/s:1911410 rem:544s step 1624 (9%) loss:4.2100 lr:0.46 dt:35ms tok/s:1879777 rem:544s step 1625 (9%) loss:4.2005 lr:0.46 dt:34ms tok/s:1903692 rem:544s step 1626 (9%) loss:4.1959 lr:0.46 dt:34ms tok/s:1908013 rem:544s step 1627 (9%) loss:4.1928 lr:0.46 dt:35ms tok/s:1881733 rem:544s step 1628 (9%) loss:4.1845 lr:0.46 dt:35ms tok/s:1881553 rem:544s step 1629 (9%) loss:4.1732 lr:0.46 dt:35ms tok/s:1872631 rem:544s step 1630 (9%) loss:4.1793 lr:0.46 dt:34ms tok/s:1908994 rem:544s step 1631 (9%) loss:4.1717 lr:0.47 dt:35ms tok/s:1885696 rem:544s step 1632 (9%) loss:4.1608 lr:0.47 dt:35ms tok/s:1888326 rem:544s step 1633 (9%) loss:4.1709 lr:0.47 dt:35ms tok/s:1889390 rem:544s step 1634 (9%) loss:4.1715 lr:0.47 dt:35ms tok/s:1878351 rem:544s step 1635 (9%) loss:4.1619 lr:0.47 dt:38ms tok/s:1743817 rem:544s step 1636 (9%) loss:4.1592 lr:0.47 dt:35ms tok/s:1883345 rem:544s step 1637 (9%) loss:4.1502 lr:0.47 dt:35ms tok/s:1885774 rem:544s step 1638 (9%) loss:4.1398 lr:0.47 dt:35ms tok/s:1848354 rem:544s step 1639 (9%) loss:4.1625 lr:0.47 dt:35ms tok/s:1855378 rem:544s step 1640 (9%) loss:4.1665 lr:0.47 dt:35ms tok/s:1853414 rem:544s step 1641 (9%) loss:4.1645 lr:0.47 dt:35ms tok/s:1868329 rem:544s step 1642 (9%) loss:4.1665 lr:0.47 dt:35ms tok/s:1859507 rem:544s step 1643 (9%) loss:4.1681 lr:0.47 dt:35ms tok/s:1859079 rem:544s step 1644 (9%) loss:4.1571 lr:0.47 dt:35ms tok/s:1862607 rem:544s step 1645 (9%) loss:4.1537 lr:0.47 dt:35ms tok/s:1854139 rem:544s step 1646 (9%) loss:4.1529 lr:0.47 dt:35ms tok/s:1861598 rem:544s step 1647 (9%) loss:4.1542 lr:0.47 dt:36ms tok/s:1831787 rem:544s step 1648 (9%) loss:4.1592 lr:0.47 dt:36ms tok/s:1812569 rem:544s step 1649 (9%) loss:4.1433 lr:0.47 dt:36ms tok/s:1822459 rem:543s step 1650 (9%) loss:4.1489 lr:0.47 dt:36ms tok/s:1821988 rem:543s step 1651 (9%) loss:4.1556 lr:0.47 dt:36ms tok/s:1830420 rem:543s step 1652 (9%) loss:4.1483 lr:0.47 dt:36ms tok/s:1825558 rem:543s step 1653 (9%) loss:4.1402 lr:0.47 dt:36ms tok/s:1824649 rem:543s step 1654 (9%) loss:4.1299 lr:0.47 dt:36ms tok/s:1817579 rem:543s step 1655 (9%) loss:4.1127 lr:0.47 dt:36ms tok/s:1819962 rem:543s step 1656 (9%) loss:4.1126 lr:0.47 dt:36ms tok/s:1813897 rem:543s step 1657 (9%) loss:4.1086 lr:0.47 dt:36ms tok/s:1823197 rem:543s step 1658 (9%) loss:4.1070 lr:0.47 dt:40ms tok/s:1639819 rem:543s step 1659 (9%) loss:4.0868 lr:0.47 dt:36ms tok/s:1843358 rem:543s step 1660 (9%) loss:4.0952 lr:0.47 dt:35ms tok/s:1862556 rem:543s step 1661 (9%) loss:4.1009 lr:0.47 dt:36ms tok/s:1807040 rem:543s step 1662 (9%) loss:4.1008 lr:0.47 dt:37ms tok/s:1795344 rem:543s step 1663 (9%) loss:4.0992 lr:0.47 dt:36ms tok/s:1817014 rem:543s step 1664 (10%) loss:4.1117 lr:0.48 dt:36ms tok/s:1826128 rem:543s step 1665 (10%) loss:4.1042 lr:0.48 dt:36ms tok/s:1823681 rem:543s step 1666 (10%) loss:4.1056 lr:0.48 dt:36ms tok/s:1823173 rem:543s step 1667 (10%) loss:4.1108 lr:0.48 dt:37ms tok/s:1787754 rem:543s step 1668 (10%) loss:4.0971 lr:0.48 dt:36ms tok/s:1823947 rem:543s step 1669 (10%) loss:4.1037 lr:0.48 dt:36ms tok/s:1818565 rem:543s step 1670 (10%) loss:4.1082 lr:0.48 dt:41ms tok/s:1596494 rem:543s step 1671 (10%) loss:4.1018 lr:0.48 dt:37ms tok/s:1751205 rem:543s step 1672 (10%) loss:4.1140 lr:0.48 dt:36ms tok/s:1826261 rem:543s step 1673 (10%) loss:4.1228 lr:0.48 dt:36ms tok/s:1809586 rem:543s step 1674 (10%) loss:4.1292 lr:0.48 dt:36ms tok/s:1815442 rem:543s step 1675 (10%) loss:4.1370 lr:0.48 dt:36ms tok/s:1817266 rem:543s step 1676 (10%) loss:4.1318 lr:0.48 dt:36ms tok/s:1821952 rem:543s step 1677 (10%) loss:4.1303 lr:0.48 dt:36ms tok/s:1822169 rem:542s step 1678 (10%) loss:4.1170 lr:0.48 dt:36ms tok/s:1826589 rem:542s step 1679 (10%) loss:4.1060 lr:0.48 dt:36ms tok/s:1820359 rem:542s step 1680 (10%) loss:4.1155 lr:0.48 dt:37ms tok/s:1788149 rem:542s step 1681 (10%) loss:4.0942 lr:0.48 dt:36ms tok/s:1820046 rem:542s step 1682 (10%) loss:4.0848 lr:0.48 dt:36ms tok/s:1815670 rem:542s step 1683 (10%) loss:4.0684 lr:0.48 dt:36ms tok/s:1810730 rem:542s step 1684 (10%) loss:4.0879 lr:0.48 dt:36ms tok/s:1813705 rem:542s step 1685 (10%) loss:4.0952 lr:0.48 dt:36ms tok/s:1814819 rem:542s step 1686 (10%) loss:4.0910 lr:0.48 dt:36ms tok/s:1817855 rem:542s step 1687 (10%) loss:4.0810 lr:0.48 dt:36ms tok/s:1809300 rem:542s step 1688 (10%) loss:4.0948 lr:0.48 dt:36ms tok/s:1811673 rem:542s step 1689 (10%) loss:4.0875 lr:0.48 dt:36ms tok/s:1809538 rem:542s step 1690 (10%) loss:4.0818 lr:0.48 dt:36ms tok/s:1818312 rem:542s step 1691 (10%) loss:4.0830 lr:0.48 dt:36ms tok/s:1808740 rem:542s step 1692 (10%) loss:4.0934 lr:0.48 dt:36ms tok/s:1815706 rem:542s step 1693 (10%) loss:4.0924 lr:0.48 dt:36ms tok/s:1812832 rem:542s step 1694 (10%) loss:4.1058 lr:0.48 dt:36ms tok/s:1806221 rem:542s step 1695 (10%) loss:4.1084 lr:0.48 dt:36ms tok/s:1817387 rem:542s step 1696 (10%) loss:4.1044 lr:0.48 dt:36ms tok/s:1815874 rem:542s step 1697 (10%) loss:4.1006 lr:0.49 dt:36ms tok/s:1814268 rem:542s step 1698 (10%) loss:4.0972 lr:0.49 dt:36ms tok/s:1813274 rem:542s step 1699 (10%) loss:4.0886 lr:0.49 dt:36ms tok/s:1818469 rem:542s step 1700 (10%) loss:4.0847 lr:0.49 dt:36ms tok/s:1812270 rem:542s + local: attn=[0.111, 0.484, 0.637] mlp=[0.262, 0.160, -0.167] + + transition: attn=[2.504, 1.296] mlp=[0.043, 0.243] + + hierarchy: attn=[1.594, 2.378, 3.281] mlp=[1.098, -0.350, 0.112] + step 1701 (10%) loss:4.0840 lr:0.49 dt:36ms tok/s:1814124 rem:542s step 1702 (10%) loss:4.0681 lr:0.49 dt:36ms tok/s:1812880 rem:542s step 1703 (10%) loss:4.0667 lr:0.49 dt:36ms tok/s:1803329 rem:542s step 1704 (10%) loss:4.0622 lr:0.49 dt:36ms tok/s:1816054 rem:542s step 1705 (10%) loss:4.0605 lr:0.49 dt:36ms tok/s:1811279 rem:541s step 1706 (10%) loss:4.0614 lr:0.49 dt:36ms tok/s:1810218 rem:541s step 1707 (10%) loss:4.0615 lr:0.49 dt:36ms tok/s:1815874 rem:541s step 1708 (10%) loss:4.0670 lr:0.49 dt:36ms tok/s:1808217 rem:541s step 1709 (10%) loss:4.0585 lr:0.49 dt:36ms tok/s:1812067 rem:541s step 1710 (10%) loss:4.0651 lr:0.49 dt:36ms tok/s:1801285 rem:541s step 1711 (10%) loss:4.0663 lr:0.49 dt:39ms tok/s:1667948 rem:541s step 1712 (10%) loss:4.0689 lr:0.49 dt:37ms tok/s:1786627 rem:541s step 1713 (10%) loss:4.0589 lr:0.49 dt:36ms tok/s:1836094 rem:541s step 1714 (10%) loss:4.0691 lr:0.49 dt:36ms tok/s:1834244 rem:541s step 1715 (10%) loss:4.0633 lr:0.49 dt:35ms tok/s:1859193 rem:541s step 1716 (10%) loss:4.0543 lr:0.49 dt:35ms tok/s:1860967 rem:541s step 1717 (10%) loss:4.0664 lr:0.49 dt:35ms tok/s:1860199 rem:541s step 1718 (10%) loss:4.0734 lr:0.49 dt:35ms tok/s:1867047 rem:541s step 1719 (10%) loss:4.0625 lr:0.49 dt:35ms tok/s:1862089 rem:541s step 1720 (10%) loss:4.0631 lr:0.49 dt:35ms tok/s:1866591 rem:541s step 1721 (10%) loss:4.0509 lr:0.49 dt:36ms tok/s:1840064 rem:541s step 1722 (10%) loss:4.0540 lr:0.49 dt:35ms tok/s:1861976 rem:541s step 1723 (10%) loss:4.0457 lr:0.49 dt:36ms tok/s:1844657 rem:541s step 1724 (10%) loss:4.0356 lr:0.49 dt:36ms tok/s:1826213 rem:541s step 1725 (10%) loss:4.0272 lr:0.49 dt:36ms tok/s:1824443 rem:541s step 1726 (10%) loss:4.0284 lr:0.49 dt:36ms tok/s:1822520 rem:541s step 1727 (10%) loss:4.0317 lr:0.49 dt:37ms tok/s:1774986 rem:541s step 1728 (10%) loss:4.0266 lr:0.49 dt:36ms tok/s:1815071 rem:541s step 1729 (10%) loss:4.0236 lr:0.49 dt:36ms tok/s:1826310 rem:541s step 1730 (10%) loss:4.0119 lr:0.49 dt:36ms tok/s:1823475 rem:541s step 1731 (10%) loss:4.0079 lr:0.50 dt:36ms tok/s:1819805 rem:541s step 1732 (10%) loss:4.0208 lr:0.50 dt:36ms tok/s:1819444 rem:540s step 1733 (10%) loss:4.0326 lr:0.50 dt:36ms tok/s:1797187 rem:540s step 1734 (10%) loss:4.0246 lr:0.50 dt:36ms tok/s:1832519 rem:540s step 1735 (10%) loss:4.0245 lr:0.50 dt:36ms tok/s:1817988 rem:540s step 1736 (10%) loss:4.0259 lr:0.50 dt:36ms tok/s:1807694 rem:540s step 1737 (10%) loss:4.0326 lr:0.50 dt:36ms tok/s:1811936 rem:540s step 1738 (10%) loss:4.0423 lr:0.50 dt:36ms tok/s:1815142 rem:540s step 1739 (10%) loss:4.0507 lr:0.50 dt:36ms tok/s:1821529 rem:540s step 1740 (10%) loss:4.0628 lr:0.50 dt:36ms tok/s:1813454 rem:540s step 1741 (10%) loss:4.0777 lr:0.50 dt:36ms tok/s:1814675 rem:540s step 1742 (10%) loss:4.0765 lr:0.50 dt:36ms tok/s:1816522 rem:540s step 1743 (10%) loss:4.0729 lr:0.50 dt:36ms tok/s:1808729 rem:540s step 1744 (10%) loss:4.0698 lr:0.50 dt:36ms tok/s:1817819 rem:540s step 1745 (10%) loss:4.0517 lr:0.50 dt:36ms tok/s:1816138 rem:540s step 1746 (10%) loss:4.0491 lr:0.50 dt:36ms tok/s:1815226 rem:540s step 1747 (10%) loss:4.0820 lr:0.50 dt:36ms tok/s:1810194 rem:540s step 1748 (10%) loss:4.0890 lr:0.50 dt:36ms tok/s:1811184 rem:540s step 1749 (10%) loss:4.0907 lr:0.50 dt:36ms tok/s:1816954 rem:540s step 1750 (10%) loss:4.1110 lr:0.50 dt:36ms tok/s:1808062 rem:540s step 1751 (10%) loss:4.0923 lr:0.50 dt:36ms tok/s:1811625 rem:540s step 1752 (10%) loss:4.0971 lr:0.50 dt:36ms tok/s:1814627 rem:540s step 1753 (10%) loss:4.0938 lr:0.50 dt:36ms tok/s:1815238 rem:540s step 1754 (10%) loss:4.0968 lr:0.50 dt:36ms tok/s:1802549 rem:540s step 1755 (10%) loss:4.0972 lr:0.50 dt:36ms tok/s:1796576 rem:540s step 1756 (10%) loss:4.0873 lr:0.50 dt:37ms tok/s:1771828 rem:540s step 1757 (10%) loss:4.0876 lr:0.50 dt:36ms tok/s:1820878 rem:540s step 1758 (10%) loss:4.0720 lr:0.50 dt:36ms tok/s:1820130 rem:540s step 1759 (10%) loss:4.0683 lr:0.50 dt:36ms tok/s:1816990 rem:540s step 1760 (10%) loss:4.0556 lr:0.50 dt:36ms tok/s:1818589 rem:539s step 1761 (10%) loss:4.0550 lr:0.50 dt:36ms tok/s:1822206 rem:539s step 1762 (10%) loss:4.0544 lr:0.50 dt:36ms tok/s:1820070 rem:539s step 1763 (10%) loss:4.0563 lr:0.50 dt:36ms tok/s:1823088 rem:539s step 1764 (10%) loss:4.0449 lr:0.51 dt:36ms tok/s:1814232 rem:539s step 1765 (10%) loss:4.0415 lr:0.51 dt:36ms tok/s:1816474 rem:539s step 1766 (10%) loss:4.0370 lr:0.51 dt:36ms tok/s:1808729 rem:539s step 1767 (10%) loss:4.0376 lr:0.51 dt:43ms tok/s:1516593 rem:539s step 1768 (10%) loss:4.0431 lr:0.51 dt:36ms tok/s:1840939 rem:539s step 1769 (10%) loss:4.0451 lr:0.51 dt:35ms tok/s:1848528 rem:539s step 1770 (10%) loss:4.0337 lr:0.51 dt:36ms tok/s:1836805 rem:539s step 1771 (10%) loss:4.0332 lr:0.51 dt:35ms tok/s:1860338 rem:539s step 1772 (10%) loss:4.0408 lr:0.51 dt:36ms tok/s:1833987 rem:539s step 1773 (10%) loss:4.0316 lr:0.51 dt:36ms tok/s:1827974 rem:539s step 1774 (10%) loss:4.0112 lr:0.51 dt:36ms tok/s:1838292 rem:539s step 1775 (10%) loss:4.0129 lr:0.51 dt:35ms tok/s:1858665 rem:539s step 1776 (10%) loss:4.0195 lr:0.51 dt:35ms tok/s:1881836 rem:539s step 1777 (10%) loss:4.0325 lr:0.51 dt:35ms tok/s:1872874 rem:539s step 1778 (10%) loss:4.0267 lr:0.51 dt:35ms tok/s:1863402 rem:539s step 1779 (10%) loss:4.0200 lr:0.51 dt:36ms tok/s:1834856 rem:539s step 1780 (10%) loss:4.0084 lr:0.51 dt:35ms tok/s:1858363 rem:539s step 1781 (10%) loss:4.0074 lr:0.51 dt:36ms tok/s:1815334 rem:539s step 1782 (10%) loss:4.0134 lr:0.51 dt:35ms tok/s:1860363 rem:539s step 1783 (10%) loss:4.0025 lr:0.51 dt:36ms tok/s:1842506 rem:539s step 1784 (10%) loss:4.0152 lr:0.51 dt:36ms tok/s:1802076 rem:539s step 1785 (10%) loss:4.0258 lr:0.51 dt:35ms tok/s:1860816 rem:539s step 1786 (10%) loss:4.0054 lr:0.51 dt:35ms tok/s:1850606 rem:539s step 1787 (10%) loss:3.9965 lr:0.51 dt:35ms tok/s:1853701 rem:539s step 1788 (10%) loss:3.9967 lr:0.51 dt:35ms tok/s:1859947 rem:538s step 1789 (10%) loss:4.0034 lr:0.51 dt:35ms tok/s:1847583 rem:538s step 1790 (10%) loss:4.0007 lr:0.51 dt:36ms tok/s:1813047 rem:538s step 1791 (10%) loss:4.0106 lr:0.51 dt:35ms tok/s:1858551 rem:538s step 1792 (10%) loss:4.0091 lr:0.51 dt:36ms tok/s:1832361 rem:538s step 1793 (10%) loss:3.9970 lr:0.51 dt:36ms tok/s:1815035 rem:538s step 1794 (10%) loss:4.0012 lr:0.51 dt:37ms tok/s:1755724 rem:538s step 1795 (10%) loss:3.9854 lr:0.51 dt:36ms tok/s:1824734 rem:538s step 1796 (10%) loss:3.9847 lr:0.51 dt:36ms tok/s:1833766 rem:538s step 1797 (10%) loss:3.9823 lr:0.52 dt:36ms tok/s:1838919 rem:538s step 1798 (10%) loss:3.9767 lr:0.52 dt:36ms tok/s:1840902 rem:538s step 1799 (10%) loss:3.9940 lr:0.52 dt:36ms tok/s:1831945 rem:538s step 1800 (10%) loss:3.9984 lr:0.52 dt:35ms tok/s:1861056 rem:538s + local: attn=[0.111, 0.471, 0.639] mlp=[0.263, 0.167, -0.172] + + transition: attn=[2.989, 1.214] mlp=[0.036, 0.278] + + hierarchy: attn=[2.004, 2.658, 3.759] mlp=[1.300, -0.259, 0.131] + step 1801 (10%) loss:4.0015 lr:0.52 dt:35ms tok/s:1864350 rem:538s step 1802 (10%) loss:3.9946 lr:0.52 dt:35ms tok/s:1859142 rem:538s step 1803 (10%) loss:3.9954 lr:0.52 dt:35ms tok/s:1874969 rem:538s step 1804 (10%) loss:4.0021 lr:0.52 dt:35ms tok/s:1874905 rem:538s step 1805 (10%) loss:3.9950 lr:0.52 dt:35ms tok/s:1862556 rem:538s step 1806 (10%) loss:3.9861 lr:0.52 dt:35ms tok/s:1862821 rem:538s step 1807 (10%) loss:3.9894 lr:0.52 dt:35ms tok/s:1869485 rem:538s step 1808 (10%) loss:3.9826 lr:0.52 dt:35ms tok/s:1875327 rem:538s step 1809 (10%) loss:3.9827 lr:0.52 dt:35ms tok/s:1873269 rem:538s step 1810 (10%) loss:3.9687 lr:0.52 dt:35ms tok/s:1870058 rem:538s step 1811 (10%) loss:3.9898 lr:0.52 dt:46ms tok/s:1418710 rem:538s step 1812 (10%) loss:3.9949 lr:0.52 dt:34ms tok/s:1953271 rem:538s step 1813 (10%) loss:4.0000 lr:0.52 dt:34ms tok/s:1934384 rem:538s step 1814 (10%) loss:3.9975 lr:0.52 dt:35ms tok/s:1887988 rem:538s step 1815 (10%) loss:3.9895 lr:0.52 dt:34ms tok/s:1933758 rem:538s step 1816 (10%) loss:3.9950 lr:0.52 dt:34ms tok/s:1904945 rem:537s step 1817 (10%) loss:3.9974 lr:0.52 dt:35ms tok/s:1893464 rem:537s step 1818 (10%) loss:3.9844 lr:0.52 dt:35ms tok/s:1891392 rem:537s step 1819 (10%) loss:3.9567 lr:0.52 dt:34ms tok/s:1916073 rem:537s step 1820 (10%) loss:3.9248 lr:0.52 dt:34ms tok/s:1900559 rem:537s step 1821 (10%) loss:3.9556 lr:0.52 dt:34ms tok/s:1900743 rem:537s step 1822 (10%) loss:3.9775 lr:0.52 dt:35ms tok/s:1893738 rem:537s step 1823 (10%) loss:3.9897 lr:0.52 dt:35ms tok/s:1862165 rem:537s step 1824 (10%) loss:4.0358 lr:0.52 dt:35ms tok/s:1863288 rem:537s step 1825 (10%) loss:4.0353 lr:0.52 dt:41ms tok/s:1599066 rem:537s step 1826 (10%) loss:4.0474 lr:0.52 dt:35ms tok/s:1878929 rem:537s step 1827 (10%) loss:4.0599 lr:0.52 dt:35ms tok/s:1858464 rem:537s step 1828 (10%) loss:4.0479 lr:0.52 dt:35ms tok/s:1877992 rem:537s step 1829 (10%) loss:4.0470 lr:0.52 dt:35ms tok/s:1883719 rem:537s step 1830 (10%) loss:4.0367 lr:0.52 dt:35ms tok/s:1879648 rem:537s step 1831 (11%) loss:4.0266 lr:0.53 dt:35ms tok/s:1880278 rem:537s step 1832 (11%) loss:4.0228 lr:0.53 dt:35ms tok/s:1877351 rem:537s step 1833 (11%) loss:4.0216 lr:0.53 dt:35ms tok/s:1880561 rem:537s step 1834 (11%) loss:4.0144 lr:0.53 dt:35ms tok/s:1866173 rem:537s step 1835 (11%) loss:4.0065 lr:0.53 dt:35ms tok/s:1870401 rem:537s step 1836 (11%) loss:4.0026 lr:0.53 dt:35ms tok/s:1879931 rem:537s step 1837 (11%) loss:3.9912 lr:0.53 dt:35ms tok/s:1865881 rem:537s step 1838 (11%) loss:3.9763 lr:0.53 dt:35ms tok/s:1859167 rem:537s step 1839 (11%) loss:3.9784 lr:0.53 dt:35ms tok/s:1874278 rem:537s step 1840 (11%) loss:4.0102 lr:0.53 dt:35ms tok/s:1861371 rem:537s step 1841 (11%) loss:4.0393 lr:0.53 dt:35ms tok/s:1867720 rem:537s step 1842 (11%) loss:4.0145 lr:0.53 dt:35ms tok/s:1865489 rem:537s step 1843 (11%) loss:3.9787 lr:0.53 dt:35ms tok/s:1859419 rem:537s step 1844 (11%) loss:3.9473 lr:0.53 dt:35ms tok/s:1866198 rem:536s step 1845 (11%) loss:3.9068 lr:0.53 dt:36ms tok/s:1825946 rem:536s step 1846 (11%) loss:3.8714 lr:0.53 dt:37ms tok/s:1765637 rem:536s step 1847 (11%) loss:3.8526 lr:0.53 dt:43ms tok/s:1527346 rem:536s step 1848 (11%) loss:3.8275 lr:0.53 dt:35ms tok/s:1873882 rem:536s step 1849 (11%) loss:3.8603 lr:0.53 dt:35ms tok/s:1876697 rem:536s step 1850 (11%) loss:3.8834 lr:0.53 dt:35ms tok/s:1882945 rem:536s step 1851 (11%) loss:3.8981 lr:0.53 dt:35ms tok/s:1882468 rem:536s step 1852 (11%) loss:3.9195 lr:0.53 dt:35ms tok/s:1883590 rem:536s step 1853 (11%) loss:3.9249 lr:0.53 dt:35ms tok/s:1880999 rem:536s step 1854 (11%) loss:3.9474 lr:0.53 dt:35ms tok/s:1885075 rem:536s step 1855 (11%) loss:3.9605 lr:0.53 dt:35ms tok/s:1876851 rem:536s step 1856 (11%) loss:3.9553 lr:0.53 dt:35ms tok/s:1870554 rem:536s step 1857 (11%) loss:3.9599 lr:0.53 dt:35ms tok/s:1862733 rem:536s step 1858 (11%) loss:3.9735 lr:0.53 dt:35ms tok/s:1869969 rem:536s step 1859 (11%) loss:3.9795 lr:0.53 dt:35ms tok/s:1869523 rem:536s step 1860 (11%) loss:3.9864 lr:0.53 dt:35ms tok/s:1877838 rem:536s step 1861 (11%) loss:3.9909 lr:0.53 dt:40ms tok/s:1649403 rem:536s step 1862 (11%) loss:3.9899 lr:0.53 dt:35ms tok/s:1875109 rem:536s step 1863 (11%) loss:3.9825 lr:0.53 dt:35ms tok/s:1899181 rem:536s step 1864 (11%) loss:3.9813 lr:0.53 dt:35ms tok/s:1897581 rem:536s step 1865 (11%) loss:3.9811 lr:0.54 dt:35ms tok/s:1859645 rem:536s step 1866 (11%) loss:4.0024 lr:0.54 dt:35ms tok/s:1874815 rem:536s step 1867 (11%) loss:4.0064 lr:0.54 dt:35ms tok/s:1876210 rem:536s step 1868 (11%) loss:4.0052 lr:0.54 dt:35ms tok/s:1864843 rem:536s step 1869 (11%) loss:4.0025 lr:0.54 dt:35ms tok/s:1867618 rem:536s step 1870 (11%) loss:3.9989 lr:0.54 dt:35ms tok/s:1874163 rem:536s step 1871 (11%) loss:4.0124 lr:0.54 dt:35ms tok/s:1871127 rem:536s step 1872 (11%) loss:4.0161 lr:0.54 dt:35ms tok/s:1852215 rem:536s step 1873 (11%) loss:4.0242 lr:0.54 dt:35ms tok/s:1874764 rem:535s step 1874 (11%) loss:4.0126 lr:0.54 dt:35ms tok/s:1867441 rem:535s step 1875 (11%) loss:4.0034 lr:0.54 dt:35ms tok/s:1871293 rem:535s step 1876 (11%) loss:4.0097 lr:0.54 dt:35ms tok/s:1872669 rem:535s step 1877 (11%) loss:4.0167 lr:0.54 dt:35ms tok/s:1860476 rem:535s step 1878 (11%) loss:4.0156 lr:0.54 dt:35ms tok/s:1869002 rem:535s step 1879 (11%) loss:4.0062 lr:0.54 dt:35ms tok/s:1865552 rem:535s step 1880 (11%) loss:4.0007 lr:0.54 dt:35ms tok/s:1860048 rem:535s step 1881 (11%) loss:4.0117 lr:0.54 dt:35ms tok/s:1862266 rem:535s step 1882 (11%) loss:4.0061 lr:0.54 dt:35ms tok/s:1867517 rem:535s step 1883 (11%) loss:3.9922 lr:0.54 dt:35ms tok/s:1864401 rem:535s step 1884 (11%) loss:3.9907 lr:0.54 dt:35ms tok/s:1874393 rem:535s step 1885 (11%) loss:3.9805 lr:0.54 dt:35ms tok/s:1864502 rem:535s step 1886 (11%) loss:3.9625 lr:0.54 dt:35ms tok/s:1870058 rem:535s step 1887 (11%) loss:3.9600 lr:0.54 dt:35ms tok/s:1871191 rem:535s step 1888 (11%) loss:3.9667 lr:0.54 dt:35ms tok/s:1864704 rem:535s step 1889 (11%) loss:3.9574 lr:0.54 dt:35ms tok/s:1857509 rem:535s step 1890 (11%) loss:3.9488 lr:0.54 dt:36ms tok/s:1835383 rem:535s step 1891 (11%) loss:3.9598 lr:0.54 dt:35ms tok/s:1846665 rem:535s step 1892 (11%) loss:3.9654 lr:0.54 dt:36ms tok/s:1843235 rem:535s step 1893 (11%) loss:3.9627 lr:0.54 dt:35ms tok/s:1847732 rem:535s step 1894 (11%) loss:3.9610 lr:0.54 dt:36ms tok/s:1837210 rem:535s step 1895 (11%) loss:3.9476 lr:0.54 dt:35ms tok/s:1849772 rem:535s step 1896 (11%) loss:3.9440 lr:0.54 dt:36ms tok/s:1811745 rem:535s step 1897 (11%) loss:3.9481 lr:0.54 dt:36ms tok/s:1843556 rem:535s step 1898 (11%) loss:3.9390 lr:0.54 dt:36ms tok/s:1833717 rem:535s step 1899 (11%) loss:3.9283 lr:0.55 dt:37ms tok/s:1790316 rem:535s step 1900 (11%) loss:3.9329 lr:0.55 dt:36ms tok/s:1799940 rem:535s + local: attn=[0.098, 0.481, 0.660] mlp=[0.251, 0.172, -0.186] + + transition: attn=[2.984, 1.136] mlp=[0.039, 0.232] + + hierarchy: attn=[2.263, 3.146, 4.297] mlp=[1.463, -0.284, 0.107] + step 1901 (11%) loss:3.9277 lr:0.55 dt:36ms tok/s:1809050 rem:534s step 1902 (11%) loss:3.9239 lr:0.55 dt:36ms tok/s:1814100 rem:534s step 1903 (11%) loss:3.9166 lr:0.55 dt:36ms tok/s:1806423 rem:534s step 1904 (11%) loss:3.9394 lr:0.55 dt:36ms tok/s:1809598 rem:534s step 1905 (11%) loss:3.9514 lr:0.55 dt:36ms tok/s:1804383 rem:534s step 1906 (11%) loss:3.9611 lr:0.55 dt:36ms tok/s:1810039 rem:534s step 1907 (11%) loss:3.9472 lr:0.55 dt:36ms tok/s:1825182 rem:534s step 1908 (11%) loss:3.9329 lr:0.55 dt:36ms tok/s:1805201 rem:534s step 1909 (11%) loss:3.9257 lr:0.55 dt:36ms tok/s:1818324 rem:534s step 1910 (11%) loss:3.9312 lr:0.55 dt:36ms tok/s:1812605 rem:534s step 1911 (11%) loss:3.9249 lr:0.55 dt:36ms tok/s:1820576 rem:534s step 1912 (11%) loss:3.9321 lr:0.55 dt:36ms tok/s:1816066 rem:534s step 1913 (11%) loss:3.9162 lr:0.55 dt:36ms tok/s:1820130 rem:534s step 1914 (11%) loss:3.9155 lr:0.55 dt:36ms tok/s:1805165 rem:534s step 1915 (11%) loss:3.9027 lr:0.55 dt:36ms tok/s:1813681 rem:534s step 1916 (11%) loss:3.9018 lr:0.55 dt:36ms tok/s:1808181 rem:534s step 1917 (11%) loss:3.8887 lr:0.55 dt:36ms tok/s:1806660 rem:534s step 1918 (11%) loss:3.9788 lr:0.55 dt:36ms tok/s:1810027 rem:534s step 1919 (11%) loss:3.9754 lr:0.55 dt:36ms tok/s:1806268 rem:534s step 1920 (11%) loss:3.9552 lr:0.55 dt:36ms tok/s:1803010 rem:534s step 1921 (11%) loss:3.9487 lr:0.55 dt:36ms tok/s:1808812 rem:534s step 1922 (11%) loss:3.9418 lr:0.55 dt:36ms tok/s:1811984 rem:534s step 1923 (11%) loss:3.9290 lr:0.55 dt:36ms tok/s:1802561 rem:534s step 1924 (11%) loss:3.9254 lr:0.55 dt:36ms tok/s:1804715 rem:534s step 1925 (11%) loss:3.9308 lr:0.55 dt:36ms tok/s:1826261 rem:534s step 1926 (11%) loss:3.9414 lr:0.55 dt:36ms tok/s:1823100 rem:534s step 1927 (11%) loss:3.9269 lr:0.55 dt:36ms tok/s:1813885 rem:534s step 1928 (11%) loss:3.9222 lr:0.55 dt:36ms tok/s:1810885 rem:533s step 1929 (11%) loss:3.8984 lr:0.55 dt:42ms tok/s:1562277 rem:533s step 1930 (11%) loss:3.8904 lr:0.55 dt:38ms tok/s:1729782 rem:533s step 1931 (11%) loss:3.8909 lr:0.55 dt:36ms tok/s:1824746 rem:533s step 1932 (11%) loss:3.8808 lr:0.56 dt:36ms tok/s:1827755 rem:533s step 1933 (11%) loss:3.8811 lr:0.56 dt:36ms tok/s:1826067 rem:533s step 1934 (11%) loss:3.8629 lr:0.56 dt:36ms tok/s:1818866 rem:533s step 1935 (11%) loss:3.8669 lr:0.56 dt:36ms tok/s:1819697 rem:533s step 1936 (11%) loss:3.8840 lr:0.56 dt:37ms tok/s:1768113 rem:533s step 1937 (11%) loss:3.8810 lr:0.56 dt:36ms tok/s:1818914 rem:533s step 1938 (11%) loss:3.8719 lr:0.56 dt:36ms tok/s:1814447 rem:533s step 1939 (11%) loss:3.8662 lr:0.56 dt:36ms tok/s:1818962 rem:533s step 1940 (11%) loss:3.8657 lr:0.56 dt:36ms tok/s:1829689 rem:533s step 1941 (11%) loss:3.8819 lr:0.56 dt:36ms tok/s:1814939 rem:533s step 1942 (11%) loss:3.9004 lr:0.56 dt:36ms tok/s:1821457 rem:533s step 1943 (11%) loss:3.9030 lr:0.56 dt:36ms tok/s:1820721 rem:533s step 1944 (11%) loss:3.8980 lr:0.56 dt:36ms tok/s:1831506 rem:533s step 1945 (11%) loss:3.8916 lr:0.56 dt:36ms tok/s:1826249 rem:533s step 1946 (11%) loss:3.8698 lr:0.56 dt:36ms tok/s:1809824 rem:533s step 1947 (11%) loss:3.8454 lr:0.56 dt:37ms tok/s:1749077 rem:533s step 1948 (11%) loss:3.8401 lr:0.56 dt:37ms tok/s:1768158 rem:533s step 1949 (11%) loss:3.8734 lr:0.56 dt:36ms tok/s:1799328 rem:533s step 1950 (11%) loss:3.8965 lr:0.56 dt:36ms tok/s:1820914 rem:533s step 1951 (11%) loss:3.9174 lr:0.56 dt:36ms tok/s:1800919 rem:533s step 1952 (11%) loss:3.9123 lr:0.56 dt:36ms tok/s:1807920 rem:533s step 1953 (11%) loss:3.9198 lr:0.56 dt:36ms tok/s:1816594 rem:533s step 1954 (11%) loss:3.9210 lr:0.56 dt:37ms tok/s:1759230 rem:533s step 1955 (11%) loss:3.9217 lr:0.56 dt:36ms tok/s:1826140 rem:533s step 1956 (11%) loss:3.9165 lr:0.56 dt:36ms tok/s:1803081 rem:532s step 1957 (11%) loss:3.9108 lr:0.56 dt:36ms tok/s:1815274 rem:532s step 1958 (11%) loss:3.9223 lr:0.56 dt:36ms tok/s:1814519 rem:532s step 1959 (11%) loss:3.9520 lr:0.56 dt:36ms tok/s:1834048 rem:532s step 1960 (11%) loss:4.0058 lr:0.56 dt:36ms tok/s:1840655 rem:532s step 1961 (11%) loss:4.0040 lr:0.56 dt:36ms tok/s:1825788 rem:532s step 1962 (11%) loss:4.0372 lr:0.56 dt:35ms tok/s:1860199 rem:532s step 1963 (11%) loss:4.0757 lr:0.56 dt:35ms tok/s:1848876 rem:532s step 1964 (11%) loss:4.0527 lr:0.56 dt:36ms tok/s:1821940 rem:532s step 1965 (11%) loss:4.0385 lr:0.57 dt:35ms tok/s:1877056 rem:532s step 1966 (11%) loss:4.0297 lr:0.57 dt:35ms tok/s:1883887 rem:532s step 1967 (11%) loss:4.0165 lr:0.57 dt:35ms tok/s:1882493 rem:532s step 1968 (11%) loss:3.9874 lr:0.57 dt:35ms tok/s:1851815 rem:532s step 1969 (11%) loss:3.9807 lr:0.57 dt:35ms tok/s:1860048 rem:532s step 1970 (11%) loss:3.9492 lr:0.57 dt:36ms tok/s:1821445 rem:532s step 1971 (11%) loss:3.9054 lr:0.57 dt:35ms tok/s:1868786 rem:532s step 1972 (11%) loss:3.8993 lr:0.57 dt:35ms tok/s:1880999 rem:532s step 1973 (11%) loss:3.8907 lr:0.57 dt:35ms tok/s:1847248 rem:532s step 1974 (11%) loss:3.9988 lr:0.57 dt:35ms tok/s:1875352 rem:532s step 1975 (11%) loss:3.9848 lr:0.57 dt:35ms tok/s:1876274 rem:532s step 1976 (11%) loss:3.9697 lr:0.57 dt:34ms tok/s:1900533 rem:532s step 1977 (11%) loss:3.9613 lr:0.57 dt:35ms tok/s:1890404 rem:532s step 1978 (11%) loss:3.9484 lr:0.57 dt:35ms tok/s:1873921 rem:532s step 1979 (11%) loss:3.9330 lr:0.57 dt:35ms tok/s:1871242 rem:532s step 1980 (11%) loss:3.9221 lr:0.57 dt:35ms tok/s:1869104 rem:532s step 1981 (11%) loss:3.8989 lr:0.57 dt:35ms tok/s:1876248 rem:532s step 1982 (11%) loss:3.8840 lr:0.57 dt:35ms tok/s:1875506 rem:532s step 1983 (11%) loss:3.8804 lr:0.57 dt:35ms tok/s:1882596 rem:532s step 1984 (11%) loss:3.8983 lr:0.57 dt:35ms tok/s:1875711 rem:531s step 1985 (11%) loss:3.8869 lr:0.57 dt:35ms tok/s:1877607 rem:531s step 1986 (11%) loss:3.8835 lr:0.57 dt:35ms tok/s:1863705 rem:531s step 1987 (11%) loss:3.9026 lr:0.57 dt:35ms tok/s:1867415 rem:531s step 1988 (11%) loss:3.9032 lr:0.57 dt:35ms tok/s:1864565 rem:531s step 1989 (11%) loss:3.9027 lr:0.57 dt:35ms tok/s:1868227 rem:531s step 1990 (11%) loss:3.9067 lr:0.57 dt:37ms tok/s:1788603 rem:531s step 1991 (11%) loss:3.9004 lr:0.57 dt:35ms tok/s:1853451 rem:531s step 1992 (11%) loss:3.9050 lr:0.57 dt:35ms tok/s:1861471 rem:531s step 1993 (11%) loss:3.9072 lr:0.57 dt:35ms tok/s:1860829 rem:531s step 1994 (11%) loss:3.9000 lr:0.57 dt:35ms tok/s:1864603 rem:531s step 1995 (11%) loss:3.8885 lr:0.57 dt:36ms tok/s:1844224 rem:531s step 1996 (11%) loss:3.9105 lr:0.57 dt:35ms tok/s:1847881 rem:531s step 1997 (11%) loss:3.9142 lr:0.57 dt:36ms tok/s:1840138 rem:531s step 1998 (11%) loss:3.9097 lr:0.57 dt:36ms tok/s:1844645 rem:531s step 1999 (12%) loss:3.8932 lr:0.58 dt:36ms tok/s:1838919 rem:531s step 2000 (12%) loss:3.9028 lr:0.58 dt:36ms tok/s:1835321 rem:531s + local: attn=[0.087, 0.488, 0.687] mlp=[0.264, 0.167, -0.194] + + transition: attn=[2.930, 1.027] mlp=[0.015, 0.244] + + hierarchy: attn=[2.480, 3.845, 4.861] mlp=[1.538, -0.377, 0.001] + step 2001 (12%) loss:3.9147 lr:0.58 dt:36ms tok/s:1837653 rem:531s step 2002 (12%) loss:3.9120 lr:0.58 dt:35ms tok/s:1846504 rem:531s step 2003 (12%) loss:3.9060 lr:0.58 dt:41ms tok/s:1603825 rem:531s step 2004 (12%) loss:3.9066 lr:0.58 dt:36ms tok/s:1810814 rem:531s step 2005 (12%) loss:3.8900 lr:0.58 dt:35ms tok/s:1852414 rem:531s step 2006 (12%) loss:3.8795 lr:0.58 dt:35ms tok/s:1849100 rem:531s step 2007 (12%) loss:3.8601 lr:0.58 dt:35ms tok/s:1859897 rem:531s step 2008 (12%) loss:3.8668 lr:0.58 dt:36ms tok/s:1836204 rem:531s step 2009 (12%) loss:3.8598 lr:0.58 dt:36ms tok/s:1833265 rem:531s step 2010 (12%) loss:3.8548 lr:0.58 dt:35ms tok/s:1855302 rem:531s step 2011 (12%) loss:3.8482 lr:0.58 dt:35ms tok/s:1855152 rem:531s step 2012 (12%) loss:3.8352 lr:0.58 dt:35ms tok/s:1854251 rem:530s step 2013 (12%) loss:3.8331 lr:0.58 dt:35ms tok/s:1853551 rem:530s step 2014 (12%) loss:3.8232 lr:0.58 dt:35ms tok/s:1852689 rem:530s step 2015 (12%) loss:3.8335 lr:0.58 dt:42ms tok/s:1547527 rem:530s step 2016 (12%) loss:3.8406 lr:0.58 dt:39ms tok/s:1671133 rem:530s step 2017 (12%) loss:3.8380 lr:0.58 dt:35ms tok/s:1899377 rem:530s step 2018 (12%) loss:3.8295 lr:0.58 dt:34ms tok/s:1908358 rem:530s step 2019 (12%) loss:3.8435 lr:0.58 dt:34ms tok/s:1903534 rem:530s step 2020 (12%) loss:3.8400 lr:0.58 dt:34ms tok/s:1908503 rem:530s step 2021 (12%) loss:3.8314 lr:0.58 dt:34ms tok/s:1901033 rem:530s step 2022 (12%) loss:3.8162 lr:0.58 dt:35ms tok/s:1872567 rem:530s step 2023 (12%) loss:3.8234 lr:0.58 dt:34ms tok/s:1909060 rem:530s step 2024 (12%) loss:3.8304 lr:0.58 dt:35ms tok/s:1895814 rem:530s step 2025 (12%) loss:3.8428 lr:0.58 dt:35ms tok/s:1879032 rem:530s step 2026 (12%) loss:3.8389 lr:0.58 dt:35ms tok/s:1888754 rem:530s step 2027 (12%) loss:3.8436 lr:0.58 dt:35ms tok/s:1874943 rem:530s step 2028 (12%) loss:3.8348 lr:0.58 dt:35ms tok/s:1883887 rem:530s step 2029 (12%) loss:3.8116 lr:0.58 dt:35ms tok/s:1884390 rem:530s step 2030 (12%) loss:3.8006 lr:0.58 dt:35ms tok/s:1877030 rem:530s step 2031 (12%) loss:3.7864 lr:0.58 dt:35ms tok/s:1889689 rem:530s step 2032 (12%) loss:3.7967 lr:0.58 dt:35ms tok/s:1887016 rem:530s step 2033 (12%) loss:3.7989 lr:0.59 dt:35ms tok/s:1871140 rem:530s step 2034 (12%) loss:3.7957 lr:0.59 dt:35ms tok/s:1880793 rem:530s step 2035 (12%) loss:3.8038 lr:0.59 dt:40ms tok/s:1620742 rem:530s step 2036 (12%) loss:3.8018 lr:0.59 dt:35ms tok/s:1857258 rem:530s step 2037 (12%) loss:3.8031 lr:0.59 dt:35ms tok/s:1868050 rem:530s step 2038 (12%) loss:3.8014 lr:0.59 dt:34ms tok/s:1909391 rem:530s step 2039 (12%) loss:3.8006 lr:0.59 dt:34ms tok/s:1911848 rem:530s step 2040 (12%) loss:3.8021 lr:0.59 dt:34ms tok/s:1909551 rem:530s step 2041 (12%) loss:3.7979 lr:0.59 dt:34ms tok/s:1909405 rem:529s step 2042 (12%) loss:3.7956 lr:0.59 dt:34ms tok/s:1917075 rem:529s step 2043 (12%) loss:3.7868 lr:0.59 dt:34ms tok/s:1913232 rem:529s step 2044 (12%) loss:3.7870 lr:0.59 dt:35ms tok/s:1895252 rem:529s step 2045 (12%) loss:3.7775 lr:0.59 dt:34ms tok/s:1899850 rem:529s step 2046 (12%) loss:3.7849 lr:0.59 dt:35ms tok/s:1894886 rem:529s step 2047 (12%) loss:3.7814 lr:0.59 dt:34ms tok/s:1904972 rem:529s step 2048 (12%) loss:3.7859 lr:0.59 dt:35ms tok/s:1898263 rem:529s step 2049 (12%) loss:3.8033 lr:0.59 dt:35ms tok/s:1890963 rem:529s step 2050 (12%) loss:3.8070 lr:0.59 dt:35ms tok/s:1890404 rem:529s step 2051 (12%) loss:3.7930 lr:0.59 dt:35ms tok/s:1885334 rem:529s step 2052 (12%) loss:3.9093 lr:0.59 dt:35ms tok/s:1899509 rem:529s step 2053 (12%) loss:3.9463 lr:0.59 dt:35ms tok/s:1872733 rem:529s step 2054 (12%) loss:3.9130 lr:0.59 dt:35ms tok/s:1871114 rem:529s step 2055 (12%) loss:3.8597 lr:0.59 dt:35ms tok/s:1879944 rem:529s step 2056 (12%) loss:3.8137 lr:0.59 dt:35ms tok/s:1879340 rem:529s step 2057 (12%) loss:3.8266 lr:0.59 dt:35ms tok/s:1872555 rem:529s step 2058 (12%) loss:3.8485 lr:0.59 dt:35ms tok/s:1869117 rem:529s step 2059 (12%) loss:3.8547 lr:0.59 dt:35ms tok/s:1870923 rem:529s step 2060 (12%) loss:3.8300 lr:0.59 dt:35ms tok/s:1865096 rem:529s step 2061 (12%) loss:3.8525 lr:0.59 dt:35ms tok/s:1864767 rem:529s step 2062 (12%) loss:3.8684 lr:0.59 dt:35ms tok/s:1879224 rem:529s step 2063 (12%) loss:3.8476 lr:0.59 dt:35ms tok/s:1873818 rem:529s step 2064 (12%) loss:3.8299 lr:0.59 dt:35ms tok/s:1874253 rem:529s step 2065 (12%) loss:3.8099 lr:0.59 dt:37ms tok/s:1749032 rem:529s step 2066 (12%) loss:3.8208 lr:0.59 dt:36ms tok/s:1816246 rem:529s step 2067 (12%) loss:3.8237 lr:0.60 dt:35ms tok/s:1880047 rem:529s step 2068 (12%) loss:3.8251 lr:0.60 dt:35ms tok/s:1879379 rem:529s step 2069 (12%) loss:3.8246 lr:0.60 dt:35ms tok/s:1881282 rem:528s step 2070 (12%) loss:3.8388 lr:0.60 dt:35ms tok/s:1889936 rem:528s step 2071 (12%) loss:3.8357 lr:0.60 dt:35ms tok/s:1884972 rem:528s step 2072 (12%) loss:3.8298 lr:0.60 dt:35ms tok/s:1882004 rem:528s step 2073 (12%) loss:3.8113 lr:0.60 dt:35ms tok/s:1883087 rem:528s step 2074 (12%) loss:3.8071 lr:0.60 dt:35ms tok/s:1883900 rem:528s step 2075 (12%) loss:3.8040 lr:0.60 dt:35ms tok/s:1865324 rem:528s step 2076 (12%) loss:3.8078 lr:0.60 dt:35ms tok/s:1870860 rem:528s step 2077 (12%) loss:3.7974 lr:0.60 dt:35ms tok/s:1866730 rem:528s step 2078 (12%) loss:3.8017 lr:0.60 dt:35ms tok/s:1871739 rem:528s step 2079 (12%) loss:3.8204 lr:0.60 dt:35ms tok/s:1861119 rem:528s step 2080 (12%) loss:3.8241 lr:0.60 dt:35ms tok/s:1852614 rem:528s step 2081 (12%) loss:3.8320 lr:0.60 dt:35ms tok/s:1846367 rem:528s step 2082 (12%) loss:3.8275 lr:0.60 dt:36ms tok/s:1842926 rem:528s step 2083 (12%) loss:3.8156 lr:0.60 dt:35ms tok/s:1849909 rem:528s step 2084 (12%) loss:3.8143 lr:0.60 dt:35ms tok/s:1850506 rem:528s step 2085 (12%) loss:3.8199 lr:0.60 dt:35ms tok/s:1849622 rem:528s step 2086 (12%) loss:3.8431 lr:0.60 dt:35ms tok/s:1851055 rem:528s step 2087 (12%) loss:3.8390 lr:0.60 dt:35ms tok/s:1852776 rem:528s step 2088 (12%) loss:3.8249 lr:0.60 dt:35ms tok/s:1856129 rem:528s step 2089 (12%) loss:3.8044 lr:0.60 dt:36ms tok/s:1826067 rem:528s step 2090 (12%) loss:3.7962 lr:0.60 dt:36ms tok/s:1813239 rem:528s step 2091 (12%) loss:3.8082 lr:0.60 dt:36ms tok/s:1819408 rem:528s step 2092 (12%) loss:3.7903 lr:0.60 dt:36ms tok/s:1824952 rem:528s step 2093 (12%) loss:3.7962 lr:0.60 dt:36ms tok/s:1820106 rem:528s step 2094 (12%) loss:3.8381 lr:0.60 dt:36ms tok/s:1816822 rem:528s step 2095 (12%) loss:3.8428 lr:0.60 dt:36ms tok/s:1816078 rem:528s step 2096 (12%) loss:3.8177 lr:0.60 dt:36ms tok/s:1828326 rem:528s step 2097 (12%) loss:3.8087 lr:0.60 dt:36ms tok/s:1818733 rem:527s step 2098 (12%) loss:3.7943 lr:0.60 dt:36ms tok/s:1822230 rem:527s step 2099 (12%) loss:3.8041 lr:0.60 dt:36ms tok/s:1812724 rem:527s step 2100 (12%) loss:3.7643 lr:0.60 dt:36ms tok/s:1814112 rem:527s + local: attn=[0.089, 0.489, 0.665] mlp=[0.249, 0.179, -0.171] + + transition: attn=[2.934, 0.950] mlp=[0.023, 0.221] + + hierarchy: attn=[2.703, 4.539, 5.425] mlp=[1.557, -0.274, 0.205] + step 2101 (12%) loss:3.7694 lr:0.61 dt:36ms tok/s:1814819 rem:527s step 2102 (12%) loss:3.7589 lr:0.61 dt:36ms tok/s:1813753 rem:527s step 2103 (12%) loss:3.7435 lr:0.61 dt:36ms tok/s:1816858 rem:527s step 2104 (12%) loss:3.7530 lr:0.61 dt:36ms tok/s:1818950 rem:527s step 2105 (12%) loss:3.7588 lr:0.61 dt:36ms tok/s:1812772 rem:527s step 2106 (12%) loss:3.7586 lr:0.61 dt:36ms tok/s:1816498 rem:527s step 2107 (12%) loss:3.7492 lr:0.61 dt:36ms tok/s:1821204 rem:527s step 2108 (12%) loss:3.7528 lr:0.61 dt:36ms tok/s:1818384 rem:527s step 2109 (12%) loss:3.7504 lr:0.61 dt:36ms tok/s:1820383 rem:527s step 2110 (12%) loss:3.7530 lr:0.61 dt:36ms tok/s:1816282 rem:527s step 2111 (12%) loss:3.7575 lr:0.61 dt:36ms tok/s:1812916 rem:527s step 2112 (12%) loss:3.7479 lr:0.61 dt:36ms tok/s:1795801 rem:527s step 2113 (12%) loss:3.7361 lr:0.61 dt:36ms tok/s:1815682 rem:527s step 2114 (12%) loss:3.7396 lr:0.61 dt:36ms tok/s:1816354 rem:527s step 2115 (12%) loss:3.7447 lr:0.61 dt:36ms tok/s:1821276 rem:527s step 2116 (12%) loss:3.7452 lr:0.61 dt:36ms tok/s:1821831 rem:527s step 2117 (12%) loss:3.7478 lr:0.61 dt:36ms tok/s:1821011 rem:527s step 2118 (12%) loss:3.7494 lr:0.61 dt:36ms tok/s:1816342 rem:527s step 2119 (12%) loss:3.7643 lr:0.61 dt:36ms tok/s:1819167 rem:527s step 2120 (12%) loss:3.7583 lr:0.61 dt:36ms tok/s:1820552 rem:527s step 2121 (12%) loss:3.7529 lr:0.61 dt:36ms tok/s:1819660 rem:527s step 2122 (12%) loss:3.7394 lr:0.61 dt:36ms tok/s:1818830 rem:527s step 2123 (12%) loss:3.7377 lr:0.61 dt:36ms tok/s:1819853 rem:527s step 2124 (12%) loss:3.7348 lr:0.61 dt:36ms tok/s:1820468 rem:527s step 2125 (12%) loss:3.7434 lr:0.61 dt:36ms tok/s:1814052 rem:526s step 2126 (12%) loss:3.7417 lr:0.61 dt:36ms tok/s:1818372 rem:526s step 2127 (12%) loss:3.7642 lr:0.61 dt:36ms tok/s:1811267 rem:526s step 2128 (12%) loss:3.7693 lr:0.61 dt:36ms tok/s:1814340 rem:526s step 2129 (12%) loss:3.7818 lr:0.61 dt:36ms tok/s:1816570 rem:526s step 2130 (12%) loss:3.7935 lr:0.61 dt:39ms tok/s:1694789 rem:526s step 2131 (12%) loss:3.8067 lr:0.61 dt:36ms tok/s:1829385 rem:526s step 2132 (12%) loss:3.8015 lr:0.61 dt:36ms tok/s:1826565 rem:526s step 2133 (12%) loss:3.8050 lr:0.61 dt:36ms tok/s:1829482 rem:526s step 2134 (12%) loss:3.8226 lr:0.62 dt:36ms tok/s:1826929 rem:526s step 2135 (12%) loss:3.8287 lr:0.62 dt:36ms tok/s:1827755 rem:526s step 2136 (12%) loss:3.8425 lr:0.62 dt:35ms tok/s:1854939 rem:526s step 2137 (12%) loss:3.8303 lr:0.62 dt:35ms tok/s:1858275 rem:526s step 2138 (12%) loss:3.8079 lr:0.62 dt:35ms tok/s:1858954 rem:526s step 2139 (12%) loss:3.8023 lr:0.62 dt:35ms tok/s:1855140 rem:526s step 2140 (12%) loss:3.8051 lr:0.62 dt:35ms tok/s:1851953 rem:526s step 2141 (12%) loss:3.8063 lr:0.62 dt:35ms tok/s:1848590 rem:526s step 2142 (12%) loss:3.7934 lr:0.62 dt:36ms tok/s:1829227 rem:526s step 2143 (12%) loss:3.7863 lr:0.62 dt:35ms tok/s:1874189 rem:526s step 2144 (12%) loss:3.7811 lr:0.62 dt:35ms tok/s:1874240 rem:526s step 2145 (12%) loss:3.7818 lr:0.62 dt:35ms tok/s:1874087 rem:526s step 2146 (12%) loss:3.7804 lr:0.62 dt:35ms tok/s:1865653 rem:526s step 2147 (12%) loss:3.7774 lr:0.62 dt:35ms tok/s:1874918 rem:526s step 2148 (12%) loss:3.7600 lr:0.62 dt:35ms tok/s:1874777 rem:526s step 2149 (12%) loss:3.7310 lr:0.62 dt:35ms tok/s:1863187 rem:526s step 2150 (12%) loss:3.7415 lr:0.62 dt:35ms tok/s:1876274 rem:526s step 2151 (12%) loss:3.7513 lr:0.62 dt:35ms tok/s:1873512 rem:526s step 2152 (12%) loss:3.7428 lr:0.62 dt:35ms tok/s:1873972 rem:526s step 2153 (12%) loss:3.7588 lr:0.62 dt:35ms tok/s:1874585 rem:525s step 2154 (12%) loss:3.7449 lr:0.62 dt:35ms tok/s:1875583 rem:525s step 2155 (12%) loss:3.7383 lr:0.62 dt:35ms tok/s:1865489 rem:525s step 2156 (12%) loss:3.7321 lr:0.62 dt:35ms tok/s:1871853 rem:525s step 2157 (12%) loss:3.7144 lr:0.62 dt:35ms tok/s:1873422 rem:525s step 2158 (12%) loss:3.7048 lr:0.62 dt:35ms tok/s:1876108 rem:525s step 2159 (12%) loss:3.7040 lr:0.62 dt:35ms tok/s:1874304 rem:525s step 2160 (12%) loss:3.7159 lr:0.62 dt:35ms tok/s:1866933 rem:525s step 2161 (12%) loss:3.7238 lr:0.62 dt:35ms tok/s:1879738 rem:525s step 2162 (12%) loss:3.7245 lr:0.62 dt:35ms tok/s:1868316 rem:525s step 2163 (12%) loss:3.7171 lr:0.62 dt:35ms tok/s:1867390 rem:525s step 2164 (12%) loss:3.7125 lr:0.62 dt:35ms tok/s:1866540 rem:525s step 2165 (12%) loss:3.7043 lr:0.62 dt:35ms tok/s:1868088 rem:525s step 2166 (12%) loss:3.6993 lr:0.62 dt:35ms tok/s:1866413 rem:525s step 2167 (12%) loss:3.6912 lr:0.62 dt:35ms tok/s:1873665 rem:525s step 2168 (12%) loss:3.7034 lr:0.62 dt:35ms tok/s:1863718 rem:525s step 2169 (13%) loss:3.7019 lr:0.63 dt:35ms tok/s:1873882 rem:525s step 2170 (13%) loss:3.7055 lr:0.63 dt:35ms tok/s:1848006 rem:525s step 2171 (13%) loss:3.7154 lr:0.63 dt:35ms tok/s:1848043 rem:525s step 2172 (13%) loss:3.7088 lr:0.63 dt:35ms tok/s:1848515 rem:525s step 2173 (13%) loss:3.6865 lr:0.63 dt:35ms tok/s:1851291 rem:525s step 2174 (13%) loss:3.6856 lr:0.63 dt:36ms tok/s:1844175 rem:525s step 2175 (13%) loss:3.6732 lr:0.63 dt:36ms tok/s:1842580 rem:525s step 2176 (13%) loss:3.6783 lr:0.63 dt:35ms tok/s:1847323 rem:525s step 2177 (13%) loss:3.7065 lr:0.63 dt:36ms tok/s:1844917 rem:525s step 2178 (13%) loss:3.7175 lr:0.63 dt:36ms tok/s:1819865 rem:525s step 2179 (13%) loss:3.7088 lr:0.63 dt:36ms tok/s:1822000 rem:525s step 2180 (13%) loss:3.7099 lr:0.63 dt:36ms tok/s:1816606 rem:525s step 2181 (13%) loss:3.7272 lr:0.63 dt:36ms tok/s:1813992 rem:525s step 2182 (13%) loss:3.7440 lr:0.63 dt:36ms tok/s:1815598 rem:524s step 2183 (13%) loss:3.7581 lr:0.63 dt:36ms tok/s:1815466 rem:524s step 2184 (13%) loss:3.7518 lr:0.63 dt:36ms tok/s:1818974 rem:524s step 2185 (13%) loss:3.7557 lr:0.63 dt:36ms tok/s:1822471 rem:524s step 2186 (13%) loss:3.7713 lr:0.63 dt:36ms tok/s:1813370 rem:524s step 2187 (13%) loss:3.7478 lr:0.63 dt:37ms tok/s:1762693 rem:524s step 2188 (13%) loss:3.7557 lr:0.63 dt:36ms tok/s:1810372 rem:524s step 2189 (13%) loss:3.7446 lr:0.63 dt:37ms tok/s:1758375 rem:524s step 2190 (13%) loss:3.7351 lr:0.63 dt:36ms tok/s:1821071 rem:524s step 2191 (13%) loss:3.7302 lr:0.63 dt:36ms tok/s:1836278 rem:524s step 2192 (13%) loss:3.7287 lr:0.63 dt:35ms tok/s:1846752 rem:524s step 2193 (13%) loss:3.7278 lr:0.63 dt:44ms tok/s:1474390 rem:524s step 2194 (13%) loss:3.7237 lr:0.63 dt:36ms tok/s:1812581 rem:524s step 2195 (13%) loss:3.7311 lr:0.63 dt:35ms tok/s:1878813 rem:524s step 2196 (13%) loss:3.7156 lr:0.63 dt:35ms tok/s:1879816 rem:524s step 2197 (13%) loss:3.7155 lr:0.63 dt:35ms tok/s:1875352 rem:524s step 2198 (13%) loss:3.7305 lr:0.63 dt:35ms tok/s:1879764 rem:524s step 2199 (13%) loss:3.7267 lr:0.63 dt:35ms tok/s:1850656 rem:524s step 2200 (13%) loss:3.7229 lr:0.63 dt:35ms tok/s:1876056 rem:524s + local: attn=[0.077, 0.478, 0.656] mlp=[0.235, 0.182, -0.198] + + transition: attn=[2.779, 0.893] mlp=[0.020, 0.203] + + hierarchy: attn=[2.727, 5.586, 5.572] mlp=[1.436, -0.149, 0.388] + step 2201 (13%) loss:3.7302 lr:0.63 dt:35ms tok/s:1870529 rem:524s step 2202 (13%) loss:3.7195 lr:0.64 dt:35ms tok/s:1860098 rem:524s step 2203 (13%) loss:3.7231 lr:0.64 dt:35ms tok/s:1864767 rem:524s step 2204 (13%) loss:3.7235 lr:0.64 dt:37ms tok/s:1794922 rem:524s step 2205 (13%) loss:3.7217 lr:0.64 dt:35ms tok/s:1879160 rem:524s step 2206 (13%) loss:3.7085 lr:0.64 dt:35ms tok/s:1886563 rem:524s step 2207 (13%) loss:3.7161 lr:0.64 dt:35ms tok/s:1891223 rem:524s step 2208 (13%) loss:3.6974 lr:0.64 dt:35ms tok/s:1885877 rem:524s step 2209 (13%) loss:3.6967 lr:0.64 dt:35ms tok/s:1891080 rem:524s step 2210 (13%) loss:3.6954 lr:0.64 dt:35ms tok/s:1863743 rem:523s step 2211 (13%) loss:3.6987 lr:0.64 dt:35ms tok/s:1849373 rem:523s step 2212 (13%) loss:3.7032 lr:0.64 dt:35ms tok/s:1856656 rem:523s step 2213 (13%) loss:3.6902 lr:0.64 dt:35ms tok/s:1864312 rem:523s step 2214 (13%) loss:3.6871 lr:0.64 dt:35ms tok/s:1867885 rem:523s step 2215 (13%) loss:3.6766 lr:0.64 dt:35ms tok/s:1894821 rem:523s step 2216 (13%) loss:3.6739 lr:0.64 dt:35ms tok/s:1875045 rem:523s step 2217 (13%) loss:3.6670 lr:0.64 dt:35ms tok/s:1865261 rem:523s step 2218 (13%) loss:3.6925 lr:0.64 dt:35ms tok/s:1868240 rem:523s step 2219 (13%) loss:3.6937 lr:0.64 dt:35ms tok/s:1861459 rem:523s step 2220 (13%) loss:3.6882 lr:0.64 dt:35ms tok/s:1858577 rem:523s step 2221 (13%) loss:3.7155 lr:0.64 dt:35ms tok/s:1864110 rem:523s step 2222 (13%) loss:3.7596 lr:0.64 dt:35ms tok/s:1870121 rem:523s step 2223 (13%) loss:3.7501 lr:0.64 dt:36ms tok/s:1844014 rem:523s step 2224 (13%) loss:3.7331 lr:0.64 dt:36ms tok/s:1839731 rem:523s step 2225 (13%) loss:3.7291 lr:0.64 dt:36ms tok/s:1842123 rem:523s step 2226 (13%) loss:3.7248 lr:0.64 dt:36ms tok/s:1837321 rem:523s step 2227 (13%) loss:3.7233 lr:0.64 dt:36ms tok/s:1828545 rem:523s step 2228 (13%) loss:3.7131 lr:0.64 dt:36ms tok/s:1840236 rem:523s step 2229 (13%) loss:3.7110 lr:0.64 dt:36ms tok/s:1836609 rem:523s step 2230 (13%) loss:3.7005 lr:0.64 dt:36ms tok/s:1839768 rem:523s step 2231 (13%) loss:3.6824 lr:0.64 dt:36ms tok/s:1835150 rem:523s step 2232 (13%) loss:3.6826 lr:0.64 dt:36ms tok/s:1842555 rem:523s step 2233 (13%) loss:3.7008 lr:0.64 dt:36ms tok/s:1830896 rem:523s step 2234 (13%) loss:3.7048 lr:0.64 dt:36ms tok/s:1838488 rem:523s step 2235 (13%) loss:3.7211 lr:0.64 dt:36ms tok/s:1806482 rem:523s step 2236 (13%) loss:3.7122 lr:0.65 dt:36ms tok/s:1819058 rem:523s step 2237 (13%) loss:3.7041 lr:0.65 dt:36ms tok/s:1816294 rem:523s step 2238 (13%) loss:3.6865 lr:0.65 dt:36ms tok/s:1825400 rem:522s step 2239 (13%) loss:3.6680 lr:0.65 dt:36ms tok/s:1829653 rem:522s step 2240 (13%) loss:3.6807 lr:0.65 dt:36ms tok/s:1835554 rem:522s step 2241 (13%) loss:3.6887 lr:0.65 dt:36ms tok/s:1831372 rem:522s step 2242 (13%) loss:3.6831 lr:0.65 dt:36ms tok/s:1842543 rem:522s step 2243 (13%) loss:3.6737 lr:0.65 dt:36ms tok/s:1812557 rem:522s step 2244 (13%) loss:3.6783 lr:0.65 dt:36ms tok/s:1839830 rem:522s step 2245 (13%) loss:3.7149 lr:0.65 dt:36ms tok/s:1813992 rem:522s step 2246 (13%) loss:3.7160 lr:0.65 dt:36ms tok/s:1826917 rem:522s step 2247 (13%) loss:3.6977 lr:0.65 dt:36ms tok/s:1834231 rem:522s step 2248 (13%) loss:3.6840 lr:0.65 dt:39ms tok/s:1678500 rem:522s step 2249 (13%) loss:3.6797 lr:0.65 dt:35ms tok/s:1854852 rem:522s step 2250 (13%) loss:3.6895 lr:0.65 dt:35ms tok/s:1857346 rem:522s step 2251 (13%) loss:3.7219 lr:0.65 dt:36ms tok/s:1828485 rem:522s step 2252 (13%) loss:3.7167 lr:0.65 dt:36ms tok/s:1829044 rem:522s step 2253 (13%) loss:3.7110 lr:0.65 dt:36ms tok/s:1836597 rem:522s step 2254 (13%) loss:3.7220 lr:0.65 dt:36ms tok/s:1839374 rem:522s step 2255 (13%) loss:3.7174 lr:0.65 dt:36ms tok/s:1838956 rem:522s step 2256 (13%) loss:3.7006 lr:0.65 dt:36ms tok/s:1817927 rem:522s step 2257 (13%) loss:3.7019 lr:0.65 dt:36ms tok/s:1838451 rem:522s step 2258 (13%) loss:3.6988 lr:0.65 dt:36ms tok/s:1829568 rem:522s step 2259 (13%) loss:3.7143 lr:0.65 dt:36ms tok/s:1833876 rem:522s step 2260 (13%) loss:3.7145 lr:0.65 dt:36ms tok/s:1840051 rem:522s step 2261 (13%) loss:3.7143 lr:0.65 dt:36ms tok/s:1835824 rem:522s step 2262 (13%) loss:3.7083 lr:0.65 dt:36ms tok/s:1834868 rem:522s step 2263 (13%) loss:3.7065 lr:0.65 dt:36ms tok/s:1833791 rem:522s step 2264 (13%) loss:3.6983 lr:0.65 dt:36ms tok/s:1831360 rem:522s step 2265 (13%) loss:3.6775 lr:0.65 dt:36ms tok/s:1836670 rem:522s step 2266 (13%) loss:3.6659 lr:0.65 dt:36ms tok/s:1837272 rem:521s step 2267 (13%) loss:3.6728 lr:0.65 dt:36ms tok/s:1840446 rem:521s step 2268 (13%) loss:3.6606 lr:0.65 dt:36ms tok/s:1830067 rem:521s step 2269 (13%) loss:3.6662 lr:0.66 dt:36ms tok/s:1840865 rem:521s step 2270 (13%) loss:3.6654 lr:0.66 dt:36ms tok/s:1834158 rem:521s step 2271 (13%) loss:3.6885 lr:0.66 dt:36ms tok/s:1837775 rem:521s step 2272 (13%) loss:3.6838 lr:0.66 dt:36ms tok/s:1837689 rem:521s step 2273 (13%) loss:3.6997 lr:0.66 dt:36ms tok/s:1834464 rem:521s step 2274 (13%) loss:3.7139 lr:0.66 dt:36ms tok/s:1838796 rem:521s step 2275 (13%) loss:3.7410 lr:0.66 dt:36ms tok/s:1828606 rem:521s step 2276 (13%) loss:3.7570 lr:0.66 dt:36ms tok/s:1835125 rem:521s step 2277 (13%) loss:3.7423 lr:0.66 dt:36ms tok/s:1830859 rem:521s step 2278 (13%) loss:3.7338 lr:0.66 dt:36ms tok/s:1833950 rem:521s step 2279 (13%) loss:3.7374 lr:0.66 dt:36ms tok/s:1831347 rem:521s step 2280 (13%) loss:3.7334 lr:0.66 dt:36ms tok/s:1836560 rem:521s step 2281 (13%) loss:3.7354 lr:0.66 dt:36ms tok/s:1837898 rem:521s step 2282 (13%) loss:3.7207 lr:0.66 dt:36ms tok/s:1834562 rem:521s step 2283 (13%) loss:3.7086 lr:0.66 dt:36ms tok/s:1827123 rem:521s step 2284 (13%) loss:3.7114 lr:0.66 dt:36ms tok/s:1830031 rem:521s step 2285 (13%) loss:3.7256 lr:0.66 dt:36ms tok/s:1834807 rem:521s step 2286 (13%) loss:3.7116 lr:0.66 dt:36ms tok/s:1835468 rem:521s step 2287 (13%) loss:3.6998 lr:0.66 dt:36ms tok/s:1824322 rem:521s step 2288 (13%) loss:3.6982 lr:0.66 dt:36ms tok/s:1821988 rem:521s step 2289 (13%) loss:3.6842 lr:0.66 dt:36ms tok/s:1837837 rem:521s step 2290 (13%) loss:3.6622 lr:0.66 dt:36ms tok/s:1836572 rem:521s step 2291 (13%) loss:3.6542 lr:0.66 dt:36ms tok/s:1837358 rem:521s step 2292 (13%) loss:3.6298 lr:0.66 dt:36ms tok/s:1829568 rem:521s step 2293 (13%) loss:3.6415 lr:0.66 dt:36ms tok/s:1830567 rem:521s step 2294 (13%) loss:3.6437 lr:0.66 dt:36ms tok/s:1835468 rem:520s step 2295 (13%) loss:3.6496 lr:0.66 dt:36ms tok/s:1835677 rem:520s step 2296 (13%) loss:3.6499 lr:0.66 dt:36ms tok/s:1842160 rem:520s step 2297 (13%) loss:3.6514 lr:0.66 dt:36ms tok/s:1836339 rem:520s step 2298 (13%) loss:3.6583 lr:0.66 dt:36ms tok/s:1826638 rem:520s step 2299 (13%) loss:3.6547 lr:0.66 dt:36ms tok/s:1829385 rem:520s step 2300 (13%) loss:3.6567 lr:0.66 dt:36ms tok/s:1840520 rem:520s + local: attn=[0.088, 0.480, 0.570] mlp=[0.222, 0.162, -0.199] + + transition: attn=[2.511, 0.783] mlp=[0.002, 0.189] + + hierarchy: attn=[2.559, 5.858, 5.606] mlp=[1.282, -0.281, 0.213] + step 2301 (13%) loss:3.6665 lr:0.66 dt:36ms tok/s:1834954 rem:520s step 2302 (13%) loss:3.6590 lr:0.66 dt:36ms tok/s:1836179 rem:520s step 2303 (13%) loss:3.6750 lr:0.67 dt:36ms tok/s:1831201 rem:520s step 2304 (13%) loss:3.6744 lr:0.67 dt:36ms tok/s:1838415 rem:520s step 2305 (13%) loss:3.6398 lr:0.67 dt:36ms tok/s:1822737 rem:520s step 2306 (13%) loss:3.6404 lr:0.67 dt:38ms tok/s:1739734 rem:520s step 2307 (13%) loss:3.6476 lr:0.67 dt:36ms tok/s:1831421 rem:520s step 2308 (13%) loss:3.7114 lr:0.67 dt:36ms tok/s:1838968 rem:520s step 2309 (13%) loss:3.6882 lr:0.67 dt:36ms tok/s:1840211 rem:520s step 2310 (13%) loss:3.6863 lr:0.67 dt:35ms tok/s:1846615 rem:520s step 2311 (13%) loss:3.6742 lr:0.67 dt:36ms tok/s:1838820 rem:520s step 2312 (13%) loss:3.6818 lr:0.67 dt:36ms tok/s:1836339 rem:520s step 2313 (13%) loss:3.6972 lr:0.67 dt:36ms tok/s:1830774 rem:520s step 2314 (13%) loss:3.6834 lr:0.67 dt:36ms tok/s:1830847 rem:520s step 2315 (13%) loss:3.7047 lr:0.67 dt:36ms tok/s:1833901 rem:520s step 2316 (13%) loss:3.7005 lr:0.67 dt:36ms tok/s:1830737 rem:520s step 2317 (13%) loss:3.6890 lr:0.67 dt:36ms tok/s:1836179 rem:520s step 2318 (13%) loss:3.6864 lr:0.67 dt:36ms tok/s:1833693 rem:520s step 2319 (13%) loss:3.6543 lr:0.67 dt:39ms tok/s:1697553 rem:520s step 2320 (13%) loss:3.6282 lr:0.67 dt:35ms tok/s:1857835 rem:520s step 2321 (13%) loss:3.6361 lr:0.67 dt:36ms tok/s:1840014 rem:519s step 2322 (13%) loss:3.6519 lr:0.67 dt:36ms tok/s:1833338 rem:519s step 2323 (13%) loss:3.6433 lr:0.67 dt:36ms tok/s:1828302 rem:519s step 2324 (13%) loss:3.6271 lr:0.67 dt:36ms tok/s:1837137 rem:519s step 2325 (13%) loss:3.6356 lr:0.67 dt:36ms tok/s:1836216 rem:519s step 2326 (13%) loss:3.6382 lr:0.67 dt:36ms tok/s:1830969 rem:519s step 2327 (13%) loss:3.6453 lr:0.67 dt:36ms tok/s:1833069 rem:519s step 2328 (13%) loss:3.6504 lr:0.67 dt:36ms tok/s:1823826 rem:519s step 2329 (13%) loss:3.6547 lr:0.67 dt:36ms tok/s:1836057 rem:519s step 2330 (13%) loss:3.6601 lr:0.67 dt:36ms tok/s:1838083 rem:519s step 2331 (13%) loss:3.6618 lr:0.67 dt:36ms tok/s:1842395 rem:519s step 2332 (13%) loss:3.6531 lr:0.67 dt:36ms tok/s:1839694 rem:519s step 2333 (13%) loss:3.6649 lr:0.67 dt:36ms tok/s:1838390 rem:519s step 2334 (13%) loss:3.6668 lr:0.67 dt:36ms tok/s:1829884 rem:519s step 2335 (13%) loss:3.6602 lr:0.67 dt:36ms tok/s:1829836 rem:519s step 2336 (14%) loss:3.6709 lr:0.68 dt:36ms tok/s:1818830 rem:519s step 2337 (14%) loss:3.6614 lr:0.68 dt:36ms tok/s:1834648 rem:519s step 2338 (14%) loss:3.6677 lr:0.68 dt:36ms tok/s:1831823 rem:519s step 2339 (14%) loss:3.6560 lr:0.68 dt:36ms tok/s:1820408 rem:519s step 2340 (14%) loss:3.6783 lr:0.68 dt:36ms tok/s:1829142 rem:519s step 2341 (14%) loss:3.7013 lr:0.68 dt:36ms tok/s:1839460 rem:519s step 2342 (14%) loss:3.6964 lr:0.68 dt:36ms tok/s:1832849 rem:519s step 2343 (14%) loss:3.6789 lr:0.68 dt:36ms tok/s:1825606 rem:519s step 2344 (14%) loss:3.7067 lr:0.68 dt:37ms tok/s:1785153 rem:519s step 2345 (14%) loss:3.7173 lr:0.68 dt:36ms tok/s:1831872 rem:519s step 2346 (14%) loss:3.6910 lr:0.68 dt:36ms tok/s:1829142 rem:519s step 2347 (14%) loss:3.6869 lr:0.68 dt:36ms tok/s:1814232 rem:519s step 2348 (14%) loss:3.6880 lr:0.68 dt:36ms tok/s:1815946 rem:519s step 2349 (14%) loss:3.6850 lr:0.68 dt:36ms tok/s:1816054 rem:518s step 2350 (14%) loss:3.6777 lr:0.68 dt:36ms tok/s:1833057 rem:518s step 2351 (14%) loss:3.6834 lr:0.68 dt:36ms tok/s:1839066 rem:518s step 2352 (14%) loss:3.6939 lr:0.68 dt:36ms tok/s:1832752 rem:518s step 2353 (14%) loss:3.6838 lr:0.68 dt:36ms tok/s:1833020 rem:518s step 2354 (14%) loss:3.6791 lr:0.68 dt:36ms tok/s:1838206 rem:518s step 2355 (14%) loss:3.6756 lr:0.68 dt:36ms tok/s:1840335 rem:518s step 2356 (14%) loss:3.6659 lr:0.68 dt:36ms tok/s:1838525 rem:518s step 2357 (14%) loss:3.6796 lr:0.68 dt:36ms tok/s:1826893 rem:518s step 2358 (14%) loss:3.6748 lr:0.68 dt:36ms tok/s:1832703 rem:518s step 2359 (14%) loss:3.6727 lr:0.68 dt:36ms tok/s:1807563 rem:518s step 2360 (14%) loss:3.6596 lr:0.68 dt:36ms tok/s:1812461 rem:518s step 2361 (14%) loss:3.6416 lr:0.68 dt:36ms tok/s:1816654 rem:518s step 2362 (14%) loss:3.6445 lr:0.68 dt:49ms tok/s:1343043 rem:518s step 2363 (14%) loss:3.6574 lr:0.68 dt:41ms tok/s:1617852 rem:518s step 2364 (14%) loss:3.6637 lr:0.68 dt:39ms tok/s:1667594 rem:518s step 2365 (14%) loss:3.6607 lr:0.68 dt:38ms tok/s:1734661 rem:518s step 2366 (14%) loss:3.6383 lr:0.68 dt:38ms tok/s:1733786 rem:518s step 2367 (14%) loss:3.6547 lr:0.68 dt:37ms tok/s:1793060 rem:518s step 2368 (14%) loss:3.6469 lr:0.68 dt:37ms tok/s:1776707 rem:518s step 2369 (14%) loss:3.6642 lr:0.69 dt:37ms tok/s:1772342 rem:518s step 2370 (14%) loss:3.6549 lr:0.69 dt:37ms tok/s:1776650 rem:518s step 2371 (14%) loss:3.6453 lr:0.69 dt:37ms tok/s:1783856 rem:518s step 2372 (14%) loss:3.6394 lr:0.69 dt:37ms tok/s:1780205 rem:518s step 2373 (14%) loss:3.6040 lr:0.69 dt:37ms tok/s:1779928 rem:518s step 2374 (14%) loss:3.6123 lr:0.69 dt:37ms tok/s:1780909 rem:518s step 2375 (14%) loss:3.6247 lr:0.69 dt:37ms tok/s:1776719 rem:518s step 2376 (14%) loss:3.6434 lr:0.69 dt:37ms tok/s:1783173 rem:517s step 2377 (14%) loss:3.6565 lr:0.69 dt:37ms tok/s:1772320 rem:517s step 2378 (14%) loss:3.6465 lr:0.69 dt:37ms tok/s:1776386 rem:517s step 2379 (14%) loss:3.6582 lr:0.69 dt:37ms tok/s:1773028 rem:517s step 2380 (14%) loss:3.6648 lr:0.69 dt:37ms tok/s:1770379 rem:517s step 2381 (14%) loss:3.6589 lr:0.69 dt:37ms tok/s:1783092 rem:517s step 2382 (14%) loss:3.6610 lr:0.69 dt:37ms tok/s:1784574 rem:517s step 2383 (14%) loss:3.6535 lr:0.69 dt:37ms tok/s:1770927 rem:517s step 2384 (14%) loss:3.6579 lr:0.69 dt:37ms tok/s:1787092 rem:517s step 2385 (14%) loss:3.6708 lr:0.69 dt:37ms tok/s:1776891 rem:517s step 2386 (14%) loss:3.6701 lr:0.69 dt:37ms tok/s:1777707 rem:517s step 2387 (14%) loss:3.6698 lr:0.69 dt:37ms tok/s:1778788 rem:517s step 2388 (14%) loss:3.6585 lr:0.69 dt:37ms tok/s:1776248 rem:517s step 2389 (14%) loss:3.6661 lr:0.69 dt:37ms tok/s:1784910 rem:517s step 2390 (14%) loss:3.6679 lr:0.69 dt:37ms tok/s:1774745 rem:517s step 2391 (14%) loss:3.6521 lr:0.69 dt:37ms tok/s:1772994 rem:517s step 2392 (14%) loss:3.6585 lr:0.69 dt:37ms tok/s:1774482 rem:517s step 2393 (14%) loss:3.6500 lr:0.69 dt:34ms tok/s:1930756 rem:517s step 2394 (14%) loss:3.6558 lr:0.69 dt:33ms tok/s:1977397 rem:517s step 2395 (14%) loss:3.6469 lr:0.69 dt:33ms tok/s:2013182 rem:517s step 2396 (14%) loss:3.6595 lr:0.69 dt:33ms tok/s:1996310 rem:517s step 2397 (14%) loss:3.6491 lr:0.69 dt:32ms tok/s:2017186 rem:517s step 2398 (14%) loss:3.6228 lr:0.69 dt:32ms tok/s:2020032 rem:517s step 2399 (14%) loss:3.6178 lr:0.69 dt:33ms tok/s:2005925 rem:517s step 2400 (14%) loss:3.6219 lr:0.69 dt:33ms tok/s:2003118 rem:517s + local: attn=[0.073, 0.478, 0.562] mlp=[0.197, 0.161, -0.174] + + transition: attn=[2.215, 0.693] mlp=[0.027, 0.181] + + hierarchy: attn=[2.479, 5.921, 5.614] mlp=[1.083, -0.530, -0.034] + step 2401 (14%) loss:3.6382 lr:0.69 dt:33ms tok/s:1996498 rem:517s step 2402 (14%) loss:3.6447 lr:0.69 dt:33ms tok/s:1976004 rem:517s step 2403 (14%) loss:3.6322 lr:0.70 dt:33ms tok/s:1965590 rem:517s step 2404 (14%) loss:3.6136 lr:0.70 dt:34ms tok/s:1955410 rem:517s step 2405 (14%) loss:3.6147 lr:0.70 dt:33ms tok/s:1960823 rem:516s step 2406 (14%) loss:3.6234 lr:0.70 dt:33ms tok/s:1958657 rem:516s step 2407 (14%) loss:3.6372 lr:0.70 dt:34ms tok/s:1952230 rem:516s step 2408 (14%) loss:3.6350 lr:0.70 dt:33ms tok/s:1956440 rem:516s step 2409 (14%) loss:3.6163 lr:0.70 dt:34ms tok/s:1948121 rem:516s step 2410 (14%) loss:3.6269 lr:0.70 dt:34ms tok/s:1925428 rem:516s step 2411 (14%) loss:3.6351 lr:0.70 dt:34ms tok/s:1909352 rem:516s step 2412 (14%) loss:3.6289 lr:0.70 dt:34ms tok/s:1926723 rem:516s step 2413 (14%) loss:3.6426 lr:0.70 dt:34ms tok/s:1922062 rem:516s step 2414 (14%) loss:3.6357 lr:0.70 dt:34ms tok/s:1920196 rem:516s step 2415 (14%) loss:3.6640 lr:0.70 dt:34ms tok/s:1906769 rem:516s step 2416 (14%) loss:3.6711 lr:0.70 dt:34ms tok/s:1907258 rem:516s step 2417 (14%) loss:3.6812 lr:0.70 dt:34ms tok/s:1903732 rem:516s step 2418 (14%) loss:3.6914 lr:0.70 dt:35ms tok/s:1891652 rem:516s step 2419 (14%) loss:3.6909 lr:0.70 dt:34ms tok/s:1901993 rem:516s step 2420 (14%) loss:3.6935 lr:0.70 dt:34ms tok/s:1903877 rem:516s step 2421 (14%) loss:3.6958 lr:0.70 dt:34ms tok/s:1904272 rem:516s step 2422 (14%) loss:3.6777 lr:0.70 dt:42ms tok/s:1575295 rem:516s step 2423 (14%) loss:3.6809 lr:0.70 dt:34ms tok/s:1903231 rem:516s step 2424 (14%) loss:3.7158 lr:0.70 dt:34ms tok/s:1904365 rem:516s step 2425 (14%) loss:3.7507 lr:0.70 dt:35ms tok/s:1892681 rem:516s step 2426 (14%) loss:3.7142 lr:0.70 dt:35ms tok/s:1892121 rem:516s step 2427 (14%) loss:3.6813 lr:0.70 dt:35ms tok/s:1891314 rem:516s step 2428 (14%) loss:3.6860 lr:0.70 dt:35ms tok/s:1893920 rem:516s step 2429 (14%) loss:3.6903 lr:0.70 dt:34ms tok/s:1908146 rem:516s step 2430 (14%) loss:3.6905 lr:0.70 dt:34ms tok/s:1910812 rem:516s step 2431 (14%) loss:3.6887 lr:0.70 dt:34ms tok/s:1904365 rem:516s step 2432 (14%) loss:3.6656 lr:0.70 dt:35ms tok/s:1898971 rem:516s step 2433 (14%) loss:3.6602 lr:0.70 dt:34ms tok/s:1901033 rem:516s step 2434 (14%) loss:3.6400 lr:0.70 dt:34ms tok/s:1900757 rem:515s step 2435 (14%) loss:3.6461 lr:0.70 dt:35ms tok/s:1883990 rem:515s step 2436 (14%) loss:3.6270 lr:0.70 dt:34ms tok/s:1907431 rem:515s step 2437 (14%) loss:3.6309 lr:0.70 dt:34ms tok/s:1902796 rem:515s step 2438 (14%) loss:3.6445 lr:0.71 dt:35ms tok/s:1893542 rem:515s step 2439 (14%) loss:3.6303 lr:0.71 dt:35ms tok/s:1884261 rem:515s step 2440 (14%) loss:3.6343 lr:0.71 dt:35ms tok/s:1888079 rem:515s step 2441 (14%) loss:3.6201 lr:0.71 dt:35ms tok/s:1882416 rem:515s step 2442 (14%) loss:3.6112 lr:0.71 dt:35ms tok/s:1887379 rem:515s step 2443 (14%) loss:3.6110 lr:0.71 dt:37ms tok/s:1786325 rem:515s step 2444 (14%) loss:3.6184 lr:0.71 dt:35ms tok/s:1884985 rem:515s step 2445 (14%) loss:3.6218 lr:0.71 dt:35ms tok/s:1860615 rem:515s step 2446 (14%) loss:3.6206 lr:0.71 dt:35ms tok/s:1859922 rem:515s step 2447 (14%) loss:3.6090 lr:0.71 dt:35ms tok/s:1863276 rem:515s step 2448 (14%) loss:3.5915 lr:0.71 dt:35ms tok/s:1856205 rem:515s step 2449 (14%) loss:3.6229 lr:0.71 dt:35ms tok/s:1860061 rem:515s step 2450 (14%) loss:3.6038 lr:0.71 dt:35ms tok/s:1866185 rem:515s step 2451 (14%) loss:3.5991 lr:0.71 dt:36ms tok/s:1834403 rem:515s step 2452 (14%) loss:3.6046 lr:0.71 dt:36ms tok/s:1823620 rem:515s step 2453 (14%) loss:3.6175 lr:0.71 dt:36ms tok/s:1822532 rem:515s step 2454 (14%) loss:3.6207 lr:0.71 dt:36ms tok/s:1819588 rem:515s step 2455 (14%) loss:3.6348 lr:0.71 dt:36ms tok/s:1820287 rem:515s step 2456 (14%) loss:3.6314 lr:0.71 dt:36ms tok/s:1820588 rem:515s step 2457 (14%) loss:3.6270 lr:0.71 dt:36ms tok/s:1810015 rem:515s step 2458 (14%) loss:3.6217 lr:0.71 dt:36ms tok/s:1819648 rem:515s step 2459 (14%) loss:3.6271 lr:0.71 dt:36ms tok/s:1800318 rem:515s step 2460 (14%) loss:3.6306 lr:0.71 dt:36ms tok/s:1819191 rem:515s step 2461 (14%) loss:3.6432 lr:0.71 dt:36ms tok/s:1825946 rem:515s step 2462 (14%) loss:3.6397 lr:0.71 dt:36ms tok/s:1819480 rem:514s step 2463 (14%) loss:3.6333 lr:0.71 dt:37ms tok/s:1792850 rem:514s step 2464 (14%) loss:3.6263 lr:0.71 dt:36ms tok/s:1802254 rem:514s step 2465 (14%) loss:3.6041 lr:0.71 dt:37ms tok/s:1774390 rem:514s step 2466 (14%) loss:3.5887 lr:0.71 dt:37ms tok/s:1757475 rem:514s step 2467 (14%) loss:3.5936 lr:0.71 dt:37ms tok/s:1778121 rem:514s step 2468 (14%) loss:3.5850 lr:0.71 dt:37ms tok/s:1777523 rem:514s step 2469 (14%) loss:3.5842 lr:0.71 dt:36ms tok/s:1802762 rem:514s step 2470 (14%) loss:3.5869 lr:0.71 dt:36ms tok/s:1816426 rem:514s step 2471 (14%) loss:3.5842 lr:0.72 dt:36ms tok/s:1813406 rem:514s step 2472 (14%) loss:3.5820 lr:0.72 dt:36ms tok/s:1823052 rem:514s step 2473 (14%) loss:3.5818 lr:0.72 dt:36ms tok/s:1820130 rem:514s step 2474 (14%) loss:3.5948 lr:0.72 dt:36ms tok/s:1825109 rem:514s step 2475 (14%) loss:3.5998 lr:0.72 dt:36ms tok/s:1824080 rem:514s step 2476 (14%) loss:3.6000 lr:0.72 dt:36ms tok/s:1820938 rem:514s step 2477 (14%) loss:3.6009 lr:0.72 dt:36ms tok/s:1825109 rem:514s step 2478 (14%) loss:3.6048 lr:0.72 dt:36ms tok/s:1819082 rem:514s step 2479 (14%) loss:3.5933 lr:0.72 dt:36ms tok/s:1803791 rem:514s step 2480 (14%) loss:3.5768 lr:0.72 dt:36ms tok/s:1814819 rem:514s step 2481 (14%) loss:3.5683 lr:0.72 dt:37ms tok/s:1764810 rem:514s step 2482 (14%) loss:3.5805 lr:0.72 dt:37ms tok/s:1750569 rem:514s step 2483 (14%) loss:3.5687 lr:0.72 dt:36ms tok/s:1805106 rem:514s step 2484 (14%) loss:3.5667 lr:0.72 dt:36ms tok/s:1816714 rem:514s step 2485 (14%) loss:3.5573 lr:0.72 dt:36ms tok/s:1816894 rem:514s step 2486 (14%) loss:3.5936 lr:0.72 dt:36ms tok/s:1815071 rem:514s step 2487 (14%) loss:3.6168 lr:0.72 dt:36ms tok/s:1807301 rem:514s step 2488 (14%) loss:3.6264 lr:0.72 dt:36ms tok/s:1813633 rem:514s step 2489 (14%) loss:3.6233 lr:0.72 dt:36ms tok/s:1815334 rem:513s step 2490 (14%) loss:3.6079 lr:0.72 dt:37ms tok/s:1774505 rem:513s step 2491 (14%) loss:3.6017 lr:0.72 dt:36ms tok/s:1799375 rem:513s step 2492 (14%) loss:3.6087 lr:0.72 dt:36ms tok/s:1808407 rem:513s step 2493 (14%) loss:3.6100 lr:0.72 dt:36ms tok/s:1813693 rem:513s step 2494 (14%) loss:3.6144 lr:0.72 dt:36ms tok/s:1815670 rem:513s step 2495 (14%) loss:3.6195 lr:0.72 dt:38ms tok/s:1723329 rem:513s step 2496 (14%) loss:3.6046 lr:0.72 dt:36ms tok/s:1806197 rem:513s step 2497 (14%) loss:3.6107 lr:0.72 dt:36ms tok/s:1811816 rem:513s step 2498 (14%) loss:3.5999 lr:0.72 dt:36ms tok/s:1809550 rem:513s step 2499 (14%) loss:3.6057 lr:0.72 dt:36ms tok/s:1807492 rem:513s step 2500 (14%) loss:3.6198 lr:0.72 dt:36ms tok/s:1806755 rem:513s + local: attn=[0.082, 0.445, 0.503] mlp=[0.188, 0.148, -0.149] + + transition: attn=[2.005, 0.673] mlp=[-0.001, 0.164] + + hierarchy: attn=[2.210, 5.935, 5.616] mlp=[0.975, -0.103, 0.137] + step 2501 (14%) loss:3.6185 lr:0.72 dt:36ms tok/s:1813693 rem:513s step 2502 (14%) loss:3.6089 lr:0.72 dt:36ms tok/s:1811410 rem:513s step 2503 (14%) loss:3.6100 lr:0.72 dt:36ms tok/s:1811960 rem:513s step 2504 (15%) loss:3.6197 lr:0.73 dt:36ms tok/s:1804892 rem:513s step 2505 (15%) loss:3.6108 lr:0.73 dt:36ms tok/s:1815142 rem:513s step 2506 (15%) loss:3.6319 lr:0.73 dt:36ms tok/s:1815178 rem:513s step 2507 (15%) loss:3.6408 lr:0.73 dt:36ms tok/s:1817098 rem:513s step 2508 (15%) loss:3.6398 lr:0.73 dt:36ms tok/s:1805367 rem:513s step 2509 (15%) loss:3.6366 lr:0.73 dt:36ms tok/s:1808455 rem:513s step 2510 (15%) loss:3.6426 lr:0.73 dt:36ms tok/s:1821940 rem:513s step 2511 (15%) loss:3.6493 lr:0.73 dt:36ms tok/s:1806185 rem:513s step 2512 (15%) loss:3.6328 lr:0.73 dt:36ms tok/s:1811255 rem:513s step 2513 (15%) loss:3.6331 lr:0.73 dt:36ms tok/s:1814376 rem:513s step 2514 (15%) loss:3.6275 lr:0.73 dt:36ms tok/s:1815346 rem:513s step 2515 (15%) loss:3.6162 lr:0.73 dt:36ms tok/s:1814328 rem:513s step 2516 (15%) loss:3.6040 lr:0.73 dt:36ms tok/s:1805924 rem:513s step 2517 (15%) loss:3.5915 lr:0.73 dt:36ms tok/s:1805414 rem:512s step 2518 (15%) loss:3.6071 lr:0.73 dt:36ms tok/s:1816438 rem:512s step 2519 (15%) loss:3.6172 lr:0.73 dt:36ms tok/s:1819371 rem:512s step 2520 (15%) loss:3.6148 lr:0.73 dt:36ms tok/s:1806351 rem:512s step 2521 (15%) loss:3.6209 lr:0.73 dt:36ms tok/s:1810814 rem:512s step 2522 (15%) loss:3.6145 lr:0.73 dt:36ms tok/s:1805319 rem:512s step 2523 (15%) loss:3.6147 lr:0.73 dt:36ms tok/s:1811649 rem:512s step 2524 (15%) loss:3.5946 lr:0.73 dt:36ms tok/s:1806078 rem:512s step 2525 (15%) loss:3.5857 lr:0.73 dt:37ms tok/s:1776076 rem:512s step 2526 (15%) loss:3.6160 lr:0.73 dt:37ms tok/s:1755231 rem:512s step 2527 (15%) loss:3.6229 lr:0.73 dt:36ms tok/s:1817831 rem:512s step 2528 (15%) loss:3.6222 lr:0.73 dt:36ms tok/s:1811136 rem:512s step 2529 (15%) loss:3.5760 lr:0.73 dt:36ms tok/s:1817952 rem:512s step 2530 (15%) loss:3.5658 lr:0.73 dt:36ms tok/s:1815430 rem:512s step 2531 (15%) loss:3.5591 lr:0.73 dt:36ms tok/s:1809788 rem:512s step 2532 (15%) loss:3.5636 lr:0.73 dt:36ms tok/s:1819913 rem:512s step 2533 (15%) loss:3.5765 lr:0.73 dt:36ms tok/s:1812605 rem:512s step 2534 (15%) loss:3.5773 lr:0.73 dt:36ms tok/s:1814891 rem:512s step 2535 (15%) loss:3.5953 lr:0.73 dt:36ms tok/s:1816366 rem:512s step 2536 (15%) loss:3.5859 lr:0.73 dt:36ms tok/s:1815586 rem:512s step 2537 (15%) loss:3.5923 lr:0.74 dt:36ms tok/s:1817927 rem:512s step 2538 (15%) loss:3.5785 lr:0.74 dt:36ms tok/s:1811578 rem:512s step 2539 (15%) loss:3.5838 lr:0.74 dt:36ms tok/s:1811518 rem:512s step 2540 (15%) loss:3.5685 lr:0.74 dt:36ms tok/s:1810110 rem:512s step 2541 (15%) loss:3.5760 lr:0.74 dt:36ms tok/s:1817351 rem:512s step 2542 (15%) loss:3.5725 lr:0.74 dt:36ms tok/s:1812832 rem:512s step 2543 (15%) loss:3.5520 lr:0.74 dt:36ms tok/s:1814172 rem:512s step 2544 (15%) loss:3.5529 lr:0.74 dt:36ms tok/s:1819492 rem:512s step 2545 (15%) loss:3.5317 lr:0.74 dt:36ms tok/s:1811960 rem:511s step 2546 (15%) loss:3.5234 lr:0.74 dt:36ms tok/s:1815742 rem:511s step 2547 (15%) loss:3.5286 lr:0.74 dt:36ms tok/s:1809479 rem:511s step 2548 (15%) loss:3.5395 lr:0.74 dt:36ms tok/s:1826480 rem:511s step 2549 (15%) loss:3.5245 lr:0.74 dt:36ms tok/s:1816210 rem:511s step 2550 (15%) loss:3.5344 lr:0.74 dt:36ms tok/s:1819962 rem:511s step 2551 (15%) loss:3.5462 lr:0.74 dt:36ms tok/s:1822919 rem:511s step 2552 (15%) loss:3.5501 lr:0.74 dt:36ms tok/s:1812258 rem:511s step 2553 (15%) loss:3.5513 lr:0.74 dt:36ms tok/s:1814436 rem:511s step 2554 (15%) loss:3.5568 lr:0.74 dt:36ms tok/s:1819299 rem:511s step 2555 (15%) loss:3.5586 lr:0.74 dt:36ms tok/s:1820034 rem:511s step 2556 (15%) loss:3.5419 lr:0.74 dt:36ms tok/s:1820251 rem:511s step 2557 (15%) loss:3.5424 lr:0.74 dt:36ms tok/s:1819131 rem:511s step 2558 (15%) loss:3.5619 lr:0.74 dt:37ms tok/s:1759534 rem:511s step 2559 (15%) loss:3.5543 lr:0.74 dt:37ms tok/s:1757576 rem:511s step 2560 (15%) loss:3.5504 lr:0.74 dt:36ms tok/s:1817855 rem:511s step 2561 (15%) loss:3.5465 lr:0.74 dt:36ms tok/s:1818000 rem:511s step 2562 (15%) loss:3.5519 lr:0.74 dt:36ms tok/s:1821433 rem:511s step 2563 (15%) loss:3.5744 lr:0.74 dt:36ms tok/s:1815982 rem:511s step 2564 (15%) loss:3.5689 lr:0.74 dt:36ms tok/s:1814352 rem:511s step 2565 (15%) loss:3.5729 lr:0.74 dt:36ms tok/s:1816882 rem:511s step 2566 (15%) loss:3.5721 lr:0.74 dt:38ms tok/s:1719158 rem:511s step 2567 (15%) loss:3.5597 lr:0.74 dt:36ms tok/s:1815958 rem:511s step 2568 (15%) loss:3.5597 lr:0.74 dt:36ms tok/s:1810611 rem:511s step 2569 (15%) loss:3.5795 lr:0.74 dt:36ms tok/s:1812533 rem:511s step 2570 (15%) loss:3.5733 lr:0.75 dt:36ms tok/s:1812533 rem:511s step 2571 (15%) loss:3.6164 lr:0.75 dt:36ms tok/s:1806791 rem:511s step 2572 (15%) loss:3.5935 lr:0.75 dt:36ms tok/s:1816498 rem:510s step 2573 (15%) loss:3.6175 lr:0.75 dt:36ms tok/s:1813191 rem:510s step 2574 (15%) loss:3.6150 lr:0.75 dt:36ms tok/s:1816870 rem:510s step 2575 (15%) loss:3.6197 lr:0.75 dt:37ms tok/s:1765025 rem:510s step 2576 (15%) loss:3.6069 lr:0.75 dt:36ms tok/s:1813837 rem:510s step 2577 (15%) loss:3.6065 lr:0.75 dt:36ms tok/s:1805248 rem:510s step 2578 (15%) loss:3.6045 lr:0.75 dt:36ms tok/s:1820793 rem:510s step 2579 (15%) loss:3.5949 lr:0.75 dt:36ms tok/s:1805722 rem:510s step 2580 (15%) loss:3.6117 lr:0.75 dt:36ms tok/s:1806541 rem:510s step 2581 (15%) loss:3.5970 lr:0.75 dt:36ms tok/s:1813263 rem:510s step 2582 (15%) loss:3.5709 lr:0.75 dt:36ms tok/s:1812258 rem:510s step 2583 (15%) loss:3.5403 lr:0.75 dt:36ms tok/s:1820504 rem:510s step 2584 (15%) loss:3.5199 lr:0.75 dt:36ms tok/s:1816054 rem:510s step 2585 (15%) loss:3.5258 lr:0.75 dt:36ms tok/s:1814076 rem:510s step 2586 (15%) loss:3.5472 lr:0.75 dt:37ms tok/s:1769433 rem:510s step 2587 (15%) loss:3.5521 lr:0.75 dt:38ms tok/s:1742833 rem:510s step 2588 (15%) loss:3.5717 lr:0.75 dt:36ms tok/s:1833412 rem:510s step 2589 (15%) loss:3.5668 lr:0.75 dt:36ms tok/s:1837751 rem:510s step 2590 (15%) loss:3.5695 lr:0.75 dt:36ms tok/s:1818830 rem:510s step 2591 (15%) loss:3.5656 lr:0.75 dt:36ms tok/s:1819239 rem:510s step 2592 (15%) loss:3.5824 lr:0.75 dt:36ms tok/s:1839916 rem:510s step 2593 (15%) loss:3.5905 lr:0.75 dt:37ms tok/s:1756375 rem:510s step 2594 (15%) loss:3.5879 lr:0.75 dt:36ms tok/s:1820058 rem:510s step 2595 (15%) loss:3.5987 lr:0.75 dt:35ms tok/s:1851965 rem:510s step 2596 (15%) loss:3.6149 lr:0.75 dt:35ms tok/s:1862657 rem:510s step 2597 (15%) loss:3.6184 lr:0.75 dt:35ms tok/s:1855090 rem:510s step 2598 (15%) loss:3.6187 lr:0.75 dt:35ms tok/s:1856531 rem:510s step 2599 (15%) loss:3.5936 lr:0.75 dt:36ms tok/s:1843433 rem:510s step 2600 (15%) loss:3.6176 lr:0.75 dt:35ms tok/s:1853664 rem:509s + local: attn=[0.077, 0.443, 0.477] mlp=[0.177, 0.155, -0.158] + + transition: attn=[1.805, 0.624] mlp=[-0.010, 0.144] + + hierarchy: attn=[2.040, 5.938, 5.616] mlp=[0.800, -0.354, -0.050] + step 2601 (15%) loss:3.6530 lr:0.75 dt:35ms tok/s:1857220 rem:509s step 2602 (15%) loss:3.6407 lr:0.75 dt:35ms tok/s:1855616 rem:509s step 2603 (15%) loss:3.5869 lr:0.75 dt:35ms tok/s:1871955 rem:509s step 2604 (15%) loss:3.6137 lr:0.76 dt:35ms tok/s:1866401 rem:509s step 2605 (15%) loss:3.6167 lr:0.76 dt:35ms tok/s:1862392 rem:509s step 2606 (15%) loss:3.6233 lr:0.76 dt:35ms tok/s:1855716 rem:509s step 2607 (15%) loss:3.6359 lr:0.76 dt:35ms tok/s:1852314 rem:509s step 2608 (15%) loss:3.6390 lr:0.76 dt:35ms tok/s:1879507 rem:509s step 2609 (15%) loss:3.6337 lr:0.76 dt:35ms tok/s:1868024 rem:509s step 2610 (15%) loss:3.6388 lr:0.76 dt:36ms tok/s:1830238 rem:509s step 2611 (15%) loss:3.6406 lr:0.76 dt:36ms tok/s:1843074 rem:509s step 2612 (15%) loss:3.6342 lr:0.76 dt:36ms tok/s:1840544 rem:509s step 2613 (15%) loss:3.6385 lr:0.76 dt:36ms tok/s:1835346 rem:509s step 2614 (15%) loss:3.6350 lr:0.76 dt:36ms tok/s:1835162 rem:509s step 2615 (15%) loss:3.6380 lr:0.76 dt:36ms tok/s:1830262 rem:509s step 2616 (15%) loss:3.8144 lr:0.76 dt:36ms tok/s:1840791 rem:509s step 2617 (15%) loss:3.7746 lr:0.76 dt:36ms tok/s:1837481 rem:509s step 2618 (15%) loss:3.7878 lr:0.76 dt:36ms tok/s:1838046 rem:509s step 2619 (15%) loss:3.8192 lr:0.76 dt:36ms tok/s:1830104 rem:509s step 2620 (15%) loss:3.8098 lr:0.76 dt:36ms tok/s:1834219 rem:509s step 2621 (15%) loss:3.7956 lr:0.76 dt:36ms tok/s:1836707 rem:509s step 2622 (15%) loss:3.7590 lr:0.76 dt:36ms tok/s:1839214 rem:509s step 2623 (15%) loss:3.7380 lr:0.76 dt:36ms tok/s:1838734 rem:509s step 2624 (15%) loss:3.7118 lr:0.76 dt:36ms tok/s:1840483 rem:509s step 2625 (15%) loss:3.6939 lr:0.76 dt:36ms tok/s:1838341 rem:509s step 2626 (15%) loss:3.6738 lr:0.76 dt:37ms tok/s:1794617 rem:509s step 2627 (15%) loss:3.6617 lr:0.76 dt:36ms tok/s:1839608 rem:509s step 2628 (15%) loss:3.6724 lr:0.76 dt:36ms tok/s:1827196 rem:508s step 2629 (15%) loss:3.6741 lr:0.76 dt:36ms tok/s:1819239 rem:508s step 2630 (15%) loss:3.6696 lr:0.76 dt:36ms tok/s:1828205 rem:508s step 2631 (15%) loss:3.6849 lr:0.76 dt:36ms tok/s:1836670 rem:508s step 2632 (15%) loss:3.7022 lr:0.76 dt:36ms tok/s:1828703 rem:508s step 2633 (15%) loss:3.6887 lr:0.76 dt:36ms tok/s:1836756 rem:508s step 2634 (15%) loss:3.6828 lr:0.76 dt:36ms tok/s:1831299 rem:508s step 2635 (15%) loss:3.6693 lr:0.76 dt:36ms tok/s:1837345 rem:508s step 2636 (15%) loss:3.6510 lr:0.76 dt:36ms tok/s:1833094 rem:508s step 2637 (15%) loss:3.6392 lr:0.77 dt:36ms tok/s:1837874 rem:508s step 2638 (15%) loss:3.6238 lr:0.77 dt:36ms tok/s:1829069 rem:508s step 2639 (15%) loss:3.6310 lr:0.77 dt:36ms tok/s:1838206 rem:508s step 2640 (15%) loss:3.6223 lr:0.77 dt:36ms tok/s:1828813 rem:508s step 2641 (15%) loss:3.6318 lr:0.77 dt:36ms tok/s:1837567 rem:508s step 2642 (15%) loss:3.6003 lr:0.77 dt:36ms tok/s:1834770 rem:508s step 2643 (15%) loss:3.5982 lr:0.77 dt:36ms tok/s:1839313 rem:508s step 2644 (15%) loss:3.6045 lr:0.77 dt:36ms tok/s:1837186 rem:508s step 2645 (15%) loss:3.5823 lr:0.77 dt:36ms tok/s:1834966 rem:508s step 2646 (15%) loss:3.5664 lr:0.77 dt:36ms tok/s:1836400 rem:508s step 2647 (15%) loss:3.5698 lr:0.77 dt:36ms tok/s:1830433 rem:508s step 2648 (15%) loss:3.5793 lr:0.77 dt:36ms tok/s:1835983 rem:508s step 2649 (15%) loss:3.5925 lr:0.77 dt:36ms tok/s:1839534 rem:508s step 2650 (15%) loss:3.6081 lr:0.77 dt:36ms tok/s:1834978 rem:508s step 2651 (15%) loss:3.6074 lr:0.77 dt:36ms tok/s:1835566 rem:508s step 2652 (15%) loss:3.6166 lr:0.77 dt:36ms tok/s:1829860 rem:508s step 2653 (15%) loss:3.6002 lr:0.77 dt:35ms tok/s:1856217 rem:508s step 2654 (15%) loss:3.5864 lr:0.77 dt:35ms tok/s:1858577 rem:508s step 2655 (15%) loss:3.5849 lr:0.77 dt:35ms tok/s:1857660 rem:508s step 2656 (15%) loss:3.5890 lr:0.77 dt:36ms tok/s:1842469 rem:507s step 2657 (15%) loss:3.5798 lr:0.77 dt:36ms tok/s:1834991 rem:507s step 2658 (15%) loss:3.5771 lr:0.77 dt:36ms tok/s:1838058 rem:507s step 2659 (15%) loss:3.5560 lr:0.77 dt:36ms tok/s:1844707 rem:507s step 2660 (15%) loss:3.5508 lr:0.77 dt:36ms tok/s:1843742 rem:507s step 2661 (15%) loss:3.5607 lr:0.77 dt:36ms tok/s:1845896 rem:507s step 2662 (15%) loss:3.5404 lr:0.77 dt:36ms tok/s:1837812 rem:507s step 2663 (15%) loss:3.5384 lr:0.77 dt:36ms tok/s:1838538 rem:507s step 2664 (15%) loss:3.5507 lr:0.77 dt:36ms tok/s:1839473 rem:507s step 2665 (15%) loss:3.5486 lr:0.77 dt:36ms tok/s:1836597 rem:507s step 2666 (15%) loss:3.5572 lr:0.77 dt:36ms tok/s:1831933 rem:507s step 2667 (15%) loss:3.5485 lr:0.77 dt:36ms tok/s:1837812 rem:507s step 2668 (15%) loss:3.5281 lr:0.77 dt:35ms tok/s:1846417 rem:507s step 2669 (15%) loss:3.5708 lr:0.77 dt:35ms tok/s:1847546 rem:507s step 2670 (15%) loss:3.5720 lr:0.77 dt:36ms tok/s:1841765 rem:507s step 2671 (16%) loss:3.5649 lr:0.78 dt:36ms tok/s:1837026 rem:507s step 2672 (16%) loss:3.5633 lr:0.78 dt:36ms tok/s:1817855 rem:507s step 2673 (16%) loss:3.5666 lr:0.78 dt:35ms tok/s:1846851 rem:507s step 2674 (16%) loss:3.5654 lr:0.78 dt:35ms tok/s:1851005 rem:507s step 2675 (16%) loss:3.5593 lr:0.78 dt:35ms tok/s:1847261 rem:507s step 2676 (16%) loss:3.5744 lr:0.78 dt:35ms tok/s:1850257 rem:507s step 2677 (16%) loss:3.5695 lr:0.78 dt:36ms tok/s:1831872 rem:507s step 2678 (16%) loss:3.5585 lr:0.78 dt:36ms tok/s:1834831 rem:507s step 2679 (16%) loss:3.5625 lr:0.78 dt:36ms tok/s:1836204 rem:507s step 2680 (16%) loss:3.5730 lr:0.78 dt:36ms tok/s:1840396 rem:507s step 2681 (16%) loss:3.5926 lr:0.78 dt:36ms tok/s:1839362 rem:507s step 2682 (16%) loss:3.5884 lr:0.78 dt:36ms tok/s:1843408 rem:507s step 2683 (16%) loss:3.5870 lr:0.78 dt:36ms tok/s:1844843 rem:507s step 2684 (16%) loss:3.5988 lr:0.78 dt:36ms tok/s:1835407 rem:506s step 2685 (16%) loss:3.6066 lr:0.78 dt:36ms tok/s:1838574 rem:506s step 2686 (16%) loss:3.6055 lr:0.78 dt:37ms tok/s:1775147 rem:506s step 2687 (16%) loss:3.6193 lr:0.78 dt:36ms tok/s:1836535 rem:506s step 2688 (16%) loss:3.6015 lr:0.78 dt:36ms tok/s:1832739 rem:506s step 2689 (16%) loss:3.5781 lr:0.78 dt:36ms tok/s:1840865 rem:506s step 2690 (16%) loss:3.5534 lr:0.78 dt:35ms tok/s:1846752 rem:506s step 2691 (16%) loss:3.5879 lr:0.78 dt:36ms tok/s:1844373 rem:506s step 2692 (16%) loss:3.5810 lr:0.78 dt:36ms tok/s:1820504 rem:506s step 2693 (16%) loss:3.5844 lr:0.78 dt:36ms tok/s:1823088 rem:506s step 2694 (16%) loss:3.5990 lr:0.78 dt:36ms tok/s:1836891 rem:506s step 2695 (16%) loss:3.6168 lr:0.78 dt:36ms tok/s:1842815 rem:506s step 2696 (16%) loss:3.6190 lr:0.78 dt:36ms tok/s:1842543 rem:506s step 2697 (16%) loss:3.6306 lr:0.78 dt:36ms tok/s:1841235 rem:506s step 2698 (16%) loss:3.6098 lr:0.78 dt:36ms tok/s:1845611 rem:506s step 2699 (16%) loss:3.5954 lr:0.78 dt:36ms tok/s:1840655 rem:506s step 2700 (16%) loss:3.6060 lr:0.78 dt:35ms tok/s:1848006 rem:506s + local: attn=[0.068, 0.415, 0.448] mlp=[0.164, 0.134, -0.143] + + transition: attn=[1.742, 0.575] mlp=[0.005, 0.147] + + hierarchy: attn=[1.914, 5.939, 5.616] mlp=[0.768, -0.027, 0.287] + step 2701 (16%) loss:3.6177 lr:0.78 dt:36ms tok/s:1838242 rem:506s step 2702 (16%) loss:3.6185 lr:0.78 dt:36ms tok/s:1845115 rem:506s step 2703 (16%) loss:3.6081 lr:0.78 dt:36ms tok/s:1843667 rem:506s step 2704 (16%) loss:3.6304 lr:0.78 dt:36ms tok/s:1842728 rem:506s step 2705 (16%) loss:3.6356 lr:0.79 dt:36ms tok/s:1841198 rem:506s step 2706 (16%) loss:3.6294 lr:0.79 dt:36ms tok/s:1839842 rem:506s step 2707 (16%) loss:3.6362 lr:0.79 dt:36ms tok/s:1840766 rem:506s step 2708 (16%) loss:3.6445 lr:0.79 dt:36ms tok/s:1844744 rem:506s step 2709 (16%) loss:3.6486 lr:0.79 dt:35ms tok/s:1846169 rem:506s step 2710 (16%) loss:3.6552 lr:0.79 dt:36ms tok/s:1842555 rem:506s step 2711 (16%) loss:3.6685 lr:0.79 dt:36ms tok/s:1839386 rem:506s step 2712 (16%) loss:3.6716 lr:0.79 dt:36ms tok/s:1840680 rem:505s step 2713 (16%) loss:3.6518 lr:0.79 dt:35ms tok/s:1846851 rem:505s step 2714 (16%) loss:3.6419 lr:0.79 dt:36ms tok/s:1841629 rem:505s step 2715 (16%) loss:3.6414 lr:0.79 dt:36ms tok/s:1837382 rem:505s step 2716 (16%) loss:3.6314 lr:0.79 dt:36ms tok/s:1836683 rem:505s step 2717 (16%) loss:3.6213 lr:0.79 dt:36ms tok/s:1833607 rem:505s step 2718 (16%) loss:3.6080 lr:0.79 dt:36ms tok/s:1844632 rem:505s step 2719 (16%) loss:3.6039 lr:0.79 dt:36ms tok/s:1838390 rem:505s step 2720 (16%) loss:3.6115 lr:0.79 dt:36ms tok/s:1836449 rem:505s step 2721 (16%) loss:3.6036 lr:0.79 dt:36ms tok/s:1839583 rem:505s step 2722 (16%) loss:3.5926 lr:0.79 dt:36ms tok/s:1839879 rem:505s step 2723 (16%) loss:3.5964 lr:0.79 dt:36ms tok/s:1840051 rem:505s step 2724 (16%) loss:3.5973 lr:0.79 dt:36ms tok/s:1839510 rem:505s step 2725 (16%) loss:3.5835 lr:0.79 dt:36ms tok/s:1839805 rem:505s step 2726 (16%) loss:3.5719 lr:0.79 dt:36ms tok/s:1832507 rem:505s step 2727 (16%) loss:3.5982 lr:0.79 dt:36ms tok/s:1841037 rem:505s step 2728 (16%) loss:3.5976 lr:0.79 dt:36ms tok/s:1839226 rem:505s step 2729 (16%) loss:3.5830 lr:0.79 dt:35ms tok/s:1846392 rem:505s step 2730 (16%) loss:3.5776 lr:0.79 dt:36ms tok/s:1841222 rem:505s step 2731 (16%) loss:3.5924 lr:0.79 dt:36ms tok/s:1846070 rem:505s step 2732 (16%) loss:3.6000 lr:0.79 dt:36ms tok/s:1838292 rem:505s step 2733 (16%) loss:3.6044 lr:0.79 dt:36ms tok/s:1837542 rem:505s step 2734 (16%) loss:3.6273 lr:0.79 dt:35ms tok/s:1846987 rem:505s step 2735 (16%) loss:3.6157 lr:0.79 dt:36ms tok/s:1841963 rem:505s step 2736 (16%) loss:3.6168 lr:0.79 dt:36ms tok/s:1834770 rem:505s step 2737 (16%) loss:3.6353 lr:0.79 dt:36ms tok/s:1839559 rem:505s step 2738 (16%) loss:3.6352 lr:0.80 dt:36ms tok/s:1834635 rem:505s step 2739 (16%) loss:3.6385 lr:0.80 dt:36ms tok/s:1824843 rem:505s step 2740 (16%) loss:3.6356 lr:0.80 dt:36ms tok/s:1834635 rem:504s step 2741 (16%) loss:3.6419 lr:0.80 dt:36ms tok/s:1834439 rem:504s step 2742 (16%) loss:3.6275 lr:0.80 dt:36ms tok/s:1841136 rem:504s step 2743 (16%) loss:3.6197 lr:0.80 dt:36ms tok/s:1839510 rem:504s step 2744 (16%) loss:3.6101 lr:0.80 dt:36ms tok/s:1836597 rem:504s step 2745 (16%) loss:3.5969 lr:0.80 dt:36ms tok/s:1835763 rem:504s step 2746 (16%) loss:3.5823 lr:0.80 dt:36ms tok/s:1833106 rem:504s step 2747 (16%) loss:3.5444 lr:0.80 dt:36ms tok/s:1829860 rem:504s step 2748 (16%) loss:3.5495 lr:0.80 dt:36ms tok/s:1835517 rem:504s step 2749 (16%) loss:3.5582 lr:0.80 dt:36ms tok/s:1833082 rem:504s step 2750 (16%) loss:3.5587 lr:0.80 dt:36ms tok/s:1833656 rem:504s step 2751 (16%) loss:3.5595 lr:0.80 dt:36ms tok/s:1830628 rem:504s step 2752 (16%) loss:3.5608 lr:0.80 dt:36ms tok/s:1827864 rem:504s step 2753 (16%) loss:3.5472 lr:0.80 dt:36ms tok/s:1839940 rem:504s step 2754 (16%) loss:3.5566 lr:0.80 dt:37ms tok/s:1787068 rem:504s step 2755 (16%) loss:3.5443 lr:0.80 dt:36ms tok/s:1803259 rem:504s step 2756 (16%) loss:3.5534 lr:0.80 dt:36ms tok/s:1833779 rem:504s step 2757 (16%) loss:3.5556 lr:0.80 dt:36ms tok/s:1829653 rem:504s step 2758 (16%) loss:3.5568 lr:0.80 dt:36ms tok/s:1831543 rem:504s step 2759 (16%) loss:3.5346 lr:0.80 dt:43ms tok/s:1538054 rem:504s step 2760 (16%) loss:3.5277 lr:0.80 dt:37ms tok/s:1764889 rem:504s step 2761 (16%) loss:3.5280 lr:0.80 dt:34ms tok/s:1915672 rem:504s step 2762 (16%) loss:3.5280 lr:0.80 dt:34ms tok/s:1919552 rem:504s step 2763 (16%) loss:3.5270 lr:0.80 dt:34ms tok/s:1904127 rem:504s step 2764 (16%) loss:3.5275 lr:0.80 dt:35ms tok/s:1894299 rem:504s step 2765 (16%) loss:3.5405 lr:0.80 dt:35ms tok/s:1879057 rem:504s step 2766 (16%) loss:3.5322 lr:0.80 dt:36ms tok/s:1826092 rem:504s step 2767 (16%) loss:3.5577 lr:0.80 dt:35ms tok/s:1891574 rem:504s step 2768 (16%) loss:3.5565 lr:0.80 dt:35ms tok/s:1892851 rem:503s step 2769 (16%) loss:3.5643 lr:0.80 dt:35ms tok/s:1889780 rem:503s step 2770 (16%) loss:3.5777 lr:0.80 dt:35ms tok/s:1890833 rem:503s step 2771 (16%) loss:3.5645 lr:0.80 dt:35ms tok/s:1888209 rem:503s step 2772 (16%) loss:3.5683 lr:0.81 dt:35ms tok/s:1880150 rem:503s step 2773 (16%) loss:3.5837 lr:0.81 dt:35ms tok/s:1874099 rem:503s step 2774 (16%) loss:3.5836 lr:0.81 dt:35ms tok/s:1863680 rem:503s step 2775 (16%) loss:3.6001 lr:0.81 dt:35ms tok/s:1891132 rem:503s step 2776 (16%) loss:3.5885 lr:0.81 dt:35ms tok/s:1881772 rem:503s step 2777 (16%) loss:3.5931 lr:0.81 dt:35ms tok/s:1867935 rem:503s step 2778 (16%) loss:3.5771 lr:0.81 dt:35ms tok/s:1861560 rem:503s step 2779 (16%) loss:3.5761 lr:0.81 dt:35ms tok/s:1863983 rem:503s step 2780 (16%) loss:3.5907 lr:0.81 dt:35ms tok/s:1869180 rem:503s step 2781 (16%) loss:3.6445 lr:0.81 dt:35ms tok/s:1874138 rem:503s step 2782 (16%) loss:3.6366 lr:0.81 dt:35ms tok/s:1869028 rem:503s step 2783 (16%) loss:3.6422 lr:0.81 dt:35ms tok/s:1871726 rem:503s step 2784 (16%) loss:3.6457 lr:0.81 dt:35ms tok/s:1867974 rem:503s step 2785 (16%) loss:3.6270 lr:0.81 dt:36ms tok/s:1837444 rem:503s step 2786 (16%) loss:3.5937 lr:0.81 dt:35ms tok/s:1846429 rem:503s step 2787 (16%) loss:3.5810 lr:0.81 dt:35ms tok/s:1847099 rem:503s step 2788 (16%) loss:3.5974 lr:0.81 dt:35ms tok/s:1851017 rem:503s step 2789 (16%) loss:3.5948 lr:0.81 dt:36ms tok/s:1845660 rem:503s step 2790 (16%) loss:3.5887 lr:0.81 dt:35ms tok/s:1850182 rem:503s step 2791 (16%) loss:3.5778 lr:0.81 dt:36ms tok/s:1842988 rem:503s step 2792 (16%) loss:3.5918 lr:0.81 dt:36ms tok/s:1841592 rem:503s step 2793 (16%) loss:3.6059 lr:0.81 dt:36ms tok/s:1799139 rem:503s step 2794 (16%) loss:3.6090 lr:0.81 dt:36ms tok/s:1827646 rem:503s step 2795 (16%) loss:3.6039 lr:0.81 dt:36ms tok/s:1845202 rem:503s step 2796 (16%) loss:3.6065 lr:0.81 dt:36ms tok/s:1827245 rem:503s step 2797 (16%) loss:3.6029 lr:0.81 dt:36ms tok/s:1828180 rem:502s step 2798 (16%) loss:3.6149 lr:0.81 dt:35ms tok/s:1854727 rem:502s step 2799 (16%) loss:3.6196 lr:0.81 dt:35ms tok/s:1852002 rem:502s step 2800 (16%) loss:3.6252 lr:0.81 dt:36ms tok/s:1837923 rem:502s + local: attn=[0.072, 0.430, 0.428] mlp=[0.162, 0.139, -0.145] + + transition: attn=[1.620, 0.553] mlp=[0.019, 0.135] + + hierarchy: attn=[1.817, 5.939, 5.616] mlp=[0.695, -0.440, -0.148] + step 2801 (16%) loss:3.6239 lr:0.81 dt:35ms tok/s:1848378 rem:502s step 2802 (16%) loss:3.5851 lr:0.81 dt:36ms tok/s:1839288 rem:502s step 2803 (16%) loss:3.5864 lr:0.81 dt:35ms tok/s:1848254 rem:502s step 2804 (16%) loss:3.5884 lr:0.81 dt:35ms tok/s:1848267 rem:502s step 2805 (16%) loss:3.5715 lr:0.81 dt:36ms tok/s:1840852 rem:502s step 2806 (16%) loss:3.5741 lr:0.82 dt:35ms tok/s:1848615 rem:502s step 2807 (16%) loss:3.5780 lr:0.82 dt:36ms tok/s:1836253 rem:502s step 2808 (16%) loss:3.5846 lr:0.82 dt:37ms tok/s:1759669 rem:502s step 2809 (16%) loss:3.5710 lr:0.82 dt:39ms tok/s:1700135 rem:502s step 2810 (16%) loss:3.5472 lr:0.82 dt:36ms tok/s:1823608 rem:502s step 2811 (16%) loss:3.5569 lr:0.82 dt:35ms tok/s:1852514 rem:502s step 2812 (16%) loss:3.5581 lr:0.82 dt:36ms tok/s:1830469 rem:502s step 2813 (16%) loss:3.5555 lr:0.82 dt:35ms tok/s:1870210 rem:502s step 2814 (16%) loss:3.5729 lr:0.82 dt:35ms tok/s:1868913 rem:502s step 2815 (16%) loss:3.5763 lr:0.82 dt:35ms tok/s:1881784 rem:502s step 2816 (16%) loss:3.5697 lr:0.82 dt:35ms tok/s:1885812 rem:502s step 2817 (16%) loss:3.5660 lr:0.82 dt:36ms tok/s:1843618 rem:502s step 2818 (16%) loss:3.5625 lr:0.82 dt:35ms tok/s:1849784 rem:502s step 2819 (16%) loss:3.5586 lr:0.82 dt:35ms tok/s:1848664 rem:502s step 2820 (16%) loss:3.5419 lr:0.82 dt:35ms tok/s:1851242 rem:502s step 2821 (16%) loss:3.5489 lr:0.82 dt:36ms tok/s:1845896 rem:502s step 2822 (16%) loss:3.5541 lr:0.82 dt:36ms tok/s:1845363 rem:502s step 2823 (16%) loss:3.5531 lr:0.82 dt:35ms tok/s:1847459 rem:502s step 2824 (16%) loss:3.5417 lr:0.82 dt:35ms tok/s:1849846 rem:502s step 2825 (16%) loss:3.5500 lr:0.82 dt:36ms tok/s:1844447 rem:501s step 2826 (16%) loss:3.5550 lr:0.82 dt:35ms tok/s:1850083 rem:501s step 2827 (16%) loss:3.5484 lr:0.82 dt:36ms tok/s:1840384 rem:501s step 2828 (16%) loss:3.5360 lr:0.82 dt:36ms tok/s:1813849 rem:501s step 2829 (16%) loss:3.5262 lr:0.82 dt:36ms tok/s:1812485 rem:501s step 2830 (16%) loss:3.5270 lr:0.82 dt:36ms tok/s:1813526 rem:501s step 2831 (16%) loss:3.5298 lr:0.82 dt:36ms tok/s:1821143 rem:501s step 2832 (16%) loss:3.5319 lr:0.82 dt:36ms tok/s:1809741 rem:501s step 2833 (16%) loss:3.5217 lr:0.82 dt:36ms tok/s:1814447 rem:501s step 2834 (16%) loss:3.5234 lr:0.82 dt:36ms tok/s:1820999 rem:501s step 2835 (16%) loss:3.5239 lr:0.82 dt:36ms tok/s:1797763 rem:501s step 2836 (16%) loss:3.5314 lr:0.82 dt:36ms tok/s:1814196 rem:501s step 2837 (16%) loss:3.5187 lr:0.82 dt:36ms tok/s:1806245 rem:501s step 2838 (16%) loss:3.5347 lr:0.82 dt:36ms tok/s:1799481 rem:501s step 2839 (16%) loss:3.5349 lr:0.82 dt:36ms tok/s:1824589 rem:501s step 2840 (17%) loss:3.5208 lr:0.83 dt:36ms tok/s:1821011 rem:501s step 2841 (17%) loss:3.5447 lr:0.83 dt:36ms tok/s:1819311 rem:501s step 2842 (17%) loss:3.5498 lr:0.83 dt:36ms tok/s:1821312 rem:501s step 2843 (17%) loss:3.5388 lr:0.83 dt:36ms tok/s:1821252 rem:501s step 2844 (17%) loss:3.5159 lr:0.83 dt:36ms tok/s:1823463 rem:501s step 2845 (17%) loss:3.5196 lr:0.83 dt:36ms tok/s:1820745 rem:501s step 2846 (17%) loss:3.5436 lr:0.83 dt:36ms tok/s:1820793 rem:501s step 2847 (17%) loss:3.5697 lr:0.83 dt:36ms tok/s:1821216 rem:501s step 2848 (17%) loss:3.5692 lr:0.83 dt:36ms tok/s:1819119 rem:501s step 2849 (17%) loss:3.5717 lr:0.83 dt:36ms tok/s:1821192 rem:501s step 2850 (17%) loss:3.5768 lr:0.83 dt:36ms tok/s:1800789 rem:501s step 2851 (17%) loss:3.5819 lr:0.83 dt:36ms tok/s:1803815 rem:501s step 2852 (17%) loss:3.5881 lr:0.83 dt:42ms tok/s:1559812 rem:500s step 2853 (17%) loss:3.5784 lr:0.83 dt:38ms tok/s:1722152 rem:500s step 2854 (17%) loss:3.5817 lr:0.83 dt:36ms tok/s:1828825 rem:500s step 2855 (17%) loss:3.5821 lr:0.83 dt:36ms tok/s:1824891 rem:500s step 2856 (17%) loss:3.5726 lr:0.83 dt:36ms tok/s:1825134 rem:500s step 2857 (17%) loss:3.5747 lr:0.83 dt:35ms tok/s:1847261 rem:500s step 2858 (17%) loss:3.5688 lr:0.83 dt:35ms tok/s:1864325 rem:500s step 2859 (17%) loss:3.5631 lr:0.83 dt:35ms tok/s:1852789 rem:500s step 2860 (17%) loss:3.5741 lr:0.83 dt:36ms tok/s:1838267 rem:500s step 2861 (17%) loss:3.5815 lr:0.83 dt:36ms tok/s:1817783 rem:500s step 2862 (17%) loss:3.5747 lr:0.83 dt:35ms tok/s:1852015 rem:500s step 2863 (17%) loss:3.5625 lr:0.83 dt:35ms tok/s:1846566 rem:500s step 2864 (17%) loss:3.5496 lr:0.83 dt:36ms tok/s:1828156 rem:500s step 2865 (17%) loss:3.5413 lr:0.83 dt:36ms tok/s:1842963 rem:500s step 2866 (17%) loss:3.5540 lr:0.83 dt:36ms tok/s:1844125 rem:500s step 2867 (17%) loss:3.5526 lr:0.83 dt:35ms tok/s:1864426 rem:500s step 2868 (17%) loss:3.5511 lr:0.83 dt:35ms tok/s:1882841 rem:500s step 2869 (17%) loss:3.5499 lr:0.83 dt:35ms tok/s:1856894 rem:500s step 2870 (17%) loss:3.5616 lr:0.83 dt:37ms tok/s:1773749 rem:500s step 2871 (17%) loss:3.5502 lr:0.83 dt:36ms tok/s:1843729 rem:500s step 2872 (17%) loss:3.5469 lr:0.83 dt:35ms tok/s:1860942 rem:500s step 2873 (17%) loss:3.5446 lr:0.84 dt:35ms tok/s:1883099 rem:500s step 2874 (17%) loss:3.5572 lr:0.84 dt:35ms tok/s:1879494 rem:500s step 2875 (17%) loss:3.5479 lr:0.84 dt:35ms tok/s:1877081 rem:500s step 2876 (17%) loss:3.5396 lr:0.84 dt:35ms tok/s:1865324 rem:500s step 2877 (17%) loss:3.5422 lr:0.84 dt:36ms tok/s:1797081 rem:500s step 2878 (17%) loss:3.5430 lr:0.84 dt:35ms tok/s:1865831 rem:500s step 2879 (17%) loss:3.5319 lr:0.84 dt:35ms tok/s:1859331 rem:500s step 2880 (17%) loss:3.5385 lr:0.84 dt:35ms tok/s:1855453 rem:499s step 2881 (17%) loss:3.5439 lr:0.84 dt:35ms tok/s:1846814 rem:499s step 2882 (17%) loss:3.5449 lr:0.84 dt:35ms tok/s:1856706 rem:499s step 2883 (17%) loss:3.5472 lr:0.84 dt:35ms tok/s:1878800 rem:499s step 2884 (17%) loss:3.5497 lr:0.84 dt:35ms tok/s:1870427 rem:499s step 2885 (17%) loss:3.5558 lr:0.84 dt:35ms tok/s:1871344 rem:499s step 2886 (17%) loss:3.5593 lr:0.84 dt:35ms tok/s:1875327 rem:499s step 2887 (17%) loss:3.7401 lr:0.84 dt:35ms tok/s:1848950 rem:499s step 2888 (17%) loss:3.8778 lr:0.84 dt:35ms tok/s:1883435 rem:499s step 2889 (17%) loss:3.9301 lr:0.84 dt:34ms tok/s:1906954 rem:499s step 2890 (17%) loss:3.9616 lr:0.84 dt:34ms tok/s:1929794 rem:499s step 2891 (17%) loss:3.9768 lr:0.84 dt:34ms tok/s:1905460 rem:499s step 2892 (17%) loss:3.9690 lr:0.84 dt:35ms tok/s:1894455 rem:499s step 2893 (17%) loss:3.9545 lr:0.84 dt:34ms tok/s:1923595 rem:499s step 2894 (17%) loss:3.9215 lr:0.84 dt:34ms tok/s:1903152 rem:499s step 2895 (17%) loss:3.8941 lr:0.84 dt:34ms tok/s:1901164 rem:499s step 2896 (17%) loss:3.8646 lr:0.84 dt:35ms tok/s:1891431 rem:499s step 2897 (17%) loss:3.8627 lr:0.84 dt:35ms tok/s:1890014 rem:499s step 2898 (17%) loss:3.8539 lr:0.84 dt:35ms tok/s:1887250 rem:499s step 2899 (17%) loss:3.8225 lr:0.84 dt:35ms tok/s:1896769 rem:499s step 2900 (17%) loss:3.8056 lr:0.84 dt:35ms tok/s:1869816 rem:499s + local: attn=[0.070, 0.401, 0.423] mlp=[0.159, 0.144, -0.131] + + transition: attn=[1.526, 0.583] mlp=[-0.001, 0.127] + + hierarchy: attn=[1.787, 5.939, 5.616] mlp=[0.658, -0.058, 0.285] + step 2901 (17%) loss:3.8039 lr:0.84 dt:35ms tok/s:1894207 rem:499s step 2902 (17%) loss:3.7871 lr:0.84 dt:35ms tok/s:1898105 rem:499s step 2903 (17%) loss:3.7765 lr:0.84 dt:35ms tok/s:1869930 rem:499s step 2904 (17%) loss:3.7630 lr:0.84 dt:35ms tok/s:1867567 rem:499s step 2905 (17%) loss:3.7409 lr:0.84 dt:35ms tok/s:1874074 rem:499s step 2906 (17%) loss:3.7288 lr:0.84 dt:36ms tok/s:1835432 rem:499s step 2907 (17%) loss:3.7121 lr:0.85 dt:35ms tok/s:1867859 rem:499s step 2908 (17%) loss:3.6940 lr:0.85 dt:35ms tok/s:1854952 rem:499s step 2909 (17%) loss:3.6719 lr:0.85 dt:36ms tok/s:1840840 rem:498s step 2910 (17%) loss:3.6537 lr:0.85 dt:36ms tok/s:1841987 rem:498s step 2911 (17%) loss:3.6453 lr:0.85 dt:36ms tok/s:1836327 rem:498s step 2912 (17%) loss:3.6543 lr:0.85 dt:36ms tok/s:1842296 rem:498s step 2913 (17%) loss:3.6370 lr:0.85 dt:35ms tok/s:1851728 rem:498s step 2914 (17%) loss:3.6409 lr:0.85 dt:36ms tok/s:1845351 rem:498s step 2915 (17%) loss:3.6425 lr:0.85 dt:36ms tok/s:1843062 rem:498s step 2916 (17%) loss:3.6232 lr:0.85 dt:35ms tok/s:1849013 rem:498s step 2917 (17%) loss:3.6203 lr:0.85 dt:36ms tok/s:1820697 rem:498s step 2918 (17%) loss:3.6199 lr:0.85 dt:36ms tok/s:1820142 rem:498s step 2919 (17%) loss:3.6046 lr:0.85 dt:36ms tok/s:1822520 rem:498s step 2920 (17%) loss:3.6071 lr:0.85 dt:36ms tok/s:1822097 rem:498s step 2921 (17%) loss:3.6144 lr:0.85 dt:36ms tok/s:1824879 rem:498s step 2922 (17%) loss:3.6132 lr:0.85 dt:36ms tok/s:1803259 rem:498s step 2923 (17%) loss:3.6109 lr:0.85 dt:36ms tok/s:1825231 rem:498s step 2924 (17%) loss:3.5937 lr:0.85 dt:36ms tok/s:1827609 rem:498s step 2925 (17%) loss:3.5928 lr:0.85 dt:36ms tok/s:1816090 rem:498s step 2926 (17%) loss:3.5858 lr:0.85 dt:36ms tok/s:1822169 rem:498s step 2927 (17%) loss:3.5607 lr:0.85 dt:36ms tok/s:1820806 rem:498s step 2928 (17%) loss:3.5486 lr:0.85 dt:36ms tok/s:1826322 rem:498s step 2929 (17%) loss:3.5399 lr:0.85 dt:36ms tok/s:1816342 rem:498s step 2930 (17%) loss:3.5324 lr:0.85 dt:36ms tok/s:1818481 rem:498s step 2931 (17%) loss:3.5454 lr:0.85 dt:36ms tok/s:1818216 rem:498s step 2932 (17%) loss:3.5373 lr:0.85 dt:36ms tok/s:1822882 rem:498s step 2933 (17%) loss:3.5469 lr:0.85 dt:36ms tok/s:1825364 rem:498s step 2934 (17%) loss:3.5553 lr:0.85 dt:36ms tok/s:1823378 rem:498s step 2935 (17%) loss:3.5608 lr:0.85 dt:36ms tok/s:1815430 rem:498s step 2936 (17%) loss:3.5628 lr:0.85 dt:36ms tok/s:1815394 rem:498s step 2937 (17%) loss:3.5709 lr:0.85 dt:36ms tok/s:1813837 rem:497s step 2938 (17%) loss:3.5711 lr:0.85 dt:36ms tok/s:1816246 rem:497s step 2939 (17%) loss:3.5848 lr:0.85 dt:36ms tok/s:1821686 rem:497s step 2940 (17%) loss:3.6125 lr:0.85 dt:36ms tok/s:1816174 rem:497s step 2941 (17%) loss:3.6389 lr:0.86 dt:36ms tok/s:1796024 rem:497s step 2942 (17%) loss:3.6387 lr:0.86 dt:36ms tok/s:1818757 rem:497s step 2943 (17%) loss:3.6390 lr:0.86 dt:36ms tok/s:1823233 rem:497s step 2944 (17%) loss:3.6388 lr:0.86 dt:36ms tok/s:1815982 rem:497s step 2945 (17%) loss:3.6285 lr:0.86 dt:36ms tok/s:1815298 rem:497s step 2946 (17%) loss:3.6425 lr:0.86 dt:36ms tok/s:1816654 rem:497s step 2947 (17%) loss:3.6454 lr:0.86 dt:38ms tok/s:1705420 rem:497s step 2948 (17%) loss:3.6347 lr:0.86 dt:35ms tok/s:1846156 rem:497s step 2949 (17%) loss:3.6268 lr:0.86 dt:36ms tok/s:1836916 rem:497s step 2950 (17%) loss:3.6061 lr:0.86 dt:36ms tok/s:1815922 rem:497s step 2951 (17%) loss:3.5989 lr:0.86 dt:36ms tok/s:1820974 rem:497s step 2952 (17%) loss:3.6056 lr:0.86 dt:36ms tok/s:1827184 rem:497s step 2953 (17%) loss:3.6630 lr:0.86 dt:36ms tok/s:1804691 rem:497s step 2954 (17%) loss:3.6456 lr:0.86 dt:36ms tok/s:1805355 rem:497s step 2955 (17%) loss:3.6542 lr:0.86 dt:36ms tok/s:1809324 rem:497s step 2956 (17%) loss:3.6393 lr:0.86 dt:36ms tok/s:1814280 rem:497s step 2957 (17%) loss:3.6483 lr:0.86 dt:36ms tok/s:1812844 rem:497s step 2958 (17%) loss:3.6503 lr:0.86 dt:36ms tok/s:1814891 rem:497s step 2959 (17%) loss:3.6325 lr:0.86 dt:37ms tok/s:1789011 rem:497s step 2960 (17%) loss:3.6133 lr:0.86 dt:36ms tok/s:1821047 rem:497s step 2961 (17%) loss:3.5687 lr:0.86 dt:36ms tok/s:1814627 rem:497s step 2962 (17%) loss:3.5358 lr:0.86 dt:36ms tok/s:1810480 rem:497s step 2963 (17%) loss:3.5570 lr:0.86 dt:36ms tok/s:1809872 rem:497s step 2964 (17%) loss:3.5681 lr:0.86 dt:36ms tok/s:1810814 rem:497s step 2965 (17%) loss:3.5661 lr:0.86 dt:36ms tok/s:1827063 rem:496s step 2966 (17%) loss:3.5693 lr:0.86 dt:36ms tok/s:1819877 rem:496s step 2967 (17%) loss:3.5817 lr:0.86 dt:36ms tok/s:1820347 rem:496s step 2968 (17%) loss:3.5711 lr:0.86 dt:36ms tok/s:1827743 rem:496s step 2969 (17%) loss:3.5635 lr:0.86 dt:36ms tok/s:1816054 rem:496s step 2970 (17%) loss:3.5628 lr:0.86 dt:36ms tok/s:1818565 rem:496s step 2971 (17%) loss:3.5582 lr:0.86 dt:36ms tok/s:1827087 rem:496s step 2972 (17%) loss:3.5808 lr:0.86 dt:36ms tok/s:1825546 rem:496s step 2973 (17%) loss:3.5894 lr:0.86 dt:36ms tok/s:1823584 rem:496s step 2974 (17%) loss:3.5994 lr:0.87 dt:36ms tok/s:1811219 rem:496s step 2975 (17%) loss:3.6106 lr:0.87 dt:36ms tok/s:1824649 rem:496s step 2976 (17%) loss:3.6149 lr:0.87 dt:36ms tok/s:1826164 rem:496s step 2977 (17%) loss:3.6013 lr:0.87 dt:36ms tok/s:1814855 rem:496s step 2978 (17%) loss:3.5996 lr:0.87 dt:36ms tok/s:1817134 rem:496s step 2979 (17%) loss:3.6039 lr:0.87 dt:36ms tok/s:1821143 rem:496s step 2980 (17%) loss:3.6078 lr:0.87 dt:36ms tok/s:1811136 rem:496s step 2981 (17%) loss:3.5912 lr:0.87 dt:36ms tok/s:1822375 rem:496s step 2982 (17%) loss:3.6008 lr:0.87 dt:36ms tok/s:1821445 rem:496s step 2983 (17%) loss:3.6150 lr:0.87 dt:36ms tok/s:1821300 rem:496s step 2984 (17%) loss:3.5973 lr:0.87 dt:36ms tok/s:1824552 rem:496s step 2985 (17%) loss:3.6050 lr:0.87 dt:36ms tok/s:1822616 rem:496s step 2986 (17%) loss:3.5997 lr:0.87 dt:36ms tok/s:1799929 rem:496s step 2987 (17%) loss:3.6012 lr:0.87 dt:36ms tok/s:1827002 rem:496s step 2988 (17%) loss:3.5876 lr:0.87 dt:36ms tok/s:1819299 rem:496s step 2989 (17%) loss:3.5803 lr:0.87 dt:36ms tok/s:1817819 rem:496s step 2990 (17%) loss:3.6034 lr:0.87 dt:36ms tok/s:1815071 rem:496s step 2991 (17%) loss:3.5978 lr:0.87 dt:36ms tok/s:1820637 rem:496s step 2992 (17%) loss:3.6036 lr:0.87 dt:36ms tok/s:1817519 rem:495s step 2993 (17%) loss:3.5744 lr:0.87 dt:36ms tok/s:1815838 rem:495s step 2994 (17%) loss:3.5810 lr:0.87 dt:36ms tok/s:1817603 rem:495s step 2995 (17%) loss:3.5655 lr:0.87 dt:36ms tok/s:1815982 rem:495s step 2996 (17%) loss:3.5254 lr:0.87 dt:36ms tok/s:1812701 rem:495s step 2997 (17%) loss:3.5030 lr:0.87 dt:36ms tok/s:1820034 rem:495s step 2998 (17%) loss:3.5147 lr:0.87 dt:36ms tok/s:1818024 rem:495s step 2999 (17%) loss:3.5344 lr:0.87 dt:36ms tok/s:1818950 rem:495s step 3000 (17%) loss:3.5319 lr:0.87 dt:36ms tok/s:1822218 rem:495s + local: attn=[0.074, 0.388, 0.427] mlp=[0.162, 0.135, -0.145] + + transition: attn=[1.483, 0.509] mlp=[0.004, 0.119] + + hierarchy: attn=[1.740, 5.939, 5.616] mlp=[0.589, -0.302, 0.086] + step 3001 (17%) loss:3.5327 lr:0.87 dt:36ms tok/s:1817471 rem:495s step 3002 (17%) loss:3.5428 lr:0.87 dt:36ms tok/s:1818794 rem:495s step 3003 (17%) loss:3.5548 lr:0.87 dt:37ms tok/s:1765093 rem:495s step 3004 (17%) loss:3.5540 lr:0.87 dt:36ms tok/s:1796106 rem:495s step 3005 (17%) loss:3.5602 lr:0.87 dt:36ms tok/s:1819733 rem:495s step 3006 (17%) loss:3.5814 lr:0.87 dt:36ms tok/s:1823173 rem:495s step 3007 (18%) loss:3.5974 lr:0.88 dt:36ms tok/s:1816942 rem:495s step 3008 (18%) loss:3.6145 lr:0.88 dt:36ms tok/s:1821819 rem:495s step 3009 (18%) loss:3.6166 lr:0.88 dt:36ms tok/s:1819974 rem:495s step 3010 (18%) loss:3.6215 lr:0.88 dt:36ms tok/s:1817831 rem:495s step 3011 (18%) loss:3.6164 lr:0.88 dt:36ms tok/s:1813705 rem:495s step 3012 (18%) loss:3.6136 lr:0.88 dt:36ms tok/s:1816954 rem:495s step 3013 (18%) loss:3.6336 lr:0.88 dt:36ms tok/s:1815442 rem:495s step 3014 (18%) loss:3.6370 lr:0.88 dt:36ms tok/s:1818409 rem:495s step 3015 (18%) loss:3.6317 lr:0.88 dt:36ms tok/s:1814627 rem:495s step 3016 (18%) loss:3.6271 lr:0.88 dt:36ms tok/s:1820347 rem:495s step 3017 (18%) loss:3.6224 lr:0.88 dt:36ms tok/s:1818336 rem:495s step 3018 (18%) loss:3.6061 lr:0.88 dt:36ms tok/s:1806731 rem:495s step 3019 (18%) loss:3.6043 lr:0.88 dt:36ms tok/s:1803980 rem:495s step 3020 (18%) loss:3.6059 lr:0.88 dt:36ms tok/s:1814939 rem:494s step 3021 (18%) loss:3.6081 lr:0.88 dt:36ms tok/s:1820697 rem:494s step 3022 (18%) loss:3.6096 lr:0.88 dt:36ms tok/s:1820371 rem:494s step 3023 (18%) loss:3.6024 lr:0.88 dt:36ms tok/s:1820938 rem:494s step 3024 (18%) loss:3.6172 lr:0.88 dt:36ms tok/s:1817711 rem:494s step 3025 (18%) loss:3.6361 lr:0.88 dt:36ms tok/s:1815358 rem:494s step 3026 (18%) loss:3.6176 lr:0.88 dt:36ms tok/s:1815094 rem:494s step 3027 (18%) loss:3.5784 lr:0.88 dt:36ms tok/s:1818000 rem:494s step 3028 (18%) loss:3.5796 lr:0.88 dt:37ms tok/s:1759174 rem:494s step 3029 (18%) loss:3.5691 lr:0.88 dt:37ms tok/s:1765694 rem:494s step 3030 (18%) loss:3.5559 lr:0.88 dt:37ms tok/s:1793879 rem:494s step 3031 (18%) loss:3.5532 lr:0.88 dt:36ms tok/s:1813729 rem:494s step 3032 (18%) loss:3.5456 lr:0.88 dt:36ms tok/s:1814987 rem:494s step 3033 (18%) loss:3.5495 lr:0.88 dt:36ms tok/s:1812509 rem:494s step 3034 (18%) loss:3.5483 lr:0.88 dt:36ms tok/s:1813693 rem:494s step 3035 (18%) loss:3.5505 lr:0.88 dt:36ms tok/s:1815035 rem:494s step 3036 (18%) loss:3.5594 lr:0.88 dt:36ms tok/s:1811005 rem:494s step 3037 (18%) loss:3.5523 lr:0.88 dt:36ms tok/s:1814507 rem:494s step 3038 (18%) loss:3.5406 lr:0.88 dt:36ms tok/s:1811745 rem:494s step 3039 (18%) loss:3.5492 lr:0.88 dt:36ms tok/s:1807004 rem:494s step 3040 (18%) loss:3.5538 lr:0.89 dt:38ms tok/s:1722930 rem:494s step 3041 (18%) loss:3.5559 lr:0.89 dt:36ms tok/s:1806945 rem:494s step 3042 (18%) loss:3.5518 lr:0.89 dt:36ms tok/s:1815658 rem:494s step 3043 (18%) loss:3.5657 lr:0.89 dt:36ms tok/s:1816102 rem:494s step 3044 (18%) loss:3.5615 lr:0.89 dt:36ms tok/s:1818529 rem:494s step 3045 (18%) loss:3.5722 lr:0.89 dt:36ms tok/s:1817915 rem:494s step 3046 (18%) loss:3.6180 lr:0.89 dt:36ms tok/s:1808574 rem:494s step 3047 (18%) loss:3.6351 lr:0.89 dt:36ms tok/s:1804810 rem:494s step 3048 (18%) loss:3.6359 lr:0.89 dt:36ms tok/s:1803117 rem:493s step 3049 (18%) loss:3.6191 lr:0.89 dt:36ms tok/s:1806565 rem:493s step 3050 (18%) loss:3.6243 lr:0.89 dt:36ms tok/s:1799010 rem:493s step 3051 (18%) loss:3.6224 lr:0.89 dt:36ms tok/s:1816294 rem:493s step 3052 (18%) loss:3.6063 lr:0.89 dt:36ms tok/s:1806387 rem:493s step 3053 (18%) loss:3.6085 lr:0.89 dt:36ms tok/s:1805936 rem:493s step 3054 (18%) loss:3.6088 lr:0.89 dt:36ms tok/s:1816546 rem:493s step 3055 (18%) loss:3.6104 lr:0.89 dt:36ms tok/s:1819672 rem:493s step 3056 (18%) loss:3.6018 lr:0.89 dt:36ms tok/s:1818360 rem:493s step 3057 (18%) loss:3.5933 lr:0.89 dt:36ms tok/s:1805983 rem:493s step 3058 (18%) loss:3.5953 lr:0.89 dt:36ms tok/s:1824274 rem:493s step 3059 (18%) loss:3.5958 lr:0.89 dt:36ms tok/s:1818914 rem:493s step 3060 (18%) loss:3.5931 lr:0.89 dt:36ms tok/s:1813813 rem:493s step 3061 (18%) loss:3.5814 lr:0.89 dt:36ms tok/s:1804525 rem:493s step 3062 (18%) loss:3.5870 lr:0.89 dt:36ms tok/s:1826358 rem:493s step 3063 (18%) loss:3.5831 lr:0.89 dt:36ms tok/s:1828862 rem:493s step 3064 (18%) loss:3.5987 lr:0.89 dt:36ms tok/s:1833216 rem:493s step 3065 (18%) loss:3.5859 lr:0.89 dt:33ms tok/s:1995933 rem:493s step 3066 (18%) loss:3.5970 lr:0.89 dt:35ms tok/s:1886680 rem:493s step 3067 (18%) loss:3.5973 lr:0.89 dt:34ms tok/s:1905606 rem:493s step 3068 (18%) loss:3.5953 lr:0.89 dt:34ms tok/s:1901743 rem:493s step 3069 (18%) loss:3.5913 lr:0.89 dt:35ms tok/s:1899338 rem:493s step 3070 (18%) loss:3.5906 lr:0.89 dt:34ms tok/s:1902928 rem:493s step 3071 (18%) loss:3.5843 lr:0.89 dt:35ms tok/s:1887742 rem:493s step 3072 (18%) loss:3.5919 lr:0.89 dt:35ms tok/s:1889260 rem:493s step 3073 (18%) loss:3.5904 lr:0.89 dt:35ms tok/s:1893894 rem:493s step 3074 (18%) loss:3.5991 lr:0.90 dt:35ms tok/s:1875685 rem:493s step 3075 (18%) loss:3.5955 lr:0.90 dt:35ms tok/s:1894795 rem:493s step 3076 (18%) loss:3.5941 lr:0.90 dt:35ms tok/s:1883474 rem:492s step 3077 (18%) loss:3.6002 lr:0.90 dt:35ms tok/s:1894508 rem:492s step 3078 (18%) loss:3.6085 lr:0.90 dt:35ms tok/s:1893725 rem:492s step 3079 (18%) loss:3.6213 lr:0.90 dt:35ms tok/s:1869282 rem:492s step 3080 (18%) loss:3.6232 lr:0.90 dt:35ms tok/s:1861661 rem:492s step 3081 (18%) loss:3.6128 lr:0.90 dt:35ms tok/s:1873933 rem:492s step 3082 (18%) loss:3.5976 lr:0.90 dt:35ms tok/s:1867897 rem:492s step 3083 (18%) loss:3.5944 lr:0.90 dt:35ms tok/s:1861371 rem:492s step 3084 (18%) loss:3.5876 lr:0.90 dt:35ms tok/s:1852876 rem:492s step 3085 (18%) loss:3.5680 lr:0.90 dt:35ms tok/s:1854376 rem:492s step 3086 (18%) loss:3.5794 lr:0.90 dt:35ms tok/s:1855015 rem:492s step 3087 (18%) loss:3.6063 lr:0.90 dt:37ms tok/s:1785142 rem:492s step 3088 (18%) loss:3.5866 lr:0.90 dt:36ms tok/s:1822931 rem:492s step 3089 (18%) loss:3.5839 lr:0.90 dt:36ms tok/s:1832947 rem:492s step 3090 (18%) loss:3.5834 lr:0.90 dt:36ms tok/s:1837653 rem:492s step 3091 (18%) loss:3.5975 lr:0.90 dt:36ms tok/s:1842172 rem:492s step 3092 (18%) loss:3.6088 lr:0.90 dt:36ms tok/s:1841580 rem:492s step 3093 (18%) loss:3.6105 lr:0.90 dt:36ms tok/s:1831811 rem:492s step 3094 (18%) loss:3.5928 lr:0.90 dt:36ms tok/s:1842024 rem:492s step 3095 (18%) loss:3.5449 lr:0.90 dt:36ms tok/s:1838341 rem:492s step 3096 (18%) loss:3.5464 lr:0.90 dt:36ms tok/s:1835873 rem:492s step 3097 (18%) loss:3.5664 lr:0.90 dt:36ms tok/s:1839349 rem:492s step 3098 (18%) loss:3.5751 lr:0.90 dt:36ms tok/s:1832666 rem:492s step 3099 (18%) loss:3.5814 lr:0.90 dt:36ms tok/s:1833448 rem:492s step 3100 (18%) loss:3.5666 lr:0.90 dt:38ms tok/s:1722433 rem:492s + local: attn=[0.067, 0.390, 0.400] mlp=[0.161, 0.126, -0.129] + + transition: attn=[1.458, 0.521] mlp=[0.016, 0.125] + + hierarchy: attn=[1.713, 5.939, 5.616] mlp=[0.634, -0.222, 0.025] + step 3101 (18%) loss:3.5764 lr:0.90 dt:36ms tok/s:1841272 rem:492s step 3102 (18%) loss:3.5850 lr:0.90 dt:35ms tok/s:1848602 rem:492s step 3103 (18%) loss:3.5876 lr:0.90 dt:36ms tok/s:1826820 rem:492s step 3104 (18%) loss:3.5900 lr:0.90 dt:36ms tok/s:1835285 rem:491s step 3105 (18%) loss:3.6141 lr:0.90 dt:36ms tok/s:1838722 rem:491s step 3106 (18%) loss:3.6052 lr:0.90 dt:36ms tok/s:1829215 rem:491s step 3107 (18%) loss:3.6159 lr:0.90 dt:36ms tok/s:1833045 rem:491s step 3108 (18%) loss:3.6237 lr:0.91 dt:36ms tok/s:1845041 rem:491s step 3109 (18%) loss:3.6285 lr:0.91 dt:36ms tok/s:1834476 rem:491s step 3110 (18%) loss:3.6112 lr:0.91 dt:36ms tok/s:1835664 rem:491s step 3111 (18%) loss:3.6094 lr:0.91 dt:36ms tok/s:1836118 rem:491s step 3112 (18%) loss:3.6024 lr:0.91 dt:36ms tok/s:1840187 rem:491s step 3113 (18%) loss:3.5797 lr:0.91 dt:36ms tok/s:1843828 rem:491s step 3114 (18%) loss:3.5767 lr:0.91 dt:36ms tok/s:1840569 rem:491s step 3115 (18%) loss:3.5725 lr:0.91 dt:36ms tok/s:1828412 rem:491s step 3116 (18%) loss:3.5892 lr:0.91 dt:36ms tok/s:1838353 rem:491s step 3117 (18%) loss:3.5931 lr:0.91 dt:36ms tok/s:1832678 rem:491s step 3118 (18%) loss:3.5840 lr:0.91 dt:36ms tok/s:1839103 rem:491s step 3119 (18%) loss:3.6411 lr:0.91 dt:36ms tok/s:1839116 rem:491s step 3120 (18%) loss:3.6688 lr:0.91 dt:36ms tok/s:1838169 rem:491s step 3121 (18%) loss:3.6669 lr:0.91 dt:36ms tok/s:1835468 rem:491s step 3122 (18%) loss:3.6572 lr:0.91 dt:36ms tok/s:1841161 rem:491s step 3123 (18%) loss:3.6436 lr:0.91 dt:36ms tok/s:1844014 rem:491s step 3124 (18%) loss:3.6349 lr:0.91 dt:36ms tok/s:1837345 rem:491s step 3125 (18%) loss:3.6224 lr:0.91 dt:36ms tok/s:1837911 rem:491s step 3126 (18%) loss:3.6148 lr:0.91 dt:36ms tok/s:1845066 rem:491s step 3127 (18%) loss:3.6063 lr:0.91 dt:36ms tok/s:1831323 rem:491s step 3128 (18%) loss:3.6133 lr:0.91 dt:36ms tok/s:1835125 rem:491s step 3129 (18%) loss:3.6019 lr:0.91 dt:36ms tok/s:1838230 rem:491s step 3130 (18%) loss:3.5999 lr:0.91 dt:36ms tok/s:1839485 rem:491s step 3131 (18%) loss:3.5885 lr:0.91 dt:36ms tok/s:1837333 rem:491s step 3132 (18%) loss:3.6188 lr:0.91 dt:36ms tok/s:1836707 rem:490s step 3133 (18%) loss:3.6229 lr:0.91 dt:36ms tok/s:1844162 rem:490s step 3134 (18%) loss:3.6258 lr:0.91 dt:36ms tok/s:1837567 rem:490s step 3135 (18%) loss:3.6298 lr:0.91 dt:36ms tok/s:1818433 rem:490s step 3136 (18%) loss:3.6252 lr:0.91 dt:36ms tok/s:1824819 rem:490s step 3137 (18%) loss:3.6092 lr:0.91 dt:40ms tok/s:1649620 rem:490s step 3138 (18%) loss:3.5825 lr:0.91 dt:36ms tok/s:1833509 rem:490s step 3139 (18%) loss:3.5739 lr:0.91 dt:36ms tok/s:1833485 rem:490s step 3140 (18%) loss:3.5717 lr:0.91 dt:36ms tok/s:1831189 rem:490s step 3141 (18%) loss:3.5622 lr:0.92 dt:36ms tok/s:1841346 rem:490s step 3142 (18%) loss:3.5742 lr:0.92 dt:36ms tok/s:1825934 rem:490s step 3143 (18%) loss:3.5633 lr:0.92 dt:36ms tok/s:1837272 rem:490s step 3144 (18%) loss:3.5638 lr:0.92 dt:36ms tok/s:1835187 rem:490s step 3145 (18%) loss:3.5613 lr:0.92 dt:36ms tok/s:1829263 rem:490s step 3146 (18%) loss:3.5606 lr:0.92 dt:36ms tok/s:1832361 rem:490s step 3147 (18%) loss:3.5549 lr:0.92 dt:36ms tok/s:1834452 rem:490s step 3148 (18%) loss:3.5462 lr:0.92 dt:36ms tok/s:1823535 rem:490s step 3149 (18%) loss:3.5546 lr:0.92 dt:36ms tok/s:1831030 rem:490s step 3150 (18%) loss:3.5625 lr:0.92 dt:36ms tok/s:1825437 rem:490s step 3151 (18%) loss:3.5688 lr:0.92 dt:36ms tok/s:1835138 rem:490s step 3152 (18%) loss:3.5683 lr:0.92 dt:36ms tok/s:1831384 rem:490s step 3153 (18%) loss:3.5575 lr:0.92 dt:36ms tok/s:1822242 rem:490s step 3154 (18%) loss:3.5457 lr:0.92 dt:36ms tok/s:1831274 rem:490s step 3155 (18%) loss:3.5398 lr:0.92 dt:36ms tok/s:1834244 rem:490s step 3156 (18%) loss:3.5348 lr:0.92 dt:36ms tok/s:1831958 rem:490s step 3157 (18%) loss:3.5248 lr:0.92 dt:36ms tok/s:1833473 rem:490s step 3158 (18%) loss:3.5320 lr:0.92 dt:36ms tok/s:1833143 rem:490s step 3159 (18%) loss:3.5232 lr:0.92 dt:41ms tok/s:1597217 rem:490s step 3160 (18%) loss:3.5142 lr:0.92 dt:36ms tok/s:1806933 rem:489s step 3161 (18%) loss:3.5177 lr:0.92 dt:35ms tok/s:1898826 rem:489s step 3162 (18%) loss:3.5236 lr:0.92 dt:35ms tok/s:1882609 rem:489s step 3163 (18%) loss:3.5221 lr:0.92 dt:35ms tok/s:1883719 rem:489s step 3164 (18%) loss:3.5176 lr:0.92 dt:35ms tok/s:1873946 rem:489s step 3165 (18%) loss:3.4900 lr:0.92 dt:35ms tok/s:1874560 rem:489s step 3166 (18%) loss:3.4537 lr:0.92 dt:35ms tok/s:1877312 rem:489s step 3167 (18%) loss:3.4538 lr:0.92 dt:35ms tok/s:1882158 rem:489s step 3168 (18%) loss:3.4505 lr:0.92 dt:35ms tok/s:1855415 rem:489s step 3169 (18%) loss:3.4258 lr:0.92 dt:35ms tok/s:1856668 rem:489s step 3170 (18%) loss:3.4055 lr:0.92 dt:35ms tok/s:1854439 rem:489s step 3171 (18%) loss:3.4489 lr:0.92 dt:36ms tok/s:1839226 rem:489s step 3172 (18%) loss:3.4778 lr:0.92 dt:36ms tok/s:1830384 rem:489s step 3173 (18%) loss:3.4991 lr:0.92 dt:35ms tok/s:1850519 rem:489s step 3174 (18%) loss:3.5098 lr:0.92 dt:35ms tok/s:1851878 rem:489s step 3175 (19%) loss:3.5181 lr:0.93 dt:35ms tok/s:1847993 rem:489s step 3176 (19%) loss:3.5305 lr:0.93 dt:38ms tok/s:1740328 rem:489s step 3177 (19%) loss:3.5413 lr:0.93 dt:36ms tok/s:1808562 rem:489s step 3178 (19%) loss:3.5425 lr:0.93 dt:35ms tok/s:1854539 rem:489s step 3179 (19%) loss:3.5506 lr:0.93 dt:35ms tok/s:1854076 rem:489s step 3180 (19%) loss:3.5656 lr:0.93 dt:35ms tok/s:1862127 rem:489s step 3181 (19%) loss:3.5653 lr:0.93 dt:35ms tok/s:1870121 rem:489s step 3182 (19%) loss:3.5741 lr:0.93 dt:35ms tok/s:1863200 rem:489s step 3183 (19%) loss:3.5736 lr:0.93 dt:35ms tok/s:1859671 rem:489s step 3184 (19%) loss:3.5892 lr:0.93 dt:35ms tok/s:1864514 rem:489s step 3185 (19%) loss:3.5912 lr:0.93 dt:35ms tok/s:1856568 rem:489s step 3186 (19%) loss:3.6059 lr:0.93 dt:35ms tok/s:1857810 rem:489s step 3187 (19%) loss:3.6167 lr:0.93 dt:35ms tok/s:1864628 rem:489s step 3188 (19%) loss:3.6224 lr:0.93 dt:36ms tok/s:1835089 rem:488s step 3189 (19%) loss:3.6143 lr:0.93 dt:36ms tok/s:1818192 rem:488s step 3190 (19%) loss:3.6179 lr:0.93 dt:37ms tok/s:1750112 rem:488s step 3191 (19%) loss:3.6157 lr:0.93 dt:35ms tok/s:1851653 rem:488s step 3192 (19%) loss:3.6109 lr:0.93 dt:35ms tok/s:1853164 rem:488s step 3193 (19%) loss:3.6232 lr:0.93 dt:36ms tok/s:1838648 rem:488s step 3194 (19%) loss:3.6208 lr:0.93 dt:36ms tok/s:1833326 rem:488s step 3195 (19%) loss:3.6225 lr:0.93 dt:36ms tok/s:1816222 rem:488s step 3196 (19%) loss:3.6232 lr:0.93 dt:36ms tok/s:1819022 rem:488s step 3197 (19%) loss:3.6098 lr:0.93 dt:36ms tok/s:1821312 rem:488s step 3198 (19%) loss:3.6060 lr:0.93 dt:36ms tok/s:1823354 rem:488s step 3199 (19%) loss:3.6044 lr:0.93 dt:36ms tok/s:1818312 rem:488s step 3200 (19%) loss:3.6012 lr:0.93 dt:36ms tok/s:1824383 rem:488s + local: attn=[0.068, 0.393, 0.373] mlp=[0.149, 0.124, -0.142] + + transition: attn=[1.414, 0.481] mlp=[-0.009, 0.114] + + hierarchy: attn=[1.620, 5.939, 5.616] mlp=[0.561, -0.044, 0.191] + step 3201 (19%) loss:3.5865 lr:0.93 dt:36ms tok/s:1822677 rem:488s step 3202 (19%) loss:3.5995 lr:0.93 dt:36ms tok/s:1822194 rem:488s step 3203 (19%) loss:3.6347 lr:0.93 dt:36ms tok/s:1816654 rem:488s step 3204 (19%) loss:3.6281 lr:0.93 dt:36ms tok/s:1818336 rem:488s step 3205 (19%) loss:3.6232 lr:0.93 dt:36ms tok/s:1814136 rem:488s step 3206 (19%) loss:3.6265 lr:0.93 dt:36ms tok/s:1821843 rem:488s step 3207 (19%) loss:3.6255 lr:0.93 dt:36ms tok/s:1826371 rem:488s step 3208 (19%) loss:3.6283 lr:0.94 dt:36ms tok/s:1821590 rem:488s step 3209 (19%) loss:3.6101 lr:0.94 dt:36ms tok/s:1817531 rem:488s step 3210 (19%) loss:3.6332 lr:0.94 dt:36ms tok/s:1822109 rem:488s step 3211 (19%) loss:3.6590 lr:0.94 dt:36ms tok/s:1816090 rem:488s step 3212 (19%) loss:3.6531 lr:0.94 dt:37ms tok/s:1751797 rem:488s step 3213 (19%) loss:3.6414 lr:0.94 dt:36ms tok/s:1797351 rem:488s step 3214 (19%) loss:3.6293 lr:0.94 dt:40ms tok/s:1645828 rem:488s step 3215 (19%) loss:3.6162 lr:0.94 dt:36ms tok/s:1836094 rem:488s step 3216 (19%) loss:3.5999 lr:0.94 dt:35ms tok/s:1849174 rem:487s step 3217 (19%) loss:3.5874 lr:0.94 dt:36ms tok/s:1835542 rem:487s step 3218 (19%) loss:3.5835 lr:0.94 dt:36ms tok/s:1830628 rem:487s step 3219 (19%) loss:3.5828 lr:0.94 dt:36ms tok/s:1815382 rem:487s step 3220 (19%) loss:3.5902 lr:0.94 dt:36ms tok/s:1822218 rem:487s step 3221 (19%) loss:3.5776 lr:0.94 dt:36ms tok/s:1817218 rem:487s step 3222 (19%) loss:3.5630 lr:0.94 dt:36ms tok/s:1825509 rem:487s step 3223 (19%) loss:3.5492 lr:0.94 dt:36ms tok/s:1824576 rem:487s step 3224 (19%) loss:3.5698 lr:0.94 dt:36ms tok/s:1817411 rem:487s step 3225 (19%) loss:3.5688 lr:0.94 dt:36ms tok/s:1817050 rem:487s step 3226 (19%) loss:3.5564 lr:0.94 dt:36ms tok/s:1818481 rem:487s step 3227 (19%) loss:3.5499 lr:0.94 dt:36ms tok/s:1815478 rem:487s step 3228 (19%) loss:3.5534 lr:0.94 dt:36ms tok/s:1820408 rem:487s step 3229 (19%) loss:3.5571 lr:0.94 dt:36ms tok/s:1828132 rem:487s step 3230 (19%) loss:3.5414 lr:0.94 dt:36ms tok/s:1819986 rem:487s step 3231 (19%) loss:3.5433 lr:0.94 dt:36ms tok/s:1822725 rem:487s step 3232 (19%) loss:3.5249 lr:0.94 dt:36ms tok/s:1814244 rem:487s step 3233 (19%) loss:3.5178 lr:0.94 dt:36ms tok/s:1821264 rem:487s step 3234 (19%) loss:3.5182 lr:0.94 dt:36ms tok/s:1821288 rem:487s step 3235 (19%) loss:3.5108 lr:0.94 dt:37ms tok/s:1773990 rem:487s step 3236 (19%) loss:3.5113 lr:0.94 dt:37ms tok/s:1761417 rem:487s step 3237 (19%) loss:3.5215 lr:0.94 dt:36ms tok/s:1814567 rem:487s step 3238 (19%) loss:3.5204 lr:0.94 dt:36ms tok/s:1812509 rem:487s step 3239 (19%) loss:3.5172 lr:0.94 dt:36ms tok/s:1813693 rem:487s step 3240 (19%) loss:3.5116 lr:0.94 dt:36ms tok/s:1801592 rem:487s step 3241 (19%) loss:3.4973 lr:0.94 dt:36ms tok/s:1805248 rem:487s step 3242 (19%) loss:3.5025 lr:0.95 dt:36ms tok/s:1814987 rem:487s step 3243 (19%) loss:3.5175 lr:0.95 dt:36ms tok/s:1805141 rem:486s step 3244 (19%) loss:3.5054 lr:0.95 dt:36ms tok/s:1815514 rem:486s step 3245 (19%) loss:3.5001 lr:0.95 dt:36ms tok/s:1814939 rem:486s step 3246 (19%) loss:3.5084 lr:0.95 dt:36ms tok/s:1802821 rem:486s step 3247 (19%) loss:3.5148 lr:0.95 dt:36ms tok/s:1811005 rem:486s step 3248 (19%) loss:3.5191 lr:0.95 dt:36ms tok/s:1811757 rem:486s step 3249 (19%) loss:3.5504 lr:0.95 dt:36ms tok/s:1813107 rem:486s step 3250 (19%) loss:3.5497 lr:0.95 dt:36ms tok/s:1806458 rem:486s step 3251 (19%) loss:3.5557 lr:0.95 dt:36ms tok/s:1806292 rem:486s step 3252 (19%) loss:3.5755 lr:0.95 dt:36ms tok/s:1809300 rem:486s step 3253 (19%) loss:3.6477 lr:0.95 dt:36ms tok/s:1809753 rem:486s step 3254 (19%) loss:3.6797 lr:0.95 dt:36ms tok/s:1809491 rem:486s step 3255 (19%) loss:3.6920 lr:0.95 dt:36ms tok/s:1810611 rem:486s step 3256 (19%) loss:3.7013 lr:0.95 dt:39ms tok/s:1687942 rem:486s step 3257 (19%) loss:3.6876 lr:0.95 dt:36ms tok/s:1819010 rem:486s step 3258 (19%) loss:3.6826 lr:0.95 dt:36ms tok/s:1830652 rem:486s step 3259 (19%) loss:3.6718 lr:0.95 dt:36ms tok/s:1811482 rem:486s step 3260 (19%) loss:3.6550 lr:0.95 dt:36ms tok/s:1814903 rem:486s step 3261 (19%) loss:3.6481 lr:0.95 dt:36ms tok/s:1839214 rem:486s step 3262 (19%) loss:3.6393 lr:0.95 dt:36ms tok/s:1834036 rem:486s step 3263 (19%) loss:3.6318 lr:0.95 dt:35ms tok/s:1861093 rem:486s step 3264 (19%) loss:3.6238 lr:0.95 dt:35ms tok/s:1854702 rem:486s step 3265 (19%) loss:3.6206 lr:0.95 dt:35ms tok/s:1856355 rem:486s step 3266 (19%) loss:3.6121 lr:0.95 dt:35ms tok/s:1859796 rem:486s step 3267 (19%) loss:3.6176 lr:0.95 dt:35ms tok/s:1856405 rem:486s step 3268 (19%) loss:3.6228 lr:0.95 dt:35ms tok/s:1851890 rem:486s step 3269 (19%) loss:3.6140 lr:0.95 dt:36ms tok/s:1837886 rem:486s step 3270 (19%) loss:3.6197 lr:0.95 dt:35ms tok/s:1860640 rem:486s step 3271 (19%) loss:3.6046 lr:0.95 dt:35ms tok/s:1861446 rem:485s step 3272 (19%) loss:3.6352 lr:0.95 dt:35ms tok/s:1850257 rem:485s step 3273 (19%) loss:3.6348 lr:0.95 dt:36ms tok/s:1826820 rem:485s step 3274 (19%) loss:3.6352 lr:0.95 dt:36ms tok/s:1839128 rem:485s step 3275 (19%) loss:3.6264 lr:0.96 dt:36ms tok/s:1840741 rem:485s step 3276 (19%) loss:3.6151 lr:0.96 dt:36ms tok/s:1835432 rem:485s step 3277 (19%) loss:3.6166 lr:0.96 dt:36ms tok/s:1832642 rem:485s step 3278 (19%) loss:3.6250 lr:0.96 dt:36ms tok/s:1836756 rem:485s step 3279 (19%) loss:3.6140 lr:0.96 dt:36ms tok/s:1825522 rem:485s step 3280 (19%) loss:3.6024 lr:0.96 dt:36ms tok/s:1824492 rem:485s step 3281 (19%) loss:3.5942 lr:0.96 dt:36ms tok/s:1840630 rem:485s step 3282 (19%) loss:3.6000 lr:0.96 dt:36ms tok/s:1838267 rem:485s step 3283 (19%) loss:3.6063 lr:0.96 dt:36ms tok/s:1842283 rem:485s step 3284 (19%) loss:3.5945 lr:0.96 dt:36ms tok/s:1839571 rem:485s step 3285 (19%) loss:3.6204 lr:0.96 dt:36ms tok/s:1830104 rem:485s step 3286 (19%) loss:3.6062 lr:0.96 dt:36ms tok/s:1835996 rem:485s step 3287 (19%) loss:3.5988 lr:0.96 dt:36ms tok/s:1837358 rem:485s step 3288 (19%) loss:3.5863 lr:0.96 dt:36ms tok/s:1833289 rem:485s step 3289 (19%) loss:3.5584 lr:0.96 dt:36ms tok/s:1833179 rem:485s step 3290 (19%) loss:3.5671 lr:0.96 dt:36ms tok/s:1839263 rem:485s step 3291 (19%) loss:3.5616 lr:0.96 dt:36ms tok/s:1836388 rem:485s step 3292 (19%) loss:3.5822 lr:0.96 dt:36ms tok/s:1834721 rem:485s step 3293 (19%) loss:3.5931 lr:0.96 dt:36ms tok/s:1839362 rem:485s step 3294 (19%) loss:3.5876 lr:0.96 dt:36ms tok/s:1830603 rem:485s step 3295 (19%) loss:3.5803 lr:0.96 dt:36ms tok/s:1839030 rem:485s step 3296 (19%) loss:3.5797 lr:0.96 dt:35ms tok/s:1861925 rem:485s step 3297 (19%) loss:3.5857 lr:0.96 dt:36ms tok/s:1835260 rem:485s step 3298 (19%) loss:3.6014 lr:0.96 dt:36ms tok/s:1809657 rem:485s step 3299 (19%) loss:3.5967 lr:0.96 dt:36ms tok/s:1828922 rem:484s step 3300 (19%) loss:3.5711 lr:0.96 dt:36ms tok/s:1821735 rem:484s + local: attn=[0.062, 0.377, 0.361] mlp=[0.142, 0.120, -0.117] + + transition: attn=[1.336, 0.482] mlp=[0.004, 0.107] + + hierarchy: attn=[1.588, 5.939, 5.616] mlp=[0.556, -0.059, 0.093] + step 3301 (19%) loss:3.5728 lr:0.96 dt:36ms tok/s:1802336 rem:484s step 3302 (19%) loss:3.5714 lr:0.96 dt:37ms tok/s:1762953 rem:484s step 3303 (19%) loss:3.5637 lr:0.96 dt:37ms tok/s:1793715 rem:484s step 3304 (19%) loss:3.5585 lr:0.96 dt:36ms tok/s:1810516 rem:484s step 3305 (19%) loss:3.5581 lr:0.96 dt:43ms tok/s:1522195 rem:484s step 3306 (19%) loss:3.5531 lr:0.96 dt:36ms tok/s:1800896 rem:484s step 3307 (19%) loss:3.5518 lr:0.96 dt:36ms tok/s:1822846 rem:484s step 3308 (19%) loss:3.5507 lr:0.97 dt:36ms tok/s:1838833 rem:484s step 3309 (19%) loss:3.5509 lr:0.97 dt:36ms tok/s:1835640 rem:484s step 3310 (19%) loss:3.5525 lr:0.97 dt:37ms tok/s:1784782 rem:484s step 3311 (19%) loss:3.5395 lr:0.97 dt:35ms tok/s:1863352 rem:484s step 3312 (19%) loss:3.5494 lr:0.97 dt:35ms tok/s:1857609 rem:484s step 3313 (19%) loss:3.5477 lr:0.97 dt:35ms tok/s:1859369 rem:484s step 3314 (19%) loss:3.5887 lr:0.97 dt:35ms tok/s:1851117 rem:484s step 3315 (19%) loss:3.5767 lr:0.97 dt:35ms tok/s:1856894 rem:484s step 3316 (19%) loss:3.5751 lr:0.97 dt:35ms tok/s:1862165 rem:484s step 3317 (19%) loss:3.5917 lr:0.97 dt:35ms tok/s:1869142 rem:484s step 3318 (19%) loss:3.5801 lr:0.97 dt:35ms tok/s:1849722 rem:484s step 3319 (19%) loss:3.5903 lr:0.97 dt:36ms tok/s:1842642 rem:484s step 3320 (19%) loss:3.5924 lr:0.97 dt:35ms tok/s:1847770 rem:484s step 3321 (19%) loss:3.5870 lr:0.97 dt:35ms tok/s:1850631 rem:484s step 3322 (19%) loss:3.5755 lr:0.97 dt:35ms tok/s:1852514 rem:484s step 3323 (19%) loss:3.5686 lr:0.97 dt:36ms tok/s:1844311 rem:484s step 3324 (19%) loss:3.5521 lr:0.97 dt:35ms tok/s:1846392 rem:484s step 3325 (19%) loss:3.5477 lr:0.97 dt:35ms tok/s:1849112 rem:484s step 3326 (19%) loss:3.5373 lr:0.97 dt:36ms tok/s:1820299 rem:484s step 3327 (19%) loss:3.5602 lr:0.97 dt:36ms tok/s:1830798 rem:483s step 3328 (19%) loss:3.5434 lr:0.97 dt:35ms tok/s:1853351 rem:483s step 3329 (19%) loss:3.5269 lr:0.97 dt:35ms tok/s:1855002 rem:483s step 3330 (19%) loss:3.5206 lr:0.97 dt:35ms tok/s:1854777 rem:483s step 3331 (19%) loss:3.5051 lr:0.97 dt:35ms tok/s:1849784 rem:483s step 3332 (19%) loss:3.5166 lr:0.97 dt:35ms tok/s:1860804 rem:483s step 3333 (19%) loss:3.5164 lr:0.97 dt:35ms tok/s:1858087 rem:483s step 3334 (19%) loss:3.5312 lr:0.97 dt:35ms tok/s:1848677 rem:483s step 3335 (19%) loss:3.5372 lr:0.97 dt:36ms tok/s:1840187 rem:483s step 3336 (19%) loss:3.5120 lr:0.97 dt:36ms tok/s:1844076 rem:483s step 3337 (19%) loss:3.5266 lr:0.97 dt:36ms tok/s:1839817 rem:483s step 3338 (19%) loss:3.5299 lr:0.97 dt:35ms tok/s:1846714 rem:483s step 3339 (19%) loss:3.5409 lr:0.97 dt:35ms tok/s:1860577 rem:483s step 3340 (19%) loss:3.5343 lr:0.97 dt:36ms tok/s:1830823 rem:483s step 3341 (19%) loss:3.5337 lr:0.97 dt:35ms tok/s:1848540 rem:483s step 3342 (20%) loss:3.5377 lr:0.98 dt:35ms tok/s:1851030 rem:483s step 3343 (20%) loss:3.5337 lr:0.98 dt:35ms tok/s:1853489 rem:483s step 3344 (20%) loss:3.5554 lr:0.98 dt:35ms tok/s:1852427 rem:483s step 3345 (20%) loss:3.5387 lr:0.98 dt:35ms tok/s:1846442 rem:483s step 3346 (20%) loss:3.5185 lr:0.98 dt:35ms tok/s:1852789 rem:483s step 3347 (20%) loss:3.4773 lr:0.98 dt:35ms tok/s:1854677 rem:483s step 3348 (20%) loss:3.4279 lr:0.98 dt:35ms tok/s:1849809 rem:483s step 3349 (20%) loss:3.4500 lr:0.98 dt:35ms tok/s:1856142 rem:483s step 3350 (20%) loss:3.4745 lr:0.98 dt:35ms tok/s:1851142 rem:483s step 3351 (20%) loss:3.5001 lr:0.98 dt:36ms tok/s:1842740 rem:483s step 3352 (20%) loss:3.5263 lr:0.98 dt:36ms tok/s:1843272 rem:483s step 3353 (20%) loss:3.5322 lr:0.98 dt:36ms tok/s:1839017 rem:483s step 3354 (20%) loss:3.5432 lr:0.98 dt:35ms tok/s:1854714 rem:483s step 3355 (20%) loss:3.5665 lr:0.98 dt:35ms tok/s:1848988 rem:482s step 3356 (20%) loss:3.5710 lr:0.98 dt:36ms tok/s:1840187 rem:482s step 3357 (20%) loss:3.5570 lr:0.98 dt:35ms tok/s:1857384 rem:482s step 3358 (20%) loss:3.5542 lr:0.98 dt:35ms tok/s:1856292 rem:482s step 3359 (20%) loss:3.5724 lr:0.98 dt:36ms tok/s:1803424 rem:482s step 3360 (20%) loss:3.5734 lr:0.98 dt:35ms tok/s:1846479 rem:482s step 3361 (20%) loss:3.5679 lr:0.98 dt:35ms tok/s:1852714 rem:482s step 3362 (20%) loss:3.5719 lr:0.98 dt:35ms tok/s:1846429 rem:482s step 3363 (20%) loss:3.5641 lr:0.98 dt:35ms tok/s:1855503 rem:482s step 3364 (20%) loss:3.5881 lr:0.98 dt:37ms tok/s:1777822 rem:482s step 3365 (20%) loss:3.5900 lr:0.98 dt:42ms tok/s:1542291 rem:482s step 3366 (20%) loss:3.5830 lr:0.98 dt:35ms tok/s:1868151 rem:482s step 3367 (20%) loss:3.5740 lr:0.98 dt:35ms tok/s:1899207 rem:482s step 3368 (20%) loss:3.6130 lr:0.98 dt:35ms tok/s:1898512 rem:482s step 3369 (20%) loss:3.6142 lr:0.98 dt:35ms tok/s:1876377 rem:482s step 3370 (20%) loss:3.6141 lr:0.98 dt:35ms tok/s:1878672 rem:482s step 3371 (20%) loss:3.6220 lr:0.98 dt:35ms tok/s:1877812 rem:482s step 3372 (20%) loss:3.6197 lr:0.98 dt:35ms tok/s:1873359 rem:482s step 3373 (20%) loss:3.6184 lr:0.98 dt:35ms tok/s:1878274 rem:482s step 3374 (20%) loss:3.6119 lr:0.98 dt:35ms tok/s:1880523 rem:482s step 3375 (20%) loss:3.6086 lr:0.98 dt:35ms tok/s:1875634 rem:482s step 3376 (20%) loss:3.6072 lr:0.99 dt:35ms tok/s:1877992 rem:482s step 3377 (20%) loss:3.6150 lr:0.99 dt:35ms tok/s:1871942 rem:482s step 3378 (20%) loss:3.5995 lr:0.99 dt:35ms tok/s:1865552 rem:482s step 3379 (20%) loss:3.6196 lr:0.99 dt:35ms tok/s:1890911 rem:482s step 3380 (20%) loss:3.6149 lr:0.99 dt:35ms tok/s:1877504 rem:482s step 3381 (20%) loss:3.6188 lr:0.99 dt:36ms tok/s:1801816 rem:482s step 3382 (20%) loss:3.6000 lr:0.99 dt:35ms tok/s:1891900 rem:482s step 3383 (20%) loss:3.5736 lr:0.99 dt:35ms tok/s:1895553 rem:481s step 3384 (20%) loss:3.5606 lr:0.99 dt:35ms tok/s:1894795 rem:481s step 3385 (20%) loss:3.5618 lr:0.99 dt:35ms tok/s:1897149 rem:481s step 3386 (20%) loss:3.5720 lr:0.99 dt:35ms tok/s:1875327 rem:481s step 3387 (20%) loss:3.5706 lr:0.99 dt:35ms tok/s:1880459 rem:481s step 3388 (20%) loss:3.5760 lr:0.99 dt:35ms tok/s:1877479 rem:481s step 3389 (20%) loss:3.5722 lr:0.99 dt:35ms tok/s:1882248 rem:481s step 3390 (20%) loss:3.5612 lr:0.99 dt:35ms tok/s:1875429 rem:481s step 3391 (20%) loss:3.5699 lr:0.99 dt:35ms tok/s:1883654 rem:481s step 3392 (20%) loss:3.5705 lr:0.99 dt:35ms tok/s:1875557 rem:481s step 3393 (20%) loss:3.5892 lr:0.99 dt:35ms tok/s:1882700 rem:481s step 3394 (20%) loss:3.5992 lr:0.99 dt:35ms tok/s:1858614 rem:481s step 3395 (20%) loss:3.6002 lr:0.99 dt:35ms tok/s:1865312 rem:481s step 3396 (20%) loss:3.5983 lr:0.99 dt:35ms tok/s:1859721 rem:481s step 3397 (20%) loss:3.6081 lr:0.99 dt:35ms tok/s:1866312 rem:481s step 3398 (20%) loss:3.6261 lr:0.99 dt:35ms tok/s:1869117 rem:481s step 3399 (20%) loss:3.6220 lr:0.99 dt:35ms tok/s:1866109 rem:481s step 3400 (20%) loss:3.6145 lr:0.99 dt:35ms tok/s:1860728 rem:481s + local: attn=[0.057, 0.359, 0.347] mlp=[0.150, 0.114, -0.116] + + transition: attn=[1.296, 0.463] mlp=[-0.003, 0.105] + + hierarchy: attn=[1.557, 5.939, 5.616] mlp=[0.517, -0.391, -0.258] + step 3401 (20%) loss:3.6155 lr:0.99 dt:35ms tok/s:1853864 rem:481s step 3402 (20%) loss:3.6144 lr:0.99 dt:35ms tok/s:1863326 rem:481s step 3403 (20%) loss:3.5834 lr:0.99 dt:35ms tok/s:1858765 rem:481s step 3404 (20%) loss:3.5796 lr:0.99 dt:35ms tok/s:1866604 rem:481s step 3405 (20%) loss:3.5883 lr:0.99 dt:35ms tok/s:1853176 rem:481s step 3406 (20%) loss:3.5721 lr:0.99 dt:35ms tok/s:1860879 rem:481s step 3407 (20%) loss:3.5680 lr:0.99 dt:35ms tok/s:1861320 rem:481s step 3408 (20%) loss:3.5762 lr:0.99 dt:35ms tok/s:1859281 rem:481s step 3409 (20%) loss:3.5860 lr:0.99 dt:35ms tok/s:1862607 rem:481s step 3410 (20%) loss:3.5842 lr:1.00 dt:36ms tok/s:1841679 rem:481s step 3411 (20%) loss:3.5851 lr:1.00 dt:35ms tok/s:1855941 rem:481s step 3412 (20%) loss:3.5919 lr:1.00 dt:35ms tok/s:1864160 rem:480s step 3413 (20%) loss:3.5879 lr:1.00 dt:35ms tok/s:1854101 rem:480s step 3414 (20%) loss:3.6075 lr:1.00 dt:35ms tok/s:1854602 rem:480s step 3415 (20%) loss:3.6492 lr:1.00 dt:35ms tok/s:1857697 rem:480s step 3416 (20%) loss:3.6365 lr:1.00 dt:35ms tok/s:1865020 rem:480s step 3417 (20%) loss:3.6228 lr:1.00 dt:36ms tok/s:1838882 rem:480s step 3418 (20%) loss:3.6215 lr:1.00 dt:43ms tok/s:1529343 rem:480s step 3419 (20%) loss:3.6152 lr:1.00 dt:37ms tok/s:1790362 rem:480s step 3420 (20%) loss:3.6058 lr:1.00 dt:35ms tok/s:1886693 rem:480s step 3421 (20%) loss:3.6120 lr:1.00 dt:35ms tok/s:1889689 rem:480s step 3422 (20%) loss:3.6009 lr:1.00 dt:35ms tok/s:1889247 rem:480s step 3423 (20%) loss:3.6023 lr:1.00 dt:35ms tok/s:1875250 rem:480s step 3424 (20%) loss:3.5948 lr:1.00 dt:35ms tok/s:1878698 rem:480s step 3425 (20%) loss:3.5885 lr:1.00 dt:35ms tok/s:1856317 rem:480s step 3426 (20%) loss:3.5881 lr:1.00 dt:35ms tok/s:1881179 rem:480s step 3427 (20%) loss:3.6027 lr:1.00 dt:37ms tok/s:1776110 rem:480s step 3428 (20%) loss:3.6055 lr:1.00 dt:35ms tok/s:1864034 rem:480s step 3429 (20%) loss:3.5933 lr:1.00 dt:34ms tok/s:1922532 rem:480s step 3430 (20%) loss:3.5801 lr:1.00 dt:35ms tok/s:1894207 rem:480s step 3431 (20%) loss:3.6039 lr:1.00 dt:35ms tok/s:1876825 rem:480s step 3432 (20%) loss:3.6000 lr:1.00 dt:35ms tok/s:1888962 rem:480s step 3433 (20%) loss:3.6008 lr:1.00 dt:35ms tok/s:1860061 rem:480s step 3434 (20%) loss:3.5862 lr:1.00 dt:35ms tok/s:1895226 rem:480s step 3435 (20%) loss:3.5814 lr:1.00 dt:35ms tok/s:1891457 rem:480s step 3436 (20%) loss:3.5802 lr:1.00 dt:35ms tok/s:1894939 rem:480s step 3437 (20%) loss:3.5926 lr:1.00 dt:35ms tok/s:1884881 rem:480s step 3438 (20%) loss:3.6205 lr:1.00 dt:35ms tok/s:1893633 rem:480s step 3439 (20%) loss:3.6605 lr:1.00 dt:35ms tok/s:1863630 rem:480s step 3440 (20%) loss:3.6647 lr:1.00 dt:35ms tok/s:1875314 rem:479s step 3441 (20%) loss:3.6578 lr:1.00 dt:35ms tok/s:1868443 rem:479s step 3442 (20%) loss:3.6466 lr:1.00 dt:35ms tok/s:1877530 rem:479s step 3443 (20%) loss:3.6532 lr:1.00 dt:35ms tok/s:1862985 rem:479s step 3444 (20%) loss:3.6377 lr:1.00 dt:35ms tok/s:1860930 rem:479s step 3445 (20%) loss:3.6286 lr:1.00 dt:35ms tok/s:1865312 rem:479s step 3446 (20%) loss:3.6222 lr:1.00 dt:35ms tok/s:1862405 rem:479s step 3447 (20%) loss:3.6149 lr:1.00 dt:35ms tok/s:1869829 rem:479s step 3448 (20%) loss:3.6272 lr:1.00 dt:35ms tok/s:1868316 rem:479s step 3449 (20%) loss:3.6516 lr:1.00 dt:35ms tok/s:1856079 rem:479s step 3450 (20%) loss:3.6289 lr:1.00 dt:35ms tok/s:1867935 rem:479s step 3451 (20%) loss:3.6350 lr:1.00 dt:35ms tok/s:1860388 rem:479s step 3452 (20%) loss:3.6289 lr:1.00 dt:35ms tok/s:1861724 rem:479s step 3453 (20%) loss:3.6278 lr:1.00 dt:35ms tok/s:1859721 rem:479s step 3454 (20%) loss:3.6159 lr:1.00 dt:35ms tok/s:1862544 rem:479s step 3455 (20%) loss:3.6101 lr:1.00 dt:35ms tok/s:1855315 rem:479s step 3456 (20%) loss:3.6195 lr:1.00 dt:35ms tok/s:1857158 rem:479s step 3457 (20%) loss:3.5998 lr:1.00 dt:35ms tok/s:1860338 rem:479s step 3458 (20%) loss:3.5956 lr:1.00 dt:35ms tok/s:1861396 rem:479s step 3459 (20%) loss:3.5943 lr:1.00 dt:35ms tok/s:1857384 rem:479s step 3460 (20%) loss:3.5844 lr:1.00 dt:35ms tok/s:1861875 rem:479s step 3461 (20%) loss:3.5906 lr:1.00 dt:36ms tok/s:1840187 rem:479s step 3462 (20%) loss:3.5730 lr:1.00 dt:36ms tok/s:1837174 rem:479s step 3463 (20%) loss:3.5643 lr:1.00 dt:36ms tok/s:1834880 rem:479s step 3464 (20%) loss:3.5775 lr:1.00 dt:38ms tok/s:1708430 rem:479s step 3465 (20%) loss:3.5810 lr:1.00 dt:36ms tok/s:1810993 rem:479s step 3466 (20%) loss:3.5725 lr:1.00 dt:38ms tok/s:1725406 rem:479s step 3467 (20%) loss:3.5695 lr:1.00 dt:35ms tok/s:1866870 rem:479s step 3468 (20%) loss:3.5713 lr:1.00 dt:35ms tok/s:1878518 rem:478s step 3469 (20%) loss:3.5762 lr:1.00 dt:36ms tok/s:1845623 rem:478s step 3470 (20%) loss:3.5756 lr:1.00 dt:35ms tok/s:1861875 rem:478s step 3471 (20%) loss:3.5740 lr:1.00 dt:35ms tok/s:1855015 rem:478s step 3472 (20%) loss:3.5877 lr:1.00 dt:35ms tok/s:1859243 rem:478s step 3473 (20%) loss:3.5902 lr:1.00 dt:35ms tok/s:1846107 rem:478s step 3474 (20%) loss:3.5684 lr:1.00 dt:37ms tok/s:1755758 rem:478s step 3475 (20%) loss:3.5822 lr:1.00 dt:35ms tok/s:1854326 rem:478s step 3476 (20%) loss:3.5587 lr:1.00 dt:35ms tok/s:1865261 rem:478s step 3477 (20%) loss:3.5436 lr:1.00 dt:36ms tok/s:1844360 rem:478s step 3478 (20%) loss:3.5421 lr:1.00 dt:35ms tok/s:1881102 rem:478s step 3479 (20%) loss:3.5309 lr:1.00 dt:35ms tok/s:1875608 rem:478s step 3480 (20%) loss:3.5454 lr:1.00 dt:35ms tok/s:1878492 rem:478s step 3481 (20%) loss:3.5420 lr:1.00 dt:35ms tok/s:1869994 rem:478s step 3482 (20%) loss:3.5315 lr:1.00 dt:35ms tok/s:1880201 rem:478s step 3483 (20%) loss:3.5330 lr:1.00 dt:35ms tok/s:1874738 rem:478s step 3484 (20%) loss:3.5313 lr:1.00 dt:35ms tok/s:1881501 rem:478s step 3485 (20%) loss:3.5168 lr:1.00 dt:35ms tok/s:1869485 rem:478s step 3486 (20%) loss:3.5159 lr:1.00 dt:35ms tok/s:1882055 rem:478s step 3487 (20%) loss:3.5046 lr:1.00 dt:35ms tok/s:1864350 rem:478s step 3488 (20%) loss:3.5171 lr:1.00 dt:35ms tok/s:1875288 rem:478s step 3489 (20%) loss:3.5256 lr:1.00 dt:35ms tok/s:1879803 rem:478s step 3490 (20%) loss:3.5355 lr:1.00 dt:35ms tok/s:1881913 rem:478s step 3491 (20%) loss:3.5172 lr:1.00 dt:35ms tok/s:1875391 rem:478s step 3492 (20%) loss:3.5142 lr:1.00 dt:35ms tok/s:1864186 rem:478s step 3493 (20%) loss:3.5231 lr:1.00 dt:35ms tok/s:1856368 rem:478s step 3494 (20%) loss:3.5280 lr:1.00 dt:35ms tok/s:1856794 rem:478s step 3495 (20%) loss:3.5369 lr:1.00 dt:35ms tok/s:1860501 rem:478s step 3496 (20%) loss:3.5389 lr:1.00 dt:35ms tok/s:1854176 rem:478s step 3497 (20%) loss:3.5479 lr:1.00 dt:35ms tok/s:1850083 rem:477s step 3498 (20%) loss:3.5610 lr:1.00 dt:35ms tok/s:1859444 rem:477s step 3499 (20%) loss:3.5623 lr:1.00 dt:35ms tok/s:1859293 rem:477s step 3500 (20%) loss:3.5692 lr:1.00 dt:35ms tok/s:1859281 rem:477s + local: attn=[0.061, 0.386, 0.365] mlp=[0.138, 0.100, -0.082] + + transition: attn=[1.269, 0.454] mlp=[-0.013, 0.113] + + hierarchy: attn=[1.536, 5.939, 5.616] mlp=[0.489, -0.047, 0.199] + step 3501 (20%) loss:3.5600 lr:1.00 dt:35ms tok/s:1855503 rem:477s step 3502 (20%) loss:3.5531 lr:1.00 dt:35ms tok/s:1849560 rem:477s step 3503 (20%) loss:3.5512 lr:1.00 dt:35ms tok/s:1849834 rem:477s step 3504 (20%) loss:3.5537 lr:1.00 dt:36ms tok/s:1830689 rem:477s step 3505 (20%) loss:3.5418 lr:1.00 dt:36ms tok/s:1829823 rem:477s step 3506 (20%) loss:3.5136 lr:1.00 dt:36ms tok/s:1832752 rem:477s step 3507 (20%) loss:3.5360 lr:1.00 dt:36ms tok/s:1826650 rem:477s step 3508 (20%) loss:3.5399 lr:1.00 dt:36ms tok/s:1834782 rem:477s step 3509 (20%) loss:3.5183 lr:1.00 dt:36ms tok/s:1821723 rem:477s step 3510 (20%) loss:3.5095 lr:1.00 dt:36ms tok/s:1835566 rem:477s step 3511 (20%) loss:3.5241 lr:1.00 dt:42ms tok/s:1568777 rem:477s step 3512 (21%) loss:3.5073 lr:1.00 dt:35ms tok/s:1851404 rem:477s step 3513 (21%) loss:3.4779 lr:1.00 dt:35ms tok/s:1848329 rem:477s step 3514 (21%) loss:3.4902 lr:1.00 dt:36ms tok/s:1845165 rem:477s step 3515 (21%) loss:3.4990 lr:1.00 dt:35ms tok/s:1874138 rem:477s step 3516 (21%) loss:3.4905 lr:1.00 dt:35ms tok/s:1872593 rem:477s step 3517 (21%) loss:3.4900 lr:1.00 dt:39ms tok/s:1693557 rem:477s step 3518 (21%) loss:3.4909 lr:1.00 dt:40ms tok/s:1636139 rem:477s step 3519 (21%) loss:3.4867 lr:1.00 dt:34ms tok/s:1927156 rem:477s step 3520 (21%) loss:3.4896 lr:1.00 dt:34ms tok/s:1901335 rem:477s step 3521 (21%) loss:3.4937 lr:1.00 dt:35ms tok/s:1883216 rem:477s step 3522 (21%) loss:3.5023 lr:1.00 dt:35ms tok/s:1884455 rem:477s step 3523 (21%) loss:3.5177 lr:1.00 dt:35ms tok/s:1891652 rem:477s step 3524 (21%) loss:3.5558 lr:1.00 dt:35ms tok/s:1894351 rem:477s step 3525 (21%) loss:3.5567 lr:1.00 dt:35ms tok/s:1892460 rem:476s step 3526 (21%) loss:3.5397 lr:1.00 dt:35ms tok/s:1886848 rem:476s step 3527 (21%) loss:3.5509 lr:1.00 dt:35ms tok/s:1892590 rem:476s step 3528 (21%) loss:3.5864 lr:1.00 dt:35ms tok/s:1890755 rem:476s step 3529 (21%) loss:3.5741 lr:1.00 dt:35ms tok/s:1897765 rem:476s step 3530 (21%) loss:3.5688 lr:1.00 dt:35ms tok/s:1893007 rem:476s step 3531 (21%) loss:3.5672 lr:1.00 dt:35ms tok/s:1886343 rem:476s step 3532 (21%) loss:3.5580 lr:1.00 dt:34ms tok/s:1905381 rem:476s step 3533 (21%) loss:3.5711 lr:1.00 dt:35ms tok/s:1890703 rem:476s step 3534 (21%) loss:3.6039 lr:1.00 dt:35ms tok/s:1897398 rem:476s step 3535 (21%) loss:3.5892 lr:1.00 dt:35ms tok/s:1888962 rem:476s step 3536 (21%) loss:3.5905 lr:1.00 dt:35ms tok/s:1891731 rem:476s step 3537 (21%) loss:3.5800 lr:1.00 dt:35ms tok/s:1894207 rem:476s step 3538 (21%) loss:3.5767 lr:1.00 dt:35ms tok/s:1882738 rem:476s step 3539 (21%) loss:3.5619 lr:1.00 dt:35ms tok/s:1861774 rem:476s step 3540 (21%) loss:3.5433 lr:1.00 dt:35ms tok/s:1863794 rem:476s step 3541 (21%) loss:3.5355 lr:1.00 dt:35ms tok/s:1859771 rem:476s step 3542 (21%) loss:3.5449 lr:1.00 dt:35ms tok/s:1862216 rem:476s step 3543 (21%) loss:3.5333 lr:1.00 dt:35ms tok/s:1889065 rem:476s step 3544 (21%) loss:3.5456 lr:1.00 dt:35ms tok/s:1883951 rem:476s step 3545 (21%) loss:3.5601 lr:1.00 dt:35ms tok/s:1885011 rem:476s step 3546 (21%) loss:3.5499 lr:1.00 dt:35ms tok/s:1886900 rem:476s step 3547 (21%) loss:3.5301 lr:1.00 dt:35ms tok/s:1890326 rem:476s step 3548 (21%) loss:3.5431 lr:1.00 dt:35ms tok/s:1872542 rem:476s step 3549 (21%) loss:3.5481 lr:1.00 dt:35ms tok/s:1864085 rem:476s step 3550 (21%) loss:3.5343 lr:1.00 dt:35ms tok/s:1864363 rem:476s step 3551 (21%) loss:3.5028 lr:1.00 dt:35ms tok/s:1863288 rem:476s step 3552 (21%) loss:3.4969 lr:1.00 dt:35ms tok/s:1868520 rem:476s step 3553 (21%) loss:3.5185 lr:1.00 dt:35ms tok/s:1867250 rem:475s step 3554 (21%) loss:3.5185 lr:1.00 dt:35ms tok/s:1870439 rem:475s step 3555 (21%) loss:3.5335 lr:1.00 dt:35ms tok/s:1872044 rem:475s step 3556 (21%) loss:3.5374 lr:1.00 dt:35ms tok/s:1865729 rem:475s step 3557 (21%) loss:3.5413 lr:1.00 dt:35ms tok/s:1848552 rem:475s step 3558 (21%) loss:3.5707 lr:1.00 dt:38ms tok/s:1702441 rem:475s step 3559 (21%) loss:3.5713 lr:1.00 dt:36ms tok/s:1844447 rem:475s step 3560 (21%) loss:3.5762 lr:1.00 dt:35ms tok/s:1852514 rem:475s step 3561 (21%) loss:3.5633 lr:1.00 dt:36ms tok/s:1821348 rem:475s step 3562 (21%) loss:3.5673 lr:1.00 dt:36ms tok/s:1822616 rem:475s step 3563 (21%) loss:3.5611 lr:1.00 dt:36ms tok/s:1804987 rem:475s step 3564 (21%) loss:3.5606 lr:1.00 dt:36ms tok/s:1811840 rem:475s step 3565 (21%) loss:3.5665 lr:1.00 dt:36ms tok/s:1818505 rem:475s step 3566 (21%) loss:3.5651 lr:1.00 dt:36ms tok/s:1803235 rem:475s step 3567 (21%) loss:3.5524 lr:1.00 dt:36ms tok/s:1803117 rem:475s step 3568 (21%) loss:3.5446 lr:1.00 dt:36ms tok/s:1804703 rem:475s step 3569 (21%) loss:3.5289 lr:1.00 dt:36ms tok/s:1799210 rem:475s step 3570 (21%) loss:3.5217 lr:1.00 dt:36ms tok/s:1802904 rem:475s step 3571 (21%) loss:3.5105 lr:1.00 dt:36ms tok/s:1804217 rem:475s step 3572 (21%) loss:3.5329 lr:1.00 dt:37ms tok/s:1777661 rem:475s step 3573 (21%) loss:3.5603 lr:1.00 dt:36ms tok/s:1808146 rem:475s step 3574 (21%) loss:3.5727 lr:1.00 dt:36ms tok/s:1797868 rem:475s step 3575 (21%) loss:3.5600 lr:1.00 dt:37ms tok/s:1757689 rem:475s step 3576 (21%) loss:3.5336 lr:1.00 dt:37ms tok/s:1759207 rem:475s step 3577 (21%) loss:3.4948 lr:1.00 dt:36ms tok/s:1799575 rem:475s step 3578 (21%) loss:3.4868 lr:1.00 dt:36ms tok/s:1825594 rem:475s step 3579 (21%) loss:3.4683 lr:1.00 dt:36ms tok/s:1830323 rem:475s step 3580 (21%) loss:3.4718 lr:1.00 dt:35ms tok/s:1853001 rem:475s step 3581 (21%) loss:3.4784 lr:1.00 dt:35ms tok/s:1852477 rem:474s step 3582 (21%) loss:3.4870 lr:1.00 dt:35ms tok/s:1849821 rem:474s step 3583 (21%) loss:3.4808 lr:1.00 dt:35ms tok/s:1848926 rem:474s step 3584 (21%) loss:3.4830 lr:1.00 dt:36ms tok/s:1832434 rem:474s step 3585 (21%) loss:3.4697 lr:1.00 dt:36ms tok/s:1823765 rem:474s step 3586 (21%) loss:3.5009 lr:1.00 dt:36ms tok/s:1824455 rem:474s step 3587 (21%) loss:3.4879 lr:1.00 dt:36ms tok/s:1818926 rem:474s step 3588 (21%) loss:3.4996 lr:1.00 dt:36ms tok/s:1812450 rem:474s step 3589 (21%) loss:3.5066 lr:1.00 dt:37ms tok/s:1775548 rem:474s step 3590 (21%) loss:3.5111 lr:1.00 dt:38ms tok/s:1735921 rem:474s step 3591 (21%) loss:3.4965 lr:1.00 dt:37ms tok/s:1747687 rem:474s step 3592 (21%) loss:3.4862 lr:1.00 dt:36ms tok/s:1843012 rem:474s step 3593 (21%) loss:3.4592 lr:1.00 dt:36ms tok/s:1841383 rem:474s step 3594 (21%) loss:3.4742 lr:1.00 dt:35ms tok/s:1847571 rem:474s step 3595 (21%) loss:3.4966 lr:1.00 dt:36ms tok/s:1832141 rem:474s step 3596 (21%) loss:3.5093 lr:1.00 dt:36ms tok/s:1826674 rem:474s step 3597 (21%) loss:3.5139 lr:1.00 dt:36ms tok/s:1831177 rem:474s step 3598 (21%) loss:3.5393 lr:1.00 dt:36ms tok/s:1831360 rem:474s step 3599 (21%) loss:3.5388 lr:1.00 dt:36ms tok/s:1830213 rem:474s step 3600 (21%) loss:3.5342 lr:1.00 dt:36ms tok/s:1835983 rem:474s + local: attn=[0.058, 0.390, 0.355] mlp=[0.152, 0.125, -0.111] + + transition: attn=[1.219, 0.454] mlp=[-0.020, 0.110] + + hierarchy: attn=[1.521, 5.939, 5.616] mlp=[0.480, -0.169, 0.054] + step 3601 (21%) loss:3.5503 lr:1.00 dt:36ms tok/s:1834562 rem:474s step 3602 (21%) loss:3.5561 lr:1.00 dt:36ms tok/s:1837616 rem:474s step 3603 (21%) loss:3.5538 lr:1.00 dt:36ms tok/s:1836621 rem:474s step 3604 (21%) loss:3.5802 lr:1.00 dt:36ms tok/s:1832788 rem:474s step 3605 (21%) loss:3.5802 lr:1.00 dt:36ms tok/s:1826019 rem:474s step 3606 (21%) loss:3.5741 lr:1.00 dt:36ms tok/s:1840261 rem:474s step 3607 (21%) loss:3.5301 lr:1.00 dt:36ms tok/s:1832031 rem:474s step 3608 (21%) loss:3.5375 lr:1.00 dt:35ms tok/s:1855340 rem:474s step 3609 (21%) loss:3.5339 lr:1.00 dt:35ms tok/s:1852065 rem:473s step 3610 (21%) loss:3.5745 lr:1.00 dt:36ms tok/s:1842321 rem:473s step 3611 (21%) loss:3.5651 lr:1.00 dt:35ms tok/s:1854176 rem:473s step 3612 (21%) loss:3.5561 lr:1.00 dt:35ms tok/s:1852976 rem:473s step 3613 (21%) loss:3.5324 lr:1.00 dt:35ms tok/s:1853301 rem:473s step 3614 (21%) loss:3.5184 lr:1.00 dt:35ms tok/s:1847757 rem:473s step 3615 (21%) loss:3.5085 lr:1.00 dt:36ms tok/s:1832361 rem:473s step 3616 (21%) loss:3.5113 lr:1.00 dt:36ms tok/s:1807337 rem:473s step 3617 (21%) loss:3.5193 lr:1.00 dt:36ms tok/s:1803684 rem:473s step 3618 (21%) loss:3.5167 lr:1.00 dt:36ms tok/s:1811530 rem:473s step 3619 (21%) loss:3.5170 lr:1.00 dt:36ms tok/s:1805983 rem:473s step 3620 (21%) loss:3.5107 lr:1.00 dt:36ms tok/s:1811446 rem:473s step 3621 (21%) loss:3.5260 lr:1.00 dt:36ms tok/s:1808253 rem:473s step 3622 (21%) loss:3.5138 lr:1.00 dt:36ms tok/s:1806862 rem:473s step 3623 (21%) loss:3.5275 lr:1.00 dt:36ms tok/s:1803720 rem:473s step 3624 (21%) loss:3.5296 lr:1.00 dt:37ms tok/s:1763677 rem:473s step 3625 (21%) loss:3.5312 lr:1.00 dt:36ms tok/s:1799116 rem:473s step 3626 (21%) loss:3.5338 lr:1.00 dt:36ms tok/s:1805509 rem:473s step 3627 (21%) loss:3.5213 lr:1.00 dt:36ms tok/s:1808455 rem:473s step 3628 (21%) loss:3.5137 lr:1.00 dt:36ms tok/s:1813239 rem:473s step 3629 (21%) loss:3.5223 lr:1.00 dt:36ms tok/s:1808169 rem:473s step 3630 (21%) loss:3.5325 lr:1.00 dt:36ms tok/s:1810993 rem:473s step 3631 (21%) loss:3.5354 lr:1.00 dt:36ms tok/s:1813167 rem:473s step 3632 (21%) loss:3.5331 lr:1.00 dt:36ms tok/s:1803436 rem:473s step 3633 (21%) loss:3.5539 lr:1.00 dt:36ms tok/s:1799952 rem:473s step 3634 (21%) loss:3.5578 lr:1.00 dt:36ms tok/s:1804490 rem:473s step 3635 (21%) loss:3.5608 lr:1.00 dt:36ms tok/s:1814819 rem:473s step 3636 (21%) loss:3.5422 lr:1.00 dt:37ms tok/s:1790456 rem:473s step 3637 (21%) loss:3.5300 lr:1.00 dt:36ms tok/s:1807658 rem:472s step 3638 (21%) loss:3.5214 lr:1.00 dt:36ms tok/s:1811804 rem:472s step 3639 (21%) loss:3.4971 lr:1.00 dt:36ms tok/s:1813741 rem:472s step 3640 (21%) loss:3.5023 lr:1.00 dt:36ms tok/s:1804229 rem:472s step 3641 (21%) loss:3.5068 lr:1.00 dt:36ms tok/s:1812390 rem:472s step 3642 (21%) loss:3.5044 lr:1.00 dt:36ms tok/s:1811506 rem:472s step 3643 (21%) loss:3.5009 lr:1.00 dt:36ms tok/s:1810933 rem:472s step 3644 (21%) loss:3.4813 lr:1.00 dt:36ms tok/s:1806779 rem:472s step 3645 (21%) loss:3.4736 lr:1.00 dt:37ms tok/s:1757689 rem:472s step 3646 (21%) loss:3.4861 lr:1.00 dt:36ms tok/s:1802407 rem:472s step 3647 (21%) loss:3.4702 lr:1.00 dt:37ms tok/s:1786070 rem:472s step 3648 (21%) loss:3.4793 lr:1.00 dt:36ms tok/s:1821107 rem:472s step 3649 (21%) loss:3.4778 lr:1.00 dt:36ms tok/s:1818854 rem:472s step 3650 (21%) loss:3.4730 lr:1.00 dt:36ms tok/s:1820263 rem:472s step 3651 (21%) loss:3.4683 lr:1.00 dt:36ms tok/s:1812031 rem:472s step 3652 (21%) loss:3.4752 lr:1.00 dt:36ms tok/s:1819119 rem:472s step 3653 (21%) loss:3.4642 lr:1.00 dt:36ms tok/s:1806743 rem:472s step 3654 (21%) loss:3.4487 lr:1.00 dt:38ms tok/s:1712304 rem:472s step 3655 (21%) loss:3.4432 lr:1.00 dt:41ms tok/s:1606581 rem:472s step 3656 (21%) loss:3.4495 lr:1.00 dt:37ms tok/s:1786708 rem:472s step 3657 (21%) loss:3.4537 lr:1.00 dt:36ms tok/s:1837886 rem:472s step 3658 (21%) loss:3.4562 lr:1.00 dt:35ms tok/s:1875173 rem:472s step 3659 (21%) loss:3.4654 lr:1.00 dt:35ms tok/s:1887263 rem:472s step 3660 (21%) loss:3.4515 lr:1.00 dt:35ms tok/s:1867923 rem:472s step 3661 (21%) loss:3.4564 lr:1.00 dt:35ms tok/s:1862026 rem:472s step 3662 (21%) loss:3.4580 lr:1.00 dt:35ms tok/s:1864704 rem:472s step 3663 (21%) loss:3.4565 lr:1.00 dt:35ms tok/s:1871433 rem:472s step 3664 (21%) loss:3.4659 lr:1.00 dt:37ms tok/s:1773623 rem:471s step 3665 (21%) loss:3.4801 lr:1.00 dt:37ms tok/s:1755298 rem:471s step 3666 (21%) loss:3.4999 lr:1.00 dt:35ms tok/s:1880549 rem:471s step 3667 (21%) loss:3.5153 lr:1.00 dt:35ms tok/s:1880780 rem:471s step 3668 (21%) loss:3.5009 lr:1.00 dt:35ms tok/s:1854489 rem:471s step 3669 (21%) loss:3.4965 lr:1.00 dt:35ms tok/s:1883525 rem:471s step 3670 (21%) loss:3.4837 lr:1.00 dt:35ms tok/s:1886110 rem:471s step 3671 (21%) loss:3.4959 lr:1.00 dt:35ms tok/s:1884830 rem:471s step 3672 (21%) loss:3.4910 lr:1.00 dt:35ms tok/s:1877274 rem:471s step 3673 (21%) loss:3.4929 lr:1.00 dt:35ms tok/s:1870172 rem:471s step 3674 (21%) loss:3.4744 lr:1.00 dt:35ms tok/s:1877774 rem:471s step 3675 (21%) loss:3.4669 lr:1.00 dt:35ms tok/s:1876133 rem:471s step 3676 (21%) loss:3.4770 lr:1.00 dt:35ms tok/s:1877697 rem:471s step 3677 (21%) loss:3.4828 lr:1.00 dt:35ms tok/s:1873640 rem:471s step 3678 (21%) loss:3.4782 lr:1.00 dt:35ms tok/s:1875084 rem:471s step 3679 (22%) loss:3.4793 lr:1.00 dt:35ms tok/s:1878839 rem:471s step 3680 (22%) loss:3.5175 lr:1.00 dt:35ms tok/s:1849224 rem:471s step 3681 (22%) loss:3.4825 lr:1.00 dt:37ms tok/s:1776064 rem:471s step 3682 (22%) loss:3.4812 lr:1.00 dt:36ms tok/s:1808288 rem:471s step 3683 (22%) loss:3.5091 lr:1.00 dt:35ms tok/s:1891678 rem:471s step 3684 (22%) loss:3.5229 lr:1.00 dt:35ms tok/s:1854064 rem:471s step 3685 (22%) loss:3.5367 lr:1.00 dt:35ms tok/s:1880484 rem:471s step 3686 (22%) loss:3.5452 lr:1.00 dt:35ms tok/s:1879610 rem:471s step 3687 (22%) loss:3.5517 lr:1.00 dt:35ms tok/s:1881617 rem:471s step 3688 (22%) loss:3.5576 lr:1.00 dt:35ms tok/s:1862771 rem:471s step 3689 (22%) loss:3.5465 lr:1.00 dt:35ms tok/s:1858175 rem:471s step 3690 (22%) loss:3.5577 lr:1.00 dt:35ms tok/s:1865995 rem:471s step 3691 (22%) loss:3.5459 lr:1.00 dt:35ms tok/s:1861396 rem:471s step 3692 (22%) loss:3.5349 lr:1.00 dt:35ms tok/s:1862518 rem:471s step 3693 (22%) loss:3.5277 lr:1.00 dt:35ms tok/s:1858815 rem:470s step 3694 (22%) loss:3.5464 lr:1.00 dt:36ms tok/s:1827318 rem:470s step 3695 (22%) loss:3.5415 lr:1.00 dt:36ms tok/s:1834366 rem:470s step 3696 (22%) loss:3.5646 lr:1.00 dt:35ms tok/s:1848851 rem:470s step 3697 (22%) loss:3.5846 lr:1.00 dt:36ms tok/s:1796212 rem:470s step 3698 (22%) loss:3.5808 lr:1.00 dt:35ms tok/s:1860879 rem:470s step 3699 (22%) loss:3.5798 lr:1.00 dt:35ms tok/s:1856643 rem:470s step 3700 (22%) loss:3.5649 lr:1.00 dt:37ms tok/s:1792090 rem:470s + local: attn=[0.068, 0.357, 0.337] mlp=[0.148, 0.100, -0.095] + + transition: attn=[1.173, 0.478] mlp=[0.005, 0.113] + + hierarchy: attn=[1.526, 5.939, 5.616] mlp=[0.470, -0.210, -0.108] + step 3701 (22%) loss:3.5582 lr:1.00 dt:36ms tok/s:1811637 rem:470s step 3702 (22%) loss:3.5381 lr:1.00 dt:36ms tok/s:1817723 rem:470s step 3703 (22%) loss:3.5221 lr:1.00 dt:36ms tok/s:1816918 rem:470s step 3704 (22%) loss:3.5213 lr:1.00 dt:36ms tok/s:1811601 rem:470s step 3705 (22%) loss:3.5139 lr:1.00 dt:36ms tok/s:1815142 rem:470s step 3706 (22%) loss:3.4980 lr:1.00 dt:36ms tok/s:1797927 rem:470s step 3707 (22%) loss:3.5068 lr:1.00 dt:36ms tok/s:1816330 rem:470s step 3708 (22%) loss:3.4957 lr:1.00 dt:36ms tok/s:1816270 rem:470s step 3709 (22%) loss:3.5148 lr:1.00 dt:36ms tok/s:1821035 rem:470s step 3710 (22%) loss:3.5220 lr:1.00 dt:36ms tok/s:1823281 rem:470s step 3711 (22%) loss:3.5048 lr:1.00 dt:36ms tok/s:1820335 rem:470s step 3712 (22%) loss:3.5010 lr:1.00 dt:37ms tok/s:1769297 rem:470s step 3713 (22%) loss:3.4900 lr:1.00 dt:37ms tok/s:1771280 rem:470s step 3714 (22%) loss:3.4973 lr:1.00 dt:37ms tok/s:1770333 rem:470s step 3715 (22%) loss:3.4584 lr:1.00 dt:37ms tok/s:1769843 rem:470s step 3716 (22%) loss:3.4115 lr:1.00 dt:37ms tok/s:1772240 rem:470s step 3717 (22%) loss:3.4236 lr:1.00 dt:37ms tok/s:1760762 rem:470s step 3718 (22%) loss:3.4433 lr:1.00 dt:37ms tok/s:1773349 rem:470s step 3719 (22%) loss:3.4564 lr:1.00 dt:37ms tok/s:1771783 rem:470s step 3720 (22%) loss:3.4724 lr:1.00 dt:37ms tok/s:1764051 rem:469s step 3721 (22%) loss:3.4605 lr:1.00 dt:37ms tok/s:1774665 rem:469s step 3722 (22%) loss:3.4724 lr:1.00 dt:37ms tok/s:1769900 rem:469s step 3723 (22%) loss:3.4781 lr:1.00 dt:37ms tok/s:1773429 rem:469s step 3724 (22%) loss:3.4792 lr:1.00 dt:37ms tok/s:1774356 rem:469s step 3725 (22%) loss:3.4988 lr:1.00 dt:37ms tok/s:1767022 rem:469s step 3726 (22%) loss:3.5070 lr:1.00 dt:37ms tok/s:1774826 rem:469s step 3727 (22%) loss:3.5493 lr:1.00 dt:37ms tok/s:1772297 rem:469s step 3728 (22%) loss:3.5568 lr:1.00 dt:37ms tok/s:1770197 rem:469s step 3729 (22%) loss:3.5639 lr:1.00 dt:39ms tok/s:1660894 rem:469s step 3730 (22%) loss:3.5661 lr:1.00 dt:37ms tok/s:1775720 rem:469s step 3731 (22%) loss:3.5720 lr:1.00 dt:37ms tok/s:1788277 rem:469s step 3732 (22%) loss:3.5656 lr:1.00 dt:37ms tok/s:1784713 rem:469s step 3733 (22%) loss:3.5586 lr:1.00 dt:36ms tok/s:1813538 rem:469s step 3734 (22%) loss:3.5432 lr:1.00 dt:36ms tok/s:1831884 rem:469s step 3735 (22%) loss:3.5588 lr:1.00 dt:36ms tok/s:1834991 rem:469s step 3736 (22%) loss:3.5467 lr:1.00 dt:36ms tok/s:1833864 rem:469s step 3737 (22%) loss:3.5329 lr:1.00 dt:36ms tok/s:1834341 rem:469s step 3738 (22%) loss:3.5166 lr:1.00 dt:36ms tok/s:1833253 rem:469s step 3739 (22%) loss:3.4611 lr:1.00 dt:35ms tok/s:1854539 rem:469s step 3740 (22%) loss:3.4746 lr:1.00 dt:35ms tok/s:1856781 rem:469s step 3741 (22%) loss:3.4874 lr:1.00 dt:35ms tok/s:1854476 rem:469s step 3742 (22%) loss:3.4875 lr:1.00 dt:36ms tok/s:1825049 rem:469s step 3743 (22%) loss:3.4732 lr:1.00 dt:36ms tok/s:1845289 rem:469s step 3744 (22%) loss:3.4836 lr:1.00 dt:35ms tok/s:1858024 rem:469s step 3745 (22%) loss:3.4944 lr:1.00 dt:35ms tok/s:1852339 rem:469s step 3746 (22%) loss:3.4921 lr:1.00 dt:35ms tok/s:1847496 rem:469s step 3747 (22%) loss:3.4929 lr:1.00 dt:35ms tok/s:1853914 rem:469s step 3748 (22%) loss:3.5143 lr:1.00 dt:35ms tok/s:1877620 rem:468s step 3749 (22%) loss:3.5001 lr:1.00 dt:35ms tok/s:1872886 rem:468s step 3750 (22%) loss:3.4898 lr:1.00 dt:35ms tok/s:1882403 rem:468s step 3751 (22%) loss:3.4985 lr:1.00 dt:35ms tok/s:1875097 rem:468s step 3752 (22%) loss:3.4932 lr:1.00 dt:35ms tok/s:1876646 rem:468s step 3753 (22%) loss:3.4931 lr:1.00 dt:35ms tok/s:1894939 rem:468s step 3754 (22%) loss:3.4840 lr:1.00 dt:35ms tok/s:1888689 rem:468s step 3755 (22%) loss:3.4040 lr:1.00 dt:35ms tok/s:1883370 rem:468s step 3756 (22%) loss:3.3320 lr:1.00 dt:35ms tok/s:1894834 rem:468s step 3757 (22%) loss:3.3773 lr:1.00 dt:35ms tok/s:1895344 rem:468s step 3758 (22%) loss:3.4216 lr:1.00 dt:35ms tok/s:1898171 rem:468s step 3759 (22%) loss:3.4579 lr:1.00 dt:35ms tok/s:1898905 rem:468s step 3760 (22%) loss:3.4444 lr:1.00 dt:34ms tok/s:1906465 rem:468s step 3761 (22%) loss:3.4378 lr:1.00 dt:35ms tok/s:1897738 rem:468s step 3762 (22%) loss:3.4577 lr:1.00 dt:35ms tok/s:1894717 rem:468s step 3763 (22%) loss:3.4676 lr:1.00 dt:35ms tok/s:1883913 rem:468s step 3764 (22%) loss:3.4700 lr:1.00 dt:35ms tok/s:1893346 rem:468s step 3765 (22%) loss:3.4747 lr:1.00 dt:35ms tok/s:1892590 rem:468s step 3766 (22%) loss:3.4839 lr:1.00 dt:35ms tok/s:1871318 rem:468s step 3767 (22%) loss:3.4792 lr:1.00 dt:35ms tok/s:1875928 rem:468s step 3768 (22%) loss:3.4808 lr:1.00 dt:35ms tok/s:1869066 rem:468s step 3769 (22%) loss:3.4711 lr:1.00 dt:35ms tok/s:1865552 rem:468s step 3770 (22%) loss:3.4793 lr:1.00 dt:35ms tok/s:1870860 rem:468s step 3771 (22%) loss:3.4782 lr:1.00 dt:35ms tok/s:1893438 rem:468s step 3772 (22%) loss:3.4946 lr:1.00 dt:35ms tok/s:1893712 rem:468s step 3773 (22%) loss:3.4971 lr:1.00 dt:35ms tok/s:1873984 rem:468s step 3774 (22%) loss:3.4788 lr:1.00 dt:35ms tok/s:1873716 rem:468s step 3775 (22%) loss:3.4810 lr:1.00 dt:38ms tok/s:1715371 rem:468s step 3776 (22%) loss:3.4967 lr:1.00 dt:39ms tok/s:1688408 rem:467s step 3777 (22%) loss:3.4936 lr:1.00 dt:35ms tok/s:1895448 rem:467s step 3778 (22%) loss:3.4961 lr:1.00 dt:35ms tok/s:1849523 rem:467s step 3779 (22%) loss:3.5242 lr:1.00 dt:35ms tok/s:1883087 rem:467s step 3780 (22%) loss:3.5314 lr:1.00 dt:35ms tok/s:1865489 rem:467s step 3781 (22%) loss:3.5074 lr:1.00 dt:35ms tok/s:1882983 rem:467s step 3782 (22%) loss:3.4989 lr:1.00 dt:35ms tok/s:1890534 rem:467s step 3783 (22%) loss:3.5154 lr:1.00 dt:35ms tok/s:1854339 rem:467s step 3784 (22%) loss:3.5186 lr:1.00 dt:36ms tok/s:1841050 rem:467s step 3785 (22%) loss:3.5064 lr:1.00 dt:35ms tok/s:1860350 rem:467s step 3786 (22%) loss:3.5133 lr:1.00 dt:35ms tok/s:1866489 rem:467s step 3787 (22%) loss:3.5430 lr:1.00 dt:36ms tok/s:1819468 rem:467s step 3788 (22%) loss:3.5482 lr:1.00 dt:35ms tok/s:1860892 rem:467s step 3789 (22%) loss:3.5432 lr:1.00 dt:35ms tok/s:1871968 rem:467s step 3790 (22%) loss:3.5423 lr:1.00 dt:35ms tok/s:1879726 rem:467s step 3791 (22%) loss:3.5326 lr:1.00 dt:35ms tok/s:1886978 rem:467s step 3792 (22%) loss:3.5339 lr:1.00 dt:35ms tok/s:1881939 rem:467s step 3793 (22%) loss:3.5382 lr:1.00 dt:35ms tok/s:1860501 rem:467s step 3794 (22%) loss:3.5549 lr:1.00 dt:35ms tok/s:1885049 rem:467s step 3795 (22%) loss:3.5853 lr:1.00 dt:35ms tok/s:1851778 rem:467s step 3796 (22%) loss:3.6059 lr:1.00 dt:36ms tok/s:1818601 rem:467s step 3797 (22%) loss:3.6265 lr:1.00 dt:35ms tok/s:1876056 rem:467s step 3798 (22%) loss:3.6194 lr:1.00 dt:35ms tok/s:1888079 rem:467s step 3799 (22%) loss:3.6066 lr:1.00 dt:35ms tok/s:1898066 rem:467s step 3800 (22%) loss:3.5884 lr:1.00 dt:35ms tok/s:1892291 rem:467s + local: attn=[0.056, 0.333, 0.305] mlp=[0.140, 0.104, -0.117] + + transition: attn=[1.194, 0.427] mlp=[-0.004, 0.107] + + hierarchy: attn=[1.470, 5.939, 5.616] mlp=[0.491, -0.035, 0.068] + step 3801 (22%) loss:3.5235 lr:1.00 dt:35ms tok/s:1847298 rem:467s step 3802 (22%) loss:3.4977 lr:1.00 dt:35ms tok/s:1895030 rem:467s step 3803 (22%) loss:3.4598 lr:1.00 dt:35ms tok/s:1881141 rem:467s step 3804 (22%) loss:3.4570 lr:1.00 dt:34ms tok/s:1899955 rem:467s step 3805 (22%) loss:3.4716 lr:1.00 dt:35ms tok/s:1899456 rem:466s step 3806 (22%) loss:3.4945 lr:1.00 dt:35ms tok/s:1886602 rem:466s step 3807 (22%) loss:3.4972 lr:1.00 dt:34ms tok/s:1902124 rem:466s step 3808 (22%) loss:3.5306 lr:1.00 dt:35ms tok/s:1873282 rem:466s step 3809 (22%) loss:3.5166 lr:1.00 dt:35ms tok/s:1886149 rem:466s step 3810 (22%) loss:3.5093 lr:1.00 dt:35ms tok/s:1891835 rem:466s step 3811 (22%) loss:3.5166 lr:1.00 dt:35ms tok/s:1872555 rem:466s step 3812 (22%) loss:3.5261 lr:1.00 dt:35ms tok/s:1896835 rem:466s step 3813 (22%) loss:3.5307 lr:1.00 dt:34ms tok/s:1910267 rem:466s step 3814 (22%) loss:3.5470 lr:1.00 dt:34ms tok/s:1901243 rem:466s step 3815 (22%) loss:3.5587 lr:1.00 dt:35ms tok/s:1890300 rem:466s step 3816 (22%) loss:3.5480 lr:1.00 dt:35ms tok/s:1892838 rem:466s step 3817 (22%) loss:3.5472 lr:1.00 dt:34ms tok/s:1912048 rem:466s step 3818 (22%) loss:3.5222 lr:1.00 dt:34ms tok/s:1913925 rem:466s step 3819 (22%) loss:3.4959 lr:1.00 dt:35ms tok/s:1889702 rem:466s step 3820 (22%) loss:3.5045 lr:1.00 dt:34ms tok/s:1918052 rem:466s step 3821 (22%) loss:3.5363 lr:1.00 dt:35ms tok/s:1897149 rem:466s step 3822 (22%) loss:3.5372 lr:1.00 dt:35ms tok/s:1879636 rem:466s step 3823 (22%) loss:3.5256 lr:1.00 dt:35ms tok/s:1896063 rem:466s step 3824 (22%) loss:3.5435 lr:1.00 dt:35ms tok/s:1883990 rem:466s step 3825 (22%) loss:3.5579 lr:1.00 dt:34ms tok/s:1903336 rem:466s step 3826 (22%) loss:3.5768 lr:1.00 dt:35ms tok/s:1898459 rem:466s step 3827 (22%) loss:3.5710 lr:1.00 dt:35ms tok/s:1859683 rem:466s step 3828 (22%) loss:3.5955 lr:1.00 dt:35ms tok/s:1883228 rem:466s step 3829 (22%) loss:3.6125 lr:1.00 dt:35ms tok/s:1851603 rem:466s step 3830 (22%) loss:3.6049 lr:1.00 dt:35ms tok/s:1867022 rem:466s step 3831 (22%) loss:3.6080 lr:1.00 dt:35ms tok/s:1879546 rem:466s step 3832 (22%) loss:3.6202 lr:1.00 dt:35ms tok/s:1872733 rem:466s step 3833 (22%) loss:3.6196 lr:1.00 dt:35ms tok/s:1852789 rem:466s step 3834 (22%) loss:3.6181 lr:1.00 dt:35ms tok/s:1858526 rem:465s step 3835 (22%) loss:3.6251 lr:1.00 dt:35ms tok/s:1853926 rem:465s step 3836 (22%) loss:3.6101 lr:1.00 dt:35ms tok/s:1847621 rem:465s step 3837 (22%) loss:3.6007 lr:1.00 dt:35ms tok/s:1848764 rem:465s step 3838 (22%) loss:3.5901 lr:1.00 dt:35ms tok/s:1853264 rem:465s step 3839 (22%) loss:3.5907 lr:1.00 dt:35ms tok/s:1853751 rem:465s step 3840 (22%) loss:3.5836 lr:1.00 dt:35ms tok/s:1854652 rem:465s step 3841 (22%) loss:3.5983 lr:1.00 dt:36ms tok/s:1843816 rem:465s step 3842 (22%) loss:3.5953 lr:1.00 dt:36ms tok/s:1845648 rem:465s step 3843 (22%) loss:3.5957 lr:1.00 dt:36ms tok/s:1840914 rem:465s step 3844 (22%) loss:3.6067 lr:1.00 dt:35ms tok/s:1849448 rem:465s step 3845 (22%) loss:3.6063 lr:1.00 dt:36ms tok/s:1821288 rem:465s step 3846 (22%) loss:3.5952 lr:1.00 dt:36ms tok/s:1819347 rem:465s step 3847 (22%) loss:3.5934 lr:1.00 dt:36ms tok/s:1822931 rem:465s step 3848 (22%) loss:3.5962 lr:1.00 dt:36ms tok/s:1808348 rem:465s step 3849 (23%) loss:3.5934 lr:1.00 dt:38ms tok/s:1735307 rem:465s step 3850 (23%) loss:3.5798 lr:1.00 dt:37ms tok/s:1783960 rem:465s step 3851 (23%) loss:3.5636 lr:1.00 dt:36ms tok/s:1815023 rem:465s step 3852 (23%) loss:3.5530 lr:1.00 dt:36ms tok/s:1811363 rem:465s step 3853 (23%) loss:3.5421 lr:1.00 dt:36ms tok/s:1825267 rem:465s step 3854 (23%) loss:3.5382 lr:1.00 dt:36ms tok/s:1825946 rem:465s step 3855 (23%) loss:3.5381 lr:1.00 dt:36ms tok/s:1823862 rem:465s step 3856 (23%) loss:3.5318 lr:1.00 dt:36ms tok/s:1819203 rem:465s step 3857 (23%) loss:3.5382 lr:1.00 dt:36ms tok/s:1810063 rem:465s step 3858 (23%) loss:3.5309 lr:1.00 dt:36ms tok/s:1815958 rem:465s step 3859 (23%) loss:3.5107 lr:1.00 dt:36ms tok/s:1808395 rem:465s step 3860 (23%) loss:3.5414 lr:1.00 dt:36ms tok/s:1810074 rem:465s step 3861 (23%) loss:3.5398 lr:1.00 dt:36ms tok/s:1810671 rem:464s step 3862 (23%) loss:3.5270 lr:1.00 dt:38ms tok/s:1712507 rem:464s step 3863 (23%) loss:3.5430 lr:1.00 dt:44ms tok/s:1501573 rem:464s step 3864 (23%) loss:3.5037 lr:1.00 dt:36ms tok/s:1841740 rem:464s step 3865 (23%) loss:3.4679 lr:1.00 dt:36ms tok/s:1823995 rem:464s step 3866 (23%) loss:3.4615 lr:1.00 dt:36ms tok/s:1831116 rem:464s step 3867 (23%) loss:3.4858 lr:1.00 dt:36ms tok/s:1840828 rem:464s step 3868 (23%) loss:3.4842 lr:1.00 dt:36ms tok/s:1831726 rem:464s step 3869 (23%) loss:3.4867 lr:1.00 dt:36ms tok/s:1829702 rem:464s step 3870 (23%) loss:3.4637 lr:1.00 dt:36ms tok/s:1804253 rem:464s step 3871 (23%) loss:3.4481 lr:1.00 dt:36ms tok/s:1806553 rem:464s step 3872 (23%) loss:3.4110 lr:1.00 dt:36ms tok/s:1814459 rem:464s step 3873 (23%) loss:3.3746 lr:1.00 dt:36ms tok/s:1814651 rem:464s step 3874 (23%) loss:3.3431 lr:1.00 dt:36ms tok/s:1812318 rem:464s step 3875 (23%) loss:3.3038 lr:1.00 dt:36ms tok/s:1807123 rem:464s step 3876 (23%) loss:3.2819 lr:1.00 dt:36ms tok/s:1799693 rem:464s step 3877 (23%) loss:3.2676 lr:1.00 dt:36ms tok/s:1822762 rem:464s step 3878 (23%) loss:3.3353 lr:1.00 dt:36ms tok/s:1813980 rem:464s step 3879 (23%) loss:3.3813 lr:1.00 dt:40ms tok/s:1658329 rem:464s step 3880 (23%) loss:3.3928 lr:1.00 dt:36ms tok/s:1837640 rem:464s step 3881 (23%) loss:3.3832 lr:1.00 dt:36ms tok/s:1839904 rem:464s step 3882 (23%) loss:3.4010 lr:1.00 dt:37ms tok/s:1760965 rem:464s step 3883 (23%) loss:3.4239 lr:1.00 dt:35ms tok/s:1875161 rem:464s step 3884 (23%) loss:3.4476 lr:1.00 dt:35ms tok/s:1894756 rem:464s step 3885 (23%) loss:3.4597 lr:1.00 dt:35ms tok/s:1896651 rem:464s step 3886 (23%) loss:3.5859 lr:1.00 dt:35ms tok/s:1889416 rem:464s step 3887 (23%) loss:3.5898 lr:1.00 dt:35ms tok/s:1872287 rem:464s step 3888 (23%) loss:3.5939 lr:1.00 dt:35ms tok/s:1873537 rem:464s step 3889 (23%) loss:3.5591 lr:1.00 dt:35ms tok/s:1870249 rem:463s step 3890 (23%) loss:3.5670 lr:1.00 dt:35ms tok/s:1873754 rem:463s step 3891 (23%) loss:3.5634 lr:1.00 dt:35ms tok/s:1871764 rem:463s step 3892 (23%) loss:3.5873 lr:1.00 dt:35ms tok/s:1862152 rem:463s step 3893 (23%) loss:3.5958 lr:1.00 dt:36ms tok/s:1845574 rem:463s step 3894 (23%) loss:3.5897 lr:1.00 dt:35ms tok/s:1850432 rem:463s step 3895 (23%) loss:3.5887 lr:1.00 dt:35ms tok/s:1852040 rem:463s step 3896 (23%) loss:3.6199 lr:1.00 dt:36ms tok/s:1817819 rem:463s step 3897 (23%) loss:3.6260 lr:1.00 dt:36ms tok/s:1823402 rem:463s step 3898 (23%) loss:3.6168 lr:1.00 dt:35ms tok/s:1850257 rem:463s step 3899 (23%) loss:3.6513 lr:1.00 dt:36ms tok/s:1843519 rem:463s step 3900 (23%) loss:3.6970 lr:1.00 dt:35ms tok/s:1852177 rem:463s + local: attn=[0.063, 0.379, 0.346] mlp=[0.154, 0.113, -0.115] + + transition: attn=[1.132, 0.440] mlp=[0.012, 0.119] + + hierarchy: attn=[1.521, 5.939, 5.616] mlp=[0.465, -0.320, -0.008] + step 3901 (23%) loss:3.6849 lr:1.00 dt:36ms tok/s:1841469 rem:463s step 3902 (23%) loss:3.6836 lr:1.00 dt:35ms tok/s:1849274 rem:463s step 3903 (23%) loss:3.6823 lr:1.00 dt:35ms tok/s:1847720 rem:463s step 3904 (23%) loss:3.6789 lr:1.00 dt:36ms tok/s:1818878 rem:463s step 3905 (23%) loss:3.6646 lr:1.00 dt:36ms tok/s:1823535 rem:463s step 3906 (23%) loss:3.6554 lr:1.00 dt:36ms tok/s:1813741 rem:463s step 3907 (23%) loss:3.6263 lr:1.00 dt:36ms tok/s:1819901 rem:463s step 3908 (23%) loss:3.6240 lr:1.00 dt:36ms tok/s:1817242 rem:463s step 3909 (23%) loss:3.6606 lr:1.00 dt:36ms tok/s:1820613 rem:463s step 3910 (23%) loss:3.6586 lr:1.00 dt:36ms tok/s:1815958 rem:463s step 3911 (23%) loss:3.6620 lr:1.00 dt:36ms tok/s:1813167 rem:463s step 3912 (23%) loss:3.6424 lr:1.00 dt:36ms tok/s:1816834 rem:463s step 3913 (23%) loss:3.6071 lr:1.00 dt:36ms tok/s:1819179 rem:463s step 3914 (23%) loss:3.5857 lr:1.00 dt:42ms tok/s:1574411 rem:463s step 3915 (23%) loss:3.5733 lr:1.00 dt:36ms tok/s:1831116 rem:463s step 3916 (23%) loss:3.5514 lr:1.00 dt:36ms tok/s:1839153 rem:463s step 3917 (23%) loss:3.5667 lr:1.00 dt:36ms tok/s:1837640 rem:462s step 3918 (23%) loss:3.5540 lr:1.00 dt:36ms tok/s:1833974 rem:462s step 3919 (23%) loss:3.5312 lr:1.00 dt:36ms tok/s:1842481 rem:462s step 3920 (23%) loss:3.5227 lr:1.00 dt:35ms tok/s:1861371 rem:462s step 3921 (23%) loss:3.5266 lr:1.00 dt:35ms tok/s:1883022 rem:462s step 3922 (23%) loss:3.5323 lr:1.00 dt:35ms tok/s:1885864 rem:462s step 3923 (23%) loss:3.5324 lr:1.00 dt:35ms tok/s:1875122 rem:462s step 3924 (23%) loss:3.5171 lr:1.00 dt:35ms tok/s:1871573 rem:462s step 3925 (23%) loss:3.5227 lr:1.00 dt:35ms tok/s:1869612 rem:462s step 3926 (23%) loss:3.5018 lr:1.00 dt:35ms tok/s:1862809 rem:462s step 3927 (23%) loss:3.4764 lr:1.00 dt:35ms tok/s:1866794 rem:462s step 3928 (23%) loss:3.4878 lr:1.00 dt:36ms tok/s:1845561 rem:462s step 3929 (23%) loss:3.5038 lr:1.00 dt:36ms tok/s:1821035 rem:462s step 3930 (23%) loss:3.5468 lr:1.00 dt:36ms tok/s:1825206 rem:462s step 3931 (23%) loss:3.5504 lr:1.00 dt:36ms tok/s:1824019 rem:462s step 3932 (23%) loss:3.5427 lr:1.00 dt:36ms tok/s:1825643 rem:462s step 3933 (23%) loss:3.5563 lr:1.00 dt:36ms tok/s:1820576 rem:462s step 3934 (23%) loss:3.5635 lr:1.00 dt:36ms tok/s:1819227 rem:462s step 3935 (23%) loss:3.5719 lr:1.00 dt:36ms tok/s:1819299 rem:462s step 3936 (23%) loss:3.5752 lr:1.00 dt:36ms tok/s:1824407 rem:462s step 3937 (23%) loss:3.5699 lr:1.00 dt:36ms tok/s:1821529 rem:462s step 3938 (23%) loss:3.5678 lr:1.00 dt:36ms tok/s:1819769 rem:462s step 3939 (23%) loss:3.5608 lr:1.00 dt:36ms tok/s:1820227 rem:462s step 3940 (23%) loss:3.5730 lr:1.00 dt:36ms tok/s:1823548 rem:462s step 3941 (23%) loss:3.5729 lr:1.00 dt:36ms tok/s:1815586 rem:462s step 3942 (23%) loss:3.5751 lr:1.00 dt:36ms tok/s:1824371 rem:462s step 3943 (23%) loss:3.5741 lr:1.00 dt:36ms tok/s:1804964 rem:462s step 3944 (23%) loss:3.5508 lr:1.00 dt:36ms tok/s:1824346 rem:462s step 3945 (23%) loss:3.5520 lr:1.00 dt:36ms tok/s:1812139 rem:461s step 3946 (23%) loss:3.5457 lr:1.00 dt:36ms tok/s:1820010 rem:461s step 3947 (23%) loss:3.5311 lr:1.00 dt:36ms tok/s:1828132 rem:461s step 3948 (23%) loss:3.5231 lr:1.00 dt:36ms tok/s:1823656 rem:461s step 3949 (23%) loss:3.5217 lr:1.00 dt:36ms tok/s:1825897 rem:461s step 3950 (23%) loss:3.5256 lr:1.00 dt:36ms tok/s:1822073 rem:461s step 3951 (23%) loss:3.5284 lr:1.00 dt:36ms tok/s:1819251 rem:461s step 3952 (23%) loss:3.5437 lr:1.00 dt:36ms tok/s:1824661 rem:461s step 3953 (23%) loss:3.5283 lr:1.00 dt:36ms tok/s:1817903 rem:461s step 3954 (23%) loss:3.5219 lr:1.00 dt:37ms tok/s:1778811 rem:461s step 3955 (23%) loss:3.5247 lr:1.00 dt:36ms tok/s:1819901 rem:461s step 3956 (23%) loss:3.5251 lr:1.00 dt:36ms tok/s:1809395 rem:461s step 3957 (23%) loss:3.5157 lr:1.00 dt:36ms tok/s:1821976 rem:461s step 3958 (23%) loss:3.5020 lr:1.00 dt:36ms tok/s:1818144 rem:461s step 3959 (23%) loss:3.5156 lr:1.00 dt:36ms tok/s:1811840 rem:461s step 3960 (23%) loss:3.5272 lr:1.00 dt:36ms tok/s:1821952 rem:461s step 3961 (23%) loss:3.5331 lr:1.00 dt:36ms tok/s:1824552 rem:461s step 3962 (23%) loss:3.5455 lr:1.00 dt:36ms tok/s:1819769 rem:461s step 3963 (23%) loss:3.5345 lr:1.00 dt:36ms tok/s:1818493 rem:461s step 3964 (23%) loss:3.5612 lr:1.00 dt:36ms tok/s:1820781 rem:461s step 3965 (23%) loss:3.5437 lr:1.00 dt:36ms tok/s:1820649 rem:461s step 3966 (23%) loss:3.5316 lr:1.00 dt:36ms tok/s:1816930 rem:461s step 3967 (23%) loss:3.5397 lr:1.00 dt:36ms tok/s:1822025 rem:461s step 3968 (23%) loss:3.5393 lr:1.00 dt:36ms tok/s:1818673 rem:461s step 3969 (23%) loss:3.5280 lr:1.00 dt:36ms tok/s:1815526 rem:461s step 3970 (23%) loss:3.5244 lr:1.00 dt:36ms tok/s:1814639 rem:461s step 3971 (23%) loss:3.5105 lr:1.00 dt:36ms tok/s:1824855 rem:461s step 3972 (23%) loss:3.4928 lr:1.00 dt:36ms tok/s:1818938 rem:461s step 3973 (23%) loss:3.4866 lr:1.00 dt:36ms tok/s:1821855 rem:460s step 3974 (23%) loss:3.4794 lr:1.00 dt:36ms tok/s:1812629 rem:460s step 3975 (23%) loss:3.4871 lr:1.00 dt:36ms tok/s:1812127 rem:460s step 3976 (23%) loss:3.4987 lr:1.00 dt:36ms tok/s:1818300 rem:460s step 3977 (23%) loss:3.4784 lr:1.00 dt:36ms tok/s:1816150 rem:460s step 3978 (23%) loss:3.4788 lr:1.00 dt:36ms tok/s:1815946 rem:460s step 3979 (23%) loss:3.4645 lr:1.00 dt:36ms tok/s:1809098 rem:460s step 3980 (23%) loss:3.4692 lr:1.00 dt:36ms tok/s:1809896 rem:460s step 3981 (23%) loss:3.4469 lr:1.00 dt:36ms tok/s:1818204 rem:460s step 3982 (23%) loss:3.3987 lr:1.00 dt:36ms tok/s:1815850 rem:460s step 3983 (23%) loss:3.4030 lr:1.00 dt:36ms tok/s:1814927 rem:460s step 3984 (23%) loss:3.4094 lr:1.00 dt:36ms tok/s:1815694 rem:460s step 3985 (23%) loss:3.4064 lr:1.00 dt:36ms tok/s:1808288 rem:460s step 3986 (23%) loss:3.4127 lr:1.00 dt:36ms tok/s:1818541 rem:460s step 3987 (23%) loss:3.4202 lr:1.00 dt:36ms tok/s:1819480 rem:460s step 3988 (23%) loss:3.4316 lr:1.00 dt:36ms tok/s:1816630 rem:460s step 3989 (23%) loss:3.4529 lr:1.00 dt:36ms tok/s:1805521 rem:460s step 3990 (23%) loss:3.4718 lr:1.00 dt:37ms tok/s:1780401 rem:460s step 3991 (23%) loss:3.4772 lr:1.00 dt:37ms tok/s:1794676 rem:460s step 3992 (23%) loss:3.4866 lr:1.00 dt:36ms tok/s:1816462 rem:460s step 3993 (23%) loss:3.5013 lr:1.00 dt:36ms tok/s:1821783 rem:460s step 3994 (23%) loss:3.4820 lr:1.00 dt:36ms tok/s:1819733 rem:460s step 3995 (23%) loss:3.4903 lr:1.00 dt:36ms tok/s:1813621 rem:460s step 3996 (23%) loss:3.5057 lr:1.00 dt:36ms tok/s:1812366 rem:460s step 3997 (23%) loss:3.4845 lr:1.00 dt:36ms tok/s:1813346 rem:460s step 3998 (23%) loss:3.4953 lr:1.00 dt:36ms tok/s:1809717 rem:460s step 3999 (23%) loss:3.4897 lr:1.00 dt:36ms tok/s:1822459 rem:460s step 4000 (23%) loss:3.4456 lr:1.00 dt:36ms tok/s:1812007 rem:459s + local: attn=[0.057, 0.385, 0.348] mlp=[0.158, 0.116, -0.109] + + transition: attn=[1.171, 0.424] mlp=[-0.015, 0.111] + + hierarchy: attn=[1.562, 5.939, 5.616] mlp=[0.506, -0.527, -0.187] + step 4001 (23%) loss:3.4456 lr:1.00 dt:36ms tok/s:1813669 rem:459s step 4002 (23%) loss:3.4283 lr:1.00 dt:36ms tok/s:1819757 rem:459s step 4003 (23%) loss:3.4143 lr:1.00 dt:36ms tok/s:1807195 rem:459s step 4004 (23%) loss:3.4132 lr:1.00 dt:36ms tok/s:1816870 rem:459s step 4005 (23%) loss:3.4326 lr:1.00 dt:36ms tok/s:1813107 rem:459s step 4006 (23%) loss:3.4742 lr:1.00 dt:36ms tok/s:1812103 rem:459s step 4007 (23%) loss:3.5122 lr:1.00 dt:36ms tok/s:1807349 rem:459s step 4008 (23%) loss:3.5385 lr:1.00 dt:36ms tok/s:1814436 rem:459s step 4009 (23%) loss:3.5432 lr:1.00 dt:36ms tok/s:1805805 rem:459s step 4010 (23%) loss:3.5700 lr:1.00 dt:36ms tok/s:1816390 rem:459s step 4011 (23%) loss:3.5443 lr:1.00 dt:36ms tok/s:1806185 rem:459s step 4012 (23%) loss:3.5412 lr:1.00 dt:36ms tok/s:1810516 rem:459s step 4013 (23%) loss:3.5486 lr:1.00 dt:36ms tok/s:1810802 rem:459s step 4014 (23%) loss:3.5245 lr:1.00 dt:36ms tok/s:1817290 rem:459s step 4015 (24%) loss:3.4943 lr:1.00 dt:36ms tok/s:1812330 rem:459s step 4016 (24%) loss:3.5507 lr:1.00 dt:36ms tok/s:1811112 rem:459s step 4017 (24%) loss:3.5619 lr:1.00 dt:36ms tok/s:1813909 rem:459s step 4018 (24%) loss:3.5919 lr:1.00 dt:36ms tok/s:1810194 rem:459s step 4019 (24%) loss:3.6013 lr:1.00 dt:37ms tok/s:1776053 rem:459s step 4020 (24%) loss:3.5986 lr:1.00 dt:37ms tok/s:1789989 rem:459s step 4021 (24%) loss:3.5755 lr:1.00 dt:37ms tok/s:1790491 rem:459s step 4022 (24%) loss:3.5768 lr:1.00 dt:37ms tok/s:1793388 rem:459s step 4023 (24%) loss:3.5928 lr:1.00 dt:37ms tok/s:1793692 rem:459s step 4024 (24%) loss:3.6084 lr:1.00 dt:37ms tok/s:1791436 rem:459s step 4025 (24%) loss:3.6120 lr:1.00 dt:37ms tok/s:1792850 rem:459s step 4026 (24%) loss:3.6049 lr:1.00 dt:36ms tok/s:1798504 rem:459s step 4027 (24%) loss:3.5946 lr:1.00 dt:37ms tok/s:1766499 rem:459s step 4028 (24%) loss:3.5971 lr:1.00 dt:37ms tok/s:1766988 rem:458s step 4029 (24%) loss:3.5864 lr:1.00 dt:37ms tok/s:1791389 rem:458s step 4030 (24%) loss:3.6047 lr:1.00 dt:36ms tok/s:1800400 rem:458s step 4031 (24%) loss:3.6186 lr:1.00 dt:37ms tok/s:1793563 rem:458s step 4032 (24%) loss:3.6242 lr:0.99 dt:36ms tok/s:1796024 rem:458s step 4033 (24%) loss:3.6188 lr:0.99 dt:37ms tok/s:1793142 rem:458s step 4034 (24%) loss:3.5980 lr:0.99 dt:37ms tok/s:1790187 rem:458s step 4035 (24%) loss:3.5702 lr:0.99 dt:37ms tok/s:1790164 rem:458s step 4036 (24%) loss:3.5692 lr:0.99 dt:36ms tok/s:1796423 rem:458s step 4037 (24%) loss:3.5660 lr:0.99 dt:36ms tok/s:1796341 rem:458s step 4038 (24%) loss:3.5687 lr:0.99 dt:36ms tok/s:1796928 rem:458s step 4039 (24%) loss:3.5494 lr:0.99 dt:38ms tok/s:1744094 rem:458s step 4040 (24%) loss:3.5410 lr:0.99 dt:36ms tok/s:1799080 rem:458s step 4041 (24%) loss:3.5308 lr:0.99 dt:36ms tok/s:1802206 rem:458s step 4042 (24%) loss:3.5127 lr:0.99 dt:37ms tok/s:1793622 rem:458s step 4043 (24%) loss:3.5233 lr:0.99 dt:37ms tok/s:1774860 rem:458s step 4044 (24%) loss:3.5300 lr:0.99 dt:37ms tok/s:1789232 rem:458s step 4045 (24%) loss:3.5291 lr:0.99 dt:37ms tok/s:1792090 rem:458s step 4046 (24%) loss:3.5340 lr:0.99 dt:37ms tok/s:1783161 rem:458s step 4047 (24%) loss:3.5576 lr:0.99 dt:37ms tok/s:1792733 rem:458s step 4048 (24%) loss:3.5621 lr:0.99 dt:37ms tok/s:1779767 rem:458s step 4049 (24%) loss:3.5983 lr:0.99 dt:36ms tok/s:1820118 rem:458s step 4050 (24%) loss:3.5887 lr:0.99 dt:36ms tok/s:1820468 rem:458s step 4051 (24%) loss:3.5766 lr:0.99 dt:36ms tok/s:1816150 rem:458s step 4052 (24%) loss:3.5703 lr:0.99 dt:36ms tok/s:1818012 rem:458s step 4053 (24%) loss:3.5611 lr:0.99 dt:36ms tok/s:1807682 rem:458s step 4054 (24%) loss:3.5573 lr:0.99 dt:36ms tok/s:1811984 rem:458s step 4055 (24%) loss:3.5464 lr:0.99 dt:36ms tok/s:1809181 rem:457s step 4056 (24%) loss:3.5387 lr:0.99 dt:36ms tok/s:1809431 rem:457s step 4057 (24%) loss:3.5367 lr:0.99 dt:36ms tok/s:1813992 rem:457s step 4058 (24%) loss:3.5284 lr:0.99 dt:36ms tok/s:1818709 rem:457s step 4059 (24%) loss:3.5229 lr:0.99 dt:36ms tok/s:1805805 rem:457s step 4060 (24%) loss:3.5282 lr:0.99 dt:37ms tok/s:1786139 rem:457s step 4061 (24%) loss:3.5214 lr:0.99 dt:36ms tok/s:1814891 rem:457s step 4062 (24%) loss:3.5040 lr:0.99 dt:36ms tok/s:1809896 rem:457s step 4063 (24%) loss:3.5074 lr:0.99 dt:36ms tok/s:1813131 rem:457s step 4064 (24%) loss:3.4987 lr:0.99 dt:36ms tok/s:1810766 rem:457s step 4065 (24%) loss:3.4915 lr:0.99 dt:43ms tok/s:1531592 rem:457s step 4066 (24%) loss:3.4822 lr:0.99 dt:35ms tok/s:1859671 rem:457s step 4067 (24%) loss:3.4667 lr:0.99 dt:35ms tok/s:1899587 rem:457s step 4068 (24%) loss:3.4840 lr:0.99 dt:34ms tok/s:1903943 rem:457s step 4069 (24%) loss:3.4800 lr:0.99 dt:35ms tok/s:1880523 rem:457s step 4070 (24%) loss:3.4812 lr:0.99 dt:35ms tok/s:1884894 rem:457s step 4071 (24%) loss:3.4686 lr:0.99 dt:35ms tok/s:1877171 rem:457s step 4072 (24%) loss:3.4218 lr:0.99 dt:35ms tok/s:1872542 rem:457s step 4073 (24%) loss:3.4561 lr:0.99 dt:35ms tok/s:1881939 rem:457s step 4074 (24%) loss:3.4641 lr:0.99 dt:35ms tok/s:1878634 rem:457s step 4075 (24%) loss:3.4626 lr:0.99 dt:35ms tok/s:1883448 rem:457s step 4076 (24%) loss:3.4582 lr:0.99 dt:36ms tok/s:1838193 rem:457s step 4077 (24%) loss:3.4542 lr:0.99 dt:35ms tok/s:1881141 rem:457s step 4078 (24%) loss:3.4475 lr:0.99 dt:35ms tok/s:1877376 rem:457s step 4079 (24%) loss:3.4628 lr:0.99 dt:35ms tok/s:1864375 rem:457s step 4080 (24%) loss:3.4486 lr:0.99 dt:35ms tok/s:1869829 rem:457s step 4081 (24%) loss:3.4352 lr:0.99 dt:35ms tok/s:1868456 rem:457s step 4082 (24%) loss:3.4371 lr:0.99 dt:35ms tok/s:1865603 rem:457s step 4083 (24%) loss:3.4572 lr:0.99 dt:36ms tok/s:1843519 rem:456s step 4084 (24%) loss:3.4743 lr:0.99 dt:36ms tok/s:1824661 rem:456s step 4085 (24%) loss:3.4897 lr:0.99 dt:36ms tok/s:1825243 rem:456s step 4086 (24%) loss:3.4884 lr:0.99 dt:36ms tok/s:1819564 rem:456s step 4087 (24%) loss:3.4991 lr:0.99 dt:36ms tok/s:1821566 rem:456s step 4088 (24%) loss:3.5063 lr:0.99 dt:36ms tok/s:1822616 rem:456s step 4089 (24%) loss:3.4906 lr:0.99 dt:36ms tok/s:1818529 rem:456s step 4090 (24%) loss:3.4884 lr:0.99 dt:36ms tok/s:1811422 rem:456s step 4091 (24%) loss:3.5054 lr:0.99 dt:36ms tok/s:1813789 rem:456s step 4092 (24%) loss:3.5002 lr:0.99 dt:36ms tok/s:1838353 rem:456s step 4093 (24%) loss:3.4972 lr:0.99 dt:36ms tok/s:1839977 rem:456s step 4094 (24%) loss:3.4819 lr:0.99 dt:36ms tok/s:1835321 rem:456s step 4095 (24%) loss:3.4949 lr:0.99 dt:36ms tok/s:1820757 rem:456s step 4096 (24%) loss:3.5086 lr:0.99 dt:35ms tok/s:1866680 rem:456s step 4097 (24%) loss:3.4907 lr:0.99 dt:35ms tok/s:1862026 rem:456s step 4098 (24%) loss:3.4678 lr:0.99 dt:35ms tok/s:1862279 rem:456s step 4099 (24%) loss:3.4442 lr:0.99 dt:35ms tok/s:1868697 rem:456s step 4100 (24%) loss:3.4287 lr:0.99 dt:35ms tok/s:1862279 rem:456s + local: attn=[0.061, 0.444, 0.393] mlp=[0.183, 0.115, -0.121] + + transition: attn=[1.279, 0.494] mlp=[-0.018, 0.134] + + hierarchy: attn=[1.677, 5.939, 5.616] mlp=[0.595, -0.802, -0.550] + step 4101 (24%) loss:3.4480 lr:0.99 dt:35ms tok/s:1862178 rem:456s step 4102 (24%) loss:3.4674 lr:0.99 dt:36ms tok/s:1845016 rem:456s step 4103 (24%) loss:3.4650 lr:0.99 dt:35ms tok/s:1855716 rem:456s step 4104 (24%) loss:3.4671 lr:0.99 dt:35ms tok/s:1876325 rem:456s step 4105 (24%) loss:3.4847 lr:0.99 dt:35ms tok/s:1877607 rem:456s step 4106 (24%) loss:3.5070 lr:0.99 dt:35ms tok/s:1874572 rem:456s step 4107 (24%) loss:3.5419 lr:0.99 dt:35ms tok/s:1879070 rem:456s step 4108 (24%) loss:3.5451 lr:0.99 dt:34ms tok/s:1902796 rem:456s step 4109 (24%) loss:3.5303 lr:0.99 dt:35ms tok/s:1886537 rem:456s step 4110 (24%) loss:3.5383 lr:0.99 dt:35ms tok/s:1898682 rem:456s step 4111 (24%) loss:3.5421 lr:0.99 dt:35ms tok/s:1892043 rem:456s step 4112 (24%) loss:3.5168 lr:0.99 dt:35ms tok/s:1889728 rem:455s step 4113 (24%) loss:3.5291 lr:0.99 dt:35ms tok/s:1885593 rem:455s step 4114 (24%) loss:3.5418 lr:0.99 dt:35ms tok/s:1891691 rem:455s step 4115 (24%) loss:3.5520 lr:0.99 dt:35ms tok/s:1890079 rem:455s step 4116 (24%) loss:3.5380 lr:0.99 dt:34ms tok/s:1904167 rem:455s step 4117 (24%) loss:3.5411 lr:0.99 dt:34ms tok/s:1908689 rem:455s step 4118 (24%) loss:3.5193 lr:0.99 dt:34ms tok/s:1902559 rem:455s step 4119 (24%) loss:3.5167 lr:0.99 dt:34ms tok/s:1908490 rem:455s step 4120 (24%) loss:3.5144 lr:0.99 dt:34ms tok/s:1911649 rem:455s step 4121 (24%) loss:3.5205 lr:0.99 dt:35ms tok/s:1898630 rem:455s step 4122 (24%) loss:3.5254 lr:0.99 dt:35ms tok/s:1860854 rem:455s step 4123 (24%) loss:3.5265 lr:0.99 dt:34ms tok/s:1906253 rem:455s step 4124 (24%) loss:3.5144 lr:0.99 dt:34ms tok/s:1905962 rem:455s step 4125 (24%) loss:3.5049 lr:0.99 dt:35ms tok/s:1884688 rem:455s step 4126 (24%) loss:3.4941 lr:0.99 dt:34ms tok/s:1926021 rem:455s step 4127 (24%) loss:3.4685 lr:0.99 dt:34ms tok/s:1937561 rem:455s step 4128 (24%) loss:3.4536 lr:0.99 dt:34ms tok/s:1940680 rem:455s step 4129 (24%) loss:3.4245 lr:0.99 dt:34ms tok/s:1925536 rem:455s step 4130 (24%) loss:3.4067 lr:0.99 dt:34ms tok/s:1932997 rem:455s step 4131 (24%) loss:3.4246 lr:0.99 dt:34ms tok/s:1933704 rem:455s step 4132 (24%) loss:3.4245 lr:0.99 dt:35ms tok/s:1884985 rem:455s step 4133 (24%) loss:3.4362 lr:0.99 dt:34ms tok/s:1931665 rem:455s step 4134 (24%) loss:3.4560 lr:0.99 dt:34ms tok/s:1924767 rem:455s step 4135 (24%) loss:3.4610 lr:0.99 dt:34ms tok/s:1936360 rem:455s step 4136 (24%) loss:3.4774 lr:0.99 dt:34ms tok/s:1930404 rem:455s step 4137 (24%) loss:3.4655 lr:0.99 dt:34ms tok/s:1927547 rem:455s step 4138 (24%) loss:3.4749 lr:0.99 dt:34ms tok/s:1931910 rem:455s step 4139 (24%) loss:3.4455 lr:0.99 dt:34ms tok/s:1922721 rem:455s step 4140 (24%) loss:3.4374 lr:0.99 dt:34ms tok/s:1915579 rem:455s step 4141 (24%) loss:3.4567 lr:0.99 dt:34ms tok/s:1928805 rem:454s step 4142 (24%) loss:3.4702 lr:0.99 dt:34ms tok/s:1929794 rem:454s step 4143 (24%) loss:3.4704 lr:0.99 dt:34ms tok/s:1928819 rem:454s step 4144 (24%) loss:3.5080 lr:0.99 dt:34ms tok/s:1933527 rem:454s step 4145 (24%) loss:3.5154 lr:0.99 dt:34ms tok/s:1934425 rem:454s step 4146 (24%) loss:3.5022 lr:0.99 dt:34ms tok/s:1931665 rem:454s step 4147 (24%) loss:3.5095 lr:0.99 dt:34ms tok/s:1942614 rem:454s step 4148 (24%) loss:3.5081 lr:0.99 dt:34ms tok/s:1933255 rem:454s step 4149 (24%) loss:3.4862 lr:0.99 dt:34ms tok/s:1930648 rem:454s step 4150 (24%) loss:3.4859 lr:0.99 dt:36ms tok/s:1844484 rem:454s step 4151 (24%) loss:3.4855 lr:0.99 dt:34ms tok/s:1925657 rem:454s step 4152 (24%) loss:3.4977 lr:0.99 dt:34ms tok/s:1950595 rem:454s step 4153 (24%) loss:3.5054 lr:0.99 dt:34ms tok/s:1945061 rem:454s step 4154 (24%) loss:3.5074 lr:0.99 dt:34ms tok/s:1949931 rem:454s step 4155 (24%) loss:3.4973 lr:0.99 dt:34ms tok/s:1948784 rem:454s step 4156 (24%) loss:3.4898 lr:0.99 dt:34ms tok/s:1941804 rem:454s step 4157 (24%) loss:3.5062 lr:0.99 dt:34ms tok/s:1944744 rem:454s step 4158 (24%) loss:3.4850 lr:0.99 dt:34ms tok/s:1941941 rem:454s step 4159 (24%) loss:3.4659 lr:0.99 dt:34ms tok/s:1942051 rem:454s step 4160 (24%) loss:3.4261 lr:0.99 dt:34ms tok/s:1943740 rem:454s step 4161 (24%) loss:3.4479 lr:0.99 dt:34ms tok/s:1953868 rem:454s step 4162 (24%) loss:3.4518 lr:0.99 dt:34ms tok/s:1944909 rem:454s step 4163 (24%) loss:3.4689 lr:0.99 dt:34ms tok/s:1944744 rem:454s step 4164 (24%) loss:3.4703 lr:0.99 dt:34ms tok/s:1950180 rem:454s step 4165 (24%) loss:3.4726 lr:0.99 dt:34ms tok/s:1937343 rem:454s step 4166 (24%) loss:3.4684 lr:0.99 dt:34ms tok/s:1945198 rem:454s step 4167 (24%) loss:3.4631 lr:0.99 dt:34ms tok/s:1940844 rem:454s step 4168 (24%) loss:3.4585 lr:0.99 dt:34ms tok/s:1950623 rem:454s step 4169 (24%) loss:3.4497 lr:0.99 dt:34ms tok/s:1946066 rem:454s step 4170 (24%) loss:3.4641 lr:0.99 dt:34ms tok/s:1947127 rem:453s step 4171 (24%) loss:3.4778 lr:0.99 dt:34ms tok/s:1946424 rem:453s step 4172 (24%) loss:3.4877 lr:0.99 dt:34ms tok/s:1937179 rem:453s step 4173 (24%) loss:3.4921 lr:0.99 dt:34ms tok/s:1943548 rem:453s step 4174 (24%) loss:3.4610 lr:0.99 dt:34ms tok/s:1949876 rem:453s step 4175 (24%) loss:3.4808 lr:0.99 dt:34ms tok/s:1948991 rem:453s step 4176 (24%) loss:3.4784 lr:0.99 dt:34ms tok/s:1945253 rem:453s step 4177 (24%) loss:3.4902 lr:0.99 dt:34ms tok/s:1942257 rem:453s step 4178 (24%) loss:3.4837 lr:0.99 dt:33ms tok/s:1965295 rem:453s step 4179 (24%) loss:3.4817 lr:0.99 dt:33ms tok/s:1960347 rem:453s step 4180 (24%) loss:3.4808 lr:0.99 dt:34ms tok/s:1954771 rem:453s step 4181 (24%) loss:3.4792 lr:0.99 dt:33ms tok/s:1958866 rem:453s step 4182 (24%) loss:3.4703 lr:0.99 dt:33ms tok/s:1959760 rem:453s step 4183 (24%) loss:3.4726 lr:0.99 dt:33ms tok/s:1961270 rem:453s step 4184 (24%) loss:3.4650 lr:0.99 dt:33ms tok/s:1963582 rem:453s step 4185 (24%) loss:3.4966 lr:0.99 dt:33ms tok/s:1964157 rem:453s step 4186 (25%) loss:3.4908 lr:0.99 dt:33ms tok/s:1959690 rem:453s step 4187 (25%) loss:3.4936 lr:0.99 dt:33ms tok/s:1961284 rem:453s step 4188 (25%) loss:3.4898 lr:0.99 dt:33ms tok/s:1958308 rem:453s step 4189 (25%) loss:3.4776 lr:0.99 dt:33ms tok/s:1959914 rem:453s step 4190 (25%) loss:3.4736 lr:0.99 dt:33ms tok/s:1969533 rem:453s step 4191 (25%) loss:3.4722 lr:0.99 dt:33ms tok/s:1960948 rem:453s step 4192 (25%) loss:3.4648 lr:0.99 dt:34ms tok/s:1948950 rem:453s step 4193 (25%) loss:3.4583 lr:0.99 dt:33ms tok/s:1958141 rem:453s step 4194 (25%) loss:3.4449 lr:0.99 dt:33ms tok/s:1957402 rem:453s step 4195 (25%) loss:3.4367 lr:0.99 dt:33ms tok/s:1961214 rem:453s step 4196 (25%) loss:3.4371 lr:0.99 dt:33ms tok/s:1960669 rem:453s step 4197 (25%) loss:3.4271 lr:0.99 dt:33ms tok/s:1958894 rem:453s step 4198 (25%) loss:3.4172 lr:0.99 dt:33ms tok/s:1963526 rem:453s step 4199 (25%) loss:3.4206 lr:0.99 dt:33ms tok/s:1959271 rem:453s step 4200 (25%) loss:3.4111 lr:0.99 dt:33ms tok/s:1962923 rem:452s + local: attn=[0.057, 0.472, 0.445] mlp=[0.188, 0.120, -0.124] + + transition: attn=[1.517, 0.536] mlp=[-0.041, 0.154] + + hierarchy: attn=[1.881, 5.939, 5.616] mlp=[0.688, -0.863, -1.132] + step 4201 (25%) loss:3.3958 lr:0.99 dt:34ms tok/s:1955257 rem:452s step 4202 (25%) loss:3.4067 lr:0.99 dt:33ms tok/s:1960207 rem:452s step 4203 (25%) loss:3.4062 lr:0.99 dt:34ms tok/s:1948521 rem:452s step 4204 (25%) loss:3.4130 lr:0.99 dt:33ms tok/s:1962376 rem:452s step 4205 (25%) loss:3.4266 lr:0.99 dt:34ms tok/s:1950249 rem:452s step 4206 (25%) loss:3.4226 lr:0.99 dt:34ms tok/s:1946865 rem:452s step 4207 (25%) loss:3.4213 lr:0.99 dt:34ms tok/s:1951828 rem:452s step 4208 (25%) loss:3.4499 lr:0.99 dt:34ms tok/s:1948162 rem:452s step 4209 (25%) loss:3.4403 lr:0.99 dt:34ms tok/s:1950083 rem:452s step 4210 (25%) loss:3.4419 lr:0.99 dt:34ms tok/s:1950291 rem:452s step 4211 (25%) loss:3.4332 lr:0.99 dt:34ms tok/s:1951939 rem:452s step 4212 (25%) loss:3.3969 lr:0.99 dt:34ms tok/s:1936837 rem:452s step 4213 (25%) loss:3.3943 lr:0.99 dt:34ms tok/s:1954604 rem:452s step 4214 (25%) loss:3.4463 lr:0.99 dt:34ms tok/s:1952091 rem:452s step 4215 (25%) loss:3.4493 lr:0.99 dt:34ms tok/s:1954868 rem:452s step 4216 (25%) loss:3.4571 lr:0.99 dt:34ms tok/s:1949240 rem:452s step 4217 (25%) loss:3.4558 lr:0.99 dt:34ms tok/s:1949115 rem:452s step 4218 (25%) loss:3.4636 lr:0.99 dt:34ms tok/s:1948563 rem:452s step 4219 (25%) loss:3.4546 lr:0.99 dt:33ms tok/s:1956412 rem:452s step 4220 (25%) loss:3.4496 lr:0.99 dt:34ms tok/s:1950042 rem:452s step 4221 (25%) loss:3.4422 lr:0.99 dt:34ms tok/s:1951870 rem:452s step 4222 (25%) loss:3.4129 lr:0.99 dt:34ms tok/s:1948093 rem:452s step 4223 (25%) loss:3.4055 lr:0.99 dt:34ms tok/s:1954951 rem:452s step 4224 (25%) loss:3.3702 lr:0.99 dt:34ms tok/s:1944703 rem:452s step 4225 (25%) loss:3.3727 lr:0.99 dt:34ms tok/s:1951108 rem:452s step 4226 (25%) loss:3.3884 lr:0.99 dt:34ms tok/s:1950319 rem:452s step 4227 (25%) loss:3.4036 lr:0.99 dt:34ms tok/s:1936974 rem:452s step 4228 (25%) loss:3.4053 lr:0.99 dt:34ms tok/s:1955633 rem:452s step 4229 (25%) loss:3.4962 lr:0.99 dt:34ms tok/s:1953271 rem:452s step 4230 (25%) loss:3.5099 lr:0.99 dt:34ms tok/s:1946562 rem:451s step 4231 (25%) loss:3.5108 lr:0.99 dt:34ms tok/s:1953923 rem:451s step 4232 (25%) loss:3.5036 lr:0.99 dt:34ms tok/s:1945680 rem:451s step 4233 (25%) loss:3.4880 lr:0.99 dt:33ms tok/s:1957653 rem:451s step 4234 (25%) loss:3.5130 lr:0.99 dt:34ms tok/s:1948770 rem:451s step 4235 (25%) loss:3.5091 lr:0.99 dt:34ms tok/s:1946865 rem:451s step 4236 (25%) loss:3.4988 lr:0.99 dt:33ms tok/s:1960934 rem:451s step 4237 (25%) loss:3.5854 lr:0.99 dt:34ms tok/s:1949503 rem:451s step 4238 (25%) loss:3.5876 lr:0.99 dt:34ms tok/s:1949530 rem:451s step 4239 (25%) loss:3.5615 lr:0.99 dt:34ms tok/s:1947224 rem:451s step 4240 (25%) loss:3.5522 lr:0.99 dt:34ms tok/s:1950457 rem:451s step 4241 (25%) loss:3.5580 lr:0.99 dt:34ms tok/s:1952369 rem:451s step 4242 (25%) loss:3.5399 lr:0.99 dt:34ms tok/s:1945515 rem:451s step 4243 (25%) loss:3.5400 lr:0.99 dt:34ms tok/s:1950153 rem:451s step 4244 (25%) loss:3.5247 lr:0.99 dt:34ms tok/s:1947817 rem:451s step 4245 (25%) loss:3.5331 lr:0.99 dt:34ms tok/s:1951662 rem:451s step 4246 (25%) loss:3.5365 lr:0.99 dt:34ms tok/s:1952938 rem:451s step 4247 (25%) loss:3.5401 lr:0.99 dt:34ms tok/s:1947831 rem:451s step 4248 (25%) loss:3.5170 lr:0.99 dt:34ms tok/s:1947320 rem:451s step 4249 (25%) loss:3.4898 lr:0.99 dt:34ms tok/s:1946245 rem:451s step 4250 (25%) loss:3.4766 lr:0.99 dt:34ms tok/s:1948659 rem:451s step 4251 (25%) loss:3.4752 lr:0.99 dt:34ms tok/s:1954507 rem:451s step 4252 (25%) loss:3.4725 lr:0.99 dt:34ms tok/s:1948494 rem:451s step 4253 (25%) loss:3.4633 lr:0.99 dt:34ms tok/s:1947734 rem:451s step 4254 (25%) loss:3.4589 lr:0.99 dt:34ms tok/s:1947845 rem:451s step 4255 (25%) loss:3.4589 lr:0.99 dt:34ms tok/s:1946672 rem:451s step 4256 (25%) loss:3.4750 lr:0.99 dt:34ms tok/s:1944428 rem:451s step 4257 (25%) loss:3.4831 lr:0.99 dt:33ms tok/s:1957262 rem:451s step 4258 (25%) loss:3.4689 lr:0.99 dt:34ms tok/s:1952119 rem:451s step 4259 (25%) loss:3.4831 lr:0.99 dt:34ms tok/s:1953465 rem:451s step 4260 (25%) loss:3.4641 lr:0.99 dt:34ms tok/s:1954910 rem:450s step 4261 (25%) loss:3.4548 lr:0.99 dt:34ms tok/s:1954062 rem:450s step 4262 (25%) loss:3.4667 lr:0.99 dt:34ms tok/s:1945240 rem:450s step 4263 (25%) loss:3.4936 lr:0.99 dt:34ms tok/s:1944703 rem:450s step 4264 (25%) loss:3.4843 lr:0.99 dt:34ms tok/s:1946217 rem:450s step 4265 (25%) loss:3.4765 lr:0.99 dt:35ms tok/s:1863453 rem:450s step 4266 (25%) loss:3.4677 lr:0.99 dt:34ms tok/s:1923689 rem:450s step 4267 (25%) loss:3.4522 lr:0.99 dt:34ms tok/s:1953104 rem:450s step 4268 (25%) loss:3.4370 lr:0.99 dt:34ms tok/s:1955536 rem:450s step 4269 (25%) loss:3.4376 lr:0.99 dt:34ms tok/s:1912207 rem:450s step 4270 (25%) loss:3.4684 lr:0.99 dt:34ms tok/s:1945501 rem:450s step 4271 (25%) loss:3.4729 lr:0.99 dt:34ms tok/s:1951329 rem:450s step 4272 (25%) loss:3.4863 lr:0.99 dt:33ms tok/s:1956719 rem:450s step 4273 (25%) loss:3.4970 lr:0.99 dt:34ms tok/s:1953049 rem:450s step 4274 (25%) loss:3.4905 lr:0.99 dt:34ms tok/s:1944964 rem:450s step 4275 (25%) loss:3.4991 lr:0.99 dt:34ms tok/s:1954938 rem:450s step 4276 (25%) loss:3.4994 lr:0.99 dt:36ms tok/s:1843717 rem:450s step 4277 (25%) loss:3.4954 lr:0.99 dt:34ms tok/s:1947914 rem:450s step 4278 (25%) loss:3.4974 lr:0.99 dt:33ms tok/s:1967095 rem:450s step 4279 (25%) loss:3.4968 lr:0.99 dt:33ms tok/s:1961508 rem:450s step 4280 (25%) loss:3.4679 lr:0.99 dt:33ms tok/s:1959076 rem:450s step 4281 (25%) loss:3.4584 lr:0.99 dt:34ms tok/s:1949710 rem:450s step 4282 (25%) loss:3.4600 lr:0.99 dt:34ms tok/s:1937424 rem:450s step 4283 (25%) loss:3.4761 lr:0.99 dt:33ms tok/s:1969533 rem:450s step 4284 (25%) loss:3.4786 lr:0.99 dt:33ms tok/s:1960040 rem:450s step 4285 (25%) loss:3.4926 lr:0.99 dt:33ms tok/s:1966602 rem:450s step 4286 (25%) loss:3.5141 lr:0.99 dt:33ms tok/s:1958838 rem:450s step 4287 (25%) loss:3.4916 lr:0.99 dt:33ms tok/s:1968856 rem:450s step 4288 (25%) loss:3.5037 lr:0.99 dt:33ms tok/s:1968757 rem:450s step 4289 (25%) loss:3.5198 lr:0.99 dt:33ms tok/s:1960627 rem:449s step 4290 (25%) loss:3.4981 lr:0.99 dt:33ms tok/s:1961480 rem:449s step 4291 (25%) loss:3.4809 lr:0.99 dt:33ms tok/s:1957220 rem:449s step 4292 (25%) loss:3.5252 lr:0.99 dt:33ms tok/s:1971384 rem:449s step 4293 (25%) loss:3.5282 lr:0.99 dt:33ms tok/s:1968461 rem:449s step 4294 (25%) loss:3.5426 lr:0.99 dt:33ms tok/s:1981316 rem:449s step 4295 (25%) loss:3.5434 lr:0.99 dt:33ms tok/s:1969011 rem:449s step 4296 (25%) loss:3.5500 lr:0.99 dt:33ms tok/s:1972317 rem:449s step 4297 (25%) loss:3.5395 lr:0.99 dt:33ms tok/s:1965913 rem:449s step 4298 (25%) loss:3.5182 lr:0.99 dt:33ms tok/s:1965224 rem:449s step 4299 (25%) loss:3.5079 lr:0.99 dt:33ms tok/s:1965969 rem:449s step 4300 (25%) loss:3.5115 lr:0.99 dt:33ms tok/s:1966518 rem:449s + local: attn=[0.066, 0.494, 0.463] mlp=[0.199, 0.116, -0.133] + + transition: attn=[1.559, 0.562] mlp=[-0.037, 0.164] + + hierarchy: attn=[1.944, 5.939, 5.616] mlp=[0.733, -0.750, -1.487] + step 4301 (25%) loss:3.5219 lr:0.99 dt:33ms tok/s:1967714 rem:449s step 4302 (25%) loss:3.5230 lr:0.99 dt:33ms tok/s:1966363 rem:449s step 4303 (25%) loss:3.5162 lr:0.99 dt:33ms tok/s:1965941 rem:449s step 4304 (25%) loss:3.4779 lr:0.99 dt:33ms tok/s:1967489 rem:449s step 4305 (25%) loss:3.4543 lr:0.99 dt:33ms tok/s:1966180 rem:449s step 4306 (25%) loss:3.4877 lr:0.99 dt:33ms tok/s:1964101 rem:449s step 4307 (25%) loss:3.5093 lr:0.99 dt:41ms tok/s:1579549 rem:449s step 4308 (25%) loss:3.5218 lr:0.99 dt:33ms tok/s:1994297 rem:449s step 4309 (25%) loss:3.5373 lr:0.99 dt:33ms tok/s:1996672 rem:449s step 4310 (25%) loss:3.5570 lr:0.99 dt:33ms tok/s:1980060 rem:449s step 4311 (25%) loss:3.5642 lr:0.99 dt:34ms tok/s:1955188 rem:449s step 4312 (25%) loss:3.5408 lr:0.99 dt:33ms tok/s:1974754 rem:449s step 4313 (25%) loss:3.5531 lr:0.99 dt:33ms tok/s:1984649 rem:449s step 4314 (25%) loss:3.5446 lr:0.99 dt:33ms tok/s:1963315 rem:449s step 4315 (25%) loss:3.5384 lr:0.99 dt:33ms tok/s:1975904 rem:449s step 4316 (25%) loss:3.5310 lr:0.99 dt:33ms tok/s:1966082 rem:449s step 4317 (25%) loss:3.5277 lr:0.99 dt:33ms tok/s:1961354 rem:449s step 4318 (25%) loss:3.5338 lr:0.99 dt:33ms tok/s:1974996 rem:449s step 4319 (25%) loss:3.5187 lr:0.99 dt:33ms tok/s:1963919 rem:448s step 4320 (25%) loss:3.5109 lr:0.99 dt:33ms tok/s:1964073 rem:448s step 4321 (25%) loss:3.4754 lr:0.99 dt:33ms tok/s:1975535 rem:448s step 4322 (25%) loss:3.4720 lr:0.99 dt:34ms tok/s:1953729 rem:448s step 4323 (25%) loss:3.4796 lr:0.99 dt:34ms tok/s:1950263 rem:448s step 4324 (25%) loss:3.4778 lr:0.99 dt:33ms tok/s:1961998 rem:448s step 4325 (25%) loss:3.4690 lr:0.99 dt:34ms tok/s:1941160 rem:448s step 4326 (25%) loss:3.4662 lr:0.99 dt:34ms tok/s:1922532 rem:448s step 4327 (25%) loss:3.4604 lr:0.99 dt:34ms tok/s:1934180 rem:448s step 4328 (25%) loss:3.4610 lr:0.99 dt:34ms tok/s:1924322 rem:448s step 4329 (25%) loss:3.4710 lr:0.99 dt:34ms tok/s:1914938 rem:448s step 4330 (25%) loss:3.4908 lr:0.99 dt:34ms tok/s:1913259 rem:448s step 4331 (25%) loss:3.4846 lr:0.99 dt:34ms tok/s:1909511 rem:448s step 4332 (25%) loss:3.4735 lr:0.99 dt:34ms tok/s:1906928 rem:448s step 4333 (25%) loss:3.4568 lr:0.99 dt:34ms tok/s:1909816 rem:448s step 4334 (25%) loss:3.4522 lr:0.99 dt:34ms tok/s:1909445 rem:448s step 4335 (25%) loss:3.4512 lr:0.99 dt:34ms tok/s:1901703 rem:448s step 4336 (25%) loss:3.4354 lr:0.99 dt:35ms tok/s:1889689 rem:448s step 4337 (25%) loss:3.4286 lr:0.99 dt:35ms tok/s:1892512 rem:448s step 4338 (25%) loss:3.4181 lr:0.99 dt:35ms tok/s:1884946 rem:448s step 4339 (25%) loss:3.4181 lr:0.99 dt:35ms tok/s:1877222 rem:448s step 4340 (25%) loss:3.4143 lr:0.99 dt:35ms tok/s:1885476 rem:448s step 4341 (25%) loss:3.4212 lr:0.99 dt:35ms tok/s:1875788 rem:448s step 4342 (25%) loss:3.4379 lr:0.99 dt:35ms tok/s:1875442 rem:448s step 4343 (25%) loss:3.4528 lr:0.99 dt:35ms tok/s:1856781 rem:448s step 4344 (25%) loss:3.4441 lr:0.99 dt:35ms tok/s:1867923 rem:448s step 4345 (25%) loss:3.4541 lr:0.99 dt:35ms tok/s:1860312 rem:448s step 4346 (25%) loss:3.4595 lr:0.99 dt:35ms tok/s:1853626 rem:448s step 4347 (25%) loss:3.4573 lr:0.99 dt:35ms tok/s:1854952 rem:448s step 4348 (25%) loss:3.5030 lr:0.99 dt:35ms tok/s:1859243 rem:447s step 4349 (25%) loss:3.4963 lr:0.99 dt:35ms tok/s:1862064 rem:447s step 4350 (25%) loss:3.4902 lr:0.99 dt:35ms tok/s:1861850 rem:447s step 4351 (25%) loss:3.4869 lr:0.99 dt:35ms tok/s:1861270 rem:447s step 4352 (25%) loss:3.4784 lr:0.99 dt:36ms tok/s:1828776 rem:447s step 4353 (25%) loss:3.4815 lr:0.99 dt:35ms tok/s:1854414 rem:447s step 4354 (25%) loss:3.4814 lr:0.99 dt:35ms tok/s:1854539 rem:447s step 4355 (25%) loss:3.5134 lr:0.99 dt:35ms tok/s:1860665 rem:447s step 4356 (25%) loss:3.5232 lr:0.99 dt:35ms tok/s:1852090 rem:447s step 4357 (25%) loss:3.5069 lr:0.99 dt:36ms tok/s:1838697 rem:447s step 4358 (25%) loss:3.5193 lr:0.99 dt:35ms tok/s:1853126 rem:447s step 4359 (25%) loss:3.5311 lr:0.99 dt:35ms tok/s:1848863 rem:447s step 4360 (25%) loss:3.5208 lr:0.99 dt:36ms tok/s:1844261 rem:447s step 4361 (25%) loss:3.5296 lr:0.99 dt:37ms tok/s:1771611 rem:447s step 4362 (25%) loss:3.5142 lr:0.99 dt:40ms tok/s:1640719 rem:447s step 4363 (26%) loss:3.5341 lr:0.99 dt:36ms tok/s:1807599 rem:447s step 4364 (26%) loss:3.5435 lr:0.99 dt:34ms tok/s:1907589 rem:447s step 4365 (26%) loss:3.5343 lr:0.99 dt:35ms tok/s:1890651 rem:447s step 4366 (26%) loss:3.5410 lr:0.99 dt:35ms tok/s:1899102 rem:447s step 4367 (26%) loss:3.5441 lr:0.99 dt:35ms tok/s:1897647 rem:447s step 4368 (26%) loss:3.5404 lr:0.99 dt:35ms tok/s:1898682 rem:447s step 4369 (26%) loss:3.5411 lr:0.99 dt:37ms tok/s:1783647 rem:447s step 4370 (26%) loss:3.5653 lr:0.99 dt:34ms tok/s:1900901 rem:447s step 4371 (26%) loss:3.5965 lr:0.99 dt:35ms tok/s:1887198 rem:447s step 4372 (26%) loss:3.5925 lr:0.99 dt:34ms tok/s:1900467 rem:447s step 4373 (26%) loss:3.6017 lr:0.99 dt:35ms tok/s:1869498 rem:447s step 4374 (26%) loss:3.5951 lr:0.99 dt:35ms tok/s:1858212 rem:447s step 4375 (26%) loss:3.6042 lr:0.99 dt:35ms tok/s:1859230 rem:447s step 4376 (26%) loss:3.5944 lr:0.99 dt:35ms tok/s:1863124 rem:447s step 4377 (26%) loss:3.5822 lr:0.99 dt:35ms tok/s:1848975 rem:446s step 4378 (26%) loss:3.5649 lr:0.99 dt:36ms tok/s:1816234 rem:446s step 4379 (26%) loss:3.5578 lr:0.99 dt:35ms tok/s:1853876 rem:446s step 4380 (26%) loss:3.5719 lr:0.99 dt:35ms tok/s:1849909 rem:446s step 4381 (26%) loss:3.5511 lr:0.99 dt:35ms tok/s:1857208 rem:446s step 4382 (26%) loss:3.5315 lr:0.99 dt:35ms tok/s:1857446 rem:446s step 4383 (26%) loss:3.5317 lr:0.99 dt:35ms tok/s:1861257 rem:446s step 4384 (26%) loss:3.5270 lr:0.99 dt:35ms tok/s:1864679 rem:446s step 4385 (26%) loss:3.5283 lr:0.99 dt:35ms tok/s:1876184 rem:446s step 4386 (26%) loss:3.5098 lr:0.99 dt:35ms tok/s:1874700 rem:446s step 4387 (26%) loss:3.5114 lr:0.99 dt:35ms tok/s:1875224 rem:446s step 4388 (26%) loss:3.5114 lr:0.99 dt:35ms tok/s:1854376 rem:446s step 4389 (26%) loss:3.5171 lr:0.99 dt:35ms tok/s:1850357 rem:446s step 4390 (26%) loss:3.5003 lr:0.99 dt:35ms tok/s:1847633 rem:446s step 4391 (26%) loss:3.5050 lr:0.99 dt:36ms tok/s:1832972 rem:446s step 4392 (26%) loss:3.5082 lr:0.99 dt:36ms tok/s:1840630 rem:446s step 4393 (26%) loss:3.5208 lr:0.99 dt:36ms tok/s:1831262 rem:446s step 4394 (26%) loss:3.5091 lr:0.99 dt:36ms tok/s:1835334 rem:446s step 4395 (26%) loss:3.5287 lr:0.99 dt:36ms tok/s:1802868 rem:446s step 4396 (26%) loss:3.5317 lr:0.99 dt:37ms tok/s:1788638 rem:446s step 4397 (26%) loss:3.5149 lr:0.99 dt:37ms tok/s:1789150 rem:446s step 4398 (26%) loss:3.4880 lr:0.99 dt:36ms tok/s:1809586 rem:446s step 4399 (26%) loss:3.5013 lr:0.99 dt:36ms tok/s:1804490 rem:446s step 4400 (26%) loss:3.5175 lr:0.99 dt:36ms tok/s:1809777 rem:446s + local: attn=[0.060, 0.589, 0.520] mlp=[0.221, 0.149, -0.138] + + transition: attn=[1.775, 0.644] mlp=[-0.058, 0.183] + + hierarchy: attn=[2.065, 5.939, 5.616] mlp=[0.798, -0.824, -1.941] + step 4401 (26%) loss:3.5391 lr:0.99 dt:36ms tok/s:1818084 rem:446s step 4402 (26%) loss:3.5376 lr:0.99 dt:36ms tok/s:1810850 rem:446s step 4403 (26%) loss:3.5167 lr:0.99 dt:36ms tok/s:1812342 rem:446s step 4404 (26%) loss:3.4939 lr:0.99 dt:36ms tok/s:1817363 rem:446s step 4405 (26%) loss:3.4690 lr:0.99 dt:36ms tok/s:1809586 rem:445s step 4406 (26%) loss:3.4492 lr:0.99 dt:36ms tok/s:1806399 rem:445s step 4407 (26%) loss:3.4153 lr:0.99 dt:36ms tok/s:1813609 rem:445s step 4408 (26%) loss:3.3607 lr:0.99 dt:36ms tok/s:1806328 rem:445s step 4409 (26%) loss:3.3261 lr:0.99 dt:36ms tok/s:1815430 rem:445s step 4410 (26%) loss:3.3499 lr:0.99 dt:36ms tok/s:1803542 rem:445s step 4411 (26%) loss:3.3677 lr:0.99 dt:36ms tok/s:1809765 rem:445s step 4412 (26%) loss:3.3708 lr:0.99 dt:36ms tok/s:1804205 rem:445s step 4413 (26%) loss:3.3717 lr:0.99 dt:36ms tok/s:1809169 rem:445s step 4414 (26%) loss:3.3833 lr:0.99 dt:36ms tok/s:1808027 rem:445s step 4415 (26%) loss:3.3982 lr:0.99 dt:36ms tok/s:1805130 rem:445s step 4416 (26%) loss:3.4240 lr:0.99 dt:36ms tok/s:1833289 rem:445s step 4417 (26%) loss:3.4515 lr:0.99 dt:36ms tok/s:1832006 rem:445s step 4418 (26%) loss:3.4656 lr:0.99 dt:36ms tok/s:1825764 rem:445s step 4419 (26%) loss:3.4766 lr:0.99 dt:36ms tok/s:1835689 rem:445s step 4420 (26%) loss:3.4837 lr:0.99 dt:36ms tok/s:1829324 rem:445s step 4421 (26%) loss:3.5036 lr:0.99 dt:36ms tok/s:1831860 rem:445s step 4422 (26%) loss:3.5074 lr:0.99 dt:36ms tok/s:1840606 rem:445s step 4423 (26%) loss:3.5080 lr:0.99 dt:36ms tok/s:1826152 rem:445s step 4424 (26%) loss:3.5216 lr:0.99 dt:36ms tok/s:1822834 rem:445s step 4425 (26%) loss:3.5331 lr:0.99 dt:36ms tok/s:1824540 rem:445s step 4426 (26%) loss:3.5481 lr:0.99 dt:36ms tok/s:1824116 rem:445s step 4427 (26%) loss:3.5450 lr:0.99 dt:36ms tok/s:1826286 rem:445s step 4428 (26%) loss:3.8728 lr:0.99 dt:36ms tok/s:1828241 rem:445s step 4429 (26%) loss:3.8409 lr:0.99 dt:36ms tok/s:1826104 rem:445s step 4430 (26%) loss:3.8213 lr:0.99 dt:36ms tok/s:1833094 rem:445s step 4431 (26%) loss:3.7929 lr:0.99 dt:36ms tok/s:1833828 rem:445s step 4432 (26%) loss:3.7574 lr:0.99 dt:36ms tok/s:1827998 rem:444s step 4433 (26%) loss:3.7196 lr:0.99 dt:36ms tok/s:1824019 rem:444s step 4434 (26%) loss:3.6975 lr:0.99 dt:36ms tok/s:1826783 rem:444s step 4435 (26%) loss:3.6616 lr:0.99 dt:36ms tok/s:1829763 rem:444s step 4436 (26%) loss:3.6589 lr:0.99 dt:36ms tok/s:1834464 rem:444s step 4437 (26%) loss:3.6527 lr:0.99 dt:36ms tok/s:1828971 rem:444s step 4438 (26%) loss:3.6273 lr:0.99 dt:36ms tok/s:1834929 rem:444s step 4439 (26%) loss:3.6095 lr:0.99 dt:36ms tok/s:1840532 rem:444s step 4440 (26%) loss:3.6024 lr:0.99 dt:36ms tok/s:1832202 rem:444s step 4441 (26%) loss:3.6006 lr:0.99 dt:36ms tok/s:1829714 rem:444s step 4442 (26%) loss:3.5998 lr:0.99 dt:36ms tok/s:1824346 rem:444s step 4443 (26%) loss:3.5929 lr:0.99 dt:36ms tok/s:1826553 rem:444s step 4444 (26%) loss:3.5898 lr:0.99 dt:36ms tok/s:1829275 rem:444s step 4445 (26%) loss:3.5845 lr:0.99 dt:36ms tok/s:1835395 rem:444s step 4446 (26%) loss:3.5665 lr:0.99 dt:36ms tok/s:1831994 rem:444s step 4447 (26%) loss:3.5561 lr:0.99 dt:36ms tok/s:1828922 rem:444s step 4448 (26%) loss:3.5390 lr:0.99 dt:36ms tok/s:1825582 rem:444s step 4449 (26%) loss:3.5279 lr:0.99 dt:36ms tok/s:1830823 rem:444s step 4450 (26%) loss:3.5007 lr:0.99 dt:36ms tok/s:1830055 rem:444s step 4451 (26%) loss:3.5188 lr:0.99 dt:36ms tok/s:1824589 rem:444s step 4452 (26%) loss:3.5136 lr:0.99 dt:36ms tok/s:1823257 rem:444s step 4453 (26%) loss:3.5185 lr:0.99 dt:36ms tok/s:1834158 rem:444s step 4454 (26%) loss:3.5252 lr:0.99 dt:36ms tok/s:1836989 rem:444s step 4455 (26%) loss:3.5169 lr:0.99 dt:36ms tok/s:1835652 rem:444s step 4456 (26%) loss:3.5185 lr:0.99 dt:36ms tok/s:1838181 rem:444s step 4457 (26%) loss:3.5534 lr:0.99 dt:36ms tok/s:1836842 rem:444s step 4458 (26%) loss:3.5565 lr:0.99 dt:36ms tok/s:1832324 rem:444s step 4459 (26%) loss:3.5467 lr:0.99 dt:36ms tok/s:1821216 rem:444s step 4460 (26%) loss:3.5388 lr:0.99 dt:36ms tok/s:1830079 rem:443s step 4461 (26%) loss:3.5266 lr:0.99 dt:36ms tok/s:1838820 rem:443s step 4462 (26%) loss:3.5035 lr:0.99 dt:36ms tok/s:1835272 rem:443s step 4463 (26%) loss:3.4789 lr:0.99 dt:36ms tok/s:1836008 rem:443s step 4464 (26%) loss:3.4642 lr:0.99 dt:36ms tok/s:1833020 rem:443s step 4465 (26%) loss:3.4611 lr:0.99 dt:36ms tok/s:1832422 rem:443s step 4466 (26%) loss:3.4625 lr:0.99 dt:36ms tok/s:1839559 rem:443s step 4467 (26%) loss:3.4710 lr:0.99 dt:36ms tok/s:1836977 rem:443s step 4468 (26%) loss:3.4680 lr:0.99 dt:36ms tok/s:1831628 rem:443s step 4469 (26%) loss:3.4573 lr:0.99 dt:36ms tok/s:1833876 rem:443s step 4470 (26%) loss:3.4515 lr:0.99 dt:36ms tok/s:1827281 rem:443s step 4471 (26%) loss:3.4180 lr:0.99 dt:36ms tok/s:1837972 rem:443s step 4472 (26%) loss:3.4157 lr:0.99 dt:36ms tok/s:1840729 rem:443s step 4473 (26%) loss:3.4420 lr:0.99 dt:36ms tok/s:1820058 rem:443s step 4474 (26%) loss:3.4461 lr:0.99 dt:36ms tok/s:1824988 rem:443s step 4475 (26%) loss:3.4356 lr:0.99 dt:36ms tok/s:1830908 rem:443s step 4476 (26%) loss:3.4220 lr:0.99 dt:37ms tok/s:1749199 rem:443s step 4477 (26%) loss:3.4224 lr:0.99 dt:35ms tok/s:1849373 rem:443s step 4478 (26%) loss:3.4410 lr:0.99 dt:36ms tok/s:1845685 rem:443s step 4479 (26%) loss:3.4583 lr:0.99 dt:35ms tok/s:1858087 rem:443s step 4480 (26%) loss:3.4781 lr:0.99 dt:36ms tok/s:1841309 rem:443s step 4481 (26%) loss:3.4737 lr:0.99 dt:36ms tok/s:1842926 rem:443s step 4482 (26%) loss:3.4756 lr:0.99 dt:35ms tok/s:1849610 rem:443s step 4483 (26%) loss:3.4655 lr:0.99 dt:35ms tok/s:1854551 rem:443s step 4484 (26%) loss:3.4386 lr:0.99 dt:35ms tok/s:1850594 rem:443s step 4485 (26%) loss:3.4424 lr:0.99 dt:36ms tok/s:1830201 rem:443s step 4486 (26%) loss:3.4380 lr:0.99 dt:36ms tok/s:1845884 rem:443s step 4487 (26%) loss:3.4328 lr:0.99 dt:35ms tok/s:1851092 rem:443s step 4488 (26%) loss:3.4180 lr:0.99 dt:35ms tok/s:1849075 rem:442s step 4489 (26%) loss:3.3795 lr:0.99 dt:35ms tok/s:1848142 rem:442s step 4490 (26%) loss:3.3679 lr:0.98 dt:36ms tok/s:1844868 rem:442s step 4491 (26%) loss:3.3799 lr:0.98 dt:35ms tok/s:1850295 rem:442s step 4492 (26%) loss:3.3845 lr:0.98 dt:36ms tok/s:1844682 rem:442s step 4493 (26%) loss:3.3939 lr:0.98 dt:35ms tok/s:1853264 rem:442s step 4494 (26%) loss:3.4050 lr:0.98 dt:35ms tok/s:1850469 rem:442s step 4495 (26%) loss:3.4137 lr:0.98 dt:35ms tok/s:1858011 rem:442s step 4496 (26%) loss:3.4048 lr:0.98 dt:35ms tok/s:1850955 rem:442s step 4497 (26%) loss:3.4050 lr:0.98 dt:35ms tok/s:1853551 rem:442s step 4498 (26%) loss:3.4133 lr:0.98 dt:35ms tok/s:1852926 rem:442s step 4499 (26%) loss:3.4028 lr:0.98 dt:35ms tok/s:1857672 rem:442s step 4500 (26%) loss:3.4182 lr:0.98 dt:36ms tok/s:1841506 rem:442s + local: attn=[0.058, 0.579, 0.541] mlp=[0.216, 0.135, -0.116] + + transition: attn=[1.848, 0.650] mlp=[-0.033, 0.192] + + hierarchy: attn=[2.225, 5.939, 5.616] mlp=[0.909, -0.937, -2.278] + step 4501 (26%) loss:3.4247 lr:0.98 dt:36ms tok/s:1843556 rem:442s step 4502 (26%) loss:3.4241 lr:0.98 dt:36ms tok/s:1844335 rem:442s step 4503 (26%) loss:3.4415 lr:0.98 dt:35ms tok/s:1846714 rem:442s step 4504 (26%) loss:3.4566 lr:0.98 dt:35ms tok/s:1850369 rem:442s step 4505 (26%) loss:3.4727 lr:0.98 dt:35ms tok/s:1849361 rem:442s step 4506 (26%) loss:3.4764 lr:0.98 dt:35ms tok/s:1847658 rem:442s step 4507 (26%) loss:3.4754 lr:0.98 dt:36ms tok/s:1842654 rem:442s step 4508 (26%) loss:3.4515 lr:0.98 dt:35ms tok/s:1849921 rem:442s step 4509 (26%) loss:3.4386 lr:0.98 dt:35ms tok/s:1855566 rem:442s step 4510 (26%) loss:3.4200 lr:0.98 dt:35ms tok/s:1850319 rem:442s step 4511 (26%) loss:3.4072 lr:0.98 dt:35ms tok/s:1851341 rem:442s step 4512 (26%) loss:3.3917 lr:0.98 dt:35ms tok/s:1853776 rem:442s step 4513 (26%) loss:3.3889 lr:0.98 dt:35ms tok/s:1847273 rem:442s step 4514 (26%) loss:3.3752 lr:0.98 dt:35ms tok/s:1849224 rem:442s step 4515 (26%) loss:3.3993 lr:0.98 dt:35ms tok/s:1850083 rem:442s step 4516 (26%) loss:3.4623 lr:0.98 dt:35ms tok/s:1846566 rem:441s step 4517 (26%) loss:3.5182 lr:0.98 dt:35ms tok/s:1848329 rem:441s step 4518 (26%) loss:3.5485 lr:0.98 dt:35ms tok/s:1846355 rem:441s step 4519 (26%) loss:3.5689 lr:0.98 dt:35ms tok/s:1848229 rem:441s step 4520 (26%) loss:3.5975 lr:0.98 dt:35ms tok/s:1849510 rem:441s step 4521 (26%) loss:3.6101 lr:0.98 dt:35ms tok/s:1850158 rem:441s step 4522 (26%) loss:3.6189 lr:0.98 dt:35ms tok/s:1856556 rem:441s step 4523 (26%) loss:3.6198 lr:0.98 dt:35ms tok/s:1851753 rem:441s step 4524 (26%) loss:3.6311 lr:0.98 dt:35ms tok/s:1855866 rem:441s step 4525 (26%) loss:3.6479 lr:0.98 dt:35ms tok/s:1848453 rem:441s step 4526 (26%) loss:3.7397 lr:0.98 dt:35ms tok/s:1848279 rem:441s step 4527 (26%) loss:3.7680 lr:0.98 dt:35ms tok/s:1852602 rem:441s step 4528 (26%) loss:3.7547 lr:0.98 dt:35ms tok/s:1849535 rem:441s step 4529 (26%) loss:3.7547 lr:0.98 dt:35ms tok/s:1848142 rem:441s step 4530 (26%) loss:3.7596 lr:0.98 dt:35ms tok/s:1847583 rem:441s step 4531 (26%) loss:3.7354 lr:0.98 dt:35ms tok/s:1853164 rem:441s step 4532 (27%) loss:3.7205 lr:0.98 dt:35ms tok/s:1850494 rem:441s step 4533 (27%) loss:3.7228 lr:0.98 dt:35ms tok/s:1846479 rem:441s step 4534 (27%) loss:3.7100 lr:0.98 dt:35ms tok/s:1848615 rem:441s step 4535 (27%) loss:3.6915 lr:0.98 dt:36ms tok/s:1844385 rem:441s step 4536 (27%) loss:3.6761 lr:0.98 dt:35ms tok/s:1848291 rem:441s step 4537 (27%) loss:3.6602 lr:0.98 dt:35ms tok/s:1846491 rem:441s step 4538 (27%) loss:3.6445 lr:0.98 dt:35ms tok/s:1851304 rem:441s step 4539 (27%) loss:3.6344 lr:0.98 dt:36ms tok/s:1844657 rem:441s step 4540 (27%) loss:3.6169 lr:0.98 dt:35ms tok/s:1848267 rem:441s step 4541 (27%) loss:3.6121 lr:0.98 dt:35ms tok/s:1849075 rem:441s step 4542 (27%) loss:3.6133 lr:0.98 dt:36ms tok/s:1844769 rem:441s step 4543 (27%) loss:3.6021 lr:0.98 dt:36ms tok/s:1806316 rem:441s step 4544 (27%) loss:3.5968 lr:0.98 dt:36ms tok/s:1842345 rem:441s step 4545 (27%) loss:3.5736 lr:0.98 dt:35ms tok/s:1847881 rem:440s step 4546 (27%) loss:3.5735 lr:0.98 dt:35ms tok/s:1851528 rem:440s step 4547 (27%) loss:3.5740 lr:0.98 dt:36ms tok/s:1845264 rem:440s step 4548 (27%) loss:3.5858 lr:0.98 dt:36ms tok/s:1840606 rem:440s step 4549 (27%) loss:3.5681 lr:0.98 dt:36ms tok/s:1843247 rem:440s step 4550 (27%) loss:3.5601 lr:0.98 dt:36ms tok/s:1843606 rem:440s step 4551 (27%) loss:3.5365 lr:0.98 dt:36ms tok/s:1845029 rem:440s step 4552 (27%) loss:3.5259 lr:0.98 dt:35ms tok/s:1848242 rem:440s step 4553 (27%) loss:3.5069 lr:0.98 dt:37ms tok/s:1774173 rem:440s step 4554 (27%) loss:3.5053 lr:0.98 dt:36ms tok/s:1843297 rem:440s step 4555 (27%) loss:3.4959 lr:0.98 dt:35ms tok/s:1855252 rem:440s step 4556 (27%) loss:3.4776 lr:0.98 dt:35ms tok/s:1847621 rem:440s step 4557 (27%) loss:3.4918 lr:0.98 dt:36ms tok/s:1845375 rem:440s step 4558 (27%) loss:3.4988 lr:0.98 dt:35ms tok/s:1854927 rem:440s step 4559 (27%) loss:3.5121 lr:0.98 dt:35ms tok/s:1853689 rem:440s step 4560 (27%) loss:3.4986 lr:0.98 dt:35ms tok/s:1852764 rem:440s step 4561 (27%) loss:3.4873 lr:0.98 dt:35ms tok/s:1855566 rem:440s step 4562 (27%) loss:3.4919 lr:0.98 dt:35ms tok/s:1854539 rem:440s step 4563 (27%) loss:3.4836 lr:0.98 dt:35ms tok/s:1850120 rem:440s step 4564 (27%) loss:3.4729 lr:0.98 dt:36ms tok/s:1841543 rem:440s step 4565 (27%) loss:3.4725 lr:0.98 dt:35ms tok/s:1847521 rem:440s step 4566 (27%) loss:3.4763 lr:0.98 dt:36ms tok/s:1844533 rem:440s step 4567 (27%) loss:3.4829 lr:0.98 dt:36ms tok/s:1841173 rem:440s step 4568 (27%) loss:3.4699 lr:0.98 dt:35ms tok/s:1848490 rem:440s step 4569 (27%) loss:3.4458 lr:0.98 dt:37ms tok/s:1782791 rem:440s step 4570 (27%) loss:3.4369 lr:0.98 dt:35ms tok/s:1848080 rem:440s step 4571 (27%) loss:3.4348 lr:0.98 dt:35ms tok/s:1856405 rem:440s step 4572 (27%) loss:3.4345 lr:0.98 dt:35ms tok/s:1853764 rem:440s step 4573 (27%) loss:3.4011 lr:0.98 dt:35ms tok/s:1848366 rem:439s step 4574 (27%) loss:3.4026 lr:0.98 dt:35ms tok/s:1849100 rem:439s step 4575 (27%) loss:3.3997 lr:0.98 dt:36ms tok/s:1845958 rem:439s step 4576 (27%) loss:3.4275 lr:0.98 dt:35ms tok/s:1851728 rem:439s step 4577 (27%) loss:3.4327 lr:0.98 dt:36ms tok/s:1842370 rem:439s step 4578 (27%) loss:3.4480 lr:0.98 dt:35ms tok/s:1847981 rem:439s step 4579 (27%) loss:3.4256 lr:0.98 dt:36ms tok/s:1844992 rem:439s step 4580 (27%) loss:3.4262 lr:0.98 dt:36ms tok/s:1836486 rem:439s step 4581 (27%) loss:3.4118 lr:0.98 dt:35ms tok/s:1849336 rem:439s step 4582 (27%) loss:3.3931 lr:0.98 dt:35ms tok/s:1847670 rem:439s step 4583 (27%) loss:3.3875 lr:0.98 dt:35ms tok/s:1851778 rem:439s step 4584 (27%) loss:3.4095 lr:0.98 dt:35ms tok/s:1853889 rem:439s step 4585 (27%) loss:3.4246 lr:0.98 dt:35ms tok/s:1852726 rem:439s step 4586 (27%) loss:3.4292 lr:0.98 dt:35ms tok/s:1850145 rem:439s step 4587 (27%) loss:3.4370 lr:0.98 dt:35ms tok/s:1857647 rem:439s step 4588 (27%) loss:3.4241 lr:0.98 dt:35ms tok/s:1851441 rem:439s step 4589 (27%) loss:3.4319 lr:0.98 dt:35ms tok/s:1846702 rem:439s step 4590 (27%) loss:3.4625 lr:0.98 dt:60ms tok/s:1092828 rem:439s step 4591 (27%) loss:3.4620 lr:0.98 dt:33ms tok/s:1981145 rem:439s step 4592 (27%) loss:3.4768 lr:0.98 dt:33ms tok/s:2011267 rem:439s step 4593 (27%) loss:3.4947 lr:0.98 dt:32ms tok/s:2020611 rem:439s step 4594 (27%) loss:3.5118 lr:0.98 dt:32ms tok/s:2038730 rem:439s step 4595 (27%) loss:3.5067 lr:0.98 dt:33ms tok/s:2005427 rem:439s step 4596 (27%) loss:3.4879 lr:0.98 dt:34ms tok/s:1918212 rem:439s step 4597 (27%) loss:3.4775 lr:0.98 dt:36ms tok/s:1838747 rem:439s step 4598 (27%) loss:3.4679 lr:0.98 dt:33ms tok/s:1984019 rem:439s step 4599 (27%) loss:3.4770 lr:0.98 dt:33ms tok/s:1999621 rem:439s step 4600 (27%) loss:3.4960 lr:0.98 dt:33ms tok/s:1993834 rem:439s + local: attn=[0.060, 0.611, 0.565] mlp=[0.229, 0.161, -0.161] + + transition: attn=[1.917, 0.677] mlp=[-0.052, 0.206] + + hierarchy: attn=[2.254, 5.939, 5.616] mlp=[0.948, -1.156, -2.591] + step 4601 (27%) loss:3.5159 lr:0.98 dt:33ms tok/s:1964284 rem:438s step 4602 (27%) loss:3.5225 lr:0.98 dt:34ms tok/s:1955299 rem:438s step 4603 (27%) loss:3.4860 lr:0.98 dt:33ms tok/s:1961452 rem:438s step 4604 (27%) loss:3.4726 lr:0.98 dt:33ms tok/s:1963919 rem:438s step 4605 (27%) loss:3.4688 lr:0.98 dt:33ms tok/s:1961466 rem:438s step 4606 (27%) loss:3.4422 lr:0.98 dt:34ms tok/s:1955953 rem:438s step 4607 (27%) loss:3.4576 lr:0.98 dt:33ms tok/s:1957722 rem:438s step 4608 (27%) loss:3.4663 lr:0.98 dt:34ms tok/s:1951080 rem:438s step 4609 (27%) loss:3.4701 lr:0.98 dt:34ms tok/s:1950319 rem:438s step 4610 (27%) loss:3.4857 lr:0.98 dt:34ms tok/s:1944923 rem:438s step 4611 (27%) loss:3.5229 lr:0.98 dt:34ms tok/s:1947638 rem:438s step 4612 (27%) loss:3.5448 lr:0.98 dt:34ms tok/s:1933731 rem:438s step 4613 (27%) loss:3.5451 lr:0.98 dt:34ms tok/s:1934833 rem:438s step 4614 (27%) loss:3.5498 lr:0.98 dt:34ms tok/s:1937971 rem:438s step 4615 (27%) loss:3.5454 lr:0.98 dt:34ms tok/s:1919766 rem:438s step 4616 (27%) loss:3.5280 lr:0.98 dt:34ms tok/s:1903363 rem:438s step 4617 (27%) loss:3.5305 lr:0.98 dt:35ms tok/s:1898066 rem:438s step 4618 (27%) loss:3.5133 lr:0.98 dt:35ms tok/s:1897922 rem:438s step 4619 (27%) loss:3.4976 lr:0.98 dt:34ms tok/s:1900205 rem:438s step 4620 (27%) loss:3.4849 lr:0.98 dt:35ms tok/s:1897070 rem:438s step 4621 (27%) loss:3.5107 lr:0.98 dt:35ms tok/s:1871675 rem:438s step 4622 (27%) loss:3.5523 lr:0.98 dt:34ms tok/s:1904193 rem:438s step 4623 (27%) loss:3.5628 lr:0.98 dt:35ms tok/s:1874969 rem:438s step 4624 (27%) loss:3.5630 lr:0.98 dt:35ms tok/s:1894690 rem:438s step 4625 (27%) loss:3.5800 lr:0.98 dt:35ms tok/s:1894077 rem:438s step 4626 (27%) loss:3.5606 lr:0.98 dt:34ms tok/s:1900454 rem:438s step 4627 (27%) loss:3.5572 lr:0.98 dt:34ms tok/s:1902901 rem:438s step 4628 (27%) loss:3.5887 lr:0.98 dt:35ms tok/s:1899233 rem:438s step 4629 (27%) loss:3.5792 lr:0.98 dt:34ms tok/s:1902612 rem:438s step 4630 (27%) loss:3.5821 lr:0.98 dt:35ms tok/s:1894286 rem:437s step 4631 (27%) loss:3.5581 lr:0.98 dt:35ms tok/s:1874738 rem:437s step 4632 (27%) loss:3.5540 lr:0.98 dt:35ms tok/s:1877043 rem:437s step 4633 (27%) loss:3.5361 lr:0.98 dt:35ms tok/s:1867149 rem:437s step 4634 (27%) loss:3.5321 lr:0.98 dt:35ms tok/s:1875263 rem:437s step 4635 (27%) loss:3.5306 lr:0.98 dt:35ms tok/s:1876184 rem:437s step 4636 (27%) loss:3.5330 lr:0.98 dt:35ms tok/s:1876953 rem:437s step 4637 (27%) loss:3.5043 lr:0.98 dt:35ms tok/s:1863440 rem:437s step 4638 (27%) loss:3.4790 lr:0.98 dt:35ms tok/s:1877197 rem:437s step 4639 (27%) loss:3.4471 lr:0.98 dt:35ms tok/s:1882158 rem:437s step 4640 (27%) loss:3.4524 lr:0.98 dt:35ms tok/s:1869765 rem:437s step 4641 (27%) loss:3.4458 lr:0.98 dt:35ms tok/s:1881965 rem:437s step 4642 (27%) loss:3.4598 lr:0.98 dt:35ms tok/s:1882609 rem:437s step 4643 (27%) loss:3.4613 lr:0.98 dt:35ms tok/s:1877120 rem:437s step 4644 (27%) loss:3.4682 lr:0.98 dt:35ms tok/s:1883796 rem:437s step 4645 (27%) loss:3.4614 lr:0.98 dt:35ms tok/s:1875967 rem:437s step 4646 (27%) loss:3.4004 lr:0.98 dt:35ms tok/s:1882841 rem:437s step 4647 (27%) loss:3.3651 lr:0.98 dt:35ms tok/s:1883577 rem:437s step 4648 (27%) loss:3.4096 lr:0.98 dt:35ms tok/s:1880098 rem:437s step 4649 (27%) loss:3.4177 lr:0.98 dt:35ms tok/s:1882171 rem:437s step 4650 (27%) loss:3.4334 lr:0.98 dt:35ms tok/s:1880716 rem:437s step 4651 (27%) loss:3.4459 lr:0.98 dt:35ms tok/s:1886615 rem:437s step 4652 (27%) loss:3.4654 lr:0.98 dt:35ms tok/s:1880446 rem:437s step 4653 (27%) loss:3.4750 lr:0.98 dt:35ms tok/s:1870350 rem:437s step 4654 (27%) loss:3.4754 lr:0.98 dt:37ms tok/s:1755960 rem:437s step 4655 (27%) loss:3.4768 lr:0.98 dt:34ms tok/s:1916100 rem:437s step 4656 (27%) loss:3.4597 lr:0.98 dt:35ms tok/s:1896809 rem:437s step 4657 (27%) loss:3.4631 lr:0.98 dt:35ms tok/s:1888339 rem:437s step 4658 (27%) loss:3.4522 lr:0.98 dt:35ms tok/s:1884558 rem:437s step 4659 (27%) loss:3.4525 lr:0.98 dt:35ms tok/s:1877145 rem:436s step 4660 (27%) loss:3.4556 lr:0.98 dt:35ms tok/s:1866553 rem:436s step 4661 (27%) loss:3.4648 lr:0.98 dt:35ms tok/s:1870847 rem:436s step 4662 (27%) loss:3.4689 lr:0.98 dt:35ms tok/s:1868431 rem:436s step 4663 (27%) loss:3.4414 lr:0.98 dt:36ms tok/s:1808003 rem:436s step 4664 (27%) loss:3.4443 lr:0.98 dt:35ms tok/s:1850369 rem:436s step 4665 (27%) loss:3.4550 lr:0.98 dt:36ms tok/s:1811984 rem:436s step 4666 (27%) loss:3.4523 lr:0.98 dt:36ms tok/s:1798339 rem:436s step 4667 (27%) loss:3.4756 lr:0.98 dt:35ms tok/s:1856668 rem:436s step 4668 (27%) loss:3.4830 lr:0.98 dt:35ms tok/s:1859356 rem:436s step 4669 (27%) loss:3.4766 lr:0.98 dt:35ms tok/s:1851229 rem:436s step 4670 (27%) loss:3.4757 lr:0.98 dt:35ms tok/s:1856380 rem:436s step 4671 (27%) loss:3.4778 lr:0.98 dt:36ms tok/s:1833436 rem:436s step 4672 (27%) loss:3.4735 lr:0.98 dt:36ms tok/s:1828667 rem:436s step 4673 (27%) loss:3.4806 lr:0.98 dt:36ms tok/s:1821336 rem:436s step 4674 (27%) loss:3.4780 lr:0.98 dt:36ms tok/s:1807420 rem:436s step 4675 (27%) loss:3.4710 lr:0.98 dt:37ms tok/s:1794500 rem:436s step 4676 (27%) loss:3.4824 lr:0.98 dt:36ms tok/s:1796130 rem:436s step 4677 (27%) loss:3.4777 lr:0.98 dt:36ms tok/s:1809050 rem:436s step 4678 (27%) loss:3.4803 lr:0.98 dt:35ms tok/s:1859180 rem:436s step 4679 (27%) loss:3.4774 lr:0.98 dt:35ms tok/s:1856643 rem:436s step 4680 (27%) loss:3.4857 lr:0.98 dt:35ms tok/s:1861093 rem:436s step 4681 (27%) loss:3.5106 lr:0.98 dt:36ms tok/s:1842074 rem:436s step 4682 (27%) loss:3.5643 lr:0.98 dt:36ms tok/s:1817387 rem:436s step 4683 (27%) loss:3.5639 lr:0.98 dt:36ms tok/s:1819516 rem:436s step 4684 (27%) loss:3.5640 lr:0.98 dt:36ms tok/s:1815730 rem:436s step 4685 (27%) loss:3.5705 lr:0.98 dt:36ms tok/s:1822665 rem:436s step 4686 (27%) loss:3.5766 lr:0.98 dt:36ms tok/s:1816630 rem:436s step 4687 (27%) loss:3.5657 lr:0.98 dt:36ms tok/s:1827889 rem:435s step 4688 (27%) loss:3.5459 lr:0.98 dt:36ms tok/s:1809622 rem:435s step 4689 (27%) loss:3.5312 lr:0.98 dt:36ms tok/s:1816690 rem:435s step 4690 (27%) loss:3.5255 lr:0.98 dt:36ms tok/s:1818721 rem:435s step 4691 (27%) loss:3.5132 lr:0.98 dt:36ms tok/s:1818505 rem:435s step 4692 (27%) loss:3.5221 lr:0.98 dt:36ms tok/s:1817170 rem:435s step 4693 (27%) loss:3.5495 lr:0.98 dt:36ms tok/s:1814304 rem:435s step 4694 (27%) loss:3.5459 lr:0.98 dt:36ms tok/s:1823414 rem:435s step 4695 (27%) loss:3.5349 lr:0.98 dt:36ms tok/s:1818012 rem:435s step 4696 (27%) loss:3.5289 lr:0.98 dt:39ms tok/s:1682054 rem:435s step 4697 (27%) loss:3.5285 lr:0.98 dt:41ms tok/s:1597329 rem:435s step 4698 (27%) loss:3.5408 lr:0.98 dt:36ms tok/s:1833448 rem:435s step 4699 (27%) loss:3.5178 lr:0.98 dt:37ms tok/s:1790094 rem:435s step 4700 (27%) loss:3.5142 lr:0.98 dt:36ms tok/s:1811888 rem:435s + local: attn=[0.067, 0.590, 0.554] mlp=[0.232, 0.142, -0.164] + + transition: attn=[1.928, 0.715] mlp=[-0.055, 0.210] + + hierarchy: attn=[2.274, 5.939, 5.616] mlp=[0.926, -1.237, -2.881] + step 4701 (27%) loss:3.5073 lr:0.98 dt:36ms tok/s:1816906 rem:435s step 4702 (28%) loss:3.5102 lr:0.98 dt:36ms tok/s:1811792 rem:435s step 4703 (28%) loss:3.5060 lr:0.98 dt:36ms tok/s:1807717 rem:435s step 4704 (28%) loss:3.4980 lr:0.98 dt:37ms tok/s:1793985 rem:435s step 4705 (28%) loss:3.4991 lr:0.98 dt:37ms tok/s:1773349 rem:435s step 4706 (28%) loss:3.4833 lr:0.98 dt:37ms tok/s:1781347 rem:435s step 4707 (28%) loss:3.4830 lr:0.98 dt:36ms tok/s:1818156 rem:435s step 4708 (28%) loss:3.4691 lr:0.98 dt:37ms tok/s:1785397 rem:435s step 4709 (28%) loss:3.4793 lr:0.98 dt:36ms tok/s:1799340 rem:435s step 4710 (28%) loss:3.4817 lr:0.98 dt:36ms tok/s:1806719 rem:435s step 4711 (28%) loss:3.4772 lr:0.98 dt:36ms tok/s:1797140 rem:435s step 4712 (28%) loss:3.4708 lr:0.98 dt:36ms tok/s:1802159 rem:435s step 4713 (28%) loss:3.4737 lr:0.98 dt:37ms tok/s:1778558 rem:435s step 4714 (28%) loss:3.4409 lr:0.98 dt:39ms tok/s:1695751 rem:434s step 4715 (28%) loss:3.4071 lr:0.98 dt:38ms tok/s:1715671 rem:434s step 4716 (28%) loss:3.3864 lr:0.98 dt:37ms tok/s:1768249 rem:434s step 4717 (28%) loss:3.4155 lr:0.98 dt:36ms tok/s:1811900 rem:434s step 4718 (28%) loss:3.4228 lr:0.98 dt:36ms tok/s:1826953 rem:434s step 4719 (28%) loss:3.4319 lr:0.98 dt:36ms tok/s:1833632 rem:434s step 4720 (28%) loss:3.4496 lr:0.98 dt:36ms tok/s:1820154 rem:434s step 4721 (28%) loss:3.4702 lr:0.98 dt:36ms tok/s:1812964 rem:434s step 4722 (28%) loss:3.4881 lr:0.98 dt:35ms tok/s:1847099 rem:434s step 4723 (28%) loss:3.4825 lr:0.98 dt:35ms tok/s:1853601 rem:434s step 4724 (28%) loss:3.4798 lr:0.98 dt:35ms tok/s:1858036 rem:434s step 4725 (28%) loss:3.5152 lr:0.98 dt:39ms tok/s:1681869 rem:434s step 4726 (28%) loss:3.5083 lr:0.98 dt:34ms tok/s:1910121 rem:434s step 4727 (28%) loss:3.5077 lr:0.98 dt:34ms tok/s:1932358 rem:434s step 4728 (28%) loss:3.4945 lr:0.98 dt:35ms tok/s:1874892 rem:434s step 4729 (28%) loss:3.5054 lr:0.98 dt:34ms tok/s:1906319 rem:434s step 4730 (28%) loss:3.5013 lr:0.98 dt:34ms tok/s:1910838 rem:434s step 4731 (28%) loss:3.4905 lr:0.98 dt:34ms tok/s:1907735 rem:434s step 4732 (28%) loss:3.4790 lr:0.98 dt:35ms tok/s:1884261 rem:434s step 4733 (28%) loss:3.5596 lr:0.98 dt:34ms tok/s:1908199 rem:434s step 4734 (28%) loss:3.5398 lr:0.98 dt:35ms tok/s:1860262 rem:434s step 4735 (28%) loss:3.5132 lr:0.98 dt:35ms tok/s:1896233 rem:434s step 4736 (28%) loss:3.5290 lr:0.98 dt:35ms tok/s:1892890 rem:434s step 4737 (28%) loss:3.5202 lr:0.98 dt:34ms tok/s:1916688 rem:434s step 4738 (28%) loss:3.5256 lr:0.98 dt:34ms tok/s:1923689 rem:434s step 4739 (28%) loss:3.5118 lr:0.98 dt:34ms tok/s:1928210 rem:434s step 4740 (28%) loss:3.5292 lr:0.98 dt:34ms tok/s:1923218 rem:434s step 4741 (28%) loss:3.5323 lr:0.98 dt:34ms tok/s:1917169 rem:434s step 4742 (28%) loss:3.5352 lr:0.98 dt:34ms tok/s:1921551 rem:434s step 4743 (28%) loss:3.5291 lr:0.98 dt:34ms tok/s:1921578 rem:433s step 4744 (28%) loss:3.5056 lr:0.98 dt:34ms tok/s:1923461 rem:433s step 4745 (28%) loss:3.5092 lr:0.98 dt:34ms tok/s:1924902 rem:433s step 4746 (28%) loss:3.5144 lr:0.98 dt:34ms tok/s:1925751 rem:433s step 4747 (28%) loss:3.5084 lr:0.98 dt:34ms tok/s:1926453 rem:433s step 4748 (28%) loss:3.5089 lr:0.98 dt:34ms tok/s:1919592 rem:433s step 4749 (28%) loss:3.5138 lr:0.98 dt:34ms tok/s:1912367 rem:433s step 4750 (28%) loss:3.5034 lr:0.98 dt:34ms tok/s:1920209 rem:433s step 4751 (28%) loss:3.4960 lr:0.98 dt:38ms tok/s:1735658 rem:433s step 4752 (28%) loss:3.4870 lr:0.98 dt:34ms tok/s:1936401 rem:433s step 4753 (28%) loss:3.4743 lr:0.98 dt:34ms tok/s:1935842 rem:433s step 4754 (28%) loss:3.4709 lr:0.98 dt:34ms tok/s:1927277 rem:433s step 4755 (28%) loss:3.4798 lr:0.98 dt:34ms tok/s:1905342 rem:433s step 4756 (28%) loss:3.5012 lr:0.98 dt:34ms tok/s:1917383 rem:433s step 4757 (28%) loss:3.5071 lr:0.98 dt:34ms tok/s:1913459 rem:433s step 4758 (28%) loss:3.5743 lr:0.98 dt:34ms tok/s:1907007 rem:433s step 4759 (28%) loss:3.5679 lr:0.98 dt:34ms tok/s:1908729 rem:433s step 4760 (28%) loss:3.5652 lr:0.98 dt:34ms tok/s:1905923 rem:433s step 4761 (28%) loss:3.5465 lr:0.98 dt:35ms tok/s:1883461 rem:433s step 4762 (28%) loss:3.5605 lr:0.98 dt:35ms tok/s:1899561 rem:433s step 4763 (28%) loss:3.5663 lr:0.98 dt:34ms tok/s:1906729 rem:433s step 4764 (28%) loss:3.5545 lr:0.98 dt:35ms tok/s:1896900 rem:433s step 4765 (28%) loss:3.5305 lr:0.98 dt:34ms tok/s:1902164 rem:433s step 4766 (28%) loss:3.5217 lr:0.98 dt:34ms tok/s:1900218 rem:433s step 4767 (28%) loss:3.5217 lr:0.98 dt:35ms tok/s:1898420 rem:433s step 4768 (28%) loss:3.5262 lr:0.98 dt:35ms tok/s:1899522 rem:433s step 4769 (28%) loss:3.5181 lr:0.98 dt:35ms tok/s:1885140 rem:433s step 4770 (28%) loss:3.5192 lr:0.98 dt:35ms tok/s:1889975 rem:433s step 4771 (28%) loss:3.5141 lr:0.98 dt:35ms tok/s:1880394 rem:433s step 4772 (28%) loss:3.4928 lr:0.98 dt:35ms tok/s:1884080 rem:432s step 4773 (28%) loss:3.4979 lr:0.98 dt:35ms tok/s:1877941 rem:432s step 4774 (28%) loss:3.4937 lr:0.98 dt:35ms tok/s:1888533 rem:432s step 4775 (28%) loss:3.4784 lr:0.98 dt:35ms tok/s:1877248 rem:432s step 4776 (28%) loss:3.4861 lr:0.98 dt:36ms tok/s:1799823 rem:432s step 4777 (28%) loss:3.4990 lr:0.98 dt:35ms tok/s:1876095 rem:432s step 4778 (28%) loss:3.4921 lr:0.98 dt:35ms tok/s:1891145 rem:432s step 4779 (28%) loss:3.4810 lr:0.98 dt:35ms tok/s:1883706 rem:432s step 4780 (28%) loss:3.4812 lr:0.98 dt:35ms tok/s:1866413 rem:432s step 4781 (28%) loss:3.4928 lr:0.98 dt:35ms tok/s:1855540 rem:432s step 4782 (28%) loss:3.4896 lr:0.98 dt:35ms tok/s:1883603 rem:432s step 4783 (28%) loss:3.4802 lr:0.98 dt:35ms tok/s:1856969 rem:432s step 4784 (28%) loss:3.4755 lr:0.98 dt:36ms tok/s:1843853 rem:432s step 4785 (28%) loss:3.4545 lr:0.98 dt:35ms tok/s:1858652 rem:432s step 4786 (28%) loss:3.4452 lr:0.98 dt:35ms tok/s:1865147 rem:432s step 4787 (28%) loss:3.4476 lr:0.98 dt:35ms tok/s:1862316 rem:432s step 4788 (28%) loss:3.4455 lr:0.98 dt:36ms tok/s:1838193 rem:432s step 4789 (28%) loss:3.4470 lr:0.98 dt:35ms tok/s:1856969 rem:432s step 4790 (28%) loss:3.4623 lr:0.98 dt:35ms tok/s:1861661 rem:432s step 4791 (28%) loss:3.4907 lr:0.98 dt:36ms tok/s:1828497 rem:432s step 4792 (28%) loss:3.4923 lr:0.98 dt:36ms tok/s:1830786 rem:432s step 4793 (28%) loss:3.4875 lr:0.98 dt:35ms tok/s:1846342 rem:432s step 4794 (28%) loss:3.5011 lr:0.98 dt:39ms tok/s:1661807 rem:432s step 4795 (28%) loss:3.4972 lr:0.98 dt:35ms tok/s:1854576 rem:432s step 4796 (28%) loss:3.5028 lr:0.98 dt:35ms tok/s:1879379 rem:432s step 4797 (28%) loss:3.4951 lr:0.98 dt:35ms tok/s:1871140 rem:432s step 4798 (28%) loss:3.4917 lr:0.98 dt:35ms tok/s:1860955 rem:432s step 4799 (28%) loss:3.4767 lr:0.98 dt:35ms tok/s:1853476 rem:432s step 4800 (28%) loss:3.4659 lr:0.98 dt:35ms tok/s:1855966 rem:431s + local: attn=[0.066, 0.623, 0.561] mlp=[0.243, 0.144, -0.117] + + transition: attn=[1.971, 0.726] mlp=[-0.055, 0.209] + + hierarchy: attn=[2.373, 5.939, 5.616] mlp=[1.011, -1.509, -3.342] + step 4801 (28%) loss:3.4582 lr:0.97 dt:35ms tok/s:1854727 rem:431s step 4802 (28%) loss:3.4661 lr:0.97 dt:35ms tok/s:1858464 rem:431s step 4803 (28%) loss:3.4532 lr:0.97 dt:35ms tok/s:1856242 rem:431s step 4804 (28%) loss:3.4518 lr:0.97 dt:35ms tok/s:1856255 rem:431s step 4805 (28%) loss:3.4455 lr:0.97 dt:35ms tok/s:1864666 rem:431s step 4806 (28%) loss:3.4740 lr:0.97 dt:35ms tok/s:1855603 rem:431s step 4807 (28%) loss:3.4698 lr:0.97 dt:35ms tok/s:1857271 rem:431s step 4808 (28%) loss:3.4803 lr:0.97 dt:35ms tok/s:1857861 rem:431s step 4809 (28%) loss:3.4986 lr:0.97 dt:36ms tok/s:1831323 rem:431s step 4810 (28%) loss:3.4923 lr:0.97 dt:37ms tok/s:1774677 rem:431s step 4811 (28%) loss:3.4806 lr:0.97 dt:36ms tok/s:1834244 rem:431s step 4812 (28%) loss:3.4769 lr:0.97 dt:36ms tok/s:1835419 rem:431s step 4813 (28%) loss:3.4769 lr:0.97 dt:36ms tok/s:1841876 rem:431s step 4814 (28%) loss:3.4746 lr:0.97 dt:36ms tok/s:1841851 rem:431s step 4815 (28%) loss:3.4606 lr:0.97 dt:36ms tok/s:1839362 rem:431s step 4816 (28%) loss:3.4769 lr:0.97 dt:36ms tok/s:1843890 rem:431s step 4817 (28%) loss:3.4756 lr:0.97 dt:36ms tok/s:1836818 rem:431s step 4818 (28%) loss:3.4688 lr:0.97 dt:36ms tok/s:1828655 rem:431s step 4819 (28%) loss:3.4738 lr:0.97 dt:36ms tok/s:1838894 rem:431s step 4820 (28%) loss:3.4593 lr:0.97 dt:36ms tok/s:1836376 rem:431s step 4821 (28%) loss:3.4508 lr:0.97 dt:36ms tok/s:1834513 rem:431s step 4822 (28%) loss:3.4331 lr:0.97 dt:36ms tok/s:1838808 rem:431s step 4823 (28%) loss:3.4284 lr:0.97 dt:36ms tok/s:1831701 rem:431s step 4824 (28%) loss:3.4430 lr:0.97 dt:36ms tok/s:1832263 rem:431s step 4825 (28%) loss:3.4556 lr:0.97 dt:36ms tok/s:1838095 rem:431s step 4826 (28%) loss:3.4538 lr:0.97 dt:37ms tok/s:1781717 rem:431s step 4827 (28%) loss:3.4535 lr:0.97 dt:36ms tok/s:1833705 rem:431s step 4828 (28%) loss:3.4514 lr:0.97 dt:36ms tok/s:1837825 rem:430s step 4829 (28%) loss:3.4416 lr:0.97 dt:36ms tok/s:1835603 rem:430s step 4830 (28%) loss:3.4436 lr:0.97 dt:36ms tok/s:1841321 rem:430s step 4831 (28%) loss:3.4430 lr:0.97 dt:36ms tok/s:1825594 rem:430s step 4832 (28%) loss:3.4440 lr:0.97 dt:36ms tok/s:1835309 rem:430s step 4833 (28%) loss:3.4327 lr:0.97 dt:36ms tok/s:1835934 rem:430s step 4834 (28%) loss:3.4175 lr:0.97 dt:36ms tok/s:1826189 rem:430s step 4835 (28%) loss:3.4217 lr:0.97 dt:36ms tok/s:1831091 rem:430s step 4836 (28%) loss:3.4240 lr:0.97 dt:36ms tok/s:1836413 rem:430s step 4837 (28%) loss:3.4241 lr:0.97 dt:36ms tok/s:1835113 rem:430s step 4838 (28%) loss:3.4411 lr:0.97 dt:36ms tok/s:1828156 rem:430s step 4839 (28%) loss:3.4521 lr:0.97 dt:36ms tok/s:1837309 rem:430s step 4840 (28%) loss:3.4560 lr:0.97 dt:36ms tok/s:1834195 rem:430s step 4841 (28%) loss:3.4575 lr:0.97 dt:36ms tok/s:1842012 rem:430s step 4842 (28%) loss:3.4914 lr:0.97 dt:36ms tok/s:1839842 rem:430s step 4843 (28%) loss:3.4902 lr:0.97 dt:36ms tok/s:1808122 rem:430s step 4844 (28%) loss:3.4888 lr:0.97 dt:36ms tok/s:1802348 rem:430s step 4845 (28%) loss:3.4917 lr:0.97 dt:36ms tok/s:1828728 rem:430s step 4846 (28%) loss:3.4923 lr:0.97 dt:36ms tok/s:1839940 rem:430s step 4847 (28%) loss:3.4794 lr:0.97 dt:35ms tok/s:1861308 rem:430s step 4848 (28%) loss:3.4739 lr:0.97 dt:35ms tok/s:1850731 rem:430s step 4849 (28%) loss:3.4584 lr:0.97 dt:35ms tok/s:1847248 rem:430s step 4850 (28%) loss:3.4216 lr:0.97 dt:35ms tok/s:1862216 rem:430s step 4851 (28%) loss:3.4332 lr:0.97 dt:36ms tok/s:1830445 rem:430s step 4852 (28%) loss:3.4092 lr:0.97 dt:36ms tok/s:1835799 rem:430s step 4853 (28%) loss:3.3952 lr:0.97 dt:36ms tok/s:1839694 rem:430s step 4854 (28%) loss:3.2865 lr:0.97 dt:36ms tok/s:1832495 rem:430s step 4855 (28%) loss:3.3330 lr:0.97 dt:36ms tok/s:1828618 rem:430s step 4856 (28%) loss:3.3569 lr:0.97 dt:36ms tok/s:1827670 rem:429s step 4857 (28%) loss:3.3813 lr:0.97 dt:36ms tok/s:1837800 rem:429s step 4858 (28%) loss:3.4219 lr:0.97 dt:36ms tok/s:1831116 rem:429s step 4859 (28%) loss:3.4379 lr:0.97 dt:36ms tok/s:1830823 rem:429s step 4860 (28%) loss:3.5048 lr:0.97 dt:36ms tok/s:1834562 rem:429s step 4861 (28%) loss:3.5106 lr:0.97 dt:36ms tok/s:1839559 rem:429s step 4862 (28%) loss:3.5394 lr:0.97 dt:36ms tok/s:1836081 rem:429s step 4863 (28%) loss:3.5274 lr:0.97 dt:36ms tok/s:1836744 rem:429s step 4864 (28%) loss:3.5330 lr:0.97 dt:36ms tok/s:1833705 rem:429s step 4865 (28%) loss:3.5208 lr:0.97 dt:36ms tok/s:1830908 rem:429s step 4866 (28%) loss:3.5231 lr:0.97 dt:42ms tok/s:1574347 rem:429s step 4867 (28%) loss:3.5315 lr:0.97 dt:35ms tok/s:1847919 rem:429s step 4868 (28%) loss:3.5255 lr:0.97 dt:34ms tok/s:1904048 rem:429s step 4869 (28%) loss:3.5020 lr:0.97 dt:35ms tok/s:1883861 rem:429s step 4870 (28%) loss:3.4744 lr:0.97 dt:35ms tok/s:1881179 rem:429s step 4871 (29%) loss:3.4661 lr:0.97 dt:35ms tok/s:1871025 rem:429s step 4872 (29%) loss:3.4656 lr:0.97 dt:35ms tok/s:1881514 rem:429s step 4873 (29%) loss:3.4570 lr:0.97 dt:35ms tok/s:1879032 rem:429s step 4874 (29%) loss:3.4522 lr:0.97 dt:35ms tok/s:1862948 rem:429s step 4875 (29%) loss:3.4522 lr:0.97 dt:35ms tok/s:1864932 rem:429s step 4876 (29%) loss:3.4553 lr:0.97 dt:35ms tok/s:1859369 rem:429s step 4877 (29%) loss:3.4669 lr:0.97 dt:35ms tok/s:1858815 rem:429s step 4878 (29%) loss:3.4808 lr:0.97 dt:36ms tok/s:1840914 rem:429s step 4879 (29%) loss:3.4906 lr:0.97 dt:36ms tok/s:1833289 rem:429s step 4880 (29%) loss:3.4721 lr:0.97 dt:36ms tok/s:1832849 rem:429s step 4881 (29%) loss:3.4601 lr:0.97 dt:36ms tok/s:1829044 rem:429s step 4882 (29%) loss:3.4436 lr:0.97 dt:36ms tok/s:1833289 rem:429s step 4883 (29%) loss:3.4512 lr:0.97 dt:42ms tok/s:1558309 rem:429s step 4884 (29%) loss:3.4710 lr:0.97 dt:35ms tok/s:1859557 rem:428s step 4885 (29%) loss:3.4599 lr:0.97 dt:35ms tok/s:1858489 rem:428s step 4886 (29%) loss:3.4523 lr:0.97 dt:35ms tok/s:1873525 rem:428s step 4887 (29%) loss:3.4634 lr:0.97 dt:35ms tok/s:1883874 rem:428s step 4888 (29%) loss:3.4558 lr:0.97 dt:35ms tok/s:1882120 rem:428s step 4889 (29%) loss:3.4529 lr:0.97 dt:35ms tok/s:1876992 rem:428s step 4890 (29%) loss:3.4462 lr:0.97 dt:35ms tok/s:1864919 rem:428s step 4891 (29%) loss:3.4501 lr:0.97 dt:35ms tok/s:1862682 rem:428s step 4892 (29%) loss:3.4553 lr:0.97 dt:35ms tok/s:1850581 rem:428s step 4893 (29%) loss:3.4387 lr:0.97 dt:36ms tok/s:1843383 rem:428s step 4894 (29%) loss:3.4536 lr:0.97 dt:35ms tok/s:1855916 rem:428s step 4895 (29%) loss:3.4398 lr:0.97 dt:36ms tok/s:1835444 rem:428s step 4896 (29%) loss:3.4216 lr:0.97 dt:36ms tok/s:1838144 rem:428s step 4897 (29%) loss:3.4337 lr:0.97 dt:36ms tok/s:1840507 rem:428s step 4898 (29%) loss:3.4318 lr:0.97 dt:36ms tok/s:1832348 rem:428s step 4899 (29%) loss:3.4364 lr:0.97 dt:36ms tok/s:1832654 rem:428s step 4900 (29%) loss:3.4484 lr:0.97 dt:36ms tok/s:1828886 rem:428s + local: attn=[0.061, 0.664, 0.582] mlp=[0.258, 0.171, -0.150] + + transition: attn=[1.933, 0.675] mlp=[-0.062, 0.217] + + hierarchy: attn=[2.377, 5.939, 5.616] mlp=[1.015, -1.563, -3.602] + step 4901 (29%) loss:3.4576 lr:0.97 dt:36ms tok/s:1838304 rem:428s step 4902 (29%) loss:3.4465 lr:0.97 dt:36ms tok/s:1843259 rem:428s step 4903 (29%) loss:3.4337 lr:0.97 dt:36ms tok/s:1842778 rem:428s step 4904 (29%) loss:3.4391 lr:0.97 dt:36ms tok/s:1835456 rem:428s step 4905 (29%) loss:3.4346 lr:0.97 dt:36ms tok/s:1828302 rem:428s step 4906 (29%) loss:3.4626 lr:0.97 dt:36ms tok/s:1832312 rem:428s step 4907 (29%) loss:3.4634 lr:0.97 dt:36ms tok/s:1830445 rem:428s step 4908 (29%) loss:3.4659 lr:0.97 dt:36ms tok/s:1833289 rem:428s step 4909 (29%) loss:3.4589 lr:0.97 dt:36ms tok/s:1833522 rem:428s step 4910 (29%) loss:3.4401 lr:0.97 dt:36ms tok/s:1834770 rem:428s step 4911 (29%) loss:3.4436 lr:0.97 dt:36ms tok/s:1838230 rem:428s step 4912 (29%) loss:3.4802 lr:0.97 dt:36ms tok/s:1833253 rem:427s step 4913 (29%) loss:3.4884 lr:0.97 dt:36ms tok/s:1831347 rem:427s step 4914 (29%) loss:3.4753 lr:0.97 dt:36ms tok/s:1833705 rem:427s step 4915 (29%) loss:3.4617 lr:0.97 dt:36ms tok/s:1828545 rem:427s step 4916 (29%) loss:3.4461 lr:0.97 dt:36ms tok/s:1836032 rem:427s step 4917 (29%) loss:3.4481 lr:0.97 dt:36ms tok/s:1836768 rem:427s step 4918 (29%) loss:3.4488 lr:0.97 dt:36ms tok/s:1836916 rem:427s step 4919 (29%) loss:3.4563 lr:0.97 dt:36ms tok/s:1839965 rem:427s step 4920 (29%) loss:3.4255 lr:0.97 dt:36ms tok/s:1830591 rem:427s step 4921 (29%) loss:3.4266 lr:0.97 dt:36ms tok/s:1839276 rem:427s step 4922 (29%) loss:3.4466 lr:0.97 dt:36ms tok/s:1833228 rem:427s step 4923 (29%) loss:3.4595 lr:0.97 dt:36ms tok/s:1838132 rem:427s step 4924 (29%) loss:3.4761 lr:0.97 dt:36ms tok/s:1830725 rem:427s step 4925 (29%) loss:3.4733 lr:0.97 dt:36ms tok/s:1829324 rem:427s step 4926 (29%) loss:3.4721 lr:0.97 dt:36ms tok/s:1841827 rem:427s step 4927 (29%) loss:3.4732 lr:0.97 dt:36ms tok/s:1839891 rem:427s step 4928 (29%) loss:3.5005 lr:0.97 dt:36ms tok/s:1839066 rem:427s step 4929 (29%) loss:3.5014 lr:0.97 dt:36ms tok/s:1832910 rem:427s step 4930 (29%) loss:3.5001 lr:0.97 dt:36ms tok/s:1834648 rem:427s step 4931 (29%) loss:3.4821 lr:0.97 dt:36ms tok/s:1811769 rem:427s step 4932 (29%) loss:3.4799 lr:0.97 dt:36ms tok/s:1838316 rem:427s step 4933 (29%) loss:3.4767 lr:0.97 dt:36ms tok/s:1831738 rem:427s step 4934 (29%) loss:3.4597 lr:0.97 dt:36ms tok/s:1835824 rem:427s step 4935 (29%) loss:3.4482 lr:0.97 dt:36ms tok/s:1832849 rem:427s step 4936 (29%) loss:3.4510 lr:0.97 dt:37ms tok/s:1772914 rem:427s step 4937 (29%) loss:3.4421 lr:0.97 dt:36ms tok/s:1836609 rem:427s step 4938 (29%) loss:3.4528 lr:0.97 dt:36ms tok/s:1811769 rem:427s step 4939 (29%) loss:3.4617 lr:0.97 dt:37ms tok/s:1788161 rem:427s step 4940 (29%) loss:3.4569 lr:0.97 dt:36ms tok/s:1808907 rem:426s step 4941 (29%) loss:3.4606 lr:0.97 dt:36ms tok/s:1812402 rem:426s step 4942 (29%) loss:3.4687 lr:0.97 dt:36ms tok/s:1808931 rem:426s step 4943 (29%) loss:3.4595 lr:0.97 dt:36ms tok/s:1807254 rem:426s step 4944 (29%) loss:3.4484 lr:0.97 dt:36ms tok/s:1799186 rem:426s step 4945 (29%) loss:3.4402 lr:0.97 dt:36ms tok/s:1823850 rem:426s step 4946 (29%) loss:3.4369 lr:0.97 dt:36ms tok/s:1816510 rem:426s step 4947 (29%) loss:3.4420 lr:0.97 dt:36ms tok/s:1812450 rem:426s step 4948 (29%) loss:3.4246 lr:0.97 dt:36ms tok/s:1806209 rem:426s step 4949 (29%) loss:3.4372 lr:0.97 dt:36ms tok/s:1802324 rem:426s step 4950 (29%) loss:3.4528 lr:0.97 dt:36ms tok/s:1815298 rem:426s step 4951 (29%) loss:3.4776 lr:0.97 dt:36ms tok/s:1808146 rem:426s step 4952 (29%) loss:3.5387 lr:0.97 dt:36ms tok/s:1814603 rem:426s step 4953 (29%) loss:3.5102 lr:0.97 dt:36ms tok/s:1818505 rem:426s step 4954 (29%) loss:3.4899 lr:0.97 dt:36ms tok/s:1807397 rem:426s step 4955 (29%) loss:3.4746 lr:0.97 dt:36ms tok/s:1808657 rem:426s step 4956 (29%) loss:3.4788 lr:0.97 dt:36ms tok/s:1807765 rem:426s step 4957 (29%) loss:3.4580 lr:0.97 dt:36ms tok/s:1807729 rem:426s step 4958 (29%) loss:3.4569 lr:0.97 dt:36ms tok/s:1814735 rem:426s step 4959 (29%) loss:3.4660 lr:0.97 dt:36ms tok/s:1811267 rem:426s step 4960 (29%) loss:3.4656 lr:0.97 dt:36ms tok/s:1814531 rem:426s step 4961 (29%) loss:3.4850 lr:0.97 dt:36ms tok/s:1811996 rem:426s step 4962 (29%) loss:3.4941 lr:0.97 dt:36ms tok/s:1804549 rem:426s step 4963 (29%) loss:3.4889 lr:0.97 dt:36ms tok/s:1829154 rem:426s step 4964 (29%) loss:3.4813 lr:0.97 dt:36ms tok/s:1833216 rem:426s step 4965 (29%) loss:3.4834 lr:0.97 dt:36ms tok/s:1828339 rem:426s step 4966 (29%) loss:3.5000 lr:0.97 dt:36ms tok/s:1832593 rem:426s step 4967 (29%) loss:3.4959 lr:0.97 dt:36ms tok/s:1834672 rem:426s step 4968 (29%) loss:3.4975 lr:0.97 dt:36ms tok/s:1826104 rem:425s step 4969 (29%) loss:3.4774 lr:0.97 dt:36ms tok/s:1836621 rem:425s step 4970 (29%) loss:3.4623 lr:0.97 dt:36ms tok/s:1827731 rem:425s step 4971 (29%) loss:3.5019 lr:0.97 dt:36ms tok/s:1835677 rem:425s step 4972 (29%) loss:3.4834 lr:0.97 dt:36ms tok/s:1841247 rem:425s step 4973 (29%) loss:3.4765 lr:0.97 dt:36ms tok/s:1834819 rem:425s step 4974 (29%) loss:3.4733 lr:0.97 dt:36ms tok/s:1833240 rem:425s step 4975 (29%) loss:3.4806 lr:0.97 dt:36ms tok/s:1833754 rem:425s step 4976 (29%) loss:3.5023 lr:0.97 dt:36ms tok/s:1838587 rem:425s step 4977 (29%) loss:3.5114 lr:0.97 dt:36ms tok/s:1837800 rem:425s step 4978 (29%) loss:3.4998 lr:0.97 dt:36ms tok/s:1824903 rem:425s step 4979 (29%) loss:3.5227 lr:0.97 dt:36ms tok/s:1801155 rem:425s step 4980 (29%) loss:3.5251 lr:0.97 dt:36ms tok/s:1830872 rem:425s step 4981 (29%) loss:3.5284 lr:0.97 dt:36ms tok/s:1830006 rem:425s step 4982 (29%) loss:3.5243 lr:0.97 dt:36ms tok/s:1828071 rem:425s step 4983 (29%) loss:3.5251 lr:0.97 dt:36ms tok/s:1834084 rem:425s step 4984 (29%) loss:3.5126 lr:0.97 dt:36ms tok/s:1831213 rem:425s step 4985 (29%) loss:3.5053 lr:0.97 dt:36ms tok/s:1829300 rem:425s step 4986 (29%) loss:3.5175 lr:0.97 dt:36ms tok/s:1828667 rem:425s step 4987 (29%) loss:3.5092 lr:0.97 dt:36ms tok/s:1835395 rem:425s step 4988 (29%) loss:3.5030 lr:0.97 dt:36ms tok/s:1834905 rem:425s step 4989 (29%) loss:3.4932 lr:0.97 dt:36ms tok/s:1834525 rem:425s step 4990 (29%) loss:3.4587 lr:0.97 dt:36ms tok/s:1830701 rem:425s step 4991 (29%) loss:3.4620 lr:0.97 dt:36ms tok/s:1836327 rem:425s step 4992 (29%) loss:3.4516 lr:0.97 dt:37ms tok/s:1776868 rem:425s step 4993 (29%) loss:3.4735 lr:0.97 dt:36ms tok/s:1844335 rem:425s step 4994 (29%) loss:3.4540 lr:0.97 dt:35ms tok/s:1852152 rem:425s step 4995 (29%) loss:3.4599 lr:0.97 dt:35ms tok/s:1853789 rem:425s step 4996 (29%) loss:3.4238 lr:0.97 dt:35ms tok/s:1861976 rem:424s step 4997 (29%) loss:3.3678 lr:0.97 dt:35ms tok/s:1854264 rem:424s step 4998 (29%) loss:3.3071 lr:0.97 dt:36ms tok/s:1842123 rem:424s step 4999 (29%) loss:3.3219 lr:0.97 dt:36ms tok/s:1839313 rem:424s step 5000 (29%) loss:3.3260 lr:0.97 dt:35ms tok/s:1850569 rem:424s + local: attn=[0.058, 0.658, 0.590] mlp=[0.260, 0.153, -0.149] + + transition: attn=[1.936, 0.697] mlp=[-0.060, 0.222] + + hierarchy: attn=[2.369, 5.939, 5.616] mlp=[1.004, -1.599, -3.749] + step 5001 (29%) loss:3.3740 lr:0.97 dt:35ms tok/s:1849859 rem:424s step 5002 (29%) loss:3.3985 lr:0.97 dt:35ms tok/s:1851491 rem:424s step 5003 (29%) loss:3.4018 lr:0.97 dt:35ms tok/s:1854551 rem:424s step 5004 (29%) loss:3.4185 lr:0.97 dt:36ms tok/s:1844670 rem:424s step 5005 (29%) loss:3.4313 lr:0.97 dt:36ms tok/s:1839103 rem:424s step 5006 (29%) loss:3.4349 lr:0.97 dt:36ms tok/s:1843198 rem:424s step 5007 (29%) loss:3.4368 lr:0.97 dt:36ms tok/s:1843198 rem:424s step 5008 (29%) loss:3.4288 lr:0.97 dt:36ms tok/s:1842185 rem:424s step 5009 (29%) loss:3.4281 lr:0.97 dt:36ms tok/s:1840729 rem:424s step 5010 (29%) loss:3.4300 lr:0.97 dt:36ms tok/s:1844447 rem:424s step 5011 (29%) loss:3.4497 lr:0.97 dt:35ms tok/s:1846752 rem:424s step 5012 (29%) loss:3.4323 lr:0.97 dt:35ms tok/s:1847968 rem:424s step 5013 (29%) loss:3.4360 lr:0.97 dt:36ms tok/s:1841864 rem:424s step 5014 (29%) loss:3.4485 lr:0.97 dt:35ms tok/s:1851803 rem:424s step 5015 (29%) loss:3.4418 lr:0.97 dt:35ms tok/s:1857534 rem:424s step 5016 (29%) loss:3.4565 lr:0.97 dt:39ms tok/s:1673839 rem:424s step 5017 (29%) loss:3.4622 lr:0.97 dt:36ms tok/s:1832397 rem:424s step 5018 (29%) loss:3.4341 lr:0.97 dt:35ms tok/s:1856305 rem:424s step 5019 (29%) loss:3.4524 lr:0.97 dt:35ms tok/s:1863049 rem:424s step 5020 (29%) loss:3.4571 lr:0.97 dt:35ms tok/s:1864097 rem:424s step 5021 (29%) loss:3.4492 lr:0.97 dt:35ms tok/s:1852190 rem:424s step 5022 (29%) loss:3.4532 lr:0.97 dt:35ms tok/s:1852040 rem:424s step 5023 (29%) loss:3.5102 lr:0.97 dt:35ms tok/s:1867390 rem:424s step 5024 (29%) loss:3.5625 lr:0.97 dt:35ms tok/s:1861345 rem:423s step 5025 (29%) loss:3.5659 lr:0.97 dt:36ms tok/s:1843692 rem:423s step 5026 (29%) loss:3.5641 lr:0.97 dt:35ms tok/s:1872593 rem:423s step 5027 (29%) loss:3.5470 lr:0.97 dt:35ms tok/s:1866794 rem:423s step 5028 (29%) loss:3.5241 lr:0.97 dt:35ms tok/s:1862026 rem:423s step 5029 (29%) loss:3.4925 lr:0.97 dt:35ms tok/s:1868964 rem:423s step 5030 (29%) loss:3.4615 lr:0.97 dt:35ms tok/s:1865033 rem:423s step 5031 (29%) loss:3.4450 lr:0.97 dt:35ms tok/s:1862581 rem:423s step 5032 (29%) loss:3.4256 lr:0.97 dt:35ms tok/s:1857107 rem:423s step 5033 (29%) loss:3.4315 lr:0.97 dt:35ms tok/s:1860589 rem:423s step 5034 (29%) loss:3.4560 lr:0.97 dt:35ms tok/s:1860413 rem:423s step 5035 (29%) loss:3.4622 lr:0.97 dt:36ms tok/s:1831396 rem:423s step 5036 (29%) loss:3.4690 lr:0.97 dt:39ms tok/s:1686875 rem:423s step 5037 (29%) loss:3.4650 lr:0.97 dt:35ms tok/s:1871038 rem:423s step 5038 (29%) loss:3.4439 lr:0.97 dt:34ms tok/s:1908821 rem:423s step 5039 (30%) loss:3.4534 lr:0.97 dt:35ms tok/s:1892747 rem:423s step 5040 (30%) loss:3.4485 lr:0.97 dt:35ms tok/s:1880008 rem:423s step 5041 (30%) loss:3.4651 lr:0.97 dt:35ms tok/s:1893216 rem:423s step 5042 (30%) loss:3.5035 lr:0.97 dt:35ms tok/s:1879597 rem:423s step 5043 (30%) loss:3.4935 lr:0.97 dt:35ms tok/s:1869867 rem:423s step 5044 (30%) loss:3.5130 lr:0.97 dt:35ms tok/s:1866350 rem:423s step 5045 (30%) loss:3.5408 lr:0.97 dt:35ms tok/s:1875928 rem:423s step 5046 (30%) loss:3.5431 lr:0.97 dt:35ms tok/s:1868507 rem:423s step 5047 (30%) loss:3.5335 lr:0.97 dt:35ms tok/s:1867656 rem:423s step 5048 (30%) loss:3.5231 lr:0.97 dt:35ms tok/s:1866629 rem:423s step 5049 (30%) loss:3.5353 lr:0.97 dt:42ms tok/s:1551536 rem:423s step 5050 (30%) loss:3.5257 lr:0.97 dt:35ms tok/s:1867644 rem:423s step 5051 (30%) loss:3.5266 lr:0.97 dt:34ms tok/s:1920907 rem:423s step 5052 (30%) loss:3.5355 lr:0.97 dt:34ms tok/s:1916674 rem:422s step 5053 (30%) loss:3.5173 lr:0.96 dt:34ms tok/s:1913539 rem:422s step 5054 (30%) loss:3.4892 lr:0.96 dt:34ms tok/s:1916447 rem:422s step 5055 (30%) loss:3.4945 lr:0.96 dt:34ms tok/s:1923407 rem:422s step 5056 (30%) loss:3.4690 lr:0.96 dt:34ms tok/s:1914378 rem:422s step 5057 (30%) loss:3.4124 lr:0.96 dt:34ms tok/s:1926912 rem:422s step 5058 (30%) loss:3.4245 lr:0.96 dt:34ms tok/s:1910971 rem:422s step 5059 (30%) loss:3.4172 lr:0.96 dt:34ms tok/s:1920249 rem:422s step 5060 (30%) loss:3.4427 lr:0.96 dt:34ms tok/s:1915032 rem:422s step 5061 (30%) loss:3.4454 lr:0.96 dt:34ms tok/s:1918306 rem:422s step 5062 (30%) loss:3.4434 lr:0.96 dt:34ms tok/s:1919364 rem:422s step 5063 (30%) loss:3.4503 lr:0.96 dt:34ms tok/s:1921780 rem:422s step 5064 (30%) loss:3.4503 lr:0.96 dt:36ms tok/s:1804786 rem:422s step 5065 (30%) loss:3.4553 lr:0.96 dt:34ms tok/s:1921914 rem:422s step 5066 (30%) loss:3.4420 lr:0.96 dt:34ms tok/s:1925603 rem:422s step 5067 (30%) loss:3.4372 lr:0.96 dt:34ms tok/s:1924201 rem:422s step 5068 (30%) loss:3.4332 lr:0.96 dt:34ms tok/s:1905328 rem:422s step 5069 (30%) loss:3.4368 lr:0.96 dt:34ms tok/s:1915539 rem:422s step 5070 (30%) loss:3.4448 lr:0.96 dt:34ms tok/s:1916808 rem:422s step 5071 (30%) loss:3.4574 lr:0.96 dt:34ms tok/s:1917904 rem:422s step 5072 (30%) loss:3.4531 lr:0.96 dt:34ms tok/s:1916380 rem:422s step 5073 (30%) loss:3.4345 lr:0.96 dt:34ms tok/s:1919954 rem:422s step 5074 (30%) loss:3.4119 lr:0.96 dt:35ms tok/s:1882674 rem:422s step 5075 (30%) loss:3.3980 lr:0.96 dt:34ms tok/s:1902098 rem:422s step 5076 (30%) loss:3.3706 lr:0.96 dt:34ms tok/s:1911609 rem:422s step 5077 (30%) loss:3.3447 lr:0.96 dt:34ms tok/s:1902493 rem:422s step 5078 (30%) loss:3.3147 lr:0.96 dt:35ms tok/s:1894521 rem:422s step 5079 (30%) loss:3.2708 lr:0.96 dt:35ms tok/s:1897293 rem:422s step 5080 (30%) loss:3.2529 lr:0.96 dt:35ms tok/s:1891171 rem:422s step 5081 (30%) loss:3.2251 lr:0.96 dt:35ms tok/s:1896154 rem:421s step 5082 (30%) loss:3.2087 lr:0.96 dt:34ms tok/s:1900914 rem:421s step 5083 (30%) loss:3.1855 lr:0.96 dt:34ms tok/s:1901388 rem:421s step 5084 (30%) loss:3.1690 lr:0.96 dt:35ms tok/s:1894586 rem:421s step 5085 (30%) loss:3.2112 lr:0.96 dt:35ms tok/s:1874918 rem:421s step 5086 (30%) loss:3.2673 lr:0.96 dt:35ms tok/s:1891952 rem:421s step 5087 (30%) loss:3.3484 lr:0.96 dt:35ms tok/s:1897358 rem:421s step 5088 (30%) loss:3.3850 lr:0.96 dt:35ms tok/s:1884649 rem:421s step 5089 (30%) loss:3.4070 lr:0.96 dt:35ms tok/s:1895017 rem:421s step 5090 (30%) loss:3.4141 lr:0.96 dt:38ms tok/s:1739601 rem:421s step 5091 (30%) loss:3.4251 lr:0.96 dt:35ms tok/s:1885722 rem:421s step 5092 (30%) loss:3.4313 lr:0.96 dt:35ms tok/s:1872325 rem:421s step 5093 (30%) loss:3.4445 lr:0.96 dt:35ms tok/s:1877479 rem:421s step 5094 (30%) loss:3.4539 lr:0.96 dt:35ms tok/s:1891041 rem:421s step 5095 (30%) loss:3.4800 lr:0.96 dt:35ms tok/s:1893777 rem:421s step 5096 (30%) loss:3.5062 lr:0.96 dt:35ms tok/s:1880780 rem:421s step 5097 (30%) loss:3.5198 lr:0.96 dt:35ms tok/s:1888235 rem:421s step 5098 (30%) loss:3.5109 lr:0.96 dt:35ms tok/s:1889988 rem:421s step 5099 (30%) loss:3.5140 lr:0.96 dt:35ms tok/s:1895396 rem:421s step 5100 (30%) loss:3.5227 lr:0.96 dt:43ms tok/s:1515472 rem:421s + local: attn=[0.063, 0.656, 0.629] mlp=[0.284, 0.133, -0.163] + + transition: attn=[1.922, 0.818] mlp=[-0.086, 0.216] + + hierarchy: attn=[2.383, 5.939, 5.616] mlp=[1.085, -1.633, -3.832] + step 5101 (30%) loss:3.5081 lr:0.96 dt:38ms tok/s:1743961 rem:421s step 5102 (30%) loss:3.5151 lr:0.96 dt:34ms tok/s:1925414 rem:421s step 5103 (30%) loss:3.5291 lr:0.96 dt:34ms tok/s:1935501 rem:421s step 5104 (30%) loss:3.5039 lr:0.96 dt:35ms tok/s:1893620 rem:421s step 5105 (30%) loss:3.5205 lr:0.96 dt:34ms tok/s:1926561 rem:421s step 5106 (30%) loss:3.5200 lr:0.96 dt:34ms tok/s:1911636 rem:421s step 5107 (30%) loss:3.4993 lr:0.96 dt:34ms tok/s:1910984 rem:421s step 5108 (30%) loss:3.5187 lr:0.96 dt:34ms tok/s:1912820 rem:421s step 5109 (30%) loss:3.5254 lr:0.96 dt:34ms tok/s:1916541 rem:421s step 5110 (30%) loss:3.5073 lr:0.96 dt:34ms tok/s:1919404 rem:420s step 5111 (30%) loss:3.5192 lr:0.96 dt:35ms tok/s:1882120 rem:420s step 5112 (30%) loss:3.5000 lr:0.96 dt:35ms tok/s:1880652 rem:420s step 5113 (30%) loss:3.5230 lr:0.96 dt:35ms tok/s:1878467 rem:420s step 5114 (30%) loss:3.5193 lr:0.96 dt:35ms tok/s:1880266 rem:420s step 5115 (30%) loss:3.5158 lr:0.96 dt:35ms tok/s:1894886 rem:420s step 5116 (30%) loss:3.5132 lr:0.96 dt:35ms tok/s:1885101 rem:420s step 5117 (30%) loss:3.5045 lr:0.96 dt:35ms tok/s:1879790 rem:420s step 5118 (30%) loss:3.5007 lr:0.96 dt:35ms tok/s:1880459 rem:420s step 5119 (30%) loss:3.5122 lr:0.96 dt:35ms tok/s:1875468 rem:420s step 5120 (30%) loss:3.5052 lr:0.96 dt:35ms tok/s:1865033 rem:420s step 5121 (30%) loss:3.5067 lr:0.96 dt:35ms tok/s:1869969 rem:420s step 5122 (30%) loss:3.5114 lr:0.96 dt:35ms tok/s:1847931 rem:420s step 5123 (30%) loss:3.5057 lr:0.96 dt:36ms tok/s:1843865 rem:420s step 5124 (30%) loss:3.5319 lr:0.96 dt:36ms tok/s:1845884 rem:420s step 5125 (30%) loss:3.5242 lr:0.96 dt:36ms tok/s:1842703 rem:420s step 5126 (30%) loss:3.4835 lr:0.96 dt:35ms tok/s:1852215 rem:420s step 5127 (30%) loss:3.4598 lr:0.96 dt:35ms tok/s:1850444 rem:420s step 5128 (30%) loss:3.4430 lr:0.96 dt:36ms tok/s:1845797 rem:420s step 5129 (30%) loss:3.4636 lr:0.96 dt:36ms tok/s:1846020 rem:420s step 5130 (30%) loss:3.4681 lr:0.96 dt:36ms tok/s:1844831 rem:420s step 5131 (30%) loss:3.4675 lr:0.96 dt:36ms tok/s:1836916 rem:420s step 5132 (30%) loss:3.4720 lr:0.96 dt:35ms tok/s:1852614 rem:420s step 5133 (30%) loss:3.5325 lr:0.96 dt:35ms tok/s:1852240 rem:420s step 5134 (30%) loss:3.5481 lr:0.96 dt:35ms tok/s:1847223 rem:420s step 5135 (30%) loss:3.5445 lr:0.96 dt:36ms tok/s:1828497 rem:420s step 5136 (30%) loss:3.5280 lr:0.96 dt:36ms tok/s:1810718 rem:420s step 5137 (30%) loss:3.5174 lr:0.96 dt:36ms tok/s:1820492 rem:420s step 5138 (30%) loss:3.5196 lr:0.96 dt:36ms tok/s:1825134 rem:419s step 5139 (30%) loss:3.5036 lr:0.96 dt:36ms tok/s:1822943 rem:419s step 5140 (30%) loss:3.4787 lr:0.96 dt:36ms tok/s:1820106 rem:419s step 5141 (30%) loss:3.4758 lr:0.96 dt:36ms tok/s:1824140 rem:419s step 5142 (30%) loss:3.4956 lr:0.96 dt:36ms tok/s:1821047 rem:419s step 5143 (30%) loss:3.5066 lr:0.96 dt:36ms tok/s:1818168 rem:419s step 5144 (30%) loss:3.4920 lr:0.96 dt:36ms tok/s:1825691 rem:419s step 5145 (30%) loss:3.4867 lr:0.96 dt:36ms tok/s:1817747 rem:419s step 5146 (30%) loss:3.4958 lr:0.96 dt:36ms tok/s:1818986 rem:419s step 5147 (30%) loss:3.5097 lr:0.96 dt:36ms tok/s:1817266 rem:419s step 5148 (30%) loss:3.5011 lr:0.96 dt:38ms tok/s:1733688 rem:419s step 5149 (30%) loss:3.4980 lr:0.96 dt:36ms tok/s:1823596 rem:419s step 5150 (30%) loss:3.4945 lr:0.96 dt:36ms tok/s:1821674 rem:419s step 5151 (30%) loss:3.4866 lr:0.96 dt:36ms tok/s:1815142 rem:419s step 5152 (30%) loss:3.4860 lr:0.96 dt:36ms tok/s:1817915 rem:419s step 5153 (30%) loss:3.4603 lr:0.96 dt:37ms tok/s:1778926 rem:419s step 5154 (30%) loss:3.4505 lr:0.96 dt:36ms tok/s:1822230 rem:419s step 5155 (30%) loss:3.4466 lr:0.96 dt:36ms tok/s:1816438 rem:419s step 5156 (30%) loss:3.4609 lr:0.96 dt:36ms tok/s:1816402 rem:419s step 5157 (30%) loss:3.4592 lr:0.96 dt:36ms tok/s:1814951 rem:419s step 5158 (30%) loss:3.4477 lr:0.96 dt:36ms tok/s:1811076 rem:419s step 5159 (30%) loss:3.4285 lr:0.96 dt:36ms tok/s:1805058 rem:419s step 5160 (30%) loss:3.4292 lr:0.96 dt:36ms tok/s:1812689 rem:419s step 5161 (30%) loss:3.4364 lr:0.96 dt:36ms tok/s:1797222 rem:419s step 5162 (30%) loss:3.4313 lr:0.96 dt:36ms tok/s:1814711 rem:419s step 5163 (30%) loss:3.4274 lr:0.96 dt:36ms tok/s:1823898 rem:419s step 5164 (30%) loss:3.4415 lr:0.96 dt:36ms tok/s:1816690 rem:419s step 5165 (30%) loss:3.4356 lr:0.96 dt:36ms tok/s:1816426 rem:419s step 5166 (30%) loss:3.4213 lr:0.96 dt:36ms tok/s:1818529 rem:418s step 5167 (30%) loss:3.4111 lr:0.96 dt:36ms tok/s:1814855 rem:418s step 5168 (30%) loss:3.4159 lr:0.96 dt:36ms tok/s:1817158 rem:418s step 5169 (30%) loss:3.4082 lr:0.96 dt:36ms tok/s:1818336 rem:418s step 5170 (30%) loss:3.4204 lr:0.96 dt:36ms tok/s:1821892 rem:418s step 5171 (30%) loss:3.4194 lr:0.96 dt:36ms tok/s:1802419 rem:418s step 5172 (30%) loss:3.4172 lr:0.96 dt:36ms tok/s:1821071 rem:418s step 5173 (30%) loss:3.4020 lr:0.96 dt:36ms tok/s:1811661 rem:418s step 5174 (30%) loss:3.4085 lr:0.96 dt:36ms tok/s:1817940 rem:418s step 5175 (30%) loss:3.3955 lr:0.96 dt:36ms tok/s:1818469 rem:418s step 5176 (30%) loss:3.3876 lr:0.96 dt:36ms tok/s:1822061 rem:418s step 5177 (30%) loss:3.4138 lr:0.96 dt:36ms tok/s:1815946 rem:418s step 5178 (30%) loss:3.4172 lr:0.96 dt:36ms tok/s:1812258 rem:418s step 5179 (30%) loss:3.4270 lr:0.96 dt:36ms tok/s:1797904 rem:418s step 5180 (30%) loss:3.4164 lr:0.96 dt:37ms tok/s:1787417 rem:418s step 5181 (30%) loss:3.3983 lr:0.96 dt:37ms tok/s:1788580 rem:418s step 5182 (30%) loss:3.3950 lr:0.96 dt:36ms tok/s:1817110 rem:418s step 5183 (30%) loss:3.4434 lr:0.96 dt:36ms tok/s:1807611 rem:418s step 5184 (30%) loss:3.4436 lr:0.96 dt:36ms tok/s:1813358 rem:418s step 5185 (30%) loss:3.4428 lr:0.96 dt:36ms tok/s:1819600 rem:418s step 5186 (30%) loss:3.4360 lr:0.96 dt:36ms tok/s:1822314 rem:418s step 5187 (30%) loss:3.4338 lr:0.96 dt:36ms tok/s:1814352 rem:418s step 5188 (30%) loss:3.4145 lr:0.96 dt:36ms tok/s:1809371 rem:418s step 5189 (30%) loss:3.4203 lr:0.96 dt:36ms tok/s:1811219 rem:418s step 5190 (30%) loss:3.4114 lr:0.96 dt:36ms tok/s:1816414 rem:418s step 5191 (30%) loss:3.4407 lr:0.96 dt:36ms tok/s:1814028 rem:418s step 5192 (30%) loss:3.4384 lr:0.96 dt:36ms tok/s:1820661 rem:418s step 5193 (30%) loss:3.4444 lr:0.96 dt:36ms tok/s:1818300 rem:417s step 5194 (30%) loss:3.4373 lr:0.96 dt:36ms tok/s:1814615 rem:417s step 5195 (30%) loss:3.4254 lr:0.96 dt:36ms tok/s:1816774 rem:417s step 5196 (30%) loss:3.4128 lr:0.96 dt:36ms tok/s:1827111 rem:417s step 5197 (30%) loss:3.4218 lr:0.96 dt:39ms tok/s:1687672 rem:417s step 5198 (30%) loss:3.4428 lr:0.96 dt:38ms tok/s:1722109 rem:417s step 5199 (30%) loss:3.4383 lr:0.96 dt:36ms tok/s:1827621 rem:417s step 5200 (30%) loss:3.4594 lr:0.96 dt:36ms tok/s:1820673 rem:417s + local: attn=[0.063, 0.672, 0.631] mlp=[0.256, 0.166, -0.153] + + transition: attn=[2.064, 0.744] mlp=[-0.062, 0.224] + + hierarchy: attn=[2.424, 5.939, 5.616] mlp=[1.037, -1.578, -3.861] + step 5201 (30%) loss:3.4580 lr:0.96 dt:36ms tok/s:1820022 rem:417s step 5202 (30%) loss:3.4612 lr:0.96 dt:36ms tok/s:1819420 rem:417s step 5203 (30%) loss:3.4583 lr:0.96 dt:36ms tok/s:1815982 rem:417s step 5204 (30%) loss:3.4556 lr:0.96 dt:36ms tok/s:1815994 rem:417s step 5205 (30%) loss:3.4569 lr:0.96 dt:36ms tok/s:1819359 rem:417s step 5206 (30%) loss:3.4482 lr:0.96 dt:36ms tok/s:1825837 rem:417s step 5207 (30%) loss:3.4313 lr:0.96 dt:36ms tok/s:1822157 rem:417s step 5208 (31%) loss:3.4249 lr:0.96 dt:36ms tok/s:1819395 rem:417s step 5209 (31%) loss:3.4277 lr:0.96 dt:36ms tok/s:1813885 rem:417s step 5210 (31%) loss:3.4254 lr:0.96 dt:37ms tok/s:1790001 rem:417s step 5211 (31%) loss:3.4314 lr:0.96 dt:36ms tok/s:1812461 rem:417s step 5212 (31%) loss:3.4153 lr:0.96 dt:36ms tok/s:1809300 rem:417s step 5213 (31%) loss:3.3996 lr:0.96 dt:36ms tok/s:1814675 rem:417s step 5214 (31%) loss:3.4193 lr:0.96 dt:36ms tok/s:1813586 rem:417s step 5215 (31%) loss:3.4382 lr:0.96 dt:36ms tok/s:1808633 rem:417s step 5216 (31%) loss:3.4446 lr:0.96 dt:36ms tok/s:1804573 rem:417s step 5217 (31%) loss:3.4432 lr:0.96 dt:37ms tok/s:1763598 rem:417s step 5218 (31%) loss:3.4390 lr:0.96 dt:36ms tok/s:1816342 rem:417s step 5219 (31%) loss:3.4414 lr:0.96 dt:36ms tok/s:1820420 rem:417s step 5220 (31%) loss:3.4274 lr:0.96 dt:43ms tok/s:1511140 rem:417s step 5221 (31%) loss:3.4284 lr:0.96 dt:35ms tok/s:1874751 rem:416s step 5222 (31%) loss:3.4450 lr:0.96 dt:35ms tok/s:1866718 rem:416s step 5223 (31%) loss:3.4491 lr:0.96 dt:34ms tok/s:1905632 rem:416s step 5224 (31%) loss:3.4600 lr:0.96 dt:37ms tok/s:1752523 rem:416s step 5225 (31%) loss:3.4651 lr:0.96 dt:35ms tok/s:1895448 rem:416s step 5226 (31%) loss:3.4658 lr:0.96 dt:34ms tok/s:1955271 rem:416s step 5227 (31%) loss:3.4625 lr:0.96 dt:34ms tok/s:1949074 rem:416s step 5228 (31%) loss:3.4745 lr:0.96 dt:34ms tok/s:1952272 rem:416s step 5229 (31%) loss:3.4513 lr:0.96 dt:34ms tok/s:1947790 rem:416s step 5230 (31%) loss:3.4507 lr:0.96 dt:34ms tok/s:1952771 rem:416s step 5231 (31%) loss:3.4336 lr:0.96 dt:34ms tok/s:1940858 rem:416s step 5232 (31%) loss:3.4339 lr:0.96 dt:34ms tok/s:1947320 rem:416s step 5233 (31%) loss:3.4599 lr:0.96 dt:34ms tok/s:1939448 rem:416s step 5234 (31%) loss:3.4401 lr:0.96 dt:34ms tok/s:1932304 rem:416s step 5235 (31%) loss:3.4342 lr:0.96 dt:34ms tok/s:1941091 rem:416s step 5236 (31%) loss:3.4414 lr:0.96 dt:34ms tok/s:1944111 rem:416s step 5237 (31%) loss:3.4344 lr:0.96 dt:34ms tok/s:1919498 rem:416s step 5238 (31%) loss:3.4100 lr:0.96 dt:34ms tok/s:1911769 rem:416s step 5239 (31%) loss:3.4099 lr:0.96 dt:34ms tok/s:1912713 rem:416s step 5240 (31%) loss:3.4274 lr:0.96 dt:34ms tok/s:1920934 rem:416s step 5241 (31%) loss:3.4302 lr:0.96 dt:34ms tok/s:1914685 rem:416s step 5242 (31%) loss:3.4215 lr:0.96 dt:34ms tok/s:1915646 rem:416s step 5243 (31%) loss:3.4096 lr:0.96 dt:34ms tok/s:1917717 rem:416s step 5244 (31%) loss:3.4117 lr:0.96 dt:34ms tok/s:1921363 rem:416s step 5245 (31%) loss:3.3943 lr:0.96 dt:34ms tok/s:1914138 rem:416s step 5246 (31%) loss:3.4060 lr:0.96 dt:34ms tok/s:1918681 rem:416s step 5247 (31%) loss:3.4010 lr:0.96 dt:34ms tok/s:1922411 rem:416s step 5248 (31%) loss:3.4042 lr:0.96 dt:34ms tok/s:1913179 rem:416s step 5249 (31%) loss:3.3919 lr:0.96 dt:34ms tok/s:1904853 rem:416s step 5250 (31%) loss:3.3890 lr:0.96 dt:35ms tok/s:1899049 rem:415s step 5251 (31%) loss:3.3864 lr:0.96 dt:34ms tok/s:1911091 rem:415s step 5252 (31%) loss:3.4046 lr:0.96 dt:35ms tok/s:1893946 rem:415s step 5253 (31%) loss:3.4318 lr:0.96 dt:35ms tok/s:1898853 rem:415s step 5254 (31%) loss:3.4468 lr:0.96 dt:36ms tok/s:1801179 rem:415s step 5255 (31%) loss:3.4480 lr:0.96 dt:35ms tok/s:1896494 rem:415s step 5256 (31%) loss:3.4369 lr:0.96 dt:34ms tok/s:1924484 rem:415s step 5257 (31%) loss:3.4411 lr:0.96 dt:34ms tok/s:1923784 rem:415s step 5258 (31%) loss:3.4585 lr:0.96 dt:34ms tok/s:1921444 rem:415s step 5259 (31%) loss:3.4538 lr:0.96 dt:34ms tok/s:1902559 rem:415s step 5260 (31%) loss:3.4509 lr:0.96 dt:34ms tok/s:1900218 rem:415s step 5261 (31%) loss:3.4400 lr:0.96 dt:34ms tok/s:1901138 rem:415s step 5262 (31%) loss:3.4482 lr:0.96 dt:34ms tok/s:1908066 rem:415s step 5263 (31%) loss:3.4537 lr:0.96 dt:35ms tok/s:1898210 rem:415s step 5264 (31%) loss:3.4395 lr:0.96 dt:35ms tok/s:1897961 rem:415s step 5265 (31%) loss:3.4417 lr:0.96 dt:35ms tok/s:1896992 rem:415s step 5266 (31%) loss:3.4279 lr:0.96 dt:35ms tok/s:1894926 rem:415s step 5267 (31%) loss:3.4281 lr:0.96 dt:34ms tok/s:1904299 rem:415s step 5268 (31%) loss:3.4352 lr:0.96 dt:34ms tok/s:1902901 rem:415s step 5269 (31%) loss:3.4292 lr:0.96 dt:35ms tok/s:1896364 rem:415s step 5270 (31%) loss:3.4102 lr:0.96 dt:34ms tok/s:1902217 rem:415s step 5271 (31%) loss:3.4009 lr:0.96 dt:34ms tok/s:1904563 rem:415s step 5272 (31%) loss:3.4024 lr:0.96 dt:34ms tok/s:1902559 rem:415s step 5273 (31%) loss:3.3949 lr:0.96 dt:34ms tok/s:1901374 rem:415s step 5274 (31%) loss:3.3932 lr:0.96 dt:35ms tok/s:1895043 rem:415s step 5275 (31%) loss:3.3983 lr:0.95 dt:35ms tok/s:1892942 rem:415s step 5276 (31%) loss:3.4170 lr:0.95 dt:35ms tok/s:1892877 rem:415s step 5277 (31%) loss:3.4184 lr:0.95 dt:35ms tok/s:1893138 rem:415s step 5278 (31%) loss:3.4035 lr:0.95 dt:35ms tok/s:1897503 rem:415s step 5279 (31%) loss:3.3958 lr:0.95 dt:34ms tok/s:1907550 rem:414s step 5280 (31%) loss:3.3938 lr:0.95 dt:35ms tok/s:1895278 rem:414s step 5281 (31%) loss:3.3686 lr:0.95 dt:34ms tok/s:1901190 rem:414s step 5282 (31%) loss:3.3308 lr:0.95 dt:34ms tok/s:1905249 rem:414s step 5283 (31%) loss:3.3345 lr:0.95 dt:34ms tok/s:1901980 rem:414s step 5284 (31%) loss:3.3519 lr:0.95 dt:34ms tok/s:1901730 rem:414s step 5285 (31%) loss:3.3776 lr:0.95 dt:35ms tok/s:1896442 rem:414s step 5286 (31%) loss:3.3991 lr:0.95 dt:35ms tok/s:1891028 rem:414s step 5287 (31%) loss:3.3998 lr:0.95 dt:35ms tok/s:1894638 rem:414s step 5288 (31%) loss:3.4033 lr:0.95 dt:35ms tok/s:1893046 rem:414s step 5289 (31%) loss:3.4157 lr:0.95 dt:35ms tok/s:1899325 rem:414s step 5290 (31%) loss:3.4183 lr:0.95 dt:35ms tok/s:1896037 rem:414s step 5291 (31%) loss:3.4051 lr:0.95 dt:36ms tok/s:1825691 rem:414s step 5292 (31%) loss:3.3937 lr:0.95 dt:35ms tok/s:1886770 rem:414s step 5293 (31%) loss:3.3992 lr:0.95 dt:35ms tok/s:1890066 rem:414s step 5294 (31%) loss:3.4126 lr:0.95 dt:35ms tok/s:1891718 rem:414s step 5295 (31%) loss:3.4214 lr:0.95 dt:40ms tok/s:1651275 rem:414s step 5296 (31%) loss:3.4368 lr:0.95 dt:34ms tok/s:1917597 rem:414s step 5297 (31%) loss:3.4276 lr:0.95 dt:34ms tok/s:1907391 rem:414s step 5298 (31%) loss:3.4260 lr:0.95 dt:34ms tok/s:1916901 rem:414s step 5299 (31%) loss:3.4346 lr:0.95 dt:34ms tok/s:1916207 rem:414s step 5300 (31%) loss:3.4334 lr:0.95 dt:34ms tok/s:1935733 rem:414s + local: attn=[0.052, 0.649, 0.632] mlp=[0.261, 0.136, -0.163] + + transition: attn=[2.086, 0.746] mlp=[-0.086, 0.224] + + hierarchy: attn=[2.415, 5.939, 5.616] mlp=[1.058, -1.549, -3.923] + step 5301 (31%) loss:3.4433 lr:0.95 dt:34ms tok/s:1929076 rem:414s step 5302 (31%) loss:3.4106 lr:0.95 dt:34ms tok/s:1930946 rem:414s step 5303 (31%) loss:3.3928 lr:0.95 dt:34ms tok/s:1933432 rem:414s step 5304 (31%) loss:3.3981 lr:0.95 dt:34ms tok/s:1904959 rem:414s step 5305 (31%) loss:3.3935 lr:0.95 dt:35ms tok/s:1895919 rem:414s step 5306 (31%) loss:3.4165 lr:0.95 dt:34ms tok/s:1900257 rem:414s step 5307 (31%) loss:3.4160 lr:0.95 dt:35ms tok/s:1898289 rem:414s step 5308 (31%) loss:3.4270 lr:0.95 dt:34ms tok/s:1910068 rem:413s step 5309 (31%) loss:3.4372 lr:0.95 dt:34ms tok/s:1907113 rem:413s step 5310 (31%) loss:3.4350 lr:0.95 dt:34ms tok/s:1902032 rem:413s step 5311 (31%) loss:3.4035 lr:0.95 dt:34ms tok/s:1910241 rem:413s step 5312 (31%) loss:3.4027 lr:0.95 dt:41ms tok/s:1610147 rem:413s step 5313 (31%) loss:3.4289 lr:0.95 dt:34ms tok/s:1929252 rem:413s step 5314 (31%) loss:3.4654 lr:0.95 dt:33ms tok/s:1960529 rem:413s step 5315 (31%) loss:3.4627 lr:0.95 dt:33ms tok/s:1966124 rem:413s step 5316 (31%) loss:3.4618 lr:0.95 dt:34ms tok/s:1934098 rem:413s step 5317 (31%) loss:3.4569 lr:0.95 dt:34ms tok/s:1938285 rem:413s step 5318 (31%) loss:3.4585 lr:0.95 dt:34ms tok/s:1936878 rem:413s step 5319 (31%) loss:3.4453 lr:0.95 dt:34ms tok/s:1934112 rem:413s step 5320 (31%) loss:3.4382 lr:0.95 dt:34ms tok/s:1920638 rem:413s step 5321 (31%) loss:3.4521 lr:0.95 dt:34ms tok/s:1922600 rem:413s step 5322 (31%) loss:3.4466 lr:0.95 dt:34ms tok/s:1924269 rem:413s step 5323 (31%) loss:3.4469 lr:0.95 dt:34ms tok/s:1930919 rem:413s step 5324 (31%) loss:3.4735 lr:0.95 dt:34ms tok/s:1924511 rem:413s step 5325 (31%) loss:3.4542 lr:0.95 dt:34ms tok/s:1915993 rem:413s step 5326 (31%) loss:3.4695 lr:0.95 dt:34ms tok/s:1914578 rem:413s step 5327 (31%) loss:3.4846 lr:0.95 dt:34ms tok/s:1914872 rem:413s step 5328 (31%) loss:3.4834 lr:0.95 dt:34ms tok/s:1902217 rem:413s step 5329 (31%) loss:3.4760 lr:0.95 dt:34ms tok/s:1901440 rem:413s step 5330 (31%) loss:3.4627 lr:0.95 dt:34ms tok/s:1918038 rem:413s step 5331 (31%) loss:3.4328 lr:0.95 dt:34ms tok/s:1904549 rem:413s step 5332 (31%) loss:3.4429 lr:0.95 dt:35ms tok/s:1898918 rem:413s step 5333 (31%) loss:3.4351 lr:0.95 dt:34ms tok/s:1910520 rem:413s step 5334 (31%) loss:3.4308 lr:0.95 dt:34ms tok/s:1909352 rem:413s step 5335 (31%) loss:3.4319 lr:0.95 dt:35ms tok/s:1896272 rem:413s step 5336 (31%) loss:3.4244 lr:0.95 dt:34ms tok/s:1905064 rem:413s step 5337 (31%) loss:3.4255 lr:0.95 dt:34ms tok/s:1906399 rem:412s step 5338 (31%) loss:3.4525 lr:0.95 dt:34ms tok/s:1902559 rem:412s step 5339 (31%) loss:3.4553 lr:0.95 dt:34ms tok/s:1913272 rem:412s step 5340 (31%) loss:3.4519 lr:0.95 dt:34ms tok/s:1909975 rem:412s step 5341 (31%) loss:3.4413 lr:0.95 dt:34ms tok/s:1902177 rem:412s step 5342 (31%) loss:3.4295 lr:0.95 dt:34ms tok/s:1916300 rem:412s step 5343 (31%) loss:3.4118 lr:0.95 dt:34ms tok/s:1901611 rem:412s step 5344 (31%) loss:3.4089 lr:0.95 dt:35ms tok/s:1848901 rem:412s step 5345 (31%) loss:3.4203 lr:0.95 dt:35ms tok/s:1897424 rem:412s step 5346 (31%) loss:3.4357 lr:0.95 dt:35ms tok/s:1893712 rem:412s step 5347 (31%) loss:3.4359 lr:0.95 dt:34ms tok/s:1911064 rem:412s step 5348 (31%) loss:3.4619 lr:0.95 dt:34ms tok/s:1904563 rem:412s step 5349 (31%) loss:3.4698 lr:0.95 dt:34ms tok/s:1905408 rem:412s step 5350 (31%) loss:3.4826 lr:0.95 dt:34ms tok/s:1904338 rem:412s step 5351 (31%) loss:3.4706 lr:0.95 dt:34ms tok/s:1900165 rem:412s step 5352 (31%) loss:3.4524 lr:0.95 dt:34ms tok/s:1906743 rem:412s step 5353 (31%) loss:3.4372 lr:0.95 dt:34ms tok/s:1902677 rem:412s step 5354 (31%) loss:3.4349 lr:0.95 dt:34ms tok/s:1899850 rem:412s step 5355 (31%) loss:3.4400 lr:0.95 dt:34ms tok/s:1908583 rem:412s step 5356 (31%) loss:3.4366 lr:0.95 dt:34ms tok/s:1913552 rem:412s step 5357 (31%) loss:3.4196 lr:0.95 dt:35ms tok/s:1897987 rem:412s step 5358 (31%) loss:3.4240 lr:0.95 dt:34ms tok/s:1904523 rem:412s step 5359 (31%) loss:3.4333 lr:0.95 dt:35ms tok/s:1887172 rem:412s step 5360 (31%) loss:3.5089 lr:0.95 dt:34ms tok/s:1903218 rem:412s step 5361 (31%) loss:3.5047 lr:0.95 dt:34ms tok/s:1910573 rem:412s step 5362 (31%) loss:3.5105 lr:0.95 dt:34ms tok/s:1908980 rem:412s step 5363 (31%) loss:3.4730 lr:0.95 dt:34ms tok/s:1903468 rem:412s step 5364 (31%) loss:3.4695 lr:0.95 dt:34ms tok/s:1905143 rem:412s step 5365 (31%) loss:3.4634 lr:0.95 dt:34ms tok/s:1912740 rem:412s step 5366 (31%) loss:3.4353 lr:0.95 dt:34ms tok/s:1907417 rem:411s step 5367 (31%) loss:3.4383 lr:0.95 dt:35ms tok/s:1897948 rem:411s step 5368 (31%) loss:3.4405 lr:0.95 dt:35ms tok/s:1897306 rem:411s step 5369 (31%) loss:3.4488 lr:0.95 dt:34ms tok/s:1901782 rem:411s step 5370 (31%) loss:3.4643 lr:0.95 dt:34ms tok/s:1906597 rem:411s step 5371 (31%) loss:3.4443 lr:0.95 dt:35ms tok/s:1886680 rem:411s step 5372 (31%) loss:3.4208 lr:0.95 dt:35ms tok/s:1894455 rem:411s step 5373 (31%) loss:3.4055 lr:0.95 dt:35ms tok/s:1881540 rem:411s step 5374 (31%) loss:3.4162 lr:0.95 dt:34ms tok/s:1902664 rem:411s step 5375 (31%) loss:3.4347 lr:0.95 dt:35ms tok/s:1898485 rem:411s step 5376 (31%) loss:3.4301 lr:0.95 dt:34ms tok/s:1902440 rem:411s step 5377 (31%) loss:3.4391 lr:0.95 dt:34ms tok/s:1908278 rem:411s step 5378 (31%) loss:3.4330 lr:0.95 dt:34ms tok/s:1905566 rem:411s step 5379 (31%) loss:3.4150 lr:0.95 dt:34ms tok/s:1902480 rem:411s step 5380 (31%) loss:3.3982 lr:0.95 dt:35ms tok/s:1897241 rem:411s step 5381 (32%) loss:3.4017 lr:0.95 dt:34ms tok/s:1904127 rem:411s step 5382 (32%) loss:3.3902 lr:0.95 dt:34ms tok/s:1900402 rem:411s step 5383 (32%) loss:3.3820 lr:0.95 dt:35ms tok/s:1899443 rem:411s step 5384 (32%) loss:3.3749 lr:0.95 dt:34ms tok/s:1906703 rem:411s step 5385 (32%) loss:3.3818 lr:0.95 dt:34ms tok/s:1901164 rem:411s step 5386 (32%) loss:3.3946 lr:0.95 dt:34ms tok/s:1904325 rem:411s step 5387 (32%) loss:3.3941 lr:0.95 dt:35ms tok/s:1899207 rem:411s step 5388 (32%) loss:3.4057 lr:0.95 dt:34ms tok/s:1903916 rem:411s step 5389 (32%) loss:3.3927 lr:0.95 dt:35ms tok/s:1892760 rem:411s step 5390 (32%) loss:3.3908 lr:0.95 dt:34ms tok/s:1900967 rem:411s step 5391 (32%) loss:3.3974 lr:0.95 dt:34ms tok/s:1907179 rem:411s step 5392 (32%) loss:3.3994 lr:0.95 dt:35ms tok/s:1899194 rem:411s step 5393 (32%) loss:3.3895 lr:0.95 dt:34ms tok/s:1901138 rem:411s step 5394 (32%) loss:3.3798 lr:0.95 dt:34ms tok/s:1901164 rem:411s step 5395 (32%) loss:3.3720 lr:0.95 dt:35ms tok/s:1895880 rem:410s step 5396 (32%) loss:3.3591 lr:0.95 dt:34ms tok/s:1900678 rem:410s step 5397 (32%) loss:3.3699 lr:0.95 dt:34ms tok/s:1904787 rem:410s step 5398 (32%) loss:3.3706 lr:0.95 dt:36ms tok/s:1820769 rem:410s step 5399 (32%) loss:3.3819 lr:0.95 dt:35ms tok/s:1857133 rem:410s step 5400 (32%) loss:3.3846 lr:0.95 dt:34ms tok/s:1918801 rem:410s + local: attn=[0.063, 0.646, 0.647] mlp=[0.274, 0.170, -0.142] + + transition: attn=[2.045, 0.740] mlp=[-0.088, 0.225] + + hierarchy: attn=[2.440, 5.939, 5.616] mlp=[1.010, -1.535, -4.009] + step 5401 (32%) loss:3.3733 lr:0.95 dt:41ms tok/s:1594410 rem:410s step 5402 (32%) loss:3.4012 lr:0.95 dt:35ms tok/s:1870299 rem:410s step 5403 (32%) loss:3.4016 lr:0.95 dt:34ms tok/s:1944868 rem:410s step 5404 (32%) loss:3.4015 lr:0.95 dt:34ms tok/s:1929631 rem:410s step 5405 (32%) loss:3.3920 lr:0.95 dt:34ms tok/s:1922828 rem:410s step 5406 (32%) loss:3.3701 lr:0.95 dt:34ms tok/s:1926494 rem:410s step 5407 (32%) loss:3.3706 lr:0.95 dt:34ms tok/s:1927899 rem:410s step 5408 (32%) loss:3.3606 lr:0.95 dt:34ms tok/s:1933690 rem:410s step 5409 (32%) loss:3.3564 lr:0.95 dt:34ms tok/s:1909086 rem:410s step 5410 (32%) loss:3.3391 lr:0.95 dt:35ms tok/s:1899587 rem:410s step 5411 (32%) loss:3.3287 lr:0.95 dt:34ms tok/s:1903152 rem:410s step 5412 (32%) loss:3.3070 lr:0.95 dt:34ms tok/s:1905777 rem:410s step 5413 (32%) loss:3.3149 lr:0.95 dt:34ms tok/s:1906941 rem:410s step 5414 (32%) loss:3.3368 lr:0.95 dt:34ms tok/s:1906491 rem:410s step 5415 (32%) loss:3.3700 lr:0.95 dt:35ms tok/s:1895932 rem:410s step 5416 (32%) loss:3.3707 lr:0.95 dt:34ms tok/s:1903811 rem:410s step 5417 (32%) loss:3.3870 lr:0.95 dt:34ms tok/s:1909630 rem:410s step 5418 (32%) loss:3.3891 lr:0.95 dt:35ms tok/s:1896979 rem:410s step 5419 (32%) loss:3.4022 lr:0.95 dt:34ms tok/s:1901914 rem:410s step 5420 (32%) loss:3.4055 lr:0.95 dt:34ms tok/s:1908583 rem:410s step 5421 (32%) loss:3.4246 lr:0.95 dt:34ms tok/s:1901125 rem:410s step 5422 (32%) loss:3.4232 lr:0.95 dt:34ms tok/s:1908079 rem:410s step 5423 (32%) loss:3.4147 lr:0.95 dt:35ms tok/s:1898184 rem:410s step 5424 (32%) loss:3.4076 lr:0.95 dt:34ms tok/s:1903283 rem:409s step 5425 (32%) loss:3.4003 lr:0.95 dt:34ms tok/s:1902862 rem:409s step 5426 (32%) loss:3.4024 lr:0.95 dt:34ms tok/s:1903811 rem:409s step 5427 (32%) loss:3.4144 lr:0.95 dt:34ms tok/s:1907258 rem:409s step 5428 (32%) loss:3.4282 lr:0.95 dt:35ms tok/s:1897227 rem:409s step 5429 (32%) loss:3.4395 lr:0.95 dt:35ms tok/s:1894978 rem:409s step 5430 (32%) loss:3.4610 lr:0.95 dt:35ms tok/s:1894247 rem:409s step 5431 (32%) loss:3.4661 lr:0.95 dt:34ms tok/s:1908729 rem:409s step 5432 (32%) loss:3.4639 lr:0.95 dt:34ms tok/s:1907536 rem:409s step 5433 (32%) loss:3.4535 lr:0.95 dt:34ms tok/s:1912261 rem:409s step 5434 (32%) loss:3.4466 lr:0.95 dt:35ms tok/s:1889195 rem:409s step 5435 (32%) loss:3.4613 lr:0.95 dt:35ms tok/s:1887159 rem:409s step 5436 (32%) loss:3.4513 lr:0.95 dt:35ms tok/s:1886421 rem:409s step 5437 (32%) loss:3.4306 lr:0.95 dt:35ms tok/s:1896835 rem:409s step 5438 (32%) loss:3.4256 lr:0.95 dt:35ms tok/s:1884933 rem:409s step 5439 (32%) loss:3.4282 lr:0.95 dt:35ms tok/s:1877556 rem:409s step 5440 (32%) loss:3.4274 lr:0.95 dt:35ms tok/s:1869345 rem:409s step 5441 (32%) loss:3.4109 lr:0.95 dt:35ms tok/s:1870058 rem:409s step 5442 (32%) loss:3.4097 lr:0.95 dt:35ms tok/s:1870376 rem:409s step 5443 (32%) loss:3.4137 lr:0.95 dt:35ms tok/s:1865261 rem:409s step 5444 (32%) loss:3.4025 lr:0.95 dt:35ms tok/s:1862872 rem:409s step 5445 (32%) loss:3.4044 lr:0.95 dt:35ms tok/s:1863630 rem:409s step 5446 (32%) loss:3.4164 lr:0.95 dt:35ms tok/s:1873461 rem:409s step 5447 (32%) loss:3.4244 lr:0.95 dt:35ms tok/s:1854176 rem:409s step 5448 (32%) loss:3.4174 lr:0.95 dt:36ms tok/s:1814280 rem:409s step 5449 (32%) loss:3.4371 lr:0.95 dt:36ms tok/s:1809896 rem:409s step 5450 (32%) loss:3.4425 lr:0.95 dt:35ms tok/s:1876415 rem:409s step 5451 (32%) loss:3.4535 lr:0.95 dt:34ms tok/s:1909272 rem:409s step 5452 (32%) loss:3.4580 lr:0.95 dt:35ms tok/s:1895344 rem:408s step 5453 (32%) loss:3.4670 lr:0.95 dt:35ms tok/s:1875352 rem:408s step 5454 (32%) loss:3.4610 lr:0.95 dt:35ms tok/s:1870376 rem:408s step 5455 (32%) loss:3.4403 lr:0.95 dt:35ms tok/s:1871140 rem:408s step 5456 (32%) loss:3.4464 lr:0.95 dt:35ms tok/s:1871522 rem:408s step 5457 (32%) loss:3.4444 lr:0.95 dt:35ms tok/s:1875954 rem:408s step 5458 (32%) loss:3.4272 lr:0.95 dt:35ms tok/s:1864262 rem:408s step 5459 (32%) loss:3.4367 lr:0.95 dt:35ms tok/s:1868126 rem:408s step 5460 (32%) loss:3.4250 lr:0.95 dt:35ms tok/s:1852939 rem:408s step 5461 (32%) loss:3.4206 lr:0.95 dt:35ms tok/s:1853339 rem:408s step 5462 (32%) loss:3.4227 lr:0.95 dt:35ms tok/s:1853489 rem:408s step 5463 (32%) loss:3.4293 lr:0.95 dt:35ms tok/s:1850556 rem:408s step 5464 (32%) loss:3.4999 lr:0.95 dt:35ms tok/s:1850531 rem:408s step 5465 (32%) loss:3.5295 lr:0.95 dt:35ms tok/s:1848975 rem:408s step 5466 (32%) loss:3.5298 lr:0.95 dt:35ms tok/s:1855015 rem:408s step 5467 (32%) loss:3.5275 lr:0.95 dt:35ms tok/s:1857835 rem:408s step 5468 (32%) loss:3.5297 lr:0.95 dt:36ms tok/s:1842444 rem:408s step 5469 (32%) loss:3.4971 lr:0.95 dt:35ms tok/s:1853489 rem:408s step 5470 (32%) loss:3.4881 lr:0.95 dt:37ms tok/s:1756016 rem:408s step 5471 (32%) loss:3.4683 lr:0.95 dt:36ms tok/s:1837763 rem:408s step 5472 (32%) loss:3.4618 lr:0.95 dt:35ms tok/s:1852477 rem:408s step 5473 (32%) loss:3.4677 lr:0.95 dt:35ms tok/s:1847732 rem:408s step 5474 (32%) loss:3.4614 lr:0.95 dt:35ms tok/s:1849423 rem:408s step 5475 (32%) loss:3.4648 lr:0.95 dt:36ms tok/s:1844360 rem:408s step 5476 (32%) loss:3.4817 lr:0.95 dt:36ms tok/s:1844249 rem:408s step 5477 (32%) loss:3.4681 lr:0.94 dt:35ms tok/s:1846206 rem:408s step 5478 (32%) loss:3.4641 lr:0.94 dt:35ms tok/s:1854576 rem:408s step 5479 (32%) loss:3.4644 lr:0.94 dt:36ms tok/s:1837714 rem:408s step 5480 (32%) loss:3.4655 lr:0.94 dt:35ms tok/s:1851616 rem:408s step 5481 (32%) loss:3.4769 lr:0.94 dt:35ms tok/s:1874278 rem:407s step 5482 (32%) loss:3.4714 lr:0.94 dt:35ms tok/s:1849884 rem:407s step 5483 (32%) loss:3.4604 lr:0.94 dt:35ms tok/s:1876261 rem:407s step 5484 (32%) loss:3.4585 lr:0.94 dt:35ms tok/s:1867859 rem:407s step 5485 (32%) loss:3.4523 lr:0.94 dt:35ms tok/s:1867339 rem:407s step 5486 (32%) loss:3.4634 lr:0.94 dt:35ms tok/s:1859180 rem:407s step 5487 (32%) loss:3.4582 lr:0.94 dt:36ms tok/s:1818252 rem:407s step 5488 (32%) loss:3.4706 lr:0.94 dt:35ms tok/s:1849846 rem:407s step 5489 (32%) loss:3.4639 lr:0.94 dt:35ms tok/s:1869193 rem:407s step 5490 (32%) loss:3.4531 lr:0.94 dt:35ms tok/s:1860589 rem:407s step 5491 (32%) loss:3.4285 lr:0.94 dt:35ms tok/s:1871496 rem:407s step 5492 (32%) loss:3.4132 lr:0.94 dt:41ms tok/s:1592341 rem:407s step 5493 (32%) loss:3.4082 lr:0.94 dt:38ms tok/s:1743297 rem:407s step 5494 (32%) loss:3.3954 lr:0.94 dt:34ms tok/s:1919887 rem:407s step 5495 (32%) loss:3.3928 lr:0.94 dt:34ms tok/s:1914765 rem:407s step 5496 (32%) loss:3.4022 lr:0.94 dt:34ms tok/s:1900428 rem:407s step 5497 (32%) loss:3.4043 lr:0.94 dt:35ms tok/s:1896141 rem:407s step 5498 (32%) loss:3.4120 lr:0.94 dt:34ms tok/s:1901822 rem:407s step 5499 (32%) loss:3.4157 lr:0.94 dt:34ms tok/s:1905500 rem:407s step 5500 (32%) loss:3.4229 lr:0.94 dt:35ms tok/s:1873435 rem:407s + local: attn=[0.070, 0.679, 0.639] mlp=[0.278, 0.175, -0.146] + + transition: attn=[2.130, 0.734] mlp=[-0.087, 0.224] + + hierarchy: attn=[2.518, 5.939, 5.616] mlp=[1.027, -1.506, -4.102] + step 5501 (32%) loss:3.4067 lr:0.94 dt:35ms tok/s:1888235 rem:407s step 5502 (32%) loss:3.4302 lr:0.94 dt:35ms tok/s:1883719 rem:407s step 5503 (32%) loss:3.4363 lr:0.94 dt:35ms tok/s:1894338 rem:407s step 5504 (32%) loss:3.4537 lr:0.94 dt:35ms tok/s:1890885 rem:407s step 5505 (32%) loss:3.4347 lr:0.94 dt:35ms tok/s:1873065 rem:407s step 5506 (32%) loss:3.4269 lr:0.94 dt:35ms tok/s:1868240 rem:407s step 5507 (32%) loss:3.4275 lr:0.94 dt:35ms tok/s:1872223 rem:407s step 5508 (32%) loss:3.4387 lr:0.94 dt:36ms tok/s:1831591 rem:407s step 5509 (32%) loss:3.4434 lr:0.94 dt:36ms tok/s:1845561 rem:406s step 5510 (32%) loss:3.4297 lr:0.94 dt:35ms tok/s:1853376 rem:406s step 5511 (32%) loss:3.4074 lr:0.94 dt:35ms tok/s:1846119 rem:406s step 5512 (32%) loss:3.4090 lr:0.94 dt:36ms tok/s:1840002 rem:406s step 5513 (32%) loss:3.4078 lr:0.94 dt:36ms tok/s:1842543 rem:406s step 5514 (32%) loss:3.4190 lr:0.94 dt:36ms tok/s:1824940 rem:406s step 5515 (32%) loss:3.4336 lr:0.94 dt:36ms tok/s:1816510 rem:406s step 5516 (32%) loss:3.4446 lr:0.94 dt:36ms tok/s:1821964 rem:406s step 5517 (32%) loss:3.4461 lr:0.94 dt:36ms tok/s:1817711 rem:406s step 5518 (32%) loss:3.4584 lr:0.94 dt:36ms tok/s:1817399 rem:406s step 5519 (32%) loss:3.4511 lr:0.94 dt:36ms tok/s:1825097 rem:406s step 5520 (32%) loss:3.4705 lr:0.94 dt:36ms tok/s:1815538 rem:406s step 5521 (32%) loss:3.4765 lr:0.94 dt:36ms tok/s:1815742 rem:406s step 5522 (32%) loss:3.4851 lr:0.94 dt:36ms tok/s:1817483 rem:406s step 5523 (32%) loss:3.4841 lr:0.94 dt:36ms tok/s:1817627 rem:406s step 5524 (32%) loss:3.4782 lr:0.94 dt:36ms tok/s:1820046 rem:406s step 5525 (32%) loss:3.4758 lr:0.94 dt:36ms tok/s:1826237 rem:406s step 5526 (32%) loss:3.4794 lr:0.94 dt:36ms tok/s:1819179 rem:406s step 5527 (32%) loss:3.4883 lr:0.94 dt:36ms tok/s:1801297 rem:406s step 5528 (32%) loss:3.4798 lr:0.94 dt:36ms tok/s:1832703 rem:406s step 5529 (32%) loss:3.4753 lr:0.94 dt:34ms tok/s:1901322 rem:406s step 5530 (32%) loss:3.4638 lr:0.94 dt:34ms tok/s:1919659 rem:406s step 5531 (32%) loss:3.4516 lr:0.94 dt:33ms tok/s:1967306 rem:406s step 5532 (32%) loss:3.4529 lr:0.94 dt:33ms tok/s:1967729 rem:406s step 5533 (32%) loss:3.4535 lr:0.94 dt:33ms tok/s:1963554 rem:406s step 5534 (32%) loss:3.4268 lr:0.94 dt:33ms tok/s:1967179 rem:406s step 5535 (32%) loss:3.4205 lr:0.94 dt:33ms tok/s:1957764 rem:406s step 5536 (32%) loss:3.4088 lr:0.94 dt:33ms tok/s:1982088 rem:406s step 5537 (32%) loss:3.4220 lr:0.94 dt:33ms tok/s:1973677 rem:405s step 5538 (32%) loss:3.4322 lr:0.94 dt:33ms tok/s:1983776 rem:405s step 5539 (32%) loss:3.4332 lr:0.94 dt:34ms tok/s:1953201 rem:405s step 5540 (32%) loss:3.4392 lr:0.94 dt:33ms tok/s:1970070 rem:405s step 5541 (32%) loss:3.4619 lr:0.94 dt:34ms tok/s:1952147 rem:405s step 5542 (32%) loss:3.4814 lr:0.94 dt:34ms tok/s:1945956 rem:405s step 5543 (32%) loss:3.4770 lr:0.94 dt:34ms tok/s:1946672 rem:405s step 5544 (32%) loss:3.4675 lr:0.94 dt:34ms tok/s:1932344 rem:405s step 5545 (32%) loss:3.4394 lr:0.94 dt:34ms tok/s:1933976 rem:405s step 5546 (32%) loss:3.4098 lr:0.94 dt:34ms tok/s:1941379 rem:405s step 5547 (32%) loss:3.3723 lr:0.94 dt:34ms tok/s:1919512 rem:405s step 5548 (32%) loss:3.3295 lr:0.94 dt:34ms tok/s:1927102 rem:405s step 5549 (32%) loss:3.3367 lr:0.94 dt:34ms tok/s:1926075 rem:405s step 5550 (32%) loss:3.3403 lr:0.94 dt:34ms tok/s:1920652 rem:405s step 5551 (32%) loss:3.3479 lr:0.94 dt:34ms tok/s:1924875 rem:405s step 5552 (32%) loss:3.3348 lr:0.94 dt:34ms tok/s:1900389 rem:405s step 5553 (33%) loss:3.3101 lr:0.94 dt:34ms tok/s:1902862 rem:405s step 5554 (33%) loss:3.3280 lr:0.94 dt:34ms tok/s:1909140 rem:405s step 5555 (33%) loss:3.3398 lr:0.94 dt:35ms tok/s:1899522 rem:405s step 5556 (33%) loss:3.3661 lr:0.94 dt:35ms tok/s:1882932 rem:405s step 5557 (33%) loss:3.3643 lr:0.94 dt:35ms tok/s:1889923 rem:405s step 5558 (33%) loss:3.3374 lr:0.94 dt:35ms tok/s:1884881 rem:405s step 5559 (33%) loss:3.3343 lr:0.94 dt:35ms tok/s:1887548 rem:405s step 5560 (33%) loss:3.3416 lr:0.94 dt:35ms tok/s:1893294 rem:405s step 5561 (33%) loss:3.3468 lr:0.94 dt:35ms tok/s:1895265 rem:405s step 5562 (33%) loss:3.3496 lr:0.94 dt:35ms tok/s:1872261 rem:405s step 5563 (33%) loss:3.3691 lr:0.94 dt:35ms tok/s:1871331 rem:405s step 5564 (33%) loss:3.3656 lr:0.94 dt:35ms tok/s:1886991 rem:405s step 5565 (33%) loss:3.3587 lr:0.94 dt:35ms tok/s:1879751 rem:405s step 5566 (33%) loss:3.3654 lr:0.94 dt:35ms tok/s:1878659 rem:405s step 5567 (33%) loss:3.3916 lr:0.94 dt:35ms tok/s:1877851 rem:404s step 5568 (33%) loss:3.3985 lr:0.94 dt:35ms tok/s:1878685 rem:404s step 5569 (33%) loss:3.3840 lr:0.94 dt:35ms tok/s:1888131 rem:404s step 5570 (33%) loss:3.3926 lr:0.94 dt:35ms tok/s:1883551 rem:404s step 5571 (33%) loss:3.3783 lr:0.94 dt:35ms tok/s:1865109 rem:404s step 5572 (33%) loss:3.3633 lr:0.94 dt:35ms tok/s:1864907 rem:404s step 5573 (33%) loss:3.3914 lr:0.94 dt:35ms tok/s:1861245 rem:404s step 5574 (33%) loss:3.4153 lr:0.94 dt:35ms tok/s:1866832 rem:404s step 5575 (33%) loss:3.3985 lr:0.94 dt:35ms tok/s:1860715 rem:404s step 5576 (33%) loss:3.3909 lr:0.94 dt:35ms tok/s:1860904 rem:404s step 5577 (33%) loss:3.3850 lr:0.94 dt:35ms tok/s:1852814 rem:404s step 5578 (33%) loss:3.3888 lr:0.94 dt:35ms tok/s:1853451 rem:404s step 5579 (33%) loss:3.4074 lr:0.94 dt:36ms tok/s:1835285 rem:404s step 5580 (33%) loss:3.4180 lr:0.94 dt:36ms tok/s:1838353 rem:404s step 5581 (33%) loss:3.4242 lr:0.94 dt:35ms tok/s:1854814 rem:404s step 5582 (33%) loss:3.4111 lr:0.94 dt:36ms tok/s:1836732 rem:404s step 5583 (33%) loss:3.3906 lr:0.94 dt:36ms tok/s:1843111 rem:404s step 5584 (33%) loss:3.4076 lr:0.94 dt:36ms tok/s:1833999 rem:404s step 5585 (33%) loss:3.4078 lr:0.94 dt:36ms tok/s:1831982 rem:404s step 5586 (33%) loss:3.4170 lr:0.94 dt:36ms tok/s:1845375 rem:404s step 5587 (33%) loss:3.4141 lr:0.94 dt:36ms tok/s:1842827 rem:404s step 5588 (33%) loss:3.4148 lr:0.94 dt:36ms tok/s:1843667 rem:404s step 5589 (33%) loss:3.3991 lr:0.94 dt:36ms tok/s:1841432 rem:404s step 5590 (33%) loss:3.3860 lr:0.94 dt:36ms tok/s:1808467 rem:404s step 5591 (33%) loss:3.4150 lr:0.94 dt:36ms tok/s:1833803 rem:404s step 5592 (33%) loss:3.4073 lr:0.94 dt:36ms tok/s:1825437 rem:404s step 5593 (33%) loss:3.3948 lr:0.94 dt:36ms tok/s:1811637 rem:404s step 5594 (33%) loss:3.3971 lr:0.94 dt:36ms tok/s:1845797 rem:404s step 5595 (33%) loss:3.3933 lr:0.94 dt:36ms tok/s:1844063 rem:403s step 5596 (33%) loss:3.4150 lr:0.94 dt:36ms tok/s:1844150 rem:403s step 5597 (33%) loss:3.4156 lr:0.94 dt:36ms tok/s:1839140 rem:403s step 5598 (33%) loss:3.4134 lr:0.94 dt:36ms tok/s:1840951 rem:403s step 5599 (33%) loss:3.4069 lr:0.94 dt:36ms tok/s:1833412 rem:403s step 5600 (33%) loss:3.3877 lr:0.94 dt:36ms tok/s:1831250 rem:403s + local: attn=[0.056, 0.686, 0.638] mlp=[0.280, 0.166, -0.187] + + transition: attn=[2.113, 0.742] mlp=[-0.097, 0.226] + + hierarchy: attn=[2.461, 5.939, 5.616] mlp=[1.041, -1.472, -4.160] + step 5601 (33%) loss:3.4035 lr:0.94 dt:36ms tok/s:1836155 rem:403s step 5602 (33%) loss:3.3994 lr:0.94 dt:36ms tok/s:1835419 rem:403s step 5603 (33%) loss:3.4104 lr:0.94 dt:36ms tok/s:1834672 rem:403s step 5604 (33%) loss:3.4088 lr:0.94 dt:36ms tok/s:1841198 rem:403s step 5605 (33%) loss:3.4160 lr:0.94 dt:37ms tok/s:1789803 rem:403s step 5606 (33%) loss:3.4227 lr:0.94 dt:36ms tok/s:1827609 rem:403s step 5607 (33%) loss:3.4306 lr:0.94 dt:36ms tok/s:1836216 rem:403s step 5608 (33%) loss:3.4141 lr:0.94 dt:36ms tok/s:1837714 rem:403s step 5609 (33%) loss:3.4160 lr:0.94 dt:36ms tok/s:1837247 rem:403s step 5610 (33%) loss:3.3993 lr:0.94 dt:36ms tok/s:1838845 rem:403s step 5611 (33%) loss:3.3916 lr:0.94 dt:36ms tok/s:1838193 rem:403s step 5612 (33%) loss:3.4015 lr:0.94 dt:36ms tok/s:1838021 rem:403s step 5613 (33%) loss:3.3996 lr:0.94 dt:36ms tok/s:1832507 rem:403s step 5614 (33%) loss:3.3833 lr:0.94 dt:36ms tok/s:1830530 rem:403s step 5615 (33%) loss:3.3743 lr:0.94 dt:36ms tok/s:1841444 rem:403s step 5616 (33%) loss:3.3643 lr:0.94 dt:36ms tok/s:1834403 rem:403s step 5617 (33%) loss:3.3766 lr:0.94 dt:36ms tok/s:1835101 rem:403s step 5618 (33%) loss:3.3646 lr:0.94 dt:36ms tok/s:1821711 rem:403s step 5619 (33%) loss:3.3474 lr:0.94 dt:36ms tok/s:1836241 rem:403s step 5620 (33%) loss:3.3560 lr:0.94 dt:36ms tok/s:1834023 rem:403s step 5621 (33%) loss:3.3430 lr:0.94 dt:38ms tok/s:1721473 rem:403s step 5622 (33%) loss:3.3512 lr:0.94 dt:45ms tok/s:1450558 rem:403s step 5623 (33%) loss:3.3764 lr:0.94 dt:38ms tok/s:1735034 rem:402s step 5624 (33%) loss:3.3849 lr:0.94 dt:35ms tok/s:1893138 rem:402s step 5625 (33%) loss:3.3994 lr:0.94 dt:35ms tok/s:1893007 rem:402s step 5626 (33%) loss:3.4080 lr:0.94 dt:34ms tok/s:1900573 rem:402s step 5627 (33%) loss:3.4023 lr:0.94 dt:35ms tok/s:1855328 rem:402s step 5628 (33%) loss:3.4066 lr:0.94 dt:35ms tok/s:1846938 rem:402s step 5629 (33%) loss:3.4193 lr:0.94 dt:36ms tok/s:1835983 rem:402s step 5630 (33%) loss:3.4249 lr:0.94 dt:36ms tok/s:1820769 rem:402s step 5631 (33%) loss:3.4220 lr:0.94 dt:36ms tok/s:1837272 rem:402s step 5632 (33%) loss:3.3984 lr:0.94 dt:35ms tok/s:1866680 rem:402s step 5633 (33%) loss:3.3866 lr:0.94 dt:35ms tok/s:1853539 rem:402s step 5634 (33%) loss:3.3991 lr:0.94 dt:36ms tok/s:1833094 rem:402s step 5635 (33%) loss:3.3745 lr:0.94 dt:36ms tok/s:1827852 rem:402s step 5636 (33%) loss:3.3852 lr:0.94 dt:36ms tok/s:1846070 rem:402s step 5637 (33%) loss:3.3758 lr:0.94 dt:36ms tok/s:1845710 rem:402s step 5638 (33%) loss:3.3873 lr:0.94 dt:36ms tok/s:1828290 rem:402s step 5639 (33%) loss:3.3896 lr:0.94 dt:37ms tok/s:1784122 rem:402s step 5640 (33%) loss:3.3760 lr:0.94 dt:37ms tok/s:1794149 rem:402s step 5641 (33%) loss:3.4058 lr:0.94 dt:36ms tok/s:1809741 rem:402s step 5642 (33%) loss:3.4185 lr:0.94 dt:36ms tok/s:1827609 rem:402s step 5643 (33%) loss:3.4384 lr:0.94 dt:37ms tok/s:1765750 rem:402s step 5644 (33%) loss:3.4274 lr:0.94 dt:35ms tok/s:1873359 rem:402s step 5645 (33%) loss:3.4146 lr:0.94 dt:35ms tok/s:1869231 rem:402s step 5646 (33%) loss:3.4107 lr:0.94 dt:36ms tok/s:1830884 rem:402s step 5647 (33%) loss:3.4234 lr:0.94 dt:36ms tok/s:1823777 rem:402s step 5648 (33%) loss:3.4226 lr:0.94 dt:36ms tok/s:1842234 rem:402s step 5649 (33%) loss:3.3940 lr:0.94 dt:35ms tok/s:1851516 rem:402s step 5650 (33%) loss:3.3815 lr:0.94 dt:35ms tok/s:1872733 rem:402s step 5651 (33%) loss:3.3839 lr:0.94 dt:35ms tok/s:1867111 rem:401s step 5652 (33%) loss:3.3824 lr:0.94 dt:35ms tok/s:1891067 rem:401s step 5653 (33%) loss:3.3916 lr:0.94 dt:35ms tok/s:1870325 rem:401s step 5654 (33%) loss:3.3858 lr:0.94 dt:35ms tok/s:1854852 rem:401s step 5655 (33%) loss:3.3947 lr:0.94 dt:35ms tok/s:1890378 rem:401s step 5656 (33%) loss:3.4019 lr:0.94 dt:35ms tok/s:1864995 rem:401s step 5657 (33%) loss:3.3928 lr:0.94 dt:36ms tok/s:1834746 rem:401s step 5658 (33%) loss:3.3868 lr:0.94 dt:34ms tok/s:1901664 rem:401s step 5659 (33%) loss:3.3943 lr:0.93 dt:35ms tok/s:1883900 rem:401s step 5660 (33%) loss:3.4293 lr:0.93 dt:35ms tok/s:1865071 rem:401s step 5661 (33%) loss:3.4334 lr:0.93 dt:35ms tok/s:1879160 rem:401s step 5662 (33%) loss:3.4221 lr:0.93 dt:35ms tok/s:1880510 rem:401s step 5663 (33%) loss:3.4064 lr:0.93 dt:35ms tok/s:1890872 rem:401s step 5664 (33%) loss:3.4275 lr:0.93 dt:35ms tok/s:1898958 rem:401s step 5665 (33%) loss:3.4265 lr:0.93 dt:35ms tok/s:1888079 rem:401s step 5666 (33%) loss:3.4189 lr:0.93 dt:35ms tok/s:1896115 rem:401s step 5667 (33%) loss:3.4178 lr:0.93 dt:35ms tok/s:1893255 rem:401s step 5668 (33%) loss:3.4149 lr:0.93 dt:35ms tok/s:1872504 rem:401s step 5669 (33%) loss:3.4333 lr:0.93 dt:35ms tok/s:1877056 rem:401s step 5670 (33%) loss:3.4533 lr:0.93 dt:35ms tok/s:1881862 rem:401s step 5671 (33%) loss:3.4548 lr:0.93 dt:35ms tok/s:1869473 rem:401s step 5672 (33%) loss:3.4567 lr:0.93 dt:36ms tok/s:1828144 rem:401s step 5673 (33%) loss:3.4558 lr:0.93 dt:35ms tok/s:1878852 rem:401s step 5674 (33%) loss:3.4430 lr:0.93 dt:35ms tok/s:1877748 rem:401s step 5675 (33%) loss:3.4312 lr:0.93 dt:35ms tok/s:1876377 rem:401s step 5676 (33%) loss:3.4266 lr:0.93 dt:35ms tok/s:1880819 rem:401s step 5677 (33%) loss:3.4374 lr:0.93 dt:35ms tok/s:1884920 rem:401s step 5678 (33%) loss:3.4329 lr:0.93 dt:35ms tok/s:1878993 rem:401s step 5679 (33%) loss:3.4137 lr:0.93 dt:35ms tok/s:1853826 rem:400s step 5680 (33%) loss:3.4178 lr:0.93 dt:35ms tok/s:1851366 rem:400s step 5681 (33%) loss:3.4172 lr:0.93 dt:35ms tok/s:1857007 rem:400s step 5682 (33%) loss:3.4284 lr:0.93 dt:35ms tok/s:1852864 rem:400s step 5683 (33%) loss:3.4263 lr:0.93 dt:35ms tok/s:1853701 rem:400s step 5684 (33%) loss:3.4430 lr:0.93 dt:35ms tok/s:1858979 rem:400s step 5685 (33%) loss:3.4668 lr:0.93 dt:36ms tok/s:1835272 rem:400s step 5686 (33%) loss:3.4676 lr:0.93 dt:36ms tok/s:1824903 rem:400s step 5687 (33%) loss:3.4350 lr:0.93 dt:36ms tok/s:1813861 rem:400s step 5688 (33%) loss:3.4312 lr:0.93 dt:36ms tok/s:1820106 rem:400s step 5689 (33%) loss:3.4009 lr:0.93 dt:36ms tok/s:1822894 rem:400s step 5690 (33%) loss:3.4536 lr:0.93 dt:36ms tok/s:1824855 rem:400s step 5691 (33%) loss:3.4518 lr:0.93 dt:36ms tok/s:1820516 rem:400s step 5692 (33%) loss:3.4587 lr:0.93 dt:36ms tok/s:1818312 rem:400s step 5693 (33%) loss:3.4436 lr:0.93 dt:36ms tok/s:1826346 rem:400s step 5694 (33%) loss:3.4522 lr:0.93 dt:36ms tok/s:1811792 rem:400s step 5695 (33%) loss:3.4279 lr:0.93 dt:36ms tok/s:1819263 rem:400s step 5696 (33%) loss:3.4289 lr:0.93 dt:36ms tok/s:1827937 rem:400s step 5697 (33%) loss:3.4657 lr:0.93 dt:36ms tok/s:1823027 rem:400s step 5698 (33%) loss:3.5104 lr:0.93 dt:36ms tok/s:1820806 rem:400s step 5699 (33%) loss:3.5113 lr:0.93 dt:36ms tok/s:1814951 rem:400s step 5700 (33%) loss:3.4841 lr:0.93 dt:41ms tok/s:1588375 rem:400s + local: attn=[0.058, 0.643, 0.641] mlp=[0.274, 0.153, -0.142] + + transition: attn=[2.267, 0.780] mlp=[-0.088, 0.258] + + hierarchy: attn=[2.546, 5.939, 5.616] mlp=[1.039, -1.364, -4.193] + step 5701 (33%) loss:3.4375 lr:0.93 dt:36ms tok/s:1833828 rem:400s step 5702 (33%) loss:3.4029 lr:0.93 dt:36ms tok/s:1832715 rem:400s step 5703 (33%) loss:3.3698 lr:0.93 dt:36ms tok/s:1842568 rem:400s step 5704 (33%) loss:3.3315 lr:0.93 dt:36ms tok/s:1840458 rem:400s step 5705 (33%) loss:3.2831 lr:0.93 dt:36ms tok/s:1835223 rem:400s step 5706 (33%) loss:3.3156 lr:0.93 dt:36ms tok/s:1845016 rem:400s step 5707 (33%) loss:3.3379 lr:0.93 dt:35ms tok/s:1869180 rem:399s step 5708 (33%) loss:3.3503 lr:0.93 dt:35ms tok/s:1874010 rem:399s step 5709 (33%) loss:3.3630 lr:0.93 dt:35ms tok/s:1871293 rem:399s step 5710 (33%) loss:3.3687 lr:0.93 dt:36ms tok/s:1845338 rem:399s step 5711 (33%) loss:3.3902 lr:0.93 dt:36ms tok/s:1833289 rem:399s step 5712 (33%) loss:3.4153 lr:0.93 dt:36ms tok/s:1827391 rem:399s step 5713 (33%) loss:3.4211 lr:0.93 dt:36ms tok/s:1817014 rem:399s step 5714 (33%) loss:3.4090 lr:0.93 dt:36ms tok/s:1820842 rem:399s step 5715 (33%) loss:3.4160 lr:0.93 dt:36ms tok/s:1825679 rem:399s step 5716 (33%) loss:3.4125 lr:0.93 dt:36ms tok/s:1824504 rem:399s step 5717 (33%) loss:3.4138 lr:0.93 dt:36ms tok/s:1819022 rem:399s step 5718 (33%) loss:3.3918 lr:0.93 dt:36ms tok/s:1819889 rem:399s step 5719 (33%) loss:3.3628 lr:0.93 dt:36ms tok/s:1824395 rem:399s step 5720 (33%) loss:3.3422 lr:0.93 dt:36ms tok/s:1816774 rem:399s step 5721 (33%) loss:3.3354 lr:0.93 dt:36ms tok/s:1824249 rem:399s step 5722 (34%) loss:3.3238 lr:0.93 dt:36ms tok/s:1832654 rem:399s step 5723 (34%) loss:3.3133 lr:0.93 dt:36ms tok/s:1825376 rem:399s step 5724 (34%) loss:3.3327 lr:0.93 dt:36ms tok/s:1826310 rem:399s step 5725 (34%) loss:3.3449 lr:0.93 dt:36ms tok/s:1821626 rem:399s step 5726 (34%) loss:3.3293 lr:0.93 dt:36ms tok/s:1831103 rem:399s step 5727 (34%) loss:3.3698 lr:0.93 dt:36ms tok/s:1820335 rem:399s step 5728 (34%) loss:3.3830 lr:0.93 dt:36ms tok/s:1832361 rem:399s step 5729 (34%) loss:3.3773 lr:0.93 dt:36ms tok/s:1832825 rem:399s step 5730 (34%) loss:3.3786 lr:0.93 dt:36ms tok/s:1819456 rem:399s step 5731 (34%) loss:3.3713 lr:0.93 dt:36ms tok/s:1831347 rem:399s step 5732 (34%) loss:3.3912 lr:0.93 dt:36ms tok/s:1825049 rem:399s step 5733 (34%) loss:3.3987 lr:0.93 dt:36ms tok/s:1819950 rem:399s step 5734 (34%) loss:3.4339 lr:0.93 dt:36ms tok/s:1823777 rem:399s step 5735 (34%) loss:3.4274 lr:0.93 dt:36ms tok/s:1823983 rem:398s step 5736 (34%) loss:3.4633 lr:0.93 dt:36ms tok/s:1825109 rem:398s step 5737 (34%) loss:3.4981 lr:0.93 dt:36ms tok/s:1827087 rem:398s step 5738 (34%) loss:3.5058 lr:0.93 dt:36ms tok/s:1818348 rem:398s step 5739 (34%) loss:3.4806 lr:0.93 dt:36ms tok/s:1826334 rem:398s step 5740 (34%) loss:3.4750 lr:0.93 dt:36ms tok/s:1827439 rem:398s step 5741 (34%) loss:3.4961 lr:0.93 dt:36ms tok/s:1822375 rem:398s step 5742 (34%) loss:3.4934 lr:0.93 dt:36ms tok/s:1824019 rem:398s step 5743 (34%) loss:3.4903 lr:0.93 dt:36ms tok/s:1827974 rem:398s step 5744 (34%) loss:3.4793 lr:0.93 dt:36ms tok/s:1819311 rem:398s step 5745 (34%) loss:3.4613 lr:0.93 dt:36ms tok/s:1811136 rem:398s step 5746 (34%) loss:3.4684 lr:0.93 dt:37ms tok/s:1776558 rem:398s step 5747 (34%) loss:3.4624 lr:0.93 dt:36ms tok/s:1814459 rem:398s step 5748 (34%) loss:3.4502 lr:0.93 dt:36ms tok/s:1819913 rem:398s step 5749 (34%) loss:3.4524 lr:0.93 dt:36ms tok/s:1820914 rem:398s step 5750 (34%) loss:3.4558 lr:0.93 dt:36ms tok/s:1825752 rem:398s step 5751 (34%) loss:3.4420 lr:0.93 dt:36ms tok/s:1820118 rem:398s step 5752 (34%) loss:3.4426 lr:0.93 dt:36ms tok/s:1814627 rem:398s step 5753 (34%) loss:3.4402 lr:0.93 dt:36ms tok/s:1817855 rem:398s step 5754 (34%) loss:3.4203 lr:0.93 dt:36ms tok/s:1817795 rem:398s step 5755 (34%) loss:3.4129 lr:0.93 dt:36ms tok/s:1821131 rem:398s step 5756 (34%) loss:3.3900 lr:0.93 dt:36ms tok/s:1826310 rem:398s step 5757 (34%) loss:3.3882 lr:0.93 dt:36ms tok/s:1822302 rem:398s step 5758 (34%) loss:3.4200 lr:0.93 dt:36ms tok/s:1819901 rem:398s step 5759 (34%) loss:3.4282 lr:0.93 dt:36ms tok/s:1826407 rem:398s step 5760 (34%) loss:3.4685 lr:0.93 dt:37ms tok/s:1761248 rem:398s step 5761 (34%) loss:3.4682 lr:0.93 dt:37ms tok/s:1755769 rem:398s step 5762 (34%) loss:3.4543 lr:0.93 dt:37ms tok/s:1773051 rem:398s step 5763 (34%) loss:3.4354 lr:0.93 dt:37ms tok/s:1777569 rem:397s step 5764 (34%) loss:3.4417 lr:0.93 dt:37ms tok/s:1774253 rem:397s step 5765 (34%) loss:3.4289 lr:0.93 dt:37ms tok/s:1771337 rem:397s step 5766 (34%) loss:3.4303 lr:0.93 dt:37ms tok/s:1779341 rem:397s step 5767 (34%) loss:3.4050 lr:0.93 dt:37ms tok/s:1776477 rem:397s step 5768 (34%) loss:3.3996 lr:0.93 dt:37ms tok/s:1776432 rem:397s step 5769 (34%) loss:3.4027 lr:0.93 dt:37ms tok/s:1770710 rem:397s step 5770 (34%) loss:3.4087 lr:0.93 dt:37ms tok/s:1773589 rem:397s step 5771 (34%) loss:3.3816 lr:0.93 dt:37ms tok/s:1769160 rem:397s step 5772 (34%) loss:3.3930 lr:0.93 dt:37ms tok/s:1773566 rem:397s step 5773 (34%) loss:3.4052 lr:0.93 dt:37ms tok/s:1777304 rem:397s step 5774 (34%) loss:3.3931 lr:0.93 dt:37ms tok/s:1771372 rem:397s step 5775 (34%) loss:3.3943 lr:0.93 dt:37ms tok/s:1782630 rem:397s step 5776 (34%) loss:3.4067 lr:0.93 dt:36ms tok/s:1826759 rem:397s step 5777 (34%) loss:3.4437 lr:0.93 dt:36ms tok/s:1821107 rem:397s step 5778 (34%) loss:3.4430 lr:0.93 dt:36ms tok/s:1815898 rem:397s step 5779 (34%) loss:3.4117 lr:0.93 dt:36ms tok/s:1814316 rem:397s step 5780 (34%) loss:3.4108 lr:0.93 dt:36ms tok/s:1817903 rem:397s step 5781 (34%) loss:3.3974 lr:0.93 dt:36ms tok/s:1821023 rem:397s step 5782 (34%) loss:3.3873 lr:0.93 dt:36ms tok/s:1827354 rem:397s step 5783 (34%) loss:3.4067 lr:0.93 dt:36ms tok/s:1822943 rem:397s step 5784 (34%) loss:3.4048 lr:0.93 dt:36ms tok/s:1810039 rem:397s step 5785 (34%) loss:3.4050 lr:0.93 dt:36ms tok/s:1816978 rem:397s step 5786 (34%) loss:3.4003 lr:0.93 dt:36ms tok/s:1813645 rem:397s step 5787 (34%) loss:3.4035 lr:0.93 dt:36ms tok/s:1812294 rem:397s step 5788 (34%) loss:3.4069 lr:0.93 dt:36ms tok/s:1816030 rem:397s step 5789 (34%) loss:3.4336 lr:0.93 dt:36ms tok/s:1811661 rem:397s step 5790 (34%) loss:3.4404 lr:0.93 dt:36ms tok/s:1819155 rem:396s step 5791 (34%) loss:3.4317 lr:0.93 dt:36ms tok/s:1815622 rem:396s step 5792 (34%) loss:3.4482 lr:0.93 dt:36ms tok/s:1817098 rem:396s step 5793 (34%) loss:3.4327 lr:0.93 dt:36ms tok/s:1815694 rem:396s step 5794 (34%) loss:3.4373 lr:0.93 dt:36ms tok/s:1816906 rem:396s step 5795 (34%) loss:3.4183 lr:0.93 dt:37ms tok/s:1767829 rem:396s step 5796 (34%) loss:3.4104 lr:0.93 dt:36ms tok/s:1817242 rem:396s step 5797 (34%) loss:3.4096 lr:0.93 dt:36ms tok/s:1815718 rem:396s step 5798 (34%) loss:3.4038 lr:0.93 dt:36ms tok/s:1818252 rem:396s step 5799 (34%) loss:3.3987 lr:0.93 dt:36ms tok/s:1816138 rem:396s step 5800 (34%) loss:3.3817 lr:0.93 dt:36ms tok/s:1812677 rem:396s + local: attn=[0.065, 0.707, 0.673] mlp=[0.298, 0.161, -0.142] + + transition: attn=[2.286, 0.781] mlp=[-0.111, 0.246] + + hierarchy: attn=[2.598, 5.939, 5.616] mlp=[1.030, -1.400, -4.214] + step 5801 (34%) loss:3.3694 lr:0.93 dt:36ms tok/s:1818649 rem:396s step 5802 (34%) loss:3.3673 lr:0.93 dt:36ms tok/s:1816678 rem:396s step 5803 (34%) loss:3.3738 lr:0.93 dt:36ms tok/s:1810194 rem:396s step 5804 (34%) loss:3.3769 lr:0.93 dt:36ms tok/s:1813849 rem:396s step 5805 (34%) loss:3.3684 lr:0.93 dt:36ms tok/s:1815023 rem:396s step 5806 (34%) loss:3.3542 lr:0.93 dt:36ms tok/s:1817747 rem:396s step 5807 (34%) loss:3.3470 lr:0.93 dt:36ms tok/s:1814340 rem:396s step 5808 (34%) loss:3.3320 lr:0.93 dt:36ms tok/s:1815622 rem:396s step 5809 (34%) loss:3.3345 lr:0.93 dt:36ms tok/s:1816990 rem:396s step 5810 (34%) loss:3.3296 lr:0.93 dt:36ms tok/s:1811972 rem:396s step 5811 (34%) loss:3.3512 lr:0.93 dt:36ms tok/s:1809431 rem:396s step 5812 (34%) loss:3.3624 lr:0.93 dt:36ms tok/s:1812199 rem:396s step 5813 (34%) loss:3.3915 lr:0.93 dt:36ms tok/s:1813980 rem:396s step 5814 (34%) loss:3.3854 lr:0.93 dt:36ms tok/s:1810039 rem:396s step 5815 (34%) loss:3.3955 lr:0.93 dt:36ms tok/s:1816126 rem:396s step 5816 (34%) loss:3.3635 lr:0.93 dt:36ms tok/s:1811697 rem:396s step 5817 (34%) loss:3.3513 lr:0.93 dt:36ms tok/s:1813466 rem:396s step 5818 (34%) loss:3.3541 lr:0.93 dt:36ms tok/s:1815430 rem:395s step 5819 (34%) loss:3.3598 lr:0.93 dt:36ms tok/s:1812306 rem:395s step 5820 (34%) loss:3.3635 lr:0.93 dt:36ms tok/s:1806197 rem:395s step 5821 (34%) loss:3.3661 lr:0.93 dt:36ms tok/s:1810408 rem:395s step 5822 (34%) loss:3.3632 lr:0.93 dt:36ms tok/s:1816150 rem:395s step 5823 (34%) loss:3.3582 lr:0.93 dt:36ms tok/s:1817447 rem:395s step 5824 (34%) loss:3.3580 lr:0.93 dt:36ms tok/s:1821831 rem:395s step 5825 (34%) loss:3.3591 lr:0.93 dt:36ms tok/s:1816810 rem:395s step 5826 (34%) loss:3.3606 lr:0.92 dt:36ms tok/s:1813047 rem:395s step 5827 (34%) loss:3.3496 lr:0.92 dt:36ms tok/s:1804004 rem:395s step 5828 (34%) loss:3.3562 lr:0.92 dt:37ms tok/s:1769274 rem:395s step 5829 (34%) loss:3.3566 lr:0.92 dt:36ms tok/s:1815059 rem:395s step 5830 (34%) loss:3.3374 lr:0.92 dt:36ms tok/s:1819179 rem:395s step 5831 (34%) loss:3.3183 lr:0.92 dt:36ms tok/s:1817170 rem:395s step 5832 (34%) loss:3.2800 lr:0.92 dt:36ms tok/s:1820950 rem:395s step 5833 (34%) loss:3.2421 lr:0.92 dt:36ms tok/s:1812772 rem:395s step 5834 (34%) loss:3.2833 lr:0.92 dt:36ms tok/s:1816654 rem:395s step 5835 (34%) loss:3.3641 lr:0.92 dt:36ms tok/s:1813574 rem:395s step 5836 (34%) loss:3.3807 lr:0.92 dt:36ms tok/s:1813298 rem:395s step 5837 (34%) loss:3.3996 lr:0.92 dt:36ms tok/s:1810885 rem:395s step 5838 (34%) loss:3.4045 lr:0.92 dt:36ms tok/s:1812593 rem:395s step 5839 (34%) loss:3.3980 lr:0.92 dt:36ms tok/s:1815454 rem:395s step 5840 (34%) loss:3.3996 lr:0.92 dt:36ms tok/s:1818264 rem:395s step 5841 (34%) loss:3.4067 lr:0.92 dt:36ms tok/s:1814459 rem:395s step 5842 (34%) loss:3.4288 lr:0.92 dt:36ms tok/s:1818264 rem:395s step 5843 (34%) loss:3.4264 lr:0.92 dt:36ms tok/s:1814220 rem:395s step 5844 (34%) loss:3.4183 lr:0.92 dt:36ms tok/s:1808788 rem:395s step 5845 (34%) loss:3.4287 lr:0.92 dt:36ms tok/s:1815418 rem:394s step 5846 (34%) loss:3.4294 lr:0.92 dt:36ms tok/s:1821759 rem:394s step 5847 (34%) loss:3.4173 lr:0.92 dt:36ms tok/s:1812366 rem:394s step 5848 (34%) loss:3.4128 lr:0.92 dt:36ms tok/s:1809324 rem:394s step 5849 (34%) loss:3.4582 lr:0.92 dt:36ms tok/s:1838697 rem:394s step 5850 (34%) loss:3.4601 lr:0.92 dt:36ms tok/s:1811709 rem:394s step 5851 (34%) loss:3.4564 lr:0.92 dt:36ms tok/s:1813777 rem:394s step 5852 (34%) loss:3.4959 lr:0.92 dt:36ms tok/s:1813215 rem:394s step 5853 (34%) loss:3.4553 lr:0.92 dt:36ms tok/s:1814376 rem:394s step 5854 (34%) loss:3.4203 lr:0.92 dt:36ms tok/s:1809121 rem:394s step 5855 (34%) loss:3.3997 lr:0.92 dt:36ms tok/s:1808205 rem:394s step 5856 (34%) loss:3.3995 lr:0.92 dt:36ms tok/s:1815059 rem:394s step 5857 (34%) loss:3.4119 lr:0.92 dt:36ms tok/s:1808443 rem:394s step 5858 (34%) loss:3.4067 lr:0.92 dt:36ms tok/s:1815394 rem:394s step 5859 (34%) loss:3.3859 lr:0.92 dt:36ms tok/s:1809133 rem:394s step 5860 (34%) loss:3.4007 lr:0.92 dt:36ms tok/s:1815442 rem:394s step 5861 (34%) loss:3.3867 lr:0.92 dt:36ms tok/s:1804454 rem:394s step 5862 (34%) loss:3.3855 lr:0.92 dt:36ms tok/s:1812366 rem:394s step 5863 (34%) loss:3.3802 lr:0.92 dt:36ms tok/s:1804537 rem:394s step 5864 (34%) loss:3.3849 lr:0.92 dt:36ms tok/s:1801143 rem:394s step 5865 (34%) loss:3.3866 lr:0.92 dt:37ms tok/s:1777442 rem:394s step 5866 (34%) loss:3.3875 lr:0.92 dt:36ms tok/s:1817122 rem:394s step 5867 (34%) loss:3.3925 lr:0.92 dt:36ms tok/s:1814495 rem:394s step 5868 (34%) loss:3.3730 lr:0.92 dt:36ms tok/s:1814747 rem:394s step 5869 (34%) loss:3.3728 lr:0.92 dt:36ms tok/s:1815694 rem:394s step 5870 (34%) loss:3.3770 lr:0.92 dt:36ms tok/s:1814280 rem:394s step 5871 (34%) loss:3.3795 lr:0.92 dt:36ms tok/s:1810003 rem:394s step 5872 (34%) loss:3.3880 lr:0.92 dt:36ms tok/s:1808312 rem:394s step 5873 (34%) loss:3.4099 lr:0.92 dt:36ms tok/s:1813167 rem:393s step 5874 (34%) loss:3.4207 lr:0.92 dt:36ms tok/s:1818481 rem:393s step 5875 (34%) loss:3.4290 lr:0.92 dt:36ms tok/s:1816114 rem:393s step 5876 (34%) loss:3.4064 lr:0.92 dt:36ms tok/s:1814292 rem:393s step 5877 (34%) loss:3.3630 lr:0.92 dt:36ms tok/s:1811804 rem:393s step 5878 (34%) loss:3.3948 lr:0.92 dt:36ms tok/s:1814471 rem:393s step 5879 (34%) loss:3.4121 lr:0.92 dt:36ms tok/s:1806814 rem:393s step 5880 (34%) loss:3.4270 lr:0.92 dt:36ms tok/s:1819179 rem:393s step 5881 (34%) loss:3.4339 lr:0.92 dt:36ms tok/s:1827245 rem:393s step 5882 (34%) loss:3.4420 lr:0.92 dt:36ms tok/s:1817290 rem:393s step 5883 (34%) loss:3.4532 lr:0.92 dt:36ms tok/s:1809896 rem:393s step 5884 (34%) loss:3.4554 lr:0.92 dt:36ms tok/s:1814963 rem:393s step 5885 (34%) loss:3.4670 lr:0.92 dt:36ms tok/s:1815394 rem:393s step 5886 (34%) loss:3.4662 lr:0.92 dt:36ms tok/s:1816594 rem:393s step 5887 (34%) loss:3.4707 lr:0.92 dt:36ms tok/s:1818360 rem:393s step 5888 (35%) loss:3.4681 lr:0.92 dt:36ms tok/s:1812378 rem:393s step 5889 (35%) loss:3.4552 lr:0.92 dt:36ms tok/s:1806755 rem:393s step 5890 (35%) loss:3.4530 lr:0.92 dt:36ms tok/s:1809634 rem:393s step 5891 (35%) loss:3.4373 lr:0.92 dt:36ms tok/s:1806328 rem:393s step 5892 (35%) loss:3.4286 lr:0.92 dt:37ms tok/s:1789803 rem:393s step 5893 (35%) loss:3.4288 lr:0.92 dt:36ms tok/s:1809240 rem:393s step 5894 (35%) loss:3.4276 lr:0.92 dt:36ms tok/s:1818962 rem:393s step 5895 (35%) loss:3.4260 lr:0.92 dt:36ms tok/s:1810372 rem:393s step 5896 (35%) loss:3.4393 lr:0.92 dt:36ms tok/s:1809300 rem:393s step 5897 (35%) loss:3.4428 lr:0.92 dt:36ms tok/s:1814543 rem:393s step 5898 (35%) loss:3.4547 lr:0.92 dt:36ms tok/s:1808360 rem:393s step 5899 (35%) loss:3.4448 lr:0.92 dt:36ms tok/s:1808848 rem:393s step 5900 (35%) loss:3.4475 lr:0.92 dt:36ms tok/s:1812426 rem:393s + local: attn=[0.061, 0.707, 0.669] mlp=[0.309, 0.180, -0.166] + + transition: attn=[2.240, 0.799] mlp=[-0.105, 0.239] + + hierarchy: attn=[2.621, 5.939, 5.616] mlp=[1.048, -1.281, -4.245] + step 5901 (35%) loss:3.4641 lr:0.92 dt:36ms tok/s:1820432 rem:392s step 5902 (35%) loss:3.4760 lr:0.92 dt:36ms tok/s:1813586 rem:392s step 5903 (35%) loss:3.4747 lr:0.92 dt:44ms tok/s:1505933 rem:392s step 5904 (35%) loss:3.4658 lr:0.92 dt:37ms tok/s:1767079 rem:392s step 5905 (35%) loss:3.4704 lr:0.92 dt:34ms tok/s:1903969 rem:392s step 5906 (35%) loss:3.4754 lr:0.92 dt:34ms tok/s:1904167 rem:392s step 5907 (35%) loss:3.4846 lr:0.92 dt:35ms tok/s:1879738 rem:392s step 5908 (35%) loss:3.5086 lr:0.92 dt:35ms tok/s:1861724 rem:392s step 5909 (35%) loss:3.5141 lr:0.92 dt:35ms tok/s:1874444 rem:392s step 5910 (35%) loss:3.5577 lr:0.92 dt:35ms tok/s:1860930 rem:392s step 5911 (35%) loss:3.5461 lr:0.92 dt:35ms tok/s:1876761 rem:392s step 5912 (35%) loss:3.5338 lr:0.92 dt:35ms tok/s:1867555 rem:392s step 5913 (35%) loss:3.5244 lr:0.92 dt:35ms tok/s:1864236 rem:392s step 5914 (35%) loss:3.5219 lr:0.92 dt:36ms tok/s:1843779 rem:392s step 5915 (35%) loss:3.5236 lr:0.92 dt:36ms tok/s:1817254 rem:392s step 5916 (35%) loss:3.5177 lr:0.92 dt:36ms tok/s:1826152 rem:392s step 5917 (35%) loss:3.5029 lr:0.92 dt:36ms tok/s:1823632 rem:392s step 5918 (35%) loss:3.5080 lr:0.92 dt:36ms tok/s:1820709 rem:392s step 5919 (35%) loss:3.5087 lr:0.92 dt:36ms tok/s:1821264 rem:392s step 5920 (35%) loss:3.4883 lr:0.92 dt:36ms tok/s:1824576 rem:392s step 5921 (35%) loss:3.4872 lr:0.92 dt:36ms tok/s:1830152 rem:392s step 5922 (35%) loss:3.5017 lr:0.92 dt:36ms tok/s:1828217 rem:392s step 5923 (35%) loss:3.5068 lr:0.92 dt:36ms tok/s:1823487 rem:392s step 5924 (35%) loss:3.4935 lr:0.92 dt:36ms tok/s:1820721 rem:392s step 5925 (35%) loss:3.4948 lr:0.92 dt:36ms tok/s:1815946 rem:392s step 5926 (35%) loss:3.4851 lr:0.92 dt:36ms tok/s:1823173 rem:392s step 5927 (35%) loss:3.4825 lr:0.92 dt:36ms tok/s:1822049 rem:392s step 5928 (35%) loss:3.4750 lr:0.92 dt:36ms tok/s:1818661 rem:391s step 5929 (35%) loss:3.4666 lr:0.92 dt:36ms tok/s:1820697 rem:391s step 5930 (35%) loss:3.4663 lr:0.92 dt:36ms tok/s:1821469 rem:391s step 5931 (35%) loss:3.4724 lr:0.92 dt:36ms tok/s:1822592 rem:391s step 5932 (35%) loss:3.4663 lr:0.92 dt:36ms tok/s:1820745 rem:391s step 5933 (35%) loss:3.4481 lr:0.92 dt:36ms tok/s:1826468 rem:391s step 5934 (35%) loss:3.4338 lr:0.92 dt:36ms tok/s:1831994 rem:391s step 5935 (35%) loss:3.4273 lr:0.92 dt:36ms tok/s:1819853 rem:391s step 5936 (35%) loss:3.4181 lr:0.92 dt:36ms tok/s:1818228 rem:391s step 5937 (35%) loss:3.4139 lr:0.92 dt:36ms tok/s:1823378 rem:391s step 5938 (35%) loss:3.4003 lr:0.92 dt:36ms tok/s:1820335 rem:391s step 5939 (35%) loss:3.3856 lr:0.92 dt:36ms tok/s:1824358 rem:391s step 5940 (35%) loss:3.4019 lr:0.92 dt:36ms tok/s:1825267 rem:391s step 5941 (35%) loss:3.4033 lr:0.92 dt:36ms tok/s:1829215 rem:391s step 5942 (35%) loss:3.4093 lr:0.92 dt:36ms tok/s:1831469 rem:391s step 5943 (35%) loss:3.4149 lr:0.92 dt:36ms tok/s:1824310 rem:391s step 5944 (35%) loss:3.4051 lr:0.92 dt:37ms tok/s:1756195 rem:391s step 5945 (35%) loss:3.4232 lr:0.92 dt:36ms tok/s:1800259 rem:391s step 5946 (35%) loss:3.4233 lr:0.92 dt:37ms tok/s:1791739 rem:391s step 5947 (35%) loss:3.4298 lr:0.92 dt:36ms tok/s:1821723 rem:391s step 5948 (35%) loss:3.4327 lr:0.92 dt:36ms tok/s:1822943 rem:391s step 5949 (35%) loss:3.4396 lr:0.92 dt:36ms tok/s:1816654 rem:391s step 5950 (35%) loss:3.4479 lr:0.92 dt:36ms tok/s:1822508 rem:391s step 5951 (35%) loss:3.4548 lr:0.92 dt:36ms tok/s:1826710 rem:391s step 5952 (35%) loss:3.4043 lr:0.92 dt:36ms tok/s:1818240 rem:391s step 5953 (35%) loss:3.3943 lr:0.92 dt:36ms tok/s:1814436 rem:391s step 5954 (35%) loss:3.4068 lr:0.92 dt:36ms tok/s:1803093 rem:391s step 5955 (35%) loss:3.4097 lr:0.92 dt:36ms tok/s:1817002 rem:391s step 5956 (35%) loss:3.4225 lr:0.92 dt:36ms tok/s:1818276 rem:390s step 5957 (35%) loss:3.4200 lr:0.92 dt:36ms tok/s:1818348 rem:390s step 5958 (35%) loss:3.4137 lr:0.92 dt:36ms tok/s:1825425 rem:390s step 5959 (35%) loss:3.4035 lr:0.92 dt:36ms tok/s:1820263 rem:390s step 5960 (35%) loss:3.3997 lr:0.92 dt:36ms tok/s:1816954 rem:390s step 5961 (35%) loss:3.4032 lr:0.92 dt:36ms tok/s:1821204 rem:390s step 5962 (35%) loss:3.4375 lr:0.92 dt:36ms tok/s:1818842 rem:390s step 5963 (35%) loss:3.4638 lr:0.92 dt:36ms tok/s:1817303 rem:390s step 5964 (35%) loss:3.4240 lr:0.92 dt:36ms tok/s:1820516 rem:390s step 5965 (35%) loss:3.4327 lr:0.92 dt:36ms tok/s:1823173 rem:390s step 5966 (35%) loss:3.4277 lr:0.92 dt:36ms tok/s:1817976 rem:390s step 5967 (35%) loss:3.4393 lr:0.92 dt:36ms tok/s:1820444 rem:390s step 5968 (35%) loss:3.4238 lr:0.92 dt:36ms tok/s:1816234 rem:390s step 5969 (35%) loss:3.4116 lr:0.92 dt:36ms tok/s:1820191 rem:390s step 5970 (35%) loss:3.4225 lr:0.92 dt:36ms tok/s:1813681 rem:390s step 5971 (35%) loss:3.4352 lr:0.92 dt:36ms tok/s:1810957 rem:390s step 5972 (35%) loss:3.4172 lr:0.92 dt:36ms tok/s:1823233 rem:390s step 5973 (35%) loss:3.4322 lr:0.92 dt:36ms tok/s:1823148 rem:390s step 5974 (35%) loss:3.4312 lr:0.92 dt:36ms tok/s:1809038 rem:390s step 5975 (35%) loss:3.4264 lr:0.92 dt:36ms tok/s:1824056 rem:390s step 5976 (35%) loss:3.4297 lr:0.92 dt:36ms tok/s:1820444 rem:390s step 5977 (35%) loss:3.4373 lr:0.92 dt:36ms tok/s:1828643 rem:390s step 5978 (35%) loss:3.4311 lr:0.92 dt:36ms tok/s:1824576 rem:390s step 5979 (35%) loss:3.3937 lr:0.92 dt:36ms tok/s:1816246 rem:390s step 5980 (35%) loss:3.3961 lr:0.92 dt:36ms tok/s:1815622 rem:390s step 5981 (35%) loss:3.3883 lr:0.92 dt:36ms tok/s:1825861 rem:390s step 5982 (35%) loss:3.3866 lr:0.91 dt:36ms tok/s:1817351 rem:390s step 5983 (35%) loss:3.3802 lr:0.91 dt:36ms tok/s:1821892 rem:390s step 5984 (35%) loss:3.3943 lr:0.91 dt:36ms tok/s:1810599 rem:389s step 5985 (35%) loss:3.3938 lr:0.91 dt:36ms tok/s:1819408 rem:389s step 5986 (35%) loss:3.3771 lr:0.91 dt:36ms tok/s:1819239 rem:389s step 5987 (35%) loss:3.3462 lr:0.91 dt:36ms tok/s:1814831 rem:389s step 5988 (35%) loss:3.3036 lr:0.91 dt:36ms tok/s:1824407 rem:389s step 5989 (35%) loss:3.3146 lr:0.91 dt:36ms tok/s:1825679 rem:389s step 5990 (35%) loss:3.3493 lr:0.91 dt:36ms tok/s:1819215 rem:389s step 5991 (35%) loss:3.3695 lr:0.91 dt:37ms tok/s:1778362 rem:389s step 5992 (35%) loss:3.3718 lr:0.91 dt:36ms tok/s:1822290 rem:389s step 5993 (35%) loss:3.3677 lr:0.91 dt:36ms tok/s:1815394 rem:389s step 5994 (35%) loss:3.3770 lr:0.91 dt:36ms tok/s:1825946 rem:389s step 5995 (35%) loss:3.3959 lr:0.91 dt:36ms tok/s:1820552 rem:389s step 5996 (35%) loss:3.4318 lr:0.91 dt:36ms tok/s:1825522 rem:389s step 5997 (35%) loss:3.4312 lr:0.91 dt:36ms tok/s:1827731 rem:389s step 5998 (35%) loss:3.4362 lr:0.91 dt:36ms tok/s:1824310 rem:389s step 5999 (35%) loss:3.4416 lr:0.91 dt:36ms tok/s:1818757 rem:389s step 6000 (35%) loss:3.4485 lr:0.91 dt:36ms tok/s:1823015 rem:389s + local: attn=[0.068, 0.702, 0.706] mlp=[0.317, 0.168, -0.150] + + transition: attn=[2.224, 0.794] mlp=[-0.098, 0.261] + + hierarchy: attn=[2.582, 5.939, 5.616] mlp=[1.072, -1.251, -4.268] + step 6001 (35%) loss:3.4494 lr:0.91 dt:36ms tok/s:1811649 rem:389s step 6002 (35%) loss:3.4658 lr:0.91 dt:36ms tok/s:1819998 rem:389s step 6003 (35%) loss:3.4607 lr:0.91 dt:36ms tok/s:1829032 rem:389s step 6004 (35%) loss:3.4488 lr:0.91 dt:36ms tok/s:1826929 rem:389s step 6005 (35%) loss:3.4492 lr:0.91 dt:38ms tok/s:1730860 rem:389s step 6006 (35%) loss:3.4484 lr:0.91 dt:36ms tok/s:1822616 rem:389s step 6007 (35%) loss:3.4635 lr:0.91 dt:36ms tok/s:1812832 rem:389s step 6008 (35%) loss:3.4471 lr:0.91 dt:36ms tok/s:1823027 rem:389s step 6009 (35%) loss:3.4046 lr:0.91 dt:36ms tok/s:1821554 rem:389s step 6010 (35%) loss:3.3866 lr:0.91 dt:36ms tok/s:1823463 rem:389s step 6011 (35%) loss:3.3994 lr:0.91 dt:36ms tok/s:1828156 rem:389s step 6012 (35%) loss:3.3947 lr:0.91 dt:36ms tok/s:1812856 rem:388s step 6013 (35%) loss:3.3899 lr:0.91 dt:37ms tok/s:1788615 rem:388s step 6014 (35%) loss:3.4052 lr:0.91 dt:36ms tok/s:1822157 rem:388s step 6015 (35%) loss:3.4166 lr:0.91 dt:36ms tok/s:1823245 rem:388s step 6016 (35%) loss:3.4157 lr:0.91 dt:36ms tok/s:1822919 rem:388s step 6017 (35%) loss:3.4131 lr:0.91 dt:36ms tok/s:1823693 rem:388s step 6018 (35%) loss:3.4317 lr:0.91 dt:36ms tok/s:1814316 rem:388s step 6019 (35%) loss:3.4234 lr:0.91 dt:36ms tok/s:1821843 rem:388s step 6020 (35%) loss:3.4227 lr:0.91 dt:36ms tok/s:1816678 rem:388s step 6021 (35%) loss:3.4167 lr:0.91 dt:36ms tok/s:1813992 rem:388s step 6022 (35%) loss:3.4296 lr:0.91 dt:36ms tok/s:1822085 rem:388s step 6023 (35%) loss:3.4418 lr:0.91 dt:36ms tok/s:1812246 rem:388s step 6024 (35%) loss:3.4415 lr:0.91 dt:36ms tok/s:1806791 rem:388s step 6025 (35%) loss:3.4346 lr:0.91 dt:36ms tok/s:1818902 rem:388s step 6026 (35%) loss:3.4299 lr:0.91 dt:37ms tok/s:1777902 rem:388s step 6027 (35%) loss:3.4006 lr:0.91 dt:36ms tok/s:1800506 rem:388s step 6028 (35%) loss:3.3380 lr:0.91 dt:36ms tok/s:1811005 rem:388s step 6029 (35%) loss:3.2757 lr:0.91 dt:36ms tok/s:1809908 rem:388s step 6030 (35%) loss:3.2792 lr:0.91 dt:36ms tok/s:1809121 rem:388s step 6031 (35%) loss:3.3077 lr:0.91 dt:36ms tok/s:1818517 rem:388s step 6032 (35%) loss:3.3171 lr:0.91 dt:36ms tok/s:1825412 rem:388s step 6033 (35%) loss:3.3374 lr:0.91 dt:36ms tok/s:1810957 rem:388s step 6034 (35%) loss:3.3426 lr:0.91 dt:37ms tok/s:1789162 rem:388s step 6035 (35%) loss:3.3411 lr:0.91 dt:37ms tok/s:1767613 rem:388s step 6036 (35%) loss:3.3506 lr:0.91 dt:36ms tok/s:1810861 rem:388s step 6037 (35%) loss:3.3386 lr:0.91 dt:36ms tok/s:1812569 rem:388s step 6038 (35%) loss:3.3325 lr:0.91 dt:36ms tok/s:1821167 rem:388s step 6039 (35%) loss:3.3419 lr:0.91 dt:36ms tok/s:1820480 rem:387s step 6040 (35%) loss:3.3694 lr:0.91 dt:36ms tok/s:1815562 rem:387s step 6041 (35%) loss:3.3812 lr:0.91 dt:36ms tok/s:1821336 rem:387s step 6042 (35%) loss:3.3865 lr:0.91 dt:36ms tok/s:1813621 rem:387s step 6043 (35%) loss:3.4092 lr:0.91 dt:36ms tok/s:1817375 rem:387s step 6044 (35%) loss:3.3960 lr:0.91 dt:36ms tok/s:1818601 rem:387s step 6045 (35%) loss:3.3934 lr:0.91 dt:36ms tok/s:1812784 rem:387s step 6046 (35%) loss:3.4057 lr:0.91 dt:36ms tok/s:1810313 rem:387s step 6047 (35%) loss:3.4076 lr:0.91 dt:36ms tok/s:1806245 rem:387s step 6048 (35%) loss:3.3972 lr:0.91 dt:36ms tok/s:1810408 rem:387s step 6049 (35%) loss:3.3726 lr:0.91 dt:36ms tok/s:1818589 rem:387s step 6050 (35%) loss:3.3710 lr:0.91 dt:36ms tok/s:1813657 rem:387s step 6051 (35%) loss:3.3929 lr:0.91 dt:36ms tok/s:1807979 rem:387s step 6052 (35%) loss:3.4047 lr:0.91 dt:36ms tok/s:1823233 rem:387s step 6053 (35%) loss:3.3929 lr:0.91 dt:36ms tok/s:1810468 rem:387s step 6054 (36%) loss:3.4719 lr:0.91 dt:36ms tok/s:1814004 rem:387s step 6055 (36%) loss:3.5549 lr:0.91 dt:36ms tok/s:1812390 rem:387s step 6056 (36%) loss:3.6112 lr:0.91 dt:36ms tok/s:1814699 rem:387s step 6057 (36%) loss:3.6656 lr:0.91 dt:36ms tok/s:1815502 rem:387s step 6058 (36%) loss:3.6986 lr:0.91 dt:36ms tok/s:1809240 rem:387s step 6059 (36%) loss:3.7227 lr:0.91 dt:36ms tok/s:1814615 rem:387s step 6060 (36%) loss:3.7382 lr:0.91 dt:36ms tok/s:1814376 rem:387s step 6061 (36%) loss:3.7412 lr:0.91 dt:36ms tok/s:1811506 rem:387s step 6062 (36%) loss:3.7311 lr:0.91 dt:36ms tok/s:1813310 rem:387s step 6063 (36%) loss:3.7182 lr:0.91 dt:36ms tok/s:1818782 rem:387s step 6064 (36%) loss:3.7027 lr:0.91 dt:36ms tok/s:1822508 rem:387s step 6065 (36%) loss:3.6671 lr:0.91 dt:36ms tok/s:1815298 rem:387s step 6066 (36%) loss:3.6533 lr:0.91 dt:36ms tok/s:1809908 rem:387s step 6067 (36%) loss:3.6095 lr:0.91 dt:36ms tok/s:1805675 rem:386s step 6068 (36%) loss:3.6308 lr:0.91 dt:36ms tok/s:1812318 rem:386s step 6069 (36%) loss:3.6238 lr:0.91 dt:36ms tok/s:1819974 rem:386s step 6070 (36%) loss:3.6054 lr:0.91 dt:36ms tok/s:1810313 rem:386s step 6071 (36%) loss:3.5948 lr:0.91 dt:36ms tok/s:1806102 rem:386s step 6072 (36%) loss:3.5773 lr:0.91 dt:37ms tok/s:1793786 rem:386s step 6073 (36%) loss:3.5488 lr:0.91 dt:36ms tok/s:1818782 rem:386s step 6074 (36%) loss:3.5212 lr:0.91 dt:36ms tok/s:1819793 rem:386s step 6075 (36%) loss:3.5121 lr:0.91 dt:36ms tok/s:1812258 rem:386s step 6076 (36%) loss:3.5090 lr:0.91 dt:36ms tok/s:1812772 rem:386s step 6077 (36%) loss:3.5073 lr:0.91 dt:36ms tok/s:1818084 rem:386s step 6078 (36%) loss:3.5134 lr:0.91 dt:36ms tok/s:1818986 rem:386s step 6079 (36%) loss:3.5077 lr:0.91 dt:36ms tok/s:1813992 rem:386s step 6080 (36%) loss:3.6400 lr:0.91 dt:36ms tok/s:1809217 rem:386s step 6081 (36%) loss:3.6925 lr:0.91 dt:36ms tok/s:1808752 rem:386s step 6082 (36%) loss:3.6815 lr:0.91 dt:36ms tok/s:1808550 rem:386s step 6083 (36%) loss:3.6720 lr:0.91 dt:36ms tok/s:1813023 rem:386s step 6084 (36%) loss:3.6524 lr:0.91 dt:36ms tok/s:1822677 rem:386s step 6085 (36%) loss:3.6364 lr:0.91 dt:36ms tok/s:1811936 rem:386s step 6086 (36%) loss:3.5992 lr:0.91 dt:36ms tok/s:1813645 rem:386s step 6087 (36%) loss:3.5986 lr:0.91 dt:37ms tok/s:1783844 rem:386s step 6088 (36%) loss:3.6009 lr:0.91 dt:36ms tok/s:1820275 rem:386s step 6089 (36%) loss:3.5573 lr:0.91 dt:37ms tok/s:1793423 rem:386s step 6090 (36%) loss:3.5523 lr:0.91 dt:36ms tok/s:1819034 rem:386s step 6091 (36%) loss:3.5582 lr:0.91 dt:36ms tok/s:1808990 rem:386s step 6092 (36%) loss:3.5372 lr:0.91 dt:36ms tok/s:1823378 rem:386s step 6093 (36%) loss:3.5322 lr:0.91 dt:36ms tok/s:1821288 rem:386s step 6094 (36%) loss:3.5234 lr:0.91 dt:43ms tok/s:1515122 rem:385s step 6095 (36%) loss:3.4947 lr:0.91 dt:36ms tok/s:1813358 rem:385s step 6096 (36%) loss:3.4588 lr:0.91 dt:36ms tok/s:1815898 rem:385s step 6097 (36%) loss:3.4588 lr:0.91 dt:36ms tok/s:1831055 rem:385s step 6098 (36%) loss:3.4533 lr:0.91 dt:36ms tok/s:1838304 rem:385s step 6099 (36%) loss:3.4370 lr:0.91 dt:35ms tok/s:1865438 rem:385s step 6100 (36%) loss:3.4061 lr:0.91 dt:35ms tok/s:1877864 rem:385s + local: attn=[0.074, 0.715, 0.675] mlp=[0.321, 0.178, -0.167] + + transition: attn=[2.241, 0.796] mlp=[-0.109, 0.281] + + hierarchy: attn=[2.656, 5.939, 5.616] mlp=[1.072, -1.174, -4.264] + step 6101 (36%) loss:3.3768 lr:0.91 dt:36ms tok/s:1845537 rem:385s step 6102 (36%) loss:3.3093 lr:0.91 dt:35ms tok/s:1865261 rem:385s step 6103 (36%) loss:3.2610 lr:0.91 dt:35ms tok/s:1883383 rem:385s step 6104 (36%) loss:3.1966 lr:0.91 dt:35ms tok/s:1877569 rem:385s step 6105 (36%) loss:3.1849 lr:0.91 dt:35ms tok/s:1887392 rem:385s step 6106 (36%) loss:3.2238 lr:0.91 dt:35ms tok/s:1877299 rem:385s step 6107 (36%) loss:3.2471 lr:0.91 dt:35ms tok/s:1887146 rem:385s step 6108 (36%) loss:3.2732 lr:0.91 dt:35ms tok/s:1883874 rem:385s step 6109 (36%) loss:3.2999 lr:0.91 dt:35ms tok/s:1883048 rem:385s step 6110 (36%) loss:3.3212 lr:0.91 dt:35ms tok/s:1883577 rem:385s step 6111 (36%) loss:3.3394 lr:0.91 dt:35ms tok/s:1883048 rem:385s step 6112 (36%) loss:3.3652 lr:0.91 dt:35ms tok/s:1869485 rem:385s step 6113 (36%) loss:3.3872 lr:0.91 dt:35ms tok/s:1869727 rem:385s step 6114 (36%) loss:3.4015 lr:0.91 dt:35ms tok/s:1868164 rem:385s step 6115 (36%) loss:3.4122 lr:0.91 dt:36ms tok/s:1832031 rem:385s step 6116 (36%) loss:3.4303 lr:0.91 dt:35ms tok/s:1869371 rem:385s step 6117 (36%) loss:3.4424 lr:0.91 dt:35ms tok/s:1873576 rem:385s step 6118 (36%) loss:3.4391 lr:0.91 dt:35ms tok/s:1876069 rem:385s step 6119 (36%) loss:3.4138 lr:0.91 dt:35ms tok/s:1853564 rem:385s step 6120 (36%) loss:3.4030 lr:0.91 dt:35ms tok/s:1855190 rem:385s step 6121 (36%) loss:3.4185 lr:0.91 dt:35ms tok/s:1847732 rem:385s step 6122 (36%) loss:3.4126 lr:0.91 dt:35ms tok/s:1852851 rem:385s step 6123 (36%) loss:3.4044 lr:0.91 dt:35ms tok/s:1857735 rem:384s step 6124 (36%) loss:3.4209 lr:0.91 dt:35ms tok/s:1848863 rem:384s step 6125 (36%) loss:3.4060 lr:0.91 dt:35ms tok/s:1846466 rem:384s step 6126 (36%) loss:3.3710 lr:0.91 dt:35ms tok/s:1850693 rem:384s step 6127 (36%) loss:3.3214 lr:0.91 dt:35ms tok/s:1854977 rem:384s step 6128 (36%) loss:3.2779 lr:0.91 dt:36ms tok/s:1832043 rem:384s step 6129 (36%) loss:3.2186 lr:0.91 dt:35ms tok/s:1850008 rem:384s step 6130 (36%) loss:3.1556 lr:0.91 dt:36ms tok/s:1844744 rem:384s step 6131 (36%) loss:3.1680 lr:0.90 dt:35ms tok/s:1851391 rem:384s step 6132 (36%) loss:3.2423 lr:0.90 dt:35ms tok/s:1851404 rem:384s step 6133 (36%) loss:3.2718 lr:0.90 dt:35ms tok/s:1856255 rem:384s step 6134 (36%) loss:3.3036 lr:0.90 dt:35ms tok/s:1853214 rem:384s step 6135 (36%) loss:3.3355 lr:0.90 dt:36ms tok/s:1841518 rem:384s step 6136 (36%) loss:3.3580 lr:0.90 dt:35ms tok/s:1847794 rem:384s step 6137 (36%) loss:3.3691 lr:0.90 dt:36ms tok/s:1832605 rem:384s step 6138 (36%) loss:3.3672 lr:0.90 dt:36ms tok/s:1821928 rem:384s step 6139 (36%) loss:3.3805 lr:0.90 dt:36ms tok/s:1818012 rem:384s step 6140 (36%) loss:3.3825 lr:0.90 dt:36ms tok/s:1819805 rem:384s step 6141 (36%) loss:3.3929 lr:0.90 dt:36ms tok/s:1819793 rem:384s step 6142 (36%) loss:3.4249 lr:0.90 dt:36ms tok/s:1818072 rem:384s step 6143 (36%) loss:3.5160 lr:0.90 dt:36ms tok/s:1822822 rem:384s step 6144 (36%) loss:3.5646 lr:0.90 dt:36ms tok/s:1812354 rem:384s step 6145 (36%) loss:3.5585 lr:0.90 dt:36ms tok/s:1817339 rem:384s step 6146 (36%) loss:3.5536 lr:0.90 dt:36ms tok/s:1823947 rem:384s step 6147 (36%) loss:3.5142 lr:0.90 dt:36ms tok/s:1824722 rem:384s step 6148 (36%) loss:3.4778 lr:0.90 dt:36ms tok/s:1828521 rem:384s step 6149 (36%) loss:3.4813 lr:0.90 dt:36ms tok/s:1823898 rem:384s step 6150 (36%) loss:3.4693 lr:0.90 dt:36ms tok/s:1816246 rem:384s step 6151 (36%) loss:3.4239 lr:0.90 dt:36ms tok/s:1825012 rem:383s step 6152 (36%) loss:3.4352 lr:0.90 dt:36ms tok/s:1823366 rem:383s step 6153 (36%) loss:3.4398 lr:0.90 dt:37ms tok/s:1793856 rem:383s step 6154 (36%) loss:3.4534 lr:0.90 dt:36ms tok/s:1798904 rem:383s step 6155 (36%) loss:3.4582 lr:0.90 dt:36ms tok/s:1822302 rem:383s step 6156 (36%) loss:3.4562 lr:0.90 dt:36ms tok/s:1822713 rem:383s step 6157 (36%) loss:3.4334 lr:0.90 dt:36ms tok/s:1825376 rem:383s step 6158 (36%) loss:3.4370 lr:0.90 dt:36ms tok/s:1823124 rem:383s step 6159 (36%) loss:3.4357 lr:0.90 dt:36ms tok/s:1832654 rem:383s step 6160 (36%) loss:3.4436 lr:0.90 dt:36ms tok/s:1825994 rem:383s step 6161 (36%) loss:3.4436 lr:0.90 dt:36ms tok/s:1824928 rem:383s step 6162 (36%) loss:3.4323 lr:0.90 dt:36ms tok/s:1817819 rem:383s step 6163 (36%) loss:3.4239 lr:0.90 dt:36ms tok/s:1820323 rem:383s step 6164 (36%) loss:3.4268 lr:0.90 dt:36ms tok/s:1822665 rem:383s step 6165 (36%) loss:3.4263 lr:0.90 dt:36ms tok/s:1827476 rem:383s step 6166 (36%) loss:3.4402 lr:0.90 dt:36ms tok/s:1806921 rem:383s step 6167 (36%) loss:3.4326 lr:0.90 dt:36ms tok/s:1829665 rem:383s step 6168 (36%) loss:3.4157 lr:0.90 dt:36ms tok/s:1821192 rem:383s step 6169 (36%) loss:3.4205 lr:0.90 dt:36ms tok/s:1821868 rem:383s step 6170 (36%) loss:3.4463 lr:0.90 dt:36ms tok/s:1825703 rem:383s step 6171 (36%) loss:3.4631 lr:0.90 dt:36ms tok/s:1821759 rem:383s step 6172 (36%) loss:3.4738 lr:0.90 dt:36ms tok/s:1822025 rem:383s step 6173 (36%) loss:3.4734 lr:0.90 dt:36ms tok/s:1817735 rem:383s step 6174 (36%) loss:3.4717 lr:0.90 dt:36ms tok/s:1813442 rem:383s step 6175 (36%) loss:3.4732 lr:0.90 dt:36ms tok/s:1796130 rem:383s step 6176 (36%) loss:3.4622 lr:0.90 dt:36ms tok/s:1816702 rem:383s step 6177 (36%) loss:3.4543 lr:0.90 dt:36ms tok/s:1817447 rem:383s step 6178 (36%) loss:3.4466 lr:0.90 dt:36ms tok/s:1822629 rem:383s step 6179 (36%) loss:3.4317 lr:0.90 dt:36ms tok/s:1802620 rem:382s step 6180 (36%) loss:3.4475 lr:0.90 dt:36ms tok/s:1812533 rem:382s step 6181 (36%) loss:3.4506 lr:0.90 dt:36ms tok/s:1824334 rem:382s step 6182 (36%) loss:3.4266 lr:0.90 dt:36ms tok/s:1824831 rem:382s step 6183 (36%) loss:3.3839 lr:0.90 dt:36ms tok/s:1820408 rem:382s step 6184 (36%) loss:3.3410 lr:0.90 dt:36ms tok/s:1817531 rem:382s step 6185 (36%) loss:3.3438 lr:0.90 dt:36ms tok/s:1815586 rem:382s step 6186 (36%) loss:3.3609 lr:0.90 dt:36ms tok/s:1813681 rem:382s step 6187 (36%) loss:3.3710 lr:0.90 dt:36ms tok/s:1815490 rem:382s step 6188 (36%) loss:3.3819 lr:0.90 dt:36ms tok/s:1816582 rem:382s step 6189 (36%) loss:3.3927 lr:0.90 dt:36ms tok/s:1811888 rem:382s step 6190 (36%) loss:3.4063 lr:0.90 dt:36ms tok/s:1810349 rem:382s step 6191 (36%) loss:3.4050 lr:0.90 dt:36ms tok/s:1813167 rem:382s step 6192 (36%) loss:3.4043 lr:0.90 dt:36ms tok/s:1810480 rem:382s step 6193 (36%) loss:3.3940 lr:0.90 dt:36ms tok/s:1812354 rem:382s step 6194 (36%) loss:3.4086 lr:0.90 dt:36ms tok/s:1815166 rem:382s step 6195 (36%) loss:3.4209 lr:0.90 dt:36ms tok/s:1820335 rem:382s step 6196 (36%) loss:3.4109 lr:0.90 dt:36ms tok/s:1813693 rem:382s step 6197 (36%) loss:3.4177 lr:0.90 dt:37ms tok/s:1786337 rem:382s step 6198 (36%) loss:3.4101 lr:0.90 dt:37ms tok/s:1794863 rem:382s step 6199 (36%) loss:3.4166 lr:0.90 dt:36ms tok/s:1806161 rem:382s step 6200 (36%) loss:3.4201 lr:0.90 dt:36ms tok/s:1815586 rem:382s + local: attn=[0.058, 0.712, 0.719] mlp=[0.316, 0.170, -0.181] + + transition: attn=[2.316, 0.804] mlp=[-0.104, 0.269] + + hierarchy: attn=[2.636, 5.939, 5.616] mlp=[1.070, -1.155, -4.206] + step 6201 (36%) loss:3.4128 lr:0.90 dt:36ms tok/s:1817976 rem:382s step 6202 (36%) loss:3.4172 lr:0.90 dt:36ms tok/s:1812748 rem:382s step 6203 (36%) loss:3.4317 lr:0.90 dt:36ms tok/s:1814675 rem:382s step 6204 (36%) loss:3.4193 lr:0.90 dt:36ms tok/s:1816642 rem:382s step 6205 (36%) loss:3.4263 lr:0.90 dt:36ms tok/s:1803306 rem:382s step 6206 (36%) loss:3.4405 lr:0.90 dt:36ms tok/s:1803353 rem:381s step 6207 (36%) loss:3.4398 lr:0.90 dt:36ms tok/s:1804655 rem:381s step 6208 (36%) loss:3.4494 lr:0.90 dt:36ms tok/s:1816354 rem:381s step 6209 (36%) loss:3.4491 lr:0.90 dt:36ms tok/s:1810993 rem:381s step 6210 (36%) loss:3.4675 lr:0.90 dt:36ms tok/s:1815526 rem:381s step 6211 (36%) loss:3.4896 lr:0.90 dt:36ms tok/s:1813945 rem:381s step 6212 (36%) loss:3.4941 lr:0.90 dt:36ms tok/s:1812820 rem:381s step 6213 (36%) loss:3.4772 lr:0.90 dt:36ms tok/s:1807634 rem:381s step 6214 (36%) loss:3.4693 lr:0.90 dt:36ms tok/s:1809443 rem:381s step 6215 (36%) loss:3.4482 lr:0.90 dt:36ms tok/s:1804194 rem:381s step 6216 (36%) loss:3.4193 lr:0.90 dt:36ms tok/s:1814052 rem:381s step 6217 (36%) loss:3.4360 lr:0.90 dt:36ms tok/s:1811852 rem:381s step 6218 (36%) loss:3.4407 lr:0.90 dt:36ms tok/s:1812832 rem:381s step 6219 (36%) loss:3.4445 lr:0.90 dt:36ms tok/s:1816042 rem:381s step 6220 (36%) loss:3.4468 lr:0.90 dt:36ms tok/s:1802194 rem:381s step 6221 (37%) loss:3.4305 lr:0.90 dt:36ms tok/s:1817927 rem:381s step 6222 (37%) loss:3.4285 lr:0.90 dt:36ms tok/s:1812987 rem:381s step 6223 (37%) loss:3.4202 lr:0.90 dt:36ms tok/s:1813191 rem:381s step 6224 (37%) loss:3.4204 lr:0.90 dt:36ms tok/s:1810623 rem:381s step 6225 (37%) loss:3.4043 lr:0.90 dt:36ms tok/s:1804892 rem:381s step 6226 (37%) loss:3.4088 lr:0.90 dt:36ms tok/s:1817471 rem:381s step 6227 (37%) loss:3.4141 lr:0.90 dt:36ms tok/s:1812043 rem:381s step 6228 (37%) loss:3.4244 lr:0.90 dt:36ms tok/s:1813394 rem:381s step 6229 (37%) loss:3.4297 lr:0.90 dt:36ms tok/s:1811637 rem:381s step 6230 (37%) loss:3.4390 lr:0.90 dt:36ms tok/s:1808812 rem:381s step 6231 (37%) loss:3.4339 lr:0.90 dt:36ms tok/s:1811148 rem:381s step 6232 (37%) loss:3.4284 lr:0.90 dt:36ms tok/s:1807908 rem:381s step 6233 (37%) loss:3.4285 lr:0.90 dt:36ms tok/s:1796952 rem:381s step 6234 (37%) loss:3.4516 lr:0.90 dt:36ms tok/s:1820396 rem:380s step 6235 (37%) loss:3.4729 lr:0.90 dt:36ms tok/s:1796142 rem:380s step 6236 (37%) loss:3.4657 lr:0.90 dt:36ms tok/s:1812940 rem:380s step 6237 (37%) loss:3.4537 lr:0.90 dt:36ms tok/s:1814975 rem:380s step 6238 (37%) loss:3.4740 lr:0.90 dt:36ms tok/s:1810289 rem:380s step 6239 (37%) loss:3.4777 lr:0.90 dt:36ms tok/s:1817170 rem:380s step 6240 (37%) loss:3.4898 lr:0.90 dt:36ms tok/s:1813274 rem:380s step 6241 (37%) loss:3.4730 lr:0.90 dt:36ms tok/s:1820323 rem:380s step 6242 (37%) loss:3.4494 lr:0.90 dt:36ms tok/s:1815718 rem:380s step 6243 (37%) loss:3.4426 lr:0.90 dt:36ms tok/s:1809431 rem:380s step 6244 (37%) loss:3.4205 lr:0.90 dt:36ms tok/s:1814603 rem:380s step 6245 (37%) loss:3.4168 lr:0.90 dt:36ms tok/s:1806767 rem:380s step 6246 (37%) loss:3.4064 lr:0.90 dt:36ms tok/s:1810325 rem:380s step 6247 (37%) loss:3.4043 lr:0.90 dt:36ms tok/s:1804987 rem:380s step 6248 (37%) loss:3.3981 lr:0.90 dt:36ms tok/s:1808181 rem:380s step 6249 (37%) loss:3.3671 lr:0.90 dt:36ms tok/s:1806161 rem:380s step 6250 (37%) loss:3.3792 lr:0.90 dt:36ms tok/s:1809991 rem:380s step 6251 (37%) loss:3.3645 lr:0.90 dt:36ms tok/s:1817315 rem:380s step 6252 (37%) loss:3.3613 lr:0.90 dt:36ms tok/s:1811625 rem:380s step 6253 (37%) loss:3.3642 lr:0.90 dt:36ms tok/s:1816042 rem:380s step 6254 (37%) loss:3.3762 lr:0.90 dt:36ms tok/s:1816210 rem:380s step 6255 (37%) loss:3.3852 lr:0.90 dt:36ms tok/s:1812844 rem:380s step 6256 (37%) loss:3.3806 lr:0.90 dt:36ms tok/s:1804833 rem:380s step 6257 (37%) loss:3.3839 lr:0.90 dt:36ms tok/s:1805734 rem:380s step 6258 (37%) loss:3.3770 lr:0.90 dt:36ms tok/s:1805628 rem:380s step 6259 (37%) loss:3.3708 lr:0.90 dt:36ms tok/s:1810718 rem:380s step 6260 (37%) loss:3.3841 lr:0.90 dt:36ms tok/s:1814076 rem:380s step 6261 (37%) loss:3.3873 lr:0.90 dt:36ms tok/s:1811100 rem:380s step 6262 (37%) loss:3.3979 lr:0.90 dt:36ms tok/s:1811673 rem:379s step 6263 (37%) loss:3.4054 lr:0.90 dt:36ms tok/s:1814867 rem:379s step 6264 (37%) loss:3.4069 lr:0.90 dt:36ms tok/s:1798421 rem:379s step 6265 (37%) loss:3.4065 lr:0.90 dt:36ms tok/s:1808824 rem:379s step 6266 (37%) loss:3.4104 lr:0.90 dt:36ms tok/s:1812438 rem:379s step 6267 (37%) loss:3.3795 lr:0.90 dt:36ms tok/s:1815898 rem:379s step 6268 (37%) loss:3.3819 lr:0.90 dt:36ms tok/s:1818854 rem:379s step 6269 (37%) loss:3.3851 lr:0.90 dt:36ms tok/s:1817543 rem:379s step 6270 (37%) loss:3.3916 lr:0.90 dt:36ms tok/s:1820781 rem:379s step 6271 (37%) loss:3.3768 lr:0.90 dt:36ms tok/s:1814699 rem:379s step 6272 (37%) loss:3.3838 lr:0.89 dt:36ms tok/s:1813466 rem:379s step 6273 (37%) loss:3.3915 lr:0.89 dt:36ms tok/s:1813466 rem:379s step 6274 (37%) loss:3.3807 lr:0.89 dt:36ms tok/s:1808110 rem:379s step 6275 (37%) loss:3.3655 lr:0.89 dt:36ms tok/s:1818348 rem:379s step 6276 (37%) loss:3.3775 lr:0.89 dt:36ms tok/s:1814076 rem:379s step 6277 (37%) loss:3.3928 lr:0.89 dt:36ms tok/s:1805533 rem:379s step 6278 (37%) loss:3.3943 lr:0.89 dt:36ms tok/s:1804620 rem:379s step 6279 (37%) loss:3.4032 lr:0.89 dt:36ms tok/s:1809240 rem:379s step 6280 (37%) loss:3.4102 lr:0.89 dt:36ms tok/s:1798374 rem:379s step 6281 (37%) loss:3.4136 lr:0.89 dt:36ms tok/s:1806446 rem:379s step 6282 (37%) loss:3.4114 lr:0.89 dt:36ms tok/s:1815358 rem:379s step 6283 (37%) loss:3.3886 lr:0.89 dt:36ms tok/s:1809526 rem:379s step 6284 (37%) loss:3.4059 lr:0.89 dt:36ms tok/s:1817411 rem:379s step 6285 (37%) loss:3.4284 lr:0.89 dt:36ms tok/s:1815802 rem:379s step 6286 (37%) loss:3.4743 lr:0.89 dt:36ms tok/s:1815346 rem:379s step 6287 (37%) loss:3.4801 lr:0.89 dt:36ms tok/s:1815574 rem:379s step 6288 (37%) loss:3.4666 lr:0.89 dt:36ms tok/s:1812342 rem:379s step 6289 (37%) loss:3.4564 lr:0.89 dt:36ms tok/s:1806791 rem:378s step 6290 (37%) loss:3.4554 lr:0.89 dt:36ms tok/s:1817399 rem:378s step 6291 (37%) loss:3.4343 lr:0.89 dt:36ms tok/s:1822206 rem:378s step 6292 (37%) loss:3.3967 lr:0.89 dt:36ms tok/s:1810182 rem:378s step 6293 (37%) loss:3.4010 lr:0.89 dt:36ms tok/s:1812820 rem:378s step 6294 (37%) loss:3.4076 lr:0.89 dt:36ms tok/s:1824758 rem:378s step 6295 (37%) loss:3.4004 lr:0.89 dt:36ms tok/s:1817699 rem:378s step 6296 (37%) loss:3.3791 lr:0.89 dt:36ms tok/s:1811518 rem:378s step 6297 (37%) loss:3.3683 lr:0.89 dt:36ms tok/s:1803318 rem:378s step 6298 (37%) loss:3.3508 lr:0.89 dt:36ms tok/s:1802584 rem:378s step 6299 (37%) loss:3.3633 lr:0.89 dt:36ms tok/s:1814232 rem:378s step 6300 (37%) loss:3.3589 lr:0.89 dt:36ms tok/s:1813598 rem:378s + local: attn=[0.065, 0.735, 0.689] mlp=[0.328, 0.182, -0.172] + + transition: attn=[2.369, 0.773] mlp=[-0.103, 0.290] + + hierarchy: attn=[2.630, 5.939, 5.616] mlp=[1.094, -1.102, -4.206] + step 6301 (37%) loss:3.3769 lr:0.89 dt:37ms tok/s:1783543 rem:378s step 6302 (37%) loss:3.3857 lr:0.89 dt:36ms tok/s:1823015 rem:378s step 6303 (37%) loss:3.3606 lr:0.89 dt:36ms tok/s:1817411 rem:378s step 6304 (37%) loss:3.3742 lr:0.89 dt:36ms tok/s:1818854 rem:378s step 6305 (37%) loss:3.3771 lr:0.89 dt:36ms tok/s:1815634 rem:378s step 6306 (37%) loss:3.3745 lr:0.89 dt:36ms tok/s:1821783 rem:378s step 6307 (37%) loss:3.3690 lr:0.89 dt:36ms tok/s:1820034 rem:378s step 6308 (37%) loss:3.3669 lr:0.89 dt:37ms tok/s:1791004 rem:378s step 6309 (37%) loss:3.3652 lr:0.89 dt:37ms tok/s:1762015 rem:378s step 6310 (37%) loss:3.3714 lr:0.89 dt:36ms tok/s:1799528 rem:378s step 6311 (37%) loss:3.3869 lr:0.89 dt:37ms tok/s:1791821 rem:378s step 6312 (37%) loss:3.4022 lr:0.89 dt:37ms tok/s:1795191 rem:378s step 6313 (37%) loss:3.3870 lr:0.89 dt:37ms tok/s:1790234 rem:378s step 6314 (37%) loss:3.3971 lr:0.89 dt:36ms tok/s:1807646 rem:378s step 6315 (37%) loss:3.4031 lr:0.89 dt:37ms tok/s:1792791 rem:378s step 6316 (37%) loss:3.4176 lr:0.89 dt:37ms tok/s:1782005 rem:378s step 6317 (37%) loss:3.4371 lr:0.89 dt:37ms tok/s:1793832 rem:377s step 6318 (37%) loss:3.4273 lr:0.89 dt:37ms tok/s:1790689 rem:377s step 6319 (37%) loss:3.4062 lr:0.89 dt:36ms tok/s:1803448 rem:377s step 6320 (37%) loss:3.3880 lr:0.89 dt:36ms tok/s:1798315 rem:377s step 6321 (37%) loss:3.4105 lr:0.89 dt:37ms tok/s:1795156 rem:377s step 6322 (37%) loss:3.4169 lr:0.89 dt:37ms tok/s:1793856 rem:377s step 6323 (37%) loss:3.4124 lr:0.89 dt:36ms tok/s:1798245 rem:377s step 6324 (37%) loss:3.4094 lr:0.89 dt:37ms tok/s:1790502 rem:377s step 6325 (37%) loss:3.3946 lr:0.89 dt:37ms tok/s:1790117 rem:377s step 6326 (37%) loss:3.3726 lr:0.89 dt:36ms tok/s:1797763 rem:377s step 6327 (37%) loss:3.3924 lr:0.89 dt:37ms tok/s:1788720 rem:377s step 6328 (37%) loss:3.3881 lr:0.89 dt:37ms tok/s:1790666 rem:377s step 6329 (37%) loss:3.4025 lr:0.89 dt:37ms tok/s:1792511 rem:377s step 6330 (37%) loss:3.3943 lr:0.89 dt:37ms tok/s:1792253 rem:377s step 6331 (37%) loss:3.3632 lr:0.89 dt:37ms tok/s:1785455 rem:377s step 6332 (37%) loss:3.2952 lr:0.89 dt:37ms tok/s:1789698 rem:377s step 6333 (37%) loss:3.2855 lr:0.89 dt:37ms tok/s:1794922 rem:377s step 6334 (37%) loss:3.3276 lr:0.89 dt:37ms tok/s:1793481 rem:377s step 6335 (37%) loss:3.3666 lr:0.89 dt:37ms tok/s:1785722 rem:377s step 6336 (37%) loss:3.3826 lr:0.89 dt:36ms tok/s:1801250 rem:377s step 6337 (37%) loss:3.4034 lr:0.89 dt:37ms tok/s:1787243 rem:377s step 6338 (37%) loss:3.4148 lr:0.89 dt:37ms tok/s:1794219 rem:377s step 6339 (37%) loss:3.4017 lr:0.89 dt:37ms tok/s:1794184 rem:377s step 6340 (37%) loss:3.3733 lr:0.89 dt:37ms tok/s:1790561 rem:377s step 6341 (37%) loss:3.3908 lr:0.89 dt:36ms tok/s:1796130 rem:377s step 6342 (37%) loss:3.3998 lr:0.89 dt:37ms tok/s:1791716 rem:377s step 6343 (37%) loss:3.3645 lr:0.89 dt:37ms tok/s:1780770 rem:377s step 6344 (37%) loss:3.3242 lr:0.89 dt:37ms tok/s:1793247 rem:376s step 6345 (37%) loss:3.2650 lr:0.89 dt:37ms tok/s:1792195 rem:376s step 6346 (37%) loss:3.2108 lr:0.89 dt:37ms tok/s:1795379 rem:376s step 6347 (37%) loss:3.1862 lr:0.89 dt:37ms tok/s:1792885 rem:376s step 6348 (37%) loss:3.2242 lr:0.89 dt:37ms tok/s:1782630 rem:376s step 6349 (37%) loss:3.2359 lr:0.89 dt:37ms tok/s:1786429 rem:376s step 6350 (37%) loss:3.2669 lr:0.89 dt:37ms tok/s:1790712 rem:376s step 6351 (37%) loss:3.2695 lr:0.89 dt:37ms tok/s:1765093 rem:376s step 6352 (37%) loss:3.3046 lr:0.89 dt:37ms tok/s:1781774 rem:376s step 6353 (37%) loss:3.3213 lr:0.89 dt:37ms tok/s:1767692 rem:376s step 6354 (37%) loss:3.3301 lr:0.89 dt:37ms tok/s:1793364 rem:376s step 6355 (37%) loss:3.3414 lr:0.89 dt:36ms tok/s:1796447 rem:376s step 6356 (37%) loss:3.3537 lr:0.89 dt:37ms tok/s:1793083 rem:376s step 6357 (37%) loss:3.3767 lr:0.89 dt:36ms tok/s:1798351 rem:376s step 6358 (37%) loss:3.3801 lr:0.89 dt:36ms tok/s:1798280 rem:376s step 6359 (37%) loss:3.3775 lr:0.89 dt:36ms tok/s:1796388 rem:376s step 6360 (37%) loss:3.3859 lr:0.89 dt:37ms tok/s:1791237 rem:376s step 6361 (37%) loss:3.3942 lr:0.89 dt:37ms tok/s:1790001 rem:376s step 6362 (37%) loss:3.3905 lr:0.89 dt:37ms tok/s:1788080 rem:376s step 6363 (37%) loss:3.3964 lr:0.89 dt:39ms tok/s:1674706 rem:376s step 6364 (37%) loss:3.4243 lr:0.89 dt:37ms tok/s:1794805 rem:376s step 6365 (37%) loss:3.4356 lr:0.89 dt:36ms tok/s:1830189 rem:376s step 6366 (37%) loss:3.4422 lr:0.89 dt:36ms tok/s:1838501 rem:376s step 6367 (37%) loss:3.4373 lr:0.89 dt:36ms tok/s:1835530 rem:376s step 6368 (37%) loss:3.4327 lr:0.89 dt:36ms tok/s:1832837 rem:376s step 6369 (37%) loss:3.4377 lr:0.89 dt:36ms tok/s:1837468 rem:376s step 6370 (37%) loss:3.4307 lr:0.89 dt:36ms tok/s:1835150 rem:376s step 6371 (37%) loss:3.4303 lr:0.89 dt:36ms tok/s:1839399 rem:376s step 6372 (37%) loss:3.4182 lr:0.89 dt:36ms tok/s:1835468 rem:375s step 6373 (37%) loss:3.4196 lr:0.89 dt:36ms tok/s:1836965 rem:375s step 6374 (37%) loss:3.4192 lr:0.89 dt:36ms tok/s:1833265 rem:375s step 6375 (37%) loss:3.4075 lr:0.89 dt:36ms tok/s:1839103 rem:375s step 6376 (37%) loss:3.4067 lr:0.89 dt:37ms tok/s:1792335 rem:375s step 6377 (37%) loss:3.3769 lr:0.89 dt:36ms tok/s:1831884 rem:375s step 6378 (37%) loss:3.3695 lr:0.89 dt:36ms tok/s:1830652 rem:375s step 6379 (37%) loss:3.3568 lr:0.89 dt:36ms tok/s:1835321 rem:375s step 6380 (37%) loss:3.3786 lr:0.89 dt:36ms tok/s:1830908 rem:375s step 6381 (37%) loss:3.3851 lr:0.89 dt:36ms tok/s:1834991 rem:375s step 6382 (37%) loss:3.3617 lr:0.89 dt:36ms tok/s:1832422 rem:375s step 6383 (37%) loss:3.3480 lr:0.89 dt:36ms tok/s:1831958 rem:375s step 6384 (37%) loss:3.3698 lr:0.89 dt:36ms tok/s:1833289 rem:375s step 6385 (37%) loss:3.3756 lr:0.89 dt:36ms tok/s:1838820 rem:375s step 6386 (38%) loss:3.3721 lr:0.89 dt:36ms tok/s:1838304 rem:375s step 6387 (38%) loss:3.3692 lr:0.89 dt:36ms tok/s:1838230 rem:375s step 6388 (38%) loss:3.3730 lr:0.89 dt:36ms tok/s:1833118 rem:375s step 6389 (38%) loss:3.3777 lr:0.89 dt:36ms tok/s:1832605 rem:375s step 6390 (38%) loss:3.3768 lr:0.89 dt:36ms tok/s:1839510 rem:375s step 6391 (38%) loss:3.3865 lr:0.89 dt:36ms tok/s:1834158 rem:375s step 6392 (38%) loss:3.3852 lr:0.89 dt:36ms tok/s:1837923 rem:375s step 6393 (38%) loss:3.3966 lr:0.89 dt:36ms tok/s:1830591 rem:375s step 6394 (38%) loss:3.4086 lr:0.89 dt:36ms tok/s:1828862 rem:375s step 6395 (38%) loss:3.4157 lr:0.89 dt:36ms tok/s:1837517 rem:375s step 6396 (38%) loss:3.4132 lr:0.89 dt:36ms tok/s:1842111 rem:375s step 6397 (38%) loss:3.4050 lr:0.89 dt:36ms tok/s:1840877 rem:375s step 6398 (38%) loss:3.3889 lr:0.89 dt:36ms tok/s:1834880 rem:375s step 6399 (38%) loss:3.3829 lr:0.89 dt:36ms tok/s:1831274 rem:374s step 6400 (38%) loss:3.3723 lr:0.89 dt:36ms tok/s:1836793 rem:374s + local: attn=[0.072, 0.755, 0.720] mlp=[0.338, 0.174, -0.168] + + transition: attn=[2.420, 0.805] mlp=[-0.124, 0.295] + + hierarchy: attn=[2.714, 5.939, 5.616] mlp=[1.140, -1.094, -4.209] + step 6401 (38%) loss:3.3773 lr:0.89 dt:36ms tok/s:1834746 rem:374s step 6402 (38%) loss:3.3635 lr:0.89 dt:36ms tok/s:1843433 rem:374s step 6403 (38%) loss:3.3789 lr:0.89 dt:36ms tok/s:1817050 rem:374s step 6404 (38%) loss:3.3735 lr:0.89 dt:36ms tok/s:1827293 rem:374s step 6405 (38%) loss:3.3658 lr:0.89 dt:36ms tok/s:1823802 rem:374s step 6406 (38%) loss:3.3739 lr:0.89 dt:36ms tok/s:1839965 rem:374s step 6407 (38%) loss:3.3806 lr:0.88 dt:36ms tok/s:1812856 rem:374s step 6408 (38%) loss:3.3780 lr:0.88 dt:36ms tok/s:1804182 rem:374s step 6409 (38%) loss:3.3922 lr:0.88 dt:36ms tok/s:1811076 rem:374s step 6410 (38%) loss:3.4018 lr:0.88 dt:36ms tok/s:1810551 rem:374s step 6411 (38%) loss:3.4054 lr:0.88 dt:36ms tok/s:1813454 rem:374s step 6412 (38%) loss:3.4185 lr:0.88 dt:36ms tok/s:1808062 rem:374s step 6413 (38%) loss:3.4144 lr:0.88 dt:36ms tok/s:1807611 rem:374s step 6414 (38%) loss:3.4313 lr:0.88 dt:36ms tok/s:1806197 rem:374s step 6415 (38%) loss:3.4467 lr:0.88 dt:36ms tok/s:1813849 rem:374s step 6416 (38%) loss:3.4551 lr:0.88 dt:36ms tok/s:1815958 rem:374s step 6417 (38%) loss:3.4532 lr:0.88 dt:36ms tok/s:1814459 rem:374s step 6418 (38%) loss:3.4364 lr:0.88 dt:36ms tok/s:1812928 rem:374s step 6419 (38%) loss:3.4359 lr:0.88 dt:37ms tok/s:1780263 rem:374s step 6420 (38%) loss:3.4350 lr:0.88 dt:36ms tok/s:1812318 rem:374s step 6421 (38%) loss:3.4325 lr:0.88 dt:36ms tok/s:1809955 rem:374s step 6422 (38%) loss:3.4428 lr:0.88 dt:36ms tok/s:1812677 rem:374s step 6423 (38%) loss:3.4281 lr:0.88 dt:36ms tok/s:1814196 rem:374s step 6424 (38%) loss:3.4042 lr:0.88 dt:36ms tok/s:1810671 rem:374s step 6425 (38%) loss:3.3901 lr:0.88 dt:36ms tok/s:1810277 rem:374s step 6426 (38%) loss:3.3944 lr:0.88 dt:36ms tok/s:1812318 rem:374s step 6427 (38%) loss:3.3994 lr:0.88 dt:36ms tok/s:1802407 rem:373s step 6428 (38%) loss:3.3962 lr:0.88 dt:36ms tok/s:1808110 rem:373s step 6429 (38%) loss:3.4025 lr:0.88 dt:36ms tok/s:1821686 rem:373s step 6430 (38%) loss:3.3898 lr:0.88 dt:36ms tok/s:1820613 rem:373s step 6431 (38%) loss:3.3960 lr:0.88 dt:36ms tok/s:1820010 rem:373s step 6432 (38%) loss:3.3804 lr:0.88 dt:36ms tok/s:1816534 rem:373s step 6433 (38%) loss:3.3796 lr:0.88 dt:36ms tok/s:1811900 rem:373s step 6434 (38%) loss:3.3947 lr:0.88 dt:36ms tok/s:1812844 rem:373s step 6435 (38%) loss:3.4152 lr:0.88 dt:37ms tok/s:1764696 rem:373s step 6436 (38%) loss:3.4289 lr:0.88 dt:36ms tok/s:1819612 rem:373s step 6437 (38%) loss:3.4312 lr:0.88 dt:36ms tok/s:1804395 rem:373s step 6438 (38%) loss:3.4192 lr:0.88 dt:36ms tok/s:1814507 rem:373s step 6439 (38%) loss:3.4372 lr:0.88 dt:36ms tok/s:1816738 rem:373s step 6440 (38%) loss:3.4174 lr:0.88 dt:36ms tok/s:1817266 rem:373s step 6441 (38%) loss:3.3889 lr:0.88 dt:36ms tok/s:1813215 rem:373s step 6442 (38%) loss:3.3832 lr:0.88 dt:36ms tok/s:1818072 rem:373s step 6443 (38%) loss:3.3833 lr:0.88 dt:36ms tok/s:1813298 rem:373s step 6444 (38%) loss:3.3787 lr:0.88 dt:36ms tok/s:1817062 rem:373s step 6445 (38%) loss:3.3642 lr:0.88 dt:36ms tok/s:1821011 rem:373s step 6446 (38%) loss:3.3759 lr:0.88 dt:36ms tok/s:1811578 rem:373s step 6447 (38%) loss:3.3756 lr:0.88 dt:36ms tok/s:1814687 rem:373s step 6448 (38%) loss:3.3935 lr:0.88 dt:36ms tok/s:1814076 rem:373s step 6449 (38%) loss:3.4055 lr:0.88 dt:36ms tok/s:1817783 rem:373s step 6450 (38%) loss:3.4136 lr:0.88 dt:36ms tok/s:1817663 rem:373s step 6451 (38%) loss:3.4167 lr:0.88 dt:36ms tok/s:1805130 rem:373s step 6452 (38%) loss:3.4110 lr:0.88 dt:36ms tok/s:1816378 rem:373s step 6453 (38%) loss:3.4139 lr:0.88 dt:36ms tok/s:1822169 rem:373s step 6454 (38%) loss:3.4027 lr:0.88 dt:36ms tok/s:1805841 rem:373s step 6455 (38%) loss:3.3940 lr:0.88 dt:36ms tok/s:1817915 rem:372s step 6456 (38%) loss:3.4005 lr:0.88 dt:36ms tok/s:1814268 rem:372s step 6457 (38%) loss:3.3987 lr:0.88 dt:36ms tok/s:1810015 rem:372s step 6458 (38%) loss:3.3958 lr:0.88 dt:36ms tok/s:1812629 rem:372s step 6459 (38%) loss:3.3931 lr:0.88 dt:36ms tok/s:1813035 rem:372s step 6460 (38%) loss:3.4008 lr:0.88 dt:36ms tok/s:1809133 rem:372s step 6461 (38%) loss:3.4024 lr:0.88 dt:36ms tok/s:1808086 rem:372s step 6462 (38%) loss:3.4058 lr:0.88 dt:36ms tok/s:1815082 rem:372s step 6463 (38%) loss:3.4026 lr:0.88 dt:37ms tok/s:1791950 rem:372s step 6464 (38%) loss:3.4238 lr:0.88 dt:37ms tok/s:1792405 rem:372s step 6465 (38%) loss:3.4178 lr:0.88 dt:37ms tok/s:1787115 rem:372s step 6466 (38%) loss:3.3879 lr:0.88 dt:36ms tok/s:1799269 rem:372s step 6467 (38%) loss:3.3970 lr:0.88 dt:37ms tok/s:1792721 rem:372s step 6468 (38%) loss:3.3992 lr:0.88 dt:36ms tok/s:1797433 rem:372s step 6469 (38%) loss:3.4050 lr:0.88 dt:37ms tok/s:1784632 rem:372s step 6470 (38%) loss:3.4103 lr:0.88 dt:37ms tok/s:1790491 rem:372s step 6471 (38%) loss:3.4072 lr:0.88 dt:37ms tok/s:1790048 rem:372s step 6472 (38%) loss:3.4047 lr:0.88 dt:37ms tok/s:1791541 rem:372s step 6473 (38%) loss:3.4136 lr:0.88 dt:37ms tok/s:1793704 rem:372s step 6474 (38%) loss:3.4241 lr:0.88 dt:36ms tok/s:1796881 rem:372s step 6475 (38%) loss:3.4112 lr:0.88 dt:37ms tok/s:1789698 rem:372s step 6476 (38%) loss:3.3962 lr:0.88 dt:37ms tok/s:1788592 rem:372s step 6477 (38%) loss:3.3495 lr:0.88 dt:36ms tok/s:1798963 rem:372s step 6478 (38%) loss:3.3320 lr:0.88 dt:36ms tok/s:1801757 rem:372s step 6479 (38%) loss:3.3605 lr:0.88 dt:36ms tok/s:1796494 rem:372s step 6480 (38%) loss:3.3903 lr:0.88 dt:37ms tok/s:1794430 rem:372s step 6481 (38%) loss:3.4019 lr:0.88 dt:37ms tok/s:1780401 rem:372s step 6482 (38%) loss:3.4059 lr:0.88 dt:36ms tok/s:1809157 rem:371s step 6483 (38%) loss:3.4125 lr:0.88 dt:35ms tok/s:1863478 rem:371s step 6484 (38%) loss:3.4233 lr:0.88 dt:35ms tok/s:1851379 rem:371s step 6485 (38%) loss:3.4259 lr:0.88 dt:36ms tok/s:1835236 rem:371s step 6486 (38%) loss:3.4395 lr:0.88 dt:36ms tok/s:1839596 rem:371s step 6487 (38%) loss:3.4327 lr:0.88 dt:36ms tok/s:1841296 rem:371s step 6488 (38%) loss:3.4521 lr:0.88 dt:36ms tok/s:1844348 rem:371s step 6489 (38%) loss:3.4385 lr:0.88 dt:36ms tok/s:1841247 rem:371s step 6490 (38%) loss:3.4286 lr:0.88 dt:36ms tok/s:1834942 rem:371s step 6491 (38%) loss:3.4328 lr:0.88 dt:36ms tok/s:1835236 rem:371s step 6492 (38%) loss:3.4339 lr:0.88 dt:36ms tok/s:1838402 rem:371s step 6493 (38%) loss:3.4272 lr:0.88 dt:36ms tok/s:1836474 rem:371s step 6494 (38%) loss:3.4143 lr:0.88 dt:36ms tok/s:1837309 rem:371s step 6495 (38%) loss:3.3805 lr:0.88 dt:36ms tok/s:1844843 rem:371s step 6496 (38%) loss:3.3697 lr:0.88 dt:36ms tok/s:1839030 rem:371s step 6497 (38%) loss:3.3781 lr:0.88 dt:36ms tok/s:1833583 rem:371s step 6498 (38%) loss:3.3813 lr:0.88 dt:36ms tok/s:1841543 rem:371s step 6499 (38%) loss:3.3838 lr:0.88 dt:36ms tok/s:1834244 rem:371s step 6500 (38%) loss:3.3822 lr:0.88 dt:36ms tok/s:1838427 rem:371s + local: attn=[0.064, 0.744, 0.725] mlp=[0.355, 0.190, -0.182] + + transition: attn=[2.459, 0.837] mlp=[-0.126, 0.294] + + hierarchy: attn=[2.812, 5.939, 5.616] mlp=[1.129, -1.050, -4.034] + step 6501 (38%) loss:3.3860 lr:0.88 dt:36ms tok/s:1836069 rem:371s step 6502 (38%) loss:3.3938 lr:0.88 dt:36ms tok/s:1840310 rem:371s step 6503 (38%) loss:3.3965 lr:0.88 dt:36ms tok/s:1837726 rem:371s step 6504 (38%) loss:3.3988 lr:0.88 dt:36ms tok/s:1843222 rem:371s step 6505 (38%) loss:3.4205 lr:0.88 dt:36ms tok/s:1839842 rem:371s step 6506 (38%) loss:3.4103 lr:0.88 dt:36ms tok/s:1840298 rem:371s step 6507 (38%) loss:3.4182 lr:0.88 dt:36ms tok/s:1835738 rem:371s step 6508 (38%) loss:3.3963 lr:0.88 dt:36ms tok/s:1839731 rem:371s step 6509 (38%) loss:3.4148 lr:0.88 dt:36ms tok/s:1836008 rem:371s step 6510 (38%) loss:3.4013 lr:0.88 dt:36ms tok/s:1838538 rem:370s step 6511 (38%) loss:3.3956 lr:0.88 dt:36ms tok/s:1842370 rem:370s step 6512 (38%) loss:3.3827 lr:0.88 dt:36ms tok/s:1840298 rem:370s step 6513 (38%) loss:3.3885 lr:0.88 dt:36ms tok/s:1841395 rem:370s step 6514 (38%) loss:3.3534 lr:0.88 dt:36ms tok/s:1843865 rem:370s step 6515 (38%) loss:3.3550 lr:0.88 dt:36ms tok/s:1836425 rem:370s step 6516 (38%) loss:3.3634 lr:0.88 dt:36ms tok/s:1837677 rem:370s step 6517 (38%) loss:3.3566 lr:0.88 dt:36ms tok/s:1839682 rem:370s step 6518 (38%) loss:3.3573 lr:0.88 dt:36ms tok/s:1836953 rem:370s step 6519 (38%) loss:3.3849 lr:0.88 dt:36ms tok/s:1832202 rem:370s step 6520 (38%) loss:3.3909 lr:0.88 dt:36ms tok/s:1844979 rem:370s step 6521 (38%) loss:3.3551 lr:0.88 dt:36ms tok/s:1834929 rem:370s step 6522 (38%) loss:3.3332 lr:0.88 dt:36ms tok/s:1836891 rem:370s step 6523 (38%) loss:3.3467 lr:0.88 dt:36ms tok/s:1828180 rem:370s step 6524 (38%) loss:3.3654 lr:0.88 dt:36ms tok/s:1837112 rem:370s step 6525 (38%) loss:3.3678 lr:0.88 dt:36ms tok/s:1840852 rem:370s step 6526 (38%) loss:3.3692 lr:0.88 dt:35ms tok/s:1854677 rem:370s step 6527 (38%) loss:3.3702 lr:0.88 dt:35ms tok/s:1849274 rem:370s step 6528 (38%) loss:3.3614 lr:0.88 dt:36ms tok/s:1842679 rem:370s step 6529 (38%) loss:3.3708 lr:0.88 dt:36ms tok/s:1834782 rem:370s step 6530 (38%) loss:3.3682 lr:0.88 dt:36ms tok/s:1842790 rem:370s step 6531 (38%) loss:3.3772 lr:0.88 dt:35ms tok/s:1848727 rem:370s step 6532 (38%) loss:3.3623 lr:0.88 dt:36ms tok/s:1842580 rem:370s step 6533 (38%) loss:3.3600 lr:0.88 dt:36ms tok/s:1834415 rem:370s step 6534 (38%) loss:3.3648 lr:0.88 dt:35ms tok/s:1846305 rem:370s step 6535 (38%) loss:3.3786 lr:0.88 dt:36ms tok/s:1834195 rem:370s step 6536 (38%) loss:3.3811 lr:0.88 dt:36ms tok/s:1842345 rem:370s step 6537 (38%) loss:3.3876 lr:0.87 dt:36ms tok/s:1834072 rem:370s step 6538 (38%) loss:3.3960 lr:0.87 dt:36ms tok/s:1841777 rem:369s step 6539 (38%) loss:3.3949 lr:0.87 dt:36ms tok/s:1839423 rem:369s step 6540 (38%) loss:3.4035 lr:0.87 dt:36ms tok/s:1843074 rem:369s step 6541 (38%) loss:3.4099 lr:0.87 dt:36ms tok/s:1839904 rem:369s step 6542 (38%) loss:3.4065 lr:0.87 dt:36ms tok/s:1806660 rem:369s step 6543 (38%) loss:3.4062 lr:0.87 dt:36ms tok/s:1826541 rem:369s step 6544 (38%) loss:3.3881 lr:0.87 dt:40ms tok/s:1655552 rem:369s step 6545 (38%) loss:3.3894 lr:0.87 dt:39ms tok/s:1683795 rem:369s step 6546 (38%) loss:3.3919 lr:0.87 dt:36ms tok/s:1827391 rem:369s step 6547 (38%) loss:3.3702 lr:0.87 dt:36ms tok/s:1829641 rem:369s step 6548 (38%) loss:3.3575 lr:0.87 dt:35ms tok/s:1858225 rem:369s step 6549 (38%) loss:3.3661 lr:0.87 dt:35ms tok/s:1856067 rem:369s step 6550 (38%) loss:3.3637 lr:0.87 dt:36ms tok/s:1823378 rem:369s step 6551 (38%) loss:3.3663 lr:0.87 dt:36ms tok/s:1815886 rem:369s step 6552 (38%) loss:3.3567 lr:0.87 dt:35ms tok/s:1860539 rem:369s step 6553 (39%) loss:3.3331 lr:0.87 dt:36ms tok/s:1820926 rem:369s step 6554 (39%) loss:3.3252 lr:0.87 dt:36ms tok/s:1828156 rem:369s step 6555 (39%) loss:3.3367 lr:0.87 dt:35ms tok/s:1860287 rem:369s step 6556 (39%) loss:3.3346 lr:0.87 dt:35ms tok/s:1854326 rem:369s step 6557 (39%) loss:3.3273 lr:0.87 dt:35ms tok/s:1848963 rem:369s step 6558 (39%) loss:3.3227 lr:0.87 dt:36ms tok/s:1845648 rem:369s step 6559 (39%) loss:3.3341 lr:0.87 dt:35ms tok/s:1866553 rem:369s step 6560 (39%) loss:3.3384 lr:0.87 dt:36ms tok/s:1843408 rem:369s step 6561 (39%) loss:3.3323 lr:0.87 dt:35ms tok/s:1866008 rem:369s step 6562 (39%) loss:3.3283 lr:0.87 dt:36ms tok/s:1835187 rem:369s step 6563 (39%) loss:3.3608 lr:0.87 dt:35ms tok/s:1852040 rem:369s step 6564 (39%) loss:3.3592 lr:0.87 dt:36ms tok/s:1818902 rem:369s step 6565 (39%) loss:3.3999 lr:0.87 dt:35ms tok/s:1848441 rem:369s step 6566 (39%) loss:3.4105 lr:0.87 dt:35ms tok/s:1859671 rem:368s step 6567 (39%) loss:3.4347 lr:0.87 dt:35ms tok/s:1863263 rem:368s step 6568 (39%) loss:3.4395 lr:0.87 dt:35ms tok/s:1852227 rem:368s step 6569 (39%) loss:3.4422 lr:0.87 dt:35ms tok/s:1860879 rem:368s step 6570 (39%) loss:3.4385 lr:0.87 dt:36ms tok/s:1826856 rem:368s step 6571 (39%) loss:3.4262 lr:0.87 dt:35ms tok/s:1846714 rem:368s step 6572 (39%) loss:3.4300 lr:0.87 dt:36ms tok/s:1830408 rem:368s step 6573 (39%) loss:3.4091 lr:0.87 dt:35ms tok/s:1847596 rem:368s step 6574 (39%) loss:3.4704 lr:0.87 dt:36ms tok/s:1832495 rem:368s step 6575 (39%) loss:3.4569 lr:0.87 dt:36ms tok/s:1841457 rem:368s step 6576 (39%) loss:3.4379 lr:0.87 dt:35ms tok/s:1859847 rem:368s step 6577 (39%) loss:3.4406 lr:0.87 dt:36ms tok/s:1845400 rem:368s step 6578 (39%) loss:3.4365 lr:0.87 dt:35ms tok/s:1855540 rem:368s step 6579 (39%) loss:3.4228 lr:0.87 dt:36ms tok/s:1820432 rem:368s step 6580 (39%) loss:3.4644 lr:0.87 dt:37ms tok/s:1784331 rem:368s step 6581 (39%) loss:3.4778 lr:0.87 dt:36ms tok/s:1809657 rem:368s step 6582 (39%) loss:3.4658 lr:0.87 dt:35ms tok/s:1877184 rem:368s step 6583 (39%) loss:3.4492 lr:0.87 dt:35ms tok/s:1862632 rem:368s step 6584 (39%) loss:3.4401 lr:0.87 dt:35ms tok/s:1857798 rem:368s step 6585 (39%) loss:3.4273 lr:0.87 dt:35ms tok/s:1866109 rem:368s step 6586 (39%) loss:3.4326 lr:0.87 dt:35ms tok/s:1868697 rem:368s step 6587 (39%) loss:3.4262 lr:0.87 dt:35ms tok/s:1878313 rem:368s step 6588 (39%) loss:3.4212 lr:0.87 dt:35ms tok/s:1885994 rem:368s step 6589 (39%) loss:3.4289 lr:0.87 dt:35ms tok/s:1880922 rem:368s step 6590 (39%) loss:3.4316 lr:0.87 dt:35ms tok/s:1873895 rem:368s step 6591 (39%) loss:3.4193 lr:0.87 dt:35ms tok/s:1874879 rem:368s step 6592 (39%) loss:3.4013 lr:0.87 dt:35ms tok/s:1878390 rem:368s step 6593 (39%) loss:3.3748 lr:0.87 dt:35ms tok/s:1871331 rem:368s step 6594 (39%) loss:3.3630 lr:0.87 dt:35ms tok/s:1873537 rem:368s step 6595 (39%) loss:3.3664 lr:0.87 dt:35ms tok/s:1871624 rem:367s step 6596 (39%) loss:3.3820 lr:0.87 dt:35ms tok/s:1870503 rem:367s step 6597 (39%) loss:3.3876 lr:0.87 dt:35ms tok/s:1870121 rem:367s step 6598 (39%) loss:3.3805 lr:0.87 dt:36ms tok/s:1823027 rem:367s step 6599 (39%) loss:3.4158 lr:0.87 dt:35ms tok/s:1869282 rem:367s step 6600 (39%) loss:3.4502 lr:0.87 dt:35ms tok/s:1869765 rem:367s + local: attn=[0.073, 0.771, 0.768] mlp=[0.359, 0.191, -0.167] + + transition: attn=[2.528, 0.822] mlp=[-0.111, 0.310] + + hierarchy: attn=[2.818, 5.939, 5.616] mlp=[1.105, -1.015, -4.030] + step 6601 (39%) loss:3.4866 lr:0.87 dt:35ms tok/s:1874879 rem:367s step 6602 (39%) loss:3.4822 lr:0.87 dt:35ms tok/s:1894364 rem:367s step 6603 (39%) loss:3.4849 lr:0.87 dt:35ms tok/s:1879276 rem:367s step 6604 (39%) loss:3.4840 lr:0.87 dt:35ms tok/s:1872555 rem:367s step 6605 (39%) loss:3.4614 lr:0.87 dt:35ms tok/s:1873576 rem:367s step 6606 (39%) loss:3.4564 lr:0.87 dt:35ms tok/s:1872695 rem:367s step 6607 (39%) loss:3.4461 lr:0.87 dt:35ms tok/s:1872274 rem:367s step 6608 (39%) loss:3.4600 lr:0.87 dt:35ms tok/s:1869345 rem:367s step 6609 (39%) loss:3.4471 lr:0.87 dt:35ms tok/s:1879019 rem:367s step 6610 (39%) loss:3.4441 lr:0.87 dt:35ms tok/s:1871101 rem:367s step 6611 (39%) loss:3.4255 lr:0.87 dt:35ms tok/s:1864869 rem:367s step 6612 (39%) loss:3.4329 lr:0.87 dt:35ms tok/s:1870541 rem:367s step 6613 (39%) loss:3.4121 lr:0.87 dt:36ms tok/s:1845499 rem:367s step 6614 (39%) loss:3.4272 lr:0.87 dt:36ms tok/s:1842901 rem:367s step 6615 (39%) loss:3.4196 lr:0.87 dt:35ms tok/s:1851803 rem:367s step 6616 (39%) loss:3.4287 lr:0.87 dt:35ms tok/s:1848814 rem:367s step 6617 (39%) loss:3.4339 lr:0.87 dt:35ms tok/s:1848888 rem:367s step 6618 (39%) loss:3.4094 lr:0.87 dt:35ms tok/s:1850108 rem:367s step 6619 (39%) loss:3.3611 lr:0.87 dt:35ms tok/s:1852102 rem:367s step 6620 (39%) loss:3.3206 lr:0.87 dt:36ms tok/s:1844892 rem:367s step 6621 (39%) loss:3.2758 lr:0.87 dt:36ms tok/s:1844224 rem:367s step 6622 (39%) loss:3.2323 lr:0.87 dt:36ms tok/s:1826528 rem:367s step 6623 (39%) loss:3.1855 lr:0.87 dt:36ms tok/s:1826735 rem:366s step 6624 (39%) loss:3.1342 lr:0.87 dt:36ms tok/s:1824044 rem:366s step 6625 (39%) loss:3.0875 lr:0.87 dt:36ms tok/s:1823027 rem:366s step 6626 (39%) loss:3.0461 lr:0.87 dt:36ms tok/s:1825691 rem:366s step 6627 (39%) loss:2.9902 lr:0.87 dt:36ms tok/s:1820516 rem:366s step 6628 (39%) loss:2.9645 lr:0.87 dt:36ms tok/s:1818661 rem:366s step 6629 (39%) loss:2.9312 lr:0.87 dt:36ms tok/s:1820444 rem:366s step 6630 (39%) loss:2.9044 lr:0.87 dt:39ms tok/s:1664928 rem:366s step 6631 (39%) loss:2.8623 lr:0.87 dt:39ms tok/s:1675523 rem:366s step 6632 (39%) loss:2.8305 lr:0.87 dt:35ms tok/s:1881231 rem:366s step 6633 (39%) loss:2.8230 lr:0.87 dt:35ms tok/s:1868951 rem:366s step 6634 (39%) loss:2.8275 lr:0.87 dt:35ms tok/s:1865337 rem:366s step 6635 (39%) loss:2.8632 lr:0.87 dt:35ms tok/s:1862783 rem:366s step 6636 (39%) loss:2.8793 lr:0.87 dt:35ms tok/s:1865932 rem:366s step 6637 (39%) loss:2.8660 lr:0.87 dt:35ms tok/s:1868075 rem:366s step 6638 (39%) loss:2.9420 lr:0.87 dt:35ms tok/s:1846367 rem:366s step 6639 (39%) loss:3.0241 lr:0.87 dt:36ms tok/s:1830104 rem:366s step 6640 (39%) loss:3.0940 lr:0.87 dt:36ms tok/s:1814987 rem:366s step 6641 (39%) loss:3.1360 lr:0.87 dt:36ms tok/s:1820914 rem:366s step 6642 (39%) loss:3.1641 lr:0.87 dt:36ms tok/s:1828290 rem:366s step 6643 (39%) loss:3.1945 lr:0.87 dt:36ms tok/s:1816174 rem:366s step 6644 (39%) loss:3.2289 lr:0.87 dt:36ms tok/s:1821180 rem:366s step 6645 (39%) loss:3.2565 lr:0.87 dt:36ms tok/s:1818012 rem:366s step 6646 (39%) loss:3.2858 lr:0.87 dt:36ms tok/s:1821952 rem:366s step 6647 (39%) loss:3.3271 lr:0.87 dt:36ms tok/s:1820275 rem:366s step 6648 (39%) loss:3.3451 lr:0.87 dt:36ms tok/s:1825546 rem:366s step 6649 (39%) loss:3.3551 lr:0.87 dt:36ms tok/s:1820709 rem:366s step 6650 (39%) loss:3.3469 lr:0.87 dt:36ms tok/s:1820890 rem:366s step 6651 (39%) loss:3.3483 lr:0.87 dt:36ms tok/s:1823511 rem:365s step 6652 (39%) loss:3.3595 lr:0.87 dt:36ms tok/s:1823802 rem:365s step 6653 (39%) loss:3.3843 lr:0.87 dt:36ms tok/s:1825073 rem:365s step 6654 (39%) loss:3.4021 lr:0.87 dt:36ms tok/s:1820468 rem:365s step 6655 (39%) loss:3.4118 lr:0.87 dt:36ms tok/s:1826431 rem:365s step 6656 (39%) loss:3.4165 lr:0.87 dt:36ms tok/s:1818144 rem:365s step 6657 (39%) loss:3.4171 lr:0.87 dt:36ms tok/s:1817459 rem:365s step 6658 (39%) loss:3.3904 lr:0.87 dt:36ms tok/s:1821035 rem:365s step 6659 (39%) loss:3.4022 lr:0.87 dt:36ms tok/s:1820757 rem:365s step 6660 (39%) loss:3.4040 lr:0.87 dt:36ms tok/s:1821735 rem:365s step 6661 (39%) loss:3.4211 lr:0.87 dt:36ms tok/s:1818360 rem:365s step 6662 (39%) loss:3.4151 lr:0.87 dt:36ms tok/s:1824322 rem:365s step 6663 (39%) loss:3.4183 lr:0.87 dt:36ms tok/s:1821324 rem:365s step 6664 (39%) loss:3.4060 lr:0.87 dt:36ms tok/s:1822073 rem:365s step 6665 (39%) loss:3.4135 lr:0.86 dt:36ms tok/s:1815382 rem:365s step 6666 (39%) loss:3.4232 lr:0.86 dt:36ms tok/s:1814795 rem:365s step 6667 (39%) loss:3.4308 lr:0.86 dt:36ms tok/s:1797716 rem:365s step 6668 (39%) loss:3.4229 lr:0.86 dt:36ms tok/s:1823124 rem:365s step 6669 (39%) loss:3.4086 lr:0.86 dt:37ms tok/s:1779134 rem:365s step 6670 (39%) loss:3.4308 lr:0.86 dt:36ms tok/s:1826358 rem:365s step 6671 (39%) loss:3.4436 lr:0.86 dt:36ms tok/s:1820528 rem:365s step 6672 (39%) loss:3.4404 lr:0.86 dt:36ms tok/s:1814436 rem:365s step 6673 (39%) loss:3.4339 lr:0.86 dt:36ms tok/s:1820215 rem:365s step 6674 (39%) loss:3.4362 lr:0.86 dt:36ms tok/s:1825946 rem:365s step 6675 (39%) loss:3.4424 lr:0.86 dt:36ms tok/s:1824649 rem:365s step 6676 (39%) loss:3.4417 lr:0.86 dt:36ms tok/s:1822278 rem:365s step 6677 (39%) loss:3.4295 lr:0.86 dt:36ms tok/s:1827986 rem:365s step 6678 (39%) loss:3.4274 lr:0.86 dt:36ms tok/s:1824516 rem:365s step 6679 (39%) loss:3.4158 lr:0.86 dt:36ms tok/s:1815214 rem:364s step 6680 (39%) loss:3.4221 lr:0.86 dt:36ms tok/s:1815274 rem:364s step 6681 (39%) loss:3.4210 lr:0.86 dt:36ms tok/s:1808253 rem:364s step 6682 (39%) loss:3.4198 lr:0.86 dt:36ms tok/s:1824249 rem:364s step 6683 (39%) loss:3.4364 lr:0.86 dt:36ms tok/s:1809753 rem:364s step 6684 (39%) loss:3.4385 lr:0.86 dt:36ms tok/s:1824286 rem:364s step 6685 (39%) loss:3.4495 lr:0.86 dt:36ms tok/s:1819745 rem:364s step 6686 (39%) loss:3.4421 lr:0.86 dt:36ms tok/s:1817507 rem:364s step 6687 (39%) loss:3.4467 lr:0.86 dt:36ms tok/s:1822290 rem:364s step 6688 (39%) loss:3.4472 lr:0.86 dt:36ms tok/s:1820359 rem:364s step 6689 (39%) loss:3.4621 lr:0.86 dt:36ms tok/s:1816618 rem:364s step 6690 (39%) loss:3.4549 lr:0.86 dt:36ms tok/s:1822604 rem:364s step 6691 (39%) loss:3.4667 lr:0.86 dt:36ms tok/s:1820685 rem:364s step 6692 (39%) loss:3.4437 lr:0.86 dt:36ms tok/s:1813406 rem:364s step 6693 (39%) loss:3.4369 lr:0.86 dt:36ms tok/s:1822653 rem:364s step 6694 (39%) loss:3.4232 lr:0.86 dt:36ms tok/s:1820130 rem:364s step 6695 (39%) loss:3.4089 lr:0.86 dt:36ms tok/s:1820299 rem:364s step 6696 (39%) loss:3.4115 lr:0.86 dt:36ms tok/s:1822073 rem:364s step 6697 (39%) loss:3.4126 lr:0.86 dt:36ms tok/s:1818457 rem:364s step 6698 (39%) loss:3.4089 lr:0.86 dt:36ms tok/s:1811673 rem:364s step 6699 (39%) loss:3.4235 lr:0.86 dt:36ms tok/s:1806150 rem:364s step 6700 (39%) loss:3.4200 lr:0.86 dt:36ms tok/s:1815826 rem:364s + local: attn=[0.068, 0.799, 0.738] mlp=[0.374, 0.193, -0.162] + + transition: attn=[2.566, 0.854] mlp=[-0.112, 0.294] + + hierarchy: attn=[2.838, 5.939, 5.616] mlp=[1.110, -1.008, -4.103] + step 6701 (39%) loss:3.4029 lr:0.86 dt:36ms tok/s:1815298 rem:364s step 6702 (39%) loss:3.4128 lr:0.86 dt:36ms tok/s:1818541 rem:364s step 6703 (39%) loss:3.4035 lr:0.86 dt:36ms tok/s:1818721 rem:364s step 6704 (39%) loss:3.4093 lr:0.86 dt:36ms tok/s:1816630 rem:364s step 6705 (39%) loss:3.4048 lr:0.86 dt:36ms tok/s:1814292 rem:364s step 6706 (39%) loss:3.4152 lr:0.86 dt:36ms tok/s:1816918 rem:363s step 6707 (39%) loss:3.4135 lr:0.86 dt:36ms tok/s:1819131 rem:363s step 6708 (39%) loss:3.4354 lr:0.86 dt:36ms tok/s:1810086 rem:363s step 6709 (39%) loss:3.4453 lr:0.86 dt:36ms tok/s:1810504 rem:363s step 6710 (39%) loss:3.4338 lr:0.86 dt:42ms tok/s:1575864 rem:363s step 6711 (39%) loss:3.4116 lr:0.86 dt:36ms tok/s:1841888 rem:363s step 6712 (39%) loss:3.3953 lr:0.86 dt:36ms tok/s:1843717 rem:363s step 6713 (39%) loss:3.3924 lr:0.86 dt:35ms tok/s:1864299 rem:363s step 6714 (39%) loss:3.3828 lr:0.86 dt:35ms tok/s:1870745 rem:363s step 6715 (39%) loss:3.3824 lr:0.86 dt:35ms tok/s:1865084 rem:363s step 6716 (39%) loss:3.4007 lr:0.86 dt:35ms tok/s:1874662 rem:363s step 6717 (39%) loss:3.3932 lr:0.86 dt:35ms tok/s:1873576 rem:363s step 6718 (39%) loss:3.3917 lr:0.86 dt:35ms tok/s:1874419 rem:363s step 6719 (39%) loss:3.3835 lr:0.86 dt:36ms tok/s:1844038 rem:363s step 6720 (39%) loss:3.3742 lr:0.86 dt:35ms tok/s:1858413 rem:363s step 6721 (40%) loss:3.3907 lr:0.86 dt:35ms tok/s:1852489 rem:363s step 6722 (40%) loss:3.3870 lr:0.86 dt:35ms tok/s:1858350 rem:363s step 6723 (40%) loss:3.3929 lr:0.86 dt:35ms tok/s:1854789 rem:363s step 6724 (40%) loss:3.4044 lr:0.86 dt:35ms tok/s:1851803 rem:363s step 6725 (40%) loss:3.4022 lr:0.86 dt:35ms tok/s:1855591 rem:363s step 6726 (40%) loss:3.3828 lr:0.86 dt:35ms tok/s:1853751 rem:363s step 6727 (40%) loss:3.3651 lr:0.86 dt:35ms tok/s:1852876 rem:363s step 6728 (40%) loss:3.3472 lr:0.86 dt:35ms tok/s:1852339 rem:363s step 6729 (40%) loss:3.3532 lr:0.86 dt:36ms tok/s:1831347 rem:363s step 6730 (40%) loss:3.3612 lr:0.86 dt:36ms tok/s:1839842 rem:363s step 6731 (40%) loss:3.3606 lr:0.86 dt:35ms tok/s:1849212 rem:363s step 6732 (40%) loss:3.3845 lr:0.86 dt:35ms tok/s:1857133 rem:363s step 6733 (40%) loss:3.4447 lr:0.86 dt:35ms tok/s:1850158 rem:363s step 6734 (40%) loss:3.4490 lr:0.86 dt:35ms tok/s:1858388 rem:362s step 6735 (40%) loss:3.4407 lr:0.86 dt:36ms tok/s:1828728 rem:362s step 6736 (40%) loss:3.4334 lr:0.86 dt:36ms tok/s:1819371 rem:362s step 6737 (40%) loss:3.4313 lr:0.86 dt:36ms tok/s:1826625 rem:362s step 6738 (40%) loss:3.4264 lr:0.86 dt:36ms tok/s:1817843 rem:362s step 6739 (40%) loss:3.4170 lr:0.86 dt:36ms tok/s:1817771 rem:362s step 6740 (40%) loss:3.4110 lr:0.86 dt:36ms tok/s:1814855 rem:362s step 6741 (40%) loss:3.3979 lr:0.86 dt:36ms tok/s:1819239 rem:362s step 6742 (40%) loss:3.3917 lr:0.86 dt:36ms tok/s:1817807 rem:362s step 6743 (40%) loss:3.3641 lr:0.86 dt:36ms tok/s:1821035 rem:362s step 6744 (40%) loss:3.3387 lr:0.86 dt:42ms tok/s:1553834 rem:362s step 6745 (40%) loss:3.3168 lr:0.86 dt:41ms tok/s:1581349 rem:362s step 6746 (40%) loss:3.3265 lr:0.86 dt:33ms tok/s:1962516 rem:362s step 6747 (40%) loss:3.3334 lr:0.86 dt:33ms tok/s:1987318 rem:362s step 6748 (40%) loss:3.3482 lr:0.86 dt:33ms tok/s:1972204 rem:362s step 6749 (40%) loss:3.3515 lr:0.86 dt:33ms tok/s:1979804 rem:362s step 6750 (40%) loss:3.3091 lr:0.86 dt:34ms tok/s:1956259 rem:362s step 6751 (40%) loss:3.3100 lr:0.86 dt:34ms tok/s:1947527 rem:362s step 6752 (40%) loss:3.3185 lr:0.86 dt:34ms tok/s:1956176 rem:362s step 6753 (40%) loss:3.3246 lr:0.86 dt:34ms tok/s:1951593 rem:362s step 6754 (40%) loss:3.3269 lr:0.86 dt:34ms tok/s:1944001 rem:362s step 6755 (40%) loss:3.3312 lr:0.86 dt:34ms tok/s:1943850 rem:362s step 6756 (40%) loss:3.3342 lr:0.86 dt:34ms tok/s:1946245 rem:362s step 6757 (40%) loss:3.3354 lr:0.86 dt:34ms tok/s:1947762 rem:362s step 6758 (40%) loss:3.3349 lr:0.86 dt:34ms tok/s:1946562 rem:362s step 6759 (40%) loss:3.3386 lr:0.86 dt:34ms tok/s:1941461 rem:362s step 6760 (40%) loss:3.3253 lr:0.86 dt:34ms tok/s:1946727 rem:362s step 6761 (40%) loss:3.3289 lr:0.86 dt:34ms tok/s:1938435 rem:362s step 6762 (40%) loss:3.3271 lr:0.86 dt:34ms tok/s:1941749 rem:362s step 6763 (40%) loss:3.3108 lr:0.86 dt:34ms tok/s:1936756 rem:361s step 6764 (40%) loss:3.3578 lr:0.86 dt:34ms tok/s:1934670 rem:361s step 6765 (40%) loss:3.3654 lr:0.86 dt:34ms tok/s:1942586 rem:361s step 6766 (40%) loss:3.3607 lr:0.86 dt:34ms tok/s:1928535 rem:361s step 6767 (40%) loss:3.3639 lr:0.86 dt:34ms tok/s:1931407 rem:361s step 6768 (40%) loss:3.3781 lr:0.86 dt:34ms tok/s:1936442 rem:361s step 6769 (40%) loss:3.3651 lr:0.86 dt:34ms tok/s:1903389 rem:361s step 6770 (40%) loss:3.3653 lr:0.86 dt:34ms tok/s:1901164 rem:361s step 6771 (40%) loss:3.3757 lr:0.86 dt:34ms tok/s:1914125 rem:361s step 6772 (40%) loss:3.3664 lr:0.86 dt:34ms tok/s:1906518 rem:361s step 6773 (40%) loss:3.3406 lr:0.86 dt:34ms tok/s:1912194 rem:361s step 6774 (40%) loss:3.3410 lr:0.86 dt:34ms tok/s:1912114 rem:361s step 6775 (40%) loss:3.3386 lr:0.86 dt:35ms tok/s:1898472 rem:361s step 6776 (40%) loss:3.3390 lr:0.86 dt:35ms tok/s:1881475 rem:361s step 6777 (40%) loss:3.3299 lr:0.86 dt:34ms tok/s:1902019 rem:361s step 6778 (40%) loss:3.3332 lr:0.86 dt:34ms tok/s:1911529 rem:361s step 6779 (40%) loss:3.3539 lr:0.86 dt:35ms tok/s:1887172 rem:361s step 6780 (40%) loss:3.3551 lr:0.86 dt:35ms tok/s:1880420 rem:361s step 6781 (40%) loss:3.3548 lr:0.86 dt:35ms tok/s:1890417 rem:361s step 6782 (40%) loss:3.3434 lr:0.86 dt:35ms tok/s:1889767 rem:361s step 6783 (40%) loss:3.3738 lr:0.86 dt:35ms tok/s:1897489 rem:361s step 6784 (40%) loss:3.3573 lr:0.86 dt:35ms tok/s:1894338 rem:361s step 6785 (40%) loss:3.3335 lr:0.86 dt:35ms tok/s:1885528 rem:361s step 6786 (40%) loss:3.3333 lr:0.86 dt:35ms tok/s:1892408 rem:361s step 6787 (40%) loss:3.3191 lr:0.86 dt:35ms tok/s:1866223 rem:361s step 6788 (40%) loss:3.3312 lr:0.86 dt:35ms tok/s:1865970 rem:361s step 6789 (40%) loss:3.3437 lr:0.85 dt:35ms tok/s:1865033 rem:361s step 6790 (40%) loss:3.3687 lr:0.85 dt:35ms tok/s:1861484 rem:361s step 6791 (40%) loss:3.3811 lr:0.85 dt:35ms tok/s:1870338 rem:361s step 6792 (40%) loss:3.3779 lr:0.85 dt:35ms tok/s:1863339 rem:360s step 6793 (40%) loss:3.3588 lr:0.85 dt:35ms tok/s:1872593 rem:360s step 6794 (40%) loss:3.3569 lr:0.85 dt:35ms tok/s:1865945 rem:360s step 6795 (40%) loss:3.3533 lr:0.85 dt:35ms tok/s:1868748 rem:360s step 6796 (40%) loss:3.3595 lr:0.85 dt:35ms tok/s:1874892 rem:360s step 6797 (40%) loss:3.3349 lr:0.85 dt:35ms tok/s:1873486 rem:360s step 6798 (40%) loss:3.3609 lr:0.85 dt:36ms tok/s:1836548 rem:360s step 6799 (40%) loss:3.3670 lr:0.85 dt:36ms tok/s:1804229 rem:360s step 6800 (40%) loss:3.3837 lr:0.85 dt:35ms tok/s:1850606 rem:360s + local: attn=[0.057, 0.783, 0.754] mlp=[0.375, 0.176, -0.173] + + transition: attn=[2.554, 0.892] mlp=[-0.122, 0.288] + + hierarchy: attn=[2.850, 5.939, 5.616] mlp=[1.113, -0.951, -3.960] + step 6801 (40%) loss:3.3655 lr:0.85 dt:36ms tok/s:1841025 rem:360s step 6802 (40%) loss:3.3711 lr:0.85 dt:36ms tok/s:1846070 rem:360s step 6803 (40%) loss:3.3762 lr:0.85 dt:36ms tok/s:1845276 rem:360s step 6804 (40%) loss:3.3802 lr:0.85 dt:36ms tok/s:1835346 rem:360s step 6805 (40%) loss:3.3551 lr:0.85 dt:36ms tok/s:1832972 rem:360s step 6806 (40%) loss:3.3701 lr:0.85 dt:36ms tok/s:1838956 rem:360s step 6807 (40%) loss:3.3684 lr:0.85 dt:36ms tok/s:1837788 rem:360s step 6808 (40%) loss:3.3767 lr:0.85 dt:36ms tok/s:1829957 rem:360s step 6809 (40%) loss:3.3702 lr:0.85 dt:36ms tok/s:1830384 rem:360s step 6810 (40%) loss:3.3705 lr:0.85 dt:36ms tok/s:1834525 rem:360s step 6811 (40%) loss:3.3619 lr:0.85 dt:36ms tok/s:1836045 rem:360s step 6812 (40%) loss:3.3812 lr:0.85 dt:36ms tok/s:1833253 rem:360s step 6813 (40%) loss:3.3829 lr:0.85 dt:36ms tok/s:1832862 rem:360s step 6814 (40%) loss:3.3827 lr:0.85 dt:36ms tok/s:1842703 rem:360s step 6815 (40%) loss:3.3476 lr:0.85 dt:36ms tok/s:1837333 rem:360s step 6816 (40%) loss:3.3552 lr:0.85 dt:36ms tok/s:1842259 rem:360s step 6817 (40%) loss:3.3258 lr:0.85 dt:36ms tok/s:1836744 rem:360s step 6818 (40%) loss:3.3163 lr:0.85 dt:36ms tok/s:1831164 rem:360s step 6819 (40%) loss:3.3354 lr:0.85 dt:36ms tok/s:1837554 rem:360s step 6820 (40%) loss:3.3373 lr:0.85 dt:36ms tok/s:1833387 rem:359s step 6821 (40%) loss:3.3366 lr:0.85 dt:36ms tok/s:1839411 rem:359s step 6822 (40%) loss:3.3217 lr:0.85 dt:36ms tok/s:1834611 rem:359s step 6823 (40%) loss:3.3404 lr:0.85 dt:36ms tok/s:1842321 rem:359s step 6824 (40%) loss:3.3478 lr:0.85 dt:36ms tok/s:1843729 rem:359s step 6825 (40%) loss:3.3446 lr:0.85 dt:36ms tok/s:1845933 rem:359s step 6826 (40%) loss:3.3377 lr:0.85 dt:36ms tok/s:1843989 rem:359s step 6827 (40%) loss:3.3430 lr:0.85 dt:36ms tok/s:1840138 rem:359s step 6828 (40%) loss:3.3649 lr:0.85 dt:35ms tok/s:1846119 rem:359s step 6829 (40%) loss:3.3503 lr:0.85 dt:36ms tok/s:1834256 rem:359s step 6830 (40%) loss:3.3355 lr:0.85 dt:36ms tok/s:1834329 rem:359s step 6831 (40%) loss:3.3254 lr:0.85 dt:36ms tok/s:1829215 rem:359s step 6832 (40%) loss:3.3536 lr:0.85 dt:36ms tok/s:1832006 rem:359s step 6833 (40%) loss:3.3471 lr:0.85 dt:36ms tok/s:1834109 rem:359s step 6834 (40%) loss:3.3527 lr:0.85 dt:36ms tok/s:1832996 rem:359s step 6835 (40%) loss:3.3591 lr:0.85 dt:36ms tok/s:1830043 rem:359s step 6836 (40%) loss:3.3292 lr:0.85 dt:36ms tok/s:1839953 rem:359s step 6837 (40%) loss:3.3013 lr:0.85 dt:35ms tok/s:1846690 rem:359s step 6838 (40%) loss:3.3211 lr:0.85 dt:36ms tok/s:1826650 rem:359s step 6839 (40%) loss:3.2855 lr:0.85 dt:35ms tok/s:1849348 rem:359s step 6840 (40%) loss:3.3044 lr:0.85 dt:36ms tok/s:1843074 rem:359s step 6841 (40%) loss:3.2895 lr:0.85 dt:36ms tok/s:1836989 rem:359s step 6842 (40%) loss:3.3046 lr:0.85 dt:35ms tok/s:1849348 rem:359s step 6843 (40%) loss:3.2948 lr:0.85 dt:36ms tok/s:1837481 rem:359s step 6844 (40%) loss:3.3129 lr:0.85 dt:36ms tok/s:1832886 rem:359s step 6845 (40%) loss:3.3194 lr:0.85 dt:36ms tok/s:1834195 rem:359s step 6846 (40%) loss:3.3269 lr:0.85 dt:36ms tok/s:1832923 rem:359s step 6847 (40%) loss:3.3257 lr:0.85 dt:36ms tok/s:1836633 rem:359s step 6848 (40%) loss:3.3170 lr:0.85 dt:36ms tok/s:1830384 rem:358s step 6849 (40%) loss:3.3020 lr:0.85 dt:36ms tok/s:1832984 rem:358s step 6850 (40%) loss:3.3112 lr:0.85 dt:36ms tok/s:1840162 rem:358s step 6851 (40%) loss:3.2609 lr:0.85 dt:36ms tok/s:1845636 rem:358s step 6852 (40%) loss:3.2032 lr:0.85 dt:35ms tok/s:1847248 rem:358s step 6853 (40%) loss:3.1420 lr:0.85 dt:35ms tok/s:1849174 rem:358s step 6854 (40%) loss:3.0907 lr:0.85 dt:36ms tok/s:1836928 rem:358s step 6855 (40%) loss:3.0388 lr:0.85 dt:36ms tok/s:1834991 rem:358s step 6856 (40%) loss:2.9986 lr:0.85 dt:36ms tok/s:1796588 rem:358s step 6857 (40%) loss:3.0632 lr:0.85 dt:36ms tok/s:1799976 rem:358s step 6858 (40%) loss:3.1127 lr:0.85 dt:36ms tok/s:1836216 rem:358s step 6859 (40%) loss:3.1479 lr:0.85 dt:36ms tok/s:1828813 rem:358s step 6860 (40%) loss:3.1745 lr:0.85 dt:36ms tok/s:1830165 rem:358s step 6861 (40%) loss:3.1946 lr:0.85 dt:36ms tok/s:1840138 rem:358s step 6862 (40%) loss:3.1956 lr:0.85 dt:36ms tok/s:1821880 rem:358s step 6863 (40%) loss:3.2213 lr:0.85 dt:36ms tok/s:1834244 rem:358s step 6864 (40%) loss:3.2540 lr:0.85 dt:36ms tok/s:1836045 rem:358s step 6865 (40%) loss:3.3017 lr:0.85 dt:36ms tok/s:1829702 rem:358s step 6866 (40%) loss:3.3060 lr:0.85 dt:36ms tok/s:1843841 rem:358s step 6867 (40%) loss:3.3032 lr:0.85 dt:36ms tok/s:1830408 rem:358s step 6868 (40%) loss:3.3211 lr:0.85 dt:36ms tok/s:1829336 rem:358s step 6869 (40%) loss:3.3344 lr:0.85 dt:36ms tok/s:1826322 rem:358s step 6870 (40%) loss:3.3387 lr:0.85 dt:36ms tok/s:1806529 rem:358s step 6871 (40%) loss:3.3421 lr:0.85 dt:36ms tok/s:1838734 rem:358s step 6872 (40%) loss:3.3207 lr:0.85 dt:36ms tok/s:1829069 rem:358s step 6873 (40%) loss:3.3195 lr:0.85 dt:36ms tok/s:1835983 rem:358s step 6874 (40%) loss:3.3001 lr:0.85 dt:36ms tok/s:1816666 rem:358s step 6875 (40%) loss:3.3035 lr:0.85 dt:36ms tok/s:1832471 rem:358s step 6876 (40%) loss:3.3096 lr:0.85 dt:36ms tok/s:1819299 rem:357s step 6877 (40%) loss:3.3116 lr:0.85 dt:37ms tok/s:1792943 rem:357s step 6878 (40%) loss:3.3217 lr:0.85 dt:36ms tok/s:1833705 rem:357s step 6879 (40%) loss:3.3286 lr:0.85 dt:36ms tok/s:1839177 rem:357s step 6880 (40%) loss:3.3230 lr:0.85 dt:36ms tok/s:1814016 rem:357s step 6881 (40%) loss:3.3307 lr:0.85 dt:36ms tok/s:1829324 rem:357s step 6882 (40%) loss:3.3479 lr:0.85 dt:36ms tok/s:1831116 rem:357s step 6883 (40%) loss:3.3558 lr:0.85 dt:36ms tok/s:1830055 rem:357s step 6884 (40%) loss:3.3344 lr:0.85 dt:36ms tok/s:1806411 rem:357s step 6885 (40%) loss:3.3135 lr:0.85 dt:36ms tok/s:1813490 rem:357s step 6886 (40%) loss:3.3063 lr:0.85 dt:36ms tok/s:1816810 rem:357s step 6887 (40%) loss:3.3027 lr:0.85 dt:36ms tok/s:1817831 rem:357s step 6888 (40%) loss:3.3094 lr:0.85 dt:36ms tok/s:1811936 rem:357s step 6889 (40%) loss:3.3221 lr:0.85 dt:36ms tok/s:1805331 rem:357s step 6890 (40%) loss:3.3142 lr:0.85 dt:36ms tok/s:1814771 rem:357s step 6891 (41%) loss:3.3163 lr:0.85 dt:36ms tok/s:1813957 rem:357s step 6892 (41%) loss:3.3228 lr:0.85 dt:37ms tok/s:1770105 rem:357s step 6893 (41%) loss:3.3238 lr:0.85 dt:36ms tok/s:1805272 rem:357s step 6894 (41%) loss:3.3181 lr:0.85 dt:36ms tok/s:1824807 rem:357s step 6895 (41%) loss:3.3131 lr:0.85 dt:36ms tok/s:1826565 rem:357s step 6896 (41%) loss:3.3245 lr:0.85 dt:36ms tok/s:1830994 rem:357s step 6897 (41%) loss:3.3341 lr:0.85 dt:36ms tok/s:1830945 rem:357s step 6898 (41%) loss:3.3642 lr:0.85 dt:36ms tok/s:1833020 rem:357s step 6899 (41%) loss:3.3563 lr:0.85 dt:36ms tok/s:1831555 rem:357s step 6900 (41%) loss:3.3390 lr:0.85 dt:36ms tok/s:1835101 rem:357s + local: attn=[0.072, 0.779, 0.765] mlp=[0.378, 0.207, -0.199] + + transition: attn=[2.559, 0.857] mlp=[-0.137, 0.318] + + hierarchy: attn=[2.845, 5.939, 5.616] mlp=[1.173, -0.989, -3.880] + step 6901 (41%) loss:3.3398 lr:0.85 dt:36ms tok/s:1833828 rem:357s step 6902 (41%) loss:3.3400 lr:0.85 dt:36ms tok/s:1828351 rem:357s step 6903 (41%) loss:3.3204 lr:0.85 dt:36ms tok/s:1837739 rem:357s step 6904 (41%) loss:3.3215 lr:0.85 dt:36ms tok/s:1832165 rem:356s step 6905 (41%) loss:3.3296 lr:0.85 dt:36ms tok/s:1840027 rem:356s step 6906 (41%) loss:3.3331 lr:0.85 dt:36ms tok/s:1831945 rem:356s step 6907 (41%) loss:3.3283 lr:0.85 dt:36ms tok/s:1824794 rem:356s step 6908 (41%) loss:3.3240 lr:0.85 dt:36ms tok/s:1831018 rem:356s step 6909 (41%) loss:3.3194 lr:0.84 dt:36ms tok/s:1832715 rem:356s step 6910 (41%) loss:3.3250 lr:0.84 dt:36ms tok/s:1829519 rem:356s step 6911 (41%) loss:3.3289 lr:0.84 dt:36ms tok/s:1833791 rem:356s step 6912 (41%) loss:3.3402 lr:0.84 dt:36ms tok/s:1834023 rem:356s step 6913 (41%) loss:3.3325 lr:0.84 dt:36ms tok/s:1831909 rem:356s step 6914 (41%) loss:3.3712 lr:0.84 dt:36ms tok/s:1834623 rem:356s step 6915 (41%) loss:3.3593 lr:0.84 dt:36ms tok/s:1833509 rem:356s step 6916 (41%) loss:3.3497 lr:0.84 dt:36ms tok/s:1827913 rem:356s step 6917 (41%) loss:3.3508 lr:0.84 dt:36ms tok/s:1824165 rem:356s step 6918 (41%) loss:3.4062 lr:0.84 dt:36ms tok/s:1834770 rem:356s step 6919 (41%) loss:3.3843 lr:0.84 dt:36ms tok/s:1838353 rem:356s step 6920 (41%) loss:3.3840 lr:0.84 dt:36ms tok/s:1839953 rem:356s step 6921 (41%) loss:3.3762 lr:0.84 dt:36ms tok/s:1830933 rem:356s step 6922 (41%) loss:3.3691 lr:0.84 dt:36ms tok/s:1834084 rem:356s step 6923 (41%) loss:3.3772 lr:0.84 dt:36ms tok/s:1832837 rem:356s step 6924 (41%) loss:3.3601 lr:0.84 dt:36ms tok/s:1833595 rem:356s step 6925 (41%) loss:3.3426 lr:0.84 dt:36ms tok/s:1836057 rem:356s step 6926 (41%) loss:3.3288 lr:0.84 dt:36ms tok/s:1836057 rem:356s step 6927 (41%) loss:3.3298 lr:0.84 dt:36ms tok/s:1834476 rem:356s step 6928 (41%) loss:3.3007 lr:0.84 dt:36ms tok/s:1827427 rem:356s step 6929 (41%) loss:3.2910 lr:0.84 dt:36ms tok/s:1832165 rem:356s step 6930 (41%) loss:3.2841 lr:0.84 dt:36ms tok/s:1836278 rem:356s step 6931 (41%) loss:3.2955 lr:0.84 dt:36ms tok/s:1826019 rem:356s step 6932 (41%) loss:3.2989 lr:0.84 dt:36ms tok/s:1830908 rem:355s step 6933 (41%) loss:3.2986 lr:0.84 dt:36ms tok/s:1827330 rem:355s step 6934 (41%) loss:3.2914 lr:0.84 dt:36ms tok/s:1829872 rem:355s step 6935 (41%) loss:3.2878 lr:0.84 dt:36ms tok/s:1833436 rem:355s step 6936 (41%) loss:3.2901 lr:0.84 dt:36ms tok/s:1821445 rem:355s step 6937 (41%) loss:3.2911 lr:0.84 dt:36ms tok/s:1829568 rem:355s step 6938 (41%) loss:3.2929 lr:0.84 dt:36ms tok/s:1832972 rem:355s step 6939 (41%) loss:3.2909 lr:0.84 dt:36ms tok/s:1840939 rem:355s step 6940 (41%) loss:3.2977 lr:0.84 dt:36ms tok/s:1830250 rem:355s step 6941 (41%) loss:3.3050 lr:0.84 dt:36ms tok/s:1838882 rem:355s step 6942 (41%) loss:3.3127 lr:0.84 dt:36ms tok/s:1830920 rem:355s step 6943 (41%) loss:3.2987 lr:0.84 dt:36ms tok/s:1831030 rem:355s step 6944 (41%) loss:3.2998 lr:0.84 dt:36ms tok/s:1823560 rem:355s step 6945 (41%) loss:3.2999 lr:0.84 dt:36ms tok/s:1829726 rem:355s step 6946 (41%) loss:3.3161 lr:0.84 dt:36ms tok/s:1836511 rem:355s step 6947 (41%) loss:3.2945 lr:0.84 dt:36ms tok/s:1839288 rem:355s step 6948 (41%) loss:3.2871 lr:0.84 dt:36ms tok/s:1831152 rem:355s step 6949 (41%) loss:3.3377 lr:0.84 dt:36ms tok/s:1832605 rem:355s step 6950 (41%) loss:3.3338 lr:0.84 dt:36ms tok/s:1832226 rem:355s step 6951 (41%) loss:3.3381 lr:0.84 dt:36ms tok/s:1828424 rem:355s step 6952 (41%) loss:3.3353 lr:0.84 dt:36ms tok/s:1829093 rem:355s step 6953 (41%) loss:3.3174 lr:0.84 dt:36ms tok/s:1827318 rem:355s step 6954 (41%) loss:3.3173 lr:0.84 dt:36ms tok/s:1833901 rem:355s step 6955 (41%) loss:3.2922 lr:0.84 dt:36ms tok/s:1831689 rem:355s step 6956 (41%) loss:3.2823 lr:0.84 dt:36ms tok/s:1838328 rem:355s step 6957 (41%) loss:3.2913 lr:0.84 dt:36ms tok/s:1833302 rem:355s step 6958 (41%) loss:3.3073 lr:0.84 dt:36ms tok/s:1835983 rem:355s step 6959 (41%) loss:3.3222 lr:0.84 dt:36ms tok/s:1836633 rem:355s step 6960 (41%) loss:3.3376 lr:0.84 dt:36ms tok/s:1839276 rem:354s step 6961 (41%) loss:3.3311 lr:0.84 dt:36ms tok/s:1837260 rem:354s step 6962 (41%) loss:3.3430 lr:0.84 dt:36ms tok/s:1836916 rem:354s step 6963 (41%) loss:3.3602 lr:0.84 dt:36ms tok/s:1828387 rem:354s step 6964 (41%) loss:3.3603 lr:0.84 dt:35ms tok/s:1857986 rem:354s step 6965 (41%) loss:3.3664 lr:0.84 dt:36ms tok/s:1836940 rem:354s step 6966 (41%) loss:3.3635 lr:0.84 dt:36ms tok/s:1834684 rem:354s step 6967 (41%) loss:3.3642 lr:0.84 dt:36ms tok/s:1832373 rem:354s step 6968 (41%) loss:3.3559 lr:0.84 dt:36ms tok/s:1825594 rem:354s step 6969 (41%) loss:3.3610 lr:0.84 dt:36ms tok/s:1824443 rem:354s step 6970 (41%) loss:3.3834 lr:0.84 dt:36ms tok/s:1832739 rem:354s step 6971 (41%) loss:3.3826 lr:0.84 dt:36ms tok/s:1827743 rem:354s step 6972 (41%) loss:3.3673 lr:0.84 dt:36ms tok/s:1833167 rem:354s step 6973 (41%) loss:3.3707 lr:0.84 dt:36ms tok/s:1824371 rem:354s step 6974 (41%) loss:3.3656 lr:0.84 dt:36ms tok/s:1832458 rem:354s step 6975 (41%) loss:3.3718 lr:0.84 dt:36ms tok/s:1829933 rem:354s step 6976 (41%) loss:3.3687 lr:0.84 dt:36ms tok/s:1827403 rem:354s step 6977 (41%) loss:3.3607 lr:0.84 dt:36ms tok/s:1832947 rem:354s step 6978 (41%) loss:3.3626 lr:0.84 dt:36ms tok/s:1829933 rem:354s step 6979 (41%) loss:3.3571 lr:0.84 dt:36ms tok/s:1833607 rem:354s step 6980 (41%) loss:3.3256 lr:0.84 dt:36ms tok/s:1834991 rem:354s step 6981 (41%) loss:3.2805 lr:0.84 dt:36ms tok/s:1831384 rem:354s step 6982 (41%) loss:3.2750 lr:0.84 dt:36ms tok/s:1825982 rem:354s step 6983 (41%) loss:3.2949 lr:0.84 dt:36ms tok/s:1837382 rem:354s step 6984 (41%) loss:3.2802 lr:0.84 dt:36ms tok/s:1833901 rem:354s step 6985 (41%) loss:3.2684 lr:0.84 dt:36ms tok/s:1825255 rem:354s step 6986 (41%) loss:3.2677 lr:0.84 dt:36ms tok/s:1828035 rem:354s step 6987 (41%) loss:3.2637 lr:0.84 dt:36ms tok/s:1829142 rem:354s step 6988 (41%) loss:3.2578 lr:0.84 dt:36ms tok/s:1832947 rem:353s step 6989 (41%) loss:3.2457 lr:0.84 dt:36ms tok/s:1839559 rem:353s step 6990 (41%) loss:3.2524 lr:0.84 dt:36ms tok/s:1833485 rem:353s step 6991 (41%) loss:3.2508 lr:0.84 dt:36ms tok/s:1819817 rem:353s step 6992 (41%) loss:3.2611 lr:0.84 dt:36ms tok/s:1826395 rem:353s step 6993 (41%) loss:3.2562 lr:0.84 dt:36ms tok/s:1823027 rem:353s step 6994 (41%) loss:3.2631 lr:0.84 dt:36ms tok/s:1831555 rem:353s step 6995 (41%) loss:3.2735 lr:0.84 dt:36ms tok/s:1832141 rem:353s step 6996 (41%) loss:3.2736 lr:0.84 dt:36ms tok/s:1834929 rem:353s step 6997 (41%) loss:3.2747 lr:0.84 dt:36ms tok/s:1838058 rem:353s step 6998 (41%) loss:3.2884 lr:0.84 dt:36ms tok/s:1835027 rem:353s step 6999 (41%) loss:3.2963 lr:0.84 dt:36ms tok/s:1828339 rem:353s step 7000 (41%) loss:3.2999 lr:0.84 dt:36ms tok/s:1826128 rem:353s + local: attn=[0.073, 0.806, 0.791] mlp=[0.405, 0.203, -0.189] + + transition: attn=[2.576, 0.876] mlp=[-0.145, 0.331] + + hierarchy: attn=[2.919, 5.939, 5.616] mlp=[1.127, -0.940, -3.632] + step 7001 (41%) loss:3.3232 lr:0.84 dt:36ms tok/s:1832275 rem:353s step 7002 (41%) loss:3.3214 lr:0.84 dt:36ms tok/s:1836376 rem:353s step 7003 (41%) loss:3.3239 lr:0.84 dt:36ms tok/s:1831360 rem:353s step 7004 (41%) loss:3.3356 lr:0.84 dt:36ms tok/s:1826820 rem:353s step 7005 (41%) loss:3.3247 lr:0.84 dt:36ms tok/s:1833876 rem:353s step 7006 (41%) loss:3.3282 lr:0.84 dt:36ms tok/s:1835664 rem:353s step 7007 (41%) loss:3.3069 lr:0.84 dt:36ms tok/s:1837653 rem:353s step 7008 (41%) loss:3.3029 lr:0.84 dt:36ms tok/s:1840261 rem:353s step 7009 (41%) loss:3.2860 lr:0.84 dt:36ms tok/s:1828922 rem:353s step 7010 (41%) loss:3.2709 lr:0.84 dt:36ms tok/s:1829020 rem:353s step 7011 (41%) loss:3.2749 lr:0.84 dt:36ms tok/s:1835468 rem:353s step 7012 (41%) loss:3.2560 lr:0.84 dt:36ms tok/s:1823656 rem:353s step 7013 (41%) loss:3.2227 lr:0.84 dt:36ms tok/s:1830335 rem:353s step 7014 (41%) loss:3.2223 lr:0.84 dt:36ms tok/s:1829897 rem:353s step 7015 (41%) loss:3.2376 lr:0.84 dt:36ms tok/s:1828983 rem:353s step 7016 (41%) loss:3.2362 lr:0.84 dt:36ms tok/s:1827731 rem:352s step 7017 (41%) loss:3.2451 lr:0.84 dt:36ms tok/s:1828205 rem:352s step 7018 (41%) loss:3.2586 lr:0.84 dt:36ms tok/s:1831323 rem:352s step 7019 (41%) loss:3.2920 lr:0.84 dt:36ms tok/s:1833399 rem:352s step 7020 (41%) loss:3.3260 lr:0.84 dt:36ms tok/s:1822629 rem:352s step 7021 (41%) loss:3.3330 lr:0.84 dt:36ms tok/s:1828229 rem:352s step 7022 (41%) loss:3.3328 lr:0.84 dt:36ms tok/s:1841358 rem:352s step 7023 (41%) loss:3.3237 lr:0.84 dt:36ms tok/s:1837247 rem:352s step 7024 (41%) loss:3.3308 lr:0.84 dt:36ms tok/s:1833326 rem:352s step 7025 (41%) loss:3.3212 lr:0.84 dt:36ms tok/s:1835456 rem:352s step 7026 (41%) loss:3.3259 lr:0.83 dt:36ms tok/s:1833436 rem:352s step 7027 (41%) loss:3.3206 lr:0.83 dt:36ms tok/s:1830067 rem:352s step 7028 (41%) loss:3.3307 lr:0.83 dt:36ms tok/s:1830652 rem:352s step 7029 (41%) loss:3.3296 lr:0.83 dt:36ms tok/s:1836327 rem:352s step 7030 (41%) loss:3.3506 lr:0.83 dt:36ms tok/s:1835285 rem:352s step 7031 (41%) loss:3.3675 lr:0.83 dt:36ms tok/s:1841370 rem:352s step 7032 (41%) loss:3.3632 lr:0.83 dt:36ms tok/s:1836498 rem:352s step 7033 (41%) loss:3.3683 lr:0.83 dt:36ms tok/s:1829336 rem:352s step 7034 (41%) loss:3.3480 lr:0.83 dt:36ms tok/s:1832214 rem:352s step 7035 (41%) loss:3.3497 lr:0.83 dt:36ms tok/s:1828253 rem:352s step 7036 (41%) loss:3.3180 lr:0.83 dt:36ms tok/s:1826541 rem:352s step 7037 (41%) loss:3.3141 lr:0.83 dt:36ms tok/s:1835297 rem:352s step 7038 (41%) loss:3.3332 lr:0.83 dt:36ms tok/s:1835223 rem:352s step 7039 (41%) loss:3.3290 lr:0.83 dt:36ms tok/s:1831982 rem:352s step 7040 (41%) loss:3.3263 lr:0.83 dt:36ms tok/s:1829349 rem:352s step 7041 (41%) loss:3.3407 lr:0.83 dt:36ms tok/s:1836793 rem:352s step 7042 (41%) loss:3.3438 lr:0.83 dt:36ms tok/s:1835407 rem:352s step 7043 (41%) loss:3.3363 lr:0.83 dt:36ms tok/s:1833387 rem:352s step 7044 (41%) loss:3.3155 lr:0.83 dt:35ms tok/s:1860426 rem:351s step 7045 (41%) loss:3.3089 lr:0.83 dt:36ms tok/s:1842333 rem:351s step 7046 (41%) loss:3.3228 lr:0.83 dt:36ms tok/s:1830786 rem:351s step 7047 (41%) loss:3.3266 lr:0.83 dt:36ms tok/s:1827160 rem:351s step 7048 (41%) loss:3.3091 lr:0.83 dt:36ms tok/s:1828728 rem:351s step 7049 (41%) loss:3.2874 lr:0.83 dt:36ms tok/s:1824286 rem:351s step 7050 (41%) loss:3.2942 lr:0.83 dt:36ms tok/s:1832813 rem:351s step 7051 (41%) loss:3.2977 lr:0.83 dt:36ms tok/s:1831774 rem:351s step 7052 (41%) loss:3.2968 lr:0.83 dt:36ms tok/s:1826104 rem:351s step 7053 (41%) loss:3.2858 lr:0.83 dt:36ms tok/s:1835248 rem:351s step 7054 (41%) loss:3.3020 lr:0.83 dt:36ms tok/s:1832923 rem:351s step 7055 (41%) loss:3.3176 lr:0.83 dt:36ms tok/s:1832923 rem:351s step 7056 (41%) loss:3.3100 lr:0.83 dt:36ms tok/s:1835346 rem:351s step 7057 (41%) loss:3.3163 lr:0.83 dt:36ms tok/s:1825982 rem:351s step 7058 (41%) loss:3.3337 lr:0.83 dt:36ms tok/s:1829726 rem:351s step 7059 (42%) loss:3.3229 lr:0.83 dt:36ms tok/s:1830811 rem:351s step 7060 (42%) loss:3.3298 lr:0.83 dt:36ms tok/s:1832898 rem:351s step 7061 (42%) loss:3.3157 lr:0.83 dt:36ms tok/s:1831945 rem:351s step 7062 (42%) loss:3.3162 lr:0.83 dt:36ms tok/s:1836830 rem:351s step 7063 (42%) loss:3.3019 lr:0.83 dt:36ms tok/s:1837911 rem:351s step 7064 (42%) loss:3.3098 lr:0.83 dt:36ms tok/s:1834929 rem:351s step 7065 (42%) loss:3.3210 lr:0.83 dt:36ms tok/s:1834219 rem:351s step 7066 (42%) loss:3.3197 lr:0.83 dt:36ms tok/s:1838033 rem:351s step 7067 (42%) loss:3.3331 lr:0.83 dt:36ms tok/s:1825667 rem:351s step 7068 (42%) loss:3.3491 lr:0.83 dt:36ms tok/s:1833864 rem:351s step 7069 (42%) loss:3.3456 lr:0.83 dt:36ms tok/s:1831848 rem:351s step 7070 (42%) loss:3.3146 lr:0.83 dt:36ms tok/s:1824770 rem:351s step 7071 (42%) loss:3.2946 lr:0.83 dt:36ms tok/s:1836425 rem:351s step 7072 (42%) loss:3.3010 lr:0.83 dt:36ms tok/s:1824249 rem:350s step 7073 (42%) loss:3.2887 lr:0.83 dt:36ms tok/s:1824213 rem:350s step 7074 (42%) loss:3.2940 lr:0.83 dt:36ms tok/s:1825097 rem:350s step 7075 (42%) loss:3.2766 lr:0.83 dt:36ms tok/s:1826492 rem:350s step 7076 (42%) loss:3.2822 lr:0.83 dt:36ms tok/s:1825776 rem:350s step 7077 (42%) loss:3.2867 lr:0.83 dt:36ms tok/s:1828703 rem:350s step 7078 (42%) loss:3.2936 lr:0.83 dt:36ms tok/s:1830274 rem:350s step 7079 (42%) loss:3.3105 lr:0.83 dt:36ms tok/s:1826395 rem:350s step 7080 (42%) loss:3.3232 lr:0.83 dt:36ms tok/s:1826541 rem:350s step 7081 (42%) loss:3.3177 lr:0.83 dt:36ms tok/s:1834599 rem:350s step 7082 (42%) loss:3.3333 lr:0.83 dt:36ms tok/s:1829641 rem:350s step 7083 (42%) loss:3.3329 lr:0.83 dt:36ms tok/s:1830811 rem:350s step 7084 (42%) loss:3.3341 lr:0.83 dt:36ms tok/s:1827269 rem:350s step 7085 (42%) loss:3.3388 lr:0.83 dt:36ms tok/s:1826225 rem:350s step 7086 (42%) loss:3.3296 lr:0.83 dt:36ms tok/s:1827889 rem:350s step 7087 (42%) loss:3.3253 lr:0.83 dt:36ms tok/s:1839916 rem:350s step 7088 (42%) loss:3.3181 lr:0.83 dt:36ms tok/s:1832813 rem:350s step 7089 (42%) loss:3.3205 lr:0.83 dt:36ms tok/s:1832874 rem:350s step 7090 (42%) loss:3.2971 lr:0.83 dt:36ms tok/s:1836008 rem:350s step 7091 (42%) loss:3.2896 lr:0.83 dt:36ms tok/s:1836560 rem:350s step 7092 (42%) loss:3.2898 lr:0.83 dt:36ms tok/s:1813466 rem:350s step 7093 (42%) loss:3.2934 lr:0.83 dt:35ms tok/s:1851204 rem:350s step 7094 (42%) loss:3.3087 lr:0.83 dt:35ms tok/s:1846603 rem:350s step 7095 (42%) loss:3.3163 lr:0.83 dt:35ms tok/s:1857722 rem:350s step 7096 (42%) loss:3.3235 lr:0.83 dt:35ms tok/s:1852876 rem:350s step 7097 (42%) loss:3.3149 lr:0.83 dt:35ms tok/s:1854051 rem:350s step 7098 (42%) loss:3.3030 lr:0.83 dt:35ms tok/s:1857446 rem:350s step 7099 (42%) loss:3.2914 lr:0.83 dt:36ms tok/s:1843297 rem:350s step 7100 (42%) loss:3.2872 lr:0.83 dt:35ms tok/s:1854139 rem:349s + local: attn=[0.069, 0.807, 0.766] mlp=[0.402, 0.195, -0.202] + + transition: attn=[2.669, 0.876] mlp=[-0.118, 0.330] + + hierarchy: attn=[2.868, 5.939, 5.616] mlp=[1.116, -0.918, -3.584] + step 7101 (42%) loss:3.2710 lr:0.83 dt:35ms tok/s:1852701 rem:349s step 7102 (42%) loss:3.2494 lr:0.83 dt:35ms tok/s:1856355 rem:349s step 7103 (42%) loss:3.2120 lr:0.83 dt:35ms tok/s:1852090 rem:349s step 7104 (42%) loss:3.1800 lr:0.83 dt:35ms tok/s:1848267 rem:349s step 7105 (42%) loss:3.1435 lr:0.83 dt:35ms tok/s:1852402 rem:349s step 7106 (42%) loss:3.1087 lr:0.83 dt:35ms tok/s:1853926 rem:349s step 7107 (42%) loss:3.0875 lr:0.83 dt:35ms tok/s:1861434 rem:349s step 7108 (42%) loss:3.0593 lr:0.83 dt:35ms tok/s:1848304 rem:349s step 7109 (42%) loss:3.0440 lr:0.83 dt:35ms tok/s:1850257 rem:349s step 7110 (42%) loss:3.0903 lr:0.83 dt:35ms tok/s:1849784 rem:349s step 7111 (42%) loss:3.1201 lr:0.83 dt:36ms tok/s:1844137 rem:349s step 7112 (42%) loss:3.1385 lr:0.83 dt:35ms tok/s:1859822 rem:349s step 7113 (42%) loss:3.1829 lr:0.83 dt:35ms tok/s:1853339 rem:349s step 7114 (42%) loss:3.2114 lr:0.83 dt:35ms tok/s:1852614 rem:349s step 7115 (42%) loss:3.2204 lr:0.83 dt:35ms tok/s:1856142 rem:349s step 7116 (42%) loss:3.2526 lr:0.83 dt:35ms tok/s:1848677 rem:349s step 7117 (42%) loss:3.2719 lr:0.83 dt:35ms tok/s:1846218 rem:349s step 7118 (42%) loss:3.2823 lr:0.83 dt:35ms tok/s:1847273 rem:349s step 7119 (42%) loss:3.3165 lr:0.83 dt:35ms tok/s:1853626 rem:349s step 7120 (42%) loss:3.3187 lr:0.83 dt:35ms tok/s:1851030 rem:349s step 7121 (42%) loss:3.3438 lr:0.83 dt:35ms tok/s:1849859 rem:349s step 7122 (42%) loss:3.3545 lr:0.83 dt:36ms tok/s:1799811 rem:349s step 7123 (42%) loss:3.3636 lr:0.83 dt:37ms tok/s:1767090 rem:349s step 7124 (42%) loss:3.3617 lr:0.83 dt:35ms tok/s:1873601 rem:349s step 7125 (42%) loss:3.3466 lr:0.83 dt:35ms tok/s:1876159 rem:349s step 7126 (42%) loss:3.3556 lr:0.83 dt:35ms tok/s:1863377 rem:349s step 7127 (42%) loss:3.3449 lr:0.83 dt:35ms tok/s:1868672 rem:349s step 7128 (42%) loss:3.3548 lr:0.83 dt:35ms tok/s:1877966 rem:348s step 7129 (42%) loss:3.3552 lr:0.83 dt:35ms tok/s:1871318 rem:348s step 7130 (42%) loss:3.3625 lr:0.83 dt:35ms tok/s:1863402 rem:348s step 7131 (42%) loss:3.4131 lr:0.83 dt:35ms tok/s:1874189 rem:348s step 7132 (42%) loss:3.4235 lr:0.83 dt:35ms tok/s:1849933 rem:348s step 7133 (42%) loss:3.4428 lr:0.83 dt:35ms tok/s:1850382 rem:348s step 7134 (42%) loss:3.4264 lr:0.83 dt:35ms tok/s:1852589 rem:348s step 7135 (42%) loss:3.4273 lr:0.83 dt:36ms tok/s:1844348 rem:348s step 7136 (42%) loss:3.4217 lr:0.83 dt:35ms tok/s:1858526 rem:348s step 7137 (42%) loss:3.4206 lr:0.83 dt:35ms tok/s:1856280 rem:348s step 7138 (42%) loss:3.4119 lr:0.83 dt:35ms tok/s:1849174 rem:348s step 7139 (42%) loss:3.4034 lr:0.83 dt:36ms tok/s:1840138 rem:348s step 7140 (42%) loss:3.3950 lr:0.82 dt:36ms tok/s:1840064 rem:348s step 7141 (42%) loss:3.4079 lr:0.82 dt:35ms tok/s:1852277 rem:348s step 7142 (42%) loss:3.4397 lr:0.82 dt:36ms tok/s:1845190 rem:348s step 7143 (42%) loss:3.4235 lr:0.82 dt:35ms tok/s:1853289 rem:348s step 7144 (42%) loss:3.4459 lr:0.82 dt:35ms tok/s:1846652 rem:348s step 7145 (42%) loss:3.4470 lr:0.82 dt:35ms tok/s:1851778 rem:348s step 7146 (42%) loss:3.4492 lr:0.82 dt:35ms tok/s:1849224 rem:348s step 7147 (42%) loss:3.4373 lr:0.82 dt:35ms tok/s:1857409 rem:348s step 7148 (42%) loss:3.4249 lr:0.82 dt:35ms tok/s:1850942 rem:348s step 7149 (42%) loss:3.4148 lr:0.82 dt:35ms tok/s:1851354 rem:348s step 7150 (42%) loss:3.4351 lr:0.82 dt:36ms tok/s:1845574 rem:348s step 7151 (42%) loss:3.4430 lr:0.82 dt:35ms tok/s:1851528 rem:348s step 7152 (42%) loss:3.4047 lr:0.82 dt:36ms tok/s:1841444 rem:348s step 7153 (42%) loss:3.4057 lr:0.82 dt:35ms tok/s:1848428 rem:348s step 7154 (42%) loss:3.3867 lr:0.82 dt:35ms tok/s:1849112 rem:348s step 7155 (42%) loss:3.3861 lr:0.82 dt:37ms tok/s:1790549 rem:348s step 7156 (42%) loss:3.3837 lr:0.82 dt:36ms tok/s:1846082 rem:347s step 7157 (42%) loss:3.3670 lr:0.82 dt:35ms tok/s:1850444 rem:347s step 7158 (42%) loss:3.3722 lr:0.82 dt:35ms tok/s:1850432 rem:347s step 7159 (42%) loss:3.3682 lr:0.82 dt:35ms tok/s:1846417 rem:347s step 7160 (42%) loss:3.3774 lr:0.82 dt:35ms tok/s:1850556 rem:347s step 7161 (42%) loss:3.3672 lr:0.82 dt:35ms tok/s:1852589 rem:347s step 7162 (42%) loss:3.3496 lr:0.82 dt:35ms tok/s:1850830 rem:347s step 7163 (42%) loss:3.3193 lr:0.82 dt:35ms tok/s:1853939 rem:347s step 7164 (42%) loss:3.2779 lr:0.82 dt:35ms tok/s:1847012 rem:347s step 7165 (42%) loss:3.2986 lr:0.82 dt:35ms tok/s:1849037 rem:347s step 7166 (42%) loss:3.2918 lr:0.82 dt:35ms tok/s:1849909 rem:347s step 7167 (42%) loss:3.2695 lr:0.82 dt:35ms tok/s:1851005 rem:347s step 7168 (42%) loss:3.2651 lr:0.82 dt:35ms tok/s:1853351 rem:347s step 7169 (42%) loss:3.2598 lr:0.82 dt:35ms tok/s:1848006 rem:347s step 7170 (42%) loss:3.2755 lr:0.82 dt:35ms tok/s:1853976 rem:347s step 7171 (42%) loss:3.2912 lr:0.82 dt:35ms tok/s:1857057 rem:347s step 7172 (42%) loss:3.3078 lr:0.82 dt:35ms tok/s:1848838 rem:347s step 7173 (42%) loss:3.2919 lr:0.82 dt:35ms tok/s:1853389 rem:347s step 7174 (42%) loss:3.2798 lr:0.82 dt:36ms tok/s:1843383 rem:347s step 7175 (42%) loss:3.2970 lr:0.82 dt:35ms tok/s:1852527 rem:347s step 7176 (42%) loss:3.3043 lr:0.82 dt:35ms tok/s:1852939 rem:347s step 7177 (42%) loss:3.3024 lr:0.82 dt:35ms tok/s:1846839 rem:347s step 7178 (42%) loss:3.2901 lr:0.82 dt:36ms tok/s:1842209 rem:347s step 7179 (42%) loss:3.2753 lr:0.82 dt:36ms tok/s:1843470 rem:347s step 7180 (42%) loss:3.2692 lr:0.82 dt:36ms tok/s:1831701 rem:347s step 7181 (42%) loss:3.2760 lr:0.82 dt:35ms tok/s:1849635 rem:347s step 7182 (42%) loss:3.2766 lr:0.82 dt:36ms tok/s:1845859 rem:347s step 7183 (42%) loss:3.2696 lr:0.82 dt:35ms tok/s:1846491 rem:347s step 7184 (42%) loss:3.2793 lr:0.82 dt:35ms tok/s:1852177 rem:346s step 7185 (42%) loss:3.2987 lr:0.82 dt:35ms tok/s:1849324 rem:346s step 7186 (42%) loss:3.3178 lr:0.82 dt:36ms tok/s:1840791 rem:346s step 7187 (42%) loss:3.3240 lr:0.82 dt:35ms tok/s:1854501 rem:346s step 7188 (42%) loss:3.3264 lr:0.82 dt:35ms tok/s:1853451 rem:346s step 7189 (42%) loss:3.3247 lr:0.82 dt:35ms tok/s:1858526 rem:346s step 7190 (42%) loss:3.3093 lr:0.82 dt:36ms tok/s:1845747 rem:346s step 7191 (42%) loss:3.3176 lr:0.82 dt:36ms tok/s:1808717 rem:346s step 7192 (42%) loss:3.3260 lr:0.82 dt:35ms tok/s:1851354 rem:346s step 7193 (42%) loss:3.3201 lr:0.82 dt:35ms tok/s:1849958 rem:346s step 7194 (42%) loss:3.3245 lr:0.82 dt:35ms tok/s:1849485 rem:346s step 7195 (42%) loss:3.2963 lr:0.82 dt:36ms tok/s:1841037 rem:346s step 7196 (42%) loss:3.2985 lr:0.82 dt:35ms tok/s:1846367 rem:346s step 7197 (42%) loss:3.3095 lr:0.82 dt:35ms tok/s:1849747 rem:346s step 7198 (42%) loss:3.3099 lr:0.82 dt:36ms tok/s:1840483 rem:346s step 7199 (42%) loss:3.3102 lr:0.82 dt:35ms tok/s:1847993 rem:346s step 7200 (42%) loss:3.3029 lr:0.82 dt:35ms tok/s:1850108 rem:346s + local: attn=[0.083, 0.791, 0.798] mlp=[0.410, 0.198, -0.174] + + transition: attn=[2.650, 0.898] mlp=[-0.126, 0.339] + + hierarchy: attn=[2.916, 5.939, 5.616] mlp=[1.092, -0.936, -3.435] + step 7201 (42%) loss:3.3189 lr:0.82 dt:35ms tok/s:1850793 rem:346s step 7202 (42%) loss:3.3147 lr:0.82 dt:35ms tok/s:1856995 rem:346s step 7203 (42%) loss:3.3214 lr:0.82 dt:35ms tok/s:1848714 rem:346s step 7204 (42%) loss:3.3231 lr:0.82 dt:36ms tok/s:1845016 rem:346s step 7205 (42%) loss:3.3182 lr:0.82 dt:35ms tok/s:1851603 rem:346s step 7206 (42%) loss:3.3222 lr:0.82 dt:35ms tok/s:1853851 rem:346s step 7207 (42%) loss:3.3310 lr:0.82 dt:35ms tok/s:1853876 rem:346s step 7208 (42%) loss:3.3450 lr:0.82 dt:35ms tok/s:1853414 rem:346s step 7209 (42%) loss:3.3271 lr:0.82 dt:35ms tok/s:1855916 rem:346s step 7210 (42%) loss:3.3201 lr:0.82 dt:35ms tok/s:1855065 rem:346s step 7211 (42%) loss:3.3342 lr:0.82 dt:35ms tok/s:1850344 rem:346s step 7212 (42%) loss:3.3163 lr:0.82 dt:36ms tok/s:1841777 rem:345s step 7213 (42%) loss:3.3258 lr:0.82 dt:36ms tok/s:1845078 rem:345s step 7214 (42%) loss:3.3434 lr:0.82 dt:35ms tok/s:1847844 rem:345s step 7215 (42%) loss:3.3408 lr:0.82 dt:35ms tok/s:1848540 rem:345s step 7216 (42%) loss:3.3578 lr:0.82 dt:35ms tok/s:1847434 rem:345s step 7217 (42%) loss:3.3386 lr:0.82 dt:35ms tok/s:1857572 rem:345s step 7218 (42%) loss:3.3099 lr:0.82 dt:35ms tok/s:1854914 rem:345s step 7219 (42%) loss:3.2872 lr:0.82 dt:35ms tok/s:1853439 rem:345s step 7220 (42%) loss:3.2969 lr:0.82 dt:37ms tok/s:1789546 rem:345s step 7221 (42%) loss:3.2982 lr:0.82 dt:35ms tok/s:1862455 rem:345s step 7222 (42%) loss:3.2817 lr:0.82 dt:35ms tok/s:1860086 rem:345s step 7223 (42%) loss:3.2957 lr:0.82 dt:36ms tok/s:1845128 rem:345s step 7224 (42%) loss:3.2765 lr:0.82 dt:35ms tok/s:1847782 rem:345s step 7225 (42%) loss:3.2569 lr:0.82 dt:35ms tok/s:1853126 rem:345s step 7226 (42%) loss:3.2283 lr:0.82 dt:36ms tok/s:1843321 rem:345s step 7227 (42%) loss:3.2246 lr:0.82 dt:35ms tok/s:1859079 rem:345s step 7228 (43%) loss:3.2255 lr:0.82 dt:35ms tok/s:1855227 rem:345s step 7229 (43%) loss:3.2410 lr:0.82 dt:35ms tok/s:1853076 rem:345s step 7230 (43%) loss:3.2453 lr:0.82 dt:35ms tok/s:1855566 rem:345s step 7231 (43%) loss:3.2483 lr:0.82 dt:35ms tok/s:1846938 rem:345s step 7232 (43%) loss:3.2479 lr:0.82 dt:36ms tok/s:1840667 rem:345s step 7233 (43%) loss:3.2597 lr:0.82 dt:35ms tok/s:1846764 rem:345s step 7234 (43%) loss:3.2723 lr:0.82 dt:36ms tok/s:1841321 rem:345s step 7235 (43%) loss:3.2971 lr:0.82 dt:36ms tok/s:1844447 rem:345s step 7236 (43%) loss:3.2992 lr:0.82 dt:35ms tok/s:1847881 rem:345s step 7237 (43%) loss:3.3143 lr:0.82 dt:35ms tok/s:1850419 rem:345s step 7238 (43%) loss:3.3141 lr:0.82 dt:35ms tok/s:1854727 rem:345s step 7239 (43%) loss:3.3287 lr:0.82 dt:35ms tok/s:1846541 rem:345s step 7240 (43%) loss:3.3316 lr:0.82 dt:35ms tok/s:1855778 rem:345s step 7241 (43%) loss:3.3401 lr:0.82 dt:35ms tok/s:1849361 rem:344s step 7242 (43%) loss:3.3254 lr:0.82 dt:36ms tok/s:1845214 rem:344s step 7243 (43%) loss:3.3252 lr:0.82 dt:35ms tok/s:1846243 rem:344s step 7244 (43%) loss:3.3221 lr:0.82 dt:35ms tok/s:1853339 rem:344s step 7245 (43%) loss:3.3117 lr:0.82 dt:36ms tok/s:1844162 rem:344s step 7246 (43%) loss:3.3193 lr:0.82 dt:35ms tok/s:1853626 rem:344s step 7247 (43%) loss:3.3265 lr:0.82 dt:35ms tok/s:1847546 rem:344s step 7248 (43%) loss:3.3368 lr:0.82 dt:35ms tok/s:1851803 rem:344s step 7249 (43%) loss:3.3445 lr:0.82 dt:35ms tok/s:1848838 rem:344s step 7250 (43%) loss:3.3333 lr:0.82 dt:35ms tok/s:1857258 rem:344s step 7251 (43%) loss:3.3199 lr:0.82 dt:35ms tok/s:1857647 rem:344s step 7252 (43%) loss:3.3012 lr:0.81 dt:35ms tok/s:1849697 rem:344s step 7253 (43%) loss:3.2853 lr:0.81 dt:35ms tok/s:1852265 rem:344s step 7254 (43%) loss:3.2956 lr:0.81 dt:36ms tok/s:1842432 rem:344s step 7255 (43%) loss:3.2861 lr:0.81 dt:35ms tok/s:1848068 rem:344s step 7256 (43%) loss:3.2833 lr:0.81 dt:35ms tok/s:1847112 rem:344s step 7257 (43%) loss:3.2683 lr:0.81 dt:35ms tok/s:1846268 rem:344s step 7258 (43%) loss:3.2616 lr:0.81 dt:36ms tok/s:1842209 rem:344s step 7259 (43%) loss:3.2749 lr:0.81 dt:36ms tok/s:1845425 rem:344s step 7260 (43%) loss:3.2645 lr:0.81 dt:35ms tok/s:1849224 rem:344s step 7261 (43%) loss:3.2605 lr:0.81 dt:35ms tok/s:1855678 rem:344s step 7262 (43%) loss:3.2890 lr:0.81 dt:35ms tok/s:1850618 rem:344s step 7263 (43%) loss:3.2748 lr:0.81 dt:35ms tok/s:1853626 rem:344s step 7264 (43%) loss:3.2390 lr:0.81 dt:35ms tok/s:1852265 rem:344s step 7265 (43%) loss:3.2470 lr:0.81 dt:35ms tok/s:1852352 rem:344s step 7266 (43%) loss:3.2719 lr:0.81 dt:35ms tok/s:1852364 rem:344s step 7267 (43%) loss:3.2502 lr:0.81 dt:35ms tok/s:1847944 rem:344s step 7268 (43%) loss:3.2498 lr:0.81 dt:35ms tok/s:1846677 rem:344s step 7269 (43%) loss:3.2690 lr:0.81 dt:35ms tok/s:1852801 rem:343s step 7270 (43%) loss:3.2607 lr:0.81 dt:36ms tok/s:1843297 rem:343s step 7271 (43%) loss:3.2687 lr:0.81 dt:35ms tok/s:1847857 rem:343s step 7272 (43%) loss:3.2621 lr:0.81 dt:35ms tok/s:1849187 rem:343s step 7273 (43%) loss:3.2673 lr:0.81 dt:35ms tok/s:1851641 rem:343s step 7274 (43%) loss:3.2560 lr:0.81 dt:35ms tok/s:1846987 rem:343s step 7275 (43%) loss:3.2632 lr:0.81 dt:35ms tok/s:1849821 rem:343s step 7276 (43%) loss:3.2479 lr:0.81 dt:35ms tok/s:1855716 rem:343s step 7277 (43%) loss:3.2789 lr:0.81 dt:35ms tok/s:1849996 rem:343s step 7278 (43%) loss:3.2920 lr:0.81 dt:35ms tok/s:1850606 rem:343s step 7279 (43%) loss:3.2803 lr:0.81 dt:35ms tok/s:1852489 rem:343s step 7280 (43%) loss:3.3004 lr:0.81 dt:36ms tok/s:1844236 rem:343s step 7281 (43%) loss:3.3215 lr:0.81 dt:35ms tok/s:1853076 rem:343s step 7282 (43%) loss:3.3148 lr:0.81 dt:35ms tok/s:1847881 rem:343s step 7283 (43%) loss:3.3128 lr:0.81 dt:35ms tok/s:1846404 rem:343s step 7284 (43%) loss:3.2994 lr:0.81 dt:36ms tok/s:1845400 rem:343s step 7285 (43%) loss:3.3040 lr:0.81 dt:35ms tok/s:1855653 rem:343s step 7286 (43%) loss:3.3388 lr:0.81 dt:35ms tok/s:1853301 rem:343s step 7287 (43%) loss:3.3552 lr:0.81 dt:35ms tok/s:1847993 rem:343s step 7288 (43%) loss:3.3337 lr:0.81 dt:35ms tok/s:1852502 rem:343s step 7289 (43%) loss:3.2515 lr:0.81 dt:36ms tok/s:1843766 rem:343s step 7290 (43%) loss:3.2666 lr:0.81 dt:35ms tok/s:1856129 rem:343s step 7291 (43%) loss:3.2722 lr:0.81 dt:35ms tok/s:1852302 rem:343s step 7292 (43%) loss:3.2769 lr:0.81 dt:35ms tok/s:1854551 rem:343s step 7293 (43%) loss:3.3001 lr:0.81 dt:35ms tok/s:1847434 rem:343s step 7294 (43%) loss:3.3081 lr:0.81 dt:36ms tok/s:1846057 rem:343s step 7295 (43%) loss:3.2996 lr:0.81 dt:35ms tok/s:1852090 rem:343s step 7296 (43%) loss:3.2859 lr:0.81 dt:35ms tok/s:1852227 rem:343s step 7297 (43%) loss:3.3270 lr:0.81 dt:35ms tok/s:1852002 rem:342s step 7298 (43%) loss:3.3343 lr:0.81 dt:35ms tok/s:1851715 rem:342s step 7299 (43%) loss:3.3237 lr:0.81 dt:35ms tok/s:1848279 rem:342s step 7300 (43%) loss:3.3063 lr:0.81 dt:35ms tok/s:1850095 rem:342s + local: attn=[0.065, 0.796, 0.824] mlp=[0.416, 0.204, -0.174] + + transition: attn=[2.587, 0.952] mlp=[-0.153, 0.338] + + hierarchy: attn=[2.855, 5.939, 5.616] mlp=[1.090, -0.934, -3.326] + step 7301 (43%) loss:3.2955 lr:0.81 dt:35ms tok/s:1848453 rem:342s step 7302 (43%) loss:3.2924 lr:0.81 dt:37ms tok/s:1764832 rem:342s step 7303 (43%) loss:3.2858 lr:0.81 dt:35ms tok/s:1852427 rem:342s step 7304 (43%) loss:3.2787 lr:0.81 dt:36ms tok/s:1843433 rem:342s step 7305 (43%) loss:3.2966 lr:0.81 dt:35ms tok/s:1854401 rem:342s step 7306 (43%) loss:3.3178 lr:0.81 dt:35ms tok/s:1875173 rem:342s step 7307 (43%) loss:3.3057 lr:0.81 dt:35ms tok/s:1874457 rem:342s step 7308 (43%) loss:3.3069 lr:0.81 dt:35ms tok/s:1881849 rem:342s step 7309 (43%) loss:3.3007 lr:0.81 dt:35ms tok/s:1880266 rem:342s step 7310 (43%) loss:3.2833 lr:0.81 dt:35ms tok/s:1878005 rem:342s step 7311 (43%) loss:3.2895 lr:0.81 dt:35ms tok/s:1882055 rem:342s step 7312 (43%) loss:3.2977 lr:0.81 dt:35ms tok/s:1872874 rem:342s step 7313 (43%) loss:3.3019 lr:0.81 dt:35ms tok/s:1877197 rem:342s step 7314 (43%) loss:3.2949 lr:0.81 dt:35ms tok/s:1869943 rem:342s step 7315 (43%) loss:3.2821 lr:0.81 dt:35ms tok/s:1878467 rem:342s step 7316 (43%) loss:3.2856 lr:0.81 dt:35ms tok/s:1864704 rem:342s step 7317 (43%) loss:3.2723 lr:0.81 dt:35ms tok/s:1868151 rem:342s step 7318 (43%) loss:3.2723 lr:0.81 dt:35ms tok/s:1868545 rem:342s step 7319 (43%) loss:3.2558 lr:0.81 dt:35ms tok/s:1875903 rem:342s step 7320 (43%) loss:3.2557 lr:0.81 dt:35ms tok/s:1868418 rem:342s step 7321 (43%) loss:3.2243 lr:0.81 dt:35ms tok/s:1895252 rem:342s step 7322 (43%) loss:3.2264 lr:0.81 dt:34ms tok/s:1901506 rem:342s step 7323 (43%) loss:3.2374 lr:0.81 dt:35ms tok/s:1894495 rem:342s step 7324 (43%) loss:3.2433 lr:0.81 dt:35ms tok/s:1867441 rem:342s step 7325 (43%) loss:3.2324 lr:0.81 dt:35ms tok/s:1881823 rem:342s step 7326 (43%) loss:3.2261 lr:0.81 dt:35ms tok/s:1883254 rem:341s step 7327 (43%) loss:3.2305 lr:0.81 dt:35ms tok/s:1878595 rem:341s step 7328 (43%) loss:3.2369 lr:0.81 dt:35ms tok/s:1892616 rem:341s step 7329 (43%) loss:3.2349 lr:0.81 dt:35ms tok/s:1881372 rem:341s step 7330 (43%) loss:3.2448 lr:0.81 dt:35ms tok/s:1878287 rem:341s step 7331 (43%) loss:3.2932 lr:0.81 dt:35ms tok/s:1876095 rem:341s step 7332 (43%) loss:3.2985 lr:0.81 dt:35ms tok/s:1879340 rem:341s step 7333 (43%) loss:3.3062 lr:0.81 dt:35ms tok/s:1873550 rem:341s step 7334 (43%) loss:3.3168 lr:0.81 dt:35ms tok/s:1875762 rem:341s step 7335 (43%) loss:3.3065 lr:0.81 dt:35ms tok/s:1871993 rem:341s step 7336 (43%) loss:3.2978 lr:0.81 dt:35ms tok/s:1857220 rem:341s step 7337 (43%) loss:3.2878 lr:0.81 dt:35ms tok/s:1873065 rem:341s step 7338 (43%) loss:3.2672 lr:0.81 dt:35ms tok/s:1868278 rem:341s step 7339 (43%) loss:3.2679 lr:0.81 dt:35ms tok/s:1861534 rem:341s step 7340 (43%) loss:3.2726 lr:0.81 dt:35ms tok/s:1867580 rem:341s step 7341 (43%) loss:3.2798 lr:0.81 dt:36ms tok/s:1836633 rem:341s step 7342 (43%) loss:3.2892 lr:0.81 dt:36ms tok/s:1830177 rem:341s step 7343 (43%) loss:3.2792 lr:0.81 dt:36ms tok/s:1822774 rem:341s step 7344 (43%) loss:3.2863 lr:0.81 dt:36ms tok/s:1819540 rem:341s step 7345 (43%) loss:3.2783 lr:0.81 dt:36ms tok/s:1823862 rem:341s step 7346 (43%) loss:3.2526 lr:0.81 dt:36ms tok/s:1825691 rem:341s step 7347 (43%) loss:3.2401 lr:0.81 dt:36ms tok/s:1831787 rem:341s step 7348 (43%) loss:3.2443 lr:0.81 dt:36ms tok/s:1825134 rem:341s step 7349 (43%) loss:3.2551 lr:0.81 dt:36ms tok/s:1820552 rem:341s step 7350 (43%) loss:3.2576 lr:0.81 dt:36ms tok/s:1820999 rem:341s step 7351 (43%) loss:3.2633 lr:0.81 dt:36ms tok/s:1796705 rem:341s step 7352 (43%) loss:3.2534 lr:0.81 dt:36ms tok/s:1819046 rem:341s step 7353 (43%) loss:3.2648 lr:0.81 dt:36ms tok/s:1808336 rem:341s step 7354 (43%) loss:3.2642 lr:0.81 dt:36ms tok/s:1810921 rem:340s step 7355 (43%) loss:3.2380 lr:0.81 dt:36ms tok/s:1809598 rem:340s step 7356 (43%) loss:3.2134 lr:0.81 dt:36ms tok/s:1821348 rem:340s step 7357 (43%) loss:3.1945 lr:0.81 dt:36ms tok/s:1817603 rem:340s step 7358 (43%) loss:3.2232 lr:0.81 dt:36ms tok/s:1814615 rem:340s step 7359 (43%) loss:3.2336 lr:0.81 dt:36ms tok/s:1804276 rem:340s step 7360 (43%) loss:3.2376 lr:0.81 dt:36ms tok/s:1812390 rem:340s step 7361 (43%) loss:3.2494 lr:0.81 dt:36ms tok/s:1810945 rem:340s step 7362 (43%) loss:3.2502 lr:0.80 dt:36ms tok/s:1817976 rem:340s step 7363 (43%) loss:3.2448 lr:0.80 dt:36ms tok/s:1803815 rem:340s step 7364 (43%) loss:3.2317 lr:0.80 dt:36ms tok/s:1805829 rem:340s step 7365 (43%) loss:3.2247 lr:0.80 dt:36ms tok/s:1808574 rem:340s step 7366 (43%) loss:3.2581 lr:0.80 dt:36ms tok/s:1806969 rem:340s step 7367 (43%) loss:3.2749 lr:0.80 dt:36ms tok/s:1812163 rem:340s step 7368 (43%) loss:3.2725 lr:0.80 dt:36ms tok/s:1813873 rem:340s step 7369 (43%) loss:3.2927 lr:0.80 dt:36ms tok/s:1811912 rem:340s step 7370 (43%) loss:3.2934 lr:0.80 dt:36ms tok/s:1814639 rem:340s step 7371 (43%) loss:3.3268 lr:0.80 dt:36ms tok/s:1821433 rem:340s step 7372 (43%) loss:3.3590 lr:0.80 dt:36ms tok/s:1814076 rem:340s step 7373 (43%) loss:3.3874 lr:0.80 dt:36ms tok/s:1803247 rem:340s step 7374 (43%) loss:3.3827 lr:0.80 dt:36ms tok/s:1800990 rem:340s step 7375 (43%) loss:3.3782 lr:0.80 dt:36ms tok/s:1816882 rem:340s step 7376 (43%) loss:3.3885 lr:0.80 dt:36ms tok/s:1818493 rem:340s step 7377 (43%) loss:3.3624 lr:0.80 dt:36ms tok/s:1819636 rem:340s step 7378 (43%) loss:3.3582 lr:0.80 dt:36ms tok/s:1811148 rem:340s step 7379 (43%) loss:3.3473 lr:0.80 dt:36ms tok/s:1815178 rem:340s step 7380 (43%) loss:3.3179 lr:0.80 dt:36ms tok/s:1802714 rem:340s step 7381 (43%) loss:3.2977 lr:0.80 dt:36ms tok/s:1812414 rem:339s step 7382 (43%) loss:3.3133 lr:0.80 dt:36ms tok/s:1807373 rem:339s step 7383 (43%) loss:3.3051 lr:0.80 dt:36ms tok/s:1820058 rem:339s step 7384 (43%) loss:3.2976 lr:0.80 dt:36ms tok/s:1810611 rem:339s step 7385 (43%) loss:3.3168 lr:0.80 dt:36ms tok/s:1820914 rem:339s step 7386 (43%) loss:3.3223 lr:0.80 dt:36ms tok/s:1816930 rem:339s step 7387 (43%) loss:3.3283 lr:0.80 dt:36ms tok/s:1823294 rem:339s step 7388 (43%) loss:3.3186 lr:0.80 dt:36ms tok/s:1813765 rem:339s step 7389 (43%) loss:3.3297 lr:0.80 dt:36ms tok/s:1821288 rem:339s step 7390 (43%) loss:3.3189 lr:0.80 dt:36ms tok/s:1819925 rem:339s step 7391 (43%) loss:3.3019 lr:0.80 dt:37ms tok/s:1762490 rem:339s step 7392 (43%) loss:3.3042 lr:0.80 dt:37ms tok/s:1778040 rem:339s step 7393 (43%) loss:3.2946 lr:0.80 dt:36ms tok/s:1824322 rem:339s step 7394 (43%) loss:3.3047 lr:0.80 dt:36ms tok/s:1817074 rem:339s step 7395 (43%) loss:3.2977 lr:0.80 dt:36ms tok/s:1819624 rem:339s step 7396 (44%) loss:3.2890 lr:0.80 dt:37ms tok/s:1772571 rem:339s step 7397 (44%) loss:3.3079 lr:0.80 dt:36ms tok/s:1822846 rem:339s step 7398 (44%) loss:3.3148 lr:0.80 dt:36ms tok/s:1818938 rem:339s step 7399 (44%) loss:3.3192 lr:0.80 dt:36ms tok/s:1819733 rem:339s step 7400 (44%) loss:3.3021 lr:0.80 dt:36ms tok/s:1818974 rem:339s + local: attn=[0.068, 0.811, 0.786] mlp=[0.429, 0.202, -0.175] + + transition: attn=[2.711, 0.867] mlp=[-0.142, 0.356] + + hierarchy: attn=[2.930, 5.939, 5.616] mlp=[1.110, -0.912, -3.322] + step 7401 (44%) loss:3.2835 lr:0.80 dt:36ms tok/s:1823185 rem:339s step 7402 (44%) loss:3.2942 lr:0.80 dt:36ms tok/s:1816690 rem:339s step 7403 (44%) loss:3.2815 lr:0.80 dt:36ms tok/s:1821686 rem:339s step 7404 (44%) loss:3.2780 lr:0.80 dt:37ms tok/s:1787243 rem:339s step 7405 (44%) loss:3.2843 lr:0.80 dt:36ms tok/s:1819480 rem:339s step 7406 (44%) loss:3.2845 lr:0.80 dt:36ms tok/s:1823366 rem:339s step 7407 (44%) loss:3.3076 lr:0.80 dt:36ms tok/s:1829093 rem:339s step 7408 (44%) loss:3.3108 lr:0.80 dt:36ms tok/s:1825061 rem:339s step 7409 (44%) loss:3.2885 lr:0.80 dt:36ms tok/s:1818938 rem:338s step 7410 (44%) loss:3.2815 lr:0.80 dt:36ms tok/s:1819781 rem:338s step 7411 (44%) loss:3.2765 lr:0.80 dt:36ms tok/s:1820456 rem:338s step 7412 (44%) loss:3.2888 lr:0.80 dt:36ms tok/s:1814699 rem:338s step 7413 (44%) loss:3.3221 lr:0.80 dt:36ms tok/s:1813813 rem:338s step 7414 (44%) loss:3.3157 lr:0.80 dt:36ms tok/s:1816402 rem:338s step 7415 (44%) loss:3.3185 lr:0.80 dt:36ms tok/s:1810015 rem:338s step 7416 (44%) loss:3.3239 lr:0.80 dt:36ms tok/s:1811029 rem:338s step 7417 (44%) loss:3.3305 lr:0.80 dt:36ms tok/s:1817290 rem:338s step 7418 (44%) loss:3.3151 lr:0.80 dt:36ms tok/s:1811601 rem:338s step 7419 (44%) loss:3.2969 lr:0.80 dt:36ms tok/s:1819082 rem:338s step 7420 (44%) loss:3.2670 lr:0.80 dt:36ms tok/s:1805960 rem:338s step 7421 (44%) loss:3.2796 lr:0.80 dt:36ms tok/s:1807979 rem:338s step 7422 (44%) loss:3.2761 lr:0.80 dt:36ms tok/s:1824564 rem:338s step 7423 (44%) loss:3.2999 lr:0.80 dt:36ms tok/s:1833999 rem:338s step 7424 (44%) loss:3.2887 lr:0.80 dt:36ms tok/s:1836229 rem:338s step 7425 (44%) loss:3.2780 lr:0.80 dt:36ms tok/s:1841148 rem:338s step 7426 (44%) loss:3.2869 lr:0.80 dt:36ms tok/s:1832568 rem:338s step 7427 (44%) loss:3.2862 lr:0.80 dt:36ms tok/s:1839177 rem:338s step 7428 (44%) loss:3.2692 lr:0.80 dt:36ms tok/s:1836486 rem:338s step 7429 (44%) loss:3.2533 lr:0.80 dt:36ms tok/s:1841802 rem:338s step 7430 (44%) loss:3.2523 lr:0.80 dt:36ms tok/s:1837739 rem:338s step 7431 (44%) loss:3.2338 lr:0.80 dt:36ms tok/s:1833742 rem:338s step 7432 (44%) loss:3.2189 lr:0.80 dt:36ms tok/s:1839239 rem:338s step 7433 (44%) loss:3.2480 lr:0.80 dt:36ms tok/s:1842778 rem:338s step 7434 (44%) loss:3.2501 lr:0.80 dt:36ms tok/s:1840433 rem:338s step 7435 (44%) loss:3.2350 lr:0.80 dt:36ms tok/s:1841284 rem:338s step 7436 (44%) loss:3.2127 lr:0.80 dt:36ms tok/s:1836253 rem:338s step 7437 (44%) loss:3.1806 lr:0.80 dt:36ms tok/s:1834868 rem:337s step 7438 (44%) loss:3.1504 lr:0.80 dt:36ms tok/s:1837468 rem:337s step 7439 (44%) loss:3.1251 lr:0.80 dt:36ms tok/s:1831030 rem:337s step 7440 (44%) loss:3.0901 lr:0.80 dt:36ms tok/s:1831372 rem:337s step 7441 (44%) loss:3.0657 lr:0.80 dt:36ms tok/s:1836192 rem:337s step 7442 (44%) loss:3.0276 lr:0.80 dt:36ms tok/s:1826747 rem:337s step 7443 (44%) loss:2.9916 lr:0.80 dt:36ms tok/s:1833693 rem:337s step 7444 (44%) loss:3.0364 lr:0.80 dt:36ms tok/s:1833974 rem:337s step 7445 (44%) loss:3.0902 lr:0.80 dt:36ms tok/s:1843828 rem:337s step 7446 (44%) loss:3.1228 lr:0.80 dt:36ms tok/s:1837603 rem:337s step 7447 (44%) loss:3.1510 lr:0.80 dt:36ms tok/s:1841654 rem:337s step 7448 (44%) loss:3.1775 lr:0.80 dt:36ms tok/s:1839583 rem:337s step 7449 (44%) loss:3.2059 lr:0.80 dt:36ms tok/s:1838587 rem:337s step 7450 (44%) loss:3.2251 lr:0.80 dt:36ms tok/s:1842271 rem:337s step 7451 (44%) loss:3.2437 lr:0.80 dt:36ms tok/s:1834586 rem:337s step 7452 (44%) loss:3.2414 lr:0.80 dt:36ms tok/s:1843395 rem:337s step 7453 (44%) loss:3.2323 lr:0.80 dt:36ms tok/s:1836903 rem:337s step 7454 (44%) loss:3.2482 lr:0.80 dt:36ms tok/s:1834354 rem:337s step 7455 (44%) loss:3.2915 lr:0.80 dt:36ms tok/s:1835591 rem:337s step 7456 (44%) loss:3.3302 lr:0.80 dt:36ms tok/s:1837530 rem:337s step 7457 (44%) loss:3.3266 lr:0.80 dt:36ms tok/s:1838144 rem:337s step 7458 (44%) loss:3.3257 lr:0.80 dt:36ms tok/s:1835554 rem:337s step 7459 (44%) loss:3.3160 lr:0.80 dt:36ms tok/s:1817014 rem:337s step 7460 (44%) loss:3.3014 lr:0.80 dt:42ms tok/s:1570336 rem:337s step 7461 (44%) loss:3.3030 lr:0.80 dt:36ms tok/s:1805485 rem:337s step 7462 (44%) loss:3.2953 lr:0.80 dt:36ms tok/s:1818649 rem:337s step 7463 (44%) loss:3.3147 lr:0.80 dt:36ms tok/s:1837210 rem:337s step 7464 (44%) loss:3.2960 lr:0.80 dt:36ms tok/s:1828497 rem:337s step 7465 (44%) loss:3.3127 lr:0.80 dt:40ms tok/s:1645552 rem:336s step 7466 (44%) loss:3.3169 lr:0.80 dt:36ms tok/s:1829288 rem:336s step 7467 (44%) loss:3.3420 lr:0.80 dt:35ms tok/s:1861219 rem:336s step 7468 (44%) loss:3.3364 lr:0.79 dt:36ms tok/s:1839386 rem:336s step 7469 (44%) loss:3.3304 lr:0.79 dt:36ms tok/s:1819721 rem:336s step 7470 (44%) loss:3.3450 lr:0.79 dt:35ms tok/s:1848975 rem:336s step 7471 (44%) loss:3.3321 lr:0.79 dt:35ms tok/s:1866477 rem:336s step 7472 (44%) loss:3.3273 lr:0.79 dt:36ms tok/s:1831421 rem:336s step 7473 (44%) loss:3.3221 lr:0.79 dt:36ms tok/s:1842629 rem:336s step 7474 (44%) loss:3.3182 lr:0.79 dt:36ms tok/s:1841888 rem:336s step 7475 (44%) loss:3.3315 lr:0.79 dt:36ms tok/s:1839214 rem:336s step 7476 (44%) loss:3.3289 lr:0.79 dt:36ms tok/s:1826541 rem:336s step 7477 (44%) loss:3.3161 lr:0.79 dt:37ms tok/s:1781821 rem:336s step 7478 (44%) loss:3.2989 lr:0.79 dt:36ms tok/s:1809765 rem:336s step 7479 (44%) loss:3.3205 lr:0.79 dt:36ms tok/s:1807504 rem:336s step 7480 (44%) loss:3.3536 lr:0.79 dt:36ms tok/s:1818192 rem:336s step 7481 (44%) loss:3.4060 lr:0.79 dt:36ms tok/s:1818252 rem:336s step 7482 (44%) loss:3.3964 lr:0.79 dt:36ms tok/s:1826189 rem:336s step 7483 (44%) loss:3.4074 lr:0.79 dt:36ms tok/s:1833717 rem:336s step 7484 (44%) loss:3.4070 lr:0.79 dt:35ms tok/s:1847534 rem:336s step 7485 (44%) loss:3.3897 lr:0.79 dt:36ms tok/s:1801356 rem:336s step 7486 (44%) loss:3.3798 lr:0.79 dt:36ms tok/s:1831238 rem:336s step 7487 (44%) loss:3.3561 lr:0.79 dt:36ms tok/s:1816354 rem:336s step 7488 (44%) loss:3.3619 lr:0.79 dt:36ms tok/s:1809657 rem:336s step 7489 (44%) loss:3.3632 lr:0.79 dt:35ms tok/s:1851528 rem:336s step 7490 (44%) loss:3.3613 lr:0.79 dt:35ms tok/s:1858979 rem:336s step 7491 (44%) loss:3.3648 lr:0.79 dt:35ms tok/s:1860363 rem:336s step 7492 (44%) loss:3.3727 lr:0.79 dt:35ms tok/s:1857321 rem:336s step 7493 (44%) loss:3.3856 lr:0.79 dt:35ms tok/s:1877363 rem:335s step 7494 (44%) loss:3.3713 lr:0.79 dt:35ms tok/s:1868951 rem:335s step 7495 (44%) loss:3.3633 lr:0.79 dt:35ms tok/s:1847757 rem:335s step 7496 (44%) loss:3.3382 lr:0.79 dt:35ms tok/s:1896285 rem:335s step 7497 (44%) loss:3.3441 lr:0.79 dt:35ms tok/s:1893946 rem:335s step 7498 (44%) loss:3.3546 lr:0.79 dt:35ms tok/s:1888832 rem:335s step 7499 (44%) loss:3.3520 lr:0.79 dt:35ms tok/s:1878929 rem:335s step 7500 (44%) loss:3.3379 lr:0.79 dt:35ms tok/s:1884494 rem:335s + local: attn=[0.067, 0.831, 0.802] mlp=[0.445, 0.211, -0.186] + + transition: attn=[2.685, 0.927] mlp=[-0.143, 0.343] + + hierarchy: attn=[2.951, 5.939, 5.616] mlp=[1.119, -0.962, -3.337] + step 7501 (44%) loss:3.3239 lr:0.79 dt:35ms tok/s:1898892 rem:335s step 7502 (44%) loss:3.3314 lr:0.79 dt:35ms tok/s:1899010 rem:335s step 7503 (44%) loss:3.3318 lr:0.79 dt:35ms tok/s:1899535 rem:335s step 7504 (44%) loss:3.3530 lr:0.79 dt:35ms tok/s:1897280 rem:335s step 7505 (44%) loss:3.3636 lr:0.79 dt:34ms tok/s:1902256 rem:335s step 7506 (44%) loss:3.3665 lr:0.79 dt:34ms tok/s:1903086 rem:335s step 7507 (44%) loss:3.3579 lr:0.79 dt:34ms tok/s:1900047 rem:335s step 7508 (44%) loss:3.3581 lr:0.79 dt:35ms tok/s:1875084 rem:335s step 7509 (44%) loss:3.3515 lr:0.79 dt:35ms tok/s:1881475 rem:335s step 7510 (44%) loss:3.3474 lr:0.79 dt:35ms tok/s:1889377 rem:335s step 7511 (44%) loss:3.3541 lr:0.79 dt:35ms tok/s:1874547 rem:335s step 7512 (44%) loss:3.3444 lr:0.79 dt:35ms tok/s:1857735 rem:335s step 7513 (44%) loss:3.3427 lr:0.79 dt:35ms tok/s:1857082 rem:335s step 7514 (44%) loss:3.3464 lr:0.79 dt:35ms tok/s:1861749 rem:335s step 7515 (44%) loss:3.3453 lr:0.79 dt:35ms tok/s:1855866 rem:335s step 7516 (44%) loss:3.3592 lr:0.79 dt:35ms tok/s:1885230 rem:335s step 7517 (44%) loss:3.3672 lr:0.79 dt:35ms tok/s:1872210 rem:335s step 7518 (44%) loss:3.3578 lr:0.79 dt:35ms tok/s:1871165 rem:335s step 7519 (44%) loss:3.3419 lr:0.79 dt:35ms tok/s:1875519 rem:335s step 7520 (44%) loss:3.3455 lr:0.79 dt:35ms tok/s:1873525 rem:335s step 7521 (44%) loss:3.3565 lr:0.79 dt:35ms tok/s:1873486 rem:334s step 7522 (44%) loss:3.3385 lr:0.79 dt:35ms tok/s:1867479 rem:334s step 7523 (44%) loss:3.3464 lr:0.79 dt:35ms tok/s:1880922 rem:334s step 7524 (44%) loss:3.3319 lr:0.79 dt:35ms tok/s:1879777 rem:334s step 7525 (44%) loss:3.3277 lr:0.79 dt:35ms tok/s:1875775 rem:334s step 7526 (44%) loss:3.3253 lr:0.79 dt:35ms tok/s:1871318 rem:334s step 7527 (44%) loss:3.3329 lr:0.79 dt:35ms tok/s:1876620 rem:334s step 7528 (44%) loss:3.3197 lr:0.79 dt:35ms tok/s:1878223 rem:334s step 7529 (44%) loss:3.3109 lr:0.79 dt:35ms tok/s:1867402 rem:334s step 7530 (44%) loss:3.3034 lr:0.79 dt:35ms tok/s:1848130 rem:334s step 7531 (44%) loss:3.2958 lr:0.79 dt:35ms tok/s:1860375 rem:334s step 7532 (44%) loss:3.3025 lr:0.79 dt:35ms tok/s:1850145 rem:334s step 7533 (44%) loss:3.2957 lr:0.79 dt:35ms tok/s:1853489 rem:334s step 7534 (44%) loss:3.3031 lr:0.79 dt:35ms tok/s:1854226 rem:334s step 7535 (44%) loss:3.3149 lr:0.79 dt:36ms tok/s:1831945 rem:334s step 7536 (44%) loss:3.3134 lr:0.79 dt:35ms tok/s:1847807 rem:334s step 7537 (44%) loss:3.2970 lr:0.79 dt:36ms tok/s:1803365 rem:334s step 7538 (44%) loss:3.2838 lr:0.79 dt:35ms tok/s:1861068 rem:334s step 7539 (44%) loss:3.2866 lr:0.79 dt:35ms tok/s:1852177 rem:334s step 7540 (44%) loss:3.2897 lr:0.79 dt:35ms tok/s:1859985 rem:334s step 7541 (44%) loss:3.2673 lr:0.79 dt:35ms tok/s:1855954 rem:334s step 7542 (44%) loss:3.2672 lr:0.79 dt:35ms tok/s:1849958 rem:334s step 7543 (44%) loss:3.2682 lr:0.79 dt:36ms tok/s:1836069 rem:334s step 7544 (44%) loss:3.2555 lr:0.79 dt:36ms tok/s:1808693 rem:334s step 7545 (44%) loss:3.2565 lr:0.79 dt:36ms tok/s:1815682 rem:334s step 7546 (44%) loss:3.2602 lr:0.79 dt:36ms tok/s:1820142 rem:334s step 7547 (44%) loss:3.2645 lr:0.79 dt:36ms tok/s:1809574 rem:334s step 7548 (44%) loss:3.2540 lr:0.79 dt:36ms tok/s:1819311 rem:334s step 7549 (44%) loss:3.2551 lr:0.79 dt:36ms tok/s:1819251 rem:334s step 7550 (44%) loss:3.2657 lr:0.79 dt:36ms tok/s:1821180 rem:333s step 7551 (44%) loss:3.2566 lr:0.79 dt:36ms tok/s:1822520 rem:333s step 7552 (44%) loss:3.2587 lr:0.79 dt:36ms tok/s:1829020 rem:333s step 7553 (44%) loss:3.2676 lr:0.79 dt:36ms tok/s:1814160 rem:333s step 7554 (44%) loss:3.2638 lr:0.79 dt:36ms tok/s:1816678 rem:333s step 7555 (44%) loss:3.2665 lr:0.79 dt:36ms tok/s:1821023 rem:333s step 7556 (44%) loss:3.2826 lr:0.79 dt:36ms tok/s:1815946 rem:333s step 7557 (44%) loss:3.2880 lr:0.79 dt:36ms tok/s:1818288 rem:333s step 7558 (44%) loss:3.2799 lr:0.79 dt:36ms tok/s:1812258 rem:333s step 7559 (44%) loss:3.2691 lr:0.79 dt:36ms tok/s:1807694 rem:333s step 7560 (44%) loss:3.2783 lr:0.79 dt:36ms tok/s:1822713 rem:333s step 7561 (44%) loss:3.2784 lr:0.79 dt:36ms tok/s:1822834 rem:333s step 7562 (44%) loss:3.2828 lr:0.79 dt:36ms tok/s:1825473 rem:333s step 7563 (44%) loss:3.2873 lr:0.79 dt:36ms tok/s:1823511 rem:333s step 7564 (45%) loss:3.2983 lr:0.79 dt:36ms tok/s:1822774 rem:333s step 7565 (45%) loss:3.3035 lr:0.79 dt:36ms tok/s:1829227 rem:333s step 7566 (45%) loss:3.2907 lr:0.79 dt:36ms tok/s:1818180 rem:333s step 7567 (45%) loss:3.2812 lr:0.79 dt:36ms tok/s:1815502 rem:333s step 7568 (45%) loss:3.2945 lr:0.79 dt:37ms tok/s:1783115 rem:333s step 7569 (45%) loss:3.2912 lr:0.79 dt:36ms tok/s:1814807 rem:333s step 7570 (45%) loss:3.3004 lr:0.79 dt:36ms tok/s:1817771 rem:333s step 7571 (45%) loss:3.2939 lr:0.79 dt:36ms tok/s:1821493 rem:333s step 7572 (45%) loss:3.2921 lr:0.79 dt:36ms tok/s:1821505 rem:333s step 7573 (45%) loss:3.2891 lr:0.79 dt:36ms tok/s:1825606 rem:333s step 7574 (45%) loss:3.2861 lr:0.78 dt:36ms tok/s:1830872 rem:333s step 7575 (45%) loss:3.2809 lr:0.78 dt:36ms tok/s:1824044 rem:333s step 7576 (45%) loss:3.2878 lr:0.78 dt:36ms tok/s:1819335 rem:333s step 7577 (45%) loss:3.2831 lr:0.78 dt:36ms tok/s:1818541 rem:332s step 7578 (45%) loss:3.2709 lr:0.78 dt:36ms tok/s:1814807 rem:332s step 7579 (45%) loss:3.2709 lr:0.78 dt:36ms tok/s:1795813 rem:332s step 7580 (45%) loss:3.2731 lr:0.78 dt:36ms tok/s:1810110 rem:332s step 7581 (45%) loss:3.2714 lr:0.78 dt:36ms tok/s:1815994 rem:332s step 7582 (45%) loss:3.2823 lr:0.78 dt:36ms tok/s:1822266 rem:332s step 7583 (45%) loss:3.2917 lr:0.78 dt:36ms tok/s:1824528 rem:332s step 7584 (45%) loss:3.2985 lr:0.78 dt:36ms tok/s:1822121 rem:332s step 7585 (45%) loss:3.3087 lr:0.78 dt:36ms tok/s:1821698 rem:332s step 7586 (45%) loss:3.2957 lr:0.78 dt:36ms tok/s:1811996 rem:332s step 7587 (45%) loss:3.2687 lr:0.78 dt:36ms tok/s:1822387 rem:332s step 7588 (45%) loss:3.2607 lr:0.78 dt:36ms tok/s:1821916 rem:332s step 7589 (45%) loss:3.2646 lr:0.78 dt:36ms tok/s:1814184 rem:332s step 7590 (45%) loss:3.2849 lr:0.78 dt:36ms tok/s:1809371 rem:332s step 7591 (45%) loss:3.2927 lr:0.78 dt:36ms tok/s:1820167 rem:332s step 7592 (45%) loss:3.3045 lr:0.78 dt:36ms tok/s:1826395 rem:332s step 7593 (45%) loss:3.3155 lr:0.78 dt:36ms tok/s:1821421 rem:332s step 7594 (45%) loss:3.3282 lr:0.78 dt:36ms tok/s:1821409 rem:332s step 7595 (45%) loss:3.3299 lr:0.78 dt:37ms tok/s:1788138 rem:332s step 7596 (45%) loss:3.3348 lr:0.78 dt:36ms tok/s:1826346 rem:332s step 7597 (45%) loss:3.3444 lr:0.78 dt:36ms tok/s:1818818 rem:332s step 7598 (45%) loss:3.3427 lr:0.78 dt:36ms tok/s:1824722 rem:332s step 7599 (45%) loss:3.3258 lr:0.78 dt:36ms tok/s:1824140 rem:332s step 7600 (45%) loss:3.3014 lr:0.78 dt:36ms tok/s:1815634 rem:332s + local: attn=[0.071, 0.812, 0.802] mlp=[0.434, 0.194, -0.178] + + transition: attn=[2.702, 0.913] mlp=[-0.128, 0.365] + + hierarchy: attn=[2.962, 5.939, 5.616] mlp=[1.151, -0.900, -3.291] + step 7601 (45%) loss:3.2899 lr:0.78 dt:36ms tok/s:1817952 rem:332s step 7602 (45%) loss:3.2790 lr:0.78 dt:36ms tok/s:1820516 rem:332s step 7603 (45%) loss:3.2868 lr:0.78 dt:36ms tok/s:1823971 rem:332s step 7604 (45%) loss:3.2963 lr:0.78 dt:36ms tok/s:1825740 rem:332s step 7605 (45%) loss:3.3081 lr:0.78 dt:36ms tok/s:1820516 rem:331s step 7606 (45%) loss:3.3024 lr:0.78 dt:36ms tok/s:1818818 rem:331s step 7607 (45%) loss:3.3073 lr:0.78 dt:36ms tok/s:1816906 rem:331s step 7608 (45%) loss:3.3130 lr:0.78 dt:36ms tok/s:1825509 rem:331s step 7609 (45%) loss:3.3099 lr:0.78 dt:36ms tok/s:1821119 rem:331s step 7610 (45%) loss:3.3249 lr:0.78 dt:36ms tok/s:1821686 rem:331s step 7611 (45%) loss:3.3153 lr:0.78 dt:36ms tok/s:1822616 rem:331s step 7612 (45%) loss:3.3099 lr:0.78 dt:36ms tok/s:1824903 rem:331s step 7613 (45%) loss:3.3154 lr:0.78 dt:36ms tok/s:1814388 rem:331s step 7614 (45%) loss:3.3159 lr:0.78 dt:36ms tok/s:1817447 rem:331s step 7615 (45%) loss:3.3154 lr:0.78 dt:36ms tok/s:1818312 rem:331s step 7616 (45%) loss:3.3182 lr:0.78 dt:36ms tok/s:1824831 rem:331s step 7617 (45%) loss:3.3247 lr:0.78 dt:37ms tok/s:1789511 rem:331s step 7618 (45%) loss:3.3314 lr:0.78 dt:36ms tok/s:1816894 rem:331s step 7619 (45%) loss:3.3352 lr:0.78 dt:36ms tok/s:1823124 rem:331s step 7620 (45%) loss:3.3350 lr:0.78 dt:36ms tok/s:1822955 rem:331s step 7621 (45%) loss:3.3260 lr:0.78 dt:36ms tok/s:1820420 rem:331s step 7622 (45%) loss:3.3317 lr:0.78 dt:36ms tok/s:1808003 rem:331s step 7623 (45%) loss:3.3324 lr:0.78 dt:36ms tok/s:1828582 rem:331s step 7624 (45%) loss:3.3472 lr:0.78 dt:36ms tok/s:1816930 rem:331s step 7625 (45%) loss:3.3478 lr:0.78 dt:36ms tok/s:1818288 rem:331s step 7626 (45%) loss:3.3494 lr:0.78 dt:36ms tok/s:1813442 rem:331s step 7627 (45%) loss:3.3607 lr:0.78 dt:36ms tok/s:1799363 rem:331s step 7628 (45%) loss:3.3424 lr:0.78 dt:36ms tok/s:1823753 rem:331s step 7629 (45%) loss:3.3253 lr:0.78 dt:36ms tok/s:1822097 rem:331s step 7630 (45%) loss:3.3167 lr:0.78 dt:36ms tok/s:1818926 rem:331s step 7631 (45%) loss:3.3139 lr:0.78 dt:36ms tok/s:1826067 rem:331s step 7632 (45%) loss:3.3202 lr:0.78 dt:36ms tok/s:1825679 rem:331s step 7633 (45%) loss:3.3019 lr:0.78 dt:36ms tok/s:1810218 rem:330s step 7634 (45%) loss:3.2882 lr:0.78 dt:36ms tok/s:1818769 rem:330s step 7635 (45%) loss:3.2887 lr:0.78 dt:36ms tok/s:1820564 rem:330s step 7636 (45%) loss:3.2955 lr:0.78 dt:36ms tok/s:1821831 rem:330s step 7637 (45%) loss:3.2952 lr:0.78 dt:36ms tok/s:1821493 rem:330s step 7638 (45%) loss:3.2896 lr:0.78 dt:36ms tok/s:1815874 rem:330s step 7639 (45%) loss:3.2655 lr:0.78 dt:46ms tok/s:1418578 rem:330s step 7640 (45%) loss:3.2751 lr:0.78 dt:36ms tok/s:1828971 rem:330s step 7641 (45%) loss:3.2533 lr:0.78 dt:34ms tok/s:1918400 rem:330s step 7642 (45%) loss:3.2500 lr:0.78 dt:34ms tok/s:1921148 rem:330s step 7643 (45%) loss:3.2661 lr:0.78 dt:34ms tok/s:1925158 rem:330s step 7644 (45%) loss:3.2823 lr:0.78 dt:34ms tok/s:1905896 rem:330s step 7645 (45%) loss:3.2757 lr:0.78 dt:34ms tok/s:1906941 rem:330s step 7646 (45%) loss:3.2601 lr:0.78 dt:34ms tok/s:1911729 rem:330s step 7647 (45%) loss:3.2712 lr:0.78 dt:35ms tok/s:1860766 rem:330s step 7648 (45%) loss:3.2874 lr:0.78 dt:35ms tok/s:1864641 rem:330s step 7649 (45%) loss:3.2916 lr:0.78 dt:35ms tok/s:1867161 rem:330s step 7650 (45%) loss:3.2862 lr:0.78 dt:35ms tok/s:1861484 rem:330s step 7651 (45%) loss:3.2801 lr:0.78 dt:35ms tok/s:1854276 rem:330s step 7652 (45%) loss:3.2912 lr:0.78 dt:35ms tok/s:1866959 rem:330s step 7653 (45%) loss:3.2930 lr:0.78 dt:35ms tok/s:1869791 rem:330s step 7654 (45%) loss:3.3096 lr:0.78 dt:35ms tok/s:1848838 rem:330s step 7655 (45%) loss:3.3051 lr:0.78 dt:37ms tok/s:1782583 rem:330s step 7656 (45%) loss:3.3175 lr:0.78 dt:36ms tok/s:1818192 rem:330s step 7657 (45%) loss:3.3150 lr:0.78 dt:36ms tok/s:1818048 rem:330s step 7658 (45%) loss:3.3080 lr:0.78 dt:36ms tok/s:1820432 rem:330s step 7659 (45%) loss:3.3216 lr:0.78 dt:36ms tok/s:1823136 rem:330s step 7660 (45%) loss:3.3303 lr:0.78 dt:36ms tok/s:1820359 rem:330s step 7661 (45%) loss:3.3301 lr:0.78 dt:36ms tok/s:1829580 rem:329s step 7662 (45%) loss:3.3367 lr:0.78 dt:36ms tok/s:1820902 rem:329s step 7663 (45%) loss:3.3322 lr:0.78 dt:36ms tok/s:1819287 rem:329s step 7664 (45%) loss:3.3213 lr:0.78 dt:36ms tok/s:1820745 rem:329s step 7665 (45%) loss:3.3072 lr:0.78 dt:36ms tok/s:1801368 rem:329s step 7666 (45%) loss:3.3167 lr:0.78 dt:37ms tok/s:1756678 rem:329s step 7667 (45%) loss:3.3355 lr:0.78 dt:36ms tok/s:1799988 rem:329s step 7668 (45%) loss:3.3381 lr:0.78 dt:36ms tok/s:1804253 rem:329s step 7669 (45%) loss:3.3434 lr:0.78 dt:36ms tok/s:1815394 rem:329s step 7670 (45%) loss:3.3520 lr:0.78 dt:36ms tok/s:1802750 rem:329s step 7671 (45%) loss:3.3510 lr:0.78 dt:36ms tok/s:1822496 rem:329s step 7672 (45%) loss:3.3492 lr:0.78 dt:36ms tok/s:1818661 rem:329s step 7673 (45%) loss:3.3617 lr:0.78 dt:36ms tok/s:1814879 rem:329s step 7674 (45%) loss:3.3689 lr:0.78 dt:36ms tok/s:1814124 rem:329s step 7675 (45%) loss:3.3685 lr:0.78 dt:36ms tok/s:1820082 rem:329s step 7676 (45%) loss:3.3816 lr:0.77 dt:36ms tok/s:1813298 rem:329s step 7677 (45%) loss:3.3855 lr:0.77 dt:36ms tok/s:1821071 rem:329s step 7678 (45%) loss:3.3595 lr:0.77 dt:37ms tok/s:1787870 rem:329s step 7679 (45%) loss:3.3668 lr:0.77 dt:36ms tok/s:1824940 rem:329s step 7680 (45%) loss:3.3653 lr:0.77 dt:36ms tok/s:1817435 rem:329s step 7681 (45%) loss:3.3870 lr:0.77 dt:36ms tok/s:1824576 rem:329s step 7682 (45%) loss:3.4072 lr:0.77 dt:36ms tok/s:1816438 rem:329s step 7683 (45%) loss:3.3837 lr:0.77 dt:36ms tok/s:1815142 rem:329s step 7684 (45%) loss:3.3768 lr:0.77 dt:36ms tok/s:1823644 rem:329s step 7685 (45%) loss:3.3629 lr:0.77 dt:36ms tok/s:1822290 rem:329s step 7686 (45%) loss:3.3251 lr:0.77 dt:36ms tok/s:1822423 rem:329s step 7687 (45%) loss:3.2833 lr:0.77 dt:36ms tok/s:1825958 rem:329s step 7688 (45%) loss:3.2535 lr:0.77 dt:36ms tok/s:1820227 rem:328s step 7689 (45%) loss:3.2239 lr:0.77 dt:36ms tok/s:1816066 rem:328s step 7690 (45%) loss:3.2593 lr:0.77 dt:36ms tok/s:1816258 rem:328s step 7691 (45%) loss:3.2595 lr:0.77 dt:36ms tok/s:1823475 rem:328s step 7692 (45%) loss:3.2766 lr:0.77 dt:36ms tok/s:1823076 rem:328s step 7693 (45%) loss:3.2822 lr:0.77 dt:36ms tok/s:1822580 rem:328s step 7694 (45%) loss:3.2839 lr:0.77 dt:36ms tok/s:1822762 rem:328s step 7695 (45%) loss:3.2929 lr:0.77 dt:36ms tok/s:1821855 rem:328s step 7696 (45%) loss:3.3040 lr:0.77 dt:36ms tok/s:1814951 rem:328s step 7697 (45%) loss:3.3016 lr:0.77 dt:36ms tok/s:1809443 rem:328s step 7698 (45%) loss:3.3073 lr:0.77 dt:36ms tok/s:1816666 rem:328s step 7699 (45%) loss:3.3211 lr:0.77 dt:36ms tok/s:1811542 rem:328s step 7700 (45%) loss:3.3243 lr:0.77 dt:36ms tok/s:1810742 rem:328s + local: attn=[0.080, 0.799, 0.771] mlp=[0.450, 0.221, -0.182] + + transition: attn=[2.648, 0.929] mlp=[-0.148, 0.394] + + hierarchy: attn=[2.900, 5.939, 5.616] mlp=[1.093, -0.902, -3.188] + step 7701 (45%) loss:3.3285 lr:0.77 dt:36ms tok/s:1814016 rem:328s step 7702 (45%) loss:3.3346 lr:0.77 dt:36ms tok/s:1814471 rem:328s step 7703 (45%) loss:3.3330 lr:0.77 dt:36ms tok/s:1823040 rem:328s step 7704 (45%) loss:3.3310 lr:0.77 dt:36ms tok/s:1808574 rem:328s step 7705 (45%) loss:3.3423 lr:0.77 dt:36ms tok/s:1816246 rem:328s step 7706 (45%) loss:3.3418 lr:0.77 dt:36ms tok/s:1817976 rem:328s step 7707 (45%) loss:3.3470 lr:0.77 dt:36ms tok/s:1818324 rem:328s step 7708 (45%) loss:3.3429 lr:0.77 dt:36ms tok/s:1819516 rem:328s step 7709 (45%) loss:3.3547 lr:0.77 dt:36ms tok/s:1817194 rem:328s step 7710 (45%) loss:3.3359 lr:0.77 dt:36ms tok/s:1814531 rem:328s step 7711 (45%) loss:3.3249 lr:0.77 dt:36ms tok/s:1820914 rem:328s step 7712 (45%) loss:3.3457 lr:0.77 dt:36ms tok/s:1803081 rem:328s step 7713 (45%) loss:3.3517 lr:0.77 dt:36ms tok/s:1812964 rem:328s step 7714 (45%) loss:3.3420 lr:0.77 dt:36ms tok/s:1814639 rem:328s step 7715 (45%) loss:3.3439 lr:0.77 dt:36ms tok/s:1814663 rem:328s step 7716 (45%) loss:3.3406 lr:0.77 dt:36ms tok/s:1807290 rem:327s step 7717 (45%) loss:3.3350 lr:0.77 dt:36ms tok/s:1818806 rem:327s step 7718 (45%) loss:3.3343 lr:0.77 dt:36ms tok/s:1813837 rem:327s step 7719 (45%) loss:3.3211 lr:0.77 dt:36ms tok/s:1800601 rem:327s step 7720 (45%) loss:3.3222 lr:0.77 dt:36ms tok/s:1810134 rem:327s step 7721 (45%) loss:3.3172 lr:0.77 dt:36ms tok/s:1814280 rem:327s step 7722 (45%) loss:3.3207 lr:0.77 dt:38ms tok/s:1744736 rem:327s step 7723 (45%) loss:3.2990 lr:0.77 dt:36ms tok/s:1805782 rem:327s step 7724 (45%) loss:3.2935 lr:0.77 dt:36ms tok/s:1813921 rem:327s step 7725 (45%) loss:3.3023 lr:0.77 dt:36ms tok/s:1820902 rem:327s step 7726 (45%) loss:3.3251 lr:0.77 dt:36ms tok/s:1811100 rem:327s step 7727 (45%) loss:3.3324 lr:0.77 dt:36ms tok/s:1816558 rem:327s step 7728 (45%) loss:3.3326 lr:0.77 dt:36ms tok/s:1800530 rem:327s step 7729 (45%) loss:3.3468 lr:0.77 dt:36ms tok/s:1797739 rem:327s step 7730 (45%) loss:3.3532 lr:0.77 dt:36ms tok/s:1811184 rem:327s step 7731 (46%) loss:3.3504 lr:0.77 dt:36ms tok/s:1800518 rem:327s step 7732 (46%) loss:3.3334 lr:0.77 dt:36ms tok/s:1817134 rem:327s step 7733 (46%) loss:3.3170 lr:0.77 dt:36ms tok/s:1799198 rem:327s step 7734 (46%) loss:3.2989 lr:0.77 dt:36ms tok/s:1816630 rem:327s step 7735 (46%) loss:3.2883 lr:0.77 dt:36ms tok/s:1814855 rem:327s step 7736 (46%) loss:3.2781 lr:0.77 dt:36ms tok/s:1818204 rem:327s step 7737 (46%) loss:3.2864 lr:0.77 dt:36ms tok/s:1818024 rem:327s step 7738 (46%) loss:3.2955 lr:0.77 dt:36ms tok/s:1809479 rem:327s step 7739 (46%) loss:3.3010 lr:0.77 dt:36ms tok/s:1820480 rem:327s step 7740 (46%) loss:3.3124 lr:0.77 dt:36ms tok/s:1814292 rem:327s step 7741 (46%) loss:3.3079 lr:0.77 dt:36ms tok/s:1814627 rem:327s step 7742 (46%) loss:3.3031 lr:0.77 dt:36ms tok/s:1813274 rem:327s step 7743 (46%) loss:3.2988 lr:0.77 dt:36ms tok/s:1809086 rem:327s step 7744 (46%) loss:3.3042 lr:0.77 dt:36ms tok/s:1815826 rem:326s step 7745 (46%) loss:3.3212 lr:0.77 dt:36ms tok/s:1814675 rem:326s step 7746 (46%) loss:3.3231 lr:0.77 dt:36ms tok/s:1810086 rem:326s step 7747 (46%) loss:3.3045 lr:0.77 dt:36ms tok/s:1818769 rem:326s step 7748 (46%) loss:3.3178 lr:0.77 dt:36ms tok/s:1816414 rem:326s step 7749 (46%) loss:3.2978 lr:0.77 dt:36ms tok/s:1825134 rem:326s step 7750 (46%) loss:3.2954 lr:0.77 dt:36ms tok/s:1809622 rem:326s step 7751 (46%) loss:3.2821 lr:0.77 dt:36ms tok/s:1818709 rem:326s step 7752 (46%) loss:3.2564 lr:0.77 dt:36ms tok/s:1825315 rem:326s step 7753 (46%) loss:3.2671 lr:0.77 dt:36ms tok/s:1820781 rem:326s step 7754 (46%) loss:3.3323 lr:0.77 dt:36ms tok/s:1820480 rem:326s step 7755 (46%) loss:3.3424 lr:0.77 dt:36ms tok/s:1813621 rem:326s step 7756 (46%) loss:3.3356 lr:0.77 dt:36ms tok/s:1820721 rem:326s step 7757 (46%) loss:3.3435 lr:0.77 dt:36ms tok/s:1820974 rem:326s step 7758 (46%) loss:3.3321 lr:0.77 dt:36ms tok/s:1816330 rem:326s step 7759 (46%) loss:3.3328 lr:0.77 dt:36ms tok/s:1811637 rem:326s step 7760 (46%) loss:3.3269 lr:0.77 dt:36ms tok/s:1803483 rem:326s step 7761 (46%) loss:3.3388 lr:0.77 dt:36ms tok/s:1808979 rem:326s step 7762 (46%) loss:3.3422 lr:0.77 dt:36ms tok/s:1805414 rem:326s step 7763 (46%) loss:3.3417 lr:0.77 dt:36ms tok/s:1810850 rem:326s step 7764 (46%) loss:3.3107 lr:0.77 dt:36ms tok/s:1814867 rem:326s step 7765 (46%) loss:3.2963 lr:0.77 dt:36ms tok/s:1815826 rem:326s step 7766 (46%) loss:3.2695 lr:0.77 dt:36ms tok/s:1814088 rem:326s step 7767 (46%) loss:3.2511 lr:0.77 dt:36ms tok/s:1811876 rem:326s step 7768 (46%) loss:3.2651 lr:0.77 dt:36ms tok/s:1814292 rem:326s step 7769 (46%) loss:3.2646 lr:0.77 dt:36ms tok/s:1804655 rem:326s step 7770 (46%) loss:3.2297 lr:0.77 dt:36ms tok/s:1814951 rem:326s step 7771 (46%) loss:3.3184 lr:0.77 dt:36ms tok/s:1813861 rem:325s step 7772 (46%) loss:3.3710 lr:0.77 dt:36ms tok/s:1813813 rem:325s step 7773 (46%) loss:3.4168 lr:0.77 dt:36ms tok/s:1804940 rem:325s step 7774 (46%) loss:3.4423 lr:0.77 dt:36ms tok/s:1812438 rem:325s step 7775 (46%) loss:3.4216 lr:0.77 dt:36ms tok/s:1801864 rem:325s step 7776 (46%) loss:3.3934 lr:0.77 dt:36ms tok/s:1807848 rem:325s step 7777 (46%) loss:3.3698 lr:0.76 dt:36ms tok/s:1834097 rem:325s step 7778 (46%) loss:3.3700 lr:0.76 dt:36ms tok/s:1834782 rem:325s step 7779 (46%) loss:3.3687 lr:0.76 dt:36ms tok/s:1831726 rem:325s step 7780 (46%) loss:3.3640 lr:0.76 dt:36ms tok/s:1829251 rem:325s step 7781 (46%) loss:3.3571 lr:0.76 dt:36ms tok/s:1833546 rem:325s step 7782 (46%) loss:3.3386 lr:0.76 dt:36ms tok/s:1832690 rem:325s step 7783 (46%) loss:3.3326 lr:0.76 dt:36ms tok/s:1839214 rem:325s step 7784 (46%) loss:3.3248 lr:0.76 dt:36ms tok/s:1834231 rem:325s step 7785 (46%) loss:3.3177 lr:0.76 dt:36ms tok/s:1830713 rem:325s step 7786 (46%) loss:3.3182 lr:0.76 dt:36ms tok/s:1833742 rem:325s step 7787 (46%) loss:3.3095 lr:0.76 dt:36ms tok/s:1834758 rem:325s step 7788 (46%) loss:3.2975 lr:0.76 dt:36ms tok/s:1841654 rem:325s step 7789 (46%) loss:3.3163 lr:0.76 dt:36ms tok/s:1838611 rem:325s step 7790 (46%) loss:3.3163 lr:0.76 dt:35ms tok/s:1854226 rem:325s step 7791 (46%) loss:3.3025 lr:0.76 dt:36ms tok/s:1831982 rem:325s step 7792 (46%) loss:3.3043 lr:0.76 dt:36ms tok/s:1838697 rem:325s step 7793 (46%) loss:3.3122 lr:0.76 dt:35ms tok/s:1849062 rem:325s step 7794 (46%) loss:3.3072 lr:0.76 dt:36ms tok/s:1833558 rem:325s step 7795 (46%) loss:3.3014 lr:0.76 dt:36ms tok/s:1838796 rem:325s step 7796 (46%) loss:3.3000 lr:0.76 dt:35ms tok/s:1847521 rem:325s step 7797 (46%) loss:3.2884 lr:0.76 dt:36ms tok/s:1840729 rem:325s step 7798 (46%) loss:3.2859 lr:0.76 dt:36ms tok/s:1828399 rem:325s step 7799 (46%) loss:3.2816 lr:0.76 dt:36ms tok/s:1833375 rem:324s step 7800 (46%) loss:3.2726 lr:0.76 dt:36ms tok/s:1833277 rem:324s + local: attn=[0.068, 0.816, 0.785] mlp=[0.447, 0.202, -0.192] + + transition: attn=[2.689, 0.906] mlp=[-0.140, 0.350] + + hierarchy: attn=[2.937, 5.939, 5.616] mlp=[1.097, -0.893, -3.243] + step 7801 (46%) loss:3.2625 lr:0.76 dt:36ms tok/s:1841839 rem:324s step 7802 (46%) loss:3.2750 lr:0.76 dt:36ms tok/s:1838919 rem:324s step 7803 (46%) loss:3.2919 lr:0.76 dt:36ms tok/s:1840138 rem:324s step 7804 (46%) loss:3.2919 lr:0.76 dt:36ms tok/s:1826468 rem:324s step 7805 (46%) loss:3.3365 lr:0.76 dt:36ms tok/s:1832629 rem:324s step 7806 (46%) loss:3.3347 lr:0.76 dt:36ms tok/s:1835309 rem:324s step 7807 (46%) loss:3.3295 lr:0.76 dt:36ms tok/s:1837591 rem:324s step 7808 (46%) loss:3.3233 lr:0.76 dt:36ms tok/s:1833595 rem:324s step 7809 (46%) loss:3.3319 lr:0.76 dt:36ms tok/s:1833497 rem:324s step 7810 (46%) loss:3.3255 lr:0.76 dt:36ms tok/s:1836241 rem:324s step 7811 (46%) loss:3.3159 lr:0.76 dt:36ms tok/s:1832972 rem:324s step 7812 (46%) loss:3.3202 lr:0.76 dt:36ms tok/s:1832947 rem:324s step 7813 (46%) loss:3.3217 lr:0.76 dt:41ms tok/s:1586624 rem:324s step 7814 (46%) loss:3.3368 lr:0.76 dt:36ms tok/s:1826152 rem:324s step 7815 (46%) loss:3.3395 lr:0.76 dt:35ms tok/s:1857936 rem:324s step 7816 (46%) loss:3.3402 lr:0.76 dt:35ms tok/s:1865907 rem:324s step 7817 (46%) loss:3.3370 lr:0.76 dt:35ms tok/s:1868481 rem:324s step 7818 (46%) loss:3.3376 lr:0.76 dt:35ms tok/s:1860476 rem:324s step 7819 (46%) loss:3.3242 lr:0.76 dt:36ms tok/s:1838255 rem:324s step 7820 (46%) loss:3.3247 lr:0.76 dt:35ms tok/s:1850693 rem:324s step 7821 (46%) loss:3.3253 lr:0.76 dt:36ms tok/s:1844137 rem:324s step 7822 (46%) loss:3.3170 lr:0.76 dt:36ms tok/s:1845264 rem:324s step 7823 (46%) loss:3.3103 lr:0.76 dt:35ms tok/s:1851267 rem:324s step 7824 (46%) loss:3.3008 lr:0.76 dt:36ms tok/s:1824165 rem:324s step 7825 (46%) loss:3.2797 lr:0.76 dt:35ms tok/s:1846144 rem:324s step 7826 (46%) loss:3.3051 lr:0.76 dt:36ms tok/s:1838894 rem:324s step 7827 (46%) loss:3.2783 lr:0.76 dt:36ms tok/s:1831848 rem:323s step 7828 (46%) loss:3.2678 lr:0.76 dt:35ms tok/s:1860161 rem:323s step 7829 (46%) loss:3.2710 lr:0.76 dt:35ms tok/s:1861421 rem:323s step 7830 (46%) loss:3.2753 lr:0.76 dt:35ms tok/s:1850270 rem:323s step 7831 (46%) loss:3.2554 lr:0.76 dt:35ms tok/s:1859029 rem:323s step 7832 (46%) loss:3.2660 lr:0.76 dt:35ms tok/s:1854839 rem:323s step 7833 (46%) loss:3.2621 lr:0.76 dt:35ms tok/s:1850033 rem:323s step 7834 (46%) loss:3.2696 lr:0.76 dt:35ms tok/s:1863137 rem:323s step 7835 (46%) loss:3.2669 lr:0.76 dt:35ms tok/s:1873959 rem:323s step 7836 (46%) loss:3.2851 lr:0.76 dt:35ms tok/s:1870350 rem:323s step 7837 (46%) loss:3.2781 lr:0.76 dt:35ms tok/s:1870389 rem:323s step 7838 (46%) loss:3.2657 lr:0.76 dt:35ms tok/s:1870579 rem:323s step 7839 (46%) loss:3.2590 lr:0.76 dt:35ms tok/s:1854014 rem:323s step 7840 (46%) loss:3.2708 lr:0.76 dt:35ms tok/s:1847211 rem:323s step 7841 (46%) loss:3.2745 lr:0.76 dt:35ms tok/s:1853026 rem:323s step 7842 (46%) loss:3.2623 lr:0.76 dt:36ms tok/s:1840593 rem:323s step 7843 (46%) loss:3.2632 lr:0.76 dt:36ms tok/s:1841925 rem:323s step 7844 (46%) loss:3.2640 lr:0.76 dt:36ms tok/s:1841284 rem:323s step 7845 (46%) loss:3.2760 lr:0.76 dt:35ms tok/s:1852065 rem:323s step 7846 (46%) loss:3.2853 lr:0.76 dt:36ms tok/s:1845276 rem:323s step 7847 (46%) loss:3.2791 lr:0.76 dt:35ms tok/s:1852427 rem:323s step 7848 (46%) loss:3.2739 lr:0.76 dt:36ms tok/s:1844583 rem:323s step 7849 (46%) loss:3.2602 lr:0.76 dt:36ms tok/s:1825837 rem:323s step 7850 (46%) loss:3.2615 lr:0.76 dt:36ms tok/s:1830908 rem:323s step 7851 (46%) loss:3.2604 lr:0.76 dt:36ms tok/s:1844769 rem:323s step 7852 (46%) loss:3.2726 lr:0.76 dt:36ms tok/s:1834415 rem:323s step 7853 (46%) loss:3.2746 lr:0.76 dt:36ms tok/s:1837235 rem:323s step 7854 (46%) loss:3.2795 lr:0.76 dt:36ms tok/s:1840729 rem:323s step 7855 (46%) loss:3.2973 lr:0.76 dt:36ms tok/s:1842012 rem:323s step 7856 (46%) loss:3.3025 lr:0.76 dt:35ms tok/s:1848590 rem:322s step 7857 (46%) loss:3.2837 lr:0.76 dt:36ms tok/s:1839879 rem:322s step 7858 (46%) loss:3.2999 lr:0.76 dt:36ms tok/s:1842827 rem:322s step 7859 (46%) loss:3.3186 lr:0.76 dt:36ms tok/s:1838120 rem:322s step 7860 (46%) loss:3.3302 lr:0.76 dt:36ms tok/s:1841235 rem:322s step 7861 (46%) loss:3.3084 lr:0.76 dt:36ms tok/s:1833497 rem:322s step 7862 (46%) loss:3.2958 lr:0.76 dt:36ms tok/s:1835946 rem:322s step 7863 (46%) loss:3.2727 lr:0.76 dt:36ms tok/s:1840692 rem:322s step 7864 (46%) loss:3.2410 lr:0.76 dt:36ms tok/s:1838255 rem:322s step 7865 (46%) loss:3.2402 lr:0.76 dt:36ms tok/s:1839510 rem:322s step 7866 (46%) loss:3.2480 lr:0.76 dt:36ms tok/s:1838636 rem:322s step 7867 (46%) loss:3.2321 lr:0.76 dt:64ms tok/s:1028050 rem:322s step 7868 (46%) loss:3.2132 lr:0.76 dt:32ms tok/s:2044538 rem:322s step 7869 (46%) loss:3.2135 lr:0.76 dt:32ms tok/s:2068758 rem:322s step 7870 (46%) loss:3.2256 lr:0.76 dt:32ms tok/s:2070737 rem:322s step 7871 (46%) loss:3.2493 lr:0.76 dt:32ms tok/s:2053090 rem:322s step 7872 (46%) loss:3.2611 lr:0.76 dt:32ms tok/s:2062734 rem:322s step 7873 (46%) loss:3.2784 lr:0.76 dt:32ms tok/s:2046136 rem:322s step 7874 (46%) loss:3.2818 lr:0.76 dt:32ms tok/s:2023378 rem:322s step 7875 (46%) loss:3.2991 lr:0.76 dt:32ms tok/s:2028619 rem:322s step 7876 (46%) loss:3.2885 lr:0.76 dt:33ms tok/s:2008064 rem:322s step 7877 (46%) loss:3.2892 lr:0.75 dt:33ms tok/s:2005135 rem:322s step 7878 (46%) loss:3.2985 lr:0.75 dt:33ms tok/s:2005881 rem:322s step 7879 (46%) loss:3.3804 lr:0.75 dt:33ms tok/s:2004579 rem:322s step 7880 (46%) loss:3.3833 lr:0.75 dt:33ms tok/s:2002287 rem:322s step 7881 (46%) loss:3.3850 lr:0.75 dt:33ms tok/s:2006057 rem:322s step 7882 (46%) loss:3.3808 lr:0.75 dt:33ms tok/s:2012917 rem:322s step 7883 (46%) loss:3.3643 lr:0.75 dt:33ms tok/s:2005003 rem:322s step 7884 (46%) loss:3.3795 lr:0.75 dt:33ms tok/s:2004784 rem:321s step 7885 (46%) loss:3.3893 lr:0.75 dt:33ms tok/s:2003337 rem:321s step 7886 (46%) loss:3.3814 lr:0.75 dt:33ms tok/s:1997587 rem:321s step 7887 (46%) loss:3.3719 lr:0.75 dt:33ms tok/s:1997717 rem:321s step 7888 (46%) loss:3.3656 lr:0.75 dt:33ms tok/s:2003147 rem:321s step 7889 (46%) loss:3.3550 lr:0.75 dt:33ms tok/s:1986040 rem:321s step 7890 (46%) loss:3.3608 lr:0.75 dt:33ms tok/s:1981973 rem:321s step 7891 (46%) loss:3.3645 lr:0.75 dt:33ms tok/s:1980046 rem:321s step 7892 (46%) loss:3.3565 lr:0.75 dt:34ms tok/s:1943300 rem:321s step 7893 (46%) loss:3.3456 lr:0.75 dt:33ms tok/s:1980560 rem:321s step 7894 (46%) loss:3.3244 lr:0.75 dt:33ms tok/s:1967574 rem:321s step 7895 (46%) loss:3.3135 lr:0.75 dt:34ms tok/s:1950111 rem:321s step 7896 (46%) loss:3.3091 lr:0.75 dt:33ms tok/s:1999403 rem:321s step 7897 (46%) loss:3.3070 lr:0.75 dt:33ms tok/s:1979618 rem:321s step 7898 (46%) loss:3.3133 lr:0.75 dt:33ms tok/s:1997369 rem:321s step 7899 (46%) loss:3.2785 lr:0.75 dt:33ms tok/s:1983175 rem:321s step 7900 (47%) loss:3.2267 lr:0.75 dt:33ms tok/s:1995238 rem:321s + local: attn=[0.068, 0.767, 0.776] mlp=[0.447, 0.197, -0.176] + + transition: attn=[2.865, 0.922] mlp=[-0.148, 0.390] + + hierarchy: attn=[2.946, 5.939, 5.616] mlp=[1.098, -0.899, -3.205] + step 7901 (47%) loss:3.1945 lr:0.75 dt:33ms tok/s:1987318 rem:321s step 7902 (47%) loss:3.1522 lr:0.75 dt:33ms tok/s:1987491 rem:321s step 7903 (47%) loss:3.1799 lr:0.75 dt:33ms tok/s:1985237 rem:321s step 7904 (47%) loss:3.2112 lr:0.75 dt:33ms tok/s:1975109 rem:321s step 7905 (47%) loss:3.2299 lr:0.75 dt:33ms tok/s:1971200 rem:321s step 7906 (47%) loss:3.2473 lr:0.75 dt:33ms tok/s:1973167 rem:321s step 7907 (47%) loss:3.2574 lr:0.75 dt:33ms tok/s:1967264 rem:321s step 7908 (47%) loss:3.2657 lr:0.75 dt:33ms tok/s:1975279 rem:321s step 7909 (47%) loss:3.2722 lr:0.75 dt:34ms tok/s:1944620 rem:321s step 7910 (47%) loss:3.2783 lr:0.75 dt:33ms tok/s:1956984 rem:321s step 7911 (47%) loss:3.2931 lr:0.75 dt:34ms tok/s:1955396 rem:321s step 7912 (47%) loss:3.2838 lr:0.75 dt:34ms tok/s:1955577 rem:321s step 7913 (47%) loss:3.3027 lr:0.75 dt:33ms tok/s:1958294 rem:321s step 7914 (47%) loss:3.2981 lr:0.75 dt:34ms tok/s:1955355 rem:321s step 7915 (47%) loss:3.2836 lr:0.75 dt:34ms tok/s:1946286 rem:320s step 7916 (47%) loss:3.2817 lr:0.75 dt:33ms tok/s:1956844 rem:320s step 7917 (47%) loss:3.2829 lr:0.75 dt:34ms tok/s:1947431 rem:320s step 7918 (47%) loss:3.2580 lr:0.75 dt:34ms tok/s:1944744 rem:320s step 7919 (47%) loss:3.2590 lr:0.75 dt:34ms tok/s:1943671 rem:320s step 7920 (47%) loss:3.2636 lr:0.75 dt:34ms tok/s:1933024 rem:320s step 7921 (47%) loss:3.2629 lr:0.75 dt:34ms tok/s:1928629 rem:320s step 7922 (47%) loss:3.2595 lr:0.75 dt:34ms tok/s:1940064 rem:320s step 7923 (47%) loss:3.2655 lr:0.75 dt:34ms tok/s:1931028 rem:320s step 7924 (47%) loss:3.2635 lr:0.75 dt:34ms tok/s:1938955 rem:320s step 7925 (47%) loss:3.2566 lr:0.75 dt:34ms tok/s:1936605 rem:320s step 7926 (47%) loss:3.2605 lr:0.75 dt:34ms tok/s:1937261 rem:320s step 7927 (47%) loss:3.2636 lr:0.75 dt:34ms tok/s:1934248 rem:320s step 7928 (47%) loss:3.2745 lr:0.75 dt:34ms tok/s:1941557 rem:320s step 7929 (47%) loss:3.2702 lr:0.75 dt:34ms tok/s:1910188 rem:320s step 7930 (47%) loss:3.2806 lr:0.75 dt:34ms tok/s:1935351 rem:320s step 7931 (47%) loss:3.2687 lr:0.75 dt:34ms tok/s:1931855 rem:320s step 7932 (47%) loss:3.2739 lr:0.75 dt:34ms tok/s:1939831 rem:320s step 7933 (47%) loss:3.2724 lr:0.75 dt:34ms tok/s:1939694 rem:320s step 7934 (47%) loss:3.2837 lr:0.75 dt:34ms tok/s:1923407 rem:320s step 7935 (47%) loss:3.2934 lr:0.75 dt:34ms tok/s:1932820 rem:320s step 7936 (47%) loss:3.2954 lr:0.75 dt:34ms tok/s:1938271 rem:320s step 7937 (47%) loss:3.2858 lr:0.75 dt:34ms tok/s:1930675 rem:320s step 7938 (47%) loss:3.2862 lr:0.75 dt:34ms tok/s:1924013 rem:320s step 7939 (47%) loss:3.2807 lr:0.75 dt:34ms tok/s:1940214 rem:320s step 7940 (47%) loss:3.2779 lr:0.75 dt:34ms tok/s:1917343 rem:320s step 7941 (47%) loss:3.2759 lr:0.75 dt:34ms tok/s:1917704 rem:320s step 7942 (47%) loss:3.2738 lr:0.75 dt:34ms tok/s:1915445 rem:320s step 7943 (47%) loss:3.2733 lr:0.75 dt:34ms tok/s:1906597 rem:320s step 7944 (47%) loss:3.2795 lr:0.75 dt:34ms tok/s:1940570 rem:319s step 7945 (47%) loss:3.2768 lr:0.75 dt:34ms tok/s:1929997 rem:319s step 7946 (47%) loss:3.2713 lr:0.75 dt:36ms tok/s:1805639 rem:319s step 7947 (47%) loss:3.2705 lr:0.75 dt:34ms tok/s:1909644 rem:319s step 7948 (47%) loss:3.2697 lr:0.75 dt:34ms tok/s:1938627 rem:319s step 7949 (47%) loss:3.2565 lr:0.75 dt:34ms tok/s:1933921 rem:319s step 7950 (47%) loss:3.2380 lr:0.75 dt:34ms tok/s:1935269 rem:319s step 7951 (47%) loss:3.2393 lr:0.75 dt:34ms tok/s:1920558 rem:319s step 7952 (47%) loss:3.2461 lr:0.75 dt:34ms tok/s:1932738 rem:319s step 7953 (47%) loss:3.2293 lr:0.75 dt:34ms tok/s:1940762 rem:319s step 7954 (47%) loss:3.2488 lr:0.75 dt:34ms tok/s:1908742 rem:319s step 7955 (47%) loss:3.2666 lr:0.75 dt:34ms tok/s:1928359 rem:319s step 7956 (47%) loss:3.2523 lr:0.75 dt:34ms tok/s:1920598 rem:319s step 7957 (47%) loss:3.2747 lr:0.75 dt:34ms tok/s:1918815 rem:319s step 7958 (47%) loss:3.2851 lr:0.75 dt:34ms tok/s:1922008 rem:319s step 7959 (47%) loss:3.2887 lr:0.75 dt:34ms tok/s:1918105 rem:319s step 7960 (47%) loss:3.2823 lr:0.75 dt:34ms tok/s:1934888 rem:319s step 7961 (47%) loss:3.2801 lr:0.75 dt:34ms tok/s:1904061 rem:319s step 7962 (47%) loss:3.2785 lr:0.75 dt:34ms tok/s:1924457 rem:319s step 7963 (47%) loss:3.2760 lr:0.75 dt:34ms tok/s:1922882 rem:319s step 7964 (47%) loss:3.2753 lr:0.75 dt:34ms tok/s:1927075 rem:319s step 7965 (47%) loss:3.2860 lr:0.75 dt:34ms tok/s:1929049 rem:319s step 7966 (47%) loss:3.2852 lr:0.75 dt:34ms tok/s:1913872 rem:319s step 7967 (47%) loss:3.2902 lr:0.75 dt:34ms tok/s:1918976 rem:319s step 7968 (47%) loss:3.3051 lr:0.75 dt:38ms tok/s:1716796 rem:319s step 7969 (47%) loss:3.3006 lr:0.75 dt:34ms tok/s:1926615 rem:319s step 7970 (47%) loss:3.2839 lr:0.75 dt:34ms tok/s:1935147 rem:319s step 7971 (47%) loss:3.2743 lr:0.75 dt:34ms tok/s:1920088 rem:319s step 7972 (47%) loss:3.2759 lr:0.75 dt:34ms tok/s:1928684 rem:319s step 7973 (47%) loss:3.2837 lr:0.75 dt:34ms tok/s:1916353 rem:318s step 7974 (47%) loss:3.2876 lr:0.75 dt:34ms tok/s:1927534 rem:318s step 7975 (47%) loss:3.2520 lr:0.75 dt:34ms tok/s:1922990 rem:318s step 7976 (47%) loss:3.2616 lr:0.75 dt:34ms tok/s:1922196 rem:318s step 7977 (47%) loss:3.2953 lr:0.75 dt:34ms tok/s:1922680 rem:318s step 7978 (47%) loss:3.2978 lr:0.75 dt:34ms tok/s:1927818 rem:318s step 7979 (47%) loss:3.3051 lr:0.75 dt:34ms tok/s:1916754 rem:318s step 7980 (47%) loss:3.3120 lr:0.75 dt:34ms tok/s:1927331 rem:318s step 7981 (47%) loss:3.2981 lr:0.75 dt:34ms tok/s:1916474 rem:318s step 7982 (47%) loss:3.3222 lr:0.74 dt:34ms tok/s:1917236 rem:318s step 7983 (47%) loss:3.3307 lr:0.74 dt:34ms tok/s:1923891 rem:318s step 7984 (47%) loss:3.3322 lr:0.74 dt:34ms tok/s:1918266 rem:318s step 7985 (47%) loss:3.3284 lr:0.74 dt:34ms tok/s:1918962 rem:318s step 7986 (47%) loss:3.3311 lr:0.74 dt:34ms tok/s:1923245 rem:318s step 7987 (47%) loss:3.3389 lr:0.74 dt:34ms tok/s:1918614 rem:318s step 7988 (47%) loss:3.3357 lr:0.74 dt:34ms tok/s:1921189 rem:318s step 7989 (47%) loss:3.3285 lr:0.74 dt:35ms tok/s:1882455 rem:318s step 7990 (47%) loss:3.3009 lr:0.74 dt:34ms tok/s:1927953 rem:318s step 7991 (47%) loss:3.2971 lr:0.74 dt:34ms tok/s:1927142 rem:318s step 7992 (47%) loss:3.2809 lr:0.74 dt:34ms tok/s:1921457 rem:318s step 7993 (47%) loss:3.2856 lr:0.74 dt:34ms tok/s:1927602 rem:318s step 7994 (47%) loss:3.2748 lr:0.74 dt:34ms tok/s:1917463 rem:318s step 7995 (47%) loss:3.3012 lr:0.74 dt:34ms tok/s:1920410 rem:318s step 7996 (47%) loss:3.3065 lr:0.74 dt:34ms tok/s:1921632 rem:318s step 7997 (47%) loss:3.2983 lr:0.74 dt:34ms tok/s:1934888 rem:318s step 7998 (47%) loss:3.3171 lr:0.74 dt:34ms tok/s:1919860 rem:318s step 7999 (47%) loss:3.3226 lr:0.74 dt:34ms tok/s:1915899 rem:318s step 8000 (47%) loss:3.3125 lr:0.74 dt:34ms tok/s:1922855 rem:318s + local: attn=[0.063, 0.864, 0.786] mlp=[0.458, 0.212, -0.189] + + transition: attn=[2.654, 0.877] mlp=[-0.145, 0.385] + + hierarchy: attn=[2.909, 5.939, 5.616] mlp=[1.141, -0.943, -3.177] + step 8001 (47%) loss:3.3049 lr:0.74 dt:34ms tok/s:1921014 rem:318s step 8002 (47%) loss:3.3050 lr:0.74 dt:34ms tok/s:1921645 rem:318s step 8003 (47%) loss:3.3083 lr:0.74 dt:34ms tok/s:1923972 rem:317s step 8004 (47%) loss:3.3096 lr:0.74 dt:34ms tok/s:1926669 rem:317s step 8005 (47%) loss:3.3249 lr:0.74 dt:34ms tok/s:1931475 rem:317s step 8006 (47%) loss:3.3321 lr:0.74 dt:34ms tok/s:1926305 rem:317s step 8007 (47%) loss:3.3314 lr:0.74 dt:34ms tok/s:1922909 rem:317s step 8008 (47%) loss:3.3317 lr:0.74 dt:34ms tok/s:1922600 rem:317s step 8009 (47%) loss:3.3366 lr:0.74 dt:34ms tok/s:1913352 rem:317s step 8010 (47%) loss:3.3306 lr:0.74 dt:36ms tok/s:1797422 rem:317s step 8011 (47%) loss:3.3359 lr:0.74 dt:34ms tok/s:1920424 rem:317s step 8012 (47%) loss:3.3283 lr:0.74 dt:34ms tok/s:1907841 rem:317s step 8013 (47%) loss:3.3273 lr:0.74 dt:35ms tok/s:1898354 rem:317s step 8014 (47%) loss:3.3162 lr:0.74 dt:38ms tok/s:1704437 rem:317s step 8015 (47%) loss:3.3049 lr:0.74 dt:34ms tok/s:1931801 rem:317s step 8016 (47%) loss:3.3059 lr:0.74 dt:34ms tok/s:1913192 rem:317s step 8017 (47%) loss:3.2997 lr:0.74 dt:34ms tok/s:1919150 rem:317s step 8018 (47%) loss:3.2851 lr:0.74 dt:34ms tok/s:1919136 rem:317s step 8019 (47%) loss:3.2685 lr:0.74 dt:34ms tok/s:1910719 rem:317s step 8020 (47%) loss:3.2654 lr:0.74 dt:34ms tok/s:1920276 rem:317s step 8021 (47%) loss:3.2400 lr:0.74 dt:34ms tok/s:1912607 rem:317s step 8022 (47%) loss:3.2043 lr:0.74 dt:35ms tok/s:1899128 rem:317s step 8023 (47%) loss:3.2127 lr:0.74 dt:34ms tok/s:1907563 rem:317s step 8024 (47%) loss:3.2376 lr:0.74 dt:34ms tok/s:1900441 rem:317s step 8025 (47%) loss:3.2492 lr:0.74 dt:35ms tok/s:1895749 rem:317s step 8026 (47%) loss:3.2679 lr:0.74 dt:35ms tok/s:1893268 rem:317s step 8027 (47%) loss:3.2815 lr:0.74 dt:35ms tok/s:1885282 rem:317s step 8028 (47%) loss:3.2933 lr:0.74 dt:35ms tok/s:1881295 rem:317s step 8029 (47%) loss:3.3027 lr:0.74 dt:35ms tok/s:1881424 rem:317s step 8030 (47%) loss:3.3016 lr:0.74 dt:35ms tok/s:1876684 rem:317s step 8031 (47%) loss:3.3022 lr:0.74 dt:35ms tok/s:1895187 rem:316s step 8032 (47%) loss:3.2981 lr:0.74 dt:35ms tok/s:1862316 rem:316s step 8033 (47%) loss:3.2971 lr:0.74 dt:35ms tok/s:1860338 rem:316s step 8034 (47%) loss:3.2990 lr:0.74 dt:35ms tok/s:1866528 rem:316s step 8035 (47%) loss:3.2978 lr:0.74 dt:35ms tok/s:1868405 rem:316s step 8036 (47%) loss:3.2983 lr:0.74 dt:35ms tok/s:1857898 rem:316s step 8037 (47%) loss:3.3042 lr:0.74 dt:35ms tok/s:1861081 rem:316s step 8038 (47%) loss:3.3049 lr:0.74 dt:35ms tok/s:1857346 rem:316s step 8039 (47%) loss:3.3017 lr:0.74 dt:35ms tok/s:1860564 rem:316s step 8040 (47%) loss:3.3015 lr:0.74 dt:35ms tok/s:1861106 rem:316s step 8041 (47%) loss:3.2904 lr:0.74 dt:35ms tok/s:1860350 rem:316s step 8042 (47%) loss:3.3078 lr:0.74 dt:35ms tok/s:1859293 rem:316s step 8043 (47%) loss:3.3026 lr:0.74 dt:35ms tok/s:1864097 rem:316s step 8044 (47%) loss:3.3144 lr:0.74 dt:36ms tok/s:1834831 rem:316s step 8045 (47%) loss:3.3282 lr:0.74 dt:35ms tok/s:1869307 rem:316s step 8046 (47%) loss:3.3211 lr:0.74 dt:35ms tok/s:1860136 rem:316s step 8047 (47%) loss:3.3322 lr:0.74 dt:36ms tok/s:1840433 rem:316s step 8048 (47%) loss:3.3307 lr:0.74 dt:35ms tok/s:1858087 rem:316s step 8049 (47%) loss:3.3384 lr:0.74 dt:35ms tok/s:1854101 rem:316s step 8050 (47%) loss:3.4567 lr:0.74 dt:35ms tok/s:1858426 rem:316s step 8051 (47%) loss:3.4585 lr:0.74 dt:35ms tok/s:1864995 rem:316s step 8052 (47%) loss:3.4437 lr:0.74 dt:36ms tok/s:1843074 rem:316s step 8053 (47%) loss:3.4351 lr:0.74 dt:35ms tok/s:1859369 rem:316s step 8054 (47%) loss:3.4355 lr:0.74 dt:35ms tok/s:1865248 rem:316s step 8055 (47%) loss:3.4099 lr:0.74 dt:35ms tok/s:1861648 rem:316s step 8056 (47%) loss:3.3909 lr:0.74 dt:35ms tok/s:1858514 rem:316s step 8057 (47%) loss:3.3779 lr:0.74 dt:35ms tok/s:1858677 rem:316s step 8058 (47%) loss:3.3644 lr:0.74 dt:35ms tok/s:1865299 rem:316s step 8059 (47%) loss:3.3505 lr:0.74 dt:35ms tok/s:1860438 rem:316s step 8060 (47%) loss:3.3069 lr:0.74 dt:35ms tok/s:1863946 rem:315s step 8061 (47%) loss:3.2794 lr:0.74 dt:35ms tok/s:1854476 rem:315s step 8062 (47%) loss:3.2856 lr:0.74 dt:36ms tok/s:1828630 rem:315s step 8063 (47%) loss:3.2909 lr:0.74 dt:35ms tok/s:1857170 rem:315s step 8064 (47%) loss:3.2842 lr:0.74 dt:35ms tok/s:1859457 rem:315s step 8065 (47%) loss:3.2988 lr:0.74 dt:35ms tok/s:1852939 rem:315s step 8066 (47%) loss:3.3213 lr:0.74 dt:38ms tok/s:1746376 rem:315s step 8067 (47%) loss:3.3280 lr:0.74 dt:35ms tok/s:1871152 rem:315s step 8068 (47%) loss:3.3285 lr:0.74 dt:35ms tok/s:1883396 rem:315s step 8069 (47%) loss:3.3357 lr:0.74 dt:35ms tok/s:1860955 rem:315s step 8070 (47%) loss:3.3372 lr:0.74 dt:35ms tok/s:1861534 rem:315s step 8071 (47%) loss:3.3492 lr:0.74 dt:36ms tok/s:1845797 rem:315s step 8072 (47%) loss:3.3481 lr:0.74 dt:35ms tok/s:1864477 rem:315s step 8073 (47%) loss:3.3541 lr:0.74 dt:35ms tok/s:1861560 rem:315s step 8074 (47%) loss:3.3514 lr:0.74 dt:35ms tok/s:1870987 rem:315s step 8075 (48%) loss:3.3491 lr:0.74 dt:35ms tok/s:1855353 rem:315s step 8076 (48%) loss:3.3412 lr:0.74 dt:35ms tok/s:1868139 rem:315s step 8077 (48%) loss:3.3388 lr:0.74 dt:35ms tok/s:1866882 rem:315s step 8078 (48%) loss:3.3343 lr:0.74 dt:35ms tok/s:1863946 rem:315s step 8079 (48%) loss:3.3148 lr:0.74 dt:35ms tok/s:1861799 rem:315s step 8080 (48%) loss:3.3121 lr:0.74 dt:36ms tok/s:1841173 rem:315s step 8081 (48%) loss:3.3243 lr:0.74 dt:36ms tok/s:1836425 rem:315s step 8082 (48%) loss:3.3307 lr:0.73 dt:35ms tok/s:1850731 rem:315s step 8083 (48%) loss:3.3204 lr:0.73 dt:36ms tok/s:1840014 rem:315s step 8084 (48%) loss:3.3041 lr:0.73 dt:36ms tok/s:1839682 rem:315s step 8085 (48%) loss:3.3021 lr:0.73 dt:35ms tok/s:1846094 rem:315s step 8086 (48%) loss:3.2825 lr:0.73 dt:36ms tok/s:1843618 rem:315s step 8087 (48%) loss:3.2763 lr:0.73 dt:36ms tok/s:1833253 rem:315s step 8088 (48%) loss:3.2928 lr:0.73 dt:36ms tok/s:1836106 rem:314s step 8089 (48%) loss:3.2835 lr:0.73 dt:36ms tok/s:1832678 rem:314s step 8090 (48%) loss:3.2753 lr:0.73 dt:36ms tok/s:1836584 rem:314s step 8091 (48%) loss:3.2720 lr:0.73 dt:36ms tok/s:1830445 rem:314s step 8092 (48%) loss:3.2811 lr:0.73 dt:36ms tok/s:1830884 rem:314s step 8093 (48%) loss:3.3383 lr:0.73 dt:36ms tok/s:1843816 rem:314s step 8094 (48%) loss:3.3442 lr:0.73 dt:35ms tok/s:1846194 rem:314s step 8095 (48%) loss:3.3369 lr:0.73 dt:36ms tok/s:1836155 rem:314s step 8096 (48%) loss:3.3319 lr:0.73 dt:36ms tok/s:1838870 rem:314s step 8097 (48%) loss:3.3066 lr:0.73 dt:36ms tok/s:1835395 rem:314s step 8098 (48%) loss:3.2841 lr:0.73 dt:36ms tok/s:1835162 rem:314s step 8099 (48%) loss:3.2839 lr:0.73 dt:36ms tok/s:1832849 rem:314s step 8100 (48%) loss:3.2875 lr:0.73 dt:36ms tok/s:1836179 rem:314s + local: attn=[0.078, 0.822, 0.770] mlp=[0.462, 0.213, -0.227] + + transition: attn=[2.711, 0.894] mlp=[-0.157, 0.404] + + hierarchy: attn=[2.985, 5.939, 5.616] mlp=[1.151, -0.933, -3.149] + step 8101 (48%) loss:3.2960 lr:0.73 dt:36ms tok/s:1841469 rem:314s step 8102 (48%) loss:3.3043 lr:0.73 dt:36ms tok/s:1835407 rem:314s step 8103 (48%) loss:3.2960 lr:0.73 dt:36ms tok/s:1836008 rem:314s step 8104 (48%) loss:3.2907 lr:0.73 dt:36ms tok/s:1840248 rem:314s step 8105 (48%) loss:3.2736 lr:0.73 dt:36ms tok/s:1839399 rem:314s step 8106 (48%) loss:3.2858 lr:0.73 dt:36ms tok/s:1839165 rem:314s step 8107 (48%) loss:3.2839 lr:0.73 dt:36ms tok/s:1836253 rem:314s step 8108 (48%) loss:3.2684 lr:0.73 dt:36ms tok/s:1835873 rem:314s step 8109 (48%) loss:3.2436 lr:0.73 dt:36ms tok/s:1833913 rem:314s step 8110 (48%) loss:3.2144 lr:0.73 dt:36ms tok/s:1834293 rem:314s step 8111 (48%) loss:3.2207 lr:0.73 dt:36ms tok/s:1833338 rem:314s step 8112 (48%) loss:3.2220 lr:0.73 dt:36ms tok/s:1834305 rem:314s step 8113 (48%) loss:3.2177 lr:0.73 dt:36ms tok/s:1836057 rem:314s step 8114 (48%) loss:3.2144 lr:0.73 dt:36ms tok/s:1839534 rem:314s step 8115 (48%) loss:3.2327 lr:0.73 dt:36ms tok/s:1839719 rem:314s step 8116 (48%) loss:3.2360 lr:0.73 dt:36ms tok/s:1837677 rem:313s step 8117 (48%) loss:3.2447 lr:0.73 dt:36ms tok/s:1833302 rem:313s step 8118 (48%) loss:3.2516 lr:0.73 dt:36ms tok/s:1830676 rem:313s step 8119 (48%) loss:3.2517 lr:0.73 dt:36ms tok/s:1827828 rem:313s step 8120 (48%) loss:3.2702 lr:0.73 dt:36ms tok/s:1835285 rem:313s step 8121 (48%) loss:3.2978 lr:0.73 dt:36ms tok/s:1834305 rem:313s step 8122 (48%) loss:3.2930 lr:0.73 dt:36ms tok/s:1835799 rem:313s step 8123 (48%) loss:3.3018 lr:0.73 dt:36ms tok/s:1845772 rem:313s step 8124 (48%) loss:3.2988 lr:0.73 dt:35ms tok/s:1846925 rem:313s step 8125 (48%) loss:3.2982 lr:0.73 dt:36ms tok/s:1839066 rem:313s step 8126 (48%) loss:3.2996 lr:0.73 dt:35ms tok/s:1849261 rem:313s step 8127 (48%) loss:3.2744 lr:0.73 dt:36ms tok/s:1836486 rem:313s step 8128 (48%) loss:3.2299 lr:0.73 dt:36ms tok/s:1834317 rem:313s step 8129 (48%) loss:3.2374 lr:0.73 dt:36ms tok/s:1836683 rem:313s step 8130 (48%) loss:3.2327 lr:0.73 dt:36ms tok/s:1838083 rem:313s step 8131 (48%) loss:3.2402 lr:0.73 dt:35ms tok/s:1847261 rem:313s step 8132 (48%) loss:3.2520 lr:0.73 dt:36ms tok/s:1844868 rem:313s step 8133 (48%) loss:3.2624 lr:0.73 dt:36ms tok/s:1836658 rem:313s step 8134 (48%) loss:3.2674 lr:0.73 dt:36ms tok/s:1831408 rem:313s step 8135 (48%) loss:3.2641 lr:0.73 dt:36ms tok/s:1839817 rem:313s step 8136 (48%) loss:3.2672 lr:0.73 dt:36ms tok/s:1842456 rem:313s step 8137 (48%) loss:3.2462 lr:0.73 dt:36ms tok/s:1841963 rem:313s step 8138 (48%) loss:3.2686 lr:0.73 dt:36ms tok/s:1841037 rem:313s step 8139 (48%) loss:3.2688 lr:0.73 dt:36ms tok/s:1838427 rem:313s step 8140 (48%) loss:3.2757 lr:0.73 dt:36ms tok/s:1840717 rem:313s step 8141 (48%) loss:3.2500 lr:0.73 dt:36ms tok/s:1839891 rem:313s step 8142 (48%) loss:3.2389 lr:0.73 dt:36ms tok/s:1843779 rem:313s step 8143 (48%) loss:3.2319 lr:0.73 dt:36ms tok/s:1841740 rem:313s step 8144 (48%) loss:3.2291 lr:0.73 dt:36ms tok/s:1840113 rem:312s step 8145 (48%) loss:3.2282 lr:0.73 dt:35ms tok/s:1847683 rem:312s step 8146 (48%) loss:3.2321 lr:0.73 dt:36ms tok/s:1839817 rem:312s step 8147 (48%) loss:3.2628 lr:0.73 dt:36ms tok/s:1834256 rem:312s step 8148 (48%) loss:3.2513 lr:0.73 dt:37ms tok/s:1765161 rem:312s step 8149 (48%) loss:3.2451 lr:0.73 dt:35ms tok/s:1854101 rem:312s step 8150 (48%) loss:3.2506 lr:0.73 dt:35ms tok/s:1880111 rem:312s step 8151 (48%) loss:3.2486 lr:0.73 dt:35ms tok/s:1884326 rem:312s step 8152 (48%) loss:3.2654 lr:0.73 dt:35ms tok/s:1862884 rem:312s step 8153 (48%) loss:3.2662 lr:0.73 dt:36ms tok/s:1815274 rem:312s step 8154 (48%) loss:3.2680 lr:0.73 dt:36ms tok/s:1840778 rem:312s step 8155 (48%) loss:3.2480 lr:0.73 dt:35ms tok/s:1858036 rem:312s step 8156 (48%) loss:3.2438 lr:0.73 dt:35ms tok/s:1861509 rem:312s step 8157 (48%) loss:3.2569 lr:0.73 dt:35ms tok/s:1864059 rem:312s step 8158 (48%) loss:3.2549 lr:0.73 dt:36ms tok/s:1841716 rem:312s step 8159 (48%) loss:3.3262 lr:0.73 dt:36ms tok/s:1834170 rem:312s step 8160 (48%) loss:3.3136 lr:0.73 dt:36ms tok/s:1842283 rem:312s step 8161 (48%) loss:3.3353 lr:0.73 dt:35ms tok/s:1873052 rem:312s step 8162 (48%) loss:3.3389 lr:0.73 dt:43ms tok/s:1535391 rem:312s step 8163 (48%) loss:3.3229 lr:0.73 dt:35ms tok/s:1868291 rem:312s step 8164 (48%) loss:3.3185 lr:0.73 dt:35ms tok/s:1865869 rem:312s step 8165 (48%) loss:3.3025 lr:0.73 dt:34ms tok/s:1908755 rem:312s step 8166 (48%) loss:3.2943 lr:0.73 dt:35ms tok/s:1895710 rem:312s step 8167 (48%) loss:3.3159 lr:0.73 dt:35ms tok/s:1881076 rem:312s step 8168 (48%) loss:3.3167 lr:0.73 dt:35ms tok/s:1886032 rem:312s step 8169 (48%) loss:3.3263 lr:0.73 dt:35ms tok/s:1891054 rem:312s step 8170 (48%) loss:3.3306 lr:0.73 dt:35ms tok/s:1872312 rem:312s step 8171 (48%) loss:3.3181 lr:0.73 dt:35ms tok/s:1873754 rem:312s step 8172 (48%) loss:3.3156 lr:0.73 dt:35ms tok/s:1872720 rem:311s step 8173 (48%) loss:3.3087 lr:0.73 dt:35ms tok/s:1879931 rem:311s step 8174 (48%) loss:3.3055 lr:0.73 dt:35ms tok/s:1879880 rem:311s step 8175 (48%) loss:3.3334 lr:0.73 dt:35ms tok/s:1867415 rem:311s step 8176 (48%) loss:3.3305 lr:0.73 dt:35ms tok/s:1879700 rem:311s step 8177 (48%) loss:3.3271 lr:0.73 dt:35ms tok/s:1876658 rem:311s step 8178 (48%) loss:3.2992 lr:0.73 dt:35ms tok/s:1873512 rem:311s step 8179 (48%) loss:3.3049 lr:0.72 dt:35ms tok/s:1878492 rem:311s step 8180 (48%) loss:3.3146 lr:0.72 dt:35ms tok/s:1852726 rem:311s step 8181 (48%) loss:3.2889 lr:0.72 dt:36ms tok/s:1843606 rem:311s step 8182 (48%) loss:3.2784 lr:0.72 dt:36ms tok/s:1845276 rem:311s step 8183 (48%) loss:3.2712 lr:0.72 dt:36ms tok/s:1843581 rem:311s step 8184 (48%) loss:3.2804 lr:0.72 dt:36ms tok/s:1833779 rem:311s step 8185 (48%) loss:3.2933 lr:0.72 dt:36ms tok/s:1819938 rem:311s step 8186 (48%) loss:3.2986 lr:0.72 dt:35ms tok/s:1846318 rem:311s step 8187 (48%) loss:3.2947 lr:0.72 dt:36ms tok/s:1830274 rem:311s step 8188 (48%) loss:3.3154 lr:0.72 dt:35ms tok/s:1846305 rem:311s step 8189 (48%) loss:3.3206 lr:0.72 dt:35ms tok/s:1855440 rem:311s step 8190 (48%) loss:3.3491 lr:0.72 dt:35ms tok/s:1862746 rem:311s step 8191 (48%) loss:3.3428 lr:0.72 dt:36ms tok/s:1833094 rem:311s step 8192 (48%) loss:3.3693 lr:0.72 dt:35ms tok/s:1851341 rem:311s step 8193 (48%) loss:3.3669 lr:0.72 dt:35ms tok/s:1848316 rem:311s step 8194 (48%) loss:3.3578 lr:0.72 dt:35ms tok/s:1855854 rem:311s step 8195 (48%) loss:3.3395 lr:0.72 dt:36ms tok/s:1830457 rem:311s step 8196 (48%) loss:3.3157 lr:0.72 dt:35ms tok/s:1852202 rem:311s step 8197 (48%) loss:3.3092 lr:0.72 dt:35ms tok/s:1848602 rem:311s step 8198 (48%) loss:3.3070 lr:0.72 dt:36ms tok/s:1839830 rem:311s step 8199 (48%) loss:3.3249 lr:0.72 dt:35ms tok/s:1847310 rem:311s step 8200 (48%) loss:3.3210 lr:0.72 dt:35ms tok/s:1847186 rem:311s + local: attn=[0.088, 0.799, 0.791] mlp=[0.476, 0.226, -0.194] + + transition: attn=[2.796, 0.922] mlp=[-0.148, 0.395] + + hierarchy: attn=[2.919, 5.939, 5.616] mlp=[1.110, -0.912, -3.108] + step 8201 (48%) loss:3.3026 lr:0.72 dt:35ms tok/s:1847782 rem:310s step 8202 (48%) loss:3.3132 lr:0.72 dt:35ms tok/s:1848267 rem:310s step 8203 (48%) loss:3.2921 lr:0.72 dt:36ms tok/s:1840655 rem:310s step 8204 (48%) loss:3.2811 lr:0.72 dt:36ms tok/s:1844385 rem:310s step 8205 (48%) loss:3.2870 lr:0.72 dt:35ms tok/s:1850419 rem:310s step 8206 (48%) loss:3.2875 lr:0.72 dt:35ms tok/s:1852065 rem:310s step 8207 (48%) loss:3.2804 lr:0.72 dt:35ms tok/s:1853326 rem:310s step 8208 (48%) loss:3.2746 lr:0.72 dt:35ms tok/s:1848602 rem:310s step 8209 (48%) loss:3.2693 lr:0.72 dt:36ms tok/s:1838058 rem:310s step 8210 (48%) loss:3.2695 lr:0.72 dt:35ms tok/s:1852452 rem:310s step 8211 (48%) loss:3.2455 lr:0.72 dt:35ms tok/s:1852851 rem:310s step 8212 (48%) loss:3.2590 lr:0.72 dt:36ms tok/s:1823330 rem:310s step 8213 (48%) loss:3.2661 lr:0.72 dt:36ms tok/s:1800400 rem:310s step 8214 (48%) loss:3.2728 lr:0.72 dt:36ms tok/s:1818192 rem:310s step 8215 (48%) loss:3.2694 lr:0.72 dt:38ms tok/s:1742612 rem:310s step 8216 (48%) loss:3.2654 lr:0.72 dt:36ms tok/s:1808336 rem:310s step 8217 (48%) loss:3.2700 lr:0.72 dt:36ms tok/s:1810182 rem:310s step 8218 (48%) loss:3.2672 lr:0.72 dt:36ms tok/s:1816042 rem:310s step 8219 (48%) loss:3.2586 lr:0.72 dt:36ms tok/s:1815826 rem:310s step 8220 (48%) loss:3.2465 lr:0.72 dt:36ms tok/s:1807313 rem:310s step 8221 (48%) loss:3.2674 lr:0.72 dt:36ms tok/s:1807527 rem:310s step 8222 (48%) loss:3.2582 lr:0.72 dt:36ms tok/s:1818180 rem:310s step 8223 (48%) loss:3.2701 lr:0.72 dt:36ms tok/s:1817831 rem:310s step 8224 (48%) loss:3.2782 lr:0.72 dt:36ms tok/s:1814675 rem:310s step 8225 (48%) loss:3.2931 lr:0.72 dt:36ms tok/s:1822508 rem:310s step 8226 (48%) loss:3.2990 lr:0.72 dt:36ms tok/s:1824903 rem:310s step 8227 (48%) loss:3.2968 lr:0.72 dt:36ms tok/s:1818794 rem:310s step 8228 (48%) loss:3.2876 lr:0.72 dt:36ms tok/s:1796787 rem:309s step 8229 (48%) loss:3.2940 lr:0.72 dt:36ms tok/s:1818264 rem:309s step 8230 (48%) loss:3.2801 lr:0.72 dt:36ms tok/s:1822459 rem:309s step 8231 (48%) loss:3.2683 lr:0.72 dt:36ms tok/s:1811029 rem:309s step 8232 (48%) loss:3.2642 lr:0.72 dt:36ms tok/s:1818012 rem:309s step 8233 (48%) loss:3.3451 lr:0.72 dt:36ms tok/s:1827160 rem:309s step 8234 (48%) loss:3.3760 lr:0.72 dt:36ms tok/s:1827391 rem:309s step 8235 (48%) loss:3.4042 lr:0.72 dt:36ms tok/s:1821892 rem:309s step 8236 (48%) loss:3.4083 lr:0.72 dt:36ms tok/s:1813574 rem:309s step 8237 (48%) loss:3.3905 lr:0.72 dt:37ms tok/s:1794652 rem:309s step 8238 (48%) loss:3.3788 lr:0.72 dt:36ms tok/s:1825231 rem:309s step 8239 (48%) loss:3.3806 lr:0.72 dt:36ms tok/s:1818445 rem:309s step 8240 (48%) loss:3.3473 lr:0.72 dt:36ms tok/s:1831164 rem:309s step 8241 (48%) loss:3.3979 lr:0.72 dt:36ms tok/s:1829154 rem:309s step 8242 (48%) loss:3.3869 lr:0.72 dt:38ms tok/s:1721322 rem:309s step 8243 (49%) loss:3.3900 lr:0.72 dt:36ms tok/s:1824976 rem:309s step 8244 (49%) loss:3.3693 lr:0.72 dt:36ms tok/s:1818072 rem:309s step 8245 (49%) loss:3.3527 lr:0.72 dt:36ms tok/s:1817723 rem:309s step 8246 (49%) loss:3.3444 lr:0.72 dt:36ms tok/s:1820504 rem:309s step 8247 (49%) loss:3.3322 lr:0.72 dt:36ms tok/s:1816510 rem:309s step 8248 (49%) loss:3.3328 lr:0.72 dt:36ms tok/s:1821240 rem:309s step 8249 (49%) loss:3.3077 lr:0.72 dt:36ms tok/s:1816786 rem:309s step 8250 (49%) loss:3.3016 lr:0.72 dt:36ms tok/s:1824322 rem:309s step 8251 (49%) loss:3.2877 lr:0.72 dt:36ms tok/s:1820070 rem:309s step 8252 (49%) loss:3.3056 lr:0.72 dt:37ms tok/s:1785385 rem:309s step 8253 (49%) loss:3.3151 lr:0.72 dt:36ms tok/s:1810969 rem:309s step 8254 (49%) loss:3.3100 lr:0.72 dt:36ms tok/s:1818000 rem:309s step 8255 (49%) loss:3.3199 lr:0.72 dt:37ms tok/s:1781105 rem:309s step 8256 (49%) loss:3.3522 lr:0.72 dt:36ms tok/s:1813215 rem:308s step 8257 (49%) loss:3.3549 lr:0.72 dt:36ms tok/s:1813873 rem:308s step 8258 (49%) loss:3.3612 lr:0.72 dt:36ms tok/s:1818433 rem:308s step 8259 (49%) loss:3.3649 lr:0.72 dt:36ms tok/s:1814519 rem:308s step 8260 (49%) loss:3.3485 lr:0.72 dt:36ms tok/s:1820552 rem:308s step 8261 (49%) loss:3.3337 lr:0.72 dt:36ms tok/s:1813071 rem:308s step 8262 (49%) loss:3.3060 lr:0.72 dt:36ms tok/s:1812689 rem:308s step 8263 (49%) loss:3.3077 lr:0.72 dt:36ms tok/s:1817687 rem:308s step 8264 (49%) loss:3.2957 lr:0.72 dt:36ms tok/s:1814004 rem:308s step 8265 (49%) loss:3.3218 lr:0.72 dt:36ms tok/s:1817615 rem:308s step 8266 (49%) loss:3.3185 lr:0.72 dt:36ms tok/s:1809455 rem:308s step 8267 (49%) loss:3.3336 lr:0.72 dt:36ms tok/s:1813119 rem:308s step 8268 (49%) loss:3.3378 lr:0.72 dt:36ms tok/s:1815190 rem:308s step 8269 (49%) loss:3.3240 lr:0.72 dt:36ms tok/s:1819444 rem:308s step 8270 (49%) loss:3.3131 lr:0.72 dt:36ms tok/s:1822580 rem:308s step 8271 (49%) loss:3.3165 lr:0.72 dt:36ms tok/s:1821023 rem:308s step 8272 (49%) loss:3.3074 lr:0.72 dt:36ms tok/s:1820492 rem:308s step 8273 (49%) loss:3.3127 lr:0.71 dt:36ms tok/s:1812629 rem:308s step 8274 (49%) loss:3.3131 lr:0.71 dt:36ms tok/s:1817014 rem:308s step 8275 (49%) loss:3.2907 lr:0.71 dt:36ms tok/s:1812270 rem:308s step 8276 (49%) loss:3.2791 lr:0.71 dt:36ms tok/s:1816234 rem:308s step 8277 (49%) loss:3.2913 lr:0.71 dt:36ms tok/s:1811243 rem:308s step 8278 (49%) loss:3.2780 lr:0.71 dt:36ms tok/s:1811542 rem:308s step 8279 (49%) loss:3.2863 lr:0.71 dt:36ms tok/s:1815298 rem:308s step 8280 (49%) loss:3.2808 lr:0.71 dt:36ms tok/s:1801568 rem:308s step 8281 (49%) loss:3.2757 lr:0.71 dt:36ms tok/s:1815586 rem:308s step 8282 (49%) loss:3.2741 lr:0.71 dt:36ms tok/s:1819733 rem:308s step 8283 (49%) loss:3.2644 lr:0.71 dt:36ms tok/s:1821783 rem:308s step 8284 (49%) loss:3.2606 lr:0.71 dt:36ms tok/s:1818457 rem:307s step 8285 (49%) loss:3.2677 lr:0.71 dt:36ms tok/s:1819287 rem:307s step 8286 (49%) loss:3.2794 lr:0.71 dt:37ms tok/s:1770231 rem:307s step 8287 (49%) loss:3.2764 lr:0.71 dt:37ms tok/s:1769991 rem:307s step 8288 (49%) loss:3.2699 lr:0.71 dt:36ms tok/s:1801403 rem:307s step 8289 (49%) loss:3.2673 lr:0.71 dt:36ms tok/s:1814903 rem:307s step 8290 (49%) loss:3.2803 lr:0.71 dt:36ms tok/s:1819311 rem:307s step 8291 (49%) loss:3.2838 lr:0.71 dt:36ms tok/s:1809741 rem:307s step 8292 (49%) loss:3.2673 lr:0.71 dt:36ms tok/s:1819974 rem:307s step 8293 (49%) loss:3.2636 lr:0.71 dt:36ms tok/s:1816078 rem:307s step 8294 (49%) loss:3.2596 lr:0.71 dt:36ms tok/s:1813933 rem:307s step 8295 (49%) loss:3.2588 lr:0.71 dt:36ms tok/s:1807195 rem:307s step 8296 (49%) loss:3.2481 lr:0.71 dt:36ms tok/s:1812987 rem:307s step 8297 (49%) loss:3.2406 lr:0.71 dt:36ms tok/s:1809967 rem:307s step 8298 (49%) loss:3.2230 lr:0.71 dt:36ms tok/s:1804987 rem:307s step 8299 (49%) loss:3.2208 lr:0.71 dt:36ms tok/s:1809300 rem:307s step 8300 (49%) loss:3.2131 lr:0.71 dt:36ms tok/s:1812940 rem:307s + local: attn=[0.078, 0.819, 0.810] mlp=[0.498, 0.222, -0.202] + + transition: attn=[2.750, 0.920] mlp=[-0.163, 0.404] + + hierarchy: attn=[3.012, 5.939, 5.616] mlp=[1.132, -0.966, -3.117] + step 8301 (49%) loss:3.2338 lr:0.71 dt:36ms tok/s:1816762 rem:307s step 8302 (49%) loss:3.2378 lr:0.71 dt:36ms tok/s:1813298 rem:307s step 8303 (49%) loss:3.2194 lr:0.71 dt:36ms tok/s:1821192 rem:307s step 8304 (49%) loss:3.2372 lr:0.71 dt:36ms tok/s:1815035 rem:307s step 8305 (49%) loss:3.2724 lr:0.71 dt:36ms tok/s:1814148 rem:307s step 8306 (49%) loss:3.2779 lr:0.71 dt:36ms tok/s:1812820 rem:307s step 8307 (49%) loss:3.2710 lr:0.71 dt:36ms tok/s:1809955 rem:307s step 8308 (49%) loss:3.2728 lr:0.71 dt:36ms tok/s:1809419 rem:307s step 8309 (49%) loss:3.2768 lr:0.71 dt:36ms tok/s:1807741 rem:307s step 8310 (49%) loss:3.2558 lr:0.71 dt:36ms tok/s:1811172 rem:307s step 8311 (49%) loss:3.2505 lr:0.71 dt:36ms tok/s:1807397 rem:306s step 8312 (49%) loss:3.2543 lr:0.71 dt:36ms tok/s:1817122 rem:306s step 8313 (49%) loss:3.2622 lr:0.71 dt:36ms tok/s:1810921 rem:306s step 8314 (49%) loss:3.2630 lr:0.71 dt:36ms tok/s:1809240 rem:306s step 8315 (49%) loss:3.2538 lr:0.71 dt:36ms tok/s:1812127 rem:306s step 8316 (49%) loss:3.2743 lr:0.71 dt:36ms tok/s:1815682 rem:306s step 8317 (49%) loss:3.3107 lr:0.71 dt:36ms tok/s:1810229 rem:306s step 8318 (49%) loss:3.3357 lr:0.71 dt:36ms tok/s:1805699 rem:306s step 8319 (49%) loss:3.3329 lr:0.71 dt:37ms tok/s:1790409 rem:306s step 8320 (49%) loss:3.3288 lr:0.71 dt:36ms tok/s:1810206 rem:306s step 8321 (49%) loss:3.3239 lr:0.71 dt:36ms tok/s:1812533 rem:306s step 8322 (49%) loss:3.3178 lr:0.71 dt:36ms tok/s:1807955 rem:306s step 8323 (49%) loss:3.3123 lr:0.71 dt:36ms tok/s:1806221 rem:306s step 8324 (49%) loss:3.3231 lr:0.71 dt:36ms tok/s:1802927 rem:306s step 8325 (49%) loss:3.3053 lr:0.71 dt:36ms tok/s:1810098 rem:306s step 8326 (49%) loss:3.3097 lr:0.71 dt:36ms tok/s:1818384 rem:306s step 8327 (49%) loss:3.3468 lr:0.71 dt:36ms tok/s:1812450 rem:306s step 8328 (49%) loss:3.4398 lr:0.71 dt:36ms tok/s:1802596 rem:306s step 8329 (49%) loss:3.4320 lr:0.71 dt:36ms tok/s:1805877 rem:306s step 8330 (49%) loss:3.4218 lr:0.71 dt:36ms tok/s:1798268 rem:306s step 8331 (49%) loss:3.4188 lr:0.71 dt:36ms tok/s:1808824 rem:306s step 8332 (49%) loss:3.3939 lr:0.71 dt:36ms tok/s:1817266 rem:306s step 8333 (49%) loss:3.4139 lr:0.71 dt:36ms tok/s:1811482 rem:306s step 8334 (49%) loss:3.4151 lr:0.71 dt:36ms tok/s:1811757 rem:306s step 8335 (49%) loss:3.3815 lr:0.71 dt:36ms tok/s:1810372 rem:306s step 8336 (49%) loss:3.3799 lr:0.71 dt:36ms tok/s:1805983 rem:306s step 8337 (49%) loss:3.3656 lr:0.71 dt:36ms tok/s:1812533 rem:306s step 8338 (49%) loss:3.3637 lr:0.71 dt:36ms tok/s:1813442 rem:306s step 8339 (49%) loss:3.3553 lr:0.71 dt:36ms tok/s:1815742 rem:305s step 8340 (49%) loss:3.3249 lr:0.71 dt:36ms tok/s:1808812 rem:305s step 8341 (49%) loss:3.3347 lr:0.71 dt:36ms tok/s:1808027 rem:305s step 8342 (49%) loss:3.3348 lr:0.71 dt:36ms tok/s:1817675 rem:305s step 8343 (49%) loss:3.3047 lr:0.71 dt:36ms tok/s:1821590 rem:305s step 8344 (49%) loss:3.2938 lr:0.71 dt:36ms tok/s:1816558 rem:305s step 8345 (49%) loss:3.3113 lr:0.71 dt:36ms tok/s:1811697 rem:305s step 8346 (49%) loss:3.3553 lr:0.71 dt:36ms tok/s:1814052 rem:305s step 8347 (49%) loss:3.3976 lr:0.71 dt:36ms tok/s:1818264 rem:305s step 8348 (49%) loss:3.3932 lr:0.71 dt:36ms tok/s:1815370 rem:305s step 8349 (49%) loss:3.3917 lr:0.71 dt:36ms tok/s:1809717 rem:305s step 8350 (49%) loss:3.3734 lr:0.71 dt:36ms tok/s:1817158 rem:305s step 8351 (49%) loss:3.3645 lr:0.71 dt:36ms tok/s:1816042 rem:305s step 8352 (49%) loss:3.3458 lr:0.71 dt:36ms tok/s:1815634 rem:305s step 8353 (49%) loss:3.3410 lr:0.71 dt:41ms tok/s:1589386 rem:305s step 8354 (49%) loss:3.3153 lr:0.71 dt:36ms tok/s:1815502 rem:305s step 8355 (49%) loss:3.2854 lr:0.71 dt:36ms tok/s:1816126 rem:305s step 8356 (49%) loss:3.2857 lr:0.71 dt:36ms tok/s:1812378 rem:305s step 8357 (49%) loss:3.2848 lr:0.71 dt:36ms tok/s:1818048 rem:305s step 8358 (49%) loss:3.2752 lr:0.71 dt:48ms tok/s:1367674 rem:305s step 8359 (49%) loss:3.2758 lr:0.71 dt:35ms tok/s:1875839 rem:305s step 8360 (49%) loss:3.2944 lr:0.71 dt:34ms tok/s:1929943 rem:305s step 8361 (49%) loss:3.2767 lr:0.71 dt:34ms tok/s:1908411 rem:305s step 8362 (49%) loss:3.2717 lr:0.71 dt:35ms tok/s:1892056 rem:305s step 8363 (49%) loss:3.2625 lr:0.71 dt:34ms tok/s:1918654 rem:305s step 8364 (49%) loss:3.2668 lr:0.71 dt:35ms tok/s:1888533 rem:305s step 8365 (49%) loss:3.2656 lr:0.71 dt:35ms tok/s:1880381 rem:305s step 8366 (49%) loss:3.2820 lr:0.70 dt:34ms tok/s:1901190 rem:305s step 8367 (49%) loss:3.2925 lr:0.70 dt:35ms tok/s:1867974 rem:304s step 8368 (49%) loss:3.2831 lr:0.70 dt:35ms tok/s:1852427 rem:304s step 8369 (49%) loss:3.2790 lr:0.70 dt:35ms tok/s:1872338 rem:304s step 8370 (49%) loss:3.3092 lr:0.70 dt:35ms tok/s:1878171 rem:304s step 8371 (49%) loss:3.3179 lr:0.70 dt:35ms tok/s:1882351 rem:304s step 8372 (49%) loss:3.3116 lr:0.70 dt:35ms tok/s:1851042 rem:304s step 8373 (49%) loss:3.3053 lr:0.70 dt:35ms tok/s:1854126 rem:304s step 8374 (49%) loss:3.3260 lr:0.70 dt:35ms tok/s:1859696 rem:304s step 8375 (49%) loss:3.3139 lr:0.70 dt:35ms tok/s:1864679 rem:304s step 8376 (49%) loss:3.3040 lr:0.70 dt:35ms tok/s:1861976 rem:304s step 8377 (49%) loss:3.3245 lr:0.70 dt:35ms tok/s:1860892 rem:304s step 8378 (49%) loss:3.3159 lr:0.70 dt:35ms tok/s:1860073 rem:304s step 8379 (49%) loss:3.3056 lr:0.70 dt:35ms tok/s:1859042 rem:304s step 8380 (49%) loss:3.2895 lr:0.70 dt:49ms tok/s:1345330 rem:304s step 8381 (49%) loss:3.3047 lr:0.70 dt:35ms tok/s:1894025 rem:304s step 8382 (49%) loss:3.3029 lr:0.70 dt:35ms tok/s:1895448 rem:304s step 8383 (49%) loss:3.3002 lr:0.70 dt:34ms tok/s:1911928 rem:304s step 8384 (49%) loss:3.2897 lr:0.70 dt:34ms tok/s:1917169 rem:304s step 8385 (49%) loss:3.2917 lr:0.70 dt:34ms tok/s:1919780 rem:304s step 8386 (49%) loss:3.3018 lr:0.70 dt:34ms tok/s:1910400 rem:304s step 8387 (49%) loss:3.3144 lr:0.70 dt:36ms tok/s:1823124 rem:304s step 8388 (49%) loss:3.3208 lr:0.70 dt:37ms tok/s:1751395 rem:304s step 8389 (49%) loss:3.3208 lr:0.70 dt:34ms tok/s:1908291 rem:304s step 8390 (49%) loss:3.3114 lr:0.70 dt:35ms tok/s:1893568 rem:304s step 8391 (49%) loss:3.3413 lr:0.70 dt:34ms tok/s:1917891 rem:304s step 8392 (49%) loss:3.3499 lr:0.70 dt:37ms tok/s:1789546 rem:304s step 8393 (49%) loss:3.3484 lr:0.70 dt:34ms tok/s:1900809 rem:304s step 8394 (49%) loss:3.3317 lr:0.70 dt:34ms tok/s:1937302 rem:304s step 8395 (49%) loss:3.3224 lr:0.70 dt:34ms tok/s:1916113 rem:303s step 8396 (49%) loss:3.3149 lr:0.70 dt:34ms tok/s:1907007 rem:303s step 8397 (49%) loss:3.3038 lr:0.70 dt:34ms tok/s:1905791 rem:303s step 8398 (49%) loss:3.3096 lr:0.70 dt:34ms tok/s:1906068 rem:303s step 8399 (49%) loss:3.3011 lr:0.70 dt:34ms tok/s:1907907 rem:303s step 8400 (49%) loss:3.2685 lr:0.70 dt:35ms tok/s:1892395 rem:303s + local: attn=[0.067, 0.826, 0.833] mlp=[0.482, 0.219, -0.201] + + transition: attn=[2.782, 0.919] mlp=[-0.167, 0.387] + + hierarchy: attn=[2.978, 5.939, 5.616] mlp=[1.138, -0.940, -3.151] + step 8401 (49%) loss:3.2629 lr:0.70 dt:34ms tok/s:1904589 rem:303s step 8402 (49%) loss:3.2657 lr:0.70 dt:35ms tok/s:1897778 rem:303s step 8403 (49%) loss:3.2627 lr:0.70 dt:34ms tok/s:1910427 rem:303s step 8404 (49%) loss:3.2812 lr:0.70 dt:34ms tok/s:1902546 rem:303s step 8405 (49%) loss:3.2901 lr:0.70 dt:34ms tok/s:1903231 rem:303s step 8406 (49%) loss:3.2825 lr:0.70 dt:36ms tok/s:1829689 rem:303s step 8407 (49%) loss:3.2874 lr:0.70 dt:35ms tok/s:1876287 rem:303s step 8408 (49%) loss:3.2936 lr:0.70 dt:35ms tok/s:1891405 rem:303s step 8409 (49%) loss:3.2880 lr:0.70 dt:35ms tok/s:1894508 rem:303s step 8410 (50%) loss:3.2787 lr:0.70 dt:34ms tok/s:1910732 rem:303s step 8411 (50%) loss:3.2652 lr:0.70 dt:35ms tok/s:1896259 rem:303s step 8412 (50%) loss:3.2613 lr:0.70 dt:35ms tok/s:1888494 rem:303s step 8413 (50%) loss:3.2411 lr:0.70 dt:35ms tok/s:1884003 rem:303s step 8414 (50%) loss:3.2575 lr:0.70 dt:35ms tok/s:1881256 rem:303s step 8415 (50%) loss:3.2717 lr:0.70 dt:35ms tok/s:1882932 rem:303s step 8416 (50%) loss:3.2626 lr:0.70 dt:35ms tok/s:1873665 rem:303s step 8417 (50%) loss:3.2687 lr:0.70 dt:35ms tok/s:1899194 rem:303s step 8418 (50%) loss:3.2562 lr:0.70 dt:35ms tok/s:1881630 rem:303s step 8419 (50%) loss:3.2450 lr:0.70 dt:35ms tok/s:1886719 rem:303s step 8420 (50%) loss:3.2636 lr:0.70 dt:35ms tok/s:1883977 rem:303s step 8421 (50%) loss:3.2794 lr:0.70 dt:35ms tok/s:1882016 rem:303s step 8422 (50%) loss:3.3021 lr:0.70 dt:35ms tok/s:1884597 rem:303s step 8423 (50%) loss:3.2923 lr:0.70 dt:35ms tok/s:1876274 rem:303s step 8424 (50%) loss:3.3007 lr:0.70 dt:35ms tok/s:1874981 rem:302s step 8425 (50%) loss:3.2947 lr:0.70 dt:34ms tok/s:1907470 rem:302s step 8426 (50%) loss:3.3075 lr:0.70 dt:34ms tok/s:1908464 rem:302s step 8427 (50%) loss:3.3162 lr:0.70 dt:34ms tok/s:1905302 rem:302s step 8428 (50%) loss:3.3282 lr:0.70 dt:35ms tok/s:1876607 rem:302s step 8429 (50%) loss:3.3163 lr:0.70 dt:34ms tok/s:1910493 rem:302s step 8430 (50%) loss:3.3279 lr:0.70 dt:34ms tok/s:1914912 rem:302s step 8431 (50%) loss:3.3349 lr:0.70 dt:35ms tok/s:1890001 rem:302s step 8432 (50%) loss:3.3267 lr:0.70 dt:34ms tok/s:1913325 rem:302s step 8433 (50%) loss:3.3227 lr:0.70 dt:34ms tok/s:1921645 rem:302s step 8434 (50%) loss:3.3182 lr:0.70 dt:34ms tok/s:1914658 rem:302s step 8435 (50%) loss:3.3177 lr:0.70 dt:34ms tok/s:1915699 rem:302s step 8436 (50%) loss:3.3115 lr:0.70 dt:34ms tok/s:1920397 rem:302s step 8437 (50%) loss:3.3055 lr:0.70 dt:34ms tok/s:1917022 rem:302s step 8438 (50%) loss:3.3020 lr:0.70 dt:34ms tok/s:1913898 rem:302s step 8439 (50%) loss:3.2996 lr:0.70 dt:34ms tok/s:1914498 rem:302s step 8440 (50%) loss:3.2999 lr:0.70 dt:34ms tok/s:1911968 rem:302s step 8441 (50%) loss:3.2980 lr:0.70 dt:34ms tok/s:1915419 rem:302s step 8442 (50%) loss:3.2737 lr:0.70 dt:34ms tok/s:1918641 rem:302s step 8443 (50%) loss:3.2764 lr:0.70 dt:35ms tok/s:1896769 rem:302s step 8444 (50%) loss:3.2742 lr:0.70 dt:35ms tok/s:1893203 rem:302s step 8445 (50%) loss:3.2637 lr:0.70 dt:34ms tok/s:1903257 rem:302s step 8446 (50%) loss:3.2516 lr:0.70 dt:35ms tok/s:1894351 rem:302s step 8447 (50%) loss:3.2527 lr:0.70 dt:35ms tok/s:1887301 rem:302s step 8448 (50%) loss:3.2565 lr:0.70 dt:34ms tok/s:1900165 rem:302s step 8449 (50%) loss:3.2663 lr:0.70 dt:34ms tok/s:1901243 rem:302s step 8450 (50%) loss:3.2651 lr:0.70 dt:35ms tok/s:1894090 rem:302s step 8451 (50%) loss:3.2617 lr:0.70 dt:35ms tok/s:1896966 rem:302s step 8452 (50%) loss:3.2557 lr:0.70 dt:34ms tok/s:1901243 rem:302s step 8453 (50%) loss:3.2606 lr:0.70 dt:35ms tok/s:1893712 rem:301s step 8454 (50%) loss:3.2574 lr:0.70 dt:35ms tok/s:1866591 rem:301s step 8455 (50%) loss:3.2477 lr:0.70 dt:35ms tok/s:1867364 rem:301s step 8456 (50%) loss:3.2598 lr:0.70 dt:35ms tok/s:1873716 rem:301s step 8457 (50%) loss:3.2556 lr:0.70 dt:36ms tok/s:1798492 rem:301s step 8458 (50%) loss:3.2561 lr:0.70 dt:35ms tok/s:1884894 rem:301s step 8459 (50%) loss:3.2630 lr:0.70 dt:35ms tok/s:1874253 rem:301s step 8460 (50%) loss:3.2642 lr:0.70 dt:35ms tok/s:1882442 rem:301s step 8461 (50%) loss:3.2457 lr:0.70 dt:35ms tok/s:1880368 rem:301s step 8462 (50%) loss:3.2465 lr:0.69 dt:35ms tok/s:1878878 rem:301s step 8463 (50%) loss:3.2514 lr:0.69 dt:35ms tok/s:1873474 rem:301s step 8464 (50%) loss:3.2727 lr:0.69 dt:35ms tok/s:1871445 rem:301s step 8465 (50%) loss:3.2739 lr:0.69 dt:35ms tok/s:1865641 rem:301s step 8466 (50%) loss:3.2675 lr:0.69 dt:35ms tok/s:1871688 rem:301s step 8467 (50%) loss:3.2713 lr:0.69 dt:35ms tok/s:1876223 rem:301s step 8468 (50%) loss:3.2468 lr:0.69 dt:35ms tok/s:1862897 rem:301s step 8469 (50%) loss:3.2439 lr:0.69 dt:35ms tok/s:1869765 rem:301s step 8470 (50%) loss:3.2479 lr:0.69 dt:35ms tok/s:1873678 rem:301s step 8471 (50%) loss:3.2462 lr:0.69 dt:35ms tok/s:1856894 rem:301s step 8472 (50%) loss:3.2577 lr:0.69 dt:35ms tok/s:1853314 rem:301s step 8473 (50%) loss:3.2750 lr:0.69 dt:35ms tok/s:1870159 rem:301s step 8474 (50%) loss:3.2784 lr:0.69 dt:35ms tok/s:1867618 rem:301s step 8475 (50%) loss:3.2657 lr:0.69 dt:35ms tok/s:1872223 rem:301s step 8476 (50%) loss:3.2479 lr:0.69 dt:35ms tok/s:1848988 rem:301s step 8477 (50%) loss:3.2450 lr:0.69 dt:35ms tok/s:1849137 rem:301s step 8478 (50%) loss:3.2413 lr:0.69 dt:35ms tok/s:1851192 rem:301s step 8479 (50%) loss:3.2641 lr:0.69 dt:36ms tok/s:1840766 rem:301s step 8480 (50%) loss:3.2664 lr:0.69 dt:36ms tok/s:1842629 rem:301s step 8481 (50%) loss:3.2590 lr:0.69 dt:35ms tok/s:1851154 rem:300s step 8482 (50%) loss:3.2821 lr:0.69 dt:36ms tok/s:1836216 rem:300s step 8483 (50%) loss:3.2504 lr:0.69 dt:35ms tok/s:1848938 rem:300s step 8484 (50%) loss:3.2605 lr:0.69 dt:35ms tok/s:1848540 rem:300s step 8485 (50%) loss:3.2535 lr:0.69 dt:36ms tok/s:1824516 rem:300s step 8486 (50%) loss:3.2867 lr:0.69 dt:36ms tok/s:1816786 rem:300s step 8487 (50%) loss:3.2891 lr:0.69 dt:36ms tok/s:1822919 rem:300s step 8488 (50%) loss:3.2939 lr:0.69 dt:36ms tok/s:1813035 rem:300s step 8489 (50%) loss:3.2759 lr:0.69 dt:36ms tok/s:1817411 rem:300s step 8490 (50%) loss:3.2692 lr:0.69 dt:38ms tok/s:1723437 rem:300s step 8491 (50%) loss:3.2493 lr:0.69 dt:38ms tok/s:1743508 rem:300s step 8492 (50%) loss:3.2194 lr:0.69 dt:37ms tok/s:1762173 rem:300s step 8493 (50%) loss:3.2476 lr:0.69 dt:37ms tok/s:1766193 rem:300s step 8494 (50%) loss:3.2679 lr:0.69 dt:37ms tok/s:1772662 rem:300s step 8495 (50%) loss:3.2667 lr:0.69 dt:37ms tok/s:1773223 rem:300s step 8496 (50%) loss:3.2640 lr:0.69 dt:37ms tok/s:1757453 rem:300s step 8497 (50%) loss:3.2793 lr:0.69 dt:37ms tok/s:1767147 rem:300s step 8498 (50%) loss:3.2868 lr:0.69 dt:37ms tok/s:1763994 rem:300s step 8499 (50%) loss:3.3030 lr:0.69 dt:37ms tok/s:1765070 rem:300s step 8500 (50%) loss:3.2950 lr:0.69 dt:37ms tok/s:1754256 rem:300s + local: attn=[0.081, 0.820, 0.817] mlp=[0.486, 0.234, -0.202] + + transition: attn=[2.754, 0.895] mlp=[-0.172, 0.429] + + hierarchy: attn=[2.966, 5.939, 5.616] mlp=[1.131, -1.001, -3.106] + step 8501 (50%) loss:3.2986 lr:0.69 dt:37ms tok/s:1766624 rem:300s step 8502 (50%) loss:3.2966 lr:0.69 dt:37ms tok/s:1765750 rem:300s step 8503 (50%) loss:3.3056 lr:0.69 dt:40ms tok/s:1619053 rem:300s step 8504 (50%) loss:3.3128 lr:0.69 dt:38ms tok/s:1726067 rem:300s step 8505 (50%) loss:3.3264 lr:0.69 dt:36ms tok/s:1821904 rem:300s step 8506 (50%) loss:3.3126 lr:0.69 dt:36ms tok/s:1825182 rem:300s step 8507 (50%) loss:3.3093 lr:0.69 dt:36ms tok/s:1825364 rem:300s step 8508 (50%) loss:3.3136 lr:0.69 dt:36ms tok/s:1816090 rem:299s step 8509 (50%) loss:3.3226 lr:0.69 dt:36ms tok/s:1813873 rem:299s step 8510 (50%) loss:3.3202 lr:0.69 dt:36ms tok/s:1821928 rem:299s step 8511 (50%) loss:3.3490 lr:0.69 dt:36ms tok/s:1814196 rem:299s step 8512 (50%) loss:3.3382 lr:0.69 dt:36ms tok/s:1824480 rem:299s step 8513 (50%) loss:3.3251 lr:0.69 dt:36ms tok/s:1826832 rem:299s step 8514 (50%) loss:3.3182 lr:0.69 dt:36ms tok/s:1825049 rem:299s step 8515 (50%) loss:3.3162 lr:0.69 dt:36ms tok/s:1813801 rem:299s step 8516 (50%) loss:3.3197 lr:0.69 dt:36ms tok/s:1821578 rem:299s step 8517 (50%) loss:3.3109 lr:0.69 dt:36ms tok/s:1823717 rem:299s step 8518 (50%) loss:3.3020 lr:0.69 dt:36ms tok/s:1826261 rem:299s step 8519 (50%) loss:3.2710 lr:0.69 dt:36ms tok/s:1826856 rem:299s step 8520 (50%) loss:3.2712 lr:0.69 dt:36ms tok/s:1815047 rem:299s step 8521 (50%) loss:3.2857 lr:0.69 dt:36ms tok/s:1813753 rem:299s step 8522 (50%) loss:3.2901 lr:0.69 dt:36ms tok/s:1815802 rem:299s step 8523 (50%) loss:3.2897 lr:0.69 dt:36ms tok/s:1817759 rem:299s step 8524 (50%) loss:3.2713 lr:0.69 dt:40ms tok/s:1623940 rem:299s step 8525 (50%) loss:3.2747 lr:0.69 dt:36ms tok/s:1813382 rem:299s step 8526 (50%) loss:3.2648 lr:0.69 dt:37ms tok/s:1795344 rem:299s step 8527 (50%) loss:3.2469 lr:0.69 dt:36ms tok/s:1822725 rem:299s step 8528 (50%) loss:3.2423 lr:0.69 dt:36ms tok/s:1816582 rem:299s step 8529 (50%) loss:3.2610 lr:0.69 dt:36ms tok/s:1811661 rem:299s step 8530 (50%) loss:3.2433 lr:0.69 dt:36ms tok/s:1811948 rem:299s step 8531 (50%) loss:3.2645 lr:0.69 dt:36ms tok/s:1822858 rem:299s step 8532 (50%) loss:3.2503 lr:0.69 dt:36ms tok/s:1805651 rem:299s step 8533 (50%) loss:3.2577 lr:0.69 dt:36ms tok/s:1810194 rem:299s step 8534 (50%) loss:3.2261 lr:0.69 dt:36ms tok/s:1820721 rem:299s step 8535 (50%) loss:3.2346 lr:0.69 dt:36ms tok/s:1819323 rem:299s step 8536 (50%) loss:3.2333 lr:0.69 dt:36ms tok/s:1821735 rem:298s step 8537 (50%) loss:3.2429 lr:0.69 dt:36ms tok/s:1819058 rem:298s step 8538 (50%) loss:3.2511 lr:0.69 dt:36ms tok/s:1808955 rem:298s step 8539 (50%) loss:3.2491 lr:0.69 dt:36ms tok/s:1811936 rem:298s step 8540 (50%) loss:3.2552 lr:0.69 dt:36ms tok/s:1814843 rem:298s step 8541 (50%) loss:3.2549 lr:0.69 dt:36ms tok/s:1818264 rem:298s step 8542 (50%) loss:3.2689 lr:0.69 dt:36ms tok/s:1807206 rem:298s step 8543 (50%) loss:3.2513 lr:0.69 dt:36ms tok/s:1814531 rem:298s step 8544 (50%) loss:3.2317 lr:0.69 dt:36ms tok/s:1818697 rem:298s step 8545 (50%) loss:3.2403 lr:0.69 dt:36ms tok/s:1815226 rem:298s step 8546 (50%) loss:3.2511 lr:0.69 dt:36ms tok/s:1810885 rem:298s step 8547 (50%) loss:3.2447 lr:0.69 dt:36ms tok/s:1818384 rem:298s step 8548 (50%) loss:3.2399 lr:0.69 dt:36ms tok/s:1811303 rem:298s step 8549 (50%) loss:3.2518 lr:0.69 dt:36ms tok/s:1818312 rem:298s step 8550 (50%) loss:3.2570 lr:0.69 dt:36ms tok/s:1798539 rem:298s step 8551 (50%) loss:3.2469 lr:0.69 dt:36ms tok/s:1817579 rem:298s step 8552 (50%) loss:3.2490 lr:0.69 dt:36ms tok/s:1814711 rem:298s step 8553 (50%) loss:3.2416 lr:0.68 dt:36ms tok/s:1801958 rem:298s step 8554 (50%) loss:3.2581 lr:0.68 dt:36ms tok/s:1802573 rem:298s step 8555 (50%) loss:3.2386 lr:0.68 dt:36ms tok/s:1809467 rem:298s step 8556 (50%) loss:3.2334 lr:0.68 dt:36ms tok/s:1817735 rem:298s step 8557 (50%) loss:3.2305 lr:0.68 dt:36ms tok/s:1809598 rem:298s step 8558 (50%) loss:3.2242 lr:0.68 dt:38ms tok/s:1725287 rem:298s step 8559 (50%) loss:3.2065 lr:0.68 dt:36ms tok/s:1808919 rem:298s step 8560 (50%) loss:3.2068 lr:0.68 dt:37ms tok/s:1782121 rem:298s step 8561 (50%) loss:3.2118 lr:0.68 dt:37ms tok/s:1758352 rem:298s step 8562 (50%) loss:3.2128 lr:0.68 dt:36ms tok/s:1811219 rem:298s step 8563 (50%) loss:3.2055 lr:0.68 dt:36ms tok/s:1808181 rem:297s step 8564 (50%) loss:3.2003 lr:0.68 dt:36ms tok/s:1810754 rem:297s step 8565 (50%) loss:3.1922 lr:0.68 dt:36ms tok/s:1814507 rem:297s step 8566 (50%) loss:3.1962 lr:0.68 dt:36ms tok/s:1814232 rem:297s step 8567 (50%) loss:3.2033 lr:0.68 dt:37ms tok/s:1792838 rem:297s step 8568 (50%) loss:3.2036 lr:0.68 dt:36ms tok/s:1816738 rem:297s step 8569 (50%) loss:3.1970 lr:0.68 dt:36ms tok/s:1813849 rem:297s step 8570 (50%) loss:3.2019 lr:0.68 dt:36ms tok/s:1808776 rem:297s step 8571 (50%) loss:3.2195 lr:0.68 dt:36ms tok/s:1808062 rem:297s step 8572 (50%) loss:3.2151 lr:0.68 dt:36ms tok/s:1812748 rem:297s step 8573 (50%) loss:3.1940 lr:0.68 dt:36ms tok/s:1803081 rem:297s step 8574 (50%) loss:3.2144 lr:0.68 dt:36ms tok/s:1816258 rem:297s step 8575 (50%) loss:3.1933 lr:0.68 dt:36ms tok/s:1803341 rem:297s step 8576 (50%) loss:3.1633 lr:0.68 dt:36ms tok/s:1807278 rem:297s step 8577 (50%) loss:3.1907 lr:0.68 dt:36ms tok/s:1813191 rem:297s step 8578 (51%) loss:3.2037 lr:0.68 dt:36ms tok/s:1819745 rem:297s step 8579 (51%) loss:3.2199 lr:0.68 dt:36ms tok/s:1806470 rem:297s step 8580 (51%) loss:3.2346 lr:0.68 dt:36ms tok/s:1807016 rem:297s step 8581 (51%) loss:3.2334 lr:0.68 dt:36ms tok/s:1811076 rem:297s step 8582 (51%) loss:3.2386 lr:0.68 dt:36ms tok/s:1814999 rem:297s step 8583 (51%) loss:3.2501 lr:0.68 dt:36ms tok/s:1815682 rem:297s step 8584 (51%) loss:3.2651 lr:0.68 dt:36ms tok/s:1817110 rem:297s step 8585 (51%) loss:3.2682 lr:0.68 dt:36ms tok/s:1809383 rem:297s step 8586 (51%) loss:3.2681 lr:0.68 dt:36ms tok/s:1815130 rem:297s step 8587 (51%) loss:3.2509 lr:0.68 dt:36ms tok/s:1810241 rem:297s step 8588 (51%) loss:3.2670 lr:0.68 dt:36ms tok/s:1812438 rem:297s step 8589 (51%) loss:3.2517 lr:0.68 dt:36ms tok/s:1808586 rem:297s step 8590 (51%) loss:3.2518 lr:0.68 dt:36ms tok/s:1810504 rem:297s step 8591 (51%) loss:3.2448 lr:0.68 dt:36ms tok/s:1810694 rem:296s step 8592 (51%) loss:3.2351 lr:0.68 dt:36ms tok/s:1816210 rem:296s step 8593 (51%) loss:3.2354 lr:0.68 dt:36ms tok/s:1817351 rem:296s step 8594 (51%) loss:3.2483 lr:0.68 dt:36ms tok/s:1807218 rem:296s step 8595 (51%) loss:3.2514 lr:0.68 dt:36ms tok/s:1813035 rem:296s step 8596 (51%) loss:3.2525 lr:0.68 dt:36ms tok/s:1822351 rem:296s step 8597 (51%) loss:3.2441 lr:0.68 dt:36ms tok/s:1812521 rem:296s step 8598 (51%) loss:3.2538 lr:0.68 dt:36ms tok/s:1813657 rem:296s step 8599 (51%) loss:3.2696 lr:0.68 dt:36ms tok/s:1807004 rem:296s step 8600 (51%) loss:3.2654 lr:0.68 dt:36ms tok/s:1815742 rem:296s + local: attn=[0.086, 0.776, 0.836] mlp=[0.506, 0.221, -0.194] + + transition: attn=[2.807, 0.925] mlp=[-0.177, 0.433] + + hierarchy: attn=[2.893, 5.939, 5.616] mlp=[1.147, -0.914, -3.065] + step 8601 (51%) loss:3.2517 lr:0.68 dt:36ms tok/s:1816510 rem:296s step 8602 (51%) loss:3.2539 lr:0.68 dt:37ms tok/s:1784134 rem:296s step 8603 (51%) loss:3.2515 lr:0.68 dt:36ms tok/s:1806886 rem:296s step 8604 (51%) loss:3.2384 lr:0.68 dt:37ms tok/s:1788894 rem:296s step 8605 (51%) loss:3.2479 lr:0.68 dt:36ms tok/s:1815190 rem:296s step 8606 (51%) loss:3.2452 lr:0.68 dt:36ms tok/s:1809884 rem:296s step 8607 (51%) loss:3.2313 lr:0.68 dt:36ms tok/s:1816462 rem:296s step 8608 (51%) loss:3.2256 lr:0.68 dt:36ms tok/s:1808871 rem:296s step 8609 (51%) loss:3.2297 lr:0.68 dt:36ms tok/s:1811100 rem:296s step 8610 (51%) loss:3.2227 lr:0.68 dt:36ms tok/s:1816162 rem:296s step 8611 (51%) loss:3.2040 lr:0.68 dt:36ms tok/s:1809967 rem:296s step 8612 (51%) loss:3.2089 lr:0.68 dt:36ms tok/s:1811005 rem:296s step 8613 (51%) loss:3.2021 lr:0.68 dt:36ms tok/s:1802738 rem:296s step 8614 (51%) loss:3.2097 lr:0.68 dt:36ms tok/s:1817230 rem:296s step 8615 (51%) loss:3.2265 lr:0.68 dt:36ms tok/s:1815538 rem:296s step 8616 (51%) loss:3.2513 lr:0.68 dt:36ms tok/s:1813741 rem:296s step 8617 (51%) loss:3.2567 lr:0.68 dt:36ms tok/s:1812438 rem:296s step 8618 (51%) loss:3.2575 lr:0.68 dt:36ms tok/s:1810253 rem:296s step 8619 (51%) loss:3.2451 lr:0.68 dt:36ms tok/s:1816438 rem:295s step 8620 (51%) loss:3.1923 lr:0.68 dt:36ms tok/s:1814447 rem:295s step 8621 (51%) loss:3.1182 lr:0.68 dt:36ms tok/s:1810718 rem:295s step 8622 (51%) loss:3.0977 lr:0.68 dt:36ms tok/s:1809931 rem:295s step 8623 (51%) loss:3.2698 lr:0.68 dt:36ms tok/s:1814016 rem:295s step 8624 (51%) loss:3.3229 lr:0.68 dt:36ms tok/s:1813310 rem:295s step 8625 (51%) loss:3.3357 lr:0.68 dt:36ms tok/s:1813909 rem:295s step 8626 (51%) loss:3.3538 lr:0.68 dt:37ms tok/s:1778915 rem:295s step 8627 (51%) loss:3.3599 lr:0.68 dt:36ms tok/s:1815370 rem:295s step 8628 (51%) loss:3.3505 lr:0.68 dt:36ms tok/s:1810850 rem:295s step 8629 (51%) loss:3.3335 lr:0.68 dt:36ms tok/s:1810086 rem:295s step 8630 (51%) loss:3.3390 lr:0.68 dt:36ms tok/s:1814304 rem:295s step 8631 (51%) loss:3.3405 lr:0.68 dt:36ms tok/s:1808895 rem:295s step 8632 (51%) loss:3.3302 lr:0.68 dt:36ms tok/s:1814436 rem:295s step 8633 (51%) loss:3.3292 lr:0.68 dt:36ms tok/s:1818457 rem:295s step 8634 (51%) loss:3.3147 lr:0.68 dt:36ms tok/s:1809407 rem:295s step 8635 (51%) loss:3.3050 lr:0.68 dt:36ms tok/s:1811148 rem:295s step 8636 (51%) loss:3.2894 lr:0.68 dt:36ms tok/s:1815934 rem:295s step 8637 (51%) loss:3.2688 lr:0.68 dt:36ms tok/s:1814711 rem:295s step 8638 (51%) loss:3.2638 lr:0.68 dt:36ms tok/s:1813191 rem:295s step 8639 (51%) loss:3.2521 lr:0.68 dt:36ms tok/s:1806850 rem:295s step 8640 (51%) loss:3.2563 lr:0.68 dt:36ms tok/s:1809622 rem:295s step 8641 (51%) loss:3.2521 lr:0.68 dt:36ms tok/s:1813789 rem:295s step 8642 (51%) loss:3.2605 lr:0.68 dt:36ms tok/s:1813179 rem:295s step 8643 (51%) loss:3.2586 lr:0.68 dt:36ms tok/s:1818300 rem:295s step 8644 (51%) loss:3.2599 lr:0.67 dt:36ms tok/s:1818084 rem:295s step 8645 (51%) loss:3.2664 lr:0.67 dt:36ms tok/s:1812820 rem:295s step 8646 (51%) loss:3.2770 lr:0.67 dt:36ms tok/s:1819383 rem:294s step 8647 (51%) loss:3.2779 lr:0.67 dt:36ms tok/s:1820552 rem:294s step 8648 (51%) loss:3.2534 lr:0.67 dt:36ms tok/s:1814591 rem:294s step 8649 (51%) loss:3.2544 lr:0.67 dt:36ms tok/s:1810766 rem:294s step 8650 (51%) loss:3.2580 lr:0.67 dt:36ms tok/s:1816762 rem:294s step 8651 (51%) loss:3.2514 lr:0.67 dt:36ms tok/s:1806838 rem:294s step 8652 (51%) loss:3.2527 lr:0.67 dt:36ms tok/s:1815418 rem:294s step 8653 (51%) loss:3.2557 lr:0.67 dt:36ms tok/s:1809455 rem:294s step 8654 (51%) loss:3.2477 lr:0.67 dt:36ms tok/s:1813550 rem:294s step 8655 (51%) loss:3.2549 lr:0.67 dt:37ms tok/s:1783995 rem:294s step 8656 (51%) loss:3.2583 lr:0.67 dt:38ms tok/s:1714034 rem:294s step 8657 (51%) loss:3.2717 lr:0.67 dt:36ms tok/s:1803613 rem:294s step 8658 (51%) loss:3.2708 lr:0.67 dt:36ms tok/s:1811578 rem:294s step 8659 (51%) loss:3.2745 lr:0.67 dt:36ms tok/s:1801769 rem:294s step 8660 (51%) loss:3.2587 lr:0.67 dt:36ms tok/s:1807646 rem:294s step 8661 (51%) loss:3.2432 lr:0.67 dt:36ms tok/s:1809526 rem:294s step 8662 (51%) loss:3.2311 lr:0.67 dt:36ms tok/s:1811422 rem:294s step 8663 (51%) loss:3.2352 lr:0.67 dt:36ms tok/s:1809765 rem:294s step 8664 (51%) loss:3.2365 lr:0.67 dt:36ms tok/s:1813969 rem:294s step 8665 (51%) loss:3.2412 lr:0.67 dt:36ms tok/s:1811040 rem:294s step 8666 (51%) loss:3.2161 lr:0.67 dt:36ms tok/s:1801238 rem:294s step 8667 (51%) loss:3.2126 lr:0.67 dt:37ms tok/s:1781717 rem:294s step 8668 (51%) loss:3.1994 lr:0.67 dt:36ms tok/s:1814076 rem:294s step 8669 (51%) loss:3.1904 lr:0.67 dt:36ms tok/s:1812940 rem:294s step 8670 (51%) loss:3.2015 lr:0.67 dt:36ms tok/s:1810349 rem:294s step 8671 (51%) loss:3.2023 lr:0.67 dt:36ms tok/s:1819299 rem:294s step 8672 (51%) loss:3.1848 lr:0.67 dt:36ms tok/s:1815646 rem:294s step 8673 (51%) loss:3.1756 lr:0.67 dt:36ms tok/s:1811613 rem:294s step 8674 (51%) loss:3.1892 lr:0.67 dt:36ms tok/s:1820275 rem:293s step 8675 (51%) loss:3.1945 lr:0.67 dt:36ms tok/s:1812282 rem:293s step 8676 (51%) loss:3.1632 lr:0.67 dt:36ms tok/s:1815742 rem:293s step 8677 (51%) loss:3.1909 lr:0.67 dt:36ms tok/s:1818048 rem:293s step 8678 (51%) loss:3.2252 lr:0.67 dt:36ms tok/s:1814831 rem:293s step 8679 (51%) loss:3.2203 lr:0.67 dt:36ms tok/s:1812557 rem:293s step 8680 (51%) loss:3.2430 lr:0.67 dt:36ms tok/s:1811351 rem:293s step 8681 (51%) loss:3.2377 lr:0.67 dt:36ms tok/s:1813693 rem:293s step 8682 (51%) loss:3.2215 lr:0.67 dt:36ms tok/s:1813669 rem:293s step 8683 (51%) loss:3.2183 lr:0.67 dt:36ms tok/s:1812820 rem:293s step 8684 (51%) loss:3.2116 lr:0.67 dt:36ms tok/s:1813897 rem:293s step 8685 (51%) loss:3.2014 lr:0.67 dt:36ms tok/s:1816498 rem:293s step 8686 (51%) loss:3.2135 lr:0.67 dt:36ms tok/s:1810349 rem:293s step 8687 (51%) loss:3.2109 lr:0.67 dt:36ms tok/s:1818048 rem:293s step 8688 (51%) loss:3.2332 lr:0.67 dt:36ms tok/s:1819239 rem:293s step 8689 (51%) loss:3.2195 lr:0.67 dt:36ms tok/s:1813143 rem:293s step 8690 (51%) loss:3.1881 lr:0.67 dt:36ms tok/s:1811506 rem:293s step 8691 (51%) loss:3.1628 lr:0.67 dt:36ms tok/s:1812629 rem:293s step 8692 (51%) loss:3.1678 lr:0.67 dt:36ms tok/s:1819191 rem:293s step 8693 (51%) loss:3.1911 lr:0.67 dt:36ms tok/s:1817098 rem:293s step 8694 (51%) loss:3.1938 lr:0.67 dt:36ms tok/s:1814747 rem:293s step 8695 (51%) loss:3.2018 lr:0.67 dt:36ms tok/s:1808193 rem:293s step 8696 (51%) loss:3.2566 lr:0.67 dt:36ms tok/s:1819335 rem:293s step 8697 (51%) loss:3.2639 lr:0.67 dt:36ms tok/s:1814471 rem:293s step 8698 (51%) loss:3.2526 lr:0.67 dt:36ms tok/s:1796564 rem:293s step 8699 (51%) loss:3.2465 lr:0.67 dt:36ms tok/s:1798668 rem:293s step 8700 (51%) loss:3.2615 lr:0.67 dt:36ms tok/s:1814112 rem:293s + local: attn=[0.080, 0.801, 0.841] mlp=[0.527, 0.211, -0.221] + + transition: attn=[2.877, 0.937] mlp=[-0.146, 0.413] + + hierarchy: attn=[3.054, 5.939, 5.616] mlp=[1.162, -0.960, -2.968] + step 8701 (51%) loss:3.2656 lr:0.67 dt:37ms tok/s:1769331 rem:293s step 8702 (51%) loss:3.2678 lr:0.67 dt:36ms tok/s:1808383 rem:292s step 8703 (51%) loss:3.3081 lr:0.67 dt:36ms tok/s:1814028 rem:292s step 8704 (51%) loss:3.3183 lr:0.67 dt:36ms tok/s:1804430 rem:292s step 8705 (51%) loss:3.3043 lr:0.67 dt:36ms tok/s:1812115 rem:292s step 8706 (51%) loss:3.3026 lr:0.67 dt:37ms tok/s:1767510 rem:292s step 8707 (51%) loss:3.2978 lr:0.67 dt:37ms tok/s:1787336 rem:292s step 8708 (51%) loss:3.2950 lr:0.67 dt:36ms tok/s:1815994 rem:292s step 8709 (51%) loss:3.2927 lr:0.67 dt:36ms tok/s:1814903 rem:292s step 8710 (51%) loss:3.2856 lr:0.67 dt:36ms tok/s:1816030 rem:292s step 8711 (51%) loss:3.2707 lr:0.67 dt:36ms tok/s:1814567 rem:292s step 8712 (51%) loss:3.2629 lr:0.67 dt:36ms tok/s:1809050 rem:292s step 8713 (51%) loss:3.2590 lr:0.67 dt:36ms tok/s:1807373 rem:292s step 8714 (51%) loss:3.2363 lr:0.67 dt:36ms tok/s:1815406 rem:292s step 8715 (51%) loss:3.2345 lr:0.67 dt:36ms tok/s:1810146 rem:292s step 8716 (51%) loss:3.2419 lr:0.67 dt:36ms tok/s:1811912 rem:292s step 8717 (51%) loss:3.2358 lr:0.67 dt:37ms tok/s:1787522 rem:292s step 8718 (51%) loss:3.2178 lr:0.67 dt:36ms tok/s:1807183 rem:292s step 8719 (51%) loss:3.1877 lr:0.67 dt:36ms tok/s:1809145 rem:292s step 8720 (51%) loss:3.1539 lr:0.67 dt:36ms tok/s:1806304 rem:292s step 8721 (51%) loss:3.1469 lr:0.67 dt:36ms tok/s:1816018 rem:292s step 8722 (51%) loss:3.1655 lr:0.67 dt:37ms tok/s:1795402 rem:292s step 8723 (51%) loss:3.1650 lr:0.67 dt:36ms tok/s:1810122 rem:292s step 8724 (51%) loss:3.1834 lr:0.67 dt:36ms tok/s:1807527 rem:292s step 8725 (51%) loss:3.1913 lr:0.67 dt:36ms tok/s:1814076 rem:292s step 8726 (51%) loss:3.2037 lr:0.67 dt:36ms tok/s:1809479 rem:292s step 8727 (51%) loss:3.2186 lr:0.67 dt:36ms tok/s:1813657 rem:292s step 8728 (51%) loss:3.2124 lr:0.67 dt:36ms tok/s:1817567 rem:292s step 8729 (51%) loss:3.1923 lr:0.67 dt:36ms tok/s:1810826 rem:291s step 8730 (51%) loss:3.2003 lr:0.67 dt:36ms tok/s:1817278 rem:291s step 8731 (51%) loss:3.2117 lr:0.67 dt:36ms tok/s:1815346 rem:291s step 8732 (51%) loss:3.2210 lr:0.67 dt:36ms tok/s:1817807 rem:291s step 8733 (51%) loss:3.2212 lr:0.66 dt:36ms tok/s:1809538 rem:291s step 8734 (51%) loss:3.2237 lr:0.66 dt:36ms tok/s:1811649 rem:291s step 8735 (51%) loss:3.2014 lr:0.66 dt:36ms tok/s:1816174 rem:291s step 8736 (51%) loss:3.1971 lr:0.66 dt:36ms tok/s:1809288 rem:291s step 8737 (51%) loss:3.1919 lr:0.66 dt:36ms tok/s:1806268 rem:291s step 8738 (51%) loss:3.2033 lr:0.66 dt:36ms tok/s:1818481 rem:291s step 8739 (51%) loss:3.2021 lr:0.66 dt:36ms tok/s:1810396 rem:291s step 8740 (51%) loss:3.2176 lr:0.66 dt:36ms tok/s:1817579 rem:291s step 8741 (51%) loss:3.2074 lr:0.66 dt:36ms tok/s:1813753 rem:291s step 8742 (51%) loss:3.1864 lr:0.66 dt:36ms tok/s:1811184 rem:291s step 8743 (51%) loss:3.1841 lr:0.66 dt:36ms tok/s:1816294 rem:291s step 8744 (52%) loss:3.1840 lr:0.66 dt:36ms tok/s:1811984 rem:291s step 8745 (52%) loss:3.1882 lr:0.66 dt:36ms tok/s:1814304 rem:291s step 8746 (52%) loss:3.1912 lr:0.66 dt:36ms tok/s:1815298 rem:291s step 8747 (52%) loss:3.1785 lr:0.66 dt:36ms tok/s:1815886 rem:291s step 8748 (52%) loss:3.1788 lr:0.66 dt:36ms tok/s:1807183 rem:291s step 8749 (52%) loss:3.1585 lr:0.66 dt:36ms tok/s:1811625 rem:291s step 8750 (52%) loss:3.1641 lr:0.66 dt:36ms tok/s:1818264 rem:291s step 8751 (52%) loss:3.1962 lr:0.66 dt:36ms tok/s:1815646 rem:291s step 8752 (52%) loss:3.1906 lr:0.66 dt:36ms tok/s:1814975 rem:291s step 8753 (52%) loss:3.1912 lr:0.66 dt:36ms tok/s:1819913 rem:291s step 8754 (52%) loss:3.1979 lr:0.66 dt:36ms tok/s:1818806 rem:291s step 8755 (52%) loss:3.1808 lr:0.66 dt:36ms tok/s:1812724 rem:291s step 8756 (52%) loss:3.1554 lr:0.66 dt:36ms tok/s:1813921 rem:291s step 8757 (52%) loss:3.1550 lr:0.66 dt:36ms tok/s:1811948 rem:290s step 8758 (52%) loss:3.1559 lr:0.66 dt:36ms tok/s:1814987 rem:290s step 8759 (52%) loss:3.1704 lr:0.66 dt:36ms tok/s:1815754 rem:290s step 8760 (52%) loss:3.1747 lr:0.66 dt:36ms tok/s:1815742 rem:290s step 8761 (52%) loss:3.1938 lr:0.66 dt:36ms tok/s:1810635 rem:290s step 8762 (52%) loss:3.2068 lr:0.66 dt:36ms tok/s:1808419 rem:290s step 8763 (52%) loss:3.2617 lr:0.66 dt:38ms tok/s:1737139 rem:290s step 8764 (52%) loss:3.2691 lr:0.66 dt:36ms tok/s:1822991 rem:290s step 8765 (52%) loss:3.2631 lr:0.66 dt:36ms tok/s:1817230 rem:290s step 8766 (52%) loss:3.2413 lr:0.66 dt:37ms tok/s:1775273 rem:290s step 8767 (52%) loss:3.2354 lr:0.66 dt:36ms tok/s:1797046 rem:290s step 8768 (52%) loss:3.2153 lr:0.66 dt:36ms tok/s:1817879 rem:290s step 8769 (52%) loss:3.2140 lr:0.66 dt:36ms tok/s:1816702 rem:290s step 8770 (52%) loss:3.2210 lr:0.66 dt:36ms tok/s:1809229 rem:290s step 8771 (52%) loss:3.2270 lr:0.66 dt:36ms tok/s:1817651 rem:290s step 8772 (52%) loss:3.2271 lr:0.66 dt:36ms tok/s:1808336 rem:290s step 8773 (52%) loss:3.1964 lr:0.66 dt:37ms tok/s:1794348 rem:290s step 8774 (52%) loss:3.2010 lr:0.66 dt:36ms tok/s:1819179 rem:290s step 8775 (52%) loss:3.2117 lr:0.66 dt:36ms tok/s:1796952 rem:290s step 8776 (52%) loss:3.2146 lr:0.66 dt:36ms tok/s:1814915 rem:290s step 8777 (52%) loss:3.2254 lr:0.66 dt:36ms tok/s:1813609 rem:290s step 8778 (52%) loss:3.2193 lr:0.66 dt:36ms tok/s:1814244 rem:290s step 8779 (52%) loss:3.2206 lr:0.66 dt:36ms tok/s:1809657 rem:290s step 8780 (52%) loss:3.1954 lr:0.66 dt:36ms tok/s:1808705 rem:290s step 8781 (52%) loss:3.1826 lr:0.66 dt:36ms tok/s:1802407 rem:290s step 8782 (52%) loss:3.1798 lr:0.66 dt:36ms tok/s:1819769 rem:290s step 8783 (52%) loss:3.1913 lr:0.66 dt:36ms tok/s:1818854 rem:290s step 8784 (52%) loss:3.1903 lr:0.66 dt:36ms tok/s:1811458 rem:289s step 8785 (52%) loss:3.1730 lr:0.66 dt:36ms tok/s:1814819 rem:289s step 8786 (52%) loss:3.1725 lr:0.66 dt:36ms tok/s:1812079 rem:289s step 8787 (52%) loss:3.1405 lr:0.66 dt:40ms tok/s:1618290 rem:289s step 8788 (52%) loss:3.1348 lr:0.66 dt:41ms tok/s:1610874 rem:289s step 8789 (52%) loss:3.1455 lr:0.66 dt:35ms tok/s:1867149 rem:289s step 8790 (52%) loss:3.1408 lr:0.66 dt:35ms tok/s:1882390 rem:289s step 8791 (52%) loss:3.1584 lr:0.66 dt:35ms tok/s:1858828 rem:289s step 8792 (52%) loss:3.1533 lr:0.66 dt:36ms tok/s:1843259 rem:289s step 8793 (52%) loss:3.1703 lr:0.66 dt:35ms tok/s:1877594 rem:289s step 8794 (52%) loss:3.1869 lr:0.66 dt:35ms tok/s:1862405 rem:289s step 8795 (52%) loss:3.2037 lr:0.66 dt:35ms tok/s:1857785 rem:289s step 8796 (52%) loss:3.1792 lr:0.66 dt:35ms tok/s:1854039 rem:289s step 8797 (52%) loss:3.1764 lr:0.66 dt:36ms tok/s:1819829 rem:289s step 8798 (52%) loss:3.1694 lr:0.66 dt:36ms tok/s:1832104 rem:289s step 8799 (52%) loss:3.1917 lr:0.66 dt:35ms tok/s:1877876 rem:289s step 8800 (52%) loss:3.1961 lr:0.66 dt:35ms tok/s:1879314 rem:289s + local: attn=[0.076, 0.839, 0.840] mlp=[0.513, 0.214, -0.215] + + transition: attn=[2.766, 0.948] mlp=[-0.179, 0.437] + + hierarchy: attn=[3.018, 5.939, 5.616] mlp=[1.128, -0.934, -3.005] + step 8801 (52%) loss:3.1957 lr:0.66 dt:35ms tok/s:1871127 rem:289s step 8802 (52%) loss:3.1838 lr:0.66 dt:35ms tok/s:1880343 rem:289s step 8803 (52%) loss:3.1802 lr:0.66 dt:35ms tok/s:1860741 rem:289s step 8804 (52%) loss:3.1732 lr:0.66 dt:35ms tok/s:1891444 rem:289s step 8805 (52%) loss:3.1674 lr:0.66 dt:35ms tok/s:1862354 rem:289s step 8806 (52%) loss:3.1579 lr:0.66 dt:35ms tok/s:1862695 rem:289s step 8807 (52%) loss:3.1716 lr:0.66 dt:35ms tok/s:1858753 rem:289s step 8808 (52%) loss:3.1834 lr:0.66 dt:36ms tok/s:1840667 rem:289s step 8809 (52%) loss:3.1749 lr:0.66 dt:40ms tok/s:1638822 rem:289s step 8810 (52%) loss:3.1731 lr:0.66 dt:34ms tok/s:1923676 rem:289s step 8811 (52%) loss:3.1849 lr:0.66 dt:34ms tok/s:1920464 rem:289s step 8812 (52%) loss:3.1808 lr:0.66 dt:34ms tok/s:1922613 rem:288s step 8813 (52%) loss:3.2014 lr:0.66 dt:34ms tok/s:1928954 rem:288s step 8814 (52%) loss:3.2304 lr:0.66 dt:34ms tok/s:1909657 rem:288s step 8815 (52%) loss:3.2449 lr:0.66 dt:34ms tok/s:1908715 rem:288s step 8816 (52%) loss:3.2494 lr:0.66 dt:34ms tok/s:1903428 rem:288s step 8817 (52%) loss:3.2667 lr:0.66 dt:34ms tok/s:1904853 rem:288s step 8818 (52%) loss:3.2617 lr:0.66 dt:35ms tok/s:1898092 rem:288s step 8819 (52%) loss:3.2641 lr:0.66 dt:34ms tok/s:1903020 rem:288s step 8820 (52%) loss:3.2610 lr:0.66 dt:34ms tok/s:1911941 rem:288s step 8821 (52%) loss:3.2507 lr:0.66 dt:34ms tok/s:1908040 rem:288s step 8822 (52%) loss:3.2465 lr:0.66 dt:34ms tok/s:1902138 rem:288s step 8823 (52%) loss:3.2372 lr:0.65 dt:34ms tok/s:1912474 rem:288s step 8824 (52%) loss:3.2116 lr:0.65 dt:35ms tok/s:1896782 rem:288s step 8825 (52%) loss:3.2135 lr:0.65 dt:34ms tok/s:1900914 rem:288s step 8826 (52%) loss:3.2204 lr:0.65 dt:35ms tok/s:1891145 rem:288s step 8827 (52%) loss:3.1997 lr:0.65 dt:35ms tok/s:1881514 rem:288s step 8828 (52%) loss:3.1831 lr:0.65 dt:35ms tok/s:1896259 rem:288s step 8829 (52%) loss:3.1587 lr:0.65 dt:35ms tok/s:1888248 rem:288s step 8830 (52%) loss:3.1724 lr:0.65 dt:35ms tok/s:1869422 rem:288s step 8831 (52%) loss:3.1787 lr:0.65 dt:35ms tok/s:1872185 rem:288s step 8832 (52%) loss:3.1638 lr:0.65 dt:35ms tok/s:1873844 rem:288s step 8833 (52%) loss:3.1385 lr:0.65 dt:35ms tok/s:1880793 rem:288s step 8834 (52%) loss:3.1448 lr:0.65 dt:35ms tok/s:1871942 rem:288s step 8835 (52%) loss:3.1448 lr:0.65 dt:36ms tok/s:1841420 rem:288s step 8836 (52%) loss:3.1420 lr:0.65 dt:35ms tok/s:1871637 rem:288s step 8837 (52%) loss:3.1295 lr:0.65 dt:35ms tok/s:1878017 rem:288s step 8838 (52%) loss:3.1406 lr:0.65 dt:35ms tok/s:1875967 rem:288s step 8839 (52%) loss:3.1343 lr:0.65 dt:35ms tok/s:1877710 rem:288s step 8840 (52%) loss:3.1518 lr:0.65 dt:35ms tok/s:1878711 rem:288s step 8841 (52%) loss:3.1685 lr:0.65 dt:35ms tok/s:1883680 rem:287s step 8842 (52%) loss:3.1612 lr:0.65 dt:35ms tok/s:1853026 rem:287s step 8843 (52%) loss:3.1576 lr:0.65 dt:35ms tok/s:1863604 rem:287s step 8844 (52%) loss:3.1748 lr:0.65 dt:35ms tok/s:1851990 rem:287s step 8845 (52%) loss:3.1740 lr:0.65 dt:35ms tok/s:1858049 rem:287s step 8846 (52%) loss:3.1864 lr:0.65 dt:35ms tok/s:1856681 rem:287s step 8847 (52%) loss:3.1917 lr:0.65 dt:35ms tok/s:1855979 rem:287s step 8848 (52%) loss:3.2201 lr:0.65 dt:35ms tok/s:1854214 rem:287s step 8849 (52%) loss:3.2178 lr:0.65 dt:35ms tok/s:1854126 rem:287s step 8850 (52%) loss:3.2123 lr:0.65 dt:35ms tok/s:1852627 rem:287s step 8851 (52%) loss:3.1882 lr:0.65 dt:35ms tok/s:1856067 rem:287s step 8852 (52%) loss:3.1645 lr:0.65 dt:35ms tok/s:1856907 rem:287s step 8853 (52%) loss:3.1271 lr:0.65 dt:35ms tok/s:1848291 rem:287s step 8854 (52%) loss:3.1328 lr:0.65 dt:35ms tok/s:1854351 rem:287s step 8855 (52%) loss:3.1465 lr:0.65 dt:35ms tok/s:1850556 rem:287s step 8856 (52%) loss:3.1518 lr:0.65 dt:35ms tok/s:1851441 rem:287s step 8857 (52%) loss:3.1643 lr:0.65 dt:35ms tok/s:1860325 rem:287s step 8858 (52%) loss:3.1759 lr:0.65 dt:35ms tok/s:1855879 rem:287s step 8859 (52%) loss:3.1704 lr:0.65 dt:35ms tok/s:1849747 rem:287s step 8860 (52%) loss:3.1827 lr:0.65 dt:36ms tok/s:1835358 rem:287s step 8861 (52%) loss:3.1965 lr:0.65 dt:36ms tok/s:1835419 rem:287s step 8862 (52%) loss:3.1983 lr:0.65 dt:36ms tok/s:1839300 rem:287s step 8863 (52%) loss:3.1947 lr:0.65 dt:36ms tok/s:1837260 rem:287s step 8864 (52%) loss:3.1759 lr:0.65 dt:36ms tok/s:1831286 rem:287s step 8865 (52%) loss:3.1809 lr:0.65 dt:36ms tok/s:1839706 rem:287s step 8866 (52%) loss:3.1990 lr:0.65 dt:36ms tok/s:1837333 rem:287s step 8867 (52%) loss:3.1975 lr:0.65 dt:36ms tok/s:1838156 rem:287s step 8868 (52%) loss:3.1752 lr:0.65 dt:36ms tok/s:1834439 rem:287s step 8869 (52%) loss:3.1522 lr:0.65 dt:36ms tok/s:1840766 rem:286s step 8870 (52%) loss:3.1520 lr:0.65 dt:36ms tok/s:1835873 rem:286s step 8871 (52%) loss:3.1702 lr:0.65 dt:36ms tok/s:1840273 rem:286s step 8872 (52%) loss:3.1842 lr:0.65 dt:36ms tok/s:1832959 rem:286s step 8873 (52%) loss:3.1738 lr:0.65 dt:36ms tok/s:1841950 rem:286s step 8874 (52%) loss:3.1616 lr:0.65 dt:35ms tok/s:1850083 rem:286s step 8875 (52%) loss:3.1513 lr:0.65 dt:35ms tok/s:1846764 rem:286s step 8876 (52%) loss:3.1768 lr:0.65 dt:35ms tok/s:1856631 rem:286s step 8877 (52%) loss:3.1780 lr:0.65 dt:35ms tok/s:1852115 rem:286s step 8878 (52%) loss:3.2401 lr:0.65 dt:36ms tok/s:1834048 rem:286s step 8879 (52%) loss:3.2223 lr:0.65 dt:36ms tok/s:1829957 rem:286s step 8880 (52%) loss:3.2172 lr:0.65 dt:36ms tok/s:1842407 rem:286s step 8881 (52%) loss:3.1989 lr:0.65 dt:38ms tok/s:1743275 rem:286s step 8882 (52%) loss:3.1980 lr:0.65 dt:35ms tok/s:1859155 rem:286s step 8883 (52%) loss:3.1851 lr:0.65 dt:36ms tok/s:1838759 rem:286s step 8884 (52%) loss:3.1824 lr:0.65 dt:35ms tok/s:1854889 rem:286s step 8885 (52%) loss:3.1759 lr:0.65 dt:36ms tok/s:1830506 rem:286s step 8886 (52%) loss:3.1885 lr:0.65 dt:35ms tok/s:1849261 rem:286s step 8887 (52%) loss:3.1886 lr:0.65 dt:35ms tok/s:1853214 rem:286s step 8888 (52%) loss:3.1874 lr:0.65 dt:35ms tok/s:1855252 rem:286s step 8889 (52%) loss:3.1800 lr:0.65 dt:35ms tok/s:1852277 rem:286s step 8890 (52%) loss:3.1674 lr:0.65 dt:36ms tok/s:1819227 rem:286s step 8891 (52%) loss:3.1721 lr:0.65 dt:36ms tok/s:1824165 rem:286s step 8892 (52%) loss:3.1499 lr:0.65 dt:36ms tok/s:1834305 rem:286s step 8893 (52%) loss:3.1524 lr:0.65 dt:35ms tok/s:1853576 rem:286s step 8894 (52%) loss:3.1493 lr:0.65 dt:35ms tok/s:1855979 rem:286s step 8895 (52%) loss:3.1640 lr:0.65 dt:35ms tok/s:1854602 rem:286s step 8896 (52%) loss:3.1681 lr:0.65 dt:36ms tok/s:1830189 rem:286s step 8897 (52%) loss:3.1632 lr:0.65 dt:36ms tok/s:1817435 rem:286s step 8898 (52%) loss:3.1579 lr:0.65 dt:36ms tok/s:1816666 rem:285s step 8899 (52%) loss:3.1463 lr:0.65 dt:36ms tok/s:1826164 rem:285s step 8900 (52%) loss:3.1467 lr:0.65 dt:36ms tok/s:1814196 rem:285s + local: attn=[0.079, 0.775, 0.833] mlp=[0.531, 0.221, -0.214] + + transition: attn=[2.735, 0.939] mlp=[-0.190, 0.453] + + hierarchy: attn=[2.949, 5.939, 5.616] mlp=[1.193, -0.954, -3.002] + step 8901 (52%) loss:3.1687 lr:0.65 dt:36ms tok/s:1817735 rem:285s step 8902 (52%) loss:3.1921 lr:0.65 dt:36ms tok/s:1821542 rem:285s step 8903 (52%) loss:3.1980 lr:0.65 dt:36ms tok/s:1822049 rem:285s step 8904 (52%) loss:3.1953 lr:0.65 dt:36ms tok/s:1819167 rem:285s step 8905 (52%) loss:3.1904 lr:0.65 dt:36ms tok/s:1816666 rem:285s step 8906 (52%) loss:3.1618 lr:0.65 dt:36ms tok/s:1813933 rem:285s step 8907 (52%) loss:3.1531 lr:0.65 dt:36ms tok/s:1821373 rem:285s step 8908 (52%) loss:3.1413 lr:0.65 dt:36ms tok/s:1820781 rem:285s step 8909 (52%) loss:3.1488 lr:0.65 dt:36ms tok/s:1825461 rem:285s step 8910 (52%) loss:3.1584 lr:0.65 dt:36ms tok/s:1817531 rem:285s step 8911 (52%) loss:3.1642 lr:0.65 dt:36ms tok/s:1820769 rem:285s step 8912 (53%) loss:3.1503 lr:0.65 dt:36ms tok/s:1819383 rem:285s step 8913 (53%) loss:3.1568 lr:0.65 dt:36ms tok/s:1817062 rem:285s step 8914 (53%) loss:3.1681 lr:0.64 dt:36ms tok/s:1824237 rem:285s step 8915 (53%) loss:3.1635 lr:0.64 dt:36ms tok/s:1817086 rem:285s step 8916 (53%) loss:3.1580 lr:0.64 dt:36ms tok/s:1817795 rem:285s step 8917 (53%) loss:3.1697 lr:0.64 dt:36ms tok/s:1818036 rem:285s step 8918 (53%) loss:3.1779 lr:0.64 dt:36ms tok/s:1820118 rem:285s step 8919 (53%) loss:3.1787 lr:0.64 dt:36ms tok/s:1822737 rem:285s step 8920 (53%) loss:3.1730 lr:0.64 dt:37ms tok/s:1769194 rem:285s step 8921 (53%) loss:3.1650 lr:0.64 dt:36ms tok/s:1812593 rem:285s step 8922 (53%) loss:3.1551 lr:0.64 dt:36ms tok/s:1818589 rem:285s step 8923 (53%) loss:3.1563 lr:0.64 dt:36ms tok/s:1824467 rem:285s step 8924 (53%) loss:3.1830 lr:0.64 dt:36ms tok/s:1828691 rem:285s step 8925 (53%) loss:3.1881 lr:0.64 dt:36ms tok/s:1826346 rem:284s step 8926 (53%) loss:3.1790 lr:0.64 dt:36ms tok/s:1816618 rem:284s step 8927 (53%) loss:3.1981 lr:0.64 dt:36ms tok/s:1825679 rem:284s step 8928 (53%) loss:3.2067 lr:0.64 dt:36ms tok/s:1819528 rem:284s step 8929 (53%) loss:3.2350 lr:0.64 dt:36ms tok/s:1823717 rem:284s step 8930 (53%) loss:3.2213 lr:0.64 dt:36ms tok/s:1815286 rem:284s step 8931 (53%) loss:3.2419 lr:0.64 dt:36ms tok/s:1822484 rem:284s step 8932 (53%) loss:3.2553 lr:0.64 dt:36ms tok/s:1818132 rem:284s step 8933 (53%) loss:3.2478 lr:0.64 dt:36ms tok/s:1815886 rem:284s step 8934 (53%) loss:3.2306 lr:0.64 dt:36ms tok/s:1820902 rem:284s step 8935 (53%) loss:3.2094 lr:0.64 dt:36ms tok/s:1826929 rem:284s step 8936 (53%) loss:3.1934 lr:0.64 dt:36ms tok/s:1822169 rem:284s step 8937 (53%) loss:3.1858 lr:0.64 dt:36ms tok/s:1828825 rem:284s step 8938 (53%) loss:3.1964 lr:0.64 dt:36ms tok/s:1822701 rem:284s step 8939 (53%) loss:3.2079 lr:0.64 dt:36ms tok/s:1814999 rem:284s step 8940 (53%) loss:3.2144 lr:0.64 dt:36ms tok/s:1827305 rem:284s step 8941 (53%) loss:3.2280 lr:0.64 dt:36ms tok/s:1819805 rem:284s step 8942 (53%) loss:3.2290 lr:0.64 dt:36ms tok/s:1826104 rem:284s step 8943 (53%) loss:3.2271 lr:0.64 dt:36ms tok/s:1817483 rem:284s step 8944 (53%) loss:3.2441 lr:0.64 dt:36ms tok/s:1819107 rem:284s step 8945 (53%) loss:3.2278 lr:0.64 dt:36ms tok/s:1817723 rem:284s step 8946 (53%) loss:3.2286 lr:0.64 dt:36ms tok/s:1819371 rem:284s step 8947 (53%) loss:3.2212 lr:0.64 dt:36ms tok/s:1815490 rem:284s step 8948 (53%) loss:3.2280 lr:0.64 dt:36ms tok/s:1823959 rem:284s step 8949 (53%) loss:3.2233 lr:0.64 dt:36ms tok/s:1820854 rem:284s step 8950 (53%) loss:3.2067 lr:0.64 dt:36ms tok/s:1809622 rem:284s step 8951 (53%) loss:3.1877 lr:0.64 dt:36ms tok/s:1820866 rem:284s step 8952 (53%) loss:3.1801 lr:0.64 dt:36ms tok/s:1827196 rem:284s step 8953 (53%) loss:3.1539 lr:0.64 dt:36ms tok/s:1826601 rem:283s step 8954 (53%) loss:3.1552 lr:0.64 dt:36ms tok/s:1825412 rem:283s step 8955 (53%) loss:3.1486 lr:0.64 dt:36ms tok/s:1813837 rem:283s step 8956 (53%) loss:3.1324 lr:0.64 dt:36ms tok/s:1821493 rem:283s step 8957 (53%) loss:3.1254 lr:0.64 dt:36ms tok/s:1823378 rem:283s step 8958 (53%) loss:3.1362 lr:0.64 dt:36ms tok/s:1819576 rem:283s step 8959 (53%) loss:3.1342 lr:0.64 dt:36ms tok/s:1819660 rem:283s step 8960 (53%) loss:3.1188 lr:0.64 dt:36ms tok/s:1811781 rem:283s step 8961 (53%) loss:3.1348 lr:0.64 dt:36ms tok/s:1815274 rem:283s step 8962 (53%) loss:3.1413 lr:0.64 dt:36ms tok/s:1815130 rem:283s step 8963 (53%) loss:3.1359 lr:0.64 dt:36ms tok/s:1820010 rem:283s step 8964 (53%) loss:3.1414 lr:0.64 dt:36ms tok/s:1818986 rem:283s step 8965 (53%) loss:3.1453 lr:0.64 dt:36ms tok/s:1821445 rem:283s step 8966 (53%) loss:3.1626 lr:0.64 dt:36ms tok/s:1812964 rem:283s step 8967 (53%) loss:3.1689 lr:0.64 dt:36ms tok/s:1819492 rem:283s step 8968 (53%) loss:3.1738 lr:0.64 dt:36ms tok/s:1818384 rem:283s step 8969 (53%) loss:3.1911 lr:0.64 dt:36ms tok/s:1817026 rem:283s step 8970 (53%) loss:3.1964 lr:0.64 dt:36ms tok/s:1814543 rem:283s step 8971 (53%) loss:3.2025 lr:0.64 dt:36ms tok/s:1819299 rem:283s step 8972 (53%) loss:3.1969 lr:0.64 dt:37ms tok/s:1793739 rem:283s step 8973 (53%) loss:3.1859 lr:0.64 dt:36ms tok/s:1818685 rem:283s step 8974 (53%) loss:3.1880 lr:0.64 dt:36ms tok/s:1807242 rem:283s step 8975 (53%) loss:3.1642 lr:0.64 dt:36ms tok/s:1814268 rem:283s step 8976 (53%) loss:3.1429 lr:0.64 dt:36ms tok/s:1795508 rem:283s step 8977 (53%) loss:3.1559 lr:0.64 dt:36ms tok/s:1806933 rem:283s step 8978 (53%) loss:3.1785 lr:0.64 dt:36ms tok/s:1812175 rem:283s step 8979 (53%) loss:3.2041 lr:0.64 dt:36ms tok/s:1813191 rem:283s step 8980 (53%) loss:3.2074 lr:0.64 dt:36ms tok/s:1807741 rem:283s step 8981 (53%) loss:3.2019 lr:0.64 dt:36ms tok/s:1814663 rem:282s step 8982 (53%) loss:3.2069 lr:0.64 dt:36ms tok/s:1818505 rem:282s step 8983 (53%) loss:3.2410 lr:0.64 dt:36ms tok/s:1819131 rem:282s step 8984 (53%) loss:3.2447 lr:0.64 dt:36ms tok/s:1814004 rem:282s step 8985 (53%) loss:3.2336 lr:0.64 dt:36ms tok/s:1813502 rem:282s step 8986 (53%) loss:3.2258 lr:0.64 dt:36ms tok/s:1817339 rem:282s step 8987 (53%) loss:3.2235 lr:0.64 dt:36ms tok/s:1812593 rem:282s step 8988 (53%) loss:3.2243 lr:0.64 dt:36ms tok/s:1814495 rem:282s step 8989 (53%) loss:3.2476 lr:0.64 dt:37ms tok/s:1792569 rem:282s step 8990 (53%) loss:3.2515 lr:0.64 dt:36ms tok/s:1810182 rem:282s step 8991 (53%) loss:3.2450 lr:0.64 dt:36ms tok/s:1817098 rem:282s step 8992 (53%) loss:3.2343 lr:0.64 dt:36ms tok/s:1817327 rem:282s step 8993 (53%) loss:3.2269 lr:0.64 dt:36ms tok/s:1813598 rem:282s step 8994 (53%) loss:3.2315 lr:0.64 dt:36ms tok/s:1799257 rem:282s step 8995 (53%) loss:3.2420 lr:0.64 dt:36ms tok/s:1816858 rem:282s step 8996 (53%) loss:3.2456 lr:0.64 dt:36ms tok/s:1814292 rem:282s step 8997 (53%) loss:3.2573 lr:0.64 dt:36ms tok/s:1811017 rem:282s step 8998 (53%) loss:3.2350 lr:0.64 dt:36ms tok/s:1806245 rem:282s step 8999 (53%) loss:3.2368 lr:0.64 dt:36ms tok/s:1808610 rem:282s step 9000 (53%) loss:3.2502 lr:0.64 dt:36ms tok/s:1813609 rem:282s + local: attn=[0.073, 0.799, 0.816] mlp=[0.527, 0.218, -0.219] + + transition: attn=[2.795, 0.934] mlp=[-0.174, 0.451] + + hierarchy: attn=[3.021, 5.939, 5.616] mlp=[1.210, -0.929, -3.012] + step 9001 (53%) loss:3.2408 lr:0.64 dt:36ms tok/s:1815418 rem:282s step 9002 (53%) loss:3.2357 lr:0.63 dt:36ms tok/s:1817759 rem:282s step 9003 (53%) loss:3.2570 lr:0.63 dt:36ms tok/s:1813191 rem:282s step 9004 (53%) loss:3.2656 lr:0.63 dt:36ms tok/s:1805995 rem:282s step 9005 (53%) loss:3.2592 lr:0.63 dt:36ms tok/s:1816666 rem:282s step 9006 (53%) loss:3.2523 lr:0.63 dt:36ms tok/s:1812868 rem:282s step 9007 (53%) loss:3.2690 lr:0.63 dt:36ms tok/s:1810945 rem:282s step 9008 (53%) loss:3.2515 lr:0.63 dt:36ms tok/s:1803791 rem:281s step 9009 (53%) loss:3.2655 lr:0.63 dt:36ms tok/s:1810396 rem:281s step 9010 (53%) loss:3.2665 lr:0.63 dt:36ms tok/s:1817579 rem:281s step 9011 (53%) loss:3.2902 lr:0.63 dt:37ms tok/s:1794558 rem:281s step 9012 (53%) loss:3.2850 lr:0.63 dt:36ms tok/s:1812497 rem:281s step 9013 (53%) loss:3.2668 lr:0.63 dt:36ms tok/s:1797528 rem:281s step 9014 (53%) loss:3.2614 lr:0.63 dt:36ms tok/s:1813167 rem:281s step 9015 (53%) loss:3.2803 lr:0.63 dt:36ms tok/s:1815047 rem:281s step 9016 (53%) loss:3.2965 lr:0.63 dt:36ms tok/s:1811673 rem:281s step 9017 (53%) loss:3.3031 lr:0.63 dt:36ms tok/s:1811231 rem:281s step 9018 (53%) loss:3.2984 lr:0.63 dt:36ms tok/s:1808907 rem:281s step 9019 (53%) loss:3.3110 lr:0.63 dt:36ms tok/s:1814531 rem:281s step 9020 (53%) loss:3.3022 lr:0.63 dt:36ms tok/s:1815226 rem:281s step 9021 (53%) loss:3.3093 lr:0.63 dt:36ms tok/s:1820552 rem:281s step 9022 (53%) loss:3.3146 lr:0.63 dt:36ms tok/s:1809145 rem:281s step 9023 (53%) loss:3.3045 lr:0.63 dt:36ms tok/s:1812940 rem:281s step 9024 (53%) loss:3.2931 lr:0.63 dt:36ms tok/s:1817002 rem:281s step 9025 (53%) loss:3.2863 lr:0.63 dt:36ms tok/s:1814100 rem:281s step 9026 (53%) loss:3.2718 lr:0.63 dt:36ms tok/s:1812007 rem:281s step 9027 (53%) loss:3.2781 lr:0.63 dt:36ms tok/s:1816654 rem:281s step 9028 (53%) loss:3.2967 lr:0.63 dt:36ms tok/s:1811661 rem:281s step 9029 (53%) loss:3.3018 lr:0.63 dt:36ms tok/s:1818661 rem:281s step 9030 (53%) loss:3.2975 lr:0.63 dt:36ms tok/s:1819167 rem:281s step 9031 (53%) loss:3.2855 lr:0.63 dt:36ms tok/s:1811828 rem:281s step 9032 (53%) loss:3.2764 lr:0.63 dt:36ms tok/s:1819504 rem:281s step 9033 (53%) loss:3.2838 lr:0.63 dt:36ms tok/s:1812892 rem:281s step 9034 (53%) loss:3.2767 lr:0.63 dt:36ms tok/s:1819889 rem:281s step 9035 (53%) loss:3.2766 lr:0.63 dt:36ms tok/s:1813717 rem:281s step 9036 (53%) loss:3.2856 lr:0.63 dt:36ms tok/s:1822991 rem:280s step 9037 (53%) loss:3.2899 lr:0.63 dt:37ms tok/s:1757588 rem:280s step 9038 (53%) loss:3.3035 lr:0.63 dt:36ms tok/s:1801958 rem:280s step 9039 (53%) loss:3.3077 lr:0.63 dt:36ms tok/s:1796987 rem:280s step 9040 (53%) loss:3.3152 lr:0.63 dt:37ms tok/s:1794547 rem:280s step 9041 (53%) loss:3.3020 lr:0.63 dt:36ms tok/s:1801887 rem:280s step 9042 (53%) loss:3.2945 lr:0.63 dt:36ms tok/s:1823040 rem:280s step 9043 (53%) loss:3.2774 lr:0.63 dt:36ms tok/s:1823233 rem:280s step 9044 (53%) loss:3.2880 lr:0.63 dt:36ms tok/s:1820757 rem:280s step 9045 (53%) loss:3.2815 lr:0.63 dt:36ms tok/s:1822882 rem:280s step 9046 (53%) loss:3.2561 lr:0.63 dt:36ms tok/s:1818096 rem:280s step 9047 (53%) loss:3.2461 lr:0.63 dt:36ms tok/s:1821904 rem:280s step 9048 (53%) loss:3.2318 lr:0.63 dt:36ms tok/s:1818493 rem:280s step 9049 (53%) loss:3.2413 lr:0.63 dt:37ms tok/s:1789395 rem:280s step 9050 (53%) loss:3.2271 lr:0.63 dt:36ms tok/s:1821831 rem:280s step 9051 (53%) loss:3.2084 lr:0.63 dt:36ms tok/s:1818060 rem:280s step 9052 (53%) loss:3.2075 lr:0.63 dt:36ms tok/s:1824419 rem:280s step 9053 (53%) loss:3.2116 lr:0.63 dt:36ms tok/s:1822713 rem:280s step 9054 (53%) loss:3.2028 lr:0.63 dt:36ms tok/s:1820757 rem:280s step 9055 (53%) loss:3.2256 lr:0.63 dt:36ms tok/s:1825158 rem:280s step 9056 (53%) loss:3.2350 lr:0.63 dt:36ms tok/s:1829312 rem:280s step 9057 (53%) loss:3.2421 lr:0.63 dt:36ms tok/s:1822447 rem:280s step 9058 (53%) loss:3.2494 lr:0.63 dt:36ms tok/s:1820167 rem:280s step 9059 (53%) loss:3.2546 lr:0.63 dt:36ms tok/s:1812665 rem:280s step 9060 (53%) loss:3.2564 lr:0.63 dt:36ms tok/s:1815394 rem:280s step 9061 (53%) loss:3.2468 lr:0.63 dt:36ms tok/s:1825219 rem:280s step 9062 (53%) loss:3.2540 lr:0.63 dt:36ms tok/s:1820203 rem:280s step 9063 (53%) loss:3.2505 lr:0.63 dt:36ms tok/s:1823548 rem:280s step 9064 (53%) loss:3.2604 lr:0.63 dt:36ms tok/s:1817867 rem:279s step 9065 (53%) loss:3.2691 lr:0.63 dt:36ms tok/s:1820588 rem:279s step 9066 (53%) loss:3.2761 lr:0.63 dt:36ms tok/s:1816846 rem:279s step 9067 (53%) loss:3.2658 lr:0.63 dt:36ms tok/s:1826735 rem:279s step 9068 (53%) loss:3.2511 lr:0.63 dt:36ms tok/s:1819624 rem:279s step 9069 (53%) loss:3.2360 lr:0.63 dt:36ms tok/s:1820818 rem:279s step 9070 (53%) loss:3.2450 lr:0.63 dt:36ms tok/s:1821083 rem:279s step 9071 (53%) loss:3.2421 lr:0.63 dt:36ms tok/s:1813777 rem:279s step 9072 (53%) loss:3.2460 lr:0.63 dt:36ms tok/s:1815790 rem:279s step 9073 (53%) loss:3.2205 lr:0.63 dt:36ms tok/s:1821928 rem:279s step 9074 (53%) loss:3.2106 lr:0.63 dt:36ms tok/s:1814316 rem:279s step 9075 (53%) loss:3.2132 lr:0.63 dt:36ms tok/s:1826844 rem:279s step 9076 (53%) loss:3.2356 lr:0.63 dt:36ms tok/s:1818830 rem:279s step 9077 (53%) loss:3.2429 lr:0.63 dt:36ms tok/s:1821167 rem:279s step 9078 (53%) loss:3.2319 lr:0.63 dt:36ms tok/s:1819697 rem:279s step 9079 (54%) loss:3.2161 lr:0.63 dt:36ms tok/s:1816282 rem:279s step 9080 (54%) loss:3.2253 lr:0.63 dt:36ms tok/s:1818144 rem:279s step 9081 (54%) loss:3.2200 lr:0.63 dt:36ms tok/s:1827196 rem:279s step 9082 (54%) loss:3.2213 lr:0.63 dt:36ms tok/s:1806553 rem:279s step 9083 (54%) loss:3.2228 lr:0.63 dt:36ms tok/s:1816738 rem:279s step 9084 (54%) loss:3.2144 lr:0.63 dt:36ms tok/s:1813765 rem:279s step 9085 (54%) loss:3.2290 lr:0.63 dt:36ms tok/s:1816702 rem:279s step 9086 (54%) loss:3.2446 lr:0.63 dt:36ms tok/s:1818072 rem:279s step 9087 (54%) loss:3.2550 lr:0.63 dt:36ms tok/s:1813155 rem:279s step 9088 (54%) loss:3.2509 lr:0.63 dt:37ms tok/s:1790269 rem:279s step 9089 (54%) loss:3.2302 lr:0.63 dt:36ms tok/s:1814016 rem:279s step 9090 (54%) loss:3.2207 lr:0.62 dt:36ms tok/s:1817879 rem:279s step 9091 (54%) loss:3.2279 lr:0.62 dt:36ms tok/s:1821602 rem:279s step 9092 (54%) loss:3.2335 lr:0.62 dt:36ms tok/s:1816078 rem:278s step 9093 (54%) loss:3.2303 lr:0.62 dt:36ms tok/s:1809431 rem:278s step 9094 (54%) loss:3.2293 lr:0.62 dt:36ms tok/s:1818709 rem:278s step 9095 (54%) loss:3.2223 lr:0.62 dt:36ms tok/s:1817146 rem:278s step 9096 (54%) loss:3.2331 lr:0.62 dt:36ms tok/s:1819805 rem:278s step 9097 (54%) loss:3.2261 lr:0.62 dt:36ms tok/s:1808979 rem:278s step 9098 (54%) loss:3.2414 lr:0.62 dt:36ms tok/s:1814388 rem:278s step 9099 (54%) loss:3.2389 lr:0.62 dt:36ms tok/s:1803898 rem:278s step 9100 (54%) loss:3.2535 lr:0.62 dt:36ms tok/s:1814567 rem:278s + local: attn=[0.077, 0.815, 0.806] mlp=[0.547, 0.229, -0.217] + + transition: attn=[2.831, 0.969] mlp=[-0.180, 0.445] + + hierarchy: attn=[2.993, 5.939, 5.616] mlp=[1.202, -0.917, -3.020] + step 9101 (54%) loss:3.2652 lr:0.62 dt:36ms tok/s:1816834 rem:278s step 9102 (54%) loss:3.2673 lr:0.62 dt:36ms tok/s:1815142 rem:278s step 9103 (54%) loss:3.2617 lr:0.62 dt:36ms tok/s:1814663 rem:278s step 9104 (54%) loss:3.2633 lr:0.62 dt:36ms tok/s:1812999 rem:278s step 9105 (54%) loss:3.2842 lr:0.62 dt:36ms tok/s:1813179 rem:278s step 9106 (54%) loss:3.2727 lr:0.62 dt:36ms tok/s:1809205 rem:278s step 9107 (54%) loss:3.2672 lr:0.62 dt:36ms tok/s:1800483 rem:278s step 9108 (54%) loss:3.2825 lr:0.62 dt:36ms tok/s:1819432 rem:278s step 9109 (54%) loss:3.2743 lr:0.62 dt:36ms tok/s:1821868 rem:278s step 9110 (54%) loss:3.2576 lr:0.62 dt:36ms tok/s:1809383 rem:278s step 9111 (54%) loss:3.2483 lr:0.62 dt:36ms tok/s:1824794 rem:278s step 9112 (54%) loss:3.2504 lr:0.62 dt:36ms tok/s:1815310 rem:278s step 9113 (54%) loss:3.2451 lr:0.62 dt:36ms tok/s:1810456 rem:278s step 9114 (54%) loss:3.2474 lr:0.62 dt:36ms tok/s:1820516 rem:278s step 9115 (54%) loss:3.2390 lr:0.62 dt:36ms tok/s:1809312 rem:278s step 9116 (54%) loss:3.2367 lr:0.62 dt:36ms tok/s:1811601 rem:278s step 9117 (54%) loss:3.2302 lr:0.62 dt:36ms tok/s:1814843 rem:278s step 9118 (54%) loss:3.2176 lr:0.62 dt:36ms tok/s:1815178 rem:278s step 9119 (54%) loss:3.2255 lr:0.62 dt:36ms tok/s:1815946 rem:277s step 9120 (54%) loss:3.2272 lr:0.62 dt:36ms tok/s:1814531 rem:277s step 9121 (54%) loss:3.2381 lr:0.62 dt:36ms tok/s:1809098 rem:277s step 9122 (54%) loss:3.2669 lr:0.62 dt:36ms tok/s:1822508 rem:277s step 9123 (54%) loss:3.2626 lr:0.62 dt:36ms tok/s:1822520 rem:277s step 9124 (54%) loss:3.2547 lr:0.62 dt:36ms tok/s:1817146 rem:277s step 9125 (54%) loss:3.2622 lr:0.62 dt:36ms tok/s:1812724 rem:277s step 9126 (54%) loss:3.2544 lr:0.62 dt:36ms tok/s:1817735 rem:277s step 9127 (54%) loss:3.2618 lr:0.62 dt:36ms tok/s:1817831 rem:277s step 9128 (54%) loss:3.2586 lr:0.62 dt:36ms tok/s:1814639 rem:277s step 9129 (54%) loss:3.2472 lr:0.62 dt:36ms tok/s:1810790 rem:277s step 9130 (54%) loss:3.2542 lr:0.62 dt:36ms tok/s:1817423 rem:277s step 9131 (54%) loss:3.2478 lr:0.62 dt:36ms tok/s:1813215 rem:277s step 9132 (54%) loss:3.2424 lr:0.62 dt:36ms tok/s:1810516 rem:277s step 9133 (54%) loss:3.2119 lr:0.62 dt:36ms tok/s:1812569 rem:277s step 9134 (54%) loss:3.2071 lr:0.62 dt:36ms tok/s:1816882 rem:277s step 9135 (54%) loss:3.2093 lr:0.62 dt:36ms tok/s:1801108 rem:277s step 9136 (54%) loss:3.1959 lr:0.62 dt:36ms tok/s:1811554 rem:277s step 9137 (54%) loss:3.2001 lr:0.62 dt:36ms tok/s:1815946 rem:277s step 9138 (54%) loss:3.2050 lr:0.62 dt:36ms tok/s:1815035 rem:277s step 9139 (54%) loss:3.2158 lr:0.62 dt:36ms tok/s:1812784 rem:277s step 9140 (54%) loss:3.2226 lr:0.62 dt:36ms tok/s:1818685 rem:277s step 9141 (54%) loss:3.2331 lr:0.62 dt:36ms tok/s:1817795 rem:277s step 9142 (54%) loss:3.2373 lr:0.62 dt:36ms tok/s:1812175 rem:277s step 9143 (54%) loss:3.2336 lr:0.62 dt:36ms tok/s:1816762 rem:277s step 9144 (54%) loss:3.2531 lr:0.62 dt:36ms tok/s:1819408 rem:277s step 9145 (54%) loss:3.2598 lr:0.62 dt:36ms tok/s:1825291 rem:277s step 9146 (54%) loss:3.2876 lr:0.62 dt:36ms tok/s:1812378 rem:277s step 9147 (54%) loss:3.2876 lr:0.62 dt:36ms tok/s:1815238 rem:276s step 9148 (54%) loss:3.2930 lr:0.62 dt:36ms tok/s:1817423 rem:276s step 9149 (54%) loss:3.2868 lr:0.62 dt:36ms tok/s:1816018 rem:276s step 9150 (54%) loss:3.2682 lr:0.62 dt:36ms tok/s:1811876 rem:276s step 9151 (54%) loss:3.2639 lr:0.62 dt:36ms tok/s:1813657 rem:276s step 9152 (54%) loss:3.2577 lr:0.62 dt:36ms tok/s:1817471 rem:276s step 9153 (54%) loss:3.2443 lr:0.62 dt:36ms tok/s:1817783 rem:276s step 9154 (54%) loss:3.2515 lr:0.62 dt:37ms tok/s:1791436 rem:276s step 9155 (54%) loss:3.2460 lr:0.62 dt:36ms tok/s:1815226 rem:276s step 9156 (54%) loss:3.2499 lr:0.62 dt:36ms tok/s:1811984 rem:276s step 9157 (54%) loss:3.2499 lr:0.62 dt:36ms tok/s:1810432 rem:276s step 9158 (54%) loss:3.2420 lr:0.62 dt:36ms tok/s:1805995 rem:276s step 9159 (54%) loss:3.2398 lr:0.62 dt:36ms tok/s:1813849 rem:276s step 9160 (54%) loss:3.2329 lr:0.62 dt:36ms tok/s:1812844 rem:276s step 9161 (54%) loss:3.2369 lr:0.62 dt:36ms tok/s:1811363 rem:276s step 9162 (54%) loss:3.2265 lr:0.62 dt:36ms tok/s:1814687 rem:276s step 9163 (54%) loss:3.1970 lr:0.62 dt:36ms tok/s:1813609 rem:276s step 9164 (54%) loss:3.1644 lr:0.62 dt:36ms tok/s:1810921 rem:276s step 9165 (54%) loss:3.1331 lr:0.62 dt:36ms tok/s:1809205 rem:276s step 9166 (54%) loss:3.0987 lr:0.62 dt:36ms tok/s:1813598 rem:276s step 9167 (54%) loss:3.0908 lr:0.62 dt:36ms tok/s:1813526 rem:276s step 9168 (54%) loss:3.1109 lr:0.62 dt:36ms tok/s:1815754 rem:276s step 9169 (54%) loss:3.1149 lr:0.62 dt:36ms tok/s:1815262 rem:276s step 9170 (54%) loss:3.1068 lr:0.62 dt:36ms tok/s:1814016 rem:276s step 9171 (54%) loss:3.1171 lr:0.62 dt:36ms tok/s:1818830 rem:276s step 9172 (54%) loss:3.1317 lr:0.62 dt:36ms tok/s:1814196 rem:276s step 9173 (54%) loss:3.1236 lr:0.62 dt:36ms tok/s:1817964 rem:276s step 9174 (54%) loss:3.1324 lr:0.62 dt:36ms tok/s:1810611 rem:276s step 9175 (54%) loss:3.1030 lr:0.62 dt:36ms tok/s:1808050 rem:275s step 9176 (54%) loss:3.1221 lr:0.62 dt:36ms tok/s:1817062 rem:275s step 9177 (54%) loss:3.1257 lr:0.61 dt:36ms tok/s:1814376 rem:275s step 9178 (54%) loss:3.1259 lr:0.61 dt:36ms tok/s:1813490 rem:275s step 9179 (54%) loss:3.1352 lr:0.61 dt:36ms tok/s:1812844 rem:275s step 9180 (54%) loss:3.1710 lr:0.61 dt:36ms tok/s:1812402 rem:275s step 9181 (54%) loss:3.1695 lr:0.61 dt:36ms tok/s:1809562 rem:275s step 9182 (54%) loss:3.1617 lr:0.61 dt:36ms tok/s:1822580 rem:275s step 9183 (54%) loss:3.1718 lr:0.61 dt:36ms tok/s:1814471 rem:275s step 9184 (54%) loss:3.1540 lr:0.61 dt:36ms tok/s:1804158 rem:275s step 9185 (54%) loss:3.1566 lr:0.61 dt:36ms tok/s:1823040 rem:275s step 9186 (54%) loss:3.1544 lr:0.61 dt:36ms tok/s:1811900 rem:275s step 9187 (54%) loss:3.1564 lr:0.61 dt:37ms tok/s:1762738 rem:275s step 9188 (54%) loss:3.1626 lr:0.61 dt:36ms tok/s:1810468 rem:275s step 9189 (54%) loss:3.1663 lr:0.61 dt:36ms tok/s:1819624 rem:275s step 9190 (54%) loss:3.1475 lr:0.61 dt:36ms tok/s:1823233 rem:275s step 9191 (54%) loss:3.1430 lr:0.61 dt:36ms tok/s:1813765 rem:275s step 9192 (54%) loss:3.1546 lr:0.61 dt:36ms tok/s:1812987 rem:275s step 9193 (54%) loss:3.1379 lr:0.61 dt:36ms tok/s:1818457 rem:275s step 9194 (54%) loss:3.1532 lr:0.61 dt:36ms tok/s:1816198 rem:275s step 9195 (54%) loss:3.1615 lr:0.61 dt:36ms tok/s:1816822 rem:275s step 9196 (54%) loss:3.1709 lr:0.61 dt:36ms tok/s:1817194 rem:275s step 9197 (54%) loss:3.1810 lr:0.61 dt:36ms tok/s:1814951 rem:275s step 9198 (54%) loss:3.1730 lr:0.61 dt:36ms tok/s:1804217 rem:275s step 9199 (54%) loss:3.1723 lr:0.61 dt:36ms tok/s:1816522 rem:275s step 9200 (54%) loss:3.1588 lr:0.61 dt:36ms tok/s:1820854 rem:275s + local: attn=[0.088, 0.804, 0.840] mlp=[0.544, 0.239, -0.218] + + transition: attn=[2.843, 0.940] mlp=[-0.178, 0.474] + + hierarchy: attn=[2.988, 5.939, 5.616] mlp=[1.223, -0.908, -2.984] + step 9201 (54%) loss:3.1562 lr:0.61 dt:36ms tok/s:1811100 rem:275s step 9202 (54%) loss:3.1508 lr:0.61 dt:36ms tok/s:1818709 rem:274s step 9203 (54%) loss:3.1695 lr:0.61 dt:36ms tok/s:1819227 rem:274s step 9204 (54%) loss:3.1735 lr:0.61 dt:36ms tok/s:1805521 rem:274s step 9205 (54%) loss:3.1891 lr:0.61 dt:36ms tok/s:1814711 rem:274s step 9206 (54%) loss:3.2115 lr:0.61 dt:37ms tok/s:1794570 rem:274s step 9207 (54%) loss:3.2030 lr:0.61 dt:44ms tok/s:1482053 rem:274s step 9208 (54%) loss:3.2003 lr:0.61 dt:36ms tok/s:1824722 rem:274s step 9209 (54%) loss:3.2055 lr:0.61 dt:35ms tok/s:1858878 rem:274s step 9210 (54%) loss:3.2203 lr:0.61 dt:35ms tok/s:1870478 rem:274s step 9211 (54%) loss:3.2169 lr:0.61 dt:35ms tok/s:1881875 rem:274s step 9212 (54%) loss:3.2252 lr:0.61 dt:35ms tok/s:1874138 rem:274s step 9213 (54%) loss:3.2213 lr:0.61 dt:35ms tok/s:1874291 rem:274s step 9214 (54%) loss:3.2229 lr:0.61 dt:35ms tok/s:1869752 rem:274s step 9215 (54%) loss:3.2200 lr:0.61 dt:35ms tok/s:1877633 rem:274s step 9216 (54%) loss:3.1747 lr:0.61 dt:35ms tok/s:1855841 rem:274s step 9217 (54%) loss:3.1343 lr:0.61 dt:35ms tok/s:1849647 rem:274s step 9218 (54%) loss:3.1514 lr:0.61 dt:35ms tok/s:1868850 rem:274s step 9219 (54%) loss:3.1607 lr:0.61 dt:35ms tok/s:1864666 rem:274s step 9220 (54%) loss:3.1757 lr:0.61 dt:35ms tok/s:1880613 rem:274s step 9221 (54%) loss:3.1719 lr:0.61 dt:35ms tok/s:1875660 rem:274s step 9222 (54%) loss:3.1794 lr:0.61 dt:36ms tok/s:1845004 rem:274s step 9223 (54%) loss:3.1837 lr:0.61 dt:35ms tok/s:1872300 rem:274s step 9224 (54%) loss:3.1881 lr:0.61 dt:35ms tok/s:1879854 rem:274s step 9225 (54%) loss:3.1818 lr:0.61 dt:35ms tok/s:1862821 rem:274s step 9226 (54%) loss:3.1996 lr:0.61 dt:35ms tok/s:1847844 rem:274s step 9227 (54%) loss:3.2140 lr:0.61 dt:36ms tok/s:1844608 rem:274s step 9228 (54%) loss:3.2204 lr:0.61 dt:35ms tok/s:1860199 rem:274s step 9229 (54%) loss:3.1950 lr:0.61 dt:35ms tok/s:1854301 rem:274s step 9230 (54%) loss:3.2016 lr:0.61 dt:35ms tok/s:1846640 rem:273s step 9231 (54%) loss:3.1994 lr:0.61 dt:35ms tok/s:1854101 rem:273s step 9232 (54%) loss:3.1990 lr:0.61 dt:35ms tok/s:1854089 rem:273s step 9233 (54%) loss:3.2117 lr:0.61 dt:35ms tok/s:1856957 rem:273s step 9234 (54%) loss:3.2094 lr:0.61 dt:35ms tok/s:1858514 rem:273s step 9235 (54%) loss:3.1838 lr:0.61 dt:35ms tok/s:1860010 rem:273s step 9236 (54%) loss:3.1896 lr:0.61 dt:36ms tok/s:1829592 rem:273s step 9237 (54%) loss:3.1857 lr:0.61 dt:36ms tok/s:1820335 rem:273s step 9238 (54%) loss:3.2043 lr:0.61 dt:36ms tok/s:1823354 rem:273s step 9239 (54%) loss:3.1948 lr:0.61 dt:36ms tok/s:1818529 rem:273s step 9240 (54%) loss:3.2013 lr:0.61 dt:36ms tok/s:1826601 rem:273s step 9241 (54%) loss:3.2091 lr:0.61 dt:36ms tok/s:1822508 rem:273s step 9242 (54%) loss:3.2187 lr:0.61 dt:36ms tok/s:1818216 rem:273s step 9243 (54%) loss:3.2167 lr:0.61 dt:36ms tok/s:1825740 rem:273s step 9244 (54%) loss:3.2302 lr:0.61 dt:36ms tok/s:1818036 rem:273s step 9245 (55%) loss:3.2157 lr:0.61 dt:36ms tok/s:1816654 rem:273s step 9246 (55%) loss:3.1852 lr:0.61 dt:36ms tok/s:1819010 rem:273s step 9247 (55%) loss:3.1708 lr:0.61 dt:36ms tok/s:1812414 rem:273s step 9248 (55%) loss:3.1821 lr:0.61 dt:36ms tok/s:1816570 rem:273s step 9249 (55%) loss:3.1379 lr:0.61 dt:36ms tok/s:1814675 rem:273s step 9250 (55%) loss:3.1053 lr:0.61 dt:36ms tok/s:1816750 rem:273s step 9251 (55%) loss:3.0508 lr:0.61 dt:36ms tok/s:1807420 rem:273s step 9252 (55%) loss:3.0227 lr:0.61 dt:36ms tok/s:1808217 rem:273s step 9253 (55%) loss:3.0527 lr:0.61 dt:36ms tok/s:1828302 rem:273s step 9254 (55%) loss:3.0722 lr:0.61 dt:36ms tok/s:1822677 rem:273s step 9255 (55%) loss:3.1118 lr:0.61 dt:36ms tok/s:1819468 rem:273s step 9256 (55%) loss:3.1105 lr:0.61 dt:36ms tok/s:1821011 rem:273s step 9257 (55%) loss:3.1055 lr:0.61 dt:36ms tok/s:1807111 rem:273s step 9258 (55%) loss:3.1135 lr:0.61 dt:36ms tok/s:1818168 rem:272s step 9259 (55%) loss:3.1506 lr:0.61 dt:36ms tok/s:1826820 rem:272s step 9260 (55%) loss:3.2165 lr:0.61 dt:36ms tok/s:1818986 rem:272s step 9261 (55%) loss:3.2865 lr:0.61 dt:36ms tok/s:1815466 rem:272s step 9262 (55%) loss:3.3084 lr:0.61 dt:36ms tok/s:1818565 rem:272s step 9263 (55%) loss:3.2973 lr:0.61 dt:36ms tok/s:1820468 rem:272s step 9264 (55%) loss:3.2944 lr:0.60 dt:36ms tok/s:1813394 rem:272s step 9265 (55%) loss:3.2874 lr:0.60 dt:36ms tok/s:1817843 rem:272s step 9266 (55%) loss:3.2851 lr:0.60 dt:36ms tok/s:1818589 rem:272s step 9267 (55%) loss:3.2727 lr:0.60 dt:36ms tok/s:1818721 rem:272s step 9268 (55%) loss:3.2831 lr:0.60 dt:36ms tok/s:1814543 rem:272s step 9269 (55%) loss:3.2971 lr:0.60 dt:36ms tok/s:1814999 rem:272s step 9270 (55%) loss:3.3451 lr:0.60 dt:36ms tok/s:1822616 rem:272s step 9271 (55%) loss:3.4475 lr:0.60 dt:36ms tok/s:1824516 rem:272s step 9272 (55%) loss:3.5201 lr:0.60 dt:36ms tok/s:1827184 rem:272s step 9273 (55%) loss:3.5235 lr:0.60 dt:36ms tok/s:1818433 rem:272s step 9274 (55%) loss:3.4938 lr:0.60 dt:36ms tok/s:1824794 rem:272s step 9275 (55%) loss:3.4791 lr:0.60 dt:36ms tok/s:1821421 rem:272s step 9276 (55%) loss:3.4633 lr:0.60 dt:36ms tok/s:1819203 rem:272s step 9277 (55%) loss:3.4459 lr:0.60 dt:36ms tok/s:1815310 rem:272s step 9278 (55%) loss:3.4367 lr:0.60 dt:36ms tok/s:1823124 rem:272s step 9279 (55%) loss:3.4204 lr:0.60 dt:36ms tok/s:1817591 rem:272s step 9280 (55%) loss:3.3944 lr:0.60 dt:36ms tok/s:1807658 rem:272s step 9281 (55%) loss:3.3755 lr:0.60 dt:36ms tok/s:1820661 rem:272s step 9282 (55%) loss:3.3644 lr:0.60 dt:36ms tok/s:1817158 rem:272s step 9283 (55%) loss:3.3464 lr:0.60 dt:36ms tok/s:1816558 rem:272s step 9284 (55%) loss:3.3188 lr:0.60 dt:36ms tok/s:1819022 rem:272s step 9285 (55%) loss:3.2831 lr:0.60 dt:36ms tok/s:1823475 rem:272s step 9286 (55%) loss:3.2831 lr:0.60 dt:36ms tok/s:1821590 rem:271s step 9287 (55%) loss:3.2650 lr:0.60 dt:36ms tok/s:1811912 rem:271s step 9288 (55%) loss:3.2740 lr:0.60 dt:36ms tok/s:1819239 rem:271s step 9289 (55%) loss:3.2900 lr:0.60 dt:36ms tok/s:1824249 rem:271s step 9290 (55%) loss:3.2987 lr:0.60 dt:36ms tok/s:1816750 rem:271s step 9291 (55%) loss:3.3007 lr:0.60 dt:36ms tok/s:1819347 rem:271s step 9292 (55%) loss:3.3167 lr:0.60 dt:36ms tok/s:1826019 rem:271s step 9293 (55%) loss:3.3369 lr:0.60 dt:36ms tok/s:1819793 rem:271s step 9294 (55%) loss:3.3349 lr:0.60 dt:36ms tok/s:1818433 rem:271s step 9295 (55%) loss:3.3369 lr:0.60 dt:36ms tok/s:1824213 rem:271s step 9296 (55%) loss:3.3227 lr:0.60 dt:36ms tok/s:1822447 rem:271s step 9297 (55%) loss:3.3176 lr:0.60 dt:36ms tok/s:1804490 rem:271s step 9298 (55%) loss:3.2993 lr:0.60 dt:36ms tok/s:1816390 rem:271s step 9299 (55%) loss:3.2784 lr:0.60 dt:36ms tok/s:1819191 rem:271s step 9300 (55%) loss:3.2414 lr:0.60 dt:36ms tok/s:1828278 rem:271s + local: attn=[0.083, 0.836, 0.845] mlp=[0.566, 0.251, -0.242] + + transition: attn=[2.876, 0.974] mlp=[-0.185, 0.473] + + hierarchy: attn=[3.058, 5.939, 5.616] mlp=[1.197, -0.967, -3.004] + step 9301 (55%) loss:3.2420 lr:0.60 dt:36ms tok/s:1821095 rem:271s step 9302 (55%) loss:3.2524 lr:0.60 dt:36ms tok/s:1819010 rem:271s step 9303 (55%) loss:3.2422 lr:0.60 dt:36ms tok/s:1818012 rem:271s step 9304 (55%) loss:3.2624 lr:0.60 dt:36ms tok/s:1815538 rem:271s step 9305 (55%) loss:3.2237 lr:0.60 dt:36ms tok/s:1810563 rem:271s step 9306 (55%) loss:3.1561 lr:0.60 dt:36ms tok/s:1817038 rem:271s step 9307 (55%) loss:3.1350 lr:0.60 dt:36ms tok/s:1803069 rem:271s step 9308 (55%) loss:3.1392 lr:0.60 dt:36ms tok/s:1818541 rem:271s step 9309 (55%) loss:3.1463 lr:0.60 dt:36ms tok/s:1814495 rem:271s step 9310 (55%) loss:3.1245 lr:0.60 dt:36ms tok/s:1816450 rem:271s step 9311 (55%) loss:3.1262 lr:0.60 dt:36ms tok/s:1814855 rem:271s step 9312 (55%) loss:3.1378 lr:0.60 dt:37ms tok/s:1753081 rem:271s step 9313 (55%) loss:3.1566 lr:0.60 dt:42ms tok/s:1546987 rem:270s step 9314 (55%) loss:3.1696 lr:0.60 dt:39ms tok/s:1668191 rem:270s step 9315 (55%) loss:3.1629 lr:0.60 dt:36ms tok/s:1799988 rem:270s step 9316 (55%) loss:3.1561 lr:0.60 dt:38ms tok/s:1743485 rem:270s step 9317 (55%) loss:3.1427 lr:0.60 dt:37ms tok/s:1790981 rem:270s step 9318 (55%) loss:3.1043 lr:0.60 dt:35ms tok/s:1854739 rem:270s step 9319 (55%) loss:3.0776 lr:0.60 dt:36ms tok/s:1829897 rem:270s step 9320 (55%) loss:3.0926 lr:0.60 dt:36ms tok/s:1823499 rem:270s step 9321 (55%) loss:3.1321 lr:0.60 dt:36ms tok/s:1803140 rem:270s step 9322 (55%) loss:3.1594 lr:0.60 dt:36ms tok/s:1800907 rem:270s step 9323 (55%) loss:3.1914 lr:0.60 dt:36ms tok/s:1818469 rem:270s step 9324 (55%) loss:3.1914 lr:0.60 dt:36ms tok/s:1842098 rem:270s step 9325 (55%) loss:3.1721 lr:0.60 dt:36ms tok/s:1824358 rem:270s step 9326 (55%) loss:3.1944 lr:0.60 dt:36ms tok/s:1833461 rem:270s step 9327 (55%) loss:3.1902 lr:0.60 dt:36ms tok/s:1844410 rem:270s step 9328 (55%) loss:3.1939 lr:0.60 dt:36ms tok/s:1840162 rem:270s step 9329 (55%) loss:3.2013 lr:0.60 dt:36ms tok/s:1837628 rem:270s step 9330 (55%) loss:3.1882 lr:0.60 dt:36ms tok/s:1845921 rem:270s step 9331 (55%) loss:3.1917 lr:0.60 dt:38ms tok/s:1747598 rem:270s step 9332 (55%) loss:3.2168 lr:0.60 dt:36ms tok/s:1828460 rem:270s step 9333 (55%) loss:3.2088 lr:0.60 dt:37ms tok/s:1782167 rem:270s step 9334 (55%) loss:3.2081 lr:0.60 dt:37ms tok/s:1785084 rem:270s step 9335 (55%) loss:3.2142 lr:0.60 dt:36ms tok/s:1801816 rem:270s step 9336 (55%) loss:3.2187 lr:0.60 dt:36ms tok/s:1838476 rem:270s step 9337 (55%) loss:3.2326 lr:0.60 dt:36ms tok/s:1812976 rem:270s step 9338 (55%) loss:3.2550 lr:0.60 dt:37ms tok/s:1792359 rem:270s step 9339 (55%) loss:3.2540 lr:0.60 dt:37ms tok/s:1782491 rem:270s step 9340 (55%) loss:3.2572 lr:0.60 dt:36ms tok/s:1822242 rem:270s step 9341 (55%) loss:3.2597 lr:0.60 dt:37ms tok/s:1789325 rem:269s step 9342 (55%) loss:3.2664 lr:0.60 dt:36ms tok/s:1803969 rem:269s step 9343 (55%) loss:3.2662 lr:0.60 dt:36ms tok/s:1820323 rem:269s step 9344 (55%) loss:3.2832 lr:0.60 dt:37ms tok/s:1786639 rem:269s step 9345 (55%) loss:3.2898 lr:0.60 dt:35ms tok/s:1847621 rem:269s step 9346 (55%) loss:3.2851 lr:0.60 dt:36ms tok/s:1824916 rem:269s step 9347 (55%) loss:3.2848 lr:0.60 dt:36ms tok/s:1815886 rem:269s step 9348 (55%) loss:3.2692 lr:0.60 dt:36ms tok/s:1832324 rem:269s step 9349 (55%) loss:3.2648 lr:0.60 dt:36ms tok/s:1822169 rem:269s step 9350 (55%) loss:3.2567 lr:0.59 dt:36ms tok/s:1839965 rem:269s step 9351 (55%) loss:3.2296 lr:0.59 dt:35ms tok/s:1850905 rem:269s step 9352 (55%) loss:3.1780 lr:0.59 dt:35ms tok/s:1865805 rem:269s step 9353 (55%) loss:3.1219 lr:0.59 dt:35ms tok/s:1860841 rem:269s step 9354 (55%) loss:3.0702 lr:0.59 dt:36ms tok/s:1837149 rem:269s step 9355 (55%) loss:3.0673 lr:0.59 dt:35ms tok/s:1861812 rem:269s step 9356 (55%) loss:3.0847 lr:0.59 dt:35ms tok/s:1854264 rem:269s step 9357 (55%) loss:3.0907 lr:0.59 dt:36ms tok/s:1842135 rem:269s step 9358 (55%) loss:3.1371 lr:0.59 dt:35ms tok/s:1882364 rem:269s step 9359 (55%) loss:3.1641 lr:0.59 dt:35ms tok/s:1880484 rem:269s step 9360 (55%) loss:3.1820 lr:0.59 dt:35ms tok/s:1884894 rem:269s step 9361 (55%) loss:3.1934 lr:0.59 dt:35ms tok/s:1875736 rem:269s step 9362 (55%) loss:3.2040 lr:0.59 dt:35ms tok/s:1881591 rem:269s step 9363 (55%) loss:3.2091 lr:0.59 dt:35ms tok/s:1881733 rem:269s step 9364 (55%) loss:3.2219 lr:0.59 dt:35ms tok/s:1881759 rem:269s step 9365 (55%) loss:3.2073 lr:0.59 dt:35ms tok/s:1878659 rem:269s step 9366 (55%) loss:3.1937 lr:0.59 dt:35ms tok/s:1882712 rem:269s step 9367 (55%) loss:3.2029 lr:0.59 dt:35ms tok/s:1882803 rem:269s step 9368 (55%) loss:3.1890 lr:0.59 dt:35ms tok/s:1883357 rem:269s step 9369 (55%) loss:3.1502 lr:0.59 dt:35ms tok/s:1882171 rem:268s step 9370 (55%) loss:3.1242 lr:0.59 dt:35ms tok/s:1881733 rem:268s step 9371 (55%) loss:3.1624 lr:0.59 dt:35ms tok/s:1877774 rem:268s step 9372 (55%) loss:3.1740 lr:0.59 dt:35ms tok/s:1885101 rem:268s step 9373 (55%) loss:3.1733 lr:0.59 dt:35ms tok/s:1893881 rem:268s step 9374 (55%) loss:3.1732 lr:0.59 dt:35ms tok/s:1894273 rem:268s step 9375 (55%) loss:3.1907 lr:0.59 dt:34ms tok/s:1902980 rem:268s step 9376 (55%) loss:3.2038 lr:0.59 dt:34ms tok/s:1901230 rem:268s step 9377 (55%) loss:3.2197 lr:0.59 dt:34ms tok/s:1901098 rem:268s step 9378 (55%) loss:3.2182 lr:0.59 dt:34ms tok/s:1902677 rem:268s step 9379 (55%) loss:3.2079 lr:0.59 dt:35ms tok/s:1898748 rem:268s step 9380 (55%) loss:3.1960 lr:0.59 dt:35ms tok/s:1890599 rem:268s step 9381 (55%) loss:3.1831 lr:0.59 dt:34ms tok/s:1906914 rem:268s step 9382 (55%) loss:3.1899 lr:0.59 dt:34ms tok/s:1900599 rem:268s step 9383 (55%) loss:3.1904 lr:0.59 dt:34ms tok/s:1901782 rem:268s step 9384 (55%) loss:3.2021 lr:0.59 dt:35ms tok/s:1886330 rem:268s step 9385 (55%) loss:3.2096 lr:0.59 dt:35ms tok/s:1886201 rem:268s step 9386 (55%) loss:3.2071 lr:0.59 dt:35ms tok/s:1895501 rem:268s step 9387 (55%) loss:3.1929 lr:0.59 dt:35ms tok/s:1889182 rem:268s step 9388 (55%) loss:3.2139 lr:0.59 dt:34ms tok/s:1901098 rem:268s step 9389 (55%) loss:3.2053 lr:0.59 dt:35ms tok/s:1893477 rem:268s step 9390 (55%) loss:3.1509 lr:0.59 dt:35ms tok/s:1886511 rem:268s step 9391 (55%) loss:3.1244 lr:0.59 dt:35ms tok/s:1894664 rem:268s step 9392 (55%) loss:3.1291 lr:0.59 dt:35ms tok/s:1884791 rem:268s step 9393 (55%) loss:3.1652 lr:0.59 dt:36ms tok/s:1839288 rem:268s step 9394 (55%) loss:3.2377 lr:0.59 dt:34ms tok/s:1901348 rem:268s step 9395 (55%) loss:3.2290 lr:0.59 dt:35ms tok/s:1891106 rem:268s step 9396 (55%) loss:3.2175 lr:0.59 dt:35ms tok/s:1861522 rem:268s step 9397 (55%) loss:3.1942 lr:0.59 dt:35ms tok/s:1868253 rem:268s step 9398 (55%) loss:3.1816 lr:0.59 dt:35ms tok/s:1859859 rem:267s step 9399 (55%) loss:3.2000 lr:0.59 dt:35ms tok/s:1878865 rem:267s step 9400 (55%) loss:3.2106 lr:0.59 dt:35ms tok/s:1875391 rem:267s + local: attn=[0.075, 0.896, 0.869] mlp=[0.588, 0.254, -0.219] + + transition: attn=[2.900, 0.955] mlp=[-0.186, 0.507] + + hierarchy: attn=[3.092, 5.939, 5.616] mlp=[1.238, -0.931, -2.981] + step 9401 (55%) loss:3.1857 lr:0.59 dt:35ms tok/s:1852889 rem:267s step 9402 (55%) loss:3.1689 lr:0.59 dt:35ms tok/s:1858112 rem:267s step 9403 (55%) loss:3.1655 lr:0.59 dt:35ms tok/s:1854989 rem:267s step 9404 (55%) loss:3.1628 lr:0.59 dt:35ms tok/s:1848627 rem:267s step 9405 (55%) loss:3.1498 lr:0.59 dt:36ms tok/s:1843470 rem:267s step 9406 (55%) loss:3.1624 lr:0.59 dt:35ms tok/s:1855215 rem:267s step 9407 (55%) loss:3.1570 lr:0.59 dt:35ms tok/s:1856844 rem:267s step 9408 (55%) loss:3.1517 lr:0.59 dt:36ms tok/s:1831518 rem:267s step 9409 (55%) loss:3.1307 lr:0.59 dt:36ms tok/s:1819745 rem:267s step 9410 (55%) loss:3.1404 lr:0.59 dt:36ms tok/s:1810158 rem:267s step 9411 (55%) loss:3.1500 lr:0.59 dt:36ms tok/s:1820697 rem:267s step 9412 (55%) loss:3.1722 lr:0.59 dt:36ms tok/s:1813657 rem:267s step 9413 (56%) loss:3.1794 lr:0.59 dt:36ms tok/s:1818637 rem:267s step 9414 (56%) loss:3.1846 lr:0.59 dt:36ms tok/s:1813945 rem:267s step 9415 (56%) loss:3.1968 lr:0.59 dt:36ms tok/s:1815190 rem:267s step 9416 (56%) loss:3.2011 lr:0.59 dt:36ms tok/s:1809133 rem:267s step 9417 (56%) loss:3.2102 lr:0.59 dt:36ms tok/s:1819432 rem:267s step 9418 (56%) loss:3.2097 lr:0.59 dt:36ms tok/s:1827184 rem:267s step 9419 (56%) loss:3.2068 lr:0.59 dt:37ms tok/s:1793095 rem:267s step 9420 (56%) loss:3.2106 lr:0.59 dt:36ms tok/s:1797915 rem:267s step 9421 (56%) loss:3.2180 lr:0.59 dt:36ms tok/s:1821228 rem:267s step 9422 (56%) loss:3.2072 lr:0.59 dt:36ms tok/s:1818060 rem:267s step 9423 (56%) loss:3.2011 lr:0.59 dt:36ms tok/s:1818541 rem:267s step 9424 (56%) loss:3.1893 lr:0.59 dt:36ms tok/s:1816162 rem:267s step 9425 (56%) loss:3.1833 lr:0.59 dt:36ms tok/s:1814447 rem:267s step 9426 (56%) loss:3.1918 lr:0.59 dt:36ms tok/s:1817615 rem:266s step 9427 (56%) loss:3.2210 lr:0.59 dt:36ms tok/s:1806696 rem:266s step 9428 (56%) loss:3.2129 lr:0.59 dt:36ms tok/s:1817735 rem:266s step 9429 (56%) loss:3.1972 lr:0.59 dt:36ms tok/s:1823475 rem:266s step 9430 (56%) loss:3.2083 lr:0.59 dt:36ms tok/s:1819853 rem:266s step 9431 (56%) loss:3.2020 lr:0.59 dt:36ms tok/s:1811184 rem:266s step 9432 (56%) loss:3.2041 lr:0.59 dt:36ms tok/s:1810289 rem:266s step 9433 (56%) loss:3.2044 lr:0.59 dt:36ms tok/s:1809252 rem:266s step 9434 (56%) loss:3.2175 lr:0.59 dt:36ms tok/s:1817687 rem:266s step 9435 (56%) loss:3.2202 lr:0.59 dt:36ms tok/s:1820564 rem:266s step 9436 (56%) loss:3.2291 lr:0.59 dt:36ms tok/s:1818842 rem:266s step 9437 (56%) loss:3.2286 lr:0.59 dt:36ms tok/s:1813394 rem:266s step 9438 (56%) loss:3.2208 lr:0.58 dt:36ms tok/s:1815214 rem:266s step 9439 (56%) loss:3.2299 lr:0.58 dt:36ms tok/s:1823318 rem:266s step 9440 (56%) loss:3.2416 lr:0.58 dt:36ms tok/s:1818264 rem:266s step 9441 (56%) loss:3.2383 lr:0.58 dt:36ms tok/s:1823681 rem:266s step 9442 (56%) loss:3.2380 lr:0.58 dt:36ms tok/s:1811972 rem:266s step 9443 (56%) loss:3.2265 lr:0.58 dt:36ms tok/s:1809181 rem:266s step 9444 (56%) loss:3.2317 lr:0.58 dt:36ms tok/s:1809407 rem:266s step 9445 (56%) loss:3.2326 lr:0.58 dt:36ms tok/s:1813047 rem:266s step 9446 (56%) loss:3.2279 lr:0.58 dt:36ms tok/s:1805521 rem:266s step 9447 (56%) loss:3.2295 lr:0.58 dt:36ms tok/s:1813095 rem:266s step 9448 (56%) loss:3.2245 lr:0.58 dt:36ms tok/s:1811745 rem:266s step 9449 (56%) loss:3.2139 lr:0.58 dt:36ms tok/s:1806292 rem:266s step 9450 (56%) loss:3.2240 lr:0.58 dt:36ms tok/s:1806969 rem:266s step 9451 (56%) loss:3.2147 lr:0.58 dt:37ms tok/s:1778397 rem:266s step 9452 (56%) loss:3.2039 lr:0.58 dt:36ms tok/s:1812390 rem:266s step 9453 (56%) loss:3.2066 lr:0.58 dt:36ms tok/s:1814076 rem:266s step 9454 (56%) loss:3.2061 lr:0.58 dt:36ms tok/s:1815874 rem:265s step 9455 (56%) loss:3.2122 lr:0.58 dt:37ms tok/s:1793786 rem:265s step 9456 (56%) loss:3.2162 lr:0.58 dt:36ms tok/s:1815226 rem:265s step 9457 (56%) loss:3.2159 lr:0.58 dt:36ms tok/s:1811912 rem:265s step 9458 (56%) loss:3.2482 lr:0.58 dt:36ms tok/s:1813179 rem:265s step 9459 (56%) loss:3.2552 lr:0.58 dt:36ms tok/s:1803554 rem:265s step 9460 (56%) loss:3.2531 lr:0.58 dt:36ms tok/s:1814136 rem:265s step 9461 (56%) loss:3.2575 lr:0.58 dt:36ms tok/s:1810027 rem:265s step 9462 (56%) loss:3.2558 lr:0.58 dt:36ms tok/s:1811446 rem:265s step 9463 (56%) loss:3.2484 lr:0.58 dt:36ms tok/s:1815598 rem:265s step 9464 (56%) loss:3.2466 lr:0.58 dt:36ms tok/s:1806090 rem:265s step 9465 (56%) loss:3.2338 lr:0.58 dt:36ms tok/s:1812366 rem:265s step 9466 (56%) loss:3.2221 lr:0.58 dt:36ms tok/s:1813849 rem:265s step 9467 (56%) loss:3.2278 lr:0.58 dt:36ms tok/s:1814699 rem:265s step 9468 (56%) loss:3.2307 lr:0.58 dt:36ms tok/s:1805070 rem:265s step 9469 (56%) loss:3.2251 lr:0.58 dt:36ms tok/s:1808514 rem:265s step 9470 (56%) loss:3.2321 lr:0.58 dt:36ms tok/s:1809360 rem:265s step 9471 (56%) loss:3.2462 lr:0.58 dt:36ms tok/s:1807183 rem:265s step 9472 (56%) loss:3.2589 lr:0.58 dt:37ms tok/s:1792698 rem:265s step 9473 (56%) loss:3.2545 lr:0.58 dt:36ms tok/s:1814316 rem:265s step 9474 (56%) loss:3.2478 lr:0.58 dt:36ms tok/s:1801250 rem:265s step 9475 (56%) loss:3.2586 lr:0.58 dt:36ms tok/s:1807373 rem:265s step 9476 (56%) loss:3.2493 lr:0.58 dt:36ms tok/s:1807920 rem:265s step 9477 (56%) loss:3.2428 lr:0.58 dt:36ms tok/s:1812665 rem:265s step 9478 (56%) loss:3.2186 lr:0.58 dt:36ms tok/s:1815106 rem:265s step 9479 (56%) loss:3.2333 lr:0.58 dt:36ms tok/s:1813334 rem:265s step 9480 (56%) loss:3.2131 lr:0.58 dt:36ms tok/s:1812354 rem:265s step 9481 (56%) loss:3.1967 lr:0.58 dt:36ms tok/s:1809002 rem:264s step 9482 (56%) loss:3.1968 lr:0.58 dt:37ms tok/s:1788219 rem:264s step 9483 (56%) loss:3.2070 lr:0.58 dt:36ms tok/s:1807955 rem:264s step 9484 (56%) loss:3.2134 lr:0.58 dt:36ms tok/s:1821035 rem:264s step 9485 (56%) loss:3.2075 lr:0.58 dt:36ms tok/s:1808598 rem:264s step 9486 (56%) loss:3.2012 lr:0.58 dt:36ms tok/s:1820649 rem:264s step 9487 (56%) loss:3.1782 lr:0.58 dt:36ms tok/s:1822786 rem:264s step 9488 (56%) loss:3.2012 lr:0.58 dt:35ms tok/s:1854351 rem:264s step 9489 (56%) loss:3.2091 lr:0.58 dt:35ms tok/s:1852215 rem:264s step 9490 (56%) loss:3.2048 lr:0.58 dt:36ms tok/s:1843136 rem:264s step 9491 (56%) loss:3.1968 lr:0.58 dt:35ms tok/s:1848863 rem:264s step 9492 (56%) loss:3.2072 lr:0.58 dt:35ms tok/s:1858074 rem:264s step 9493 (56%) loss:3.2085 lr:0.58 dt:35ms tok/s:1855616 rem:264s step 9494 (56%) loss:3.2235 lr:0.58 dt:36ms tok/s:1829860 rem:264s step 9495 (56%) loss:3.2227 lr:0.58 dt:35ms tok/s:1857220 rem:264s step 9496 (56%) loss:3.2188 lr:0.58 dt:35ms tok/s:1853176 rem:264s step 9497 (56%) loss:3.2119 lr:0.58 dt:35ms tok/s:1855929 rem:264s step 9498 (56%) loss:3.1996 lr:0.58 dt:35ms tok/s:1856556 rem:264s step 9499 (56%) loss:3.2020 lr:0.58 dt:35ms tok/s:1864059 rem:264s step 9500 (56%) loss:3.2035 lr:0.58 dt:35ms tok/s:1860363 rem:264s + local: attn=[0.083, 0.860, 0.854] mlp=[0.608, 0.243, -0.220] + + transition: attn=[2.938, 0.980] mlp=[-0.192, 0.481] + + hierarchy: attn=[3.148, 5.939, 5.616] mlp=[1.286, -0.942, -3.049] + step 9501 (56%) loss:3.2035 lr:0.58 dt:35ms tok/s:1847683 rem:264s step 9502 (56%) loss:3.1890 lr:0.58 dt:35ms tok/s:1861951 rem:264s step 9503 (56%) loss:3.1955 lr:0.58 dt:35ms tok/s:1857873 rem:264s step 9504 (56%) loss:3.1929 lr:0.58 dt:35ms tok/s:1861661 rem:264s step 9505 (56%) loss:3.1787 lr:0.58 dt:35ms tok/s:1852989 rem:264s step 9506 (56%) loss:3.1674 lr:0.58 dt:35ms tok/s:1858941 rem:264s step 9507 (56%) loss:3.1758 lr:0.58 dt:35ms tok/s:1854652 rem:264s step 9508 (56%) loss:3.2307 lr:0.58 dt:35ms tok/s:1857597 rem:264s step 9509 (56%) loss:3.2729 lr:0.58 dt:36ms tok/s:1811840 rem:263s step 9510 (56%) loss:3.2908 lr:0.58 dt:35ms tok/s:1848677 rem:263s step 9511 (56%) loss:3.2720 lr:0.58 dt:36ms tok/s:1825497 rem:263s step 9512 (56%) loss:3.2717 lr:0.58 dt:35ms tok/s:1859557 rem:263s step 9513 (56%) loss:3.2921 lr:0.58 dt:35ms tok/s:1858363 rem:263s step 9514 (56%) loss:3.2850 lr:0.58 dt:35ms tok/s:1858727 rem:263s step 9515 (56%) loss:3.2826 lr:0.58 dt:35ms tok/s:1850805 rem:263s step 9516 (56%) loss:3.2807 lr:0.58 dt:35ms tok/s:1865793 rem:263s step 9517 (56%) loss:3.2819 lr:0.58 dt:35ms tok/s:1850419 rem:263s step 9518 (56%) loss:3.2696 lr:0.58 dt:35ms tok/s:1857007 rem:263s step 9519 (56%) loss:3.2702 lr:0.58 dt:35ms tok/s:1855052 rem:263s step 9520 (56%) loss:3.2582 lr:0.58 dt:35ms tok/s:1859356 rem:263s step 9521 (56%) loss:3.2571 lr:0.58 dt:35ms tok/s:1851753 rem:263s step 9522 (56%) loss:3.2543 lr:0.58 dt:35ms tok/s:1863061 rem:263s step 9523 (56%) loss:3.2601 lr:0.58 dt:35ms tok/s:1856205 rem:263s step 9524 (56%) loss:3.2520 lr:0.57 dt:35ms tok/s:1855703 rem:263s step 9525 (56%) loss:3.2395 lr:0.57 dt:35ms tok/s:1861913 rem:263s step 9526 (56%) loss:3.2234 lr:0.57 dt:35ms tok/s:1856355 rem:263s step 9527 (56%) loss:3.2116 lr:0.57 dt:35ms tok/s:1859218 rem:263s step 9528 (56%) loss:3.2172 lr:0.57 dt:36ms tok/s:1845797 rem:263s step 9529 (56%) loss:3.2130 lr:0.57 dt:35ms tok/s:1858878 rem:263s step 9530 (56%) loss:3.2075 lr:0.57 dt:36ms tok/s:1822145 rem:263s step 9531 (56%) loss:3.2137 lr:0.57 dt:35ms tok/s:1854364 rem:263s step 9532 (56%) loss:3.2157 lr:0.57 dt:35ms tok/s:1858589 rem:263s step 9533 (56%) loss:3.2064 lr:0.57 dt:36ms tok/s:1832886 rem:263s step 9534 (56%) loss:3.1996 lr:0.57 dt:35ms tok/s:1859809 rem:263s step 9535 (56%) loss:3.2045 lr:0.57 dt:35ms tok/s:1852489 rem:263s step 9536 (56%) loss:3.1914 lr:0.57 dt:35ms tok/s:1850145 rem:263s step 9537 (56%) loss:3.1984 lr:0.57 dt:35ms tok/s:1851441 rem:263s step 9538 (56%) loss:3.2113 lr:0.57 dt:37ms tok/s:1778685 rem:262s step 9539 (56%) loss:3.2156 lr:0.57 dt:35ms tok/s:1855328 rem:262s step 9540 (56%) loss:3.2106 lr:0.57 dt:35ms tok/s:1858991 rem:262s step 9541 (56%) loss:3.2098 lr:0.57 dt:35ms tok/s:1855578 rem:262s step 9542 (56%) loss:3.2029 lr:0.57 dt:35ms tok/s:1850942 rem:262s step 9543 (56%) loss:3.2047 lr:0.57 dt:35ms tok/s:1856054 rem:262s step 9544 (56%) loss:3.2209 lr:0.57 dt:35ms tok/s:1857534 rem:262s step 9545 (56%) loss:3.2141 lr:0.57 dt:35ms tok/s:1848105 rem:262s step 9546 (56%) loss:3.2208 lr:0.57 dt:35ms tok/s:1855002 rem:262s step 9547 (56%) loss:3.2157 lr:0.57 dt:35ms tok/s:1855591 rem:262s step 9548 (56%) loss:3.2112 lr:0.57 dt:35ms tok/s:1858250 rem:262s step 9549 (56%) loss:3.2101 lr:0.57 dt:35ms tok/s:1854189 rem:262s step 9550 (56%) loss:3.2068 lr:0.57 dt:35ms tok/s:1852764 rem:262s step 9551 (56%) loss:3.2090 lr:0.57 dt:36ms tok/s:1836498 rem:262s step 9552 (56%) loss:3.2179 lr:0.57 dt:35ms tok/s:1852377 rem:262s step 9553 (56%) loss:3.2138 lr:0.57 dt:36ms tok/s:1842901 rem:262s step 9554 (56%) loss:3.2424 lr:0.57 dt:35ms tok/s:1856418 rem:262s step 9555 (56%) loss:3.2697 lr:0.57 dt:35ms tok/s:1861951 rem:262s step 9556 (56%) loss:3.2255 lr:0.57 dt:35ms tok/s:1863137 rem:262s step 9557 (56%) loss:3.1974 lr:0.57 dt:35ms tok/s:1853614 rem:262s step 9558 (56%) loss:3.1727 lr:0.57 dt:35ms tok/s:1848490 rem:262s step 9559 (56%) loss:3.1700 lr:0.57 dt:35ms tok/s:1855816 rem:262s step 9560 (56%) loss:3.2059 lr:0.57 dt:36ms tok/s:1845747 rem:262s step 9561 (56%) loss:3.1964 lr:0.57 dt:35ms tok/s:1850021 rem:262s step 9562 (56%) loss:3.1817 lr:0.57 dt:35ms tok/s:1855365 rem:262s step 9563 (56%) loss:3.2038 lr:0.57 dt:35ms tok/s:1856844 rem:262s step 9564 (56%) loss:3.2093 lr:0.57 dt:35ms tok/s:1857396 rem:262s step 9565 (56%) loss:3.2049 lr:0.57 dt:35ms tok/s:1863819 rem:262s step 9566 (56%) loss:3.1922 lr:0.57 dt:35ms tok/s:1863819 rem:261s step 9567 (56%) loss:3.1888 lr:0.57 dt:35ms tok/s:1861282 rem:261s step 9568 (56%) loss:3.2107 lr:0.57 dt:35ms tok/s:1858413 rem:261s step 9569 (56%) loss:3.2193 lr:0.57 dt:35ms tok/s:1855653 rem:261s step 9570 (56%) loss:3.1885 lr:0.57 dt:36ms tok/s:1834525 rem:261s step 9571 (56%) loss:3.1886 lr:0.57 dt:36ms tok/s:1838267 rem:261s step 9572 (56%) loss:3.1792 lr:0.57 dt:36ms tok/s:1838120 rem:261s step 9573 (56%) loss:3.1799 lr:0.57 dt:36ms tok/s:1838083 rem:261s step 9574 (56%) loss:3.1788 lr:0.57 dt:36ms tok/s:1839153 rem:261s step 9575 (56%) loss:3.1756 lr:0.57 dt:36ms tok/s:1839645 rem:261s step 9576 (56%) loss:3.1958 lr:0.57 dt:35ms tok/s:1849324 rem:261s step 9577 (56%) loss:3.2031 lr:0.57 dt:36ms tok/s:1835897 rem:261s step 9578 (56%) loss:3.2185 lr:0.57 dt:36ms tok/s:1838021 rem:261s step 9579 (56%) loss:3.2354 lr:0.57 dt:37ms tok/s:1780839 rem:261s step 9580 (56%) loss:3.2279 lr:0.57 dt:39ms tok/s:1683744 rem:261s step 9581 (57%) loss:3.2197 lr:0.57 dt:36ms tok/s:1812115 rem:261s step 9582 (57%) loss:3.2186 lr:0.57 dt:35ms tok/s:1878184 rem:261s step 9583 (57%) loss:3.2086 lr:0.57 dt:35ms tok/s:1856581 rem:261s step 9584 (57%) loss:3.2050 lr:0.57 dt:36ms tok/s:1803531 rem:261s step 9585 (57%) loss:3.2039 lr:0.57 dt:37ms tok/s:1780113 rem:261s step 9586 (57%) loss:3.2021 lr:0.57 dt:36ms tok/s:1808015 rem:261s step 9587 (57%) loss:3.1985 lr:0.57 dt:36ms tok/s:1840310 rem:261s step 9588 (57%) loss:3.1875 lr:0.57 dt:36ms tok/s:1842481 rem:261s step 9589 (57%) loss:3.1853 lr:0.57 dt:36ms tok/s:1837493 rem:261s step 9590 (57%) loss:3.1772 lr:0.57 dt:36ms tok/s:1842444 rem:261s step 9591 (57%) loss:3.1879 lr:0.57 dt:36ms tok/s:1837935 rem:261s step 9592 (57%) loss:3.1918 lr:0.57 dt:36ms tok/s:1832556 rem:261s step 9593 (57%) loss:3.1671 lr:0.57 dt:36ms tok/s:1840470 rem:261s step 9594 (57%) loss:3.1747 lr:0.57 dt:36ms tok/s:1837640 rem:260s step 9595 (57%) loss:3.1693 lr:0.57 dt:36ms tok/s:1838599 rem:260s step 9596 (57%) loss:3.1766 lr:0.57 dt:36ms tok/s:1839940 rem:260s step 9597 (57%) loss:3.1933 lr:0.57 dt:36ms tok/s:1834146 rem:260s step 9598 (57%) loss:3.1787 lr:0.57 dt:36ms tok/s:1840791 rem:260s step 9599 (57%) loss:3.1833 lr:0.57 dt:36ms tok/s:1825376 rem:260s step 9600 (57%) loss:3.1697 lr:0.57 dt:36ms tok/s:1836646 rem:260s + local: attn=[0.084, 0.843, 0.885] mlp=[0.605, 0.253, -0.250] + + transition: attn=[2.950, 0.992] mlp=[-0.208, 0.504] + + hierarchy: attn=[3.197, 5.939, 5.616] mlp=[1.295, -0.956, -3.064] + step 9601 (57%) loss:3.1678 lr:0.57 dt:36ms tok/s:1828509 rem:260s step 9602 (57%) loss:3.1478 lr:0.57 dt:36ms tok/s:1833094 rem:260s step 9603 (57%) loss:3.1728 lr:0.57 dt:36ms tok/s:1837382 rem:260s step 9604 (57%) loss:3.1801 lr:0.57 dt:37ms tok/s:1785710 rem:260s step 9605 (57%) loss:3.1838 lr:0.57 dt:36ms tok/s:1821747 rem:260s step 9606 (57%) loss:3.1873 lr:0.57 dt:42ms tok/s:1555505 rem:260s step 9607 (57%) loss:3.1893 lr:0.57 dt:36ms tok/s:1838943 rem:260s step 9608 (57%) loss:3.1900 lr:0.57 dt:35ms tok/s:1864160 rem:260s step 9609 (57%) loss:3.1831 lr:0.57 dt:36ms tok/s:1841580 rem:260s step 9610 (57%) loss:3.1453 lr:0.57 dt:35ms tok/s:1868837 rem:260s step 9611 (57%) loss:3.1227 lr:0.56 dt:35ms tok/s:1864729 rem:260s step 9612 (57%) loss:3.1168 lr:0.56 dt:35ms tok/s:1848142 rem:260s step 9613 (57%) loss:3.1105 lr:0.56 dt:35ms tok/s:1856706 rem:260s step 9614 (57%) loss:3.1066 lr:0.56 dt:35ms tok/s:1861560 rem:260s step 9615 (57%) loss:3.0777 lr:0.56 dt:35ms tok/s:1859608 rem:260s step 9616 (57%) loss:3.0754 lr:0.56 dt:35ms tok/s:1862771 rem:260s step 9617 (57%) loss:3.1011 lr:0.56 dt:35ms tok/s:1860589 rem:260s step 9618 (57%) loss:3.1274 lr:0.56 dt:36ms tok/s:1845375 rem:260s step 9619 (57%) loss:3.1419 lr:0.56 dt:35ms tok/s:1858300 rem:260s step 9620 (57%) loss:3.1364 lr:0.56 dt:35ms tok/s:1854952 rem:260s step 9621 (57%) loss:3.1272 lr:0.56 dt:35ms tok/s:1858639 rem:260s step 9622 (57%) loss:3.1745 lr:0.56 dt:35ms tok/s:1865704 rem:259s step 9623 (57%) loss:3.1861 lr:0.56 dt:35ms tok/s:1846429 rem:259s step 9624 (57%) loss:3.2011 lr:0.56 dt:35ms tok/s:1856819 rem:259s step 9625 (57%) loss:3.1957 lr:0.56 dt:35ms tok/s:1856129 rem:259s step 9626 (57%) loss:3.2104 lr:0.56 dt:35ms tok/s:1848988 rem:259s step 9627 (57%) loss:3.2144 lr:0.56 dt:35ms tok/s:1848565 rem:259s step 9628 (57%) loss:3.1994 lr:0.56 dt:35ms tok/s:1850606 rem:259s step 9629 (57%) loss:3.1654 lr:0.56 dt:35ms tok/s:1849460 rem:259s step 9630 (57%) loss:3.1879 lr:0.56 dt:35ms tok/s:1861232 rem:259s step 9631 (57%) loss:3.2189 lr:0.56 dt:35ms tok/s:1861119 rem:259s step 9632 (57%) loss:3.2033 lr:0.56 dt:35ms tok/s:1852090 rem:259s step 9633 (57%) loss:3.2030 lr:0.56 dt:36ms tok/s:1832031 rem:259s step 9634 (57%) loss:3.2090 lr:0.56 dt:36ms tok/s:1844880 rem:259s step 9635 (57%) loss:3.2059 lr:0.56 dt:36ms tok/s:1837702 rem:259s step 9636 (57%) loss:3.1843 lr:0.56 dt:35ms tok/s:1847658 rem:259s step 9637 (57%) loss:3.1981 lr:0.56 dt:36ms tok/s:1845214 rem:259s step 9638 (57%) loss:3.1883 lr:0.56 dt:35ms tok/s:1856982 rem:259s step 9639 (57%) loss:3.1806 lr:0.56 dt:35ms tok/s:1853489 rem:259s step 9640 (57%) loss:3.1767 lr:0.56 dt:35ms tok/s:1856154 rem:259s step 9641 (57%) loss:3.1775 lr:0.56 dt:35ms tok/s:1859281 rem:259s step 9642 (57%) loss:3.1767 lr:0.56 dt:35ms tok/s:1852826 rem:259s step 9643 (57%) loss:3.1720 lr:0.56 dt:35ms tok/s:1850170 rem:259s step 9644 (57%) loss:3.1846 lr:0.56 dt:35ms tok/s:1855841 rem:259s step 9645 (57%) loss:3.2141 lr:0.56 dt:36ms tok/s:1823354 rem:259s step 9646 (57%) loss:3.2370 lr:0.56 dt:35ms tok/s:1856292 rem:259s step 9647 (57%) loss:3.2390 lr:0.56 dt:35ms tok/s:1851890 rem:259s step 9648 (57%) loss:3.2574 lr:0.56 dt:35ms tok/s:1858200 rem:259s step 9649 (57%) loss:3.2535 lr:0.56 dt:35ms tok/s:1849373 rem:259s step 9650 (57%) loss:3.2365 lr:0.56 dt:35ms tok/s:1849684 rem:258s step 9651 (57%) loss:3.2478 lr:0.56 dt:35ms tok/s:1862089 rem:258s step 9652 (57%) loss:3.2478 lr:0.56 dt:39ms tok/s:1669924 rem:258s step 9653 (57%) loss:3.2522 lr:0.56 dt:35ms tok/s:1860993 rem:258s step 9654 (57%) loss:3.2542 lr:0.56 dt:35ms tok/s:1849672 rem:258s step 9655 (57%) loss:3.2394 lr:0.56 dt:42ms tok/s:1554133 rem:258s step 9656 (57%) loss:3.2446 lr:0.56 dt:39ms tok/s:1689778 rem:258s step 9657 (57%) loss:3.2371 lr:0.56 dt:34ms tok/s:1903666 rem:258s step 9658 (57%) loss:3.2420 lr:0.56 dt:34ms tok/s:1919753 rem:258s step 9659 (57%) loss:3.2359 lr:0.56 dt:34ms tok/s:1923811 rem:258s step 9660 (57%) loss:3.2359 lr:0.56 dt:34ms tok/s:1915125 rem:258s step 9661 (57%) loss:3.2160 lr:0.56 dt:35ms tok/s:1873295 rem:258s step 9662 (57%) loss:3.2139 lr:0.56 dt:35ms tok/s:1884145 rem:258s step 9663 (57%) loss:3.2101 lr:0.56 dt:35ms tok/s:1887496 rem:258s step 9664 (57%) loss:3.2022 lr:0.56 dt:35ms tok/s:1884636 rem:258s step 9665 (57%) loss:3.2261 lr:0.56 dt:35ms tok/s:1882132 rem:258s step 9666 (57%) loss:3.2174 lr:0.56 dt:35ms tok/s:1874866 rem:258s step 9667 (57%) loss:3.1911 lr:0.56 dt:35ms tok/s:1866578 rem:258s step 9668 (57%) loss:3.1861 lr:0.56 dt:35ms tok/s:1884507 rem:258s step 9669 (57%) loss:3.1922 lr:0.56 dt:35ms tok/s:1873091 rem:258s step 9670 (57%) loss:3.2116 lr:0.56 dt:35ms tok/s:1879803 rem:258s step 9671 (57%) loss:3.2231 lr:0.56 dt:35ms tok/s:1876453 rem:258s step 9672 (57%) loss:3.2262 lr:0.56 dt:35ms tok/s:1874496 rem:258s step 9673 (57%) loss:3.2294 lr:0.56 dt:35ms tok/s:1879854 rem:258s step 9674 (57%) loss:3.2172 lr:0.56 dt:35ms tok/s:1875455 rem:258s step 9675 (57%) loss:3.1973 lr:0.56 dt:35ms tok/s:1877928 rem:258s step 9676 (57%) loss:3.1951 lr:0.56 dt:36ms tok/s:1843556 rem:258s step 9677 (57%) loss:3.1921 lr:0.56 dt:35ms tok/s:1881244 rem:258s step 9678 (57%) loss:3.1995 lr:0.56 dt:35ms tok/s:1873831 rem:257s step 9679 (57%) loss:3.2036 lr:0.56 dt:35ms tok/s:1872159 rem:257s step 9680 (57%) loss:3.1938 lr:0.56 dt:35ms tok/s:1861534 rem:257s step 9681 (57%) loss:3.1971 lr:0.56 dt:35ms tok/s:1896364 rem:257s step 9682 (57%) loss:3.1968 lr:0.56 dt:34ms tok/s:1902559 rem:257s step 9683 (57%) loss:3.2037 lr:0.56 dt:35ms tok/s:1881540 rem:257s step 9684 (57%) loss:3.1664 lr:0.56 dt:35ms tok/s:1870618 rem:257s step 9685 (57%) loss:3.1655 lr:0.56 dt:37ms tok/s:1767033 rem:257s step 9686 (57%) loss:3.1944 lr:0.56 dt:35ms tok/s:1894939 rem:257s step 9687 (57%) loss:3.2024 lr:0.56 dt:35ms tok/s:1887794 rem:257s step 9688 (57%) loss:3.2284 lr:0.56 dt:35ms tok/s:1876466 rem:257s step 9689 (57%) loss:3.2259 lr:0.56 dt:35ms tok/s:1857760 rem:257s step 9690 (57%) loss:3.2033 lr:0.56 dt:35ms tok/s:1854714 rem:257s step 9691 (57%) loss:3.1928 lr:0.56 dt:35ms tok/s:1860955 rem:257s step 9692 (57%) loss:3.1634 lr:0.56 dt:36ms tok/s:1828168 rem:257s step 9693 (57%) loss:3.1764 lr:0.56 dt:36ms tok/s:1837542 rem:257s step 9694 (57%) loss:3.1858 lr:0.56 dt:36ms tok/s:1828922 rem:257s step 9695 (57%) loss:3.1964 lr:0.56 dt:36ms tok/s:1835468 rem:257s step 9696 (57%) loss:3.1867 lr:0.56 dt:36ms tok/s:1832067 rem:257s step 9697 (57%) loss:3.1884 lr:0.56 dt:36ms tok/s:1829629 rem:257s step 9698 (57%) loss:3.1928 lr:0.55 dt:36ms tok/s:1837849 rem:257s step 9699 (57%) loss:3.1993 lr:0.55 dt:36ms tok/s:1826698 rem:257s step 9700 (57%) loss:3.1983 lr:0.55 dt:35ms tok/s:1854739 rem:257s + local: attn=[0.088, 0.873, 0.903] mlp=[0.627, 0.258, -0.245] + + transition: attn=[2.999, 0.996] mlp=[-0.203, 0.516] + + hierarchy: attn=[3.188, 5.939, 5.616] mlp=[1.306, -0.984, -3.074] + step 9701 (57%) loss:3.2101 lr:0.55 dt:35ms tok/s:1853839 rem:257s step 9702 (57%) loss:3.2131 lr:0.55 dt:35ms tok/s:1853101 rem:257s step 9703 (57%) loss:3.2109 lr:0.55 dt:35ms tok/s:1850618 rem:257s step 9704 (57%) loss:3.2089 lr:0.55 dt:36ms tok/s:1839534 rem:257s step 9705 (57%) loss:3.2038 lr:0.55 dt:35ms tok/s:1855415 rem:257s step 9706 (57%) loss:3.1994 lr:0.55 dt:36ms tok/s:1830847 rem:257s step 9707 (57%) loss:3.1838 lr:0.55 dt:36ms tok/s:1832862 rem:256s step 9708 (57%) loss:3.1813 lr:0.55 dt:36ms tok/s:1828764 rem:256s step 9709 (57%) loss:3.1858 lr:0.55 dt:36ms tok/s:1838513 rem:256s step 9710 (57%) loss:3.2006 lr:0.55 dt:36ms tok/s:1833791 rem:256s step 9711 (57%) loss:3.1824 lr:0.55 dt:36ms tok/s:1833179 rem:256s step 9712 (57%) loss:3.1838 lr:0.55 dt:36ms tok/s:1829665 rem:256s step 9713 (57%) loss:3.1701 lr:0.55 dt:36ms tok/s:1841901 rem:256s step 9714 (57%) loss:3.1574 lr:0.55 dt:36ms tok/s:1842679 rem:256s step 9715 (57%) loss:3.1587 lr:0.55 dt:36ms tok/s:1841148 rem:256s step 9716 (57%) loss:3.1596 lr:0.55 dt:36ms tok/s:1833473 rem:256s step 9717 (57%) loss:3.1465 lr:0.55 dt:37ms tok/s:1793996 rem:256s step 9718 (57%) loss:3.1439 lr:0.55 dt:36ms tok/s:1819817 rem:256s step 9719 (57%) loss:3.1414 lr:0.55 dt:36ms tok/s:1838820 rem:256s step 9720 (57%) loss:3.1484 lr:0.55 dt:37ms tok/s:1754279 rem:256s step 9721 (57%) loss:3.1305 lr:0.55 dt:36ms tok/s:1812868 rem:256s step 9722 (57%) loss:3.1252 lr:0.55 dt:36ms tok/s:1831823 rem:256s step 9723 (57%) loss:3.1233 lr:0.55 dt:36ms tok/s:1844942 rem:256s step 9724 (57%) loss:3.1141 lr:0.55 dt:36ms tok/s:1832104 rem:256s step 9725 (57%) loss:3.1141 lr:0.55 dt:36ms tok/s:1829215 rem:256s step 9726 (57%) loss:3.1156 lr:0.55 dt:36ms tok/s:1831286 rem:256s step 9727 (57%) loss:3.0994 lr:0.55 dt:36ms tok/s:1834819 rem:256s step 9728 (57%) loss:3.1276 lr:0.55 dt:36ms tok/s:1834084 rem:256s step 9729 (57%) loss:3.1391 lr:0.55 dt:36ms tok/s:1832080 rem:256s step 9730 (57%) loss:3.1521 lr:0.55 dt:36ms tok/s:1830798 rem:256s step 9731 (57%) loss:3.1667 lr:0.55 dt:36ms tok/s:1837382 rem:256s step 9732 (57%) loss:3.1685 lr:0.55 dt:36ms tok/s:1837395 rem:256s step 9733 (57%) loss:3.1959 lr:0.55 dt:36ms tok/s:1832923 rem:256s step 9734 (57%) loss:3.1960 lr:0.55 dt:36ms tok/s:1836867 rem:255s step 9735 (57%) loss:3.1881 lr:0.55 dt:36ms tok/s:1834966 rem:255s step 9736 (57%) loss:3.1879 lr:0.55 dt:36ms tok/s:1843742 rem:255s step 9737 (57%) loss:3.1856 lr:0.55 dt:36ms tok/s:1837714 rem:255s step 9738 (57%) loss:3.1791 lr:0.55 dt:36ms tok/s:1835640 rem:255s step 9739 (57%) loss:3.2240 lr:0.55 dt:36ms tok/s:1834219 rem:255s step 9740 (57%) loss:3.2514 lr:0.55 dt:36ms tok/s:1830957 rem:255s step 9741 (57%) loss:3.2582 lr:0.55 dt:36ms tok/s:1835125 rem:255s step 9742 (57%) loss:3.2692 lr:0.55 dt:36ms tok/s:1829884 rem:255s step 9743 (57%) loss:3.2647 lr:0.55 dt:36ms tok/s:1830104 rem:255s step 9744 (57%) loss:3.2428 lr:0.55 dt:36ms tok/s:1836057 rem:255s step 9745 (57%) loss:3.2234 lr:0.55 dt:36ms tok/s:1834023 rem:255s step 9746 (57%) loss:3.2217 lr:0.55 dt:36ms tok/s:1837714 rem:255s step 9747 (57%) loss:3.2482 lr:0.55 dt:36ms tok/s:1839485 rem:255s step 9748 (57%) loss:3.2341 lr:0.55 dt:36ms tok/s:1820926 rem:255s step 9749 (58%) loss:3.2204 lr:0.55 dt:36ms tok/s:1830213 rem:255s step 9750 (58%) loss:3.2249 lr:0.55 dt:36ms tok/s:1830786 rem:255s step 9751 (58%) loss:3.2202 lr:0.55 dt:36ms tok/s:1837714 rem:255s step 9752 (58%) loss:3.2112 lr:0.55 dt:36ms tok/s:1827293 rem:255s step 9753 (58%) loss:3.2118 lr:0.55 dt:36ms tok/s:1832165 rem:255s step 9754 (58%) loss:3.1865 lr:0.55 dt:36ms tok/s:1826893 rem:255s step 9755 (58%) loss:3.1787 lr:0.55 dt:36ms tok/s:1840051 rem:255s step 9756 (58%) loss:3.1678 lr:0.55 dt:36ms tok/s:1834072 rem:255s step 9757 (58%) loss:3.1759 lr:0.55 dt:36ms tok/s:1836143 rem:255s step 9758 (58%) loss:3.1891 lr:0.55 dt:36ms tok/s:1841457 rem:255s step 9759 (58%) loss:3.1885 lr:0.55 dt:36ms tok/s:1841481 rem:255s step 9760 (58%) loss:3.1827 lr:0.55 dt:36ms tok/s:1836278 rem:255s step 9761 (58%) loss:3.1765 lr:0.55 dt:36ms tok/s:1834170 rem:255s step 9762 (58%) loss:3.1537 lr:0.55 dt:36ms tok/s:1838648 rem:254s step 9763 (58%) loss:3.1558 lr:0.55 dt:36ms tok/s:1841161 rem:254s step 9764 (58%) loss:3.1797 lr:0.55 dt:36ms tok/s:1834917 rem:254s step 9765 (58%) loss:3.1894 lr:0.55 dt:36ms tok/s:1841494 rem:254s step 9766 (58%) loss:3.1816 lr:0.55 dt:36ms tok/s:1839079 rem:254s step 9767 (58%) loss:3.1875 lr:0.55 dt:36ms tok/s:1835076 rem:254s step 9768 (58%) loss:3.1780 lr:0.55 dt:36ms tok/s:1831323 rem:254s step 9769 (58%) loss:3.1616 lr:0.55 dt:36ms tok/s:1826929 rem:254s step 9770 (58%) loss:3.1815 lr:0.55 dt:36ms tok/s:1837051 rem:254s step 9771 (58%) loss:3.2200 lr:0.55 dt:36ms tok/s:1836977 rem:254s step 9772 (58%) loss:3.2018 lr:0.55 dt:36ms tok/s:1839374 rem:254s step 9773 (58%) loss:3.2367 lr:0.55 dt:36ms tok/s:1841296 rem:254s step 9774 (58%) loss:3.2141 lr:0.55 dt:36ms tok/s:1818806 rem:254s step 9775 (58%) loss:3.1938 lr:0.55 dt:36ms tok/s:1838255 rem:254s step 9776 (58%) loss:3.1915 lr:0.55 dt:36ms tok/s:1823669 rem:254s step 9777 (58%) loss:3.1863 lr:0.55 dt:36ms tok/s:1831201 rem:254s step 9778 (58%) loss:3.1871 lr:0.55 dt:36ms tok/s:1824819 rem:254s step 9779 (58%) loss:3.1887 lr:0.55 dt:36ms tok/s:1834942 rem:254s step 9780 (58%) loss:3.2018 lr:0.55 dt:36ms tok/s:1832959 rem:254s step 9781 (58%) loss:3.3424 lr:0.55 dt:36ms tok/s:1838525 rem:254s step 9782 (58%) loss:3.4055 lr:0.55 dt:36ms tok/s:1811697 rem:254s step 9783 (58%) loss:3.3728 lr:0.55 dt:36ms tok/s:1836265 rem:254s step 9784 (58%) loss:3.3373 lr:0.54 dt:36ms tok/s:1832238 rem:254s step 9785 (58%) loss:3.2688 lr:0.54 dt:36ms tok/s:1828253 rem:254s step 9786 (58%) loss:3.2528 lr:0.54 dt:36ms tok/s:1834048 rem:254s step 9787 (58%) loss:3.2386 lr:0.54 dt:36ms tok/s:1806933 rem:254s step 9788 (58%) loss:3.2371 lr:0.54 dt:36ms tok/s:1808824 rem:254s step 9789 (58%) loss:3.2355 lr:0.54 dt:36ms tok/s:1838365 rem:254s step 9790 (58%) loss:3.2360 lr:0.54 dt:36ms tok/s:1805426 rem:253s step 9791 (58%) loss:3.2276 lr:0.54 dt:36ms tok/s:1830225 rem:253s step 9792 (58%) loss:3.2600 lr:0.54 dt:36ms tok/s:1833571 rem:253s step 9793 (58%) loss:3.2737 lr:0.54 dt:36ms tok/s:1832935 rem:253s step 9794 (58%) loss:3.2794 lr:0.54 dt:36ms tok/s:1840729 rem:253s step 9795 (58%) loss:3.2921 lr:0.54 dt:36ms tok/s:1832104 rem:253s step 9796 (58%) loss:3.2670 lr:0.54 dt:36ms tok/s:1831860 rem:253s step 9797 (58%) loss:3.2355 lr:0.54 dt:36ms tok/s:1838488 rem:253s step 9798 (58%) loss:3.2340 lr:0.54 dt:36ms tok/s:1844459 rem:253s step 9799 (58%) loss:3.2399 lr:0.54 dt:36ms tok/s:1836646 rem:253s step 9800 (58%) loss:3.2303 lr:0.54 dt:36ms tok/s:1840298 rem:253s + local: attn=[0.084, 0.837, 0.858] mlp=[0.643, 0.269, -0.248] + + transition: attn=[2.965, 0.996] mlp=[-0.218, 0.540] + + hierarchy: attn=[3.208, 5.939, 5.616] mlp=[1.309, -1.024, -3.112] + step 9801 (58%) loss:3.2246 lr:0.54 dt:36ms tok/s:1829409 rem:253s step 9802 (58%) loss:3.2390 lr:0.54 dt:36ms tok/s:1815946 rem:253s step 9803 (58%) loss:3.2380 lr:0.54 dt:36ms tok/s:1828326 rem:253s step 9804 (58%) loss:3.2296 lr:0.54 dt:36ms tok/s:1839399 rem:253s step 9805 (58%) loss:3.2430 lr:0.54 dt:36ms tok/s:1831506 rem:253s step 9806 (58%) loss:3.2555 lr:0.54 dt:36ms tok/s:1837923 rem:253s step 9807 (58%) loss:3.2591 lr:0.54 dt:36ms tok/s:1838710 rem:253s step 9808 (58%) loss:3.2664 lr:0.54 dt:36ms tok/s:1839608 rem:253s step 9809 (58%) loss:3.2473 lr:0.54 dt:36ms tok/s:1832788 rem:253s step 9810 (58%) loss:3.2239 lr:0.54 dt:36ms tok/s:1805651 rem:253s step 9811 (58%) loss:3.1880 lr:0.54 dt:36ms tok/s:1832776 rem:253s step 9812 (58%) loss:3.1796 lr:0.54 dt:36ms tok/s:1836977 rem:253s step 9813 (58%) loss:3.1849 lr:0.54 dt:36ms tok/s:1838218 rem:253s step 9814 (58%) loss:3.1821 lr:0.54 dt:38ms tok/s:1720202 rem:253s step 9815 (58%) loss:3.1944 lr:0.54 dt:36ms tok/s:1836327 rem:253s step 9816 (58%) loss:3.1905 lr:0.54 dt:36ms tok/s:1828776 rem:253s step 9817 (58%) loss:3.1930 lr:0.54 dt:36ms tok/s:1835885 rem:253s step 9818 (58%) loss:3.1907 lr:0.54 dt:36ms tok/s:1830481 rem:252s step 9819 (58%) loss:3.1994 lr:0.54 dt:36ms tok/s:1830774 rem:252s step 9820 (58%) loss:3.1947 lr:0.54 dt:36ms tok/s:1837100 rem:252s step 9821 (58%) loss:3.1755 lr:0.54 dt:36ms tok/s:1839140 rem:252s step 9822 (58%) loss:3.1711 lr:0.54 dt:36ms tok/s:1835701 rem:252s step 9823 (58%) loss:3.1659 lr:0.54 dt:36ms tok/s:1835922 rem:252s step 9824 (58%) loss:3.1554 lr:0.54 dt:36ms tok/s:1830433 rem:252s step 9825 (58%) loss:3.1655 lr:0.54 dt:36ms tok/s:1840902 rem:252s step 9826 (58%) loss:3.1777 lr:0.54 dt:36ms tok/s:1833314 rem:252s step 9827 (58%) loss:3.1767 lr:0.54 dt:36ms tok/s:1835530 rem:252s step 9828 (58%) loss:3.1581 lr:0.54 dt:36ms tok/s:1837370 rem:252s step 9829 (58%) loss:3.1756 lr:0.54 dt:36ms tok/s:1832776 rem:252s step 9830 (58%) loss:3.1592 lr:0.54 dt:36ms tok/s:1838931 rem:252s step 9831 (58%) loss:3.1524 lr:0.54 dt:36ms tok/s:1825922 rem:252s step 9832 (58%) loss:3.1584 lr:0.54 dt:36ms tok/s:1836633 rem:252s step 9833 (58%) loss:3.1538 lr:0.54 dt:36ms tok/s:1837886 rem:252s step 9834 (58%) loss:3.1444 lr:0.54 dt:36ms tok/s:1831360 rem:252s step 9835 (58%) loss:3.1371 lr:0.54 dt:36ms tok/s:1835812 rem:252s step 9836 (58%) loss:3.1442 lr:0.54 dt:36ms tok/s:1813334 rem:252s step 9837 (58%) loss:3.1463 lr:0.54 dt:36ms tok/s:1835603 rem:252s step 9838 (58%) loss:3.1503 lr:0.54 dt:36ms tok/s:1828606 rem:252s step 9839 (58%) loss:3.1408 lr:0.54 dt:36ms tok/s:1833803 rem:252s step 9840 (58%) loss:3.1501 lr:0.54 dt:36ms tok/s:1831457 rem:252s step 9841 (58%) loss:3.1490 lr:0.54 dt:36ms tok/s:1839731 rem:252s step 9842 (58%) loss:3.1503 lr:0.54 dt:36ms tok/s:1835750 rem:252s step 9843 (58%) loss:3.1572 lr:0.54 dt:36ms tok/s:1806541 rem:252s step 9844 (58%) loss:3.1744 lr:0.54 dt:36ms tok/s:1834305 rem:252s step 9845 (58%) loss:3.1706 lr:0.54 dt:36ms tok/s:1838697 rem:252s step 9846 (58%) loss:3.1578 lr:0.54 dt:36ms tok/s:1842605 rem:251s step 9847 (58%) loss:3.1533 lr:0.54 dt:36ms tok/s:1839694 rem:251s step 9848 (58%) loss:3.1573 lr:0.54 dt:36ms tok/s:1834942 rem:251s step 9849 (58%) loss:3.1429 lr:0.54 dt:36ms tok/s:1835395 rem:251s step 9850 (58%) loss:3.1526 lr:0.54 dt:36ms tok/s:1830555 rem:251s step 9851 (58%) loss:3.1492 lr:0.54 dt:36ms tok/s:1840556 rem:251s step 9852 (58%) loss:3.1387 lr:0.54 dt:36ms tok/s:1835419 rem:251s step 9853 (58%) loss:3.1312 lr:0.54 dt:36ms tok/s:1836290 rem:251s step 9854 (58%) loss:3.1602 lr:0.54 dt:37ms tok/s:1758735 rem:251s step 9855 (58%) loss:3.1747 lr:0.54 dt:36ms tok/s:1829665 rem:251s step 9856 (58%) loss:3.1674 lr:0.54 dt:36ms tok/s:1838697 rem:251s step 9857 (58%) loss:3.1598 lr:0.54 dt:37ms tok/s:1791016 rem:251s step 9858 (58%) loss:3.1788 lr:0.54 dt:37ms tok/s:1784945 rem:251s step 9859 (58%) loss:3.1972 lr:0.54 dt:36ms tok/s:1810647 rem:251s step 9860 (58%) loss:3.1998 lr:0.54 dt:36ms tok/s:1813514 rem:251s step 9861 (58%) loss:3.1860 lr:0.54 dt:36ms tok/s:1837260 rem:251s step 9862 (58%) loss:3.1840 lr:0.54 dt:36ms tok/s:1836768 rem:251s step 9863 (58%) loss:3.1696 lr:0.54 dt:36ms tok/s:1831433 rem:251s step 9864 (58%) loss:3.2061 lr:0.54 dt:36ms tok/s:1832080 rem:251s step 9865 (58%) loss:3.1988 lr:0.54 dt:36ms tok/s:1844546 rem:251s step 9866 (58%) loss:3.2042 lr:0.54 dt:36ms tok/s:1838636 rem:251s step 9867 (58%) loss:3.2014 lr:0.54 dt:36ms tok/s:1831323 rem:251s step 9868 (58%) loss:3.1805 lr:0.54 dt:36ms tok/s:1797927 rem:251s step 9869 (58%) loss:3.1539 lr:0.53 dt:36ms tok/s:1834770 rem:251s step 9870 (58%) loss:3.1772 lr:0.53 dt:36ms tok/s:1836437 rem:251s step 9871 (58%) loss:3.1828 lr:0.53 dt:36ms tok/s:1835701 rem:251s step 9872 (58%) loss:3.1606 lr:0.53 dt:36ms tok/s:1839694 rem:251s step 9873 (58%) loss:3.1580 lr:0.53 dt:36ms tok/s:1837603 rem:251s step 9874 (58%) loss:3.1683 lr:0.53 dt:36ms tok/s:1829665 rem:250s step 9875 (58%) loss:3.1640 lr:0.53 dt:36ms tok/s:1842666 rem:250s step 9876 (58%) loss:3.1651 lr:0.53 dt:36ms tok/s:1831018 rem:250s step 9877 (58%) loss:3.1604 lr:0.53 dt:36ms tok/s:1833179 rem:250s step 9878 (58%) loss:3.1348 lr:0.53 dt:36ms tok/s:1827658 rem:250s step 9879 (58%) loss:3.1300 lr:0.53 dt:36ms tok/s:1840828 rem:250s step 9880 (58%) loss:3.1334 lr:0.53 dt:36ms tok/s:1840088 rem:250s step 9881 (58%) loss:3.1314 lr:0.53 dt:36ms tok/s:1842876 rem:250s step 9882 (58%) loss:3.1447 lr:0.53 dt:36ms tok/s:1824116 rem:250s step 9883 (58%) loss:3.1317 lr:0.53 dt:36ms tok/s:1833632 rem:250s step 9884 (58%) loss:3.1206 lr:0.53 dt:36ms tok/s:1838328 rem:250s step 9885 (58%) loss:3.1435 lr:0.53 dt:36ms tok/s:1830372 rem:250s step 9886 (58%) loss:3.1454 lr:0.53 dt:36ms tok/s:1840138 rem:250s step 9887 (58%) loss:3.1457 lr:0.53 dt:36ms tok/s:1836008 rem:250s step 9888 (58%) loss:3.1719 lr:0.53 dt:36ms tok/s:1826698 rem:250s step 9889 (58%) loss:3.1700 lr:0.53 dt:36ms tok/s:1828789 rem:250s step 9890 (58%) loss:3.1756 lr:0.53 dt:36ms tok/s:1832984 rem:250s step 9891 (58%) loss:3.1587 lr:0.53 dt:36ms tok/s:1835456 rem:250s step 9892 (58%) loss:3.1482 lr:0.53 dt:36ms tok/s:1829531 rem:250s step 9893 (58%) loss:3.1421 lr:0.53 dt:36ms tok/s:1833522 rem:250s step 9894 (58%) loss:3.1415 lr:0.53 dt:36ms tok/s:1833571 rem:250s step 9895 (58%) loss:3.1478 lr:0.53 dt:38ms tok/s:1730795 rem:250s step 9896 (58%) loss:3.1380 lr:0.53 dt:36ms tok/s:1834525 rem:250s step 9897 (58%) loss:3.1227 lr:0.53 dt:36ms tok/s:1831067 rem:250s step 9898 (58%) loss:3.1009 lr:0.53 dt:36ms tok/s:1838009 rem:250s step 9899 (58%) loss:3.1143 lr:0.53 dt:36ms tok/s:1830128 rem:250s step 9900 (58%) loss:3.1274 lr:0.53 dt:36ms tok/s:1836118 rem:250s + local: attn=[0.094, 0.828, 0.877] mlp=[0.651, 0.268, -0.231] + + transition: attn=[3.033, 1.019] mlp=[-0.219, 0.584] + + hierarchy: attn=[3.237, 5.939, 5.616] mlp=[1.351, -1.008, -3.102] + step 9901 (58%) loss:3.1383 lr:0.53 dt:36ms tok/s:1835652 rem:250s step 9902 (58%) loss:3.1500 lr:0.53 dt:36ms tok/s:1834880 rem:249s step 9903 (58%) loss:3.1458 lr:0.53 dt:36ms tok/s:1836879 rem:249s step 9904 (58%) loss:3.1464 lr:0.53 dt:36ms tok/s:1838070 rem:249s step 9905 (58%) loss:3.1526 lr:0.53 dt:36ms tok/s:1826929 rem:249s step 9906 (58%) loss:3.1456 lr:0.53 dt:35ms tok/s:1860552 rem:249s step 9907 (58%) loss:3.1416 lr:0.53 dt:35ms tok/s:1860350 rem:249s step 9908 (58%) loss:3.1577 lr:0.53 dt:36ms tok/s:1841580 rem:249s step 9909 (58%) loss:3.1764 lr:0.53 dt:36ms tok/s:1834537 rem:249s step 9910 (58%) loss:3.1694 lr:0.53 dt:36ms tok/s:1828740 rem:249s step 9911 (58%) loss:3.1590 lr:0.53 dt:36ms tok/s:1834452 rem:249s step 9912 (58%) loss:3.1622 lr:0.53 dt:36ms tok/s:1833314 rem:249s step 9913 (58%) loss:3.1671 lr:0.53 dt:36ms tok/s:1835089 rem:249s step 9914 (58%) loss:3.1623 lr:0.53 dt:36ms tok/s:1825194 rem:249s step 9915 (58%) loss:3.1725 lr:0.53 dt:36ms tok/s:1819612 rem:249s step 9916 (58%) loss:3.1682 lr:0.53 dt:36ms tok/s:1833620 rem:249s step 9917 (59%) loss:3.1546 lr:0.53 dt:36ms tok/s:1833766 rem:249s step 9918 (59%) loss:3.2602 lr:0.53 dt:36ms tok/s:1830945 rem:249s step 9919 (59%) loss:3.3805 lr:0.53 dt:36ms tok/s:1827160 rem:249s step 9920 (59%) loss:3.4513 lr:0.53 dt:36ms tok/s:1830506 rem:249s step 9921 (59%) loss:3.4424 lr:0.53 dt:36ms tok/s:1825073 rem:249s step 9922 (59%) loss:3.4273 lr:0.53 dt:36ms tok/s:1826625 rem:249s step 9923 (59%) loss:3.3991 lr:0.53 dt:36ms tok/s:1830555 rem:249s step 9924 (59%) loss:3.3668 lr:0.53 dt:36ms tok/s:1826492 rem:249s step 9925 (59%) loss:3.3215 lr:0.53 dt:36ms tok/s:1834023 rem:249s step 9926 (59%) loss:3.2944 lr:0.53 dt:36ms tok/s:1835358 rem:249s step 9927 (59%) loss:3.2815 lr:0.53 dt:36ms tok/s:1827901 rem:249s step 9928 (59%) loss:3.2568 lr:0.53 dt:36ms tok/s:1832471 rem:249s step 9929 (59%) loss:3.2446 lr:0.53 dt:36ms tok/s:1841074 rem:249s step 9930 (59%) loss:3.2272 lr:0.53 dt:36ms tok/s:1834782 rem:248s step 9931 (59%) loss:3.2213 lr:0.53 dt:36ms tok/s:1834807 rem:248s step 9932 (59%) loss:3.1972 lr:0.53 dt:36ms tok/s:1826395 rem:248s step 9933 (59%) loss:3.1673 lr:0.53 dt:39ms tok/s:1666088 rem:248s step 9934 (59%) loss:3.1553 lr:0.53 dt:37ms tok/s:1768352 rem:248s step 9935 (59%) loss:3.1550 lr:0.53 dt:35ms tok/s:1876274 rem:248s step 9936 (59%) loss:3.1322 lr:0.53 dt:35ms tok/s:1875058 rem:248s step 9937 (59%) loss:3.1045 lr:0.53 dt:35ms tok/s:1881810 rem:248s step 9938 (59%) loss:3.1196 lr:0.53 dt:35ms tok/s:1873525 rem:248s step 9939 (59%) loss:3.1499 lr:0.53 dt:35ms tok/s:1869345 rem:248s step 9940 (59%) loss:3.1439 lr:0.53 dt:35ms tok/s:1876710 rem:248s step 9941 (59%) loss:3.1612 lr:0.53 dt:35ms tok/s:1869994 rem:248s step 9942 (59%) loss:3.1570 lr:0.53 dt:35ms tok/s:1870949 rem:248s step 9943 (59%) loss:3.1390 lr:0.53 dt:35ms tok/s:1879250 rem:248s step 9944 (59%) loss:3.1404 lr:0.53 dt:35ms tok/s:1850830 rem:248s step 9945 (59%) loss:3.1242 lr:0.53 dt:35ms tok/s:1849909 rem:248s step 9946 (59%) loss:3.1194 lr:0.53 dt:35ms tok/s:1859130 rem:248s step 9947 (59%) loss:3.1064 lr:0.53 dt:35ms tok/s:1852327 rem:248s step 9948 (59%) loss:3.1088 lr:0.53 dt:35ms tok/s:1857949 rem:248s step 9949 (59%) loss:3.1133 lr:0.53 dt:35ms tok/s:1856029 rem:248s step 9950 (59%) loss:3.1285 lr:0.53 dt:35ms tok/s:1847509 rem:248s step 9951 (59%) loss:3.1301 lr:0.53 dt:35ms tok/s:1856342 rem:248s step 9952 (59%) loss:3.1207 lr:0.53 dt:35ms tok/s:1852240 rem:248s step 9953 (59%) loss:3.1327 lr:0.53 dt:35ms tok/s:1881656 rem:248s step 9954 (59%) loss:3.1420 lr:0.53 dt:35ms tok/s:1855490 rem:248s step 9955 (59%) loss:3.1226 lr:0.52 dt:35ms tok/s:1850768 rem:248s step 9956 (59%) loss:3.1255 lr:0.52 dt:35ms tok/s:1862531 rem:248s step 9957 (59%) loss:3.1187 lr:0.52 dt:35ms tok/s:1850643 rem:248s step 9958 (59%) loss:3.1234 lr:0.52 dt:35ms tok/s:1853651 rem:247s step 9959 (59%) loss:3.1115 lr:0.52 dt:35ms tok/s:1852052 rem:247s step 9960 (59%) loss:3.1313 lr:0.52 dt:35ms tok/s:1855553 rem:247s step 9961 (59%) loss:3.1306 lr:0.52 dt:35ms tok/s:1856581 rem:247s step 9962 (59%) loss:3.1263 lr:0.52 dt:35ms tok/s:1857559 rem:247s step 9963 (59%) loss:3.1286 lr:0.52 dt:35ms tok/s:1856756 rem:247s step 9964 (59%) loss:3.1123 lr:0.52 dt:35ms tok/s:1854964 rem:247s step 9965 (59%) loss:3.1104 lr:0.52 dt:35ms tok/s:1857672 rem:247s step 9966 (59%) loss:3.1157 lr:0.52 dt:35ms tok/s:1855453 rem:247s step 9967 (59%) loss:3.1293 lr:0.52 dt:35ms tok/s:1851503 rem:247s step 9968 (59%) loss:3.1310 lr:0.52 dt:35ms tok/s:1857233 rem:247s step 9969 (59%) loss:3.1483 lr:0.52 dt:35ms tok/s:1854514 rem:247s step 9970 (59%) loss:3.1555 lr:0.52 dt:35ms tok/s:1858564 rem:247s step 9971 (59%) loss:3.1639 lr:0.52 dt:35ms tok/s:1848068 rem:247s step 9972 (59%) loss:3.1435 lr:0.52 dt:35ms tok/s:1851179 rem:247s step 9973 (59%) loss:3.1395 lr:0.52 dt:35ms tok/s:1856543 rem:247s step 9974 (59%) loss:3.1337 lr:0.52 dt:35ms tok/s:1849075 rem:247s step 9975 (59%) loss:3.1260 lr:0.52 dt:35ms tok/s:1851865 rem:247s step 9976 (59%) loss:3.1188 lr:0.52 dt:35ms tok/s:1859130 rem:247s step 9977 (59%) loss:3.1221 lr:0.52 dt:35ms tok/s:1860980 rem:247s step 9978 (59%) loss:3.1417 lr:0.52 dt:35ms tok/s:1853639 rem:247s step 9979 (59%) loss:3.1462 lr:0.52 dt:35ms tok/s:1855428 rem:247s step 9980 (59%) loss:3.1529 lr:0.52 dt:35ms tok/s:1850668 rem:247s step 9981 (59%) loss:3.1536 lr:0.52 dt:35ms tok/s:1858212 rem:247s step 9982 (59%) loss:3.1419 lr:0.52 dt:35ms tok/s:1858464 rem:247s step 9983 (59%) loss:3.1545 lr:0.52 dt:35ms tok/s:1856631 rem:247s step 9984 (59%) loss:3.1594 lr:0.52 dt:35ms tok/s:1857810 rem:247s step 9985 (59%) loss:3.1469 lr:0.52 dt:35ms tok/s:1861547 rem:247s step 9986 (59%) loss:3.1371 lr:0.52 dt:35ms tok/s:1857785 rem:247s step 9987 (59%) loss:3.1154 lr:0.52 dt:35ms tok/s:1857208 rem:246s step 9988 (59%) loss:3.1142 lr:0.52 dt:35ms tok/s:1852027 rem:246s step 9989 (59%) loss:3.1260 lr:0.52 dt:35ms tok/s:1858464 rem:246s step 9990 (59%) loss:3.1248 lr:0.52 dt:35ms tok/s:1853576 rem:246s step 9991 (59%) loss:3.1149 lr:0.52 dt:36ms tok/s:1841050 rem:246s step 9992 (59%) loss:3.1045 lr:0.52 dt:36ms tok/s:1845983 rem:246s step 9993 (59%) loss:3.0929 lr:0.52 dt:36ms tok/s:1834084 rem:246s step 9994 (59%) loss:3.0777 lr:0.52 dt:36ms tok/s:1831616 rem:246s step 9995 (59%) loss:3.0543 lr:0.52 dt:36ms tok/s:1836548 rem:246s step 9996 (59%) loss:3.0600 lr:0.52 dt:36ms tok/s:1829604 rem:246s step 9997 (59%) loss:3.0642 lr:0.52 dt:36ms tok/s:1836498 rem:246s step 9998 (59%) loss:3.0419 lr:0.52 dt:36ms tok/s:1840939 rem:246s step 9999 (59%) loss:3.0251 lr:0.52 dt:36ms tok/s:1806363 rem:246s step 10000 (59%) loss:3.0324 lr:0.52 dt:36ms tok/s:1832849 rem:246s + local: attn=[0.083, 0.872, 0.876] mlp=[0.663, 0.258, -0.255] + + transition: attn=[3.092, 1.000] mlp=[-0.243, 0.577] + + hierarchy: attn=[3.246, 5.939, 5.616] mlp=[1.396, -1.037, -3.155] + step 10001 (59%) loss:3.0595 lr:0.52 dt:36ms tok/s:1842061 rem:246s step 10002 (59%) loss:3.0702 lr:0.52 dt:36ms tok/s:1828971 rem:246s step 10003 (59%) loss:3.0509 lr:0.52 dt:36ms tok/s:1830798 rem:246s step 10004 (59%) loss:3.0676 lr:0.52 dt:44ms tok/s:1504770 rem:246s step 10005 (59%) loss:3.0815 lr:0.52 dt:35ms tok/s:1897450 rem:246s step 10006 (59%) loss:3.0905 lr:0.52 dt:35ms tok/s:1892838 rem:246s step 10007 (59%) loss:3.1181 lr:0.52 dt:35ms tok/s:1875775 rem:246s step 10008 (59%) loss:3.1233 lr:0.52 dt:35ms tok/s:1879353 rem:246s step 10009 (59%) loss:3.1084 lr:0.52 dt:35ms tok/s:1882351 rem:246s step 10010 (59%) loss:3.1219 lr:0.52 dt:35ms tok/s:1855666 rem:246s step 10011 (59%) loss:3.1376 lr:0.52 dt:35ms tok/s:1875416 rem:246s step 10012 (59%) loss:3.1216 lr:0.52 dt:38ms tok/s:1738556 rem:246s step 10013 (59%) loss:3.1344 lr:0.52 dt:35ms tok/s:1861812 rem:246s step 10014 (59%) loss:3.1438 lr:0.52 dt:34ms tok/s:1922048 rem:246s step 10015 (59%) loss:3.1333 lr:0.52 dt:34ms tok/s:1904602 rem:245s step 10016 (59%) loss:3.1509 lr:0.52 dt:34ms tok/s:1904576 rem:245s step 10017 (59%) loss:3.1392 lr:0.52 dt:35ms tok/s:1899233 rem:245s step 10018 (59%) loss:3.1083 lr:0.52 dt:35ms tok/s:1898695 rem:245s step 10019 (59%) loss:3.1059 lr:0.52 dt:35ms tok/s:1883203 rem:245s step 10020 (59%) loss:3.1077 lr:0.52 dt:40ms tok/s:1639486 rem:245s step 10021 (59%) loss:3.1037 lr:0.52 dt:37ms tok/s:1778788 rem:245s step 10022 (59%) loss:3.1012 lr:0.52 dt:34ms tok/s:1905870 rem:245s step 10023 (59%) loss:3.0962 lr:0.52 dt:34ms tok/s:1905910 rem:245s step 10024 (59%) loss:3.1144 lr:0.52 dt:34ms tok/s:1909577 rem:245s step 10025 (59%) loss:3.1164 lr:0.52 dt:35ms tok/s:1885851 rem:245s step 10026 (59%) loss:3.1360 lr:0.52 dt:35ms tok/s:1864755 rem:245s step 10027 (59%) loss:3.1519 lr:0.52 dt:35ms tok/s:1877633 rem:245s step 10028 (59%) loss:3.1617 lr:0.52 dt:35ms tok/s:1896822 rem:245s step 10029 (59%) loss:3.1538 lr:0.52 dt:34ms tok/s:1903033 rem:245s step 10030 (59%) loss:3.1422 lr:0.52 dt:35ms tok/s:1890469 rem:245s step 10031 (59%) loss:3.1595 lr:0.52 dt:35ms tok/s:1891965 rem:245s step 10032 (59%) loss:3.1532 lr:0.52 dt:35ms tok/s:1896154 rem:245s step 10033 (59%) loss:3.1646 lr:0.52 dt:35ms tok/s:1880935 rem:245s step 10034 (59%) loss:3.1456 lr:0.52 dt:35ms tok/s:1884016 rem:245s step 10035 (59%) loss:3.1353 lr:0.52 dt:35ms tok/s:1875442 rem:245s step 10036 (59%) loss:3.1388 lr:0.52 dt:35ms tok/s:1878094 rem:245s step 10037 (59%) loss:3.1474 lr:0.52 dt:35ms tok/s:1874240 rem:245s step 10038 (59%) loss:3.1396 lr:0.52 dt:35ms tok/s:1875877 rem:245s step 10039 (59%) loss:3.1348 lr:0.52 dt:35ms tok/s:1860464 rem:245s step 10040 (59%) loss:3.1447 lr:0.52 dt:35ms tok/s:1871764 rem:245s step 10041 (59%) loss:3.1368 lr:0.52 dt:35ms tok/s:1871509 rem:245s step 10042 (59%) loss:3.1275 lr:0.51 dt:35ms tok/s:1873256 rem:245s step 10043 (59%) loss:3.1331 lr:0.51 dt:35ms tok/s:1871140 rem:244s step 10044 (59%) loss:3.1161 lr:0.51 dt:35ms tok/s:1854977 rem:244s step 10045 (59%) loss:3.1143 lr:0.51 dt:35ms tok/s:1856443 rem:244s step 10046 (59%) loss:3.1219 lr:0.51 dt:35ms tok/s:1846839 rem:244s step 10047 (59%) loss:3.1410 lr:0.51 dt:36ms tok/s:1844150 rem:244s step 10048 (59%) loss:3.1473 lr:0.51 dt:35ms tok/s:1855966 rem:244s step 10049 (59%) loss:3.1328 lr:0.51 dt:36ms tok/s:1802478 rem:244s step 10050 (59%) loss:3.1367 lr:0.51 dt:35ms tok/s:1850793 rem:244s step 10051 (59%) loss:3.1428 lr:0.51 dt:35ms tok/s:1855453 rem:244s step 10052 (59%) loss:3.1295 lr:0.51 dt:35ms tok/s:1861925 rem:244s step 10053 (59%) loss:3.1059 lr:0.51 dt:35ms tok/s:1847422 rem:244s step 10054 (59%) loss:3.0934 lr:0.51 dt:35ms tok/s:1851142 rem:244s step 10055 (59%) loss:3.1128 lr:0.51 dt:35ms tok/s:1853476 rem:244s step 10056 (59%) loss:3.0918 lr:0.51 dt:36ms tok/s:1845921 rem:244s step 10057 (59%) loss:3.0587 lr:0.51 dt:36ms tok/s:1841802 rem:244s step 10058 (59%) loss:3.0298 lr:0.51 dt:35ms tok/s:1846739 rem:244s step 10059 (59%) loss:3.0145 lr:0.51 dt:36ms tok/s:1838882 rem:244s step 10060 (59%) loss:3.0262 lr:0.51 dt:36ms tok/s:1843086 rem:244s step 10061 (59%) loss:3.0445 lr:0.51 dt:35ms tok/s:1848640 rem:244s step 10062 (59%) loss:3.0514 lr:0.51 dt:36ms tok/s:1843519 rem:244s step 10063 (59%) loss:3.0749 lr:0.51 dt:35ms tok/s:1847919 rem:244s step 10064 (59%) loss:3.0887 lr:0.51 dt:36ms tok/s:1837395 rem:244s step 10065 (59%) loss:3.1139 lr:0.51 dt:36ms tok/s:1842395 rem:244s step 10066 (59%) loss:3.1532 lr:0.51 dt:36ms tok/s:1819420 rem:244s step 10067 (59%) loss:3.1519 lr:0.51 dt:48ms tok/s:1374664 rem:244s step 10068 (59%) loss:3.1423 lr:0.51 dt:35ms tok/s:1865641 rem:244s step 10069 (59%) loss:3.1457 lr:0.51 dt:34ms tok/s:1940488 rem:244s step 10070 (59%) loss:3.1628 lr:0.51 dt:34ms tok/s:1920370 rem:244s step 10071 (59%) loss:3.1665 lr:0.51 dt:34ms tok/s:1922707 rem:243s step 10072 (59%) loss:3.1644 lr:0.51 dt:34ms tok/s:1905883 rem:243s step 10073 (59%) loss:3.1464 lr:0.51 dt:35ms tok/s:1897778 rem:243s step 10074 (59%) loss:3.1598 lr:0.51 dt:35ms tok/s:1898971 rem:243s step 10075 (59%) loss:3.1563 lr:0.51 dt:34ms tok/s:1901598 rem:243s step 10076 (59%) loss:3.1782 lr:0.51 dt:34ms tok/s:1901348 rem:243s step 10077 (59%) loss:3.1836 lr:0.51 dt:34ms tok/s:1909259 rem:243s step 10078 (59%) loss:3.1735 lr:0.51 dt:34ms tok/s:1901598 rem:243s step 10079 (59%) loss:3.1431 lr:0.51 dt:35ms tok/s:1890079 rem:243s step 10080 (59%) loss:3.1127 lr:0.51 dt:35ms tok/s:1894926 rem:243s step 10081 (59%) loss:3.1110 lr:0.51 dt:34ms tok/s:1902296 rem:243s step 10082 (59%) loss:3.1201 lr:0.51 dt:35ms tok/s:1885528 rem:243s step 10083 (59%) loss:3.1278 lr:0.51 dt:35ms tok/s:1895252 rem:243s step 10084 (59%) loss:3.1391 lr:0.51 dt:34ms tok/s:1904035 rem:243s step 10085 (59%) loss:3.1387 lr:0.51 dt:35ms tok/s:1888300 rem:243s step 10086 (59%) loss:3.1412 lr:0.51 dt:35ms tok/s:1864616 rem:243s step 10087 (60%) loss:3.1460 lr:0.51 dt:35ms tok/s:1867009 rem:243s step 10088 (60%) loss:3.1300 lr:0.51 dt:35ms tok/s:1847732 rem:243s step 10089 (60%) loss:3.1250 lr:0.51 dt:35ms tok/s:1870567 rem:243s step 10090 (60%) loss:3.1219 lr:0.51 dt:35ms tok/s:1878788 rem:243s step 10091 (60%) loss:3.1354 lr:0.51 dt:35ms tok/s:1868570 rem:243s step 10092 (60%) loss:3.1056 lr:0.51 dt:36ms tok/s:1836768 rem:243s step 10093 (60%) loss:3.0553 lr:0.51 dt:35ms tok/s:1851242 rem:243s step 10094 (60%) loss:3.0159 lr:0.51 dt:35ms tok/s:1852552 rem:243s step 10095 (60%) loss:2.9796 lr:0.51 dt:36ms tok/s:1843593 rem:243s step 10096 (60%) loss:2.9735 lr:0.51 dt:36ms tok/s:1842246 rem:243s step 10097 (60%) loss:2.9970 lr:0.51 dt:36ms tok/s:1818818 rem:243s step 10098 (60%) loss:3.0227 lr:0.51 dt:35ms tok/s:1852851 rem:243s step 10099 (60%) loss:3.0319 lr:0.51 dt:35ms tok/s:1855628 rem:243s step 10100 (60%) loss:3.0304 lr:0.51 dt:35ms tok/s:1850444 rem:242s + local: attn=[0.092, 0.860, 0.915] mlp=[0.691, 0.267, -0.254] + + transition: attn=[3.060, 1.026] mlp=[-0.233, 0.590] + + hierarchy: attn=[3.285, 5.939, 5.616] mlp=[1.392, -1.003, -3.101] + step 10101 (60%) loss:3.0432 lr:0.51 dt:36ms tok/s:1824322 rem:242s step 10102 (60%) loss:3.0447 lr:0.51 dt:36ms tok/s:1845462 rem:242s step 10103 (60%) loss:3.0482 lr:0.51 dt:35ms tok/s:1848851 rem:242s step 10104 (60%) loss:3.0609 lr:0.51 dt:35ms tok/s:1854089 rem:242s step 10105 (60%) loss:3.0622 lr:0.51 dt:35ms tok/s:1851341 rem:242s step 10106 (60%) loss:3.0719 lr:0.51 dt:35ms tok/s:1852776 rem:242s step 10107 (60%) loss:3.0855 lr:0.51 dt:35ms tok/s:1846144 rem:242s step 10108 (60%) loss:3.0986 lr:0.51 dt:36ms tok/s:1835983 rem:242s step 10109 (60%) loss:3.0980 lr:0.51 dt:42ms tok/s:1556844 rem:242s step 10110 (60%) loss:3.0952 lr:0.51 dt:36ms tok/s:1818818 rem:242s step 10111 (60%) loss:3.0970 lr:0.51 dt:35ms tok/s:1890612 rem:242s step 10112 (60%) loss:3.1042 lr:0.51 dt:35ms tok/s:1896521 rem:242s step 10113 (60%) loss:3.0998 lr:0.51 dt:35ms tok/s:1886952 rem:242s step 10114 (60%) loss:3.1220 lr:0.51 dt:35ms tok/s:1887146 rem:242s step 10115 (60%) loss:3.1238 lr:0.51 dt:35ms tok/s:1853301 rem:242s step 10116 (60%) loss:3.1228 lr:0.51 dt:35ms tok/s:1893946 rem:242s step 10117 (60%) loss:3.1416 lr:0.51 dt:35ms tok/s:1887833 rem:242s step 10118 (60%) loss:3.1334 lr:0.51 dt:35ms tok/s:1875532 rem:242s step 10119 (60%) loss:3.1288 lr:0.51 dt:35ms tok/s:1886511 rem:242s step 10120 (60%) loss:3.1074 lr:0.51 dt:35ms tok/s:1881089 rem:242s step 10121 (60%) loss:3.0934 lr:0.51 dt:35ms tok/s:1884003 rem:242s step 10122 (60%) loss:3.0708 lr:0.51 dt:35ms tok/s:1875852 rem:242s step 10123 (60%) loss:3.0682 lr:0.51 dt:35ms tok/s:1863857 rem:242s step 10124 (60%) loss:3.0894 lr:0.51 dt:35ms tok/s:1879045 rem:242s step 10125 (60%) loss:3.0951 lr:0.51 dt:35ms tok/s:1868951 rem:242s step 10126 (60%) loss:3.1029 lr:0.51 dt:35ms tok/s:1868151 rem:242s step 10127 (60%) loss:3.1074 lr:0.51 dt:36ms tok/s:1842988 rem:242s step 10128 (60%) loss:3.1188 lr:0.50 dt:35ms tok/s:1871280 rem:241s step 10129 (60%) loss:3.1208 lr:0.50 dt:36ms tok/s:1826092 rem:241s step 10130 (60%) loss:3.1219 lr:0.50 dt:35ms tok/s:1850469 rem:241s step 10131 (60%) loss:3.1181 lr:0.50 dt:37ms tok/s:1793868 rem:241s step 10132 (60%) loss:3.1129 lr:0.50 dt:36ms tok/s:1802525 rem:241s step 10133 (60%) loss:3.1159 lr:0.50 dt:35ms tok/s:1857961 rem:241s step 10134 (60%) loss:3.1217 lr:0.50 dt:35ms tok/s:1854301 rem:241s step 10135 (60%) loss:3.1493 lr:0.50 dt:35ms tok/s:1847124 rem:241s step 10136 (60%) loss:3.1703 lr:0.50 dt:36ms tok/s:1823475 rem:241s step 10137 (60%) loss:3.1769 lr:0.50 dt:36ms tok/s:1818769 rem:241s step 10138 (60%) loss:3.1807 lr:0.50 dt:36ms tok/s:1830518 rem:241s step 10139 (60%) loss:3.1765 lr:0.50 dt:36ms tok/s:1823402 rem:241s step 10140 (60%) loss:3.1430 lr:0.50 dt:36ms tok/s:1819841 rem:241s step 10141 (60%) loss:3.1380 lr:0.50 dt:36ms tok/s:1820613 rem:241s step 10142 (60%) loss:3.1362 lr:0.50 dt:36ms tok/s:1804336 rem:241s step 10143 (60%) loss:3.1230 lr:0.50 dt:36ms tok/s:1809955 rem:241s step 10144 (60%) loss:3.1307 lr:0.50 dt:41ms tok/s:1600528 rem:241s step 10145 (60%) loss:3.1566 lr:0.50 dt:36ms tok/s:1836265 rem:241s step 10146 (60%) loss:3.1672 lr:0.50 dt:35ms tok/s:1861837 rem:241s step 10147 (60%) loss:3.1475 lr:0.50 dt:35ms tok/s:1861018 rem:241s step 10148 (60%) loss:3.0932 lr:0.50 dt:35ms tok/s:1862064 rem:241s step 10149 (60%) loss:3.0283 lr:0.50 dt:35ms tok/s:1861232 rem:241s step 10150 (60%) loss:3.0252 lr:0.50 dt:36ms tok/s:1837161 rem:241s step 10151 (60%) loss:3.0494 lr:0.50 dt:36ms tok/s:1816078 rem:241s step 10152 (60%) loss:3.0583 lr:0.50 dt:36ms tok/s:1812641 rem:241s step 10153 (60%) loss:3.0736 lr:0.50 dt:36ms tok/s:1811625 rem:241s step 10154 (60%) loss:3.0736 lr:0.50 dt:36ms tok/s:1816978 rem:241s step 10155 (60%) loss:3.0751 lr:0.50 dt:36ms tok/s:1816306 rem:241s step 10156 (60%) loss:3.0543 lr:0.50 dt:36ms tok/s:1816030 rem:240s step 10157 (60%) loss:3.0456 lr:0.50 dt:36ms tok/s:1818565 rem:240s step 10158 (60%) loss:3.0644 lr:0.50 dt:36ms tok/s:1809872 rem:240s step 10159 (60%) loss:3.0843 lr:0.50 dt:36ms tok/s:1820504 rem:240s step 10160 (60%) loss:3.1044 lr:0.50 dt:36ms tok/s:1821336 rem:240s step 10161 (60%) loss:3.1122 lr:0.50 dt:36ms tok/s:1825655 rem:240s step 10162 (60%) loss:3.1378 lr:0.50 dt:36ms tok/s:1821566 rem:240s step 10163 (60%) loss:3.1410 lr:0.50 dt:36ms tok/s:1821880 rem:240s step 10164 (60%) loss:3.1349 lr:0.50 dt:36ms tok/s:1812892 rem:240s step 10165 (60%) loss:3.1080 lr:0.50 dt:36ms tok/s:1815154 rem:240s step 10166 (60%) loss:3.1107 lr:0.50 dt:36ms tok/s:1805130 rem:240s step 10167 (60%) loss:3.1342 lr:0.50 dt:36ms tok/s:1804632 rem:240s step 10168 (60%) loss:3.1552 lr:0.50 dt:36ms tok/s:1807444 rem:240s step 10169 (60%) loss:3.1639 lr:0.50 dt:36ms tok/s:1806363 rem:240s step 10170 (60%) loss:3.1576 lr:0.50 dt:36ms tok/s:1816690 rem:240s step 10171 (60%) loss:3.1374 lr:0.50 dt:36ms tok/s:1808312 rem:240s step 10172 (60%) loss:3.1283 lr:0.50 dt:36ms tok/s:1812772 rem:240s step 10173 (60%) loss:3.1249 lr:0.50 dt:36ms tok/s:1816474 rem:240s step 10174 (60%) loss:3.1444 lr:0.50 dt:36ms tok/s:1810587 rem:240s step 10175 (60%) loss:3.1356 lr:0.50 dt:36ms tok/s:1815286 rem:240s step 10176 (60%) loss:3.1290 lr:0.50 dt:36ms tok/s:1809312 rem:240s step 10177 (60%) loss:3.1195 lr:0.50 dt:36ms tok/s:1817651 rem:240s step 10178 (60%) loss:3.1166 lr:0.50 dt:36ms tok/s:1810468 rem:240s step 10179 (60%) loss:3.1287 lr:0.50 dt:36ms tok/s:1807016 rem:240s step 10180 (60%) loss:3.1115 lr:0.50 dt:36ms tok/s:1802124 rem:240s step 10181 (60%) loss:3.1079 lr:0.50 dt:36ms tok/s:1812605 rem:240s step 10182 (60%) loss:3.0971 lr:0.50 dt:36ms tok/s:1809514 rem:240s step 10183 (60%) loss:3.0871 lr:0.50 dt:36ms tok/s:1817339 rem:239s step 10184 (60%) loss:3.1079 lr:0.50 dt:36ms tok/s:1818216 rem:239s step 10185 (60%) loss:3.0895 lr:0.50 dt:36ms tok/s:1806553 rem:239s step 10186 (60%) loss:3.1291 lr:0.50 dt:37ms tok/s:1795016 rem:239s step 10187 (60%) loss:3.1389 lr:0.50 dt:36ms tok/s:1817711 rem:239s step 10188 (60%) loss:3.1234 lr:0.50 dt:36ms tok/s:1803294 rem:239s step 10189 (60%) loss:3.1320 lr:0.50 dt:36ms tok/s:1816522 rem:239s step 10190 (60%) loss:3.1398 lr:0.50 dt:36ms tok/s:1803128 rem:239s step 10191 (60%) loss:3.1316 lr:0.50 dt:36ms tok/s:1805307 rem:239s step 10192 (60%) loss:3.1206 lr:0.50 dt:36ms tok/s:1809550 rem:239s step 10193 (60%) loss:3.1247 lr:0.50 dt:36ms tok/s:1810611 rem:239s step 10194 (60%) loss:3.1276 lr:0.50 dt:36ms tok/s:1800471 rem:239s step 10195 (60%) loss:3.0988 lr:0.50 dt:36ms tok/s:1809467 rem:239s step 10196 (60%) loss:3.1056 lr:0.50 dt:36ms tok/s:1809038 rem:239s step 10197 (60%) loss:3.1134 lr:0.50 dt:36ms tok/s:1804727 rem:239s step 10198 (60%) loss:3.1122 lr:0.50 dt:36ms tok/s:1797210 rem:239s step 10199 (60%) loss:3.1126 lr:0.50 dt:37ms tok/s:1771509 rem:239s step 10200 (60%) loss:3.1060 lr:0.50 dt:36ms tok/s:1814759 rem:239s + local: attn=[0.097, 0.862, 0.908] mlp=[0.695, 0.267, -0.248] + + transition: attn=[3.119, 1.034] mlp=[-0.227, 0.603] + + hierarchy: attn=[3.329, 5.939, 5.616] mlp=[1.416, -0.991, -3.190] + step 10201 (60%) loss:3.0857 lr:0.50 dt:36ms tok/s:1811279 rem:239s step 10202 (60%) loss:3.0780 lr:0.50 dt:37ms tok/s:1784921 rem:239s step 10203 (60%) loss:3.0829 lr:0.50 dt:37ms tok/s:1795426 rem:239s step 10204 (60%) loss:3.0687 lr:0.50 dt:37ms tok/s:1792674 rem:239s step 10205 (60%) loss:3.0715 lr:0.50 dt:36ms tok/s:1808086 rem:239s step 10206 (60%) loss:3.0803 lr:0.50 dt:36ms tok/s:1805675 rem:239s step 10207 (60%) loss:3.1002 lr:0.50 dt:36ms tok/s:1813598 rem:239s step 10208 (60%) loss:3.1111 lr:0.50 dt:36ms tok/s:1811064 rem:239s step 10209 (60%) loss:3.0919 lr:0.50 dt:36ms tok/s:1809121 rem:239s step 10210 (60%) loss:3.0993 lr:0.50 dt:37ms tok/s:1789628 rem:239s step 10211 (60%) loss:3.0906 lr:0.50 dt:36ms tok/s:1812987 rem:238s step 10212 (60%) loss:3.0887 lr:0.50 dt:36ms tok/s:1809586 rem:238s step 10213 (60%) loss:3.0663 lr:0.49 dt:37ms tok/s:1775376 rem:238s step 10214 (60%) loss:3.0588 lr:0.49 dt:36ms tok/s:1813669 rem:238s step 10215 (60%) loss:3.0481 lr:0.49 dt:36ms tok/s:1817651 rem:238s step 10216 (60%) loss:3.0683 lr:0.49 dt:36ms tok/s:1807420 rem:238s step 10217 (60%) loss:3.0654 lr:0.49 dt:36ms tok/s:1812234 rem:238s step 10218 (60%) loss:3.0649 lr:0.49 dt:36ms tok/s:1811351 rem:238s step 10219 (60%) loss:3.0579 lr:0.49 dt:36ms tok/s:1814112 rem:238s step 10220 (60%) loss:3.0538 lr:0.49 dt:36ms tok/s:1817278 rem:238s step 10221 (60%) loss:3.0383 lr:0.49 dt:36ms tok/s:1811804 rem:238s step 10222 (60%) loss:3.0215 lr:0.49 dt:36ms tok/s:1817435 rem:238s step 10223 (60%) loss:3.0113 lr:0.49 dt:36ms tok/s:1807967 rem:238s step 10224 (60%) loss:3.0299 lr:0.49 dt:36ms tok/s:1820588 rem:238s step 10225 (60%) loss:3.0358 lr:0.49 dt:36ms tok/s:1807717 rem:238s step 10226 (60%) loss:3.0455 lr:0.49 dt:36ms tok/s:1811231 rem:238s step 10227 (60%) loss:3.0421 lr:0.49 dt:36ms tok/s:1814316 rem:238s step 10228 (60%) loss:3.0555 lr:0.49 dt:36ms tok/s:1814100 rem:238s step 10229 (60%) loss:3.0598 lr:0.49 dt:38ms tok/s:1740824 rem:238s step 10230 (60%) loss:3.0681 lr:0.49 dt:36ms tok/s:1834991 rem:238s step 10231 (60%) loss:3.0675 lr:0.49 dt:35ms tok/s:1856932 rem:238s step 10232 (60%) loss:3.1009 lr:0.49 dt:36ms tok/s:1839953 rem:238s step 10233 (60%) loss:3.1207 lr:0.49 dt:36ms tok/s:1834623 rem:238s step 10234 (60%) loss:3.1289 lr:0.49 dt:36ms tok/s:1840150 rem:238s step 10235 (60%) loss:3.1249 lr:0.49 dt:36ms tok/s:1834403 rem:238s step 10236 (60%) loss:3.1301 lr:0.49 dt:36ms tok/s:1809467 rem:238s step 10237 (60%) loss:3.1158 lr:0.49 dt:42ms tok/s:1567202 rem:238s step 10238 (60%) loss:3.1323 lr:0.49 dt:36ms tok/s:1799846 rem:237s step 10239 (60%) loss:3.1435 lr:0.49 dt:36ms tok/s:1834415 rem:237s step 10240 (60%) loss:3.1575 lr:0.49 dt:35ms tok/s:1855603 rem:237s step 10241 (60%) loss:3.1632 lr:0.49 dt:35ms tok/s:1851778 rem:237s step 10242 (60%) loss:3.1441 lr:0.49 dt:35ms tok/s:1869740 rem:237s step 10243 (60%) loss:3.1363 lr:0.49 dt:35ms tok/s:1853539 rem:237s step 10244 (60%) loss:3.1358 lr:0.49 dt:35ms tok/s:1848888 rem:237s step 10245 (60%) loss:3.1339 lr:0.49 dt:36ms tok/s:1821083 rem:237s step 10246 (60%) loss:3.1417 lr:0.49 dt:35ms tok/s:1861358 rem:237s step 10247 (60%) loss:3.1236 lr:0.49 dt:35ms tok/s:1871280 rem:237s step 10248 (60%) loss:3.1131 lr:0.49 dt:36ms tok/s:1837861 rem:237s step 10249 (60%) loss:3.1199 lr:0.49 dt:36ms tok/s:1827937 rem:237s step 10250 (60%) loss:3.1224 lr:0.49 dt:35ms tok/s:1850843 rem:237s step 10251 (60%) loss:3.1163 lr:0.49 dt:36ms tok/s:1842185 rem:237s step 10252 (60%) loss:3.1110 lr:0.49 dt:35ms tok/s:1848204 rem:237s step 10253 (60%) loss:3.1083 lr:0.49 dt:36ms tok/s:1804987 rem:237s step 10254 (61%) loss:3.1084 lr:0.49 dt:36ms tok/s:1812318 rem:237s step 10255 (61%) loss:3.1027 lr:0.49 dt:37ms tok/s:1791611 rem:237s step 10256 (61%) loss:3.0855 lr:0.49 dt:36ms tok/s:1804999 rem:237s step 10257 (61%) loss:3.1358 lr:0.49 dt:37ms tok/s:1790654 rem:237s step 10258 (61%) loss:3.1312 lr:0.49 dt:36ms tok/s:1809967 rem:237s step 10259 (61%) loss:3.1879 lr:0.49 dt:36ms tok/s:1825158 rem:237s step 10260 (61%) loss:3.2054 lr:0.49 dt:35ms tok/s:1862480 rem:237s step 10261 (61%) loss:3.2015 lr:0.49 dt:35ms tok/s:1858841 rem:237s step 10262 (61%) loss:3.1842 lr:0.49 dt:37ms tok/s:1787080 rem:237s step 10263 (61%) loss:3.1865 lr:0.49 dt:36ms tok/s:1826990 rem:237s step 10264 (61%) loss:3.1856 lr:0.49 dt:36ms tok/s:1800707 rem:237s step 10265 (61%) loss:3.1737 lr:0.49 dt:36ms tok/s:1837874 rem:237s step 10266 (61%) loss:3.1634 lr:0.49 dt:36ms tok/s:1838820 rem:236s step 10267 (61%) loss:3.1307 lr:0.49 dt:36ms tok/s:1817242 rem:236s step 10268 (61%) loss:3.1215 lr:0.49 dt:36ms tok/s:1819251 rem:236s step 10269 (61%) loss:3.1183 lr:0.49 dt:37ms tok/s:1794067 rem:236s step 10270 (61%) loss:3.1257 lr:0.49 dt:36ms tok/s:1829434 rem:236s step 10271 (61%) loss:3.1004 lr:0.49 dt:36ms tok/s:1827245 rem:236s step 10272 (61%) loss:3.1141 lr:0.49 dt:36ms tok/s:1819359 rem:236s step 10273 (61%) loss:3.1273 lr:0.49 dt:36ms tok/s:1842000 rem:236s step 10274 (61%) loss:3.1341 lr:0.49 dt:35ms tok/s:1857434 rem:236s step 10275 (61%) loss:3.1407 lr:0.49 dt:36ms tok/s:1842740 rem:236s step 10276 (61%) loss:3.1382 lr:0.49 dt:36ms tok/s:1820588 rem:236s step 10277 (61%) loss:3.1266 lr:0.49 dt:36ms tok/s:1839497 rem:236s step 10278 (61%) loss:3.1329 lr:0.49 dt:35ms tok/s:1880291 rem:236s step 10279 (61%) loss:3.1174 lr:0.49 dt:35ms tok/s:1876748 rem:236s step 10280 (61%) loss:3.1121 lr:0.49 dt:36ms tok/s:1828886 rem:236s step 10281 (61%) loss:3.1000 lr:0.49 dt:36ms tok/s:1833656 rem:236s step 10282 (61%) loss:3.0954 lr:0.49 dt:35ms tok/s:1885761 rem:236s step 10283 (61%) loss:3.0936 lr:0.49 dt:35ms tok/s:1890820 rem:236s step 10284 (61%) loss:3.0986 lr:0.49 dt:35ms tok/s:1899522 rem:236s step 10285 (61%) loss:3.1017 lr:0.49 dt:35ms tok/s:1873652 rem:236s step 10286 (61%) loss:3.1028 lr:0.49 dt:35ms tok/s:1856593 rem:236s step 10287 (61%) loss:3.1172 lr:0.49 dt:35ms tok/s:1860388 rem:236s step 10288 (61%) loss:3.1036 lr:0.49 dt:35ms tok/s:1861913 rem:236s step 10289 (61%) loss:3.0960 lr:0.49 dt:36ms tok/s:1843729 rem:236s step 10290 (61%) loss:3.0943 lr:0.49 dt:36ms tok/s:1820094 rem:236s step 10291 (61%) loss:3.0906 lr:0.49 dt:42ms tok/s:1561469 rem:236s step 10292 (61%) loss:3.1076 lr:0.49 dt:37ms tok/s:1789849 rem:236s step 10293 (61%) loss:3.1252 lr:0.49 dt:34ms tok/s:1904589 rem:236s step 10294 (61%) loss:3.1137 lr:0.49 dt:35ms tok/s:1880729 rem:235s step 10295 (61%) loss:3.1120 lr:0.49 dt:35ms tok/s:1868380 rem:235s step 10296 (61%) loss:3.1063 lr:0.49 dt:35ms tok/s:1870758 rem:235s step 10297 (61%) loss:3.0827 lr:0.49 dt:35ms tok/s:1868177 rem:235s step 10298 (61%) loss:3.0645 lr:0.48 dt:35ms tok/s:1865438 rem:235s step 10299 (61%) loss:3.0495 lr:0.48 dt:35ms tok/s:1846615 rem:235s step 10300 (61%) loss:3.0542 lr:0.48 dt:36ms tok/s:1814711 rem:235s + local: attn=[0.092, 0.883, 0.940] mlp=[0.706, 0.283, -0.275] + + transition: attn=[3.086, 1.008] mlp=[-0.234, 0.630] + + hierarchy: attn=[3.275, 5.939, 5.616] mlp=[1.439, -1.079, -3.259] + step 10301 (61%) loss:3.0305 lr:0.48 dt:36ms tok/s:1812175 rem:235s step 10302 (61%) loss:3.0335 lr:0.48 dt:36ms tok/s:1825473 rem:235s step 10303 (61%) loss:3.0284 lr:0.48 dt:36ms tok/s:1810814 rem:235s step 10304 (61%) loss:3.0269 lr:0.48 dt:36ms tok/s:1820999 rem:235s step 10305 (61%) loss:3.0253 lr:0.48 dt:36ms tok/s:1820818 rem:235s step 10306 (61%) loss:3.0178 lr:0.48 dt:36ms tok/s:1815850 rem:235s step 10307 (61%) loss:3.0354 lr:0.48 dt:36ms tok/s:1820974 rem:235s step 10308 (61%) loss:3.0354 lr:0.48 dt:36ms tok/s:1824903 rem:235s step 10309 (61%) loss:3.0369 lr:0.48 dt:36ms tok/s:1823983 rem:235s step 10310 (61%) loss:3.0319 lr:0.48 dt:36ms tok/s:1822496 rem:235s step 10311 (61%) loss:3.0362 lr:0.48 dt:36ms tok/s:1826176 rem:235s step 10312 (61%) loss:3.0412 lr:0.48 dt:36ms tok/s:1829982 rem:235s step 10313 (61%) loss:3.0655 lr:0.48 dt:36ms tok/s:1825655 rem:235s step 10314 (61%) loss:3.0755 lr:0.48 dt:36ms tok/s:1819058 rem:235s step 10315 (61%) loss:3.0816 lr:0.48 dt:36ms tok/s:1806755 rem:235s step 10316 (61%) loss:3.0677 lr:0.48 dt:36ms tok/s:1826152 rem:235s step 10317 (61%) loss:3.0546 lr:0.48 dt:36ms tok/s:1826565 rem:235s step 10318 (61%) loss:3.0396 lr:0.48 dt:36ms tok/s:1825800 rem:235s step 10319 (61%) loss:3.0397 lr:0.48 dt:36ms tok/s:1819588 rem:235s step 10320 (61%) loss:3.0225 lr:0.48 dt:36ms tok/s:1815718 rem:235s step 10321 (61%) loss:3.0323 lr:0.48 dt:36ms tok/s:1816966 rem:235s step 10322 (61%) loss:3.0263 lr:0.48 dt:36ms tok/s:1823269 rem:234s step 10323 (61%) loss:3.0180 lr:0.48 dt:36ms tok/s:1817495 rem:234s step 10324 (61%) loss:3.0149 lr:0.48 dt:36ms tok/s:1829470 rem:234s step 10325 (61%) loss:3.0370 lr:0.48 dt:36ms tok/s:1819564 rem:234s step 10326 (61%) loss:3.0373 lr:0.48 dt:36ms tok/s:1813921 rem:234s step 10327 (61%) loss:3.0407 lr:0.48 dt:36ms tok/s:1814364 rem:234s step 10328 (61%) loss:3.0381 lr:0.48 dt:36ms tok/s:1808407 rem:234s step 10329 (61%) loss:3.0150 lr:0.48 dt:36ms tok/s:1815778 rem:234s step 10330 (61%) loss:3.0242 lr:0.48 dt:36ms tok/s:1813251 rem:234s step 10331 (61%) loss:3.0408 lr:0.48 dt:36ms tok/s:1817651 rem:234s step 10332 (61%) loss:3.0462 lr:0.48 dt:36ms tok/s:1816138 rem:234s step 10333 (61%) loss:3.0506 lr:0.48 dt:36ms tok/s:1811279 rem:234s step 10334 (61%) loss:3.0471 lr:0.48 dt:36ms tok/s:1819950 rem:234s step 10335 (61%) loss:3.0494 lr:0.48 dt:36ms tok/s:1823475 rem:234s step 10336 (61%) loss:3.0730 lr:0.48 dt:36ms tok/s:1819950 rem:234s step 10337 (61%) loss:3.0601 lr:0.48 dt:36ms tok/s:1812127 rem:234s step 10338 (61%) loss:3.0563 lr:0.48 dt:36ms tok/s:1820974 rem:234s step 10339 (61%) loss:3.0641 lr:0.48 dt:36ms tok/s:1817038 rem:234s step 10340 (61%) loss:3.0525 lr:0.48 dt:36ms tok/s:1814495 rem:234s step 10341 (61%) loss:3.0572 lr:0.48 dt:37ms tok/s:1769524 rem:234s step 10342 (61%) loss:3.0666 lr:0.48 dt:36ms tok/s:1822447 rem:234s step 10343 (61%) loss:3.0706 lr:0.48 dt:36ms tok/s:1820191 rem:234s step 10344 (61%) loss:3.0743 lr:0.48 dt:36ms tok/s:1824419 rem:234s step 10345 (61%) loss:3.0739 lr:0.48 dt:36ms tok/s:1821095 rem:234s step 10346 (61%) loss:3.0854 lr:0.48 dt:36ms tok/s:1820769 rem:234s step 10347 (61%) loss:3.0962 lr:0.48 dt:36ms tok/s:1820191 rem:234s step 10348 (61%) loss:3.1049 lr:0.48 dt:36ms tok/s:1811136 rem:234s step 10349 (61%) loss:3.0836 lr:0.48 dt:36ms tok/s:1822375 rem:234s step 10350 (61%) loss:3.0884 lr:0.48 dt:36ms tok/s:1803211 rem:233s step 10351 (61%) loss:3.0829 lr:0.48 dt:36ms tok/s:1802584 rem:233s step 10352 (61%) loss:3.0854 lr:0.48 dt:36ms tok/s:1806862 rem:233s step 10353 (61%) loss:3.1078 lr:0.48 dt:36ms tok/s:1816798 rem:233s step 10354 (61%) loss:3.1056 lr:0.48 dt:36ms tok/s:1805616 rem:233s step 10355 (61%) loss:3.1072 lr:0.48 dt:36ms tok/s:1813598 rem:233s step 10356 (61%) loss:3.1008 lr:0.48 dt:36ms tok/s:1810587 rem:233s step 10357 (61%) loss:3.1095 lr:0.48 dt:36ms tok/s:1815562 rem:233s step 10358 (61%) loss:3.0920 lr:0.48 dt:36ms tok/s:1809371 rem:233s step 10359 (61%) loss:3.0811 lr:0.48 dt:36ms tok/s:1813562 rem:233s step 10360 (61%) loss:3.0858 lr:0.48 dt:36ms tok/s:1812581 rem:233s step 10361 (61%) loss:3.0911 lr:0.48 dt:36ms tok/s:1802443 rem:233s step 10362 (61%) loss:3.0914 lr:0.48 dt:36ms tok/s:1810742 rem:233s step 10363 (61%) loss:3.0929 lr:0.48 dt:38ms tok/s:1718846 rem:233s step 10364 (61%) loss:3.0786 lr:0.48 dt:36ms tok/s:1823185 rem:233s step 10365 (61%) loss:3.0527 lr:0.48 dt:35ms tok/s:1851129 rem:233s step 10366 (61%) loss:3.0557 lr:0.48 dt:35ms tok/s:1850307 rem:233s step 10367 (61%) loss:3.0678 lr:0.48 dt:36ms tok/s:1838378 rem:233s step 10368 (61%) loss:3.0737 lr:0.48 dt:36ms tok/s:1831152 rem:233s step 10369 (61%) loss:3.0568 lr:0.48 dt:36ms tok/s:1840717 rem:233s step 10370 (61%) loss:3.0353 lr:0.48 dt:36ms tok/s:1832568 rem:233s step 10371 (61%) loss:3.0279 lr:0.48 dt:36ms tok/s:1835076 rem:233s step 10372 (61%) loss:3.0578 lr:0.48 dt:36ms tok/s:1845128 rem:233s step 10373 (61%) loss:3.0777 lr:0.48 dt:36ms tok/s:1834048 rem:233s step 10374 (61%) loss:3.0767 lr:0.48 dt:36ms tok/s:1832984 rem:233s step 10375 (61%) loss:3.0854 lr:0.48 dt:36ms tok/s:1830031 rem:233s step 10376 (61%) loss:3.0815 lr:0.48 dt:36ms tok/s:1838083 rem:233s step 10377 (61%) loss:3.0738 lr:0.48 dt:36ms tok/s:1834476 rem:233s step 10378 (61%) loss:3.1050 lr:0.48 dt:36ms tok/s:1829385 rem:232s step 10379 (61%) loss:3.1167 lr:0.48 dt:36ms tok/s:1835579 rem:232s step 10380 (61%) loss:3.1134 lr:0.48 dt:36ms tok/s:1830579 rem:232s step 10381 (61%) loss:3.0999 lr:0.48 dt:36ms tok/s:1826516 rem:232s step 10382 (61%) loss:3.0985 lr:0.48 dt:36ms tok/s:1834562 rem:232s step 10383 (61%) loss:3.1038 lr:0.47 dt:36ms tok/s:1831872 rem:232s step 10384 (61%) loss:3.1146 lr:0.47 dt:37ms tok/s:1793364 rem:232s step 10385 (61%) loss:3.0931 lr:0.47 dt:36ms tok/s:1825255 rem:232s step 10386 (61%) loss:3.0840 lr:0.47 dt:36ms tok/s:1834868 rem:232s step 10387 (61%) loss:3.1037 lr:0.47 dt:36ms tok/s:1828776 rem:232s step 10388 (61%) loss:3.0907 lr:0.47 dt:36ms tok/s:1812593 rem:232s step 10389 (61%) loss:3.1120 lr:0.47 dt:36ms tok/s:1816534 rem:232s step 10390 (61%) loss:3.1514 lr:0.47 dt:37ms tok/s:1794254 rem:232s step 10391 (61%) loss:3.1433 lr:0.47 dt:36ms tok/s:1810420 rem:232s step 10392 (61%) loss:3.1416 lr:0.47 dt:36ms tok/s:1805568 rem:232s step 10393 (61%) loss:3.1153 lr:0.47 dt:36ms tok/s:1814280 rem:232s step 10394 (61%) loss:3.0929 lr:0.47 dt:36ms tok/s:1813789 rem:232s step 10395 (61%) loss:3.0876 lr:0.47 dt:36ms tok/s:1808169 rem:232s step 10396 (61%) loss:3.0812 lr:0.47 dt:36ms tok/s:1812294 rem:232s step 10397 (61%) loss:3.0644 lr:0.47 dt:36ms tok/s:1811029 rem:232s step 10398 (61%) loss:3.0775 lr:0.47 dt:36ms tok/s:1817266 rem:232s step 10399 (61%) loss:3.0668 lr:0.47 dt:36ms tok/s:1811398 rem:232s step 10400 (61%) loss:3.0563 lr:0.47 dt:36ms tok/s:1812342 rem:232s + local: attn=[0.089, 0.870, 0.903] mlp=[0.729, 0.271, -0.286] + + transition: attn=[3.060, 1.036] mlp=[-0.235, 0.638] + + hierarchy: attn=[3.326, 5.939, 5.616] mlp=[1.468, -1.117, -3.280] + step 10401 (61%) loss:3.0544 lr:0.47 dt:36ms tok/s:1805805 rem:232s step 10402 (61%) loss:3.0557 lr:0.47 dt:36ms tok/s:1808027 rem:232s step 10403 (61%) loss:3.0645 lr:0.47 dt:37ms tok/s:1785049 rem:232s step 10404 (61%) loss:3.0720 lr:0.47 dt:37ms tok/s:1786139 rem:232s step 10405 (61%) loss:3.0658 lr:0.47 dt:36ms tok/s:1807361 rem:231s step 10406 (61%) loss:3.0668 lr:0.47 dt:36ms tok/s:1809407 rem:231s step 10407 (61%) loss:3.0649 lr:0.47 dt:36ms tok/s:1810253 rem:231s step 10408 (61%) loss:3.0759 lr:0.47 dt:36ms tok/s:1809979 rem:231s step 10409 (61%) loss:3.0962 lr:0.47 dt:36ms tok/s:1806304 rem:231s step 10410 (61%) loss:3.0977 lr:0.47 dt:36ms tok/s:1809205 rem:231s step 10411 (61%) loss:3.0919 lr:0.47 dt:36ms tok/s:1812306 rem:231s step 10412 (61%) loss:3.0579 lr:0.47 dt:36ms tok/s:1811291 rem:231s step 10413 (61%) loss:3.0445 lr:0.47 dt:36ms tok/s:1813394 rem:231s step 10414 (61%) loss:3.0472 lr:0.47 dt:36ms tok/s:1804430 rem:231s step 10415 (61%) loss:3.0781 lr:0.47 dt:36ms tok/s:1802573 rem:231s step 10416 (61%) loss:3.0728 lr:0.47 dt:37ms tok/s:1782260 rem:231s step 10417 (61%) loss:3.0747 lr:0.47 dt:36ms tok/s:1802963 rem:231s step 10418 (61%) loss:3.0803 lr:0.47 dt:36ms tok/s:1804620 rem:231s step 10419 (61%) loss:3.0877 lr:0.47 dt:36ms tok/s:1805426 rem:231s step 10420 (62%) loss:3.0885 lr:0.47 dt:36ms tok/s:1805082 rem:231s step 10421 (62%) loss:3.0909 lr:0.47 dt:36ms tok/s:1809395 rem:231s step 10422 (62%) loss:3.0996 lr:0.47 dt:36ms tok/s:1815430 rem:231s step 10423 (62%) loss:3.0885 lr:0.47 dt:36ms tok/s:1815011 rem:231s step 10424 (62%) loss:3.0931 lr:0.47 dt:36ms tok/s:1808027 rem:231s step 10425 (62%) loss:3.0973 lr:0.47 dt:36ms tok/s:1807717 rem:231s step 10426 (62%) loss:3.0886 lr:0.47 dt:36ms tok/s:1804419 rem:231s step 10427 (62%) loss:3.0822 lr:0.47 dt:37ms tok/s:1772743 rem:231s step 10428 (62%) loss:3.0759 lr:0.47 dt:36ms tok/s:1804348 rem:231s step 10429 (62%) loss:3.0767 lr:0.47 dt:36ms tok/s:1807515 rem:231s step 10430 (62%) loss:3.0710 lr:0.47 dt:36ms tok/s:1802254 rem:231s step 10431 (62%) loss:3.0711 lr:0.47 dt:36ms tok/s:1812569 rem:231s step 10432 (62%) loss:3.0707 lr:0.47 dt:36ms tok/s:1802916 rem:231s step 10433 (62%) loss:3.0910 lr:0.47 dt:36ms tok/s:1808764 rem:230s step 10434 (62%) loss:3.0876 lr:0.47 dt:36ms tok/s:1806375 rem:230s step 10435 (62%) loss:3.0640 lr:0.47 dt:36ms tok/s:1811410 rem:230s step 10436 (62%) loss:3.0833 lr:0.47 dt:36ms tok/s:1818505 rem:230s step 10437 (62%) loss:3.0881 lr:0.47 dt:36ms tok/s:1820251 rem:230s step 10438 (62%) loss:3.0955 lr:0.47 dt:36ms tok/s:1805094 rem:230s step 10439 (62%) loss:3.0995 lr:0.47 dt:36ms tok/s:1807539 rem:230s step 10440 (62%) loss:3.1739 lr:0.47 dt:36ms tok/s:1808836 rem:230s step 10441 (62%) loss:3.1492 lr:0.47 dt:36ms tok/s:1807076 rem:230s step 10442 (62%) loss:3.1444 lr:0.47 dt:36ms tok/s:1813454 rem:230s step 10443 (62%) loss:3.1315 lr:0.47 dt:36ms tok/s:1807682 rem:230s step 10444 (62%) loss:3.1261 lr:0.47 dt:36ms tok/s:1805722 rem:230s step 10445 (62%) loss:3.1427 lr:0.47 dt:36ms tok/s:1808169 rem:230s step 10446 (62%) loss:3.1570 lr:0.47 dt:36ms tok/s:1819323 rem:230s step 10447 (62%) loss:3.1620 lr:0.47 dt:36ms tok/s:1808538 rem:230s step 10448 (62%) loss:3.1618 lr:0.47 dt:36ms tok/s:1816846 rem:230s step 10449 (62%) loss:3.1544 lr:0.47 dt:36ms tok/s:1812103 rem:230s step 10450 (62%) loss:3.1485 lr:0.47 dt:36ms tok/s:1815298 rem:230s step 10451 (62%) loss:3.1458 lr:0.47 dt:36ms tok/s:1813717 rem:230s step 10452 (62%) loss:3.1195 lr:0.47 dt:36ms tok/s:1804608 rem:230s step 10453 (62%) loss:3.0945 lr:0.47 dt:36ms tok/s:1805379 rem:230s step 10454 (62%) loss:3.0738 lr:0.47 dt:36ms tok/s:1814855 rem:230s step 10455 (62%) loss:3.0349 lr:0.47 dt:36ms tok/s:1808027 rem:230s step 10456 (62%) loss:3.0112 lr:0.47 dt:36ms tok/s:1811267 rem:230s step 10457 (62%) loss:3.0136 lr:0.47 dt:36ms tok/s:1816822 rem:230s step 10458 (62%) loss:3.0952 lr:0.47 dt:36ms tok/s:1811303 rem:230s step 10459 (62%) loss:3.1163 lr:0.47 dt:36ms tok/s:1804975 rem:230s step 10460 (62%) loss:3.1148 lr:0.47 dt:36ms tok/s:1807076 rem:230s step 10461 (62%) loss:3.1277 lr:0.47 dt:36ms tok/s:1805473 rem:229s step 10462 (62%) loss:3.1395 lr:0.47 dt:36ms tok/s:1808931 rem:229s step 10463 (62%) loss:3.1330 lr:0.47 dt:36ms tok/s:1816558 rem:229s step 10464 (62%) loss:3.1413 lr:0.47 dt:36ms tok/s:1805094 rem:229s step 10465 (62%) loss:3.1451 lr:0.47 dt:36ms tok/s:1805770 rem:229s step 10466 (62%) loss:3.1472 lr:0.47 dt:36ms tok/s:1810396 rem:229s step 10467 (62%) loss:3.1504 lr:0.46 dt:36ms tok/s:1814016 rem:229s step 10468 (62%) loss:3.1439 lr:0.46 dt:36ms tok/s:1820263 rem:229s step 10469 (62%) loss:3.1296 lr:0.46 dt:36ms tok/s:1806351 rem:229s step 10470 (62%) loss:3.1413 lr:0.46 dt:36ms tok/s:1838882 rem:229s step 10471 (62%) loss:3.1308 lr:0.46 dt:36ms tok/s:1826322 rem:229s step 10472 (62%) loss:3.1216 lr:0.46 dt:36ms tok/s:1830481 rem:229s step 10473 (62%) loss:3.1217 lr:0.46 dt:36ms tok/s:1833118 rem:229s step 10474 (62%) loss:3.1108 lr:0.46 dt:36ms tok/s:1831311 rem:229s step 10475 (62%) loss:3.1176 lr:0.46 dt:36ms tok/s:1829117 rem:229s step 10476 (62%) loss:3.1589 lr:0.46 dt:35ms tok/s:1852439 rem:229s step 10477 (62%) loss:3.1601 lr:0.46 dt:35ms tok/s:1855891 rem:229s step 10478 (62%) loss:3.1806 lr:0.46 dt:35ms tok/s:1854001 rem:229s step 10479 (62%) loss:3.1807 lr:0.46 dt:35ms tok/s:1847993 rem:229s step 10480 (62%) loss:3.1722 lr:0.46 dt:36ms tok/s:1838242 rem:229s step 10481 (62%) loss:3.1914 lr:0.46 dt:35ms tok/s:1848615 rem:229s step 10482 (62%) loss:3.2076 lr:0.46 dt:35ms tok/s:1849996 rem:229s step 10483 (62%) loss:3.2372 lr:0.46 dt:35ms tok/s:1850631 rem:229s step 10484 (62%) loss:3.2227 lr:0.46 dt:35ms tok/s:1854114 rem:229s step 10485 (62%) loss:3.2405 lr:0.46 dt:35ms tok/s:1847745 rem:229s step 10486 (62%) loss:3.2538 lr:0.46 dt:35ms tok/s:1858577 rem:229s step 10487 (62%) loss:3.2650 lr:0.46 dt:35ms tok/s:1858878 rem:229s step 10488 (62%) loss:3.2606 lr:0.46 dt:36ms tok/s:1843420 rem:229s step 10489 (62%) loss:3.2348 lr:0.46 dt:35ms tok/s:1849958 rem:228s step 10490 (62%) loss:3.2152 lr:0.46 dt:35ms tok/s:1857886 rem:228s step 10491 (62%) loss:3.2151 lr:0.46 dt:35ms tok/s:1849796 rem:228s step 10492 (62%) loss:3.1800 lr:0.46 dt:35ms tok/s:1852427 rem:228s step 10493 (62%) loss:3.1746 lr:0.46 dt:35ms tok/s:1854977 rem:228s step 10494 (62%) loss:3.1588 lr:0.46 dt:36ms tok/s:1842901 rem:228s step 10495 (62%) loss:3.1478 lr:0.46 dt:36ms tok/s:1835750 rem:228s step 10496 (62%) loss:3.1403 lr:0.46 dt:36ms tok/s:1845574 rem:228s step 10497 (62%) loss:3.1398 lr:0.46 dt:35ms tok/s:1858815 rem:228s step 10498 (62%) loss:3.1312 lr:0.46 dt:35ms tok/s:1848416 rem:228s step 10499 (62%) loss:3.1586 lr:0.46 dt:35ms tok/s:1848180 rem:228s step 10500 (62%) loss:3.1871 lr:0.46 dt:35ms tok/s:1849398 rem:228s + local: attn=[0.101, 0.910, 0.951] mlp=[0.737, 0.305, -0.286] + + transition: attn=[3.077, 1.078] mlp=[-0.252, 0.640] + + hierarchy: attn=[3.333, 5.939, 5.616] mlp=[1.477, -1.080, -3.354] + step 10501 (62%) loss:3.1762 lr:0.46 dt:36ms tok/s:1844719 rem:228s step 10502 (62%) loss:3.1715 lr:0.46 dt:35ms tok/s:1849324 rem:228s step 10503 (62%) loss:3.1798 lr:0.46 dt:35ms tok/s:1857032 rem:228s step 10504 (62%) loss:3.1811 lr:0.46 dt:35ms tok/s:1850606 rem:228s step 10505 (62%) loss:3.1693 lr:0.46 dt:42ms tok/s:1547875 rem:228s step 10506 (62%) loss:3.1625 lr:0.46 dt:35ms tok/s:1882197 rem:228s step 10507 (62%) loss:3.1690 lr:0.46 dt:34ms tok/s:1909683 rem:228s step 10508 (62%) loss:3.1750 lr:0.46 dt:35ms tok/s:1879610 rem:228s step 10509 (62%) loss:3.1809 lr:0.46 dt:36ms tok/s:1846082 rem:228s step 10510 (62%) loss:3.1527 lr:0.46 dt:35ms tok/s:1885748 rem:228s step 10511 (62%) loss:3.1504 lr:0.46 dt:35ms tok/s:1881746 rem:228s step 10512 (62%) loss:3.1383 lr:0.46 dt:35ms tok/s:1865552 rem:228s step 10513 (62%) loss:3.1591 lr:0.46 dt:35ms tok/s:1871394 rem:228s step 10514 (62%) loss:3.1616 lr:0.46 dt:35ms tok/s:1877569 rem:228s step 10515 (62%) loss:3.1302 lr:0.46 dt:35ms tok/s:1883422 rem:228s step 10516 (62%) loss:3.1241 lr:0.46 dt:35ms tok/s:1870618 rem:228s step 10517 (62%) loss:3.1051 lr:0.46 dt:35ms tok/s:1875711 rem:227s step 10518 (62%) loss:3.1110 lr:0.46 dt:35ms tok/s:1870058 rem:227s step 10519 (62%) loss:3.1103 lr:0.46 dt:35ms tok/s:1875250 rem:227s step 10520 (62%) loss:3.1076 lr:0.46 dt:35ms tok/s:1884184 rem:227s step 10521 (62%) loss:3.1312 lr:0.46 dt:35ms tok/s:1876646 rem:227s step 10522 (62%) loss:3.1180 lr:0.46 dt:35ms tok/s:1860741 rem:227s step 10523 (62%) loss:3.1196 lr:0.46 dt:35ms tok/s:1861056 rem:227s step 10524 (62%) loss:3.1136 lr:0.46 dt:35ms tok/s:1858112 rem:227s step 10525 (62%) loss:3.1030 lr:0.46 dt:35ms tok/s:1858941 rem:227s step 10526 (62%) loss:3.1007 lr:0.46 dt:35ms tok/s:1861459 rem:227s step 10527 (62%) loss:3.1067 lr:0.46 dt:35ms tok/s:1856017 rem:227s step 10528 (62%) loss:3.1044 lr:0.46 dt:35ms tok/s:1846590 rem:227s step 10529 (62%) loss:3.1115 lr:0.46 dt:35ms tok/s:1855841 rem:227s step 10530 (62%) loss:3.1426 lr:0.46 dt:35ms tok/s:1857785 rem:227s step 10531 (62%) loss:3.1541 lr:0.46 dt:35ms tok/s:1862443 rem:227s step 10532 (62%) loss:3.1535 lr:0.46 dt:35ms tok/s:1846144 rem:227s step 10533 (62%) loss:3.1247 lr:0.46 dt:35ms tok/s:1853226 rem:227s step 10534 (62%) loss:3.0774 lr:0.46 dt:36ms tok/s:1805805 rem:227s step 10535 (62%) loss:3.1040 lr:0.46 dt:35ms tok/s:1861862 rem:227s step 10536 (62%) loss:3.1237 lr:0.46 dt:35ms tok/s:1861988 rem:227s step 10537 (62%) loss:3.1319 lr:0.46 dt:35ms tok/s:1857107 rem:227s step 10538 (62%) loss:3.1349 lr:0.46 dt:35ms tok/s:1855490 rem:227s step 10539 (62%) loss:3.1463 lr:0.46 dt:35ms tok/s:1853414 rem:227s step 10540 (62%) loss:3.1767 lr:0.46 dt:35ms tok/s:1857773 rem:227s step 10541 (62%) loss:3.1730 lr:0.46 dt:35ms tok/s:1863150 rem:227s step 10542 (62%) loss:3.1741 lr:0.46 dt:35ms tok/s:1859872 rem:227s step 10543 (62%) loss:3.1635 lr:0.46 dt:35ms tok/s:1857785 rem:227s step 10544 (62%) loss:3.1429 lr:0.46 dt:36ms tok/s:1832324 rem:227s step 10545 (62%) loss:3.1317 lr:0.46 dt:36ms tok/s:1829154 rem:226s step 10546 (62%) loss:3.1426 lr:0.46 dt:36ms tok/s:1830347 rem:226s step 10547 (62%) loss:3.1282 lr:0.46 dt:36ms tok/s:1836413 rem:226s step 10548 (62%) loss:3.1514 lr:0.46 dt:35ms tok/s:1847583 rem:226s step 10549 (62%) loss:3.1575 lr:0.46 dt:36ms tok/s:1830774 rem:226s step 10550 (62%) loss:3.1564 lr:0.46 dt:36ms tok/s:1836253 rem:226s step 10551 (62%) loss:3.1100 lr:0.46 dt:36ms tok/s:1837849 rem:226s step 10552 (62%) loss:3.1084 lr:0.46 dt:36ms tok/s:1835824 rem:226s step 10553 (62%) loss:3.1197 lr:0.46 dt:36ms tok/s:1838574 rem:226s step 10554 (62%) loss:3.1301 lr:0.45 dt:36ms tok/s:1832874 rem:226s step 10555 (62%) loss:3.1342 lr:0.45 dt:36ms tok/s:1829458 rem:226s step 10556 (62%) loss:3.1369 lr:0.45 dt:36ms tok/s:1837591 rem:226s step 10557 (62%) loss:3.1421 lr:0.45 dt:36ms tok/s:1835444 rem:226s step 10558 (62%) loss:3.1508 lr:0.45 dt:36ms tok/s:1835309 rem:226s step 10559 (62%) loss:3.1548 lr:0.45 dt:36ms tok/s:1832141 rem:226s step 10560 (62%) loss:3.1513 lr:0.45 dt:36ms tok/s:1838501 rem:226s step 10561 (62%) loss:3.1399 lr:0.45 dt:36ms tok/s:1834697 rem:226s step 10562 (62%) loss:3.1400 lr:0.45 dt:36ms tok/s:1831225 rem:226s step 10563 (62%) loss:3.1249 lr:0.45 dt:36ms tok/s:1802620 rem:226s step 10564 (62%) loss:3.1229 lr:0.45 dt:36ms tok/s:1810659 rem:226s step 10565 (62%) loss:3.1269 lr:0.45 dt:36ms tok/s:1825303 rem:226s step 10566 (62%) loss:3.1261 lr:0.45 dt:36ms tok/s:1810921 rem:226s step 10567 (62%) loss:3.0936 lr:0.45 dt:36ms tok/s:1818577 rem:226s step 10568 (62%) loss:3.0493 lr:0.45 dt:36ms tok/s:1812629 rem:226s step 10569 (62%) loss:3.0616 lr:0.45 dt:36ms tok/s:1818312 rem:226s step 10570 (62%) loss:3.0754 lr:0.45 dt:36ms tok/s:1812222 rem:226s step 10571 (62%) loss:3.1174 lr:0.45 dt:36ms tok/s:1810814 rem:226s step 10572 (62%) loss:3.1404 lr:0.45 dt:36ms tok/s:1817891 rem:226s step 10573 (62%) loss:3.1216 lr:0.45 dt:36ms tok/s:1815310 rem:225s step 10574 (62%) loss:3.1488 lr:0.45 dt:36ms tok/s:1814615 rem:225s step 10575 (62%) loss:3.1396 lr:0.45 dt:36ms tok/s:1819371 rem:225s step 10576 (62%) loss:3.1459 lr:0.45 dt:36ms tok/s:1812306 rem:225s step 10577 (62%) loss:3.1309 lr:0.45 dt:36ms tok/s:1800271 rem:225s step 10578 (62%) loss:3.1463 lr:0.45 dt:36ms tok/s:1815538 rem:225s step 10579 (62%) loss:3.1301 lr:0.45 dt:36ms tok/s:1804087 rem:225s step 10580 (62%) loss:3.1232 lr:0.45 dt:36ms tok/s:1807575 rem:225s step 10581 (62%) loss:3.1052 lr:0.45 dt:36ms tok/s:1814867 rem:225s step 10582 (62%) loss:3.0836 lr:0.45 dt:36ms tok/s:1798410 rem:225s step 10583 (62%) loss:3.0979 lr:0.45 dt:36ms tok/s:1810826 rem:225s step 10584 (62%) loss:3.0981 lr:0.45 dt:35ms tok/s:1859344 rem:225s step 10585 (62%) loss:3.0920 lr:0.45 dt:36ms tok/s:1832361 rem:225s step 10586 (62%) loss:3.1115 lr:0.45 dt:36ms tok/s:1836449 rem:225s step 10587 (62%) loss:3.1260 lr:0.45 dt:36ms tok/s:1821457 rem:225s step 10588 (63%) loss:3.1485 lr:0.45 dt:36ms tok/s:1818276 rem:225s step 10589 (63%) loss:3.1312 lr:0.45 dt:36ms tok/s:1816774 rem:225s step 10590 (63%) loss:3.1045 lr:0.45 dt:36ms tok/s:1819624 rem:225s step 10591 (63%) loss:3.0985 lr:0.45 dt:36ms tok/s:1817122 rem:225s step 10592 (63%) loss:3.1044 lr:0.45 dt:36ms tok/s:1814975 rem:225s step 10593 (63%) loss:3.1164 lr:0.45 dt:36ms tok/s:1812844 rem:225s step 10594 (63%) loss:3.1194 lr:0.45 dt:36ms tok/s:1818300 rem:225s step 10595 (63%) loss:3.1313 lr:0.45 dt:36ms tok/s:1817783 rem:225s step 10596 (63%) loss:3.1524 lr:0.45 dt:37ms tok/s:1770641 rem:225s step 10597 (63%) loss:3.1428 lr:0.45 dt:36ms tok/s:1827330 rem:225s step 10598 (63%) loss:3.1215 lr:0.45 dt:36ms tok/s:1835468 rem:225s step 10599 (63%) loss:3.0831 lr:0.45 dt:36ms tok/s:1835922 rem:225s step 10600 (63%) loss:3.0927 lr:0.45 dt:36ms tok/s:1818818 rem:225s + local: attn=[0.085, 0.908, 0.936] mlp=[0.753, 0.273, -0.275] + + transition: attn=[3.164, 1.059] mlp=[-0.250, 0.672] + + hierarchy: attn=[3.394, 5.939, 5.616] mlp=[1.524, -1.121, -3.383] + step 10601 (63%) loss:3.0998 lr:0.45 dt:35ms tok/s:1850295 rem:224s step 10602 (63%) loss:3.1095 lr:0.45 dt:35ms tok/s:1856104 rem:224s step 10603 (63%) loss:3.1046 lr:0.45 dt:35ms tok/s:1855728 rem:224s step 10604 (63%) loss:3.1112 lr:0.45 dt:35ms tok/s:1859255 rem:224s step 10605 (63%) loss:3.1024 lr:0.45 dt:35ms tok/s:1854189 rem:224s step 10606 (63%) loss:3.0911 lr:0.45 dt:35ms tok/s:1861345 rem:224s step 10607 (63%) loss:3.0905 lr:0.45 dt:35ms tok/s:1848267 rem:224s step 10608 (63%) loss:3.0699 lr:0.45 dt:35ms tok/s:1848378 rem:224s step 10609 (63%) loss:3.0788 lr:0.45 dt:35ms tok/s:1852352 rem:224s step 10610 (63%) loss:3.0733 lr:0.45 dt:35ms tok/s:1860678 rem:224s step 10611 (63%) loss:3.0517 lr:0.45 dt:35ms tok/s:1856493 rem:224s step 10612 (63%) loss:3.0642 lr:0.45 dt:35ms tok/s:1859683 rem:224s step 10613 (63%) loss:3.0664 lr:0.45 dt:35ms tok/s:1848826 rem:224s step 10614 (63%) loss:3.0792 lr:0.45 dt:35ms tok/s:1852726 rem:224s step 10615 (63%) loss:3.0732 lr:0.45 dt:35ms tok/s:1860652 rem:224s step 10616 (63%) loss:3.0882 lr:0.45 dt:35ms tok/s:1862531 rem:224s step 10617 (63%) loss:3.1132 lr:0.45 dt:35ms tok/s:1854539 rem:224s step 10618 (63%) loss:3.1163 lr:0.45 dt:35ms tok/s:1857861 rem:224s step 10619 (63%) loss:3.1132 lr:0.45 dt:35ms tok/s:1856769 rem:224s step 10620 (63%) loss:3.0980 lr:0.45 dt:35ms tok/s:1860048 rem:224s step 10621 (63%) loss:3.1220 lr:0.45 dt:35ms tok/s:1857384 rem:224s step 10622 (63%) loss:3.1191 lr:0.45 dt:35ms tok/s:1862102 rem:224s step 10623 (63%) loss:3.1184 lr:0.45 dt:35ms tok/s:1867276 rem:224s step 10624 (63%) loss:3.1128 lr:0.45 dt:35ms tok/s:1858074 rem:224s step 10625 (63%) loss:3.1398 lr:0.45 dt:35ms tok/s:1866844 rem:224s step 10626 (63%) loss:3.1353 lr:0.45 dt:35ms tok/s:1864692 rem:224s step 10627 (63%) loss:3.1487 lr:0.45 dt:35ms tok/s:1858363 rem:224s step 10628 (63%) loss:3.1638 lr:0.45 dt:35ms tok/s:1854514 rem:224s step 10629 (63%) loss:3.1625 lr:0.45 dt:35ms tok/s:1862607 rem:223s step 10630 (63%) loss:3.1578 lr:0.45 dt:35ms tok/s:1849149 rem:223s step 10631 (63%) loss:3.2009 lr:0.45 dt:36ms tok/s:1829580 rem:223s step 10632 (63%) loss:3.1926 lr:0.45 dt:35ms tok/s:1857346 rem:223s step 10633 (63%) loss:3.1979 lr:0.45 dt:35ms tok/s:1862241 rem:223s step 10634 (63%) loss:3.1885 lr:0.45 dt:36ms tok/s:1821059 rem:223s step 10635 (63%) loss:3.1977 lr:0.45 dt:36ms tok/s:1837468 rem:223s step 10636 (63%) loss:3.2005 lr:0.45 dt:36ms tok/s:1827524 rem:223s step 10637 (63%) loss:3.1986 lr:0.45 dt:36ms tok/s:1828290 rem:223s step 10638 (63%) loss:3.1995 lr:0.45 dt:36ms tok/s:1834452 rem:223s step 10639 (63%) loss:3.2156 lr:0.45 dt:36ms tok/s:1830006 rem:223s step 10640 (63%) loss:3.1969 lr:0.44 dt:36ms tok/s:1828655 rem:223s step 10641 (63%) loss:3.1859 lr:0.44 dt:36ms tok/s:1831384 rem:223s step 10642 (63%) loss:3.1874 lr:0.44 dt:36ms tok/s:1819636 rem:223s step 10643 (63%) loss:3.1754 lr:0.44 dt:36ms tok/s:1830884 rem:223s step 10644 (63%) loss:3.1827 lr:0.44 dt:36ms tok/s:1834831 rem:223s step 10645 (63%) loss:3.1752 lr:0.44 dt:35ms tok/s:1856606 rem:223s step 10646 (63%) loss:3.1703 lr:0.44 dt:35ms tok/s:1858652 rem:223s step 10647 (63%) loss:3.1550 lr:0.44 dt:35ms tok/s:1865413 rem:223s step 10648 (63%) loss:3.1577 lr:0.44 dt:35ms tok/s:1849622 rem:223s step 10649 (63%) loss:3.1502 lr:0.44 dt:35ms tok/s:1859017 rem:223s step 10650 (63%) loss:3.1490 lr:0.44 dt:35ms tok/s:1849436 rem:223s step 10651 (63%) loss:3.1282 lr:0.44 dt:35ms tok/s:1854551 rem:223s step 10652 (63%) loss:3.1159 lr:0.44 dt:36ms tok/s:1836118 rem:223s step 10653 (63%) loss:3.1084 lr:0.44 dt:36ms tok/s:1837812 rem:223s step 10654 (63%) loss:3.1840 lr:0.44 dt:36ms tok/s:1838661 rem:223s step 10655 (63%) loss:3.2921 lr:0.44 dt:36ms tok/s:1831030 rem:223s step 10656 (63%) loss:3.3818 lr:0.44 dt:36ms tok/s:1832471 rem:223s step 10657 (63%) loss:3.3869 lr:0.44 dt:36ms tok/s:1834452 rem:222s step 10658 (63%) loss:3.3769 lr:0.44 dt:36ms tok/s:1826953 rem:222s step 10659 (63%) loss:3.3636 lr:0.44 dt:36ms tok/s:1833167 rem:222s step 10660 (63%) loss:3.3463 lr:0.44 dt:36ms tok/s:1837935 rem:222s step 10661 (63%) loss:3.3253 lr:0.44 dt:36ms tok/s:1828764 rem:222s step 10662 (63%) loss:3.3074 lr:0.44 dt:36ms tok/s:1840495 rem:222s step 10663 (63%) loss:3.2948 lr:0.44 dt:36ms tok/s:1837591 rem:222s step 10664 (63%) loss:3.2724 lr:0.44 dt:36ms tok/s:1830518 rem:222s step 10665 (63%) loss:3.2604 lr:0.44 dt:36ms tok/s:1833889 rem:222s step 10666 (63%) loss:3.2581 lr:0.44 dt:36ms tok/s:1841185 rem:222s step 10667 (63%) loss:3.2519 lr:0.44 dt:36ms tok/s:1832263 rem:222s step 10668 (63%) loss:3.2273 lr:0.44 dt:36ms tok/s:1837284 rem:222s step 10669 (63%) loss:3.2099 lr:0.44 dt:36ms tok/s:1835297 rem:222s step 10670 (63%) loss:3.1767 lr:0.44 dt:36ms tok/s:1835174 rem:222s step 10671 (63%) loss:3.1666 lr:0.44 dt:36ms tok/s:1836511 rem:222s step 10672 (63%) loss:3.1433 lr:0.44 dt:37ms tok/s:1776673 rem:222s step 10673 (63%) loss:3.1618 lr:0.44 dt:37ms tok/s:1768875 rem:222s step 10674 (63%) loss:3.1499 lr:0.44 dt:36ms tok/s:1836695 rem:222s step 10675 (63%) loss:3.1464 lr:0.44 dt:36ms tok/s:1827160 rem:222s step 10676 (63%) loss:3.1607 lr:0.44 dt:36ms tok/s:1835015 rem:222s step 10677 (63%) loss:3.1742 lr:0.44 dt:36ms tok/s:1835468 rem:222s step 10678 (63%) loss:3.1577 lr:0.44 dt:36ms tok/s:1804111 rem:222s step 10679 (63%) loss:3.1269 lr:0.44 dt:36ms tok/s:1832153 rem:222s step 10680 (63%) loss:3.1088 lr:0.44 dt:36ms tok/s:1827877 rem:222s step 10681 (63%) loss:3.1328 lr:0.44 dt:36ms tok/s:1833302 rem:222s step 10682 (63%) loss:3.1234 lr:0.44 dt:36ms tok/s:1827743 rem:222s step 10683 (63%) loss:3.1242 lr:0.44 dt:36ms tok/s:1830360 rem:222s step 10684 (63%) loss:3.1287 lr:0.44 dt:36ms tok/s:1828229 rem:222s step 10685 (63%) loss:3.1026 lr:0.44 dt:36ms tok/s:1835517 rem:221s step 10686 (63%) loss:3.0971 lr:0.44 dt:36ms tok/s:1834439 rem:221s step 10687 (63%) loss:3.0998 lr:0.44 dt:36ms tok/s:1810563 rem:221s step 10688 (63%) loss:3.0842 lr:0.44 dt:36ms tok/s:1824649 rem:221s step 10689 (63%) loss:3.0954 lr:0.44 dt:36ms tok/s:1828472 rem:221s step 10690 (63%) loss:3.1151 lr:0.44 dt:36ms tok/s:1841469 rem:221s step 10691 (63%) loss:3.1231 lr:0.44 dt:36ms tok/s:1840298 rem:221s step 10692 (63%) loss:3.1139 lr:0.44 dt:36ms tok/s:1840470 rem:221s step 10693 (63%) loss:3.1241 lr:0.44 dt:36ms tok/s:1838292 rem:221s step 10694 (63%) loss:3.1447 lr:0.44 dt:36ms tok/s:1830920 rem:221s step 10695 (63%) loss:3.1090 lr:0.44 dt:36ms tok/s:1831774 rem:221s step 10696 (63%) loss:3.1035 lr:0.44 dt:36ms tok/s:1818505 rem:221s step 10697 (63%) loss:3.0967 lr:0.44 dt:36ms tok/s:1842024 rem:221s step 10698 (63%) loss:3.0993 lr:0.44 dt:36ms tok/s:1832312 rem:221s step 10699 (63%) loss:3.0975 lr:0.44 dt:36ms tok/s:1833155 rem:221s step 10700 (63%) loss:3.0938 lr:0.44 dt:36ms tok/s:1838550 rem:221s + local: attn=[0.094, 0.907, 0.933] mlp=[0.756, 0.285, -0.275] + + transition: attn=[3.167, 1.049] mlp=[-0.270, 0.697] + + hierarchy: attn=[3.358, 5.939, 5.616] mlp=[1.543, -1.181, -3.422] + step 10701 (63%) loss:3.0905 lr:0.44 dt:36ms tok/s:1836057 rem:221s step 10702 (63%) loss:3.0824 lr:0.44 dt:36ms tok/s:1827099 rem:221s step 10703 (63%) loss:3.0973 lr:0.44 dt:36ms tok/s:1825570 rem:221s step 10704 (63%) loss:3.0921 lr:0.44 dt:36ms tok/s:1829689 rem:221s step 10705 (63%) loss:3.0871 lr:0.44 dt:36ms tok/s:1807266 rem:221s step 10706 (63%) loss:3.0842 lr:0.44 dt:36ms tok/s:1832471 rem:221s step 10707 (63%) loss:3.0955 lr:0.44 dt:36ms tok/s:1840667 rem:221s step 10708 (63%) loss:3.0869 lr:0.44 dt:36ms tok/s:1834513 rem:221s step 10709 (63%) loss:3.0775 lr:0.44 dt:36ms tok/s:1827135 rem:221s step 10710 (63%) loss:3.0863 lr:0.44 dt:36ms tok/s:1837174 rem:221s step 10711 (63%) loss:3.0767 lr:0.44 dt:36ms tok/s:1837591 rem:221s step 10712 (63%) loss:3.0722 lr:0.44 dt:36ms tok/s:1837100 rem:221s step 10713 (63%) loss:3.0758 lr:0.44 dt:36ms tok/s:1828558 rem:220s step 10714 (63%) loss:3.1009 lr:0.44 dt:36ms tok/s:1826249 rem:220s step 10715 (63%) loss:3.1187 lr:0.44 dt:36ms tok/s:1836069 rem:220s step 10716 (63%) loss:3.1094 lr:0.44 dt:36ms tok/s:1834880 rem:220s step 10717 (63%) loss:3.1014 lr:0.44 dt:36ms tok/s:1826225 rem:220s step 10718 (63%) loss:3.0994 lr:0.44 dt:36ms tok/s:1835750 rem:220s step 10719 (63%) loss:3.1116 lr:0.44 dt:36ms tok/s:1835236 rem:220s step 10720 (63%) loss:3.1153 lr:0.44 dt:36ms tok/s:1829373 rem:220s step 10721 (63%) loss:3.1045 lr:0.44 dt:36ms tok/s:1835027 rem:220s step 10722 (63%) loss:3.0991 lr:0.44 dt:36ms tok/s:1838624 rem:220s step 10723 (63%) loss:3.0944 lr:0.44 dt:36ms tok/s:1834390 rem:220s step 10724 (63%) loss:3.0909 lr:0.44 dt:36ms tok/s:1843371 rem:220s step 10725 (63%) loss:3.0792 lr:0.44 dt:36ms tok/s:1835505 rem:220s step 10726 (63%) loss:3.0656 lr:0.43 dt:36ms tok/s:1840027 rem:220s step 10727 (63%) loss:3.0634 lr:0.43 dt:36ms tok/s:1844249 rem:220s step 10728 (63%) loss:3.0499 lr:0.43 dt:36ms tok/s:1837075 rem:220s step 10729 (63%) loss:3.0402 lr:0.43 dt:36ms tok/s:1808776 rem:220s step 10730 (63%) loss:3.0389 lr:0.43 dt:36ms tok/s:1843717 rem:220s step 10731 (63%) loss:3.0376 lr:0.43 dt:36ms tok/s:1836057 rem:220s step 10732 (63%) loss:3.0380 lr:0.43 dt:36ms tok/s:1808919 rem:220s step 10733 (63%) loss:3.0308 lr:0.43 dt:36ms tok/s:1830872 rem:220s step 10734 (63%) loss:3.0306 lr:0.43 dt:36ms tok/s:1818637 rem:220s step 10735 (63%) loss:3.0333 lr:0.43 dt:36ms tok/s:1832141 rem:220s step 10736 (63%) loss:3.0294 lr:0.43 dt:36ms tok/s:1837997 rem:220s step 10737 (63%) loss:3.0221 lr:0.43 dt:36ms tok/s:1836437 rem:220s step 10738 (63%) loss:3.0195 lr:0.43 dt:36ms tok/s:1833583 rem:220s step 10739 (63%) loss:3.0173 lr:0.43 dt:36ms tok/s:1812115 rem:220s step 10740 (63%) loss:3.0157 lr:0.43 dt:36ms tok/s:1820613 rem:220s step 10741 (63%) loss:3.0212 lr:0.43 dt:36ms tok/s:1813119 rem:219s step 10742 (63%) loss:3.0118 lr:0.43 dt:36ms tok/s:1826249 rem:219s step 10743 (63%) loss:3.0037 lr:0.43 dt:36ms tok/s:1835321 rem:219s step 10744 (63%) loss:2.9937 lr:0.43 dt:36ms tok/s:1827257 rem:219s step 10745 (63%) loss:2.9986 lr:0.43 dt:36ms tok/s:1827014 rem:219s step 10746 (63%) loss:3.0062 lr:0.43 dt:36ms tok/s:1838169 rem:219s step 10747 (63%) loss:3.0148 lr:0.43 dt:36ms tok/s:1839128 rem:219s step 10748 (63%) loss:3.0319 lr:0.43 dt:36ms tok/s:1830457 rem:219s step 10749 (63%) loss:3.0493 lr:0.43 dt:36ms tok/s:1834599 rem:219s step 10750 (63%) loss:3.0566 lr:0.43 dt:36ms tok/s:1834048 rem:219s step 10751 (63%) loss:3.0575 lr:0.43 dt:36ms tok/s:1833424 rem:219s step 10752 (63%) loss:3.0536 lr:0.43 dt:36ms tok/s:1829884 rem:219s step 10753 (63%) loss:3.0501 lr:0.43 dt:36ms tok/s:1835677 rem:219s step 10754 (63%) loss:3.0395 lr:0.43 dt:36ms tok/s:1835481 rem:219s step 10755 (63%) loss:3.0567 lr:0.43 dt:36ms tok/s:1839805 rem:219s step 10756 (64%) loss:3.0424 lr:0.43 dt:36ms tok/s:1796693 rem:219s step 10757 (64%) loss:3.0506 lr:0.43 dt:36ms tok/s:1836498 rem:219s step 10758 (64%) loss:3.0455 lr:0.43 dt:36ms tok/s:1841740 rem:219s step 10759 (64%) loss:3.0644 lr:0.43 dt:36ms tok/s:1837677 rem:219s step 10760 (64%) loss:3.0792 lr:0.43 dt:36ms tok/s:1836560 rem:219s step 10761 (64%) loss:3.0916 lr:0.43 dt:36ms tok/s:1838402 rem:219s step 10762 (64%) loss:3.1029 lr:0.43 dt:36ms tok/s:1834611 rem:219s step 10763 (64%) loss:3.1067 lr:0.43 dt:36ms tok/s:1835763 rem:219s step 10764 (64%) loss:3.1030 lr:0.43 dt:36ms tok/s:1840014 rem:219s step 10765 (64%) loss:3.0803 lr:0.43 dt:36ms tok/s:1836854 rem:219s step 10766 (64%) loss:3.0567 lr:0.43 dt:36ms tok/s:1837051 rem:219s step 10767 (64%) loss:3.0601 lr:0.43 dt:36ms tok/s:1807254 rem:219s step 10768 (64%) loss:3.0414 lr:0.43 dt:36ms tok/s:1835395 rem:219s step 10769 (64%) loss:3.0559 lr:0.43 dt:36ms tok/s:1837161 rem:218s step 10770 (64%) loss:3.0487 lr:0.43 dt:36ms tok/s:1835370 rem:218s step 10771 (64%) loss:3.0523 lr:0.43 dt:36ms tok/s:1834464 rem:218s step 10772 (64%) loss:3.0644 lr:0.43 dt:36ms tok/s:1831225 rem:218s step 10773 (64%) loss:3.0673 lr:0.43 dt:36ms tok/s:1833632 rem:218s step 10774 (64%) loss:3.0768 lr:0.43 dt:36ms tok/s:1837751 rem:218s step 10775 (64%) loss:3.0750 lr:0.43 dt:37ms tok/s:1769433 rem:218s step 10776 (64%) loss:3.0753 lr:0.43 dt:36ms tok/s:1835652 rem:218s step 10777 (64%) loss:3.0780 lr:0.43 dt:36ms tok/s:1839460 rem:218s step 10778 (64%) loss:3.0852 lr:0.43 dt:36ms tok/s:1843383 rem:218s step 10779 (64%) loss:3.0958 lr:0.43 dt:36ms tok/s:1829349 rem:218s step 10780 (64%) loss:3.0966 lr:0.43 dt:36ms tok/s:1828205 rem:218s step 10781 (64%) loss:3.0951 lr:0.43 dt:36ms tok/s:1834256 rem:218s step 10782 (64%) loss:3.0806 lr:0.43 dt:36ms tok/s:1840285 rem:218s step 10783 (64%) loss:3.0866 lr:0.43 dt:36ms tok/s:1838636 rem:218s step 10784 (64%) loss:3.0841 lr:0.43 dt:36ms tok/s:1835640 rem:218s step 10785 (64%) loss:3.0827 lr:0.43 dt:36ms tok/s:1835959 rem:218s step 10786 (64%) loss:3.0907 lr:0.43 dt:36ms tok/s:1838501 rem:218s step 10787 (64%) loss:3.0942 lr:0.43 dt:36ms tok/s:1845326 rem:218s step 10788 (64%) loss:3.0965 lr:0.43 dt:36ms tok/s:1841111 rem:218s step 10789 (64%) loss:3.0861 lr:0.43 dt:36ms tok/s:1842753 rem:218s step 10790 (64%) loss:3.0935 lr:0.43 dt:36ms tok/s:1839017 rem:218s step 10791 (64%) loss:3.0875 lr:0.43 dt:36ms tok/s:1833668 rem:218s step 10792 (64%) loss:3.1042 lr:0.43 dt:36ms tok/s:1840273 rem:218s step 10793 (64%) loss:3.1018 lr:0.43 dt:36ms tok/s:1829288 rem:218s step 10794 (64%) loss:3.1117 lr:0.43 dt:36ms tok/s:1832055 rem:218s step 10795 (64%) loss:3.1041 lr:0.43 dt:36ms tok/s:1837272 rem:218s step 10796 (64%) loss:3.1075 lr:0.43 dt:36ms tok/s:1832043 rem:218s step 10797 (64%) loss:3.1000 lr:0.43 dt:36ms tok/s:1836854 rem:217s step 10798 (64%) loss:3.1189 lr:0.43 dt:36ms tok/s:1832813 rem:217s step 10799 (64%) loss:3.1203 lr:0.43 dt:36ms tok/s:1837997 rem:217s step 10800 (64%) loss:3.1014 lr:0.43 dt:36ms tok/s:1836683 rem:217s + local: attn=[0.097, 0.888, 0.941] mlp=[0.765, 0.294, -0.269] + + transition: attn=[3.149, 1.075] mlp=[-0.262, 0.676] + + hierarchy: attn=[3.359, 5.939, 5.616] mlp=[1.543, -1.188, -3.509] + step 10801 (64%) loss:3.1177 lr:0.43 dt:36ms tok/s:1841851 rem:217s step 10802 (64%) loss:3.1482 lr:0.43 dt:36ms tok/s:1842111 rem:217s step 10803 (64%) loss:3.1457 lr:0.43 dt:36ms tok/s:1833999 rem:217s step 10804 (64%) loss:3.1578 lr:0.43 dt:36ms tok/s:1839362 rem:217s step 10805 (64%) loss:3.1591 lr:0.43 dt:36ms tok/s:1832813 rem:217s step 10806 (64%) loss:3.1660 lr:0.43 dt:36ms tok/s:1842246 rem:217s step 10807 (64%) loss:3.1583 lr:0.43 dt:36ms tok/s:1836670 rem:217s step 10808 (64%) loss:3.1501 lr:0.43 dt:36ms tok/s:1839571 rem:217s step 10809 (64%) loss:3.1447 lr:0.43 dt:36ms tok/s:1836265 rem:217s step 10810 (64%) loss:3.1354 lr:0.43 dt:36ms tok/s:1832373 rem:217s step 10811 (64%) loss:3.1031 lr:0.43 dt:36ms tok/s:1836597 rem:217s step 10812 (64%) loss:3.0970 lr:0.42 dt:36ms tok/s:1833938 rem:217s step 10813 (64%) loss:3.0944 lr:0.42 dt:36ms tok/s:1841703 rem:217s step 10814 (64%) loss:3.1002 lr:0.42 dt:36ms tok/s:1835848 rem:217s step 10815 (64%) loss:3.0868 lr:0.42 dt:36ms tok/s:1831762 rem:217s step 10816 (64%) loss:3.1221 lr:0.42 dt:36ms tok/s:1835150 rem:217s step 10817 (64%) loss:3.1325 lr:0.42 dt:36ms tok/s:1844558 rem:217s step 10818 (64%) loss:3.1366 lr:0.42 dt:36ms tok/s:1842086 rem:217s step 10819 (64%) loss:3.1384 lr:0.42 dt:36ms tok/s:1795520 rem:217s step 10820 (64%) loss:3.1146 lr:0.42 dt:35ms tok/s:1856957 rem:217s step 10821 (64%) loss:3.1235 lr:0.42 dt:36ms tok/s:1838304 rem:217s step 10822 (64%) loss:3.1116 lr:0.42 dt:36ms tok/s:1841037 rem:217s step 10823 (64%) loss:3.1167 lr:0.42 dt:36ms tok/s:1842876 rem:217s step 10824 (64%) loss:3.1407 lr:0.42 dt:36ms tok/s:1839694 rem:217s step 10825 (64%) loss:3.1634 lr:0.42 dt:36ms tok/s:1841679 rem:216s step 10826 (64%) loss:3.1805 lr:0.42 dt:36ms tok/s:1840409 rem:216s step 10827 (64%) loss:3.2100 lr:0.42 dt:36ms tok/s:1837026 rem:216s step 10828 (64%) loss:3.2137 lr:0.42 dt:35ms tok/s:1850469 rem:216s step 10829 (64%) loss:3.2115 lr:0.42 dt:36ms tok/s:1841309 rem:216s step 10830 (64%) loss:3.2177 lr:0.42 dt:35ms tok/s:1846714 rem:216s step 10831 (64%) loss:3.2214 lr:0.42 dt:36ms tok/s:1841000 rem:216s step 10832 (64%) loss:3.2187 lr:0.42 dt:36ms tok/s:1835971 rem:216s step 10833 (64%) loss:3.2154 lr:0.42 dt:36ms tok/s:1831286 rem:216s step 10834 (64%) loss:3.1869 lr:0.42 dt:36ms tok/s:1835579 rem:216s step 10835 (64%) loss:3.1725 lr:0.42 dt:36ms tok/s:1843148 rem:216s step 10836 (64%) loss:3.1683 lr:0.42 dt:36ms tok/s:1831006 rem:216s step 10837 (64%) loss:3.1603 lr:0.42 dt:36ms tok/s:1837984 rem:216s step 10838 (64%) loss:3.1333 lr:0.42 dt:36ms tok/s:1830628 rem:216s step 10839 (64%) loss:3.1274 lr:0.42 dt:36ms tok/s:1835971 rem:216s step 10840 (64%) loss:3.1128 lr:0.42 dt:36ms tok/s:1832165 rem:216s step 10841 (64%) loss:3.1165 lr:0.42 dt:36ms tok/s:1837775 rem:216s step 10842 (64%) loss:3.1118 lr:0.42 dt:36ms tok/s:1833326 rem:216s step 10843 (64%) loss:3.0780 lr:0.42 dt:36ms tok/s:1813095 rem:216s step 10844 (64%) loss:3.0746 lr:0.42 dt:36ms tok/s:1833424 rem:216s step 10845 (64%) loss:3.0822 lr:0.42 dt:36ms tok/s:1827172 rem:216s step 10846 (64%) loss:3.0945 lr:0.42 dt:36ms tok/s:1829519 rem:216s step 10847 (64%) loss:3.1130 lr:0.42 dt:36ms tok/s:1829616 rem:216s step 10848 (64%) loss:3.1121 lr:0.42 dt:36ms tok/s:1837689 rem:216s step 10849 (64%) loss:3.0888 lr:0.42 dt:36ms tok/s:1836879 rem:216s step 10850 (64%) loss:3.0892 lr:0.42 dt:36ms tok/s:1828874 rem:216s step 10851 (64%) loss:3.0902 lr:0.42 dt:36ms tok/s:1827257 rem:216s step 10852 (64%) loss:3.0675 lr:0.42 dt:36ms tok/s:1839411 rem:216s step 10853 (64%) loss:3.0714 lr:0.42 dt:36ms tok/s:1838242 rem:215s step 10854 (64%) loss:3.0752 lr:0.42 dt:36ms tok/s:1838365 rem:215s step 10855 (64%) loss:3.1020 lr:0.42 dt:36ms tok/s:1839977 rem:215s step 10856 (64%) loss:3.1095 lr:0.42 dt:36ms tok/s:1840507 rem:215s step 10857 (64%) loss:3.1324 lr:0.42 dt:36ms tok/s:1828545 rem:215s step 10858 (64%) loss:3.1222 lr:0.42 dt:36ms tok/s:1832471 rem:215s step 10859 (64%) loss:3.1180 lr:0.42 dt:36ms tok/s:1818300 rem:215s step 10860 (64%) loss:3.1177 lr:0.42 dt:36ms tok/s:1834684 rem:215s step 10861 (64%) loss:3.1321 lr:0.42 dt:36ms tok/s:1831555 rem:215s step 10862 (64%) loss:3.1405 lr:0.42 dt:36ms tok/s:1836008 rem:215s step 10863 (64%) loss:3.1494 lr:0.42 dt:36ms tok/s:1824201 rem:215s step 10864 (64%) loss:3.1481 lr:0.42 dt:36ms tok/s:1827852 rem:215s step 10865 (64%) loss:3.1413 lr:0.42 dt:36ms tok/s:1840187 rem:215s step 10866 (64%) loss:3.1159 lr:0.42 dt:36ms tok/s:1823693 rem:215s step 10867 (64%) loss:3.0737 lr:0.42 dt:36ms tok/s:1839300 rem:215s step 10868 (64%) loss:3.0521 lr:0.42 dt:36ms tok/s:1832519 rem:215s step 10869 (64%) loss:3.0546 lr:0.42 dt:36ms tok/s:1834893 rem:215s step 10870 (64%) loss:3.0517 lr:0.42 dt:36ms tok/s:1828168 rem:215s step 10871 (64%) loss:3.0592 lr:0.42 dt:36ms tok/s:1834770 rem:215s step 10872 (64%) loss:3.0618 lr:0.42 dt:36ms tok/s:1830652 rem:215s step 10873 (64%) loss:3.0759 lr:0.42 dt:36ms tok/s:1835468 rem:215s step 10874 (64%) loss:3.0740 lr:0.42 dt:36ms tok/s:1832251 rem:215s step 10875 (64%) loss:3.0810 lr:0.42 dt:36ms tok/s:1838353 rem:215s step 10876 (64%) loss:3.0806 lr:0.42 dt:36ms tok/s:1836045 rem:215s step 10877 (64%) loss:3.0708 lr:0.42 dt:36ms tok/s:1834133 rem:215s step 10878 (64%) loss:3.0559 lr:0.42 dt:36ms tok/s:1844608 rem:215s step 10879 (64%) loss:3.0658 lr:0.42 dt:36ms tok/s:1840778 rem:215s step 10880 (64%) loss:3.0587 lr:0.42 dt:36ms tok/s:1833314 rem:215s step 10881 (64%) loss:3.0528 lr:0.42 dt:41ms tok/s:1603750 rem:214s step 10882 (64%) loss:3.0587 lr:0.42 dt:36ms tok/s:1832752 rem:214s step 10883 (64%) loss:3.0322 lr:0.42 dt:35ms tok/s:1858577 rem:214s step 10884 (64%) loss:2.9483 lr:0.42 dt:35ms tok/s:1855866 rem:214s step 10885 (64%) loss:2.9743 lr:0.42 dt:35ms tok/s:1856468 rem:214s step 10886 (64%) loss:2.9717 lr:0.42 dt:35ms tok/s:1857170 rem:214s step 10887 (64%) loss:2.9943 lr:0.42 dt:36ms tok/s:1840704 rem:214s step 10888 (64%) loss:3.0039 lr:0.42 dt:35ms tok/s:1857823 rem:214s step 10889 (64%) loss:3.0000 lr:0.42 dt:35ms tok/s:1858325 rem:214s step 10890 (64%) loss:3.0098 lr:0.42 dt:35ms tok/s:1857647 rem:214s step 10891 (64%) loss:3.0228 lr:0.42 dt:35ms tok/s:1854164 rem:214s step 10892 (64%) loss:3.0325 lr:0.42 dt:35ms tok/s:1854764 rem:214s step 10893 (64%) loss:3.0438 lr:0.42 dt:35ms tok/s:1860652 rem:214s step 10894 (64%) loss:3.0334 lr:0.42 dt:35ms tok/s:1858024 rem:214s step 10895 (64%) loss:3.0285 lr:0.42 dt:36ms tok/s:1842753 rem:214s step 10896 (64%) loss:3.0299 lr:0.42 dt:36ms tok/s:1839042 rem:214s step 10897 (64%) loss:3.0443 lr:0.42 dt:35ms tok/s:1862102 rem:214s step 10898 (64%) loss:3.0373 lr:0.42 dt:36ms tok/s:1829056 rem:214s step 10899 (64%) loss:3.0300 lr:0.41 dt:35ms tok/s:1853089 rem:214s step 10900 (64%) loss:3.0240 lr:0.41 dt:35ms tok/s:1847434 rem:214s + local: attn=[0.097, 0.894, 0.944] mlp=[0.780, 0.286, -0.279] + + transition: attn=[3.200, 1.084] mlp=[-0.270, 0.735] + + hierarchy: attn=[3.366, 5.939, 5.616] mlp=[1.591, -1.212, -3.545] + step 10901 (64%) loss:3.0330 lr:0.41 dt:35ms tok/s:1856593 rem:214s step 10902 (64%) loss:3.0354 lr:0.41 dt:36ms tok/s:1844657 rem:214s step 10903 (64%) loss:3.0492 lr:0.41 dt:36ms tok/s:1831323 rem:214s step 10904 (64%) loss:3.0505 lr:0.41 dt:36ms tok/s:1836425 rem:214s step 10905 (64%) loss:3.0519 lr:0.41 dt:36ms tok/s:1835187 rem:214s step 10906 (64%) loss:3.0549 lr:0.41 dt:36ms tok/s:1833338 rem:214s step 10907 (64%) loss:3.0498 lr:0.41 dt:36ms tok/s:1832617 rem:214s step 10908 (64%) loss:3.0536 lr:0.41 dt:36ms tok/s:1835064 rem:214s step 10909 (64%) loss:3.0549 lr:0.41 dt:36ms tok/s:1839916 rem:213s step 10910 (64%) loss:3.0784 lr:0.41 dt:36ms tok/s:1840791 rem:213s step 10911 (64%) loss:3.0714 lr:0.41 dt:35ms tok/s:1848826 rem:213s step 10912 (64%) loss:3.0692 lr:0.41 dt:36ms tok/s:1835260 rem:213s step 10913 (64%) loss:3.0706 lr:0.41 dt:36ms tok/s:1836155 rem:213s step 10914 (64%) loss:3.0808 lr:0.41 dt:36ms tok/s:1832397 rem:213s step 10915 (64%) loss:3.0891 lr:0.41 dt:36ms tok/s:1829945 rem:213s step 10916 (64%) loss:3.0915 lr:0.41 dt:36ms tok/s:1830884 rem:213s step 10917 (64%) loss:3.0890 lr:0.41 dt:36ms tok/s:1832739 rem:213s step 10918 (64%) loss:3.0929 lr:0.41 dt:36ms tok/s:1839263 rem:213s step 10919 (64%) loss:3.0997 lr:0.41 dt:36ms tok/s:1843865 rem:213s step 10920 (64%) loss:3.1032 lr:0.41 dt:36ms tok/s:1825012 rem:213s step 10921 (64%) loss:3.1034 lr:0.41 dt:36ms tok/s:1828922 rem:213s step 10922 (64%) loss:3.0929 lr:0.41 dt:36ms tok/s:1834782 rem:213s step 10923 (64%) loss:3.0997 lr:0.41 dt:36ms tok/s:1838722 rem:213s step 10924 (65%) loss:3.1090 lr:0.41 dt:36ms tok/s:1840655 rem:213s step 10925 (65%) loss:3.1060 lr:0.41 dt:36ms tok/s:1835603 rem:213s step 10926 (65%) loss:3.1058 lr:0.41 dt:36ms tok/s:1842753 rem:213s step 10927 (65%) loss:3.0982 lr:0.41 dt:36ms tok/s:1830457 rem:213s step 10928 (65%) loss:3.0954 lr:0.41 dt:36ms tok/s:1839497 rem:213s step 10929 (65%) loss:3.1036 lr:0.41 dt:36ms tok/s:1837800 rem:213s step 10930 (65%) loss:3.1101 lr:0.41 dt:36ms tok/s:1842666 rem:213s step 10931 (65%) loss:3.0960 lr:0.41 dt:36ms tok/s:1841963 rem:213s step 10932 (65%) loss:3.0805 lr:0.41 dt:36ms tok/s:1844162 rem:213s step 10933 (65%) loss:3.0852 lr:0.41 dt:36ms tok/s:1835775 rem:213s step 10934 (65%) loss:3.0776 lr:0.41 dt:36ms tok/s:1832373 rem:213s step 10935 (65%) loss:3.0758 lr:0.41 dt:36ms tok/s:1842172 rem:213s step 10936 (65%) loss:3.0829 lr:0.41 dt:36ms tok/s:1835346 rem:213s step 10937 (65%) loss:3.0837 lr:0.41 dt:36ms tok/s:1841913 rem:212s step 10938 (65%) loss:3.0710 lr:0.41 dt:36ms tok/s:1839005 rem:212s step 10939 (65%) loss:3.0622 lr:0.41 dt:37ms tok/s:1790642 rem:212s step 10940 (65%) loss:3.0427 lr:0.41 dt:36ms tok/s:1838206 rem:212s step 10941 (65%) loss:3.0511 lr:0.41 dt:36ms tok/s:1840754 rem:212s step 10942 (65%) loss:3.1738 lr:0.41 dt:36ms tok/s:1838771 rem:212s step 10943 (65%) loss:3.2910 lr:0.41 dt:36ms tok/s:1845698 rem:212s step 10944 (65%) loss:3.3851 lr:0.41 dt:36ms tok/s:1834807 rem:212s step 10945 (65%) loss:3.5043 lr:0.41 dt:36ms tok/s:1796611 rem:212s step 10946 (65%) loss:3.4966 lr:0.41 dt:36ms tok/s:1822592 rem:212s step 10947 (65%) loss:3.4884 lr:0.41 dt:38ms tok/s:1733414 rem:212s step 10948 (65%) loss:3.4572 lr:0.41 dt:36ms tok/s:1836867 rem:212s step 10949 (65%) loss:3.4515 lr:0.41 dt:36ms tok/s:1810814 rem:212s step 10950 (65%) loss:3.4468 lr:0.41 dt:36ms tok/s:1813286 rem:212s step 10951 (65%) loss:3.3933 lr:0.41 dt:36ms tok/s:1816954 rem:212s step 10952 (65%) loss:3.3192 lr:0.41 dt:36ms tok/s:1819251 rem:212s step 10953 (65%) loss:3.3006 lr:0.41 dt:36ms tok/s:1818397 rem:212s step 10954 (65%) loss:3.2661 lr:0.41 dt:36ms tok/s:1821192 rem:212s step 10955 (65%) loss:3.2540 lr:0.41 dt:37ms tok/s:1791961 rem:212s step 10956 (65%) loss:3.2338 lr:0.41 dt:36ms tok/s:1813621 rem:212s step 10957 (65%) loss:3.2273 lr:0.41 dt:36ms tok/s:1816606 rem:212s step 10958 (65%) loss:3.2125 lr:0.41 dt:36ms tok/s:1812976 rem:212s step 10959 (65%) loss:3.2053 lr:0.41 dt:36ms tok/s:1802384 rem:212s step 10960 (65%) loss:3.1888 lr:0.41 dt:36ms tok/s:1812772 rem:212s step 10961 (65%) loss:3.1818 lr:0.41 dt:36ms tok/s:1815262 rem:212s step 10962 (65%) loss:3.1566 lr:0.41 dt:36ms tok/s:1818661 rem:212s step 10963 (65%) loss:3.1434 lr:0.41 dt:36ms tok/s:1798704 rem:212s step 10964 (65%) loss:3.1511 lr:0.41 dt:36ms tok/s:1813286 rem:212s step 10965 (65%) loss:3.1371 lr:0.41 dt:36ms tok/s:1808871 rem:211s step 10966 (65%) loss:3.1204 lr:0.41 dt:36ms tok/s:1801568 rem:211s step 10967 (65%) loss:3.1279 lr:0.41 dt:37ms tok/s:1758352 rem:211s step 10968 (65%) loss:3.1281 lr:0.41 dt:37ms tok/s:1787533 rem:211s step 10969 (65%) loss:3.1233 lr:0.41 dt:37ms tok/s:1778673 rem:211s step 10970 (65%) loss:3.0884 lr:0.41 dt:36ms tok/s:1809669 rem:211s step 10971 (65%) loss:3.0957 lr:0.41 dt:41ms tok/s:1585572 rem:211s step 10972 (65%) loss:3.0876 lr:0.41 dt:36ms tok/s:1838845 rem:211s step 10973 (65%) loss:3.0763 lr:0.41 dt:35ms tok/s:1863604 rem:211s step 10974 (65%) loss:3.0782 lr:0.41 dt:35ms tok/s:1863933 rem:211s step 10975 (65%) loss:3.0810 lr:0.41 dt:36ms tok/s:1833558 rem:211s step 10976 (65%) loss:3.0838 lr:0.41 dt:36ms tok/s:1839374 rem:211s step 10977 (65%) loss:3.0764 lr:0.41 dt:36ms tok/s:1837431 rem:211s step 10978 (65%) loss:3.0906 lr:0.41 dt:36ms tok/s:1836879 rem:211s step 10979 (65%) loss:3.0906 lr:0.41 dt:36ms tok/s:1833962 rem:211s step 10980 (65%) loss:3.0755 lr:0.41 dt:36ms tok/s:1829811 rem:211s step 10981 (65%) loss:3.0824 lr:0.41 dt:36ms tok/s:1836118 rem:211s step 10982 (65%) loss:3.0891 lr:0.41 dt:36ms tok/s:1826201 rem:211s step 10983 (65%) loss:3.0810 lr:0.41 dt:36ms tok/s:1837763 rem:211s step 10984 (65%) loss:3.0882 lr:0.41 dt:36ms tok/s:1806185 rem:211s step 10985 (65%) loss:3.0638 lr:0.40 dt:36ms tok/s:1840458 rem:211s step 10986 (65%) loss:3.0617 lr:0.40 dt:36ms tok/s:1832617 rem:211s step 10987 (65%) loss:3.0658 lr:0.40 dt:36ms tok/s:1845314 rem:211s step 10988 (65%) loss:3.0841 lr:0.40 dt:36ms tok/s:1840470 rem:211s step 10989 (65%) loss:3.0815 lr:0.40 dt:36ms tok/s:1840556 rem:211s step 10990 (65%) loss:3.0887 lr:0.40 dt:36ms tok/s:1839670 rem:211s step 10991 (65%) loss:3.0904 lr:0.40 dt:36ms tok/s:1835003 rem:211s step 10992 (65%) loss:3.0827 lr:0.40 dt:36ms tok/s:1829957 rem:210s step 10993 (65%) loss:3.0951 lr:0.40 dt:36ms tok/s:1842839 rem:210s step 10994 (65%) loss:3.0948 lr:0.40 dt:36ms tok/s:1840224 rem:210s step 10995 (65%) loss:3.0913 lr:0.40 dt:36ms tok/s:1838451 rem:210s step 10996 (65%) loss:3.1044 lr:0.40 dt:36ms tok/s:1803341 rem:210s step 10997 (65%) loss:3.1135 lr:0.40 dt:36ms tok/s:1837849 rem:210s step 10998 (65%) loss:3.1041 lr:0.40 dt:36ms tok/s:1842407 rem:210s step 10999 (65%) loss:3.0852 lr:0.40 dt:36ms tok/s:1843655 rem:210s step 11000 (65%) loss:3.0825 lr:0.40 dt:36ms tok/s:1841432 rem:210s + local: attn=[0.101, 0.906, 0.946] mlp=[0.776, 0.289, -0.308] + + transition: attn=[3.252, 1.087] mlp=[-0.285, 0.732] + + hierarchy: attn=[3.398, 5.939, 5.616] mlp=[1.594, -1.242, -3.623] + step 11001 (65%) loss:3.0902 lr:0.40 dt:36ms tok/s:1841444 rem:210s step 11002 (65%) loss:3.1032 lr:0.40 dt:36ms tok/s:1839263 rem:210s step 11003 (65%) loss:3.1029 lr:0.40 dt:36ms tok/s:1839362 rem:210s step 11004 (65%) loss:3.1151 lr:0.40 dt:36ms tok/s:1836756 rem:210s step 11005 (65%) loss:3.1147 lr:0.40 dt:36ms tok/s:1804999 rem:210s step 11006 (65%) loss:3.1135 lr:0.40 dt:35ms tok/s:1846528 rem:210s step 11007 (65%) loss:3.0858 lr:0.40 dt:36ms tok/s:1838402 rem:210s step 11008 (65%) loss:3.0861 lr:0.40 dt:36ms tok/s:1841629 rem:210s step 11009 (65%) loss:3.0764 lr:0.40 dt:36ms tok/s:1836854 rem:210s step 11010 (65%) loss:3.0859 lr:0.40 dt:36ms tok/s:1844967 rem:210s step 11011 (65%) loss:3.1217 lr:0.40 dt:36ms tok/s:1841259 rem:210s step 11012 (65%) loss:3.1049 lr:0.40 dt:36ms tok/s:1835689 rem:210s step 11013 (65%) loss:3.0981 lr:0.40 dt:36ms tok/s:1841987 rem:210s step 11014 (65%) loss:3.1062 lr:0.40 dt:36ms tok/s:1830859 rem:210s step 11015 (65%) loss:3.0860 lr:0.40 dt:36ms tok/s:1835187 rem:210s step 11016 (65%) loss:3.0810 lr:0.40 dt:35ms tok/s:1848068 rem:210s step 11017 (65%) loss:3.0730 lr:0.40 dt:36ms tok/s:1842444 rem:210s step 11018 (65%) loss:3.0641 lr:0.40 dt:36ms tok/s:1839965 rem:210s step 11019 (65%) loss:3.0569 lr:0.40 dt:36ms tok/s:1830091 rem:210s step 11020 (65%) loss:3.0543 lr:0.40 dt:36ms tok/s:1829982 rem:209s step 11021 (65%) loss:3.0674 lr:0.40 dt:36ms tok/s:1837739 rem:209s step 11022 (65%) loss:3.0648 lr:0.40 dt:36ms tok/s:1838316 rem:209s step 11023 (65%) loss:3.0688 lr:0.40 dt:36ms tok/s:1838156 rem:209s step 11024 (65%) loss:3.0622 lr:0.40 dt:36ms tok/s:1836732 rem:209s step 11025 (65%) loss:3.0462 lr:0.40 dt:35ms tok/s:1853326 rem:209s step 11026 (65%) loss:3.0617 lr:0.40 dt:35ms tok/s:1857120 rem:209s step 11027 (65%) loss:3.0547 lr:0.40 dt:35ms tok/s:1861245 rem:209s step 11028 (65%) loss:3.0599 lr:0.40 dt:35ms tok/s:1867250 rem:209s step 11029 (65%) loss:3.0628 lr:0.40 dt:35ms tok/s:1853251 rem:209s step 11030 (65%) loss:3.0748 lr:0.40 dt:35ms tok/s:1857471 rem:209s step 11031 (65%) loss:3.0562 lr:0.40 dt:35ms tok/s:1852839 rem:209s step 11032 (65%) loss:3.0696 lr:0.40 dt:35ms tok/s:1861333 rem:209s step 11033 (65%) loss:3.0654 lr:0.40 dt:35ms tok/s:1860237 rem:209s step 11034 (65%) loss:3.0726 lr:0.40 dt:35ms tok/s:1853476 rem:209s step 11035 (65%) loss:3.0660 lr:0.40 dt:35ms tok/s:1850905 rem:209s step 11036 (65%) loss:3.0741 lr:0.40 dt:36ms tok/s:1844001 rem:209s step 11037 (65%) loss:3.0791 lr:0.40 dt:35ms tok/s:1853526 rem:209s step 11038 (65%) loss:3.0745 lr:0.40 dt:35ms tok/s:1858200 rem:209s step 11039 (65%) loss:3.0620 lr:0.40 dt:35ms tok/s:1858175 rem:209s step 11040 (65%) loss:3.0642 lr:0.40 dt:35ms tok/s:1862228 rem:209s step 11041 (65%) loss:3.0865 lr:0.40 dt:35ms tok/s:1863920 rem:209s step 11042 (65%) loss:3.0791 lr:0.40 dt:35ms tok/s:1856756 rem:209s step 11043 (65%) loss:3.0736 lr:0.40 dt:35ms tok/s:1858175 rem:209s step 11044 (65%) loss:3.0745 lr:0.40 dt:35ms tok/s:1856104 rem:209s step 11045 (65%) loss:3.0599 lr:0.40 dt:35ms tok/s:1855866 rem:209s step 11046 (65%) loss:3.0601 lr:0.40 dt:35ms tok/s:1854114 rem:209s step 11047 (65%) loss:3.0722 lr:0.40 dt:35ms tok/s:1856393 rem:209s step 11048 (65%) loss:3.0741 lr:0.40 dt:36ms tok/s:1844719 rem:209s step 11049 (65%) loss:3.0587 lr:0.40 dt:35ms tok/s:1856217 rem:208s step 11050 (65%) loss:3.0668 lr:0.40 dt:35ms tok/s:1852514 rem:208s step 11051 (65%) loss:3.0676 lr:0.40 dt:35ms tok/s:1856957 rem:208s step 11052 (65%) loss:3.0570 lr:0.40 dt:35ms tok/s:1862720 rem:208s step 11053 (65%) loss:3.0503 lr:0.40 dt:35ms tok/s:1864274 rem:208s step 11054 (65%) loss:3.0667 lr:0.40 dt:35ms tok/s:1860942 rem:208s step 11055 (65%) loss:3.0701 lr:0.40 dt:35ms tok/s:1851666 rem:208s step 11056 (65%) loss:3.0717 lr:0.40 dt:36ms tok/s:1843507 rem:208s step 11057 (65%) loss:3.0657 lr:0.40 dt:35ms tok/s:1859910 rem:208s step 11058 (65%) loss:3.0552 lr:0.40 dt:35ms tok/s:1850432 rem:208s step 11059 (65%) loss:3.0681 lr:0.40 dt:36ms tok/s:1843383 rem:208s step 11060 (65%) loss:3.0725 lr:0.40 dt:35ms tok/s:1852976 rem:208s step 11061 (65%) loss:3.0754 lr:0.40 dt:35ms tok/s:1862430 rem:208s step 11062 (65%) loss:3.0590 lr:0.40 dt:35ms tok/s:1858237 rem:208s step 11063 (65%) loss:3.0533 lr:0.40 dt:35ms tok/s:1859193 rem:208s step 11064 (65%) loss:3.0389 lr:0.40 dt:35ms tok/s:1852639 rem:208s step 11065 (65%) loss:3.0195 lr:0.40 dt:35ms tok/s:1857346 rem:208s step 11066 (65%) loss:3.0132 lr:0.40 dt:35ms tok/s:1854426 rem:208s step 11067 (65%) loss:3.0318 lr:0.40 dt:35ms tok/s:1854351 rem:208s step 11068 (65%) loss:3.0285 lr:0.40 dt:35ms tok/s:1868570 rem:208s step 11069 (65%) loss:3.0171 lr:0.40 dt:35ms tok/s:1863983 rem:208s step 11070 (65%) loss:3.0414 lr:0.40 dt:35ms tok/s:1858237 rem:208s step 11071 (65%) loss:3.0589 lr:0.40 dt:35ms tok/s:1849112 rem:208s step 11072 (65%) loss:3.0596 lr:0.40 dt:35ms tok/s:1856794 rem:208s step 11073 (65%) loss:3.0433 lr:0.39 dt:35ms tok/s:1864717 rem:208s step 11074 (65%) loss:3.0453 lr:0.39 dt:35ms tok/s:1861383 rem:208s step 11075 (65%) loss:3.0271 lr:0.39 dt:35ms tok/s:1851316 rem:208s step 11076 (65%) loss:3.0525 lr:0.39 dt:36ms tok/s:1840138 rem:208s step 11077 (65%) loss:3.0589 lr:0.39 dt:36ms tok/s:1841346 rem:207s step 11078 (65%) loss:3.0641 lr:0.39 dt:36ms tok/s:1841173 rem:207s step 11079 (65%) loss:3.0587 lr:0.39 dt:37ms tok/s:1754984 rem:207s step 11080 (65%) loss:3.0593 lr:0.39 dt:36ms tok/s:1842234 rem:207s step 11081 (65%) loss:3.0607 lr:0.39 dt:36ms tok/s:1840298 rem:207s step 11082 (65%) loss:3.0667 lr:0.39 dt:36ms tok/s:1842234 rem:207s step 11083 (65%) loss:3.0640 lr:0.39 dt:36ms tok/s:1824988 rem:207s step 11084 (65%) loss:3.0575 lr:0.39 dt:36ms tok/s:1845921 rem:207s step 11085 (65%) loss:3.0411 lr:0.39 dt:36ms tok/s:1834195 rem:207s step 11086 (65%) loss:3.0107 lr:0.39 dt:35ms tok/s:1847434 rem:207s step 11087 (65%) loss:2.9808 lr:0.39 dt:36ms tok/s:1839940 rem:207s step 11088 (65%) loss:2.9964 lr:0.39 dt:36ms tok/s:1845029 rem:207s step 11089 (65%) loss:3.0074 lr:0.39 dt:36ms tok/s:1839497 rem:207s step 11090 (65%) loss:3.0384 lr:0.39 dt:36ms tok/s:1836474 rem:207s step 11091 (65%) loss:3.0269 lr:0.39 dt:36ms tok/s:1841395 rem:207s step 11092 (66%) loss:3.0355 lr:0.39 dt:36ms tok/s:1800011 rem:207s step 11093 (66%) loss:3.0377 lr:0.39 dt:35ms tok/s:1846404 rem:207s step 11094 (66%) loss:3.0270 lr:0.39 dt:36ms tok/s:1841814 rem:207s step 11095 (66%) loss:3.0279 lr:0.39 dt:36ms tok/s:1836118 rem:207s step 11096 (66%) loss:3.0090 lr:0.39 dt:36ms tok/s:1836621 rem:207s step 11097 (66%) loss:3.0158 lr:0.39 dt:36ms tok/s:1842568 rem:207s step 11098 (66%) loss:3.0525 lr:0.39 dt:36ms tok/s:1842629 rem:207s step 11099 (66%) loss:3.0646 lr:0.39 dt:36ms tok/s:1834746 rem:207s step 11100 (66%) loss:3.0576 lr:0.39 dt:35ms tok/s:1847447 rem:207s + local: attn=[0.104, 0.902, 0.939] mlp=[0.819, 0.314, -0.287] + + transition: attn=[3.242, 1.050] mlp=[-0.286, 0.740] + + hierarchy: attn=[3.349, 5.939, 5.616] mlp=[1.643, -1.251, -3.688] + step 11101 (66%) loss:3.0971 lr:0.39 dt:36ms tok/s:1830908 rem:207s step 11102 (66%) loss:3.1992 lr:0.39 dt:37ms tok/s:1770881 rem:207s step 11103 (66%) loss:3.2532 lr:0.39 dt:36ms tok/s:1840187 rem:207s step 11104 (66%) loss:3.2196 lr:0.39 dt:36ms tok/s:1836302 rem:207s step 11105 (66%) loss:3.1568 lr:0.39 dt:36ms tok/s:1839116 rem:206s step 11106 (66%) loss:3.1647 lr:0.39 dt:36ms tok/s:1838488 rem:206s step 11107 (66%) loss:3.1653 lr:0.39 dt:36ms tok/s:1826334 rem:206s step 11108 (66%) loss:3.1723 lr:0.39 dt:36ms tok/s:1837911 rem:206s step 11109 (66%) loss:3.1631 lr:0.39 dt:36ms tok/s:1844595 rem:206s step 11110 (66%) loss:3.1515 lr:0.39 dt:35ms tok/s:1847509 rem:206s step 11111 (66%) loss:3.1486 lr:0.39 dt:36ms tok/s:1842000 rem:206s step 11112 (66%) loss:3.1167 lr:0.39 dt:36ms tok/s:1843247 rem:206s step 11113 (66%) loss:3.1135 lr:0.39 dt:36ms tok/s:1832996 rem:206s step 11114 (66%) loss:3.1118 lr:0.39 dt:36ms tok/s:1836376 rem:206s step 11115 (66%) loss:3.1213 lr:0.39 dt:36ms tok/s:1837984 rem:206s step 11116 (66%) loss:3.1103 lr:0.39 dt:36ms tok/s:1841531 rem:206s step 11117 (66%) loss:3.1131 lr:0.39 dt:36ms tok/s:1837431 rem:206s step 11118 (66%) loss:3.0623 lr:0.39 dt:36ms tok/s:1831750 rem:206s step 11119 (66%) loss:3.0213 lr:0.39 dt:36ms tok/s:1836204 rem:206s step 11120 (66%) loss:3.0268 lr:0.39 dt:36ms tok/s:1835714 rem:206s step 11121 (66%) loss:3.0485 lr:0.39 dt:38ms tok/s:1742170 rem:206s step 11122 (66%) loss:3.0703 lr:0.39 dt:36ms tok/s:1844311 rem:206s step 11123 (66%) loss:3.0756 lr:0.39 dt:36ms tok/s:1836621 rem:206s step 11124 (66%) loss:3.0716 lr:0.39 dt:36ms tok/s:1835775 rem:206s step 11125 (66%) loss:3.0922 lr:0.39 dt:36ms tok/s:1843148 rem:206s step 11126 (66%) loss:3.0898 lr:0.39 dt:36ms tok/s:1845388 rem:206s step 11127 (66%) loss:3.0949 lr:0.39 dt:36ms tok/s:1836216 rem:206s step 11128 (66%) loss:3.0932 lr:0.39 dt:35ms tok/s:1850220 rem:206s step 11129 (66%) loss:3.1090 lr:0.39 dt:36ms tok/s:1838697 rem:206s step 11130 (66%) loss:3.1143 lr:0.39 dt:36ms tok/s:1832287 rem:206s step 11131 (66%) loss:3.1177 lr:0.39 dt:36ms tok/s:1839817 rem:206s step 11132 (66%) loss:3.1144 lr:0.39 dt:36ms tok/s:1844769 rem:206s step 11133 (66%) loss:3.0976 lr:0.39 dt:36ms tok/s:1839583 rem:205s step 11134 (66%) loss:3.1013 lr:0.39 dt:36ms tok/s:1840976 rem:205s step 11135 (66%) loss:3.0882 lr:0.39 dt:36ms tok/s:1838304 rem:205s step 11136 (66%) loss:3.0577 lr:0.39 dt:36ms tok/s:1841592 rem:205s step 11137 (66%) loss:3.0372 lr:0.39 dt:36ms tok/s:1833999 rem:205s step 11138 (66%) loss:3.0452 lr:0.39 dt:36ms tok/s:1812426 rem:205s step 11139 (66%) loss:3.0651 lr:0.39 dt:36ms tok/s:1818240 rem:205s step 11140 (66%) loss:3.0737 lr:0.39 dt:36ms tok/s:1845698 rem:205s step 11141 (66%) loss:3.0755 lr:0.39 dt:36ms tok/s:1841185 rem:205s step 11142 (66%) loss:3.0754 lr:0.39 dt:36ms tok/s:1843408 rem:205s step 11143 (66%) loss:3.0725 lr:0.39 dt:36ms tok/s:1824807 rem:205s step 11144 (66%) loss:3.0707 lr:0.39 dt:36ms tok/s:1843667 rem:205s step 11145 (66%) loss:3.0707 lr:0.39 dt:36ms tok/s:1845289 rem:205s step 11146 (66%) loss:3.0820 lr:0.39 dt:36ms tok/s:1800212 rem:205s step 11147 (66%) loss:3.0753 lr:0.39 dt:62ms tok/s:1063884 rem:205s step 11148 (66%) loss:3.0470 lr:0.39 dt:33ms tok/s:2006760 rem:205s step 11149 (66%) loss:3.0527 lr:0.39 dt:32ms tok/s:2043854 rem:205s step 11150 (66%) loss:3.0689 lr:0.39 dt:32ms tok/s:2069630 rem:205s step 11151 (66%) loss:3.0683 lr:0.39 dt:32ms tok/s:2071174 rem:205s step 11152 (66%) loss:3.0644 lr:0.39 dt:32ms tok/s:2053964 rem:205s step 11153 (66%) loss:3.0705 lr:0.39 dt:38ms tok/s:1714397 rem:205s step 11154 (66%) loss:3.0432 lr:0.39 dt:34ms tok/s:1919927 rem:205s step 11155 (66%) loss:3.0383 lr:0.39 dt:32ms tok/s:2043292 rem:205s step 11156 (66%) loss:3.0502 lr:0.39 dt:32ms tok/s:2030567 rem:205s step 11157 (66%) loss:3.0491 lr:0.39 dt:33ms tok/s:1999679 rem:205s step 11158 (66%) loss:3.0503 lr:0.39 dt:32ms tok/s:2042533 rem:205s step 11159 (66%) loss:3.0447 lr:0.39 dt:33ms tok/s:2000844 rem:205s step 11160 (66%) loss:3.0263 lr:0.39 dt:32ms tok/s:2023303 rem:205s step 11161 (66%) loss:3.0112 lr:0.38 dt:33ms tok/s:2008358 rem:204s step 11162 (66%) loss:3.0087 lr:0.38 dt:33ms tok/s:1996658 rem:204s step 11163 (66%) loss:2.9814 lr:0.38 dt:33ms tok/s:2011665 rem:204s step 11164 (66%) loss:2.9643 lr:0.38 dt:33ms tok/s:1987606 rem:204s step 11165 (66%) loss:2.9846 lr:0.38 dt:33ms tok/s:1982574 rem:204s step 11166 (66%) loss:3.0046 lr:0.38 dt:33ms tok/s:1969336 rem:204s step 11167 (66%) loss:3.0194 lr:0.38 dt:33ms tok/s:1977254 rem:204s step 11168 (66%) loss:3.0146 lr:0.38 dt:33ms tok/s:1993790 rem:204s step 11169 (66%) loss:3.0086 lr:0.38 dt:33ms tok/s:2000043 rem:204s step 11170 (66%) loss:3.0055 lr:0.38 dt:33ms tok/s:1987218 rem:204s step 11171 (66%) loss:2.9876 lr:0.38 dt:33ms tok/s:1964677 rem:204s step 11172 (66%) loss:2.9205 lr:0.38 dt:34ms tok/s:1949005 rem:204s step 11173 (66%) loss:2.8639 lr:0.38 dt:33ms tok/s:1959998 rem:204s step 11174 (66%) loss:2.8492 lr:0.38 dt:34ms tok/s:1953937 rem:204s step 11175 (66%) loss:2.8704 lr:0.38 dt:33ms tok/s:1988555 rem:204s step 11176 (66%) loss:2.8787 lr:0.38 dt:33ms tok/s:1994499 rem:204s step 11177 (66%) loss:2.9055 lr:0.38 dt:33ms tok/s:1976899 rem:204s step 11178 (66%) loss:2.9187 lr:0.38 dt:33ms tok/s:1963877 rem:204s step 11179 (66%) loss:2.9311 lr:0.38 dt:34ms tok/s:1944950 rem:204s step 11180 (66%) loss:2.9437 lr:0.38 dt:34ms tok/s:1955160 rem:204s step 11181 (66%) loss:2.9502 lr:0.38 dt:33ms tok/s:1964185 rem:204s step 11182 (66%) loss:2.9631 lr:0.38 dt:33ms tok/s:1957694 rem:204s step 11183 (66%) loss:2.9504 lr:0.38 dt:34ms tok/s:1951232 rem:204s step 11184 (66%) loss:2.9591 lr:0.38 dt:33ms tok/s:1963596 rem:204s step 11185 (66%) loss:2.9607 lr:0.38 dt:33ms tok/s:1971370 rem:204s step 11186 (66%) loss:2.9779 lr:0.38 dt:33ms tok/s:1956816 rem:204s step 11187 (66%) loss:2.9897 lr:0.38 dt:34ms tok/s:1945556 rem:204s step 11188 (66%) loss:2.9971 lr:0.38 dt:33ms tok/s:1969703 rem:204s step 11189 (66%) loss:2.9804 lr:0.38 dt:33ms tok/s:1968574 rem:204s step 11190 (66%) loss:2.9764 lr:0.38 dt:34ms tok/s:1951440 rem:204s step 11191 (66%) loss:2.9750 lr:0.38 dt:33ms tok/s:1985366 rem:204s step 11192 (66%) loss:2.9853 lr:0.38 dt:33ms tok/s:1980888 rem:203s step 11193 (66%) loss:2.9896 lr:0.38 dt:33ms tok/s:1978678 rem:203s step 11194 (66%) loss:2.9637 lr:0.38 dt:33ms tok/s:1978706 rem:203s step 11195 (66%) loss:2.9686 lr:0.38 dt:33ms tok/s:1986973 rem:203s step 11196 (66%) loss:2.9749 lr:0.38 dt:33ms tok/s:1983046 rem:203s step 11197 (66%) loss:2.9720 lr:0.38 dt:33ms tok/s:1979932 rem:203s step 11198 (66%) loss:2.9712 lr:0.38 dt:33ms tok/s:1987764 rem:203s step 11199 (66%) loss:2.9853 lr:0.38 dt:33ms tok/s:1989721 rem:203s step 11200 (66%) loss:3.0015 lr:0.38 dt:33ms tok/s:1995585 rem:203s + local: attn=[0.110, 0.895, 0.968] mlp=[0.827, 0.305, -0.295] + + transition: attn=[3.212, 1.107] mlp=[-0.277, 0.766] + + hierarchy: attn=[3.400, 5.939, 5.616] mlp=[1.663, -1.295, -3.672] + step 11201 (66%) loss:3.0127 lr:0.38 dt:33ms tok/s:1993053 rem:203s step 11202 (66%) loss:3.0183 lr:0.38 dt:33ms tok/s:1997340 rem:203s step 11203 (66%) loss:3.0228 lr:0.38 dt:33ms tok/s:1968630 rem:203s step 11204 (66%) loss:3.0167 lr:0.38 dt:32ms tok/s:2043975 rem:203s step 11205 (66%) loss:3.0123 lr:0.38 dt:33ms tok/s:1989793 rem:203s step 11206 (66%) loss:3.0155 lr:0.38 dt:33ms tok/s:2000480 rem:203s step 11207 (66%) loss:3.0204 lr:0.38 dt:33ms tok/s:1992851 rem:203s step 11208 (66%) loss:3.0089 lr:0.38 dt:33ms tok/s:1997006 rem:203s step 11209 (66%) loss:3.0091 lr:0.38 dt:33ms tok/s:1996064 rem:203s step 11210 (66%) loss:3.0259 lr:0.38 dt:33ms tok/s:1995890 rem:203s step 11211 (66%) loss:3.0185 lr:0.38 dt:33ms tok/s:1967405 rem:203s step 11212 (66%) loss:3.0158 lr:0.38 dt:33ms tok/s:1971808 rem:203s step 11213 (66%) loss:3.0153 lr:0.38 dt:33ms tok/s:1962390 rem:203s step 11214 (66%) loss:3.0032 lr:0.38 dt:33ms tok/s:1966265 rem:203s step 11215 (66%) loss:2.9984 lr:0.38 dt:34ms tok/s:1955424 rem:203s step 11216 (66%) loss:3.0029 lr:0.38 dt:33ms tok/s:1975663 rem:203s step 11217 (66%) loss:3.0055 lr:0.38 dt:33ms tok/s:1964438 rem:203s step 11218 (66%) loss:3.0060 lr:0.38 dt:33ms tok/s:1966194 rem:203s step 11219 (66%) loss:3.0072 lr:0.38 dt:34ms tok/s:1944854 rem:203s step 11220 (66%) loss:2.9960 lr:0.38 dt:34ms tok/s:1925320 rem:203s step 11221 (66%) loss:3.0008 lr:0.38 dt:34ms tok/s:1954771 rem:203s step 11222 (66%) loss:3.0186 lr:0.38 dt:34ms tok/s:1951523 rem:202s step 11223 (66%) loss:3.0019 lr:0.38 dt:33ms tok/s:1956343 rem:202s step 11224 (66%) loss:3.0152 lr:0.38 dt:34ms tok/s:1955243 rem:202s step 11225 (66%) loss:3.0098 lr:0.38 dt:34ms tok/s:1948894 rem:202s step 11226 (66%) loss:3.0139 lr:0.38 dt:33ms tok/s:1963442 rem:202s step 11227 (66%) loss:3.0129 lr:0.38 dt:33ms tok/s:1959788 rem:202s step 11228 (66%) loss:3.0062 lr:0.38 dt:33ms tok/s:1966321 rem:202s step 11229 (66%) loss:3.0141 lr:0.38 dt:33ms tok/s:1968405 rem:202s step 11230 (66%) loss:3.0085 lr:0.38 dt:33ms tok/s:1960291 rem:202s step 11231 (66%) loss:3.0108 lr:0.38 dt:33ms tok/s:1964606 rem:202s step 11232 (66%) loss:3.0113 lr:0.38 dt:35ms tok/s:1872746 rem:202s step 11233 (66%) loss:3.0291 lr:0.38 dt:33ms tok/s:1959732 rem:202s step 11234 (66%) loss:3.0560 lr:0.38 dt:33ms tok/s:1958601 rem:202s step 11235 (66%) loss:3.0518 lr:0.38 dt:33ms tok/s:1966602 rem:202s step 11236 (66%) loss:3.0763 lr:0.38 dt:33ms tok/s:1965674 rem:202s step 11237 (66%) loss:3.0821 lr:0.38 dt:33ms tok/s:1964831 rem:202s step 11238 (66%) loss:3.0686 lr:0.38 dt:33ms tok/s:1960543 rem:202s step 11239 (66%) loss:3.0792 lr:0.38 dt:33ms tok/s:1962755 rem:202s step 11240 (66%) loss:3.0855 lr:0.38 dt:33ms tok/s:1961900 rem:202s step 11241 (66%) loss:3.0732 lr:0.38 dt:34ms tok/s:1955494 rem:202s step 11242 (66%) loss:3.0720 lr:0.38 dt:34ms tok/s:1954812 rem:202s step 11243 (66%) loss:3.0932 lr:0.38 dt:34ms tok/s:1948176 rem:202s step 11244 (66%) loss:3.0734 lr:0.38 dt:33ms tok/s:1960906 rem:202s step 11245 (66%) loss:3.0814 lr:0.38 dt:33ms tok/s:1961382 rem:202s step 11246 (66%) loss:3.0768 lr:0.38 dt:33ms tok/s:1958029 rem:202s step 11247 (66%) loss:3.0828 lr:0.38 dt:38ms tok/s:1745367 rem:202s step 11248 (66%) loss:3.0954 lr:0.38 dt:34ms tok/s:1937698 rem:202s step 11249 (66%) loss:3.0960 lr:0.38 dt:34ms tok/s:1940036 rem:202s step 11250 (66%) loss:3.0802 lr:0.38 dt:34ms tok/s:1941009 rem:202s step 11251 (66%) loss:3.0836 lr:0.38 dt:34ms tok/s:1931855 rem:201s step 11252 (66%) loss:3.0731 lr:0.38 dt:34ms tok/s:1937179 rem:201s step 11253 (66%) loss:3.0659 lr:0.38 dt:34ms tok/s:1947914 rem:201s step 11254 (66%) loss:3.0386 lr:0.38 dt:34ms tok/s:1944882 rem:201s step 11255 (66%) loss:3.0442 lr:0.38 dt:34ms tok/s:1946658 rem:201s step 11256 (66%) loss:3.0472 lr:0.37 dt:34ms tok/s:1936046 rem:201s step 11257 (66%) loss:3.0521 lr:0.37 dt:34ms tok/s:1949406 rem:201s step 11258 (66%) loss:3.0382 lr:0.37 dt:34ms tok/s:1929266 rem:201s step 11259 (66%) loss:3.0270 lr:0.37 dt:34ms tok/s:1941681 rem:201s step 11260 (66%) loss:3.0152 lr:0.37 dt:34ms tok/s:1943232 rem:201s step 11261 (66%) loss:3.0086 lr:0.37 dt:34ms tok/s:1946025 rem:201s step 11262 (66%) loss:3.0042 lr:0.37 dt:34ms tok/s:1933921 rem:201s step 11263 (66%) loss:3.0149 lr:0.37 dt:34ms tok/s:1928751 rem:201s step 11264 (66%) loss:3.0262 lr:0.37 dt:34ms tok/s:1935310 rem:201s step 11265 (66%) loss:3.0376 lr:0.37 dt:34ms tok/s:1931407 rem:201s step 11266 (66%) loss:3.0303 lr:0.37 dt:34ms tok/s:1907351 rem:201s step 11267 (67%) loss:3.0229 lr:0.37 dt:34ms tok/s:1917436 rem:201s step 11268 (67%) loss:3.0254 lr:0.37 dt:34ms tok/s:1917516 rem:201s step 11269 (67%) loss:3.0399 lr:0.37 dt:34ms tok/s:1926899 rem:201s step 11270 (67%) loss:3.0462 lr:0.37 dt:35ms tok/s:1898590 rem:201s step 11271 (67%) loss:3.0448 lr:0.37 dt:34ms tok/s:1903231 rem:201s step 11272 (67%) loss:3.0481 lr:0.37 dt:34ms tok/s:1905487 rem:201s step 11273 (67%) loss:3.0447 lr:0.37 dt:34ms tok/s:1901874 rem:201s step 11274 (67%) loss:3.0507 lr:0.37 dt:35ms tok/s:1887444 rem:201s step 11275 (67%) loss:3.0535 lr:0.37 dt:35ms tok/s:1890378 rem:201s step 11276 (67%) loss:3.0450 lr:0.37 dt:35ms tok/s:1896324 rem:201s step 11277 (67%) loss:3.0514 lr:0.37 dt:35ms tok/s:1890950 rem:201s step 11278 (67%) loss:3.0579 lr:0.37 dt:35ms tok/s:1894168 rem:201s step 11279 (67%) loss:3.0289 lr:0.37 dt:35ms tok/s:1871764 rem:201s step 11280 (67%) loss:2.9976 lr:0.37 dt:35ms tok/s:1870389 rem:201s step 11281 (67%) loss:3.0249 lr:0.37 dt:35ms tok/s:1869752 rem:200s step 11282 (67%) loss:3.0377 lr:0.37 dt:37ms tok/s:1780170 rem:200s step 11283 (67%) loss:3.0376 lr:0.37 dt:35ms tok/s:1869562 rem:200s step 11284 (67%) loss:3.0260 lr:0.37 dt:35ms tok/s:1870618 rem:200s step 11285 (67%) loss:3.0273 lr:0.37 dt:35ms tok/s:1880446 rem:200s step 11286 (67%) loss:3.0159 lr:0.37 dt:34ms tok/s:1919954 rem:200s step 11287 (67%) loss:3.0169 lr:0.37 dt:34ms tok/s:1936646 rem:200s step 11288 (67%) loss:3.0253 lr:0.37 dt:34ms tok/s:1931978 rem:200s step 11289 (67%) loss:3.0345 lr:0.37 dt:37ms tok/s:1788138 rem:200s step 11290 (67%) loss:3.0272 lr:0.37 dt:36ms tok/s:1840963 rem:200s step 11291 (67%) loss:3.0365 lr:0.37 dt:34ms tok/s:1907047 rem:200s step 11292 (67%) loss:3.0394 lr:0.37 dt:34ms tok/s:1908159 rem:200s step 11293 (67%) loss:3.0393 lr:0.37 dt:36ms tok/s:1841506 rem:200s step 11294 (67%) loss:3.0345 lr:0.37 dt:35ms tok/s:1870554 rem:200s step 11295 (67%) loss:3.0446 lr:0.37 dt:35ms tok/s:1895749 rem:200s step 11296 (67%) loss:3.0629 lr:0.37 dt:35ms tok/s:1896573 rem:200s step 11297 (67%) loss:3.0701 lr:0.37 dt:34ms tok/s:1907126 rem:200s step 11298 (67%) loss:3.0493 lr:0.37 dt:35ms tok/s:1899338 rem:200s step 11299 (67%) loss:3.0435 lr:0.37 dt:35ms tok/s:1898590 rem:200s step 11300 (67%) loss:3.0547 lr:0.37 dt:35ms tok/s:1885657 rem:200s + local: attn=[0.106, 0.882, 0.966] mlp=[0.844, 0.310, -0.306] + + transition: attn=[3.277, 1.078] mlp=[-0.298, 0.794] + + hierarchy: attn=[3.384, 5.939, 5.616] mlp=[1.719, -1.374, -3.742] + step 11301 (67%) loss:3.0491 lr:0.37 dt:35ms tok/s:1883887 rem:200s step 11302 (67%) loss:3.0544 lr:0.37 dt:35ms tok/s:1890872 rem:200s step 11303 (67%) loss:3.0507 lr:0.37 dt:35ms tok/s:1897227 rem:200s step 11304 (67%) loss:3.0584 lr:0.37 dt:35ms tok/s:1886887 rem:200s step 11305 (67%) loss:3.0491 lr:0.37 dt:35ms tok/s:1894090 rem:200s step 11306 (67%) loss:3.0348 lr:0.37 dt:35ms tok/s:1881501 rem:200s step 11307 (67%) loss:3.0323 lr:0.37 dt:35ms tok/s:1866882 rem:200s step 11308 (67%) loss:3.0393 lr:0.37 dt:35ms tok/s:1870159 rem:200s step 11309 (67%) loss:3.0532 lr:0.37 dt:35ms tok/s:1877197 rem:199s step 11310 (67%) loss:3.0543 lr:0.37 dt:35ms tok/s:1858263 rem:199s step 11311 (67%) loss:3.0728 lr:0.37 dt:35ms tok/s:1861661 rem:199s step 11312 (67%) loss:3.0589 lr:0.37 dt:35ms tok/s:1875416 rem:199s step 11313 (67%) loss:3.0571 lr:0.37 dt:35ms tok/s:1857961 rem:199s step 11314 (67%) loss:3.0481 lr:0.37 dt:35ms tok/s:1860552 rem:199s step 11315 (67%) loss:3.0448 lr:0.37 dt:35ms tok/s:1846516 rem:199s step 11316 (67%) loss:3.0598 lr:0.37 dt:36ms tok/s:1801616 rem:199s step 11317 (67%) loss:3.0528 lr:0.37 dt:35ms tok/s:1846925 rem:199s step 11318 (67%) loss:3.0483 lr:0.37 dt:36ms tok/s:1835162 rem:199s step 11319 (67%) loss:3.0359 lr:0.37 dt:36ms tok/s:1832141 rem:199s step 11320 (67%) loss:3.0223 lr:0.37 dt:36ms tok/s:1837284 rem:199s step 11321 (67%) loss:3.0413 lr:0.37 dt:36ms tok/s:1841703 rem:199s step 11322 (67%) loss:3.0378 lr:0.37 dt:36ms tok/s:1841888 rem:199s step 11323 (67%) loss:3.0458 lr:0.37 dt:35ms tok/s:1849734 rem:199s step 11324 (67%) loss:3.0511 lr:0.37 dt:36ms tok/s:1844360 rem:199s step 11325 (67%) loss:3.0596 lr:0.37 dt:36ms tok/s:1843754 rem:199s step 11326 (67%) loss:3.0639 lr:0.37 dt:36ms tok/s:1841111 rem:199s step 11327 (67%) loss:3.0568 lr:0.37 dt:36ms tok/s:1841111 rem:199s step 11328 (67%) loss:3.0546 lr:0.37 dt:36ms tok/s:1837235 rem:199s step 11329 (67%) loss:3.0346 lr:0.37 dt:36ms tok/s:1843742 rem:199s step 11330 (67%) loss:3.0546 lr:0.37 dt:36ms tok/s:1834158 rem:199s step 11331 (67%) loss:3.0580 lr:0.37 dt:36ms tok/s:1839805 rem:199s step 11332 (67%) loss:3.0580 lr:0.37 dt:35ms tok/s:1846429 rem:199s step 11333 (67%) loss:3.0716 lr:0.37 dt:36ms tok/s:1832703 rem:199s step 11334 (67%) loss:3.0747 lr:0.37 dt:36ms tok/s:1828193 rem:199s step 11335 (67%) loss:3.0756 lr:0.37 dt:36ms tok/s:1842098 rem:199s step 11336 (67%) loss:3.0762 lr:0.37 dt:36ms tok/s:1838611 rem:199s step 11337 (67%) loss:3.0730 lr:0.37 dt:36ms tok/s:1832006 rem:198s step 11338 (67%) loss:3.0815 lr:0.37 dt:36ms tok/s:1832409 rem:198s step 11339 (67%) loss:3.0870 lr:0.37 dt:36ms tok/s:1832923 rem:198s step 11340 (67%) loss:3.0912 lr:0.37 dt:36ms tok/s:1833216 rem:198s step 11341 (67%) loss:3.0885 lr:0.37 dt:36ms tok/s:1823608 rem:198s step 11342 (67%) loss:3.0773 lr:0.37 dt:36ms tok/s:1831482 rem:198s step 11343 (67%) loss:3.0777 lr:0.37 dt:36ms tok/s:1834023 rem:198s step 11344 (67%) loss:3.0694 lr:0.37 dt:36ms tok/s:1836069 rem:198s step 11345 (67%) loss:3.0586 lr:0.37 dt:36ms tok/s:1835554 rem:198s step 11346 (67%) loss:3.0684 lr:0.36 dt:36ms tok/s:1841432 rem:198s step 11347 (67%) loss:3.0453 lr:0.36 dt:36ms tok/s:1832788 rem:198s step 11348 (67%) loss:3.0461 lr:0.36 dt:36ms tok/s:1837456 rem:198s step 11349 (67%) loss:3.0100 lr:0.36 dt:36ms tok/s:1838193 rem:198s step 11350 (67%) loss:2.9586 lr:0.36 dt:36ms tok/s:1829202 rem:198s step 11351 (67%) loss:2.9571 lr:0.36 dt:36ms tok/s:1839768 rem:198s step 11352 (67%) loss:2.9685 lr:0.36 dt:36ms tok/s:1836413 rem:198s step 11353 (67%) loss:2.9709 lr:0.36 dt:36ms tok/s:1831335 rem:198s step 11354 (67%) loss:2.9748 lr:0.36 dt:36ms tok/s:1830225 rem:198s step 11355 (67%) loss:2.9745 lr:0.36 dt:36ms tok/s:1834452 rem:198s step 11356 (67%) loss:2.9871 lr:0.36 dt:36ms tok/s:1837517 rem:198s step 11357 (67%) loss:3.0174 lr:0.36 dt:36ms tok/s:1827281 rem:198s step 11358 (67%) loss:3.0290 lr:0.36 dt:36ms tok/s:1840507 rem:198s step 11359 (67%) loss:3.0324 lr:0.36 dt:36ms tok/s:1838365 rem:198s step 11360 (67%) loss:3.0489 lr:0.36 dt:36ms tok/s:1838710 rem:198s step 11361 (67%) loss:3.0597 lr:0.36 dt:36ms tok/s:1839706 rem:198s step 11362 (67%) loss:3.0461 lr:0.36 dt:36ms tok/s:1842629 rem:198s step 11363 (67%) loss:3.0579 lr:0.36 dt:36ms tok/s:1841272 rem:198s step 11364 (67%) loss:3.0229 lr:0.36 dt:36ms tok/s:1839682 rem:198s step 11365 (67%) loss:2.9770 lr:0.36 dt:36ms tok/s:1840655 rem:197s step 11366 (67%) loss:2.9642 lr:0.36 dt:36ms tok/s:1838193 rem:197s step 11367 (67%) loss:2.9780 lr:0.36 dt:36ms tok/s:1841037 rem:197s step 11368 (67%) loss:2.9846 lr:0.36 dt:36ms tok/s:1836327 rem:197s step 11369 (67%) loss:2.9933 lr:0.36 dt:36ms tok/s:1830945 rem:197s step 11370 (67%) loss:3.0550 lr:0.36 dt:36ms tok/s:1834684 rem:197s step 11371 (67%) loss:3.0650 lr:0.36 dt:36ms tok/s:1835910 rem:197s step 11372 (67%) loss:3.0649 lr:0.36 dt:36ms tok/s:1836253 rem:197s step 11373 (67%) loss:3.0717 lr:0.36 dt:36ms tok/s:1841235 rem:197s step 11374 (67%) loss:3.0721 lr:0.36 dt:36ms tok/s:1836253 rem:197s step 11375 (67%) loss:3.0710 lr:0.36 dt:36ms tok/s:1831067 rem:197s step 11376 (67%) loss:3.0963 lr:0.36 dt:36ms tok/s:1834562 rem:197s step 11377 (67%) loss:3.1112 lr:0.36 dt:36ms tok/s:1834464 rem:197s step 11378 (67%) loss:3.0932 lr:0.36 dt:37ms tok/s:1767499 rem:197s step 11379 (67%) loss:3.0814 lr:0.36 dt:37ms tok/s:1795004 rem:197s step 11380 (67%) loss:3.1025 lr:0.36 dt:35ms tok/s:1873908 rem:197s step 11381 (67%) loss:3.1096 lr:0.36 dt:35ms tok/s:1883216 rem:197s step 11382 (67%) loss:3.1149 lr:0.36 dt:35ms tok/s:1858350 rem:197s step 11383 (67%) loss:3.1088 lr:0.36 dt:36ms tok/s:1837493 rem:197s step 11384 (67%) loss:3.1103 lr:0.36 dt:36ms tok/s:1826383 rem:197s step 11385 (67%) loss:3.0644 lr:0.36 dt:35ms tok/s:1858338 rem:197s step 11386 (67%) loss:3.0770 lr:0.36 dt:38ms tok/s:1742049 rem:197s step 11387 (67%) loss:3.0803 lr:0.36 dt:36ms tok/s:1837395 rem:197s step 11388 (67%) loss:3.0772 lr:0.36 dt:36ms tok/s:1835321 rem:197s step 11389 (67%) loss:3.1173 lr:0.36 dt:36ms tok/s:1829787 rem:197s step 11390 (67%) loss:3.2031 lr:0.36 dt:36ms tok/s:1835248 rem:197s step 11391 (67%) loss:3.1988 lr:0.36 dt:36ms tok/s:1821976 rem:197s step 11392 (67%) loss:3.2028 lr:0.36 dt:36ms tok/s:1830750 rem:197s step 11393 (67%) loss:3.1973 lr:0.36 dt:36ms tok/s:1828837 rem:196s step 11394 (67%) loss:3.1978 lr:0.36 dt:42ms tok/s:1567131 rem:196s step 11395 (67%) loss:3.1898 lr:0.36 dt:36ms tok/s:1821336 rem:196s step 11396 (67%) loss:3.2191 lr:0.36 dt:35ms tok/s:1864135 rem:196s step 11397 (67%) loss:3.2103 lr:0.36 dt:35ms tok/s:1854401 rem:196s step 11398 (67%) loss:3.1977 lr:0.36 dt:36ms tok/s:1844818 rem:196s step 11399 (67%) loss:3.1762 lr:0.36 dt:35ms tok/s:1859910 rem:196s step 11400 (67%) loss:3.1522 lr:0.36 dt:35ms tok/s:1864173 rem:196s + local: attn=[0.108, 0.922, 0.969] mlp=[0.865, 0.320, -0.308] + + transition: attn=[3.267, 1.106] mlp=[-0.312, 0.804] + + hierarchy: attn=[3.380, 5.939, 5.616] mlp=[1.717, -1.377, -3.789] + step 11401 (67%) loss:3.1491 lr:0.36 dt:35ms tok/s:1870007 rem:196s step 11402 (67%) loss:3.1328 lr:0.36 dt:35ms tok/s:1881566 rem:196s step 11403 (67%) loss:3.1062 lr:0.36 dt:35ms tok/s:1872618 rem:196s step 11404 (67%) loss:3.0973 lr:0.36 dt:35ms tok/s:1877941 rem:196s step 11405 (67%) loss:3.0993 lr:0.36 dt:35ms tok/s:1872708 rem:196s step 11406 (67%) loss:3.0935 lr:0.36 dt:35ms tok/s:1892147 rem:196s step 11407 (67%) loss:3.1193 lr:0.36 dt:35ms tok/s:1868735 rem:196s step 11408 (67%) loss:3.1061 lr:0.36 dt:35ms tok/s:1876172 rem:196s step 11409 (67%) loss:3.0671 lr:0.36 dt:35ms tok/s:1880253 rem:196s step 11410 (67%) loss:3.0979 lr:0.36 dt:35ms tok/s:1871433 rem:196s step 11411 (67%) loss:3.1177 lr:0.36 dt:35ms tok/s:1876389 rem:196s step 11412 (67%) loss:3.1017 lr:0.36 dt:35ms tok/s:1869714 rem:196s step 11413 (67%) loss:3.1564 lr:0.36 dt:35ms tok/s:1875058 rem:196s step 11414 (67%) loss:3.1813 lr:0.36 dt:35ms tok/s:1874317 rem:196s step 11415 (67%) loss:3.1719 lr:0.36 dt:35ms tok/s:1870949 rem:196s step 11416 (67%) loss:3.1615 lr:0.36 dt:35ms tok/s:1875045 rem:196s step 11417 (67%) loss:3.1533 lr:0.36 dt:35ms tok/s:1878544 rem:196s step 11418 (67%) loss:3.1379 lr:0.36 dt:35ms tok/s:1870910 rem:196s step 11419 (67%) loss:3.1524 lr:0.36 dt:35ms tok/s:1880240 rem:196s step 11420 (67%) loss:3.1419 lr:0.36 dt:35ms tok/s:1867491 rem:196s step 11421 (67%) loss:3.1345 lr:0.36 dt:35ms tok/s:1869651 rem:196s step 11422 (67%) loss:3.1167 lr:0.36 dt:35ms tok/s:1870020 rem:195s step 11423 (67%) loss:3.1026 lr:0.36 dt:35ms tok/s:1846764 rem:195s step 11424 (67%) loss:3.0850 lr:0.36 dt:35ms tok/s:1884881 rem:195s step 11425 (67%) loss:3.0607 lr:0.36 dt:35ms tok/s:1858778 rem:195s step 11426 (67%) loss:3.0558 lr:0.36 dt:35ms tok/s:1859142 rem:195s step 11427 (67%) loss:3.0578 lr:0.36 dt:35ms tok/s:1862089 rem:195s step 11428 (67%) loss:3.0501 lr:0.36 dt:35ms tok/s:1856443 rem:195s step 11429 (67%) loss:3.0589 lr:0.36 dt:35ms tok/s:1860413 rem:195s step 11430 (67%) loss:3.0696 lr:0.36 dt:35ms tok/s:1851179 rem:195s step 11431 (67%) loss:3.0713 lr:0.36 dt:35ms tok/s:1859532 rem:195s step 11432 (67%) loss:3.0492 lr:0.36 dt:36ms tok/s:1845239 rem:195s step 11433 (67%) loss:3.0332 lr:0.36 dt:35ms tok/s:1856593 rem:195s step 11434 (67%) loss:3.0173 lr:0.36 dt:35ms tok/s:1861787 rem:195s step 11435 (67%) loss:3.0191 lr:0.36 dt:36ms tok/s:1832141 rem:195s step 11436 (67%) loss:3.0162 lr:0.35 dt:36ms tok/s:1829349 rem:195s step 11437 (68%) loss:2.9946 lr:0.35 dt:36ms tok/s:1842963 rem:195s step 11438 (68%) loss:2.9469 lr:0.35 dt:36ms tok/s:1834660 rem:195s step 11439 (68%) loss:2.9632 lr:0.35 dt:36ms tok/s:1826747 rem:195s step 11440 (68%) loss:2.9842 lr:0.35 dt:36ms tok/s:1827634 rem:195s step 11441 (68%) loss:3.0112 lr:0.35 dt:40ms tok/s:1623030 rem:195s step 11442 (68%) loss:3.0187 lr:0.35 dt:34ms tok/s:1918627 rem:195s step 11443 (68%) loss:2.9974 lr:0.35 dt:34ms tok/s:1920222 rem:195s step 11444 (68%) loss:2.9871 lr:0.35 dt:34ms tok/s:1903020 rem:195s step 11445 (68%) loss:3.0115 lr:0.35 dt:34ms tok/s:1907536 rem:195s step 11446 (68%) loss:3.0116 lr:0.35 dt:34ms tok/s:1907073 rem:195s step 11447 (68%) loss:3.0280 lr:0.35 dt:35ms tok/s:1898708 rem:195s step 11448 (68%) loss:3.0442 lr:0.35 dt:34ms tok/s:1901835 rem:195s step 11449 (68%) loss:3.0581 lr:0.35 dt:35ms tok/s:1894756 rem:195s step 11450 (68%) loss:3.0742 lr:0.35 dt:34ms tok/s:1900415 rem:194s step 11451 (68%) loss:3.0631 lr:0.35 dt:34ms tok/s:1916754 rem:194s step 11452 (68%) loss:3.0565 lr:0.35 dt:34ms tok/s:1914658 rem:194s step 11453 (68%) loss:3.0439 lr:0.35 dt:34ms tok/s:1904444 rem:194s step 11454 (68%) loss:3.0140 lr:0.35 dt:34ms tok/s:1910042 rem:194s step 11455 (68%) loss:2.9758 lr:0.35 dt:34ms tok/s:1909259 rem:194s step 11456 (68%) loss:2.9442 lr:0.35 dt:34ms tok/s:1918132 rem:194s step 11457 (68%) loss:2.9144 lr:0.35 dt:34ms tok/s:1905738 rem:194s step 11458 (68%) loss:2.9458 lr:0.35 dt:34ms tok/s:1911742 rem:194s step 11459 (68%) loss:2.9639 lr:0.35 dt:34ms tok/s:1901914 rem:194s step 11460 (68%) loss:3.0044 lr:0.35 dt:34ms tok/s:1903415 rem:194s step 11461 (68%) loss:2.9922 lr:0.35 dt:34ms tok/s:1907457 rem:194s step 11462 (68%) loss:2.9809 lr:0.35 dt:34ms tok/s:1909352 rem:194s step 11463 (68%) loss:3.0013 lr:0.35 dt:34ms tok/s:1903943 rem:194s step 11464 (68%) loss:3.0215 lr:0.35 dt:35ms tok/s:1865122 rem:194s step 11465 (68%) loss:3.0350 lr:0.35 dt:35ms tok/s:1890508 rem:194s step 11466 (68%) loss:3.0431 lr:0.35 dt:35ms tok/s:1877068 rem:194s step 11467 (68%) loss:3.0696 lr:0.35 dt:35ms tok/s:1885192 rem:194s step 11468 (68%) loss:3.0680 lr:0.35 dt:35ms tok/s:1878402 rem:194s step 11469 (68%) loss:3.0727 lr:0.35 dt:36ms tok/s:1824189 rem:194s step 11470 (68%) loss:3.0684 lr:0.35 dt:35ms tok/s:1880639 rem:194s step 11471 (68%) loss:3.0591 lr:0.35 dt:35ms tok/s:1875327 rem:194s step 11472 (68%) loss:3.0603 lr:0.35 dt:35ms tok/s:1883177 rem:194s step 11473 (68%) loss:3.0632 lr:0.35 dt:35ms tok/s:1867770 rem:194s step 11474 (68%) loss:3.0735 lr:0.35 dt:35ms tok/s:1858589 rem:194s step 11475 (68%) loss:3.0603 lr:0.35 dt:35ms tok/s:1861219 rem:194s step 11476 (68%) loss:3.0552 lr:0.35 dt:35ms tok/s:1855816 rem:194s step 11477 (68%) loss:3.0579 lr:0.35 dt:35ms tok/s:1865514 rem:194s step 11478 (68%) loss:3.0466 lr:0.35 dt:35ms tok/s:1856857 rem:194s step 11479 (68%) loss:3.0287 lr:0.35 dt:35ms tok/s:1856305 rem:193s step 11480 (68%) loss:3.0243 lr:0.35 dt:35ms tok/s:1871790 rem:193s step 11481 (68%) loss:3.0301 lr:0.35 dt:35ms tok/s:1889923 rem:193s step 11482 (68%) loss:3.0267 lr:0.35 dt:35ms tok/s:1877351 rem:193s step 11483 (68%) loss:3.0341 lr:0.35 dt:35ms tok/s:1877902 rem:193s step 11484 (68%) loss:3.0341 lr:0.35 dt:35ms tok/s:1884933 rem:193s step 11485 (68%) loss:3.0335 lr:0.35 dt:35ms tok/s:1880227 rem:193s step 11486 (68%) loss:3.0307 lr:0.35 dt:35ms tok/s:1879867 rem:193s step 11487 (68%) loss:3.0287 lr:0.35 dt:35ms tok/s:1883667 rem:193s step 11488 (68%) loss:3.0367 lr:0.35 dt:35ms tok/s:1872363 rem:193s step 11489 (68%) loss:3.0352 lr:0.35 dt:35ms tok/s:1884416 rem:193s step 11490 (68%) loss:3.0530 lr:0.35 dt:35ms tok/s:1877107 rem:193s step 11491 (68%) loss:3.0723 lr:0.35 dt:35ms tok/s:1878518 rem:193s step 11492 (68%) loss:3.0779 lr:0.35 dt:35ms tok/s:1881810 rem:193s step 11493 (68%) loss:3.0730 lr:0.35 dt:36ms tok/s:1842950 rem:193s step 11494 (68%) loss:3.0695 lr:0.35 dt:36ms tok/s:1816810 rem:193s step 11495 (68%) loss:3.0572 lr:0.35 dt:36ms tok/s:1844657 rem:193s step 11496 (68%) loss:3.0633 lr:0.35 dt:35ms tok/s:1854589 rem:193s step 11497 (68%) loss:3.0820 lr:0.35 dt:35ms tok/s:1849373 rem:193s step 11498 (68%) loss:3.0526 lr:0.35 dt:35ms tok/s:1853189 rem:193s step 11499 (68%) loss:3.0232 lr:0.35 dt:35ms tok/s:1850432 rem:193s step 11500 (68%) loss:3.0262 lr:0.35 dt:35ms tok/s:1855453 rem:193s + local: attn=[0.120, 0.912, 0.970] mlp=[0.887, 0.329, -0.339] + + transition: attn=[3.273, 1.120] mlp=[-0.314, 0.842] + + hierarchy: attn=[3.436, 5.939, 5.616] mlp=[1.769, -1.429, -3.835] + step 11501 (68%) loss:3.0345 lr:0.35 dt:35ms tok/s:1847248 rem:193s step 11502 (68%) loss:3.0291 lr:0.35 dt:36ms tok/s:1829154 rem:193s step 11503 (68%) loss:3.0158 lr:0.35 dt:36ms tok/s:1819793 rem:193s step 11504 (68%) loss:3.0164 lr:0.35 dt:36ms tok/s:1822133 rem:193s step 11505 (68%) loss:2.9998 lr:0.35 dt:36ms tok/s:1820227 rem:193s step 11506 (68%) loss:2.9852 lr:0.35 dt:36ms tok/s:1817651 rem:193s step 11507 (68%) loss:2.9769 lr:0.35 dt:36ms tok/s:1819624 rem:192s step 11508 (68%) loss:2.9455 lr:0.35 dt:36ms tok/s:1818721 rem:192s step 11509 (68%) loss:2.9444 lr:0.35 dt:36ms tok/s:1825582 rem:192s step 11510 (68%) loss:2.9553 lr:0.35 dt:36ms tok/s:1828485 rem:192s step 11511 (68%) loss:2.9580 lr:0.35 dt:36ms tok/s:1812067 rem:192s step 11512 (68%) loss:2.9798 lr:0.35 dt:36ms tok/s:1824661 rem:192s step 11513 (68%) loss:2.9875 lr:0.35 dt:36ms tok/s:1823342 rem:192s step 11514 (68%) loss:3.0257 lr:0.35 dt:36ms tok/s:1828399 rem:192s step 11515 (68%) loss:3.0190 lr:0.35 dt:36ms tok/s:1821686 rem:192s step 11516 (68%) loss:3.0274 lr:0.35 dt:36ms tok/s:1818240 rem:192s step 11517 (68%) loss:3.0371 lr:0.35 dt:36ms tok/s:1824080 rem:192s step 11518 (68%) loss:3.0786 lr:0.35 dt:36ms tok/s:1825303 rem:192s step 11519 (68%) loss:3.0848 lr:0.35 dt:36ms tok/s:1810098 rem:192s step 11520 (68%) loss:3.0835 lr:0.35 dt:36ms tok/s:1824080 rem:192s step 11521 (68%) loss:3.0911 lr:0.35 dt:36ms tok/s:1823088 rem:192s step 11522 (68%) loss:3.0975 lr:0.35 dt:36ms tok/s:1821747 rem:192s step 11523 (68%) loss:3.1046 lr:0.35 dt:36ms tok/s:1819962 rem:192s step 11524 (68%) loss:3.0989 lr:0.35 dt:36ms tok/s:1821373 rem:192s step 11525 (68%) loss:3.0797 lr:0.35 dt:36ms tok/s:1814687 rem:192s step 11526 (68%) loss:3.0770 lr:0.35 dt:36ms tok/s:1822725 rem:192s step 11527 (68%) loss:3.0699 lr:0.34 dt:36ms tok/s:1822520 rem:192s step 11528 (68%) loss:3.1170 lr:0.34 dt:37ms tok/s:1760593 rem:192s step 11529 (68%) loss:3.1278 lr:0.34 dt:36ms tok/s:1818000 rem:192s step 11530 (68%) loss:3.1336 lr:0.34 dt:36ms tok/s:1818553 rem:192s step 11531 (68%) loss:3.1223 lr:0.34 dt:36ms tok/s:1809193 rem:192s step 11532 (68%) loss:3.1006 lr:0.34 dt:36ms tok/s:1815634 rem:192s step 11533 (68%) loss:3.0882 lr:0.34 dt:36ms tok/s:1813885 rem:192s step 11534 (68%) loss:3.1033 lr:0.34 dt:36ms tok/s:1817459 rem:192s step 11535 (68%) loss:3.0919 lr:0.34 dt:36ms tok/s:1811196 rem:191s step 11536 (68%) loss:3.0994 lr:0.34 dt:36ms tok/s:1826893 rem:191s step 11537 (68%) loss:3.0779 lr:0.34 dt:36ms tok/s:1818505 rem:191s step 11538 (68%) loss:3.0699 lr:0.34 dt:36ms tok/s:1816870 rem:191s step 11539 (68%) loss:3.0468 lr:0.34 dt:36ms tok/s:1807682 rem:191s step 11540 (68%) loss:3.0325 lr:0.34 dt:36ms tok/s:1810945 rem:191s step 11541 (68%) loss:3.0246 lr:0.34 dt:36ms tok/s:1806375 rem:191s step 11542 (68%) loss:3.0122 lr:0.34 dt:36ms tok/s:1805248 rem:191s step 11543 (68%) loss:3.0139 lr:0.34 dt:36ms tok/s:1809705 rem:191s step 11544 (68%) loss:3.0187 lr:0.34 dt:36ms tok/s:1809777 rem:191s step 11545 (68%) loss:3.0247 lr:0.34 dt:36ms tok/s:1809336 rem:191s step 11546 (68%) loss:3.0190 lr:0.34 dt:36ms tok/s:1810241 rem:191s step 11547 (68%) loss:3.0253 lr:0.34 dt:36ms tok/s:1818818 rem:191s step 11548 (68%) loss:3.0291 lr:0.34 dt:36ms tok/s:1809419 rem:191s step 11549 (68%) loss:3.0354 lr:0.34 dt:36ms tok/s:1812055 rem:191s step 11550 (68%) loss:3.0367 lr:0.34 dt:36ms tok/s:1806316 rem:191s step 11551 (68%) loss:3.0305 lr:0.34 dt:36ms tok/s:1813729 rem:191s step 11552 (68%) loss:3.0035 lr:0.34 dt:36ms tok/s:1815730 rem:191s step 11553 (68%) loss:3.0085 lr:0.34 dt:36ms tok/s:1809169 rem:191s step 11554 (68%) loss:3.0010 lr:0.34 dt:36ms tok/s:1812043 rem:191s step 11555 (68%) loss:3.0014 lr:0.34 dt:36ms tok/s:1812557 rem:191s step 11556 (68%) loss:3.0052 lr:0.34 dt:36ms tok/s:1817074 rem:191s step 11557 (68%) loss:3.0003 lr:0.34 dt:36ms tok/s:1812928 rem:191s step 11558 (68%) loss:3.0022 lr:0.34 dt:36ms tok/s:1812270 rem:191s step 11559 (68%) loss:2.9845 lr:0.34 dt:36ms tok/s:1808098 rem:191s step 11560 (68%) loss:3.2467 lr:0.34 dt:36ms tok/s:1808348 rem:191s step 11561 (68%) loss:3.2346 lr:0.34 dt:36ms tok/s:1805367 rem:191s step 11562 (68%) loss:3.2299 lr:0.34 dt:36ms tok/s:1805165 rem:191s step 11563 (68%) loss:3.2050 lr:0.34 dt:36ms tok/s:1813310 rem:190s step 11564 (68%) loss:3.1904 lr:0.34 dt:36ms tok/s:1809324 rem:190s step 11565 (68%) loss:3.1608 lr:0.34 dt:36ms tok/s:1809395 rem:190s step 11566 (68%) loss:3.0960 lr:0.34 dt:36ms tok/s:1811530 rem:190s step 11567 (68%) loss:3.0853 lr:0.34 dt:36ms tok/s:1811781 rem:190s step 11568 (68%) loss:3.0794 lr:0.34 dt:36ms tok/s:1813586 rem:190s step 11569 (68%) loss:3.0607 lr:0.34 dt:36ms tok/s:1811996 rem:190s step 11570 (68%) loss:3.0623 lr:0.34 dt:36ms tok/s:1806423 rem:190s step 11571 (68%) loss:3.0593 lr:0.34 dt:36ms tok/s:1807860 rem:190s step 11572 (68%) loss:3.1093 lr:0.34 dt:36ms tok/s:1821216 rem:190s step 11573 (68%) loss:3.0867 lr:0.34 dt:36ms tok/s:1816750 rem:190s step 11574 (68%) loss:3.0767 lr:0.34 dt:36ms tok/s:1817170 rem:190s step 11575 (68%) loss:3.0541 lr:0.34 dt:36ms tok/s:1818060 rem:190s step 11576 (68%) loss:3.0439 lr:0.34 dt:37ms tok/s:1777350 rem:190s step 11577 (68%) loss:3.0340 lr:0.34 dt:36ms tok/s:1807836 rem:190s step 11578 (68%) loss:3.0144 lr:0.34 dt:36ms tok/s:1809360 rem:190s step 11579 (68%) loss:2.9917 lr:0.34 dt:37ms tok/s:1768796 rem:190s step 11580 (68%) loss:2.9975 lr:0.34 dt:38ms tok/s:1736283 rem:190s step 11581 (68%) loss:3.0049 lr:0.34 dt:37ms tok/s:1790082 rem:190s step 11582 (68%) loss:3.0413 lr:0.34 dt:37ms tok/s:1793926 rem:190s step 11583 (68%) loss:3.0248 lr:0.34 dt:37ms tok/s:1788289 rem:190s step 11584 (68%) loss:3.0081 lr:0.34 dt:36ms tok/s:1819408 rem:190s step 11585 (68%) loss:3.0261 lr:0.34 dt:36ms tok/s:1815154 rem:190s step 11586 (68%) loss:3.0334 lr:0.34 dt:36ms tok/s:1811816 rem:190s step 11587 (68%) loss:3.0446 lr:0.34 dt:36ms tok/s:1818529 rem:190s step 11588 (68%) loss:3.0350 lr:0.34 dt:36ms tok/s:1814040 rem:190s step 11589 (68%) loss:3.0211 lr:0.34 dt:36ms tok/s:1816462 rem:190s step 11590 (68%) loss:3.0054 lr:0.34 dt:36ms tok/s:1797927 rem:189s step 11591 (68%) loss:2.9910 lr:0.34 dt:36ms tok/s:1812940 rem:189s step 11592 (68%) loss:2.9990 lr:0.34 dt:36ms tok/s:1823027 rem:189s step 11593 (68%) loss:3.0164 lr:0.34 dt:36ms tok/s:1809133 rem:189s step 11594 (68%) loss:3.0167 lr:0.34 dt:36ms tok/s:1809300 rem:189s step 11595 (68%) loss:3.0156 lr:0.34 dt:36ms tok/s:1811578 rem:189s step 11596 (68%) loss:3.0273 lr:0.34 dt:36ms tok/s:1811136 rem:189s step 11597 (68%) loss:3.0277 lr:0.34 dt:36ms tok/s:1816402 rem:189s step 11598 (68%) loss:3.0438 lr:0.34 dt:37ms tok/s:1782237 rem:189s step 11599 (68%) loss:3.0409 lr:0.34 dt:36ms tok/s:1815023 rem:189s step 11600 (68%) loss:3.0363 lr:0.34 dt:36ms tok/s:1819660 rem:189s + local: attn=[0.116, 0.939, 0.974] mlp=[0.904, 0.339, -0.319] + + transition: attn=[3.358, 1.139] mlp=[-0.326, 0.853] + + hierarchy: attn=[3.388, 5.939, 5.616] mlp=[1.793, -1.427, -3.929] + step 11601 (68%) loss:3.0265 lr:0.34 dt:36ms tok/s:1819010 rem:189s step 11602 (68%) loss:3.0337 lr:0.34 dt:36ms tok/s:1827111 rem:189s step 11603 (68%) loss:3.0219 lr:0.34 dt:36ms tok/s:1809908 rem:189s step 11604 (68%) loss:3.0218 lr:0.34 dt:36ms tok/s:1810981 rem:189s step 11605 (69%) loss:3.0167 lr:0.34 dt:36ms tok/s:1815898 rem:189s step 11606 (69%) loss:3.0473 lr:0.34 dt:36ms tok/s:1819938 rem:189s step 11607 (69%) loss:3.0474 lr:0.34 dt:36ms tok/s:1817134 rem:189s step 11608 (69%) loss:3.0266 lr:0.34 dt:36ms tok/s:1811697 rem:189s step 11609 (69%) loss:3.0162 lr:0.34 dt:36ms tok/s:1816078 rem:189s step 11610 (69%) loss:3.0215 lr:0.34 dt:36ms tok/s:1809860 rem:189s step 11611 (69%) loss:3.0208 lr:0.34 dt:36ms tok/s:1821324 rem:189s step 11612 (69%) loss:3.0255 lr:0.34 dt:36ms tok/s:1810361 rem:189s step 11613 (69%) loss:3.0202 lr:0.34 dt:36ms tok/s:1806981 rem:189s step 11614 (69%) loss:3.0054 lr:0.34 dt:36ms tok/s:1809038 rem:189s step 11615 (69%) loss:3.0019 lr:0.34 dt:36ms tok/s:1810158 rem:189s step 11616 (69%) loss:3.0186 lr:0.33 dt:36ms tok/s:1813251 rem:189s step 11617 (69%) loss:3.0350 lr:0.33 dt:36ms tok/s:1810611 rem:189s step 11618 (69%) loss:3.0504 lr:0.33 dt:36ms tok/s:1818926 rem:188s step 11619 (69%) loss:3.0455 lr:0.33 dt:36ms tok/s:1816342 rem:188s step 11620 (69%) loss:3.0390 lr:0.33 dt:36ms tok/s:1820408 rem:188s step 11621 (69%) loss:3.0168 lr:0.33 dt:38ms tok/s:1730272 rem:188s step 11622 (69%) loss:3.0244 lr:0.33 dt:36ms tok/s:1816186 rem:188s step 11623 (69%) loss:3.0310 lr:0.33 dt:36ms tok/s:1811387 rem:188s step 11624 (69%) loss:3.0274 lr:0.33 dt:36ms tok/s:1809252 rem:188s step 11625 (69%) loss:3.0090 lr:0.33 dt:36ms tok/s:1809193 rem:188s step 11626 (69%) loss:3.0062 lr:0.33 dt:36ms tok/s:1813717 rem:188s step 11627 (69%) loss:3.0252 lr:0.33 dt:36ms tok/s:1814759 rem:188s step 11628 (69%) loss:3.0367 lr:0.33 dt:36ms tok/s:1817014 rem:188s step 11629 (69%) loss:3.0424 lr:0.33 dt:36ms tok/s:1816114 rem:188s step 11630 (69%) loss:3.0407 lr:0.33 dt:36ms tok/s:1818649 rem:188s step 11631 (69%) loss:3.0417 lr:0.33 dt:36ms tok/s:1820962 rem:188s step 11632 (69%) loss:3.0491 lr:0.33 dt:36ms tok/s:1835775 rem:188s step 11633 (69%) loss:3.0519 lr:0.33 dt:36ms tok/s:1835493 rem:188s step 11634 (69%) loss:3.0522 lr:0.33 dt:36ms tok/s:1841099 rem:188s step 11635 (69%) loss:3.0510 lr:0.33 dt:36ms tok/s:1834586 rem:188s step 11636 (69%) loss:3.0571 lr:0.33 dt:36ms tok/s:1834060 rem:188s step 11637 (69%) loss:3.0506 lr:0.33 dt:36ms tok/s:1835125 rem:188s step 11638 (69%) loss:3.0356 lr:0.33 dt:36ms tok/s:1837456 rem:188s step 11639 (69%) loss:3.0409 lr:0.33 dt:36ms tok/s:1835358 rem:188s step 11640 (69%) loss:3.0175 lr:0.33 dt:36ms tok/s:1841617 rem:188s step 11641 (69%) loss:3.0138 lr:0.33 dt:36ms tok/s:1830518 rem:188s step 11642 (69%) loss:3.0155 lr:0.33 dt:36ms tok/s:1832544 rem:188s step 11643 (69%) loss:3.0252 lr:0.33 dt:36ms tok/s:1841457 rem:188s step 11644 (69%) loss:3.0315 lr:0.33 dt:36ms tok/s:1827670 rem:188s step 11645 (69%) loss:3.0446 lr:0.33 dt:35ms tok/s:1890703 rem:188s step 11646 (69%) loss:3.0473 lr:0.33 dt:34ms tok/s:1910068 rem:187s step 11647 (69%) loss:3.0524 lr:0.33 dt:34ms tok/s:1904193 rem:187s step 11648 (69%) loss:3.0675 lr:0.33 dt:34ms tok/s:1899745 rem:187s step 11649 (69%) loss:3.0705 lr:0.33 dt:35ms tok/s:1884184 rem:187s step 11650 (69%) loss:3.0639 lr:0.33 dt:35ms tok/s:1872210 rem:187s step 11651 (69%) loss:3.0645 lr:0.33 dt:35ms tok/s:1873997 rem:187s step 11652 (69%) loss:3.0680 lr:0.33 dt:35ms tok/s:1881669 rem:187s step 11653 (69%) loss:3.0457 lr:0.33 dt:35ms tok/s:1898748 rem:187s step 11654 (69%) loss:3.0505 lr:0.33 dt:35ms tok/s:1893072 rem:187s step 11655 (69%) loss:3.0388 lr:0.33 dt:35ms tok/s:1888546 rem:187s step 11656 (69%) loss:3.0576 lr:0.33 dt:46ms tok/s:1428657 rem:187s step 11657 (69%) loss:3.0663 lr:0.33 dt:34ms tok/s:1908066 rem:187s step 11658 (69%) loss:3.1024 lr:0.33 dt:34ms tok/s:1907206 rem:187s step 11659 (69%) loss:3.1123 lr:0.33 dt:34ms tok/s:1935555 rem:187s step 11660 (69%) loss:3.1076 lr:0.33 dt:34ms tok/s:1939721 rem:187s step 11661 (69%) loss:3.0855 lr:0.33 dt:34ms tok/s:1943218 rem:187s step 11662 (69%) loss:3.0682 lr:0.33 dt:34ms tok/s:1938012 rem:187s step 11663 (69%) loss:3.0641 lr:0.33 dt:34ms tok/s:1940721 rem:187s step 11664 (69%) loss:3.0575 lr:0.33 dt:34ms tok/s:1937670 rem:187s step 11665 (69%) loss:3.0503 lr:0.33 dt:34ms tok/s:1939311 rem:187s step 11666 (69%) loss:3.0530 lr:0.33 dt:34ms tok/s:1936646 rem:187s step 11667 (69%) loss:3.0599 lr:0.33 dt:34ms tok/s:1943177 rem:187s step 11668 (69%) loss:3.0505 lr:0.33 dt:34ms tok/s:1940461 rem:187s step 11669 (69%) loss:3.0682 lr:0.33 dt:34ms tok/s:1939037 rem:187s step 11670 (69%) loss:3.0759 lr:0.33 dt:34ms tok/s:1923043 rem:187s step 11671 (69%) loss:3.0753 lr:0.33 dt:34ms tok/s:1933296 rem:187s step 11672 (69%) loss:3.0380 lr:0.33 dt:34ms tok/s:1927885 rem:187s step 11673 (69%) loss:3.0359 lr:0.33 dt:33ms tok/s:1984334 rem:187s step 11674 (69%) loss:3.0369 lr:0.33 dt:34ms tok/s:1912221 rem:187s step 11675 (69%) loss:3.0212 lr:0.33 dt:34ms tok/s:1934534 rem:186s step 11676 (69%) loss:3.0387 lr:0.33 dt:33ms tok/s:1959076 rem:186s step 11677 (69%) loss:3.0415 lr:0.33 dt:33ms tok/s:1961466 rem:186s step 11678 (69%) loss:3.0629 lr:0.33 dt:36ms tok/s:1843569 rem:186s step 11679 (69%) loss:3.0568 lr:0.33 dt:33ms tok/s:1956580 rem:186s step 11680 (69%) loss:3.0485 lr:0.33 dt:34ms tok/s:1955591 rem:186s step 11681 (69%) loss:3.0635 lr:0.33 dt:33ms tok/s:1959774 rem:186s step 11682 (69%) loss:3.0605 lr:0.33 dt:34ms tok/s:1947872 rem:186s step 11683 (69%) loss:3.0551 lr:0.33 dt:34ms tok/s:1946493 rem:186s step 11684 (69%) loss:3.0340 lr:0.33 dt:34ms tok/s:1941804 rem:186s step 11685 (69%) loss:3.0231 lr:0.33 dt:34ms tok/s:1942422 rem:186s step 11686 (69%) loss:2.9942 lr:0.33 dt:34ms tok/s:1923824 rem:186s step 11687 (69%) loss:2.9701 lr:0.33 dt:34ms tok/s:1911941 rem:186s step 11688 (69%) loss:2.9840 lr:0.33 dt:34ms tok/s:1925064 rem:186s step 11689 (69%) loss:2.9814 lr:0.33 dt:34ms tok/s:1917945 rem:186s step 11690 (69%) loss:3.0203 lr:0.33 dt:34ms tok/s:1921108 rem:186s step 11691 (69%) loss:3.0333 lr:0.33 dt:34ms tok/s:1926440 rem:186s step 11692 (69%) loss:3.0354 lr:0.33 dt:35ms tok/s:1878492 rem:186s step 11693 (69%) loss:3.0258 lr:0.33 dt:35ms tok/s:1881643 rem:186s step 11694 (69%) loss:3.0204 lr:0.33 dt:35ms tok/s:1889598 rem:186s step 11695 (69%) loss:3.0118 lr:0.33 dt:35ms tok/s:1893125 rem:186s step 11696 (69%) loss:3.0122 lr:0.33 dt:35ms tok/s:1888014 rem:186s step 11697 (69%) loss:2.9932 lr:0.33 dt:35ms tok/s:1891783 rem:186s step 11698 (69%) loss:2.9511 lr:0.33 dt:35ms tok/s:1896220 rem:186s step 11699 (69%) loss:2.9384 lr:0.33 dt:35ms tok/s:1871815 rem:186s step 11700 (69%) loss:2.9114 lr:0.33 dt:36ms tok/s:1817507 rem:186s + local: attn=[0.132, 0.939, 0.991] mlp=[0.915, 0.345, -0.334] + + transition: attn=[3.381, 1.132] mlp=[-0.334, 0.864] + + hierarchy: attn=[3.429, 5.939, 5.616] mlp=[1.839, -1.474, -3.933] + step 11701 (69%) loss:2.9311 lr:0.33 dt:35ms tok/s:1869943 rem:186s step 11702 (69%) loss:2.9242 lr:0.33 dt:35ms tok/s:1869994 rem:186s step 11703 (69%) loss:2.8943 lr:0.33 dt:35ms tok/s:1870592 rem:186s step 11704 (69%) loss:2.8179 lr:0.33 dt:35ms tok/s:1859331 rem:185s step 11705 (69%) loss:2.7861 lr:0.33 dt:35ms tok/s:1866185 rem:185s step 11706 (69%) loss:2.7742 lr:0.33 dt:35ms tok/s:1857296 rem:185s step 11707 (69%) loss:2.7657 lr:0.33 dt:35ms tok/s:1850245 rem:185s step 11708 (69%) loss:2.8060 lr:0.32 dt:35ms tok/s:1854514 rem:185s step 11709 (69%) loss:2.8114 lr:0.32 dt:36ms tok/s:1831982 rem:185s step 11710 (69%) loss:2.8325 lr:0.32 dt:36ms tok/s:1836805 rem:185s step 11711 (69%) loss:2.8449 lr:0.32 dt:35ms tok/s:1846355 rem:185s step 11712 (69%) loss:2.8614 lr:0.32 dt:36ms tok/s:1840532 rem:185s step 11713 (69%) loss:2.8900 lr:0.32 dt:35ms tok/s:1851503 rem:185s step 11714 (69%) loss:2.9153 lr:0.32 dt:35ms tok/s:1847323 rem:185s step 11715 (69%) loss:2.9301 lr:0.32 dt:35ms tok/s:1847484 rem:185s step 11716 (69%) loss:2.9539 lr:0.32 dt:36ms tok/s:1842555 rem:185s step 11717 (69%) loss:2.9512 lr:0.32 dt:35ms tok/s:1848453 rem:185s step 11718 (69%) loss:2.9715 lr:0.32 dt:35ms tok/s:1847149 rem:185s step 11719 (69%) loss:2.9849 lr:0.32 dt:35ms tok/s:1846132 rem:185s step 11720 (69%) loss:2.9814 lr:0.32 dt:36ms tok/s:1844732 rem:185s step 11721 (69%) loss:2.9947 lr:0.32 dt:36ms tok/s:1842580 rem:185s step 11722 (69%) loss:2.9697 lr:0.32 dt:36ms tok/s:1838341 rem:185s step 11723 (69%) loss:2.9819 lr:0.32 dt:35ms tok/s:1852165 rem:185s step 11724 (69%) loss:2.9876 lr:0.32 dt:36ms tok/s:1824298 rem:185s step 11725 (69%) loss:3.0286 lr:0.32 dt:36ms tok/s:1836229 rem:185s step 11726 (69%) loss:3.0610 lr:0.32 dt:36ms tok/s:1834244 rem:185s step 11727 (69%) loss:3.0552 lr:0.32 dt:35ms tok/s:1851878 rem:185s step 11728 (69%) loss:3.0614 lr:0.32 dt:36ms tok/s:1838021 rem:185s step 11729 (69%) loss:3.0616 lr:0.32 dt:36ms tok/s:1828424 rem:185s step 11730 (69%) loss:3.0595 lr:0.32 dt:35ms tok/s:1849013 rem:185s step 11731 (69%) loss:3.0631 lr:0.32 dt:36ms tok/s:1842444 rem:185s step 11732 (69%) loss:3.0681 lr:0.32 dt:36ms tok/s:1842679 rem:184s step 11733 (69%) loss:3.0570 lr:0.32 dt:36ms tok/s:1841074 rem:184s step 11734 (69%) loss:3.0475 lr:0.32 dt:35ms tok/s:1846901 rem:184s step 11735 (69%) loss:3.0271 lr:0.32 dt:36ms tok/s:1840778 rem:184s step 11736 (69%) loss:3.0282 lr:0.32 dt:36ms tok/s:1844694 rem:184s step 11737 (69%) loss:3.0331 lr:0.32 dt:36ms tok/s:1843618 rem:184s step 11738 (69%) loss:3.0225 lr:0.32 dt:36ms tok/s:1844212 rem:184s step 11739 (69%) loss:3.0256 lr:0.32 dt:36ms tok/s:1836768 rem:184s step 11740 (69%) loss:3.0251 lr:0.32 dt:36ms tok/s:1844125 rem:184s step 11741 (69%) loss:3.0217 lr:0.32 dt:36ms tok/s:1838599 rem:184s step 11742 (69%) loss:3.0138 lr:0.32 dt:36ms tok/s:1835542 rem:184s step 11743 (69%) loss:3.0433 lr:0.32 dt:36ms tok/s:1832727 rem:184s step 11744 (69%) loss:3.0481 lr:0.32 dt:36ms tok/s:1841851 rem:184s step 11745 (69%) loss:3.0394 lr:0.32 dt:36ms tok/s:1839128 rem:184s step 11746 (69%) loss:3.0305 lr:0.32 dt:36ms tok/s:1842370 rem:184s step 11747 (69%) loss:3.0027 lr:0.32 dt:36ms tok/s:1841383 rem:184s step 11748 (69%) loss:3.0023 lr:0.32 dt:36ms tok/s:1839411 rem:184s step 11749 (69%) loss:3.0234 lr:0.32 dt:36ms tok/s:1827305 rem:184s step 11750 (69%) loss:3.0280 lr:0.32 dt:36ms tok/s:1840051 rem:184s step 11751 (69%) loss:3.0300 lr:0.32 dt:36ms tok/s:1834072 rem:184s step 11752 (69%) loss:3.0319 lr:0.32 dt:36ms tok/s:1833130 rem:184s step 11753 (69%) loss:3.0294 lr:0.32 dt:36ms tok/s:1840187 rem:184s step 11754 (69%) loss:3.0458 lr:0.32 dt:35ms tok/s:1846380 rem:184s step 11755 (69%) loss:3.0499 lr:0.32 dt:36ms tok/s:1840322 rem:184s step 11756 (69%) loss:3.0468 lr:0.32 dt:36ms tok/s:1837493 rem:184s step 11757 (69%) loss:3.0669 lr:0.32 dt:36ms tok/s:1835444 rem:184s step 11758 (69%) loss:3.0592 lr:0.32 dt:36ms tok/s:1844818 rem:184s step 11759 (69%) loss:3.0543 lr:0.32 dt:36ms tok/s:1837247 rem:184s step 11760 (69%) loss:3.0554 lr:0.32 dt:36ms tok/s:1838820 rem:183s step 11761 (69%) loss:3.0397 lr:0.32 dt:36ms tok/s:1843841 rem:183s step 11762 (69%) loss:3.0209 lr:0.32 dt:36ms tok/s:1831152 rem:183s step 11763 (69%) loss:3.0121 lr:0.32 dt:36ms tok/s:1835468 rem:183s step 11764 (69%) loss:3.0453 lr:0.32 dt:36ms tok/s:1840606 rem:183s step 11765 (69%) loss:3.0499 lr:0.32 dt:36ms tok/s:1806019 rem:183s step 11766 (69%) loss:3.0757 lr:0.32 dt:37ms tok/s:1776661 rem:183s step 11767 (69%) loss:3.0814 lr:0.32 dt:37ms tok/s:1782965 rem:183s step 11768 (69%) loss:3.0607 lr:0.32 dt:37ms tok/s:1784446 rem:183s step 11769 (69%) loss:3.0521 lr:0.32 dt:36ms tok/s:1809955 rem:183s step 11770 (69%) loss:3.0520 lr:0.32 dt:36ms tok/s:1832263 rem:183s step 11771 (69%) loss:3.0310 lr:0.32 dt:36ms tok/s:1837321 rem:183s step 11772 (69%) loss:3.0438 lr:0.32 dt:36ms tok/s:1839805 rem:183s step 11773 (69%) loss:3.0395 lr:0.32 dt:36ms tok/s:1837554 rem:183s step 11774 (69%) loss:3.0343 lr:0.32 dt:38ms tok/s:1730849 rem:183s step 11775 (70%) loss:3.0348 lr:0.32 dt:36ms tok/s:1834611 rem:183s step 11776 (70%) loss:3.0432 lr:0.32 dt:36ms tok/s:1839633 rem:183s step 11777 (70%) loss:3.0488 lr:0.32 dt:36ms tok/s:1840421 rem:183s step 11778 (70%) loss:3.0486 lr:0.32 dt:36ms tok/s:1839448 rem:183s step 11779 (70%) loss:3.0289 lr:0.32 dt:36ms tok/s:1841173 rem:183s step 11780 (70%) loss:3.0244 lr:0.32 dt:36ms tok/s:1830603 rem:183s step 11781 (70%) loss:3.0318 lr:0.32 dt:36ms tok/s:1833913 rem:183s step 11782 (70%) loss:3.0417 lr:0.32 dt:36ms tok/s:1830116 rem:183s step 11783 (70%) loss:3.0489 lr:0.32 dt:36ms tok/s:1836241 rem:183s step 11784 (70%) loss:3.0566 lr:0.32 dt:36ms tok/s:1839977 rem:183s step 11785 (70%) loss:3.1168 lr:0.32 dt:36ms tok/s:1842679 rem:183s step 11786 (70%) loss:3.1508 lr:0.32 dt:36ms tok/s:1839977 rem:183s step 11787 (70%) loss:3.1711 lr:0.32 dt:36ms tok/s:1838033 rem:183s step 11788 (70%) loss:3.1238 lr:0.32 dt:36ms tok/s:1838968 rem:182s step 11789 (70%) loss:3.0967 lr:0.32 dt:36ms tok/s:1845227 rem:182s step 11790 (70%) loss:3.0936 lr:0.32 dt:36ms tok/s:1844175 rem:182s step 11791 (70%) loss:3.1067 lr:0.32 dt:36ms tok/s:1826104 rem:182s step 11792 (70%) loss:3.0976 lr:0.32 dt:36ms tok/s:1837382 rem:182s step 11793 (70%) loss:3.0874 lr:0.32 dt:36ms tok/s:1843383 rem:182s step 11794 (70%) loss:3.0721 lr:0.32 dt:36ms tok/s:1837960 rem:182s step 11795 (70%) loss:3.0746 lr:0.32 dt:36ms tok/s:1840470 rem:182s step 11796 (70%) loss:3.0767 lr:0.32 dt:36ms tok/s:1833033 rem:182s step 11797 (70%) loss:3.0633 lr:0.32 dt:36ms tok/s:1813430 rem:182s step 11798 (70%) loss:3.0447 lr:0.32 dt:37ms tok/s:1754951 rem:182s step 11799 (70%) loss:3.0410 lr:0.32 dt:36ms tok/s:1816102 rem:182s step 11800 (70%) loss:3.0549 lr:0.31 dt:36ms tok/s:1816954 rem:182s + local: attn=[0.128, 0.920, 0.995] mlp=[0.932, 0.359, -0.330] + + transition: attn=[3.297, 1.133] mlp=[-0.342, 0.944] + + hierarchy: attn=[3.478, 5.939, 5.616] mlp=[1.866, -1.545, -3.967] + step 11801 (70%) loss:3.0469 lr:0.31 dt:36ms tok/s:1819119 rem:182s step 11802 (70%) loss:3.0442 lr:0.31 dt:36ms tok/s:1818637 rem:182s step 11803 (70%) loss:3.0527 lr:0.31 dt:36ms tok/s:1812222 rem:182s step 11804 (70%) loss:3.0635 lr:0.31 dt:36ms tok/s:1823923 rem:182s step 11805 (70%) loss:3.0657 lr:0.31 dt:36ms tok/s:1812796 rem:182s step 11806 (70%) loss:3.0654 lr:0.31 dt:36ms tok/s:1820420 rem:182s step 11807 (70%) loss:3.0489 lr:0.31 dt:36ms tok/s:1816462 rem:182s step 11808 (70%) loss:3.0600 lr:0.31 dt:36ms tok/s:1814651 rem:182s step 11809 (70%) loss:3.0390 lr:0.31 dt:36ms tok/s:1816666 rem:182s step 11810 (70%) loss:3.0360 lr:0.31 dt:36ms tok/s:1817807 rem:182s step 11811 (70%) loss:3.0356 lr:0.31 dt:36ms tok/s:1813992 rem:182s step 11812 (70%) loss:3.0310 lr:0.31 dt:36ms tok/s:1820191 rem:182s step 11813 (70%) loss:3.0316 lr:0.31 dt:36ms tok/s:1810289 rem:182s step 11814 (70%) loss:3.0166 lr:0.31 dt:36ms tok/s:1797892 rem:182s step 11815 (70%) loss:3.0257 lr:0.31 dt:36ms tok/s:1801108 rem:182s step 11816 (70%) loss:3.0111 lr:0.31 dt:36ms tok/s:1819817 rem:181s step 11817 (70%) loss:3.0010 lr:0.31 dt:36ms tok/s:1817110 rem:181s step 11818 (70%) loss:2.9890 lr:0.31 dt:36ms tok/s:1815730 rem:181s step 11819 (70%) loss:2.9789 lr:0.31 dt:36ms tok/s:1814831 rem:181s step 11820 (70%) loss:2.9522 lr:0.31 dt:36ms tok/s:1808015 rem:181s step 11821 (70%) loss:2.9061 lr:0.31 dt:36ms tok/s:1815526 rem:181s step 11822 (70%) loss:2.9049 lr:0.31 dt:36ms tok/s:1815706 rem:181s step 11823 (70%) loss:2.9288 lr:0.31 dt:36ms tok/s:1809634 rem:181s step 11824 (70%) loss:2.9351 lr:0.31 dt:36ms tok/s:1812031 rem:181s step 11825 (70%) loss:2.9356 lr:0.31 dt:36ms tok/s:1820890 rem:181s step 11826 (70%) loss:2.9676 lr:0.31 dt:38ms tok/s:1727825 rem:181s step 11827 (70%) loss:3.0054 lr:0.31 dt:36ms tok/s:1814795 rem:181s step 11828 (70%) loss:3.0160 lr:0.31 dt:36ms tok/s:1819227 rem:181s step 11829 (70%) loss:3.0455 lr:0.31 dt:36ms tok/s:1806518 rem:181s step 11830 (70%) loss:3.0543 lr:0.31 dt:37ms tok/s:1784331 rem:181s step 11831 (70%) loss:3.0617 lr:0.31 dt:36ms tok/s:1807515 rem:181s step 11832 (70%) loss:3.0750 lr:0.31 dt:36ms tok/s:1809312 rem:181s step 11833 (70%) loss:3.0677 lr:0.31 dt:36ms tok/s:1815586 rem:181s step 11834 (70%) loss:3.0815 lr:0.31 dt:36ms tok/s:1805924 rem:181s step 11835 (70%) loss:3.0825 lr:0.31 dt:36ms tok/s:1810635 rem:181s step 11836 (70%) loss:3.0742 lr:0.31 dt:36ms tok/s:1809133 rem:181s step 11837 (70%) loss:3.0537 lr:0.31 dt:36ms tok/s:1810372 rem:181s step 11838 (70%) loss:3.0541 lr:0.31 dt:36ms tok/s:1814951 rem:181s step 11839 (70%) loss:3.0612 lr:0.31 dt:36ms tok/s:1811458 rem:181s step 11840 (70%) loss:3.0783 lr:0.31 dt:36ms tok/s:1813885 rem:181s step 11841 (70%) loss:3.0864 lr:0.31 dt:36ms tok/s:1812306 rem:181s step 11842 (70%) loss:3.0769 lr:0.31 dt:36ms tok/s:1808859 rem:181s step 11843 (70%) loss:3.0755 lr:0.31 dt:36ms tok/s:1804821 rem:180s step 11844 (70%) loss:3.0764 lr:0.31 dt:36ms tok/s:1807741 rem:180s step 11845 (70%) loss:3.0324 lr:0.31 dt:36ms tok/s:1808110 rem:180s step 11846 (70%) loss:2.9799 lr:0.31 dt:36ms tok/s:1815898 rem:180s step 11847 (70%) loss:2.9648 lr:0.31 dt:36ms tok/s:1813430 rem:180s step 11848 (70%) loss:2.9802 lr:0.31 dt:37ms tok/s:1770961 rem:180s step 11849 (70%) loss:3.0051 lr:0.31 dt:36ms tok/s:1814855 rem:180s step 11850 (70%) loss:2.9870 lr:0.31 dt:36ms tok/s:1811589 rem:180s step 11851 (70%) loss:2.9875 lr:0.31 dt:36ms tok/s:1812091 rem:180s step 11852 (70%) loss:2.9902 lr:0.31 dt:36ms tok/s:1814256 rem:180s step 11853 (70%) loss:3.0005 lr:0.31 dt:36ms tok/s:1803057 rem:180s step 11854 (70%) loss:2.9984 lr:0.31 dt:36ms tok/s:1813227 rem:180s step 11855 (70%) loss:2.9986 lr:0.31 dt:36ms tok/s:1810003 rem:180s step 11856 (70%) loss:2.9847 lr:0.31 dt:36ms tok/s:1813394 rem:180s step 11857 (70%) loss:2.9985 lr:0.31 dt:36ms tok/s:1808134 rem:180s step 11858 (70%) loss:2.9987 lr:0.31 dt:36ms tok/s:1810492 rem:180s step 11859 (70%) loss:3.0099 lr:0.31 dt:36ms tok/s:1811996 rem:180s step 11860 (70%) loss:3.0102 lr:0.31 dt:37ms tok/s:1785977 rem:180s step 11861 (70%) loss:3.0145 lr:0.31 dt:36ms tok/s:1811446 rem:180s step 11862 (70%) loss:2.9911 lr:0.31 dt:36ms tok/s:1811542 rem:180s step 11863 (70%) loss:2.9800 lr:0.31 dt:36ms tok/s:1811243 rem:180s step 11864 (70%) loss:2.9996 lr:0.31 dt:36ms tok/s:1808122 rem:180s step 11865 (70%) loss:3.0071 lr:0.31 dt:36ms tok/s:1809157 rem:180s step 11866 (70%) loss:3.0066 lr:0.31 dt:36ms tok/s:1817399 rem:180s step 11867 (70%) loss:3.0167 lr:0.31 dt:36ms tok/s:1813466 rem:180s step 11868 (70%) loss:3.0249 lr:0.31 dt:36ms tok/s:1810993 rem:180s step 11869 (70%) loss:3.0048 lr:0.31 dt:36ms tok/s:1815082 rem:180s step 11870 (70%) loss:2.9963 lr:0.31 dt:36ms tok/s:1812820 rem:180s step 11871 (70%) loss:2.9858 lr:0.31 dt:36ms tok/s:1812007 rem:179s step 11872 (70%) loss:2.9786 lr:0.31 dt:36ms tok/s:1812461 rem:179s step 11873 (70%) loss:2.9791 lr:0.31 dt:36ms tok/s:1808967 rem:179s step 11874 (70%) loss:2.9733 lr:0.31 dt:36ms tok/s:1805165 rem:179s step 11875 (70%) loss:2.9783 lr:0.31 dt:36ms tok/s:1813633 rem:179s step 11876 (70%) loss:3.0009 lr:0.31 dt:36ms tok/s:1810861 rem:179s step 11877 (70%) loss:3.0180 lr:0.31 dt:36ms tok/s:1808610 rem:179s step 11878 (70%) loss:3.0286 lr:0.31 dt:36ms tok/s:1817831 rem:179s step 11879 (70%) loss:3.0353 lr:0.31 dt:36ms tok/s:1810229 rem:179s step 11880 (70%) loss:3.0376 lr:0.31 dt:36ms tok/s:1809788 rem:179s step 11881 (70%) loss:3.0484 lr:0.31 dt:36ms tok/s:1807872 rem:179s step 11882 (70%) loss:3.0452 lr:0.31 dt:36ms tok/s:1812342 rem:179s step 11883 (70%) loss:3.0485 lr:0.31 dt:36ms tok/s:1816174 rem:179s step 11884 (70%) loss:3.0340 lr:0.31 dt:36ms tok/s:1810635 rem:179s step 11885 (70%) loss:3.0399 lr:0.31 dt:36ms tok/s:1811542 rem:179s step 11886 (70%) loss:3.0195 lr:0.31 dt:36ms tok/s:1820468 rem:179s step 11887 (70%) loss:3.0072 lr:0.31 dt:36ms tok/s:1808336 rem:179s step 11888 (70%) loss:3.0084 lr:0.31 dt:36ms tok/s:1813466 rem:179s step 11889 (70%) loss:2.9992 lr:0.31 dt:36ms tok/s:1807183 rem:179s step 11890 (70%) loss:3.0116 lr:0.31 dt:36ms tok/s:1810706 rem:179s step 11891 (70%) loss:2.9926 lr:0.30 dt:36ms tok/s:1804916 rem:179s step 11892 (70%) loss:2.9787 lr:0.30 dt:36ms tok/s:1806221 rem:179s step 11893 (70%) loss:2.9951 lr:0.30 dt:36ms tok/s:1812808 rem:179s step 11894 (70%) loss:3.0076 lr:0.30 dt:37ms tok/s:1784898 rem:179s step 11895 (70%) loss:3.0203 lr:0.30 dt:36ms tok/s:1816150 rem:179s step 11896 (70%) loss:3.0182 lr:0.30 dt:36ms tok/s:1797410 rem:179s step 11897 (70%) loss:3.0113 lr:0.30 dt:36ms tok/s:1810408 rem:179s step 11898 (70%) loss:2.9918 lr:0.30 dt:36ms tok/s:1800742 rem:178s step 11899 (70%) loss:2.9097 lr:0.30 dt:36ms tok/s:1813681 rem:178s step 11900 (70%) loss:2.8596 lr:0.30 dt:36ms tok/s:1810063 rem:178s + local: attn=[0.124, 0.937, 1.012] mlp=[0.955, 0.351, -0.341] + + transition: attn=[3.404, 1.138] mlp=[-0.352, 0.942] + + hierarchy: attn=[3.441, 5.939, 5.616] mlp=[1.918, -1.522, -4.044] + step 11901 (70%) loss:2.8513 lr:0.30 dt:36ms tok/s:1816138 rem:178s step 11902 (70%) loss:2.8698 lr:0.30 dt:36ms tok/s:1814855 rem:178s step 11903 (70%) loss:2.8871 lr:0.30 dt:36ms tok/s:1810850 rem:178s step 11904 (70%) loss:2.9041 lr:0.30 dt:36ms tok/s:1817170 rem:178s step 11905 (70%) loss:2.9155 lr:0.30 dt:38ms tok/s:1732300 rem:178s step 11906 (70%) loss:2.9338 lr:0.30 dt:36ms tok/s:1818324 rem:178s step 11907 (70%) loss:2.9513 lr:0.30 dt:36ms tok/s:1814004 rem:178s step 11908 (70%) loss:2.9625 lr:0.30 dt:36ms tok/s:1815250 rem:178s step 11909 (70%) loss:2.9872 lr:0.30 dt:36ms tok/s:1812079 rem:178s step 11910 (70%) loss:3.0170 lr:0.30 dt:36ms tok/s:1807908 rem:178s step 11911 (70%) loss:3.0543 lr:0.30 dt:36ms tok/s:1809455 rem:178s step 11912 (70%) loss:3.0656 lr:0.30 dt:36ms tok/s:1817843 rem:178s step 11913 (70%) loss:3.0617 lr:0.30 dt:36ms tok/s:1813466 rem:178s step 11914 (70%) loss:3.0634 lr:0.30 dt:36ms tok/s:1815466 rem:178s step 11915 (70%) loss:3.0633 lr:0.30 dt:36ms tok/s:1810492 rem:178s step 11916 (70%) loss:3.0579 lr:0.30 dt:36ms tok/s:1810778 rem:178s step 11917 (70%) loss:3.0480 lr:0.30 dt:36ms tok/s:1813933 rem:178s step 11918 (70%) loss:3.0567 lr:0.30 dt:37ms tok/s:1786940 rem:178s step 11919 (70%) loss:3.0534 lr:0.30 dt:37ms tok/s:1785316 rem:178s step 11920 (70%) loss:3.0520 lr:0.30 dt:36ms tok/s:1813550 rem:178s step 11921 (70%) loss:3.0423 lr:0.30 dt:36ms tok/s:1810253 rem:178s step 11922 (70%) loss:3.0472 lr:0.30 dt:36ms tok/s:1807670 rem:178s step 11923 (70%) loss:3.0240 lr:0.30 dt:36ms tok/s:1813298 rem:178s step 11924 (70%) loss:2.9588 lr:0.30 dt:36ms tok/s:1808990 rem:178s step 11925 (70%) loss:2.9573 lr:0.30 dt:36ms tok/s:1806909 rem:178s step 11926 (70%) loss:2.9581 lr:0.30 dt:36ms tok/s:1813921 rem:177s step 11927 (70%) loss:2.9506 lr:0.30 dt:36ms tok/s:1815946 rem:177s step 11928 (70%) loss:2.9523 lr:0.30 dt:36ms tok/s:1816522 rem:177s step 11929 (70%) loss:2.9763 lr:0.30 dt:36ms tok/s:1815826 rem:177s step 11930 (70%) loss:2.9698 lr:0.30 dt:36ms tok/s:1820215 rem:177s step 11931 (70%) loss:2.9869 lr:0.30 dt:36ms tok/s:1813574 rem:177s step 11932 (70%) loss:2.9867 lr:0.30 dt:36ms tok/s:1810683 rem:177s step 11933 (70%) loss:2.9852 lr:0.30 dt:36ms tok/s:1813969 rem:177s step 11934 (70%) loss:2.9916 lr:0.30 dt:36ms tok/s:1810039 rem:177s step 11935 (70%) loss:2.9796 lr:0.30 dt:36ms tok/s:1816138 rem:177s step 11936 (70%) loss:2.9748 lr:0.30 dt:36ms tok/s:1808955 rem:177s step 11937 (70%) loss:2.9673 lr:0.30 dt:36ms tok/s:1819275 rem:177s step 11938 (70%) loss:2.9540 lr:0.30 dt:36ms tok/s:1814591 rem:177s step 11939 (70%) loss:2.9655 lr:0.30 dt:36ms tok/s:1803519 rem:177s step 11940 (70%) loss:3.0172 lr:0.30 dt:36ms tok/s:1812497 rem:177s step 11941 (71%) loss:3.0876 lr:0.30 dt:36ms tok/s:1814675 rem:177s step 11942 (71%) loss:3.0992 lr:0.30 dt:36ms tok/s:1809348 rem:177s step 11943 (71%) loss:3.0908 lr:0.30 dt:36ms tok/s:1806993 rem:177s step 11944 (71%) loss:3.0889 lr:0.30 dt:36ms tok/s:1806933 rem:177s step 11945 (71%) loss:3.0879 lr:0.30 dt:36ms tok/s:1807408 rem:177s step 11946 (71%) loss:3.0726 lr:0.30 dt:38ms tok/s:1711323 rem:177s step 11947 (71%) loss:3.0660 lr:0.30 dt:36ms tok/s:1840778 rem:177s step 11948 (71%) loss:3.0730 lr:0.30 dt:36ms tok/s:1801439 rem:177s step 11949 (71%) loss:3.0621 lr:0.30 dt:36ms tok/s:1817122 rem:177s step 11950 (71%) loss:3.0580 lr:0.30 dt:36ms tok/s:1814879 rem:177s step 11951 (71%) loss:3.0506 lr:0.30 dt:36ms tok/s:1810301 rem:177s step 11952 (71%) loss:3.0513 lr:0.30 dt:36ms tok/s:1811255 rem:177s step 11953 (71%) loss:3.0488 lr:0.30 dt:36ms tok/s:1810790 rem:177s step 11954 (71%) loss:3.0530 lr:0.30 dt:36ms tok/s:1816246 rem:176s step 11955 (71%) loss:3.0436 lr:0.30 dt:36ms tok/s:1812270 rem:176s step 11956 (71%) loss:3.0599 lr:0.30 dt:36ms tok/s:1815838 rem:176s step 11957 (71%) loss:3.0576 lr:0.30 dt:36ms tok/s:1811673 rem:176s step 11958 (71%) loss:3.0681 lr:0.30 dt:36ms tok/s:1813191 rem:176s step 11959 (71%) loss:3.0627 lr:0.30 dt:36ms tok/s:1819010 rem:176s step 11960 (71%) loss:3.0477 lr:0.30 dt:36ms tok/s:1811685 rem:176s step 11961 (71%) loss:3.0370 lr:0.30 dt:36ms tok/s:1817723 rem:176s step 11962 (71%) loss:3.0320 lr:0.30 dt:37ms tok/s:1767510 rem:176s step 11963 (71%) loss:3.0257 lr:0.30 dt:36ms tok/s:1807908 rem:176s step 11964 (71%) loss:3.0238 lr:0.30 dt:36ms tok/s:1803389 rem:176s step 11965 (71%) loss:3.0221 lr:0.30 dt:36ms tok/s:1809121 rem:176s step 11966 (71%) loss:3.0265 lr:0.30 dt:36ms tok/s:1814591 rem:176s step 11967 (71%) loss:3.0080 lr:0.30 dt:36ms tok/s:1812354 rem:176s step 11968 (71%) loss:3.0171 lr:0.30 dt:36ms tok/s:1804525 rem:176s step 11969 (71%) loss:3.0164 lr:0.30 dt:36ms tok/s:1804513 rem:176s step 11970 (71%) loss:3.0173 lr:0.30 dt:36ms tok/s:1812294 rem:176s step 11971 (71%) loss:3.0240 lr:0.30 dt:36ms tok/s:1812450 rem:176s step 11972 (71%) loss:3.0204 lr:0.30 dt:36ms tok/s:1804999 rem:176s step 11973 (71%) loss:3.0157 lr:0.30 dt:36ms tok/s:1811363 rem:176s step 11974 (71%) loss:3.0647 lr:0.30 dt:36ms tok/s:1806114 rem:176s step 11975 (71%) loss:3.0575 lr:0.30 dt:36ms tok/s:1808812 rem:176s step 11976 (71%) loss:3.0457 lr:0.30 dt:36ms tok/s:1810063 rem:176s step 11977 (71%) loss:3.0456 lr:0.30 dt:36ms tok/s:1813011 rem:176s step 11978 (71%) loss:3.0323 lr:0.30 dt:36ms tok/s:1826710 rem:176s step 11979 (71%) loss:3.0513 lr:0.30 dt:36ms tok/s:1805082 rem:176s step 11980 (71%) loss:3.0601 lr:0.30 dt:36ms tok/s:1809300 rem:176s step 11981 (71%) loss:3.0508 lr:0.30 dt:36ms tok/s:1840211 rem:175s step 11982 (71%) loss:3.0425 lr:0.30 dt:36ms tok/s:1839054 rem:175s step 11983 (71%) loss:2.9772 lr:0.29 dt:36ms tok/s:1832165 rem:175s step 11984 (71%) loss:2.9583 lr:0.29 dt:36ms tok/s:1830811 rem:175s step 11985 (71%) loss:2.9798 lr:0.29 dt:36ms tok/s:1833204 rem:175s step 11986 (71%) loss:2.9609 lr:0.29 dt:36ms tok/s:1835346 rem:175s step 11987 (71%) loss:2.9698 lr:0.29 dt:36ms tok/s:1823753 rem:175s step 11988 (71%) loss:2.9762 lr:0.29 dt:36ms tok/s:1824831 rem:175s step 11989 (71%) loss:2.9886 lr:0.29 dt:36ms tok/s:1833509 rem:175s step 11990 (71%) loss:3.0010 lr:0.29 dt:36ms tok/s:1829811 rem:175s step 11991 (71%) loss:2.9841 lr:0.29 dt:36ms tok/s:1833693 rem:175s step 11992 (71%) loss:2.9586 lr:0.29 dt:36ms tok/s:1830152 rem:175s step 11993 (71%) loss:2.9459 lr:0.29 dt:36ms tok/s:1831665 rem:175s step 11994 (71%) loss:2.9391 lr:0.29 dt:36ms tok/s:1824395 rem:175s step 11995 (71%) loss:2.9354 lr:0.29 dt:36ms tok/s:1832605 rem:175s step 11996 (71%) loss:2.9225 lr:0.29 dt:36ms tok/s:1833644 rem:175s step 11997 (71%) loss:2.9328 lr:0.29 dt:36ms tok/s:1838046 rem:175s step 11998 (71%) loss:2.9516 lr:0.29 dt:36ms tok/s:1842592 rem:175s step 11999 (71%) loss:2.9767 lr:0.29 dt:36ms tok/s:1832080 rem:175s step 12000 (71%) loss:2.9955 lr:0.29 dt:36ms tok/s:1824310 rem:175s + local: attn=[0.126, 0.975, 1.026] mlp=[0.967, 0.357, -0.352] + + transition: attn=[3.335, 1.152] mlp=[-0.358, 0.959] + + hierarchy: attn=[3.500, 5.939, 5.616] mlp=[1.959, -1.553, -4.125] + step 12001 (71%) loss:2.9991 lr:0.29 dt:36ms tok/s:1832849 rem:175s step 12002 (71%) loss:3.0118 lr:0.29 dt:36ms tok/s:1828691 rem:175s step 12003 (71%) loss:3.0175 lr:0.29 dt:36ms tok/s:1832568 rem:175s step 12004 (71%) loss:3.0187 lr:0.29 dt:36ms tok/s:1836609 rem:175s step 12005 (71%) loss:3.0230 lr:0.29 dt:36ms tok/s:1829555 rem:175s step 12006 (71%) loss:3.0050 lr:0.29 dt:36ms tok/s:1815598 rem:175s step 12007 (71%) loss:3.0102 lr:0.29 dt:37ms tok/s:1755915 rem:175s step 12008 (71%) loss:3.0001 lr:0.29 dt:36ms tok/s:1834403 rem:175s step 12009 (71%) loss:3.0161 lr:0.29 dt:36ms tok/s:1829921 rem:174s step 12010 (71%) loss:3.0118 lr:0.29 dt:36ms tok/s:1831933 rem:174s step 12011 (71%) loss:3.0236 lr:0.29 dt:36ms tok/s:1837628 rem:174s step 12012 (71%) loss:3.0253 lr:0.29 dt:36ms tok/s:1828849 rem:174s step 12013 (71%) loss:3.0133 lr:0.29 dt:36ms tok/s:1832238 rem:174s step 12014 (71%) loss:2.9902 lr:0.29 dt:36ms tok/s:1833118 rem:174s step 12015 (71%) loss:2.9605 lr:0.29 dt:36ms tok/s:1834917 rem:174s step 12016 (71%) loss:2.9117 lr:0.29 dt:36ms tok/s:1836818 rem:174s step 12017 (71%) loss:2.8623 lr:0.29 dt:36ms tok/s:1830079 rem:174s step 12018 (71%) loss:2.8149 lr:0.29 dt:36ms tok/s:1824661 rem:174s step 12019 (71%) loss:2.7654 lr:0.29 dt:36ms tok/s:1822508 rem:174s step 12020 (71%) loss:2.6440 lr:0.29 dt:36ms tok/s:1829251 rem:174s step 12021 (71%) loss:2.6931 lr:0.29 dt:36ms tok/s:1834341 rem:174s step 12022 (71%) loss:2.7289 lr:0.29 dt:36ms tok/s:1834770 rem:174s step 12023 (71%) loss:2.7634 lr:0.29 dt:36ms tok/s:1836560 rem:174s step 12024 (71%) loss:2.8170 lr:0.29 dt:36ms tok/s:1826662 rem:174s step 12025 (71%) loss:2.8556 lr:0.29 dt:36ms tok/s:1831152 rem:174s step 12026 (71%) loss:2.8717 lr:0.29 dt:36ms tok/s:1837210 rem:174s step 12027 (71%) loss:2.9062 lr:0.29 dt:36ms tok/s:1834133 rem:174s step 12028 (71%) loss:2.9147 lr:0.29 dt:36ms tok/s:1828764 rem:174s step 12029 (71%) loss:2.9310 lr:0.29 dt:36ms tok/s:1828959 rem:174s step 12030 (71%) loss:2.9252 lr:0.29 dt:36ms tok/s:1833901 rem:174s step 12031 (71%) loss:2.9579 lr:0.29 dt:36ms tok/s:1826978 rem:174s step 12032 (71%) loss:2.9880 lr:0.29 dt:35ms tok/s:1855077 rem:174s step 12033 (71%) loss:2.9855 lr:0.29 dt:36ms tok/s:1845512 rem:174s step 12034 (71%) loss:3.0028 lr:0.29 dt:36ms tok/s:1832996 rem:174s step 12035 (71%) loss:2.9950 lr:0.29 dt:36ms tok/s:1841000 rem:174s step 12036 (71%) loss:2.9984 lr:0.29 dt:37ms tok/s:1784284 rem:174s step 12037 (71%) loss:3.0068 lr:0.29 dt:36ms tok/s:1813538 rem:173s step 12038 (71%) loss:3.0194 lr:0.29 dt:36ms tok/s:1830213 rem:173s step 12039 (71%) loss:3.0231 lr:0.29 dt:36ms tok/s:1831201 rem:173s step 12040 (71%) loss:3.0165 lr:0.29 dt:36ms tok/s:1831921 rem:173s step 12041 (71%) loss:3.0156 lr:0.29 dt:36ms tok/s:1831628 rem:173s step 12042 (71%) loss:3.0177 lr:0.29 dt:36ms tok/s:1841074 rem:173s step 12043 (71%) loss:3.0166 lr:0.29 dt:36ms tok/s:1821855 rem:173s step 12044 (71%) loss:3.0053 lr:0.29 dt:36ms tok/s:1826092 rem:173s step 12045 (71%) loss:2.9947 lr:0.29 dt:36ms tok/s:1824831 rem:173s step 12046 (71%) loss:2.9813 lr:0.29 dt:36ms tok/s:1825073 rem:173s step 12047 (71%) loss:3.0038 lr:0.29 dt:36ms tok/s:1831262 rem:173s step 12048 (71%) loss:3.0074 lr:0.29 dt:36ms tok/s:1830262 rem:173s step 12049 (71%) loss:2.9972 lr:0.29 dt:36ms tok/s:1834415 rem:173s step 12050 (71%) loss:2.9956 lr:0.29 dt:36ms tok/s:1833424 rem:173s step 12051 (71%) loss:2.9845 lr:0.29 dt:36ms tok/s:1829422 rem:173s step 12052 (71%) loss:2.9810 lr:0.29 dt:36ms tok/s:1825594 rem:173s step 12053 (71%) loss:2.9810 lr:0.29 dt:36ms tok/s:1833987 rem:173s step 12054 (71%) loss:2.9656 lr:0.29 dt:36ms tok/s:1831518 rem:173s step 12055 (71%) loss:2.9677 lr:0.29 dt:36ms tok/s:1827342 rem:173s step 12056 (71%) loss:2.9611 lr:0.29 dt:36ms tok/s:1828801 rem:173s step 12057 (71%) loss:2.9672 lr:0.29 dt:36ms tok/s:1828849 rem:173s step 12058 (71%) loss:2.9891 lr:0.29 dt:36ms tok/s:1829750 rem:173s step 12059 (71%) loss:2.9816 lr:0.29 dt:36ms tok/s:1830262 rem:173s step 12060 (71%) loss:2.9736 lr:0.29 dt:36ms tok/s:1830506 rem:173s step 12061 (71%) loss:2.9740 lr:0.29 dt:36ms tok/s:1828436 rem:173s step 12062 (71%) loss:2.9665 lr:0.29 dt:36ms tok/s:1826043 rem:173s step 12063 (71%) loss:2.9732 lr:0.29 dt:36ms tok/s:1826941 rem:173s step 12064 (71%) loss:2.9815 lr:0.29 dt:36ms tok/s:1828375 rem:173s step 12065 (71%) loss:2.9832 lr:0.29 dt:36ms tok/s:1832153 rem:172s step 12066 (71%) loss:2.9962 lr:0.29 dt:36ms tok/s:1839263 rem:172s step 12067 (71%) loss:3.0162 lr:0.29 dt:36ms tok/s:1837751 rem:172s step 12068 (71%) loss:3.0026 lr:0.29 dt:36ms tok/s:1833155 rem:172s step 12069 (71%) loss:3.0003 lr:0.29 dt:36ms tok/s:1835040 rem:172s step 12070 (71%) loss:3.0011 lr:0.29 dt:36ms tok/s:1831103 rem:172s step 12071 (71%) loss:2.9872 lr:0.29 dt:36ms tok/s:1830933 rem:172s step 12072 (71%) loss:2.9860 lr:0.29 dt:36ms tok/s:1824395 rem:172s step 12073 (71%) loss:3.0616 lr:0.29 dt:36ms tok/s:1829823 rem:172s step 12074 (71%) loss:3.0910 lr:0.29 dt:36ms tok/s:1834917 rem:172s step 12075 (71%) loss:3.0885 lr:0.29 dt:41ms tok/s:1618147 rem:172s step 12076 (71%) loss:3.0857 lr:0.29 dt:38ms tok/s:1744747 rem:172s step 12077 (71%) loss:3.0678 lr:0.28 dt:35ms tok/s:1851815 rem:172s step 12078 (71%) loss:3.0839 lr:0.28 dt:35ms tok/s:1846851 rem:172s step 12079 (71%) loss:3.1212 lr:0.28 dt:36ms tok/s:1844992 rem:172s step 12080 (71%) loss:3.1445 lr:0.28 dt:36ms tok/s:1845784 rem:172s step 12081 (71%) loss:3.1365 lr:0.28 dt:36ms tok/s:1815706 rem:172s step 12082 (71%) loss:3.1234 lr:0.28 dt:35ms tok/s:1854902 rem:172s step 12083 (71%) loss:3.1349 lr:0.28 dt:35ms tok/s:1853364 rem:172s step 12084 (71%) loss:3.1282 lr:0.28 dt:35ms tok/s:1850992 rem:172s step 12085 (71%) loss:3.1116 lr:0.28 dt:35ms tok/s:1855265 rem:172s step 12086 (71%) loss:3.1018 lr:0.28 dt:36ms tok/s:1845214 rem:172s step 12087 (71%) loss:3.0862 lr:0.28 dt:36ms tok/s:1824540 rem:172s step 12088 (71%) loss:3.0981 lr:0.28 dt:39ms tok/s:1685478 rem:172s step 12089 (71%) loss:3.0874 lr:0.28 dt:36ms tok/s:1835358 rem:172s step 12090 (71%) loss:3.0627 lr:0.28 dt:35ms tok/s:1850344 rem:172s step 12091 (71%) loss:3.0513 lr:0.28 dt:35ms tok/s:1861623 rem:172s step 12092 (71%) loss:3.0484 lr:0.28 dt:35ms tok/s:1861711 rem:172s step 12093 (71%) loss:3.0395 lr:0.28 dt:36ms tok/s:1805556 rem:171s step 12094 (71%) loss:3.0158 lr:0.28 dt:36ms tok/s:1836646 rem:171s step 12095 (71%) loss:3.0035 lr:0.28 dt:35ms tok/s:1867707 rem:171s step 12096 (71%) loss:3.0264 lr:0.28 dt:35ms tok/s:1859444 rem:171s step 12097 (71%) loss:3.0164 lr:0.28 dt:35ms tok/s:1869880 rem:171s step 12098 (71%) loss:2.9889 lr:0.28 dt:35ms tok/s:1884649 rem:171s step 12099 (71%) loss:2.9544 lr:0.28 dt:35ms tok/s:1873359 rem:171s step 12100 (71%) loss:2.9246 lr:0.28 dt:35ms tok/s:1855465 rem:171s + local: attn=[0.138, 0.944, 1.022] mlp=[0.976, 0.376, -0.342] + + transition: attn=[3.399, 1.157] mlp=[-0.373, 0.989] + + hierarchy: attn=[3.532, 5.939, 5.616] mlp=[1.994, -1.621, -4.220] + step 12101 (71%) loss:2.8671 lr:0.28 dt:36ms tok/s:1825461 rem:171s step 12102 (71%) loss:2.8543 lr:0.28 dt:35ms tok/s:1852352 rem:171s step 12103 (71%) loss:2.8548 lr:0.28 dt:37ms tok/s:1748965 rem:171s step 12104 (71%) loss:2.8872 lr:0.28 dt:35ms tok/s:1893490 rem:171s step 12105 (71%) loss:2.9009 lr:0.28 dt:35ms tok/s:1855378 rem:171s step 12106 (71%) loss:2.9061 lr:0.28 dt:35ms tok/s:1874151 rem:171s step 12107 (71%) loss:2.9248 lr:0.28 dt:35ms tok/s:1862140 rem:171s step 12108 (72%) loss:2.9295 lr:0.28 dt:35ms tok/s:1875621 rem:171s step 12109 (72%) loss:2.9334 lr:0.28 dt:35ms tok/s:1870325 rem:171s step 12110 (72%) loss:3.0137 lr:0.28 dt:35ms tok/s:1867720 rem:171s step 12111 (72%) loss:3.0268 lr:0.28 dt:36ms tok/s:1830274 rem:171s step 12112 (72%) loss:3.0324 lr:0.28 dt:35ms tok/s:1883564 rem:171s step 12113 (72%) loss:3.0338 lr:0.28 dt:35ms tok/s:1896311 rem:171s step 12114 (72%) loss:3.0328 lr:0.28 dt:36ms tok/s:1831604 rem:171s step 12115 (72%) loss:3.0479 lr:0.28 dt:35ms tok/s:1864363 rem:171s step 12116 (72%) loss:3.0673 lr:0.28 dt:35ms tok/s:1894181 rem:171s step 12117 (72%) loss:3.0482 lr:0.28 dt:34ms tok/s:1901651 rem:171s step 12118 (72%) loss:3.0403 lr:0.28 dt:34ms tok/s:1899758 rem:171s step 12119 (72%) loss:3.0432 lr:0.28 dt:35ms tok/s:1895788 rem:171s step 12120 (72%) loss:3.0492 lr:0.28 dt:35ms tok/s:1893425 rem:171s step 12121 (72%) loss:3.0417 lr:0.28 dt:34ms tok/s:1902348 rem:170s step 12122 (72%) loss:3.0488 lr:0.28 dt:34ms tok/s:1902322 rem:170s step 12123 (72%) loss:3.0504 lr:0.28 dt:35ms tok/s:1892512 rem:170s step 12124 (72%) loss:3.0578 lr:0.28 dt:34ms tok/s:1902691 rem:170s step 12125 (72%) loss:3.0621 lr:0.28 dt:35ms tok/s:1893099 rem:170s step 12126 (72%) loss:3.0626 lr:0.28 dt:35ms tok/s:1888559 rem:170s step 12127 (72%) loss:3.0546 lr:0.28 dt:35ms tok/s:1881797 rem:170s step 12128 (72%) loss:3.0522 lr:0.28 dt:35ms tok/s:1877774 rem:170s step 12129 (72%) loss:3.0471 lr:0.28 dt:35ms tok/s:1888170 rem:170s step 12130 (72%) loss:3.0371 lr:0.28 dt:35ms tok/s:1868456 rem:170s step 12131 (72%) loss:3.0211 lr:0.28 dt:35ms tok/s:1870987 rem:170s step 12132 (72%) loss:3.0139 lr:0.28 dt:35ms tok/s:1875288 rem:170s step 12133 (72%) loss:3.0258 lr:0.28 dt:36ms tok/s:1842271 rem:170s step 12134 (72%) loss:3.0329 lr:0.28 dt:35ms tok/s:1871305 rem:170s step 12135 (72%) loss:3.0146 lr:0.28 dt:35ms tok/s:1848192 rem:170s step 12136 (72%) loss:3.0029 lr:0.28 dt:35ms tok/s:1898013 rem:170s step 12137 (72%) loss:2.9958 lr:0.28 dt:34ms tok/s:1907576 rem:170s step 12138 (72%) loss:2.9918 lr:0.28 dt:35ms tok/s:1898630 rem:170s step 12139 (72%) loss:2.9948 lr:0.28 dt:35ms tok/s:1893568 rem:170s step 12140 (72%) loss:2.9766 lr:0.28 dt:35ms tok/s:1881617 rem:170s step 12141 (72%) loss:2.9715 lr:0.28 dt:35ms tok/s:1879109 rem:170s step 12142 (72%) loss:2.9655 lr:0.28 dt:35ms tok/s:1889988 rem:170s step 12143 (72%) loss:2.9762 lr:0.28 dt:35ms tok/s:1877748 rem:170s step 12144 (72%) loss:2.9976 lr:0.28 dt:35ms tok/s:1883900 rem:170s step 12145 (72%) loss:3.0161 lr:0.28 dt:35ms tok/s:1872593 rem:170s step 12146 (72%) loss:3.0331 lr:0.28 dt:35ms tok/s:1860312 rem:170s step 12147 (72%) loss:3.0352 lr:0.28 dt:35ms tok/s:1854489 rem:170s step 12148 (72%) loss:3.0353 lr:0.28 dt:35ms tok/s:1863541 rem:170s step 12149 (72%) loss:3.0649 lr:0.28 dt:35ms tok/s:1857522 rem:170s step 12150 (72%) loss:3.0604 lr:0.28 dt:35ms tok/s:1859847 rem:169s step 12151 (72%) loss:3.0645 lr:0.28 dt:36ms tok/s:1829154 rem:169s step 12152 (72%) loss:3.0399 lr:0.28 dt:35ms tok/s:1850556 rem:169s step 12153 (72%) loss:3.0372 lr:0.28 dt:35ms tok/s:1863427 rem:169s step 12154 (72%) loss:3.0554 lr:0.28 dt:35ms tok/s:1864831 rem:169s step 12155 (72%) loss:3.0496 lr:0.28 dt:35ms tok/s:1861522 rem:169s step 12156 (72%) loss:3.0353 lr:0.28 dt:35ms tok/s:1867352 rem:169s step 12157 (72%) loss:3.0435 lr:0.28 dt:35ms tok/s:1854827 rem:169s step 12158 (72%) loss:3.0463 lr:0.28 dt:35ms tok/s:1863427 rem:169s step 12159 (72%) loss:3.0436 lr:0.28 dt:35ms tok/s:1860602 rem:169s step 12160 (72%) loss:3.0348 lr:0.28 dt:35ms tok/s:1862998 rem:169s step 12161 (72%) loss:3.0269 lr:0.28 dt:35ms tok/s:1875634 rem:169s step 12162 (72%) loss:3.0217 lr:0.28 dt:35ms tok/s:1864034 rem:169s step 12163 (72%) loss:3.0218 lr:0.28 dt:35ms tok/s:1867656 rem:169s step 12164 (72%) loss:3.0014 lr:0.28 dt:35ms tok/s:1856556 rem:169s step 12165 (72%) loss:3.0037 lr:0.28 dt:35ms tok/s:1850743 rem:169s step 12166 (72%) loss:2.9949 lr:0.28 dt:35ms tok/s:1855165 rem:169s step 12167 (72%) loss:3.0040 lr:0.28 dt:35ms tok/s:1855766 rem:169s step 12168 (72%) loss:2.9913 lr:0.28 dt:35ms tok/s:1847633 rem:169s step 12169 (72%) loss:2.9824 lr:0.28 dt:36ms tok/s:1805426 rem:169s step 12170 (72%) loss:2.9858 lr:0.28 dt:35ms tok/s:1852602 rem:169s step 12171 (72%) loss:2.9871 lr:0.28 dt:35ms tok/s:1854189 rem:169s step 12172 (72%) loss:2.9868 lr:0.28 dt:35ms tok/s:1852127 rem:169s step 12173 (72%) loss:2.9839 lr:0.28 dt:35ms tok/s:1851454 rem:169s step 12174 (72%) loss:2.9710 lr:0.27 dt:35ms tok/s:1859809 rem:169s step 12175 (72%) loss:2.9821 lr:0.27 dt:35ms tok/s:1856656 rem:169s step 12176 (72%) loss:2.9764 lr:0.27 dt:35ms tok/s:1864047 rem:169s step 12177 (72%) loss:2.9592 lr:0.27 dt:35ms tok/s:1859973 rem:169s step 12178 (72%) loss:2.9554 lr:0.27 dt:35ms tok/s:1851080 rem:168s step 12179 (72%) loss:2.9611 lr:0.27 dt:35ms tok/s:1850382 rem:168s step 12180 (72%) loss:2.9550 lr:0.27 dt:36ms tok/s:1824032 rem:168s step 12181 (72%) loss:2.9544 lr:0.27 dt:36ms tok/s:1839830 rem:168s step 12182 (72%) loss:2.9598 lr:0.27 dt:36ms tok/s:1826613 rem:168s step 12183 (72%) loss:2.9385 lr:0.27 dt:36ms tok/s:1840815 rem:168s step 12184 (72%) loss:2.9517 lr:0.27 dt:102ms tok/s:642684 rem:168s step 12185 (72%) loss:2.9833 lr:0.27 dt:33ms tok/s:2006350 rem:168s step 12186 (72%) loss:2.9857 lr:0.27 dt:32ms tok/s:2030073 rem:168s step 12187 (72%) loss:3.0031 lr:0.27 dt:32ms tok/s:2044508 rem:168s step 12188 (72%) loss:2.9925 lr:0.27 dt:31ms tok/s:2092044 rem:168s step 12189 (72%) loss:2.9989 lr:0.27 dt:31ms tok/s:2130853 rem:168s step 12190 (72%) loss:3.0043 lr:0.27 dt:31ms tok/s:2104296 rem:168s step 12191 (72%) loss:2.9940 lr:0.27 dt:31ms tok/s:2106005 rem:168s step 12192 (72%) loss:2.9989 lr:0.27 dt:31ms tok/s:2100309 rem:168s step 12193 (72%) loss:2.9884 lr:0.27 dt:31ms tok/s:2088278 rem:168s step 12194 (72%) loss:2.9995 lr:0.27 dt:31ms tok/s:2083213 rem:168s step 12195 (72%) loss:2.9849 lr:0.27 dt:32ms tok/s:2070799 rem:168s step 12196 (72%) loss:2.9926 lr:0.27 dt:32ms tok/s:2033076 rem:168s step 12197 (72%) loss:2.9944 lr:0.27 dt:32ms tok/s:2034354 rem:168s step 12198 (72%) loss:2.9909 lr:0.27 dt:33ms tok/s:1999956 rem:168s step 12199 (72%) loss:3.0011 lr:0.27 dt:33ms tok/s:1984091 rem:168s step 12200 (72%) loss:3.0079 lr:0.27 dt:33ms tok/s:2013935 rem:168s + local: attn=[0.132, 0.961, 1.031] mlp=[1.002, 0.381, -0.370] + + transition: attn=[3.400, 1.151] mlp=[-0.381, 1.026] + + hierarchy: attn=[3.478, 5.939, 5.616] mlp=[2.043, -1.687, -4.307] + step 12201 (72%) loss:3.0093 lr:0.27 dt:33ms tok/s:1979390 rem:168s step 12202 (72%) loss:3.0073 lr:0.27 dt:33ms tok/s:1966054 rem:168s step 12203 (72%) loss:3.0068 lr:0.27 dt:33ms tok/s:1978364 rem:168s step 12204 (72%) loss:3.0303 lr:0.27 dt:34ms tok/s:1948356 rem:168s step 12205 (72%) loss:3.0261 lr:0.27 dt:34ms tok/s:1948107 rem:168s step 12206 (72%) loss:3.0213 lr:0.27 dt:34ms tok/s:1928562 rem:167s step 12207 (72%) loss:3.0234 lr:0.27 dt:34ms tok/s:1936087 rem:167s step 12208 (72%) loss:3.0696 lr:0.27 dt:34ms tok/s:1937506 rem:167s step 12209 (72%) loss:3.0686 lr:0.27 dt:34ms tok/s:1932304 rem:167s step 12210 (72%) loss:3.0618 lr:0.27 dt:34ms tok/s:1941996 rem:167s step 12211 (72%) loss:3.0445 lr:0.27 dt:34ms tok/s:1939078 rem:167s step 12212 (72%) loss:3.0473 lr:0.27 dt:34ms tok/s:1940091 rem:167s step 12213 (72%) loss:3.0673 lr:0.27 dt:34ms tok/s:1932263 rem:167s step 12214 (72%) loss:3.0458 lr:0.27 dt:34ms tok/s:1922263 rem:167s step 12215 (72%) loss:3.0371 lr:0.27 dt:34ms tok/s:1915472 rem:167s step 12216 (72%) loss:3.0520 lr:0.27 dt:42ms tok/s:1561904 rem:167s step 12217 (72%) loss:3.0299 lr:0.27 dt:34ms tok/s:1928900 rem:167s step 12218 (72%) loss:3.0118 lr:0.27 dt:33ms tok/s:1971794 rem:167s step 12219 (72%) loss:2.9949 lr:0.27 dt:32ms tok/s:2018860 rem:167s step 12220 (72%) loss:2.9997 lr:0.27 dt:33ms tok/s:2000494 rem:167s step 12221 (72%) loss:2.9926 lr:0.27 dt:33ms tok/s:1974839 rem:167s step 12222 (72%) loss:2.9896 lr:0.27 dt:33ms tok/s:1978521 rem:167s step 12223 (72%) loss:3.0153 lr:0.27 dt:33ms tok/s:1980959 rem:167s step 12224 (72%) loss:3.0068 lr:0.27 dt:34ms tok/s:1946741 rem:167s step 12225 (72%) loss:3.0057 lr:0.27 dt:34ms tok/s:1917262 rem:167s step 12226 (72%) loss:3.0528 lr:0.27 dt:34ms tok/s:1953951 rem:167s step 12227 (72%) loss:3.0665 lr:0.27 dt:34ms tok/s:1952216 rem:167s step 12228 (72%) loss:3.0441 lr:0.27 dt:34ms tok/s:1951343 rem:167s step 12229 (72%) loss:3.0525 lr:0.27 dt:34ms tok/s:1942037 rem:167s step 12230 (72%) loss:3.0464 lr:0.27 dt:36ms tok/s:1811661 rem:167s step 12231 (72%) loss:3.0475 lr:0.27 dt:34ms tok/s:1934779 rem:167s step 12232 (72%) loss:3.0258 lr:0.27 dt:34ms tok/s:1931815 rem:167s step 12233 (72%) loss:3.0187 lr:0.27 dt:34ms tok/s:1931190 rem:167s step 12234 (72%) loss:2.9983 lr:0.27 dt:45ms tok/s:1455852 rem:167s step 12235 (72%) loss:2.9865 lr:0.27 dt:33ms tok/s:2014673 rem:167s step 12236 (72%) loss:2.9800 lr:0.27 dt:33ms tok/s:1996571 rem:166s step 12237 (72%) loss:2.9964 lr:0.27 dt:33ms tok/s:2006701 rem:166s step 12238 (72%) loss:3.0045 lr:0.27 dt:33ms tok/s:1982045 rem:166s step 12239 (72%) loss:3.0123 lr:0.27 dt:33ms tok/s:1989346 rem:166s step 12240 (72%) loss:3.0036 lr:0.27 dt:53ms tok/s:1247047 rem:166s step 12241 (72%) loss:2.9954 lr:0.27 dt:32ms tok/s:2030492 rem:166s step 12242 (72%) loss:2.9892 lr:0.27 dt:31ms tok/s:2095873 rem:166s step 12243 (72%) loss:2.9900 lr:0.27 dt:31ms tok/s:2123872 rem:166s step 12244 (72%) loss:2.9977 lr:0.27 dt:31ms tok/s:2118307 rem:166s step 12245 (72%) loss:2.9920 lr:0.27 dt:31ms tok/s:2103105 rem:166s step 12246 (72%) loss:2.9775 lr:0.27 dt:31ms tok/s:2089881 rem:166s step 12247 (72%) loss:3.0235 lr:0.27 dt:34ms tok/s:1942435 rem:166s step 12248 (72%) loss:3.0797 lr:0.27 dt:32ms tok/s:2055854 rem:166s step 12249 (72%) loss:3.0724 lr:0.27 dt:32ms tok/s:2052614 rem:166s step 12250 (72%) loss:3.0664 lr:0.27 dt:34ms tok/s:1910520 rem:166s step 12251 (72%) loss:3.0605 lr:0.27 dt:32ms tok/s:2059350 rem:166s step 12252 (72%) loss:3.0963 lr:0.27 dt:32ms tok/s:2049798 rem:166s step 12253 (72%) loss:3.0681 lr:0.27 dt:31ms tok/s:2088341 rem:166s step 12254 (72%) loss:3.0459 lr:0.27 dt:32ms tok/s:2041107 rem:166s step 12255 (72%) loss:3.0366 lr:0.27 dt:32ms tok/s:2060631 rem:166s step 12256 (72%) loss:3.0371 lr:0.27 dt:32ms tok/s:2019186 rem:166s step 12257 (72%) loss:3.0465 lr:0.27 dt:33ms tok/s:2015574 rem:166s step 12258 (72%) loss:3.0525 lr:0.27 dt:32ms tok/s:2024108 rem:166s step 12259 (72%) loss:3.0453 lr:0.27 dt:33ms tok/s:1983318 rem:166s step 12260 (72%) loss:3.1357 lr:0.27 dt:33ms tok/s:1982245 rem:166s step 12261 (72%) loss:3.1221 lr:0.27 dt:33ms tok/s:1986485 rem:166s step 12262 (72%) loss:3.1140 lr:0.27 dt:33ms tok/s:1982245 rem:166s step 12263 (72%) loss:3.1123 lr:0.27 dt:33ms tok/s:1967123 rem:166s step 12264 (72%) loss:3.1004 lr:0.27 dt:34ms tok/s:1953395 rem:166s step 12265 (72%) loss:3.0836 lr:0.27 dt:34ms tok/s:1946300 rem:166s step 12266 (72%) loss:3.0757 lr:0.27 dt:33ms tok/s:1958769 rem:165s step 12267 (72%) loss:3.0680 lr:0.27 dt:33ms tok/s:1957220 rem:165s step 12268 (72%) loss:3.0541 lr:0.27 dt:34ms tok/s:1947596 rem:165s step 12269 (72%) loss:3.0507 lr:0.27 dt:34ms tok/s:1954006 rem:165s step 12270 (72%) loss:3.0402 lr:0.27 dt:34ms tok/s:1951315 rem:165s step 12271 (72%) loss:3.0237 lr:0.27 dt:34ms tok/s:1944125 rem:165s step 12272 (72%) loss:3.0164 lr:0.27 dt:34ms tok/s:1939571 rem:165s step 12273 (72%) loss:3.0159 lr:0.27 dt:34ms tok/s:1923326 rem:165s step 12274 (72%) loss:3.0154 lr:0.26 dt:35ms tok/s:1898826 rem:165s step 12275 (72%) loss:3.0184 lr:0.26 dt:34ms tok/s:1902888 rem:165s step 12276 (72%) loss:3.0117 lr:0.26 dt:34ms tok/s:1900993 rem:165s step 12277 (72%) loss:3.0052 lr:0.26 dt:34ms tok/s:1904602 rem:165s step 12278 (72%) loss:3.0126 lr:0.26 dt:34ms tok/s:1908305 rem:165s step 12279 (72%) loss:3.0090 lr:0.26 dt:37ms tok/s:1792288 rem:165s step 12280 (72%) loss:3.0077 lr:0.26 dt:35ms tok/s:1897280 rem:165s step 12281 (73%) loss:3.0092 lr:0.26 dt:35ms tok/s:1898826 rem:165s step 12282 (73%) loss:3.0022 lr:0.26 dt:34ms tok/s:1903758 rem:165s step 12283 (73%) loss:3.0100 lr:0.26 dt:35ms tok/s:1884067 rem:165s step 12284 (73%) loss:3.0126 lr:0.26 dt:35ms tok/s:1872465 rem:165s step 12285 (73%) loss:3.0248 lr:0.26 dt:35ms tok/s:1866109 rem:165s step 12286 (73%) loss:3.0214 lr:0.26 dt:35ms tok/s:1866046 rem:165s step 12287 (73%) loss:3.0189 lr:0.26 dt:35ms tok/s:1887781 rem:165s step 12288 (73%) loss:3.0135 lr:0.26 dt:35ms tok/s:1880806 rem:165s step 12289 (73%) loss:3.0018 lr:0.26 dt:35ms tok/s:1865489 rem:165s step 12290 (73%) loss:3.0003 lr:0.26 dt:35ms tok/s:1870159 rem:165s step 12291 (73%) loss:3.0141 lr:0.26 dt:36ms tok/s:1844323 rem:165s step 12292 (73%) loss:3.0122 lr:0.26 dt:36ms tok/s:1824528 rem:165s step 12293 (73%) loss:3.0019 lr:0.26 dt:36ms tok/s:1822689 rem:165s step 12294 (73%) loss:2.9875 lr:0.26 dt:37ms tok/s:1762919 rem:165s step 12295 (73%) loss:2.9862 lr:0.26 dt:36ms tok/s:1819022 rem:164s step 12296 (73%) loss:2.9814 lr:0.26 dt:36ms tok/s:1821614 rem:164s step 12297 (73%) loss:2.9774 lr:0.26 dt:36ms tok/s:1816354 rem:164s step 12298 (73%) loss:2.9739 lr:0.26 dt:36ms tok/s:1819046 rem:164s step 12299 (73%) loss:2.9564 lr:0.26 dt:36ms tok/s:1818697 rem:164s step 12300 (73%) loss:2.9692 lr:0.26 dt:36ms tok/s:1821904 rem:164s + local: attn=[0.131, 0.960, 1.046] mlp=[1.035, 0.391, -0.383] + + transition: attn=[3.434, 1.157] mlp=[-0.381, 1.037] + + hierarchy: attn=[3.484, 5.939, 5.616] mlp=[2.055, -1.702, -4.405] + step 12301 (73%) loss:2.9670 lr:0.26 dt:36ms tok/s:1819733 rem:164s step 12302 (73%) loss:2.9525 lr:0.26 dt:36ms tok/s:1826092 rem:164s step 12303 (73%) loss:2.9531 lr:0.26 dt:36ms tok/s:1818673 rem:164s step 12304 (73%) loss:2.9584 lr:0.26 dt:36ms tok/s:1825982 rem:164s step 12305 (73%) loss:2.9648 lr:0.26 dt:36ms tok/s:1820191 rem:164s step 12306 (73%) loss:2.9498 lr:0.26 dt:36ms tok/s:1825958 rem:164s step 12307 (73%) loss:2.9649 lr:0.26 dt:36ms tok/s:1816762 rem:164s step 12308 (73%) loss:2.9698 lr:0.26 dt:36ms tok/s:1815334 rem:164s step 12309 (73%) loss:2.9700 lr:0.26 dt:36ms tok/s:1817327 rem:164s step 12310 (73%) loss:2.9614 lr:0.26 dt:36ms tok/s:1815310 rem:164s step 12311 (73%) loss:2.9593 lr:0.26 dt:36ms tok/s:1803519 rem:164s step 12312 (73%) loss:2.9556 lr:0.26 dt:36ms tok/s:1817507 rem:164s step 12313 (73%) loss:2.9617 lr:0.26 dt:36ms tok/s:1820287 rem:164s step 12314 (73%) loss:2.9645 lr:0.26 dt:36ms tok/s:1797586 rem:164s step 12315 (73%) loss:2.9682 lr:0.26 dt:36ms tok/s:1806553 rem:164s step 12316 (73%) loss:2.9523 lr:0.26 dt:36ms tok/s:1812892 rem:164s step 12317 (73%) loss:2.9777 lr:0.26 dt:36ms tok/s:1812677 rem:164s step 12318 (73%) loss:2.9590 lr:0.26 dt:36ms tok/s:1808455 rem:164s step 12319 (73%) loss:2.9576 lr:0.26 dt:36ms tok/s:1810706 rem:164s step 12320 (73%) loss:2.9434 lr:0.26 dt:36ms tok/s:1833192 rem:164s step 12321 (73%) loss:2.9560 lr:0.26 dt:36ms tok/s:1839030 rem:164s step 12322 (73%) loss:2.9481 lr:0.26 dt:36ms tok/s:1817483 rem:163s step 12323 (73%) loss:2.9334 lr:0.26 dt:36ms tok/s:1811792 rem:163s step 12324 (73%) loss:2.9145 lr:0.26 dt:36ms tok/s:1808324 rem:163s step 12325 (73%) loss:2.9119 lr:0.26 dt:36ms tok/s:1811996 rem:163s step 12326 (73%) loss:2.9043 lr:0.26 dt:36ms tok/s:1814867 rem:163s step 12327 (73%) loss:2.8851 lr:0.26 dt:37ms tok/s:1787254 rem:163s step 12328 (73%) loss:2.8455 lr:0.26 dt:36ms tok/s:1812712 rem:163s step 12329 (73%) loss:2.8329 lr:0.26 dt:36ms tok/s:1812007 rem:163s step 12330 (73%) loss:2.8505 lr:0.26 dt:36ms tok/s:1806957 rem:163s step 12331 (73%) loss:2.8778 lr:0.26 dt:36ms tok/s:1813502 rem:163s step 12332 (73%) loss:2.8901 lr:0.26 dt:36ms tok/s:1815874 rem:163s step 12333 (73%) loss:2.8964 lr:0.26 dt:36ms tok/s:1818096 rem:163s step 12334 (73%) loss:2.8901 lr:0.26 dt:36ms tok/s:1817663 rem:163s step 12335 (73%) loss:2.8949 lr:0.26 dt:36ms tok/s:1820914 rem:163s step 12336 (73%) loss:2.9151 lr:0.26 dt:36ms tok/s:1811398 rem:163s step 12337 (73%) loss:2.9238 lr:0.26 dt:36ms tok/s:1816582 rem:163s step 12338 (73%) loss:2.9200 lr:0.26 dt:36ms tok/s:1809110 rem:163s step 12339 (73%) loss:2.9091 lr:0.26 dt:36ms tok/s:1809086 rem:163s step 12340 (73%) loss:2.8753 lr:0.26 dt:36ms tok/s:1812091 rem:163s step 12341 (73%) loss:2.8838 lr:0.26 dt:36ms tok/s:1807622 rem:163s step 12342 (73%) loss:2.8870 lr:0.26 dt:36ms tok/s:1813969 rem:163s step 12343 (73%) loss:2.8853 lr:0.26 dt:36ms tok/s:1814867 rem:163s step 12344 (73%) loss:2.8847 lr:0.26 dt:36ms tok/s:1823160 rem:163s step 12345 (73%) loss:2.9240 lr:0.26 dt:36ms tok/s:1808598 rem:163s step 12346 (73%) loss:2.9193 lr:0.26 dt:36ms tok/s:1819323 rem:163s step 12347 (73%) loss:2.9094 lr:0.26 dt:36ms tok/s:1813681 rem:163s step 12348 (73%) loss:2.9437 lr:0.26 dt:36ms tok/s:1809503 rem:163s step 12349 (73%) loss:2.9151 lr:0.26 dt:36ms tok/s:1805331 rem:163s step 12350 (73%) loss:2.9169 lr:0.26 dt:36ms tok/s:1818312 rem:162s step 12351 (73%) loss:2.9189 lr:0.26 dt:36ms tok/s:1812390 rem:162s step 12352 (73%) loss:2.9276 lr:0.26 dt:36ms tok/s:1812736 rem:162s step 12353 (73%) loss:2.9355 lr:0.26 dt:36ms tok/s:1809217 rem:162s step 12354 (73%) loss:2.9579 lr:0.26 dt:36ms tok/s:1811398 rem:162s step 12355 (73%) loss:2.9699 lr:0.26 dt:36ms tok/s:1814663 rem:162s step 12356 (73%) loss:2.9766 lr:0.26 dt:36ms tok/s:1814759 rem:162s step 12357 (73%) loss:2.9756 lr:0.26 dt:36ms tok/s:1809157 rem:162s step 12358 (73%) loss:2.9525 lr:0.26 dt:36ms tok/s:1815346 rem:162s step 12359 (73%) loss:2.9495 lr:0.26 dt:36ms tok/s:1814244 rem:162s step 12360 (73%) loss:2.9384 lr:0.26 dt:36ms tok/s:1815622 rem:162s step 12361 (73%) loss:2.9524 lr:0.26 dt:36ms tok/s:1815082 rem:162s step 12362 (73%) loss:2.9817 lr:0.26 dt:36ms tok/s:1818421 rem:162s step 12363 (73%) loss:2.9683 lr:0.26 dt:36ms tok/s:1819829 rem:162s step 12364 (73%) loss:2.9714 lr:0.26 dt:36ms tok/s:1808015 rem:162s step 12365 (73%) loss:2.9615 lr:0.26 dt:36ms tok/s:1815574 rem:162s step 12366 (73%) loss:2.9822 lr:0.26 dt:36ms tok/s:1817927 rem:162s step 12367 (73%) loss:2.9773 lr:0.26 dt:36ms tok/s:1810790 rem:162s step 12368 (73%) loss:2.9801 lr:0.26 dt:36ms tok/s:1811064 rem:162s step 12369 (73%) loss:2.9694 lr:0.26 dt:37ms tok/s:1785084 rem:162s step 12370 (73%) loss:2.9740 lr:0.26 dt:36ms tok/s:1812270 rem:162s step 12371 (73%) loss:2.9747 lr:0.25 dt:37ms tok/s:1783416 rem:162s step 12372 (73%) loss:2.9763 lr:0.25 dt:37ms tok/s:1780482 rem:162s step 12373 (73%) loss:2.9814 lr:0.25 dt:37ms tok/s:1789640 rem:162s step 12374 (73%) loss:2.9870 lr:0.25 dt:37ms tok/s:1789430 rem:162s step 12375 (73%) loss:2.9879 lr:0.25 dt:37ms tok/s:1789896 rem:162s step 12376 (73%) loss:2.9866 lr:0.25 dt:37ms tok/s:1788557 rem:162s step 12377 (73%) loss:2.9704 lr:0.25 dt:37ms tok/s:1791027 rem:161s step 12378 (73%) loss:2.9644 lr:0.25 dt:37ms tok/s:1789500 rem:161s step 12379 (73%) loss:2.9582 lr:0.25 dt:37ms tok/s:1781520 rem:161s step 12380 (73%) loss:2.9512 lr:0.25 dt:37ms tok/s:1791366 rem:161s step 12381 (73%) loss:2.9465 lr:0.25 dt:37ms tok/s:1793762 rem:161s step 12382 (73%) loss:2.9492 lr:0.25 dt:37ms tok/s:1793481 rem:161s step 12383 (73%) loss:2.9515 lr:0.25 dt:37ms tok/s:1782005 rem:161s step 12384 (73%) loss:2.9639 lr:0.25 dt:36ms tok/s:1797457 rem:161s step 12385 (73%) loss:2.9591 lr:0.25 dt:37ms tok/s:1785617 rem:161s step 12386 (73%) loss:2.9612 lr:0.25 dt:37ms tok/s:1788498 rem:161s step 12387 (73%) loss:2.9672 lr:0.25 dt:37ms tok/s:1790257 rem:161s step 12388 (73%) loss:2.9729 lr:0.25 dt:37ms tok/s:1784307 rem:161s step 12389 (73%) loss:2.9674 lr:0.25 dt:37ms tok/s:1780182 rem:161s step 12390 (73%) loss:2.9554 lr:0.25 dt:38ms tok/s:1733699 rem:161s step 12391 (73%) loss:2.9562 lr:0.25 dt:36ms tok/s:1814603 rem:161s step 12392 (73%) loss:2.9695 lr:0.25 dt:36ms tok/s:1810408 rem:161s step 12393 (73%) loss:2.9537 lr:0.25 dt:36ms tok/s:1809252 rem:161s step 12394 (73%) loss:2.9323 lr:0.25 dt:36ms tok/s:1809955 rem:161s step 12395 (73%) loss:2.9359 lr:0.25 dt:36ms tok/s:1809050 rem:161s step 12396 (73%) loss:2.9123 lr:0.25 dt:36ms tok/s:1814639 rem:161s step 12397 (73%) loss:2.8974 lr:0.25 dt:36ms tok/s:1817194 rem:161s step 12398 (73%) loss:2.8998 lr:0.25 dt:36ms tok/s:1821348 rem:161s step 12399 (73%) loss:2.9072 lr:0.25 dt:36ms tok/s:1814172 rem:161s step 12400 (73%) loss:2.9212 lr:0.25 dt:36ms tok/s:1819564 rem:161s + local: attn=[0.137, 0.973, 1.060] mlp=[1.036, 0.401, -0.385] + + transition: attn=[3.404, 1.165] mlp=[-0.413, 1.077] + + hierarchy: attn=[3.500, 5.939, 5.616] mlp=[2.117, -1.764, -4.439] + step 12401 (73%) loss:2.9340 lr:0.25 dt:36ms tok/s:1810599 rem:161s step 12402 (73%) loss:2.9215 lr:0.25 dt:36ms tok/s:1812605 rem:161s step 12403 (73%) loss:2.9321 lr:0.25 dt:36ms tok/s:1818661 rem:161s step 12404 (73%) loss:2.9081 lr:0.25 dt:36ms tok/s:1811100 rem:161s step 12405 (73%) loss:2.9099 lr:0.25 dt:36ms tok/s:1809610 rem:160s step 12406 (73%) loss:2.9358 lr:0.25 dt:36ms tok/s:1815586 rem:160s step 12407 (73%) loss:2.9350 lr:0.25 dt:36ms tok/s:1809276 rem:160s step 12408 (73%) loss:2.9311 lr:0.25 dt:36ms tok/s:1815190 rem:160s step 12409 (73%) loss:2.9315 lr:0.25 dt:36ms tok/s:1809169 rem:160s step 12410 (73%) loss:2.9311 lr:0.25 dt:36ms tok/s:1808241 rem:160s step 12411 (73%) loss:2.9460 lr:0.25 dt:37ms tok/s:1793282 rem:160s step 12412 (73%) loss:2.9316 lr:0.25 dt:36ms tok/s:1806126 rem:160s step 12413 (73%) loss:2.9279 lr:0.25 dt:36ms tok/s:1811315 rem:160s step 12414 (73%) loss:2.9374 lr:0.25 dt:36ms tok/s:1814807 rem:160s step 12415 (73%) loss:2.9376 lr:0.25 dt:36ms tok/s:1811267 rem:160s step 12416 (73%) loss:2.9471 lr:0.25 dt:36ms tok/s:1803720 rem:160s step 12417 (73%) loss:2.9564 lr:0.25 dt:36ms tok/s:1806981 rem:160s step 12418 (73%) loss:2.9364 lr:0.25 dt:36ms tok/s:1808895 rem:160s step 12419 (73%) loss:2.9493 lr:0.25 dt:36ms tok/s:1813933 rem:160s step 12420 (73%) loss:2.9550 lr:0.25 dt:36ms tok/s:1805153 rem:160s step 12421 (73%) loss:2.9517 lr:0.25 dt:36ms tok/s:1821071 rem:160s step 12422 (73%) loss:2.9444 lr:0.25 dt:36ms tok/s:1825606 rem:160s step 12423 (73%) loss:2.9429 lr:0.25 dt:36ms tok/s:1810277 rem:160s step 12424 (73%) loss:2.9482 lr:0.25 dt:36ms tok/s:1814028 rem:160s step 12425 (73%) loss:2.9436 lr:0.25 dt:36ms tok/s:1808574 rem:160s step 12426 (73%) loss:2.9563 lr:0.25 dt:36ms tok/s:1809681 rem:160s step 12427 (73%) loss:2.9694 lr:0.25 dt:36ms tok/s:1806814 rem:160s step 12428 (73%) loss:2.9711 lr:0.25 dt:36ms tok/s:1804087 rem:160s step 12429 (73%) loss:2.9535 lr:0.25 dt:36ms tok/s:1812880 rem:160s step 12430 (73%) loss:2.9520 lr:0.25 dt:36ms tok/s:1808265 rem:160s step 12431 (73%) loss:2.9366 lr:0.25 dt:36ms tok/s:1808324 rem:160s step 12432 (73%) loss:2.9300 lr:0.25 dt:36ms tok/s:1818709 rem:160s step 12433 (73%) loss:2.9501 lr:0.25 dt:36ms tok/s:1812617 rem:159s step 12434 (73%) loss:2.9543 lr:0.25 dt:36ms tok/s:1809419 rem:159s step 12435 (73%) loss:2.9446 lr:0.25 dt:36ms tok/s:1812653 rem:159s step 12436 (73%) loss:2.9535 lr:0.25 dt:36ms tok/s:1808491 rem:159s step 12437 (73%) loss:2.9530 lr:0.25 dt:36ms tok/s:1806648 rem:159s step 12438 (73%) loss:2.9496 lr:0.25 dt:37ms tok/s:1777546 rem:159s step 12439 (73%) loss:2.9535 lr:0.25 dt:38ms tok/s:1729194 rem:159s step 12440 (73%) loss:2.9580 lr:0.25 dt:35ms tok/s:1850133 rem:159s step 12441 (73%) loss:2.9585 lr:0.25 dt:35ms tok/s:1897201 rem:159s step 12442 (73%) loss:2.9705 lr:0.25 dt:35ms tok/s:1877158 rem:159s step 12443 (73%) loss:2.9942 lr:0.25 dt:35ms tok/s:1876684 rem:159s step 12444 (73%) loss:3.0079 lr:0.25 dt:35ms tok/s:1857547 rem:159s step 12445 (73%) loss:3.0120 lr:0.25 dt:38ms tok/s:1734224 rem:159s step 12446 (73%) loss:3.0132 lr:0.25 dt:35ms tok/s:1880870 rem:159s step 12447 (74%) loss:2.9821 lr:0.25 dt:35ms tok/s:1879636 rem:159s step 12448 (74%) loss:2.9892 lr:0.25 dt:35ms tok/s:1868126 rem:159s step 12449 (74%) loss:2.9719 lr:0.25 dt:35ms tok/s:1862178 rem:159s step 12450 (74%) loss:2.9665 lr:0.25 dt:35ms tok/s:1857396 rem:159s step 12451 (74%) loss:2.9804 lr:0.25 dt:35ms tok/s:1858740 rem:159s step 12452 (74%) loss:2.9511 lr:0.25 dt:35ms tok/s:1854139 rem:159s step 12453 (74%) loss:2.9662 lr:0.25 dt:35ms tok/s:1860652 rem:159s step 12454 (74%) loss:2.9589 lr:0.25 dt:37ms tok/s:1782306 rem:159s step 12455 (74%) loss:2.9469 lr:0.25 dt:43ms tok/s:1521891 rem:159s step 12456 (74%) loss:2.9639 lr:0.25 dt:43ms tok/s:1527745 rem:159s step 12457 (74%) loss:2.9729 lr:0.25 dt:35ms tok/s:1887301 rem:159s step 12458 (74%) loss:2.9723 lr:0.25 dt:34ms tok/s:1949364 rem:159s step 12459 (74%) loss:3.0039 lr:0.25 dt:34ms tok/s:1944056 rem:159s step 12460 (74%) loss:3.0574 lr:0.25 dt:34ms tok/s:1923514 rem:158s step 12461 (74%) loss:3.1323 lr:0.25 dt:34ms tok/s:1919136 rem:158s step 12462 (74%) loss:3.1218 lr:0.25 dt:34ms tok/s:1924848 rem:158s step 12463 (74%) loss:3.1252 lr:0.25 dt:34ms tok/s:1920370 rem:158s step 12464 (74%) loss:3.1348 lr:0.25 dt:34ms tok/s:1917102 rem:158s step 12465 (74%) loss:3.1097 lr:0.25 dt:34ms tok/s:1925576 rem:158s step 12466 (74%) loss:3.0960 lr:0.25 dt:34ms tok/s:1920383 rem:158s step 12467 (74%) loss:3.0967 lr:0.25 dt:34ms tok/s:1922479 rem:158s step 12468 (74%) loss:3.1003 lr:0.25 dt:34ms tok/s:1922196 rem:158s step 12469 (74%) loss:3.0899 lr:0.24 dt:34ms tok/s:1915299 rem:158s step 12470 (74%) loss:3.0782 lr:0.24 dt:35ms tok/s:1896586 rem:158s step 12471 (74%) loss:3.0721 lr:0.24 dt:34ms tok/s:1901335 rem:158s step 12472 (74%) loss:3.0555 lr:0.24 dt:35ms tok/s:1894325 rem:158s step 12473 (74%) loss:3.0239 lr:0.24 dt:34ms tok/s:1899876 rem:158s step 12474 (74%) loss:3.0028 lr:0.24 dt:34ms tok/s:1905460 rem:158s step 12475 (74%) loss:3.0332 lr:0.24 dt:38ms tok/s:1745046 rem:158s step 12476 (74%) loss:3.0339 lr:0.24 dt:35ms tok/s:1897083 rem:158s step 12477 (74%) loss:3.0149 lr:0.24 dt:35ms tok/s:1894952 rem:158s step 12478 (74%) loss:2.9887 lr:0.24 dt:34ms tok/s:1899824 rem:158s step 12479 (74%) loss:2.9663 lr:0.24 dt:34ms tok/s:1900428 rem:158s step 12480 (74%) loss:2.9738 lr:0.24 dt:35ms tok/s:1896979 rem:158s step 12481 (74%) loss:2.9633 lr:0.24 dt:35ms tok/s:1891444 rem:158s step 12482 (74%) loss:2.9676 lr:0.24 dt:34ms tok/s:1902335 rem:158s step 12483 (74%) loss:2.9877 lr:0.24 dt:35ms tok/s:1892004 rem:158s step 12484 (74%) loss:3.0022 lr:0.24 dt:34ms tok/s:1912660 rem:158s step 12485 (74%) loss:3.0034 lr:0.24 dt:35ms tok/s:1888988 rem:158s step 12486 (74%) loss:2.9846 lr:0.24 dt:35ms tok/s:1899233 rem:158s step 12487 (74%) loss:2.9991 lr:0.24 dt:35ms tok/s:1883009 rem:158s step 12488 (74%) loss:2.9924 lr:0.24 dt:35ms tok/s:1878415 rem:158s step 12489 (74%) loss:2.9740 lr:0.24 dt:35ms tok/s:1882210 rem:157s step 12490 (74%) loss:2.9858 lr:0.24 dt:35ms tok/s:1864135 rem:157s step 12491 (74%) loss:2.9710 lr:0.24 dt:35ms tok/s:1863377 rem:157s step 12492 (74%) loss:2.9533 lr:0.24 dt:35ms tok/s:1866895 rem:157s step 12493 (74%) loss:2.9670 lr:0.24 dt:35ms tok/s:1864654 rem:157s step 12494 (74%) loss:2.9672 lr:0.24 dt:35ms tok/s:1875980 rem:157s step 12495 (74%) loss:2.9736 lr:0.24 dt:35ms tok/s:1856781 rem:157s step 12496 (74%) loss:2.9787 lr:0.24 dt:35ms tok/s:1848428 rem:157s step 12497 (74%) loss:2.9774 lr:0.24 dt:35ms tok/s:1848006 rem:157s step 12498 (74%) loss:2.9624 lr:0.24 dt:35ms tok/s:1848130 rem:157s step 12499 (74%) loss:2.9695 lr:0.24 dt:36ms tok/s:1845425 rem:157s step 12500 (74%) loss:2.9741 lr:0.24 dt:35ms tok/s:1850718 rem:157s + local: attn=[0.130, 0.985, 1.063] mlp=[1.066, 0.411, -0.413] + + transition: attn=[3.457, 1.189] mlp=[-0.408, 1.128] + + hierarchy: attn=[3.515, 5.939, 5.616] mlp=[2.186, -1.866, -4.515] + step 12501 (74%) loss:2.9677 lr:0.24 dt:36ms tok/s:1845549 rem:157s step 12502 (74%) loss:2.9338 lr:0.24 dt:36ms tok/s:1845722 rem:157s step 12503 (74%) loss:2.9180 lr:0.24 dt:35ms tok/s:1846231 rem:157s step 12504 (74%) loss:2.9463 lr:0.24 dt:36ms tok/s:1844162 rem:157s step 12505 (74%) loss:2.9567 lr:0.24 dt:36ms tok/s:1843284 rem:157s step 12506 (74%) loss:2.9825 lr:0.24 dt:35ms tok/s:1846826 rem:157s step 12507 (74%) loss:2.9808 lr:0.24 dt:36ms tok/s:1820118 rem:157s step 12508 (74%) loss:2.9621 lr:0.24 dt:36ms tok/s:1811040 rem:157s step 12509 (74%) loss:2.9493 lr:0.24 dt:36ms tok/s:1811339 rem:157s step 12510 (74%) loss:2.9477 lr:0.24 dt:36ms tok/s:1805663 rem:157s step 12511 (74%) loss:2.9401 lr:0.24 dt:36ms tok/s:1810361 rem:157s step 12512 (74%) loss:2.9202 lr:0.24 dt:37ms tok/s:1757419 rem:157s step 12513 (74%) loss:2.9194 lr:0.24 dt:36ms tok/s:1819950 rem:157s step 12514 (74%) loss:2.9086 lr:0.24 dt:36ms tok/s:1797539 rem:157s step 12515 (74%) loss:2.9312 lr:0.24 dt:36ms tok/s:1819781 rem:157s step 12516 (74%) loss:2.9537 lr:0.24 dt:36ms tok/s:1810993 rem:157s step 12517 (74%) loss:2.9586 lr:0.24 dt:36ms tok/s:1812461 rem:156s step 12518 (74%) loss:2.9593 lr:0.24 dt:36ms tok/s:1813957 rem:156s step 12519 (74%) loss:2.9396 lr:0.24 dt:36ms tok/s:1810790 rem:156s step 12520 (74%) loss:2.9465 lr:0.24 dt:36ms tok/s:1812115 rem:156s step 12521 (74%) loss:2.9348 lr:0.24 dt:37ms tok/s:1778800 rem:156s step 12522 (74%) loss:2.9274 lr:0.24 dt:36ms tok/s:1816126 rem:156s step 12523 (74%) loss:2.9148 lr:0.24 dt:36ms tok/s:1798362 rem:156s step 12524 (74%) loss:2.9391 lr:0.24 dt:36ms tok/s:1814975 rem:156s step 12525 (74%) loss:2.9376 lr:0.24 dt:36ms tok/s:1815922 rem:156s step 12526 (74%) loss:2.9393 lr:0.24 dt:36ms tok/s:1816030 rem:156s step 12527 (74%) loss:2.9166 lr:0.24 dt:36ms tok/s:1812007 rem:156s step 12528 (74%) loss:2.9383 lr:0.24 dt:36ms tok/s:1811601 rem:156s step 12529 (74%) loss:2.9501 lr:0.24 dt:36ms tok/s:1806292 rem:156s step 12530 (74%) loss:2.9456 lr:0.24 dt:36ms tok/s:1814064 rem:156s step 12531 (74%) loss:2.9505 lr:0.24 dt:36ms tok/s:1807539 rem:156s step 12532 (74%) loss:2.9591 lr:0.24 dt:36ms tok/s:1814711 rem:156s step 12533 (74%) loss:2.9634 lr:0.24 dt:36ms tok/s:1817399 rem:156s step 12534 (74%) loss:2.9761 lr:0.24 dt:36ms tok/s:1810718 rem:156s step 12535 (74%) loss:2.9809 lr:0.24 dt:36ms tok/s:1810277 rem:156s step 12536 (74%) loss:2.9784 lr:0.24 dt:36ms tok/s:1821747 rem:156s step 12537 (74%) loss:2.9821 lr:0.24 dt:36ms tok/s:1817519 rem:156s step 12538 (74%) loss:2.9762 lr:0.24 dt:36ms tok/s:1815334 rem:156s step 12539 (74%) loss:2.9798 lr:0.24 dt:36ms tok/s:1809252 rem:156s step 12540 (74%) loss:2.9643 lr:0.24 dt:37ms tok/s:1793341 rem:156s step 12541 (74%) loss:2.9683 lr:0.24 dt:37ms tok/s:1782757 rem:156s step 12542 (74%) loss:2.9405 lr:0.24 dt:36ms tok/s:1804964 rem:156s step 12543 (74%) loss:2.9347 lr:0.24 dt:36ms tok/s:1813753 rem:156s step 12544 (74%) loss:2.9336 lr:0.24 dt:36ms tok/s:1831457 rem:156s step 12545 (74%) loss:2.9309 lr:0.24 dt:36ms tok/s:1840101 rem:155s step 12546 (74%) loss:2.9258 lr:0.24 dt:35ms tok/s:1860413 rem:155s step 12547 (74%) loss:2.9268 lr:0.24 dt:35ms tok/s:1862039 rem:155s step 12548 (74%) loss:2.9157 lr:0.24 dt:36ms tok/s:1836584 rem:155s step 12549 (74%) loss:2.9201 lr:0.24 dt:36ms tok/s:1841000 rem:155s step 12550 (74%) loss:2.9133 lr:0.24 dt:36ms tok/s:1830555 rem:155s step 12551 (74%) loss:2.9009 lr:0.24 dt:36ms tok/s:1831140 rem:155s step 12552 (74%) loss:2.9203 lr:0.24 dt:36ms tok/s:1837567 rem:155s step 12553 (74%) loss:2.9322 lr:0.24 dt:36ms tok/s:1844311 rem:155s step 12554 (74%) loss:2.9239 lr:0.24 dt:36ms tok/s:1840520 rem:155s step 12555 (74%) loss:2.9278 lr:0.24 dt:36ms tok/s:1837431 rem:155s step 12556 (74%) loss:2.9292 lr:0.24 dt:36ms tok/s:1842580 rem:155s step 12557 (74%) loss:2.9461 lr:0.24 dt:36ms tok/s:1843507 rem:155s step 12558 (74%) loss:2.9565 lr:0.24 dt:36ms tok/s:1838316 rem:155s step 12559 (74%) loss:2.9807 lr:0.24 dt:36ms tok/s:1834942 rem:155s step 12560 (74%) loss:2.9768 lr:0.24 dt:36ms tok/s:1839731 rem:155s step 12561 (74%) loss:2.9603 lr:0.24 dt:36ms tok/s:1843012 rem:155s step 12562 (74%) loss:2.9882 lr:0.24 dt:36ms tok/s:1844954 rem:155s step 12563 (74%) loss:2.9640 lr:0.24 dt:36ms tok/s:1835959 rem:155s step 12564 (74%) loss:2.9505 lr:0.24 dt:36ms tok/s:1836278 rem:155s step 12565 (74%) loss:2.9533 lr:0.24 dt:36ms tok/s:1829470 rem:155s step 12566 (74%) loss:2.9334 lr:0.24 dt:36ms tok/s:1844162 rem:155s step 12567 (74%) loss:2.9325 lr:0.24 dt:35ms tok/s:1847124 rem:155s step 12568 (74%) loss:2.9161 lr:0.24 dt:36ms tok/s:1840606 rem:155s step 12569 (74%) loss:2.9140 lr:0.24 dt:36ms tok/s:1830006 rem:155s step 12570 (74%) loss:2.9068 lr:0.23 dt:36ms tok/s:1838820 rem:155s step 12571 (74%) loss:2.9081 lr:0.23 dt:36ms tok/s:1838033 rem:155s step 12572 (74%) loss:2.9196 lr:0.23 dt:36ms tok/s:1835199 rem:155s step 12573 (74%) loss:2.9252 lr:0.23 dt:36ms tok/s:1832678 rem:154s step 12574 (74%) loss:2.9178 lr:0.23 dt:36ms tok/s:1842679 rem:154s step 12575 (74%) loss:2.9200 lr:0.23 dt:36ms tok/s:1832129 rem:154s step 12576 (74%) loss:2.9216 lr:0.23 dt:36ms tok/s:1833485 rem:154s step 12577 (74%) loss:2.9270 lr:0.23 dt:36ms tok/s:1839670 rem:154s step 12578 (74%) loss:2.9239 lr:0.23 dt:36ms tok/s:1835652 rem:154s step 12579 (74%) loss:2.9258 lr:0.23 dt:36ms tok/s:1831213 rem:154s step 12580 (74%) loss:2.9071 lr:0.23 dt:36ms tok/s:1824516 rem:154s step 12581 (74%) loss:2.8879 lr:0.23 dt:36ms tok/s:1833840 rem:154s step 12582 (74%) loss:2.8864 lr:0.23 dt:36ms tok/s:1833008 rem:154s step 12583 (74%) loss:2.8929 lr:0.23 dt:37ms tok/s:1793528 rem:154s step 12584 (74%) loss:2.8983 lr:0.23 dt:36ms tok/s:1802289 rem:154s step 12585 (74%) loss:2.8849 lr:0.23 dt:36ms tok/s:1824746 rem:154s step 12586 (74%) loss:2.8271 lr:0.23 dt:36ms tok/s:1830091 rem:154s step 12587 (74%) loss:2.8274 lr:0.23 dt:36ms tok/s:1809264 rem:154s step 12588 (74%) loss:2.8299 lr:0.23 dt:36ms tok/s:1837640 rem:154s step 12589 (74%) loss:2.8606 lr:0.23 dt:36ms tok/s:1836621 rem:154s step 12590 (74%) loss:2.8738 lr:0.23 dt:36ms tok/s:1837567 rem:154s step 12591 (74%) loss:2.8772 lr:0.23 dt:36ms tok/s:1838661 rem:154s step 12592 (74%) loss:2.8586 lr:0.23 dt:36ms tok/s:1837284 rem:154s step 12593 (74%) loss:2.8772 lr:0.23 dt:36ms tok/s:1831921 rem:154s step 12594 (74%) loss:2.8750 lr:0.23 dt:36ms tok/s:1842172 rem:154s step 12595 (74%) loss:2.8778 lr:0.23 dt:36ms tok/s:1840384 rem:154s step 12596 (74%) loss:2.8800 lr:0.23 dt:36ms tok/s:1839325 rem:154s step 12597 (74%) loss:2.8990 lr:0.23 dt:36ms tok/s:1836192 rem:154s step 12598 (74%) loss:2.8905 lr:0.23 dt:36ms tok/s:1832471 rem:154s step 12599 (74%) loss:2.9007 lr:0.23 dt:36ms tok/s:1824431 rem:154s step 12600 (74%) loss:2.9011 lr:0.23 dt:36ms tok/s:1807646 rem:154s + local: attn=[0.152, 0.981, 1.081] mlp=[1.067, 0.406, -0.387] + + transition: attn=[3.517, 1.192] mlp=[-0.447, 1.165] + + hierarchy: attn=[3.565, 5.939, 5.616] mlp=[2.246, -1.897, -4.551] + step 12601 (74%) loss:2.9099 lr:0.23 dt:36ms tok/s:1816354 rem:153s step 12602 (74%) loss:2.9417 lr:0.23 dt:36ms tok/s:1818216 rem:153s step 12603 (74%) loss:2.9327 lr:0.23 dt:36ms tok/s:1829239 rem:153s step 12604 (74%) loss:2.9608 lr:0.23 dt:36ms tok/s:1832067 rem:153s step 12605 (74%) loss:2.9547 lr:0.23 dt:36ms tok/s:1831250 rem:153s step 12606 (74%) loss:2.9492 lr:0.23 dt:38ms tok/s:1725385 rem:153s step 12607 (74%) loss:2.9487 lr:0.23 dt:36ms tok/s:1828351 rem:153s step 12608 (74%) loss:2.9343 lr:0.23 dt:36ms tok/s:1820721 rem:153s step 12609 (74%) loss:2.9218 lr:0.23 dt:36ms tok/s:1829008 rem:153s step 12610 (74%) loss:2.9169 lr:0.23 dt:36ms tok/s:1838501 rem:153s step 12611 (74%) loss:2.9185 lr:0.23 dt:36ms tok/s:1836805 rem:153s step 12612 (74%) loss:2.9285 lr:0.23 dt:36ms tok/s:1834966 rem:153s step 12613 (74%) loss:2.9200 lr:0.23 dt:36ms tok/s:1839743 rem:153s step 12614 (74%) loss:2.9100 lr:0.23 dt:36ms tok/s:1843111 rem:153s step 12615 (74%) loss:2.9114 lr:0.23 dt:36ms tok/s:1830079 rem:153s step 12616 (75%) loss:2.9163 lr:0.23 dt:36ms tok/s:1811243 rem:153s step 12617 (75%) loss:2.9080 lr:0.23 dt:36ms tok/s:1837739 rem:153s step 12618 (75%) loss:2.9198 lr:0.23 dt:36ms tok/s:1840581 rem:153s step 12619 (75%) loss:2.9180 lr:0.23 dt:36ms tok/s:1836867 rem:153s step 12620 (75%) loss:2.9212 lr:0.23 dt:36ms tok/s:1833485 rem:153s step 12621 (75%) loss:3.0738 lr:0.23 dt:36ms tok/s:1838402 rem:153s step 12622 (75%) loss:3.0656 lr:0.23 dt:36ms tok/s:1833033 rem:153s step 12623 (75%) loss:3.0627 lr:0.23 dt:36ms tok/s:1839928 rem:153s step 12624 (75%) loss:3.0555 lr:0.23 dt:36ms tok/s:1839916 rem:153s step 12625 (75%) loss:3.0785 lr:0.23 dt:36ms tok/s:1840261 rem:153s step 12626 (75%) loss:3.0770 lr:0.23 dt:36ms tok/s:1836903 rem:153s step 12627 (75%) loss:3.0542 lr:0.23 dt:36ms tok/s:1826638 rem:153s step 12628 (75%) loss:3.0198 lr:0.23 dt:36ms tok/s:1831445 rem:153s step 12629 (75%) loss:3.0131 lr:0.23 dt:36ms tok/s:1831640 rem:152s step 12630 (75%) loss:3.0080 lr:0.23 dt:36ms tok/s:1837444 rem:152s step 12631 (75%) loss:3.0095 lr:0.23 dt:36ms tok/s:1837923 rem:152s step 12632 (75%) loss:2.9967 lr:0.23 dt:36ms tok/s:1842222 rem:152s step 12633 (75%) loss:3.0070 lr:0.23 dt:36ms tok/s:1829349 rem:152s step 12634 (75%) loss:3.0079 lr:0.23 dt:36ms tok/s:1841247 rem:152s step 12635 (75%) loss:2.9933 lr:0.23 dt:36ms tok/s:1841851 rem:152s step 12636 (75%) loss:2.9730 lr:0.23 dt:36ms tok/s:1821035 rem:152s step 12637 (75%) loss:2.9584 lr:0.23 dt:36ms tok/s:1835419 rem:152s step 12638 (75%) loss:2.9538 lr:0.23 dt:36ms tok/s:1810384 rem:152s step 12639 (75%) loss:2.9346 lr:0.23 dt:36ms tok/s:1816486 rem:152s step 12640 (75%) loss:2.9626 lr:0.23 dt:36ms tok/s:1809217 rem:152s step 12641 (75%) loss:2.9738 lr:0.23 dt:36ms tok/s:1803898 rem:152s step 12642 (75%) loss:2.9773 lr:0.23 dt:36ms tok/s:1812342 rem:152s step 12643 (75%) loss:2.9715 lr:0.23 dt:36ms tok/s:1818782 rem:152s step 12644 (75%) loss:2.9609 lr:0.23 dt:36ms tok/s:1813777 rem:152s step 12645 (75%) loss:2.9520 lr:0.23 dt:36ms tok/s:1815274 rem:152s step 12646 (75%) loss:2.9274 lr:0.23 dt:36ms tok/s:1807515 rem:152s step 12647 (75%) loss:2.9223 lr:0.23 dt:36ms tok/s:1811876 rem:152s step 12648 (75%) loss:2.9138 lr:0.23 dt:36ms tok/s:1814651 rem:152s step 12649 (75%) loss:2.9110 lr:0.23 dt:36ms tok/s:1814627 rem:152s step 12650 (75%) loss:2.9176 lr:0.23 dt:36ms tok/s:1811410 rem:152s step 12651 (75%) loss:2.9172 lr:0.23 dt:36ms tok/s:1811972 rem:152s step 12652 (75%) loss:2.9163 lr:0.23 dt:36ms tok/s:1809896 rem:152s step 12653 (75%) loss:2.9178 lr:0.23 dt:36ms tok/s:1814364 rem:152s step 12654 (75%) loss:2.9196 lr:0.23 dt:36ms tok/s:1800353 rem:152s step 12655 (75%) loss:2.9272 lr:0.23 dt:36ms tok/s:1805355 rem:152s step 12656 (75%) loss:2.9275 lr:0.23 dt:36ms tok/s:1809252 rem:152s step 12657 (75%) loss:2.9302 lr:0.23 dt:36ms tok/s:1805877 rem:151s step 12658 (75%) loss:2.9202 lr:0.23 dt:36ms tok/s:1812629 rem:151s step 12659 (75%) loss:2.9210 lr:0.23 dt:36ms tok/s:1815358 rem:151s step 12660 (75%) loss:2.9119 lr:0.23 dt:36ms tok/s:1814412 rem:151s step 12661 (75%) loss:2.8939 lr:0.23 dt:36ms tok/s:1813980 rem:151s step 12662 (75%) loss:2.8946 lr:0.23 dt:36ms tok/s:1813107 rem:151s step 12663 (75%) loss:2.8748 lr:0.23 dt:36ms tok/s:1810146 rem:151s step 12664 (75%) loss:2.8437 lr:0.23 dt:36ms tok/s:1800754 rem:151s step 12665 (75%) loss:2.8203 lr:0.23 dt:36ms tok/s:1810218 rem:151s step 12666 (75%) loss:2.7904 lr:0.23 dt:36ms tok/s:1806446 rem:151s step 12667 (75%) loss:2.7537 lr:0.23 dt:36ms tok/s:1807349 rem:151s step 12668 (75%) loss:2.7349 lr:0.23 dt:36ms tok/s:1812701 rem:151s step 12669 (75%) loss:2.7128 lr:0.23 dt:36ms tok/s:1813167 rem:151s step 12670 (75%) loss:2.6830 lr:0.23 dt:36ms tok/s:1804099 rem:151s step 12671 (75%) loss:2.6447 lr:0.22 dt:36ms tok/s:1812641 rem:151s step 12672 (75%) loss:2.6081 lr:0.22 dt:36ms tok/s:1816630 rem:151s step 12673 (75%) loss:2.5799 lr:0.22 dt:36ms tok/s:1812450 rem:151s step 12674 (75%) loss:2.5751 lr:0.22 dt:36ms tok/s:1812976 rem:151s step 12675 (75%) loss:2.5890 lr:0.22 dt:36ms tok/s:1812653 rem:151s step 12676 (75%) loss:2.5927 lr:0.22 dt:37ms tok/s:1777419 rem:151s step 12677 (75%) loss:2.5757 lr:0.22 dt:36ms tok/s:1811494 rem:151s step 12678 (75%) loss:2.5533 lr:0.22 dt:36ms tok/s:1811709 rem:151s step 12679 (75%) loss:2.5188 lr:0.22 dt:36ms tok/s:1807206 rem:151s step 12680 (75%) loss:2.4900 lr:0.22 dt:36ms tok/s:1808241 rem:151s step 12681 (75%) loss:2.4683 lr:0.22 dt:36ms tok/s:1812354 rem:151s step 12682 (75%) loss:2.4609 lr:0.22 dt:36ms tok/s:1803235 rem:151s step 12683 (75%) loss:2.5264 lr:0.22 dt:36ms tok/s:1814627 rem:151s step 12684 (75%) loss:2.5980 lr:0.22 dt:36ms tok/s:1812952 rem:150s step 12685 (75%) loss:2.6616 lr:0.22 dt:36ms tok/s:1805450 rem:150s step 12686 (75%) loss:2.7059 lr:0.22 dt:36ms tok/s:1817699 rem:150s step 12687 (75%) loss:2.7368 lr:0.22 dt:36ms tok/s:1810396 rem:150s step 12688 (75%) loss:2.7599 lr:0.22 dt:36ms tok/s:1811601 rem:150s step 12689 (75%) loss:2.7836 lr:0.22 dt:36ms tok/s:1815418 rem:150s step 12690 (75%) loss:2.8184 lr:0.22 dt:36ms tok/s:1813538 rem:150s step 12691 (75%) loss:2.8527 lr:0.22 dt:36ms tok/s:1803649 rem:150s step 12692 (75%) loss:2.8672 lr:0.22 dt:36ms tok/s:1815790 rem:150s step 12693 (75%) loss:2.8716 lr:0.22 dt:36ms tok/s:1811984 rem:150s step 12694 (75%) loss:2.8818 lr:0.22 dt:36ms tok/s:1816594 rem:150s step 12695 (75%) loss:2.8871 lr:0.22 dt:36ms tok/s:1816150 rem:150s step 12696 (75%) loss:2.8864 lr:0.22 dt:36ms tok/s:1815562 rem:150s step 12697 (75%) loss:2.8891 lr:0.22 dt:36ms tok/s:1805663 rem:150s step 12698 (75%) loss:2.8794 lr:0.22 dt:36ms tok/s:1808693 rem:150s step 12699 (75%) loss:2.8712 lr:0.22 dt:36ms tok/s:1816930 rem:150s step 12700 (75%) loss:2.8719 lr:0.22 dt:36ms tok/s:1814052 rem:150s + local: attn=[0.135, 0.966, 1.079] mlp=[1.105, 0.433, -0.402] + + transition: attn=[3.437, 1.221] mlp=[-0.470, 1.178] + + hierarchy: attn=[3.590, 5.939, 5.616] mlp=[2.308, -1.998, -4.708] + step 12701 (75%) loss:2.8906 lr:0.22 dt:36ms tok/s:1816006 rem:150s step 12702 (75%) loss:2.8999 lr:0.22 dt:36ms tok/s:1817423 rem:150s step 12703 (75%) loss:2.8956 lr:0.22 dt:36ms tok/s:1810647 rem:150s step 12704 (75%) loss:2.8991 lr:0.22 dt:36ms tok/s:1811410 rem:150s step 12705 (75%) loss:2.9010 lr:0.22 dt:36ms tok/s:1806019 rem:150s step 12706 (75%) loss:2.8858 lr:0.22 dt:36ms tok/s:1811661 rem:150s step 12707 (75%) loss:2.9011 lr:0.22 dt:36ms tok/s:1812999 rem:150s step 12708 (75%) loss:2.9073 lr:0.22 dt:36ms tok/s:1810122 rem:150s step 12709 (75%) loss:2.9005 lr:0.22 dt:36ms tok/s:1814459 rem:150s step 12710 (75%) loss:2.8969 lr:0.22 dt:36ms tok/s:1810039 rem:150s step 12711 (75%) loss:2.9070 lr:0.22 dt:36ms tok/s:1814843 rem:150s step 12712 (75%) loss:2.9195 lr:0.22 dt:36ms tok/s:1810396 rem:149s step 12713 (75%) loss:2.9493 lr:0.22 dt:36ms tok/s:1816726 rem:149s step 12714 (75%) loss:2.9398 lr:0.22 dt:36ms tok/s:1809360 rem:149s step 12715 (75%) loss:2.9447 lr:0.22 dt:36ms tok/s:1815694 rem:149s step 12716 (75%) loss:2.9405 lr:0.22 dt:36ms tok/s:1809872 rem:149s step 12717 (75%) loss:2.9367 lr:0.22 dt:36ms tok/s:1813586 rem:149s step 12718 (75%) loss:2.9544 lr:0.22 dt:36ms tok/s:1817615 rem:149s step 12719 (75%) loss:2.9732 lr:0.22 dt:36ms tok/s:1811327 rem:149s step 12720 (75%) loss:2.9634 lr:0.22 dt:36ms tok/s:1809074 rem:149s step 12721 (75%) loss:2.9642 lr:0.22 dt:36ms tok/s:1809503 rem:149s step 12722 (75%) loss:2.9727 lr:0.22 dt:37ms tok/s:1785176 rem:149s step 12723 (75%) loss:2.9556 lr:0.22 dt:36ms tok/s:1808740 rem:149s step 12724 (75%) loss:2.9519 lr:0.22 dt:36ms tok/s:1813418 rem:149s step 12725 (75%) loss:2.9548 lr:0.22 dt:36ms tok/s:1810504 rem:149s step 12726 (75%) loss:2.9525 lr:0.22 dt:36ms tok/s:1807908 rem:149s step 12727 (75%) loss:2.9632 lr:0.22 dt:36ms tok/s:1803270 rem:149s step 12728 (75%) loss:2.9465 lr:0.22 dt:36ms tok/s:1810110 rem:149s step 12729 (75%) loss:2.9342 lr:0.22 dt:36ms tok/s:1809038 rem:149s step 12730 (75%) loss:2.9373 lr:0.22 dt:36ms tok/s:1812282 rem:149s step 12731 (75%) loss:2.9270 lr:0.22 dt:36ms tok/s:1805936 rem:149s step 12732 (75%) loss:2.9326 lr:0.22 dt:36ms tok/s:1812067 rem:149s step 12733 (75%) loss:2.9521 lr:0.22 dt:36ms tok/s:1815059 rem:149s step 12734 (75%) loss:2.9419 lr:0.22 dt:36ms tok/s:1811219 rem:149s step 12735 (75%) loss:2.9277 lr:0.22 dt:36ms tok/s:1805924 rem:149s step 12736 (75%) loss:2.9216 lr:0.22 dt:36ms tok/s:1808217 rem:149s step 12737 (75%) loss:2.9314 lr:0.22 dt:36ms tok/s:1810575 rem:149s step 12738 (75%) loss:2.9068 lr:0.22 dt:36ms tok/s:1810277 rem:149s step 12739 (75%) loss:2.9135 lr:0.22 dt:36ms tok/s:1817062 rem:148s step 12740 (75%) loss:2.9100 lr:0.22 dt:36ms tok/s:1810850 rem:148s step 12741 (75%) loss:2.9185 lr:0.22 dt:36ms tok/s:1812545 rem:148s step 12742 (75%) loss:2.9271 lr:0.22 dt:36ms tok/s:1816414 rem:148s step 12743 (75%) loss:2.9238 lr:0.22 dt:36ms tok/s:1818998 rem:148s step 12744 (75%) loss:2.9003 lr:0.22 dt:36ms tok/s:1814172 rem:148s step 12745 (75%) loss:2.9112 lr:0.22 dt:36ms tok/s:1815514 rem:148s step 12746 (75%) loss:2.9227 lr:0.22 dt:36ms tok/s:1812211 rem:148s step 12747 (75%) loss:2.9057 lr:0.22 dt:36ms tok/s:1824455 rem:148s step 12748 (75%) loss:2.8708 lr:0.22 dt:36ms tok/s:1827779 rem:148s step 12749 (75%) loss:2.8705 lr:0.22 dt:36ms tok/s:1829056 rem:148s step 12750 (75%) loss:2.8797 lr:0.22 dt:36ms tok/s:1833326 rem:148s step 12751 (75%) loss:2.8684 lr:0.22 dt:36ms tok/s:1841938 rem:148s step 12752 (75%) loss:2.8599 lr:0.22 dt:36ms tok/s:1836118 rem:148s step 12753 (75%) loss:2.8851 lr:0.22 dt:36ms tok/s:1838378 rem:148s step 12754 (75%) loss:2.8551 lr:0.22 dt:36ms tok/s:1833595 rem:148s step 12755 (75%) loss:2.8375 lr:0.22 dt:36ms tok/s:1840064 rem:148s step 12756 (75%) loss:2.8478 lr:0.22 dt:36ms tok/s:1827962 rem:148s step 12757 (75%) loss:2.8610 lr:0.22 dt:36ms tok/s:1842086 rem:148s step 12758 (75%) loss:2.8599 lr:0.22 dt:36ms tok/s:1843692 rem:148s step 12759 (75%) loss:2.8658 lr:0.22 dt:36ms tok/s:1839103 rem:148s step 12760 (75%) loss:2.8771 lr:0.22 dt:36ms tok/s:1837235 rem:148s step 12761 (75%) loss:2.8948 lr:0.22 dt:36ms tok/s:1836339 rem:148s step 12762 (75%) loss:2.8587 lr:0.22 dt:36ms tok/s:1831945 rem:148s step 12763 (75%) loss:2.8403 lr:0.22 dt:36ms tok/s:1834770 rem:148s step 12764 (75%) loss:2.8706 lr:0.22 dt:36ms tok/s:1830640 rem:148s step 12765 (75%) loss:2.8926 lr:0.22 dt:36ms tok/s:1834231 rem:148s step 12766 (75%) loss:2.9007 lr:0.22 dt:36ms tok/s:1834721 rem:148s step 12767 (75%) loss:2.8988 lr:0.22 dt:36ms tok/s:1843025 rem:147s step 12768 (75%) loss:2.8922 lr:0.22 dt:36ms tok/s:1831738 rem:147s step 12769 (75%) loss:2.8863 lr:0.22 dt:36ms tok/s:1842617 rem:147s step 12770 (75%) loss:2.8793 lr:0.22 dt:36ms tok/s:1822145 rem:147s step 12771 (75%) loss:2.8694 lr:0.22 dt:36ms tok/s:1840458 rem:147s step 12772 (75%) loss:2.8733 lr:0.22 dt:36ms tok/s:1839140 rem:147s step 12773 (75%) loss:2.8718 lr:0.21 dt:36ms tok/s:1842988 rem:147s step 12774 (75%) loss:2.8780 lr:0.21 dt:36ms tok/s:1837124 rem:147s step 12775 (75%) loss:2.9294 lr:0.21 dt:36ms tok/s:1829823 rem:147s step 12776 (75%) loss:2.9257 lr:0.21 dt:36ms tok/s:1840963 rem:147s step 12777 (75%) loss:2.9057 lr:0.21 dt:36ms tok/s:1842864 rem:147s step 12778 (75%) loss:2.9197 lr:0.21 dt:36ms tok/s:1837456 rem:147s step 12779 (75%) loss:2.9225 lr:0.21 dt:36ms tok/s:1833803 rem:147s step 12780 (75%) loss:2.9037 lr:0.21 dt:36ms tok/s:1829775 rem:147s step 12781 (75%) loss:2.9034 lr:0.21 dt:36ms tok/s:1835701 rem:147s step 12782 (76%) loss:2.9082 lr:0.21 dt:36ms tok/s:1831835 rem:147s step 12783 (76%) loss:2.9089 lr:0.21 dt:36ms tok/s:1831042 rem:147s step 12784 (76%) loss:2.9155 lr:0.21 dt:36ms tok/s:1840322 rem:147s step 12785 (76%) loss:2.8997 lr:0.21 dt:36ms tok/s:1835566 rem:147s step 12786 (76%) loss:2.8965 lr:0.21 dt:36ms tok/s:1842469 rem:147s step 12787 (76%) loss:2.8887 lr:0.21 dt:36ms tok/s:1838796 rem:147s step 12788 (76%) loss:2.8790 lr:0.21 dt:36ms tok/s:1839460 rem:147s step 12789 (76%) loss:2.8845 lr:0.21 dt:36ms tok/s:1839645 rem:147s step 12790 (76%) loss:2.8813 lr:0.21 dt:36ms tok/s:1843259 rem:147s step 12791 (76%) loss:2.8871 lr:0.21 dt:36ms tok/s:1840828 rem:147s step 12792 (76%) loss:2.8877 lr:0.21 dt:36ms tok/s:1844967 rem:147s step 12793 (76%) loss:2.8926 lr:0.21 dt:36ms tok/s:1838341 rem:147s step 12794 (76%) loss:2.9092 lr:0.21 dt:36ms tok/s:1834831 rem:147s step 12795 (76%) loss:2.9201 lr:0.21 dt:36ms tok/s:1839399 rem:146s step 12796 (76%) loss:2.9182 lr:0.21 dt:36ms tok/s:1832483 rem:146s step 12797 (76%) loss:2.9100 lr:0.21 dt:36ms tok/s:1835591 rem:146s step 12798 (76%) loss:2.9050 lr:0.21 dt:36ms tok/s:1828193 rem:146s step 12799 (76%) loss:2.9046 lr:0.21 dt:36ms tok/s:1836818 rem:146s step 12800 (76%) loss:2.9076 lr:0.21 dt:36ms tok/s:1834170 rem:146s + local: attn=[0.147, 0.987, 1.099] mlp=[1.112, 0.434, -0.403] + + transition: attn=[3.576, 1.192] mlp=[-0.462, 1.231] + + hierarchy: attn=[3.537, 5.939, 5.616] mlp=[2.375, -2.020, -4.786] + step 12801 (76%) loss:2.9055 lr:0.21 dt:36ms tok/s:1838181 rem:146s step 12802 (76%) loss:2.8867 lr:0.21 dt:36ms tok/s:1840162 rem:146s step 12803 (76%) loss:2.8882 lr:0.21 dt:36ms tok/s:1836376 rem:146s step 12804 (76%) loss:2.8974 lr:0.21 dt:36ms tok/s:1838488 rem:146s step 12805 (76%) loss:2.8930 lr:0.21 dt:36ms tok/s:1824346 rem:146s step 12806 (76%) loss:2.8955 lr:0.21 dt:36ms tok/s:1839990 rem:146s step 12807 (76%) loss:2.9174 lr:0.21 dt:36ms tok/s:1842271 rem:146s step 12808 (76%) loss:2.9165 lr:0.21 dt:36ms tok/s:1841494 rem:146s step 12809 (76%) loss:2.9335 lr:0.21 dt:36ms tok/s:1833681 rem:146s step 12810 (76%) loss:2.9452 lr:0.21 dt:36ms tok/s:1829982 rem:146s step 12811 (76%) loss:2.9426 lr:0.21 dt:36ms tok/s:1836045 rem:146s step 12812 (76%) loss:2.9270 lr:0.21 dt:36ms tok/s:1837714 rem:146s step 12813 (76%) loss:2.9303 lr:0.21 dt:36ms tok/s:1831530 rem:146s step 12814 (76%) loss:2.9408 lr:0.21 dt:36ms tok/s:1834905 rem:146s step 12815 (76%) loss:2.9419 lr:0.21 dt:36ms tok/s:1827427 rem:146s step 12816 (76%) loss:2.9357 lr:0.21 dt:36ms tok/s:1832886 rem:146s step 12817 (76%) loss:2.9360 lr:0.21 dt:36ms tok/s:1832690 rem:146s step 12818 (76%) loss:2.9368 lr:0.21 dt:36ms tok/s:1830664 rem:146s step 12819 (76%) loss:2.8921 lr:0.21 dt:36ms tok/s:1839128 rem:146s step 12820 (76%) loss:2.8448 lr:0.21 dt:36ms tok/s:1827281 rem:146s step 12821 (76%) loss:2.8158 lr:0.21 dt:36ms tok/s:1835272 rem:146s step 12822 (76%) loss:2.8086 lr:0.21 dt:36ms tok/s:1831579 rem:146s step 12823 (76%) loss:2.7978 lr:0.21 dt:36ms tok/s:1834023 rem:145s step 12824 (76%) loss:2.8054 lr:0.21 dt:36ms tok/s:1832483 rem:145s step 12825 (76%) loss:2.8234 lr:0.21 dt:36ms tok/s:1836732 rem:145s step 12826 (76%) loss:2.8606 lr:0.21 dt:36ms tok/s:1835836 rem:145s step 12827 (76%) loss:2.9107 lr:0.21 dt:36ms tok/s:1838169 rem:145s step 12828 (76%) loss:2.9129 lr:0.21 dt:36ms tok/s:1837505 rem:145s step 12829 (76%) loss:2.9244 lr:0.21 dt:36ms tok/s:1831396 rem:145s step 12830 (76%) loss:2.9102 lr:0.21 dt:36ms tok/s:1842691 rem:145s step 12831 (76%) loss:2.8774 lr:0.21 dt:36ms tok/s:1835922 rem:145s step 12832 (76%) loss:2.8812 lr:0.21 dt:36ms tok/s:1832348 rem:145s step 12833 (76%) loss:2.8768 lr:0.21 dt:36ms tok/s:1836241 rem:145s step 12834 (76%) loss:2.8663 lr:0.21 dt:36ms tok/s:1836891 rem:145s step 12835 (76%) loss:2.8738 lr:0.21 dt:36ms tok/s:1833143 rem:145s step 12836 (76%) loss:2.8582 lr:0.21 dt:36ms tok/s:1834011 rem:145s step 12837 (76%) loss:2.8350 lr:0.21 dt:36ms tok/s:1829263 rem:145s step 12838 (76%) loss:2.8421 lr:0.21 dt:36ms tok/s:1828035 rem:145s step 12839 (76%) loss:2.8472 lr:0.21 dt:36ms tok/s:1839965 rem:145s step 12840 (76%) loss:2.8416 lr:0.21 dt:36ms tok/s:1838304 rem:145s step 12841 (76%) loss:2.8692 lr:0.21 dt:36ms tok/s:1837825 rem:145s step 12842 (76%) loss:2.8685 lr:0.21 dt:36ms tok/s:1834439 rem:145s step 12843 (76%) loss:2.8557 lr:0.21 dt:38ms tok/s:1711771 rem:145s step 12844 (76%) loss:2.8573 lr:0.21 dt:36ms tok/s:1832935 rem:145s step 12845 (76%) loss:2.8698 lr:0.21 dt:36ms tok/s:1834195 rem:145s step 12846 (76%) loss:2.8808 lr:0.21 dt:36ms tok/s:1833938 rem:145s step 12847 (76%) loss:2.8794 lr:0.21 dt:35ms tok/s:1860275 rem:145s step 12848 (76%) loss:2.8535 lr:0.21 dt:35ms tok/s:1858551 rem:145s step 12849 (76%) loss:2.8323 lr:0.21 dt:35ms tok/s:1857710 rem:145s step 12850 (76%) loss:2.8409 lr:0.21 dt:36ms tok/s:1845822 rem:145s step 12851 (76%) loss:2.8284 lr:0.21 dt:36ms tok/s:1834574 rem:144s step 12852 (76%) loss:2.8249 lr:0.21 dt:36ms tok/s:1832055 rem:144s step 12853 (76%) loss:2.8395 lr:0.21 dt:35ms tok/s:1855804 rem:144s step 12854 (76%) loss:2.8490 lr:0.21 dt:35ms tok/s:1854727 rem:144s step 12855 (76%) loss:2.8540 lr:0.21 dt:36ms tok/s:1831823 rem:144s step 12856 (76%) loss:2.8479 lr:0.21 dt:36ms tok/s:1829555 rem:144s step 12857 (76%) loss:2.8475 lr:0.21 dt:36ms tok/s:1808372 rem:144s step 12858 (76%) loss:2.8483 lr:0.21 dt:36ms tok/s:1800577 rem:144s step 12859 (76%) loss:2.8448 lr:0.21 dt:36ms tok/s:1814436 rem:144s step 12860 (76%) loss:2.8378 lr:0.21 dt:36ms tok/s:1825958 rem:144s step 12861 (76%) loss:2.8356 lr:0.21 dt:36ms tok/s:1830347 rem:144s step 12862 (76%) loss:2.8401 lr:0.21 dt:36ms tok/s:1826917 rem:144s step 12863 (76%) loss:2.8398 lr:0.21 dt:36ms tok/s:1828764 rem:144s step 12864 (76%) loss:2.8691 lr:0.21 dt:36ms tok/s:1837591 rem:144s step 12865 (76%) loss:2.8649 lr:0.21 dt:36ms tok/s:1830311 rem:144s step 12866 (76%) loss:2.8726 lr:0.21 dt:36ms tok/s:1826055 rem:144s step 12867 (76%) loss:2.8723 lr:0.21 dt:36ms tok/s:1832862 rem:144s step 12868 (76%) loss:2.8779 lr:0.21 dt:36ms tok/s:1836192 rem:144s step 12869 (76%) loss:2.8880 lr:0.21 dt:36ms tok/s:1823693 rem:144s step 12870 (76%) loss:2.8837 lr:0.21 dt:36ms tok/s:1832874 rem:144s step 12871 (76%) loss:2.8794 lr:0.21 dt:36ms tok/s:1833424 rem:144s step 12872 (76%) loss:2.8834 lr:0.21 dt:36ms tok/s:1828083 rem:144s step 12873 (76%) loss:2.9023 lr:0.21 dt:36ms tok/s:1830737 rem:144s step 12874 (76%) loss:2.9057 lr:0.21 dt:36ms tok/s:1827293 rem:144s step 12875 (76%) loss:2.8966 lr:0.21 dt:36ms tok/s:1828947 rem:144s step 12876 (76%) loss:2.9108 lr:0.21 dt:36ms tok/s:1835922 rem:144s step 12877 (76%) loss:2.9143 lr:0.21 dt:36ms tok/s:1832862 rem:144s step 12878 (76%) loss:2.8858 lr:0.20 dt:37ms tok/s:1774849 rem:144s step 12879 (76%) loss:2.8828 lr:0.20 dt:36ms tok/s:1845128 rem:143s step 12880 (76%) loss:2.8825 lr:0.20 dt:35ms tok/s:1850145 rem:143s step 12881 (76%) loss:2.8890 lr:0.20 dt:35ms tok/s:1854602 rem:143s step 12882 (76%) loss:2.8728 lr:0.20 dt:35ms tok/s:1855603 rem:143s step 12883 (76%) loss:2.8900 lr:0.20 dt:36ms tok/s:1845091 rem:143s step 12884 (76%) loss:2.8987 lr:0.20 dt:36ms tok/s:1845772 rem:143s step 12885 (76%) loss:2.9040 lr:0.20 dt:35ms tok/s:1850718 rem:143s step 12886 (76%) loss:2.9082 lr:0.20 dt:35ms tok/s:1856581 rem:143s step 12887 (76%) loss:2.8864 lr:0.20 dt:35ms tok/s:1846442 rem:143s step 12888 (76%) loss:2.8933 lr:0.20 dt:36ms tok/s:1845314 rem:143s step 12889 (76%) loss:2.8812 lr:0.20 dt:35ms tok/s:1847410 rem:143s step 12890 (76%) loss:2.8838 lr:0.20 dt:35ms tok/s:1846727 rem:143s step 12891 (76%) loss:2.8748 lr:0.20 dt:35ms tok/s:1847397 rem:143s step 12892 (76%) loss:2.8717 lr:0.20 dt:36ms tok/s:1844261 rem:143s step 12893 (76%) loss:2.8661 lr:0.20 dt:35ms tok/s:1856054 rem:143s step 12894 (76%) loss:2.8833 lr:0.20 dt:35ms tok/s:1851977 rem:143s step 12895 (76%) loss:2.8754 lr:0.20 dt:35ms tok/s:1854301 rem:143s step 12896 (76%) loss:2.8945 lr:0.20 dt:35ms tok/s:1854364 rem:143s step 12897 (76%) loss:2.9007 lr:0.20 dt:35ms tok/s:1846677 rem:143s step 12898 (76%) loss:2.9022 lr:0.20 dt:35ms tok/s:1852864 rem:143s step 12899 (76%) loss:2.9018 lr:0.20 dt:35ms tok/s:1856355 rem:143s step 12900 (76%) loss:2.9029 lr:0.20 dt:35ms tok/s:1849398 rem:143s + local: attn=[0.157, 0.990, 1.112] mlp=[1.134, 0.442, -0.428] + + transition: attn=[3.571, 1.223] mlp=[-0.490, 1.289] + + hierarchy: attn=[3.550, 5.939, 5.616] mlp=[2.453, -2.086, -4.862] + step 12901 (76%) loss:2.9066 lr:0.20 dt:35ms tok/s:1856894 rem:143s step 12902 (76%) loss:2.9161 lr:0.20 dt:35ms tok/s:1851254 rem:143s step 12903 (76%) loss:2.9206 lr:0.20 dt:35ms tok/s:1850394 rem:143s step 12904 (76%) loss:2.9101 lr:0.20 dt:35ms tok/s:1846516 rem:143s step 12905 (76%) loss:2.9213 lr:0.20 dt:35ms tok/s:1854689 rem:143s step 12906 (76%) loss:2.9087 lr:0.20 dt:35ms tok/s:1849983 rem:143s step 12907 (76%) loss:2.8989 lr:0.20 dt:35ms tok/s:1860199 rem:142s step 12908 (76%) loss:2.9005 lr:0.20 dt:35ms tok/s:1851329 rem:142s step 12909 (76%) loss:2.8789 lr:0.20 dt:36ms tok/s:1828935 rem:142s step 12910 (76%) loss:2.8582 lr:0.20 dt:34ms tok/s:1903508 rem:142s step 12911 (76%) loss:2.8401 lr:0.20 dt:34ms tok/s:1910201 rem:142s step 12912 (76%) loss:2.8257 lr:0.20 dt:34ms tok/s:1918319 rem:142s step 12913 (76%) loss:2.8371 lr:0.20 dt:34ms tok/s:1908848 rem:142s step 12914 (76%) loss:2.8508 lr:0.20 dt:34ms tok/s:1920316 rem:142s step 12915 (76%) loss:2.8780 lr:0.20 dt:34ms tok/s:1914978 rem:142s step 12916 (76%) loss:2.9109 lr:0.20 dt:34ms tok/s:1916527 rem:142s step 12917 (76%) loss:2.9333 lr:0.20 dt:34ms tok/s:1919217 rem:142s step 12918 (76%) loss:2.9280 lr:0.20 dt:34ms tok/s:1920920 rem:142s step 12919 (76%) loss:2.9327 lr:0.20 dt:34ms tok/s:1916594 rem:142s step 12920 (76%) loss:2.9067 lr:0.20 dt:34ms tok/s:1900651 rem:142s step 12921 (76%) loss:2.9301 lr:0.20 dt:34ms tok/s:1905962 rem:142s step 12922 (76%) loss:2.9362 lr:0.20 dt:34ms tok/s:1902124 rem:142s step 12923 (76%) loss:2.9366 lr:0.20 dt:35ms tok/s:1887885 rem:142s step 12924 (76%) loss:2.9543 lr:0.20 dt:36ms tok/s:1829129 rem:142s step 12925 (76%) loss:2.9606 lr:0.20 dt:37ms tok/s:1769171 rem:142s step 12926 (76%) loss:2.9525 lr:0.20 dt:35ms tok/s:1854689 rem:142s step 12927 (76%) loss:2.9448 lr:0.20 dt:33ms tok/s:1965646 rem:142s step 12928 (76%) loss:2.9306 lr:0.20 dt:34ms tok/s:1941502 rem:142s step 12929 (76%) loss:2.9318 lr:0.20 dt:33ms tok/s:1961914 rem:142s step 12930 (76%) loss:2.9099 lr:0.20 dt:33ms tok/s:1965604 rem:142s step 12931 (76%) loss:2.9047 lr:0.20 dt:33ms tok/s:1971808 rem:142s step 12932 (76%) loss:2.9081 lr:0.20 dt:33ms tok/s:1961802 rem:142s step 12933 (76%) loss:2.8827 lr:0.20 dt:33ms tok/s:1961788 rem:142s step 12934 (76%) loss:2.8809 lr:0.20 dt:33ms tok/s:1971652 rem:142s step 12935 (76%) loss:2.8799 lr:0.20 dt:33ms tok/s:1959844 rem:142s step 12936 (76%) loss:2.8809 lr:0.20 dt:34ms tok/s:1952258 rem:142s step 12937 (76%) loss:2.8798 lr:0.20 dt:33ms tok/s:1969872 rem:141s step 12938 (76%) loss:2.8621 lr:0.20 dt:33ms tok/s:1965154 rem:141s step 12939 (76%) loss:2.8750 lr:0.20 dt:34ms tok/s:1947196 rem:141s step 12940 (76%) loss:2.8883 lr:0.20 dt:34ms tok/s:1951357 rem:141s step 12941 (76%) loss:2.8740 lr:0.20 dt:34ms tok/s:1950263 rem:141s step 12942 (76%) loss:2.8686 lr:0.20 dt:33ms tok/s:1957248 rem:141s step 12943 (76%) loss:2.8738 lr:0.20 dt:34ms tok/s:1950194 rem:141s step 12944 (76%) loss:2.8971 lr:0.20 dt:33ms tok/s:1958643 rem:141s step 12945 (76%) loss:2.8998 lr:0.20 dt:34ms tok/s:1953368 rem:141s step 12946 (76%) loss:2.9030 lr:0.20 dt:34ms tok/s:1938641 rem:141s step 12947 (76%) loss:2.8909 lr:0.20 dt:34ms tok/s:1928548 rem:141s step 12948 (76%) loss:2.8853 lr:0.20 dt:34ms tok/s:1929306 rem:141s step 12949 (76%) loss:2.8667 lr:0.20 dt:34ms tok/s:1933908 rem:141s step 12950 (76%) loss:2.8668 lr:0.20 dt:34ms tok/s:1933404 rem:141s step 12951 (76%) loss:2.8693 lr:0.20 dt:34ms tok/s:1931190 rem:141s step 12952 (76%) loss:2.8684 lr:0.20 dt:34ms tok/s:1917851 rem:141s step 12953 (77%) loss:2.8809 lr:0.20 dt:34ms tok/s:1928846 rem:141s step 12954 (77%) loss:2.9066 lr:0.20 dt:34ms tok/s:1936769 rem:141s step 12955 (77%) loss:2.8942 lr:0.20 dt:34ms tok/s:1920209 rem:141s step 12956 (77%) loss:2.8956 lr:0.20 dt:34ms tok/s:1917302 rem:141s step 12957 (77%) loss:2.9008 lr:0.20 dt:34ms tok/s:1924242 rem:141s step 12958 (77%) loss:2.8935 lr:0.20 dt:34ms tok/s:1920531 rem:141s step 12959 (77%) loss:2.8987 lr:0.20 dt:35ms tok/s:1887327 rem:141s step 12960 (77%) loss:2.8906 lr:0.20 dt:34ms tok/s:1917396 rem:141s step 12961 (77%) loss:2.8792 lr:0.20 dt:34ms tok/s:1911263 rem:141s step 12962 (77%) loss:2.8788 lr:0.20 dt:34ms tok/s:1921350 rem:141s step 12963 (77%) loss:2.8570 lr:0.20 dt:34ms tok/s:1908464 rem:141s step 12964 (77%) loss:2.8314 lr:0.20 dt:34ms tok/s:1915392 rem:141s step 12965 (77%) loss:2.8312 lr:0.20 dt:34ms tok/s:1913365 rem:141s step 12966 (77%) loss:2.8368 lr:0.20 dt:34ms tok/s:1922304 rem:140s step 12967 (77%) loss:2.8163 lr:0.20 dt:34ms tok/s:1917624 rem:140s step 12968 (77%) loss:2.8114 lr:0.20 dt:34ms tok/s:1915138 rem:140s step 12969 (77%) loss:2.8237 lr:0.20 dt:34ms tok/s:1916768 rem:140s step 12970 (77%) loss:2.8405 lr:0.20 dt:34ms tok/s:1924134 rem:140s step 12971 (77%) loss:2.8302 lr:0.20 dt:34ms tok/s:1904272 rem:140s step 12972 (77%) loss:2.8294 lr:0.20 dt:34ms tok/s:1914045 rem:140s step 12973 (77%) loss:2.8426 lr:0.20 dt:34ms tok/s:1905328 rem:140s step 12974 (77%) loss:2.8704 lr:0.20 dt:34ms tok/s:1909537 rem:140s step 12975 (77%) loss:2.8767 lr:0.20 dt:34ms tok/s:1900415 rem:140s step 12976 (77%) loss:2.8830 lr:0.20 dt:34ms tok/s:1907960 rem:140s step 12977 (77%) loss:2.8823 lr:0.20 dt:34ms tok/s:1910493 rem:140s step 12978 (77%) loss:2.8715 lr:0.20 dt:34ms tok/s:1905223 rem:140s step 12979 (77%) loss:2.8602 lr:0.20 dt:34ms tok/s:1906557 rem:140s step 12980 (77%) loss:2.8814 lr:0.20 dt:34ms tok/s:1907682 rem:140s step 12981 (77%) loss:2.8939 lr:0.20 dt:34ms tok/s:1904800 rem:140s step 12982 (77%) loss:2.8823 lr:0.20 dt:34ms tok/s:1903099 rem:140s step 12983 (77%) loss:2.9016 lr:0.20 dt:35ms tok/s:1897765 rem:140s step 12984 (77%) loss:2.9611 lr:0.20 dt:34ms tok/s:1906518 rem:140s step 12985 (77%) loss:2.9538 lr:0.20 dt:34ms tok/s:1911981 rem:140s step 12986 (77%) loss:2.9612 lr:0.20 dt:34ms tok/s:1915365 rem:140s step 12987 (77%) loss:2.9452 lr:0.20 dt:34ms tok/s:1911596 rem:140s step 12988 (77%) loss:2.9485 lr:0.20 dt:34ms tok/s:1906994 rem:140s step 12989 (77%) loss:2.9423 lr:0.19 dt:34ms tok/s:1907682 rem:140s step 12990 (77%) loss:2.9110 lr:0.19 dt:34ms tok/s:1916193 rem:140s step 12991 (77%) loss:2.9078 lr:0.19 dt:34ms tok/s:1903600 rem:140s step 12992 (77%) loss:2.9060 lr:0.19 dt:34ms tok/s:1908146 rem:140s step 12993 (77%) loss:2.9306 lr:0.19 dt:35ms tok/s:1897804 rem:140s step 12994 (77%) loss:2.9150 lr:0.19 dt:35ms tok/s:1867390 rem:140s step 12995 (77%) loss:2.8886 lr:0.19 dt:40ms tok/s:1637192 rem:139s step 12996 (77%) loss:2.8892 lr:0.19 dt:34ms tok/s:1942353 rem:139s step 12997 (77%) loss:2.9130 lr:0.19 dt:35ms tok/s:1890612 rem:139s step 12998 (77%) loss:2.9261 lr:0.19 dt:34ms tok/s:1924578 rem:139s step 12999 (77%) loss:2.9207 lr:0.19 dt:34ms tok/s:1922237 rem:139s step 13000 (77%) loss:2.9158 lr:0.19 dt:34ms tok/s:1914845 rem:139s + local: attn=[0.165, 0.999, 1.128] mlp=[1.152, 0.442, -0.434] + + transition: attn=[3.616, 1.218] mlp=[-0.528, 1.289] + + hierarchy: attn=[3.560, 5.939, 5.616] mlp=[2.458, -2.133, -4.999] + step 13001 (77%) loss:2.9063 lr:0.19 dt:34ms tok/s:1907748 rem:139s step 13002 (77%) loss:2.9012 lr:0.19 dt:35ms tok/s:1891483 rem:139s step 13003 (77%) loss:2.9222 lr:0.19 dt:35ms tok/s:1891965 rem:139s step 13004 (77%) loss:2.9064 lr:0.19 dt:35ms tok/s:1898551 rem:139s step 13005 (77%) loss:2.8859 lr:0.19 dt:35ms tok/s:1896037 rem:139s step 13006 (77%) loss:2.8976 lr:0.19 dt:35ms tok/s:1896521 rem:139s step 13007 (77%) loss:2.8951 lr:0.19 dt:35ms tok/s:1895030 rem:139s step 13008 (77%) loss:2.9084 lr:0.19 dt:35ms tok/s:1880459 rem:139s step 13009 (77%) loss:2.9508 lr:0.19 dt:34ms tok/s:1912354 rem:139s step 13010 (77%) loss:2.9558 lr:0.19 dt:34ms tok/s:1917878 rem:139s step 13011 (77%) loss:2.9276 lr:0.19 dt:34ms tok/s:1905711 rem:139s step 13012 (77%) loss:2.9227 lr:0.19 dt:35ms tok/s:1865603 rem:139s step 13013 (77%) loss:2.9315 lr:0.19 dt:36ms tok/s:1827087 rem:139s step 13014 (77%) loss:2.9332 lr:0.19 dt:36ms tok/s:1837431 rem:139s step 13015 (77%) loss:2.9248 lr:0.19 dt:35ms tok/s:1874981 rem:139s step 13016 (77%) loss:2.9171 lr:0.19 dt:34ms tok/s:1901085 rem:139s step 13017 (77%) loss:2.9113 lr:0.19 dt:36ms tok/s:1817615 rem:139s step 13018 (77%) loss:2.9001 lr:0.19 dt:35ms tok/s:1860627 rem:139s step 13019 (77%) loss:2.9012 lr:0.19 dt:34ms tok/s:1910015 rem:139s step 13020 (77%) loss:2.9008 lr:0.19 dt:35ms tok/s:1859696 rem:139s step 13021 (77%) loss:2.9104 lr:0.19 dt:36ms tok/s:1799104 rem:139s step 13022 (77%) loss:2.8766 lr:0.19 dt:35ms tok/s:1872057 rem:139s step 13023 (77%) loss:2.8625 lr:0.19 dt:35ms tok/s:1886485 rem:139s step 13024 (77%) loss:2.8638 lr:0.19 dt:35ms tok/s:1891965 rem:138s step 13025 (77%) loss:2.8593 lr:0.19 dt:35ms tok/s:1871700 rem:138s step 13026 (77%) loss:2.8623 lr:0.19 dt:34ms tok/s:1905685 rem:138s step 13027 (77%) loss:2.8525 lr:0.19 dt:35ms tok/s:1895801 rem:138s step 13028 (77%) loss:2.8540 lr:0.19 dt:35ms tok/s:1885127 rem:138s step 13029 (77%) loss:2.8630 lr:0.19 dt:35ms tok/s:1896992 rem:138s step 13030 (77%) loss:2.8532 lr:0.19 dt:35ms tok/s:1887094 rem:138s step 13031 (77%) loss:2.8391 lr:0.19 dt:35ms tok/s:1888715 rem:138s step 13032 (77%) loss:2.8290 lr:0.19 dt:34ms tok/s:1915886 rem:138s step 13033 (77%) loss:2.8131 lr:0.19 dt:34ms tok/s:1909113 rem:138s step 13034 (77%) loss:2.7868 lr:0.19 dt:35ms tok/s:1865160 rem:138s step 13035 (77%) loss:2.7863 lr:0.19 dt:34ms tok/s:1904853 rem:138s step 13036 (77%) loss:2.8032 lr:0.19 dt:35ms tok/s:1896678 rem:138s step 13037 (77%) loss:2.8017 lr:0.19 dt:34ms tok/s:1905289 rem:138s step 13038 (77%) loss:2.8053 lr:0.19 dt:34ms tok/s:1921054 rem:138s step 13039 (77%) loss:2.8587 lr:0.19 dt:34ms tok/s:1925374 rem:138s step 13040 (77%) loss:2.8724 lr:0.19 dt:34ms tok/s:1918105 rem:138s step 13041 (77%) loss:2.8737 lr:0.19 dt:35ms tok/s:1893477 rem:138s step 13042 (77%) loss:2.8549 lr:0.19 dt:34ms tok/s:1918520 rem:138s step 13043 (77%) loss:2.8496 lr:0.19 dt:35ms tok/s:1874393 rem:138s step 13044 (77%) loss:2.8647 lr:0.19 dt:34ms tok/s:1908119 rem:138s step 13045 (77%) loss:2.8660 lr:0.19 dt:34ms tok/s:1912740 rem:138s step 13046 (77%) loss:2.8588 lr:0.19 dt:34ms tok/s:1904497 rem:138s step 13047 (77%) loss:2.8571 lr:0.19 dt:35ms tok/s:1883874 rem:138s step 13048 (77%) loss:2.8527 lr:0.19 dt:35ms tok/s:1883913 rem:138s step 13049 (77%) loss:2.8484 lr:0.19 dt:35ms tok/s:1876594 rem:138s step 13050 (77%) loss:2.8388 lr:0.19 dt:36ms tok/s:1838021 rem:138s step 13051 (77%) loss:2.8199 lr:0.19 dt:35ms tok/s:1879327 rem:138s step 13052 (77%) loss:2.8190 lr:0.19 dt:35ms tok/s:1883267 rem:138s step 13053 (77%) loss:2.8291 lr:0.19 dt:35ms tok/s:1882829 rem:137s step 13054 (77%) loss:2.7919 lr:0.19 dt:35ms tok/s:1871038 rem:137s step 13055 (77%) loss:3.1051 lr:0.19 dt:35ms tok/s:1858489 rem:137s step 13056 (77%) loss:3.0851 lr:0.19 dt:35ms tok/s:1867225 rem:137s step 13057 (77%) loss:3.0589 lr:0.19 dt:35ms tok/s:1857873 rem:137s step 13058 (77%) loss:3.0454 lr:0.19 dt:35ms tok/s:1869663 rem:137s step 13059 (77%) loss:3.0313 lr:0.19 dt:35ms tok/s:1858288 rem:137s step 13060 (77%) loss:3.0359 lr:0.19 dt:35ms tok/s:1855465 rem:137s step 13061 (77%) loss:3.0327 lr:0.19 dt:35ms tok/s:1868354 rem:137s step 13062 (77%) loss:3.0144 lr:0.19 dt:35ms tok/s:1867390 rem:137s step 13063 (77%) loss:2.9896 lr:0.19 dt:35ms tok/s:1869269 rem:137s step 13064 (77%) loss:3.0000 lr:0.19 dt:35ms tok/s:1865957 rem:137s step 13065 (77%) loss:2.9849 lr:0.19 dt:36ms tok/s:1844645 rem:137s step 13066 (77%) loss:2.9676 lr:0.19 dt:36ms tok/s:1842975 rem:137s step 13067 (77%) loss:2.9584 lr:0.19 dt:35ms tok/s:1847881 rem:137s step 13068 (77%) loss:2.9337 lr:0.19 dt:36ms tok/s:1842407 rem:137s step 13069 (77%) loss:2.9104 lr:0.19 dt:36ms tok/s:1843940 rem:137s step 13070 (77%) loss:2.9025 lr:0.19 dt:36ms tok/s:1842185 rem:137s step 13071 (77%) loss:2.8935 lr:0.19 dt:36ms tok/s:1840285 rem:137s step 13072 (77%) loss:2.8839 lr:0.19 dt:36ms tok/s:1832544 rem:137s step 13073 (77%) loss:2.8788 lr:0.19 dt:36ms tok/s:1838316 rem:137s step 13074 (77%) loss:2.8805 lr:0.19 dt:36ms tok/s:1838599 rem:137s step 13075 (77%) loss:2.8695 lr:0.19 dt:36ms tok/s:1840754 rem:137s step 13076 (77%) loss:2.8726 lr:0.19 dt:35ms tok/s:1847496 rem:137s step 13077 (77%) loss:2.8756 lr:0.19 dt:36ms tok/s:1841296 rem:137s step 13078 (77%) loss:2.8676 lr:0.19 dt:35ms tok/s:1847782 rem:137s step 13079 (77%) loss:2.8539 lr:0.19 dt:36ms tok/s:1835162 rem:137s step 13080 (77%) loss:2.8365 lr:0.19 dt:36ms tok/s:1840396 rem:137s step 13081 (77%) loss:2.8443 lr:0.19 dt:35ms tok/s:1850245 rem:136s step 13082 (77%) loss:2.8526 lr:0.19 dt:36ms tok/s:1838599 rem:136s step 13083 (77%) loss:2.8499 lr:0.19 dt:36ms tok/s:1817867 rem:136s step 13084 (77%) loss:2.8406 lr:0.19 dt:36ms tok/s:1840717 rem:136s step 13085 (77%) loss:2.8455 lr:0.19 dt:36ms tok/s:1842172 rem:136s step 13086 (77%) loss:2.8445 lr:0.19 dt:36ms tok/s:1841087 rem:136s step 13087 (77%) loss:2.8525 lr:0.19 dt:36ms tok/s:1835775 rem:136s step 13088 (77%) loss:2.8508 lr:0.19 dt:36ms tok/s:1840741 rem:136s step 13089 (77%) loss:2.8533 lr:0.19 dt:36ms tok/s:1835873 rem:136s step 13090 (77%) loss:2.8665 lr:0.19 dt:36ms tok/s:1835738 rem:136s step 13091 (77%) loss:2.8972 lr:0.19 dt:36ms tok/s:1840396 rem:136s step 13092 (77%) loss:2.8904 lr:0.19 dt:36ms tok/s:1835799 rem:136s step 13093 (77%) loss:2.8976 lr:0.19 dt:36ms tok/s:1835873 rem:136s step 13094 (77%) loss:2.8929 lr:0.19 dt:36ms tok/s:1836965 rem:136s step 13095 (77%) loss:2.8662 lr:0.19 dt:36ms tok/s:1837812 rem:136s step 13096 (77%) loss:2.8587 lr:0.19 dt:36ms tok/s:1834599 rem:136s step 13097 (77%) loss:2.8572 lr:0.19 dt:36ms tok/s:1841592 rem:136s step 13098 (77%) loss:2.8399 lr:0.19 dt:35ms tok/s:1847944 rem:136s step 13099 (77%) loss:2.8244 lr:0.19 dt:36ms tok/s:1846045 rem:136s step 13100 (77%) loss:2.8079 lr:0.18 dt:35ms tok/s:1848465 rem:136s + local: attn=[0.158, 1.009, 1.113] mlp=[1.178, 0.475, -0.455] + + transition: attn=[3.613, 1.222] mlp=[-0.528, 1.336] + + hierarchy: attn=[3.572, 5.939, 5.616] mlp=[2.527, -2.187, -5.010] + step 13101 (77%) loss:2.8232 lr:0.18 dt:36ms tok/s:1821312 rem:136s step 13102 (77%) loss:2.8283 lr:0.18 dt:36ms tok/s:1824056 rem:136s step 13103 (77%) loss:2.8285 lr:0.18 dt:36ms tok/s:1819444 rem:136s step 13104 (77%) loss:2.8605 lr:0.18 dt:36ms tok/s:1818336 rem:136s step 13105 (77%) loss:2.8465 lr:0.18 dt:36ms tok/s:1803318 rem:136s step 13106 (77%) loss:2.8424 lr:0.18 dt:36ms tok/s:1815370 rem:136s step 13107 (77%) loss:2.8594 lr:0.18 dt:36ms tok/s:1810337 rem:136s step 13108 (77%) loss:2.8792 lr:0.18 dt:36ms tok/s:1812748 rem:136s step 13109 (77%) loss:2.8975 lr:0.18 dt:36ms tok/s:1804892 rem:135s step 13110 (77%) loss:2.8964 lr:0.18 dt:36ms tok/s:1804040 rem:135s step 13111 (77%) loss:2.8846 lr:0.18 dt:36ms tok/s:1808074 rem:135s step 13112 (77%) loss:2.8975 lr:0.18 dt:36ms tok/s:1811900 rem:135s step 13113 (77%) loss:2.8810 lr:0.18 dt:36ms tok/s:1818902 rem:135s step 13114 (77%) loss:2.8818 lr:0.18 dt:36ms tok/s:1815526 rem:135s step 13115 (77%) loss:2.8836 lr:0.18 dt:36ms tok/s:1808348 rem:135s step 13116 (77%) loss:2.8796 lr:0.18 dt:36ms tok/s:1802124 rem:135s step 13117 (77%) loss:2.8766 lr:0.18 dt:36ms tok/s:1819203 rem:135s step 13118 (77%) loss:2.8516 lr:0.18 dt:36ms tok/s:1808098 rem:135s step 13119 (77%) loss:2.8250 lr:0.18 dt:36ms tok/s:1811721 rem:135s step 13120 (77%) loss:2.8148 lr:0.18 dt:36ms tok/s:1813179 rem:135s step 13121 (77%) loss:2.8183 lr:0.18 dt:36ms tok/s:1819323 rem:135s step 13122 (77%) loss:2.8509 lr:0.18 dt:36ms tok/s:1807991 rem:135s step 13123 (77%) loss:2.8855 lr:0.18 dt:36ms tok/s:1815466 rem:135s step 13124 (78%) loss:2.9100 lr:0.18 dt:36ms tok/s:1809562 rem:135s step 13125 (78%) loss:2.9242 lr:0.18 dt:39ms tok/s:1685282 rem:135s step 13126 (78%) loss:2.9313 lr:0.18 dt:36ms tok/s:1797222 rem:135s step 13127 (78%) loss:2.9258 lr:0.18 dt:36ms tok/s:1795778 rem:135s step 13128 (78%) loss:2.9231 lr:0.18 dt:36ms tok/s:1809967 rem:135s step 13129 (78%) loss:2.9162 lr:0.18 dt:36ms tok/s:1814735 rem:135s step 13130 (78%) loss:2.8995 lr:0.18 dt:36ms tok/s:1807801 rem:135s step 13131 (78%) loss:2.8802 lr:0.18 dt:37ms tok/s:1786046 rem:135s step 13132 (78%) loss:2.8604 lr:0.18 dt:37ms tok/s:1789465 rem:135s step 13133 (78%) loss:2.8602 lr:0.18 dt:37ms tok/s:1792744 rem:135s step 13134 (78%) loss:2.8591 lr:0.18 dt:37ms tok/s:1795426 rem:135s step 13135 (78%) loss:2.8702 lr:0.18 dt:37ms tok/s:1791366 rem:135s step 13136 (78%) loss:2.8669 lr:0.18 dt:37ms tok/s:1782710 rem:134s step 13137 (78%) loss:2.8712 lr:0.18 dt:36ms tok/s:1805805 rem:134s step 13138 (78%) loss:2.8730 lr:0.18 dt:36ms tok/s:1807682 rem:134s step 13139 (78%) loss:2.8757 lr:0.18 dt:36ms tok/s:1820420 rem:134s step 13140 (78%) loss:2.8722 lr:0.18 dt:36ms tok/s:1811697 rem:134s step 13141 (78%) loss:2.8643 lr:0.18 dt:36ms tok/s:1815790 rem:134s step 13142 (78%) loss:2.8581 lr:0.18 dt:36ms tok/s:1818998 rem:134s step 13143 (78%) loss:2.8475 lr:0.18 dt:36ms tok/s:1813107 rem:134s step 13144 (78%) loss:2.8459 lr:0.18 dt:36ms tok/s:1810802 rem:134s step 13145 (78%) loss:2.8369 lr:0.18 dt:36ms tok/s:1809586 rem:134s step 13146 (78%) loss:2.8492 lr:0.18 dt:36ms tok/s:1805343 rem:134s step 13147 (78%) loss:2.8453 lr:0.18 dt:37ms tok/s:1774700 rem:134s step 13148 (78%) loss:2.8463 lr:0.18 dt:36ms tok/s:1809086 rem:134s step 13149 (78%) loss:2.8377 lr:0.18 dt:36ms tok/s:1810802 rem:134s step 13150 (78%) loss:2.8466 lr:0.18 dt:36ms tok/s:1811542 rem:134s step 13151 (78%) loss:2.8540 lr:0.18 dt:36ms tok/s:1808383 rem:134s step 13152 (78%) loss:2.8439 lr:0.18 dt:36ms tok/s:1816966 rem:134s step 13153 (78%) loss:2.8536 lr:0.18 dt:36ms tok/s:1813837 rem:134s step 13154 (78%) loss:2.8553 lr:0.18 dt:36ms tok/s:1806945 rem:134s step 13155 (78%) loss:2.8562 lr:0.18 dt:36ms tok/s:1813071 rem:134s step 13156 (78%) loss:2.8544 lr:0.18 dt:36ms tok/s:1809110 rem:134s step 13157 (78%) loss:2.8145 lr:0.18 dt:36ms tok/s:1812533 rem:134s step 13158 (78%) loss:2.8288 lr:0.18 dt:36ms tok/s:1814016 rem:134s step 13159 (78%) loss:2.8232 lr:0.18 dt:36ms tok/s:1819757 rem:134s step 13160 (78%) loss:2.8161 lr:0.18 dt:36ms tok/s:1810074 rem:134s step 13161 (78%) loss:2.8399 lr:0.18 dt:36ms tok/s:1803909 rem:134s step 13162 (78%) loss:2.8398 lr:0.18 dt:36ms tok/s:1803921 rem:134s step 13163 (78%) loss:2.8480 lr:0.18 dt:36ms tok/s:1806114 rem:134s step 13164 (78%) loss:2.8318 lr:0.18 dt:36ms tok/s:1811410 rem:133s step 13165 (78%) loss:2.8650 lr:0.18 dt:36ms tok/s:1804359 rem:133s step 13166 (78%) loss:2.8550 lr:0.18 dt:36ms tok/s:1810337 rem:133s step 13167 (78%) loss:2.8200 lr:0.18 dt:36ms tok/s:1813861 rem:133s step 13168 (78%) loss:2.8181 lr:0.18 dt:36ms tok/s:1813586 rem:133s step 13169 (78%) loss:2.8282 lr:0.18 dt:36ms tok/s:1812282 rem:133s step 13170 (78%) loss:2.8357 lr:0.18 dt:36ms tok/s:1816390 rem:133s step 13171 (78%) loss:2.7875 lr:0.18 dt:36ms tok/s:1811637 rem:133s step 13172 (78%) loss:2.8026 lr:0.18 dt:36ms tok/s:1808276 rem:133s step 13173 (78%) loss:2.8122 lr:0.18 dt:36ms tok/s:1814843 rem:133s step 13174 (78%) loss:2.8259 lr:0.18 dt:36ms tok/s:1817146 rem:133s step 13175 (78%) loss:2.8255 lr:0.18 dt:36ms tok/s:1814975 rem:133s step 13176 (78%) loss:2.8403 lr:0.18 dt:36ms tok/s:1809288 rem:133s step 13177 (78%) loss:2.8560 lr:0.18 dt:36ms tok/s:1814855 rem:133s step 13178 (78%) loss:2.8268 lr:0.18 dt:36ms tok/s:1816150 rem:133s step 13179 (78%) loss:2.8360 lr:0.18 dt:36ms tok/s:1806636 rem:133s step 13180 (78%) loss:2.8404 lr:0.18 dt:36ms tok/s:1804964 rem:133s step 13181 (78%) loss:2.8496 lr:0.18 dt:36ms tok/s:1805628 rem:133s step 13182 (78%) loss:2.8402 lr:0.18 dt:36ms tok/s:1800683 rem:133s step 13183 (78%) loss:2.8320 lr:0.18 dt:36ms tok/s:1806826 rem:133s step 13184 (78%) loss:2.8478 lr:0.18 dt:36ms tok/s:1814783 rem:133s step 13185 (78%) loss:2.8447 lr:0.18 dt:36ms tok/s:1813777 rem:133s step 13186 (78%) loss:2.8424 lr:0.18 dt:36ms tok/s:1815382 rem:133s step 13187 (78%) loss:2.8493 lr:0.18 dt:36ms tok/s:1810051 rem:133s step 13188 (78%) loss:2.8634 lr:0.18 dt:36ms tok/s:1807753 rem:133s step 13189 (78%) loss:2.8571 lr:0.18 dt:36ms tok/s:1811172 rem:133s step 13190 (78%) loss:2.8533 lr:0.18 dt:36ms tok/s:1808027 rem:133s step 13191 (78%) loss:2.8520 lr:0.18 dt:40ms tok/s:1644026 rem:133s step 13192 (78%) loss:2.8552 lr:0.18 dt:36ms tok/s:1816426 rem:132s step 13193 (78%) loss:2.8757 lr:0.18 dt:36ms tok/s:1835199 rem:132s step 13194 (78%) loss:2.8878 lr:0.18 dt:36ms tok/s:1818661 rem:132s step 13195 (78%) loss:2.8800 lr:0.18 dt:36ms tok/s:1813083 rem:132s step 13196 (78%) loss:2.8750 lr:0.18 dt:36ms tok/s:1816282 rem:132s step 13197 (78%) loss:2.8704 lr:0.18 dt:36ms tok/s:1811876 rem:132s step 13198 (78%) loss:2.8768 lr:0.18 dt:36ms tok/s:1814867 rem:132s step 13199 (78%) loss:2.8961 lr:0.18 dt:36ms tok/s:1810575 rem:132s step 13200 (78%) loss:2.8995 lr:0.18 dt:36ms tok/s:1820383 rem:132s + local: attn=[0.158, 1.024, 1.119] mlp=[1.191, 0.475, -0.468] + + transition: attn=[3.632, 1.225] mlp=[-0.549, 1.406] + + hierarchy: attn=[3.568, 5.939, 5.616] mlp=[2.599, -2.246, -5.028] + step 13201 (78%) loss:2.9109 lr:0.18 dt:36ms tok/s:1810361 rem:132s step 13202 (78%) loss:2.9049 lr:0.18 dt:36ms tok/s:1810850 rem:132s step 13203 (78%) loss:2.9101 lr:0.18 dt:36ms tok/s:1809169 rem:132s step 13204 (78%) loss:2.8851 lr:0.18 dt:36ms tok/s:1806553 rem:132s step 13205 (78%) loss:2.8779 lr:0.18 dt:36ms tok/s:1810063 rem:132s step 13206 (78%) loss:2.8955 lr:0.18 dt:36ms tok/s:1806660 rem:132s step 13207 (78%) loss:2.8945 lr:0.18 dt:36ms tok/s:1804881 rem:132s step 13208 (78%) loss:2.8973 lr:0.18 dt:36ms tok/s:1810742 rem:132s step 13209 (78%) loss:2.8725 lr:0.17 dt:36ms tok/s:1804928 rem:132s step 13210 (78%) loss:2.8785 lr:0.17 dt:36ms tok/s:1808039 rem:132s step 13211 (78%) loss:2.8713 lr:0.17 dt:36ms tok/s:1808407 rem:132s step 13212 (78%) loss:2.8740 lr:0.17 dt:36ms tok/s:1818601 rem:132s step 13213 (78%) loss:2.8876 lr:0.17 dt:37ms tok/s:1773372 rem:132s step 13214 (78%) loss:2.8813 lr:0.17 dt:36ms tok/s:1808514 rem:132s step 13215 (78%) loss:2.8707 lr:0.17 dt:36ms tok/s:1807717 rem:132s step 13216 (78%) loss:2.8677 lr:0.17 dt:36ms tok/s:1813969 rem:132s step 13217 (78%) loss:2.8749 lr:0.17 dt:36ms tok/s:1815035 rem:132s step 13218 (78%) loss:2.8798 lr:0.17 dt:36ms tok/s:1809205 rem:132s step 13219 (78%) loss:2.8975 lr:0.17 dt:36ms tok/s:1811757 rem:131s step 13220 (78%) loss:2.8969 lr:0.17 dt:36ms tok/s:1805865 rem:131s step 13221 (78%) loss:2.8973 lr:0.17 dt:36ms tok/s:1805865 rem:131s step 13222 (78%) loss:2.9051 lr:0.17 dt:36ms tok/s:1817567 rem:131s step 13223 (78%) loss:2.9008 lr:0.17 dt:36ms tok/s:1812760 rem:131s step 13224 (78%) loss:2.9165 lr:0.17 dt:36ms tok/s:1812414 rem:131s step 13225 (78%) loss:2.9145 lr:0.17 dt:36ms tok/s:1813251 rem:131s step 13226 (78%) loss:2.9118 lr:0.17 dt:36ms tok/s:1820866 rem:131s step 13227 (78%) loss:2.9072 lr:0.17 dt:36ms tok/s:1806387 rem:131s step 13228 (78%) loss:2.8636 lr:0.17 dt:36ms tok/s:1812844 rem:131s step 13229 (78%) loss:2.8404 lr:0.17 dt:36ms tok/s:1809943 rem:131s step 13230 (78%) loss:2.8073 lr:0.17 dt:36ms tok/s:1813095 rem:131s step 13231 (78%) loss:2.7785 lr:0.17 dt:36ms tok/s:1815742 rem:131s step 13232 (78%) loss:2.7549 lr:0.17 dt:36ms tok/s:1817339 rem:131s step 13233 (78%) loss:2.7527 lr:0.17 dt:36ms tok/s:1820432 rem:131s step 13234 (78%) loss:2.7633 lr:0.17 dt:36ms tok/s:1815454 rem:131s step 13235 (78%) loss:2.7760 lr:0.17 dt:36ms tok/s:1809110 rem:131s step 13236 (78%) loss:2.7894 lr:0.17 dt:36ms tok/s:1808455 rem:131s step 13237 (78%) loss:2.7994 lr:0.17 dt:36ms tok/s:1800094 rem:131s step 13238 (78%) loss:2.8136 lr:0.17 dt:36ms tok/s:1813394 rem:131s step 13239 (78%) loss:2.8098 lr:0.17 dt:36ms tok/s:1811494 rem:131s step 13240 (78%) loss:2.8115 lr:0.17 dt:36ms tok/s:1805983 rem:131s step 13241 (78%) loss:2.8172 lr:0.17 dt:36ms tok/s:1817603 rem:131s step 13242 (78%) loss:2.8321 lr:0.17 dt:36ms tok/s:1813286 rem:131s step 13243 (78%) loss:2.8351 lr:0.17 dt:36ms tok/s:1805201 rem:131s step 13244 (78%) loss:2.8433 lr:0.17 dt:36ms tok/s:1809276 rem:131s step 13245 (78%) loss:2.8641 lr:0.17 dt:36ms tok/s:1807111 rem:131s step 13246 (78%) loss:2.8995 lr:0.17 dt:36ms tok/s:1817663 rem:131s step 13247 (78%) loss:2.9147 lr:0.17 dt:36ms tok/s:1813346 rem:130s step 13248 (78%) loss:2.9052 lr:0.17 dt:36ms tok/s:1812880 rem:130s step 13249 (78%) loss:2.8946 lr:0.17 dt:36ms tok/s:1818132 rem:130s step 13250 (78%) loss:2.9103 lr:0.17 dt:36ms tok/s:1813490 rem:130s step 13251 (78%) loss:2.9066 lr:0.17 dt:36ms tok/s:1805236 rem:130s step 13252 (78%) loss:2.9085 lr:0.17 dt:36ms tok/s:1807551 rem:130s step 13253 (78%) loss:2.9130 lr:0.17 dt:38ms tok/s:1723934 rem:130s step 13254 (78%) loss:2.9357 lr:0.17 dt:36ms tok/s:1824455 rem:130s step 13255 (78%) loss:2.9193 lr:0.17 dt:36ms tok/s:1823185 rem:130s step 13256 (78%) loss:2.9261 lr:0.17 dt:36ms tok/s:1831921 rem:130s step 13257 (78%) loss:2.9274 lr:0.17 dt:36ms tok/s:1838845 rem:130s step 13258 (78%) loss:2.9098 lr:0.17 dt:36ms tok/s:1829568 rem:130s step 13259 (78%) loss:2.8800 lr:0.17 dt:36ms tok/s:1828874 rem:130s step 13260 (78%) loss:2.8886 lr:0.17 dt:36ms tok/s:1830908 rem:130s step 13261 (78%) loss:2.8913 lr:0.17 dt:36ms tok/s:1835971 rem:130s step 13262 (78%) loss:2.9251 lr:0.17 dt:36ms tok/s:1832153 rem:130s step 13263 (78%) loss:2.9008 lr:0.17 dt:36ms tok/s:1824564 rem:130s step 13264 (78%) loss:2.9066 lr:0.17 dt:36ms tok/s:1833094 rem:130s step 13265 (78%) loss:2.8917 lr:0.17 dt:36ms tok/s:1833668 rem:130s step 13266 (78%) loss:2.8989 lr:0.17 dt:36ms tok/s:1827816 rem:130s step 13267 (78%) loss:2.9016 lr:0.17 dt:36ms tok/s:1836081 rem:130s step 13268 (78%) loss:2.9174 lr:0.17 dt:36ms tok/s:1835554 rem:130s step 13269 (78%) loss:2.9204 lr:0.17 dt:36ms tok/s:1837984 rem:130s step 13270 (78%) loss:2.9162 lr:0.17 dt:36ms tok/s:1835138 rem:130s step 13271 (78%) loss:2.9270 lr:0.17 dt:36ms tok/s:1830762 rem:130s step 13272 (78%) loss:2.9264 lr:0.17 dt:36ms tok/s:1829604 rem:130s step 13273 (78%) loss:2.9313 lr:0.17 dt:36ms tok/s:1837567 rem:130s step 13274 (78%) loss:2.9176 lr:0.17 dt:36ms tok/s:1833558 rem:130s step 13275 (78%) loss:2.9194 lr:0.17 dt:36ms tok/s:1842111 rem:129s step 13276 (78%) loss:2.9104 lr:0.17 dt:36ms tok/s:1830530 rem:129s step 13277 (78%) loss:2.9159 lr:0.17 dt:36ms tok/s:1832532 rem:129s step 13278 (78%) loss:2.9249 lr:0.17 dt:36ms tok/s:1825049 rem:129s step 13279 (78%) loss:2.9131 lr:0.17 dt:36ms tok/s:1826468 rem:129s step 13280 (78%) loss:2.9146 lr:0.17 dt:36ms tok/s:1820613 rem:129s step 13281 (78%) loss:2.9226 lr:0.17 dt:36ms tok/s:1824201 rem:129s step 13282 (78%) loss:2.9172 lr:0.17 dt:36ms tok/s:1828290 rem:129s step 13283 (78%) loss:2.8970 lr:0.17 dt:41ms tok/s:1614052 rem:129s step 13284 (78%) loss:2.8913 lr:0.17 dt:35ms tok/s:1890248 rem:129s step 13285 (78%) loss:2.8904 lr:0.17 dt:35ms tok/s:1859344 rem:129s step 13286 (78%) loss:2.8786 lr:0.17 dt:35ms tok/s:1878390 rem:129s step 13287 (78%) loss:2.8798 lr:0.17 dt:35ms tok/s:1879391 rem:129s step 13288 (78%) loss:2.8657 lr:0.17 dt:35ms tok/s:1872759 rem:129s step 13289 (78%) loss:2.8711 lr:0.17 dt:35ms tok/s:1878993 rem:129s step 13290 (79%) loss:2.8359 lr:0.17 dt:35ms tok/s:1877325 rem:129s step 13291 (79%) loss:2.8289 lr:0.17 dt:35ms tok/s:1866502 rem:129s step 13292 (79%) loss:2.8222 lr:0.17 dt:35ms tok/s:1870961 rem:129s step 13293 (79%) loss:2.8274 lr:0.17 dt:35ms tok/s:1882584 rem:129s step 13294 (79%) loss:2.8316 lr:0.17 dt:35ms tok/s:1869930 rem:129s step 13295 (79%) loss:2.8328 lr:0.17 dt:35ms tok/s:1868202 rem:129s step 13296 (79%) loss:2.8540 lr:0.17 dt:35ms tok/s:1897896 rem:129s step 13297 (79%) loss:2.8583 lr:0.17 dt:35ms tok/s:1893020 rem:129s step 13298 (79%) loss:2.8670 lr:0.17 dt:34ms tok/s:1904286 rem:129s step 13299 (79%) loss:2.8858 lr:0.17 dt:35ms tok/s:1855854 rem:129s step 13300 (79%) loss:2.8863 lr:0.17 dt:35ms tok/s:1877530 rem:129s + local: attn=[0.161, 1.014, 1.139] mlp=[1.215, 0.483, -0.468] + + transition: attn=[3.644, 1.238] mlp=[-0.570, 1.455] + + hierarchy: attn=[3.562, 5.939, 5.616] mlp=[2.668, -2.325, -5.024] + step 13301 (79%) loss:2.8853 lr:0.17 dt:35ms tok/s:1878261 rem:129s step 13302 (79%) loss:2.8761 lr:0.17 dt:35ms tok/s:1885450 rem:129s step 13303 (79%) loss:2.8788 lr:0.17 dt:35ms tok/s:1859708 rem:128s step 13304 (79%) loss:2.8722 lr:0.17 dt:35ms tok/s:1859230 rem:128s step 13305 (79%) loss:2.8690 lr:0.17 dt:35ms tok/s:1875186 rem:128s step 13306 (79%) loss:2.8568 lr:0.17 dt:35ms tok/s:1877761 rem:128s step 13307 (79%) loss:2.8556 lr:0.17 dt:36ms tok/s:1836756 rem:128s step 13308 (79%) loss:2.8540 lr:0.17 dt:35ms tok/s:1854677 rem:128s step 13309 (79%) loss:2.8536 lr:0.17 dt:35ms tok/s:1862897 rem:128s step 13310 (79%) loss:2.8584 lr:0.17 dt:35ms tok/s:1849286 rem:128s step 13311 (79%) loss:2.8359 lr:0.17 dt:35ms tok/s:1849485 rem:128s step 13312 (79%) loss:2.8470 lr:0.17 dt:35ms tok/s:1853701 rem:128s step 13313 (79%) loss:2.8504 lr:0.17 dt:35ms tok/s:1863667 rem:128s step 13314 (79%) loss:2.8500 lr:0.17 dt:35ms tok/s:1850868 rem:128s step 13315 (79%) loss:2.8381 lr:0.17 dt:35ms tok/s:1857346 rem:128s step 13316 (79%) loss:2.8439 lr:0.17 dt:35ms tok/s:1871777 rem:128s step 13317 (79%) loss:2.8430 lr:0.17 dt:35ms tok/s:1866401 rem:128s step 13318 (79%) loss:2.8483 lr:0.17 dt:35ms tok/s:1855465 rem:128s step 13319 (79%) loss:2.8574 lr:0.17 dt:35ms tok/s:1862443 rem:128s step 13320 (79%) loss:2.8654 lr:0.17 dt:35ms tok/s:1865223 rem:128s step 13321 (79%) loss:2.8655 lr:0.17 dt:35ms tok/s:1861345 rem:128s step 13322 (79%) loss:2.8423 lr:0.17 dt:35ms tok/s:1860262 rem:128s step 13323 (79%) loss:2.8574 lr:0.16 dt:35ms tok/s:1858124 rem:128s step 13324 (79%) loss:2.8514 lr:0.16 dt:41ms tok/s:1581003 rem:128s step 13325 (79%) loss:2.8663 lr:0.16 dt:35ms tok/s:1893464 rem:128s step 13326 (79%) loss:2.8806 lr:0.16 dt:35ms tok/s:1875685 rem:128s step 13327 (79%) loss:2.9003 lr:0.16 dt:35ms tok/s:1896102 rem:128s step 13328 (79%) loss:2.9006 lr:0.16 dt:35ms tok/s:1875864 rem:128s step 13329 (79%) loss:2.9049 lr:0.16 dt:35ms tok/s:1873397 rem:128s step 13330 (79%) loss:2.9193 lr:0.16 dt:35ms tok/s:1855440 rem:128s step 13331 (79%) loss:2.9072 lr:0.16 dt:35ms tok/s:1852951 rem:127s step 13332 (79%) loss:2.9020 lr:0.16 dt:35ms tok/s:1862746 rem:127s step 13333 (79%) loss:2.8985 lr:0.16 dt:35ms tok/s:1860514 rem:127s step 13334 (79%) loss:2.8916 lr:0.16 dt:35ms tok/s:1852102 rem:127s step 13335 (79%) loss:2.9020 lr:0.16 dt:35ms tok/s:1856957 rem:127s step 13336 (79%) loss:2.8948 lr:0.16 dt:35ms tok/s:1857760 rem:127s step 13337 (79%) loss:2.8895 lr:0.16 dt:35ms tok/s:1859155 rem:127s step 13338 (79%) loss:2.8698 lr:0.16 dt:35ms tok/s:1862241 rem:127s step 13339 (79%) loss:2.8645 lr:0.16 dt:35ms tok/s:1859243 rem:127s step 13340 (79%) loss:2.8798 lr:0.16 dt:35ms tok/s:1856054 rem:127s step 13341 (79%) loss:2.8721 lr:0.16 dt:35ms tok/s:1854289 rem:127s step 13342 (79%) loss:2.8694 lr:0.16 dt:35ms tok/s:1855027 rem:127s step 13343 (79%) loss:2.8660 lr:0.16 dt:35ms tok/s:1859847 rem:127s step 13344 (79%) loss:2.8681 lr:0.16 dt:35ms tok/s:1862493 rem:127s step 13345 (79%) loss:2.8603 lr:0.16 dt:35ms tok/s:1860514 rem:127s step 13346 (79%) loss:2.8614 lr:0.16 dt:35ms tok/s:1855365 rem:127s step 13347 (79%) loss:2.8544 lr:0.16 dt:35ms tok/s:1859193 rem:127s step 13348 (79%) loss:2.8651 lr:0.16 dt:35ms tok/s:1852289 rem:127s step 13349 (79%) loss:2.8740 lr:0.16 dt:35ms tok/s:1857898 rem:127s step 13350 (79%) loss:2.8661 lr:0.16 dt:35ms tok/s:1857082 rem:127s step 13351 (79%) loss:2.8602 lr:0.16 dt:35ms tok/s:1855641 rem:127s step 13352 (79%) loss:2.8523 lr:0.16 dt:35ms tok/s:1856518 rem:127s step 13353 (79%) loss:2.8727 lr:0.16 dt:35ms tok/s:1861169 rem:127s step 13354 (79%) loss:2.8972 lr:0.16 dt:35ms tok/s:1857183 rem:127s step 13355 (79%) loss:2.9069 lr:0.16 dt:35ms tok/s:1855052 rem:127s step 13356 (79%) loss:2.8922 lr:0.16 dt:35ms tok/s:1858325 rem:127s step 13357 (79%) loss:2.8871 lr:0.16 dt:35ms tok/s:1848465 rem:127s step 13358 (79%) loss:2.8900 lr:0.16 dt:35ms tok/s:1851379 rem:127s step 13359 (79%) loss:2.8825 lr:0.16 dt:35ms tok/s:1853751 rem:127s step 13360 (79%) loss:2.8762 lr:0.16 dt:35ms tok/s:1851429 rem:126s step 13361 (79%) loss:2.8600 lr:0.16 dt:35ms tok/s:1851404 rem:126s step 13362 (79%) loss:2.8703 lr:0.16 dt:35ms tok/s:1880034 rem:126s step 13363 (79%) loss:2.8504 lr:0.16 dt:35ms tok/s:1879173 rem:126s step 13364 (79%) loss:2.8468 lr:0.16 dt:35ms tok/s:1870478 rem:126s step 13365 (79%) loss:2.8415 lr:0.16 dt:35ms tok/s:1859419 rem:126s step 13366 (79%) loss:2.8500 lr:0.16 dt:35ms tok/s:1858011 rem:126s step 13367 (79%) loss:2.8377 lr:0.16 dt:35ms tok/s:1861812 rem:126s step 13368 (79%) loss:2.8653 lr:0.16 dt:35ms tok/s:1860124 rem:126s step 13369 (79%) loss:2.8827 lr:0.16 dt:35ms tok/s:1857095 rem:126s step 13370 (79%) loss:2.8980 lr:0.16 dt:36ms tok/s:1805414 rem:126s step 13371 (79%) loss:2.9039 lr:0.16 dt:36ms tok/s:1820564 rem:126s step 13372 (79%) loss:2.8935 lr:0.16 dt:36ms tok/s:1836462 rem:126s step 13373 (79%) loss:2.8896 lr:0.16 dt:35ms tok/s:1864249 rem:126s step 13374 (79%) loss:2.8691 lr:0.16 dt:35ms tok/s:1860300 rem:126s step 13375 (79%) loss:2.8624 lr:0.16 dt:35ms tok/s:1857861 rem:126s step 13376 (79%) loss:2.8719 lr:0.16 dt:35ms tok/s:1860703 rem:126s step 13377 (79%) loss:2.8709 lr:0.16 dt:35ms tok/s:1860829 rem:126s step 13378 (79%) loss:2.8679 lr:0.16 dt:35ms tok/s:1858212 rem:126s step 13379 (79%) loss:2.8582 lr:0.16 dt:35ms tok/s:1851104 rem:126s step 13380 (79%) loss:2.8517 lr:0.16 dt:36ms tok/s:1834733 rem:126s step 13381 (79%) loss:2.8395 lr:0.16 dt:35ms tok/s:1861673 rem:126s step 13382 (79%) loss:2.8403 lr:0.16 dt:35ms tok/s:1863415 rem:126s step 13383 (79%) loss:2.8541 lr:0.16 dt:35ms tok/s:1856405 rem:126s step 13384 (79%) loss:2.8547 lr:0.16 dt:35ms tok/s:1854814 rem:126s step 13385 (79%) loss:2.8598 lr:0.16 dt:35ms tok/s:1857609 rem:126s step 13386 (79%) loss:2.8822 lr:0.16 dt:35ms tok/s:1865425 rem:126s step 13387 (79%) loss:2.8757 lr:0.16 dt:35ms tok/s:1856844 rem:126s step 13388 (79%) loss:2.8402 lr:0.16 dt:36ms tok/s:1813969 rem:125s step 13389 (79%) loss:2.8518 lr:0.16 dt:36ms tok/s:1817819 rem:125s step 13390 (79%) loss:2.8445 lr:0.16 dt:35ms tok/s:1863996 rem:125s step 13391 (79%) loss:2.8532 lr:0.16 dt:35ms tok/s:1859004 rem:125s step 13392 (79%) loss:2.8499 lr:0.16 dt:35ms tok/s:1861988 rem:125s step 13393 (79%) loss:2.8540 lr:0.16 dt:36ms tok/s:1812473 rem:125s step 13394 (79%) loss:2.8593 lr:0.16 dt:36ms tok/s:1808586 rem:125s step 13395 (79%) loss:2.8603 lr:0.16 dt:36ms tok/s:1834868 rem:125s step 13396 (79%) loss:2.8693 lr:0.16 dt:36ms tok/s:1831628 rem:125s step 13397 (79%) loss:2.8573 lr:0.16 dt:36ms tok/s:1831958 rem:125s step 13398 (79%) loss:2.8201 lr:0.16 dt:36ms tok/s:1834966 rem:125s step 13399 (79%) loss:2.7987 lr:0.16 dt:36ms tok/s:1827184 rem:125s step 13400 (79%) loss:2.8167 lr:0.16 dt:36ms tok/s:1828910 rem:125s + local: attn=[0.171, 1.021, 1.152] mlp=[1.236, 0.487, -0.486] + + transition: attn=[3.715, 1.240] mlp=[-0.577, 1.499] + + hierarchy: attn=[3.568, 5.939, 5.616] mlp=[2.723, -2.415, -5.026] + step 13401 (79%) loss:2.8301 lr:0.16 dt:36ms tok/s:1823451 rem:125s step 13402 (79%) loss:2.8284 lr:0.16 dt:36ms tok/s:1842518 rem:125s step 13403 (79%) loss:2.8076 lr:0.16 dt:36ms tok/s:1840261 rem:125s step 13404 (79%) loss:2.8188 lr:0.16 dt:36ms tok/s:1845276 rem:125s step 13405 (79%) loss:2.8246 lr:0.16 dt:36ms tok/s:1845784 rem:125s step 13406 (79%) loss:2.8305 lr:0.16 dt:36ms tok/s:1839337 rem:125s step 13407 (79%) loss:2.8298 lr:0.16 dt:35ms tok/s:1853626 rem:125s step 13408 (79%) loss:2.8269 lr:0.16 dt:35ms tok/s:1852077 rem:125s step 13409 (79%) loss:2.8832 lr:0.16 dt:36ms tok/s:1843853 rem:125s step 13410 (79%) loss:2.8865 lr:0.16 dt:35ms tok/s:1849647 rem:125s step 13411 (79%) loss:2.8915 lr:0.16 dt:36ms tok/s:1843074 rem:125s step 13412 (79%) loss:2.9318 lr:0.16 dt:35ms tok/s:1849697 rem:125s step 13413 (79%) loss:2.9073 lr:0.16 dt:35ms tok/s:1846442 rem:125s step 13414 (79%) loss:2.8873 lr:0.16 dt:36ms tok/s:1845029 rem:125s step 13415 (79%) loss:2.8795 lr:0.16 dt:162ms tok/s:403648 rem:124s step 13416 (79%) loss:2.8812 lr:0.16 dt:38ms tok/s:1703475 rem:124s step 13417 (79%) loss:2.8773 lr:0.16 dt:39ms tok/s:1682198 rem:124s step 13418 (79%) loss:2.8817 lr:0.16 dt:39ms tok/s:1672383 rem:124s step 13419 (79%) loss:2.8913 lr:0.16 dt:40ms tok/s:1651494 rem:124s step 13420 (79%) loss:2.8849 lr:0.16 dt:40ms tok/s:1642612 rem:124s step 13421 (79%) loss:2.8766 lr:0.16 dt:151ms tok/s:433032 rem:124s step 13422 (79%) loss:2.8683 lr:0.16 dt:40ms tok/s:1624574 rem:124s step 13423 (79%) loss:2.8678 lr:0.16 dt:41ms tok/s:1601069 rem:124s step 13424 (79%) loss:2.8418 lr:0.16 dt:41ms tok/s:1582159 rem:124s step 13425 (79%) loss:2.8195 lr:0.16 dt:42ms tok/s:1573689 rem:124s step 13426 (79%) loss:2.8023 lr:0.16 dt:42ms tok/s:1560804 rem:124s step 13427 (79%) loss:2.8038 lr:0.16 dt:42ms tok/s:1571773 rem:124s step 13428 (79%) loss:2.8169 lr:0.16 dt:41ms tok/s:1581185 rem:124s step 13429 (79%) loss:2.8189 lr:0.16 dt:41ms tok/s:1598508 rem:124s step 13430 (79%) loss:2.8366 lr:0.16 dt:41ms tok/s:1594151 rem:124s step 13431 (79%) loss:2.8129 lr:0.16 dt:41ms tok/s:1589377 rem:124s step 13432 (79%) loss:2.8223 lr:0.15 dt:41ms tok/s:1592295 rem:124s step 13433 (79%) loss:2.7764 lr:0.15 dt:41ms tok/s:1591484 rem:124s step 13434 (79%) loss:2.7351 lr:0.15 dt:41ms tok/s:1590535 rem:124s step 13435 (79%) loss:2.7538 lr:0.15 dt:41ms tok/s:1585993 rem:123s step 13436 (79%) loss:2.7692 lr:0.15 dt:46ms tok/s:1426751 rem:123s step 13437 (79%) loss:2.7743 lr:0.15 dt:41ms tok/s:1579949 rem:123s step 13438 (79%) loss:2.7816 lr:0.15 dt:41ms tok/s:1582869 rem:123s step 13439 (79%) loss:2.7997 lr:0.15 dt:42ms tok/s:1576180 rem:123s step 13440 (79%) loss:2.7996 lr:0.15 dt:42ms tok/s:1576406 rem:123s step 13441 (79%) loss:2.8115 lr:0.15 dt:43ms tok/s:1520536 rem:123s step 13442 (79%) loss:2.8185 lr:0.15 dt:42ms tok/s:1560467 rem:123s step 13443 (79%) loss:2.8242 lr:0.15 dt:42ms tok/s:1558998 rem:123s step 13444 (79%) loss:2.8089 lr:0.15 dt:42ms tok/s:1557408 rem:123s step 13445 (79%) loss:2.8533 lr:0.15 dt:42ms tok/s:1554212 rem:123s step 13446 (79%) loss:2.8805 lr:0.15 dt:42ms tok/s:1549918 rem:123s step 13447 (80%) loss:2.8819 lr:0.15 dt:42ms tok/s:1555126 rem:123s step 13448 (80%) loss:2.9133 lr:0.15 dt:42ms tok/s:1554229 rem:123s step 13449 (80%) loss:2.9137 lr:0.15 dt:42ms tok/s:1543114 rem:123s step 13450 (80%) loss:2.9061 lr:0.15 dt:42ms tok/s:1553018 rem:123s step 13451 (80%) loss:2.8947 lr:0.15 dt:43ms tok/s:1541711 rem:123s step 13452 (80%) loss:2.8915 lr:0.15 dt:42ms tok/s:1553957 rem:123s step 13453 (80%) loss:2.9154 lr:0.15 dt:42ms tok/s:1554730 rem:123s step 13454 (80%) loss:2.9463 lr:0.15 dt:42ms tok/s:1548399 rem:123s step 13455 (80%) loss:2.9587 lr:0.15 dt:42ms tok/s:1544735 rem:123s step 13456 (80%) loss:2.9405 lr:0.15 dt:42ms tok/s:1550836 rem:123s step 13457 (80%) loss:2.8931 lr:0.15 dt:42ms tok/s:1553316 rem:123s step 13458 (80%) loss:2.8920 lr:0.15 dt:42ms tok/s:1548128 rem:122s step 13459 (80%) loss:2.8901 lr:0.15 dt:43ms tok/s:1541962 rem:122s step 13460 (80%) loss:2.8810 lr:0.15 dt:42ms tok/s:1552561 rem:122s step 13461 (80%) loss:2.8862 lr:0.15 dt:42ms tok/s:1549883 rem:122s step 13462 (80%) loss:2.8847 lr:0.15 dt:42ms tok/s:1548181 rem:122s step 13463 (80%) loss:2.8807 lr:0.15 dt:42ms tok/s:1548800 rem:122s step 13464 (80%) loss:2.8712 lr:0.15 dt:42ms tok/s:1549839 rem:122s step 13465 (80%) loss:2.8621 lr:0.15 dt:42ms tok/s:1548189 rem:122s step 13466 (80%) loss:2.8711 lr:0.15 dt:42ms tok/s:1548896 rem:122s step 13467 (80%) loss:2.8725 lr:0.15 dt:42ms tok/s:1548826 rem:122s step 13468 (80%) loss:2.8770 lr:0.15 dt:42ms tok/s:1545847 rem:122s step 13469 (80%) loss:2.8941 lr:0.15 dt:42ms tok/s:1546473 rem:122s step 13470 (80%) loss:2.8874 lr:0.15 dt:42ms tok/s:1553799 rem:122s step 13471 (80%) loss:2.8919 lr:0.15 dt:42ms tok/s:1553676 rem:122s step 13472 (80%) loss:2.8868 lr:0.15 dt:42ms tok/s:1552386 rem:122s step 13473 (80%) loss:2.8859 lr:0.15 dt:42ms tok/s:1552553 rem:122s step 13474 (80%) loss:2.8915 lr:0.15 dt:42ms tok/s:1548486 rem:122s step 13475 (80%) loss:2.8969 lr:0.15 dt:48ms tok/s:1379203 rem:122s step 13476 (80%) loss:2.8776 lr:0.15 dt:42ms tok/s:1547649 rem:122s step 13477 (80%) loss:2.8474 lr:0.15 dt:42ms tok/s:1546247 rem:122s step 13478 (80%) loss:2.8389 lr:0.15 dt:42ms tok/s:1547753 rem:122s step 13479 (80%) loss:2.8348 lr:0.15 dt:42ms tok/s:1546247 rem:122s step 13480 (80%) loss:2.8366 lr:0.15 dt:42ms tok/s:1544527 rem:122s step 13481 (80%) loss:2.8339 lr:0.15 dt:42ms tok/s:1548172 rem:122s step 13482 (80%) loss:2.8444 lr:0.15 dt:42ms tok/s:1545439 rem:121s step 13483 (80%) loss:2.8563 lr:0.15 dt:42ms tok/s:1553983 rem:121s step 13484 (80%) loss:2.8652 lr:0.15 dt:42ms tok/s:1550600 rem:121s step 13485 (80%) loss:2.8473 lr:0.15 dt:42ms tok/s:1549717 rem:121s step 13486 (80%) loss:2.8634 lr:0.15 dt:42ms tok/s:1550880 rem:121s step 13487 (80%) loss:2.8597 lr:0.15 dt:42ms tok/s:1545934 rem:121s step 13488 (80%) loss:2.8626 lr:0.15 dt:42ms tok/s:1547222 rem:121s step 13489 (80%) loss:2.8672 lr:0.15 dt:42ms tok/s:1547283 rem:121s step 13490 (80%) loss:2.8748 lr:0.15 dt:42ms tok/s:1546308 rem:121s step 13491 (80%) loss:2.8870 lr:0.15 dt:43ms tok/s:1539811 rem:121s step 13492 (80%) loss:2.8776 lr:0.15 dt:42ms tok/s:1543451 rem:121s step 13493 (80%) loss:2.8703 lr:0.15 dt:43ms tok/s:1541089 rem:121s step 13494 (80%) loss:2.8705 lr:0.15 dt:42ms tok/s:1542525 rem:121s step 13495 (80%) loss:2.8788 lr:0.15 dt:43ms tok/s:1541219 rem:121s step 13496 (80%) loss:2.8842 lr:0.15 dt:42ms tok/s:1544544 rem:121s step 13497 (80%) loss:2.8748 lr:0.15 dt:42ms tok/s:1544831 rem:121s step 13498 (80%) loss:2.8726 lr:0.15 dt:42ms tok/s:1546430 rem:121s step 13499 (80%) loss:2.8833 lr:0.15 dt:42ms tok/s:1545447 rem:121s step 13500 (80%) loss:2.9119 lr:0.15 dt:42ms tok/s:1544579 rem:121s + local: attn=[0.171, 1.044, 1.182] mlp=[1.260, 0.498, -0.497] + + transition: attn=[3.757, 1.227] mlp=[-0.607, 1.536] + + hierarchy: attn=[3.555, 5.939, 5.616] mlp=[2.775, -2.547, -5.055] + step 13501 (80%) loss:2.8751 lr:0.15 dt:42ms tok/s:1544883 rem:121s step 13502 (80%) loss:2.8660 lr:0.15 dt:43ms tok/s:1541807 rem:121s step 13503 (80%) loss:2.8664 lr:0.15 dt:42ms tok/s:1545274 rem:121s step 13504 (80%) loss:2.8549 lr:0.15 dt:43ms tok/s:1540856 rem:121s step 13505 (80%) loss:2.8496 lr:0.15 dt:43ms tok/s:1538562 rem:120s step 13506 (80%) loss:2.8477 lr:0.15 dt:43ms tok/s:1534148 rem:120s step 13507 (80%) loss:2.8489 lr:0.15 dt:43ms tok/s:1536901 rem:120s step 13508 (80%) loss:2.8505 lr:0.15 dt:43ms tok/s:1536515 rem:120s step 13509 (80%) loss:2.8439 lr:0.15 dt:43ms tok/s:1536481 rem:120s step 13510 (80%) loss:2.8303 lr:0.15 dt:43ms tok/s:1536566 rem:120s step 13511 (80%) loss:2.8376 lr:0.15 dt:45ms tok/s:1461743 rem:120s step 13512 (80%) loss:2.8563 lr:0.15 dt:43ms tok/s:1534919 rem:120s step 13513 (80%) loss:2.8533 lr:0.15 dt:43ms tok/s:1537383 rem:120s step 13514 (80%) loss:2.8741 lr:0.15 dt:43ms tok/s:1536008 rem:120s step 13515 (80%) loss:2.8780 lr:0.15 dt:43ms tok/s:1536326 rem:120s step 13516 (80%) loss:2.8948 lr:0.15 dt:43ms tok/s:1539044 rem:120s step 13517 (80%) loss:2.8719 lr:0.15 dt:43ms tok/s:1535897 rem:120s step 13518 (80%) loss:2.8751 lr:0.15 dt:158ms tok/s:414649 rem:120s step 13519 (80%) loss:2.8760 lr:0.15 dt:42ms tok/s:1575493 rem:120s step 13520 (80%) loss:2.8692 lr:0.15 dt:42ms tok/s:1551492 rem:120s step 13521 (80%) loss:2.8759 lr:0.15 dt:43ms tok/s:1538174 rem:120s step 13522 (80%) loss:2.8804 lr:0.15 dt:42ms tok/s:1556773 rem:120s step 13523 (80%) loss:2.8602 lr:0.15 dt:42ms tok/s:1562543 rem:120s step 13524 (80%) loss:2.8722 lr:0.15 dt:42ms tok/s:1577392 rem:120s step 13525 (80%) loss:2.8633 lr:0.15 dt:42ms tok/s:1573230 rem:120s step 13526 (80%) loss:2.8512 lr:0.15 dt:42ms tok/s:1577600 rem:119s step 13527 (80%) loss:2.8557 lr:0.15 dt:42ms tok/s:1571791 rem:119s step 13528 (80%) loss:2.8659 lr:0.15 dt:42ms tok/s:1573608 rem:119s step 13529 (80%) loss:2.8449 lr:0.15 dt:42ms tok/s:1573338 rem:119s step 13530 (80%) loss:2.8536 lr:0.14 dt:42ms tok/s:1571719 rem:119s step 13531 (80%) loss:2.8267 lr:0.14 dt:42ms tok/s:1566684 rem:119s step 13532 (80%) loss:2.8260 lr:0.14 dt:42ms tok/s:1568929 rem:119s step 13533 (80%) loss:2.8189 lr:0.14 dt:42ms tok/s:1562605 rem:119s step 13534 (80%) loss:2.8085 lr:0.14 dt:42ms tok/s:1560290 rem:119s step 13535 (80%) loss:2.7823 lr:0.14 dt:42ms tok/s:1559334 rem:119s step 13536 (80%) loss:2.7809 lr:0.14 dt:42ms tok/s:1557038 rem:119s step 13537 (80%) loss:2.7725 lr:0.14 dt:42ms tok/s:1549236 rem:119s step 13538 (80%) loss:2.7553 lr:0.14 dt:42ms tok/s:1547300 rem:119s step 13539 (80%) loss:2.7519 lr:0.14 dt:44ms tok/s:1479158 rem:119s step 13540 (80%) loss:2.7477 lr:0.14 dt:42ms tok/s:1553211 rem:119s step 13541 (80%) loss:2.7596 lr:0.14 dt:42ms tok/s:1544067 rem:119s step 13542 (80%) loss:2.7495 lr:0.14 dt:42ms tok/s:1542230 rem:119s step 13543 (80%) loss:2.7583 lr:0.14 dt:45ms tok/s:1466774 rem:119s step 13544 (80%) loss:2.7531 lr:0.14 dt:42ms tok/s:1545847 rem:119s step 13545 (80%) loss:2.7550 lr:0.14 dt:42ms tok/s:1543937 rem:119s step 13546 (80%) loss:2.7666 lr:0.14 dt:42ms tok/s:1542568 rem:119s step 13547 (80%) loss:2.7706 lr:0.14 dt:158ms tok/s:413801 rem:118s step 13548 (80%) loss:2.7937 lr:0.14 dt:161ms tok/s:407403 rem:118s step 13549 (80%) loss:2.8064 lr:0.14 dt:42ms tok/s:1550267 rem:118s step 13550 (80%) loss:2.7854 lr:0.14 dt:43ms tok/s:1521083 rem:118s step 13551 (80%) loss:2.8201 lr:0.14 dt:43ms tok/s:1518671 rem:118s step 13552 (80%) loss:2.8830 lr:0.14 dt:42ms tok/s:1558026 rem:118s step 13553 (80%) loss:2.9149 lr:0.14 dt:42ms tok/s:1575123 rem:118s step 13554 (80%) loss:2.9391 lr:0.14 dt:41ms tok/s:1579241 rem:118s step 13555 (80%) loss:2.9384 lr:0.14 dt:42ms tok/s:1576985 rem:118s step 13556 (80%) loss:2.9333 lr:0.14 dt:42ms tok/s:1572681 rem:118s step 13557 (80%) loss:2.8965 lr:0.14 dt:41ms tok/s:1585234 rem:118s step 13558 (80%) loss:2.8971 lr:0.14 dt:41ms tok/s:1579313 rem:118s step 13559 (80%) loss:2.8828 lr:0.14 dt:154ms tok/s:425092 rem:118s step 13560 (80%) loss:2.9849 lr:0.14 dt:41ms tok/s:1584302 rem:118s step 13561 (80%) loss:2.9834 lr:0.14 dt:42ms tok/s:1558671 rem:118s step 13562 (80%) loss:2.9758 lr:0.14 dt:42ms tok/s:1554986 rem:118s step 13563 (80%) loss:2.9623 lr:0.14 dt:42ms tok/s:1555936 rem:118s step 13564 (80%) loss:2.9627 lr:0.14 dt:42ms tok/s:1573167 rem:118s step 13565 (80%) loss:2.9690 lr:0.14 dt:42ms tok/s:1577229 rem:117s step 13566 (80%) loss:2.9558 lr:0.14 dt:42ms tok/s:1578126 rem:117s step 13567 (80%) loss:2.9336 lr:0.14 dt:41ms tok/s:1581230 rem:117s step 13568 (80%) loss:2.9321 lr:0.14 dt:41ms tok/s:1582787 rem:117s step 13569 (80%) loss:2.9261 lr:0.14 dt:41ms tok/s:1587009 rem:117s step 13570 (80%) loss:2.9240 lr:0.14 dt:159ms tok/s:412574 rem:117s step 13571 (80%) loss:2.9194 lr:0.14 dt:41ms tok/s:1587137 rem:117s step 13572 (80%) loss:2.9041 lr:0.14 dt:42ms tok/s:1562756 rem:117s step 13573 (80%) loss:2.8949 lr:0.14 dt:42ms tok/s:1557770 rem:117s step 13574 (80%) loss:2.9002 lr:0.14 dt:42ms tok/s:1558998 rem:117s step 13575 (80%) loss:2.9031 lr:0.14 dt:44ms tok/s:1504441 rem:117s step 13576 (81%) loss:2.9071 lr:0.14 dt:42ms tok/s:1573635 rem:117s step 13577 (81%) loss:2.9124 lr:0.14 dt:42ms tok/s:1558070 rem:117s step 13578 (81%) loss:2.9300 lr:0.14 dt:42ms tok/s:1576551 rem:117s step 13579 (81%) loss:2.9257 lr:0.14 dt:42ms tok/s:1569323 rem:117s step 13580 (81%) loss:2.9104 lr:0.14 dt:42ms tok/s:1562632 rem:117s step 13581 (81%) loss:2.9196 lr:0.14 dt:41ms tok/s:1581194 rem:117s step 13582 (81%) loss:2.9180 lr:0.14 dt:42ms tok/s:1557364 rem:117s step 13583 (81%) loss:2.9302 lr:0.14 dt:42ms tok/s:1572600 rem:117s step 13584 (81%) loss:2.9201 lr:0.14 dt:42ms tok/s:1565614 rem:117s step 13585 (81%) loss:2.9297 lr:0.14 dt:42ms tok/s:1566890 rem:117s step 13586 (81%) loss:2.9639 lr:0.14 dt:42ms tok/s:1562650 rem:117s step 13587 (81%) loss:3.0425 lr:0.14 dt:42ms tok/s:1561070 rem:116s step 13588 (81%) loss:3.0336 lr:0.14 dt:42ms tok/s:1560414 rem:116s step 13589 (81%) loss:3.0368 lr:0.14 dt:42ms tok/s:1544067 rem:116s step 13590 (81%) loss:3.0313 lr:0.14 dt:42ms tok/s:1554810 rem:116s step 13591 (81%) loss:3.0048 lr:0.14 dt:46ms tok/s:1426144 rem:116s step 13592 (81%) loss:2.9989 lr:0.14 dt:42ms tok/s:1553527 rem:116s step 13593 (81%) loss:2.9637 lr:0.14 dt:42ms tok/s:1543989 rem:116s step 13594 (81%) loss:2.9428 lr:0.14 dt:43ms tok/s:1539173 rem:116s step 13595 (81%) loss:2.9537 lr:0.14 dt:43ms tok/s:1539975 rem:116s step 13596 (81%) loss:2.9747 lr:0.14 dt:43ms tok/s:1537116 rem:116s step 13597 (81%) loss:2.9975 lr:0.14 dt:43ms tok/s:1530876 rem:116s step 13598 (81%) loss:2.9717 lr:0.14 dt:43ms tok/s:1532463 rem:116s step 13599 (81%) loss:3.1126 lr:0.14 dt:43ms tok/s:1531328 rem:116s step 13600 (81%) loss:3.2333 lr:0.14 dt:43ms tok/s:1531677 rem:116s + local: attn=[0.171, 1.046, 1.182] mlp=[1.277, 0.522, -0.483] + + transition: attn=[3.794, 1.245] mlp=[-0.612, 1.606] + + hierarchy: attn=[3.561, 5.939, 5.616] mlp=[2.826, -2.610, -5.147] + step 13601 (81%) loss:3.2756 lr:0.14 dt:43ms tok/s:1531814 rem:116s step 13602 (81%) loss:3.2319 lr:0.14 dt:44ms tok/s:1473734 rem:116s step 13603 (81%) loss:3.2016 lr:0.14 dt:43ms tok/s:1525633 rem:116s step 13604 (81%) loss:3.1668 lr:0.14 dt:42ms tok/s:1546830 rem:116s step 13605 (81%) loss:3.1399 lr:0.14 dt:42ms tok/s:1543313 rem:116s step 13606 (81%) loss:3.1170 lr:0.14 dt:43ms tok/s:1540191 rem:116s step 13607 (81%) loss:3.0717 lr:0.14 dt:43ms tok/s:1540277 rem:116s step 13608 (81%) loss:3.0577 lr:0.14 dt:43ms tok/s:1539018 rem:116s step 13609 (81%) loss:3.0356 lr:0.14 dt:43ms tok/s:1524153 rem:116s step 13610 (81%) loss:3.0095 lr:0.14 dt:43ms tok/s:1532147 rem:115s step 13611 (81%) loss:2.9903 lr:0.14 dt:43ms tok/s:1529079 rem:115s step 13612 (81%) loss:2.9733 lr:0.14 dt:43ms tok/s:1528135 rem:115s step 13613 (81%) loss:2.9636 lr:0.14 dt:43ms tok/s:1527473 rem:115s step 13614 (81%) loss:2.9314 lr:0.14 dt:43ms tok/s:1527515 rem:115s step 13615 (81%) loss:2.9003 lr:0.14 dt:43ms tok/s:1527609 rem:115s step 13616 (81%) loss:2.8972 lr:0.14 dt:43ms tok/s:1526133 rem:115s step 13617 (81%) loss:2.8824 lr:0.14 dt:43ms tok/s:1525591 rem:115s step 13618 (81%) loss:2.8618 lr:0.14 dt:43ms tok/s:1524009 rem:115s step 13619 (81%) loss:2.8447 lr:0.14 dt:43ms tok/s:1524220 rem:115s step 13620 (81%) loss:2.8317 lr:0.14 dt:43ms tok/s:1524635 rem:115s step 13621 (81%) loss:2.8400 lr:0.14 dt:43ms tok/s:1525565 rem:115s step 13622 (81%) loss:2.8292 lr:0.14 dt:43ms tok/s:1524609 rem:115s step 13623 (81%) loss:2.8292 lr:0.13 dt:43ms tok/s:1523215 rem:115s step 13624 (81%) loss:2.8101 lr:0.13 dt:43ms tok/s:1526650 rem:115s step 13625 (81%) loss:2.8263 lr:0.13 dt:43ms tok/s:1527396 rem:115s step 13626 (81%) loss:2.8363 lr:0.13 dt:43ms tok/s:1525599 rem:115s step 13627 (81%) loss:2.8441 lr:0.13 dt:43ms tok/s:1525794 rem:115s step 13628 (81%) loss:2.8430 lr:0.13 dt:43ms tok/s:1522642 rem:115s step 13629 (81%) loss:2.8521 lr:0.13 dt:43ms tok/s:1524423 rem:115s step 13630 (81%) loss:2.8592 lr:0.13 dt:46ms tok/s:1420940 rem:115s step 13631 (81%) loss:2.8557 lr:0.13 dt:43ms tok/s:1527032 rem:115s step 13632 (81%) loss:2.8570 lr:0.13 dt:43ms tok/s:1511863 rem:115s step 13633 (81%) loss:2.8522 lr:0.13 dt:43ms tok/s:1526353 rem:114s step 13634 (81%) loss:2.8513 lr:0.13 dt:161ms tok/s:406556 rem:114s step 13635 (81%) loss:2.8605 lr:0.13 dt:41ms tok/s:1581349 rem:114s step 13636 (81%) loss:2.8508 lr:0.13 dt:42ms tok/s:1560547 rem:114s step 13637 (81%) loss:2.8529 lr:0.13 dt:42ms tok/s:1558671 rem:114s step 13638 (81%) loss:2.8372 lr:0.13 dt:42ms tok/s:1561496 rem:114s step 13639 (81%) loss:2.8342 lr:0.13 dt:42ms tok/s:1575647 rem:114s step 13640 (81%) loss:2.8220 lr:0.13 dt:41ms tok/s:1582815 rem:114s step 13641 (81%) loss:2.8195 lr:0.13 dt:41ms tok/s:1589119 rem:114s step 13642 (81%) loss:2.8241 lr:0.13 dt:41ms tok/s:1594651 rem:114s step 13643 (81%) loss:2.8231 lr:0.13 dt:41ms tok/s:1585672 rem:114s step 13644 (81%) loss:2.8224 lr:0.13 dt:41ms tok/s:1585206 rem:114s step 13645 (81%) loss:2.7857 lr:0.13 dt:41ms tok/s:1581649 rem:114s step 13646 (81%) loss:2.7997 lr:0.13 dt:42ms tok/s:1575241 rem:114s step 13647 (81%) loss:2.8029 lr:0.13 dt:43ms tok/s:1521302 rem:114s step 13648 (81%) loss:2.8054 lr:0.13 dt:42ms tok/s:1559485 rem:114s step 13649 (81%) loss:2.8116 lr:0.13 dt:42ms tok/s:1566756 rem:114s step 13650 (81%) loss:2.8026 lr:0.13 dt:42ms tok/s:1565587 rem:114s step 13651 (81%) loss:2.8142 lr:0.13 dt:42ms tok/s:1559679 rem:114s step 13652 (81%) loss:2.8257 lr:0.13 dt:42ms tok/s:1551046 rem:114s step 13653 (81%) loss:2.8368 lr:0.13 dt:42ms tok/s:1553781 rem:114s step 13654 (81%) loss:2.8520 lr:0.13 dt:42ms tok/s:1555311 rem:113s step 13655 (81%) loss:2.8487 lr:0.13 dt:42ms tok/s:1547475 rem:113s step 13656 (81%) loss:2.8370 lr:0.13 dt:43ms tok/s:1522776 rem:113s step 13657 (81%) loss:2.8326 lr:0.13 dt:42ms tok/s:1548346 rem:113s step 13658 (81%) loss:2.8322 lr:0.13 dt:42ms tok/s:1547823 rem:113s step 13659 (81%) loss:2.8434 lr:0.13 dt:42ms tok/s:1543989 rem:113s step 13660 (81%) loss:2.8483 lr:0.13 dt:43ms tok/s:1534294 rem:113s step 13661 (81%) loss:2.8460 lr:0.13 dt:43ms tok/s:1540562 rem:113s step 13662 (81%) loss:2.8367 lr:0.13 dt:43ms tok/s:1541305 rem:113s step 13663 (81%) loss:2.8241 lr:0.13 dt:43ms tok/s:1540303 rem:113s step 13664 (81%) loss:2.8285 lr:0.13 dt:43ms tok/s:1541150 rem:113s step 13665 (81%) loss:2.8274 lr:0.13 dt:42ms tok/s:1542585 rem:113s step 13666 (81%) loss:2.8420 lr:0.13 dt:43ms tok/s:1540053 rem:113s step 13667 (81%) loss:2.8489 lr:0.13 dt:43ms tok/s:1537323 rem:113s step 13668 (81%) loss:2.8416 lr:0.13 dt:43ms tok/s:1534593 rem:113s step 13669 (81%) loss:2.8121 lr:0.13 dt:43ms tok/s:1531362 rem:113s step 13670 (81%) loss:2.7973 lr:0.13 dt:42ms tok/s:1543140 rem:113s step 13671 (81%) loss:2.7911 lr:0.13 dt:43ms tok/s:1541539 rem:113s step 13672 (81%) loss:2.7835 lr:0.13 dt:46ms tok/s:1429549 rem:113s step 13673 (81%) loss:2.7577 lr:0.13 dt:43ms tok/s:1531669 rem:113s step 13674 (81%) loss:2.7156 lr:0.13 dt:43ms tok/s:1526201 rem:113s step 13675 (81%) loss:2.6868 lr:0.13 dt:43ms tok/s:1522911 rem:113s step 13676 (81%) loss:2.6756 lr:0.13 dt:43ms tok/s:1527863 rem:113s step 13677 (81%) loss:2.6829 lr:0.13 dt:43ms tok/s:1523713 rem:113s step 13678 (81%) loss:2.7054 lr:0.13 dt:43ms tok/s:1522414 rem:112s step 13679 (81%) loss:2.7188 lr:0.13 dt:43ms tok/s:1514404 rem:112s step 13680 (81%) loss:2.7349 lr:0.13 dt:43ms tok/s:1516601 rem:112s step 13681 (81%) loss:2.7468 lr:0.13 dt:43ms tok/s:1520788 rem:112s step 13682 (81%) loss:2.7686 lr:0.13 dt:43ms tok/s:1520847 rem:112s step 13683 (81%) loss:2.7693 lr:0.13 dt:43ms tok/s:1519292 rem:112s step 13684 (81%) loss:2.7747 lr:0.13 dt:43ms tok/s:1517681 rem:112s step 13685 (81%) loss:2.7628 lr:0.13 dt:44ms tok/s:1494690 rem:112s step 13686 (81%) loss:2.7609 lr:0.13 dt:43ms tok/s:1534551 rem:112s step 13687 (81%) loss:2.7696 lr:0.13 dt:43ms tok/s:1539320 rem:112s step 13688 (81%) loss:2.7754 lr:0.13 dt:43ms tok/s:1529036 rem:112s step 13689 (81%) loss:2.8128 lr:0.13 dt:43ms tok/s:1528560 rem:112s step 13690 (81%) loss:2.7960 lr:0.13 dt:43ms tok/s:1531379 rem:112s step 13691 (81%) loss:2.7697 lr:0.13 dt:43ms tok/s:1532916 rem:112s step 13692 (81%) loss:2.7909 lr:0.13 dt:43ms tok/s:1526921 rem:112s step 13693 (81%) loss:2.8006 lr:0.13 dt:43ms tok/s:1532028 rem:112s step 13694 (81%) loss:2.8001 lr:0.13 dt:43ms tok/s:1532241 rem:112s step 13695 (81%) loss:2.8177 lr:0.13 dt:44ms tok/s:1500917 rem:112s step 13696 (81%) loss:2.8023 lr:0.13 dt:43ms tok/s:1531592 rem:112s step 13697 (81%) loss:2.8028 lr:0.13 dt:43ms tok/s:1529002 rem:112s step 13698 (81%) loss:2.8187 lr:0.13 dt:43ms tok/s:1531942 rem:112s step 13699 (81%) loss:2.8291 lr:0.13 dt:43ms tok/s:1528977 rem:112s step 13700 (81%) loss:2.8171 lr:0.13 dt:43ms tok/s:1531038 rem:112s + local: attn=[0.179, 1.058, 1.180] mlp=[1.309, 0.540, -0.481] + + transition: attn=[3.785, 1.275] mlp=[-0.660, 1.661] + + hierarchy: attn=[3.577, 5.939, 5.616] mlp=[2.914, -2.715, -5.173] + step 13701 (81%) loss:2.8170 lr:0.13 dt:43ms tok/s:1534371 rem:111s step 13702 (81%) loss:2.8179 lr:0.13 dt:43ms tok/s:1531635 rem:111s step 13703 (81%) loss:2.8188 lr:0.13 dt:43ms tok/s:1531951 rem:111s step 13704 (81%) loss:2.8115 lr:0.13 dt:43ms tok/s:1529377 rem:111s step 13705 (81%) loss:2.8166 lr:0.13 dt:43ms tok/s:1527380 rem:111s step 13706 (81%) loss:2.8130 lr:0.13 dt:43ms tok/s:1524829 rem:111s step 13707 (81%) loss:2.8075 lr:0.13 dt:43ms tok/s:1522642 rem:111s step 13708 (81%) loss:2.8055 lr:0.13 dt:43ms tok/s:1528943 rem:111s step 13709 (81%) loss:2.8066 lr:0.13 dt:43ms tok/s:1528560 rem:111s step 13710 (81%) loss:2.8147 lr:0.13 dt:43ms tok/s:1527507 rem:111s step 13711 (81%) loss:2.7999 lr:0.13 dt:43ms tok/s:1524373 rem:111s step 13712 (81%) loss:2.7832 lr:0.13 dt:43ms tok/s:1527456 rem:111s step 13713 (82%) loss:2.7954 lr:0.13 dt:43ms tok/s:1529002 rem:111s step 13714 (82%) loss:2.7922 lr:0.13 dt:43ms tok/s:1526006 rem:111s step 13715 (82%) loss:2.7843 lr:0.13 dt:43ms tok/s:1523612 rem:111s step 13716 (82%) loss:2.7908 lr:0.13 dt:43ms tok/s:1524660 rem:111s step 13717 (82%) loss:2.8010 lr:0.13 dt:43ms tok/s:1520023 rem:111s step 13718 (82%) loss:2.7996 lr:0.13 dt:43ms tok/s:1527447 rem:111s step 13719 (82%) loss:2.7923 lr:0.13 dt:43ms tok/s:1511215 rem:111s step 13720 (82%) loss:2.7979 lr:0.13 dt:43ms tok/s:1512928 rem:111s step 13721 (82%) loss:2.8062 lr:0.13 dt:43ms tok/s:1514445 rem:111s step 13722 (82%) loss:2.8304 lr:0.13 dt:43ms tok/s:1514337 rem:111s step 13723 (82%) loss:2.8198 lr:0.13 dt:43ms tok/s:1515381 rem:111s step 13724 (82%) loss:2.8523 lr:0.13 dt:45ms tok/s:1466039 rem:110s step 13725 (82%) loss:2.9402 lr:0.13 dt:43ms tok/s:1529691 rem:110s step 13726 (82%) loss:3.0171 lr:0.13 dt:43ms tok/s:1533643 rem:110s step 13727 (82%) loss:3.0282 lr:0.12 dt:43ms tok/s:1532531 rem:110s step 13728 (82%) loss:3.0050 lr:0.12 dt:43ms tok/s:1530305 rem:110s step 13729 (82%) loss:2.9753 lr:0.12 dt:43ms tok/s:1524525 rem:110s step 13730 (82%) loss:2.9276 lr:0.12 dt:43ms tok/s:1534045 rem:110s step 13731 (82%) loss:2.9427 lr:0.12 dt:43ms tok/s:1531456 rem:110s step 13732 (82%) loss:2.9175 lr:0.12 dt:43ms tok/s:1535536 rem:110s step 13733 (82%) loss:2.9180 lr:0.12 dt:44ms tok/s:1504482 rem:110s step 13734 (82%) loss:2.9054 lr:0.12 dt:43ms tok/s:1532156 rem:110s step 13735 (82%) loss:2.9066 lr:0.12 dt:43ms tok/s:1531677 rem:110s step 13736 (82%) loss:2.9071 lr:0.12 dt:43ms tok/s:1527346 rem:110s step 13737 (82%) loss:2.8998 lr:0.12 dt:43ms tok/s:1533446 rem:110s step 13738 (82%) loss:2.8625 lr:0.12 dt:43ms tok/s:1528951 rem:110s step 13739 (82%) loss:2.8675 lr:0.12 dt:43ms tok/s:1531549 rem:110s step 13740 (82%) loss:2.8784 lr:0.12 dt:43ms tok/s:1530066 rem:110s step 13741 (82%) loss:2.8811 lr:0.12 dt:43ms tok/s:1530475 rem:110s step 13742 (82%) loss:2.8650 lr:0.12 dt:43ms tok/s:1526082 rem:110s step 13743 (82%) loss:2.8606 lr:0.12 dt:43ms tok/s:1528271 rem:110s step 13744 (82%) loss:2.8499 lr:0.12 dt:43ms tok/s:1524508 rem:110s step 13745 (82%) loss:2.8386 lr:0.12 dt:43ms tok/s:1524026 rem:110s step 13746 (82%) loss:2.8226 lr:0.12 dt:43ms tok/s:1529130 rem:110s step 13747 (82%) loss:2.8149 lr:0.12 dt:48ms tok/s:1367096 rem:109s step 13748 (82%) loss:2.8193 lr:0.12 dt:43ms tok/s:1539139 rem:109s step 13749 (82%) loss:2.8294 lr:0.12 dt:43ms tok/s:1530859 rem:109s step 13750 (82%) loss:2.8255 lr:0.12 dt:43ms tok/s:1534354 rem:109s step 13751 (82%) loss:2.8167 lr:0.12 dt:43ms tok/s:1532489 rem:109s step 13752 (82%) loss:2.8176 lr:0.12 dt:43ms tok/s:1533566 rem:109s step 13753 (82%) loss:2.8266 lr:0.12 dt:43ms tok/s:1532660 rem:109s step 13754 (82%) loss:2.9350 lr:0.12 dt:43ms tok/s:1530168 rem:109s step 13755 (82%) loss:3.0145 lr:0.12 dt:43ms tok/s:1532198 rem:109s step 13756 (82%) loss:3.0135 lr:0.12 dt:43ms tok/s:1531567 rem:109s step 13757 (82%) loss:3.0120 lr:0.12 dt:43ms tok/s:1531934 rem:109s step 13758 (82%) loss:2.9917 lr:0.12 dt:43ms tok/s:1527167 rem:109s step 13759 (82%) loss:2.9886 lr:0.12 dt:43ms tok/s:1526472 rem:109s step 13760 (82%) loss:2.9806 lr:0.12 dt:43ms tok/s:1526726 rem:109s step 13761 (82%) loss:2.9636 lr:0.12 dt:43ms tok/s:1525075 rem:109s step 13762 (82%) loss:2.9522 lr:0.12 dt:43ms tok/s:1526777 rem:109s step 13763 (82%) loss:2.9294 lr:0.12 dt:43ms tok/s:1524077 rem:109s step 13764 (82%) loss:2.9267 lr:0.12 dt:43ms tok/s:1525134 rem:109s step 13765 (82%) loss:2.9252 lr:0.12 dt:43ms tok/s:1525684 rem:109s step 13766 (82%) loss:2.9250 lr:0.12 dt:43ms tok/s:1528892 rem:109s step 13767 (82%) loss:2.9196 lr:0.12 dt:43ms tok/s:1526574 rem:109s step 13768 (82%) loss:2.9123 lr:0.12 dt:43ms tok/s:1527269 rem:109s step 13769 (82%) loss:2.9378 lr:0.12 dt:43ms tok/s:1528263 rem:109s step 13770 (82%) loss:2.9549 lr:0.12 dt:43ms tok/s:1527380 rem:109s step 13771 (82%) loss:2.9587 lr:0.12 dt:43ms tok/s:1530484 rem:108s step 13772 (82%) loss:2.9595 lr:0.12 dt:43ms tok/s:1531089 rem:108s step 13773 (82%) loss:2.9466 lr:0.12 dt:43ms tok/s:1525032 rem:108s step 13774 (82%) loss:2.9336 lr:0.12 dt:43ms tok/s:1527091 rem:108s step 13775 (82%) loss:2.9247 lr:0.12 dt:43ms tok/s:1529785 rem:108s step 13776 (82%) loss:2.9140 lr:0.12 dt:43ms tok/s:1527413 rem:108s step 13777 (82%) loss:2.9085 lr:0.12 dt:43ms tok/s:1527634 rem:108s step 13778 (82%) loss:2.9229 lr:0.12 dt:43ms tok/s:1526336 rem:108s step 13779 (82%) loss:2.9324 lr:0.12 dt:43ms tok/s:1524263 rem:108s step 13780 (82%) loss:2.9224 lr:0.12 dt:43ms tok/s:1522389 rem:108s step 13781 (82%) loss:2.9006 lr:0.12 dt:43ms tok/s:1522962 rem:108s step 13782 (82%) loss:2.8761 lr:0.12 dt:43ms tok/s:1521470 rem:108s step 13783 (82%) loss:2.8639 lr:0.12 dt:43ms tok/s:1523013 rem:108s step 13784 (82%) loss:2.8499 lr:0.12 dt:43ms tok/s:1516760 rem:108s step 13785 (82%) loss:2.8642 lr:0.12 dt:43ms tok/s:1525692 rem:108s step 13786 (82%) loss:2.8766 lr:0.12 dt:43ms tok/s:1523899 rem:108s step 13787 (82%) loss:2.8462 lr:0.12 dt:43ms tok/s:1522642 rem:108s step 13788 (82%) loss:2.8472 lr:0.12 dt:43ms tok/s:1524187 rem:108s step 13789 (82%) loss:2.8245 lr:0.12 dt:43ms tok/s:1517581 rem:108s step 13790 (82%) loss:2.8284 lr:0.12 dt:43ms tok/s:1521765 rem:108s step 13791 (82%) loss:2.8249 lr:0.12 dt:43ms tok/s:1523072 rem:108s step 13792 (82%) loss:2.8018 lr:0.12 dt:43ms tok/s:1521335 rem:108s step 13793 (82%) loss:2.7775 lr:0.12 dt:43ms tok/s:1526226 rem:108s step 13794 (82%) loss:2.7760 lr:0.12 dt:43ms tok/s:1517924 rem:107s step 13795 (82%) loss:2.7894 lr:0.12 dt:43ms tok/s:1520797 rem:107s step 13796 (82%) loss:2.7703 lr:0.12 dt:43ms tok/s:1523604 rem:107s step 13797 (82%) loss:2.7772 lr:0.12 dt:43ms tok/s:1519342 rem:107s step 13798 (82%) loss:2.7625 lr:0.12 dt:43ms tok/s:1520763 rem:107s step 13799 (82%) loss:2.7701 lr:0.12 dt:43ms tok/s:1516141 rem:107s step 13800 (82%) loss:2.7599 lr:0.12 dt:43ms tok/s:1521588 rem:107s + local: attn=[0.177, 1.063, 1.211] mlp=[1.325, 0.543, -0.508] + + transition: attn=[3.835, 1.249] mlp=[-0.671, 1.723] + + hierarchy: attn=[3.574, 5.939, 5.616] mlp=[2.969, -2.787, -5.178] + step 13801 (82%) loss:2.7812 lr:0.12 dt:43ms tok/s:1524398 rem:107s step 13802 (82%) loss:2.7985 lr:0.12 dt:43ms tok/s:1518772 rem:107s step 13803 (82%) loss:2.7894 lr:0.12 dt:43ms tok/s:1516124 rem:107s step 13804 (82%) loss:2.7774 lr:0.12 dt:43ms tok/s:1514571 rem:107s step 13805 (82%) loss:2.7782 lr:0.12 dt:43ms tok/s:1515422 rem:107s step 13806 (82%) loss:2.7760 lr:0.12 dt:43ms tok/s:1512762 rem:107s step 13807 (82%) loss:2.7581 lr:0.12 dt:44ms tok/s:1502344 rem:107s step 13808 (82%) loss:2.7204 lr:0.12 dt:43ms tok/s:1515656 rem:107s step 13809 (82%) loss:2.6959 lr:0.12 dt:43ms tok/s:1515355 rem:107s step 13810 (82%) loss:2.6429 lr:0.12 dt:43ms tok/s:1516317 rem:107s step 13811 (82%) loss:2.5682 lr:0.12 dt:43ms tok/s:1511340 rem:107s step 13812 (82%) loss:2.6041 lr:0.12 dt:43ms tok/s:1512296 rem:107s step 13813 (82%) loss:2.6348 lr:0.12 dt:44ms tok/s:1492531 rem:107s step 13814 (82%) loss:2.6597 lr:0.12 dt:46ms tok/s:1430352 rem:107s step 13815 (82%) loss:2.6689 lr:0.12 dt:44ms tok/s:1498536 rem:107s step 13816 (82%) loss:2.6869 lr:0.12 dt:44ms tok/s:1502574 rem:107s step 13817 (82%) loss:2.7000 lr:0.12 dt:43ms tok/s:1515030 rem:106s step 13818 (82%) loss:2.7020 lr:0.12 dt:43ms tok/s:1511564 rem:106s step 13819 (82%) loss:2.7088 lr:0.12 dt:43ms tok/s:1512138 rem:106s step 13820 (82%) loss:2.7216 lr:0.12 dt:43ms tok/s:1515322 rem:106s step 13821 (82%) loss:2.7299 lr:0.12 dt:43ms tok/s:1507304 rem:106s step 13822 (82%) loss:2.7356 lr:0.12 dt:44ms tok/s:1499549 rem:106s step 13823 (82%) loss:2.7453 lr:0.12 dt:43ms tok/s:1517279 rem:106s step 13824 (82%) loss:2.7400 lr:0.12 dt:43ms tok/s:1516844 rem:106s step 13825 (82%) loss:2.7640 lr:0.12 dt:44ms tok/s:1492272 rem:106s step 13826 (82%) loss:2.7722 lr:0.12 dt:44ms tok/s:1489159 rem:106s step 13827 (82%) loss:2.7833 lr:0.12 dt:44ms tok/s:1503042 rem:106s step 13828 (82%) loss:2.7724 lr:0.12 dt:43ms tok/s:1519964 rem:106s step 13829 (82%) loss:2.7076 lr:0.12 dt:43ms tok/s:1517011 rem:106s step 13830 (82%) loss:2.6731 lr:0.12 dt:44ms tok/s:1491082 rem:106s step 13831 (82%) loss:2.6524 lr:0.12 dt:44ms tok/s:1501942 rem:106s step 13832 (82%) loss:2.6634 lr:0.12 dt:44ms tok/s:1499427 rem:106s step 13833 (82%) loss:2.6530 lr:0.12 dt:43ms tok/s:1512828 rem:106s step 13834 (82%) loss:2.6328 lr:0.12 dt:44ms tok/s:1499369 rem:106s step 13835 (82%) loss:2.6320 lr:0.12 dt:45ms tok/s:1447411 rem:106s step 13836 (82%) loss:2.6521 lr:0.11 dt:46ms tok/s:1436099 rem:106s step 13837 (82%) loss:2.6619 lr:0.11 dt:44ms tok/s:1474857 rem:106s step 13838 (82%) loss:2.6796 lr:0.11 dt:44ms tok/s:1496464 rem:106s step 13839 (82%) loss:2.7029 lr:0.11 dt:44ms tok/s:1501409 rem:106s step 13840 (82%) loss:2.7129 lr:0.11 dt:45ms tok/s:1469267 rem:105s step 13841 (82%) loss:2.7360 lr:0.11 dt:45ms tok/s:1458393 rem:105s step 13842 (82%) loss:2.7415 lr:0.11 dt:44ms tok/s:1476735 rem:105s step 13843 (82%) loss:2.7775 lr:0.11 dt:45ms tok/s:1455836 rem:105s step 13844 (82%) loss:2.7727 lr:0.11 dt:44ms tok/s:1478434 rem:105s step 13845 (82%) loss:2.7870 lr:0.11 dt:46ms tok/s:1434727 rem:105s step 13846 (82%) loss:2.7904 lr:0.11 dt:45ms tok/s:1471950 rem:105s step 13847 (82%) loss:2.7700 lr:0.11 dt:45ms tok/s:1454465 rem:105s step 13848 (82%) loss:2.7490 lr:0.11 dt:44ms tok/s:1476727 rem:105s step 13849 (82%) loss:2.7397 lr:0.11 dt:44ms tok/s:1501319 rem:105s step 13850 (82%) loss:2.7436 lr:0.11 dt:46ms tok/s:1428516 rem:105s step 13851 (82%) loss:2.7462 lr:0.11 dt:44ms tok/s:1504902 rem:105s step 13852 (83%) loss:2.7537 lr:0.11 dt:43ms tok/s:1514754 rem:105s step 13853 (83%) loss:2.7601 lr:0.11 dt:43ms tok/s:1524516 rem:105s step 13854 (83%) loss:2.7755 lr:0.11 dt:43ms tok/s:1521546 rem:105s step 13855 (83%) loss:2.7481 lr:0.11 dt:44ms tok/s:1473892 rem:105s step 13856 (83%) loss:2.7621 lr:0.11 dt:43ms tok/s:1527728 rem:105s step 13857 (83%) loss:2.7545 lr:0.11 dt:43ms tok/s:1527931 rem:105s step 13858 (83%) loss:2.7287 lr:0.11 dt:43ms tok/s:1523452 rem:105s step 13859 (83%) loss:2.7164 lr:0.11 dt:43ms tok/s:1524127 rem:105s step 13860 (83%) loss:2.7383 lr:0.11 dt:43ms tok/s:1524018 rem:105s step 13861 (83%) loss:2.7462 lr:0.11 dt:43ms tok/s:1528611 rem:105s step 13862 (83%) loss:2.7439 lr:0.11 dt:43ms tok/s:1529487 rem:104s step 13863 (83%) loss:2.7382 lr:0.11 dt:43ms tok/s:1527422 rem:104s step 13864 (83%) loss:2.7128 lr:0.11 dt:43ms tok/s:1525075 rem:104s step 13865 (83%) loss:2.7010 lr:0.11 dt:43ms tok/s:1527040 rem:104s step 13866 (83%) loss:2.7153 lr:0.11 dt:43ms tok/s:1509779 rem:104s step 13867 (83%) loss:2.7243 lr:0.11 dt:43ms tok/s:1526175 rem:104s step 13868 (83%) loss:2.7287 lr:0.11 dt:43ms tok/s:1527965 rem:104s step 13869 (83%) loss:2.7324 lr:0.11 dt:43ms tok/s:1524728 rem:104s step 13870 (83%) loss:2.7501 lr:0.11 dt:43ms tok/s:1526701 rem:104s step 13871 (83%) loss:2.7461 lr:0.11 dt:43ms tok/s:1528892 rem:104s step 13872 (83%) loss:2.7523 lr:0.11 dt:43ms tok/s:1526057 rem:104s step 13873 (83%) loss:2.7766 lr:0.11 dt:43ms tok/s:1525947 rem:104s step 13874 (83%) loss:2.7868 lr:0.11 dt:43ms tok/s:1512412 rem:104s step 13875 (83%) loss:2.7806 lr:0.11 dt:43ms tok/s:1515038 rem:104s step 13876 (83%) loss:2.8070 lr:0.11 dt:43ms tok/s:1513620 rem:104s step 13877 (83%) loss:2.8345 lr:0.11 dt:43ms tok/s:1513862 rem:104s step 13878 (83%) loss:2.8222 lr:0.11 dt:43ms tok/s:1514946 rem:104s step 13879 (83%) loss:2.8241 lr:0.11 dt:43ms tok/s:1514162 rem:104s step 13880 (83%) loss:2.8102 lr:0.11 dt:43ms tok/s:1518033 rem:104s step 13881 (83%) loss:2.8238 lr:0.11 dt:43ms tok/s:1528628 rem:104s step 13882 (83%) loss:2.8300 lr:0.11 dt:43ms tok/s:1527532 rem:104s step 13883 (83%) loss:2.8263 lr:0.11 dt:43ms tok/s:1530586 rem:104s step 13884 (83%) loss:2.8217 lr:0.11 dt:43ms tok/s:1530160 rem:104s step 13885 (83%) loss:2.8278 lr:0.11 dt:43ms tok/s:1527880 rem:104s step 13886 (83%) loss:2.8351 lr:0.11 dt:43ms tok/s:1526947 rem:103s step 13887 (83%) loss:2.8044 lr:0.11 dt:43ms tok/s:1524660 rem:103s step 13888 (83%) loss:2.7642 lr:0.11 dt:43ms tok/s:1531387 rem:103s step 13889 (83%) loss:2.6963 lr:0.11 dt:43ms tok/s:1528331 rem:103s step 13890 (83%) loss:2.6303 lr:0.11 dt:43ms tok/s:1526353 rem:103s step 13891 (83%) loss:2.6451 lr:0.11 dt:43ms tok/s:1529470 rem:103s step 13892 (83%) loss:2.6618 lr:0.11 dt:43ms tok/s:1529674 rem:103s step 13893 (83%) loss:2.6591 lr:0.11 dt:43ms tok/s:1520090 rem:103s step 13894 (83%) loss:2.6913 lr:0.11 dt:43ms tok/s:1530143 rem:103s step 13895 (83%) loss:2.7386 lr:0.11 dt:43ms tok/s:1531259 rem:103s step 13896 (83%) loss:2.7785 lr:0.11 dt:43ms tok/s:1529496 rem:103s step 13897 (83%) loss:2.8033 lr:0.11 dt:43ms tok/s:1526430 rem:103s step 13898 (83%) loss:2.7968 lr:0.11 dt:43ms tok/s:1532010 rem:103s step 13899 (83%) loss:2.7803 lr:0.11 dt:43ms tok/s:1527804 rem:103s step 13900 (83%) loss:2.7925 lr:0.11 dt:43ms tok/s:1527846 rem:103s + local: attn=[0.191, 1.077, 1.218] mlp=[1.337, 0.573, -0.521] + + transition: attn=[3.826, 1.252] mlp=[-0.699, 1.756] + + hierarchy: attn=[3.567, 5.939, 5.616] mlp=[3.022, -2.869, -5.179] + step 13901 (83%) loss:2.7927 lr:0.11 dt:43ms tok/s:1522793 rem:103s step 13902 (83%) loss:2.7635 lr:0.11 dt:43ms tok/s:1528212 rem:103s step 13903 (83%) loss:2.8023 lr:0.11 dt:44ms tok/s:1499353 rem:103s step 13904 (83%) loss:2.8303 lr:0.11 dt:43ms tok/s:1510177 rem:103s step 13905 (83%) loss:2.8218 lr:0.11 dt:43ms tok/s:1523106 rem:103s step 13906 (83%) loss:2.8071 lr:0.11 dt:43ms tok/s:1525117 rem:103s step 13907 (83%) loss:2.7757 lr:0.11 dt:43ms tok/s:1520418 rem:103s step 13908 (83%) loss:2.7833 lr:0.11 dt:43ms tok/s:1522372 rem:103s step 13909 (83%) loss:2.7995 lr:0.11 dt:43ms tok/s:1518923 rem:102s step 13910 (83%) loss:2.8076 lr:0.11 dt:43ms tok/s:1521866 rem:102s step 13911 (83%) loss:2.8110 lr:0.11 dt:43ms tok/s:1521672 rem:102s step 13912 (83%) loss:2.8322 lr:0.11 dt:43ms tok/s:1522819 rem:102s step 13913 (83%) loss:2.8471 lr:0.11 dt:43ms tok/s:1522852 rem:102s step 13914 (83%) loss:2.8442 lr:0.11 dt:43ms tok/s:1518285 rem:102s step 13915 (83%) loss:2.8345 lr:0.11 dt:43ms tok/s:1522448 rem:102s step 13916 (83%) loss:2.8410 lr:0.11 dt:43ms tok/s:1524567 rem:102s step 13917 (83%) loss:2.8279 lr:0.11 dt:43ms tok/s:1524026 rem:102s step 13918 (83%) loss:2.8289 lr:0.11 dt:44ms tok/s:1485240 rem:102s step 13919 (83%) loss:2.8195 lr:0.11 dt:43ms tok/s:1525963 rem:102s step 13920 (83%) loss:2.8397 lr:0.11 dt:43ms tok/s:1532754 rem:102s step 13921 (83%) loss:2.8318 lr:0.11 dt:43ms tok/s:1533190 rem:102s step 13922 (83%) loss:2.8220 lr:0.11 dt:43ms tok/s:1528161 rem:102s step 13923 (83%) loss:2.8282 lr:0.11 dt:43ms tok/s:1528382 rem:102s step 13924 (83%) loss:2.8424 lr:0.11 dt:43ms tok/s:1526675 rem:102s step 13925 (83%) loss:2.8509 lr:0.11 dt:43ms tok/s:1528977 rem:102s step 13926 (83%) loss:2.8474 lr:0.11 dt:43ms tok/s:1528535 rem:102s step 13927 (83%) loss:2.8359 lr:0.11 dt:43ms tok/s:1527829 rem:102s step 13928 (83%) loss:2.8265 lr:0.11 dt:44ms tok/s:1489450 rem:102s step 13929 (83%) loss:2.8170 lr:0.11 dt:43ms tok/s:1529981 rem:102s step 13930 (83%) loss:2.8166 lr:0.11 dt:43ms tok/s:1527040 rem:102s step 13931 (83%) loss:2.8007 lr:0.11 dt:43ms tok/s:1525938 rem:102s step 13932 (83%) loss:2.8069 lr:0.11 dt:43ms tok/s:1528475 rem:101s step 13933 (83%) loss:2.8083 lr:0.11 dt:43ms tok/s:1529172 rem:101s step 13934 (83%) loss:2.8259 lr:0.11 dt:43ms tok/s:1531131 rem:101s step 13935 (83%) loss:2.8179 lr:0.11 dt:44ms tok/s:1505949 rem:101s step 13936 (83%) loss:2.8162 lr:0.11 dt:43ms tok/s:1516283 rem:101s step 13937 (83%) loss:2.7962 lr:0.11 dt:43ms tok/s:1526913 rem:101s step 13938 (83%) loss:2.8005 lr:0.11 dt:43ms tok/s:1509431 rem:101s step 13939 (83%) loss:2.7924 lr:0.11 dt:43ms tok/s:1530415 rem:101s step 13940 (83%) loss:2.8238 lr:0.11 dt:43ms tok/s:1527880 rem:101s step 13941 (83%) loss:2.8177 lr:0.11 dt:43ms tok/s:1529326 rem:101s step 13942 (83%) loss:2.8211 lr:0.11 dt:43ms tok/s:1527269 rem:101s step 13943 (83%) loss:2.7975 lr:0.11 dt:43ms tok/s:1525803 rem:101s step 13944 (83%) loss:2.7972 lr:0.11 dt:43ms tok/s:1527762 rem:101s step 13945 (83%) loss:2.8057 lr:0.11 dt:43ms tok/s:1526582 rem:101s step 13946 (83%) loss:2.8362 lr:0.11 dt:43ms tok/s:1526048 rem:101s step 13947 (83%) loss:2.8248 lr:0.11 dt:43ms tok/s:1523426 rem:101s step 13948 (83%) loss:2.8153 lr:0.11 dt:43ms tok/s:1527685 rem:101s step 13949 (83%) loss:2.8236 lr:0.10 dt:43ms tok/s:1509920 rem:101s step 13950 (83%) loss:2.8202 lr:0.10 dt:43ms tok/s:1511797 rem:101s step 13951 (83%) loss:2.8230 lr:0.10 dt:43ms tok/s:1523409 rem:101s step 13952 (83%) loss:2.8141 lr:0.10 dt:43ms tok/s:1525608 rem:101s step 13953 (83%) loss:2.8213 lr:0.10 dt:43ms tok/s:1527015 rem:101s step 13954 (83%) loss:2.8041 lr:0.10 dt:43ms tok/s:1530347 rem:101s step 13955 (83%) loss:2.7936 lr:0.10 dt:43ms tok/s:1524592 rem:100s step 13956 (83%) loss:2.8111 lr:0.10 dt:43ms tok/s:1521580 rem:100s step 13957 (83%) loss:2.8130 lr:0.10 dt:43ms tok/s:1513320 rem:100s step 13958 (83%) loss:2.8020 lr:0.10 dt:43ms tok/s:1512229 rem:100s step 13959 (83%) loss:2.7978 lr:0.10 dt:43ms tok/s:1515472 rem:100s step 13960 (83%) loss:2.8012 lr:0.10 dt:43ms tok/s:1514637 rem:100s step 13961 (83%) loss:2.7803 lr:0.10 dt:43ms tok/s:1515364 rem:100s step 13962 (83%) loss:2.8101 lr:0.10 dt:43ms tok/s:1512637 rem:100s step 13963 (83%) loss:2.8118 lr:0.10 dt:43ms tok/s:1518386 rem:100s step 13964 (83%) loss:2.8146 lr:0.10 dt:43ms tok/s:1515882 rem:100s step 13965 (83%) loss:2.8126 lr:0.10 dt:43ms tok/s:1516936 rem:100s step 13966 (83%) loss:2.8049 lr:0.10 dt:43ms tok/s:1517958 rem:100s step 13967 (83%) loss:2.8029 lr:0.10 dt:43ms tok/s:1516083 rem:100s step 13968 (83%) loss:2.8119 lr:0.10 dt:43ms tok/s:1515798 rem:100s step 13969 (83%) loss:2.8024 lr:0.10 dt:43ms tok/s:1512895 rem:100s step 13970 (83%) loss:2.8012 lr:0.10 dt:43ms tok/s:1516894 rem:100s step 13971 (83%) loss:2.7948 lr:0.10 dt:43ms tok/s:1518285 rem:100s step 13972 (83%) loss:2.7931 lr:0.10 dt:47ms tok/s:1396893 rem:100s step 13973 (83%) loss:2.8136 lr:0.10 dt:43ms tok/s:1526760 rem:100s step 13974 (83%) loss:2.8391 lr:0.10 dt:43ms tok/s:1535991 rem:100s step 13975 (83%) loss:2.8465 lr:0.10 dt:43ms tok/s:1538708 rem:100s step 13976 (83%) loss:2.8396 lr:0.10 dt:43ms tok/s:1533147 rem:100s step 13977 (83%) loss:2.8344 lr:0.10 dt:43ms tok/s:1522507 rem:100s step 13978 (83%) loss:2.8204 lr:0.10 dt:43ms tok/s:1527481 rem:100s step 13979 (83%) loss:2.8155 lr:0.10 dt:43ms tok/s:1532173 rem:99s step 13980 (83%) loss:2.8149 lr:0.10 dt:43ms tok/s:1530986 rem:99s step 13981 (83%) loss:2.8140 lr:0.10 dt:44ms tok/s:1495039 rem:99s step 13982 (83%) loss:2.8034 lr:0.10 dt:43ms tok/s:1533369 rem:99s step 13983 (83%) loss:2.7866 lr:0.10 dt:43ms tok/s:1531004 rem:99s step 13984 (83%) loss:2.8360 lr:0.10 dt:43ms tok/s:1534354 rem:99s step 13985 (83%) loss:2.8764 lr:0.10 dt:43ms tok/s:1531626 rem:99s step 13986 (83%) loss:2.9038 lr:0.10 dt:43ms tok/s:1530194 rem:99s step 13987 (83%) loss:2.8933 lr:0.10 dt:43ms tok/s:1532121 rem:99s step 13988 (83%) loss:2.9068 lr:0.10 dt:43ms tok/s:1533472 rem:99s step 13989 (83%) loss:2.9037 lr:0.10 dt:43ms tok/s:1531515 rem:99s step 13990 (83%) loss:2.9024 lr:0.10 dt:43ms tok/s:1530944 rem:99s step 13991 (84%) loss:2.8985 lr:0.10 dt:43ms tok/s:1534045 rem:99s step 13992 (84%) loss:2.8856 lr:0.10 dt:43ms tok/s:1532514 rem:99s step 13993 (84%) loss:2.8689 lr:0.10 dt:43ms tok/s:1531899 rem:99s step 13994 (84%) loss:2.8630 lr:0.10 dt:43ms tok/s:1529547 rem:99s step 13995 (84%) loss:2.8483 lr:0.10 dt:43ms tok/s:1523730 rem:99s step 13996 (84%) loss:2.8324 lr:0.10 dt:43ms tok/s:1532685 rem:99s step 13997 (84%) loss:2.8242 lr:0.10 dt:43ms tok/s:1535382 rem:99s step 13998 (84%) loss:2.8284 lr:0.10 dt:43ms tok/s:1532933 rem:99s step 13999 (84%) loss:2.8199 lr:0.10 dt:43ms tok/s:1531729 rem:99s step 14000 (84%) loss:2.8282 lr:0.10 dt:43ms tok/s:1530782 rem:99s + local: attn=[0.180, 1.079, 1.228] mlp=[1.364, 0.575, -0.540] + + transition: attn=[3.874, 1.260] mlp=[-0.724, 1.808] + + hierarchy: attn=[3.580, 5.939, 5.616] mlp=[3.079, -2.953, -5.180] + step 14001 (84%) loss:2.8025 lr:0.10 dt:43ms tok/s:1530202 rem:99s step 14002 (84%) loss:2.8122 lr:0.10 dt:43ms tok/s:1532446 rem:98s step 14003 (84%) loss:2.8059 lr:0.10 dt:43ms tok/s:1532822 rem:98s step 14004 (84%) loss:2.7950 lr:0.10 dt:43ms tok/s:1532344 rem:98s step 14005 (84%) loss:2.8200 lr:0.10 dt:43ms tok/s:1529598 rem:98s step 14006 (84%) loss:2.8148 lr:0.10 dt:43ms tok/s:1531328 rem:98s step 14007 (84%) loss:2.8259 lr:0.10 dt:43ms tok/s:1530083 rem:98s step 14008 (84%) loss:2.8139 lr:0.10 dt:43ms tok/s:1532625 rem:98s step 14009 (84%) loss:2.8060 lr:0.10 dt:43ms tok/s:1534585 rem:98s step 14010 (84%) loss:2.7886 lr:0.10 dt:43ms tok/s:1514963 rem:98s step 14011 (84%) loss:2.7739 lr:0.10 dt:43ms tok/s:1523401 rem:98s step 14012 (84%) loss:2.7885 lr:0.10 dt:43ms tok/s:1523511 rem:98s step 14013 (84%) loss:2.7905 lr:0.10 dt:43ms tok/s:1521470 rem:98s step 14014 (84%) loss:2.8064 lr:0.10 dt:43ms tok/s:1523241 rem:98s step 14015 (84%) loss:2.7921 lr:0.10 dt:43ms tok/s:1520090 rem:98s step 14016 (84%) loss:2.7979 lr:0.10 dt:43ms tok/s:1518981 rem:98s step 14017 (84%) loss:2.8243 lr:0.10 dt:43ms tok/s:1518000 rem:98s step 14018 (84%) loss:2.8163 lr:0.10 dt:43ms tok/s:1520149 rem:98s step 14019 (84%) loss:2.8118 lr:0.10 dt:43ms tok/s:1516191 rem:98s step 14020 (84%) loss:2.8124 lr:0.10 dt:43ms tok/s:1523046 rem:98s step 14021 (84%) loss:2.8072 lr:0.10 dt:43ms tok/s:1524254 rem:98s step 14022 (84%) loss:2.8091 lr:0.10 dt:43ms tok/s:1523832 rem:98s step 14023 (84%) loss:2.7599 lr:0.10 dt:43ms tok/s:1517363 rem:98s step 14024 (84%) loss:2.7767 lr:0.10 dt:43ms tok/s:1523063 rem:98s step 14025 (84%) loss:2.8237 lr:0.10 dt:43ms tok/s:1521428 rem:97s step 14026 (84%) loss:2.8697 lr:0.10 dt:43ms tok/s:1518553 rem:97s step 14027 (84%) loss:2.9043 lr:0.10 dt:43ms tok/s:1521925 rem:97s step 14028 (84%) loss:2.9266 lr:0.10 dt:43ms tok/s:1518772 rem:97s step 14029 (84%) loss:2.9509 lr:0.10 dt:44ms tok/s:1502336 rem:97s step 14030 (84%) loss:2.9601 lr:0.10 dt:43ms tok/s:1524356 rem:97s step 14031 (84%) loss:2.9777 lr:0.10 dt:43ms tok/s:1519888 rem:97s step 14032 (84%) loss:2.9844 lr:0.10 dt:43ms tok/s:1519729 rem:97s step 14033 (84%) loss:2.9874 lr:0.10 dt:43ms tok/s:1517665 rem:97s step 14034 (84%) loss:2.9915 lr:0.10 dt:43ms tok/s:1522869 rem:97s step 14035 (84%) loss:2.9937 lr:0.10 dt:43ms tok/s:1519359 rem:97s step 14036 (84%) loss:3.0013 lr:0.10 dt:43ms tok/s:1524745 rem:97s step 14037 (84%) loss:3.0043 lr:0.10 dt:43ms tok/s:1520250 rem:97s step 14038 (84%) loss:2.9829 lr:0.10 dt:43ms tok/s:1522684 rem:97s step 14039 (84%) loss:2.9605 lr:0.10 dt:43ms tok/s:1522405 rem:97s step 14040 (84%) loss:2.9298 lr:0.10 dt:43ms tok/s:1522414 rem:97s step 14041 (84%) loss:2.9163 lr:0.10 dt:43ms tok/s:1520090 rem:97s step 14042 (84%) loss:2.9110 lr:0.10 dt:43ms tok/s:1520519 rem:97s step 14043 (84%) loss:2.9199 lr:0.10 dt:43ms tok/s:1519704 rem:97s step 14044 (84%) loss:2.9054 lr:0.10 dt:43ms tok/s:1521159 rem:97s step 14045 (84%) loss:2.9054 lr:0.10 dt:43ms tok/s:1522110 rem:97s step 14046 (84%) loss:2.8926 lr:0.10 dt:47ms tok/s:1397249 rem:97s step 14047 (84%) loss:2.8862 lr:0.10 dt:43ms tok/s:1508594 rem:97s step 14048 (84%) loss:2.8918 lr:0.10 dt:43ms tok/s:1526082 rem:96s step 14049 (84%) loss:2.8695 lr:0.10 dt:43ms tok/s:1526430 rem:96s step 14050 (84%) loss:2.8842 lr:0.10 dt:43ms tok/s:1526565 rem:96s step 14051 (84%) loss:2.9012 lr:0.10 dt:43ms tok/s:1521731 rem:96s step 14052 (84%) loss:2.8834 lr:0.10 dt:43ms tok/s:1519872 rem:96s step 14053 (84%) loss:2.8701 lr:0.10 dt:43ms tok/s:1520670 rem:96s step 14054 (84%) loss:2.8837 lr:0.10 dt:43ms tok/s:1523502 rem:96s step 14055 (84%) loss:2.8825 lr:0.10 dt:43ms tok/s:1519485 rem:96s step 14056 (84%) loss:2.8765 lr:0.10 dt:43ms tok/s:1520982 rem:96s step 14057 (84%) loss:2.8835 lr:0.10 dt:43ms tok/s:1522819 rem:96s step 14058 (84%) loss:2.8923 lr:0.10 dt:43ms tok/s:1522701 rem:96s step 14059 (84%) loss:2.8876 lr:0.10 dt:43ms tok/s:1523165 rem:96s step 14060 (84%) loss:2.8779 lr:0.10 dt:43ms tok/s:1520553 rem:96s step 14061 (84%) loss:2.8625 lr:0.10 dt:44ms tok/s:1503371 rem:96s step 14062 (84%) loss:2.8584 lr:0.10 dt:43ms tok/s:1520132 rem:96s step 14063 (84%) loss:2.8600 lr:0.10 dt:43ms tok/s:1520267 rem:96s step 14064 (84%) loss:2.8751 lr:0.10 dt:44ms tok/s:1482101 rem:96s step 14065 (84%) loss:2.8689 lr:0.10 dt:43ms tok/s:1509845 rem:96s step 14066 (84%) loss:2.8631 lr:0.10 dt:43ms tok/s:1527244 rem:96s step 14067 (84%) loss:2.8606 lr:0.09 dt:43ms tok/s:1526786 rem:96s step 14068 (84%) loss:2.8616 lr:0.09 dt:43ms tok/s:1522405 rem:96s step 14069 (84%) loss:2.8438 lr:0.09 dt:43ms tok/s:1522987 rem:96s step 14070 (84%) loss:2.8398 lr:0.09 dt:43ms tok/s:1520931 rem:96s step 14071 (84%) loss:2.8332 lr:0.09 dt:43ms tok/s:1521883 rem:95s step 14072 (84%) loss:2.8335 lr:0.09 dt:43ms tok/s:1525142 rem:95s step 14073 (84%) loss:2.8249 lr:0.09 dt:43ms tok/s:1519855 rem:95s step 14074 (84%) loss:2.8231 lr:0.09 dt:43ms tok/s:1524745 rem:95s step 14075 (84%) loss:2.8454 lr:0.09 dt:43ms tok/s:1518260 rem:95s step 14076 (84%) loss:2.8345 lr:0.09 dt:43ms tok/s:1522220 rem:95s step 14077 (84%) loss:2.8305 lr:0.09 dt:43ms tok/s:1518881 rem:95s step 14078 (84%) loss:2.8256 lr:0.09 dt:43ms tok/s:1523283 rem:95s step 14079 (84%) loss:2.8370 lr:0.09 dt:43ms tok/s:1520931 rem:95s step 14080 (84%) loss:2.8249 lr:0.09 dt:43ms tok/s:1518939 rem:95s step 14081 (84%) loss:2.8288 lr:0.09 dt:43ms tok/s:1516275 rem:95s step 14082 (84%) loss:2.8195 lr:0.09 dt:45ms tok/s:1443262 rem:95s step 14083 (84%) loss:2.8178 lr:0.09 dt:43ms tok/s:1525819 rem:95s step 14084 (84%) loss:2.8269 lr:0.09 dt:43ms tok/s:1533412 rem:95s step 14085 (84%) loss:2.8167 lr:0.09 dt:43ms tok/s:1523705 rem:95s step 14086 (84%) loss:2.7997 lr:0.09 dt:43ms tok/s:1530449 rem:95s step 14087 (84%) loss:2.7966 lr:0.09 dt:43ms tok/s:1532258 rem:95s step 14088 (84%) loss:2.7393 lr:0.09 dt:43ms tok/s:1532728 rem:95s step 14089 (84%) loss:2.6930 lr:0.09 dt:43ms tok/s:1532651 rem:95s step 14090 (84%) loss:2.6868 lr:0.09 dt:43ms tok/s:1522928 rem:95s step 14091 (84%) loss:2.6995 lr:0.09 dt:43ms tok/s:1531908 rem:95s step 14092 (84%) loss:2.7017 lr:0.09 dt:43ms tok/s:1530850 rem:95s step 14093 (84%) loss:2.7078 lr:0.09 dt:43ms tok/s:1532856 rem:95s step 14094 (84%) loss:2.7238 lr:0.09 dt:43ms tok/s:1530876 rem:95s step 14095 (84%) loss:2.7406 lr:0.09 dt:43ms tok/s:1534174 rem:94s step 14096 (84%) loss:2.7483 lr:0.09 dt:43ms tok/s:1535159 rem:94s step 14097 (84%) loss:2.7436 lr:0.09 dt:43ms tok/s:1531771 rem:94s step 14098 (84%) loss:2.7395 lr:0.09 dt:43ms tok/s:1536240 rem:94s step 14099 (84%) loss:2.7417 lr:0.09 dt:43ms tok/s:1534945 rem:94s step 14100 (84%) loss:2.7552 lr:0.09 dt:43ms tok/s:1522127 rem:94s + local: attn=[0.187, 1.087, 1.229] mlp=[1.383, 0.586, -0.548] + + transition: attn=[3.872, 1.267] mlp=[-0.749, 1.881] + + hierarchy: attn=[3.562, 5.939, 5.616] mlp=[3.142, -3.047, -5.180] + step 14101 (84%) loss:2.7517 lr:0.09 dt:42ms tok/s:1548215 rem:94s step 14102 (84%) loss:2.7533 lr:0.09 dt:42ms tok/s:1545925 rem:94s step 14103 (84%) loss:2.7643 lr:0.09 dt:43ms tok/s:1540977 rem:94s step 14104 (84%) loss:2.8000 lr:0.09 dt:43ms tok/s:1537177 rem:94s step 14105 (84%) loss:2.7730 lr:0.09 dt:43ms tok/s:1535082 rem:94s step 14106 (84%) loss:2.7281 lr:0.09 dt:43ms tok/s:1530790 rem:94s step 14107 (84%) loss:2.7233 lr:0.09 dt:43ms tok/s:1532737 rem:94s step 14108 (84%) loss:2.7383 lr:0.09 dt:43ms tok/s:1533506 rem:94s step 14109 (84%) loss:2.7371 lr:0.09 dt:43ms tok/s:1531754 rem:94s step 14110 (84%) loss:2.7522 lr:0.09 dt:43ms tok/s:1535228 rem:94s step 14111 (84%) loss:2.7574 lr:0.09 dt:43ms tok/s:1536232 rem:94s step 14112 (84%) loss:2.7946 lr:0.09 dt:43ms tok/s:1536403 rem:94s step 14113 (84%) loss:2.8097 lr:0.09 dt:43ms tok/s:1533267 rem:94s step 14114 (84%) loss:2.8022 lr:0.09 dt:43ms tok/s:1533908 rem:94s step 14115 (84%) loss:2.7972 lr:0.09 dt:43ms tok/s:1538467 rem:94s step 14116 (84%) loss:2.7986 lr:0.09 dt:43ms tok/s:1537477 rem:94s step 14117 (84%) loss:2.7798 lr:0.09 dt:43ms tok/s:1534996 rem:94s step 14118 (84%) loss:2.7744 lr:0.09 dt:43ms tok/s:1535365 rem:93s step 14119 (84%) loss:2.7875 lr:0.09 dt:43ms tok/s:1535923 rem:93s step 14120 (84%) loss:2.7970 lr:0.09 dt:43ms tok/s:1535579 rem:93s step 14121 (84%) loss:2.8015 lr:0.09 dt:43ms tok/s:1533515 rem:93s step 14122 (84%) loss:2.7918 lr:0.09 dt:43ms tok/s:1533352 rem:93s step 14123 (84%) loss:2.8121 lr:0.09 dt:43ms tok/s:1532292 rem:93s step 14124 (84%) loss:2.8434 lr:0.09 dt:44ms tok/s:1492191 rem:93s step 14125 (84%) loss:2.8722 lr:0.09 dt:43ms tok/s:1535708 rem:93s step 14126 (84%) loss:2.8589 lr:0.09 dt:43ms tok/s:1532489 rem:93s step 14127 (84%) loss:2.8431 lr:0.09 dt:43ms tok/s:1533634 rem:93s step 14128 (84%) loss:2.8311 lr:0.09 dt:44ms tok/s:1496586 rem:93s step 14129 (84%) loss:2.8342 lr:0.09 dt:43ms tok/s:1529368 rem:93s step 14130 (84%) loss:2.8307 lr:0.09 dt:43ms tok/s:1525870 rem:93s step 14131 (85%) loss:2.8279 lr:0.09 dt:43ms tok/s:1529274 rem:93s step 14132 (85%) loss:2.8198 lr:0.09 dt:43ms tok/s:1531080 rem:93s step 14133 (85%) loss:2.8050 lr:0.09 dt:45ms tok/s:1471809 rem:93s step 14134 (85%) loss:2.8028 lr:0.09 dt:43ms tok/s:1530177 rem:93s step 14135 (85%) loss:2.7868 lr:0.09 dt:43ms tok/s:1528637 rem:93s step 14136 (85%) loss:2.7887 lr:0.09 dt:43ms tok/s:1527422 rem:93s step 14137 (85%) loss:2.8007 lr:0.09 dt:43ms tok/s:1525278 rem:93s step 14138 (85%) loss:2.7981 lr:0.09 dt:43ms tok/s:1529070 rem:93s step 14139 (85%) loss:2.7945 lr:0.09 dt:43ms tok/s:1531038 rem:93s step 14140 (85%) loss:2.8106 lr:0.09 dt:43ms tok/s:1527965 rem:93s step 14141 (85%) loss:2.7953 lr:0.09 dt:43ms tok/s:1528722 rem:92s step 14142 (85%) loss:2.7902 lr:0.09 dt:43ms tok/s:1522810 rem:92s step 14143 (85%) loss:2.7651 lr:0.09 dt:43ms tok/s:1532976 rem:92s step 14144 (85%) loss:2.7592 lr:0.09 dt:43ms tok/s:1526455 rem:92s step 14145 (85%) loss:2.7578 lr:0.09 dt:43ms tok/s:1528059 rem:92s step 14146 (85%) loss:2.7378 lr:0.09 dt:43ms tok/s:1528781 rem:92s step 14147 (85%) loss:2.6999 lr:0.09 dt:43ms tok/s:1534020 rem:92s step 14148 (85%) loss:2.6547 lr:0.09 dt:43ms tok/s:1527015 rem:92s step 14149 (85%) loss:2.6611 lr:0.09 dt:43ms tok/s:1524094 rem:92s step 14150 (85%) loss:2.6894 lr:0.09 dt:43ms tok/s:1527210 rem:92s step 14151 (85%) loss:2.6973 lr:0.09 dt:43ms tok/s:1509422 rem:92s step 14152 (85%) loss:2.7026 lr:0.09 dt:43ms tok/s:1526362 rem:92s step 14153 (85%) loss:2.6897 lr:0.09 dt:43ms tok/s:1524296 rem:92s step 14154 (85%) loss:2.6900 lr:0.09 dt:43ms tok/s:1511614 rem:92s step 14155 (85%) loss:2.7008 lr:0.09 dt:43ms tok/s:1514813 rem:92s step 14156 (85%) loss:2.7206 lr:0.09 dt:43ms tok/s:1525430 rem:92s step 14157 (85%) loss:2.7394 lr:0.09 dt:43ms tok/s:1528178 rem:92s step 14158 (85%) loss:2.7309 lr:0.09 dt:43ms tok/s:1527346 rem:92s step 14159 (85%) loss:2.7335 lr:0.09 dt:43ms tok/s:1527447 rem:92s step 14160 (85%) loss:2.7234 lr:0.09 dt:43ms tok/s:1524423 rem:92s step 14161 (85%) loss:2.7259 lr:0.09 dt:43ms tok/s:1525506 rem:92s step 14162 (85%) loss:2.7120 lr:0.09 dt:43ms tok/s:1529385 rem:92s step 14163 (85%) loss:2.7250 lr:0.09 dt:43ms tok/s:1529811 rem:92s step 14164 (85%) loss:2.7391 lr:0.09 dt:43ms tok/s:1527711 rem:92s step 14165 (85%) loss:2.7530 lr:0.09 dt:43ms tok/s:1523139 rem:91s step 14166 (85%) loss:2.7663 lr:0.09 dt:43ms tok/s:1527778 rem:91s step 14167 (85%) loss:2.7667 lr:0.09 dt:44ms tok/s:1473323 rem:91s step 14168 (85%) loss:2.7706 lr:0.09 dt:43ms tok/s:1511547 rem:91s step 14169 (85%) loss:2.7815 lr:0.09 dt:43ms tok/s:1540295 rem:91s step 14170 (85%) loss:2.7855 lr:0.09 dt:43ms tok/s:1537555 rem:91s step 14171 (85%) loss:2.7957 lr:0.09 dt:43ms tok/s:1536489 rem:91s step 14172 (85%) loss:2.8001 lr:0.09 dt:43ms tok/s:1537323 rem:91s step 14173 (85%) loss:2.8167 lr:0.09 dt:43ms tok/s:1537400 rem:91s step 14174 (85%) loss:2.8188 lr:0.09 dt:43ms tok/s:1537099 rem:91s step 14175 (85%) loss:2.8054 lr:0.09 dt:43ms tok/s:1530944 rem:91s step 14176 (85%) loss:2.7973 lr:0.09 dt:43ms tok/s:1538700 rem:91s step 14177 (85%) loss:2.7833 lr:0.09 dt:43ms tok/s:1535176 rem:91s step 14178 (85%) loss:2.7854 lr:0.09 dt:43ms tok/s:1535983 rem:91s step 14179 (85%) loss:2.7865 lr:0.09 dt:43ms tok/s:1536129 rem:91s step 14180 (85%) loss:2.7928 lr:0.09 dt:43ms tok/s:1532156 rem:91s step 14181 (85%) loss:2.7625 lr:0.09 dt:43ms tok/s:1535888 rem:91s step 14182 (85%) loss:2.7690 lr:0.09 dt:43ms tok/s:1535639 rem:91s step 14183 (85%) loss:2.7701 lr:0.09 dt:43ms tok/s:1534850 rem:91s step 14184 (85%) loss:2.7475 lr:0.09 dt:43ms tok/s:1529300 rem:91s step 14185 (85%) loss:2.7449 lr:0.09 dt:43ms tok/s:1532036 rem:91s step 14186 (85%) loss:2.7595 lr:0.09 dt:43ms tok/s:1531908 rem:91s step 14187 (85%) loss:2.7511 lr:0.09 dt:43ms tok/s:1533019 rem:91s step 14188 (85%) loss:2.7714 lr:0.09 dt:43ms tok/s:1538321 rem:90s step 14189 (85%) loss:2.7748 lr:0.09 dt:43ms tok/s:1534294 rem:90s step 14190 (85%) loss:2.7773 lr:0.09 dt:43ms tok/s:1535305 rem:90s step 14191 (85%) loss:2.7800 lr:0.08 dt:43ms tok/s:1536154 rem:90s step 14192 (85%) loss:2.7805 lr:0.08 dt:43ms tok/s:1535348 rem:90s step 14193 (85%) loss:2.7756 lr:0.08 dt:43ms tok/s:1534276 rem:90s step 14194 (85%) loss:2.7623 lr:0.08 dt:43ms tok/s:1534396 rem:90s step 14195 (85%) loss:2.7640 lr:0.08 dt:43ms tok/s:1530407 rem:90s step 14196 (85%) loss:2.7572 lr:0.08 dt:43ms tok/s:1530842 rem:90s step 14197 (85%) loss:2.7533 lr:0.08 dt:43ms tok/s:1534011 rem:90s step 14198 (85%) loss:2.7661 lr:0.08 dt:44ms tok/s:1506329 rem:90s step 14199 (85%) loss:2.7783 lr:0.08 dt:43ms tok/s:1534139 rem:90s step 14200 (85%) loss:2.7774 lr:0.08 dt:43ms tok/s:1531285 rem:90s + local: attn=[0.200, 1.087, 1.245] mlp=[1.400, 0.595, -0.550] + + transition: attn=[3.912, 1.257] mlp=[-0.771, 1.945] + + hierarchy: attn=[3.537, 5.939, 5.616] mlp=[3.216, -3.135, -5.180] + step 14201 (85%) loss:2.8000 lr:0.08 dt:43ms tok/s:1530024 rem:90s step 14202 (85%) loss:2.7878 lr:0.08 dt:43ms tok/s:1533600 rem:90s step 14203 (85%) loss:2.7877 lr:0.08 dt:43ms tok/s:1531507 rem:90s step 14204 (85%) loss:2.7891 lr:0.08 dt:43ms tok/s:1532053 rem:90s step 14205 (85%) loss:2.8220 lr:0.08 dt:43ms tok/s:1530969 rem:90s step 14206 (85%) loss:2.8312 lr:0.08 dt:43ms tok/s:1530398 rem:90s step 14207 (85%) loss:2.8480 lr:0.08 dt:43ms tok/s:1531592 rem:90s step 14208 (85%) loss:2.8349 lr:0.08 dt:43ms tok/s:1529896 rem:90s step 14209 (85%) loss:2.8399 lr:0.08 dt:43ms tok/s:1532984 rem:90s step 14210 (85%) loss:2.8376 lr:0.08 dt:43ms tok/s:1527660 rem:90s step 14211 (85%) loss:2.8410 lr:0.08 dt:43ms tok/s:1532087 rem:89s step 14212 (85%) loss:2.8345 lr:0.08 dt:43ms tok/s:1532139 rem:89s step 14213 (85%) loss:2.8247 lr:0.08 dt:43ms tok/s:1533335 rem:89s step 14214 (85%) loss:2.8385 lr:0.08 dt:43ms tok/s:1530279 rem:89s step 14215 (85%) loss:2.8167 lr:0.08 dt:43ms tok/s:1529845 rem:89s step 14216 (85%) loss:2.8143 lr:0.08 dt:43ms tok/s:1515447 rem:89s step 14217 (85%) loss:2.8038 lr:0.08 dt:43ms tok/s:1524423 rem:89s step 14218 (85%) loss:2.8117 lr:0.08 dt:43ms tok/s:1527507 rem:89s step 14219 (85%) loss:2.8123 lr:0.08 dt:43ms tok/s:1525362 rem:89s step 14220 (85%) loss:2.8078 lr:0.08 dt:43ms tok/s:1521933 rem:89s step 14221 (85%) loss:2.7934 lr:0.08 dt:43ms tok/s:1523671 rem:89s step 14222 (85%) loss:2.7957 lr:0.08 dt:44ms tok/s:1477235 rem:89s step 14223 (85%) loss:2.7946 lr:0.08 dt:43ms tok/s:1520889 rem:89s step 14224 (85%) loss:2.7940 lr:0.08 dt:43ms tok/s:1519216 rem:89s step 14225 (85%) loss:2.7830 lr:0.08 dt:43ms tok/s:1519300 rem:89s step 14226 (85%) loss:2.7910 lr:0.08 dt:43ms tok/s:1522473 rem:89s step 14227 (85%) loss:2.7880 lr:0.08 dt:45ms tok/s:1447145 rem:89s step 14228 (85%) loss:2.7852 lr:0.08 dt:43ms tok/s:1523469 rem:89s step 14229 (85%) loss:2.7739 lr:0.08 dt:43ms tok/s:1529351 rem:89s step 14230 (85%) loss:2.7646 lr:0.08 dt:43ms tok/s:1531985 rem:89s step 14231 (85%) loss:2.7699 lr:0.08 dt:43ms tok/s:1527566 rem:89s step 14232 (85%) loss:2.7795 lr:0.08 dt:71ms tok/s:924143 rem:89s step 14233 (85%) loss:2.7811 lr:0.08 dt:42ms tok/s:1564633 rem:89s step 14234 (85%) loss:2.7625 lr:0.08 dt:43ms tok/s:1520393 rem:88s step 14235 (85%) loss:2.7641 lr:0.08 dt:42ms tok/s:1560521 rem:88s step 14236 (85%) loss:2.7644 lr:0.08 dt:42ms tok/s:1562152 rem:88s step 14237 (85%) loss:2.7637 lr:0.08 dt:42ms tok/s:1567363 rem:88s step 14238 (85%) loss:2.7581 lr:0.08 dt:42ms tok/s:1565177 rem:88s step 14239 (85%) loss:2.7612 lr:0.08 dt:42ms tok/s:1564491 rem:88s step 14240 (85%) loss:2.7539 lr:0.08 dt:42ms tok/s:1571207 rem:88s step 14241 (85%) loss:2.7529 lr:0.08 dt:42ms tok/s:1568839 rem:88s step 14242 (85%) loss:2.7356 lr:0.08 dt:42ms tok/s:1565658 rem:88s step 14243 (85%) loss:2.7572 lr:0.08 dt:42ms tok/s:1551983 rem:88s step 14244 (85%) loss:2.7469 lr:0.08 dt:42ms tok/s:1557708 rem:88s step 14245 (85%) loss:2.7321 lr:0.08 dt:42ms tok/s:1563583 rem:88s step 14246 (85%) loss:2.7361 lr:0.08 dt:42ms tok/s:1562739 rem:88s step 14247 (85%) loss:2.7316 lr:0.08 dt:42ms tok/s:1553711 rem:88s step 14248 (85%) loss:2.7411 lr:0.08 dt:42ms tok/s:1549769 rem:88s step 14249 (85%) loss:2.7562 lr:0.08 dt:42ms tok/s:1549271 rem:88s step 14250 (85%) loss:2.7671 lr:0.08 dt:42ms tok/s:1549472 rem:88s step 14251 (85%) loss:2.7593 lr:0.08 dt:42ms tok/s:1548224 rem:88s step 14252 (85%) loss:2.7587 lr:0.08 dt:42ms tok/s:1546856 rem:88s step 14253 (85%) loss:2.7646 lr:0.08 dt:42ms tok/s:1548320 rem:88s step 14254 (85%) loss:2.7546 lr:0.08 dt:42ms tok/s:1548495 rem:88s step 14255 (85%) loss:2.7569 lr:0.08 dt:42ms tok/s:1545508 rem:88s step 14256 (85%) loss:2.7738 lr:0.08 dt:42ms tok/s:1542966 rem:88s step 14257 (85%) loss:2.7725 lr:0.08 dt:42ms tok/s:1542404 rem:88s step 14258 (85%) loss:2.7686 lr:0.08 dt:43ms tok/s:1541392 rem:87s step 14259 (85%) loss:2.7608 lr:0.08 dt:42ms tok/s:1542577 rem:87s step 14260 (85%) loss:2.7877 lr:0.08 dt:49ms tok/s:1337605 rem:87s step 14261 (85%) loss:2.7850 lr:0.08 dt:42ms tok/s:1552597 rem:87s step 14262 (85%) loss:2.7882 lr:0.08 dt:42ms tok/s:1557135 rem:87s step 14263 (85%) loss:2.7826 lr:0.08 dt:42ms tok/s:1553834 rem:87s step 14264 (85%) loss:2.7501 lr:0.08 dt:42ms tok/s:1550145 rem:87s step 14265 (85%) loss:2.7232 lr:0.08 dt:44ms tok/s:1500180 rem:87s step 14266 (85%) loss:2.7334 lr:0.08 dt:42ms tok/s:1553983 rem:87s step 14267 (85%) loss:2.7283 lr:0.08 dt:42ms tok/s:1545508 rem:87s step 14268 (85%) loss:2.7216 lr:0.08 dt:42ms tok/s:1546439 rem:87s step 14269 (85%) loss:2.7298 lr:0.08 dt:42ms tok/s:1549324 rem:87s step 14270 (86%) loss:2.7388 lr:0.08 dt:42ms tok/s:1546473 rem:87s step 14271 (86%) loss:2.7477 lr:0.08 dt:43ms tok/s:1541184 rem:87s step 14272 (86%) loss:2.7649 lr:0.08 dt:42ms tok/s:1546212 rem:87s step 14273 (86%) loss:2.7960 lr:0.08 dt:43ms tok/s:1539604 rem:87s step 14274 (86%) loss:2.7907 lr:0.08 dt:43ms tok/s:1529096 rem:87s step 14275 (86%) loss:2.7871 lr:0.08 dt:42ms tok/s:1546308 rem:87s step 14276 (86%) loss:2.7895 lr:0.08 dt:43ms tok/s:1533823 rem:87s step 14277 (86%) loss:2.8018 lr:0.08 dt:43ms tok/s:1531123 rem:87s step 14278 (86%) loss:2.7974 lr:0.08 dt:43ms tok/s:1528195 rem:87s step 14279 (86%) loss:2.7943 lr:0.08 dt:43ms tok/s:1532378 rem:87s step 14280 (86%) loss:2.7844 lr:0.08 dt:43ms tok/s:1531097 rem:87s step 14281 (86%) loss:2.7944 lr:0.08 dt:43ms tok/s:1530211 rem:86s step 14282 (86%) loss:2.7929 lr:0.08 dt:43ms tok/s:1533480 rem:86s step 14283 (86%) loss:2.7916 lr:0.08 dt:43ms tok/s:1531336 rem:86s step 14284 (86%) loss:2.8008 lr:0.08 dt:43ms tok/s:1530816 rem:86s step 14285 (86%) loss:2.7987 lr:0.08 dt:43ms tok/s:1529436 rem:86s step 14286 (86%) loss:2.7958 lr:0.08 dt:43ms tok/s:1530739 rem:86s step 14287 (86%) loss:2.8034 lr:0.08 dt:43ms tok/s:1531413 rem:86s step 14288 (86%) loss:2.8089 lr:0.08 dt:44ms tok/s:1492231 rem:86s step 14289 (86%) loss:2.8064 lr:0.08 dt:43ms tok/s:1516325 rem:86s step 14290 (86%) loss:2.7900 lr:0.08 dt:43ms tok/s:1510335 rem:86s step 14291 (86%) loss:2.7881 lr:0.08 dt:43ms tok/s:1528339 rem:86s step 14292 (86%) loss:2.7938 lr:0.08 dt:43ms tok/s:1529028 rem:86s step 14293 (86%) loss:2.8031 lr:0.08 dt:43ms tok/s:1524990 rem:86s step 14294 (86%) loss:2.8138 lr:0.08 dt:43ms tok/s:1527363 rem:86s step 14295 (86%) loss:2.8092 lr:0.08 dt:43ms tok/s:1525489 rem:86s step 14296 (86%) loss:2.7975 lr:0.08 dt:43ms tok/s:1529598 rem:86s step 14297 (86%) loss:2.7880 lr:0.08 dt:43ms tok/s:1531021 rem:86s step 14298 (86%) loss:2.7743 lr:0.08 dt:43ms tok/s:1527626 rem:86s step 14299 (86%) loss:2.7627 lr:0.08 dt:43ms tok/s:1532232 rem:86s step 14300 (86%) loss:2.7666 lr:0.08 dt:43ms tok/s:1530918 rem:86s + local: attn=[0.205, 1.101, 1.242] mlp=[1.419, 0.612, -0.556] + + transition: attn=[3.899, 1.250] mlp=[-0.808, 2.002] + + hierarchy: attn=[3.538, 5.939, 5.616] mlp=[3.305, -3.239, -5.180] + step 14301 (86%) loss:2.7566 lr:0.08 dt:43ms tok/s:1528832 rem:86s step 14302 (86%) loss:2.7414 lr:0.08 dt:43ms tok/s:1524795 rem:86s step 14303 (86%) loss:2.7098 lr:0.08 dt:43ms tok/s:1525151 rem:86s step 14304 (86%) loss:2.6979 lr:0.08 dt:43ms tok/s:1524077 rem:85s step 14305 (86%) loss:2.7001 lr:0.08 dt:43ms tok/s:1528645 rem:85s step 14306 (86%) loss:2.7147 lr:0.08 dt:43ms tok/s:1518755 rem:85s step 14307 (86%) loss:2.7158 lr:0.08 dt:43ms tok/s:1514521 rem:85s step 14308 (86%) loss:2.7371 lr:0.08 dt:46ms tok/s:1431059 rem:85s step 14309 (86%) loss:2.7605 lr:0.08 dt:43ms tok/s:1520948 rem:85s step 14310 (86%) loss:2.7365 lr:0.08 dt:43ms tok/s:1520082 rem:85s step 14311 (86%) loss:2.7530 lr:0.08 dt:43ms tok/s:1519359 rem:85s step 14312 (86%) loss:2.7679 lr:0.08 dt:43ms tok/s:1522928 rem:85s step 14313 (86%) loss:2.7887 lr:0.08 dt:43ms tok/s:1517640 rem:85s step 14314 (86%) loss:2.7935 lr:0.08 dt:43ms tok/s:1521150 rem:85s step 14315 (86%) loss:2.7910 lr:0.08 dt:43ms tok/s:1521807 rem:85s step 14316 (86%) loss:2.8148 lr:0.08 dt:43ms tok/s:1521773 rem:85s step 14317 (86%) loss:2.8527 lr:0.08 dt:43ms tok/s:1522549 rem:85s step 14318 (86%) loss:2.8266 lr:0.08 dt:43ms tok/s:1521495 rem:85s step 14319 (86%) loss:2.8117 lr:0.08 dt:43ms tok/s:1520157 rem:85s step 14320 (86%) loss:2.8262 lr:0.08 dt:43ms tok/s:1525168 rem:85s step 14321 (86%) loss:2.8203 lr:0.08 dt:43ms tok/s:1510592 rem:85s step 14322 (86%) loss:2.8233 lr:0.07 dt:43ms tok/s:1518646 rem:85s step 14323 (86%) loss:2.8240 lr:0.07 dt:43ms tok/s:1519326 rem:85s step 14324 (86%) loss:2.8114 lr:0.07 dt:43ms tok/s:1518319 rem:85s step 14325 (86%) loss:2.8025 lr:0.07 dt:43ms tok/s:1521605 rem:85s step 14326 (86%) loss:2.7816 lr:0.07 dt:43ms tok/s:1516250 rem:85s step 14327 (86%) loss:2.7662 lr:0.07 dt:43ms tok/s:1520536 rem:84s step 14328 (86%) loss:2.7579 lr:0.07 dt:43ms tok/s:1522684 rem:84s step 14329 (86%) loss:2.7607 lr:0.07 dt:43ms tok/s:1520595 rem:84s step 14330 (86%) loss:2.7576 lr:0.07 dt:43ms tok/s:1516300 rem:84s step 14331 (86%) loss:2.7682 lr:0.07 dt:43ms tok/s:1520511 rem:84s step 14332 (86%) loss:2.7643 lr:0.07 dt:43ms tok/s:1521992 rem:84s step 14333 (86%) loss:2.7693 lr:0.07 dt:43ms tok/s:1520965 rem:84s step 14334 (86%) loss:2.7370 lr:0.07 dt:43ms tok/s:1521807 rem:84s step 14335 (86%) loss:2.7626 lr:0.07 dt:43ms tok/s:1516886 rem:84s step 14336 (86%) loss:2.7729 lr:0.07 dt:43ms tok/s:1522954 rem:84s step 14337 (86%) loss:2.7591 lr:0.07 dt:43ms tok/s:1517380 rem:84s step 14338 (86%) loss:2.7854 lr:0.07 dt:44ms tok/s:1480018 rem:84s step 14339 (86%) loss:2.7819 lr:0.07 dt:43ms tok/s:1518210 rem:84s step 14340 (86%) loss:2.7678 lr:0.07 dt:43ms tok/s:1532019 rem:84s step 14341 (86%) loss:2.7595 lr:0.07 dt:43ms tok/s:1535408 rem:84s step 14342 (86%) loss:2.7636 lr:0.07 dt:43ms tok/s:1533438 rem:84s step 14343 (86%) loss:2.7659 lr:0.07 dt:43ms tok/s:1533344 rem:84s step 14344 (86%) loss:2.8443 lr:0.07 dt:43ms tok/s:1533729 rem:84s step 14345 (86%) loss:2.8658 lr:0.07 dt:43ms tok/s:1531404 rem:84s step 14346 (86%) loss:2.8635 lr:0.07 dt:43ms tok/s:1531925 rem:84s step 14347 (86%) loss:2.8630 lr:0.07 dt:43ms tok/s:1527999 rem:84s step 14348 (86%) loss:2.8427 lr:0.07 dt:43ms tok/s:1530833 rem:84s step 14349 (86%) loss:2.8298 lr:0.07 dt:43ms tok/s:1529096 rem:84s step 14350 (86%) loss:2.8185 lr:0.07 dt:43ms tok/s:1530927 rem:84s step 14351 (86%) loss:2.7975 lr:0.07 dt:43ms tok/s:1533677 rem:83s step 14352 (86%) loss:2.7501 lr:0.07 dt:43ms tok/s:1520040 rem:83s step 14353 (86%) loss:2.7459 lr:0.07 dt:43ms tok/s:1532685 rem:83s step 14354 (86%) loss:2.7382 lr:0.07 dt:43ms tok/s:1525913 rem:83s step 14355 (86%) loss:2.7567 lr:0.07 dt:43ms tok/s:1534842 rem:83s step 14356 (86%) loss:2.7703 lr:0.07 dt:44ms tok/s:1504655 rem:83s step 14357 (86%) loss:2.7805 lr:0.07 dt:43ms tok/s:1537822 rem:83s step 14358 (86%) loss:2.7699 lr:0.07 dt:43ms tok/s:1534611 rem:83s step 14359 (86%) loss:2.7543 lr:0.07 dt:43ms tok/s:1537305 rem:83s step 14360 (86%) loss:2.7400 lr:0.07 dt:43ms tok/s:1536369 rem:83s step 14361 (86%) loss:2.7531 lr:0.07 dt:43ms tok/s:1532882 rem:83s step 14362 (86%) loss:2.7575 lr:0.07 dt:43ms tok/s:1530816 rem:83s step 14363 (86%) loss:2.7628 lr:0.07 dt:43ms tok/s:1532497 rem:83s step 14364 (86%) loss:2.7685 lr:0.07 dt:43ms tok/s:1533951 rem:83s step 14365 (86%) loss:2.7527 lr:0.07 dt:43ms tok/s:1531370 rem:83s step 14366 (86%) loss:2.7415 lr:0.07 dt:43ms tok/s:1532950 rem:83s step 14367 (86%) loss:2.6775 lr:0.07 dt:43ms tok/s:1533600 rem:83s step 14368 (86%) loss:2.6799 lr:0.07 dt:43ms tok/s:1509539 rem:83s step 14369 (86%) loss:2.7229 lr:0.07 dt:43ms tok/s:1536644 rem:83s step 14370 (86%) loss:2.7417 lr:0.07 dt:43ms tok/s:1530978 rem:83s step 14371 (86%) loss:2.7550 lr:0.07 dt:43ms tok/s:1535811 rem:83s step 14372 (86%) loss:2.7721 lr:0.07 dt:44ms tok/s:1505116 rem:83s step 14373 (86%) loss:2.7864 lr:0.07 dt:43ms tok/s:1533891 rem:83s step 14374 (86%) loss:2.7774 lr:0.07 dt:43ms tok/s:1529368 rem:82s step 14375 (86%) loss:2.7660 lr:0.07 dt:43ms tok/s:1533789 rem:82s step 14376 (86%) loss:2.7694 lr:0.07 dt:43ms tok/s:1535777 rem:82s step 14377 (86%) loss:2.7739 lr:0.07 dt:43ms tok/s:1534611 rem:82s step 14378 (86%) loss:2.7840 lr:0.07 dt:43ms tok/s:1534062 rem:82s step 14379 (86%) loss:2.7975 lr:0.07 dt:43ms tok/s:1530782 rem:82s step 14380 (86%) loss:2.7993 lr:0.07 dt:43ms tok/s:1519166 rem:82s step 14381 (86%) loss:2.8149 lr:0.07 dt:42ms tok/s:1543036 rem:82s step 14382 (86%) loss:2.8136 lr:0.07 dt:43ms tok/s:1537142 rem:82s step 14383 (86%) loss:2.8197 lr:0.07 dt:43ms tok/s:1539846 rem:82s step 14384 (86%) loss:2.8132 lr:0.07 dt:43ms tok/s:1535228 rem:82s step 14385 (86%) loss:2.8039 lr:0.07 dt:43ms tok/s:1538166 rem:82s step 14386 (86%) loss:2.8095 lr:0.07 dt:43ms tok/s:1539156 rem:82s step 14387 (86%) loss:2.8028 lr:0.07 dt:43ms tok/s:1539837 rem:82s step 14388 (86%) loss:2.7986 lr:0.07 dt:43ms tok/s:1534422 rem:82s step 14389 (86%) loss:2.7863 lr:0.07 dt:43ms tok/s:1537813 rem:82s step 14390 (86%) loss:2.8046 lr:0.07 dt:43ms tok/s:1536523 rem:82s step 14391 (86%) loss:2.8083 lr:0.07 dt:43ms tok/s:1535442 rem:82s step 14392 (86%) loss:2.8056 lr:0.07 dt:43ms tok/s:1535168 rem:82s step 14393 (86%) loss:2.8132 lr:0.07 dt:43ms tok/s:1539484 rem:82s step 14394 (86%) loss:2.8115 lr:0.07 dt:43ms tok/s:1532019 rem:82s step 14395 (86%) loss:2.7860 lr:0.07 dt:43ms tok/s:1531686 rem:82s step 14396 (86%) loss:2.7922 lr:0.07 dt:43ms tok/s:1533292 rem:82s step 14397 (86%) loss:2.7881 lr:0.07 dt:43ms tok/s:1531302 rem:81s step 14398 (86%) loss:2.8017 lr:0.07 dt:43ms tok/s:1533557 rem:81s step 14399 (86%) loss:2.8077 lr:0.07 dt:43ms tok/s:1534328 rem:81s step 14400 (86%) loss:2.8100 lr:0.07 dt:43ms tok/s:1526006 rem:81s + local: attn=[0.202, 1.103, 1.258] mlp=[1.440, 0.619, -0.579] + + transition: attn=[3.898, 1.258] mlp=[-0.854, 2.047] + + hierarchy: attn=[3.534, 5.939, 5.616] mlp=[3.341, -3.334, -5.180] + step 14401 (86%) loss:2.8010 lr:0.07 dt:43ms tok/s:1528365 rem:81s step 14402 (86%) loss:2.7944 lr:0.07 dt:43ms tok/s:1535502 rem:81s step 14403 (86%) loss:2.7967 lr:0.07 dt:43ms tok/s:1530458 rem:81s step 14404 (86%) loss:2.8045 lr:0.07 dt:43ms tok/s:1526735 rem:81s step 14405 (86%) loss:2.8012 lr:0.07 dt:43ms tok/s:1526285 rem:81s step 14406 (86%) loss:2.8159 lr:0.07 dt:43ms tok/s:1507428 rem:81s step 14407 (86%) loss:2.8242 lr:0.07 dt:43ms tok/s:1530305 rem:81s step 14408 (86%) loss:2.7881 lr:0.07 dt:43ms tok/s:1520948 rem:81s step 14409 (86%) loss:2.7454 lr:0.07 dt:43ms tok/s:1520536 rem:81s step 14410 (87%) loss:2.7463 lr:0.07 dt:43ms tok/s:1519678 rem:81s step 14411 (87%) loss:2.7246 lr:0.07 dt:43ms tok/s:1520242 rem:81s step 14412 (87%) loss:2.7404 lr:0.07 dt:43ms tok/s:1520006 rem:81s step 14413 (87%) loss:2.7330 lr:0.07 dt:43ms tok/s:1519586 rem:81s step 14414 (87%) loss:2.7347 lr:0.07 dt:43ms tok/s:1523249 rem:81s step 14415 (87%) loss:2.7391 lr:0.07 dt:43ms tok/s:1520233 rem:81s step 14416 (87%) loss:2.7584 lr:0.07 dt:43ms tok/s:1519880 rem:81s step 14417 (87%) loss:2.7645 lr:0.07 dt:43ms tok/s:1517078 rem:81s step 14418 (87%) loss:2.7676 lr:0.07 dt:43ms tok/s:1522313 rem:81s step 14419 (87%) loss:2.7806 lr:0.07 dt:43ms tok/s:1521007 rem:81s step 14420 (87%) loss:2.7782 lr:0.07 dt:43ms tok/s:1519603 rem:81s step 14421 (87%) loss:2.7978 lr:0.07 dt:43ms tok/s:1509555 rem:80s step 14422 (87%) loss:2.7803 lr:0.07 dt:43ms tok/s:1518545 rem:80s step 14423 (87%) loss:2.7775 lr:0.07 dt:43ms tok/s:1522043 rem:80s step 14424 (87%) loss:2.7790 lr:0.07 dt:43ms tok/s:1522616 rem:80s step 14425 (87%) loss:2.7732 lr:0.07 dt:43ms tok/s:1520704 rem:80s step 14426 (87%) loss:2.7744 lr:0.07 dt:43ms tok/s:1524863 rem:80s step 14427 (87%) loss:2.7676 lr:0.07 dt:43ms tok/s:1524542 rem:80s step 14428 (87%) loss:2.7664 lr:0.07 dt:43ms tok/s:1524018 rem:80s step 14429 (87%) loss:2.7688 lr:0.07 dt:43ms tok/s:1521537 rem:80s step 14430 (87%) loss:2.7754 lr:0.07 dt:43ms tok/s:1519846 rem:80s step 14431 (87%) loss:2.7812 lr:0.07 dt:43ms tok/s:1519418 rem:80s step 14432 (87%) loss:2.8016 lr:0.07 dt:43ms tok/s:1522726 rem:80s step 14433 (87%) loss:2.7820 lr:0.07 dt:43ms tok/s:1523418 rem:80s step 14434 (87%) loss:2.7773 lr:0.07 dt:43ms tok/s:1514387 rem:80s step 14435 (87%) loss:2.7912 lr:0.07 dt:43ms tok/s:1523469 rem:80s step 14436 (87%) loss:2.7967 lr:0.07 dt:43ms tok/s:1522642 rem:80s step 14437 (87%) loss:2.7980 lr:0.07 dt:43ms tok/s:1521504 rem:80s step 14438 (87%) loss:2.7980 lr:0.07 dt:43ms tok/s:1523519 rem:80s step 14439 (87%) loss:2.7935 lr:0.07 dt:43ms tok/s:1522287 rem:80s step 14440 (87%) loss:2.8061 lr:0.07 dt:43ms tok/s:1514537 rem:80s step 14441 (87%) loss:2.8272 lr:0.07 dt:43ms tok/s:1521032 rem:80s step 14442 (87%) loss:2.8179 lr:0.07 dt:43ms tok/s:1521268 rem:80s step 14443 (87%) loss:2.8346 lr:0.07 dt:43ms tok/s:1520334 rem:80s step 14444 (87%) loss:2.8380 lr:0.07 dt:43ms tok/s:1520225 rem:79s step 14445 (87%) loss:2.8323 lr:0.07 dt:43ms tok/s:1520544 rem:79s step 14446 (87%) loss:2.8208 lr:0.07 dt:43ms tok/s:1528110 rem:79s step 14447 (87%) loss:2.8087 lr:0.07 dt:44ms tok/s:1502566 rem:79s step 14448 (87%) loss:2.7811 lr:0.07 dt:43ms tok/s:1519015 rem:79s step 14449 (87%) loss:2.7962 lr:0.07 dt:43ms tok/s:1529802 rem:79s step 14450 (87%) loss:2.7905 lr:0.07 dt:43ms tok/s:1533686 rem:79s step 14451 (87%) loss:2.8019 lr:0.07 dt:43ms tok/s:1536798 rem:79s step 14452 (87%) loss:2.7834 lr:0.07 dt:43ms tok/s:1535888 rem:79s step 14453 (87%) loss:2.7395 lr:0.07 dt:43ms tok/s:1532856 rem:79s step 14454 (87%) loss:2.7080 lr:0.07 dt:43ms tok/s:1534516 rem:79s step 14455 (87%) loss:2.7289 lr:0.07 dt:43ms tok/s:1535914 rem:79s step 14456 (87%) loss:2.7442 lr:0.07 dt:44ms tok/s:1503108 rem:79s step 14457 (87%) loss:2.7436 lr:0.07 dt:43ms tok/s:1538880 rem:79s step 14458 (87%) loss:2.7217 lr:0.07 dt:43ms tok/s:1532831 rem:79s step 14459 (87%) loss:2.7298 lr:0.07 dt:43ms tok/s:1538398 rem:79s step 14460 (87%) loss:2.7213 lr:0.07 dt:43ms tok/s:1538674 rem:79s step 14461 (87%) loss:2.6837 lr:0.07 dt:43ms tok/s:1536094 rem:79s step 14462 (87%) loss:2.6715 lr:0.06 dt:43ms tok/s:1537538 rem:79s step 14463 (87%) loss:2.6759 lr:0.06 dt:43ms tok/s:1521731 rem:79s step 14464 (87%) loss:2.7099 lr:0.06 dt:43ms tok/s:1537701 rem:79s step 14465 (87%) loss:2.7203 lr:0.06 dt:43ms tok/s:1533771 rem:79s step 14466 (87%) loss:2.7257 lr:0.06 dt:43ms tok/s:1536850 rem:79s step 14467 (87%) loss:2.7477 lr:0.06 dt:43ms tok/s:1535871 rem:78s step 14468 (87%) loss:2.7597 lr:0.06 dt:43ms tok/s:1539803 rem:78s step 14469 (87%) loss:2.7647 lr:0.06 dt:43ms tok/s:1537220 rem:78s step 14470 (87%) loss:2.7698 lr:0.06 dt:43ms tok/s:1536901 rem:78s step 14471 (87%) loss:2.7529 lr:0.06 dt:43ms tok/s:1538166 rem:78s step 14472 (87%) loss:2.7550 lr:0.06 dt:43ms tok/s:1531848 rem:78s step 14473 (87%) loss:2.7505 lr:0.06 dt:43ms tok/s:1534157 rem:78s step 14474 (87%) loss:2.7535 lr:0.06 dt:45ms tok/s:1448288 rem:78s step 14475 (87%) loss:2.7580 lr:0.06 dt:43ms tok/s:1527668 rem:78s step 14476 (87%) loss:2.7466 lr:0.06 dt:42ms tok/s:1556130 rem:78s step 14477 (87%) loss:2.7531 lr:0.06 dt:42ms tok/s:1550993 rem:78s step 14478 (87%) loss:2.7653 lr:0.06 dt:42ms tok/s:1549481 rem:78s step 14479 (87%) loss:2.7712 lr:0.06 dt:43ms tok/s:1541322 rem:78s step 14480 (87%) loss:2.7676 lr:0.06 dt:43ms tok/s:1537529 rem:78s step 14481 (87%) loss:2.7498 lr:0.06 dt:43ms tok/s:1538863 rem:78s step 14482 (87%) loss:2.7468 lr:0.06 dt:43ms tok/s:1535880 rem:78s step 14483 (87%) loss:2.7944 lr:0.06 dt:43ms tok/s:1534825 rem:78s step 14484 (87%) loss:2.7912 lr:0.06 dt:43ms tok/s:1533010 rem:78s step 14485 (87%) loss:2.7900 lr:0.06 dt:48ms tok/s:1368696 rem:78s step 14486 (87%) loss:2.7981 lr:0.06 dt:43ms tok/s:1537503 rem:78s step 14487 (87%) loss:2.8046 lr:0.06 dt:43ms tok/s:1540312 rem:78s step 14488 (87%) loss:2.7809 lr:0.06 dt:43ms tok/s:1538347 rem:78s step 14489 (87%) loss:2.7677 lr:0.06 dt:43ms tok/s:1540157 rem:78s step 14490 (87%) loss:2.7654 lr:0.06 dt:43ms tok/s:1535554 rem:78s step 14491 (87%) loss:2.7697 lr:0.06 dt:43ms tok/s:1540217 rem:77s step 14492 (87%) loss:2.7593 lr:0.06 dt:43ms tok/s:1533472 rem:77s step 14493 (87%) loss:2.7503 lr:0.06 dt:43ms tok/s:1540692 rem:77s step 14494 (87%) loss:2.7597 lr:0.06 dt:43ms tok/s:1540329 rem:77s step 14495 (87%) loss:2.7723 lr:0.06 dt:43ms tok/s:1536575 rem:77s step 14496 (87%) loss:2.7841 lr:0.06 dt:42ms tok/s:1542378 rem:77s step 14497 (87%) loss:2.7853 lr:0.06 dt:43ms tok/s:1539311 rem:77s step 14498 (87%) loss:2.7737 lr:0.06 dt:43ms tok/s:1535485 rem:77s step 14499 (87%) loss:2.7926 lr:0.06 dt:43ms tok/s:1538613 rem:77s step 14500 (87%) loss:2.7801 lr:0.06 dt:43ms tok/s:1541245 rem:77s + local: attn=[0.211, 1.116, 1.270] mlp=[1.455, 0.627, -0.583] + + transition: attn=[3.920, 1.252] mlp=[-0.875, 2.141] + + hierarchy: attn=[3.528, 5.939, 5.616] mlp=[3.393, -3.447, -5.180] + step 14501 (87%) loss:2.7834 lr:0.06 dt:43ms tok/s:1538071 rem:77s step 14502 (87%) loss:2.7940 lr:0.06 dt:43ms tok/s:1536223 rem:77s step 14503 (87%) loss:2.8025 lr:0.06 dt:42ms tok/s:1542525 rem:77s step 14504 (87%) loss:2.7959 lr:0.06 dt:43ms tok/s:1538639 rem:77s step 14505 (87%) loss:2.8044 lr:0.06 dt:43ms tok/s:1537434 rem:77s step 14506 (87%) loss:2.8092 lr:0.06 dt:43ms tok/s:1534259 rem:77s step 14507 (87%) loss:2.8034 lr:0.06 dt:43ms tok/s:1538105 rem:77s step 14508 (87%) loss:2.7843 lr:0.06 dt:43ms tok/s:1541374 rem:77s step 14509 (87%) loss:2.8051 lr:0.06 dt:42ms tok/s:1542395 rem:77s step 14510 (87%) loss:2.7942 lr:0.06 dt:42ms tok/s:1543712 rem:77s step 14511 (87%) loss:2.7869 lr:0.06 dt:43ms tok/s:1537667 rem:77s step 14512 (87%) loss:2.7783 lr:0.06 dt:43ms tok/s:1541616 rem:77s step 14513 (87%) loss:2.7740 lr:0.06 dt:43ms tok/s:1537899 rem:77s step 14514 (87%) loss:2.7788 lr:0.06 dt:43ms tok/s:1541322 rem:76s step 14515 (87%) loss:2.7791 lr:0.06 dt:43ms tok/s:1537348 rem:76s step 14516 (87%) loss:2.7950 lr:0.06 dt:43ms tok/s:1522971 rem:76s step 14517 (87%) loss:2.7988 lr:0.06 dt:44ms tok/s:1501786 rem:76s step 14518 (87%) loss:2.7826 lr:0.06 dt:43ms tok/s:1519578 rem:76s step 14519 (87%) loss:2.7797 lr:0.06 dt:43ms tok/s:1520595 rem:76s step 14520 (87%) loss:2.7872 lr:0.06 dt:43ms tok/s:1518981 rem:76s step 14521 (87%) loss:2.8079 lr:0.06 dt:45ms tok/s:1455721 rem:76s step 14522 (87%) loss:2.8068 lr:0.06 dt:43ms tok/s:1527711 rem:76s step 14523 (87%) loss:2.8159 lr:0.06 dt:43ms tok/s:1529351 rem:76s step 14524 (87%) loss:2.7942 lr:0.06 dt:43ms tok/s:1532933 rem:76s step 14525 (87%) loss:2.7888 lr:0.06 dt:43ms tok/s:1532617 rem:76s step 14526 (87%) loss:2.7770 lr:0.06 dt:43ms tok/s:1528977 rem:76s step 14527 (87%) loss:2.7685 lr:0.06 dt:43ms tok/s:1528662 rem:76s step 14528 (87%) loss:2.7656 lr:0.06 dt:43ms tok/s:1529700 rem:76s step 14529 (87%) loss:2.7726 lr:0.06 dt:43ms tok/s:1524296 rem:76s step 14530 (87%) loss:2.7535 lr:0.06 dt:43ms tok/s:1523536 rem:76s step 14531 (87%) loss:2.7562 lr:0.06 dt:43ms tok/s:1525506 rem:76s step 14532 (87%) loss:2.7498 lr:0.06 dt:43ms tok/s:1526887 rem:76s step 14533 (87%) loss:2.7647 lr:0.06 dt:44ms tok/s:1472723 rem:76s step 14534 (87%) loss:2.7470 lr:0.06 dt:43ms tok/s:1529334 rem:76s step 14535 (87%) loss:2.7503 lr:0.06 dt:43ms tok/s:1537847 rem:76s step 14536 (87%) loss:2.7476 lr:0.06 dt:43ms tok/s:1540770 rem:76s step 14537 (87%) loss:2.7491 lr:0.06 dt:43ms tok/s:1533078 rem:75s step 14538 (87%) loss:2.7274 lr:0.06 dt:43ms tok/s:1533900 rem:75s step 14539 (87%) loss:2.7241 lr:0.06 dt:43ms tok/s:1539665 rem:75s step 14540 (87%) loss:2.7329 lr:0.06 dt:43ms tok/s:1541763 rem:75s step 14541 (87%) loss:2.7473 lr:0.06 dt:43ms tok/s:1540416 rem:75s step 14542 (87%) loss:2.7551 lr:0.06 dt:43ms tok/s:1538484 rem:75s step 14543 (87%) loss:2.7574 lr:0.06 dt:42ms tok/s:1542118 rem:75s step 14544 (87%) loss:2.7442 lr:0.06 dt:43ms tok/s:1541435 rem:75s step 14545 (87%) loss:2.7340 lr:0.06 dt:43ms tok/s:1537073 rem:75s step 14546 (87%) loss:2.7380 lr:0.06 dt:42ms tok/s:1542248 rem:75s step 14547 (87%) loss:2.7488 lr:0.06 dt:43ms tok/s:1541072 rem:75s step 14548 (87%) loss:2.7546 lr:0.06 dt:42ms tok/s:1543339 rem:75s step 14549 (87%) loss:2.7562 lr:0.06 dt:43ms tok/s:1538562 rem:75s step 14550 (88%) loss:2.7586 lr:0.06 dt:42ms tok/s:1544136 rem:75s step 14551 (88%) loss:2.7502 lr:0.06 dt:43ms tok/s:1541461 rem:75s step 14552 (88%) loss:2.7433 lr:0.06 dt:47ms tok/s:1386990 rem:75s step 14553 (88%) loss:2.7530 lr:0.06 dt:43ms tok/s:1530449 rem:75s step 14554 (88%) loss:2.7588 lr:0.06 dt:42ms tok/s:1544891 rem:75s step 14555 (88%) loss:2.7436 lr:0.06 dt:45ms tok/s:1446147 rem:75s step 14556 (88%) loss:2.7492 lr:0.06 dt:43ms tok/s:1540321 rem:75s step 14557 (88%) loss:2.7371 lr:0.06 dt:43ms tok/s:1539872 rem:75s step 14558 (88%) loss:2.7311 lr:0.06 dt:43ms tok/s:1537649 rem:75s step 14559 (88%) loss:2.6813 lr:0.06 dt:43ms tok/s:1541962 rem:75s step 14560 (88%) loss:2.6170 lr:0.06 dt:43ms tok/s:1534533 rem:75s step 14561 (88%) loss:2.6115 lr:0.06 dt:43ms tok/s:1514504 rem:74s step 14562 (88%) loss:2.6122 lr:0.06 dt:43ms tok/s:1533036 rem:74s step 14563 (88%) loss:2.6235 lr:0.06 dt:43ms tok/s:1531874 rem:74s step 14564 (88%) loss:2.6476 lr:0.06 dt:43ms tok/s:1535331 rem:74s step 14565 (88%) loss:2.6681 lr:0.06 dt:43ms tok/s:1531021 rem:74s step 14566 (88%) loss:2.6920 lr:0.06 dt:43ms tok/s:1534611 rem:74s step 14567 (88%) loss:2.7130 lr:0.06 dt:43ms tok/s:1532497 rem:74s step 14568 (88%) loss:2.7114 lr:0.06 dt:43ms tok/s:1537847 rem:74s step 14569 (88%) loss:2.7110 lr:0.06 dt:43ms tok/s:1534028 rem:74s step 14570 (88%) loss:2.7007 lr:0.06 dt:43ms tok/s:1538562 rem:74s step 14571 (88%) loss:2.7044 lr:0.06 dt:43ms tok/s:1535631 rem:74s step 14572 (88%) loss:2.7019 lr:0.06 dt:43ms tok/s:1529462 rem:74s step 14573 (88%) loss:2.7147 lr:0.06 dt:43ms tok/s:1531336 rem:74s step 14574 (88%) loss:2.7268 lr:0.06 dt:43ms tok/s:1532762 rem:74s step 14575 (88%) loss:2.7421 lr:0.06 dt:43ms tok/s:1522566 rem:74s step 14576 (88%) loss:2.7515 lr:0.06 dt:43ms tok/s:1523443 rem:74s step 14577 (88%) loss:2.7462 lr:0.06 dt:43ms tok/s:1522852 rem:74s step 14578 (88%) loss:2.7341 lr:0.06 dt:43ms tok/s:1533309 rem:74s step 14579 (88%) loss:2.7205 lr:0.06 dt:43ms tok/s:1535948 rem:74s step 14580 (88%) loss:2.7292 lr:0.06 dt:43ms tok/s:1531063 rem:74s step 14581 (88%) loss:2.7399 lr:0.06 dt:43ms tok/s:1532139 rem:74s step 14582 (88%) loss:2.7481 lr:0.06 dt:43ms tok/s:1537933 rem:74s step 14583 (88%) loss:2.7371 lr:0.06 dt:43ms tok/s:1537125 rem:74s step 14584 (88%) loss:2.7201 lr:0.06 dt:43ms tok/s:1534388 rem:73s step 14585 (88%) loss:2.7097 lr:0.06 dt:43ms tok/s:1529428 rem:73s step 14586 (88%) loss:2.7144 lr:0.06 dt:43ms tok/s:1533258 rem:73s step 14587 (88%) loss:2.7185 lr:0.06 dt:43ms tok/s:1534816 rem:73s step 14588 (88%) loss:2.7177 lr:0.06 dt:43ms tok/s:1534414 rem:73s step 14589 (88%) loss:2.7196 lr:0.06 dt:43ms tok/s:1518293 rem:73s step 14590 (88%) loss:2.7094 lr:0.06 dt:43ms tok/s:1534234 rem:73s step 14591 (88%) loss:2.6956 lr:0.06 dt:43ms tok/s:1527159 rem:73s step 14592 (88%) loss:2.6927 lr:0.06 dt:45ms tok/s:1456014 rem:73s step 14593 (88%) loss:2.6981 lr:0.06 dt:45ms tok/s:1470030 rem:73s step 14594 (88%) loss:2.7419 lr:0.06 dt:43ms tok/s:1520233 rem:73s step 14595 (88%) loss:2.7505 lr:0.06 dt:43ms tok/s:1539018 rem:73s step 14596 (88%) loss:2.7556 lr:0.06 dt:43ms tok/s:1530467 rem:73s step 14597 (88%) loss:2.7303 lr:0.06 dt:43ms tok/s:1527133 rem:73s step 14598 (88%) loss:2.7302 lr:0.06 dt:43ms tok/s:1526870 rem:73s step 14599 (88%) loss:2.7401 lr:0.06 dt:43ms tok/s:1516986 rem:73s step 14600 (88%) loss:2.7417 lr:0.06 dt:43ms tok/s:1515974 rem:73s + local: attn=[0.219, 1.126, 1.282] mlp=[1.464, 0.653, -0.571] + + transition: attn=[3.958, 1.253] mlp=[-0.910, 2.188] + + hierarchy: attn=[3.526, 5.939, 5.616] mlp=[3.456, -3.510, -5.180] + step 14601 (88%) loss:2.7373 lr:0.06 dt:43ms tok/s:1522987 rem:73s step 14602 (88%) loss:2.7406 lr:0.06 dt:43ms tok/s:1511572 rem:73s step 14603 (88%) loss:2.7142 lr:0.06 dt:43ms tok/s:1514062 rem:73s step 14604 (88%) loss:2.7127 lr:0.06 dt:43ms tok/s:1520931 rem:73s step 14605 (88%) loss:2.7271 lr:0.06 dt:43ms tok/s:1517816 rem:73s step 14606 (88%) loss:2.7177 lr:0.06 dt:43ms tok/s:1516124 rem:73s step 14607 (88%) loss:2.7073 lr:0.06 dt:45ms tok/s:1452774 rem:72s step 14608 (88%) loss:2.7092 lr:0.06 dt:43ms tok/s:1517170 rem:72s step 14609 (88%) loss:2.7188 lr:0.06 dt:44ms tok/s:1504985 rem:72s step 14610 (88%) loss:2.7076 lr:0.06 dt:43ms tok/s:1510933 rem:72s step 14611 (88%) loss:2.6995 lr:0.06 dt:44ms tok/s:1497638 rem:72s step 14612 (88%) loss:2.6993 lr:0.05 dt:43ms tok/s:1522633 rem:72s step 14613 (88%) loss:2.7336 lr:0.05 dt:43ms tok/s:1507750 rem:72s step 14614 (88%) loss:2.7421 lr:0.05 dt:44ms tok/s:1495177 rem:72s step 14615 (88%) loss:2.7387 lr:0.05 dt:43ms tok/s:1510094 rem:72s step 14616 (88%) loss:2.7407 lr:0.05 dt:44ms tok/s:1496569 rem:72s step 14617 (88%) loss:2.7369 lr:0.05 dt:43ms tok/s:1507394 rem:72s step 14618 (88%) loss:2.7348 lr:0.05 dt:43ms tok/s:1515239 rem:72s step 14619 (88%) loss:2.7308 lr:0.05 dt:44ms tok/s:1499623 rem:72s step 14620 (88%) loss:2.7302 lr:0.05 dt:43ms tok/s:1517715 rem:72s step 14621 (88%) loss:2.7382 lr:0.05 dt:44ms tok/s:1499844 rem:72s step 14622 (88%) loss:2.7319 lr:0.05 dt:44ms tok/s:1496007 rem:72s step 14623 (88%) loss:2.7414 lr:0.05 dt:43ms tok/s:1528977 rem:72s step 14624 (88%) loss:2.7527 lr:0.05 dt:43ms tok/s:1525582 rem:72s step 14625 (88%) loss:2.7510 lr:0.05 dt:43ms tok/s:1510393 rem:72s step 14626 (88%) loss:2.7290 lr:0.05 dt:43ms tok/s:1518000 rem:72s step 14627 (88%) loss:2.7069 lr:0.05 dt:43ms tok/s:1521083 rem:72s step 14628 (88%) loss:2.6748 lr:0.05 dt:43ms tok/s:1520999 rem:72s step 14629 (88%) loss:2.6583 lr:0.05 dt:43ms tok/s:1518914 rem:72s step 14630 (88%) loss:2.6309 lr:0.05 dt:43ms tok/s:1523570 rem:71s step 14631 (88%) loss:2.6067 lr:0.05 dt:43ms tok/s:1521209 rem:71s step 14632 (88%) loss:2.5901 lr:0.05 dt:43ms tok/s:1523013 rem:71s step 14633 (88%) loss:2.5648 lr:0.05 dt:43ms tok/s:1522355 rem:71s step 14634 (88%) loss:2.6015 lr:0.05 dt:43ms tok/s:1523452 rem:71s step 14635 (88%) loss:2.6336 lr:0.05 dt:43ms tok/s:1522203 rem:71s step 14636 (88%) loss:2.6572 lr:0.05 dt:43ms tok/s:1519620 rem:71s step 14637 (88%) loss:2.6548 lr:0.05 dt:43ms tok/s:1525159 rem:71s step 14638 (88%) loss:2.6699 lr:0.05 dt:43ms tok/s:1525532 rem:71s step 14639 (88%) loss:2.7157 lr:0.05 dt:43ms tok/s:1525472 rem:71s step 14640 (88%) loss:2.7103 lr:0.05 dt:43ms tok/s:1525049 rem:71s step 14641 (88%) loss:2.7095 lr:0.05 dt:43ms tok/s:1523198 rem:71s step 14642 (88%) loss:2.7098 lr:0.05 dt:43ms tok/s:1519258 rem:71s step 14643 (88%) loss:2.7272 lr:0.05 dt:43ms tok/s:1520847 rem:71s step 14644 (88%) loss:2.7413 lr:0.05 dt:43ms tok/s:1516911 rem:71s step 14645 (88%) loss:2.7310 lr:0.05 dt:43ms tok/s:1516191 rem:71s step 14646 (88%) loss:2.7523 lr:0.05 dt:43ms tok/s:1522304 rem:71s step 14647 (88%) loss:2.7551 lr:0.05 dt:43ms tok/s:1524339 rem:71s step 14648 (88%) loss:2.7591 lr:0.05 dt:43ms tok/s:1522060 rem:71s step 14649 (88%) loss:2.7469 lr:0.05 dt:43ms tok/s:1516543 rem:71s step 14650 (88%) loss:2.7383 lr:0.05 dt:50ms tok/s:1318796 rem:71s step 14651 (88%) loss:2.7483 lr:0.05 dt:43ms tok/s:1537598 rem:71s step 14652 (88%) loss:2.7609 lr:0.05 dt:43ms tok/s:1522523 rem:71s step 14653 (88%) loss:2.8206 lr:0.05 dt:47ms tok/s:1385341 rem:70s step 14654 (88%) loss:2.8076 lr:0.05 dt:42ms tok/s:1555716 rem:70s step 14655 (88%) loss:2.8120 lr:0.05 dt:42ms tok/s:1552991 rem:70s step 14656 (88%) loss:2.8085 lr:0.05 dt:42ms tok/s:1549909 rem:70s step 14657 (88%) loss:2.7929 lr:0.05 dt:42ms tok/s:1545821 rem:70s step 14658 (88%) loss:2.7892 lr:0.05 dt:42ms tok/s:1542179 rem:70s step 14659 (88%) loss:2.7767 lr:0.05 dt:43ms tok/s:1535339 rem:70s step 14660 (88%) loss:2.7790 lr:0.05 dt:43ms tok/s:1532352 rem:70s step 14661 (88%) loss:2.7529 lr:0.05 dt:43ms tok/s:1531396 rem:70s step 14662 (88%) loss:2.7322 lr:0.05 dt:43ms tok/s:1529249 rem:70s step 14663 (88%) loss:2.7203 lr:0.05 dt:43ms tok/s:1527694 rem:70s step 14664 (88%) loss:2.7238 lr:0.05 dt:43ms tok/s:1524220 rem:70s step 14665 (88%) loss:2.7115 lr:0.05 dt:43ms tok/s:1527337 rem:70s step 14666 (88%) loss:2.6886 lr:0.05 dt:43ms tok/s:1527532 rem:70s step 14667 (88%) loss:2.6898 lr:0.05 dt:43ms tok/s:1525396 rem:70s step 14668 (88%) loss:2.7066 lr:0.05 dt:43ms tok/s:1522785 rem:70s step 14669 (88%) loss:2.7133 lr:0.05 dt:43ms tok/s:1525709 rem:70s step 14670 (88%) loss:2.6950 lr:0.05 dt:43ms tok/s:1523612 rem:70s step 14671 (88%) loss:2.6921 lr:0.05 dt:43ms tok/s:1524804 rem:70s step 14672 (88%) loss:2.6871 lr:0.05 dt:43ms tok/s:1528305 rem:70s step 14673 (88%) loss:2.7060 lr:0.05 dt:43ms tok/s:1522895 rem:70s step 14674 (88%) loss:2.6966 lr:0.05 dt:43ms tok/s:1529113 rem:70s step 14675 (88%) loss:2.7022 lr:0.05 dt:43ms tok/s:1526201 rem:70s step 14676 (88%) loss:2.7054 lr:0.05 dt:43ms tok/s:1526540 rem:70s step 14677 (88%) loss:2.7309 lr:0.05 dt:43ms tok/s:1522245 rem:69s step 14678 (88%) loss:2.7292 lr:0.05 dt:43ms tok/s:1529019 rem:69s step 14679 (88%) loss:2.7332 lr:0.05 dt:43ms tok/s:1522599 rem:69s step 14680 (88%) loss:2.7353 lr:0.05 dt:43ms tok/s:1523950 rem:69s step 14681 (88%) loss:2.7119 lr:0.05 dt:43ms tok/s:1525997 rem:69s step 14682 (88%) loss:2.7266 lr:0.05 dt:43ms tok/s:1526192 rem:69s step 14683 (88%) loss:2.7431 lr:0.05 dt:43ms tok/s:1525218 rem:69s step 14684 (88%) loss:2.7390 lr:0.05 dt:43ms tok/s:1521849 rem:69s step 14685 (88%) loss:2.7539 lr:0.05 dt:43ms tok/s:1524237 rem:69s step 14686 (88%) loss:2.7360 lr:0.05 dt:43ms tok/s:1526463 rem:69s step 14687 (88%) loss:2.7409 lr:0.05 dt:43ms tok/s:1529691 rem:69s step 14688 (88%) loss:2.7392 lr:0.05 dt:43ms tok/s:1528917 rem:69s step 14689 (89%) loss:2.7429 lr:0.05 dt:43ms tok/s:1527303 rem:69s step 14690 (89%) loss:2.7376 lr:0.05 dt:43ms tok/s:1523300 rem:69s step 14691 (89%) loss:2.7506 lr:0.05 dt:43ms tok/s:1528416 rem:69s step 14692 (89%) loss:2.8015 lr:0.05 dt:44ms tok/s:1487861 rem:69s step 14693 (89%) loss:2.7969 lr:0.05 dt:43ms tok/s:1527532 rem:69s step 14694 (89%) loss:2.7994 lr:0.05 dt:43ms tok/s:1525879 rem:69s step 14695 (89%) loss:2.8216 lr:0.05 dt:43ms tok/s:1525218 rem:69s step 14696 (89%) loss:2.8113 lr:0.05 dt:43ms tok/s:1525650 rem:69s step 14697 (89%) loss:2.7989 lr:0.05 dt:43ms tok/s:1520586 rem:69s step 14698 (89%) loss:2.8276 lr:0.05 dt:44ms tok/s:1503964 rem:69s step 14699 (89%) loss:2.8096 lr:0.05 dt:43ms tok/s:1519662 rem:69s step 14700 (89%) loss:2.7987 lr:0.05 dt:43ms tok/s:1520040 rem:68s + local: attn=[0.223, 1.132, 1.290] mlp=[1.483, 0.661, -0.596] + + transition: attn=[3.973, 1.253] mlp=[-0.931, 2.228] + + hierarchy: attn=[3.509, 5.939, 5.616] mlp=[3.491, -3.582, -5.180] + step 14701 (89%) loss:2.7934 lr:0.05 dt:43ms tok/s:1517924 rem:68s step 14702 (89%) loss:2.8216 lr:0.05 dt:43ms tok/s:1520174 rem:68s step 14703 (89%) loss:2.8053 lr:0.05 dt:43ms tok/s:1520738 rem:68s step 14704 (89%) loss:2.8053 lr:0.05 dt:43ms tok/s:1521874 rem:68s step 14705 (89%) loss:2.8160 lr:0.05 dt:43ms tok/s:1522093 rem:68s step 14706 (89%) loss:2.8126 lr:0.05 dt:43ms tok/s:1518042 rem:68s step 14707 (89%) loss:2.7980 lr:0.05 dt:43ms tok/s:1522271 rem:68s step 14708 (89%) loss:2.7738 lr:0.05 dt:43ms tok/s:1521874 rem:68s step 14709 (89%) loss:2.7561 lr:0.05 dt:43ms tok/s:1522599 rem:68s step 14710 (89%) loss:2.7567 lr:0.05 dt:43ms tok/s:1522540 rem:68s step 14711 (89%) loss:2.7602 lr:0.05 dt:43ms tok/s:1520040 rem:68s step 14712 (89%) loss:2.7829 lr:0.05 dt:43ms tok/s:1517363 rem:68s step 14713 (89%) loss:2.7729 lr:0.05 dt:43ms tok/s:1520460 rem:68s step 14714 (89%) loss:2.7717 lr:0.05 dt:43ms tok/s:1523038 rem:68s step 14715 (89%) loss:2.7699 lr:0.05 dt:43ms tok/s:1518428 rem:68s step 14716 (89%) loss:2.7675 lr:0.05 dt:43ms tok/s:1516718 rem:68s step 14717 (89%) loss:2.7728 lr:0.05 dt:43ms tok/s:1520881 rem:68s step 14718 (89%) loss:2.7702 lr:0.05 dt:43ms tok/s:1521950 rem:68s step 14719 (89%) loss:2.7605 lr:0.05 dt:44ms tok/s:1505100 rem:68s step 14720 (89%) loss:2.7490 lr:0.05 dt:43ms tok/s:1523916 rem:68s step 14721 (89%) loss:2.7224 lr:0.05 dt:43ms tok/s:1519208 rem:68s step 14722 (89%) loss:2.7145 lr:0.05 dt:43ms tok/s:1518587 rem:68s step 14723 (89%) loss:2.7452 lr:0.05 dt:43ms tok/s:1521377 rem:67s step 14724 (89%) loss:2.7458 lr:0.05 dt:43ms tok/s:1521740 rem:67s step 14725 (89%) loss:2.7408 lr:0.05 dt:43ms tok/s:1519578 rem:67s step 14726 (89%) loss:2.7381 lr:0.05 dt:43ms tok/s:1522574 rem:67s step 14727 (89%) loss:2.7323 lr:0.05 dt:43ms tok/s:1524533 rem:67s step 14728 (89%) loss:2.7193 lr:0.05 dt:43ms tok/s:1519292 rem:67s step 14729 (89%) loss:2.7056 lr:0.05 dt:43ms tok/s:1518671 rem:67s step 14730 (89%) loss:2.7099 lr:0.05 dt:43ms tok/s:1524060 rem:67s step 14731 (89%) loss:2.7313 lr:0.05 dt:43ms tok/s:1517179 rem:67s step 14732 (89%) loss:2.7258 lr:0.05 dt:43ms tok/s:1518025 rem:67s step 14733 (89%) loss:2.7136 lr:0.05 dt:43ms tok/s:1524812 rem:67s step 14734 (89%) loss:2.7113 lr:0.05 dt:43ms tok/s:1522515 rem:67s step 14735 (89%) loss:2.7167 lr:0.05 dt:43ms tok/s:1522110 rem:67s step 14736 (89%) loss:2.7043 lr:0.05 dt:43ms tok/s:1521984 rem:67s step 14737 (89%) loss:2.6796 lr:0.05 dt:43ms tok/s:1511672 rem:67s step 14738 (89%) loss:2.6635 lr:0.05 dt:43ms tok/s:1523882 rem:67s step 14739 (89%) loss:2.6648 lr:0.05 dt:43ms tok/s:1518746 rem:67s step 14740 (89%) loss:2.6620 lr:0.05 dt:43ms tok/s:1521091 rem:67s step 14741 (89%) loss:2.6741 lr:0.05 dt:43ms tok/s:1524559 rem:67s step 14742 (89%) loss:2.6889 lr:0.05 dt:43ms tok/s:1519519 rem:67s step 14743 (89%) loss:2.6775 lr:0.05 dt:43ms tok/s:1519939 rem:67s step 14744 (89%) loss:2.6728 lr:0.05 dt:44ms tok/s:1499664 rem:67s step 14745 (89%) loss:2.6723 lr:0.05 dt:43ms tok/s:1509663 rem:67s step 14746 (89%) loss:2.6863 lr:0.05 dt:43ms tok/s:1520544 rem:66s step 14747 (89%) loss:2.6926 lr:0.05 dt:43ms tok/s:1516074 rem:66s step 14748 (89%) loss:2.6913 lr:0.05 dt:43ms tok/s:1521394 rem:66s step 14749 (89%) loss:2.6849 lr:0.05 dt:43ms tok/s:1520536 rem:66s step 14750 (89%) loss:2.6842 lr:0.05 dt:43ms tok/s:1521824 rem:66s step 14751 (89%) loss:2.6888 lr:0.05 dt:43ms tok/s:1513495 rem:66s step 14752 (89%) loss:2.6703 lr:0.05 dt:43ms tok/s:1521226 rem:66s step 14753 (89%) loss:2.6394 lr:0.05 dt:43ms tok/s:1517656 rem:66s step 14754 (89%) loss:2.6054 lr:0.05 dt:43ms tok/s:1514178 rem:66s step 14755 (89%) loss:2.5925 lr:0.05 dt:45ms tok/s:1465625 rem:66s step 14756 (89%) loss:2.6067 lr:0.05 dt:43ms tok/s:1521352 rem:66s step 14757 (89%) loss:2.6440 lr:0.05 dt:43ms tok/s:1522591 rem:66s step 14758 (89%) loss:2.6591 lr:0.05 dt:43ms tok/s:1526438 rem:66s step 14759 (89%) loss:2.6930 lr:0.05 dt:43ms tok/s:1526277 rem:66s step 14760 (89%) loss:2.6919 lr:0.05 dt:46ms tok/s:1432708 rem:66s step 14761 (89%) loss:2.6935 lr:0.05 dt:43ms tok/s:1532702 rem:66s step 14762 (89%) loss:2.6968 lr:0.05 dt:43ms tok/s:1531771 rem:66s step 14763 (89%) loss:2.7089 lr:0.05 dt:43ms tok/s:1531464 rem:66s step 14764 (89%) loss:2.7090 lr:0.05 dt:43ms tok/s:1533224 rem:66s step 14765 (89%) loss:2.7121 lr:0.05 dt:43ms tok/s:1533326 rem:66s step 14766 (89%) loss:2.7521 lr:0.05 dt:43ms tok/s:1531788 rem:66s step 14767 (89%) loss:2.7446 lr:0.05 dt:49ms tok/s:1348677 rem:66s step 14768 (89%) loss:2.7324 lr:0.05 dt:42ms tok/s:1548634 rem:66s step 14769 (89%) loss:2.6942 lr:0.05 dt:43ms tok/s:1539380 rem:65s step 14770 (89%) loss:2.6793 lr:0.05 dt:43ms tok/s:1539079 rem:65s step 14771 (89%) loss:2.6716 lr:0.05 dt:43ms tok/s:1540942 rem:65s step 14772 (89%) loss:2.6708 lr:0.05 dt:43ms tok/s:1540295 rem:65s step 14773 (89%) loss:2.6612 lr:0.05 dt:43ms tok/s:1536910 rem:65s step 14774 (89%) loss:2.6621 lr:0.04 dt:43ms tok/s:1540675 rem:65s step 14775 (89%) loss:2.6689 lr:0.04 dt:43ms tok/s:1541245 rem:65s step 14776 (89%) loss:2.6633 lr:0.04 dt:43ms tok/s:1540657 rem:65s step 14777 (89%) loss:2.6872 lr:0.04 dt:43ms tok/s:1540804 rem:65s step 14778 (89%) loss:2.6973 lr:0.04 dt:43ms tok/s:1530160 rem:65s step 14779 (89%) loss:2.7087 lr:0.04 dt:43ms tok/s:1539415 rem:65s step 14780 (89%) loss:2.7115 lr:0.04 dt:43ms tok/s:1536798 rem:65s step 14781 (89%) loss:2.7020 lr:0.04 dt:43ms tok/s:1539130 rem:65s step 14782 (89%) loss:2.7025 lr:0.04 dt:43ms tok/s:1522389 rem:65s step 14783 (89%) loss:2.7651 lr:0.04 dt:43ms tok/s:1538657 rem:65s step 14784 (89%) loss:2.7838 lr:0.04 dt:43ms tok/s:1539820 rem:65s step 14785 (89%) loss:2.7842 lr:0.04 dt:43ms tok/s:1538992 rem:65s step 14786 (89%) loss:2.7771 lr:0.04 dt:48ms tok/s:1357616 rem:65s step 14787 (89%) loss:2.7643 lr:0.04 dt:42ms tok/s:1558972 rem:65s step 14788 (89%) loss:2.7689 lr:0.04 dt:42ms tok/s:1550626 rem:65s step 14789 (89%) loss:2.7648 lr:0.04 dt:42ms tok/s:1550329 rem:65s step 14790 (89%) loss:2.7383 lr:0.04 dt:42ms tok/s:1546586 rem:65s step 14791 (89%) loss:2.7400 lr:0.04 dt:45ms tok/s:1462918 rem:65s step 14792 (89%) loss:2.7323 lr:0.04 dt:42ms tok/s:1545995 rem:64s step 14793 (89%) loss:2.7345 lr:0.04 dt:43ms tok/s:1541634 rem:64s step 14794 (89%) loss:2.7427 lr:0.04 dt:43ms tok/s:1537822 rem:64s step 14795 (89%) loss:2.7510 lr:0.04 dt:43ms tok/s:1538338 rem:64s step 14796 (89%) loss:2.7414 lr:0.04 dt:43ms tok/s:1540666 rem:64s step 14797 (89%) loss:2.7313 lr:0.04 dt:43ms tok/s:1537391 rem:64s step 14798 (89%) loss:2.7286 lr:0.04 dt:43ms tok/s:1538200 rem:64s step 14799 (89%) loss:2.7262 lr:0.04 dt:43ms tok/s:1536652 rem:64s step 14800 (89%) loss:2.7257 lr:0.04 dt:43ms tok/s:1541400 rem:64s + local: attn=[0.219, 1.140, 1.306] mlp=[1.493, 0.668, -0.596] + + transition: attn=[3.982, 1.249] mlp=[-0.960, 2.292] + + hierarchy: attn=[3.494, 5.939, 5.616] mlp=[3.530, -3.646, -5.180] + step 14801 (89%) loss:2.7283 lr:0.04 dt:43ms tok/s:1539751 rem:64s step 14802 (89%) loss:2.7173 lr:0.04 dt:43ms tok/s:1533763 rem:64s step 14803 (89%) loss:2.7437 lr:0.04 dt:43ms tok/s:1537529 rem:64s step 14804 (89%) loss:2.7573 lr:0.04 dt:43ms tok/s:1536927 rem:64s step 14805 (89%) loss:2.7481 lr:0.04 dt:43ms tok/s:1533053 rem:64s step 14806 (89%) loss:2.7465 lr:0.04 dt:43ms tok/s:1533994 rem:64s step 14807 (89%) loss:2.7414 lr:0.04 dt:43ms tok/s:1535193 rem:64s step 14808 (89%) loss:2.7231 lr:0.04 dt:43ms tok/s:1533557 rem:64s step 14809 (89%) loss:2.7246 lr:0.04 dt:43ms tok/s:1534833 rem:64s step 14810 (89%) loss:2.7736 lr:0.04 dt:43ms tok/s:1518662 rem:64s step 14811 (89%) loss:2.7555 lr:0.04 dt:43ms tok/s:1531140 rem:64s step 14812 (89%) loss:2.7735 lr:0.04 dt:43ms tok/s:1527354 rem:64s step 14813 (89%) loss:2.7437 lr:0.04 dt:43ms tok/s:1525049 rem:64s step 14814 (89%) loss:2.7451 lr:0.04 dt:43ms tok/s:1527532 rem:64s step 14815 (89%) loss:2.7264 lr:0.04 dt:43ms tok/s:1528467 rem:64s step 14816 (89%) loss:2.7254 lr:0.04 dt:43ms tok/s:1529385 rem:63s step 14817 (89%) loss:2.7496 lr:0.04 dt:43ms tok/s:1529326 rem:63s step 14818 (89%) loss:2.7549 lr:0.04 dt:43ms tok/s:1530441 rem:63s step 14819 (89%) loss:2.7347 lr:0.04 dt:43ms tok/s:1529326 rem:63s step 14820 (89%) loss:2.7337 lr:0.04 dt:43ms tok/s:1526870 rem:63s step 14821 (89%) loss:2.7192 lr:0.04 dt:43ms tok/s:1528050 rem:63s step 14822 (89%) loss:2.7213 lr:0.04 dt:43ms tok/s:1529947 rem:63s step 14823 (89%) loss:2.7292 lr:0.04 dt:43ms tok/s:1529266 rem:63s step 14824 (89%) loss:2.7267 lr:0.04 dt:43ms tok/s:1528739 rem:63s step 14825 (89%) loss:2.7413 lr:0.04 dt:43ms tok/s:1529181 rem:63s step 14826 (89%) loss:2.7653 lr:0.04 dt:46ms tok/s:1435184 rem:63s step 14827 (89%) loss:2.7541 lr:0.04 dt:43ms tok/s:1535940 rem:63s step 14828 (89%) loss:2.7403 lr:0.04 dt:43ms tok/s:1517941 rem:63s step 14829 (90%) loss:2.7261 lr:0.04 dt:43ms tok/s:1532395 rem:63s step 14830 (90%) loss:2.7565 lr:0.04 dt:43ms tok/s:1536214 rem:63s step 14831 (90%) loss:2.7446 lr:0.04 dt:43ms tok/s:1534139 rem:63s step 14832 (90%) loss:2.7570 lr:0.04 dt:43ms tok/s:1533121 rem:63s step 14833 (90%) loss:2.7600 lr:0.04 dt:43ms tok/s:1533053 rem:63s step 14834 (90%) loss:2.7468 lr:0.04 dt:43ms tok/s:1536043 rem:63s step 14835 (90%) loss:2.7110 lr:0.04 dt:43ms tok/s:1531097 rem:63s step 14836 (90%) loss:2.6829 lr:0.04 dt:43ms tok/s:1532967 rem:63s step 14837 (90%) loss:2.6734 lr:0.04 dt:43ms tok/s:1533438 rem:63s step 14838 (90%) loss:2.6832 lr:0.04 dt:43ms tok/s:1534071 rem:63s step 14839 (90%) loss:2.6714 lr:0.04 dt:43ms tok/s:1528909 rem:62s step 14840 (90%) loss:2.6723 lr:0.04 dt:43ms tok/s:1528764 rem:62s step 14841 (90%) loss:2.6784 lr:0.04 dt:43ms tok/s:1530611 rem:62s step 14842 (90%) loss:2.7016 lr:0.04 dt:43ms tok/s:1528543 rem:62s step 14843 (90%) loss:2.7163 lr:0.04 dt:43ms tok/s:1527524 rem:62s step 14844 (90%) loss:2.7079 lr:0.04 dt:43ms tok/s:1527592 rem:62s step 14845 (90%) loss:2.6946 lr:0.04 dt:43ms tok/s:1525075 rem:62s step 14846 (90%) loss:2.7106 lr:0.04 dt:43ms tok/s:1527821 rem:62s step 14847 (90%) loss:2.6954 lr:0.04 dt:43ms tok/s:1528220 rem:62s step 14848 (90%) loss:2.6887 lr:0.04 dt:43ms tok/s:1527405 rem:62s step 14849 (90%) loss:2.6830 lr:0.04 dt:43ms tok/s:1528127 rem:62s step 14850 (90%) loss:2.6781 lr:0.04 dt:43ms tok/s:1528994 rem:62s step 14851 (90%) loss:2.6791 lr:0.04 dt:43ms tok/s:1526862 rem:62s step 14852 (90%) loss:2.6448 lr:0.04 dt:43ms tok/s:1523426 rem:62s step 14853 (90%) loss:2.6393 lr:0.04 dt:43ms tok/s:1514971 rem:62s step 14854 (90%) loss:2.6412 lr:0.04 dt:43ms tok/s:1525616 rem:62s step 14855 (90%) loss:2.6489 lr:0.04 dt:43ms tok/s:1528186 rem:62s step 14856 (90%) loss:2.6393 lr:0.04 dt:43ms tok/s:1525726 rem:62s step 14857 (90%) loss:2.6515 lr:0.04 dt:43ms tok/s:1526099 rem:62s step 14858 (90%) loss:2.6516 lr:0.04 dt:43ms tok/s:1528866 rem:62s step 14859 (90%) loss:2.6642 lr:0.04 dt:43ms tok/s:1527609 rem:62s step 14860 (90%) loss:2.6708 lr:0.04 dt:43ms tok/s:1529053 rem:62s step 14861 (90%) loss:2.6852 lr:0.04 dt:43ms tok/s:1528203 rem:62s step 14862 (90%) loss:2.7159 lr:0.04 dt:43ms tok/s:1524026 rem:61s step 14863 (90%) loss:2.7276 lr:0.04 dt:43ms tok/s:1524161 rem:61s step 14864 (90%) loss:2.7333 lr:0.04 dt:43ms tok/s:1530143 rem:61s step 14865 (90%) loss:2.7208 lr:0.04 dt:43ms tok/s:1524068 rem:61s step 14866 (90%) loss:2.7143 lr:0.04 dt:43ms tok/s:1527252 rem:61s step 14867 (90%) loss:2.7123 lr:0.04 dt:43ms tok/s:1525955 rem:61s step 14868 (90%) loss:2.7266 lr:0.04 dt:43ms tok/s:1522895 rem:61s step 14869 (90%) loss:2.7458 lr:0.04 dt:43ms tok/s:1522169 rem:61s step 14870 (90%) loss:2.7632 lr:0.04 dt:43ms tok/s:1529164 rem:61s step 14871 (90%) loss:2.7517 lr:0.04 dt:43ms tok/s:1529896 rem:61s step 14872 (90%) loss:2.7441 lr:0.04 dt:43ms tok/s:1528994 rem:61s step 14873 (90%) loss:2.7293 lr:0.04 dt:43ms tok/s:1529470 rem:61s step 14874 (90%) loss:2.7147 lr:0.04 dt:43ms tok/s:1529053 rem:61s step 14875 (90%) loss:2.6997 lr:0.04 dt:43ms tok/s:1526769 rem:61s step 14876 (90%) loss:2.7019 lr:0.04 dt:43ms tok/s:1523578 rem:61s step 14877 (90%) loss:2.7240 lr:0.04 dt:43ms tok/s:1512787 rem:61s step 14878 (90%) loss:2.7211 lr:0.04 dt:43ms tok/s:1527583 rem:61s step 14879 (90%) loss:2.7138 lr:0.04 dt:43ms tok/s:1524778 rem:61s step 14880 (90%) loss:2.7202 lr:0.04 dt:43ms tok/s:1526014 rem:61s step 14881 (90%) loss:2.7234 lr:0.04 dt:43ms tok/s:1525913 rem:61s step 14882 (90%) loss:2.7018 lr:0.04 dt:43ms tok/s:1527473 rem:61s step 14883 (90%) loss:2.7435 lr:0.04 dt:43ms tok/s:1529113 rem:61s step 14884 (90%) loss:2.7991 lr:0.04 dt:43ms tok/s:1520637 rem:61s step 14885 (90%) loss:2.8541 lr:0.04 dt:43ms tok/s:1529257 rem:61s step 14886 (90%) loss:2.8371 lr:0.04 dt:43ms tok/s:1526328 rem:60s step 14887 (90%) loss:2.8001 lr:0.04 dt:43ms tok/s:1527965 rem:60s step 14888 (90%) loss:2.7548 lr:0.04 dt:43ms tok/s:1526277 rem:60s step 14889 (90%) loss:2.7000 lr:0.04 dt:43ms tok/s:1527210 rem:60s step 14890 (90%) loss:2.6989 lr:0.04 dt:43ms tok/s:1506783 rem:60s step 14891 (90%) loss:2.6895 lr:0.04 dt:43ms tok/s:1529155 rem:60s step 14892 (90%) loss:2.6899 lr:0.04 dt:43ms tok/s:1527982 rem:60s step 14893 (90%) loss:2.6836 lr:0.04 dt:43ms tok/s:1528535 rem:60s step 14894 (90%) loss:2.7137 lr:0.04 dt:43ms tok/s:1527711 rem:60s step 14895 (90%) loss:2.7168 lr:0.04 dt:43ms tok/s:1524060 rem:60s step 14896 (90%) loss:2.7085 lr:0.04 dt:43ms tok/s:1528849 rem:60s step 14897 (90%) loss:2.6928 lr:0.04 dt:43ms tok/s:1529436 rem:60s step 14898 (90%) loss:2.7029 lr:0.04 dt:43ms tok/s:1527549 rem:60s step 14899 (90%) loss:2.6872 lr:0.04 dt:43ms tok/s:1524871 rem:60s step 14900 (90%) loss:2.6632 lr:0.04 dt:43ms tok/s:1527855 rem:60s + local: attn=[0.216, 1.131, 1.310] mlp=[1.507, 0.676, -0.608] + + transition: attn=[3.985, 1.248] mlp=[-0.997, 2.333] + + hierarchy: attn=[3.490, 5.939, 5.616] mlp=[3.560, -3.725, -5.180] + step 14901 (90%) loss:2.6517 lr:0.04 dt:43ms tok/s:1530620 rem:60s step 14902 (90%) loss:2.6435 lr:0.04 dt:43ms tok/s:1528492 rem:60s step 14903 (90%) loss:2.6463 lr:0.04 dt:44ms tok/s:1503684 rem:60s step 14904 (90%) loss:2.6350 lr:0.04 dt:43ms tok/s:1526557 rem:60s step 14905 (90%) loss:2.6331 lr:0.04 dt:43ms tok/s:1528586 rem:60s step 14906 (90%) loss:2.6683 lr:0.04 dt:43ms tok/s:1517849 rem:60s step 14907 (90%) loss:2.6733 lr:0.04 dt:43ms tok/s:1533019 rem:60s step 14908 (90%) loss:2.6634 lr:0.04 dt:44ms tok/s:1484102 rem:60s step 14909 (90%) loss:2.6925 lr:0.04 dt:43ms tok/s:1528994 rem:59s step 14910 (90%) loss:2.6877 lr:0.04 dt:43ms tok/s:1519057 rem:59s step 14911 (90%) loss:2.6876 lr:0.04 dt:43ms tok/s:1516283 rem:59s step 14912 (90%) loss:2.6861 lr:0.04 dt:43ms tok/s:1518428 rem:59s step 14913 (90%) loss:2.6836 lr:0.04 dt:43ms tok/s:1513745 rem:59s step 14914 (90%) loss:2.6909 lr:0.04 dt:43ms tok/s:1513878 rem:59s step 14915 (90%) loss:2.6791 lr:0.04 dt:47ms tok/s:1384894 rem:59s step 14916 (90%) loss:2.6932 lr:0.04 dt:43ms tok/s:1527244 rem:59s step 14917 (90%) loss:2.6865 lr:0.04 dt:43ms tok/s:1520418 rem:59s step 14918 (90%) loss:2.6836 lr:0.04 dt:43ms tok/s:1524466 rem:59s step 14919 (90%) loss:2.6751 lr:0.04 dt:43ms tok/s:1520712 rem:59s step 14920 (90%) loss:2.6753 lr:0.04 dt:43ms tok/s:1529360 rem:59s step 14921 (90%) loss:2.6862 lr:0.04 dt:43ms tok/s:1516175 rem:59s step 14922 (90%) loss:2.6898 lr:0.04 dt:43ms tok/s:1511140 rem:59s step 14923 (90%) loss:2.6951 lr:0.04 dt:43ms tok/s:1514704 rem:59s step 14924 (90%) loss:2.6929 lr:0.04 dt:43ms tok/s:1511780 rem:59s step 14925 (90%) loss:2.6687 lr:0.04 dt:43ms tok/s:1518797 rem:59s step 14926 (90%) loss:2.6501 lr:0.04 dt:43ms tok/s:1518134 rem:59s step 14927 (90%) loss:2.6121 lr:0.04 dt:43ms tok/s:1520191 rem:59s step 14928 (90%) loss:2.5595 lr:0.04 dt:43ms tok/s:1515757 rem:59s step 14929 (90%) loss:2.5509 lr:0.04 dt:43ms tok/s:1516292 rem:59s step 14930 (90%) loss:2.5342 lr:0.04 dt:43ms tok/s:1521420 rem:59s step 14931 (90%) loss:2.5484 lr:0.04 dt:43ms tok/s:1516677 rem:59s step 14932 (90%) loss:2.5568 lr:0.04 dt:43ms tok/s:1516660 rem:58s step 14933 (90%) loss:2.5963 lr:0.04 dt:43ms tok/s:1519258 rem:58s step 14934 (90%) loss:2.6197 lr:0.04 dt:43ms tok/s:1516041 rem:58s step 14935 (90%) loss:2.6188 lr:0.04 dt:43ms tok/s:1515924 rem:58s step 14936 (90%) loss:2.6197 lr:0.04 dt:45ms tok/s:1464368 rem:58s step 14937 (90%) loss:2.6523 lr:0.04 dt:44ms tok/s:1497075 rem:58s step 14938 (90%) loss:2.6441 lr:0.04 dt:43ms tok/s:1516459 rem:58s step 14939 (90%) loss:2.6461 lr:0.04 dt:43ms tok/s:1523384 rem:58s step 14940 (90%) loss:2.6469 lr:0.04 dt:43ms tok/s:1523975 rem:58s step 14941 (90%) loss:2.6527 lr:0.04 dt:43ms tok/s:1518402 rem:58s step 14942 (90%) loss:2.6605 lr:0.04 dt:43ms tok/s:1524719 rem:58s step 14943 (90%) loss:2.6586 lr:0.04 dt:43ms tok/s:1518562 rem:58s step 14944 (90%) loss:2.6737 lr:0.04 dt:43ms tok/s:1520199 rem:58s step 14945 (90%) loss:2.6802 lr:0.04 dt:43ms tok/s:1521024 rem:58s step 14946 (90%) loss:2.6840 lr:0.04 dt:43ms tok/s:1521024 rem:58s step 14947 (90%) loss:2.6854 lr:0.04 dt:43ms tok/s:1519410 rem:58s step 14948 (90%) loss:2.6716 lr:0.04 dt:44ms tok/s:1482261 rem:58s step 14949 (90%) loss:2.6656 lr:0.04 dt:43ms tok/s:1521571 rem:58s step 14950 (90%) loss:2.6773 lr:0.04 dt:43ms tok/s:1535442 rem:58s step 14951 (90%) loss:2.6830 lr:0.04 dt:47ms tok/s:1406500 rem:58s step 14952 (90%) loss:2.6887 lr:0.04 dt:43ms tok/s:1539441 rem:58s step 14953 (90%) loss:2.6812 lr:0.04 dt:43ms tok/s:1529274 rem:58s step 14954 (90%) loss:2.6765 lr:0.04 dt:43ms tok/s:1527303 rem:58s step 14955 (90%) loss:2.6733 lr:0.04 dt:43ms tok/s:1528475 rem:57s step 14956 (90%) loss:2.6948 lr:0.03 dt:44ms tok/s:1490120 rem:57s step 14957 (90%) loss:2.7038 lr:0.03 dt:43ms tok/s:1527660 rem:57s step 14958 (90%) loss:2.7141 lr:0.03 dt:43ms tok/s:1537735 rem:57s step 14959 (90%) loss:2.7354 lr:0.03 dt:43ms tok/s:1536180 rem:57s step 14960 (90%) loss:2.7455 lr:0.03 dt:43ms tok/s:1539234 rem:57s step 14961 (90%) loss:2.7474 lr:0.03 dt:43ms tok/s:1534371 rem:57s step 14962 (90%) loss:2.7490 lr:0.03 dt:43ms tok/s:1532241 rem:57s step 14963 (90%) loss:2.7304 lr:0.03 dt:43ms tok/s:1537091 rem:57s step 14964 (90%) loss:2.7299 lr:0.03 dt:43ms tok/s:1527380 rem:57s step 14965 (90%) loss:2.7342 lr:0.03 dt:43ms tok/s:1529640 rem:57s step 14966 (90%) loss:2.7217 lr:0.03 dt:43ms tok/s:1525828 rem:57s step 14967 (90%) loss:2.6914 lr:0.03 dt:43ms tok/s:1526896 rem:57s step 14968 (91%) loss:2.7228 lr:0.03 dt:43ms tok/s:1515097 rem:57s step 14969 (91%) loss:2.7302 lr:0.03 dt:43ms tok/s:1529087 rem:57s step 14970 (91%) loss:2.7356 lr:0.03 dt:44ms tok/s:1497940 rem:57s step 14971 (91%) loss:2.7266 lr:0.03 dt:43ms tok/s:1514037 rem:57s step 14972 (91%) loss:2.7061 lr:0.03 dt:43ms tok/s:1524762 rem:57s step 14973 (91%) loss:2.6936 lr:0.03 dt:43ms tok/s:1522456 rem:57s step 14974 (91%) loss:2.7030 lr:0.03 dt:43ms tok/s:1521714 rem:57s step 14975 (91%) loss:2.7137 lr:0.03 dt:43ms tok/s:1519502 rem:57s step 14976 (91%) loss:2.7041 lr:0.03 dt:43ms tok/s:1520746 rem:57s step 14977 (91%) loss:2.7105 lr:0.03 dt:43ms tok/s:1521066 rem:57s step 14978 (91%) loss:2.7160 lr:0.03 dt:43ms tok/s:1521554 rem:56s step 14979 (91%) loss:2.7109 lr:0.03 dt:43ms tok/s:1520755 rem:56s step 14980 (91%) loss:2.6996 lr:0.03 dt:43ms tok/s:1518386 rem:56s step 14981 (91%) loss:2.7028 lr:0.03 dt:43ms tok/s:1521672 rem:56s step 14982 (91%) loss:2.7090 lr:0.03 dt:43ms tok/s:1522802 rem:56s step 14983 (91%) loss:2.7197 lr:0.03 dt:43ms tok/s:1520704 rem:56s step 14984 (91%) loss:2.7241 lr:0.03 dt:43ms tok/s:1522515 rem:56s step 14985 (91%) loss:2.7261 lr:0.03 dt:46ms tok/s:1426670 rem:56s step 14986 (91%) loss:2.7076 lr:0.03 dt:43ms tok/s:1526260 rem:56s step 14987 (91%) loss:2.7043 lr:0.03 dt:43ms tok/s:1527006 rem:56s step 14988 (91%) loss:2.6862 lr:0.03 dt:43ms tok/s:1525997 rem:56s step 14989 (91%) loss:2.6676 lr:0.03 dt:43ms tok/s:1525947 rem:56s step 14990 (91%) loss:2.6885 lr:0.03 dt:43ms tok/s:1509580 rem:56s step 14991 (91%) loss:2.7151 lr:0.03 dt:43ms tok/s:1528339 rem:56s step 14992 (91%) loss:2.7070 lr:0.03 dt:48ms tok/s:1378580 rem:56s step 14993 (91%) loss:2.7110 lr:0.03 dt:43ms tok/s:1538183 rem:56s step 14994 (91%) loss:2.7027 lr:0.03 dt:43ms tok/s:1525549 rem:56s step 14995 (91%) loss:2.6953 lr:0.03 dt:43ms tok/s:1518998 rem:56s step 14996 (91%) loss:2.6948 lr:0.03 dt:43ms tok/s:1519888 rem:56s step 14997 (91%) loss:2.6939 lr:0.03 dt:43ms tok/s:1521108 rem:56s step 14998 (91%) loss:2.7111 lr:0.03 dt:43ms tok/s:1525159 rem:56s step 14999 (91%) loss:2.7348 lr:0.03 dt:46ms tok/s:1431261 rem:56s step 15000 (91%) loss:2.7464 lr:0.03 dt:43ms tok/s:1533566 rem:56s + local: attn=[0.221, 1.145, 1.315] mlp=[1.512, 0.690, -0.622] + + transition: attn=[3.997, 1.241] mlp=[-1.017, 2.384] + + hierarchy: attn=[3.477, 5.939, 5.616] mlp=[3.598, -3.797, -5.180] + step 15001 (91%) loss:2.7513 lr:0.03 dt:42ms tok/s:1544822 rem:55s step 15002 (91%) loss:2.7213 lr:0.03 dt:43ms tok/s:1537701 rem:55s step 15003 (91%) loss:2.6958 lr:0.03 dt:43ms tok/s:1536206 rem:55s step 15004 (91%) loss:2.6669 lr:0.03 dt:43ms tok/s:1537976 rem:55s step 15005 (91%) loss:2.6676 lr:0.03 dt:43ms tok/s:1539432 rem:55s step 15006 (91%) loss:2.6872 lr:0.03 dt:43ms tok/s:1540536 rem:55s step 15007 (91%) loss:2.6991 lr:0.03 dt:43ms tok/s:1538467 rem:55s step 15008 (91%) loss:2.6789 lr:0.03 dt:43ms tok/s:1537632 rem:55s step 15009 (91%) loss:2.6792 lr:0.03 dt:43ms tok/s:1540424 rem:55s step 15010 (91%) loss:2.6707 lr:0.03 dt:43ms tok/s:1540709 rem:55s step 15011 (91%) loss:2.7276 lr:0.03 dt:43ms tok/s:1539087 rem:55s step 15012 (91%) loss:2.7535 lr:0.03 dt:43ms tok/s:1539501 rem:55s step 15013 (91%) loss:2.8081 lr:0.03 dt:43ms tok/s:1534893 rem:55s step 15014 (91%) loss:2.8075 lr:0.03 dt:43ms tok/s:1533994 rem:55s step 15015 (91%) loss:2.8020 lr:0.03 dt:43ms tok/s:1531993 rem:55s step 15016 (91%) loss:2.7798 lr:0.03 dt:43ms tok/s:1535133 rem:55s step 15017 (91%) loss:2.7668 lr:0.03 dt:43ms tok/s:1534816 rem:55s step 15018 (91%) loss:2.7665 lr:0.03 dt:43ms tok/s:1535185 rem:55s step 15019 (91%) loss:2.7553 lr:0.03 dt:43ms tok/s:1532942 rem:55s step 15020 (91%) loss:2.7528 lr:0.03 dt:43ms tok/s:1531652 rem:55s step 15021 (91%) loss:2.7400 lr:0.03 dt:43ms tok/s:1527371 rem:55s step 15022 (91%) loss:2.7387 lr:0.03 dt:43ms tok/s:1528433 rem:55s step 15023 (91%) loss:2.7339 lr:0.03 dt:43ms tok/s:1527643 rem:55s step 15024 (91%) loss:2.7238 lr:0.03 dt:43ms tok/s:1529436 rem:55s step 15025 (91%) loss:2.7089 lr:0.03 dt:43ms tok/s:1532600 rem:54s step 15026 (91%) loss:2.7026 lr:0.03 dt:43ms tok/s:1527524 rem:54s step 15027 (91%) loss:2.6976 lr:0.03 dt:43ms tok/s:1527872 rem:54s step 15028 (91%) loss:2.7051 lr:0.03 dt:43ms tok/s:1519032 rem:54s step 15029 (91%) loss:2.7004 lr:0.03 dt:43ms tok/s:1539699 rem:54s step 15030 (91%) loss:2.6822 lr:0.03 dt:43ms tok/s:1533908 rem:54s step 15031 (91%) loss:2.6876 lr:0.03 dt:43ms tok/s:1534773 rem:54s step 15032 (91%) loss:2.6842 lr:0.03 dt:43ms tok/s:1533729 rem:54s step 15033 (91%) loss:2.6698 lr:0.03 dt:43ms tok/s:1529717 rem:54s step 15034 (91%) loss:2.6852 lr:0.03 dt:43ms tok/s:1527999 rem:54s step 15035 (91%) loss:2.6948 lr:0.03 dt:43ms tok/s:1529036 rem:54s step 15036 (91%) loss:2.6823 lr:0.03 dt:43ms tok/s:1526811 rem:54s step 15037 (91%) loss:2.6927 lr:0.03 dt:43ms tok/s:1524677 rem:54s step 15038 (91%) loss:2.6889 lr:0.03 dt:43ms tok/s:1517279 rem:54s step 15039 (91%) loss:2.6877 lr:0.03 dt:43ms tok/s:1530100 rem:54s step 15040 (91%) loss:2.7264 lr:0.03 dt:43ms tok/s:1524491 rem:54s step 15041 (91%) loss:2.8100 lr:0.03 dt:43ms tok/s:1528917 rem:54s step 15042 (91%) loss:2.8024 lr:0.03 dt:47ms tok/s:1404796 rem:54s step 15043 (91%) loss:2.7842 lr:0.03 dt:43ms tok/s:1532959 rem:54s step 15044 (91%) loss:2.7767 lr:0.03 dt:43ms tok/s:1533797 rem:54s step 15045 (91%) loss:2.8165 lr:0.03 dt:43ms tok/s:1533378 rem:54s step 15046 (91%) loss:2.8579 lr:0.03 dt:43ms tok/s:1532156 rem:54s step 15047 (91%) loss:2.8345 lr:0.03 dt:43ms tok/s:1533181 rem:54s step 15048 (91%) loss:2.8451 lr:0.03 dt:43ms tok/s:1533207 rem:53s step 15049 (91%) loss:2.8293 lr:0.03 dt:43ms tok/s:1513662 rem:53s step 15050 (91%) loss:2.8182 lr:0.03 dt:43ms tok/s:1531456 rem:53s step 15051 (91%) loss:2.7966 lr:0.03 dt:43ms tok/s:1533096 rem:53s step 15052 (91%) loss:2.7887 lr:0.03 dt:43ms tok/s:1534808 rem:53s step 15053 (91%) loss:2.7713 lr:0.03 dt:43ms tok/s:1519695 rem:53s step 15054 (91%) loss:2.7915 lr:0.03 dt:43ms tok/s:1520755 rem:53s step 15055 (91%) loss:2.7759 lr:0.03 dt:43ms tok/s:1534782 rem:53s step 15056 (91%) loss:2.7649 lr:0.03 dt:43ms tok/s:1535768 rem:53s step 15057 (91%) loss:2.7321 lr:0.03 dt:43ms tok/s:1534217 rem:53s step 15058 (91%) loss:2.7173 lr:0.03 dt:43ms tok/s:1532600 rem:53s step 15059 (91%) loss:2.6961 lr:0.03 dt:43ms tok/s:1520334 rem:53s step 15060 (91%) loss:2.6911 lr:0.03 dt:43ms tok/s:1529283 rem:53s step 15061 (91%) loss:2.6916 lr:0.03 dt:43ms tok/s:1527167 rem:53s step 15062 (91%) loss:2.6816 lr:0.03 dt:43ms tok/s:1526718 rem:53s step 15063 (91%) loss:2.6789 lr:0.03 dt:43ms tok/s:1530910 rem:53s step 15064 (91%) loss:2.6659 lr:0.03 dt:43ms tok/s:1533515 rem:53s step 15065 (91%) loss:2.6741 lr:0.03 dt:43ms tok/s:1534508 rem:53s step 15066 (91%) loss:2.7217 lr:0.03 dt:43ms tok/s:1533917 rem:53s step 15067 (91%) loss:2.7228 lr:0.03 dt:43ms tok/s:1530228 rem:53s step 15068 (91%) loss:2.7072 lr:0.03 dt:43ms tok/s:1534568 rem:53s step 15069 (91%) loss:2.7022 lr:0.03 dt:43ms tok/s:1534028 rem:53s step 15070 (91%) loss:2.7109 lr:0.03 dt:43ms tok/s:1532386 rem:53s step 15071 (91%) loss:2.7010 lr:0.03 dt:43ms tok/s:1532634 rem:52s step 15072 (91%) loss:2.6714 lr:0.03 dt:43ms tok/s:1529564 rem:52s step 15073 (91%) loss:2.6558 lr:0.03 dt:43ms tok/s:1521672 rem:52s step 15074 (91%) loss:2.6656 lr:0.03 dt:43ms tok/s:1526319 rem:52s step 15075 (91%) loss:2.6503 lr:0.03 dt:43ms tok/s:1534105 rem:52s step 15076 (91%) loss:2.6424 lr:0.03 dt:43ms tok/s:1532446 rem:52s step 15077 (91%) loss:2.6434 lr:0.03 dt:43ms tok/s:1532523 rem:52s step 15078 (91%) loss:2.6504 lr:0.03 dt:48ms tok/s:1375738 rem:52s step 15079 (91%) loss:2.6424 lr:0.03 dt:42ms tok/s:1545526 rem:52s step 15080 (91%) loss:2.6485 lr:0.03 dt:42ms tok/s:1553430 rem:52s step 15081 (91%) loss:2.6336 lr:0.03 dt:42ms tok/s:1546865 rem:52s step 15082 (91%) loss:2.6202 lr:0.03 dt:42ms tok/s:1549140 rem:52s step 15083 (91%) loss:2.6167 lr:0.03 dt:47ms tok/s:1407897 rem:52s step 15084 (91%) loss:2.6112 lr:0.03 dt:42ms tok/s:1548503 rem:52s step 15085 (91%) loss:2.5955 lr:0.03 dt:42ms tok/s:1543027 rem:52s step 15086 (91%) loss:2.6437 lr:0.03 dt:43ms tok/s:1540226 rem:52s step 15087 (91%) loss:2.6421 lr:0.03 dt:43ms tok/s:1513262 rem:52s step 15088 (91%) loss:2.6362 lr:0.03 dt:43ms tok/s:1531413 rem:52s step 15089 (91%) loss:2.6375 lr:0.03 dt:43ms tok/s:1530058 rem:52s step 15090 (91%) loss:2.6352 lr:0.03 dt:43ms tok/s:1531456 rem:52s step 15091 (91%) loss:2.6355 lr:0.03 dt:43ms tok/s:1534756 rem:52s step 15092 (91%) loss:2.6415 lr:0.03 dt:43ms tok/s:1533532 rem:52s step 15093 (91%) loss:2.6574 lr:0.03 dt:43ms tok/s:1530611 rem:52s step 15094 (91%) loss:2.6500 lr:0.03 dt:43ms tok/s:1533002 rem:52s step 15095 (91%) loss:2.6543 lr:0.03 dt:43ms tok/s:1532617 rem:51s step 15096 (91%) loss:2.6584 lr:0.03 dt:43ms tok/s:1533917 rem:51s step 15097 (91%) loss:2.6518 lr:0.03 dt:43ms tok/s:1534225 rem:51s step 15098 (91%) loss:2.6416 lr:0.03 dt:43ms tok/s:1530952 rem:51s step 15099 (91%) loss:2.6281 lr:0.03 dt:43ms tok/s:1509431 rem:51s step 15100 (91%) loss:2.6285 lr:0.03 dt:43ms tok/s:1526803 rem:51s + local: attn=[0.224, 1.146, 1.318] mlp=[1.521, 0.698, -0.627] + + transition: attn=[4.004, 1.237] mlp=[-1.040, 2.413] + + hierarchy: attn=[3.464, 5.939, 5.616] mlp=[3.624, -3.856, -5.180] + step 15101 (91%) loss:2.6333 lr:0.03 dt:49ms tok/s:1336026 rem:51s step 15102 (91%) loss:2.6180 lr:0.03 dt:43ms tok/s:1536386 rem:51s step 15103 (91%) loss:2.6194 lr:0.03 dt:43ms tok/s:1525354 rem:51s step 15104 (91%) loss:2.6210 lr:0.03 dt:43ms tok/s:1527626 rem:51s step 15105 (91%) loss:2.6087 lr:0.03 dt:50ms tok/s:1322819 rem:51s step 15106 (91%) loss:2.6420 lr:0.03 dt:43ms tok/s:1540562 rem:51s step 15107 (92%) loss:2.6045 lr:0.03 dt:42ms tok/s:1546865 rem:51s step 15108 (92%) loss:2.5820 lr:0.03 dt:42ms tok/s:1550818 rem:51s step 15109 (92%) loss:2.6051 lr:0.03 dt:42ms tok/s:1551230 rem:51s step 15110 (92%) loss:2.6240 lr:0.03 dt:42ms tok/s:1554414 rem:51s step 15111 (92%) loss:2.6329 lr:0.03 dt:43ms tok/s:1540649 rem:51s step 15112 (92%) loss:2.6383 lr:0.03 dt:42ms tok/s:1546665 rem:51s step 15113 (92%) loss:2.6240 lr:0.03 dt:42ms tok/s:1548120 rem:51s step 15114 (92%) loss:2.6106 lr:0.03 dt:42ms tok/s:1543460 rem:51s step 15115 (92%) loss:2.6077 lr:0.03 dt:42ms tok/s:1543339 rem:51s step 15116 (92%) loss:2.5933 lr:0.03 dt:43ms tok/s:1535983 rem:51s step 15117 (92%) loss:2.5979 lr:0.03 dt:43ms tok/s:1534842 rem:51s step 15118 (92%) loss:2.6012 lr:0.03 dt:43ms tok/s:1539225 rem:50s step 15119 (92%) loss:2.5898 lr:0.03 dt:43ms tok/s:1540951 rem:50s step 15120 (92%) loss:2.6026 lr:0.03 dt:43ms tok/s:1537254 rem:50s step 15121 (92%) loss:2.6109 lr:0.03 dt:44ms tok/s:1495055 rem:50s step 15122 (92%) loss:2.6222 lr:0.03 dt:43ms tok/s:1528858 rem:50s step 15123 (92%) loss:2.6169 lr:0.03 dt:44ms tok/s:1479469 rem:50s step 15124 (92%) loss:2.6156 lr:0.03 dt:43ms tok/s:1534731 rem:50s step 15125 (92%) loss:2.6221 lr:0.03 dt:43ms tok/s:1536309 rem:50s step 15126 (92%) loss:2.6306 lr:0.03 dt:43ms tok/s:1536678 rem:50s step 15127 (92%) loss:2.6351 lr:0.03 dt:43ms tok/s:1529402 rem:50s step 15128 (92%) loss:2.6375 lr:0.03 dt:43ms tok/s:1538743 rem:50s step 15129 (92%) loss:2.6412 lr:0.03 dt:42ms tok/s:1542404 rem:50s step 15130 (92%) loss:2.6507 lr:0.03 dt:43ms tok/s:1535862 rem:50s step 15131 (92%) loss:2.6560 lr:0.03 dt:43ms tok/s:1539217 rem:50s step 15132 (92%) loss:2.6508 lr:0.03 dt:42ms tok/s:1543053 rem:50s step 15133 (92%) loss:2.6615 lr:0.03 dt:43ms tok/s:1539001 rem:50s step 15134 (92%) loss:2.6694 lr:0.03 dt:43ms tok/s:1542023 rem:50s step 15135 (92%) loss:2.6533 lr:0.03 dt:43ms tok/s:1536481 rem:50s step 15136 (92%) loss:2.6378 lr:0.03 dt:43ms tok/s:1537994 rem:50s step 15137 (92%) loss:2.6528 lr:0.03 dt:43ms tok/s:1537796 rem:50s step 15138 (92%) loss:2.6544 lr:0.03 dt:43ms tok/s:1535425 rem:50s step 15139 (92%) loss:2.6639 lr:0.03 dt:43ms tok/s:1537409 rem:50s step 15140 (92%) loss:2.6944 lr:0.03 dt:43ms tok/s:1538200 rem:50s step 15141 (92%) loss:2.7156 lr:0.03 dt:43ms tok/s:1533660 rem:49s step 15142 (92%) loss:2.7082 lr:0.03 dt:43ms tok/s:1532497 rem:49s step 15143 (92%) loss:2.7077 lr:0.03 dt:43ms tok/s:1530492 rem:49s step 15144 (92%) loss:2.7111 lr:0.03 dt:43ms tok/s:1532882 rem:49s step 15145 (92%) loss:2.7165 lr:0.03 dt:43ms tok/s:1522532 rem:49s step 15146 (92%) loss:2.7210 lr:0.03 dt:43ms tok/s:1521942 rem:49s step 15147 (92%) loss:2.7193 lr:0.03 dt:43ms tok/s:1522136 rem:49s step 15148 (92%) loss:2.6929 lr:0.03 dt:43ms tok/s:1523452 rem:49s step 15149 (92%) loss:2.6484 lr:0.03 dt:43ms tok/s:1521723 rem:49s step 15150 (92%) loss:2.6361 lr:0.03 dt:43ms tok/s:1523038 rem:49s step 15151 (92%) loss:2.6394 lr:0.03 dt:43ms tok/s:1520452 rem:49s step 15152 (92%) loss:2.6429 lr:0.03 dt:43ms tok/s:1521445 rem:49s step 15153 (92%) loss:2.6514 lr:0.03 dt:43ms tok/s:1519494 rem:49s step 15154 (92%) loss:2.6487 lr:0.03 dt:43ms tok/s:1520485 rem:49s step 15155 (92%) loss:2.6297 lr:0.03 dt:43ms tok/s:1523046 rem:49s step 15156 (92%) loss:2.6153 lr:0.03 dt:43ms tok/s:1523055 rem:49s step 15157 (92%) loss:2.5772 lr:0.03 dt:43ms tok/s:1523317 rem:49s step 15158 (92%) loss:2.5540 lr:0.03 dt:43ms tok/s:1523789 rem:49s step 15159 (92%) loss:2.5678 lr:0.03 dt:43ms tok/s:1523046 rem:49s step 15160 (92%) loss:2.5850 lr:0.03 dt:43ms tok/s:1521184 rem:49s step 15161 (92%) loss:2.5980 lr:0.03 dt:43ms tok/s:1522692 rem:49s step 15162 (92%) loss:2.5961 lr:0.03 dt:43ms tok/s:1521201 rem:49s step 15163 (92%) loss:2.5970 lr:0.03 dt:43ms tok/s:1520999 rem:49s step 15164 (92%) loss:2.6055 lr:0.03 dt:43ms tok/s:1523401 rem:48s step 15165 (92%) loss:2.6171 lr:0.02 dt:43ms tok/s:1525447 rem:48s step 15166 (92%) loss:2.6228 lr:0.02 dt:43ms tok/s:1523266 rem:48s step 15167 (92%) loss:2.6321 lr:0.02 dt:43ms tok/s:1519855 rem:48s step 15168 (92%) loss:2.6318 lr:0.02 dt:43ms tok/s:1523274 rem:48s step 15169 (92%) loss:2.6276 lr:0.02 dt:43ms tok/s:1526353 rem:48s step 15170 (92%) loss:2.6301 lr:0.02 dt:43ms tok/s:1522566 rem:48s step 15171 (92%) loss:2.6399 lr:0.02 dt:43ms tok/s:1521201 rem:48s step 15172 (92%) loss:2.6470 lr:0.02 dt:43ms tok/s:1526048 rem:48s step 15173 (92%) loss:2.6459 lr:0.02 dt:43ms tok/s:1522085 rem:48s step 15174 (92%) loss:2.6400 lr:0.02 dt:43ms tok/s:1522582 rem:48s step 15175 (92%) loss:2.6610 lr:0.02 dt:43ms tok/s:1522768 rem:48s step 15176 (92%) loss:2.6675 lr:0.02 dt:43ms tok/s:1524956 rem:48s step 15177 (92%) loss:2.6707 lr:0.02 dt:43ms tok/s:1521159 rem:48s step 15178 (92%) loss:2.6632 lr:0.02 dt:47ms tok/s:1380998 rem:48s step 15179 (92%) loss:2.7002 lr:0.02 dt:43ms tok/s:1529045 rem:48s step 15180 (92%) loss:2.7084 lr:0.02 dt:44ms tok/s:1493358 rem:48s step 15181 (92%) loss:2.7107 lr:0.02 dt:43ms tok/s:1531686 rem:48s step 15182 (92%) loss:2.6983 lr:0.02 dt:43ms tok/s:1533087 rem:48s step 15183 (92%) loss:2.7080 lr:0.02 dt:43ms tok/s:1521226 rem:48s step 15184 (92%) loss:2.7046 lr:0.02 dt:43ms tok/s:1521824 rem:48s step 15185 (92%) loss:2.7002 lr:0.02 dt:43ms tok/s:1518805 rem:48s step 15186 (92%) loss:2.6940 lr:0.02 dt:43ms tok/s:1525599 rem:48s step 15187 (92%) loss:2.6932 lr:0.02 dt:43ms tok/s:1510526 rem:48s step 15188 (92%) loss:2.6901 lr:0.02 dt:43ms tok/s:1525354 rem:47s step 15189 (92%) loss:2.6930 lr:0.02 dt:43ms tok/s:1522186 rem:47s step 15190 (92%) loss:2.6727 lr:0.02 dt:43ms tok/s:1527838 rem:47s step 15191 (92%) loss:2.6659 lr:0.02 dt:43ms tok/s:1523004 rem:47s step 15192 (92%) loss:2.6621 lr:0.02 dt:43ms tok/s:1523519 rem:47s step 15193 (92%) loss:2.6524 lr:0.02 dt:43ms tok/s:1521142 rem:47s step 15194 (92%) loss:2.6245 lr:0.02 dt:43ms tok/s:1521799 rem:47s step 15195 (92%) loss:2.6384 lr:0.02 dt:43ms tok/s:1519762 rem:47s step 15196 (92%) loss:2.6192 lr:0.02 dt:43ms tok/s:1520788 rem:47s step 15197 (92%) loss:2.6182 lr:0.02 dt:44ms tok/s:1505100 rem:47s step 15198 (92%) loss:2.6143 lr:0.02 dt:43ms tok/s:1507932 rem:47s step 15199 (92%) loss:2.6332 lr:0.02 dt:43ms tok/s:1522490 rem:47s step 15200 (92%) loss:2.6314 lr:0.02 dt:43ms tok/s:1521815 rem:47s + local: attn=[0.225, 1.145, 1.322] mlp=[1.525, 0.705, -0.626] + + transition: attn=[4.015, 1.224] mlp=[-1.050, 2.460] + + hierarchy: attn=[3.461, 5.939, 5.616] mlp=[3.655, -3.919, -5.180] + step 15201 (92%) loss:2.6507 lr:0.02 dt:43ms tok/s:1522034 rem:47s step 15202 (92%) loss:2.6797 lr:0.02 dt:43ms tok/s:1523967 rem:47s step 15203 (92%) loss:2.6835 lr:0.02 dt:43ms tok/s:1522954 rem:47s step 15204 (92%) loss:2.6848 lr:0.02 dt:43ms tok/s:1520166 rem:47s step 15205 (92%) loss:2.6761 lr:0.02 dt:43ms tok/s:1523773 rem:47s step 15206 (92%) loss:2.6769 lr:0.02 dt:43ms tok/s:1523950 rem:47s step 15207 (92%) loss:2.6790 lr:0.02 dt:43ms tok/s:1521352 rem:47s step 15208 (92%) loss:2.6818 lr:0.02 dt:43ms tok/s:1518755 rem:47s step 15209 (92%) loss:2.6826 lr:0.02 dt:43ms tok/s:1518654 rem:47s step 15210 (92%) loss:2.6705 lr:0.02 dt:43ms tok/s:1522077 rem:47s step 15211 (92%) loss:2.6663 lr:0.02 dt:43ms tok/s:1519452 rem:46s step 15212 (92%) loss:2.6771 lr:0.02 dt:43ms tok/s:1516953 rem:46s step 15213 (92%) loss:2.6764 lr:0.02 dt:43ms tok/s:1527931 rem:46s step 15214 (92%) loss:2.6739 lr:0.02 dt:43ms tok/s:1525049 rem:46s step 15215 (92%) loss:2.6519 lr:0.02 dt:43ms tok/s:1523671 rem:46s step 15216 (92%) loss:2.6451 lr:0.02 dt:43ms tok/s:1521975 rem:46s step 15217 (92%) loss:2.6408 lr:0.02 dt:43ms tok/s:1525168 rem:46s step 15218 (92%) loss:2.6519 lr:0.02 dt:43ms tok/s:1526328 rem:46s step 15219 (92%) loss:2.6433 lr:0.02 dt:44ms tok/s:1505817 rem:46s step 15220 (92%) loss:2.6447 lr:0.02 dt:43ms tok/s:1522667 rem:46s step 15221 (92%) loss:2.6638 lr:0.02 dt:43ms tok/s:1520569 rem:46s step 15222 (92%) loss:2.6568 lr:0.02 dt:43ms tok/s:1520527 rem:46s step 15223 (92%) loss:2.6533 lr:0.02 dt:43ms tok/s:1518713 rem:46s step 15224 (92%) loss:2.6397 lr:0.02 dt:43ms tok/s:1522532 rem:46s step 15225 (92%) loss:2.6463 lr:0.02 dt:43ms tok/s:1518335 rem:46s step 15226 (92%) loss:2.6597 lr:0.02 dt:43ms tok/s:1522819 rem:46s step 15227 (92%) loss:2.6811 lr:0.02 dt:43ms tok/s:1519410 rem:46s step 15228 (92%) loss:2.6830 lr:0.02 dt:43ms tok/s:1523283 rem:46s step 15229 (92%) loss:2.6772 lr:0.02 dt:43ms tok/s:1527363 rem:46s step 15230 (92%) loss:2.6847 lr:0.02 dt:43ms tok/s:1521773 rem:46s step 15231 (92%) loss:2.6830 lr:0.02 dt:43ms tok/s:1519477 rem:46s step 15232 (92%) loss:2.6740 lr:0.02 dt:43ms tok/s:1521041 rem:46s step 15233 (92%) loss:2.6718 lr:0.02 dt:43ms tok/s:1521908 rem:46s step 15234 (92%) loss:2.6740 lr:0.02 dt:43ms tok/s:1525320 rem:45s step 15235 (92%) loss:2.6806 lr:0.02 dt:43ms tok/s:1521917 rem:45s step 15236 (92%) loss:2.6858 lr:0.02 dt:43ms tok/s:1523097 rem:45s step 15237 (92%) loss:2.6770 lr:0.02 dt:43ms tok/s:1522751 rem:45s step 15238 (92%) loss:2.6650 lr:0.02 dt:43ms tok/s:1525930 rem:45s step 15239 (92%) loss:2.6664 lr:0.02 dt:43ms tok/s:1523713 rem:45s step 15240 (92%) loss:2.6660 lr:0.02 dt:43ms tok/s:1520410 rem:45s step 15241 (92%) loss:2.6666 lr:0.02 dt:43ms tok/s:1518662 rem:45s step 15242 (92%) loss:2.6658 lr:0.02 dt:43ms tok/s:1525803 rem:45s step 15243 (92%) loss:2.6649 lr:0.02 dt:43ms tok/s:1520957 rem:45s step 15244 (92%) loss:2.6699 lr:0.02 dt:43ms tok/s:1516618 rem:45s step 15245 (92%) loss:2.6958 lr:0.02 dt:43ms tok/s:1520300 rem:45s step 15246 (92%) loss:2.7070 lr:0.02 dt:43ms tok/s:1519981 rem:45s step 15247 (93%) loss:2.6963 lr:0.02 dt:43ms tok/s:1521335 rem:45s step 15248 (93%) loss:2.6998 lr:0.02 dt:43ms tok/s:1522211 rem:45s step 15249 (93%) loss:2.6956 lr:0.02 dt:43ms tok/s:1525803 rem:45s step 15250 (93%) loss:2.7045 lr:0.02 dt:43ms tok/s:1522161 rem:45s step 15251 (93%) loss:2.6963 lr:0.02 dt:43ms tok/s:1521049 rem:45s step 15252 (93%) loss:2.7045 lr:0.02 dt:43ms tok/s:1520889 rem:45s step 15253 (93%) loss:2.7182 lr:0.02 dt:43ms tok/s:1525269 rem:45s step 15254 (93%) loss:2.7374 lr:0.02 dt:43ms tok/s:1522937 rem:45s step 15255 (93%) loss:2.7426 lr:0.02 dt:43ms tok/s:1520452 rem:45s step 15256 (93%) loss:2.7361 lr:0.02 dt:43ms tok/s:1524677 rem:45s step 15257 (93%) loss:2.7232 lr:0.02 dt:43ms tok/s:1523815 rem:44s step 15258 (93%) loss:2.7168 lr:0.02 dt:43ms tok/s:1522085 rem:44s step 15259 (93%) loss:2.6940 lr:0.02 dt:43ms tok/s:1523435 rem:44s step 15260 (93%) loss:2.6918 lr:0.02 dt:43ms tok/s:1524263 rem:44s step 15261 (93%) loss:2.6854 lr:0.02 dt:43ms tok/s:1522684 rem:44s step 15262 (93%) loss:2.6992 lr:0.02 dt:43ms tok/s:1522372 rem:44s step 15263 (93%) loss:2.7031 lr:0.02 dt:43ms tok/s:1518746 rem:44s step 15264 (93%) loss:2.6870 lr:0.02 dt:43ms tok/s:1522599 rem:44s step 15265 (93%) loss:2.6767 lr:0.02 dt:43ms tok/s:1522296 rem:44s step 15266 (93%) loss:2.6575 lr:0.02 dt:43ms tok/s:1519998 rem:44s step 15267 (93%) loss:2.6538 lr:0.02 dt:43ms tok/s:1521234 rem:44s step 15268 (93%) loss:2.6474 lr:0.02 dt:43ms tok/s:1520906 rem:44s step 15269 (93%) loss:2.6623 lr:0.02 dt:43ms tok/s:1520359 rem:44s step 15270 (93%) loss:2.6635 lr:0.02 dt:43ms tok/s:1511780 rem:44s step 15271 (93%) loss:2.6941 lr:0.02 dt:43ms tok/s:1508172 rem:44s step 15272 (93%) loss:2.7202 lr:0.02 dt:43ms tok/s:1523274 rem:44s step 15273 (93%) loss:2.7081 lr:0.02 dt:43ms tok/s:1519762 rem:44s step 15274 (93%) loss:2.7074 lr:0.02 dt:43ms tok/s:1523671 rem:44s step 15275 (93%) loss:2.6980 lr:0.02 dt:48ms tok/s:1354799 rem:44s step 15276 (93%) loss:2.7093 lr:0.02 dt:43ms tok/s:1532224 rem:44s step 15277 (93%) loss:2.7254 lr:0.02 dt:43ms tok/s:1526879 rem:44s step 15278 (93%) loss:2.7369 lr:0.02 dt:43ms tok/s:1515447 rem:44s step 15279 (93%) loss:2.7243 lr:0.02 dt:43ms tok/s:1524482 rem:44s step 15280 (93%) loss:2.7067 lr:0.02 dt:43ms tok/s:1527498 rem:43s step 15281 (93%) loss:2.6942 lr:0.02 dt:43ms tok/s:1523984 rem:43s step 15282 (93%) loss:2.7141 lr:0.02 dt:43ms tok/s:1518897 rem:43s step 15283 (93%) loss:2.7109 lr:0.02 dt:43ms tok/s:1524280 rem:43s step 15284 (93%) loss:2.6905 lr:0.02 dt:43ms tok/s:1521638 rem:43s step 15285 (93%) loss:2.6900 lr:0.02 dt:43ms tok/s:1515915 rem:43s step 15286 (93%) loss:2.6931 lr:0.02 dt:43ms tok/s:1516133 rem:43s step 15287 (93%) loss:2.7273 lr:0.02 dt:44ms tok/s:1483549 rem:43s step 15288 (93%) loss:2.7544 lr:0.02 dt:43ms tok/s:1516492 rem:43s step 15289 (93%) loss:2.7641 lr:0.02 dt:43ms tok/s:1516225 rem:43s step 15290 (93%) loss:2.7605 lr:0.02 dt:43ms tok/s:1519183 rem:43s step 15291 (93%) loss:2.7424 lr:0.02 dt:43ms tok/s:1517765 rem:43s step 15292 (93%) loss:2.7338 lr:0.02 dt:44ms tok/s:1496504 rem:43s step 15293 (93%) loss:2.7109 lr:0.02 dt:43ms tok/s:1515798 rem:43s step 15294 (93%) loss:2.7107 lr:0.02 dt:43ms tok/s:1516066 rem:43s step 15295 (93%) loss:2.6970 lr:0.02 dt:43ms tok/s:1518671 rem:43s step 15296 (93%) loss:2.7020 lr:0.02 dt:43ms tok/s:1515899 rem:43s step 15297 (93%) loss:2.7053 lr:0.02 dt:43ms tok/s:1515498 rem:43s step 15298 (93%) loss:2.7070 lr:0.02 dt:43ms tok/s:1520141 rem:43s step 15299 (93%) loss:2.7106 lr:0.02 dt:43ms tok/s:1514387 rem:43s step 15300 (93%) loss:2.6978 lr:0.02 dt:43ms tok/s:1516727 rem:43s + local: attn=[0.233, 1.147, 1.330] mlp=[1.530, 0.709, -0.638] + + transition: attn=[4.019, 1.217] mlp=[-1.068, 2.500] + + hierarchy: attn=[3.445, 5.939, 5.616] mlp=[3.680, -3.974, -5.180] + step 15301 (93%) loss:2.7028 lr:0.02 dt:43ms tok/s:1515213 rem:43s step 15302 (93%) loss:2.6888 lr:0.02 dt:43ms tok/s:1513045 rem:43s step 15303 (93%) loss:2.6860 lr:0.02 dt:43ms tok/s:1517078 rem:42s step 15304 (93%) loss:2.6834 lr:0.02 dt:43ms tok/s:1519065 rem:42s step 15305 (93%) loss:2.7006 lr:0.02 dt:43ms tok/s:1519796 rem:42s step 15306 (93%) loss:2.7088 lr:0.02 dt:43ms tok/s:1518847 rem:42s step 15307 (93%) loss:2.6987 lr:0.02 dt:43ms tok/s:1514788 rem:42s step 15308 (93%) loss:2.6984 lr:0.02 dt:43ms tok/s:1513137 rem:42s step 15309 (93%) loss:2.6799 lr:0.02 dt:43ms tok/s:1519645 rem:42s step 15310 (93%) loss:2.6680 lr:0.02 dt:43ms tok/s:1516894 rem:42s step 15311 (93%) loss:2.6638 lr:0.02 dt:43ms tok/s:1511905 rem:42s step 15312 (93%) loss:2.6669 lr:0.02 dt:43ms tok/s:1518369 rem:42s step 15313 (93%) loss:2.6557 lr:0.02 dt:43ms tok/s:1515406 rem:42s step 15314 (93%) loss:2.6623 lr:0.02 dt:43ms tok/s:1517514 rem:42s step 15315 (93%) loss:2.6523 lr:0.02 dt:43ms tok/s:1512287 rem:42s step 15316 (93%) loss:2.6524 lr:0.02 dt:43ms tok/s:1515991 rem:42s step 15317 (93%) loss:2.6745 lr:0.02 dt:43ms tok/s:1515815 rem:42s step 15318 (93%) loss:2.7017 lr:0.02 dt:43ms tok/s:1511323 rem:42s step 15319 (93%) loss:2.7291 lr:0.02 dt:43ms tok/s:1518310 rem:42s step 15320 (93%) loss:2.7159 lr:0.02 dt:43ms tok/s:1515581 rem:42s step 15321 (93%) loss:2.7026 lr:0.02 dt:43ms tok/s:1514621 rem:42s step 15322 (93%) loss:2.7144 lr:0.02 dt:43ms tok/s:1516150 rem:42s step 15323 (93%) loss:2.6920 lr:0.02 dt:43ms tok/s:1520856 rem:42s step 15324 (93%) loss:2.6927 lr:0.02 dt:43ms tok/s:1515539 rem:42s step 15325 (93%) loss:2.7085 lr:0.02 dt:43ms tok/s:1512845 rem:42s step 15326 (93%) loss:2.7006 lr:0.02 dt:43ms tok/s:1517623 rem:42s step 15327 (93%) loss:2.6714 lr:0.02 dt:43ms tok/s:1514370 rem:41s step 15328 (93%) loss:2.6689 lr:0.02 dt:43ms tok/s:1513012 rem:41s step 15329 (93%) loss:2.6791 lr:0.02 dt:43ms tok/s:1516836 rem:41s step 15330 (93%) loss:2.6760 lr:0.02 dt:43ms tok/s:1515305 rem:41s step 15331 (93%) loss:2.6896 lr:0.02 dt:43ms tok/s:1520090 rem:41s step 15332 (93%) loss:2.6901 lr:0.02 dt:43ms tok/s:1517790 rem:41s step 15333 (93%) loss:2.6920 lr:0.02 dt:43ms tok/s:1509638 rem:41s step 15334 (93%) loss:2.6868 lr:0.02 dt:43ms tok/s:1514779 rem:41s step 15335 (93%) loss:2.6979 lr:0.02 dt:43ms tok/s:1516342 rem:41s step 15336 (93%) loss:2.6936 lr:0.02 dt:44ms tok/s:1497434 rem:41s step 15337 (93%) loss:2.6964 lr:0.02 dt:43ms tok/s:1520645 rem:41s step 15338 (93%) loss:2.6983 lr:0.02 dt:43ms tok/s:1521260 rem:41s step 15339 (93%) loss:2.7059 lr:0.02 dt:43ms tok/s:1521613 rem:41s step 15340 (93%) loss:2.6778 lr:0.02 dt:43ms tok/s:1521714 rem:41s step 15341 (93%) loss:2.6735 lr:0.02 dt:43ms tok/s:1506948 rem:41s step 15342 (93%) loss:2.6791 lr:0.02 dt:43ms tok/s:1524973 rem:41s step 15343 (93%) loss:2.6573 lr:0.02 dt:43ms tok/s:1517916 rem:41s step 15344 (93%) loss:2.6665 lr:0.02 dt:43ms tok/s:1522110 rem:41s step 15345 (93%) loss:2.6696 lr:0.02 dt:43ms tok/s:1520384 rem:41s step 15346 (93%) loss:2.6602 lr:0.02 dt:43ms tok/s:1520729 rem:41s step 15347 (93%) loss:2.6673 lr:0.02 dt:43ms tok/s:1519611 rem:41s step 15348 (93%) loss:2.6606 lr:0.02 dt:43ms tok/s:1523004 rem:41s step 15349 (93%) loss:2.6438 lr:0.02 dt:43ms tok/s:1527329 rem:41s step 15350 (93%) loss:2.6365 lr:0.02 dt:43ms tok/s:1527184 rem:40s step 15351 (93%) loss:2.6095 lr:0.02 dt:43ms tok/s:1522633 rem:40s step 15352 (93%) loss:2.5839 lr:0.02 dt:43ms tok/s:1521975 rem:40s step 15353 (93%) loss:2.5771 lr:0.02 dt:43ms tok/s:1522658 rem:40s step 15354 (93%) loss:2.5887 lr:0.02 dt:43ms tok/s:1523215 rem:40s step 15355 (93%) loss:2.5905 lr:0.02 dt:43ms tok/s:1525108 rem:40s step 15356 (93%) loss:2.5881 lr:0.02 dt:43ms tok/s:1522768 rem:40s step 15357 (93%) loss:2.5951 lr:0.02 dt:43ms tok/s:1520132 rem:40s step 15358 (93%) loss:2.6007 lr:0.02 dt:44ms tok/s:1490912 rem:40s step 15359 (93%) loss:2.6106 lr:0.02 dt:46ms tok/s:1435574 rem:40s step 15360 (93%) loss:2.6234 lr:0.02 dt:43ms tok/s:1540657 rem:40s step 15361 (93%) loss:2.6284 lr:0.02 dt:43ms tok/s:1535108 rem:40s step 15362 (93%) loss:2.6231 lr:0.02 dt:43ms tok/s:1529623 rem:40s step 15363 (93%) loss:2.6223 lr:0.02 dt:43ms tok/s:1528628 rem:40s step 15364 (93%) loss:2.6467 lr:0.02 dt:43ms tok/s:1530816 rem:40s step 15365 (93%) loss:2.6550 lr:0.02 dt:43ms tok/s:1514971 rem:40s step 15366 (93%) loss:2.6555 lr:0.02 dt:43ms tok/s:1511606 rem:40s step 15367 (93%) loss:2.6573 lr:0.02 dt:43ms tok/s:1527838 rem:40s step 15368 (93%) loss:2.6566 lr:0.02 dt:44ms tok/s:1482285 rem:40s step 15369 (93%) loss:2.6741 lr:0.02 dt:44ms tok/s:1502262 rem:40s step 15370 (93%) loss:2.6660 lr:0.02 dt:43ms tok/s:1526006 rem:40s step 15371 (93%) loss:2.6497 lr:0.02 dt:43ms tok/s:1526201 rem:40s step 15372 (93%) loss:2.6425 lr:0.02 dt:43ms tok/s:1522077 rem:40s step 15373 (93%) loss:2.6434 lr:0.02 dt:44ms tok/s:1492434 rem:39s step 15374 (93%) loss:2.6362 lr:0.02 dt:46ms tok/s:1430650 rem:39s step 15375 (93%) loss:2.6007 lr:0.02 dt:44ms tok/s:1500229 rem:39s step 15376 (93%) loss:2.5831 lr:0.02 dt:44ms tok/s:1498274 rem:39s step 15377 (93%) loss:2.5923 lr:0.02 dt:44ms tok/s:1502040 rem:39s step 15378 (93%) loss:2.6204 lr:0.02 dt:43ms tok/s:1507419 rem:39s step 15379 (93%) loss:2.6316 lr:0.02 dt:44ms tok/s:1477417 rem:39s step 15380 (93%) loss:2.6403 lr:0.02 dt:43ms tok/s:1515264 rem:39s step 15381 (93%) loss:2.6503 lr:0.02 dt:44ms tok/s:1505108 rem:39s step 15382 (93%) loss:2.6591 lr:0.02 dt:43ms tok/s:1510243 rem:39s step 15383 (93%) loss:2.6675 lr:0.02 dt:44ms tok/s:1503149 rem:39s step 15384 (93%) loss:2.6636 lr:0.02 dt:43ms tok/s:1514420 rem:39s step 15385 (94%) loss:2.6521 lr:0.02 dt:43ms tok/s:1510434 rem:39s step 15386 (94%) loss:2.6619 lr:0.02 dt:43ms tok/s:1513462 rem:39s step 15387 (94%) loss:2.6486 lr:0.02 dt:44ms tok/s:1502615 rem:39s step 15388 (94%) loss:2.6415 lr:0.02 dt:43ms tok/s:1519494 rem:39s step 15389 (94%) loss:2.6213 lr:0.02 dt:43ms tok/s:1529223 rem:39s step 15390 (94%) loss:2.6244 lr:0.02 dt:43ms tok/s:1522228 rem:39s step 15391 (94%) loss:2.6266 lr:0.02 dt:43ms tok/s:1509961 rem:39s step 15392 (94%) loss:2.6159 lr:0.02 dt:43ms tok/s:1526921 rem:39s step 15393 (94%) loss:2.6139 lr:0.02 dt:43ms tok/s:1525659 rem:39s step 15394 (94%) loss:2.6165 lr:0.02 dt:43ms tok/s:1529079 rem:39s step 15395 (94%) loss:2.6324 lr:0.02 dt:43ms tok/s:1531575 rem:39s step 15396 (94%) loss:2.6195 lr:0.02 dt:43ms tok/s:1528059 rem:38s step 15397 (94%) loss:2.6260 lr:0.02 dt:43ms tok/s:1531029 rem:38s step 15398 (94%) loss:2.6163 lr:0.02 dt:43ms tok/s:1527269 rem:38s step 15399 (94%) loss:2.5793 lr:0.02 dt:43ms tok/s:1529708 rem:38s step 15400 (94%) loss:2.5493 lr:0.02 dt:43ms tok/s:1530518 rem:38s + local: attn=[0.234, 1.150, 1.330] mlp=[1.533, 0.717, -0.636] + + transition: attn=[4.019, 1.210] mlp=[-1.087, 2.535] + + hierarchy: attn=[3.441, 5.939, 5.616] mlp=[3.699, -4.019, -5.180] + step 15401 (94%) loss:2.5214 lr:0.02 dt:43ms tok/s:1520881 rem:38s step 15402 (94%) loss:2.5271 lr:0.02 dt:43ms tok/s:1530313 rem:38s step 15403 (94%) loss:2.5396 lr:0.02 dt:43ms tok/s:1532292 rem:38s step 15404 (94%) loss:2.5369 lr:0.02 dt:43ms tok/s:1529326 rem:38s step 15405 (94%) loss:2.5453 lr:0.02 dt:43ms tok/s:1528824 rem:38s step 15406 (94%) loss:2.5858 lr:0.02 dt:43ms tok/s:1527787 rem:38s step 15407 (94%) loss:2.5988 lr:0.02 dt:43ms tok/s:1526489 rem:38s step 15408 (94%) loss:2.6113 lr:0.02 dt:43ms tok/s:1529011 rem:38s step 15409 (94%) loss:2.6282 lr:0.02 dt:43ms tok/s:1529972 rem:38s step 15410 (94%) loss:2.6308 lr:0.02 dt:43ms tok/s:1527931 rem:38s step 15411 (94%) loss:2.6380 lr:0.02 dt:43ms tok/s:1528943 rem:38s step 15412 (94%) loss:2.6653 lr:0.02 dt:43ms tok/s:1529274 rem:38s step 15413 (94%) loss:2.6476 lr:0.02 dt:43ms tok/s:1524389 rem:38s step 15414 (94%) loss:2.6592 lr:0.02 dt:45ms tok/s:1468585 rem:38s step 15415 (94%) loss:2.6641 lr:0.02 dt:43ms tok/s:1537908 rem:38s step 15416 (94%) loss:2.6422 lr:0.02 dt:43ms tok/s:1539191 rem:38s step 15417 (94%) loss:2.6332 lr:0.02 dt:43ms tok/s:1532121 rem:38s step 15418 (94%) loss:2.6131 lr:0.02 dt:43ms tok/s:1535048 rem:38s step 15419 (94%) loss:2.6040 lr:0.02 dt:43ms tok/s:1521512 rem:37s step 15420 (94%) loss:2.6030 lr:0.01 dt:43ms tok/s:1536043 rem:37s step 15421 (94%) loss:2.6006 lr:0.01 dt:43ms tok/s:1522768 rem:37s step 15422 (94%) loss:2.5866 lr:0.01 dt:43ms tok/s:1510933 rem:37s step 15423 (94%) loss:2.6149 lr:0.01 dt:45ms tok/s:1455698 rem:37s step 15424 (94%) loss:2.6082 lr:0.01 dt:43ms tok/s:1530697 rem:37s step 15425 (94%) loss:2.5898 lr:0.01 dt:43ms tok/s:1537477 rem:37s step 15426 (94%) loss:2.5944 lr:0.01 dt:43ms tok/s:1538545 rem:37s step 15427 (94%) loss:2.6106 lr:0.01 dt:43ms tok/s:1536352 rem:37s step 15428 (94%) loss:2.6301 lr:0.01 dt:43ms tok/s:1534354 rem:37s step 15429 (94%) loss:2.6137 lr:0.01 dt:43ms tok/s:1533138 rem:37s step 15430 (94%) loss:2.6085 lr:0.01 dt:43ms tok/s:1534696 rem:37s step 15431 (94%) loss:2.6377 lr:0.01 dt:43ms tok/s:1536575 rem:37s step 15432 (94%) loss:2.6246 lr:0.01 dt:43ms tok/s:1536094 rem:37s step 15433 (94%) loss:2.6069 lr:0.01 dt:43ms tok/s:1512287 rem:37s step 15434 (94%) loss:2.6087 lr:0.01 dt:44ms tok/s:1485594 rem:37s step 15435 (94%) loss:2.6225 lr:0.01 dt:43ms tok/s:1532429 rem:37s step 15436 (94%) loss:2.6093 lr:0.01 dt:44ms tok/s:1495438 rem:37s step 15437 (94%) loss:2.6149 lr:0.01 dt:44ms tok/s:1491316 rem:37s step 15438 (94%) loss:2.6168 lr:0.01 dt:44ms tok/s:1474975 rem:37s step 15439 (94%) loss:2.6141 lr:0.01 dt:43ms tok/s:1540295 rem:37s step 15440 (94%) loss:2.6062 lr:0.01 dt:43ms tok/s:1539691 rem:37s step 15441 (94%) loss:2.6318 lr:0.01 dt:43ms tok/s:1541452 rem:37s step 15442 (94%) loss:2.6172 lr:0.01 dt:43ms tok/s:1532890 rem:36s step 15443 (94%) loss:2.6313 lr:0.01 dt:43ms tok/s:1535339 rem:36s step 15444 (94%) loss:2.6503 lr:0.01 dt:42ms tok/s:1551125 rem:36s step 15445 (94%) loss:2.6529 lr:0.01 dt:42ms tok/s:1549830 rem:36s step 15446 (94%) loss:2.6484 lr:0.01 dt:42ms tok/s:1543755 rem:36s step 15447 (94%) loss:2.6492 lr:0.01 dt:42ms tok/s:1545291 rem:36s step 15448 (94%) loss:2.6786 lr:0.01 dt:43ms tok/s:1541893 rem:36s step 15449 (94%) loss:2.7292 lr:0.01 dt:43ms tok/s:1541936 rem:36s step 15450 (94%) loss:2.7090 lr:0.01 dt:43ms tok/s:1540640 rem:36s step 15451 (94%) loss:2.7145 lr:0.01 dt:43ms tok/s:1540813 rem:36s step 15452 (94%) loss:2.7080 lr:0.01 dt:42ms tok/s:1543824 rem:36s step 15453 (94%) loss:2.6935 lr:0.01 dt:43ms tok/s:1538080 rem:36s step 15454 (94%) loss:2.6867 lr:0.01 dt:42ms tok/s:1544371 rem:36s step 15455 (94%) loss:2.6905 lr:0.01 dt:43ms tok/s:1539596 rem:36s step 15456 (94%) loss:2.6838 lr:0.01 dt:42ms tok/s:1542109 rem:36s step 15457 (94%) loss:2.6869 lr:0.01 dt:43ms tok/s:1533557 rem:36s step 15458 (94%) loss:2.6732 lr:0.01 dt:43ms tok/s:1538743 rem:36s step 15459 (94%) loss:2.6873 lr:0.01 dt:43ms tok/s:1536429 rem:36s step 15460 (94%) loss:2.7027 lr:0.01 dt:44ms tok/s:1500008 rem:36s step 15461 (94%) loss:2.6997 lr:0.01 dt:43ms tok/s:1537383 rem:36s step 15462 (94%) loss:2.6918 lr:0.01 dt:43ms tok/s:1538140 rem:36s step 15463 (94%) loss:2.6832 lr:0.01 dt:43ms tok/s:1537520 rem:36s step 15464 (94%) loss:2.6673 lr:0.01 dt:43ms tok/s:1535365 rem:36s step 15465 (94%) loss:2.6637 lr:0.01 dt:43ms tok/s:1535734 rem:36s step 15466 (94%) loss:2.6552 lr:0.01 dt:43ms tok/s:1535983 rem:35s step 15467 (94%) loss:2.6518 lr:0.01 dt:43ms tok/s:1537099 rem:35s step 15468 (94%) loss:2.6723 lr:0.01 dt:43ms tok/s:1536755 rem:35s step 15469 (94%) loss:2.6827 lr:0.01 dt:43ms tok/s:1533232 rem:35s step 15470 (94%) loss:2.6831 lr:0.01 dt:43ms tok/s:1533378 rem:35s step 15471 (94%) loss:2.6771 lr:0.01 dt:43ms tok/s:1536712 rem:35s step 15472 (94%) loss:2.6827 lr:0.01 dt:43ms tok/s:1534576 rem:35s step 15473 (94%) loss:2.6767 lr:0.01 dt:43ms tok/s:1535099 rem:35s step 15474 (94%) loss:2.6740 lr:0.01 dt:43ms tok/s:1530969 rem:35s step 15475 (94%) loss:2.6828 lr:0.01 dt:43ms tok/s:1538140 rem:35s step 15476 (94%) loss:2.6895 lr:0.01 dt:48ms tok/s:1377454 rem:35s step 15477 (94%) loss:2.6912 lr:0.01 dt:43ms tok/s:1536120 rem:35s step 15478 (94%) loss:2.6623 lr:0.01 dt:43ms tok/s:1531806 rem:35s step 15479 (94%) loss:2.6500 lr:0.01 dt:43ms tok/s:1536129 rem:35s step 15480 (94%) loss:2.6538 lr:0.01 dt:43ms tok/s:1532395 rem:35s step 15481 (94%) loss:2.6621 lr:0.01 dt:43ms tok/s:1535236 rem:35s step 15482 (94%) loss:2.6438 lr:0.01 dt:43ms tok/s:1536403 rem:35s step 15483 (94%) loss:2.6288 lr:0.01 dt:44ms tok/s:1493536 rem:35s step 15484 (94%) loss:2.6238 lr:0.01 dt:43ms tok/s:1530714 rem:35s step 15485 (94%) loss:2.6264 lr:0.01 dt:43ms tok/s:1536970 rem:35s step 15486 (94%) loss:2.6462 lr:0.01 dt:43ms tok/s:1538321 rem:35s step 15487 (94%) loss:2.6314 lr:0.01 dt:43ms tok/s:1537125 rem:35s step 15488 (94%) loss:2.6217 lr:0.01 dt:43ms tok/s:1532617 rem:35s step 15489 (94%) loss:2.6134 lr:0.01 dt:43ms tok/s:1537443 rem:34s step 15490 (94%) loss:2.6337 lr:0.01 dt:43ms tok/s:1533292 rem:34s step 15491 (94%) loss:2.6381 lr:0.01 dt:43ms tok/s:1534765 rem:34s step 15492 (94%) loss:2.6221 lr:0.01 dt:43ms tok/s:1535965 rem:34s step 15493 (94%) loss:2.6227 lr:0.01 dt:43ms tok/s:1540908 rem:34s step 15494 (94%) loss:2.6278 lr:0.01 dt:43ms tok/s:1536816 rem:34s step 15495 (94%) loss:2.6291 lr:0.01 dt:43ms tok/s:1538235 rem:34s step 15496 (94%) loss:2.6312 lr:0.01 dt:43ms tok/s:1538105 rem:34s step 15497 (94%) loss:2.6449 lr:0.01 dt:43ms tok/s:1538657 rem:34s step 15498 (94%) loss:2.6566 lr:0.01 dt:43ms tok/s:1536129 rem:34s step 15499 (94%) loss:2.6555 lr:0.01 dt:43ms tok/s:1535039 rem:34s step 15500 (94%) loss:2.6549 lr:0.01 dt:43ms tok/s:1537228 rem:34s + local: attn=[0.234, 1.160, 1.331] mlp=[1.535, 0.720, -0.640] + + transition: attn=[4.024, 1.213] mlp=[-1.097, 2.558] + + hierarchy: attn=[3.431, 5.939, 5.616] mlp=[3.711, -4.040, -5.180] + step 15501 (94%) loss:2.6599 lr:0.01 dt:43ms tok/s:1527957 rem:34s step 15502 (94%) loss:2.6668 lr:0.01 dt:43ms tok/s:1534568 rem:34s step 15503 (94%) loss:2.6665 lr:0.01 dt:43ms tok/s:1535159 rem:34s step 15504 (94%) loss:2.6627 lr:0.01 dt:43ms tok/s:1537847 rem:34s step 15505 (94%) loss:2.6712 lr:0.01 dt:43ms tok/s:1534893 rem:34s step 15506 (94%) loss:2.6832 lr:0.01 dt:43ms tok/s:1534080 rem:34s step 15507 (94%) loss:2.6741 lr:0.01 dt:43ms tok/s:1536343 rem:34s step 15508 (94%) loss:2.6811 lr:0.01 dt:44ms tok/s:1478991 rem:34s step 15509 (94%) loss:2.7137 lr:0.01 dt:43ms tok/s:1527371 rem:34s step 15510 (94%) loss:2.7023 lr:0.01 dt:43ms tok/s:1530024 rem:34s step 15511 (94%) loss:2.7013 lr:0.01 dt:43ms tok/s:1528968 rem:34s step 15512 (94%) loss:2.6867 lr:0.01 dt:43ms tok/s:1528237 rem:34s step 15513 (94%) loss:2.6819 lr:0.01 dt:43ms tok/s:1530509 rem:33s step 15514 (94%) loss:2.6700 lr:0.01 dt:43ms tok/s:1530484 rem:33s step 15515 (94%) loss:2.6412 lr:0.01 dt:43ms tok/s:1529708 rem:33s step 15516 (94%) loss:2.5991 lr:0.01 dt:43ms tok/s:1530262 rem:33s step 15517 (94%) loss:2.5910 lr:0.01 dt:47ms tok/s:1388216 rem:33s step 15518 (94%) loss:2.6310 lr:0.01 dt:43ms tok/s:1535519 rem:33s step 15519 (94%) loss:2.6946 lr:0.01 dt:43ms tok/s:1526260 rem:33s step 15520 (94%) loss:2.6982 lr:0.01 dt:43ms tok/s:1525574 rem:33s step 15521 (94%) loss:2.6871 lr:0.01 dt:43ms tok/s:1522684 rem:33s step 15522 (94%) loss:2.6790 lr:0.01 dt:43ms tok/s:1518369 rem:33s step 15523 (94%) loss:2.6869 lr:0.01 dt:43ms tok/s:1526167 rem:33s step 15524 (94%) loss:2.6644 lr:0.01 dt:43ms tok/s:1525142 rem:33s step 15525 (95%) loss:2.6641 lr:0.01 dt:43ms tok/s:1524009 rem:33s step 15526 (95%) loss:2.6563 lr:0.01 dt:43ms tok/s:1523773 rem:33s step 15527 (95%) loss:2.6486 lr:0.01 dt:43ms tok/s:1521201 rem:33s step 15528 (95%) loss:2.6505 lr:0.01 dt:44ms tok/s:1500106 rem:33s step 15529 (95%) loss:2.6571 lr:0.01 dt:43ms tok/s:1525168 rem:33s step 15530 (95%) loss:2.6623 lr:0.01 dt:43ms tok/s:1526497 rem:33s step 15531 (95%) loss:2.6608 lr:0.01 dt:43ms tok/s:1526235 rem:33s step 15532 (95%) loss:2.6422 lr:0.01 dt:43ms tok/s:1525498 rem:33s step 15533 (95%) loss:2.6444 lr:0.01 dt:43ms tok/s:1521462 rem:33s step 15534 (95%) loss:2.6373 lr:0.01 dt:43ms tok/s:1526396 rem:33s step 15535 (95%) loss:2.6194 lr:0.01 dt:43ms tok/s:1524136 rem:33s step 15536 (95%) loss:2.6156 lr:0.01 dt:43ms tok/s:1524229 rem:32s step 15537 (95%) loss:2.6482 lr:0.01 dt:43ms tok/s:1517581 rem:32s step 15538 (95%) loss:2.6515 lr:0.01 dt:43ms tok/s:1523621 rem:32s step 15539 (95%) loss:2.6508 lr:0.01 dt:43ms tok/s:1524736 rem:32s step 15540 (95%) loss:2.6431 lr:0.01 dt:44ms tok/s:1501737 rem:32s step 15541 (95%) loss:2.6676 lr:0.01 dt:44ms tok/s:1505158 rem:32s step 15542 (95%) loss:2.6611 lr:0.01 dt:43ms tok/s:1524280 rem:32s step 15543 (95%) loss:2.6430 lr:0.01 dt:43ms tok/s:1528101 rem:32s step 15544 (95%) loss:2.6251 lr:0.01 dt:43ms tok/s:1512737 rem:32s step 15545 (95%) loss:2.6574 lr:0.01 dt:44ms tok/s:1504219 rem:32s step 15546 (95%) loss:2.6692 lr:0.01 dt:43ms tok/s:1520124 rem:32s step 15547 (95%) loss:2.6822 lr:0.01 dt:43ms tok/s:1518235 rem:32s step 15548 (95%) loss:2.6780 lr:0.01 dt:43ms tok/s:1525676 rem:32s step 15549 (95%) loss:2.6615 lr:0.01 dt:43ms tok/s:1521032 rem:32s step 15550 (95%) loss:2.6583 lr:0.01 dt:50ms tok/s:1306907 rem:32s step 15551 (95%) loss:2.6559 lr:0.01 dt:43ms tok/s:1514153 rem:32s step 15552 (95%) loss:2.6682 lr:0.01 dt:42ms tok/s:1557453 rem:32s step 15553 (95%) loss:2.6753 lr:0.01 dt:42ms tok/s:1542975 rem:32s step 15554 (95%) loss:2.6532 lr:0.01 dt:43ms tok/s:1525345 rem:32s step 15555 (95%) loss:2.6659 lr:0.01 dt:43ms tok/s:1538105 rem:32s step 15556 (95%) loss:2.6841 lr:0.01 dt:42ms tok/s:1547231 rem:32s step 15557 (95%) loss:2.6797 lr:0.01 dt:42ms tok/s:1549219 rem:32s step 15558 (95%) loss:2.6753 lr:0.01 dt:42ms tok/s:1544961 rem:32s step 15559 (95%) loss:2.6745 lr:0.01 dt:43ms tok/s:1541711 rem:31s step 15560 (95%) loss:2.6581 lr:0.01 dt:43ms tok/s:1540631 rem:31s step 15561 (95%) loss:2.6354 lr:0.01 dt:43ms tok/s:1540157 rem:31s step 15562 (95%) loss:2.5987 lr:0.01 dt:42ms tok/s:1542603 rem:31s step 15563 (95%) loss:2.5915 lr:0.01 dt:42ms tok/s:1542049 rem:31s step 15564 (95%) loss:2.5928 lr:0.01 dt:43ms tok/s:1539406 rem:31s step 15565 (95%) loss:2.5848 lr:0.01 dt:43ms tok/s:1539441 rem:31s step 15566 (95%) loss:2.5778 lr:0.01 dt:43ms tok/s:1536781 rem:31s step 15567 (95%) loss:2.6465 lr:0.01 dt:44ms tok/s:1480162 rem:31s step 15568 (95%) loss:2.6375 lr:0.01 dt:42ms tok/s:1542308 rem:31s step 15569 (95%) loss:2.6369 lr:0.01 dt:43ms tok/s:1538269 rem:31s step 15570 (95%) loss:2.6359 lr:0.01 dt:43ms tok/s:1530756 rem:31s step 15571 (95%) loss:2.6241 lr:0.01 dt:43ms tok/s:1530254 rem:31s step 15572 (95%) loss:2.6283 lr:0.01 dt:43ms tok/s:1535734 rem:31s step 15573 (95%) loss:2.6067 lr:0.01 dt:43ms tok/s:1534679 rem:31s step 15574 (95%) loss:2.6052 lr:0.01 dt:43ms tok/s:1531754 rem:31s step 15575 (95%) loss:2.6059 lr:0.01 dt:43ms tok/s:1535099 rem:31s step 15576 (95%) loss:2.6324 lr:0.01 dt:43ms tok/s:1533960 rem:31s step 15577 (95%) loss:2.6243 lr:0.01 dt:43ms tok/s:1534165 rem:31s step 15578 (95%) loss:2.6156 lr:0.01 dt:46ms tok/s:1421836 rem:31s step 15579 (95%) loss:2.6159 lr:0.01 dt:43ms tok/s:1537159 rem:31s step 15580 (95%) loss:2.6078 lr:0.01 dt:43ms tok/s:1537305 rem:31s step 15581 (95%) loss:2.6154 lr:0.01 dt:43ms tok/s:1516108 rem:31s step 15582 (95%) loss:2.6168 lr:0.01 dt:43ms tok/s:1521900 rem:30s step 15583 (95%) loss:2.6065 lr:0.01 dt:43ms tok/s:1525015 rem:30s step 15584 (95%) loss:2.6016 lr:0.01 dt:43ms tok/s:1522928 rem:30s step 15585 (95%) loss:2.5930 lr:0.01 dt:43ms tok/s:1522616 rem:30s step 15586 (95%) loss:2.5877 lr:0.01 dt:43ms tok/s:1525667 rem:30s step 15587 (95%) loss:2.6002 lr:0.01 dt:43ms tok/s:1521445 rem:30s step 15588 (95%) loss:2.6408 lr:0.01 dt:43ms tok/s:1518436 rem:30s step 15589 (95%) loss:2.6372 lr:0.01 dt:43ms tok/s:1517296 rem:30s step 15590 (95%) loss:2.6291 lr:0.01 dt:43ms tok/s:1522363 rem:30s step 15591 (95%) loss:2.6207 lr:0.01 dt:43ms tok/s:1522077 rem:30s step 15592 (95%) loss:2.6092 lr:0.01 dt:43ms tok/s:1519477 rem:30s step 15593 (95%) loss:2.6104 lr:0.01 dt:43ms tok/s:1523882 rem:30s step 15594 (95%) loss:2.6168 lr:0.01 dt:43ms tok/s:1524195 rem:30s step 15595 (95%) loss:2.6184 lr:0.01 dt:43ms tok/s:1524491 rem:30s step 15596 (95%) loss:2.6439 lr:0.01 dt:43ms tok/s:1525371 rem:30s step 15597 (95%) loss:2.6369 lr:0.01 dt:44ms tok/s:1500524 rem:30s step 15598 (95%) loss:2.6364 lr:0.01 dt:43ms tok/s:1523747 rem:30s step 15599 (95%) loss:2.6198 lr:0.01 dt:43ms tok/s:1522785 rem:30s step 15600 (95%) loss:2.6076 lr:0.01 dt:43ms tok/s:1522034 rem:30s + local: attn=[0.229, 1.161, 1.334] mlp=[1.538, 0.722, -0.649] + + transition: attn=[4.021, 1.206] mlp=[-1.108, 2.574] + + hierarchy: attn=[3.424, 5.939, 5.616] mlp=[3.721, -4.066, -5.180] + step 15601 (95%) loss:2.5930 lr:0.01 dt:43ms tok/s:1520637 rem:30s step 15602 (95%) loss:2.6301 lr:0.01 dt:43ms tok/s:1523823 rem:30s step 15603 (95%) loss:2.6332 lr:0.01 dt:43ms tok/s:1524575 rem:30s step 15604 (95%) loss:2.6583 lr:0.01 dt:43ms tok/s:1525980 rem:30s step 15605 (95%) loss:2.6777 lr:0.01 dt:43ms tok/s:1523587 rem:29s step 15606 (95%) loss:2.6715 lr:0.01 dt:43ms tok/s:1521335 rem:29s step 15607 (95%) loss:2.6577 lr:0.01 dt:43ms tok/s:1523249 rem:29s step 15608 (95%) loss:2.6440 lr:0.01 dt:43ms tok/s:1519317 rem:29s step 15609 (95%) loss:2.6298 lr:0.01 dt:43ms tok/s:1526218 rem:29s step 15610 (95%) loss:2.6097 lr:0.01 dt:43ms tok/s:1518411 rem:29s step 15611 (95%) loss:2.6009 lr:0.01 dt:43ms tok/s:1510517 rem:29s step 15612 (95%) loss:2.5978 lr:0.01 dt:43ms tok/s:1523916 rem:29s step 15613 (95%) loss:2.6544 lr:0.01 dt:43ms tok/s:1524491 rem:29s step 15614 (95%) loss:2.6501 lr:0.01 dt:43ms tok/s:1520914 rem:29s step 15615 (95%) loss:2.6424 lr:0.01 dt:43ms tok/s:1520031 rem:29s step 15616 (95%) loss:2.6440 lr:0.01 dt:43ms tok/s:1520586 rem:29s step 15617 (95%) loss:2.6363 lr:0.01 dt:43ms tok/s:1527346 rem:29s step 15618 (95%) loss:2.6470 lr:0.01 dt:43ms tok/s:1522254 rem:29s step 15619 (95%) loss:2.6470 lr:0.01 dt:43ms tok/s:1522591 rem:29s step 15620 (95%) loss:2.6456 lr:0.01 dt:43ms tok/s:1524026 rem:29s step 15621 (95%) loss:2.6378 lr:0.01 dt:43ms tok/s:1518553 rem:29s step 15622 (95%) loss:2.6344 lr:0.01 dt:43ms tok/s:1521436 rem:29s step 15623 (95%) loss:2.6295 lr:0.01 dt:43ms tok/s:1521478 rem:29s step 15624 (95%) loss:2.6377 lr:0.01 dt:43ms tok/s:1518629 rem:29s step 15625 (95%) loss:2.6458 lr:0.01 dt:43ms tok/s:1521858 rem:29s step 15626 (95%) loss:2.6478 lr:0.01 dt:43ms tok/s:1518210 rem:29s step 15627 (95%) loss:2.6385 lr:0.01 dt:43ms tok/s:1521058 rem:29s step 15628 (95%) loss:2.6580 lr:0.01 dt:43ms tok/s:1515949 rem:29s step 15629 (95%) loss:2.6539 lr:0.01 dt:43ms tok/s:1531660 rem:28s step 15630 (95%) loss:2.6467 lr:0.01 dt:43ms tok/s:1527133 rem:28s step 15631 (95%) loss:2.6391 lr:0.01 dt:43ms tok/s:1528084 rem:28s step 15632 (95%) loss:2.6242 lr:0.01 dt:43ms tok/s:1521647 rem:28s step 15633 (95%) loss:2.6330 lr:0.01 dt:43ms tok/s:1522304 rem:28s step 15634 (95%) loss:2.6231 lr:0.01 dt:43ms tok/s:1521563 rem:28s step 15635 (95%) loss:2.5854 lr:0.01 dt:43ms tok/s:1516518 rem:28s step 15636 (95%) loss:2.5990 lr:0.01 dt:43ms tok/s:1518067 rem:28s step 15637 (95%) loss:2.5836 lr:0.01 dt:43ms tok/s:1515548 rem:28s step 15638 (95%) loss:2.5926 lr:0.01 dt:43ms tok/s:1518948 rem:28s step 15639 (95%) loss:2.5922 lr:0.01 dt:43ms tok/s:1516267 rem:28s step 15640 (95%) loss:2.5866 lr:0.01 dt:43ms tok/s:1514879 rem:28s step 15641 (95%) loss:2.6015 lr:0.01 dt:43ms tok/s:1515798 rem:28s step 15642 (95%) loss:2.6081 lr:0.01 dt:43ms tok/s:1513987 rem:28s step 15643 (95%) loss:2.6176 lr:0.01 dt:43ms tok/s:1517497 rem:28s step 15644 (95%) loss:2.6348 lr:0.01 dt:43ms tok/s:1517438 rem:28s step 15645 (95%) loss:2.6570 lr:0.01 dt:43ms tok/s:1517288 rem:28s step 15646 (95%) loss:2.6640 lr:0.01 dt:43ms tok/s:1516819 rem:28s step 15647 (95%) loss:2.6669 lr:0.01 dt:43ms tok/s:1517196 rem:28s step 15648 (95%) loss:2.6468 lr:0.01 dt:43ms tok/s:1518184 rem:28s step 15649 (95%) loss:2.6579 lr:0.01 dt:44ms tok/s:1477616 rem:28s step 15650 (95%) loss:2.6542 lr:0.01 dt:43ms tok/s:1509356 rem:28s step 15651 (95%) loss:2.6472 lr:0.01 dt:43ms tok/s:1514788 rem:28s step 15652 (95%) loss:2.6344 lr:0.01 dt:43ms tok/s:1518235 rem:27s step 15653 (95%) loss:2.6307 lr:0.01 dt:43ms tok/s:1516300 rem:27s step 15654 (95%) loss:2.6181 lr:0.01 dt:43ms tok/s:1517623 rem:27s step 15655 (95%) loss:2.6146 lr:0.01 dt:43ms tok/s:1518629 rem:27s step 15656 (95%) loss:2.6267 lr:0.01 dt:43ms tok/s:1520107 rem:27s step 15657 (95%) loss:2.6446 lr:0.01 dt:43ms tok/s:1515907 rem:27s step 15658 (95%) loss:2.6573 lr:0.01 dt:43ms tok/s:1518453 rem:27s step 15659 (95%) loss:2.6612 lr:0.01 dt:43ms tok/s:1512504 rem:27s step 15660 (95%) loss:2.6543 lr:0.01 dt:43ms tok/s:1514470 rem:27s step 15661 (95%) loss:2.6712 lr:0.01 dt:43ms tok/s:1517489 rem:27s step 15662 (95%) loss:2.6589 lr:0.01 dt:43ms tok/s:1516543 rem:27s step 15663 (95%) loss:2.6647 lr:0.01 dt:43ms tok/s:1514587 rem:27s step 15664 (96%) loss:2.6739 lr:0.01 dt:43ms tok/s:1510708 rem:27s step 15665 (96%) loss:2.6726 lr:0.01 dt:43ms tok/s:1508818 rem:27s step 15666 (96%) loss:2.6637 lr:0.01 dt:43ms tok/s:1515614 rem:27s step 15667 (96%) loss:2.6309 lr:0.01 dt:43ms tok/s:1516961 rem:27s step 15668 (96%) loss:2.6203 lr:0.01 dt:43ms tok/s:1514829 rem:27s step 15669 (96%) loss:2.6247 lr:0.01 dt:43ms tok/s:1513870 rem:27s step 15670 (96%) loss:2.6220 lr:0.01 dt:43ms tok/s:1516618 rem:27s step 15671 (96%) loss:2.6135 lr:0.01 dt:43ms tok/s:1524246 rem:27s step 15672 (96%) loss:2.6207 lr:0.01 dt:44ms tok/s:1499435 rem:27s step 15673 (96%) loss:2.6249 lr:0.01 dt:43ms tok/s:1511930 rem:27s step 15674 (96%) loss:2.6302 lr:0.01 dt:43ms tok/s:1510459 rem:27s step 15675 (96%) loss:2.6356 lr:0.01 dt:43ms tok/s:1507163 rem:26s step 15676 (96%) loss:2.6451 lr:0.01 dt:43ms tok/s:1513753 rem:26s step 15677 (96%) loss:2.6581 lr:0.01 dt:43ms tok/s:1512487 rem:26s step 15678 (96%) loss:2.6598 lr:0.01 dt:43ms tok/s:1511248 rem:26s step 15679 (96%) loss:2.6572 lr:0.01 dt:43ms tok/s:1513870 rem:26s step 15680 (96%) loss:2.6556 lr:0.01 dt:43ms tok/s:1508727 rem:26s step 15681 (96%) loss:2.6466 lr:0.01 dt:43ms tok/s:1509563 rem:26s step 15682 (96%) loss:2.6354 lr:0.01 dt:43ms tok/s:1512662 rem:26s step 15683 (96%) loss:2.6388 lr:0.01 dt:43ms tok/s:1510908 rem:26s step 15684 (96%) loss:2.6351 lr:0.01 dt:43ms tok/s:1511747 rem:26s step 15685 (96%) loss:2.6253 lr:0.01 dt:43ms tok/s:1509995 rem:26s step 15686 (96%) loss:2.6247 lr:0.01 dt:43ms tok/s:1510194 rem:26s step 15687 (96%) loss:2.6336 lr:0.01 dt:43ms tok/s:1514554 rem:26s step 15688 (96%) loss:2.6562 lr:0.01 dt:43ms tok/s:1510617 rem:26s step 15689 (96%) loss:2.6649 lr:0.01 dt:43ms tok/s:1508246 rem:26s step 15690 (96%) loss:2.6416 lr:0.01 dt:43ms tok/s:1514980 rem:26s step 15691 (96%) loss:2.6054 lr:0.01 dt:44ms tok/s:1493837 rem:26s step 15692 (96%) loss:2.5814 lr:0.01 dt:44ms tok/s:1496586 rem:26s step 15693 (96%) loss:2.5875 lr:0.01 dt:43ms tok/s:1507932 rem:26s step 15694 (96%) loss:2.6047 lr:0.01 dt:43ms tok/s:1512254 rem:26s step 15695 (96%) loss:2.5976 lr:0.01 dt:43ms tok/s:1512312 rem:26s step 15696 (96%) loss:2.6029 lr:0.01 dt:43ms tok/s:1512987 rem:26s step 15697 (96%) loss:2.6224 lr:0.01 dt:43ms tok/s:1511007 rem:26s step 15698 (96%) loss:2.6274 lr:0.01 dt:43ms tok/s:1508776 rem:25s step 15699 (96%) loss:2.6179 lr:0.01 dt:43ms tok/s:1512987 rem:25s step 15700 (96%) loss:2.6256 lr:0.01 dt:43ms tok/s:1509381 rem:25s + local: attn=[0.232, 1.160, 1.335] mlp=[1.539, 0.723, -0.650] + + transition: attn=[4.025, 1.207] mlp=[-1.115, 2.592] + + hierarchy: attn=[3.420, 5.939, 5.616] mlp=[3.730, -4.091, -5.180] + step 15701 (96%) loss:2.6349 lr:0.01 dt:44ms tok/s:1506057 rem:25s step 15702 (96%) loss:2.6288 lr:0.01 dt:46ms tok/s:1420492 rem:25s step 15703 (96%) loss:2.6354 lr:0.01 dt:43ms tok/s:1516576 rem:25s step 15704 (96%) loss:2.6306 lr:0.01 dt:43ms tok/s:1512021 rem:25s step 15705 (96%) loss:2.6324 lr:0.01 dt:46ms tok/s:1416422 rem:25s step 15706 (96%) loss:2.6365 lr:0.01 dt:43ms tok/s:1515598 rem:25s step 15707 (96%) loss:2.6421 lr:0.01 dt:43ms tok/s:1521731 rem:25s step 15708 (96%) loss:2.6533 lr:0.01 dt:45ms tok/s:1471045 rem:25s step 15709 (96%) loss:2.6372 lr:0.01 dt:43ms tok/s:1514178 rem:25s step 15710 (96%) loss:2.6398 lr:0.01 dt:43ms tok/s:1518101 rem:25s step 15711 (96%) loss:2.6404 lr:0.01 dt:44ms tok/s:1497385 rem:25s step 15712 (96%) loss:2.6595 lr:0.01 dt:42ms tok/s:1542317 rem:25s step 15713 (96%) loss:2.6573 lr:0.01 dt:43ms tok/s:1538949 rem:25s step 15714 (96%) loss:2.6530 lr:0.01 dt:43ms tok/s:1524795 rem:25s step 15715 (96%) loss:2.6433 lr:0.01 dt:43ms tok/s:1539087 rem:25s step 15716 (96%) loss:2.6513 lr:0.01 dt:43ms tok/s:1538028 rem:25s step 15717 (96%) loss:2.6469 lr:0.01 dt:43ms tok/s:1535828 rem:25s step 15718 (96%) loss:2.6506 lr:0.01 dt:43ms tok/s:1535245 rem:25s step 15719 (96%) loss:2.6664 lr:0.01 dt:43ms tok/s:1526938 rem:25s step 15720 (96%) loss:2.6775 lr:0.01 dt:43ms tok/s:1539553 rem:25s step 15721 (96%) loss:2.6468 lr:0.01 dt:43ms tok/s:1539820 rem:24s step 15722 (96%) loss:2.6335 lr:0.01 dt:43ms tok/s:1538424 rem:24s step 15723 (96%) loss:2.6021 lr:0.01 dt:43ms tok/s:1533711 rem:24s step 15724 (96%) loss:2.6040 lr:0.01 dt:46ms tok/s:1427781 rem:24s step 15725 (96%) loss:2.6145 lr:0.01 dt:42ms tok/s:1543174 rem:24s step 15726 (96%) loss:2.6141 lr:0.01 dt:43ms tok/s:1531276 rem:24s step 15727 (96%) loss:2.6243 lr:0.01 dt:43ms tok/s:1541193 rem:24s step 15728 (96%) loss:2.6175 lr:0.01 dt:42ms tok/s:1547997 rem:24s step 15729 (96%) loss:2.6109 lr:0.01 dt:42ms tok/s:1542196 rem:24s step 15730 (96%) loss:2.6167 lr:0.01 dt:42ms tok/s:1543876 rem:24s step 15731 (96%) loss:2.6221 lr:0.01 dt:42ms tok/s:1543192 rem:24s step 15732 (96%) loss:2.6211 lr:0.01 dt:42ms tok/s:1545691 rem:24s step 15733 (96%) loss:2.6284 lr:0.01 dt:42ms tok/s:1546508 rem:24s step 15734 (96%) loss:2.6339 lr:0.01 dt:43ms tok/s:1537048 rem:24s step 15735 (96%) loss:2.6218 lr:0.01 dt:43ms tok/s:1534191 rem:24s step 15736 (96%) loss:2.6359 lr:0.01 dt:43ms tok/s:1537177 rem:24s step 15737 (96%) loss:2.6355 lr:0.01 dt:43ms tok/s:1538183 rem:24s step 15738 (96%) loss:2.6358 lr:0.01 dt:43ms tok/s:1536730 rem:24s step 15739 (96%) loss:2.6280 lr:0.01 dt:43ms tok/s:1512987 rem:24s step 15740 (96%) loss:2.6244 lr:0.01 dt:43ms tok/s:1539579 rem:24s step 15741 (96%) loss:2.5905 lr:0.01 dt:43ms tok/s:1536884 rem:24s step 15742 (96%) loss:2.5707 lr:0.01 dt:43ms tok/s:1540519 rem:24s step 15743 (96%) loss:2.5535 lr:0.01 dt:43ms tok/s:1534628 rem:24s step 15744 (96%) loss:2.5709 lr:0.01 dt:43ms tok/s:1532506 rem:23s step 15745 (96%) loss:2.5922 lr:0.01 dt:43ms tok/s:1530952 rem:23s step 15746 (96%) loss:2.6009 lr:0.01 dt:43ms tok/s:1536738 rem:23s step 15747 (96%) loss:2.6029 lr:0.01 dt:43ms tok/s:1532386 rem:23s step 15748 (96%) loss:2.5948 lr:0.01 dt:43ms tok/s:1532284 rem:23s step 15749 (96%) loss:2.6206 lr:0.01 dt:43ms tok/s:1531208 rem:23s step 15750 (96%) loss:2.6264 lr:0.01 dt:43ms tok/s:1533703 rem:23s step 15751 (96%) loss:2.6370 lr:0.01 dt:43ms tok/s:1535820 rem:23s step 15752 (96%) loss:2.6491 lr:0.01 dt:43ms tok/s:1535331 rem:23s step 15753 (96%) loss:2.6670 lr:0.01 dt:43ms tok/s:1516710 rem:23s step 15754 (96%) loss:2.6844 lr:0.01 dt:43ms tok/s:1536498 rem:23s step 15755 (96%) loss:2.6851 lr:0.01 dt:43ms tok/s:1535228 rem:23s step 15756 (96%) loss:2.6618 lr:0.01 dt:43ms tok/s:1534653 rem:23s step 15757 (96%) loss:2.6600 lr:0.01 dt:43ms tok/s:1533078 rem:23s step 15758 (96%) loss:2.6650 lr:0.01 dt:43ms tok/s:1532925 rem:23s step 15759 (96%) loss:2.6851 lr:0.01 dt:43ms tok/s:1531268 rem:23s step 15760 (96%) loss:2.6815 lr:0.01 dt:43ms tok/s:1532916 rem:23s step 15761 (96%) loss:2.6844 lr:0.01 dt:43ms tok/s:1534868 rem:23s step 15762 (96%) loss:2.6846 lr:0.01 dt:43ms tok/s:1530560 rem:23s step 15763 (96%) loss:2.6915 lr:0.01 dt:43ms tok/s:1530756 rem:23s step 15764 (96%) loss:2.6819 lr:0.01 dt:43ms tok/s:1534988 rem:23s step 15765 (96%) loss:2.6606 lr:0.01 dt:43ms tok/s:1534902 rem:23s step 15766 (96%) loss:2.6436 lr:0.01 dt:43ms tok/s:1527965 rem:23s step 15767 (96%) loss:2.6554 lr:0.01 dt:43ms tok/s:1529538 rem:23s step 15768 (96%) loss:2.6528 lr:0.01 dt:43ms tok/s:1533010 rem:22s step 15769 (96%) loss:2.6776 lr:0.01 dt:43ms tok/s:1533489 rem:22s step 15770 (96%) loss:2.6792 lr:0.01 dt:43ms tok/s:1534354 rem:22s step 15771 (96%) loss:2.6771 lr:0.01 dt:43ms tok/s:1532147 rem:22s step 15772 (96%) loss:2.6614 lr:0.01 dt:43ms tok/s:1534551 rem:22s step 15773 (96%) loss:2.6509 lr:0.01 dt:43ms tok/s:1525413 rem:22s step 15774 (96%) loss:2.6413 lr:0.01 dt:43ms tok/s:1520687 rem:22s step 15775 (96%) loss:2.6593 lr:0.01 dt:43ms tok/s:1533968 rem:22s step 15776 (96%) loss:2.6492 lr:0.01 dt:43ms tok/s:1527210 rem:22s step 15777 (96%) loss:2.6439 lr:0.01 dt:43ms tok/s:1533361 rem:22s step 15778 (96%) loss:2.6329 lr:0.01 dt:43ms tok/s:1531319 rem:22s step 15779 (96%) loss:2.6385 lr:0.01 dt:43ms tok/s:1530611 rem:22s step 15780 (96%) loss:2.6515 lr:0.01 dt:43ms tok/s:1516442 rem:22s step 15781 (96%) loss:2.6480 lr:0.01 dt:43ms tok/s:1522136 rem:22s step 15782 (96%) loss:2.6617 lr:0.01 dt:43ms tok/s:1536180 rem:22s step 15783 (96%) loss:2.6505 lr:0.01 dt:44ms tok/s:1485353 rem:22s step 15784 (96%) loss:2.6482 lr:0.01 dt:43ms tok/s:1523587 rem:22s step 15785 (96%) loss:2.6495 lr:0.01 dt:43ms tok/s:1526150 rem:22s step 15786 (96%) loss:2.6274 lr:0.01 dt:43ms tok/s:1525947 rem:22s step 15787 (96%) loss:2.6319 lr:0.01 dt:43ms tok/s:1520973 rem:22s step 15788 (96%) loss:2.6231 lr:0.01 dt:43ms tok/s:1524373 rem:22s step 15789 (96%) loss:2.5943 lr:0.00 dt:43ms tok/s:1523756 rem:22s step 15790 (96%) loss:2.6109 lr:0.00 dt:43ms tok/s:1523460 rem:22s step 15791 (96%) loss:2.6226 lr:0.00 dt:43ms tok/s:1523637 rem:21s step 15792 (96%) loss:2.6252 lr:0.00 dt:43ms tok/s:1520326 rem:21s step 15793 (96%) loss:2.6328 lr:0.00 dt:43ms tok/s:1525235 rem:21s step 15794 (96%) loss:2.6523 lr:0.00 dt:43ms tok/s:1522397 rem:21s step 15795 (96%) loss:2.6574 lr:0.00 dt:44ms tok/s:1473663 rem:21s step 15796 (96%) loss:2.6795 lr:0.00 dt:43ms tok/s:1529564 rem:21s step 15797 (96%) loss:2.6932 lr:0.00 dt:43ms tok/s:1523798 rem:21s step 15798 (96%) loss:2.6622 lr:0.00 dt:43ms tok/s:1522389 rem:21s step 15799 (96%) loss:2.6330 lr:0.00 dt:43ms tok/s:1525642 rem:21s step 15800 (96%) loss:2.6368 lr:0.00 dt:43ms tok/s:1521175 rem:21s + local: attn=[0.234, 1.159, 1.336] mlp=[1.540, 0.724, -0.652] + + transition: attn=[4.025, 1.203] mlp=[-1.122, 2.604] + + hierarchy: attn=[3.418, 5.939, 5.616] mlp=[3.737, -4.106, -5.180] + step 15801 (96%) loss:2.6551 lr:0.00 dt:43ms tok/s:1525091 rem:21s step 15802 (96%) loss:2.6621 lr:0.00 dt:43ms tok/s:1526167 rem:21s step 15803 (96%) loss:2.6621 lr:0.00 dt:43ms tok/s:1519670 rem:21s step 15804 (97%) loss:2.6772 lr:0.00 dt:43ms tok/s:1520612 rem:21s step 15805 (97%) loss:2.6929 lr:0.00 dt:43ms tok/s:1524432 rem:21s step 15806 (97%) loss:2.6999 lr:0.00 dt:43ms tok/s:1520460 rem:21s step 15807 (97%) loss:2.7062 lr:0.00 dt:43ms tok/s:1520015 rem:21s step 15808 (97%) loss:2.6918 lr:0.00 dt:43ms tok/s:1524111 rem:21s step 15809 (97%) loss:2.6981 lr:0.00 dt:43ms tok/s:1521681 rem:21s step 15810 (97%) loss:2.6854 lr:0.00 dt:43ms tok/s:1523967 rem:21s step 15811 (97%) loss:2.6820 lr:0.00 dt:43ms tok/s:1519897 rem:21s step 15812 (97%) loss:2.6943 lr:0.00 dt:43ms tok/s:1516777 rem:21s step 15813 (97%) loss:2.6924 lr:0.00 dt:43ms tok/s:1519645 rem:21s step 15814 (97%) loss:2.6876 lr:0.00 dt:43ms tok/s:1524195 rem:20s step 15815 (97%) loss:2.6849 lr:0.00 dt:43ms tok/s:1517321 rem:20s step 15816 (97%) loss:2.6842 lr:0.00 dt:43ms tok/s:1517883 rem:20s step 15817 (97%) loss:2.6873 lr:0.00 dt:43ms tok/s:1519334 rem:20s step 15818 (97%) loss:2.6938 lr:0.00 dt:43ms tok/s:1521445 rem:20s step 15819 (97%) loss:2.6873 lr:0.00 dt:43ms tok/s:1520864 rem:20s step 15820 (97%) loss:2.7067 lr:0.00 dt:43ms tok/s:1517832 rem:20s step 15821 (97%) loss:2.7240 lr:0.00 dt:43ms tok/s:1518067 rem:20s step 15822 (97%) loss:2.6853 lr:0.00 dt:43ms tok/s:1517983 rem:20s step 15823 (97%) loss:2.6754 lr:0.00 dt:43ms tok/s:1523266 rem:20s step 15824 (97%) loss:2.6728 lr:0.00 dt:43ms tok/s:1522448 rem:20s step 15825 (97%) loss:2.6691 lr:0.00 dt:43ms tok/s:1520940 rem:20s step 15826 (97%) loss:2.6548 lr:0.00 dt:43ms tok/s:1519015 rem:20s step 15827 (97%) loss:2.6316 lr:0.00 dt:43ms tok/s:1524770 rem:20s step 15828 (97%) loss:2.6320 lr:0.00 dt:43ms tok/s:1513770 rem:20s step 15829 (97%) loss:2.6564 lr:0.00 dt:43ms tok/s:1518436 rem:20s step 15830 (97%) loss:2.6932 lr:0.00 dt:43ms tok/s:1518881 rem:20s step 15831 (97%) loss:2.7129 lr:0.00 dt:43ms tok/s:1517187 rem:20s step 15832 (97%) loss:2.7005 lr:0.00 dt:43ms tok/s:1515038 rem:20s step 15833 (97%) loss:2.6990 lr:0.00 dt:43ms tok/s:1518872 rem:20s step 15834 (97%) loss:2.7006 lr:0.00 dt:43ms tok/s:1507568 rem:20s step 15835 (97%) loss:2.6983 lr:0.00 dt:44ms tok/s:1492336 rem:20s step 15836 (97%) loss:2.6906 lr:0.00 dt:43ms tok/s:1519586 rem:20s step 15837 (97%) loss:2.6828 lr:0.00 dt:43ms tok/s:1519762 rem:19s step 15838 (97%) loss:2.6417 lr:0.00 dt:43ms tok/s:1518310 rem:19s step 15839 (97%) loss:2.6342 lr:0.00 dt:43ms tok/s:1510625 rem:19s step 15840 (97%) loss:2.6268 lr:0.00 dt:43ms tok/s:1519544 rem:19s step 15841 (97%) loss:2.6212 lr:0.00 dt:43ms tok/s:1521504 rem:19s step 15842 (97%) loss:2.6388 lr:0.00 dt:43ms tok/s:1522608 rem:19s step 15843 (97%) loss:2.6299 lr:0.00 dt:43ms tok/s:1519158 rem:19s step 15844 (97%) loss:2.6143 lr:0.00 dt:43ms tok/s:1522523 rem:19s step 15845 (97%) loss:2.6230 lr:0.00 dt:43ms tok/s:1524542 rem:19s step 15846 (97%) loss:2.6347 lr:0.00 dt:43ms tok/s:1520864 rem:19s step 15847 (97%) loss:2.6261 lr:0.00 dt:43ms tok/s:1522077 rem:19s step 15848 (97%) loss:2.6260 lr:0.00 dt:43ms tok/s:1521655 rem:19s step 15849 (97%) loss:2.6371 lr:0.00 dt:45ms tok/s:1446391 rem:19s step 15850 (97%) loss:2.6499 lr:0.00 dt:43ms tok/s:1525879 rem:19s step 15851 (97%) loss:2.6469 lr:0.00 dt:43ms tok/s:1516610 rem:19s step 15852 (97%) loss:2.6394 lr:0.00 dt:43ms tok/s:1526828 rem:19s step 15853 (97%) loss:2.6200 lr:0.00 dt:43ms tok/s:1519426 rem:19s step 15854 (97%) loss:2.6223 lr:0.00 dt:43ms tok/s:1511955 rem:19s step 15855 (97%) loss:2.6043 lr:0.00 dt:43ms tok/s:1510136 rem:19s step 15856 (97%) loss:2.6156 lr:0.00 dt:43ms tok/s:1512587 rem:19s step 15857 (97%) loss:2.6089 lr:0.00 dt:43ms tok/s:1513828 rem:19s step 15858 (97%) loss:2.5984 lr:0.00 dt:43ms tok/s:1516367 rem:19s step 15859 (97%) loss:2.6119 lr:0.00 dt:43ms tok/s:1507023 rem:19s step 15860 (97%) loss:2.6067 lr:0.00 dt:44ms tok/s:1504754 rem:18s step 15861 (97%) loss:2.5961 lr:0.00 dt:43ms tok/s:1506700 rem:18s step 15862 (97%) loss:2.5845 lr:0.00 dt:43ms tok/s:1507403 rem:18s step 15863 (97%) loss:2.5969 lr:0.00 dt:44ms tok/s:1506337 rem:18s step 15864 (97%) loss:2.6063 lr:0.00 dt:44ms tok/s:1505290 rem:18s step 15865 (97%) loss:2.6157 lr:0.00 dt:44ms tok/s:1499190 rem:18s step 15866 (97%) loss:2.6162 lr:0.00 dt:43ms tok/s:1508950 rem:18s step 15867 (97%) loss:2.6242 lr:0.00 dt:44ms tok/s:1500474 rem:18s step 15868 (97%) loss:2.6128 lr:0.00 dt:44ms tok/s:1501196 rem:18s step 15869 (97%) loss:2.5952 lr:0.00 dt:44ms tok/s:1501876 rem:18s step 15870 (97%) loss:2.5818 lr:0.00 dt:44ms tok/s:1496472 rem:18s step 15871 (97%) loss:2.5893 lr:0.00 dt:44ms tok/s:1499083 rem:18s step 15872 (97%) loss:2.5746 lr:0.00 dt:44ms tok/s:1494511 rem:18s step 15873 (97%) loss:2.5822 lr:0.00 dt:44ms tok/s:1498234 rem:18s step 15874 (97%) loss:2.5707 lr:0.00 dt:44ms tok/s:1496415 rem:18s step 15875 (97%) loss:2.5943 lr:0.00 dt:44ms tok/s:1488756 rem:18s step 15876 (97%) loss:2.6031 lr:0.00 dt:44ms tok/s:1503010 rem:18s step 15877 (97%) loss:2.6063 lr:0.00 dt:44ms tok/s:1504688 rem:18s step 15878 (97%) loss:2.5994 lr:0.00 dt:44ms tok/s:1498038 rem:18s step 15879 (97%) loss:2.6021 lr:0.00 dt:43ms tok/s:1507031 rem:18s step 15880 (97%) loss:2.6135 lr:0.00 dt:44ms tok/s:1505100 rem:18s step 15881 (97%) loss:2.6233 lr:0.00 dt:44ms tok/s:1500343 rem:18s step 15882 (97%) loss:2.6194 lr:0.00 dt:44ms tok/s:1505380 rem:18s step 15883 (97%) loss:2.6211 lr:0.00 dt:44ms tok/s:1499468 rem:17s step 15884 (97%) loss:2.6167 lr:0.00 dt:44ms tok/s:1497817 rem:17s step 15885 (97%) loss:2.5791 lr:0.00 dt:44ms tok/s:1500090 rem:17s step 15886 (97%) loss:2.5992 lr:0.00 dt:44ms tok/s:1504639 rem:17s step 15887 (97%) loss:2.6098 lr:0.00 dt:44ms tok/s:1499656 rem:17s step 15888 (97%) loss:2.5540 lr:0.00 dt:44ms tok/s:1501540 rem:17s step 15889 (97%) loss:2.5570 lr:0.00 dt:47ms tok/s:1380373 rem:17s step 15890 (97%) loss:2.5500 lr:0.00 dt:43ms tok/s:1507394 rem:17s step 15891 (97%) loss:2.5661 lr:0.00 dt:43ms tok/s:1514229 rem:17s step 15892 (97%) loss:2.5966 lr:0.00 dt:43ms tok/s:1507105 rem:17s step 15893 (97%) loss:2.5915 lr:0.00 dt:43ms tok/s:1513770 rem:17s step 15894 (97%) loss:2.6170 lr:0.00 dt:43ms tok/s:1508660 rem:17s step 15895 (97%) loss:2.6170 lr:0.00 dt:44ms tok/s:1495788 rem:17s step 15896 (97%) loss:2.6253 lr:0.00 dt:44ms tok/s:1503626 rem:17s step 15897 (97%) loss:2.6212 lr:0.00 dt:44ms tok/s:1503454 rem:17s step 15898 (97%) loss:2.6161 lr:0.00 dt:47ms tok/s:1385550 rem:17s step 15899 (97%) loss:2.6206 lr:0.00 dt:43ms tok/s:1513603 rem:17s step 15900 (97%) loss:2.6217 lr:0.00 dt:44ms tok/s:1505999 rem:17s + local: attn=[0.233, 1.161, 1.339] mlp=[1.541, 0.726, -0.653] + + transition: attn=[4.024, 1.202] mlp=[-1.127, 2.609] + + hierarchy: attn=[3.416, 5.939, 5.616] mlp=[3.739, -4.117, -5.180] + step 15901 (97%) loss:2.6168 lr:0.00 dt:43ms tok/s:1507312 rem:17s step 15902 (97%) loss:2.6224 lr:0.00 dt:44ms tok/s:1505149 rem:17s step 15903 (97%) loss:2.6328 lr:0.00 dt:44ms tok/s:1502796 rem:17s step 15904 (97%) loss:2.6445 lr:0.00 dt:44ms tok/s:1505974 rem:17s step 15905 (97%) loss:2.6616 lr:0.00 dt:43ms tok/s:1507378 rem:17s step 15906 (97%) loss:2.6686 lr:0.00 dt:44ms tok/s:1498650 rem:16s step 15907 (97%) loss:2.6671 lr:0.00 dt:44ms tok/s:1497034 rem:16s step 15908 (97%) loss:2.6626 lr:0.00 dt:44ms tok/s:1501310 rem:16s step 15909 (97%) loss:2.6286 lr:0.00 dt:44ms tok/s:1490265 rem:16s step 15910 (97%) loss:2.6069 lr:0.00 dt:44ms tok/s:1499026 rem:16s step 15911 (97%) loss:2.5801 lr:0.00 dt:44ms tok/s:1501474 rem:16s step 15912 (97%) loss:2.5931 lr:0.00 dt:44ms tok/s:1501901 rem:16s step 15913 (97%) loss:2.6107 lr:0.00 dt:44ms tok/s:1499819 rem:16s step 15914 (97%) loss:2.6100 lr:0.00 dt:44ms tok/s:1502090 rem:16s step 15915 (97%) loss:2.6202 lr:0.00 dt:44ms tok/s:1500057 rem:16s step 15916 (97%) loss:2.6175 lr:0.00 dt:44ms tok/s:1505949 rem:16s step 15917 (97%) loss:2.6409 lr:0.00 dt:43ms tok/s:1507791 rem:16s step 15918 (97%) loss:2.6595 lr:0.00 dt:44ms tok/s:1505050 rem:16s step 15919 (97%) loss:2.6632 lr:0.00 dt:44ms tok/s:1504079 rem:16s step 15920 (97%) loss:2.6521 lr:0.00 dt:43ms tok/s:1508048 rem:16s step 15921 (97%) loss:2.6359 lr:0.00 dt:43ms tok/s:1507841 rem:16s step 15922 (97%) loss:2.6251 lr:0.00 dt:44ms tok/s:1503478 rem:16s step 15923 (97%) loss:2.6250 lr:0.00 dt:43ms tok/s:1506585 rem:16s step 15924 (97%) loss:2.6209 lr:0.00 dt:44ms tok/s:1503922 rem:16s step 15925 (97%) loss:2.6324 lr:0.00 dt:44ms tok/s:1505487 rem:16s step 15926 (97%) loss:2.6283 lr:0.00 dt:43ms tok/s:1507626 rem:16s step 15927 (97%) loss:2.6356 lr:0.00 dt:44ms tok/s:1483181 rem:16s step 15928 (97%) loss:2.6296 lr:0.00 dt:43ms tok/s:1510277 rem:16s step 15929 (97%) loss:2.6228 lr:0.00 dt:46ms tok/s:1434240 rem:15s step 15930 (97%) loss:2.6274 lr:0.00 dt:43ms tok/s:1520124 rem:15s step 15931 (97%) loss:2.6531 lr:0.00 dt:43ms tok/s:1521361 rem:15s step 15932 (97%) loss:2.6316 lr:0.00 dt:43ms tok/s:1516074 rem:15s step 15933 (97%) loss:2.6359 lr:0.00 dt:44ms tok/s:1502090 rem:15s step 15934 (97%) loss:2.6415 lr:0.00 dt:44ms tok/s:1505611 rem:15s step 15935 (97%) loss:2.6398 lr:0.00 dt:44ms tok/s:1506535 rem:15s step 15936 (97%) loss:2.6449 lr:0.00 dt:44ms tok/s:1503437 rem:15s step 15937 (97%) loss:2.6553 lr:0.00 dt:43ms tok/s:1509149 rem:15s step 15938 (97%) loss:2.6556 lr:0.00 dt:44ms tok/s:1505842 rem:15s step 15939 (97%) loss:2.6459 lr:0.00 dt:43ms tok/s:1506758 rem:15s step 15940 (97%) loss:2.6615 lr:0.00 dt:43ms tok/s:1524668 rem:15s step 15941 (97%) loss:2.6488 lr:0.00 dt:44ms tok/s:1504029 rem:15s step 15942 (98%) loss:2.6558 lr:0.00 dt:44ms tok/s:1505413 rem:15s step 15943 (98%) loss:2.6523 lr:0.00 dt:44ms tok/s:1505553 rem:15s step 15944 (98%) loss:2.6495 lr:0.00 dt:44ms tok/s:1504770 rem:15s step 15945 (98%) loss:2.6507 lr:0.00 dt:43ms tok/s:1508561 rem:15s step 15946 (98%) loss:2.6588 lr:0.00 dt:44ms tok/s:1501179 rem:15s step 15947 (98%) loss:2.6636 lr:0.00 dt:44ms tok/s:1500040 rem:15s step 15948 (98%) loss:2.6668 lr:0.00 dt:44ms tok/s:1490176 rem:15s step 15949 (98%) loss:2.6556 lr:0.00 dt:44ms tok/s:1496789 rem:15s step 15950 (98%) loss:2.6578 lr:0.00 dt:44ms tok/s:1494316 rem:15s step 15951 (98%) loss:2.6659 lr:0.00 dt:44ms tok/s:1499599 rem:15s step 15952 (98%) loss:2.6611 lr:0.00 dt:45ms tok/s:1464609 rem:14s step 15953 (98%) loss:2.6710 lr:0.00 dt:44ms tok/s:1503782 rem:14s step 15954 (98%) loss:2.6723 lr:0.00 dt:44ms tok/s:1500311 rem:14s step 15955 (98%) loss:2.6788 lr:0.00 dt:44ms tok/s:1501064 rem:14s step 15956 (98%) loss:2.6833 lr:0.00 dt:44ms tok/s:1501852 rem:14s step 15957 (98%) loss:2.6838 lr:0.00 dt:44ms tok/s:1504564 rem:14s step 15958 (98%) loss:2.6848 lr:0.00 dt:44ms tok/s:1501155 rem:14s step 15959 (98%) loss:2.6726 lr:0.00 dt:44ms tok/s:1501015 rem:14s step 15960 (98%) loss:2.6740 lr:0.00 dt:44ms tok/s:1499705 rem:14s step 15961 (98%) loss:2.6803 lr:0.00 dt:44ms tok/s:1502188 rem:14s step 15962 (98%) loss:2.6661 lr:0.00 dt:44ms tok/s:1499124 rem:14s step 15963 (98%) loss:2.6558 lr:0.00 dt:44ms tok/s:1498675 rem:14s step 15964 (98%) loss:2.6475 lr:0.00 dt:43ms tok/s:1507609 rem:14s step 15965 (98%) loss:2.6409 lr:0.00 dt:44ms tok/s:1504235 rem:14s step 15966 (98%) loss:2.6283 lr:0.00 dt:44ms tok/s:1502656 rem:14s step 15967 (98%) loss:2.6247 lr:0.00 dt:46ms tok/s:1439031 rem:14s step 15968 (98%) loss:2.6343 lr:0.00 dt:43ms tok/s:1514087 rem:14s step 15969 (98%) loss:2.6491 lr:0.00 dt:43ms tok/s:1524001 rem:14s step 15970 (98%) loss:2.6478 lr:0.00 dt:43ms tok/s:1515222 rem:14s step 15971 (98%) loss:2.6430 lr:0.00 dt:43ms tok/s:1515673 rem:14s step 15972 (98%) loss:2.6335 lr:0.00 dt:43ms tok/s:1511398 rem:14s step 15973 (98%) loss:2.6495 lr:0.00 dt:43ms tok/s:1512520 rem:14s step 15974 (98%) loss:2.6633 lr:0.00 dt:43ms tok/s:1511148 rem:14s step 15975 (98%) loss:2.6456 lr:0.00 dt:43ms tok/s:1511880 rem:13s step 15976 (98%) loss:2.6591 lr:0.00 dt:43ms tok/s:1514562 rem:13s step 15977 (98%) loss:2.6629 lr:0.00 dt:43ms tok/s:1507204 rem:13s step 15978 (98%) loss:2.6417 lr:0.00 dt:44ms tok/s:1492385 rem:13s step 15979 (98%) loss:2.6184 lr:0.00 dt:44ms tok/s:1504688 rem:13s step 15980 (98%) loss:2.6199 lr:0.00 dt:44ms tok/s:1504713 rem:13s step 15981 (98%) loss:2.6288 lr:0.00 dt:43ms tok/s:1508462 rem:13s step 15982 (98%) loss:2.6450 lr:0.00 dt:43ms tok/s:1510285 rem:13s step 15983 (98%) loss:2.6301 lr:0.00 dt:43ms tok/s:1506676 rem:13s step 15984 (98%) loss:2.6227 lr:0.00 dt:44ms tok/s:1505496 rem:13s step 15985 (98%) loss:2.6270 lr:0.00 dt:44ms tok/s:1502098 rem:13s step 15986 (98%) loss:2.6369 lr:0.00 dt:44ms tok/s:1506519 rem:13s step 15987 (98%) loss:2.6204 lr:0.00 dt:44ms tok/s:1503100 rem:13s step 15988 (98%) loss:2.6152 lr:0.00 dt:43ms tok/s:1507758 rem:13s step 15989 (98%) loss:2.5958 lr:0.00 dt:44ms tok/s:1503577 rem:13s step 15990 (98%) loss:2.6008 lr:0.00 dt:44ms tok/s:1498830 rem:13s step 15991 (98%) loss:2.6160 lr:0.00 dt:44ms tok/s:1488651 rem:13s step 15992 (98%) loss:2.6287 lr:0.00 dt:44ms tok/s:1504202 rem:13s step 15993 (98%) loss:2.6378 lr:0.00 dt:44ms tok/s:1505834 rem:13s step 15994 (98%) loss:2.6320 lr:0.00 dt:43ms tok/s:1509149 rem:13s step 15995 (98%) loss:2.6147 lr:0.00 dt:44ms tok/s:1498078 rem:13s step 15996 (98%) loss:2.6108 lr:0.00 dt:44ms tok/s:1502040 rem:13s step 15997 (98%) loss:2.6218 lr:0.00 dt:44ms tok/s:1501745 rem:13s step 15998 (98%) loss:2.5906 lr:0.00 dt:44ms tok/s:1505867 rem:12s step 15999 (98%) loss:2.5918 lr:0.00 dt:43ms tok/s:1506849 rem:12s step 16000 (98%) loss:2.6098 lr:0.00 dt:44ms tok/s:1506560 rem:12s + local: attn=[0.235, 1.161, 1.337] mlp=[1.542, 0.727, -0.653] + + transition: attn=[4.026, 1.201] mlp=[-1.129, 2.614] + + hierarchy: attn=[3.414, 5.939, 5.616] mlp=[3.741, -4.125, -5.180] + step 16001 (98%) loss:2.6156 lr:0.00 dt:44ms tok/s:1506477 rem:12s step 16002 (98%) loss:2.6324 lr:0.00 dt:43ms tok/s:1506808 rem:12s step 16003 (98%) loss:2.7177 lr:0.00 dt:44ms tok/s:1503462 rem:12s step 16004 (98%) loss:2.7087 lr:0.00 dt:43ms tok/s:1508197 rem:12s step 16005 (98%) loss:2.7066 lr:0.00 dt:44ms tok/s:1506073 rem:12s step 16006 (98%) loss:2.7119 lr:0.00 dt:43ms tok/s:1507932 rem:12s step 16007 (98%) loss:2.6978 lr:0.00 dt:43ms tok/s:1507576 rem:12s step 16008 (98%) loss:2.6892 lr:0.00 dt:52ms tok/s:1271423 rem:12s step 16009 (98%) loss:2.6831 lr:0.00 dt:46ms tok/s:1431783 rem:12s step 16010 (98%) loss:2.6867 lr:0.00 dt:45ms tok/s:1445303 rem:12s step 16011 (98%) loss:2.6836 lr:0.00 dt:45ms tok/s:1455767 rem:12s step 16012 (98%) loss:2.6782 lr:0.00 dt:45ms tok/s:1466626 rem:12s step 16013 (98%) loss:2.6601 lr:0.00 dt:45ms tok/s:1449984 rem:12s step 16014 (98%) loss:2.6477 lr:0.00 dt:45ms tok/s:1442338 rem:12s step 16015 (98%) loss:2.6329 lr:0.00 dt:45ms tok/s:1443952 rem:12s step 16016 (98%) loss:2.6267 lr:0.00 dt:46ms tok/s:1438263 rem:12s step 16017 (98%) loss:2.6371 lr:0.00 dt:46ms tok/s:1427811 rem:12s step 16018 (98%) loss:2.6386 lr:0.00 dt:45ms tok/s:1442164 rem:12s step 16019 (98%) loss:2.6349 lr:0.00 dt:45ms tok/s:1440751 rem:12s step 16020 (98%) loss:2.6328 lr:0.00 dt:44ms tok/s:1503519 rem:11s step 16021 (98%) loss:2.6204 lr:0.00 dt:43ms tok/s:1522903 rem:11s step 16022 (98%) loss:2.6191 lr:0.00 dt:43ms tok/s:1537598 rem:11s step 16023 (98%) loss:2.6068 lr:0.00 dt:42ms tok/s:1543235 rem:11s step 16024 (98%) loss:2.5996 lr:0.00 dt:42ms tok/s:1546926 rem:11s step 16025 (98%) loss:2.5815 lr:0.00 dt:42ms tok/s:1545960 rem:11s step 16026 (98%) loss:2.5866 lr:0.00 dt:42ms tok/s:1553685 rem:11s step 16027 (98%) loss:2.5863 lr:0.00 dt:43ms tok/s:1533994 rem:11s step 16028 (98%) loss:2.5764 lr:0.00 dt:43ms tok/s:1537280 rem:11s step 16029 (98%) loss:2.5660 lr:0.00 dt:43ms tok/s:1531729 rem:11s step 16030 (98%) loss:2.5520 lr:0.00 dt:42ms tok/s:1550713 rem:11s step 16031 (98%) loss:2.5461 lr:0.00 dt:42ms tok/s:1555188 rem:11s step 16032 (98%) loss:2.5500 lr:0.00 dt:43ms tok/s:1537460 rem:11s step 16033 (98%) loss:2.5575 lr:0.00 dt:43ms tok/s:1536154 rem:11s step 16034 (98%) loss:2.5696 lr:0.00 dt:43ms tok/s:1533600 rem:11s step 16035 (98%) loss:2.5760 lr:0.00 dt:43ms tok/s:1534011 rem:11s step 16036 (98%) loss:2.5870 lr:0.00 dt:42ms tok/s:1552360 rem:11s step 16037 (98%) loss:2.5902 lr:0.00 dt:43ms tok/s:1538260 rem:11s step 16038 (98%) loss:2.5852 lr:0.00 dt:43ms tok/s:1532916 rem:11s step 16039 (98%) loss:2.5793 lr:0.00 dt:43ms tok/s:1534148 rem:11s step 16040 (98%) loss:2.6037 lr:0.00 dt:43ms tok/s:1530509 rem:11s step 16041 (98%) loss:2.5916 lr:0.00 dt:43ms tok/s:1526701 rem:11s step 16042 (98%) loss:2.5975 lr:0.00 dt:43ms tok/s:1532822 rem:11s step 16043 (98%) loss:2.6068 lr:0.00 dt:43ms tok/s:1529649 rem:11s step 16044 (98%) loss:2.6205 lr:0.00 dt:43ms tok/s:1524931 rem:10s step 16045 (98%) loss:2.6357 lr:0.00 dt:42ms tok/s:1543018 rem:10s step 16046 (98%) loss:2.6423 lr:0.00 dt:43ms tok/s:1539208 rem:10s step 16047 (98%) loss:2.6498 lr:0.00 dt:42ms tok/s:1542473 rem:10s step 16048 (98%) loss:2.6659 lr:0.00 dt:42ms tok/s:1543512 rem:10s step 16049 (98%) loss:2.6670 lr:0.00 dt:42ms tok/s:1545030 rem:10s step 16050 (98%) loss:2.6578 lr:0.00 dt:42ms tok/s:1542620 rem:10s step 16051 (98%) loss:2.6437 lr:0.00 dt:42ms tok/s:1545195 rem:10s step 16052 (98%) loss:2.6027 lr:0.00 dt:43ms tok/s:1539234 rem:10s step 16053 (98%) loss:2.6052 lr:0.00 dt:43ms tok/s:1536446 rem:10s step 16054 (98%) loss:2.6201 lr:0.00 dt:43ms tok/s:1541089 rem:10s step 16055 (98%) loss:2.6399 lr:0.00 dt:43ms tok/s:1536970 rem:10s step 16056 (98%) loss:2.6641 lr:0.00 dt:43ms tok/s:1536790 rem:10s step 16057 (98%) loss:2.6671 lr:0.00 dt:43ms tok/s:1538458 rem:10s step 16058 (98%) loss:2.6729 lr:0.00 dt:43ms tok/s:1538639 rem:10s step 16059 (98%) loss:2.6681 lr:0.00 dt:43ms tok/s:1539527 rem:10s step 16060 (98%) loss:2.6600 lr:0.00 dt:43ms tok/s:1537142 rem:10s step 16061 (98%) loss:2.6713 lr:0.00 dt:43ms tok/s:1538992 rem:10s step 16062 (98%) loss:2.6430 lr:0.00 dt:43ms tok/s:1525091 rem:10s step 16063 (98%) loss:2.6454 lr:0.00 dt:43ms tok/s:1514120 rem:10s step 16064 (98%) loss:2.6319 lr:0.00 dt:43ms tok/s:1521605 rem:10s step 16065 (98%) loss:2.6139 lr:0.00 dt:43ms tok/s:1536369 rem:10s step 16066 (98%) loss:2.6139 lr:0.00 dt:46ms tok/s:1415459 rem:10s step 16067 (98%) loss:2.6068 lr:0.00 dt:43ms tok/s:1532950 rem:9s step 16068 (98%) loss:2.6176 lr:0.00 dt:43ms tok/s:1524212 rem:9s step 16069 (98%) loss:2.6116 lr:0.00 dt:43ms tok/s:1522439 rem:9s step 16070 (98%) loss:2.6037 lr:0.00 dt:43ms tok/s:1535605 rem:9s step 16071 (98%) loss:2.6299 lr:0.00 dt:43ms tok/s:1530339 rem:9s step 16072 (98%) loss:2.6123 lr:0.00 dt:43ms tok/s:1529751 rem:9s step 16073 (98%) loss:2.6520 lr:0.00 dt:43ms tok/s:1527745 rem:9s step 16074 (98%) loss:2.6862 lr:0.00 dt:43ms tok/s:1527099 rem:9s step 16075 (98%) loss:2.6512 lr:0.00 dt:43ms tok/s:1528509 rem:9s step 16076 (98%) loss:2.6436 lr:0.00 dt:43ms tok/s:1525963 rem:9s step 16077 (98%) loss:2.6397 lr:0.00 dt:43ms tok/s:1527940 rem:9s step 16078 (98%) loss:2.6325 lr:0.00 dt:43ms tok/s:1509166 rem:9s step 16079 (98%) loss:2.6314 lr:0.00 dt:43ms tok/s:1527787 rem:9s step 16080 (99%) loss:2.6396 lr:0.00 dt:43ms tok/s:1526455 rem:9s step 16081 (99%) loss:2.6398 lr:0.00 dt:43ms tok/s:1524643 rem:9s step 16082 (99%) loss:2.6570 lr:0.00 dt:43ms tok/s:1527770 rem:9s step 16083 (99%) loss:2.6626 lr:0.00 dt:43ms tok/s:1525701 rem:9s step 16084 (99%) loss:2.6630 lr:0.00 dt:47ms tok/s:1384587 rem:9s step 16085 (99%) loss:2.6548 lr:0.00 dt:43ms tok/s:1532711 rem:9s step 16086 (99%) loss:2.6514 lr:0.00 dt:43ms tok/s:1523004 rem:9s step 16087 (99%) loss:2.6169 lr:0.00 dt:43ms tok/s:1531780 rem:9s step 16088 (99%) loss:2.6234 lr:0.00 dt:43ms tok/s:1534619 rem:9s step 16089 (99%) loss:2.6265 lr:0.00 dt:43ms tok/s:1527626 rem:9s step 16090 (99%) loss:2.6309 lr:0.00 dt:43ms tok/s:1529351 rem:8s step 16091 (99%) loss:2.6364 lr:0.00 dt:43ms tok/s:1529998 rem:8s step 16092 (99%) loss:2.6254 lr:0.00 dt:43ms tok/s:1519519 rem:8s step 16093 (99%) loss:2.6181 lr:0.00 dt:42ms tok/s:1552167 rem:8s step 16094 (99%) loss:2.6018 lr:0.00 dt:41ms tok/s:1587092 rem:8s step 16095 (99%) loss:2.6028 lr:0.00 dt:41ms tok/s:1582651 rem:8s step 16096 (99%) loss:2.6011 lr:0.00 dt:42ms tok/s:1573500 rem:8s step 16097 (99%) loss:2.6176 lr:0.00 dt:42ms tok/s:1574311 rem:8s step 16098 (99%) loss:2.6181 lr:0.00 dt:41ms tok/s:1585453 rem:8s step 16099 (99%) loss:2.6061 lr:0.00 dt:42ms tok/s:1569968 rem:8s step 16100 (99%) loss:2.6008 lr:0.00 dt:41ms tok/s:1587550 rem:8s + local: attn=[0.235, 1.162, 1.337] mlp=[1.542, 0.728, -0.653] + + transition: attn=[4.026, 1.200] mlp=[-1.130, 2.616] + + hierarchy: attn=[3.414, 5.939, 5.616] mlp=[3.742, -4.128, -5.180] + step 16101 (99%) loss:2.6009 lr:0.00 dt:43ms tok/s:1521689 rem:8s step 16102 (99%) loss:2.6151 lr:0.00 dt:41ms tok/s:1611743 rem:8s step 16103 (99%) loss:2.6194 lr:0.00 dt:41ms tok/s:1607210 rem:8s step 16104 (99%) loss:2.6047 lr:0.00 dt:41ms tok/s:1598406 rem:8s step 16105 (99%) loss:2.6171 lr:0.00 dt:41ms tok/s:1589753 rem:8s step 16106 (99%) loss:2.6049 lr:0.00 dt:41ms tok/s:1598044 rem:8s step 16107 (99%) loss:2.6055 lr:0.00 dt:43ms tok/s:1509132 rem:8s step 16108 (99%) loss:2.6003 lr:0.00 dt:46ms tok/s:1416021 rem:8s step 16109 (99%) loss:2.6028 lr:0.00 dt:42ms tok/s:1559573 rem:8s step 16110 (99%) loss:2.6219 lr:0.00 dt:43ms tok/s:1525185 rem:8s step 16111 (99%) loss:2.6425 lr:0.00 dt:42ms tok/s:1576560 rem:8s step 16112 (99%) loss:2.6487 lr:0.00 dt:42ms tok/s:1568356 rem:8s step 16113 (99%) loss:2.6466 lr:0.00 dt:47ms tok/s:1405637 rem:8s step 16114 (99%) loss:2.6672 lr:0.00 dt:43ms tok/s:1522389 rem:7s step 16115 (99%) loss:2.6740 lr:0.00 dt:42ms tok/s:1576876 rem:7s step 16116 (99%) loss:2.6627 lr:0.00 dt:42ms tok/s:1576487 rem:7s step 16117 (99%) loss:2.6710 lr:0.00 dt:42ms tok/s:1544839 rem:7s step 16118 (99%) loss:2.6661 lr:0.00 dt:47ms tok/s:1408301 rem:7s step 16119 (99%) loss:2.6639 lr:0.00 dt:42ms tok/s:1569063 rem:7s step 16120 (99%) loss:2.6639 lr:0.00 dt:42ms tok/s:1556156 rem:7s step 16121 (99%) loss:2.6258 lr:0.00 dt:45ms tok/s:1452306 rem:7s step 16122 (99%) loss:2.6196 lr:0.00 dt:45ms tok/s:1455859 rem:7s step 16123 (99%) loss:2.7405 lr:0.00 dt:42ms tok/s:1546647 rem:7s step 16124 (99%) loss:2.7961 lr:0.00 dt:42ms tok/s:1569117 rem:7s step 16125 (99%) loss:2.7789 lr:0.00 dt:42ms tok/s:1559564 rem:7s step 16126 (99%) loss:2.7560 lr:0.00 dt:42ms tok/s:1570605 rem:7s step 16127 (99%) loss:2.7364 lr:0.00 dt:44ms tok/s:1494885 rem:7s step 16128 (99%) loss:2.7315 lr:0.00 dt:45ms tok/s:1467738 rem:7s step 16129 (99%) loss:2.7106 lr:0.00 dt:42ms tok/s:1560423 rem:7s step 16130 (99%) loss:2.7192 lr:0.00 dt:42ms tok/s:1553000 rem:7s step 16131 (99%) loss:2.7043 lr:0.00 dt:43ms tok/s:1531413 rem:7s step 16132 (99%) loss:2.7126 lr:0.00 dt:45ms tok/s:1454942 rem:7s step 16133 (99%) loss:2.6935 lr:0.00 dt:47ms tok/s:1401237 rem:7s step 16134 (99%) loss:2.6811 lr:0.00 dt:46ms tok/s:1421145 rem:7s step 16135 (99%) loss:2.6840 lr:0.00 dt:42ms tok/s:1545404 rem:7s step 16136 (99%) loss:2.6712 lr:0.00 dt:43ms tok/s:1515966 rem:7s step 16137 (99%) loss:2.6686 lr:0.00 dt:43ms tok/s:1535751 rem:6s step 16138 (99%) loss:2.6527 lr:0.00 dt:47ms tok/s:1382449 rem:6s step 16139 (99%) loss:2.6506 lr:0.00 dt:43ms tok/s:1535356 rem:6s + +[POST] training done: 16140 steps, 1057.8M tokens + +[POST] EMA: swapped 31 params + +[POST] checkpoint saved + +[POST] artifact: 12134416 bytes, headroom: 3791736 + +[POST] model reloaded from artifact + +[POST] running eval: base_int8 + N-gram cache built: 5.0M tokens, contexts: [871, 101782, 200000, 200000], time: 7.6s + +[POST] compile warmup done + N-gram GPU tables: [871, 101782, 200000, 200000] + +[POST] base_int8 val_bpb:1.5976 time:59s +[EVAL-TTT] wall hit at 595s + +[POST] rank 0: base_int8 = 1.5976 (59s) + +[POST] rank 1: ttt_lr5e-4_2s = 1.6217 (459s) + +[POST] rank 2: ttt_lr1e-3_2s = 1.6714 (459s) + +[POST] rank 3: ttt_lr1e-3_1s = 1.6500 (262s) + +[POST] rank 4: ttt_lr5e-4_1s = 1.5918 (262s) + +[POST] rank 5: ttt_lr7e-4_2s = 1.6402 (457s) + +[POST] rank 6: ttt_lr5e-4_3s = 1.6486 (595s) + +[POST] rank 7: ngram_blend = 1.7386 (69s) +final_int8_zlib_roundtrip val_bpb:1.5918 +final_int8_zlib_roundtrip_exact val_bpb:1.59179616 +val_bpb:1.5918 + +[POST] BEST: ttt_lr5e-4_1s val_bpb:1.5918 + +[POST] DONE — clean exit