- Status: Proposed
- Date: 2026-03-12
- Deciders: ruv
- Tags: wasm, browser, cnn, pose-estimation, ruvector, video, multimodal, fusion
WiFi-DensePose estimates human poses from WiFi CSI (Channel State Information).
The ruvector-cnn crate provides a pure Rust CNN (MobileNet-V3) with WASM bindings.
Both modalities exist independently — what's missing is fusing live webcam video
with WiFi CSI in a single browser demo to achieve robust pose estimation that
works even when one modality degrades (occlusion, signal noise, poor lighting).
Existing assets:
wifi-densepose-wasm— CSI signal processing compiled to WASMwifi-densepose-sensing-server— Axum server streaming live CSI via WebSocketruvector-cnn— Pure Rust CNN with MobileNet-V3 backbones, SIMD, contrastive learningruvector-cnn-wasm— wasm-bindgen bindings:WasmCnnEmbedder,SimdOps,LayerOps, contrastive lossesvendor/ruvector/examples/wasm-vanilla/— Reference vanilla JS WASM example
Research shows multi-modal fusion (camera + WiFi) significantly outperforms either alone:
- Camera fails under occlusion, poor lighting, privacy constraints
- WiFi CSI fails with signal noise, multipath, low spatial resolution
- Fusion compensates: WiFi provides through-wall coverage, camera provides fine-grained detail
Build a dual-modal browser demo at examples/wasm-browser-pose/ that:
- Captures live webcam video via
getUserMediaAPI - Receives live WiFi CSI via WebSocket from the sensing server
- Processes both streams through separate CNN pipelines in
ruvector-cnn-wasm - Fuses embeddings with learned attention weights for combined pose estimation
- Renders video overlay with skeleton + WiFi confidence heatmap on Canvas
- Runs entirely in the browser — all inference client-side via WASM
┌──────────────────────────────────────────────────────────────────┐
│ Browser │
│ │
│ ┌────────────┐ ┌────────────────┐ ┌───────────────────┐ │
│ │ getUserMedia│───▶│ Video Frame │───▶│ CNN WASM │ │
│ │ (Webcam) │ │ Capture │ │ (Visual Embedder) │ │
│ └────────────┘ │ 224×224 RGB │ │ → 512-dim │ │
│ └────────────────┘ └────────┬──────────┘ │
│ │ │
│ visual_embedding │
│ │ │
│ ┌──────▼──────┐ │
│ ┌────────────┐ ┌────────────────┐ │ │ │
│ │ WebSocket │───▶│ CSI WASM │ │ Attention │ │
│ │ Client │ │ (densepose- │ │ Fusion │ │
│ │ │ │ wasm) │ │ Module │ │
│ └────────────┘ └───────┬────────┘ │ │ │
│ │ └──────┬──────┘ │
│ ┌───────▼────────┐ │ │
│ │ CNN WASM │ fused_embedding │
│ │ (CSI Embedder) │ │ │
│ │ → 512-dim │ ┌──────▼──────┐ │
│ └───────┬────────┘ │ Pose │ │
│ │ │ Decoder │ │
│ csi_embedding │ → 17 kpts │ │
│ │ └──────┬──────┘ │
│ └──────────────────────┘ │
│ │ │
│ ┌──────────────┐ ┌─────▼──────┐ │
│ │ Video Canvas │◀────────│ Overlay │ │
│ │ + Skeleton │ │ Renderer │ │
│ │ + Heatmap │ └────────────┘ │
│ └──────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────┘
▲ ▲
│ getUserMedia │ WebSocket
│ (camera) │ (ws://host:3030/ws/csi)
│ │
┌────┴────┐ ┌───────┴─────────┐
│ Webcam │ │ Sensing Server │
└─────────┘ └─────────────────┘
Two parallel CNN pipelines run on each frame tick (~30 FPS):
| Pipeline | Input | Preprocessing | CNN Config | Output |
|---|---|---|---|---|
| Visual | Webcam frame (640×480) | Resize to 224×224 RGB, ImageNet normalize | MobileNet-V3 Small, 512-dim | Visual embedding |
| CSI | CSI frame (ADR-018 binary) | Amplitude/phase/delta → 224×224 pseudo-RGB | MobileNet-V3 Small, 512-dim | CSI embedding |
Both use the same WasmCnnEmbedder but with separate instances and weight sets.
Learned attention-weighted fusion combines the two 512-dim embeddings:
// Attention fusion: learn which modality to trust per-dimension
// α ∈ [0,1]^512 — attention weights (shipped as JSON, trained offline)
// visual_emb, csi_emb ∈ R^512
function fuseEmbeddings(visual_emb, csi_emb, attention_weights) {
const fused = new Float32Array(512);
for (let i = 0; i < 512; i++) {
const α = attention_weights[i];
fused[i] = α * visual_emb[i] + (1 - α) * csi_emb[i];
}
return fused;
}Dynamic confidence gating adjusts fusion based on signal quality:
| Condition | Behavior |
|---|---|
| Good video + good CSI | Balanced fusion (α ≈ 0.5) |
| Poor lighting / occlusion | CSI-dominant (α → 0, WiFi takes over) |
| CSI noise / no ESP32 | Video-dominant (α → 1, camera only) |
| Video-only mode (no WiFi) | α = 1.0, pure visual CNN pose estimation |
| CSI-only mode (no camera) | α = 0.0, pure WiFi pose estimation |
Quality detection:
- Video quality: Frame brightness variance (dark = low quality), motion blur score
- CSI quality: Signal-to-noise ratio from
wifi-densepose-wasm, coherence gate output
CSI data encoded as 3-channel pseudo-image for the CSI CNN pipeline:
| Channel | Data | Normalization |
|---|---|---|
| R | CSI amplitude (subcarrier × time window) | Min-max to [0, 255] |
| G | CSI phase (unwrapped, subcarrier × time window) | Min-max to [0, 255] |
| B | Temporal difference (frame-to-frame Δ amplitude) | Abs, min-max to [0, 255] |
Webcam frames processed through standard ImageNet pipeline:
// Capture frame from video element
const frame = captureVideoFrame(videoElement, 224, 224); // Returns Uint8Array RGB
// ImageNet normalization happens inside WasmCnnEmbedder.extract()
const visual_embedding = visual_embedder.extract(frame, 224, 224);17 COCO-format keypoints decoded from the fused 512-dim embedding:
0: nose 1: left_eye 2: right_eye
3: left_ear 4: right_ear 5: left_shoulder
6: right_shoulder 7: left_elbow 8: right_elbow
9: left_wrist 10: right_wrist 11: left_hip
12: right_hip 13: left_knee 14: right_knee
15: left_ankle 16: right_ankle
Each keypoint decoded as (x, y, confidence) = 51 values from the 512-dim embedding via a learned linear projection.
The demo supports three modes, selectable in the UI:
| Mode | Video | CSI | Fusion | Use Case |
|---|---|---|---|---|
| Dual (default) | ✅ | ✅ | Attention-weighted | Best accuracy, full demo |
| Video Only | ✅ | ❌ | α = 1.0 | No ESP32 available, quick demo |
| CSI Only | ❌ | ✅ | α = 0.0 | Privacy mode, through-wall sensing |
Video Only mode works without any hardware — just a webcam — making the demo instantly accessible for anyone wanting to try it.
examples/wasm-browser-pose/
├── index.html # Single-page app (vanilla JS, no bundler)
├── js/
│ ├── app.js # Main entry, mode selection, orchestration
│ ├── video-capture.js # getUserMedia, frame extraction, quality detection
│ ├── csi-processor.js # WebSocket CSI client, frame parsing, pseudo-image encoding
│ ├── fusion.js # Attention-weighted embedding fusion, confidence gating
│ ├── pose-decoder.js # Fused embedding → 17 keypoints
│ └── canvas-renderer.js # Video overlay, skeleton, CSI heatmap, confidence bars
├── data/
│ ├── visual-weights.json # Visual CNN → embedding projection (placeholder until trained)
│ ├── csi-weights.json # CSI CNN → embedding projection (placeholder until trained)
│ ├── fusion-weights.json # Attention fusion α weights (512 values)
│ └── pose-weights.json # Fused embedding → keypoint projection
├── css/
│ └── style.css # Dark theme UI styling
├── pkg/ # Built WASM packages (gitignored, built by script)
│ ├── wifi_densepose_wasm/
│ └── ruvector_cnn_wasm/
├── build.sh # wasm-pack build script for both packages
└── README.md # Setup and usage instructions
#!/bin/bash
# build.sh — builds both WASM packages into pkg/
set -e
# Build wifi-densepose-wasm (CSI processing)
wasm-pack build ../../rust-port/wifi-densepose-rs/crates/wifi-densepose-wasm \
--target web --out-dir "$(pwd)/pkg/wifi_densepose_wasm" --no-typescript
# Build ruvector-cnn-wasm (CNN inference for both video and CSI)
wasm-pack build ../../vendor/ruvector/crates/ruvector-cnn-wasm \
--target web --out-dir "$(pwd)/pkg/ruvector_cnn_wasm" --no-typescript
echo "Build complete. Serve with: python3 -m http.server 8080"┌─────────────────────────────────────────────────────────┐
│ WiFi-DensePose — Live Dual-Modal Pose Estimation │
│ [Dual Mode ▼] [⚙ Settings] FPS: 28 ◉ Live │
├───────────────────────────┬─────────────────────────────┤
│ │ │
│ ┌───────────────────┐ │ ┌───────────────────┐ │
│ │ │ │ │ │ │
│ │ Video + Skeleton │ │ │ CSI Heatmap │ │
│ │ Overlay │ │ │ (amplitude × │ │
│ │ (main canvas) │ │ │ subcarrier) │ │
│ │ │ │ │ │ │
│ └───────────────────┘ │ └───────────────────┘ │
│ │ │
├───────────────────────────┴─────────────────────────────┤
│ Fusion Confidence: ████████░░ 78% │
│ Video: ██████████ 95% │ CSI: ██████░░░░ 61% │
├─────────────────────────────────────────────────────────┤
│ ┌─────────────────────────────────────────────────┐ │
│ │ Embedding Space (2D projection) │ │
│ │ · · · │ │
│ │ · · · · · · (color = pose cluster) │ │
│ │ · · · · │ │
│ └─────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────┤
│ Latency: Video 12ms │ CSI 8ms │ Fusion 1ms │ Total 21ms│
│ [▶ Record] [📷 Snapshot] [Confidence: ████ 0.6] │
└─────────────────────────────────────────────────────────┘
| Package | Source Crate | Provides | Size (est.) |
|---|---|---|---|
wifi_densepose_wasm |
wifi-densepose-wasm |
CSI frame parsing, signal processing, feature extraction | ~200KB |
ruvector_cnn_wasm |
ruvector-cnn-wasm |
WasmCnnEmbedder (×2 instances), SimdOps, LayerOps, contrastive losses |
~150KB |
Two WasmCnnEmbedder instances are created — one for video frames, one for CSI pseudo-images.
They share the same WASM module but have independent state.
| API | Purpose | Required | Fallback |
|---|---|---|---|
getUserMedia |
Webcam capture | For video mode | CSI-only mode |
| WebAssembly | CNN inference | Yes | None (hard requirement) |
| WASM SIMD128 | Accelerated inference | No | Scalar fallback (~2× slower) |
| WebSocket | CSI data stream | For CSI mode | Video-only mode |
| Canvas 2D | Rendering | Yes | None |
requestAnimationFrame |
Render loop | Yes | setTimeout fallback |
| ES Modules | Code organization | Yes | None |
Target: Chrome 89+, Firefox 89+, Safari 15+, Edge 89+
| Stage | Target Latency | Notes |
|---|---|---|
| Video frame capture + resize | <3ms | drawImage to offscreen canvas |
| Video CNN embedding | <15ms | 224×224 RGB → 512-dim |
| CSI receive + parse | <2ms | Binary WebSocket message |
| CSI pseudo-image encoding | <3ms | Amplitude/phase/delta channels |
| CSI CNN embedding | <15ms | 224×224 pseudo-RGB → 512-dim |
| Attention fusion | <1ms | Element-wise weighted sum |
| Pose decoding | <1ms | Linear projection |
| Canvas overlay render | <3ms | Video + skeleton + heatmap |
| Total (dual mode) | <33ms | 30 FPS capable |
| Total (video only) | <22ms | 45 FPS capable |
Note: Video and CSI CNN pipelines can run in parallel using Web Workers, reducing dual-mode latency to ~max(15, 15) + 5 = ~20ms (50 FPS).
The demo optionally shows real-time contrastive learning in the browser:
- InfoNCE loss (
WasmInfoNCELoss): Compare video vs CSI embeddings for the same pose — trains cross-modal alignment - Triplet loss (
WasmTripletLoss): Push apart different poses, pull together same pose across modalities - SimdOps: Accelerated dot products for real-time similarity computation
- Embedding space panel: Live 2D projection shows video and CSI embeddings converging when viewing the same person
| Existing Crate | Role in This Demo |
|---|---|
ruvector-cnn-wasm |
CNN inference for both video frames and CSI pseudo-images |
wifi-densepose-wasm |
CSI frame parsing and signal processing |
wifi-densepose-sensing-server |
WebSocket CSI data source |
wifi-densepose-core |
ADR-018 frame format definitions |
ruvector-cnn |
Underlying MobileNet-V3, layers, contrastive learning |
No new Rust crates are needed. The example is pure HTML/JS consuming existing WASM packages.
- Instant demo: Video-only mode works with just a webcam — no ESP32 needed
- Multi-modal showcase: Demonstrates camera + WiFi fusion, the core innovation of the project
- Graceful degradation: Works with video-only, CSI-only, or both
- Through-wall capability: CSI mode shows pose estimation where cameras cannot reach
- Zero-install: Anyone with a browser can try it
- Training data collection: Can record paired (video, CSI) data for offline model training
- Reusable: JS modules embed directly in the Tauri desktop app's webview
- Model weights: Requires offline-trained weights for visual CNN, CSI CNN, fusion, and pose decoder (~200KB total JSON)
- WASM size: Two WASM modules total ~350KB (acceptable)
- No GPU: CPU-only WASM inference; adequate at 224×224 but limits resolution scaling
- Camera privacy: Video mode requires camera permission (mitigated: CSI-only mode available)
- Two CNN instances: Memory footprint doubles vs single-modal (~10MB total, acceptable for desktop browsers)
- Cross-modal alignment: Video and CSI embeddings must be trained jointly for fusion to work; without proper training, fusion may be worse than either modality alone
- Latency on mobile: Dual CNN on mobile browsers may exceed 33ms; implement automatic quality reduction
- WebSocket drops: Network jitter → CSI frame gaps; buffer last 3 frames, interpolate missing data
- Phase 1 — Scaffold: File layout, build.sh, index.html shell, mode selector UI
- Phase 2 — Video pipeline: getUserMedia → frame capture → CNN embedding → basic pose display
- Phase 3 — CSI pipeline: WebSocket client → CSI parsing → pseudo-image → CNN embedding
- Phase 4 — Fusion: Attention-weighted combination, confidence gating, mode switching
- Phase 5 — Pose decoder: Linear projection with placeholder weights → 17 keypoints
- Phase 6 — Overlay renderer: Video canvas with skeleton overlay, CSI heatmap panel
- Phase 7 — Training: Use
wifi-densepose-trainto generate real weights for both CNNs + fusion + decoder - Phase 8 — Contrastive demo: Embedding space visualization, cross-modal similarity display
- Phase 9 — Web Workers: Move CNN inference to workers for parallel video + CSI processing
- Phase 10 — Polish: Recording, snapshots, adaptive quality, mobile optimization
Rejected: Misses the opportunity to show multi-modal fusion and makes the demo less accessible (requires ESP32 hardware). Video-only mode as a fallback is strictly better.
Rejected: Adds latency, requires webcam stream upload (privacy concern), and defeats the WASM-first architecture. All inference must be client-side.
Rejected: Would require two different ML frameworks. Using ruvector-cnn-wasm for both
keeps a single WASM module, unified embedding space, and simpler fusion.
Rejected: Live webcam input is far more compelling for demonstrations. Pre-recorded mode can be added as a secondary option.
Rejected: Adds build tooling. Vanilla JS + ES modules keeps the demo self-contained.
- ADR-018: Binary CSI Frame Format
- ADR-024: Contrastive CSI Embedding / AETHER
- ADR-055: Integrated Sensing Server
vendor/ruvector/crates/ruvector-cnn/src/lib.rs— CNN embedder implementationvendor/ruvector/crates/ruvector-cnn-wasm/src/lib.rs— WASM bindingsvendor/ruvector/examples/wasm-vanilla/index.html— Reference vanilla JS WASM pattern- Person-in-WiFi: Fine-grained Person Perception using WiFi (ICCV 2019) — camera+WiFi fusion precedent
- WiPose: Multi-Person WiFi Pose Estimation (TMC 2022) — cross-modal embedding approach