Multi-Modal Robotics Sensor Fusion Pipeline

This repository contains a custom data engineering and computer vision pipeline I built to process and synchronize raw, multi-modal sensor data from a robotics platform.

The core challenge of this project was dealing with completely undocumented proprietary binary formats. I had to reverse-engineer the high-frequency IMU and Video-Timestamp (VTS) byte streams from scratch, perfectly align them with a fisheye camera feed, and then apply modern deep learning models (YOLOv8 and Depth Anything V2) to extract spatial and semantic intelligence from the scene.

What's Inside

Binary Reverse Engineering: Custom Python parsers that decode the raw .imu (568 Hz) and .vts formats using byte-level struct unpacking.
Microsecond Synchronization: An O(log N) binary search algorithm that mathematically links every 30 FPS video frame to its exact corresponding IMU reading with a median error of just 0.17 ms.
Computer Vision & Perception:
- Depth Estimation: Runs Depth Anything V2 Small via HuggingFace to generate relative depth maps, highly robust to the uncalibrated fisheye lens.
- Instance Segmentation: Runs YOLOv8x-seg to overlay dense object masks and bounding boxes.
Queryable Storage: A robust data layer that ingests all parsed and synchronized data into relational SQLite databases and columnar Parquet files for downstream analytics.

Visual Outputs

(If you want to see the final rendered videos, they are available here on Google Drive).

imu_sync.mp4: Side-by-side rendering of the camera feed and live, scrolling matplotlib graphs of the IMU's acceleration, gyroscope, and magnetometer.
depth_map.mp4: Side-by-side original footage and an INFERNO colormap depth estimation.
segmentation.mp4: The camera feed densely overlaid with instance segmentation masks.

Technical Hurdles & How I Solved Them

Building this wasn't totally straightforward. Here are a few interesting problems I had to solve along the way:

Undocumented Byte Layouts: I had to empirically deduce the .imu format. For example, I proved the Z-axis acceleration was stored as a 32-bit float by finding the bytes that consistently hovered around -9.81 m/s² (gravity). Similarly, I proved the gyroscope was in rad/s rather than degrees because a max rotation of 2.8 perfectly matched the fast handheld movement when converted to ~162°/s.
OpenCV's Lying Metadata: OpenCV (cv2.CAP_PROP_FRAME_COUNT) falsely claimed the 43-second .mp4 only had 30 frames total. I bypassed this by throwing out the container metadata and using the parsed binary .vts file as the absolute source of truth for the frame count (1,316 frames).
Domain Gap in Segmentation: YOLOv8 is trained on COCO (clean, well-lit photos). Pointing it at a blurry, fisheye robotics feed caused some funny misclassifications (like thinking a computer mouse was a cell phone). I tuned the confidence thresholds to balance this, but a true fix in production would require fine-tuning the model on a custom robotics workspace dataset.

Project Architecture

.
├── data/                      # Place your raw .mp4, .imu, and .vts files here
├── outputs/                   # Generated .mp4 visualisation videos land here
├── 01_parse_imu_vts.py        # Standalone parser testing and physical validation script
├── 02_imu_sync_video.py       # Renders the IMU telemetry HUD and scrolling graphs
├── 03_depth_estimation.py     # Runs the HuggingFace Depth Anything pipeline
├── 04_segmentation.py         # Runs the Ultralytics YOLOv8x-seg pipeline
├── 05_data_storage.py         # Compresses all data into SQLite and Snappy Parquet
├── parse_imu_vts_lib.py       # The core parsing engine imported by all other scripts
└── run_all.sh                 # Master execution script

How to Run It

I've designed the pipeline to be as plug-and-play as possible. The run_all.sh script automatically checks for missing dependencies, builds the environment, and runs the entire suite.

# 1. Create a clean Python 3.10 environment
python3 -m venv trekion-env
source trekion-env/bin/activate

# 2. Let the master script do the rest
chmod +x run_all.sh
./run_all.sh

Requirements: numpy, opencv-python, matplotlib, tqdm, Pillow, torch, torchvision, transformers, ultralytics, scipy, pandas, pyarrow. (Tested natively on Apple Silicon with PyTorch MPS acceleration).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Modal Robotics Sensor Fusion Pipeline

What's Inside

Visual Outputs

Technical Hurdles & How I Solved Them

Project Architecture

How to Run It

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
test		test
.gitignore		.gitignore
01_parse_imu_vts.py		01_parse_imu_vts.py
02_imu_sync_video.py		02_imu_sync_video.py
03_depth_estimation.py		03_depth_estimation.py
04_segmentation.py		04_segmentation.py
05_data_storage.py		05_data_storage.py
README.md		README.md
parse_imu_vts_lib.py		parse_imu_vts_lib.py
requirements.txt		requirements.txt
run_all.sh		run_all.sh

Folders and files

Latest commit

History

Repository files navigation

Multi-Modal Robotics Sensor Fusion Pipeline

What's Inside

Visual Outputs

Technical Hurdles & How I Solved Them

Project Architecture

How to Run It

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages