This repository contains a custom data engineering and computer vision pipeline I built to process and synchronize raw, multi-modal sensor data from a robotics platform.
The core challenge of this project was dealing with completely undocumented proprietary binary formats. I had to reverse-engineer the high-frequency IMU and Video-Timestamp (VTS) byte streams from scratch, perfectly align them with a fisheye camera feed, and then apply modern deep learning models (YOLOv8 and Depth Anything V2) to extract spatial and semantic intelligence from the scene.
- Binary Reverse Engineering: Custom Python parsers that decode the raw
.imu(568 Hz) and.vtsformats using byte-levelstructunpacking. - Microsecond Synchronization: An O(log N) binary search algorithm that mathematically links every 30 FPS video frame to its exact corresponding IMU reading with a median error of just
0.17 ms. - Computer Vision & Perception:
- Depth Estimation: Runs Depth Anything V2 Small via HuggingFace to generate relative depth maps, highly robust to the uncalibrated fisheye lens.
- Instance Segmentation: Runs YOLOv8x-seg to overlay dense object masks and bounding boxes.
- Queryable Storage: A robust data layer that ingests all parsed and synchronized data into relational SQLite databases and columnar Parquet files for downstream analytics.
(If you want to see the final rendered videos, they are available here on Google Drive).
imu_sync.mp4: Side-by-side rendering of the camera feed and live, scrolling matplotlib graphs of the IMU's acceleration, gyroscope, and magnetometer.depth_map.mp4: Side-by-side original footage and an INFERNO colormap depth estimation.segmentation.mp4: The camera feed densely overlaid with instance segmentation masks.
Building this wasn't totally straightforward. Here are a few interesting problems I had to solve along the way:
- Undocumented Byte Layouts: I had to empirically deduce the
.imuformat. For example, I proved the Z-axis acceleration was stored as a 32-bit float by finding the bytes that consistently hovered around-9.81 m/s²(gravity). Similarly, I proved the gyroscope was inrad/srather than degrees because a max rotation of2.8perfectly matched the fast handheld movement when converted to ~162°/s. - OpenCV's Lying Metadata: OpenCV (
cv2.CAP_PROP_FRAME_COUNT) falsely claimed the 43-second.mp4only had 30 frames total. I bypassed this by throwing out the container metadata and using the parsed binary.vtsfile as the absolute source of truth for the frame count (1,316 frames). - Domain Gap in Segmentation: YOLOv8 is trained on COCO (clean, well-lit photos). Pointing it at a blurry, fisheye robotics feed caused some funny misclassifications (like thinking a computer mouse was a cell phone). I tuned the confidence thresholds to balance this, but a true fix in production would require fine-tuning the model on a custom robotics workspace dataset.
.
├── data/ # Place your raw .mp4, .imu, and .vts files here
├── outputs/ # Generated .mp4 visualisation videos land here
├── 01_parse_imu_vts.py # Standalone parser testing and physical validation script
├── 02_imu_sync_video.py # Renders the IMU telemetry HUD and scrolling graphs
├── 03_depth_estimation.py # Runs the HuggingFace Depth Anything pipeline
├── 04_segmentation.py # Runs the Ultralytics YOLOv8x-seg pipeline
├── 05_data_storage.py # Compresses all data into SQLite and Snappy Parquet
├── parse_imu_vts_lib.py # The core parsing engine imported by all other scripts
└── run_all.sh # Master execution script
I've designed the pipeline to be as plug-and-play as possible. The run_all.sh script automatically checks for missing dependencies, builds the environment, and runs the entire suite.
# 1. Create a clean Python 3.10 environment
python3 -m venv trekion-env
source trekion-env/bin/activate
# 2. Let the master script do the rest
chmod +x run_all.sh
./run_all.shRequirements: numpy, opencv-python, matplotlib, tqdm, Pillow, torch, torchvision, transformers, ultralytics, scipy, pandas, pyarrow. (Tested natively on Apple Silicon with PyTorch MPS acceleration).