Skip to content
Open
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
248 changes: 129 additions & 119 deletions applications/surgical_scene_recon/README.md
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 120/126 the title of Model. What model are we talking here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@finalelement could you please help here?

Original file line number Diff line number Diff line change
@@ -1,16 +1,15 @@
# Surgical Scene Reconstruction with Gaussian Splatting

This application demonstrates real-time 3D surgical scene reconstruction by combining **Holoscan SDK** for high-performance streaming, **3D Gaussian Splatting** for neural 3D representation, and **temporal deformation networks** for accurate modeling of dynamic tissue.

Real-time 3D surgical scene reconstruction using Gaussian Splatting in a Holoscan streaming pipeline with temporal deformation for accurate tissue modeling.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is redundant and duplicates the content from line 3. It should be removed to avoid repetition.

Suggested change
Real-time 3D surgical scene reconstruction using Gaussian Splatting in a Holoscan streaming pipeline with temporal deformation for accurate tissue modeling.


![Training Visualization - Ground Truth vs Rendered](train_gt_animation.gif)

## Overview

This application demonstrates real-time 3D surgical scene reconstruction by combining **Holoscan SDK** for high-performance streaming, **3D Gaussian Splatting** for neural 3D representation, and **temporal deformation networks** for accurate modeling of dynamic tissue.

The application provides a complete end-to-end pipeline—from raw surgical video to real-time 3D reconstruction—enabling researchers and developers to train custom models on their own endoscopic data and visualize results with GPU-accelerated rendering.
The application provides a complete end-to-end pipeline—from raw surgical video to real-time 3D reconstruction. Researchers and developers can use it to train custom models on their own endoscopic data and visualize results with GPU-accelerated rendering.

### Key Features
Features of this application include:

- **Real-time Visualization:** Stream surgical scene reconstruction at >30 FPS using Holoscan
- **Temporal Deformation:** Accurate per-frame tissue modeling as it deforms over time
Expand All @@ -19,13 +18,9 @@ The application provides a complete end-to-end pipeline—from raw surgical vide
- **Two Operation Modes:** Inference-only (with pre-trained checkpoint) OR train-then-render
- **Production Ready:** Tested and optimized Holoscan pipeline with complete Docker containerization

### What It Does
It takes input from EndoNeRF surgical datasets (RGB images + stereo depth + camera poses + tool masks). It processes the input using multi-frame Gaussian Splatting with a 4D spatiotemporal deformation network. And it outputs real-time 3D tissue reconstruction without surgical instruments.

- **Input:** EndoNeRF surgical dataset (RGB images + stereo depth + camera poses + tool masks)
- **Process:** Multi-frame Gaussian Splatting with 4D spatiotemporal deformation network
- **Output:** Real-time 3D tissue reconstruction without surgical instruments

### Use Cases
It is ideal for use cases, such as:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "uses cases" should be "use cases"

Suggested change
It is ideal for use cases, such as:
It is ideal for use cases, such as:


- Surgical scene understanding and visualization
- Tool-free tissue reconstruction for analysis
Expand All @@ -34,7 +29,7 @@ The application provides a complete end-to-end pipeline—from raw surgical vide

## Quick Start

### Step 1: Clone HoloHub
### Step 1: Clone the HoloHub Repository

```bash
git clone https://github.com/nvidia-holoscan/holohub.git
Expand All @@ -43,31 +38,34 @@ cd holohub

### Step 2: Read and Agree to the Terms and Conditions of the EndoNeRF Sample Dataset

- Read and agree to the [Terms and Conditions](https://docs.google.com/document/d/1P6q2hXoGpVMKeD-PpjYYdZ0Yx1rKZdJF1rXxpobbFMY/edit?usp=share_link) for the EndoNeRF dataset.
- EndoNeRF sample dataset is being downloaded automatically when building the application. For manual download, please refer to the [Data](#data) section below.
- If you do not agree to the terms and conditions, set the `HOLOHUB_DOWNLOAD_DATASETS` environment variable to `OFF` and manually download the dataset and place it in the correct location by following the instructions in the [Data](#data) section below.
1. Read and agree to the [Terms and Conditions](https://docs.google.com/document/d/1P6q2hXoGpVMKeD-PpjYYdZ0Yx1rKZdJF1rXxpobbFMY/edit?usp=share_link) for the EndoNeRF dataset.
1. EndoNeRF sample dataset is being downloaded automatically when building the application.
1. Optionally, for manual download of the dataset, refer to the [Data](#pulling-soft-tissues-dataset) section below.
1. Optionally, if you do not agree to the terms and conditions, set the `HOLOHUB_DOWNLOAD_DATASETS` environment variable to `OFF` and manually download the dataset and place it in the correct location by following the instructions in the [Data](#pulling-soft-tissues-dataset) section below.

```bash
export HOLOHUB_DOWNLOAD_DATASETS=OFF
```
```bash
export HOLOHUB_DOWNLOAD_DATASETS=OFF
```

### Step 3: Run Training

```bash
./holohub run surgical_scene_recon train
```
To run the virtual surgical training:

```bash
./holohub run surgical_scene_recon train
```

### Step 4: Dynamic Rendering with Trained Model
### Step 4: Dynamic Rendering with a Trained Model

After training completes, visualize your results in real-time:
After training completes, to visualize your results in real-time, run the surgical render:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"run the surgical render" is awkward - not a standard technical term

Suggested change
After training completes, to visualize your results in real-time, run the surgical render:
After training completes, to visualize your results in real-time:

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!


```bash
./holohub run surgical_scene_recon render
```

![Dynamic Rendering Visualization](surg_recon_inference.gif)

## Data
## Pulling Soft Tissues Dataset

This application uses the **EndoNeRF "pulling_soft_tissues" dataset**, which contains:

Expand All @@ -76,36 +74,43 @@ This application uses the **EndoNeRF "pulling_soft_tissues" dataset**, which con
- Tool segmentation masks for instrument removal
- Camera poses and bounds (poses_bounds.npy)

### Download
### Download the Dataset

📦 **Direct Google Drive:** <https://drive.google.com/drive/folders/1zTcX80c1yrbntY9c6-EK2W2UVESVEug8?usp=sharing>
You can download the dataset from one of the following locations:

In the Google Drive folder, you'll see:
* 📦 Direct Google Drive: <https://drive.google.com/drive/folders/1zTcX80c1yrbntY9c6-EK2W2UVESVEug8?usp=sharing>

- `cutting_tissues_twice`
- `pulling_soft_tissues` ← **Download this one**
1. In the Google Drive folder, you'll see:

**Alternative:** Visit the [EndoNeRF repository](https://github.com/med-air/EndoNeRF)
- `cutting_tissues_twice`
- `pulling_soft_tissues`

1. Download `pulling_soft_tissues`.

* Visit the [EndoNeRF repository](https://github.com/med-air/EndoNeRF).

### Dataset Setup

The dataset will be automatically used by the application when placed in the correct location. Refer to the [HoloHub glossary](../../README.md#Glossary) for definitions of HoloHub-specific directory terms used below.

Place the dataset at `<HOLOHUB_ROOT>/data/EndoNeRF/pulling/`:
To place the dataset at `<HOLOHUB_ROOT>/data/EndoNeRF/pulling/`:

```bash
# From the HoloHub root directory
mkdir -p data/EndoNeRF
1. From the HoloHub root directory:
```bash
mkdir -p data/EndoNeRF
```

# Extract and move (or copy) the downloaded dataset
mv /path/to/pulling_soft_tissues data/EndoNeRF/pulling
```
1. Extract and move (or copy) the downloaded dataset:

```bash
mv /path/to/pulling_soft_tissues data/EndoNeRF/pulling
```

**⚠️ Important:** The dataset MUST be physically at the path abovedo NOT use symlinks! Docker containers cannot follow symlinks outside mounted volumes.
**Important:** The dataset MUST be physically at the path above, do NOT use symlinks. Docker containers cannot follow symlinks outside mounted volumes.

### Verify Dataset Structure
### Verify the Dataset Structure

Your dataset should have this structure:
Verify that your dataset has this structure:

```text
<HOLOHUB_ROOT>/
Expand All @@ -118,39 +123,55 @@ Your dataset should have this structure:
└── poses_bounds.npy # Camera poses (8.5 KB)
```

## Model
## Models Used by the `surgical_scene_recon` Application

The application uses **3D Gaussian Splatting** with a **temporal deformation network** for dynamic scene reconstruction:
The `surgical_scene_recon` application uses a **3D Gaussian Splatting** model with a **temporal deformation network** for dynamic scene reconstruction.

### Gaussian Splatting

- **Architecture:** 3D Gaussians with learned position, scale, rotation, opacity, and color
- **Initialization:** Multi-frame point cloud (~30,000-50,000 points from all frames)
- **Renderer:** gsplat library (CUDA-accelerated differentiable rasterization)
- **Spherical Harmonics:** Degree 3 (16 coefficients per gaussian for view-dependent color)
- **Resolution:** 640×512 pixels (RGB, 3 channels)
- Gaussian Splatting Model

### Temporal Deformation Network
Each portion of the application makes use of different aspects of the Gaussian Splatting Model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is vague and doesn't add meaningful information. What "different aspects" does it refer to? Consider removing or making it more specific.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes a vague lead in -- twas my attempt to try to add some explanation for what the bullet list is trying to be. Ideal would be adding other words that help make the bullets make sense, but I need some feedback on what the list is trying to accomplish


- **Architecture:** HexPlane 4D spatiotemporal grid + MLP decoder
- **Input:** 3D position + normalized time value [0, 1]
- **Output:** Deformed position, scale, rotation, and opacity changes
- **Training:** Two-stage process (coarse: static, fine: with deformation)
- **Inference:** Direct PyTorch (no conversion, full precision)
- Architecture: 3D Gaussian with learned position, scale, rotation, opacity, and color
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inconsistent capitalization - "Gaussian" should be capitalized consistently with the rest of the document

Suggested change
- Architecture: 3D Gaussian with learned position, scale, rotation, opacity, and color
- Architecture: 3D Gaussians with learned position, scale, rotation, opacity, and color

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3D Gaussian should be 3D Gaussians (plural) to match the original description and be technically correct, as the model uses multiple Gaussian primitives.

Suggested change
- Architecture: 3D Gaussian with learned position, scale, rotation, opacity, and color
- Architecture: 3D Gaussians with learned position, scale, rotation, opacity, and color

- Initialization: Multi-frame point cloud (~30,000-50,000 points from all frames)
- Renderer: `gsplat` library (CUDA-accelerated differentiable rasterization)
- Spherical Harmonics: Degree 3 (16 coefficients per gaussian for view-dependent color)
- Resolution: 640×512 pixels (RGB, three channels)

### Training Process
- Temporal Deformation Network model

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Description incorrect - mentions MRI data and forecast future frames, but this application uses surgical endoscopy video for tissue reconstruction, not MRI

Suggested change
The Temporal Deformation Network enables dynamic scene modeling by deforming the base Gaussian representations over time to accurately capture tissue movement and deformation during surgery.

Temporal Generative Network (TGN) model generates data that integrates bidirectional deformation estimation with temporal prediction to interpolate missing MRI data and forecast future frames.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description of the Temporal Deformation Network is inaccurate. It mentions "Temporal Generative Network (TGN)" generating "missing MRI data", but this application is for surgical endoscopy (not MRI), and the network performs real-time deformation (not data interpolation/forecasting). This description appears to be copied from an unrelated medical imaging context.

The description should match the actual functionality described in lines 146-150, which correctly describes the HexPlane-based deformation network.



- **Architecture:** HexPlane 4D spatiotemporal grid + MLP decoder
- **Input:** 3D position + normalized time value [0, 1]
- **Output:** Deformed position, scale, rotation, and opacity changes
- **Training:** Two-stage process (coarse: static, fine: with deformation)
- **Inference:** Direct PyTorch (no conversion, full precision)

## About the Model Training Process

The application trains in two stages:

1. **Coarse Stage:** Learn base static Gaussians without deformation
2. **Fine Stage:** Add temporal deformation network for dynamic tissue modeling
1. The Coarse Stage where the application learns the base static Gaussian models without deformation.
2. The Fine Stage where a temporal deformation network model is added for dynamic tissue modeling.

The training uses:

- **Multi-modal Data:** RGB images, depth maps, tool segmentation masks
- **Loss Functions:** RGB loss, depth loss, TV loss, masking losses
- **Optimization:** Adam optimizer with batch-size scaled learning rates
- **Tool Removal:** Segmentation masks applied during training for tissue-only reconstruction
- Multi-modal Data: RGB images, depth maps, tool segmentation masks
- Loss Functions: RGB loss, depth loss, TV loss, masking losses
- Optimization: Adam optimizer with batch-size scaled learning rates
- Tool Removal: Segmentation masks applied during training for tissue-only reconstruction

The **training pipeline** (`gsplat_train.py`) runs in the following order:

1. Data Loading using EndoNeRF parser loads RGB, depth, masks, and poses.
2. Initialization uses Multi-frame point cloud (~30k points).
3. Training happens in two stages:
- Coarse
- Fine
4. Optimization is done by the Adam (Adaptive Moment Estimation) optimizer with batch-size scaled learning rates.
5. Regularization, for depth loss, TV loss, and masking losses, is performed on the data.

The default training command trains a model on all 63 frames with 2000 iterations, producing smooth temporal deformation and high-quality reconstruction.

Expand Down Expand Up @@ -178,60 +199,49 @@ EndoNeRFLoaderOp → GsplatLoaderOp → GsplatRenderOp → HolovizOp
ImageSaverOp
```

**Components:**

- **EndoNeRFLoaderOp:** Streams camera poses and timestamps
- **GsplatLoaderOp:** Loads checkpoint and deformation network
- **GsplatRenderOp:** Applies temporal deformation and renders
- **HolovizOp:** Real-time GPU-accelerated visualization
- **ImageSaverOp:** Optional frame saving

## Requirements
## Requirements for the `surgical_scene_recon` Application

- **Hardware:**
- NVIDIA GPU (RTX 3000+ series recommended, tested on RTX 6000 Ada Generation)
- ~2 GB free disk space (dataset)
- ~30 GB free disk space (Docker container)
- ~2 GB free disk space (for the dataset)
- ~30 GB free disk space (for Docker containers)
- **Software:**
- Docker with NVIDIA GPU support
- X11 display server (for visualization)
- Holoscan SDK 3.7.0 or later (automatically provided in container)
- Holoscan SDK 3.7.0 or later (automatically provided in containers)

## Application Integration Testing

## Testing
We provide integration tests.

We provide integration tests that can be run with the following command to test the application for training and inference:
To test the application for training and inference, run:

```bash
./holohub test surgical_scene_recon --verbose
```

## Technical Details

### Training Pipeline (gsplat_train.py)
## Performance

1. **Data Loading:** EndoNeRF parser loads RGB, depth, masks, poses
2. **Initialization:** Multi-frame point cloud (~30k points)
3. **Two-Stage Training:**
- **Coarse:** Learn base Gaussians (no deformation)
- **Fine:** Add temporal deformation network
4. **Optimization:** Adam with batch-size scaled learning rates
5. **Regularization:** Depth loss, TV loss, masking losses
Tested Configuration:

### Performance
- GPU: NVIDIA RTX 6000 Ada Generation
- Container: Holoscan SDK 3.7.0
- Training Time: ~5 minutes (63 frames, 2000 iterations)
- Rendering: Real-time >30 FPS

**Tested Configuration:**
Quality Metrics (train mode):

- **GPU:** NVIDIA RTX 6000 Ada Generation
- **Container:** Holoscan SDK 3.7.0
- **Training Time:** ~5 minutes (63 frames, 2000 iterations)
- **Rendering:** Real-time >30 FPS

**Quality Metrics (train mode):**

- **PSNR:** ~36-38 dB
- **SSIM:** ~0.80
- **Gaussians:** ~50,000 splats
- **Deformation:** Smooth temporal consistency
- PSNR: ~36-38 dB
- SSIM: ~0.80
- Gaussian functions: ~50,000 splats
- Deformation: Smooth temporal consistency

## Troubleshooting

Expand Down Expand Up @@ -263,40 +273,40 @@ We provide integration tests that can be run with the following command to test

### Citation

If you use this work, please cite:
If you use this work, cite the following:

**EndoNeRF:**
* EndoNeRF:

```bibtex
@inproceedings{wang2022endonerf,
title={EndoNeRF: Neural Rendering for Stereo 3D Reconstruction of Deformable Tissues in Robotic Surgery},
author={Wang, Yuehao and Yifan, Wang and Tao, Rui and others},
booktitle={MICCAI},
year={2022}
}
```
```bibtex
@inproceedings{wang2022endonerf,
title={EndoNeRF: Neural Rendering for Stereo 3D Reconstruction of Deformable Tissues in Robotic Surgery},
author={Wang, Yuehao and Yifan, Wang and Tao, Rui and others},
booktitle={MICCAI},
year={2022}
}
```

**3D Gaussian Splatting:**
* 3D Gaussian Splatting:

```bibtex
@article{kerbl20233d,
title={3d gaussian splatting for real-time radiance field rendering},
author={Kerbl, Bernhard and Kopanas, Georgios and Leimk{\"u}hler, Thomas and Drettakis, George},
journal={ACM Transactions on Graphics},
year={2023}
}
```
```bibtex
@article{kerbl20233d,
title={3d gaussian splatting for real-time radiance field rendering},
author={Kerbl, Bernhard and Kopanas, Georgios and Leimk{\"u}hler, Thomas and Drettakis, George},
journal={ACM Transactions on Graphics},
year={2023}
}
```

**gsplat Library:**
* `gsplat` Library:

```bibtex
@software{ye2024gsplat,
title={gsplat},
author={Ye, Vickie and Turkulainen, Matias and others},
year={2024},
url={https://github.com/nerfstudio-project/gsplat}
}
```
```bibtex
@software{ye2024gsplat,
title={gsplat},
author={Ye, Vickie and Turkulainen, Matias and others},
year={2024},
url={https://github.com/nerfstudio-project/gsplat}
}
```

### License

Expand Down