This repository contains the code and data used in the project described in Generative_Video_Augmentation_for_Image_Classification.pdf. In brief, we show that by using a generative model to hallucinate a tiny “pseudo‐video” (just two extra frames) from each static image, and then training an inflated 3D VGG-11 network on these 3-frame clips, you can recover motion cues and boost classification accuracy by over 20 percentage points on a 10-class Tiny ImageNet subset—at the cost of only ~0.14 ms extra inference time per sample.
The dataset that we generated for this project is available on Kaggle. You can download it from the following link: https://www.kaggle.com/datasets/afnanalgarby/svd-generated-video-dataset/data
The GitHub repository for this project is located at https://github.com/nthPerson/augmented_image_classification
├── Baseline_VGG/ # Data loaders, model definition, training script & config for the 2D VGG-11 baseline
├── Data_Augmentation_SVD/ # Script and config used to generate 3-frame video dataset via Stable Video Diffusion
├── Data_Sourcing/ # Code to extract and filter the 10-class Tiny ImageNet subset
├── Inflated_3D_VGG/ # Data loaders, model definition, training script & config for the 3D‐inflated VGG-11
├── Model_Evaluation/ # Inference scripts, metric calculators, FLOPs/parameter counters
├── scripts/ # Throw-away/debug scripts (e.g. video loader tests)
├── svd_videos/ # The generated 3-frame “video” dataset (train/val + CSV indices)
├── utils/ # Utility functions used across the repo
├── Generative_Video_Augmentation_for_Image_Classification.pdf # The final report
├── README.md # You are here
└── requirements.txt # Python dependencies
tiny_imagenet.py– Dataset definition for the 10-class Tiny ImageNet subset.vgg_baseline.py– VGG-11 model wrapper, head replacement for 10 classes.train.py& - Training script for the 2D baseline model.train_config.yaml– Hyperparameter YAML config for 2D baseline.
make_svd_videos.py– Data augmentation entry point: loads each 224×224 image, invokes SVD to generate two extra frames, saves 3-frame clips.tiny_imagenet_aug.py– Dataset definition for the SVD data augmentation process.train3d_config.yaml– Data augmentation YAML config: number of frames to generate, model and data IDs.
filter.py– Filters ImageNet synsets to our 10 mammal classes, copies train images.val.py– Builds the 500-image validation set from the original Tiny ImageNet val folder.README-data_sourcing.txt– Detailed information about data curation process.
video_dataset.py– PyTorch dataset for loading 3×224×224 clips + labels.vgg_inflated.py– Converts a pretrained 2D VGG-11 to 3D VGG-11 (Conv3D/BatchNorm3D/MaxPool3D).train3d.py- Training script for the inflated 3D model.train3d_video.yaml– Hyperparameter YAML config for the inflated 3D model.
evaluate.pyRuns inference on baseline vs. inflated model; computes Top-k accuracy and other metrics.evaluate.yaml– YAML config for evaluation script: model IDs, output CSVs.prepare_evaluation_data.py– Transforms evaluation data for visualization in Tableau.parameter_count.py&flops_count.py– Scripts to report model size and computational cost.evaluation_results/– Stored confusion matrices, per-class metrics, and other evaluation data.
debug_video_loader.py– Quick test to verify 3D DataLoader & transforms.
train/&val/– The actual 3-frame videos.train.csv,val.csv– Filename-to=label indices for augmented dataset.
-
utils.py– Common helpers (seed setting, progress bars, metric aggregators). -
requirements.txt– Exactpipversions tested.