This repo details the work done the module project of NUS CS4243 Computer Vision and Pattern Recognition.
Given a video or image dataset, we are required to develop an algorithm to classify between 3 classes:
- carrying (a weapon is being carried)
- normal (no weapon is seen)
- threat (a weapon based on threat can be detected)
Given a video for classification, we extract its image and audio data.
For image data, we apply YOLO v5 to extract the cropped bounding boxes of the detected human.
For audio data, we apply librosa to extract the mel spectrum.
- Image Processing
We perform data cleaning of the image dataset by iterating through each the given dataset fornormal
,carrying
andthreat
class. For each class, after applying YOLO, we are able to filter out images without a person inside. The implementation can be found here. - Audio Processing
For every video from a class, the extracted mel spectrum from librosa is saved. The implementation can be found here. - Data Splitting
We split the dataset such that given a video, if its frames are being used for training, its mel spectrum is also used for training. The same mechanism applies for validation and testing. The implementation can be found here.
In training
directory, you can find the scripts for training for images, training for audio and ensemble testing.
In demo
directory, you can find our original video for demo and the output video with class annotation in all of its frames.
You can try with other video by changing the video_name
variable in demo/demo.ipynb
.