This is an implementation of YOLO (You Only Look Once), a fast, real-time object detection algorithm that is widely used in the field of computer vision. It is capable of detecting multiple objects in an image and assigning them semantic labels based on their class. The following image is an example of the output of an object detection model:
Here, the different colors indicate different object classes.
If you want to learn more about YOLO, here are some useful resources: Original YOLO paper | Intuitive Explanation | YOLO Video Tutorial | Mean Average Precision | Intersection over Union
The data for this project consists of 10,000 street scene images and their corresponding labels. The images are 128x128x3 in dimension, and the labels include the semantic class and bounding box for each object in the image. Note that a small portion of these labels may be noisy, and the size of the training set is not large, so it may not be possible to learn a highly robust object detector.
To use the data, download the images from here and the labels from this repository as labels.npz.
Before training the model, the labels must be converted into a ground truth matrix with dimension
- The image is divided into
$8 \times 8$ grid cells, with each cell representing a 16x16 patch in the original image. - For simplicity, only one anchor box is used, with the same size as the grid cell. If the center of an object falls within a grid cell, that cell is responsible for detecting the object.
- Each anchor has 8 channels: Pr(Objectness),
$x$ ,$y$ ,$w$ ,$h$ , P(class=pedestrian), P(class=traffic light), and P(class=car - Pr(Objectness) is the probability that this anchor is an object rather than background. "1" indicates an object, and "0" indicates background.
- The
$x$ and$y$ coordinates represent the center of the bounding box relative to the bounds of the grid cell, and$w$ and$h$ are the width and height of the bounding box relative to the width and height of the image. - The final three channels are for the semantic labels of the object, and are encoded using one-hot coding.
- If the anchor does not have an object (Pr=0), the values for channels 2-8 are not assigned, as they will not be used during training.
- The dimensions are ordered as (channels, x, y).
This model takes input with dimension of
Layer | Hyperparameters |
---|---|
conv1 | Kernel size |
conv2 | Kernel size |
conv3 | Kernel size |
conv4 | Kernel size |
conv5 | Kernel size |
conv6 | Kernel size |
transposed_conv7 | Kernel size |
transposed_conv8 | Kernel size |
conv9 | Kernel size |
During training, the localization and classification errors are optimized jointly. The loss function is shown as below.
- In our case there is only one anchor box at each grid, hence
$B = 1$ . -
$S^2 =$ total number of grid cells. -
$\mathbb{1}_{ij}^\text{obj} = 1$ if an object appears in grid cell$i$ and 0 otherwise. -
$\hat{C}_i =$ Box confidence score$=$ Pr(box contains object)$\times$ IoU - IoU
$=$ Intersection over union between the predicted and the ground truth. -
$\hat{p}_i(c) =$ conditional class probability of class$c$ in cell$i$ .
Each grid cell predicts 1 bounding box, confidence score for those boxes and class conditional probabilities.
The confidence Score reflects the degree of confidence that the box contains an object and how accurate the box is. If no object exists in the cell then the confidence score should be 0 else the confidence score should be equal to the IOU between the predicted box and the ground truth box.
During training, I set a learning rate of 10e-3 using Adam optimizer with default beta 1 and beta 2. I also visualize the loss over training iterations. Based on the loss visualization, I train the model for 20 epochs.
During inference, the network is going to predict lots of overlapping redundant bounding boxes. To eliminate the redundant boxes, there are basically two steps:
- Get rid of predicted boxes with low objectness probability (Pr
$< 0.6$ ). - For each class, calculate the IoU for all the bounding boxes and cluster boxes with IoU > 0.5 as a group. For each group, find the one with highest Pr and suppress the other boxes. This is referred as non-max suppression.
To evaluate the performance of the YOLO implementation, I compute the mean Average Precision (mAP) of inference. Predicted bounding boxes are a match with ground truth bounding boxes if they share the same label and have an IoU with the ground truth bounding box of greater than 0.5. These matches can be used to calculate a precision/recall curve for each class. The Average Precision for a class is the area under this curve. The mean of these Average Precision values over all the classes in inference gives the mean Average Precision of the network.