Related Work

CNNs have achieved significant breakthrough on image classification and object detection. Nevertheless, the sliding window approach still needs to apply CNN on many different sliding windows and it is still a repetition of performing image classification on local regions; as a result, it is extremely computationally expensive with repetitive computations of CNNs.

In order to handle the expensive computation problem in CNNs, people tried to find a way to reduce the candidate locations of the sliding window, instead of performing CNN computation many times on every sliding window location. As a result, region proposal method was developed to find potential regions that have a high possibility containing objects [1], through which the number of potential regions is reduced compared with the sliding window approach. One of the most significant breakthrough on object detection is Faster R-CNN [6]. Faster R-CNN performes Regions of Interest (RoI) pooling and makes the CNN to do the Region Proposal, which inserts the Region Proposal Network (RPN) as part of the layers in the CNN model to predict the possibility of objectiveness in the region.

Some detection methods were introduced without the Region Proposal method in order to achieve smaller training and testing time. YOLO [5] and SSD [4] are two detection methods without Region Proposals. The input image goes directly to one big convolutional neural network. Inside the network, the input image at first is divided into many grid cells, and the classification scores and the bounding box coordinates and scales are determined on each grid cell. And the overall object classes and bounding boxes are calculated based on the results obtained from each grid cell. These two approaches further reduce the training and test time time but the accuracy is compromised compared with the method using Region Proposals [3]. The region proposal method has a reduce on the training and testing time compared with the sliding window techniques and helps increase the detection accuracy; in fact, faster R-CNN achieves the highest accuracy compared with other methods. In terms of training and testing time, SSD is significantly faster than other methods since it gets rid of the Region Proposal method, but with a cost of reduced accuracy compared with those with Region Proposal.

Also, we plan to try improving current state-of-art method or developing a new method to achieve real-time face detection and recognition. Finally, we will select the best method and model based on the evaluation of the trade-off between accuracy and speed of detection and recognition.

[1] Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari. Measuring the ob- jectness of image windows. IEEE transactions on pattern analysis and ma- chine intelligence, 34(11):2189–2202, 2012.

[2] Ross Girshick. Fast r-cnn. arXiv preprint arXiv:1504.08083, 2015.

[3] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Ko- rattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional ob- ject detectors. In IEEE CVPR, 2017.

[4] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.

[5] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.

[6] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: To- wards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.

[7] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.