Develop a model to perform human activity recognition, specifically to detect falls. Falls are an important health problem worldwide and reliable automatic fall detection systems can play an important role to mitigate negative consequences of falls. The automatic detection of falls has attracted considerable attention in the computer vision and pattern recognition communities. To tackle this we leverage the following two neural network models:
- the Fall-Detection-with-CNNs-and-Optical-Flow based on the paper: "Vision-Based Fall Detection with Convolutional Neural Networks" by Núñez-Marcos
- I3D models based on models reported in the paper: "Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset" by Joao Carreira and Andrew Zisserman
The basic idea of this project is to use transfer learning on Fall Detection. Specifically, we preprocess fall detection videos to different crops. Each crop has 20 frames with labels on every single frame. The original video are in RGB. In addition to this domain, we also made conversion from RGB to optical flow domain, where the correlation and the velocity of the moving object across frames are considered. Combining the data from the two domain, we perform transfer learning on the previous two networks that are trained on either fall detection with other dataset or Kinetics-400 dataset that includes videos of different daily activitives/sports, etc across 400 different labelled categories. We finetune the two network with our prepared dataset and perform fall detection inference once they are trained.
Overall, we achieved 94% detection accuracy with prevision: 82% and recall 82% for I3D model, and Accuracy: 99% (+/- 1%), False Positive Rate: 1% (+/- 1%), False Negative Rate: 6% (+/- 3%).
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system. The repository contains the following files:
- Source of downloading three Falling Datasets
- Dataset Pre-processing
- Source code for two models: a) optimal flow CNN b) I3D model
- Results presentation.
There are three different datasets used in this project: 1)UR Fall Detection Dataset; 2)Kinetic Human Action Video Dataset; 3)Multiple Cameras Fall Dataset.
In order to describe how we process the data, we firstly introduce some notations. Here we use V to represent one video sample, V(t) represents our frame in this video. Lv is the length of this video, Sv is the starting point of frame that is labelled as "Fall". Ev is the end point of the frame that is labelled as "No Fall". We also have two hyperparameter: One is the window size W, which is defined as how many frames per sample. Here sample means the data unit for training and evaluating the model. The other hyperparameter is T, which represents the threshold.
All of our videos come with frame-wise labelling, which means each frames is labelled as 1 if it is "Fall", and 0 otherwise. With these setups, here we introduce our algorithm to cut video into different pieces.
Define W and T,
Given V, do:
for i in range(0, Lv, W):
if sum_ones(V(i) to V(i+W)) > T:
collect frames between V(i) and V(i+W) as "Fall"
else:
collect frames between V(i) and V(i+W) as "No Fall"
done
Given the above algorithm, we go through all the videos with train/test split.
Optical flow images represent the motion of two consecutive frames, which is too short-timed to detect a fall. However, stacking a set of them the network can also learn longer time-related features. These features were used as input of a classifier, a fully connected neural network (FCNN), which outputs a signal of “fall” or “no fall.” The full pipeline can be seen in Figure 1. Finally, we used a three-step
The optical flow algorithm represents the patterns of the motion of objects as displacement vector fields between two consecutive images
The following figure shows the system architecture or pipeline: the RGB images are converted to optical flow images, then features are extracted with a CNN, and a FC-NN decides whether there has been a fall or not.
VGG-16 is based on convolutional neural networks (CNNs), which have much lower computation cost than the recurrent neural networks (RNNs), or the long short-term memory (LSTM) networks. Furthermore, as we will show later, the CNN-based model can effectively capture the temporal correlation in the video clips, which further makes the use of RNN or LSTM redundant. From this perspective, this light-weight model is much more suitable for real-time detection scenarios.
Cross-validation is necessary for the test machine learning algorithms on a limited dataset, which is usually the case for video classification problems. In our experiment, we adopted 5-fold cross-validation. The workflow is as follows:
-
Shuffle the dataset randomly.
-
Split the dataset into 5 groups.
-
For each group, conduct the following experiment:
a. Reserve the current group as the test set. b. Use the remaining 4 groups as the training set. c. Train the model on the training set. d. Evaluate the model on the reserved test set (the current group). e. Record the evaluation scores and discard the model. -
Summarize the performance metrics for all the 5 experiments.
Dataset I is prepared on the frame level, which means that we treat each frames, or group of frame, independently, even though they might come from the same video clip. This dataset is easier to prepare, however, it might inadvertently introduce weak correlation in training and test set. This problem will be address in Dataset II.
The 5-fold cross-validated true positive rate (TPR), true negative rate (TNR), false positive rate (FPR), and false negative rate (FNR), is as follows:
Accuracy: 99% (+/- 1%)
TPR: 100% (+/- 1%)
TNR: 99% (+/- 1%)
FPR: 1% (+/- 1%)
FNR: 0% (+/- 1%)
The reported results above are the averages of the 5 independent experiments from the 5-fold cross-validation, and they show the high accuracy and low variance in the performance of our trained model.
Dataset II is prepared in such a way that all the frames of one particular video clip are either in the training set or in the test set. So there will not be any correlation between samples in the training set and samples in the test set.
Accuracy: 99% (+/- 1%)
TPR: 94% (+/- 3%)
TNR: 99% (+/- 1%)
FPR: 1% (+/- 1%)
FNR: 6% (+/- 3%)
Similarly, the reported results above are the averages of the 5 independent experiments from the 5-fold cross-validation, and they, once again, show the high accuracy and low variance in the performance of our trained model.