-
Notifications
You must be signed in to change notification settings - Fork 9
3. How it Works
The main effort in software architecture design has been for the Signapse video processing pipeline. A basic example pipeline is shown in the diagram below. The initial element, Camera
creates events, these are passed along pipeline links and consumed by the GUI; along the way, various pipeline links process these events, modifying the data and passing downstream. Callbacks deal with the type Scene
, this has been created to store information about a given camera frame including metadata and the frame pixel values. Scene
is suitable for passing information about any sign output, not just ASL alphabet characters.
Pipeline links have been crafted to adhere strongly to SOLID object-oriented principles. This can be described by the inheritance hierarchy shown in the UML diagram below. At the top, the basic interface, SceneCallback
defines a pure virtual function, NextScene()
which allows an inheriting class to be the recipient of a Scene
via a callback. The derived class, PipelineLink
adds the method RegisterCallback
; this enables transmission and reception of Scene
objects, suitable for intermediate video processing stages. These interfaces are then further derived to produce pipeline elements that are interchangeable and serve a common interface. The Camera
object inherits PipelineLink
and adds functionality to interface with camera hardware, the PreProcessor
object handles image processing required (including pixel formatting and drawing rectangles). LinkSplitter
inherits PipelineLink
while adding a secondary callback so that a scene may be passed to two further elements. For latency-bound elements, the SchedulableLink
has been derived from PipelineLink
this is further described in section Real-time Considerations. GUI
inherits SceneCallback
as it has no requirement to pass scenes further downstream.
The remaining application logic constructs this pipeline in the arrangement shown below.
The GUI
element receives two streams, one from the CNNProcessor
and one from LinkSplitter
instance. Logic within Gui::NextScene
displays the frame if it does not contain CNN results, otherwise it updates its progress bar.
The real-time requirement for Signapse is obvious; for smoother output GUI framerates, the system is more intuitive, educational and easier to use. Thus, the framerate at the GUI must be as high as possible while processing a maximal amount of frames with the CNN.
A challenge emerges when consideration is given to the CNN execution latency for processing a single Scene
. Testing reveals that on a desktop Ubuntu machine with moderate specs, execution latency is around 80ms and on RPi this increases to around 660ms (you can try this out by running ./test/main
and checking the output). In both cases, processing every frame with the CNN would not satisfy a basic (minimal) framerate of ~15fps. A naive approach to boost throughput could be to process every other frame (skipping half), and the number to skip could be adjusted per platform depending on the network execution latency. However, this approach assumes the software knows execution latency before the CNN is executed (from a given preset); it is therefore not resistant to varying background CPU loads, camera input framerates and is generally unfriendly to cross-platform integration.
Instead, we have provided an object-oriented approach to de-couple latency-bound processing elements from affecting video pipeline throughput. This ensures a consistent experience across hardware platforms and is resilient to a changing latency at execution time. As described in section Processing Architecture, the system makes use of various PipelineLink
callback objects. Each link in the pipeline must be constructed from a PipelineLink
or an object which inherits its properties. For latency bound pipeline stages, a new compatible interface is provided, SchedulableLink
. Instances of this type process Scene
objects when it is afforded, rejecting frames otherwise. SchedulableLink
contains a worker queue of Scene
objects and a worker thread which is only awoken when an object is added to the queue. The queue is implemented using an instance of our custom data structure, BlockingQueue
which extends the std::deque
object to sleep the thread while Pop
is called without an available output. Within SchedulableLink
, the NextScene
callback for PipelineLink
is overridden with logic that adds a Scene
to the worker queue only if it is empty. The method Run
processes elements on the queue, calling the next callback routine. Objects which inherit SchedulableLink
must implement the pure virtual function ProcessScene
which should contain logic for the high-latency stage. Within Signapse, the object CNNProcessor
inherits from SchedulableLink
and calls network inference logic from its implementation of ProcessScene
.
This architecture enables frames to be passed through a SchedulableLink
while a maximum amount are allocated to the latency-bound routine. This ensures a smooth frame rate at the GUI for any CNN latency. Additionally, the worker queue idea lends itself well to multithreading future extension, this is elaborated further in Future Work section. Finally, for platforms with higher CNN execution latency (such as RPi), the progress bar will be updated less frequently. This results in users holding a sign for longer on slower devices to complete the progress challenge. As a remedy, a threshold parameter is provided and modifiable in the Gui, this allows the user to set how quickly it takes for the progress bar to fill. Users may easily adjust how long to sign for, creating a consistent experience across platforms.
Signaspe successfully meets all of its real-time programming requirements while providing an object oriented design pattern for future modification and consistent experience on both x86 and ARM platforms targeted.
To recognise hand signs in the camera feed, Signapse makes use of a Convolutional Neural Network (CNN) classifier. This is an image processing neural network that produces an output for the likelihood of what sign is within the input image. This necessitates video processing to be performed on a per-image basis - a large motivation for our video pipeline structure. CNN classifiers typically output multiple values with a range of confidences, we simplify this to one (maximum confidence) value per input. To keep things simple, we have decided to only classify between a set of ASL alphabet hand signs with a view to expand to more complex signs in the future.
After some experimentation, we found the best architecture to use was a MobileNetv2 model. This network is suitably lightweight for real-time applications while still providing enough grunt for distinguishing between 26 hand signs in the alphabet. The academic paper for MobileNetv2 is available here. Our instance of MobileNetv2 is generated with TensorFlow and Keras and exported to the .pb
(protobuf) format for use with OpenCV runtime.
Our MobileNetv2 instance is trained using a transfer learning approach. The initial network is trained on the ImageNet dataset (available here) and further learning is performed on a freely available ASL dataset (available on Kaggle here).
You can try out training the model on the dataset and exporting into the OpenCV format! Have a look in the README.md
in Signapse/tools/neural_network_training/
. We are excited for others to get involved and contribute new CNNs with capability for more sign distinguishing, please check it out!
If you have any questions or queries please contact us at: [email protected].