Skip to content

3. How it Works

rossGardiner edited this page Apr 19, 2022 · 5 revisions

Processing Architecture

The main effort in software architecture design has been for the Signapse video processing pipeline. A basic example pipeline is shown in the diagram below. The initial element, Camera creates events, these are passed along pipeline links and consumed by the GUI; along the way, various pipeline links process these events, modifying the data and passing downstream. Callbacks deal with the type Scene, this has been created to store information about a given camera frame including metadata and the frame pixel values. Scene is suitable for passing information about any sign output, not just ASL alphabet characters.

Pipeline links have been crafted to adhere strongly to SOLID object-oriented principles. This can be described by the inheritance hierarchy shown in the UML diagram below. At the top, the basic interface, SceneCallback defines a pure virtual function, NextScene() which allows an inheriting class to be the recipient of a Scene via a callback. The derived class, PipelineLink adds the method RegisterCallback; this enables transmission and reception of Scene objects, suitable for intermediate video processing stages. These interfaces are then further derived to produce pipeline elements that are interchangeable and serve a common interface. The Camera object inherits PipelineLink and adds functionality to interface with camera hardware, the PreProcessor object handles image processing required (including pixel formatting and drawing rectangles). LinkSplitter inherits PipelineLink while adding a secondary callback so that a scene may be passed to two further elements. For latency-bound elements, the SchedulableLink has been derived from PipelineLink this is further described in section Real-time Considerations. GUI inherits SceneCallback as it has no requirement to pass scenes further downstream.

The remaining application logic constructs this pipeline in the arrangement shown below.

The GUI element receives two streams, one from the CNNProcessor and one from LinkSplitter instance. Logic within Gui::NextScene displays the frame if it does not contain CNN results, otherwise it updates its progress bar.

Real-time Considerations

The real-time requirement for Signapse is obvious; for smoother output GUI framerates, the system is more intuitive, educational and easier to use. Thus, the framerate at the GUI must be as high as possible while processing a maximal amount of frames with the CNN.

A challenge emerges when consideration is given to the CNN execution latency for processing a single Scene. Testing reveals that on a desktop Ubuntu machine with moderate specs, execution latency is around 80ms and on RPi this increases to around 660ms (you can try this out by running ./test/main and checking the output). In both cases, processing every frame with the CNN would not satisfy a basic (minimal) framerate of ~15fps. A naive approach to boost throughput could be to process every other frame (skipping half), and the number to skip could be adjusted per platform depending on the network execution latency. However, this approach assumes the software knows execution latency before the CNN is executed (from a given preset); it is therefore not resistant to varying background CPU loads, camera input framerates and is generally unfriendly to cross-platform integration.

Instead, we have provided an object-oriented approach to de-couple latency-bound processing elements from affecting video pipeline throughput. This ensures a consistent experience across hardware platforms and is resilient to a changing latency at execution time. As described in section Processing Architecture, the system makes use of various PipelineLink callback objects. Each link in the pipeline must be constructed from a PipelineLink or an object which inherits its properties. For latency bound pipeline stages, a new compatible interface is provided, SchedulableLink. Instances of this type process Scene objects when it is afforded, rejecting frames otherwise. SchedulableLink contains a worker queue of Scene objects and a worker thread which is only awoken when an object is added to the queue. The queue is implemented using an instance of our custom data structure, BlockingQueue which extends the std::deque object to sleep the thread while Pop is called without an available output. Within SchedulableLink, the NextScene callback for PipelineLink is overridden with logic that adds a Scene to the worker queue only if it is empty. The method Run processes elements on the queue, calling the next callback routine. Objects which inherit SchedulableLink must implement the pure virtual function ProcessScene which should contain logic for the high-latency stage. Within Signapse, the object CNNProcessor inherits from SchedulableLink and calls network inference logic from its implementation of ProcessScene.

This architecture enables frames to be passed through a SchedulableLink while a maximum amount are allocated to the latency-bound routine. This ensures a smooth frame rate at the GUI for any CNN latency. Additionally, the worker queue idea lends itself well to multithreading future extension, this is elaborated further in Future Work section. Finally, for platforms with higher CNN execution latency (such as RPi), the progress bar will be updated less frequently. This results in users holding a sign for longer on slower devices to complete the progress challenge. As a remedy, a threshold parameter is provided and modifiable in the Gui, this allows the user to set how quickly it takes for the progress bar to fill. Users may easily adjust how long to sign for, creating a consistent experience across platforms.

Signaspe successfully meets all of its real-time programming requirements while providing an object oriented design pattern for future modification and consistent experience on both x86 and ARM platforms targeted.

Neural Network Classifier

To recognise hand signs in the camera feed, Signapse makes use of a Convolutional Neural Network (CNN) classifier. This is an image processing neural network that produces an output for the likelihood of what sign is within the input image. This necessitates video processing to be performed on a per-image basis - a large motivation for our video pipeline structure. CNN classifiers typically output multiple values with a range of confidences, we simplify this to one (maximum confidence) value per input. To keep things simple, we have decided to only classify between a set of ASL alphabet hand signs with a view to expand to more complex signs in the future.

Architecture

After some experimentation, we found the best architecture to use was a MobileNetv2 model. This network is suitably lightweight for real-time applications while still providing enough grunt for distinguishing between 26 hand signs in the alphabet. The academic paper for MobileNetv2 is available here. Our instance of MobileNetv2 is generated with TensorFlow and Keras and exported to the .pb (protobuf) format for use with OpenCV runtime.

Dataset

Our MobileNetv2 instance is trained using a transfer learning approach. The initial network is trained on the ImageNet dataset (available here) and further learning is performed on a freely available ASL dataset (available on Kaggle here).

Try it Out!

You can try out training the model on the dataset and exporting into the OpenCV format! Have a look in the README.md in Signapse/tools/neural_network_training/. We are excited for others to get involved and contribute new CNNs with capability for more sign distinguishing, please check it out!