Event Neural Networks
Proc. ECCV 2022
Artificial neural networks are the state of the art in visual recognition. Many of these networks are trained and evaluated on static images. For example, the OpenPose model takes a single image as input and predicts a set of human joint positions. To make predictions on video, a single-frame model can be applied repeatedly to individual frames. However, at the frame rate of a typical camera (e.g., 30 frames per second), there is often significant repetition in image content between adjacent frames. In some regions of an image, there may be repetition in pixel values. In other image regions, there may be repetition of more complex structures (e.g., tree branches, as shown in the figure above). We propose Event Neural Networks (EvNets), which leverage this repetition to achieve considerable computation savings when processing video data. A defining characteristic of EvNets is that each neuron has state variables that provide it with long-term memory, which allows low-cost, high-accuracy inference even in the presence of significant camera motion. We show that it is possible to transform a wide range of neural networks into EvNets without re-training. We demonstrate our method on state-of-the-art architectures for both high- and low-level visual processing, including pose recognition, object detection, optical flow, and image enhancement. We observe roughly an order-of-magnitude reduction in computational costs compared to conventional networks, with minimal reductions in model accuracy.
Proc. ECCV 2022
(a) Conventional neurons completely recompute their activations on each time step. (b) Value-based event neurons only transmit activations that have changed significantly. However, a value-based transmission can still trigger many computations. (c) Delta-based event neurons only transmit differential updates to their activations.
We design an event neuron based on sparse, delta-based transmission. We start with a generic artificial neuron that composes a linear function "g" with a nonlinear activation "f." Then, we assume that the neuron receives sparse deltas as input instead of values. The sparsity of the input deltas leads to computation savings in the linear function g. We add an accumulator state variable "a" that transforms the output of the linear function from a delta-based representation to a value-based representation (activation functions require values, not deltas, as input). After the activation function, we add two new state variables, "b" and "d." The variable b tracks the current best estimate of the neuron's output value, and d stores any not-yet-transmitted output delta (this is the neuron's long-term memory). Finally, we apply a transmission policy that decides which deltas the neuron will transmit on this time step. For example, the policy might transmit the value of d whenever it exceeds a threshold.
We assemble event neuron components into network-level structures. An accumulator layer contains multiple "a" variables. A gate layer contains multiple pairs of "b" and "d" variables. A gate layer also applies the transmission policy. By strategically inserting accumulator and gate layers throughout the network, we can convert a wide range of existing neural network architectures into event networks without retraining.
The long-term memory variable "d" is critical for maintaining accuracy over long video sequences. If we remove this variable and the associated logic, the model accuracy decays rapidly.
We demonstrate that EvNets are an effective strategy for many high- and low-level vision tasks. Across tasks, we see significant computation savings while maintaining high-quality output. This frame shows a person mid-jump. The EvNet tracks the subject correctly under rapid motion.