Leveraging temporal redundancy for efficient visual recognition on video and other time-series data
GitHub: PyTorch, TensorFlow
Artificial neural networks are the state of the art in visual recognition. Many of these networks are trained and evaluated on static images. For example, the OpenPose model takes a single image as input and predicts a set of human joint positions. To make predictions on video, a single-frame model can be applied repeatedly to individual frames. However, at the frame rate of a typical camera (e.g., 30 frames per second), there is often significant repetition in image content between adjacent frames. In some regions of an image, there may be repetition in pixel values. In other image regions, there may be repetition of more complex structures (e.g., tree branches, as shown in the figure above). We propose Event Neural Networks (EvNets), which leverage this repetition to achieve considerable computation savings when processing video data. A defining characteristic of EvNets is that each neuron has state variables that provide it with long-term memory, which allows low-cost, high-accuracy inference even in the presence of significant camera motion. We show that it is possible to transform a wide range of neural networks into EvNets without re-training. We demonstrate our method on state-of-the-art architectures for both high- and low-level visual processing, including pose recognition, object detection, optical flow, and image enhancement. We observe roughly an order-of-magnitude reduction in computational costs compared to conventional networks, with minimal reductions in model accuracy.