Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers

Vision Transformers achieve impressive accuracy across a range of visual recognition tasks. Unfortunately, their accuracy frequently comes with high computational costs. This is a particular issue in video recognition, where models are often applied repeatedly across frames or temporal chunks. In this work, we exploit temporal redundancy between subsequent inputs to reduce the cost of Transformers for video processing. We describe a method for identifying and re-processing only those tokens that have changed significantly over time. Our proposed family of models, Eventful Transformers, can be converted from existing Transformers (often without any re-training) and give adaptive control over the compute cost at runtime. We evaluate our method on large-scale datasets for video object detection (ImageNet VID) and action recognition (EPIC-Kitchens 100). Our approach leads to significant computational savings (on the order of 2-4x) with only minor reductions in accuracy.

Publications

Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers

Matthew Dutson, Yin Li, Mohit Gupta

Proc. ICCV 2023

Token Gating

A vision Transformer represents a scene as a collection of token vectors. We propose a gating module that identifies which tokens have undergone significant changes since their last update. The gating module maintains a set of reference tokens. If the difference between a token and its reference is large, then that token is selected to be updated.

Accelerating Token-Wise Operations

Many of the operations in a vision Transformer are applied token-by-token. We can accelerate these operations by simply computing them on the reduced set of tokens selected by the gate. We then return the result to the full, expected size by applying a token buffer module. This module tracks the most recent known value for each token.

An Eventful Transformer Block

We propose a modified Transformer block that accounts for temporal redundancy in tokens. We strategically apply token gating and buffering throughout the block and compute modified, sparse updates to the self-attention operator (see below).

The Query-Key Product

We accelerate the query-key product (one of the primary operations within self-attention) by selectively updating values in the output matrix. We first compute changes induced by the first operand (q). We then compute changes caused by the second operand (k).

The Attention-Value Product

We propose a delta-based strategy for sparsely updating the attention-value product (another major operation within self-attention). We remove rows and columns from the input operands that do not contribute to the result (due to a zero multiplication).