Burst Vision Using Single-Photon Cameras
Proc. WACV 2023
Exploring the capabilities of SPAD sensors for a wide gamut of real-world computer vision tasks including object detection, pose estimation, SLAM, text recognition and so on.
Single-photon avalanche diodes (SPADs) are novel image sensors that record the arrival of individual photons at extremely high temporal resolution. In the past, they were only available as single pixels or small-format arrays, for various active imaging applications such as LiDAR and microscopy. Recently, high-resolution SPAD arrays up to 3.2 megapixel have been realized, which for the first time may be able to capture sufficient spatial details for general computer vision tasks, purely as a passive sensor. However, existing vision algorithms are not directly applicable on the binary data captured by SPADs. In this paper, we propose developing quanta vision algorithms based on burst processing for extracting scene information from SPAD photon streams. With extensive real-world data, we demonstrate that current SPAD arrays, along with burst processing as an example plug-and-play algorithm, are capable of a wide range of downstream vision tasks in extremely challenging imaging conditions including fast motion, low light ($<5$ lux) and high dynamic range. To our knowledge, this is the first attempt to demonstrate the capabilities of SPAD sensors for a wide gamut of real-world computer vision tasks including object detection, pose estimation, SLAM, and text recognition. We hope this work will inspire future research into developing computer vision algorithms in extreme scenarios using single-photon cameras.
Proc. WACV 2023
So far, SPADs have been mostly used for 3D sensing or scientific imaging. Recently, high-resolution SPAD arrays have been developed. Are they ready for general computer vision tasks? We explore this capability of SPADs by first use burst reconstruction algorithms to generate an intensity image, and then run off-the-shelf downstream algorithms.
Why do we need burst vision? Here we show a binary sequence captured in a dark garage (night, lights off). Naive averaging and burst reconstruction are used to reconstruct an intensity image, which is then passed to pre-trained YOLOv3 for object detection. (Top) Naive average images are either too noisy or too blurred. Consequently, object detection fails for all integration window lengths. (Bottom) Burst vision is able to generate clear images that provide sufficient signal for detection of the person and the bike with sufficiently large integration windows.
Furthermore, SPAD is also the ideal sensor for burst vision. Conventional burst photography introduces read noise at each frame, resulting in a blur-SNR trade-off. A lower read noise gives a higher blur-SNR curve. SPAD sensors have negligible read noise, which results in an ideal flat curve for low-light burst vision.
To evaluate the capabilities of SPADs for general computer vision, we capture binary sequences for a wide range of tasks, consisting of over 50 million binary images in total. Tested tasks and algorithms including: QR decoding, scene text detection, object detection, SLAM, face detection, human pose estimation, action recognition, background subtraction, object tracking.
We run a pre-trained YOLOv3 model on images reconstructed by naive averaging and QBP. (a) In a bright scene, both naive average (short) and QBP generate quality images for person detection. 35X contrast stretched. (b) In a dark scene, naive average suffers from the blur-noise trade-off. QBP is able to reconstruct a clear image for detection. 370X contrast stretched. (c) Naive averaging always performs worse than QBP and has a best operating point due to the blur-noise trade-off. QBP keeps improving as software-defined exposure time increases. (d) We simulate images that would have been captured by a conventional camera with read noise of 1 photon. Conventional single shot shows similar performance as naive average due to the blur-noise trade-off. Conventional burst photography has improved accuracy when a longer exposure time is used, but is significantly lower due to read noise.
A package box moves fast in the dark. Naive average images are noisy and blurry. Burst results are clear and sharp and the text and QR code are correctly detected, which would be challenging if a low-resolution SPAD was used.
(Top) We manually mark the bounding box of the ball at the beginning of the sequence, where the ball starts falling at a low speed. (Bottom) In a later frame of the sequence, the ball drops at a higher speed. The night vision camera records a blurred image. The thermal camera does not capture the visual features of the ball. The SPAD camera gets a clear, sharp burst-reconstructed image for tracking the ball and achieves the best AO among the three cameras.
We placed a SPAD camera and a night vision camera side-by-side on the dashboard of a car. (Top) SPAD gives a clearer image so the person can be detected from a long distance. The bright spots are traffic signs illuminated by the headlight (this street has no streetlights). The scene is very dark (~1lux, 0.05 photons per pixel on average). Images are contrast stretched (15X) for visualization. (Bottom) SPAD is able to reconstruct a high-quality image to enable SLAM tracking for nighttime driving while the night vision camera fails to detect features.
A SPAD camera captures a person walking from a dark room (1 lux) to directly under sunlight (50000 lux). Burst method reconstructs HDR images, enabling object detection across the sequence. As shown in the plot, the person is detected for almost every frame reconstructed by SPAD. We also simulate a conventional camera from the binary sequence, which fails to reconstruct good images for dark and bright conditions, resulting in failure of the person detection algorithm.