As a cost-effective and robust technology, automotive radar has seen steady improvement during the last years. Radio frequency (RF) images, serving as a radar data format with rich semantic information, have attracted considerable interest in radar object detection. Previous RF-based models heavily rely on convolutional neural networks, leading to the high computational cost. To solve this problem, we propose a model called Mask-RadarNet to fully utilize the hierarchical semantic features from the RF image sequences. Mask-RadarNet exploits the combination of interleaved convolution and attention operations in the encoder. In addition, patch shift is introduced to Mask-RadarNet for efficient spatial-temporal feature learning. By shifting part of patches with a specific mosaic pattern in the temporal dimension, Mask-RadarNet achieves competitive performance while reducing the computational burden of the spatial-temporal modeling. In order to capture the spatial-temporal semantic contextual information, we design the class masking attention module (CMAM) in our encoder. Moreover, a lightweight auxiliary decoder is added to our model to aggregate prior maps generated from the CMAM. Experiments on the CRUW dataset demonstrate that the proposed Mask-RadarNet achieves state-of-the-art performance with relatively lower computational complexity and fewer parameters.
Mask-RadarNet: Enhancing Radar Object Detection With Spatio-Temporal Context
Danilo Orlando;
2025-01-01
Abstract
As a cost-effective and robust technology, automotive radar has seen steady improvement during the last years. Radio frequency (RF) images, serving as a radar data format with rich semantic information, have attracted considerable interest in radar object detection. Previous RF-based models heavily rely on convolutional neural networks, leading to the high computational cost. To solve this problem, we propose a model called Mask-RadarNet to fully utilize the hierarchical semantic features from the RF image sequences. Mask-RadarNet exploits the combination of interleaved convolution and attention operations in the encoder. In addition, patch shift is introduced to Mask-RadarNet for efficient spatial-temporal feature learning. By shifting part of patches with a specific mosaic pattern in the temporal dimension, Mask-RadarNet achieves competitive performance while reducing the computational burden of the spatial-temporal modeling. In order to capture the spatial-temporal semantic contextual information, we design the class masking attention module (CMAM) in our encoder. Moreover, a lightweight auxiliary decoder is added to our model to aggregate prior maps generated from the CMAM. Experiments on the CRUW dataset demonstrate that the proposed Mask-RadarNet achieves state-of-the-art performance with relatively lower computational complexity and fewer parameters.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


