The Detection Transformer (DETR) has emerged as the dominant paradigm in the field of object detection due to its end-to-end architectural design. Researchers have explored various aspects of DETR, including its structure, pre-training strategies, attention mechanisms, and query embeddings, achiving significant progress. However, high computational costs limit the efficient use of multi-scale feature maps and hinder the full exploitation of complex multi-branch structures. We examine the negative impact of multi-scale features on the computational cost of DETRs and find that introducing long sequence data to the encoder is suboptimal. In this work, we aim to further push the boundaries of DETR's performance and efficiency from the model structure perspective, thus developing the fusion detection Transformer (F-DETR) with heterogeneous scale multi-branch structure. To the best of our knowledge, this is the first explicit attempt to integrate multi-scale features into the end-to-end DETR structure. Specifically, we propose a multi-branch structure to simultaneously utilize feature maps at different levels, facilitating the interaction of local and global features. Additionally, we select certain joint latent variables from the interactive information flow to initialize the object container, a technique commonly used in query-based detectors. Experimental results show that F-DETR achieves a 43.9 % AP using 36 training epochs on the popular public COCO dataset. Furthermore, our approach demonstrates a better trade-off between accuracy and complexity compared to the original DETR.

Rethinking the multi-scale feature hierarchy in object detection transformer (DETR)

Elhanashi, Abdussalam;Saponara, Sergio
2025-01-01

Abstract

The Detection Transformer (DETR) has emerged as the dominant paradigm in the field of object detection due to its end-to-end architectural design. Researchers have explored various aspects of DETR, including its structure, pre-training strategies, attention mechanisms, and query embeddings, achiving significant progress. However, high computational costs limit the efficient use of multi-scale feature maps and hinder the full exploitation of complex multi-branch structures. We examine the negative impact of multi-scale features on the computational cost of DETRs and find that introducing long sequence data to the encoder is suboptimal. In this work, we aim to further push the boundaries of DETR's performance and efficiency from the model structure perspective, thus developing the fusion detection Transformer (F-DETR) with heterogeneous scale multi-branch structure. To the best of our knowledge, this is the first explicit attempt to integrate multi-scale features into the end-to-end DETR structure. Specifically, we propose a multi-branch structure to simultaneously utilize feature maps at different levels, facilitating the interaction of local and global features. Additionally, we select certain joint latent variables from the interactive information flow to initialize the object container, a technique commonly used in query-based detectors. Experimental results show that F-DETR achieves a 43.9 % AP using 36 training epochs on the popular public COCO dataset. Furthermore, our approach demonstrates a better trade-off between accuracy and complexity compared to the original DETR.
2025
Liu, Fanglin; Zheng, Qinghe; Tian, Xinyu; Shu, Feng; Jiang, Weiwei; Wang, Miaohui; Elhanashi, Abdussalam; Saponara, Sergio
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11568/1332528
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 30
  • ???jsp.display-item.citation.isi??? 27
social impact