FBRT-YOLO: Faster and Better for Real-Time Aerial Image Detection

Q: How does FCM integrate spatial and semantic information?

FCM splits channels into two branches (one processed with $3\times 3$ conv for richer semantic/channel info, one with point-wise conv to preserve spatial info), then uses channel interaction (depthwise conv + global average pooling + sigmoid) and spatial interaction (conv + BN + sigmoid) to generate weights $\omega_1$ and $\omega_2$, fusing features as $X_{\text{FCM}}=(X_C\otimes\omega_2)\oplus(X_S\otimes\omega_1)$.

Q: What are the main VisDrone results for FBRT-YOLO variants?

FBRT-YOLO-N: AP $20.2\%$, AP50 $34.4\%$, 0.9M params, 6.7 GFLOPs, 192 FPS. FBRT-YOLO-S: AP $25.9\%$, AP50 $42.4\%$, 2.9M, 22.9G, 143 FPS. FBRT-YOLO-M: AP $28.4\%$, AP50 $45.9\%$, 7.2M, 58.7G, 94 FPS. FBRT-YOLO-X: AP $30.1\%$, AP50 $48.4\%$, 22.8M, 185.8G, 52 FPS.

Q: What improvements are shown on UAVDT and AI-TOD?

On UAVDT, FBRT-YOLO reaches AP $18.4\%$ and AP50 $31.1\%$, outperforming CEASC (AP $17.1\%$, AP50 $30.9\%$). On AI-TOD, FBRT-YOLO-S improves AP from $19.1\%$ to $20.2\%$ and AP50 from $43.6\%$ to $45.8\%$ while reducing parameters (11.2M to 2.9M) and GFLOPs (28.6G to 22.9G).

Yao Xiao, Tingfa Xu, Xin Yu, Jianan Li

Proceedings of the AAAI Conference on Artificial Intelligence·2025·Engineering·67 citations

8 min read

Read the full paper at DOI or on arxiv

TL;DR

FBRT-YOLO is a YOLO-family real-time aerial detector designed to improve small-target detection while maintaining high inference efficiency.

Briefing Cornell Notes

Briefing

This paper addresses a practical bottleneck in real-time aerial image detection: how to improve small-target detection accuracy without sacrificing inference speed and computational efficiency on resource-constrained embedded flight devices. The authors frame the core research question as balancing (i) the difficulty of detecting very small objects in high-resolution aerial imagery and (ii) the need for fast inference under limited compute budgets. This matters because many standard real-time detectors (especially those optimized for low-resolution natural images) either lose small-object information during downsampling/feature extraction or become too computationally expensive when adapted to high-resolution aerial inputs.

The proposed solution, FBRT-YOLO, is a family of real-time detectors built around two lightweight modules designed specifically to improve small-object perception while keeping the model efficient. The Feature Complementary Mapping Module (FCM) targets an information mismatch problem: shallow layers contain precise spatial location cues, while deeper layers contain richer semantic cues, but conventional backbones do not integrate shallow spatial information into deeper representations effectively. The Multi-Kernel Perception Unit (MKP) targets another issue: small objects may occupy only a few pixels and can be “washed out” by receptive-field limitations and downsampling; additionally, small objects exist at multiple scales and are often surrounded by complex backgrounds. MKP is intended to enhance multi-scale feature perception with minimal structural overhead.

Methodologically, the paper follows a standard deep learning object detection evaluation pipeline. FBRT-YOLO is implemented as a YOLO-family detector with modifications to the backbone/neck design. Training is performed for 300 epochs using SGD with momentum 0.937, weight decay 0.0005, batch size 4, and initial learning rate 0.01. Experiments are run on three aerial detection benchmarks: VisDrone (2018), UAVDT (2018), and AI-TOD (2021). Hardware details are provided: training uses an NVIDIA GeForce RTX 4090 GPU, while inference speed is measured on a single RTX 3080 GPU. The paper reports detection quality using average precision (AP) and AP50/AP75, and efficiency using GFLOPs, parameter counts, and FPS.

The key architectural contributions are described as follows. FCM is embedded into each stage of the backbone. It uses a channel split with ratio $α$ \u0007 to divide input feature channels into two branches: one branch is processed with a standard $3 \times 3$ convolution to extract richer semantic/channel information, while the other branch uses point-wise convolution to preserve more shallow spatial positional information. The module then performs complementary mapping via channel interaction (depthwise convolution followed by global average pooling and sigmoid to generate channel weights) and spatial interaction (a convolution + BN + sigmoid to generate spatial attention weights). Finally, it aggregates the two branches using element-wise weighted fusion: $X_{FCM} = (X_{C} \otimes ω_{2}) \oplus (X_{S} \otimes ω_{1}) .$ FCM’s intent is to transfer spatial location information into deeper semantic representations, improving alignment and localization of small targets.

MKP is introduced in the final (fourth) backbone stage. It replaces the final downsampling layer and reduces the corresponding detection heads, aiming to simplify the network while improving multi-scale perception. MKP concatenates depthwise convolutions with different kernel sizes (the paper’s experiments explore kernel-size configurations) and uses point-wise convolutions between scales to integrate cross-scale spatial relationships. The paper summarizes MKP as: $X^{'} = T_{2 k + 1} (A (\dots A (T_{k} (X)) \dots)) .$ In practice, the authors set $k = 3$ (implying multi-kernel sizes such as $3, 5, 7$ in their ablations).

The paper’s main quantitative results are reported on VisDrone in Table 1, comparing FBRT-YOLO variants (N/S/M/L/X) against multiple real-time detectors (YOLOv5-L, YOLOv8-N/S/M/L/X, YOLOv9-M, YOLOv10-S/L/X, and RT-DETR-R34/R50). Across model sizes, FBRT-YOLO improves the accuracy–efficiency trade-off. For example, on VisDrone: - FBRT-YOLO-N achieves AP $20.2%$ and AP50 $34.4%$ with 0.9M parameters and 6.7 GFLOPs, running at 192 FPS. - FBRT-YOLO-S achieves AP $25.9%$ , AP50 $42.4%$ with 2.9M parameters and 22.9 GFLOPs at 143 FPS. - FBRT-YOLO-M achieves AP $28.4%$ , AP50 $45.9%$ with 7.2M parameters and 58.7 GFLOPs at 94 FPS. - FBRT-YOLO-L achieves AP $29.7%$ , AP50 $47.7%$ with 14.6M parameters and 119.2 GFLOPs at 70 FPS. - FBRT-YOLO-X achieves AP $30.1%$ , AP50 $48.4%$ with 22.8M parameters and 185.8 GFLOPs at 52 FPS.

The authors also provide relative improvements in terms of parameter/GFLOPs reductions and AP gains versus YOLOv8 and YOLOv10 baselines. For instance, they state that FBRT-YOLO-N/S reduces parameter counts by 72% and 74% compared to YOLOv8-N/S while improving AP by 0.6% and 2.3%. For medium models, FBRT-YOLO-M reduces GFLOPs by 26% (vs YOLOv8-M) and 23% (vs YOLOv9-M) while improving AP by 1.3% and 1.2%. For large models, FBRT-YOLO-X is reported to have 66% and 23% fewer parameters than YOLOv8-X and YOLOv10-X, with AP improvements of 1.2% and 1.4%.

On VisDrone, Table 2 compares against additional state-of-the-art methods, showing FBRT-YOLO reaching AP $30.1%$ and AP50 $48.4%$ and AP75 $31.7%$ , outperforming YOLOv3-SPP3 (AP $26.4%$ ), DMNet (AP $29.4%$ ), QueryDet (AP $28.3%$ ), and CEASC (AP $28.7%$ ).

On UAVDT (Table 3), FBRT-YOLO reports AP $18.4%$ , AP50 $31.1%$ , and AP75 $18.9%$ , exceeding CEASC (AP $17.1%$ , AP50 $30.9%$ ) and other listed methods such as GLSAN (AP $17.0%$ ) and GFL (AP $16.9%$ ).

On AI-TOD (Table 4), the paper compares FBRT-YOLO-S against YOLOv8-S: FBRT-YOLO-S improves AP from $19.1%$ to $20.2%$ and AP50 from $43.6%$ to $45.8%$ , while reducing parameters from 11.2M to 2.9M and GFLOPs from 28.6G to 22.9G. FPS increases slightly from 131 to 142.

The ablation study (Table 5) on VisDrone using YOLOv8-S as baseline evaluates the contributions of FCM, MKP, and a redundancy-reduction (RR) design. The baseline (no FCM, no MKP, no RR) has AP $23.6%$ , AP50 $39.6%$ , 11.2M parameters, and 28.6 GFLOPs. Adding RR alone yields AP $23.4%$ (slightly lower) but reduces parameters to 9.1M and FLOPs to 25.5G. Adding FCM (with RR) yields AP $24.3%$ and AP50 $40.6%$ with 7.2M parameters and 23.2G FLOPs. Adding both FCM and MKP (and RR) yields the best ablation result: AP $25.9%$ , AP50 $42.4%$ with 2.9M parameters and 22.9G FLOPs.

Additional ablations explore FCM’s split ratio $α$ (Table 7), showing that configurations retaining more spatial information in deeper stages (e.g., $α$ patterns like $(0.75, 0.75, 0.25, 0.25)$ ) achieve higher AP50 and AP. Kernel-size choices for MKP (Table 8) indicate that too-small kernels limit receptive field, while too-large kernels introduce background noise; the mixed kernel configuration $(3, 5, 7)$ achieves AP $25.9%$ and AP50 $42.4%$ with 2.90M parameters and 22.9G FLOPs.

Limitations are not deeply quantified in the provided text. However, several apparent limitations follow from the methodology and reporting style: (1) the paper focuses on three benchmarks and does not report cross-dataset generalization beyond these; (2) inference speed is measured on a specific GPU setup, so real embedded deployment may differ; (3) the ablation and comparisons emphasize AP/AP50/AP75 and efficiency, but do not provide detailed statistical significance testing (e.g., confidence intervals) or robustness analyses across weather/illumination beyond qualitative heatmaps/visualizations; and (4) the paper does not clearly specify whether results are averaged over multiple runs with different seeds.

Practically, the results suggest that FBRT-YOLO is a strong candidate for real-time aerial perception pipelines where small-object detection is critical (e.g., drone-based monitoring, traffic/incident detection, and search-and-rescue). Who should care includes practitioners building on-device or near-device detection systems, researchers working on small-object detection in high-resolution imagery, and teams needing a YOLO-like architecture that improves the accuracy–efficiency trade-off without resorting to expensive multi-scale input pyramids.

Overall, the paper’s core contribution is a YOLO-family detector that combines spatial-semantic complementary mapping (FCM) with multi-kernel multi-scale perception (MKP), achieving improved AP while maintaining real-time performance across multiple model sizes on standard aerial benchmarks.

Cornell Notes

FBRT-YOLO is a YOLO-family real-time aerial detector that improves small-object detection by addressing spatial-semantic feature mismatch (FCM) and enhancing multi-scale perception (MKP). Across VisDrone, UAVDT, and AI-TOD, it reports higher AP while reducing parameters and GFLOPs relative to common real-time baselines, maintaining strong FPS.

What problem does the paper target in real-time aerial image detection?

Small-object detection in high-resolution aerial imagery is difficult while maintaining real-time efficiency on resource-constrained devices; the paper targets both small-target accuracy and the accuracy–speed trade-off.

What are the two core modules proposed in FBRT-YOLO?

FCM (Feature Complementary Mapping Module) integrates shallow spatial positional information into deeper semantic features; MKP (Multi-Kernel Perception Unit) replaces the final downsampling with multi-kernel convolutions to improve multi-scale perception.

How does FCM integrate spatial and semantic information?

FCM splits channels into two branches (one processed with $3 \times 3$ conv for richer semantic/channel info, one with point-wise conv to preserve spatial info), then uses channel interaction (depthwise conv + global average pooling + sigmoid) and spatial interaction (conv + BN + sigmoid) to generate weights $ω_{1}$ and $ω_{2}$ , fusing features as $X_{FCM} = (X_{C} \otimes ω_{2}) \oplus (X_{S} \otimes ω_{1})$ .

Where is MKP applied, and what does it replace?

MKP is introduced in the final (fourth) backbone stage and replaces the final downsampling layer; it also reduces the corresponding detection heads to simplify the network.

What training setup and evaluation metrics are used?

Training uses SGD (momentum 0.937, weight decay 0.0005, batch size 4, learning rate 0.01) for 300 epochs. Evaluation reports AP, AP50, AP75, and efficiency via GFLOPs, parameter count, and FPS.

What are the main VisDrone results for FBRT-YOLO variants?

FBRT-YOLO-N: AP $20.2%$ , AP50 $34.4%$ , 0.9M params, 6.7 GFLOPs, 192 FPS. FBRT-YOLO-S: AP $25.9%$ , AP50 $42.4%$ , 2.9M, 22.9G, 143 FPS. FBRT-YOLO-M: AP $28.4%$ , AP50 $45.9%$ , 7.2M, 58.7G, 94 FPS. FBRT-YOLO-X: AP $30.1%$ , AP50 $48.4%$ , 22.8M, 185.8G, 52 FPS.

How does FBRT-YOLO compare to YOLOv8 on VisDrone in reported relative terms?

The paper states FBRT-YOLO-N/S reduces parameters by 72% and 74% vs YOLOv8-N/S while improving AP by 0.6% and 2.3%. For medium models, FBRT-YOLO-M reduces GFLOPs by 26% vs YOLOv8-M and 23% vs YOLOv9-M while improving AP by 1.3% and 1.2%.

What improvements are shown on UAVDT and AI-TOD?

On UAVDT, FBRT-YOLO reaches AP $18.4%$ and AP50 $31.1%$ , outperforming CEASC (AP $17.1%$ , AP50 $30.9%$ ). On AI-TOD, FBRT-YOLO-S improves AP from $19.1%$ to $20.2%$ and AP50 from $43.6%$ to $45.8%$ while reducing parameters (11.2M to 2.9M) and GFLOPs (28.6G to 22.9G).

What does the ablation study conclude about FCM and MKP?

With YOLOv8-S baseline, adding RR alone reduces compute but slightly lowers AP. Adding FCM improves AP (to $24.3%$ in the reported ablation step), and adding MKP on top of FCM yields the best result: AP $25.9%$ and AP50 $42.4%$ with 2.9M parameters and 22.9G FLOPs.

Review Questions

Which specific feature mismatch does FCM aim to correct, and how does its channel split and attention weighting mechanism implement that goal?
Why might replacing the final downsampling with MKP help small-object detection, and what evidence does the paper provide (tables/ablation)?
From the VisDrone table, how does FBRT-YOLO-S achieve a better accuracy–efficiency trade-off than YOLOv8-S (compare AP, params, GFLOPs, FPS)?
What ablation results demonstrate that both FCM and MKP are needed (not just one), and what are the best AP/AP50 numbers reported?
How do the MKP kernel-size ablations support the claim that multi-scale receptive fields improve performance without excessive background noise?

Key Points

1
FBRT-YOLO is a YOLO-family real-time aerial detector designed to improve small-target detection while maintaining high inference efficiency.
2
FCM addresses spatial-semantic mismatch by transferring shallow spatial positional cues into deeper semantic features using complementary channel and spatial attention weighting.
3
MKP replaces the final downsampling stage with multi-kernel depthwise convolutions (and point-wise integration) to improve multi-scale perception for tiny objects.
4
On VisDrone, FBRT-YOLO-X achieves AP $30.1%$ and AP50 $48.4%$ at 52 FPS with 22.8M parameters and 185.8 GFLOPs.
5
On VisDrone, FBRT-YOLO-S achieves AP $25.9%$ and AP50 $42.4%$ at 143 FPS with 2.9M parameters and 22.9 GFLOPs.
6
On UAVDT, FBRT-YOLO reaches AP $18.4%$ and AP50 $31.1%$ , outperforming CEASC (AP $17.1%$ , AP50 $30.9%$ ).
7
On AI-TOD, FBRT-YOLO-S improves AP to $20.2%$ (from $19.1%$ ) while cutting parameters from 11.2M to 2.9M and GFLOPs from 28.6G to 22.9G.
8
Ablations show the combined use of RR + FCM + MKP yields the best performance (AP $25.9%$ , AP50 $42.4%$ ) with low compute (2.9M params, 22.9G FLOPs).

Highlights

“FBRT-YOLO-X (Ours) … AP 30.1, AP50 48.4 … Params 22.8M … GFLOPs 185.8 … FPS 52” (VisDrone, Table 1).

“FBRT-YOLO-S (Ours) … AP 25.9, AP50 42.4 … Params 2.9M … GFLOPs 22.9 … FPS 143” (VisDrone, Table 1).

On AI-TOD: “FBRT-YOLO-S … AP 20.2 … Params 2.9M … FLOPs 22.9G … FPS 142” vs “YOLOv8-S … AP 19.1 … Params 11.2M … FLOPs 28.6G … FPS 131” (Table 4).

Ablation best: “RR ✓, FCM ✓, MKP ✓ … AP 25.9 … AP50 42.4 … Params 2.9M … FLOPs 22.9G” (Table 5).

FCM fusion is explicitly defined as XFCM​=(XC​⊗ω2​)⊕(XS​⊗ω1​), combining channel- and spatial-weighted complementary features (Method section).

Topics

Computer Vision
Object Detection
Real-Time Detection
Aerial Image Analysis
Small Object Detection
Neural Network Architecture Design
Feature Pyramids and Multi-Scale Representation
Attention Mechanisms
YOLO-family Detectors

Mentioned

FBRT-YOLO
YOLOv5
YOLOv8
YOLOv9
YOLOv10
RT-DETR
Ultralytics YOLO
PyTorch/SGD (implied by training setup)
NVIDIA RTX 4090
NVIDIA RTX 3080
Yao Xiao
Tingfa Xu
Xin Yu
Jianan Li
S. Deng
S. Li
K. Xie
W. Song
X. Liao
A. Hao
H. Qin
B. Du
Y. Huang
J. Chen
D. Huang
D. Du
Y. Qi
H. Yu
K. Yang
K. Tian
J. Jocher
A. Chaurasia
J. Qiu
T.-Y. Lin
P. Dollár
R. Girshick
K. He
B. Hariharan
S. Belongie
C.-Y. Wang
H.-Y. M. Liao
C. Wang
I.-H. Yeh
Z. Zhao
W. Lv
S. Xu
J. Wei
G. Wang
Q. Dang
Y. Liu
J. Chen
FCM - Feature Complementary Mapping Module
MKP - Multi-Kernel Perception Unit
RR - Redundancy-Reduction module (as named in ablations)
AP - Average Precision
AP50 - Average Precision at IoU threshold 0.50
AP75 - Average Precision at IoU threshold 0.75
GFLOPs - Giga Floating Point Operations per second (compute measure)
FPS - Frames Per Second
FPN - Feature Pyramid Network
PANet - Path Aggregation Network
DWConv - Depthwise Convolution
PWConv - Pointwise Convolution
BN - Batch Normalization
SGD - Stochastic Gradient Descent