RARE25 Challenge Results¶

RARE25 brought together 11 teams from 7 countries, highlighting growing international interest in computer-aided detection for early Barrett’s neoplasia. The submissions reflected a diverse set of modern deep learning strategies, with a clear trend toward large-scale pretraining, ensembling, and careful calibration.

Top teams and approaches¶

IMSY — GitHub

The winning solution combined a strong CNN GastroNet pretrained backbone (ResNet-50) with a LoRA finetuned DINOv3 ViT-Large, extensive ensembling, and post-hoc calibration. Their approach emphasized robustness and consistency across folds, rather than relying on a single model.
UT — GitHub

The UT team focused on a transformer-based architecture (MaxViT-Tiny) trained purely on challenge data. Their approach highlights that competitive performance is still achievable without large-scale external pretraining, provided training and validation are carefully structured.
Jmees-inc

This team used a segmentation-driven pipeline with staged pretraining and a compact ensemble. Their method illustrates the continued relevance of localization-aware approaches alongside pure classification pipelines.

Key outcomes¶

RARE25 shows several clear developments compared to earlier editions:

Stronger baseline performance: Most top teams achieved robust detection performance, indicating that modern architectures and training strategies have matured significantly for this task.
Pretraining and ensembling are dominant: The best-performing methods consistently relied on large-scale pretraining and multi-model ensembles, suggesting that single-model solutions are no longer competitive at the top level.
Improved robustness: Top teams demonstrated more stable performance across validation splits, pointing toward better generalization and reduced sensitivity to data partitioning.

Remaining challenges¶

Despite these improvements, the challenge also highlights important open problems:

Clinical usability remains limited: Even strong models still struggle with false positives in low-prevalence settings, which limits direct deployment in screening workflows.
Data efficiency: Many top solutions depend on heavy pretraining or ensembling, raising questions about scalability and reproducibility in smaller clinical settings.
Generalization across centers: Performance in controlled benchmarks does not fully capture domain shift across hospitals, devices, and patient populations.

Read the full report¶

For full details on the challenge design, participating methods, and quantitative results, see the preprint:

https://arxiv.org/abs/2604.11171

Takeaway¶

RARE25 demonstrates that while technical performance continues to improve, bridging the gap to clinically reliable deployment—particularly under realistic prevalence and variability—remains the central challenge moving forward.