Task & Evaluation


Task Description 📝

The core task of the RARE26 challenge is binary classification: determining whether an endoscopic image of a Barrett’s Esophagus (BE) patient contains early neoplasia or not. The goal is to build AI algorithms that can identify subtle but critical signs of early-stage cancer while maintaining a low false positive rate — a key requirement in real-world clinical use.


Metric 🎯

The central performance metric is the Positive Predictive Value at 90% Recall (PPV@90Recall).

Why not accuracy or AUC?

Early neoplasia in Barrett's Esophagus patients is rare — roughly 1 in 100 surveillance cases. This low prevalence causes two widely-used metrics to silently fail:

Accuracy is dominated by the majority class. A model that flags nothing already achieves ~99% accuracy, because 99% of images are non-dysplastic. Any model that catches some true cases while incurring some false positives will score similarly to this useless baseline. The metric cannot distinguish the two.

AUROC averages discrimination performance across every possible threshold — most of which are clinically irrelevant. A model can achieve an AUC of 0.90 while producing dozens of false alarms for every true case at the threshold required for clinical use. AUC never tells you what happens at that specific operating point.

What PPV@90Recall measures

PPV@90Recall fixes recall (sensitivity) at 90% — a level high enough that the vast majority of true neoplasia cases are caught — and then measures Positive Predictive Value (PPV), i.e. the fraction of positive predictions that are actually correct.

PPV  =  TP / (TP + FP)     evaluated at the threshold where Recall = 0.90

Recall  =  TP / (TP + FN)  ← fixed at 0.90

This directly quantifies clinical utility: of every alert the model raises, how many correspond to a real case? A low PPV means clinicians must review many benign patients for every true finding — eroding trust and wasting resources. A high PPV means alerts are actionable.

A concrete example

Consider 1,000 surveillance images with 10 neoplasia cases (1% prevalence). A model at 90% recall correctly flags 9 of those 10 cases. If it also flags 90 benign images:

Model flags positive Model flags negative
True neoplasia 9 (TP) 1 (FN)
Non-dysplastic 90 (FP) 900 (TN)
  • Accuracy: (9 + 900) / 1000 = 90.9% — and a model flagging nothing scores 99%
  • AUROC: may read 0.90+, masking the FP burden entirely
  • PPV@90Recall: 9 / (9 + 90) = 9.1% — only 1 in 11 alerts is real

A better-calibrated model at the same recall but flagging only 20 benign images would achieve PPV = 9 / 29 ≈ 31% — a clinically meaningful improvement that accuracy and AUC would not have differentiated.

🔬 Interactive figure: Explore why accuracy and AUC fail in low-prevalence detection →


Evaluation 📈

The evaluation procedure is designed to mirror real-world clinical demands and is consistent across both the Open Development Phase and the Closed Testing Phase.

Simulating realistic prevalence

For each evaluation run, all non-dysplastic images in the relevant set are included. Neoplasia images are then sampled with replacement to simulate a realistic class imbalance, targeting a ratio of one neoplasia case per 100 non-neoplasia cases (~1% prevalence). This reflects the distribution an algorithm would encounter in actual clinical deployment — not the balanced distributions common in model training.

Reducing sampling noise

Because this subsampling involves randomness, the evaluation is repeated 1,000 times. In each iteration, PPV@90Recall is computed independently. The final score for a submission is the median PPV@90Recall across all 1,000 repetitions. Using the median rather than the mean provides robustness against outlier runs caused by unusual sampling draws.

Secondary metrics

During the Open Development Phase, participants will also receive the AUROC and AUPRC (Area Under the Precision-Recall Curve) as supplementary diagnostics to help guide model development. These metrics are provided for reference only and will not affect the final ranking.

Confidentiality

The results from the Closed Testing Phase will remain confidential until they are officially presented at MICCAI 2026.