Evaluation & Ranking - AutoPET III

🤺 Evaluation criteria¶

Evaluation will be performed on held-out test cases of 200 patients. Test data will be drawn in part (50%) from the same sources and distributions as the training data, i.e. 50 PSMA-PET/CT scans from LMU, 50 FDG-PET/CT scans from UKT. The other part will be drawn crosswise from the other center, i.e. 50 PSMA-PET/CT scans from UKT and 50 FDG-PET/CT scans from LMU.

For evaluation, a combination of three metrics will be used, reflecting the aims and specific challenges of PET lesion segmentation:

Foreground Dice score of segmented lesions
False positive volume (FPvol): Volume of false positive connected components that do not overlap with positives
False negative volume (FNvol): Volume of positive connected components in the ground truth that do not overlap with the estimated segmentation mask

In case of test data that do not contain positives (no FDG- or PSMA-avid lesions), only metric 2 will be used.

Figure: Example for the evaluation. The Dice score is calculated to measure the correct overlap between predicted lesion segmentation (blue) and ground truth (red). Additionally special emphasis is put on false negatives by measuring their volume (i.e. entirely missed lesions) and on false positives by measuring their volume (i.e. large false positive volumes like brain or bladder will result in a low score).

A python script computing these evaluation metrics is provided under https://github.com/lab-midas/autoPET.

📈 Updated Ranking¶

Unfortunatley the proposed ranking sheme did not work. The model was not able to fit the distribution of the challenge results. Because of that we fall back to the folllowing ranking sheme:

We divided the test dataset into subsets based on center and tracer (i.e., PSMA LMU, PSMA UKT, FDG LMU, FDG UKT) and calculated the average metrics for Dice, FPV, and FNV within each subset. Then, we ranked the subset averages across all algorithms. For each metric, we computed an intermediate average rank by averaging the ranks of the subsets. Finally, we generated the overall rank by combining the three metric ranks using a weighting factor: Dice (0.5), FPV (0.25), and FNV (0.25).

¶

📈 Original Ranking¶

Each metric is calculated for every test sample (if applicable). The model's performance is then assessed with a mixed model framework. By incorporating the center and tracer as random effects, and thereby correcting for the (fixed) effect of the actual model performance, we ensure generalization to the random effects (center and tracer). The mean performance values (= fixed effects) per model are compared post-hoc using the Tukey test, and rankings are determined based on these comparisons.
If the p-value of a pairwise comparison is not significant (p \< 0.05), the performance of the corresponding model is considered equally good. This is applied for every metric individually.

Figure: Exemplary ranking mechanism via mixed model evaluation. Left: Traditionally, the challenge winner is determined based solely on the best (highest) mean performance value. Our approach: Middle: We go beyond mean performance by also taking into account the distribution of performance scores obtained from model evaluation on the test set; Right: We account for random variation introduced by tracers and centers by including a random effect in our ranking model.

The ranks are then combined to evaluate the overall best algorithm (metric 1: 50 % weight, metrics 2 and 3: 25 % weight each). This ranking scheme allows for multiple submissions to share the same rank! Awards will be shared in this case since there is no significant difference between these submissions.
For the second award category a submitted model needs to have a higher rank than the supplied datacentric baseline model. Datacentric models are also eligible for Award Category 1. Teams are allowed to submit up to two models in any permutation of category ((cat1, cat1), (cat2,cat2), (cat1,cat2)). Only the better model per category will be taken into account.

Figure: Example for the ranking. The final rank is derived from the combination of the ranks for dice, FPvol, and FNvol. Team A submitted a normal model and reached the highest rank which means it won in award category 1. The baselines are out of competition. Team B and Team C submitted a datacentric model and reached positions 2 and 3, respectively. For award category 2 only Team B is eligible since it submitted a datacentric model and reached a higher rank than the given baseline.