Aim: Research studies evaluating high-resolution (HR) HLA imputation in transplantation have varied substantially in their approaches to measuring prediction performance. To increase study quality and reproducibility, we aim to develop a package for computing comprehensive imputation performance metrics and visualizations for predictions of HLA alleles, amino acids, and molecular mismatch categories for donor-recipient pairs with ambiguous typing.
Method: We developed a Python package for imputation validation where model performance is evaluated using built-in functions from the scikit-learn package, a best-in-class machine learning framework. Calibration plots categorize imputation probabilities into quantiles and compare the fraction of correct predictions to quantile-average probabilities using mean-squared error and city block distance metrics. The Brier Score metric jointly measures calibration and certainty for individual-level predictions (lower scores are better; range 0 to 1). A histogram of the imputation probability distribution is provided. Receiver Operating Characteristic (ROC) curves plot the true positive rate against the false positive rate for the most probable imputation prediction and quantify performance based on area under the curve (AUC).
Results: In a small simulation validation study, we analyzed HR imputation results for a dataset of 217 antigen-level typings of deceased kidney donors who also had HR typing. We then computed HLA-DQ eplet mismatches using the HLA Eplet Registry calculator for 100 random HLA pairings. We created a calibration plot for correctly predicting the unique set of DQ eplet mismatches. (Figure 1A). Predictions were well-calibrated, with a Brier Score metric of 0.1. The ROC AUC for DQ eplet mismatch predictions was 0.78, but nearly all predictions had high probabilities, and this class imbalance limits utility of ROC metrics (Figure 1B). The package is available on GitHub at https://github.com/lgragert/imputation-validation/.
Conclusion: This simulation validation study illustrates that profiling imputation performance with calibration plots would provide advantages over using ROC metrics and simple accuracy measures, especially in datasets with class imbalance. This framework could support high quality HLA imputation studies across transplant settings.