Aim: We have developed a web-tool and Application Program Interface (API) to support submission of HLA data to public databanks such as the European Nucleotide Archive or GenBank and ultimately to the IPD-IMGT/HLA Database. The Sequence Feature Annotation Tool (SFAT) provides annotation of exons and introns which are a mandatory requirement for submission. Feature annotation can be a laborious and time-consuming process, particularly for groups with limited bioinformatics support. SFAT provides an intuitive UI with easy to navigate results and an API for programmatic users to perform concurrent queries. This replaces the existing feature annotation within the HLA Submission Tool allowing integration into custom bioinformatics pipelines. SFAT will be integrated into planned redevelopment of the HLA Submission Tool which aims to provide an easier user experience and more efficient submission processing.
Method: SFAT is developed in python using a Flask API and a web-based user interface. The tool uses Biopython’s pairwise alignment tool to identify feature boundaries using IPD-IMGT/HLA reference sequences. Specific HLA domain knowledge is used to resolve complex annotation. Results are returned in standard formats, allowing integration into HLA annotation pipelines and databank submissions.
Results: In validation on over 58,000 submissions to IPD-IMGT/HLA over the last 25 years, SFAT correctly annotated 99.7% of sequences from curated submissions. Sequences with ambiguous annotation are due to polymorphisms in the expected splicing motifs. SFAT rapidly annotates sequences averaging 0.15 seconds per query in the validation set, and offers parallelisation for bulk queries. It is available to use and hosted as part of the IPD section on the European Bioinformatics Institute servers.
Conclusion: SFAT provides an intuitive pathway for automated feature annotation, suitable for use in bioinformatics pipelines and database submission. It provides an intuitive web-based UI and an API for large volumes of programmatic queries. SFAT is highly accurate, and validation demonstrates this even with the hyperpolymorphic HLA sequences. HLA complexity is an ongoing challenge for analysis pipelines. For highly complex sequences cDNA sequencing remains the most accurate way of annotating features. This work helps facilitate submission of data to IPD-IMGT/HLA and paves the way for future development of the HLA Submission Tool.