InflANNet: a neural network predictor for Influenza A CTL and HTL epitopes to aid robust vaccine design

An efficient and reliable data-driven method is essential to aid robust vaccine design, particularly in the case of an epidemic like Influenza A. Although various prediction tools are existing, most of them focus on the MHC-peptide binding affinity predictions. A tool which can incorporate more features other than binding affinity which characterizes the T-cell epitopes as vaccine candidates would be of much value in this scenario. The objective of this study is to develop two separate neural network models for the predictions of CTLs (cytotoxic T lymphocyte) and HTLs (helper T lymphocyte) with the manually curated datasets as a part of this study from the raw viral sequences of Influenza A. The epitope datasets curated from the raw sequences of the broadly protective Neuraminidase protein were utilized for building and training the models for CTLs and HTLs. Each set consisted of nearly a balanced mix of vaccine candidates and non-vaccine candidates for both CTLs and HTLs. These were fed to neural networks as they are proven to be powerful for the predictions when compared with the other machine/deep learning algorithms. A set of epitopes experimentally proved were chosen to validate the model which was also tested through mutational analysis and cross-reactivity. The prepared dataset gave some valuable insights into the epitope distribution statistics and their conservancy in various outbreaks. An idea about the most probable range of peptide-MHC binding affinities was also obtained. Both the models performed well giving high accuracies when validated. These epitopes were checked for cross-reactivity with other antigens upon which it proved to be highly conservative and ideal for vaccine formulation. The combination of various features and the resulting model efficiencies in turn proved that the collected features are valuable in the easy identification of the vaccine candidates. This suggests that our proposed models have more potential for conserved epitope prediction compared to other existing models trained on similar data and features. The possibility of refining the model with more set threshold values based on more parameters is an added feature that makes it more user driven. Furthermore, the uniqueness of the model due to exclusive set of Neuraminidase epitopes paves a robust way for rapid vaccine design. The analysis of the past viral strains of Neuraminidase, delivered an interesting list of possible epitopes for vaccine design. The class-I & class-II MHC binding reports of the test set of epitopes, strongly validate the high performance (~ or above 90%) of the ML models. The homology studies with nematode antigens suggest that our predicted epitopes are free from cross-reactivity with parasitic epitopes. The very few mutations reported for Neuraminidase (being the conserved one among the other proteins) from the mutational analysis prove that the vaccine formulations with our predicted epitopes are capable of overcoming any antigenic drifts in the future. The analysis of the past viral strains of Neuraminidase, delivered an interesting list of possible epitopes for vaccine design. The class-I & class-II MHC binding reports of the test set of epitopes, strongly validate the high performance (~ or above 90%) of the ML models. The homology studies with nematode antigens suggest that our predicted epitopes are free from cross-reactivity with parasitic epitopes. The very few mutations reported for Neuraminidase (being the conserved one among the other proteins) from the mutational analysis prove that the vaccine formulations with our predicted epitopes are capable of overcoming any antigenic drifts in the future.


Background
Influenza A continues to be a global illness that causes huge morbidity and mortality in humans.Its ability to mutate, even now, termed as 'antigenic shift' and 'drift' , makes the circulating strain prediction difficult and antigenic mismatch likely (Kim et al. 2022).And this poses a serious challenge for the scientists in each pandemic season for having an optimized vaccine candidate discovery due to the ineffectiveness of the already existing vaccines (Sheikh et al. 2016;Zeller et al. 2021).Thus, a good vaccination strategy along with the increasing need for easier prediction methods is highly in demand in this era.
Only Influenza A has both seasonal, epidemic and pandemic capability and is classified into subtypes according to its hemagglutinin (HA) and neuraminidase (NA) surface glycoprotein antigens.A good vaccine recipe should target potential circulating strains of the virus in humans that could provide population immunity in the new flu season.Hence, targeting the right protein could help us rationalize with the right solution.Even though HA antigenic regions had been a long-standing interest, the more conserved the NA regions, the greater the potential for protective immunity (Almalki et al. 2022).Therefore, an effective vaccine recipe could be developed by targeting potentially conserved, cross-reactive T-and B-cell epitopes of this strain.Such 'universal' vaccine design can potentially be addressed by a T-cell epitope ensemble vaccine comprising short, highly conserved, immunogenic peptides from influenza to activate T-cells.All these shows the pre-dominant importance of the T-cell epitope candidates.The role of CTL in cellular immunity includes the direct clearance of virally infected cells and the indirect recruitment of other immune cells via chemokine and cytokine secretion (McGee and Huang 2022).CD4+ T cells' primary roles include B cell stimulation leading to specific antigen antibody production as well as stimulating CD8+ proliferation and memory responses.CD4+ T-cells also mediate direct and indirect viral clearance, and symptom severity reduction in secondary infection.These T-cell epitope candidates are usually identified using computational prediction tools to facilitate the efficient vaccine design (Sanchez-Trincado et al. 2017).
These can reduce the time and resources needed for epitope identification projects by narrowing down the peptide repertoire that needs to be experimentally tested.Most of these prediction tools are developed using various statistical and machine learning algorithms trained on mainly two types of data: binding affinities of peptides to specific MHC molecules generated using binding assays or sets of naturally processed MHC ligands found by eluting peptides from MHC molecules on cell surface and identifying them by mass spectrometry (Barra et al. 2018).So, there is a need to include more deciding features which can decide the immunogenic and thus vaccine candidate potential.But, if a new model could be developed by training the data on some more efficient features which decides its epitope's vaccine candidate potential, it can be more rewarding to the vaccine research.
In the present scenario where many epitope prediction servers are available, it is necessary to keep comparing the performance of the different methods against each other, to rationally decide which methods to choose, and to allow developers to understand what changes can truly improve the prediction performance.Through this study, we also evaluated the existing servers for the Flu A epitope prediction by means of efficiency, accuracy classification parameters and ROC curves (Suri and Dakshanamurthy 2022;Xia et al. 2021).One issue with the past evaluations has been that they are commonly evaluated using the same set of data on which they were trained, which can impact the performance results.Another problem that arises when we employ this model is that the data used for training might exclusively use theoretical prediction results without relating to experimental results which offers no reliability or translation advantage for developing vaccine.Therefore, it makes necessary to validate the model with prior curated experimentally published data for qualitatively robust prediction (Ramírez-Salinas et al. 2020).

Collection of raw sequences of viral strains and conservancy analysis
The raw sequences of Influenza A Neuraminidase strain virus of all the past four outbreaks were collected from NCBI Influenza Virus Resource Database (Table 1).All the sequences from these outbreak periods were collected according to the corresponding subtypes expressed with the other parameters.These protein sequences were subjected to Multiple Sequence Alignment by means of Clustal Omega server.The aligned regions were curated from the results.The output aligned sequences from the MSA were further analysed for conserved regions by means of Castresana G blocks server.This server gives the blocks of conserved sequences with relevant nucleotide-base representation by removing least aligned regions and divergent sequences from the input of multiple sequence alignments.The conserved blocks were displayed within a Blue underlining at the bottom.Stringent selection was applied for not allowing many contiguous non-conserved positions.
Antigenicity determination The strains collected were checked for its antigenic potential by means of Vaxijen 2.0 server.The optimum threshold was set as 0.5.The antigen classification is solely based on the physicochemical properties of proteins without recourse to sequence alignment.It can give the output list of the sequences with the Antigenic and Non-antigenic tags based on the threshold provided.The average range of antigenicity score of antigenic strains was identified as: 0.5-0.75.The dataset from flu 2009 contained some non-antigenic strains, and they were filtered out.
CTL epitope prediction NetC TL 1.2 server predicted the CTL (CD8+) epitopes in protein sequences for all the 12 MHC supertypes: A1, A2, A3, A24, A26, B7, B8, B39, B44, B58, B62 with the set threshold parameters.The probable epitopes for each supertype were screened out from all peptides based on the combined score and they were classified year-wise after removing the redundant ones.The immunogenicity scores were obtained for all the CD8+ epitopes by (Immune Epitope Database) IEDB T cell CD8+ Immunogenicity prediction tool.
HTL epitope prediction This process involved multiple screening steps considering the limitation of in-silico methods to predict only peptides that bind to MHC II.For the peptides to be epitopes, it is important to understand the activation of T helper cells.For this, the pMH-CII complexes should be immunogenic enough to induce the production of Tc-cells.The final list of predicted 15-mer peptides were checked for the CD4+ immunogenicity by means of IEDB T-cell Immunogenicity prediction tool-CD4 episcore.This step is crucial in case of CD4+ epitope identification.

Collection of other physico-chemical features of the epitopes
Along with the binding affinity and immunogenicity values, various other physico-chemical properties of the epitopes based on their amino-acid sequence were collected from the ProtParam ExPAsy server.This provided more reliable information about their antigenicity.These corresponding parameters were considered for each entry of CTL and HTL peptides and were screened based on thresholds (Table 2).
The potential features (Rostaminia et al. 2021) collected are as follows:

Neural network development and training
The dataset consisted of 1200 peptides for CTL model-Model I and 1400 for HTL-Model II comprising features with similar training criteria.The categorical encoding of the variables for the two output categories vaccine candidates (VC) & non-vaccine candidates (NVC) were also done.Similarly, as the mere alphabet representation of the amino-acids possess practical challenges in the protein prediction problems, they were One-hot encoded appropriately.Then, the model was built and trained through various methods.For this problem of Binary Classification, train_test_split method from Sci-kit learns and some deep learning libraries-Keras, Tensorflow were also chosen.The data were split into train, test and validation sets in a ratio of 60:20:20 for both the datasets.
To develop a good performing model, hyperparameter tuning was done with appropriate number of network layers set.Different optimizers were tried such as Adam, RMSprop along with loss parametric-binary_crossentropy and accuracy metric-binary_accuracy.Different number of epochs 50, 80, 100, 150 were trained on.An overview of the constructed machine learning model is depicted in Fig. 1.

Model validation
The test set of epitopes contained a list of 9 peptide sequences for vaccine candidates (VC) and 9 sequences for non-vaccine candidates (NVC) as predicted by the ML model.All the peptides were checked for their effectiveness in binding with various alleles for MHC-I and MHC-II separately.The MHCI binding predictions were made for 27 different HLA alleles using the IEDB analysis resource NetMHCpan (ver.4.1) tool (Reynisson et al. 2020).Similarly, the MHCII binding predictions were made using the IEDB analysis resource Consensus tool (Wang et al. 2008(Wang et al. , 2010)).

Mutational analysis for neuraminidase protein
Though Neuraminidase is considered as the most conserved protein of the Influenza virus, we suspected that  the few mutations that are reported till date, might overlap with our predicted vaccine candidates.Performing an in-depth mutational analysis and checking for mutations within the epitope region would render a better understanding while considering the epitopes for vaccine formulation.Hence the complete list of mutations, that have been reported till date for Neuraminidase was collected from various literature and a mutated model of Neuraminidase was generated using the SWISS-MODEL homology modelling server (Waterhouse et al. 2018).
The overlap of sequences was also checked using the ClustalW tool for multiple sequence analysis (Thompson et al. 1994).

Investigation of the disease pathology and comorbidities of Flu A associated with other parasitic infections and antigens
There have already been many clinical and immunological links found between parasitic infections and   respiratory infections, particularly Influenza (Breloer and Hartmann 2023).It is through immune responses that these pathogens survive and reproduce, and these effects can be positive or negative.Comorbidities are complex in that they involve the immune system heavily.The human body responds better to Influenza A infection in humans with various helminthic infections.A good example is Trichinella spiralis (Furze et al. 2006).According to studies conducted on co-infections with Trichinella pseudospiralis, the bacterium suppresses inflammation and reduces cellular recruitment around implanted material (Stewart et al. 1985).A study by Korten et al. (2002) observed that the number of DX5+/CD3+NK cells and DX5+/CD3+ T cells increased systemically during Litomosoides sigmodontis infection, so it is possible that the systemic release of parasite products and/or local cytokine and chemokine activity resulting from migration of new-born larvae through the capillary bed lining the lungs prompt the recruitment and activation of these cells.
In this study, we aim to investigate the effects of parasitic pathology on antigenic epitopes with flu's, and how the model predicts these effects.This requires a good understanding of the homology associated with the strains and the epitopes collected from various studies.Therefore, homology studies were performed with the other possible cross-reactive nematode antigens with Flu A using ClustalW for the following sequences retrieved from NCBI database.

Epitope statistics and analysis
The initial stage results gave a good understanding of the epitope distribution statistics and the disease    3.
The highly expressed epitopes among various years in MHC II binding are listed in Table 4.
The highly repeated 9-mer binding cores are listed in Table 5.

Model architecture and optimization
The neural network models were built for both 9-mer CTLs and 15-mer HTLs.A binary classifier defined with a train_validation_test split of 60:20:20 for both models with an optimized model architecture gave a good accuracy score of 99.5% for CTL model and 99% for HTL model with very minimum loss.The model architecture defined was of 3 layers with 10, 3, and 2 layers, respectively, with a dropout of 0.5 for each layer.Adam optimizer was employed, and the optimal number of training epochs was found to be 80.

Model evaluation
The model performances were evaluated using various metrics like accuracy, precision, F1-score and auc_roc_ score.This being balanced datasets, accuracy was taken as the primary deciding measure.These prediction models were further validated with a set of experimentally validated CTLs and HTL epitopes collected from IEDB database which are experimentally curated by T-cell, B-cell, and MHC ligand assays.Some were also collected from cited literature findings (Mintaev et al. 2022).All  these gave accurate predictions when tested on the developed models with the same parameters.Different models were trained with various combinations of the features.The best model proved to be the ones with all features which denoted that all the features contributed well for the vaccine candidate design.Binding affinity and Immunogenicity were found to be the strong features of all with which the models gave an accuracy of 98% for both models.The ROC plots and accuracies of other combination results are as given below (Figs. 3 and 4).
The accuracy of the CTL model and HTL model for some of the strong features is listed in Table 6.
Figure 5 represents the normalized confusion matrix for the CTL validation set, and Fig. 6 represents the classification report for the CTL validation set.

Model validation
The binding activity checked with the epitope analysis resource provided by IEDB showed a strong correlation with the results obtained from our ML model with the test set of epitopes and thereby strengthening the performance of our ML model (Figs. 7, 8 and Table 7).

Mutational analysis for neuraminidase
The following mutations reported were collected from various literature sources: D151G, G147R, R292K, Q136K, E119K, V116A, I223R, S247N, H275Y, N295S, N200S, V241 (Hooper and Bloom 2013;Shao et al. 2017;Hossain et al. 2002;Eshaghi et al. 2014;Jain et al. 2018).The reference sequence for NA protein-Flu A-ADK33724.1 neuraminidase [Influenza A virus (A/Aarhus/INS242/2009(H1N1))] was obtained from NCBI database and the novel mutations were introduced into the sequence.The multiple sequence analysis performed in ClustalW (Thompson et al. 1994) revealed that the mutation V241I was found to occur in the vaccine epitope region CVNGSCFTV as predicted by the ML model.However, the other mutations had no overlaps with any of the epitope regions (Fig. 9).
The observation leads to the inference that, most of the epitopes predicted as effective vaccine candidates from the ML model were highly conserved regions and hence considering them for vaccine formulation will prove to overcome any antigenic drifts in the future, which is also a major objective of our research.

Homology studies with possible cross-reactive nematode antigens
For the homology studies performed with the various nematode antigens, no complete overlap was found for any of our predicted epitopes, hence suggesting that the reformulated vaccine with our predicted epitopes will have the least chances for cross-reaction with parasitic antigens (Breloer and Hartmann 2023).

Discussion
Influenza has been a globe-threatening infectious disease which is still a concern of the epidemiological, social, and biological scientists (Kim et al. 2022).Even though many tri-valent vaccines and anti-viral drugs exist, the Flu virus by virtue of antigenic drift and shift continues to circulate in humans and also in animal populations that results in seasonal evolution of influenza viruses (Kim et al. 2022;Petrova and Russell 2018).Predominantly, the surface proteins of the virus namely, Haemagglutinin (HA) and Neuraminidase (NA) that play a pivotal role in the pathogenesis is the focal point for vaccine design.In our study, we exploit the conserved regions of these proteins across the flu seasons to identify potential epitopes using neural network models incorporating more features for robust prediction.This approach could be a potential tool for proposing universal targets for vaccine development in Influenza from conserved regions.Discovering such a universal solution for these recurring episodes of flu via peptide vaccines could save the future generations from serious ailments.
Our study presented here provides such a possible solution through peptide vaccine design via an epitope prediction tool development.The current servers that exist for the epitope peptide prediction mostly focus on the binding affinity of the peptides to the MHC  (Yuan et al. 2017).The mutated model is represented in magenta and the native structure in cyan, while the mutated residues V241I are represented as spheres in red complexes and is not customized for specific viral infections (Ras-Carmona et al. 2021;Desta et al. 2023).Our tool for epitope prediction is highly specific for Influenza A and carries additional feature of predicting the potential of a peptide's scope to be a putative vaccine candidate for any future outbreaks.These short-listed epitopes could be further validated by additional biological screening assays from clinical samples to develop a successful vaccine.Neuraminidase, being the choice for the protein vaccine design, has gotten to be the recent interest worldwide due to its highly conserved nature and vaccine effectiveness over the other proteins.We came up with this deep neural network model due to the high demand for the fast and accurate solutions that enables the effective Flu vaccine target epitope optimizations and predictions instead of the time-consuming simulation design and experimental methods.This tool was developed from trained B-cell and T-cell epitope sets that were curated independently incorporating validation features for vaccine effectiveness.The high accuracy of 99.5% for the CD8 + CTL & 99% for the CD4 + HTL models attributes to the high antigenicity shown by the Neuraminidase strains and the stability of the 9-mer, 15-mer peptides based on its strong physio-chemical parameters like binding affinity & immunogenicity.These peptides will also compliment effectively the existing resources for epitopes across populations due to its wide allelic coverage as observed from the IEDB resource results.The high repeated-ness of epitope scores among both sets also strengthens the fact that Neuraminidase could be a sole active target for Influenza vaccines in future.The Precision and Recall value of 1.0 of the models also validates the robustness of the model with no/less False Positives (FP) and True-Negatives (TN).The validation was done through a set developed from the experimentally proven vaccine epitopes from the literature (Kim et al. 2022;Lee et al. 2020).The strong correlation in the average binding score as demonstrated by the experimentally proven vaccine epitopes supports the robust design of the ML model.In addition, mutational analysis of the Neuraminidase protein showed that most of the predicted epitopes by the ML model lie in highly conserved regions of the protein.Therefore, the predicted epitopes by our ML model are expected to withstand seasonal epidemics and may provide long-term immune-resistance.In addition, the predicted epitopes were non-overlapping with the parasitic antigens, thereby to prevent crossreactivity during vaccine administration.Interestingly, there are reports of diminished immune responses due to helminth infections for people living in infectious nematode endemic areas (Breloer and Hartmann 2023).Our homology studies on selected nematode antigens lacked complete overlap with our predicted epitopes, suggesting that the epitopes predicted using our tool could provide putative epitopes for designing future Influenza vaccines more effective in helminth endemic regions.Overall, this ML model serves as a bridging gap in real world between vaccine development and its clinical application.

Conclusions
This highly efficient model developed and presented here with a pan-data collected from across the globe would be the first-of-its-kind machine/deep learning algorithm that can predict the probable vaccine candidates even with minimal peptide data.This tool could independently compliment and mutually be cross-curated from majority of the other models that otherwise predicts only the MHC-binding affinity.Our prediction tool has a limitation for screening lengthy in-put protein or peptide sequences but works efficiently for independent, short peptide fragments.The potential putative epitopes predicted separately could be subsequently concatenated to enable faster vaccine design.Moreover, the recent research on Flu indicating the potency of T-cell epitope vaccines in providing a stable solution for the viral drift, makes this study and model even more worthwhile.Since, the cross-validation of the results obtained from the test set of peptides was in strong correlation with the IEDB Epitope analysis resource which is a renowned tool for epitope prediction suggesting the vigour our model.Additionally, our model stands unique from the other already existing models for being much efficient in overcoming antigenic drifts in the future, since the algorithm is built based on the most conserved Neuraminidase protein in influenza virus.The cross-reactivity analysis with the nematode antigens has provided a novel insight about our predicted epitopes when considered for vaccine reformulation in endemic areas with neglected nematode infectious diseases (Breloer and Hartmann 2023).We thereby suggest that the usefulness of InflAN-Net would prove to be beneficial to the overall scientific community and could be employed as a kick-start-tool in developing strategies for future clinical influenza vaccine experiments.
Fig. 1 A schematic representation of the neural network-based prediction tool-InflANNet N I S H F L U

Fig. 3
Fig. 3 Accuracy-loss plots and ROC curves for original models

Fig. 4
Fig. 4 Comparison of various feature combination models

Fig. 9
Fig. 9 A cartoon representation of the superposition of mutated model with the native structure of Neuraminidase showing an RMSD score of 0.237, created using PyMol molecular visualizer(Yuan et al. 2017).The mutated model is represented in magenta and the native structure in cyan, while the mutated residues V241I are represented as spheres in red

Table 1
No. of viral strains of Neuraminidase protein retrieved

Table 2
Search set parameters for CTL and HTL peptides from ProtParam server

Table 3
Highly repeated epitopes from pMHC I binding

Table 4
Highly repeated epitopes from pMHCII binding

Table 7
A tabular column representing the various epitope sequences with their binding activity for MHC-I and MHC-II allelesA high binding score corresponds to effective binding with MHC-I alleles and a low adjusted rank corresponds to effective binding with MHC-II alleles.For each epitope, the binding scores and the adjusted ranks were taken as an average of 27 different alleles for MHC-I and MHC-II separately