Quantitative structure-activity relationship (QSAR) modelling study of some novel carboxamide series as new anti-tubercular agents

Background: QSAR modelling was performed on thirty-five (35) newly discovered compounds of N-(2-phenoxy) ethyl imidazo[1,2-a] pyridine-3-carboxamide (IPA) to predict their biological activities against Mycobacterium tuberculosis (MTB-H37Rv strain) by using some numerical data derived from structural and chemical features (descriptors) of the compounds. Results: At first, the structure of the compounds was accurately drawn and optimized using the Spartan 14 software at DFT level of theory with B3LYP/6-31G** basis set in a vacuum. The diverse chemometric descriptors were computed from the optimized structures using the PaDEL descriptors software, and the division of the dataset into training and test sets was done based on Kennard-Stone’s algorithm. Five (5) models were generated from the training set using genetic function approximation, and model 1 was chosen as the best due to its robust internal and external validation metrics (Rtrain = 0.8563, R 2 adjusted = 0.8185, PRESS = 3.5724, average R 2 m (LOO-train) = 0.6751, Q 2 cv = 0.7534, R 2 pred 1⁄4 0.7543, Rtest = 0.6993) which passed the model criteria of acceptability. 6-Bromo-N-(2-(4-bromophenoxy) ethyl)-2ethylimidazo[1,2-a] pyridine-3-carboxamide (compound 13) was used as the structural template for the in silico design due to its high pMIC, and it is within the model’s chemical space. Conclusion: Based on the information obtained from model 1, six (6) designed compounds with higher antitubercular activity were obtained. Furthermore, the ADME and drug-likeness prediction of the designed molecules showed good pharmacokinetic properties which indicate the application prospect of these compounds as novel MTB-H37Rv inhibitors. This research could help the medicinal chemists and pharmaceutical practitioners in future designing and development of more potent drug candidates.

Keywords: QSAR, Template, Docking, Hydrogen bonding, Genetic algorithm, Multi-linear regression, Model, Descriptors, Leave-one out Background Mycobacterium tuberculosis (MTB) is the bacterium that causes one of the world's most deadly respiratory communicable diseases called tuberculosis (TB). It was among the ranked top 10 deadliest diseases caused by a single infectious agent (above HIV/AIDS) (Zhai et al. 2019). In recent times, the number of persons receiving life-saving treatment for TB in 2018 has tremendously increased due to enhanced detection and diagnosis (Mabhula and Singh 2019). Nigeria is among the top seven (7) countries that account for 64% of the total burden of tuberculosis worldwide (Ogbuabor and Onwujekwe 2019). The Philippines Department of Health (DOH) and the World Health Organization (WHO) jointly called for an all-out-war against tuberculosis (TB) in early 2019, because it is regarded as the number one infectious killer in the country. According to their report, TB claims the lives of over 70 Filipinos every day. There are also one million Filipinos who have active TB disease, the third-highest global prevalence rate next to South Africa and Lesotho (World Health Organization (WHO) 2019). Similarly, in the global tuberculosis report of WHO (2019), data were reported by 202 countries and territories that account for more than 99% of the world's population and estimated number of TB cases (World Health Organization (WHO) 2019). The existence of extensively drug-resistant (XDR) and the evolution of multidrug-resistant (MDR) TB have attracted the attention of drug scientists who are in search of novel anti-tubercular agents with better bioactivities . Researches have shown that imidazo[1,2-a] pyridine-3-carboxamides (IPA) as an anti-tubercular candidate is currently in the second phase of clinical trials, and it was reported to have resilient inhibitory potency or anti-mycobacterial activity . It was established that the development of more potent compounds with improved bioactivities is very costly and time-consuming (Adeniji et al. 2019). In recent decades, computational chemistry techniques such as computer-aided drug design (CADD) might save the time of discovering new compounds which also reduce the cost of synthesis (Abdullahi et al. n.d.). The quantitative structure-activity relationship (QSAR) technique provides a mathematical model containing some structural features represented as numerical data which predicts the response properties of the compound such as activity, toxicity, and so on . The ultimate goal of this study was to derive a robust QSAR model from the structures of some synthesized IPAs compounds which predicts their biological activities against Mycobacterium tuberculosis (MTB-H37Rv strain), then utilized the model to design new compounds with improved activity.

Dataset collection and optimization
Thirty-five (35) compounds were selected from the newly discovered and synthesized series of N-(2-phenoxy) ethyl imidazo[1,2-a] pyridine-3-carboxamide (IPA) as antitubercular agents . The compound's response against the Mycobacterium tuberculosis (MTB-H37Rv) was measured as minimum inhibitory concentration (MIC) which is the lowest concentration affecting a decrease in fluorescence of greater than 90% relative to the mean of replicate bacterium-only controls in microgram per milliliter . These values were converted into logarithmic MIC values (pMIC) in order to reduce skewness using Eq. 1 where Mw is the molar weight of the compound in grams per mole and MIC is the minimum inhibitory concentration of the compound. The IPAs core structure and the substitution arrangement of the compounds based on R 1 , R 2 , and R 3 along with their anti-tubercular activities were presented in Table 1. The molecular structure of the IPAs showed above were accurately drawn using the ChemDraw Ultra level software V12.0.2, then saved in (*cdx) format. Consequently, the drawn compounds were exported to the Spartan 14 wave function program for equilibrium geometry optimization at ground state with density functional theory calculations (DFT/B3LYP/6-31G**) in a vacuum, starting from the initial molecular geometry (Adedirin et al. 2018a). In principle, geometry optimization is an iterative process whereby the energy and its first derivative with respect to all geometrical coordinates are calculated from a guess geometry, then used the information to project new geometry (Adedirin et al. 2018a). Thus, the process continues until the lowest energy or optimized structure of the molecule is achieved.

Descriptors computation
The thirty-five (35) optimized structures of IPAs from Spartan 14 were accordingly saved as SD file format, then exported to the PaDEL descriptors software which is a product of Pharmaceutical Data Exploration Laboratory, created by Yap Chun Wai (Sanyal et al. 2019). This software allows QSAR users to compute diverse molecular descriptors and fingerprints of a molecule, including electrostatic, topological, spatial, autocorrelation, geometrical, constitutional, and thermodynamic descriptors (Abdullahi et al. 2018).

Data pretreatment and division
PaDEL descriptors output in MS Excel sheet was subjected to a variable reduction method so as to eliminate constant and highly inter-correlated descriptors based on user-specified variance and correlation coefficient cut-off value using Data Pre-treatment GUI 1.2, downloaded from Drug Theoretics and Cheminformatics (DTC) Laboratory. In order to have a rational selection of training set and test set, the Dataset Division GUI 1.0 software was used by engaging Kennard-Stone's algorithm division technique (Adeniji et al. 2019).

Internal validation
The training set compounds were used to build the five (5) multi-linear regression (MLR) models using Material Studio Software (Version 8.0) based on genetic function approximation (GFA) as the variable selection technique where the dependent variable is the logarithmic values of the minimum inhibitory concentration (pMIC) and the independent variables are the descriptors generated from PaDEL program (Tropsha 2010). Numerous internal validation metrics of the models were also generated using MLR-plus validation program such as: a. Friedman lack-of-fit parameter (LOF) from the material studio is defined as where a = number of the terms in the model, b = scaled smoothing factor, c = corresponds to the entire number of descriptors in the model, j = total number of compounds in the training set, and β = a safety factor with a value of 0.99 which guarantee that the denominator of the equation can never be equal to zero.
b. Cross-validated parameter (Q 2 cv ) Y tr = average observed concentrations of the training set, Y = observed concentration, and Y pred = predicted concentration in the training set, respectively.
c. Regression coefficient squared (R 2 train and test ) Y pred and Y were predicted and observed training set concentration (experimental), respectively. Y tr and Y pred were the average observed (experimental) and predicted training set response, respectively d. Coefficient of determination adjusted (R 2 adjusted ) where p = number of the descriptor in the model, n = number of compounds in the training set, R 2 is the correlation coefficient, and n − 1 − p is the degree of freedom.
e. Variance ratio F (Fischer's value) where r 0 2 m and r 2 m represent the reverse and modified square of correlation coefficient computed according to the expressions below: where r 2 and r 2 o represent the correlation coefficients of the plot of observed against predicted training set concentrations with and without intercept, respectively. r 0 2 0 is a squared correlation coefficient of the plot of predicted versus observed training set response without intercept (Veerasamy et al. 2011).
h. Delta modified square of correlation coefficient (Δr 2 m ) where cR 2 p = coefficient of determination, R = correlation of coefficient, and Rr = average "R" of random models.

External validation
The QSAR models predictive competency were examined by using independent test set compounds for external validation, and the metrics proposed by Golbraikh and Tropsha (Tropsha 2010) were also computed using MLRplusValidation 1.3 program as follows: i. Predicted determination coefficient for test set data R 2 Pred expressed as: where Ypred test and Y test are the predicted and observed concentration of test set compounds respectively. Y tr ¼ average values of observed concentration of the training set compounds. ii.
where r 2 = squared correlation coefficient between the observed and predicted activities with intercept and r 2 0 = squared correlation coefficient between the predicted and observed concentration without intercept (Tropsha 2010).
iii. Root mean square error of prediction RMSEP is defined as: where Y (test) and Y pred(test) are the observed and predicted test set concentrations, respectively.
iv. The slope of the plot of observed against the predicted concentration of test set compounds without intercept (k) and plot of predicted against the observed concentration of test set compounds without intercept (k ′ ) are expressed as: However, an acceptable and predictive QSAR model should have 0.85 < k < 1.15 or 0.85 < k′ < 1.15 (Tropsha 2010) Applicability domain (AD) The applicability domain (AD) of the developed model is defined as the chemical space of compound structure and response where the model predictions are highly reliable (Veerasamy et al. 2011). This technique is used to detect the presence of response and structural outliers in the test set and training set compounds, respectively (Tropsha 2010;Veerasamy et al. 2011;Gramatica 2007). The leverage approach of evaluating the model's applicability domain based on models distance measure can be utilized by plotting a scatter plot of standardized residual response and leverage values (f) of both training set and test set (Williams plot). The leverage of compound (f) can be determined as follows: where x is the model's descriptors matrix, x T represents the transpose matrix x, and F is the diagonal element of the hat matrix. In this study, AD of the QSAR model was evaluated as the square area with vertical boundary 0 < f i < f* and horizontal boundary − 3 < standardized residual < 3, where f i is leverage values of compounds and f* is the threshold leverage expressed as: where p is the number of molecules in the training set and m is the number of molecular descriptors used in the model. In addition, compounds with higher leverage scores which are greater than threshold leverage (f i > f*) tend to have unreliable predictions. However, compounds whose leverage scores are less than the threshold score (f i < f*) and the standardized residuals are not greater than ± 3α (3 standard deviation units) are said to fall within the applicability domain (Adedirin et al. 2018b). Similarly, the Euclidean approach of the applicability domain was also determined based on mean distance scores computed by the euclidean distance. As such, the Uzairu plot was determined by plotting the standardized residuals against normalized mean distance scores whose ranges are from 0 to 1 (Arthur et al. 2018). The normalized mean distance score for training set ranges from 0 which is for least diverse and 1 which is for the most diverse training set. However, the normalized mean distance scores for test compounds with scores outside 0 to 1 are regarded as outliers which are outside the applicability domain (Arthur et al. 2018) In silico ADME prediction The designed hypothetical molecules in SMILES format (Simplified Molecular Input Line Entry System) were exported to SwissADME online webserver to predict their absorption, distribution, metabolism, and excretion (ADME) properties (Pan et al. 2019).

QSAR modelling analysis
The logic behind the development of a QSAR model is to arrive at relevant molecular descriptors that describe changes in the structural features of a compound. Molecular descriptors of all compounds in this study were generated using the PaDEL software as mentioned earlier. A total sum of 625 diverse descriptors was generated in MS Excel (.csv) format, and the result was exported to the DTC lab software for the pretreatment and division. In the data pretreatment process, non-informative and highly inter-correlated descriptors with correlation cutoff greater than 0.8 were removed, which reduces about 24.32% of the total descriptors computed by the PaDEL program amounting to 152 descriptors. The pretreated data were divided into the training set and test set based on Kennard-Stone permutation, where 70% of the dataset (25 compounds) are the training set and the remaining 30% (10 compounds) are the test set. Based on the genetic function approximation of the descriptors from the Material Studio software, five (5) multilinear regression models (Eqs.19, 20, 21, 22 and 23) were  developed containing five (5) optimum descriptors, and model 1 was selected as the best model due to its statistical significance of the internal and external validation metrics. The experimental pMIC reported in the literature, the predicted pMIC computed by model 1 for all the 35 anti-tubercular agents, the residual scores, and the leverage values are shown in Table 1. The residual score is the difference between the experimental pMIC and predicted pMIC, and the lower residual score signifies that the developed model has good predictive potentials. The internal validation parameters of the models generated are presented in Table 2, and model 1 revealed the most significant descriptors. The external validations of the models generated are shown in Table 3, and model 1 has also passed the external validation metrics proposed by Golbraikh and Tropsha including the error-based judgment of test set compounds (Tropsha 2010). In addition, the validation metrics reported in this study are in agreement with the metrics from literature and for the purpose of reproducibility; all the computed descriptors for both the training and test set in model 1 are reported in Tables 4 and 5, respectively. Table 6 provides a comprehensive description of the molecular descriptors in the model 1. Furthermore, the model showed a positive contribution of MATS5p, GATS6c, and AATS1i descriptors, while a negative contribution for descriptors ATSC4c and ATSC3v, respectively. This means that the increment in the magnitude of MATS5p, GATS6c, and AATS1i descriptors will positively influence the prediction of pMIC with the negative influence of ATSC4c and ATSC3v descriptors. However, the MATS5p descriptor has the highest contribution which is the most significant descriptor to be considered for designing new hypothetical compounds. The regression coefficient squared (R 2 train = 0.8536 and R 2 test = 0.7543) indicates good extrapolation between the training set and test set. In addition, the models generated are robust due to the small differences in R 2 and Q 2 cv (< 0.3). The regression statistics (Table 7) show P value and t values of the model which suggests that the coefficients of the descriptors are statistically significant at a 95% confidence level. Furthermore, inter-correlated descriptors in model 1 were assessed based on their multicollinearity computed as the variation inflation factor (VIF): where R 2 represents the correlation coefficient. VIF values corresponding to unity depict no inter-correlation among each variable; if the VIF scores ranging from 1 to 5, as such the model is acceptable and stable. But if the VIF scores larger than 10, it means that the model in question is unstable and unacceptable (Driouche and Messadi 2019). Table 8 shows the correlation analysis and VIF values of the descriptors in model 1. The nonexistence of inter-correlation among the descriptors could be observed between descriptors pair, and the VIF values of each descriptor do not exceed 4 depicting that the descriptors in the model are stable. The plot of predicted pMIC against experimental pMIC values is shown in Fig. 1. It could be seen that the values of the test sets are in close agreement with the training set values. The scatter plot of standardized residual against experimental pMIC values (Fig. 2) showed a random scattering of data  above and below the baseline of the standardized residual of zero which signified the non-existence of systematic error. The Williams plot (Fig. 3) which is the scatter plot of standardized residuals versus leverages revealed two (2) response outliers (compounds 22 and 25). This is because their leverage scores are greater than the threshold (f*) of 0.72 which may be due to the changes in substitution arrangement of the substituents on the parent structure. However, the remaining compounds whose leverage score less than the threshold score are within the applicability domain of square area of ± 2.5. Also, the Uzairu plot (Fig. 4) showed that all compounds fall within the chemical space of the model which confirmed its predictive capabilities.

In silico design of new compounds
In order to explore newly hypothetical compounds with improved anti-tubercular activity, the QSAR model 1 was utilized due to its robust statistical metrics as mentioned earlier. In silico screening reduced cost and time of identifying new hits or lead compounds. This is done by substitution, deletion, insertions, or addition of substituents to the scaffold (template) or lead compound. Compound 13 was chosen as the template due to its relatively higher pMIC of 7.2704, a low absolute residual score of 0.5291, and it is within the chemical space or applicability domain of the model with leverage score of 0.2905. The alteration was successfully done around its benzene ring and imidazo[1,2-a] pyridine moiety as shown in Table 9. Subsequently, modifications were done by inserting different amino-N analogs such as -N(CH 3 ) 2 , -NO 2 , -NHCH 3 , and sulfur (S) containing substituents, and -OCH 3 which enhances the molecular polarizability as suggested by the MATS5p descriptor. On this note, six (6) newly designed compounds were obtained with excellent predicted MIC value greater than the experimental MIC value for the template (7.2704) and also having leverage scores less than the threshold leverage (0.72) which indicated that the designed compounds are within the chemical space of the model used for predicting the anti-tubercular response.
In silico ADME prediction of designed compounds The ADME and drug-likeness analysis are very important in the drug discovery which helps to make a rational decision on whether inhibitors can be administered to a biological system or not (Attique et al. 2019). Furthermore, the inhibitors with poor ADME properties and high toxicity effects on the biological systems are often the major cause of most failed medicines in the clinical phase of In an effort to improve the success rate of our QSAR modelling analysis, the pharmacokinetics properties of the designed compounds were predicted and assessed using the SwissADME online software as mentioned earlier, revealing that D1, D2, D3, D4, D5, and D6 obey Linpinski's rule of five (5), which indicates the claim prospect of these compounds as novel MTB inhibitors. The Lipinski's rule of five (5) is a thumbrule for evaluating drug-likeness and to decide if an inhibitor with a certain pharmacological or biological properties would be an orally active drug in the human body (Daina et al. 2017). The rule states that a molecule or an inhibitor can be orally absorbed/active if two (2) or more of these thresholds, molecular weight (Mw) of molecule ˂ 500, octanol/water partition coefficient (iLOGP) ≤ 5, number of hydrogen bond acceptors (nHBA) ≤ 10, number of hydrogen bond donors (nHBD) ≤ 5, and topological polar surface area (TPSA ˂ 140 Å 2 ), are not violated (Daina et al. 2017). From the output of some ADME and drug-likeness properties shown in Table 10, it was observed that only D1 molecule has zero violation of the Lipinski's rule, but D2, D3, D4, D5, and D6 respectfully violated molecular weight rule.

Conclusion
In this research, chemometric modelling analysis has been thoroughly used on 35 IPA molecules as potential antitubercular agents. As such, a regression-dependent quantitative structure-activity relationship (QSAR) model was fabricated and defended with multiple statistical parameters according to Golbraikh and Tropsha (Tropsha 2010). The internal and external validation confirmed the robustness and reliability of the built QSAR model. Molecular descriptors, MATS5p, GATS6c, AATS1i, ATSC4c, and ATSC3v, from the results (model 1) are the optimum descriptors needed to predict the bioactivities of the compounds. Based on the information obtained from model 1, compound 13 was used as a template for the in silico design due to its high pMIC, and it is within the chemical space of the model. Thereafter, six (6) newly designed compounds with better anti-tubercular activity and good ADME/drug-likeness properties were obtained. According to the above work, the designed compounds have shown substantial prospective therapy against Mycobacterium tuberculosis. However, the research encouraged further experimental validation of the designed compounds against Mycobacterium tuberculosis through in vivo and in vitro considerations. Fig. 4 The scatter plot of standardized residuals and normalized mean distance (Uzairu's plot)