QSAR, molecular docking, and design of novel 4-(N,N-diarylmethyl amines) Furan-2(5H)-one derivatives as insecticides against Aphis craccivora

Aphis craccivora has many plant hosts, though it seemingly forechoice to groups of bean family. Other plants it hosts are families of Solanaceae, Rosaceae, Malvaceae, Chenopodiaceae, Caryophyllaceae, Ranunculaceae, Cucurbitaceae, Brassicaceae, and Asteraceae. A computational study was carried out on a series of twenty compounds of novel 4-(N,N-diarylmethylamines) furan-2(5H)-one derivatives against Aphis craccivora insect. Optimization of the compounds was performed with the aid of Spartan 14 software using DFT/B3LYP/6-31G** quantum mechanical method. Using PaDel descriptor software to calculate the descriptors, Generic Function Approximation (GFA) was employed to generate the model. Model 1 found to be the optimal out of four models generated which has the following statistical parameters; R2 = 0.871489, R2adj = 0.83644, cross-validated R2 = 0.790821, and external R2 = 0.550768. Molecular docking study occurred between the compounds and the complex crystal structure of the acetylcholine (protein AChBP) (PDB CODE 2zju) in which compound 13 was identified to have the highest binding energy of − 8.4 kcalmol−1. Statistical analyses, such as variance inflation factor, mean effect, and the applicability domain, were conducted on the model. This compound has a strong affinity with the macromolecular target point of the A. craccivora (2zju) producing H-bond and as well the hydrophobic interaction at the target point of amino acid residue. Molecular docking gave an insight into the structure-based design of the new compounds with better activity against A. craccivora in which three compounds A, B, and C were designed and discovered to be of high quality and have greater binding affinity compared to the one obtained from the literature. The QSAR model was generated by the employment of Genetic Function Approximation (GFA). The model was found to be robust and possessed a good statistical parameter. Furthermore, a molecular docking study was performed to get an idea for structure-based design in which three (3) compounds A, B, and C were designed and were found to be more active than the template (compound 13, i.e., the one with highest docking score). QSAR model was developed to give an insight into the ligand/template-based design of computer-aided drug design.


Background
Aphis craccivora, known as cowpea aphid, peanut (groundnut) aphid, or black legume aphid, is one of the most dangerous agricultural pests which directly causes harm to plants by delaying and deforming the growth of the plants through malnutrition. The molasses manufactured by the vector are placed on plants and stimulate the growth of molds with soot that limits photosynthesis. Aphis craccivora has many plant hosts, though it seemingly forechoice to groups of bean family. Other plants it hosts are families of Solanaceae, Rosaceae, Malvaceae, Chenopodiaceae, Caryophyllaceae, Ranunculaceae, Cucurbitaceae, Brassicaceae, and Asteraceae. Aphid is a vector of a series of plant viruses that include peanut rosette virus, groundnut mottle virus, mosaic virus common bean, alfalfa mosaic virus, and cucumber mosaic virus (Wikipedia, 2018).
Furan-and amide-containing compounds are among the molecular structures found to have an extensively wide range of applications in the field of medicine and agrochemical due to their extensive range of biological activity like antimicrobial/anti-inflammatory activity (Huczyński et al. 2012;Ravindra et al. 2006;Özden et al. 2005), as antibiotic activity (Arjona et al. 1999), as pesticides such antifungal (Yao et al. 2017), and insecticides (Teixeira et al. 2015;Wang et al. 2013) among others. The furan-2(5H)-one was examined to be a potential inhibitor of nicotinic acetylcholine receptor (Tian et al., 2019), and this was the reason for the docking study on the crystal structure of this protein.
The quantitative analysis of the structure-activity relationship (QSAR) is among the most efficient ways to optimize the main compounds and design new compounds. QSAR can be used to predict bioactivities, like toxicity, carcinogenicity, and mutagenicity, depending on the structural characteristics of the molecules and the actual mathematical models. Nowadays, one can easily and accurately calculate quantum chemical parameters of the compounds due to fast development in computer technology as well as theoretical quantum chemical study which helps in predicting the new compounds with better activity than the existing ones. This quantum chemical calculation is extensively applied while forming the QSAR models (Gagic et al. 2016). Molecular docking helps to investigate the capacity of the prepared compounds toward the interaction with the protein residue of the target organism and to also predict the preferred orientation of the molecules.
The objective of the research is to discover a new model that predicts the activity of chemical products with better activity capable of destroying Aphis craccivora using Genetic Function Approximation (GFA) or molecular docking techniques.

Dataset
In this work, we used a dataset of 20 compounds to design a relation between the chemical traces of compounds and their insecticidal activity. These 20 compounds of novel4-(N,N-diarylmethylamines) furan-2(5H)-one derivatives were obtained from the literature (Tian et al., 2019). The logarithm of the measured LC 50 (μg mL −1 ) against  insecticidal activity given by p LC 50 (p LC 50 = −log 1/ LC 50 ) was taken as a dependent parameter; therefore, the data was linearly correlated with the independent parameter/descriptors (Edache et al. 2017).

Optimization/molecular descriptor calculation
The database (see Fig. 1 and Table 1) was optimized at a density function theory level using the "Becke's threeparameter read-Yang-Parr hybrid" (B3LYP) function together with "6-31G**" basis set of Spartan14 software (Arthur et al. 2016a). Graphical-user-interface of Spar-tan14 was utilized in drawing the 2D molecular structures of the dataset which were later exported in the form of 3D. The optimized structures were then taken to PaDel descriptor software to calculate the quantum molecular descriptors (Yap 2011).

Data division
To get a validated model, the dataset was split into (3:1) train test sets. Accordingly, the split was done in such a way that the compounds forming the train (70% of the data) and the test sets (30% of the data) are shared within an entire descriptive space filled by the complete dataset as described by Kennard-Stone Algorithm method (Arthur et al. 2016b). The generated molecular descriptors were taken for regression analysis, with experimental activities as dependent parameters where the molecular descriptors served as independent parameters. With the Genetic Function Approximation method (GFA) incorporated in "Material Studio 2017" software, the compounds of train sets were utilized to develop the QSAR model. Four QSAR models were built where the best model was chosen according to the one with the lowest score of lack of fit (LOF) given as follows: where SSE represents the sum of squares of errors, d is a smoothing parameter defined by the user, c = number of terms a model possessed in addition to the constant term, M is equal to the number of samples present in the training set, and p = overall number of descriptors present in all terms of the model excluding the constant term (Edache et al. 2015).

Internal validation
The generated model was validated internally by the following parameters: (a) The correlation coefficient (R 2 ): explain the division of overall variation ascribed to the built model. The    accepted value of R 2 ranges from 0.5 to <1 and more the value of R 2 and the model considered to be a better model as R 2 approaches 1.0, though there are other analyses that the model passed to be a better one. Being the most common internal validation pointer, R 2 is expressed as follows: where Y expt , Y predt , and Y train represent the experimental, predictive, and average activities of the training set (Adeniji et al. 2018).
(b) Adjusted R 2 : The value of R 2 is inconsistent to evaluate the power of the built model. Thus, R 2 is adjusted to restore and stabilize the model. This adjusted R 2 is defined in equation iii as: where p presents "the number of descriptors constituted the model," while n = number of training set compounds (Ibrahim et al. 2018a).
(c) Cross-validated R 2 : The validity of the models was identified by a cross-validation test measured by predictive Q 2 cv. For a "leave one out (LOO) cross-validation," a data point is eliminated (left-out) in the set and the model is readjusted; the predicted value of the eliminated data point is compared to its real value. This is repeated until each data removed. We can then calculate the value of Q 2 using the sum of the squares of these elimination residues as in the equation below: where Y expt , Y predt , and Y train represent the experimental, predictive, and average activities of the training set (Adedirin et al. 2018).

External validation
The prediction ability of the model was examined by an external validation through the ability of the model to predict the activity values of the test set compounds as well as its application in the calculating the predicted value of R 2 pred according to the equation below: where Y predt and Y expt are the test set's experimental and predicted activities while Y train indicates the average activities of the training set (Edache et al., 2017).

Statistical analysis of the descriptors Variance inflation factor (VIF)
VIF is defined as the measure of multicollinearity amongst the independent variables (i.e., descriptors). It quantifies the extent of correlation between one predictor and the other predictors in a model.
where R 2 gives multiple correlation coefficient between the variables within the model. If the VIF is equal to 1, it  Table 5 Continuation of external validation means there is no intercorrelation in each variable, and if it ranges from 1 to 5, then it is said to be suitable and acceptable. But if the VIF turns out to be greater than 10, this indicates the instability of the model and need to be reexamined (Pourbasheer et al. 2015;Karthikeyan et al. 2009).

Mean effect (ME)
The average effect (mean effect) correlates the effect or influence of given molecular descriptors to the activity of the compounds that made up the model. The sign of descriptors shows the direction of their deviation toward the activity of compounds. That is an increase or decrease in the value of the descriptors will improve the activity of the compounds. The mean effect is defined by the following: where B j and D j are the j-descriptor coefficient in a model and the values of each descriptor in training set, while m and n stand for the number of molecular descriptors as well as the number of compounds in the training set. To evaluate the significance of the model, the ME of all the descriptors was calculated (Edache et al. 2015).

Applicability domain
To confirm the reliability of the model and to examine the outliers as well as the influential compounds, it is very important to evaluate its domain of applicability. It aimed at predicting the uncertainty of a compound depends on its similarities to the compounds used in building the model and also the distance between the train and test set of the compounds. This can be achieved by employing William's plot which was plotted using standardized residuals versus the leverages. The leverages for a particular chemical compound are given as follows: where h i = leverage for a particular compound, Z i = matrix i of the training set. Z = nxk descriptor-matrix for a training set compounds. Z T = transpose of Z matrix. The warning leverage (h*) that is the boundary for usual values of Z outliers is given as follows: where n = number of compounds in the training set whereas p gives the number of descriptors present in the model (Ibrahim et al. 2018a).

Ligand and receptor preparation
From the RCSBPDB (www.rcsb.org), the PDB format of the receptor was successfully downloaded. This was then taken to the discovery studio for an appropriate preparation where all the residues associated with it such as a ligand, water molecules, and other traces associated with the receptor were removed. The ligands (the optimized compounds) which were in the SDF file were transformed into the PDB file format. Figures 2 and 3 showed the prepared receptor and ligand (Ibrahim et al. 2018b).

QSAR model
Genetic function algorithm (GFA) was used to generate the three QSAR models which predicted the activity of the compounds. The first model was chosen as the optimal model due to its statistical significance was presented in equation (x) below: All the validation/statistical parameters that signified the stability, robustness, and the prediction capability of the model were presented in Table 2.
The name, symbol, and class of the three selected descriptors that made up the model are presented in Table 3.
Since the selected model is internally valid, then an external validation is the next step. Tables 4 and 5 represent the external validation as well as the calculation of predicted R 2 of the best-chosen model.
The experimental, predictive, and residual activity for both training sets and test sets are shown in Table 6. This table is represented to show the robustness of the model considering the lower residual values.

Statistical analyses
Statistical analyses on the model's descriptors are very necessary in order to know how related they are. For that, Pearson's correlation, variance inflation factor, mean effect (which contains regression analysis), and applicability domain were carried out.
Pearson's correlation (Tables 7 and 8) Figures 4 and 5 are the plot of predicted activity against the experimental activity (pLC 50 ) and plot of standardized residual against the experimental activity (pLC 50 ).

Applicability domain
A graph of leverages of each compound of dataset versus their standardized residuals terms William's plot was presented in the Fig. 6 below.

Docking studies
Molecular docking analysis was carried out between the ligands (compounds) and the receptor to evaluate the binding affinity at the ligand-receptor interface.

Result of the design
The structure of the three (3) compounds which were designed using an optimization method of structurebased design. The structure of the chosen scaffold (compound 13) was presented in Fig. 9 below.

QSAR model
The QSAR examination was carried out to relate the structure-activity relationship of novel 4-(N,N-diarylmethylamines) furan-2(5H)-one derivatives as a potent inhibitor of Aphis craccivora. Three descriptors were utilized in constructing the QSAR model which predicted the activity of the compounds based on the genetic function algorithm (GFA). The first model was chosen as the optimal model due to its statistical significance. The bestchosen model constructed was presented in equation.
All the validation/statistical parameters that signified the stability, robustness, and the prediction capability of  Table 3 below. The fact that 2D and 3D descriptors are present in the model implies that the descriptors used in the model can determine a better insecticidal activity of the compounds. The individual capability and inducing power of the selected descriptors toward the activity of the compounds depend on their values, signs, and as well their mean effects. Tables 4 and 5 represent the external validation as well as the calculation of predicted R 2 of the best-chosen model.

Statistical analysis
The experimental, predictive, and residual activity for both training sets and test sets are shown in Table 6. The residual value is the difference between the predicted and actual activity.
To evaluate the relationships between each descriptor used in the built model, Pearson's correlation was carried out on the values of the model's descriptors and the results were presented in Table 7. The results show that the descriptors are not significantly inter-correlated for the fact that none of their correlation coefficients are up to 0.5, and this indicates the robustness as well as the stability of the built model. The variance inflation factor (VIF) values for each of the three descriptors were not up to 2, which indicates that the descriptors and the model are stable and accepted. Table 8 showed the standard regression coefficients "bj," the values of mean effect (ME), and confidence interval (p values). These give vital information on the impact and contribution of the descriptors toward the built model. The individual capability and inducing power of the selected descriptors toward the activity of the compounds depend on their values, signs, and as well their mean effects. The p values of the three descriptors (at 95% c.l.) that made up the model are all < 0.05; this implies that there is a significant relationship among these descriptors (as contrary to the null hypothesis) and the inhibitory concentration of the compounds. Figure 5 which presented a graph of observed activity versus standardized residual shows a random dispersion at the baseline where the standardized residual is zero. Therefore, no systematic error occurred in the built model.
A graph of leverages of each compound of dataset versus their standardized residual terms William's plot was plotted to discover the outliers as well as the chemical influential values of the model. The domain of applicability was established within a box at ± 3.0 limit for the residuals and leverage threshold h* (where h* calculated to be 0.8). The result indicates that except three compounds from the test set, all the molecules in the dataset are within the box of the applicability domain of the model. This may be characterized by their clear differences in chemical structures by considering the rest of the compounds highlighted in the dataset.

Molecular docking study
Due to unavailability of the crystal structure of Aphis craccivora, the complex crystal structure of the acetylcholine (protein AChBP) from Lymnaea stagnalis and imidacloprid with PDB code 2ZJU was utilized for the docking analysis since it possess "high homology extracellular domain" of the A. craccivora protein (nAChR) which has been used for many docking studies involving A. craccivora. The docking studies were performed between 2ZJU and the ligands (compounds) of novel4-(N, N-diarylmethylamines) furan-2(5H)-one derivatives to investigate the binding energy of the compounds to the target site of the insect. The ligands show a good interaction with the active site of the Aphis craccivora that is to say they inhibit the activity of the insect. Some ligands show high binding energy that varies from − 7.9 to − 8.4 kcalmol −1 as presented in Table 9. However, compound 13 with the highest binding score (− 8.4 kcal/mol) possessed an interaction mode with H-bond of ARG137 and 2.60716 bond length and hydrophobic interaction of TYR89, TYR89, ASN90, VAL183, TRP53, TYR89, and TYR185. The interaction between the compound with highest binding energy and the binding pocket of the receptor is shown in Fig. 7 while Figs. 8 and 9 is the 2D hydrogen bond interaction of compound 13 with the receptor.

Conclusion
This research involves a QSAR and molecular docking studies on 20 compounds of novel 4-(N,N-diarylmethylamines) furan-2(5H)-one derivatives against Aphis craccivora. Using DFT for molecule optimization, Genetic Function Approximation (GFA) was employed in generating the built model. Out of three models built, the first model was identified to be the optimal constituted with good statistical parameters such as R 2 = 0.871489, R 2 adj = 0.83644, cross-validated R 2 = 0.790821, and external R 2 = 0.550768. A decrease in negative coefficient descriptors (like ATSc4 and Weta3.polar) and an increase in positive coefficients descriptors (like nCl) will improve the activities of the compounds against A. craccivora. According to the docking scores, most of the ligands (compounds) show good inhibitory activity against A. craccivora protein. However, ligands 13 showed a higher binding affinity of -8.4 kcal/mol. This compound has a strong affinity with the macromolecular target point of the A. craccivora (2zju) producing H-bond and as well the hydrophobic interaction at the target point of amino acid residue. Molecular docking gave an insight into the structure-based design of the new compounds with better activity against A. craccivora in which three compounds A, B, and C were designed and discovered to be of high quality and have greater binding affinity compared to the one obtained from the literature. Author's contributions YI: contributed throughout the research work. AU: gives directives and technical advices. SU: partake in technical activities. MTI: partake in technical activities. ABU: partake in technical activities. All authors have read and approved the final manuscript.

Funding
No fund was collected on this research work.

Availability of data and materials
The section is not applicable to this research work

Ethics approval and consent to participate
The section is not applicable to this research work

Consent for publication
The section is not applicable to this research work