Modeling nanofluid viscosity: comparing models and optimizing feature selection—a novel approach

The accurate prediction of viscosity in nanofluids is essential for comprehending their flow behavior and enhancing their effectiveness in different industries. This research delves into modeling the viscosity of nanofluids and assessing various models through cross-validation techniques. The models are compared based on the root mean square error of the cross-validation sets, which served as the selection criteria. Four feature selection algorithms namely the minimum redundancy maximum relevance, F-test, RReliefF were evaluated to identify the most influential features for viscosity prediction. The feature selection based on physical meaning was the algorithm that yielded the best results, as outlined in this study. This methodology takes into account the physical relevance of most aspects of the nanofluid's viscosity. To assess the predictive performance of the models, a cross-validation process was conducted, which provided a robust evaluation. The root mean squared error of the validation sets was used to compare the models. This rigorous evaluation identified the most accurate and reliable model for predicting nanofluid viscosity. The results showed that the novel feature selection algorithm outclassed the established approaches in predicting the viscosity of single material nanofluid. The proposed feature selection algorithm had a root mean squared error of 0.022 and an r squared value of 0.9941 for the validation set, while for the test set, the root mean squared error was 0.0146, the mean squared error was 0.0157, the r squared value was 0.9924. This research provides valuable insights into nanofluid viscosity and offers guidance on choosing the most suitable features for viscosity modeling. The study also highlights the importance of using physical meaning to select features and cross-validation to assess model performance. The models developed in this study can be helpful in predicting nanofluid viscosity and optimizing their use in different industrial processes.


Background
Predicting viscosity in nanofluids plays a crucial role in understanding their flow behavior and optimizing their applications in various industries (Bhaumik et al. 2023;Chiniforooshan Esfahani 2023;Esfe and Arani 2018;Gholizadeh et al. 2020;Onyiriuka 2023b;Said et al. 2021;Tan et al. 2022;Yadav et al. 2020).Nanofluids, suspensions of nanoparticles in base fluids, exhibit unique rheological properties that differ from those of conventional fluids (Tan et al. 2022).Accurate prediction of the viscosity of nanofluids is essential for the efficient design and optimization of heat transfer systems, lubrication processes, and other applications.
As a critical step in the modeling process, feature selection aims to identify the most influential features Page 2 of 14 Onyiriuka Bulletin of the National Research Centre (2023) 47:139 contributing to nanofluids' viscosity.It involves selecting relevant input variables or features from potential predictors.This study focuses on the feature selection process for predicting the viscosity of single material nanofluids.Single material nanofluids consist of nanoparticles and base fluid that are stably mixed.A nanofluid viscosity model provides a unique system for investigating the impact of various parameters on viscosity.By carefully selecting the appropriate features, we can uncover the underlying relationships between the composition, particle size, temperature, other factors, and the resulting viscosity of nanofluids.
The objective of this study is to investigate various feature selection methods and pinpoint the primary factors that have a significant impact on the viscosity of single material nanofluids.By utilizing physical, sophisticated statistical, and machine learning techniques, the goal is to create precise prediction models that can estimate the viscosity of nanofluids based on a chosen set of input features.
The findings of this study will contribute to a deeper understanding of the factors that govern the viscosity of nanofluids and provide valuable insights for optimizing their performance in practical applications.Moreover, the developed feature selection techniques can be applied to other nanofluid systems, enabling efficient and effective viscosity prediction models for various nanofluid applications.
Various researchers have studied this subject extensively but mainly focusing on its accuracy than its generality and conventional feature selection.Gholizadeh et al. (2020) in 2020, a group of researchers-Gholizadeh, Jamei, Ahmadianfar, and Pourrajab-conducted a study on predicting the viscosity of nanofluids using the Random Forest (RF) approach.What was unique about their research is that they utilized the RF method to estimate the thermophysical property of nanofluids for the very first time.The study focused on five significant parameters, which included volume fraction, nanoparticle size, nanoparticle density, and base fluid viscosity.
The researchers used various statistical tools to compare different correlations and found that their model was the best, with an R 2 of 0.9972.The next best was Nguyen's model with an R 2 of 0.654, followed by the Maiga et al. correlation at an R 2 of 0.652 (Gholizadeh et al. 2020).
It's worth noting that there was no validation data set mentioned for their case.The researchers also utilized the out-of-bag error rate method to tune the number of trees and predictors of the RF model.Lastly, they applied a performance index to compare different machine learning models accurately.However, the paper did not consider the application of cross-validation in comparing models, Brownlee (2016), states that from a machine learning viewpoint, it is an essential step in model evaluation and comparison.
It was observed from the study that the volume fraction increased viscosity while particle size decreased it.The nanoparticle volume fraction was noticed to have the most significant impact in predicting the viscosity of nanofluids, while the temperature had the least predictive impact (Gholizadeh et al. 2020).Rudyak and Minakov (2018) stated that a universal formula describing the viscosity coefficient of any nanofluid has yet to be derived.In addition, most measurements of this quantity have mainly led to opposite results.Einstein and other researchers, including the international nanofluid properties benchmark exercise (Buongiorno et al. 2009;Kim et al. 2009;Venerus et al. 2010), thought that the volume fraction was the sole determining factor of nanofluids' viscosity.It has now been shown that the non-universality models are because the volume fraction of the nanoparticles is not the only factor determining nanofluids' viscosity.
According to a recent study, the size and material of nanoparticles play a significant role in determining the viscosity of nanofluids.As the concentration of particles increases, the viscosity of nanofluids also increases, while an increase in particle size or temperature results in a decrease in viscosity.Additionally, the type of nanoparticle used can lead to a significant difference in viscosity.Nanofluids have been found to have higher viscosity levels than ordinary fluids with coarse dispersion (Rudyak and Minakov 2018).
The viscosity of nanofluids can be estimated using the modified Einstein's quadratic model form for low and moderate concentrations of nanoparticles.However, the coefficients in this equation vary based on the material and size of the particles.Increasing the degree of order in a fluid lead to an increase in effective viscosity, which can be achieved by decreasing the particle size and increasing the particle concentration (Rudyak and Minakov 2018).
Nanofluids are more ordered than base fluids, and the addition of nanoparticles helps to improve momentum transfer.Molecular dynamics suggest that nanoparticlemolecule interaction is the primary reason for increased viscosity in nanofluids.Einstein's equations do not apply to nanofluids due to assumptions like neglecting interactions between molecules and nanoparticles, creeping flows, or very low particle Reynolds numbers.Therefore, further investigation is needed to understand the relationship between the viscosity of nanofluids and nanoparticle materials, as concluded by the study (Rudyak and Minakov 2018).

Machine learning models
In this study several machine learning models were applied, namely: The Gaussian process regressor, Neural network, support vector machines, decision trees, ensembles, and linear regression.
The Gaussian process regressor uses probability distributions to model relationships between variables.The neural network learns complex patterns of data through layers of interconnected nodes.Support vector machines finds a hyperplane that separates data into classes.Decision trees divides data into subsets based on feature threshold.Ensemble models combine multiple models to improve predictive accuracy and robustness.The linear regression establishes a linear relationship between features and target (Mahesh 2020;Sarker 2021).
The variables are represented by the following nomenclature for ease of reference, as shown in Table 1.
In the provided Fig. 1a, we can observe the distribution of each variable.However, there seems to be no normality in general for any of the variables.Each plot in the figure represents a histogram plot displaying the range of values for each feature.
For instance, the temperature values are plotted on the x-axis, while the frequency of each temperature value is represented on the y-axis.The first plot in Fig. 1a shows the temperature values, where the most frequently occurring temperature value is 50 °C.On the other hand, the least occurring temperature value of 70 °C was also the highest temperature value.The temperature values between 35 and 45 °C were the most frequently occurring groups in the data set.The general trend in the data shows a rise in the beginning and a fall toward the end.Similar analysis can be seen for the other features.This property is also illustrated clearly in the standard probability plot in Fig. 1b.
In Fig. 1b, we can see a normal probability plot that compares the distribution of data in each variable to the standard normal distribution.The plot uses plus sign markers (' + ') to represent each data point in each variable.Two reference lines are drawn to show the theoretical normal distribution.The first reference line is a solid line that connects the data's first and third quartiles, while the second is a dashed line that extends the solid line to the ends of the data range.If the data follows a normal distribution, the points align along the reference line.
However, if the data deviate from the normal distribution, it introduces a curvature or deviation in the plot, indicating that the data distribution differs from the expected normal distribution (MathWorks 2022).By visually inspecting the standard probability plot in Fig. 1b, we can observe the departure from normality and the nature of the data distribution.
Figure 2 shows the box plot of each variable.Using a five-number summary, box plots are a common method for displaying data distribution.The temperature data's box plot shows the minimum, first quartile, median, third quartile, and maximum values of temperature.The five components make up the box plots, providing information about the temperature distribution for instance.These components include the median, hinges (Q1 and Q3 quartiles), fences (adjacent extremes), whiskers (minimum and maximum values, excluding outliers), and outliers (data points outside the whiskers).
Table 1 Variables nomenclature for ease of reference (Onyiriuka 2023a) Notched box plots, narrow the box around the median to provide an approximate 95% confidence interval for the population's median.Notches are particularly useful for evaluating the significance of differences between medians.In Fig. 2, it was observed that notches of the temperature values and the particle size overlap signifying the similar median distribution.The height of the notches is proportional to the interquartile range (IQR) of the sample and inversely proportional to the square root of the sample size.By analyzing the plot, it is evident that each variable has distinct values except for the thermal conductivity, thermal diffusivity, specific heat capacity, surface tension, and dielectric constant of the base fluid, which are similar but opposite to the density, viscosity, kinematic viscosity, and boiling point of the base fluid.To model the viscosity of nanofluids, it is recommended to explore decision trees, ensemble models, and neural networks.

Methods
This section tests various modeling and feature selection algorithms, including the algorithm outlined below in Sect."Algorithm for parameter selection applied for viscosity" [Novel Feature selection algorithms (NFSA)].The other investigated feature selection algorithms include minimum redundancy, maximum relevance (MRMR), FTest, and RReliefF.Tables 2 and 3 summarize the results obtained by applying these algorithms.

The Minimum Redundancy Maximum Relevance (MRMR)
The MRMR algorithm is a technique used in machine learning and data mining to select a subset of features from a larger set.The main objective of this algorithm is to maximize the relevance of the chosen features to the target variable while minimizing redundancy among them.Here is how the MRMR algorithm works (ÇALIŞKAN 2023;Sakthivel et al. 2023;TM & VENI 2023): First, start with an empty set of selected features.Then, calculate the relevance of each feature by using different metrics such as mutual information, correlation coefficient, or information gain, with respect to the target variable.Next, select the feature with the highest relevance and add it to the selected feature set.After that, for every remaining feature, calculate its redundancy with respect to the already selected features.Redundancy is a measure of how much information a feature provides beyond what is already captured by the selected features (TM & VENI 2023).
Calculate the MRMR score for each feature by subtracting its redundancy from its relevance.Choose the feature with the highest MRMR score and add it to the selected feature set.Repeat the steps until the desired  number of features is selected or a stopping criterion is met (for example a predefined threshold for MRMR score).The final selected features are those in the selected feature set.The MRMR algorithm aims to balance between informative features (high relevance) and avoiding redundant information.By using this approach, the algorithm can help improve the efficiency and interpretability of machine learning models by reducing the dimensionality of the input feature space while retaining the most relevant information (TM & VENI 2023).

FTest
The F-test algorithm is a statistical technique that identifies the features with the most relevance or discriminatory power for a given target variable (Mathew 2023;Venkatesan 2023).For each feature in the dataset, the F-statistic is calculated to determine the ratio of betweenclass variability to within-class variability.The corresponding p value is computed to represent the likelihood of obtaining the observed F-statistic by chance.The features are then sorted based on their F-statistic or p value in ascending or descending order.The top-k features with the highest F-statistic or lowest p value are selected as the final feature subset (Mathew 2023;Venkatesan 2023).By examining the variability between different classes and within each class, the F-test algorithm assesses the relationship between each feature and the target variable.Features with higher F-statistics or lower p values indicate stronger associations with the target variable.The F-test algorithm aids in identifying the most relevant features for a given classification or regression task by selecting the features with the highest discriminatory power (Mathew 2023;Venkatesan 2023).

RReliefF
The RReliefF algorithm is a technique for selecting features that can effectively differentiate between instances of different classes (Aggarwal et al. 2023).It assigns weights to each feature based on its discriminatory power.The weights are updated iteratively and aggregated across all instances to identify the most relevant features for classification tasks.The selected features are those with the highest scores, indicating their importance in separating instances of different classes (Aggarwal et al. 2023).
To begin, the weights for each instance are initialized to zero.For each instance in the dataset, the weight updates are calculated by considering the differences between the feature values of the current instance and its closest instances of the same and different classes.The weights are then updated accordingly, with greater emphasis placed on features that contribute more to distinguishing between instances of different classes.The feature scores are calculated by aggregating the weight updates across all instances.Finally, the top-k features with the highest scores are selected as the final feature subset (Aggarwal et al. 2023).

Algorithm for parameter selection applied for viscosity
Here we discuss the procedure for selecting parameters according to the novel method discussed by (Onyiriuka 2023a) for predicting the viscosity of single material nanofluids.
(1) Check the problem being solved.
(2) List all the possible features.
(3) Drop features that have no meaning or direct implication to the viscosity of a fluid.For example, using single material nanofluids: So, these three feature groupings define a nanofluid.(4) Apply statistical methods to select features according to (3) out of all other features.(5) At the end of steps (3)-( 5), you should have a reasonable amount of features and optimal accuracy.
Note that the main focus of this parameter selection is not accuracy but enhanced model learning for generalization.Accuracy is still of utmost importance.

Model evaluation methods
The root mean squared error (RMSE) Eq. ( 1), mean squared error (MSE) (6), mean absolute error equation (MAE) (7), and the Rsquared equation (R 2 ) (2)-( 5) were applied in this study to measure model performance.The main decision-making performance evaluation metrics in this study was the root mean squared error.This is applied because of its intuitive and direct interpretation of the error.

Discussion
The results from the lowest RMSE on both the validation and test datasets, indicating its superior predictive accuracy.The other Gaussian Process Regression models also showed promising results but were not as accurate as the topperforming model.The Neural Network models demonstrated competitive performance but were not able to outperform the Gaussian Process Regression models.It is possible that further tuning of the Neural Network architectures and hyperparameters could potentially improve their performance.The Linear Regression models and Tree-based models showed relatively higher RMSE values, suggesting that they might not capture the complex relationships present in the nanofluid viscosity data as effectively as the Gaussian Process Regression and Neural Network models.
Also, Table 2 shows that the best model is obtained by applying the algorithm in Sect."Algorithm for parameter selection applied for viscosity" [Novel Feature selection algorithms (NFSA)], with 0.0220 roots mean square of the validation data set.Table 3 presents the settings of each of the models and the applied feature selection algorithm.The model settings in Table 3 were obtained from optimizable version of the original model.They are the settings that give the best results when they are optimized with the Bayesian optimizer class.Table 3 provides insight into the hyperparameters, and feature selection algorithms applied to the Gaussian Process Regression models.
The model with the "None" feature selection algorithm performed well, suggesting that all features can be used essentially for good predictions.However, the "FTest" and "MRMR" feature selection algorithms also showed competitive performance, indicating that they effectively identified relevant features for the nanofluid viscosity prediction.
The predictions of response plot of the best model, is shown in Fig. 3.The dots points represent the difference between the predicted response and the true response.A perfect scenario is represented by the line that goes through the origin, indicating that the predicted response and the true response are the same.The vertical distance between the line and any point is the error of the prediction for that point.A good model has small errors, meaning that the predictions are more concentrated near the line.
The plots in Fig. 3a and 3b visually illustrate the quality of predictions made by the accepted model on the training and test datasets, respectively.The close alignment between the predicted responses and the true responses indicates the model's ability to generalize well to unseen data and its overall reliability.
Figures 3a and 3b also demonstrate that the accepted model fits both the training and testing data groups.It is Onyiriuka Bulletin of the National Research Centre (2023) 47:139 essential to note that the models did not have knowledge of the test data during the training process.
It is important to note that the models were not applied to other scenarios due to the marked difference in data logging methods and nanofluid preparation and handling methods by different researchers and considering their good performance on the test data.The future work would be to apply it to other nanofluid thermophysical properties like hybrid nanofluids thermophysical properties which introduce new features that may be important in predicting its thermophysical properties.

Conclusions
This study focused on modeling nanofluid viscosity and optimizing feature selection for accurate prediction.Through the comparison of various models using crossvalidation techniques, we gained valuable insights into the factors influencing nanofluid viscosity and identified the most influential features.By incorporating physical meaning into the feature selection process, we achieved improved results.The research findings underscore the importance of considering physical relevance when selecting features for nanofluid viscosity prediction.
By prioritizing features that have a direct physical impact on viscosity, we were able to develop more precise and reliable prediction models.This approach not only enhances the accuracy of viscosity estimation but also provides a better understanding of the underlying mechanisms governing nanofluid behavior.
The application of cross-validation techniques further strengthened our evaluation of the models.By assessing the root mean squared error of the cross-validation sets, we obtained robust measures of model performance.This rigorous evaluation allowed us to identify the most accurate and reliable model for predicting nanofluid viscosity.
The insights gained from this research contribute to the broader understanding of nanofluid viscosity and offer guidance for optimizing their use in practical applications.By accurately predicting viscosity, industries can improve the design and efficiency of heat transfer systems, lubrication processes, and other applications involving nanofluids.The optimized feature selection techniques developed in this study can be readily applied to other nanofluid systems, enabling efficient and effective viscosity prediction models across various applications.
It is important to note that the research presented here focused specifically on single material nanofluids.
Further studies could explore the modeling and feature selection techniques for other types of nanofluids, such as multi-material and hybrid nanofluids or those with complex compositions.Additionally, investigating the relationship between nanofluid viscosity and thermal conductivity could provide valuable insights into the overall fluid behavior.
In conclusion, this study contributes to the field of nanofluid viscosity modeling by providing a novel approach to feature selection and model evaluation.The novel feature selection algorithm makes a more comprehensive method for representing the viscosity of nanofluids in such a way as to preserve the generality of the models.The accurate prediction of nanofluid viscosity opens up new possibilities for optimizing their performance in industrial processes, leading to enhanced efficiency and cost-effectiveness.The models developed in this research serve as valuable tools for predicting nanofluid viscosity and driving advancements in nanofluid-based technologies.

Fig. 2
Fig. 2 A box plot of each feature material: Any two intensive properties will fix the material of the nanoparticle type(Callister 2007;Cengel et al. 2011;Moran et al. 2010).(ii) Base fluid material: Any two intensive properties will fix the material of the base fluid type(Callister  2007;Cengel et al. 2011;Moran et al. 2010).

Table 2
Models performance and comparison

Table 3
Model parameters and the optimized Gaussian process model