Skip to main content

Simulation of liver function enzymes as determinants of thyroidism: a novel ensemble machine learning approach



Hormone production by the thyroid gland is a prime aspect of maintaining body homeostasis. In this study, the ability of single artificial intelligence (AI)-based models, namely multi-layer perceptron (MLP), support vector machine (SVM), and Hammerstein–Weiner (HW) models, were used in the simulation of thyroidism status. The study's primary aim is to unveil the best performing model for the simulation of thyroidism status using hepatic enzymes and hormones as the independent variables. Three statistical metrics were used in evaluating the performance of the models, namely determination coefficient (R2), correlation coefficient (R), and mean squared error (MSE).


Considering the quantitative and visual presentation of the results obtained, it has been observed that the MLP model showed higher performance skills than SVM and HW, which improved their performances up to 3.77% and 12.54%, respectively, in the testing stages. Furthermore, to boost the performance of the single AI-based models, three different ensemble approaches were employed, including neural network ensemble (NNE), weighted average ensemble (WAE), and simple average ensemble (SAE). The quantitative predictive performance of the NNE technique boosts the performance of SAE and WAE approaches up to 2.85% and 1.22%, respectively, in the testing stage.


Comparative performance of the ensemble techniques over the single models showed that NNE outperformed all the three AI-based models (MLP, SVM, and HW) and boosted their performance accuracy up to 7.44%, 11.212%, and 19.98%, respectively, in the testing stages.


Hormone production by the thyroid gland is a prime aspect of maintaining body homeostasis. However, the process must be tightly regulated to prevent a hormonal imbalance that negatively affects certain metabolic activities associated with many diseases (Ghali et al. 2020). The thyroid gland is part of the endocrine system, which produces thyroid hormones, namely triiodothyronine (T3) and thyroxine (T4), responsible for hormonal regulation for proper functioning (Brent 2012).

The thyroid-stimulating hormone produced by the pituitary gland is essential in regulating hormonal production by the thyroid gland (Jonklaas 2020). Clinically, an elevated level of TSH denotes underperformance of thyroid hormones; as such, the pituitary gland compensates accordingly by producing more TSH, a condition referred to as hypothyroidism. However, low TSH levels indicate excess production of thyroid hormone above normal. The pituitary hormone compensates accordingly by decreasing the TSH production to retain the thyroid function, a condition referred to as hyperthyroidism (Zhang et al. 2020).

Considering that our research is aimed at simulation of thyroidism status using liver enzymes and thyroid-stimulating hormone as an independent variable, we feel it is worthy to further elucidate the biological relevance and association of thyroid hormones and vitamin D with disease severity and liver functions in patients with non-cholesteric chronic liver disease (Fisher and Fisher 2007). More so, thyroid hormones play a substantial role in the biological system. However, their abnormal level leads to several medical implications such as thyroid tumorigenesis, Hashimoto's thyroiditis, goitre, myxoedema, benign tumour of the pituitary gland, peripheral neuropathy, as observed in many patients (Choi et al. 2021; Huang and Liaw 1995). Abnormalities in serum liver enzymes are observed in hypothyroidism and may be related to impaired lipid metabolism, hepatic steatosis, hypothyroidism-induced myopathy, hyperammonemia, ascites, and hepatitis c virus. (Piantanida et al. 2020; Płudowski et al. 2013). We believe that proposing the best performing AI and machine learning model using hepatic enzymes and hormones will serve as a breakthrough in drug design and discovery for several medical complications.

Interestingly, machine learning tools were widely employed to extract patterns within patient data and predict the outcomes for improved clinical management of autoimmune thyroid diseases. Recently, a comparison of logistic regression and neural network models was conducted to diagnose thyroid disorders with better performance of the neural network model than multinomial logistic regression. However, the predictive performance of the two models was found to be disdainful to laboratory variables (Chowdhury and Chakraborty 2017).

Another study that uses data set from the Graven Institute in Sydney, Australia, to predict thyroid disease using a machine learning approach shows that prediction and classification of any data depend on the data set. The algorithms used, such as SVM, Naïve Bayes, autoencoders, ANNs, and CNNs yield a better result (Chaubey et al. 2020). A cross-sectional study of 143 LN patients with hyperthyroidism diagnosed by renal biopsy using PCA–logistic regression model shows that the model performed well in identifying important risk factors for specific clinical outcomes (Aguayo-Orozco et al. 2021).

Association between hyperthyroidism and liver disease has been reported and further correlates hyperthyroidism's severity as an independent risk factor for abnormal liver function enzymes (Piantanida et al. 2020). Furthermore, many subsequent case reports and series have highlighted the prevalence of liver test abnormalities in the setting of hyperthyroidism and several mechanisms of liver dysfunction in hyperthyroidism, including liver abnormalities due to hyperthyroidism alone, liver damage related to heart failure and hyperthyroidism, and concomitant liver disease in hyperthyroidism (Punekar et al. 2018). Moreso, an experimental study involving the application of linear and nonlinear models has predicted the thyroid hormone level (TSH) using different macro-elements and vitamins as the corresponding input parameters (Muhammad Ghali et al. 2020). In a clinical puzzle study, liver transplant evaluation was carefully analysed in a patient with coagulopathy, encephalopathy, and drug-induced acute liver failure. Uncontrolled thyroid diseases are reported to strongly correlate with liver dysfunction (Anugwom and Leventhal 2021).

Unfortunately, none of the studies addresses the correlations and associations with artificial intelligence and machine learning using two independent variables. The major novelty of our work is the adequate use of machine learning algorithms to understand such complex clinical data better and propose the best AI algorithm that could model thyroidism status in different subjects using liver function enzymes and level of TSH as an input variable. Interestingly, this compelling evidence in our work suggested that the best performed model has tremendous potential in the future for further endocrine-based research, diagnosis, and treatment of several liver diseases.

The principal aim of our work is to make a comparative analysis and propose the best performing model of AI for the simulation of thyroidism status using hepatic enzymes and hormones as the independent variables, considering the gaps and limitations in many published articles.


Clinical methodology

The sample was assayed as described by (Muhammad Ghali et al. 2020). Briefly, an automated COBAS E411analyser was applied to analyse liver function enzymes. Thyroid hormones were assessed using Elecsys COBAS E411 after centrifuging it for 10 min at 2000g in a red-capped tube. The samples were separated into Control groups, hyperthyroidism and hypothyroidism.

Proposed data computational intelligence approach

Different data-driven approaches were used separately in this research to propose a model for developing an AI diagnostic approach for TSH. This current study is data-driven, collecting the data from our previous research (Ghali et al. 2020). The thyroidism status was predicted using two parameters that serve as inputs, i.e. liver enzymes alanine transaminase (ALT), aspartate transaminase (AST), albumin blood test (ALB), gamma-glutamyl transferase (GGT), alkaline phosphatase (ALP), direct bilirubin (DBIL), total bilirubin (TBIL), and hormones such as thyroid-stimulating hormone (TSH), triiodothyronine (T3), thyroxine (T4), free triiodothyronine (FT3), and free thyroxine (FT4). The liver enzyme parameters were used to predict thyroidism status using different AI models, considering that liver is the major organ that synthesizes thyroid-binding globulin, prealbumin, and albumin that binds to thyroid hormone in the peripheral circulation and the liver metabolizes thyroid hormone. Therefore, to obtain accurate predictions of the best performing AI model, liver function enzyme parameters are likely to give a clear picture of the thyroidism status since the liver plays a crucial role in thyroid disease conditions.

Furthermore, three ensemble techniques (MLP, SVM, and HW) were applied to boost the prediction accuracy of the thyroidism status by combining the output results of the single models. Practically, concluding on a single model that can outperform other existing models used in predicting various parameters for a specific study is not feasible for the predictors. The proposed method in this research involves the determination of thyroid hormone status using two liver enzymes and hormones by selecting an ensemble of different models.

Hammerstein–Weiner model (HW)

The Hammerstein–Weiner (HW) model involves utilizing a black-box discovery model methodology planned to decide the nonlinear framework (Gaya et al. 2017). The HW model's arrangement consists of three blocks: a static input nonlinear block, a static output nonlinear, and the linear dynamic block, as shown in Fig. 1 (Abba et al. 2019). The model adopts the nonlinear input to linear function blocks, then returns to nonlinear functions in the output structure. Furthermore, the HW model displays more precise comprehension of the nonlinear and linear system connection than the other standard ANNs (Abba et al. 2019). Mainly, the MATLAB toolbox was utilized to improve the HW model based on its structure. The piecewise linear functions are input and the output nonlinearity predictors, while 10 is set as default for the number of units, although the complexity of the model increases as the number of units gets more extensive (Guo 2004).

Fig. 1
figure 1

Schematic of Hammerstein-Wiener model. \(w\left(t\right)=f(u\left(t\right))\) is the nonlinear function that converts input data, \(x\left(t\right)=w\left(t\right)B/F\) is a linear transfer function, f and h are acting on the input and output port of the linear block, respectively. The other functions, w(t) and x(t) are used to define the output and input of the linear block

Multi-layer perceptron (MLP) neural network

Multi-layer perceptron (MLP) is considered among the frequent exemplars of ANN, which carry a nonlinear system and work as a widespread approximator (Choubin et al. 2016). The structure of the MLP consists of output, input, and hidden layer, unlike the other ordinary ANNs (Kim and Singh 2014; Pham et al. 2019). The input layer nodes are mainly linked to the hidden and output layers. From the input to the output layer, the signals are processed and afterwards transmitted across the assist of biases and weights by sequential mathematical operations. The Levenberg–Marquardt algorithm is a learning algorithm that is mainly used to improve the inaccuracy among the measured and predicted values. The training algorithms are constantly replicated till the required outcomes are pointed out. The MLP structure consists of input, output, and one or more hidden layers, similar to the ordinary ANNs, shown in Fig. 2 (Committee 2000).


N is defined as the total number of nodes in the top layer of the node, i; wji is the weight between the nodes i and j in the upper layer; xj defines the output derived from node j; wi0 is the bias in node i, and yi defines the input signal of node i which crosses via the transfer function.

Fig. 2
figure 2

Three-layer multi-layer perceptron structure

Support vector machine (SVM)

The concept of learning in the theme of support vector machine (SVM) was suggested by Vapnik in 1995, which supplies the wanted machine for problem-solving that include prediction, classification, regression, and pattern recognition. The SVM works and consists of a data-driven model. The two prominent roles of SVM include statistical learning theory and structural risk minimization. SVM's ability to boost the overall efficacy of the model and decrease the error, excess of data, and sophistication as well the development in the overall efficacy of the system, makes it superior to ANN (Vapnik 1995).

SVM can be categorized into linear support vector regression and nonlinear support vector regression. This indicates that support vector regression (SVR) is estimated as a category of SVM according to the two primary structural layers; the kernel function weighting on the input variable as the first layer, and the second function is a kernel outputs weighted sum as demonstrated in Fig. 3. The linear regression adapts on the data; afterwards, the outputs pass over the nonlinear kernel to capture the nonlinear model of the data.

Fig. 3
figure 3

The architecture of SVM Algorithms

Ensemble techniques

The AI-based models for the same inputs provide diverse performance levels regarding their robustness or limitation. Hence, many methods such as web ranking algorithm, classification, regression problems, and time series clustering are used in different study fields (B and Sadaoui 2019; Baba et al. 2015a; Dehghanian et al. 2015; Loos et al. 2019). Ensemble learning is the collective term for branch machine learning that deal with homogenous or heterogenous multiple models. The ensemble machine learning method is usually engaged by joining the process of various predictors to boost the performance of a single AI model. Machine learning has been demonstrated to be exceptionally successful in creating exact outcomes compared to single models applied for tackling a similar issue. For developing the expected performance of this model, three procedures are commonly utilized: (1) simple averaging ensemble (SAE) for combining the HW, MLP, and SVM predictors, (2) neural network ensemble (NNE), and (3) weighted averaging ensemble (WAE) (Baba et al. 2015b).

Simple averaging ensemble (SAE)

For SAE, the SVM, HW, and MLP single models are first prepared and tried independently, the average of the MLP, SVM, and HW are analysed and tested against the noticed qualities where the overall formula for SAE is given as:

$${P}_{(t)}= \frac{1}{N}\sum_{i=1}^{N}{p}_{i}(t)$$

N defines the number of learners (here N = 3), and pi represents the output of any single model (i.e. HW, SVM, and MLP) at a specific time t.

Weighted average ensemble (WAE)

Weighted average ensemble (WAE) can be resolved by allocating different weights to specific outputs of the single models indicated to their significance output, which is the opposite in the case of single models. The WAE is shown in the form of:

$${P}_{(t)}={\sum }_{i=1}^{N}{w}_{i}p(t)$$

where \(w_{i}\) represents the weight applied to the ith model's output and can be resolved based on the performance of the model as:

$$w_{i} = \frac{{DC_{i} }}{{\sum\nolimits_{i = 1}^{N} {DC_{i} } }}$$

DCi is the performance efficiency of the ith single model.

Neural network ensemble (NNE)

In the neural ensemble method (NNE), the nonlinear average is directed by training the different neural networks. The input layer of NNE is supported through outputs of the single models, by which everyone in the input layer is assigned to a single neuron. For network training, the backpropagation algorithm is used, by which the perfect structure and epoch number can be demonstrated utilizing the trial by error method for the ensemble network.

Data pre-processing and model validation

In computational intelligence models, the principal point is to guarantee any particular model or models utilized upon a given data collection and achieve agreeable predictions on obscure data collection (Nourani et al. 2018). The most frequent issue in prediction is overfitting, which results in the contradiction between testing performances and training. Different validation methods can be applied during the validation process, such as k-fold cross-validation and leave one out and holdout. The primary importance of k-fold and cross-validation is that at every round, the validation and training set are autonomous from each other (Usman et al. 2020). In our study, the k-fold cross-validation is used to adapt and reduce overfitting issues as demonstrated in Fig. 4.

Fig. 4
figure 4

Illustration of k-fold cross-validation

Furthermore, the primer training data set is separated into same-sized subsets of k and typed from the k−1 data subsets for the validation process, while the remaining subsets are used for the training purpose (Elkiran et al. 2018). The result variations are considered as the average of validation efficiency of k-subsets. In general, k-values are calculated from sample availability, mainly 2–10. In the k-fold cross-validation process, the general advantages are that the calibration and validation set in every round are independent of one other to achieve a satisfying foundation of model optimization (Abba et al. 2017). The basic set of data is split into two groups; the verification and calibration set to achieve high performance of the data usage in model configuration (Soltani et al. 2015). Our study conducted data classification in two phases (25% for verification and 75% for calibration) to avoid the overfitting, underfitting, and local minima issues that may lead to qualitative and quantitative changes, as shown in Fig. 4 (Usman et al. 2021).

The performance accuracy is estimated from various criteria based on the differential between predicted and measured values. In our study, correlation coefficient (R), determination coefficient (R), and mean square error (MSE) were used to evaluate the models:

$${R}^{2}=1-\frac{\sum_{j=1}^{N}{\left[{(Y}_{obsi}-Y_{comi} )\right]}^{2}}{\sum_{j=1}^{N}{\left[{(Y}_{obsi}-\bar{Y}_obsi)\right]}^{2}}$$
$${\text{MSE}} = \frac{1}{N}\sum\limits_{(i = 1)}^{N} {(Y_{obsi} - Y_{comi} )^{2} }$$
$$R=\frac{\sum_{i=1}^{N}({Y}_{obsi}-{\overline{Y} }_{obsi})({Y}_{comi}-{\overline{Y} }_{comi})}{\sqrt{\sum_{i=1}^{N}{({Y}_{obsi}-{\overline{Y} }_{obsi})}^{2}}\sum_{i=1}^{N}{({Y}_{comi}-{\overline{Y} }_{comi})}^{2}}$$

where N = data number,\({Y}_{obsi}\)=observed data,\(\overline{Y }\)= average value, and \({Y}_{comi}\)= computed values.

According to Nourani et al. (2018) and Elkiran et al. (2019), for a good analysis of any data intelligence model, the efficiency performance should include at least one goodness-of-fit (e.g. R2) and at least one absolute error measure (e.g. RMSE). The employed three performance criteria in this study were attributed to the fact that multi-criteria indicator for measuring the models' performance was generally employed in contemporary studies. Another important reason for using multiple criteria is that the properties of data, such as normality, size, and linearity, affect the performance accuracy of any model, which can also be evaluated using these criteria. In addition, several studies have already shown that even for the same type of data set, the performance results may deviate from one model performance to another. For example, R2 does not consider any biases that might be present in the data. Therefore, a good model might have a low R2 value or a model that does not fit the data might have a high R2 value. Hence, other evaluation metrics can be combined with the goodness-of-fit (R2), such as the error measure root mean square (RMSE) and biases measure could lead to promising and reliable simulation. Other performance efficiency criteria can also be used, such as mean absolute relative error (MAE) (Nourani et al. 2018; Elkiran et al. 2018; Usman et al. 2021).


The efficiency of the AI-based models was checked by comparing the predicted and experimental values regarding the thyroid status of the subjects. The performance skills of the techniques used in the determination were estimated using three different statistical metrics (R2, MSE, and R). Furthermore, visual presentation of the data via scatter plots, radar plots, bar charts, and response plots was equally illustrated. The obtained results are shown in the following section.

Performance of the single AI-based models and ensemble techniques in both the training and testing stages in terms of the statistical evaluation metrics are shown in Tables 1 and 2. The tables equally demonstrate the quantitative comparative performance of the models towards simulating the dependent variable inform of thyroidism status using various liver function enzymes and hormones as the independent variables.

Table 1 Basic descriptive statistics and Spearman–Pearson correlation analysis
Table 2 Performance of the single models

The result for the comparative performance of the models is equally shown in Fig. 5 using a scatter plot to recommend the best performing algorithm in both the training and testing stages.

Fig. 5
figure 5

Scatter plots of MLP, SVM, and HW models for the thyroidism status

The errors depicted by each model is demonstrated comparatively in Fig. 6 to visualize the performance results of the models; this is in line with the quantitative performance result shown in Table 1.

Fig. 6
figure 6

Visualized comparative error performance of the single models

Furthermore, the NNE technique boosted the performance accuracy of the single AI models MLP, SVM, and HW up to 7.44%, 11.21%, and 19.98%, respectively, in the testing stages. Besides, the comparative performance of the models is illustrated graphically, as shown in Figs. 7, 8, and 9.

Fig. 7
figure 7

Response plots of the ensemble models

Fig. 8
figure 8

Scatter plots of the ensemble technique

Fig. 9
figure 9

Radar surface performance of the ensemble techniques in both the training and testing stages


Spearman correlation analysis shown in Table 3 depicts a strong correlation between the dependent variable inform of thyroidism status with TSH, T3, FT3, and T4 having correlation coefficient values as 0.503, 0.569, 0.765, 0.820, and 0.898, respectively. It equally showed a weak inverse correlation between the dependent variable and D.BIL. The basic descriptive statistics equally demonstrate the nature of the raw data before the data normalization and the modelling, which will aid in simplifying the approach to be employed during the simulation. More information regarding correlation analysis can be found in (Abba et al. 2021; Asnake Metekia et al. 2021).

Table 3 Performance of the ensemble techniques

Based on the performance of the models, it can be observed that MLP outperformed both SVM and HW models in both the training and testing stages. Furthermore, the quantitative skills of MLP showed it could improve the performance ability of HW SVM and HW up to 3.77% and 12.54, respectively, in the testing stages.

Based on the quantitative performance of the ensemble techniques, it can be noted that all three approaches were able to predict the dependent variable with high performance in both the training and testing stages. Moreover, the NNE method outperformed the other two techniques (SAE and WAE) in the training and testing phases. Moreso, the quantitative predictive performance of the NNE technique can boost the performance of SAE and WAE approaches up to 2.85% and 1.22%, respectively, in the testing stage.

The comparative performance of the ensemble techniques can be presented visually using a radar plot using the correlation coefficient of the ensemble techniques in both the training and testing stages. Based on this plot, it can be noted that the NNE approach outperformed SAE and WAE techniques in both the training and testing stages, as shown in Table 2.

Furthermore, other approaches such as the hybrid data intelligence techniques that involve the coupling of both the linear and nonlinear models to enjoy the benefits of both properties towards enhancing the performance skills of the single models as well as the implementation of the emerging and newest metaheuristic approaches such as the Harris Hawks optimization method and the novel neuro-emotional technique can be used to improve and boost the performance efficiency of the single models.


The results obtained from the data-driven approaches using experimental data indicated that MLP, SVM, and HW AI models could model the thyroidism status of each subject either as normal, hypothyroidism, or hyperthyroidism with R2-values higher than 0.7000 in both the training and testing stages. Interestingly, NNE model outperformed all the three AI-based models (MLP, SVM, and HW) and boosted their performance accuracy up to 7.44%, 11.212%, and 19.98%, respectively, in the testing stages, which makes it the most recommended and promising model for further endocrine-based research. Therefore, our study serves as a breakthrough and calls for the application of the proposed model in drug design and discovery to reduce the morbidity and mortality rate of hepatocytes and thyroid diseases.

Availability of data and materials

The data are provided in the main manuscript.



Artificial intelligence


Alanine transaminase


Aspartate transaminase


Albumin blood test


Alkaline phosphatase


Direct bilirubin


Extracellular fluid

FT3 :

Free triiodothyronine

FT4 :

Free thyroxine


Gamma-glutamyl transferase




Mean squared error


Multi-layer perceptron


Neural network ensemble


Parathyroid hormone


Correlation coefficient

R 2 :

Determination coefficient


Simple average ensemble


Support vector machine


Thyroid-stimulating hormone


Total bilirubin

T3 :



Thyroid-stimulating hormone

T4 :



Weighted average ensemble


Download references


The authors are grateful to Near East University and the relevant cited articles used in this manuscript.


Not applicable.

Author information

Authors and Affiliations



AGU and UMG contributed to the conceptualization, analysis and writing of the manuscript. MAA, SMM and EH contributed with the proofreading. AUK and QH contributed to writing the manuscript. SI helps towards proofreading and major revisions of the manuscript, while SIA supervised and drafted the manuscript. All the authors participated in the proofreading and approval of the manuscript.

Corresponding author

Correspondence to Abdulaziz Umar Kurya.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

There is no known competing interest to be declared by the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Usman, A.G., Ghali, U.M., Degm, M.A.A. et al. Simulation of liver function enzymes as determinants of thyroidism: a novel ensemble machine learning approach. Bull Natl Res Cent 46, 73 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: