Diagnosis of Malignancy in Thyroid Tumors by Multi-Layer Perceptron Neural Networks With Different Batch Learning Algorithms

To diagnose the malignancy in thyroid tumor, neural network approach is applied and the performances of thirteen batch learning algorithms are investigated on accuracy of the prediction. Therefore, a back propagation feed forward neural networks (BP FNNs) is designed and three different numbers of neuron in hidden layer are compared (5, 10 and 20 neurons). The pathology result after the surgery and clinical findings before surgery of the patients are used as the target outputs and the inputs, respectively. The best algorithm(s) is/are chosen based on mean or maximum accuracy values in the prediction and also area under Receiver Operating Characteristic Curve (ROC curve). The results show superiority of the network with 5 neurons in the hidden layer. In addition, the better performances are occurred for Polak-Ribiere conjugate gradient, BFGS quasi-newton and one step secant algorithms according to their accuracy percentage in prediction (83%) and for Scaled Conjugate Gradient and BFGS quasi-Newton based on their area under the ROC curve (0.905).


Introduction
In recent years, digital revolution makes a huge volume of information collected and stored. Especially in health information, large databases of patient's findings are already available. Data mining methods are powerful tools to assist physicians in decision making. These methods model the relations among clinical findings and hence help the physicians in diagnosing similar cases. However, the final decision will be still up to the doctor (Raghavendra & Srivatsa, 2011). For instance, logistic regression method is a probabilistic classification technique. It models the relationship between a binary outcome (healthy/unhealthy or death/survival) and a set of related attributes (risk factors). The derived model then helps the physician in prediction, diagnosis and treatment of the diseases during a reasonable time (Raghavendra & Srivatsa, 2011). This statistical method also depends heavily on its theoretical underlying assumptions (Pourahmad, Ayatollahi, & Taheri, 2011). Therefore, its flexibility in adapting to real data circumstances is reduced. But it is a powerful method with simple interpretations if its assumptions are met. However, the nature of clinical findings and large number of attributes under consideration require more flexible methods with no theoretical assumptions. Neural networks method is such the methods. It is able to model the complex relations among a large data set without any theoretical assumptions. Therefore, it can be a useful tool for modeling the relation among clinical findings (Amato et al., 2013).
Simple neural network was firstly introduced in 1943 ( McCulloch & Pitts, 1943). Since then many developments have been occurred in its theory and applications. In theory for instance, different learning methods and diverse training algorithms have been proposed. The applications of neural networks in diverse fields such as clinical researches are also attractive. Among the recent studies, cancer diagnosis (Bourdes & Bonnevay, 2010), disease diagnosis (Alizadehsani et al., 2013), death prediction (Shi et al., 2012) and image classification (Kuruvilla & Gunavathi, 2014) can be mentioned. However, there are few researches for investigating the performance of different learning algorithms on accuracy of the results.
Thyroid nodule is a common problem in human population. Therefore, decision making for its management is controversial. Its management varies from observation to total thyroidectomy. To determine the type of management, Fine Needle Aspiration (FNA) of the nodule is one of the most useful tools. Indeed, it determines the type of surgery. If the test detects a benign tumor, then right, left or subtotal lobectomy is applied. Otherwise, total lobectomy is performed. However, clinical texts report some limitations in accurate report and some significant mistakes while decision is made based on FNA result (Zhang & Berardi, 1998). Among the affective factors on malignancy in thyroid tumors, there are factors such as age, gender, size of thyroid gland, tumor size, type of operation, type of malignant tumor, malignant tumor size, duration of the disease and family history. To detect the importance of these factors in diagnosis of malignancy, it is necessary to search fully in patients' attributes to find and model the meaningful relations among their clinical findings. Hence the diagnosis process may be developed.
Accordingly, in present study, three neural networks with one hidden layer and 5, 10 and 20 neurons are considered. The performances of thirteen different batch learning algorithms are then compared in diagnosis of malignant thyroid tumors. The superior algorithms are chosen based on accuracy percentage in prediction and area under the ROC curve.

Materials
This study includs all patients who were initially diagnosed for thyroid tumors surgery in both sexes and all age groups. FNA test was performed on them and they were operated in Shahid Rajaee and Nemazee hospitals (two hospitals in Shiraz, southern Iran) during 2009 to 2012. The number of eligible people for the study was 345 persons. Based on clinical expert opinion, all factors related to type of thyroid tumors (malignant/benign) before the surgery were collected from the patients' hospital records. Accordingly, 12 important factors such as gender, age, type and growth of the thyroid gland, FNA test result, duration of disease, family history of disease and cancer, size of the right and left thyroid gland and size of nodules in the left and right thyroid glands were considered in the modeling process.

Methods
In a classic definition, artificial neural network is a large set of parallel processors with a natural talent for storage of experimental data. It is like the brain for at least two stages: synaptic weights to store knowledge and a process called learning (Reggia, 1993).
In present study, a supervised network known as feed-forward neural network (FNN) will be applied with back-propagation (BP) training algorithm. One hidden layer with three different numbers of neurons including 5, 10 and 20 neurons is considered and thirteen batch learning algorithms for training the network are compared.
To recall the different activation functions or learning algorithms in neural networks modeling process, their abbreviations in MATLAB software will be presented in parenthesis in the following sections.

BP Algorithm
FNNs are applied to approximate the non-linear complex functions and hence are appropriate to model the ambiguous relations among clinical findings. The BP algorithm is a frequently used learning algorithm for training FNNs with high modeling power. It adjusts the network parameters iteratively to minimize the sum of squared approximation errors using a gradient descent technique (Sibi, Jones, & Siddarth, 2013).
The learning steps in this algorithm are as follows (Raghavendra & Srivatsa, 2011): 1) Inputs are entered into the system and go ahead trough the network layers with forward method until the output layer is reached. Then the output is predicted by considering the initial values for the parameters (weights and biases).
2) The network errors are defined as the difference between the predicted output and the target output.
3) Then it goes back and tries to decrease the errors by adjustment of the weights. Therefore, the mean square deviation between the predicted and target outputs is minimized in this method. 4) These steps are repeated reciprocally until the errors between the predicted and the actual outputs are minimized.

Activation Functions
Activation function is a linear or non-linear function which is applied on the outputs of the previous layer to build the inputs of the next layer. It would be possible to use different activation functions for each layer and even for each neuron in a layer.
Generally, in BP FNNs just three activation functions namely Linear (purelin), Log-Sigmoid (logsig) and Hyperbolic Tangent Sigmoid (tansig) can be used since these functions are differentiable (Hagen, Demuth, & Beale, 1996). As mentioned, the outputs in our clinical dataset were dichotomous. The bipolar data representation was then used for the target outputs (malignant: 1 and benign: -1). Generally, the binary data representation leads to elimination (to be zero) the network's coefficients and consequently affects the learning process. Indeed, zero units are not learned (Fausett, 1994). As a result, the appropriate activation function was tansig for two conjunctions (input-hidden and hidden-output layers) in our study.

Batch Learning Algorithms
When the number of layers, the number of neurons and the activation functions in each layer are determined, the method of parameters' adjusting (learning algorithm) should be chosen. There are two different learning algorithms namely 'sequential or online' and 'batch' learning methods (MATLAB, 2010a). In sequential learning method, parameters are updated after applying each pattern (instance) to the network. But in batch training algorithm, the updating process is performed after applying all patterns.
In present study, batch learning algorithm was used and thirteen different methods in this algorithm which are available in MATLAB software were compared. These methods are summarized as follows: IX) Levenberg -Marquatdt (trainlm): It is a popular curve-fitting algorithm used in many software applications for solving generic curve-fitting problems and finds only a local minimum. It is the fastest training algorithm for networks of moderate size and has memory reduction feature for use when the training set is large.
X) BFGS quasi-Newton (trainbfg): It updates weight and bias values according to the BFGS quasi-Newton method and requires storage of approximate hessian matrix. It has more computations for iterations than conjugate gradient algorithms, but usually converges in less iteration.
method and compromises between conjugate gradient methods and quasi-Newton methods.
XII) Gradient descent with momentum and adaptive learning rate (traingdx): It updates the weight and bias values according to gradient descent momentum and an adaptive learning rate. It has faster training than traingd, but it can only be used in batch mode training.

XIII) Bayesian regularization (trainbr): It updates the weight and bias values according to
Levenberg-Marquardt optimization. It modifies the Levenberg-Marquardt training algorithm to produce networks that generalize well and reduces the difficulty of determining the optimum network architecture.
In all mentioned algorithms, training stops when any of these conditions occurs: The maximum number of epochs (repetitions) is reached or the maximum amount of time has been exceeded.

Receiver Operating Characteristic Curve (ROC curve)
ROC curve is used to evaluate discriminating power of the different methods especially for comparing diagnostic tests (Shang, Lin, & Goetz, 2000). Whenever the method or the system fails to recognize (diagnose) a disease correctly, the curve is a straight line between the points (0,0) and (1,1) in a two-dimensional space. While the performance of the method is accurate, its ROC curve is a vertical line between the points (0,0) to (0,1) and then a horizontal line to the point (1,1). Usually, the curves of different methods lie between these two positions unless the performance of the method or system is weaker than a random prediction. Area under the ROC curve also represents relative performance of the method. The amount of 0.5 indicates no apparent accuracy and the amount of 1 shows perfect accuracy (Shang, Lin, & Goetz, 2000). A nonparametric statistical method was used to test the significance difference of this area from the value 0.5.

The Accuracy Percentage in Prediction
This value is calculated by cross-validation method. In training process, dataset is divided into k separate parts (k-fold). Then k-1 parts are applied for model construction (training the system) and the remaining part is used to test the model. In testing process, the best model among k different models is chosen. The accuracy rate in each tour is defined as the number of correct predicted patterns divided by total number of patterns multiplied by 100.

Results
In this study, a BP FNN with one hidden layer was applied with 'tansig' activation function in both layers. The activation function was considered the same for all neurons in each layer. Furthermore, three different numbers of neuron in hidden layer including 5, 10 and 20 neurons were compared. Moreover, thirteen different training methods in batch learning algorithm (explained in section 2.1.3) were applied to train the network. Table 1 summarizes the general characteristics of the network.  Vol. 7, No. 6;2015 As mentioned earlier, clinical findings of 345 patients with thyroid tumor referred for surgery were used for training the networks.  Vol. 7, No. 6;2015 Dataset was randomly divided into two parts: 80 percent (276 cases) as the training set for learning and 20 percent (69 cases) as the unseen data for validation. In each thirteen training algorithms, training dataset was randomly divided into two separate parts. Minimum (Min), maximum (Max), mean and standard deviation (SD) values of accuracy percentages in diagnosis were then calculated for each algorithm (Table 3). Table 3. Prediction accuracy of thirteen learning algorithms on validation data  The results showed superiority of the network with 5 neurons in hidden layer. The networks with 10 and 20 neurons were at the next orders, respectively. Accordingly, based on the maximum values, the algorithms named Polak-Ribiere conjugate gradient, BFGS quasi-newton and one step secant in 5 neurons (83%), basic gradient descent in 10 neurons (80%) and gradient descent with adaptive in 20 neurons (78%) structures trained the networks with most accuracy percentage in diagnosis. However, based on mean values, the algorithms such as Basic gradient descent (71%), Basic gradient descent (69%) and Bayesian regularization (68%) were chosen, respectively (Table 3). Furthermore, the area under the ROC curve was computed for the best trained network on each algorithm in the three defined structures. Table 4 summarizes the results. Although all the area under the curves were statistically significant (p-value<0.001), 5 neurons structure represented better results than two other structures. Based on this criterion, the 20 and 10 neurons structures are at the next ranks, respectively. As a result, the algorithms such as Scaled Conjugate Gradient and BFGS quasi-Newton (the area= 0.905) in 5 neurons, Gradient Descent with Momentum (the area=0.863) in 20 neurons and Bayesian regularization (the area=0.835) in 10 neurons structures had the highest diagnosis power on our clinical dataset.

Discussion
Present study was conducted to help the physicians to diagnose the type of thyroid tumor in patients with a primary diagnosis of thyroid tumor surgery. The performance of thirteen different batch learning algorithms on prediction's accuracy was compared for this purpose. This subject has not been sufficiently investigated in a single study. Therefore, this study may be important technically. Some recent researches investigated a subset of these algorithms (Koçer & Canal, 2011;Ramos-Pollán, Guevara-López, & Oliveira, 2012) or compared batch learning algorithms with online algorithms (Randall Wilsona & Martinez, 2003;Duchi & Singer, 2009;Perez-Suay, Francesc, Arevalillo-Herraez, & Jesus, 2013). Furthermore, the applications of these learning algorithms on other clinical problems such as image classification received more attention than cancers' type diagnosis (Steven, Jinz, Zhuy, & Lyuy, 2006).
In addition, since initial diagnosis of tumor type (malignancy/benign) affects type of surgery (subtotal or total lobectomy), results of this study may be noteworthy clinically. There are few studies which worked on modeling type of tumor or other diseases related to thyroid glands based on affecting factors. Some recent researches in this field used classic statistical methods such as logistic or linear regression analysis to model the relations among the factors (Lee & Kwak, 2010;Lima, Neto, Tambascia, & Wittmann, 2013;Zou et al., 2013). For modeling with soft computing techniques, some studies applied neural network method to model the relations among the factors but not with the purpose of comparing different learning methods (Sarasvathi & Santhakumaran, 2011;Zhu et al., 2013;Bastias, Horvath, Baesler, & Silva, 2011;Gharehchopogh, Molany, & Mokri, 2013;Shukla, Tiwari, Kaur, & Janghel, 2009;Ozyilmaz & Yildirim, 2002;Zhang & Berardi, 1998). However in our primary research in this field, three different methods of classification in data mining techniques had been compared on a subset of this dataset (Pourahmad, Azad, Paydar, & Abbasi, 2012).
According to the text, result of FNA test as the preoperative diagnostic criterion may has some significant mistakes (Zhang & Berardi, 1998). In this study, result of FNA test compared with actual tumor type after surgery showed 63 percent accuracy in diagnosis. This is in agreement with other clinical texts. Whereas the represented modeling process in this study increased this accuracy rate up to at least 75 percent on favorites algorithms.
Furthermore, increasing neurons in hidden layer usually leads to better learning (Fausett, 1994) but our results did not confirm this fact.
At the end, although the algorithms offered acceptable and almost similar results in present study, work on larger dataset is recommended to achieve further opportunities of comparisons and derive more powerful diagnostic models in this medical problem.