Fırat Tıp Dergisi


[ Ana Sayfa \| Editörler \| Danışma Kurulu \| Dergi Hakkında \| İçindekiler \| Arşiv \| Yayın Arama \| Yazarlara Bilgi \| E-Posta ]

Fırat Tıp Dergisi

2024, Cilt 29, Sayı 4, Sayfa(lar) 191-195

[ Özet ] [ PDF ] [ Benzer Makaleler ] [ Yazara E-Posta ] [ Editöre E-Posta ]

Evaluation of Radial Basis Function Network and Supervised Machine Learning Methods on Brain Stroke Prediction Datasets

Kübra Elif AKBAŞ, Betül DAĞOĞLU HARK

Fırat University Faculty of Medicine, Department of Biostatistics, Elazığ, Türkiye

Keywords: Veri Madenciliği, Denetimli Makine Öğrenmesi, Sinir Ağları, Performans Ölçütleri, Beyin Felci, Data Mining, Supervised Machine Learning, Neural Network, Performance Measures, Brain Stroke

Summary

Objective: Supervised machine learning algorithms and neural networks are widely used classification methods in data mining. In this study, RBFN, one of the widely used supervised machine learning (SML) algorithms and neural network methods, was used according to the factors affecting the diagnosis of cerebral palsy, and it was aimed to evaluate their classification performance.

Material and Method: The dataset is an open source dataset, and there are a total of 4981 people with and without stroke. This dataset is modeled with RBFN from neural networks with four algorithms commonly used in supervised machine learning decision tree (DT), random forest (RF), and K-nearest neighbor (K-NN) and support machine vector (SVM). Their performance was evaluated according to performance criteria.

Results: The algorithms with the highest performance according to the accuracy criteria are DT (0.954), SVM (0.954), RBFN (0.954) and RF (0.953), respectively. The K-NN algorithm was found to be higher than other methods in terms of precision (0.061) and sensitivity (0.080).

Conclusion: The performances of DT, RF, SVM and RBFN methods were found to be close to each other in terms of accuracy criteria. In the decision-making process, the correct classification performance of these four methods is higher than K-NN.

Top

Summary

Introduction

Methods

Results

Disscussion

Conclusion

References

Introduction

Stroke is one of the important health problems en-countered today. Stroke, also known as a cerebrovascular accident, is a neurological disease caused by ischemia or bleeding of the cerebral arteries. This disease causes cognitive disorders. The two leading causes of death and disability worldwide are ischemic heart disease and stroke¹. In addition, the cost of hospitalization increases due to stroke. For these reasons, diagnosis, treatment, estimation of the clinical process, recommendation of therapeutic interventions and rehabilitation programs for stroke have become an important need. Machine learning (ML) is an important decision-making process in this process².

Advances in computer and information technologies have made the emergence of big data sets an inevitable situation. As the amount of data increases, uncovering patterns and trends and understanding data (i.e. learning from data) has become an important issue. An important area known as “machine learning” has developed to process big data statistically. ML is mainly based on supervised and unsupervised learning. Supervised machine learning (SML) assumes that the machine will learn from the data when a target variable is specified. Unsupervised learning is the opposite of supervised learning. Here, no target value or label is specified for the data. In addition, the data is visualized in two or three dimensions³.

Radial basic function neural networks (RBFN) are one of the basic categories of neural networks (NN). The main architectures, learning techniques and applications of RBFN have been demonstrated in many studies⁴^-⁶. The learning and generalization abilities of these networks are very good. In particular, the learning rates are significantly faster compared to other multilayer neural networks⁷.

The aim of this article is to compare the performances of ML and NN approaches, which are predictive methods for the diagnosis of stroke, which causes death and disability. Decision tree (DT), random forest (RF), k-nearest neighbors (K-NN) and support vector machines (SVM) are used as supervised machine learning (SML) methods, while RBFN was used as NN. First of all, SML and RBFN methods were compared and then these methods were evaluated within themselves.

Top

Summary

Introduction

Methods

Results

Disscussion

Conclusion

References

Methods

Dataset
The dataset includes a total of 4981 people with and without stroke, and the open access dataset was obtained from the relevant source https://www.kaggle.com/datasets/zzettrkalpakbal/full-filled-brain-stroke-dataset⁸. In this data set, there are 10 explanatory variables and 1 response variable indicating the presence of stroke. While the total number of patients with stroke was 248 (5.0%), the number of patients without stroke was 4733 (95.0%). The explanation of the variables of the data set is given in table 1.

Click Here to Zoom Table 1: The detailed explanation of the variables in the dataset.

Data Preprocessing
The data set was first evaluated in terms of extreme values for all variables. Then, the presence of missing data was checked and it was determined that 1500 (30.1%) data were missing for the smoking variable. Due to the high amount of loss in this variable, this variable was assigned with the multiple imputation method.

Supervised Machine Learning Algorithms Decision Tree (DT)
Decision trees are one of the powerful ML methods widely used in various fields such as image processing and pattern identification ⁹. The algorithm, as the name suggests, consists of a tree structure with root node, branches and leaf nodes displaying attributes, conditions and results respectively¹⁰. Each node represents a property in a classification category, and each subset specifies a value that the node can access⁹. The DT is a sequential model that efficiently and harmoniously combines a set of core tests in which a numerical feature is compared with a threshold value in each test. Conceptual rules are much easier to construct than numerical weights in a NN between nodes. DT, which is mainly used for grouping in data mining, is a model that is also used for classification purposes⁹.

Random Forest (RF)
The random forest (RF) method was first proposed by Breiman. This algorithm includes many decision trees. This method can be used in both regression and classification problems. Also, RF is one of the best ML algorithms that can be applied to many different fields. This algorithm consists of two parts. The first of these is training data. The second is validation data. From these two data sets, many random decision trees are created with bootstrap samples. The branching of each tree is determined by randomly selected estimators at the nodes. The final estimate of RF is the mean of all results from each tree. Each tree weights are taken into account while estimating the RF in the algorithm, and therefore each tree is not examined individually¹¹.

K-Nearest Neighbors (K-NN)
K-Nearest Neighbors (K-NN) are one of the most common classification techniques used in machine learning. K-NN is considered a non-parametric method when data distribution assumptions are not met. K-NN takes into account the equivalence of the new data with the existing data and places the new data in the nearest existing class. K-NN is used in recognition problems as well as regression problems¹².

Support Vector Machines (SVM)
Support vector machines (SVM) is a data-driven machine learning approach that deals with assigning class labels to unlabeled data. It is a predictive binary classification procedure. SVM is based on the maximum margin function that divides the observations into two classes. This function divides the data into two classes using the set of observations with known labels. The new unlabeled data is then assigned a class based on the classifier function and their geometric position¹³. Some datasets are not linearly separable and any dividing line, no matter how narrow the margin, can cause misclassification¹⁴. This problem can be solved by using the softer margin to estimate training samples with an acceptable misclassification¹⁵.

The disadvantage of support vector machines is that the classification result is binary and the probability of class membership is not estimated.

Radial Basis Function Network (RBFN)
The radial basis function network (RBFN) is used for classification and is a feed forward NN structure with a single hidden layer. The term 'feed-forward' means that neurons are organized in layers in a layered NN. This NN consists of three layers. These are the input layer, hidden layer and output layer. The input layer consists of input data. The hidden layer transforms the data from the input field to the hidden field using a non-linear function. The linear output layer gives the response of the network. The Euclidean distance between the input vector and the center of each hidden unit in an RBFN is determined by the argument of the activation function of that unit¹⁶^,¹⁷.

Data Analysis
IBM SPSS Statistics Version 22.0 package program was used for statistical analysis of the data. Categorical input variables were summarized as frequency and percentage, and continuous input variables as mean and standard deviation. Categorical input variables and the presence of stroke, which is the output variable, were compared using the Chi-Square test statistic. The mean difference between the presence of stroke and continuous input variables was statistically tested using the independent samples t-test. Statistical significance level was taken as 0.05 in all tests.

R-studio software language was used for the application and comparison of machine learning and deep learning methods.

Top

Summary

Introduction

Methods

Results

Disscussion

Conclusion

References

Results

Descriptive statistics and p values of output variables explaining the presence of stroke are given in table 2.

Click Here to Zoom Table 2: Descriptive statistics of patients.

The difference in patients with stroke in Table 2 according to age, presence of hypertension, ever married, type of employment, mean glucose level, smoking and BMI variables was found to be statistically significant (p value <0.001 for each variable). The stroke outcome variable for gender and residence type did not differ significantly (p values 0.552 and 0.268, respectively).

According to the findings from table 3, the DT algorithm has the highest performance in terms of accuracy, F1 score and specificity.

Click Here to Zoom Table 3: Results obtained according to classifier performance criteria of ML and RBFN algorithms.

The accuracy, F1 score and specificity performance criteria for SVM, RF and RBFN algorithms were close to the DT algorithm. The K-NN algorithm is higher than other algorithms in terms of precision and sensitivity.

DT, SVM, RBFN and RF algorithms are very close to each other in terms of performance criteria. Accuracy; calculated based on true positive and true negative observations, F1 score; calculated based on false positive and false negative observations, and specificity; calculated based on negative estimates within the true negative. Therefore, DT, SVM, RBFN, and RF algorithms performed similarly in terms of performance measures involving negative estimates. While precision is the rate of correct classification within a positive prediction, sensitivity is the rate of correct classification within a true positive. Accordingly, the K-NN algorithm is a more precise method for accurate positive classification.

Top

Summary

Introduction

Methods

Results

Disscussion

Conclusion

References

Discussion

Stroke disease occurs when there is a blood flow disorder or deficiency in various parts of the brain. This causes the cells in the damaged areas of the brain to not receive the nutrients and oxygen they need, and as a result, the cells die. Stroke is a medical emergency that requires immediate medical attention. It is important to identify early diagnosis and appropriate treatment management in order to prevent the harm and damage it will cause¹⁸. Stroke is a serious and common disease. At the same time, stroke, many people today suffer from acute brain attacks, which are ischemic strokes caused by blood clots blocking the cerebral arteries¹⁹. Therefore, stroke is a critical medical condition that must be treated before it worsens.

Brain stroke can be detected early. Early detection can reduce the effects of this stroke on the brain and other parts of the body. The aim of our work is to detect the presence of the disease early with machine learning algorithms and RBFN when dealing with a medical diagnostic field such as cerebral palsy. ML and RBFN applications have become very common especially in the medical field. It provides great convenience to physicians in making clinical decisions and predictions and in the decision-making process of the disease. In addition, ML and RBFN solutions are needed to cope with the limited number of doctors and the rapidly increasing challenges of large data sources.

In their study, Singh et al.²⁰ used five different ML techniques to predict stroke in the "Cardiovascular Health Study (CHS)" dataset. In their studies, C4.5 algorithm, DT, principal component analysis (PCA), artificial neural networks (ANN) and SVM methods were used. Kivrak et al.²¹ predicted mortality using machine learning approaches such as RF, K-NN, extreme gradient boosting (XGBoost) and deep learning on an open-source COVID-19 dataset. Sailasya and Kumari²² took various physiological factors and used machine learning algorithms such as logistic regression (LR), RF, DT, K-NN, SVM, and naive bayes (NB) to train five different models to accurately predict the probability of stroke in the brain.

In this article, DT, RF, K-NN, SVM and RBFN are the algorithms used in the decision making process. DT, RF, SVM and RBFN methods were determined as the methods with the best performance.

Cicek and Kücükakcalı⁵ compared the multilayer perceptron neural network (MLPNN) and RBFN methods using the Prostate Cancer Data Set in their study and found that the MLPNN method was more successful in classification. Tan et al.⁶ compared RBFN and SVM methods in their study. They did not find a significant difference between the method results in the study.

The article has limitations. One of them is that the data set has an unbalanced structure in terms of output variable. So the positive observation rate is quite low. Therefore, the outputs of the algorithms for positive classification are low. In other studies, it is recom-mended to compare ML and RBFN algorithms considering this unbalanced structure.

Top

Summary

Introduction

Methods

Results

Discussion

Conclusion

References

Conclusion

Stroke is a critical medical condition that needs to be treated before it gets worse. Thanks to ML and RBFN, stroke can be predicted early and its serious side effects can be reduced. By using these systems, medical doctors can diagnose cerebral palsy earlier and thus take the necessary measures to reduce the effect of stroke. To facilitate this decision-making process, the use of DT, RF, SVM and RBFN algorithms is suggested in the article.

Conflicts of Interest
The authors declare that there is no conflict of interest.

Top

Summary

Introduction

Methods

Results

Discussion

Conclusion

References

1) Johnson W, Onuma O, Owolabi M, Sachdev S. Stroke: a global response is needed. Bulletin of the World Health Organization. 2016; 94: 634.

2) Sirsat MS, Fermé E, Câmara J. Machine learning for brain stroke: a review. J. Stroke Cerebrovasc. Dis 2020; 29: 105162.

3) Muthukrishnan R, Rohini R, editors. LASSO: A feature selection technique in predictive modeling for machine learning. 2016 IEEE international conference on advances in computer applications (ICACA); 2016: IEEE.