Dataset
The dataset includes a total of 4981 people with and without stroke, and the open access dataset was obtained from the relevant source https://www.kaggle.com/datasets/zzettrkalpakbal/full-filled-brain-stroke-dataset
8. In this data set, there are 10 explanatory variables and 1 response variable indicating the presence of stroke. While the total number of patients with stroke was 248 (5.0%), the number of patients without stroke was 4733 (95.0%). The explanation of the variables of the data set is given in table 1.
Data Preprocessing
The data set was first evaluated in terms of extreme values for all variables. Then, the presence of missing data was checked and it was determined that 1500 (30.1%) data were missing for the smoking variable. Due to the high amount of loss in this variable, this variable was assigned with the multiple imputation method.
Supervised Machine Learning Algorithms
Decision Tree (DT)
Decision trees are one of the powerful ML methods widely used in various fields such as image processing and pattern identification 9. The algorithm, as the name suggests, consists of a tree structure with root node, branches and leaf nodes displaying attributes, conditions and results respectively10. Each node represents a property in a classification category, and each subset specifies a value that the node can access9. The DT is a sequential model that efficiently and harmoniously combines a set of core tests in which a numerical feature is compared with a threshold value in each test. Conceptual rules are much easier to construct than numerical weights in a NN between nodes. DT, which is mainly used for grouping in data mining, is a model that is also used for classification purposes9.
Random Forest (RF)
The random forest (RF) method was first proposed by Breiman. This algorithm includes many decision trees. This method can be used in both regression and classification problems. Also, RF is one of the best ML algorithms that can be applied to many different fields. This algorithm consists of two parts. The first of these is training data. The second is validation data. From these two data sets, many random decision trees are created with bootstrap samples. The branching of each tree is determined by randomly selected estimators at the nodes. The final estimate of RF is the mean of all results from each tree. Each tree weights are taken into account while estimating the RF in the algorithm, and therefore each tree is not examined individually11.
K-Nearest Neighbors (K-NN)
K-Nearest Neighbors (K-NN) are one of the most common classification techniques used in machine learning. K-NN is considered a non-parametric method when data distribution assumptions are not met. K-NN takes into account the equivalence of the new data with the existing data and places the new data in the nearest existing class. K-NN is used in recognition problems as well as regression problems12.
Support Vector Machines (SVM)
Support vector machines (SVM) is a data-driven machine learning approach that deals with assigning class labels to unlabeled data. It is a predictive binary classification procedure. SVM is based on the maximum margin function that divides the observations into two classes. This function divides the data into two classes using the set of observations with known labels. The new unlabeled data is then assigned a class based on the classifier function and their geometric position13. Some datasets are not linearly separable and any dividing line, no matter how narrow the margin, can cause misclassification14. This problem can be solved by using the softer margin to estimate training samples with an acceptable misclassification15.
The disadvantage of support vector machines is that the classification result is binary and the probability of class membership is not estimated.
Radial Basis Function Network (RBFN)
The radial basis function network (RBFN) is used for classification and is a feed forward NN structure with a single hidden layer. The term 'feed-forward' means that neurons are organized in layers in a layered NN. This NN consists of three layers. These are the input layer, hidden layer and output layer. The input layer consists of input data. The hidden layer transforms the data from the input field to the hidden field using a non-linear function. The linear output layer gives the response of the network. The Euclidean distance between the input vector and the center of each hidden unit in an RBFN is determined by the argument of the activation function of that unit16,17.
Data Analysis
IBM SPSS Statistics Version 22.0 package program was used for statistical analysis of the data. Categorical input variables were summarized as frequency and percentage, and continuous input variables as mean and standard deviation. Categorical input variables and the presence of stroke, which is the output variable, were compared using the Chi-Square test statistic. The mean difference between the presence of stroke and continuous input variables was statistically tested using the independent samples t-test. Statistical significance level was taken as 0.05 in all tests.
R-studio software language was used for the application and comparison of machine learning and deep learning methods.