Classification Methods - Data mining methods

2.4 Data mining methods

2.4.1 Classification Methods

In classification, data is classified into one of several class labels that have been predefined categorically. The word ‘class’ from classification refers to a characteristic in a set of data which is of interest to the user. In statistics it is known as the dependent variable. For a data to be classified, the classification method generates a model of classification which consists of characterization rules (Yoo et al. 2012). For example, in financial institutions classification methods can be used as knowledge discovery tool to identify trends in the market (Fayyad et al. 1996) . In healthcare, these methods can help define ailments and the observed symptoms based on diagnosis and prognosis (Yoo et al, 2012).

Classification involves a process of two steps, the first is training while the second step is testing. The training process, a classification model is built which is made up of classifying rules and by analyzing training data containing class labels`. In some other classifiers, mathematical formulas are used instead of the IF-THEN rules to improve accuracy. Testing which is the second step, a classifier is examined for accuracy or in its

capabilility to categorize unidentified items (records) for prediction. The testing step is an easy and cheap computational process in comparison to the training step (Yoo et al.

2012).

The Naïve Bayesian Classifier is a simple and efficient classification method. The term

‘naïve’ indicates the assumption that data attributes are independent .This assumption is known as conditional independence (Wang et al. 2007). This classifier is built on the Bayes theorem and is a probabilistic statistical classifier. One of the benefits of the nave Bayesian classifier is in its ease of use, as it’s the simplest algorithms amongst the classification algorithms. As a result, it can handle easily a data set with different features (Yoo et al. 2012). “This classifier only requires a small amount of training data to develop accurate parameter estimations because it requires only the calculation of frequencies of attributes and attribute outcome pairs in the training data set” (Yoo et al. 2012).

The main disadvantage associated with this algorithm is the basic assumption that all attributes are independent of each other. This fundamental assumption of the classifier is unrealistic in many cases (Yoo et al.2012).

A typical example is in the medical field where several health issues are related to one another, for example body mass index and blood pressure. This may cause some anomaly in the classification generated. Overall the Bayesian classifier provides accurate results when used for classification and is a very popular method in medical data mining (Yoo et al. 2012).

Neural Networks emerged during the 20^th century and was said to be the most effective classification algorithm, before other algorithms like decision trees and support vector machines (SVM) were introduced. As a result of it being one of the earliest used classification algorithms, it has been well used in various fields like healthcare and biomedicine. For example, it has been the algorithm of choice when diagnosing diseases like cancer and has also been used in the prediction of outcomes (Yoo et al. 2012).

Neural networks are computer programs which are built to mirror the neurological activities of the brain. They are made up of computational nodes that mimic how the neurons in the brain function. These neurone or nodes are linked to the other nodes through links with alterable weights. When the neural network is learning or being trained the link weights can be adjusted. The nodes found in neural networks can be classified

into two which are input and output layer, or in some cases it can be categorized into three input, hidden and output layers (Yoo et al. 2012).

For example, a network could be designed which links a set of observations to a set of diagnoses. Each specific input node would be assigned to different datum and each output node would be assigned similarly to a corresponding diagnosis. Then, the observations which have been identified is programmed to the network, the output node that has been the most stimulated by the input data is preferentially fired and thus produces a diagnosis (Coiera, 2015).

The knowledge derived from the observations and diagnoses are saved within the connections of the network. The main idea behind neural networks is inspired by reactions of neurones once a certain level of activation has been attained. A node present in the network fires up when the sum of its inputs is greater than a pre-determined threshold.

“These inputs are determined by the number of input connections that have been fired and the weights upon those connections. Thus when a network is presented with a pattern on its input nodes, it will output a recognition pattern determined by the weights on the connections between layers” (Coiera, 2015).

The frequently utilized neural network is the multi-layer perceptron which has a back-propagation, and can be found on Weka and it is said to perform better than the other neural algorithms (Yoo et al. 2012).

FIGURE 4: NN architecture (Aguiar-Pulido et al. 2013)

Although, neural network has been widely used, it still possesses several limitations.

Firstly, its learning or training process is usually slow, because it takes time to come up with the parameters to be used, due to the amount of different combinations and it is

computationally expensive. Secondly, it is limited in its ability to explain conclusions gotten, for this reason health professionals can’t comprehend how it arrives at its classification decisions. Unlike other algorithms like decision trees. Thirdly it needs a vast number of parameters and the performance is precise to those parameters that have been picked. Lastly, the accuracy of its classification is of lower standard to the newly developed classification algorithms like support vector machines and decision trees (Yoo et al. 2012).

Decision Trees are a popularly utilized classification algorithms in the data mining process and machine learning is decision tree due to its ease of understanding. They are useful in modelling when the aim is to comprehend the fundamental processes of the environment. In cases where data do not meet the assumptions needed by the more traditional methods then decision trees are useful (Czajkowski et al. 2014).

Decision tree algorithms was introduced by Ross Quinlan in the year 1979. The most widely used decision tree algorithm is the C4.5 which replaced the iterative Dichotomiser (ID3) (Yoo et al. 2012).

Decision trees consist of decision nodes, which are joined together by branches that extend in a top-down manner from the root node, before it then terminates at the leaf nodes. The process starts at the root node which is placed at the top commonly in the decision tree diagram, each attribute is tested at the decision nodes and the outcome forming a branch. The newly formed branch then leads to another decision node or a terminating leaf node (Larose & Larose, 2015).

For example. The diagram shows the family history of lung cancer in a decision tree and the family history happens to be the root node. When decision trees contain the IF-THEN rules then it is a classification model. This means that the construction of a decision tree can be regarded as an important part of the training process (Yoo et al. 2012) .

One of the benefits of the decision tree is the ability to visualize data in a class-focused way. This allows the user to comprehend easily the data structure and to readily observe the attribute which affects the class. While the limitation to the decision tree is when there are too many attributes and the decision tree becomes difficult to understand. This complex decision tree can be resolved by using tree pruning methods which uses statistical methods to take away the least important branches and this helps the user to focus and work with the more important attributes (Yoo et al. 2012).

FIGURE 5: Decision tree example (Yoo et al. 2012)

The support vector machine (SVM) came into existence in the early 90s while the extended support vector machine in the mid-90s, it was developed for AT & T bell laboratories by Vladimir Vapnik and his co-workers. The foundation work for SVM was carried out by Vapnik and his colleagues in 1963. The SVM was modelled using statistical learning theory, and its design is focused on solving a two-class classification problem (Yoo et al. 2012).

For example, safe vs risky. The early and later versions of the SVM were a bit different, the former only provided a linear kernel function, while the later supplied a non-linear kernel function e.g. polynomial and a radial basis function which help strengthened the classification accuracy (Yoo et al. 2012).

“The strategic basis of the SVM is when a dataset is represented in a high dimensional feature space, it searches for the ideal separating hyperplane where the margin amidst two unique objects is maximal. Hyperplanes are decision limits between two unique set of

objects. To discover the hyperplane that has the maximal margin, the SVM makes use of support vectors and the margin can be identified by the use of two support vectors” (Yoo et al. 2012).

One of the main benefits of the SVM is because of the accuracy of its classification. It should be noted it is not the preferred or best technique for every dataset. Some of the limitations of the SVM are multiple kernel functions are provided and each of those function is not the same for every data set. Secondly the SVM is modelled to provide solutions to a two-class classification problem. The approach used to resolve a multiclass classification problem is by decreasing the multiclass problem to a multiple binary problem (Yoo et al. 2012).

FIGURE 6: Example of SVM. Safe vs Risky (Yoo et al. 2012)

Ensemble approach is a data mining technique that falls under classification. Its logic is based on the fact that multiple classifiers, when they work together can result in a more effective classification accuracy than when using a single classifier. For example if three classifiers A,B and C make a prediction that a patient that’s difficult to classify has got lung cancer and another classifier C and D predicts that the said patient is not cancerous, then by voting strategy the patient is identified as having cancer in the lungs (Yoo et al.

2012).

When the ensemble approach is used researchers are more confident that the prediction results obtained is reliable. The ensemble approach is of three kind, they are bagging, boosting and random subspace. Further research has proved that classification performance can be optimized by using met-ensemble approaches (Yoo et al. 2012).

AdaBoost or adaptive boosting which was invented by Schapire and Freund around 1997 is a boosting ensemble method. The simple idea behind AdaBoost was released in an abstract in the year 1995. In recent times, it has gained popularity due to its ability to provide quality classification performance. AdaBoost has been able to outperform other classification techniques like SVM, for this reason it is the most used ensemble method (Yoo et al. 2012).

A major attribute which allows it to perform superior classification is in the weighted majority voting. The logic it utilizes is that classifiers that give better results of classification during the repetitive training process usually possess a greater voting weight than the rest in the concluding classification decisions (Yoo et al. 2012).

In document A scoping review on the different data mining methods used in epilepsy seizure detection to improve clinical decisions support (sivua 17-23)