In this last section of the pre-processing stage, we generated a multitude of feature sub-sets which became the inputs to the classification stage. We created feature subsets in three ways, manually, semi-manually and algorithmically. In this section, we elaborate on each feature selection strategy. The entire feature selection and sub-set generation pipeline is shown in figure 15 at the end of the section.
3.4.1 Manual Selection
In the manual selection case, we devised two feature sets and further generated more sub-sets between statistical summaries; When selecting feature mean values we code-name the sub-set as โFeature sub-set (๐)โ, when we select the standard deviation values we used the name โFeature sub-set (๐)โ, when both summaries were used, we aggregated both code-names as โFeature sub-set (๐, ๐)โ. In tables 15 and 16 we show each manually selected feature sub-set, dimensionality and description.
TABLE 15. The โAll Featuresโ sets, comprising of all features and every summary statistic sub-set.
โAll Featuresโ Sets Dimensionality Description
โAll Features (๐)โ 63 Only mean values of all features.
โAll Features (๐)โ 63 Only standard deviation values of all features.
โAll Features (๐, ๐)โ 126 Mean and standard deviation values of all features.
TABLE 16. The โindividual featureโ sets, comprising of each individual feature and both summary statistics.
โIndividual featureโ Sets Dimensionality Description
โSB-Entropy (๐, ๐)โ 20 Mean and standard deviation values of 10 sub-band entropies.
โSB-Flux (๐, ๐)โ 20 Mean and standard deviation values of 10 sub-band
fluxes.
โSB-Kurtosis (๐, ๐)โ 20 Mean and standard deviation values of 10 sub-band kurtosis.
โSB-ZCR (๐, ๐)โ 20 Mean and standard deviation values of 10 Sub-Band
zero crossing rates.
โSB-Skewness (๐, ๐)โ 20 Mean and standard deviation values of 10 sub-band skewness.
โMFCCs (๐, ๐)โ 26 Mean and standard deviation values of 13 MFCC
coefficients.
3.4.2 Semi - Manual Selection
In semi-manual selection, we devise the โTop 2โ feature sub-set from the classification performance ranking of the โindividual featuresโ sets. This selection design deals with two features only, for this reason, the ๐1, ๐1 codes referred to the mean and standard deviation of the best performing individual feature, while ๐2, ๐2 referred to the same statistics for the second-best feature. Table 17 displays the composition of our โTop 2โ semi- manual selection design.
TABLE 17. The semi-manual feature selection subsets, containing the top two performing individual features.
Top 2 Feature Sub-Set
Algorithmic feature selection or simply โfeature selectionโ plays an important role in automatically ranking and selecting relevant features for a classification task. Often a portion of features involved in machine learning may not be informative or relevant to the classification task. Indeed, a high number of irrelevant features may increase the complexity of a model and even decrease classification and computational performance by increasing overfitting or imposing other unwanted effects. Feature selection algorithms help to combat such effects. Importantly, feature selection should not be perplexed with dimensionality reduction methods, both reduce the number of features for a given task, but dimensionality reduction methods may produce different features from the initial feature set, feature selection methods do not.
There are three types of feature selection methods, filter methods, wrapper methods and embedded methods. Filter methods employ statistical measures to score each feature with
respect to the dependent variable or independently. In contrast, wrapper methods generate feature combinations and compare the classification accuracy results of such combinations directly via the classification stage. Embedded methods identify feature contributions to classification accuracy within and during the classification process. A popular class of embedded methods is regularization techniques often used to penalize classification and regression algorithms such that they may reduce over-reliance on specific features, this often may lead to reduced overfitting.
The choice of feature selection algorithm depends heavily on the understanding of a given problem. There is no โone size fits allโ selection algorithm, often selection algorithms tend to produce completely different rankings for the same input features. In the absence of deep problem understanding it is common for multiple selection methods to be evaluated, and at times aggregated ranks between multiple rankings may be performed. In our study, we employ a filter method using the Information gain (IG) algorithm that we detailed below. The resulting feature sub-set was code-named โInformation Gain Top 20โ to refer to the top 20 features suggested by IG when inputting the โAll Features (๐, ๐)โ feature set. We implemented IG in python with the Orange3 library (Demลกsar et al., 2013).
3.4.4 Information Gain
Information Gain (IG) is a filter method computed for each feature with respect to the class labels (Liu & Motoda, 1998); it relies heavily on the Shannonโs information entropy (Shannon, 2001). To obtain ๐ผ๐บ(๐|๐) were ๐ and ๐ are random variables, let us consider the formula for the information entropy ๐ป of variable ๐:
๐ป(๐) = โ โ ๐(๐ฅ)
๐ฅโ๐
๐๐๐2(๐(๐ฅ))
Where ๐(๐ฅ) is the marginal probability density function for ๐. When we introduce variable ๐ and devise the values of variable ๐ with respect to the values of ๐, a relationship between ๐ and ๐ exists only when the entropy of ๐ devised by ๐ is smaller than the initial entropy of ๐, the entropy of ๐ devised by ๐ is given by:
๐ป(๐|๐) = โ โ ๐(๐ฆ)
๐ฆโ๐
โ ๐(๐ฅ|๐ฆ)๐๐๐2(๐(๐ฅ|๐ฆ))
๐ฅโ๐
Where ๐(๐ฆ) is the marginal probability density function for variable ๐ and ๐(๐ฅ|๐ฆ) is the conditional probability of ๐ฅ given ๐ฆ. After the conditional entropy step, information gain is defined as the amount of entropy decrease in ๐ represented by the surplus information provided from ๐ about ๐, formally defined as:
๐ผ๐บ(๐|๐) = ๐ป(๐) โ ๐ป(๐|๐)
Importantly, ๐ผ๐บ(๐|๐) = ๐ผ๐บ(๐|๐) because information gain is symmetrical for the two variables (Yu & Liu, 2003).
3.4.5 Feature Selection Overview
To summarize, figure 15 maps the flow of our feature selection sets, along with their statistical summary subsets with respect to the classification stage.
FIGURE 15. The flow of feature selection subsets with respect to the classification stage.