Support vector machines in spectral data classification

(1)

Bachelor’s degree program in chemical engineering 5.12.2013

Tuomas Sihvonen

Examiner: Satu-Pia Reinikainen

(2)

1 Introduction 2

2 Theory 2

2.1 Hard margin support vector machines . . . 3

2.2 Soft-Margin SVMs . . . 5

2.3 Kernel methods . . . 6

2.3.1 Linear kernel . . . 7

2.3.2 Polynomial kernel . . . 7

2.3.3 Radial basis function kernel . . . 7

3 Experimental 8 3.1 Datasets used . . . 8

3.2 Data pretreatment . . . 8

4 Results and discussion 10 4.1 Effect of kernel parameters . . . 10

4.1.1 Linear kernel . . . 10

4.1.2 Polynomial kernel . . . 12

4.1.3 RBF kernel . . . 13

5 Conclusion 16

References 17

(3)

1 Introduction

In chemistry and chemical engineering, as in many other branches of science and engineering, the advances in computer and measurement technologies have made it possible to gather much more information about different experiments or processes.

Then the problem is how to extract useful information from all of the data gathered.

One way of doing this is to use pattern recognition to classify the data. This is useful when the outcome is not known and we want to explore the dataset and find patterns. [1]

In classification the aim is to separate samples differing from one another, by as- signing different labels to them. These samples can be virtually anything, form separating healthy people form the sick or monitoring if quality of a product is sat- isfactory. Support vector machines (SVM) are a powerful classification tool that have been gaining popularity in chemometrics during the last 10 years. SVMs are computationally light but provide good generalization properties and can be applied to complex datasets. In chemistry SVMs have been used especially in the bio and medical side, where datasets are usually quite big [2, 3]. Some uses in process monitoring have also been reported [4–6]. In the theory part of this study we take a look at SVM theory and formulation, and in the experimental section we see how different SVM parameters effect the results obtained from SVM.

2 Theory

In support vector machines the idea is to find a hyperplane that separate the points of different labels and that the margin between the plane and the points is largest possible. This principle is visualized in Figure 1.

Figure 1: Two separating hyperplanes fitted to a dataset. Plane with the larger margins is the optimal one and the filled symbols are support vectors. [7]

(4)

2.1 Hard margin support vector machines

We are going to use the hard margin support vector machine as the basis when go through the theory of support vector machines. This form of SVM can be considered to be the simplest. Later we see that even the more advanced SVM variants are only small modifications of this inital theory. Then we only have to look at the equations derived here and point out how they have been modified.

Hard margin SVM:s work we have data pointsx_i(i = 1, ..., M) that are linearly separable and have a class label ofy_i = 1ory_i =−1. Then we can place a line or a hyper plane between these points, to separate them according to their labels (Fig.

1). This plane has the formw^>x_n+b= 0and it is used as the decision function:

D(x) = w^>x_n+b (1)

wherewis the normal to the plane and b is the bias term. The distance between the plane and the nearest data point is called the margin. Then a sample is classified according to the following rule

( When D(x)<0→Sample belongs to the class -1 When D(x)>0→Sample belongs to the class 1

There are a infinite amount of planes that can separate the data points. To get the optimal separating hyperplane we need to maximize the plane’s distance from the data points i.e. maximizes the margin. To do this we must first calculate the distance of a pointx_nfrom the planew^>x+b= 0. The plane is normalized so that|w^>x+ b|= 1. This means that the point closest to the plane is assigned the value 1.

distance=|wˆ^>(x_n−x)| (2) where wˆ is the unit vectorwˆ = _kwk^w . Then by substitution this, and adding and subtractingbfrom equation (2), we get equation (3).

distance= 1

kwk|w^>x_n+b

| {z }

1

−w^>x−b

| {z }

0

|= 1

kwk (3)

Here we see that the pointxis on the plane and thus the term becomes zero.

To get a plane that maximizes the margins we want to maximize this distance

max 1

kwk (4)

(5)

subject to min|w^>x_n+b|= 1 (5) This is not easy a easy optimization task. We notice that we can get rid of the absolute values by multiplying by the class labels

|w^>x_n+b|=y_n(w^>x_n+b) (6) To make the optimization easier we move from maximizing to minimizing and get the following problem

minimize 1

2w^>w (7)

subject toyn(w^>xn+b)≥1forn = 1, . . . , N (8) We transform the previous constraint optimization problem to a easier one by using the method of Lagrange multipliers

L(w, b, α) = 1

2w^>w−

N

X

n=1

α_n(y_n(w^>x_n+b)−1) (9) whereα_n is the Lagrange multiplier. We want to minimize this equation with respect towandb and maximize it w.r.t eachα_n ≥ 0. To achieve the minimization w.r.t.wwe take the gradient of equation (9) to form (10) and set it to zero

∇_wL(w, b, α) =w−

N

X

n=1

α_ny_nx_n = 0 (10) and we can solvew

w=

N

X

n=1

α_ny_nx_n (11)

and to minimize the equation (9) w.r.t.bwe differentiate it

∂L

∂b =−

N

X

n=1

αnyn= 0 (12)

The equations (11) and (12) are called Karush–Kuhn–Tucker (KKT) conditions and by substituting these to (9) we get

L(α) =

N

X

n=1

α_n−1 2

N

X

n=1 N

X

m=1

y_ny_mα_nα_mx^>_nx_m (13) So now the equation is free of w and b and can be maximized w.r.t. αn using quadratic programming. Here they_ny_mx^>_nx_mare just the data points from the train-

(6)

ing set multiplied by their labels and can be written as the matrixQso that in the end we have the following optimization task

minimize 1

2α^>Qα−1^>α (14) subject to

N

X

n=1

y^>α= 0 α≥0forn= 1, . . . , N (15) After the alpha has been solved by quadratic programming, we can solve wfrom the equation (11). Thosexn thatαn ≥0are called support vectors (SV). Those are the points that can be used to create the separating hyper plane. Then it is enough to use just thex_ns that are support vector to calculatew

w= X

xnis SV

α_ny_nx_n (16)

Afterwis solved the biasbcan be solved from any SV

yn(w^>xn+b) = 1 (17)

2.2 Soft-Margin SVMs

Hard-Margin SMVs work only when the data is linearly separable, because other- wise the constraint (8) is violated. To allow some violation of the margin we add the so called slack variableξ_nto the constraint equation. This variable tels us how deep in to the margin a point is allowed during placing of the plane, to still accept the plane as the optimal one.

yn(w^>xn+b)≥1−ξn (18) Now our minimization task (7) is altered

minimize 1

2w^>w+C

N

X

n=1

ξ_n (19)

subject toy_n(w^>x_n+b)≥1−ξ_nforn = 1, . . . , N (20) and ξ_n ≥0forn = 1, . . . , N (21)

ξ_n∈R^N, b ∈R,w∈R^d (22)

(7)

Now we can write the Lagrangian again for the modified target function.

L(w, b, ξ, α, β) = 1

2w^>w+C

N

X

n=1

ξ_n−

N

X

n=1

α_n(y_n(w^>x_n+b)−1 +ξ) +

N

X

n=1

βξ_n (23) This equation is very similar to the equation for the linearly separable case (9). Now we have just added another lagrange multiplier for the new variableξ. This function is also minimized with respect towand b and maximize it w.r.t eachα_n ≥ 0and β_n ≥ 0. When equation (23) is minimized w.r.t. wandb we get exactly the same results as previously in equations (10) and (12). When minimized w.r.t. ξn we get the following equation

∂L

∂ξ_n =C−α_n−β_n = 0 (24)

From the equation this, we get the condition thatα_n ≤ C, becauseβ_n ≥ 0and if alpha would be greater thanCthen we cant find aβto make the equation (2.2) true.

When the equations (10), (12) and are substituted back to the equation (23) we get again the equation (13). Now the only thing that was changed in the nonlinearly separable case was that we have the condition that αn ≤ C. So the final target function for the soft margin case is the following

minimize 1

2α^>Qα−1^>α (25) subject to

N

X

n=1

y^>α = 0 0≤α≤C for n= 1, . . . , N (26)

2.3 Kernel methods

In the case that the data which is being classified is not linearly separable and a separating hyper plane can not be found, the data can be raised to a higher dimension where it is linearly separable with a hyper plane. This can be done by moving from theX toZ space and solving the Lagrangian there.

L(α) =

N

X

n=1

α_n− 1 2

N

X

n=1 N

X

m=1

y_ny_mα_nα_mz^>_nz_m (27)

In fact we only need to know the inner productz^>_nzm from theZ space and not the vectorzitself. To solve the inner product we can for the following function we call the kernel.

z^>z⁰ =K(x,x⁰) (28)

(8)

If this kernel corresponds to a inner product on some spaceZit can be taken as the inner product with out calculating the transformation. This makes the computations much more economic. When calculating the Lagrangian we get

L(α) =

N

X

n=1

α_n− 1 2

N

X

n=1 N

X

m=1

y_ny_mα_nα_mK(x,x⁰) (29) When using a kernel, also the decision function is altered, as we need to move the classifiable data to the same space out separating plane is. Computationally this is not much heavier than the normal case where just the inner product of data points in the original space is calculated.

D(x) =

N

X

n=1

y^>αK(x,x⁰) +b (30) The kernel function can be selected according to the classification task at hand.

With it we can move to very high dimensional spaces to find the best separating plane and use it to classify our data in that space according to (30).

2.3.1 Linear kernel

When the classifiable data is linearly separable or nearly linearly separable in the input space, mapping to a higher dimensional space is not needed. Then we use the so called linear kernel, which is just the inner product of the input data.

K(x,x⁰) = x^>x⁰ (31)

2.3.2 Polynomial kernel

Polynomial kernels have the following form

K(x,x⁰) = (1 +x^>x⁰)^d (32) Heredis the degree of the polynomial.

2.3.3 Radial basis function kernel

Radial basis function kernel has the following form

K(x,x⁰) = exp(−γkx−x⁰k²) (33)

(9)

Whereγ is a positive parameter controlling the radius of the function. The gamma value determines how far from a data point the separating plane is placed. So that with large gamma values the plane is near a point and for small values the plane is further a way. This can be understood so that when the distance between points increases the function value approaches zero very rapidly. Thenγ as the multiplier affects how fast the function value decreases and thus how far a point of data influ- ences. This parameter can be optimized for the classification task. γ can also be represented asγ = _2σ¹2.

3 Experimental

Implementing the SVM algorithm was done in Matlab R2012a software. Two algo- rithms found were used as the basis of the matlab coding [8, 9].

3.1 Datasets used

The algorithm was tested on NIR dataset. The dataset consisted on 175 spectra that had been classified to two classes. Each spectra had 1068 data points. In Figure 2 all of the 175 spectra have been plotted to a single image. From this image the differences between spectra are hard to see and the only area where a clear difference between the spectra can be seen is between wavenumbers 400–500. The difference does not become clear even when the classes of spectra are separated and plotted side by side, as in Figure 3.

3.2 Data pretreatment

The data represented in Figures 2 and 3 is clearly linearly inseparable. This is why we do some form of data pretreatment.

One often used multivariate method, in study of spectral data, is principal component analysis (PCA). Idea is to transform data to a set of linearly uncorrelated varables. Tranfomation is done so that the first principal component (PC) has the largest possible variance i.e. it explains most of the original data. The subsequent PCs have lower variances and they are chosen so that all of the PCs are orthogonal to the other PCs.

Because of this property of PCA the first PCs should explain most of the data, and as they have most of the variance they also have most of the differences between

(10)

100 200 300 400 500 600 700 800 900 1000

−2

−1.5

−1

−0.5 0 0.5 1 1.5 2 2.5 3

Wavenumber, 1/cm

Transmittance, −

Figure 2: Unmodified spectral dataset.

100 200 300 400 500 600 700 800 900 1000

−2

−1.5

−1

−0.5 0 0.5 1 1.5 2 2.5 3

Wavenumber, 1/cm

Transmittance, −

100 200 300 400 500 600 700 800 900 1000

−2

−1.5

−1

−0.5 0 0.5 1 1.5 2 2.5 3

Wavenumber, 1/cm

Transmittance, −

Figure 3: The two classes of spectra separated and plotted side by side.

the two classes. Thus principal components one and two were chosen when repre- senting data in plots. In this way the different classes can be observed easier, as can be seen in Fugure 4.

After PCA transformation we have 175 PCs, the same amount as the original spectra, each with the length of 88. As the dataset was fairly small to begin with, we used all of the PCs, so no features of the data would be left out. The dataset was split in two. Half of the data was used for the training of the SVM and the other half was used in testing of the classification.

(11)

−10 −8 −6 −4 −2 0 2 4 6

−6

−4

−2 0 2 4 6

Training data

PC 1

PC 2

luokka +1 luokka −1

Figure 4: PCA transformed training data plotted as a function of principal components 1 and 2.

4 Results and discussion

4.1 Effect of kernel parameters

All three kernels that were described earlier were tested on classification of the spectral dataset. The effect of different parameters was studied for each kernel. The parameterC was common for all of the kernels, although it is more of a parameter of the soft margin SMV than of the kernels. This parameter determines the trade off between misclassification and minimizing the error. WhenC is high the SVM aims to classify all the training samples correctly and the resulting decision border becomes more complex. Similarly, whenC is small the border become smoother, as more error is allowed. This can be seen also from the equation (23), whereC is the multiplier of the slack variable.

4.1.1 Linear kernel

This kernel is the simples one to use in the classification. It is just an inner product of the data points in the input space. When using this kernel the only parameter adjusting the SMVs action is the parameter C,

(12)

The dataset was classified by changing the parameter C from 0 to 1. The performance of the classification was evaluated by calculating how well the test set was classified. In Figure 5 the classification error is presented as the function of C. We can see that after C value of 0.5 the error is at its minimum of 6,7 %.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 10 20 30 40 50 60 70 80 90 100

Value of parameter C

Ammount of misclassified samples, %

Figure 5: Error of classifiacation as a function of C when using linear kernel.

In Figure 6 we can see the separating line in drawn for the test data for various C values. It should be noted that the line in in Figure 6 is a projection of the separating plane placed to the dataset which is, after the PCA, 88 dimensional. Still this image can give us a visual indicator on how the classification is working.

(13)

−10 −8 −6 −4 −2 0 2 4 6

−10

−8

−6

−4

−2 0 2 4 6

Test data

PC 1

PC 2

C = 0.01

C = 0.02

C = 0.03

C = 0.04

C = 0.05

C = 0.06

C = 0.07

C = 0.08

C = 0.09

C = 0.1

Class +1 Class −1 Classified +1 Classified −1

Figure 6: Test dataset with the separating lines for different C values.

4.1.2 Polynomial kernel

In a sense the polynomial kernel has two parameters, the parameterCand the power of the polynomiald(eq. (32)). Of course it could also be interpreted that when the power of polynomial is changed also the kernel changes. Polynomial kernels were unable to classify data that was not first transformed through PCA.

In Figure 7 the classification error is represented as a function of C and d. We can see that the lowest error (6.9 %) is achieved when the power of the polynomial is one. For this power the polynomial kernel actually reduces back to the linear kernel. Another thing noticed from the Figure 7 is that the odd numbered powers give smaller errors.

(14)

0

0.5

1 0 2 4 6 8 10

0 10 20 30 40 50 60

Power of the polynomial kernel, − Value of parameter C

Ammount of misclassified samples, %

Figure 7: Error of classification as a function of the power of the polynomial kernel and value ofC.

4.1.3 RBF kernel

The RBF kernel has a parameterσ that controls the radius of the function that is how close to the data points the margins are set.

Lowers classification errors were achieved by using the RBF kernel. At its lowest the error was 2.3 %. This kind of low error is most likely caused by over fitting during SVM training. For example when comparing Figures 6 and 9, we see that the RBF kernel has found the data points in the middle of what seems to be another class.

(15)

Figure 8: Error of classification as a function of the values ofσandC.

The class borders for different values of C have been presented in Figure 9. The behavior of parameter C can be seen better in this figure, than in the case of linear kernel. Here we can see how the border stretches further, covering more points, when value of C increases. This illustrates that the RBF kernel can produce very complex borders, to the point that even single samples have been given their own borders. With this kind of borders just looking at the classification error can be dangerous, as there is a great chance on over fitting.

(16)

−10 −8 −6 −4 −2 0 2 4 6

−8

−6

−4

−2 0 2 4 6

Test data

PC 1

PC 2

C = 0.04 C = 0.04

C = 0.04

C = 0.04 C = 0.04

C = 0.04

C = 0.05

C = 0.05 C = 0.05

C = 0.05

C = 0.05 C = 0.05

C = 0.05C = 0.06

C = 0.06

C = 0.06 C = 0.06

C = 0.06 C = 0.06 C = 0.06

C = 0.07

C = 0.07 C = 0.07

C = 0.08 C = 0.08

C = 0.08

C = 0.08 C = 0.08

C = 0.09 C = 0.09

C = 0.09

C = 0.09 C = 0.09

C = 0.1 C = 0.1

C = 0.1

C = 0.1 C = 0.1

Figure 9: Decision boundary for values of C whenσ= 0.5.

The effect of parameterσcan be seen in Figure 10. For lowerσ values the border is very close to the data points, in some cases the border goes individually around a single data point. This clearly over fitting the data, as it seems unlikely that a new data point would fall to a so tightly confined space. As σ values increase, so does the distance from the data points to the border. Further increase of the parameter would most likely cause the border to move so that points would begin to be misclassified. In fact this can be seen in Figure 8, where the error starts to increase at higher σ values. RBF kernel was also able to classify the data even without the PCA pretreatment. But getting any kind of visual interpretation from this is quite impossible, because of the high dimensionality.

(17)

−10 −8 −6 −4 −2 0 2 4 6

−8

−6

−4

−2 0 2 4 6

Test data

PC 1

PC 2

σ = 0.2 σ = 0.2

σ = 0.2

σ = 0.2 σ = 0.2 σ = 0.2

σ = 0.4 σ = 0.4

σ = 0.4

σ = 0.4 σ = 0.4

σ = 0.6

σ = 0.6 σ = 0.6 σ = 0.6 σ = 0.6

σ = 0.8 σ = 0.8

σ = 0.8

σ = 0.8 σ = 1 σ = 1

σ = 1

σ = 1 σ = 1

Figure 10: Decision boundary for values ofσwhenC = 0.5.

5 Conclusion

Support vector machines are a classification tool that has been gaining popularity in the field of chemometrics. The theoretical background given in this work helps users to understand how the SVMs work, what are their limitations and what are their strengths. Functionality of the SVMs were further explored in the experimental section. There it was shown how data derived from the chemical industry could be classified. Different kernels were tested for the data. For each kernel the effect of different parameters on the performance of the classification were studied.

Linear and RFB kernels were most successful in classification of the data used.

From these two the linear kernel would be the better choice. It only has one parameter to optimize and in the case of this data, it seems to ignore the outliers in the training data. RBF kernel gave the lowest classification errors, but this was due to the over fitting. RBF kernel was also the only one capable in classifying the data when it was not transformed to principal components.

(18)

References

[1] Richard G. Brereton. Chemometrics for pattern recognition. Wiley, 2009.

[2] R. Burbidge, M. Trotter, B. Buxton, and S. Holden. Drug design by machine learning: support vector machines for pharmaceutical data analysis.Computers

& Chemistry, 26(1):5 – 14, 2001.

[3] Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. Gene selection for cancer classification using support vector machines. Machine Learning, 46(1-3):389–422, 2002.

[4] Olivier Devos, Gerard Downey, and Ludovic Duponchel. Simultaneous data pre-processing and {SVM} classification model selection based on a parallel genetic algorithm applied to spectroscopic data of olive oils. Food Chemistry, 148(0):124 – 130, 2014.

[5] Manabu Kano and Yoshiaki Nakagawa. Data-based process monitoring, process control, and quality improvement: Recent developments and applications in steel industry. Computers & Chemical Engineering, 32(1–2):12 – 24, 2008.

[6] Yingwei Zhang. Enhanced statistical analysis of nonlinear processes using kpca, {KICA} and {SVM}. Chemical Engineering Science, 64(5):801 – 811, 2009.

[7] Shigeo Abe.Support vector machines for pattern classification. Springer, 2010.

[8] Anton Schwaighofer. Support vector machine toolbox for matlab, 01 2002.

[9] Simon Rogers and Mark Girolami. A first course in machine learning simon rogers and mark girolami: Accompanying material. Website, 08 2012.