582669 Supervised Machine Learning (Spring 2014) Homework 5, sample solutions

(1)

582669 Supervised Machine Learning (Spring 2014) Homework 5, sample solutions

Credit for the solutions goes to mainly to Panu Luosto and Joonas Paalasmaa, with some additional contributions by Jyrki Kivinen

Problem 1

This exercise is of course a drill: there are simpler ways of finding the minimum of x² in a closed interval.

First part. Letf(x) =x²andp(x) = (x−2)²−1. We are asked to minimizef(x)with the constraint p(x)≤0. Both functions are clearly convex, and the Karush-Kuhn-Tucker conditions of this convex optimization problem are











f⁰(x) +λp⁰(x) = 0 p(x)≤0 λ≥0 λp(x) = 0, or equivalently











2x+λ(2x−4) = 0 (i) (x−2)²−1≤0 (ii)

λ≥0 (iii)

λ((x−2)²−1) = 0. (iv)

The solutions of (iv) areλ= 0,x= 3orx= 1. Ifλ= 0, we getx= 0from (i) , and the condition (ii) is not true. Otherwise ifx= 3, equation (i) has the form6 + 2λ= 0, which means that condition (iii) is violated. We are left withx= 1. Then (i) yields2−2λ= 0⇔λ= 1, and the inequalities (ii) and (iii) are true. In other words,f(x)is minimized subject top(x)≤0whenx= 1.

The Lagrangian of the problem is

L(x, λ) =f(x) +λp(x)

=x²+λ((x−2)²−1)

= (1 +λ)x²−4λx+ 3λ , and the dual is

g(λ) = inf

x∈RL(x, λ)

= inf

x∈R

(1 +λ)x²−4λx+ 3λ .

The dual problem is to maximize g(λ) in the set λ ∈ [0,∞[. Let h(x) = (1 +λ)x²−4λx+ 3λ.

Because λ≥0, the graph of his a convex parabola, and the minimum of h is at the point where h⁰(x) = (2 + 2λ)x−4λ= 0⇒x= 2λ/(1 +λ). Thus

g(λ) = (1 +λ) 4λ²

(1 +λ)² − 8λ² 1 +λ+ 3λ

= 4λ²−8λ²+ 3λ+ 3λ² 1 +λ

= 3λ−λ² 1 +λ .

(2)

With some more calculations, we could show that the dualg is maximized withλ= 1, and g(1) = 1.

Our optimization problem is convex, and p(2)<0, in other words, there isx∈Rsuch that p(x)is strictly less than zero. Therefore, according to Theorem 3.4 of the lecture slides, the strong duality holds, and g(λ^∗) = f(x^∗), where λ^∗ andx^∗ are the solutions of the dual and primal optimization problems, respectively.

Second part. Letf(x) =x² and let p(x) = (x−2)²−8. Again, we want to minimize the convex functionf subject top(x)≤0, wherepis also a convex function. The KKT conditions of this convex optimization problem are











2x+λ(2x−4) = 0 (i) (x−2)²−8≤0 (ii) λ≥0 (iii) λ((x−2)²−8) = 0. (iv)

From (iv) and (i) we find easily the solution x=λ= 0. In order to check for other possible solutions (for practise only), we getx= 2 + 2√

2andx= 2−2√

2 from (iv). But then solvingλ=x/(2−x) from (i), shows that the condition (iii) is violated in these two cases. So the solution of the problem is x= 0.

The Lagrangian off is

L(x, λ) =f(x) +λp(x)

=x²+λ((x−2)²−8), and the dual is

g(λ) = inf

x∈R

L(x, λ)

= inf

x∈R

(1 +λ)x²−4λx−4λ .

Finding the infinimum goes along similar steps as in the first part of the exercise, and g(λ) = (1 +λ)

2λ (1 +λ)

²

−4λ 2λ 1 +λ−4λ

= 4λ²−8λ²−4λ−4λ² 1 +λ

=−2λ2 + 4λ 1 +λ .

In the dual problem, we maximizeg(λ)in the setλ∈[0,∞[. Becauseg(λ)≤0everywhere inR+, and g(λ) = 0only whenλ= 0, the maximum is achieved whenλ= 0. Thenx= (2·0)/(1 + 0) = 0. Strong duality holds also in this case, which can be be seen also from the fact that all the KKT conditions are satisfied when λ=x= 0.

Problem 2

Let f : R^d →R, f(w) =kw−w_tk² = (w−w_t)·(w−w_t) = w·w−2w·w_t+w_t·w_t, and let p(w) = 1−ytw·xt. Notice that this is the revised version of the exercise 2, in the original versionp was different. We prove as a warm-up formally thatf is a convex function inR^d .

A twice differentiable function is convex in a convex set if and only if its Hessian matrix is positive semidefinite on the interior of the convex set (compare with the non-negativeness of the second derivative of a function of typeR→R). Becausef(w) =Pd

i=1(wi−wt,i)², it is easy to see that

∂f(w)

∂wi

= 2(wi−wt,i) (i∈ {1,2, . . . , d}),

(3)

and for alli, j∈ {1,2, . . . , d}

∂²f(w)

∂wi∂wj

= 2 ifi=j and ∂²f(w)

∂wi∂wj

= 0 ifi6=j .

So the Hessian matrix of f isH(f) = 2I_d. It holds for allz∈R^d×1thatz^T(2I_d)z= 2kzk²≥0, which meansH(f)is positive semidefinite.

The optimization problem is to minimizef(w)subject top(w)≤0. The Karush-Kuhn-Tucker conditions are











∇f(w) +λ∇p(w) =0 p(w)≤0 λ≥0 λp(w) = 0, or equivalenty











2w−2wt−λytxt=0 (i) 1−ytw·xt≤0 (ii)

λ≥0 (iii)

λ(1−ytw·xt) = 0. (iv) Equation (i) yields

w=wt+ (1/2)λytxt. (1)

From equality (iv) we get two solutions. In the first case,λ= 0, and inequality (iii) is true. Also, w=wt, and1−ytwt·xt≤0 has to hold.

In the second solutionλ >0, and we get from (iv) 1−ytw·xt= 1−yt(wt+1

2λytxt)·xt

= 1−ytwt·xt−1

2λxt·xt (2)

= 0

where we usedy_t²= 1. In this case (ii) holds with equality. Solvingλfrom (2) and using (iii) gives λ= 2·1−y_tw_t·x_t

kxtk² ≥0 (3)

which implies1−y_tw_t·x_t≥0. Pluggingλfrom (3) into (1) yields w=w_t+1

2 ·2·1−ytwt·xt

kxtk² y_tx_t

=wt+yt

1−ytwt·xt

kx_tk² xt. Putting finally all things together, we have the update rule

wt+1=wt+σtyt

1−ytwt·xt

kx_tk² xt

where

σ_t=

(0 ify_tw_t·x_t≥1 1 ifytwt·xt<1.

(4)

We were asked to show, that ifw1=0, thenwtis a linear combination of the vectorsx1,x2, . . . ,x_t−1, in other words,wt=α1x1+α2x2+· · ·+α_t−1x_t−1, whereα1, α2, . . . , α_t−1∈R. Ift= 1, we have an empty sum, and the statement is true. Lett∈ {2,3, . . .} and assume thatwt−1=Pt−2

i=1αixi. Using our update rule,

wt=

t−2

X

i=1

αixi+σtyt

1−ytwt·xt

kx_tk² xt

=

t−1

X

i=1

αixi,

where we have writtenαt=σtyt(1−ytwt·xt)/kxtk². According to the induction priciple, the statement is thus true for allt∈N.

In the kernelized Perceptron algorithm we replace thexi vectors with their counterpartsψ(xi)in the feature space. The algorithm can be described as follows:

Fort= 1,2, . . . , T do 1. Get the instancex_t. 2. Letpt=wt·ψ(xt) =

Pt−1

i=1αiψ(xi)

·ψ(xt) =Pt−1

i=1αik(xi,xt). Predictyˆt=sign(pt).

3. Get the correct answery_t.

4. Ifytpt≤1, setαt←yt(1−ytwt·ψ(xt))/kψ(xt)k²=yt−Pt−1

i=1αik(xi,xt)/k(xt,xt), otherwise setαt←0(andxt can be discarded).

5. Letkw_t+1⁰ k²=Pt

i=1αiψ(xi)·Pt

j=1αjψ(xj) =Pt i=1

Pt

j=1αiαjk(xi,xj). Ifkw⁰_t+1k>

1, setαi←αi/kw⁰_tk for alli∈ {1,2, . . . , t}.

Problem 3

The idea of this exercise was to show that if there is a solution to the given minimization problem, then the coordinate iof the new weight vector satisfies

wt+1,i= 1

Zwt,iβ^y_t^t^xⁱ

whereβ≥1. Notice that the formulation of the exercise is slightly revised.

Letf :R^d+ →R, f(w) =dre(w,wt). See some discussion about definingf on the boundary of set R^d+ in the solution for the exercise 2b of the third homework. Similarly as in the previous exercise, we start by proving thatf is convex inR^d+, which we do by examining the Hessian matrix off on the interior ofR+. Letw= (w1, w2, . . . , wd)∈(R+\ {0})^d. For alli∈ {1,2, . . . , d}

∂f(w)

∂wi

=∂Pd

i=1wiln(wi/wt,i)

∂wi

= lnwi−lnwt,i+ 1, and

∂²f(w)

∂wi∂wi

=∂(lnw_i−lnw_t,i+ 1)

∂wi

= 1 wi

. Additionally, for alli, j∈ {1,2, . . . , d}, i6=j,

∂²f(w)

∂w_i∂w_j = 0.

(5)

Letz∈R^d×1. The Hessian off is positively semidefinite because z^TH(f)z= (z1/w1),(z2, /w2), . . . ,(zd/wd)

z

=

d

X

i=1

z_i² wi

≥0. We have now proved that f is convex inR+.

This time we have in addition to inequality constrains also an equality constraint. Let p(w) =−ytw·xt,

q(w) =

d

X

i=1

wi−1.

The conditionswi ≥0are left out, because of the definition of f. The optimization problem is to minimizef in the setR^d+ subject to

(p(w)≤0 q(w) = 0.

The functionpis linear, and therefore convex. Let a= (1,1, . . . ,1). We see thatq(w) =a·w−1is affine. So our minimization problem is convex.

The KKT conditions for the optimization problem are











∇f(w) +λ∇p(w) +γ∇q(w) = 0 p(w)≤0 q(w) = 0 λ≥0 λp(w) = 0. whereγ∈R. For alli∈ {1,2, . . . , d} it holds that

∂f(w)

∂wi

= lnwi−lnwt,i+ 1

∂p(w)

∂wi

=ytxt,i

∂q(w)

∂wi

= 1 and the first KKT condition yields thus

(lnwi−lnwt,i+ 1)−λytxt,i+γ= 0. This implies

lnwi= lnwt,i−1 +λytxt,i−γ , and further

w_i=w_t,i exp(−1 +λy_tx_t,i−γ)

= exp(−1−γ)wt,i exp(λytxt,i)

= 1

Zwt,iβ^y_t^t^x^t,i

(6)

where1/Z= exp(−1−γ)andβt= exp(λ). The condition q(w) = 0yields now

d

X

i=1

1

Zw_t,iβ^x_t^t,i−1 = 0 ⇒ Z =

d

X

i=1

w_t,iβ_t^x^t,i,

and because of the conditionλ≥0we know thatβ_t= exp(λ)≥1.

Problem 4

The accuracy with different choices of the kernel width and parameterCare shown below. For both parameters, the values 1, 10 and 100 are tried.

Note that the smallest test set error is attained withC= 1.0whenwidth= 1.0and withC= 100.0 whenwidth= 10.0. Whenwidth= 100.0, the classifier breaks down completely, because the kernel size is much larger than the extent of the data set.

d a t a = dlmread( ’ hw05data . t x t ’ ) ; X = d a t a ( : , 1 ) ;

Y = d a t a ( : , 2 ) ; l a b e l = d a t a ( : , 3 ) ; N = s i z e( data , 1 ) ;

negative_mask = f i n d( l a b e l ==−1);

p o s i t i v e _ m a s k = f i n d( l a b e l ==1);

X_train = X( 1 :N/ 2 ) ; Y_train = Y( 1 :N/ 2 ) ;

l a b e l _ t r a i n = l a b e l ( 1 :N/ 2 ) ; X_test = X(N/2+1:end) ; Y_test = Y(N/2+1:end) ;

l a b e l _ t e s t = l a b e l (N/2+1:end) ; t r a i n _ d a t a=d a t a ( 1 :N/ 2 , 1 : 2 ) ; t e s t _ d a t a=d a t a (N/2+1:end, 1 : 2 ) ;

sigma = [ 1 10 1 0 0 ] ; C = [ 1 10 1 0 0 ] ;

f o r sigma_i =1:3 f o r C_i=1:3 f i g u r e;

model = s v m t r a i n ( t r a i n _ d a t a , l a b e l _ t r a i n , ’ k e r n e l _ f u n c t i o n ’ , . . .

’ r b f ’ , ’ rbf_sigma ’ , sigma ( sigma_i ) , . . .

’ b o x c o n s t r a i n t ’ , C( C_i ) , ’ s h o w p l o t ’ , t r u e )

l a b e l _ t e s t _ d a t a _ r e s u l t = s v m c l a s s i f y ( model , t e s t _ d a t a , ’ s h o w p l o t ’ , t r u e ) ; l a b e l _ t r a i n _ d a t a _ r e s u l t = s v m c l a s s i f y ( model , t r a i n _ d a t a ) ;

t r a i n _ e r r o r = mean( l a b e l _ t r a i n~=l a b e l _ t r a i n _ d a t a _ r e s u l t ) t e s t _ e r r o r = mean( l a b e l _ t e s t~=l a b e l _ t e s t _ d a t a _ r e s u l t )

(7)

t i t l e(s p r i n t f( ’ width=%.1 f ␣C=%.1 f ␣ t e s t e r r .=%.3 f ␣ t r a i n e r r .=%.3 f ’ , . . . sigma ( sigma_i ) , C( C_i ) , t e s t _ e r r o r , t r a i n _ e r r o r ) ) ; print −d e p s

end end

−3 −2 −1 0 1 2 3

−3

−2

−1 0 1 2 3

width=1.0 C=1.0 testerr.=0.045 trainerr.=0.055

−1 (training)

−1 (classified) 1 (training) 1 (classified) Support Vectors

−3 −2 −1 0 1 2 3

−3

−2

−1 0 1 2 3

−1 (training)

−3 −2 −1 0 1 2 3

−3

−2

−1 0 1 2 3

−1 (training)

−3 −2 −1 0 1 2 3

−3

−2

−1 0 1 2 3

−1 (training)

−3 −2 −1 0 1 2 3

−3

−2

−1 0 1 2 3

−1 (training)

−3 −2 −1 0 1 2 3

−3

−2

−1 0 1 2 3

−1 (training)

(8)

−3 −2 −1 0 1 2 3

−3

−2

−1 0 1 2 3

−1 (training)

−3 −2 −1 0 1 2 3

−3

−2

−1 0 1 2 3

−1 (training)

−3 −2 −1 0 1 2 3

−3

−2

−1 0 1 2 3

−1 (training)