• Ei tuloksia

582669 Supervised Machine Learning (Spring 2014) Homework 5, sample solutions

N/A
N/A
Info
Lataa
Protected

Academic year: 2022

Jaa "582669 Supervised Machine Learning (Spring 2014) Homework 5, sample solutions"

Copied!
8
0
0

Kokoteksti

(1)

582669 Supervised Machine Learning (Spring 2014) Homework 5, sample solutions

Credit for the solutions goes to mainly to Panu Luosto and Joonas Paalasmaa, with some additional contributions by Jyrki Kivinen

Problem 1

This exercise is of course a drill: there are simpler ways of finding the minimum of x2 in a closed interval.

First part. Letf(x) =x2andp(x) = (x−2)2−1. We are asked to minimizef(x)with the constraint p(x)≤0. Both functions are clearly convex, and the Karush-Kuhn-Tucker conditions of this convex optimization problem are









f0(x) +λp0(x) = 0 p(x)≤0 λ≥0 λp(x) = 0, or equivalently









2x+λ(2x−4) = 0 (i) (x−2)2−1≤0 (ii)

λ≥0 (iii)

λ((x−2)2−1) = 0. (iv)

The solutions of (iv) areλ= 0,x= 3orx= 1. Ifλ= 0, we getx= 0from (i) , and the condition (ii) is not true. Otherwise ifx= 3, equation (i) has the form6 + 2λ= 0, which means that condition (iii) is violated. We are left withx= 1. Then (i) yields2−2λ= 0⇔λ= 1, and the inequalities (ii) and (iii) are true. In other words,f(x)is minimized subject top(x)≤0whenx= 1.

The Lagrangian of the problem is

L(x, λ) =f(x) +λp(x)

=x2+λ((x−2)2−1)

= (1 +λ)x2−4λx+ 3λ , and the dual is

g(λ) = inf

x∈RL(x, λ)

= inf

x∈R

(1 +λ)x2−4λx+ 3λ .

The dual problem is to maximize g(λ) in the set λ ∈ [0,∞[. Let h(x) = (1 +λ)x2−4λx+ 3λ.

Because λ≥0, the graph of his a convex parabola, and the minimum of h is at the point where h0(x) = (2 + 2λ)x−4λ= 0⇒x= 2λ/(1 +λ). Thus

g(λ) = (1 +λ) 4λ2

(1 +λ)2 − 8λ2 1 +λ+ 3λ

= 4λ2−8λ2+ 3λ+ 3λ2 1 +λ

= 3λ−λ2 1 +λ .

(2)

With some more calculations, we could show that the dualg is maximized withλ= 1, and g(1) = 1.

Our optimization problem is convex, and p(2)<0, in other words, there isx∈Rsuch that p(x)is strictly less than zero. Therefore, according to Theorem 3.4 of the lecture slides, the strong duality holds, and g(λ) = f(x), where λ andx are the solutions of the dual and primal optimization problems, respectively.

Second part. Letf(x) =x2 and let p(x) = (x−2)2−8. Again, we want to minimize the convex functionf subject top(x)≤0, wherepis also a convex function. The KKT conditions of this convex optimization problem are









2x+λ(2x−4) = 0 (i) (x−2)2−8≤0 (ii) λ≥0 (iii) λ((x−2)2−8) = 0. (iv)

From (iv) and (i) we find easily the solution x=λ= 0. In order to check for other possible solutions (for practise only), we getx= 2 + 2√

2andx= 2−2√

2 from (iv). But then solvingλ=x/(2−x) from (i), shows that the condition (iii) is violated in these two cases. So the solution of the problem is x= 0.

The Lagrangian off is

L(x, λ) =f(x) +λp(x)

=x2+λ((x−2)2−8), and the dual is

g(λ) = inf

x∈R

L(x, λ)

= inf

x∈R

(1 +λ)x2−4λx−4λ .

Finding the infinimum goes along similar steps as in the first part of the exercise, and g(λ) = (1 +λ)

2λ (1 +λ)

2

−4λ 2λ 1 +λ−4λ

= 4λ2−8λ2−4λ−4λ2 1 +λ

=−2λ2 + 4λ 1 +λ .

In the dual problem, we maximizeg(λ)in the setλ∈[0,∞[. Becauseg(λ)≤0everywhere inR+, and g(λ) = 0only whenλ= 0, the maximum is achieved whenλ= 0. Thenx= (2·0)/(1 + 0) = 0. Strong duality holds also in this case, which can be be seen also from the fact that all the KKT conditions are satisfied when λ=x= 0.

Problem 2

Let f : Rd →R, f(w) =kw−wtk2 = (w−wt)·(w−wt) = w·w−2w·wt+wt·wt, and let p(w) = 1−ytw·xt. Notice that this is the revised version of the exercise 2, in the original versionp was different. We prove as a warm-up formally thatf is a convex function inRd .

A twice differentiable function is convex in a convex set if and only if its Hessian matrix is positive semidefinite on the interior of the convex set (compare with the non-negativeness of the second derivative of a function of typeR→R). Becausef(w) =Pd

i=1(wi−wt,i)2, it is easy to see that

∂f(w)

∂wi

= 2(wi−wt,i) (i∈ {1,2, . . . , d}),

(3)

and for alli, j∈ {1,2, . . . , d}

2f(w)

∂wi∂wj

= 2 ifi=j and ∂2f(w)

∂wi∂wj

= 0 ifi6=j .

So the Hessian matrix of f isH(f) = 2Id. It holds for allz∈Rd×1thatzT(2Id)z= 2kzk2≥0, which meansH(f)is positive semidefinite.

The optimization problem is to minimizef(w)subject top(w)≤0. The Karush-Kuhn-Tucker conditions are









∇f(w) +λ∇p(w) =0 p(w)≤0 λ≥0 λp(w) = 0, or equivalenty









2w−2wt−λytxt=0 (i) 1−ytw·xt≤0 (ii)

λ≥0 (iii)

λ(1−ytw·xt) = 0. (iv) Equation (i) yields

w=wt+ (1/2)λytxt. (1)

From equality (iv) we get two solutions. In the first case,λ= 0, and inequality (iii) is true. Also, w=wt, and1−ytwt·xt≤0 has to hold.

In the second solutionλ >0, and we get from (iv) 1−ytw·xt= 1−yt(wt+1

2λytxt)·xt

= 1−ytwt·xt−1

2λxt·xt (2)

= 0

where we usedyt2= 1. In this case (ii) holds with equality. Solvingλfrom (2) and using (iii) gives λ= 2·1−ytwt·xt

kxtk2 ≥0 (3)

which implies1−ytwt·xt≥0. Pluggingλfrom (3) into (1) yields w=wt+1

2 ·2·1−ytwt·xt

kxtk2 ytxt

=wt+yt

1−ytwt·xt

kxtk2 xt. Putting finally all things together, we have the update rule

wt+1=wttyt

1−ytwt·xt

kxtk2 xt

where

σt=

(0 ifytwt·xt≥1 1 ifytwt·xt<1.

(4)

We were asked to show, that ifw1=0, thenwtis a linear combination of the vectorsx1,x2, . . . ,xt−1, in other words,wt1x12x2+· · ·+αt−1xt−1, whereα1, α2, . . . , αt−1∈R. Ift= 1, we have an empty sum, and the statement is true. Lett∈ {2,3, . . .} and assume thatwt−1=Pt−2

i=1αixi. Using our update rule,

wt=

t−2

X

i=1

αixityt

1−ytwt·xt

kxtk2 xt

=

t−1

X

i=1

αixi,

where we have writtenαttyt(1−ytwt·xt)/kxtk2. According to the induction priciple, the statement is thus true for allt∈N.

In the kernelized Perceptron algorithm we replace thexi vectors with their counterpartsψ(xi)in the feature space. The algorithm can be described as follows:

Fort= 1,2, . . . , T do 1. Get the instancext. 2. Letpt=wt·ψ(xt) =

Pt−1

i=1αiψ(xi)

·ψ(xt) =Pt−1

i=1αik(xi,xt). Predictyˆt=sign(pt).

3. Get the correct answeryt.

4. Ifytpt≤1, setαt←yt(1−ytwt·ψ(xt))/kψ(xt)k2=yt−Pt−1

i=1αik(xi,xt)/k(xt,xt), otherwise setαt←0(andxt can be discarded).

5. Letkwt+10 k2=Pt

i=1αiψ(xi)·Pt

j=1αjψ(xj) =Pt i=1

Pt

j=1αiαjk(xi,xj). Ifkw0t+1k>

1, setαi←αi/kw0tk for alli∈ {1,2, . . . , t}.

Problem 3

The idea of this exercise was to show that if there is a solution to the given minimization problem, then the coordinate iof the new weight vector satisfies

wt+1,i= 1

Zwt,iβyttxi

whereβ≥1. Notice that the formulation of the exercise is slightly revised.

Letf :Rd+ →R, f(w) =dre(w,wt). See some discussion about definingf on the boundary of set Rd+ in the solution for the exercise 2b of the third homework. Similarly as in the previous exercise, we start by proving thatf is convex inRd+, which we do by examining the Hessian matrix off on the interior ofR+. Letw= (w1, w2, . . . , wd)∈(R+\ {0})d. For alli∈ {1,2, . . . , d}

∂f(w)

∂wi

=∂Pd

i=1wiln(wi/wt,i)

∂wi

= lnwi−lnwt,i+ 1, and

2f(w)

∂wi∂wi

=∂(lnwi−lnwt,i+ 1)

∂wi

= 1 wi

. Additionally, for alli, j∈ {1,2, . . . , d}, i6=j,

2f(w)

∂wi∂wj = 0.

(5)

Letz∈Rd×1. The Hessian off is positively semidefinite because zTH(f)z= (z1/w1),(z2, /w2), . . . ,(zd/wd)

z

=

d

X

i=1

zi2 wi

≥0. We have now proved that f is convex inR+.

This time we have in addition to inequality constrains also an equality constraint. Let p(w) =−ytw·xt,

q(w) =

d

X

i=1

wi−1.

The conditionswi ≥0are left out, because of the definition of f. The optimization problem is to minimizef in the setRd+ subject to

(p(w)≤0 q(w) = 0.

The functionpis linear, and therefore convex. Let a= (1,1, . . . ,1). We see thatq(w) =a·w−1is affine. So our minimization problem is convex.

The KKT conditions for the optimization problem are













∇f(w) +λ∇p(w) +γ∇q(w) = 0 p(w)≤0 q(w) = 0 λ≥0 λp(w) = 0. whereγ∈R. For alli∈ {1,2, . . . , d} it holds that

∂f(w)

∂wi

= lnwi−lnwt,i+ 1

∂p(w)

∂wi

=ytxt,i

∂q(w)

∂wi

= 1 and the first KKT condition yields thus

(lnwi−lnwt,i+ 1)−λytxt,i+γ= 0. This implies

lnwi= lnwt,i−1 +λytxt,i−γ , and further

wi=wt,i exp(−1 +λytxt,i−γ)

= exp(−1−γ)wt,i exp(λytxt,i)

= 1

Zwt,iβyttxt,i

(6)

where1/Z= exp(−1−γ)andβt= exp(λ). The condition q(w) = 0yields now

d

X

i=1

1

Zwt,iβxtt,i−1 = 0 ⇒ Z =

d

X

i=1

wt,iβtxt,i,

and because of the conditionλ≥0we know thatβt= exp(λ)≥1.

Problem 4

The accuracy with different choices of the kernel width and parameterCare shown below. For both parameters, the values 1, 10 and 100 are tried.

Note that the smallest test set error is attained withC= 1.0whenwidth= 1.0and withC= 100.0 whenwidth= 10.0. Whenwidth= 100.0, the classifier breaks down completely, because the kernel size is much larger than the extent of the data set.

d a t a = dlmread( ’ hw05data . t x t ’ ) ; X = d a t a ( : , 1 ) ;

Y = d a t a ( : , 2 ) ; l a b e l = d a t a ( : , 3 ) ; N = s i z e( data , 1 ) ;

negative_mask = f i n d( l a b e l ==−1);

p o s i t i v e _ m a s k = f i n d( l a b e l ==1);

X_train = X( 1 :N/ 2 ) ; Y_train = Y( 1 :N/ 2 ) ;

l a b e l _ t r a i n = l a b e l ( 1 :N/ 2 ) ; X_test = X(N/2+1:end) ; Y_test = Y(N/2+1:end) ;

l a b e l _ t e s t = l a b e l (N/2+1:end) ; t r a i n _ d a t a=d a t a ( 1 :N/ 2 , 1 : 2 ) ; t e s t _ d a t a=d a t a (N/2+1:end, 1 : 2 ) ;

sigma = [ 1 10 1 0 0 ] ; C = [ 1 10 1 0 0 ] ;

f o r sigma_i =1:3 f o r C_i=1:3 f i g u r e;

model = s v m t r a i n ( t r a i n _ d a t a , l a b e l _ t r a i n , ’ k e r n e l _ f u n c t i o n ’ , . . .

’ r b f ’ , ’ rbf_sigma ’ , sigma ( sigma_i ) , . . .

’ b o x c o n s t r a i n t ’ , C( C_i ) , ’ s h o w p l o t ’ , t r u e )

l a b e l _ t e s t _ d a t a _ r e s u l t = s v m c l a s s i f y ( model , t e s t _ d a t a , ’ s h o w p l o t ’ , t r u e ) ; l a b e l _ t r a i n _ d a t a _ r e s u l t = s v m c l a s s i f y ( model , t r a i n _ d a t a ) ;

t r a i n _ e r r o r = mean( l a b e l _ t r a i n~=l a b e l _ t r a i n _ d a t a _ r e s u l t ) t e s t _ e r r o r = mean( l a b e l _ t e s t~=l a b e l _ t e s t _ d a t a _ r e s u l t )

(7)

t i t l e(s p r i n t f( ’ width=%.1 f ␣C=%.1 f ␣ t e s t e r r .=%.3 f ␣ t r a i n e r r .=%.3 f ’ , . . . sigma ( sigma_i ) , C( C_i ) , t e s t _ e r r o r , t r a i n _ e r r o r ) ) ; print −d e p s

end end

−3 −2 −1 0 1 2 3

−3

−2

−1 0 1 2 3

width=1.0 C=1.0 testerr.=0.045 trainerr.=0.055

−1 (training)

−1 (classified) 1 (training) 1 (classified) Support Vectors

−3 −2 −1 0 1 2 3

−3

−2

−1 0 1 2 3

width=1.0 C=10.0 testerr.=0.055 trainerr.=0.055

−1 (training)

−1 (classified) 1 (training) 1 (classified) Support Vectors

−3 −2 −1 0 1 2 3

−3

−2

−1 0 1 2 3

width=1.0 C=100.0 testerr.=0.065 trainerr.=0.050

−1 (training)

−1 (classified) 1 (training) 1 (classified) Support Vectors

−3 −2 −1 0 1 2 3

−3

−2

−1 0 1 2 3

width=10.0 C=1.0 testerr.=0.250 trainerr.=0.245

−1 (training)

−1 (classified) 1 (training) 1 (classified) Support Vectors

−3 −2 −1 0 1 2 3

−3

−2

−1 0 1 2 3

width=10.0 C=10.0 testerr.=0.080 trainerr.=0.085

−1 (training)

−1 (classified) 1 (training) 1 (classified) Support Vectors

−3 −2 −1 0 1 2 3

−3

−2

−1 0 1 2 3

width=10.0 C=100.0 testerr.=0.055 trainerr.=0.075

−1 (training)

−1 (classified) 1 (training) 1 (classified) Support Vectors

(8)

−3 −2 −1 0 1 2 3

−3

−2

−1 0 1 2 3

width=100.0 C=1.0 testerr.=0.510 trainerr.=0.485

−1 (training)

−1 (classified) 1 (training) 1 (classified) Support Vectors

−3 −2 −1 0 1 2 3

−3

−2

−1 0 1 2 3

width=100.0 C=10.0 testerr.=0.510 trainerr.=0.485

−1 (training)

−1 (classified) 1 (training) 1 (classified) Support Vectors

−3 −2 −1 0 1 2 3

−3

−2

−1 0 1 2 3

width=100.0 C=100.0 testerr.=0.505 trainerr.=0.485

−1 (training)

−1 (classified) 1 (training) 1 (classified) Support Vectors

Viittaukset

LIITTYVÄT TIEDOSTOT

(You are expected to be familiar with balanced search trees from an earlier course in algorithms and data structures, but for this course they are not central.) Storing an element

(33) We shall next consider an input set that allow as to bound kwk 2 from below, giving us an upper bound for the smallest normalized margin.. Our base case is i

Credit for the solutions goes to mainly to Panu Luosto and Joonas Paalasmaa, with some additional contributions by Jyrki Kivinen..

Inspired by the results of Kalton and Wood [21], Becerra and Rodriguez [6] proved that if X is a complex Banach space such that X R is convex- transitive and admits a

diagnosed as abnormal if the tendon was thickened (&gt;6 mm in antero- posterior diameter or ≥ 2-mm thicker than the asymptomatic side), had a convex anterior margin on axial images

Our main result is stark: Among all convex combinations of group strategy-proof rules, only Gale's Top Trading Cycles is efficient and meets a positive a-endowment lower bound..

In these models the conditional distribution, not only the conditional expectation (and possibly conditional variance) is speci…ed as a convex combination of (typically)

(c) It has been shown in the course that if K is a nonempty convex and closed subset of a Hilbert space, then for every given point there exists a unique closest point in