• Ei tuloksia

582669 Supervised Machine Learning (Spring 2014) Homework 4, sample solutions

N/A
N/A
Info
Lataa
Protected

Academic year: 2022

Jaa "582669 Supervised Machine Learning (Spring 2014) Homework 4, sample solutions"

Copied!
6
0
0

Kokoteksti

(1)

582669 Supervised Machine Learning (Spring 2014) Homework 4, sample solutions

Credit for the solutions goes to mainly to Panu Luosto and Joonas Paalasmaa, with some additional contributions by Jyrki Kivinen

Problem 1

First we bound the drop of potential in one step. Let t∈ {1,2, . . . , T}. Ifσt= 0 thenPt−Pt+1= 0.

Assume σt= 1. Then,

Pt−Pt+1= 1

2ku−wtk22−1

2ku−wt+1k22

= (wt+1−wt)·(u−wt)−1

2kwt−wt+1k22.

Since the update rule iswt+1−wt=ησtytxt=ηytxt, we can write the difference in the form Pt−Pt+1=ηytxt·(u−wt)−1

2kηytxtk22

=ηytxt·u−ηytxt·wt−1

2kxtk22. (1)

It is assumed that ytxt·u≥1. Because σt = 1, it holds −ηytxt·wt ≥0. Additionally kxtk ≤ X.

Plugging all that into (1) yields

Pt−Pt+1≥η−1 2η2X2.

We can now use the result in the following:

P1=1

2ku−winitk22

≥P1−PT+1

=

T

X

t=1

(Pt−Pt+1)

T

X

t=1

η−1

2X2

σt.

Ifη−(1/2)η2X2>0, that yields

T

X

t=1

σt

1

2ku−winitk22 η−12η2X2

To get the lowest upper bound for the number of mistakes, we maximize the value in the denominator.

Letf(η) =η−(1/2)η2X2. The function f is differentiable and concave in the set ]0,∞[, and it has its maximum wheref0(η) = 1−ηX2= 0⇒η = 1/X2. The maximum valuef(1/X2) = 1/(2X2)is positive, and the bound is thus

T

X

t=1

σt≤1

2ku−winitk22·2X2

=ku−winitk22X2.

(2)

Problem 2

Letx= (x1, x2, . . . , xn). We use a notational shorthandxm= (x1, x2, . . . , xm)in the following. The definition of the kernel can be then written somewhat explicitly as

kq(xm,zm) = X

A⊂{1,2,...,m}:

|A|=q

ψA(xmA(xm)

Notice first that ifq= 1, then

kq(xm,zm) =

m

X

i=1

ψ{i}(xm{i}(zm)

=

m

X

i=1

xizi

=xm·zm.

Let now q ∈ {2,3. . . , n}, and let m ∈ {q, q+ 1, . . . , n}. For further notational convinience, we defineki(xi−1,zi−1)≡0for alli∈ {2, . . . , n}. Now we can derive a recursion formula

kq(xm,zm) = X

A⊂{1,2,...,m}

:|A|=q

ψA(xmA(zm)

= X

A⊂{1,2,...,m}

:|A|=q

Y

i∈A

xizi

= X

A⊂{1,2,...,m−1}

:|A|=q−1

Y

i∈A

xizi·xmzm+ X

A⊂{1,2,...,m−1}

:|A|=q

Y

i∈A

xizi

=kq−1(xm−1,zm−1)·xmzm+kq(xm−1,zm−1).

Our algorithm is simply the following. Fori∈ {1,2, . . . , q}in ascending order, compute and store the valueski(xi,zi),ki(xi+1,zi+1), . . . , ki(xn,zn).

For computing the values fork1(·,·), timeO(n)is enough, becausek1(xi,zi) =k1(xi−1,zi−1)+xizi for alli∈ {2,3, . . . , n}. After that, every single value can be computed using the recursion formula in constant time, so the overall time requirement is O(nq).

Problem 3

We use a similar potential function as in the first exercise, so the drop of the potential is again for all t∈ {1,2, . . . , T}

Pt−Pt+1= 1

2ku−wtk22−1

2ku−wt+1k22

= (wt+1−wt)·(u−wt)−1

2kwt−wt+1k22.

The update rule is this timewt+1−wt=η(yt−yˆt)xt, and thus(wt+1−wt)·(u−wt) =η(yt−yˆt)xt· (u−wt) =η(yt−yˆt)(u·xt−wt·xt) =η(yt−yˆt)2. Using the assumption thatkxtk2≤X for allt, we have a bound

Pt−Pt+1 =η(yt−yˆt)2−1

2kη(yt−yˆt)xtk22

≥η(yt−yˆt)2−1

2X2(yt−yˆt)2

=

η−1 2η2X2

(yt−yˆt)2.

(3)

The potentials are all non-negative, and we can estimate that P1=1

2ku−0k22

=1 2kuk22

≥P1−PT+1

=

T

X

t=1

(Pt−Pt+1)

T

X

t=1

η−1

2X2

(yt−yˆt)2.

From the first exercise we know thatη−(1/2)η2X2= 1/(2X2)>0 whenη= 1/X2. That yields

T

X

t=1

(yt−yˆt)2≤ 1

2kuk22·2X2

=kuk22X2.

Extra credit. In the more general case, we do not assume any moreu·xt=yt. For allt, we can still write

Pt−Pt+1=η(yt−yˆt)xt·(u−wt)−1

2kwt−wt+1k22

≥η(yt−yˆt)(u·xt−yˆt)−1

2(yt−yˆt)X2

=η(yt−yˆt)((u·xt−yt) + (yt−yˆt))−1

2(yt−yˆt)2X2

=

η−1 2η2X2

(yt−yˆt)2−η(ˆyt−yt)(u·xt−yt).

Because it holds for anya, b∈Rthat(a−b)2=a2−2ab+b2≥0⇒ab≤(1/2)(a2+b2), we can estimate that

Pt−Pt+1

η−1 2η2X2

(yt−yˆt)2−η(ˆyt−yt)(u·xt−yt)

η−1 2η2X2

(yt−yˆt)2−η1

2 (ˆyt−yt)2+ (u·xt−yt)2

=1

2 η−η2X2

(yt−yˆt)2−1

2η(yt−u·xt)2. Summing over t∈ {1,2, . . . , T} yields

P1=1 2kuk22

t

X

t=1

(Pt−Pt+1)

T

X

t=1

1

2(η−η2X2)(yt−yˆt)2−1

2η(yt−u·xt)2

or

(η−η2X2)

T

X

t=1

(yt−yˆt)2≤ kuk22

T

X

t=1

(yt−u·xt)2.

(4)

Whenη−η2X2>0⇒η <1/X2, we have thus the bound

T

X

t=1

(yt−yˆt)2≤ 1

η−η2X2 kuk22

T

X

t=1

(yt−u·xt)2

!

= kuk22

η−η2X2+ 1 1−ηX2

T

X

t=1

(yt−u·xt)2.

This bound has an additional penalty depending on the sum of squared mistakes that the classifieru makes.

Problem 4

The Perceptron Convergence Theorem guartees that the Perceptron Algorithm makes at mostB22 mistakes, whereB is an upper bound for the Euclidean norm of anyxin the sample. Since the data was sampled from the[−1,1]d hypercube, we know thatB2≤d. However, this bound was very loose for the following experiment.

#! / u s r / b i n / o c t a v e −q f

function [ xs , y s ] = make_sample ( n , d , gam ) x s = zeros ( n , d ) ;

y s = zeros ( n , 1 ) ; c n t = 0 ;

while ( c n t < n )

x = 2 ∗ (rand ( 1 , d ) − 0 . 5 ) ; t = ( x ( 1 ) + x ( 2 ) ) / sqrt ( 2 ) ; i f (abs ( t ) > gam )

x s(++cnt , : ) = x ; y s ( c n t ) = sign ( t ) ; endif

endwhile endfunction

function m i s t a k e s = p e r c e p t r o n ( xs , y s ) m i s t a k e s = 0 ;

r i g h t _ p r e d i c t i o n s = 0 ; n = rows ( x s ) ;

x s = c a t ( 2 , xs , ones ( n , 1 ) ) ; w = zeros ( 1 , columns ( x s ) ) ;

while ( r i g h t _ p r e d i c t i o n s < n ) f o r i = [ 1 : n ]

hat_y = sign (w ∗ x s ( i , : ) ’ ) ; i f ( hat_y == 0 )

hat_y = 1 ; endif

i f ( hat_y != y s ( i ) ) m i s t a k e s ++;

r i g h t _ p r e d i c t i o n s = 0 ; w = w + y s ( i ) ∗ x s ( i , : ) ;

e l s e i f (++ r i g h t _ p r e d i c t i o n s == n ) break;

endif endfor

(5)

endwhile endfunction

f i g u r e ( 1 , " v i s i b l e " , " o f f " ) ; n = 1 0 0 0 ;

ds = [ 5 10 100 500 1 0 0 0 ] ; gammas = [ 0 . 0 5 : 0 . 0 5 : 1 ] ;

m i s t a k e s = zeros (length ( ds ) , length ( gammas ) ) ; f o r i = [ 1 : (length ( ds ) ) ]

f o r j = [ 1 : (length ( gammas ) ) ]

[ x s y s ] = make_sample ( n , ds ( i ) , gammas ( j ) ) ; m i s t a k e s ( i , j ) = p e r c e p t r o n ( xs , y s ) ;

endfor

plot ( gammas , m i s t a k e s ( i , : ) , "+" ) ;

print (s p r i n t f ( "dim_%d . e p s " , ds ( i ) ) , "−d e p s " ) ; endfor

plot ( gammas , m i s t a k e s ’ , "o−" ) ; print ( " fixed_dim . e p s " , "−d e p s " ) ; gammas = [ 0 . 0 5 0 . 1 0 . 4 0 . 7 1 ] ;

ds = [ 5 : 50 : 1 0 0 0 ] ;

m i s t a k e s = zeros (length ( gammas ) , length ( ds ) ) ; f o r i = [ 1 : (length ( gammas ) ) ]

f o r j = [ 1 : (length ( ds ) ) ]

[ x s y s ] = make_sample ( n , ds ( j ) , gammas ( i ) ) ; m i s t a k e s ( i , j ) = p e r c e p t r o n ( xs , y s ) ;

endfor

plot ( ds , m i s t a k e s ( i , : ) , "+" ) ;

print (s p r i n t f ( "gamma_%.2 f . e p s " , gammas ( i ) ) , "−d e p s " ) ; endfor

plot ( ds , m i s t a k e s ’ , "o−" ) ;

print ( " fixed_gamma . e p s " , "−d e p s " ) ;

(6)

0 100 200 300 400 500 600

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 1: Number of mistakes as a function of γ. Number of dimensions from bottom to top: 5, 10, 100, 500, 1000. The size of the sample wasn= 1000.

0 100 200 300 400 500 600 700 800

0 100 200 300 400 500 600 700 800 900

Figure 2: Number of mistakes as a function of d. γ from top to bottom: 0.05, 0.1, 0.4, 0.7, 1. The size of the sample wasn= 1000.

Viittaukset

LIITTYVÄT TIEDOSTOT

• elective master’s level course in specialisation area Algorithms and Machine Learning, continues from Introduction to Machine Learning.. • Introduction to Machine Learning is not

Rewrite the above minimisation problem as a convex optimisation problem where the objective and constraint functions

• elective master’s level course in specialisation area Algorithms and Machine Learning, continues from Introduction to Machine Learning.. • Introduction to Machine Learning is not

The recent interest in kernels among the machine learning community is largely due to the success of Support Vector Machines (SVMs) that are a statistical learning algorithm based

• elective master’s level course in specialisation area Algorithms and Machine Learning, continues from Introduction to Machine Learning.. • Introduction to Machine Learning is not

(b) In the “soft” version of the smallest enclosing ball, we do not require that all the points must be inside the ball, but charge a loss for the points that are outside, with the

(This is related to the problem of overfitting discussed in Introduction to Machine Learning.).. Statistical learning: noise-free PAC model. As with online learning, we consider

(You are expected to be familiar with balanced search trees from an earlier course in algorithms and data structures, but for this course they are not central.) Storing an element