582669 Supervised Machine Learning (Spring 2014) Homework 4, sample solutions

(1)

582669 Supervised Machine Learning (Spring 2014) Homework 4, sample solutions

Credit for the solutions goes to mainly to Panu Luosto and Joonas Paalasmaa, with some additional contributions by Jyrki Kivinen

Problem 1

First we bound the drop of potential in one step. Let t∈ {1,2, . . . , T}. Ifσt= 0 thenPt−Pt+1= 0.

Assume σt= 1. Then,

Pt−Pt+1= 1

2ku−wtk²₂−1

2ku−wt+1k²₂

= (wt+1−wt)·(u−wt)−1

2kwt−wt+1k²₂.

Since the update rule iswt+1−wt=ησtytxt=ηytxt, we can write the difference in the form P_t−P_t+1=ηy_tx_t·(u−w_t)−1

2kηytx_tk²₂

=ηy_tx_t·u−ηy_tx_t·w_t−1

2η²kxtk²₂. (1)

It is assumed that y_tx_t·u≥1. Because σ_t = 1, it holds −ηytx_t·w_t ≥0. Additionally kxtk ≤ X.

Plugging all that into (1) yields

P_t−P_t+1≥η−1 2η²X².

We can now use the result in the following:

P₁=1

2ku−w_initk²₂

≥P1−PT+1

=

T

X

t=1

(Pt−Pt+1)

≥

T

X

t=1

η−1

2η²X²

σt.

Ifη−(1/2)η²X²>0, that yields

T

X

t=1

σt≤

1

2ku−winitk²₂ η−¹₂η²X²

To get the lowest upper bound for the number of mistakes, we maximize the value in the denominator.

Letf(η) =η−(1/2)η²X². The function f is differentiable and concave in the set ]0,∞[, and it has its maximum wheref⁰(η) = 1−ηX²= 0⇒η = 1/X². The maximum valuef(1/X²) = 1/(2X²)is positive, and the bound is thus

T

X

t=1

σ_t≤1

2ku−w_initk²₂·2X²

=ku−winitk²₂X².

(2)

Problem 2

Letx= (x₁, x₂, . . . , x_n). We use a notational shorthandx^m= (x₁, x₂, . . . , x_m)in the following. The definition of the kernel can be then written somewhat explicitly as

kq(x^m,z^m) = X

A⊂{1,2,...,m}:

|A|=q

ψA(x^m)ψA(x^m)

Notice first that ifq= 1, then

kq(x^m,z^m) =

m

X

i=1

ψ_{i}(x^m)ψ_{i}(z^m)

=

m

X

i=1

x_iz_i

=x^m·z^m.

Let now q ∈ {2,3. . . , n}, and let m ∈ {q, q+ 1, . . . , n}. For further notational convinience, we definek_i(xⁱ⁻¹,zⁱ⁻¹)≡0for alli∈ {2, . . . , n}. Now we can derive a recursion formula

kq(x^m,z^m) = X

A⊂{1,2,...,m}

:|A|=q

ψA(x^m)ψA(z^m)

= X

A⊂{1,2,...,m}

:|A|=q

Y

i∈A

x_iz_i

= X

A⊂{1,2,...,m−1}

:|A|=q−1

Y

i∈A

x_iz_i·x_mz_m+ X

A⊂{1,2,...,m−1}

:|A|=q

Y

i∈A

x_iz_i

=k_q−1(x^m−1,z^m−1)·xmzm+kq(x^m−1,z^m−1).

Our algorithm is simply the following. Fori∈ {1,2, . . . , q}in ascending order, compute and store the valuesk_i(xⁱ,zⁱ),k_i(xⁱ⁺¹,zⁱ⁺¹), . . . , k_i(xⁿ,zⁿ).

For computing the values fork₁(·,·), timeO(n)is enough, becausek₁(xⁱ,zⁱ) =k₁(xⁱ⁻¹,zⁱ⁻¹)+x_iz_i for alli∈ {2,3, . . . , n}. After that, every single value can be computed using the recursion formula in constant time, so the overall time requirement is O(nq).

Problem 3

We use a similar potential function as in the first exercise, so the drop of the potential is again for all t∈ {1,2, . . . , T}

Pt−Pt+1= 1

2ku−wtk²₂−1

2ku−wt+1k²₂

= (wt+1−wt)·(u−wt)−1

2kwt−wt+1k²₂.

The update rule is this timew_t+1−w_t=η(y_t−yˆ_t)x_t, and thus(w_t+1−w_t)·(u−w_t) =η(y_t−yˆ_t)x_t· (u−w_t) =η(y_t−yˆ_t)(u·x_t−w_t·x_t) =η(y_t−yˆ_t)². Using the assumption thatkxtk2≤X for allt, we have a bound

Pt−Pt+1 =η(yt−yˆt)²−1

2kη(yt−yˆt)xtk²₂

≥η(yt−yˆt)²−1

2η²X²(yt−yˆt)²

=

η−1 2η²X²

(yt−yˆt)².

(3)

The potentials are all non-negative, and we can estimate that P1=1

2ku−0k²₂

=1 2kuk²₂

≥P₁−P_T₊₁

=

T

X

t=1

(Pt−Pt+1)

≥

T

X

t=1

η−1

2η²X²

(y_t−yˆ_t)².

From the first exercise we know thatη−(1/2)η²X²= 1/(2X²)>0 whenη= 1/X². That yields

T

X

t=1

(y_t−yˆ_t)²≤ 1

2kuk²₂·2X²

=kuk²₂X².

Extra credit. In the more general case, we do not assume any moreu·xt=yt. For allt, we can still write

Pt−Pt+1=η(yt−yˆt)xt·(u−wt)−1

2kwt−wt+1k²₂

≥η(yt−yˆt)(u·xt−yˆt)−1

2η²(yt−yˆt)X²

=η(yt−yˆt)((u·xt−yt) + (yt−yˆt))−1

2η²(yt−yˆt)²X²

=

η−1 2η²X²

(yt−yˆt)²−η(ˆyt−yt)(u·xt−yt).

Because it holds for anya, b∈Rthat(a−b)²=a²−2ab+b²≥0⇒ab≤(1/2)(a²+b²), we can estimate that

Pt−Pt+1≥

η−1 2η²X²

(yt−yˆt)²−η(ˆyt−yt)(u·xt−yt)

≥

η−1 2η²X²

(y_t−yˆ_t)²−η1

2 (ˆy_t−y_t)²+ (u·x_t−y_t)²

=1

2 η−η²X²

(y_t−yˆ_t)²−1

2η(y_t−u·x_t)². Summing over t∈ {1,2, . . . , T} yields

P₁=1 2kuk²₂

≥

t

X

t=1

(P_t−P_t+1)

≥

T

X

t=1

1

2(η−η²X²)(y_t−yˆ_t)²−1

2η(y_t−u·x_t)²

or

(η−η²X²)

T

X

t=1

(y_t−yˆ_t)²≤ kuk²₂+η

T

X

t=1

(y_t−u·x_t)².

(4)

Whenη−η²X²>0⇒η <1/X², we have thus the bound

T

X

t=1

(y_t−yˆ_t)²≤ 1

η−η²X² kuk²₂+η

T

X

t=1

(y_t−u·x_t)²

!

= kuk²₂

η−η²X²+ 1 1−ηX²

T

X

t=1

(yt−u·xt)².

This bound has an additional penalty depending on the sum of squared mistakes that the classifieru makes.

Problem 4

The Perceptron Convergence Theorem guartees that the Perceptron Algorithm makes at mostB²/γ² mistakes, whereB is an upper bound for the Euclidean norm of anyxin the sample. Since the data was sampled from the[−1,1]^d hypercube, we know thatB²≤d. However, this bound was very loose for the following experiment.

#! / u s r / b i n / o c t a v e −q f

function [ xs , y s ] = make_sample ( n , d , gam ) x s = zeros ( n , d ) ;

y s = zeros ( n , 1 ) ; c n t = 0 ;

while ( c n t < n )

x = 2 ∗ (rand ( 1 , d ) − 0 . 5 ) ; t = ( x ( 1 ) + x ( 2 ) ) / sqrt ( 2 ) ; i f (abs ( t ) > gam )

x s(++cnt , : ) = x ; y s ( c n t ) = sign ( t ) ; endif

endwhile endfunction

function m i s t a k e s = p e r c e p t r o n ( xs , y s ) m i s t a k e s = 0 ;

r i g h t _ p r e d i c t i o n s = 0 ; n = rows ( x s ) ;

x s = c a t ( 2 , xs , ones ( n , 1 ) ) ; w = zeros ( 1 , columns ( x s ) ) ;

while ( r i g h t _ p r e d i c t i o n s < n ) f o r i = [ 1 : n ]

hat_y = sign (w ∗ x s ( i , : ) ’ ) ; i f ( hat_y == 0 )

hat_y = 1 ; endif

i f ( hat_y != y s ( i ) ) m i s t a k e s ++;

r i g h t _ p r e d i c t i o n s = 0 ; w = w + y s ( i ) ∗ x s ( i , : ) ;

e l s e i f (++ r i g h t _ p r e d i c t i o n s == n ) break;

endif endfor

(5)

endwhile endfunction

f i g u r e ( 1 , " v i s i b l e " , " o f f " ) ; n = 1 0 0 0 ;

ds = [ 5 10 100 500 1 0 0 0 ] ; gammas = [ 0 . 0 5 : 0 . 0 5 : 1 ] ;

m i s t a k e s = zeros (length ( ds ) , length ( gammas ) ) ; f o r i = [ 1 : (length ( ds ) ) ]

f o r j = [ 1 : (length ( gammas ) ) ]

[ x s y s ] = make_sample ( n , ds ( i ) , gammas ( j ) ) ; m i s t a k e s ( i , j ) = p e r c e p t r o n ( xs , y s ) ;

endfor

plot ( gammas , m i s t a k e s ( i , : ) , "+" ) ;

print (s p r i n t f ( "dim_%d . e p s " , ds ( i ) ) , "−d e p s " ) ; endfor

plot ( gammas , m i s t a k e s ’ , "o−" ) ; print ( " fixed_dim . e p s " , "−d e p s " ) ; gammas = [ 0 . 0 5 0 . 1 0 . 4 0 . 7 1 ] ;

ds = [ 5 : 50 : 1 0 0 0 ] ;

m i s t a k e s = zeros (length ( gammas ) , length ( ds ) ) ; f o r i = [ 1 : (length ( gammas ) ) ]

f o r j = [ 1 : (length ( ds ) ) ]

[ x s y s ] = make_sample ( n , ds ( j ) , gammas ( i ) ) ; m i s t a k e s ( i , j ) = p e r c e p t r o n ( xs , y s ) ;

endfor

plot ( ds , m i s t a k e s ( i , : ) , "+" ) ;

print (s p r i n t f ( "gamma_%.2 f . e p s " , gammas ( i ) ) , "−d e p s " ) ; endfor

plot ( ds , m i s t a k e s ’ , "o−" ) ;

print ( " fixed_gamma . e p s " , "−d e p s " ) ;

(6)

0 100 200 300 400 500 600

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 1: Number of mistakes as a function of γ. Number of dimensions from bottom to top: 5, 10, 100, 500, 1000. The size of the sample wasn= 1000.

0 100 200 300 400 500 600 700 800

0 100 200 300 400 500 600 700 800 900

Figure 2: Number of mistakes as a function of d. γ from top to bottom: 0.05, 0.1, 0.4, 0.7, 1. The size of the sample wasn= 1000.