582669 Supervised Machine Learning (Spring 2014) Homework 4, sample solutions
Credit for the solutions goes to mainly to Panu Luosto and Joonas Paalasmaa, with some additional contributions by Jyrki Kivinen
Problem 1
First we bound the drop of potential in one step. Let t∈ {1,2, . . . , T}. Ifσt= 0 thenPt−Pt+1= 0.
Assume σt= 1. Then,
Pt−Pt+1= 1
2ku−wtk22−1
2ku−wt+1k22
= (wt+1−wt)·(u−wt)−1
2kwt−wt+1k22.
Since the update rule iswt+1−wt=ησtytxt=ηytxt, we can write the difference in the form Pt−Pt+1=ηytxt·(u−wt)−1
2kηytxtk22
=ηytxt·u−ηytxt·wt−1
2η2kxtk22. (1)
It is assumed that ytxt·u≥1. Because σt = 1, it holds −ηytxt·wt ≥0. Additionally kxtk ≤ X.
Plugging all that into (1) yields
Pt−Pt+1≥η−1 2η2X2.
We can now use the result in the following:
P1=1
2ku−winitk22
≥P1−PT+1
=
T
X
t=1
(Pt−Pt+1)
≥
T
X
t=1
η−1
2η2X2
σt.
Ifη−(1/2)η2X2>0, that yields
T
X
t=1
σt≤
1
2ku−winitk22 η−12η2X2
To get the lowest upper bound for the number of mistakes, we maximize the value in the denominator.
Letf(η) =η−(1/2)η2X2. The function f is differentiable and concave in the set ]0,∞[, and it has its maximum wheref0(η) = 1−ηX2= 0⇒η = 1/X2. The maximum valuef(1/X2) = 1/(2X2)is positive, and the bound is thus
T
X
t=1
σt≤1
2ku−winitk22·2X2
=ku−winitk22X2.
Problem 2
Letx= (x1, x2, . . . , xn). We use a notational shorthandxm= (x1, x2, . . . , xm)in the following. The definition of the kernel can be then written somewhat explicitly as
kq(xm,zm) = X
A⊂{1,2,...,m}:
|A|=q
ψA(xm)ψA(xm)
Notice first that ifq= 1, then
kq(xm,zm) =
m
X
i=1
ψ{i}(xm)ψ{i}(zm)
=
m
X
i=1
xizi
=xm·zm.
Let now q ∈ {2,3. . . , n}, and let m ∈ {q, q+ 1, . . . , n}. For further notational convinience, we defineki(xi−1,zi−1)≡0for alli∈ {2, . . . , n}. Now we can derive a recursion formula
kq(xm,zm) = X
A⊂{1,2,...,m}
:|A|=q
ψA(xm)ψA(zm)
= X
A⊂{1,2,...,m}
:|A|=q
Y
i∈A
xizi
= X
A⊂{1,2,...,m−1}
:|A|=q−1
Y
i∈A
xizi·xmzm+ X
A⊂{1,2,...,m−1}
:|A|=q
Y
i∈A
xizi
=kq−1(xm−1,zm−1)·xmzm+kq(xm−1,zm−1).
Our algorithm is simply the following. Fori∈ {1,2, . . . , q}in ascending order, compute and store the valueski(xi,zi),ki(xi+1,zi+1), . . . , ki(xn,zn).
For computing the values fork1(·,·), timeO(n)is enough, becausek1(xi,zi) =k1(xi−1,zi−1)+xizi for alli∈ {2,3, . . . , n}. After that, every single value can be computed using the recursion formula in constant time, so the overall time requirement is O(nq).
Problem 3
We use a similar potential function as in the first exercise, so the drop of the potential is again for all t∈ {1,2, . . . , T}
Pt−Pt+1= 1
2ku−wtk22−1
2ku−wt+1k22
= (wt+1−wt)·(u−wt)−1
2kwt−wt+1k22.
The update rule is this timewt+1−wt=η(yt−yˆt)xt, and thus(wt+1−wt)·(u−wt) =η(yt−yˆt)xt· (u−wt) =η(yt−yˆt)(u·xt−wt·xt) =η(yt−yˆt)2. Using the assumption thatkxtk2≤X for allt, we have a bound
Pt−Pt+1 =η(yt−yˆt)2−1
2kη(yt−yˆt)xtk22
≥η(yt−yˆt)2−1
2η2X2(yt−yˆt)2
=
η−1 2η2X2
(yt−yˆt)2.
The potentials are all non-negative, and we can estimate that P1=1
2ku−0k22
=1 2kuk22
≥P1−PT+1
=
T
X
t=1
(Pt−Pt+1)
≥
T
X
t=1
η−1
2η2X2
(yt−yˆt)2.
From the first exercise we know thatη−(1/2)η2X2= 1/(2X2)>0 whenη= 1/X2. That yields
T
X
t=1
(yt−yˆt)2≤ 1
2kuk22·2X2
=kuk22X2.
Extra credit. In the more general case, we do not assume any moreu·xt=yt. For allt, we can still write
Pt−Pt+1=η(yt−yˆt)xt·(u−wt)−1
2kwt−wt+1k22
≥η(yt−yˆt)(u·xt−yˆt)−1
2η2(yt−yˆt)X2
=η(yt−yˆt)((u·xt−yt) + (yt−yˆt))−1
2η2(yt−yˆt)2X2
=
η−1 2η2X2
(yt−yˆt)2−η(ˆyt−yt)(u·xt−yt).
Because it holds for anya, b∈Rthat(a−b)2=a2−2ab+b2≥0⇒ab≤(1/2)(a2+b2), we can estimate that
Pt−Pt+1≥
η−1 2η2X2
(yt−yˆt)2−η(ˆyt−yt)(u·xt−yt)
≥
η−1 2η2X2
(yt−yˆt)2−η1
2 (ˆyt−yt)2+ (u·xt−yt)2
=1
2 η−η2X2
(yt−yˆt)2−1
2η(yt−u·xt)2. Summing over t∈ {1,2, . . . , T} yields
P1=1 2kuk22
≥
t
X
t=1
(Pt−Pt+1)
≥
T
X
t=1
1
2(η−η2X2)(yt−yˆt)2−1
2η(yt−u·xt)2
or
(η−η2X2)
T
X
t=1
(yt−yˆt)2≤ kuk22+η
T
X
t=1
(yt−u·xt)2.
Whenη−η2X2>0⇒η <1/X2, we have thus the bound
T
X
t=1
(yt−yˆt)2≤ 1
η−η2X2 kuk22+η
T
X
t=1
(yt−u·xt)2
!
= kuk22
η−η2X2+ 1 1−ηX2
T
X
t=1
(yt−u·xt)2.
This bound has an additional penalty depending on the sum of squared mistakes that the classifieru makes.
Problem 4
The Perceptron Convergence Theorem guartees that the Perceptron Algorithm makes at mostB2/γ2 mistakes, whereB is an upper bound for the Euclidean norm of anyxin the sample. Since the data was sampled from the[−1,1]d hypercube, we know thatB2≤d. However, this bound was very loose for the following experiment.
#! / u s r / b i n / o c t a v e −q f
function [ xs , y s ] = make_sample ( n , d , gam ) x s = zeros ( n , d ) ;
y s = zeros ( n , 1 ) ; c n t = 0 ;
while ( c n t < n )
x = 2 ∗ (rand ( 1 , d ) − 0 . 5 ) ; t = ( x ( 1 ) + x ( 2 ) ) / sqrt ( 2 ) ; i f (abs ( t ) > gam )
x s(++cnt , : ) = x ; y s ( c n t ) = sign ( t ) ; endif
endwhile endfunction
function m i s t a k e s = p e r c e p t r o n ( xs , y s ) m i s t a k e s = 0 ;
r i g h t _ p r e d i c t i o n s = 0 ; n = rows ( x s ) ;
x s = c a t ( 2 , xs , ones ( n , 1 ) ) ; w = zeros ( 1 , columns ( x s ) ) ;
while ( r i g h t _ p r e d i c t i o n s < n ) f o r i = [ 1 : n ]
hat_y = sign (w ∗ x s ( i , : ) ’ ) ; i f ( hat_y == 0 )
hat_y = 1 ; endif
i f ( hat_y != y s ( i ) ) m i s t a k e s ++;
r i g h t _ p r e d i c t i o n s = 0 ; w = w + y s ( i ) ∗ x s ( i , : ) ;
e l s e i f (++ r i g h t _ p r e d i c t i o n s == n ) break;
endif endfor
endwhile endfunction
f i g u r e ( 1 , " v i s i b l e " , " o f f " ) ; n = 1 0 0 0 ;
ds = [ 5 10 100 500 1 0 0 0 ] ; gammas = [ 0 . 0 5 : 0 . 0 5 : 1 ] ;
m i s t a k e s = zeros (length ( ds ) , length ( gammas ) ) ; f o r i = [ 1 : (length ( ds ) ) ]
f o r j = [ 1 : (length ( gammas ) ) ]
[ x s y s ] = make_sample ( n , ds ( i ) , gammas ( j ) ) ; m i s t a k e s ( i , j ) = p e r c e p t r o n ( xs , y s ) ;
endfor
plot ( gammas , m i s t a k e s ( i , : ) , "+" ) ;
print (s p r i n t f ( "dim_%d . e p s " , ds ( i ) ) , "−d e p s " ) ; endfor
plot ( gammas , m i s t a k e s ’ , "o−" ) ; print ( " fixed_dim . e p s " , "−d e p s " ) ; gammas = [ 0 . 0 5 0 . 1 0 . 4 0 . 7 1 ] ;
ds = [ 5 : 50 : 1 0 0 0 ] ;
m i s t a k e s = zeros (length ( gammas ) , length ( ds ) ) ; f o r i = [ 1 : (length ( gammas ) ) ]
f o r j = [ 1 : (length ( ds ) ) ]
[ x s y s ] = make_sample ( n , ds ( j ) , gammas ( i ) ) ; m i s t a k e s ( i , j ) = p e r c e p t r o n ( xs , y s ) ;
endfor
plot ( ds , m i s t a k e s ( i , : ) , "+" ) ;
print (s p r i n t f ( "gamma_%.2 f . e p s " , gammas ( i ) ) , "−d e p s " ) ; endfor
plot ( ds , m i s t a k e s ’ , "o−" ) ;
print ( " fixed_gamma . e p s " , "−d e p s " ) ;
0 100 200 300 400 500 600
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 1: Number of mistakes as a function of γ. Number of dimensions from bottom to top: 5, 10, 100, 500, 1000. The size of the sample wasn= 1000.
0 100 200 300 400 500 600 700 800
0 100 200 300 400 500 600 700 800 900
Figure 2: Number of mistakes as a function of d. γ from top to bottom: 0.05, 0.1, 0.4, 0.7, 1. The size of the sample wasn= 1000.