582669 Supervised Machine Learning (Spring 2012) Homework 4 (17 February)
1. We generalise the Perceptron Algorithm by introducing a learning rate η > 0. The update becomes
wt+1=wt+ησtytxt.
Further, we start the algorithm withw1=winitwhere the initial weights need not be zero. (Note that if we havewinit=0then the learning rate does not affect the predictionssign(wt·xt).) Assume thatkxtk2≤X for someX >0, and someu∈Rd satisfiesytu·xt≥1for allt. Modify the proof for the Perceptron Convergence Theorem by using
Pt= 1
2ku−wtk22
as the potential function. The result should be that
T
X
t=1
σt≤ ku−winitk22X2
for a suitable choice ofη. Thus, if we start the algorithm close to the target, we get a smaller mistake bound.
Hint:This is a fairly straightforward modification of the proof in the lecture notes. Instead ofc andγ, the learning rateη will appear in some terms of the potential estimate.
2. As with the all subsets kernel (Example 2.19, page 110), define forA⊆ {1, . . . , n}the feature
ψA(x) =Y
i∈A
xi.
The degreeqANOVA feature map has the nq
featuresψA where|A|=q. (Thus the all subsets feature map combines the ANOVA features forq= 0, . . . , n.)
Letkq be the kernel of this feature map. There is no nice closed form for this kernel, but given x,z∈Rn we can still compute the value
kq(x,z) = X
|A|=q
ψA(x)ψA(z)
much more efficiently than the naiveO(nq). Give an algorithm to do this.
Hint:Expresskq((x1, . . . , xn),(z1, . . . , zn))in terms of kq−1((x1, . . . , xn−1),(z1, . . . , zn−1))and kq((x1, . . . , xn−1),(z1, . . . , zn−1)). You can save computation effort by dynamic programming.
Continues on the next page!
3. Consideronline linear regression, where nowyˆtandytcan both be arbitrary real numbers. The analogue of the Perceptron algorithm is the Least Mean Squares algorithm (LMS, also known as Widrow-Hoff):
Initialisew1=0.
Repeat fort= 1, . . . , T: 1. Getxt∈Rn. 2. Predictyˆt=wt·xt.
3. Receive the correct answeryt. 4. Updatewt+1=wt−η(ˆyt−yt)xt. Hereη >0is a learning rate parameter.
Assume that there are someu∈Rn andX >0 such thatyt=u·xt andkxtk2≤X for allt.
Show that the square loss of the LMS algorithm can be bounded as
T
X
t=1
(yt−yˆt)2≤ kuk22X2.
For extra credit(worth one regular problem), generalise this to the “agnostic” case where we do not assumeu·xt=yt.
Hint:For the basic case, show that 1
2ku−wtk22−1
2ku−wt+1k22=
η−1 2ηX2
(yt−yˆt)2.
Optimiseη and sum overt.
For the agnostic case, show that 1
2ku−wtk22−1
2ku−wt+1k22=a(yt−yˆt)2−b(yt−u·xt)2
for somea, b >0that depend onX andη. You do not need to find the optimalη for this case.
4. Consider the linear classifierf(x) = sign(w·x)forx∈Rd wherew1=w2= 1andwi = 0for i= 3, . . . , d.
We generate a random sample as follows. First, we draw a large number of instancesxt from the uniform distribution over the cube[−1,1]d. Then we classify the instances using the above classifierf. Finally, we discard from the sample the points where the margin is below some value γwe decide in advance. Therefore we get a sample that is linearly separable with margin γ by the classifierf.
Implement the sampling method and the Perceptron algorithm. Study how the number of mis- takes made by the algorithm changes when you
• keep dimensiondfixed but let the marginγ vary
• keep the margin γfixed but let dimensiondvary.
Is the behaviour of the algorithm similar to what you would expect from the Perceptron Con- vergence Theorem?
Your solution should consist of a brief explanation of the observations you made, a couple of representative plots to support this, and a printout of your program code.
2