Continues on the next page!

(1)

582669 Supervised Machine Learning (Spring 2014) Homework 4 (13 February)

To get credit, you must hand in your solution before the exercise session, i.e. no later than 14:15 on Thursday, 13 February. You can send your solutions as a PDF file tojyrki.kivinen@cs.helsinki.fi, drop a paper version to a box outside Jyrki Kivinen’s office B229a, or bring a paper version to the exercise session. Please turn in either asinglePDF file, orallyour solutions on paper.

There are four (4) regular problems, and one voluntary problem for extra credit.

1. We generalise the Perceptron Algorithm by introducing a learning rate η > 0. The update becomes

wt+1=wt+ησtytxt.

Further, we start the algorithm withw1=winitwhere the initial weights need not be zero. (Note that if we havewinit=0then the learning rate does not affect the predictionssign(wt·xt).) Assume thatkxtk₂≤X for someX >0, and someu∈R^d satisfiesy_tu·x_t≥1for allt. Modify the proof for the Perceptron Convergence Theorem by using

P_t= 1

2ku−w_tk²₂

as the potential function. The result should be that

T

X

t=1

σt≤ ku−winitk²₂X²

for a suitable choice ofη. Thus, if we start the algorithm close to the target, we get a smaller mistake bound.

Hint:This is a fairly straightforward modification of the proof in the lecture notes. Instead ofc andγ, the learning rateη will appear in some terms of the potential estimate.

2. As with the all subsets kernel (Example 2.19, page 110), define forA⊆ {1, . . . , n}the feature

ψ_A(x) =Y

i∈A

x_i.

The degreeqANOVA feature map has the ⁿ_q

featuresψA where|A|=q. (Thus the all subsets feature map combines the ANOVA features forq= 0, . . . , n.)

Letk_q be the kernel of this feature map. There is no nice closed form for this kernel, but given x,z∈Rⁿ we can still compute the value

k_q(x,z) = X

|A|=q

ψ_A(x)ψ_A(z)

much more efficiently than the naiveO(n^q). Give an algorithm to do this.

Hint:Expresskq((x1, . . . , xn),(z1, . . . , zn))in terms of kq−1((x1, . . . , xn−1),(z1, . . . , zn−1))and kq((x1, . . . , xn−1),(z1, . . . , zn−1)). You can save computation effort by dynamic programming.

Continues on the next page!

(2)

3. Consideronline linear regression, where nowyˆtandytcan both be arbitrary real numbers. The analogue of the Perceptron algorithm is the Least Mean Squares algorithm (LMS, also known as Widrow-Hoff):

Initialisew1=0.

Repeat fort= 1, . . . , T: 1. Getx_t∈Rⁿ. 2. Predictyˆ_t=w_t·x_t.

3. Receive the correct answery_t. 4. Updatew_t+1=w_t−η(ˆy_t−y_t)x_t. Hereη >0is a learning rate parameter.

Assume that there are someu∈Rⁿ andX >0 such thatyt=u·xt andkxtk₂≤X for allt.

Show that the square loss of the LMS algorithm can be bounded as

T

X

t=1

(yt−yˆt)²≤ kuk²₂X².

For extra credit(worth one regular problem), generalise this to the “agnostic” case where we do not assumeu·x_t=y_t.

Hint:For the basic case, show that 1

2ku−wtk²₂−1

2ku−wt+1k²₂=

η−1 2ηX²

(yt−yˆt)².

Optimiseη and sum overt.

For the agnostic case, show that 1

2ku−wtk²₂−1

2ku−wt+1k²₂=a(yt−yˆt)²−b(yt−u·xt)²

for somea, b >0that depend onX andη. You do not need to find the optimalη for this case.

4. Consider the linear classifierf(x) = sign(w·x)forx∈R^d wherew1=w2= 1andwi = 0for i= 3, . . . , d.

We generate a random sample as follows. First, we draw a large number of instancesxt from the uniform distribution over the cube[−1,1]^d. Then we classify the instances using the above classifierf. Finally, we discard from the sample the points where the margin is below some value γwe decide in advance. Therefore we get a sample that is linearly separable with margin γ by the classifierf.

Implement the sampling method and the Perceptron algorithm. Study how the number of mis- takes made by the algorithm changes when you

• keep dimensiondfixed but let the marginγ vary

• keep the margin γfixed but let dimensiondvary.

Is the behaviour of the algorithm similar to what you would expect from the Perceptron Con- vergence Theorem?

Your solution should consist of a brief explanation of the observations you made, a couple of representative plots to support this, and a printout of your program code.

2