• Ei tuloksia

Continues on the next page!

N/A
N/A
Info
Lataa
Protected

Academic year: 2022

Jaa "Continues on the next page!"

Copied!
2
0
0

Kokoteksti

(1)

582669 Supervised Machine Learning (Spring 2014) Homework 4 (13 February)

To get credit, you must hand in your solution before the exercise session, i.e. no later than 14:15 on Thursday, 13 February. You can send your solutions as a PDF file tojyrki.kivinen@cs.helsinki.fi, drop a paper version to a box outside Jyrki Kivinen’s office B229a, or bring a paper version to the exercise session. Please turn in either asinglePDF file, orallyour solutions on paper.

There are four (4) regular problems, and one voluntary problem for extra credit.

1. We generalise the Perceptron Algorithm by introducing a learning rate η > 0. The update becomes

wt+1=wt+ησtytxt.

Further, we start the algorithm withw1=winitwhere the initial weights need not be zero. (Note that if we havewinit=0then the learning rate does not affect the predictionssign(wt·xt).) Assume thatkxtk2≤X for someX >0, and someu∈Rd satisfiesytu·xt≥1for allt. Modify the proof for the Perceptron Convergence Theorem by using

Pt= 1

2ku−wtk22

as the potential function. The result should be that

T

X

t=1

σt≤ ku−winitk22X2

for a suitable choice ofη. Thus, if we start the algorithm close to the target, we get a smaller mistake bound.

Hint:This is a fairly straightforward modification of the proof in the lecture notes. Instead ofc andγ, the learning rateη will appear in some terms of the potential estimate.

2. As with the all subsets kernel (Example 2.19, page 110), define forA⊆ {1, . . . , n}the feature

ψA(x) =Y

i∈A

xi.

The degreeqANOVA feature map has the nq

featuresψA where|A|=q. (Thus the all subsets feature map combines the ANOVA features forq= 0, . . . , n.)

Letkq be the kernel of this feature map. There is no nice closed form for this kernel, but given x,z∈Rn we can still compute the value

kq(x,z) = X

|A|=q

ψA(x)ψA(z)

much more efficiently than the naiveO(nq). Give an algorithm to do this.

Hint:Expresskq((x1, . . . , xn),(z1, . . . , zn))in terms of kq−1((x1, . . . , xn−1),(z1, . . . , zn−1))and kq((x1, . . . , xn−1),(z1, . . . , zn−1)). You can save computation effort by dynamic programming.

Continues on the next page!

(2)

3. Consideronline linear regression, where nowyˆtandytcan both be arbitrary real numbers. The analogue of the Perceptron algorithm is the Least Mean Squares algorithm (LMS, also known as Widrow-Hoff):

Initialisew1=0.

Repeat fort= 1, . . . , T: 1. Getxt∈Rn. 2. Predictyˆt=wt·xt.

3. Receive the correct answeryt. 4. Updatewt+1=wt−η(ˆyt−yt)xt. Hereη >0is a learning rate parameter.

Assume that there are someu∈Rn andX >0 such thatyt=u·xt andkxtk2≤X for allt.

Show that the square loss of the LMS algorithm can be bounded as

T

X

t=1

(yt−yˆt)2≤ kuk22X2.

For extra credit(worth one regular problem), generalise this to the “agnostic” case where we do not assumeu·xt=yt.

Hint:For the basic case, show that 1

2ku−wtk22−1

2ku−wt+1k22=

η−1 2ηX2

(yt−yˆt)2.

Optimiseη and sum overt.

For the agnostic case, show that 1

2ku−wtk22−1

2ku−wt+1k22=a(yt−yˆt)2−b(yt−u·xt)2

for somea, b >0that depend onX andη. You do not need to find the optimalη for this case.

4. Consider the linear classifierf(x) = sign(w·x)forx∈Rd wherew1=w2= 1andwi = 0for i= 3, . . . , d.

We generate a random sample as follows. First, we draw a large number of instancesxt from the uniform distribution over the cube[−1,1]d. Then we classify the instances using the above classifierf. Finally, we discard from the sample the points where the margin is below some value γwe decide in advance. Therefore we get a sample that is linearly separable with margin γ by the classifierf.

Implement the sampling method and the Perceptron algorithm. Study how the number of mis- takes made by the algorithm changes when you

• keep dimensiondfixed but let the marginγ vary

• keep the margin γfixed but let dimensiondvary.

Is the behaviour of the algorithm similar to what you would expect from the Perceptron Con- vergence Theorem?

Your solution should consist of a brief explanation of the observations you made, a couple of representative plots to support this, and a printout of your program code.

2

Viittaukset

LIITTYVÄT TIEDOSTOT

You should send one PDF file including your solutions to the pen-and-paper problems and the report for the programming problem, and a single compressed file including the code for

• Return a brief written report (as PDF, included in the same file as your pen-and-paper solutions) and one directory (as a zip or tar.gz file) including all your code.. • Do

• A single PDF file containing you answer to pen and paper and programming problems and a single compressed file (either rar, zip, tar.gz) containing your code.. The submission must

In practice you should just calculate the values exactly (if you need specific numerical values) or use Chernoff bounds or similar (if you need an asymptotically fairly

Then an error occurs for a given bit, i.e., the recipient is unable to recover its correct value, if at least k/2 of the sent bits were flipped.. (a) Give an exact expression in

(a) What is the optimal strategy if you want to minimize the expected number of questions required.. What can you say about the expected number of questions that

(This has been counted as two problems to keep the total work load more manageable. For purposes of grading, and perhaps also for planning your work, you may think that Problem 4 is

The goal is to make the Weighted Majority algorithm a bit more concrete and prepare for later exercises with other online