• Ei tuloksia

582631 Introduction to Machine Learning, Autumn 2015 Exercise set 5 (based on fourth week of lectures)

N/A
N/A
Info
Lataa
Protected

Academic year: 2022

Jaa "582631 Introduction to Machine Learning, Autumn 2015 Exercise set 5 (based on fourth week of lectures)"

Copied!
3
0
0

Kokoteksti

(1)

582631 Introduction to Machine Learning, Autumn 2015 Exercise set 5 (based on fourth week of lectures)

To be discussed in the exercise session 26 November. Attending the session is completely voluntary.

Credit is given based on the written solutions turned in to the course assistants.

The deadline for turning in your solutions isFriday, 27 November, at 23:59.

Follow the provided guidelines and send your solutions to the course assistants (johannes.verwijnen@cs.helsinki.fi) and (amin.sorkhei@cs.helsinki.fi) by e-mail.

Submission Guidelines

(1) Your submission is composed of two parts:

• A single PDF file containing you answer to pen and paper and programming problems and a single compressed file (either rar, zip, tar.gz) containing your code. The submission must be sent directly to BOTHJohannes and Amin. If you are not using Python, please mention in your email what programming language you have used. In that case, the code needs to be submitted as a runnable file or set of files (command to run it given in the report).

• All the material (compressed file, code, PDF file) must be renamed to your student ID.

(2) If you are using Python, there is no scaffolding regarding this week’s exercises. You have to return one single .py file containing your code which must be as always compressed.

(3) Always mention your name and student ID in the report.

(4) Use the tag [IML15][your student ID] as the subject of your email.

Pen-and-paper problems

Problem 1 (3 points) Consider our standard linear regression problem, where we have have a set ofN data points (xi, yi) = ((xi1, . . . , xin), yn) withxi∈Rn andyi∈Rfori= 1, . . . , N. Let ˆw∈Rn be the least-squares weight vector, as for example on page 202 in textbook and slide 199 of lecture notes.

(a) Consider a variablexj that has a positive coefficient in the optimal solution, that is, ˆwj >0. Does this imply that xj andyhave a positive correlation in the data? Why?

(b) Consider a variablexj that is positively correlated withy. Does this imply that ˆwj >0? Why?

Hint: Consider situations wherexj has a strong correlation with another variablexk.

Problem 2 (3 points) Exclusive or is a classical example used to demonstrate the limitations of linear classifiers. In the most basic setting, we have two binary inputs and a binary output. Using +1 and−1 as the binary values, the exclusive or function XOR is given by the following table:

x1 x2 XOR(x1, x2)

−1 −1 −1

−1 +1 +1

+1 −1 +1

+1 +1 −1

(a) Give a detailed proof for the fact that the exclusive or function cannot be represented as a linear classifier.

That is, prove that there is no coefficient vector (w0, w1, w2) such that XOR(x1, x2) = sign(w0+w1x1+w2x2) would hold for all possible inputs (x1, x2).

1

(2)

(b) Show that if we include an extra input variablex1x2, we can represent XOR as a linear classifier in terms of the modified inputs. That is, find a coefficient vector (w0, w1, w2, w3) such that

XOR(x1, x2) = sign(w0+w1x1+w2x2+w3x1x2) holds for all (x1, x2).

Problem 3 (3 points)

Multilayer neural networksare another way of using linear classifiers for nonlinear problems. The most basic such classifier, for two-dimensional inputs and with twohidden units, has parameters (uij),i∈ {1,2},j∈ {0,1,2} and (vk),i∈ {0,1,2}(that is, a total of 9 parameters), and computes a mapping (x1, x2)7→y as follows:

z1 = sign(u10+u11x1+u12x2) z2 = sign(u20+u21x1+u22x2)

y = sign(v0+v1z1+v2z2).

This can be visualized as follows:

1

x2

x1

u10

u20 u11

u22

v2

z1

z2 u21 y

u12

1

v1 v0

The term “hidden unit” refers to the computation of intermediate values zi that are not as such part of the visible input or output.

(a) Show that with a suitable choice of the weights (uij) and (vk), the neural network explained above can compute the XOR function from Problem 2.

(b) There are numerous variants of this basic neural network model. In particular, the signum function used above in computing the valuesziis often replaced by some continuous function, such as the logistic sigmoid r7→1/(1 + e−r). However, one variant that doesnotmake sense is just to ignore the signum function for zi and just compute

z1 = u10+u11x1+u12x2 z2 = u20+u21x1+u22x2

y = sign(v0+v1z1+v2z2).

Show that this model would not offer any advantage over a basic linear classifiery= sign(w0+w1x1+w2x2).

Computer problem

General instructions:

• In your report, the results of the programming problem will be mostly either in the form of a figure or program output. In both cases, add some sentences which explain what you are showing and why the results are the answer to the question.

2

(3)

• We use the report as the main basis for grading: All your results should be in the report. We also look at the code, but we won’t however go fishing for results in your code.

• If we ask you to test whether an algorithm works, always give a few examples showing that the algorithm indeed works.

Problem 4 (15 points)

In this problem, we will test linear regression on a simple synthetic dataset. We will use the following polynomial as the underlying target function

y=f(x) = 2 +x−0.5x2. (1)

First, randomly sample 30 points xi from the uniform distribution on the interval [−3,3]. Then, randomly generate theyi using

yi=f(xi) +ni, (2)

wheref is as defined above, and theniare i.i.d. normal random variables with zero mean and standard deviation 0.4. The resulting 30 pairs (xi, yi) is your data set for this exercise.

(a) First, let’s fit polynomials of order 0 to 10 to this dataset using linear regression, minimizing the sum of squares error. That is, fit functions of the form

ˆ y=

K

X

p=0

wpxp (3)

withK= 0, . . . ,10 to the data. For instance, forK= 4 the polynomial to fit is ˆ

y=w0+w1x+w2x2+w3x3+w4x4. (4) For each of the 11 values of K, produce a separate plot showing the datapoints (xi, yi) and the fitted polynomial. (Plot the polynomial as a curve, in the full interval [−3,3], overlayed on the scatterplot of the points.) You should see that as the order of the polynomial K increases, the curve comes closer and closer to fitting all the datapoints.

Thecoefficient of determinationis a commonly used measure for how well a regression model fits the data.

It is defined as

R2= 1− Pn

i=1(yi−yˆi)2 Pn

i=1(yi−y)¯ 2 (5)

where ¯y= (1/n)Pn

i=1yi. Calculate R2 for each value ofK and comment on its behavious as function of K.

(b) Finally, let’s test model selection based on 10-fold cross-validation. That is, divide the dataset into 10 equal-sized subsets (i.e. 3 datapoints in each subset), and, for each value of K= 0, . . . ,11 and each data subset j = 1, . . .10, use all the data except the data in subsetj to fit the polynomial of order K, and compute the resulting sum of squared errors on subset j. For each value ofK, sum together the errors coming from the different j. Plot these results with K on the horizontal axis, and the sum of squared errors on the vertical axis. How does this function behave? Does the cross-validated error improve with increasingK? WhichKgives the minimum error, and what are the corresponding estimated coefficients?

Compare them to the true coefficients used inf(x) to generate the data.

3

Viittaukset

LIITTYVÄT TIEDOSTOT

In this problem we will attempt to reproduce Figure 2.4 in Hastie et al (2009) 1 , also presented in the slides of lecture 6, showing the error on the test and training data as

You should send one PDF file including your solutions to the pen-and-paper problems and the report for the programming problem, and a single compressed file including the code for

• Return a brief written report (as PDF, included in the same file as your pen-and-paper solutions) and one directory (as a zip or tar.gz file) including all your code.. • Do

Thus the K-medoids algorithm is exactly like the K-means algorithm (Algorithm 8.1 in the textbook, also presented in the slides for lecture 10), except that in line 4 of the

Compute the 5-by-5 similarity matrix and apply your hierarchical clustering algorithm (using both single-link and complete-link) to this data- set. Show

samples approaches the expectation of the generating distribution as the number of samples goes to infinity.. , n

(f) Classify each of the test images with a nearest neighbor classifer: For each of the test images, compute its Euclidean distance to all (2,500) of the training images, and let

the expected error of the optimal classifier, by computer simulation, as follows: Generate 10 000 samples from the above model (for each sample, first randomly select y = 0 or y =