582631 Introduction to Machine Learning, Autumn 2015 Exercise set 6 (based on ﬁfth week of lectures)

(1)

582631 Introduction to Machine Learning, Autumn 2015 Exercise set 6 (based on fifth week of lectures)

To be discussed in the exercise session 3 December. Attending the session is completely voluntary.

Credit is given based on the written solutions turned in to the course assistants.

The deadline for turning in your solutions isFriday, 4 December, at 23:59.

Follow the provided guidelines and send your solutions to the course assistants (johannes.verwijnen@cs.helsinki.fi) and (amin.sorkhei@cs.helsinki.fi) by e-mail.

Addition to homework policies

Starting from this assignment, please include in your submitted solution also a short note mentioning which other students you discussed the problems with, and which other external sources of information you used (if any).

Notice that you are still allowed to discuss and use source material as before. We (the teachers) just wish to know if you actually do so, and also to remind you of the homework policies including the “half-hour” rule as explained on course homepage.

Submission Guidelines

Important Note: MNIST package was updated on November 5th and improved for easier use.

The programming section is totally based on the new package, thus, in order to obviate any chance of error or inconsistency, go to the course webpage and download the MNIST package in order to have the most recent version.

(1) Your submission is composed of two parts:

• A single PDF file containing you answer to pen and paper and programming problems and a single compressed file (either rar, zip, tar.gz) containing your code. The submission must be sent directly to BOTHJohannes and Amin. If you are not using Python, please mention in your email what programming language you have used. In that case, the code needs to be submitted as a runnable file or set of files (command to run it given in the report).

• All the material (compressed file, code, PDF file) must be renamed to your student ID.

(2) If you are using Python, please use the provided scaffolding (https://github.com/amin-sorkhei/IML-req/

blob/master/ASSG6_TMP.py) in order to do the programming assignment and only submit the completed scaffolding as your code. You have to implement one vs all and all vs all methods. DO NOT include any of the data files (MovieLens, MNIST) in your email.

(3) Always mention your name and student ID in the report.

(4) Use the tag [IML15][your student ID] as the subject of your email.

Pen-and-paper problems

Problem 1 (6 points)

Assume that in a three-class classification problem the class prior distributionP(Y) is given byP(Y = 0) = 0.4, P(Y = 1) = 0.3 and P(Y = 2) = 0.3, and class-conditional distributions P(X| Y) as specified below (with X= (X1, X2)):

1

(2)

X₁

X2

0 1

0 1 2

0.2 0.1 0.4 0.2 0.0 0.1

X₁

X2

0 1

0 1 2

0.6 0.1 0.1 0.1 0.1 0.0

X₁

X2

0 1

0 1 2

0.1 0.4 0.3 0.0 0.2 0.0

P(X₁, X₂|Y = 0) P(X₁, X₂|Y = 1) P(X₁, X₂|Y = 2)

We consider the usual (forced choice) classification, trying to minimise 0-1 loss (i.e. unit cost for any wrong answer, zero cost for correct answers).

(a) Give the Bayes optimal classifier, i.e. for each of the 6 possible values ofX, which of the classes{0,1,2} is the best prediction of Y?

(b) Compute the Bayes error, i.e. the error rate (percentage of wrong answers) for the Bayes optimal classifier.

(c) Construct the corresponding naive Bayes classifier, i.e. when approximating the class-conditional distributions by the product of the marginals, what is the best forced-choice prediction of Y for each of the 6 possible values ofX?

(d) Compute the error rate of the naive Bayes classifier.

(e) Is the error rate of the naive Bayes classifier higher or lower than the Bayes error? Could you have answered this last question without computing the actual values?

Given a set ofN pointsy1, . . . , yN, with eachyi∈R, show that (a) the valuey^∗which minimizes the sum ofsquared errors, i.e.

y^∗= arg min

ˆ y

N

X

i=1

(yi−y)ˆ ² (1)

is given by themean of theyi, i.e. y^∗=P

iyi/N.

(b) the valuey^∗which minimizes the sum ofabsolute errors, i.e.

y^∗= arg min

ˆ y

N

X

i=1

|yi−yˆ| (2)

is given by the median of the yi. [Hint for part (b): Break the sum into parts corresponding to pairs of observations, pairing the smallest with the largest point, the second-smallest with the second-largest point, etc.]

Computer problem

Important Note: MNIST package was updated on November 5th and improved for easier use.

The programming section is totally based on the new package, thus, in order to obviate any chance of error or inconsistency, go to the course webpage and download the MNIST package in order to have the most recent version.

General instructions:

2

(3)

• In your report, the results of the programming problem will be mostly either in the form of a figure or program output. In both cases, add some sentences which explain what you are showing and why the results are the answer to the question.

• We use the report as the main basis for grading: All your results should be in the report. We also look at the code, but we won’t however go fishing for results in your code.

• If we ask you to test whether an algorithm works, always give a few examples showing that the algorithm indeed works.

Download the MNIST data set. It can be found on the course web page, in a package that also includes some helpful Python, MATLAB and R routines for handling it. Split the data into two equally sized parts, using the first half of the data set as training set, and the second half as test set. The data set contains 60,000 data points, so both training and test sets will contain 30,000 points. Each ’X’ value is a 784-dimensional vector, corresponding to grey values of a 28×28 pixel image. The ’y’ values are classes, corresponding to digits ’0’

through ’9’.

Implement the Perceptron algorithm, including the “pocket” modification for use on non-separable data.

Use it to learn a multi-class linear classifier for the MNIST data set. Do this in two different ways, first using the one vs. all technique for applying binary classifiers to a multiclass problem, and then using the all vs. all technique.

Apply the multi-class classifiers you have learned to the test set. Create the 10×10 confusion matrix, both for the one vs. all and for the all vs. all classifier. Are the classification results reasonably accurate? Can you see any difference between the two methods? Are there any other interesting observations you can make?

3