Pen-and-paper problems

(1)

582631 Introduction to Machine Learning, Autumn 2013 Exercise set 5 (based on fourth week of lectures)

To be discussed in the exercise session 29 November.

Credit is given based on the written solutions turned in to the course assistant. Extra credit (0.5 points per problem) is given for willingness to present the solution to the pen-and-paper problems at the exercise session.

The deadline for handing in your solutions is9:00am on Wednesday, 27 November.

Send your solutions to the course assistant (Yuan.Zou@cs.helsinki.fi) by e-mail. You should send one PDF file including your solutions to the pen-and-paper problems and the report for the programming problem, and a single compressed file including the code for the programming problem.

Pen-and-paper problems

Problem 1 (3 points) Consider the following two sets of 80 cars each: Set ‘A’ consists of 10 Volvos, 25 Toyotas, and 45 Audis, while set ‘B’ consists of 8 Volvos, 32 Toyotas, and 40 Audis. Which set do you intuitively think is more pure (that is, has lower impurity), and why? Compute the Entropy, the Gini index, and the misclassification error for each of the two sets. According to these measures, which set is more pure? Could this phenomenon (conflict among the measures) happen if there were just two classes (two types of car) rather than three? Why, or why not?

Problem 2 (3 points) Consider a binary classification problem with the following set of attributes and attribute values:

• Air conditioner ={Working, Broken}

• Engine ={Good, Bad}

• Mileage ={High, Medium, Low}

• Rust ={Yes, No}

Suppose a rule-based classifier produces the following rule set:

• (Mileage = High)⇒(Value = Low)

• (Mileage = Low)⇒(Value = High)

• ((Air conditioner = Working) and (Engine = Good)) ⇒(Value = High)

• ((Air conditioner = Working) and (Engine = Bad))⇒(Value = Low)

• (Air conditioner = Broken)⇒(Value = Low) Given the above, answer the following:

(a) Are the rules mutually exclusive?

(b) Is the rule set exhaustive?

(c) Is ordering needed for this set of rules?

(d) Do you need a default class for the rule set?

Problem 3 (3 points) Given a set ofN pointsy₁, . . . , y_N, with eachy_i∈R, show that...

(a) . . . the valuey^∗which minimizes the sum ofsquared errors, i.e.

y^∗= arg min

ˆ y

N

X

i=1

(yi−y)ˆ² (1)

is given by themean of they_i, i.e. y^∗=P

iy_i/N.

Continues on the next page!

1

(2)

(b) . . . the valuey^∗which minimizes the sum ofabsolute errors, i.e.

y^∗= arg min

ˆ y

N

X

i=1

|yi−y|ˆ (2)

is given by the median of the y_i. [Hint for part (b): Break the sum into parts corresponding to pairs of observations, pairing the smallest with the largest point, the second-smallest with the second-largest point, etc.]

Computer problem

General instructions:

• Return a brief written report (as PDF, included in the same file as your pen-and-paper solutions) and one directory (as a zip or tar.gz file) including all your code.

• Do not include any of the data files in your solution file.

• Always mention your name and student ID in the report.

• We use the report as the main basis for grading: All your results should be in the report. We also look at the code, but we won’t however go fishing for results in your code.

• The code needs to be submitted as a runnable file or set of files (command to run it given in the report).

• In your report, the results will be mostly either in the form of a figure or program output. In both cases, add some sentences which explain what you are showing and why the results are the answer to the question.

• If we ask you to test whether an algorithm works, always give a few examples showing that the algorithm indeed works

Problem 4 (15 points)

In this problem, we will test linear regression on a simple synthetic dataset. We will use the following polynomial as the underlying target function

y=f(x) = 2 +x−0.5x². (3)

First, randomly sample 30 points xi from the uniform distribution on the interval [−3,3]. Then, randomly generate theyi using

yi=f(xi) +ni, (4)

wheref is as defined above, and then_iare i.i.d. normal random variables with zero mean and standard deviation 0.4. The resulting 30 pairs (x_i, y_i) is your data set for this exercise.

(a) First, let’s fit polynomials of order 0 to 10 to this dataset using linear regression, minimizing the sum of squares error. That is, fit functions of the form

ˆ y=

K

X

p=0

wpx^p (5)

withK= 0, . . . ,10 to the data. For instance, forK= 4 the polynomial to fit is ˆ

y=w0+w1x+w2x²+w3x³+w4x⁴. (6) For each of the 11 values of K, produce a separate plot showing the datapoints (xi, yi) and the fitted polynomial. (Plot the polynomial as a curve, in the full interval [−3,3], overlayed on the scatterplot of the points.) You should see that as the order of the polynomial K increases, the curve comes closer and closer to fitting all the datapoints. For each value ofK, calculate the coefficient of determinationR² and comment on the behaviour of this measure as a function ofK.

Continues on the next page!

2

(3)

(b) Finally, let’s test model selection based on 10-fold cross-validation. That is, divide the dataset into 10 equal-sized subsets (i.e. 3 datapoints in each subset), and, for each value of K= 0, . . . ,11 and each data subset j = 1, . . .10, use all the data except the data in subsetj to fit the polynomial of order K, and compute the resulting sum of squared errors on subset j. For each value of K, sum together the errors coming from the different j. Plot these results with K on the horizontal axis, and the sum of squared errors on the vertical axis. How does this function behave? Does the cross-validated error improve with increasingK? WhichK gives the minimum error, and what are the corresponding estimated coefficients?

Compare them to the true coefficients used inf(x) to generate the data.

3