Model complexity, overﬁtting, and model selection

(1)

Model complexity, overfitting,

and model selection

(2)

So, how good is my classifier?

I Apply the learned classifier to the training data?

I

k-Nearest Neighbor with k = 1 performs at 100% accuracy (each point is the nearest neighbor of itself, so label is correct!)

I

k-Nearest Neighbor with k > 1 generally performs < 100%

I

Na¨ıve Bayes also generally gives < 100% accuracy

I

Many other classifiers (such as decision trees or rule-based classifiers, coming soon!) can achieve 100% accuracy as well, as long as all records x are distinct

⇒ so never use kNN with k > 1 or Naive Bayes??!?

I But... the goal of classification is to perform well on new (unseen) data. How can we test that?

I If the i.i.d. model for the classification problem is correct,

then...

(3)

...we can estimate the generalization error (expected error when samples are drawn from the underlying distribution) by using only part of the available data for ‘training’ and leaving the rest for ‘testing’.

(under the stated assumptions the test data is now ‘new data’, so we can with this approach get unbiased estimates of the generalization error)

I Typically (almost invariably), the performance on the test set

is worse than on the training set, because the classifier has

learned some features specific to the training set (in addition

to features of the underlying distribution)

(4)

I Comparing 1NN, kNN, Naive Bayes, and any other classifier on a held-out test set:

I

Now all generally perform < 100%

I

Not a priori clear which method is best, this is an empirical issue (and depends on the amount of data, the structure in the data, etc)

I Flexible classifiers may overfit the training data.

I

Spam filter: Picking up on words that discriminate the two classes in the training set but not in the distribution

I

Credit card fraud detection: Some thieves may use the card to buy regular items; those particular items (or item

combinations) may then be marked suspicious

I

Face recognition: It happened to be darker than average when

one of the pictures was taken. Now darkness is associated with

that particular identity.

(5)

1NN example (recap)

I Decision boundary for k Nearest Neighbor classifier, k = 1 (figure from Hastie et al, 2009)

16 2. Overview of Supervised Learning

1-Nearest Neighbor Classifier

.. .. .

. . . .

. . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . ... . .. . .. . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . .. . .. . .. . . .. . . .. . . .. . .. . .. . . .. . .. . ... ... . . . .. . . . .. . . . .. . . .. . . .. . . .. .. . .. .. . . .. . .. . . .. .. .. . . .. . . .. . . .. . . .. . . .. . . .. .. . . .. . . .. . . .. . .. . .. . . . .. . . .. .. . .. .. . .. . .. . . .. . . .. . . ... .. . .. .. . .. . . .. . . .. . . .. . . . .. . . ... . . ... . . .. . . .. .. . .. . .. . . . .... .. . .. .. . . .. . . .. . . .. . .. . .. . . .. .. .. . . .. . . .. . . .. . . .. . . ... .. . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . .. . . .. . . .. . . .. . . .. . . .. . . .. . . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . . .. . . . .. . . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . . .. . . . .. . . . .. . . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. .. .. .. ...

o o

o o o

o o

o

o o

o

o o

o

o o

o

o o

o o o

o o

o

o o

o o o

o o

o

o o

o

o o

o

o o

o o o o o o

o o

o

o o

o

o o

o

o o o

o o

o o o

o o

o

o o o

o o

o

o o

o o o

o

o o o

o

o o o

o

o o

o o o o o

o o

o

o o o

o

o o o

o o

o

o o

o

o o

o

o o

oo o

o o o

o o

o o o

o o

o

o o

o o o

o

FIGURE 2.3. The same classification example in two dimensions as in Fig- ure 2.1. The classes are coded as a binary variable (BLUE = 0, ORANGE = 1), and then predicted by 1-nearest-neighbor classification.

2.3.3 From Least Squares to Nearest Neighbors

The linear decision boundary from least squares is very smooth, and ap- parently stable to fit. It does appear to rely heavily on the assumption that a linear decision boundary is appropriate. In language we will develop shortly, it has low variance and potentially high bias.

On the other hand, the k-nearest-neighbor procedures do not appear to rely on any stringent assumptions about the underlying data, and can adapt to any situation. However, any particular subregion of the decision bound- ary depends on a handful of input points and their particular positions, and is thus wiggly and unstable—high variance and low bias.

Each method has its own situations for which it works best; in particular linear regression is more appropriate for Scenario 1 above, while nearest neighbors are more suitable for Scenario 2. The time has come to expose the oracle! The data in fact were simulated from a model somewhere be- tween the two, but closer to Scenario 2. First we generated 10 means m

k

from a bivariate Gaussian distribution N((1, 0)

^T

, I) and labeled this class

132 ,

(6)

kNN example (recap)

I Decision boundary for k Nearest Neighbor classifier, k = 15 (figure from Hastie et al, 2009)

2.3 Least Squares and Nearest Neighbors 15

15-Nearest Neighbor Classifier

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . . ... .

. .. . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

o o

o o o

o o

o

o o

o

o o

o

o o

o

o o

o o o

o o

o

o o

o o o

o o

o

o o

o

o o

o

o o

o o o o o o

o o

o

o o

o

o o

o

o o o

o o

o o o

o o

o

o o

o

o o

o

o o

o o o

o

o o o

o

o o o

o

o o

o o o o o

o o

o

o o o

o

o o o

o o

o

o o

o

o o

o

o o

oo o

o o o

o o

o o o

o o

o

o o

o o o

o

FIGURE 2.2. The same classification example in two dimensions as in Fig- ure 2.1. The classes are coded as a binary variable (BLUE = 0, ORANGE = 1) and then fit by 15-nearest-neighbor averaging as in (2.8). The predicted class is hence chosen by majority vote amongst the 15-nearest neighbors.

In Figure 2.2 we see that far fewer training observations are misclassified than in Figure 2.1. This should not give us too much comfort, though, since in Figure 2.3 none of the training data are misclassified. A little thought suggests that for k-nearest-neighbor fits, the error on the training data should be approximately an increasing function of k, and will always be 0 for k = 1. An independent test set would give us a more satisfactory means for comparing the different methods.

It appears that k-nearest-neighbor fits have a single parameter, the num- ber of neighbors k, compared to the p parameters in least-squares fits. Al- though this is the case, we will see that the effective number of parameters of k-nearest neighbors is N/k and is generally bigger than p, and decreases with increasing k. To get an idea of why, note that if the neighborhoods were nonoverlapping, there would be N/k neighborhoods and we would fit one parameter (a mean) in each neighborhood.

It is also clear that we cannot use sum-of-squared errors on the training set as a criterion for picking k, since we would always pick k = 1! It would seem that k-nearest-neighbor methods would be more appropriate for the

133 ,

(7)

Bayes Optimal Classifier

I Bayes optimal decision boundary (figure from Hastie et al, 2009)

2.4 Statistical Decision Theory 21

Bayes Optimal Classifier

. ...

. . . .

. . . . .

. . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .

o o

o o o

o o

o

o o

o

o o

o

o o

o

o o

o o o

o o

o

o o

o o o

o o

o

o o

o

o o

o

o o

o o o o o o

o o

o

o o

o

o o

o

o o o

o o

o o o

o o

o

o o

o

o o

o

o o

o o o

o

o o o

o

o o o

o

o o

o o o o

o

o o

o

o o o

o

o o o

o o

o

o o

o

o o

o

o o

oo o

o o o

o o

o o o

o o

o

o o

o o o

o

FIGURE 2.5. The optimal Bayes decision boundary for the simulation example of Figures 2.1, 2.2 and 2.3. Since the generating density is known for each class, this boundary can be calculated exactly (Exercise 2.2).

and again it suffices to minimize EPE pointwise:

G(x) = argmin ˆ

_g∈G

!

K k=1

L( G

^k

, g)Pr( G

^k

| X = x). (2.21) With the 0–1 loss function this simplifies to

G(x) = argmin ˆ

_g_∈G

[1 − Pr(g | X = x)] (2.22) or simply

G(X) = ˆ G

k

if Pr(G

k

|X = x) = max

g∈G

Pr(g|X = x). (2.23) This reasonable solution is known as the Bayes classifier, and says that we classify to the most probable class, using the conditional (discrete) dis-

134 ,

(8)

Error vs flexibility (train and test)

I Classification error on training set and test set (training set size: 200 points, test set size: 10,000 points) (figure from Hastie et al, 2009)

2.3 Least Squares and Nearest Neighbors 17

Degrees of Freedom − N/k Test Error 0.100.150.200.250.30

2 3 5 8 12 18 29 67 200

151 101 69 45 31 21 11 7 5 3 1

Train Test Bayes

k − Number of Nearest Neighbors

Linear

FIGURE 2.4.Misclassification curves for the simulation example used in Fig- ures 2.1, 2.2 and 2.3. A single training sample of size200was used, and a test sample of size10,000. The orange curves are test and the blue are training error fork-nearest-neighbor classification. The results for linear regression are the bigger orange and blue squares at three degrees of freedom. The purple line is the optimal Bayes error rate.

then generated a

N(mk,I/5), thus leading to a mixture of Gaussian clus-

ters for each class. Figure 2.4 shows the results of classifying 10,000 new observations generated from the model. We compare the results for least squares and those for

k-nearest neighbors for a range of values ofk.

A large subset of the most popular techniques in use today are variants of these two simple procedures. In fact 1-nearest-neighbor, the simplest of all, captures a large percentage of the market for low-dimensional problems.

The following list describes some ways in which these simple procedures have been enhanced:

•

Kernel methods use weights that decrease smoothly to zero with dis- tance from the target point, rather than the effective 0/1 weights used by

k-nearest neighbors.

135 ,

(9)

Error vs flexibility (train and test)

I Typical behaviour: The higher the model complexity (more flexible model) the lower the error on the training sample.

However, the error curve for a test sample is U-shaped.

(figure from Hastie et al, 2009)

38 2. Overview of Supervised Learning

PSfrag replacements

High Bias Low Variance

Low Bias High Variance

PredictionError

Model Complexity Training Sample

Test Sample

Low High

FIGURE 2.11.Test and training error as a function of model complexity.

be close to f(x

0

). As k grows, the neighbors are further away, and then anything can happen.

The variance term is simply the variance of an average here, and de- creases as the inverse of k. So as k varies, there is a bias–variance tradeoff.

More generally, as the model complexity of our procedure is increased, the variance tends to increase and the squared bias tends to decreases.

The opposite behavior occurs as the model complexity is decreased. For k-nearest neighbors, the model complexity is controlled by k.

Typically we would like to choose our model complexity to trade bias off with variance in such a way as to minimize the test error. An obvious estimate of test error is the training error

_N¹

!

i

(y

i

− y ˆ

i

)

²

. Unfortunately training error is not a good estimate of test error, as it does not properly account for model complexity.

Figure 2.11 shows the typical behavior of the test and training error, as model complexity is varied. The training error tends to decrease whenever we increase the model complexity, that is, whenever we fit the data harder.

However with too much fitting, the model adapts itself too closely to the training data, and will not generalize well (i.e., have large test error). In that case the predictions ˆ f(x

0

) will have large variance, as reflected in the last term of expression (2.46). In contrast, if the model is not complex

136 ,

(10)

Analogous problem: Curve fitting

I Which of the following curves ‘fits best’ to the data?

(11)

Analogous problem: Curve fitting

I Which of the following curves ‘fits best’ to the data?

(12)

Analogous problem: Curve fitting

I Which of the following curves ‘fits best’ to the data?

‘underfit’ ‘overfit’

I The more flexible the curve...

I

...the better you can make it fit your data...

I

...but the more likely it is to overfit

⇒ ...so you need to be careful to strive for both model

simplicity and for good fit to data!

(13)

Bias-variance tradeoff

I Based on N training datapoints from the distribution, how close is the learned classifier to the optimal classifier?

Consider multiple trials: repeatedly and independently drawing N training points from the underlying distribution.

I

Bias: Difference between the optimal classifier and the average (over all trials) of the learned classifiers

I

Variance: Average squared difference from the (single-trial) learned classifier from the average (over all trials) of the learned classifiers

I Goal: Low bias and low variance.

I High model complexity ⇒ low bias and high variance

Low model complexity ⇒ high bias and low variance

(14)

Problems of kNN with large k

I For a fixed number N datapoints, increasing k smooths out local structure:

training example of class 1 training example of class 2 test example

X

(15)

Problems of kNN with large k

I For a fixed number N datapoints, increasing k smooths out local structure:

training example of class 1 training example of class 2 test example

X

(16)

Problems of kNN with large k

I For a fixed number N datapoints, increasing k smooths out local structure:

training example of class 1 training example of class 2 test example

X

(17)

Problems of kNN with large k

I For a fixed number N datapoints, increasing k smooths out local structure:

training example of class 1 training example of class 2 test example

X

I For finite N, we don’t know the best value k ...

I (Note: Might even be best to use different number of

neighbors in different regions of the space.)

(18)

All classifiers have this problem: Naive Bayes

I How to estimate the (univariate) class-conditional marginal distributions? How much smoothing to use?

I E.g. kernel-density estimate, what value to choose for σ?

p(x _i | y) = 1 N _y

N

y

X

j=1

N

x _i ; x _i ^(j ⁾ , σ ²

(13)

0.10.20.30.4

fe

(19)

All classifiers have this problem: Naive Bayes

I How to estimate the (univariate) class-conditional marginal distributions? How much smoothing to use?

I E.g. kernel-density estimate, what value to choose for σ?

p(x _i | y) = 1 N _y

N

y

X

j=1

N

x _i ; x _i ^(j ⁾ , σ ²

(13)

0.10.20.30.4

fe

(20)

All classifiers have this problem: Naive Bayes

I How to estimate the (univariate) class-conditional marginal distributions? How much smoothing to use?

I E.g. kernel-density estimate, what value to choose for σ?

p(x _i | y) = 1 N _y

N

y

X

j=1

N

x _i ; x _i ^(j ⁾ , σ ²

(13)

0.10.20.30.4

fe

(21)

All classifiers have this problem: Naive Bayes

I How to estimate the (univariate) class-conditional marginal distributions? How much smoothing to use?

I E.g. kernel-density estimate, what value to choose for σ?

p(x _i | y) = 1 N _y

N

y

X

j=1

N

x _i ; x _i ^(j ⁾ , σ ²

(13)

0.10.20.30.4

fe

(22)

All classifiers have this problem: Naive Bayes

I How to estimate the (univariate) class-conditional marginal distributions? How much smoothing to use?

I E.g. kernel-density estimate, what value to choose for σ?

p(x _i | y) = 1 N _y

N

y

X

j=1

N

x _i ; x _i ^(j ⁾ , σ ²

(13)

0.10.20.30.4

fe

(23)

All classifiers have this problem: Naive Bayes

I How to estimate the (univariate) class-conditional marginal distributions? How much smoothing to use?

I E.g. kernel-density estimate, what value to choose for σ?

p(x _i | y) = 1 N _y

N

y

X

j=1

N

x _i ; x _i ^(j ⁾ , σ ²

(13)

0.10.20.30.4

fe

(24)

All classifiers have this problem: Naive Bayes

I How to estimate the (univariate) class-conditional marginal distributions? How much smoothing to use?

I E.g. kernel-density estimate, what value to choose for σ?

p(x _i | y) = 1 N _y

N

y

X

j=1

N

x _i ; x _i ^(j ⁾ , σ ²

(13)

0.10.20.30.4

fe

(25)

All classifiers have this problem: Naive Bayes

I How to estimate the (univariate) class-conditional marginal distributions? How much smoothing to use?

I E.g. kernel-density estimate, what value to choose for σ?

p(x _i | y) = 1 N _y

N

y

X

j=1

N

x _i ; x _i ^(j ⁾ , σ ²

(13)

0.10.20.30.4

fe

(26)

All classifiers have this problem: Decision trees

(Discussed soon in more detail!)

I What is the appropriate depth of the tree?

A

x ₁ x ₂

0 1 2 3 4 5

0 1 2 3

x ₁ > 2.8

I Use pre-pruning (stop ‘early’ before adding too many nodes)

or post-pruning (after the full tree is constructed, try to

(27)

All classifiers have this problem: Decision trees

(Discussed soon in more detail!)

I What is the appropriate depth of the tree?

A

B

x ₁ x ₂

0 1 2 3 4 5

0 1 2 3

x ₁ > 2.8

x ₂ > 2.0

I Use pre-pruning (stop ‘early’ before adding too many nodes)

or post-pruning (after the full tree is constructed, try to

(28)

All classifiers have this problem: Decision trees

(Discussed soon in more detail!)

I What is the appropriate depth of the tree?

A

B B

x ₁ x ₂

0 1 2 3 4 5

0 1 2 3

x ₁ > 2.8

x ₂ > 2.0 x ₂ > 1.2

I Use pre-pruning (stop ‘early’ before adding too many nodes)

or post-pruning (after the full tree is constructed, try to

(29)

All classifiers have this problem: Decision trees

(Discussed soon in more detail!)

I What is the appropriate depth of the tree?

A

B B

C x ₁

x ₂

0 1 2 3 4 5

0 1 2 3

x ₁ > 2.8

x ₂ > 2.0 x ₂ > 1.2

x ₁ > 4.3

I Use pre-pruning (stop ‘early’ before adding too many nodes)

or post-pruning (after the full tree is constructed, try to

(30)

All classifiers have this problem: Decision trees

(Discussed soon in more detail!)

I What is the appropriate depth of the tree?

A

B B

C x ₁

x ₂

0 1 2 3 4 5

0 1 2 3

x ₁ > 2.8

x ₂ > 2.0 x ₂ > 1.2

x ₁ > 4.3 x ₁ > 2.2

I Use pre-pruning (stop ‘early’ before adding too many nodes)

or post-pruning (after the full tree is constructed, try to

(31)

All classifiers have this problem: Rule-based classifiers

(Discussed soon in more detail!)

I Rules are likely to be unreliable if:

I

...they have very low coverage

I

...they have complicated antecedents

x

₁

x

2

0 1 2 3 4 5

0 1 2 3

green

red

r

1

r

2

r

3

green

(32)

All regression methods have this problem as well

I E.g. What degree polynomial to fit to the data?

I

Note that low degree polynomials are special cases of higher-degree polynomials, with some coefficients set to zero (i.e. the model classes are ‘nested’), e.g. second-degree polynomial as special case of fourth-degree:

y = c ₀ + c ₁ · x + c ₂ · x ² + 0 · x ³ + 0 · x ⁴ (14)

⇒ a high-degree polynomial is guaranteed to fit the data at least as well as a low-degree one!

⇒ need some form of regularization to avoid overlearning!

(33)

Using ‘validation’ data to select model complexity

1. Split the data into ‘train’ and ‘validation’ subsets:

train validation

available data

2. Fit models with varying complexity on ‘train’ data, e.g.

I

kNN with varying k

I

Naive Bayes with varying amounts of smoothing

I

Decision tree with varying depth

I

Rule-based classifier with various coverage requirements 3. Choose the best model based on the performance on the

‘validation’ set

(Not quite optimal since the amount of training data is not the

(34)

Cross-validation

To get more reliable statistics than a single ‘split’ provides, use K -fold cross-validation:

1. Divide the data into K equal-sized subsets:

available data

1 2 3 4 5

2. For j goes from 1 to K :

2.1 Train the model(s) using all data except that of subset j 2.2 Compute the resulting validation error on the subset j 3. Average the K results

When K = N (i.e. each datapoint is a separate subset) this is

(35)

Multiple testing (multiple comparison)

Example:

I A number of investment advisors who are predicting whether the market will rise or fall in the next day.

I No advisor is actually any better than a coin-flip

I We have the record of 50 advisors for 10 consecutive days.

I Probability that a specific advisor will be correct at least 8 days out of 10:

10 8

+ ¹⁰ ₉ + ¹⁰ ₁₀

2 ¹⁰ ≈ 0.0547 (15)

I Probability that at least one of them gets at least 8 correct guesses:

1 − (1 − 0.0547) ⁵⁰ ≈ 0.9399 (16)

(36)

I The moral of the story: If you are comparing a number of random predictors, it is likely that some will have very good empirical performance even if they are all quite random.

I It makes sense to select the best predictors, but one should not have much faith in their predictive power unless one after selection tests them on a fresh dataset

I This problem is strongly related to machine learning:

1. Many machine learning algorithms employ multiple tests when selecting nodes to split or variables to include in the analysis 2. Overall, data mining is concerned with finding interesting and

useful relationships in data, and if the search is not restricted it is almost always possible to find ‘strange patterns’, even in completely random data.

I The importance of this issue cannot be overstated!

(37)

Estimating generalization performance

I So, at the end, to get an unbiased estimate of the

generalization performance, test it on data that has not been used in any way to guide the search for the best model. (The test data should be ‘locked in a vault’. No peeking!)

I If you are testing several models, beware that the best/worst results may not be good estimates of the generalization error of those models. (Don’t fall into that trap, again!)

I Often, data is divided in three: ‘train’, ‘validation’, and ‘test’

train validation

available data test

I Can use a cross-validation approach for train/validation,

(38)

Explicitly penalizing complexity

I Models can be regularized by explicitly adding a term that penalizes for complexity, as in

(model score) = (model fit to data) − λ · (model complexity) where the fit to data can be likelihood, classification accuracy, or similar, and the model complexity can be

I

Number of nodes in a decision tree

I

Norm of the weights (parameters) of the model

I

...

I But how to select λ?

Cross-validation, Bayesian approach, MDL, BIC, AIC...

(39)

Bayesian model selection

I Treat the model parameters as a (hidden) random vector θ, with a prior probaility density p(θ | M i ) for each model M i

(with different complexities for different i ).

I Select the model M i which gives the highest marginal likelihood of the observed data, integrating out θ:

p(data | M ⁱ ) = Z

θ

p(data | θ, M ⁱ ) p(θ | M ⁱ ) (17)

I This favors the simplest (least flexible) models that nevertheless provide a decent fit to the observed data.

p(D) M1 M2

M3

(40)

Minimum description length (MDL)

I Basic idea: The best ‘model’ for the data is the most compact (shortest) description of the data

I Typically, this will consist of describing some general characteristics of the data, and then the details of how the data deviates from those predictions:

L(description of model) + L(description of data given the model)

I Thus, this framework provides one way of balancing model complexity with data fit (but the devil is in the details)

I For a thorough treatment, take the course

‘Information-Theoretic Modeling’