Probabilistic predictions: Calibration
I Informally, a probabilistic predictor is calibrated if the observed relative frequencies agree with the predicted probabilities.
I Definition: A probabilistic predictor is calibrated if, whenever the predicted probability distribution is P b (y | x) = α, the observed relative frequencies of events is also α, for all values of α.
I Calibration is a desirable property of probabilistic predictions, independent of the cost function.
I For example: In those cases where P( b rain ) = 0.2, how often did it actually rain? If it rained in exactly 20% of such cases, then this prediction was calibrated. If this holds for all predictions (all probabilities), then the probabilistic predictor is calibrated.
98 ,
ROC curve for binary classification
I For binary variables, using probabilistic predictions (or any real-numbered predictions that indicate some level of
confidence), we can plot a Receiver Operating Characteristic (ROC) curve (example on next slide):
1. Sort the records into increasing estimated probability of positive class
2. For i in 1 . . . N
2.1 Classify all records j < i as negatives, and all records j ≥ i as positives
2.2 Plot the corresponding false positive rate vs true positive rate as a point in the graph (and connect the points into a curve) I One performance measure is AUC: Area under ROC curve
(0 ≤ AUC ≤ 1)
99 ,
ROC curve for binary classification: Example
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0 false positive rate
true positive rate
red area =
AUC for red classifier
100 ,
ROC curve for binary classification: Example
1 0.92 +
2 0.34 –
3 0.56 –
4 0.49 +
5 0.87 +
6 0.54 +
7 0.13 –
8 0.70 +
9 0.60 +
10 0.53 –
instance P(y= ‘+’|x) true class
7 0.13 –
2 0.34 –
4 0.49 +
10 0.53 –
6 0.54 +
3 0.56 –
9 0.60 +
8 0.70 +
5 0.87 +
1 0.92 +
instance P(y= ‘+’|x) true class
0.0 0.2 0.4 0.6 0.8 1.0 0.0
0.2 0.4 0.6 0.8 1.0
FPR TPR = TP / (TP+FN) = ‘what proportion of TPR
all actual positives where classified as positive?’
FPR = FP / (TN+FP) = ‘what proportion of all actual negative where classified as positive?’
101 ,
ROC curve for binary classification: Example
1 0.92 +
2 0.34 –
3 0.56 –
4 0.49 +
5 0.87 +
6 0.54 +
7 0.13 –
8 0.70 +
9 0.60 +
10 0.53 –
instance P(y= ‘+’|x) true class
7 0.13 –
2 0.34 –
4 0.49 +
10 0.53 –
6 0.54 +
3 0.56 –
9 0.60 +
8 0.70 +
5 0.87 +
1 0.92 +
instance P(y= ‘+’|x) true class
0.0 0.2 0.4 0.6 0.8 1.0 0.0
0.2 0.4 0.6 0.8 1.0
FPR TPR = TP / (TP+FN) = ‘what proportion of TPR
all actual positives where classified as positive?’
FPR = FP / (TN+FP) = ‘what proportion of all actual negative where classified as positive?’
101 ,
ROC curve for binary classification: Example
1 0.92 +
2 0.34 –
3 0.56 –
4 0.49 +
5 0.87 +
6 0.54 +
7 0.13 –
8 0.70 +
9 0.60 +
10 0.53 –
instance P(y= ‘+’|x) true class
7 0.13 –
2 0.34 –
4 0.49 +
10 0.53 –
6 0.54 +
3 0.56 –
9 0.60 +
8 0.70 +
5 0.87 +
1 0.92 +
instance P(y= ‘+’|x) true class
0.0 0.2 0.4 0.6 0.8 1.0 0.0
0.2 0.4 0.6 0.8 1.0
FPR TPR = TP / (TP+FN) = ‘what proportion of TPR
all actual positives where classified as positive?’
FPR = FP / (TN+FP) = ‘what proportion of all actual negative where classified as positive?’
101 ,
ROC curve for binary classification: Example
1 0.92 +
2 0.34 –
3 0.56 –
4 0.49 +
5 0.87 +
6 0.54 +
7 0.13 –
8 0.70 +
9 0.60 +
10 0.53 –
instance P(y= ‘+’|x) true class
7 0.13 –
2 0.34 –
4 0.49 +
10 0.53 –
6 0.54 +
3 0.56 –
9 0.60 +
8 0.70 +
5 0.87 +
1 0.92 +
instance P(y= ‘+’|x) true class
0.0 0.2 0.4 0.6 0.8 1.0 0.0
0.2 0.4 0.6 0.8 1.0
FPR TPR = TP / (TP+FN) = ‘what proportion of TPR
all actual positives where classified as positive?’
FPR = FP / (TN+FP) = ‘what proportion of all actual negative where classified as positive?’
101 ,
ROC curve for binary classification: Example
1 0.92 +
2 0.34 –
3 0.56 –
4 0.49 +
5 0.87 +
6 0.54 +
7 0.13 –
8 0.70 +
9 0.60 +
10 0.53 –
instance P(y= ‘+’|x) true class
7 0.13 –
2 0.34 –
4 0.49 +
10 0.53 –
6 0.54 +
3 0.56 –
9 0.60 +
8 0.70 +
5 0.87 +
1 0.92 +
instance P(y= ‘+’|x) true class
0.0 0.2 0.4 0.6 0.8 1.0 0.0
0.2 0.4 0.6 0.8 1.0
FPR TPR = TP / (TP+FN) = ‘what proportion of TPR
all actual positives where classified as positive?’
FPR = FP / (TN+FP) = ‘what proportion of all actual negative where classified as positive?’
101 ,
ROC curve for binary classification: Example
1 0.92 +
2 0.34 –
3 0.56 –
4 0.49 +
5 0.87 +
6 0.54 +
7 0.13 –
8 0.70 +
9 0.60 +
10 0.53 –
instance P(y= ‘+’|x) true class
7 0.13 –
2 0.34 –
4 0.49 +
10 0.53 –
6 0.54 +
3 0.56 –
9 0.60 +
8 0.70 +
5 0.87 +
1 0.92 +
instance P(y= ‘+’|x) true class
0.0 0.2 0.4 0.6 0.8 1.0 0.0
0.2 0.4 0.6 0.8 1.0
FPR TPR = TP / (TP+FN) = ‘what proportion of TPR
all actual positives where classified as positive?’
FPR = FP / (TN+FP) = ‘what proportion of all actual negative where classified as positive?’
101 ,
ROC curve for binary classification: Example
1 0.92 +
2 0.34 –
3 0.56 –
4 0.49 +
5 0.87 +
6 0.54 +
7 0.13 –
8 0.70 +
9 0.60 +
10 0.53 –
instance P(y= ‘+’|x) true class
7 0.13 –
2 0.34 –
4 0.49 +
10 0.53 –
6 0.54 +
3 0.56 –
9 0.60 +
8 0.70 +
5 0.87 +
1 0.92 +
instance P(y= ‘+’|x) true class
0.0 0.2 0.4 0.6 0.8 1.0 0.0
0.2 0.4 0.6 0.8 1.0
FPR TPR = TP / (TP+FN) = ‘what proportion of TPR
all actual positives where classified as positive?’
FPR = FP / (TN+FP) = ‘what proportion of all actual negative where classified as positive?’
101 ,
ROC curve for binary classification: Example
1 0.92 +
2 0.34 –
3 0.56 –
4 0.49 +
5 0.87 +
6 0.54 +
7 0.13 –
8 0.70 +
9 0.60 +
10 0.53 –
instance P(y= ‘+’|x) true class
7 0.13 –
2 0.34 –
4 0.49 +
10 0.53 –
6 0.54 +
3 0.56 –
9 0.60 +
8 0.70 +
5 0.87 +
1 0.92 +
instance P(y= ‘+’|x) true class
0.0 0.2 0.4 0.6 0.8 1.0 0.0
0.2 0.4 0.6 0.8 1.0
FPR TPR = TP / (TP+FN) = ‘what proportion of TPR
all actual positives where classified as positive?’
FPR = FP / (TN+FP) = ‘what proportion of all actual negative where classified as positive?’
101 ,
ROC curve for binary classification: Example
1 0.92 +
2 0.34 –
3 0.56 –
4 0.49 +
5 0.87 +
6 0.54 +
7 0.13 –
8 0.70 +
9 0.60 +
10 0.53 –
instance P(y= ‘+’|x) true class
7 0.13 –
2 0.34 –
4 0.49 +
10 0.53 –
6 0.54 +
3 0.56 –
9 0.60 +
8 0.70 +
5 0.87 +
1 0.92 +
instance P(y= ‘+’|x) true class
0.0 0.2 0.4 0.6 0.8 1.0 0.0
0.2 0.4 0.6 0.8 1.0
FPR TPR = TP / (TP+FN) = ‘what proportion of TPR
all actual positives where classified as positive?’
FPR = FP / (TN+FP) = ‘what proportion of all actual negative where classified as positive?’
101 ,
ROC curve for binary classification: Example
1 0.92 +
2 0.34 –
3 0.56 –
4 0.49 +
5 0.87 +
6 0.54 +
7 0.13 –
8 0.70 +
9 0.60 +
10 0.53 –
instance P(y= ‘+’|x) true class
7 0.13 –
2 0.34 –
4 0.49 +
10 0.53 –
6 0.54 +
3 0.56 –
9 0.60 +
8 0.70 +
5 0.87 +
1 0.92 +
instance P(y= ‘+’|x) true class
0.0 0.2 0.4 0.6 0.8 1.0 0.0
0.2 0.4 0.6 0.8 1.0
FPR TPR = TP / (TP+FN) = ‘what proportion of TPR
all actual positives where classified as positive?’
FPR = FP / (TN+FP) = ‘what proportion of all actual negative where classified as positive?’
101 ,
ROC curve for binary classification: Example
1 0.92 +
2 0.34 –
3 0.56 –
4 0.49 +
5 0.87 +
6 0.54 +
7 0.13 –
8 0.70 +
9 0.60 +
10 0.53 –
instance P(y= ‘+’|x) true class
7 0.13 –
2 0.34 –
4 0.49 +
10 0.53 –
6 0.54 +
3 0.56 –
9 0.60 +
8 0.70 +
5 0.87 +
1 0.92 +
instance P(y= ‘+’|x) true class
0.0 0.2 0.4 0.6 0.8 1.0 0.0
0.2 0.4 0.6 0.8 1.0
FPR TPR = TP / (TP+FN) = ‘what proportion of TPR
all actual positives where classified as positive?’
FPR = FP / (TN+FP) = ‘what proportion of all actual negative where classified as positive?’
101 ,
The classification problem: full specification
I Observe pairs (x 1 , y 1 ) . . . (x N , y N ) where x i ∈ X , with X an arbitrary set, and y i ∈ Y , with Y a finite (typically small) set.
I Given a set { x N+1 . . . x N+M } of new objects (each one also
∈ X ), predict the corresponding classes y N+1 . . . y N+M (each one also ∈ Y ). (Typically the new objects are predicted one at a time, independently of each other.)
Full specification:
I X , Y , and N training dataset pairs (x 1 , y 1 ) . . . (x N , y N )
I Prediction type
I Objective function (minimizing expected cost, or optimizing some other given performance metric)
Goal: optimize the objective function for new examples (x 0 , y 0 ).
102 ,
Simply ‘memorizing the training data’
I Simple classifier, with “don’t know” output option:
When presented with a new object x 0 to classify, check whether x 0 = x i for some x i in the training data.
I
If so, return the corresponding y i .
I
If not, return “don’t know”
If there are several matches, use a ‘majority voting’ scheme.
I Problem: No generalization! In many (most) applications, the set X is very large and there are essentially never any perfect matches. So the method always returns “don’t know”.
103 ,
Nearest Neighbor classifier
Adapt the above ‘simple memory’ as follows:
I Instead of saying “don’t know” for new unseen objects x 0 , find the closest match x i to x 0 , and return the corresponding y i I Note: ‘closest’ means here the highest similarity, or
equivalently the lowest dissimilarity. The specific choice of proximity measure will of course influence the results!
training example of class 1 training example of class 2 test example
X
104 ,
Nearest Neighbor classifier: Decision boundary
I Decision boundary for nearest neighbor classifier (figure from Hastie et al, 2009: ‘The elements of statistical learning’)
16 2. Overview of Supervised Learning
1-Nearest Neighbor Classifier.. .. .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . ... . .. . .. . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . .. . .. . .. . . .. . . .. . . .. . .. . .. . . .. . .. . ... ... . . . .. . . . .. . . . .. . . .. . . .. . . .. .. . .. .. . . .. . .. . . .. .. .. . . .. . . .. . . .. . . .. . . .. . . .. .. . . .. . . .. . . .. . .. . .. . . . .. . . .. .. . .. .. . .. . .. . . .. . . .. . . ... .. . .. .. . .. . . .. . . .. . . .. . . . .. . . ... . . ... . . .. . . .. .. . .. . .. . . . .... .. . .. .. . . .. . . .. . . .. . .. . .. . . .. .. .. . . .. . . .. . . .. . . .. . . ... .. . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . .. . . .. . . .. . . .. . . .. . . .. . . .. . . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . . .. . . . .. . . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . . .. . . . .. . . . .. . . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. .. .. .. ...
o o
o o o
o o
o
o o
o o
o
o o
o
o o
o o
o o
o
o o
o o
o o
o o
o o o
o o
o
o
o o
o o
o o o
o o
o
o o
o o
o o
o o
o o
o o
o o
o
o
o o
o
o o
o o o o o o
o o
o o
o
o
o
o o
o
o o
o o
o o
o o
o o
o
o o o
o o
o o o
o o
o
o
o o o
o o
o o
o
o o
o o o
o
o o o
o
o o o
o
o o
o o o o o
o o
o o
o o
o o
o
o o o
o o o
o
o o o
o o
o
o o
o o
o
o o
o
o
o o
o o
o o
oo o
o o o
o o
o o o
o o
o o
o o
o
o
o
o o
o o o
o
FIGURE 2.3. The same classification example in two dimensions as in Fig- ure 2.1. The classes are coded as a binary variable (BLUE = 0, ORANGE = 1), and then predicted by 1-nearest-neighbor classification.
2.3.3 From Least Squares to Nearest Neighbors
The linear decision boundary from least squares is very smooth, and ap- parently stable to fit. It does appear to rely heavily on the assumption that a linear decision boundary is appropriate. In language we will develop shortly, it has low variance and potentially high bias.
On the other hand, the k-nearest-neighbor procedures do not appear to rely on any stringent assumptions about the underlying data, and can adapt to any situation. However, any particular subregion of the decision bound- ary depends on a handful of input points and their particular positions, and is thus wiggly and unstable—high variance and low bias.
Each method has its own situations for which it works best; in particular linear regression is more appropriate for Scenario 1 above, while nearest neighbors are more suitable for Scenario 2. The time has come to expose the oracle! The data in fact were simulated from a model somewhere be- tween the two, but closer to Scenario 2. First we generated 10 means m
kfrom a bivariate Gaussian distribution N((1, 0)
T, I) and labeled this class BLUE . Similarly, 10 more were drawn from N((0, 1)
T, I) and labeled class ORANGE . Then for each class we generated 100 observations as follows: for each observation, we picked an m
kat random with probability 1/10, and
105 ,
k-Nearest Neighbor classifier (kNN)
I Select the k nearest neighbors (instead of just the single nearest data point).
I
‘Forced choice’: Set y 0 = mode of values of neighbors y i .
I
Output “don’t know” when no clear winner?
I
Or output the empirical class distribution of neighbors...
training example of class 1 training example of class 2 test example
X
106 ,
k-Nearest Neighbor classifier: Decision boundary
I Decision boundary for k Nearest Neighbor classifier, k = 15 (figure from Hastie et al, 2009)
2.3 Least Squares and Nearest Neighbors 15
15-Nearest Neighbor Classifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . . ... .
. .. . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
o o
o o o
o o
o
o o
o o
o
o o
o
o o
o o
o o
o
o o
o o
o o
o o
o o o
o o
o
o
o o
o o
o o o
o o
o
o o
o o
o o
o o
o o
o o
o o
o
o
o o
o
o o
o o o o o o
o o
o o
o
o
o
o o
o
o o
o o
o o
o o
o o
o
o o o
o o
o o o
o o
o
o
o o
o
o o
o o
o
o o
o o o
o
o o o
o
o o o
o
o o
o o o o o
o o
o o
o o
o o
o
o o o
o o o
o
o o o
o o
o
o o
o o
o
o o
o
o
o o
o o
o o
oo o
o o o
o o
o o o
o o
o o
o o
o
o
o
o o
o o o
o
FIGURE 2.2. The same classification example in two dimensions as in Fig- ure 2.1. The classes are coded as a binary variable (BLUE = 0, ORANGE = 1) and then fit by 15-nearest-neighbor averaging as in (2.8). The predicted class is hence chosen by majority vote amongst the 15-nearest neighbors.
In Figure 2.2 we see that far fewer training observations are misclassified than in Figure 2.1. This should not give us too much comfort, though, since in Figure 2.3 none of the training data are misclassified. A little thought suggests that for k-nearest-neighbor fits, the error on the training data should be approximately an increasing function of k, and will always be 0 for k = 1. An independent test set would give us a more satisfactory means for comparing the different methods.
It appears that k-nearest-neighbor fits have a single parameter, the num- ber of neighbors k, compared to the p parameters in least-squares fits. Al- though this is the case, we will see that the effective number of parameters of k-nearest neighbors is N/k and is generally bigger than p, and decreases with increasing k. To get an idea of why, note that if the neighborhoods were nonoverlapping, there would be N/k neighborhoods and we would fit one parameter (a mean) in each neighborhood.
It is also clear that we cannot use sum-of-squared errors on the training set as a criterion for picking k, since we would always pick k = 1! It would seem that k-nearest-neighbor methods would be more appropriate for the mixture Scenario 2 described above, while for Gaussian data the decision boundaries of k-nearest neighbors would be unnecessarily noisy.
107 ,
What is a good value for k ?
I k = 1 produces a very jagged decision boundary (capturing small-scale structure, but depends strongly on the specific training examples used)
I k = 15 produces a much smoother boundary (capturing only larger-scale structure, and may miss local small-scale
structure, but is more robust with respect to the set of training examples)
I We need additional tools and assumptions to say much about the theoretical behavior of the classifier as a function of k.
(So far this has all been pretty ad-hoc/intuitive)
108 ,
Computational complexity of kNN
I ‘Training phase’: Just reading the data into memory, an O(N) operation
I ‘Online’ (in operation): For each new point to be classified, a straightforward implementation requires computing the proximity to all the training examples, i.e. O(N). This can be too expensive in many practical situations (as real-time operation is often required)
I Efficient indexing techniques may be available that reduce the number of proximity computations that need to be performed
109 ,
Standard probabilistic assumptions for classification
Two assumptions:
I All training records (x i , y i ) are drawn independently, identically distributed (i.i.d.) from some joint distribution P (x, y )
(represented by a density as needed for continuous-valued x)
I The test records (x 0 , y 0 ) are also drawn i.i.d. from the same distribution
p(x | y = 0) p(x | y = 1)
0 1 class P
0.0 0.2 0.4 0.6
110 ,
Is this a reasonable assumption?
I Handwritten digit recognition?
I Designing a spam filter?
I News stories (classify into foreign, domestic, economic, sports)
I Classifying web pages into porn vs non-porn?
I . . .
Note: By randomly permuting the data records, we can remove temporal/sequential structure. However, in many cases it can be crucial to model this structure.
111 ,
Bayesian classifier
I If P (y ) and P (x | y) were known, we also know
P (x, y) = P (y )P (x | y) (7) from which we get
P (y | x) = P (x, y)
P (x) = P (y)P(x | y)
P (x) (8)
which is known as Bayes’ rule. Note that the denominator is a constant (so when maximizing the left-hand-side with respect to y it can safely be ignored).
I The latter equation gives us the true probabilistic predictions (but this of course requires knowing P(y) and P (x | y ))...
I ...from which we can derive optimal ‘forced-choice’
classifications (or optimal classifications with “don’t know”
option)
112 ,
Bayesian classifier: Example
I For simplicity, say x is one-dimensional (so write it as x), and p(x | y ) is Gaussian, for both values of y . Say we know P (y ), and we are maximizing standard accuracy. We can explicitly compute the optimal decision boundary: It is given by the point(s!), i.e. value(s) of x, for which
p(x | y = 0)P (y = 0) = p(x | y = 1)P(y = 1).
I The Bayes error (=‘unavoidable error’) is the error rate of this optimal classifier.
Making optimal decisions
For two classes c 1 and c 2 and feature vector x , decision rule
decide class c 1 iff P (c 1 | x ) ≥ P (c 2 | x ) minimizes misclassification error
With Bayes’ rule obtain
P (c 1 | x ) ≥ P (c 2 | x ) ⇔ P (x | c 1 )P (c 1 )
P (x ) ≥ P (x | c 2 )P (c 2 ) P (x )
⇔ P (x | c 1 )P (c 1 ) ≥ P (x | c 2 )P (c 2 ) Continuous features x : replace P (x | c i ) by density p (x | c i )
Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 13: Statistical Pattern Recognition Artificial Intelligence SS2010 7 / 28
Making optimal decisions (cont.)
Components of p(x | c i )P (c i ) are class-conditional densities p(x | c i ) and priors (class prevalences) P (c i ) Graphical illustration that decision threshold at
p(x | c 1 )P (c 1 ) = p(x | c 2 )P (c 2 ) minimizes misclassification error:
Grey area is Bayes error: unavoidable error due to overlap
Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 13: Statistical Pattern Recognition Artificial Intelligence SS2010 8 / 28
113 ,Say P(y = 0) = 1/2, and P(y = 1) = 1/2.
Additionally assume p(x | y = 0) = N (x ; µ 0 , σ 2 ) and p(x | y = 1) = N (x ; µ 1 , σ 2 ), where
N (x; µ, σ 2 ) = 1
√ 2πσ exp
− (x − µ) 2 2σ 2
(9) The maximum accuracy (probability of correct classification) is given by using the decision boundary where
p(x t | y = 0)P(y = 0) = p(x t | y = 1)P(y = 1) exp
− (x t − µ 0 ) 2 2σ 2
= exp
− (x t − µ 1 ) 2 2σ 2
(x t − µ 0 ) 2 = (x t − µ 1 ) 2 x t 2 − 2x t µ 0 + µ 2 0 = x t 2 − 2x t µ 1 + µ 2 1
2(µ 1 − µ 0 )x t = µ 2 1 − µ 2 0
x t = (µ 1 − µ 0 )(µ 1 + µ 0 )
2(µ 1 − µ 0 ) = µ 1 + µ 0
2
114 ,
The Bayes error rate in this case is equal to
1 2
Z x
t−∞ N (x ; µ 1 , σ 2 ) + 1 2
Z ∞
x
tN (x; µ 0 , σ 2 )
= Z ∞
x
tN (x; µ 0 , σ 2 )
When the variances are the same but the class probabilities are unequal, i.e. P(y = 0) 6 = P(y = 1) the optimal threshold (again, just a single threshold) is biased away from (µ 0 + µ 1 )/2.
If the variances differ then in general there will be two thresholds!
115 ,