Estimating class-conditional distributions P (x | y )

(1)

Estimating class prevalences P (y )

I

How to estimate P(y)?

I Simplest estimate: For each class, just divide the empirical frequency by the total number of examples:

P(y =α) =#[y =α]

N (10)

For example: The proportion of females in a computer science class at our department is estimated by the number of females in this class divided by the size of the class. (This is an unbiased estimate if the students taking the course is a random sample of the students at the department... this may or may not be true.)

I In the i.i.d. setting it is consistent, and it works reasonably well when there are a sufficient number of examples of all classes

116 ,

(2)

Estimating class-conditional distributions P (x | y )

I

How to estimate P(x

|

y) for each value of y?

I For high-dimensionalx this is an extremely non-trivial problem

I Discretex:

Dividing empirical frequencies by total count for every element ofX does not work in practice in most cases (too many states, too few data points⇒unreliable, many zeros)

I Continuous-valuedx:

Multivariate normal (Gaussian) distribution (ndimensions):

N(x;µ,Σ) = 1

(2π)^n/2|Σ|^1/2exp

−1

2(x−µ)^TΣ⁻¹(x−µ)

‘Only’ requires estimates of the meanµand the covariance matrix Σ, so the total number of parameters isn+n(n+ 1)/2.

For low-dimensional data this may be ok, but even this is unreliable for high-dimensional data.

(3)

Na¨ıve Bayes classifier

I

The central idea of the Na¨ıve Bayes classifier is to assume that the class-conditional distributions factorize, i.e. that for each class, the joint distribution over examples is given by the product of distributions in each dimension:

P (x

|

y = y

0

) =

Yn

i=1

P (x

i |

y = y

0

) (11)

I

In other words, one estimates the distributions of each class-attribute pair separately and uses these to approximate the class-conditional distributions in the joint space.

118 ,

(4)

Na¨ıve Bayes classifier – example 1

I

Gaussian class-conditional distributions:

I Instead ofn+n(n+ 1)/2 parameters for each class-conditional distribution, we only usen+n(mean and variance for each dimension)

I Fit to data is only approximate, especially when there are strong correlations between attributes within one or more classes

p(x|y= 0)

p(x|y= 1)

true class-conditional

distributions Naive Bayes approximation (note: axis-aligned!)

�

i

p(xi|y= 1)

�

i

p(xi|y= 0)

(5)

Na¨ıve Bayes classifier – example 2

I

Discrete attributes (n attributes, each with L possible values):

I Instead ofLⁿ−1 parameters for each class-conditional distribution (for the joint distribution), we only need (L−1)n parameters (for thenmarginal distributions each overLvalues)

p(x|y= 0) p(x|y= 1)

true class-conditional

distributions Naive Bayes approximation 0.4 0.0

0.2 0.4 0

1

0 1

x1

x2

0.1 0.5 0.3 0.1 0

1

0 1

x1

x2

0.24 0.16 0.36 0.24 0

1

0 1

x1

x2

0.24 0.36 0.16 0.24 0

1

0 1

x1

x2

�

i

p(xi|y= 1)

�

i

p(xi|y= 0)

0.4 0.6 0.4 0.6

0.6 0.4 0.6 0.4

120 ,

(6)

Na¨ıve Bayes – fitting univariate class conditionals

I

Continuous variables: densities (see Appendix C)

I Gaussian

I Exponential

I Student’s t

I Histogram (i.e. discretizing)

I Kernel density estimation

I . . .

I

How to estimate the parameters?

I Moment matching

I Maximum likelihood

I . . .

(7)

Na¨ıve Bayes – fitting univariate class conditionals

I

Discrete variables with finite state space:

I We already gave a simple estimator when discussing estimating P(y): just normalizing the counts. In this case, for a given classy=y0, and a given attribute xi we would normalize the counts by

P(xi=α|y =y0) =#[xi=α and y =y0]

#[y =y₀] (12)

I For some class-attribute pairs, some of the counts may be zero

⇒some of the class-conditional probabilities are zero!

⇒for some (new) data pointsx, we have∀y: P(x|y) = 0, so we cannot computeP(y|x) (it is undefined!)

I

A simple solution: Add a constant c (for example c = 1) to the numerator, and add Lc (where L is the number of distinct values of x

_i

) to the denominator

⇒

no more zeros,

probabilities still sum to one, and can even be justified on theoretical grounds

122 ,

(8)

I

Example (spam filtering):

0 1 1 0 ...

0 0 1 1 ...

0 0 1 0 ...

0 1 0 1 ...

0 0 0 0 ...

1 1 0 0 ...

1 0 0 1 ...

0 1 0 1 ...

0 0 0 1 ...

0 1 0 0 ...

‘viagra’

‘millions’

‘gra de’

‘confid enti

al’

msg #1 msg #2 msg #3 msg #4 msg #5 spam #1 spam #2 spam #3 spam #4 spam #5

Simply normalizing the frequencies:

P(msg) = 5/10 = 0.5 P(spam) = 5/10 = 0.5 P(viagra|msg) = 0/5 = 0 P(millions|msg) = 2/5 = 0.4

Problem: A new message “premium grade viagra available now!!!” would get 0 probability under both classes!

(9)

email = ”confidential: how to win millions”

P(spam|email) = P(email|spam)P(spam)/P(email)

= P(¬viagra|spam)P(millions|spam)P(¬grade|spam) P(confidential|spam)P(spam)/P(email)

= 0.6∗0.6∗1.0∗0.6∗0.5/P(email) = 0.108/P(email) P(msg|email) = P(email|msg)P(msg)/P(email)

= P(¬viagra|msg)P(millions|msg)P(¬grade|msg) P(confidential|msg)P(msg)/P(email)

= 1.0∗0.4∗0.4∗0.4∗0.5/P(email) = 0.032/P(email) P(email) = 0.108 + 0.032 = 0.140

P(spam|email) = 0.108/0.140≈0.77 P(msg|email) = 0.032/0.140≈0.23

124 ,

(10)

I

Example (spam filtering):

0 1 1 0 ...

0 0 1 1 ...

0 0 1 0 ...

0 1 0 1 ...

0 0 0 0 ...

1 1 0 0 ...

1 0 0 1 ...

0 1 0 1 ...

0 0 0 1 ...

0 1 0 0 ...

‘viagra’

‘millions’

‘gra de’

‘confid enti

al’

msg #1 msg #2 msg #3 msg #4 msg #5 spam #1 spam #2 spam #3 spam #4 spam #5

Instead, let’s addc = 1 to all counts:

Note: Adding 2 to the denominator because each variable is binary (has 2 states). Now all probabilities are>0.

(11)

Practical issues with Na¨ıve Bayes

I

In high dimensions, the probabilities/densities P (x

|

y

_j

) tend to become very close to zero for all classes y

_j

. The same applies to the marginals P(x).

⇒

Numerical problems in comparing the P(x

|

y

_j

) with each other, and in computing P (x

|

y

_j

)P (y

_j

)/P(x) (floating point underflow, 0/0 problems)

I

Solution: Use logarithms and add a suitable constant before exponentiating, as follows

1. L(yj |x) = log(P(x|yj)P(yj)) =P

ilog(P(xi|yj)) + log(P(yj)) 2. L(yj |x) =L(yj |x)−maxj(L(yj |x))

3. P_t(y_j |x) = exp(L(y_j |x)) 4. P(y_j |x) =P_t(y_j|x)/P

jP_t(y_j|x)

126 ,

(12)

Generative vs discriminative

I

Na¨ıve Bayes is a representative of what we call a generative approach to classification...

I From the model, it is actually possible to synthesize (generate) new examples: First drawy∼P(y) then drawx∼P(x|y)

I

...as opposed to a discriminative approach, which only models P (y

|x) or even more simply just the function ˆ

y = f (x)

I

Estimating class-conditional distributions P (x | y )

Estimating class prevalences P (y )

How to estimate P(y)?

Estimating class-conditional distributions P (x | y )

How to estimate P(x

y) for each value of y?

Na¨ıve Bayes classifier

The central idea of the Na¨ıve Bayes classifier is to assume that the class-conditional distributions factorize, i.e. that for each class, the joint distribution over examples is given by the product of distributions in each dimension:

P (x

y = y

) =

P (x

y = y

) (11)

In other words, one estimates the distributions of each class-attribute pair separately and uses these to approximate the class-conditional distributions in the joint space.

Na¨ıve Bayes classifier – example 1

Gaussian class-conditional distributions:

Na¨ıve Bayes classifier – example 2

Discrete attributes (n attributes, each with L possible values):

Na¨ıve Bayes – fitting univariate class conditionals

Continuous variables: densities (see Appendix C)

How to estimate the parameters?

Na¨ıve Bayes – fitting univariate class conditionals

Discrete variables with finite state space:

A simple solution: Add a constant c (for example c = 1) to the numerator, and add Lc (where L is the number of distinct values of x

) to the denominator

no more zeros,

probabilities still sum to one, and can even be justified on theoretical grounds

Example (spam filtering):

Example (spam filtering):

Practical issues with Na¨ıve Bayes

In high dimensions, the probabilities/densities P (x

y

) tend to become very close to zero for all classes y

. The same applies to the marginals P(x).

Numerical problems in comparing the P(x

y

) with each other, and in computing P (x

y

)P (y

)/P(x) (floating point underflow, 0/0 problems)

Solution: Use logarithms and add a suitable constant before exponentiating, as follows

Generative vs discriminative

Na¨ıve Bayes is a representative of what we call a generative approach to classification...

...as opposed to a discriminative approach, which only models P (y

y = f (x)

Note: The generative approach is, in a sense, solving a more

difficult problem than is asked for.