• Ei tuloksia

Estimating class-conditional distributions P (x | y )

N/A
N/A
Info
Lataa
Protected

Academic year: 2022

Jaa "Estimating class-conditional distributions P (x | y )"

Copied!
12
0
0

Kokoteksti

(1)

Estimating class prevalences P (y )

I

How to estimate P(y)?

I Simplest estimate: For each class, just divide the empirical frequency by the total number of examples:

P(y =α) =#[y =α]

N (10)

For example: The proportion of females in a computer science class at our department is estimated by the number of females in this class divided by the size of the class. (This is an unbiased estimate if the students taking the course is a random sample of the students at the department... this may or may not be true.)

I In the i.i.d. setting it is consistent, and it works reasonably well when there are a sufficient number of examples of all classes

116 ,

(2)

Estimating class-conditional distributions P (x | y )

I

How to estimate P(x

|

y) for each value of y?

I For high-dimensionalx this is an extremely non-trivial problem

I Discretex:

Dividing empirical frequencies by total count for every element ofX does not work in practice in most cases (too many states, too few data points⇒unreliable, many zeros)

I Continuous-valuedx:

Multivariate normal (Gaussian) distribution (ndimensions):

N(x;µ,Σ) = 1

(2π)n/2|Σ|1/2exp

−1

2(x−µ)TΣ−1(x−µ)

‘Only’ requires estimates of the meanµand the covariance matrix Σ, so the total number of parameters isn+n(n+ 1)/2.

For low-dimensional data this may be ok, but even this is unreliable for high-dimensional data.

(3)

Na¨ıve Bayes classifier

I

The central idea of the Na¨ıve Bayes classifier is to assume that the class-conditional distributions factorize, i.e. that for each class, the joint distribution over examples is given by the product of distributions in each dimension:

P (x

|

y = y

0

) =

Yn

i=1

P (x

i |

y = y

0

) (11)

I

In other words, one estimates the distributions of each class-attribute pair separately and uses these to approximate the class-conditional distributions in the joint space.

118 ,

(4)

Na¨ıve Bayes classifier – example 1

I

Gaussian class-conditional distributions:

I Instead ofn+n(n+ 1)/2 parameters for each class-conditional distribution, we only usen+n(mean and variance for each dimension)

I Fit to data is only approximate, especially when there are strong correlations between attributes within one or more classes

p(x|y= 0)

p(x|y= 1)

true class-conditional

distributions Naive Bayes approximation (note: axis-aligned!)

i

p(xi|y= 1)

i

p(xi|y= 0)

(5)

Na¨ıve Bayes classifier – example 2

I

Discrete attributes (n attributes, each with L possible values):

I Instead ofLn−1 parameters for each class-conditional distribution (for the joint distribution), we only need (L−1)n parameters (for thenmarginal distributions each overLvalues)

p(x|y= 0) p(x|y= 1)

true class-conditional

distributions Naive Bayes approximation 0.4 0.0

0.2 0.4 0

1

0 1

x1

x2

0.1 0.5 0.3 0.1 0

1

0 1

x1

x2

0.24 0.16 0.36 0.24 0

1

0 1

x1

x2

0.24 0.36 0.16 0.24 0

1

0 1

x1

x2

i

p(xi|y= 1)

i

p(xi|y= 0)

0.4 0.6 0.4 0.6

0.6 0.4 0.6 0.4

120 ,

(6)

Na¨ıve Bayes – fitting univariate class conditionals

I

Continuous variables: densities (see Appendix C)

I Gaussian

I Exponential

I Student’s t

I Histogram (i.e. discretizing)

I Kernel density estimation

I . . .

I

How to estimate the parameters?

I Moment matching

I Maximum likelihood

I . . .

(7)

Na¨ıve Bayes – fitting univariate class conditionals

I

Discrete variables with finite state space:

I We already gave a simple estimator when discussing estimating P(y): just normalizing the counts. In this case, for a given classy=y0, and a given attribute xi we would normalize the counts by

P(xi=α|y =y0) =#[xi=α and y =y0]

#[y =y0] (12)

I For some class-attribute pairs, some of the counts may be zero

⇒some of the class-conditional probabilities are zero!

⇒for some (new) data pointsx, we have∀y: P(x|y) = 0, so we cannot computeP(y|x) (it is undefined!)

I

A simple solution: Add a constant c (for example c = 1) to the numerator, and add Lc (where L is the number of distinct values of x

i

) to the denominator

no more zeros,

probabilities still sum to one, and can even be justified on theoretical grounds

122 ,

(8)

I

Example (spam filtering):

0 1 1 0 ...

0 0 1 1 ...

0 0 1 0 ...

0 1 0 1 ...

0 0 0 0 ...

1 1 0 0 ...

1 0 0 1 ...

0 1 0 1 ...

0 0 0 1 ...

0 1 0 0 ...

‘viagra

‘millions

‘gra de’

‘confid enti

al’

msg #1 msg #2 msg #3 msg #4 msg #5 spam #1 spam #2 spam #3 spam #4 spam #5

Simply normalizing the frequencies:

P(msg) = 5/10 = 0.5 P(spam) = 5/10 = 0.5 P(viagra|msg) = 0/5 = 0 P(millions|msg) = 2/5 = 0.4

P(grade|msg) = 3/5 = 0.6 P(confidential|msg) = 2/5 = 0.4 P(viagra|spam) = 2/5 = 0.4 P(millions|spam) = 3/5 = 0.6 P(grade|spam) = 0/5 = 0 P(confidential|spam) = 3/5 = 0.6 Compute: P(spam|“confidential: how to win millions”) (next slide)

Problem: A new message “premium grade viagra available now!!!” would get 0 probability under both classes!

(9)

email = ”confidential: how to win millions”

P(spam|email) = P(email|spam)P(spam)/P(email)

= P(¬viagra|spam)P(millions|spam)P(¬grade|spam) P(confidential|spam)P(spam)/P(email)

= 0.6∗0.6∗1.0∗0.6∗0.5/P(email) = 0.108/P(email) P(msg|email) = P(email|msg)P(msg)/P(email)

= P(¬viagra|msg)P(millions|msg)P(¬grade|msg) P(confidential|msg)P(msg)/P(email)

= 1.0∗0.4∗0.4∗0.4∗0.5/P(email) = 0.032/P(email) P(email) = 0.108 + 0.032 = 0.140

P(spam|email) = 0.108/0.140≈0.77 P(msg|email) = 0.032/0.140≈0.23

124 ,

(10)

I

Example (spam filtering):

0 1 1 0 ...

0 0 1 1 ...

0 0 1 0 ...

0 1 0 1 ...

0 0 0 0 ...

1 1 0 0 ...

1 0 0 1 ...

0 1 0 1 ...

0 0 0 1 ...

0 1 0 0 ...

‘viagra

‘millions

‘gra de’

‘confid enti

al’

msg #1 msg #2 msg #3 msg #4 msg #5 spam #1 spam #2 spam #3 spam #4 spam #5

Instead, let’s addc = 1 to all counts:

P(msg) = (5 + 1)/12 = 0.5 P(spam) = (5 + 1)/12 = 0.5 P(viagra|msg) = (0 + 1)/7 = 1/7 P(millions|msg) = (2 + 1)/7 = 3/7 P(grade|msg) = (3 + 1)/7 = 4/7 P(confidential|msg) = (2 + 1)/7 = 3/7 P(viagra|spam) = (2 + 1)/7 = 3/7 P(millions|spam) = (3 + 1)/7 = 4/7 P(grade|spam) = (0 + 1)/7 = 1/7 P(confidential|spam) = (3 + 1)/7 = 4/7

Note: Adding 2 to the denominator because each variable is binary (has 2 states). Now all probabilities are>0.

(11)

Practical issues with Na¨ıve Bayes

I

In high dimensions, the probabilities/densities P (x

|

y

j

) tend to become very close to zero for all classes y

j

. The same applies to the marginals P(x).

Numerical problems in comparing the P(x

|

y

j

) with each other, and in computing P (x

|

y

j

)P (y

j

)/P(x) (floating point underflow, 0/0 problems)

I

Solution: Use logarithms and add a suitable constant before exponentiating, as follows

1. L(yj |x) = log(P(x|yj)P(yj)) =P

ilog(P(xi|yj)) + log(P(yj)) 2. L(yj |x) =L(yj |x)−maxj(L(yj |x))

3. Pt(yj |x) = exp(L(yj |x)) 4. P(yj |x) =Pt(yj|x)/P

jPt(yj|x)

126 ,

(12)

Generative vs discriminative

I

Na¨ıve Bayes is a representative of what we call a generative approach to classification...

I From the model, it is actually possible to synthesize (generate) new examples: First drawy∼P(y) then drawx∼P(x|y)

I

...as opposed to a discriminative approach, which only models P (y

|x) or even more simply just the function ˆ

y = f (x)

I

Note: The generative approach is, in a sense, solving a more

difficult problem than is asked for.

Viittaukset

LIITTYVÄT TIEDOSTOT

• The conditional variance, conditional covariance, and conditional correlation coefficient, for the Gaussian distribution, are known as partial variance , partial covariance

I The central idea of the Na¨ıve Bayes classifier is to assume that the class-conditional distributions factorize, i.e.. Now all probabilities are

Probability Space and Random Variables Joint and Conditional Distributions Expectation.. Law of

Since the uncertainty on all kinds of unobserved quantities may be expressed using conditional (posterior) probability distributions given the observed quantities (data), we only

The thesis is focused on the possible worlds account; thus, we analyse conditional logics as extensions of classical propositional logic by means of a two-place modal operator.. For

At this point, I would merely like to hint at the resulting conceptualization, namely as a category of narratives, beliefs, practices and/or experiences relat- ing to

3. Study the same particle as in problem 3, but in the case of an instantaneous circular motion with velocity and acceleration perpendicular to each other.. a) Express the intensity

(2009), the conditional mean and conditional variance of each component model are the same as in the Gaussian case (a linear function of past observations and a constant,