Autumn2012 Lecture2:MathematicalPreliminariesJyrkiKivinen Information-TheoreticModeling

(1)

Outline Calculus Probability Inequalities

Information-Theoretic Modeling

Lecture 2: Mathematical Preliminaries

Jyrki Kivinen

Department of Computer Science, University of Helsinki

Autumn 2012

(2)

Lecture 2: Mathematical Preliminaries

(3)

1 Calculus

Limits and Convergence Convexity

2 Probability

Probability Space and Random Variables Joint and Conditional Distributions Expectation

Law of Large Numbers

3 Inequalities

Jensen’s Inequality Gibbs’s Inequality

(4)

1 Calculus

2 Probability

3 Inequalities

(5)

1 Calculus

2 Probability

3 Inequalities

(6)

1 Calculus

2 Probability

3 Inequalities

(7)

Exponent Function

-10 -5 0 5 10

exp x

Exponent function exp : R→ R⁺, expk =e^k =

k

z }| { e×e×. . .×e:

multiplicative growth (nuclear reaction, “interest on interest”, ...)

expx·expy = exp(x+y) Derivative dexpx

dx = expx.

(8)

Exponent Function

-10 -5 0 5 10

exp x

k

z }| { e×e×. . .×e:

multiplicative growth (nuclear reaction, “interest on interest”, ...) expx·expy = exp(x+y)

Derivative dexpx

dx = expx.

(9)

Exponent Function

-10 -5 0 5 10

exp x

k

z }| { e×e×. . .×e:

multiplicative growth (nuclear reaction, “interest on interest”, ...) expx·expy = exp(x+y) Derivative dexpx

dx = expx.

(10)

Examples: Logarithm

-10 -5 0 5 10

exp x ln x

Natural logarithm ln : R⁺→R, ln expx =x:

time to grow tox, number of digits (log₁₀).

General (basea) logarithm, log_aa^x =x: log_ax= 1 lnalnx

(11)

Examples: Logarithm

-10 -5 0 5 10

exp x ln x

Natural logarithm ln : R⁺→R, ln expx =x:

time to grow tox, number of digits (log₁₀).

General (basea) logarithm, log_aa^x =x: log_ax= 1 lnalnx

(12)

Logarithm Function

-10 -5 0 5 10

exp x ln x

lnxy = lnx+ lny

lnx^r =rlnx ln 1

x =−lnx lnx

y = lnx−lny lnx≤x−1 with equality if and only ifx = 1

(NB: doesn’t work with log_ax ifa6=e)

dlnx dx = 1

x

(13)

Logarithm Function

-10 -5 0 5 10

exp x ln x

lnxy = lnx+ lny lnx^r =rlnx

ln 1

x =−lnx lnx

dlnx dx = 1

x

(14)

Logarithm Function

-10 -5 0 5 10

exp x ln x

lnxy = lnx+ lny lnx^r =rlnx ln 1

x =−lnx

lnx

dlnx dx = 1

x

(15)

Logarithm Function

-10 -5 0 5 10

exp x ln x

x =−lnx lnx

y = lnx−lny

lnx≤x−1 with equality if and only ifx = 1 (NB: doesn’t work with log_ax ifa6=e)

dlnx dx = 1

x

(16)

Logarithm Function

-10 -5 0 5 10

exp x ln x x-1

x =−lnx lnx

dlnx dx = 1

x

(17)

Logarithm Function

-10 -5 0 5 10

exp x ln x x-1

x =−lnx lnx

dlnx dx = 1

x

(18)

Limits and Convergence

A sequence of values (xi : i ∈N) convergesto limit L, limi→∞x_i =L, iff for any >0 there exists a numberN ∈N such that

|x_i −L|< for all i ≥N .

f(x) has alimit L asx approaches c, limx→cf(x) =L, (from above c⁺/belowc⁻) iff for any >0 there exists a number δ >0 such that

|f(x)−L|< for all







c <x <c+δ ‘above’ c−δ <x <c ’below’ 0<|x−c|< δ —

(19)

Limits and Convergence

A sequence of values (xi : i ∈N) convergesto limit L, limi→∞x_i =L, iff for any >0 there exists a numberN ∈N such that

|x_i −L|< for all i ≥N .

f(x) has alimit L asx approaches c, limx→cf(x) =L, (from abovec⁺/belowc⁻) iff for any >0 there exists a number δ >0 such that

|f(x)−L|< for all







c <x <c+δ ‘above’

c−δ <x <c ’below’

0<|x−c|< δ —

(20)

Example: Logarithm Again

-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1

0 0.2 0.4 0.6 0.8 1 x ln x

Even thoughxlnx is undefined atx = 0, we have(by l’Hˆopital’s rule):

x→0lim⁺xlnx = 0 .

(21)

Convexity

Functionf : X →Ris said to be convexiff for any x,y ∈ X and any 0≤t≤1 we have

f(tx+ (1−t)y)≤tf(x) + (1−t)f(y) .

0 2 4 6 8 10

-4 -2 0 2 4

exp x

Function f isstrictly convexiff the above inequality holds strictly (‘<’ instead of ‘≤’) when 0<t <1.

Function f is (strictly)concaveiff the above holds for −f.

(22)

Convexity

f(tx+ (1−t)y)≤tf(x) + (1−t)f(y) .

0 2 4 6 8 10

-4 -2 0 2 4

exp x

Functionf is strictly convexiff the above inequality holds strictly (‘<’ instead of ‘≤’) when 0<t<1.

Function f is (strictly)concaveiff the above holds for −f.

(23)

Convexity

f(tx+ (1−t)y)≤tf(x) + (1−t)f(y) .

0 2 4 6 8 10

-4 -2 0 2 4

exp x

Functionf is strictly convexiff the above inequality holds strictly (‘<’ instead of ‘≤’) when 0<t<1.

Functionf is (strictly) concaveiff the above holds for −f.

(24)

Convexity and Derivatives

Theorem

If functionf has a second derivativef⁰⁰, andf⁰⁰ is non-negative (≥0) for all x, then f is convex. Iff⁰⁰ is positive (>0) for all x, thenf is strictlyconvex.

0 2 4 6 8 10

-4 -2 0 2 4

exp x

e^x is conve^x!

Example: f⁰(x) = dexpx

dx = expx ⇒f⁰⁰(x) = expx >0. Hence exp is strictly convex.

(25)

Convexity and Derivatives

Theorem

0 2 4 6 8 10

-4 -2 0 2 4

exp x

e^x is conve^x!

dx = expx

⇒f⁰⁰(x) = expx >0. Hence exp is strictly convex.

(26)

Convexity and Derivatives

Theorem

0 2 4 6 8 10

-4 -2 0 2 4

exp x

e^x is conve^x!

dx = expx ⇒f⁰⁰(x) = expx >0.

Hence exp is strictly convex.

(27)

Convexity and Derivatives

Theorem

0 2 4 6 8 10

-4 -2 0 2 4

exp x

e^x is conve^x!

(28)

Convexity and Derivatives

Theorem

0 2 4 6 8 10

-4 -2 0 2 4

exp x

e^x is conve^x!

(29)

Probability

A.N. Kolmogorov, 1903–1987

(30)

1 Calculus

2 Probability

3 Inequalities

(31)

Probability Space

A probability space (Ω,F,P) is defined by

thesample space Ω whose elements are called outcomes ω, a sigma algebraF of subsets of Ω, whose elements are called eventsE, and

a measure P which determines the probabilities of events, P : F →[0,1].

MeasureP has to satisfy the probability axioms: P(E)≥0 for all E ∈ F,P(Ω) = 1, and P(E1∪E2∪. . .) =P

iP(E_i) if (E_i) is a countable sequence ofdisjoint events.

These axioms imply the usual rules of probability calculus, e.g., P(A∪B) =P(A) +P(B)−P(A∩B),P(Ω\E) = 1−P(E), etc.

(32)

Probability Space

thesample space Ω whose elements are called outcomes ω,

a sigma algebraF of subsets of Ω, whose elements are called eventsE, and

(33)

Probability Space

(34)

Probability Space

(35)

Probability Space

iP(E_i) if (E_i) is a countable sequence ofdisjointevents.

(36)

Probability Space

iP(E_i) if (E_i) is a countable sequence ofdisjointevents.

(37)

Venn Diagrams

0000000 0000000 0000000 0000000 0000000 0000000 0000000 0000000 0000000 0000000 0000000

1111111 1111111 1111111 1111111 1111111 1111111 1111111 1111111 1111111 1111111 1111111

A B

Ω

A B

(38)

Probability Calculus

1 The conditional probabilityof event B given that event A occurs is defined as

P(B |A) = P(A∩B)

P(A) for A such thatP(A)>0.

2 P(A∩B) =P(A)·P(B |A) =P(B)·P(A|B) .

3 Bayes’ rule: P(B |A) = P(A|B)·P(B)

P(A) .

4 Chain rule: P(∩^N_i=1E_i) =

N

Y

i=1

P(E_i | ∩ⁱ_j⁻¹₌₁E_j)

=P(E1)·P(E2 |E1)·P(E3 |E1∩E2)·. . .

·P(EN |E1∩. . .∩EN−1) .

(39)

Probability Calculus

P(B |A) = P(A∩B)

2 P(A∩B) =P(A)·P(B |A) =P(B)·P(A|B) .

3 Bayes’ rule: P(B |A) = P(A|B)·P(B)

P(A) .

N

Y

i=1

P(E_i | ∩ⁱ_j⁻¹₌₁E_j)

=P(E1)·P(E2 |E1)·P(E3 |E1∩E2)·. . .

·P(EN |E1∩. . .∩EN−1) .

(40)

Probability Calculus

P(B |A) = P(A∩B)

2 P(A∩B) =P(A)·P(B |A) =P(B)·P(A|B) .

3 Bayes’ rule: P(B |A) = P(A|B)·P(B) P(A) .

N

Y

i=1

P(E_i | ∩ⁱ_j⁻¹₌₁E_j)

=P(E1)·P(E2 |E1)·P(E3 |E1∩E2)·. . .

·P(EN |E1∩. . .∩EN−1) .

(41)

Probability Calculus

P(B |A) = P(A∩B)

2 P(A∩B) =P(A)·P(B |A) =P(B)·P(A|B) .

3 Bayes’ rule: P(B |A) = P(A|B)·P(B) P(A) .

4 Chain rule:

P(∩^N_i=1E_i) =

N

Y

i=1

P(E_i | ∩ⁱ_j⁻¹₌₁E_j)

=P(E1)·P(E2 |E1)·P(E3 |E1∩E2)·. . .

·P(EN |E1∩. . .∩EN−1) .

(42)

Random Variables

Technically, a random variable is a (measurable) function X : Ω→Rfrom the sample space to the reals.

The probability measureP on Ω determines the distribution of X: PX(A) = Pr[X ∈A] =P({ω : X(ω)∈A}) ,

whereA⊆R.

It is often more natural to relabel the outcomes and denote them, for instance, by letters,A,B,C,..., or words red,black, ... In practice, we often forget about the underlying probability space Ω, and just speak of random variableX and its distribution P_X.

(43)

The distribution of a random variable canalwaysbe represented as acumulative distribution function(cdf)FX(x) = Pr[X ≤x].

In addition:

A discreterandom variableX with countable alphabetX has a probability mass function(pmf) p_X such that

Pr[X =x] =p_X(x).

A continuous random variableY has aprobability density function(pdf) f_Y such that Pr[Y ∈A] =R

Af_Y(x)dy.

There are alsomixed random variables that are neither discrete nor continuous. They don’t have a pmf or pdf, but they do have a cdf. We often omit the subscriptsX,Y, . . .and write p(x),f(y), etc.

(47)

Random Variables

In addition:

Pr[X =x] =p_X(x).

A continuous random variableY has aprobability density function(pdf) f_Y such that Pr[Y ∈A] =R

Af_Y(x)dy.

(48)

Since random variables are functions, we can define more random variables as functions of random variables: iff is a function, andX andY are r.v.’s, thenf(X) : Ω→Ris a r.v.,X +Y is a r.v., etc.

Example: Let r.v.X be the outcome of a die.

The pmf of X is given byp_X(x) = 1/6 for all x ∈ {1,2,3,4,5,6}.

The pmf of r.v. X² is given byp_X2(x) = 1/6 for all x ∈ {1,4,9,16,25,36}.

!

In particular, a pmfp_X is a function, and hence, p_X(X) is also a random variable. Further, p_X²(X),lnpX(X), etc. are random variables.

(52)

Random Variables

The pmf of r.v. X² is given by p_X2(x) = 1/6 for all x ∈ {1,4,9,16,25,36}.

!

(53)

Random Variables

!

(54)

Random Variables

!

In particular, a pmfp_X is a function, and hence, p_X(X) is also a random variable. Further, p_X²(X),lnp_X(X), etc. are random variables.

(55)

Multivariate Distributions

The probabilistic behavior of two or more random variables is described by multivariate distributions.

Thejoint distributionof r.v.’s X andY is P_X_,Y(A,B) = Pr[X ∈A ∧ Y ∈B]

=P({ω : X(ω)∈A,Y(ω)∈B}) .

For each multivariate distributionP_X_,Y, there are uniquemarginal distributionsP_X andP_Y such that

PX(A) =PX,Y(A,R), PY(B) =PX,Y(R,B) ,

pmf:pY(y) = X

x∈X

pX,Y(x,y) pdf:fY(y) = Z

R

fX,Y(x,y)dx .

(56)

Multivariate Distributions

=P({ω : X(ω)∈A,Y(ω)∈B}) .

For each multivariate distributionP_X_,Y, there are uniquemarginal distributionsPX andPY such that

pmf:pY(y) = X

x∈X

R

fX,Y(x,y)dx .

(57)

Multivariate Distributions

=P({ω : X(ω)∈A,Y(ω)∈B}) .

For each multivariate distributionP_X_,Y, there are uniquemarginal distributionsPX andPY such that

pmf:pY(y) =X

x∈X

R

fX,Y(x,y)dx .

(58)

Multivariate Distributions

Theconditional distributionis defined similar to conditional probability:

P_Y_|X(B |A) =PX,Y(A,B)

P_X(A) for Asuch that P_X(A)>0.

For discrete/continuous variables we have: discrete r.v.’s:

p_Y|X(y |x) = p_X_,Y(x,y)

p_X(x) , pX(x)>0 , continuousr.v.’s:

f_Y|X(y|x) = f_X_,Y(x,y)

f_X(x) , fX(x)>0 .

(59)

Multivariate Distributions

For discrete/continuous variables we have:

discrete r.v.’s:

p_X(x) , pX(x)>0 ,

continuousr.v.’s:

f_Y|X(y|x) = f_X_,Y(x,y)

f_X(x) , fX(x)>0 .

(60)

Multivariate Distributions

For discrete/continuous variables we have:

discrete r.v.’s:

p_X(x) , pX(x)>0 , continuousr.v.’s:

f_Y_|X(y|x) = f_X_,Y(x,y)

f_X(x) , f_X(x)>0 .

(61)

Independence

VariableX is said to be independent of variableY (X Y) iff P_X_,Y(A,B) =P_X(A)·P_Y(B) for all A,B⊆R.

This is equivalent to

P_X_|Y(A|B) =P_X(A) for all B such that P(B)>0, and

P_Y_|X(B|A) =P_Y(B) for all Asuch thatP(A)>0. In words, knowledge about one variable tells nothing about the other. Note that independence is symmetric,X Y ⇔Y X.

(62)

Independence

VariableX is said to be independent of variableY (X Y) iff P_X_,Y(A,B) =P_X(A)·P_Y(B) for all A,B⊆R. This is equivalent to

P_X_|Y(A|B) =P_X(A) for all B such thatP(B)>0,

and

P_Y_|X(B|A) =P_Y(B) for all Asuch thatP(A)>0. In words, knowledge about one variable tells nothing about the other. Note that independence is symmetric,X Y ⇔Y X.

(63)

Independence

VariableX is said to be independent of variableY (X Y) iff P_X_,Y(A,B) =P_X(A)·P_Y(B) for all A,B⊆R. This is equivalent to

P_X_|Y(A|B) =P_X(A) for all B such thatP(B)>0, and

P_Y_|X(B|A) =P_Y(B) for all Asuch thatP(A)>0.

In words, knowledge about one variable tells nothing about the other. Note that independence is symmetric,X Y ⇔Y X.

(64)

Expectation

Theexpectation(or expected value, or mean) of a discrete random variable is given by

E[X] = X

x∈X

p(x)x .

The expectation of a continuous random variable is given by E[X] =

Z

X

f(x)x dx . In both cases, it is possible thatE[X] =±∞.

E[kX] =kE[X] E[X+Y] =E[X] +E[Y] E[XY] =E[X]E[Y] if X Y

(65)

Expectation

E[X] = X

x∈X

p(x)x .

Z

X

f(x)x dx .

In both cases, it is possible thatE[X] =±∞.

(66)

Expectation

E[X] = X

x∈X

LetSn=Pn

i=1Xnbe the sum of the first n outcomes.

The distribution ofS_n is given by

Source: Wikipedia