• Ei tuloksia

Enhancing Nearest Neighbor Based Entropy Estimator for High Dimensional Distributions via Bootstrapping Local Ellipsoid

N/A
N/A
Info
Lataa
Protected

Academic year: 2022

Jaa "Enhancing Nearest Neighbor Based Entropy Estimator for High Dimensional Distributions via Bootstrapping Local Ellipsoid"

Copied!
8
0
0

Kokoteksti

(1)

Enhancing Nearest Neighbor Based Entropy Estimator for High Dimensional Distributions via Bootstrapping Local Ellipsoid

Chien Lu, Jaakko Peltonen

Tampere University Kalevantie 4 33100 Tampere, Finland

chien.lu@tuni.fi, jaakko.peltonen@tuni.fi

Abstract

An ellipsoid-based, improved kNN entropy estimator based on random samples of distribution for high dimensionality is developed. We argue that the inaccuracy of the classical kNN estimator in high dimensional spaces results from the local uniformity assumption and the proposed method mitigates the local uniformity assumption by two crucial extensions, a local ellipsoid-based volume correction and a correction ac- ceptance testing procedure. Relevant theoretical contributions are provided and several experiments from simple to compli- cated cases have shown that the proposed estimator can ef- fectively reduce the bias especially in high dimensionalities, outperforming current state of the art alternative estimators.

Introduction

The differential entropy of a continuous-valued random vari- ableXis defined as

H(X) =− Z

p(x) logp(x)dx. (1) wherep(x)is the probability density function of X. En- tropy has been an important numerical quantity in Statistics, Machine Learning and other disciplines such as Physics. It provides a summary measurement of the degree of uncer- tainty of a system. Entropy is also related to other important information theoretic measures, including Kullback-Leibler divergence (Kullback and Leibler 1951) and mutual infor- mation (Cover and Thomas 2006) , which is defined for ran- dom variablesXandY as

I(X,Y) = Z

Y

Z

X

log p(x,y)

p(x)p(y)dxdy=−H(Y|X)+H(Y). (2) Classical kNN Estimator

In theory, to obtain the entropy of a system, the underlying probability distribution must be known, that is, the analytical form of the probability density function (PDF) needs to be available. However, in real-world cases, it is common that Copyright c2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

the underlying PDF is not available but only a set of sam- ples are observed. This raises the research problem of esti- mating the entropy from the observed data only; in partic- ular, flexible estimation approaches that do not assume the distribution to lie in a particular parametric model family are known as non-parametric entropy estimation. Estimators such as the k-nearest neighbor estimator (kNN; Kozachenko and Leonenko 1987 and Goria et al. 2005) and the kernel density estimator (KDE; Silverman 2018) or hybrid methods (Orava 2011) have been proposed. Note that there is another important group of approaches, called ensemble estimators (Sricharan, Wei, and Hero 2013; Moon, Sricharan, and Hero 2017; Moon et al. 2018), which use weighted combinations of different estimators. This work focuses on the kNN esti- mator and approaches related to it.

Assume there are N i.i.d. D-dimensional samples x1, ...,xN ∼ P, where the probability density functionp is unknown. The classical kNN estimator is written as

H(X)≈ −1 N

N

X

i=1

logp(xi)

≈ψ(N)−ψ(k) + log(cD) +D N

N

X

i=1

logεi (3) where ψ denotes the digamma function,εi denotes the Euclidean distance from xi to its nearest neighbor,cD =

πD2

Γ(1+D2), andψ(N)−ψ(k)is the correction term.

The classical estimator has been widely applied in many different research problems. It has been also adopted to non- parametrically estimate the mutual information (Kraskov, St¨ogbauer, and Grassberger 2004) of Equation (2) and KL divergence (P´erez-Cruz 2008; Wang, Kulkarni, and Verd´u 2009).

Bias of the classical kNN

Although the classical kNN estimator has shown promising applicability, it has been found biased especially in higher dimensional cases (Noh et al. 2014; Chauveau and Vandek- erkhove 2014). Theoretical bounds relating the convergence to the data dimension include (Gao, Oh, and Viswanath

(2)

2018; Singh and P´oczos 2016b; 2014; 2016a), for exam- ple (Gao, Oh, and Viswanath 2018) showed the the bias of the classical kNN estimator isO(N˜ −1/D), whereO˜denotes limiting behavior up to polylogarithmic factors inN.

One explanation of the bias can be that the local uniform assumption of the classical kNN cannot always hold (Lom- bardi and Pant 2016; Gao, Ver Steeg, and Galstyan 2015;

Gao, Steeg, and Galstyan 2015; Gao, Oh, and Viswanath 2016; Lord, Sun, and Bollt 2018) especially when the di- mension is high. In other words, the hyperspherical structure (ε-ball) is not capable of capturing the twisted shape of the underlying distribution of the random variable around the sample; such a situation can have increasingly strong effect on the bias when dimensionality becomes higher.

The rest of the paper is organized as follows. In Sec- tion we describe our proposed method and bias analysis of the proposed method including the corresponding correc- tion term and the bias bound. In Section we review related work based on local approximations. In Section we describe comparison experiments both in estimation of entropy and estimation of mutual information. Lastly, Section provides conclusions.

Method

We now introduce our novel entropy estimation method, called the kNN estimator with Ellipsoidal correction (EC- kNN); the resulting method will be directly applicable both to entropy estimation and to mutual information estimation.

The idea is to construct a local ellipsoid instead of aε-ball, so that inside of the local ellipsoid samples can be assumed more uniformly distributed and hence will better fit the uni- formity assumption of kNN based entropy estimation. The proposed method comprises two parts: the first part is the lo- cal Ellipsoidal correction where a local ellipsoid is learned via performing a local principal component analysis (PCA) algorithm, the second part is an acceptance testing proce- dure which is performed in a boot-strapping manner. Note that the local-PCA approach has been adopted by other sim- ilar works but the bootstrap step is an significant novelty in our approach.

The classical kNN estimator of Equation (3) can be rewritten as

H(X)≈ 1

NHkN N(xi) (4) whereHkN N(xi)is an approximation of−logp(xi)based on aD-dimensional ball of radiusεiencompassingkneigh- bors, where

HkN N(xi) =ψ(N)−ψ(k) + log(Vi) (5) and thelog(Vi)term denotes the logarithm of the volumeVi of the ball, computed as:

log(Vi) = log(cD) +Dlogεi (6) wherecD= π

D 2

Γ(1+D2)as before.

Our aim is to generalize the above by a logarithmic vol- ume correction termlog(∆ ˜Vi), corresponding to the differ- ence between logarithmic volume of a local ellipsoid versus

the epsilon-ball. Furthermore, as a result of the local ellip- soid the number of neighborskmay slightly change around each pointxi, yielding an individualkifor each point.

Theorem 1. The correction term of the ellipsoid-based es- timator isψ(N)−ψ(k), which is the same as the classical estimator.

Proof. LetE(xi,r(xi))denote the ellipsoid aroundxi de- fined by axes r(xi) = [r1(xi), r2(xi), . . . , rD(xi)]where r1(xi)andrD(xi)are the longest and shortest axes respec- tively. For any choice ofr(xi)theE(xi,r(xi))can also be defined by introducing a Mahalanobis matrixMxisuch that

∀ξ∈ E(xi,r(xi)) q

(ξ−xi)>Mxi(ξ−xi)≤r1(xi). (7) In this definition the matrix controls the shape of the ellip- soid andr1(xi)controls its overall size.

Letp(r1(xi)denote the probability mass inside the Ellip- soid controlled byr1(xi)andMxi; for brevity the notation p(r1(xi)shows the dependence onr1(xi)only. The proba- bility mass is

p(r1(xi)) = Z

ξ∈E(x,r(x))

p(ξ)dξ

= Z

(ξ−x)>Mxi(ξ−x)≤r1(xi)

p(ξ)dξ (8) Similar to the work of (Kraskov, St¨ogbauer, and Grass- berger 2004), let µ(r1(xi)) be the probability density of r1(xi),µ(r1(xi))dr1(xi)is the probability that the Maha- lanobis distance of thekth nearest neighbor and thexiis in the interval[r1(xi), r1(xi) +dr1(xi)]. It can be written as

µ(r1(xi))dr1(xi) = (N−1)!

1!(k−1)!(N−k−1)!

×dp(r1(xi)) dr1(xi) dr1(xi)

×p(r1(xi))k−1×(1−p(r1(xi)))N−k−1 (9) by using the trinomial formula. The correction term can be obtained via taking the expectation of−logp((r1(xi))

E[−logp(r1(xi))] = Z

0

µ(r1(xi))dr1(xi) (−logp(r1(xi)))

=ψ(N)−ψ(k). (10)

Under the assumption that inside of E(xi,r(xi)), the probability density is p(xi) a constant such that p(xi)× VEi ≈p(r1(xi))andlogVEi = log(Vi) + log(∆ ˜Vi),

E[−logp(xi)]≈ψ(N)−ψ(k) + logVEi

=ψ(N)−ψ(k) + log(Vi) + log(∆ ˜Vi) (11)

(3)

, the resulting entropy estimator EC-kNN will be of the form H(X)≈ 1

NHEC−kN N(xi) (12) where

HEC−kN N(xi) =ψ(N)−ψ(ki) + log(Vi) + log(∆ ˜Vi) (13) andlog(Vi)is defined by Equation (6). However, since the local ellipsoid is learned from a small amount of local data points, we must guard against potential over-correction to the shape learned from this limited amount of data; we do this by an acceptance testing procedure, and apply the logarithmic correction termlog(∆ ˜Vi)only if the test is suc- cessful.

Next, in Section we describe the local ellipsoidal correc- tion for computing the correction term∆ ˜Viand in Section we describe the bootstrap based acceptance testing.

Local Ellipsoidal Correction

Theε-ball in the classical kNN entropy estimator treats all directions equally. In a high-dimensional space the under- lying probability distribution around a data point xi may be curved or stretched so that, due to the shape of the dis- tribution, the density along some directions of the high- dimensional space may be larger than along other directions, and moreover, the density may experience larger changes along some directions than along others. In such a situation, it can still be a reasonable approximation to assume the den- sity to be approximately uniform in a small hyper-region, but it is no longer a good assumption that the hyper-region would be anε-ball. We instead aim to learn an ellipsoid as a better representation of a local hyper-region with approxi- mately uniform density.

To learn a local ellipsoid structure, we propose to use a local PCA based approach as follows. In the local PCA ap- proach, we first find theknearest neighbors ofxiby small- est Euclidean distance as usual. We then compute the sam- ple covariance matrix of the resulting neighborhood ofxi (including xi itself) and rotate the neighborhood to a new coordinate system by projecting the data onto the eigenvec- tors of the obtained covariance matrix. The eigenvectors are used as the axes of the local ellipsoid. Next, the widths of the local ellipsoid along each axis are then computed via search- ing for the maximum distance fromxito the neighbor points along each coordinate axis (see Figure ).

After performing the local PCA, the sum of the log of ratios of the longest axis to other each of the axes of the estimated ellipsoid is then taken as the logarithmic volume correction term; additionally, the number of neighbors is re- counted based on the ellipsoid.

In detail, since the volume of an ellipsoid with axes r1(xi), r2(xi), ..., rD(xi)is defined as

πD2 Γ(1 +D2)

D

Y

d=1

rd(xi), (14) we assume that the longest axisr1, the distance from the origin to the farthest point alone the longest axis, represents

Figure 1: (a) Theε-ball of the classical kNN estimator. (b) Local ellipsoid learned via local PCA. Lines extending from the center denote local axes found by local PCA; the widths rdof the lines denote distance from the central point to the furthest neighbor along each axis, and the width of the el- lipsoid along the axis is defined based on that distance. Note that as a result, due to the curvature of the ellipsoid bound- ary, the furthest neighbor along an axis may not lie inside of the ellipsoid.

the original radius, the correction term is obtained as

∆ ˆV(xi,X) =

D

Y

d=1

rd(xi)

r1(xi) . (15) Algorithm 1 shows the computation of the correction term.

In the complete entropy estimation algorithm described in the next section, we denote the correction term as∆ ˜Vi. Algorithm 1Local ellipsoid-based volume correction Require: xi: sample point

X={x1, . . . ,xN}: all sample points D: dimention

k: number of neighbors

Ensure: ∆ ˆV(xi,X): local ellipsoid-based volume correc- tion

1: Find thek-th neighbor of thexi, compute the distance εi

2: Perform a PCA on the set of thek+ 1points

3: Project thek+ 1points to the new coordinate system, get the new center via averaging all the projected points 4: Find the lengthsr1, ..., rd of the axes of the projected ellipsoid through computing the maximum difference of the point from the center to the center along the PCA projection axis

return∆ ˆV(xi,X) =QD d=1

rd r1

Bootstrap Testing for Correction Acceptance A nonuniform empirical distribution of a finite set of data samples along different coordinate axes can also happen due to random sampling variation. That is, a nonuniform distri- bution of the data subset in an ε-ball around a data point can be observed even under the uniformity assumption of the underlying distribution. Therefore, it can happen that the volume corrections from some points are not necessary and an over-correction problem can occur if the correction is per- formed for every sample point.

(4)

Hence, we introduce a bootstrap style acceptance testing procedure to amend the potential over-correction issue. For each sample point, a reference correction∆ ˆVuis generated, representing a rough estimate of the correction that can oc- cur due to random sampling alone, under a uniformity as- sumption of the underlying density in theε-ball around the sample point. The reference correction is then used in the acceptance testing procedure as an acceptance threshold.

The reference correction∆ ˆVuis simply a volume correc- tion obtained from a set ofk+ 1samples generated inside of a uniformly distributedε-ball around thexi. The details of generating the∆ ˆVuare shown in Algorithm 2.

Since the data pointxihas itself been randomly generated from an underlying distribution, the center point of theε-ball is itself a random variable. Therefore, in order to simulate a true random configuration, the randomness of whether the xi is the center of theε-ball should also be taken into ac- count. Therefore, after simulating the k + 1 points inside of theε-ball, the center of the ball is not necessarily the origi- nalxibut is instead redefined by choosing the point among thek+ 1points which is closest to the original point.

The proposed kNN estimator with ellipsoidal correc- tion (EC-kNN) is then developed. It combines the above- proposed Algorithm 1 and Algorithm 2. For each sample point, the volume correction is performed with a bootstrap acceptance test. The correction term∆ ˜Viand the referenced correction term ∆Vu are generated respectively. Then the bootstrap acceptance test is conducted via comparing the values of the two∆ ˜Viand∆Vu. The correction is accepted if∆ ˜Vi < ∆Vu, otherwise the algorithm uses the result of the classical kNN estimator.

Note that, in high dimensional cases, it can happen that when the number of neighbors inside the new ellipsoid are counted, it turns out that there are no points of the finite data set inside of the ellipsoid. When encountering this is- sue, the axes of the ellipsoid are increased slightly until there is at least one data point inside of the ellipsoid. Here, in the proposed algorithm, every one of the axes is length- ened by multiplying the rd by a small ratio (e.g., 1.01).

The volume correction is changed accordingly. By defini- tion,xi= [xi,1, ..., xi,D]>is inside of the ellipsoid if

D

X

d=1

(xi,d−cd)2

r2d ≤1 (16)

wherec= [c1, ..., cD]>is the origin of the ellipsoid.

After going through every data point in the data set, the al- gorithm produces the final corrected result by averaging the corrected entropy values from each data point from the data set. Note that, there is not hyper-parameter setting required and since the bootstrap acceptance test procedure is a result of a randomly generated value∆Vu, therefore, the result of the computation can be slightly different every time.

The final algorithm including the local correction compu- tation, reference correction computation, bootstrap testing, enlargement of ellipsoids, and computation of the entropy estimate is summarized as Algorithm 3.

Algorithm 2Reference correction Require: D: dimension

k: number of neighbors ε: radius

Ensure: Vˆu: Acceptance variable

1: Generate random samplesU ={u1, ...,uk+1}from a D-dimensional uniformly distributedε-ball

2: Select the point uj which is the closest to the origin among thek+ 1random samples

returnVˆu= ∆ ˆVk(uj,U)using Algorithm 1

Bias analysis

Theorem 2. The biasEX∼p

|logp(x)−log ˆpr(x)(x)|

is bounded.

Proof. By the mean value theorem (Lebesgue 1910) for any 0 < a < bwe have(log(b)−log(a))/(b−a) = 1/cfor somecin the open interval (a, b)Applying this inside the expectationEX∼p[|logp(x)−log ˆp(x)|]we get the bound EX∼p[|logp(x)−log ˆp(x)|]≤MEX∼p[|p(x)−p(x)|]ˆ (18) wherep(x)ˆ is an arbitrary estimator ofp(x), andM = supX∼pmax(1/p(x),1/ˆp(x))which is finite ifpandpˆare both bounded above zero inside the support ofp. We make use of the result thatEX∼p

|p(x)−pˆε(x)(x)|

is bounded for the classical estimator (Singh and P´oczos 2016b).

Consider the classical estimatorpˆε(x)(x)which assumes the density is uniform inside the ball. In the asymptotic case of increasing data the estimate becomes the integral over density inside the ball,

ˆ

pε(x)(x) = 1 VB

Z

ξ∈E(x,ε(x))

p(ξ)dξ

where, by an abuse of notation,VB denotes the volume of the Ball. The bias of the estimator then becomes

EX∼p

|p(x)−pˆε(x)(x)|

=

EX∼p

"

1 VB

Z

ξ∈E(x,ε(x))

p(x)−p(ξ)dξ

# (19)

Similarly, for the case of the proposed estimator, let r(x) denote the axes of the ellipsoidE(x,r(x))aroundxso that

EX∼p

|p(x)−pˆr(x)(x)|

=EX∼p

"

1 VE

Z

ξ∈E(x,r(x))

p(x)−p(ξ)dξ

# (20) Since by construction the ellipsoid is contained within the

(5)

Algorithm 3EC-kNN

Require: X={x1, . . . ,xN}: all sample points D: dimension

k: number of neighbors

Ensure: HˆX: corrected entropy estimation 1: foreachxido

2: Find theknearest neighbors, compute the distance εiand the volume of theεi-ballVi

3: Compute the empirical volume correction

∆ ˜Vi= ∆ ˆV(xi,X) (17) using Algorithm 1

4: Compute the reference volume∆ ˆVu correction us- ing Algorithm 2

5: if∆ ˜Vi<∆ ˆVuthen

6: Find the number of samples inside of the ellip- soid

7: whileNo samples inside of the ellipsoiddo Lengthen every axis by multiplying eachrdwith a small ratio (1.01), adjust the∆ ˜Viaccordingly

8: end while 9:

HEC−kN N(xi) =ψ(N)−ψ(ki) + log(Vi) + log(∆ ˜Vi) 10: else

11: HEC−kN N(xi) =ψ(N)−ψ(ki) + log(Vi) 12: end if

13: end for

returnH(X) =N1 PN

i=1HEC−kN N(xi)

corresponding ball, (20) becomes EX∼p

"

VB

VE 1 VB

Z

ξ∈E(x,r(x))

p(x)−p(ξ)dξ

#

≤C·EX∼p

"

1 VB

Z

ξ∈E(x,r(x))

p(x)−p(ξ)dξ

#

≤C·

EX∼p

"

1 VB

Z

ξ∈E(x,r(x))

p(x)−p(ξ)dξ

#

+EX∼p

"

1 VB

Z

ξ∈E(x,ε(x))\E(x,r(x))

p(x)−p(ξ)dξ

#

=C·EX∼p

p(x)−pˆε(x)(x)

(21) whereC = supX∼pVVB

E is bounded above by any suitable regularization keeping the ellipsoids at non-zero volume, the right-hand term is for the classical estimator which is know to be bounded, thus the ellipsoid case is bounded as well.

Theorem 3. Bound ofEX∼p

|logp(x)−log ˆpr(x)(x)|

is less or equal to bound ofEX∼p

|logp(x)−log ˆpε(x)(x)|

.

Proof. We continue from (18). Assume that the probability densitypis second order differentiable. Then, with Taylor expansion the density bias term of the classical estimator can be approximated as

EX∼p

p(x)−pˆε(x)(x)

=

EX∼p

"

1 VB

Z

ξ∈E(x,ε(x))

p(x)−p(ξ)dξ

#

=EX∼p

"

1 2VB

Z

ξ∈B(x,ε(x))

(ξ−x)>H(x)(ξ−x)dξ

# +B

(22) where, again by an abuse of notation,VBdenotes the volume of the ball andBdenotes the remainder term of the Taylor series; we assumeBis small enough to be neglected. Sim- ilarly, for the case of the proposed estimator, letr(x)again denote the axes of the ellipsoidE(x,r(x))aroundx,

EX∼p

p(x)−pˆr(x)(x)|

=EX∼p

"

1 VE

Z

ξ∈E(x,r(x))

p(x)−p(ξ)dξ

#

=EX∼p

"

1 VE

Z

ξ∈E(x,r(x))

5p(x)>(ξ−x)dξ

#

+EX∼p

"

1 2VE

Z

ξ∈E(x,r(x))

(ξ−x)>H(x)(ξ−x)dξ

# +E

(23) whereVE denotes the volume of the ellipsoid, and again, E is assumed ignorable. Due to the symmetry of the el- lipsoid, EX∼ph

1 VE

R

ξ∈E(x,r(x))5p(x)>(ξ−x)dξ

i = 0.

Hence, EX∼p

p(x)−pˆr(x)(x)

≈EX∼p

"

1 2VE

Z

ξ∈E(x,r(x))

(ξ−x)>H(x)(ξ−x)dξ

#

=EX∼p

1 2VE

Z

· · · Z

PD d=1

(ξdxd)2

rd(x)2 ≤1

(ξ−x)>H(x)(ξ−x)dξ

=EX∼p

Q

drd(x) 2VE

Z

· · · Z

u>u≤1

(r(x)u)>H(x)(r(x)u)du

≤EX∼p

ε(x)D 2VB

Z

· · · Z

u>u≤1

(ε(x)u)>H(x) (ε(x)u)du

=EX∼p

"

1 2VB

Z

ξ∈B(x,ε(x))

(ξ−x)>H(x)(ξ−x)dξ

#

≈EX∼p

"

1 VB

Z

ξ∈B(x,ε(x))

p(x)−p(ξ)dξ

#

=EX∼p

p(x)−pˆε(x)(x)|

(24)

(6)

where the≈denotes the Taylor approximation and we used the knowledge thatrd(x)≤ε(x)∀rd(x)∈r(x). With the proven inequality, the mean value theorem indicates that the asymptotic bias bound of the proposed estimator is less or equal to the bias bound of the classical estimator.

Related Works

Local PCA approximation

Several approaches based on local PCA approximation (Gao, Ver Steeg, and Galstyan 2015; Lord, Sun, and Bollt 2018) have been developed. Gao et al. have proposed LNC (Local Nonuniformity Correction) estimator which com- prises a local PCA based correction and a test procedure.

However, the testing procedure highly depends on an arbi- trary ”small value”αand the way of how to determine the αis not addressed. Where as in our proposed method, no hyper-parameter for the testing procedure is required.

Likewise, Lord et al. have proposed a similar approach with a re-centralizing local points and applying SVD to es- timate the local volume. Nonetheless, there is no testing or filtering procedure in the method which might lead to over- correction. Besides, in the their paper, only the estimation of mutual information is performed. It is possible that the over- correction resulted error has been cancelled out during the computation. Also note that, both of the above-mentioned works, in their papers, haven’t applied their methods in re- alistically high dimensional cases. The highest dimension used to evaluate LNC is D = 3 whereas in Lord et al.’s work the highest dimension isD= 4.

Local Gaussian approximation

Approaches based on local Gaussian approximation have been proposed (Gao, Steeg, and Galstyan 2015; Lord, Sun, and Bollt 2018). One approach, kNN-bw (bandwidth) (Gao, Oh, and Viswanath 2016) utilizes a local Gaussian kernel to determine the bandwidth of the kNN estimator; as this method is the only one of this group having a public imple- mentation by authors available, we chose it as the represen- tative for this group of approaches.

Simulation Experiments

We carry out several experiments comparing our estimator EC-kNN to other methods in two tasks, entropy estimation and mutual information estimation1. For both tasks we com- pare to state of the art alternatives, using implementations from their respective authors with default values. We also compare to the baseline kNN estimator using several values ofk, whereas for our method EC-kNNksimply fixed to 25 in every dimension, which already turns out to work well.

We next describe the data distributions used in the experi- ments, and then describe the entropy estimation and mutual information tasks along with their respective results.

1R implementation can be found in online repository https://github.com/hummmblelu/eckNN

Designed Cases

To verify the usability of the proposed approach, three cases are set from simple to complicated, details are as follows.

1. Symmetric Gaussian. Samples are generated from ad- dimensional Gaussian distribution N(µ,Σ). The mean vectorµis set to a zero-vector

µ= [0,0, . . . ,0] (25) and the covariance matrixΣis generated as follow

Σ=s×s>+diag(1, . . . ,1) (26) wheres ∼ N(0, diag(3,3, ...3)). The generation of the Σcreates the random dependence between dimensions.

2. Asymmetric Gaussian.Similar to the previous case, the mean vectorµis set to a zero-vector but the covariance matrixΣis generated as follow

Σ=s×s>+diag(1, . . . , D) (27) wheres ∼ N(0, diag(3,3, ...3)). When the dimension- alityDgrows, the variation increases accordingly.

3. Mixture Gaussian.A mixture of two Gaussian distribu- tions is selected to evaluate the performance of the pro- posed method in a scenario which is more complicated than the previous two cases. The probability density func- tion of a mixture of two Gaussian distributions is

p(x) =πp(x|µ11) + (1−π)p(x|µ22) (28) whereµ1andµ2are mean vectors;Σ1andΣ2are co- variance matrices of the two different Gaussian distribu- tions and the πis the mixture weight. Here we setπto 0.4,

µ1= [0,0, . . . ,0] (29) µ2= [10,10, . . . ,10] (30) Σ1=s1×s1>+diag(1, . . . ,1) (31) Σ2=s2×s2>+diag(1, . . . , D) (32) wheres1ands2are sampled fromN(0, diag(3,3, ...3)).

Entropy Estimation

We compare our approach (EC-kNN) with the classical kNN estimator (Kozachenko and Leonenko 1987) and kNN-bw estimator (Gao, Oh, and Viswanath 2016)2with three dif- ferent cases. The ground truth value of entropy of the first two cases can be obtained analytically

H(X) =1

2log det (2πeΣ) (33) for the mixture Gaussian case, since the entropy value can- not be analytically obtained, Monte Carlo estimation with 100000 samples sampled from the distribution is employed.

In each case, each dimension and each method, we repeat 10 times simulating 1000 samples and computing the esti- mation to compare with the ground truth entropy and mu- tual information. The average of the root mean square error (RMSE) is taken as the performance measure. The result can be found in Figure 2. EC-kNN outperforms both the classi- cal kNN and kNN-bw especially in high dimensionalities.

2We directly use the program from https://github.com/wgao9/

lnn, input parameters are set to the default values

(7)

Figure 2: Performance comparisons for entropy estimation. (Average RMSE) (a): Symmetric Gaussian Case (b): Asymmetric Case (c) Mixture Gaussian Case. EC-kNN outperforms other approaches in all cases when dimensionality becomes higher.

Figure 3: Performance comparisons for mutual information estimation. In order to fit the LNC dot line (grey) into the plot, we apply a logarithm scale on the y axis after around 50 (a): Symmetric Gaussian Case (b): Asymmetric Case (c) Mixture Gaussian Case. The proposed EC-kNN outperforms other approaches in all cases when dimensionality becomes higher.

Mutual Information Estimation

For multivariate extension of mutual information between random variablesX = [X1, . . . , XD], whereXddenote the component random variables of the vector-valued random variableX, here we adopt the following definition

D

X

d=1

H(Xd)−H(X). (34) It is also called the total correlation or multi-information (Van de Cruys 2011) and it is also used in one of the above- mentioned works (Gao, Ver Steeg, and Galstyan 2015).

The ground truth value of mutual information for the first two cases can be obtained analytically using the Equation (33). For the third case (mixture Gaussian), the ground true value of mutual information for the mixture Gaussian case is obtained using Monte-Carlo integration with 100000 sam- ples.We compare our approach with kNN estimator, KSG estimator, LNC3estimator and kNN-bw estimator.

Again, in each case, each dimension and each method, we repeat 10 times simulating 1000 samples and computing the estimation to compare with the ground truth entropy and mutual information. The average of the root mean square er- ror (RMSE) is taken as the performance measure. The result

3We use the program from https://github.com/BiuBiuBiLL/

NPEET LNC for the computations of KSG and LNC and the in- put parameters are set to the default values

can be found in Figure 3. EC-kNN again outperforms both the classical kNN and other alternatives especially in high dimensionalities.

Conclusions and Discussions

The contribution of this paper is the novel approach for entropy estimation called the EC-kNN estimator which re- duces the bias of the kNN estimator. The proposed EC-kNN comprises the local PCA learning and the boot-strap style correction acceptance procedure which together address the bias resulting from the uniformity assumption.

The advantage of the EC-kNN has been shown to be prominent especially when the data set is high-dimensional and complicated. The experiments have implied that the lo- cal ellipsoidal correction and the boot-strap type acceptance procedure can properly capture the local uniformity region.

Our approach provides several interesting directions of fu- ture work. Our method using a fixed value ofkalready out- performed alternatives and performance with optimized k could be even better. Secondly, we proved here a bias bound and additional bounds of variance of the estimator would also be valuable.

Acknowledgments

The work was supported by Academy of Finland decisions 312395 and 313748.

(8)

References

Chauveau, D., and Vandekerkhove, P. 2014. The nearest neighbor entropy estimate: an adequate tool for adaptive mcmc evaluation. Technical report, HAL.

https://hal.archives-ouvertes.fr/hal-0106808.

Cover, T. M., and Thomas, J. A. 2006. Elements of infor- mation theory 2nd edition.Willey-Interscience: NJ.

Gao, W.; Oh, S.; and Viswanath, P. 2016. Breaking the band- width barrier: Geometrical adaptive entropy estimation. In Advances in Neural Information Processing Systems, 2460–

2468.

Gao, W.; Oh, S.; and Viswanath, P. 2018. Demystify- ing fixed k-nearest neighbor information estimators. IEEE Transactions on Information Theory.

Gao, S.; Steeg, G. V.; and Galstyan, A. 2015. Estimating mutual information by local gaussian approximation. InUn- certainty in Artificial Intelligence, 278–285.

Gao, S.; Ver Steeg, G.; and Galstyan, A. 2015. Efficient es- timation of mutual information for strongly dependent vari- ables. InArtificial Intelligence and Statistics, 277–286.

Goria, M. N.; Leonenko, N. N.; Mergel, V. V.; and Novi In- verardi, P. L. 2005. A new class of random vector entropy estimators and its applications in testing statistical hypothe- ses.Journal of Nonparametric Statistics17(3):277–297.

Kozachenko, L., and Leonenko, N. N. 1987. Sample esti- mate of the entropy of a random vector.Problemy Peredachi Informatsii23(2):9–16.

Kraskov, A.; St¨ogbauer, H.; and Grassberger, P. 2004.

Estimating mutual information. Physical review E 69(6):066138.

Kullback, S., and Leibler, R. A. 1951. On information and sufficiency.The annals of mathematical statistics22(1):79–

86.

Lebesgue, H. 1910. Sur l’int´egration des fonctions dis- continues. In Annales scientifiques de l’ ´Ecole normale sup´erieure, volume 27, 361–450.

Lombardi, D., and Pant, S. 2016. Nonparametric k- nearest-neighbor entropy estimator. Physical Review E 93(1):013310.

Lord, W. M.; Sun, J.; and Bollt, E. M. 2018. Geometric k-nearest neighbor estimation of entropy and mutual infor- mation. Chaos: An Interdisciplinary Journal of Nonlinear Science28(3):033114.

Moon, K.; Sricharan, K.; Greenewald, K.; and Hero, A.

2018. Ensemble estimation of information divergence. En- tropy20(8):560.

Moon, K. R.; Sricharan, K.; and Hero, A. O. 2017. Ensem- ble estimation of mutual information. In2017 IEEE Inter- national Symposium on Information Theory (ISIT), 3030–

3034. IEEE.

Noh, Y.-K.; Sugiyama, M.; Liu, S.; Plessis, M. C.; Park, F. C.; and Lee, D. D. 2014. Bias reduction and metric learn- ing for nearest-neighbor estimation of kullback-leibler di- vergence. InArtificial Intelligence and Statistics, 669–677.

Orava, J. 2011. K-nearest neighbour kernel density estima- tion, the choice of optimal k.Tatra Mountains Mathematical Publications50(1):39–50.

P´erez-Cruz, F. 2008. Kullback-leibler divergence estimation of continuous distributions. In Information Theory, 2008.

ISIT 2008. IEEE International Symposium on, 1666–1670.

IEEE.

Silverman, B. W. 2018.Density estimation for statistics and data analysis. Routledge.

Singh, S., and P´oczos, B. 2014. Exponential concentration of a density functional estimator. In Advances in Neural Information Processing Systems, 3032–3040.

Singh, S., and P´oczos, B. 2016a. Analysis of k-nearest neighbor distances with application to entropy estimation.

arXiv preprint arXiv:1603.08578.

Singh, S., and P´oczos, B. 2016b. Finite-sample analysis of fixed-k nearest neighbor density functional estimators. In Advances in neural information processing systems, 1217–

1225.

Sricharan, K.; Wei, D.; and Hero, A. O. 2013. Ensemble estimators for multivariate entropy estimation. IEEE trans- actions on information theory59(7):4374–4388.

Van de Cruys, T. 2011. Two multivariate generalizations of pointwise mutual information. InProceedings of the Work- shop on Distributional Semantics and Compositionality, 16–

20. Association for Computational Linguistics.

Wang, Q.; Kulkarni, S. R.; and Verd´u, S. 2009. Di- vergence estimation for multidimensional densities via k- nearest-neighbor distances. IEEE Transactions on Informa- tion Theory55(5):2392–2405.

Viittaukset

LIITTYVÄT TIEDOSTOT

Työn merkityksellisyyden rakentamista ohjaa moraalinen kehys; se auttaa ihmistä valitsemaan asioita, joihin hän sitoutuu. Yksilön moraaliseen kehyk- seen voi kytkeytyä

Raportissa tarkastellaan monia kuntajohtami- sen osa-alueita kuten sitä, kenellä on vaikutusvaltaa kunnan päätöksenteossa, mil- lainen johtamismalli olisi paras tulevaisuudessa,

The new European Border and Coast Guard com- prises the European Border and Coast Guard Agency, namely Frontex, and all the national border control authorities in the member

The US and the European Union feature in multiple roles. Both are identified as responsible for “creating a chronic seat of instability in Eu- rope and in the immediate vicinity

Finally, development cooperation continues to form a key part of the EU’s comprehensive approach towards the Sahel, with the Union and its member states channelling

Indeed, while strongly criticized by human rights organizations, the refugee deal with Turkey is seen by member states as one of the EU’s main foreign poli- cy achievements of

Pixel level accuracy was calculated for the methods based on the calibration estimator so that the results could be compared with the results of the nearest neighbour estimation, the

For a local homeomorphism maximal lifts of paths always exist; a local lift always exists via the local inverse mapping and a straightforward Zorn’s lemma argument shows that