Community detection with the non-backtracking operator

(1)

Community detection with the non-backtracking operator

Marc Lelarge

INRIA-ENS

Aalto University, Helsinki, October 2016

(2)

Motivation

Community detection in social or biological networks in the sparse regime with a small average degree.

Adamic Glance ’05

Performance analysis of spectral algorithms on a toy model (where the ground truth is known!).

(3)

Motivation

Community detection in social or biological networks in the sparse regime with a small average degree.

Adamic Glance ’05

Performance analysis of spectral algorithms on a toy model (where the ground truth is known!).

(4)

A model: the stochastic block model

(5)

The sparse stochastic block model

A random graph model onnnodes with three parameters, a,b,c≥0.

total population

(6)

The sparse stochastic block model

Assign each vertex spin +1 or−1 uniformly at random.

+1 and−1

(7)

The sparse stochastic block model

Independently for each pair(u,v):

ifσ_u=σ_v = +1, draw the edge w.p.a/n.

ifσ_u6=σ_v, draw the edge w.p.b/n.

ifσ_u=σ_v =−1, draw

the edge w.p.c/n.

a/n,b/n,c/n.

(8)

Community detection problem

Reconstruct the underlying communities (i.e. spin configurationσ) based on one realization of the graph.

Asymptotics: n→ ∞

Sparse graph: the parametersa,b,c are fixed.

notion ofperformance:

w.h.p. strictly less than half of the vertices are misclassified

=positively correlated partition.

(9)

Community detection problem

(10)

Community detection problem

(11)

Community detection problem

(12)

A first attempt: looking at degrees

Degree in community+1 is:

D₊∼Bin ⁿ₂−1,^a_n

+Bin ⁿ₂,^b_n We have

E[D+]≈ a+b

2 , andVar(D+)≈ a+b 2 . and similarly, in community−1:

E[D−]≈ c+b

2 , andVar(D−)≈ c+b 2 .

Clustering based on degrees should ’work’ as soon as:

(E[D₊]−E[D−])²max(Var(D₊),Var(D−)) i.e. (ignoring constant factors)

(a−c)²b+max(a,c).

(13)

A first attempt: looking at degrees

E[D+]≈ a+b

E[D−]≈ c+b

2 , andVar(D−)≈ c+b 2 .

(14)

A first attempt: looking at degrees

E[D+]≈ a+b

E[D−]≈ c+b

2 , andVar(D−)≈ c+b 2 .

(15)

Is it any good?

Data:Athe adjacency matrix of the graph.

We define the mean column for each community:

A+= 1 n





 a

... a b ... b







, and A−= 1 n





 b

... b c ... c







The variance of each entry is≤max(a,b,c)/n.

Pretend the columns are i.i.d., spherical Gaussian andk =n...

(16)

Clustering a mixture of Gaussians

Consider a mixture of two spherical Gaussians inRⁿwith respective meansm₁andm₂and varianceσ².

Pb: givenk samples∼1/2N(m₁, σ²) +1/2N(m₂, σ²), recover the unknown parametersm₁,m₂andσ².

(17)

Doing better than naive algorithm

Ifkm₁−m₂k²nσ², then the densities ’do not overlap’ inRⁿ. Projection preserves varianceσ². So projecting onto the line formed bym₁andm₂gives 1-dim. Gaussian variables with no overlap as soon askm₁−m₂k²σ². We gain a factor ofn.

(18)

Doing better than naive algorithm

Ifkm₁−m₂k²nσ², then the densities ’do not overlap’ inRⁿ. Projection preserves varianceσ². So projecting onto the line formed bym₁andm₂gives 1-dim. Gaussian variables with no overlap as soon askm₁−m₂k²σ². We gain a factor ofn.

(19)

Algorithm for clustering a mixture of Gaussians

Each sample is a column of the following matrix:

A= (A₁,A₂, . . . ,A_k)∈R^n×k Consider the SVD ofA:

A=

n

X

i=1

λ_iu_iv^T_i , u_i ∈Rⁿ, v_i ∈R^k, λ₁≥λ₂≥. . .

Then the best approximation for the direction(m₁,m₂)given by the data isu₁.

Project the points fromRⁿonto this line and then do clustering.

Providedk is large enough, this ’works’ as soon as:

km₁−m₂k²σ².

(20)

Back to our clustering problem

Data:Athe adjacency matrix of the graph.

The mean columns for each community are:

A+= 1 n





 a

... a b ... b







, and A−= 1 n





 b

... b c ... c







The variance of each entry is≤max(a,b,c)/n.

(21)

Heuristics for community detection

The naive algorithm should work as soon as kA₊−A−k² nmax(a,b,c)

n

| {z }

Var

(a−b)²+ (b−c)² nmax(a,b,c) Spectral clustering should allow you a gain ofn, i.e.

(a−b)²+ (b−c)² max(a,b,c)

Our previous analysis shows that clustering based on degrees works as soon as

(a−c)²max(a,b,c).

Whena=c, no information given by the degrees.

(22)

Heuristics for community detection

n

| {z }

Var

(23)

Heuristics for community detection

n

| {z }

Var

(24)

The sparse symmetric stochastic block model

A random graph model onnnodes with two parameters, a,b≥0.

ifσ_u=σ_v, draw the edge w.p.a/n.

a/n,b/n,a/n.

Heuristic: spectral should work as soon as(a−b)²a+b

(25)

The sparse symmetric stochastic block model

A random graph model onnnodes with two parameters, a,b≥0.

ifσ_u=σ_v, draw the edge w.p.a/n.

a/n,b/n,a/n.

Heuristic: spectral should work as soon as(a−b)²a+b

(26)

Efficiency of Spectral Algorithms

Boppana ’87, Condon, Karp ’01, Carson, Impagliazzo ’01, McSherry ’01, Kannan, Vempala, Vetta ’04...

Theorem

Suppose that for sufficiently large K and K⁰, (a−b)²

a+b ≥()K +K⁰ln(a+b),

then ’trimming+spectral+greedy improvement’ outputs a positively correlated (almost exact) partition w.h.p.

Coja-Oghlan ’10

Heuristic based on analogy with mixture of Gaussians:

(a−b)² a+b

(27)

Another look at spectral algorithms

Take a finite, simple, non-oriented graphG= (V,E).

Adjacency matrix : symmetric, indexed on vertices, foru,v ∈V, A_uv =1({u,v} ∈E).

Low rank approximation of the adjacency matrix works as soon as

(a−b)²a+b

(28)

Another look at spectral algorithms

Take a finite, simple, non-oriented graphG= (V,E).

Adjacency matrix : symmetric, indexed on vertices, foru,v ∈V, A_uv =1({u,v} ∈E).

Low rank approximation of the adjacency matrix works as soon as

(a−b)²a+b

(29)

Spectral analysis

Assume thata→ ∞, anda−b≈√

a+bso thata∼b.

A = a+b 2

√1 n

1^T

√n+ a−b 2

√σ n

σ^T

√n +A−E[A]

a+b

2 is themean degreeand degrees in the graph are very concentrated ifalnn. We can construct

A−a+b

2n J = a−b 2

√σ n

σ^T

√n +A−E[A]

(30)

Spectral analysis

Assume thata→ ∞, anda−b≈√

a+bso thata∼b.

A = a+b 2

√1 n

1^T

√n+ a−b 2

√σ n

σ^T

√n +A−E[A]

a+b

2 is themean degreeand degrees in the graph are very concentrated ifalnn. We can construct

A−a+b

2n J = a−b 2

√σ n

σ^T

√n +A−E[A]

(31)

Spectrum of the noise matrix

The matrixA−E[A]is a symmetric random matrix with independent centered entries having variance∼ ^a_n.

To have convergence to theWigner semicircle law, we need to normalize the variance to ¹_n.

ESD

A−E[A]

√a

→µsc(x) = ₁

2π

√

4−x², if|x| ≤2;

0, otherwise.

(32)

Naive spectral analysis

To sum up, we can construct:

M = 1

√a

A−a+b 2n J

= θ σ

√n σ^T

√n +A−E[A]

√a , withθ= √^a−b

2(a+b).

We should be able to detect signal as soon as

θ >2⇔ (a−b)² 2(a+b) >4

(33)

Naive spectral analysis

To sum up, we can construct:

M = 1

√a

A−a+b 2n J

= θ σ

√n σ^T

√n +A−E[A]

√a , withθ= √^a−b

2(a+b).

We should be able to detect signal as soon as θ >2⇔ (a−b)²

2(a+b) >4

(34)

We can do better!

A lower bound on the spectral radius ofM =θ^√^σ_n^√^σ^T_n +W: λ₁(M) = sup

kxk=1

kMxk ≥ kM σ

√nk But

kM σ

√nk² = θ²+kW σ

√nk²+2hW, σ

√ni

≈ θ²+1 n

X

i,j

W_ij²

≈ θ²+1.

As a result, we get

λ₁(M)>2⇔θ >1⇔(a−b)²>2(a+b).

(35)

We can do better!

kxk=1

kMxk ≥ kM σ

√nk But

kM σ

√nk²+2hW, σ

√ni

≈ θ²+1 n

X

i,j

W_ij²

≈ θ²+1.

As a result, we get

λ₁(M)>2⇔θ >1⇔(a−b)²>2(a+b).

(36)

We can do better!

kxk=1

kMxk ≥ kM σ

√nk But

kM σ

√nk²+2hW, σ

√ni

≈ θ²+1 n

X

i,j

W_ij²

≈ θ²+1.

As a result, we get

λ₁(M)>2⇔θ >1⇔(a−b)²>2(a+b).

(37)

Baik, Ben Arous, Péché phase transition

Rank one perturbation of a Wigner matrix:

λ1(θσσ^T +W)^a.s→

θ+¹_θ ifθ >1, 2 otherwise.

Letσ˜be the eigenvector associated withλ1(θuu^T +W), then

|h˜σ, σi|²^a.s→

1−_θ¹₂ ifθ >1, 0 otherwise.

Watkin Nadal ’94, Baik, Ben Arous, Péché ’05 Newman, Rao ’14

For SBM witha,b→ ∞,

θ²= (a−b)² 2(a+b) >1

(38)

Baik, Ben Arous, Péché phase transition

Rank one perturbation of a Wigner matrix:

λ1(θσσ^T +W)^a.s→

θ+¹_θ ifθ >1, 2 otherwise.

Letσ˜be the eigenvector associated withλ1(θuu^T +W), then

|h˜σ, σi|²^a.s→

1−_θ¹₂ ifθ >1, 0 otherwise.

Watkin Nadal ’94, Baik, Ben Arous, Péché ’05 Newman, Rao ’14

For SBM witha,b→ ∞,

θ²= (a−b)² 2(a+b) >1

(39)

When a, b → ∞ spectral is optimal

SBM withn=2000, average degree 50 and _2(a+b)^(a−b)² =2.

Random matrix theorypredictsλ₁=51,λ₂=15 and noise at

|λ₃|<14.14

(40)

Decreasing the average degree

Random matrix theorypredictsλ₁=11,λ₂=6.7 and noise at

|λ₃|<6.3

(41)

Problems when the average degree is small

Random matrix theorypredictsλ₁=4,λ₂=3.67 and noise at

|λ₃|<3.46

(42)

Problems when the average degree is finite

High degree nodes: a star with degreed has eigenvalues {−√

d,0,√ d}.

In the regime whereaandbare finite, the degrees are asymptotically Poisson with mean ^a+b₂ . The adjacency matrix hasΩ

q

lnn ln lnn

eigenvalues.

Low degree nodes: instead of the adjacency matrix, take the (normalized) Laplacian but then isolated edges produce spurious eigenvalues.

(43)

Problems when the average degree is finite

High degree nodes: a star with degreed has eigenvalues {−√

d,0,√ d}.

In the regime whereaandbare finite, the degrees are asymptotically Poisson with mean ^a+b₂ . The adjacency matrix hasΩ

q

lnn ln lnn

eigenvalues.

Low degree nodes: instead of the adjacency matrix, take the (normalized) Laplacian but then isolated edges produce spurious eigenvalues.

(44)

Problems when the average degree is small

Same graph after trimming.

(45)

Phase transition for a, b = O(1)

Theorem

τ = (a−b)² 2(a+b)

Ifτ >1, then positively correlated reconstruction is possible.

Ifτ <1, then positively correlated reconstruction is impossible.

Conjectured byDecelle, Krzakala, Moore, Zdeborova ’11based on statistical physics arguments.

Non-reconstruction proved byMossel, Neeman, Sly ’12.

Reconstruction proved byMassoulié ’13andMossel, Neeman, Sly ’13.

(46)

Phase transition for a, b = O(1)

Theorem

τ = (a−b)² 2(a+b)

(47)

Phase transition for a, b = O(1)

Theorem

τ = (a−b)² 2(a+b)

(48)

Regularization through the non-backtracking matrix

LetE~ ={u →v;{u,v} ∈E}be the set of oriented edges.

m=|E|~ is twice the number of unoriented edges.

Thenon-backtracking matrixis anm×mmatrix defined by Bu→v,v→w =1({u,v} ∈E)1({v,w} ∈E)1(u6=w)

e

f

e f

u v=x

y

Bis NOT symmetric:B^T 6=B. We denote its eigenvalues by λ₁, λ₂, . . . withλ₁≥ · · · ≥ |λ_m|.

Proposed byKrzakala et al. ’13.

(49)

Ihara-Bass’ Identity

LetDthe diagonal matrix withDvv =deg(v). We have det(z−B) = (z²−1)^|E|−|V|det(z²−Az+D−Id) IfGisd-regular, thenD=dId and,

σ(B) ={±1} ∪n

λ:λ²−λµ+ (d −1) =0 withµ∈σ(A)o .

(50)

Ihara-Bass’ Identity

LetDthe diagonal matrix withDvv =deg(v). We have det(z−B) = (z²−1)^|E|−|V|det(z²−Az+D−Id) IfGisd-regular, thenD=dId and,

σ(B) ={±1} ∪n

λ:λ²−λµ+ (d −1) =0 withµ∈σ(A)o .

(51)

Non-Backtracking matrix of regular graphs

For ad-regular graph,λ₁=d−1,

? Alon-Boppana bound : max_k6=1<(λ_k)≥√

λ1−o(1).

? Ramanujan (non bipartite) :|λ₂|=√ λ₁

? Friedman’s thm :|λ₂| ≤√

λ₁+o(1)ifGrandom uniform.

(52)

Simulation for Erd ˝os-Rényi Graph

Eigenvalues ofBfor an Erd ˝os-Rényi graphG(n, λ/n)with n=500 andλ=4.

(53)

Erd ˝os-Rényi Graph

Eigenvalues ofB:λ₁≥ |λ₂| ≥. . ..

Theorem

Letλ >1and G with distribution G(n, λ/n). With high probability,

λ₁ = λ+o(1)

|λ₂| ≤ √

λ+o(1).

Bordenave, Lelarge, Massoulié ’15

(54)

Simulation for Stochastic Block Model

Eigenvalues ofBfor a Stochastic Block Model withn=2000, mean degree ^a+b₂ =3 and ^a−b₂ =2.45

(55)

Stochastic Block Model

Eigenvalues ofB:λ1≥ |λ₂| ≥. . ..

Theorem

Let G be a Stochastic Block Model with parameters a,b. If (a−b)²>2(a+b), then with high probability,

λ1 = a+b

2 +o(1) λ₂ = a−b

2 +o(1)

|λ₃| ≤

ra+b

2 +o(1).

Bordenave, Lelarge, Massoulié ’15

(56)

Test with real benchmarks

(57)

Test with real benchmarks

The Power Law Shop

(58)

The non-backtracking matrix on real data

fromKrzakala, Moore, Mossel, Neeman, Sly, Zdeborovà ’13

(59)

Back to political blogging network data

(60)

Non-symmetric Stochastic Block Model

Consider the case where there is a small community of sizepn withp<1/2, then the SNR is given byd(1−b)²whered is the average degree.

0 0.2 0.4 0.6 0.8 1 1.2

0 0.1 0.2 0.3 0.4 0.5

SNR

p

k-s

p^* EASY

HARD

IMPOSSIBLE

Phase diagram withp^∗ = ¹₂− ¹

2√ 3. Lelarge, Caltagirone & Miolane, ’16

(61)

Some extensions

For thelabeledstochastic block model, we also conjecture a phase transition. We have partial results and an optimal spectral algorithm.

Saade, Krzakala, Lelarge, Zdeborovà, ’15,’16

(62)

Some extensions

The non-backtracking matrix is also working for the degree-corrected SBM.

ongoing work withGulikers and Massoulié.

We can adapt the non-backtracking matrix to deal with small cliques.

-3 -2 -1 0 1 2 3

-2 0 2 4 6 8

ongoing work withCaltagirone.

(63)

Some extensions

SBM with no noiseb=0 but withoverlap.

Spectrum of the non-backtracking operator withn=1200, sn=400 anda=9 and 13. The circle has radiusp

a(2−3s) in each case.

Kaufmann, Bonald, Lelarge ’16

(64)

Non-backtracking vs adjacency

On thesparse stochastic block modelwith probability of intra-edgea/nand inter-edgeb/n.

The problem: ifa,b→ ∞, then Wigner’s semi-circle law + BBP phase transition but ifa,b<∞asn→ ∞, thenLifshitz tails.

The solution: the non-backtracking matrix on directed edges of the graph: Bu→v,v→w =1({u,v} ∈E)1({v,w} ∈E)1(u6=w) achievesoptimal detectionon the SBM.

THANK YOU!

(65)

Non-backtracking vs adjacency

On thesparse stochastic block modelwith probability of intra-edgea/nand inter-edgeb/n.

The problem: ifa,b→ ∞, then Wigner’s semi-circle law + BBP phase transition but ifa,b<∞asn→ ∞, thenLifshitz tails.

The solution: the non-backtracking matrix on directed edges of the graph: Bu→v,v→w =1({u,v} ∈E)1({v,w} ∈E)1(u6=w) achievesoptimal detectionon the SBM.

THANK YOU!