• Ei tuloksia

2. Graph Theory

2.4 Random graph generation models

Ck(ni) =

s∈V

σks(ni)

σks , (2.6)

whereσks(ni)is the number ofk-paths (a path of length k) originating from nodesand passing through node ni, and σks is the total number of k-paths originating from node s. k-path centrality of an edge is defined by replacing the node with an edge in the previous definition.

Corresponding to the degree centrality of a node, the degree product of an edge—the product of the degrees of the two end nodes of the edge—is used as a centrality measurement for edges.

Authenticity measures whether an edge follows the clustering proper-ties in a graph or not. Zhang et. al. [P2] defined the authenticity score as

aeij =meij−eeij, (2.7) wheremeij is the actual number of edges that connect the two groups of the neighboring nodes of the two end nodes of edge eij, and eeij is the expected number of edges between these two groups of nodes.

Edge authenticity can be used for graph clustering, outlier detection, and graph data preprocessing [P2, P4].

2.4 Random graph generation models

Random graph generation model has been an important research topic for the last several decades [73]. Numerous models have been proposed to generate graphs by stochastic processes to simulate or mimic real-world graphs. The most important aspect of a good random graph generation model is that the generated graphs show similar properties as those real-world graphs. This section will not give a thorough review of these models. Instead, we will only discuss some graph generation

models that have a huge impact on the research in this field and the models that are used in other sections of this thesis.

Erdős–Rényi model is the first random graph generation model and the most well-studied one [10]. A commonly used Erdős–Rényi model is G(n, p) model where a graph of nnodes is generated by randomly connecting two nodes by an edge with probability p. Many impor-tant properties have been found about the graphs generated by the Erdős–Rényi model. For example, ln(nn) is the sharp threshold of the connectedness of graph G(n, p). If p > ln(n)n , G(n, p) is almost sure to be connected. Otherwise, it is almost surely to be disconnected.

However, random graphs generated by the Erdős–Rényi model lack many important properties that a real-world graph has. One signifi-cant shortcoming is that the generated graphs do not show clustering structure. Another important defect is that the degree distribution of the nodes is binomial, whereas the degree distribution of real-world graphs often follows the power law [74]. To overcome these short-comings, many other random graph generation models have been pro-posed.

Preferential attachment, also named as “the rich get richer” or “cumu-lative advantage”, is a principle that an entity gets more of a certain asset if it has already possessed more of this asset. Barabási and Al-bert applied this principle to the random graph growth process in their model [75, 76]. The generation process starts from a single node and the nodes are added one by one. Every time a node is added, the probability that the new node is connected to nodeni is proportional to the degree of node ni. The degree distribution of the graph gener-ated by the Barabási–Albert model follows the power law in the form of p(k) = k3. The Barabási–Albert model is a very simple process that whenever a node is added to the graph, edges can only be added to connect the newly added node. Also, whenever an edge is added,

2.4. Random graph generation models 19 it can not be removed from the graph anymore. This is obviously not the case in many real-world graphs such as social network or web page graphs. Many extensions have been proposed to address these limitations [77, 78].

Small-world is another interesting property that many real-world graphs have [38, 79]. Small-world means that the average distance between any pair of nodes in a graph is limited by a small number. It is also known as “six degrees of separation” for social networks. A small-world graph is a graph where the average path length (APL) of any pair of nodes is proportional to log(N). APL of a graph generated by the Erdős–Rényi model follows AP LER log(Nlog(k)), where k is the average number degree. APL of a graph generated by the Barabási–Albert model follows AP LBA log log(log(NN)) [79, 80, 81]. Watts and Strogatz proposed a model that generate graphs with not only the small-world property but also a constant clustering coefficient. With the Watts-Strogatz model, a ring type of lattice graph is first constructed. Then a process, similar to the Erdős–Rényi model, is applied to randomly rewire the edges with a certain probability.

Stochastic block model aims at generating graphs with another im-portant property of real-world graphs: clustering. With this model, nodes are first divided into r groups. Then ar×r probability matrix P is defined such that the element Pij defines the probability of an edge is generated to connect nodes between groupiandj. This model can be viewed as a generalization of the Erdős–Rényi model where all elements in matrix P are identical.

Fig. 2.3 shows the samples of random graphs generated by different random graph generation models.

(a) Erdős–Rényi model (b) Barabási–Albert model

(c) Watts-Strogatz model (d) Stochastic block model Figure 2.3 Random graphs generated by different generation models

21

3. GRAPH CLUSTERING AND