• Ei tuloksia

PART III: DEVELOPMENT AND EVALUATION OF MODEL IDENTIFICATION

8.5   R ESULTS WITH MCMC-G ENERATED S YNTHETIC D ATA

8.5.4   E FFECT OF D ATA C HARACTERISTICS

This subsection examines how data characteristics, i.e., the type of node load distribution, node neighbourhood size, data set size, and network size affect topology identification. Furthermore, the quality of synthetic data is ascertained by studying the impact of the length of the burn-in period in MCMC data generation. The functional relationship between the overall coherence measures, , ASSMI, and , and the graph correlations as functions of are of main focus here.

Figure 8.6. Distributions of internode distances in estimated node location maps of true graph neighbours (left-hand-side bars) and of all nodes (right-(left-hand-side bars), and corresponding distribution similarity measures (bot-tom). The distribution plots from top-left to middle-right are shown for the following values: 0.02, 0.04, 0.08, 0.12, 0.16, and 0.20. KLD (bottom-left), JSD (bottom-centre), and CSSapproximation of KLD (bottom-right) are shown for all the 21 parameterisations as functions of . Calculation: with each , histograms are calculated by using all the data of the three ensembles with bars defined at equal intervals. The histograms are then represented in the form of probability densities. The number of bars is the same for each , but the range changes according to the distance values. Similarity measures are obtained from the distributions.

First, the distribution from which node loadings are randomly drawn is varied. Besides the uni-form distribution Uni 0, 1, of the reference case, normally distributed node loadings with

0.5, 0.25 (mean 0.5, variance 0.25 ) and exponentially distributed node loadings with Exp 0.58 (mean 0.58) are studied. The results with these three loading distributions are shown in Figure 8.8. With exponentially distributed loadings, ASSMI assumes a slightly different func-tional form, and assumes clearly smaller values than with the other two distribution types. The differences may result from the smaller median value, which is 0.4 with the Exp 0.58 distribu-tion. The graph correlation shows rather high and similar values with all three load distributions, implying that the MGMN method works well regardless of the type of node load distribution.

Next the neighbourhood size is varied as 6.8, 8.8 (reference), and 10.8. The re-sults are shown in Figure 8.8, where the range of differs slightly from the rest of the figures.

Obviously, the larger the value and the network connectivity, the larger the coherence with the values of and ASSMI being large. In fact, with 10.8 and a few largest values, the coher-ence is so great that nearly all nodes appear constantly in equal states, entailing problems with topology estimation. Among the cases with 0.16, only one ensemble at 0.19 yielded topology estimates, this value corresponding to the GC value, which is being clearly distinct from the remaining values at 33. Otherwise, topology is successfully identified in each case, though small values tend to yield somewhat better graph structures.

The effect of data size is tested with 270 (reference)  540, and 1080. Figure 8.8 shows that both ASSMI and GC values are affected. However,  remains practically unchanged, because changing does not change average node states or loadings. The ASSMI increases with

Figure 8.7. Histograms of true graph distances of estimated graph neighbours. Bars with each graph distance from left to right correspond to increasing values from 0 to 0.20 with even intervals of 0.01. Calculation: with each , histograms are first calculated for all three ensembles, then the hits at each bar are summed over the hits in the three ensembles, and finally the number of hits at each bar is divided by the total number of hits in the three en-sembles together.

, because large data sets are more informative about node dependencies. As node dependencies are better estimated with larger data sets, also the graph correlation increases; the change in GC from 270 to 540 is particularly large, whereas the difference between 540 and 1080 is small. Consequently, the smallest data set seems still too small to estimate SSMI values accurately and thereby to obtain an accurate topology estimate, whereas the second largest data size gives already results similar to the largest one and is thus large enough.

Testing the effect of network size is more complicated, because increasing the number of nodes in a network should be coupled with a simultaneous increase in data set size. The following net-work sizes are studied here: 30 (reference), 60, and 120. Tests are run in three ways: is first kept constant for each , then increased linearly, and finally quadratically in . The rationale for the last case is that the number of node pairs in a network grows quadratically in . For the same quality data for each network size, steps in the MCMC burn-in period must be increased linearly in as 500 . The neighborhood size is constant at 8.8 for all .

Figure 8.8. Effect of load distribution type (top row), node neighbourhood size (middle row), and data set size (bottom row) on topology identification. (left column) and ASSMI (centre column) are shown as functions of , and GC (right column) as a function of . Top row: exponential (squares), uniform (circles), and normal (trian-gles) node load distributions. Middle row: 6.8 (squares), 8.8 (circles), and 10.8 (triangles). Bottom row: 270 (circles), 540 (squares), and 1080 (triangles). Calculation: measures are medians over the respective values with the three ensembles.

Figure 8.9 shows the results for constant data size, 270. With the smallest , both the ASSMI and GC assume clearly larger values than with the two larger values. Consequently, with the two larger networks, is all too small for the data to uncover node dependencies and to obtain adequate topology estimates. ′ remains nearly unchanged when varies, which implies that the data assumes similar values with each network size. However, the ASSMI varies and is particularly different with 30, assuming clearly greater values than with the two larger net-works. The conclusion is that must be increased to gain reasonable topology estimates also with large networks.

Next, is increased linearly in (results in Figure 8.9). Because the computation time increases greatly with , as the number of node pairs grows quadratically in , only single ensembles are studied with 60 and 120. Consequently, the results may contain some extra uncer-tainty. Though coherence measures behave like in the previous case, GC values are similar for all the network sizes, suggesting that linear increase indeed is suitable. Even more computation time is required to test the quadratic increase of in . Therefore, again only a single ensemble is

Figure 8.9. Effect of network size on topology identification. (left column) and ASSMI (centre column) are shown as functions of , and GC (right column) as a function of . Top row: 30 (circles), 60 (squares), and 120 (triangles), each with 270. Middle row: 30, 270 (circles), 60, 540 (squares), and 120, 1080 (triangles). Bottom row: 270 (circles), 540 (squares), and 1080 (triangles), each with 60. Calculation: for 30, measures are medians over the respective val-ues with the three ensembles. For the other network sizes, measures are calculated from single ensembles.

studied, and only the network with 60 is tested with 270, 540, and 1080 (results in Figure 8.9). The results for 270 are the worst, whereas despite rather heavy fluc-tuations, GC assumes values similar to those of the two larger data sets and almost those of the reference case. ASSMI values are similar to those of the two larger data sizes, whereas with 270 they are somewhat smaller. ′ again remains nearly constant, indicating that the data is similar, as it should be, because in each case it is generated with the same Ising model parameters and with the same graph structure.

In conclusion, the question remains why the ASSMI seems to depend on network size, or at least differs greatly with 30 when compared to the two larger networks. The reason may simply be the particular properties of that randomly generated network, and because the network is so small, even a few highly connected nodes may drastically affect its properties, contributing to high coherence and a high ASSMI.

Finally, the quality of the data used in the above analyses is checked by changing the number of steps in the burn-in period of the MCMC data generation procedure. A well chosen burn-in pe-riod is especially important here, because the ensemble scheme is used to generate data observa-tions. If too short a burn-in period is chosen, the Markov chain of the MCMC do not converge to the stationary distribution of the respective Ising model, and the samples are false. If too long a burn-in period is chosen, only some unnecessary time is wasted on sample generation, but the quality of the generated data set remains unaffected. With the 30-node network the number of MCMC steps is varied here from the reference case’s 500 30 to 250 30 and to 1000 30. As Figure 8.10 shows, all three cases yield similar results, confirming that the length of the burn-in period in the reference case is long enough to generate reasonably good network observations.