• Ei tuloksia

Replicability on a separate sample

If a clustering is “real” in the sense that it reflects an underlying subgroup-ing of individuals in the original population, then two samples from that population should reveal a similar (though probably not, due to random effects, completely identical) clustering model. This idea was used in our analysis of temperament data, where a clustering model based on a sample from the Finnish population was validated by comparing to a clustering in a second sample. In this section we describe the simulations that are used to study whether this intuitive property actually holds. Such a check is crucial for the applicability of clustering methods.

Next we describe the process at a high level. We create three datasets, two by sampling from the same distribution and one from sampling from a similar class of distributions but with different parameters. Call these datasets the “original”, “replication”, and “independent” samples. We learn a clustering model based on each of the datasets. We then use the model built on the original dataset to give clustering labels to individuals in the replication and independent samples, and the models built on those two to give two clustering labels to individuals in the original dataset.

We combine these labels to obtain four clusterings: A1) labels for individ-uals in the original dataset and the replication dataset, based on the original model, A2) labels for individuals in the original dataset and the replication dataset, based on the replication model, B1) labels for individuals in the original dataset and the independent dataset, based on the original model and B2) labels for individuals in the original dataset and the independent dataset, based on the independent model. Comparing a similarity score between A1 and A2 to that of between B1 and B2 we get insight to whether two clusterings from the same population can be expected to be more similar to each other than two clusterings from different populations, given that the populations have an underlying clustering structure.

First, data was generated from a mixture of multivariate normal distri-butions as described in 3.1, for a fixed numberk={3,4,5} of true centers.

This distribution was sampled for two datasets of size 1000 each (“original”

and “replication” samples). In addition, another (“independent”) sample was created from another distribution with the number of centers randomly selected to be betweenk−1 and k+ 1 (inclusive).

These datasets were clustered independently using the EMalgorithm to fit mixtures of Gaussian distributions (see Section 2.2). The best k for the original sample was determined by the Bayesian information criterion (Section 2.3.2), and the replication and independent sample where forced to

the samek.

For to compare this to a situation where there is no cluster structure in the original population, data was generated from a single multivariate normal distribution for the original and replication samples, and the independent dataset was generated from a distribution of 1–5 centers, at random. These data were similarly clustered, butkforced in turns to 3, 4, or 5, in order to simulate a situation where we mistakenly identify a cluster structure that is not there.

Alternative clusterings for original/replication and original/independent datasets were compared by cross-tabulating the cluster labels and calculating the χ2 statistic. One hundred experiments were performed for existing cluster structure and for the case of no cluster structure in the original, and for each possible value ofk involved.

Figures 3.6 to 3.8 show observed χ2 values for the four cases (original population does or does not have a cluster structure, second sample is a replication or an independent sample) fork= 3 to 5. Note that this krefers to the true number of clusters in the case where cluster structure exists and to the arbitrarily pickedk in the case of no cluster structure.

We see the basically same phenomenon in all of the cases. Where there is no cluster structure and the replication dataset comes from an independent sample, the statistics stays below N = 1000. Where there is a cluster structure but the second sample is an independent one, or where there is no cluster structure but the replication is from the same distribution, we see a distribution of observed χ2 that resembles the theoretical χ2 distribution, maximum observed values falling at about 2∗N = 2000, which is what you would expect for two random clusterings.

In the case of a replication from the same distribution with a cluster structure, we see a spread-out of values, going up to extremes close to what we would expect with a perfect replication. The higher the number of clusters, the more there are poorer-quality replications. Since our N does not increase with k, this is to be expected: the smallerN/k, the more likely it is that the sample contains only a small amount of representatives from a particular cluster, making clustering more random. In any case, the distribution does not significantly overlap with those obtained in other scenarios.

This means that should we observe a high similarity between the repli-cation clusterings, we can fairly safely assume that 1) there is a clustering structure in the population, and 2) we have managed to replicate the sam-pling procedure accurately. However, should we observe a low similarity, we cannot based on it alone deduce whether this is because of lack of a real cluster structure or failure of replication.

Figure 3.6: Histograms of chi-square values comparing clustering based on a model obtained on a sample from a distribution with or without cluster structure, to either a model obtained on another sample from the same distribution (replication) or to a model obtained with a sample from a different distribution with cluster structure (independent). Three clusters in the original distribution.

Figure 3.7: Histograms of chi-square values comparing clustering based on a model obtained on a sample from a distribution with or without cluster structure, to either a model obtained on another sample from the same distribution (replication) or to a model obtained with a sample from a different distribution with cluster structure (independent). Four clusters in the original distribution. Red and green lines show observations from a real life replication experiment. [WSM+12]

For the sake of interest, in the case of four clusters, I have shown two values observed in a real life dataset (described in [WSM+12] and Chapter 4.2). In the case of real data, what structures exist, they probably do not conform very closely to the model assumptions. This might explain why we observe values that are between the observed distributions for the different simulated scenarios.

Figure 3.8: Histograms of chi-square values comparing clustering based on a model obtained on a sample from a distribution with or without cluster structure, to either a model obtained on another sample from the same distribution (replication) or to a model obtained with a sample from a different distribution with cluster structure (independent). Five clusters in the original distribution.