Analysis of patterns in mining time series using Qlucore

(1)

Faculty of Technology

Department of Mathematics and Physics Laboratory of Applied Mathematics

Analysis of Patterns

in Mining Time Series using Qlucore

The topic of this Master's thesis was approved by the departmental council of the De- partment of Mathematics and Physics on July20^th, 2009.

The examiners of the thesis were Professor Markku Lukka and Ph.D. Tuomo Kauranne.

The thesis was supervised by Ph.D. Tuomo Kauranne.

Lappeenranta, December2^nd, 2009

Pitos Seleka Biganda Punkkerikatu 2 A 6

53850 Lappeenranta, Finland Tel. +358 44 966 1508 pitos.biganda [at] lut.

(2)

Abstract

Lappeenranta University of Technology Department of Mathematics and Physics Pitos Seleka Biganda

Analysis of Patterns in Mining Time Series using Qlucore Thesis for the Degree of Master of Science in Technology

2009

64 pages, 36 gures, 8 tables, 1 appendix

Examiners: Prof. Markku Lukka and Ph.D. Tuomo Kauranne.

Keywords: Qlucore analysis, principal component analysis, sparse Bayesian regression, missing values reconstruction.

Raw measurement data does not always immediately convey useful information, but applying mathematical statistical analysis tools into measurement data can improve the situation. Data analysis can oer benets like acquiring meaningful insight from the dataset, basing critical decisions on the ndings, and ruling out human bias through proper statistical treatment. In this thesis we analyze data from an industrial mineral processing plant with the aim of studying the possibility of forecasting the quality of the nal product, given by one variable, with a model based on the other variables. For the study mathematical tools like Qlucore Omics Explorer (QOE) and Sparse Bayesian regression (SB) are used. Later on, linear regression is used to build a model based on a subset of variables that seem to have most signicant weights in the SB model. The results obtained from QOE show that the variable representing the desired nal product does not correlate with other variables. For SB and linear regression, the results show that both SB and linear regression models built on 1-day averaged data seriously under- estimate the variance of true data, whereas the two models built on 1-month averaged data are reliable and able to explain a larger proportion of variability in the available data, making them suitable for prediction purposes. However, it is concluded that no single model can t well the whole available dataset and therefore, it is proposed for future work to make piecewise non linear regression models if the same available dataset is used, or the plant to provide another dataset that should be collected in a more systematic fashion than the present data for further analysis.

(3)

Acknowledgements

First of all, I would like to express my gratitude to the Department of Mathematics and Physics of Lappeenranta University of Technology for the nancial support during the entire duration of my studies. Thank you very much.

I express my warm thanks to the supervisor of the thesis, Ph.D. Tuomo Kauranne, for giving valuable comments and guidance; and Professor Markku Lukka for examining my thesis. Their guidance helped to shape my ideas as I was writing the thesis.

I especially thank Ville Manninen for his assistance in the area of Qlucore analysis and Virpi Junttila for her advice in the area of sparse Bayesian Modeling. Many thanks to E. Antero Tammi for his technical support and advice he showed during preparation of this thesis.

To all my friends, those near and those far away, I thank you all for your kind support and being there for me especially during the dicult times.

I extend my deepest gratitude to my family for being a great source of support and for encouraging me not to give up.

Above all, I thank the Almighty God for the many blessings upon my life, and for opportunities that motive me to strive at being the best that I can be.

Kiitoksia kaikille!

Lappeenranta, December2^nd, 2009.

Pitos Seleka Biganda.

(4)

A.1 For any orthogonal matrixP, the covariance matrix of Y isP^TCP, where Y=P^TX andC is the covariance matrix of X . . . 52 A.2 The inverse of an orthogonal matrix is its transpose . . . 52 A.3 If A is any matrix, the matrices A^TA and AA^T are both symmetric . . . 53 A.4 Woodbury matrix identity . . . 53

(7)

List of Tables

1 Row wise patterns of NaNs . . . 12

2 Row wise patterns of NaNs in the averaged data . . . 16

3 Correlated variables (from QOE simple average algorithm) . . . 29

4 Correlated variables (from QOE kNN impute algorithm) . . . 31

5 Average coecients of variables in the model derived from 1-day averaged data . . . 34

6 Linear regression model derived from 1-day averaged data . . . 36

7 Average model parameters for the variables in the SB model derived from 30-days averaged data . . . 39

8 Linear regression model derived from 30-days averaged data . . . 41

(8)

List of Figures

1 Combined daily observations of materials of variables throughout the pe-

riod . . . 2

2 Example of measurements made for a single variable . . . 3

3 Illustration of time points where the Process was running . . . 11

4 Illustration of time points where the Process was not running . . . 12

5 Row wise patterns of NaNs in the data . . . 13

6 Row wise patterns of NaNs in the data . . . 14

7 Preprocessing of the data by adaptive (weighted) moving average . . . 15

8 An example of general behaviour of awma . . . 16

11 Further preprocessing of the data . . . 19

12 Justication of preprocessing of the data . . . 20

13 An example of a PCA plot in QOE . . . 22

14 QOE ndings based on SA missing value reconstruction algorithm . . . . 29

15 QOE ndings based on kNN impute missing value reconstruction algorithm 30 16 Empirical cumulative distribution of the data . . . 32

17 Sparse Bayesian Regression model derived from 1-day averaged data . . . 33

18 Residuals of SB Regression model derived from 1-day averaged data . . . 34

19 Prediction of SB Regression model derived from 1-day averaged data . . . 35

20 Prediction of SB Regression model using whole 1-day averaged data . . . 35

21 Linear regression model derived from 1-day averaged data . . . 36

(9)

22 Residuals of linear regression model derived from 1-day averaged data . . 36 23 Prediction of linear regression model derived from 1-day averaged data . . 37 24 Prediction of linear regression model using full 1-day averaged data . . . . 37 25 Sparse Bayesian Regression model derived from 30-days averaged data . . 38 26 Residuals of SB Regression model derived from 30-days averaged data . . 38 27 Prediction of SB Regression model derived from 30-days averaged data . . 39 28 Prediction of SB Regression model using full 30-days averaged data . . . . 40 29 Linear regression model derived from 30-days averaged data . . . 40 30 Residuals of linear regression model derived from 30-days averaged data . 41 31 Prediction of linear regression model derived from 30-days averaged data . 42 32 Prediction of linear regression model using full 30-days averaged data . . . 42 33 SB Regression model derived from 1-day averaged data using unique ran-

dom samples . . . 43 34 Linear Regression model derived from 1-day averaged data using unique

random samples . . . 44 35 SB Regression model derived from 30-days averaged data using unique

random samples . . . 45 36 Linear Regression model derived from 30-days averaged data using unique

random samples . . . 46

(10)

Symbols and Abbreviations

¬ Logical negation symbol

awma Adaptive weighted moving average

isnan Logical function, used to nd missing values in a data set kNN k Nearest Neighbors

ma Moving average

NaNs Missing values

PCA Principal Component Analysis QOE Qlucore Omics Explorer

SA Simple average

SB Sparse Bayesian (regression) wma Weighted moving average

(11)

1 Introduction

This thesis is based on analyzing data from a mineral processing plant. The data were not gathered for a designed experiment, but were merely collected during the operation of the process. We may refer to such data as "happenstance data." Because happenstance data are not gathered under controlled conditions, they are subject to a wide variety of unknown inuences that make their validity for statistical analysis questionable [1].

Therefore, happenstance data analysis should not be used to draw conclusions. However, because organizations usually invest substantial amount of money to collect the data, it would be a waste to ignore the information in the data warehouse just because it is not

"clean" or perfect.

Since the data has happenstance nature some questions arise, like 1. Which variable or variables are likely to be important? 2. Approximately what eect is caused by changing a given variable by a certain amount? 3. Is there a combination of variables that cause an inuence on another variable or variables? These questions can be partly answered by data analysis which is the major task in this thesis. In this thesis, we study the possibility of forecasting the quality of the nal product, given by one variable, with a model based on the other variables. The focus of this analysis is to determine the variables that inuence this quality. We will not need to know the real name of the desired product rather we will call it simply variable x27. All other variables will be denoted in same way but with dierent numbers to the subscript.

The structure of this thesis is as follows. The next section goes through the description of statement of the problem: overall description of the data and the target problem and its importance. Section 3 covers description of the mathematical modeling approach:

history, theory and denitions of the (applied) mathematical tools to be used. Section 4 gives a description of data preprocessing. Section 5 moves to description of the mathematical models/tools to solve the problem at hand. This section will describe Qlucore Analysis and Sparse Bayesian Regression modeling. Analysis results are presented in section 6 and section 7 gives the discussion of the whole analysis. Finally, section 8 concludes and gives some recommendations for future work.

(12)

2 Description of Statement of the Problem

2.1 Overall description of Data

The data to be analyzed has 24869 observation points as rows and 37 variables as columns in the observation matrix. The variables are named as x1,x2, ...,x37, with the subscripts representing the column numbers as in the observation matrix. Figure 1 illustrates the combined daily observations of materials of each variable during the entire period when recording was done. The graph is designed just to raise the image of how complex are the data to work on and was produced after transposing the data matrix. The variables included in this gure are 33 in number, and are the ones which are believed to have more important information of the process. Most of the analysis in this study will cover these variables.

Figure 1: Combined daily observations of materials of variables throughout the period The data were collected from a number of rening processes of mineral ores for about two years. Data collection was not a linear process because the ow of material was not linear either. There had been dierent time delays between one data collection point and the

(13)

other. Now one can easily observe from gure 1 that there are a number of complicated structural patterns in the data. Except for variable 1, all other variables contains some missing values (NaNs) . So our task is to understand and nd explanations of at least some of the emerged patterns of the NaNs in order to ll in them before further analysis of the data is done. By looking roughly at gure 1 we can identify three major patterns the NaNs form in the data. These are

1. The process was running or not.

2. Some variables are not observed every hour.

3. Some variables are observed only on certain periods.

The presence of NaNs in our data makes the data to have no mathematical denition.

Therefore a preprocessing of data is needed so that the data can be transformed to have some mathematical entities. But a disturbing question is how to preprocess the data and reasonably achieve the goal.

2.2 The Target Problem

Regardless of presence of too many NaNs in the data some variations and anomalies can be identied among dierent variables during the run of the process when the measurements were made. Figure 2 illustrates the variations in the measurements made for variable x27 during the whole period.

Figure 2: Example of measurements made for a single variable

(14)

As stated in the introduction, variable x27 is the variable of interest in this analysis.

Now the objective is to nd other variable(s) in the data that accounts for the existence of variations and anomalies in the observed measurements for this particular variable.

As it can be seen from gure 2 there are uctuations (bumps) in the measured data.

A question is that can we explain this anomalies basing on the available data so as to forecast this variable in future?

In this work, a series of mathematical tools are presented in the following sections to address the understanding of the problem and hopefully nd the solution of it. Taking into account the missing values and the happenstance nature of the data, to achieve a desired solution maybe a bit challenging.

(15)

3 Description of Mathematical Modeling Approach

3.1 Principal Component analysis (PCA)

3.1.1 Variance and Covariance

Let X be an observation matrix. Each row gives values of the variablesx₁,· · ·, x_m:

X=







x₁₁ x₁₂ · · · x_1m x₂₁ x₂₂ · · · x_2m ... ... ... ...

x_n1 x_n2 · · · x_nm







, (1)

We dene the basic concepts as follows:

The mean value

xk= 1 n

n

X

i=1

xik

The variance

var(x_k) = 1 n−1

n

X

i=1

(x_ik−x_k)²:=σ_x²_k The standard deviation

std(x_k) =p

var(x_k) :=σ_x_k

The covariance

cov(x_k, x_l) = 1 n−1

n

X

i=1

(x_ik−x_k)(x_il−x_l) :=σ_x_k_x_l The correlation

cor(xk, xl) = σxkxl

σx_kσx_l

The covariance matrix

C=cov(X) =







σ_x²₁ σ_x₁_x₂ · · · σ_x₁_x_m σ_x₁_x₂ σ_x²₂

... ...

σ_x₁_x_m σ²_x_m







If X₀ denotes the mean-deviation form of X, obtained by subtracting the mean value from each column of X, then

C = 1

n−1X₀^TX₀

The diagonal entry c_kk in C give variances of the columns of X. The variance of x_k measures the spread of the values ofx_k. The entryc_klinC fork6=lis the covariance of

(16)

x_k and x_l. Ifc_kl= 0 thenx_k and x_l are uncorrelated. Analysis of the multivariate data inX is greatly simplied when most or all of the variables x₁, ..., x_m are uncorrelated, that is, when the covariance matrix ofX is diagonal or nearly diagonal [4].

3.1.2 Principal Component Analysis

Principal component analysis (PCA) was invented in 1901 by Karl Pearson [2] and is a mathematical procedure that transforms a number of correlated variables into a smaller number of uncorrelated variables called principal components. The rst principal component accounts for as much of the variability in the data as possible, and the succeeding components each accounts for as much of the remaining variability as possible. PCA is mostly used as a tool in exploratory data analysis and for making predictive models. It is based on the calculation of the eigenvalue decomposition of a data covariance matrix or singular value decomposition of a data matrix after mean centering the data for each attribute.

Mathematically we dene PCA as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the rst coordinate (called the rst principal component), the second greatest variance on the second coordinate, and so on [3].

Let us dene the matrix (1) as X= [x₁, x₂, ..., x_m]^T. Assume that X is already in mean- deviation form. The goal of PCA is to nd an orthogonalm×mmatrixP = [u1 · · · um] that determines a change of variable,

X=PY, (2)

or 



 x1

x2

...

xm







= [u1 u2 · · · um]





 y1

y2

...

ym







with the property that the new variablesy1, ..., ym are uncorrelated and are arranged in order of decreasing variance.

The orthogonal change of variable (2) means that each observation vector Xk receives a

"new name," Yk, such that Xk =PYk. Notice that Yk is the coordinate vector of Xk

with respect to the columns ofP, and

Yk=P⁻¹Xk

=P^TXk (3)

(17)

for k= 1, ..., n.

It is easy to verify that for any orthogonalP, the covariance matrix of Y isP^TCP (see Appendix A.1). So the desired orthogonal matrixP is one that makesP^TCP diagonal.

LetD be a diagonal matrix with eigenvaluesλ₁, ...., λ_m of C on the diagonal, arranged so thatλ₁ ≥λ₂≥ · · · ≥λ_m ≥0, and let P be an orthogonal matrix whose columns are the corresponding unit eigenvectors u1, ...,um. ThenC=P DP^T and P^TCP =D. The unit eigenvectors u1, ...,um of the covariance matrix C are called the principal components of the data (in the matrix of observations). The rst principal component is the eigenvector corresponding to the largest eigenvalue of C, the second principal component is the eigenvector corresponding to the second largest eigenvalue, and so on.

The rst principal component u1 determines the new variable y₁ in the following way.

Let a1, ..., am be the entries in u1. Since u^T1 is the rst row of P^T, then equation (3) shows that

y1 =u^T1X=a1x1+a2x2+· · ·+amxm

Thus y₁ is a linear combination of the original variables x₁, ..., x_m, using the entries in the eigenvector u1 as weights. In a similar fashion, u2 determines the variable y₂, and so on.

3.2 Set theory

3.2.1 Basic concepts

LetA be a set. If an object ois a member of A we write o∈A. This relation between o and A is known as 'binary relation' between an object and a set. A binary relation between two sets, say A and B, is the subset relation (also called set inclusion) and is dened as: If all the members of A are also members of B, then A is a subset of B, denotedA⊆B.

Operations on sets

Some binary operations on sets include:

• Union of setsA andB, denotedAS

B; is the set whose members are members of at least one ofA or B.

• Intersection,AT

B; is the set whose members are members of both A andB.

(18)

• Complement of set,A^c; is the set of all members of the universal set that are not members ofA.

3.2.2 Logical functions

The basic set theory together with some operators (e.g, relational and logical operators) are used to dene logical functions. In this section we dene one logical function isnan which is used to nd NaNs in a data set. LetX be the observation data matrix. Then we determine NaNs inX by

L=isnan(X). (4)

isnan(X) is a MATLAB function which returns an array of the same size asXcontaining logical 1 (true) where the elements of X are NaNs and logical 0 (false) where they are not. In this caseL, dened by (4), is a logical matrix or Boolean matrix with same size asX.

3.3 Adaptive Moving Average

3.3.1 Moving Average, ma(p)

Let x∈Rⁿ and y∈Rⁿ, then y=ma(x, p) is a moving average of orderpfor the vector x, with 1≤p < n. Dene a constant weight vector by

w= 1

p+ 1 1

p+ 1· · · 1 p+ 1

T

(5) where w ∈ R^p+1 and w^T1 = 1. Here p+ 1 is the size of the window. Let k be the position index of thekth value in x, where p < k < n−p+s, ands is the index shift.

Then for0≤s≤p,

yk =w^Tx(k−p+s:k+s) (6)

is the moving average value of thekth value in x.

3.3.2 Weighted Moving Average, wma(p)

The weighted moving averagewma(p)is simply the moving average as given by (6), but with non uniformly distributed weightswi ≥0 (but of course w^T1 = 1). It is used, for instance, in irregular sampling.

(19)

3.3.3 Adaptive (weighted) moving average, awma(p)

The two dierent kinds of moving averages dened above do not allow the missing values in x. So we need to dene the moving average that can handle somehow missing values in the vector x∈Rⁿ. An adaptive (weighted) moving average, awma(p) is just a wma(p) that allows the weight vector w adapt to the neighbourhood of k (but not changing p). In this case, the weight vector is not constant and w^T1 > 0. Now let lk = ¬isnan(x(k−p+s:k+s)) . Dene the weight vector wk as the element wise multiplication between two vectors as shown below:

wk=w.∗lk

where w is given by (5).

Letwˆk = _w^wT^k

k1. Then, the adaptive (weighted) moving average yk is dened by yk=

(wˆ^T_kx(k−p+s:k+s) ifwˆ^T_k1>0

N aN otherwise (7)

(20)

4 Data Preprocessing

Before we put data into analysis we need rst to perform some preprocessing of the data. Dierent kinds of preprocessing exist for dierent types of data to be analyzed.

The aim of preprocessing is to transform the data into a reasonable manner such that the conclusion to be made from the results of analysis is as good as possible. In this case, preprocessing of data can be said to be part of the real analysis of the data because the better the preprocessing is done the better the analysis results will be. In this section we will discuss the preprocessing of data basing on how to handle the NaNs and outliers detection in our data.

4.1 Handling of NaNs

As stated in the earlier sections that the data we are analyzing contain lots of NaNs in such a way that no mathematical entity can dene our data. These NaNs are thought to be due to the reason that the process was not running at some time points and therefore no measurements were collected during these particular time points. Other NaNs are due to irregular and time varying sampling. Figure 3 illustrate the time points when the process was running. The gure aims at showing how short or long the process was running at dierent time points. The width of rectangles is equivalent to the run length in hours as indicated in the vertical axis. One can see that the process had been running at some points in very long hours while in some other points the run had been in very short time. As a result dierent rows in the data array contain dierent number of NaNs.

(21)

0.5 1 1.5 2

x 10⁴ 0

100 200 300 400 500 600

Process running

time(1h)

run length (1h)

Figure 3: Illustration of time points where the Process was running

The opposite of the features described in gure 3 is given in gure 4, which illustrate the time points where the process was not running with the corresponding non-run length in hours. We can observe from this gure both short and very long periods of time when the process was not running. In other words the gure gives the behaviour of NaNs in the data.

(22)

0.5 1 1.5 2

x 10⁴ 0

200 400 600 800 1000

Process not running

time(1h)

nonrun length (1h)

Figure 4: Illustration of time points where the Process was not running

Before we take into account on how to treat the NaNs let us look at dierent roles they have in the data. Table 1 shows row wise patterns the NaNs form in the data. The table tell us that no single row in the whole data array has full observations. This means that all variables contains at least one NaN. There are 528 distinct row wise patterns and the most frequent of them are found in 3667 rows of the whole data. A proper visualization of these patterns can be seen in gure 5.

Table 1: Row wise patterns of NaNs

rows without NaNs unique #(patterns) of NaNs #(most frequent patterns)

0 528 3667

(23)

50 100 150 200 250 300 350 400 450 500 0

500 1000 1500 2000 2500 3000 3500

row wise pattern

#(rows)

Histogram of patterns

50 100 150 200 250 300 350 400 450 500

0 0.5 1 1.5 2

x 10⁴

row wise pattern

#(rows)

Cumulative sum

Figure 5: Row wise patterns of NaNs in the data

In the rst subplot of gure5 we can see that the rst 50 distinct row wise patterns are most frequent in many rows of the data set. They occupy more than 3000 rows. The second subplot show the total number of rows (blue line), number of observations in x27

(red line) and number of observations in x19(green line) that are occupied by each of the distinct row wise patterns of NaNs. This subplot is analogous to gure 6 which shows the relative cumulative sum of rows and observations for the two variables. From this gure we can see that the rst 100 row wise pattern occupy more than 90% of the total rows (blue line), total number of observations in x27 (red line) and about 79% of the total observations in x19(green line). The line in cyan is the relative cumulative sum of unique row wise patterns in the data.

(24)

50 100 150 200 250 300 350 400 450 500 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

row wise pattern

relative #(rows or measurements)

Cumulative sum

Figure 6: Row wise patterns of NaNs in the data

Having seen dierent patterns of NaNs and their roles in the data now we have open questions about these NaNs: What to do with them? Do we need to ll in some numbers?

Answers to the rst question can be either ignoring them and analyze the data with NaNs, or remove them from the data and analyze the remaining data. Implementation of any of the two options will aect the results of the analysis dierently and the conclusion should be drawn accordingly. Also since the NaNs are due to the process not running at some point then lling in some constant numbers may produce bad results because we are adding false measurements where there were no recordings. It maybe the case that the recordings were not taken for a reason. No one knows the real meaning of the NaNs beforehand. For this case the analyst is left with the decision of which way to handle these NaNs in order to get as good results as possible. In our case we will use adaptive (weighted) moving average (awma) (given by Eq.7) as one way of addressing this problem. Moving average is used to average noise out from the data; and awma may also ll in some missing values. However, it should be noted that moving average ltering has general problems (even with data without NaNs). They include the following:

• May interpolate values not in the given domain.

• Lack of full support both at the beginning and end of the vector.

• It is sensitive to the size of the window. Therefore serial correlation structure of the data may be destroyed if the window size is not carefully chosen.

(25)

• It underestimates variances

• It is mean seeking

Now if many missing values exist in the data set mean seeking is probably a wrong kind of handling the data, because it underestimates the inuence of variables. Basically, all moving window based approach do share (same or similar) problems. The awma can manage to ll in few of the NaNs when the vector contains many NaNs. Figure 7 illustrates the general observations for each variable after performing awma using window size of 24 hours. As it can be seen from the top gure there are still many NaNs in the data array. This is because of the fact that if thekth index value in the window contains only NaNs in its neighbourhood then the awma can only adapt the NaNs. The bottom gure shows the relative number of observations (in logarithmic transform) of all variables (columns of the data array).

Figure 7: Preprocessing of the data by adaptive (weighted) moving average Figure 8(a) shows the observations time point plot of variable x27 as an example. The blue colour stands for the observations in the raw data while heavy green line are the averaged values using awma of window size 24 hours. Figure 8(b) shows the distribution of residuals among raw data and the averaged data in x27; and gure8(c) is the serial

(26)

correlation plot of the residuals.

Figure 8: An example of general behaviour of awma

Table 2 gives a summary of row wise patterns of NaNs remaining in the data set after performing adaptive (weighted) moving average for the raw data. We can still observe that no single row has full measurements, only that the awma has managed to ll in some missing values leading to a decrease of total number of distinct row wise patterns of NaNs from 528 to 211. However, the patterns now remaining are more frequent than before as we can read from the table that the most frequent ones are found in 4269 rows.

Table 2: Row wise patterns of NaNs in the averaged data

rows without NaNs unique #(patterns) of NaNs #(most frequent patterns)

0 211 4269

Equivalent to gure5 is gure 9 which illustrates the patterns of NaNs remaining in the averaged data from which the summary in Table 2 was drawn. Now we can see that of the 211 patterns remaining, 20 are the most frequent. We can also see from gure 10 that now the rst 100 distinct patterns of NaNs occupy more than 95% of the total rows (blue line), total number of observations in both variables x27 and x19 each. The other colored lines in gure 10 represent same features as in gure 6 for non averaged data.

(27)

20 40 60 80 100 120 140 160 180 200 0

500 1000 1500 2000 2500 3000 3500 4000

row wise pattern

#(rows)

Histogram of patterns

20 40 60 80 100 120 140 160 180 200

0 0.5 1 1.5 2

x 10⁴

row wise pattern

#(rows)

Cumulative sum

Figure 9: Row wise patterns of NaNs in the averaged data

(28)

20 40 60 80 100 120 140 160 180 200 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

row wise pattern

relative #(rows or measurements)

Cumulative sum

Figure 10: Row wise patterns of NaNs in the averaged data

As a second step in handling the NaNs we removed all rows where the process was not running and replaced NaNs with zeros. This was done by using the data array produced by the awma. Rows that indicate very short period of running of the process were also removed and retained those whose run lengths were long enough. The rst subplot of Figure 11 show the resulting observations of variables from step one by awma after removing rows where the process was not running and replacing NaNs with zeros. The second subplot shows the observations of variables for long enough run lengths of the process.

(29)

Figure 11: Further preprocessing of the data

As we can see from gure 11 we can not manage to get rid of all missing values in the data without destroying the real structure of the data. Therefore at this point we need to acknowledge that treating of the NaNs stops here and we will take the resulting preprocessed data into further analysis. However, a justication is needed to know how much we have changed the original data and the quality of the resulting data.

4.2 Justication of preprocessing of the data

The process of justifying our preprocessing method is the most dicult thing in this analysis, especially when we need to know the quality of the new data produced by the method. This is because there is no reference data to compare with, and taking into account that the original data has a complicated structure which is mathematically not dened. In this case justication remains to be subjective and, through trial and error, the quality of the resulting data will be revealed by the results to be produced by mathematical models in real analysis of the data given in the following sections. To see how much we have modied the original data we have calculated the relative Euclidean norm of the variables in the data matrix such that fork= 1,· · · ,33:

d_k =

Dk−Dˆk

2

kDkk₂ (8)

(30)

where D andD are the original and new data respectively. The results as can be seenˆ in gure12 indicate that (for some variables) we have not destroyed much information.

However, the largest information added is only about 41% which is in variable number 16 and the smallest added values is 0% in variable 1. The relative variances, which were calculated in the same way as in (8) but replacing the formula for norm with the one for variance, show that like in the case of relative norm, some variables have more than 50%

variances of the dierence between the two data relative to the original one. However, in any interpretation to be made from these results one must know that the variables in the data have dierent patterns of missing values and therefore the choice of window size to average the data aects the resulting values dierently for dierent variables.

5 10 15 20 25 30

0 0.1 0.2 0.3 0.4 0.5 0.6

variable

relative (norm, variance)

norm variance

Figure 12: Justication of preprocessing of the data

(31)

5 Description of Mathematical Tools for the Problem

In this section we give a description of mathematical models/ tools that will be used to solve the problem at hand. These tools are Qlucore Analysis and Sparse Bayesian Modeling.

5.1 Qlucore Analysis

Qlucore Analysis is the analysis by an exploration and visualization tool from Qlucore company. The tool was founded from a collaborative research project at Lund Univer- sity, Sweden, with researchers at the Department of Mathematics and Clinical Genetics.

The problem of the project was the vast amount of high-dimensional data generated with microarray gene expression analysis; where it was discovered that an interactive scientic software tool was needed to conceptualize the ideas evolving from the research collaboration. The basic concept behind the software is that it provides a tool that can take full advantage of the most powerful pattern recognizer that exists ,the human brain, in order to nd relevant structures in the data. The results of the research is a core software engine that visualizes the data in 3D and aids the user in identifying hidden structures and patterns.

Qlucore was founded in early 2007 and so far three versions of the software has been released. The rst one was "Qlucore Gene Explorer 1.0", the second was version 1.1 released in June 2008 and in June 2009 version 2.0 was released which is called "Qlucore Omics Explorer (QOE)".

Our purpose here is not to explain in detail how the software works rather we give short description of the mathematical functionalities included in the software to solve our problem. A detailed description of how the software works can be obtained in the Reference Manual that is supplied in the Help Menu of the software, or at [5] where more information is provided.

QOE is a powerful interactive visualization environment. It can be used to analyze many dierent types of data sets. The tool helps the analyst to uncover hidden structures and nd patterns in large data sets. The basic mathematical operation used in QOE environment is principal component analysis (PCA) that presents and makes it possible to visualize high-dimensional data in lower dimension. In QOE, all data is then presented in three dimensions and this three dimensional representation is plotted on the computer screen. Other mathematical functionalities included are scatter plots, heat maps and data tables.

(32)

5.1.1 Dynamic PCA

While working with QOE the analyst has the possibility to interactively and in real time modify the dierent PCA-plots (and other representations of the data) directly on the computer screen by using available mathematical and statistical algorithms and at the same time work with all annotations and other links in a fully integrated way. The dynamic PCA functionality is a unique feature of QOE.

5.1.2 Interpreting a PCA plot

The basic meaning of the PCA plot of any multidimensional data in QOE is that data points that are similar are presented close together in the generated plots. The operation of PCA has a characteristic feature that it preserves as much of the originally available information as possible in the generated 3D plots. The information content is then measured by the statistical variance in the data when applying PCA. In the picture below, for example, the red group consists of data samples that are similar to each and that are dierent from the yellow samples.

Figure 13: An example of a PCA plot in QOE

5.1.3 Statistical signicance of a PCA plot

The PCA operation does not make any assumptions regarding data to be analyzed. The structures or patterns visible on the computer screen are then because they are present

(33)

in the data. QOE is provided with some statistical methods (eg., ANOVA) which may create patterns even from random data. These patterns are then, with very high prob- ability, not statistically stable and you must then look at the statistical signicance of the structures you discover. QOE has several available tools for controlling statistical signicance. These are cross validation (leaving one or several data samples out), randomization or permutation tests. QOE also provides p-values and q-values for the chosen statistical methods, making it easy to dynamically check the statistical signicance of the structures you discover.

5.1.4 Missing Value Reconstruction

Next we look at how QOE treats the NaNs present in the data being analyzed. When a data frame le with missing values is loaded, then QOE automatically lls in the missing values using a missing value reconstruction algorithm. There are two dierent algorithms:

• Simple Average (SA)

• K Nearest Neighbors (kNN) Impute

Using the SA algorithm the missing value is reconstructed as the average of the non- missing values of the same variable over all samples. Note that QOE recognize all rows of the data matrix as variables and columns as samples. Therefore observation matrix whose rows and columns constitute of samples and variables respectively need to be transposed before being loaded in the QOE.

With the kNN Impute algorithm the missing value is reconstructed as a normalized average of non-missing values of similar variables for the same sample; the details can be found in [6]. The drawback of this algorithm is that it is based on data set with a range of 1 - 20% missing values. Thus data set with greater range of missing values this algorithm can mislead the conclusion made from analysis results. But with reasonable range of missing values kNN impute algorithm appears to provide a more robust and sensitive method for missing value estimation than simple average method [6].

5.2 Sparse Bayesian Modeling

Happenstance data [9] occurs when variables are highly intercorrelated. This calls from regularization which sparse Bayesian modeling does automatically. Using sparse

(34)

Bayesian modeling in a time series context amounts to randomization as also presented in [9]. In this section, therefore, we give a mathematical description of sparse Bayesian models.

Consider a predictive modeling task where there is a set of input data samples X={xn} with corresponding desired target values t={t_n},n= 1... N. The objective is to nd a functional modely(x)that "predicts"twell given x, and is not compromised by noisy, real-world, data. One possible predictory(x) would be a sparse model.

The sparse model [12] take the form:

y(x;w) =

M

X

m=1

w_mφ_m(x), (9)

where w = (w₁, w₂,· · ·, w_M)^T is the vector of adjustable parameters. This model is said to be linear in the parameters. But by choosing the "basis functions"φ_m(x) to be nonlinear,y(x;w)may be nonlinear too. IfM is large, then this model is potentially very exible, and the computational and statistical complexity of the model is approximately managed, it can be very eectively applied to a wide range of problems.

5.2.1 Sparse Bayesian Regression

Let{xn, tn}^N_n=1 be a given data set, we write the targets as a vector t = (t1,· · ·, tN)^T and express it as the sum of an approximation vector y= (y(x1),· · · , y(xN))^T and an 'error' vector= (₁,· · · , _N)^T

t=y+,

=Φw+, (10)

where w is the parameter vector and Φ = [φ1 · · · φM] is the N ×M 'design' matrix whose columns contain the complete set of 'basis vectors'.

The sparse Bayesian framework assumes that the errors are modeled probabilistically as independent zero-mean Gaussian, with varianceσ²: sop() =QN

n=1N n|0, σ²

; which is equivalent to a mean-squared-error criterion. The parameterσ² can be set in advance if known but more usually it is estimated from the data. This error model therefore implies a multivariate Gaussian likelihood for the target vector t:

p t|w, σ²

= (2π)^−N/2σ^−Nexp (

−kt−yk² 2σ²

)

. (11)

(35)

The likelihood (11) is complemented by a prior over the parameters which takes the form:

p(w|α) = (2π)^−M/2

M

Y

m=1

α^1/2_m exp

−αmw²_m 2

. (12)

The M independent hyper parameters, α = (α1,· · · , αM)^T being introduced, each one individually controls the strength of the prior over its associated weight. This form of prior is responsible for the sparsity properties of the model [7]. Given α, the posterior parameter distribution conditioned on the data is given by combining the likelihood and prior within Bayes' rule:

p w|t,α, σ²

= p t|w, σ²

p(w|α)

p(t|α, σ²) , (13)

and is Gaussian N(µ,Σ) with

Σ= A+σ⁻²Φ^TΦ−1

µ=σ⁻²ΣΦ^Tt, (14) and A dened asdiag(α1,· · · , αM). In order not to extend the model to include Bayesian inference over those hyper parameters (which is analytically intractable), a most-probable point estimateαM P may be found via a type-II maximum likelihood procedure.That is, sparse Bayesian 'learning' is formulated as the local maximisation with respect toα of the marginal likelihood [8, 7], or equivalently, its logarithm L(α):

L(α) =logp t|α, σ²

=log Z ∞

−∞

p t|w, σ²

p(w|α)dw,

=−1 2

Nlog2π+log|C|+t^TC⁻¹t

, (15)

with

C=σ²I+ΦA⁻¹Φ^T. (16)

A point estimate µ_{M P} for the parameters is then obtained by evaluating (14) with α = α_{M P}, giving a nal (posterior mean) approximator y = Φµ_{M P}. The essential observation is that typically the optimal values of many hyper parameters are innite [7]. From (14) this leads to a parameter posterior innitely peaked at zero for many weightswm with the consequence thatµ_{M P} correspondingly contains very few non-zero elements.

5.2.2 Marginal Likelihood Maximization

In this part we give a summary of some key properties of the algorithm for maximization of the marginal likelihood (15). A more detailed description of the algorithm can be obtained in [11].

(36)

Consider the dependence of L(α) on a single hyper parameter α_i, i ∈ {1· · ·M}. Then we can decompose C in (16) as

C=σ²I+X

m6=i

α⁻¹_m φ_mφ^T_m+α⁻¹_i φ_iφ^T_i ,

=C−i+α⁻¹_i φ_iφ^T_i , (17) where C−i is C with the contribution of basis vectoriremoved. Using matrix determi- nant and inverse identities we can write the terms of interest in L(α) as:

|C|=|C−i|

1 +α⁻¹_m φ^T_i C⁻¹−iφ_i

, (18)

C⁻¹ =C⁻¹−i − C⁻¹_−iφiφ^T_i C⁻¹_−i αi+φ^T_i C⁻¹−iφi

. (19)

From this, we can writeL(α) as:

L(α) =−1 2

"

Nlog(2π) +log|C−i|+t^TC⁻¹_−it−logαi+log αi+φ^T_i C⁻¹_−iφi

− φ^T_i C⁻¹−it2

α_i+φ^T_i C⁻¹_−iφ_i

#

=L(αi) +1 2

logαi−log(αi+si) + q_i² α_i+s_i

=L(αi) +l(αi), (20)

where for simplication purposes, we have dened:

si ∆

=φ^T_i C⁻¹−iφi, and qi ∆

=φ^T_i C⁻¹−it. (21) The sparsity factor si is a measure of the extent that basis vector φi overlaps those already present in the model. The 'quality factor' written asqi =σ⁻²φ^T_i (t−y−i), is a measure of the alignment ofφi with the error of the model with that vector excluded.

The objective function has now been decomposed intoL(α−i), the marginal likelihood withφ_i excluded, and l(α_i), where terms inα_i are now conveniently isolated.

Analysis of l(αi)[11]indicates that L(α) has a unique maximum with respect toαi:

αi = ( _s²

i

q²_i−si ifq²_i > s_i

∞ ifq²_i ≤s_i (22)

It is relatively easy to compute q_i ands_i for all the basis functions φ_i in the dictionary, including those not currently utilized in the model (i.e. for which α_i = ∞). It is also easier to maintain and update values of

S_m =φ^T_mC⁻¹φ_m, Q_m=φ^T_mC⁻¹t, (23)

(37)

and from these it follows that:

s_m = α_mS_m αm−Sm

, q_m = α_mQ_m αm−Sm

. (24)

It is noted that whenα_m=∞,s_m=S_mandq_m=Q_m. In practice, then, it is convenient to make use of the Woodbury identity to obtain the quantities of interest:

Sm =φ^T_mBφm−φ^T_mBΦΣΦ^TBφ, (25) Qm =φ^T_mBˆt−φ^T_mBΦΣΦ^TBˆt, (26) where B ≡ σ⁻²I and ˆt ≡ t. The quantities Φ and Σ here contain only those basis functions that are currently included in the model, and computation therefore scales in the cube of that measure which is typically only a very small fraction of the fullM. The result (22) implies that:

• Ifφi is in the model (i.e. αi <∞) yetq²_i ≤si, thenφi may be deleted (i.e. αi set to ∞),

• Ifφ_i is excluded from the model (α_i=∞) andq_i²> s_i, then φ_i may be added (i.e.

α_i is set to some optimal nite value).

(38)

6 Results

6.1 Results by Qlucore Analysis

Because Qlucore Omics Explorer is able to ll in the missing values present in the data then, raw data were used in this part of analysis. The results presented in this section are therefore, based on the two dierent missing value reconstruction algorithms supplied in QOE. The analysis has been an exploratory one; to nd dierent patterns existing among the variables in the data. The results are just for providing insight for further analysis of the data. Thus we should not make any conclusion from these results. It is only in this part of the data analysis where the whole data matrix (except x1) was used.

That is, 36 variables and all 24869 samples were put into the analysis.

6.1.1 Findings from the Simple Average algorithm

Figure 14 shows the exploratory ndings from the data whose missing value reconstruction was done by the simple average algorithm described in Section 5.1.4. The patterns you see in the plot have aq-value equal to0.0031412(which can be interpreted as a false discovery rate threshold) and therefore variables included have a very high statistical signicance. Variables not included in any of the patterns shown in the gure include x6,x12,x29,x35,x36 and x37. These were found to have very small variance among the rest of the variables over the samples and therefore, were ltered out by variance.

(39)

Figure 14: QOE ndings based on SA missing value reconstruction algorithm Looking at this gure we identify ve patterns of variables. Four of them each has a certain number of variables connected by a line, and the other one has variables not connected by a line at all. The patterns with variables connected by a line contain variables with at least 50% Pearson's correlation coecient in the group and the non-connected variables has less than 50% Pearson's correlation coecients. Names of variables with their corresponding patterns is summarized in Table 3.

Table 3: Correlated variables (from QOE simple average algorithm)

Pattern Variables Pearson's Corr.

1 x8,x9,x10,x11,x13,x15,x25 ≥50%

2 x31,x32 ≥50%

3 x4,x5 ≥50%

4 x17,x21,x22,x23 ≥50%

5 x2,x3,x7,x14,x16,x18,x19

x20,x24,x26,x27,x28,x30,x33,x34 <50%

(40)

Since the objective of this work is to study variables which best inuence x27 then the signicance of this nding is that it provides a clue to the analyst to put emphasis on analyzing variables included in pattern 5 (see Table 3) as x27 is also a member of this pattern. Therefore in Section 6.2 we will present the results from Sparse Bayesian regression.

6.1.2 Findings from the kNN Impute algorithm

Figure 15 is a PCA plot showing the exploratory data analysis results using QOE when kNN impute algorithm was used to ll in the missing values in the data. Unlike the results obtained using SA algorithm shown in gure 14, now we can identify only two patterns of variables. The rst pattern contain variables with at least 50% Pearson's correlation coecients. These can be seen in the plot with lines connecting them to indicate their correlations. The second pattern contain variables with correlation below 50%. In the plot these are seen with no lines connecting them.

Figure 15: QOE ndings based on kNN impute missing value reconstruction algorithm

(41)

The variables belonging in each of the two patterns are summarized in Table 4 shown below.

Table 4: Correlated variables (from QOE kNN impute algorithm)

Pattern Variables Pearson's Corr.

1

x2,x3,x4,x5,x6,x7

x8,x9,x10,x11,x12,x13

x14,x15,x16,x17,x18,x19

x21,x22,x23,x24,x25,x26 ≥50%

x28,x30,x31,x32,x33,x35

x36,x37

2 x20,x27,x29,x34 <50%

Like in simple average algorithm case, it can be seen from Table 4 that x27 falls in the pattern of variables with less than 50% Pearson's correlation coecient even when the kNN impute algorithm was applied. However, these results reveals that the two missing value estimation methods used in QOE are dierent and that each one produce dierent results of analysis. This can be veried by looking at these patterns produced by each method whereby we see some variables which fall in a group of less correlated variables in one algorithm (e.g. x2,x3,x7,x14,x16,x18,x19,x24,x26,x28,x30,x33) now fall in a group of correlated variables in the other algorithm. This situation prevents us to make conclusion on these results rather we move forward with other analysis basing on the information we have got from the results to see where the reality is. Therefore in the next section we will present results from Sparse Bayesian regression as another tool for modeling our problem.

6.2 Results by Sparse Bayesian Modeling

The results presented in this section are based on two types of data sets, depending on the window length used in the adaptive weighted moving average to average the data. The window lengths are 24 hours (1-day) and 720 hours (30-days). Except for variables x29,x30,x31 and x32 all other variables (and x1 included) were used in this part of analysis. In each of the two data types 100 samples of nobservations each were randomly selected with replacement from the data to build a model from each sample.

The overall model is the average of all models obtained from all samples selected. We call these samples as learning or training sets. Then another random sample of same size was randomly selected from the same data to make prediction of the model. In this case we are using cross validation to estimate how accurately the predictive model

(42)

will perform in practice. For each model we calculate the coecient of determination, R2-value to estimate how well the model ts the data, and the Q2-value to estimate how well the model can predict future outcomes.

Before looking at the results, let us discuss how samples were selected from the data.

Initially, an empirical distribution of the data (see gure 16) was made basing on some preferences in the data to ensure that important features shown by the data are taken into account. For instance, to ensure that the "bulge" or uctuation in the observed values for x27(see gure 2) is taken into account, some scaling and translation was done while computing the cumulative probabilities such that high probabilities were given to the rows that cover the bulge. Zero probabilities were given to rows that indicate the process not running. Other preferences were such as computing probabilities basing on non missing values and process run lengths. Finally, a combination of these three preferences was made by choosing some combination coecients with a condition that the sum of the coecients is 1.

Figure 16: Empirical cumulative distribution of the data

Weighted samples were then randomly taken with replacement by using a MATLAB built in function randsample() which returned indices of rows (observations) randomly selected according to their weights. The weights were the computed probabilities discussed above.

The aim of random sampling was to stratify the data so that one random sample is used for tting the model and the other one for validating the model.

(43)

6.2.1 Model obtained from 1-day averaged data

Using 1-day averaged data, and normalizing them to zero mean and unit variance, 100 samples of 3000 observations were randomly selected to build a sparse Bayesian regression model. Basis vectors were all 32 variables, and the target vector was variable x27. The resulting average model is shown in gure 17, where tted values are indicated in green while original values are in blue. The right panel of this gure is a zoomed version of the left panel for clear visualization of the ts. Variables with their average coecients in the model are tabulated in table 5. The R2-value for this model was 0.3303. This tell us that the model can only explain 33% proportion of variability in the data set.

Figure 18 illustrate the residuals of the model. The left panel is a distribution of residuals while the right panel gives the correlation behavior between the residuals. We can see from the left panel that the residuals are randomly distributed. Testing for independence of residuals as shown in the right panel, Durbin-Watson test was performed. The Durbin- Watson test is used to test if the residuals are independent, against the alternative that there is autocorrelation among them. The computed p-value for the test was 0.9866 which is reasonable to accept the null hypothesis.

Figure 17: Sparse Bayesian Regression model derived from 1-day averaged data

(44)

Figure 18: Residuals of SB Regression model derived from 1-day averaged data

Table 5: Average coecients of variables in the model derived from 1-day averaged data

Variable Parameter value Variable Parameter value

x1 -0.3233 x17 0.1920

x2 -0.0154 x18 -0.0988

x3 -0.1342 x19 0.0382

x4 0.6728 x20 0.0078

x5 -0.1479 x21 -0.0883

x6 0.1049 x22 -0.0817

x7 -0.1587 x23 0.0832

x8 -0.0681 x24 -0.1004

x9 -0.1222 x25 -0.0597

x10 0.0372 x26 -0.4831

x11 -0.0649 x28 0.0270

x12 0.0340 x33 0.1512

x13 -0.3498 x34 -0.1349

x14 0.1578 x35 -0.1548

x15 -0.2637 x36 0.2915

x16 -0.0038 x37 -0.0769

For validation purposes, a random sample of 3000 observations was taken from the data.

The sample was then used in the above model to predict values for x27. Figures 19 and 20 illustrate the model prediction using the random sample and full observations. In both gures, predicted values are represented in green while original values of x27 are indicated in blue. The left panel of each of the gures is a zoomed part of the right panel for clear visualization. The Q2-value calculated was 0.3757 which indicates that the model has poor prediction.

Analysis of patterns in mining time series using Qlucore