• Ei tuloksia

Positive Matrix Factorization and its application to FIGAERO-CIMS

Paper II utilizes Positive Matrix Factorization (PMF, Paatero and Tapper, 1994; Ulbrich et al., 2009) to extract volatility information from FIGAERO-CIMS thermogram data. PMF can be used to reduce the complexity of mass spectrometer data, which often consists of several hundred compounds and their time-series. PMF is a bilinear statistical model that can be represented as

28

๐‘ฟ = ๐‘ฎ๐‘ญ + ๐‘ฌ, (8)

where solutions are constrained into non-zero and non-negative space. When applied to mass spectral data, the X is a m โ…น n sized matrix containing the measured observations, G is a m x p sized matrix that contains the factor time series as columns, F is p x n sized matrix that contains the factor mass spectra as rows and E is m x n sized matrix that accounts for residuals between the observation matrix and fitted values. The PMF does not need any a priori information for G and F or for the number of factors (p), instead the user must decide which number of factors best represents the data.

For acquiring final output for the PMF, equation (8) is solved iteratively in least squares sense by minimizing an object function where Sij, is the uncertainty (or error) of each observed data point. This minimization process is continued until values of Q converge to some close-to stable value.

In an application of PMF, the user must choose the number of factors and can further aid the convergence of the solution by utilizing so-called โ€œseedโ€ values, which are pseudorandom starting points for the iteration. The use of seed-values might be beneficial when the solution space contains several local minima. The seed values might help the algorithm to converge to additional solutions. Additional solutions can also be found by using so-called โ€œrotationsโ€, which enable a further exploration of the solution parameter space. However, a more detailed discussion about effects of seed and rotations to PMF results is beyond the scope of this thesis. Further information can be found from Ulbrich et al., (2009) and references therein.

The values of S in Equation (9) act as weighting values for the observation matrix, so that observation points with higher error values are weighted less and vice versa. Therefore, correctly choosing the error matrix plays a critical role for obtaining reliable information from PMF. Typically, when applied to long time-series data, the values of S follow the actual data. For example in Yan et al., (2016), Sij was defined as a Poisson-type distribution:

๐‘บ๐‘–๐‘— = ๐›ผ โ‹… โˆš๐—๐‘–๐‘—

๐‘ก๐‘  + ฯƒ๐‘›๐‘œ๐‘–๐‘ ๐‘’,๐‘–,

(10)

where Xij, is the signal intensity of ion i, ts is the sampling or averaging interval of the data, ฯƒnoise,i is the electrical noise of the ion i and ฮฑ is experimental constant. In Chen et al., (2020) the error was determined simply via Poisson-counting statistics:

๐‘บ๐‘–๐‘— = โˆš๐‘ฟ๐’Š๐’‹. (11)

29

These type of error modulations (later termed Poisson-like errors, PLerror) work well with long datasets where rapid changes in the data often indicate outliers or instrument malfunctioning. However, this kind of weighting is problematic when applied to quickly changing FIGAERO-CIMS thermogram data as shown in Figure 4a. The PLerror gives too much weight to the stable start and end parts of the thermograms and less weight to the actual peak region of the data, which contains most of the volatility information. We therefore introduced a new error scheme where individual Sij values are constant (later constant noise, CNerror) as

๐‘บ๐‘–๐‘— = ฯƒ๐‘›๐‘œ๐‘–๐‘ ๐‘’,๐‘–. (12)

In both Equations (10) and (12), ฯƒnoise,i was determined by first fitting a line to the stable part of each thermogram, usually at the end of the thermogram and calculating the residual (resij) between the data and the line fit. The error was then determined as

ฯƒ๐‘›๐‘œ๐‘–๐‘ ๐‘’,๐‘– = {๐‘š๐‘’๐‘‘๐‘–๐‘Ž๐‘›(๐‘ ๐‘ก๐‘‘๐‘’๐‘ฃ(๐‘Ÿ๐‘’๐‘ )) | ๐‘ ๐‘ก๐‘‘๐‘’๐‘ฃ(๐‘Ÿ๐‘’๐‘ ๐‘–๐‘—) โ‰ค ๐‘š๐‘’๐‘‘๐‘–๐‘Ž๐‘›(๐‘ ๐‘ก๐‘‘๐‘’๐‘ฃ(๐‘Ÿ๐‘’๐‘ ))

๐‘ ๐‘ก๐‘‘๐‘’๐‘ฃ(๐‘Ÿ๐‘’๐‘ ๐‘–๐‘—) | ๐‘ ๐‘ก๐‘‘๐‘’๐‘ฃ(๐‘Ÿ๐‘’๐‘ ๐‘–๐‘—) > ๐‘š๐‘’๐‘‘๐‘–๐‘Ž๐‘›(๐‘ ๐‘ก๐‘‘๐‘’๐‘ฃ(๐‘Ÿ๐‘’๐‘ )). (13) In other words, the value of ฯƒnoise,i was at least the median value of all stdev(res) values.

This method was chosen since converging the solution to a stable value would have taken an excessively long time, if the values of Sij were too small.

For the correct interpretation of PMF results, it is critical to choose the correct solution from all possible solutions with different number of factors and seed values. Normally, a certain indication about the correct solution can be found by investigating the ratio Q/Qexp, where Qexp is equal to the degree of freedom of the model solution. In the ideal case, this ratio should approach unity. However, when using the CNerror-scheme, the Q/Qexp-ratio is often larger as some part of the signal is bound to be left unexplained without overfitting the data with too many factors. Therefore, we calculated the fraction of the explained absolute variance (Ratioexp) for determining the โ€œbest solutionโ€ as:

๐‘Ž๐‘๐‘ ๐‘‰๐‘Ž๐‘Ÿtotal= โˆ‘ |๐‘‹๐‘–๐‘— ๐‘–๐‘—โˆ’ ๐‘‹ฬ… |๐‘– , (14)

๐‘Ž๐‘๐‘ ๐‘‰๐‘Ž๐‘Ÿexplained = โˆ‘ |R๐‘–๐‘— ๐‘–๐‘—โˆ’ ๐‘‹ฬ… |๐‘– , (15) and

๐‘…๐‘Ž๐‘ก๐‘–๐‘œ๐‘’๐‘ฅ๐‘ =absVarexplained

absVartotal , (16)

where Rij is the value of each ion in the reconstructed data matrix (R = GF), ๐‘‹ฬ…๐‘–is the average of the measured ion value, absVartotal and absVarexplained are the total and explained variance.

It should be noted that even with help of Ratioexp in choosing the criteria, the ultimate choice of โ€œbestโ€ solution is still left to the end user and therefore must be considered as somewhat subjective.

30

3 Results

In this chapter I will go through the results from FIGAERO-CIMS volatility calibration experiments (Sect. 3.1) and continue with results from utilizing PMF on FIGAERO-CIMS data (Sect. 3.2). Finally, in Sect. 3.3 I will discuss different aspects impacting the volatility of the SOA particles, including oxidative conditions and structures of the precursor compounds.

3.1 FIGAERO-CIMS particle phase calibration for extracting volatility