Matti HotokkaPhysical chemistryÅbo Akademi University Chemometrics

(1)

Chemometrics

Matti Hotokka

Physical chemistry

Åbo Akademi University

(2)

. Consider spectrophotometry as an example

< Beer-Lamberts law: A = cåR

< Experiment

– Make three known references with concentrations c

₁

, c

₂

, c

₃

and measure the absorbances A

₁

, A

₂

, and A

₃

– Place a straight line through the points: A = a + bc – Measure the absorbance A of the unknown sample – Read the concentration from the calibration curve

Linear regression

Experiment

(3)

Linear regression

Calibration curve

c A

A^calc(c) = a + bc

We have good theoretical grounds for saying that the calibration model is linear.

The intercept a is

determined by additional disturbing components in the sample and can be ignodred.

(4)

. One measured value, one regressor

< Equation: y

^calc

(x) = b

₀

+ b

₁

x

< To be solved: b

₀

, b

₁

< How to:

Linear regression

How to solve

(5)

. Two components a and b

< Measure at two wavelengths, ë

₁

, and ë

₂

< Beer-Lamberts law but different for each component and wavelength;additive

– At ë

₁

A

₁

= å

_a1

c

_a

+ å

_b1

c

_b

– At ë

₂

A

₂

= å

_a2

c

_a

+ å

_b2

c

_b

– In matrix form

Multiregression

Experiment

(6)

. Obtain the unknown coefficients å

< In this case, two solutions containing only either a or b in known concentrations

< The molar absorption coefficients are calculated from

– A

₁

' = å

_a1

c

_a

at ë

₁

– A

₂

' = å

_a2

c

_a

at ë

₂

– A

₁

" = å

_b1

c

_b

at ë

₁

– A

₂

" = å

_b2

c

_b

at ë

₂

Multiregression

Calibrate the model

(7)

. One dimension

< y = b

₀

+ b

₁

x = b

₀

@1 + b

₁

x

. Generalization of the linear model

Linear regression again

Standard method

(8)

ANOVA

In linear regression

Definitions

Sum of Squares Matrix operation Calculation D.f.

SS_T, Total SS_M, Mean

SS_corr, Corrected for the mean SS_fact, Factors

SS_R, Residuals SS_lof, Lack of fit

SS_pe, Pure experimental error

n = observations; p = coefficients b; f = replications

(9)

ANOVA

Example

No. x y

1 0 0.3

2 1 2.2

3 2 3

4 2 4

(10)

ANOVA

Sums of squares

(11)

ANOVA

Quality of the fit

Correlation

Mean sum of squares: MSS = SS divided by D.f.

F-test for goodness of fit

F-test for lack-of-fit

If this F-value exceeds the critical value in the table the fit is significant.

This F-value cannot exceed the critical value if the model is appropriate.

(12)

ANOVA

The example

Goodness of fit

Exceeds the critical value at 5 % risk, 18.51. The fit is statistically significant

Lack of fit

This value is below the critical value at 5 % risk, 161. The model is appropriate because the lack of significant is not significant.

(13)

ANOVA

Confidence intervals

Variance-covariance matrix

For an appropriate fit with a low value of SS_lof, MSS_R = s_R² can be used instead of Ss_pe. The diagonal elements of the variance-covariance matrix are the variances of the factors b.

The confidence limits of a factor b (either b₀ or b₁) are

The prediction at a given point x₀ = (1 x₀) is

(14)

ANOVA

The example

At 5 % risk level, F(0.05; 1, 2) = 18.51. Thus we obtain

(15)

Multiple linear regression

Ordinary regression

Two-dimensional case

Measurement at three points (x₁₁,x₁₂), (x₂₁,x₂₂) and (x₃₁,x₃₂) are needed

In matrix form this system of equations is written as

The order has been changed to stress similarity to 1-dim regression

(16)

Multiple linear regression

Ordinary regression

If there are several dependent variables y, each with a different equation

= B

Y X

p m

n

n m

p

+Residuals

(17)

Multiple linear regression

Ordinary regression

The equation is solved exactly as the 1-dim equation

However, if there are linear dependencies between the x’s the system becomes singular and cannot be solved.

Prediction a y₀ vector (dimension 1xm) at a given point x₀ (dimension 1xp) is

(18)

PCR

Principal component regression

In full multicomponent regression it often happens that some of the x’s are interdependent, i.e., not linearly independent.

To avoid this only a few coordinate axes are used. They are chosen to be orthogonal. The selection of orthogonal coordinates resembles the PCA method.

(19)

PCR

PCA revisited

In PCA, the original data matrix X is written as a product of the scores and loadings,

One method of solving the problem is to use the SVD (singular value decomposition) method. In that case X matrix is written as

Here matrix U corresponds to T and V corresponds to L. They are joined by a diagonal matrix W. The diagonal elements are w_ii = %ë_ii, where ë_ii are the eigenvalues of the X matrix. The smallest eigenvalues can be forced to value zero. Then this matrix will remove the small eigenvalues indicating dependencies.

(20)

PCR

The SVD method

Solution of the full linear equation is

Now the matrix X is written in SVD approximation as

Then a pseudo-inverse matrix X⁺ can be used

The solution will then be given along a desired number of principal axes as

(21)

PLS

Partial Least Squares method

In PCA the matrix X is split into a product of the scores matrix and the loadings matrix,

X

p

n

= T

d

n

P

^T

p

+

d

+ E

p

n

(22)

PLS

Partial Least Squares method

In PLS, also the matrix Y is split into a product of the scores matrix and the loadings matrix,

Y

m

n

= U

d

n

Q

^T

+

d

+ F

n

m m

(23)

PLS

Partial Least Squares method

The solution can then be written as

Here W is is a dxp matrix of PLS weights. Only a few of the eigenvalues are kept, the rest are set to zero.

(24)

PLS

Algorithm

Initialize: Shift the columns of the matrices.

Initialize: Use the first column of the Y matrix as the first Y score vector.

(25)

PLS

Algorithm

(1) Compute X-weights

(2) Scale the weights

(3) Estimate the scores of the X matrix

(26)

PLS

Algorithm

(4) Compute the Y loadings

(5) Generate a new u vector

Repeate from step (1) until u is stationary.

(27)

PLS

Algorithm

(6) Determine the scalar coefficient b for this variable.

(7) Compute the loadings of the X matrix.

(8) Compute the residuals.

(28)

PLS

Algorithm

Stopping criterion: Calculate the standard error of prediction due to cross- validation.

If SEP_CV is greater than the actual number of factors then the optimum number of dimensions has been reached and the final B coefficients can be calculated. Otherwise use the residuals from step (8) as the new X and Y matrices and continue from the initialization step with an additional

dimension.

(29)