Chemometrics
Matti Hotokka
Physical chemistry
Åbo Akademi University
. Consider spectrophotometry as an example
< Beer-Lamberts law: A = cåR
< Experiment
– Make three known references with concentrations c
1, c
2, c
3and measure the absorbances A
1, A
2, and A
3– Place a straight line through the points: A = a + bc – Measure the absorbance A of the unknown sample – Read the concentration from the calibration curve
Linear regression
Experiment
Linear regression
Calibration curve
c A
Acalc(c) = a + bc
We have good theoretical grounds for saying that the calibration model is linear.
The intercept a is
determined by additional disturbing components in the sample and can be ignodred.
. One measured value, one regressor
< Equation: y
calc(x) = b
0+ b
1x
< To be solved: b
0, b
1< How to:
Linear regression
How to solve
. Two components a and b
< Measure at two wavelengths, ë
1, and ë
2< Beer-Lamberts law but different for each component and wavelength;additive
– At ë
1A
1= å
a1c
a+ å
b1c
b– At ë
2A
2= å
a2c
a+ å
b2c
b– In matrix form
Multiregression
Experiment
. Obtain the unknown coefficients å
< In this case, two solutions containing only either a or b in known concentrations
< The molar absorption coefficients are calculated from
– A
1' = å
a1c
aat ë
1– A
2' = å
a2c
aat ë
2– A
1" = å
b1c
bat ë
1– A
2" = å
b2c
bat ë
2Multiregression
Calibrate the model
. One dimension
< y = b
0+ b
1x = b
0@1 + b
1x
. Generalization of the linear model
Linear regression again
Standard method
ANOVA
In linear regression
Definitions
Sum of Squares Matrix operation Calculation D.f.
SST, Total SSM, Mean
SScorr, Corrected for the mean SSfact, Factors
SSR, Residuals SSlof, Lack of fit
SSpe, Pure experimental error
n = observations; p = coefficients b; f = replications
ANOVA
Example
No. x y
1 0 0.3
2 1 2.2
3 2 3
4 2 4
ANOVA
Sums of squares
ANOVA
Quality of the fit
Correlation
Mean sum of squares: MSS = SS divided by D.f.
F-test for goodness of fit
F-test for lack-of-fit
If this F-value exceeds the critical value in the table the fit is significant.
This F-value cannot exceed the critical value if the model is appropriate.
ANOVA
The example
Goodness of fit
Exceeds the critical value at 5 % risk, 18.51. The fit is statistically significant
Lack of fit
This value is below the critical value at 5 % risk, 161. The model is appropriate because the lack of significant is not significant.
ANOVA
Confidence intervals
Variance-covariance matrix
For an appropriate fit with a low value of SSlof, MSSR = sR2 can be used instead of Sspe. The diagonal elements of the variance-covariance matrix are the variances of the factors b.
The confidence limits of a factor b (either b0 or b1) are
The prediction at a given point x0 = (1 x0) is
ANOVA
The example
At 5 % risk level, F(0.05; 1, 2) = 18.51. Thus we obtain
Multiple linear regression
Ordinary regression
Two-dimensional case
Measurement at three points (x11,x12), (x21,x22) and (x31,x32) are needed
In matrix form this system of equations is written as
The order has been changed to stress similarity to 1-dim regression
Multiple linear regression
Ordinary regression
If there are several dependent variables y, each with a different equation
= B
Y X
p m
n
n m
p
+Residuals
Multiple linear regression
Ordinary regression
The equation is solved exactly as the 1-dim equation
However, if there are linear dependencies between the x’s the system becomes singular and cannot be solved.
Prediction a y0 vector (dimension 1xm) at a given point x0 (dimension 1xp) is
PCR
Principal component regression
In full multicomponent regression it often happens that some of the x’s are interdependent, i.e., not linearly independent.
To avoid this only a few coordinate axes are used. They are chosen to be orthogonal. The selection of orthogonal coordinates resembles the PCA method.
PCR
PCA revisited
In PCA, the original data matrix X is written as a product of the scores and loadings,
One method of solving the problem is to use the SVD (singular value decomposition) method. In that case X matrix is written as
Here matrix U corresponds to T and V corresponds to L. They are joined by a diagonal matrix W. The diagonal elements are wii = %ëii, where ëii are the eigenvalues of the X matrix. The smallest eigenvalues can be forced to value zero. Then this matrix will remove the small eigenvalues indicating dependencies.
PCR
The SVD method
Solution of the full linear equation is
Now the matrix X is written in SVD approximation as
Then a pseudo-inverse matrix X+ can be used
The solution will then be given along a desired number of principal axes as
PLS
Partial Least Squares method
In PCA the matrix X is split into a product of the scores matrix and the loadings matrix,
X
p
n
= T
d
n
P
Tp
+
d
+ E
p
n
PLS
Partial Least Squares method
In PLS, also the matrix Y is split into a product of the scores matrix and the loadings matrix,
Y
m
n
= U
d
n
Q
T+
d
+ F
n
m m
PLS
Partial Least Squares method
The solution can then be written as
Here W is is a dxp matrix of PLS weights. Only a few of the eigenvalues are kept, the rest are set to zero.
PLS
Algorithm
Initialize: Shift the columns of the matrices.
Initialize: Use the first column of the Y matrix as the first Y score vector.
PLS
Algorithm
(1) Compute X-weights
(2) Scale the weights
(3) Estimate the scores of the X matrix
PLS
Algorithm
(4) Compute the Y loadings
(5) Generate a new u vector
Repeate from step (1) until u is stationary.
PLS
Algorithm
(6) Determine the scalar coefficient b for this variable.
(7) Compute the loadings of the X matrix.
(8) Compute the residuals.
PLS
Algorithm
Stopping criterion: Calculate the standard error of prediction due to cross- validation.
If SEPCV is greater than the actual number of factors then the optimum number of dimensions has been reached and the final B coefficients can be calculated. Otherwise use the residuals from step (8) as the new X and Y matrices and continue from the initialization step with an additional
dimension.