• Ei tuloksia

Analysis techniques and variable selection

Both univariate and multivariate approaches have been applied in spectroscopy [95].

However, with the lack of specificity and heavily overlapping overtones in the NIR region, univariate analysis often results in unreliable models; nevertheless, it has been applied for cartilage evaluation by Spahnet al[19,47–49]. Utilization of the NIR spectral region has been vastly under-developed by the scientific community until recently, as it has now gained considerable attention in multiple fields requiring fast analytical solutions. This is due to the introduction and application of multivariate analysis, enabling swift evaluation of complex non-linear data. Multivariate model-ing is conventionally applied in classification or prediction problems to understand relationships between non-linear multivariate data and reference parameters. For multivariate modeling, a large number of observations is required in order to train and develop robust and well-generalized models.

Partial least squares regression

Partial least squares regression (PLSR) and principal component regression (PCR) are the most commonly applied chemometric techniques in NIR spectroscopic anal-ysis [96, 109]. Conventionally, PLSR is more robust when compared to PCR due to two crucial differences. Firstly, in a process of deflation, each PLS factor is subtracted from the predictor variables (spectral data), therefore enabling development of ro-bust models by maximizing the covariance structure in the data. Secondly, PLSR takes into account the variation of the dependent or response variable (y, reference).

The method was first introduced by Woldet al[109] to describe complicated

multi-variate systems by a sequence of simple least squares regressions [96]; furthermore, it can efficiently process highly multi-collinear variables. In multivariate analysis, the goal is to predict a reference variable (y) from a set of complex multivariate predictor variables (X) using:

whereXikis absorption at wavelength (k) andb0is a regression constant andbk re-gression coefficients that are estimated statistically based on calibration data: spectra (X) and reference variables (y).

In PLSR modeling, choosing the optimal number of PLS factors is essential. If too many factors are included in the model, this may result in overfitting, in which the model is tailored to the training data and does not generalize well with new samples.

Furthermore, with too few PLS factors, key variance and important relationships between predictor and response variables may be unaccounted for in the model, therefore limiting the predictive capability of the model.

Validation of the accuracy and performance of PLSR models generally involves one of two techniques, namely cross-validation or (independent) test set validation.

Cross-validation techniques, such as leave-one-out (LOO) ork-fold, enable the de-termination of the optimal number of PLS factors and, thus, ensure generalization of the model. In LOO cross-validation, a single sample is iteratively excluded from modeling and a value is predicted with the model based on other samples. Ink-fold cross-validation, a similar exclusion is performed by dividing the data intokgroups.

With independent test set validation, data is split in two groups: calibration and validation groups (e.g., 66% and 33%). A model is built with the calibration group with internal cross-validation (LOO or k-fold), which is followed by validation of the model performance using the independent test set. The prediction performance of models is determined with the minimum root mean square error (RMSE).

RMSE= whereyiis measured reference value and ˆyipredicted value. The RMSE parameter has several variations, which depend on the model and cross-validation used. The three most common variations are RMSE of calibration (RMSEC) describing the error of the calibration model, the RMSE of cross-validation (RMSECV) describing the error of iteratively excluded samples, and the RMSE of prediction (RMSEP) the error of an independent test set. Performance of models can be also described with other metrics, such as Pearson correlation coefficient (r).

Artificial neural network

Artificial neural networks (ANN) were originally designed to mimic how the human brain processes information. Due to the ability of neural networks to model non-linear relationships and the breakthrough in efficiently implementing them, they have been applied in several fields for solving complex problems, with high accu-racy. Modeling with ANN can be unsupervised or supervised, where the latter

involves information on the reference values. ANNs are conventionally categorized into shallow or deep networks, including a single or multiple hidden layers, respec-tively. Each hidden layer includes artificial neurons which receive a set of weighted inputs, process the sum and apply an activation function (σa), and then pass the result onwards to the next layer. The most common activation functions are

linear:σa(x) =x sigmoid:σa(x) = 1

1−exp(−x) hyperbolic tangent:σa(x) =tanh(x).

(3.13)

The objective in ANN is to find a set of weights (w) that minimizes the sum of squared error (E(X,w)) of predictionei =yˆi−yi Several algorithms are available for optimizing the weights, including steep-est descent algorithm, Newton’s method, Gauss-Newton’s algorithm, and Leven-berg-Marquardt algorithm [110, 111]. The LevenLeven-berg-Marquardt back-propagation algorithm incorporates the steepest descent method and the Gauss-Newton algo-rithm [112], and is generally favored due to its stable and fast optimization. The weights are updated as follows

wk+1=wk−(JkTJk+µIi)−1Jkek, (3.15) where J is the Jacobian matrix, µ is the always positive combination coefficient, and Ii the identity matrix. When the combination coefficient is very small (µ ≈0), the algorithm behaves similarly to Gauss-Newton algorithm, whereas with very big combination coefficient, the steepest descent method is used [112]. In the modeling, deep neural networks are considered to be better in generalization of the problem but require substantial amount of data and computational resources. In addition, adapting unnecessarily complicated networks may result in overfitting and, thus, reduce model reliability when introduced with new data. During initialization of model building, the originally introduced weights should be randomized and small (close to zero) to break symmetry and avoid saturation, respectively [113]. Addition-ally, initial weights have a substantial effect on the convergence speed of learning.

Variable selection

Multivariate data (e.g., NIR spectra) are often complex and include dispensable pa-rameters; thus, only the variables that enhance model reliability should be retained.

Variable selection techniques can be divided into two categories: techniques that de-termine the most relevant variables and techniques that eliminate the least relevant non-contributing variables. Conventionally, the technique based on selection of the most relevant variables achieves better results swiftly due to the lower amount of variables processed. Currently, several variable selection techniques are available, such as Monte Carlo uninformative variable elimination (MC-UVE), competitive adaptive reweighted sampling (CARS), variable combination population analysis (VCPA), backward interval partial least squares (BiPLS), genetic algorithm (GA),

and jack-knife [114]. In addition, exhaustive variable selection techniques (e.g., for-ward selection) can be applied. In forfor-ward variable selection technique, univariate models are built upon which the single most reliable variable is selected based on model performance. The selection is continued by iteratively identifying the next most reliable variable by building a bivariate model with retainment of the first variable [115]. However, exhaustive search that evaluates all variable combinations is the only approach that guarantees the optimal model performance [116]; there-fore, this is extremely time consuming.

3.5 NEAR INFRARED SPECTROSCOPY OF OSTEOCHONDRAL