Welcome Guest from Sweden
Sign In Change Country
  0 Items
Search:
Algorithms - Principal Component Analysis Methods, Optimization: Selecting the Principal Component
Algorithms
Principal Component Analysis Methods
Optimization: Selecting the Principal Component


One of the most difficult tasks in using PCR and PLS is determining the correct number of loading vectors (factors) to use to model the data. As more and more vectors are calculated, they are ordered by the degree of importance to the model (either by variance in PCA or concentration weighted variance in PLS). Eventually the loading vectors will begin to model the system noise (which usually provides the smallest contribution to the data).

The earlier vectors in the model are most likely to be the ones related to the constituents of interest, while later vectors generally have less information that is useful for predicting concentration. In fact, if these vectors are included in the model, the predictions can actually be worse than if they were ignored altogether. Thus, decomposing spectra with these techniques and selecting the correct number of loading vectors is a very effective way of filtering out noise.

However, if too few vectors are used to construct the model, the prediction accuracy for unknown samples will suffer since not enough terms are being used to model all the spectral variations that compose the constituents of interest. Therefore, it is very important to define a model that contains enough vectors to properly model the components of interest without adding too much contribution from the noise.

Models that include noise vectors or more vectors than are actually necessary to predict the constituent concentrations are called overfit. Models that do not have enough factors in them are known as underfit.

Unfortunately, there is usually no clear indication of how many factors are required to move from "constituent" vectors into "noise" vectors and prevent both underfitting and overfitting. However, there are a variety of methods that can be used to aid in determining this value. One of the most effective is to calculate the PRESS (Prediction Residual Error Sum of Squares) for every possible factor. This is calculated by building a calibration model with a number of factors, then predicting some samples of known concentration (usually the training set data itself) against the model. The sum of the squared difference between the predicted and known concentrations gives the PRESS value for that model.

In the above equation,n is the number of samples in the training set, and m is the number of constituents. Cp is the matrix of predicted sample concentrations from the model, and C is the matrix of known concentrations of the samples.

The smaller the PRESS value, the better the model is able to predict the concentrations of the calibrated constituents. By calculating the PRESS value for a model using all possible factors (i.e., first with 1 factor, then 2, 3, etc.) and plotting the results, a very clear trend should emerge.

However, as with everything in chemometrics, there are a variety of methods that can be used to optimize a model. The main issue is what data to use during the prediction step before calculating the PRESS.

Self-Prediction

This is the simplest method for testing a calibration model, but unfortunately it is not very useful. In this method, the models are built using all the spectra in the training set, then the same spectra are predicted back against these models. The problem with this approach is that the model vectors are calculated from these same spectra. Therefore, ALL the vectors calculated exist in ALL the training spectra. The PRESS plot will continue to fall as new factors are added to the model and will never rise. In effect, this gives the impression that all the vectors are "constituent" vectors, and there are no "noise" vectors to eliminate, which is never the case with real data.

Figure 1. A PRESS plot for a self prediction validation of a training set of NIR diffuse reflectance spectra of 50 samples of wheat. Note that the PRESS value continues to decrease as new factors are added. There is no clear indication of the optimum number of factors for this model.

The only reason to use this method is that it is very fast. Since it only requires building the models once, predicting the samples can be done in one step. Sometimes it is possible to select the number of factors as the place where the plot starts to "flatten out". However this is an inexact measure, and gives no indication of the true optimum number of factors for the model when predicting unknown samples.

Cross-Validation
Cross validation is conceptually very simple to understand, but it is also the most calculationally intensive method of optimizing a model. In effect, cross-validation attempts to emulate predicting "unknown" samples by using the training set data itself. The procedure is as follows:
1. Select a sample (or a small group of samples, if the training set is large enough) and remove the spectrum (spectra) and corresponding concentration data from the training set. Set the factor counter to i = 1.
2. Use the remaining training set samples to perform the decomposition and calibration calculations for factor i (loading vector).
3. Predict the concentrations of the removed sample(s) using the calibration equation from Step 2 and calculate PRESS(i).
4. Increment the factor counter (i = i + 1) and repeat from Step 2 until all desired factors (i = f) have been calculated and predicted.
5. Place the previously left out sample data back into the training set and select a different sample (or group). Return to Step 1 and repeat the calculations. As each sample is left out, add the calculated squared residual error to all the previous PRESS values. Repeat until all samples have been left out and predicted at least once.

Figure 2. A PRESS plot for a cross validation prediction of a training set of NIR diffuse reflectance spectra of 50 samples of wheat. Note how the plot reaches a minimum (at approximately 7 factors), then starts to rise as more "noise" vectors are added to the model.

In Figure 2 notice that from 0 to 7 factors the prediction error (PRESS) decreases as each new factor is added to the model. This indicates that the model is underfit and there are not enough factors to completely account for the constituents of interest.

At some point the PRESS plot should reach a minimum and start to ascend again. At this point the model is beginning to add factors that contain uncorrelated noise which are not related to the constituents of interest. When these extra "noise" vectors are included in the model, it is overfit and its predictive ability is diminished.

There are two main advantages of cross-validation over all other methods. The first is in how it estimates the performance of the model. Since the predicted samples are not the same as the samples used to build the model, the calculated PRESS value is a very good indication of the error in the accuracy of the model when used to predict "unknown" samples in the future. The larger the training set and the smaller the groups of samples left out in each pass (optimally only one sample at a time, but this can be very time consuming), the better this estimate will be. In effect, the model is validated with a large number of "unknown" samples (since each training sample is left out at least once) without having to measure an entirely new set of data (see Validation Set below).

The second benefit of cross-validation is better outlier detection. Cross-validation is the only validation method that can give complete outlier detection for the training set data. Since each sample is left out of the models during the cross-validation process, it is possible to calculate how well the spectrum matches the model by calculating the spectral reconstruction and comparing it to the original training spectrum (via the spectral residual). If the predicted concentrations for a single sample are way off and the spectrum does not match the model very well but the rest of the data works very well, the sample is possibly an outlier. Identifying and removing outlier samples from the training set should always improve the predictive ability of the model.

It is very difficult to perform outlier detection on the training set data without performing a complete Cross-Validation. The results of the other validation methods (Self-Prediction, Leverage and Validation Set) are generally not adequate since the predictions are based on a model built using every available sample. Any unique variations that are present in outlier sample(s) are therefore built into the model. Thus, when the validation spectra (either the training set or a separate validation set) are predicted back against the model, it can appear to be working well. The accuracy of the predictions is actually worse than if those training samples were removed and the model rebuilt.

Unfortunately, cross-validation is a very time consuming process. It requires re-calculating the models for every sample left out. However, there are a few somewhat acceptable short cuts. If the number of samples in the training set is large enough, the number of samples rotated out in each pass can be more than one. This obviously does not give the best statistics for each sample, but it does speed the calculations and can be acceptable for determining the number of factors for the model.

In fact, in some cases, leaving out groups of samples at a time can be preferable to leaving out only one at a time. In training sets that contain replicate spectra of the same sample, the rotation should be performed on each standard sample, not on each spectrum. For example, if a training set of 50 spectra contains two spectra each of 25 known samples, then each pair of replicates should be left out together. This completely removes the contribution of that sample from the model before prediction. Otherwise, if a rotation value of one is used, there will always be a similar spectrum of the removed sample in the set and the sample will never be predicted as a true unknown.

Another trick is to use cross-validation to perform a pseudo-validation set prediction. If the training set is very large, setting the rotation to one-half the total number of samples effectively accomplishes the same goal. By building a model with half the training set data and predicting it with the other half, similar trends will appear in the PRESS plot. The added advantage is that all the collected training samples can ultimately be used to build the final calibration, making it more robust.

Leverage Prediction
This method is an attempt to compromise between a full cross-validation (which is very slow, but gives the best estimate of the model's performance when it is applied to unknown samples) and a self prediction (which is very fast, but gives limited information about the predictive ability of the model).

Figure 3. A PRESS plot for a leverage validation of a PCR model built from a training set of NIR diffuse reflectance spectra of 50 samples of wheat. The minimum occurs at 13 factors; however, an argument could also be made for selecting 11. In either case, the results are higher than the 7 factors reported by cross-validation for the same data set.

In leverage validation, the models are built using all of the training spectra, similar to self prediction. However, when the samples are predicted, the scores are corrected for the individual sample leverage. The leverage is a measure of the importance of the sample to the overall model equations. Generally, samples at the high and low end of the constituent concentration range will have large leverages, while samples that lie closer to the mean concentrations will have low leverages. If a single sample has a leverage that is significantly larger than the rest of the training set, this can indicate that the spectrum is very different from the rest of the training set and does not represent the actual sample (this is called an "outlier", but more about that later).

Figure 4. A PRESS plot for a leverage validation of a PLS model built from a training set of NIR diffuse reflectance spectra of 50 samples of wheat. With this data set, no minimum is reached within the 20 factors calculated.

When this method is used, the models are only calculated once, which significantly increases the speed of the analysis. However, the leverage correction is applied to the scores assuming that they will then be used in a regression. While this is fine for PCR models, remember that PLS models do not use a separate regression step. The outcome is that leverage validation works fairly well with PCR models, but suffers the same problem as self prediction when applied to PLS models.

Validation Set Prediction
In this method, a new set of training spectra is measured under the same conditions as the training set, and then calibrated by the primary method. This data is called a validation Set. This new data set is then predicted against the calibration models built using from the training set. This approach gives the best estimate of the model's performance since none of the samples in the validation set was used to build the model.

The downside to using a separate validation set is the time and cost involved in generating the data. When the validation set method is used, typically a very large set of training data (both spectra and concentrations) is measured. This is then split into two groups: a set of training data and a set of validation data. Depending on the available samples, the splits could be 80%/20%, 60%/40%, or even 50%/50% respectively of the total number of samples. After the model is built, the validation set is effectively "discarded." However, if the same data were used in a cross-validation, every sample collected could be used to both build the model and then validate it.

Figure 5. A PRESS plot of a validation set prediction (solid) of 50 samples and the cross-validation prediction (dashed) from Figure 10 of a training set of 50 samples of NIR diffuse reflectance spectra of wheat. The minimum in the validation set prediction is at 8 factors.

The main advantage of validation set prediction is the ability to test the model's performance with a completely different data set than the calibration/training data. This is most important for determining the long term stability of the model. If a model is constructed and used to predict samples, there is no guarantee that the spectrometer will continue to perform exactly the same way. There are many things that affect the spectrum of a sample that cannot be completely controlled: spectrometer/detector wear, sample handling, environment (moisture, temperature). Collecting and predicting a validation set long after the model has been in use is one way of ensuring that the concentration predictions are still within the desired range of accuracy.

Selecting the Factors Based on PRESS
To avoid building a model that is either overfit or underfit, the number of factors where the PRESS plot reaches a minimum would be the obvious choice for the best model (except in the case of Self-Prediction). While the minimum of the PRESS may be the best choice for predicting the particular set of samples, most likely it is not optimum for prediction of all unknown samples in the future.

Since there are a finite number of samples in the set used for prediction, in many cases the number of factors that gives a minimum PRESS value can still be overfit for predicting unknown samples. In other words, there is a statistical possibility that some of the "noise" vectors from the spectral decomposition may be present in more than one sample. These vectors can appear to improve the calibration by a small amount when, by random correlation, they are added to the model. However, if these exact same noise vectors are not present in future unknown samples (and most likely they will not be), the predicted concentrations will have significantly larger prediction errors than if those additional vectors were left out of the model.

A solution to this problem has been suggested in which the PRESS values for all previous factors are compared to the PRESS value at the minimum. The ratio between these values (also known as the F-ratio can be calculated and assigned a statistical significance based on the number of samples used in the calibration set:

where i indicates the number of factors in the model.

This ratio is an indicator of the relative significance of each model to the model with the number of factors at the minimum of the PRESS. The number of factors where the F-ratio falls below a predefined significance level determines the optimum number of factors for a model used for predicting unknowns. The work done in Reference 16 suggests that this is easily determined by determining the point at which adding a new factor to the model causes the F-test probability level to fall at or below 0.75. This is applied by calculating the F-ratio as described, and looking the value up in a table of F-statistic values (these can commonly be found in the back of a statistics book) for the a=0.25 significance level.

In order to use the F-statistic tables properly, it is also necessary to know the degrees of freedom in both the numerator (n1) and denominator (n2) of the F-ratio value. For F-ratios based on PRESS values, the number of samples used to calibrate the model has been suggested as the proper value for both. Therefore, in the case of a cross-validation, the degrees of freedom would be the total number of sample in the training set minus the number left out in each group. For a validation set prediction, they would be the total number of samples in the training set.

Applying the F-test to PRESS values from a self-prediction generally does not work. This is due to the fact that the F-test is primarily designed to find the statistically optimum number of factors for predicting samples that were not included when the model was built. In the self-prediction scheme, every sample is already included in the model which gives no information on the performance of the model with true unknowns. This is merely one more reason why one of the other validation methods should be used to optimize the number of factors for the model.


For more information, see References.