|
Training Set Design Selecting the Principal Component Outlier Sample Detection Spectral Region Selection
Optimization In order to get good calibration using PCR and PLS techniques a lot of effort is involved on the part of the analyst to optimize the conditions that will lead to the most robust model for the samples under analysis.
Training Set Design One of the most important things to remember when creating a good calibration model is the quality of the training set data. As with any quantitative calibration method, the predictive ability of the equations are only as good as the data used to calculate them in the first place. Control over such variables collecting representative samples, an accurate primary calibration method, and appropriate sample measurements are critical for obtaining good results.
One of the most apparent drawbacks of multivariate calibration models is the comparatively large number of training set samples required. Some complex materials may require hundreds or even thousands of samples to be tested and spectra measured before a suitable set of training samples can be identified. While the work to create a multivariate calibration model is more significant, they tend to be more robust than the simpler univariate methods and require less maintenance in the long run.
Training Samples Should be as Similar as Possible to "Unknowns" A common misconception in quantitative spectroscopic calibration is that the spectrum of a constituent looks pretty much the same when it is part of a mixture as when it’s in the pure form. Unless the samples are very simple mixtures or they are being measured in gas phase, nothing could be further from the truth. Mixing constituents together causes all sorts of changes in the spectra that are not readily apparent. Factor based models can compensate for these inter-constituent interactions, however only if the training set contains examples of it.
Most samples used for factor-based multivariate quantitative spectroscopic analysis are not simple mixtures. Otherwise, it would not be necessary to use these models and the calibration could be performed using a much simpler method. These models have the distinct advantage for complex samples because they can find the important information in the spectra and ignore the rest. In order to give the model the best chance to learn to recognize the information for the constituents of interest, it is important to train it using samples that emulate the "unknowns" as closely as possible.
Most experienced analysts using these methods collect actual samples from the plant, the field, or any other source of the material they expect to measure with the calibration model. These samples are then brought back to the lab and analyzed using other primary calibration methods (chromatography, wet chemical test, drying, etc.) to arrive at the constituent values. This data along with the sample spectra formulates the training set they will use for building the calibration model. Remember, one of the main advantages of factor-based multivariate methods is the ability to calibrate for individual constituents in samples with very complex compositions, provided the "unknowns" exhibit the same behavior as the training samples.
Most experienced analysts using these methods collect actual samples from the plant, the field, or any other source of the material they expect to measure with the calibration model. These samples are then brought back to the lab and analyzed using other primary calibration methods (chromatography, wet chemical test, drying, etc.) to arrive at the constituent values. This data along with the sample spectra formulates the training set they will use for building the calibration model. Remember, one of the main advantages of factor-based multivariate methods is the ability to calibrate for individual constituents in samples with very complex compositions, provided the "unknowns" exhibit the same behavior as the training samples.
Bracket the Expected Range of Constituent Values As in all quantitative methods, the constituent values for the training samples should span the expected range of all future unknown samples. While the models can sometimes extrapolate outside the range of calibration, this is generally not a good idea for multivariate calibration, let alone any other calibration method. There is no method other than external validation that can determine how well a model will predict outside the original calibration range, and this is not a good measure of future performance.
Basically this means that the constituent values in the training samples should be both larger and smaller than the expected values in "unknown" samples. By bracketing the range of concentration, the model will give the most accurate answer possible. Keep in mind that calibration points in the middle of the range are required as well.
In some situations, it can be difficult to get samples with both high and low concentrations of the constituents of interest. This is especially true of samples of natural products that tend to be very homogeneous from batch to batch and where the analyst has no control over the sample composition. As mentioned before, it may require collecting hundreds or thousands of samples and testing them with the primary calibration method until a suitable set of samples can be identified for multivariate calibration.
Use Enough Samples Model the Data Variability In order to use multivariate calibration methods the training set must have at least as many samples as there are constituents of interest, and usually many, many more than that. How many samples are required to build a good model? Unfortunately, there is no hard answer like, "use at least 10 samples for one constituent, 20 samples for 2 constituents, etc." The real answer is, "use as many samples as it takes." A statistically significant number of samples is critical in both evaluating the analysis and in obtaining a robust calibration model. The more data, the higher the confidence in the analysis and in the statistics.
Another reason to use a large number of samples for calibration is to allow more factors in the model. Due to the nature of the linear algebra used to solve the eigenvector decomposition of the spectra, the maximum number of factors that can be calculated for a given training set is limited by the smallest dimension of the data matrix. Therefore, a training set with only 10 samples can only calculate 10 factors; if one-sample-out cross validation is used, then only 9 factors are possible For complex materials, this may not be enough to account for all the variability in the real samples that will be predicted as "unknowns." As a side note, this also applies for the number of wavelengths selected for calibration. If a training set has 500 samples but the calibration regions have only 20 total spectral data points, then the maximum number of factors is limited to 20 as well.
As mentioned before, most training sets will have many, many samples before an accurate, robust calibration can be built. Since these types of chemometric models look at the relative changes in the data, putting more samples in the training set allows the calculations to more easily identify which spectral information is important (real signal) and which is not (noise). Just remember, the quality of the data is just as important as the quantity, if not more so. Simply piling a huge number of spectra together as a training set will not guarantee a better model than carefully measuring and qualifying a much smaller number for calibration.
Avoid Constituent Collinearity What exactly is collinearity and why is it a problem in multivariate models? Collinearity is the effect observed when the relative amounts of two or more constituents is a constant throughout all the training samples. The reason this causes so much trouble for multivariate models is due to the way they correlate information. Remember that these models do not calibrate by creating a direct relationship between the constituent data and spectral response. Instead they try to correlate the change in concentration to some corresponding changes in the spectra. When constituents are collinear, multivariate models cannot not differentiate them, and the calibrations for the constituents will be unstable.
To give a simple example, consider the case of creating artificial "standards" for calibration. A typical practice is to make one mixture with high concentrations of all constituents of interest and then make multiple dilutions of that one mixture to create the remaining samples. While this approach will work fine for univariate methods and is used quite frequently for Least Squares Regression models, it will completely fail for univariate models. The main problem is that there is no inter-constituent variations in the data. When the concentration of one constituent increases, they all increase, and vice versa. Correspondingly, the spectra will have the same problem: when the spectral responses of the constituents all increase and decrease in sympathy. To a multivariate model, this appears as one constituent regardless of how many were mixed together in the original high concentration standard. To an eigenvector-based model, only one factor will arise containing nearly all the variance in the data. Any sample predicted against this model that does not have exactly the same ratio of constituent concentrations as the mixed standards will be predicted completely wrong.
While this example is a fairly obvious one, there are cases where the data can be collinear seemingly without good reason. A simple visual aid to identifying this potential problem is to plot the sample concentrations of each constituent in the model against the others. If the points fall on a straight line, the concentrations are collinear. If the constituents were completely uncorrelated, they would form a nice symmetric square shape. However, in most cases, they will look more like a cluster of points.
 |
| Figure 1. This training set has constituent values that are very collinear. Notice the trend in the values for Constituent 2; they increase as the corresponding values for Constituent 1 increase. It will be very difficult for a multivariate model to distinguish these constituents. |
 |
| Figure 2. Another training set with more evenly distributed constituent variations. There appears to be little to no collinearity in the values which should lead to a better multivariate model. |
|