Willkommen Gast aus United States
Anmelden Land wechseln
  5 Artikel   Summe 6,250.00 USD
Suche:
Algorithms - Principal Component Analysis Methods Optimization: Outlier Sample Detection
Algorithms
Principal Component Analysis Methods
Optimization: Outlier Sample Detection


Equally important to choosing the optimum number of factors for the model is outlier detection. If one or more of the training samples are in error, it will cause errors in the calibration model and ultimately poor prediction results for unknowns. In short, the models attempt to account for all the variations in the training set data when they are calibrated. Outlier samples usually arise from some incorrect measurement, whether it is in the concentration data (i.e., errors in the primary calibration technique, transcription errors), or in the spectral data (i.e., spectrometer error, sample handling procedures, environmental control such as temperature, humidity, etc.). Including outlier samples in the training set will introduce a bias to the final model. In effect, outlier samples will tend to "pull" the model in their direction, causing the predicted concentrations of valid samples to be less accurate (or even erroneous) than if the sample was completely eliminated from the training set.

Concentration Residuals
One powerful tool for outlier detection is the cross-validation procedure used to calculate the PRESS values described previously. When the optimum number of factors for the model has been determined, the predicted concentrations of each training sample from the sample rotation with the selected factor model can be used for outlier detection. The difference between the actual and predicted concentrations for a sample is known as the concentration residual.

The model attempts to account for all the variations in the training data when the calibration calculations are performed. Therefore, the prediction error of most of the samples should be approximately the same. Samples that have significantly larger concentration residuals than the rest of the training set are known as concentration outliers.

Figure 1. A plot of concentration residual of the predicted hydroxyl number versus the sample number for a cross-validation of a PLS model built from 35 FT-NIR spectra. Note that sample 31 has a significantly different residual than the remainder of the data set, indicating that it is probably an outlier.

This type of outlier generally arises when the experimenter either makes a mistake in creating the calibration mixtures or there was an error in the analysis of the sample from the primary calibration technique used to generate the calibration concentration values. Another possibility which frequently occurs is a transcription error; the analyst simply types in the wrong concentration value when building the computerized training set.

Looking at Figure 1 above, it is clear to see that sample #31 is significantly different from the rest of the training set, and most likely a concentration outlier. However, outliers in most data sets will not be as obvious as this. While the human eye is excellent at discerning patterns in data, visual inspection is not always a valid basis for a decision of this type. What is really needed is a mathematical way to accurately determine the likelihood that a sample is really an outlier.

The F-test method for determining the optimum number of factors from a cross-validation PRESS analysis is also useful for determining the statistical significance of a sample's concentration residual with respect to the rest of the training set. In this case, the F-ratio value is calculated by:

where i is the number of the sample being tested, n is the number of samples in the training set and Cr are the concentration residual values of the sample predictions. In order to determine if the sample is an outlier, this F-ratio can be looked up in an F-statistic table. In this case, the degrees of freedom are one (1) for the numerator (v1) and (n-1) for the denominator (v1). Generally, samples that exhibit probabilities of 0.99 (a=0.01) are considered outliers and should be removed from the training set before calculating the final calibration model.

Spectral Residuals
Another powerful tool in seeking out outlier samples is the spectral residual. This was discussed briefly in an earlier section. Similar to looking for concentration outliers, spectral outliers are detected by using a model for which the optimum number of factors has been determined by a cross validation.

Remember that when each sample is predicted, a set of scores is found that best fits the model loading vectors to the sample spectrum. By using the calculated scores and the calibration loading vectors, a new model reconstructed spectrum can be calculated. This new spectrum is what the PLS or PCR model thinks the sample spectrum should look like. The spectral residual is the difference between this spectrum and the actual prediction spectrum and is calculated as:

where p is the number of wavelengths (data points) in the spectrum, Aorig are the original spectrum absorbances, and Apred are the model predicted spectrum absorbances.

Figure 2. A plot of the spectral residual versus sample number for a cross-validation of a PLS model of Research Octane Number (RON) built from 57 NIR spectra of gasoline. Notice that sample number 45 has a significantly different residual than the remainder of the set indicating that it is a possible outlier.

As with concentration residuals, samples that have significantly higher spectral residuals than the rest of the training set may be outliers. Spectral outliers can be caused by many different factors including inconsistent sample handling, changes in the performance of the instrument, or anything that contributes to a significant change in the spectrum of a given sample.

The spectral residual plot in Figure 2 was chosen to illustrate an obvious outlier (sample #45). It is necessary to use the F-test method to determine that the samples are indeed outliers in the same manner that is used for concentration outliers. For spectral residuals, the F-ratio is calculated as:

where the subscript i indicates the number of the sample being tested, n is the number of samples in the training set and Sr are the spectral residual values of the sample predictions.

Unfortunately, there is some debate over the actual number of degrees of freedom to use for spectral residuals. Some values have been suggested that seem to work well in practice: one (1) for the numerator (n1) and (n-f-1) for the denominator (n2) where f is the number of factors in the model. Again, samples that exhibit probabilities of 0.99 (a=0.01) are considered outliers and should be removed from the training set before calculating the final calibration model.

Cluster Analysis
There are other methods of outlier detection that are more abstract but equally valuable. Cluster analysis is a method that is used to look for samples which have scores inconsistent with other samples in the training set. In this technique, the scores of one loading vector are plotted versus the scores of another vector for every sample in the training set. (Remember that the scores are the scalar values by which each loading is multiplied to reconstruct the original spectrum.) If all the samples in the training set are similar in composition and calibration value, the data points will tend to "cluster" about some mean value. If a sample point lies significantly outside this cluster, it indicates that the ratio of the two factor scores for this sample is inconsistent with the other spectra in the training set and it may be an outlier.

There is, however, one exception: samples that lie at the ends of the calibration concentration range (i.e., the sample contains the highest or the lowest concentration of a constituent) can be expected to lie at the extreme limits of the cluster. An extreme sample will sometimes appear as an outlier, even though it may not be one at all.

Figure 3. The Mahalanobis distance of a point is measured from the mean point of the cluster (indicated by X). Unlike an absolute distance measurement, it takes into account the "shape" of the cluster. Although points A and B appear to be equidistant from the mean, in terms of Mahalanobis distance, A is much closer and therefore more likely to be a member of the cluster.

Once again it is desirable to have a more statistical measure of a sample's potential to be an outlier than simple visual inspection. For score clusters, it is possible to use a measure of the Mahalanobis distance. This is calculated as the distance of the potential outlier sample point as measured from the mean of all the remaining points in the cluster. The distance is scaled for the range of variation in the cluster in all dimensions, and then assigns a probability weight to the sample in terms of standard deviation. Any sample which lies outside of 3 standard deviations from the mean can be considered suspicious.

The Mahalanobis distance is also useful in qualitative analysis of spectral data for which the constituent concentrations are not known. This method, along with the mathematics, will be discussed in a later section.

Figure 4. Sometimes trends in the training data can be revealed by looking at score cluster plots. This plot shows the sample scores of the first two principal components of the training set data for the same model in Figure 2 above. The data are clearly split into two separate clusters indicating two different types of samples. After careful examination of the sample concentration values, it was determined that the clusters do not represent the low and high concentrations of RON. It was later discovered that the spectra had been collected at different times by two different analysts.

 

Leverage and Studentized T-Test
Another useful plot for identifying outliers is a plot of the Studentized concentration residuals versus the leverage value for each sample in the training set. The leverage value gives a measure of how important an individual training sample is to the overall model. The Studentized residuals give an indication of how well the sample's predicted concentration is in line with the leverage. If a sample has a very high leverage compared to the rest of the training set, it is not necessarily always an outlier. It could just be a sample at the high or low end of the concentration range. However, if a sample has both a high leverage and a Studentized residual that is very different from the rest of the data set, most likely it can be eliminated as an outlier.

Figure 5. A Leverage vs. Studentized Residual plot of same cross-validation prediction of the model in data from Figure 2 above. Note that both the studentized concentration residual and the leverage of sample #45 are both significantly larger than the remainder of the training set. This is another confirmation that this sample is an outlier.

Sample leverages are calculated from the factor scores in PCA/PCR and PLS models. It is a relatively simple calculation:

whereS is the n by f matrix of sample scores, and H is an n by n square matrix (see References). As before, n is the number of samples in the training set, and f is the number of factors in the model. The subscripti is the sample number in the training set. Note that the individual sample leverages are the diagonal elements of the Hat matrix.

The Studentized residual is then calculated by:

where Cr are the concentration residuals of every sample in the training set.


For more information, see References.