Willkommen Gast aus Deutschland
Anmelden Land wechseln
  0 Artikel
Suche:
Algorithms - Discriminant Analysis, The PCA/MDR Method

Algorithms

Discriminant Analysis, The PCA/MDR Method

Principal Component Analysis is a very effective data reduction technique for spectroscopic data. To review, PCA decomposes the training set spectra into mathematical spectra (called loading vectors, factors, principal components, etc.) which represent the most common variations to all the data. A set of scaling coefficients (called scores) for each factor can be calculated for every spectrum in the training set. When the scores are multiplied by the loading vectors, and the results summed, the original spectra are reconstructed. By knowing the set of loading vectors, the scores will represent the spectra as accurately as the original responses at all the wavelengths.

Replacing the selected wavelengths in the Mahalanobis distance equation with PCA scores, gives a much more robust solution to the discriminant analysis problem. Using PCA as a data reduction technique allows full spectral coverage for all samples and alleviates the need for wavelength optimization. Since PCA also reduces data into a smaller set of representative numbers (known as scores), the problem of over-discrimination can be avoided while still using entire spectra or spectral ranges (as with PCR and quantitative methods). This approach has been called the PCA/MDR method.

There is one problem with basing the discrimination purely on the PCA factor scores; any "impurities" or extra aberrations that appear in the unknown spectra, but were not present in the training samples, will not appear in the scores calculations. As long as the rest of the variations in the spectrum are consistent with the training set, the model will predict the sample as a match. Therefore, the information in the unknown spectrum that is not compensated by the PCA factors must be considered in order to make an accurate assessment of the match.

As mentioned before, one incidental advantage of PCA is that a spectrum can be mathematically "reconstructed" by multiplying the spectrum scores by the set of primary factors and summing the results. The reconstructed spectrum can be subtracted from the original spectrum to determine how well the PCA model is performing for the sample. The result of this subtraction is known as the spectral residual. If the spectral residual is a relatively flat line (or just noise) near zero, then the model is able to account for all the variations in the spectrum. However, if the spectral residual has additional peaks or significantly more noise than expected, then the model is not completely predicting the information in the sample.

By calculating the sum of the squares of the spectral residuals across all the wavelengths, an additional representative value can be generated for each spectrum. The spectral residual is effectively a measure of the amount of each spectrum left over in the secondary or "noise" vectors. This value is the basis of another type of discrimination method known as SIMCA (see References). Another method is to combine the PCA/MD method with SIMCA to provide a bi-parametric method of discriminant analysis. In this method, both the Mahalanobis distance and the SIMCA test on the spectral residual had to pass in order for a sample to be classified as a match.

However, another approach is to combine the PCA scores and spectral residuals for each spectrum and use them all for the Mahalanobis group matrix calculations. These values are included in the Mahalanobis group calculations by adding an extra "score" for each spectrum that contains the sum squared spectral residuals for each spectrum. Therefore, the relationship between the PCA scores and the spectral residual is all considered when an unknown sample is predicted. The main benefit of this method is that all of the data in the sample spectrum is included in the calculations, and not just the PCA-modeled data (as with PCA/MD) and the spectral residual data that is left over (as with SIMCA). This is known as the PCA/MDR method.

Including the sum squared spectral residual as an additional discriminating factor for the Mahalanobis group is an important extra step that improves the sensitivity of the unknown sample classifications. It not only sets the maximum allowed variation in the factors, but also limits the range of variation in the residual for a sample to be classified as a member of the group. This is particularly important in quality control applications where it is not only important to verify the identity of a material but also to determine if it contains substantial impurities different from those in the training data.

 

Figure 1. Spectra of samples for discrimination against the training set in Figure 1 of the Mahalanobis section. The solid line is a spectrum of the same material ("In Spec"), while the dotted line is a spectrum of a very similar, but not the same material ("Out Spec"). The spectra have not been baseline corrected, and actually are nearly the same after the baseline is removed.

 

Figure 2. Spectra of a sample of the same material as training set shown in Figure 1 of the Mahalanobis section. One sample is the same "In Spec" spectrum from Figure 1 above, while the other is contaminated with a very small band at approximately 1140 cm-1 ("Contaminated").

Figure 3. Predicted Mahalanobis distances of the spectra shown in Figure 1 and Figure2 above by the three methods discussed in the text. The "In Spec" values are from the good sample, the "Contaminated" values are from the sample with the extra band, and the "Out Spec" values are for the sample of the similar but not same material.

 

The sample spectra in the previous figures were predicted against Mahalanobis discrimination models built using the three methods discussed so far: selected wavelengths, PCA scores, and PCA scores with spectral residuals. Notice that the wavelength method accurately predicts the sample that is supposed to match ("In Spec") and the sample that is not a match ("Out Spec"). The PCA scores-only method performs similarly. However, both methods misclassify the "Contaminated" sample as a match. The PCA scores with spectral residuals method accurately predicts all samples. In addition, the sensitivity to the "Out Spec" sample is drastically increased. This indicated that the material is spectrally very similar to the group samples; however, there is some extra information left in the spectral residual that helps discrimination.

Another method that is used to improve the sensitivity on the analysis is to normalize the predicted distances by the Root Mean Squared Group size (RMSG). Depending on the number of training samples and the variations in the sampling technique, the points that form the Mahalanobis ellipse may be "scattered" more for one group than another. This causes each group to have a different "size" in the multi-dimensional Mahalanobis space. In order for the Mahalanobis distance from one group to be compared to the distance from another group, this "size" difference must be removed. This is easy to understand by remembering the fact that Mahalanobis distances are measured in terms of standard deviations. If the standard deviations for the two groups are different, no distance cutoff value can be determined that will effectively work for both groups. The RMSG is simply determined by calculating the root mean square of the predicted distances for every training sample against the whole training set. Normalizing the predicted sample distances by the RMSG causes all groups to have the same weight in the predictions.

Finally, one of the best ways to improve the sensitivity of the discrimination is to create separate training sets for each group. Some methods in the literature pool all samples together for calculating the PCA, then try to calculate a Mahalanobis matrix for the subsets of scores that belong to individual classes of materials. This has the advantage of performing all calculations in the same score space and only requires one PCA to calculate all the factors. However, it has some serious disadvantages as well. Certainly, creating a new training set for every material or group requires collecting a lot more spectra. On the other hand, using separate training sets creates PCA factors that are unique to each group used for classification. Since the factors are going to be slightly different for each group (even for very spectrally similar materials), the likelihood of group overlap and, thus misclassification, is reduced. In addition, because each group is calculated separately, inclusion of a new classification group in the analysis does not require re-optimizing the entire model for all groups; a new model is simply built for the new group.

Calculating PCA/MDR
The following information is provided for those who are interested in the complete calculations of the PCA/MDR method. This discussion assumes that the training data has already been reduced to the component PCA factors and scores. The basic model for PCA of the spectral data matrix is:

where A is the n by p matrix of training spectra, S is an n by f matrix of the scores, F is an f by p matrix of the PCA factors, and E is an n by p matrix of the residual error not modeled by PCA. The dimensions are n for the number of spectra, p for the number of wavelengths and f for the number of PCA factors.

The spectral residual is therefore:

The sum squared residual for each row of the spectral residual matrix is then calculated as:

where Ri is the sum squared residual for the spectrum numberi. This vector of residuals is then mean centered, and then appended as an extra column to the PCA scores matrix:

Mean Center Residuals:

 

Append to Scores:

 

where Sr is the new n by f+1 residual augmented scores matrix. The calculation of the Mahalanobis matrix is then done on the Sr matrix:

The Root Mean Squared Group size in then calculated so that predicted samples can be normalized. This first requires predicting all training samples against the Mahalanobis matrix:

where Di is the predicted distance of training sample number i. The predicted distances are then used to calculate the RMSG normalization factor:

Whenever an unknown sample is predicted, effectively the same set of steps are used; however, the mean of the training group residuals is subtracted from the residual of the unknown, and the predicted distance is divided by RMSG before reporting the value.

Back to top