Welcome Guest from United States
Sign In Change Country
  0 Items
Search:
Algorithms - Discriminant Analysis, Optimization

Algorithms

Discriminant Analysis

Optimization
Selecting the Principal Components
Outlier Sample Detection
Spectral Region Selection


Selecting the Principal Components
One of the biggest problems in using PCA spectral decomposition for discriminant analysis is identifying the correct number of factors to use for the models. In the case of quantitative analysis methods such as PCR and PLS, there is always a set of secondary benchmarks to compare the quality of the model; the primary calibration data. By performing a PRESS analysis, it is very easy to determine the number of factors by calculating the prediction error of the constituent values at every factor. The smaller the error, the better the model.

However with discriminant analysis methods, the only information available is the set of training spectra. There is no external set of data to compare the model’s predictive capabilities. The basic problem therefore becomes how to include enough factors in the model that will give the appropriate discrimination, without including extra "noise" factors that are unique to the training set data and are not likely to be found in the spectra of "unknown" samples. To add some new definitions, the set of vectors that represent the true variations in the data are often called the primary factors, while the remaining "noise" vectors are known as the set of secondary factors.

Fortunately, there are ways of determining the number of factors by looking at the Eigenvalues of the PCA factors. One fact that has been left out in all the discussions of PCA is that the data is not broken down into just two sets on values (scores and factors), but into three. The third set of values is the Eigenvalues. Due to the way the PCA decomposition is calculated, the scores and factors generally only span a data range of ±1. If the scores and factors were the only representation of the data, all the principal component spectra would have the same relative intensities in the samples. Clearly, this is not the case; some components will vary larger than others. Therefore, the Eigenvalues are actually a measure of the importance of each factor to reconstructing the real spectra.

Typically, the Eigenvalues for the first few factors are much larger than those of the remaining factors. The trick is to determine the relative importance of each factor in the model by comparing the PCA Eigenvalues. There are a number of methods in the literature. This discussion will focus on some simple views of the Eigenvalue data.

 

Figure 1. The calculated Eigenvalues of a PCA decomposition of the spectral data in Figure 1 of the Mahalanobis section. Notice that the first Eigenvalue is substantially larger than the rest, and that the values fall off rapidly at the higher numbered factors.

 

 

Simply looking at a plot of the Eigenvalues might lead to a model that has too few factors. Empirically, from examination of the Eigenvalue plot in Figure 1, it appears that a model with 4 factors would probably work fine, as the values seem to be very small from this factor on up. However, in actuality, this would build a model that is significantly underfit for this data set.

Figure 2. A plot of Total Percent (%) Variance from the Eigenvalues in Figure 1 above. Typically, a good model is formed when at least 99.9% of the variance in the data has been accounted for in the factors. For this data set, this occurs at 7 factors, although it is difficult to see in this plot.

 

Instead, methods that use the Eigenvalues to calculate some comparative statistics are needed. One such method is to calculate the Total Percent Variance for each factor. When the maximum number of factors is calculated (which is equal to the total number of samples in the training set), all the variance in the data is accounted for, or 100%. Since the PCA factors represent the variations in the data, and the Eigenvalues are the relative weights of each of the factors, then the Eigenvalues can also be thought of as the amount of variance in the data that is represented by that factor. By summing the Eigenvalues, an estimate can be made of how much variance is accounted for by the PCA factors:

where TPVi is the Total Percent Variance accounted for by a model with i PCA factors, is the 1 by f vector of Eigenvalues, and f is the total number of factors for the data set. When looking at a plot of the Total Percent Variance versus the number of factors, a model that accounts for at least 99.9% of the total variance will generally give good predictive ability for similar samples. However, this is not a hard and fast rule. Depending on the nature of the variations in the data, some models will work with less total variance, and some with more.

Another method is through the use of "Indicator Functions." Similar to a PRESS for quantitative analyses, these functions will usually give a minimum at the optimum number of primary factors, or they will show a "leveling" once the optimum has been reached. One of the more useful functions is called Malinowski’s Indicator (see References). Much like the Total Percent Variance, it is calculated from the PCA Eigenvalues:

where REi is the Real Error function at factor number i, MIi is the value of Malinowski’s Indicator function at factor numberi, is the 1 by f vector of Eigenvalues, and f is the total number of factors for the data set. The d1 and d2 values in the denominator are the dimensions of the original spectral data matrix used for the PCA. The d1 value is the smaller of n the number of samples, and p the number of spectral data points selected, and d2 is the larger.

While indicator functions such as Malinowski’s are useful in helping determine the optimum number of factors, they generally tend to create models that are overfit when used for discriminant analysis.

   

 

Figure 3. Plot of Malinowski’s Indicator function for the Eigenvalues in Figure 1above. Notice that the minimum occurs at 10 factors. This is 3 more factors than indicated by the Total Percent Variance plot in Figure 2 above.

 

Another method (which was also proposed by Malinowski) is to calculate the statistical significance of the factors by performing an F-test on the Eigenvalues. This method is very similar to the F-test on the F-ratio values from a PRESS analysis for optimizing a quantitative model. The F-test is not actually applied directly to the Eigenvalues, but to what Malinowski called the Reduced Eigenvalue (REV):

where REVi is the Reduced Eigenvalue at factor number i, is the 1 by f vector of Eigenvalues, and f is the total number of factors, n is the number of samples in the training set and p is the number of spectral data points selected. Effectively, the data set Eigenvalues are normalized by the degrees of freedom in the data set to arrive at REV.

To arrive at the optimum number of factors, the F-ratio and F-test calculations are then performed on REV at a significance level of a=0.01. Therefore, all factors with a probability greater than or equal to 0.99 are maintained as primary factors, and the remaining factors are assumed to be noise or the set of secondary factors.

Another approach is to perform a cross-validation much like for quantitative models. However, rather than predicting the constituent values as each sample is rotated out (and there are not any to predict in discriminant analysis anyway), the Mahalanobis distance of each sample is predicted at every factor. The cross validation procedure is basically the same; remove a sample or set of samples, construct a Mahalanobis matrix for 1 factor, 2 factors, etc. and then predict the sample(s) left out against it. The samples are then returned to the training set, and a new set is removed. The process is continued until every sample has been rotated out once.

Remember that the Mahalanobis distance is normally distributed and measures the distances in terms of standard deviations from the group mean point. Therefore, good samples should be at least 3 Mahalanobis distances away or less to be classified as a member of the group. Assuming that all of the samples in the training set are "good" (no outliers), then all samples should give a predicted distance of 3 or less when rotated out. It should therefore be possible to determine the number of factors by calculating the average predicted distance for all the samples at each factor, and locating the point where the value goes above 3.

 

 

Figure 4. Average Predicted Mahalanobis Distance for a cross-validation of the spectra in Figure 1 of the Mahalanobis section. The plot reaches a value of 3.3 distances at 8 factors. However, the value at 9 factors is actually 3.0. An argument could be made for either number being correct.

 

 

 

Outlier Sample Detection
As with quantitative models, outlier samples in the training set can have an unwanted influence on the discrimination ability of the model. Many of the same techniques (spectral residual plots, cluster plots) used there can be used to check for outliers in discriminant models too. However, keep in mind the purpose of the discriminant analysis experiment: to build a model that can accurately match a spectrum to the training group, but allow enough variation in the model to compensate for the natural variations seen in real samples.

Any data that is included in the training set in effect becomes part of that allowed variation. If there are a few spectra of samples that are substantially different from the rest (which will tend to cluster tightly around the group mean), then they may appear as outliers by many of the statistical tests. However, if these samples are known to be "good," then they should not be removed from the training set just because they fail the statistical tests.

There is one additional method to use in determining outliers in discriminant analysis models, and that is to look at a plot of the predicted Mahalanobis distances (either from a cross-validation or self prediction) to see if any samples stand out.

 

 

Figure 1. Predicted Mahalanobis distances from a cross-validation of the training spectra in Figure 1 of the Mahalanobis section. The model was created with 9 factors. Notice that sample number 5 appears to be substantially different from the rest. However, upon further examination of the data, the spectrum of this sample had the largest baseline shift. Most likely the sample is fine and was left in the training set for building the final model.

 


 

 

Back to top