Welcome Guest from United States
Sign In Change Country
  0 Items
Search:
Algorithms - Principal Component Analysis

Algorithms

Principal Component Analysis Methods

In real samples, there are usually many different variations that make up a spectrum: the constituents in the sample mixture, inter-constituent interactions, instrument variations such as detector noise, changing environmental conditions that affect the baseline and absorbance, and differences in sample handling. Yet, even with all of these complex changes occurring, there should be some finite number of independent variations occurring in the spectral data. Hopefully, the largest variations in the calibration set would be the changes in the spectrum due to the different concentrations of the constituents of the mixtures. If it were possible to calculate a set of "variation spectra" that represented the changes in the absorbances at all the wavelengths in the spectra, then this data could be used instead of the raw spectral data for building the calibration model. There should be fewer common variations than the number of calibration spectra (in most cases), and thus, the number of calculations for the calibration equations will be reduced as well.

Presumably, the "variation spectra" could be used to reconstruct the spectrum of a sample by multiplying each one by a different constant scaling factor and adding the results together until the new spectrum closely matches the unknown spectrum. Obviously, each spectrum in the calibration set would have a different set of scaling constants for each variation since the concentrations of the constituents are all different. Therefore, the fraction of each "spectrum" that must be added to reconstruct the unknown data should be related to the concentration of the constituents.

The "variation spectra" are often called eigenvectors (a.k.a., spectral loadings, loading vectors, principal components or factors), for the methods used to calculate them. The scaling constants used to reconstruct the spectra are generally known as scores. This method of breaking down a set spectroscopic data into its most basic variations is called Principal Components Analysis (PCA).

Since the calculated eigenvectors came from the original calibration data, they must somehow relate to the concentrations of the constituents that make up the samples. The same loading vectors can be used to predict "unknown" samples; thus, the only difference between the spectra of samples with different constituent concentrations is the fraction of each loading vector added (scores).

The calculated scores are unique to each separate principal component and training spectrum, and can be used in place of absorbances in either of the classical model equations (CLS or ILS). Since the representation of the mixture spectrum is reduced from many wavelengths to a few scores, it seems best to use the ILS expression of Beer's Law for calculating concentrations due to its ability to calculate concentrations among interfering species. Note, however, that the calculations maintain the CLS averaging effect by using a large number of wavelengths in the spectrum (up to the entire spectrum) for calculating the eigenvectors. So, in effect, eigenvector models combine the best features of both the CLS and ILS methods together in the same calculation. This is the main reason why eigenvector models are generally better than classical models in both accuracy and robustness.

PCA breaks apart the spectral data into the most common spectral variations (factors, eigenvectors, loadings) and the corresponding scaling coefficients (scores).

 

The trick in using these models comes in how the eigenvectors are calculated. Note that these models base the concentration predictions on changes in the data, not absolute absorbance measurements (which are used in all the Classical models). In order to calculate the PCA model, the spectral data must change in some way. The best way to accomplish this is to vary the concentrations of the constituents of interest. As with the ILS model, there can be problems with collinearity. If the concentrations of 2 important constituents in the calibration samples are always present in the same ratio (for example, 2:1 of A to B, such as if dilutions were made from a single stock sample), the model will only detect one variation, NOT TWO! As far as the model is concerned, all the absorbance peaks of constituent A increase or decrease when constituent B also increases or decreases, and vice versa. Thus, only one variation is detected: the changes in the spectrum of A+B. Therefore, it is very important when calibrating eigenvector models that the calibration data have concentrations of the individual constituents of interest present in evenly and randomly distributed ratios.

For the following discussion of the different eigenvector methods, the set of synthetic spectral data shown in the image below is used for demonstrations.

A hypothetical training set of spectra with 3 independent constituents consisting of 2 Gaussian bands each.

 

Before PCA is applied to a training set, the data is commonly mean centered. This means that the mean spectrum (average spectrum) is calculated from all of the calibration spectra and then subtracted from every calibration spectrum. Mean centering has the effect of enhancing the subtle differences between the spectra. Remember, eigenvector methods calculate the principal components based on changes in the absorbance data, and not the absolute absorbance. Therefore, anything that improves the ability of the calculation to detect the differences between the calibration spectra, will improve the model.

This actually makes a lot of sense when considered in the context of how PCA calculates the eigenvectors. Since the eigenvectors represent the changes in the spectral data that are common to all the calibration spectra, removing the mean simply removes the first most common variation before the data is even processed by the PCA algorithm.

PCA is effectively a process of elimination. By iteratively eliminating each independent variation from the calibration spectra in series, it is possible to create a set of eigenvectors (principal components) that represent the changes in the absorbances that are common to all. When the training data has been fully processed by the PCA algorithm, it is reduced to two main matrices: the eigenvectors (spectra) and the scores (the eigenvector weighting values for all the calibration spectra). The matrix expression of the model equation for the spectral data looks something like:

where A is an n by p matrix of spectral absorbances, S is an n by f matrix of score values for all of the spectra, and F is an f by p matrix of eigenvectors. The EA matrix is the errors in the model’s ability to predict the calibration absorbances and has the same dimensionality as the A matrix. In the case of eigenvector analysis, the EA matrix is often called the matrix of residual spectra. The dimensions of the matrices are representative of the data they hold; n is the number of samples (spectra), p is the number of data points (wavelengths) used for calibration, and f is the number PCA eigenvectors. As will be shown later, this is actually a simplification of the true model equation.

By multiplying PC1 & PC2 (Eigenvectors) by the set of representative scalar fractions (Scores) and summing the results (along with the Mean spectrum if the data was mean centered), the original calibration spectra can be recreated. The "spectral residual" is the difference between this reconstruction and the original.

 

Back to top