Welcome Guest from United States
Sign In Change Country
  0 Items
Search:
Algorithms - Chemometric Preprocessing Techniques

Algorithms

Chemometrics, Preprocessing Techniques

Light Scattering Correction, Multiplicative Scatter Correction(MSC)
Light Scattering Correction, Standard Normal Variate (SNV) Transformation and Detrending
Correcting Sample Pathlength Differences
Measured Pathlength Calibration
Sample Thickness Correction
Unit Area Normalization
Correcting Baseline Effects
First and Second Derivatives
Data Enhancement

Correcting Sample Pathlength Differences
Beer’s Law states that there is a direct and linear relationship between sample concentration, pathlength and the absorbance of light at a particular wavelength. Most chemometric models are built using samples that vary in concentration, but the pathlength is fixed (except for diffuse reflectance measurements, however, MSC or SNV is used to correct this type of data). Factor-based chemometric models are not limited by this requirement, although the performance of the models is certainly better when this is true.

Unfortunately, it is not always possible to collect spectra of either training samples or "unknown" samples with a constant pathlength. For example, when measuring transmission spectra of thin films, it is very difficult to extrude polymers to a constant thickness every time. Obviously, if the pathlength varies in the sample, this will appear in the spectra as changes in response that are not correlated to concentration. Sometimes the factor based models can correct for these effects if the range of different pathlengths is not too large. However, the model will most certainly work much better if the pathlength effect can be removed altogether.

In addition to the methods listed here, MSC has also been applied to correct spectra of samples with indeterminate pathlengths, but not necessarily measured by diffuse reflectance. While it will not be as effective as either of the two specific pathlength corrections below (Measured and   Thickness), it will usually give better results than no correction at all. Some success has also been shown in using MSC to correct the pathlength effects in spectra measured using Attenuated Total Reflectance (ATR).

 

Measured Pathlength Calibration
In some cases, it is possible to actually measure the exact pathlengths of the training samples. This information can be used during the model building step by including the pathlength as an extra constituent in the calibration. During the calibration calculations, all the concentration data is scaled to the entered pathlength for each sample. When an "unknown" sample is predicted, its pathlength is predicted at the same time. The concentrations are then un-scaled by the predicted pathlength before they are reported.

This method is useful in cases where the pathlengths of the training samples are easy to acquire. This can be used to correct for samples where the pathlength of the "unknowns" either cannot be (easily) measured or is expected to vary substantially over a period of time. The downside to this method is that accurately measuring the pathlengths of every training sample can be a tedious process.

Sample Thickness Correction
This type of pathlength correction is sometime called the "internal standard" method. It is primarily used for samples that cannot be corrected by the Measured Pathlength method. One requirement for this method is that there must be an isolated band in every spectrum that arises from a constituent that does not vary in concentration in all samples, for both the training set and "unknowns" for prediction.

Since the chosen spectral band is assumed to be concentration invariant in all samples, an increase or decrease of the absorption of that band in the spectrum can be assumed to be entirely due to an increase or decrease in the sample pathlength. Therefore, by normalizing the entire spectrum to the intensity of the band, the pathlength variation is effectively removed. The intensity of the band can be calculated as either the response at a single wavelength in the band (usually the peak maximum) or the integrated area.

One potential problem with this method is that it is extremely susceptible to baseline offset and slope effects in the spectrum. When calculating the pathlength normalization constants, the spectra must either be baseline corrected before creating the training set, or a local baseline must be calculated "on-the-fly" under the thickness correction band for each spectrum. The latter approach is usually recommended as the former method requires separate manual baseline correction of each "unknown" spectrum before prediction against the calibration model.

Figure 1. Spectra of 4 polymer thin films. The constituent of interest is clearly the band in the middle: 2% (solid), 4% (dashed), 8% (dotted) and 10% (dash-dot). Note that the relative intensities of the constituent bands are not correct due to the varying thickness of the samples.
Figure 2. Spectra from Figure 5 after thickness correction. The integrated area of the band on the left (1525 - 1400 cm-1) was used as the thickness correction factor. Notice that the relative intensities of the constituent bands now appear more in line with the known concentrations.

Thickness Correction is very useful for spectral measurements of samples that the pathlength cannot be guaranteed to be constant. However, it does require that the samples have a constant concentration constituent and that an isolated spectral band can be identified which is solely due to that constituent.

Another use for Thickness Correction is to allow the calibration model to be pathlength independent. Instead of normalizing to a small region (isolated band) in the spectrum, a much larger region or even the entire spectral range is used to calculate the integrated area. This allows correcting samples with nearly any pathlength. The only requirement in using Thickness Correction in this manner is that the range of constituent concentrations must be relatively small. Large variations in the concentrations will cause the integrated areas of the spectra to vary mostly by concentration and not pathlength differences. This will actually introduce non-linearity's in the spectra-constituent correlation's and degrade the predictive ability of the model. However, if the concentration range is relatively small, this is a great way to build models that are insensitive to changes in pathlength of both the training spectra and spectra of "unknowns."

Unit Area Normalization
This method attempts to correct the spectra for indeterminate pathlength when there is no way of measuring it, or isolating a band of a constant concentration constituent. In this approach, the spectra are normalized by calculating the area under the curve for the entire spectrum.

In effect, this method is the same as using the Thickness Correction on a large region of the spectrum as mentioned above. However, here the entire spectrum is always used, rather than a large selected region. This method is very simple to implement, but has some drawbacks. First, the concentration variations between all the training samples and "unknown" samples must not be too large, for the same reasons discussed in Thickness Correction. In addition, since this method uses the entire spectrum, the responses at all wavelengths in the spectrum must contain useful data. This means spectra that exhibit evidence of detector or optics cutoffs, or "black sample" (sample is too thick or too concentrated causing complete absorbance of all light at some wavelengths) cannot be corrected. Finally, if the spectra do not have a constant baseline between all measurements, the integrated area will be calculated incorrectly. It is generally best to combine this method with some form of baseline correction.

Correcting Baseline Effects
As all spectroscopists know and have observed, spectrometers do not always collect data with an ideal baseline. Due to a variety of problems (detector drift, changing environmental conditions such a temperature, spectrometer purge, sampling accessories, etc.), the baseline of a given spectrum is not always where it should be. Beer’s Law assumes that the absorption of light at a given wavelength is due entirely to the absorptivity of the constituents in the sample; it does not account for "spectrometer error" or "sampling error." Therefore, in order to accurately calculate concentrations, it is necessary to remove the baseline effect introduced by the spectrometer.

As with most random variations in the spectral data, most chemometric models can compensate for these effects by adding extra factors. Or, if the variations are truly completely random, ignore them altogether. However, as with all preprocessing methods, a more robust model will usually result when the known interference's in the data are removed first.

There are a number of methods used by spectroscopists to remove baseline effects from the spectra they collect. The problem with most methods is that they require the spectroscopist to decide that the baseline is correct by visual inspection. In addition to being very subjective, most of these methods cannot be easily applied in the somewhat automated fashion required for a calibration model.

However, there are some methods which are reasonably automated enough to be used as part of a calibration model. The following list of baseline correction methods is not exhaustive, and there are many other ways of auto-correcting the spectrum baseline as a chemometric preprocessing step.

Linear Regression Baseline Fitting
This is a very simple approach to baseline correction in that it requires no effort to set up. In this method, a least squares regression line is fit to the responses in each spectral region selected for calibration. This line is then subtracted from the response values in the region before using the data to perform the calibration model calculations.

Unfortunately, this is not always the best approach, especially when the selected spectral regions are primarily large bands from the constituents of interest. It tends to work better when the entire spectrum is used or when the selected regions are very broad. In some cases, this method actually degrades the performance to the calibration models more than if no baseline correction was used at all. In general, this method should only be used in situations where baseline aberrations are severe and a limited number of training sample spectra are available.

Two Point Linear Baseline

Another approach is the tried-and-true method of selecting two baseline points in the spectrum, connecting them with a line, and subtracting it from the spectral responses. This is known as a two-point baseline correction.

The main problem with this method is selecting the two points. In the optimum case, the training sample spectra will always have at least two regions that are at different ends of the spectrum where there is no absorption. In the worst case, the entire spectrum may exhibit absorption, and defining a baseline point can be difficult if not almost impossible. Another problem is how to select baseline points from looking at a single spectrum, and be sure that no band will suddenly appear at that wavelength in future spectra.

Despite these limitations, there are some things that can be done to make this method of baseline correction more robust. For example, instead of selecting two single points for baseline correction, select a range of points in two parts of the spectrum. Then locate the wavelength point which has the minimum response in each range, and use these values as the two points. Another method is to calculate the average of the points in the selected baseline regions. These methods will presumably get around the problem of peaks shifting near the selected points.

First and Second Derivatives
One of the best methods for removing baseline effects is the use of derivative spectra. This method is one of the earliest methods used to attempt to correct for baseline effects in spectra solely for the purpose of creating robust calibration models. The 1st derivative of a spectrum is simply a measure of the slope of the spectral curve at every point. The slope of the curve is not affected by baseline offsets in the spectrum, and thus the 1st derivative is a very effective method for removing baseline offsets. The 2nd derivative is a measure of the change in the slope of the curve. In addition to ignoring the offset, it is not affected by any linear "tilt" that may exist in the data, and is therefore a very effective method for removing both the baseline offset and slope from a spectrum.

There are many ways to calculate derivatives. One of the easiest is by using the method of simple differences. In this approach, the derivative at a given point is calculated by:

Unfortunately, this method is not always useful for calculating "real" derivatives. In fact, since it attempts to estimate the derivative from the between-point differences, in most case it only succeeds in enhancing the noise in the spectrum.

There are better algorithms calculating derivatives including the Gap method and the Savitzky-Golay method. Both of these algorithms use information from a localized segment of the spectrum to calculate the derivative at a particular wavelength rather than the difference between adjacent data points. In most cases, this avoids the problem of noise enhancement from the simple difference method and may actually apply some smoothing to the data.

One problem in applying these methods is that they require an extra parameter; the size of the spectral segment to use for calculation of the derivative points. For the Gap method, this is the size of the gap (usually measured in wavelength span, but sometimes in terms of data points) between the difference points. The Savitzky-Golay method uses a convolution function, and thus the number of data points in the function must be specified. If the segment is too small, the result may be no better than using the simple difference method. If it is too large, the derivative will not represent the local behavior of the spectrum (esp. Gap), and it will smooth out too much of the important information (esp. Savitzky-Golay). Although there have been many studies done on the appropriate size of the spectral segment to use, a good general rule is to use a sufficient number of points to cover the full width at half height of the largest absorbing band in the spectrum.

The main disadvantage of using derivative preprocessing is that the resulting spectra are very difficult to interpret. As mentioned above, the loading vectors for the calibration model represent the changes in the constituents of interest. In some cases (especially in the case of PLS-1 models), the vectors can be visually identified as representing a particular constituent. However, when derivative spectra are used, the loading vectors cannot be easily identified. In addition, the derivative makes visual interpretation of the residual spectrum more difficult, and thus locating the spectral absorbencies of impurities in the samples cannot be done. 

Figure 1. First derivatives of the spectra in Figure 3 below. Derivatives were calculated using the Gap method with a gap value of 10 nm.
Figure 2. Second derivatives of the spectra in Figure 3 below. Derivatives were calculated using the Gap method with a gap value of 10 nm.
Figure3. A set of 50 Log(1/R) NIR spectra of ground wheat samples measured using diffuse reflectance. The concentrations of the constituents of interest fall in a relatively narrow concentration range. However, note that the light scattering causes the spectra to appear quite different.

Data Enhancement
Due to the multivariate nature of factor-based chemometric models, the direct relationship between the spectral response and the constituent concentration (univariate) is not very important. These models do not look at the absolute relationship between these values, but instead they calculate the relative change in the spectra and attempt to correlate that to a corresponding change in the constituent concentrations. This is why the models tend to be so robust and why they can calibrate for the constituents of interest in the presence of many other interference's.

Due to this fact, there are some mathematical enhancements that can be applied to data that is to be used in a multivariate model that would render it useless for a univariate model. The purpose of these algorithms is to remove redundant information and enhance the important sample-to-sample differences that exist within the data.

Mean Centering
Mean centering is almost always applied when calculating any multivariate calibration model. This involves calculating the average spectrum of all the spectra in the training set and then subtracting the result from each spectrum. In addition, the mean concentration value for each constituent is calculated and subtracted from the concentrations of every sample.

Mean Spectrum:

Mean Centering:

In these equations, A is the n by p matrix of training set spectral responses for all the wavelengths, is a 1 byp vector of the average responses of all the training set spectra at each wavelength, Aj is a 1 by p vector of the responses for a single spectrum in the training set, n is the number of training spectra, and p is the number of wavelengths in the spectra.

By removing the mean from the data, the differences between the samples are substantially enhanced in terms of both concentration and spectral response. This usually leads to calibration models that give more accurate predictions.

Variance Scaling
Variance scaling is used to emphasize small variations in the data by giving all values equal weighting. Variance scaling is calculated by dividing the response at each spectral data point by the standard deviation of the responses of all training spectra at that point. The concentration data is scaled likewise for each constituent. Note that variance scaling is only applicable after the data has already been mean centered.

Variance Spectrum:

Variance Scaling:

In these equations, A is the n by p matrix of training set spectral responses for all the wavelengths, Av is a 1 by p vector of the variance of the training set spectral responses at each wavelength, Aj is a 1 by p vector of the responses for a single spectrum in the training set, n is the number of training spectra, and p is the number of wavelengths in the spectra.

This preprocessing algorithm is most useful when analyzing minor (low concentration) constituents that have spectral bands that overlap those of major (higher concentration) constituents. By giving all the information in the data equal weighting, the calibration errors in the model should be more consistent across all constituents.

 

Back to top