Welcome Guest from United States
Sign In Change Country
  0 Items
Search:
Algorithms - Discriminant Analysis, Mahalanobis Distance

Algorithms

Discriminant Analysis, The Mahalanobis Distance

The Mahalanobis distance is a very useful way of determining the "similarity" of a set of values from an "unknown: sample to a set of values measured from a collection of "known" samples. The actual mathematics of the Mahalanobis distance calculation has been known for some time. In fact, this method has been applied successfully for spectral discrimination in a number of cases. One of the main reasons the Mahalanobis distance method is used is that it is very sensitive to inter-variable changes in the training data. In addition, since the Mahalanobis distance is measured in terms of standard deviations from the mean of the training samples, the reported matching values give a statistical measure of how well the spectrum of the unknown sample matches (or does not match) the original training spectra. 

The mathematical basis behind the Mahalanobis distance measurement is really quite simple, but it is much easier to understand when it is explained graphically. Consider a set of spectra of different samples of the same material as shown below in Figure 1.

 

Figure 1. Multiple spectra of the same compound collected on different dates using the same FT-IR spectrometer. Notice the subtle variations in the spectral bands and the large variations in the baseline.

 

When measuring samples of the same material, it is expected that the spectra will be very similar to one another. However, no two spectra will be exactly the same. Each spectrum would be slightly different due to spectrometer drift, differences in sample handling, changing environmental conditions such as humidity, as well as batch to batch variations in the sample material.

However, since the spectra are all of the same material, the relative intensities at all the wavelength should remain approximately the same. The peaks will tend to rise and fall together throughout the entire spectrum. To demonstrate this, measure a series of spectra of different samples of the same material to form a training set. Then, select any two wavelengths in the spectrum (preferably at or near the tops of two major bands) and plot the responses at the first wavelength versus the responses at the second. What should arise is a plot similar to Figure 2.

 

Figure 2. Absorbance of two selected wavelengths plotted against each other. The values are from multiple spectra of the compound shown in Figure 1. The mean (X) point of the group should be unique for the material. The elliptical cluster shape formed by the group points is typical of spectra from the same material, with most samples clustered near the mean.

Notice that the points tend to form an elliptical cluster indicating the subtle differences between the spectra (both in terms of baseline shift, pathlength and concentration). The mean position of the cluster is unique to the particular material of interest, since the intensities at these two wavelengths would be different for a different material. This theory can be tested by taking a series of spectra of a different compound and plotting the intensities at the same wavelengths together with the previous data. An additional group of points would form on the graph, but centered about at a different position. Each of these groups is unique to the particular material that created it. This gives a very simple method of determining the similarity of an unknown spectrum to one of the groups. Just measure the spectrum of the "unknown;" plot the intensities of the selected wavelengths, and see if the point falls "near" the mean point of one of the groups. If the "unknown" point is close enough, the sample can be classified as being the same material. If the point is far away, it does not match, and the sample may be a different material or may not have the same purity as the training set data.

The approach seems relatively straightforward, but how is the concept of being "near" a group actually defined. As mentioned earlier, visual inspection is not a viable method for real world discriminant analysis applications. What is needed is a mathematical equation to measure "nearness" of the unknown point to the mean point of the group(s).

One such measurement technique has already been mentioned: the Eculidean distance. As with using the responses at all wavelengths in a spectrum to perform a simple single spectrum match, the same formula can be applied to calculating the distance of the "unknown" point to the group mean point. In this case, the vectors would only have two data points each: the selected wavelengths. This would be a fine method except for two simple facts. First, as mentioned above, the Euclidean distance does not give any statistical measurement of how well the unknown matches the training set. In addition, the Euclidean distance only measures a relative distance from the mean point in the group. It does not take into account the distribution of the points in the group.

For reference, an example Euclidean boundary has been superimposed on the group points in Figure 2. In addition, two hypothetical unknown sample points "A" and "B" have been added as well. Notice that although the training set group points tend to form an elliptical shape, the Euclidean distance describes a circular boundary around the mean point. By the Euclidean distance method, sample "B" is just as likely to be classified as belonging to the group as sample "A." However, sample "A" clearly lies along the elongated axis of the group points, indicating that the selected wavelengths in the spectrum are behaving much more like the training group than those same wavelengths in the spectrum of sample "B." Clearly, the Euclidean distance method does not take into account the variability of the values in all dimensions, and is therefore not an optimum discriminant analysis algorithm for this case.

The Mahalanobis distance, however, does take the sample variability into account. Instead of treating all values equally when calculating the distance from the mean point, it weights the differences by the range of variability in the direction of the sample point. Refer to the Mahalanobis boundary that has been superimposed on Figure 2 and this concept becomes much clearer. The Mahalanobis distance constructs a space that weights the variation in the sample along the axis of elongation less than in the shorter axis of the group ellipse. In terms of Mahalanobis measurements, sample "A" will have a substantially smaller distance to the mean than sample "B" since it lies along the axis of the group that has the largest variability. Therefore, sample "A" is far more likely to be classified as the same material as the group. Mahalanobis distances look at not only variations (variance) between the responses at the same wavelengths, but also at the inter-wavelength variations (co-variance). The Mahalanobis group defines a multi-dimensional space whose boundaries determine the range of variation that are acceptable for unknown samples to be classified as members.

Another advantage of using the Mahalanobis measurement for discrimination is that the distances are calculated in units of standard deviation from the group mean. Therefore, the calculated circumscribing ellipse formed around the cluster actually defines the one standard deviation boundary of that group. This allows the analyst to assign a statistical probability to that measurement. In theory, samples that have a Mahalanobis distance of 3 or greater have a probability of 0.01 or less and can be classified as non-members of the group. Samples that have distances less than 3 are then classified as members. In practice, others have found a Mahalanobis distance of 10-15 works better as a maximum variance for classification33. However, the determination of the cutoff value depends on the application and the type of samples.

Just like many multivariate quantitative methods, the Mahalanobis distance can solve for multiple dimensions simultaneously. The Mahalanobis group can therefore be extended to more than 2 dimensions by simply selecting more wavelengths. This is generally a good idea as this will attempt to compensate for variations in other regions of the spectrum.

Unfortunately, this method is not perfect, and in fact there are a number of drawbacks. First of all, this approach to discriminant analysis relies on selecting a subset of wavelengths to represent the entire spectrum. Again, if any impurities or aberrations appear in the spectra of the "unknowns" which do not appear at the selected wavelengths, the discriminant analysis will determine that the sample matches the group, when in fact it does not!

A simple solution would appear to be simply to select more wavelengths. For that matter, why not select every data point in the spectrum? The reason is that the Mahalanobis model tends to become overfit very quickly as more wavelengths are added. This is only logical when the method of calculating Mahalanobis matrix is considered. Since all the inter-wavelength variations are considered just as important as matching the corresponding wavelengths, the likelihood of an unknown sample having the same relative intensity values at all selected wavelengths decreases substantially. In the worst case, using too many wavelengths can cause "good" samples to be misclassified as not in the group. In practice, using more than approximately 10 to 15 wavelengths can lead to misclassification of known samples. In other words, samples that should be classified as members of that group, are rejected as non-members. There have been procedures put forth for optimum wavelength selection33 based on all the groups used for comparison. However, this can be a time consuming and computationally intensive process.

In order to insure that all impurities and other anomalies in the unknowns are detected, the discrimination method needs to be able to use the entire spectrum, or at least large regions, instead of selected wavelengths. However, spectra are usually collected with many data points, and certainly more than the 10-15 variable limit of the Mahalanobis distance method. So how is it possible to combine these apparently opposing necessities into one method for spectral discrimination? As with quantitative analysis methods, the answer lies in first reducing the spectral data into its component variations with Principal Component Analysis.

 

Back to top