Sara Stewarta,
Michelle Adams Ivyb and
Eric V. Anslyn*a
aInstitute for Cell and Molecular Biology, The University of Texas at Austin, 1 University Station A4800, Austin, Texas 78712, USA
bDepartment of Chemistry and Biochemistry, The University of Texas at Austin, 1 University Station A1590, Austin, Texas 78712, USA. E-mail: anslyn@austin.utexas.edu; Fax: +1 512 471 7791; Tel: +1 512 471 0068
First published on 2nd September 2013
Statistical analysis techniques such as principal component analysis (PCA) and discriminant analysis (DA) have become an integral part of data analysis for differential sensing. These multivariate statistical tools, while extremely versatile and useful, are sometimes used as “black boxes”. Our aim in this paper is to improve the general understanding of how PCA and DA process and display differential sensing data, which should lead to the ability to better interpret the final results. With various sets of model data, we explore several topics, such as how to choose an appropriate number of hosts for an array, selectivity compared to cross-reactivity, when to add hosts, how to obtain the best visually representative plot of a data set, and when arrays are not necessary. We also include items at the end of the paper as general recommendations which readers can follow when using PCA or DA in a practical application. Through this paper we hope to present these statistical analysis methods in a manner such that chemists gain further insight into approaches that optimize the discriminatory power of their arrays.
Key learning points(1) This review is intended to advance the readers ability in interpret PCA and DA plots.(2) We would like the reader to better understand how certain types of chemical behaviour translate to plot behaviour. (3) Thirdly we would like readers to understand what modifications can be done to their data analysis to improve data representation with these methods. (4) Furthermore we would like the reader to understand some pitfalls common to these methods and how to avoid them. (5) Finally, through this work, it is hoped that an increased understanding of multivariate statistical analysis tools is gained, which aids in the interpretation of complex data sets, specifically those seen in differential sensing routines. |
To alleviate such difficulties, chemists have explored the use of statistical analysis techniques such as principal component analysis (PCA) and discriminant analysis (DA). Although these techniques are becoming particularly important for differential sensing purposes,7–16 these techniques are sometimes used as a “black box”. PCA and DA are widely utilized across multiple fields of academia and industry, and thus there are numerous reviews and tutorials on these techniques available to study.17–21 However, these articles are often heavily laden with mathematical symbols and derivations, or with seemingly unrelated examples, that are challenging to translate to differential sensing. For this reason, we see the present need for a qualitative explanation of these techniques to help chemists interpret PCA and DA plots that result from differential sensing studies. Our aim is to present PCA and DA to chemists in a manner which will shed light on the types of receptor arrays which lead to certain plots, and to give a few general criteria for obtaining optimal PCA and DA plots. This information ultimately can be utilized to refine differential sensing systems for better analyte and solution discrimination and differentiation.
To explore how PCA and DA work, a relevant analogy can be made to a more familiar eigenvalue problem (see Scheme 1 for how eigenvalue problems are written). Most chemists know that the Schrödinger equation plays a fundamental role in quantum mechanics (eqn (1)). This differential equation is usually simplified and reduced to a problem involving the eigenvectors and eigenvalues of a square matrix.22 The eigenvectors of this matrix represent the molecular orbitals with which we are all familiar, and the eigenvalues give the orbital energies that correspond roughly to the ionization potentials of the molecules. This classic equation is just one of many examples of eigenvalue problems, which play roles in fields as diverse as signal processing and civil engineering. Simply stated, when a matrix multiplies one of its eigenvectors, the result is proportional to the eigenvector (it has the same directional sense), where the constant of proportionality is the eigenvalue.22
HΨ = EΨ | (1) |
What differentiates one eigenvalue problem from another is the way that the elements of the square matrix are defined. In the Schrödinger equation describing electrons in molecules, the matrix elements are quite complicated integrals involving the “basis functions” (models of atomic orbitals) that describe the problem, and the eigenvectors are linear combinations of the basis functions, giving a mathematical description of the molecular orbitals.
In PCA (eqn (2)), the square matrix C is referred to as the co-variance matrix, while v is the set of the eigenvectors, and D is the set of eigenvalues. Because the goal of PCA is to find the greatest extents of variance in a set of data, the square matrix is a function of variance. Specifically, in PCA, the matrix reflects co-variance. Deriving the co-variance matrix C is the key to PCA, just like deriving the Hamiltonian matrix is key to solving the Schrödinger equation.
Cv = Dv | (2) |
To create the co-variance matrix, we first take a matrix of experimental observations (m) for different samples (n) to make an m × n data matrix. For example, in array sensing the observations may be absorbances at various wavelengths for different receptors mixed with the different analytes. The samples (number = n) are the individual analytes and replicates of the analytes. If we record 50 absorbance values with 5 receptors we would have 250 experimental observations (m = 50 × 5) for every sample. Next, for each sample n, the variance in the data (experimental observations) is derived from the standard deviation, presented in eqn (3), where N equals the number of total observations in a group, xi is a single observation within a group and i is the mean of all the observations in a group. Variance is the square of the standard deviation (eqn (4)).
(3) |
Var(X) = s2 | (4) |
In our example, the data for which variance is calculated consists of all the absorbance values for the series of receptors. So far, this would mean that for each sample (n of these), corresponding to potentially a large set of data (m observations), we simply have one number – variance. The goal of PCA is to seek how the variance of one sample correlates with the variance of another sample. To do this, the method calculates co-variance, defined as in eqn (5). In this formula xi is a single observation in a group, i is the mean of all the observations in a group, yi is a single observation in a different group and ȳi is the mean of all observations in that different group.
(5) |
Importantly, there is a co-variance value for each sample relative to each other sample. Hence, for n samples there will be n × n co-variance values. These values can therefore be arranged into the square co-variance matrix (see Scheme 2) and this sets the stage for an eigenvalue problem as discussed above. The matrix is symmetric across the diagonal, because the co-variance of, for example sample 3 with sample 5, must be the same as between 5 and 3. In Pearson's co-variance, a specific type of PCA plot, which normalizes the data set before running the PCA algorithm, the diagonal elements of the co-variance matrix will be equal to 1.23
Scheme 2 An n × n matrix used in PCA, where cov = co-variance, and the variance of the data for each sample samples is labeled with a number up to n samples. |
In the original data matrix, one can define a vector for each sample in an m dimensional space. After PCA, each sample consists as a vector in n dimensional space, where each dimension reflects decreasing extents of variance between the samples. The x, y, z… coordinates for each sample in the new space are called the scores for that sample, and the score values along each axis reflect the extent to which the samples differ along the variance expressed by that axis. The extent of variance along each principal component axis is the eigenvalue for that axis.
The above example presents what is generally called eigenvalue value decomposition (EVD). We have used it as an example, along with molecular orbital theory (MOT), in order to allow the reader to better conceptualize what the underlying mechanics of PCA are. However this is, in fact, only one approach to solve the eigenvalues and is limited by requiring a square matrix. While this is a legitimate method of calculating the eigenvalues, it is very computationally taxing. A more generalized approach to the problem would be singular value decomposition (SVD). PCA can be thought of in the general form presented in eqn (6) where T and P′ are matrices which capture the underlying data pattern of X, which is in this case the covariance matrix.24
X = TP′ | (6) |
For the purposes of PCA, T is a matrix where the columns contain the factor scores for each component and the P′ is a matrix where the rows contain the loading scores. Fundamentally the factor scores are the coordinates of each sample and loading scores are the coordinates of each variable in a data reduced space.25,26
SVD has the form presented in eqn (7). In this case the columns of V are termed the right singular vectors and are equivalent to the columns in P. U is equivalent to T except
X = UDV′ | (7) |
Another alternative to EVD for solving for eigenvectors is Nonlinear Iterative Partial Least Squares (NIPALS). In this method possible loadings and score values are set initially and iteratively modified until convergence between the previous values and new values is attained.27,28
More simply, PCA rotates and combines the original data such that each new orthogonal axis explains the most possible variance. This is referred to as a change of basis. This results in the apparent shifting of the data points such that they are centered around the origin of each axis. It is here where the real strength of PCA arises. It is generally safe to assume that the most important variables are those with the greatest variance. However it is not always apparent which combination of variables will yield a vector which explains the most variance. Furthermore, when a large data set is generated, such as for spectroscopic data, there is a fair amount of redundancy between variables. PCA fundamentally reduces the dimensionality of the data removing redundancy and finding collinear variables and expressing them across a single axis. In essence, PCA finds the axis that best fits an n dimensional space of data and projects those axes in a simpler space.29
The eigenvectors of any matrix, not just with PCA, can be viewed as a “coordinate system” that is optimal for the problem under consideration. For example, if one does PCA for a set of x,y data points, the eigenvectors correspond to two lines in the x,y plane. Along one of these lines, the variance of the data points is maximized (the data exhibits a wide range of values along that axis), while the other axis has the opposite behaviour (the data exhibits a narrow range of values). While it can be quite easy to see – by eye – what these axes are in a two-dimensional case, the generalization to more dimensions is less easily visualized but no less straightforwardly amenable to computation.
Discriminant analysis (DA) is another eigenvalue problem, and has many features in common with PCA. The main difference between DA and PCA is that with PCA there is no bias placed on finding the greatest variance between samples. This means that replicates of the same analytes are treated identically as different sets of analytes. Therefore, clustering of the samples in PCA means that the variance between these samples is indeed smaller than the variance with other samples. In DA, the mathematics place a bias toward clustering repetitive samples (called a class) and separating them from repetitions of a different set of analytes (a different class). Unlike PCA variance is not the parameter used to distinguish data in DA. Instead DA fundamentally finds the best way to organize data in order to maximize class discrimination. For this manuscript the percent captured values for the PCA plots represent variance captured and the percent captured values for the DA plots is discrimination captured.
An important distinction that needs to be made here is the precise form of DA used. For the sake of simplicity the more general form of DA, called canonical discriminate analysis (CDA) will be used here. In the most basic sense CDA identifies some combination of variables which maximize the Euclidean distance between groups while minimizing the distance between members of a group. There are other forms of discriminate analysis such as linear discriminate analysis (LDA) and quadratic discriminate analysis (QDA). LDA differs from CDA in that rather than relying on both with-in and between group data to classify, LDA uses distance from a centroid to classify data.30 QDA is a more complex application of the likelihood classification, but rather than finding the linear combination of variables it identifies a quadratic surface which minimizes misclassifications.31
CDA is a function which maximizes the difference between the means of differing classes, while minimizing the difference within a class. This is done by defining the scatter within a class and the scatter between the classes. Scatter is defined by matrices which are analogous in form to the co-variance matrices used in PCA.20 Importantly, with CDA two matrices are used, one for between class scatter (SB) and one for within class scatter (SW) (i.e. variance). Given this, the eigenvalue problem is formulated as in eqn (8). The inverse of the within class scatter matrix multiplied by the between class scatter matrix acts to maximize between class scatter while minimizing within class scatter. The eigenvectors (w) represent weighted combinations of scatter within and between the samples, while the eigenvalues (J) represent the extent that scatter is best maximized between classes and minimized with classes. The J values are analogous to the extent of variance eigenvalues found in PCA. Because DA has a bias built into the mathematical approach, it is called a “supervised” routine, while PCA is “unsupervised”. Consequently, due to the supervised nature of DA, the resulting plots often show better analyte classification than a corresponding PCA plot.
Sw−1SBw = Jw | (8) |
Now that a summary of the Schrödinger equation and PCA/DA eigenvalue problems has been given, we can draw an analogy between the results of the two kinds of problems. The eigenvectors of the Schrödinger equation are linear combinations of atomic orbitals we interpret as molecular orbitals. Each value of the eigenvector is the coefficient of the atomic orbital that contributes to the molecular orbital, and it acts as a weighting factor for that atomic orbital. Each element in an eigenvector for a particular eigenvalue from PCA is the coordinate position of individual samples along different axes in the n-dimensional space. The dimensions in this new space are orthogonal, and are referred to as PC 1, 2,…n, where each PC is associated with an eigenvalue. The most important eigenvalues from the Schrödinger equation are those associated with the HOMO and LUMO, meaning the orbitals near the middle of the energies. In PCA, the eigenvalues are the extent of variance carried by each axis in the n-dimensional space. It is the first few principle coordinates that are the most important because they reflect the greatest amount of variance between the samples.
In a PCA or DA plot resulting from differential sensing, the response from multiple receptors can contribute to each axis in the plot, although some receptors often have a much larger contribution to a particular axis than others. The power of PCA and DA becomes most apparent in the cases that have data sets with a large number of receptors, spectral data, or other experimental data where it is nearly impossible to comprehensively evaluate the raw data with a few simple calibration curves. When there are many more variables than samples, DA in particular may not perform well due to an issue called “over-fitting” which will be discussed later in this manuscript.
As already mentioned, PCA and DA are common techniques employed to analyze the data that result from differential sensing. The receptors used in this technique are commonly referred to as differential, or cross-reactive. The terms are often synonymous, and we use them and likely will continue to use them in this manner. However, for purposes of this concept article, we give them slightly different definitions. Differential receptors simply show different responses from each other to the analytes. Cross-reactive receptors actually have differences in their trends of affinities to the analytes, meaning that some receptors have higher affinities to some analytes, while the corresponding cross-reactive receptors prefer different analytes. This means that cross-reactive receptors are a subset of differential receptors.
It is important to note that PCA and DA are not the only algorithms used for pattern recognition. Factor analysis (FA), partial least squares (PLS), and maximum redundancy analysis (MRA), and hierarchical clustering are examples of alternatives.23,27,32
For each scenario, values were selected to represent the Ka of each host:guest pair. For each pair, five values representing repetitions were randomly generated, following a normal distribution.32 This distribution was set such that the mean of the values was equal to the Ka's selected to represent the host:guest pair. The 0.5 to 5 standard deviations (σ) of the distributions of Ka values for each host:guest pair were used in order to simulate a range of variances within repetitions. For each scenario presented we have included a summary of the mean Ka values used and the σ-value used.
Fig. 1 (A) PCA plot of the antibody-like scenario. (B) Mean Ka values for the “antibody like” scenario. In this example, each host:guest behaves in a very specific manner. For example, guest 1 (G1) and host 1 (H1) have a very high affinity for each other relative to the other host:guest pairs (0.5 standard deviations). |
Similar results can be achieved through lower selectivity but fully cross-reactive receptors (Fig. 2A). Fig. 2A presents a plot in which each receptor is cross-reactive with all other receptors. In this case, because each of the host:guest pairs behaves in a unique manner each guest is separated from the others in both the F1 and F2 axes. From this example, it seems that there is very little difference in using a panel of receptors that have antibody like behavior as opposed to cross-reactive behavior, since discrimination of analytes can be effectively achieved in both circumstances. In these models each host responds in an unambiguously different manner to all the guests. This situation is ideal for optimal discrimination. However, quality discrimination can still be achieved with small differences between responses to guests for an array of hosts, as is the case in most cross-reactive arrays, since each receptor behaves in a sufficiently unique manner.
On one hand it may seem that these conclusions support the notion for utilizing cross-reactive arrays in all scenarios. However, despite the power cross reactive arrays may show for discriminating similar targets, they are not well suited for all cases. When a specific target or targets need to be identified from a complex solution, cross-reactive arrays may fall short. Further, the advantages that cross-reactive arrays show to subtle variations in a target may be rendered useless by a large concentration of background competing analytes. In cases like this, a very specific receptor, that is not perturbed by compounds in the background would obviously be far superior to a cross reactive array.
Fig. 2 (A) PCA plot of the cross-reactive scenario. (B) Mean Ka values for the cross-reactive scenario. In this example, each host:guest behaves in a very unique manner. For example, guest 1 (G1) and host 1 (H1) have lower affinity for each other than the affinity of H1 for any of the other guests, whereas host 2 (H2) has the lowest affinity for G2 relative to the other host:guest pairs (2 standard deviations). |
Fig. 3 (A) PCA plot of antibody-like scenario with four hosts. (B) The same data sets as presented in Fig. 1B, however one of the hosts has been omitted (0.5 standard deviations). |
This can be thought of in terms of a combination of 1s and 0s (this is the source of the 2; 2 options 1 or 0). An antibody like sensor can be considered “perfect,” when presented with its target, it has maximum signal and can be assigned a 1. When presented with a non-target analyte it has no signal and can be assigned a 0. A combination of 2 receptors could have the values (1,1), (0,1), (1,0) or (0,0) giving four unique combination. Thus the number of possible combinations and set of perfect receptors is equal to 2n. Not all “perfect” sensor systems are capable of displaying all four possible combinations. Frequently, systems of highly selective sensors (like antibodies) are designed in such a way to eliminate the possibility of the (1,1) situation. One further consideration for this scheme is a restriction on the (0,0) response. As with any signal, having multiple guests that respond in the same manner limits the discriminatory power of the assay. While this limitation exists for all guests, there can commonly be more guests that show no response than those that will. Thus, for full classification, it is important to ensure there is only one guest that shows a null response to all receptors in the array.
Cross-reactive arrays are not limited by being constrained by on and off values. Rather, they are limited only by their ability to create a reproducible and sufficiently unique pattern of binding for each target. When a single host is removed from the cross-reactive data set in Fig. 2A, the score plot still retains a high level of discrimination (Fig. 4A). However, guests 2 and 4 begin to show overlap in a two-dimensional plot. This is due to the lack of a sufficient number of hosts behaving in a distinctly unique manner. Also, in the cross-reactive case an increase can be seen in the overall variance of PC1 and PC2, from 69.74% to 75.17%. This is due to the properties of the data set itself. In x–y plots of multivariate data, there is variance found in each of the variables but only two dimensions are displayed. Therefore, in a five variable array, the variance is distributed across five dimensions. When a variable is removed from the system, there are fewer dimensions across which the variance can be distributed.
Fig. 4 (A) PCA plot of cross-reactive scenario with four hosts. (B) The same data set as presented in Fig. 2B, however, one of the hosts has been omitted (2 standard deviations). |
Fig. 5 (A) PCA plot of overlapping data set with 5 hosts (B) DA plot overlapping data set with 5 hosts (C) PCA plot of overlapping data set with ten plots (D) LDA plot of over lapping data set with ten hosts (E) mean Ka values (×100) for each host guest pair (5 standard deviations). |
In Fig. 5 clustering of the various guests may roughly exist, but the groups are not readily distinguishable. This is due to a high standard deviation between the repetitions, relative to the magnitude of the Ka values for all the guest groups. Fig. 5A is the PCA plot of the data, which appears as total scatter. However, even with DA (Fig. 5B), the method falls short of completely discriminating the analyte classes.
One possible reason for this lack of discrimination has to do with what characteristics are being used to classify the data. As is typical with PCA, it is assumed that variance between groups of guests is sufficient to categorize data. DA on the other hand requires information about class membership to group the data. This situation is hinted at by the difference in variance captured by the PCA in Fig. 5A and the percent of discrimination captured by the DA in Fig. 5B. If the amount of variance captured by the first few PCs in PCA is low then it is possible that variance is not a good classifier. However, this leads to the debate of how many axes are appropriate to include in a model. There are arguments that in cross-reactive sensing routines including many axes to account for a large amount of the variance is a more appropriate method, indicating that variance dispersed across many sensor combinations is the discriminatory element.33 The counter point to this is that by including many axes in the analysis, the ability of PCA to reduce the dimensionality is neglected. To some extent this is a philosophical question we will not be exploring here. However, careful consideration of what exactly a researcher wished to achieve as well as how the data is expected to behave may allow one to decide what the best model approach may be. While the DA appears to be superior to the PCA in 5A due the increased amount of discrimination captured, it likely will perform poorly as a predictive model. This is due to the model relying on all data points in order to create the classifier model. If a leave one out cross validation is performed one can see that the model loses its predictive power.
In cases like this, adding additional hosts can improve the data discrimination by reinforcing difficult to observe patterns in the data sets. Fig. 5C and D show the PCA and DA plots obtained when ten hosts are considered, rather than the five used for Fig. 5A and B. The five new hosts were chosen to respond identically to the first five. This could represent additional replicates in the system, or additional hosts where the deviation from other hosts is subtle. Now we see that each guest is more localized in the PCA plot, though overlap still exists. In the DA plot (Fig. 5D), there exists a much tighter clustering of the guests. The improvement can be further supported by considering the jack-knife analysis for the five host data set DA plot (76%), and the improved jack-knife analysis for the ten host data set DA plot (84%). This analysis means that the identity of each guest was correctly predicted 76% or 84% of the time. Please see the section on validation for a more detailed explanation of this method.
The reason additional hosts improved the discriminatory power of this system is that each host responds to the guests in a specific manner that is not easily observed due to a high amount of noise in the system. In such circumstances, differentiation of analytes can be achieved by adding additional hosts that either reinforce observed patterns, by adding additional hosts with similar behaviour to other hosts (as the example presented here does), or by adding additional hosts with wholly unique behaviour. This situation, where adding hosts to a high noise system increases the discriminatory power of an array, is called “co-linearity”.34 When this data set is expanded to 20 hosts (Fig. 6) – a situation which in this case could be considered as having many more hosts than guests; the amount of variance captured across the first few PCs decreases. This is not unexpected as each PC captures a portion of the variance; as the number of PCs increase the amount of variance captured by each PC decreases. Visual examination of the first 2 PCs appears to show superior grouping of the samples. However when a validation method is applied is it is found that the model has very poor predictive power, in this case the Jack knife analysis yields a 12% correct classification rate.
Fig. 6 PCA of overlapping data with 20 hosts. The data has been “over-fit”. |
This is a situation referred to as over-fitting, and is a common trap many researchers fall into. Even in the most random data set, it is possible to find an equation which can perfectly group the data by whatever parameter the researcher wants. If too much data is used, however, the equation is only relevant to the presently available data. Any new data is not likely to follow the lines of discrimination resulting in a model that only predicts itself. When examining the quality of a PCA or DA one must be cognizant that adding more variables can appear to make a better fitting model while reducing the predicative power of the model.35
The math behind the decomposition of a data set in PCA into its corresponding eigenvectors necessitates that the number of eigenvectors which emerge from the calculation be equal to the number of receptors in the array or the number of samples, whichever of these two numbers is smaller.23 Similarly the number if eigenvectors which can emerge from a DA is equal to the number of classes minus one.39 However, this correlation between the number of eigenvectors and number of receptors often leads to an incorrect conclusion: Each discriminating axis represents one receptor (variable). To further understand why this conclusion is a misconception, we must turn to loading plots, which are simultaneously generated when PCA and DA plots are produced Loading plots show the influence which each receptor, or variable, has on the corresponding discriminating axis. Each receptor is represented by a vector in a loading plot. The x,y coordinates (or higher coordinates) of each vector indicates the extent to which each receptor contributes to a discriminating axis. Vectors of (1,0) or (−1,0) most influence the discrimination of analytes along the x-axis (F1), with the vector (−1,0) best discriminating analytes on the left side of F1 and vector (1,0) best discriminating analytes on the right side of F1. Conversely, vectors (0,−1) or (0,1) most influence the lower half of the y-axis or the upper half of the y-axis, respectively. Receptors with vectors of intermediary x,y values indicate contributions to both axes. Thus, the loading plot becomes very useful in analyzing which receptors or variables are most useful for discrimination, thus aiding in determining receptor performance.
Similar to a loading plot is a biplot. Biplots are most commonly seen in conjunction with PCA plots, where the loading plot of the PCA plot is superimposed onto its corresponding PCA plot. In these plots, the receptors which most influence a particular data point are located close in vector space to the data point. The proximity of a receptor vector endpoint to a data point allows further analysis of the array system to determine whether the receptor is important for discriminating that particular analyte.
The loading plot or biplots make it clear that the differing principal components can be made up from several responses of receptors. Both loading plots and biplots are important plots to analyze once PCA or DA results have been generated. They allow the user to probe the importance of the receptors in an array, which in turn provides information to improve and modify the array to obtain the best results. The DA plot in Fig. 7A was derived from an array consisting of 15 receptors, where Fig. 7C shows the mean Ka values used for this simulation. These values were chosen to maximize the dissimilarity between the behavior of each host:guest interaction in order to observe how each receptor can contribute to multiple axes. In the loading plot (Fig. 7B), we see that the first axis (F1) discriminates analytes based on the data from all the receptors, except H2 which shows a close to zero value as an x-component in its line ending vector. The second axis (F2) discriminates analytes based on the data from nearly all the receptors, because most receptors contain a non-zero value as a y-component in their line ending vector (H3 and H6 are near zero). Receptors H2, H8, H10, and H4 contribute the most to the discrimination seen with the F2 axis because the absolute value of their y-component is larger than the absolute value of the y-vector components of the other receptors. Therefore, each discriminating axis contains contributions from multiple variables.
Care must be taken not to rely on loading plots and biplots exclusively for receptor selection. While it is likely that H2 and H8 are the primary contributors to the position of G4 in Fig. 7A, the precise relationships between the host‘s influence on the guest's position cannot be determined. Biplots are useful tools for approximating the host's influence, however for a more exact measure and for optimal variable selection a factor analysis could be performed. This method uses the loadings and biplots in PCA and applies a set of rules and criterions in order to quantify the relative significance each factor has on the data structure of the model.23 This gives a quantitative estimation of the importance of each factor. That being said, in many cases PCA will yield equivalent results to a factor analysis.40
The misconception that each discriminating axis represents one receptor in an array is most likely a result of the direct correlation seen between the number of receptors (variables) in an array and the number of discriminating axes obtained. Another reason for this misunderstanding may be that in high dimensionality systems with a large number of receptors, it is often the case that only a few of the receptors have pertinent contribution to a particular discriminating axis, while the other variables in the array have a very small contribution that can be considered negligible.
Fig. 8 (A) PCA plot with a large variance data set. (B) Three-dimensional PCA plot of the data set. (C) Mean Ka values of inconsistent variance data (data for G1 contains 0.5 standard deviations, data for G2–G5 with Ka values of 10 contains 0.5 standard deviations, data for G2–G5 with Ka values of 20 contains 1 standard deviation, data for G2–G5 with Ka values of 30 contains 1.5 standard deviations, data for G2–G5 with Ka values of 40 contains 2 standard deviations). |
Once we examine the three-dimensional plot, which takes into account a third discriminating component (Fig. 8B), we see excellent discrimination of all the analytes. If any other of the discriminating axes calculated by the PCA or DA algorithms hold a substantial discriminating percentage (i.e. >5%), it would be beneficial to further examine those axes in addition to the two greatest discriminatory axes. For instance, there may be circumstances where a third and fourth discriminating component is important to differentiate a data set. In this case, an examination of the plots generated from all combinations of the first through fourth axes may be necessary (i.e. compare plots – 1 vs. 2, 1 vs. 3, 1 vs. 4, 2 vs. 3, 2 vs. 4, etc.). Thus, careful consideration of all components may lead to the best visual representation of differentiated data.
To specifically optimize PCA plots, there exist rotation methods which often aid in simplifying the discriminating axes for easier interpretation. Although there are several methods, varimax rotation introduced by Kaiser in 195841 is the most common. Varimax works by searching for a rotation of the original discriminating axes which maximizes the variance of the squared loading scores. The advantage of utilizing varimax is that the new plot may be easier to analyse because each axis represents a response from one, or only a few receptors.42 This tends to lead to loading scores (i.e. placements within the PCA plot) which have a wide range of values and emphasize clustering.23
Fig. 9 (A) DA plot of low variance data with blank included in the data set. (B) DA plot with blank excluded from data set. (C) Ka values of low variance data (0.5 standard deviations). |
Fig. 10 (A) A PCA plot of data where an array is not needed. (B) An LDA plot of date where an array is not needed. (C) Ka values for a plot where an array is not needed (0.5 standard deviations). |
As already mentioned, it is often very common for a DA plot to look better than its corresponding PCA plot. Therefore, as we have mentioned a few times, validation techniques are simultaneously run with trained models to allow users to evaluate the validity of the model for their data set. A common initial validation technique for DA is the jack-knife analysis, also known as the “leave-one-out” analysis. In this validation technique, data for one or several samples are removed from the data set and a new model is constructed. The classification of this removed analyte is then estimated from the new model and compared to its previous classification. This entire process is completed with every analyte in the data set, and the resulting number is the percentage of classifications which were correctly identified.23
In addition to validation techniques for DA, most computer software programs give an option to display confidence ellipses for the grouped data. These ellipses generally represent a 95% confidence limit for a specific analyte group; typically calculated using the Hotelling T2 statistic. These confidence ellipses help the user to more easily identify how close each sample is to the group consensus; the centroid. On this note, however, we strongly discourage the incorporation of arbitrarily drawn circles which encompass analyte groups, as these may be mistaken for confidence ellipses. A more relevant statistic for assessing the quality of classification would be the use of bootstrapping in order to approximate to true characteristics of the population. Briefly, a bootstrap method resamples data from a sample population in order to create an expected distribution of the data. This process is repeated many times, often several thousands, until a reasonably accurate distribution is generated. This allows the researcher to generate confidence intervals directly related to the true distribution of the data rather than make assumptions when using the T2 statistic.43
These techniques can be used to validate PCA data as well as DA data. While not considered a classification method in and of itself, PCA scores can be used for classification. Most software packages support validation methods for both DA and PCA though the exact title may vary. One software package may use the term “predictive PCA”44 while another may call it “principal component regression”.
If the final goal of these methods is to classify unknown data, an external validation data set should be used. This is data for which the researcher knows the class (though if possible blinding the experimenter should be considered) but was not used to develop the model. This can include data that was collected as part of an experiment but not used in the model or data collected as part of another experiment. There are a variety of methods used to quantify the predictive power of a model using an external data set. The precise method used should be selected to best reflect how the model is expected to be used and the availability/quality of external data.44
Typically, scaling is used in conjunction with centering to further normalize the variables to each other. With scaling, variables with large values are fundamentally “shrunk” while variables with small variables are “stretched” to put them on the same footing.45 Without these steps it is often the case that the primary vector of variance is defined by the mean of the data. This limits the utility of the model by obscuring latent variables which are more powerful for discrimination.46
Another method commonly used to remove noise from data is referred to as smoothing. One of the simplest examples of this is the moving average. In this example a window of points is averaged together to generate a single point in their place. The window is then shifted through the entire data set point by point until each new point represents an average of a subset of points. This method reduces the impact of especially high or low values in the data set.47 There are many different methods used to smooth and remove noise from a data set, such as the Fourier transformation or the Savitzky–Golay smoothing filter.48
Pearson's correlation is considered to be a normalization method where the standard deviation is used to scale the data between −1 and +1.33 Correlation methods are always applied when the data set being used contains different units (i.e. absorbance and fluorescence data both contained within a data set). There are four main methods of executing these methods: (1) covariance about the origin, (2) covariance about the mean, (3) correlation about the origin, and (4) correlation about the mean. Rozett and Petersen give a detailed analysis of these methods and their advantages and disadvantages.49 In the context of differential sensing, covariance about the origin is the typical approach. Using this pre-processing method prevents the loss of data around the zero point of the experimental scale and avoids the loss of information regarding the relative size and relative error associated with the data from different receptors.23
It is important to note that because pre-processing of data can be key to obtaining the best differentiating model for the data set, some programs which run PCA often include a pre-processing step in the program calculations. In these cases, additional pre-processing may not be necessary for the user and the raw data can be used directly.
It is also important to note the difference between technical replicates and experimental replicates. Technical replicates involve replicated data which were derived from using the same stock solutions. These types of replicates help to evaluate pipetting accuracy and the homogeneity of the solutions or media being tested. Experimental replicates require the entire experiment to be reproduced including the growing of cells and preparation of stock solutions. These types of replicates are very useful in preventing results which discriminate data based on irrelevant variables such as the petri dish in which the cells were grown, the well plate in which the array was run, and the conditions in which the solutions or media were stored. Not all systems may require the incorporation of both technical replicates and experimental replicates in a data set, but a clear understanding of the benefits which arise from each different type of replicate may prevent false or unsupported discrimination in a plot, thus avoiding incorrect conclusions.
This journal is © The Royal Society of Chemistry 2014 |