Taxonomy 3 - A multivariate genetic analysis
get newsletter
email us
PCA: main objectives and properties   |   Correlation (understanding) versus covariation (prediction)   |   Mathematical details and algorithms

PCA: main objectives and properties    top of page

When mining a dataset comprised of numerous variables (e.g. SNPs), it is likely that subsets of variables are highly correlated with each other. Given a high correlation between two or more variables it can be concluded that these variables are quite redundant and/or share the same driving principle (in our case share biological properties) in defining the outcome of interest.

Principal components analysis (PCA) is a statistical technique applied to a single set of variables to discover which sets of variables form coherent subsets that are relatively independent of one another. Variables that are correlated with one another which are also largely independent of other subsets of variables are combined into components. Components which are generated are thought to be representative of the underlying processes that have created the correlations among variables. The first principal component is the component retaining the most information of the original dataset.

PCA is widely used in signal processing, statistics, and neural computing. It is used for two objectives:
1. Reducing the number of variables comprising a dataset while retaining the variability in the data (i.e. reduce the dimensionality of the data set).
2. Identifying hidden patterns in the data (i.e. structure in the relationships between variables), and classifying them (i.e. identify new meaningful underlying groups of variables) according to how much of the information, stored in the data, they account for.

In 'taxonomy 3', the information retained by the first principal component is relevant to the case/control distinction (the between group variability). The other components describe within group variability, i.e. population heterogeneity and sub-phenotypes. This orthogonality (components are independent from each other) is the key reason why we have chosen the PCA to analyse the LBF matrix.

The PCA can be performed in any domain of interest: SNP, gene, ontology.

Within each component, variables have specific loadings, and subjects specific scores (see our examples page):

  • Loadings represent the relative importance of a given variable in a given component. For example, a SNP (or a gene, or an ontology, depending on the domain being analyzed) with a high loading 1 (the loading of the 1st component) will be of importance for the case/control distinction.

  • Scores represent the subjects distribution and structure for a given component.

Covariation versus correlation - 'prediction' versus 'understanding'    top of page

PCA is obtained from the eigen-decomposition of a covariation or correlation matrix. In 'taxonomy 3', we are using the covariation or correlation of the LBF matrix. The Eigen decomposition is a widely used mathematical function, but is likely too technical for the lay user. However, it is important to understand the differences between correlation and covariations.

Many diseases have low penetrance and disease predisposing genotypes at loci causing complex diseases are generally expected to have modest size effects. Using the correlation matrix across subjects in the eigen analysis (rather than the covariance matrix) prevents large genetic effects (i.e. SNPs with large frequency differences) dominating the decomposition and preserves the importance of effects at low minor allele frequencies. Together, with the use of a relative frequency measure (i.e. LBF), this keeps a wider size range for potential marker effects of physiological interest.

In a nutshell, covariation will favor variables having a strong effect (i.e. large variance), important in the whole cohort, and thus should be used for 'predictive' purposes (e.g. SNPs of predictive interest). On the other hand, correlation will also favor variables having a small effect (i.e. small variance), which may be important only in a sub-group of the cohort, and thus should be used for 'understanding' purposes (e.g. genes of physiological interest).

It is of note that, since the LBF function rescales variables (log likelihood ratio), a covariation matrix of LBFs can be used even if original variables are not all of the same type: e.g. SNPs, clinical variables, -omics,....

Mathematical definitions of covariation and correlation

Population covariation and correlation coefficient and matrix.

The correlation corr(X,Y) and covariation cov(X,Y) coefficient between 2 random variables X and Y with expected values and , standard deviations and , where E is the expected value, is defined as:

In 'taxonomy 3', we are producing a correlation or covariation matrix, defined as a matrix of correlation of covariances coefficient between LBF vectors (i.e. one LBF vector for each SNP analyzed).

The covariance matrix emphasizes variables with most variance, whereas the correlation matrix rescales variables to equi-variance.

 


Sample covariation and correlation coefficient matrix.

In our method, we are dealing with observed LBFs vectors. The LBF vector for the jth variable is defined as, where n is the number of subjects:  

The sample correlation coefficient between the 2 variables j1 and j2 is defined as, where n is the number of subjects:

The correlation or covariation matrix is simply the p by p squared, symmetric matrix made of these elements, where p is the number of variables analyzed.


An example. Differences between covariation and correlation: prediction versus understanding.

The example below clarifies the differences between covariation and correlation. It is a case-control collection with 12 subjects and 2 SNPs. For both SNPs, the 'AA' genotype is associated with some cases and the 'TT' genotype with some controls. The 'AT' genotype is not specifically associated with either.

  • SNP2 is of importance for the case/control distinction in most of the subjects (2/3rd), and could be used for prediction.

  • SNP1 is of importance for the case/control distinction only in 1/3rd of the subjects. This SNP is potentially important for the understanding of the outcome (especially in cases #1 and #2), but cannot be used alone for prediction.

In the covariation matrix, the highest coefficients are for combinations of SNP2 and casecont. SNP2 is clearly favored, and would show up as a SNP of interest in the eigen decomposition.

In the correlation matrix, SNP1 and SNP2 are almost at the same level of importance. In particular, the SNP1.x.SNP2 correlation coefficient is greater than for covarition. Both SNP1 and SNP2 would show up as SNPs of interest in the eigen decomposition.

The Excel file used to create this example can be accessed here (the user can enter new genotypes: LBFs and COV/CORR are automatically calculated).

The main features of 'taxonomy 3' can be tested on-line with small datasets, including testing correlation versus covariation: see the on-line analysis page.

For more details, please refer to these external links:


Mathematical details and algorithms    top of page

PCA reduces the dimensionality of a data set consisting of a large number of interrelated variables, while retaining as much as possible of the variation present in the data set. This is achieved by transforming to a new set of variables, the principal components (PCs), which are uncorrelated, and which are ordered so that the first few retain most of the variation present in all of the original variables.

PCA is the orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. In other words, the basic idea in PCA is to find the components s1,s2,...,sn so that they explain the maximum amount of variance possible by n linearly transformed components.

PCA has the distinction of being the optimal linear transformation for keeping the subspace that has largest variance.

Having defined the PCA, it can be demonstrated that the kth principal component of a set of variables is an eigenvector of their correlation or covariation matrix corresponding to its kth largest eigenvalue. PCA is obtained from the eigen-decomposition of a covariance or correlation matrix and gives the least square estimate of the original datamatrix.

Also PCA represents a specific Singular Value Decomposition in the case where principal components are calculated from the covariance or correlation matrix.

PCA's objective: Linear transformation based on maximum variance.

 

Let a vector of p random variables. The variance of these p variables and their covariation or correlation structure is of interest. If p is large, one may want to look for a few (<<p) derived variables that preserve most of the information given by these variances and correlations or covariances.

The first step is to look for a linear function of the p elements of having maximum variance. If  is a vector of constants, this linear function can be noted as:

The next step is to look for a linear function , uncorrelated with , having maximum variance, and so on.

It is hoped that in general, most of the variation in will be accounted for by m principal components , where m<<p.

 


Eigen decomposition (also called spectral decomposition).

Matrix decomposition refers to the transformation of a given matrix (often assumed to be a square matrix) into a given canonical form.

A canonical form is a function that is written in the most standard, conventional, and logical way. A canonical form is required to have two essential properties. Every object under consideration must have exactly one canonical form, and two objects that have the same canonical form must be essentially the same.

The decomposition of a matrix A into matrices composed of its eigenvectors P and Eigenvalues  is called an Eigen decomposition. It satisfies the following:

  , where  

Eigenvectors and Eigenvalues are also referred to as characteristic vectors and latent roots or characteristic equation (in German, "eigen" means "specific of" or "characteristic of"). The set of Eigenvalues of a matrix is also called its spectrum.

The fact that this decomposition is always possible for a square matrix A as long as P is a square matrix is known as the Eigen decomposition theorem.

When A is a semi-definite matrix (symmetric matrices such as covariation or correlation matrices), then its Eigenvalues are always positive or null, and its eigenvectors are composed of real values and are pair wise orthogonal when their Eigenvalues are different (Two vectors are 'orthogonal' or 'perpendicular' when their 'dot product' is zero). Because eigenvectors corresponding to different Eigenvalues are orthogonal, we have in this particular case:


Demonstration that PCA is the Eigen decomposition of the covariance matrix (for the 1st PC).

To derive the first principal component, one should maximize 

If   is the (p by p) covariance matrix of the p variables of vector , then:

Using the Lagrange multiplier technique, the following should be maximized:

  where  is a scalar and the Lagrange multiplier.

Lagrange multiplier

A function f(P) has to be to "extremized" (minimized or maximized), subject to a specific constraint on P given as g(P) = 0.

One answer is to add a new variable  to the problem, to define a new 'unconstrained' Lagrangian function to extremize:


Differentiation with respect to   gives: 

Thus  is the eigenvalue of  and is the corresponding eigenvector.

If  is chosen to be as large as possible then is the first eigenvector of , and  the largest eigenvalue is: 

The above is valid for a sample covariance or correlation matrix, such as the covariation or correlation matrix of LBF.


Recapitulation. Definition of coefficients, loadings and scores.


The kth principal component is defined as a linear combination of the p random variables :

In 'taxonomy 3' we are decomposing a sample of , which is our LBF matrix.

The amount of variance accounted for by the kth component is given by its eigenvalue,

The proportion of variance accounted for by each component is , where if the correlation matrix was used.

Each eigenvector is composed of scalars called 'coefficients' they relate original variables to components. For the kth PC, coefficients are simply defined as . They represent the relative 'weight' of each variable in each component.

Loadings are defined as: . They represent the relative 'weight' of each variable in each component, re-scaled with the amount of variance explained by the component.

In 'taxonomy 3' we are decomposing a sample of , which is our LBF matrix. In this case the subjects' contribution can be simply 'projected' on each component. For the kth component, the projected LBF for subject i and variable j is . The score for subject i and component k is defined as the mean of LBFk(i,j) over the p variables.


PCA algorithms for very large scale datasets (Kernel PCA and ARPACK).


For very large datasets, above reasonable computational resources, 2 problems arise with the usual procedures:

- the size of the correlation matrix : which is ~nVar^2     (23K variables CORR SAS dataset requires 2Gb RAM)
- the duration of the Eigen decomposition: which is ~nVar^3    (15K variables require ~15 hours on a 'normal' machine)

Therefore, above ~20K variables, other PCA algorithms than the default SAS procedures have to be used (see tax3 manual for more details):

ARPACK

This Eigen function is faster than the SAS eigen function (~nVar^2) and requires 50% less RAM than SAS. We have used it up to 30K variables (this is a good option for whole genome scans analyzed at the gene level).

For more information, please see:
http://www.caam.rice.edu/software/ARPACK/

We have used the fortran source, and compiled it for windows using CygWin.

Above ~30K variables, the Kernel PCA method is recommended (see below).

Kernel PCA

This is fast PCA algorithm, dramatically reducing the computational resources needed (problem size is rescalled from nVar^3 to nObs^3). With this algorithm, we have analyzed datasets up to 650K variables and 200 subjects ( see dataset 4).

Kernel PCA was developed by :

Wen Wu
PhD, CSci (Chartered Scientist), CChem (Chartered Chemist)
FRSC (Fellow of the Royal Society of Chemistry)
Member of the editorial board of Chemometrics and Intelligent Laboratory Systems
BIx Investigator Analyst
Email: wen_2_wu@gsk.com

Please contact Wen should you want to use kernel PCA to solve large data problem. This algorithm is described in these articles:

The kernel PCA algorithms for wide data. Part I: Theory and algorithms
Chemometrics and Intelligent Laboratory Systems, Volume 36, Issue 2, April 1997, Pages 165-172
W. Wu, D. L. Massart and S. de Jong -- Abstract

Kernel-PCA algorithms for wide data Part II: Fast cross-validation and application in classification of NIR data
Chemometrics and Intelligent Laboratory Systems, Volume 37, Issue 2, June 1997, Pages 271-280
W. Wu, D. L. Massart and S. de Jong -- Abstract

Fast regression methods in a Lanczos (or PLS-1) basis. Theory and applications
Chemometrics and Intelligent Laboratory Systems, Volume 51, Issue 2, 24 July 2000, Pages 145-161
Wen Wu and Rolf Manne -- Abstract

.

 

References and external links relevant to PCA and eigen decomposition

 

 


   top of page
Newsletter

    -> We plan to send (infrequent) emails regarding publications, talks, software updates, etc...

To subscribe, or manage your subscription, just enter your email address below:

email:

 

Send us your comments
Name   (optional)
Title   (optional)
Email  
Comment  
    

You can also email us directly at:   taxonomy@delrieu.org