When
mining a dataset comprised of numerous variables (e.g. SNPs), it is
likely that subsets of variables are highly correlated with each other.
Given a high correlation between two or more variables it can be concluded
that these variables are quite redundant and/or share the same driving
principle (in our case share biological properties) in defining the
outcome of interest.
Principal
components analysis (PCA) is a statistical technique applied to a single
set of variables to discover which sets of variables form coherent subsets
that are relatively independent of one another. Variables that
are correlated with one another which are also largely independent of
other subsets of variables are combined into components. Components
which are generated are thought to be representative of the underlying
processes that have created the correlations among variables. The first
principal component is the component retaining the most information
of the original dataset.
PCA
is widely used in signal processing, statistics, and neural computing.
It is used for two objectives:
1. Reducing the number of variables comprising a dataset while retaining
the variability in the data (i.e. reduce the dimensionality of the data
set).
2. Identifying hidden patterns in the data (i.e. structure in the relationships
between variables), and classifying them (i.e. identify new meaningful
underlying groups of variables) according to how much of the information,
stored in the data, they account for.
In
'taxonomy 3', the information retained by the first principal component
is relevant to the case/control distinction (the between group variability).
The other components describe within group variability, i.e. population
heterogeneity and sub-phenotypes. This orthogonality (components
are independent from each other) is the key reason why we have chosen
the PCA to analyse the LBF matrix.
The
PCA can be performed in any domain of interest: SNP, gene, ontology.
Within
each component, variables have specific loadings, and subjects
specific scores (see our examples
page):
- Loadings
represent the relative importance of a given variable in a given component.
For example, a SNP (or a gene, or an ontology, depending on the domain
being analyzed) with a high loading 1 (the loading of the 1st component)
will be of importance for the case/control distinction.
- Scores
represent the subjects distribution and structure for a given component.
| Covariation
versus correlation - 'prediction' versus 'understanding' |
|
top
of page |
PCA
is obtained from the eigen-decomposition of a covariation or correlation
matrix. In 'taxonomy 3', we are using the covariation or correlation
of the LBF matrix. The Eigen decomposition is a widely used mathematical
function, but is likely too technical for the lay user. However, it
is important to understand the differences between correlation and covariations.
Many
diseases have low penetrance and disease predisposing genotypes at loci
causing complex diseases are generally expected to have modest size
effects. Using the correlation matrix across subjects in the eigen analysis
(rather than the covariance matrix) prevents large genetic effects (i.e.
SNPs with large frequency differences) dominating the decomposition
and preserves the importance of effects at low minor allele frequencies.
Together, with the use of a relative frequency measure (i.e. LBF), this
keeps a wider size range for potential marker effects of physiological
interest.
In
a nutshell, covariation will favor variables having a strong effect
(i.e. large variance), important in the whole cohort, and thus should
be used for 'predictive' purposes (e.g. SNPs of predictive interest).
On the other hand, correlation will also favor variables having a small
effect (i.e. small variance), which may be important only in a sub-group
of the cohort, and thus should be used for 'understanding' purposes
(e.g. genes of physiological interest).
It
is of note that, since the LBF function rescales variables (log likelihood
ratio), a covariation matrix of LBFs can be used even if original variables
are not all of the same type: e.g. SNPs, clinical variables, -omics,....
Mathematical definitions of covariation and correlation
Population
covariation and correlation coefficient and matrix.
|
|
|
The correlation corr(X,Y) and covariation cov(X,Y)
coefficient between 2 random variables X and Y with expected
values and
,
standard deviations and
,
where E is the expected value, is defined as:

In 'taxonomy 3', we are producing a correlation or covariation
matrix, defined as a matrix of correlation of covariances
coefficient between LBF vectors (i.e. one LBF vector for each
SNP analyzed).
The covariance matrix emphasizes variables with most variance,
whereas the correlation matrix rescales variables to equi-variance.
|
Sample
covariation and correlation coefficient matrix.
|
|
|
| In our method, we are dealing with observed
LBFs vectors. The LBF vector for the jth variable
is defined as, where n is the number of subjects:
|
|
 |
The sample correlation coefficient between
the 2 variables j1 and j2 is defined as, where
n is the number of subjects:

The correlation or covariation matrix is
simply the p by p squared, symmetric matrix
made of these elements, where p is the number of variables
analyzed.
|
An example. Differences between covariation and correlation: prediction
versus understanding.
The example below clarifies the differences
between covariation and correlation. It is a case-control collection
with 12 subjects and 2 SNPs. For both SNPs, the 'AA' genotype is associated
with some cases and the 'TT' genotype with some controls. The 'AT'
genotype is not specifically associated with either.
- SNP2 is of importance for the case/control distinction
in most of the subjects (2/3rd), and could be used for prediction.
- SNP1 is of importance for the case/control distinction
only in 1/3rd of the subjects. This SNP is potentially important
for the understanding of the outcome (especially in cases
#1 and #2), but cannot be used alone for prediction.
In the covariation matrix, the highest coefficients
are for combinations of SNP2 and casecont. SNP2 is clearly favored,
and would show up as a SNP of interest in the eigen decomposition.
In the correlation matrix, SNP1 and SNP2 are almost
at the same level of importance. In particular, the SNP1.x.SNP2 correlation
coefficient is greater than for covarition. Both SNP1 and SNP2 would
show up as SNPs of interest in the eigen decomposition.

The Excel file used to create this example can be accessed here
(the user can enter new genotypes: LBFs and COV/CORR are automatically
calculated).
The main features of
'taxonomy 3' can be tested on-line with small datasets, including
testing correlation versus covariation: see the on-line
analysis page.
For more details, please refer to these external
links:
PCA reduces the dimensionality
of a data set consisting of a large number of interrelated variables,
while retaining as much as possible of the variation present in the
data set. This is achieved by transforming to a new set of variables,
the principal components (PCs), which are uncorrelated, and which are
ordered so that the first few retain most of the variation present in
all of the original variables.
PCA is the orthogonal linear transformation
that transforms the data to a new coordinate system such that the greatest
variance by any projection of the data comes to lie on the first coordinate
(called the first principal component), the second greatest variance
on the second coordinate, and so on. In other words, the basic idea
in PCA is to find the components s1,s2,...,sn so that they explain the
maximum amount of variance possible by n linearly transformed components.
PCA has the distinction of being the optimal linear
transformation for keeping the subspace that has largest variance.
Having defined the PCA, it can be demonstrated that
the kth principal component of a set of variables is an eigenvector
of their correlation or covariation matrix corresponding to its kth
largest eigenvalue. PCA is obtained from the eigen-decomposition of
a covariance or correlation matrix and gives the least square estimate
of the original datamatrix.
Also PCA represents a specific Singular Value Decomposition
in the case where principal components are calculated from the covariance
or correlation matrix.
PCA's
objective: Linear transformation based on maximum variance.
|
|
|
Let
a vector of p random variables. The variance of these
p variables and their covariation or correlation structure
is of interest. If p is large, one may want to look for
a few (<<p) derived variables that preserve most
of the information given by these variances and correlations
or covariances.
The first step is to look for a linear function of the p
elements of
having maximum variance. If is
a vector of constants, this linear function can be noted as:

The next step is to look for a linear function ,
uncorrelated with ,
having maximum variance, and so on.
It is hoped that in general, most of the variation in will
be accounted for by m principal components ,
where m<<p.
|
Eigen
decomposition (also called spectral decomposition).
|
|
|
Matrix decomposition refers to the transformation of a given
matrix (often assumed to be a square matrix) into a given canonical
form.
A canonical form is a function that is written in the most
standard, conventional, and logical way. A canonical form is
required to have two essential properties. Every object under
consideration must have exactly one canonical form, and two
objects that have the same canonical form must be essentially
the same.
The decomposition of a matrix A into matrices composed
of its eigenvectors P and Eigenvalues is
called an Eigen decomposition. It satisfies the following:
,
where 
Eigenvectors and Eigenvalues are also referred to as characteristic
vectors and latent roots or characteristic equation (in German,
"eigen" means "specific of" or "characteristic
of"). The set of Eigenvalues of a matrix is also called
its spectrum.
The fact that this decomposition is always possible for a square
matrix A as long as P is a square matrix is known
as the Eigen decomposition theorem.
When A is a semi-definite matrix (symmetric matrices
such as covariation or correlation matrices), then its Eigenvalues
are always positive or null, and its eigenvectors are composed
of real values and are pair wise orthogonal when their Eigenvalues
are different (Two vectors are 'orthogonal' or 'perpendicular'
when their 'dot product' is zero). Because eigenvectors corresponding
to different Eigenvalues are orthogonal, we have in this particular
case:

|
Demonstration
that PCA is the Eigen decomposition of the covariance matrix (for the
1st PC).
|
|
|
To derive the first principal component, one should maximize 
If is
the (p by p) covariance matrix of the p
variables of vector ,
then:

Using the Lagrange multiplier technique, the following should
be maximized:
where
is
a scalar and the Lagrange multiplier.
|
Lagrange multiplier
A function f(P) has to be to
"extremized" (minimized or maximized),
subject to a specific constraint on P given as
g(P) = 0.
One answer is to add a new variable
to
the problem, to define a new 'unconstrained' Lagrangian
function to extremize:

|
Differentiation with respect to gives: 
Thus is
the eigenvalue of and
is
the corresponding eigenvector.
If is
chosen to be as large as possible then
is the first eigenvector of ,
and the largest eigenvalue is:

The above is valid for a sample covariance or correlation
matrix, such as the covariation or correlation matrix of LBF.
|
Recapitulation.
Definition of coefficients, loadings and scores.
|
|
|
The kth principal component
is defined as a linear combination of the p random variables
: 
In 'taxonomy 3' we are decomposing a sample
of ,
which is our LBF matrix.
The amount of variance accounted for
by the kth component is given by its eigenvalue, 
The proportion of variance accounted
for by each component is ,
where if
the correlation matrix was used.
Each eigenvector is composed of scalars called
'coefficients' they relate original variables to components.
For the kth PC, coefficients are simply defined as
. They represent the relative 'weight' of each variable in each
component.
Loadings are defined as: .
They represent the relative 'weight' of each variable in each
component, re-scaled with the amount of variance explained by
the component.
In 'taxonomy 3' we are decomposing a
sample of ,
which is our LBF matrix. In this case the subjects' contribution
can be simply 'projected' on each component. For the kth
component, the projected LBF for subject i and variable
j is .
The score for subject i and component k
is defined as the mean of LBFk(i,j)
over the p variables.
|
PCA
algorithms for very large scale datasets (Kernel PCA and ARPACK).
|
|
|
For very large datasets, above reasonable computational resources,
2 problems arise with the usual procedures:
- the size of the correlation matrix : which is ~nVar^2 (23K
variables CORR SAS dataset requires 2Gb RAM)
- the duration of the Eigen decomposition: which is ~nVar^3
(15K variables require ~15 hours on a 'normal'
machine)
Therefore, above ~20K variables,
other PCA algorithms than the default SAS procedures have to
be used (see tax3 manual for more details):
ARPACK
This Eigen function is faster than the SAS eigen function
(~nVar^2) and requires 50% less RAM than SAS. We have used
it up to 30K variables (this is a good option for whole genome
scans analyzed at the gene level).
For more information, please see:
http://www.caam.rice.edu/software/ARPACK/
We have used the fortran source, and compiled it for windows
using CygWin.
Above ~30K variables, the Kernel PCA method is recommended
(see below).
Kernel PCA
This is fast PCA algorithm, dramatically reducing the computational
resources needed (problem size is rescalled from nVar^3 to
nObs^3). With this algorithm, we have analyzed datasets up
to 650K variables and 200 subjects ( see dataset
4).
Kernel PCA was developed by :
Wen Wu
PhD, CSci (Chartered Scientist), CChem (Chartered Chemist)
FRSC (Fellow of the Royal Society of Chemistry)
Member of the editorial board of Chemometrics
and Intelligent Laboratory Systems
BIx Investigator Analyst
Email: wen_2_wu@gsk.com
Please contact Wen should you want to use kernel PCA to solve
large data problem. This algorithm is described in these articles:
The kernel
PCA algorithms for wide data. Part I: Theory and algorithms
Chemometrics and Intelligent Laboratory Systems, Volume
36, Issue 2, April 1997, Pages 165-172
W. Wu, D. L. Massart and S. de Jong -- Abstract
Kernel-PCA algorithms for wide data Part II: Fast cross-validation
and application in classification of NIR data
Chemometrics and Intelligent Laboratory Systems, Volume
37, Issue 2, June 1997, Pages 271-280
W. Wu, D. L. Massart and S. de Jong -- Abstract
Fast regression methods in a Lanczos (or PLS-1) basis.
Theory and applications
Chemometrics and Intelligent Laboratory Systems, Volume
51, Issue 2, 24 July 2000, Pages 145-161
Wen Wu and Rolf Manne -- Abstract
.
|
References and external links relevant to PCA and
eigen decomposition
|