The main results of the 'taxonomy 3' method
are as follows:
Principal components
Each principal component is characterized
by its eigenvalue, loadings of variables (SNPs, genes, ontologies)
and scores of subjects.
Principal components should be understood
as orthogonal directions (i.e. independent directions) within the
dataset.
Because the LBF measure amplifies all the
observed contrasts between groups, it is expected that the between
group variability of this measure will be higher than any within group
variability. Hence, the First principal component is expected to be
relevant to the case/control distinction (i.e. aligned, parallel with
the case/control direction) . Other components are expected to be
relevant to within group heterogeneity (population heterogeneity)
and irrelevant (i.e. orthogonal) to the case/control distinction.
This is generally the case, and can be tested
by adding a dummy case/control variable in the PCA analysis. Our example
page (dataset 2) gives
an example where it is not the case: the within group variability
was higher than the between group variability, due to a sampling bias.
Eigenvalues and scree plots
There is one eigenvalue for each principal
component. It explains the variability of the original dataset that
is accounted for by this component.
The first eigenvalue explains the variability
of the original dataset that is accounted for by the case/control
distinction
Scree Plots are simple line segment
plots that show the fraction of total variance in the data as explained
or represented by each component. The components are ordered by decreasing
order of contribution to total variance. Such a plot when read left-to-right
across the abscissa can often show a clear separation in fraction
of total variance where the 'most important' components cease and
the 'least important' components begin. The point of separation is
often called the 'elbow'. The plot is called a 'Scree' Plot because
it often looks like a 'scree' slope, where rocks have fallen down
and accumulated on the side of a mountain.

Loadings
There is one loading for each variable and
component. It represents the importance of the variable in the component.
Covarying markers of consistent genetic model
for the trait will have high loading 1. The method is equivalent to
the ranking of case predictive markers that are consistent over people
allowing for the pattern of marker co-occurrences (i.e. allowing for
allelic association or linkage disequilibrium, LD) common across people.
It encompasses, without pre-specification, any additive genetic model
and by virtue of taking all genetic effects and their co-occurrences
into account collectively, should be very suitable to dissect the
manifold complexities of common and multifactorial chronic diseases.
Loadings are plotted in 2D orthogonal graphs,
such as loading1_x_loading2, loading1_x_loading3, ... (see example
page dataset 1)

Scores
There is one score for each subject and component.
The original dataset can be reconstructed for each component (i.e.
projected on the component). The reconstructed signal gives important
information regarding subjects' heterogeneity. Scores are the mean
subject's LBF (over all the variables), projected on each component.
Subjects' scores 1 (first component scores)
shows the differentiation between cases and controls which is driven
by variables having high loading 1.
Scores are plotted in 2D orthogonal graphs,
such as score1_x_score2, score1_x_score3, ... (see example page dataset
1)
Biplots are 2D orthogonal graphs where
variable loadings and subjects scores are plotted together
on the same graph (see example page dataset
3). It usually clearly shows how variables having high loadings
'pull apart' groups of subjects.
LBF moments and sub-phenotypes.
SNP LBFs can be aggregated using any know
domain (gene, ontology), or newly discovered ontologies, such as genes
having high loading 1.
Aggregation using a know domain allows 'supervised'
biological exploration of the data.
Aggregation using ontologies defined by each
component of the PCA, allows the biological exploration of independent
components of the data. Aggregation of LBFs projected on the first
component allows the exploration of potential subjects heterogeneity
relevant to the case/control distinction.
Plots of mean(LBF) versus var(LBF) across
markers for each subject can help classification of subjects and the
discovery of sub-phenotypes (see example page dataset
1).
There are 2 ways allowing a systematic exploration/discovery
of sub-phenotypes:
- clinical/biological variables can be added
to the main analysis. Their directions will indicate patterns of
co-occurence with other variables (SNPs, genes) and subgroups of
subjects (see example page dataset
3).
- Association between clinical/biological
variables and LBF moments (mean, variance,...) can be looked for
using several techniques: clustering, recursive partitioning, heatmaps,
etc...(see example page dataset
1).

