Taxonomy 3 - A multivariate genetic analysis
get newsletter
email us

Results and interpretation

The main results of the 'taxonomy 3' method are as follows:

Principal components

Each principal component is characterized by its eigenvalue, loadings of variables (SNPs, genes, ontologies) and scores of subjects.

Principal components should be understood as orthogonal directions (i.e. independent directions) within the dataset.

Because the LBF measure amplifies all the observed contrasts between groups, it is expected that the between group variability of this measure will be higher than any within group variability. Hence, the First principal component is expected to be relevant to the case/control distinction (i.e. aligned, parallel with the case/control direction) . Other components are expected to be relevant to within group heterogeneity (population heterogeneity) and irrelevant (i.e. orthogonal) to the case/control distinction.

This is generally the case, and can be tested by adding a dummy case/control variable in the PCA analysis. Our example page (dataset 2) gives an example where it is not the case: the within group variability was higher than the between group variability, due to a sampling bias.

Eigenvalues and scree plots

There is one eigenvalue for each principal component. It explains the variability of the original dataset that is accounted for by this component.

The first eigenvalue explains the variability of the original dataset that is accounted for by the case/control distinction

Scree Plots are simple line segment plots that show the fraction of total variance in the data as explained or represented by each component. The components are ordered by decreasing order of contribution to total variance. Such a plot when read left-to-right across the abscissa can often show a clear separation in fraction of total variance where the 'most important' components cease and the 'least important' components begin. The point of separation is often called the 'elbow'. The plot is called a 'Scree' Plot because it often looks like a 'scree' slope, where rocks have fallen down and accumulated on the side of a mountain.


Loadings

There is one loading for each variable and component. It represents the importance of the variable in the component.

Covarying markers of consistent genetic model for the trait will have high loading 1. The method is equivalent to the ranking of case predictive markers that are consistent over people allowing for the pattern of marker co-occurrences (i.e. allowing for allelic association or linkage disequilibrium, LD) common across people. It encompasses, without pre-specification, any additive genetic model and by virtue of taking all genetic effects and their co-occurrences into account collectively, should be very suitable to dissect the manifold complexities of common and multifactorial chronic diseases.

Loadings are plotted in 2D orthogonal graphs, such as loading1_x_loading2, loading1_x_loading3, ... (see example page dataset 1)


Scores

There is one score for each subject and component. The original dataset can be reconstructed for each component (i.e. projected on the component). The reconstructed signal gives important information regarding subjects' heterogeneity. Scores are the mean subject's LBF (over all the variables), projected on each component.

Subjects' scores 1 (first component scores) shows the differentiation between cases and controls which is driven by variables having high loading 1.

Scores are plotted in 2D orthogonal graphs, such as score1_x_score2, score1_x_score3, ... (see example page dataset 1)

Biplots are 2D orthogonal graphs where variable loadings and subjects scores are plotted together on the same graph (see example page dataset 3). It usually clearly shows how variables having high loadings 'pull apart' groups of subjects.



LBF moments and sub-phenotypes.

SNP LBFs can be aggregated using any know domain (gene, ontology), or newly discovered ontologies, such as genes having high loading 1.

Aggregation using a know domain allows 'supervised' biological exploration of the data.

Aggregation using ontologies defined by each component of the PCA, allows the biological exploration of independent components of the data. Aggregation of LBFs projected on the first component allows the exploration of potential subjects heterogeneity relevant to the case/control distinction.

Plots of mean(LBF) versus var(LBF) across markers for each subject can help classification of subjects and the discovery of sub-phenotypes (see example page dataset 1).

There are 2 ways allowing a systematic exploration/discovery of sub-phenotypes:

  • clinical/biological variables can be added to the main analysis. Their directions will indicate patterns of co-occurence with other variables (SNPs, genes) and subgroups of subjects (see example page dataset 3).

  • Association between clinical/biological variables and LBF moments (mean, variance,...) can be looked for using several techniques: clustering, recursive partitioning, heatmaps, etc...(see example page dataset 1).

 


   top of page
Newsletter

    -> We plan to send (infrequent) emails regarding publications, talks, software updates, etc...

To subscribe, or manage your subscription, just enter your email address below:

email:

 

Send us your comments
Name   (optional)
Subject   (optional)
Email  
Comment  
    

You can also email us directly at:   taxonomy@delrieu.org