Taxonomy 3 - A multivariate genetic analysis
get newsletter
email us
What are LBFs?  

The first step of the 'taxonomy3' analysis is to transform the input dataset (a matrix observations for variables of potentially several data types) into a matrix of LBFs. This is done one variable at a time.

For each variable, an observation's LBF value represents that observation's relative contribution within its group to the overall distinction between the groups. For example, a 'case' with a high LBF will be representative of its group and representative of the overall difference observed between cases and controls (this person's 'case-ness').

In other words, an LBF value represents the information content that this observation provides to the overall between group contrast.

The LBF mathematical foundations are complex and driven by 'taxonomic' and Bayesian considerations free of any -genetic- assumptions. This was detailed in our LASR 2005 'princeps' paper. LBFs are a subset of a mathematical group called divergences as described in our LASR 2007 paper. Divergences are the natural metric for the comparison of individuals or groups in the co-analysis of multiple variable types.

Since LBFs represent an information content, they are additive and independant from the type of the variable they originate from (e.g. binary, poisson, normal, exponential...). Hence any complex dataset composed of several variable types can be analyzed as a single and homogeneous entity. Once the original data matrix is transformed into an LBF matrix, various multivariate techniques (linear algebra) can be used. We have chosen the Principal Component Analysis.

LBFs for categorical variables  

LBFs are easy and straightforward to calculate. The calculations and assignation of LBFs to all individuals in a case-control collection can be carried out trivially with a spreadsheet.

For example, let's assume a subject is carrying the 'AT' genotype for a given SNP. Then the subject's LBF for this SNP is:

We are using 'frequency estimates' instead of frequencies in order to prevent numerical issues should a frequency be equal to zero. If the frequency is then, the frequency estimate will be  .

This figures shows the LBF function given the frequency of the categorical value for cases and controls:

The main features of 'taxonomy 3' can be tested on-line with small datasets, including LBF calculations for categorical variables: see the on-line analysis page, or use the basic LBF calculator below:

Simple LBF calculator for categorical variables (e.g. SNP)  

Enter/modify the SNP genotypes in the table below for these 5 cases and 7 controls subjects, and click on the button.

LBF are calculated for each unique genotype (right), and attributed back to each subject (left)
 Subjects   SNP 
genotypes
SNP
LBFs
1 CASE 1.2
2 CASE 1.2
3 CASE 1.2
4 CASE 1.2
5 CASE -0.12
6 CONT 1.2
7 CONT -0.12
8 CONT -0.12
9 CONT -1.32
10 CONT -1.32
11 CONT -1.32
12 CONT -1.32
unique
genotypes
formulae LBF
AT log{  [(4+1)/(5+1)]  /  [(1+1)/(7+1)]  }=  1.2
TT log{  [(1+1)/(5+1)]  /  [(2+1)/(7+1)]  }= -0.12
AA log{  [(0+1)/(5+1)]  /  [(4+1)/(7+1)]  }= -1.32
         
         
         
         
         
         
         
         
         

LBF : definition and properties for categorical variables (e.g. SNP).

 

We are using an informal inference technique that filters SNP data to extract meaning. It relies upon a directed discrete multivariate "Bayes Rule" measure of evidence for the SNP distinction of an individual free of genetic assumptions.

The LBFs are the observed difference in log genotype frequencies (between trait cases and controls) or a difference in the relative frequencies of categorical variables between the two groups. This measure transforms the characterisation of people from a binary domain of SNPs to a rapidly calculable continuous measure with simple additive properties.

For subject i, categorical variable j and categorical values k, the lbf value is :


Where:

Si,j,k = {0,1} : presence/absence of the kth categorical value (genotype) of the jth categorical variable (SNP) for the ith subject

nu and theta are bayesian estimates of the genotype frequencies (non-informative beta prior) :

     

 

This Bayes Factor or 'diagnostic' likelihood ratio (DLR) is the amount of evidence that the ith case individual is classified as not having the same genome as that of a set of controls. LBFs are a directed or asymmetric measure that indicate the 'case-ness' of an individual i.e. their propensity to be a non-control. In other words, LBF values for each individual (i.e. the individualised difference in information content) represent that person's relative contribution to the 'case-ness' in this overall distinction, or 'index of separation', between the groups.

This extremely simple empirical Bayesian predictive measure is effectively the summation of case-control contrasts (i.e. group by SNP genotype differences or interaction) over genotypes x loci from a log-linear model estimated for each locus on its own (partial log-linear modeling) of all subjects in 2 groups, instantiated by the presence of the SNP genotype markers in that individual.

This method non-linearly deforms or amplifies the SNP data space by re-weighting all subjects with the observed between group differences in order to yield a continuous measure of the genetic model involved in that trait.

This measure has simple additive properties and can be used in any linear algebra tool: addition, averaging, moments, singular value decomposition, eigen analysis, recursive partitioning, ....


 

LBFs for continuous variables and other LBFs  

The LBF concept, defined above as observed Log Likelihood ratios (LLR) of contrasts, applies to categorical variables and represents 'case-ness' for each subject and variable. It can be generalized to continuous variables, and can aim at other types of group distinction: 'this-ness' or 'other-ness'.

This generalisation was presented at the 25th Leeds Annual Statistical Research Workshop in 2006 (LASR 2006). The PDF version of the presentation can be downloaded here.

For example, the following table shows these other definitions for observed LLR. We assume 'cases' are group 1, 'controls' (reference group) are group 2, is frequency of the binary variable, the continuous variable y is normally distributed in cases and controls : µ is mean, is variance.

Divergences for other variable types are described in our LASR 2007 paper.

Aim Domain Binary variable Normal variable
Case-ness All subjects
This-ness Cases
Controls
Other-ness Cases
Controls

 


   top of page
Newsletter

    -> We plan to send (infrequent) emails regarding publications, talks, software updates, etc...

To subscribe, or manage your subscription, just enter your email address below:

email:

 

Send us your comments
Name   (optional)
Title   (optional)
Email  
Comment  
    

You can also email us directly at:   taxonomy@delrieu.org