‘Taxonomy 3’ Packages – Revisions Logs

 

V3.0

 

EIGEN_ANALYSIS MACRO

 

The macro can now cope with larger datasets, tested up to ~500K collections.
(The resources needed for each option are provided in the installation file.)

 

·        The user is now provided with 2 options to perform the Eigen decomposition:

o   MainPCAMethod=’CLASSICAL’. The correlation matrix of collections LBF is calculated.

o   MainPCAMethod=’KERNEL’. The correlation matrix of standardised subjects LBF is calculated. This provides an approximation, suitable for very large datasets.

Therefore, in those cases where nCollections >> nSubjects, the limiting factor becomes more the Hard-Disk size, rather than the RAM available or analysis duration. For example, on a Windows XP 32bits system, 500K collections on 50 subjects were analyzed in 30mins with ~100Mb RAM and ~50Gb HD.

·        When MainPCAMethod=’CLASSICAL’ is used, the user is provided with 2 options to perform the Eigen decomposition of the correlation matrix:

o   EigenFunction='SAS'. In this case, SAS PROC PRINCOMP is used to decompose the correlation matrix. This function is limited to ~20K collections or by the RAM available.

o   EigenFunction='ARPACK'. In this case, an external function is used to decompose the correlation matrix. This function can cope with larger datasets or limited RAM, and was tested with up to 50K collections.

 

Other important changes.

 

·         A new dataset can be provided to the EigenAnalysis macro: VarModifDataSet.
This dataset provides a modifier for each collection to be applied to the correlation matrix before the Eigen decomposition. This allows for example to adjust for the number of SNPs per gene: in this case, the dataset is provided by the
Aggregation macro.
Typically VarModif(i)= number of Categorical Variables used for the aggregation of collection i. And:

o    if i<>j ,      CORmodif(i,j)=COR(i,j)/sqrt(VarModif(i)*VarModif(j))

o    if i=j,     CORmodif(i,j)=COR(i,j)    

 

·        AssessImpactMissing=’YES’ in not used any more. It is replaced with 2 options:

o   AssessImpactMissing='IMPACT' , the impact of missing values on the case/cont distinction is assessed (LBF are replaced with:  log(2) if CASE and -log(2) if CONTROL)

o   AssessImpactMissing='PATTERN', the pattern of missing values within the dataset is analyzed (LBF are replaced with: 0 if CASE or  CONTROL, 1 if missing value)

 

Miscellaneous

 

·        The BigCorr option is not used any more : it is replaced by MaxCorrSize

·        The MaxCorrMemUsage option limits the amount of RAM used by ‘PROC CORR’. If more RAM is needed, then the correlation/covariation matrix is obtained in several steps of MaxCorrMemUsage size. This option allows the analysis of a larger number of collections given the amount of RAM available. The default is set to 300Mb.

·        The IMLready option is not used any more. SAS/IML is now not required to run the EigenAnalysis macro (it is still required for the projection macro).

 

 

AGGREGATION MACRO

·        This macro produces a new dataset ‘DatasetPreFix_MapDensity’, reporting the number of categorical variables for each collection. This dataset can be used to adjust the correlation matrix in the Eigen analysis macro.

   

 

LBF MACROS

·        A new LBF definition was added: LBFtype='pooled'. This is mainly for testing purpose only, and is not strictly speaking an 'LBF'. In this case:

lbf(group_i)=freq(group_i)/freq(all groups)

 

 

 

 

v2.0:

 

EIGEN_ANALYSIS MACRO

·        new ‘PROC CORR’ introduced

This new procedure allows larger datasets to be analyzed. The correlation (or Covariation) matrix is calculated each variable at a time.

This new procedure is used if BigCorr=’YES’ option is defined; otherwise, all variables are analyzed in one pass.

 

 

LBF MACROS

·        A new subject class introduced: 'TEST'.

Subjects with this label are NOT used to calculate LBFs. However an LBF is given for these subjects, using data provided for cases and controls.

The intent is to use this information to ‘predict’ subject’s class (case or control) after the Eigen analysis is performed on cases and controls: one would multiply the TEST_LBFs matrix by Loading1 vector to obtain subjects scores.