‘Taxonomy 3’ Packages –
Revisions Logs
V3.0
EIGEN_ANALYSIS
MACRO
The macro can
now cope with larger datasets, tested up to ~500K collections.
(The resources needed for each option are provided in the installation file.)
· The user is now provided with 2 options to perform the Eigen decomposition:
o MainPCAMethod=’CLASSICAL’. The correlation matrix of collections LBF is calculated.
o
MainPCAMethod=’KERNEL’. The correlation matrix of standardised
subjects LBF is calculated. This
provides an approximation, suitable for very large datasets.
Therefore, in
those cases where nCollections >> nSubjects, the limiting factor becomes more
the Hard-Disk size, rather than the RAM available or analysis duration. For
example, on a Windows XP 32bits system, 500K collections on 50 subjects were
analyzed in 30mins with ~100Mb RAM and ~50Gb HD.
· When MainPCAMethod=’CLASSICAL’ is used, the user is provided with 2 options to perform the Eigen decomposition of the correlation matrix:
o EigenFunction='SAS'. In this case, SAS PROC PRINCOMP is used to decompose the correlation matrix. This function is limited to ~20K collections or by the RAM available.
o EigenFunction='ARPACK'. In this case, an external function is used to decompose the correlation matrix. This function can cope with larger datasets or limited RAM, and was tested with up to 50K collections.
Other
important changes.
·
A new dataset can be provided to the EigenAnalysis macro: VarModifDataSet.
This dataset provides a modifier for each collection to be applied to the
correlation matrix before the Eigen
decomposition. This allows for example to adjust for the number of SNPs per
gene: in this case, the dataset is provided by the Aggregation macro.
Typically VarModif(i)= number of Categorical Variables used for the aggregation
of collection i. And:
o if i<>j , CORmodif(i,j)=COR(i,j)/sqrt(VarModif(i)*VarModif(j))
o if i=j, CORmodif(i,j)=COR(i,j)
· AssessImpactMissing=’YES’ in not used any more. It is replaced with 2 options:
o AssessImpactMissing='IMPACT' , the impact of missing values on the case/cont distinction is assessed (LBF are replaced with: log(2) if CASE and -log(2) if CONTROL)
o AssessImpactMissing='PATTERN', the pattern of missing values within the dataset is analyzed (LBF are replaced with: 0 if CASE or CONTROL, 1 if missing value)
Miscellaneous
·
The BigCorr option is not used any more : it is replaced by MaxCorrSize
·
The
MaxCorrMemUsage
option limits the
amount of RAM used by ‘PROC CORR’. If more RAM is needed, then the
correlation/covariation matrix is obtained in several steps of MaxCorrMemUsage size. This option allows the
analysis of a larger number of collections given the amount of RAM available.
The default is set to 300Mb.
·
The IMLready option is not used
any more. SAS/IML is now not required to run the EigenAnalysis macro (it is
still required for the projection macro).
AGGREGATION
MACRO
· This macro produces a new dataset ‘DatasetPreFix_MapDensity’, reporting the number of categorical variables for each collection. This dataset can be used to adjust the correlation matrix in the Eigen analysis macro.
LBF
MACROS
·
A new LBF definition was added: LBFtype='pooled'. This is mainly
for testing purpose only, and is not strictly speaking an 'LBF'. In this case:
lbf(group_i)=freq(group_i)/freq(all groups)
v2.0:
EIGEN_ANALYSIS
MACRO
This new procedure allows larger datasets to be analyzed. The correlation (or Covariation) matrix is calculated each variable at a time.
This new procedure is used if BigCorr=’YES’ option is defined; otherwise, all variables are analyzed in one pass.
LBF
MACROS
·
A new subject class introduced: 'TEST'.
Subjects with this label are NOT used to calculate LBFs. However an LBF is given for these subjects, using data provided for cases and controls.
The intent is to use this information to ‘predict’ subject’s class (case or control) after the Eigen analysis is performed on cases and controls: one would multiply the TEST_LBFs matrix by Loading1 vector to obtain subjects scores.