Taxonomy3 package v3.0 – Installation and Description
This
package includes all SAS macros required to run ‘taxonomy 3’.
It
includes also:
-
An
ARPACK FORTRAN ms-windows executable for large scale Eigen problems
-
a
WinBugs source code for the signal-to-noise decomposition
-
a
MS excel add-in for drawing heatmaps (at a very draft stage and not described)
This
document shows how to install the package and how to use its main features.
Content
Software & Hardware requirements
Local installation
Summary of SAS macros in C:\taxonomy\v3.0\SasMacros
SAS macros: main features and short step by step description
EXAMPLES
Appendix 1. Large problems: resources needed for EigenAnalysis and suggested strategy.
Software & Hardware requirements
This package is intended to run on Windows and Unix stations.
Software
·
SAS : v9.1,
(the SAS macros provided herein will NOT function with SAS 8.2, since
variables and datasets have long names)
SAS/GRAPH, SAS/IML, SAS/ACCESS : optional
· Optional:
- Winbugs : for Signal/Noise decomposition
Hardware [for more details see appendix 1]
The main limiting factor is the RAM used in the Eigen analysis, due to the size of the correlation/covariation matrix on which the Eigen analysis is performed.
Two analysis options are provided:
· Exact calculation (Classical PCA option – SAS Eigen function):
RAM required ≃ 4. nCollections ^ 2
· Approximation (Kernel PCA option):
RAM required ≃ 4. nSubjects ^ 2
A third option is provided (Classical PCA – Arpack Eigen function): this is an exact calculation method where the RAM needed is fixed by the user (disk swapping is used instead).
For example, on a system with 1Gb RAM available, the following options are recommended (assuming nSubjects << nCollections) :
· nCollections < 10000: Classical PCA – SAS Eigen function
· nCollections < 20000: Classical PCA – ARPACK Eigen function
· nCollections > 15000: Kernel PCA approximation
See ‘LargeProblem’ example and detailed manual for more information.
Local installation:
Simply unzip taxonomy.zip into "C:\taxonomy" – you will obtain this
structure:

The SAS
macros are located in the folder C:\Taxonomy\v3.0\SasMacros

Summary of SAS macros in C:\taxonomy\v3.0\SasMacros
Macros
are to be used roughly in the other provided in this table:
|
Macro file names and %macros |
Description |
|
Filter.sas |
|
|
%Filter |
Filters a dataset for poorly characterized subjects or variables.
The goal is to prevent potential biases in the Eigen analysis. |
|
Display_missing_values.sas |
|
|
%MakeMissingDataSet %DisplayMissingDataSet |
Displays missing observations that could later on
introduce biases in the Eigen analysis |
|
Interaction.sas: |
|
|
%Interaction |
This macro combines categorical variables and allows for
example gene by gene interactions analyses |
|
LBF.sas: Returns LBFs from
categorical variables (genetic or non-genetic): |
|
|
%LBF_Categorical |
Categorical variables |
|
%LBF_Autosomal |
Autosomal markers |
|
%LBF_x |
Chromosome X markers |
|
%LBF_HLA |
HLA markers |
|
%LBF_mixed |
Mixed of above categories |
|
%LBF_ByChromosome |
Gets LBFs chromosome by chromosome [useful if big
datasets] |
|
%LBF_moments |
Gets and plots moments (mean, var, …) of LBF across subjects |
|
Aggregation.sas |
|
|
%Aggregate |
Aggregate LBFs from a variable level to a collection level
using a variable2collection map |
|
%AggregateInteraction |
Aggregate LBFs from an interaction dataset. |
|
Eigen_Analysis.sas |
|
|
%EigenAnalysis |
Performs the Eigen decomposition of LBF’s correlation
matrix |
|
%EigenExportAcc |
Exports main results to MSaccess database |
|
Eigen_graphs.sas |
|
|
% EigenGraph |
Helps visualize eigen analysis and plots: loadings,
scores, biplot, scree plots, projected LBF moments,… |
|
Projection.sas |
|
|
% MultiDimProj |
Project all variables (e.g. genes) on a direction (e.g.
casecont or a sub-phenotype) given by the Eigen decomposition.
Multidimensional (n>2) projections are possible. |
SAS MACROS: main features and short step by step description
All macros are annotated and a detailed
description is provided in the *.sas macro files. All these descriptions are
regrouped into one word document, in the v3.0 folder.
tax3
package v3.0 – 2007-02 - SAS macros - detailed manual.doc
The bits of code provided below are located in, and can be run from:
C:\taxonomy\v3.0\Examples\description\CodesInDescriptionFile.sas
Please familiarize yourself with this code, log file, datasets and graphs produced by the analysis.
§
Rules.
- Do not use the ‘work’ library, it is used by the macros, and wiped out.
- Store and check the LOG files: all macros write important information in the LOGs.
§
Step 1: input datasets
Genotypes and subjects case/control status are provided using a ‘tall/skinny’ format.
Please note one missing value for subject 5 and SNP3
data myProj.genotypes;
input subid polyid $ genotype $;
datalines;
1 SNP1
A_T
2 SNP1
A_T
3 SNP1
T_T
4 SNP1 T_T
5 SNP1 A_A
6 SNP1 A_A
1 SNP2 G_G
2 SNP2 G_C
3 SNP2 G_C
4 SNP2 C_C
5 SNP2 G_C
6 SNP2 C_C
1 SNP3 C_A
2 SNP3 C_A
3 SNP3 C_A
4 SNP3 C_C
6 SNP3 C_C
;
run;
data myProj.casecontset;
input subid
casecont $ ;
datalines;
1 CASE
2 CASE
3 CASE
4 CONT
5 CONT
6 CONT
;
run;
§
Step 2: LBF calculations
*define how cases and controls are labelled
in the datasets;
*to be used in LBF and EigenAnalysis macros;
%let labelCASE='CASE';
%let labelCONTROL='CONT';
*calculate LBFs;
%let GenotypeDelimiter='_';
%LBF_Autosomal(
longdata=myProj.genotypes
,longlbf=myProj.lbf
,casecontset=myProj.casecontset
);
This dataset is produced: myProj.LBF

§
Step 3: Aggregation from SNP LBFs to gene LBFs
This step is optional: the Eigen analysis can be performed on the SNP LBF dataset.
- A SNP to gene map has to be provided.
- You have the option to keep SNPs that are not in the map (Orphan SNPs).
*SNP to gene map;
data myProj.map;
length gene $ 15;
input polyid $
gene $ ;
datalines;
SNP1 GeneA
SNP2 GeneA
SNP3 GeneB
;
run;
*aggregate LBFs from SNP level to gene level;
%Aggregate(
DataIn
= myproj.lbf(rename=(polyid=catvar))
,
Map = myProj.map(rename=(polyid=catvar gene=collection))
,
Library = myProj
,
DatasetPreFix = lbf_gene
,
KeepOrphanCatVar = 'NO'
) ;
This main dataset is produced: myProj.LBF_gene_meanlbf

Another dataset is produced myproj. Lbf_gene_mapdensity , and can be used in the Eigen analysis to
correct for the fact that geneA and geneB do not have the same ‘granularity’,
since geneA includes 2 SNPs, and geneB include 1 SNP. See manual for more
details.

§
Step 4: Eigen analysis of gene LBFs – and plot
*number of components to calculate;
%let nCompToStore=2;
*number of components for which mean.var LBFs
tables are generated;;
%let nLBFc=2;
*any missing element in the correlation
matrix is replaced with zero;
%let MissingValues='MISSING_CORR_ZERO';
* Eigen Analysis of correlation matrix of
gene LBFs;
* correlation matrix is corrected with the
MapDensity dataset;
%EigenAnalysis(
lbflong=myProj.lbf_gene_meanlbf
,Library=myProj
,DataSetPostFix=_eigen
,IncludeCaseCont='YES'
,CorCov='COR'
,VarModifDataSet=myProj.lbf_gene_mapdensity(rename=(nCatVar=VarModif))
);
This produces several exxxx_eigen datasets, mainly those ones used by the EigenGraph macro:
- eloadings : loadings of the genes analyzed
- escores: scores of the subjects analyzed
- evalue: Eigen values
§
Step 5: Graphs
%EigenGraph( DatasetPostFix=_eigen
,Library=myProj
,OutputFolder=&MyRootFolder.graph\
,GraphTitle=%str( eigen
decomposition )
);
This produces, among others, the main ‘bi-plot’ graph, showing:
- Distinction of cases (red dots) and controls (blue dots)
-
Case/cont distinction represents most
of the variability of the dataset
(case/cont aligned with component1)
-
GeneB is in correlation with the
case/cont distinction
(case/cont and GeneB have same direction)
- GeneA is involved with the case/cont distinction (high loading1), but is also involved with some within group variability (high loading2): this is mainly due to the heterogeneity observed in the case group (GeneB pulls away one control from the 2 others).

EXAMPLES.
Three examples are provided to help you dealing with simple or more complex datasets.
A simple example using a small dataset (candidate gene
study, autosomal SNPs)
Please start looking in and run the file: SasCode\__DO__ALL_.sas
(it takes ~5mins to run the analysis)
A complex example: a
candidate gene study, with:
-
autosomal SNPs
-
X-linked SNPs
-
HLA markers
-
sub-phenotypes
-
gene by gene interactions
-
signal to noise decomposition using WinBUGS.
-
heatmaps.
Please start looking in and run the file: SasCode\__DO__ALL_.sas
(it takes ~15 mins to run the analysis)
A simple example helping you to test the options provided
for the analysis of large or very large datasets.
Please start looking in and run the file: SasCode\__DO__ALL_.sas
(it takes ~5mins to run the analysis)
This example was used to produce the detailed list of
resources needed for large problems, as described in appendix 1.
Machine used: P4
2.8GHz , 4Gb RAM, Windows XP pro 32bits with /3Gb boot switch
|
(nSubjects
= 50) \ Number of collections : |
5000 |
10000 |
15000 |
20000 |
30000 |
50000 |
500000 |
||
|
CLASSICAL METHOD |
NEW CORR function (1) |
TIME |
40 sec |
6 mins |
30 mins |
1 h |
4 h 15 |
- |
- |
|
SAS
EIGEN FUNCTION |
RAM |
100
Mb |
400
Mb |
900
Mb |
(2) |
- |
- |
- |
|
|
TIME (PRINCOMP) |
12 mins |
2 h 40 |
14 h |
- |
- |
- |
- |
||
|
HD
SPACE |
<1Gb |
3
Gb |
6
Gb |
- |
- |
- |
- |
||
|
TOTAL
TIME |
15
mins |
3
h |
15
h |
- |
- |
- |
- |
||
|
ARPACK
EIGEN FUNCTION |
RAM + VM allocated (3) (requested) |
- |
- |
1500 Mb |
1500 Mb |
1500 Mb |
- |
- |
|
|
TIME (ARPACK) |
- |
- |
45 mins |
1h20 |
29 h |
- |
- |
||
|
HD
SPACE |
- |
- |
8
Gb |
12
Gb |
38
Gb |
- |
- |
||
|
TOTAL
TIME |
- |
- |
1h30 |
2h45 |
34
h (4) |
- |
- |
||
|
KERNEL
PCA |
RAM
|
- |
- |
- |
- |
<100
Mb |
<100
Mb |
||
|
HD
SPACE |
- |
- |
- |
- |
<1
Gb |
<1
Gb |
2
Gb |
||
|
TOTAL
TIME |
- |
- |
- |
- |
1
min |
2
mins |
20
mins |
||
(1) RAM usage was fixed to 2000Mb using
MaxCorrMemUsage=2000 option.
(2) 2 Gb is the memory allocation limit for SAS
PRINCOMP v9 (on both 32 bits and 64 bits systems…). This represents ~23000
variables.
(3)
On Windows XP
32bits, this implementation of ARPACK allocates 1500Mb of RAM + Virtual Memory
for the storage of the correlation matrix. If more memory is requested, disk
swapping is used: this is a slow option that allows going above 20000 variables
on this system.
(4)
This large
increase of duration from 20000 to 30000 variables is due to the 1703-1500=203Mb
than cannot be allocated in RAM. Disk swap was used, hence the slowness.