Taxonomy3 package v3.0 – Installation and Description
This
package includes all SAS macros required to run ‘taxonomy 3’.
It
includes also:
-
An
ARPACK FORTRAN ms-windows executable for large scale Eigen problems
-
a
WinBugs source code for the signal-to-noise decomposition
-
a
MS excel add-in for drawing heatmaps (at a very draft stage and not described)
This
document shows how to install the package and how to use its main features.
Content
Software & Hardware requirements
Local installation
Summary of SAS macros in C:\taxonomy\v3.0\SasMacros
SAS macros: main features and short step by step description
EXAMPLES
Appendix 1. Large problems: resources needed for EigenAnalysis and suggested strategy.
Software & Hardware requirements
This package is intended to run on Windows and Unix stations.
Software
·
SAS : v9.1,
(the SAS macros provided herein will NOT function with SAS 8.2, since
variables and datasets have long names)
SAS/GRAPH, SAS/IML, SAS/ACCESS : optional
· Optional:
- Winbugs : for Signal/Noise decomposition
Hardware [for more details see appendix 1]
The main limiting factor is the RAM used in the Eigen analysis, due to the size of the correlation/covariation matrix on which the Eigen analysis is performed.
Two analysis options are provided:
· Exact calculation (Classical PCA option – SAS Eigen function):
RAM required ≃ 4. nCollections ^ 2
· Approximation (Kernel PCA option):
RAM required ≃ 4. nSubjects ^ 2
A third option is provided (Classical PCA – Arpack Eigen function): this is an exact calculation method where the RAM needed is fixed by the user (disk swapping is used instead).
For example, on a system with 1Gb RAM available, the following options are recommended (assuming nSubjects << nCollections) :
· nCollections < 10000: Classical PCA – SAS Eigen function
· nCollections < 20000: Classical PCA – ARPACK Eigen function
· nCollections > 15000: Kernel PCA approximation
See ‘LargeProblem’ example and detailed manual for more information.
Local installation:
Simply unzip taxonomy.zip into "C:\taxonomy" – you will obtain this
structure:

The SAS
macros are located in the folder C:\Taxonomy\v3.0\SasMacros

Summary of SAS macros in C:\taxonomy\v3.0\SasMacros
Macros
are to be used roughly in the other provided in this table:
|
Macro file names and %macros |
Description |
|
Filter.sas |
|
|
%Filter |
Filters a dataset for poorly characterized subjects or variables.
The goal is to prevent potential biases in the Eigen analysis. |
|
Display_missing_values.sas |
|
|
%MakeMissingDataSet %DisplayMissingDataSet |
Displays missing observations that could later on
introduce biases in the Eigen analysis |
|
Interaction.sas: |
|
|
%Interaction |
This macro combines categorical variables and allows for
example gene by gene interactions analyses |
|
LBF.sas: Returns LBFs from
categorical variables (genetic or non-genetic): |
|
|
%LBF_Categorical |
Categorical variables |
|
%LBF_Autosomal |
Autosomal markers |
|
%LBF_x |
Chromosome X markers |
|
%LBF_HLA |
HLA markers |
|
%LBF_mixed |
Mixed of above categories |
|
%LBF_ByChromosome |
Gets LBFs chromosome by chromosome [useful if big
datasets] |
|
%LBF_moments |
Gets and plots moments (mean, var, …) of LBF across subjects |
|
Aggregation.sas |
|
|
%Aggregate |
Aggregate LBFs from a variable level to a collection level
using a variable2collection map |
|
%AggregateInteraction |
Aggregate LBFs from an interaction dataset. |
|
Eigen_Analysis.sas |
|
|
%EigenAnalysis |
Performs the Eigen decomposition of LBF’s correlation
matrix |
|
%EigenExportAcc |
Exports main results to MSaccess database |
|
Eigen_graphs.sas |
|
|
% EigenGraph |
Helps visualize eigen analysis and plots: loadings,
scores, biplot, scree plots, projected LBF moments,… |
|
Projection.sas |
|
|
% MultiDimProj |
Project all variables (e.g. genes) on a direction (e.g.
casecont or a sub-phenotype) given by the Eigen decomposition.
Multidimensional (n>2) projections are possible. |
SAS MACROS: main features and short step by step description
All macros are annotated and a detailed
description is provided in the *.sas macro files. All these descriptions are
regrouped into one word document, in the v3.0 folder.
tax3
package v3.0 – 2007-02 - SAS macros - detailed manual.doc
The bits of code provided below are located in, and can be run from:
C:\taxonomy\v3.0\Examples\description\CodesInDescriptionFile.sas
Please familiarize yourself with this code, log file, datasets and graphs produced by the analysis.
§
Rules.
- Do not use the ‘work’ library, it is used by the macros, and wiped out.
- Store and check the LOG files: all macros write important information in the LOGs.
§
Step 1: input datasets
Genotypes and subjects case/control status are provided using a ‘tall/skinny’ format.
Please note one missing value for subject 5 and SNP3
data myProj.genotypes;
input subid polyid $ genotype $;
datalines;
1 SNP1
A_T
2 SNP1
A_T
3 SNP1
T_T
4 SNP1 T_T
5 SNP1 A_A
6 SNP1 A_A
1 SNP2 G_G
2 SNP2 G_C
3 SNP2 G_C
4 SNP2 C_C
5 SNP2 G_C
6 SNP2 C_C
1 SNP3 C_A
2 SNP3 C_A
3 SNP3 C_A
4 SNP3 C_C
6 SNP3 C_C
;
run;
data myProj.casecontset;
input subid
casecont $ ;
datalines;
1 CASE
2 CASE
3 CASE
4 CONT
5 CONT
6 CONT
;
run;
§
Step 2: LBF calculations
*define how cases and controls are labelled
in the datasets;
*to be used in LBF and EigenAnalysis macros;
%let labelCASE='CASE';
%let labelCONTROL='CONT';
*calculate LBFs;
%let GenotypeDelimiter='_';
%LBF_Autosomal(
longdata=myProj.genotypes
,longlbf=myProj.lbf
,casecontset=myProj.casecontset
);
This dataset is produced: myProj.LBF

§
Step 3: Aggregation from SNP LBFs to gene LBFs
This step is optional: the Eigen analysis can be performed on the SNP LBF dataset.
- A SNP to gene map has to be provided.
- You have the option to keep SNPs that are not in the map (Orphan SNPs).
*SNP to gene map;
data myProj.map;
length gene $ 15;
input polyid $
gene $ ;
datalines;
SNP1 GeneA
SNP2 GeneA
SNP3 GeneB
;
run;
*aggregate LBFs from SNP level to gene level;
%Aggregate(
DataIn
= myproj.lbf(rename=(polyid=catvar))
,
Map = myProj.map(rename=(polyid=catvar gene=collection))
,
Library = myProj
,
DatasetPreFix = lbf_gene
,
KeepOrphanCatVar = 'NO'
) ;
This main dataset is produced: myProj.LBF_gene_meanlbf

Another dataset is produced myproj. Lbf_gene_mapdensity , and can be used in the Eigen analysis to
correct for the fact that geneA and geneB do not have the same ‘granularity’,
since geneA includes 2 SNPs, and geneB include 1 SNP. See manual for more
details.

§
Step 4: Eigen analysis of gene LBFs – and plot
*number of components to calculate;
%let nCompToStore=2;
*number of components for which mean.var LBFs
tables are generated;;
%let nLBFc=2;
*any missing element in the correlation
matrix is replaced with zero;
%let MissingValues='MISSING_CORR_ZERO';
* Eigen Analysis of correlation matrix of
gene LBFs;
* correlation matrix is corrected with the
MapDensity dataset;
%EigenAnalysis(
lbflong=myProj.lbf_gene_meanlbf
,Library=myProj
,DataSetPostFix=_eigen
,IncludeCaseCont='YES'
,CorCov='COR'
,VarModifDataSet=myProj.lbf_gene_mapdensity(rename=(nCatVar=VarModif))
);
This produces several exxxx_eigen datasets, mainly those ones used by the EigenGraph macro:
- eloadings : loadings of the genes analyzed
- escores: scores of the subjects analyzed
- evalue: Eigen values