Taxonomy3 package v3.0 – Installation and Description

 

 

This package includes all SAS macros required to run ‘taxonomy 3’.

 

It includes also:

-          An ARPACK FORTRAN ms-windows executable for large scale Eigen problems

-          a WinBugs source code for the signal-to-noise decomposition

-          a MS excel add-in for drawing heatmaps (at a very draft stage and not described)

 

 

This document shows how to install the package and how to use its main features.

 

 

Content

 

Software & Hardware requirements

Local installation

Summary of SAS macros in C:\taxonomy\v3.0\SasMacros

SAS macros: main features and short step by step description

EXAMPLES

Appendix 1. Large problems: resources needed for EigenAnalysis and suggested strategy.

 

 


Software & Hardware requirements

 

This package is intended to run on Windows and Unix stations.

 

Software

 

·         SAS : v9.1,
(the SAS macros provided herein will NOT function with SAS 8.2, since
variables and datasets have long names)


SAS/GRAPH, SAS/IML, SAS/ACCESS : optional


·        Optional:

-          Winbugs : for Signal/Noise decomposition

 

 

Hardware [for more details see appendix 1]

 

The main limiting factor is the RAM used in the Eigen analysis, due to the size of the correlation/covariation matrix on which the Eigen analysis is performed.

Two analysis options are provided:

 

·        Exact calculation (Classical PCA option – SAS Eigen function):

RAM required 4. nCollections ^ 2

·        Approximation (Kernel PCA option):

RAM required 4. nSubjects ^ 2

           

            A third option is provided (Classical PCA – Arpack Eigen function): this is an exact calculation method where the RAM needed is fixed by the user (disk swapping is used instead).

 

            For example, on a system with 1Gb RAM available, the following options are recommended (assuming nSubjects << nCollections) :

·        nCollections < 10000: Classical PCA – SAS Eigen function

·        nCollections < 20000: Classical PCA – ARPACK Eigen function

·        nCollections > 15000: Kernel PCA approximation

 

 

            See ‘LargeProblem’ example and detailed manual for more information.

 


Local installation:

 

Simply unzip taxonomy.zip into "C:\taxonomy" – you will obtain this structure:

 

 

 

 

The SAS macros are located in the folder C:\Taxonomy\v3.0\SasMacros

 

 

 

 


Summary of SAS macros in C:\taxonomy\v3.0\SasMacros

Macros are to be used roughly in the other provided in this table:

 

Macro file names and %macros

Description

 

Filter.sas

%Filter

Filters a dataset for poorly characterized subjects or variables. The goal is to prevent potential biases in the Eigen analysis.

 

Display_missing_values.sas

%MakeMissingDataSet  

%DisplayMissingDataSet

Displays missing observations that could later on introduce biases in the Eigen analysis

 

Interaction.sas: 

%Interaction

This macro combines categorical variables and allows for example gene by gene interactions analyses

 

LBF.sas:  Returns LBFs from categorical variables (genetic or non-genetic):

%LBF_Categorical

Categorical variables

%LBF_Autosomal

Autosomal markers

%LBF_x

Chromosome X markers

%LBF_HLA

HLA markers

%LBF_mixed

Mixed of above categories

%LBF_ByChromosome

Gets LBFs chromosome by chromosome [useful if big datasets]

%LBF_moments

Gets and plots moments (mean, var, …) of LBF across subjects

 

Aggregation.sas

%Aggregate

Aggregate LBFs from a variable level to a collection level using a variable2collection map

%AggregateInteraction

Aggregate LBFs from an interaction dataset.

 

Eigen_Analysis.sas

%EigenAnalysis

Performs the Eigen decomposition of LBF’s correlation matrix

%EigenExportAcc

Exports main results to MSaccess database

 

Eigen_graphs.sas

% EigenGraph

Helps visualize eigen analysis and plots: loadings, scores, biplot, scree plots, projected LBF moments,…

 

Projection.sas

% MultiDimProj

Project all variables (e.g. genes) on a direction (e.g. casecont or a sub-phenotype) given by the Eigen decomposition. Multidimensional (n>2) projections are possible.

 


SAS MACROS: main features and short step by step description

 

All macros are annotated and a detailed description is provided in the *.sas macro files. All these descriptions are regrouped into one word document, in the v3.0 folder.
tax3 package v3.0 – 2007-02 - SAS macros - detailed manual.doc

 

The bits of code provided below are located in, and can be run from:

C:\taxonomy\v3.0\Examples\description\CodesInDescriptionFile.sas

 

Please familiarize yourself with this code, log file, datasets and graphs produced by the analysis.

 

§         Rules.

-         Do not use the ‘work’ library, it is used by the macros, and wiped out.

-         Store and check the LOG files: all macros write important information in the LOGs.

 

 

§         Step 1: input datasets

 

Genotypes and subjects case/control status are provided using a ‘tall/skinny’ format.

Please note one missing value for subject 5 and SNP3

 

data myProj.genotypes;

       input subid polyid $       genotype $;

       datalines;

1      SNP1   A_T

2      SNP1   A_T

3      SNP1   T_T

4      SNP1   T_T

5      SNP1   A_A

6      SNP1   A_A

 

1      SNP2   G_G

2      SNP2   G_C

3      SNP2   G_C

4      SNP2   C_C

5      SNP2   G_C

6      SNP2   C_C

 

1      SNP3   C_A

2      SNP3   C_A

3      SNP3   C_A

4      SNP3   C_C

6      SNP3   C_C

;

run;

 

data myProj.casecontset;

       input subid  casecont $ ;

       datalines;

1      CASE

2      CASE

3      CASE

4      CONT

5      CONT

6      CONT

;

run;


§         Step 2: LBF calculations

 

*define how cases and controls are labelled in the datasets;

*to be used in LBF and EigenAnalysis macros;

%let labelCASE='CASE';

%let labelCONTROL='CONT';

 

*calculate LBFs;

%let GenotypeDelimiter='_';

%LBF_Autosomal(

       longdata=myProj.genotypes

       ,longlbf=myProj.lbf

       ,casecontset=myProj.casecontset

       );

 

 

 

This dataset is produced: myProj.LBF

 

 

 

 

 


§         Step 3: Aggregation from SNP LBFs to gene LBFs

 

This step is optional: the Eigen analysis can be performed on the SNP LBF dataset.

 

-         A SNP to gene map has to be provided.

-         You have the option to keep SNPs that are not in the map (Orphan SNPs).

 

 

*SNP to gene map;

data myProj.map;

       length gene $ 15;

       input polyid $  gene $ ;

       datalines;

SNP1   GeneA

SNP2   GeneA

SNP3   GeneB

;

run;

 

*aggregate LBFs from SNP level to gene level;

%Aggregate(

       DataIn = myproj.lbf(rename=(polyid=catvar))

       , Map = myProj.map(rename=(polyid=catvar gene=collection))

       , Library = myProj

       , DatasetPreFix = lbf_gene

       , KeepOrphanCatVar = 'NO'

) ;

 

This main dataset is produced: myProj.LBF_gene_meanlbf

 


 

Another dataset is produced myproj. Lbf_gene_mapdensity , and can be used in the Eigen analysis to correct for the fact that geneA and geneB do not have the same ‘granularity’, since geneA includes 2 SNPs, and geneB include 1 SNP. See manual for more details.

 


§         Step 4: Eigen analysis of gene LBFs – and plot

 

 

*number of components to calculate;

%let nCompToStore=2;

 

*number of components for which mean.var LBFs tables are generated;;

%let nLBFc=2;

 

*any missing element in the correlation matrix is replaced with zero;

%let MissingValues='MISSING_CORR_ZERO';

 

* Eigen Analysis of correlation matrix of gene LBFs;

* correlation matrix is corrected with the MapDensity dataset;

%EigenAnalysis(

       lbflong=myProj.lbf_gene_meanlbf

       ,Library=myProj

       ,DataSetPostFix=_eigen

       ,IncludeCaseCont='YES'

       ,CorCov='COR'

       ,VarModifDataSet=myProj.lbf_gene_mapdensity(rename=(nCatVar=VarModif))

);

 

 

 

This produces several exxxx_eigen datasets, mainly those ones used by the EigenGraph macro:

-         eloadings : loadings of the genes analyzed

-         escores: scores of the subjects analyzed

-         evalue: Eigen values

 

 

 


§         Step 5: Graphs

 

 

%EigenGraph( DatasetPostFix=_eigen

                                                ,Library=myProj

                                                ,OutputFolder=&MyRootFolder.graph\

                                                ,GraphTitle=%str( eigen decomposition )

);

 

This produces, among others, the main ‘bi-plot’ graph, showing:

-         Distinction of cases (red dots) and controls (blue dots)

-         Case/cont distinction represents most of the variability of the dataset
(case/cont aligned with component1)

-         GeneB is in correlation with the case/cont distinction
(case/cont and GeneB have same direction)

-         GeneA is involved with the case/cont distinction (high loading1), but is also involved with some within group variability (high loading2): this is mainly due to the heterogeneity observed in the case group (GeneB pulls away one control from the 2 others).

 

 


EXAMPLES.

 

Three examples are provided to help you dealing with simple or more complex datasets.

 

C:\taxonomy\v3.0\Examples\simple\

 

A simple example using a small dataset (candidate gene study, autosomal SNPs)

 

Please start looking in and run the file: SasCode\__DO__ALL_.sas
(it takes ~5mins to run the analysis)

 

 

C:\taxonomy\v3.0\Examples\complex\

A complex example: a candidate gene study, with:

-          autosomal SNPs

-          X-linked SNPs

-          HLA markers

-          sub-phenotypes

-          gene by gene interactions

-          signal to noise decomposition using WinBUGS.

-          heatmaps.

Please start looking in and run the file: SasCode\__DO__ALL_.sas
(it takes ~15 mins to run the analysis)

 

C:\taxonomy\v3.0\Examples\LargeProblems\

A simple example helping you to test the options provided for the analysis of large or very large datasets.

 

Please start looking in and run the file: SasCode\__DO__ALL_.sas
(it takes ~5mins to run the analysis)

 

This example was used to produce the detailed list of resources needed for large problems, as described in appendix 1.

 

 

 


Appendix 1. Large problems: resources needed for EigenAnalysis and suggested strategy.

 

Machine used: P4 2.8GHz , 4Gb RAM, Windows XP pro 32bits with /3Gb boot switch

(nSubjects = 50) \ Number of collections :

5000

10000

15000

20000

30000

50000

500000

CLASSICAL

METHOD

NEW CORR function (1)

TIME

40 sec

6 mins

30 mins

1 h

4 h 15

-

-

SAS EIGEN FUNCTION
(EXACT)

RAM
 (~ nVar^2)

100 Mb

400 Mb

900 Mb

(2)

-

-

-

TIME (PRINCOMP)
~ nVar^3

12 mins

2 h 40

14 h

-

-

-

-

HD SPACE

<1Gb

3 Gb

6 Gb

-

-

-

-

TOTAL TIME

15 mins

3 h

15 h

-

-

-

-

ARPACK EIGEN FUNCTION

RAM + VM allocated (3) (requested)

-

-

1500 Mb
(431 Mb)

1500 Mb
(770 Mb)

1500 Mb
(1703 Mb)

-

-

TIME (ARPACK)
 ~ nVar^3 + SWAP

-

-

45 mins

1h20

29 h

-

-

HD SPACE

-

-

8 Gb

12 Gb

38 Gb

-

-

TOTAL TIME

-

-

1h30

2h45

34 h (4)

-

-

KERNEL PCA
APPROXIMATION
(SAS EIGEN FUNCTION)

RAM
~ nSub^2

-

-

-

-

<100 Mb

<100 Mb

<100 Mb

HD SPACE

-

-

-

-

<1 Gb

<1 Gb

2 Gb

TOTAL TIME

-

-

-

-

1 min

2 mins

20 mins

 

(1)    RAM usage was fixed to 2000Mb using MaxCorrMemUsage=2000 option.

(2)    2 Gb is the memory allocation limit for SAS PRINCOMP v9 (on both 32 bits and 64 bits systems…). This represents ~23000 variables.

(3)   On Windows XP 32bits, this implementation of ARPACK allocates 1500Mb of RAM + Virtual Memory for the storage of the correlation matrix. If more memory is requested, disk swapping is used: this is a slow option that allows going above 20000 variables on this system.

(4)   This large increase of duration from 20000 to 30000 variables is due to the 1703-1500=203Mb than cannot be allocated in RAM. Disk swap was used, hence the slowness.