Supplementary Materials

Large-Scale Comparison of Publicly Available SAGE, cDNA Microarray, and Oligonucleotide Microarray Expression Data for Global Co-Expression Analyses

Authors
Obi L. Griffith(1), Erin D. Pleasance(1), Debra L. Fulton(2), Mehrdad Oveisi(1), Martin Ester(3), Asim Siddiqui(1) and Steven J.M. Jones(1)

Affiliations
1. Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, Canada, V5Z 4E6
2. Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, British Columbia, Canada, V5A 1S6
3. School of Computing Science, Simon Fraser University, Burnaby, British Columbia, Canada, V5A 1S6

I) Supplementary Materials

Supplementary methods, results, and figures referred to in the manuscript:
Suppl. materials (WORD doc)

Supplementary methods, results, and figure titles without actual figures inserted (see below for separate figure files):
Suppl. materials WORD doc (text only)

High resolution figures (zip archive of tiff files):
Figures 1-4
Suppl. Figures 1-9

II) Data

A) Affymetrix HG-U133A oligonucleotide array

Details

Gene Expression Omnibus (GEO)
GEO Platform: GPL96
Full sample set as of April 07, 2004.
889 samples (667 with p-value or detection calls)
22215 probes

Downloads

Affymetrix annotation file used in analysis
Complete data (log transformed)
Processed data
Gene Pair Pearson Correlations

Data processing

Affymetrix probe intensities were converted to natural log values. Only Affymetrix probe intensities with a 'P' call were considered (p-value < 0.04). Intensities with 'A' or 'M' calls were set to null. Finally, all ln(intensity) values were normalized by subtracting the median and dividing by the inter-quartile range for the experiment (Davidson et al. 2001). Affymetrix probe ids were mapped to LocusLink Ids using the most current Affymetrix annotation file for the HG-U133A chip (www.affymetrix.com). Probes with ambiguous mapping to LocusLink (see SAGE processing below) were discarded resulting in a final set of 8106 genes from the Affymetrix dataset. Genes not common to all three datasets (Affymetrix, cDNA and SAGE) were removed resulting in a final gene set of 5881.

B) cDNA microarray

Details

Stuart et al. 2003.
Kim Lab Website
1202 samples
13595 genes

Downloads

Data processing

cDNA microarray data were used as provided by Stuart et al. (2003) except for minor formatting changes. Genes not common to all three datasets (Affymetrix, cDNA and SAGE) were removed resulting in a final gene set of 5881.

C) SAGE

Details

Gene Expression Omnibus (GEO)
GEO Platform: GPL4
Full sample set as of March 31, 2004.
242 samples
609,224 tags

Downloads

Data processing

SAGE data was first filtered to remove tags present in less than 10 libraries reducing the unique tags from 609,224 to 87,521. Next, SAGE tags were mapped to genes by the lowest sense tag predicted from Refseq or MGC sequences and then mapped to LocusLink ids using DISCOVERYspace reducing the unique tag set further to 47,263. In the event of discrepancy between Refseq and MGC, the former was taken as correct. If a tag mapped to more than one LocusLink or more than one tag mapped to the same LocusLink it was discarded resulting in a final set of 15,426 unique tags confidently mapped to LocusLink ids. SAGE tag counts of zero were converted to nulls. Non-zero SAGE tag counts were converted to log frequency as follows:

Tag frequency = ln((tag count x 10000)/total tags in library)

III) Distance Calculations and Modifications to C clustering library

A Pearson correlation coefficient was calculated for all possible gene pairs for each platform as a measure of expression similarity. These calculations were performed by a modified version of the C clustering library (De Hoon et al. 2004) on 64-bit opteron linux machines with 8-32GB memory (code available upon request). In all platforms, genes are represented by a vector of expression values for all the experiments in the data set. In each case, genes have null values if not represented on that array (cDNA), no tags observed (SAGE), or intensity not significantly detected (Affymetrix). Thus, when calculating Pearson distances between gene pairs, the number of shared data points varied from zero to the total number of experiments. A minimum number of common experiments (MCE) was required for each gene pair to provide some confidence in the value calculated (e.g. a Pearson distance based on observations from only two experiments is meaningless). This MCE was 95 for Affy, 28 for cDNA and 23 for SAGE.

Open Source Clustering Software Webpage
M.J.L. de Hoon, S. Imoto, J. Nolan, and S. Miyano: Open Source Clustering Software. Bioinformatics, 20 (9): 1453--1454 (2004).

Document explaining modifications

IV) GO Analysis

Methodology
Perl Source Code

V) Other recently published coexpression datasets

A) Lee HK, Hsu AK, Sajdak J, Qin J, Pavlidis P. 2004. Coexpression analysis of human genes across many microarray data sets. Genome Res. 14(6):1085-94.
TMM website
TMM abstract

B) Jensen LJ, Lagarde J, von Mering C, Bork P. 2004. ArrayProspector: a web resource of functional associations inferred from microarray expression data. Nucleic Acids Res. 32(Web Server issue):W445-8.
ArrayProspector website
ArrayProspector abstract

VI) High-confidence coexpression set

A set of high-confidence coexpressed gene pairs were chosen from the three datasets using the following criteria. These co-expression links are being used by the Genome Sciences Centre Gene Regulation (Informatics) Team to predict human regulatory elements as part of the cisRED project (www.cisred.org). Note: A gene pair may be present in the list multiple times if it passes more than one of the following criteria.

High-confidence Coexpression Criteria (version 1):

Two-platform combined (2PC) method:
1. Minimum Common Experiments (MCE): Affy;cDNA:100, SAGE:25
2. 2PC average pearson: r_avg >= 0.65
TMM method: TMM >= 7
ArrayProspector method: AP>= 0.7

Downloads

13145 co-expressed gene pairs for 2979 genes

High-confidence Coexpression Set (Uniprot)

VI) Updates to high-confidence coexpression set

As part of our studies of gene regulation the analysis presented above will be updated and expanded to include new expression data, new species, and new analysis methods. Updated high-confidence coexpression sets will be provided on our coexpression resources page.

References

Davidson GS, Wylie BN, Boyack KW. 2001. Cluster stability and the use of noise in interpretation of clustering. Proc. IEEE Information Visualization 2001, 23-30.

de Hoon MJL, Imoto S, Nolan J, Miyano S. 2004. Open Source Clustering Software. Bioinformatics. 20(9):1453-1454.

Jensen LJ, Lagarde J, von Mering C, Bork P. 2004. ArrayProspector: a web resource of functional associations inferred from microarray expression data. Nucleic Acids Res. 32(Web Server issue):W445-8.

Lee HK, Hsu AK, Sajdak J, Qin J, Pavlidis P. 2004. Coexpression analysis of human genes across many microarray data sets. Genome Res. 14(6):1085-94.

Stuart JM, Segal E, Koller D, Kim SK. 2003. A gene-coexpression network for global discovery of conserved genetic modules. Science. 302(5643):249-55

Page last modified Nov 26, 2007

Personal tools

Supplementary Materials

I) Supplementary Materials

II) Data

A) Affymetrix HG-U133A oligonucleotide array

Details

Downloads

Data processing

B) cDNA microarray

Details

Downloads

Data processing

C) SAGE

Details

Downloads

Data processing

III) Distance Calculations and Modifications to C clustering library

IV) GO Analysis

V) Other recently published coexpression datasets

VI) High-confidence coexpression set

Downloads

VI) Updates to high-confidence coexpression set

References