Spark User Guide


1. Overview

2. Getting Started

3. Components of the Visualization

4. New Analysis

5. Output Files

6. Command-line Options


1. OVERVIEW


Spark is a clustering and visualization tool that enables the interactive exploration of genome-wide data sets. Popular genome browsers display data along the genomic x-coordinate and organize each data sample as a distinct row or "track". While powerful for integrating diverse data types, this view is inherently limited to individual genomic loci and it can be difficult to obtain a global overview of the predominant data patterns.

Our approach targets genomic regions of interest, such as transcriptional start sites or a user defined set of data enrichment peaks, and extracts a data matrix for each region. The matrix rows correspond to data samples and the columns map to genomic positions. These matrices are clustered and the summarized data for each cluster is visualized in an interactive display.

Selection of any cluster reveals a scrollable panel of the cluster's individual members, enabling the data for each individual region to be easily viewed. This approach utilizes data clusters as a high-level visual guide, while facilitating detailed inspection of individual regions. The detailed view links to existing genome browser displays to take advantage of their wealth of annotation and provide a richer biological context.



2. Getting Starting


To get a feeling for how Spark works, click on the "Example" icon on the opening screen to load an example clustering.



3. Components of the Visualization


3.1 Cluster Sizes

A size bar is draw above the clusters which displays the absolute number of region members in each cluster. The total number of regions is reported on the far left.

3.2 Cluster Display

The upper panel displays the clusters arranged from left to right in descending order by number of member regions. Each cluster displays the average values of its member regions and thus provides a global overview of its pattern. The sample names are available upon mouseover.

3.3 Region Browser

The Region Browser is a scrollable panel in which the individual members of the currently selected cluster can be explored. Only five regions are displayed at one time. If the "Name" attribute was specified in the regions GFF3 file (see section on "New Analysis: Step 2: Input Regions"), then these names will appear above the corresponding regions and are also searchable in the Search box (upper right). If the names are too long to be displayed without overlaps, then they are made be available upon mouseover, as are the sample names and region coordinates.

3.4 Linking to the UCSC Genome Browser

Right clicking on an individual region in the Region Browser will provide a context menu with the option to link to the UCSC Genome Browser. This link will open the UCSC Genome Browser at the corresponding genome coordinate.

3.5 Search

The search box on the upper right allows you to search for a region name and find it in the clustering. Only "Name" attributes in the GFF3 regions file are indexed for searching. Currently, if two regions have the same "Name" attribute, then "-1" and "-2" will be assigned to the root name to distinguish the two instances.

3.6 Interactive Cluster Splitting

To split a cluster within the visualization tool, right click on the target cluster and select the "Split cluster" menu option. This will launch a new k-means clustering using a k of 2 using only the regions within the selected cluster, effectively splitting it into two new clusters. While the clustering is running, the target cluster will appear grey. However, the clustering is performed on a separate thread, so you can still interact with the uneffected clusters.

3.7 Gene Ontology (GO) Analysis

To launch a GO analysis on the regions within a give cluster, right click on the target cluster and select the "Launch GO analysis" menu option. This will upload the region ids from the selected cluster to the DAVID website, enabling you to view GO classifications. This option only functions if the input GFF3 regions file contains the attribute "ID" (see section on "New Analysis: Step 2: Input Regions").

There is a limitation on the number of ids that can be uploaded in this fashion at one time, so you may encounter a dialog with the option to 'Copy and Launch'. This will copy the region IDs to the clipboard and launch the DAVID website. Once the site is loaded, paste your ID list into the 'Upload' tab. Note that you will also have to "Select Identifier" ("REFSEQ_MRNA" for the example cluster files) and a "List Type" (select "Gene List").

3.8 File Menu

Most for these options are straightforward. However, note that Spark writes its output to a directory (not a single file). As such, you need to select an output directory when using the "Open..." and 'Save As..." options.

4. New Analysis


Step 1: Input Data Tracks


You must specify at least one input data track for the clustering analysis. Spark provides access to data tracks generated by the NIH Roadmap Epigenomics Project in addition to your own data files. Currently only wig and bigWig formats are supported: http://genome.ucsc.edu/goldenPath/help/wiggle.html

Epigenome Atlas Tracks

Select one or more tracks and click on 'Add' to add your selections to the Input Data Tracks table.

For more information, visit www.epigenomeatlas.org

ENCODE Atlas Tracks

Select one or more tracks and click on 'Add' to add your selections to the Input Data Tracks table.

For more information, visit http://genome.ucsc.edu/ENCODE/downloads.html

Local Data Tracks

Click on 'Browse' to choose one or more files from your local computer, then click on 'Add' to add these selections to the Input Data Tracks table.

Reference Genome

This optional field allows you to specify the reference genome of your input data tracks (e.g. hg19). Spark only uses this information when creating url links to an external genome browser (see section on the Feature Browser). If you have selected an Epigenome Atlas Track, then this field will default to the correct human reference genome.

Input Data Tracks table

This table displays information about your input data tracks. You must add at least one track to this table to continue. The 'Sample Name' column is authomatically populated with names parsed from the "name" field in each input data file, or with the file name if this field is missing. You can double click on sample names in this column to edit them. You can also change the display color for each sample by using the right-click menu.


Step 2: Input Regions


You must specify a single set of region coordinates. Spark will cluster only those data from your input data tracks that map to these regions.

Spark Regions Sets

Spark provides several region sets for your convenience. The Epigenome Roadmap Atlas Tracks have been preprocessed with these region files, so a clustering analysis will run faster when one of these is used.

Local GFF File

Alternatively, you can select one of your own region sets to use in the analysis by clicking on 'Browse'. This file must be in GFF format with tab delimited chromosome, source, feature, start, end, score, strand, and frame fields, followed by ";" separated attributes. Please see the Sequence Ontology Project site for a full specification of this format: http://www.sequenceontology.org/gff3.shtml
For example:

chr1 ucsc TSS 26370 32369 . - . ID=NR_024540 ; Name=WASH7P

Note that the end coordinate is inclusive: e.g. a feature from 1 to 10 has a length of 10 nt (not 9 nt). The regions can be of variable length. The "Name" field is later used for region searching (see section "3.5: Search"). In addition, the GFF3 attribute with the key "ID" is required if you want to use the interactive GO analysis feature. See the section on "3.7: Gene Ontology (GO) Analysis" for details).

Regions Label

This optional field allows you to label your region set. The label will be used in the window title bar in the Spark display and it can be a helpful reminder when viewing a clustering.


Step 3: Settings

Output Directory

Click on 'Browse' to specify an output directory in which Spark will write its output files. You can later use the "File:Open..." option to select this directory and view your clustering analysis again.

Number of bins

Data from across the input regions are first binned prior to clustering. The "Number of bins" parameter specifies the number of (equally sized) bins to use for each region. The provided default is 20.

For example, if you use the default and your regions have variable lengths, as is the case for gene boundaries, then each gene will be divided into 20 equally sized bins for the purposes of clustering and display, but the absolute bin size in nucleotides will likely differ between genes.

Number of clusters

The user must specify how many clusters Spark should initially generate. The provided default is 3. See the section on Interactive Cluster Spliting to learn more about how clusters can subsequently be subdivided.

Normalization

Data in each input data track are normalized separately to be in the range 0.0 to 1.0. This is important for the clustering step to ensure that equal weighting is given to the different data tracks that may have inherently different dynamic ranges.

The normalization is either done 'globally' (default) in which case the highest values across the whole genome from a given data track are assigned a 1.0, and similarly, the lowest values across the whole genome are assigned a 0.0. In contrast, a 'regional' normalization only considers the input regions, not the whole genome. As a result, 0.0 and 1.0 indicate the mimimum and maximum values, respectively, for a given data track across the input regions.

Spark employs the same normalization scheme as used by ChromaSig
Hon G, Ren B, and Wang W, PLoS Comput Biol. 2008 Oct;4(10):e1000201

x'_h,i = 1/(1+e^-(x_h,i - median(x_h))/std(x_h))

for bin i and sample h. The result is that all binned values are normalized to be between 0.0 and 1.0, with a median value having a normalized value of 0.5.


5. Output Files


Spark outputs several files to the analysis directory that you specify. You do not need to know about their formats or functions in order to use the tool. You can simply use the graphical interface to create and open analyses. However, some of the output files may be useful to you for downstream analyses, such as the clusters.gff file. They are described in detail below:


properties.txt

This file captures all of the parameters of your clustering analysis. If you ever wish to run Spark on the command-line, you will need to generate a properties.txt file (see Section 6: Command-line Options).

clusters

There are two files inside the clusters subdirectory. clusters.gff is identical to your input regions gff file, but has been annotated to include the regions' cluster assignments (attribute "cID"). This can be a useful file if you wish to do your own downstream analyses on the clusters.

The second file, clusters.values, simply specifies the precomputed cluster averages to display in Spark.

stats

Spark computes statistics (such as mean, standard deviation, etc.) for each input data file. These are calculated both globally (considering the whole input data file) and regionally (considering only the specified input regions) and are captured in the stats subdirectory as x__global.stats and x__regional.stats files, respectively.

tables

Spark stores a data table as a binary .dat file in the tables subdirectory for each input data file. Each table contains the data values from only the specified input regions. This enables Spark to avoid reparsing the input data files on subsequent reloads, thus improving performance.


6. Command-line Options


Running Spark on the command-line can be useful if you wish to preprocess many files in batch.


1. Download the jar file from:

http://www.bcgsc.ca/downloads/spark/v1.1.0/lib/Spark-v1.1.0.jar

2. You can see the command-line options by running:

"java -jar Spark-v1.1.0.jar --help"

You should see the following output:

Spark v1.1.0

Usage:
java -jar Spark-v1.1.0.jar -p directoryPath (Proprocessing mode)
java -jar Spark-v1.1.0.jar (GUI mode)

Options:
-h [--help]
-p [--preprocessing] - Must specify an analysis directory to use
If no options are specified, the GUI is launched.

3. You then can create a directory for your analysis (e.g. "mkdir myAnalysis") and open a "properties.txt" file inside this directory. This file specifies all of the parameters for your analysis. Here is an example:


dataFiles=http://www.genboree.org/EdaccData/Release-5/sample-experiment/H1_Cell_Line/Histone_H3K4me3/UCSF-UBC.H1.H3K4me3.H1EScd1-me3K4-A.wig.gz,http://www.genboree.org/EdaccData/Release-5/sample-experiment/H1_Cell_Line/MRE-Seq/UCSF-UBC.H1.MRE-Seq.HS1052.wig.gz
sampleNames=UCSF-UBC.H1.H3K4me3.H1EScd1-me3K4-A,UCSF-UBC.H1.MRE-Seq.HS1052
regionsFile=http://www.bcgsc.ca/downloads/cydney-test/EdaccRegions/hg19/TSS/tss_hg19_+-3000_noNeighbors.gff
org=hg19
regionsLabel=TSS
numBins=20
k=3
normType=exp
statsType=global
colLabels=blue,blue

dataFiles
This is a comma separated list of files, which can either be local paths or URLs (or both). Currently only wig or bigWig formats are supported. These are equivalent to the "Input Data Tracks" described in Section 4: New Analysis.

sampleNames
A comma separated list of sample names (in the same order as the data files above).

regionsFile
A local path or URL to a single GFF file. This is equivalent to the "Input Regions" described in Section 4: New Analysis.

org
Reference genome (e.g. "hg19"). See a full description under "Reference Genome" in Section 4: New Analysis.

regionsLabel
An optional label for your regions set. See a full description under "Regions Label" in Section 4: New Analysis.

numBins
The number of bins to use per region. See a full description under "Number of bins" in Section 4: New Analysis.

k
The number of clusters to generate.

normType
Specifies what type of normalization to use. Currently, Spark only provides the normalization scheme provided by ChromaSig specified with "exp" (see "Normalization" under Section 4: New Analysis).

statsType
Specifies how to compute the statistics, either "global" or "regional". See "Normalization" under Section 4: New Analysis.

colLabels
A comma separated list of colors to use in the visualization (one per input data file). Supported options are "blue", "black", "green", "orange", "pink", and "purple".

4. Run your analysis:

"java -Xms256m -Xmx1024m -jar Spark-v1.1.0.jar -p myAnalysis"

The "-Xms" and "-Xmx" specify the min and max heap size for Java. You may need to increase "-Xmx" for memory intensive jobs.

5. Open your new analysis in the graphical interface:

"java -Xms256m -Xmx1024m -jar Spark-v1.1.0.jar"

Select your analysis directory using the "File:Open..." option.