Slider

Maximum use of probability information for alignment of short sequence reads and SNP detection.

Project Description

Slider is replaced by SliderII (March, 2009)

http://www.bcgsc.ca/platform/bioinfo/software/SliderII

Slider paper

Malhis et. al. 2008

Abstract

Slider is an application for the Illumina Sequence Analyzer output that uses the probability files instead of the sequence files as an input for alignment to a reference sequence or a set of reference sequences.

Motivation: A plethora of alignment tools have been created that are designed to best fit different types of alignment conditions. While some of these are made for aligning Illumina Sequence Analyzer reads, none of these are fully utilizing its probability (prb) output. In this paper, we will introduce a new alignment approach (Slider) that reduces the alignment problem space by utilizing each read base’s probabilities given in the prb files.

Results: Compared with other aligners, Slider has higher alignment accuracy and efficiency. In addition, given that Slider matches bases with probabilities other than the most probable, it significantly reduces the percentage of base mismatches. The result is that its SNPs predictions are more accurate than other SNPs prediction approaches used today that start from the most probable sequence, including those using base quality.

Using Slider

Slider consists of five main commands and some utilities which are all java classes. Each of these commands/utilities requires some parameters that are listed in a configuration file. This is explained in further detail under "Running Slider".

The five main commands are:

CreateDB.java for generating a reference database table: This table is generated using a sliding window of size SZ_d, where every subsequence of size SZ_d from the reference and its reverse complement are included. Up to two undefined bases, noted as Ns in each subsequence, are allowed by substituting all four bases for every N and generating all possible sub-sequences. Finally, this database table is lexicographically sorted. Generating a reference database table needs to be done once for each reference sequence. For example, the human genome reference database table has approximately 6 billion records (counting both forward and reverse complement) that are sorted in lexicographical order. For the human genome, this table was generated in less than 24 hours on a single CPU and used about 160 GB of disk space. Once a reference database table is generated, it can be used for every alignment to that reference, and there is no need for it to be regenerated with each alignment.
CreateSequences.java for generating the P0_Reads Table: Using prb values, we first generate read sequences R_ps for each prb line as explained earlier. One unique id is given for all reads that are generated from the same prb line source. When a prb line has a number of non-crisp bases larger than some threshold value UD_max (a value of 11 is used), only the MPS (most probable sequence) is generated and is marked as a low quality, LQ. LQ sequences will not be used for SNP prediction. The table of all reads generated from all prb lines is then lexicographically sorted to form the P0_Reads table.
Alignment.Java: Find read locations on the reference sequence with an exact match and one-off match (one base mismatch) to prb derived sequences.
Separation.java: for each reference sequence, aligned reads are then separated into two file: a unique match file and a multiple match file.
SNPsPrediction.java: For each reference sequence, a SNP prediction table, a coverage information file, and a set of coverage histogram files are created from the unique matches file of a reference sequence.

Compiling Slider

Slider must first be compiled for your operating environment.

javac JavaCode/*.java

Running Slider

The package includes a set of sample data and a configuration file. Slider requires a configuration file to tell it where to look for the prb file(s) and the reference file and the location for storing results. In this example, the configuration file is found in:

input/T02

It's contents include a parameter, a space, and the value. The location of this file is arbitrary and is specified on the command line as shown later when running Slider commands. In this example set, this file looks like:

oMxSize  27
skip     0
ETag     SET
DATASET  T02/
DATABASE T02/
genome_dir data/seqDB/T02/
pathDB     data/Database/
pathPRB    data/PRBs/
pathData   data/OUT/
ch_name1
SEQ_VAR  T02_ref
ch_name2 .fa
pathMulti
pathParallel

This can be used as a template for future assemblies. A suggested format is included here. These parameters describe the read length (oMxSize), dataset and database names, the folders containing the reference sequence (genome_dir), database, input prb files, and output folders. We also provide information on the reference sequence files. For this sample dataset:

Each read is 27 bases, if the prb files have lines for more than 27 bases, only the first 27 bases will be used. No starting bases will be skipped ('skip' is set to 0).
Its output will be placed in: /data/OUT/T02
Its prb files are in: /data/PRBs/T02/*/*prb*. Note that Slider requires a subdirectory (of any name) inside the PRBs/dataset directory (T02 in this case) that will hold the prb files. This allows flexibility of using prb files with the same name.
Its reference genome is in: /data/seqDB/T02/
Its reference genome file is: T02_ref.fa. This is a concatenation of SEQ_VAR and ch_name2. The parameter, ch_name1 would be place before SEQ_VAR if this is required. This allows for flexibility in defining a list of reference sequences to align to. For example, to align to two reference sequences these lines in the configuration file could look like:
```
ch_name1 chr
SEQ_VAR  1 2
ch_name2 .fa
```
In this case, the alignment would occur on the reference files chr1.fa and chr2.fa which are located in the the folder that 'genome_dir' is set to.
The reference genome database will be placed in: /data/Database/T02/
pathMulti and pathParallel are used in properties that are under development.

Once the parameters file for a dataset is created, each command/utility can be used as:

java –cp . command parametersFile

For example, using the dataset provided, if we are using the reference sequence for the first time, we need first to generate its database table:

java -cp . -Xmx1G JavaCode/CreateDB  input/T02

Then, to generate the P0_Reads table using prb values for a dataset T02, then we use the command CreateSequences as:

java -cp . -Xmx1G JavaCode/CreateSequences  input/T02

To align these sequences to the reference sequence, we use Alignment as:

java -cp . -Xmx1G JavaCode/Alignment  input/T02

And then we need to separate the alignment result into different files associated with each reference sequence using the Separate command:

java -cp . -Xmx1G JavaCode/Separate  input/T02

And finally SNPs prediction can be done using:

java -cp . -Xmx1G JavaCode/SNPsPrediction  input/T02

This will show homozygous SNPs identified with a minimum coverage of 2 (nmCount).

SNP calling parameters can be adjusted as follows:

java –cp . -Xmx1G JavaCode/SNPsPrediction input/T02 [[c] t]

where c is the minimum nmCount coverage for SNPs <default is 2>

and t can be either 1 or 2 <default is 1>:

If t=1: Slider will search for homozygous SNPs only.

If t=2: Slider will search for homozygous and heterozygous SNPs.

Results

For viewing alignment information, this command will send the alignment results to output.txt:

java -cp . -Xmx1G JavaCode/display  input/T02 data/OUT/T02/Results/T02_ref.fa.um.s.id/ n

This will display the first n reads of the alignment. These include unique matches or U0 or U1. The default is the first 10 lines and to display all, set n to -1.

0:      7       3013629 F       1.0     36      986311278       Crisp   MPS     S0      UM      U28     u0      SET     CAAACTAACC AGCTTCACTA GGCTCCTGAC TACCTG         ZZ[[[\\]]] ^^^^^^^^^^ \\\\\ZZZZZ XTTTXW 36
1:      7       3013629 F       1.0     36      978646192       Crisp   MPS     S0      UM      U28     u0      SET     CAAACTAACC AGCTTCACTA GGCTCCTGAC TACCTG         ZZ[[[\\]]] ^^^^^^^^^^ \\\\\ZZZZZ XTTTXW 36
2:      7       3014948 R       1.0     36      989971955       Crisp   MPS     S0      UM      U25     u0      SET     GCCAAGGGAA ATTTTGAAGA TGTGTTGAAA AATAAT         WXTTTXZZZZ Z\\\\\^^^^ ^^^^^^]]]\ \[[[ZZ 36
3:      7       3014948 R       1.0     36      982306869       Crisp   MPS     S0      UM      U25     u0      SET     GCCAAGGGAA ATTTTGAAGA TGTGTTGAAA AATAAT         WXTTTXZZZZ Z\\\\\^^^^ ^^^^^^]]]\ \[[[ZZ 36
4:      7       3017750 R       1.0     36      997039988       GQ      PS      S0      UM      U21     u0      SET     AGAAAGAAaa ATATAAAAAT GCTAATATTG TGTATC         DXTTTEZZAD ZJ\\\\^^F] ^^B^^M]]]\ \[[[ZZ 36      G9 T10
5:      7       3017856 F       1.0     36      993872057       Crisp   MPS     S0      UM      U14     u0      SET     AACGGCCCAT TTCTATCATC CAACCAAATG ATCTTT         ZZ[[[\\]]] ^^^^^^^^^^ \\\\\ZZZZZ XTTTXL 36
6:      7       3020029 R       1.0     36      988863573       Crisp   MPS     S0      UM      U16     u1      SET     ATTTTAGGAg TGTGTTTATT TCTGAGACGT CTCCTT         WXDTTXZZZZ Z\\\\\^^^^ ^^^^^^]]]\ \[[[ZZ 36      <A10 (Q-26)>
7:      7       3020029 R       1.0     36      981198487       Crisp   MPS     S0      UM      U16     u1      SET     ATTTTAGGAg TGTGTTTATT TCTGAGACGT CTCCTT         WXDTTXZZZZ Z\\\\\^^^^ ^^^^^^]]]\ \[[[ZZ 36      <A10 (Q-26)>
8:      7       3021071 R       1.0     36      992052084       GQ      MPS     S0      UM      U17     u0      SET     TAGCCTCACA AGAGACAGTC ATATCTGGGT CCCTTC         WXTTTXZTZZ Z\VJF\^^^^ ^^J^^O]]]\ \[[[ZZ 36
9:      7       3021085 R       1.0     36      984649984       Crisp   MPS     S0      UM      U31     u0      SET     cCAGcCATAT CTGGGTCCCT TCAGCAAAAT CTTGCT         IXTTTXZZZZ Z\\\\\^^^^ ^^^^^^]]]\ \[[[ZZ 36      A1 <T5 (Q-24)>

Above is the information for the first 6 matches in this alignment. The columns are as follows: read id, reference aligned to, reference coordinate of alignment start, forward or reverse strand, weight of read (see paper for further details), read length, alignment id, quality of read ("Crisp", "LQ" for low quality, or "GQ" - see paper for further details), MPS (most probable sequence) or RS, read end (if a PET run, can be either S0 or S1), SET (single end tag) or PET (paired end tag), UM = unique match or MM=multiple match, U0 (no mismatch bases) or U1 (1 base mismatch; a number follows with the location of the mismatch), sequence of aligned read.

To display the first n reads from the oligo input table (before alignment) s.ol0:

java -cp . -Xmx1G JavaCode/display  input/T02 data/OUT/T02/s.ol0/ n

If n is larger than the number of reads in the oligo table, the full oligo table is displayed. This can be used to convert an oligo table to a text format. The contents of this file is sorted lexographically.

994:    0.23    27      999998827       GQ      RS      S0       AATCGCTTGA ACCTGAGAGG TGGAAGA
995:    0.74    27      999998827       GQ      MPS     S0       AATCGCTTGA ACCTGAGAGG TGGAAGT
996:    0.97    27      999999582       Crisp   MPS     S0       AATCTCCCAG CGTGGTTTAA TCAGACG
997:    0.0     27      999996886       GQ      RS      S0       AATCTCCCCA GTGACACTTC TCCAAGG
998:    0.05    27      999996886       GQ      RS      S0       AATCTCCCCA GTGACACTTC TCCCAGG
999:    0.04    27      999996886       GQ      RS      S0       AATCTCCCCA GTGTCACTTC TCCAAGG

Above is extracted from the sample oligo table.

Final list of SNPs produced by SNPsPrediction is located at: data/OUT/T02/Results/ScoreResults/T02_ref.fa.snp. For this small test dataset, eight SNPs were found.

84      15698   197088613       G       A       8       8       4       44      30      A       100     0       0       0       Novel   H1
3       15699   197091993       G       T       0       0       2       8       0       -       0       0       0       100     Novel   H1
50      15700   197104764       A       C       2       2       2       0       26      C       0       100     0       0       Novel   H1
13      15701   197104822       C       T       1       1       2       92      0       -       0       0       0       100     Novel   H1
77      15702   197112635       C       A       2       2       2       81      26      A       100     0       0       0       Novel   H1
77      15703   197140911       T       G       2       2       2       64      28      G       0       0       100     0       Novel   H1
29      15704   197141740       G       C       1       1       2       89      0       -       0       100     0       0       Novel   H1
13      15705   197170288       A       G       1       1       2       92      0       -       0       0       100     0       Novel   H1
13      15706   197190959       C       T       1       1       2       92      0       -       0       0       0       100     Novel   H1

Please see SNPs Table for a description of this file.

Expanded Table 3 (from the Slider paper) is available which includes extended information on the alignment accuracy of Slider compared to Eland and RMAP.

Full datasets

Two larger datasets are included here for Slider testing:
http://www.bcgsc.ca/downloads/slider/T02_full_data.tar.gz
http://www.bcgsc.ca/downloads/slider/CT302_full_data.tar.gz

Current Release
Slider 0.6

Released Nov 04, 2008

This is the first release of Slider. A seed-based release of Slider will be available at the end of November.
More about this release…

Get Slider for all platforms: Slider Package

All Releases

Version	Released	Description	Compatibility	Licenses	Status
1.0		Slider II More about this release…		AFL	pre-release
0.6	Nov 04, 2008	This is the first release of Slider. A seed-based release of Slider will be available at the end of November. More about this release…		AFL	final

Personal tools