High quality SNP calling using Illumina data at minimal coverage

Project Description

SliderII paper

Under review.

Original Slider paper

Malhis et. al. 2008 - High Quality Alignment and SNP Calling for Illumina High Throughput Sequence Data

SliderII Results of SNPs concordance comparison to Maq

We used 68 lanes human whole genome shotgun sequencing, WGSS, PET data (real data), a total of 906 million reads sizes from 36 to 42 bases, using each of SliderII and MAQ, we aligned these reads to the human genome hg18 resulting a coverage of about 7.5 X. For each aligner, SNPs are sorted in descending order, using SNPs score for SliderII, and the Phred-like consensus quality score for MAQ.  Concordance with Ensembl Variation 50 SNPs is used to compare SNPs calling accuracy.

Figure 1 show this concordance of top scored SNPs for SliderII without using known SNPs as priors (SliderII A line), and with known SNPs priors to be used in the alignment and SNP calling processes (SliderII B) and for MAQ at minimum coverage depths of 1, 2, and 3.  In figure 1, while SliderII SNPs concordance is correlated with their scores, MAQ concordance dropped significantly for those SNPs with high quality score.  We contributed that to SNPs generated as a result of paralogous alignments since MAQ dose not have the capability of filtering out such SNPs, so, in order to reduce those inaccurately called SNPs in MAQ output, we filtered out SNPs with coverage higher than 15, figure 2.

when we have a concordance of x%, this means that:

1-     Our SNPs calling accuracy is at least x%. It is exactly x% when the sample data contain only known SNPs.

2-     The SNPs in our sample data has at most (1-x)% novel SNPs. It is (1-x)% when we have no calling errors.

So, for example, if the sample data contain y% novel SNPs, (and y <= 1-x), novel SNPs accuracy is z%, and z = y/(1-x).

Figure 1

Figure 1

Figure 2          

Figure 2


Slider is an application for the Illumina Sequence Analyzer output that uses the probability files instead of the sequence files as an input for alignment to a reference sequence or a set of reference sequences. 

Motivation: A plethora of alignment tools have been created that are designed to best fit different types of alignment conditions. While some of these are made for aligning Illumina Sequence Analyzer reads, none of these are fully utilizing its probability (prb) output.  In this paper, we will introduce a new alignment approach (Slider) that reduces the alignment problem space by utilizing each read base’s probabilities given in the prb files.

Results: Compared with other aligners, Slider has higher alignment accuracy and efficiency. In addition, given that Slider matches bases with probabilities other than the most probable, it significantly reduces the percentage of base mismatches.  The result is that its SNPs predictions are more accurate than other SNPs prediction approaches used today that start from the most probable sequence, including those using base quality.

Using Slider

Slider consists of seven main commands and some utilities which are all java classes.  Each of these commands/utilities requires some parameters that are listed in a configuration file.  This is explained in further detail under "Running Slider". 

These commands are described below:

  • for generating a reference database table: This table is generated using a sliding window of size SZd, where every subsequence of size SZd from the reference and its reverse complement are included. Up to two undefined bases, noted as Ns in each subsequence, are allowed by substituting all four bases for every N and generating all possible sub-sequences. Finally, this database table is lexicographically sorted. Generating a reference database table needs to be done once for each reference sequence. For example, the human genome reference database table has approximately 6 billion records (counting both forward and reverse complement) that are sorted in lexicographical order. For the human genome, this table was generated in less than 24 hours on a single CPU and used about 300 GB of disk space. Once a reference database table is generated, it can be used for every alignment to that reference, and there is no need for it to be regenerated with each alignment.
  • for generating the P0_Reads Table: Using prb values, we first generate read sequences Rps for each prb line as explained earlier. One unique id is given for all reads that are generated from the same prb line source. When a prb line has a number of non-crisp bases larger than some threshold value UDmax (a value of 11 is used), only the MPS (most probable sequence) is generated and is marked as a low quality, LQ.  LQ sequences will not be used for SNP prediction. The table of all reads generated from all prb lines is then lexicographically sorted to form the P0_Reads table.
  • Alignment.Java: Find read locations on the reference sequence with an exact match and one-off match (one base mismatch) to prb derived sequences.
  • Expand reads to include up to 3 mismatches
  • for each reference sequence, aligned reads are then separated into two file: a unique match file and a multiple match file.
  • Generates SNP and alignment information files.
  • ScoreSNPs: Adds a single score in the range 0-100 (where 100 is highest score) that coorelates with SNPs accuracy by combining other scores .

Compiling Slider

Slider must first be compiled for your operating environment.
javac JavaCode/*.java

Running Slider

The package includes a set of sample data and a configuration file.  Slider requires a configuration file to tell it where to look for the prb file(s) and the reference file and the location for storing results.  In this example, there are two configuration files one for a single end tag alignment, T02, and one for a paired end tag alignment which will be used in this example:

It's contents include a parameter, a space, and the value.  The location of this file is arbitrary and is specified on the command line as shown later when running Slider commands.  In this example set, this file looks like:

seedSize 31
skip     -1
ETag     PET
SNP_in   100000
genome_dir data/seqDB/BACPET/
KnownSNPs  data/seqDB/BACPET/
pathDB     data/Database/
pathPRB    data/PRBs/BACPET/
pathData   data/OUT/
SEQ_VAR    h03
ch_name2  .ref
This can be used as a template for future assemblies.  A suggested format is included here.  These parameters describe the SeedSize, dataset and database names, the folders containing the reference sequence in fasta format (genome_dir), database, input prb files, and output folders.  We also provide information on the reference sequence files.  For this sample dataset:
  • The seed size is 31.  The smaller the seed size  is, the faster the alignment will by.
  • skip sets the number of bases will be skipped from the start of each sequence.  In this case, it is set to -1; if Slider believes that the first base is adapter sequence (based on it repeating), this base will be eliminated.  Set to 0 tells Slider not to skip the first base and 1 or more tells it to skip the first number of bases specified.
  • Its output will be placed in: data/OUT/   (this path needs to be created before running Slider)
  • Its prb files are in: data/PRBs/BACPET/*/*prb*.  Note that Slider requires a subdirectory (of any name) inside the PRBs/DATASET directory that will hold the prb files.  This allows flexibility of using prb files with the same name by placing them in different subdirectories.  Note that Slider can read compressed .gz format files.
  • Its reference genome is in: data/seqDB/BACPET
  • Its reference genome file is: h03.ref.  This is a  concatenation of SEQ_VAR and ch_name2.  The parameter, ch_name1 would be place before SEQ_VAR  if this is required.  This allows for flexibility in defining a list of reference sequences to align to.  For example, to align to two reference sequences these lines in the configuration file could look like:
    ch_name1 chr
    SEQ_VAR  1 2
    ch_name2 .fa

    In this case, the alignment would occur on the reference files chr1.fa and chr2.fa which are located in the the folder that 'genome_dir' is set to.

  • The reference genome database will be placed in: data/Database/  (this path needs to be created before running Slider)
  • pathMulti and pathParallel are used in properties that are under development.

Once the parameters file for a dataset is created, each command/utility can be used as:

java –cp . command parametersFile

For example, using the dataset provided, if we are using the reference sequence for the first time, we need first to generate its database table.  The full path name to the configuration file is required:

java -cp . -Xmx1G JavaCode/CreateDB2  ./input/BACPET

If we want to incorporate known SNPs in our database, we place a file of known SNPs called SNPs in the location pointed to by KnownSNPs in the configuration file as the above example shows. 

The format of this file is:

"Reference ID"    "Chromosome name" "Position on Chromosome (bp)"     "Strand"  "Allele"  "Mapweight"       "Ancestral allele"
rs3     13      31344842        1       C/T     1       C
rs4     13      31345222        1       A/G     1       A
rs5     7       91677046        1       C/T     1
rs6     7       91585067        1       A/G     1       G

Then, to generate the P0_Reads table using prb values for the dataset, we use the command CreateSequences as:

java -cp . -Xmx1G JavaCode/CreateSequences  ./input/BACPET

To align these sequences to the reference sequence, we use Alignment and Extend:

java -cp . -Xmx1G JavaCode/Alignment  ./input/BACPET
java -cp . -Xmx1G JavaCode/Extend  ./input/BACPET

And then we need to separate the alignment result into different files associated with each reference sequence using the Separate command followed by Extend:

java -cp . -Xmx1G JavaCode/Separate  ./input/BACPET

The last step provides alignment information and SNP calls:

java -cp . -Xmx1G JavaCode/CallSNPs  ./input/BACPET
To score the SNP calls, use ScoreSNPs.  By combining other scores, it reflects the prediction accuracy (in the range 0-100 where 100 is highest score):
java -cp . -Xmx1G JavaCode/ScoreSNPs ./input/BACPET


For viewing alignment information, this command will send the alignment results to output.txt:

java -cp . -Xmx1G JavaCode/display ./input/BACPET data/OUT/BACPET/Results/ n

This will display the first n reads of the alignment.  These include unique matches or U0 or U1.  The default is the first 10 lines and to display all, set n to -1.

0:      h03     73027   F       1.0     37      1000000000      GQ      MPS     S0      UM      U12     u0      217     GGCTGGGGCC ACGGGGAGCC GGAGGGAGGC GGAGGGG        ZZ[[[\\]]] ^^^^^^^^^^ \\Q\\ZJZZU XXJXWXW        37
1:      h03     73207   R       1.0     37      1000000000      GQ      MPS     S1      UM      U11     u0      217     CCCGGTGCCC GCCGCGGAGC CCGCGCCCCC GGCCCCC        WNWSIAUZZZ ZZ\V\T\^^^ ^^^^^^^]]] \\[[[ZZ        37
2:      h03     66801   R       1.0     37      999999999       GQ      MPS     S0      UM      U9      u1      226     GGGCTGCGAG CCCCTCcTCT CTGAGCACGG ACTCCTG        NXIXOSXZZO ZZ\\\\\^^^ ^^^^^^^]]] \\[[[ZZ        37      <T17 (Q-28)>
3:      h03     66612   F       1.0     37      999999999       Crisp   MPS     S1      UM      U10     u0      226     CAGGCCTCTC GCCTTCTGTG GGGCAGGGCA GCCCCCA        ZZ[[[\\]]] ^^^^^^^^^^ \\\\VZZZZT XPRXSTO        37
4:      h03     66918   R       1.0     37      999999998       GQ      MPS     S0      UM      U9      u0      SET     CTCTGGCCAG CTGGACAGCT CTGCCCAAAT CTCAAAT        WLWQNSXZZP ZZ\\\\\^^^ ^^^^^^^]]] \\[[[ZZ        37
5:      h03     119779  F       1.0     37      999999998       LQ      MPS     S1      UM      U9      u0      SET     GCACAGGATT TCCCAGCTTC CCGCCCTTTT CCTCACa        ZZ[G[\\C]] ^?^]WG^]OQ OVABGRTGTD RBQL;K<        37      C37
6:      h03     61315   R       1.0     37      999999997       Crisp   MPS     S0      UM      U15     u0      217     CCCTTGTTAC TGTCCCCGCC CCATGCCTGG GTGGTGG        WXWXXXXZZZ ZZ\\\\\^^^ ^^^^^^^]]] \\[[[ZZ        37
7:      h03     61135   F       1.0     37      999999997       Crisp   MPS     S1      UM      U11     u0      217     CCAGAGTCCT GGAGAGCTTC CAACTCTGGC AGGGCCT        ZZ[[[\\]]] ^^^^^^^^^^ \\\\\ZZZZZ XXXXWXW        37
8:      h03     53404   F       1.0     37      999999995       Crisp   MPS     S0      UM      U10     u0      SET     TGAGGAGAGG TCAGAGAGCC AGCCCACCTC GTCGTGA        ZZ[[[\\]]] ^^^^^^^^^^ \\\\\ZZZZZ XPOXRXN        37
9:      h03     101674  F       1.0     37      999999994       Crisp   MPS     S0      UM      U12     u0      217     CTGTGGGGAG GGCTCGAGGC ACCGACTCAC ACTCCTG        ZZ[[[\\]]] ^^^^^^^^^^ \\\\\ZZZZZ XXXXWXW        37

Above is the information for the first 10 matches in this alignment.  The columns are as follows:
1.   Read id
2.   Reference name that read aligned to
3.   Reference coordinate/position of alignment start
4.   Forward or reverse strand
5.   Weight of read (see paper for further details)
6.   Read length
7.   Alignment id
8.   Quality of read ("Crisp", "LQ" for low quality, or "GQ" - see paper for further details)
9.   MPS (most probable sequence) or RS
10. Read end (if a PET run, can be either S0 or S1)
11. UM = unique match or MM=multiple match
12. How long read must be to be unique (short will be more accurate)
13. Base mismatch - u0 (no mismatch bases) or u1 (1 base mismatch)
14. SET (single end tag) or a number if PET (fragment size = gap + length of both reads)
15. Sequence of aligned read
16. Quality string - quality of read (qcal)
16. Length of read
17. Information on mismatch bases, blank if there are none

To display the first n reads from the oligo input table (before alignment) s.ol0: 

java -cp . -Xmx1G JavaCode/display  ./input/BACPET data/OUT/BACPET/s.ol0/ n

If n is larger than the number of reads in the oligo table, the full oligo table is displayed.  This can be used to convert an oligo table to a text format.  The contents of this file is sorted lexographically.

988:    1.0     31      999985733       Crisp   MPS     S1      u1       AAAAGTCAGA GGCCACCAAA GTCCGTTTCC C
989:    1.0     31      999983044       Crisp   MPS     S1      u1       AAAAGTCAGA GGCCACCAAA GTCCGTTTCC C
990:    0.01    31      999997594       GQ      PS      S0      u1       AAAAGTCGAG GAGGTGCTGC ATCCACCACA G
991:    0.0     31      999997594       GQ      PS      S0      u1       AAAAGTCGAG GAGGTGCTGC GTCCACCACA G
992:    1.0     31      999982737       Crisp   MPS     S0      u1       AAAAGTGACT CTGATGATGA GGAGTACAAG G
993:    0.0     31      999996814       GQ      PS      S0      u1       AAAAGTGCCC TTGCCTTCTC TATCCTCTTC A
994:    0.01    31      999996814       GQ      PS      S0      u1       AAAAGTGCCC TTGCCTTCTC TCTCCTCCTC A
995:    0.12    31      999996814       GQ      PS      S0      u1       AAAAGTGCCC TTGCCTTCTC TCTCCTCTTC A
996:    0.01    31      999996814       GQ      PS      S0      u1       AAAAGTGCCC TTGTCTTCTC TATCCTCTTC A
997:    0.03    31      999996814       GQ      PS      S0      u1       AAAAGTGCCC TTGTCTTCTC TCTCCTCCTC A
998:    0.38    31      999996814       GQ      MPS     S0      u1       AAAAGTGCCC TTGTCTTCTC TCTCCTCTTC A
999:    0.15    31      999987179       LQ      PS      S1      u1       AAAAGTGTAC ACATGTGGGC AAAAATACAT
Above is extracted from the sample oligo table. 

An unfiltered list of SNPs produced by SNPs is located at:  data/OUT/BACPET/Results/h03.ref.snps.  For this dataset, 290 SNPs were found (first ten displayed below).
1       11652   A       G       5       5       4       65      28      G       29      G       0       0       100     0       Novel   H1
2       12431   T       C       10      10      5       29      30      C       30      C       0       100     0       0       Novel   H1
3       12575   G       T       13      13      8       57      30      T       30      T       0       0       0       100     Novel   H1
4       13268   A       G       4       4       3       28      27      G       27      G       0       0       100     0       Novel   H1
5       13375   C       G       15      15      7       49      30      G       30      G       0       0       100     0       Novel   H1
6       13792   A       G       27      27      11      53      30      G       30      G       0       0       100     0       Novel   H1
7       13829   A       G       10      10      5       66      30      G       30      G       0       0       100     0       Novel   H1
8       14427   A       G       11      11      5       36      30      G       30      G       0       0       100     0       Novel   H1
9       14917   A       M       3       3       5       51      29      A       29      A       10      90      0       0       Novel   H2
10      15181   A       C       17      17      7       46      30      C       30      C       0       100     0       0       Novel   H1

The filtered/scored list of SNPs is located at: data/OUT/BACPET/Results/filter/h03.ref.snps which is the same formatting as above but with a SNP Score value at the beginning of each line.  The previous list has been filtered down to 270 SNPs (first ten displayed below):
19      1       11652   A       G       5       5       4       65      28      G       29      G       0       0       100     0       Novel   H1
39      2       12431   T       C       10      10      5       29      30      C       30      C       0       100     0       0       Novel   H1
43      3       12575   G       T       13      13      8       57      30      T       30      T       0       0       0       100     Novel   H1
19      4       13268   A       G       4       4       3       28      27      G       27      G       0       0       100     0       Novel   H1
74      5       13375   C       G       15      15      7       49      30      G       30      G       0       0       100     0       Novel   H1
79      6       13792   A       G       27      27      11      53      30      G       30      G       0       0       100     0       Novel   H1
34      7       13829   A       G       10      10      5       66      30      G       30      G       0       0       100     0       Novel   H1
58      8       14427   A       G       11      11      5       36      30      G       30      G       0       0       100     0       Novel   H1
6       9       14917   A       M       3       3       5       51      29      A       29      A       10      90      0       0       Novel   H2
86      10      15181   A       C       17      17      7       46      30      C       30      C       0       100     0       0       Novel   H1
Above is the information for the first 10 matches in this alignment.  The columns are as follows:
1.   Score - generated by ScoreSNPs by combining other scores; it reflects the prediction accuracy (in the range 0-100 where 100 is highest score)
2.   Id
3.   Coordinate
4.   Reference
5.   Observed
6.   Score1 - this score is based on the accumulated probability of the coverage and the expected percentage of SNPs in the reference genome.
7.   Score2 - this is the same as score1 but takes into account a priori knowledge of SNPs already identified; if the SNP database was not built with a list of known SNPs, this value will be the same as Score1
8.   Coverage
9.   Average location of the SNP in the reads
10. This is used by ScoreSNPs
11. This is used by ScoreSNPs
12. This is used by ScoreSNPs
13. This is used by ScoreSNPs
14. Accumulated probability of base A
15. Accumulated probability of base C
16. Accumulated probability of base G
17. Accumulated probability of base T
18. If using known SNPs as prior knowledge, this value is Known if SNP is in the database and otherwise if it is not in the database or not using a SNP database this is listed as Novel.
19. H1 - homozygous, H2 - heterozygous

Current Release
SliderII 1.1

Released Jun 09, 2009

Latest version
More about this release…

Download file Get SliderII for all platforms
Slider II 1.1

All Releases

Version Released Description Compatibility Licenses Status
1.1 Jun 09, 2009 Latest version More about this release… AFL final
Filed under: ,