Data inputs

This document describes the old input format required by SVA versions prior to 1.1. For newer versions of SVA 1.1 and onwards, where the vcf format is supported, see here.

SVA users need to prepare four (4) types of input files for an SVA project. All these files, except for type 3, can be generated from a pileup file, a format first used by Tony Cox and Zemin Ning at the Sanger Institute. This pileup file can be generated from software tools, for example, SAMtools, in a next-generation sequencing study. However, please note that the specific pileup format we used here is a bit different from the default SAMtools output format described here. Later in this page I will include detailed information and programs to generate those files.

These 4 types of files are:

  1. A list of identified single nucleotide variants (SNVs) in a specific pileup format - text file with file name extension .samtools;
  2. A list of identified insertion/deletion s(INDELs) in a specific pileup format - text file with file name extension .samtoolsindels;
  3. (Optional) A list of structural variations (SVs) in HMMCNV output format - text file with file name extension .events;
  4. A chromosome-wise coverage and quality control data file, generated from SAMtools pileup output.- binary file with file name extension .bco

In addition, there is an optional pedinf file for an SVA project. This file lists the subjects in a linkage format. This file is not necessary for SVA annotation tasks, but is necessary for some SVA analysis and exporting functions.

Optional pedinf file :

pedinf file: listing the subjects in a linkage format, consisting of six columns, seperated by space or tab:

Family ID, Individual ID, Father ID, Mother ID, Gender (1=male, 2=female), Affected status (1=control, 2=case, -9=unknown)

Here is an example for this file.

 

I will assume that the SVA users are already familiar with next-generation sequencing data pipelines, particularly using BWA/SAMtools. The file name extensions in the above box is only for SVA to conveniently recognize the relative format. Although we do ourselves use BWA/SAMtools, the file extensions do not indicate that SVA only takes outputs from SAMtools. SVA does not distinguish which software generates the alignment results, as long as the format is in the pileup file format described below.

There is another important note:

Important note:

The default SVA release comes with a supporting database based on human reference sequence build 36 and Ensembl release 50_36l (June 2008 version). So it supports annotation and analysis consistent with that build. All the following steps, including the alignment processes upstream to this discussion, should also be based on human reference sequence build 36.

The supporting database of SVA can be updated to newer builds. But the default SVA release is with build 36. We will release newer versions of supporting databases in the future.

In any cases, the SVA supporting database version should be consistent with your alignment process. Otherwise, annotation will simply generate wrong results. The default SVA release is build 36.

 

The basic data generation flow described below is based on our experience for your reference.

Step 1. Generating pileup file

We used SAMtools to generate the pileup file:

[YOUR SAMTOOLS DIR]/samtools pileup -f [YOUR RERERENCE FASTA FILE] -c [YOUR ALIGNMENT .bam file] > [YOUR pileup file]

There is an important note regarding the chromosome designatations, which will affect the following data generation.

Step 2. Generating variant file

We used SAMtools to generate the variant file (Please note this is a basic example. Your actual parameters may vary.):

[YOUR SAMTOOLS DIR]/misc/samtools.pl varFilter [YOUR pileup file] > [YOUR SNP_INDEL_FILE]

Step 3. Generate SNV file .samtools

We used a simple perl script snp_filter.pl (download it here) to generate the SNV file:

perl [YOUR snp_filter DIR]/snp_filter.pl [YOUR SNP_INDEL_FILE] > [YOUR_SNV_FILE.samtools]

Here is an example of the generated .samtools file:

X 2163 G A 51 51 50 8 aAaAaAAA
X 2219 G A 66 66 56 13 AAAaaaaaAaaAa

The columns are: chromosome name, coordinate, reference allele, variant allele, Phred-like consensus score, SNP quality, RMS score, read depth, pileup bases.

Step 4. Generate INDEL file .samtoolsindels

We used a simple perl script indel_filter.pl (download it here) to generate the INDEL file:

perl [YOUR indel_filter DIR]/indel_filter.pl [YOUR SNP_INDEL_FILE] > [YOUR_INDEL_FILE.samtoolsindels]

Here is an example of the generated .samtoolsindels file:

X 106234 * -G/-G 104 1055 55 30 -G * 27 3
X 111910 * +T/+T 43 501 46 21 +T * 14 7

The columns are: chromosome name, coordinate, a star, the genotype, consensus quality, SNP quality, RMS mapping quality, # covering reads, the first alllele, the second allele, # reads supporting the first allele, # reads supporting the second allele.

(Optional) Step 5. Generate SV file .events

We used a separate program (ERDS) to generate the SV file:

[YOUR HMMSV_PROGRAM DIR]/HMMSV_PROGRAM [YOUR pileup file] > [YOUR_SV_FILE.events]

Here is an example of the generated .events file:

X 2130001 2206000 76000 2 213.2
X 2206001 2208000 2000 0 0.7

The columns are: chromosome name, start coordinate, end coordinate, SV status (diploid=2), LOD score.

Step 6. Generate coverage and quality score file .bco

We used a simple JAVA program pileup2bco.jar (download it here) to generate the chromosome-wise .bco file. Please be noted that the output parameter is in this particular format: [YOUR_BCO_OUTPUT_DIRECTORY]/[YOUR PREFIX TO THE OUTPUT]. For example, this could be: /usr/jack/bco/subject1, where the bco output will be like : /usr/jack/bco/subject1_.1.bco through /usr/jack/bco/subject1_.Y.bco.

[YOUR pileup2bco.jar DIR]/pileup2bco.jar [YOUR pileup file] [YOUR_BCO_OUTPUTSTEM]

Note: This small JAVA program (pileup2bco.jar) accepts pileup file with chromosome designations (column 1) as an integer from 1-22, and X, Y, M.

In the following example, pileup2bco accepts "16" but not "chr16".

Acceptable
Not Acceptable
16 41 t A 0 0 60 1 ^~, !
16 42 c C 4 0 60 1 , #
Chr16 41 t A 0 0 60 1 ^~, !
Chr16 42 c C 4 0 60 1 , #



The .bco is in binary format, using 4 bytes for each base with one byte for each score: consensus quality, SNP quality, RMS mapping quality, read depth. Please note in this process the upper limit for each score is 255. Any score greater than 255 will be trimmed.

After you generate these four types of files (with step 5 as optional), you may proceed to create your project.