Using SVA Analysis functions - Gene Prioritization

 

The 'Gene Prioritization' function produces a list of protein coding genes that are 'knocked out' by homozygous variants in case genomes. This analysis prioritizes genes by the number of cases containing homozygous protein truncating or non-synonymous genetic variants, absent in control genomes. By this definition, different variants in the same gene are allowed and contribute equally to the ranking of this gene. In other words, this analysis tries to answer this simple question: do you see enrichment of protein-truncating (optionally, or non-synonymous) variants in any gene in the cases, when compared to control genomes?

Your SVA project must contain at least one case and one control for this function to work. Your SVA project must contain at least two cases for this function to reliably prioritize knocked out genes (with only one case, all homozygous knock out variants are listed but they all have the same liklihood of impacting case phenotype).

To access the gene prioritization function, go to the menu 'Analysis -> Gene prioritization'. The Genome Browser must be initilized to run this function.

This will open a 'Prioritization analysis' window. This window will allow you to customize this function in four different ways, explained below. The 'Prioritization analysis' window looks like this:

Function of genetic variant

This section allows you to select the types of 'knock out' variants you wish to include. The three options are:
  1. 'Protein truncation only'. This is the default setting and includes all homozygous variants that result in a premature trunction of a protein. Protein truncating mutations will always be included in the output.
  2. 'Including non-synonymous SNPs'. This setting includes both protein truncating variants and homozygous SNPs resulting in the change of an amino acid residue in the resulting protein.
  3. 'For non-synonymous SNPs: including only intolerable NS change'. This setting is supplementary to option number #2. When checked, an additional filter is applied to the non-synonymous SNPs by eliminating any 'tolerable' non-synonymous SNPs. Note: The status of an 'intolerable' non-synonymous SNP is determined using the MAPP software (http://mendel.stanford.edu/SidowLab/downloads/MAPP/index.html).

Criteria for homozygous variants

This section allows you to adjust the coverage you want to require for a homozygous variant. This will impact the number of variants that show up on your gene prioritization output. If you do not care about coverage (i.e. confidence) of the homozygous variants, you can uncheck 'Minimum coverage for a homozygous variant'. Otherwise the default parameters are set as follows:

  Number of sequencing reads
Autosomes (case): 10
Sex chromosomes (case): 5
Autosomes (control): 10
Sex chromosomes (control): 5

A brief explanation of default parameters for homozygous variants:
In order to determine reliable default parameters, a number of whole genome sequence samples were analyzed for the coverage (read depth) distribution for confident 'homozygous' SNPs called by SAMtools. The coverage data was compared to the 1 million SNP chip data for the same sample. This analysis revealed the following:

    1. Mistakenly called homozygous SNPs (defined by mismatch with 1M chip), had an average read depth of 7.03, with a median of 5.
    2. Correctly called homozygous SNPs (defined by match with 1M chip), had an average read depth of 23.48, with a median of 24.
    3. When a read depth threshold is not specified, 97.2% homozygous SNPs can be correctly called.
    4. When a read depth threshold of >=10X is specified, then 99.30% homozygous SNPs can be correctly called.
    5. These data suggest that a read depth threshold needs to be considered when you define a SNP as homozygous.  
      1. If you want a lower false negative, but a relatively higher false positive (sensitivity is high), then use a lower threshold in cases but higher threshold in controls.
      2. If you want a higher false negative, but a relatively lower false positive (specificity is high), then use a higher threshold in cases but lower threshold in controls.

Genetic model

This section allows you to select the genotype of the variant (heterozygous or homozygous) that you will allow to be present in your control genomes. The target variations in your cases are unaffected by this setting (still homozygous knock out variants).
  1. 'Recessive model (allowing for heterozygotes in controls)'. This is the default setting. This setting allows for any target variant to be present in control genomes in heterozygous form.
  2. Alternately, you have an option to specify a model 'not allowing heterozygotes' in control genomes. This means that only target variants for which the control genomes are homozygous for the reference allele will appear on the Gene Prioritization output.

Control genomes

  1. If 'check control samples for reference sequence' is unchecked, then all homozygous variants are listed, not just those unique to the cases. This function is disabled since you always want this function to check control samples for the reference sequence.
  2. You may choose to 'Treat Venter's sequence as a control genome' by clicking the check box next to this option.

After selecting your desired settings in the 'Prioritization analysis' window, click 'Output to. . . ' select the file you wish to save your file to, name your file by entering a name into the 'File Name' field and select 'Save'.

After clicking 'Save', the path to this file will be displayed in the 'Prioritization analysis' window, then select 'Analyze'.

Sample output (.txt file):

Rank GeneSym SNP_count Indel_count Total_count
1 NUCKS1 1 0 1

 

There is also an example run demonstrating this function to prioritize the Factor VIII (F8) gene in type A hemophilia patients.