--------------------------------------------------------

README - MutPred predictions for all variants in dbNSFP

Written by:
Vikas Pejaver
August 31, 2016

Contact:
vpejaver@indiana.edu

--------------------------------------------------------


Background:
-----------
dbNSFP is a database of functional prediction scores for all theoretically possible single nucleotide variants (SNVs), i.e. every base in a given exonic region is substituted to the remaining three possible nucleotides, scored and stored [1]. In order to make MutPred predictions available to the community at large, we have scored ~80 million non-synonymous variants from this exhaustive database. We encourage users to use these predictions as it may help circumvent the wait for results from the web-server and the need for customized large batch runs. If you use these predictions in your work, please cite [3].


Description:
------------
MutPred requires protein sequences and amino acid substitution information as inputs. To this end, we mapped chromosomal information to the protein information available in dbNSFP v2.7, downloaded the corresponding sequences from UniProt and Ensembl and ran MutPred on these sequences and substitutions.

In this directory, there are 24 result files in total, with each file corresponding to one of the 22 autosomes and the two sex chromosomes. Run gunzip on each file to decompress it. Each resulting file is a tab-delimited file with the following fields: 
COLUMN 1 - Colon-delimited information on chromosomal co-ordinates represented as follows: chromosome_number:position:reference_nucleotide_allele:alternate_nucleotide_allele.
         -> The positions are in the 1-based coordinate system (as in the original dbNSFP files).
         -> hg19 was used for these positions (as in the original dbNSFP files).
         -> The nucleotide alleles are those on the + strand (as in the original dbNSFP files).
COLUMN 2 - Mapped UniProt accessions or Ensembl transcript IDs.
         -> In the case of multiple isoforms, only one isoform (and thus, one protein or transcript ID) was selected. When UniProt IDs were available, the canonical UniProt isoform was chosen. If this did not exist, the canonical Ensembl transcript was chosen. If even this did not exist, the longest secondary UniProt isoform was chosen and if this was also unavailable, the longest secondary Ensembl transcript.
         -> In the case of Ensembl transcripts, the protein sequences were obtained by mapping the transcript to protein identifiers.
         -> This column may be marked with a "-" if no such mapping was existed. This could happen due to a number of reasons such as obsolete IDs, annotation errors, etc.
COLUMN 3 - Amino acid substitution in the following format: reference_amino_acid|protein_position|alternate_amino_acid
         -> This column may be marked with a "-" if a protein mapping was not found.
COLUMN 4 - General MutPred score (prediction of pathogenicity from the random forest model)
         -> This column may be marked with a "-" if a protein mapping was not found or if certain features could not be computed by MutPred due to technical limitations of the pipeline, such as insufficient memory, extremely short protein sequences, discrepancies between reference amino acid and that in the actual sequence at the give position, etc.
COLUMN 5 - Top 5 features (molecular mechanisms of disease) as predicted by MutPred
         -> This column may be marked with a "-" if a protein mapping was not found or if certain features could not be computed by MutPred due to technical limitations of the pipeline, such as insufficient memory, extremely short protein sequences, etc.

For more information on the interpretation of columns 4 and 5, please refer to [2] and [3]. 


References:
-----------
[1] Liu X, Jian X, Boerwinkle E. dbNSFP: a lightweight database of human nonsynonymous SNPs and their functional predictions. Human mutation (2011) 32(8): 894-899.
[2] http://mutpred1.mutdb.org/about.html
[3] Li B, Krishnan VG, Mort ME, Xin F, Kamati KK, Cooper DN, Mooney SD, Radivojac P. Automated inference of molecular mechanisms of disease from amino acid substitutions. Bioinformatics (2009) 25(21): 2744-2750