What is FAVOR?

Functional Annotation of Variants - Online Resource (FAVOR) is an open-access web portal that assembles individual variant functional annotation data from a variety of sources and displays the information through a web interface. FAVOR currently provides functional annotations from 13 major attributes: Basic, Variant Category, Allele Frequencies, Integrative Score, Protein Function, Epigenetics, Conservation, Transcription Factors, Chromatin States, Local Nucleotide Diversity, Mutation Density, Mappability, and Proximity and Tissue and Cell specific annotations. FAVOR supports the following queries:

  • Single variant query: The FAVOR web portal allows for single variant query using either genome position (Build GRCh38) or rsID. The results are displayed in tables. If an rsID corresponds to multi-alleles, the results for all alleles are shown in separate tabs.

  • Region-based and gene-based query: Annotations for all variants in the Trans-Omics for Precision Medicine (TOPMed) Freeze 8 BRAVO variant set (705,486,649 variants observed on 132,345 samples' whole genomes) in a given gene/region are displayed in the web interface using tables. For region-based query, genome positions are specified using Build GRCh38.

  • Batch submission: A variant list using either genome locations (Build GRCh38) or rsIDs can be uploaded. The functional annotation results are available for download in the multiple formats. Check the (Batch Submission) page for more details about the format of the input file and the output files.

Variant set

The current version of the FAVOR database contains a total of 8,892,915,237 variants, which include all possible 8,812,917,339 SNVs and 79,997,898 indels.

Graphical illustration of gene-based annotation

Please refer to the gene-based annotation figure above for the detail meaning of each annotation category. The “exonic” here refers only to the coding exonic portion, but not the UTR portion, as there are two keywords (UTR5, UTR3) which are specifically reserved for UTR annotations. “splicing” is defined as variant that is within 2-bp away from an exon/intron boundary. If a variant is located in both 5’ UTR and 3’ UTR region (possibly for two different genes), then the “UTR5,UTR3” will be printed as the output. The term “upstream” and “downstream” is defined as 1-kb away from transcription start site or transcription end site, respectively, taking the strand of the mRNA into account. If a variant is located in both the downstream and upstream regions (possibly for 2 different genes), then “upstream,downstream” will be printed as the output. If a variant is located in the CAGE promoter/enhancer or GeneHancer regions, it will also be annotated as CAGE promoter/enhancer or GeneHancer.

Calculation of annotation PCs (aPCs) and interpretation

Often it is helpful to have a single metric summarizing multiple similar annotations measuring the same underlying biological function. We achieve this goal by proposing variant annotation Principal Components (aPCs), which are principal component summaries of the multi-faceted functional annotation data in FAVOR. Unlike ancestral PCs that are subject-specific and are calculated using genotypes across the genome to control for population structure, annotation PCs are variant-specific and are calculated using functional annotations for individual variants. Annotation PCs summarize multiple aspects of variant function, with different blocks of individual functional annotations in the heatmap below captured by different annotation PCs (Figure 1) [@li2020dynamic]. We summarize the detailed steps of obtaining aPCs as follows (currently aPCs are calculated for all PASS SNVs in the variant set):

  • Step 0 (pre-processing): We impute variant with missing individual scores with their default values, and transform particular individual scores such that (1) a higher value of each individual score indicates increased functionality according to that annotation; (2) the distribution of individual score becomes less skewed. Specifically, we use
SIFTval~=1SIFTval\widetilde{\text{SIFTval}} = 1 - \text{SIFTval}
minDistTSS~=log(minDistTSS)\widetilde{\text{minDistTSS}} = -\log(\text{minDistTSS})
minDistTSE~=log(minDistTSE)\widetilde{\text{minDistTSE}} = -\log(\text{minDistTSE})
Encode~a=log(Encodea)\widetilde{\text{Encode}}_a = \log(\text{Encode}_a)

    where aH3K4me1, H3K4me2, H3K4me3, H3K9ac, H3K9me3, H3K27ac, H3K27me3, H3K36me3, H3K36me3, H3K79me2, H4K20me1, H2AFZa\in\text{H3K4me1, H3K4me2, H3K4me3, H3K9ac, H3K9me3, H3K27ac, H3K27me3, H3K36me3, H3K36me3, H3K79me2, H4K20me1, H2AFZ}

  • Step 1: We group the individual annotation scores into major functional blocks based on a priori knowledge. Each block captures a specific aspect of variant biological function: protein function, conservation, epigenetics, local nucleotide diversity, proximity (distance) to coding, mutation density, transcription factors, mappability, and proximity (distance) to TSS/TES (See Table above) [@li2020dynamic].

  • Step 2: For each annotation block k{1,,K}k \in \{1,\dots,K\}, we center and standardize all (pre-processed) individual scores within the block, and obtain the standardized individual annotation score matrix X~k\widetilde{\boldsymbol{X}}_k (i.e., each column of X~k\widetilde{\boldsymbol{X}}_k has mean 0 and variance 1).

  • Step 3: For each annotation block k{1,,K}k \in \{1,\dots,K\}, we calculate the aPC raw score of that block as the first PC from the standardized individual scores. Specifically, for annotation block kk,

aPC.rawk=X~kek,1\text{aPC.raw}_k = \widetilde{\boldsymbol{X}}_k \boldsymbol{e}_{k,1}

    where ek,1\boldsymbol{e}_{k,1} is the eigenvector corresponding to the largest eigenvalue of X~kX~k\widetilde{\boldsymbol{X}}_k^\top\widetilde{\boldsymbol{X}}_k. Note that here we flip the sign of aPC.rawk\text{aPC.raw}_k (if necessary) such that each aPC.rawk\text{aPC.raw}_k is positively correlated with the individual scores in that block.

  • Step 4: To facilitate better interpretation, these aPC raw scores are transformed into the PHRED-scaled scores for each variant across the genome [@li2020dynamic], defined as
aPC.PHREDk=10×log10(rank(aPC.rawk)/M)\text{aPC.PHRED}_k = -10\times\log_{10}(rank(-\text{aPC.raw}_k)/M)

    where MM is total number of variants sequenced across the whole genome. Note that the PHRED-scaled scores used in annotation PCs express the rank in order of magnitude. For example, a variant at the top 10 percentile of aPC raw score has a PHRED score 10, top 1 percentile has a PHRED score 20, top 0.1 percentile has a PHRED score 30, among all the variants in the FAVOR database.

Figure 1. Correlation heatmap of individual and integrative functional annotations. The figure shows pairwise correlations between 63 individual and integrative functional annotations. The cells in the visualization are colored by Pearson's correlation coefficient values with deeper colors indicating higher positive (red) or negative (blue) correlations. Each annotation principal component (aPC) is the first PC calculated from the set of standardized individual functional annotations that measure similar biological function. These aPCs are then transformed into the PHRED-scaled scores for each variant across the genome.