Introduction
Functional Annotation of Variants - Online Resource (FAVOR) is an open-access web portal that assembles individual variant functional annotation data from a variety of sources and displays the information through a web interface. FAVOR currently provides functional annotations from 13 major attributes: Basic, Variant Category, Allele Frequencies, Integrative Score, Protein Function, Epigenetics, Conservation, Transcription Factors, Chromatin States, Local Nucleotide Diversity, Mutation Density, Mappability, and Proximity. FAVOR supports the following queries:
-
Single variant query: The FAVOR web portal allows for single variant query using either genome position (Build GRCh38) or rsID. The results are displayed in tables. If an rsID corresponds to multi-alleles, the results for all alleles are shown in separate pages.
-
Region-based and gene-based query: Annotations for all variants in the Trans-Omics for Precision Medicine (TOPMed) Freeze 8 BRAVO variant set (705,486,649 variants observed on 132,345 samples' whole genomes) in a given gene/region are displayed in the web interface using tables. For region-based query, genome positions are specified using Build GRCh38.
-
Batch submission: A variant list using either genome locations (Build GRCh38) or rsIDs can be uploaded in the format (Chr, Position, Ref, Alt) or with other standard file formats (.txt). The functional annotation results are displayed on the web interface using tables, and the full annotated results are available for download in the .csv format.
Variant set
The current version of the FAVOR database contains a total of 8,892,915,237 variants, which include all possible 8,812,917,339 SNVs and 79,997,898 indels.
Functional annotations and annotation PCs (aPCs)
The functional annotations provided in the FAVOR web portal are as follows:
Detailed descriptions of selected functional annotations and annotation pcs in the favor database. for numeric type of annotation marked as (+), a higher value indicates increased functionality according to that annotation. for numeric type of annotation marked as (-), a lower value indicates increased functionality according to that annotation.
Block Name | Annotation Name | Explanation | Type | Source |
---|---|---|---|---|
Basic | Variant | The unique identifier of the given variant. Reported as chr-pos-ref-alt format. | String | |
Basic | rsID | The rsID of the given variant (if exists). | String | |
Basic | TOPMed Depth | TOPMed depth of the given variant. | String | |
Basic | TOPMed QC Status | TOPMed QC status of the given variant. | String | |
ClinVar | Clinical Significance | Clinical significance for this single variant. [@landrum2013clinvar; @landrum2017clinvar] | String | [Source][Ref1,2] |
ClinVar | Clinical significance (genotype includes) | Clinical significance for a haplotype or genotype that includes this variant. Reported as pairs of VariationID:clinical significance. [@landrum2013clinvar; @landrum2017clinvar] | String | [Source][Ref1,2] |
ClinVar | Disease Name | ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB. [@landrum2013clinvar; @landrum2017clinvar] | String | [Source][Ref1,2] |
ClinVar | Disease Name (included variant) | For included variant: ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB. [@landrum2013clinvar; @landrum2017clinvar] | String | [Source][Ref1,2] |
ClinVar | Review Status | ClinVar review status for the Variation ID. [@landrum2013clinvar; @landrum2017clinvar] | String | [Source][Ref1,2] |
ClinVar | Allele Origin | Allele origin: 0 - unknown; 1 - germline; 2 - somatic; 4 - inherited; 8 - paternal; 16 - maternal; 32 - de-novo; 64 - biparental; 128 - uniparental; 256 - not-tested; 512 - tested-inconclusive. [@landrum2013clinvar; @landrum2017clinvar] | String | [Source][Ref1,2] |
ClinVar | Disease Database ID | Tag-value pairs of disease database name and identifier, e.g. OMIM:NNNNNN. [@landrum2013clinvar; @landrum2017clinvar] | String | [Source][Ref1,2] |
ClinVar | Disease Database ID (includeded variant) | For included variant: Tag-value pairs of disease database name and identifier, e.g. OMIM:NNNNNN. [@landrum2013clinvar; @landrum2017clinvar] | String | [Source][Ref1,2] |
ClinVar | Gene Reported | Gene(s) for the variant reported as gene symbol:gene id. The gene symbol and id are delimited by a colon (:) and each pair is delimited by a vertical bar (|). [@landrum2013clinvar; @landrum2017clinvar] | String | [Source][Ref1,2] |
Variant Category | Gencode Comprehensive Info | Identify whether variants cause protein coding changes using Gencode genes definition systems, it will label the gene name of the variants has impact, if it is intergenic region, the nearby gene name will be labeled in the annotation. [@harrow2012gencode; @frankish2018gencode] | String | [Source1,2][Ref1,2] |
Variant Category | Gencode Comprehensive Category | Identify whether variants cause protein coding changes using Gencode genes definition systems. It will label the gene name of the variants has impact, if it is intergenic region, the nearby gene name will be labeled in the annotation. [@harrow2012gencode; @frankish2018gencode] | String | [Source1,2][Ref1,2] |
Variant Category | Disruptive Missense | Identify whether the variant is a disruptive missense variant, defined as "disruptive" by the ensemble MetaSVM annotation. [@dong2014comparison] | Factor | [Source1,2][Ref] |
Variant Category | CAGE Promoter | CAGE defined promoter sites from Fantom 5. [@forrest2014promoter] | String | [Source][Ref] |
Variant Category | CAGE Enhancer | CAGE defined permissive Enhancer sites from Fantom 5. [@andersson2014atlas] | String | [Source][Ref] |
Variant Category | GeneHancer | Predicted human enhancer sites from the GeneHancer database. [@fishilevich2017genehancer] | String | [Ref] |
Variant Category | SuperEnhancer | Predicted super-enhancer sites and targets in a range of human cell types. [@hnisz2013super] | String | [Source][Ref] |
Variant Category | Gencode Comprehensive Exonic Category | Identify variants impact using Gencode exonic definition, and only label exonic categorical information like, synonymous, non-synonymous, frame-shifts indels, etc. [@harrow2012gencode; @frankish2018gencode] | String | [Source1,2][Ref1,2] |
Variant Category | Gencode Comprehensive Exonic Info | Identify variants cause protein coding changes using Gencode genes definition, and gives out detail annotation information of which exons of the variant has impacts on and how the impacts causes changes in amino acid changes. [@harrow2012gencode; @frankish2018gencode] | String | [Source1,2][Ref1,2] |
Variant Category | UCSC Info | Identify whether variants cause protein coding changes using UCSC genes definition systems, it will label the gene name of the variants has impact. If it is intergenic region, the nearby gene name will be labeled in the annotation. | String | [Source] |
Variant Category | UCSC Exonic Info | Identify variants cause protein coding changes using UCSC genes definition, and give out detail annotation information of which exons of the variant has impacts on and how the impacts causes changes in amino acid changes. | String | [Source] |
Variant Category | RefSeq Info | Identify whether variants cause protein coding changes using RefSeq genes definition systems, it will label the gene name of the variants has impact, if it is intergenic region, the nearby gene name will be labeled in the annotation. | String | [Source] |
Variant Category | RefSeq Exonic Info | Identify variants cause protein coding changes using RefSeq genes definition, and give out detailed annotation information of which exons of the variant have impacts on and how the impacts cause changes in amino acid changes. | String | [Source] |
Allele Frequencies | TOPMed Bravo AF | TOPMed Bravo Genome Allele Frequency. [@taliun2019sequencing; @nhlbi2018bravo] | num | [Source][Ref] |
Allele Frequencies | GNOMAD Total AF | GNOMAD v3 Genome Allele Frequency using all the samples. [@karczewski2020mutational; @gnomad2019browser] | num | [Source][Ref] |
Allele Frequencies | AFR GNOMAD AF | GNOMAD v3 Genome African population frequency. [@karczewski2020mutational; @gnomad2019browser] | num | [Source][Ref] |
Allele Frequencies | AMR GNOMAD AF | GNOMAD v3 Genome Ad Mixed American population frequency. [@karczewski2020mutational; @gnomad2019browser] | num | [Source][Ref] |
Allele Frequencies | EAS GNOMAD AF | GNOMAD v3 Genome East Asian population frequency. [@karczewski2020mutational; @gnomad2019browser] | num | [Source][Ref] |
Allele Frequencies | NFE GNOMAD AF | GNOMAD v3 Genome Non-Finnish European population frequency. [@karczewski2020mutational; @gnomad2019browser] | num | [Source][Ref] |
Allele Frequencies | FIN GNOMAD AF | GNOMAD v3 Genome Finnish European population frequency. [@karczewski2020mutational; @gnomad2019browser] | num | [Source][Ref] |
Allele Frequencies | SAS GNOMAD AF | GNOMAD v3 Genome South Asian population frequency. [@karczewski2020mutational; @gnomad2019browser] | num | [Source][Ref] |
Allele Frequencies | AMI GNOMAD AF | GNOMAD v3 Genome Amish population frequency. [@karczewski2020mutational; @gnomad2019browser] | num | [Source][Ref] |
Allele Frequencies | ASJ GNOMAD AF | GNOMAD v3 Genome Ashkenazi Jewish population frequency. [@karczewski2020mutational; @gnomad2019browser] | num | [Source][Ref] |
Allele Frequencies | OTH GNOMAD AF | GNOMAD v3 Genome Other (population not assigned) frequency. [@karczewski2020mutational; @gnomad2019browser] | num | [Source][Ref] |
Allele Frequencies | Male GNOMAD AF | GNOMAD v3 Genome Male Allele Frequency. [@karczewski2020mutational; @gnomad2019browser] | num | [Source][Ref] |
Allele Frequencies | AFR Male GNOMAD AF | GNOMAD v3 Genome African Male population frequency. [@karczewski2020mutational; @gnomad2019browser] | num | [Source][Ref] |
Allele Frequencies | AMI Male GNOMAD AF | GNOMAD v3 Genome Amish Male population frequency. [@karczewski2020mutational; @gnomad2019browser] | num | [Source][Ref] |
Allele Frequencies | AMR Male GNOMAD AF | GNOMAD v3 Genome Ad Mixed American Male population frequency. [@karczewski2020mutational; @gnomad2019browser] | num | [Source][Ref] |
Allele Frequencies | ASJ Male GNOMAD AF | GNOMAD v3 Genome Ashkenazi Jewish Male population frequency. [@karczewski2020mutational; @gnomad2019browser] | num | [Source][Ref] |
Allele Frequencies | EAS Male GNOMAD AF | GNOMAD v3 Genome East Asian Male population frequency. [@karczewski2020mutational; @gnomad2019browser] | num | [Source][Ref] |
Allele Frequencies | FIN Male GNOMAD AF | GNOMAD v3 Genome Finnish European Male population frequency. [@karczewski2020mutational; @gnomad2019browser] | num | [Source][Ref] |
Allele Frequencies | NFE Male GNOMAD AF | GNOMAD v3 Genome Non-Finnish European Male population frequency. [@karczewski2020mutational; @gnomad2019browser] | num | [Source][Ref] |
Allele Frequencies | OTH Male GNOMAD AF | GNOMAD v3 Genome Other (population not assigned) Male frequency. [@karczewski2020mutational; @gnomad2019browser] | num | [Source][Ref] |
Allele Frequencies | SAS Male GNOMAD AF | GNOMAD v3 Genome South Asian Male population frequency. [@karczewski2020mutational; @gnomad2019browser] | num | [Source][Ref] |
Allele Frequencies | Female GNOMAD AF | GNOMAD v3 Genome Female Allele Frequency. [@karczewski2020mutational; @gnomad2019browser] | num | [Source][Ref] |
Allele Frequencies | AFR Female GNOMAD AF | GNOMAD v3 Genome African Female population frequency. [@karczewski2020mutational; @gnomad2019browser] | num | [Source][Ref] |
Allele Frequencies | AMI Female GNOMAD AF | GNOMAD v3 Genome Amish Female population frequency. [@karczewski2020mutational; @gnomad2019browser] | num | [Source][Ref] |
Allele Frequencies | AMR Female GNOMAD AF | GNOMAD v3 Genome Ad Mixed American Female population frequency. [@karczewski2020mutational; @gnomad2019browser] | num | [Source][Ref] |
Allele Frequencies | ASJ Female GNOMAD AF | GNOMAD v3 Genome Ashkenazi Jewish Female population frequency. [@karczewski2020mutational; @gnomad2019browser] | num | [Source][Ref] |
Allele Frequencies | EAS Female GNOMAD AF | GNOMAD v3 Genome East Asian Female population frequency. [@karczewski2020mutational; @gnomad2019browser] | num | [Source][Ref] |
Allele Frequencies | FIN Female GNOMAD AF | GNOMAD v3 Genome Finnish European Female population frequency. [@karczewski2020mutational; @gnomad2019browser] | num | [Source][Ref] |
Allele Frequencies | NFE Female GNOMAD AF | GNOMAD v3 Genome Non-Finnish European Female population frequency. [@karczewski2020mutational; @gnomad2019browser] | num | [Source][Ref] |
Allele Frequencies | OTH Female GNOMAD AF | GNOMAD v3 Genome Other (population not assigned) Female frequency. [@karczewski2020mutational; @gnomad2019browser] | num | [Source][Ref] |
Allele Frequencies | SAS Female GNOMAD AF | GNOMAD v3 Genome South Asian Female population frequency. [@karczewski2020mutational; @gnomad2019browser] | num | [Source][Ref] |
Allele Frequencies | ALL 1000G AF | 1000 Genome Allele Frequency (Whole genome allele frequencies from the 1000 Genomes Project phase 3 data). | num | [Source] |
Allele Frequencies | AFR 1000G AF | 1000 Genomes African population frequency. | num | [Source] |
Allele Frequencies | AMR 1000G AF | 1000 Genomes Ad Mixed American population frequency. | num | [Source] |
Allele Frequencies | EAS 1000G AF | 1000 Genomes East Asian population frequency. | num | [Source] |
Allele Frequencies | EUR 1000G AF | 1000 Genomes European population frequency. | num | [Source] |
Allele Frequencies | SAS 1000G AF | 1000 Genomes South Asian population frequency. | num | [Source] |
Integrative Score | aPC-Protein-Function | Protein function annotation PC: the first PC of the standardized scores of "SIFTval, PolyPhenVal, Grantham, Polyphen2_HDIV_score, Polyphen2_HVAR_score, MutationTaster_score, MutationAssessor_score" in PHRED scale. Range: [2.970, 97.690]. [@li2020dynamic] | num (+) | Individual annotation channels in the FAVOR database. |
Integrative Score | aPC-Conservation | Conservation annotation PC: the first PC of the standardized scores of "GerpN, GerpS, priPhCons, mamPhCons, verPhCons, priPhyloP, mamPhyloP, verPhyloP" in PHRED scale. Range: [1.478E-09, 99.451]. [@li2020dynamic] | num (+) | Individual annotation channels in the FAVOR database. |
Integrative Score | aPC-Epigenetics-Active | Active Epigenetic annotation PC: the first PC of the standardized scores of “EncodeH3K4me1.max, EncodeH3K4me2.max, EncodeH3K4me3.max, EncodeH3K9ac.max, EncodeH3K27ac.max, EncodeH4K20me1.max,EncodeH2AFZ.max,” in PHRED scale.Range: [0, 99.451].[@li2020dynamic] | num (+) | Individual annotation channels in the FAVOR database. |
Integrative Score | aPC-Epigenetics-Repressed | Repressed Epigenetic annotation PC: the first PC of the standardized scores of “EncodeH3K9me3.max, EncodeH3K27me3.max” in PHRED scale. Range: [0, 99.451]. (Li et al., 2020). [@li2020dynamic] | num (+) | Individual annotation channels in the FAVOR database. |
Integrative Score | aPC-Epigenetics-Transcription | Transcription Epigenetic annotation PC: the first PC of the standardized scores of “EncodeH3K36me3.max, EncodeH3K79me2.max” in PHRED scale. Range: [0, 99.451]. [@li2020dynamic] | num (+) | Individual annotation channels in the FAVOR database. |
Integrative Score | aPC-Local-Nucleotide-Diversity | Local nucleotide diversity annotation PC: the first PC of the standardized scores of "bStatistic, RecombinationRate, NuclearDiversity" in PHRED scale. Range: [0, 99.451]. [@li2020dynamic] | num | Individual annotation channels in the FAVOR database. |
Integrative Score | aPC-Mutation-Density | Mutation density annotation PC: the first PC of the standardized scores of "Common100bp, Rare100bp, Sngl100bp, Common1000bp, Rare1000bp, Sngl1000bp, Common10000bp, Rare10000bp, Sngl10000bp" in PHRED scale. Range: [0, 99.451]. [@li2020dynamic] | num | Individual annotation channels in the FAVOR database. |
Integrative Score | aPC-Transcription-Factor | Transcription factor annotation PC: the first PC of the standardized scores of "RemapOverlapTF, RemapOverlapCL" in PHRED scale. Range: [1.185, 99.451]. [@li2020dynamic] | num (+) | Individual annotation channels in the FAVOR database. |
Integrative Score | aPC-Mappability | Mappability annotation PC: the first PC of the standardized scores of "umap_k100, bismap_k100, umap_k50, bismap_k50, umap_k36, bismap_k36, umap_k24, bismap_k24" in PHRED scale. Range: [0.185, 99.451]. [@li2020dynamic] | num (+) | Individual annotation channels in the FAVOR database. |
Integrative Score | aPC-Proximity-To-TSS-TES | Proximity to TSS (Transcription Starting Site) and TES (Transcription Ending Site) annotation PC: the first PC of "minDistTSS, minDistTSE" in PHRED scale. Range: [0, 99.451]. [@li2020dynamic] | num (+) | Individual annotation channels in the FAVOR database. |
Integrative Score | CADD RawScore | The CADD raw score (integrative score). A higher CADD score indicates more deleterious. Range: [-237.102, 22.763]. [@kircher2014general; @rentzsch2018cadd] | num (+) | [Source][Ref1,2] |
Integrative Score | CADD PHRED | The CADD score in PHRED scale (integrative score). A higher CADD score indicates more deleterious. Range: [0, 99]. [@kircher2014general; @rentzsch2018cadd] | num (+) | [Source][Ref1,2] |
Integrative Score | LINSIGHT | The LINSIGHT score (integrative score). A higher LINSIGHT score indicates more functionality. Range: [0.215, 0.995]. [@huang2017fast] | num (+) | [Source][Ref] |
Integrative Score | FATHMM-XF | The FATHMM-XF score (integrative score). A higher FATHMM-XF score indicates more functionality. Range: [0.405, 99.451]. [@rogers2017fathmm] | num (+) | [Source][Ref] |
Integrative Score | Funseq Value (impact score) | A flexible framework to prioritize regulatory mutations from cancer genome sequencing (integrative score). [@fu2014funseq2] | num (+) | [Source][Ref] |
Integrative Score | Funseq Description (annotation) | Funseq annotation pints out whether given mutation falls in coding or non-coding region (integrative score). [@fu2014funseq2] | String | [Source][Ref] |
Integrative Score | Aloft Value (impact score) | ALoFT provides extensive annotations to putative loss-of-function variants (LoF) in protein-coding genes including functional, evolutionary and network features (integrative score). [@balasubramanian2017using] | num (+) | [Source][Ref] |
Integrative Score | Aloft Description (annotation) | ALoFT annotation can predict the impact of premature stop variants and classify them as dominant disease-causing, recessive disease-causing and benign variants (integrative score). [@balasubramanian2017using] | String | [Source][Ref] |
Protein Function | PolyPhenCat | PolyPhen category of change. [@adzhubei2010method] | Factor | [Source][Ref] |
Protein Function | PolyPhenVal | PolyPhen score: It predicts the functional significance of an allele replacement from its individual features. Range: [0, 1] (default: 0). [@adzhubei2010method] | num (+) | [Source][Ref] |
Protein Function | Polyphen2_HDIV | Predicts possible impact of an amino acid substitution on the structure and function of a human protein using straightforward physical and comparative considerations. HumDiv is Mendelian disease variants vs. divergence from close mammalian homologs of human proteins (>=95% sequence identity). Range: [0, 1] (default: 0). [@adzhubei2010method] | num (+) | [Source1,2,3][Ref] |
Protein Function | Polyphen2_HVAR | Predicts possible impact of an amino acid substitution on the structure and function of a human protein using straightforward physical and comparative considerations. HumVar is all human variants associated with some disease (except cancer mutations) or loss of activity/function vs. common (minor allele frequency >1%) human polymorphism with no reported association with a disease of other effect. Range: [0, 1] (default: 0). [@adzhubei2010method] | num (+) | [Source1,2,3][Ref] |
Protein Function | Grantham | Grantham score: oAA, nAA. It attempts to predict the distance between two amino acids, in an evolutionary sense. A lower Grantham score reflects less evolutionary distance. A higher Grantham score reflects a greater evolutionary distance, and is considered more deleterious. Range: [0, 215] (default: 0). [@grantham1974amino] | num (+) | [Source1,2][Ref] |
Protein Function | MutationTaster | MutationTaster is a free web-based application to evaluate DNA sequence variants for their disease-causing potential. The software performs a battery of in silico tests to estimate the impact of the variant on the gene product/protein. Range: [0, 1] (default: 0). [@schwarz2014mutationtaster2] | num (+) | [Source1,2,3][Ref] |
Protein Function | MutationAssessor | Predicts the functional impact of amino-acid substitutions in proteins, such as mutations discovered in cancer or missense polymorphisms. Range: [-5.135, 6.490] (default: -5.545). [@reva2011predicting] | num (+) | [Source1,2,3][Ref] |
Protein Function | SIFTcat | SIFT category of change. [@ng2003sift] | Factor | [Source][Ref] |
Protein Function | SIFTval | SIFT score, ranges from 0.0 (deleterious) to 1.0 (tolerated). Range: [0, 1] (default: 1). [@ng2003sift] | num (-) | [Source][Ref] |
Conservation | priPhCons | Primate phastCons conservation score (excl. human). A higher score means the region is more conserved. PhastCons considers n species rather than two. It considers the phylogeny by which these species are related, and instead of measuring similarity/divergence simply in terms of percent identity. It uses statistical models of nucleotide substitution that allow for multiple substitutions per site and for unequal rates of substitution between different pairs of bases. Range: [0, 0.999] (default: 0.0). [@siepel2005evolutionarily] | num (+) | [Source][Ref] |
Conservation | mamPhCons | Mammalian phastCons conservation score (excl. human). A higher score means the region is more conserved. PhastCons considers n species rather than two. It considers the phylogeny by which these species are related, and instead of measuring similarity/divergence simply in terms of percent identity. It uses statistical models of nucleotide substitution that allow for multiple substitutions per site and for unequal rates of substitution between different pairs of bases. Range: [0, 1] (default: 0.0). [@siepel2005evolutionarily] | num (+) | [Source][Ref] |
Conservation | verPhCons | Vertebrate phastCons conservation score (excl. human). A higher score means the region is more conserved. PhastCons considers n species rather than two. It considers the phylogeny by which these species are related, and instead of measuring similarity/divergence simply in terms of percent identity. It uses statistical models of nucleotide substitution that allow for multiple substitutions per site and for unequal rates of substitution between different pairs of bases. Range: [0, 1] (default: 0.0). [@siepel2005evolutionarily] | num (+) | [Source][Ref] |
Conservation | priPhyloP | Primate phyloP score (excl. human). A higher score means the region is more conserved. PhyloP scores measure evolutionary conservation at individual alignment sites. The scores are calculated by comparing with the evolution expected under neutral drift. Positive scores: measure conservation, i.e., slower evolution than expected, at sites that are predicted to be conserved. Negative scores: measure acceleration, i.e., faster evolution than expected, at sites that are predicted to be fast-evolving. Range: [-10.761, 0.595] (default: -0.029). [@pollard2010detection] | num (+) | [Source][Ref] |
Conservation | mamPhyloP | Mammalian phyloP score (excl. human). A higher score means the region is more conserved. PhyloP scores measure evolutionary conservation at individual alignment sites. The scores are calculated by comparing with the evolution expected under neutral drift. Positive scores: measure conservation, i.e., slower evolution than expected, at sites that are predicted to be conserved. Negative scores: measure acceleration, i.e., faster evolution than expected, at sites that are predicted to be fast-evolving. Range: [-20, 4.494] (default: -0.005). [@pollard2010detection] | num (+) | [Source][Ref] |
Conservation | verPhyloP | Vertebrate phyloP score (excl. human). A higher score means the region is more conserved. PhyloP scores measure evolutionary conservation at individual alignment sites. The scores are calculated by comparing with the evolution expected under neutral drift. Positive scores: measure conservation, i.e., slower evolution than expected, at sites that are predicted to be conserved. Negative scores: measure acceleration, i.e., faster evolution than expected, at sites that are predicted to be fast-evolving. Range: [-20, 11.295] (default: 0.042). [@pollard2010detection] | num (+) | [Source][Ref] |
Conservation | GerpN | Neutral evolution score defined by GERP++. A higher score means the region is more conserved. Range: [0, 19.8] (default: 3.0). [@davydov2010identifying] | num (+) | [Source][Ref] |
Conservation | GerpS | Rejected Substitution score defined by GERP++. A higher score means the region is more conserved. GERP (Genomic Evolutionary Rate Profiling) identifies constrained elements in multiple alignments by quantifying substitution deficits. These deficits represent substitutions that would have occurred if the element were neutral DNA, but did not occur because the element has been under functional constraint. These deficits are referred to as "Rejected Substitutions". Rejected substitutions are a natural measure of constraint that reflects the strength of past purifying selection on the element. GERP estimates constraint for each alignment column; elements are identified as excess aggregations of constrained columns. Positive scores (fewer than expected) indicate that a site is under evolutionary constraint. Negative scores may be weak evidence of accelerated rates of evolution. Range: [-39.5, 19.8] (default: -0.2). [@davydov2010identifying] | num (+) | [Source][Ref] |
Epigenetics | EncodeDNase | Maximum Encode DNase-seq level over 12 cell lines. Range: [0, 118672] (default: 0.0). [@encode2012integrated] | num (+) | [Source][Ref] |
Epigenetics | EncodeH3K27ac | Maximum Encode H3K27ac level over 14 cell lines. Range: [0.010, 1442.690] (default: 0.36). [@encode2012integrated] | num (+) | [Source][Ref] |
Epigenetics | EncodeH3K4me1 | Maximum Encode H3K4me1 level over 13 cell lines. Range: [0.010, 227.81] (default: 0.37). [@encode2012integrated] | num (+) | [Source][Ref] |
Epigenetics | EncodeH3K4me2 | Maximum Encode H3K4me2 level over 14 cell lines. Range: [0.010, 774.99] (default: 0.37). [@encode2012integrated] | num (+) | [Source][Ref] |
Epigenetics | EncodeH3K4me3 | Maximum Encode H3K4me3 level over 14 cell lines. Range: [0.010, 1093.75] (default: 0.38). [@encode2012integrated] | num (+) | [Source][Ref] |
Epigenetics | EncodeH3K9ac | Maximum Encode H3K9ac level over 13 cell lines. Range: [0.010, 1340.42] (default: 0.41). [@encode2012integrated] | num (+) | [Source][Ref] |
Epigenetics | EncodeH4K20me1 | Maximum Encode H4K20me1 level over 11 cell lines. Range: [0.010, 226.64] (default: 0.47). [@encode2012integrated] | num (+) | [Source][Ref] |
Epigenetics | EncodeH2AFZ | Maximum Encode H2AFZ level over 13 cell lines. Range: [0.020, 468.98] (default: 0.42). [@encode2012integrated] | num (+) | [Source][Ref] |
Epigenetics | EncodeH3K9me3 | Maximum Encode H3K9me3 level over 14 cell lines. Range: [0.010, 226.64] (default: 0.38). [@encode2012integrated] | num (+) | [Source][Ref] |
Epigenetics | EncodeH3K27me3 | Maximum Encode H3K27me3 level over 14 cell lines. Range: [0.010, 193.38] (default: 0.47). [@encode2012integrated] | num (+) | [Source][Ref] |
Epigenetics | EncodeH3K36me3 | Maximum Encode H3K36me3 level over 10 cell lines. Range: [0.020, 246.88] (default: 0.39). [@encode2012integrated] | num (+) | [Source][Ref] |
Epigenetics | EncodeH3K79me2 | Maximum Encode H3K79me2 level over 13 cell lines. Range: [0.020, 553.06] (default: 0.34). [@encode2012integrated] | num (+) | [Source][Ref] |
Epigenetics | EncodetotalRNA | Maximum Encode totalRNA-seq level over 10 cell lines (minus and plus strand separately). Range: [0, 385096] (default: 0.0). [@encode2012integrated] | num (+) | [Source][Ref] |
Epigenetics | GC | Percent GC in a window of +/- 75bp. Range: [0, 1] (default: 0.42). | num (+) | [Source] |
Epigenetics | CpG | Percent CpG in a window of +/- 75bp. Range: [0, 0.604] (default: 0.02). | num (+) | [Source] |
Transcription Factors | RemapOverlapTF | Remap number of different transcription factors binding. Range: [1, 350] (default: -0.5). | int (+) | [Source] |
Transcription Factors | RemapOverlapCL | Remap number of different transcription factor - cell line combinations binding. Range: [1, 1068] (default: -0.5). | int (+) | [Source] |
Chromatin States | cHmm E1 | Number of 48 cell types in chromHMM state E1_poised. (default: 1.92). [@ernst2015large] | num | [Source][Ref] |
Chromatin States | cHmm E2 | Number of 48 cell types in chromHMM state E2_repressed. (default: 1.92). [@ernst2015large] | num | [Source][Ref] |
Chromatin States | cHmm E3 | Number of 48 cell types in chromHMM state E3_dead. (default: 1.92). [@ernst2015large] | num | [Source][Ref] |
Chromatin States | cHmm E4 | Number of 48 cell types in chromHMM state E4_dead. (default: 1.92). [@ernst2015large] | num | [Source][Ref] |
Chromatin States | cHmm E5 | Number of 48 cell types in chromHMM state E5_repressed. (default: 1.92). [@ernst2015large] | num | [Source][Ref] |
Chromatin States | cHmm E6 | Number of 48 cell types in chromHMM state E6_repressed. (default: 1.92). [@ernst2015large] | num | [Source][Ref] |
Chromatin States | cHmm E7 | Number of 48 cell types in chromHMM state E7_weak. (default: 1.92). [@ernst2015large] | num | [Source][Ref] |
Chromatin States | cHmm E8 | Number of 48 cell types in chromHMM state E8_gene. (default: 1.92). [@ernst2015large] | num | [Source][Ref] |
Chromatin States | cHmm E9 | Number of 48 cell types in chromHMM state E9_gene. (default: 1.92). [@ernst2015large] | num | [Source][Ref] |
Chromatin States | cHmm E10 | Number of 48 cell types in chromHMM state E10_gene. (default: 1.92). [@ernst2015large] | num | [Source][Ref] |
Chromatin States | cHmm E11 | Number of 48 cell types in chromHMM state E11_gene. (default: 1.92). [@ernst2015large] | num | [Source][Ref] |
Chromatin States | cHmm E12 | Number of 48 cell types in chromHMM state E12_distal. (default: 1.92). [@ernst2015large] | num | [Source][Ref] |
Chromatin States | cHmm E13 | Number of 48 cell types in chromHMM state E13_distal. (default: 1.92). [@ernst2015large] | num | [Source][Ref] |
Chromatin States | cHmm E14 | Number of 48 cell types in chromHMM state E14_distal. (default: 1.92). [@ernst2015large] | num | [Source][Ref] |
Chromatin States | cHmm E15 | Number of 48 cell types in chromHMM state E15_weak. (default: 1.92). [@ernst2015large] | num | [Source][Ref] |
Chromatin States | cHmm E16 | Number of 48 cell types in chromHMM state E16_tss. (default: 1.92). [@ernst2015large] | num | [Source][Ref] |
Chromatin States | cHmm E17 | Number of 48 cell types in chromHMM state E17_proximal. (default: 1.92). [@ernst2015large] | num | [Source][Ref] |
Chromatin States | cHmm E18 | Number of 48 cell types in chromHMM state E18_proximal. (default: 1.92). [@ernst2015large] | num | [Source][Ref] |
Chromatin States | cHmm E19 | Number of 48 cell types in chromHMM state E19_tss. (default: 1.92). [@ernst2015large] | num | [Source][Ref] |
Chromatin States | cHmm E20 | Number of 48 cell types in chromHMM state E20_poised. (default: 1.92). [@ernst2015large] | num | [Source][Ref] |
Chromatin States | cHmm E21 | Number of 48 cell types in chromHMM state E21_dead. (default: 1.92). [@ernst2015large] | num | [Source][Ref] |
Chromatin States | cHmm E22 | Number of 48 cell types in chromHMM state E22_repressed. (default: 1.92). [@ernst2015large] | num | [Source][Ref] |
Chromatin States | cHmm E23 | Number of 48 cell types in chromHMM state E23_weak. (default: 1.92). [@ernst2015large] | num | [Source][Ref] |
Chromatin States | cHmm E24 | Number of 48 cell types in chromHMM state E24_distal. (default: 1.92). [@ernst2015large] | num | [Source][Ref] |
Chromatin States | cHmm E25 | Number of 48 cell types in chromHMM state E25_distal. (default: 1.92). [@ernst2015large] | num | [Source][Ref] |
Local Nucleotide Diversity | RecombinationRate | Recombination rate measures the probability of how likely the region tends to undergo recombination. Range: [0, 54.96] (default: 0). [@gazal2017linkage] | num (+) | [Ref] |
Local Nucleotide Diversity | NuclearDiversity | Nuclear diversity measures the probability of how likely the region diversify. Range: [0.05, 60.25] (default: 0). [@gazal2017linkage] | num (+) | [Ref] |
Local Nucleotide Diversity | bStatistic | Background selection score. A background selection (B) value for each position in the genome. B indicates the expected fraction of neutral diversity that is present at a site, with values close to 0 representing near complete removal of diversity as a result of selection and values near 1000 indicating little effect of selection. Range: [0, 1000] (default: 800). [@mcvicker2009widespread] | int (+) | [Source][Ref] |
Mutation Density | Common100bp | Number of common (MAF > 0.05) BRAVO SNVs in the nearby 100 bp window (default: 0). A higher value indicates more mutations happen in the region and a higher likelihood of mutations. Scores range from 0 to 100. Range: [0, 14] (default: 0). | int (+) | [Source] |
Mutation Density | Rare100bp | Number of rare (MAF < 0.05) BRAVO SNVs in the nearby 100 bp window (default: 0). A higher value indicates more mutations happen in the region and a higher likelihood of mutations. Scores range from 0 to 100. Range: [0, 31] (default: 0). | int (+) | [Source] |
Mutation Density | Sngl100bp | Number of single occurrence of BRAVO SNVs in the nearby 100 bp window (default: 0). A higher value indicates more mutations happen in the region and a higher likelihood of mutation. Scores range from 0 to 100. Range: [0, 99] (default: 0). | int (+) | [Source] |
Mutation Density | Common1000bp | Number of common (MAF > 0.05) BRAVO SNVs in the nearby1000 bp window (default: 0). A higher value indicates more mutations happen in the region and a higher likelihood of mutations. Scores range from 0 to 1000. Range: [0, 73] (default: 0). | int (+) | [Source] |
Mutation Density | Rare1000bp | Number of rare (MAF < 0.05) BRAVO SNVs in the nearby 1000 bp window (default: 0). A higher value indicates more mutations happen in the region and a higher likelihood of mutations. Scores range from 0 to 1000. Range: [0, 74] (default: 0). | int (+) | [Source] |
Mutation Density | Sngl1000bp | Number of single occurrence of BRAVO SNVs in the nearby 1000 bp window (default: 0). A higher value indicates more mutations happen in the region and a higher likelihood of mutation. Scores range from 0 to 1000. Range: [0, 658] (default: 0). | int (+) | [Source] |
Mutation Density | Common10000bp | Number of common (MAF > 0.05) BRAVO SNVs in the nearby 10000 bp window (default: 0). A higher value indicates more mutations happen in the region and a higher likelihood of mutations. Scores range from 0 to 10000. Range: [0, 443] (default: 0). | int (+) | [Source] |
Mutation Density | Rare10000bp | Number of rare (MAF < 0.05) BRAVO SNVs in the nearby 10000 bp window (default: 0). A higher value indicates more mutations happen in the region and a higher likelihood of mutations. Scores range from 0 to 10000. Range: [0, 355] (default: 0). | int (+) | [Source] |
Mutation Density | Sngl10000bp | Number of single occurrence of BRAVO SNVs in the nearby 10000 bp window (default: 0). A higher value indicates more mutations happen in the region and a higher likelihood of mutation. Scores range from 0 to 10000. Range: [0, 4750] (default: 0). | int (+) | [Source] |
Mappability | Umap (k100, k50, k36, k24) | Mappability of unconverted genome. It measures the extent to which a position can be uniquely mapped by sequence reads. Lower mappability means the estimates of genomic and epigenomic characteristics from sequencing assays are less reliable, and the region has increased susceptibility to spurious mapping from reads from other regions of the genome with sequencing errors or unexpected genetic variation. Range: [0, 1] (default: 0). [@karimzadeh2018umap] | num (+) | [Source][Ref] |
Mappability | Bismap (k100, k50, k36, k24) | Mappability of the bisulfite-converted genome. Bisulfite sequencing approaches used to identify DNA methylation introduce large numbers of reads that map to multiple regions. This annotation identifies mappability of the bisulfite-converted genome. Range: [0, 1] (default: 0). [@karimzadeh2018umap] | num (+) | [Source][Ref] |
Proximity Table | minDistTSS | Distance to closest Transcribed Sequence Start (TSS). Range: [1, 3604063] (default: 1e7). | num (-) | [Source] |
Proximity Table | minDistTSE | Distance to closest Transcribed Sequence End (TSE). Range: [1, 3608885] (default: 1e7). | num (-) | [Source] |
Graphical illustration of gene-based annotation
Please refer to the gene-based annotation figure above for the detail meaning of each annotation category. The “exonic” here refers only to the coding exonic portion, but not the UTR portion, as there are two keywords (UTR5, UTR3) which are specifically reserved for UTR annotations. “splicing” is defined as variant that is within 2-bp away from an exon/intron boundary. If a variant is located in both 5’ UTR and 3’ UTR region (possibly for two different genes), then the “UTR5,UTR3” will be printed as the output. The term “upstream” and “downstream” is defined as 1-kb away from transcription start site or transcription end site, respectively, taking the strand of the mRNA into account. If a variant is located in both the downstream and upstream regions (possibly for 2 different genes), then “upstream,downstream” will be printed as the output. If a variant is located in the CAGE promoter/enhancer or GeneHancer regions, it will also be annotated as CAGE promoter/enhancer or GeneHancer.
Calculation of annotation PCs (aPCs) and interpretation
Often it is helpful to have a single metric summarizing multiple similar annotations measuring the same underlying biological function. We achieve this goal by proposing variant annotation Principal Components (aPCs), which are principal component summaries of the multi-faceted functional annotation data in FAVOR. Unlike ancestral PCs that are subject-specific and are calculated using genotypes across the genome to control for population structure, annotation PCs are variant-specific and are calculated using functional annotations for individual variants. Annotation PCs summarize multiple aspects of variant function, with different blocks of individual functional annotations in the heatmap below captured by different annotation PCs (Figure 1) [@li2020dynamic]. We summarize the detailed steps of obtaining aPCs as follows (currently aPCs are calculated for all PASS SNVs in the variant set):
- Step 0 (pre-processing): We impute variant with missing individual scores with their default values, and transform particular individual scores such that (1) a higher value of each individual score indicates increased functionality according to that annotation; (2) the distribution of individual score becomes less skewed. Specifically, we use
where
-
Step 1: We group the individual annotation scores into major functional blocks based on a priori knowledge. Each block captures a specific aspect of variant biological function: protein function, conservation, epigenetics, local nucleotide diversity, proximity (distance) to coding, mutation density, transcription factors, mappability, and proximity (distance) to TSS/TES (See Table above) [@li2020dynamic].
-
Step 2: For each annotation block , we center and standardize all (pre-processed) individual scores within the block, and obtain the standardized individual annotation score matrix (i.e., each column of has mean 0 and variance 1).
-
Step 3: For each annotation block , we calculate the aPC raw score of that block as the first PC from the standardized individual scores. Specifically, for annotation block ,
where is the eigenvector corresponding to the largest eigenvalue of . Note that here we flip the sign of (if necessary) such that each is positively correlated with the individual scores in that block.
- Step 4: To facilitate better interpretation, these aPC raw scores are transformed into the PHRED-scaled scores for each variant across the genome [@li2020dynamic], defined as
where is total number of variants sequenced across the whole genome. Note that the PHRED-scaled scores used in annotation PCs express the rank in order of magnitude. For example, a variant at the top 10 percentile of aPC raw score has a PHRED score 10, top 1 percentile has a PHRED score 20, top 0.1 percentile has a PHRED score 30, among all the variants in the FAVOR database.
Figure 1. Correlation heatmap of individual and integrative functional annotations. The figure shows pairwise correlations between 63 individual and integrative functional annotations. The cells in the visualization are colored by Pearson's correlation coefficient values with deeper colors indicating higher positive (red) or negative (blue) correlations. Each annotation principal component (aPC) is the first PC calculated from the set of standardized individual functional annotations that measure similar biological function. These aPCs are then transformed into the PHRED-scaled scores for each variant across the genome.