Introduction

Functional Annotation of Variants - Online Resource (FAVOR) is an open-access web portal that assembles individual variant functional annotation data from a variety of sources and displays the information through a web interface. FAVOR currently provides functional annotations from 13 major attributes: Basic, Variant Category, Allele Frequencies, Integrative Score, Protein Function, Epigenetics, Conservation, Transcription Factors, Chromatin States, Local Nucleotide Diversity, Mutation Density, Mappability, and Proximity. FAVOR supports the following queries:

  • Single variant query: The FAVOR web portal allows for single variant query using either genome position (Build GRCh38) or rsID. The results are displayed in tables. If an rsID corresponds to multi-alleles, the results for all alleles are shown in separate pages.

  • Region-based and gene-based query: Annotations for all variants in the Trans-Omics for Precision Medicine (TOPMed) Freeze 8 BRAVO variant set (705,486,649 variants observed on 132,345 samples' whole genomes) in a given gene/region are displayed in the web interface using tables. For region-based query, genome positions are specified using Build GRCh38.

  • Batch submission: A variant list using either genome locations (Build GRCh38) or rsIDs can be uploaded in the format (Chr, Position, Ref, Alt) or with other standard file formats (.txt). The functional annotation results are displayed on the web interface using tables, and the full annotated results are available for download in the .csv format.

Variant set

The current version of the FAVOR database contains a total of 8,892,915,237 variants, which include all possible 8,812,917,339 SNVs and 79,997,898 indels.

Functional annotations and annotation PCs (aPCs)

The functional annotations provided in the FAVOR web portal are as follows:

Block NameAnnotation NameExplanationTypeSource
BasicVariantThe unique identifier of the given variant. Reported as chr-pos-ref-alt format.String
BasicrsIDThe rsID of the given variant (if exists).String
BasicTOPMed DepthTOPMed depth of the given variant.String
BasicTOPMed QC StatusTOPMed QC status of the given variant.String
ClinVarClinical SignificanceClinical significance for this single variant. [@landrum2013clinvar; @landrum2017clinvar]String[Source][Ref1,2]
ClinVarClinical significance (genotype includes)Clinical significance for a haplotype or genotype that includes this variant. Reported as pairs of VariationID:clinical significance. [@landrum2013clinvar; @landrum2017clinvar]String[Source][Ref1,2]
ClinVarDisease NameClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB. [@landrum2013clinvar; @landrum2017clinvar]String[Source][Ref1,2]
ClinVarDisease Name (included variant)For included variant: ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB. [@landrum2013clinvar; @landrum2017clinvar]String[Source][Ref1,2]
ClinVarReview StatusClinVar review status for the Variation ID. [@landrum2013clinvar; @landrum2017clinvar]String[Source][Ref1,2]
ClinVarAllele OriginAllele origin: 0 - unknown; 1 - germline; 2 - somatic; 4 - inherited; 8 - paternal; 16 - maternal; 32 - de-novo; 64 - biparental; 128 - uniparental; 256 - not-tested; 512 - tested-inconclusive. [@landrum2013clinvar; @landrum2017clinvar]String[Source][Ref1,2]
ClinVarDisease Database IDTag-value pairs of disease database name and identifier, e.g. OMIM:NNNNNN. [@landrum2013clinvar; @landrum2017clinvar]String[Source][Ref1,2]
ClinVarDisease Database ID (includeded variant)For included variant: Tag-value pairs of disease database name and identifier, e.g. OMIM:NNNNNN. [@landrum2013clinvar; @landrum2017clinvar]String[Source][Ref1,2]
ClinVarGene ReportedGene(s) for the variant reported as gene symbol:gene id. The gene symbol and id are delimited by a colon (:) and each pair is delimited by a vertical bar (|). [@landrum2013clinvar; @landrum2017clinvar]String[Source][Ref1,2]
Variant CategoryGencode Comprehensive InfoIdentify whether variants cause protein coding changes using Gencode genes definition systems, it will label the gene name of the variants has impact, if it is intergenic region, the nearby gene name will be labeled in the annotation. [@harrow2012gencode; @frankish2018gencode]String[Source1,2][Ref1,2]
Variant CategoryGencode Comprehensive CategoryIdentify whether variants cause protein coding changes using Gencode genes definition systems. It will label the gene name of the variants has impact, if it is intergenic region, the nearby gene name will be labeled in the annotation. [@harrow2012gencode; @frankish2018gencode]String[Source1,2][Ref1,2]
Variant CategoryDisruptive MissenseIdentify whether the variant is a disruptive missense variant, defined as "disruptive" by the ensemble MetaSVM annotation. [@dong2014comparison]Factor[Source1,2][Ref]
Variant CategoryCAGE PromoterCAGE defined promoter sites from Fantom 5. [@forrest2014promoter]String[Source][Ref]
Variant CategoryCAGE EnhancerCAGE defined permissive Enhancer sites from Fantom 5. [@andersson2014atlas]String[Source][Ref]
Variant CategoryGeneHancerPredicted human enhancer sites from the GeneHancer database. [@fishilevich2017genehancer]String[Ref]
Variant CategorySuperEnhancerPredicted super-enhancer sites and targets in a range of human cell types. [@hnisz2013super]String[Source][Ref]
Variant CategoryGencode Comprehensive Exonic CategoryIdentify variants impact using Gencode exonic definition, and only label exonic categorical information like, synonymous, non-synonymous, frame-shifts indels, etc. [@harrow2012gencode; @frankish2018gencode]String[Source1,2][Ref1,2]
Variant CategoryGencode Comprehensive Exonic InfoIdentify variants cause protein coding changes using Gencode genes definition, and gives out detail annotation information of which exons of the variant has impacts on and how the impacts causes changes in amino acid changes. [@harrow2012gencode; @frankish2018gencode]String[Source1,2][Ref1,2]
Variant CategoryUCSC InfoIdentify whether variants cause protein coding changes using UCSC genes definition systems, it will label the gene name of the variants has impact. If it is intergenic region, the nearby gene name will be labeled in the annotation.String[Source]
Variant CategoryUCSC Exonic InfoIdentify variants cause protein coding changes using UCSC genes definition, and give out detail annotation information of which exons of the variant has impacts on and how the impacts causes changes in amino acid changes.String[Source]
Variant CategoryRefSeq InfoIdentify whether variants cause protein coding changes using RefSeq genes definition systems, it will label the gene name of the variants has impact, if it is intergenic region, the nearby gene name will be labeled in the annotation.String[Source]
Variant CategoryRefSeq Exonic InfoIdentify variants cause protein coding changes using RefSeq genes definition, and give out detailed annotation information of which exons of the variant have impacts on and how the impacts cause changes in amino acid changes.String[Source]
Allele FrequenciesTOPMed Bravo AFTOPMed Bravo Genome Allele Frequency. [@taliun2019sequencing; @nhlbi2018bravo]num[Source][Ref]
Allele FrequenciesGNOMAD Total AFGNOMAD v3 Genome Allele Frequency using all the samples. [@karczewski2020mutational; @gnomad2019browser]num[Source][Ref]
Allele FrequenciesAFR GNOMAD AFGNOMAD v3 Genome African population frequency. [@karczewski2020mutational; @gnomad2019browser]num[Source][Ref]
Allele FrequenciesAMR GNOMAD AFGNOMAD v3 Genome Ad Mixed American population frequency. [@karczewski2020mutational; @gnomad2019browser]num[Source][Ref]
Allele FrequenciesEAS GNOMAD AFGNOMAD v3 Genome East Asian population frequency. [@karczewski2020mutational; @gnomad2019browser]num[Source][Ref]
Allele FrequenciesNFE GNOMAD AFGNOMAD v3 Genome Non-Finnish European population frequency. [@karczewski2020mutational; @gnomad2019browser]num[Source][Ref]
Allele FrequenciesFIN GNOMAD AFGNOMAD v3 Genome Finnish European population frequency. [@karczewski2020mutational; @gnomad2019browser]num[Source][Ref]
Allele FrequenciesSAS GNOMAD AFGNOMAD v3 Genome South Asian population frequency. [@karczewski2020mutational; @gnomad2019browser]num[Source][Ref]
Allele FrequenciesAMI GNOMAD AFGNOMAD v3 Genome Amish population frequency. [@karczewski2020mutational; @gnomad2019browser]num[Source][Ref]
Allele FrequenciesASJ GNOMAD AFGNOMAD v3 Genome Ashkenazi Jewish population frequency. [@karczewski2020mutational; @gnomad2019browser]num[Source][Ref]
Allele FrequenciesOTH GNOMAD AFGNOMAD v3 Genome Other (population not assigned) frequency. [@karczewski2020mutational; @gnomad2019browser]num[Source][Ref]
Allele FrequenciesMale GNOMAD AFGNOMAD v3 Genome Male Allele Frequency. [@karczewski2020mutational; @gnomad2019browser]num[Source][Ref]
Allele FrequenciesAFR Male GNOMAD AFGNOMAD v3 Genome African Male population frequency. [@karczewski2020mutational; @gnomad2019browser]num[Source][Ref]
Allele FrequenciesAMI Male GNOMAD AFGNOMAD v3 Genome Amish Male population frequency. [@karczewski2020mutational; @gnomad2019browser]num[Source][Ref]
Allele FrequenciesAMR Male GNOMAD AFGNOMAD v3 Genome Ad Mixed American Male population frequency. [@karczewski2020mutational; @gnomad2019browser]num[Source][Ref]
Allele FrequenciesASJ Male GNOMAD AFGNOMAD v3 Genome Ashkenazi Jewish Male population frequency. [@karczewski2020mutational; @gnomad2019browser]num[Source][Ref]
Allele FrequenciesEAS Male GNOMAD AFGNOMAD v3 Genome East Asian Male population frequency. [@karczewski2020mutational; @gnomad2019browser]num[Source][Ref]
Allele FrequenciesFIN Male GNOMAD AFGNOMAD v3 Genome Finnish European Male population frequency. [@karczewski2020mutational; @gnomad2019browser]num[Source][Ref]
Allele FrequenciesNFE Male GNOMAD AFGNOMAD v3 Genome Non-Finnish European Male population frequency. [@karczewski2020mutational; @gnomad2019browser]num[Source][Ref]
Allele FrequenciesOTH Male GNOMAD AFGNOMAD v3 Genome Other (population not assigned) Male frequency. [@karczewski2020mutational; @gnomad2019browser]num[Source][Ref]
Allele FrequenciesSAS Male GNOMAD AFGNOMAD v3 Genome South Asian Male population frequency. [@karczewski2020mutational; @gnomad2019browser]num[Source][Ref]
Allele FrequenciesFemale GNOMAD AFGNOMAD v3 Genome Female Allele Frequency. [@karczewski2020mutational; @gnomad2019browser]num[Source][Ref]
Allele FrequenciesAFR Female GNOMAD AFGNOMAD v3 Genome African Female population frequency. [@karczewski2020mutational; @gnomad2019browser]num[Source][Ref]
Allele FrequenciesAMI Female GNOMAD AFGNOMAD v3 Genome Amish Female population frequency. [@karczewski2020mutational; @gnomad2019browser]num[Source][Ref]
Allele FrequenciesAMR Female GNOMAD AFGNOMAD v3 Genome Ad Mixed American Female population frequency. [@karczewski2020mutational; @gnomad2019browser]num[Source][Ref]
Allele FrequenciesASJ Female GNOMAD AFGNOMAD v3 Genome Ashkenazi Jewish Female population frequency. [@karczewski2020mutational; @gnomad2019browser]num[Source][Ref]
Allele FrequenciesEAS Female GNOMAD AFGNOMAD v3 Genome East Asian Female population frequency. [@karczewski2020mutational; @gnomad2019browser]num[Source][Ref]
Allele FrequenciesFIN Female GNOMAD AFGNOMAD v3 Genome Finnish European Female population frequency. [@karczewski2020mutational; @gnomad2019browser]num[Source][Ref]
Allele FrequenciesNFE Female GNOMAD AFGNOMAD v3 Genome Non-Finnish European Female population frequency. [@karczewski2020mutational; @gnomad2019browser]num[Source][Ref]
Allele FrequenciesOTH Female GNOMAD AFGNOMAD v3 Genome Other (population not assigned) Female frequency. [@karczewski2020mutational; @gnomad2019browser]num[Source][Ref]
Allele FrequenciesSAS Female GNOMAD AFGNOMAD v3 Genome South Asian Female population frequency. [@karczewski2020mutational; @gnomad2019browser]num[Source][Ref]
Allele FrequenciesALL 1000G AF1000 Genome Allele Frequency (Whole genome allele frequencies from the 1000 Genomes Project phase 3 data).num[Source]
Allele FrequenciesAFR 1000G AF1000 Genomes African population frequency.num[Source]
Allele FrequenciesAMR 1000G AF1000 Genomes Ad Mixed American population frequency.num[Source]
Allele FrequenciesEAS 1000G AF1000 Genomes East Asian population frequency.num[Source]
Allele FrequenciesEUR 1000G AF1000 Genomes European population frequency.num[Source]
Allele FrequenciesSAS 1000G AF1000 Genomes South Asian population frequency.num[Source]
Integrative ScoreaPC-Protein-FunctionProtein function annotation PC: the first PC of the standardized scores of "SIFTval, PolyPhenVal, Grantham, Polyphen2_HDIV_score, Polyphen2_HVAR_score, MutationTaster_score, MutationAssessor_score" in PHRED scale. Range: [2.970, 97.690]. [@li2020dynamic]num (+)Individual annotation channels in the FAVOR database.
Integrative ScoreaPC-ConservationConservation annotation PC: the first PC of the standardized scores of "GerpN, GerpS, priPhCons, mamPhCons, verPhCons, priPhyloP, mamPhyloP, verPhyloP" in PHRED scale. Range: [1.478E-09, 99.451]. [@li2020dynamic]num (+)Individual annotation channels in the FAVOR database.
Integrative ScoreaPC-Epigenetics-ActiveActive Epigenetic annotation PC: the first PC of the standardized scores of “EncodeH3K4me1.max, EncodeH3K4me2.max, EncodeH3K4me3.max, EncodeH3K9ac.max, EncodeH3K27ac.max, EncodeH4K20me1.max,EncodeH2AFZ.max,” in PHRED scale.Range: [0, 99.451].[@li2020dynamic]num (+)Individual annotation channels in the FAVOR database.
Integrative ScoreaPC-Epigenetics-RepressedRepressed Epigenetic annotation PC: the first PC of the standardized scores of “EncodeH3K9me3.max, EncodeH3K27me3.max” in PHRED scale. Range: [0, 99.451]. (Li et al., 2020). [@li2020dynamic]num (+)Individual annotation channels in the FAVOR database.
Integrative ScoreaPC-Epigenetics-TranscriptionTranscription Epigenetic annotation PC: the first PC of the standardized scores of “EncodeH3K36me3.max, EncodeH3K79me2.max” in PHRED scale. Range: [0, 99.451]. [@li2020dynamic]num (+)Individual annotation channels in the FAVOR database.
Integrative ScoreaPC-Local-Nucleotide-DiversityLocal nucleotide diversity annotation PC: the first PC of the standardized scores of "bStatistic, RecombinationRate, NuclearDiversity" in PHRED scale. Range: [0, 99.451]. [@li2020dynamic]numIndividual annotation channels in the FAVOR database.
Integrative ScoreaPC-Mutation-DensityMutation density annotation PC: the first PC of the standardized scores of "Common100bp, Rare100bp, Sngl100bp, Common1000bp, Rare1000bp, Sngl1000bp, Common10000bp, Rare10000bp, Sngl10000bp" in PHRED scale. Range: [0, 99.451]. [@li2020dynamic]numIndividual annotation channels in the FAVOR database.
Integrative ScoreaPC-Transcription-FactorTranscription factor annotation PC: the first PC of the standardized scores of "RemapOverlapTF, RemapOverlapCL" in PHRED scale. Range: [1.185, 99.451]. [@li2020dynamic]num (+)Individual annotation channels in the FAVOR database.
Integrative ScoreaPC-MappabilityMappability annotation PC: the first PC of the standardized scores of "umap_k100, bismap_k100, umap_k50, bismap_k50, umap_k36, bismap_k36, umap_k24, bismap_k24" in PHRED scale. Range: [0.185, 99.451]. [@li2020dynamic]num (+)Individual annotation channels in the FAVOR database.
Integrative ScoreaPC-Proximity-To-TSS-TESProximity to TSS (Transcription Starting Site) and TES (Transcription Ending Site) annotation PC: the first PC of "minDistTSS, minDistTSE" in PHRED scale. Range: [0, 99.451]. [@li2020dynamic]num (+)Individual annotation channels in the FAVOR database.
Integrative ScoreCADD RawScoreThe CADD raw score (integrative score). A higher CADD score indicates more deleterious. Range: [-237.102, 22.763]. [@kircher2014general; @rentzsch2018cadd]num (+)[Source][Ref1,2]
Integrative ScoreCADD PHREDThe CADD score in PHRED scale (integrative score). A higher CADD score indicates more deleterious. Range: [0, 99]. [@kircher2014general; @rentzsch2018cadd]num (+)[Source][Ref1,2]
Integrative ScoreLINSIGHTThe LINSIGHT score (integrative score). A higher LINSIGHT score indicates more functionality. Range: [0.215, 0.995]. [@huang2017fast]num (+)[Source][Ref]
Integrative ScoreFATHMM-XFThe FATHMM-XF score (integrative score). A higher FATHMM-XF score indicates more functionality. Range: [0.405, 99.451]. [@rogers2017fathmm]num (+)[Source][Ref]
Integrative ScoreFunseq Value (impact score)A flexible framework to prioritize regulatory mutations from cancer genome sequencing (integrative score). [@fu2014funseq2]num (+)[Source][Ref]
Integrative ScoreFunseq Description (annotation)Funseq annotation pints out whether given mutation falls in coding or non-coding region (integrative score). [@fu2014funseq2]String[Source][Ref]
Integrative ScoreAloft Value (impact score)ALoFT provides extensive annotations to putative loss-of-function variants (LoF) in protein-coding genes including functional, evolutionary and network features (integrative score). [@balasubramanian2017using]num (+)[Source][Ref]
Integrative ScoreAloft Description (annotation)ALoFT annotation can predict the impact of premature stop variants and classify them as dominant disease-causing, recessive disease-causing and benign variants (integrative score). [@balasubramanian2017using]String[Source][Ref]
Protein FunctionPolyPhenCatPolyPhen category of change. [@adzhubei2010method]Factor[Source][Ref]
Protein FunctionPolyPhenValPolyPhen score: It predicts the functional significance of an allele replacement from its individual features. Range: [0, 1] (default: 0). [@adzhubei2010method]num (+)[Source][Ref]
Protein FunctionPolyphen2_HDIVPredicts possible impact of an amino acid substitution on the structure and function of a human protein using straightforward physical and comparative considerations. HumDiv is Mendelian disease variants vs. divergence from close mammalian homologs of human proteins (>=95% sequence identity). Range: [0, 1] (default: 0). [@adzhubei2010method]num (+)[Source1,2,3][Ref]
Protein FunctionPolyphen2_HVARPredicts possible impact of an amino acid substitution on the structure and function of a human protein using straightforward physical and comparative considerations. HumVar is all human variants associated with some disease (except cancer mutations) or loss of activity/function vs. common (minor allele frequency >1%) human polymorphism with no reported association with a disease of other effect. Range: [0, 1] (default: 0). [@adzhubei2010method]num (+)[Source1,2,3][Ref]
Protein FunctionGranthamGrantham score: oAA, nAA. It attempts to predict the distance between two amino acids, in an evolutionary sense. A lower Grantham score reflects less evolutionary distance. A higher Grantham score reflects a greater evolutionary distance, and is considered more deleterious. Range: [0, 215] (default: 0). [@grantham1974amino]num (+)[Source1,2][Ref]
Protein FunctionMutationTasterMutationTaster is a free web-based application to evaluate DNA sequence variants for their disease-causing potential. The software performs a battery of in silico tests to estimate the impact of the variant on the gene product/protein. Range: [0, 1] (default: 0). [@schwarz2014mutationtaster2]num (+)[Source1,2,3][Ref]
Protein FunctionMutationAssessorPredicts the functional impact of amino-acid substitutions in proteins, such as mutations discovered in cancer or missense polymorphisms. Range: [-5.135, 6.490] (default: -5.545). [@reva2011predicting]num (+)[Source1,2,3][Ref]
Protein FunctionSIFTcatSIFT category of change. [@ng2003sift]Factor[Source][Ref]
Protein FunctionSIFTvalSIFT score, ranges from 0.0 (deleterious) to 1.0 (tolerated). Range: [0, 1] (default: 1). [@ng2003sift]num (-)[Source][Ref]
ConservationpriPhConsPrimate phastCons conservation score (excl. human). A higher score means the region is more conserved. PhastCons considers n species rather than two. It considers the phylogeny by which these species are related, and instead of measuring similarity/divergence simply in terms of percent identity. It uses statistical models of nucleotide substitution that allow for multiple substitutions per site and for unequal rates of substitution between different pairs of bases. Range: [0, 0.999] (default: 0.0). [@siepel2005evolutionarily]num (+)[Source][Ref]
ConservationmamPhConsMammalian phastCons conservation score (excl. human). A higher score means the region is more conserved. PhastCons considers n species rather than two. It considers the phylogeny by which these species are related, and instead of measuring similarity/divergence simply in terms of percent identity. It uses statistical models of nucleotide substitution that allow for multiple substitutions per site and for unequal rates of substitution between different pairs of bases. Range: [0, 1] (default: 0.0). [@siepel2005evolutionarily]num (+)[Source][Ref]
ConservationverPhConsVertebrate phastCons conservation score (excl. human). A higher score means the region is more conserved. PhastCons considers n species rather than two. It considers the phylogeny by which these species are related, and instead of measuring similarity/divergence simply in terms of percent identity. It uses statistical models of nucleotide substitution that allow for multiple substitutions per site and for unequal rates of substitution between different pairs of bases. Range: [0, 1] (default: 0.0). [@siepel2005evolutionarily]num (+)[Source][Ref]
ConservationpriPhyloPPrimate phyloP score (excl. human). A higher score means the region is more conserved. PhyloP scores measure evolutionary conservation at individual alignment sites. The scores are calculated by comparing with the evolution expected under neutral drift. Positive scores: measure conservation, i.e., slower evolution than expected, at sites that are predicted to be conserved. Negative scores: measure acceleration, i.e., faster evolution than expected, at sites that are predicted to be fast-evolving. Range: [-10.761, 0.595] (default: -0.029). [@pollard2010detection]num (+)[Source][Ref]
ConservationmamPhyloPMammalian phyloP score (excl. human). A higher score means the region is more conserved. PhyloP scores measure evolutionary conservation at individual alignment sites. The scores are calculated by comparing with the evolution expected under neutral drift. Positive scores: measure conservation, i.e., slower evolution than expected, at sites that are predicted to be conserved. Negative scores: measure acceleration, i.e., faster evolution than expected, at sites that are predicted to be fast-evolving. Range: [-20, 4.494] (default: -0.005). [@pollard2010detection]num (+)[Source][Ref]
ConservationverPhyloPVertebrate phyloP score (excl. human). A higher score means the region is more conserved. PhyloP scores measure evolutionary conservation at individual alignment sites. The scores are calculated by comparing with the evolution expected under neutral drift. Positive scores: measure conservation, i.e., slower evolution than expected, at sites that are predicted to be conserved. Negative scores: measure acceleration, i.e., faster evolution than expected, at sites that are predicted to be fast-evolving. Range: [-20, 11.295] (default: 0.042). [@pollard2010detection]num (+)[Source][Ref]
ConservationGerpNNeutral evolution score defined by GERP++. A higher score means the region is more conserved. Range: [0, 19.8] (default: 3.0). [@davydov2010identifying]num (+)[Source][Ref]
ConservationGerpSRejected Substitution score defined by GERP++. A higher score means the region is more conserved. GERP (Genomic Evolutionary Rate Profiling) identifies constrained elements in multiple alignments by quantifying substitution deficits. These deficits represent substitutions that would have occurred if the element were neutral DNA, but did not occur because the element has been under functional constraint. These deficits are referred to as "Rejected Substitutions". Rejected substitutions are a natural measure of constraint that reflects the strength of past purifying selection on the element. GERP estimates constraint for each alignment column; elements are identified as excess aggregations of constrained columns. Positive scores (fewer than expected) indicate that a site is under evolutionary constraint. Negative scores may be weak evidence of accelerated rates of evolution. Range: [-39.5, 19.8] (default: -0.2). [@davydov2010identifying]num (+)[Source][Ref]
EpigeneticsEncodeDNaseMaximum Encode DNase-seq level over 12 cell lines. Range: [0, 118672] (default: 0.0). [@encode2012integrated]num (+)[Source][Ref]
EpigeneticsEncodeH3K27acMaximum Encode H3K27ac level over 14 cell lines. Range: [0.010, 1442.690] (default: 0.36). [@encode2012integrated]num (+)[Source][Ref]
EpigeneticsEncodeH3K4me1Maximum Encode H3K4me1 level over 13 cell lines. Range: [0.010, 227.81] (default: 0.37). [@encode2012integrated]num (+)[Source][Ref]
EpigeneticsEncodeH3K4me2Maximum Encode H3K4me2 level over 14 cell lines. Range: [0.010, 774.99] (default: 0.37). [@encode2012integrated]num (+)[Source][Ref]
EpigeneticsEncodeH3K4me3Maximum Encode H3K4me3 level over 14 cell lines. Range: [0.010, 1093.75] (default: 0.38). [@encode2012integrated]num (+)[Source][Ref]
EpigeneticsEncodeH3K9acMaximum Encode H3K9ac level over 13 cell lines. Range: [0.010, 1340.42] (default: 0.41). [@encode2012integrated]num (+)[Source][Ref]
EpigeneticsEncodeH4K20me1Maximum Encode H4K20me1 level over 11 cell lines. Range: [0.010, 226.64] (default: 0.47). [@encode2012integrated]num (+)[Source][Ref]
EpigeneticsEncodeH2AFZMaximum Encode H2AFZ level over 13 cell lines. Range: [0.020, 468.98] (default: 0.42). [@encode2012integrated]num (+)[Source][Ref]
EpigeneticsEncodeH3K9me3Maximum Encode H3K9me3 level over 14 cell lines. Range: [0.010, 226.64] (default: 0.38). [@encode2012integrated]num (+)[Source][Ref]
EpigeneticsEncodeH3K27me3Maximum Encode H3K27me3 level over 14 cell lines. Range: [0.010, 193.38] (default: 0.47). [@encode2012integrated]num (+)[Source][Ref]
EpigeneticsEncodeH3K36me3Maximum Encode H3K36me3 level over 10 cell lines. Range: [0.020, 246.88] (default: 0.39). [@encode2012integrated]num (+)[Source][Ref]
EpigeneticsEncodeH3K79me2Maximum Encode H3K79me2 level over 13 cell lines. Range: [0.020, 553.06] (default: 0.34). [@encode2012integrated]num (+)[Source][Ref]
EpigeneticsEncodetotalRNAMaximum Encode totalRNA-seq level over 10 cell lines (minus and plus strand separately). Range: [0, 385096] (default: 0.0). [@encode2012integrated]num (+)[Source][Ref]
EpigeneticsGCPercent GC in a window of +/- 75bp. Range: [0, 1] (default: 0.42).num (+)[Source]
EpigeneticsCpGPercent CpG in a window of +/- 75bp. Range: [0, 0.604] (default: 0.02).num (+)[Source]
Transcription FactorsRemapOverlapTFRemap number of different transcription factors binding. Range: [1, 350] (default: -0.5).int (+)[Source]
Transcription FactorsRemapOverlapCLRemap number of different transcription factor - cell line combinations binding. Range: [1, 1068] (default: -0.5).int (+)[Source]
Chromatin StatescHmm E1Number of 48 cell types in chromHMM state E1_poised. (default: 1.92). [@ernst2015large]num[Source][Ref]
Chromatin StatescHmm E2Number of 48 cell types in chromHMM state E2_repressed. (default: 1.92). [@ernst2015large]num[Source][Ref]
Chromatin StatescHmm E3Number of 48 cell types in chromHMM state E3_dead. (default: 1.92). [@ernst2015large]num[Source][Ref]
Chromatin StatescHmm E4Number of 48 cell types in chromHMM state E4_dead. (default: 1.92). [@ernst2015large]num[Source][Ref]
Chromatin StatescHmm E5Number of 48 cell types in chromHMM state E5_repressed. (default: 1.92). [@ernst2015large]num[Source][Ref]
Chromatin StatescHmm E6Number of 48 cell types in chromHMM state E6_repressed. (default: 1.92). [@ernst2015large]num[Source][Ref]
Chromatin StatescHmm E7Number of 48 cell types in chromHMM state E7_weak. (default: 1.92). [@ernst2015large]num[Source][Ref]
Chromatin StatescHmm E8Number of 48 cell types in chromHMM state E8_gene. (default: 1.92). [@ernst2015large]num[Source][Ref]
Chromatin StatescHmm E9Number of 48 cell types in chromHMM state E9_gene. (default: 1.92). [@ernst2015large]num[Source][Ref]
Chromatin StatescHmm E10Number of 48 cell types in chromHMM state E10_gene. (default: 1.92). [@ernst2015large]num[Source][Ref]
Chromatin StatescHmm E11Number of 48 cell types in chromHMM state E11_gene. (default: 1.92). [@ernst2015large]num[Source][Ref]
Chromatin StatescHmm E12Number of 48 cell types in chromHMM state E12_distal. (default: 1.92). [@ernst2015large]num[Source][Ref]
Chromatin StatescHmm E13Number of 48 cell types in chromHMM state E13_distal. (default: 1.92). [@ernst2015large]num[Source][Ref]
Chromatin StatescHmm E14Number of 48 cell types in chromHMM state E14_distal. (default: 1.92). [@ernst2015large]num[Source][Ref]
Chromatin StatescHmm E15Number of 48 cell types in chromHMM state E15_weak. (default: 1.92). [@ernst2015large]num[Source][Ref]
Chromatin StatescHmm E16Number of 48 cell types in chromHMM state E16_tss. (default: 1.92). [@ernst2015large]num[Source][Ref]
Chromatin StatescHmm E17Number of 48 cell types in chromHMM state E17_proximal. (default: 1.92). [@ernst2015large]num[Source][Ref]
Chromatin StatescHmm E18Number of 48 cell types in chromHMM state E18_proximal. (default: 1.92). [@ernst2015large]num[Source][Ref]
Chromatin StatescHmm E19Number of 48 cell types in chromHMM state E19_tss. (default: 1.92). [@ernst2015large]num[Source][Ref]
Chromatin StatescHmm E20Number of 48 cell types in chromHMM state E20_poised. (default: 1.92). [@ernst2015large]num[Source][Ref]
Chromatin StatescHmm E21Number of 48 cell types in chromHMM state E21_dead. (default: 1.92). [@ernst2015large]num[Source][Ref]
Chromatin StatescHmm E22Number of 48 cell types in chromHMM state E22_repressed. (default: 1.92). [@ernst2015large]num[Source][Ref]
Chromatin StatescHmm E23Number of 48 cell types in chromHMM state E23_weak. (default: 1.92). [@ernst2015large]num[Source][Ref]
Chromatin StatescHmm E24Number of 48 cell types in chromHMM state E24_distal. (default: 1.92). [@ernst2015large]num[Source][Ref]
Chromatin StatescHmm E25Number of 48 cell types in chromHMM state E25_distal. (default: 1.92). [@ernst2015large]num[Source][Ref]
Local Nucleotide DiversityRecombinationRateRecombination rate measures the probability of how likely the region tends to undergo recombination. Range: [0, 54.96] (default: 0). [@gazal2017linkage]num (+)[Ref]
Local Nucleotide DiversityNuclearDiversityNuclear diversity measures the probability of how likely the region diversify. Range: [0.05, 60.25] (default: 0). [@gazal2017linkage]num (+)[Ref]
Local Nucleotide DiversitybStatisticBackground selection score. A background selection (B) value for each position in the genome. B indicates the expected fraction of neutral diversity that is present at a site, with values close to 0 representing near complete removal of diversity as a result of selection and values near 1000 indicating little effect of selection. Range: [0, 1000] (default: 800). [@mcvicker2009widespread]int (+)[Source][Ref]
Mutation DensityCommon100bpNumber of common (MAF > 0.05) BRAVO SNVs in the nearby 100 bp window (default: 0). A higher value indicates more mutations happen in the region and a higher likelihood of mutations. Scores range from 0 to 100. Range: [0, 14] (default: 0).int (+)[Source]
Mutation DensityRare100bpNumber of rare (MAF < 0.05) BRAVO SNVs in the nearby 100 bp window (default: 0). A higher value indicates more mutations happen in the region and a higher likelihood of mutations. Scores range from 0 to 100. Range: [0, 31] (default: 0).int (+)[Source]
Mutation DensitySngl100bpNumber of single occurrence of BRAVO SNVs in the nearby 100 bp window (default: 0). A higher value indicates more mutations happen in the region and a higher likelihood of mutation. Scores range from 0 to 100. Range: [0, 99] (default: 0).int (+)[Source]
Mutation DensityCommon1000bpNumber of common (MAF > 0.05) BRAVO SNVs in the nearby1000 bp window (default: 0). A higher value indicates more mutations happen in the region and a higher likelihood of mutations. Scores range from 0 to 1000. Range: [0, 73] (default: 0).int (+)[Source]
Mutation DensityRare1000bpNumber of rare (MAF < 0.05) BRAVO SNVs in the nearby 1000 bp window (default: 0). A higher value indicates more mutations happen in the region and a higher likelihood of mutations. Scores range from 0 to 1000. Range: [0, 74] (default: 0).int (+)[Source]
Mutation DensitySngl1000bpNumber of single occurrence of BRAVO SNVs in the nearby 1000 bp window (default: 0). A higher value indicates more mutations happen in the region and a higher likelihood of mutation. Scores range from 0 to 1000. Range: [0, 658] (default: 0).int (+)[Source]
Mutation DensityCommon10000bpNumber of common (MAF > 0.05) BRAVO SNVs in the nearby 10000 bp window (default: 0). A higher value indicates more mutations happen in the region and a higher likelihood of mutations. Scores range from 0 to 10000. Range: [0, 443] (default: 0).int (+)[Source]
Mutation DensityRare10000bpNumber of rare (MAF < 0.05) BRAVO SNVs in the nearby 10000 bp window (default: 0). A higher value indicates more mutations happen in the region and a higher likelihood of mutations. Scores range from 0 to 10000. Range: [0, 355] (default: 0).int (+)[Source]
Mutation DensitySngl10000bpNumber of single occurrence of BRAVO SNVs in the nearby 10000 bp window (default: 0). A higher value indicates more mutations happen in the region and a higher likelihood of mutation. Scores range from 0 to 10000. Range: [0, 4750] (default: 0).int (+)[Source]
MappabilityUmap (k100, k50, k36, k24)Mappability of unconverted genome. It measures the extent to which a position can be uniquely mapped by sequence reads. Lower mappability means the estimates of genomic and epigenomic characteristics from sequencing assays are less reliable, and the region has increased susceptibility to spurious mapping from reads from other regions of the genome with sequencing errors or unexpected genetic variation. Range: [0, 1] (default: 0). [@karimzadeh2018umap]num (+)[Source][Ref]
MappabilityBismap (k100, k50, k36, k24)Mappability of the bisulfite-converted genome. Bisulfite sequencing approaches used to identify DNA methylation introduce large numbers of reads that map to multiple regions. This annotation identifies mappability of the bisulfite-converted genome. Range: [0, 1] (default: 0). [@karimzadeh2018umap]num (+)[Source][Ref]
Proximity TableminDistTSSDistance to closest Transcribed Sequence Start (TSS). Range: [1, 3604063] (default: 1e7).num (-)[Source]
Proximity TableminDistTSEDistance to closest Transcribed Sequence End (TSE). Range: [1, 3608885] (default: 1e7).num (-)[Source]

Graphical illustration of gene-based annotation

Please refer to the gene-based annotation figure above for the detail meaning of each annotation category. The “exonic” here refers only to the coding exonic portion, but not the UTR portion, as there are two keywords (UTR5, UTR3) which are specifically reserved for UTR annotations. “splicing” is defined as variant that is within 2-bp away from an exon/intron boundary. If a variant is located in both 5’ UTR and 3’ UTR region (possibly for two different genes), then the “UTR5,UTR3” will be printed as the output. The term “upstream” and “downstream” is defined as 1-kb away from transcription start site or transcription end site, respectively, taking the strand of the mRNA into account. If a variant is located in both the downstream and upstream regions (possibly for 2 different genes), then “upstream,downstream” will be printed as the output. If a variant is located in the CAGE promoter/enhancer or GeneHancer regions, it will also be annotated as CAGE promoter/enhancer or GeneHancer.

Calculation of annotation PCs (aPCs) and interpretation

Often it is helpful to have a single metric summarizing multiple similar annotations measuring the same underlying biological function. We achieve this goal by proposing variant annotation Principal Components (aPCs), which are principal component summaries of the multi-faceted functional annotation data in FAVOR. Unlike ancestral PCs that are subject-specific and are calculated using genotypes across the genome to control for population structure, annotation PCs are variant-specific and are calculated using functional annotations for individual variants. Annotation PCs summarize multiple aspects of variant function, with different blocks of individual functional annotations in the heatmap below captured by different annotation PCs (Figure 1) [@li2020dynamic]. We summarize the detailed steps of obtaining aPCs as follows (currently aPCs are calculated for all PASS SNVs in the variant set):

  • Step 0 (pre-processing): We impute variant with missing individual scores with their default values, and transform particular individual scores such that (1) a higher value of each individual score indicates increased functionality according to that annotation; (2) the distribution of individual score becomes less skewed. Specifically, we use
SIFTval~=1SIFTval\widetilde{\text{SIFTval}} = 1 - \text{SIFTval}
minDistTSS~=log(minDistTSS)\widetilde{\text{minDistTSS}} = -\log(\text{minDistTSS})
minDistTSE~=log(minDistTSE)\widetilde{\text{minDistTSE}} = -\log(\text{minDistTSE})
Encode~a=log(Encodea)\widetilde{\text{Encode}}_a = \log(\text{Encode}_a)

    where aH3K4me1, H3K4me2, H3K4me3, H3K9ac, H3K9me3, H3K27ac, H3K27me3, H3K36me3, H3K36me3, H3K79me2, H4K20me1, H2AFZa\in\text{H3K4me1, H3K4me2, H3K4me3, H3K9ac, H3K9me3, H3K27ac, H3K27me3, H3K36me3, H3K36me3, H3K79me2, H4K20me1, H2AFZ}

  • Step 1: We group the individual annotation scores into major functional blocks based on a priori knowledge. Each block captures a specific aspect of variant biological function: protein function, conservation, epigenetics, local nucleotide diversity, proximity (distance) to coding, mutation density, transcription factors, mappability, and proximity (distance) to TSS/TES (See Table above) [@li2020dynamic].

  • Step 2: For each annotation block k{1,,K}k \in \{1,\dots,K\}, we center and standardize all (pre-processed) individual scores within the block, and obtain the standardized individual annotation score matrix X~k\widetilde{\boldsymbol{X}}_k (i.e., each column of X~k\widetilde{\boldsymbol{X}}_k has mean 0 and variance 1).

  • Step 3: For each annotation block k{1,,K}k \in \{1,\dots,K\}, we calculate the aPC raw score of that block as the first PC from the standardized individual scores. Specifically, for annotation block kk,

aPC.rawk=X~kek,1\text{aPC.raw}_k = \widetilde{\boldsymbol{X}}_k \boldsymbol{e}_{k,1}

    where ek,1\boldsymbol{e}_{k,1} is the eigenvector corresponding to the largest eigenvalue of X~kX~k\widetilde{\boldsymbol{X}}_k^\top\widetilde{\boldsymbol{X}}_k. Note that here we flip the sign of aPC.rawk\text{aPC.raw}_k (if necessary) such that each aPC.rawk\text{aPC.raw}_k is positively correlated with the individual scores in that block.

  • Step 4: To facilitate better interpretation, these aPC raw scores are transformed into the PHRED-scaled scores for each variant across the genome [@li2020dynamic], defined as
aPC.PHREDk=10×log10(rank(aPC.rawk)/M)\text{aPC.PHRED}_k = -10\times\log_{10}(rank(-\text{aPC.raw}_k)/M)

    where MM is total number of variants sequenced across the whole genome. Note that the PHRED-scaled scores used in annotation PCs express the rank in order of magnitude. For example, a variant at the top 10 percentile of aPC raw score has a PHRED score 10, top 1 percentile has a PHRED score 20, top 0.1 percentile has a PHRED score 30, among all the variants in the FAVOR database.

Figure 1. Correlation heatmap of individual and integrative functional annotations. The figure shows pairwise correlations between 63 individual and integrative functional annotations. The cells in the visualization are colored by Pearson's correlation coefficient values with deeper colors indicating higher positive (red) or negative (blue) correlations. Each annotation principal component (aPC) is the first PC calculated from the set of standardized individual functional annotations that measure similar biological function. These aPCs are then transformed into the PHRED-scaled scores for each variant across the genome.