Introduction

FAVOR-GPT is a seamless natural language query interface integration into Functional Annotation of Variants Online Resources (FAVOR). To enhance user-friendliness, we built FAVOR-GPT, a versatile and user-centric chatbot designed to facilitate informative interpretation and interactive summary of the annotation data in FAVOR.

Its introduction complements the original FAVOR portal, further enhancing accessibility and usability for a wider range of users and amplifying the value of this resource.

FAVOR-GPT simplifies raw annotations by providing interpretable explanations and result summaries responding to the user’s prompt. FAVOR-GPT is empowering users to navigate variant functional annotation effortlessly. This integration signifies an essential advancement in functional annotation database development, facilitating the process of informed decision-making in genetics studies.

Key Features

  • Natural Language Interface: Allows users to interact with the database using everyday language.
  • Tabular Data Interpretation: Capable of understanding and responding to queries related to tabular data structures.
  • Database Integration: Seamlessly connects with databases to fetch and present relevant information.
  • Adaptive Learning: Continuously improves its understanding and responses based on user interactions.

Implementation

FAVOR-GPT was designed with flexibility, resource efficiency, and adaptability in mind. It was designed to be operated without significant fine-tuning from all of the existing biomedical literature but instead, be more flexible using the documentation and information available. Further, FAVOR-GPT is extremely user-centric, deployed on a fast and accessible web application.

FAVOR-GPT is based on Retrieval Augmented Generation, an AI framework that enhances responses based on an external knowledge source. In this case, FAVOR-GPT is grounded through its ability to retrieve data from the external FAVOR database, thus grounding the model on accurate, current information for reliable and insightful responses​.


Examples

Get information about a variant

Request: "What is the pathogenicity of the variant 19-44908822-C-T/rs7412?"

Response

The variant 19-44908822-C-T (rs7412) is located in the APOE gene and is a nonsynonymous SNV, meaning it
results in a change in the amino acid sequence of the protein. This SNV has been associated with various
clinical significance annotations, including drug response, but also noted as pathogenic for conditions such
as Hypercholesterolemia, Warfarin response, and Familial type 3 hyperlipoproteinemia. For some databases, it
has an uncertain significance. The variant has a relatively high allele frequency in the general population
as per the BRAVO database. The variant is predicted to be probably damaging by PolyPhen, deleterious by
SIFT, and has a MutationTaster score of 0.93, which signifies a high probability of the variant being
disease-causing.

For a more detailed look at this variant, please refer to <a>FAVOR</a>.

Request: "what are the epigenetics score for variant 19-44908822-C-T/rs7412?"

Response

The epigenetic scores for the variant 19-44908822-C-T are as follows:

Active: 5.26114650042989
Repressed: 1.92246709640996
Transcription: 2.86284431612327


Get information about a gene

Request: "What is the function of the gene APOE?"

Response

The gene APOE (Apolipoprotein E) is located on chromosome 19q13.32 and is involved in the metabolism of fats
in the body. Apolipoprotein E is a component of several lipoproteins, including very low-density
lipoproteins (VLDL), which are released into the bloodstream from the liver. APOE is essential for the
normal catabolism of triglyceride-rich lipoprotein constituents. More precisely, APOE helps in the clearance
of chylomicron remnants and VLDL remnants by acting as a ligand for receptors in the liver that remove
remnants of triglyceride-rich lipoproteins from the bloodstream. You can find more information on APOE from
its gene-database entry.

Request: "What are the variants associated with the gene APOE?"

Response

Below are two of the variants associated with the gene APOE:

Variant: 19-44905485-A-G
rsID: rs1038445539
Bravo AN: 264690
Bravo AC: 13

Variant: 19-44906328-T-G
rsID: rs1343559899
Bravo AN: 264690
Bravo AC: 7926
This is a small selection of the total 1470 variants associated with the gene APOE found in the database.
For more detailed information, please visit the gene-specific variant list for APOE

Request: "How many pathogenic variants are there in the gene APOE?"

Response

The APOE gene has a total of 11 pathogenic variants. To learn more, you can visit the FAVOR APOE gene-level
annotation page.

Request: "What are the pathogenic variants for the gene APOE?"

Response

Variant: 19-44905485-A-G

rsid: rs1038445539
BRAVO_AN: 264690
BRAVO_AC: 13
Variant: 19-44906328-T-G

rsid: rs1343559899
BRAVO_AN: 264690
BRAVO_AC: 7926
Variant: 19-44906374-T-C

rsid: rs893154126
BRAVO_AN: 264690
BRAVO_AC: 3
Variant: 19-44906522-G-C

rsid: rs1463299466
BRAVO_AN: 264690
BRAVO_AC: 1
Variant: 19-44907212-G-A

rsid: rs1969825044
BRAVO_AN: 264690
BRAVO_AC: 1
Variant: 19-44908867-G-A

rsid: rs1265472491
BRAVO_AN: 264690
BRAVO_AC: 1
Variant: 19-44908947-C-T

rsid: rs72654468
BRAVO_AN: 264690
BRAVO_AC: 224
Variant: 19-44909272-C-T

rsid: rs1374831423
BRAVO_AN: 264690
BRAVO_AC: 1
Variant: 19-44906195-G-A

rsid: rs1969803374
BRAVO_AN: 264690
BRAVO_AC: 1
Variant: 19-44906285-G-C

rsid: rs1969804676
BRAVO_AN: 264690
BRAVO_AC: 4

Please note that not all variants may be clinically defined as pathogenic; the clinical significance can
vary. Furthermore, BRAVO_AN and BRAVO_AC represent the allele number and allele count observed in a
population database, which doesn’t necessarily indicate pathogenicity. For additional information and
annotation, you can refer to FAVOR APOE gene-level variants.


Get information about a specific region

Request: "How many non-synonymous SNV variants are there in region 19-44908820-44908922?"

Response

In the region 19-44908820-44908922, there are a total of 26 non-synonymous SNV variants.

Request: "What are the variants in the region 19-44908820-44908922?"

Response

The region 19-44908820-44908922 has a total of 38 variants. Here are the details of the first two:

Variant: 19-44908922-G-A
rsid: rs1047319304
bravo_an: 264690
bravo_ac: 2

Variant: 19-44908921-C-T
rsid: rs1459595735
bravo_an: 264690
bravo_ac: 6

Request: "How many exonic variants are there in region 19-44908820-44908922?"

Response

In the region 19-44908820-44908922, there are a total of 38 exonic variants.

Request: "Can you list the exonic variants in the region 19-44908820-44908922?"

Response

The exonic variants in the region 19-44908820-44908922 are:

Variant: 19-44908922-G-A
rsid: rs1047319304
BRAVO_AN: 264690
BRAVO_AC: 2

Variant: 19-44908921-C-T
rsid: rs1459595735
BRAVO_AN: 264690
BRAVO_AC: 6

Variant: 19-44908918-G-A
rsid: rs770942678
BRAVO_AN: 264690
BRAVO_AC: 5

Variant: 19-44908917-C-T
rsid: rs1296647170
BRAVO_AN: 264690
BRAVO_AC: 1

Variant: 19-44908917-C-G
rsid: rs1296647170
BRAVO_AN: 264690
BRAVO_AC: 4

Variant: 19-44908915-C-T
rsid: rs749750245
BRAVO_AN: 264690
BRAVO_AC: 31

Variant: 19-44908915-C-G
rsid: rs749750245
BRAVO_AN: 264690
BRAVO_AC: 1

Variant: 19-44908915-C-A
rsid: rs749750245
BRAVO_AN: 264690
BRAVO_AC: 4

Variant: 19-44908902-G-A
rsid: rs1478080731
BRAVO_AN: 264690
BRAVO_AC: 2

Variant: 19-44908896-G-C
rsid: rs1969874223
BRAVO_AN: 264690
BRAVO_AC: 1

For more information and the full list of variants in this region, visit the FAVOR region lookup page.


Explain a term specific to the database

Request: "What is BRAVO_AN?"

Response

BRAVO_AN refers to the TopMed Bravo Genome Allele Number, which indicates the number of alleles observed in
the TopMed research project. This data comes from the NHLBI TOPMed Consortium and was published in 2018 and
2019 by Taliun et al.

Request: "What does PolyPhenVal mean?"

Response

PolyPhenVal represents the PolyPhen score, which predicts the potential impact of an amino acid substitution
on the structure and function of a human protein. This is achieved by analyzing various sequence features
and structural attributes. The score ranges from 0 to 1, with higher scores indicating a higher likelihood
that the substitution is damaging to the protein function.