Tutorials Overview
We provide video tutorials and links to resources in three areas:
-
Patients’ therapeutic odysseys and challenge description. Gain an understanding of (1) the clinical context motivating this challenge as well as (2) the structure of the challenge, including timelines for each phase, expected input and output, the source and kinds of questions your system should handle, and rationale for the
justification
field in your output. -
Biomedical resources for gathering key evidence necessary for treatment. Learn about the current (manual) clinical workflows used to identify plausible treatments for rare genetic diagnoses. This will include references to key biomedical datasets and resources that are available for programmatic information retrieval to be used by your models.
-
Compute environments and model availability. Explore available compute environments including Google Cloud Platform (GCP), learn how to find models and access data via Hugging Face, and get a brief introduction to challenge-critical concepts like Retrieval-Augmented Generation (RAG).
Therapeutic Identification Workflows
Finding treatments for patients with rare genetic disease diagnoses is a long, frustrating, largely manual process. Targeted treatments exist for only ~5% of rare disorders (Kaufmann et al. 2018).
Biomedical Resources
Once a clinician receives a genetic testing report with a confirmed diagnostic variant, the immediate next steps are to identify whether there are:
- existing targeted therapies, FDA-approved treatments for this specific genetic diagnosis
- existing supportive therapies, FDA-approved drugs with other primary indications but for which there is peer-reviewed literature suggesting its efficacy for this specific genetic diagnosis
- ongoing clinical trials for this specific genetic diagnosis where the patient meets all inclusion criteria and no exclusion criteria
- drugs that can be repurposed for off-label usage based on experimental evidence and/or molecular functionality rationale
- opportunities for personalized therapeutics such as antisense oligonucleotides or gene therapies based on the genetic variant location, impact, and affected tissue(s)
Learn more about resources used to manually identify therapeutic avenues:
Biomedical Resources (Nonexhaustive)
- Established (and in trial) Targeted and Supportive Therapies
- PubMed – A searchable database of biomedical literature, including research articles relevant to genetics and molecular biology.
- GeneReviews – Expert-authored, peer-reviewed disease overviews focused on the diagnosis, management, and genetic counseling of inherited disorders.
- ClinicalTrials.gov – A registry of clinical trials, providing information on ongoing or completed studies involving genetic conditions and therapies.
- Druggability Information
- ClinGen Dosage Sensitivity – Curated data on gene dosage sensitivity (haploinsufficiency and triplosensitivity) and gene-disease validity classifications.
- PHAROS – An interface to the Target Central Resource Database (TCRD) providing information on the druggability and function of human proteins.
- DrugBank – A richly annotated database of drugs and drug targets, including FDA-approved therapies and molecular interactions.
- Variant Amenability to Personalized Therapies
- N=1 Collaborative (“N1C”) - Initiative to clarify and streamline development of individualized medicines, centralizing knowledge, data, and safety information.
- Gene-Phenotype Associations
- ClinVar – A public archive of interpretations of clinically relevant variants and their supporting evidence.
- OMIM – Online Mendelian Inheritance in Man; a comprehensive catalog of human genes and genetic disorders with phenotype descriptions and molecular relationships.
- Geno2MP – A resource linking rare variants in exomes to de-identified phenotypic profiles, used for rare disease discovery and matchmaking.
- DECIPHER – A database of chromosomal imbalance and pathogenic sequence variants with phenotype correlations and genome browser integration.
- NCBI Gene – Comprehensive gene-specific information including nomenclature, sequences, pathways, and bibliographic links.
- GenCC – The Gene Curation Coalition; a unified platform of gene-disease validity assertions from multiple expert groups.
- Variant, Gene and Pathway Information
- Ensembl – Genome browser with rich annotation of genes, variants, comparative genomics, and regulation for multiple species.
- UniProt – A comprehensive resource for protein sequence and functional information, including isoforms and domain annotations.
- Mutalyzer / VariantValidator – Tools for validating and correcting variant nomenclature according to HGVS standards.
- gnomAD – Aggregated population allele frequency data for assessing variant rarity and constraint metrics across diverse ancestries.
- GTEx – The Genotype-Tissue Expression project; provides data on tissue-specific gene expression and eQTLs across human tissues.
- VastDB / ExonSkipDB / Snaptron – Databases and tools for exploring alternative splicing events and transcript usage from RNA-seq data.
- Resource aggregation platforms
- MobiDetails – A meta-resource that aggregates variant-level information and prioritization tools for clinical variant interpretation.
- Franklin by Genoox – A clinical-grade platform that integrates multiple databases and tools for variant classification and evidence curation.
Computational Resources
Hugging Face is a universal hub for:
- AI models: ~1.6M models across various tasks
- Datasets: ~375K for training and evaluation
- Deployment: single-click deployment of your project (across multiple providers) using shareable self-contained “spaces”
- User Interfaces: build and share ML apps with user interfaces (UIs) using Python on Gradio
- Education: free courses and certifications
Learn more about accessing data and deploying models in Hugging Face:
Challenge-Relevant Concepts
- Retrieval Augmented Generation (RAG)
- Retrieve up-to-date, specialized, and verifiable information from trusted knowledge bases
- Insert this information into your “model context” in order to generate answer(s) using in-context information
- Example: Almanac, NEJM AI
- Model Context Protocol (MCP)
- Standardized protocol to allow LLMs (from any provider) to use context (i.e., as required for RAG)
- Building MCP Servers with Gradio
- Example ClinicalTrials.gov MCP server on Hugging Face: Embeddings of “brief description” column from entire Clinical Trials database
- Example PubMed Medline MCP server & Gradio demo on Hugging Face: Embeddings of the description/abstract of every article on PubMed
- AI Agents
- MCP Client and a “While Loop” to simulate a chatbot
- Example AI-Tx Challenge AI Agent Demo on Hugging Face
- Constrained Generation
- LLMs are bad at generating valid JSON (required per our Challenge Output Requirements)
- Open libraries include “Outlines” for context-free grammars (CFG) and JSONFormer