--- title: "Functional Inference Pipeline (FIP) DEMO" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Functional Inference Pipeline (FIP) DEMO} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") find_leo_sc_root <- function(start = getwd()) { cur <- normalizePath(start, winslash = "/", mustWork = TRUE) repeat { desc <- file.path(cur, "DESCRIPTION") if (file.exists(desc)) { desc_lines <- readLines(desc, warn = FALSE) if (any(grepl("^Package:\\s*leo\\.sc\\s*$", desc_lines))) { return(cur) } } parent <- dirname(cur) if (identical(parent, cur)) { return(NULL) } cur <- parent } } # Prefer local checkout when available; otherwise use installed package. pkg_root <- find_leo_sc_root() if (!is.null(pkg_root) && requireNamespace("devtools", quietly = TRUE)) { devtools::load_all(pkg_root, quiet = TRUE) } else if (!requireNamespace("leo.sc", quietly = TRUE)) { stop("Please install leo.sc, or render from a leo.sc source checkout with devtools installed.") } silico_ko <- get("silico_ko", envir = asNamespace("leo.sc"), inherits = FALSE) ``` ## Overview This tutorial demonstrates the **Functional Inference Pipeline (FIP)** via `silico_ko()` from `leo.sc`. FIP estimates gene-associated transcriptional programs by comparing cells with **low** versus **high** endogenous expression of a target gene ***g***, conceptually mimicking transcriptional changes after a perturbation knock-down — but entirely *in silico* using existing single-cell data. The function is named `silico_ko()` (SKO) for memorability, but the underlying logic is more akin to **endogenous knock-down / over-expression** contrasting rather than a true genetic knockout. - **Dataset**: `SeuratObject::pbmc_small` (80-cell Seurat demo object, for illustration only) - **Goal**: Identify functional gene programs associated with the perturbation of a target gene - **Output**: Cell-type composition of high-expressors, DEG table, and enrichment results ## Method FIP proceeds in six steps: 1. **Filter**: Exclude cells with zero target-gene expression; remove rare cell types (< `filter_cell_threshold` cells). 2. **Rank**: Rank all remaining ***g***-expressing cells (*N*) by expression of the target gene. 3. **Select *g*high**: Select the top cells (top `pct_threshold` fraction, e.g., 0.1 = 10%; or `abs_threshold` absolute count) as the **high-expression group**, preserving cell-type composition. 4. **Match *g*low**: Select an equal number of bottom cells as the **low-expression group** with matched cell-type composition to minimise lineage confounding. 5. **DEG**: Apply `Seurat::FindMarkers()` (Wilcoxon rank-sum; ***g***low as `ident.1`) to identify differentially expressed genes between groups. 6. **Enrichment**: Run `leo.basic::leo_enrich()` (ORA / GSEA against GO, KEGG, or custom backgrounds) to infer the biological functions associated with ***g***high vs ***g***low. ## Data & Gene Setup ```{r data-setup} data("pbmc_small", package = "SeuratObject") srt <- pbmc_small # Annotate clusters with broad cell-type labels # Source: https://satijalab.org/seurat/articles/pbmc3k_tutorial srt$cell_anno <- dplyr::recode( as.character(Seurat::Idents(srt)), "0" = "T/NK cells", "1" = "CD14+ Monocytes", "2" = "B cells" ) # Pick a target gene: here we choose NKG7 for illustration. gene_use <- "NKG7" gene_use ``` > **⚠️ Method accuracy note:** `pbmc_small` contains only 80 cells and is used here **purely for a reproducible demo**. DEG and enrichment results are statistically underpowered and should **not** be interpreted biologically. For a realistic FIP result, see the [real-world example](#real-world-example-lrrk2-in-vkh-disease) below. ## Run SKO ```{r run-sko} res <- silico_ko( all = srt, gene = gene_use, sko_mode = "ko", cell_col = "cell_anno", filter_cell_threshold = 3, pct_threshold = 0.2, deg_method = "default", enrichment_method = "ORA", enrichment_bg = "GO", simplify = FALSE # pbmc_small is too small for reliable simplification ) names(res) ``` ## Interpret Outputs ```{r inspect-output} length(res$high_cells) length(res$low_cells) head(res$cell_rato) deg <- res$deg_results deg <- tibble::rownames_to_column(deg, "gene") head(deg[, c("gene", "avg_log2FC", "p_val_adj")]) if (length(res$enrichments) == 0) { cat("No enrichment result was returned (expected for very small demo data).\n") } else { print(names(res$enrichments)) } ``` ## Caveats - `pbmc_small` is intentionally tiny and is only used here to keep the demo lightweight. - Enrichment output can be empty (`res$enrichments` may be `list()`) and this is expected in small datasets. - For biological interpretation, use larger datasets and review parameters such as `pct_threshold`, `filter_cell_threshold`, and the enrichment background. ## Real-World Example: LRRK2 in VKH Disease To demonstrate FIP's capability to uncover biological insights at scale, we now walk through a **real-world result** from a Vogt–Koyanagi–Harada (VKH) disease study, in which `silico_ko()` was applied to **LRRK2** across a scRNA-seq cohort of 34 VKH patients and 14 healthy controls. ### Load pre-computed FIP result The pre-computed result is hosted on [GitHub Releases](https://github.com/laleoarrow/leo.sc/releases/tag/data-v2). We download it once and cache it locally: ```{r lrrk2-load} data_url <- "https://github.com/laleoarrow/leo.sc/releases/download/data-v2/LRRK2_ko_res.rds" cache_dir <- tools::R_user_dir("leo.sc", which = "cache") if (!dir.exists(cache_dir)) dir.create(cache_dir, recursive = TRUE) dest <- file.path(cache_dir, "LRRK2_ko_res.rds") if (!file.exists(dest)) { message("Downloading LRRK2 FIP result (7.8 MB) ...") utils::download.file(data_url, dest, mode = "wb", quiet = TRUE) } lrrk2 <- readRDS(dest) # **This represents a standard output object generated by FIP.** names(lrrk2) ``` ### Understanding the comparison direction In FIP's **KO mode**, `ident.1 = "low"` (simulated knock-out) and `ident.2 = "high"` (wild-type–like). According to `Seurat::FindMarkers()` convention: - **`avg_log2FC > 0`** → gene is **up-regulated after simulated KO** (more expressed in *g*low) - **`avg_log2FC < 0`** → gene is **down-regulated after simulated KO** (more expressed in *g*high) ```{r lrrk2-direction} # Confirm comparison direction stored in the result cat("ident.1 (test):", unique(lrrk2$deg_results$ident.1), "\n") cat("ident.2 (ref): ", unique(lrrk2$deg_results$ident.2), "\n") ``` ### Cell-type composition of ***g***high FIP preserves cell-type composition between ***g***high and ***g***low groups, ensuring that DEGs reflect variation in LRRK2 expression *within each lineage* as a whole rather than confounding by certain cell-type abundance: ```{r lrrk2-cell-composition} # Number of high- and low-expression cells cat("g_high cells:", length(lrrk2$high_cells), "\n") cat("g_low cells:", length(lrrk2$low_cells), "\n") # Cell-type composition of the high-expression group knitr::kable( head(lrrk2$cell_rato, 10), caption = "Top 10 cell types in LRRK2-high group (pct = percentage)" ) ``` ### Differential expression summary ```{r lrrk2-deg-summary} deg <- lrrk2$deg_results up <- sum(deg$avg_log2FC > 0 & deg$p_val_adj < 0.05) down <- sum(deg$avg_log2FC < 0 & deg$p_val_adj < 0.05) cat("Significant DEGs (FDR < 0.05):\n") cat(" Up after KO (log2FC > 0): ", up, "\n") cat(" Down after KO (log2FC < 0):", down, "\n") # Top 10 genes up-regulated after simulated KO deg_df <- tibble::rownames_to_column(deg, "gene") top_up <- head(deg_df[deg_df$avg_log2FC > 0 & deg_df$p_val_adj < 0.05, ], 10) knitr::kable( top_up[, c("gene", "avg_log2FC", "p_val_adj")], digits = 3, caption = "Top 10 genes up-regulated after simulated LRRK2 knock-out" ) ``` ### GSEA enrichment (GO) The enrichment results are stored in `lrrk2$enrichments`: ```{r lrrk2-enrichments} cat("Available enrichment results:\n") print(names(lrrk2$enrichments)) ``` Three representative GO pathways are visualised below using `enrichplot::gseaplot2()`: ```{r lrrk2-gsea-plot, out.width="100%", fig.width=7, fig.height=5, fig.cap="GSEA enrichment plot for LRRK2-high vs LRRK2-low contrast (GO). Top enriched pathways: antigen binding, oxidative phosphorylation, and leukocyte-mediated cytotoxicity."} library(enrichplot) gseaplot2( lrrk2$enrichments$GSEA_GO, color = ggsci::pal_npg("nrc")(3), geneSetID = c("GO:0003823", "GO:0006119", "GO:0001909"), subplots = 1:2, pvalue_table = TRUE ) ``` > **Key biological insight.**: All three pathways show **positive NES**, > meaning they are **up-regulated in the LRRK2-low (simulated KO) group**. > This implies that LRRK2 normally **suppresses** these immune programmes > under physiological conditions. Notably, this computational prediction > aligns with experimental evidence: Gardet *et al.* demonstrated that > LRRK2 controls inflammatory cytokine production via NF-κB > phosphorylation and **antigen presentation** in bone marrow-derived > dendritic cells > ([*Int J Mol Sci* 2020;21:6097](https://doi.org/10.3390/ijms21176097)). > The concordance between FIP's *in silico* inference and wet-lab > perturbation underscores the method's utility for generating testable > hypotheses from single-cell data.