---
title: "Functional Inference Pipeline (FIP) DEMO"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Functional Inference Pipeline (FIP) DEMO}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
find_leo_sc_root <- function(start = getwd()) {
cur <- normalizePath(start, winslash = "/", mustWork = TRUE)
repeat {
desc <- file.path(cur, "DESCRIPTION")
if (file.exists(desc)) {
desc_lines <- readLines(desc, warn = FALSE)
if (any(grepl("^Package:\\s*leo\\.sc\\s*$", desc_lines))) {
return(cur)
}
}
parent <- dirname(cur)
if (identical(parent, cur)) {
return(NULL)
}
cur <- parent
}
}
# Prefer local checkout when available; otherwise use installed package.
pkg_root <- find_leo_sc_root()
if (!is.null(pkg_root) && requireNamespace("devtools", quietly = TRUE)) {
devtools::load_all(pkg_root, quiet = TRUE)
} else if (!requireNamespace("leo.sc", quietly = TRUE)) {
stop("Please install leo.sc, or render from a leo.sc source checkout with devtools installed.")
}
silico_ko <- get("silico_ko", envir = asNamespace("leo.sc"), inherits = FALSE)
```
## Overview
This tutorial demonstrates the **Functional Inference Pipeline (FIP)** via `silico_ko()` from `leo.sc`.
FIP estimates gene-associated transcriptional programs by comparing cells with **low** versus **high** endogenous expression of a target gene ***g***, conceptually mimicking transcriptional changes after a perturbation knock-down — but entirely *in silico* using existing single-cell data. The function is named `silico_ko()` (SKO) for memorability, but the underlying logic is more akin to **endogenous knock-down / over-expression** contrasting rather than a true genetic knockout.
- **Dataset**: `SeuratObject::pbmc_small` (80-cell Seurat demo object, for illustration only)
- **Goal**: Identify functional gene programs associated with the perturbation of a target gene
- **Output**: Cell-type composition of high-expressors, DEG table, and enrichment results
## Method
FIP proceeds in six steps:
1. **Filter**: Exclude cells with zero target-gene expression; remove rare cell types (< `filter_cell_threshold` cells).
2. **Rank**: Rank all remaining ***g***-expressing cells (*N*) by expression of the target gene.
3. **Select *g*high**: Select the top cells (top `pct_threshold` fraction, e.g., 0.1 = 10%; or `abs_threshold` absolute count) as the **high-expression group**, preserving cell-type composition.
4. **Match *g*low**: Select an equal number of bottom cells as the **low-expression group** with matched cell-type composition to minimise lineage confounding.
5. **DEG**: Apply `Seurat::FindMarkers()` (Wilcoxon rank-sum; ***g***low as `ident.1`) to identify differentially expressed genes between groups.
6. **Enrichment**: Run `leo.basic::leo_enrich()` (ORA / GSEA against GO, KEGG, or custom backgrounds) to infer the biological functions associated with ***g***high vs ***g***low.
## Data & Gene Setup
```{r data-setup}
data("pbmc_small", package = "SeuratObject")
srt <- pbmc_small
# Annotate clusters with broad cell-type labels
# Source: https://satijalab.org/seurat/articles/pbmc3k_tutorial
srt$cell_anno <- dplyr::recode(
as.character(Seurat::Idents(srt)),
"0" = "T/NK cells", "1" = "CD14+ Monocytes", "2" = "B cells"
)
# Pick a target gene: here we choose NKG7 for illustration.
gene_use <- "NKG7"
gene_use
```
> **⚠️ Method accuracy note:** `pbmc_small` contains only 80 cells and is used here **purely for a reproducible demo**. DEG and enrichment results are statistically underpowered and should **not** be interpreted biologically. For a realistic FIP result, see the [real-world example](#real-world-example-lrrk2-in-vkh-disease) below.
## Run SKO
```{r run-sko}
res <- silico_ko(
all = srt,
gene = gene_use,
sko_mode = "ko",
cell_col = "cell_anno",
filter_cell_threshold = 3,
pct_threshold = 0.2,
deg_method = "default",
enrichment_method = "ORA",
enrichment_bg = "GO",
simplify = FALSE # pbmc_small is too small for reliable simplification
)
names(res)
```
## Interpret Outputs
```{r inspect-output}
length(res$high_cells)
length(res$low_cells)
head(res$cell_rato)
deg <- res$deg_results
deg <- tibble::rownames_to_column(deg, "gene")
head(deg[, c("gene", "avg_log2FC", "p_val_adj")])
if (length(res$enrichments) == 0) {
cat("No enrichment result was returned (expected for very small demo data).\n")
} else {
print(names(res$enrichments))
}
```
## Caveats
- `pbmc_small` is intentionally tiny and is only used here to keep the demo lightweight.
- Enrichment output can be empty (`res$enrichments` may be `list()`) and this is expected in small datasets.
- For biological interpretation, use larger datasets and review parameters such as `pct_threshold`, `filter_cell_threshold`, and the enrichment background.
## Real-World Example: LRRK2 in VKH Disease
To demonstrate FIP's capability to uncover biological insights at scale,
we now walk through a **real-world result** from
a Vogt–Koyanagi–Harada (VKH) disease study, in which `silico_ko()` was
applied to **LRRK2** across a scRNA-seq cohort of 34 VKH patients and
14 healthy controls.
### Load pre-computed FIP result
The pre-computed result is hosted on
[GitHub Releases](https://github.com/laleoarrow/leo.sc/releases/tag/data-v2).
We download it once and cache it locally:
```{r lrrk2-load}
data_url <- "https://github.com/laleoarrow/leo.sc/releases/download/data-v2/LRRK2_ko_res.rds"
cache_dir <- tools::R_user_dir("leo.sc", which = "cache")
if (!dir.exists(cache_dir)) dir.create(cache_dir, recursive = TRUE)
dest <- file.path(cache_dir, "LRRK2_ko_res.rds")
if (!file.exists(dest)) {
message("Downloading LRRK2 FIP result (7.8 MB) ...")
utils::download.file(data_url, dest, mode = "wb", quiet = TRUE)
}
lrrk2 <- readRDS(dest) # **This represents a standard output object generated by FIP.**
names(lrrk2)
```
### Understanding the comparison direction
In FIP's **KO mode**, `ident.1 = "low"` (simulated knock-out) and
`ident.2 = "high"` (wild-type–like). According to `Seurat::FindMarkers()`
convention:
- **`avg_log2FC > 0`** → gene is **up-regulated after simulated KO** (more expressed in *g*low)
- **`avg_log2FC < 0`** → gene is **down-regulated after simulated KO** (more expressed in *g*high)
```{r lrrk2-direction}
# Confirm comparison direction stored in the result
cat("ident.1 (test):", unique(lrrk2$deg_results$ident.1), "\n")
cat("ident.2 (ref): ", unique(lrrk2$deg_results$ident.2), "\n")
```
### Cell-type composition of ***g***high
FIP preserves cell-type composition between
***g***high and ***g***low groups, ensuring that DEGs
reflect variation in LRRK2 expression *within each lineage* as a whole rather
than confounding by certain cell-type abundance:
```{r lrrk2-cell-composition}
# Number of high- and low-expression cells
cat("g_high cells:", length(lrrk2$high_cells), "\n")
cat("g_low cells:", length(lrrk2$low_cells), "\n")
# Cell-type composition of the high-expression group
knitr::kable(
head(lrrk2$cell_rato, 10),
caption = "Top 10 cell types in LRRK2-high group (pct = percentage)"
)
```
### Differential expression summary
```{r lrrk2-deg-summary}
deg <- lrrk2$deg_results
up <- sum(deg$avg_log2FC > 0 & deg$p_val_adj < 0.05)
down <- sum(deg$avg_log2FC < 0 & deg$p_val_adj < 0.05)
cat("Significant DEGs (FDR < 0.05):\n")
cat(" Up after KO (log2FC > 0): ", up, "\n")
cat(" Down after KO (log2FC < 0):", down, "\n")
# Top 10 genes up-regulated after simulated KO
deg_df <- tibble::rownames_to_column(deg, "gene")
top_up <- head(deg_df[deg_df$avg_log2FC > 0 & deg_df$p_val_adj < 0.05, ], 10)
knitr::kable(
top_up[, c("gene", "avg_log2FC", "p_val_adj")],
digits = 3,
caption = "Top 10 genes up-regulated after simulated LRRK2 knock-out"
)
```
### GSEA enrichment (GO)
The enrichment results are stored in `lrrk2$enrichments`:
```{r lrrk2-enrichments}
cat("Available enrichment results:\n")
print(names(lrrk2$enrichments))
```
Three representative GO pathways are visualised below using
`enrichplot::gseaplot2()`:
```{r lrrk2-gsea-plot, out.width="100%", fig.width=7, fig.height=5, fig.cap="GSEA enrichment plot for LRRK2-high vs LRRK2-low contrast (GO). Top enriched pathways: antigen binding, oxidative phosphorylation, and leukocyte-mediated cytotoxicity."}
library(enrichplot)
gseaplot2(
lrrk2$enrichments$GSEA_GO,
color = ggsci::pal_npg("nrc")(3),
geneSetID = c("GO:0003823", "GO:0006119", "GO:0001909"),
subplots = 1:2,
pvalue_table = TRUE
)
```
> **Key biological insight.**: All three pathways show **positive NES**,
> meaning they are **up-regulated in the LRRK2-low (simulated KO) group**.
> This implies that LRRK2 normally **suppresses** these immune programmes
> under physiological conditions. Notably, this computational prediction
> aligns with experimental evidence: Gardet *et al.* demonstrated that
> LRRK2 controls inflammatory cytokine production via NF-κB
> phosphorylation and **antigen presentation** in bone marrow-derived
> dendritic cells
> ([*Int J Mol Sci* 2020;21:6097](https://doi.org/10.3390/ijms21176097)).
> The concordance between FIP's *in silico* inference and wet-lab
> perturbation underscores the method's utility for generating testable
> hypotheses from single-cell data.