Introduction to swapper
Milan Malfait
Ghent Universitymilan.malfait94@gmail.com
11 October 2022
Source:vignettes/swapper.Rmd
swapper.Rmd
Introduction
The swapper package implements a simple DE simulator based on feature swapping.The essence of the method involves randomly scrambling a subset of features in one subgroup of the data. This effectively induces a DE signal for those scrambled genes, while retaining the characteristics of the original data set without having to rely on any modeling assumptions.
Quick start
Mock up a data set
First we mock up a dataset.
sce <- scuttle::mockSCE(ncells = 100, ngenes = 50)
sce
#> class: SingleCellExperiment
#> dim: 50 100
#> metadata(0):
#> assays(1): counts
#> rownames(50): Gene_0001 Gene_0002 ... Gene_0049 Gene_0050
#> rowData names(0):
#> colnames(100): Cell_001 Cell_002 ... Cell_099 Cell_100
#> colData names(3): Mutation_Status Cell_Cycle Treatment
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(1): Spikes
This mock data set contains a "Treatment"
variable, which is just a random partitioning of the cells in 2 groups.
table(sce$Treatment)
#>
#> treat1 treat2
#> 55 45
Simulate DE by swapping features
Next we use swapper to simulate DE genes, using the "Treatment"
variable as the grouping factor. We induce DE in 10% of the original genes for cells belonging to the "treat1"
group.
set.seed(123) # for reproducibility
cells_to_swap <- sce$Treatment == "treat1"
sim <- simulateDE(sce, which_cols = cells_to_swap, prop_DE = 0.1)
rowData(sim)
#> DataFrame with 50 rows and 2 columns
#> is_DE swapped_feature
#> <logical> <character>
#> Gene_0001 FALSE NA
#> Gene_0002 FALSE NA
#> Gene_0003 TRUE Gene_0031
#> Gene_0004 FALSE NA
#> Gene_0005 FALSE NA
#> ... ... ...
#> Gene_0046 FALSE NA
#> Gene_0047 FALSE NA
#> Gene_0048 FALSE NA
#> Gene_0049 FALSE NA
#> Gene_0050 FALSE NA
table(rowData(sim)$is_DE)
#>
#> FALSE TRUE
#> 46 4
Explore the simulated data
From the rowData
we can retrieve the true DE genes. The rowData
also contains information regarding which gene was swapped with which.
(trueDE_genes <- rownames(sim)[rowData(sim)$is_DE])
#> [1] "Gene_0003" "Gene_0015" "Gene_0031" "Gene_0042"
rowData(sim)[trueDE_genes, ]
#> DataFrame with 4 rows and 2 columns
#> is_DE swapped_feature
#> <logical> <character>
#> Gene_0003 TRUE Gene_0031
#> Gene_0015 TRUE Gene_0042
#> Gene_0031 TRUE Gene_0015
#> Gene_0042 TRUE Gene_0003
The "sim_group"
column in the colData
indicates for which cells the swapping occurred. In this example, this should be equivalent to the "Treatment"
grouping.
colData(sim)
#> DataFrame with 100 rows and 4 columns
#> Mutation_Status Cell_Cycle Treatment sim_group
#> <character> <character> <character> <logical>
#> Cell_001 negative S treat1 TRUE
#> Cell_002 positive G1 treat2 FALSE
#> Cell_003 negative G2M treat2 FALSE
#> Cell_004 positive G0 treat2 FALSE
#> Cell_005 positive G1 treat2 FALSE
#> ... ... ... ... ...
#> Cell_096 positive G0 treat2 FALSE
#> Cell_097 positive G0 treat1 TRUE
#> Cell_098 negative G1 treat2 FALSE
#> Cell_099 positive G2M treat2 FALSE
#> Cell_100 negative S treat2 FALSE
table(sim$Treatment, sim$sim_group)
#>
#> FALSE TRUE
#> treat1 0 55
#> treat2 45 0
We can visualize these and compare them to their original counts to see how the swapping works.
if (requireNamespace("scater", quietly = TRUE)) {
library(scater)
## Log-normalize counts for visualization
sce <- logNormCounts(sce)
sim <- logNormCounts(sim)
p_orig <- plotExpression(
sce, features = trueDE_genes, x = "Treatment", colour_by = "Treatment"
) + ggtitle("Original counts")
p_sim <- scater::plotExpression(
sim, features = trueDE_genes, x = "Treatment", colour_by = "Treatment"
) + ggtitle("Simulated counts")
gridExtra::grid.arrange(p_orig, p_sim, ncol = 2)
}
#> Loading required package: scuttle
#> Loading required package: ggplot2
data:image/s3,"s3://crabby-images/e9563/e9563fe8e2b8ee114c876e63c856b1fda0e8cff1" alt="Comparing expression values for the simulated DE genes between the original data (left) and simulated data (right)."
Comparing expression values for the simulated DE genes between the original data (left) and simulated data (right).
From this plot and the rowData
shown above, we can see that the counts for e.g. Gene_0003
in the treat1
group of the simulated data originated from the treat1
counts for Gene_0014
in the original data. Likewise, the simulated treat1
counts for Gene_0014
are the original treat1
counts for Gene_0042
. So we effectively induced DE by selectively swapping out counts in one, but not the other, treatment group.
Session Info
Session info
#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.2.1 (2022-06-23)
#> os Ubuntu 20.04.5 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language en
#> collate C.UTF-8
#> ctype C.UTF-8
#> tz UTC
#> date 2022-10-11
#> pandoc 2.14.2 @ /usr/bin/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> beachmat 2.12.0 2022-04-26 [1] Bioconductor
#> beeswarm 0.4.0 2021-06-01 [1] RSPM
#> Biobase * 2.56.0 2022-04-26 [1] Bioconductor
#> BiocGenerics * 0.42.0 2022-04-26 [1] Bioconductor
#> BiocManager 1.30.18 2022-05-18 [1] RSPM
#> BiocNeighbors 1.14.0 2022-04-26 [1] Bioconductor
#> BiocParallel 1.30.3 2022-06-05 [1] Bioconductor
#> BiocSingular 1.12.0 2022-04-26 [1] Bioconductor
#> BiocStyle * 2.24.0 2022-04-26 [1] Bioconductor
#> bitops 1.0-7 2021-04-24 [1] RSPM
#> bookdown 0.29 2022-09-12 [1] RSPM
#> bslib 0.4.0 2022-07-16 [1] RSPM
#> cachem 1.0.6 2021-08-19 [1] RSPM
#> cli 3.4.1 2022-09-23 [1] RSPM
#> codetools 0.2-18 2020-11-04 [2] CRAN (R 4.2.1)
#> colorspace 2.0-3 2022-02-21 [1] RSPM
#> DelayedArray 0.22.0 2022-04-26 [1] Bioconductor
#> DelayedMatrixStats 1.18.1 2022-09-27 [1] Bioconductor
#> desc 1.4.2 2022-09-08 [1] RSPM
#> digest 0.6.29 2021-12-01 [1] RSPM
#> evaluate 0.16 2022-08-09 [1] RSPM
#> fansi 1.0.3 2022-03-24 [1] RSPM
#> farver 2.1.1 2022-07-06 [1] RSPM
#> fastmap 1.1.0 2021-01-25 [1] RSPM
#> fs 1.5.2 2021-12-08 [1] RSPM
#> GenomeInfoDb * 1.32.4 2022-09-06 [1] Bioconductor
#> GenomeInfoDbData 1.2.8 2022-10-11 [1] Bioconductor
#> GenomicRanges * 1.48.0 2022-04-26 [1] Bioconductor
#> ggbeeswarm 0.6.0 2017-08-07 [1] RSPM
#> ggplot2 * 3.3.6 2022-05-03 [1] RSPM
#> ggrepel 0.9.1 2021-01-15 [1] RSPM
#> glue 1.6.2 2022-02-24 [1] RSPM
#> gridExtra 2.3 2017-09-09 [1] RSPM
#> gtable 0.3.1 2022-09-01 [1] RSPM
#> highr 0.9 2021-04-16 [1] RSPM
#> htmltools 0.5.3 2022-07-18 [1] RSPM
#> IRanges * 2.30.1 2022-08-18 [1] Bioconductor
#> irlba 2.3.5.1 2022-10-03 [1] RSPM
#> jquerylib 0.1.4 2021-04-26 [1] RSPM
#> jsonlite 1.8.2 2022-10-02 [1] RSPM
#> knitr 1.40 2022-08-24 [1] RSPM
#> labeling 0.4.2 2020-10-20 [1] RSPM
#> lattice 0.20-45 2021-09-22 [2] CRAN (R 4.2.1)
#> lifecycle 1.0.2 2022-09-09 [1] RSPM
#> magrittr 2.0.3 2022-03-30 [1] RSPM
#> Matrix 1.5-1 2022-09-13 [1] RSPM
#> MatrixGenerics * 1.8.1 2022-06-26 [1] Bioconductor
#> matrixStats * 0.62.0 2022-04-19 [1] RSPM
#> memoise 2.0.1 2021-11-26 [1] RSPM
#> munsell 0.5.0 2018-06-12 [1] RSPM
#> pillar 1.8.1 2022-08-19 [1] RSPM
#> pkgconfig 2.0.3 2019-09-22 [1] RSPM
#> pkgdown 2.0.6 2022-07-16 [1] any (@2.0.6)
#> purrr 0.3.5 2022-10-06 [1] RSPM
#> R6 2.5.1 2021-08-19 [1] RSPM
#> ragg 1.2.3 2022-09-29 [1] RSPM
#> Rcpp 1.0.9 2022-07-08 [1] RSPM
#> RCurl 1.98-1.9 2022-10-03 [1] RSPM
#> rlang 1.0.6 2022-09-24 [1] RSPM
#> rmarkdown 2.17 2022-10-07 [1] CRAN (R 4.2.1)
#> rprojroot 2.0.3 2022-04-02 [1] RSPM
#> rsvd 1.0.5 2021-04-16 [1] RSPM
#> S4Vectors * 0.34.0 2022-04-26 [1] Bioconductor
#> sass 0.4.2 2022-07-16 [1] RSPM
#> ScaledMatrix 1.4.1 2022-09-11 [1] Bioconductor
#> scales 1.2.1 2022-08-20 [1] RSPM
#> scater * 1.24.0 2022-04-26 [1] Bioconductor
#> scuttle * 1.6.3 2022-08-23 [1] Bioconductor
#> sessioninfo 1.2.2 2021-12-06 [1] any (@1.2.2)
#> SingleCellExperiment * 1.18.1 2022-10-02 [1] Bioconductor
#> sparseMatrixStats 1.8.0 2022-04-26 [1] Bioconductor
#> stringi 1.7.8 2022-07-11 [1] RSPM
#> stringr 1.4.1 2022-08-20 [1] RSPM
#> SummarizedExperiment * 1.26.1 2022-04-29 [1] Bioconductor
#> swapper * 0.99.2 2022-10-11 [1] local
#> systemfonts 1.0.4 2022-02-11 [1] RSPM
#> textshaping 0.3.6 2021-10-13 [1] RSPM
#> tibble 3.1.8 2022-07-22 [1] RSPM
#> utf8 1.2.2 2021-07-24 [1] RSPM
#> vctrs 0.4.2 2022-09-29 [1] RSPM
#> vipor 0.4.5 2017-03-22 [1] RSPM
#> viridis 0.6.2 2021-10-13 [1] RSPM
#> viridisLite 0.4.1 2022-08-22 [1] RSPM
#> withr 2.5.0 2022-03-03 [1] RSPM
#> xfun 0.33 2022-09-12 [1] RSPM
#> XVector 0.36.0 2022-04-26 [1] Bioconductor
#> yaml 2.3.5 2022-02-21 [1] RSPM
#> zlibbioc 1.42.0 2022-04-26 [1] Bioconductor
#>
#> [1] /home/runner/work/_temp/Library
#> [2] /opt/R/4.2.1/lib/R/library
#>
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────