Introduction to swapper
Milan Malfait
Ghent Universitymilan.malfait94@gmail.com
11 October 2022
Source:vignettes/swapper.Rmd
swapper.Rmd
Introduction
The swapper package implements a simple DE simulator based on feature swapping.The essence of the method involves randomly scrambling a subset of features in one subgroup of the data. This effectively induces a DE signal for those scrambled genes, while retaining the characteristics of the original data set without having to rely on any modeling assumptions.
Quick start
Mock up a data set
First we mock up a dataset.
sce <- scuttle::mockSCE(ncells = 100, ngenes = 50)
sce
#> class: SingleCellExperiment
#> dim: 50 100
#> metadata(0):
#> assays(1): counts
#> rownames(50): Gene_0001 Gene_0002 ... Gene_0049 Gene_0050
#> rowData names(0):
#> colnames(100): Cell_001 Cell_002 ... Cell_099 Cell_100
#> colData names(3): Mutation_Status Cell_Cycle Treatment
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(1): Spikes
This mock data set contains a "Treatment"
variable, which is just a random partitioning of the cells in 2 groups.
table(sce$Treatment)
#>
#> treat1 treat2
#> 55 45
Simulate DE by swapping features
Next we use swapper to simulate DE genes, using the "Treatment"
variable as the grouping factor. We induce DE in 10% of the original genes for cells belonging to the "treat1"
group.
set.seed(123) # for reproducibility
cells_to_swap <- sce$Treatment == "treat1"
sim <- simulateDE(sce, which_cols = cells_to_swap, prop_DE = 0.1)
rowData(sim)
#> DataFrame with 50 rows and 2 columns
#> is_DE swapped_feature
#> <logical> <character>
#> Gene_0001 FALSE NA
#> Gene_0002 FALSE NA
#> Gene_0003 TRUE Gene_0031
#> Gene_0004 FALSE NA
#> Gene_0005 FALSE NA
#> ... ... ...
#> Gene_0046 FALSE NA
#> Gene_0047 FALSE NA
#> Gene_0048 FALSE NA
#> Gene_0049 FALSE NA
#> Gene_0050 FALSE NA
table(rowData(sim)$is_DE)
#>
#> FALSE TRUE
#> 46 4
Explore the simulated data
From the rowData
we can retrieve the true DE genes. The rowData
also contains information regarding which gene was swapped with which.
(trueDE_genes <- rownames(sim)[rowData(sim)$is_DE])
#> [1] "Gene_0003" "Gene_0015" "Gene_0031" "Gene_0042"
rowData(sim)[trueDE_genes, ]
#> DataFrame with 4 rows and 2 columns
#> is_DE swapped_feature
#> <logical> <character>
#> Gene_0003 TRUE Gene_0031
#> Gene_0015 TRUE Gene_0042
#> Gene_0031 TRUE Gene_0015
#> Gene_0042 TRUE Gene_0003
The "sim_group"
column in the colData
indicates for which cells the swapping occurred. In this example, this should be equivalent to the "Treatment"
grouping.
colData(sim)
#> DataFrame with 100 rows and 4 columns
#> Mutation_Status Cell_Cycle Treatment sim_group
#> <character> <character> <character> <logical>
#> Cell_001 negative S treat1 TRUE
#> Cell_002 positive G1 treat2 FALSE
#> Cell_003 negative G2M treat2 FALSE
#> Cell_004 positive G0 treat2 FALSE
#> Cell_005 positive G1 treat2 FALSE
#> ... ... ... ... ...
#> Cell_096 positive G0 treat2 FALSE
#> Cell_097 positive G0 treat1 TRUE
#> Cell_098 negative G1 treat2 FALSE
#> Cell_099 positive G2M treat2 FALSE
#> Cell_100 negative S treat2 FALSE
table(sim$Treatment, sim$sim_group)
#>
#> FALSE TRUE
#> treat1 0 55
#> treat2 45 0
We can visualize these and compare them to their original counts to see how the swapping works.
if (requireNamespace("scater", quietly = TRUE)) {
library(scater)
## Log-normalize counts for visualization
sce <- logNormCounts(sce)
sim <- logNormCounts(sim)
p_orig <- plotExpression(
sce, features = trueDE_genes, x = "Treatment", colour_by = "Treatment"
) + ggtitle("Original counts")
p_sim <- scater::plotExpression(
sim, features = trueDE_genes, x = "Treatment", colour_by = "Treatment"
) + ggtitle("Simulated counts")
gridExtra::grid.arrange(p_orig, p_sim, ncol = 2)
}
#> Loading required package: scuttle
#> Loading required package: ggplot2
From this plot and the rowData
shown above, we can see that the counts for e.g. Gene_0003
in the treat1
group of the simulated data originated from the treat1
counts for Gene_0014
in the original data. Likewise, the simulated treat1
counts for Gene_0014
are the original treat1
counts for Gene_0042
. So we effectively induced DE by selectively swapping out counts in one, but not the other, treatment group.
Session Info
Session info
#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.2.1 (2022-06-23)
#> os Ubuntu 20.04.5 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language en
#> collate C.UTF-8
#> ctype C.UTF-8
#> tz UTC
#> date 2022-10-11
#> pandoc 2.14.2 @ /usr/bin/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> beachmat 2.12.0 2022-04-26 [1] Bioconductor
#> beeswarm 0.4.0 2021-06-01 [1] RSPM
#> Biobase * 2.56.0 2022-04-26 [1] Bioconductor
#> BiocGenerics * 0.42.0 2022-04-26 [1] Bioconductor
#> BiocManager 1.30.18 2022-05-18 [1] RSPM
#> BiocNeighbors 1.14.0 2022-04-26 [1] Bioconductor
#> BiocParallel 1.30.3 2022-06-05 [1] Bioconductor
#> BiocSingular 1.12.0 2022-04-26 [1] Bioconductor
#> BiocStyle * 2.24.0 2022-04-26 [1] Bioconductor
#> bitops 1.0-7 2021-04-24 [1] RSPM
#> bookdown 0.29 2022-09-12 [1] RSPM
#> bslib 0.4.0 2022-07-16 [1] RSPM
#> cachem 1.0.6 2021-08-19 [1] RSPM
#> cli 3.4.1 2022-09-23 [1] RSPM
#> codetools 0.2-18 2020-11-04 [2] CRAN (R 4.2.1)
#> colorspace 2.0-3 2022-02-21 [1] RSPM
#> DelayedArray 0.22.0 2022-04-26 [1] Bioconductor
#> DelayedMatrixStats 1.18.1 2022-09-27 [1] Bioconductor
#> desc 1.4.2 2022-09-08 [1] RSPM
#> digest 0.6.29 2021-12-01 [1] RSPM
#> evaluate 0.16 2022-08-09 [1] RSPM
#> fansi 1.0.3 2022-03-24 [1] RSPM
#> farver 2.1.1 2022-07-06 [1] RSPM
#> fastmap 1.1.0 2021-01-25 [1] RSPM
#> fs 1.5.2 2021-12-08 [1] RSPM
#> GenomeInfoDb * 1.32.4 2022-09-06 [1] Bioconductor
#> GenomeInfoDbData 1.2.8 2022-10-11 [1] Bioconductor
#> GenomicRanges * 1.48.0 2022-04-26 [1] Bioconductor
#> ggbeeswarm 0.6.0 2017-08-07 [1] RSPM
#> ggplot2 * 3.3.6 2022-05-03 [1] RSPM
#> ggrepel 0.9.1 2021-01-15 [1] RSPM
#> glue 1.6.2 2022-02-24 [1] RSPM
#> gridExtra 2.3 2017-09-09 [1] RSPM
#> gtable 0.3.1 2022-09-01 [1] RSPM
#> highr 0.9 2021-04-16 [1] RSPM
#> htmltools 0.5.3 2022-07-18 [1] RSPM
#> IRanges * 2.30.1 2022-08-18 [1] Bioconductor
#> irlba 2.3.5.1 2022-10-03 [1] RSPM
#> jquerylib 0.1.4 2021-04-26 [1] RSPM
#> jsonlite 1.8.2 2022-10-02 [1] RSPM
#> knitr 1.40 2022-08-24 [1] RSPM
#> labeling 0.4.2 2020-10-20 [1] RSPM
#> lattice 0.20-45 2021-09-22 [2] CRAN (R 4.2.1)
#> lifecycle 1.0.2 2022-09-09 [1] RSPM
#> magrittr 2.0.3 2022-03-30 [1] RSPM
#> Matrix 1.5-1 2022-09-13 [1] RSPM
#> MatrixGenerics * 1.8.1 2022-06-26 [1] Bioconductor
#> matrixStats * 0.62.0 2022-04-19 [1] RSPM
#> memoise 2.0.1 2021-11-26 [1] RSPM
#> munsell 0.5.0 2018-06-12 [1] RSPM
#> pillar 1.8.1 2022-08-19 [1] RSPM
#> pkgconfig 2.0.3 2019-09-22 [1] RSPM
#> pkgdown 2.0.6 2022-07-16 [1] any (@2.0.6)
#> purrr 0.3.5 2022-10-06 [1] RSPM
#> R6 2.5.1 2021-08-19 [1] RSPM
#> ragg 1.2.3 2022-09-29 [1] RSPM
#> Rcpp 1.0.9 2022-07-08 [1] RSPM
#> RCurl 1.98-1.9 2022-10-03 [1] RSPM
#> rlang 1.0.6 2022-09-24 [1] RSPM
#> rmarkdown 2.17 2022-10-07 [1] CRAN (R 4.2.1)
#> rprojroot 2.0.3 2022-04-02 [1] RSPM
#> rsvd 1.0.5 2021-04-16 [1] RSPM
#> S4Vectors * 0.34.0 2022-04-26 [1] Bioconductor
#> sass 0.4.2 2022-07-16 [1] RSPM
#> ScaledMatrix 1.4.1 2022-09-11 [1] Bioconductor
#> scales 1.2.1 2022-08-20 [1] RSPM
#> scater * 1.24.0 2022-04-26 [1] Bioconductor
#> scuttle * 1.6.3 2022-08-23 [1] Bioconductor
#> sessioninfo 1.2.2 2021-12-06 [1] any (@1.2.2)
#> SingleCellExperiment * 1.18.1 2022-10-02 [1] Bioconductor
#> sparseMatrixStats 1.8.0 2022-04-26 [1] Bioconductor
#> stringi 1.7.8 2022-07-11 [1] RSPM
#> stringr 1.4.1 2022-08-20 [1] RSPM
#> SummarizedExperiment * 1.26.1 2022-04-29 [1] Bioconductor
#> swapper * 0.99.2 2022-10-11 [1] local
#> systemfonts 1.0.4 2022-02-11 [1] RSPM
#> textshaping 0.3.6 2021-10-13 [1] RSPM
#> tibble 3.1.8 2022-07-22 [1] RSPM
#> utf8 1.2.2 2021-07-24 [1] RSPM
#> vctrs 0.4.2 2022-09-29 [1] RSPM
#> vipor 0.4.5 2017-03-22 [1] RSPM
#> viridis 0.6.2 2021-10-13 [1] RSPM
#> viridisLite 0.4.1 2022-08-22 [1] RSPM
#> withr 2.5.0 2022-03-03 [1] RSPM
#> xfun 0.33 2022-09-12 [1] RSPM
#> XVector 0.36.0 2022-04-26 [1] Bioconductor
#> yaml 2.3.5 2022-02-21 [1] RSPM
#> zlibbioc 1.42.0 2022-04-26 [1] Bioconductor
#>
#> [1] /home/runner/work/_temp/Library
#> [2] /opt/R/4.2.1/lib/R/library
#>
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────