Skip to contents

Introduction

The swapper package implements a simple DE simulator based on feature swapping.The essence of the method involves randomly scrambling a subset of features in one subgroup of the data. This effectively induces a DE signal for those scrambled genes, while retaining the characteristics of the original data set without having to rely on any modeling assumptions.

Quick start

Mock up a data set

First we mock up a dataset.

sce <- scuttle::mockSCE(ncells = 100, ngenes = 50)
sce
#> class: SingleCellExperiment 
#> dim: 50 100 
#> metadata(0):
#> assays(1): counts
#> rownames(50): Gene_0001 Gene_0002 ... Gene_0049 Gene_0050
#> rowData names(0):
#> colnames(100): Cell_001 Cell_002 ... Cell_099 Cell_100
#> colData names(3): Mutation_Status Cell_Cycle Treatment
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(1): Spikes

This mock data set contains a "Treatment" variable, which is just a random partitioning of the cells in 2 groups.

table(sce$Treatment)
#> 
#> treat1 treat2 
#>     55     45

Simulate DE by swapping features

Next we use swapper to simulate DE genes, using the "Treatment" variable as the grouping factor. We induce DE in 10% of the original genes for cells belonging to the "treat1" group.

set.seed(123) # for reproducibility

cells_to_swap <- sce$Treatment == "treat1"
sim <- simulateDE(sce, which_cols = cells_to_swap, prop_DE = 0.1)
rowData(sim)
#> DataFrame with 50 rows and 2 columns
#>               is_DE swapped_feature
#>           <logical>     <character>
#> Gene_0001     FALSE              NA
#> Gene_0002     FALSE              NA
#> Gene_0003      TRUE       Gene_0031
#> Gene_0004     FALSE              NA
#> Gene_0005     FALSE              NA
#> ...             ...             ...
#> Gene_0046     FALSE              NA
#> Gene_0047     FALSE              NA
#> Gene_0048     FALSE              NA
#> Gene_0049     FALSE              NA
#> Gene_0050     FALSE              NA
table(rowData(sim)$is_DE)
#> 
#> FALSE  TRUE 
#>    46     4

Explore the simulated data

From the rowData we can retrieve the true DE genes. The rowData also contains information regarding which gene was swapped with which.

(trueDE_genes <- rownames(sim)[rowData(sim)$is_DE])
#> [1] "Gene_0003" "Gene_0015" "Gene_0031" "Gene_0042"
rowData(sim)[trueDE_genes, ]
#> DataFrame with 4 rows and 2 columns
#>               is_DE swapped_feature
#>           <logical>     <character>
#> Gene_0003      TRUE       Gene_0031
#> Gene_0015      TRUE       Gene_0042
#> Gene_0031      TRUE       Gene_0015
#> Gene_0042      TRUE       Gene_0003

The "sim_group" column in the colData indicates for which cells the swapping occurred. In this example, this should be equivalent to the "Treatment" grouping.

colData(sim)
#> DataFrame with 100 rows and 4 columns
#>          Mutation_Status  Cell_Cycle   Treatment sim_group
#>              <character> <character> <character> <logical>
#> Cell_001        negative           S      treat1      TRUE
#> Cell_002        positive          G1      treat2     FALSE
#> Cell_003        negative         G2M      treat2     FALSE
#> Cell_004        positive          G0      treat2     FALSE
#> Cell_005        positive          G1      treat2     FALSE
#> ...                  ...         ...         ...       ...
#> Cell_096        positive          G0      treat2     FALSE
#> Cell_097        positive          G0      treat1      TRUE
#> Cell_098        negative          G1      treat2     FALSE
#> Cell_099        positive         G2M      treat2     FALSE
#> Cell_100        negative           S      treat2     FALSE
table(sim$Treatment, sim$sim_group)
#>         
#>          FALSE TRUE
#>   treat1     0   55
#>   treat2    45    0

We can visualize these and compare them to their original counts to see how the swapping works.

if (requireNamespace("scater", quietly = TRUE)) {
    library(scater)
    
    ## Log-normalize counts for visualization
    sce <- logNormCounts(sce)
    sim <- logNormCounts(sim)
    
    p_orig <- plotExpression(
        sce, features = trueDE_genes, x = "Treatment", colour_by = "Treatment"
    ) + ggtitle("Original counts")
    
    p_sim <- scater::plotExpression(
        sim, features = trueDE_genes, x = "Treatment", colour_by = "Treatment"
    ) + ggtitle("Simulated counts")
    
    gridExtra::grid.arrange(p_orig, p_sim, ncol = 2)
}
#> Loading required package: scuttle
#> Loading required package: ggplot2
Comparing expression values for the simulated DE genes between the original data (left) and simulated data (right).

Comparing expression values for the simulated DE genes between the original data (left) and simulated data (right).

From this plot and the rowData shown above, we can see that the counts for e.g. Gene_0003 in the treat1 group of the simulated data originated from the treat1 counts for Gene_0014 in the original data. Likewise, the simulated treat1 counts for Gene_0014 are the original treat1 counts for Gene_0042. So we effectively induced DE by selectively swapping out counts in one, but not the other, treatment group.

Session Info

Session info
#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.1 (2022-06-23)
#>  os       Ubuntu 20.04.5 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language en
#>  collate  C.UTF-8
#>  ctype    C.UTF-8
#>  tz       UTC
#>  date     2022-10-11
#>  pandoc   2.14.2 @ /usr/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#>  package              * version  date (UTC) lib source
#>  beachmat               2.12.0   2022-04-26 [1] Bioconductor
#>  beeswarm               0.4.0    2021-06-01 [1] RSPM
#>  Biobase              * 2.56.0   2022-04-26 [1] Bioconductor
#>  BiocGenerics         * 0.42.0   2022-04-26 [1] Bioconductor
#>  BiocManager            1.30.18  2022-05-18 [1] RSPM
#>  BiocNeighbors          1.14.0   2022-04-26 [1] Bioconductor
#>  BiocParallel           1.30.3   2022-06-05 [1] Bioconductor
#>  BiocSingular           1.12.0   2022-04-26 [1] Bioconductor
#>  BiocStyle            * 2.24.0   2022-04-26 [1] Bioconductor
#>  bitops                 1.0-7    2021-04-24 [1] RSPM
#>  bookdown               0.29     2022-09-12 [1] RSPM
#>  bslib                  0.4.0    2022-07-16 [1] RSPM
#>  cachem                 1.0.6    2021-08-19 [1] RSPM
#>  cli                    3.4.1    2022-09-23 [1] RSPM
#>  codetools              0.2-18   2020-11-04 [2] CRAN (R 4.2.1)
#>  colorspace             2.0-3    2022-02-21 [1] RSPM
#>  DelayedArray           0.22.0   2022-04-26 [1] Bioconductor
#>  DelayedMatrixStats     1.18.1   2022-09-27 [1] Bioconductor
#>  desc                   1.4.2    2022-09-08 [1] RSPM
#>  digest                 0.6.29   2021-12-01 [1] RSPM
#>  evaluate               0.16     2022-08-09 [1] RSPM
#>  fansi                  1.0.3    2022-03-24 [1] RSPM
#>  farver                 2.1.1    2022-07-06 [1] RSPM
#>  fastmap                1.1.0    2021-01-25 [1] RSPM
#>  fs                     1.5.2    2021-12-08 [1] RSPM
#>  GenomeInfoDb         * 1.32.4   2022-09-06 [1] Bioconductor
#>  GenomeInfoDbData       1.2.8    2022-10-11 [1] Bioconductor
#>  GenomicRanges        * 1.48.0   2022-04-26 [1] Bioconductor
#>  ggbeeswarm             0.6.0    2017-08-07 [1] RSPM
#>  ggplot2              * 3.3.6    2022-05-03 [1] RSPM
#>  ggrepel                0.9.1    2021-01-15 [1] RSPM
#>  glue                   1.6.2    2022-02-24 [1] RSPM
#>  gridExtra              2.3      2017-09-09 [1] RSPM
#>  gtable                 0.3.1    2022-09-01 [1] RSPM
#>  highr                  0.9      2021-04-16 [1] RSPM
#>  htmltools              0.5.3    2022-07-18 [1] RSPM
#>  IRanges              * 2.30.1   2022-08-18 [1] Bioconductor
#>  irlba                  2.3.5.1  2022-10-03 [1] RSPM
#>  jquerylib              0.1.4    2021-04-26 [1] RSPM
#>  jsonlite               1.8.2    2022-10-02 [1] RSPM
#>  knitr                  1.40     2022-08-24 [1] RSPM
#>  labeling               0.4.2    2020-10-20 [1] RSPM
#>  lattice                0.20-45  2021-09-22 [2] CRAN (R 4.2.1)
#>  lifecycle              1.0.2    2022-09-09 [1] RSPM
#>  magrittr               2.0.3    2022-03-30 [1] RSPM
#>  Matrix                 1.5-1    2022-09-13 [1] RSPM
#>  MatrixGenerics       * 1.8.1    2022-06-26 [1] Bioconductor
#>  matrixStats          * 0.62.0   2022-04-19 [1] RSPM
#>  memoise                2.0.1    2021-11-26 [1] RSPM
#>  munsell                0.5.0    2018-06-12 [1] RSPM
#>  pillar                 1.8.1    2022-08-19 [1] RSPM
#>  pkgconfig              2.0.3    2019-09-22 [1] RSPM
#>  pkgdown                2.0.6    2022-07-16 [1] any (@2.0.6)
#>  purrr                  0.3.5    2022-10-06 [1] RSPM
#>  R6                     2.5.1    2021-08-19 [1] RSPM
#>  ragg                   1.2.3    2022-09-29 [1] RSPM
#>  Rcpp                   1.0.9    2022-07-08 [1] RSPM
#>  RCurl                  1.98-1.9 2022-10-03 [1] RSPM
#>  rlang                  1.0.6    2022-09-24 [1] RSPM
#>  rmarkdown              2.17     2022-10-07 [1] CRAN (R 4.2.1)
#>  rprojroot              2.0.3    2022-04-02 [1] RSPM
#>  rsvd                   1.0.5    2021-04-16 [1] RSPM
#>  S4Vectors            * 0.34.0   2022-04-26 [1] Bioconductor
#>  sass                   0.4.2    2022-07-16 [1] RSPM
#>  ScaledMatrix           1.4.1    2022-09-11 [1] Bioconductor
#>  scales                 1.2.1    2022-08-20 [1] RSPM
#>  scater               * 1.24.0   2022-04-26 [1] Bioconductor
#>  scuttle              * 1.6.3    2022-08-23 [1] Bioconductor
#>  sessioninfo            1.2.2    2021-12-06 [1] any (@1.2.2)
#>  SingleCellExperiment * 1.18.1   2022-10-02 [1] Bioconductor
#>  sparseMatrixStats      1.8.0    2022-04-26 [1] Bioconductor
#>  stringi                1.7.8    2022-07-11 [1] RSPM
#>  stringr                1.4.1    2022-08-20 [1] RSPM
#>  SummarizedExperiment * 1.26.1   2022-04-29 [1] Bioconductor
#>  swapper              * 0.99.2   2022-10-11 [1] local
#>  systemfonts            1.0.4    2022-02-11 [1] RSPM
#>  textshaping            0.3.6    2021-10-13 [1] RSPM
#>  tibble                 3.1.8    2022-07-22 [1] RSPM
#>  utf8                   1.2.2    2021-07-24 [1] RSPM
#>  vctrs                  0.4.2    2022-09-29 [1] RSPM
#>  vipor                  0.4.5    2017-03-22 [1] RSPM
#>  viridis                0.6.2    2021-10-13 [1] RSPM
#>  viridisLite            0.4.1    2022-08-22 [1] RSPM
#>  withr                  2.5.0    2022-03-03 [1] RSPM
#>  xfun                   0.33     2022-09-12 [1] RSPM
#>  XVector                0.36.0   2022-04-26 [1] Bioconductor
#>  yaml                   2.3.5    2022-02-21 [1] RSPM
#>  zlibbioc               1.42.0   2022-04-26 [1] Bioconductor
#> 
#>  [1] /home/runner/work/_temp/Library
#>  [2] /opt/R/4.2.1/lib/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────