Title: | Analysis of Biological Count Data, Especially from Single-Cell RNA-Seq |
---|---|
Description: | A set of functions for applying a restricted linear algebra to the analysis of count-based data. See the accompanying preprint manuscript: "Normalizing need not be the norm: count-based math for analyzing single-cell data" Church et al (2022) <doi:10.1101/2022.06.01.494334> This tool is specifically designed to analyze count matrices from single cell RNA sequencing assays. The tools implement several count-based approaches for standard steps in single-cell RNA-seq analysis, including scoring genes and cells, comparing cells and clustering, calculating differential gene expression, and several methods for rank reduction. There are many opportunities for further optimization that may prove useful in the analysis of other data. We provide the source code freely available at <https://github.com/shchurch/countland> and encourage users and developers to fork the code for their own purposes. |
Authors: | Church Samuel H. [aut, cre]
|
Maintainer: | Church Samuel H. <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.2 |
Built: | 2025-02-26 04:33:01 UTC |
Source: | https://github.com/cran/countland |
Recapitulate Seurat centering scaled and transformed data
Center(C)
Center(C)
C |
countland object |
countland object with slots centered_counts
Perform spectral clustering on dot products.
Cluster(C, n_clusters, n_components = NULL)
Cluster(C, n_clusters, n_components = NULL)
C |
countland object |
n_clusters |
number of clusters, integer |
n_components |
number of components from spectral embedding to use (default NULL, will be set to n_clusters), integer |
countland object with slot cluster_labels
gold_path <- system.file("testdata", package = "countland", mustWork = TRUE) gold.data <- Seurat::Read10X(data.dir = gold_path) C <- countland(gold.data) C <- Dot(C) C <- Embed(C,n_components=5) C <- Cluster(C,n_clusters=3)
gold_path <- system.file("testdata", package = "countland", mustWork = TRUE) gold.data <- Seurat::Read10X(data.dir = gold_path) C <- countland(gold.data) C <- Dot(C) C <- Embed(C,n_components=5) C <- Cluster(C,n_clusters=3)
Internal function for calculating count index.
CountIndex(lm)
CountIndex(lm)
lm |
column vector |
count index = largest n where n cells have >= n counts
Initialize a countland object from a dgCMatrix
countland(m, remove_empty = TRUE, verbose = TRUE)
countland(m, remove_empty = TRUE, verbose = TRUE)
m |
A matrix of counts (dense or sparse) |
remove_empty |
filter out cells and genes with no observed counts (default=TRUE) |
verbose |
show stderr message statements (default=TRUE) |
countland object
gold_path <- system.file("testdata", package = "countland", mustWork = TRUE) gold.data <- Seurat::Read10X(data.dir = gold_path) C <- countland(gold.data)
gold_path <- system.file("testdata", package = "countland", mustWork = TRUE) gold.data <- Seurat::Read10X(data.dir = gold_path) C <- countland(gold.data)
An S4 class to represent a countland object
counts
A dgCMatrix with rows as cells, columns as genes.
names_genes
A character vector of column names.
names_cells
A character vector of row names.
raw_counts
The count dgCMatrix as originally loaded.
raw_names_genes
The gene name character vector as originally loaded.
raw_names_cells
The cell name character vector as originally loaded.
subsample
A dgCMatrix with row sums equal.
cell_scores
A data.frame of cell count measures.
gene_scores
A data.frame of gene expression measures.
dots
A similarity dgCMatrix of dot products.
eigenvals
An vector of eigenvalues from spectral embedding
embedding
An array of two columns (spectral embeddings).
cluster_labels
A numeric vector of cluster assignments of length n cells.
marker_full
A list of data.frames with genes ranked for each cluster.
marker_genes
A data.frame of top ten marker genes per cluster.
matrixU
A dgCMatrix of dimensions cells x features.
matrixV
A dgCMatrix of dimensions genes x features.
matrixLambda
A diagonal dgCMatrix of scaling factors.
sharedcounts
A similarity dgCMatrix of shared counts between genes.
sum_sharedcounts
A dgCMatrix with counts summed within gene clusters.
sum_sharedcounts_all
A dgCMatrix with counts summed and including all genes not present in any cluster.
norm_factor
A numeric vector of cell normalization factors.
norm_counts
A dgCMatrix of normalized counts.
log_counts
A dgCMatrix of log transformed counts.
scaled_counts
A dgCMatrix of counts scaled by gene unit variance.
centered_counts
A dgCMatrix of counts centered at zero.
verbose
A T/F object for suppressing messages
Calculate pairwise dot products of counts between all cells.
Dot(C, subsample = FALSE)
Dot(C, subsample = FALSE)
C |
countland object |
subsample |
if TRUE, use subsampled counts, otherwise use counts (default=FALSE) |
countland object with slot dots
gold_path <- system.file("testdata", package = "countland", mustWork = TRUE) gold.data <- Seurat::Read10X(data.dir = gold_path) C <- countland(gold.data) C <- Dot(C)
gold_path <- system.file("testdata", package = "countland", mustWork = TRUE) gold.data <- Seurat::Read10X(data.dir = gold_path) C <- countland(gold.data) C <- Dot(C)
Perform spectral embedding on dot products.
Embed(C, n_components = 10)
Embed(C, n_components = 10)
C |
countland object |
n_components |
number of components, integer (default=10) |
countland object with slot embedding
, eigenvals
gold_path <- system.file("testdata", package = "countland", mustWork = TRUE) gold.data <- Seurat::Read10X(data.dir = gold_path) C <- countland(gold.data) C <- Dot(C) C <- Embed(C,n_components=5)
gold_path <- system.file("testdata", package = "countland", mustWork = TRUE) gold.data <- Seurat::Read10X(data.dir = gold_path) C <- countland(gold.data) C <- Dot(C) C <- Embed(C,n_components=5)
run integer matrix approximation
IMA(X, params)
IMA(X, params)
X |
observed data matrix |
params |
parameter object |
U, V, and Lambda matrix factors
rescale if max val is above upper bound
IMA_Compute_Init_Scaled(h, l_bound, u_bound)
IMA_Compute_Init_Scaled(h, l_bound, u_bound)
h |
matrix to be rescaled |
l_bound |
lower bound |
u_bound |
upper bound |
rescaled matrix
function to initialize U, V, and Lambda
IMA_init(X, params)
IMA_init(X, params)
X |
observed data matrix |
params |
parameter object |
initialized U, V, and Lambda matrices
Parameter class for IMA
IMA_params( rank, u_bounds, l_bounds = c(0, 0), maxiter = 1e+06, stop_crit = 1e-04 )
IMA_params( rank, u_bounds, l_bounds = c(0, 0), maxiter = 1e+06, stop_crit = 1e-04 )
rank |
target number of features in final matrices |
u_bounds |
upper bounds on integers |
l_bounds |
lower bounds on integers (default = c(0,0)) |
maxiter |
maximum number of iterations (default = 1000000) |
stop_crit |
criterion of difference at which to stop (default = 0.0001) |
parameter object
Update factor matrix - see SUSTain code
IMA_Update_Factor(M, coeff, mkrp, mode, lambda_, params)
IMA_Update_Factor(M, coeff, mkrp, mode, lambda_, params)
M |
matrix to be updated (either U or V) |
coeff |
matrix used in updating algorithm |
mkrp |
matrix used in updating algorithm |
mode |
whether update U or V |
lambda_ |
scaling matrix |
params |
parameter object |
updated matrix and scaling factors
Split dgCMatrix into column vectors.
listCols(m)
listCols(m)
m |
dgCMatrix |
list of column vectors, numeric
Recapitulate Seurat log transformation
Log(C)
Log(C)
C |
countland object |
countland object with slots log_counts
Recapitulate Seurat normalization
Normalize(C)
Normalize(C)
C |
countland object |
countland object with slots norm_factor
, norm_counts
Plots eigenvalues to investigate the optimal number of clusters
PlotEigengap(C)
PlotEigengap(C)
C |
countland object |
generates plot of eigenvalues by number of components
gold_path <- system.file("testdata", package = "countland", mustWork = TRUE) gold.data <- Seurat::Read10X(data.dir = gold_path) C <- countland(gold.data) C <- Dot(C) C <- Embed(C,n_components=5) PlotEigengap(C)
gold_path <- system.file("testdata", package = "countland", mustWork = TRUE) gold.data <- Seurat::Read10X(data.dir = gold_path) C <- countland(gold.data) C <- Dot(C) C <- Embed(C,n_components=5) PlotEigengap(C)
Plot cells using spectral embedding of dot products.
PlotEmbedding(C, colors = color_palette)
PlotEmbedding(C, colors = color_palette)
C |
countland object |
colors |
color palette for ggplot2, default=palette of 11 colors |
generates plot of cells in two spectral embedding dimensions
gold_path <- system.file("testdata", package = "countland", mustWork = TRUE) gold.data <- Seurat::Read10X(data.dir = gold_path) C <- countland(gold.data) C <- Dot(C) C <- Embed(C,n_components=5) C <- Cluster(C,n_clusters=3) PlotEmbedding(C)
gold_path <- system.file("testdata", package = "countland", mustWork = TRUE) gold.data <- Seurat::Read10X(data.dir = gold_path) C <- countland(gold.data) C <- Dot(C) C <- Embed(C,n_components=5) C <- Cluster(C,n_clusters=3) PlotEmbedding(C)
Generate a strip plot for counts across selected genes
PlotGeneCounts(C, gene_indices, colors = color_palette)
PlotGeneCounts(C, gene_indices, colors = color_palette)
C |
countland object |
gene_indices |
vector of gene index values |
colors |
color palette for ggplot2, default=palette of 11 colors |
generates plot of gene count distributions
gold_path <- system.file("testdata", package = "countland", mustWork = TRUE) gold.data <- Seurat::Read10X(data.dir = gold_path) C <- countland(gold.data) PlotGeneCounts(C,gene_indices=1:10)
gold_path <- system.file("testdata", package = "countland", mustWork = TRUE) gold.data <- Seurat::Read10X(data.dir = gold_path) C <- countland(gold.data) PlotGeneCounts(C,gene_indices=1:10)
Plot cells using integer matrix approximation
PlotIMA(C, x = 1, y = 2, colors = color_palette, subsample = TRUE)
PlotIMA(C, x = 1, y = 2, colors = color_palette, subsample = TRUE)
C |
countland object |
x |
feature on x-axis, integer (default=1) |
y |
feature on y-axis, integer (default=2) |
colors |
color palette for ggplot2, default=palette of 11 colors |
subsample |
if TRUE, use subsampled counts (default), otherwise use counts |
generates plot of cells using integer matrix approximation
Plot the difference between the observed and reconstructed count matrix using integer matrix approximation and a series of total features.
PlotIMAElbow(C, max_features, u_bounds, subsample = TRUE)
PlotIMAElbow(C, max_features, u_bounds, subsample = TRUE)
C |
countland object |
max_features |
maximum number of features to assess, integer |
u_bounds |
upper bounds for U and V matrices, vector of length 2 |
subsample |
if TRUE, use subsampled counts (default), otherwise use counts |
generates elbow plot for the difference between observed and reconstructed matrices as number of features increases
Plot cell using spectral embedding and display counts in a given gene.
PlotMarker(C, gene_index, colors = color_palette)
PlotMarker(C, gene_index, colors = color_palette)
C |
countland object |
gene_index |
index value for gene to visualize |
colors |
color palette for ggplot2, default=palette of 11 colors |
generates plot of cells with spectral embedding, colored by marker gene counts
Restore count matrix to original state
PrintGeneNumber(C)
PrintGeneNumber(C)
C |
countland object |
countland object
Rank the top marker genes for each cluster from spectral clustering.
RankMarkerGenes(C, method = "prop-zero", subsample = FALSE)
RankMarkerGenes(C, method = "prop-zero", subsample = FALSE)
C |
countland object |
method |
|
subsample |
if TRUE, use subsampled counts, otherwise use counts (default=FALSE) |
countland object with slots marker_genes
and marker_full
gold_path <- system.file("testdata", package = "countland", mustWork = TRUE) gold.data <- Seurat::Read10X(data.dir = gold_path) C <- countland(gold.data) C <- Dot(C) C <- Embed(C,n_components=5) C <- Cluster(C,n_clusters=3) C <- RankMarkerGenes(C,method='prop-zero',subsample=FALSE)
gold_path <- system.file("testdata", package = "countland", mustWork = TRUE) gold.data <- Seurat::Read10X(data.dir = gold_path) C <- countland(gold.data) C <- Dot(C) C <- Embed(C,n_components=5) C <- Cluster(C,n_clusters=3) C <- RankMarkerGenes(C,method='prop-zero',subsample=FALSE)
Internal function to remove empty columns and rows
RemoveEmpty(C)
RemoveEmpty(C)
C |
countland object |
countland object, count matrix updated
Recapitulate Seurat scaling to unit variance
RescaleVariance(C)
RescaleVariance(C)
C |
countland object |
countland object with slots scaled_counts
Restore count matrix to original state
RestoreCounts(C)
RestoreCounts(C)
C |
countland object |
countland object
gold_path <- system.file("testdata", package = "countland", mustWork = TRUE) gold.data <- Seurat::Read10X(data.dir = gold_path) C <- countland(gold.data) C <- SubsetGenes(C,gene_indices=1:200) C <- SubsetCells(C,cell_indices=1:50) C <- RestoreCounts(C)
gold_path <- system.file("testdata", package = "countland", mustWork = TRUE) gold.data <- Seurat::Read10X(data.dir = gold_path) C <- countland(gold.data) C <- SubsetGenes(C,gene_indices=1:200) C <- SubsetCells(C,cell_indices=1:50) C <- RestoreCounts(C)
Perform integer matrix approximation on count matrix.
RunIMA( C, features, u_bounds, l_bounds = c(0, 0), maxiter = 1e+06, stop_crit = 1e-04, subsample = TRUE )
RunIMA( C, features, u_bounds, l_bounds = c(0, 0), maxiter = 1e+06, stop_crit = 1e-04, subsample = TRUE )
C |
countland object |
features |
target number of features, integer |
u_bounds |
upper bounds for U and V matrices, vector of length 2 |
l_bounds |
lower bounds for U and V matrices, vector of length 2 (default=c(0,0)) |
maxiter |
maximum number of iterations, integer (default=1000000) |
stop_crit |
criterion for stopping based on difference between iterations, numeric (default=0.0001) |
subsample |
if TRUE, use subsampled counts (default), otherwise use counts |
countland object with slots matrixU
, matrixV
, matrixLambda
gold_path <- system.file("testdata", package = "countland", mustWork = TRUE) gold.data <- Seurat::Read10X(data.dir = gold_path) C <- countland(gold.data) C <- RunIMA(C,features=10,u_bounds=c(10,10),subsample=FALSE)
gold_path <- system.file("testdata", package = "countland", mustWork = TRUE) gold.data <- Seurat::Read10X(data.dir = gold_path) C <- countland(gold.data) C <- RunIMA(C,features=10,u_bounds=c(10,10),subsample=FALSE)
Recapitulate scikit.manifold.spectral_embedding from python.
ScikitManifoldSpectralEmbedding(A, n_components)
ScikitManifoldSpectralEmbedding(A, n_components)
A |
similarity matrix, dgCMatrix |
n_components |
number of eigenvectors to retain, integer |
matrix of eigenvectors
Calculate several scores for counts across cells
ScoreCells(C, gene_string = NULL)
ScoreCells(C, gene_string = NULL)
C |
countland object |
gene_string |
string with regular expression expression matching gene names of interest (default=NULL) |
countland object with slot cell_scores
gold_path <- system.file("testdata", package = "countland", mustWork = TRUE) gold.data <- Seurat::Read10X(data.dir = gold_path) C <- countland(gold.data) C <- ScoreCells(C,gene_string="*149932$")
gold_path <- system.file("testdata", package = "countland", mustWork = TRUE) gold.data <- Seurat::Read10X(data.dir = gold_path) C <- countland(gold.data) C <- ScoreCells(C,gene_string="*149932$")
Calculate several scores for count-based gene expression.
ScoreGenes(C, subsample = FALSE)
ScoreGenes(C, subsample = FALSE)
C |
countland object |
subsample |
if TRUE, use subsampled counts, otherwise use counts (default=FALSE) |
countland object with slot gene_scores
gold_path <- system.file("testdata", package = "countland", mustWork = TRUE) gold.data <- Seurat::Read10X(data.dir = gold_path) C <- countland(gold.data) C <- ScoreGenes(C)
gold_path <- system.file("testdata", package = "countland", mustWork = TRUE) gold.data <- Seurat::Read10X(data.dir = gold_path) C <- countland(gold.data) C <- ScoreGenes(C)
Subsample cells to a standard number of counts by randomly sampling observations without replacement.
Subsample(C, gene_counts = NA, cell_counts = NA)
Subsample(C, gene_counts = NA, cell_counts = NA)
C |
countland object |
gene_counts |
maximum total counts for genes |
cell_counts |
sequencing depth for all cells, or "min" to use the minimum cell total |
countland object with slot subsample
gold_path <- system.file("testdata", package = "countland", mustWork = TRUE) gold.data <- Seurat::Read10X(data.dir = gold_path) C <- countland(gold.data) C <- Subsample(C,gene_counts=250,cell_counts=100)
gold_path <- system.file("testdata", package = "countland", mustWork = TRUE) gold.data <- Seurat::Read10X(data.dir = gold_path) C <- countland(gold.data) C <- Subsample(C,gene_counts=250,cell_counts=100)
Internal function for subsampling a column from a sparse matrix.
SubsampleCol(lm, li, j, n_counts)
SubsampleCol(lm, li, j, n_counts)
lm |
column vector |
li |
row positions |
j |
column index |
n_counts |
count to sample |
subsampled column as dgTMatrix components
Subsets cells using a vector of cell indices
SubsetCells(C, cell_indices, remove_empty = TRUE)
SubsetCells(C, cell_indices, remove_empty = TRUE)
C |
countland object |
cell_indices |
vector of cell index values |
remove_empty |
filter out cells and genes with no observed counts (default=TRUE) |
countland object, count matrix updated
gold_path <- system.file("testdata", package = "countland", mustWork = TRUE) gold.data <- Seurat::Read10X(data.dir = gold_path) C <- countland(gold.data) C <- SubsetCells(C,cell_indices=1:50)
gold_path <- system.file("testdata", package = "countland", mustWork = TRUE) gold.data <- Seurat::Read10X(data.dir = gold_path) C <- countland(gold.data) C <- SubsetCells(C,cell_indices=1:50)
Subsets genes using a vector of gene indices
SubsetGenes(C, gene_indices, remove_empty = TRUE)
SubsetGenes(C, gene_indices, remove_empty = TRUE)
C |
countland object |
gene_indices |
vector of gene index values |
remove_empty |
filter out cells and genes with no observed counts (default=TRUE) |
countland object, count matrix updated
gold_path <- system.file("testdata", package = "countland", mustWork = TRUE) gold.data <- Seurat::Read10X(data.dir = gold_path) C <- countland(gold.data) C <- SubsetGenes(C,gene_indices=1:200)
gold_path <- system.file("testdata", package = "countland", mustWork = TRUE) gold.data <- Seurat::Read10X(data.dir = gold_path) C <- countland(gold.data) C <- SubsetGenes(C,gene_indices=1:200)