singlet.dataset package

Module contents

class singlet.dataset.Dataset(counts_table=None, samplesheet=None, featuresheet=None, dataset=None, plugins=None)[source]

Bases: object

Collection of cells, with feature counts and metadata

average(axis, column)[source]

Average samples or features based on metadata

Parameters:
  • axis (string) – Must be ‘samples’ or ‘features’.
  • column (string) – Must be a column of the samplesheet (for axis=’samples’) or of the featuresheet (for axis=’features’). Samples or features with a common value in this column are averaged over.
Returns:

A Dataset with the averaged counts.

Note: if you average over samples, you get an empty samplesheet. Simlarly, if you average over features, you get an empty featuresheet.

bootstrap(groupby=None)[source]

Resample with replacement, aka bootstrap dataset

Parameters:
  • groupby (str or list of str or None) – If None, bootstrap random
  • disregarding sample metadata. If a string or a list of (samples) –
  • boostrap over groups of samples with consistent (strings,) –
  • for that/those columns. (entries) –
Returns:

A Dataset with the resampled samples.

compare(other, features='mapped', phenotypes=(), method='kolmogorov-smirnov')[source]

Statistically compare with another Dataset.

Parameters:
  • other (Dataset) – The Dataset to compare with.
  • features (list, string, or None) – Features to compare. The string ‘total’ means all features including spikeins and other, ‘mapped’ means all features excluding spikeins and other, ‘spikeins’ means only spikeins, and ‘other’ means only ‘other’ features. If empty list or None, do not compare features (useful for phenotypic comparison).
  • phenotypes (list of strings) – Phenotypes to compare.
  • method (string or function) – Statistical test to use for the comparison. If a string it must be one of ‘kolmogorov-smirnov’ or ‘mann-whitney’. If a function, it must accept two arrays as arguments (one for each dataset, running over the samples) and return a P-value for the comparison.
Returns:

A pandas.DataFrame containing the P-values of the comparisons for

all features and phenotypes.

copy()[source]

Copy of the Dataset

counts

Matrix of gene expression counts.

Rows are features, columns are samples.

Notice: If you reset this matrix with features that are not in the
featuresheet or samples that are not in the samplesheet, those tables will be reset to empty.
featuremetadatanames

pandas.Index of feature metadata column names

featurenames

pandas.Index of feature names

featuresheet

Matrix of feature metadata.

Rows are features, columns are metadata (e.g. Gene Ontologies).

n_features

Number of features

n_samples

Number of samples

query_features_by_counts(expression, inplace=False, local_dict=None)[source]

Select features based on their expression.

Parameters:
  • expression (string) – An expression compatible with pandas.DataFrame.query.
  • inplace (bool) – Whether to change the Dataset in place or return a new one.
  • local_dict (dict) – A dictionary of local variables, useful if you are using @var assignments in your expression. By far the most common usage of this argument is to set local_dict=locals().
Returns:

If inplace is True, None. Else, a Dataset.

query_features_by_metadata(expression, inplace=False, local_dict=None)[source]

Select features based on metadata.

Parameters:
  • expression (string) – An expression compatible with pandas.DataFrame.query.
  • inplace (bool) – Whether to change the Dataset in place or return a new one.
  • local_dict (dict) – A dictionary of local variables, useful if you are using @var assignments in your expression. By far the most common usage of this argument is to set local_dict=locals().
Returns:

If inplace is True, None. Else, a Dataset.

query_features_by_name(featurenames, inplace=False, ignore_missing=False)[source]

Select features by name.

Parameters:
  • featurenames – names of the features to keep.
  • inplace (bool) – Whether to change the Dataset in place or return a new one.
  • ignore_missing (bool) – Whether to silently skip missing features.
query_samples_by_counts(expression, inplace=False, local_dict=None)[source]

Select samples based on gene expression.

Parameters:
  • expression (string) – An expression compatible with pandas.DataFrame.query.
  • inplace (bool) – Whether to change the Dataset in place or return a new one.
  • local_dict (dict) – A dictionary of local variables, useful if you are using @var assignments in your expression. By far the most common usage of this argument is to set local_dict=locals().
Returns:

If inplace is True, None. Else, a Dataset.

query_samples_by_metadata(expression, inplace=False, local_dict=None)[source]

Select samples based on metadata.

Parameters:
  • expression (string) – An expression compatible with pandas.DataFrame.query.
  • inplace (bool) – Whether to change the Dataset in place or return a new one.
  • local_dict (dict) – A dictionary of local variables, useful if you are using @var assignments in your expression. By far the most common usage of this argument is to set local_dict=locals().
Returns:

If inplace is True, None. Else, a Dataset.

query_samples_by_name(samplenames, inplace=False, ignore_missing=False)[source]

Select samples by name.

Parameters:
  • samplenames – names of the samples to keep.
  • inplace (bool) – Whether to change the Dataset in place or return a new one.
  • ignore_missing (bool) – Whether to silently skip missing samples.
reindex(axis, column, drop=False, inplace=False)[source]

Reindex samples or features from a metadata column

Parameters:
  • axis (string) – Must be ‘samples’ or ‘features’.
  • column (string) – Must be a column of the samplesheet (for axis=’samples’) or of the featuresheet (for axis=’features’) with unique names of samples or features.
  • drop (bool) – Whether to drop the column from the metadata table.
  • inplace (bool) – Whether to change the Dataset in place or return a new one.
rename(axis, column, inplace=False)[source]

Rename samples or features

Parameters:
  • axis (string) – Must be ‘samples’ or ‘features’.
  • column (string) – Must be a column of the samplesheet (for axis=’samples’) or of the featuresheet (for axis=’features’) with unique names of samples or features.
  • inplace (bool) – Whether to change the Dataset in place or return a new one.

DEPRECATED: use reindex instead.

samplemetadatanames

pandas.Index of sample metadata column names

samplenames

pandas.Index of sample names

samplesheet

Matrix of sample metadata.

Rows are samples, columns are metadata (e.g. phenotypes).

split(phenotypes, copy=True)[source]

Split Dataset based on one or more categorical phenotypes

Parameters:phenotypes (string or list of strings) – one or more phenotypes to use for the split. Unique values of combinations of these determine the split Datasets.
Returns:
the keys are either unique values of the
phenotype chosen or, if more than one, tuples of unique combinations.
Return type:dict of Datasets
to_dataset_file(filename, fmt=None, **kwargs)[source]

Store dataset into an integrated dataset file

Parameters:
  • filename (str) – path of the file to write to.
  • fmt (str or None) – file format. If None, infer from the file
  • extension.
  • **kwargs (keyword arguments) – depend on the format.

The additional keyword argument for the supported formats are: - loom:

  • axis_samples: rows or columns (default)

Submodules

singlet.dataset.correlations module

class singlet.dataset.correlations.Correlation(dataset)[source]

Bases: singlet.dataset.plugins.Plugin

Correlate gene expression and phenotype in single cells

correlate_features_features(features='all', features2=None, method='spearman')[source]

Correlate feature expression with one or more phenotypes.

Parameters:
  • features (list or string) – list of features to correlate. Use a string for a single feature. The special string ‘all’ (default) uses all features.
  • features2 (list or string) – list of features to correlate with. Use a string for a single feature. The special string ‘all’ uses all features. None (default) takes the same list as features, returning a square matrix.
  • method (string) – type of correlation. Must be one of ‘pearson’ or ‘spearman’.
Returns:

pandas.DataFrame with the correlation coefficients. If either

features or features2 is a single string, the function returns a pandas.Series. If both are a string, it returns a single correlation coefficient.

correlate_features_phenotypes(phenotypes, features='all', method='spearman', fillna=None)[source]

Correlate feature expression with one or more phenotypes.

Parameters:
  • phenotypes (list of string) – list of phenotypes, i.e. columns of the samplesheet. Use a string for a single phenotype.
  • features (list or string) – list of features to correlate. Use a string for a single feature. The special string ‘all’ (default) uses all features.
  • method (string) – type of correlation. Must be one of ‘pearson’ or ‘spearman’.
  • fillna (dict, int, or None) – a dictionary with phenotypes as keys and numbers to fill for NaNs as values. None will do nothing.
Returns:

pandas.DataFrame with the correlation coefficients. If either

phenotypes or features is a single string, the function returns a pandas.Series. If both are a string, it returns a single correlation coefficient.

correlate_phenotypes_phenotypes(phenotypes, phenotypes2=None, method='spearman', fillna=None, fillna2=None)[source]

Correlate feature expression with one or more phenotypes.

Parameters:
  • phenotypes (list of string) – list of phenotypes, i.e. columns of the samplesheet. Use a string for a single phenotype.
  • phenotypes2 (list of string) – list of phenotypes, i.e. columns of the samplesheet. Use a string for a single phenotype. None (default) uses the same as phenotypes.
  • method (string) – type of correlation. Must be one of ‘pearson’ or ‘spearman’.
  • fillna (dict, int, or None) – a dictionary with phenotypes as keys and numbers to fill for NaNs as values. None will do nothing, potentially yielding NaN as correlation coefficients.
  • fillna2 (dict, int, or None) – as fillna, but for phenotypes2.
Returns:

pandas.DataFrame with the correlation coefficients. If either

phenotypes or features is a single string, the function returns a pandas.Series. If both are a string, it returns a single correlation coefficient.

correlate_samples(samples='all', samples2=None, phenotypes=None, method='spearman')[source]

Correlate feature expression with one or more phenotypes.

Parameters:
  • samples (list or string) – list of samples to correlate. Use a string for a single sample. The special string ‘all’ (default) uses all samples.
  • samples2 (list or string) – list of samples to correlate with. Use a string for a single sample. The special string ‘all’ uses all samples. None (default) takes the same list as samples, returning a square matrix.
  • method (string) – type of correlation. Must be one of ‘pearson’ or ‘spearman’.
  • phenotypes (list) – phenotypes to include as additional features in the correlation calculation. None (default) means only feature counts are used.
Returns:

pandas.DataFrame with the correlation coefficients. If either

samples or samples2 is a single string, the function returns a pandas.Series. If both are a string, it returns a single correlation coefficient.

singlet.dataset.plot module

class singlet.dataset.plot.Plot(dataset)[source]

Bases: singlet.dataset.plugins.Plugin

Plot gene expression and phenotype in single cells

clustermap(cluster_samples=False, cluster_features=False, phenotypes_cluster_samples=(), phenotypes_cluster_features=(), annotate_samples=False, annotate_features=False, labels_samples=True, labels_features=True, orientation='horizontal', colorbars=False, **kwargs)[source]

Samples versus features / phenotypes.

Parameters:
  • cluster_samples (bool or linkage) – Whether to cluster samples and show the dendrogram. Can be either, False, True, or a linkage from scipy.cluster.hierarchy.linkage.
  • cluster_features (bool or linkage) – Whether to cluster features and show the dendrogram. Can be either, False, True, or a linkage from scipy.cluster.hierarchy.linkage.
  • phenotypes_cluster_samples (iterable of strings) – Phenotypes to add to the features for joint clustering of the samples. If the clustering has been precomputed including phenotypes and the linkage matrix is explicitely set as cluster_samples, the same phenotypes must be specified here, in the same order.
  • phenotypes_cluster_features (iterable of strings) – Phenotypes to add to the features for joint clustering of the features and phenotypes. If the clustering has been precomputed including phenotypes and the linkage matrix is explicitely set as cluster_features, the same phenotypes must be specified here, in the same order.
  • annotate_samples (dict, or False) – Whether and how to annotate the samples with separate colorbars. The dictionary must have phenotypes or features as keys. For qualitative phenotypes, the values can be palette names or palettes (with at least as many colors as there are categories). For quantitative phenotypes and features, they can be colormap names or colormaps.
  • annotate_features (dict, or False) – Whether and how to annotate the featues with separate colorbars. The dictionary must have features metadata as keys. For qualitative annotations, the values can be palette names or palettes (with at least as many colors as there are categories). For quantitative annotatoins, the values can be colormap names or colormaps. Keys must be columns of the Dataset.featuresheet, except for the key ‘mean expression’ which is interpreted to mean the average of the counts for that feature.
  • labels_samples (bool) – Whether to show the sample labels. If you have hundreds or more samples, you may want to turn this off to make the plot tidier.
  • labels_features (bool) – Whether to show the feature labels. If you have hundreds or more features, you may want to turn this off to make the plot tidier.
  • orientation (string) – Whether the samples are on the abscissa (‘horizontal’) or on the ordinate (‘vertical’).
  • tight_layout (bool or dict) – Whether to call matplotlib.pyplot.tight_layout at the end of the plotting. If it is a dict, pass it unpacked to that function.
  • colorbars (bool) – Whether to add colorbars. One colorbar refers to the heatmap. Moreover, if annotations for samples or features are shown, a colorbar for each of them will be shown as well.
  • **kwargs – named arguments passed to seaborn.clustermap.
Returns:

A seaborn ClusterGrid instance.

plot_coverage(features='total', kind='cumulative', ax=None, tight_layout=True, legend=False, **kwargs)[source]

Plot number of reads for each sample

Parameters:
  • features (list or string) – Features to sum over. The string ‘total’ means all features including spikeins and other, ‘mapped’ means all features excluding spikeins and other, ‘spikeins’ means only spikeins, and ‘other’ means only ‘other’ features.
  • kind (string) – Kind of plot (default: cumulative distribution).
  • ax (matplotlib.axes.Axes) – The axes to plot into. If None (default), a new figure with one axes is created. ax must not strictly be a matplotlib class, but it must have common methods such as ‘plot’ and ‘set’.
  • tight_layout (bool or dict) – Whether to call matplotlib.pyplot.tight_layout at the end of the plotting. If it is a dict, pass it unpacked to that function.
  • legend (bool or dict) – If True, call ax.legend(). If a dict, pass as **kwargs to ax.legend.
  • **kwargs – named arguments passed to the plot function.
Returns:

matplotlib.axes.Axes with the axes contaiing the plot.

plot_distributions(features, kind='violin', ax=None, tight_layout=True, legend=False, orientation='vertical', sort=False, bottom=0, grid=None, **kwargs)[source]

Plot distribution of spike-in controls

Parameters:
  • features (list or string) – List of features to plot. If it is the string ‘spikeins’, plot all spikeins, if the string ‘other’, plot other features.
  • kind (string) – Kind of plot, one of ‘violin’ (default), ‘box’, ‘swarm’.
  • ax (matplotlib.axes.Axes) – Axes to plot into. If None (default), create a new figure and axes.
  • tight_layout (bool or dict) – Whether to call matplotlib.pyplot.tight_layout at the end of the plotting. If it is a dict, pass it unpacked to that function.
  • legend (bool or dict) – If True, call ax.legend(). If a dict, pass as **kwargs to ax.legend. Notice that legend has a special meaning in these kinds of seaborn plots.
  • orientation (string) – ‘horizontal’ or ‘vertical’.
  • sort (bool or string) – True or ‘ascending’ sorts the features by median, ‘descending’ uses the reverse order.
  • bottom (float or string) – The value of zero-count features. If you are using a log axis, you may want to set this to 0.1 or any other small positive number. If a string, it must be ‘pseudocount’, then the CountsTable.pseudocount will be used.
  • grid (bool or None) – Whether to add a grid to the plot. None defaults to your existing settings.
  • **kwargs – named arguments passed to the plot function.
Returns:

The axes with the plot.

Return type:

matplotlib.axes.Axes

scatter_reduced_samples(vectors_reduced, color_by=None, color_log=None, cmap='viridis', ax=None, tight_layout=True, **kwargs)[source]

Scatter samples after dimensionality reduction.

Parameters:
  • vectors_reduced (pandas.Dataframe) – matrix of coordinates of the samples after dimensionality reduction. Rows are samples, columns (typically 2 or 3) are the component in the low-dimensional embedding.
  • color_by (string or None) – color sample dots by phenotype or expression of a certain feature.
  • color_log (bool or None) – use log of phenotype/expression in the colormap. Default None only logs expression, but not phenotypes.
  • cmap (string or matplotlib colormap) – color map to use for the sample dots.
  • ax (matplotlib.axes.Axes) – The axes to plot into. If None (default), a new figure with one axes is created. ax must not strictly be a matplotlib class, but it must have common methods such as ‘plot’ and ‘set’.
  • tight_layout (bool or dict) – Whether to call matplotlib.pyplot.tight_layout at the end of the plotting. If it is a dict, pass it unpacked to that function.
  • **kwargs – named arguments passed to the plot function.
Returns:

matplotlib.axes.Axes with the axes containing the plot.

scatter_statistics(features='mapped', x='mean', y='cv', ax=None, tight_layout=True, legend=False, grid=None, **kwargs)[source]

Scatter plot statistics of features.

Parameters:
  • features (list or string) – List of features to plot. The string ‘mapped’ means everything excluding spikeins and other, ‘all’ means everything including spikeins and other.
  • x (string) – Statistics to plot on the x axis.
  • y (string) – Statistics to plot on the y axis.
  • ax (matplotlib.axes.Axes) – The axes to plot into. If None (default), a new figure with one axes is created. ax must not strictly be a matplotlib class, but it must have common methods such as ‘plot’ and ‘set’.
  • tight_layout (bool or dict) – Whether to call matplotlib.pyplot.tight_layout at the end of the plotting. If it is a dict, pass it unpacked to that function.
  • legend (bool or dict) – If True, call ax.legend(). If a dict, pass as **kwargs to ax.legend.
  • grid (bool or None) – Whether to add a grid to the plot. None defaults to your existing settings.
  • **kwargs – named arguments passed to the plot function.
Returns:

matplotlib.axes.Axes with the axes contaiing the plot.

singlet.dataset.dimensionality module

class singlet.dataset.dimensionality.DimensionalityReduction(dataset)[source]

Bases: singlet.dataset.plugins.Plugin

Reduce dimensionality of gene expression and phenotype

pca(n_dims=2, transform='log10', robust=True, random_state=None)[source]

Principal component analysis

Parameters:
  • n_dims (int) – Number of dimensions (2+).
  • transform (string or None) – Whether to preprocess the data.
  • robust (bool) – Whether to use Principal Component Pursuit to exclude outliers.
Returns:

dict of the left eigenvectors (vs), right eigenvectors (us)

of the singular value decomposition, eigenvalues (lambdas), the transform, and the whiten function (for plotting).

tsne(n_dims=2, perplexity=30, theta=0.5, rand_seed=0, **kwargs)[source]

t-SNE algorithm.

Parameters:
  • n_dims (int) – Number of dimensions to use.
  • perplexity (float) – Perplexity of the algorithm.
  • theta (float) – A number between 0 and 1. Higher is faster but less accurate (via the Barnes-Hut approximation).
  • rand_seed (int) – Random seed. -1 randomizes each run.
  • **kwargs – Named arguments passed to the t-SNE algorithm.

Returns:

umap(n_dims=2, rand_seed=0, **kwargs)[source]

Uniform Manifold Approximation and Projection.

Parameters:
  • n_dims (int) – Number of dimensions to use.
  • rand_seed (int) – Random seed. -1 randomizes each run.
  • **kwargs – Named arguments passed to umap.UMAP.

Returns: