singlet.dataset.cluster¶

class singlet.dataset.cluster.Cluster(dataset)[source]¶

Bases: object

Cluster samples, features, and phenotypes

dbscan(axis, phenotypes=(), **kwargs)[source]¶

Density-Based Spatial Clustering of Applications with Noise.

Parameters:

axis (string) – It must be ‘samples’ or ‘features’. The Dataset.counts matrix is used and either samples or features are clustered.
phenotypes (iterable of strings) – Phenotypes to add to the features for joint clustering.
log_features (bool) – Whether to add pseudocounts and take a log of the feature counts before calculating distances.
**kwargs – arguments passed to sklearn.cluster.DBSCAN.

Returns:

pd.Series with the labels of the clusters.

hierarchical(axis, phenotypes=(), metric='correlation', method='average', log_features=True, optimal_ordering=False)[source]¶

Hierarchical clustering.

Parameters:

axis (string) – It must be ‘samples’ or ‘features’. The Dataset.counts matrix is used and either samples or features are clustered.
phenotypes (iterable of strings) – Phenotypes to add to the features for joint clustering.
metric (string) – Metric to calculate the distance matrix. Should be a string accepted by scipy.spatial.distance.pdist.
method (string) – Clustering method. Must be a string accepted by scipy.cluster.hierarchy.linkage.
log_features (bool) – Whether to add pseudocounts and take a log of the feature counts before calculating distances.
optimal_ordering (bool) – Whether to resort the linkage so that nearest neighbours have shortest distance. This may take longer than the clustering itself.

Returns:

dict with the linkage, distance matrix, and ordering.

kmeans(n_clusters, axis, phenotypes=(), random_state=0)[source]¶

K-Means clustering.

Parameters:

n_clusters (int) – The number of clusters you want.
axis (string) – It must be ‘samples’ or ‘features’. The Dataset.counts matrix is used and either samples or features are clustered.
phenotypes (iterable of strings) – Phenotypes to add to the features for joint clustering.
log_features (bool) – Whether to add pseudocounts and take a log of the feature counts before calculating distances.
random_state (int) – Set to the same int for deterministic results.

Returns:

pd.Series with the labels of the clusters.

singlet