singlet.dataset.cluster

class singlet.dataset.cluster.Cluster(dataset)[source]

Bases: object

Cluster samples, features, and phenotypes

dbscan(axis, phenotypes=(), **kwargs)[source]

Density-Based Spatial Clustering of Applications with Noise.

Parameters:
  • axis (string) – It must be ‘samples’ or ‘features’. The Dataset.counts matrix is used and either samples or features are clustered.
  • phenotypes (iterable of strings) – Phenotypes to add to the features for joint clustering.
  • log_features (bool) – Whether to add pseudocounts and take a log of the feature counts before calculating distances.
  • **kwargs – arguments passed to sklearn.cluster.DBSCAN.
Returns:

pd.Series with the labels of the clusters.

hierarchical(axis, phenotypes=(), metric='correlation', method='average', log_features=True, optimal_ordering=False)[source]

Hierarchical clustering.

Parameters:
  • axis (string) – It must be ‘samples’ or ‘features’. The Dataset.counts matrix is used and either samples or features are clustered.
  • phenotypes (iterable of strings) – Phenotypes to add to the features for joint clustering.
  • metric (string) – Metric to calculate the distance matrix. Should be a string accepted by scipy.spatial.distance.pdist.
  • method (string) – Clustering method. Must be a string accepted by scipy.cluster.hierarchy.linkage.
  • log_features (bool) – Whether to add pseudocounts and take a log of the feature counts before calculating distances.
  • optimal_ordering (bool) – Whether to resort the linkage so that nearest neighbours have shortest distance. This may take longer than the clustering itself.
Returns:

dict with the linkage, distance matrix, and ordering.

kmeans(n_clusters, axis, phenotypes=(), random_state=0)[source]

K-Means clustering.

Parameters:
  • n_clusters (int) – The number of clusters you want.
  • axis (string) – It must be ‘samples’ or ‘features’. The Dataset.counts matrix is used and either samples or features are clustered.
  • phenotypes (iterable of strings) – Phenotypes to add to the features for joint clustering.
  • log_features (bool) – Whether to add pseudocounts and take a log of the feature counts before calculating distances.
  • random_state (int) – Set to the same int for deterministic results.
Returns:

pd.Series with the labels of the clusters.