clustering

Contains classes to perform clustering, including agglomerative clustering, k-means, hdbscan and Butina clustering.

BaseClustering

 BaseClustering ()

Base class to perform clustering on a collection of molecules. Use children classes KMeansClustering, HDBSCANClustering, ButinaClustering to cluster molecules

source

HierarchicalClustering

 HierarchicalClustering (dataset:Union[numpy.__array_like._SupportsArray[n
                         umpy.dtype],numpy.__nested_sequence._NestedSequen
                         ce[numpy.__array_like._SupportsArray[numpy.dtype]
                         ],bool,int,float,complex,str,bytes,numpy.__nested
                         _sequence._NestedSequence[Union[bool,int,float,co
                         mplex,str,bytes]]])

Performs agglomerative hierarchical clustering on a dataset of molecules

Attributes:

dataset : numpy.array An array of features with shape (n,p), where n is the number of molecules and p is the number of descriptors.

Methods:

cluster(n_clusters:int) Performs k-means clustering on ´self.dataset´

source

HierarchicalClustering.cluster

 HierarchicalClustering.cluster (n_clusters:int=2,
                                 affinity:str='euclidean', memory=None,
                                 connectivity=None,
                                 compute_full_tree='auto', linkage='ward',
                                 distance_threshold=None,
                                 compute_distances=False)

Clustering molecules using different hierarchical methods available on scikit-learn.

Arguments:

n_clusters : int or None, default=2
    The number of clusters to find. It must be ``None`` if
    ``distance_threshold`` is not ``None``.

affinity : str or callable, default='euclidean'
    Metric used to compute the linkage. Can be "euclidean", "l1", "l2",
    "manhattan", "cosine", or "precomputed".
    If linkage is "ward", only "euclidean" is accepted.
    If "precomputed", a distance matrix (instead of a similarity matrix)
    is needed as input for the fit method.

memory : str or object with the joblib.Memory interface, default=None
    Used to cache the output of the computation of the tree.
    By default, no caching is done. If a string is given, it is the
    path to the caching directory.

connectivity : array-like or callable, default=None
    Connectivity matrix. Defines for each sample the neighboring
    samples following a given structure of the data.
    This can be a connectivity matrix itself or a callable that transforms
    the data into a connectivity matrix, such as derived from
    `kneighbors_graph`. Default is ``None``, i.e, the
    hierarchical clustering algorithm is unstructured.

compute_full_tree : 'auto' or bool, default='auto'
    Stop early the construction of the tree at ``n_clusters``. This is
    useful to decrease computation time if the number of clusters is not
    small compared to the number of samples. This option is useful only
    when specifying a connectivity matrix. Note also that when varying the
    number of clusters and using caching, it may be advantageous to compute
    the full tree. It must be ``True`` if ``distance_threshold`` is not
    ``None``. By default `compute_full_tree` is "auto", which is equivalent
    to `True` when `distance_threshold` is not `None` or that `n_clusters`
    is inferior to the maximum between 100 or `0.02 * n_samples`.
    Otherwise, "auto" is equivalent to `False`.

linkage : {'ward', 'complete', 'average', 'single'}, default='ward'
    Which linkage criterion to use. The linkage criterion determines which
    distance to use between sets of observation. The algorithm will merge
    the pairs of cluster that minimize this criterion.
    - 'ward' minimizes the variance of the clusters being merged.
    - 'average' uses the average of the distances of each observation of
      the two sets.
    - 'complete' or 'maximum' linkage uses the maximum distances between
      all observations of the two sets.
    - 'single' uses the minimum of the distances between all observations
      of the two sets.

distance_threshold : float, default=None
    The linkage distance threshold above which, clusters will not be
    merged. If not ``None``, ``n_clusters`` must be ``None`` and
    ``compute_full_tree`` must be ``True``.

compute_distances : bool, default=False
    Computes distances between clusters even if `distance_threshold` is not
    used. This can be used to make dendrogram visualization, but introduces
    a computational and memory overhead.

Returns:

    labels : np.array
        Clustering labels

source

HierarchicalClustering.plot_dendrogram

 HierarchicalClustering.plot_dendrogram (figsize:tuple=(12, 9), **kwargs)

Plots the dendrogram generated from the hierarchical clustering.

Arguments:

figsize : tuple (default=(12,9)) Figure size for the plot.

source

KMeansClustering

 KMeansClustering (dataset:Union[numpy.__array_like._SupportsArray[numpy.d
                   type],numpy.__nested_sequence._NestedSequence[numpy.__a
                   rray_like._SupportsArray[numpy.dtype]],bool,int,float,c
                   omplex,str,bytes,numpy.__nested_sequence._NestedSequenc
                   e[Union[bool,int,float,complex,str,bytes]]])

Performs k-means clustering on a dataset of molecules

Attributes:

dataset : numpy.array An array of features with shape (n,p), where n is the number of molecules and p is the number of descriptors.

Methods:

cluster(n_clusters:int) Performs k-means clustering on ´self.dataset´

elbow_method(n_clusters:List, figsize:Tuple) Uses the elbow method to find the optimal number of clusters

source

KMeansClustering.cluster

 KMeansClustering.cluster (n_clusters:int=10, **kwargs)

Run k-means on the dataset

Arguments:

n_clusters : int (default=10)
    Number of clusters

Keyword arguments: max_iter : int (default=5) n_init : int (default=5) init : str (default=‘k-means++’) random_state : int (default=None)

Returns:

labels : np.array
    Clustering labels

source

HDBSCANClustering

 HDBSCANClustering (dataset:Union[numpy.__array_like._SupportsArray[numpy.
                    dtype],numpy.__nested_sequence._NestedSequence[numpy._
                    _array_like._SupportsArray[numpy.dtype]],bool,int,floa
                    t,complex,str,bytes,numpy.__nested_sequence._NestedSeq
                    uence[Union[bool,int,float,complex,str,bytes]]])

Performs HDBSCAN clustering on a dataset of molecules

Attributes:

dataset : numpy.array
    An array of features with shape (n,p), where n is the number of molecules and p is the number of descriptors.

Methods:

cluster(n_clusters:int)
    Performs k-means clustering on ´self.dataset´

validate_clustering(X, labels)
    Compute the density based cluster validity index for the clustering specified by labels and for each cluster in labels.

source

HDBSCANClustering.cluster

 HDBSCANClustering.cluster (min_cluster_size:int=5, min_samples:int=None,
                            metric:str='jaccard', **kwargs)

Run HDBSCAN clustering on the dataset

Arguments:

min_cluster_size : int, optional (default=5) The minimum size of clusters; single linkage splits that contain fewer points than this will be considered points “falling out” of a cluster rather than a cluster splitting into two new clusters.

min_samples : int, optional (default=None) The number of samples in a neighbourhood for a point to be considered a core point.

metric : string, or callable, optional (default=‘euclidean’) The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by metrics.pairwise.pairwise_distances for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square.

Keyword arguments:

See HDBSCAN documentation (https://hdbscan.readthedocs.io/en/latest/index.html)

Returns:

labels : np.array
    Clustering labels

source

ButinaClustering

 ButinaClustering (dataset:List, fp_type='rdkit')

Performs Butina clustering

See original publication at: https://github.com/PatWalters/clusterama

Attributes:

dataset : list
    A list of SMILES.

Methods:

cluster(sim_cutoff:float, nbits:int, radius:int)
    Performs Butina clustering on ´self.dataset´.

get_fps(mol_list:list, nbits:int, radius:int)
    Generate descriptors for ´self.dataset´.

cluster_mols(mol_list, sim_cutoff:float, nbits:int, radius:int)
    Cluster molecules.

source

ButinaClustering.cluster

 ButinaClustering.cluster (sim_cutoff:float, nbits:int=2048, radius:int=2)

Run Butina clustering on the dataset

Arguments:

sim_cutoff : float
    The minimum Tanimoto similarity to consider for putting compounds in the same cluster

nbits : int, optional (default=2048)
    Number of bits of the fingerprints if ´fp_type´ is 'morgan2'

radius : int, optional (default=2)
    Radius of the fingerprints if ´fp_type´ is 'morgan2'

Returns:

labels : np.array
    Clustering labels