clustering
BaseClustering
BaseClustering ()
Base class to perform clustering on a collection of molecules. Use children classes KMeansClustering, HDBSCANClustering, ButinaClustering to cluster molecules
HierarchicalClustering
HierarchicalClustering (dataset:Union[numpy.__array_like._SupportsArray[n umpy.dtype],numpy.__nested_sequence._NestedSequen ce[numpy.__array_like._SupportsArray[numpy.dtype] ],bool,int,float,complex,str,bytes,numpy.__nested _sequence._NestedSequence[Union[bool,int,float,co mplex,str,bytes]]])
Performs agglomerative hierarchical clustering on a dataset of molecules
Attributes:
dataset : numpy.array An array of features with shape (n,p), where n is the number of molecules and p is the number of descriptors.
Methods:
cluster(n_clusters:int) Performs k-means clustering on ´self.dataset´
HierarchicalClustering.cluster
HierarchicalClustering.cluster (n_clusters:int=2, affinity:str='euclidean', memory=None, connectivity=None, compute_full_tree='auto', linkage='ward', distance_threshold=None, compute_distances=False)
Clustering molecules using different hierarchical methods available on scikit-learn.
Arguments:
n_clusters : int or None, default=2
The number of clusters to find. It must be ``None`` if
``distance_threshold`` is not ``None``.
affinity : str or callable, default='euclidean'
Metric used to compute the linkage. Can be "euclidean", "l1", "l2",
"manhattan", "cosine", or "precomputed".
If linkage is "ward", only "euclidean" is accepted.
If "precomputed", a distance matrix (instead of a similarity matrix)
is needed as input for the fit method.
memory : str or object with the joblib.Memory interface, default=None
Used to cache the output of the computation of the tree.
By default, no caching is done. If a string is given, it is the
path to the caching directory.
connectivity : array-like or callable, default=None
Connectivity matrix. Defines for each sample the neighboring
samples following a given structure of the data.
This can be a connectivity matrix itself or a callable that transforms
the data into a connectivity matrix, such as derived from
`kneighbors_graph`. Default is ``None``, i.e, the
hierarchical clustering algorithm is unstructured.
compute_full_tree : 'auto' or bool, default='auto'
Stop early the construction of the tree at ``n_clusters``. This is
useful to decrease computation time if the number of clusters is not
small compared to the number of samples. This option is useful only
when specifying a connectivity matrix. Note also that when varying the
number of clusters and using caching, it may be advantageous to compute
the full tree. It must be ``True`` if ``distance_threshold`` is not
``None``. By default `compute_full_tree` is "auto", which is equivalent
to `True` when `distance_threshold` is not `None` or that `n_clusters`
is inferior to the maximum between 100 or `0.02 * n_samples`.
Otherwise, "auto" is equivalent to `False`.
linkage : {'ward', 'complete', 'average', 'single'}, default='ward'
Which linkage criterion to use. The linkage criterion determines which
distance to use between sets of observation. The algorithm will merge
the pairs of cluster that minimize this criterion.
- 'ward' minimizes the variance of the clusters being merged.
- 'average' uses the average of the distances of each observation of
the two sets.
- 'complete' or 'maximum' linkage uses the maximum distances between
all observations of the two sets.
- 'single' uses the minimum of the distances between all observations
of the two sets.
distance_threshold : float, default=None
The linkage distance threshold above which, clusters will not be
merged. If not ``None``, ``n_clusters`` must be ``None`` and
``compute_full_tree`` must be ``True``.
compute_distances : bool, default=False
Computes distances between clusters even if `distance_threshold` is not
used. This can be used to make dendrogram visualization, but introduces
a computational and memory overhead.
Returns:
labels : np.array
Clustering labels
HierarchicalClustering.plot_dendrogram
HierarchicalClustering.plot_dendrogram (figsize:tuple=(12, 9), **kwargs)
Plots the dendrogram generated from the hierarchical clustering.
Arguments:
figsize : tuple (default=(12,9)) Figure size for the plot.
KMeansClustering
KMeansClustering (dataset:Union[numpy.__array_like._SupportsArray[numpy.d type],numpy.__nested_sequence._NestedSequence[numpy.__a rray_like._SupportsArray[numpy.dtype]],bool,int,float,c omplex,str,bytes,numpy.__nested_sequence._NestedSequenc e[Union[bool,int,float,complex,str,bytes]]])
Performs k-means clustering on a dataset of molecules
Attributes:
dataset : numpy.array An array of features with shape (n,p), where n is the number of molecules and p is the number of descriptors.
Methods:
cluster(n_clusters:int) Performs k-means clustering on ´self.dataset´
elbow_method(n_clusters:List, figsize:Tuple) Uses the elbow method to find the optimal number of clusters
KMeansClustering.cluster
KMeansClustering.cluster (n_clusters:int=10, **kwargs)
Run k-means on the dataset
Arguments:
n_clusters : int (default=10)
Number of clusters
Keyword arguments: max_iter : int (default=5) n_init : int (default=5) init : str (default=‘k-means++’) random_state : int (default=None)
Returns:
labels : np.array
Clustering labels
HDBSCANClustering
HDBSCANClustering (dataset:Union[numpy.__array_like._SupportsArray[numpy. dtype],numpy.__nested_sequence._NestedSequence[numpy._ _array_like._SupportsArray[numpy.dtype]],bool,int,floa t,complex,str,bytes,numpy.__nested_sequence._NestedSeq uence[Union[bool,int,float,complex,str,bytes]]])
Performs HDBSCAN clustering on a dataset of molecules
Attributes:
dataset : numpy.array
An array of features with shape (n,p), where n is the number of molecules and p is the number of descriptors.
Methods:
cluster(n_clusters:int)
Performs k-means clustering on ´self.dataset´
validate_clustering(X, labels)
Compute the density based cluster validity index for the clustering specified by labels and for each cluster in labels.
HDBSCANClustering.cluster
HDBSCANClustering.cluster (min_cluster_size:int=5, min_samples:int=None, metric:str='jaccard', **kwargs)
Run HDBSCAN clustering on the dataset
Arguments:
min_cluster_size : int, optional (default=5) The minimum size of clusters; single linkage splits that contain fewer points than this will be considered points “falling out” of a cluster rather than a cluster splitting into two new clusters.
min_samples : int, optional (default=None) The number of samples in a neighbourhood for a point to be considered a core point.
metric : string, or callable, optional (default=‘euclidean’) The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by metrics.pairwise.pairwise_distances for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square.
Keyword arguments:
See HDBSCAN documentation (https://hdbscan.readthedocs.io/en/latest/index.html)
Returns:
labels : np.array
Clustering labels
ButinaClustering
ButinaClustering (dataset:List, fp_type='rdkit')
Performs Butina clustering
See original publication at: https://github.com/PatWalters/clusterama
Attributes:
dataset : list
A list of SMILES.
Methods:
cluster(sim_cutoff:float, nbits:int, radius:int)
Performs Butina clustering on ´self.dataset´.
get_fps(mol_list:list, nbits:int, radius:int)
Generate descriptors for ´self.dataset´.
cluster_mols(mol_list, sim_cutoff:float, nbits:int, radius:int)
Cluster molecules.
ButinaClustering.cluster
ButinaClustering.cluster (sim_cutoff:float, nbits:int=2048, radius:int=2)
Run Butina clustering on the dataset
Arguments:
sim_cutoff : float
The minimum Tanimoto similarity to consider for putting compounds in the same cluster
nbits : int, optional (default=2048)
Number of bits of the fingerprints if ´fp_type´ is 'morgan2'
radius : int, optional (default=2)
Radius of the fingerprints if ´fp_type´ is 'morgan2'
Returns:
labels : np.array
Clustering labels