Clustering molecules
Primer on Exploratory data analysis.
Exploratory data analysis is fundamental in cheminformatics.
- rdkit >= 2020.09.1
- pandas >= 1.1.3
- seaborn
- matplotlib
- fastcore (!conda install fastcore)
In this tutorial we'll use a dataset compiled by Sorkun et al (2019) from multiple projects to predict water solubility. You can download the original dataset from here.
At the end of the notebook I added a class to process the original dataset in order to remove salts, mixtures, neutralize charges and generate canonical SMILES. I highly recommend checking each structure before modeling.
%reload_ext autoreload
%autoreload 2
%matplotlib inline
from IPython.display import Image
import matplotlib.pyplot as plt
from matplotlib.ticker import FormatStrFormatter
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.preprocessing import Normalizer, normalize, RobustScaler
from functools import partial
from pathlib import Path
from joblib import load, dump
from scipy import stats
from scipy.stats import norm
from statsmodels.graphics.gofplots import qqplot
from rdkit.Chem import Draw
from rdkit.Chem import MolFromSmiles, MolToSmiles
np.random.seed(5)
sns.set(rc={'figure.figsize': (16, 16)})
sns.set_style('whitegrid')
sns.set_context('paper',font_scale=1.5)
data = pd.read_csv('../_data/curated-solubility-dataset_processed.csv')
data.head()
Detecting outliers
https://www.ucd.ie/ecomodel/Resources/QQplots_WebVersion.html
https://www.dummies.com/programming/big-data/data-science/graphical-tests-of-data-outliers/
https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/
https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba
https://www.dbs.ifi.lmu.de/Publikationen/Papers/LOF.pdf
https://towardsdatascience.com/local-outlier-factor-for-anomaly-detection-cc0c770d2ebe
Normality tests
Machine learning mastery post: https://machinelearningmastery.com/a-gentle-introduction-to-normality-tests-in-python/
Shapiro-Wilk test: https://www.itl.nist.gov/div898/handbook/prc/section2/prc213.htm
Anderson-Darling test : https://www.itl.nist.gov/div898/handbook/prc/section2/prc21.htm
GraphPad entry: https://www.graphpad.com/guides/prism/latest/statistics/stat_choosing_a_normality_test.htm