Exploratory data analysis is fundamental in cheminformatics.

Requirements

  • rdkit >= 2020.09.1
  • pandas >= 1.1.3
  • seaborn
  • matplotlib
  • fastcore (!conda install fastcore)

In this tutorial we'll use a dataset compiled by Sorkun et al (2019) from multiple projects to predict water solubility. You can download the original dataset from here.

At the end of the notebook I added a class to process the original dataset in order to remove salts, mixtures, neutralize charges and generate canonical SMILES. I highly recommend checking each structure before modeling.

Import modules

%reload_ext autoreload
%autoreload 2
%matplotlib inline
from IPython.display import Image
import matplotlib.pyplot as plt
from matplotlib.ticker import FormatStrFormatter
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.preprocessing import Normalizer, normalize, RobustScaler
from functools import partial
from pathlib import Path
from joblib import load, dump

from scipy import stats
from scipy.stats import norm
from statsmodels.graphics.gofplots import qqplot
from rdkit.Chem import Draw
from rdkit.Chem import MolFromSmiles, MolToSmiles
np.random.seed(5)
sns.set(rc={'figure.figsize': (16, 16)})
sns.set_style('whitegrid')
sns.set_context('paper',font_scale=1.5)

Load Data

data = pd.read_csv('../_data/curated-solubility-dataset_processed.csv')
data.head()

ID Name InChI InChIKey SMILES Solubility SD Ocurrences Group MolWt ... NumValenceElectrons NumAromaticRings NumSaturatedRings NumAliphaticRings RingCount TPSA LabuteASA BalabanJ BertzCT processed_smiles
0 A-3 N,N,N-trimethyloctadecan-1-aminium bromide InChI=1S/C21H46N.BrH/c1-5-6-7-8-9-10-11-12-13-... SZEMGTQCPRNXEG-UHFFFAOYSA-M [Br-].CCCCCCCCCCCCCCCCCC[N+](C)(C)C -3.616127 0.0 1 G1 392.510 ... 142.0 0.0 0.0 0.0 0.0 0.00 158.520601 0.000000 210.377334 CCCCCCCCCCCCCCCCCC[N+](C)(C)C
1 A-4 Benzo[cd]indol-2(1H)-one InChI=1S/C11H7NO/c13-11-8-5-1-3-7-4-2-6-9(12-1... GPYLCFQEKPUWLD-UHFFFAOYSA-N O=C1Nc2cccc3cccc1c23 -3.254767 0.0 1 G1 169.183 ... 62.0 2.0 0.0 1.0 3.0 29.10 75.183563 2.582996 511.229248 O=C1Nc2cccc3cccc1c23
2 A-5 4-chlorobenzaldehyde InChI=1S/C7H5ClO/c8-7-3-1-6(5-9)2-4-7/h1-5H AVPYQKSLYISFPO-UHFFFAOYSA-N Clc1ccc(C=O)cc1 -2.177078 0.0 1 G1 140.569 ... 46.0 1.0 0.0 0.0 1.0 17.07 58.261134 3.009782 202.661065 O=Cc1ccc(Cl)cc1
3 A-9 4-({4-[bis(oxiran-2-ylmethyl)amino]phenyl}meth... InChI=1S/C25H30N2O4/c1-5-20(26(10-22-14-28-22)... FAUAZXVRLVIARB-UHFFFAOYSA-N C1OC1CN(CC2CO2)c3ccc(Cc4ccc(cc4)N(CC5CO5)CC6CO... -4.662065 0.0 1 G1 422.525 ... 164.0 2.0 4.0 4.0 6.0 56.60 183.183268 1.084427 769.899934 c1cc(N(CC2CO2)CC2CO2)ccc1Cc1ccc(N(CC2CO2)CC2CO...
4 A-10 vinyltoluene InChI=1S/C9H10/c1-3-9-6-4-5-8(2)7-9/h3-7H,1H2,2H3 JZHGRUMIRATHIU-UHFFFAOYSA-N Cc1cccc(C=C)c1 -3.123150 0.0 1 G1 118.179 ... 46.0 1.0 0.0 0.0 1.0 0.00 55.836626 3.070761 211.033225 C=Cc1cccc(C)c1

5 rows × 27 columns

References

Fin