Water solubility is one onf the main players in lead optimization. If a molecule is not soluble, we might have problems in biological assays and also to make it reach the desired target in vivo.

This post is the first of a series that will end with a prediction model to estimate water solubility. In this first notebook, we will see some exploratory data analysis and how to prepare a dataset for modeling.

Requirements

  • rdkit >= 2020.09.1
  • pandas >= 1.1.3
  • seaborn
  • matplotlib
  • fastcore (!conda install fastcore)

In this tutorial we'll use a dataset compiled by Sorkun et al (2019) from multiple projects to predict water solubility. You can download the original dataset from here.

At the end of the notebook I added a class to process the original dataset in order to remove salts, mixtures, neutralize charges and generate canonical SMILES. I highly recommend checking each structure before modeling.

Import modules

%reload_ext autoreload
%autoreload 2
%matplotlib inline
from IPython.display import Image
import matplotlib.pyplot as plt
from matplotlib.ticker import FormatStrFormatter
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.preprocessing import Normalizer, normalize, RobustScaler
from functools import partial
from pathlib import Path
from joblib import load, dump

from scipy import stats
from scipy.stats import norm
from statsmodels.graphics.gofplots import qqplot
from rdkit.Chem import Draw
from rdkit.Chem import MolFromSmiles, MolToSmiles
np.random.seed(5)
sns.set(rc={'figure.figsize': (16, 16)})
sns.set_style('whitegrid')
sns.set_context('paper',font_scale=1.5)

Load Data

data = pd.read_csv('../_data/curated-solubility-dataset_processed.csv')
data.head()

ID Name InChI InChIKey SMILES Solubility SD Ocurrences Group MolWt ... NumValenceElectrons NumAromaticRings NumSaturatedRings NumAliphaticRings RingCount TPSA LabuteASA BalabanJ BertzCT processed_smiles
0 A-3 N,N,N-trimethyloctadecan-1-aminium bromide InChI=1S/C21H46N.BrH/c1-5-6-7-8-9-10-11-12-13-... SZEMGTQCPRNXEG-UHFFFAOYSA-M [Br-].CCCCCCCCCCCCCCCCCC[N+](C)(C)C -3.616127 0.0 1 G1 392.510 ... 142.0 0.0 0.0 0.0 0.0 0.00 158.520601 0.000000 210.377334 CCCCCCCCCCCCCCCCCC[N+](C)(C)C
1 A-4 Benzo[cd]indol-2(1H)-one InChI=1S/C11H7NO/c13-11-8-5-1-3-7-4-2-6-9(12-1... GPYLCFQEKPUWLD-UHFFFAOYSA-N O=C1Nc2cccc3cccc1c23 -3.254767 0.0 1 G1 169.183 ... 62.0 2.0 0.0 1.0 3.0 29.10 75.183563 2.582996 511.229248 O=C1Nc2cccc3cccc1c23
2 A-5 4-chlorobenzaldehyde InChI=1S/C7H5ClO/c8-7-3-1-6(5-9)2-4-7/h1-5H AVPYQKSLYISFPO-UHFFFAOYSA-N Clc1ccc(C=O)cc1 -2.177078 0.0 1 G1 140.569 ... 46.0 1.0 0.0 0.0 1.0 17.07 58.261134 3.009782 202.661065 O=Cc1ccc(Cl)cc1
3 A-9 4-({4-[bis(oxiran-2-ylmethyl)amino]phenyl}meth... InChI=1S/C25H30N2O4/c1-5-20(26(10-22-14-28-22)... FAUAZXVRLVIARB-UHFFFAOYSA-N C1OC1CN(CC2CO2)c3ccc(Cc4ccc(cc4)N(CC5CO5)CC6CO... -4.662065 0.0 1 G1 422.525 ... 164.0 2.0 4.0 4.0 6.0 56.60 183.183268 1.084427 769.899934 c1cc(N(CC2CO2)CC2CO2)ccc1Cc1ccc(N(CC2CO2)CC2CO...
4 A-10 vinyltoluene InChI=1S/C9H10/c1-3-9-6-4-5-8(2)7-9/h3-7H,1H2,2H3 JZHGRUMIRATHIU-UHFFFAOYSA-N Cc1cccc(C=C)c1 -3.123150 0.0 1 G1 118.179 ... 46.0 1.0 0.0 0.0 1.0 0.00 55.836626 3.070761 211.033225 C=Cc1cccc(C)c1

5 rows × 27 columns

Exploratory data analysis

In this notebook we'll cover three main topics:

  • What's the relationship between our features and solubility?
  • What's the distribution of our target variable?
  • Are there outliers and how can we detect them?

Preliminary analysis

Let's start by counting how many features we have. In total we have 17 descriptors already calculated for our data set.

descriptors = ['MolWt', 'MolLogP', 'MolMR', 'HeavyAtomCount',
       'NumHAcceptors', 'NumHDonors', 'NumHeteroatoms', 'NumRotatableBonds',
       'NumValenceElectrons', 'NumAromaticRings', 'NumSaturatedRings',
       'NumAliphaticRings', 'RingCount', 'TPSA', 'LabuteASA', 'BalabanJ',
       'BertzCT']
print(len(descriptors))
17

We can check the correlation between each feature and solubility using seaborn's regplot:

nr_rows = 6
nr_cols = 3
target = 'Solubility'
fig, axs = plt.subplots(nr_rows, nr_cols, figsize=(nr_cols*3.5,nr_rows*3))

for r in range(0,nr_rows):
    
    for c in range(0,nr_cols):  
        i = r*nr_cols+c
        
        if i < len(descriptors):
            
            sns.regplot(x=data[descriptors[i]], y=data[target], ax = axs[r][c])
            
            stp = stats.pearsonr(data[descriptors[i]], data[target])

            str_title = "r = " + "{0:.2f}".format(stp[0]) + "      " "p = " + "{0:.2f}".format(stp[1])
            axs[r][c].set_title(str_title,fontsize=11)
            
plt.tight_layout()    
plt.show()  

Most have a negative correlation with solubility. In fact, the available features reflect more or less the size/complexity of the molecule (e.g. MolWt, RingCount, MolMR) and it's capacity to make inter and intramolecular interactions (e.g., NumValenceElectrons, NumHAcceptors, NumHDonors, MolLogP), which are known to influence water solubility. For instance, bigger molecules tend to be less soluble in water; the same applies for very liphophilic (aka aversion to water) molecules. On the other hand, more polar molecules, capable of making hydrogen bonds with water molecules, are usually more soluble.

Based on this preliminary analysis we can start thinking which features to include when training a regression model.

Reliability of data

Sorkun et al. described in their paper that the molecules in the dataset were classified according to the number of occurence and the standard deviation of logS measurements. Based on this, the authors considered 5 groups: G1-5, with G5 showing the highest reliability (+3 occurences and SD < 0.5), followed by G4 (+3 occurences and SD > 0.5), G3 (occurences = 2 and SD < 0.5) and G2 (occurences = 2 and SD > 0.5), and finally G1 (just one occurance in all datasets).

Compounds can be classified according to solubility values (LogS); Compounds with 0 and higher solubility value are highly soluble, those in the range of 0 to −2 are soluble, those in the range of −2 to −4 are slightly soluble and insoluble if less than −4.

data['class'] = None
data.loc[data['Solubility']>0.0, 'class'] = 'highly soluble'


data.loc[(data['Solubility']>-2) & (data['Solubility']<=0), 'class'] = 'soluble'

data.loc[(data['Solubility']>-4.0) & (data['Solubility']<=-2.0), 'class'] = 'slightly soluble'

data.loc[data['Solubility']<=-4.0, 'class'] = 'insoluble'

Let's count the molecules in each reliability group.

counts = data.groupby(['Group','class'])['ID'].count().reset_index()
counts
Group class ID
0 G1 highly soluble 624
1 G1 insoluble 1976
2 G1 slightly soluble 2364
3 G1 soluble 1929
4 G2 highly soluble 11
5 G2 insoluble 114
6 G2 slightly soluble 51
7 G2 soluble 29
8 G3 highly soluble 90
9 G3 insoluble 305
10 G3 slightly soluble 353
11 G3 soluble 324
12 G4 highly soluble 10
13 G4 insoluble 44
14 G4 slightly soluble 46
15 G4 soluble 21
16 G5 highly soluble 40
17 G5 insoluble 140
18 G5 slightly soluble 194
19 G5 soluble 177
sns.barplot(x='Group',y='ID',data=counts,hue='class',ci=95)
<AxesSubplot:xlabel='Group', ylabel='ID'>

The G1 group dominates, with the majority of molecules, followed by G3 and G5. G4 and G2 are the smallest groups, and contains mostly soluble or slightly soluble molecules. The number of insoluble molecules in all groups is high and comparable to the number of soluble and slighly soluble. In addition, for all groups the number of highly soluble molecules is small; in G2 these molecules represent only 5%.

This initial analysis already shows us that our distribution probably is skewed towards less soluble molecules. This might be a problem when training our model because it might make more errors in the chemical space of highly soluble molecules, since they are not well represented in the dataset.

Summary statistics

data.describe()
Solubility SD Ocurrences MolWt MolLogP MolMR HeavyAtomCount NumHAcceptors NumHDonors NumHeteroatoms NumRotatableBonds NumValenceElectrons NumAromaticRings NumSaturatedRings NumAliphaticRings RingCount TPSA LabuteASA BalabanJ BertzCT
count 8842.000000 8842.000000 8842.000000 8842.000000 8842.000000 8842.000000 8842.000000 8842.000000 8842.000000 8842.000000 8842.000000 8842.000000 8842.000000 8842.000000 8842.000000 8842.000000 8842.000000 8842.000000 8842.000000 8842.000000
mean -2.976133 0.059580 1.345397 253.005300 2.445421 66.159777 16.904207 3.174169 1.097263 4.667496 3.927279 92.041280 1.064917 0.318706 0.467541 1.532459 55.691938 103.250839 2.618086 443.741261
std 2.315261 0.205385 0.809062 150.964318 2.811272 41.035370 10.507310 2.693266 1.433689 3.480039 5.265856 56.843391 1.194696 0.910786 1.080055 1.528568 46.990709 62.039117 0.819560 436.470929
min -13.171900 0.000000 1.000000 26.038000 -17.406400 3.450000 2.000000 0.000000 0.000000 0.000000 0.000000 10.000000 0.000000 0.000000 0.000000 0.000000 0.000000 11.307062 -0.000004 0.000000
25% -4.346725 0.000000 1.000000 163.176000 0.957900 42.360000 11.000000 2.000000 0.000000 2.000000 1.000000 60.000000 0.000000 0.000000 0.000000 0.000000 26.020000 67.199259 2.169297 183.391645
50% -2.689542 0.000000 1.000000 226.446000 2.141950 59.451150 15.000000 3.000000 1.000000 4.000000 3.000000 82.000000 1.000000 0.000000 0.000000 1.000000 48.560000 92.834705 2.622906 368.308634
75% -1.366225 0.000000 1.000000 310.365500 3.555000 81.085350 20.000000 4.000000 2.000000 6.000000 5.000000 110.000000 2.000000 0.000000 1.000000 2.000000 75.270000 126.296571 3.069899 606.196178
max 2.137682 3.270214 12.000000 5299.456000 68.541140 1419.351700 388.000000 86.000000 26.000000 89.000000 141.000000 2012.000000 35.000000 30.000000 30.000000 36.000000 1214.340000 2230.685124 7.517310 20720.267708

Let's take a look at the pairplot to see how the features interact with each other and with our target.

sns.pairplot(data)

<seaborn.axisgrid.PairGrid at 0x7f0dd6fafcd0>

We can see some interesting information about this dataset. Our target have a normal-like distribution (first plot on the top left). This is very good because some algorithms assume that the data follows a normal distribution. If our data is normal it should be easier to train the model.

Normality check

Here's another view of the target solubility variable using violin and kernel density plots. The plots shows a negative skew (the distribution shifted to the right), indicating that most molecules are not highly soluble.

fig, (ax1, ax2) = plt.subplots(2,figsize=(12,8))

sns.set_context(font_scale=2)
sns.histplot(x='Solubility',data=data,ax=ax1,hue='class')

ax1.axvline(data['Solubility'].mean(), color='k', linestyle='dashed', linewidth=2.5)

sns.violinplot(data['Solubility'].values,ax=ax2)

# set the spacing between subplots
plt.subplots_adjust(left=0.1,
                    bottom=0.1, 
                    right=0.9, 
                    top=0.9, 
                    wspace=0.4, 
                    hspace=0.4)

ax1.title.set_text('Histogram')
ax2.title.set_text('Violin plot')

plt.ylabel('Count',fontsize=18)
plt.xlabel('Solubility',fontsize=18)
/home/marcossantana/anaconda3/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning
Text(0.5, 0, 'Solubility')

In order to gain a better insight into our distribution, let's calculate how big is that negative skew and how much the distribution deviates from a Gaussian (normal) distribution. We'll use two values: skewness and kurtosis

Skewness and kurtosis can be defined as follows:

  • Skew is how much a distribution is pushed left or right, a measure of asymmetry in the distribution. We can think about skew as if someone was pushing the tail of a normal distribution.
  • Kurtosis quantifies how much of the distribution is in the tail. It is a simple and commonly used statistical test for normality. A good analogy is to think about punching the peak or top of the distribution and making it spread to the tails.

We can calculate skewness and kurtosis using scipy skew and kurtosis functions, respectively.

Figure 1. Skewness

Figure 2. Kurtosis

from scipy.stats import skew, kurtosis, kurtosistest, skewtest
skew_val = skew(data['Solubility'])
kurtosis_val = kurtosis(data['Solubility'])
print(f'Skewness = {skew_val}')
print(f'Kurtosis = {kurtosis_val}')
Skewness = -0.5593279897021967
Kurtosis = 0.04726151580435545

Now let's confirm the hypothesis that our distribution is not normal. by running different normality tests, such as D'Agostino-Pearson , Shapiro-Wilk and Anderson tests.

from scipy.stats import normaltest, shapiro, anderson

The D'Agostino-Pearson normality test measures how far a distribution is from normality by computing skew and kurtosis. In this test, the null hypothesis (H0) is that our distribution is normal. We can use the k2 statistic and the p-value to reject or not H0.

print("D'Agostino K2")
k2, p = normaltest(data['Solubility'])
print(f'Statistics={k2}, p={p}')
# interpret
alpha = 0.05
if p > alpha:
    print('Sample looks Gaussian (fail to reject H0)')
else:
    print('Sample does not look Gaussian (reject H0)')
D'Agostino K2
Statistics=404.5266797956529, p=1.4392876240350922e-88
Sample does not look Gaussian (reject H0)

The Shapiro-Wilk normality test calculates a W statistics to check if a sample comes from a normal distribution. In this test, the null hypothesis (H0) is that our distribution is normal. We can use the W statistic and the p-value to reject or not H0.

print("Shapiro-Wilk")
w_stat, p = shapiro(data['Solubility'])
print(f'Statistics={w_stat}, p={p}')
# interpret
alpha = 0.05
if p > alpha:
    print('Sample looks Gaussian (fail to reject H0)')
else:
    print('Sample does not look Gaussian (reject H0)')
Shapiro-Wilk
Statistics=0.9754287600517273, p=2.2349054059092123e-36
Sample does not look Gaussian (reject H0)
/home/marcossantana/anaconda3/lib/python3.7/site-packages/scipy/stats/morestats.py:1681: UserWarning: p-value may not be accurate for N > 5000.
  warnings.warn("p-value may not be accurate for N > 5000.")

The Anderson-Darling normality test measures if a sample comes from a specific distribution (e.g. normal). It measures the distance between the cumulative probability distribution between the sample and a specific distribution. The test is a modified version of the nonparametric goodness-of-fit Kolmogorov-Smirnov test. In this test, the null hypothesis (H0) is that our distribution is normal. We can use the A statistic and the p-value to reject or not H0. In reality, the scipy function returns a range of critical values that allows us to interpret if H0 should be rejected.

print("Anderson-Darling")
result  = anderson(data['Solubility'])
print(f'A statistic = {result.statistic}')
p = 0
for i in range(len(result.critical_values)):
	sl, cv = result.significance_level[i], result.critical_values[i]
	if result.statistic < result.critical_values[i]:
		print(f'{sl:.3f}: {cv:.3f}, data looks normal (fail to reject H0)')
	else:
		print(f'{sl:.3f}: {cv:.3f}, data does not look normal (reject H0)')
Anderson-Darling
A statistic = 54.56324161518205
15.000: 0.576, data does not look normal (reject H0)
10.000: 0.656, data does not look normal (reject H0)
5.000: 0.787, data does not look normal (reject H0)
2.500: 0.918, data does not look normal (reject H0)
1.000: 1.092, data does not look normal (reject H0)

We can also plot some features to investigate their distribution.

fig, ((ax1,ax2),(ax3,ax4)) = plt.subplots(2,2,figsize=(12,12))

sns.histplot(data['MolWt'],ax=ax1)
sns.histplot(data['MolLogP'],ax=ax2)
sns.histplot(data['TPSA'],ax=ax3)
sns.histplot(data['BertzCT'],ax=ax4)

# set the spacing between subplots
plt.subplots_adjust(left=0.1,
                    bottom=0.1, 
                    right=0.9, 
                    top=0.9, 
                    wspace=0.4, 
                    hspace=0.4)

ax1.set_title('Molecular weight (Da)')
ax2.set_title('Lipophilicity (logP)')
ax3.set_title('Total polar surface area (TPSA)')
ax4.set_title('BertzCT descriptor')
Text(0.5, 1.0, 'BertzCT descriptor')

Molecular weight (MolWt), Lipophilicity (MolLogP), total polar surface area (TPSA) and the BertzCT topological descriptor show a clear positive skew. These distributions actually make sense based on our observations about solubility. Most of the molecules in the dataset are in the slightly soluble - soluble range (Solubility between -4 to 0), and as we can see from the above plots, there is a positive skew on molecular weight and lipophilicity.

Detecting outliers

We can use the information we have to identify data points that are very different from the general distribution and that could hinder the performance of a predictive model. These data points are called outliers and their presence is very bad for many type of algorithms, leading to longer training time, more complex models and maybe degraded performance.

fig, axs = plt.subplots(nr_rows, nr_cols, figsize=(nr_cols*3.5,nr_rows*3))

for r in range(0,nr_rows):
    for c in range(0,nr_cols):  
        i = r*nr_cols+c
        if i < len(descriptors):
            sns.scatterplot(x=data[descriptors[i]], y=data[target], ax = axs[r][c])
            axs[r][c].set_title(descriptors[i],fontsize=11)
            
plt.tight_layout()    
plt.show()  

The scatter plots shows us that some points are distant from the others. An extreme case is a data point on the top right of the plots, showing high molecular weight, TPSA and logP! It does seem we have some outliers in our data. But how can we detect which molecules are outliers? Can we simply remove them based on some kind of rule of thumb? Let's investigate this!

Using mean and std to find outliers

The first outlier detection method we'll use is based on the mean and standard deviation of our data. In this method, a point is considered outlier if it's 2 std below or above the mean of our sample.

c = 2
def get_outliers(data, feature_name, c=2):
    mean, std = data[feature_name].mean(),data[feature_name].std()
    bounds = mean - (c*std), mean + (c*std)
    outliers = data[~data[feature_name].between(*bounds)]
    non_outliers = data.iloc[data.index.difference(outliers.index)]
    return bounds, outliers, non_outliers
outliers = []

for i in descriptors:
    outliers.append(get_outliers(data, feature_name= i, c=2)[1])
fig, axs = plt.subplots(nr_rows, nr_cols, figsize=(nr_cols*3.5,nr_rows*3))

for r in range(0,nr_rows):
    for c in range(0,nr_cols):  
        i = r*nr_cols+c
        if i < len(descriptors):
            sns.scatterplot(x=data[descriptors[i]], y=data[target], ax = axs[r][c])
            sns.scatterplot(x=outliers[i][descriptors[i]], y=outliers[i][target], ax = axs[r][c])
            
           # print(f'Possible number of outliers = {}')
            axs[r][c].set_title(f'{descriptors[i]}\nNumber of outliers = {len(outliers[i])}',fontsize=11)
            
# set the spacing between subplots
plt.subplots_adjust(left=0.1,
                    bottom=0.1, 
                    right=0.9, 
                    top=0.9, 
                    wspace=0.4, 
                    hspace=0.4)            
plt.tight_layout()    
plt.show()  

The scatterplots shows some potential outliers for each feature.

Let's take a closer look at solubility and one of the descriptor to get some insight of physicochemical properties of potential outliers.

Water solubility has a strong negative correlation with lipophilicity (logP). Ideally, we should see a solubility decrease for more lipophilic molecules. Let's try using logP to identify potential outliers.

logp_bonds,logp_outliers,logp_non_outliers = get_outliers(data, feature_name= 'MolLogP', c=2)
print(logp_bonds)
(-3.1771227601854855, 8.067965494860898)
print(f'Possible number of outliers = {len(logp_outliers)}')
Possible number of outliers = 300

We found 300 potential outliers that differ from the mean MolLogP by 2 std. We can clearly see the extreme point we identified earlier and some other more distant fromt he distribution.

Here's MolLogP before and after outliers

fig, ax1 = plt.subplots(1,figsize=(20,16), sharey=True)
s=90
sns.scatterplot(x='MolLogP', y='Solubility', data =logp_non_outliers,ax=ax1,s=s,color='gray')
sns.scatterplot(x='MolLogP', y='Solubility', data =logp_outliers,ax=ax1,color='goldenrod',s=s)
ax1.set_title('MolLogP outliers',fontsize=18)
ax1.set_ylabel('Solubility',fontsize=18)
ax1.set_xlabel('MolLogP',fontsize=18)
#sns.scattrplot(x='MolLogP', y='Solubility', data = logp_non_outliers,ax=ax3,color='gray')
#ax3.set_title('Removed outliers')

Text(0.5, 0, 'MolLogP')

If we remove the putative outliers we can get a better view of the relationship between logP and solubility.

fig, (ax1, ax2, ax3) = plt.subplots(1,3,figsize=(36,12), sharey=True)
s=90

sns.scatterplot(x='MolLogP', y='Solubility', data =data,ax=ax1,s=s,color='gray')
ax1.set_title('Original data')

sns.scatterplot(x='MolLogP', y='Solubility', data =logp_non_outliers,ax=ax2,s=s,color='gray')
sns.scatterplot(x='MolLogP', y='Solubility', data =logp_outliers,ax=ax2,color='goldenrod',s=s)
ax2.set_title('Outliers')

sns.scatterplot(x='MolLogP', y='Solubility', data = logp_non_outliers,ax=ax3,color='gray')
ax3.set_title('Removed outliers')
Text(0.5, 1.0, 'Removed outliers')

The scatterplot on the right shows that as logP increases, the solubility tends to decrease, which is what we expected. In general, very big compounds are less soluble and have a hard time being optimized into lead compounds or drugs. But what kind of molecules did we remove?

from rdkit.Chem import Draw
Draw.MolsToGridImage(list(map(MolFromSmiles, logp_outliers.processed_smiles)),subImgSize=(900,900))

/home/marcossantana/anaconda3/lib/python3.7/site-packages/rdkit/Chem/Draw/IPythonConsole.py:192: UserWarning: Truncating the list of molecules to be displayed to 50. Change the maxMols value to display more.
  % (maxMols))

Some potential outliers doesn't look very friendly or something that would be in drug discovery libraries! Take a look at this guy:

smi = logp_outliers.processed_smiles.tolist()[2]
mol = MolFromSmiles(smi)
print(f'Outlier SMILES - {smi}')
mol

Outlier SMILES - CCCCCCCCCCCCCCCCCC(=O)OCC(COC(=O)CCCCCCCCCCCCCCCCC)OC(=O)CCCCCCCCCCCCCCCCC

We might be better off without this kind of "ugly chemistry". But let's keep these monsters for now and explore other methods to outlier detection.

Interquartile range

We can also use the interquartile range (IQR) to detect outliers. Simply put, we'll use a boxplot to detect molecules with values higher than the 75th quartile and lower than the 25th quartile.

def iqr_detection(data, feature_name, std_cut=1.5):
    
    
    df = data.copy()
    df.reset_index(drop=True,inplace=True)
    
    q3 = np.quantile(df[feature_name],.75)
    q1 = np.quantile(df[feature_name],.25)

    iqr = q3 - q1
    
    cutoff = std_cut * iqr
    lower, upper = q1 - cutoff, q3 + cutoff
    
    outliers = df[(df[feature_name] < lower) | (df[feature_name] > upper)]
    ok_points = df.iloc[df.index.difference(outliers.index)]

    return (lower, upper), outliers, ok_points

We will consider a data point as outlier if it lies at above or below the cutoff value of 1.5 * IQR

iqr_bounds, iqr_outliers, iqr_okpoints = iqr_detection(data, 'Solubility')
print(iqr_bounds)
(-8.817475000000002, 3.1045250000000006)

Automatic detection

We can also automate the detection of outliers using the local outlier factor algorithm

The local outlier factor (LOF) adopts a notion of oulier that depends on the local neighborhood of each data point. This allows the algorithm to find data points that are different from the global distribution and also different from their neighbors. LOF does this by assigning an outlier factor to each data point, which is the degree each point deviates from its neighbors. An outlier is expected to have a local densitiy that is smaller (its far away) from it's k-neighbors.

Normalize data

Let's try to find some outliers with the available descriptors. But first, let's normalize them!

X = data[descriptors].values
Y = data['Solubility'].values
Y.shape
(8842,)
scaler = RobustScaler()
Xnorm = scaler.fit(X).transform(X)
Xnorm.mean(),Xnorm.std()
(0.176258938735513, 1.0250171066612603)

LocalOutlierFactor

from sklearn.neighbors import LocalOutlierFactor
outlier_scores = LocalOutlierFactor(n_neighbors=20,novelty=False).fit_predict(Xnorm)
lof_no_outliers = data.iloc[outlier_scores == 1]
lof_outliers = data.iloc[outlier_scores == -1]

lof_no_outliers.shape,lof_outliers.shape
((8525, 28), (317, 28))
Draw.MolsToGridImage(list(map(MolFromSmiles, lof_outliers.processed_smiles.tolist())),subImgSize=(600,600))

/home/marcossantana/anaconda3/lib/python3.7/site-packages/rdkit/Chem/Draw/IPythonConsole.py:192: UserWarning: Truncating the list of molecules to be displayed to 50. Change the maxMols value to display more.
  % (maxMols))

It seems LOF removed the most extreme points. After removing outliers, we can see that the correlation between each feature and solubility increased. The only exception was TPSA, which showed a small decrease in the Pearson's correlation.

In the next posts, we'll explore more of this data. The main question we'll try to answer is: can we train a model with good performance to predict water solubility?

References

Fin

Sample class to sanitize the dataset

from unittest.mock import *
from rdkit import Chem
from tqdm.notebook import tqdm
from rdkit.Chem.Draw import IPythonConsole, MolsToGridImage
from rdkit.Chem.SaltRemover import SaltRemover as saltremover
from rdkit.Chem.MolStandardize.rdMolStandardize import LargestFragmentChooser
from rdkit.Chem import AllChem, SanitizeMol
from rdkit import rdBase
from rdkit.Chem import rdmolops
from fastcore.all import *
rdBase.DisableLog('rdApp.error')
rdBase.DisableLog('rdApp.info')

_saltremover = saltremover()
_unwanted = Chem.MolFromSmarts('[!#1!#6!#7!#8!#9!#15!#16!#17!#35!#53]')

def _initialiseNeutralisationReactions():
  patts = (
        # Imidazoles
        ('[n+;H]', 'n'),
        # Amines
        ('[N+;!H0]', 'N'),
        # Carboxylic acids and alcohols
        ('[$([O-]);!$([O-][#7])]', 'O'),
        # Thiols
        ('[S-;X1]', 'S'),
        #Amidines
        ('[C+](N)', 'C(=N)'),
        # Sulfonamides
        ('[$([N-;X2]S(=O)=O)]', 'N'),
        # Enamines
        ('[$([N-;X2][C,N]=C)]', 'N'),
        # Tetrazoles
        ('[n-]', '[nH]'),
        # Sulfoxides
        ('[$([S-]=O)]', 'S'),
        # Amides
        ('[$([N-]C=O)]', 'N'),
    )
  return [(Chem.MolFromSmarts(x),Chem.MolFromSmiles(y,False)) for x,y in patts]

def neutralise(mol,reactions=None):
  reactions = _initialiseNeutralisationReactions()
  replaced = False
  for i,(reactant, product) in enumerate(reactions):
    while mol.HasSubstructMatch(reactant):
      replaced = True
      rms = AllChem.ReplaceSubstructs(mol, reactant, product)
      mol = rms[0]

  if replaced:
    rdmolops.Cleanup(mol)
    rdmolops.SanitizeMol(mol)
    mol = rdmolops.RemoveHs(mol, implicitOnly=False, updateExplicitCount=False, sanitize=True)
    return mol
  else:
    rdmolops.Cleanup(mol)
    rdmolops.SanitizeMol(mol)
    mol = rdmolops.RemoveHs(mol, implicitOnly=False, updateExplicitCount=False, sanitize=True)
    return mol

def _rare_filters(mol):
  if mol:
    cyano_filter = "[C-]#[N+]"
    oh_filter = "[OH+]"
    sulfur_filter = "[SH]"
    sulfur_filter2 = "[SH2]"
    iodine_filter1 = "[IH2]"
    iodine_filter2 = "[I-]"
    iodine_filter3 = "[I+]"

    if not mol.HasSubstructMatch(Chem.MolFromSmarts(cyano_filter)) \
      and not mol.HasSubstructMatch(Chem.MolFromSmarts(oh_filter)) \
      and not mol.HasSubstructMatch(Chem.MolFromSmarts(sulfur_filter2)) \
      and not mol.HasSubstructMatch(Chem.MolFromSmarts(iodine_filter1)) \
      and not mol.HasSubstructMatch(Chem.MolFromSmarts(iodine_filter2)) \
      and not mol.HasSubstructMatch(Chem.MolFromSmarts(iodine_filter3)) \
      and not mol.HasSubstructMatch(Chem.MolFromSmarts(sulfur_filter)):
      return mol

def set_active(row):
    active = 'Inactive'
    if row['standard_value'] <= 10000: active = 'Active'
    return active

class SanitizeDataset():
    def __init__(self,df,smiles_column:str='Smiles',activity_column = None, sep:str='\t'):
        self.df, self.smiles_column, self.activity_column = df, smiles_column, activity_column

@patch
@delegates(SanitizeDataset)
def add_mol_column(x:SanitizeDataset, smiles, **kwargs):
    if type(smiles) == str and smiles != '':

        mol = Chem.MolFromSmiles(smiles) #.replace('@','').replace('/','').replace("\\","")
        return mol
    else:
        return None

@patch
@delegates(SanitizeDataset)
def remove_unwanted(x:SanitizeDataset,mol, **kwargs):
    if not mol.HasSubstructMatch(_unwanted) and (sum([atom.GetIsotope() for atom in mol.GetAtoms()])==0):
        return mol
@patch
@delegates(SanitizeDataset)
def _remove_salts(x:SanitizeDataset,mol, **kwargs):
    mol = _saltremover.StripMol(mol, dontRemoveEverything=True)
    mol = _saltremover.StripMol(mol, dontRemoveEverything=True)
    mol = rdmolops.RemoveHs(mol, implicitOnly=False, updateExplicitCount=False, sanitize=True)
    mol = x._getlargestFragment(mol)
    return mol

@patch
@delegates(SanitizeDataset)
def _getlargestFragment(x:SanitizeDataset, mol, **kwargs):
    frags = rdmolops.GetMolFrags(mol, asMols=True, sanitizeFrags=True)
    maxmol = None
    for mol in frags:
        if mol is None:
            continue
        if maxmol is None:
            maxmol = mol
        if maxmol.GetNumHeavyAtoms() < mol.GetNumHeavyAtoms():
            maxmol = mol
    return maxmol

@patch
@delegates(SanitizeDataset)
def is_valid(x:SanitizeDataset, mol, min_heavy_atoms = 1, max_heavy_atoms = 9999, **kwargs):
    if mol:
        return _rare_filters(mol) if min_heavy_atoms < mol.GetNumHeavyAtoms() < max_heavy_atoms else None
    return None

@patch
@delegates(SanitizeDataset)
def canonilize(x:SanitizeDataset, mol, **kwargs):
    if mol:
        return Chem.MolToSmiles(mol, isomericSmiles = True) if mol.GetNumHeavyAtoms()>0 else print(type(mol))
    else:
        return None

@patch
@delegates(SanitizeDataset)
def process_data(x:SanitizeDataset, **kwargs):

    x.df = x.df.loc[~x.df[x.smiles_column].isnull()]

    x.df['mol'] = x.df[x.smiles_column].apply(x.add_mol_column)
    x.df = x.df.loc[~x.df['mol'].isnull()]

    x.df['mol'] = x.df['mol'].apply(x.remove_unwanted)
    x.df['mol'] = x.df['mol'].apply(_rare_filters)
    x.df = x.df.loc[~x.df['mol'].isnull()]

    x.df['mol'] = x.df['mol'].apply(x._remove_salts)
    x.df['mol'] = x.df['mol'].apply(neutralise)
    x.df['mol'] = x.df['mol'].apply(x.is_valid)
    x.df = x.df.loc[~x.df['mol'].isnull()]  

    x.df['processed_smiles'] = x.df['mol'].apply(lambda x : Chem.MolToSmiles(x))
    x.df = x.df.loc[~x.df['processed_smiles'].isnull()]
    x.df.drop_duplicates(subset='processed_smiles',inplace=True)
    return x.df

#sanitizer = SanitizeDataset(df=df, smiles_column='SMILES',activity_column = 'Solubility', sep=',')