CP-MLR/PLS directed quantitative structure-activity relationship study on the CDK8 inhibitory activity: The derivatives of naphthyridine and isoquinoline

The CDK8 and 7dF3 inhibition activity of naphthyridine and isoquinoline derivatives have been quantitatively analyzed in terms of Dragon descriptors. The statistically validated quantitative structure-activity relationship (QSAR) models provided rationales to explain the inhibition activities of these congeners. The descriptors identified through CP-MLR analysis for the CDK8 inhibitory activity have highlighted the role of sum of the topological distances between N..N (T(N..N)), distance/detour ring index of order 6 (D/Dr06), aromatic ratio (ARR), number of double bonds (nDB), number of 6-membered rings (nR06), number of sulphur atoms (nS), number of unsubstituted sp2 hybridized aromatic carbon atoms (nCaH), number of aliphatic N hydrazines (nN-N), number of aromatic primary amines (nNH2Ph) and certain atom centered fragments such as R--CH--R (C-024), X--CR..X (C-034), H attached to C1(sp3)/C0(sp2) (H-047) and RCO-N</>N-X=X (N-072) to quantify the inhibitory actions. The highest eigen value n.2 of Burden matrix/weighted by atomic polarizabilites (BEHp2) and atomic Sanderson electronegativities weighted Geary autocorrelation of lag 8 (descriptor GATS8e) have also shown prevalence to model the CDK8 inhibitory activity. PLS analysis has also corroborated the dominance of CP-MLR identified descriptors. Applicability domain analysis revealed that the suggested model matches the high quality parameters with good fitting power and the capability of assessing external data and all of the compounds was within the applicability domain of the proposed model and were evaluated correctly. The models obtained from the descriptor pool that was chosen for the CDK8 inhibitory activity is able to explain nearly 74% variance in the observed 7dF3 activities of titled compounds.


Introduction
The pivotal role played by cyclin-dependent kinases (CDKs) in the regulation of cell progression and gene transcription puts these as promising therapeutic targets for various diseases [1].In the recent decades intensive efforts have been made in search of novel and potent CDK inhibitors.The unique member of the CDK family, CDK8 that is a transcriptional CDK, is a component of the mediator complex.By participation in a myriad of signaling pathways CDK8 regulates the transcription of related genes in a context-specific way [2].The implications of CDK8 in colorectal and gastric cancers as an oncogene, is due to the activation of WNT signaling [3][4][5][6].Because of the association of CDK8 in the development and progression of cancers its selective inhibition has been regarded as a promising approach for cancer therapy.A variety of small molecule modulators of CDK8 have been recently reported and these include Sorafenib [7], Senexin A [8], Cortistatin A [9], 6-azabenzothiophene derivatives [10].The optimization of a 3,4,5-trisubstituted pyridine series led to potent, selective and orally bioavailable dual CDK8/19 ligandsCCT251545 and CCT251921 [11][12][13].Mallinger et al has been reported a novel series of 2,8-disubstituted-1,6-naphthyridine-and 4,6-disubstituted-isoquinolines as dual CDK8/19 ligands by means of scaffold-hop approach [14].
The aim of present communication is to establish the quantitative relationships between the reported activities and molecular descriptors unfolding the substitutional changes in titled compounds.

Data-set
For present work the reported fifty naphthyridines and isoquinolines have been considered as the data set [14].The general structure of these compounds is represented in Figure 1 and structural variations are mentioned in Table 1.

Figure 1 General structure of naphthyridine (X=N) and isoquinoline analogues
These derivatives were evaluated for their inhibition of CDK8 and WNT signaling in luciferase reporter assay in HEK293 7dF3 cells.Both the inhibition activities have also been reported in Table1 [14].The inhibition activity, IC50, represents the concentration of a compound to achieve 50% inhibition of CDK8 and 7dF3.The same is expressed as pIC50 on a molar basis and considered as the dependent variable for the present quantitative analysis.In the dataset, the initial assessment of activity with all descriptors has suggested the compound 23 as potential outlier.An outlier to a QSAR can indicate the limits of applicability of QSAR models.This outlier is not part of the data set.The data set was sub-divided into training set to develop models and test set to validate the models externally.The test set compounds which were selected using an in-house written randomization program, are also mentioned in Table 1. a Reference [14], IC50 represents the concentration of a compound to bring out 50% inhibition of CDK8 and 7dF3; b Compound included in test set; c "Outlier" compound not included in data set.

Molecular descriptors
The structures of the compounds (Table 1), under study, have been drawn in 2D ChemDraw [15] and were converted into 3D objects using the default conversion procedure implemented in the CS Chem3D Ultra.The generated 3Dstructures of the compounds were subjected to energy minimization in the MOPAC module, using the AM1 procedure for closed shell systems, implemented in the CS Chem3D Ultra.This will ensure a well defined conformer relationship across the compounds of the study.All these energy minimized structures of respective compounds have been ported to DRAGON software [16] for computing the descriptors corresponding to 0D-, 1D-, and 2D-classes.

Development and validation of model
The combinatorial protocol in multiple linear regression (CP-MLR) [17][18][19][20][21] and partial least squares (PLS) [22][23][24] procedures have been used in the present work for developing QSAR models.The CP-MLR is a "filter"-based variable selection procedure, which employs a combinatorial strategy with MLR to result in selected subset regressions for the extraction of diverse structure-activity models, each having unique combination of descriptors from the generated dataset of the compounds under study.The embedded filters make the variable selection process efficient and lead to unique solution.Fear of "chance correlations" exists where large descriptor pools are used in multilinear QSAR/QSPR studies [25,26].Furthermore, in order to discover any chance correlations associated with the models recognized in CP-MLR, each cross-validated model has been put to a randomization test [27,28] by repeated randomization of the activity to ascertain the chance correlations, if any, associated with them.For this, every model has been subjected to 100 simulation runs with scrambled activity.The scrambled activity models with regression statistics better than or equal to that of the original activity model have been counted, to express the percent chance correlation of the model under scrutiny.
Validation of the derived model is necessary to test its prediction and generalization within the study domain.For each model, derived by involving n data points, a number of statistical parameters such as r (the multiple correlation coefficient), s (the standard deviation), F (the F ratio between the variances of calculated and observed activities), and Q 2 LOO (the cross-validated index from leave-one-out procedure) have been obtained to access its overall statistical significance.In case of internal validation, Q 2 LOO is used as a criterion of both robustness and predictive ability of the model.A value greater than 0.5 of Q 2 index suggests a statistically significant model.The predictive power of derived model is based on test set compounds.The model obtained from training set has a reliable predictive power if the value of the r 2 Test (the squared correlation coefficient between the observed and predicted values of compounds from test set) is greater than 0.5.

Applicability Domain
The utility of a QSAR model is based on its accurate prediction ability for new compounds.A model is valid only within its training domain and new compounds must be assessed as belonging to the domain before the model is applied.The applicability domain is assessed by the leverage values for each compound [29].The Williams plot (the plot of standardized residuals versus leverage values, h) can then be used for an immediate and simple graphical detection of both the response outliers (Y outliers) and structurally influential chemicals (X outliers) in the model.In this plot, the applicability domain is established inside a squared area within ± x (s.d.) and a leverage threshold h * .The threshold h * is generally fixed at 3(k + 1)/n (n is the number of training-set compounds and k is the number of model parameters) whereas x = 2 or 3. Prediction must be considered unreliable for compounds with a high leverage value (h > h * ).On the other hand, when the leverage value of a compound is lower than the threshold value, the probability of accordance between predicted and observed values is as high as that for the training-set compounds.

QSAR results
For the compounds in Table 1, a total number of 506 descriptors belonging to 0D-to 2D-classes of DRAGON have been computed.Prior to model development procedure, all those descriptors that are inter-correlated beyond 0.90 and showing a correlation of less than 0.1 with the biological endpoints (descriptor versus activity, r < 0.1) were excluded.This procedure has reduced the total descriptors from 506 to 107 as relevant ones to explain the biological actions of titled compounds and these were subjected to CP-MLR analysis with default "filters" set in it.The descriptors have been scaled between the intervals 0 to 1 [30] to ensure that a descriptor will not dominate simply because it has larger or smaller pre-scaled value compared to the other descriptors.In this way, the scaled descriptors would have equal potential to influence the QSAR models.
In multi-descriptor class environment, exploring for best model equation(s) along the descriptor class provides an opportunity to unravel the phenomenon under investigation.In other words, the concepts embedded in the descriptor classes relate the biological actions revealed by the compounds.
The 49 compounds were divided into training-set and test-set.Fourteen compounds (nearly 30% of total population) have been selected for test-set.The identified test-set was then used for external validation of models derived from remaining thirty five compounds in the training-set.The squared correlation coefficient between the observed and predicted values of compounds from test-set, r 2 Test, was calculated to explain the fraction of explained variance in the test-set which is not part of regression/model derivation.It is a measure of goodness of the derived model equation.A high r 2 Test value is always good.But considering the stringency of test-set procedures, often r 2 Test values in the range of 0.5 to 0.6 are regarded as logical models.Following the strategy to explore only predictive models, CP-MLR resulted into 08, 44 and 18 models in two, three and four descriptors, respectively.The generated models in two and three descriptors, all having r 2 Test<0.5, for the CDK8 inhibitory activity.The selected models are mentioned in Table 2.The signs of the regression coefficients have indicated the direction of influence of explanatory variables in above models.The positive regression coefficient associated to a descriptor will augment the activity profile of a compound while the negative coefficient will cause detrimental effect to it.
In above model Eqs., (1-3), the descriptor T(N..N) is topological class descriptor.The other participating descriptors are BEHp2 (BCUT class descriptor), nDB and nS (constitution class descriptors), N-072 and C-024 (atom centered fragments) and nCaH (functional group).The positive sign of regression coefficients of descriptors nDB (number of double bonds), nCaH (number of unsubstituted sp2 hybridized aromatic carbon atoms) and presence of RCO-N</>N-X=X type atom centered fragment (descriptor N-072) in a molecular structure suggested that a higher value of these descriptors would be beneficial to augment the CDK8 inhibitory activity.On the other hand, a lower value of descriptors T(N..N) (sum of the topological distances between N..N), nS (number of sulfur atoms), C-024 (R--CH--R) and BEHp2 (highest eigen value n.2 of Burden matrix/weighted by atomic polarizabilites) would be supportive to the CDK8 inhibition.
Considering the number of observation in the dataset, models with up to five descriptors were explored.It has resulted in 18 five-parameter models with test set r 2 > 0.50.These models (with 107 descriptors) were identified in CP-MLR by successively incrementing the filter-3 with increasing number of descriptors (per equation).For this, the optimum rbar value of the preceding level model (=0.867) has been used as the new threshold of filter-3 for the next generation.These models have shared 16 descriptors among them.All these 16 descriptors along with their brief meaning, average regression coefficients, and total incidence are listed in Table 3, which will serve as a measure of their estimate across these models.
Table 3 Identified descriptors a along with their physical meaning, average regression coefficient and incidence b , in modeling the CDK8 inhibitory activities a The descriptors are identified from the four parameter models for activity emerged from CP-MLR protocol with filter-1 as 0.79, filter-2 as 2.0, filter-3 as 0.867 and filter-4 as 0.3 ≤ q 2 ≤1.0 with a training set of 35 compounds.b The average regression coefficient of the descriptor corresponding to all models and the total number of its incidence.The arithmetic sign of the coefficient represents the actual sign of the regression coefficient in the models.
Following are the selected five-descriptor models for the CDK8 inhibitory activities emerged through CP-MLR.These models have accounted for nearly 84% variance in the observed activities.In the randomization study (100 simulations per model), none of the identified models has shown any chance correlation.The values greater than 0.5 of Q 2 index is in accordance to a reasonable robust QSAR model.The pIC50 values of training set compounds calculated using Eqs.( 4) to (7) have been included in Table 4.The models (4) to ( 7) are validated with an external test set of 14 compounds listed in Table 4.The predictions of the test set compounds based on external validation are found to be satisfactory as reflected in the test set r 2 (r 2 Test) values and the same is reported in Table 4.The plot showing goodness of fit between observed and calculated activities for the training and test set compounds is given in Figure 2.
Table 4 Observed and calculated CDK8 and 7dF3 inhibition activities of naphthyridine and isoquinoline derivatives

Figure 2 Plot of observed versus caculated pIC50 values for training-and test-set compounds for CDK8 inhibition
The newly appeared descriptors in above models are nR06 (constitutional class), ARR (empirical class) and nN-N (functional group class).The descriptor ARR has shown positive correlation to the activity whereas descriptors nR06 and nN-N have correlated negatively to the activity.The signs of regression coefficients advocated that higher value of aromatic ratio (descriptor ARR) and lesser number of 6-membered rings in a molecular structure (descriptor nR06) and number of aliphatic N hydrazines (descriptor nN-N) would be helpful to augment the CDK8 inhibitory activity.
A partial least square (PLS) analysis has been carried out on these 16 CP-MLR identified descriptors, mentioned in Table 3, to facilitate the development of a "single window" structure-activity model.For the purpose of PLS, the descriptors have been autoscaled (zero mean and unit SD) to give each one of them equal weight in the analysis.In the PLS cross-validation, four components are found to be the optimum for these 16 descriptors and they explained 86.86% variance in the activity.The MLR-like PLS coefficients of these 16 descriptors are given in Table 5.For the sake of comparison, the plot showing goodness of fit between observed and calculated activities (through PLS analysis) for the training and test set compounds is also given in Figure 2. Figure 3 shows a plot of the fraction contribution of normalized regression coefficients of these descriptors to the activity.3) associated with CDK8 inhibitory activity of naphthyridine and isoquinoline derivatives The PLS analysis has suggested N-072 as the most determining descriptor for modeling the activity of the compounds (descriptor S. No. 15 in Table 5; Figure 3).The other nine descriptors in decreasing order of significance are nNH2Ph, nDB, nN-N, GATS8e, nR06, nCaH, C-034, H-047 and D/Dr06.The descriptors nDB, nN-N, nR06 and nCaH are part of Eqs.
(1) to ( 7) and convey same inference in the PLS model as well.The topological class descriptor D/Dr06, the distance/detour ring index of order 6, advocates that a lower value of it would be beneficiary to the activity.The negative influence of descriptors, nNH2Ph (number of aromatic primary amines), C-034 (X--CR..X) and H-047 (H attached to C1(sp3)/C0(sp2)) recommended absence of such functionality or fragments in a compound for improved activity.The positive regression coefficient of atomic Sanderson electronegativities weighted Geary autocorrelation of lag 8 (descriptor GATS8e), advocates that a higher positive value of it is incremental to the activity.It is also observed that PLS model from the dataset devoid of CP-MLR identified 16 descriptors (Table 3) is inferior in explaining the activity of the analogues.
CP-MLR analysis has also been carried out for another reported inhibition activity 7dF3 using same descriptor pool and test set.Following are the selected five-descriptor models for the 7dF3 inhibitory activities emerged through CP-MLR.Except the descriptors BEHe1, BEHm3, nBM and GGI2, all the descriptors participated in models ( 8) to ( 11) are part of earlier discussed models (1) to ( 7) and convey same inference.It is evinced from the models mentioned above that the descriptors BEHe1, BEHm3, nBM and GGI2 contributed positively to the activity.Thus a higher value of descriptors BEHe1 (atomic Sanderson electronegativities weighted highest eigenvalue n.1 of Burden matrix), BEHm3 (atomic masses weighted highest eigenvalue n.3 of Burden matrix), nBM (number of multiple bonds) and GGI2 (Galvez topological charge index of 2 nd order) will be supportive to enhance the 7dF3 inhibition activity.
These models have accounted for nearly 74% variance in the observed activities.The values greater than 0.5 of Q 2 index is in accordance to a reasonable robust QSAR model.The pIC50 values of training set compounds calculated using Eqs.( 8) to (11) have been included in Table 4.The models ( 8) to ( 11) are validated with an external test set of 5 compounds listed in Table 4.The predictions of the test set compounds based on external validation are found to be satisfactory as reflected in the test set r 2 (r 2 Test) values and the same is reported in Table 4.The plot showing goodness of fit between observed and calculated activities for the training and test set compounds is given in Figure 4.

Applicability domain
On analyzing the applicability domain (AD) for the CDK8 inhibitory actions in the Williams plot (Figure 5) of the model based on the whole data set (Table 5), No any compound has been identified as an obvious 'outlier' for the CDK8 inhibitory activity if the limit of normal values for the Y outliers (response outliers) was set as 2.5×(standard deviation) units.None of the compound was found to have leverage (h) values greater than the threshold leverage (h*).For both the training-set and test-set, the suggested model matches the high quality parameters with good fitting power and the capability of assessing external data.Furthermore, all of the compounds were within the applicability domain of the proposed model and were evaluated correctly.1.The horizontal dotted line refers to the residual limit (±3×standard deviation) and the vertical dotted line represents threshold leverage h* (=0.514)

Conclusion
The CDK8 and 7dF3 inhibition activity of naphthyridine and isoquinoline derivatives have been quantitatively analyzed in terms of Dragon descriptors.The statistically validated quantitative structure-activity relationship (QSAR) models provided rationales to explain the inhibition activities of these congeners.The descriptors identified through CP-MLR analysis for the CDK8 inhibitory activity have highlighted the role of sum of the topological distances between N..N (T(N..N)), distance/detour ring index of order 6 (D/Dr06), aromatic ratio (ARR), number of double bonds (nDB), number of 6-membered rings (nR06), number of sulphur atoms (nS), number of unsubstituted sp2 hybridized aromatic carbon atoms (nCaH), number of aliphatic N hydrazines (nN-N), number of aromatic primary amines(nNH2Ph) and certain atom centered fragments such as R--CH--R (C-024),X--CR..X (C-034), H attached to C1(sp3)/C0(sp2) (H-047) andRCO-N</>N-X=X (N-072) to quantify the inhibitory actions.The highest eigen value n.2 of Burden matrix/weighted by atomic polarizabilites (BEHp2) and atomic Sanderson electronegativities weighted Geary autocorrelation of lag 8 (descriptor GATS8e) have also shown prevalence to model the CDK8 inhibitory activity.PLS analysis has also corroborated the dominance of CP-MLR identified descriptors.Applicability domain analysis revealed that the suggested model matches the high quality parameters with good fitting power and the capability of assessing external data and all of the compounds was within the applicability domain of the proposed model and were evaluated correctly.The models obtained from the descriptor pool that was chosen for the CDK8 inhibitory activity is able to explain nearly 74% variance in the observed 7dF3 activities of titled compounds.

Figure 3
Figure 3 Plot of fraction contribution of MLR-like PLS coefficients (normalized) against 16 CP-MLR identified descriptors (Table3) associated with CDK8 inhibitory activity of naphthyridine and isoquinoline derivatives

Figure 4
Figure 4 Plot of observed versus caculated pIC50 values for training-and test-set compounds for 7dF3 inhibition

pIC50 = 5 .Figure 5
Figure 5Williams plot for the training-set and test-set for CDK8 inhibition activity of compounds in Table1.The horizontal dotted line refers to the residual limit (±3×standard deviation) and the vertical dotted line represents threshold leverage h* (=0.514)

Table 2
Highest significant models in two, three and four parameters derived for training set through CP-MLR for CDK8 inhibitory activity

Table 5
PLS and MLR-like PLS models from the 16 descriptors of five parameter CP-MLR models for CDK8 inhibitory activities Regression coefficient of PLS factor and its standard error.b Coefficients of MLR-like PLS equation in terms of descriptors for their original values; c f.c. is fraction contribution of regression coefficient, computed from the normalized regression coefficients obtained from the autoscaled (zero mean and unit s.d.) data. a

Table 5
Models derived for the whole data set (n = 49) in descriptors identified through CP-MLR for CDK8 inhibitory actions