Predicting drug resistance from gene expression

data analysis

gdsc

As practice for using gradient boosted trees, I used an open dataset to use gene expression in tumours to predict resistance to chemotherapy drugs

Published

2024-04-02

Show the code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import lightgbm
import shap
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import LinearRegression
from re import sub

This is a bit of work I did during my postdoc at the University of Nottingham. There’s not enough here for a publication, and was a bit out of the scope of the project, so I’m writing it up here. I’m not reproducing the entirety of this work here, mostly because running the whole thing takes ages. The full workflow is still in the code, I have just commented parts of it out. If you do want to run the full analysis yourself, you can uncomment these lines and change a few variable names.

I went to a conference¹ and heard a talk² in which the speaker described using a dataset of tumour cell lines to identify proteins important in cancer to identify a target to study. The Genomics of Drug Sensitivity in Cancer dataset is a massive project testing the sensitivity of tumour cell lines to different drugs. Part of the dataset is a table describing how sensitive cell lines are to each of a few hundred drugs.

Show the code

# It seems that previous releases get moved, so I'm just downloading it locally instead of getting it via ftp
#ic50_data = pd.read_excel("ftp://ftp.sanger.ac.uk/pub/project/cancerrxgene/releases/current_release/GDSC1_fitted_dose_response_25Feb20.xlsx")
ic50_data = pd.read_excel("data/GDSC1_fitted_dose_response_27Oct23.xlsx")
ic50_data.drop(columns = ['NLME_RESULT_ID', 'NLME_CURVE_ID', 'SANGER_MODEL_ID', 'TCGA_DESC', 'DRUG_ID', 'PUTATIVE_TARGET', 'PATHWAY_NAME', 'COMPANY_ID', 'WEBRELEASE', 'MIN_CONC', 'MAX_CONC', 'AUC', 'RMSE'], inplace=True)
ic50_data.head()

	DATASET	COSMIC_ID	CELL_LINE_NAME	DRUG_NAME	LN_IC50	Z_SCORE
0	GDSC1	684057	ES5	Erlotinib	3.966813	1.299144
1	GDSC1	684059	ES7	Erlotinib	2.692090	0.156076
2	GDSC1	684062	EW-11	Erlotinib	2.477990	-0.035912
3	GDSC1	684072	SK-ES-1	Erlotinib	2.033564	-0.434437
4	GDSC1	687448	COLO-829	Erlotinib	2.966007	0.401702

I’m just using the first version of the dataset here, as shown by “GDSC1” in the DATASET column. There is a GDSC2 that could be added, but we have enough to be getting on with. The COSMIC_ID and CELL_LINE_NAME columns identify the cell line used in the experiment. DRUG_NAME should be fairly obvious.

The last two columns might be a bit less familiar. To explain these, I will have to be a little more precise about what “sensitivity” means. To create this dataset, the experimenters grew lots of samples of each cell line, and added different concentrations of drugs to the samples. They then let them grow for a bit, and counted how many were still viable (alive and able to keep going). They then analysed this data to see how much each concentration of the drug slowed the growth of cells. They then fitted this to a dose response curve to get an estimate of the concentration of the drug that inhibited growth by 50, a measure called the IC₅₀. This measure is best described using a natural logarithm, so this is reported as LN_IC50. If you want to get a better feel for dose-response curves, I have a toy here. The Z_SCORE describes how significant this difference is for that cell line, compared with the sensitivity of other cell lines to that drug.

Another part of the dataset is a measure of how genes are expressed in each cell line. For each of the cell lines in the dataset, the mRNA for each gene has been measured using a microarray.

Show the code

# I did have it downloading directly from the source, but I've had failures retrieving it and the IC50 data, so I'm loading it locally instead
#rma_expr = pd.read_csv("https://www.cancerrxgene.org/gdsc1000/GDSC1000_WebResources//Data/preprocessed/Cell_line_RMA_proc_basalExp.txt.zip", sep = "\t")
rma_expr = pd.read_csv("data/Cell_line_RMA_proc_basalExp.txt", sep = "\t")
rma_expr = rma_expr.drop('GENE_title', axis = 1)
rma_expr= rma_expr.set_index('GENE_SYMBOLS')
rma_expr.head()

	DATA.906826	DATA.687983	DATA.910927	DATA.1240138	DATA.1240139	DATA.906792	DATA.910688	DATA.1240135	DATA.1290812	DATA.907045	...	DATA.753584	DATA.907044	DATA.998184	DATA.908145	DATA.1659787	DATA.1298157	DATA.1480372	DATA.1298533	DATA.930299	DATA.905954.1
GENE_SYMBOLS
TSPAN6	7.632023	7.548671	8.712338	7.797142	7.729268	7.074533	3.285198	6.961606	5.943046	3.455951	...	7.105637	3.236503	3.038892	8.373223	6.932178	8.441628	8.422922	8.089255	3.112333	7.153127
TNMD	2.964585	2.777716	2.643508	2.817923	2.957739	2.889677	2.828203	2.874751	2.686874	3.290184	...	2.798847	2.745137	2.976406	2.852552	2.622630	2.639276	2.879890	2.521169	2.870468	2.834285
DPM1	10.379553	11.807341	9.880733	9.883471	10.418840	9.773987	10.264385	10.205931	10.299757	11.570155	...	10.486486	10.442951	10.311962	10.454830	10.418475	11.463742	10.557777	10.792750	9.873902	10.788218
SCYL3	3.614794	4.066887	3.956230	4.063701	4.341500	4.270903	5.968168	3.715033	3.848112	5.560883	...	3.696835	4.624013	4.348524	3.858121	3.947561	4.425849	3.550390	4.443337	4.266828	4.100493
C1orf112	3.380681	3.732485	3.236620	3.558414	3.840373	3.815055	3.011867	3.268449	3.352835	3.571228	...	3.726833	3.947744	3.806584	3.196988	3.814831	4.384732	4.247189	3.071359	3.230197	3.435795

5 rows × 1018 columns

In this table, each row is a gene, and each column shows the expression of that gene in a cell line. This data can be used with the drug sensitivity dataset using the COSMIC_ID column of that dataset. The column names in the expression table contain the COSMIC ID (DATA.[COSMIC ID]).

We can see how many of the cell lines described in the expression table are present in the sensitivity table.

Show the code

rma_cells = [int(x.split('.')[1]) for x in rma_expr.columns]
cell_id_matches = []

for cell_id in rma_cells:
    if cell_id in ic50_data['COSMIC_ID'].values:
        cell_id_matches.append(cell_id)

print(f'Number of matches: {len(cell_id_matches)} of {len(rma_cells)}')

Number of matches: 946 of 1018

And filter the expression data so we’re only using columns that are relevant to the cell lines of interest.

Show the code

rma_matches = np.isin(np.array(rma_cells), np.array(cell_id_matches))

rma_expr_matched = rma_expr.iloc[:,rma_matches]
rma_expr_matched.shape

(17737, 946)

Then carry out a similar operation on the sensitivity table

Show the code

ic50_matched = ic50_data.loc[ic50_data['COSMIC_ID'].isin(cell_id_matches)]
ic50_matched.head()

	DATASET	COSMIC_ID	CELL_LINE_NAME	DRUG_NAME	LN_IC50	Z_SCORE
0	GDSC1	684057	ES5	Erlotinib	3.966813	1.299144
1	GDSC1	684059	ES7	Erlotinib	2.692090	0.156076
2	GDSC1	684062	EW-11	Erlotinib	2.477990	-0.035912
3	GDSC1	684072	SK-ES-1	Erlotinib	2.033564	-0.434437
4	GDSC1	687448	COLO-829	Erlotinib	2.966007	0.401702

How many drugs does that leave us with?

Show the code

len(ic50_matched['DRUG_NAME'].unique())

What’s the plan here? A lot of statistical analysis has been carried out on this dataset by bioinformaticians much smarter than I am. What I want to get an idea of is how well this dataset can be used to predict drug sensitivity from expression data. In this first pass, I’m getting a baseline. There are things you can do to improve models like this, which I have carried out, but let’s pretend, for the sake of narrative, that I haven’t yet.

To use this data to train a predictive model, we really want a table where each row is a cell line, and each column is a gene.

Show the code

rma_expr_matched = rma_expr_matched.T
rma_expr_matched['COSMIC_ID'] = [int(x.split('.')[1]) for x in rma_expr_matched.index]
rma_expr_matched = rma_expr_matched.loc[:, rma_expr_matched.columns.notna()]
rma_expr_matched.head()

GENE_SYMBOLS	TSPAN6	TNMD	DPM1	SCYL3	C1orf112	FGR	CFH	FUCA2	GCLC	NFYA	...	OR1D5	ZNF234	MYH4	LINC00526	PPY2	KRT18P55	POLRMTP1	UBL5P2	TBC1D3P5	COSMIC_ID
DATA.906826	7.632023	2.964585	10.379553	3.614794	3.380681	3.324692	3.566350	8.204530	5.235118	5.369039	...	3.134197	4.841169	2.628932	6.786925	2.997054	3.331134	3.130696	9.986616	3.073724	906826
DATA.687983	7.548671	2.777716	11.807341	4.066887	3.732485	3.152404	7.827172	6.616972	5.809264	7.209653	...	3.327528	4.570476	2.783441	5.317911	3.263745	2.992611	3.260982	9.002814	3.000182	687983
DATA.910927	8.712338	2.643508	9.880733	3.956230	3.236620	3.241246	2.931034	8.191246	5.426841	5.120747	...	3.326309	4.214729	2.603604	3.143006	3.112145	2.886574	3.176239	9.113243	2.916274	910927
DATA.1240138	7.797142	2.817923	9.883471	4.063701	3.558414	3.101247	7.211707	8.630643	5.617714	4.996434	...	2.921903	4.060761	2.619540	3.153896	3.151576	3.812119	3.074432	9.958284	3.256500	1240138
DATA.1240139	7.729268	2.957739	10.418840	4.341500	3.840373	3.001802	3.375422	8.296950	5.669418	4.180205	...	3.474086	4.869199	2.450375	3.652660	2.918475	3.412586	3.213545	9.938978	3.396126	1240139

5 rows × 17420 columns

I’m going to use a kind of model called “Gradient Boosted Trees” to predict drug sensitivity here. I won’t go into too much detail on how they work (Google has a course on them here), but essentially they build up a model from a set of small models called decision trees. It starts with a simple decision tree, then adds another simple decision tree to the output of that one that is chosen to best deal with the shortcomings of the previous. This process is repeated until adding more trees doesn’t improve prediction very much. These models perform really well on small, structured datasets, so are a good choice here. The implementation I’m using is called LightGBM, which, from my understanding, makes models that are nearly as good as gradient boosted trees can be, but are relatively fast.

For each drug, my model_drug function takes the IC₅₀ data matching that drug, then merges that data with matching data from the expression table.

It then splits this into a training and test set. The training set contains 80% of the data. The decision trees are built based on this data, then the model’s performance is assessed on how well it predicts the IC₅₀ in the remaining 20%.

In my first run at this, I trained a model for each of the drugs in the dataset. To do this, I had to start it when I got in one morning, go do a day’s work in the lab, then come back to collect the results. Here, I’ve just chosen 14 drugs that (spoiler alert) do well here, and one that I know doesn’t. If you want to check my work, the code for running the whole lot is still in there.

For each drug, the Mean Absolute Error of the model and the range of ln(IC₅₀) in the test set are reported.

Show the code

def model_drug(drug, verbose = False, figure = False):
    ic50_sub = ic50_matched.loc[ic50_matched['DRUG_NAME'] == drug][['COSMIC_ID',
                                                                    'LN_IC50',
                                                                    'Z_SCORE']]
    df = pd.merge(ic50_sub, rma_expr_matched).set_index('COSMIC_ID')

    X = df.drop(['LN_IC50', 'Z_SCORE'], axis = 1)
    y = df['LN_IC50']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

    train_data = lightgbm.Dataset(X_train, label = y_train)
    test_data = lightgbm.Dataset(X_test, label = y_test, reference = train_data)

    param = {'boosting_type': 'goss',
             'n_estimators': 500,
             'num_iterations': 500,
             'learning_rate': 0.05,
             'max_bin': 1024,
             'metric': 'l2',
             'objective': 'regression',
             'num_leaves': 50,
             'verbose': -1}

    bst = lightgbm.train(param,
                         train_data,
                         callbacks=[lightgbm.early_stopping(stopping_rounds=30, verbose = False)],
                         valid_sets = test_data)

    fit_predict = bst.predict(X_test)
    if verbose:
        mae = mean_absolute_error(fit_predict, y_test)
        test_range = max(y_test)-min(y_test)
        print(f'{drug}:\nMAE = {mae:.3} (range {test_range:.3})')
    if figure:
        fig, ax = plt.subplots(figsize = (16,8), ncols = 2)
        ax[0].scatter(y_test, fit_predict, color = 'black', alpha = 0.5)
        ax[0].ylabel = 'Predicted ln(IC50)'
        ax[0].xlabel = 'True ln(IC50)'
        ax[0].set_title(drug)
        lightgbm.plot_importance(bst, max_num_features=20, ax = ax[1])
        if figure == 'save':
            filename = sub('[^A-Za-z0-9-]+', '', drug)
            plt.savefig(f'{filename}.png')
    return bst, fit_predict, y_test


#all_models = dict()

#for drug in ic50_matched['DRUG_NAME'].unique():
#    all_models[drug] = model_drug(drug)

example_drugs = ['AZ628', 'WZ3105', 'NPK76-II-72-1', 'Tubastatin A', 'PIK-93', 'Venotoclax', 'Methotrexate', 'Refametinib', 'AZD7762', 'Tanespimycin', 'Nutlin-3a (-)', 'Trametinib', 'Dabrafenib','SN-38', 'Erlotinib']

example_models = dict()

for drug in example_drugs:
  example_models[drug] = model_drug(drug, verbose=True)

AZ628:
MAE = 1.23 (range 8.49)
WZ3105:
MAE = 1.19 (range 8.98)
NPK76-II-72-1:
MAE = 1.07 (range 7.88)
Tubastatin A:
MAE = 0.858 (range 6.29)
PIK-93:
MAE = 1.13 (range 8.7)
Venotoclax:
MAE = 0.629 (range 9.47)
Methotrexate:
MAE = 1.0 (range 7.75)
Refametinib:
MAE = 0.811 (range 8.17)
AZD7762:
MAE = 0.785 (range 10.7)
Tanespimycin:
MAE = 0.961 (range 7.35)
Nutlin-3a (-):
MAE = 0.741 (range 5.03)
Trametinib:
MAE = 1.39 (range 10.4)
Dabrafenib:
MAE = 1.26 (range 10.3)
SN-38:
MAE = 1.01 (range 10.6)
Erlotinib:
MAE = 0.699 (range 3.74)

The R² value describes how much of the variance of a dependent variable can be explained by the independent variable. I’m using it here to get an idea of how well the predictions from the models correlate with the observations.

Show the code

def r_squared(predicted, true):
    mean = np.mean(true)
    true_diff_sq = np.square(true - mean)
    pred_diff_sq = np.square(true - predicted)
    return 1-(np.sum(pred_diff_sq)/np.sum(true_diff_sq))

models_r_sq = dict([(x, r_squared(y[1], y[2])) for x, y in example_models.items()])

[(x,y) for x,y in models_r_sq.items() if y > 0.4]

[('NPK76-II-72-1', 0.40157140393725255),
 ('Methotrexate', 0.4736918458664131),
 ('Refametinib', 0.6541216816833308),
 ('AZD7762', 0.45333389047134853),
 ('Tanespimycin', 0.4472008811378849),
 ('Nutlin-3a (-)', 0.4394229469688272),
 ('Trametinib', 0.4575719604811399),
 ('Dabrafenib', 0.460910835015082),
 ('SN-38', 0.4510809900723055)]

Not bad for a first go. Here’s what a plot of the predicted vs observed values looks like for Refametinib, which can be predicted pretty well.

Show the code

def plot_test(drug):
    bst, fit_predict, y_test = example_models[drug]
    fig, ax = plt.subplots(figsize = (16,8), ncols = 2)
    ax[0].scatter(y_test, fit_predict, color = 'black', alpha = 0.5)
    ax[0].ylabel = 'Predicted ln(IC50)'
    ax[0].xlabel = 'True ln(IC50)'
    ax[0].set_title(drug)
    lightgbm.plot_importance(bst, max_num_features=20, ax = ax[1])

plot_test('Refametinib')

And for Erlotinib, one that can’t (in this first go, at least).

Show the code

plot_test('Erlotinib')

The panels on the left of these plots show this comparison. Hopefully, you’ll be wondering what the other panel shows. A neat feature of LightGBM models is that the model can tell you how important different dependent variables are when making a prediction. This means that we can get an idea of how the differential expression of particular genes contributes to the sensitivity of tumour cells to drugs³.

To get a little preview of what I’ll do later in⁴ this series, there’s a more powerful way of looking at how different variables affect prediction. This is using something called SHAP (SHapley Additive exPlanations) values. I’ll go more into these in another post, but here’s a plot of SHAP values for the model trying to predict AZ628 sensitivity.

Show the code

shap.initjs()

The colour of a point describes how high the feature (dependent variable) value is, and the x-position is the SHAP value. For example, a high DUSP6 expression value is associated with a more negative SHAP value, meaning a lower IC₅₀, so the cell line is more sensitive to AZ628. There’s a lot more that can be done with these, but I think this is quite long enough for one post.

Show the code

def prep_data(drug):
    ic50_sub = ic50_matched.loc[ic50_matched['DRUG_NAME'] == drug][['COSMIC_ID',
                                                                    'LN_IC50',
                                                                    'Z_SCORE']]
    df = pd.merge(ic50_sub, rma_expr_matched).set_index('COSMIC_ID')

    return df.drop(['LN_IC50', 'Z_SCORE'], axis = 1)

def plot_shap(drug):
    model = example_models[drug][0]
    X = prep_data(drug)
    explainer = shap.TreeExplainer(model)
    shap_values = explainer.shap_values(X)
    shap.summary_plot(shap_values, X, title = drug)

plot_shap('AZ628')

In subsequent posts, I’ll be aiming to improve model performance through various avenues, and trying to find whether we can actually make any useful interpretations of these models. They’re quite promising. For Refametinib, the model correlates really well with the observed data. This doesn’t take into account any factors other than changes in expression, so is quite impressive!

Footnotes

The International Transmembrane Society Symposium↩︎
by Emily Barnes at the MRC Laboratory of Medical Sciences↩︎
to be properly rigourous, we can’t really: correlation doesn’t imply causation! Without properly designed experiments this is all speculation↩︎
what will hopefully be↩︎