Deep Learning Tutorial¶

In this tutorial we will demonstrate how to create a basic deep learning model using train, test, and validation sets. For the model a subset of the bladderpdo dataset is used where transcriptomics and drug structures are the inputs and drug response values are the output. Throughout the tutorial we will be using functions included in coderdata. For more infomration on functions used from the package look to the coderdata/API reference.

Import coderdata, pandas, and numpy.

[1]:

import pandas as pd
import numpy as np
import coderdata as cd

Download and View Datasets¶

To view the availabe datasets within the coderdata API use the following functions.

[2]:

cd.list_datasets()

beataml: Beat acute myeloid leukemia (BeatAML) focuses on acute myeloid leukemia tumor data. Data includes drug response, proteomics, and transcriptomics datasets.
bladderpdo: Tumor Evolution and Drug Response in Patient-Derived Organoid Models of Bladder Cancer Data includes transcriptomics, mutations, copy number, and drug response data.
ccle: Cancer Cell Line Encyclopedia (CCLE).
cptac: The Clinical Proteomic Tumor Analysis Consortium (CPTAC) project is a collaborative network funded by the National Cancer Institute (NCI) focused on improving our understanding of cancer biology through the integration of transcriptomic, proteomic, and genomic data.
ctrpv2: Cancer Therapeutics Response Portal version 2 (CTRPv2)
fimm: Institute for Molecular Medicine Finland (FIMM) dataset.
gcsi: The Genentech Cell Line Screening Initiative (gCSI)
gdscv1: Genomics of Drug Sensitivity in Cancer (GDSC) v1
gdscv2: Genomics of Drug Sensitivity in Cancer (GDSC) v2
hcmi: Human Cancer Models Initiative (HCMI) encompasses numerous cancer types and includes cell line, organoid, and tumor data. Data includes the transcriptomics, somatic mutation, and copy number datasets.
mpnst: Malignant Peripheral Nerve Sheath Tumor is a rare, agressive sarcoma that affects peripheral nerves throughout the body.
mpnstpdx: Patient derived xenograft data for MPNST
nci60: National Cancer Institute 60
pancpdo: Organoid Profiling Identifies Common Responders to Chemotherapy in Pancreatic Cancer Data includes transcriptomics, mutations, copy number, and drug response data.
prism: Profiling Relative Inhibition Simultaneously in Mixtures
sarcpdo: The landscape of drug sensitivity and resistance in sarcoma Data includes transcriptomics, mutations, and drug response data.

To download and load the needed dataset(s) use the function below. For this example we will be using bladderpdo.

[ ]:

cd.download("bladderpdo")

Example Output:
Downloaded ‘https://ndownloader.figshare.com/files/53779532’ to ‘c:\Path\genes.csv.gz’
Downloaded ‘https://ndownloader.figshare.com/files/53956211’ to ‘c:\Path\bladderpdo_copy_number.csv.gz’
Downloaded ‘https://ndownloader.figshare.com/files/53956214’ to ‘c:\Path\bladderpdo_drugs.tsv.gz’
Downloaded ‘https://ndownloader.figshare.com/files/53956217’ to ‘c:\Path\bladderpdo_drug_descriptors.tsv.gz
Downloaded ‘https://ndownloader.figshare.com/files/53956220’ to ‘c:\Path\bladderpdo_experiments.tsv.gz’
Downloaded ‘https://ndownloader.figshare.com/files/53956223’ to ‘c:\Path\bladderpdo_mutations.csv.gz’
Downloaded ‘https://ndownloader.figshare.com/files/53956226’ to ‘c:\Path\bladderpdo_samples.csv’
Downloaded ‘https://ndownloader.figshare.com/files/53956232’ to ‘c:\Path\bladderpdo_transcriptomics.csv.gz’

[ ]:

bladderpdo=cd.load("bladderpdo")

Example Output:
Importing raw data …
Importing ‘copy_number’ from c:\Path\bladderpdo_copy_number.csv.gz … DONE
Importing ‘drugs’ from c:\Path\bladderpdo_drugs.tsv.gz … DONE
Importing ‘drug_descriptors’ from c:\Path\bladderpdo_drug_descriptors.tsv.gz … DONE
Importing ‘experiments’ from c:\Path\bladderpdo_experiments.tsv.gz … DONE
Importing ‘mutations’ from c:\Path\bladderpdo_mutations.csv.gz … DONE
Importing ‘samples’ from c:\Path\bladderpdo_samples.csv … DONE
Importing ‘transcriptomics’ from c:\Path\bladderpdo_transcriptomics.csv.gz … DONE
Importing ‘genes’ from c:\Path\genes.csv.gz … DONE
Importing raw data … DONE

The function type() can be used to view what the class of any object is as seen below, bladderpdo is a dataset.

[5]:

type(bladderpdo)

[5]:

coderdata.dataset.dataset.Dataset

To view all the datatypes included with a dataset object use the function below. The info() command allows us to see which datatypes are included in the bladderpdo dataset object. We will be using the Transcriptomics, Drugs, and Experiments Data in our example. Note that each datatype has different format options and arguments. To view all options for each type navigate to the usage.

[6]:

pd.DataFrame({"datatypes":bladderpdo.types()})

[6]:

	datatypes
0	transcriptomics
1	mutations
2	copy_number
3	samples
4	drugs
5	experiments
6	genes

Examples are shown below of how to view a regular dataset and the specificed datatype.

[7]:

bladderpdo.drugs

[7]:

	improve_drug_id	chem_name	pubchem_id	canSMILES	InChIKey	formula	weight
0	SMI_8294	bi-2536	11364421	CCC1C(=O)N(C2=CN=C(N=C2N1C3CCCC3)NC4=C(C=C(C=C...	XQVVPGYIWAGRNI-JOCHJYFZSA-N	C28H39N7O3	521.7
1	SMI_8294	((r)-4-[(8-cyclopentyl-7-et-5,6,7,8-tetrahydro...	11364421	CCC1C(=O)N(C2=CN=C(N=C2N1C3CCCC3)NC4=C(C=C(C=C...	XQVVPGYIWAGRNI-JOCHJYFZSA-N	C28H39N7O3	521.7
2	SMI_8294	en300-18167228	11364421	CCC1C(=O)N(C2=CN=C(N=C2N1C3CCCC3)NC4=C(C=C(C=C...	XQVVPGYIWAGRNI-JOCHJYFZSA-N	C28H39N7O3	521.7
3	SMI_8294	4-((r)-8-cyclopentyl-7-ethyl-5-methyl-6-oxo-5,...	11364421	CCC1C(=O)N(C2=CN=C(N=C2N1C3CCCC3)NC4=C(C=C(C=C...	XQVVPGYIWAGRNI-JOCHJYFZSA-N	C28H39N7O3	521.7
4	SMI_8294	bcpp000342	11364421	CCC1C(=O)N(C2=CN=C(N=C2N1C3CCCC3)NC4=C(C=C(C=C...	XQVVPGYIWAGRNI-JOCHJYFZSA-N	C28H39N7O3	521.7
...	...	...	...	...	...	...	...
5015	SMI_438	metatrexan	126941	CN(CC1=CN=C2C(=N1)C(=NC(=N2)N)N)C3=CC=C(C=C3)C...	FBOZXECLQNJBKD-ZDUSSCGKSA-N	C20H22N8O5	454.4
5016	SMI_438	hms2233o18	126941	CN(CC1=CN=C2C(=N1)C(=NC(=N2)N)N)C3=CC=C(C=C3)C...	FBOZXECLQNJBKD-ZDUSSCGKSA-N	C20H22N8O5	454.4
5017	SMI_438	amethopterin l-	126941	CN(CC1=CN=C2C(=N1)C(=NC(=N2)N)N)C3=CC=C(C=C3)C...	FBOZXECLQNJBKD-ZDUSSCGKSA-N	C20H22N8O5	454.4
5018	SMI_438	tocris-1230	126941	CN(CC1=CN=C2C(=N1)C(=NC(=N2)N)N)C3=CC=C(C=C3)C...	FBOZXECLQNJBKD-ZDUSSCGKSA-N	C20H22N8O5	454.4
5019	SMI_438	amethopterin (hydrate); cl14377 (hydrate); wr1...	126941	CN(CC1=CN=C2C(=N1)C(=NC(=N2)N)N)C3=CC=C(C=C3)C...	FBOZXECLQNJBKD-ZDUSSCGKSA-N	C20H22N8O5	454.4

5020 rows × 7 columns

[8]:

bladderpdo.transcriptomics

[8]:

	entrez_id	transcriptomics	improve_sample_id	source	study
0	7105	69.277129	5553	Gene Expression Omnibus	Lee et al. 2018 Bladder PDOs
1	64102	0.027787	5553	Gene Expression Omnibus	Lee et al. 2018 Bladder PDOs
2	8813	81.403736	5553	Gene Expression Omnibus	Lee et al. 2018 Bladder PDOs
3	57147	20.024448	5553	Gene Expression Omnibus	Lee et al. 2018 Bladder PDOs
4	55732	29.819016	5553	Gene Expression Omnibus	Lee et al. 2018 Bladder PDOs
...	...	...	...	...	...
816096	140032	0.024479	5502	Gene Expression Omnibus	Lee et al. 2018 Bladder PDOs
816097	9782	44.949383	5502	Gene Expression Omnibus	Lee et al. 2018 Bladder PDOs
816098	102724737	0.024479	5502	Gene Expression Omnibus	Lee et al. 2018 Bladder PDOs
816099	101669762	2.203401	5502	Gene Expression Omnibus	Lee et al. 2018 Bladder PDOs
816100	100527949	0.024479	5502	Gene Expression Omnibus	Lee et al. 2018 Bladder PDOs

816101 rows × 5 columns

[9]:

bladderpdo.experiments

[9]:

	source	improve_sample_id	improve_drug_id	study	time	time_unit	dose_response_metric	dose_response_value
0	Synapse	5473	SMI_32023	Lee etal 2018 Bladder PDOs	6	days	fit_auc	0.7725
1	Synapse	5473	SMI_55473	Lee etal 2018 Bladder PDOs	6	days	fit_auc	0.9277
2	Synapse	5473	SMI_27249	Lee etal 2018 Bladder PDOs	6	days	fit_auc	0.8538
3	Synapse	5473	SMI_55606	Lee etal 2018 Bladder PDOs	6	days	fit_auc	0.8478
4	Synapse	5473	SMI_20004	Lee etal 2018 Bladder PDOs	6	days	fit_auc	0.8825
...	...	...	...	...	...	...	...	...
32995	Synapse	5606	SMI_438	Lee etal 2018 Bladder PDOs	6	days	dss	0.1221
32996	Synapse	5606	SMI_52219	Lee etal 2018 Bladder PDOs	6	days	dss	0.1104
32997	Synapse	5606	SMI_34999	Lee etal 2018 Bladder PDOs	6	days	dss	0.1202
32998	Synapse	5606	SMI_55689	Lee etal 2018 Bladder PDOs	6	days	dss	0.0000
32999	Synapse	5606	SMI_10376	Lee etal 2018 Bladder PDOs	6	days	dss	0.0000

33000 rows × 8 columns

Coderdata has a function format() that will reformat any dataset.datatype with the specificed arguments. Each datatype has different arguments specific to its contents and to view each datatypes argument options visit coderdata/usage. Here we will use the bladderpdo.experiment to format. We want to only include auc metric values in our dataframe for this tutorial, however, to view other metric options for a dataset use the pandas unique function.

[10]:

pd.DataFrame({"dose_response_metric":bladderpdo.experiments.dose_response_metric.unique()})

[10]:

	dose_response_metric
0	fit_auc
1	fit_ic50
2	fit_ec50
3	fit_r2
4	fit_ec50se
5	fit_einf
6	fit_hs
7	aac
8	auc
9	dss

Set bladderpdo.experiments to only include samples where the dose_response_metric is auc.

[11]:

bladderpdo.experiments = bladderpdo.format(data_type='experiments', metrics='auc')
bladderpdo.experiments

[11]:

	source	improve_sample_id	improve_drug_id	study	time	time_unit	dose_response_metric	dose_response_value
26400	Synapse	5473	SMI_32023	Lee etal 2018 Bladder PDOs	6	days	auc	0.7530
26401	Synapse	5473	SMI_55473	Lee etal 2018 Bladder PDOs	6	days	auc	0.9067
26402	Synapse	5473	SMI_27249	Lee etal 2018 Bladder PDOs	6	days	auc	0.8972
26403	Synapse	5473	SMI_55606	Lee etal 2018 Bladder PDOs	6	days	auc	0.8320
26404	Synapse	5473	SMI_20004	Lee etal 2018 Bladder PDOs	6	days	auc	0.8937
...	...	...	...	...	...	...	...	...
29695	Synapse	5606	SMI_438	Lee etal 2018 Bladder PDOs	6	days	auc	0.4095
29696	Synapse	5606	SMI_52219	Lee etal 2018 Bladder PDOs	6	days	auc	0.6260
29697	Synapse	5606	SMI_34999	Lee etal 2018 Bladder PDOs	6	days	auc	0.8730
29698	Synapse	5606	SMI_55689	Lee etal 2018 Bladder PDOs	6	days	auc	0.9406
29699	Synapse	5606	SMI_10376	Lee etal 2018 Bladder PDOs	6	days	auc	0.9813

3300 rows × 8 columns

Develop Deep Learning Model¶

We will be using the keras functional API, for information on the keras API visit keras.io. This allows us to use multiple inputs in our model. You may need to install a couple of these dependencies to proceed.

Dependencies that are not included in coderdata:

tensorflow (Note: tensorflow is only supported up to the python version 3.12)
matplotlib
rdkit

[12]:

import keras
#from numpy import loadtxt
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.preprocessing import MinMaxScaler
#from sklearn.model_selection import train_test_split
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.rdFingerprintGenerator import GetMorganGenerator
from keras.models import Model
from keras.layers import Input, Dense, concatenate
from keras.optimizers import Adam
from keras.losses import MeanSquaredError
from keras.metrics import MeanAbsoluteError
from keras import layers
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

Coderdata has 2 split functions a two-way and three-way split that can be used with any dataset. Below are example of how to use both functions but for this tutorial we will be using the three-way splits for our model.

Quick tip: To view all available arguments for any function use the help() function below.

[13]:

help(bladderpdo.split_train_other)

Help on method split_train_other in module coderdata.dataset.dataset:

split_train_other(split_type: "Literal['mixed-set', 'drug-blind', 'cancer-blind']" = 'mixed-set', ratio: 'tuple[int, int]' = (8, 2), stratify_by: 'Optional[str]' = None, balance: 'bool' = False, random_state: 'Optional[Union[int, RandomState]]' = None, **kwargs: 'dict') -> 'TwoWaySplit' method of coderdata.dataset.dataset.Dataset instance

Here is an example of a two-way split that is trained on bladderpdo dataset dataframe. To view the definitions of different split_type and the associated arguments view the API reference. Note that the other split functions will use similar arguments.

[14]:

two_way_split= bladderpdo.split_train_other(split_type = 'drug-blind',
                                            ratio = [8,2],
                                            random_state = 42,
                                            )

The split function dataframe can be used with any datatype however here we will use the experiments datatype. Now to view both the train and other splits use the call below, you will see that that the train set is ~80% of the original bladderpdo.experiments and the other set is ~20% aligning with the values input for ratio.

[15]:

two_way_split.train.experiments.head()

[15]:

	source	improve_sample_id	improve_drug_id	study	time	time_unit	dose_response_metric	dose_response_value
0	Synapse	5473	SMI_20004	Lee etal 2018 Bladder PDOs	6	days	auc	0.8937
1	Synapse	5473	SMI_27249	Lee etal 2018 Bladder PDOs	6	days	auc	0.8972
2	Synapse	5473	SMI_28189	Lee etal 2018 Bladder PDOs	6	days	auc	0.7877
3	Synapse	5473	SMI_32023	Lee etal 2018 Bladder PDOs	6	days	auc	0.7530
4	Synapse	5473	SMI_34999	Lee etal 2018 Bladder PDOs	6	days	auc	0.7171

[16]:

two_way_split.other.experiments.head()

[16]:

	source	improve_sample_id	improve_drug_id	study	time	time_unit	dose_response_metric	dose_response_value
0	Synapse	5473	SMI_25616	Lee etal 2018 Bladder PDOs	6	days	auc	0.6962
1	Synapse	5474	SMI_25616	Lee etal 2018 Bladder PDOs	6	days	auc	0.6962
2	Synapse	5475	SMI_25616	Lee etal 2018 Bladder PDOs	6	days	auc	0.6962
3	Synapse	5476	SMI_25616	Lee etal 2018 Bladder PDOs	6	days	auc	0.6962
4	Synapse	5477	SMI_25616	Lee etal 2018 Bladder PDOs	6	days	auc	0.6962

The three-way split uses similar input arguments as the two-way split. However, the three-way split provides a train, test, and validation set.

[17]:

split = bladderpdo.split_train_test_validate(split_type='mixed-set',
                                             ratio = [8,1,1],
                                             random_state = 42,
                                             thresh = 0.8
                                             )

The CoderData package includes its own split function that is intended for only coderdata specific datasets. Here is an overview of what each split is and how it will be used.

Train (80%) : This gathers majority of the data and it allows the model to learn from the patterns and relationships of the data. This step involves iteratively adjusting its parameters to bettter fit the model.

Validate (10%): Gathering generally the smallest portion of data is used to refine the model’s hyperparameters. It evaluates the training performance by comparing it to validation performance based on overfitting or underfitting. Here it will follow each training iteration (epoch) and evaluating the performance based on the validation set.

Test (10%) : Finally to evaluate the final model performance the test data will be used to provide an estimate to how the model will perform on unseen data

Note: The percentage of data pulled for each step may vary but should typically follows this size comparison (Train > Test >= Validate). For a more in depth explanation of how train, test, and validate work visit mlu-explain.github.io.

[18]:

split.train.experiments

[18]:

	source	improve_sample_id	improve_drug_id	study	time	time_unit	dose_response_metric	dose_response_value
0	Synapse	5553	SMI_55531	Lee etal 2018 Bladder PDOs	6	days	auc	0.9275
1	Synapse	5556	SMI_55473	Lee etal 2018 Bladder PDOs	6	days	auc	0.2615
2	Synapse	5479	SMI_55606	Lee etal 2018 Bladder PDOs	6	days	auc	0.8320
3	Synapse	5566	SMI_25708	Lee etal 2018 Bladder PDOs	6	days	auc	0.7642
4	Synapse	5564	SMI_16144	Lee etal 2018 Bladder PDOs	6	days	auc	0.7298
...	...	...	...	...	...	...	...	...
2635	Synapse	5528	SMI_27186	Lee etal 2018 Bladder PDOs	6	days	auc	0.9594
2636	Synapse	5529	SMI_10282	Lee etal 2018 Bladder PDOs	6	days	auc	0.8449
2637	Synapse	5532	SMI_25708	Lee etal 2018 Bladder PDOs	6	days	auc	0.8323
2638	Synapse	5523	SMI_42845	Lee etal 2018 Bladder PDOs	6	days	auc	0.6559
2639	Synapse	5604	SMI_35935	Lee etal 2018 Bladder PDOs	6	days	auc	0.4328

2640 rows × 8 columns

[19]:

split.test.experiments

[19]:

	source	improve_sample_id	improve_drug_id	study	time	time_unit	dose_response_metric	dose_response_value
0	Synapse	5522	SMI_51528	Lee etal 2018 Bladder PDOs	6	days	auc	0.5353
1	Synapse	5544	SMI_20644	Lee etal 2018 Bladder PDOs	6	days	auc	1.0000
2	Synapse	5563	SMI_55531	Lee etal 2018 Bladder PDOs	6	days	auc	0.9126
3	Synapse	5539	SMI_20644	Lee etal 2018 Bladder PDOs	6	days	auc	0.9315
4	Synapse	5490	SMI_16144	Lee etal 2018 Bladder PDOs	6	days	auc	0.7088
...	...	...	...	...	...	...	...	...
325	Synapse	5484	SMI_8294	Lee etal 2018 Bladder PDOs	6	days	auc	0.7684
326	Synapse	5491	SMI_18136	Lee etal 2018 Bladder PDOs	6	days	auc	0.7549
327	Synapse	5532	SMI_20004	Lee etal 2018 Bladder PDOs	6	days	auc	0.9606
328	Synapse	5548	SMI_659	Lee etal 2018 Bladder PDOs	6	days	auc	0.7099
329	Synapse	5490	SMI_55531	Lee etal 2018 Bladder PDOs	6	days	auc	0.9967

330 rows × 8 columns

[20]:

split.validate.experiments

[20]:

	source	improve_sample_id	improve_drug_id	study	time	time_unit	dose_response_metric	dose_response_value
0	Synapse	5603	SMI_42965	Lee etal 2018 Bladder PDOs	6	days	auc	0.7433
1	Synapse	5555	SMI_55473	Lee etal 2018 Bladder PDOs	6	days	auc	0.2615
2	Synapse	5493	SMI_40409	Lee etal 2018 Bladder PDOs	6	days	auc	0.5723
3	Synapse	5553	SMI_46771	Lee etal 2018 Bladder PDOs	6	days	auc	0.2187
4	Synapse	5489	SMI_27249	Lee etal 2018 Bladder PDOs	6	days	auc	0.7267
...	...	...	...	...	...	...	...	...
325	Synapse	5541	SMI_50331	Lee etal 2018 Bladder PDOs	6	days	auc	0.5340
326	Synapse	5537	SMI_25616	Lee etal 2018 Bladder PDOs	6	days	auc	0.8480
327	Synapse	5530	SMI_31885	Lee etal 2018 Bladder PDOs	6	days	auc	0.8442
328	Synapse	5542	SMI_28189	Lee etal 2018 Bladder PDOs	6	days	auc	0.5715
329	Synapse	5603	SMI_5060	Lee etal 2018 Bladder PDOs	6	days	auc	0.7683

330 rows × 8 columns

The coderdata function save() can be used to keep the split for later use.

Note: The save function can be used with any coderdata Dataset object. However only individual splits can be saved not the full split.

[ ]:

split.train.save(path='/Path/coderdata/beataml_train.pickle')

Preparing the Data¶

Impute Missing Data¶

Now we will prepare the transcriptomic and data structure columns to ensure it is formated correclty to be merged with our drug response values and then used within our model.

First impute missing values in the bladderpdo transcriptomics data. Here we take the global mean, however, this may not be the best method for you or needed for your datasets.

[22]:

#Impute and transcriptomics if there are any based on global mean.
columns_to_fill = ['transcriptomics']

for i in columns_to_fill:
    if i in bladderpdo.transcriptomics.columns[bladderpdo.transcriptomics.isnull().any(axis=0)]:
        bladderpdo.transcriptomics[i].fillna(bladderpdo.transcriptomics[i].mean(), inplace=True)

Scale Input Data¶

(Optional) Here use a sklearn function to scale the transcriptomics data in a small range of non-negative numbers. This step can be helpful when there is a wide range of values in the data.

[23]:

scaler=MinMaxScaler()
bladderpdo.transcriptomics[["transcriptomics"]] = scaler.fit_transform(bladderpdo.transcriptomics[["transcriptomics"]])

Drop any duplicated transcriptomics data.

[24]:

bladderpdo.transcriptomics[['transcriptomics']].drop_duplicates().head()

[24]:

	transcriptomics
0	1.572852e-03
1	3.745602e-07
2	1.848217e-03
3	4.544482e-04
4	6.768580e-04

From the bladderpdo.drugs dataframe we are pulling the canSMILES column and using rdkit to convert them into Morgan fingerprints. The resulting fingerprint column will then be used as another input along with transcriptomics for the model.

[25]:

# Convert SMILES to Morgan fingerprints
def smiles_to_fingerprint(smiles):
    try:
        mol = Chem.MolFromSmiles(smiles)
        if mol is None: # Handle cases where MolFromSmiles fails
                return None
        morgan_gen= GetMorganGenerator(radius=2, fpSize=1024)
        fingerprint = morgan_gen.GetFingerprint(mol)  # Adjust radius and nBits as needed
        fingerprint_array = np.array(fingerprint)
        return fingerprint_array
    except Exception as e:
        print(f"Error processing SMILES '{smiles}': {e}")
        return None


# smiles_to_fingerprint(merged_df.canSMILES[0])
# # Apply the function to your SMILES column
bladderpdo.drugs['fingerprint']= bladderpdo.drugs['canSMILES'].apply(smiles_to_fingerprint)

Merge Data¶

Group transcriptomics data into an array based on the improve_sample_id. This is in prepration for the model because the input values must be in an array. Then drop any duplicated transriptomics values with the same array.

[26]:

group_df = bladderpdo.transcriptomics.groupby('improve_sample_id')['transcriptomics'].apply(list).reset_index()
group_df['transcriptomics'] = group_df['transcriptomics'].apply(lambda x: list(set(x)))

[27]:

group_df.head()

[27]:

	improve_sample_id	transcriptomics
0	5473	[0.0004120739513098224, 0.005498299890739134, ...
1	5477	[0.03584645254124297, 0.0006806983197478193, 0...
2	5482	[0.0010883074665564165, 0.004632827848666922, ...
3	5485	[0.0014773197417206076, 0.0038554997961206556,...
4	5488	[0.000516024688106854, 0.0020666180961583826, ...

We will merge the split dataframes with the bladderpdo.drugs dataframe based on the improve_drug_id first and then in the next step with will merge the dataframes with the transcriptomics.

[28]:

split.train.experiments = pd.merge(split.train.experiments[['improve_sample_id','improve_drug_id', 'dose_response_value']],
                                   bladderpdo.drugs[['improve_drug_id','fingerprint']],
                                   on='improve_drug_id',
                                   how= 'left')

split.test.experiments = pd.merge(split.test.experiments[['improve_sample_id','improve_drug_id', 'dose_response_value']],
                                   bladderpdo.drugs[['improve_drug_id','fingerprint']],
                                   on='improve_drug_id',
                                   how= 'left')

split.validate.experiments = pd.merge(split.validate.experiments[['improve_sample_id','improve_drug_id', 'dose_response_value']],
                                   bladderpdo.drugs[['improve_drug_id','fingerprint']],
                                   on='improve_drug_id',
                                   how= 'left')

To the split bladderpdo.experiments dataframes add transcriptomics arrays based on improve_sample_id. Then drop all columns besides our input values, transcriptomics and fingerprint, with the corresponding outputs dose_response_value. This is repeated for the train, test, and validate dataframes.

[29]:

train_df = pd.merge(split.train.experiments, group_df, on='improve_sample_id', how = 'left')
train_df.dropna(subset=['transcriptomics'], inplace=True)
train_df.drop_duplicates(subset=['dose_response_value'], inplace=True)
train_df = train_df[['dose_response_value','transcriptomics', 'fingerprint' ]]

test_df = pd.merge(split.test.experiments, group_df, on='improve_sample_id', how = 'left')
test_df.dropna(subset=['transcriptomics'], inplace=True)
test_df.drop_duplicates(subset=['dose_response_value'], inplace=True)
test_df = test_df[['dose_response_value','transcriptomics','fingerprint']]

validate_df = pd.merge(split.validate.experiments, group_df, on='improve_sample_id', how = 'left')
validate_df.dropna(subset=['transcriptomics'], inplace=True)
validate_df.drop_duplicates(subset=['dose_response_value'], inplace=True)
validate_df = validate_df[['dose_response_value','transcriptomics','fingerprint']]

[30]:

train_df.head()

[30]:

	dose_response_value	transcriptomics	fingerprint
0	0.9275	[0.0031862060625194033, 2.728004558250126e-05,...	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, ...
107	0.2615	[0.026989577082694893, 0.002483812226380167, 0...	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
870	0.5044	[0.0031862060625194033, 2.728004558250126e-05,...	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1055	0.7376	[0.044913496837260425, 0.0005729141409851485, ...	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1168	0.9244	[0.0, 0.0013422786305636884, 0.006175404786558...	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

[31]:

test_df.head()

[31]:

	dose_response_value	transcriptomics	fingerprint
308	0.9126	[0.0007689920202373915, 0.002660806480477618, ...	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, ...
836	0.8929	[0.008620298094601964, 0.003253032313722315, 0...	[0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1334	0.7684	[0.0010883074665564165, 0.004632827848666922, ...	[0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1500	0.4067	[0.0031862060625194033, 2.728004558250126e-05,...	[0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2137	0.2959	[0.026989577082694893, 0.002483812226380167, 0...	[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

[32]:

validate_df.head()

[32]:

	dose_response_value	transcriptomics	fingerprint
171	0.5723	[0.0007678460342894769, 0.0014434679191595028,...	[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...
254	0.2187	[0.0031862060625194033, 2.728004558250126e-05,...	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
436	0.7674	[0.008620298094601964, 0.003253032313722315, 0...	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
506	0.6207	[0.002719332067378844, 0.0028423386644789166, ...	[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...
589	0.7549	[0.000516024688106854, 0.0020666180961583826, ...	[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

Quick Tip: View the length of the transcriptomics array to then later decide the shape that is input into the model. Here we have multiple size arrays so we will need to pad or truncate to ensure a consist shape. An example of that will be shown later in the tutorial.

[33]:

pd.DataFrame({"Array Length":validate_df['transcriptomics'].apply(len).unique()})

[33]:

	Array Length
0	3996
1	4741
2	4040
3	4766
4	3964
5	195
6	4207
7	4067
8	3723
9	4184
10	3698
11	4083
12	4236
13	2985
14	4193
15	4845
16	4491

Set the Model Parameters¶

Here we set the model parameters. You may modify the inputs, layers, activation functions, and outputs of the model. This is a general setup when using a keras model, but this may need to be adjusted to better suit your data. Here we use activation = 'relu' for the output layer, note that this activation is only intended when dealing with non-negative values (how the transcriptomic inputs were scaled previously).

[34]:

#Define Inputs
transcriptomics_input = keras.Input(shape=(3000,), name="transcriptomics")
fingerprint_input = keras.Input(shape=(1024,), name="fingerprint")

# You may adjust these layers and activation functions to get better (or worse) results.
# Merge all available features into a single large vector via concatenation
x = layers.concatenate([transcriptomics_input, fingerprint_input])
x = layers.Dense(16, activation='relu')(x)
x = layers.Dense(64, activation="relu")(x)
x = layers.Dropout(0.5)(x)
x = layers.Dense(32, activation='relu')(x)
x = layers.Dense(12, activation='relu')(x)


# Priority prediction is here. You may add layers (and an output) after this too.
# This just allows for an early prediction in case you have a large datset and want preliminary results.
priority_pred = layers.Dense(1, name="priority", activation ='relu')(x)  # Using the activation relu.

# Instantiate an end-to-end model predicting both priority and department
model = keras.Model(
    inputs= [transcriptomics_input, fingerprint_input],
    outputs={"priority": priority_pred},
)

Compile the Model¶

Here we used a basic keras compile() input with the Adam optimizer. For other functions options visit keras.io/compile-method or optimizer options keras.io/optimizers. Then view the model summary, this can be helpful when adjusting parameters and seeing the model layers. For more information on the model summary visit keras.io/summary-method.

[35]:

model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.0001),
    loss={
        "priority": keras.losses.MeanSquaredError(),
    },
    metrics=[MeanAbsoluteError()],
    loss_weights={"priority": 1},
)

model.summary()

Model: "functional"

┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ Layer (type)        ┃ Output Shape      ┃    Param # ┃ Connected to      ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│ transcriptomics     │ (None, 3000)      │          0 │ -                 │
│ (InputLayer)        │                   │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ fingerprint         │ (None, 1024)      │          0 │ -                 │
│ (InputLayer)        │                   │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ concatenate         │ (None, 4024)      │          0 │ transcriptomics[… │
│ (Concatenate)       │                   │            │ fingerprint[0][0] │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ dense (Dense)       │ (None, 16)        │     64,400 │ concatenate[0][0] │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ dense_1 (Dense)     │ (None, 64)        │      1,088 │ dense[0][0]       │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ dropout (Dropout)   │ (None, 64)        │          0 │ dense_1[0][0]     │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ dense_2 (Dense)     │ (None, 32)        │      2,080 │ dropout[0][0]     │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ dense_3 (Dense)     │ (None, 12)        │        396 │ dense_2[0][0]     │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ priority (Dense)    │ (None, 1)         │         13 │ dense_3[0][0]     │
└─────────────────────┴───────────────────┴────────────┴───────────────────┘

 Total params: 67,977 (265.54 KB)

 Trainable params: 67,977 (265.54 KB)

 Non-trainable params: 0 (0.00 B)

Run your Model¶

Before you can run the mdoel you may need to pad/truncate the input arrays. As mentioned earlier this is because the arrays must be the same length in order for the model to run correctly. Here we adjust both the transcriptomics and fingerprint inputs for all split dataframes.

[36]:

# Set target_length as the maximum length
def pad_sequence(seq, target_length, pad_value=0):
    """Pad or truncate a sequence to a target length."""
    if len(seq) > target_length:  # Truncate if longer than target_length
        return seq[:target_length]
    elif len(seq) < target_length:  # Pad with zeros if shorter than target_length
        return seq + [pad_value] * (target_length - len(seq))
    else:
        return seq

# Example: Create fixed-length arrays
train_df['transcriptomics'] = train_df['transcriptomics'].apply(lambda x: pad_sequence(x, target_length=3000))
test_df['transcriptomics'] = test_df['transcriptomics'].apply(lambda x: pad_sequence(x, target_length=3000))
validate_df['transcriptomics'] = validate_df['transcriptomics'].apply(lambda x: pad_sequence(x, target_length=3000))

train_df['fingerprint'] = train_df['fingerprint'].apply(lambda x: pad_sequence(x, target_length=1024))
test_df['fingerprint'] = test_df['fingerprint'].apply(lambda x: pad_sequence(x, target_length=1024))
validate_df['fingerprint'] = validate_df['fingerprint'].apply(lambda x: pad_sequence(x, target_length=1024))

Define the output data, dose_response_value, in prepartion of running the model.

[37]:

y_train = np.array(train_df['dose_response_value'])
y_test = np.array(test_df['dose_response_value'])
y_validate = np.array(validate_df['dose_response_value'])

Now use the keras fit() function with the train and validation data. For more information on this function visit kera.io/fit-method. Note you may adjust the epoch and bath_size to better fit your model.

[ ]:

# Train the model
history = model.fit(
    [np.array(train_df['transcriptomics'].tolist()),
     np.array(train_df['fingerprint'].tolist())
     ],
    y_train,
    validation_data=([np.array(validate_df['transcriptomics'].tolist()),
                      np.array(validate_df['fingerprint'].tolist())
                      ],
                     y_validate),
    epochs=100,  # Number of epochs
    batch_size=32,  # Batch size
    verbose=0 # Reduces the output (set equal to 1 if you want to see the output steps)
)

Evaluate your Model¶

Now we will use the test data to evaluate our model. Information on the evaluate() function, kera.io/evaluate-method.

[39]:

losses = model.evaluate(
    [np.array(test_df['transcriptomics'].tolist()),
     np.array(test_df['fingerprint'].tolist())],
    y_test,
    return_dict=True)
print(losses)

3/3 ━━━━━━━━━━━━━━━━━━━━ 0s 17ms/step - loss: 0.0197 - mean_absolute_error: 0.1211
{'loss': 0.01972586289048195, 'mean_absolute_error': 0.12107016891241074}

Here is a site giving a guide on how to evaluate your model, visit machinelearning.com. The site gives examples of similary graphs shown below demonstrating what overfitting, underfitting, and correct models may output.

Plot Loss Function and Error of the Model¶

[40]:

# summarize history for accuracy
plt.plot(history.history['val_mean_absolute_error'])
plt.plot(history.history['mean_absolute_error'])
plt.title('Mean Absolute Error of Deep Learning Model')
plt.ylabel('Mean Absolute Error')
plt.xlabel('epoch')
plt.legend(['Validation Set','Training Set',], loc='upper right')
plt.show()
# summarize history for loss
plt.plot(history.history['val_loss'])
plt.plot(history.history['loss'])
plt.title('Loss Function of Deep Learning Model')
plt.ylabel('Loss')
plt.xlabel('epoch')
plt.legend(['Validation Set','Training Set'], loc='upper right')
plt.show()

../_images/_notebooks_docsite_tutorial_83_0.png

../_images/_notebooks_docsite_tutorial_83_1.png

Prediction vs True Values¶

We will now use the test data to see how our model did in making predictions based off of the train and validation data it was given. Using graph and summary statistics we can see how well the model was able to reflect the given data.

[41]:

# make a prediction
predict = model.predict([np.array(test_df['transcriptomics'].tolist()),
     np.array(test_df['fingerprint'].tolist())])

3/3 ━━━━━━━━━━━━━━━━━━━━ 0s 31ms/step

[42]:

new_df = pd.DataFrame({
    'Predicted Values': predict["priority"].tolist(),
    'True Values': y_test
})
new_df['Predicted Values'] = new_df['Predicted Values'].apply(lambda x: x[0] if isinstance(x, list) and len(x) == 1 else x)
sorted_df = new_df.sort_values(by='True Values', ascending=True)  # Change 'ascending' to False for descending order
sorted_df.head()

[42]:

	Predicted Values	True Values
4	0.432947	0.2959
13	0.340147	0.3021
15	0.548971	0.3230
74	0.497476	0.3567
34	0.371287	0.3596

[43]:

true_values = np.array(sorted_df["True Values"])
predicted_values = np.array(sorted_df["Predicted Values"])

# Plot the true values and predicted values
plt.plot(true_values)
plt.plot(predicted_values)

# Fit trendlines
true_values_trendline = LinearRegression().fit(np.arange(len(true_values)).reshape(-1, 1), true_values.reshape(-1, 1)).predict(np.arange(len(true_values)).reshape(-1, 1))
predicted_values_trendline = LinearRegression().fit(np.arange(len(predicted_values)).reshape(-1, 1), predicted_values.reshape(-1, 1)).predict(np.arange(len(predicted_values)).reshape(-1, 1))

# Plot trendlines
plt.plot(true_values_trendline, linestyle='--', color='blue', alpha=0.3)
plt.plot(predicted_values_trendline, linestyle='--', color='orange', alpha=0.3)

# Customize plot
plt.title('Predicted AUC vs True AUC')
plt.ylabel('Drug Response AUC')
plt.xlabel('Observation Values')
plt.legend(['True Values', 'Predicted Values', 'True Values Trendline', 'Predicted Values Trendline'], loc='lower right')
plt.show()

../_images/_notebooks_docsite_tutorial_88_0.png

[44]:

true_values = np.array(new_df["True Values"])
predicted_values = np.array(new_df["Predicted Values"])

# Plot the true values and predicted values
plt.plot(true_values)
plt.plot(predicted_values)

# Fit trendlines
true_values_trendline = LinearRegression().fit(np.arange(len(true_values)).reshape(-1, 1), true_values.reshape(-1, 1)).predict(np.arange(len(true_values)).reshape(-1, 1))
predicted_values_trendline = LinearRegression().fit(np.arange(len(predicted_values)).reshape(-1, 1), predicted_values.reshape(-1, 1)).predict(np.arange(len(predicted_values)).reshape(-1, 1))


# Customize plot
plt.title('Predicted AUC vs True AUC')
plt.ylabel('Drug Response AUC')
plt.xlabel('Observation Values')
plt.legend(['True Values', 'Predicted Values', 'True Values Trendline', 'Predicted Values Trendline'], loc='lower right')
plt.show()

../_images/_notebooks_docsite_tutorial_89_0.png

[45]:

import seaborn as sns

# Visualizing the regression results
plt.figure(figsize=(10,6))
sns.scatterplot(x=y_test, y=predicted_values, alpha=0.6)
sns.lineplot(x=[y_test.min(), y_test.max()], y=[y_test.min(), y_test.max()], color='orange')
plt.title('True vs Predicted AUC Values')
plt.xlabel('True AUC Values')
plt.ylabel('Predicted AUC Values')
plt.show()

../_images/_notebooks_docsite_tutorial_90_0.png

[46]:

# Calculate Summary Statistics


# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(new_df['True Values'], new_df['Predicted Values'])

# Calculate Root Mean Squared Error (RMSE)
rmse = np.sqrt(mean_squared_error(new_df['True Values'], new_df['Predicted Values']))

# Calculate R-squared (R2) score
r2 = r2_score(new_df['True Values'], new_df['Predicted Values'])

summary_statistics = new_df.describe()

# Print the statistics
print("Mean Absolute Error (MAE):", mae)
print("Root Mean Squared Error (RMSE):", rmse)
print("R-squared (R2) Score:", r2)
print("\n")
# Print summary statistics
print(summary_statistics)

Mean Absolute Error (MAE): 0.12107017955400727
Root Mean Squared Error (RMSE): 0.14044879395085597
R-squared (R2) Score: 0.40249192535799005


       Predicted Values  True Values
count         88.000000    88.000000
mean           0.665929     0.749632
std            0.118047     0.182738
min            0.340147     0.295900
25%            0.586784     0.668400
50%            0.673757     0.776050
75%            0.753646     0.890800
max            0.885833     0.997100

[47]:

#Side by side comparison for first 50 values.
new_df[0:50].head()

[47]:

	Predicted Values	True Values
0	0.849208	0.9126
1	0.837691	0.8929
2	0.632486	0.7684
3	0.548004	0.4067
4	0.432947	0.2959