Datasets Included¶

This page provides an overview of the datasets included in CoderData version 2.2.0. This package collects 18 diverse sets of paired molecular datasets with corresponding drug sensitivity data. All data here is reprocessed and standardized so it can be easily used as a benchmark dataset for drug response prediction machine learning models.

The dataset files are in csv format and are available at the link below:

Figshare record: https://api.figshare.com/v2/articles/28823159

Version: 2.2.0

Dataset Overview¶

Datasets and Modalities¶
Dataset	References	Sample	Drug	Drug Descriptor	Experiments	Transcriptomics	Proteomics	Mutations	Copy Number
BeatAML	[1], [2]	1022	164	X	X	X	X	X
Bladder	[3]	134	50	X	X	X		X	X
CCLE	[4]	502	24	X	X	X	X	X	X
Colorectal	[18]	61	10	X		X		X	X
CPTAC	[5]	1139				X	X	X	X
CTRPv2	[6], [7], [8]	846	459	X	X	X		X	X
FIMM	[9], [10]	52	52	X	X	X
GDSC v1	[23], [24], [25]	984	294	X		X	X	X	X
GDSC v2	[23], [24], [25]	806	171	X		X	X	X	X
gCSI	[21], [22]	569	44	X		X	X	X	X
HCMI	[11]	886				X		X	X
Liver	[19]	62	76	X		X		X	X
MPNST	[12]	50	30	X	X	X	X	X	X
NCI60	[13]	83	55157	X	X	X	X	X
Novartis	[20]	386	25	X		X		X	X
Pancreatic	[14]	70	25	X	X	X		X	X
PRISM	[15], [16]	478	1419	X	X	X
Sarcoma	[17]	36	34	X	X	X		X

The table above lists the datasets included in CoderData version 2.2.0, along with references to their original publications, counts of samples and drugs, and the types of data available for each dataset.

CoderData includes the following data:

Sample - cell lines, patient-derived samples, or patient-derived organoids
Drug - compounds tested for sensitivity
Drug Descriptor - molecular descriptors for each drug (computed using RDKit)
Experiments - dose-response experiments (various metrics such as AUC, IC50, etc.)
Transcriptomics - gene expression (in transcripts per million, TPM)
Proteomics - protein expression (in log2 ratio to reference)
Mutations - gene mutations (variant calls)
Copy Number - gene copy number variations (number of copies of each gene, 2 being diploid)

An “X” indicates the presence of a particular data type for the corresponding dataset. Each sample in the datasets corresponds to either a cancer cell line, a patient-derived xenograft, or a patient-derived organoid, depending on the specific dataset.

Dataset Summary Statistics¶

The following table summarizes combination counts for each dataset. This includes the number of experimental sample-drug pairs, with different molecular data types. Each column represents the number of unique combinations of samples and drugs with the specified molecular data types available. For example, the “Sample-Drug-Transcriptomics-Mutations” column indicates the number of unique sample-drug pairs that have both transcriptomics and mutation data available. These counts let you estimate how much paired data is available for tasks like building predictive models with transcriptomics and drug response.

Dataset Summary Statistics¶

dataset

sample_drug_pairs

sample_drug_transcript_pairs

sample_drug_transcriptomics_mutation_pairs

sample_drug_transcriptomics_copynumber_pairs

sample_drug_mutation_copynumber_pairs

beataml

31926.0

4137.0

3958.0

bladder

3300.0

840.0

640.0

640.0

3100.0

ccle

11543.0

10887.0

10792.0

10887.0

11118.0

colorectal

140.0

60.0

60.0

60.0

140.0

cptac

ctrpv2

309401.0

300507.0

295742.0

299698.0

300616.0

fimm

2663.0

2457.0

2457.0

2457.0

2611.0

gcsi

13398.0

12506.0

12338.0

12506.0

13112.0

gdscv1

247753.0

245220.0

241999.0

241240.0

242570.0

gdscv2

115440.0

114373.0

112829.0

112523.0

113133.0

hcmi

liver

4453.0

4453.0

4453.0

4453.0

4453.0

mpnst

272.0

193.0

184.0

191.0

184.0

nci60

2960756.0

2329149.0

2329132.0

2329149.0

2784474.0

novartis

1766.0

1734.0

1734.0

1723.0

1723.0

pancreatic

190.0

190.0

185.0

185.0

185.0

prism

638983.0

632078.0

630672.0

632078.0

636226.0

sarcoma

275.0

234.0

187.0

Drug Curve Metrics Collected¶

The following table summarizes the number of drugs associated with each dose-response metric across the datasets.

Drug Curve Metrics Summary¶

dataset

num_drugs

aac

abc

auc

dss

fit_auc

fit_ec50

fit_ec50se

fit_einf

fit_hs

fit_ic50

fit_r2

lmm

mRESCIST

published_auc

TGI

beataml

164

X

X

X

X

X

X

X

X

X

X

bladder

50

X

X

X

X

X

X

X

X

X

X

ccle

24

X

X

X

X

X

X

X

X

X

X

colorectal

10

X

X

X

X

X

X

X

X

X

X

ctrpv2

459

X

X

X

X

X

X

X

X

X

X

fimm

52

X

X

X

X

X

X

X

X

X

X

gcsi

44

X

X

X

X

X

X

X

X

X

X

gdscv1

294

X

X

X

X

X

X

X

X

X

X

gdscv2

171

X

X

X

X

X

X

X

X

X

X

liver

76

X

X

X

X

X

X

X

X

X

X

mpnst

30

X

X

X

X

X

X

X

X

X

X

X

X

X

X

nci60

55157

X

X

X

X

X

X

X

X

X

X

novartis

25

X

X

X

X

pancreatic

25

X

X

X

X

X

X

X

X

X

X

prism

1419

X

X

X

X

X

X

X

X

X

X

sarcoma

34

X

Types of dose-response metrics collected include:

AAC - Area above the response curve; the complement value of AUC.
ABC - Area between curves, the difference between the AUC of the control and the treated cells.
AUC - Area under the fitted hill slope curve across all doses present. Lower AUC signifies lower levels of growth.
DSS - A multiparametric dose response value that takes into account control and treated cells.
fit_auc - Area under the fitted hill slope curve across the common interval of −log10[M], where the molar concentration ranges from 10⁻⁴ to 10⁻¹⁰.
fit_ec50 - The fitted curve prediction of the −log10M concentration at which 50% of the maximal effect is observed.
fit_ec50se - Standard error of the Fit_EC50 estimate.
fit_einf - The fraction of cells that are unaffected even at an infinite dose concentration. Calculated as the lower asymptote of the hill slope function.
fit_hs - The estimated hill slope binding cooperativity, calculated as the slope of the sigmoidal hill curve.
fit_ic50 - The fitted curve prediction of the −log10M concentration required to reduce tumor growth by 50%.
fit_r2 - Coefficient of determination between observed growth and the fitted hill slope curve, indicating goodness of fit.
lmm - The resulting “time and treatment interaction” in a linear mixed model with fixed effects as time and treatment and patient as a random effect. Indicates how much the treatment changes the slope of log(volume) over time compared to the control.
mRESCIST - Disease status classified into PD (progressive disease), SD (stable disease), PR (partial response), and CR (complete response), based on percent volume change and cumulative average response.
published_auc - Published Area Under the Curve
TG - Tumor growth inhibition between the control and treatment time-volume curves.

Datasets Included¶

Dataset Overview¶

Dataset Summary Statistics¶

Drug Curve Metrics Collected¶

References¶

Resource Links

Quick Access

Table of Contents

dataset	sample_drug_pairs	sample_drug_transcript_pairs	sample_drug_transcriptomics_mutation_pairs	sample_drug_transcriptomics_copynumber_pairs	sample_drug_mutation_copynumber_pairs
beataml	31926.0	4137.0	3958.0
bladder	3300.0	840.0	640.0	640.0	3100.0
ccle	11543.0	10887.0	10792.0	10887.0	11118.0
colorectal	140.0	60.0	60.0	60.0	140.0
cptac
ctrpv2	309401.0	300507.0	295742.0	299698.0	300616.0
fimm	2663.0	2457.0	2457.0	2457.0	2611.0
gcsi	13398.0	12506.0	12338.0	12506.0	13112.0
gdscv1	247753.0	245220.0	241999.0	241240.0	242570.0
gdscv2	115440.0	114373.0	112829.0	112523.0	113133.0
hcmi
liver	4453.0	4453.0	4453.0	4453.0	4453.0
mpnst	272.0	193.0	184.0	191.0	184.0
nci60	2960756.0	2329149.0	2329132.0	2329149.0	2784474.0
novartis	1766.0	1734.0	1734.0	1723.0	1723.0
pancreatic	190.0	190.0	185.0	185.0	185.0
prism	638983.0	632078.0	630672.0	632078.0	636226.0
sarcoma	275.0	234.0	187.0

dataset	num_drugs	aac	abc	auc	dss	fit_auc	fit_ec50	fit_ec50se	fit_einf	fit_hs	fit_ic50	fit_r2	lmm	mRESCIST	published_auc	TGI
beataml	164	X		X	X	X	X	X	X	X	X	X
bladder	50	X		X	X	X	X	X	X	X	X	X
ccle	24	X		X	X	X	X	X	X	X	X	X
colorectal	10	X		X	X	X	X	X	X	X	X	X
ctrpv2	459	X		X	X	X	X	X	X	X	X	X
fimm	52	X		X	X	X	X	X	X	X	X	X
gcsi	44	X		X	X	X	X	X	X	X	X	X
gdscv1	294	X		X	X	X	X	X	X	X	X	X
gdscv2	171	X		X	X	X	X	X	X	X	X	X
liver	76	X		X	X	X	X	X	X	X	X	X
mpnst	30	X	X	X	X	X	X	X	X	X	X	X	X	X		X
nci60	55157	X		X	X	X	X	X	X	X	X	X
novartis	25		X										X	X		X
pancreatic	25	X		X	X	X	X	X	X	X	X	X
prism	1419	X		X	X	X	X	X	X	X	X	X
sarcoma	34														X