CoderData Cancer Omics and Drug Experiment Response Data (`coderdata`) Python Package

Introduction

CoderData is a cancer benchmark data package developed in Python and R. There are two aspects of this package, the backend build section and the user facing python package. The build section is a github workflow that generates four cancer datasets in a format that is easy for users and algorithms to ingest. The python package allows users to easily download the data, load it into python and reformat it as desired.

Installation and Usage

Installation

Assuming python>=3.9 is installed on the system, simply run the following command in the terminal to install the most recent release of the coderdata API:

$ pip install coderdata

Bash / Command line

A full list of available datasets can be retrieved via:

$ coderdata --list

To download datasets, simply run the following command in your terminal substituting <DATASET> with the desired dataset (e.g. beataml). To download all datasets use --name all.

$ coderdata download --name <DATASET>

Python

To download, load, and call datasets in python, simply run the following commands.

>>> import coderdata as cd
>>> cd.download(name='beataml')
>>> beataml = cd.load('beataml')
>>> beataml.experiments
         source  improve_sample_id improve_drug_id    study  time time_unit dose_response_metric  dose_response_value
0       synapse               3907       SMI_11123  BeatAML    72       hrs              fit_auc               0.0564
1       synapse               3907       SMI_11211  BeatAML    72       hrs              fit_auc               0.9621
2       synapse               3907       SMI_12192  BeatAML    72       hrs              fit_auc               0.1691
3       synapse               3907       SMI_12254  BeatAML    72       hrs              fit_auc               0.4245
4       synapse               3907       SMI_12469  BeatAML    72       hrs              fit_auc               0.7397
...         ...                ...             ...      ...   ...       ...                  ...                  ...
233775  synapse               3626        SMI_7110  BeatAML    72       hrs                  dss               0.0000
233776  synapse               3626        SMI_7590  BeatAML    72       hrs                  dss               0.0000
233777  synapse               3626        SMI_8159  BeatAML    72       hrs                  dss               0.1946
233778  synapse               3626        SMI_8724  BeatAML    72       hrs                  dss               0.0000
233779  synapse               3626         SMI_987  BeatAML    72       hrs                  dss               0.7165

[233780 rows x 8 columns]

For more indepth instructions view our Usage page.

Datasets

Dataset Cancer Types Samples Drugs Transcriptomics Proteomics Mutations Copy Number
Broad Sanger 106 2053 56082 1697 1008 1729 1790
CPTAC 10 1139 0 1113 1086 833 1024
HCMI 29 758 0 396 0 289 282
BeatAML 1 1022 163 707 210 871 0
MPNST 1 50 25 35 6 29 32

Data Overview

Summary 1
Summary 2