Usage¶
CoderData is a comprehensive package designed for handling cancer benchmark data in Python.
It offers functionalities to download datasets, load them into Python environments, and reformat them according to user needs.
The primary way to interact with coderdata is through the coderdata API.
Additionally a command line interface with limited functionality (primarily to download data) is also available.
CLI¶
Invoking coderdata from the command line will by default print a help / usage message and exit (see below):
$ coderdata
usage: coderdata [-h] [-l | -v] {download} ...
options:
-h, --help show this help message and exit.
-l, --list prints list of available datasets and exits program.
-v, --version prints the versions of the coderdata API and dataset and exits the program.
commands:
{download}
download subroutine to download datasets. See "coderdata download -h" for more options.
The primary use case of the CLI is to retrieve dataset from the repository.
This can be done by invoking the download routine of coderdata.
Without defining a specific dataset the whole repository will be downloaded:
$ coderdata download
Downloaded 'https://ndownloader.figshare.com/files/48032953' to '/tmp/beataml_drugs.tsv.gz'
Downloaded 'https://ndownloader.figshare.com/files/48032962' to '/tmp/mpnst_drugs.tsv.gz'
...
Downloading a specific dataset can be achieved by passing the -n/--name argument to the download routine:
$ coderdata download --name beataml
Downloaded 'https://ndownloader.figshare.com/files/48032953' to 'beataml_drugs.tsv.gz'
Downloaded 'https://ndownloader.figshare.com/files/48032959' to 'beataml_samples.csv'
...
A full list of available arguments of the download function including a short explanation can be retrieved via the command shown below:
$ coderdata download -h
usage: coderdata download [-h] [-n DATASET_NAME] [-p LOCAL_PATH] [-o]
options:
-h, --help show this help message and exit.
-n, --name DATASET_NAME
name of the dataset to download (e.g., "beataml"). Alternatively, "all" will download the full repository of coderdata datasets. See "coderdata --list" for a
complete list of available datasets. Defaults to "all" if omitted.
-p, --local_path LOCAL_PATH
defines the folder the datasets should be stored in. Defaults to the current working directory if omitted.
-o, --overwrite allow dataset files to be overwritten if they already exist.
Additionally to the download functionality, the CLI currently supports displaying basic information such as the version numbers of the package and the dataset (see example call below):
$ coderdata --version
package version: 2.1.0
dataset version: 2.1.0
As well as listing the dataset that are available for download (example output below):
$ coderdata --list
Available datasets
------------------
beataml: Beat acute myeloid leukemia (BeatAML) focuses on acute myeloid leukemia tumor data. Data includes drug response, proteomics, and transcriptomics datasets.
bladderpdo: Tumor Evolution and Drug Response in Patient-Derived Organoid models of Bladder Cancer Data includes transcriptomics, mutations, copy number, and drug response data.
ccle: Cancer Cell Line Encyclopedia (CCLE).
cptac: The Clinical Proteomic Tumor Analysis Consortium (CPTAC) project is a collaborative network funded by the National Cancer Institute (NCI) focused on improving our understanding of cancer biology through the integration of transcriptomic, proteomic, and genomic data.
ctrpv2: Cancer Therapeutics Response Portal version 2 (CTRPv2).
fimm: Institute for Molecular Medicine Finland (FIMM) dataset.
gcsi: The Genentech Cell Line Screening Initiative (gCSI).
gdscv1: Genomics of Drug Sensitivity in Cancer (GDSC) v1.
gdscv2: Genomics of Drug Sensitivity in Cancer (GDSC) v2.
hcmi: Human Cancer Models Initiative (HCMI) encompasses numerous cancer types and includes cell line, organoid, and tumor data. Data includes the transcriptomics, somatic mutation, and copy number datasets.
mpnst: Malignant Peripheral Nerve Sheath Tumor is a rare, aggressive sarcoma that affects peripheral nerves throughout the body.
mpnstpdx: Patient derived xenograft data for MPNST.
nci60: National Cancer Institute 60.
pancpdo: Organoid Profiling Identifies Common Responders to Chemotherapy in Pancreatic Cancer Data includes transcriptomics, mutations, copy number, and drug response data.
prism: Profiling Relative Inhibition Simultaneously in Mixtures.
sarcpdo: The landscape of drug sensitivity and resistance in sarcoma Data includes transcriptomics, mutations, and drug response data.
------------------
To download individual datasets run "coderdata download -name DATASET_NAME" where "DATASET_NAME" is for example "beataml".
Windows User¶
For Windows different syntax is needed to invoke coderdata from the command line. Use python to call a module located within the coderdata package using the structure python -m <package>.module.
Refer to instructions at docs.python.org/cmdline for additional support.
To invoke coderdata substitute the command $ coderdata with:
$ python -m coderdata.cli
Other command options previously demonstrated follow the same syntax but with the substitute for coderdata shown above.
Example of adjusted commands to list available datasets, download a dataset, and view available arguments within the download function respectively:
$ python -m coderdata.cli -l
$ python -m coderdata.cli download -n beataml
$ python -m coderdata.cli download -h
API Usage¶
Downloading data ¶
Using the coderdata API, the download process is handled through the download function in the downloader module.
>>> import coderdata as cd
>>> cd.download(name='beataml')
Downloaded 'https://ndownloader.figshare.com/files/48032953' to 'beataml_drugs.tsv.gz'
Downloaded 'https://ndownloader.figshare.com/files/48032959' to 'beataml_samples.csv'
Downloaded 'https://ndownloader.figshare.com/files/48032965' to 'beataml_mutations.csv.gz'
Downloaded 'https://ndownloader.figshare.com/files/48032968' to 'beataml_proteomics.csv.gz'
Downloaded 'https://ndownloader.figshare.com/files/48032974' to 'beataml_experiments.tsv.gz'
Downloaded 'https://ndownloader.figshare.com/files/48033052' to 'beataml_transcriptomics.csv.gz'
Downloaded 'https://ndownloader.figshare.com/files/48033058' to 'beataml_drug_descriptors.tsv.gz'
Downloaded 'https://ndownloader.figshare.com/files/48033064' to 'genes.csv.gz'
As with the CLI download functionality, the local path where to store the downloaded files, as well as a flag that defines whether existing files should be overwritten can be defined in the download() function.
For example the function call below will download all ‘BeatAML’ related datasets to the local path /tmp/coderdata/ and will overwrite files if they already exist.
>>> cd.download(name='beataml', local_path='/tmp/coderdata/', exist_ok=True)
Note that if exist_ok==False (the default if omitted) and a downloaded file already exists a warning will be given and the file won’t be stored.
Finally, if all datasets should be downloaded the name argument can be manually set to name='all' or omitted all together as the name defaults to 'all'.
The Dataset object¶
The Dataset object is the central data structure in CoderData.
It automatically initializes attributes for each dataset type like tumor samples, drug response data, as well as associated omics data like proteomics.
Each datatype in a Dataset is internally stored in a pandas.DataFrame.
Loading data into a Dataset object ¶
The code snippet will load the previously downloaded ‘BeatAML’ dataset into a Dataset object called beataml.
:ref:
>>> beataml = cd.load(name='beataml', local_path='/tmp/coderdata')
Importing raw data ...
Importing 'transcriptomics' from /tmp/coderdata/beataml_transcriptomics.csv.gz ... DONE
Importing 'drugs' from /tmp/coderdata/beataml_drugs.tsv.gz ... DONE
Importing 'proteomics' from /tmp/coderdata/beataml_proteomics.csv.gz ... DONE
Importing 'drug_descriptors' from /tmp/coderdata/beataml_drug_descriptors.tsv.gz ... DONE
Importing 'mutations' from /tmp/coderdata/beataml_mutations.csv.gz ... DONE
Importing 'samples' from /tmp/coderdata/beataml_samples.csv ... DONE
Importing 'experiments' from /tmp/coderdata/beataml_experiments.tsv.gz ... DONE
Importing 'genes' from /tmp/coderdata/genes.csv.gz ... DONE
Importing raw data ... DONE
Additionally, the load() function also allows for loading data from a previously pickled Dataset object (see [Saving manipulated Dataset objects]
Displaying the datatypes in a Dataset object¶
The data types associated with a dataset can be displayed via the Dataset.types() function.
The function will return a simple list of available datatypes.
>>> beataml.types()
['transcriptomics', 'proteomics', 'mutations', 'samples', 'drugs', 'experiments', 'genes']
Individual datatypes can be addressed and manipulated by subscripting the dataset.
For example extracting the underlying pandas.DataFrame that contains drug response values for ‘BeatAML’ can be done via the command below:
>>> beataml.experiments
source improve_sample_id improve_drug_id study time time_unit dose_response_metric dose_response_value
0 synapse 3909 SMI_3871 BeatAML 72 hrs fit_auc 0.8004
1 synapse 3909 SMI_4862 BeatAML 72 hrs fit_auc 0.8718
2 synapse 3909 SMI_11493 BeatAML 72 hrs fit_auc 0.4916
3 synapse 3909 SMI_23048 BeatAML 72 hrs fit_auc 0.7600
4 synapse 3909 SMI_51801 BeatAML 72 hrs fit_auc 0.5468
... ... ... ... ... ... ... ... ...
236615 synapse 3628 SMI_13100 BeatAML 72 hrs dss 0.2715
236616 synapse 3628 SMI_40233 BeatAML 72 hrs dss 0.1521
236617 synapse 3628 SMI_16810 BeatAML 72 hrs dss 0.2663
236618 synapse 3628 SMI_35928 BeatAML 72 hrs dss 0.1946
236619 synapse 3628 SMI_32922 BeatAML 72 hrs dss 0.2500
[236620 rows x 8 columns]
Reformatting and exporting datatypes¶
Internally all data is stored in long format.
If different formats are needed for further analysis or as input for the training of machine learning models, the Dataset.format(data_type, **kwargs) function is able to return individual data types in altered formats.
For example the drug response data can be reformatted into wide format via the following command:
>>> beataml.format(data_type='experiments', shape='wide', metrics=['fit_auc', 'dss'])
source improve_sample_id improve_drug_id study time time_unit dss fit_auc
0 synapse 3192 SMI_10197 BeatAML 72 hrs 0.0000 0.8013
1 synapse 3192 SMI_10282 BeatAML 72 hrs 0.0000 0.7837
2 synapse 3192 SMI_10376 BeatAML 72 hrs 0.3403 0.6893
3 synapse 3192 SMI_11493 BeatAML 72 hrs 0.0000 0.7933
4 synapse 3192 SMI_11502 BeatAML 72 hrs 0.2797 0.6160
... ... ... ... ... ... ... ... ...
23657 synapse 3918 SMI_56594 BeatAML 72 hrs 0.0000 0.8086
23658 synapse 3918 SMI_56596 BeatAML 72 hrs 0.0000 0.8136
23659 synapse 3918 SMI_6419 BeatAML 72 hrs 0.0000 0.8103
23660 synapse 3918 SMI_7971 BeatAML 72 hrs 0.2662 0.7069
23661 synapse 3918 SMI_8294 BeatAML 72 hrs 0.2355 0.6617
[23662 rows x 8 columns]
Note that the Dataset.format(data_type, **kwargs) function behaves slightly different for different data_type values.
For example for data_type='experiments' accepted keyword arguments are shape & metrics.
shape defines which format the resulting pandas.DataFrame should be in (e.g. long, wide or matrix).
metrics defines the drug response metrics that should be filtered for.
A full list of parameters for the individual data types can be found below:
Dataset.format(data_type='transcriptomics')returns amatrixlikepandas.DataFramewhere each cell contains the measured transcriptomics value for a gene (row -entrez_id) in a specific cancer sample (column -improve_sample_id).Dataset.format(data_type='mutations', mutation_type=...)will return a binarymatrixlikepandas.DataFramewith rows representing genes and columns representing samples.mutation_typecan be any of the recorded mutation types available (e.g.'Frame_Shift_Del','Frame_Shift_Ins','Missense_Muation'or'Start_Codon_SNP'among others). Cells contain the value of1if a mutation in given gene/sample falls into the category defined bymutation_type. To exploremutation_typeoptions for aDatasetuseDataset.mutations.variant_classification.unique().Dataset.format(data_type='copy_number', copy_call=False)returns amatrixlikepandas.DataFramewhere cells report themeancopy number value for each combination of gene (row -entrez_id) and cancer sample (column -improve_sample_id). Ifcopy_call=Truecells report the discretized measurement (‘deep del’, ‘het loss’, ‘diploid’, ‘gain’, ‘amp’) of copy number provided by the schema.Dataset.format('data_type=proteomics')returns amatrixlikepandas.DataFramewhere each cell contains the measured proteomics value for a gene (row -entrez_id) in a specific cancer sample (column -improve_sample_id).Dataset.format(data_type='experiments', shape=..., metrics=...), returns a formattedpandas.DataFrameaccording to definedshape(shapecan be of values'long','wide'and'matrix').metricsfurther defines which drug response metrics the resulting outputDataFrameshould be filtered for. Examples are'fit_auc','fit_ec50or'dss'. Ifshape=wide, a list can be passed tometriccontaining more than one value. To exploremetricsoptions for aDatasetuseDataset.experiments.dose_response_metric.unique().Dataset.format(data_type='drug_descriptor', shape=..., drug_descriptor_type=...)returns apandas.DataFrameformatted either inlongorwide(depending on theshapeargument).drug_descriptor_typecan be defined as a list of desiredstructural_descriptorsin conjunction withshape=wide, to limit the resultingDataFrameto only list the desiredstructual_descriptorsas columns.Dataset.format(data_type='drugs')is equal toDataset.drugs. It returns the underlyingpandas.DataFramecontaining the drug information.Dataset.format(data_type='genes')is equal toDataset.genes. It returns the underlyingpandas.DataFramecontaining the gene information.Dataset.format(data_type='samples')is equal toDataset.samples. It returns the underlyingpandas.DataFramecontaining the cancer sample data information.
Creating training / testing and validation splits with coderdata¶
coderdata provides two functions to generate dataset splits.
Dataset.split_train_other() for a “two-way” split (useful if no validation in machine learning needs to be done) and Dataset.split_train_test_validate() for a “three-way” split.
Both functions return @dataclass objects, that contain either .train & .other (.split_train_other()) or .train, .test and .validate (.split_train_test_validate()) attributes which reference Dataset objects.
Example uses of .split_train_test_validate() follow below.
Note that both splitting functions share the same arguments with only ratio differing in so far that .split_train_test_validate() expects a tuple with 3 elements whereas .split_train_other expects a 2 element tuple.
>>> split = beataml.split_train_test_validate()
>>> split.train.experiments.shape
(189290, 8)
>>> split.test.experiments.shape
(23660, 8)
>>> split.validate.experiments.shape
(23670, 8)
By default the returned splits will be mixed-set (drugs and cancer samples can appear in all three folds), with a ratio of 8:1:1, no stratification and no set random state (seed).
This behavior can be changed by passing split_type, ratio, stratified_by and random_state to the function.
split_type can be either 'mixed-set', 'drug-blind' or 'cancer-blind':
mixed-set: Splits randomly independent of drug / cancer association of the samples. Individual drugs or cancer types can appear in all three splits.drug-blind: Splits according to drug association. Any sample associated with a drug will be unique to one of the splits. For example samples with association to drug A will only be present in the train split, but never in test or validate.cancer-blind: Splits according to cancer association.Equivalent to drug-blind, except cancer types will be unique to splits.
ratio can be used to adjust the split ratios using a 3 item tuple containing integers (2 items for .split_train_other).
For example ratio=(5:3:2) would result in a split where train, test and validate contain roughly 50%, 30% and 20% of the original data respectively.
random_state defines a seed values for the random number generator.
Defining a random_state will guarantee reproducability as two runs with the same random_state will result in the same splits.
stratify_by Defines if the training, testing, and validation sets should be stratified.
Stratification tries to maintain a similar distribution of feature classes across different splits.
For example assuming a drug responses value threshold that defines positive and negative classes (e.g. reduced vs. no change in cancer cell viability) the splitting algorithm could attempt to assign the same amount of positive class instances as negative class instances to each split.
Stratification is performed by drug_response_value.
Any value other than None indicates stratification and defines which drug_response_value should be used as basis for the stratification.
None indicates that no stratification should be performed.
Which type of stratification should be performed can further be customized with keyword arguments (thresh, num_classes, quantiles).
An example call to create a 70/20/10 drug-blind split that is stratified by fit_auc could look like this:
>>> split = beataml.split_train_test_validate(
... split_type='drug-blind',
... ratio=[7,2,1],
... random_state=42,
... stratify_by='fit_auc',
... thresh=0.8
... )
>>> split.train.experiments.shape
(180080, 8)
>>> split.test.experiments.shape
(28520, 8)
>>> split.validate.experiments.shape
(28080, 8)
Saving manipulated Dataset objects (e.g. saving splits) ¶
In order to save a Dataset for later use, the Dataset.save() function can be used.
>>> split.train.save(path='/tmp/coderdata/beataml_train.pickle')
>>> split.test.save(path='/tmp/coderdata/beataml_test.pickle')
>>> split.validate.save(path='/tmp/coderdata/beataml_validate.pickle')
This function can be used to either save the individual splits (as demonstrated above), or raw Dataset that was the basis for the splits for example if any modifications of the dataset were performed.
To reload the splits (or the full dataset) the coderdata.load() function (see also Loading data into a Dataset object) can be used.
To load a pickled Dataset, the argument from_pickle=True must be passed to the function:
>>> beataml_train = cd.load('beataml_train', local_path='/tmp/coderdata/', from_pickle=True)
Importing pickled data ... DONE
If experiencing the error below with save() or load():
"SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \UXXXXXXXX escape"
Adjust the syntax for the functions (example below):
>>> split.train.save(path=r'\...')
>>> beataml_train = cd.load('beataml_train', local_path=r'\...', from_pickle=True)
Note that only individual splits (e.g. only train) can be saved and loaded and not the full Split object.
To learn more about pickling refer to the page docs.python.org/pickle.