CoderData Cancer Omics and Drug Experiment Response Data (`coderdata`) Python Package

Introduction

CoderData is a comprehensive package designed for handling cancer benchmark data in Python.
It offers functionalities to download datasets, load them into Python environments, and reformat them according to user needs.

Installation

To install, confirm that you have python avilable and then run the following command in your terminal:

pip install coderdata

Downloading Data

The download function in CoderData facilitates the downloading of datasets from Figshare. Users can specify a dataset prefix to filter the required files.

Command Line Usage

To download data via the command line, execute the following command:

coderdata download --prefix [PREFIX]

Replace [PREFIX] with the desired dataset prefix (e.g., ‘hcmi’, ‘beataml’). Omit the prefix argument to download all available datasets.

Python Usage

In Python, the download process is handled through the download_data_by_prefix function from the downloader module.

import coderdata as cd

# Download a specific dataset

cd.download_data_by_prefix('beataml')

# Download all datasets

cd.download_data_by_prefix()

Loading Data

The DatasetLoader class in CoderData is designed for loading datasets into Python.
It automatically initializes attributes for each dataset type like transcriptomics, proteomics, and mutations.

import coderdata as cd

# Initialize the DatasetLoader for a specific dataset type

broad_sanger = cd.DatasetLoader('broad_sanger')

# Access pandas formatted preview of the samples data

broad_sanger.samples

# Access pandas formatted preview of each data type

broad_sanger.transcriptomics

broad_sanger.proteomics

broad_sanger.pertubations

broad_sanger.mutations

broad_sanger.copy_number

broad_sanger.drugs

broad_sanger.experiments

broad_sanger.genes

Joining Datasets

The join_datasets function in CoderData is designed for joining and loading datasets in Python with the most flexibility possible. It is capable of joining initialized, previously joined, or non-initialized datasets. This means you may modify a dataset before joining it with another.

import coderdata as cd

# Initialize the DatasetLoader for a specific dataset type

hcmi = cd.DatasetLoader('hcmi')

# Access a datatype of the loaded dataset

beataml = cd.DatasetLoader('beataml')

# Join two previously initialized datasets

joined_dataset1 = cd.join_datasets(beataml, hcmi)

# Join a previously joined dataset with a non-initialized dataset

# Quotes around a dataset name will load from local files using the DatasetLoader function.

joined_dataset2 = cd.join_datasets(joined_dataset1, "broad_sanger")

# Join multiple datasets using every method available

joined_dataset3 = cd.join_datasets("broad_sanger", beataml)

joined_dataset4 = cd.join_datasets(joined_dataset3, "cptac", hcmi)

Reformatting Datasets

You can reformat datasets into long or wide formats using the reformat_dataset method. By default, data is in the long format.
Reformatting from long to wide retains three data types, entrez_id and improve_sample_id, value of interest (such as transcriptomics).
Datasets cannot be joined while there is a datatype in the wide format.

import coderdata as cd

# Reformat a specific dataset

hcmi.reformat_dataset('transcriptomics', 'wide')

# Reformat all datasets

hcmi.reformat_dataset('wide')

# Reformat all datatypes back to 'long' datasets

hcmi.reformat_dataset('long')

Reloading Datasets

The reload_datasets method is useful for reloading specific datasets or all datasets from local storage, especially if the data files have been updated or altered.

import coderdata as cd

# Reload a specific dataset

hcmi.reload_datasets('transcriptomics')

# Reload all datasets

hcmi.reload_datasets()

Info Function

The info method tells you which datatypes are available, their long/wide format, and which datasets they came from.

# Get information about the joined datasets
joined_dataset4.info()
# The output is as follows -
This is a joined dataset comprising of:
- beataml: Beat acute myeloid leukemia (BeatAML) data was collected though GitHub and Synapse.
- hcmi: Human Cancer Models Initiative (HCMI) data was collected though the National Cancer Institute (NCI) Genomic Data Commons (GDC) Data Portal.
- broad_sanger: The cell line datasets were collected from numerous resources such as the LINCS project, broad_sanger, and the Sanger Institute.
- cptac: The Clinical Proteomic Tumor Analysis Consortium (CPTAC) project is a collaborative network funded by the National Cancer Institute (NCI).
Available Datatypes and Their Formats
- copy_number: long format
- mutations: long format
- proteomics: long format
- samples: long format
- transcriptomics: long format
- drugs: long format
- experiments: long format
Datatype Origins:
- proteomics: Data from beataml, broad_sanger, cptac
- transcriptomics: Data from beataml, broad_sanger, hcmi, cptac
- copy_number: Data from broad_sanger, hcmi, cptac
- mutations: Data from beataml, broad_sanger, hcmi, cptac
- samples: Data from beataml, broad_sanger, hcmi, cptac
- drugs: Data from beataml, broad_sanger
- experiments: Data from beataml, broad_sanger

Conclusion

CoderData provides a robust and flexible way to work with cancer benchmark data.
By using these functionalities, researchers and data scientists can easily manipulate and analyze complex datasets in their Python environments