CoderData is a comprehensive package designed for handling cancer benchmark data in Python.
It offers functionalities to download datasets, load them into Python environments, and reformat them according to user needs.
To install, confirm that you have python avilable and then run the following command in your terminal:
pip install coderdata
The download
function in CoderData facilitates the downloading of datasets from Figshare. Users can specify a dataset prefix to filter the required files.
To download data via the command line, execute the following command:
coderdata download --prefix [PREFIX]
Replace [PREFIX] with the desired dataset prefix (e.g., ‘hcmi’, ‘beataml’). Omit the prefix argument to download all available datasets.
In Python, the download process is handled through the download_data_by_prefix
function from the downloader module.
import coderdata as cd
# Download a specific dataset
cd.download_data_by_prefix('beataml')
# Download all datasets
cd.download_data_by_prefix()
The DatasetLoader
class in CoderData is designed for loading datasets into Python.
It automatically initializes attributes for each dataset type like transcriptomics, proteomics, and mutations.
import coderdata as cd
# Initialize the DatasetLoader for a specific dataset type
broad_sanger = cd.DatasetLoader('broad_sanger')
# Access pandas formatted preview of the samples data
broad_sanger.samples
# Access pandas formatted preview of each data type
broad_sanger.transcriptomics
broad_sanger.proteomics
broad_sanger.pertubations
broad_sanger.mutations
broad_sanger.copy_number
broad_sanger.drugs
broad_sanger.experiments
broad_sanger.genes
The join_datasets
function in CoderData is designed for joining and loading datasets in Python with the most flexibility possible.
It is capable of joining initialized, previously joined, or non-initialized datasets. This means you may modify a dataset before joining it with another.
import coderdata as cd
# Initialize the DatasetLoader for a specific dataset type
hcmi = cd.DatasetLoader('hcmi')
# Access a datatype of the loaded dataset
beataml = cd.DatasetLoader('beataml')
# Join two previously initialized datasets
joined_dataset1 = cd.join_datasets(beataml, hcmi)
# Join a previously joined dataset with a non-initialized dataset
# Quotes around a dataset name will load from local files using the DatasetLoader function.
joined_dataset2 = cd.join_datasets(joined_dataset1, "broad_sanger")
# Join multiple datasets using every method available
joined_dataset3 = cd.join_datasets("broad_sanger", beataml)
joined_dataset4 = cd.join_datasets(joined_dataset3, "cptac", hcmi)
You can reformat datasets into long or wide formats using the reformat_dataset
method. By default, data is in the long format.
Reformatting from long to wide retains three data types, entrez_id and improve_sample_id, value of interest (such as transcriptomics).
Datasets cannot be joined while there is a datatype in the wide format.
import coderdata as cd
# Reformat a specific dataset
hcmi.reformat_dataset('transcriptomics', 'wide')
# Reformat all datasets
hcmi.reformat_dataset('wide')
# Reformat all datatypes back to 'long' datasets
hcmi.reformat_dataset('long')
The reload_datasets
method is useful for reloading specific datasets or all datasets from local storage, especially if the data files have been updated or altered.
import coderdata as cd
# Reload a specific dataset
hcmi.reload_datasets('transcriptomics')
# Reload all datasets
hcmi.reload_datasets()
The info
method tells you which datatypes are available, their long/wide format, and which datasets they came from.
CoderData provides a robust and flexible way to work with cancer benchmark data.
By using these functionalities, researchers and data scientists can easily manipulate and analyze complex datasets in their Python environments