API Reference

CoderData Object

coderdata.download.downloader.download(name: str = 'all', local_path: PathLike = PosixPath('/home/runner/work/coderdata/coderdata'), exist_ok: bool = False)

Download the most recent version of files from a Figshare dataset, filtered by a specific prefix or all files.

This function queries the Figshare API to retrieve details of a dataset and then downloads files from it. Files can be filtered by a specified prefix such as hcmi, beataml, etc. If ‘all’, an empty string, or None is passed as the prefix, all files in the dataset are downloaded. The function identifies the most recent version of a file by selecting the one with the highest ID among duplicates with the same name.

Parameters:

dataset_prefix (str, optional) – The prefix of the dataset to download (e.g., ‘hcmi’). If ‘all’, an empty string, or None, all files in the dataset are downloaded. Default is None.

Returns:

The function downloads files to the local repository and does not return any value.

Return type:

None

Collection of small utility and helper functions.

coderdata.utils.utils.list_datasets(raw: bool = False) dict | None

Hepler function that returns a list of available datasets including a short description and additional information available.

Parameters:

raw (bool, default=False) – If set to True returns a yaml dictionary containing all available datasets including additional information. If set to false prints information to stdout and returns None.

Returns:

Returns a dict containing the information if raw==True, otherwise prints information to stdout and returns None.

Return type:

dict | None

coderdata.utils.utils.version() dict

Helper function that returns the version strings for the package and the dataset build.

Returns:

Contains package and dataset build version.

Return type:

dict

Command Line Interface to retrieve coderdata datasets.

coderdata.cli.check_folder(path: str | PathLike | Path) Path

Helper function to check if a defined folder exists.

Returns:

Cleaned path object with the absolute path to the folder passed to the function.

Return type:

Path

Raises:
  • TypeError – If passed path argument is not of the requested type.

  • OSError – If the passed path argument does not link to a valid existing folder.

coderdata.cli.info(args)

Helper function that takes the parsed command line arguments and prints either verison information or information on the available datasets depending on the arguments in args.

Parameters:

args (Namespace) – A Namespace object that contains commandline arguments parsed by ArgumentParser.parse_args().

Dataset Object

class coderdata.dataset.dataset.Dataset(name: str | None = None, transcriptomics: DataFrame | None = None, proteomics: DataFrame | None = None, mutations: DataFrame | None = None, copy_number: DataFrame | None = None, samples: DataFrame | None = None, drugs: DataFrame | None = None, drug_descriptors: DataFrame | None = None, mirna: DataFrame | None = None, experiments: DataFrame | None = None, methylation: DataFrame | None = None, metabolomics: DataFrame | None = None, genes: DataFrame | None = None, combinations: DataFrame | None = None)
coderdata.dataset.dataset.Dataset.save(self, path: Path) None

_summary_

coderdata.dataset.dataset.load(name: str, local_path: str | Path = PosixPath('/home/runner/work/coderdata/coderdata'), from_pickle: bool = False) Dataset

_summary_

Parameters:
  • name (str) – _description_

  • directory (str | Path, optional) – _description_, by default Path.cwd()

Returns:

_description_

Return type:

Dataset

Raises:
  • OSError – _description_

  • TypeError – _description_

coderdata.dataset.dataset.train_test_validate(data: Dataset, split_type: Literal['mixed-set', 'drug-blind', 'cancer-blind'] = 'mixed-set', ratio: tuple[int, int, int] = (8, 1, 1), stratify_by: str | None = None, balance: bool = False, random_state: int | RandomState | None = None, **kwargs: dict) Split

Splits a CoderData object (see also coderdata.load.loader.DatasetLoader) into three subsets for training, testing and validating machine learning algorithms.

The size of the splits can be adjusted to be different from 80:10:10 (the default)for train:test:validate. The function also allows for additional optional arguments, that define the type of split that is performed (‘mixed-set’, ‘drug-blind’, ‘cancer-blind’), if the splits should be stratified (and which drug response metric to use), as well as a random seed to enable the creation of reproducable splits. Furhermore, a list of keyword arguments can be defined that will be passed to the stratification function if so desired.

Parameters:
  • data (DatasetLoader) – CoderData object containing a full dataset either downloaded from the CoderData repository (see also coderdata.download.downloader.download_data_by_prefix) or built locally via the build_all process. The object must first be loaded via coderdata.load.loader.DatasetLoader.

  • split_type ({'mixed-set', 'drug-blind', 'cancer-blind'}, default='mixed-set') –

    Defines the type of split that should be generated:

    • mixed-set: Splits randomly independent of drug / cancer

      association of the samples. Individual drugs or cancer types can appear in all three splits

    • drug-blind: Splits according to drug association. Any sample

      associated with a drug will be unique to one of the splits. For example samples with association to drug A will only be present in the train split, but never in test or validate.

    • cancer-blind: Splits according to cancer association.

      Equivalent to drug-blind, except cancer types will be unique to splits.

  • ratio (tuple[int, int, int], default=(8,1,1)) – Defines the size ratio of the resulting test, train and validation sets.

  • stratify_by (str | None, default=None) – Defines if the training, testing and validation sets should be stratified. Any value other than None indicates stratification and defines which drug response value should be used as basis for the stratification. _None_ indicates that no stratfication should be performed.

  • random_state (int | RandomState | None, defaul=None) – Defines a seed value for the randomization of the splits. Will get passed to internal functions. Providing the seed will enable reproducability of the generated splits.

  • **kwargs – Additional keyword arguments that will be passed to the function that generates classes for the stratification (see also _create_classes).

Returns:

Splits – A Split object that contains three Dataset objects as attributes (Split.train, Split.test, Split.validate)

Return type:

Split

Raises:
  • ValueError :

  • If supplied split_type is not in the list of accepted values.

Collection of helper scripts to generate general statistics on the data contained in a CoderData Object.

coderdata.utils.stats.plot_response_metric(data: Dataset, metric: str = 'auc', ax: Axes = None, **kwargs: dict) None

Creates a histogram detailing the distribution of dose response values for a given dose respones metric.

If used in conjunction with matplotlib.pyplot.subplot or matplotlib.pyplot.subplots and the axes object is passed to the function, the function populates the axes object with the generated plot.

Parameters:
  • data (coderdata.DataLoader) – A full CoderData object of a dataset

  • metric (str, default='auc') – A string that defines the response metric that should be plotted

  • ax (matplotlib.axes.Axes, default=None) – An Axes object can be defined. This is uesful if a multipannel subplot has been defined prior via matplotlib.pyplot.subplots. Passing the location of the axes to the function will then populate the subplot at the given location with the generated plot.

  • **kwargs (dict, optional) – Additional keyword arguments that can be passed to the function - bins : int - sets the number of bins; passed to seaborn.histplot - title : str - sets the title of the axes - kde : bool - adds a kernel density estimate plot into the histogram

Return type:

None

Example

In a Jupyter Notebook environment the following snippet can be used to display a histgram detailing the distribution of drug response AUC measures in the beataml dataset.

>>> import coderdata as cd
>>> beataml = cd.DataLoader('beataml')
>>> cd.plot_response_metric(data=beataml, metric='auc', bin=10)

For generating multipanel plots we can make use of matplotlib and the ax parameter of this function. Furthermore, other features / parameters of the cerated figure can be changed (e.g. the title of the figure via suptitle()). Finally it can be saved.

>>> import coderdata as cd
>>> import matplotlib.pyplot as plt
>>> beataml = cd.DataLoader('beataml')
>>> fig, axs = plt.subplots(ncols=2, figsize=(10, 5))
>>> plot_response_metric(
...     data=beataml,
...     metric='auc',
...     bins=10,
...     ax=axs[0]
...     )
>>> plot_response_metric(
...     data=beataml,
...     metric='aac',
...     bins=10,
...     ax=axs[0]
...     )
>>> fig.set_layout_engine('tight')
>>> fig.suptitle('Distribution of drug response values')
>>> fig.savefig('figure.png')
coderdata.utils.stats.summarize_response_metric(data: Dataset) DataFrame

Helper function to extract basic statistics for the experiments object in a CoderData object. Uses pandas.DataFrame.describe() internally to generate count, mean, standard deviation, minimum, 25-, 50- and 75-percentile as well as maximum for dose_response_value for each dose_response_metric present in experiments.

Parameters:

data (coderdata.cd.Dataset) – A full CoderData object of a dataset

Returns:

A pandas.DataFrame containing basic statistics for each dose response metric.

Return type:

pandas.DataFrame

Example

The Example assumes that a dataset with the prefix ‘beataml’ has been downloaded previously. See also coderdata.download()

>>> import coderdata as cd
>>> beataml = cd.DataLoader('beataml')
>>> summary_stats = summarize_response_metric(data=beataml)
>>> summary_stats
                        count          mean           std
dose_response_metric
aac                   23378.0  3.028061e-01  1.821265e-01  ...
auc                   23378.0  6.971939e-01  1.821265e-01  ...
dss                   23378.0  3.218484e-01  5.733492e-01  ...
...                   ...      ...           ...           ...