API Reference¶
CoderData Object¶
- coderdata.download.downloader.download(name: str = 'all', local_path: PathLike = PosixPath('/home/runner/work/coderdata/coderdata'), exist_ok: bool = False)¶
Download the most recent version of files from a Figshare dataset, filtered by a specific prefix or all files.
This function queries the Figshare API to retrieve details of a dataset and then downloads files from it. Files can be filtered by a specified prefix such as hcmi, beataml, etc. If ‘all’, an empty string, or None is passed as the prefix, all files in the dataset are downloaded. The function identifies the most recent version of a file by selecting the one with the highest ID among duplicates with the same name.
- Parameters:
dataset_prefix (str, optional) – The prefix of the dataset to download (e.g., ‘hcmi’). If ‘all’, an empty string, or None, all files in the dataset are downloaded. Default is None.
- Returns:
The function downloads files to the local repository and does not return any value.
- Return type:
None
Dataset Object¶
- class coderdata.dataset.dataset.Dataset(name: str = None, transcriptomics: DataFrame = None, proteomics: DataFrame = None, mutations: DataFrame = None, copy_number: DataFrame = None, samples: DataFrame = None, drugs: DataFrame = None, drug_descriptors: DataFrame = None, mirna: DataFrame = None, experiments: DataFrame = None, methylation: DataFrame = None, metabolomics: DataFrame = None, genes: DataFrame = None, combinations: DataFrame = None)¶
- coderdata.dataset.dataset.Dataset.save(self, path: Path) None ¶
_summary_
- coderdata.dataset.dataset.format(data: Dataset, data_type: Literal['transcriptomics', 'mutations', 'copy_number', 'proteomics', 'experiments', 'combinations', 'drug_descriptor', 'drugs', 'genes', 'samples'], use_polars: bool = False, **kwargs: dict)¶
- coderdata.dataset.dataset.load(name: str, local_path: str | Path = PosixPath('/home/runner/work/coderdata/coderdata'), from_pickle: bool = False) Dataset ¶
_summary_
- Parameters:
name (str) – _description_
directory (str | Path, optional) – _description_, by default Path.cwd()
- Returns:
_description_
- Return type:
- Raises:
OSError – _description_
TypeError – _description_
- coderdata.dataset.dataset.split_train_other(data: Dataset, split_type: Literal['mixed-set', 'drug-blind', 'cancer-blind'] = 'mixed-set', ratio: tuple[int, int] = (8, 2), stratify_by: str | None = None, balance: bool = False, random_state: int | RandomState | None = None, **kwargs: dict)¶
- coderdata.dataset.dataset.split_train_test_validate(data: Dataset, split_type: Literal['mixed-set', 'drug-blind', 'cancer-blind'] = 'mixed-set', ratio: tuple[int, int, int] = (8, 1, 1), stratify_by: str | None = None, balance: bool = False, random_state: int | RandomState | None = None, **kwargs: dict) Split ¶
- coderdata.dataset.dataset.train_test_validate(data: Dataset, split_type: Literal['mixed-set', 'drug-blind', 'cancer-blind'] = 'mixed-set', ratio: tuple[int, int, int] = (8, 1, 1), stratify_by: str | None = None, balance: bool = False, random_state: int | RandomState | None = None, **kwargs: dict) Split ¶
Splits a CoderData object (see also coderdata.load.loader.DatasetLoader) into three subsets for training, testing and validating machine learning algorithms.
The size of the splits can be adjusted to be different from 80:10:10 (the default)for train:test:validate. The function also allows for additional optional arguments, that define the type of split that is performed (‘mixed-set’, ‘drug-blind’, ‘cancer-blind’), if the splits should be stratified (and which drug response metric to use), as well as a random seed to enable the creation of reproducable splits. Furhermore, a list of keyword arguments can be defined that will be passed to the stratification function if so desired.
- Parameters:
data (DatasetLoader) – CoderData object containing a full dataset either downloaded from the CoderData repository (see also coderdata.download.downloader.download_data_by_prefix) or built locally via the build_all process. The object must first be loaded via coderdata.load.loader.DatasetLoader.
split_type ({'mixed-set', 'drug-blind', 'cancer-blind'}, default='mixed-set') –
Defines the type of split that should be generated:
- mixed-set: Splits randomly independent of drug / cancer
association of the samples. Individual drugs or cancer types can appear in all three splits
- drug-blind: Splits according to drug association. Any sample
associated with a drug will be unique to one of the splits. For example samples with association to drug A will only be present in the train split, but never in test or validate.
- cancer-blind: Splits according to cancer association.
Equivalent to drug-blind, except cancer types will be unique to splits.
ratio (tuple[int, int, int], default=(8,1,1)) – Defines the size ratio of the resulting test, train and validation sets.
stratify_by (str | None, default=None) – Defines if the training, testing and validation sets should be stratified. Any value other than None indicates stratification and defines which drug response value should be used as basis for the stratification. _None_ indicates that no stratfication should be performed.
random_state (int | RandomState | None, defaul=None) – Defines a seed value for the randomization of the splits. Will get passed to internal functions. Providing the seed will enable reproducability of the generated splits.
**kwargs – Additional keyword arguments that will be passed to the function that generates classes for the stratification (see also
_create_classes
).
- Returns:
Splits – A
Split
object that contains three Dataset objects as attributes (Split.train
,Split.test
,Split.validate
)- Return type:
Split
- Raises:
ValueError : –
If supplied split_type is not in the list of accepted values. –