decomprolute

Decomprolute

The goal of this package is to both run and evaluate tumor deconvolution algorithms on multi-omics data. We provide the ability to assess a suite of algorithms and cell signature matrices such that you can select your algorithm in a data-driven fashion. We also provide a modular framework that enables you to add your own algorithm or cell signature. For doing this, please see our GitHub site.

Contents

Prepare your system

To run the code you will need to download Docker and a CWL interpreter such as CWL tool that supports CWL v1.2. These tools will enable the different modules to interoperate. Once you have these two tools installed you can test it by running deconvolution on a single data type as shown below:

cwltool https://raw.githubusercontent.com/PNNL-CompBio/decomprolute/main/metrics/prot-deconv.cwl --cancer hnscc --protAlg mcpcounter --sampleType tumor --signature LM7c

This will run the MCP-counter algorithm on proteomics data from the CPTAC breast HNSCC cohort using our LM7c signature and confirm that the system is able to run the more complex analyses. Here are more specific use cases.

Deconvolve CPTAC data

Decomprolute can be used to evaluate cell type on a specific CPTAC dataset, as we have included numerous publicly available datasets and algorithms within the framework. Specifically, you can run the prot-deconv.cwl script with the following arguments:

cwltool https://raw.githubusercontent.com/PNNL-CompBio/decomprolute/main/metrics/prot-deconv.cwl --cancer hnscc --protAlg cibersort --sampleType tumor --signature LM9

CPTAC Data

This algorithm leverages data collected through the clinical proteomic tumor analysis consortium (CPTAC) as the foundation of its benchmarking metrics. This consortium has collected hundreds of patient tumor data, including proteomic and transcriptomic data from the same patients. Given the general confidence in transcriptomic-based tumor convolution, we can use these data to compare transcriptomic and proteomic tumor deconvolution in the same patient samples

We have collect this data via the CPTAC Python API to better match the mRNA data. This CWL tool and Docker image are in the protData and mRNAdata directories.

Below are the available tumor types:

Dataset name Description Data reuse status Publication link
Brca breast cancer no restrictions https://pubmed.ncbi.nlm.nih.gov/33212010/
Ccrcc clear cell renal cell carcinoma (kidney) no restrictions https://pubmed.ncbi.nlm.nih.gov/31675502/
Colon colorectal cancer no restrictions https://pubmed.ncbi.nlm.nih.gov/31031003/
Endometrial endometrial carcinoma (uterine) no restrictions https://pubmed.ncbi.nlm.nih.gov/32059776/
Gbm glioblastoma no restrictions https://pubmed.ncbi.nlm.nih.gov/33577785/
Hnscc head and neck squamous cell carcinoma no restrictions https://pubmed.ncbi.nlm.nih.gov/33417831/
**Lscc lung squamous cell carcinoma password access only unpublished**
Luad lung adenocarcinoma no restrictions https://pubmed.ncbi.nlm.nih.gov/32649874/
Ovarian high grade serous ovarian cancer no restrictions https://pubmed.ncbi.nlm.nih.gov/27372738/
**Pdac pancreatic ductal adenocarcinoma password access only unpublished**

As such, datasets have been updated to following: [‘brca’, ‘ccrcc’, ‘endometrial’, ‘colon’, ‘ovarian’, ‘hnscc’, ‘luad’]

As more datasets are published we will update the list accordingly.

Algorithms

We have included numerous algorithms in this package. Docker files and requisite data are included in the existing code base.

Algorithm Source
cibersort Cibersort
epic EPIC
xcell xCell
mcpcounter MCP Counter

Cell type signatures

There are numerous ways to define the individual cell types we are using to run the deconvolution algorithms. We will upload specific lists to compare in our workflow.

List Name Description Source
LM7c Seven cell types (B, CD4 T, CD8 T, dendritic cells, granulocytes, monocytes, NK) collapsed from proteomic data Rieckmann et al.
3’ PBMCs Seven cell types (B, CD4 T, CD8 T (CD8 T + NK T), dendritic cells, megakaryocytes, monocytes, NK) from scRNA-seq data Newman et al.
LM9 Ten cell types predicted by MCPCounter signature  
LM22 The original matrix from cibersort Newman et al.

Deconvolve your own data

If you have a specific dataset you’d like to deconvolve but are not sure which tool to use, you can use the tools in the metrics directory to determine and then run the best algorithm for your data. T

To identify the signature matrix/algorithm combination that agrees between your own mRNA/protein data, you can run the following (replacing the files in the best-test.yml file).

Run the algorithm/signature matrix that correlates best between mRNA and protein

To assess which algorithm/signature matrix provides the best agreement between mRNA and protein datasets, you will need to provide two matrices from your own data as input into the run-best-alg-by-cor workflow.

Here we recommend replacing the two files in the YAML file shown here to compare the mRNA and protein correlations to find the best algorithm for your data.

cwltool https://raw.githubusercontent.com/PNNL-CompBio/decomprolute/main/metrics/mrna-prot/run-best-alg-by-cor.cwl https://raw.githubusercontent.com/PNNL-CompBio/decomprolute/main/metrics/mrna-prot/best-test.yml

Run the algorithm on simulated data

To assess which algorithm/signature matrix best agree on simulated data, you can use either mRNA or protein data as input into the run-best-alg-by-sim workflow. Below is an example using our test data.

cwltool https://raw.githubusercontent.com/PNNL-CompBio/decomprolute/main/metrics/data-sim/run-best-alg-by-sim.cwl --datFile https://raw.gihubusercontent.com/PNNL-CompBio/decomprolute/main/toy_data/ov-all-prot-reduced.tsv --data-type prot

Evaluate metrics on new algorithm or signature matrix.

In the manuscript we completed three separate tests of proteomic tumor deconvolution algorithms. To benchmark your own algorithm or signature matrix, follow the Contribution guide on the main GitHub page to add to our framework, then you can run the following metrics as described in our manuscript.

Performance on simulated data

We have simulated both mRNA and proteomics data from established experiments as described below. We try to evaluate mRNA data on mRNA-derived simulations, and proteomics data on proteomics-derived simulated data. The datasets themselves are stored in the simulatedData directory.

We have included two YAML files to use as test runs of each simulation.

cwltool https://raw.githubusercontent.com/PNNL-CompBio/decomprolute/main/metrics/data-sim/simul-data-comparison.cwl https://raw.githubusercontent.com/PNNL-CompBio/decomprolute/main/metrics/data-sim/rna-sim-test.yml ##evaluate rna-based deconvolution
cwltool https://raw.githubusercontent.com/PNNL-CompBio/decomprolute/main/metrics/data-sim/simul-data-comparison.cwl https://raw.githubusercontent.com/PNNL-CompBio/decomprolute/main/metrics/data-sim/prot-sim-test.yml ##evaluate protein based deconvolution

These will produced the necessary summary statistics and figures.

mRNA-Proteomics Comparison

We also wanted to measure how consistent an algorithm was between mRNA and proteomics data. This iterates through all algorithms, data, and matrices to and compares how similar each cell type prediction is across mRNA vs. proteomic samples.

cwltool https://raw.githubusercontent.com/PNNL-CompBio/decomprolute/main/metrics/mrna-prot/mrna-prot-comparison.cwl https://raw.githubusercontent.com/PNNL-CompBio/decomprolute/main/metrics/mrna-prot/alg-test.yml

This will run the evaluation in our test YAML file. To update the parameters, create your own YAML file. The algorithm currently has five parameters:

  1. mrna-algorithms: List of algorithms to use to deconvolve mRNA data. One of epic, xcell, cibersort, mcpcounter.
  2. prot-algorithms: List of algorithms to use to deconvolve protein data. One of epic, xcell, cibersort, mcpcounter.
  3. cancerTypes: List of cancer types
  4. signatures: List of signature matrices, currently found in the signature matrix directory
  5. tissueTypes: list of tissue types: tumor, normal, or all

Pan-Immune clustering annotation

Lastly we can cross-reference known immune types with predicted cell types from the various deconvolution algorithms to ascertain how well predicted cell types align with immune populations.


cwltool https://raw.githubusercontent.com/PNNL-CompBio/decomprolute/main/metrics/imm-subtypes/pan-can-immune-preds.cwl https://raw.githubusercontent.com/PNNL-CompBio/decomprolute/main/metrics/imm-subtypes/imm-args.yml