Here we will be using coderdata to generate UMAPs by model type, cancer type, and source. These will be run on each individual dataset as well as all of them joined together. These can be used to create some general hypotheses for further testing.
We will be focusing on transcriptomics, however proteomics can be filled in using the exact same methods seen below.
A note, UMAPs are easy to make and interpret but clusters are not guaranteed to be meaningful or consistent. This is best used as a method to generate hypotheses and display possible trends.
import pandas as pd
import coderdata as cd
import umap
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np
import seaborn as sns
import matplotlib.patches as mpatches
import warnings
warnings.filterwarnings('ignore')
Load Datasets into respective objects. Then merge them into joined datasets.
# Load in all Datasets
hcmi = cd.DatasetLoader('hcmi')
beataml = cd.DatasetLoader('beataml')
cptac = cd.DatasetLoader('cptac')
depmap = cd.DatasetLoader('broad_sanger')
mpnst = cd.DatasetLoader('mpnst')
Processing Data... Loaded genes dataset. Processing Data... Loaded genes dataset. Processing Data... Loaded genes dataset. Processing Data... Loaded genes dataset. Processing Data... Loaded genes dataset.
# Join BeatAML and HCMI
joined_dataset0 = cd.join_datasets(beataml, hcmi)
# Join DepMap and CPTAC
joined_dataset1 = cd.join_datasets(depmap, cptac)
# Join Datasets
joined_dataset2 = cd.join_datasets(joined_dataset0,joined_dataset1)
# Final Join
joined_dataset3 = cd.join_datasets(joined_dataset2,mpnst)
Processing Data... Loaded genes dataset.
joined_dataset3.transcriptomics= joined_dataset3.transcriptomics[["improve_sample_id", "transcriptomics", "entrez_id", "source", "study"]]
joined_dataset3.info()
This is a joined dataset comprising of: - mpnst: A collection of NF1-MPNST patient-derived xenografts, organoids, and tumors. Data hosted on synapse. - cptac: The Clinical Proteomic Tumor Analysis Consortium (CPTAC) project is a collaborative network funded by the National Cancer Institute (NCI). - hcmi: Human Cancer Models Initiative (HCMI) data was collected though the National Cancer Institute (NCI) Genomic Data Commons (GDC) Data Portal. - beataml: Beat acute myeloid leukemia (BeatAML) data was collected though GitHub and Synapse. Available Datatypes and Their Formats: - copy_number: long format - drugs: long format - experiments: long format - genes: Format not specified - mutations: long format - proteomics: long format - samples: Format not specified - transcriptomics: long format
joined_dataset3.samples
other_id | improve_sample_id | other_names | common_name | cancer_type | model_type | other_id_source | species | |
---|---|---|---|---|---|---|---|---|
0 | 11-00261 | 4102 | Acute myelomonocytic leukaemia | Peripheral Blood | Acute Myeloid Leukaemia | ex vivo | beatAML | NaN |
1 | 11-00503 | 4103 | AML with mutated NPM1 | Bone Marrow Aspirate | Acute Myeloid Leukaemia | ex vivo | beatAML | NaN |
2 | 11-00475 | 4104 | AML with mutated NPM1 | Bone Marrow Aspirate | Acute Myeloid Leukaemia | ex vivo | beatAML | NaN |
3 | 13-00047 | 4105 | Mixed phenotype acute leukaemia, T/myeloid, NOS | Peripheral Blood | Acute Myeloid Leukaemia | ex vivo | beatAML | NaN |
4 | 12-00032 | 4106 | Chronic myelomonocytic leukaemia | Peripheral Blood | Acute Myeloid Leukaemia | ex vivo | beatAML | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... |
45 | WU-487 Tumor | 5169 | NaN | WU-487 | Malignant peripheral nerve sheath tumor | Tumor | NF Data Portal | Human |
46 | WU-505 Tumor | 5170 | NaN | WU-505 | Malignant peripheral nerve sheath tumor | Tumor | NF Data Portal | Human |
47 | WU-536 Tumor | 5171 | NaN | WU-536 | Malignant peripheral nerve sheath tumor | Tumor | NF Data Portal | Human |
48 | WU-545 Tumor | 5172 | NaN | WU-545 | Malignant peripheral nerve sheath tumor | Tumor | NF Data Portal | Human |
49 | WU-561 Tumor | 5173 | NaN | WU-561 | Malignant peripheral nerve sheath tumor | Tumor | NF Data Portal | Human |
45285 rows × 8 columns
joined_dataset3.transcriptomics
improve_sample_id | transcriptomics | entrez_id | source | study | |
---|---|---|---|---|---|
0 | 5087 | 1.523670 | 7105.0 | synapse | BeatAML |
1 | 5087 | 1.523670 | 7105.0 | synapse | BeatAML |
2 | 5087 | 7.107711 | 8813.0 | synapse | BeatAML |
3 | 5087 | 7.107711 | 8813.0 | synapse | BeatAML |
4 | 5087 | 3.362605 | 6359.0 | synapse | BeatAML |
... | ... | ... | ... | ... | ... |
44552582 | 3188 | 11.940000 | 23140.0 | bcm | CPTAC3 |
44552583 | 3189 | 12.970000 | 23140.0 | bcm | CPTAC3 |
44552584 | 3190 | 11.860000 | 23140.0 | bcm | CPTAC3 |
44552585 | 3191 | 11.620000 | 23140.0 | bcm | CPTAC3 |
44552586 | 3192 | 12.040000 | 23140.0 | bcm | CPTAC3 |
187115143 rows × 5 columns
These mapping directories will be used to map samples (improve_sample_id) to model type, cancer type, common name, and source.
# Model Type Mapping
model_type_dict = {
'Solid Tissue': 'tumor',
'tumor': 'tumor',
"organoid" : "organoid",
'cell line': 'cell line',
'Tumor': 'tumor',
'ex vivo': 'tumor',
'3D Organoid': 'organoid',
'Peripheral Blood Components NOS': 'tumor',
'Buffy Coat': np.nan,
None: np.nan,
'Peripheral Whole Blood': 'tumor',
'Adherent Cell Line': 'cell line',
'3D Neurosphere': 'organoid',
'2D Modified Conditionally Reprogrammed Cells': 'cell line',
'Pleural Effusion': np.nan,
'Human Original Cells': 'cell line',
'Not Reported': np.nan,
'Mixed Adherent Suspension': 'cell line',
'Cell': 'cell line',
'Saliva': np.nan
}
model_type_sample_map = dict(zip(joined_dataset3.samples['improve_sample_id'], joined_dataset3.samples['model_type']))
common_name_sample_map = dict(zip(joined_dataset3.samples['improve_sample_id'], joined_dataset3.samples['common_name']))
cancer_type_sample_map = dict(zip(joined_dataset3.samples['improve_sample_id'], joined_dataset3.samples['cancer_type']))
study_sample_map = dict(zip(joined_dataset3.transcriptomics['improve_sample_id'], joined_dataset3.transcriptomics['study']))
study_sample_map = dict(zip(joined_dataset3.transcriptomics['improve_sample_id'], joined_dataset3.transcriptomics['study']))
Convert the transcriptomics data from default (long) to wide.
joined_dataset3.reformat_dataset("transcriptomics","wide")
transcriptomics successfully converted to wide format
We store improve_sample_id to a seperate dataframe. This will be used to link information to the umap embeddings.
Retreivable data: model_type, source.
jd3_sample_col = joined_dataset3.transcriptomics.iloc[:, 0].to_frame()
jd3_sample_col['model_type'] = jd3_sample_col['improve_sample_id'].map(model_type_sample_map)
jd3_sample_col['model_type'] = jd3_sample_col['model_type'].map(model_type_dict)
jd3_sample_col['study'] = jd3_sample_col['improve_sample_id'].map(study_sample_map)
jd3_sample_col['cancer_type'] = jd3_sample_col['improve_sample_id'].map(cancer_type_sample_map)
joined_dataset3.transcriptomics
entrez_id | improve_sample_id | 1.0 | 2.0 | 3.0 | 9.0 | 10.0 | 11.0 | 12.0 | 13.0 | 14.0 | ... | 118097967.0 | 118126072.0 | 118142757.0 | 118568804.0 | 122394733.0 | 122405565.0 | 124905743.0 | 124906461.0 | 125316803.0 | 125505920.0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.377908 | 0.214517 | 0.07 | 10.884808 | 1.004783 | 0.0 | 0.374231 | 5.125926 | 58.706594 | ... | 0.0 | 4.250000 | 0.128292 | 0.0 | 0.540773 | 2.314544 | 0.000000 | 0.070195 | 21.933752 | 0.13 |
1 | 2 | 2.016174 | 0.178161 | 0.02 | 4.039268 | 1.309193 | 0.0 | 0.247848 | 0.158752 | 71.169380 | ... | 0.0 | 1.910000 | 0.017178 | 0.0 | 0.098161 | 6.066530 | 0.000000 | 0.007178 | 10.166839 | 0.11 |
2 | 3 | 0.927081 | 20.780606 | 0.00 | 4.297862 | 0.034285 | 0.0 | 5.436207 | 0.000000 | 41.411760 | ... | 0.0 | 3.530000 | 0.000000 | 0.0 | 2.387944 | 27.548022 | 0.000000 | 0.000000 | 23.857035 | 0.75 |
3 | 4 | 0.068752 | 0.497908 | 0.00 | 14.466601 | 0.090195 | 0.0 | 0.329962 | 0.665773 | 89.898180 | ... | 0.0 | 3.010000 | 0.007178 | 0.0 | 1.978817 | 34.059067 | 0.000000 | 0.000000 | 31.096575 | 0.07 |
4 | 5 | 2.837025 | 0.520713 | 0.00 | 6.530024 | 0.022178 | 0.0 | 0.021322 | 0.056322 | 57.371410 | ... | 0.0 | 23.490000 | 0.022178 | 0.0 | 2.944486 | 18.276848 | 0.000000 | 0.022178 | 31.300927 | 0.03 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3908 | 5118 | -2.948888 | -0.509818 | NaN | 2.541549 | NaN | NaN | NaN | NaN | 6.503961 | ... | NaN | 1.964441 | NaN | NaN | -0.589435 | 5.347260 | 3.780014 | NaN | 6.300062 | NaN |
3909 | 5119 | -2.671581 | -0.051962 | NaN | 2.107456 | NaN | NaN | NaN | NaN | 6.267933 | ... | NaN | 3.081971 | NaN | NaN | -1.789189 | 5.714423 | 4.520107 | NaN | 6.159214 | NaN |
3910 | 5120 | -3.720447 | 0.918365 | NaN | 2.160300 | NaN | NaN | NaN | NaN | 6.317204 | ... | NaN | 3.347740 | NaN | NaN | -1.689536 | 5.734799 | 5.603482 | NaN | 6.171670 | NaN |
3911 | 5121 | -2.254779 | 1.955952 | NaN | 2.230090 | NaN | NaN | NaN | NaN | 6.954112 | ... | NaN | 4.166740 | NaN | NaN | 1.183738 | 6.455522 | -0.976484 | NaN | 6.394535 | NaN |
3912 | 5122 | -3.144297 | -1.273558 | NaN | 2.163439 | NaN | NaN | NaN | NaN | 6.285517 | ... | NaN | 3.991816 | NaN | NaN | -0.286764 | 5.575612 | 3.880383 | NaN | 6.172091 | NaN |
3913 rows × 38619 columns
This method could be used for transcriptomics or other data types as well.
The points in the UMAP are in the same order as jd3_sample_col, so these can still be colored and labeled.
joined_dataset3.transcriptomics = joined_dataset3.transcriptomics.drop(joined_dataset3.transcriptomics.columns[:1], axis=1)
joined_dataset3.transcriptomics
entrez_id | 1.0 | 2.0 | 3.0 | 9.0 | 10.0 | 11.0 | 12.0 | 13.0 | 14.0 | 15.0 | ... | 118097967.0 | 118126072.0 | 118142757.0 | 118568804.0 | 122394733.0 | 122405565.0 | 124905743.0 | 124906461.0 | 125316803.0 | 125505920.0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.377908 | 0.214517 | 0.07 | 10.884808 | 1.004783 | 0.0 | 0.374231 | 5.125926 | 58.706594 | 0.178292 | ... | 0.0 | 4.250000 | 0.128292 | 0.0 | 0.540773 | 2.314544 | 0.000000 | 0.070195 | 21.933752 | 0.13 |
1 | 2.016174 | 0.178161 | 0.02 | 4.039268 | 1.309193 | 0.0 | 0.247848 | 0.158752 | 71.169380 | 0.773686 | ... | 0.0 | 1.910000 | 0.017178 | 0.0 | 0.098161 | 6.066530 | 0.000000 | 0.007178 | 10.166839 | 0.11 |
2 | 0.927081 | 20.780606 | 0.00 | 4.297862 | 0.034285 | 0.0 | 5.436207 | 0.000000 | 41.411760 | 0.192164 | ... | 0.0 | 3.530000 | 0.000000 | 0.0 | 2.387944 | 27.548022 | 0.000000 | 0.000000 | 23.857035 | 0.75 |
3 | 0.068752 | 0.497908 | 0.00 | 14.466601 | 0.090195 | 0.0 | 0.329962 | 0.665773 | 89.898180 | 0.366517 | ... | 0.0 | 3.010000 | 0.007178 | 0.0 | 1.978817 | 34.059067 | 0.000000 | 0.000000 | 31.096575 | 0.07 |
4 | 2.837025 | 0.520713 | 0.00 | 6.530024 | 0.022178 | 0.0 | 0.021322 | 0.056322 | 57.371410 | 0.091322 | ... | 0.0 | 23.490000 | 0.022178 | 0.0 | 2.944486 | 18.276848 | 0.000000 | 0.022178 | 31.300927 | 0.03 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3908 | -2.948888 | -0.509818 | NaN | 2.541549 | NaN | NaN | NaN | NaN | 6.503961 | NaN | ... | NaN | 1.964441 | NaN | NaN | -0.589435 | 5.347260 | 3.780014 | NaN | 6.300062 | NaN |
3909 | -2.671581 | -0.051962 | NaN | 2.107456 | NaN | NaN | NaN | NaN | 6.267933 | NaN | ... | NaN | 3.081971 | NaN | NaN | -1.789189 | 5.714423 | 4.520107 | NaN | 6.159214 | NaN |
3910 | -3.720447 | 0.918365 | NaN | 2.160300 | NaN | NaN | NaN | NaN | 6.317204 | NaN | ... | NaN | 3.347740 | NaN | NaN | -1.689536 | 5.734799 | 5.603482 | NaN | 6.171670 | NaN |
3911 | -2.254779 | 1.955952 | NaN | 2.230090 | NaN | NaN | NaN | NaN | 6.954112 | NaN | ... | NaN | 4.166740 | NaN | NaN | 1.183738 | 6.455522 | -0.976484 | NaN | 6.394535 | NaN |
3912 | -3.144297 | -1.273558 | NaN | 2.163439 | NaN | NaN | NaN | NaN | 6.285517 | NaN | ... | NaN | 3.991816 | NaN | NaN | -0.286764 | 5.575612 | 3.880383 | NaN | 6.172091 | NaN |
3913 rows × 38618 columns
Here we use the median of each column to fill in NAN values.
This is a low precision method and you may wish to use other methods here.
for column in joined_dataset3.transcriptomics.columns:
median_value = joined_dataset3.transcriptomics[column].median()
joined_dataset3.transcriptomics[column].fillna(median_value, inplace=True)
joined_dataset3.transcriptomics = joined_dataset3.transcriptomics.dropna(axis='columns', how='all')
Data is scaled, transformed and embedded using the UMAP functions from umap-learn.
reducer = umap.UMAP()
t_full_data = joined_dataset3.transcriptomics.values
scaled_t_full_data = StandardScaler().fit_transform(t_full_data)
embedding_t_full_data = reducer.fit_transform(scaled_t_full_data)
embedding_t_full_data.shape
(3913, 2)
This maps cancer types to common names. In the future this will be done within the CoderData pipeline and this step can be removed.
cell_line_types_df = pd.read_csv('cellLineTypes.csv')
mapping_dict = {}
# Iterate through each row in the DataFrame
for _, row in cell_line_types_df.iterrows():
# Find the first non-null value in the row to use as the mapping target
target_value = row.dropna().iloc[0] if not row.dropna().empty else None
if target_value:
# Iterate over all values in the row
for value in row:
# Check if the value is not null and not already the target value
if pd.notnull(value) and value != target_value:
# Map this value to the target_value
mapping_dict[value] = target_value
jd3_sample_col['cancer_type'] = jd3_sample_col['cancer_type'].map(mapping_dict).fillna(jd3_sample_col['cancer_type'])
In this plot, we are using model type (Organoid, Tumor, Cell Line, Other), labels. In this example, I've hidden the points that map to Other model_types - these include saliva, buffy coat, etc. The Other model_types will likely be removed from the coderdata package.
#plot umap. Hide/unhide unknowns
legend_handles = [
mpatches.Patch(color=sns.color_palette()[0], label='Tumor'),
mpatches.Patch(color=sns.color_palette()[1], label='Organoid'),
mpatches.Patch(color=sns.color_palette()[2], label='Cell Line')
# mpatches.Patch(color=sns.color_palette()[3], label='Other') # Uncomment this to include the Other model type legend label.
]
# This is used to hide the Unknown model types.
alphas = [0 if x == 3 else 1 for x in jd3_sample_col.model_type.map({"tumor": 0, "organoid": 1, "cell line": 2, np.nan: 3})]
plt.scatter(
embedding_t_full_data[:, 0],
embedding_t_full_data[:, 1],
c=[sns.color_palette()[x] for x in jd3_sample_col.model_type.map({"tumor": 0, "organoid": 1, "cell line": 2, np.nan: 3})],
alpha=alphas, # Apply the alpha values here
s=3
)
plt.gca().set_aspect('equal', 'datalim')
plt.legend(handles=legend_handles, title='Model Type')
plt.title('All Datasets, Transcriptomics UMAP by Model Type', fontsize=12)
Text(0.5, 1.0, 'All Datasets, Transcriptomics UMAP by Model Type')
jd3_sample_col
improve_sample_id | model_type | study | cancer_type | |
---|---|---|---|---|
0 | 1 | cell line | Sanger & Broad Cell Lines RNASeq | Pancreatic Carcinoma |
1 | 2 | cell line | Sanger & Broad Cell Lines RNASeq | Colorectal Carcinoma |
2 | 3 | cell line | Sanger & Broad Cell Lines RNASeq | Glioblastoma multiforme |
3 | 4 | cell line | Sanger & Broad Cell Lines RNASeq | Mesothelioma |
4 | 5 | cell line | Sanger & Broad Cell Lines RNASeq | B-Lymphoblastic Leukemia |
... | ... | ... | ... | ... |
3908 | 5118 | tumor | BeatAML | Acute myeloid leukemia |
3909 | 5119 | tumor | BeatAML | Acute myeloid leukemia |
3910 | 5120 | tumor | BeatAML | Acute myeloid leukemia |
3911 | 5121 | tumor | BeatAML | Acute myeloid leukemia |
3912 | 5122 | tumor | BeatAML | Acute myeloid leukemia |
3913 rows × 4 columns
unique_studies = jd3_sample_col['study'].dropna().unique()
palette = sns.color_palette("Set2", len(unique_studies))
study_to_color = {study: color for study, color in zip(unique_studies, palette)}
# Prepare the colors for each point
colors = jd3_sample_col['study'].map(study_to_color).fillna('black')
plt.scatter(
embedding_t_full_data[:, 0],
embedding_t_full_data[:, 1],
c=colors,
alpha=1, # Adjust based on your preference for visibility
s=3 # Adjust size as needed
)
plt.gca().set_aspect('equal', 'datalim')
# Create legend handles manually
legend_handles = [plt.Line2D([0], [0], marker='o', color='w', label=study,
markerfacecolor=color, markersize=10) for study, color in study_to_color.items()]
plt.legend(handles=legend_handles, title='Study',prop={'size': 6}, title_fontsize=8)
plt.title('All Datasets, Transcriptomics UMAP by Study', fontsize=12)
plt.show()
def interlace_lists(*lists):
"""Interlace items from multiple lists in an alternating fashion."""
max_length = max(len(lst) for lst in lists)
interlaced = []
for i in range(max_length):
for lst in lists:
if i < len(lst):
interlaced.append(lst[i])
return interlaced
top_10_organoid = jd3_sample_col[jd3_sample_col.model_type == "organoid"].cancer_type.value_counts().head(10).index
top_10_cell_line = jd3_sample_col[jd3_sample_col.model_type == "cell line"].cancer_type.value_counts().head(10).index
top_10_tumor = jd3_sample_col[jd3_sample_col.model_type == "tumor"].cancer_type.value_counts().head(10).index
top_10_cancer_type = jd3_sample_col.cancer_type.value_counts().head(10).index
top_10_organoid_series = pd.Series(top_10_organoid)
top_10_cell_line_series = pd.Series(top_10_cell_line)
top_10_tumor_series = pd.Series(top_10_tumor)
top_10_cancer_type_series = pd.Series(top_10_cancer_type)
# Step 2: Create a Unified List of Unique Top Cancer Types
all_top_cancer_types = interlace_lists(top_10_cancer_type,top_10_organoid, top_10_cell_line, top_10_tumor)
all_top_cancer_types = pd.Series(all_top_cancer_types).unique()
# # Step 3: Generate a Color Mapping for These Cancer Types
colors = sns.color_palette("hsv", len(all_top_cancer_types))
color_map = {cancer_type: color for cancer_type, color in zip(all_top_cancer_types, colors)}
# Manually set "Other" to grey
color_map['Other'] = (0.5, 0.5, 0.5)
In this plot, we are using cancer type to show differences in groups. As there are hundred of cancer types present, we must filter down to a reasonable number such as 10 for plotting.
top_10_types = jd3_sample_col.cancer_type.value_counts().head(10).index
# Create a new column for mapping colors with the unified color scheme, considering only top 10 for the current model
jd3_sample_col[f'full_color_group'] = jd3_sample_col.apply(
lambda row: row.cancer_type if row.cancer_type in top_10_types else 'Other', axis=1
)
jd3_sample_col['color'] = jd3_sample_col[f'full_color_group'].map(color_map)
jd3_sample_col['alpha'] = jd3_sample_col['full_color_group'].apply(lambda x: 0.25 if x == 'Other' else 1.0)
legend_handles = [mpatches.Patch(color=color_map[group], label=group) for group in list(top_10_types) + ['Other']]
# Plotting
plt.scatter(
embedding_t_full_data[:, 0],
embedding_t_full_data[:, 1],
c=jd3_sample_col['color'], # Use the mapped colors
alpha=jd3_sample_col['alpha'], # Apply the alpha values here
s=1.5
)
plt.legend(handles=legend_handles, title='Cancer Type', prop={'size': 6}, title_fontsize=8)
plt.gca().set_aspect('equal', 'datalim')
plt.title('All Datasets, Transcriptomics UMAP by Cancer Type', fontsize=12)
plt.show()
# Function to apply unified color mapping across cancer types and plot UMAP by model type
def plot_umap_by_model_type(model_type, embedding, color_map, jd3_sample_col,num):
# Determine top 10 cancer types for the current model type
top_x_types = jd3_sample_col[jd3_sample_col.model_type == model_type].cancer_type.value_counts().head(num).index
# Create a new column for mapping colors with the unified color scheme, considering only top 10 for the current model
jd3_sample_col[f'{model_type}_color_group'] = jd3_sample_col.apply(
lambda row: row.cancer_type if row.cancer_type in top_x_types else 'Other', axis=1
)
# Map the color_group column to actual colors using the unified color map
jd3_sample_col['color'] = jd3_sample_col[f'{model_type}_color_group'].map(color_map)
# Filtering rows for the current model type
filtered_rows = jd3_sample_col[jd3_sample_col.model_type == model_type]
# Plotting
plt.scatter(
embedding[:, 0],
embedding[:, 1],
c=filtered_rows['color'],
alpha=[0.1 if x == 'Other' else 1 for x in filtered_rows[f'{model_type}_color_group']], # One-liner for conditional alpha
s=12
)
# Adjusting legend to reflect new grouping with unified color scheme
legend_handles = [mpatches.Patch(color=color_map[group], label=group) for group in list(top_x_types) + ['Other']]
plt.gca().set_aspect('equal', 'datalim')
plt.legend(handles=legend_handles, title='Cancer Type', prop={'size': 5}, title_fontsize=8)
plt.title(f'All Datasets, {model_type.capitalize()}, Transcriptomics UMAP by Cancer Type', fontsize=12)
plt.show()
reducer = umap.UMAP()
t_full_organoid_data = joined_dataset3.transcriptomics[jd3_sample_col.model_type == "organoid"].values
scaled_t_full_organoid_data = StandardScaler().fit_transform(t_full_organoid_data)
embedding_t_full_organoid_data = reducer.fit_transform(scaled_t_full_organoid_data)
embedding_t_full_organoid_data.shape
(192, 2)
plot_umap_by_model_type("organoid", embedding_t_full_organoid_data, color_map, jd3_sample_col, 8)
reducer = umap.UMAP()
t_full_cell_line_data = joined_dataset3.transcriptomics[jd3_sample_col.model_type == "cell line"].values
scaled_t_full_cell_line_data = StandardScaler().fit_transform(t_full_cell_line_data)
embedding_t_full_cell_line_data = reducer.fit_transform(scaled_t_full_cell_line_data)
embedding_t_full_cell_line_data.shape
(1761, 2)
plot_umap_by_model_type("cell line", embedding_t_full_cell_line_data, color_map, jd3_sample_col,10)
reducer = umap.UMAP()
t_full_tumor_data = joined_dataset3.transcriptomics[jd3_sample_col.model_type == "tumor"].values
scaled_t_full_tumor_data = StandardScaler().fit_transform(t_full_tumor_data)
embedding_t_full_tumor_data = reducer.fit_transform(scaled_t_full_tumor_data)
embedding_t_full_tumor_data.shape
(1960, 2)
plot_umap_by_model_type("tumor", embedding_t_full_tumor_data, color_map, jd3_sample_col,10)
All of the code above is now applied to HCMI.
hcmi.reformat_dataset("transcriptomics","wide")
hcmi_sample_col = hcmi.transcriptomics.iloc[:, 0].to_frame()
hcmi_sample_col['model_type'] = hcmi_sample_col['improve_sample_id'].map(model_type_sample_map)
hcmi_sample_col['model_type'] = hcmi_sample_col['model_type'].map(model_type_dict)
hcmi.transcriptomics = hcmi.transcriptomics.drop(hcmi.transcriptomics.columns[:1], axis=1)
hcmi.transcriptomics
for column in hcmi.transcriptomics.columns:
median_value = hcmi.transcriptomics[column].median()
hcmi.transcriptomics[column].fillna(median_value, inplace=True)
hcmi.transcriptomics
reducer = umap.UMAP()
t_hcmi_data = hcmi.transcriptomics.values
scaled_t_hcmi_data = StandardScaler().fit_transform(t_hcmi_data)
embedding_t_hcmi_data = reducer.fit_transform(scaled_t_hcmi_data)
embedding_t_hcmi_data.shape
transcriptomics successfully converted to wide format
(396, 2)
legend_handles = [
mpatches.Patch(color=sns.color_palette()[0], label='Tumor'),
mpatches.Patch(color=sns.color_palette()[1], label='Organoid'),
mpatches.Patch(color=sns.color_palette()[2], label='Cell Line')
]
alphas = [0 if x == 3 else 1 for x in hcmi_sample_col.model_type.map({"tumor": 0, "organoid": 1, "cell line": 2, np.nan: 3})]
plt.scatter(
embedding_t_hcmi_data[:, 0],
embedding_t_hcmi_data[:, 1],
c=[sns.color_palette()[x] for x in hcmi_sample_col.model_type.map({"tumor": 0, "organoid": 1, "cell line": 2, np.nan: 3})],
alpha=alphas, # Apply the alpha values here
s=12
)
plt.gca().set_aspect('equal', 'datalim')
plt.legend(handles=legend_handles, title='Model Type')
plt.title('HCMI, Transcriptomics UMAP by Model Type', fontsize=12)
Text(0.5, 1.0, 'HCMI, Transcriptomics UMAP by Model Type')
All of the code above is now applied to BeatAML.
beataml.reformat_dataset("transcriptomics","wide")
beataml_sample_col = beataml.transcriptomics.iloc[:, 0].to_frame()
beataml_sample_col['common_name'] = beataml_sample_col['improve_sample_id'].map(common_name_sample_map)
# beataml_sample_col['model_type'] = beataml_sample_col['model_type'].map(model_type_dict)
beataml.transcriptomics = beataml.transcriptomics.drop(beataml.transcriptomics.columns[:1], axis=1)
beataml.transcriptomics
for column in beataml.transcriptomics.columns:
median_value = beataml.transcriptomics[column].median()
beataml.transcriptomics[column].fillna(median_value, inplace=True)
beataml.transcriptomics
reducer = umap.UMAP()
t_beataml_data = beataml.transcriptomics.values
scaled_t_beataml_data = StandardScaler().fit_transform(t_beataml_data)
embedding_t_beataml_data = reducer.fit_transform(scaled_t_beataml_data)
embedding_t_beataml_data.shape
transcriptomics successfully converted to wide format
(707, 2)
legend_handles = [
mpatches.Patch(color=sns.color_palette()[0], label='Peripheral Blood'),
mpatches.Patch(color=sns.color_palette()[1], label='Leukapheresis'),
mpatches.Patch(color=sns.color_palette()[2], label='Bone Marrow Aspirate'),
mpatches.Patch(color=sns.color_palette()[3], label='Healthy pooled CD34+'),
mpatches.Patch(color=sns.color_palette()[4], label='Bone Marrow Aspirate'),
mpatches.Patch(color=sns.color_palette()[5], label='Healthy pooled CD34+'),
mpatches.Patch(color=sns.color_palette()[6], label='Healthy Individual BM MNC'),
# mpatches.Patch(color=sns.color_palette()[7], label='Healthy Individual CD34+'),
]
# Here we hide Healthy Individual CD34+ because there is only 1 sample.
alphas = [0 if x == 7 else 1 for x in beataml_sample_col.common_name.map({"Peripheral Blood": 0, "Leukapheresis": 1, "Healthy pooled CD34+": 2, 'Bone Marrow Aspirate': 3, 'Healthy pooled CD34+': 4,'Healthy Individual BM MNC':5,'Healthy Individual CD34+':6})]
plt.scatter(
embedding_t_beataml_data[:, 0],
embedding_t_beataml_data[:, 1],
c=[sns.color_palette()[x] for x in beataml_sample_col.common_name.map({"Peripheral Blood": 0, "Leukapheresis": 1, "Healthy pooled CD34+": 2, 'Bone Marrow Aspirate': 3, 'Healthy pooled CD34+': 4,'Healthy Individual BM MNC':5,'Healthy Individual CD34+':6})],
s=12,
alpha=alphas
)
plt.gca().set_aspect('equal', 'datalim')
plt.legend(handles=legend_handles, title='Model Type', fontsize='small', title_fontsize='small')
plt.title('BeatAML, Transcriptomics UMAP by Sample Type', fontsize=12)
Text(0.5, 1.0, 'BeatAML, Transcriptomics UMAP by Sample Type')
All of the code above is now applied to CPTAC. The plotting code below is condensed to avoid hard coding of cancer types.
cptac.reformat_dataset("transcriptomics","wide")
cptac_sample_col = cptac.transcriptomics.iloc[:, 0].to_frame()
cptac_sample_col['cancer_type'] = cptac_sample_col['improve_sample_id'].map(cancer_type_sample_map)
# cptac_sample_col['model_type'] = cptac_sample_col['model_type'].map(model_type_dict)
cptac.transcriptomics = cptac.transcriptomics.drop(cptac.transcriptomics.columns[:1], axis=1)
cptac.transcriptomics
for column in cptac.transcriptomics.columns:
median_value = cptac.transcriptomics[column].median()
cptac.transcriptomics[column].fillna(median_value, inplace=True)
cptac.transcriptomics
reducer = umap.UMAP()
t_cptac_data = cptac.transcriptomics.values
scaled_t_cptac_data = StandardScaler().fit_transform(t_cptac_data)
embedding_t_cptac_data = reducer.fit_transform(scaled_t_cptac_data)
embedding_t_cptac_data.shape
transcriptomics successfully converted to wide format
(1113, 2)
cancer_types = cptac.samples.cancer_type.unique()
colors = sns.color_palette(n_colors=len(cancer_types))
# Create legend handles dynamically
legend_handles = [mpatches.Patch(color=colors[i], label=label) for i, label in enumerate(cancer_types)]
color_mapping = {cancer_type: color for cancer_type, color in zip(cancer_types, colors)}
cptac_colors = cptac_sample_col.cancer_type.map(color_mapping).tolist()
plt.scatter(
embedding_t_cptac_data[:, 0],
embedding_t_cptac_data[:, 1],
c=cptac_colors,
s=12,
)
plt.gca().set_aspect('equal', 'datalim')
plt.legend(handles=legend_handles, title='Model Type', prop={'size': 6}, title_fontsize=8)
plt.title('CPTAC, Transcriptomics UMAP by Cancer Type', fontsize=12)
Text(0.5, 1.0, 'CPTAC, Transcriptomics UMAP by Cancer Type')
All of the code above is now applied to DepMap/Sanger. However, due to the fact that there are over 200 cancer types, these are not easily plotable in a umap by color. As such, we will filter to just use the top 10 cancer types that are most abundant.
# Assuming cptac is already loaded and initialized
depmap.reformat_dataset("transcriptomics", "wide")
# Creating a mapping for cancer types
depmap_sample_col = depmap.transcriptomics.iloc[:, 0].to_frame()
depmap_sample_col['cancer_type'] = depmap_sample_col['improve_sample_id'].map(cancer_type_sample_map)
# Identify the top 10 most prevalent cancer types
top_10_cancer_types = depmap_sample_col['cancer_type'].value_counts().head(10).index.tolist()
# Filter the dataset for only the top 10 cancer types
depmap_sample_col = depmap_sample_col[depmap_sample_col['cancer_type'].isin(top_10_cancer_types)]
filtered_transcriptomics = depmap.transcriptomics.loc[depmap_sample_col.index]
# Fill missing values with the median of each column
for column in filtered_transcriptomics.columns:
median_value = filtered_transcriptomics[column].median()
filtered_transcriptomics[column].fillna(median_value, inplace=True)
filtered_transcriptomics = filtered_transcriptomics.dropna(axis=1, how='all')
# UMAP Analysis
reducer = umap.UMAP()
t_depmap_data = filtered_transcriptomics.values
scaled_t_depmap_data = StandardScaler().fit_transform(t_depmap_data)
embedding_t_depmap_data = reducer.fit_transform(scaled_t_depmap_data)
# Visualization
cancer_types = top_10_cancer_types
colors = sns.color_palette(n_colors=len(cancer_types))
legend_handles = [mpatches.Patch(color=colors[i], label=label) for i, label in enumerate(cancer_types)]
color_mapping = {cancer_type: color for cancer_type, color in zip(cancer_types, colors)}
depmap_colors = depmap_sample_col['cancer_type'].map(color_mapping).tolist()
plt.scatter(
embedding_t_depmap_data[:, 0],
embedding_t_depmap_data[:, 1],
c=depmap_colors,
s=12,
)
plt.gca().set_aspect('equal', 'datalim')
plt.legend(handles=legend_handles, title='Cancer Type', prop={'size': 6}, title_fontsize=8)
plt.title('DepMap, Transcriptomics UMAP by Cancer Type', fontsize=12)
plt.show()
transcriptomics successfully converted to wide format