Welcome to the Data Exploration UMAP Tutorial¶

Here we will be using coderdata to generate UMAPs by model type, cancer type, and source. These will be run on each individual dataset as well as all of them joined together. These can be used to create some general hypotheses for further testing.

We will be focusing on transcriptomics, however proteomics can be filled in using the exact same methods seen below.

  • You can find and replace "transcriptomics" with "proteomics" and there are only a couple small adjustments needed to get it to run.

A note, UMAPs are easy to make and interpret but clusters are not guaranteed to be meaningful or consistent. This is best used as a method to generate hypotheses and display possible trends.

Import Packages¶

In [1]:
import pandas as pd
import coderdata as cd
import umap
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np
import seaborn as sns
import matplotlib.patches as mpatches
import warnings
warnings.filterwarnings('ignore')

Load all Datasets¶

Load Datasets into respective objects. Then merge them into joined datasets.

In [2]:
# Load in all Datasets
hcmi = cd.DatasetLoader('hcmi')
beataml = cd.DatasetLoader('beataml')
cptac = cd.DatasetLoader('cptac')
depmap = cd.DatasetLoader('broad_sanger')
mpnst = cd.DatasetLoader('mpnst')
Processing Data...
Loaded genes dataset.
Processing Data...
Loaded genes dataset.
Processing Data...
Loaded genes dataset.
Processing Data...
Loaded genes dataset.
Processing Data...
Loaded genes dataset.
In [3]:
# Join BeatAML and HCMI
joined_dataset0 = cd.join_datasets(beataml, hcmi)

# Join DepMap and CPTAC
joined_dataset1 = cd.join_datasets(depmap, cptac)

# Join Datasets
joined_dataset2 = cd.join_datasets(joined_dataset0,joined_dataset1)

# Final Join
joined_dataset3 = cd.join_datasets(joined_dataset2,mpnst)
Processing Data...
Loaded genes dataset.
In [7]:
joined_dataset3.transcriptomics= joined_dataset3.transcriptomics[["improve_sample_id", "transcriptomics", "entrez_id", "source", "study"]]
In [8]:
joined_dataset3.info()
This is a joined dataset comprising of:
- mpnst: A collection of NF1-MPNST patient-derived xenografts, organoids, and tumors. Data hosted on synapse.
- cptac: The Clinical Proteomic Tumor Analysis Consortium (CPTAC) project is a collaborative network funded by the National Cancer Institute (NCI).
- hcmi: Human Cancer Models Initiative (HCMI) data was collected though the National Cancer Institute (NCI) Genomic Data Commons (GDC) Data Portal.
- beataml: Beat acute myeloid leukemia (BeatAML) data was collected though GitHub and Synapse.

Available Datatypes and Their Formats:
- copy_number: long format
- drugs: long format
- experiments: long format
- genes: Format not specified
- mutations: long format
- proteomics: long format
- samples: Format not specified
- transcriptomics: long format
In [9]:
joined_dataset3.samples
Out[9]:
other_id improve_sample_id other_names common_name cancer_type model_type other_id_source species
0 11-00261 4102 Acute myelomonocytic leukaemia Peripheral Blood Acute Myeloid Leukaemia ex vivo beatAML NaN
1 11-00503 4103 AML with mutated NPM1 Bone Marrow Aspirate Acute Myeloid Leukaemia ex vivo beatAML NaN
2 11-00475 4104 AML with mutated NPM1 Bone Marrow Aspirate Acute Myeloid Leukaemia ex vivo beatAML NaN
3 13-00047 4105 Mixed phenotype acute leukaemia, T/myeloid, NOS Peripheral Blood Acute Myeloid Leukaemia ex vivo beatAML NaN
4 12-00032 4106 Chronic myelomonocytic leukaemia Peripheral Blood Acute Myeloid Leukaemia ex vivo beatAML NaN
... ... ... ... ... ... ... ... ...
45 WU-487 Tumor 5169 NaN WU-487 Malignant peripheral nerve sheath tumor Tumor NF Data Portal Human
46 WU-505 Tumor 5170 NaN WU-505 Malignant peripheral nerve sheath tumor Tumor NF Data Portal Human
47 WU-536 Tumor 5171 NaN WU-536 Malignant peripheral nerve sheath tumor Tumor NF Data Portal Human
48 WU-545 Tumor 5172 NaN WU-545 Malignant peripheral nerve sheath tumor Tumor NF Data Portal Human
49 WU-561 Tumor 5173 NaN WU-561 Malignant peripheral nerve sheath tumor Tumor NF Data Portal Human

45285 rows × 8 columns

In [10]:
joined_dataset3.transcriptomics
Out[10]:
improve_sample_id transcriptomics entrez_id source study
0 5087 1.523670 7105.0 synapse BeatAML
1 5087 1.523670 7105.0 synapse BeatAML
2 5087 7.107711 8813.0 synapse BeatAML
3 5087 7.107711 8813.0 synapse BeatAML
4 5087 3.362605 6359.0 synapse BeatAML
... ... ... ... ... ...
44552582 3188 11.940000 23140.0 bcm CPTAC3
44552583 3189 12.970000 23140.0 bcm CPTAC3
44552584 3190 11.860000 23140.0 bcm CPTAC3
44552585 3191 11.620000 23140.0 bcm CPTAC3
44552586 3192 12.040000 23140.0 bcm CPTAC3

187115143 rows × 5 columns

Initialize Mapping Directories¶

These mapping directories will be used to map samples (improve_sample_id) to model type, cancer type, common name, and source.

In [11]:
# Model Type Mapping
model_type_dict = {
    'Solid Tissue': 'tumor',
    'tumor': 'tumor',
    "organoid" : "organoid",
    'cell line': 'cell line',
    'Tumor': 'tumor',
    'ex vivo': 'tumor',
    '3D Organoid': 'organoid',
    'Peripheral Blood Components NOS': 'tumor',
    'Buffy Coat': np.nan,
     None: np.nan,
    'Peripheral Whole Blood': 'tumor',
    'Adherent Cell Line': 'cell line',
    '3D Neurosphere': 'organoid',
    '2D Modified Conditionally Reprogrammed Cells': 'cell line',
    'Pleural Effusion': np.nan,
    'Human Original Cells': 'cell line',
    'Not Reported': np.nan, 
    'Mixed Adherent Suspension': 'cell line',
    'Cell': 'cell line',
    'Saliva': np.nan
    }

model_type_sample_map = dict(zip(joined_dataset3.samples['improve_sample_id'], joined_dataset3.samples['model_type']))
common_name_sample_map = dict(zip(joined_dataset3.samples['improve_sample_id'], joined_dataset3.samples['common_name']))
cancer_type_sample_map = dict(zip(joined_dataset3.samples['improve_sample_id'], joined_dataset3.samples['cancer_type']))

study_sample_map = dict(zip(joined_dataset3.transcriptomics['improve_sample_id'], joined_dataset3.transcriptomics['study']))
In [12]:
study_sample_map = dict(zip(joined_dataset3.transcriptomics['improve_sample_id'], joined_dataset3.transcriptomics['study']))

Reformat transcriptomics Data¶

Convert the transcriptomics data from default (long) to wide.

In [13]:
joined_dataset3.reformat_dataset("transcriptomics","wide")
transcriptomics successfully converted to wide format

Store improve_sample_id to a Seperate Dataframe¶

We store improve_sample_id to a seperate dataframe. This will be used to link information to the umap embeddings.
Retreivable data: model_type, source.

In [14]:
jd3_sample_col = joined_dataset3.transcriptomics.iloc[:, 0].to_frame()
jd3_sample_col['model_type'] = jd3_sample_col['improve_sample_id'].map(model_type_sample_map)
jd3_sample_col['model_type'] = jd3_sample_col['model_type'].map(model_type_dict)
jd3_sample_col['study'] = jd3_sample_col['improve_sample_id'].map(study_sample_map)
jd3_sample_col['cancer_type'] = jd3_sample_col['improve_sample_id'].map(cancer_type_sample_map)
In [15]:
joined_dataset3.transcriptomics
Out[15]:
entrez_id improve_sample_id 1.0 2.0 3.0 9.0 10.0 11.0 12.0 13.0 14.0 ... 118097967.0 118126072.0 118142757.0 118568804.0 122394733.0 122405565.0 124905743.0 124906461.0 125316803.0 125505920.0
0 1 0.377908 0.214517 0.07 10.884808 1.004783 0.0 0.374231 5.125926 58.706594 ... 0.0 4.250000 0.128292 0.0 0.540773 2.314544 0.000000 0.070195 21.933752 0.13
1 2 2.016174 0.178161 0.02 4.039268 1.309193 0.0 0.247848 0.158752 71.169380 ... 0.0 1.910000 0.017178 0.0 0.098161 6.066530 0.000000 0.007178 10.166839 0.11
2 3 0.927081 20.780606 0.00 4.297862 0.034285 0.0 5.436207 0.000000 41.411760 ... 0.0 3.530000 0.000000 0.0 2.387944 27.548022 0.000000 0.000000 23.857035 0.75
3 4 0.068752 0.497908 0.00 14.466601 0.090195 0.0 0.329962 0.665773 89.898180 ... 0.0 3.010000 0.007178 0.0 1.978817 34.059067 0.000000 0.000000 31.096575 0.07
4 5 2.837025 0.520713 0.00 6.530024 0.022178 0.0 0.021322 0.056322 57.371410 ... 0.0 23.490000 0.022178 0.0 2.944486 18.276848 0.000000 0.022178 31.300927 0.03
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3908 5118 -2.948888 -0.509818 NaN 2.541549 NaN NaN NaN NaN 6.503961 ... NaN 1.964441 NaN NaN -0.589435 5.347260 3.780014 NaN 6.300062 NaN
3909 5119 -2.671581 -0.051962 NaN 2.107456 NaN NaN NaN NaN 6.267933 ... NaN 3.081971 NaN NaN -1.789189 5.714423 4.520107 NaN 6.159214 NaN
3910 5120 -3.720447 0.918365 NaN 2.160300 NaN NaN NaN NaN 6.317204 ... NaN 3.347740 NaN NaN -1.689536 5.734799 5.603482 NaN 6.171670 NaN
3911 5121 -2.254779 1.955952 NaN 2.230090 NaN NaN NaN NaN 6.954112 ... NaN 4.166740 NaN NaN 1.183738 6.455522 -0.976484 NaN 6.394535 NaN
3912 5122 -3.144297 -1.273558 NaN 2.163439 NaN NaN NaN NaN 6.285517 ... NaN 3.991816 NaN NaN -0.286764 5.575612 3.880383 NaN 6.172091 NaN

3913 rows × 38619 columns

Format The transcriptomics Data into Wide Format for the UMAP¶

This method could be used for transcriptomics or other data types as well.
The points in the UMAP are in the same order as jd3_sample_col, so these can still be colored and labeled.

In [16]:
joined_dataset3.transcriptomics = joined_dataset3.transcriptomics.drop(joined_dataset3.transcriptomics.columns[:1], axis=1)
In [17]:
joined_dataset3.transcriptomics
Out[17]:
entrez_id 1.0 2.0 3.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0 ... 118097967.0 118126072.0 118142757.0 118568804.0 122394733.0 122405565.0 124905743.0 124906461.0 125316803.0 125505920.0
0 0.377908 0.214517 0.07 10.884808 1.004783 0.0 0.374231 5.125926 58.706594 0.178292 ... 0.0 4.250000 0.128292 0.0 0.540773 2.314544 0.000000 0.070195 21.933752 0.13
1 2.016174 0.178161 0.02 4.039268 1.309193 0.0 0.247848 0.158752 71.169380 0.773686 ... 0.0 1.910000 0.017178 0.0 0.098161 6.066530 0.000000 0.007178 10.166839 0.11
2 0.927081 20.780606 0.00 4.297862 0.034285 0.0 5.436207 0.000000 41.411760 0.192164 ... 0.0 3.530000 0.000000 0.0 2.387944 27.548022 0.000000 0.000000 23.857035 0.75
3 0.068752 0.497908 0.00 14.466601 0.090195 0.0 0.329962 0.665773 89.898180 0.366517 ... 0.0 3.010000 0.007178 0.0 1.978817 34.059067 0.000000 0.000000 31.096575 0.07
4 2.837025 0.520713 0.00 6.530024 0.022178 0.0 0.021322 0.056322 57.371410 0.091322 ... 0.0 23.490000 0.022178 0.0 2.944486 18.276848 0.000000 0.022178 31.300927 0.03
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3908 -2.948888 -0.509818 NaN 2.541549 NaN NaN NaN NaN 6.503961 NaN ... NaN 1.964441 NaN NaN -0.589435 5.347260 3.780014 NaN 6.300062 NaN
3909 -2.671581 -0.051962 NaN 2.107456 NaN NaN NaN NaN 6.267933 NaN ... NaN 3.081971 NaN NaN -1.789189 5.714423 4.520107 NaN 6.159214 NaN
3910 -3.720447 0.918365 NaN 2.160300 NaN NaN NaN NaN 6.317204 NaN ... NaN 3.347740 NaN NaN -1.689536 5.734799 5.603482 NaN 6.171670 NaN
3911 -2.254779 1.955952 NaN 2.230090 NaN NaN NaN NaN 6.954112 NaN ... NaN 4.166740 NaN NaN 1.183738 6.455522 -0.976484 NaN 6.394535 NaN
3912 -3.144297 -1.273558 NaN 2.163439 NaN NaN NaN NaN 6.285517 NaN ... NaN 3.991816 NaN NaN -0.286764 5.575612 3.880383 NaN 6.172091 NaN

3913 rows × 38618 columns

Impute transcriptomics Data¶

Here we use the median of each column to fill in NAN values.
This is a low precision method and you may wish to use other methods here.

In [18]:
for column in joined_dataset3.transcriptomics.columns:
    median_value = joined_dataset3.transcriptomics[column].median()
    joined_dataset3.transcriptomics[column].fillna(median_value, inplace=True)
joined_dataset3.transcriptomics = joined_dataset3.transcriptomics.dropna(axis='columns', how='all')

Run UMAP Functions¶

Data is scaled, transformed and embedded using the UMAP functions from umap-learn.

In [19]:
reducer = umap.UMAP()
t_full_data = joined_dataset3.transcriptomics.values
scaled_t_full_data = StandardScaler().fit_transform(t_full_data)
embedding_t_full_data = reducer.fit_transform(scaled_t_full_data)
embedding_t_full_data.shape
Out[19]:
(3913, 2)

Add in Cancer Mapping Types¶

This maps cancer types to common names. In the future this will be done within the CoderData pipeline and this step can be removed.

In [20]:
cell_line_types_df = pd.read_csv('cellLineTypes.csv')
mapping_dict = {}

# Iterate through each row in the DataFrame
for _, row in cell_line_types_df.iterrows():
    # Find the first non-null value in the row to use as the mapping target
    target_value = row.dropna().iloc[0] if not row.dropna().empty else None
    
    if target_value:
        # Iterate over all values in the row
        for value in row:
            # Check if the value is not null and not already the target value
            if pd.notnull(value) and value != target_value:
                # Map this value to the target_value
                mapping_dict[value] = target_value
jd3_sample_col['cancer_type'] = jd3_sample_col['cancer_type'].map(mapping_dict).fillna(jd3_sample_col['cancer_type'])

Plot All Datasets UMAP by Model Types¶

In this plot, we are using model type (Organoid, Tumor, Cell Line, Other), labels. In this example, I've hidden the points that map to Other model_types - these include saliva, buffy coat, etc. The Other model_types will likely be removed from the coderdata package.

In [21]:
#plot umap. Hide/unhide unknowns

legend_handles = [
    mpatches.Patch(color=sns.color_palette()[0], label='Tumor'),  
    mpatches.Patch(color=sns.color_palette()[1], label='Organoid'),
    mpatches.Patch(color=sns.color_palette()[2], label='Cell Line')
    # mpatches.Patch(color=sns.color_palette()[3], label='Other')  # Uncomment this to include the Other model type legend label.
]

# This is used to hide the Unknown model types.
alphas = [0 if x == 3 else 1 for x in jd3_sample_col.model_type.map({"tumor": 0, "organoid": 1, "cell line": 2, np.nan: 3})]

plt.scatter(
    embedding_t_full_data[:, 0],
    embedding_t_full_data[:, 1],
    c=[sns.color_palette()[x] for x in jd3_sample_col.model_type.map({"tumor": 0, "organoid": 1, "cell line": 2, np.nan: 3})],
    alpha=alphas,  # Apply the alpha values here
    s=3
)
plt.gca().set_aspect('equal', 'datalim')

plt.legend(handles=legend_handles, title='Model Type')
plt.title('All Datasets, Transcriptomics UMAP by Model Type', fontsize=12)
Out[21]:
Text(0.5, 1.0, 'All Datasets, Transcriptomics UMAP by Model Type')
In [ ]:
 
In [ ]:
 
In [22]:
jd3_sample_col
Out[22]:
improve_sample_id model_type study cancer_type
0 1 cell line Sanger & Broad Cell Lines RNASeq Pancreatic Carcinoma
1 2 cell line Sanger & Broad Cell Lines RNASeq Colorectal Carcinoma
2 3 cell line Sanger & Broad Cell Lines RNASeq Glioblastoma multiforme
3 4 cell line Sanger & Broad Cell Lines RNASeq Mesothelioma
4 5 cell line Sanger & Broad Cell Lines RNASeq B-Lymphoblastic Leukemia
... ... ... ... ...
3908 5118 tumor BeatAML Acute myeloid leukemia
3909 5119 tumor BeatAML Acute myeloid leukemia
3910 5120 tumor BeatAML Acute myeloid leukemia
3911 5121 tumor BeatAML Acute myeloid leukemia
3912 5122 tumor BeatAML Acute myeloid leukemia

3913 rows × 4 columns

In [ ]:
 
In [30]:
unique_studies = jd3_sample_col['study'].dropna().unique()
palette = sns.color_palette("Set2", len(unique_studies))
study_to_color = {study: color for study, color in zip(unique_studies, palette)}

# Prepare the colors for each point
colors = jd3_sample_col['study'].map(study_to_color).fillna('black')

plt.scatter(
    embedding_t_full_data[:, 0],
    embedding_t_full_data[:, 1],
    c=colors,
    alpha=1,  # Adjust based on your preference for visibility
    s=3  # Adjust size as needed
)
plt.gca().set_aspect('equal', 'datalim')

# Create legend handles manually
legend_handles = [plt.Line2D([0], [0], marker='o', color='w', label=study,
                             markerfacecolor=color, markersize=10) for study, color in study_to_color.items()]

plt.legend(handles=legend_handles, title='Study',prop={'size': 6}, title_fontsize=8)
plt.title('All Datasets, Transcriptomics UMAP by Study', fontsize=12)
plt.show()

Map Colors to Cancer Types for the Next Four Plots¶

In [32]:
def interlace_lists(*lists):
    """Interlace items from multiple lists in an alternating fashion."""
    max_length = max(len(lst) for lst in lists)
    interlaced = []
    for i in range(max_length):
        for lst in lists:
            if i < len(lst):
                interlaced.append(lst[i])
    return interlaced


top_10_organoid = jd3_sample_col[jd3_sample_col.model_type == "organoid"].cancer_type.value_counts().head(10).index
top_10_cell_line = jd3_sample_col[jd3_sample_col.model_type == "cell line"].cancer_type.value_counts().head(10).index
top_10_tumor = jd3_sample_col[jd3_sample_col.model_type == "tumor"].cancer_type.value_counts().head(10).index
top_10_cancer_type = jd3_sample_col.cancer_type.value_counts().head(10).index

top_10_organoid_series = pd.Series(top_10_organoid)
top_10_cell_line_series = pd.Series(top_10_cell_line)
top_10_tumor_series = pd.Series(top_10_tumor)
top_10_cancer_type_series = pd.Series(top_10_cancer_type)
# Step 2: Create a Unified List of Unique Top Cancer Types
all_top_cancer_types = interlace_lists(top_10_cancer_type,top_10_organoid, top_10_cell_line, top_10_tumor)
all_top_cancer_types = pd.Series(all_top_cancer_types).unique()

# # Step 3: Generate a Color Mapping for These Cancer Types
colors = sns.color_palette("hsv", len(all_top_cancer_types))
color_map = {cancer_type: color for cancer_type, color in zip(all_top_cancer_types, colors)}
# Manually set "Other" to grey
color_map['Other'] = (0.5, 0.5, 0.5)

Plot All Datasets UMAP by Cancer Types¶

In this plot, we are using cancer type to show differences in groups. As there are hundred of cancer types present, we must filter down to a reasonable number such as 10 for plotting.

In [33]:
top_10_types = jd3_sample_col.cancer_type.value_counts().head(10).index

# Create a new column for mapping colors with the unified color scheme, considering only top 10 for the current model
jd3_sample_col[f'full_color_group'] = jd3_sample_col.apply(
    lambda row: row.cancer_type if row.cancer_type in top_10_types else 'Other', axis=1
)

jd3_sample_col['color'] = jd3_sample_col[f'full_color_group'].map(color_map)

jd3_sample_col['alpha'] = jd3_sample_col['full_color_group'].apply(lambda x: 0.25 if x == 'Other' else 1.0)
legend_handles = [mpatches.Patch(color=color_map[group], label=group) for group in list(top_10_types) + ['Other']]

# Plotting
plt.scatter(
    embedding_t_full_data[:, 0],
    embedding_t_full_data[:, 1],
    c=jd3_sample_col['color'],  # Use the mapped colors
    alpha=jd3_sample_col['alpha'],  # Apply the alpha values here
    s=1.5
)

plt.legend(handles=legend_handles, title='Cancer Type', prop={'size': 6}, title_fontsize=8)
plt.gca().set_aspect('equal', 'datalim')
plt.title('All Datasets, Transcriptomics UMAP by Cancer Type', fontsize=12)
plt.show()
In [26]:
# Function to apply unified color mapping across cancer types and plot UMAP by model type
def plot_umap_by_model_type(model_type, embedding, color_map, jd3_sample_col,num):
    # Determine top 10 cancer types for the current model type
    top_x_types = jd3_sample_col[jd3_sample_col.model_type == model_type].cancer_type.value_counts().head(num).index

    # Create a new column for mapping colors with the unified color scheme, considering only top 10 for the current model
    jd3_sample_col[f'{model_type}_color_group'] = jd3_sample_col.apply(
        lambda row: row.cancer_type if row.cancer_type in top_x_types else 'Other', axis=1
    )

    # Map the color_group column to actual colors using the unified color map
    jd3_sample_col['color'] = jd3_sample_col[f'{model_type}_color_group'].map(color_map)

    # Filtering rows for the current model type
    filtered_rows = jd3_sample_col[jd3_sample_col.model_type == model_type]

    # Plotting
    plt.scatter(
        embedding[:, 0],
        embedding[:, 1],
        c=filtered_rows['color'],
        alpha=[0.1 if x == 'Other' else 1 for x in filtered_rows[f'{model_type}_color_group']],  # One-liner for conditional alpha
        s=12 
    )

    # Adjusting legend to reflect new grouping with unified color scheme
    legend_handles = [mpatches.Patch(color=color_map[group], label=group) for group in list(top_x_types) + ['Other']]
    plt.gca().set_aspect('equal', 'datalim')
    plt.legend(handles=legend_handles, title='Cancer Type', prop={'size': 5}, title_fontsize=8)
    plt.title(f'All Datasets, {model_type.capitalize()}, Transcriptomics UMAP by Cancer Type', fontsize=12)
    plt.show()

Plot All Datasets UMAP Within Organoid Model Type¶

In [27]:
reducer = umap.UMAP()
t_full_organoid_data = joined_dataset3.transcriptomics[jd3_sample_col.model_type == "organoid"].values
scaled_t_full_organoid_data = StandardScaler().fit_transform(t_full_organoid_data)
embedding_t_full_organoid_data = reducer.fit_transform(scaled_t_full_organoid_data)
embedding_t_full_organoid_data.shape
Out[27]:
(192, 2)
In [28]:
plot_umap_by_model_type("organoid", embedding_t_full_organoid_data, color_map, jd3_sample_col, 8)

Plot All Datasets UMAP Within Cell Line Model Type¶

In [29]:
reducer = umap.UMAP()
t_full_cell_line_data = joined_dataset3.transcriptomics[jd3_sample_col.model_type == "cell line"].values
scaled_t_full_cell_line_data = StandardScaler().fit_transform(t_full_cell_line_data)
embedding_t_full_cell_line_data = reducer.fit_transform(scaled_t_full_cell_line_data)
embedding_t_full_cell_line_data.shape
Out[29]:
(1761, 2)
In [30]:
plot_umap_by_model_type("cell line", embedding_t_full_cell_line_data, color_map, jd3_sample_col,10)

Plot All Datasets UMAP Within Tumor Model Type¶

In [31]:
reducer = umap.UMAP()
t_full_tumor_data = joined_dataset3.transcriptomics[jd3_sample_col.model_type == "tumor"].values
scaled_t_full_tumor_data = StandardScaler().fit_transform(t_full_tumor_data)
embedding_t_full_tumor_data = reducer.fit_transform(scaled_t_full_tumor_data)
embedding_t_full_tumor_data.shape
Out[31]:
(1960, 2)
In [32]:
plot_umap_by_model_type("tumor", embedding_t_full_tumor_data, color_map, jd3_sample_col,10)

Run UMAP for HCMI on Model Types¶

All of the code above is now applied to HCMI.

In [33]:
hcmi.reformat_dataset("transcriptomics","wide")
hcmi_sample_col = hcmi.transcriptomics.iloc[:, 0].to_frame()
hcmi_sample_col['model_type'] = hcmi_sample_col['improve_sample_id'].map(model_type_sample_map)
hcmi_sample_col['model_type'] = hcmi_sample_col['model_type'].map(model_type_dict)
hcmi.transcriptomics = hcmi.transcriptomics.drop(hcmi.transcriptomics.columns[:1], axis=1)
hcmi.transcriptomics
for column in hcmi.transcriptomics.columns:
    median_value = hcmi.transcriptomics[column].median()
    hcmi.transcriptomics[column].fillna(median_value, inplace=True)
hcmi.transcriptomics
reducer = umap.UMAP()
t_hcmi_data = hcmi.transcriptomics.values
scaled_t_hcmi_data = StandardScaler().fit_transform(t_hcmi_data)
embedding_t_hcmi_data = reducer.fit_transform(scaled_t_hcmi_data)
embedding_t_hcmi_data.shape
transcriptomics successfully converted to wide format
Out[33]:
(396, 2)
In [34]:
legend_handles = [
    mpatches.Patch(color=sns.color_palette()[0], label='Tumor'),  
    mpatches.Patch(color=sns.color_palette()[1], label='Organoid'),
    mpatches.Patch(color=sns.color_palette()[2], label='Cell Line')
]
alphas = [0 if x == 3 else 1 for x in hcmi_sample_col.model_type.map({"tumor": 0, "organoid": 1, "cell line": 2, np.nan: 3})]
plt.scatter(
    embedding_t_hcmi_data[:, 0],
    embedding_t_hcmi_data[:, 1],
    c=[sns.color_palette()[x] for x in hcmi_sample_col.model_type.map({"tumor": 0, "organoid": 1, "cell line": 2, np.nan: 3})],
    alpha=alphas,  # Apply the alpha values here
    s=12
)
plt.gca().set_aspect('equal', 'datalim')
plt.legend(handles=legend_handles, title='Model Type')
plt.title('HCMI, Transcriptomics UMAP by Model Type', fontsize=12)
Out[34]:
Text(0.5, 1.0, 'HCMI, Transcriptomics UMAP by Model Type')

Run UMAP for BeatAML on Sample Types¶

All of the code above is now applied to BeatAML.

In [35]:
beataml.reformat_dataset("transcriptomics","wide")
beataml_sample_col = beataml.transcriptomics.iloc[:, 0].to_frame()
beataml_sample_col['common_name'] = beataml_sample_col['improve_sample_id'].map(common_name_sample_map)
# beataml_sample_col['model_type'] = beataml_sample_col['model_type'].map(model_type_dict)
beataml.transcriptomics = beataml.transcriptomics.drop(beataml.transcriptomics.columns[:1], axis=1)
beataml.transcriptomics
for column in beataml.transcriptomics.columns:
    median_value = beataml.transcriptomics[column].median()
    beataml.transcriptomics[column].fillna(median_value, inplace=True)
beataml.transcriptomics
reducer = umap.UMAP()
t_beataml_data = beataml.transcriptomics.values
scaled_t_beataml_data = StandardScaler().fit_transform(t_beataml_data)
embedding_t_beataml_data = reducer.fit_transform(scaled_t_beataml_data)
embedding_t_beataml_data.shape
transcriptomics successfully converted to wide format
Out[35]:
(707, 2)
In [36]:
legend_handles = [
    mpatches.Patch(color=sns.color_palette()[0], label='Peripheral Blood'),  
    mpatches.Patch(color=sns.color_palette()[1], label='Leukapheresis'),
    mpatches.Patch(color=sns.color_palette()[2], label='Bone Marrow Aspirate'),
    mpatches.Patch(color=sns.color_palette()[3], label='Healthy pooled CD34+'),
    mpatches.Patch(color=sns.color_palette()[4], label='Bone Marrow Aspirate'),
    mpatches.Patch(color=sns.color_palette()[5], label='Healthy pooled CD34+'),
    mpatches.Patch(color=sns.color_palette()[6], label='Healthy Individual BM MNC'),
    # mpatches.Patch(color=sns.color_palette()[7], label='Healthy Individual CD34+'),
]
# Here we hide Healthy Individual CD34+ because there is only 1 sample.
alphas = [0 if x == 7 else 1 for x in beataml_sample_col.common_name.map({"Peripheral Blood": 0, "Leukapheresis": 1, "Healthy pooled CD34+": 2, 'Bone Marrow Aspirate': 3, 'Healthy pooled CD34+': 4,'Healthy Individual BM MNC':5,'Healthy Individual CD34+':6})]

plt.scatter(
    embedding_t_beataml_data[:, 0],
    embedding_t_beataml_data[:, 1],
    c=[sns.color_palette()[x] for x in beataml_sample_col.common_name.map({"Peripheral Blood": 0, "Leukapheresis": 1, "Healthy pooled CD34+": 2, 'Bone Marrow Aspirate': 3, 'Healthy pooled CD34+': 4,'Healthy Individual BM MNC':5,'Healthy Individual CD34+':6})],
    s=12,
    alpha=alphas
)
plt.gca().set_aspect('equal', 'datalim')
plt.legend(handles=legend_handles, title='Model Type', fontsize='small', title_fontsize='small')

plt.title('BeatAML, Transcriptomics UMAP by Sample Type', fontsize=12)
Out[36]:
Text(0.5, 1.0, 'BeatAML, Transcriptomics UMAP by Sample Type')

Run UMAP for CPTAC on Cancer Types¶

All of the code above is now applied to CPTAC. The plotting code below is condensed to avoid hard coding of cancer types.

In [37]:
cptac.reformat_dataset("transcriptomics","wide")
cptac_sample_col = cptac.transcriptomics.iloc[:, 0].to_frame()
cptac_sample_col['cancer_type'] = cptac_sample_col['improve_sample_id'].map(cancer_type_sample_map)
# cptac_sample_col['model_type'] = cptac_sample_col['model_type'].map(model_type_dict)
cptac.transcriptomics = cptac.transcriptomics.drop(cptac.transcriptomics.columns[:1], axis=1)
cptac.transcriptomics
for column in cptac.transcriptomics.columns:
    median_value = cptac.transcriptomics[column].median()
    cptac.transcriptomics[column].fillna(median_value, inplace=True)
cptac.transcriptomics
reducer = umap.UMAP()
t_cptac_data = cptac.transcriptomics.values
scaled_t_cptac_data = StandardScaler().fit_transform(t_cptac_data)
embedding_t_cptac_data = reducer.fit_transform(scaled_t_cptac_data)
embedding_t_cptac_data.shape
transcriptomics successfully converted to wide format
Out[37]:
(1113, 2)
In [38]:
cancer_types = cptac.samples.cancer_type.unique()
colors = sns.color_palette(n_colors=len(cancer_types))

# Create legend handles dynamically
legend_handles = [mpatches.Patch(color=colors[i], label=label) for i, label in enumerate(cancer_types)]

color_mapping = {cancer_type: color for cancer_type, color in zip(cancer_types, colors)}

cptac_colors = cptac_sample_col.cancer_type.map(color_mapping).tolist()

plt.scatter(
    embedding_t_cptac_data[:, 0],
    embedding_t_cptac_data[:, 1],
    c=cptac_colors,
    s=12,
)

plt.gca().set_aspect('equal', 'datalim')
plt.legend(handles=legend_handles, title='Model Type', prop={'size': 6}, title_fontsize=8)
plt.title('CPTAC, Transcriptomics UMAP by Cancer Type', fontsize=12)
Out[38]:
Text(0.5, 1.0, 'CPTAC, Transcriptomics UMAP by Cancer Type')

Run UMAP for DepMap/Sanger on Top 10 Cancer Types¶

All of the code above is now applied to DepMap/Sanger. However, due to the fact that there are over 200 cancer types, these are not easily plotable in a umap by color. As such, we will filter to just use the top 10 cancer types that are most abundant.

In [39]:
# Assuming cptac is already loaded and initialized
depmap.reformat_dataset("transcriptomics", "wide")

# Creating a mapping for cancer types
depmap_sample_col = depmap.transcriptomics.iloc[:, 0].to_frame()
depmap_sample_col['cancer_type'] = depmap_sample_col['improve_sample_id'].map(cancer_type_sample_map)

# Identify the top 10 most prevalent cancer types
top_10_cancer_types = depmap_sample_col['cancer_type'].value_counts().head(10).index.tolist()

# Filter the dataset for only the top 10 cancer types
depmap_sample_col = depmap_sample_col[depmap_sample_col['cancer_type'].isin(top_10_cancer_types)]
filtered_transcriptomics = depmap.transcriptomics.loc[depmap_sample_col.index]

# Fill missing values with the median of each column
for column in filtered_transcriptomics.columns:
    median_value = filtered_transcriptomics[column].median()
    filtered_transcriptomics[column].fillna(median_value, inplace=True)

filtered_transcriptomics = filtered_transcriptomics.dropna(axis=1, how='all')
    
    
# UMAP Analysis
reducer = umap.UMAP()
t_depmap_data = filtered_transcriptomics.values
scaled_t_depmap_data = StandardScaler().fit_transform(t_depmap_data)
embedding_t_depmap_data = reducer.fit_transform(scaled_t_depmap_data)

# Visualization
cancer_types = top_10_cancer_types
colors = sns.color_palette(n_colors=len(cancer_types))

legend_handles = [mpatches.Patch(color=colors[i], label=label) for i, label in enumerate(cancer_types)]
color_mapping = {cancer_type: color for cancer_type, color in zip(cancer_types, colors)}

depmap_colors = depmap_sample_col['cancer_type'].map(color_mapping).tolist()

plt.scatter(
    embedding_t_depmap_data[:, 0],
    embedding_t_depmap_data[:, 1],
    c=depmap_colors,
    s=12,
)

plt.gca().set_aspect('equal', 'datalim')
plt.legend(handles=legend_handles, title='Cancer Type', prop={'size': 6}, title_fontsize=8)
plt.title('DepMap, Transcriptomics UMAP by Cancer Type', fontsize=12)
plt.show()
transcriptomics successfully converted to wide format

Good luck creating your own UMAPs!¶

In [ ]: