import pandas as pd
import coderdata as cd
import umap
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np
import seaborn as sns
import matplotlib.patches as mpatches
import warnings
warnings.filterwarnings('ignore')


# Load in all Datasets
hcmi = cd.DatasetLoader('hcmi')
beataml = cd.DatasetLoader('beataml')
cptac = cd.DatasetLoader('cptac')
depmap = cd.DatasetLoader('broad_sanger')
mpnst = cd.DatasetLoader('mpnst')

Processing Data...
Loaded genes dataset.
Processing Data...
Loaded genes dataset.
Processing Data...
Loaded genes dataset.
Processing Data...
Loaded genes dataset.
Processing Data...
Loaded genes dataset.


# Join BeatAML and HCMI
joined_dataset0 = cd.join_datasets(beataml, hcmi)

# Join DepMap and CPTAC
joined_dataset1 = cd.join_datasets(depmap, cptac)

# Join Datasets
joined_dataset2 = cd.join_datasets(joined_dataset0,joined_dataset1)

# Final Join
joined_dataset3 = cd.join_datasets(joined_dataset2,mpnst)

Processing Data...
Loaded genes dataset.


joined_dataset3.transcriptomics= joined_dataset3.transcriptomics[["improve_sample_id", "transcriptomics", "entrez_id", "source", "study"]]


joined_dataset3.info()

This is a joined dataset comprising of:
- mpnst: A collection of NF1-MPNST patient-derived xenografts, organoids, and tumors. Data hosted on synapse.
- cptac: The Clinical Proteomic Tumor Analysis Consortium (CPTAC) project is a collaborative network funded by the National Cancer Institute (NCI).
- hcmi: Human Cancer Models Initiative (HCMI) data was collected though the National Cancer Institute (NCI) Genomic Data Commons (GDC) Data Portal.
- beataml: Beat acute myeloid leukemia (BeatAML) data was collected though GitHub and Synapse.

Available Datatypes and Their Formats:
- copy_number: long format
- drugs: long format
- experiments: long format
- genes: Format not specified
- mutations: long format
- proteomics: long format
- samples: Format not specified
- transcriptomics: long format


joined_dataset3.samples


joined_dataset3.transcriptomics


# Model Type Mapping
model_type_dict = {
    'Solid Tissue': 'tumor',
    'tumor': 'tumor',
    "organoid" : "organoid",
    'cell line': 'cell line',
    'Tumor': 'tumor',
    'ex vivo': 'tumor',
    '3D Organoid': 'organoid',
    'Peripheral Blood Components NOS': 'tumor',
    'Buffy Coat': np.nan,
     None: np.nan,
    'Peripheral Whole Blood': 'tumor',
    'Adherent Cell Line': 'cell line',
    '3D Neurosphere': 'organoid',
    '2D Modified Conditionally Reprogrammed Cells': 'cell line',
    'Pleural Effusion': np.nan,
    'Human Original Cells': 'cell line',
    'Not Reported': np.nan, 
    'Mixed Adherent Suspension': 'cell line',
    'Cell': 'cell line',
    'Saliva': np.nan
    }

model_type_sample_map = dict(zip(joined_dataset3.samples['improve_sample_id'], joined_dataset3.samples['model_type']))
common_name_sample_map = dict(zip(joined_dataset3.samples['improve_sample_id'], joined_dataset3.samples['common_name']))
cancer_type_sample_map = dict(zip(joined_dataset3.samples['improve_sample_id'], joined_dataset3.samples['cancer_type']))

study_sample_map = dict(zip(joined_dataset3.transcriptomics['improve_sample_id'], joined_dataset3.transcriptomics['study']))


study_sample_map = dict(zip(joined_dataset3.transcriptomics['improve_sample_id'], joined_dataset3.transcriptomics['study']))


joined_dataset3.reformat_dataset("transcriptomics","wide")

transcriptomics successfully converted to wide format


jd3_sample_col = joined_dataset3.transcriptomics.iloc[:, 0].to_frame()
jd3_sample_col['model_type'] = jd3_sample_col['improve_sample_id'].map(model_type_sample_map)
jd3_sample_col['model_type'] = jd3_sample_col['model_type'].map(model_type_dict)
jd3_sample_col['study'] = jd3_sample_col['improve_sample_id'].map(study_sample_map)
jd3_sample_col['cancer_type'] = jd3_sample_col['improve_sample_id'].map(cancer_type_sample_map)


joined_dataset3.transcriptomics


joined_dataset3.transcriptomics = joined_dataset3.transcriptomics.drop(joined_dataset3.transcriptomics.columns[:1], axis=1)


joined_dataset3.transcriptomics


for column in joined_dataset3.transcriptomics.columns:
    median_value = joined_dataset3.transcriptomics[column].median()
    joined_dataset3.transcriptomics[column].fillna(median_value, inplace=True)
joined_dataset3.transcriptomics = joined_dataset3.transcriptomics.dropna(axis='columns', how='all')


reducer = umap.UMAP()
t_full_data = joined_dataset3.transcriptomics.values
scaled_t_full_data = StandardScaler().fit_transform(t_full_data)
embedding_t_full_data = reducer.fit_transform(scaled_t_full_data)
embedding_t_full_data.shape

(3913, 2)


cell_line_types_df = pd.read_csv('cellLineTypes.csv')
mapping_dict = {}

# Iterate through each row in the DataFrame
for _, row in cell_line_types_df.iterrows():
    # Find the first non-null value in the row to use as the mapping target
    target_value = row.dropna().iloc[0] if not row.dropna().empty else None
    
    if target_value:
        # Iterate over all values in the row
        for value in row:
            # Check if the value is not null and not already the target value
            if pd.notnull(value) and value != target_value:
                # Map this value to the target_value
                mapping_dict[value] = target_value
jd3_sample_col['cancer_type'] = jd3_sample_col['cancer_type'].map(mapping_dict).fillna(jd3_sample_col['cancer_type'])


#plot umap. Hide/unhide unknowns

legend_handles = [
    mpatches.Patch(color=sns.color_palette()[0], label='Tumor'),  
    mpatches.Patch(color=sns.color_palette()[1], label='Organoid'),
    mpatches.Patch(color=sns.color_palette()[2], label='Cell Line')
    # mpatches.Patch(color=sns.color_palette()[3], label='Other')  # Uncomment this to include the Other model type legend label.
]

# This is used to hide the Unknown model types.
alphas = [0 if x == 3 else 1 for x in jd3_sample_col.model_type.map({"tumor": 0, "organoid": 1, "cell line": 2, np.nan: 3})]

plt.scatter(
    embedding_t_full_data[:, 0],
    embedding_t_full_data[:, 1],
    c=[sns.color_palette()[x] for x in jd3_sample_col.model_type.map({"tumor": 0, "organoid": 1, "cell line": 2, np.nan: 3})],
    alpha=alphas,  # Apply the alpha values here
    s=3
)
plt.gca().set_aspect('equal', 'datalim')

plt.legend(handles=legend_handles, title='Model Type')
plt.title('All Datasets, Transcriptomics UMAP by Model Type', fontsize=12)

Text(0.5, 1.0, 'All Datasets, Transcriptomics UMAP by Model Type')


jd3_sample_col


unique_studies = jd3_sample_col['study'].dropna().unique()
palette = sns.color_palette("Set2", len(unique_studies))
study_to_color = {study: color for study, color in zip(unique_studies, palette)}

# Prepare the colors for each point
colors = jd3_sample_col['study'].map(study_to_color).fillna('black')

plt.scatter(
    embedding_t_full_data[:, 0],
    embedding_t_full_data[:, 1],
    c=colors,
    alpha=1,  # Adjust based on your preference for visibility
    s=3  # Adjust size as needed
)
plt.gca().set_aspect('equal', 'datalim')

# Create legend handles manually
legend_handles = [plt.Line2D([0], [0], marker='o', color='w', label=study,
                             markerfacecolor=color, markersize=10) for study, color in study_to_color.items()]

plt.legend(handles=legend_handles, title='Study',prop={'size': 6}, title_fontsize=8)
plt.title('All Datasets, Transcriptomics UMAP by Study', fontsize=12)
plt.show()


def interlace_lists(*lists):
    """Interlace items from multiple lists in an alternating fashion."""
    max_length = max(len(lst) for lst in lists)
    interlaced = []
    for i in range(max_length):
        for lst in lists:
            if i < len(lst):
                interlaced.append(lst[i])
    return interlaced


top_10_organoid = jd3_sample_col[jd3_sample_col.model_type == "organoid"].cancer_type.value_counts().head(10).index
top_10_cell_line = jd3_sample_col[jd3_sample_col.model_type == "cell line"].cancer_type.value_counts().head(10).index
top_10_tumor = jd3_sample_col[jd3_sample_col.model_type == "tumor"].cancer_type.value_counts().head(10).index
top_10_cancer_type = jd3_sample_col.cancer_type.value_counts().head(10).index

top_10_organoid_series = pd.Series(top_10_organoid)
top_10_cell_line_series = pd.Series(top_10_cell_line)
top_10_tumor_series = pd.Series(top_10_tumor)
top_10_cancer_type_series = pd.Series(top_10_cancer_type)
# Step 2: Create a Unified List of Unique Top Cancer Types
all_top_cancer_types = interlace_lists(top_10_cancer_type,top_10_organoid, top_10_cell_line, top_10_tumor)
all_top_cancer_types = pd.Series(all_top_cancer_types).unique()

# # Step 3: Generate a Color Mapping for These Cancer Types
colors = sns.color_palette("hsv", len(all_top_cancer_types))
color_map = {cancer_type: color for cancer_type, color in zip(all_top_cancer_types, colors)}
# Manually set "Other" to grey
color_map['Other'] = (0.5, 0.5, 0.5)


top_10_types = jd3_sample_col.cancer_type.value_counts().head(10).index

# Create a new column for mapping colors with the unified color scheme, considering only top 10 for the current model
jd3_sample_col[f'full_color_group'] = jd3_sample_col.apply(
    lambda row: row.cancer_type if row.cancer_type in top_10_types else 'Other', axis=1
)

jd3_sample_col['color'] = jd3_sample_col[f'full_color_group'].map(color_map)

jd3_sample_col['alpha'] = jd3_sample_col['full_color_group'].apply(lambda x: 0.25 if x == 'Other' else 1.0)
legend_handles = [mpatches.Patch(color=color_map[group], label=group) for group in list(top_10_types) + ['Other']]

# Plotting
plt.scatter(
    embedding_t_full_data[:, 0],
    embedding_t_full_data[:, 1],
    c=jd3_sample_col['color'],  # Use the mapped colors
    alpha=jd3_sample_col['alpha'],  # Apply the alpha values here
    s=1.5
)

plt.legend(handles=legend_handles, title='Cancer Type', prop={'size': 6}, title_fontsize=8)
plt.gca().set_aspect('equal', 'datalim')
plt.title('All Datasets, Transcriptomics UMAP by Cancer Type', fontsize=12)
plt.show()


# Function to apply unified color mapping across cancer types and plot UMAP by model type
def plot_umap_by_model_type(model_type, embedding, color_map, jd3_sample_col,num):
    # Determine top 10 cancer types for the current model type
    top_x_types = jd3_sample_col[jd3_sample_col.model_type == model_type].cancer_type.value_counts().head(num).index

    # Create a new column for mapping colors with the unified color scheme, considering only top 10 for the current model
    jd3_sample_col[f'{model_type}_color_group'] = jd3_sample_col.apply(
        lambda row: row.cancer_type if row.cancer_type in top_x_types else 'Other', axis=1
    )

    # Map the color_group column to actual colors using the unified color map
    jd3_sample_col['color'] = jd3_sample_col[f'{model_type}_color_group'].map(color_map)

    # Filtering rows for the current model type
    filtered_rows = jd3_sample_col[jd3_sample_col.model_type == model_type]

    # Plotting
    plt.scatter(
        embedding[:, 0],
        embedding[:, 1],
        c=filtered_rows['color'],
        alpha=[0.1 if x == 'Other' else 1 for x in filtered_rows[f'{model_type}_color_group']],  # One-liner for conditional alpha
        s=12 
    )

    # Adjusting legend to reflect new grouping with unified color scheme
    legend_handles = [mpatches.Patch(color=color_map[group], label=group) for group in list(top_x_types) + ['Other']]
    plt.gca().set_aspect('equal', 'datalim')
    plt.legend(handles=legend_handles, title='Cancer Type', prop={'size': 5}, title_fontsize=8)
    plt.title(f'All Datasets, {model_type.capitalize()}, Transcriptomics UMAP by Cancer Type', fontsize=12)
    plt.show()


reducer = umap.UMAP()
t_full_organoid_data = joined_dataset3.transcriptomics[jd3_sample_col.model_type == "organoid"].values
scaled_t_full_organoid_data = StandardScaler().fit_transform(t_full_organoid_data)
embedding_t_full_organoid_data = reducer.fit_transform(scaled_t_full_organoid_data)
embedding_t_full_organoid_data.shape

(192, 2)


plot_umap_by_model_type("organoid", embedding_t_full_organoid_data, color_map, jd3_sample_col, 8)


reducer = umap.UMAP()
t_full_cell_line_data = joined_dataset3.transcriptomics[jd3_sample_col.model_type == "cell line"].values
scaled_t_full_cell_line_data = StandardScaler().fit_transform(t_full_cell_line_data)
embedding_t_full_cell_line_data = reducer.fit_transform(scaled_t_full_cell_line_data)
embedding_t_full_cell_line_data.shape

(1761, 2)


plot_umap_by_model_type("cell line", embedding_t_full_cell_line_data, color_map, jd3_sample_col,10)


reducer = umap.UMAP()
t_full_tumor_data = joined_dataset3.transcriptomics[jd3_sample_col.model_type == "tumor"].values
scaled_t_full_tumor_data = StandardScaler().fit_transform(t_full_tumor_data)
embedding_t_full_tumor_data = reducer.fit_transform(scaled_t_full_tumor_data)
embedding_t_full_tumor_data.shape

(1960, 2)


plot_umap_by_model_type("tumor", embedding_t_full_tumor_data, color_map, jd3_sample_col,10)


hcmi.reformat_dataset("transcriptomics","wide")
hcmi_sample_col = hcmi.transcriptomics.iloc[:, 0].to_frame()
hcmi_sample_col['model_type'] = hcmi_sample_col['improve_sample_id'].map(model_type_sample_map)
hcmi_sample_col['model_type'] = hcmi_sample_col['model_type'].map(model_type_dict)
hcmi.transcriptomics = hcmi.transcriptomics.drop(hcmi.transcriptomics.columns[:1], axis=1)
hcmi.transcriptomics
for column in hcmi.transcriptomics.columns:
    median_value = hcmi.transcriptomics[column].median()
    hcmi.transcriptomics[column].fillna(median_value, inplace=True)
hcmi.transcriptomics
reducer = umap.UMAP()
t_hcmi_data = hcmi.transcriptomics.values
scaled_t_hcmi_data = StandardScaler().fit_transform(t_hcmi_data)
embedding_t_hcmi_data = reducer.fit_transform(scaled_t_hcmi_data)
embedding_t_hcmi_data.shape

transcriptomics successfully converted to wide format

(396, 2)


legend_handles = [
    mpatches.Patch(color=sns.color_palette()[0], label='Tumor'),  
    mpatches.Patch(color=sns.color_palette()[1], label='Organoid'),
    mpatches.Patch(color=sns.color_palette()[2], label='Cell Line')
]
alphas = [0 if x == 3 else 1 for x in hcmi_sample_col.model_type.map({"tumor": 0, "organoid": 1, "cell line": 2, np.nan: 3})]
plt.scatter(
    embedding_t_hcmi_data[:, 0],
    embedding_t_hcmi_data[:, 1],
    c=[sns.color_palette()[x] for x in hcmi_sample_col.model_type.map({"tumor": 0, "organoid": 1, "cell line": 2, np.nan: 3})],
    alpha=alphas,  # Apply the alpha values here
    s=12
)
plt.gca().set_aspect('equal', 'datalim')
plt.legend(handles=legend_handles, title='Model Type')
plt.title('HCMI, Transcriptomics UMAP by Model Type', fontsize=12)

Text(0.5, 1.0, 'HCMI, Transcriptomics UMAP by Model Type')


beataml.reformat_dataset("transcriptomics","wide")
beataml_sample_col = beataml.transcriptomics.iloc[:, 0].to_frame()
beataml_sample_col['common_name'] = beataml_sample_col['improve_sample_id'].map(common_name_sample_map)
# beataml_sample_col['model_type'] = beataml_sample_col['model_type'].map(model_type_dict)
beataml.transcriptomics = beataml.transcriptomics.drop(beataml.transcriptomics.columns[:1], axis=1)
beataml.transcriptomics
for column in beataml.transcriptomics.columns:
    median_value = beataml.transcriptomics[column].median()
    beataml.transcriptomics[column].fillna(median_value, inplace=True)
beataml.transcriptomics
reducer = umap.UMAP()
t_beataml_data = beataml.transcriptomics.values
scaled_t_beataml_data = StandardScaler().fit_transform(t_beataml_data)
embedding_t_beataml_data = reducer.fit_transform(scaled_t_beataml_data)
embedding_t_beataml_data.shape

transcriptomics successfully converted to wide format

(707, 2)


legend_handles = [
    mpatches.Patch(color=sns.color_palette()[0], label='Peripheral Blood'),  
    mpatches.Patch(color=sns.color_palette()[1], label='Leukapheresis'),
    mpatches.Patch(color=sns.color_palette()[2], label='Bone Marrow Aspirate'),
    mpatches.Patch(color=sns.color_palette()[3], label='Healthy pooled CD34+'),
    mpatches.Patch(color=sns.color_palette()[4], label='Bone Marrow Aspirate'),
    mpatches.Patch(color=sns.color_palette()[5], label='Healthy pooled CD34+'),
    mpatches.Patch(color=sns.color_palette()[6], label='Healthy Individual BM MNC'),
    # mpatches.Patch(color=sns.color_palette()[7], label='Healthy Individual CD34+'),
]
# Here we hide Healthy Individual CD34+ because there is only 1 sample.
alphas = [0 if x == 7 else 1 for x in beataml_sample_col.common_name.map({"Peripheral Blood": 0, "Leukapheresis": 1, "Healthy pooled CD34+": 2, 'Bone Marrow Aspirate': 3, 'Healthy pooled CD34+': 4,'Healthy Individual BM MNC':5,'Healthy Individual CD34+':6})]

plt.scatter(
    embedding_t_beataml_data[:, 0],
    embedding_t_beataml_data[:, 1],
    c=[sns.color_palette()[x] for x in beataml_sample_col.common_name.map({"Peripheral Blood": 0, "Leukapheresis": 1, "Healthy pooled CD34+": 2, 'Bone Marrow Aspirate': 3, 'Healthy pooled CD34+': 4,'Healthy Individual BM MNC':5,'Healthy Individual CD34+':6})],
    s=12,
    alpha=alphas
)
plt.gca().set_aspect('equal', 'datalim')
plt.legend(handles=legend_handles, title='Model Type', fontsize='small', title_fontsize='small')

plt.title('BeatAML, Transcriptomics UMAP by Sample Type', fontsize=12)

Text(0.5, 1.0, 'BeatAML, Transcriptomics UMAP by Sample Type')


cptac.reformat_dataset("transcriptomics","wide")
cptac_sample_col = cptac.transcriptomics.iloc[:, 0].to_frame()
cptac_sample_col['cancer_type'] = cptac_sample_col['improve_sample_id'].map(cancer_type_sample_map)
# cptac_sample_col['model_type'] = cptac_sample_col['model_type'].map(model_type_dict)
cptac.transcriptomics = cptac.transcriptomics.drop(cptac.transcriptomics.columns[:1], axis=1)
cptac.transcriptomics
for column in cptac.transcriptomics.columns:
    median_value = cptac.transcriptomics[column].median()
    cptac.transcriptomics[column].fillna(median_value, inplace=True)
cptac.transcriptomics
reducer = umap.UMAP()
t_cptac_data = cptac.transcriptomics.values
scaled_t_cptac_data = StandardScaler().fit_transform(t_cptac_data)
embedding_t_cptac_data = reducer.fit_transform(scaled_t_cptac_data)
embedding_t_cptac_data.shape

transcriptomics successfully converted to wide format

(1113, 2)


cancer_types = cptac.samples.cancer_type.unique()
colors = sns.color_palette(n_colors=len(cancer_types))

# Create legend handles dynamically
legend_handles = [mpatches.Patch(color=colors[i], label=label) for i, label in enumerate(cancer_types)]

color_mapping = {cancer_type: color for cancer_type, color in zip(cancer_types, colors)}

cptac_colors = cptac_sample_col.cancer_type.map(color_mapping).tolist()

plt.scatter(
    embedding_t_cptac_data[:, 0],
    embedding_t_cptac_data[:, 1],
    c=cptac_colors,
    s=12,
)

plt.gca().set_aspect('equal', 'datalim')
plt.legend(handles=legend_handles, title='Model Type', prop={'size': 6}, title_fontsize=8)
plt.title('CPTAC, Transcriptomics UMAP by Cancer Type', fontsize=12)

Text(0.5, 1.0, 'CPTAC, Transcriptomics UMAP by Cancer Type')


# Assuming cptac is already loaded and initialized
depmap.reformat_dataset("transcriptomics", "wide")

# Creating a mapping for cancer types
depmap_sample_col = depmap.transcriptomics.iloc[:, 0].to_frame()
depmap_sample_col['cancer_type'] = depmap_sample_col['improve_sample_id'].map(cancer_type_sample_map)

# Identify the top 10 most prevalent cancer types
top_10_cancer_types = depmap_sample_col['cancer_type'].value_counts().head(10).index.tolist()

# Filter the dataset for only the top 10 cancer types
depmap_sample_col = depmap_sample_col[depmap_sample_col['cancer_type'].isin(top_10_cancer_types)]
filtered_transcriptomics = depmap.transcriptomics.loc[depmap_sample_col.index]

# Fill missing values with the median of each column
for column in filtered_transcriptomics.columns:
    median_value = filtered_transcriptomics[column].median()
    filtered_transcriptomics[column].fillna(median_value, inplace=True)

filtered_transcriptomics = filtered_transcriptomics.dropna(axis=1, how='all')
    
    
# UMAP Analysis
reducer = umap.UMAP()
t_depmap_data = filtered_transcriptomics.values
scaled_t_depmap_data = StandardScaler().fit_transform(t_depmap_data)
embedding_t_depmap_data = reducer.fit_transform(scaled_t_depmap_data)

# Visualization
cancer_types = top_10_cancer_types
colors = sns.color_palette(n_colors=len(cancer_types))

legend_handles = [mpatches.Patch(color=colors[i], label=label) for i, label in enumerate(cancer_types)]
color_mapping = {cancer_type: color for cancer_type, color in zip(cancer_types, colors)}

depmap_colors = depmap_sample_col['cancer_type'].map(color_mapping).tolist()

plt.scatter(
    embedding_t_depmap_data[:, 0],
    embedding_t_depmap_data[:, 1],
    c=depmap_colors,
    s=12,
)

plt.gca().set_aspect('equal', 'datalim')
plt.legend(handles=legend_handles, title='Cancer Type', prop={'size': 6}, title_fontsize=8)
plt.title('DepMap, Transcriptomics UMAP by Cancer Type', fontsize=12)
plt.show()

transcriptomics successfully converted to wide format

	other_id	improve_sample_id	other_names	common_name	cancer_type	model_type	other_id_source	species
0	11-00261	4102	Acute myelomonocytic leukaemia	Peripheral Blood	Acute Myeloid Leukaemia	ex vivo	beatAML	NaN
1	11-00503	4103	AML with mutated NPM1	Bone Marrow Aspirate	Acute Myeloid Leukaemia	ex vivo	beatAML	NaN
2	11-00475	4104	AML with mutated NPM1	Bone Marrow Aspirate	Acute Myeloid Leukaemia	ex vivo	beatAML	NaN
3	13-00047	4105	Mixed phenotype acute leukaemia, T/myeloid, NOS	Peripheral Blood	Acute Myeloid Leukaemia	ex vivo	beatAML	NaN
4	12-00032	4106	Chronic myelomonocytic leukaemia	Peripheral Blood	Acute Myeloid Leukaemia	ex vivo	beatAML	NaN
...	...	...	...	...	...	...	...	...
45	WU-487 Tumor	5169	NaN	WU-487	Malignant peripheral nerve sheath tumor	Tumor	NF Data Portal	Human
46	WU-505 Tumor	5170	NaN	WU-505	Malignant peripheral nerve sheath tumor	Tumor	NF Data Portal	Human
47	WU-536 Tumor	5171	NaN	WU-536	Malignant peripheral nerve sheath tumor	Tumor	NF Data Portal	Human
48	WU-545 Tumor	5172	NaN	WU-545	Malignant peripheral nerve sheath tumor	Tumor	NF Data Portal	Human
49	WU-561 Tumor	5173	NaN	WU-561	Malignant peripheral nerve sheath tumor	Tumor	NF Data Portal	Human

	improve_sample_id	transcriptomics	entrez_id	source	study
0	5087	1.523670	7105.0	synapse	BeatAML
1	5087	1.523670	7105.0	synapse	BeatAML
2	5087	7.107711	8813.0	synapse	BeatAML
3	5087	7.107711	8813.0	synapse	BeatAML
4	5087	3.362605	6359.0	synapse	BeatAML
...	...	...	...	...	...
44552582	3188	11.940000	23140.0	bcm	CPTAC3
44552583	3189	12.970000	23140.0	bcm	CPTAC3
44552584	3190	11.860000	23140.0	bcm	CPTAC3
44552585	3191	11.620000	23140.0	bcm	CPTAC3
44552586	3192	12.040000	23140.0	bcm	CPTAC3

entrez_id	improve_sample_id	1.0	2.0	3.0	9.0	10.0	11.0	12.0	13.0	14.0	...	118097967.0	118126072.0	118142757.0	118568804.0	122394733.0	122405565.0	124905743.0	124906461.0	125316803.0	125505920.0
0	1	0.377908	0.214517	0.07	10.884808	1.004783	0.0	0.374231	5.125926	58.706594	...	0.0	4.250000	0.128292	0.0	0.540773	2.314544	0.000000	0.070195	21.933752	0.13
1	2	2.016174	0.178161	0.02	4.039268	1.309193	0.0	0.247848	0.158752	71.169380	...	0.0	1.910000	0.017178	0.0	0.098161	6.066530	0.000000	0.007178	10.166839	0.11
2	3	0.927081	20.780606	0.00	4.297862	0.034285	0.0	5.436207	0.000000	41.411760	...	0.0	3.530000	0.000000	0.0	2.387944	27.548022	0.000000	0.000000	23.857035	0.75
3	4	0.068752	0.497908	0.00	14.466601	0.090195	0.0	0.329962	0.665773	89.898180	...	0.0	3.010000	0.007178	0.0	1.978817	34.059067	0.000000	0.000000	31.096575	0.07
4	5	2.837025	0.520713	0.00	6.530024	0.022178	0.0	0.021322	0.056322	57.371410	...	0.0	23.490000	0.022178	0.0	2.944486	18.276848	0.000000	0.022178	31.300927	0.03
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3908	5118	-2.948888	-0.509818	NaN	2.541549	NaN	NaN	NaN	NaN	6.503961	...	NaN	1.964441	NaN	NaN	-0.589435	5.347260	3.780014	NaN	6.300062	NaN
3909	5119	-2.671581	-0.051962	NaN	2.107456	NaN	NaN	NaN	NaN	6.267933	...	NaN	3.081971	NaN	NaN	-1.789189	5.714423	4.520107	NaN	6.159214	NaN
3910	5120	-3.720447	0.918365	NaN	2.160300	NaN	NaN	NaN	NaN	6.317204	...	NaN	3.347740	NaN	NaN	-1.689536	5.734799	5.603482	NaN	6.171670	NaN
3911	5121	-2.254779	1.955952	NaN	2.230090	NaN	NaN	NaN	NaN	6.954112	...	NaN	4.166740	NaN	NaN	1.183738	6.455522	-0.976484	NaN	6.394535	NaN
3912	5122	-3.144297	-1.273558	NaN	2.163439	NaN	NaN	NaN	NaN	6.285517	...	NaN	3.991816	NaN	NaN	-0.286764	5.575612	3.880383	NaN	6.172091	NaN

entrez_id	1.0	2.0	3.0	9.0	10.0	11.0	12.0	13.0	14.0	15.0	...	118097967.0	118126072.0	118142757.0	118568804.0	122394733.0	122405565.0	124905743.0	124906461.0	125316803.0	125505920.0
0	0.377908	0.214517	0.07	10.884808	1.004783	0.0	0.374231	5.125926	58.706594	0.178292	...	0.0	4.250000	0.128292	0.0	0.540773	2.314544	0.000000	0.070195	21.933752	0.13
1	2.016174	0.178161	0.02	4.039268	1.309193	0.0	0.247848	0.158752	71.169380	0.773686	...	0.0	1.910000	0.017178	0.0	0.098161	6.066530	0.000000	0.007178	10.166839	0.11
2	0.927081	20.780606	0.00	4.297862	0.034285	0.0	5.436207	0.000000	41.411760	0.192164	...	0.0	3.530000	0.000000	0.0	2.387944	27.548022	0.000000	0.000000	23.857035	0.75
3	0.068752	0.497908	0.00	14.466601	0.090195	0.0	0.329962	0.665773	89.898180	0.366517	...	0.0	3.010000	0.007178	0.0	1.978817	34.059067	0.000000	0.000000	31.096575	0.07
4	2.837025	0.520713	0.00	6.530024	0.022178	0.0	0.021322	0.056322	57.371410	0.091322	...	0.0	23.490000	0.022178	0.0	2.944486	18.276848	0.000000	0.022178	31.300927	0.03
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3908	-2.948888	-0.509818	NaN	2.541549	NaN	NaN	NaN	NaN	6.503961	NaN	...	NaN	1.964441	NaN	NaN	-0.589435	5.347260	3.780014	NaN	6.300062	NaN
3909	-2.671581	-0.051962	NaN	2.107456	NaN	NaN	NaN	NaN	6.267933	NaN	...	NaN	3.081971	NaN	NaN	-1.789189	5.714423	4.520107	NaN	6.159214	NaN
3910	-3.720447	0.918365	NaN	2.160300	NaN	NaN	NaN	NaN	6.317204	NaN	...	NaN	3.347740	NaN	NaN	-1.689536	5.734799	5.603482	NaN	6.171670	NaN
3911	-2.254779	1.955952	NaN	2.230090	NaN	NaN	NaN	NaN	6.954112	NaN	...	NaN	4.166740	NaN	NaN	1.183738	6.455522	-0.976484	NaN	6.394535	NaN
3912	-3.144297	-1.273558	NaN	2.163439	NaN	NaN	NaN	NaN	6.285517	NaN	...	NaN	3.991816	NaN	NaN	-0.286764	5.575612	3.880383	NaN	6.172091	NaN

	improve_sample_id	model_type	study	cancer_type
0	1	cell line	Sanger & Broad Cell Lines RNASeq	Pancreatic Carcinoma
1	2	cell line	Sanger & Broad Cell Lines RNASeq	Colorectal Carcinoma
2	3	cell line	Sanger & Broad Cell Lines RNASeq	Glioblastoma multiforme
3	4	cell line	Sanger & Broad Cell Lines RNASeq	Mesothelioma
4	5	cell line	Sanger & Broad Cell Lines RNASeq	B-Lymphoblastic Leukemia
...	...	...	...	...
3908	5118	tumor	BeatAML	Acute myeloid leukemia
3909	5119	tumor	BeatAML	Acute myeloid leukemia
3910	5120	tumor	BeatAML	Acute myeloid leukemia
3911	5121	tumor	BeatAML	Acute myeloid leukemia
3912	5122	tumor	BeatAML	Acute myeloid leukemia

Welcome to the Data Exploration UMAP Tutorial¶

Import Packages¶

Load all Datasets¶

Initialize Mapping Directories¶

Reformat transcriptomics Data¶

Store improve_sample_id to a Seperate Dataframe¶

Format The transcriptomics Data into Wide Format for the UMAP¶

Impute transcriptomics Data¶

Run UMAP Functions¶

Add in Cancer Mapping Types¶

Plot All Datasets UMAP by Model Types¶

Map Colors to Cancer Types for the Next Four Plots¶

Plot All Datasets UMAP by Cancer Types¶

Plot All Datasets UMAP Within Organoid Model Type¶

Plot All Datasets UMAP Within Cell Line Model Type¶

Plot All Datasets UMAP Within Tumor Model Type¶

Run UMAP for HCMI on Model Types¶

Run UMAP for BeatAML on Sample Types¶

Run UMAP for CPTAC on Cancer Types¶

Run UMAP for DepMap/Sanger on Top 10 Cancer Types¶

Good luck creating your own UMAPs!¶