Note: This tutorial was originally written as an Jupyter Notebook (ipynb). If you would like to run or edit the notebook, it can be found here

Now for the fun bit, wrangling the data and plotting our results! Below, we do this using pandas, matplotlib and statsmodels, and some convenience functions from clonedetective.clone_analysis.

Recap - background and experimental aims

Our original demo dataset consisted of 67 images of the Drosophila intestine from a MARCM genetic cell lineage experiment. GFP (C1 channel) marks genetic recombination events, and so clusters of GFP positive cells define "clones" of daughter cells that over time have grown out from a single parent cell. The remaining images include DAPI (C0 channel), which ubiquitously marks all cell nuclei; PDM1 (C2 channel), which labels EnteroCyte (EC) cells; and Prospero (C3 channel), which labels EnteroEndocrine cells.

We aimed to investigate: 1) Does gene of interest regulate cell proliferation?
2) Does our gene of interest control cell differentiation?
3) Does gene of interest effect local tissue cell arrangement and structure?

We will tackle each of these experimental aims using data previously generated by clonedetective.

Loading data and updating column metadata

Reading in our previous results:

import pandas as pd

df = pd.read_csv("data/example_results.csv")

df.head()

To ease downstream wrangling and plotting, we next update some of the column metadata.

For example, when plotting, it will be helpful to label whether a cell is inside a clone (IC) or outside a clone (OC). This information is already present within the df['GFP_clonepos'] column that states if a cell is GFP positive. We, therefore, recode this column and rename it to clone_status.

import numpy as np

df["GFP_clonepos"] = np.where(df["GFP_clonepos"], "IC", "OC")

df = df.rename(columns={"GFP_clonepos": "clone_status"})
df.head()

We also can extract metadata stored within the image names. We know that images are named 'a1' if from a wild-type (ctrl) animal, and 'a2' if from a mutant (mut) animal. Using this information, we can add an 'genotype' columns.

df["genotype"] = np.where(df["int_img"].str.match(r"a1"), "ctrl", "mut")

df.head()

Cell Proliferation - number of cells per clone

Our first aim is to determine whether there is a difference in the number of cells per clone in mutant vs control animals. count_df currently contains four rows for each cell - one for each of the four intensity image channels. Since we are presently interested in cell types and numbers, rather than the intensity values of individual cells, we can query count_df for a single intensity image channel. This ensures that count_data contains only a single row for each cell.

For simplicity, we also filter out columns not required for counting the number and type of cells per clone.

count_df = df.query('int_img_ch == "C1"').filter(
    regex=r"clone$|pos$|int_img|clone_status|genotype"
)

To help with later calculations, we add a 'total' column:

count_df["total"] = count_df.filter(regex="_pos").sum(axis=1)

And quickly simplify the cell type column names:

import re

count_df.columns = [re.sub("_pos", "", col) for col in count_df.columns]

Next, we groupby clones per image and sum to get the number of cells of each type per clone.

count_df = count_df.groupby(["int_img", "GFP_clone"], as_index=False).agg(
    {
        "clone_status": "first",
        "genotype": "first",
        "EC": "sum",
        "EE": "sum",
        "preEC": "sum",
        "ECEE": "sum",
        "ISCorEB": "sum",
        "total": "sum",
    }
)
count_df.head()

We then groupby only the genotype column and generate a stacked bar plot displaying cell numbers and types per clone. Here, we only interested in plotting cells inside clones, and so we first query on 'clone_status':

count_df_inside = count_df.query('clone_status == "IC"')

inside_clone_mean = (
    count_df_inside[["genotype", "ISCorEB", "preEC", "EC", "EE", "ECEE"]]
    .groupby("genotype")
    .mean()
)

inside_clone_sem = (
    count_df_inside[["genotype", "ISCorEB", "preEC", "EC", "EE", "ECEE"]]
    .groupby("genotype")
    .sem()
)

import matplotlib.pyplot as plt
import seaborn as sns

from clonedetective import clone_analysis as ca

with sns.plotting_context("notebook", font_scale=1.2):
    ca.create_stack_bar_plot(
        inside_clone_mean,
        inside_clone_sem,
        x_figSize=2,
        y_figSize=4,
        color_pal=["#008b68", "#539DC2", "#F57171", "#95a5a6", "#000000"],
        bar_width=0.8,
        y_axis_limit=16,
        y_label="Cells per clone",
    )

From this plot we can see that 'mutant' clones are on average bigger than 'ctrl' clones. This suggests that mutation of our gene of interest may affect cell proliferation. Since we have count data, let's fit a negative binomial regression using the statsmodels library test and if this difference is significant:

import statsmodels.api as sm
from statsmodels.formula.api import glm

glm_nb = glm(
    "total ~ genotype", data=count_df_inside, family=sm.families.NegativeBinomial()
).fit()

glm_nb.summary()

glm_nb.t_test_pairwise("genotype").result_frame

Yes, the total number of cells per clone is significantly different in "ctrl" vs "mutant" animals.

Conclusion:
These data are consistent with a role for our gene of interest in controlling cell proliferation. However, complementary studies are be required to test this hypothesis further and eliminate alternative theories, e.g. does our gene of interest enhance cell death, leading to few cells per clone.

Cell Differentiation - percentage cell type inside and outside clones

Next, let's investigate whether the types of cells inside and outside clones are different in ctrl and mutant animals. This information will help us determine whether our gene of interest affects cell differentiation.

First, since we already know that 'mutant' clones possess fewer cells than 'ctrl' clones, it is helpful to summarise cell types as a percentage.

percentage_df = count_df.eval(
    """EC = EC / total * 100 
    EE = EE / total * 100
    ECEE = ECEE / total * 100 
    preEC = preEC / total * 100 
    ISCorEB = ISCorEB / total * 100"""
)

Next, we plot a stacked barplot as before. However, this time we plot percentage cell types:

percentage_df_mean = (
    percentage_df[["genotype", "clone_status", "ISCorEB", "preEC", "EC", "EE", "ECEE"]]
    .groupby(["clone_status", "genotype"])
    .mean()
)

percentage_df_sem = (
    percentage_df[["genotype", "clone_status", "ISCorEB", "preEC", "EC", "EE", "ECEE"]]
    .groupby(["clone_status", "genotype"])
    .sem()
)

with sns.plotting_context("notebook", font_scale=1.2):
    ca.create_stack_bar_plot(
        percentage_df_mean,
        percentage_df_sem,
        x_figSize=2,
        y_figSize=4,
        color_pal=["#008b68", "#539DC2", "#F57171", "#95a5a6", "#000000"],
        bar_width=0.8,
        y_label="% cell type per clone",
    )

This plot suggests there might be a difference in percentage cell-types of mutant clones vs ctrl. Let's check if these differences are statistically significant.

To do this, we reshape our percentage data into a tidy format:

tidy_percentage_df = pd.melt(
    percentage_df,
    id_vars=["genotype", "clone_status"],
    value_vars=["ISCorEB", "preEC", "EC", "EE", "ECEE"],
    var_name="cell_type",
    value_name="perc_cell_num",
)
tidy_percentage_df.head()

We then create a composite column that contains genotype (g), clone_status (cs) and cell_type (ct) information.

tidy_percentage_df["g_cs_ct"] = (
    tidy_percentage_df["genotype"]
    + "_"
    + tidy_percentage_df["clone_status"]
    + "_"
    + tidy_percentage_df["cell_type"]
)
tidy_percentage_df.head()

tidy_percentage_df["g_cs_ct"] = tidy_percentage_df["g_cs_ct"].astype("category")

Since we interested in specific comparisons, we setup a contrast matrix.

Our first set of contrasts compares the percentage of each cell type inside mutant vs ctrl clones:

contrasts = [
    "mut_IC_EC-ctrl_IC_EC",
    "mut_IC_EE-ctrl_IC_EE",
    "mut_IC_preEC-ctrl_IC_preEC",
    "mut_IC_ISCorEB-ctrl_IC_ISCorEB",
    "mut_IC_ECEE-ctrl_IC_ECEE",
]

Next, we would like to compare the percentage of each cell type outside mutant vs ctrl clones:

contrasts = contrasts + [
    "mut_OC_EC-ctrl_OC_EC",
    "mut_OC_EE-ctrl_OC_EE",
    "mut_OC_preEC-ctrl_OC_preEC",
    "mut_OC_ISCorEB-ctrl_OC_ISCorEB",
    "mut_OC_ECEE-ctrl_OC_ECEE",
]

Finally, we would like to compare the percentage of each cell type outside mutant vs inside mutant clones.

contrasts = contrasts + [
    "mut_OC_EC-mut_IC_EC",
    "mut_OC_EE-mut_IC_EE",
    "mut_OC_preEC-mut_IC_preEC",
    "mut_OC_ISCorEB-mut_IC_ISCorEB",
    "mut_OC_ECEE-mut_IC_ECEE",
]

Below is a helper function to construct the values of contrast matrix:

def create_contrast_matrix(desired_contrasts, df_column):
    l = list()
    levels = pd.Series(df_column.cat.categories.values)
    for contrast in desired_contrasts:
        group_A, group_B = contrast.split("-")
        l.append(
            levels.replace({f"^{group_A}$": -1, f"^{group_B}$": 1, ".": 0}, regex=True)
        )
    df = pd.DataFrame(l)
    df.columns = levels.values
    df.index = desired_contrasts
    return df

We can now setup a custom contrast matrix:

matrix = create_contrast_matrix(contrasts, tidy_percentage_df["g_cs_ct"])
matrix

Finally, since we have proportional data, we use statsmodels to construct a quasi-binomal regression model and perform tests for our desired contrasts:

tidy_percentage_df["pro_cell_num"] = tidy_percentage_df["perc_cell_num"] / 100

glm_quasi = glm(
    "pro_cell_num ~ g_cs_ct", data=tidy_percentage_df, family=sm.families.Binomial(),
).fit(scale="X2")

t_test_df = glm_quasi.t_test(matrix).summary_frame()
t_test_df.index = matrix.index
t_test_df.round(4)  # round for easier reading

However, as we are performing multiple comparisons, we need to adjust our p values. We can do that using the Benjamini/Hochberg false-discovery rate procedure:

from statsmodels.stats.multitest import fdrcorrection

fdr_results = fdrcorrection(t_test_df["P>|z|"])

t_test_df["pval_fdr_bh"] = fdr_results[1]
t_test_df["reject_fdr_bh"] = fdr_results[0]

t_test_df.round(4)  # round for easier reading

To aid visualization, let's add these results to our stacked percentage barplot from earlier. We annotations only statistically significant comparisons.

from clonedetective.clone_analysis import plot_stat_annotation, pvals_to_stat_anots

# convert t_test_results to annotations:
stat_anots = pvals_to_stat_anots(t_test_df["pval_fdr_bh"])

with sns.plotting_context("notebook", font_scale=1.2):
    ca.create_stack_bar_plot(
        percentage_df_mean,
        percentage_df_sem,
        x_figSize=2,
        y_figSize=5,
        color_pal=["#008b68", "#539DC2", "#F57171", "#95a5a6", "#000000"],
        bar_width=0.8,
        y_label="% cell type per clone",
    )
    plot_stat_annotation(
        (0, 1),
        105,
        stat_anots[
            [
                "mut_IC_EC-ctrl_IC_EC",
                "mut_IC_ISCorEB-ctrl_IC_ISCorEB",
                "mut_IC_EE-ctrl_IC_EE",
            ]
        ],
        sep=1,
        text_colors=["#008b68", "#F57171", "#95a5a6"],
    )
    plot_stat_annotation(
        (1, 3),
        120,
        stat_anots[["mut_OC_EC-mut_IC_EC", "mut_OC_ISCorEB-mut_IC_ISCorEB"]],
        sep=1,
        text_colors=["#008b68", "#F57171"],
    )

Summary of results:

comparing the percentage makeup of control and mutant clones, mutant clones possess significantly fewer ECs and significantly more progenitor cells (e.g. ISCs or EBs) and EEs.
comparing the percentage makeup of cells inside mutant clones and outside mutant clones, cells inside clones similarly possess significantly fewer ECs and more progenitor cells.
no significant difference is seen in the percentage makeup of cells outside clones in control vs mutant animals, suggesting our gene of interest may be haploinsufficient.

Conclusion:
Together, these data suggest mutation of our gene of interest may hinder ISC -> EC differentiation. However, complementary studies are be required to further test this hypothesis and eliminate alternative theories, e.g. does our gene of interest lead to enhanced EC cell death, resulting in higher percentage progenitor and EE cells.

Number of neighbours

We have found that mutant clones are smaller than control (wildtype) clones, and may therefore regulate cell proliferation. We also found that gene of interest may regulate cell differentiation, as mutant clones possess a lower percentage of ECs and a higher percentage of progenitor cells (ISCs or EBs). We now wish to find out whether our gene of interest regulates local tissue cell arrangement and structure by comparing the number and type of cell neighbours in control vs mutant clones.

First, we initially perform a 'sanity check' to make sure our 'neighbour counts' are reporting what we would expect.

Since mutant clones are smaller than controls clones, we would expect that compared to cells within control clones, each cell within mutant clones will have (on average) a lower proportion GFP positive neighbours and a higher proportion of GFP negative neighbours.

gfp_df = (
    df.query('int_img_ch == "C0" & clone_status == "IC"')
    .set_index(["int_img", "label", "GFP_clone"])
    .filter(regex="GFP|tot_nc|genotype|clone_status")
    .groupby(["int_img", "GFP_clone"])
    .agg(
        {
            "genotype": "first",
            "GFPneg_nc": "mean",
            "GFPpos_nc": "mean",
            "tot_nc": "mean",
        }
    )
)

tidy_gfp_df = pd.melt(
    gfp_df,
    id_vars=["genotype"],
    value_vars=["GFPneg_nc", "GFPpos_nc", "tot_nc"],
    var_name="neigh_type",
    value_name="neighbour_count",
)

import matplotlib.pyplot as plt
import seaborn as sns
from statannotations.Annotator import Annotator

with sns.plotting_context("notebook", font_scale=1.1):
    fig, ax = plt.subplots(figsize=(4, 4))

    sns.set_style("ticks")
    ax = sns.barplot(
        x="genotype",
        y="neighbour_count",
        hue="neigh_type",
        ax=ax,
        data=tidy_gfp_df,
        palette=["#95a5a6", "#008b68", "#539DC2"],
    )

    # adjust positon of legend
    handles, labels = ax.get_legend_handles_labels()
    ax.legend(handles, labels, bbox_to_anchor=(1, 1), loc="upper left")
    plt.tight_layout()

Indeed, compared to control clones, mutant clones have, on average, fewer GFP positive neighbours and more GFP negative neighbours.

	int_img	label	GFPneg_nc	GFPpos_nc	tot_nc	ECneg_nc	ECpos_nc	EEneg_nc	...	centroid-1	eccentricity	area_um2	GFP_clonepos	total_intensity	EC_pos	EE_pos	ECEE_pos	ISCorEB_pos	preEC_pos
0	a1g01p1	11	2	2	4	1	3	4	...	89.552116	0.440881	73.280625	False	2196457.0	True	False	False	False	False
1	a1g01p1	11	2	2	4	1	3	4	...	89.552116	0.440881	73.280625	False	52757.0	True	False	False	False	False
2	a1g01p1	11	2	2	4	1	3	4	...	89.552116	0.440881	73.280625	False	2262648.0	True	False	False	False	False
3	a1g01p1	11	2	2	4	1	3	4	...	89.552116	0.440881	73.280625	False	67331.0	True	False	False	False	False
4	a1g01p1	12	4	0	4	1	3	4	...	486.798467	0.465227	98.690625	False	4113516.0	True	False	False	False	False

	int_img	label	GFPneg_nc	GFPpos_nc	tot_nc	ECneg_nc	ECpos_nc	EEneg_nc	...	centroid-1	eccentricity	area_um2	clone_status	total_intensity	EC_pos	EE_pos	ECEE_pos	ISCorEB_pos	preEC_pos
0	a1g01p1	11	2	2	4	1	3	4	...	89.552116	0.440881	73.280625	OC	2196457.0	True	False	False	False	False
1	a1g01p1	11	2	2	4	1	3	4	...	89.552116	0.440881	73.280625	OC	52757.0	True	False	False	False	False
2	a1g01p1	11	2	2	4	1	3	4	...	89.552116	0.440881	73.280625	OC	2262648.0	True	False	False	False	False
3	a1g01p1	11	2	2	4	1	3	4	...	89.552116	0.440881	73.280625	OC	67331.0	True	False	False	False	False
4	a1g01p1	12	4	0	4	1	3	4	...	486.798467	0.465227	98.690625	OC	4113516.0	True	False	False	False	False

Dep. Variable:	total	No. Observations:	429
Model:	GLM	Df Residuals:	427
Model Family:	NegativeBinomial	Df Model:	1
Link Function:	log	Scale:	1.0000
Method:	IRLS	Log-Likelihood:	-1297.0
Date:	Sat, 18 Sep 2021	Deviance:	445.65
Time:	19:15:46	Pearson chi2:	577.
No. Iterations:	6
Covariance Type:	nonrobust

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	2.5930	0.074	34.926	0.000	2.447	2.738
genotype[T.mut]	-1.1887	0.104	-11.420	0.000	-1.393	-0.985

	coef	std err	z	P>\|z\|	Conf. Int. Low	Conf. Int. Upp.
mut_IC_EC-ctrl_IC_EC	0.7649	0.1978	3.8681	0.0001	0.3773	1.1525
mut_IC_EE-ctrl_IC_EE	-0.8064	0.2862	-2.8173	0.0048	-1.3675	-0.2454
mut_IC_preEC-ctrl_IC_preEC	-1.3296	1.4032	-0.9476	0.3433	-4.0797	1.4205
mut_IC_ISCorEB-ctrl_IC_ISCorEB	-0.8487	0.1217	-6.9731	0.0000	-1.0873	-0.6102
mut_IC_ECEE-ctrl_IC_ECEE	0.0260	0.7414	0.0351	0.9720	-1.4271	1.4792
mut_OC_EC-ctrl_OC_EC	0.3803	0.2990	1.2719	0.2034	-0.2057	0.9663
mut_OC_EE-ctrl_OC_EE	-0.2120	0.6093	-0.3479	0.7279	-1.4062	0.9823
mut_OC_preEC-ctrl_OC_preEC	-1.2323	1.8048	-0.6828	0.4947	-4.7695	2.3050
mut_OC_ISCorEB-ctrl_OC_ISCorEB	-0.2758	0.2887	-0.9552	0.3395	-0.8417	0.2901
mut_OC_ECEE-ctrl_OC_ECEE	0.2190	1.6601	0.1319	0.8950	-3.0346	3.4727
mut_OC_EC-mut_IC_EC	-0.7743	0.2403	-3.2226	0.0013	-1.2452	-0.3034
mut_OC_EE-mut_IC_EE	0.0324	0.4411	0.0734	0.9415	-0.8321	0.8968
mut_OC_preEC-mut_IC_preEC	-1.2188	1.0590	-1.1509	0.2498	-3.2943	0.8567
mut_OC_ISCorEB-mut_IC_ISCorEB	0.6880	0.2252	3.0544	0.0023	0.2465	1.1294
mut_OC_ECEE-mut_IC_ECEE	-0.1520	1.3415	-0.1133	0.9098	-2.7813	2.4773

	int_img	GFP_clone	clone_status	genotype	EC	EE	preEC	ECEE	ISCorEB	total
0	a1g01p1	0	OC	ctrl	70	2	1	4	98	175
1	a1g01p1	1	IC	ctrl	9	0	0	0	5	14
2	a1g01p1	2	IC	ctrl	0	0	0	0	1	1
3	a1g01p1	3	IC	ctrl	1	0	0	0	3	4
4	a1g01p1	4	IC	ctrl	12	0	0	1	21	34

	genotype	clone_status	cell_type	perc_cell_num
0	ctrl	OC	ISCorEB	56.000000
1	ctrl	IC	ISCorEB	35.714286
2	ctrl	IC	ISCorEB	100.000000
3	ctrl	IC	ISCorEB	75.000000
4	ctrl	IC	ISCorEB	61.764706

	ctrl_IC_EC	ctrl_IC_ECEE	ctrl_IC_EE	ctrl_IC_ISCorEB	ctrl_IC_preEC	ctrl_OC_EC	ctrl_OC_ECEE	ctrl_OC_EE	ctrl_OC_ISCorEB	ctrl_OC_preEC	mut_IC_EC	mut_IC_ECEE	mut_IC_EE	mut_IC_ISCorEB	mut_IC_preEC	mut_OC_EC	mut_OC_ECEE	mut_OC_EE	mut_OC_ISCorEB	mut_OC_preEC
mut_IC_EC-ctrl_IC_EC	1	0	0	0	0	0	0	0	0	0	-1	0	0	0	0	0	0	0	0	0
mut_IC_EE-ctrl_IC_EE	0	0	1	0	0	0	0	0	0	0	0	0	-1	0	0	0	0	0	0	0
mut_IC_preEC-ctrl_IC_preEC	0	0	0	0	1	0	0	0	0	0	0	0	0	0	-1	0	0	0	0	0
mut_IC_ISCorEB-ctrl_IC_ISCorEB	0	0	0	1	0	0	0	0	0	0	0	0	0	-1	0	0	0	0	0	0
mut_IC_ECEE-ctrl_IC_ECEE	0	1	0	0	0	0	0	0	0	0	0	-1	0	0	0	0	0	0	0	0
mut_OC_EC-ctrl_OC_EC	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	-1	0	0	0	0
mut_OC_EE-ctrl_OC_EE	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	-1	0	0
mut_OC_preEC-ctrl_OC_preEC	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	-1
mut_OC_ISCorEB-ctrl_OC_ISCorEB	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	-1	0
mut_OC_ECEE-ctrl_OC_ECEE	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	-1	0	0	0
mut_OC_EC-mut_IC_EC	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	-1	0	0	0	0
mut_OC_EE-mut_IC_EE	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	-1	0	0
mut_OC_preEC-mut_IC_preEC	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	-1
mut_OC_ISCorEB-mut_IC_ISCorEB	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	-1	0
mut_OC_ECEE-mut_IC_ECEE	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	-1	0	0	0

Downstream Analysis - Python

Recap - background and experimental aims

Loading data and updating column metadata

Cell Proliferation - number of cells per clone

Cell Differentiation - percentage cell type inside and outside clones

Number of neighbours