2 Chapter 2: GammaGateR
2.1 Research question
Before important spatial insights can be gleaned using statistical methods, mIF images undergo an intensive preprocessing pipeline to obtain single-cell measurements. While there are various steps included in the pipeline such as image registration, single-cell segmentation, quantification, and batch correction, cell phenotyping is typically the final step before downstream analyses on the cell-level data (Graf et al. 2022; Harris et al. 2022). Cell phenotyping identifies individual cell phenotypes from the measured marker expression values of the cell and directly affects the subsequent cell population analysis results.
The two most common approaches for cell phenotyping in mIF are manual gating and graph-based multivariate clustering. In manual gating, each sample is visualized separately to determine a threshold, and super-threshold cells are labeled as marker positive. This procedure is repeated for all marker channels and slides, and the phenotypes are determined by combining combinations of marker-positive cells (Chen et al. 2021; Lin et al. 2023). An example for cell phenotyping from gated markers is presented in Table 2.1. Alternatively, multivariate graph-based clustering is adapted from other single-cell assays. This approach first performs cell clustering, then assigns a phenotype to each cell group based on their average expression profile. Multivariate graph-based clustering is implemented with various modifications across many software packages. Unfortunately, both methods are labor intensive, and their accuracy suffers from image noise and spatial artifacts in mIF images that cause marker expression histograms to appear continuous or uni-modal . As a result, both phenotyping methods possess shortcomings that cannot be ignored. On one hand, manual gating can be subjective. On the other hand, graph-based clustering results are prone to over-clustering and producing poor separation between clusters.
Cell Types | Markers |
---|---|
B cells | CD19+ |
Tumor | CK+ |
Cell ID | CD19+ | CK+ | Tumor | B cells |
---|---|---|---|---|
1 | TRUE | TRUE | TRUE | TRUE |
2 | TRUE | FALSE | TRUE | FALSE |
3 | FALSE | FALSE | FALSE | FALSE |
2.2 Previous works
The challenges described above are well recognized and there are a few methods and software developed that attempt to automate cell phenotyping for mIF images. For example, CellSighter is a recently proposed supervised deep-learning algorithm for cell phenotyping that requires a “gold standard” training dataset (Amitay et al. 2023). Another recent solution, ASTIR (Automated assignment of cell identity from single-cell multiplexed imaging and proteomic data), is a fast unsupervised approach that defines cell phenotypes from segmented cell-level data by using a neural network-based mixture model assuming a multivariate log-normal distribution (Geuenich et al. 2021). Instead of binary outputs like in classification methods, ASTIR returns posterior probabilities of different cell types for each cell. This type of output is advantageous because it offers more information than nominal cell types and leaves cell labeling to the clinician’s discretion. Lastly, (Ahmadian et al. 2022) treat the analysis as a pixel classification problem and design a single-step framework for mIF phenotyping that is integrated with other preprocessing steps.
Nevertheless, inconsistencies persist in the results rendered by these learning-based methods when applied across markers, slides, batches, and datasets. These inconsistencies result from the immense variation in the cell-level distribution of phenotyping markers that are often too nuanced to be removed by existing batch correction methods (Wilson et al. 2021; Hunt et al. 2022). For these reasons, it is difficult to fully automate the cell phenotyping process, despite the availability of automated tools, and manual gating is still used to perform cell phenotyping because it is easy to visualize and evaluate the quality of the phenotype.
2.3 Methods
Since automated methods cannot be run without evaluation and supervised methods require a gold-standard dataset, no method is truly fully automated. As a solution, we develop an explicitly semi-automated algorithm called GammaGateR. GammaGateR allows the user to easily perform cell phenotyping, visualize results, and conduct interpretable quality control while reducing manual labor. Based on a novel closed-form Gamma mixture model (cfGMM), GammaGateR is a probabilistic model that is fitted to each channel and slide separately, and outputs positive-component probabilities for each marker. These can then be easily thresholded and combined for semi-automated marker gating or input directly into downstream analysis. GammaGateR has important technical advantages, including 1) improved computation time and model convergence due to its novel closed-form property, and 2) high consistency and reproducibility for phenotyping results across mIF data batches due to incorporation of parameter boundary constraints. In applications on real-world mIF data, we find that GammaGateR has fast and consistent results across many slides and markers. We provide an open-source implementation of our method in the new GammaGateR R package.
2.3.1 GammaGateR
The GammaGateR algorithm is unique to existing methods for its focus on parsimoniously modeling cell-level marker expression densities. This approach yields tailored-to-slide model estimation in cell-level mIF data where marker expression distributions can vary substantially across slides. The algorithm uses a zero-inflated two-component GMM to model marker expression for each slide. The Gamma mixture model naturally identifies marker-positive and marker-negative cell distributions and returns the probability of belonging to the marker-positive cell distribution for each cell. The returned probabilities can either be used directly in subsequent analysis or combined and dichotomized to define cell phenotypes. GammaGateR incorporates user-specified constraints to provide consistent model fit across a large number of slides. The model evaluation methods included in GammaGateR allow the user to evaluate the constraints and quality check results. The power source of GammaGateR is the closed-form Gamma mixture model, which is a novel approach to phenotyping for mIF data that makes it more computationally efficient than traditional GMMs.
The analysis pipeline is illustrated for the CD4 marker channel (Figure 2.1). GammaGateR takes a single-cell image dataset, with each row corresponding to an individual cell, and each column as the normalized intensity of a marker channel for each cell (Figure 2.1 1). The first step is selecting biological constraints for model fitting by visualizing overlay histograms for each marker channel (Figure 2.1 2). The constraints are not manual thresholds, but represent boundaries for the mode of each component of the fitted distribution across all slides in the dataset. Because marker-positive cells often are a small proportion of all cells and have higher expression values, we limit the mode of the higher component to be no lower than the “elbow” of the overlay histograms (Figure 2.1 2, e.g. 0.45). While GammaGateR can be fit without constraints, the constraints provide more consistent model estimation across many slides. Given the data and constraints, GammaGateR generates the parameter estimates of the Gamma mixture model including modes and proportion of each component, and the posterior and marginal probabilities of each cell being marker-positive.
To ensure accurate model fitting, GammaGateR includes functionality for users to evaluate model fit and modify the fit when needed. Diagnostic plots for the fitted GammaGateR model object consist of a scatter plot of all the slides fitted model modes and lambda (marker-positive probability), and the fitted density curve over the cell expression histogram for each slide in the data set (Figure 2.1 5.a). The x-axis of the scatter plot is the mode for the marker-positive component, and the y-axis is the proportion of the corresponding component. The scatter plot is useful for identifying slides that are outliers with respect to where the mode of marker-positive cells lies or the estimated proportion of marker-positive cells. The histograms can be used for visually evaluating model fit for one slide. A good model fit shows an approximate fit of the smooth density line to the histogram with a marker-positive cell distribution sitting to the right (Figure 2.1 5.a). If there is poor model fit, users can compare fitted models between two different constraints to check how different boundaries affect fitted values (Figure 2.1 5.b). Figure 2.1 5.b compares the model fit for CD4 with no constraints (green) to the model fit with an initial constraint with a lower bound of 0.45 for the marker-positive component (red). The model without constraints places the marker-positive distribution directly over the marker-negative distribution. Users can adjust the parameter boundaries and fit again until satisfactory fittings are rendered. Finally, the output of the fitted models is easily accessible in the GammaGateR model object (Figure 2.1 6). The vignette in the GammaGateR R package provides a guide to fitting the GammaGateR model in the lung cancer dataset from the VectraPolaris dataset available on Bioconductor.
In the extreme tails, the marker-positive probability might not always be higher than that of marker-negative, due to different variances of the two components’ Gamma distributions. Therefore, we apply a correction to the posterior probabilities to force them to be monotonic with respect to the marker values. Specifically, after the first crossing of the density curves of the two components, the density curve of the first component will be forced to be non-increasing, and the density curve of the second component will be forced to be non-decreasing. In addition to the posterior probability, GammaGateR also outputs the marginal probability of the observed marker value for the marker-positive component. The marginal probability is monotonically increasing in the marker intensities and represents the probability that a marker positive cell is less than the given value, x.
2.3.2 cfGMM
For mIF data, we use the GMM to fit cell marker expression values as a weighted sum of different probability distributions that represent unique cell populations (McLachlan, Lee, and Rathnayake 2019). The Gamma distribution is an excellent model for marker values because the domain of the Gamma distribution is strictly positive and it has the flexibility to model the varying skewed densities seen in mIF marker values (Figure 2.1 5.a). However, GMMs are not scalable for mIF image data, because they rely on computationally inefficient numerical methods to obtain the maximum likelihood estimator (MLE). The slow convergence of the MLE for the GMM makes it prohibitive to apply across a large number of channels, slides, and cells. As a solution, we develop a closed-form GMM (cfGMM) estimation procedure based on a recently developed estimator for the Gamma distribution (Ye and Chen 2017). In addition, to improve computational efficiency, the cfGMM has the benefit of allowing prior constraints on model parameters. With the cfGMM in GammaGateR, we enable the flexibility to include a biologically meaningful range for the mode of each component in the Gamma mixture model. This way, users of GammaGateR can restrict estimation to biologically meaningful values.
2.3.2.1 Derivation
We assume the data is a random sample from a component generalized gamma mixture distribution. The density function of is and the log-likelihood of the dataset is For each generalized gamma component , are the mixture parameters, ; denotes the generalized gamma density function; are the parameters for the generalized gamma.
Here, we use the expectation maximization (EM) algorithm [dempster_maximum_1977] for parameter estimation. EM algorithm is a standard approach for parameter estimation in mixture models. It introduces the latent multinomial variable into the model and maximizes the expected value of the complete data likelihood [dempster_maximum_1977]. The expectation of the complete data likelihood to be maximized for the generalized gamma distribution is where , and , are similarly defined vectors.
From here, the maximization of the expectation is now analogous to the maximization of generalized gamma distribution for each component of the mixture model.
The expectation of log-likelihood is and where .
By the two formulas above, The expected joint log-likelihood is
The estimators of each of the terms of the expected joint log-likelihood are derived as follows:
first take derivative of the expression from the above equation with respect to separately:
Note that is digamma function. Among which, (2) can be solved as Substitute (4) into (3):
Solving this, we have
Plug in (5) , we now have In addition, can simply be estimated as
It is worth noting that we are not maximizing the exact Gamma distribution, therefore the algorithm we devise here is an EM-type algorithm.
2.4 Results
2.4.1 Simulation for cfGMM
To compare the bias and compute time of the closed-form GMM to maximum likelihood GMM implementation, we run the cfGMM, the constrained cfGMM, and the GMM to evaluate bias and variance in a sample size of 10,000 across 1,000 simulations. We simulate a two-component mixture model with parameters . For the constrained estimator, we restrict the mode of each component to be in the range and for marker negative and marker positive components, respectively, which include the true mode for each component, (no mode) and .
Both closed-form estimation procedures have substantially faster computation time than the MLE while maintaining similarly low bias, as in Figure 2.2. The sample size used in the simulation is roughly similar to that of the cell-level mIF image dataset, which further proves that cfGMM brings computation efficiency to our target application. The closed-form GMM, therefore, enables computationally feasible, precise, and flexible model estimation when applied to a large number of channels and slides using GammaGateR. It is also worth noting that the constrained cfGMM converges slightly faster than without constraints. This implies that when using cfGMM, computational cost can be reduced with proper knowledge of biological priors.
2.4.2 Analysis in real dataset
We use three single-cell imaging datasets to evaluate model performance and demonstrate the use of the GammaGateR analysis pipeline: the Colorectal Molecular Atlas Project (Colon MAP) dataset (Chen et al. 2021), the Spatial Colorectal Cancer (CRC) Atlas dataset (Heiser et al. 2023) and Ovarian Cancer dataset (Steinhart et al. 2021; Wrobel and Ghosh 2022). After processing and prior to analysis, cell expression values were normalized by first mean division then log10 transformation to reduce slide-to-slide variation (Harris et al. 2022).
To compare the methods in determining cell phenotypes we assess the accuracy of each method relative to a “silver standard” manual phenotyping in the Colon MAP and CRC atlas datasets and evaluate the efficacy in predicting survival in the ovarian cancer data. We compare phenotyping results obtained using GammaGateR and ASTIR to “silver standard” manual phenotyping using the Adjusted Rand Index (Hubert and Arabie 1985). Adjusted rand index typically takes values between 0 and 1, where a larger value indicates a better alignment between two categorical variables. The silver standard is obtained by gating the raw images based on visual inspection and defining marker-positive cells as those that contain more than 50% marker-positive pixels within each cell. Semi-automated marker gating results are obtained using GammaGateR as described in Section 2.3.1, with monotonically adjusted posterior probability and marginal probability thresholded at 0.5 to define marker positive cells. The same phenotype definitions used for ASTIR are used to define phenotypes from marker positive labels using GammaGateR and manual marker gating as well. Each cell belongs to a given phenotype if it is marker positive for combinations of markers for that phenotype. ASTIR phenotypes are determined by selecting the cell type with the maximum probability for each cell. All methods use the same combinations of markers to define phenotypes. For both datasets and all cell types in these datasets, the posterior probability by GammaGateR yields higher Median ARI (Figure 2.3 a & b). This means that the posterior probability has consistently greater similarity to the silver standard than marginal probability and ASTIR (Figure 2.3 a & b). However, for some cell types (e.g. Macrophage, B-cells, Myeloid), all methods have low performance. This is an indication of systematic difference in how the algorithms identify positive cells relative to the manual labels.
Because the Ovarian Cancer dataset does not include manual cell phenotypes, we instead compare the prediction accuracy of survival time data for each method across all patients in the study to determine if one method has greater biological sensitivity than the other. The original study shows that survival of ovarian cancer patients is significantly correlated with B-cell and CD4 T-cell, as well as spatial interaction between CD4 T-cell and macrophage. Therefore, we evaluate the methods using this dataset by fitting a survival model with age, cancer stage, B-cell proportions, CD4 T-cell proportions, and the spatial interaction between Macrophage and CD4 T-cells estimated by Ripley’s H with r=50 (Kiskowski, Hancock, and Kenworthy 2009). Ripley’s H is a geospatial index that describes spatial attraction/repulsion. We fit a model for each method, where the cell phenotypes are determined using the given method, and compare all models, as well as a base model that includes only age and cancer stage. We use a random forest survival model to be sensitive to complex nonlinear relationships (Ishwaran et al. 2008; Ishwaran, Kogalur, and Kogalur 2023). To estimate variability in the out-of-bag performance error, we compare the methods across 100 bootstrap samples. Performance error quantification is based on C-index (Harrell et al. 1982), where a low performance error means that the model is a good fit, and a performance error of 0.5 is a random chance. For all methods, incorporating the cell-level information reduces out-of-bag error performance by approximately 0.075, over the base model that includes only age and cancer stage. This indicates that spatial cell phenotype covariates are useful in predicting survival outcomes, consistent with the original findings in Steinhart et al. (2021). The posterior probability slightly outperforms other methods, having the lowest prediction error in 46% of the bootstrap samples, compared to 36% with ASTIR and 18% with the marginal probability (Figure 2.3 c).
2.5 Summary
GammaGateR is a semi-automated maker gating tool. Driven by a novel cfGMM estimation framework, GammaGateR generates reproducible and evaluable marker gating for mIF data. In addition, cfGMM enables computationally feasible model estimation for large-scale datasets like mIF. The marker gating output of GammaGateR can be used to define phenotypes and as input to downstream analysis. GammaGateR implements interactive visualization to quality check the clustering results (see vignette in Supplementary Material) and allows users to modify the constraints to improve model results when needed. Consequently, GammaGateR provides more consistent results with silver standard labels than ASTIR, the existing state-of-the-art method for automated phenotyping of cell-level mIF data. In the examples shown in this section, GammaGateR phenotypes had slightly improved ovarian cancer survival prediction accuracy compared to ASTIR. This paper also compares the posterior and the marginal probabilities returned by GammaGateR. The marginal probabilities only use the marker positive cell distribution to determine cell phenotypes, whereas the posterior probabilities take into account the distribution of the marker negative cells. Using posterior probabilities was almost always better indicating the importance of accounting for the full distribution of the marker intensities when identifying marker-positive cells.