3 Chapter 3: Missing data imputation in mIF imaging

Just like all data created and collected by human beings, missing data is inevitable in mIF image as well. Bao et al. (2021) in their paper gave a brief summary of types of missing data in mIF image, as in Figure 3.1. Case 1 in Figure 3.1 refers to the missingness of one or more entire marker channel due to low image quality, which occurs roughly 3% of all times. Other possible reasons for missing channel, not described in Bao et al. (2021), can be supply shortage in certain type of fluorescent material or change in research plan. Interestingly, all the existing marker channel applications are inspired by the expensiveness of high-plex image instead. Due to time and financial constraints, mIF with no more than seven channels are often more feasible to obtain than the 40-channel mIFs(Wu et al. 2023). To break the restraint in cell phenotypes availability from few number of markers, imputation of marker channels are proposed. Case 2 Figure 3.1 occurs more frequently, when tissue wears off in the cycles of staining - wash off described in Figure 1.2.

Owing to the rapid development in the field of computer vision, all current applications in mIF imputation are implemented with machine learning and/or deep learning methods. In the three applications covered in this document today, Bao et al. (2021) use generative adversarial networks (GANs), Wu et al. (2023) use gradient boosting decision tree in combination with convolutional neural network, and Sims and Chang (2023) use masked autoencoders (MAE). All applications performs well in the evaluations set up in the paper. However, the subsequent analysis with imputed image can benefit from statistical thinking in data imputation. This will be discussed further in chapter 4.

Figure 3.1: Types of missing data in mIF. Image courtesy of Bao et al. (2021).

3.1 Application case 1: Missing tissue imputation

3.1.1 Method: GANs

The fundamental version of GANs comprises of two compartments: a discriminator and a generator (Goodfellow et al. 2014). Figure 3.2 by Bok and Langr (2019) gives a brief sketch of how GANs works. Like a turn-based strategy game, the two components take turns to run an epoch. Starting with a noise distribution (usually a uniform distribution), the generator’s goal is to generate data that is close to the real data. The discriminator’s goal is to identify the real data between a mix of real data and data generated by the generator. With classification error fed back to generator and discriminator, both opponents update their weights: the generator will try to maximize the probability that the discriminator misclassify generated data as real, and the discriminator will try to maximize classification accuracy. Within infinite number of rounds, they will eventually reach a state close to equilibrium, where either party can only improve negligibly: generator generates close-to-real data, and discriminator classifies with 50% accuracy(Bok and Langr 2019). This is the point where the algorithm stops.

Figure 3.2: Illustration of how GANs work. 2. represents noise data input, and $x^*$ is the “fake” data generated by 3. generator. The goal of the generator is to produce data very similar to real data. 1. represents real data (some hand written digits. Real data $x$ is sent to 4. discriminator along with $x^*$ . The discriminator then try to distinguish true data from generated data. Finally, 5. the classification error is sent back to both 3 and 4 to iteratively train for better performance. Image courtesy of Bok and Langr (2019)

The value function of GANs can be written as formula as follows. $D(x)$ represents the probability that discriminator classify $x$ as real data. $z$ is the noise distribution. $G(z)$ is the fake data that generator creates based on the noise distribution.

$\min_G \max_D V(D,G) = E_{x\sim p_{data}(x)}[\log D(x)] + E_{z\sim p_{z}(z)}[\log (1-D(G(z)))]$ Discriminator is trained to maximize $V(D,G)$ , and the generator is trained to minimize $E_{z\sim p_{z}(z)}[\log (1-D(G(z)))]$ , the probability that fake data is recognized and classified as fake data.

One disadvantage of the original GANs is its weak control on the generated data, due to the random noise input. This disadvantage stands out especially with image synthesis. Conditional GANs (CGANs) provided a promising solution to this issue by including auxiliary information on both generator and discriminator (Mirza and Osindero 2014). The auxiliary information is usually data from the same class, for example other images in the case of image synthesis. Suppose $x$ belongs to the input data class, and $y$ is the intended output data class. GANs would learn the mapping of $G: z \rightarrow y$ , while CGANs learns $G: {x,z} \rightarrow y$ (Isola et al. 2017; Souza et al. 2023). The updated value function for CGANs is as follows:

$\min_G \max_D V(D,G) = E_{x,y}[\log D(x)] + E_{x,z}[\log (1-D(x, G(x,z)))] \qquad(3.1)$

Based on this, pix2pix is developed. It is able to perform image-to-image translation by using image pairs $(x,y)$ to train the data. Figure 3.3 shows the types of $(x,y)$ it use. Intuitively, image imputation can utilize this idea: with sets of channels that does not have missing data, well-trained pix2pix can generate imputed data.

Figure 3.3: Input and output image pair example for pix2pix. Image courtesy of Isola et al. (2017).

3.1.2 Application in mIF: pixN2N-HD

pixN2N-HD is a “novel multi-channel high-resolution image synthesis approach”, an extension based on pix2pix. “N2N” represents “N-to-N”, which distinguishes itself from the widely-used (N-1)-to-1 model. N represents the number of marker channels, and in the dataset used in this paper, N=11. In (N-1)-to-1 design, 10 channels are used as input and 1 channel is used as output, and this repeats for 11 permutations of models. The “N-to-N” instead uses a random gate strategy, as shown in Figure 3.4. Let $\delta$ be the binary vector indicating channels without missing tissue. When $\delta^{(i)}$ is turned on, image of the $ith$ channel will go into the generator; when turned off, a blank image will be feed into the generator instead. The value function is similar to Equation 3.1. $X=\delta(M)$ , the non-missing channels is the input, and $Y=\bar\delta(M)$ is the output.

$\min_G \max_D V(D,G) = E_{X,Y}[\log D(X)] + E_{X}[\log (1-D(X, G(X)))]$

Figure 3.4: Work flow of pixN2N-HD. The topleft binary vector is the random gate, where 1 represents channels with missing tissue and 0 represents channels without missing tissue. Images from intact marker channels are sent to the both generator and discriminator as auxiliary information. Generator synthesizes images of all marker channels, and only the image for marker channel that has missing tissues are sent to discriminator. The discriminator then try to classify real and imputed image for the missing channel. Image courtesy of Bao et al. (2021).

This paper evaluated the model performance by comparing “N-to-N” model with “(N-1)-to-1” model and another “(N-1)-to-1 random gate” model, which blends in random gate but still needs to train 11 separate models. An index for measuring image similarity, the structure similarity index measure (SSIM) is used to assess whether “N-to-N” model generates comparable results with the other two methods (Wang et al. 2004). The result shows that all pairs of methods do not have significantly different results on a 0.05 significance level, and therefore the methods are concluded to be comparable. This “N-to-N” model take significantly less amount of time to train compared to the other methods, which is very meaningful in terms of effective computation.

3.2 Application case 2: Marker channel imputation

Both 7-UP and CyCIF panel reduction are intended for marker channel imputation, providing access to otherwise expensive high-plex (40+ channels) mIF image for study that can only obtain low-plex images. Interestingly, the two application uses very different methods for imputation.

3.2.1 Application 2.1: 7-UP

7-UP starts from a 7-plex mIF image and generates high-plex image that can identify up to 16 different cell types (Wu et al. 2023). This approach consists of three main parts:

Marker panel selection. This part will select the seven markers to start with, using concrete autoencoder. Concrete autoencoder is an feature selection method, of which the loss function is the difference between the original sample and the reconstructed low-dimension sample (Balın, Abid, and Zou 2019).
Morphology feature extraction. This step uses a convolutional neural network to learn the morphology features, i.e. spatial and structural features of cells. Convolutional neural networks are similar to layers of linear regressions, where there are more combinations of weights linked to each input variable.
Marker expression imputation. Once the location and structure of cells are learned, the important task left is to impute the expression of each marker on each cell. The imputation is performed using XGBoost, a scalable gradient-boosting tree software (Chen and Guestrin 2016).

A series of evaluation and analysis are performed to show the validity of the method. The performance of the method is examined in three ways:

Calculating the pearson correlation coefficient between the imputed marker expression and the testing data marker expression.
Calculating the F1 score between the imputed and testing data cell type. F1-score is the harmonic mean of precision and sensitivity: $2/(sensitivity^{-1}+precision^{-1})$ . Cell type is generated from the marker expression through k-nearest neighbor.
Patient survival status, HPV status and disease recurrence are used to further evaluate the cell type outcomes. AUC score for patient status prediction is calculated for both imputed data outcome and training data.

All evluation shows that the imputation generates comparable results with the training data, hence proven the validity of this method.

3.2.2 Application 2.2: CyCIF panel reduction

This method is intended to be an improvement from their own previous work (Ternes et al. 2022). The previous work first go through panel selection and then imputes marker channel with variatioal autoencoder. The current improved method (Sims and Chang 2023) uses masked autoencoder for image synthesis as shown is Figure 3.5. The difference is the adoption of within-model iterative selection of marker panels, as the authors believe that panel selection should be more closely tied with panel reconstruction. Starting with standard DAPI, each marker is added to the panel, predict marker intensities of other panels, and mean Spearman correlation is calculated between the predicted intensity and real intensity. The marker with highest correlation is selected, and the next round continues until the panel is constructed. The ratio of masked channels depends on tasks, though 25%~75% is a reasonable range.

The method outcome is evaluated by Spearman correlation with the true data. It is shown in the results that both MAE and the iterative panel selection outperforms the VAE and out-of-box panel selection of the previous method.

Figure 3.5: CyCIF panel reduction with autoencoder. The left hand side is the masked marker panel, where the mask represents missing marker channels. Next, in the encoder step, the masked markers are translated to a lower-dimnesion latent space. The masked panels are then “imputed” through the decoder step, where latent space information are tranlated back to the same dimension space as the imput image, creating full panel of desired markers. Figure courtesy of Sims and Chang (2023).

Balın, Muhammed Fatih, Abubakar Abid, and James Zou. 2019. “Concrete Autoencoders: Differentiable Feature Selection and Reconstruction.” In International Conference on Machine Learning, 444–53. PMLR.

Bao, Shunxing, Yucheng Tang, Ho Hin Lee, Riqiang Gao, Sophie Chiron, Ilwoo Lyu, Lori A Coburn, et al. 2021. “Random Multi-Channel Image Synthesis for Multiplexed Immunofluorescence Imaging.” In MICCAI Workshop on Computational Pathology, 36–46. PMLR.

Bok, Vladimir, and Jakub Langr. 2019. GANs in Action: Deep Learning with Generative Adversarial Networks. Simon; Schuster.

Chen, Tianqi, and Carlos Guestrin. 2016. “Xgboost: A Scalable Tree Boosting System.” In Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, 785–94.

Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. “Generative Adversarial Nets.” Advances in Neural Information Processing Systems 27.

Isola, Phillip, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. “Image-to-Image Translation with Conditional Adversarial Networks.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1125–34.

Mirza, Mehdi, and Simon Osindero. 2014. “Conditional Generative Adversarial Nets.” arXiv Preprint arXiv:1411.1784.

Sims, Zachary, and Young Hwan Chang. 2023. “A Masked Image Modeling Approach to Cyclic Immunofluorescence (CyCIF) Panel Reduction and Marker Imputation.” bioRxiv, 2023–05.

Souza, Vinicius Luis Trevisan de, Bruno Augusto Dorta Marques, Harlen Costa Batagelo, and João Paulo Gois. 2023. “A Review on Generative Adversarial Networks for Image Generation.” Computers & Graphics.

Ternes, Luke, Jia-Ren Lin, Yu-An Chen, Joe W Gray, and Young Hwan Chang. 2022. “Computational Multiplex Panel Reduction to Maximize Information Retention in Breast Cancer Tissue Microarrays.” PLoS Computational Biology 18 (9): e1010505.

Wang, Zhou, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. 2004. “Image Quality Assessment: From Error Visibility to Structural Similarity.” IEEE Transactions on Image Processing 13 (4): 600–612. https://doi.org/10.1109/TIP.2003.819861.

Wu, Eric, Alexandro E Trevino, Zhenqin Wu, Kyle Swanson, Honesty J Kim, H Blaize D’Angio, Ryan Preska, et al. 2023. “7-UP: Generating in Silico CODEX from a Small Set of Immunofluorescence Markers.” PNAS Nexus 2 (6): pgad171.