3 Chapter 3: Missing data imputation in mIF imaging
Just like all data created and collected by human beings, missing data is inevitable in mIF image as well. Bao et al. (2021) in their paper gave a brief summary of types of missing data in mIF image, as in Figure 3.1. Case 1 in Figure 3.1 refers to the missingness of one or more entire marker channel due to low image quality, which occurs roughly 3% of all times. Other possible reasons for missing channel, not described in Bao et al. (2021), can be supply shortage in certain type of fluorescent material or change in research plan. Interestingly, all the existing marker channel applications are inspired by the expensiveness of high-plex image instead. Due to time and financial constraints, mIF with no more than seven channels are often more feasible to obtain than the 40-channel mIFs(Wu et al. 2023). To break the restraint in cell phenotypes availability from few number of markers, imputation of marker channels are proposed. Case 2 Figure 3.1 occurs more frequently, when tissue wears off in the cycles of staining - wash off described in Figure 1.2.
Owing to the rapid development in the field of computer vision, all current applications in mIF imputation are implemented with machine learning and/or deep learning methods. In the three applications covered in this document today, Bao et al. (2021) use generative adversarial networks (GANs), Wu et al. (2023) use gradient boosting decision tree in combination with convolutional neural network, and Sims and Chang (2023) use masked autoencoders (MAE). All applications performs well in the evaluations set up in the paper. However, the subsequent analysis with imputed image can benefit from statistical thinking in data imputation. This will be discussed further in chapter 4.
3.1 Application case 1: Missing tissue imputation
3.1.1 Method: GANs
The fundamental version of GANs comprises of two compartments: a discriminator and a generator (Goodfellow et al. 2014). Figure 3.2 by Bok and Langr (2019) gives a brief sketch of how GANs works. Like a turn-based strategy game, the two components take turns to run an epoch. Starting with a noise distribution (usually a uniform distribution), the generator’s goal is to generate data that is close to the real data. The discriminator’s goal is to identify the real data between a mix of real data and data generated by the generator. With classification error fed back to generator and discriminator, both opponents update their weights: the generator will try to maximize the probability that the discriminator misclassify generated data as real, and the discriminator will try to maximize classification accuracy. Within infinite number of rounds, they will eventually reach a state close to equilibrium, where either party can only improve negligibly: generator generates close-to-real data, and discriminator classifies with 50% accuracy(Bok and Langr 2019). This is the point where the algorithm stops.
The value function of GANs can be written as formula as follows. represents the probability that discriminator classify as real data. is the noise distribution. is the fake data that generator creates based on the noise distribution.
Discriminator is trained to maximize , and the generator is trained to minimize , the probability that fake data is recognized and classified as fake data.
One disadvantage of the original GANs is its weak control on the generated data, due to the random noise input. This disadvantage stands out especially with image synthesis. Conditional GANs (CGANs) provided a promising solution to this issue by including auxiliary information on both generator and discriminator (Mirza and Osindero 2014). The auxiliary information is usually data from the same class, for example other images in the case of image synthesis. Suppose belongs to the input data class, and is the intended output data class. GANs would learn the mapping of , while CGANs learns (Isola et al. 2017; Souza et al. 2023). The updated value function for CGANs is as follows:
Based on this, pix2pix is developed. It is able to perform image-to-image translation by using image pairs to train the data. Figure 3.3 shows the types of it use. Intuitively, image imputation can utilize this idea: with sets of channels that does not have missing data, well-trained pix2pix can generate imputed data.
3.1.2 Application in mIF: pixN2N-HD
pixN2N-HD is a “novel multi-channel high-resolution image synthesis approach”, an extension based on pix2pix. “N2N” represents “N-to-N”, which distinguishes itself from the widely-used (N-1)-to-1 model. N represents the number of marker channels, and in the dataset used in this paper, N=11. In (N-1)-to-1 design, 10 channels are used as input and 1 channel is used as output, and this repeats for 11 permutations of models. The “N-to-N” instead uses a random gate strategy, as shown in Figure 3.4. Let be the binary vector indicating channels without missing tissue. When is turned on, image of the channel will go into the generator; when turned off, a blank image will be feed into the generator instead. The value function is similar to Equation 3.1. , the non-missing channels is the input, and is the output.
This paper evaluated the model performance by comparing “N-to-N” model with “(N-1)-to-1” model and another “(N-1)-to-1 random gate” model, which blends in random gate but still needs to train 11 separate models. An index for measuring image similarity, the structure similarity index measure (SSIM) is used to assess whether “N-to-N” model generates comparable results with the other two methods (Wang et al. 2004). The result shows that all pairs of methods do not have significantly different results on a 0.05 significance level, and therefore the methods are concluded to be comparable. This “N-to-N” model take significantly less amount of time to train compared to the other methods, which is very meaningful in terms of effective computation.
3.2 Application case 2: Marker channel imputation
Both 7-UP and CyCIF panel reduction are intended for marker channel imputation, providing access to otherwise expensive high-plex (40+ channels) mIF image for study that can only obtain low-plex images. Interestingly, the two application uses very different methods for imputation.
3.2.1 Application 2.1: 7-UP
7-UP starts from a 7-plex mIF image and generates high-plex image that can identify up to 16 different cell types (Wu et al. 2023). This approach consists of three main parts:
- Marker panel selection. This part will select the seven markers to start with, using concrete autoencoder. Concrete autoencoder is an feature selection method, of which the loss function is the difference between the original sample and the reconstructed low-dimension sample (Balın, Abid, and Zou 2019).
- Morphology feature extraction. This step uses a convolutional neural network to learn the morphology features, i.e. spatial and structural features of cells. Convolutional neural networks are similar to layers of linear regressions, where there are more combinations of weights linked to each input variable.
- Marker expression imputation. Once the location and structure of cells are learned, the important task left is to impute the expression of each marker on each cell. The imputation is performed using XGBoost, a scalable gradient-boosting tree software (Chen and Guestrin 2016).
A series of evaluation and analysis are performed to show the validity of the method. The performance of the method is examined in three ways:
- Calculating the pearson correlation coefficient between the imputed marker expression and the testing data marker expression.
- Calculating the F1 score between the imputed and testing data cell type. F1-score is the harmonic mean of precision and sensitivity: . Cell type is generated from the marker expression through k-nearest neighbor.
- Patient survival status, HPV status and disease recurrence are used to further evaluate the cell type outcomes. AUC score for patient status prediction is calculated for both imputed data outcome and training data.
All evluation shows that the imputation generates comparable results with the training data, hence proven the validity of this method.
3.2.2 Application 2.2: CyCIF panel reduction
This method is intended to be an improvement from their own previous work (Ternes et al. 2022). The previous work first go through panel selection and then imputes marker channel with variatioal autoencoder. The current improved method (Sims and Chang 2023) uses masked autoencoder for image synthesis as shown is Figure 3.5. The difference is the adoption of within-model iterative selection of marker panels, as the authors believe that panel selection should be more closely tied with panel reconstruction. Starting with standard DAPI, each marker is added to the panel, predict marker intensities of other panels, and mean Spearman correlation is calculated between the predicted intensity and real intensity. The marker with highest correlation is selected, and the next round continues until the panel is constructed. The ratio of masked channels depends on tasks, though 25%~75% is a reasonable range.
The method outcome is evaluated by Spearman correlation with the true data. It is shown in the results that both MAE and the iterative panel selection outperforms the VAE and out-of-box panel selection of the previous method.