Style transfer as data augmentation: evaluating unpaired image-to-image translation models in mammography

Emir Ahmed¹ and Spencer A. Thomas¹ and Ciaran Bench¹ ¹E. Ahmed, S. Thomas, and C. Bench are with the Department of Data Science and AI, National Physical Laboratory, Hampton Road, Teddington, United Kingdom, TW11 0LW. [email protected]

Abstract

Several studies indicate that deep learning models can learn to detect breast cancer from mammograms (X-ray images of the breasts). However, challenges with overfitting and poor generalisability prevent their routine use in the clinic. Models trained on data from one patient population may not perform well on another due to differences in their data domains, emerging due to variations in scanning technology or patient characteristics. Data augmentation techniques can be used to improve generalisability by expanding the diversity of feature representations in the training data by altering existing examples. Image-to-image translation models are one approach capable of imposing the characteristic feature representations (i.e. style) of images from one dataset onto another. However, evaluating model performance is non-trivial, particularly in the absence of ground truths (a common reality in medical imaging). Here, we describe some key aspects that should be considered when evaluating style transfer algorithms, highlighting the advantages and disadvantages of popular metrics, and important factors to be mindful of when implementing them in practice. We consider two types of generative models: a cycle-consistent generative adversarial network (CycleGAN) and a diffusion-based SynDiff model. We learn unpaired image-to-image translation across three mammography datasets. We highlight that undesirable aspects of model performance may determine the suitability of some metrics, and also provide some analysis indicating the extent to which various metrics assess unique aspects of model performance. We emphasise the need to use several metrics for a comprehensive assessment of model performance.

Clinical relevance— Image-to-image translation models are used to augment the training sets of disease classifiers, which can supplement small datasets and potentially reduce bias. This can improve model generalisability, equitability and performance, and hence, patient outcomes. This work describes important factors that need to be considered to effectively evaluate their performance.

I Introduction

Mammography is a widely used X-ray breast imaging technique that helps facilitate the early detection of cancer. One in eight women in the United States will develop breast cancer at some point during their lifetime, where screening programs have drastically improved patient outcomes [1]. Yet the need for human clinicians to manually inspect images places a heavy burden on healthcare systems. Automated diagnostic protocols based on deep neural networks have the potential to assist clinicians in making their diagnoses more efficiently and accurately [2]. However, poor generalisability to unseen data is a major barrier to routine application. For example, a deep learning model trained to classify whether a patient has breast cancer may not perform as well when applied to mammograms from different scanners [3]. This is because of differences in the data domains describing either set of images. A data domain $\mathcal{D}=\{\chi,P(x),P(x,y)\}$ is composed of an input feature space $\chi$ (a vector space containing all image features), a marginal distribution $P(x)$ , and a joint probability distribution $P(x,y)$ . Here, $x$ is a member of the set $X$ of $N$ training example inputs $x_{1},x_{2},...x_{N}\in X$ and $y$ is a member of the corresponding set of ground truths $y_{1},y_{2},...y_{N}\in Y$ . A significant discrepancy between $\mathcal{D}_{\text{train}}$ and $\mathcal{D}_{\text{test}}$ results in poor generalisability of models trained on data from $\mathcal{D}_{\text{train}}$ when applied to data from $\mathcal{D}_{\text{test}}$ .

Expanding the training set with additional/diverse examples can help align the two domains. However, collating a large set of mammograms from different sources is challenging due to patient privacy laws and complications with coordinating data collection efforts across institutions. Instead, data augmentation schemes can be used to modify existing datasets to produce additional training examples. These modifications often take the form of generic geometric transformations, such as flipping and warping [4]. However, more complex and domain-specific transformations that align source feature representations with those of the target domain can be applied using certain types of deep neural networks. This task is referred to as image-to-image translation, performed by neural style transfer models that adapt source images to appear more like images from the target domain [5]. These models have been used in mammography studies, where the inclusion of augmented examples has improved the performance of classification models [6]. With that said, style transfer models are known to suffer from overfitting and other undesirable behaviours, motivating a need to carefully evaluate their performance.

There are two key concepts to assess: the degree to which the target style (i.e. the feature representations for common objects) has been imposed onto the source domain, and whether the transformed/adapted image is still representative of the original tissue [7, 8]. The ideal behaviour of a given model is to make it appear as though the image of a given source domain tissue has been acquired using the same technology used to acquire images from the target domain. However, when considering images acquired from different patient populations, differences in tissue composition (e.g. density) may be imposed on the source domain images in addition to the stylistic features imposed by the properties of the different scanners. This assessment is particularly challenging in the absence of ground truths (e.g. the same tissue images with a different scanner), requiring the use of higher-order statistical measures of stylistic content. While several metrics have been proposed, none provide a fully comprehensive assessment of model performance [9].

II Methods

II-A Outline

We conduct experiments that exhibit the use of various metrics and highlight important aspects to consider when evaluating the performance of style transfer models. We employ the two most common frameworks for generative modelling by using a cycle-consistent generative adversarial network (CycleGAN) [10] and a diffusion-based SynDiff model [7]. We train these to perform unpaired image-to-image translation on patches parsed from images from three open source mammography datasets: VinDr Mammo (VDM) [11], the Chinese Mammography Database (CMMD) [12], and the Curated Breast Imaging Subset of the Digital Database for Screening Mammography (CBIS-DDSM) [13].

II-B Evaluating style

Differences in the styles of two image sets is typically assessed by observing the similarities between the distribution of features found in adapted source and target domain images (i.e. how often particular features are found in either dataset). Manual feature extraction-based methods, such as gray-level co-occurrence matrices and gabor filters, can be used to acquire features contained in the images. However, these do not necessarily capture the full extent of features that define the style of a domain. Networks trained to perform tasks that require the encoding of more complex feature representations (e.g. classification networks) may be preferable.

The Fréchet Inception Distance (FID) [14] is one popular metric that implements the Inception-V3 architecture to extract activations from images drawn from a given domain (e.g. the adapted source images $\hat{x}_{1},\hat{x}_{2},...,\hat{x}_{N}\in\hat{X}$ ). The mean $\mu_{\hat{x}}$ and covariance $\Sigma_{\hat{x}}$ of the resultant activations are used to parameterise a multivariate Gaussian distribution $\mathcal{N}_{{\hat{x}}}(\mu_{\hat{x}},\Sigma_{\hat{x}})$ , that is then compared to the corresponding distribution computed from a set of target images (unpaired) $\mathcal{N}_{y}(\mu_{y},\Sigma_{y})$ with a Fréchet distance:

\begin{split}\mathrm{FID}(\hat{X},Y)=d_{F}\left(\mathcal{N}_{{\hat{x}}}(\mu_{% \hat{x}},\Sigma_{\hat{x}}),\mathcal{N}_{y}\left(\mu_{y},\Sigma_{y}\right)% \right)^{2}=\\ \left\|\mu_{\hat{x}}-\mu_{y}\right\|_{2}^{2}+\text{tr}\left(\Sigma_{\hat{x}}+% \Sigma_{y}-2\left(\Sigma_{\hat{x}}\Sigma_{y}\right)^{\frac{1}{2}}\right).\end{split}

(1)

However, modelling the distribution as a multivariate Gaussian is a strong assumption to make about the distributions of activations, especially considering the representations are produced from layers with ReLU activations, whose distribution is inherently unsmooth. The authors of [15] note that typically 2% of Inception representations were zero, indicating a complex, non-Gaussian form for the distribution of activations in their experiments.

The Kernel Inception Distance (KID) [15] is an alternative metric that measures the squared maximum mean discrepancy between activation distributions. Here, a polynomial kernel is used to assess similarities of activations for the different images,

k(\alpha,\omega)=\left(\frac{1}{d}\alpha^{\top}\omega+1\right)^{3},

(2)

where $\alpha$ and $\omega$ are activations produced from two inputs and $d$ is their dimensionality. Here, we represent the application of this kernel onto the activations produced using two different images, e.g., $x_{1}$ and $\hat{x}_{1}$ , with $k(\phi(x_{1}),\phi(\hat{x}_{1}))$ , where $\phi$ represents application of Inception-V3 to acquire activations. The KID uses the polynomial kernel to assess the similarity of the adapted source images with themselves $k(\phi(\hat{x}),\phi(\hat{x}))$ , the target images with themselves $k(\phi(y),\phi(y))$ , and between the adapted source and target images $k(\phi(\hat{x}),\phi(y))$ and $k(\phi(y),\phi(\hat{x}))$ . The kernel values are acquired and used to compute the KID, where $m$ and $n$ are the total number of images for the adapted source and target, respectively:

\begin{split}\text{KID}(\hat{X},Y)=&\ \frac{1}{m(m-1)}\sum_{i\neq j}^{m}k(\phi% (\hat{x}_{i}),\phi(\hat{x}_{j}))\\ &+\frac{1}{n(n-1)}\sum_{i\neq j}^{n}k(\phi(y_{i}),\phi(y_{j}))\\ &-\frac{2}{mn}\sum_{i=1}^{m}\sum_{j=1}^{n}k(\phi(\hat{x}_{i}),\phi(y_{j})).% \end{split}

(3)

Aside from not requiring a Gaussian distribution, another benefit of using the KID is that Eq. 2 enables a comparison of the skewness as well as the mean and variance of the activation distributions. The additional moment provides a more thorough assessment of the similarity between the distributions. The KID has been reported to be more effective than the FID on smaller sets of images [15], which is appealing in the context of mammography given the typical scarcity of available data. However, it is widely known that the Inception-V3 used in FID and KID is pretrained on the ImageNet dataset (image classification), and so is not optimal for extracting features from mammography images. I.e., because it was trained to classify natural images, it is not optimised to detect features that characterise the style of mammography images. Nonetheless, both the FID and KID are widely used in medical imaging studies in the absence of a domain-specific feature extraction network. Other metrics like the CLIP (Contrastive Language-Image Pre-Training) Maximum Mean Discrepancy (CLIP-MMD) [16] use CLIP embeddings that may capture a more diverse range of feature representations but ultimately, it is still not specialised to detect information relevant to characterising mammography images. Here, we simply employ the more commonly used KID and FID.

II-C Evaluating content preservation

Assessing stylistic content alone fails to capture another important aspect of model performance: content preservation. We desire translated images to not only contain features characteristic of the target domain, but also still be representative of the original tissue. The preservation of the image’s structural content, or the underlying tissue components and anatomical features, is often used to assess this aspect of model performance. The key challenge is disentangling this lower-order information from the higher-order statistical information related to style [17]. A hypothetical, ideal metric would return an optimal score when processing two images of the same tissues acquired with different scanners. However, the metrics typically used to assess this aspect of model performance exhibit this behaviour to varying degrees. Here, we consider reference-based metrics (i.e. those that compare the unadapted source image with an adapted version) because these provide a more direct assessment of changes in content post-translation compared to metrics that do not incorporate a comparison with a reference image.

The mean squared error (MSE) between a source image and its adapted counterpart is one crude metric [18]. However, its optimal value occurs when no adaptation is applied to the image. It also fails to consider spatial dependencies/correlations that can be used to effectively define lower-level structural content. The peak signal-to-noise ratio (PSNR) [7] is another metric that contains an MSE term, and so suffers similar drawbacks.

The Structural Similarity Index Measure (SSIM) is widely used in mammography and other medical imaging studies [19]. In contrast to MSE or PSNR, which use pixel-wise comparisons of image amplitude, differences in the contrast and luminance between groups of pixels are used to measure structure. Here, contextual information that is important to defining structural features is considered, making the metric better equipped to detect distortions which typically span regions of pixels [20]. With that said, SSIM is known to be highly sensitive to geometric and scale distortions [19].

The Deep Image Structure and Texture Similarity (DISTS) [8] is another metric, where activations from an image are extracted using a unique variant of the VGG (Visual Geometry Group) model pre-trained to classify ImageNet. Here, an image from the source domain dataset is passed through the network, where activations from several layers are acquired and normalised. Activations are also produced from its corresponding adapted source image. The covariance between the corresponding activations as well as the variance of the activations are used to assess structural preservation, while textural similarity is assessed using the mean of the activations from each image. The metric combines these measurements from different convolution layers using a weighted sum. The textural similarity, in part, defines the style of the images, which is an unappealing aspect of using this metric to assess content preservation. Another drawback is, like the FID, the activations are produced from a model trained to classify natural images, and so may not be effective at encoding the structural content of mammography images [21].

The Feature Similarity Index for Image Quality Assessment (FSIM) [17] is another metric that uses differences in the phase congruency and gradient magnitudes between pairs made up of an unadapted image and its adapted counterpart to assess changes in image quality. Phase congruency assesses the significance of local structures using Fourier components (based on the premise that visually discernible features coincide with points where the Fourier waves at different frequencies have congruent phases), while the gradient magnitudes detect variations in contrast important to defining structure. The authors argue that the use of phase congruency as the key feature provides a more effective means of extracting low-level feature information related to encoding structural content.

II-D Practical considerations

The effectiveness of some metrics may be influenced by processing applied to the output, or artefacts imposed onto the image by the model. E.g. some models may impose offsets on translated images that can affect the use of structural similarity metrics. The subsequent correction of these artefacts using other software packages may impose their own artefacts that should be considered when evaluating model performance.

Some metrics (e.g. the continuous wavelet variant of the SSIM discussed in Section III-B [22]) may require images in a particular format, where any conversions may result in a loss of information if not performed carefully (e.g. float32 to unsigned 8-bit integers). Numerical precision may also affect the efficacy of some metrics (e.g. the FID, as will be discussed in Section III-D). These factors should be carefully considered when using evaluation metrics, or developing post-processing protocols.

II-E Data

We train a CycleGAN and SynDiff model on 1500 256 $\times$ 256 image patches for each domain for all six tasks (style transfer each way between every pair of the three datasets). We then evaluate 3500 test patches, and apply the metrics discussed in Section II, providing commentary on their use and interpretation.

The characteristics of each dataset are outlined in Table LABEL:tab:Metadata. The variation in the physiological characteristics of the patient populations and in scanning technology provide a significant challenge for the translation models.

We considered patches parsed from full-scale mammogrpahy images to alleviate issues with memory consumption. These were parsed from whole mammography images in 256 $\times$ 256 pixel patches using a step size of 246 pixels. Only patches containing 99% of pixels with non-zero values were included in the dataset.

Before decomposing each mammogram into patches, the breast was segmented from the background for VDM and CBIS-DDSM images using Otsu’s method [23] (CMMD was already segmented). We applied contrast inversion to images from all three datasets that had photometric interpretation set to MONOCHROME1. We also horizontally flipped all ‘right’ laterality images. Each image was normalised so the maximum pixel value was 1, and the minimum was 0. Padding was applied to images so they had dimensions of 2224 $\times$ 2224 pixels. Histogram equalisation was applied to each patch and the values were then normalised. 1500 patches and 3500 patches from each dataset were collated for training and testing, respectively (with no overlap across the training/test sets at the patch or whole image level). The number of training patches was in part determined to accommodate training times with the SynDiff model (i.e. 100 epochs within 4 days) using the available computational resources (NVIDIA A100 GPU, and an AMD EPYC 7643 2.3 GHz CPU) [9].

All adapted images (model outputs) were normalised with pixel values in the range [0,1] before calculating any evaluation metric/subsequent post-processing steps.

II-F Training

We use the default parameterisations of the CycleGAN and SynDiff models, aside from an increased cycle-consistency weight of 1000 for the SynDiff models. Our aim here is not to compare which model is better, but rather to use the results/unique behaviours of each to emphasise important aspects of evaluation. We also use the default number of training epochs associated with each implementation (100 for SynDiff and 200 for CycleGAN), as this was reported to produce sensible results on related experiments [10, 7].

III Experiments and Results

III-A Evaluating content preservation (CycleGAN)

TABLE I: Content Preservation Metrics: CycleGAN

Task	SSIM	PSNR	MSE	DISTS	FSIM
CMMD $\rightarrow$ CBIS	0.86	23.7	0.0048	0.1378	0.91
CBIS $\rightarrow$ CMMD	0.95	33.5	0.0016	0.0656	0.97
CMMD $\rightarrow$ VDM	0.96	29.5	0.0018	0.0903	0.97
VDM $\rightarrow$ CMMD	0.95	27.0	0.0024	0.1333	0.96
VDM $\rightarrow$ CBIS	0.90	24.3	0.0046	0.1081	0.95
CBIS $\rightarrow$ VDM	0.96	33.0	0.0015	0.0707	0.98

We use the scikit-image¹¹1https://scikit-image.org/ package to calculate SSIM, PSNR and MSE, and other code repositories to calculate DISTS²²2https://github.com/dingkeyan93/DISTS and FSIM³³3https://piq.readthedocs.io/en/latest/. FSIM scores typically range from 0 to 1, where 1 indicates perfect similarity between two images; however values greater than 1 can occur if the adapted source image has greater contrast than its unadapted counterpart. DISTS scores also lie in the range [0,1], but 0 here instead represents perfect similarity. The bounds for SSIM are -1 and 1, where negative structural similarity values correspond to the cases where the local image structures are inverted. PSNR has a lower bound of 0, but does not have an upper bound, making it challenging to assess which score infers a ‘good quality’ of preservation.

In Table I we see high scores for SSIM and FSIM, and low scores for MSE and DISTS for all tasks, indicating that much of the structural content of the tissue has been preserved. Fig. LABEL:fig:correlation_matrix displays a correlation matrix between the five metrics for the VDM $\rightarrow$ CBIS task. We choose Spearman’s correlation coefficient as we are interested in assessing the monotonicity between pairs of metrics. While a low correlation between two metrics computed for a given test set doesn’t necessarily indicate each metric is considering different aspects of image contents, assessing this provides a useful starting point for a deeper analysis of their distinct qualities, and whether in combination they may provide a more comprehensive assessment of model performance. In Fig. LABEL:fig:correlation_matrix we see that PSNR and SSIM have a moderate positive correlation between each other, which broadly agrees with the scatterplot of the two metrics shown in Fig. LABEL:vindr_cbis_psnr_ssim, where the lower cluster of qualitatively less correlated points accompanies a smaller cluster of more highly correlated points. Indeed, this modest correlation is not entirely unexpected given the differences in which structure is assessed with either metric. We see the same score, but negative when comparing SSIM to the MSE, which is expected given the inverse relation between MSE and PSNR. We also see modest correlations when comparing all these metrics with DISTS. In addition to the unique manner with which it acquires information about structure, the metric also encodes information about style/textural features, which likely contributes to the lower correlation with these more structure focused metrics. The FSIM correlates strongly with the SSIM, which could suggest that each ultimately uses similar information when assessing structural content. The variation we observe in the correlations between different metrics highlights that each considers different aspects of image content to assess structure, and that several should be used to provide a more comprehensive assessment of model performance.

III-B Evaluating content preservation (SynDiff)

TABLE II: Content Preservation Metrics: SynDiff

Task	SSIM	PSNR	CW-SSIM	DISTS
CMMD $\rightarrow$ CBIS	0.71	21.4	0.93	0.1427
CBIS $\rightarrow$ CMMD	0.75	20.9	0.92	0.1260
CMMD $\rightarrow$ VDM	0.57	19.1	0.96	0.1453
VDM $\rightarrow$ CMMD	0.59	19.4	0.95	0.1256
VDM $\rightarrow$ CBIS	0.52	19.6	0.94	0.1200
CBIS $\rightarrow$ VDM	0.54	19.7	0.94	0.1040

The SynDiff model was found to produce small offsets of a few pixels to translated images. This resulted in low values for the more generic structural similarity metrics (SSIM and PSNR in Table II). However, qualitatively, the image contents otherwise appeared to be of high quality. Some variants of common metrics can be more tolerant of small distortions and are better suited to assessing data that exhibits these undesirable model behaviours. We select the continuous wavelet variant of SSIM (CW-SSIM) which has been used in other mammography-based studies[19], and we use DISTS, as in Table I, which is tolerant to mild geometric translations. Even though SSIM and PSNR are known to be sensitive to such distortions, we retain their scores in our results to show a comparison between sensitive and insensitive metrics. The results in Table II indicate SSIM and PSNR are heavily affected by the offset imposed by SynDiff, resulting in lower scores compared to CycleGAN. With that said, the more tolerant metrics (CW-SSIM and DISTS), show that a lot of structure is preserved post-adaptation. This is in line with the results achieved with more generic structural similarity metrics on versions of the image where this offset has been corrected (described shortly), indicating that these more tolerant metrics are assessing underlying structure effectively in the face of the offset.

Refer to caption — Figure 1: Example image pair of sections of adapted CBIS image patches a) before and b) after registration, for the CBIS $\rightarrow$ VDM task. The adapted CBIS output, after registration, consists of some blur, which could have an impact on SSIM and PSNR scores, c) and d), respectively.

We note that in some cases, it may be preferable to correct artefacts like offsets (e.g. to facilitate uncertainty quantification, which may require pixel-wise comparisons of image content) instead of using metrics that tolerate them. However, these can impose their own distortions on image contents, and their effects should be considered when evaluating model performance.

Here, we demonstrate this approach by correcting the offsets imposed on images by the SynDiff model using a popular processing package, ANTsPy⁴⁴4https://github.com/ANTsX/ANTsPy. As stated previously, we notice that the offset observed differs in magnitude and direction for each SynDiff model. We specifically use the translation-mode functionality from the ANTsPy package, to correct the offsets. We trained a SynDiff model for the CBIS $\leftrightarrow$ VDM tasks, distinct to the model used to collate the results in Table II, and we register the test outputs to their respective source images. To remove any lines of black pixels from the images after registration, we cropped five pixels from each side to ensure that all images had the same dimensions. We notice that post-registered images contain some blurring, such as in Fig. 1, which could have a significant effect when applying evaluation metrics. We also compute the SSIM and PSNR scores after correcting the offsets, hoping to see a better indication of the amount of content preserved, rather than the offset dominating the scores.

As expected, the SSIM and PSNR scores increase by a large amount after registration (Fig.1). However the interpretation of these metrics is made more complex given a report that indicates that some metrics, including the SSIM, will produce better scores from blurred images[24]. These results emphasise the need to consider how post-processing can affect image quality and perceptions about model performance.

III-C Evaluating style

We compare the FID and KID values assessed between adapted source images and target images with a baseline score computed from unadapted source images and the target test images.

TABLE III: FID scores summary

Task	Baseline	CycleGAN	SynDiff
CMMD $\rightarrow$ CBIS	35.2	32.7	15.2
CBIS $\rightarrow$ CMMD	35.2	21.4	20.0
CMMD $\rightarrow$ VDM	37.2	30.5	34.2
VDM $\rightarrow$ CMMD	37.2	23.2	24.1
VDM $\rightarrow$ CBIS	44.8	18.7	19.8
CBIS $\rightarrow$ VDM	44.8	40.4	34.9

TABLE IV: Mean (Std Dev) KID scores summary

Task	Baseline	cycleGAN	SynDiff
CMMD $\rightarrow$ CBIS	0.0220 (0.0047)	0.0219 (0.0055)	0.0044 (0.0027)
CBIS $\rightarrow$ CMMD	0.0220 (0.0047)	0.0099 (0.0034)	0.0091 (0.0027)
CMMD $\rightarrow$ VDM	0.0173 (0.0040)	0.0079 (0.0040)	0.0110 (0.0043)
VDM $\rightarrow$ CMMD	0.0173 (0.0040)	0.0091 (0.0036)	0.0074 (0.0031)
VDM $\rightarrow$ CBIS	0.0300 (0.0063)	0.0060 (0.0036)	0.0080 (0.0035)
CBIS $\rightarrow$ VDM	0.0300 (0.0063)	0.0164 (0.0059)	0.0157 (0.0057)

In Table III, all the tasks for CycleGAN and SynDiff have a FID score lower than the baseline, suggesting that the adapted source images are in greater alignment with the target domain. For the KID (reported in Table IV), we pick the number of subsets and subset size to be 50 and 100 respectively. This emulates the resampling of images, i.e., having a high chance that the same sample will be present in multiple bins featured in the original implementation [15]. We did not find any significant changes in the mean KID scores when using higher values of the subset size and numbers of subsets (not reported here). However, we did find some reduction in the standard deviations with larger bin size values. The need to pick an optimal subset size is an unappealing aspect of using the KID. We observe some differences between the mean KID and FID scores. For the VDM $\rightarrow$ CMMD task, the mean KID score for CycleGAN is higher when compared to SynDiff, but Table III shows that the FID score for CycleGAN is lower. It is also interesting to note that the baseline for CMMD $\leftrightarrow$ VDM is the lowest in Table IV when compared with the other tasks, but in Table III the baseline is higher than CMMD $\leftrightarrow$ CBIS. Even though these differences could be used to draw opposing conclusions about model performance, the large standard deviation in proportion to the mean KID scores make comparisons challenging.

III-D Numerical precision and the FID

Another factor to consider when using a given metric is whether the numerical precision of the data or data type is appropriate for its use. For example, the PyTorch lightning documentation for FID ⁵⁵5https://lightning.ai/docs/torchmetrics/stable/image/frechet_inception_distance.html states that the metric is known to be unstable in its calculations and that float64 precision is recommended for best results. However, we found no differences in the scores computed with float64 precision (Table III) compared to using float32 (not shown here). Additionally, we found that computing the results with float64 precision with our computational resources required approximately twice the time to generate a score compared to using float32. This suggests that this particular recommendation may not be significant for mammography-based style transfer tasks when using thousands of 256 $\times$ 256 pixel images.

IV Conclusions

We have discussed the key challenges with evaluating the performance of unpaired image-to-image translation models, describing the advantages and disadvantages of several popular metrics. We suggest that a more comprehensive assessment may be acquired by using several metrics, each of which consider unique aspects of image quality. We have also emphasised the need to consider how artefacts or other undesirable model behaviours as well as the post-processing steps used to correct them may affect the use/interpretation of common performance evaluation metrics. Ultimately, this work provides a useful reference for others interested in using common metrics to evaluate the performance of generative models, and a starting point for more in-depth analysis.

ACKNOWLEDGMENT

The project 22HLT05 MAIBAI has received funding from the European Partnership on Metrology, co-financed from the European Union’s Horizon Europe Research and Innovation Programme and by the Participating States. Funding for the UK partners was provided by Innovate UK under the Horizon Europe Guarantee Extension.

References

[1] Therese B Bevers, Benjamin O Anderson, Ermelinda Bonaccio, Sandra Buys, Mary B Daly, Peter J Dempsey, William B Farrar, Irving Fleming, Judy E Garber, Randall E Harris, et al. Breast cancer screening and diagnosis. Journal of the National Comprehensive Cancer Network, 7(10):1060–1096, 2009.
[2] Lulu Wang. Mammography with deep learning for breast cancer detection. Frontiers in Oncology, 14:1281922, 2024.
[3] Scott Mayer McKinney, Marcin Sieniek, Varun Godbole, Jonathan Godwin, Natasha Antropova, Hutan Ashrafian, Trevor Back, Mary Chesus, Greg S Corrado, Ara Darzi, et al. International evaluation of an ai system for breast cancer screening. Nature, 577(7788):89–94, 2020.
[4] Arthur C Costa, Helder CR Oliveira, and Marcelo AC Vieira. Data augmentation: Effect in deep convolutional neural network for the detection of architectural distortion in digital mammography. 2019.
[5] Yongcheng Jing, Yezhou Yang, Zunlei Feng, Jingwen Ye, Yizhou Yu, and Mingli Song. Neural style transfer: A review. IEEE transactions on visualization and computer graphics, 26(11):3365–3385, 2019.
[6] Sheng Wang, Jiayu Huo, Xi Ouyang, Jifei Che, Zhong Xue, Dinggang Shen, Qian Wang, and Jie-Zhi Cheng. mr nst: Multi-resolution and multi-reference neural style transfer for mammography. In International Workshop on PRedictive Intelligence In MEdicine, pages 169–177. Springer, 2020.
[7] Muzaffer Özbey, Onat Dalmaz, Salman UH Dar, Hasan A Bedel, Şaban Özturk, Alper Güngör, and Tolga Çukur. Unsupervised medical image translation with adversarial diffusion models. IEEE Transactions on Medical Imaging, 2023.
[8] Keyan Ding, Kede Ma, Shiqi Wang, and Eero P Simoncelli. Image quality assessment: Unifying structure and texture similarity. IEEE transactions on pattern analysis and machine intelligence, 44(5):2567–2581, 2020.
[9] Ciaran Bench, Emir Ahmed, and Spencer A. Thomas. Trustworthy image-to-image translation: evaluating uncertainty calibration in unpaired training scenarios, 2025.
[10] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.
[11] Hieu T Nguyen, Ha Q Nguyen, Hieu H Pham, Khanh Lam, Linh T Le, Minh Dao, and Van Vu. Vindr-mammo: A large-scale benchmark dataset for computer-aided diagnosis in full-field digital mammography. Scientific Data, 10(1):277, 2023.
[12] Hongmin Cai, Jinhua Wang, Tingting Dan, Jiao Li, Zhihao Fan, Weiting Yi, Chunyan Cui, Xinhua Jiang, and Li Li. An online mammography database with biopsy confirmed types. Scientific Data, 10(1):123, 2023.
[13] Rebecca Sawyer-Lee, Francisco Gimenez, Assaf Hoogi, and Daniel Rubin. Curated breast imaging subset of digital database for screening mammography (cbis-ddsm). (No Title), 2016.
[14] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
[15] Mikołaj Bińkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans. arXiv preprint arXiv:1801.01401, 2018.
[16] Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit, Daniel Glasner, Ayan Chakrabarti, and Sanjiv Kumar. Rethinking fid: Towards a better evaluation metric for image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9307–9315, 2024.
[17] Lin Zhang, Lei Zhang, Xuanqin Mou, and David Zhang. Fsim: A feature similarity index for image quality assessment. IEEE transactions on Image Processing, 20(8):2378–2386, 2011.
[18] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
[19] Maxine Tan, Bin Zheng, Joseph K. Leader, and David Gur. Association between changes in mammographic image features and risk for near-term breast cancer development. IEEE Transactions on Medical Imaging, 35(7):1719–1728, 2016.
[20] Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
[21] Rucha Deshpande, Varun A Kelkar, Dimitrios Gotsis, Prabhat Kc, Rongping Zeng, Kyle J Myers, Frank J Brooks, and Mark A Anastasio. Report on the aapm grand challenge on deep generative modeling for learning medical image statistics. ArXiv, 2024.
[22] Mehul P. Sampat, Zhou Wang, Shalini Gupta, Alan Conrad Bovik, and Mia K. Markey. Complex wavelet structural similarity: A new image similarity index. IEEE Transactions on Image Processing, 18(11):2385 – 2401, 2009.
[23] Nobuyuki Ostu. A threshold selection method from gray-level histograms. IEEE Trans SMC, 9:62, 1979.
[24] Melanie Dohmen, Tuan Truong, Ivo M Baltruschat, and Matthias Lenga. Five pitfalls when assessing synthetic medical images with reference metrics. In MICCAI Workshop on Deep Generative Models, pages 150–159. Springer, 2024.