Paper link: Perceptual Losses for Real-Time Style Transfer and Super-Resolution, ECCV2016
Authors: Justin Johnson, Alexandre Alahi, Li Fei-Fei
Problems
the per-pixel losses used by these methods do not capture perceptual differences between output and ground-truth images.
high-quality images can be generated using perceptual loss functions based not on differences between pixels but instead on differences between high-level image feature representations extracted from pretrained convolutional neural networks.
These approaches produce high-quality images, but are slow since inference requires solving an optimization problem.
Idea
we combine the benefits of these two approaches. We train feed forward transformation networks for image transformation tasks, but rather than using per-pixel loss functions depending only on low-level pixel information, we train our networks using perceptual loss functions that depend on high-level features from a pretrained loss network.
Method
loss functions
feature reconstruction loss $\ell_{feat}^\phi$
style reconstruction loss $\ell_{style}^\phi$
In order to compute efficiently, reshaping $\phi_j(x)$ into a matrix $\psi$ of shape $C_j\times H_jW_j$, and then $G_j^\phi(x)=\psi\psi^T/C_jH_jW_j$
The style reconstruction loss is then the squared Frobenius norm of the difference between the Gram matrices of the output and target images:
pixel Loss
Total Variation Regularization $\ell_{TV}(\hat{y})$
- Loss
Experiments
Style Transfer
goal
To generate an image $\hat{y}$ that combines the content of a target content image $y_c$ with the style of the a style style image $y_s$.
single-image super-resolution
The take is to generate a high-resolution output image from a low-resolution input
Future
In future work we hope to explore the use of perceptual loss functions for other image transformation tasks, such as colorization and semantic segmentation. We also plan to investigate the use of different loss networks to see whether for example loss networks trained on different tasks or datasets can impart image transformation networks with different types of semantic knowledge.