Perceptual Losses for Real-Time Style Transfer and Super-Resolution

Paper link: Perceptual Losses for Real-Time Style Transfer and Super-Resolution, ECCV2016

Authors: Justin Johnson, Alexandre Alahi, Li Fei-Fei

Problems

the per-pixel losses used by these methods do not capture perceptual differences between output and ground-truth images.
high-quality images can be generated using perceptual loss functions based not on differences between pixels but instead on differences between high-level image feature representations extracted from pretrained convolutional neural networks.

These approaches produce high-quality images, but are slow since inference requires solving an optimization problem.

Idea

we combine the benefits of these two approaches. We train feed forward transformation networks for image transformation tasks, but rather than using per-pixel loss functions depending only on low-level pixel information, we train our networks using perceptual loss functions that depend on high-level features from a pretrained loss network.

Method

loss functions

$W^*=\mathop{argmin}_\limits{W}E_{x,\{y_i\}}\left[\sum_{i=1}\lambda_i\ell_i(f_W(x),y_i\right]$

feature reconstruction loss $\ell_{feat}^\phi$
$\ell_{feat}^{\phi,j}(\hat{y},y)=\frac{1}{C_jH_jW_j}||\phi_j(\hat{y})-\phi_j(y))||^2_2$
style reconstruction loss $\ell_{style}^\phi$
$G_j^\phi(x)_{c,c'}=\frac{1}{C_jH_jW_j}\sum_{h=1}^{H_j}\sum_{w=1}^{W_j}\phi_j(x)_{h,w,c}\phi_j(x)_{h,w,c'}$
In order to compute efficiently, reshaping $\phi_j(x)$ into a matrix $\psi$ of shape $C_j\times H_jW_j$, and then $G_j^\phi(x)=\psi\psi^T/C_jH_jW_j$

The style reconstruction loss is then the squared Frobenius norm of the difference between the Gram matrices of the output and target images:
$\ell^{\phi,j}_{style}(\hat{y},y)=||G_j^\phi(\hat{y})-G_j^\phi||^2_F$
pixel Loss
$\ell_{pixel}(\hat{y},y)=\frac{||\hat{y}-y||^2_2}{CHW}$
Total Variation Regularization $\ell_{TV}(\hat{y})$
$\ell_{tv}=Mean||\sum_{h,w}(\hat{y}_{h+1,w}-y_{h,w})+(\hat{y}_{h,w+1}-y_{h,w})||^2_2$

Loss $\hat{y}=\mathop{argmin}_y\lambda_c\ell^{\phi,j}_{feat}(y,y_c)+\lambda_s\ell^{\phi,J}_{style}(y,y_s)+\lambda_{TV}\ell_{TV}(y)$

Experiments

Style Transfer

goal

To generate an image $\hat{y}$ that combines the content of a target content image $y_c$ with the style of the a style style image $y_s$.

single-image super-resolution

The take is to generate a high-resolution output image from a low-resolution input

Future

In future work we hope to explore the use of perceptual loss functions for other image transformation tasks, such as colorization and semantic segmentation. We also plan to investigate the use of different loss networks to see whether for example loss networks trained on different tasks or datasets can impart image transformation networks with different types of semantic knowledge.