
in existing approaches cannot yet achieve such separation.
In this paper, we present a novel Relightable Neural Ren-
derer (RNR) for view synthesis and relighting from multi-
view inputs. A unique step in our approach is that we model
image formation in terms of environment lighting, object
intrinsic attributes, and light transport function (LTF). RNR
sets out to conduct regression on these three individual com-
ponents rather than directly translating deep features to ap-
pearance as in existing NR. In addition, the use of LTF in-
stead of a parametric BRDF model extends the capability
of modeling global illumination. While enabling relight-
ing, RNR can also produce view synthesis using the same
network architecture. Comprehensive experiments on syn-
thetic and real data show that RNR provides a practical and
effective solution for conducting free-viewpoint relighting.
2. Related Work
Image-based Rendering (IBR). Traditional IBR meth-
ods [
17, 37, 24, 5, 7, 86, 23, 57] synthesize novel views by
blending pixels from input images. Compared with phys-
ically based rendering, which requires high-resolution ge-
ometry and accurate surface reflectance, they can use lower
quality geometry as proxies to produce relatively high qual-
ity rendering. The ultimate rendering quality, however, is
a trade-off between the density of sampled images and ge-
ometry: low quality geometry requires dense sampling to
reduce artifacts; otherwise the rendering exhibits various
artifacts including ghosting, aliasing, misalignment and ap-
pearance jumps. The same trade-off applies to image-based
relighting, although for low frequency lighting, sparse sam-
pling may suffice to produce realistic appearance. Hand-
crafted blending schemes [
9, 35, 8, 23, 57] have been devel-
oped for specific rendering tasks but they generally require
extensive parameter tuning.
Deep View Synthesis. Recently, there has been a large
corpus of works on learning-based novel view synthesis.
[68, 13] learn an implicit 3D representation by training on
synthetic datasets. Warping-based methods [
88, 55, 67, 90,
28, 11] synthesize novel views by predicting the optical flow
field. Flow estimation can also be enhanced with geometry
priors [
87, 45]. Kalantari et al. [30] separate the synthesis
process into disparity and color estimations for light field
data. Srinivasan et al. [
66] further extend to RGB-D view
synthesis on small baseline light fields.
Eslami et al. [14] propose Generative Query Network
to embed appearances of different views in latent space.
Disentangled understanding of scenes can also be con-
ducted through interpretable transformations [
82, 36, 77],
Lie groups-based latent variables [
15] or attention modules
[
6]. Instead of 2D latent features, [72, 52, 20] utilize vol-
umetric representations as a stronger multi-view constraint
whereas Sitzmann et al. [64] represent a scene as a contin-
uous mapping from 3D geometry to deep features.
To create more photo-realistic rendering for a wide view-
ing range, [
22, 70, 10, 63, 47, 69, 2, 61, 49, 79] require many
more images as input. Hedman et al. [
22] learn the blending
scheme in IBR. Thies et al. [
70] model the view-dependent
component with self-supervised learning and then combine
it with the diffuse component. Chen et al. [
10] apply fully
connected networks to model the surface light field by ex-
ploiting appearance redundancies. Volume-based methods
[
63, 47] utilize learnable 3D volume to represent scene and
combine with projection or ray marching to enforce geo-
metric constraint. Thies et al. [69] present a novel learn-
able neural texture to model rendering as image translation.
They use coarse geometry for texture projection and offer
flexible content editing. Aliev et al. [
2] directly use neural
point cloud to avoid surface meshing. Auxiliary informa-
tion such as poses can be used to synthesize more complex
objects such as human bodies [
61].
To accommodate relighting, Meshry et al. [
49] learn
an embedding for appearance style whereas Xu et al. [
79]
use deep image-based relighting [
81] on multi-view multi-
light photometric images captured using specialized gantry.
Geometry-differentiable neural rendering [
58, 46, 44, 40,
32, 48, 84, 43, 27, 71, 51] can potentially handle relighting
but our technique focuses on view synthesis and relighting
without modifying 3D geometry.
Free-Viewpoint Relighting. Earlier free-viewpoint re-
lighting of real world objects requires delicate acquisi-
tions of reflectance [
18, 75, 76] while more recent low-cost
approaches still require controlled active illumination or
known illumination/geometry [
50, 25, 89, 78, 16, 83, 42, 31,
12, 41, 80]. Our work aims to use multi-view images cap-
tured under single unknown natural illumination. Previous
approaches solve this ill-posed problem via spherical har-
monics (SH) [
85] or wavelets [19] or both [39] to represent
illumination and a parametric BRDF model to represent re-
flectance. Imber et al. [
26] extract pixel-resolution intrin-
sic textures. Despite these advances, accurate geometry re-
mains as a key component for reliable relighting whereas
our RNR aims to simultaneously compensate for geometric
inaccuracy and disentangle intrinsic properties from light-
ing. Tailored illumination models can support outdoor re-
lighting [
59, 21, 60] or indoor inverse rendering [3] whereas
our RNR uses a more generic lighting model for learning
the light transport process. Specifically, our work uses a set
of multi-view images of an object under fixed yet unknown
natural illumination as input. To carry out view projection
and texture mapping, we assume known camera parameters
of the input views and known coarse 3D geometry of the ob-
ject, where standard structure-from-motion and multi-view
stereo reconstruction can provide reliable estimations.
5600