0% found this document useful (0 votes)
18 views15 pages

Scene Viewer

Uploaded by

tiracolate
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views15 pages

Scene Viewer

Uploaded by

tiracolate
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 29, NO.

12, DECEMBER 2023 5523

SceneViewer: Automating Residential


Photography in Virtual Environments
Shao-Kui Zhang , Hou Tam , Yi-Xiao Li, Tai-Jiang Mu , and Song-Hai Zhang , Member, IEEE

Abstract—Selecting views is one of the most common but overlooked procedures in topics related to 3D scenes. Typically, existing
applications and researchers manually select views through a trial-and-error process or “preset” a direction, such as the top-down
views. For example, literature for scene synthesis requires views for visualizing scenes. Research on panorama and VR also require
initial placements for cameras, etc. This article presents SceneViewer, an integrated system for automatic view selections. Our system
is achieved by applying rules of interior photography, which guides potential views and seeks better views. Through experiments and
applications, we show the potentiality and novelty of the proposed method.

Index Terms—Interior photography, view selection, 3D interior scene

1 INTRODUCTION views, for predefined or fixed views are not always suitable
for every scene. Recent online businesses on estates would
ESEARCH on 3D scenes has advanced over the last decades,
R where the topics spread over a wide range, including
scene understanding [1], [2], [3], scene synthesis [4], [5], [6],
like to show the best views of their properties. Their busi-
nesses mainly depend on exhibiting rendered photos taken
from the floorplans to potential buyers with an excellent
style compatibility [7], scene detailing [8], interaction [9], [10],
first impression, so selecting the views accurately and
industrial applications1, etc. Among all the topics around 3D
appealingly saves time for marketing and gives the custom-
scenes, selecting a view to visualize or render a scene should
ers a better understanding of the rooms. Handa et al. [15]
be the most fundamental step since we can not perceive scenes
generate datasets for computer vision by using 3D scenes
intuitively, given only text-based configurations. However,
through different views, demonstrating that good choices of
view selection is either ignored by setting an empirical fixed
views yield informative datasets. The rendering focuses on
direction or conducted manually in the existing literature.
developing as realistic an image as possible [16], but we
First, for 3D scene synthesis [4], it is essential to render
argue that selecting positions and directions also matters
generated scenes for evaluation. However, since no rules of
considering the camera.
3D scene viewing are considered, current state-of-the-art
Nevertheless, automatic view selection in 3D scenes is not
techniques depend on fixed predefined views, e.g., top-
well explored previously. Existing literature on view selec-
down views [11], [12], [13], [14]. Consequently, the evalua-
tions is still designated for 3D meshes [17]. It lacks semantic
tions may result in perceptual biases due to the selections of
channels such as layouts of objects and does not consider
human-centric constraints such as connections of rooms and
1. https://www.kujiale.com/ visually pleasing qualities. Furthermore, the hypothesis space
of possible views for a mesh is often distributed on a regular
 Shao-Kui Zhang, Hou Tam, and Tai-Jiang Mu are with the Department of geometry such as a sphere or regular polyhedron [17], [18].
Computer Science and Technology, Tsinghua University, Beijing 100084, However, potential views for 3D scenes lie in a more arbitrary
China. E-mail: {zhangsk18, th21}@mails.tsinghua.edu.cn, taijiang@tsinghua. and irregular space, i.e., a view could be positioned anywhere
edu.cn.
 Yi-Xiao Li is with the Academy of Arts & Design, Tsinghua University,
among any set of objects within a scene.
Beijing 100084, China. E-mail: [email protected]. In this paper, we propose SceneViewer for automatic
 Song-Hai Zhang is with the Department of Computer Science and Tech- view selections in 3D scenes. Our method automatically
nology, Tsinghua University, Beijing 100084, China, and also with the generates appealing perspective views representing the
Beijing National Research Center for Information Science and Technology
(BNRist), Tsinghua University, Beijing 100084, China. underlying 3D scene, given a typical floorplan with multiple
E-mail: [email protected]. rooms, as shown in Fig. 1. This is achieved by adopting the
Manuscript received 5 October 2021; revised 6 October 2022; accepted 12 interior photography rules, where “probe views” are gener-
October 2022. Date of publication 17 October 2022; date of current version 10 ated first instead of being exhausted in the entire 3D space.
November 2023. To generate the set of “probe views”, we geometrically for-
This work was supported in part by the Natural Science Foundation of China
under Grant 62132012, and in part by the Tsinghua-Tencent Joint Laboratory mulate the rules of “one-point perspective (OPP)” and
for Internet Innovation Technology. “two-point perspective (TPP)” [19], [20] w.r.t. room shapes.
(Corresponding author: Song-Hai Zhang.) OPP refers to a single vanishing point formed by a room
Recommended for acceptance by V. Popescu. shape, while TPP refers to two vanishing points. The OPP is
This article has supplementary downloadable material available at https://doi.
org/10.1109/TVCG.2022.3214836, provided by the authors. dedicated to capturing the architectures and symmetrical
Digital Object Identifier no. 10.1109/TVCG.2022.3214836 perfections (section 3.1). The TPP is dedicated to showing
1077-2626 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Tsinghua University. Downloaded on November 11,2023 at 12:40:28 UTC from IEEE Xplore. Restrictions apply.
5524 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 29, NO. 12, DECEMBER 2023

Fig. 2. A problem of one-point perspective (OPP) in residential photogra-


phy. Although each line guides the camera to satisfy OPP, most camera
positions are still unfavourable due to irregular room shapes in real life,
e.g., the views at the white cameras.

Fig. 1. We present a framework for viewing 3D scenes by automatically


generating photographs. Our framework positions and directs a set of  We present SceneViewer for automatically selecting
appealing views into scenes following the rules of residential photogra- views for 3D scenes by computationally formulating
phy. Subsequently, our method also supports generating a series of the rules of residential photography.
views for floorplans and more applications (seeing section 4), thus giving
users a more perceptible and faster understanding of 3D scenes.  We formulate a set of constraints that quantitatively
measure how views are informative and visual-
pleasingly satisfy residential photography.
the human perception of depth (section 3.1). In the simplest  We present several applications, including (series of)
case, OPP could be satisfied by placing the camera facing views generation, trajectory, view-based 3D scene
the middle of a wall. TPP could be satisfied by placing the synthesis, etc., to demonstrate the potentiality of our
camera in front of a corner and orienting it to the corner. SceneViewer.
However, this merely works for rooms with well-aligned
layouts and cuboid shapes. For example, as shown in Fig. 2,
to view the main content of the room, placing the camera 2 RELATED WORKS
along with all positions on the red dotted lines satisfies the Research on 3D Scenes. Scene synthesis techniques require
constraint of OPP. However, most views are not favourable showcasing the synthesized results to debug and conduct
because some views may be blocked by the walls or include experiments, e.g., user study [6], [14], [22]. Scene under-
very few objects. The fundamental problem is that room standing techniques require a high-quality training set [2],
shapes are irregular. Therefore, we computationally gener- [23], where existing literature pursues realistic rendering
alize residential photography, where five schemes are pro- [24], [25] and view selections are indeed ignored. [26] learns
posed to adapt complex geometries while keeping the OPP and predicts human poses given furniture groups using
and TPP. Our method can generate probe views for arbi- captured views from sensors. [27] inserts objects given
trary room shapes by doing so. views of 3D scenes. [8], [9], [28] interactively synthesize 3D
One could easily derive a series of views satisfying OPP scenes, but training users to control the camera would be
and TPP from a room, but not all the generated probe views time-consuming. [15] renders a computer vision dataset
are guaranteed to be informative and aesthetic. To this end, using different views derived from 3D scenes and [29] trains
several constraints, i.e., the content and aesthetic con- neural networks to evaluate 3D scenes based on various
straints, are proposed to refine the final human-centric views from datasets. [7], [30] make the style and colour of
views. The former measures the “informative extent” of rooms consistent, so different views of verifying the results
views, e.g., a view should contain as many objects and con- are needed. Our method is a tool for automatically photo-
nections of rooms as possible. The latter measures the visual graphing 3D scenes, bridging computer graphics and com-
effects of the views, e.g., the trisection rules [21]. As shown puter vision [31].
in Fig. 1, our method automatically generates individual Viewpoint Selections. Existing literature focuses on select-
photographs and a series of views, where further mappings ing views for objects, e.g., CAD models, meshes, flows, etc.
from perspective views to floorplans are available, thus [32], [33] select views for volume rendering and [34] is
composing the basic features of SceneViewer. Section 3 will intended for streamlines. [35] suggests camera views for
illustrate how we formulate interior photography and point cloud segmentation. [36] locates optimal views based
develop SceneViewer. In Section 4, we conduct experiments on feature extraction. [37] investigates camera control in cin-
to verify our method and present potential applications that ematography. [38], [39] select views for CAD models, but
could be exclusively achieved using SceneViewer2. the optimization loss is the number of primitives seen by
In summary, our work makes the following contributions: the camera. The aesthetic of views for 3D scenes is more
related to semantics. [18], [40] generate a series of views for
2. Code is publicly available at https://github.com/Shao-Kui/ a model, but the views are derived from regular shapes,
3DScenePlatform#sceneviewer. e.g., dodecahedron, and icosahedron, which is unsuitable
Authorized licensed use limited to: Tsinghua University. Downloaded on November 11,2023 at 12:40:28 UTC from IEEE Xplore. Restrictions apply.
ZHANG ET AL.: SCENEVIEWER: AUTOMATING RESIDENTIAL PHOTOGRAPHY IN VIRTUAL ENVIRONMENTS 5525

Fig. 3. The derived basic views. (a): OPP-Mid, where each view is positioned in the wall middle behind the view and directed toward the wall normal.
(b) and (c): OPP-Thin & OPP-Expand, where each view is positioned toward a target wall and directed against the normal of the target wall. The for-
mer prefer narrower views, thus following the adjacent walls of the probe wall. The latter prefers wider views, thus expanding virtual walls based on
the probe wall. (d): TPP-2, where views are placed in front of two adjacent walls as much as possible. (e): TPP-3, based on OPP-Thin and OPP-
Expand, views are placed at the trisection point and are directed toward another trisection point of the probe wall.

for arbitrary room shapes. [41] measures the quality of dif- “narrow” the space of possible probe views, we adapt rules
ferent viewpoints for 3D objects. [42] selects upright orienta- from residential photography to arbitrary room shapes.
tions for 3D models. Please refer to an insightful survey [17] Even so, many probe views may still satisfy the rules, so a
for more comparisons on selecting viewpoints for 3D mod- set of constraints are incorporated to filter them further.
els. Additionally, [43] estimates the visibility of a scene In general, the process of finding a set of n views S ^¼
given a view based on many primitives and objects. View- fc1 ; c2 ; . . . ; cn g for a room follows Equation (1), where S is
points are also used for generating pathways. [44], [45] gen- the set consisting of all probe views derived from a room
erate a series of viewpoints (nodes) connected to guide a shape R. The constraints function Cview ðcjR; OÞ is used to
tour in a given 3D scene. [46] generates camera trajectories measure the goodness of a view c, given the room shape R
outside scenes or objects for automatic explorations. [47] and its corresponding content O. In this paper, each view is
develops an interactive system to offer users as much envi- parameterized as c ¼ ð~ z; ~
bÞ, where ~
z positions the probe
ronmental knowledge as possible. In contrast, we focus on view and ~ b is the normalized direction. The room shape R
how every single view satisfies the rules of residential pho- is formulated as a polygon, and the content O typically con-
tography, where the connected pathways are our additional tains a set of furniture objects
features (see Section 4.3). To the best of our knowledge, we
are the first to formulate a view selection method for 3D X
^ ¼ arg max
S Cview ðcjR; OÞ; (1)
scenes that adhere to the rules of photography. S 0 S;jS 0 j¼n c2S 0
Residential photography has formed its matured and
detailed standards in industries. These standards are Cview ðcÞ ¼ c Cc ðcjR; OÞ þ ð1  c ÞCa ðcjR; OÞ: (2)
mainly based on the desire to take photos that are as visual-
pleasing as possible, in line with how humans see scenes Based on residential photography [19], [20], we adapt the
and remember them [19]. In response to this, the rules of one-point and two-point perspectives for deriving candi-
“One Point Perspective” and “Two Point Perspective” [48] date views c 2 S. However, there are further concerns. First,
are summarized, where the visual plane is carefully the shapes of modern rooms and floorplans are irregular.
adjusted, and the height should be consistent with human For example, a room shape could be a concave polygon
eyes [20]. Residential photography emphasizes essential ele- with many edges (walls) instead of a simple rectangle. As a
ments in the visual center [49] to guide viewers’ attention. result, we extend the OPP and TPP to make it more robust
For a more exquisite composition, photographers could also on various polygons and discover more potential probe
refer to the rule of thirds [50], where photographers arrange views as shown in Fig. 3. Second, residential photography
objects according to the intersections of vertical and hori- should bring meaningful content to humans and be visually
zontal lines trisecting a photo, e.g., the dominant object is pleasing. However, there may exist many unappealing
often exhibited at the bottom third horizontally. Since users probe views subjecting to the rules for a given room shape.
are more concerned about objects than walls, we typically Therefore, we utilize constraints in Cview ðÞ for further filter-
try involving as many objects as possible via assigning ing out less informative and visually unpleasing probe
fewer biases on room shapes [21] by tilting the camera. views in S.
Additionally, [51] also investigates the problem of balancing A room with s sides is abstracted as an s-gon composed by
objects in photos. a set of edges R ¼ fri ji ¼ 1; 2; . . . ; sg. Contents O of a room is
a set of objects (e.g., furniture), windows, doors, etc. Cview ðÞ
considers the content constraints Cc ðÞ and the aesthetic con-
3 FORMULATING RESIDENTIAL PHOTOGRAPHY straints Ca ðÞ. The former calculates the number of contents
Generating informative and visual-pleasing photographs perceived by a probe view, e.g., the number of furniture, con-
for scenes is difficult due to the sizeable hypothetical space nections of rooms, etc. The latter analyzes how visually com-
in complicated 3D environments. In other words, the lay- fortable a probe view is, including the law of the third,
outs of 3D scenes could be arbitrary when placing potential angles of the camera, etc. These two constraints are balanced
probe views, unlike views derived from regular shapes, with c . In the following subsections, Sections 3.1 and 3.2
which are adopted for viewing meshes [18], [34], [36]. To detail how we adapt residential photography with room
Authorized licensed use limited to: Tsinghua University. Downloaded on November 11,2023 at 12:40:28 UTC from IEEE Xplore. Restrictions apply.
5526 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 29, NO. 12, DECEMBER 2023

geometry. Sections 3.3 and 3.4 formulate the content con- 1) assuming walls behind the camera satisfy the rule of
straints and aesthetic constraints, respectively. central symmetry (OPP-Mid).
In addition to the one-point and two-point perspectives, 2) assuming opposite walls with complements satisfy-
the generated views should be subject to the following two ing the rule (OPP-Thin).
independent rules: 3) assuming the opposite walls with expansions satisfy-
ing the rule (OPP-Expand).
1) A view should give fewer biases on ceilings [21]. First of all, we assume the main contents of a room may
2) A human-centric view should be as tall as a camera be arranged in front of a wall, thus satisfying the central
held by a person [20]. symmetry rule, so a camera is positioned in each wall mid-
Note that the above rules are optional but affect people’s dle behind it, towards the direction of wall normals, shown
visual perception. Rule 1 suggests that photographs should in Figs. 3a as “OPP-Mid”. This assumption is a straightfor-
accommodate more content on walls and grounds, i.e., the ward solution since we assume that the content is spread
camera should be pushed lower or rotated towards the across a room.
ground, which refers to the “pitch” of the camera. We However, the primary content of a room usually gathers
adopt rule 1 by applying Rodrigues’ formula to ~ u ~z where in a sub-area. Simply placing the camera in front of each wall
u is the up vectors of views. Rule 2 suggests that photo-
~ could be far away from the content. The walls or other objects
graphs should not violate the view height of a person, e.g., will also likely occlude the content. Therefore, we further
humans seldom watch objects clinging to the ground or present “OPP-Thin” and “OPP-Expand” to seek positions
watching from top-down views. We utilize rule 2 by empir- along the sub-areas of a room. If the rule is not satisfied by
ically assigning the camera’s height slightly lower than the walls behind the camera, it should be satisfied by oppo-
human eyes. site walls. Nevertheless, suppose the room shapes are irregu-
lar. In that case, we could fit the length of the walls behind
Algorithm 1. The Pipeline of Finding One-Point Perspec- the camera by complementing/clipping the opposite walls,
tive Views as shown in Fig. 3b or fit the room shape by expanding the
Input: Polygon of the room’s inner side R; opposite walls across the room, as shown in Fig. 3c. Thus, the
Output: A set of one-point perspective views S opp  S; former prefers a narrower view, and the latter prefers a
1: S opp := ; broader view.
2: for ri 2 R do
3: Find the middle point m of ri
4: Calculate the ray ni originated from m and following the Algorithm 2. Expanding Virtual Walls
normal of edge ri Input:
5: Find the predecessor wall rpre of edge ri 1: An edge ri and its corresponding endpoints p1 and p2 ;
6: Find the successor wall rnxt of edge ri 2: Polygon of the room’s inner side R;
7: for rj 2 R n fri g do Output: Two virtual and expanded “edges” rpre and rnxt ;
8: p := LineIntersection(ni , rj ) 3: N := -1
9: if p exists then 4: P := -1
10: t1 := LineIntersection(rpre , rj ) 5: rpre := null
11: t2 := LineIntersection(rnxt , rj ) 6: rnxt := null
12: ~z := ðt1 þ t2 Þ=2 7: m := ðp1 þ p2 Þ=2
13: ~b := m  ~ z 8: for r 2 R n fri g do
14: Push ð~ z; ~
bÞ to S opp 9: p := LineIntersection(ri , r)
15: end if 10: if p exists then :
16: end for 11: d := jjm  pjj
17: end for 12: if ðp2  p1 Þ  ðp  mÞ > 0 then
13: if d > N then
14: N := d
15: rnxt := p
3.1 One-Point Perspective
16: end if
The one-point perspective refers to a single vanishing point 17: else
given a view in a room shape, requiring the central symme- 18: if d > P then
try w.r.t walls. The OPP aims to show the interior as the 19: P := d
architect intended, in all its symmetrical perfection, while 20: rpre := p
creating a visually stunning shot of the views [20]. How- 21: end if
ever, OPP may not always be satisfied. For example, the 22: end if
wall behind the L-shape sofa in Fig. 2 is longer than its two 23: end if
opposite walls, and choosing either to place the camera 24: end for
gives unfair biases to the dinning table with four chairs. The
core problem is that room shapes are irregular in most
cases. Humans have the prior knowledge to find a rectangu- The idea of OPP-Thin is to make the area in front of a cam-
lar area satisfying OPP in a room, but not machines. There- era as rectangular as possible. Views of OPP-Thin focus on
fore, we propose three strategies that try to generate views the quasi-rectangular region, excluding other parts of the
satisfying OPP, as illustrated in Figs. 3a, 3b, and 3c: room. As described in Algorithm 1, OPP-Thin initially selects
Authorized licensed use limited to: Tsinghua University. Downloaded on November 11,2023 at 12:40:28 UTC from IEEE Xplore. Restrictions apply.
ZHANG ET AL.: SCENEVIEWER: AUTOMATING RESIDENTIAL PHOTOGRAPHY IN VIRTUAL ENVIRONMENTS 5527

TABLE 1
Views Generated by Applying OPP and TPP to Irregular Room Shapes

Photographs in the same row belong to the same room and photographs in the same column belong to the same view-selecting strategy.

a probe wall ri . A ray ni is then cast from the center of ri fol- derived by two expanded virtual walls, so a probe view has
lowing its normal direction. The ray can intersect the a broader perception of the room.
extended line of the other wall rj . If there is an intersection, Due to various shapes, these three strategies may over-
i.e., “LineIntersection()” returns a point p, we subsequently lap, thus generating the same results. Our strategies are pro-
calculate the intersections of rj with rpre and rnxt , which are posed to deal robustly with irregular shapes. OPP-Mid will
the adjacent walls of ri . This process yields a virtual wall con- be a straightforward solution if a room shape is a regular
necting t1 and t2. Finally, a probe view c is calculated as rectangle. Table 1 shows the qualitative OPP views of sev-
eral rooms generated by our method, where each room is
ððt1 þ t2 Þ=2; m  ðt1 þ t2 Þ=2Þ. This process is repeated to try
guaranteed to yield an OPP-Mid, an OPP-Thin and an OPP-
each pair of walls.
Expand. We could see the views neatly concerning the
In contrast, OPP-Expand tries to expand the region as
room shapes.
much as possible, which is achieved by replacing the calcu-
lation of rpre and rnxt in Algorithm 1 to Algorithm 2, where
a probe wall ri tries to intersect all extended lines of others 3.2 Two-Point Perspective
and find two endpoints. The endpoints should be as far When a photographer is taking a picture of a room, she/he
away from ri as possible. As a result, t1 and t2 would be usually includes two or three walls to give viewers a sense
Authorized licensed use limited to: Tsinghua University. Downloaded on November 11,2023 at 12:40:28 UTC from IEEE Xplore. Restrictions apply.
5528 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 29, NO. 12, DECEMBER 2023

of depth so that the viewers can perceive the space. For t1 and t2 and draws back along the ~ b. If the horizontal field
example, if a photo has only one relatively complete wall, of view (FoV) 2f is sufficient to accommodate two walls, the
where other walls are not seen, or only tiny parts are shown, length is directly calculated using f. Otherwise, the camera
it would be hard for the viewers to determine their locations draws back until it touches a wall
w.r.t the room shape. If two or more walls are presented,
locations are easier to be determined, referring us to the b ¼ ðt1  t2 Þ  ~
~ u; (3)
two-point perspective (TPP) [20]. To this end, two more t1 þ t2 ~ jjt1  t2 jj
strategies are proposed based on TPP. First, to capture two z¼
~ b : (4)
2 2 tan f
walls in one view, the best camera position is usually
located in one corner, looking towards the opposite cor- The best position to place a camera is often a third along
ner [20]. However, as shown in Fig. 3d, irregular room the length of the back wall towards the opposite wall since
shapes often prevent the cameras from capturing the oppo- it avoids photographing the extra side wall at an overly obli-
site corners since they are likely to be occluded by other que angle [20]. This intuition implies that the camera should
walls or two adjacent walls with different lengths. Thus, we give a primary bias to the front wall while giving fewer
formulate this rule with the modification, referred to as biases to the side walls. The “back wall” and the “opposite
TPP-2 in Fig. 3d, that the camera should be as far away from wall” lead us to Algorithm 1, which aims at finding two rel-
the two walls as possible to capture as much content w.r.t ative walls. To apply “a third of the way”, referred to as
the two walls as possible. TPP-3 in Fig. 3e, we make the following modifications: a
view is taken from one of the trisection points along rj and
is directed towards another trisection point on ri . In prac-
tice, views of TPP-3 are derived concurrently with both
Algorithm 3. Detecting Whether an Object is Semanti- OPP-Thin and OPP-Expand.
cally Visible Note that TPP-3 theoretically still belongs to TPP since it
Input: aims at more than one vanishing point. Table 1 also shows
1: An object o to be tested with vertices v1 ; v2 ; . . . ; vn 2 o; the qualitative TPP views of several rooms.
2: A probe view c ¼ ð~ z; ~
bÞ to be tested;
3: Polygon of the room’s inner side R; 3.3 Content Constraints
4: Horizontal and vertical FoV 2f and 2u (aspect ratio); A probe view should perceive as many objects as possible,
5: Contents of the room O; which leads us to occlusion detection. However, whether or
Output: is o seen by c;
not an object is in sight depends on various conditions. For
6: opre := o
example, the occlusion of an object could be attributed to
v := ~
7: ~ b~ u=j~b~ uj
two or more objects. Therefore, we present Algorithm 3 to
8: ~
h := ~b ~ v=j~b ~ vj
9: for v 2 o do
decide if an object o is visible concerning a probe view c.
10: ~ t := v  ~z First, for each vertex v of an object, we calculate if it is on
11: ~ v0 := ð~ v ~tÞ~
v þ~ t the camera’s horizontal and vertical visual planes, i.e., a v
12: ~ h0 := ð~ h ~tÞ~
h þ~ t could only be seen if it can be projected within the horizon-
13: if ð~ v0 Þ=j~
b ~ v0 j < cos u then
bjj~ tal and vertical visual plane of the frustum. Then, we check
14: o := o n fvg if the room shape occludes the vertices. The function
15: else if ð~ b~ h0 Þ=j~bjj~h0 j < cos f then “isCross” detects if two geometries cross each other. The
16: o := o n fvg function “isOccluded” conducts a raycasting-based app-
17: end if roach [52] based on [53] to calculate how many vertices in
18: end for object o are occluded by other objects in O. If the ð~ z; ~
bÞ of
19: for v 2 o do probe views does not vary, rendering-based approaches
20: if isCross(v  c, R) then such as [54] could also be leveraged for the acceleration.
21: o := o n fvg [39] could also be assembled to prevent perspective distor-
22: end if tions if our method is applied in the real world instead of
23: end for digital 3D scenes. Finally, Algorithm 3 decides the visibility
24: o := on isOccludedðo; OÞ of o based on a tolerance M. The visibility is measured by
25: if joj=jopre j > ¼ M then the ratio of visible vertices to the number of all vertices. If
26: return True M takes a smaller value, more objects could be perceived in
27: else the results, while the visible part of each object may be
28: return False
small, e.g., a stool leg, and vice versa. Note that M is never
29: end if
set more significant than 0.5 since an object could occlude
the vertices of itself.
To better perceive the room’s layout, a view should be set
Each TPP-2 probe view is generated by traversing each to see the connections between rooms, e.g., doors and corri-
pair of adjacent walls. A probe view is calculated by Equa- dors, so we formulate the constraints for windows and
b takes the cross product of
tions (3) and (4). The direction ~ doors. The overall process of counting the number of con-
camera up vector ~ u with the vector connecting t1 and t2 , nections follows Algorithm 3, where the function “isCross”
which are the two opposite endpoints of the two walls, takes the R n v as input and v is the wall that a door or a
respectively. The position ~z starts from the middle point of window belongs. Due to the speciality of connections, this
Authorized licensed use limited to: Tsinghua University. Downloaded on November 11,2023 at 12:40:28 UTC from IEEE Xplore. Restrictions apply.
ZHANG ET AL.: SCENEVIEWER: AUTOMATING RESIDENTIAL PHOTOGRAPHY IN VIRTUAL ENVIRONMENTS 5529

P P
constraint is measured by o; cÞ= o0 jaðo0 ; cÞj, where
o^ að^ respectively. lay , trd and off are weights for them
að; Þ calculates the projected area of a connection, o^ and o0
are from the sets of visible connections and all connections. 8
>
< 0; if c is derived from TPP-2
Consequently, the constraints contribute by a weighted
Coff ¼  arccosð~b  ni Þ; if c is derived from OPP-Mid :
sum: Cc ðÞ ¼ obj Cobj ðÞ þ win Cwin ðÞ, where Cobj and Cwin >
:
refer to the number of objects and area of connections, b  nj Þ; otherwise
arccosð~
respectively. obj and win are the weights for them. (5)
Table 2 shows views generated by discarding one con-
3.4 Aesthetic Constraints straint in each column. Since constraints are designed to
In addition to content constraints, a view should also be work together, sometimes removing a single constraint may
visually aesthetic. First, from an aesthetic view, the orienta- not yield poor results. Therefore, we give negative coeffi-
tions of objects w.r.t the camera should be carefully consid- cients to the discarded constraints to amplify the adverse
ered, e.g., we appreciate a bed from the front view. effect. For the content constraints, the results contain as
Otherwise, if a probe view has the same direction w.r.t a much content as possible. For the layout constraint, it
double bed, it would either be blocked by the bed or lack favours views that capture a set of objects facing the camera.
details For the rule of the third, the method guarantees that a domi-
P of the bed. Thus, we formulate the layout constraint
as o2Osim ~ b  u~o , where u~o denotes the normalized direction nant object appears at the trisection points of images. As for
vector of the object o. The Osim  O is the functional sym- the offset constraint, since it can be well-satisfied by being
metries extracted with the guidance of PlanIT [13], a tool perpendicular to walls behind it, a set of views could all
used to generate the relation graphs for scene synthesis. comply with it. However, there are unappealing cases. For
According to [13] and [28], only dominant objects with example, the rule of the third may pull an object closer so
properties including “left-right”, “front-back”, “corner-left”, that the object is at the trisection point. However, this single
and “corner-right” would be added to Osim . object would occupy most parts of the image, resulting in a
The rule of thirds is one of the best-known composition visual imbalance. Similarly, the connection constraint may
rules used in painting and photography, motivated by the cause the view to be too inclined to capture more windows
golden ratio [55]. With the rule of thirds, the focus of interest and doors. As a result, each constraint is necessary, and
must be placed at lines’ intersections that divide the frame Cview ðcÞ incorporates all of them.
into thirds from top to bottom and from left to right [21].
Commonly, the dominant furniture stands at the intersec- 3.5 View Mapping
tion of the third to the bottom [20]. Subsequently, viewers’ After having a set of independent views derived from each
eyes are led along the trisection lines through the image, room, a floorplan could be already explored. However, this
thus creating a more balanced composition and an aestheti- still isolates the views since connections among the views are
cally pleasing photo. One advantage of this approach is that inferred manually, e.g., finding a view of the exact next room
it avoids directly placing dominant objects at the center of based on the current view. Thus, we propose a way to auto-
the photo [56]. When placing a dominant object, such as a matically “map” individual views together with the floorplan.
table at the center of the photo, no matter where the other
objects are arranged around the dominant object, the bal- Algorithm 4. Mapping Perspective Views Together With
ance of the composition will not be improved. As a result, the Orthographic View
this constraint is formulated by a sign function, which Input:
returns 1 if a ray cast from intersections of the thirds touches 1: Orthographic view Vorth ;
a dominant object. Otherwise, it returns 0. We assemble the 2: Perspective views and the positions of their corresponding
idea of MageAdd [28], which interactively inserts objects in viewpoint on the orthographic view S ^ ¼ fðc1 ; pc Þ; ðc2 ; pc Þ;
1 1
real-time based on partitioning objects into dominant and . . . ; ðcn ; pc1 Þg;
subordinate objects to determine the dominance of objects. 3: Maximum perspective views per room k;
We also constrain the direction of the camera concerning Output: Mapping result;
walls. If a probe view is geared too close to walls, it would 4: Vselected := fg
contain much less meaningful content from walls causing 5: for each room do
obvious imbalance [51]. An extreme example should be that 6: Vselected := Vselected + selectViews(room, k)
the direction of a camera is perpendicular to the normal of 7: end for
the wall behind it, where half the view is filled with the 8: sort Vselected by viewpoint’s x-coordinate
wall. To alleviate this, we propose an offset constraint 9: (Vleft , Vright , Vtop , Vbottom ) := arrangeViews(Vselected )
Coff ðÞ measured by Equation (5). Specifically, for c derived 10: sort Vleft , Vright by viewpoint’s z-coordinate respectively
from TPP-2, since c is towards two walls, no penalty will be 11: sort Vtop , Vbottom by viewpoint’s x-coordinate respectively
imposed. If c is derived from OPP-Mid, the wall ri is used 12: while IntersectionExist() do
according to Algorithm 1. Otherwise, the virtual wall rj is 13: for ci , cj in (Vleft , Vright , Vtop , Vbottom ) do
14: if mappingLineIntersect((ci , pci ), (cj , pcj )) then
used. The returned value equals the included angle between
15: swapPosition(ci , cj )
wall normals and ~ b. Similar to content constraints, aesthetic
16: end if
constraints are weighted and summed as Ca ðÞ ¼ lay Clay ðÞ
17: end for
þtrd Ctrd ðÞ þ off Coff ðÞ, where Clay ðÞ, Ctrd ðÞ and Coff ðÞ
18: end while
refer to layout constraints, the rule of thirds and offset
Authorized licensed use limited to: Tsinghua University. Downloaded on November 11,2023 at 12:40:28 UTC from IEEE Xplore. Restrictions apply.
5530 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 29, NO. 12, DECEMBER 2023

TABLE 2
Qualitative Results When we Discard a Constraint

Each cell contains a relatively unappealing case due to the lack of a constraint.

Algorithm 4 formulates the process of view mapping. This and right columns according to the x-coordinate of their
algorithm selects no more than k perspective views for each viewpoints. The top and bottom spaces are utilized to place
room and places them around the orthographic view with as the middlemost perspective views when there are too many
few intersections between the mapping lines as possible. views to be arranged to make the result more compact. Sup-
In order to provide users views from different positions, pose the top or bottom row capacity is sufficient for the
the function “selectViews” selects the top-k scattered view- selected middlemost views. Which row to choose depends
points that cover the most area of each room. Since k is on the overall distance to the corresponding side. Depend-
small, we calculate the convex hull of all available view- ing on the number of excess views, it may also take both the
points and select k points from the hull to satisfy the most- top and bottom space. Since all views are arranged to one of
scattered condition. If there are less than k points in the the four sides, an initial mapping layout is obtained with a
hull, a random pick from the interior points is adopted since fixed margin between each view. The views along each side
the convex hull covers all the points. are sorted to determine their initial positions and to avoid
The function “arrangeView” adaptively assigns the per- intersections of mapping lines within the same side. As
spective views to the four sides around the orthographic shown in Algorithm 4, the views arranged to the left/right
view and gives an initial mapping layout. The fundamental sides are sorted by the z-coordinate of their viewpoints. The
strategy is to evenly split the perspective views to the left x-coordinate sorts those from the top and bottom sides.
Authorized licensed use limited to: Tsinghua University. Downloaded on November 11,2023 at 12:40:28 UTC from IEEE Xplore. Restrictions apply.
ZHANG ET AL.: SCENEVIEWER: AUTOMATING RESIDENTIAL PHOTOGRAPHY IN VIRTUAL ENVIRONMENTS 5531

Though the sorting and arrangement above can signifi-


cantly reduce intersections of lines, further intersection
check is needed: we add a distance check between end-
points and line segments to the regular segment intersection
test in Algorithm 43. The distance check is triggered if the
distance between an endpoint and a mapping line is less
than the endpoint radius. The placement of the intersected
views is swapped until no lines intersect.

4 EXPERIMENTS AND APPLICATIONS


4.1 Setup and Results
We utilize the recently released 3D-Front dataset [57], [58],
which contains nearly 10,000 floorplans with more than
Fig. 4. A platform for visualizing, exploring and manipulating 3D indoor
70,000 rooms and nearly 10,000 3D models, to demonstrate
scenes. More details are included in the supplementary video, available
our method fully. The aspect ratio of views is set as 1920:1080. online.
The vertical FoV is set as 75 . For the height of cameras, as
suggested by rule 2, we set a typical value of 1.3 meters. had a basic knowledge of photography by showing them
The weight of Equation (2) is set as 0.5. The weights of several well-designed photographs derived from the data-
content constraints are set as 1.0 for objects and 0.6 for con- set. The rooms were selected with coherent groups greater
nections. The weights of aesthetic constraints are set as 3.0 than 2 to enable sufficient variations of views. The technical
for layout, 1.0 for the rule of thirds, and 10.0 for camera staff stood by the participant in case of technical questions
directions. The interactive platform is rendered based on during the experiment. In this way, we collected a set of
Three.js4, a popular rendering engine on top of WebGL, and over 400 views considered the ground truth (GT), which
other results are rendered using Mitsuba [16], a photo-real- will be compared in the following experiments. Next, we
istic rendering system. Our method is implemented with conducted a user study to verify our ability to generate indi-
Numpy (Scipy) and Shapely mainly for operating geome- vidual photographs, as shown in Fig. 65. We invited 58 sub-
tries, e.g., processing room shapes. The system is deployed jects to compare the views automatically generated by our
on a desktop computer with a Titan RTX, 32GB memory, method and the GT views collected before. These subjects
and an AMD Ryzen 2700x CPU. are also amateurs, such as university students, workers,
Table 1 tabulates the generated views of several rooms engineers, designers, researchers, social media, etc. Each
based on the above constraints settings. Each row belongs to subject was presented with a series of questions. For each
a specific room, and each column belongs to one of the views question, a subject compared two photos and rated them
illustrated in Fig. 3. Normally, views are selected based on respectively, where two photos were taken in the same
Equation (1), regardless of which basic views they are. In room with the same layout. The rating scores range from 0
Table 1, to ensure that each room contains all the basic views (very poor) to 5 (exceptionally appealing), considering the
to show how they differ, views in different columns are inde- aesthetic, richness, goodness, and information obtained
pendent w.r.t Equation (1). In other words, Each strategy in from the photo. Each questionnaire was uniquely and ran-
Fig. 3 is guaranteed to be selected once and only once in each domly generated, i.e., rooms were arbitrarily assigned to
row. Ordinary results are shown in Fig. 5 and the supplemen- subjects, and two photos were selected randomly. Results
tary materials, available online. Our method could capture are shown in Table 3, where we report the ratings’ average
different levels of details in rooms, leveraging the variety of (AVG) and standard deviation (STD). We could conclude
proposed view-selecting strategies, e.g., capturing an entire that our method can generate photographs competitive
room or focusing on a coherent group. with manual operation.

Professional Evaluation. Next, we evaluate our method from


4.2 User Study
the perspective of professional photographers. Another 10
We conducted several user studies to verify how our method
subjects were recruited, and all reported professional skills
can generate views satisfying for residential photography.
in residential photography. They were invited to conduct a
Third-Party Evaluation. We invited 10 non-professional
similar process described in the third-party evaluations to
participants to adjust the cameras manually. We obtained a
annotate ground truth views using the platform in Fig. 4.
set of “handmade” photographs, leveraging the platform
Subsequently, annotated views were organized. We recre-
above, which enables participants to interact with 3D scenes
ated the questionnaire, where each question contains a pro-
through both the orbital control and the first-person control,
fessionally designed view and a generated view by our
as shown in Fig. 4. Before the experiment, one technical staff
method in the same room. The newly recruited professio-
and a detailed manual told each participant how to control
nals were further invited to conduct the new questionnaire.
the platform. The technical staff also ensured participants
Instead of purely finishing a questionnaire, this process is
also considered an interview. Technical staff was nearby to
3. The thickness of a line segment and the radius of the endpoint is
also taken into account when testing whether two mapping lines
intersect. 5. See the supplementary document for more details about the UI,
4. https://threejs.org/ available online.
Authorized licensed use limited to: Tsinghua University. Downloaded on November 11,2023 at 12:40:28 UTC from IEEE Xplore. Restrictions apply.
5532 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 29, NO. 12, DECEMBER 2023

Fig. 5. The results of view mapping, where generated photos are mapped around floorplans. More results are included in the supplementary materi-
als, available online.

record opinions, reasons and suggestions about why they 4.3 View Visualization
favoured one view. Note that each subject not only In this section, we implement a platform aiming at manipu-
answered questions involving her/his photos but also lating 3D scenes with the help of SceneViewer to show the
answered questions involving photos annotated by the potentiality of our method. First, as shown in Fig. 4a, we
others, leading to a more comprehensive evaluation. Table 4 embed the view mapping into the platform to give the user
shows the averages and standard deviations of ratings. a more accessible and quicker understanding of floorplans.
Since professionals have insights, their results are slightly As verified in Section 4.2, view mapping notably reduces
better, but our method are still comparable. The qualitative the time consumption of selecting favoured floorplans for
summary of the interviews will be discussed in Section 5. each user. Note that view mapping focuses on how results
Mapping Evaluation. We conduct one more user study to are effectively presented to users instead of how results are
verify the ability to generate a series of photographs for recommended to users, which belongs to the topic of recom-
floorplans. Another 20 subjects were invited, and each sub- mendation systems [60].
ject was presented with two series of photos: the tradi- Second, when arranging objects in floorplans, we often
tional solution and the view mapping of this paper. The need to wander the 3D scenes, causing trial-and-error and
traditional solution is adopted by most platforms, e.g., time-consuming operations on changing views. As shown
[59], where they typically show a top-down orthographic in Fig. 4b, by clicking on the “SceneViewer” button, individ-
view of the entire floorplan and several accompanying ual views are generated at the left search bar in the form of
photos captured by ordinary users. Each subject was only rendered images. The camera is smoothly translated into
given one mapped view of each floorplan. For both meth- the 3D scene by clicking on a rendered image. Users can eas-
ods, two sets of floorplans were shown to users, and each ily access different rooms and switch to different views
set contained 20 floorplans. The two sets do not intersect, within each room, so the manipulations, such as transform-
and floorplans have approximately the same scales. Sub- ing objects, adding objects, and deleting objects, can be per-
jects were asked to watch all floorplans and sorted them formed more efficiently.
by preference carefully. Results are shown in Table 5, Additionally, the generated views are not “still”. Users
where we report the average time consumption for sorting could play an automatic animation connecting views on a
and the visual satisfaction of the two methods (brackets floorplan if they favour a quicker tour instead of interac-
contain the standard deviations). Consequently, our tions. A trajectory is a user-oriented tour, so it should go
method alleviates the time required to understand floor- through each view at least once. Meanwhile, the trajectory
plans and improves users’ visual satisfaction. According should pass each view at most once to reduce redundancies.
to the Kruskal-Wallis H-Test (brackets contain p-values), Therefore, this leads us to the Hamilton pathfinding algo-
there are significant statistical differences between our rithm, which is an np-complete problem [61], [62].
method and the traditional solution. Thus, we can con- Efforts [63], [64] were made to heuristically solve this prob-
clude that our method of generating a series of photo- lem by efficiently finding the approximated and sufficiently
graphs for floorplans is more effective than the traditional close results. In this paper, we utilize the algorithm of
method. We also interviewed several subjects, and their Angluin and Valiant [64] to find such a trajectory.
responses indicate that our method has a sense of whole- The basic idea of [64] is that we start from a random ver-
ness and compactness, which reduces the time consumed tex p, and each time we choose an edge ðp; v0 Þ from the
to understand the floorplans. graph G ¼ ðV; EÞ. If v0 is not an existing vertex in the Hamil-
ton path, we add v0 and set it to the new p. Otherwise, a
short-circuit case is detected, and we cut off the edge ðu; v0 Þ
where u is the neighbour of v0 . Then, the edge ðp; v0 Þ is
added to the path and p is set to u. However, additional con-
cerns are inevitable due to the complexity of 3D scenes. Spe-
cifically, an initial graph G ¼ ðV; EÞ is formulated and
passed to [64], where V refers to the generated views, but E
is not explicitly given in 3D scenes. For three reasons, We
can not assume that E connects all the vertices resulting in a
complete graph. First, a transition between two views
should not go through walls; second, a transition between
Fig. 6. The user study platform of section 4.2. (a): Each user is pre- two views in different rooms should not happen; and third,
sented with a series of questions containing a manual photo and a photo
taken by our method. Temporally skipping a question or jumping to edge connections should consider the distances between
another question is enabled. (b): Each photo can be zoomed in or out. vertices. Additionally, a trajectory should be physically as
Authorized licensed use limited to: Tsinghua University. Downloaded on November 11,2023 at 12:40:28 UTC from IEEE Xplore. Restrictions apply.
ZHANG ET AL.: SCENEVIEWER: AUTOMATING RESIDENTIAL PHOTOGRAPHY IN VIRTUAL ENVIRONMENTS 5533

TABLE 3
User Study: Third-Party Evaluation

- Bedroom Living&Dinning Room Total


Metric GT Ours GT Ours GT Ours
AVG 3.228 3.359 3.209 3.198 3.221 3.296
STD 1.09 1.096 1.157 1.203 1.117 1.142

TABLE 4
User Study: Professional Evaluation

- Bedroom Living&Dinning Room Total


Metric GT Ours GT Ours GT Ours
AVG 3.644 3.492 3.797 3.52 3.707 3.503
STD 0.993 0.94 0.937 0.923 0.973 0.933

TABLE 5
User Study: Comparing Mapped Views

Metric Ground Truth Ours


Time Consumption 1395.0 (416.896) 880.0 (271.819)
Kruskal-Wallis H-Test 13.534 (0.0)
Visual Satisfaction 2.6 (1.281) 4.05 (0.865)
Kruskal-Wallis H-Test 13.093 (0.0)

short as possible and be thematically successive, i.e., several rotate/rescale objects in the “god’s perspective”. Neverthe-
adjacent transitions need to focus on the same room instead less, the orbital control does not consider the daily views,
of “wandering” among rooms. Consequently, to determine i.e., residential photography. In this application, we facili-
E and address concerns in 3D scenes, we proposed the fol- tate the manipulation of an object in 3D scenes [65], [66] by
lowing rules to complement [64]: viewing the effect of manipulation with our efficient view
generation.
1) Two views do not share an edge in E if they are in Foremost, intuitively, views of a room being appealing
two different rooms. are necessary for the layout of the room to be appealing.
2) We no longer uniformly choose a random edge ðp; v0 Þ This intuition motivates us to incorporate automatic pho-
from G. Instead, we choose an incident edge with the tography into interactive scene synthesis [8], [9], [28], where
nearest distance w.r.t v0 . we iteratively suggest views to users and require users to
3) Extra-views are placed at the positions of doors. fill in the room based on the views. Filling in rooms could
4) Each extra view can be passed more than once. be accomplished by either traditional interactions [65], [66]
5) Each photography-satisfied view belongs to and or MageAdd [28], a framework for synthesizing 3D scenes
only belongs to the room where it is generated. through “popping” up objects with transformations based
6) Each extra-view belongs to two rooms which are the on cursor movements in real-time.
rooms on both sides of the door. Starting from an empty room, we first suggest the best
Rules 1 and 3 address the problem of showing connec- view, typically faced toward windows or doors, according
tions between rooms. Rule 2 guides thematically successive to content constraints (Section 3.3). Subsequently, our
trajectories. Rule 1, 5, and 6 are for constructing the edge set method iteratively generates appropriate room views when
E. Finally, rule 4 guarantees the existence of trajectories the contents are changed. Users should fill the room with
because it is very likely to have a room with only one rota- objects using the suggested view for each iteration. This iter-
tion to pass in and out. The animated results are shown in ative process terminates once the worst view of the room
the supplementary materials, available online. satisfies Equation (1) (Section 3) and the room has no space
left. We formulate two rules during iterations to fully
4.4 View-Based Scene Synthesis explore the variation of views. First, we suggest alterna-
In this section, we implement a scene synthesis application tively best and worst views according to Equation (1). For
to show the additional feature of our method, thus further example, if the best view is suggested in the current itera-
strengthening our contribution to view generations. We tion, the worst view will be suggested in the next iteration.
observe that top-down views are not standard in our daily Second, swapping views are only triggered by dominant
lives. 3D scene modelling tools such as Kujiale [65] and objects, e.g., a double bed, coffee table, etc. Selecting and
Planner5d [66] utilize the orbital control that translates and transforming a dominant object is independent of other
rotates the perspective camera based on the top-down objects, whereas inferring subordinate objects depends on
views, which is convenient for users to insert/translate/ the dominant objects they belong to [6], [14], [28], which
Authorized licensed use limited to: Tsinghua University. Downloaded on November 11,2023 at 12:40:28 UTC from IEEE Xplore. Restrictions apply.
5534 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 29, NO. 12, DECEMBER 2023

Fig. 7. A scenario of view-based scene synthesis, where the transparent object is the object to be added. We first suggest a view w.r.t the double bed,
and the room is further filled with several subordinate objects relating to the bed. After the user inserts a desk, the camera is transformed toward
another wall hinting the user to add a cabinet. This process continues until the room is appropriately filled. More details are included in the supple-
mentary video, available online.

allows users to exploit a view as much as possible. A sce- photos, pushing the camera closer to the ground. Second,
nario is shown in Fig. 7. one could rotate the pitch of the camera towards the
ground. However, this results in more vanishing points,
thus visually affecting the aesthetic of views. Third, photog-
5 DISCUSSIONS AND LIMITATIONS raphers could cut off the upper image patch containing
This section presents further discussions on our method and parts of the ceilings while preserving the lower after obtain-
its limitations, giving insights for future work. ing the views, but the extent of how many pixels to cut, how
Preparing Method Input. This paper focuses on automatic to preserve the aspect ratio and how this potentially influen-
photography in digital 3D scenes, where the input, i.e., the ces the calculation of constraints should be carefully investi-
scene configurations, can either be captured by standard 3D gated in the future.
scene modelling tools [65], [66] or be provided by 3D scene Balance of View Content. We may also force views to be
datasets [57], [58], [67]. For pure 3D scene meshes and 3D balanced w.r.t objects, but this would potentially break the
scans, existing literature is available for semantic segmenta- rules of photography or distract cameras, as shown in
tion [68], [69], model retrieval [70] and registering [71] to Fig. 9. Future research should be conducted to improve the
convert them into the configurations required by our balance of objects.
method. These approaches may not always work, for real- Rounded Walls. Because our method is executed based on
world scans could be complex, so we recommend using the lines of polygons, i.e., room shapes, some exceptional cases,
3D scene modelling tools above. However, developing tools such as rounded walls, should be discretized to polygons
for acquiring scene configurations is beyond the scope of beforehand. The mapping of our method focuses on floor-
our paper. plans. However, given a house with multiple floors, our
Object-Oriented Photography. One of our motivations is to method could only generate separate maps for each floor.
enable users to perceive as much content as possible. How- So an improvement could be a more comprehensive map-
ever, sometimes a user may want to take a specific object as ping on multiple floors.
the theme while other accompanied contents are considered Alternative Types of Views. Our method does not cover all
less significant. For example, this happens when a furniture types of “shots” humans would have taken. A person may
seller tries to advertise her/his products. The photos would try other alternative views in addition to OPP and TPP. We
only focus on the commodity objects, and other objects con- have interviewed the professionals in Section 4.2 and sum-
tribute to the reality of the 3D scene, which is exclusively marised their suggestions, which are also future directions
elaborated for the objects they would like to sell. Fig. 8 of automatic residential photography. First, though views
shows a few preliminary results of group-oriented photo- generated by our method follow the OPP and TPP rules, it
graphs as alternative future directions of automatic would be more appealing to slightly tune the camera direc-
photography. tion to achieve a better perspective, as shown in Fig. 10a.
Photo Clipping. To enable views to be aesthetic and con-
tain more content, ceilings are considered less critical, and a
camera should give more attention to walls and grounds.
Therefore, three methods are adopted. First, photographers
could lower their cameras so that grounds are “lifted” in the

Fig. 8. The photography for an object group includes a coffee table and Fig. 9. The balancing problem to be explored in the future. Top: The origi-
two sofas, where other objects serve as foils. (a) Mimicking first-person nal results. Bottom: The balanced results based on objects. More
views. (b) Using the bounding square of the group as the virtual room advanced techniques should be made so that the balancing neither viola-
shape and calculating its one-point perspective view. tes the existing photography rules nor drives cameras in weird directions.
Authorized licensed use limited to: Tsinghua University. Downloaded on November 11,2023 at 12:40:28 UTC from IEEE Xplore. Restrictions apply.
ZHANG ET AL.: SCENEVIEWER: AUTOMATING RESIDENTIAL PHOTOGRAPHY IN VIRTUAL ENVIRONMENTS 5535

Fig. 11. Three representative room layouts that are abnormal in the data-
set. (a): A room is filled up with a huge object. (b): An implausible layout.
(c): A room with a tiny size.

Hypothesizing views for tiny rooms is also not as capable as


Fig. 10. Typical views based on the shots taken by professionals (top dealing with regular rooms. Consequently, they are also
row) and the corresponding views generated (bottom row) in the same worthy of investigation.
rooms. (a): A better perspective could be achieved by slightly tuning the
camera direction. (b) Sometimes, our method may create a sense of
oppression. (c) Sometimes an object (e.g., the double bed) should be
fully exposed. 6 CONCLUSION AND FUTURE WORKS
In this paper, we proposed and demonstrated a framework,
Second, the compositions of photos can be optimized to SceneViewer, to generate views for 3D scenes by computa-
avoid a sense of oppression, as shown in Fig. 10b. Third, tionally formulating residential photography. Our approach
sometimes, it is better to force an object to be fully exposed, is flexible in that it can generate individual views by select-
e.g., the double bed in Fig. 10c. ing the best viewpoints and combining them with the floor-
Views Generated w.r.t the Constraints. Currently, con- plan, providing users with both local details and overall
straints of our method are used for evaluating the probe understanding. Extensive experiments and applications
views. In contrast, those views are not generated according show that our method is effective and has various academic
to a specific constraint, i.e., the views are generated accord- and industrial uses. We hope this work could advance
ing to OPP and TPP rules regardless of constraints. For future research, especially the literature relying upon show-
example, the method does not guarantee that the objects of casing 3D scenes.
interest satisfy the rule of thirds. Instead, the views might Our method effectively visualizes indoor scenes in vir-
be chosen because the probe views have the suitable objects tual environments, thus setting up a foundation for future
already placed, satisfying the rule of the thirds and other works such as trajectory generation for movie cameras. We
constraints since constraints work together. provide a way for selecting camera positions and animating
FoV and Camera Height. We also notice two hyper-param- them efficiently. By adding constraints such as character
eters: the FoV and the camera’s height. The FoV is set as 75 importance or the location of a critical storyline, a similar
because it is a choice to capture more content [72]. However, idea can be derived to support generating character-aware,
humans feel differently given different lenses (FoVs). It is story-aware, and even audience-interactable movie camera
generally believed that the typical focal length is the focal trajectory.
length that best represents the human eye’s perception of
the surrounding environment. The field of view covered by ACKNOWLEDGMENTS
a standard lens is between 40 and 50 , similar to what a
human would see. A relatively large FoV increases the per- The authors would like to thank all reviewers for their
ceived depth and the distances between objects. This incre- thoughtful comments.
ment causes near objects to appear closer, thus visually
being bigger than they are, while distant objects appear REFERENCES
smaller and farther away [73]. Selecting FoVs is a trade-off [1] L. Nan, K. Xie, and A. Sharf, “A search-classify approach for clut-
between mimicking human eyes and the sense of depth. tered indoor scene understanding,” ACM Trans. Graph., vol. 31,
no. 6, pp. 1–10, 2012.
Note that optical distortions of a real-world camera could [2] S. Satkin, J. Lin, and M. Hebert, “Data-driven scene understanding
be corrected given the right lens profiles, and a virtual cam- from 3D models,” 2012.
era’s projective distortions could be corrected using an [3] K. Xu et al., “Organizing heterogeneous scene collections
existing method such as [39]. Similarly, the camera height is through contextual focal points,” ACM Trans. Graph., vol. 33,
no. 4, pp. 1–12, 2014.
set lower than human eyes because it gives fewer biases to [4] S.-H. Zhang, S.-K. Zhang, Y. Liang, and P. Hall, “A survey of 3D
the ceilings. In daily life, we may take photos with our knees indoor scene synthesis,” J. Comput. Sci. Technol., vol. 34, no. 3,
bent, which is also a trade-off between mimicking human 2019, Art. no. 594.
[5] Q. Fu, X. Chen, X. Wang, S. Wen, B. Zhou, and H. Fu, “Adaptive
eyes and catching more content. Thus, investigating how synthesis of indoor scenes via activity-associated object relation
humans feel given different FoVs and heights of a camera is graphs,” ACM Trans. Graph., vol. 36, no. 6, pp. 1–13, 2017.
also a promising future direction. [6] S.-H. Zhang, S.-K. Zhang, W.-Y. Xie, C.-Y. Luo, Y.-L. Yang, and
Dataset. As a method designated for 3D scenes, our H. Fu, “Fast 3D indoor scene synthesis by learning spatial relation
priors of objects,” IEEE Trans. Vis. Comput. Graph., vol. 28, no. 9,
method also suffers from implausible cases from the data- pp. 3082–3092, Sep. 2022.
set. As shown in Fig. 11, if a room is occupied with a simple [7] T. Liu, A. Hertzmann, W. Li, and T. Funkhouser, “Style compati-
but colossal object, the constraints are hard to be met for bility for 3D furniture models,” ACM Trans. Graph., vol. 34, no. 4,
given probe views. Our method may also fail to adjust pp. 1–9, 2015.
[8] L.-F. Yu, S.-K. Yeung, and D. Terzopoulos, “The clutterpalette: An
views given a room with an implausible layout, e.g., the ori- interactive tool for detailing indoor scenes,” IEEE Trans. Vis. Com-
entation of a double bed is inconsistent with its nightstands. put. Graph., vol. 22, no. 2, pp. 1138–1148, Feb. 2015.
Authorized licensed use limited to: Tsinghua University. Downloaded on November 11,2023 at 12:40:28 UTC from IEEE Xplore. Restrictions apply.
5536 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 29, NO. 12, DECEMBER 2023

[9] S. Zhang, Z. Han, and H. Zhang, “User guided 3D scene [35] S. Yang, J. Xu, K. Chen, and H. Fu, “View suggestion for interac-
enrichment,” in Proc. 15th ACM SIGGRAPH Conf. Virtual-Reality tive segmentation of indoor scenes,” Comput. Vis. Media, vol. 3,
Continuum Appl. Ind., 2016, pp. 353–362. no. 2, pp. 131–146, 2017.
[10] M. Yan, X. Chen, and J. Zhou, “An interactive system for efficient [36] S. Takahashi, I. Fujishiro, Y. Takeshima, and T. Nishita, “A fea-
3D furniture arrangement,” in Proc. Comput. Graph. Int. Conf., ture-driven approach to locating optimal viewpoints for volume
2017, pp. 1–6. visualization,” in Proc. IEEE Vis., 2005, pp. 495–502.
[11] S. Qi, Y. Zhu, S. Huang, C. Jiang, and S.-C. Zhu, “Human-centric [37] M. Christie, P. Olivier, and J.-M. Normand, “Camera control in
indoor scene synthesis using stochastic grammar,” in Proc. IEEE computer graphics,” in Comput. Graph. Forum, vol. 27, no. 8,
Conf. Comput. Vis. Pattern Recognit., 2018, pp. 5899–5908. pp. 2197–2218, 2008.
[12] K. Wang, M. Savva, A. X. Chang, and D. Ritchie, “Deep convolu- [38] P.-P. Vazquez, M. Feixas, M. Sbert, and W. Heidrich, “Automatic
tional priors for indoor scene synthesis,” ACM Trans. Graph., view selection using viewpoint entropy and its application to
vol. 37, no. 4, 2018, Art. no. 70. image-based modelling,” in Comput. Graph. Forum, vol. 22, no. 4,
[13] K. Wang, Y.-A. Lin, B. Weissmann, M. Savva, A. X. Chang, and pp. 689–700, 2003.
D. Ritchie, “PlanIT: Planning and instantiating indoor scenes with [39] P.-P. Vazquez and M. Sbert, “On the fly best view detection using
relation graph and spatial prior networks,” ACM Trans. Graph., graphics hardware,” in Proc. 4th Int Conf. Vis., Imag., Image Pro-
vol. 38, no. 4, 2019, Art. no. 132. cess., 2004, pp. 625–630.
[14] S.-K. Zhang, W.-Y. Xie, and S.-H. Zhang, “Geometry-based layout [40] M. Eitz, R. Richter, T. Boubekeur, K. Hildebrand, and M. Alexa,
generation with hyper-relations among objects,” Graph. Models- “Sketch-based shape retrieval,” ACM Trans. Graph., vol. 31, no. 4,
vol. 116, 2021, Art. no. 101104. pp. 1–10, 2012.
[15] A. Handa, V. Patraucean, V. Badrinarayanan, S. Stent, and R. [41] L. Neumann, M. Sbert, B. Gooch, and W. Purgathofer et al.,
Cipolla, “Understanding real world indoor scenes with synthetic “Viewpoint quality: Measures and applications,” in Proc. 1st Euro-
data,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, graphics Workshop Comput. Aesthetics Graph., Vis. Imag., 2005,
pp. 4077–4085. pp. 185–192.
[16] W. Jakob, “Mitsuba renderer,” 2010, [Online]. Available: http:// [42] Z. Liu, J. Zhang, and L. Liu, “Upright orientation of 3D shapes with
www.mitsuba-renderer.org convolutional networks,” Graphical Models, vol. 85, pp. 22–29,
[17] X. Bonaventura, M. Feixas, M. Sbert, L. Chuang, and C. Wall- 2016.
raven, “A survey of viewpoint selection methods for polygonal [43] S. Freitag, B. Weyers, and T. W. Kuhlen, “Efficient approximate
models,” Entropy, vol. 20, no. 5, 2018, Art. no. 370. computation of scene visibility based on navigation meshes and
[18] D.-Y. Chen, X.-P. Tian, Y.-T. Shen, and M. Ouhyoung, “On visual applications for navigation and scene analysis,” in Proc. IEEE
similarity based 3D model retrieval,” in Comput. Graph. Forum, Symp. 3D User Interfaces, 2017, pp. 134–143.
vol. 22, no. 3, pp. 223–232, 2003. [44] P. Barral, G. Dorme, and D. Plemenos, “Intelligent scene explora-
[19] R. Hicks and F. Schultz, Interiors: A Guide to Professional Lighting tion with a camera,” in Proc. Int. Conf. 3IA, 2000, pp. 3–4.
Techniques Interior Shots (Pro-lighting), Brighton, U.K.: RotoVision, [45] C. And ujar, P. Vazquez, and M. Fairen, “Way-finder: Guided
1996. tours through complex walkthrough models,” in Comput. Graph.
[20] M. G. Harris, Professional Interior Photography, New York, NY, Forum, vol. 23, no. 3, pp. 499–508, 2004.
USA: Taylor & Francis, 2003. [46] B. Jaubert, K. Tamine, and D. Plemenos, “Techniques for off-line
[21] D. Prakel, Basics Photography 01: Composition, vol. 1, Worthing, scene exploration using a virtual camera,” in Proc. Int. Conf. 3IA,
U.K., AVA Publishing, 2006. 2006, Art. no. 1.
[22] L.-F. Yu, S. K. Yeung, C.-K. Tang, D. Terzopoulos, T. F. Chan, and [47] S. Freitag, B. Weyers, and T. W. Kuhlen, “Interactive exploration
S. Osher, “Make it home: Automatic optimization of furniture assistance for immersive virtual environments based on object vis-
arrangement,” ACM Trans. Graph., vol. 30, no. 4, 2011, Art. no. 86. ibility and viewpoint quality,” in Proc. IEEE Conf. Virtual Reality
[23] Y. M. Kim, N. J. Mitra, D.-M. Yan, and L. Guibas, “Acquiring 3D 3D User Interfaces, 2018, pp. 355–362.
indoor environments with variability and repetition,” ACM Trans. [48] J. D’amelio, Perspective Drawing Handbook. Chelmsford, MA, USA:
Graph., vol. 31, no. 6, pp. 1–11, 2012. Courier Corporation, 2004.
[24] W. Li et al., “Interiornet: Mega-scale multi-sensor photo-realis- [49] M. Claassens, “Interior photography: The ins and outs,” Ph.D. dis-
tic indoor scenes dataset,” in Proc. Brit. Mach. Vis. Conf., 2018, sertation, Faculty of Human Sciences, Central Univ. Technology,
p. 77. Bloemfontein, Free State, 1997.
[25] J. Zheng, J. Zhang, J. Li, R. Tang, S. Gao, and Z. Zhou, [50] B. P. Krages, “The art of composition,” New York, NY, USA: All-
“Structured3D: A large photo-realistic dataset for structured 3D worth Communications, 2005.
modeling,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 519–535. [51] J. Webb, Basics Creative Photography 01: Design Principles. London,
[26] M. Savva, A. X. Chang, P. Hanrahan, M. Fisher, and M. Nießner, U.K.: Bloomsbury Publishing, 2017.
“PiGraphs: Learning interaction snapshots from observations,” [52] I. Pantazopoulos and S. Tzafestas, “Occlusion culling algorithms:
ACM Trans. Graph., vol. 35, no. 4, pp. 1–12, 2016. A comprehensive survey,” J. Intell. Robotic Syst., vol. 35, no. 2,
[27] Y. Liang, L. Fan, P. Ren, X. Xie, and X.-S. Hua, “Decorin: An auto- pp. 123–156, 2002.
matic method for plane-based decorating,” IEEE Trans. Vis. Com- [53] D. Cohen-Or, G. Fibich, D. Halperin, and E. Zadicario,
put. Graph., vol. 27, no. 8, pp. 3438–3450, Aug. 2020. “Conservative visibility and strong occlusion for viewspace parti-
[28] S.-K. Zhang, Y.-X. Li, Y. He, Y.-L. Yang, and S.-H. Zhang, tioning of densely occluded scenes,” in Comput. Graph. Forum,
“Mageadd: Real-time interaction simulation for scene synthesis,” vol. 17, no. 3, pp. 243–253, 1998.
in Proc. 29th ACM Int. Conf. Multimedia, 2021, pp. 965–973. [54] H. Weghorst, G. Hooper, and D. P. Greenberg, “Improved compu-
[29] A. Luo, Z. Zhang, J. Wu, and J. B. Tenenbaum, “End-to-end opti- tational methods for ray tracing,” ACM Trans. Graph., vol. 3, no. 1,
mization of scene layout,” in Proc. IEEE/CVF Conf. Comput. Vis. pp. 52–69, 1984.
Pattern Recognit., 2020, pp. 3754–3763. [55] B. Gooch, E. Reinhard, C. Moulding, and P. Shirley, “Artistic com-
[30] Y. He et al., “Style-compatible object recommendation for multi- position for image creation,” in Proc. Eurographics Workshop Ren-
room indoor scene synthesis,” 2020, arXiv:2003.04187. dering Techn., 2001, pp. 83–88.
[31] M.-M. Cheng, Q.-B. Hou, S.-H. Zhang, and P. L. Rosin, “Intelligent [56] B. Krages, Photography: The Art of Composition. New York, NY,
visual media processing: When graphics meets vision,” J. Comput. USA: Simon and Schuster, 2012.
Sci. Technol., vol. 32, no. 1, pp. 110–121, 2017. [57] H. Fu et al., “3D-front: 3D furnished rooms with layouts and
[32] U. D. Bordoloi and H.-W. Shen, “View selection for volume ren- semantics,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021,
dering,” in Proc. IEEE Vis., 2005, pp. 487–494. pp. 10933–10942.
[33] M. Ruiz, I. Boada, M. Feixas, and M. Sbert, “Viewpoint informa- [58] H. Fu et al., “3D-future: 3D furniture shape with texture,”
tion channel for illustrative volume rendering,” Comput. Graph., 2020, arXiv:2009.09633.
vol. 34, no. 4, pp. 351–360, 2010. [59] ziroom.com, “Ziroom,” Jun. 2022. [Online]. Available: https://
[34] J. Tao, J. Ma, C. Wang, and C.-K. Shene, “A unified approach www.ziroom.com/
to streamline selection and viewpoint selection for 3D flow [60] P. Nagarnaik and A. Thomas, “Survey on recommendation sys-
visualization,” IEEE Trans. Vis. Comput. Graph., vol. 19, no. 3, tem methods,” in Proc. 2nd Int. Conf. Electron. Commun. Syst., 2015,
pp. 393–406, Mar. 2013. pp. 1603–1608.

Authorized licensed use limited to: Tsinghua University. Downloaded on November 11,2023 at 12:40:28 UTC from IEEE Xplore. Restrictions apply.
ZHANG ET AL.: SCENEVIEWER: AUTOMATING RESIDENTIAL PHOTOGRAPHY IN VIRTUAL ENVIRONMENTS 5537

[61] H. R. Lewis, “Computers and intractability. A guide to the theory Hou Tam received the bachelor’s degree in com-
of NP-completeness,” J. Symbolic Log., vol. 48, no. 2, pp. 498–500, puter science and technology from Tsinghua Uni-
1983. versity, in 2021. He is currently working toward the
[62] H. S. Wilf and H. S. Wilf, Algorithms and Complexity, vol. 986, Eng- master’s degree with the Department of Computer
lewood Cliffs, NJ, USA: Prentice-Hall, 1986. Science and Technology, Tsinghua University.
[63] F. Rubin, “A search procedure for hamilton paths and circuits,”
J. ACM, vol. 21, no. 4, pp. 576–580, 1974.
[64] D. Angluin and L. G. Valiant, “Fast probabilistic algorithms for
hamiltonian circuits and matchings,” J. Comput. Syst. Sci., vol. 18,
no. 2, pp. 155–193, 1979.
[65] kujiale.com, “Kujiale,” Dec. 2020. [Online]. Available: https://
www.kujiale.com/ Yi-Xiao Li received the bachelor’s degree in arts
[66] planner5d.com, “Planner5d,” Dec. 2020. [Online]. Available: & design from Tsinghua University, Beijing, in 2020.
https://planner5d.com/ She is currently working toward the master’s degree
[67] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funk- with the Academy of Arts & Design, Tsinghua Uni-
houser, “Semantic scene completion from a single depth versity, Beijing. Her research interests include
image,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, human-computer interaction and virtual reality.
pp. 1746–1754.
[68] A. Dai, M. Nießner, M. Zoll€ ofer, S. Izadi, and C. Theobalt,
“Bundlefusion: Real-time globally consistent 3D reconstruction
using on-the-fly surface re-integration,” ACM Trans. Graph.,
vol. 36, no. 3, pp. 1–18, 2017.
[69] T. Vu, K. Kim, T. M. Luu, X. T. Nguyen, and C. D. Yoo, “SoftGroup Tai-Jiang Mu received the bachelor’s and PhD
for 3D instance segmentation on 3D point clouds,” in Proc. IEEE degrees in computer science and technology from
Conf. Comput. Vis. Pattern Recognit., 2022, pp. 2698–2707. Tsinghua University, in 2011 and 2016, respectively.
[70] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deep learning He is currently an assistant researcher with the
on point sets for 3D classification and segmentation,” in Proc. Graphics and Geometric Computing Group in the
IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 652–660. Department of Computer Science and Technology,
[71] Z. J. Yew and G. H. Lee, “RPM-Net: Robust point matching using Tsinghua University. His research interests include
learned features,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern computer graphics and visual media learning.
Recognit., 2020, pp. 11824–11833.
[72] M. Claassens, “Interior photography: The ins and outs,” Ph.D. dis-
sertation, Faculty of Human Sciences, Central Univ. Technology,
Bloemfontein, Free State, 1997.
[73] C. Marquardt, Wide-Angle Photography, San Rafael, CA, USA:
Rocky Nook, 2018. Song-Hai Zhang (Member, IEEE) received the
PhD degree in computer science and technology
from Tsinghua University, Beijing, in 2007. He is
currently an Associate Professor with the Depart-
Shao-Kui Zhang received the BS degree in soft- ment of Computer Science and Technology, Tsing-
ware engineering from Northeastern University, hua University. His research interests include
Shenyang, in 2018. He is currently working computer graphics and virtual reality.
toward the PhD degree with the Department of
Computer Science and Technology, Tsinghua
University, Beijing, China. His research interests
include computer graphics, 3D scene synthesis,
and intelligent 3D scene interaction.
" For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/csdl.

Authorized licensed use limited to: Tsinghua University. Downloaded on November 11,2023 at 12:40:28 UTC from IEEE Xplore. Restrictions apply.

You might also like