MeshFormer: High-Quality Mesh Generation with 3D-Guided Reconstruction Model

Minghua Liu^∗1,2† Chong Zeng^∗3‡ Xinyue Wei^1,2† Ruoxi Shi^1,2†
Linghao Chen^2,3† Chao Xu^2,4† Mengqi Zhang² Zhaoning Wang²
Xiaoshuai Zhang^1,2† Isabella Liu¹ Hongzhi Wu³ Hao Su^1,2

¹ UC San Diego ² Hillbot Inc. ³ Zhejiang University ⁴ UCLA

Project Website: https://meshformer3d.github.io/

Abstract

Open-world 3D reconstruction models have recently garnered significant attention. However, without sufficient 3D inductive bias, existing methods typically entail expensive training costs and struggle to extract high-quality 3D meshes. In this work, we introduce MeshFormer, a sparse-view reconstruction model that explicitly leverages 3D native structure, input guidance, and training supervision. Specifically, instead of using a triplane representation, we store features in 3D sparse voxels and combine transformers with 3D convolutions to leverage an explicit 3D structure and projective bias. In addition to sparse-view RGB input, we require the network to take input and generate corresponding normal maps. The input normal maps can be predicted by 2D diffusion models, significantly aiding in the guidance and refinement of the geometry’s learning. Moreover, by combining Signed Distance Function (SDF) supervision with surface rendering, we directly learn to generate high-quality meshes without the need for complex multi-stage training processes. By incorporating these explicit 3D biases, MeshFormer can be trained efficiently and deliver high-quality textured meshes with fine-grained geometric details. It can also be integrated with 2D diffusion models to enable fast single-image-to-3D and text-to-3D tasks.

^†^†footnotetext: ^∗ Equal contribution. ^† Work done during internship at Hillbot Inc. ^‡ Work done during internship at UC San Diego.

1 Introduction

High-quality 3D meshes are essential for numerous applications, including rendering, simulation, and 3D printing. Traditional photogrammetry systems [57, 61] and recent neural approaches, such as NeRF [43], typically require a dense set of input views of the object and long processing times. Recently, open-world 3D object generation has made significant advancements, aiming to democratize 3D asset creation by reducing input requirements. There are several prevailing paradigms: training a native 3D generative model using only 3D data [13, 95] or performing per-shape optimization with Score Distillation Sampling (SDS) losses [47, 30]. Another promising direction is to first predict a sparse set of multi-view images using 2D diffusion models [33, 59] and then lift these predicted images into a 3D model by training a feed-forward network [32, 31]. This strategy addresses the limited generalizability of models trained solely on 3D data and overcomes the long runtime and 3D inconsistency of per-shape-optimization-based methods.

While many recent works explore utilizing priors from 2D diffusion models, such as generating consistent multi-view images [60, 59] and predicting normal maps from RGB [37, 12, 59], the feed-forward model that converts multi-view images into 3D remains underexplored. One-2-3-45 [32] leverages a generalizable NeRF method for 3D reconstruction but suffers from limited quality and success rates. One-2-3-45++ [31] improves on this by using a two-stage 3D diffusion model, yet it still struggles to generate high-quality textures or fine-grained geometry. Given that sparse-view reconstruction of open-world objects requires extensive priors, another family of works pioneered by the large reconstruction model (LRM) [16] combines large-scale transformer models with the triplane representation and trains the model primarily using rendering loss. Although straightforward, these methods typically require over a hundred GPUs to train. Moreover, due to their reliance on volume rendering, these methods have difficulty extracting high-quality meshes. For instance, some recent follow-up works [85, 79] implement complex multi-stage “NeRF-to-mesh” training strategies, but the results still leave room for improvement.

In this work, we present MeshFormer, an open-world sparse-view reconstruction model that takes a sparse set of posed images of an arbitrary object as input and delivers high-quality 3D textured meshes with a single forward pass in a few seconds. Instead of representing 3D data as “2D planes” and training a “black box” transformer model optimizing only rendering loss, we find that by incorporating various types of 3D-native priors into the model design, including network architecture, supervision signals, and input guidance, our model can significantly improve both mesh quality and training efficiency. Specifically, we propose representing features in explicit 3D voxels and introduce a novel architecture that combines large-scale transformers with 3D (sparse) convolutions. Compared to triplanes and pure transformers models with little 3D-native design, MeshFormer leverages the explicit 3D structure of voxel features and the precise projective correspondence between 3D voxels and 2D multi-view features, enabling faster and more effective learning.

Unlike previous works that rely on NeRF-based representation in their pipeline, we utilize mesh representation throughout the process and train MeshFormer in a unified, single-stage manner. Specifically, we propose combining surface rendering with additional explicit 3D supervision, requiring the model to learn a signed distance function (SDF) field. The network is trained with high-resolution SDF supervision, and efficient differentiable surface rendering is applied to the extracted meshes for rendering losses. Due to the explicit 3D geometry supervision, MeshFormer enables faster training while eliminating the need for expensive volume rendering and learning an initial coarse NeRF. Furthermore, in addition to multi-view posed RGB images, we propose using corresponding normal maps as input, which can be captured through sensors and photometric techniques [82, 4] or directly estimated by recent 2D vision models [59, 37, 12]. These multi-view normal images provide important clues for 3D reconstruction and fine-grained geometric details. We also task the model with learning a normal texture in addition to the RGB texture, which can then be used to enhance the generated geometry through a traditional post-processing algorithm [44].

Thanks to the explicit 3D-native structure, supervision signal, and normal guidance that we have incorporated, MeshFormer can generate high-quality textured meshes with fine-grained geometric details, as shown in Figure LABEL:fig:teaser. Compared to concurrent methods that require over one hundred GPUs or complex multi-stage training, MeshFormer can be trained more efficiently and conveniently with just eight GPUs over two days, achieving on-par or even better performance. It can also seamlessly integrate with various 2D diffusion models to enable numerous tasks, such as single-image-to-3D and text-to-3D. In summary, our key contributions include:

•

We introduce MeshFormer, an open-world sparse-view reconstruction model capable of generating high-quality 3D textured meshes with fine-grained geometric details in a few seconds. It can be trained with only 8 GPUs, outperforming baselines that require over one hundred GPUs.
•

We propose a novel architecture that combines 3D (sparse) convolution and transformers. By explicitly leveraging 3D structure and projective bias, it facilitates better and faster learning.
•

We propose a unified single-stage training strategy for generating high-quality meshes by combining surface rendering and explicit 3D geometric supervision.
•

We are the first to introduce multi-view normal images as input to the feed-forward reconstruction network, providing crucial geometric guidance. Additionally, we propose to predict extra 3D normal texture for geometric enhancement.

2 Related Work

Open-world 3D Object Generation Thanks to the emergence of large-scale 3D datasets [9, 8] and the extensive priors learned by 2D models [51, 56, 55, 50], open-world 3D object generation have recently made significant advancements. Exemplified by DreamFusion [47], a line of work [70, 10, 26, 58, 6, 76, 48, 62, 60, 30, 5, 65] uses 2D models as guidance to generate 3D objects through per-shape optimization with SDS-like losses. Although these methods produce increasingly better results, they are still limited by lengthy runtimes and many other issues. Another line of work [45, 20, 40, 84, 16, 96] trains a feed-forward generative model solely on 3D data that consumes text prompts or single-image inputs. While fast during inference, these methods struggle to generalize to unseen object categories due to the scarcity of 3D data. More recently, works such as Zero123 [33] have shown that 2D diffusion models can be fine-tuned with 3D data for novel view synthesis. A line of work [31, 27, 85, 27, 79, 77, 64], pioneered by One-2-3-45 [32], proposes first predicting multi-view images through 2D diffusion models and then lifting them to 3D through a feed-forward network, effectively addressing the speed and generalizability issues. Many recent works have also explored better strategies to fine-tune 2D diffusion models for enhancing the 3D consistency of multi-view images [60, 34, 80, 89, 59, 17, 91, 36, 81, 69, 14, 49, 23, 72]. In addition to the feed-forward models, the generated multi-view images can also be lifted to 3D through optimizations [37, 14, 34].

Sparse-View Feed-Forward Reconstruction Models When a small baseline between input images is assumed, existing generalizable NeRF methods [52, 68, 35, 88] aim to find pixel correspondences and learn generalizable priors across scenes by leveraging cost-volume-based techniques [3, 90, 38] or transformer-based structures [74, 24, 71, 19, 54]. Some of methods have also incorporated a 2D diffusion process into the pipeline [1, 66, 21]. However, these methods often struggle to handle large baseline settings (e.g., only frontal-view reconstruction) or are limited by a small training set and fail to generalize to open-world objects. Recently, many models [94, 77, 64, 27, 86, 79, 92, 85, 73, 87] specifically aimed at open-world 3D object generation have been proposed. They typically build large networks and aim to learn extensive reconstruction priors by training on large-scale 3D datasets [9]. For example, the triplane representation and transformer models are often used. By applying volume rendering or Gaussian splatting [64, 86, 92], they train the model with rendering losses. However, these methods typically require extensive GPUs to train and have difficulty extracting high-quality meshes. While some recent (concurrent) works [85, 79] utilize multi-stage “NeRF-to-mesh” training strategies to improve the quality, the results still leave room for improvement.

Geometry Guidance for 3D Reconstruction Many recent works have shown that in addition to multi-view RGB images, 2D diffusion models can be fine-tuned to generate other geometric modalities, such as depth maps [75], normal maps [37, 41, 12], or coordinate maps [28, 77]. These additional modalities can provide crucial guidance for 3D generation and reconstruction. While many recent methods utilize these geometric cues as inverse optimization guidance [49, 5, 37, 12, 28, 77], we propose to take normal maps as input in a feed-forward reconstruction model and task the model with generating 3D-consistent normal texture for geometry enhancement of sharp details.

3D Native Representations and Network Architectures in 3D Generation The use of 3D voxel representations and 3D convolutions is common in general 3D generation. However, most recent works focus on 3D-native diffusion [53, 29, 7, 95, 31, 18], one of the key paradigms in 3D generation, which differs from the route taken by MeshFormer. These 3D-diffusion-based methods have some common limitations. For instance, they focus solely on geometry generation and cannot directly predict high-quality textures from the network [53, 29, 7, 95, 31, 18]. Due to the limited availability of 3D data, 3D-native diffusion methods also typically struggle with open-world capabilities and are often constrained to closed-domain datasets (e.g., ShapeNet [2]) in their experiments [29, 7, 95].

In MeshFormer, our goal is to achieve direct high-quality texture generation while handling arbitrary object categories. Therefore, we adopt a different approach: sparse-view feed-forward reconstruction, as opposed to 3D-native diffusion. In this specific task setting, more comparable works are recent LRM-style methods [85, 79, 64, 67]. However, most of these methods rely on a combination of triplane representation and large-scale transformers. In this paper, we demonstrate that 3D-native representations and networks can not only be used in 3D-native diffusion but can also be combined with differentiable rendering to train a feed-forward sparse-view reconstruction model using rendering losses. In open-world sparse-view reconstruction, we are not limited to the triplane representation. Instead, 3D-native structures (e.g., voxels), network architectures, and projective priors can facilitate more efficient training, significantly reducing the required training resources. While scalable networks are necessary to learn extensive priors, scalability is not exclusive to triplane-based transformers. By integrating 3D convolutions with transformer layers, scalability can also be achieved.

3 Method

Refer to caption — Figure 1: Pipeline Overview. MeshFormer takes a sparse set of multi-view RGB and normal images as input, which can be estimated using existing 2D diffusion models. We utilize a 3D feature volume representation, and submodules Voxel Former and Sparse Voxel Former share a similar novel architecture, detailed in the gray region. We train MeshFormer in a unified single stage by combining mesh surface rendering and $512^{3}$ SDF supervision. MeshFormer learns an additional normal texture, which can be used to further enhance the geometry and generate fine-grained sharp geometric details.

As shown in Figure 1, MeshFormer takes a sparse set of posed multi-view RGB and normal images as input and generates a high-quality textured mesh in a single feed-forward pass. In the following sections, we will first introduce our choice of 3D representation and a novel model architecture that combines large-scale transformers with 3D convolutions (Sec. 3.1). Then, we will describe our training objectives, which integrate surface rendering and explicit 3D SDF supervision (Sec. 3.2). Last but not least, we will present our normal guidance and geometry enhancement module, which plays a crucial role in generating high-quality meshes with fine-grained geometric details (Sec. 3.3).

3.1 3D Representation and Model Architecture

Triplane vs. 3D Voxels Open-world sparse-view reconstruction requires extensive priors, which can be learned through a large-scale transformer. Prior arts [27, 79, 85, 67, 77] typically utilize the triplane representation, which decomposes a 3D neural field into a set of 2D planes. While straightforward for processing by transformers, the triplane representation lacks explicit 3D spatial structures and makes it hard to enable precise interaction between each 3D location and its corresponding 2D projected pixels from multi-view images. For instance, these methods often simply apply self-attention across all triplane patch tokens and cross-attention between triplane tokens and all multi-view image tokens. This all-to-all attention is not only costly but also makes the methods cumbersome to train. Moreover, the triplane representation often shows results with notable artifacts at the boundaries of patches and may suffer from limited expressiveness for complex structures. Consequently, we choose the 3D voxel representation instead, which explicitly preserves the 3D spatial structures.

Combining Transformer with 3D Convolution To leverage the explicit 3D structure and the powerful expressiveness of a large-scale transformer model while avoiding an explosion of computational costs, we propose VoxelFormer and SparseVoxelFormer, which follow a 3D UNet architecture while integrating a transformer at the bottleneck. The overall idea is that we use local 3D convolution to encode and decode a high-resolution 3D feature volume, while the global transformer layer handles reasoning and memorizing priors for the compressed low-resolution feature volume. Specifically, as shown in Figure 1, a 3D feature volume begins with a learnable token shared by all 3D voxels. With the 3D voxel coordinates, we can leverage the projection matrix to enable each 3D voxel to aggregate 2D local features from multi-view images via a projection-aware cross-attention layer. By iteratively performing projection-aware cross-attention and 3D (sparse) convolution, we can compress the 3D volume to a lower-resolution one. After compression, each 3D voxel feature then serves as a latent token, and a deep transformer model is applied to a sequence of all 3D voxel features (position encoded) to enhance the model’s expressiveness. Finally, we use the convolution-based inverse upper branch with skip connection to decode a 3D feature volume with the initial high resolution.

Projection-Aware Cross Attention Regarding 3D-2D interaction, the input multi-view RGB and normal images are initially processed by a 2D feature extractor, such as a trainable DINOv2 [46], to generate multi-view patch features. While previous cost-volume-based methods [38, 3] typically use mean or max pooling to aggregate multi-view 2D features, these simple pooling operations might be suboptimal for addressing occlusion and visibility issues. Instead, we propose a projection-aware cross-attention mechanism to adaptively aggregate the multi-view features for each 3D voxel. Specifically, we project each 3D voxel onto the $m$ views to interpolate $m$ RGB and normal features. We then concatenate these local patch features with the projected RGB and normal values to form $m$ 2D features. In the projection-aware cross-attention module, we use the 3D voxel feature to calculate a query and use both the 3D voxel feature and the $m$ 2D features to calculate $m+1$ keys and values. A cross-attention is then performed for each 3D voxel, enabling precise interaction between each 3D location and its corresponding 2D projected pixels, and allowing adaptive aggregation of 2D features, which can be formulated as:

v\leftarrow\operatorname{CrossAttention}(Q=\{v\},K=\{p^{v}_{i}\}_{i=1}^{m}+\{v% \},V=\{p^{v}_{i}\}_{i=1}^{m}+\{v\})

(1)

Where $v$ denotes a 3D voxel feature, and $p^{v}_{i}$ denotes its projected 2D pixel feature from view $i$ , which is a concatenation of the RGB feature $f^{v}_{i}$ , the normal feature $g^{v}_{i}$ , and the RGB and normal values $c^{v}_{i}$ and $n^{v}_{i}$ , respectively.

Coarse-to-Fine Feature Generation As shown in Fig. 1, to generate a high-resolution 3D feature volume that captures the fine-grained details of 3D shapes, we follow previous work [95, 31] by employing a coarse-to-fine strategy. Specifically, we first use VoxelFormer, which is equipped with full 3D convolution, to predict a low-resolution (e.g., $64^{3}$ ), coarse 3D occupancy volume. Each voxel in this volume stores a binary value indicating whether it is close to the surface. The predicted occupied voxels are then subdivided to create higher-resolution sparse voxels (e.g., $256^{3}$ ). Next, we utilize a second module, SparseVoxelFormer, which features 3D sparse convolution [63], to predict features for these sparse voxels. After this, we trilinearly interpolate the 3D feature of any near-surface 3D point, which encodes both geometric and color information, from the high-resolution sparse feature volume. The features are then fed into various MLPs to learn the corresponding fields.

3.2 Unified Single-Stage Training: Surface Rendering with SDF Supervision

Existing works typically use NeRF [42] and volume rendering or 3D Gaussian splatting [22] since they come with a relatively easy and stable learning process. However, extracting high-quality meshes from their results is often non-trivial. For example, directly applying Marching Cubes [39] to density fields of learned NeRFs typically generates meshes with many artifacts. Recent methods [79, 85, 78] have designed complex, multi-stage “NeRF-to-mesh” training with differentiable surface rendering, but the generated meshes still leave room for improvement. On the other hand, skipping a good initialization and directly learning meshes from scratch using purely differentiable surface rendering losses is also infeasible, as it is highly unstable to train and typically results in distorted geometry.

In this work, we propose leveraging explicit 3D supervision in addition to 2D rendering losses. As shown in Figure 1, we task MeshFormer with learning a signed distance function (SDF) field supervised by a high-resolution (e.g., $512^{3}$ ) ground truth SDF volume. The SDF loss provides explicit guidance for the underlying 3D geometry and facilitates faster learning. It also allows us to use mesh representation and differentiable surface rendering from the beginning without worrying about good geometry initialization or unstable training, as the SDF loss serves as a strong regularization for the underlying geometry. By combining surface rendering with explicit 3D SDF supervision, we train MeshFormer in a unified, single-stage training process. As shown in Figure 1, we employ three tiny MLPs that take as input the 3D feature interpolated from the 3D sparse feature volume to learn an SDF field, a 3D color texture, and a 3D normal texture. We extract meshes from the SDF volume using dual Marching Cubes [39] and employ NVDiffRast [25] for differentiable surface rendering. We render both the multi-view RGB and normal images and compute the rendering losses, which consist of both the MSE and perceptual loss terms. As a result, our training loss can be expressed as:

\mathcal{L}=\lambda_{1}\mathcal{L}_{\mathrm{MSE}}^{\mathrm{color}}+\lambda_{2}% \mathcal{L}_{\mathrm{LPIPS}}^{\mathrm{color}}+\lambda_{3}\mathcal{L}_{\mathrm{% MSE}}^{\mathrm{normal}}+\lambda_{4}\mathcal{L}_{\mathrm{LPIPS}}^{\mathrm{% normal}}+\lambda_{5}\mathcal{L}_{\mathrm{occ}}+\lambda_{6}\mathcal{L}_{\mathrm% {SDF}}

(2)

where $L_{\mathrm{occ}}$ and $L_{\mathrm{SDF}}$ are MSE losses for occupancy and SDF volumes, and $\lambda_{i}$ denotes the weight of each loss term. Note that we do not use mesh geometry to derive normal maps; instead, we utilize the learned normal texture from the MLP, which will be detailed later.

3.3 Fine-Grained Geometric Details: Normal Guidance and Geometry Enhancement

Without dense-view correspondences, 3D reconstruction from sparse-view RGB images typically struggles to capture geometric details and suffers from texture ambiguity. While many recent works [27, 79, 85] attempt to employ large-scale models to learn mappings from RGB to geometric details, this typically requires significant computational resources. Additionally, these methods are primarily trained using 3D data, but it’s still uncertain whether the scale of 3D datasets is sufficient for learning such extensive priors. On the other hand, unlike RGB images, normal maps explicitly encode geometric information and can provide crucial guidance for 3D reconstruction. Notably, open-world normal map estimation has achieved great advancements. Many recent works [59, 12, 37] demonstrate that 2D diffusion models, trained on billions of natural images, embed extensive priors and can be fine-tuned to predict normal maps. Given the significant disparity in data scale between 2D and 3D datasets, it may be more effective to use 2D models first for generating geometric guidance.

Input Normal Guidance As shown in Figure 1, in addition to multi-view RGB images, MeshFormer also takes multi-view normal maps as input, which can be generated using recent open-world normal estimation models [59, 12, 37]. In our experiments, we utilize Zero123++ v1.2 [59], which trains an additional ControlNet [93] over the multi-view prediction model. The ControlNet takes multi-view RGB images, predicted by Zero123++, as a condition and produces corresponding multi-view normal maps, expressed in the camera coordinate frame. Given these maps, MeshFormer first converts them to a unified world coordinate frame, and then treats them similarly to the multi-view RGB images, using projection-aware cross-attention to guide 3D reconstruction. According to our experiments (Sec. 4.4), the multi-view normal maps enable the networks to better capture geometry details, and thus greatly improve final mesh quality.

Geometry Enhancement While the straightforward approach of deriving normal maps from the learned mesh and using a normal loss to guide geometry learning has been commonly used, we find that this approach makes our mesh learning less stable. Instead, we propose learning a 3D normal texture, similar to a color texture, using a separate MLP. By computing the normal loss for MLP-queried normal maps instead of mesh-derived normal maps, we decouple normal texture learning from underlying geometry learning. This makes the training more stable, as it is easier to learn a sharp 3D normal map than to directly learn a sharp mesh geometry. The learned 3D normal texture can be exported with the mesh, similar to the color texture, to support various graphics rendering pipelines. In applications that require precise 3D geometry, such as 3D printing, the learned normal texture can also be used to refine the mesh geometry with traditional algorithms. Specifically, during inference, after extracting a 3D mesh from the SDF volume, we utilize a post-processing algorithm [44] that takes as input the 3D positions of the mesh vertices and the vertex normals estimated from the MLP. The algorithm adjusts the mesh vertices to align with the predicted normals in a few seconds, further enhancing the geometry quality and generating sharp geometric details, as shown in Figure 4.

4 Experiments

4.1 Implementation Details and Evaluation Settings

Implementation Details We trained MeshFormer on the Objaverse [9] dataset. The total number of network parameters is approximately 648 million. We trained the model using 8 H100 GPUs for about one week (350k iterations) with a batch size of 1 per GPU, although we also show that the model can achieve similar results in just two days. Please refer to the supplementary for more details.

Evaluation Settings We evaluate the methods on two datasets: GSO [11] and OmniObject3D [83]. Both datasets contain real-scanned 3D objects that were not seen during training. For the GSO dataset, we use all 1,030 3D shapes for evaluation. For the OmniObject3D dataset, we randomly sample up to 5 shapes from each category, resulting in 1,038 shapes for evaluation. We utilize both 2D and 3D metrics. For 3D metrics, we use both the F-score and Chamfer distance (CD), calculated between the predicted meshes and ground truth meshes, following [31, 85]. For 2D metrics, we compute both PSNR and LPIPS for the rendered color images. Since each baseline may use a different coordinate frame for generated results, we carefully align the predicted meshes of all methods to the ground truth meshes before calculating the metrics. Please refer to the supplemental material for more details.

4.2 Comparison with Single/Sparse-View to 3D Methods

We compare MeshFormer with recent open-world feed-forward single/sparse-view to 3D methods, including One-2-3-45++ [31], TripoSR [67], CRM [77], LGM [64], InstantMesh [85], and MeshLRM [79]. Many of these methods have been released recently and should be considered concurrent methods. For MeshLRM [79], we contacted the authors for the results. For the other methods, we utilized their official implementations. Please refer to the supplementary for details.

Since input settings differ among the baselines, we evaluate all methods in a unified single-view to 3D setting. For the GSO dataset, we utilized the first thumbnail image as the single-view input. For the OmniObject3D dataset, we used a rendered image with a random pose as input. For One-2-3-45++ [31], InstantMesh [85], MeshLRM [79], and our MeshFormer, we first utilized Zero123++ [59] to convert the input single-view image into multi-view images before 3D reconstruction. Other baselines follow their original settings and take a single-view image directly as input. In addition to the RGB images, our MeshFormer also takes additional multi-view normal images as input, which are also predicted by Zero123++ [59]. Note that when comparing with baseline methods, we never use ground truth normal images to ensure a fair comparison.

In Fig. 2, we showcase qualitative examples. Our MeshFormer produces the most accurate meshes with fine-grained, sharp geometric details. In contrast, baseline methods produce inferior mesh quality. For example, TripoSR directly extracts meshes from the learned NeRF representation, resulting in significant artifacts. While InstantMesh and MeshLRM use mesh representation in their second stage, notable uneven artifacts are still observable upon a zoom-in inspection. Additionally, all baseline methods incorrectly close the surface of the copper bell. We also provide quantitative results in Tab. 1. Although our baselines include four methods released just one or two months before the time of submission, our MeshFormer significantly outperforms many of them and achieves the best performance on most metrics across two datasets. For the color LPIPS metric, our performance is very similar to MeshLRM’s, despite a perceptual loss being their main training loss term. We also highlight that many of the baselines require over one hundred GPUs for training, whereas our model can be efficiently trained with just 8 GPUs. Please refer to Sec. 4.4 for analysis on training efficiency.

4.3 Application: Text to 3D

Table 1: Quantitative Results of Single Image to 3D. Evaluated on the 1,030 and 1,038 3D shapes from the GSO [11] and the OmniObject3D [83] datasets, respectively. One-2-3-45++ [31], InstantMesh [85], MeshLRM [79], and our method all take the same multi-view RGB images predicted by Zero123++ [59] as input. CD denotes Chamfer Distance.

Method	GSO [11]				OmniObject3D [83]
Method	F-Score $\uparrow$	CD $\downarrow$	PSNR $\uparrow$	LPIPS $\downarrow$	F-Score $\uparrow$	CD $\downarrow$	PSNR $\uparrow$	LPIPS $\downarrow$
One-2-3-45++ [31]	0.936	0.039	20.97	0.21	0.871	0.054	17.08	0.31
TripoSR [67]	0.896	0.047	19.85	0.26	0.895	0.048	17.68	0.28
CRM [77]	0.886	0.051	19.99	0.27	0.821	0.065	16.01	0.34
LGM [64]	0.776	0.074	18.52	0.35	0.635	0.114	14.75	0.45
InstantMesh [64]	0.934	0.037	20.90	0.22	0.889	0.049	17.61	0.28
MeshLRM [79]	0.956	0.033	21.31	0.19	0.910	0.045	18.10	0.26
Ours	0.963	0.031	21.47	0.20	0.914	0.043	18.14	0.27

In addition to the single image to 3D, MeshFormer can also be integrated with 2D diffusion models to enable various 3D object generation tasks. For example, we follow the framework proposed by [37] to finetune Stable Diffusion [56] and build a text-to-multi-view model. By integrating this model, along with the normal prediction from Zero123++ [59], with MeshFormer, we can enable the task of text to 3D. Figure 3 shows some interesting results, where we convert a single text prompt into a high-quality 3D mesh in just a few seconds. Please refer to the supplemental materials for a qualitative comparison with one of the state-of-the-art text-to-3D methods, Instant3D [27].

4.4 Analysis and Ablation Study

Table 2: We compare methods using limited training resources. Evaluated on the GSO [11] dataset.

Method	Training Resources	F-Score $\uparrow$	CD $\downarrow$	PSNR-C $\uparrow$	LPIPS-C $\downarrow$	PSNR-N $\uparrow$	LPIPS-N $\downarrow$
MeshLRM [79]	8 $\times$ H100 48h	0.925	0.0397	21.09	0.26	21.69	0.22
Ours	8 $\times$ H100 48h	0.960	0.0317	21.41	0.20	23.01	0.15

Explicit 3D structure vs. Triplane In Section 4.2, we demonstrated that MeshFormer outperforms baseline methods that primarily utilize the triplane representation. Here, we highlight two additional advantages of using the explicit 3D voxel structure: training efficiency and the avoidance of “triplane artifacts”. Without leveraging explicit 3D structure, existing triplane-based large reconstruction models require extensive computing resources for training. For example, TripoSR requires 176 A100 GPUs for five days of training. InstantMesh relies on OpenLRM [15], which requires 128 A100 GPUs for three days of training. MeshLRM also utilizes similar resources during training. By utilizing explicit 3D structure and projective bias, our MeshFormer can be trained much more efficiently using only 8 GPUs. To better understand the gap, we trained both MeshLRM and our MeshFormer under very limited training resources, and the results are shown in Table 2. When using only 8 GPUs for two days, we found that MeshLRM failed to converge and experienced significant performance degradation compared to the results shown in Table 1, while our MeshFormer had already converged to a decent result, close to the fully-trained version, demonstrating superior training efficiency.

We observe that the triplane typically generates results with axis-aligned artifacts, as shown in Fig.2 (5th row, please zoom in). As demonstrated in the supplementary (Fig. 6), these artifacts also cause difficulties for MeshLRM [79] in capturing the words on objects. These limitations are likely caused by the limited number of triplane tokens (e.g., $32\times 32\times 3$ ), constrained by the global attention, which often leads to artifacts at the boundaries of the triplane patches. In contrast, MeshFormer leverages sparse voxels, supports a higher feature resolution of $256^{3}$ , and is free from such artifacts.

Table 3: Ablation Study on the GSO [11] dataset. -C denotes color renderings, and -N denotes normal renderings. CD stands for Chamfer distance. By default, ground truth multi-view images are used to exclude the influence of errors from 2D diffusion models.

	Setting	PSNR-C $\uparrow$	LPIPS-C $\downarrow$	PSNR-N $\uparrow$	LPIPS-N $\downarrow$	F-Score $\uparrow$	CD $\downarrow$
a	w/o normal input	24.82	0.129	24.85	0.107	0.964	0.024
b	w/o SDF supervision	20.72	0.244	20.42	0.257	0.940	0.035
c	w/o transformer layer	26.63	0.101	29.80	0.036	0.992	0.013
d	w/o projection-aware cross-attention	25.48	0.155	29.01	0.045	0.991	0.013
e	w/o geometry enhancement	27.95	0.085	29.10	0.048	0.992	0.012
f	w/ pred normal	26.84	0.096	26.99	0.067	0.987	0.017
g	full	28.15	0.083	29.80	0.036	0.992	0.012

Normal Input and SDF supervision As shown in Table 3 (a), the performance significantly drops when multi-view input normal maps are removed, indicating that the geometric guidance and clues provided by normal images are crucial for facilitating network training, particularly for local geometric details. In (f), we replace ground truth normal maps with normal predictions by Zero123++ [59] and observe a notable performance gap compared to (g). This indicates that although predicted multi-view normal images can be beneficial, existing 2D diffusion models still have room for improvement in generating more accurate results. See supplementary for qualitative examples. As shown in (b), if we remove the SDF loss after the first epoch and train the network using only surface rendering losses, the geometry learning quickly deteriorates, resulting in poor geometry. This explains why existing methods [27, 79] typically employ complex multi-stage training and use volume rendering to learn a coarse NeRF in the initial stage. By leveraging explicit 3D SDF supervision as strong geometric regularization, we enable a unified single-stage training, using mesh as the only representation.

Projection-Aware Cross-Attention and Transformer Layers We propose to utilize projection-

aware cross-attention to precisely aggregate multi-view projected 2D features for each 3D voxel. In conventional learning-based multi-view stereo (MVS) methods [3, 38], average or max pooling is typically employed for feature aggregation. In Table 3 (d), we replace the cross-attention with a simple average pooling and we observe a significant performance drop. This verifies that projection-aware cross-attention provides a more effective way for 3D-2D interaction while simple average pooling may fail to handle the occlusion and visibility issues. In the bottleneck of the UNet, we treat all 3D (sparse) voxels as a sequence of tokens and apply transformer layers to them. As shown in row (c), after removing these layers, we observe a performance drop in metrics related to texture quality. This indicates that texture learning requires more extensive priors and benefits more from the transformer layers.

Geometry Enhancement We propose to learn an additional normal map texture and apply a traditional algorithm as post-processing for geometry enhancement during inference. As shown in Figure 4, the geometry enhancement aligns the mesh geometry with the learned normal texture and generates fine-grained sharp details. In some cases (such as the wolf), the meshes output by the network are already good enough, and the difference caused by the enhancement tends to be subtle. Row (e) also quantitatively verifies the effectiveness of the module.

5 Conclusion and Limitations

We present MeshFormer, an open-world sparse-view reconstruction model that leverages explicit 3D native structure, supervision signals, and input guidance. MeshFormer can be conveniently trained in a unified single-stage manner and efficiently with just 8 GPUs. It generates high-quality meshes with fine-grained geometric details and outperforms baselines trained with over one hundred GPUs.

MeshFormer relies on 2D models to generate multi-view RGB and normal images from a single input image or text prompt. However, existing models still have limited capabilities to generate consistent multi-view images, which can cause a performance drop. Strategies to improve model robustness against such imperfect predictions are worth further exploration, and we leave this as future work.

References

[1] Eric R Chan, Koki Nagano, Matthew A Chan, Alexander W Bergman, Jeong Joon Park, Axel Levy, Miika Aittala, Shalini De Mello, Tero Karras, and Gordon Wetzstein. Genvs: Generative novel view synthesis with 3d-aware diffusion models, 2023.
[2] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
[3] Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Fanbo Xiang, Jingyi Yu, and Hao Su. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14124–14133, 2021.
[4] Guanying Chen, Kai Han, Boxin Shi, Yasuyuki Matsushita, and Kwan-Yee K. Wong. Sdps-net: Self-calibrating deep photometric stereo networks. In CVPR, 2019.
[5] Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023.
[6] Zilong Chen, Feng Wang, Yikai Wang, and Huaping Liu. Text-to-3d using gaussian splatting, 2024.
[7] Yen-Chi Cheng, Hsin-Ying Lee, Sergey Tulyakov, Alexander G Schwing, and Liang-Yan Gui. Sdfusion: Multimodal 3d shape completion, reconstruction, and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4456–4465, 2023.
[8] Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram Voleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl Vondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Ali Farhadi. Objaverse-xl: A universe of 10m+ 3d objects, 2023.
[9] Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects, 2022.
[10] Congyue Deng, Chiyu Jiang, Charles R Qi, Xinchen Yan, Yin Zhou, Leonidas Guibas, Dragomir Anguelov, et al. Nerdi: Single-view nerf synthesis with language-guided diffusion as general image priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20637–20647, 2023.
[11] Laura Downs, Anthony Francis, Nate Koenig, Brandon Kinman, Ryan Hickman, Krista Reymann, Thomas B McHugh, and Vincent Vanhoucke. Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automation (ICRA), pages 2553–2560. IEEE, 2022.
[12] Xiao Fu, Wei Yin, Mu Hu, Kaixuan Wang, Yuexin Ma, Ping Tan, Shaojie Shen, Dahua Lin, and Xiaoxiao Long. Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image. arXiv preprint arXiv:2403.12013, 2024.
[13] Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja Fidler. Get3d: A generative model of high quality 3d textured shapes learned from images. Advances In Neural Information Processing Systems, 35:31841–31854, 2022.
[14] Ruiqi Gao*, Aleksander Holynski*, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul P. Srinivasan, Jonathan T. Barron, and Ben Poole*. Cat3d: Create anything in 3d with multi-view diffusion models. arXiv, 2024.
[15] Zexin He and Tengfei Wang. Openlrm: Open-source large reconstruction models. https://github.com/3DTopia/OpenLRM, 2023.
[16] Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. arXiv preprint arXiv:2311.04400, 2023.
[17] Hanzhe Hu, Zhizhuo Zhou, Varun Jampani, and Shubham Tulsiani. Mvd-fusion: Single-view 3d via depth-consistent multi-view generation. In CVPR, 2024.
[18] Ka-Hei Hui, Aditya Sanghi, Arianna Rampini, Kamal Rahimi Malekshan, Zhengzhe Liu, Hooman Shayani, and Chi-Wing Fu. Make-a-shape: a ten-million-scale 3d shape model. In Forty-first International Conference on Machine Learning, 2024.
[19] Mohammad Mahdi Johari, Yann Lepoittevin, and François Fleuret. Geonerf: Generalizing nerf with geometry priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18365–18375, 2022.
[20] Heewoo Jun and Alex Nichol. Shap-e: Generating conditional 3d implicit functions, 2023.
[21] Animesh Karnewar, Andrea Vedaldi, David Novotny, and Niloy J Mitra. Holodiffusion: Training a 3d diffusion model using 2d images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18423–18433, 2023.
[22] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), July 2023.
[23] Xin Kong, Shikun Liu, Xiaoyang Lyu, Marwan Taher, Xiaojuan Qi, and Andrew J Davison. Eschernet: A generative model for scalable view synthesis. arXiv preprint arXiv:2402.03908, 2024.
[24] Jonáš Kulhánek, Erik Derner, Torsten Sattler, and Robert Babuška. Viewformer: Nerf-free neural rendering from few images using transformers. In European Conference on Computer Vision, pages 198–216. Springer, 2022.
[25] Samuli Laine, Janne Hellsten, Tero Karras, Yeongho Seol, Jaakko Lehtinen, and Timo Aila. Modular primitives for high-performance differentiable rendering. ACM Transactions on Graphics, 39(6), 2020.
[26] Han-Hung Lee and Angel X. Chang. Understanding pure clip guidance for voxel grid nerf models, 2022.
[27] Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, and Sai Bi. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. arXiv preprint arXiv:2311.06214, 2023.
[28] Weiyu Li, Rui Chen, Xuelin Chen, and Ping Tan. Sweetdreamer: Aligning geometric priors in 2d diffusion for consistent text-to-3d. arxiv:2310.02596, 2023.
[29] Yuhan Li, Yishun Dou, Xuanhong Chen, Bingbing Ni, Yilin Sun, Yutian Liu, and Fuzhen Wang. Generalized deep 3d shape prior via part-discretized diffusion process. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16784–16794, 2023.
[30] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
[31] Minghua Liu, Ruoxi Shi, Linghao Chen, Zhuoyang Zhang, Chao Xu, Xinyue Wei, Hansheng Chen, Chong Zeng, Jiayuan Gu, and Hao Su. One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion. arXiv preprint arXiv:2311.07885, 2023.
[32] Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Hao Su. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. Advances in Neural Information Processing Systems, 36, 2024.
[33] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9298–9309, 2023.
[34] Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023.
[35] Yuan Liu, Sida Peng, Lingjie Liu, Qianqian Wang, Peng Wang, Christian Theobalt, Xiaowei Zhou, and Wenping Wang. Neural rays for occlusion-aware image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7824–7833, 2022.
[36] Yuxin Liu, Minshan Xie, Hanyuan Liu, and Tien-Tsin Wong. Text-guided texturing by synchronized multi-view diffusion. arXiv preprint arXiv:2311.12891, 2023.
[37] Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Single image to 3d using cross-domain diffusion. arXiv preprint arXiv:2310.15008, 2023.
[38] Xiaoxiao Long, Cheng Lin, Peng Wang, Taku Komura, and Wenping Wang. Sparseneus: Fast generalizable neural surface reconstruction from sparse views. In European Conference on Computer Vision, pages 210–227. Springer, 2022.
[39] William E. Lorensen and Harvey E. Cline. Marching cubes: a high resolution 3D surface construction algorithm, page 347–353. Association for Computing Machinery, New York, NY, USA, 1998.
[40] J. Lorraine, K. Xie, X. Zeng, C. Lin, T. Takikawa, N. Sharp, T. Lin, M. Liu, S. Fidler, and J. Lucas. Att3d: Amortized text-to-3d object synthesis. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 17900–17910, Los Alamitos, CA, USA, oct 2023. IEEE Computer Society.
[41] Yuanxun Lu, Jingyang Zhang, Shiwei Li, Tian Fang, David McKinnon, Yanghai Tsin, Long Quan, Xun Cao, and Yao Yao. Direct2.5: Diverse text-to-3d generation via multi-view 2.5d diffusion, 2024.
[42] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
[43] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
[44] Diego Nehab, Szymon Rusinkiewicz, James Davis, and Ravi Ramamoorthi. Efficiently combining positions and normals for precise 3d geometry. ACM Trans. Graph., 24(3):536–543, jul 2005.
[45] Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generating 3d point clouds from complex prompts, 2022.
[46] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski. Dinov2: Learning robust visual features without supervision, 2023.
[47] Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. In ICLR, 2023.
[48] Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan Skorokhodov, Peter Wonka, Sergey Tulyakov, and Bernard Ghanem. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. In The Twelfth International Conference on Learning Representations (ICLR), 2024.
[49] Lingteng Qiu, Guanying Chen, Xiaodong Gu, Qi Zuo, Mutian Xu, Yushuang Wu, Weihao Yuan, Zilong Dong, Liefeng Bo, and Xiaoguang Han. Richdreamer: A generalizable normal-depth diffusion model for detail richness in text-to-3d, 2023.
[50] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021.
[51] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation, 2021.
[52] Konstantinos Rematas, Ricardo Martin-Brualla, and Vittorio Ferrari. Sharf: Shape-conditioned radiance fields from a single view. arXiv preprint arXiv:2102.08860, 2021.
[53] Xuanchi Ren, Jiahui Huang, Xiaohui Zeng, Ken Museth, Sanja Fidler, and Francis Williams. Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4209–4219, 2024.
[54] Yufan Ren, Tong Zhang, Marc Pollefeys, Sabine Süsstrunk, and Fangjinhua Wang. Volrecon: Volume rendering of signed ray distance functions for generalizable multi-view reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16685–16695, 2023.
[55] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, Los Alamitos, CA, USA, jun 2022. IEEE Computer Society.
[56] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding, 2022.
[57] Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016.
[58] Junyoung Seo, Wooseok Jang, Min-Seop Kwak, Jaehoon Ko, Hyeonsu Kim, Junho Kim, Jin-Hwa Kim, Jiyoung Lee, and Seungryong Kim. Let 2d diffusion model know 3d-consistency for robust text-to-3d generation. arXiv preprint arXiv:2303.07937, 2023.
[59] Ruoxi Shi, Hansheng Chen, Zhuoyang Zhang, Minghua Liu, Chao Xu, Xinyue Wei, Linghao Chen, Chong Zeng, and Hao Su. Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110, 2023.
[60] Yichun Shi, Peng Wang, Jianglong Ye, Long Mai, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d generation. arXiv:2308.16512, 2023.
[61] Robust Multiview Stereopsis. Accurate, dense, and robust multiview stereopsis. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 32(8), 2010.
[62] Jingxiang Sun, Bo Zhang, Ruizhi Shao, Lizhen Wang, Wen Liu, Zhenda Xie, and Yebin Liu. Dreamcraft3d: Hierarchical 3d generation with bootstrapped diffusion prior, 2023.
[63] Haotian Tang, Shang Yang, Zhijian Liu, Ke Hong, Zhongming Yu, Xiuyu Li, Guohao Dai, Yu Wang, and Song Han. Torchsparse++: Efficient training and inference framework for sparse convolution on gpus. In IEEE/ACM International Symposium on Microarchitecture (MICRO), 2023.
[64] Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. arXiv preprint arXiv:2402.05054, 2024.
[65] Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023.
[66] Ayush Tewari, Tianwei Yin, George Cazenavette, Semon Rezchikov, Joshua B Tenenbaum, Frédo Durand, William T Freeman, and Vincent Sitzmann. Diffusion with forward models: Solving stochastic inverse problems without direct supervision. arXiv preprint arXiv:2306.11719, 2023.
[67] Dmitry Tochilkin, David Pankratz, Zexiang Liu, Zixuan Huang, Adam Letts, Yangguang Li, Ding Liang, Christian Laforte, Varun Jampani, and Yan-Pei Cao. Triposr: Fast 3d object reconstruction from a single image, 2024.
[68] Alex Trevithick and Bo Yang. Grf: Learning a general radiance field for 3d representation and rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15182–15192, 2021.
[69] Vikram Voleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz, Dmitrii Tochilkin, Christian Laforte, Robin Rombach, and Varun Jampani. SV3D: Novel multi-view synthesis and 3D generation from a single image using latent video diffusion. arXiv, 2024.
[70] Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A. Yeh, and Greg Shakhnarovich. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. arXiv preprint arXiv:2212.00774, 2022.
[71] Peihao Wang, Xuxi Chen, Tianlong Chen, Subhashini Venugopalan, Zhangyang Wang, et al. Is attention all nerf needs? arXiv preprint arXiv:2207.13298, 2022.
[72] Peng Wang and Yichun Shi. Imagedream: Image-prompt multi-view diffusion for 3d generation. arXiv preprint arXiv:2312.02201, 2023.
[73] Peng Wang, Hao Tan, Sai Bi, Yinghao Xu, Fujun Luan, Kalyan Sunkavalli, Wenping Wang, Zexiang Xu, and Kai Zhang. Pf-lrm: Pose-free large reconstruction model for joint pose and shape prediction. arXiv preprint arXiv:2311.12024, 2023.
[74] Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P Srinivasan, Howard Zhou, Jonathan T Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. Ibrnet: Learning multi-view image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4690–4699, 2021.
[75] Zhen Wang, Qiangeng Xu, Feitong Tan, Menglei Chai, Shichen Liu, Rohit Pandey, Sean Fanello, Achuta Kadambi, and Yinda Zhang. Mvdd: Multi-view depth diffusion models, 2023.
[76] Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213, 2023.
[77] Zhengyi Wang, Yikai Wang, Yifei Chen, Chendong Xiang, Shuo Chen, Dajiang Yu, Chongxuan Li, Hang Su, and Jun Zhu. Crm: Single image to 3d textured mesh with convolutional reconstruction model. arXiv preprint arXiv:2403.05034, 2024.
[78] Xinyue Wei, Fanbo Xiang, Sai Bi, Anpei Chen, Kalyan Sunkavalli, Zexiang Xu, and Hao Su. NeuManifold: Neural Watertight Manifold Reconstruction with Efficient and High-Quality Rendering Support. arXiv preprint, 2023.
[79] Xinyue Wei, Kai Zhang, Sai Bi, Hao Tan, Fujun Luan, Valentin Deschaintre, Kalyan Sunkavalli, Hao Su, and Zexiang Xu. Meshlrm: Large reconstruction model for high-quality mesh. arXiv preprint arXiv:2404.12385, 2024.
[80] Haohan Weng, Tianyu Yang, Jianan Wang, Yu Li, Tong Zhang, C. L. Philip Chen, and Lei Zhang. Consistent123: Improve consistency for one image to 3d object synthesis, 2023.
[81] Sangmin Woo, Byeongjun Park, Hyojun Go, Jin-Young Kim, and Changick Kim. Harmonyview: Harmonizing consistency and diversity in one-image-to-3d, 2023.
[82] Robert J. Woodham. Photometric method for determining surface orientation from multiple images, page 513–531. MIT Press, Cambridge, MA, USA, 1989.
[83] Tong Wu, Jiarui Zhang, Xiao Fu, Yuxin Wang, Jiawei Ren, Liang Pan, Wayne Wu, Lei Yang, Jiaqi Wang, Chen Qian, et al. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 803–814, 2023.
[84] Kevin Xie, Jonathan Lorraine, Tianshi Cao, Jun Gao, James Lucas, Antonio Torralba, Sanja Fidler, and Xiaohui Zeng. Latte3d: Large-scale amortized text-to-enhanced3d synthesis. arXiv preprint arXiv:2403.15385, 2024.
[85] Jiale Xu, Weihao Cheng, Yiming Gao, Xintao Wang, Shenghua Gao, and Ying Shan. Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models. arXiv preprint arXiv:2404.07191, 2024.
[86] Yinghao Xu, Zifan Shi, Wang Yifan, Sida Peng, Ceyuan Yang, Yujun Shen, and Wetzstein Gordon. Grm: Large gaussian reconstruction model for efficient 3d reconstruction and generation. arxiv: 2403.14621, 2024.
[87] Yinghao Xu, Hao Tan, Fujun Luan, Sai Bi, Peng Wang, Jiahao Li, Zifan Shi, Kalyan Sunkavalli, Gordon Wetzstein, Zexiang Xu, et al. Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model. arXiv preprint arXiv:2311.09217, 2023.
[88] Hao Yang, Lanqing Hong, Aoxue Li, Tianyang Hu, Zhenguo Li, Gim Hee Lee, and Liwei Wang. Contranerf: Generalizable neural radiance fields for synthetic-to-real novel view synthesis via contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16508–16517, 2023.
[89] Jianglong Ye, Peng Wang, Kejie Li, Yichun Shi, and Heng Wang. Consistent-1-to-3: Consistent image to 3d view synthesis via geometry-aware diffusion models. arXiv preprint arXiv:2310.03020, 2023.
[90] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4578–4587, 2021.
[91] Wangbo Yu, Li Yuan, Yan-Pei Cao, Xiangjun Gao, Xiaoyu Li, Wenbo Hu, Long Quan, Ying Shan, and Yonghong Tian. Hifi-123: Towards high-fidelity one image to 3d content generation, 2024.
[92] Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, and Zexiang Xu. Gs-lrm: Large reconstruction model for 3d gaussian splatting. arXiv, 2024.
[93] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
[94] Xin-Yang Zheng, Hao Pan, Yu-Xiao Guo, Xin Tong, and Yang Liu. Mvd²: Efficient multiview 3d reconstruction for multiview diffusion. In SIGGRAPH, 2024.
[95] Xin-Yang Zheng, Hao Pan, Peng-Shuai Wang, Xin Tong, Yang Liu, and Heung-Yeung Shum. Locally attentional sdf diffusion for controllable 3d shape generation. ACM Transactions on Graphics (ToG), 42(4):1–13, 2023.
[96] Zi-Xin Zou, Zhipeng Yu, Yuan-Chen Guo, Yangguang Li, Ding Liang, Yan-Pei Cao, and Song-Hai Zhang. Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers, 2023.

Appendix A Supplemental Material

A.1 Comparison with Instant3D

In Figure 5, we showcase the comparison with Instant3D [27] on the text-to-3D task. The results are obtained from the paper authors. While Instant3D [27] also generates 3D shapes that match the input text prompt, our method generates results with superior mesh quality and fine-grained, sharp geometric details.

A.2 Triplane Artifacts

As shown in Fig.6, MeshLRM [79] has difficulty capturing words on objects, even when ground truth multi-view RGB images are used as input. We speculate that this is due to the limited number of triplane patches (e.g., $32\times 32\times 3$ ) restricted by global attention. In contrast, our method leverages sparse voxels and supports a much higher feature resolution of $256^{3}$ , making it free from such issues.

A.3 Ablation Study: Input Normal Maps

In Figure 7, we qualitatively demonstrate the effect of input normal maps. When the model is trained without multi-view normal maps, we find that the generated model can only capture the global 3D shape but fails to generate fine-grained geometric details. However, when the model is given predicted normal maps, the performance is significantly better, although there are still some small gaps when compared to the results of ground truth normals (see the bread hole of the toaster and the wheel of the tram). This indicates errors or inconsistencies from the 2D normal prediction models.

A.4 Ablation Study: Geometry Enhancement

Table 4: Normal consistency (angle error) between the mesh geometry (mesh vertex normals) and the predicted normal maps, both before and after the geometry enhancement post-processing. The ratio of mesh vertices below a specific error threshold is shown. Evaluated on the GSO dataset.

angle error threshold	before	after
$<1^{\circ}$	8.83%	16.27%
$<2^{\circ}$	26.39%	40.83%
$<5^{\circ}$	60.55%	73.19%
$<10^{\circ}$	78.79%	86.43%
$<15^{\circ}$	86.46%	91.29%

We propose asking the network to predict an additional normal texture, which can be used for further geometric enhancement by applying a traditional algorithm as post-processing. The geometric enhancement aims to align the mesh geometry with the predicted normal map by adjusting the vertex locations. However, the traditional algorithm we used cannot guarantee that the mesh normals will be fully aligned with the predicted normal maps after processing. This limitation arises because the algorithm operates in local space and avoids large vertex displacements. Moreover, the predicted normal maps may contain errors or inconsistencies, such as conflicting neighboring normals. The adopted algorithm is an iterative numerical optimization method and does not compute an analytic solution.

However, we have quantitatively verified that the post-processing module can significantly improve normal consistency with the predicted normal map. For example, before post-processing, only 26.4% of mesh vertices had a normal angle error of less than 2 degrees. After post-processing, this number increased to 40.8%. For a 10-degree threshold, the ratio increases from 78.8% to 86.4%. For more details, please refer to Table 4.

A.5 Ablation Study: Training time

Table 5: Analysis of our mesh generation quality over training time. Evaluated on the GSO [11] dataset.

Training Time	PSNR-C $\uparrow$	LPIPS-C $\downarrow$	PSNR-N $\uparrow$	LPIPS-N $\downarrow$	CD $\downarrow$	F-Score $\uparrow$
8 $\times$ H100 12h	21.28	0.2135	22.89	0.1536	0.0330	0.960
8 $\times$ H100 24h	21.32	0.2076	22.96	0.1516	0.0320	0.960
8 $\times$ H100 48h	21.41	0.2033	23.01	0.1484	0.0317	0.960
8 $\times$ H100 120h	21.44	0.2029	23.04	0.1480	0.0314	0.961
$8\times$ H100 168h	21.47	0.2010	23.09	0.1466	0.0313	0.963

Our MeshFormer can be trained efficiently using only 8 GPUs, typically converging in approximately two days. Table 5 presents a quantitative analysis of our mesh generation quality over the training period. We observe that performance improves rapidly and nearly converges, with only marginal changes occurring after the two-day training period.

A.6 Training Details and Evaluation Metrics

Training Details: We trained the model using a subset of 395k 3D shapes filtered from the Objaverse [9] dataset. These objects have a distributable Creative Commons license and were obtained by the Objaverse team using Sketchfab’s public API. For each filtered 3D shape, we randomly rotated the mesh and generated 10 data samples. For each data sample, we compute a $512^{3}$ ground truth SDF volume using a CUDA-based program and render multi-view RGB and normal images using BlenderProc. In our experiments, the resolutions of the occupancy volume and sparse feature volume are 64 and 256, respectively. The resolution of the predicted and ground truth SDF volumes is 512. The model is trained with the Adam optimizer and a cosine learning rate scheduler. The loss weights $\lambda_{1},\cdots,\lambda_{6}$ are set to 80, 2, 16, 2, 8, and 8, respectively.

All data preparation, including image rendering and SDF computation, is performed using an internal cluster. This process can be completed using 4000 CPU cores in roughly one week. The generated data takes up approximately 30TB. All model training tasks are conducted in public cloud clusters. Our main model is trained using 8 H100 GPUs for one week. All experiments listed in the paper can be completed in 15 days using 32 H100 GPUs (running multiple parallel experiments), excluding the preliminary exploration experiments.

Architecture Details: For VoxelFormer, the UNet consists of four levels with resolutions of $64^{3}$ , $32^{3}$ , $16^{3}$ and $16^{3}$ . Each level includes a ResNet module, a projection-aware cross-attention module, and a downsampling module, with channel sizes of 64, 128, 256, and 512. We added 6 transformer layers at the bottleneck of the UNet, with each 3D voxel treated as a token, and token channels set to 512.

For SparseVoxelFormer, the sparse UNet consists of six levels with resolutions of $256^{3}$ , $128^{3}$ , $64^{3}$ , $32^{3}$ , $16^{3}$ , and $16^{3}$ . Each level includes a sparse ResNet module, a projection-aware cross-attention module, and a downsampling module, with channel sizes of 16, 32, 64, 128, 512, and 2,048. We added 16 transformer layers at the bottleneck of the UNet, with each 3D sparse voxel treated as a token, and token channels set to 1,024. The feature dimension of the output sparse feature volume (before the MLP) is 32.

For both of them, a skip connection is added to the UNet.

Evaluation Metrics: To account for the scale and pose ambiguity of the generated mesh from different baselines, we align the predicted mesh with the ground truth mesh prior to the evaluation metric calculation. This alignment process involves uniformly sampling rotations and scales for initialization and subsequently refining the alignment using the Iterative Closest Point (ICP) algorithm. We select the alignment that yields the highest inlier ratio. Both the ground truth and predicted meshes are then scaled to fit within a unit bounding box.

For 3D metrics, we sample 100,000 points on both the ground truth mesh and the predicted mesh and compute the F-score and Chamfer distance, setting the F-score threshold at 0.05. To evaluate texture quality, we compute the PSNR and LPIPS between images rendered from the reconstructed mesh and those of the ground truth. Following InstantMesh [85], we sample 24 camera poses, encompassing a full 360-degree view around the object, and utilize BlenderProc for rendering RGB and normal images with a resolution of 320x320. Since we use the VGG model for LPIPS loss calculation during training, we employ the Alex model for LPIPS loss calculation during evaluation.

A.7 Training Details of MeshLRM

All results of MeshLRM, except those in Table 2, were reproduced by the MeshLRM authors at Hillbot following the original settings as described in the paper. For the results in Table 2, we trained the model using the same training data as our method on 8 $\times$ H100 GPUs for 48 hours. We maintained the same batch size as reported in the paper and proportionally scaled down the original training time for each stage of MeshLRM based on a total training time of 48 hours. This included 5.8 seconds per iteration for 20,000 iterations in the 256-resolution pre-training, 12 seconds per iteration for 4,000 iterations in the 512-resolution fine-tuning, and 4.7 seconds per iteration for 4,000 iterations in mesh refinement.

A.8 Qualitative Examples of One-2-3-45++ and CRM

Figure 8 shows qualitative results of One-2-3-45++ [31] and CRM [77] on single image to 3D and our method produces better results.

A.9 Broader Impact

We introduce an efficient approach for training open-world sparse-view reconstruction models, which has the potential to significantly reduce energy consumption and carbon emissions, as baseline models typically require much more computing resources for training. Previously, the creation of 3D assets was reserved for specialized artists who spent hours or even days producing a single 3D model. Our proposed technique allows even novice individuals without specialized 3D modeling knowledge to create high-quality 3D assets in seconds. This democratization of 3D modeling has unleashed unprecedented creative potential and operational efficiency across various sectors.

However, like other generative AI models, it also carries the risk of misuse, such as spreading misinformation and creating pornography models. Therefore, it is crucial to implement strict ethical guidelines to mitigate these risks.