
The rapid development of AI Generative models, especially deep generative AI models, has significantly advanced capabilities in natural language generation, 3D generation, image generation, and speech synthesis. These models have revolutionized 3D production across various industries. Nevertheless, many face a challenge: their complex wiring and generated meshes often aren’t compatible with traditional rendering pipelines like Physically Based Rendering (PBR). Diffusion-based models, notably without lighting textures, show impressive diverse 3D asset generation, enhancing 3D frameworks in filmmaking, gaming, and AR/VR.
This text introduces Paint3D, a novel framework for producing diverse, high-resolution 2K UV texture maps for untextured 3D meshes, conditioned on visual or textual inputs. Paint3D’s important challenge is generating high-quality textures without embedded illumination, enabling user re-editing or re-lighting inside modern graphics pipelines. It employs a pre-trained 2D diffusion model for multi-view texture fusion, generating initial coarse texture maps. Nevertheless, these maps often show illumination artifacts and incomplete areas on account of the 2D model’s limitations in disabling lighting effects and fully representing 3D shapes. We’ll delve into Paint3D’s workings, architecture, and comparisons with other deep generative frameworks. Let’s begin.
The capabilities of Deep Generative AI models in natural language generation, 3D generation, and image synthesis tasks is well-known and implemented in real-life applications, revolutionizing the 3D generation industry. Despite their remarkable capabilities, modern deep generative AI frameworks generate meshes which might be characterised by complex wiring and chaotic lighting textures which might be often incompatible with conventional rendering pipelines including PBR or Physically based Rendering. Like deep generative AI models, texture synthesis has also advanced rapidly especially in utilizing 2D diffusion models. Texture synthesis models employ pre-trained depth-to-image diffusion models effectively to make use of text conditions to generate high-quality textures. Nevertheless, these approaches face troubles with pre-illuminated textures that may significantly impact the ultimate 3D environment renderings and introduce lighting errors when the lights are modified inside the common workflows as demonstrated in the next image.
As it may possibly be observed, the feel map with free illumination works in sync with the standard rendering pipelines delivering accurate results whereas the feel map with pre-illumination includes inappropriate shadows when relighting is applied. However, texture generation frameworks trained on 3D data offer another approach through which the framework generates the textures by comprehending a particular 3D object’s entire geometry. Although they could deliver higher results, texture generation frameworks trained on 3D data lack generalization capabilities that hinders their capability to use the model to 3D objects outside their training data.
Current texture generation models face two critical challenges: using image guidance or diverse prompts to attain a broader degree of generalization across different objects, and the second challenge being the elimination of coupled illumination on the outcomes obtained from pre-training. The pre-illuminated textures can potentially interfere with the ultimate outcomes of the textured objects inside rendering engines, and because the pre-trained 2D diffusion models provide 2D results only within the view domain, they lack comprehensive understanding of shapes that results in them being unable to keep up view consistency for 3D objects.
Owing to the challenges mentioned above, the Paint3D framework attempts to develop a dual-stage texture diffusion model for 3D objects that generalizes to different pre-trained generative models and preserve view consistency while learning lightning-less texture generation.
Paint3D is a dual-stage coarse to positive texture generation model that goals to leverage the strong prompt guidance and image generation capabilities of pre-trained generative AI models to texture 3D objects. In the primary stage, the Paint3D framework first samples multi-view images from a pre-trained depth aware 2D image diffusion model progressively to enable the generalization of high-quality and wealthy texture results from diverse prompts. The model then generates an initial texture map by back projecting these images onto the 3D mesh surface. Within the second stage, the model focuses on generating lighting-less textures by implementing approaches employed by diffusion models specialized within the removal of lighting influences and shape-aware refinement of incomplete regions. Throughout the method, the Paint3D framework is consistently capable of generate high-quality 2K textures semantically, and eliminates intrinsic illumination effects.
To sum it up, Paint3D is a novel coarse to positive generative AI model that goals to supply diverse, lighting-less and high-resolution 2K UV texture maps for untextured 3D meshes to attain cutting-edge performance in texturing 3D objects with different conditional inputs including text & images, and offers significant advantage for synthesis and graphics editing tasks.
Methodology and Architecture
The Paint3D framework generates and refines texture maps progressively to generate diverse and top quality texture maps for 3D models using desired conditional inputs including images and prompts, as demonstrated in the next image.
Within the coarse stage, the Paint3D model uses pre-trained 2D image diffusion models to sample multi-view images, after which creates the initial texture maps back-projecting these images onto the surface of the mesh. Within the second stage i.e. the refinement stage, the Paint3D model uses a diffusion process within the UV space to boost coarse texture maps, thus achieving high-quality, inpainting, and lighting-less function that ensures the visual appeal and completeness of the ultimate texture.
Stage 1: Progressive Coarse Texture Generation
Within the progressive coarse texture generation stage, the Paint3D model generates a rough UV texture map for the 3D meshes that use a pre-trained depth-aware 2D diffusion model. To be more specific, the model first uses different camera views to render the depth map, then uses depth conditions to sample images from the image diffusion model, after which back-projects these images onto the mesh surface. The framework performs the rendering, sampling, and back-projection approaches alternately to enhance the consistency of the feel meshes, which ultimately helps within the progressive generation of the feel map.
The model starts generating the feel of the visible region with the camera views specializing in the 3D mesh, and renders the 3D mesh to a depth map from the primary view. The model then samples a texture image for an appearance condition and a depth condition. The model then back-projects the image onto the 3D mesh. For the viewpoints, the Paint3D model executes an identical approach but with a slight change by performing the feel sampling process using a picture painting approach. Moreover, the model takes the textured regions from previous viewpoints under consideration, allowing the rendering process to not only output a depth image, but additionally a partially coloured RGB image with an uncolored mask in the present view.
The model then uses a depth-aware image inpainting model with an inpainting encoder to fill the uncolored area inside the RGB image. The model then generates the feel map from the view by back-projecting the inpainted image into the 3D mesh under the present view, allowing the model to generate the feel map progressively, and arriving at your complete coarse structure map. Finally, the model extends the feel sampling process to a scene or object with multiple views. To be more specific, the model utilizes a pair of cameras to capture two depth maps in the course of the initial texture sampling from symmetric viewpoints. The model then combines two depth maps and composes a depth grid. The model replaces the only depth image with the depth grid to perform multi-view depth-aware texture sampling.
Stage 2: Texture Refinement in UV Space
Although the looks of coarse texture maps is logical, it does face some challenges like texture holes caused in the course of the rendering process by self-occlusion or lightning shadows owing to the involvement of 2D image diffusion models. The Paint3D model goals to perform a diffusion process within the UV space on the idea of a rough texture map, attempting to mitigate the problems and enhance the visual appeal of the feel map even further during texture refinement. Nevertheless, refining the mainstream image diffusion model with the feel maps within the UV space introduces texture discontinuity because the texture map is generated by the UV mapping of the feel of the 3D surface that cuts the continual texture right into a series of individual fragments within the UV space. In consequence of the fragmentation, the model finds it difficult to learn the 3D adjacency relationships amongst the fragments that results in texture discontinuity issues.
The model refines the feel map within the UV space by performing the diffusion process under the guidance of texture fragments’ adjacency information. It is necessary to notice that within the UV space, it’s the position map that represents the 3D adjacency information of texture fragments, with the model treating each non-background element as a 3D point coordinate. In the course of the diffusion process, the model fuses the 3D adjacency information by adding a person position map encoder to the pretrained image diffusion model. The brand new encoder resembles the design of the ControlNet framework and has the identical architecture because the encoder implemented within the image diffusion model with the zero-convolution layer connecting the 2. Moreover, the feel diffusion model is trained on a dataset comprising texture and position maps, and the model learns to predict the noise added to the noisy latent. The model then optimizes the position encoder and freezes the trained denoiser for its image diffusion task.
The model then concurrently uses the position of conditional encoder and other encoders to perform refinement tasks within the UV space. On this respect, the model has two refinement capabilities: UVHD or UV High Definition and UV inpainting. The UVHD method is structured to boost the visual appeal and aesthetics of the feel map. To attain UVHD, the model uses a picture enhance encoder and a position encoder with the diffusion model. The model uses the UV inpainting method to fill the feel holes inside the UV plane that’s able to avoiding self-occlusion issues generated during rendering. Within the refinement stage, the Paint3D model first performs UV inpainting after which performs UVHD to generate the ultimate refined texture map. By integrating the 2 refinement methods, the Paint3D framework is capable of produce complete, diverse, high-resolution, and lighting-less UV texture maps.
Paint3D : Experiments and Results
The Paint3D model employs the Stable Diffusion text2image model to help it with texture generation tasks while it employs the image encoder component to handle image conditions. To further enhance its grip on conditional controls like image inpainting, depth, and image high definition, the Paint3D framework employs ControlNet domain encoders. The model is implemented on the PyTorch framework with rendering and texture projections implemented on Kaolin.
Text to Textures Comparison
To investigate its performance, we start by evaluating Paint3D’s texture generation effect when conditioned using textual prompts, and compare it against cutting-edge frameworks including Text2Tex, TEXTure, and LatentPaint. As it may possibly be observed in the next image, the Paint3D framework not only excels at generating high-quality texture details, nevertheless it also synthesizes an illumination-free texture map reasonably well.
Compared, the Latent-Paint framework is vulnerable to generating blurry textures that leads to suboptimal visual effects. However, although the TEXTure framework generates clear textures, it lacks smoothness and exhibits noticeable splicing and seams. Finally, the Text2Tex framework generates smooth textures remarkably well, nevertheless it fails to duplicate the performance for generating positive textures with intricate detailing.
The next image compares the Paint3D framework with cutting-edge frameworks quantitatively.
As it may possibly be observed, the Paint3D framework outperforms all the prevailing models, and by a big margin with nearly 30% improvement within the FID baseline and roughly 40% improvement within the KID baseline. The advance within the FID and KID baseline scores show Paint3D’s ability to generate high-quality textures across diverse objects and categories.
Image to Texture Comparison
To generate Paint3D’s generative capabilities using visual prompts, we use the TEXTure model because the baseline. As mentioned earlier, the Paint3D model employs a picture encoder sourced from the text2image model from Stable Diffusion. As it may possibly be seen in the next image, the Paint3D framework synthesizes exquisite textures remarkably well, and continues to be able to keep up high fidelity w.r.t the image condition.
However, the TEXTure framework is capable of generate a texture just like Paint3D, nevertheless it falls short to represent the feel details within the image condition accurately. Moreover, as demonstrated in the next image, the Paint3D framework delivers higher FID and KID baseline scores when put next to the TEXTure framework with the previous decreasing from 40.83 to 26.86 whereas the latter showing a drop from 9.76 to 4.94.
Final Thoughts
In this text, we’ve talked about Paint3D, a coarse-to-fine novel framework capable of manufacturing lighting-less, diverse, and high-resolution 2K UV texture maps for untextured 3D meshes conditioned either on visual or textual inputs. The important highlight of the Paint3D framework is that it’s able to generating lighting-less high-resolution 2K UV textures which might be semantically consistent without being conditioned on image or text inputs. Owing to its coarse-to-fine approach, the Paint3D framework produce lighting-less, diverse, and high-resolution texture maps, and delivers higher performance than current cutting-edge frameworks.