CustomTex and the Disentanglement of Scene Appearance: Toward Factorized 3D Assets

Introduction

The generation of photorealistic 3D indoor scenes remains a bottleneck in computer graphics and vision, particularly when user customization enters the equation. While neural reconstruction methods such as Neural Radiance Fields and 3D Gaussian Splatting have achieved impressive geometric fidelity, they typically produce entangled representations where lighting, material, and geometry collapse into view dependent radiance fields. This entanglement forces practitioners to accept "baked in" appearances that resist editing or relighting under novel conditions.

Texture synthesis offers an alternative pathway, yet prevailing methods rely heavily on text to image diffusion models that struggle with the precise, instance level control required for interior design and architectural visualization. Text prompts are inherently ambiguous; they fail to capture the specific weave of a fabric, the exact grain of a wood surface, or the subtle pattern of a wallpaper. Even when reference images supplement text, existing approaches typically provide only coarse, global control rather than fine grained, object specific customization.

In their paper "CustomTex: High-fidelity Indoor Scene Texturing via Multi-Reference Customization," Chen et al. propose a framework that addresses these limitations through a dual distillation approach. By separating semantic control from pixel level enhancement within a Variational Score Distillation optimization framework, CustomTex enables instance specific texturing driven by multiple reference images. This method not only improves visual fidelity but also represents a conceptual shift toward disentangling geometric identity from material properties, pointing toward a future where 3D assets resemble editable physically based rendering graphs rather than static texture maps.

Technical Architecture: Dual Distillation and VSD Optimization

The core innovation of CustomTex lies in its explicit separation of semantic generation from pixel quality enhancement. Rather than forcing a single diffusion model to simultaneously handle object recognition and high frequency detail synthesis, CustomTex employs two distinct distillation processes, each leveraging a pre trained Stable Diffusion model.

The first component, semantic level distillation, utilizes an instance cross attention mechanism to align specific objects in the 3D scene with their corresponding reference images. This mechanism ensures that the sofa in the generated texture matches the fabric pattern of its reference photograph, while the adjacent cabinet adheres to its own distinct wood grain specification. By routing instance specific attention through the optimization process, the framework maintains semantic plausibility without conflating the visual characteristics of different scene elements.

The second component, pixel level distillation, focuses exclusively on visual fidelity, sharpness, and artifact reduction. This process enhances high frequency details while preserving the structural and semantic information established by the first stage. Both processes operate within a unified Variational Score Distillation framework, which allows the optimization to capture multiple modes present in the reference imagery rather than collapsing to a single average representation.

This architectural choice reflects a deeper understanding of how diffusion models process information. Standard text driven methods entangle semantic content with pixel level statistics, often resulting in blurry or unnaturally uniform textures when the model averages over ambiguous prompts. By decoupling these aspects, CustomTex achieves textures with superior sharpness and significantly reduced artifacts compared to state of the art alternatives.

Instance Level Control versus Global Constraints

Previous approaches to reference driven texturing typically treat the reference image as a global style guide, applying its aesthetic uniformly across the entire scene. This global constraint fails in complex indoor environments where different objects require distinct material specifications. CustomTex introduces instance level customization, allowing users to specify reference images for individual scene components such as chairs, walls, or flooring.

This granularity addresses a fundamental limitation in text driven pipelines. Text prompts operate at the level of categories and attributes, but they cannot reliably specify the exact visual fingerprint of a particular material sample. By accepting multiple reference images and binding each to specific geometric instances through the instance cross attention mechanism, CustomTex provides the precision necessary for professional visualization workflows.

Furthermore, the framework explicitly addresses the problem of baked in shading. Traditional diffusion based texturing methods often replicate lighting cues present in the training data, generating textures that contain spurious highlights or shadows. These baked in lighting effects make the textures unsuitable for relighting under different environmental conditions. CustomTex reduces this entanglement by isolating material appearance from illumination during the distillation process, producing textures that more closely resemble albedo maps suitable for physically based rendering pipelines.

Original Insights: The Trajectory Toward Factorized Assets

The separation of semantic distillation from pixel level enhancement in CustomTex represents more than an incremental quality improvement. It signals a broader transition in 3D generation from monolithic texture maps toward factorized asset representations. In professional computer graphics, physically based rendering pipelines rely on distinct material channels: albedo, roughness, metallic properties, and normal maps remain separate to enable dynamic relighting and editing. Current neural texturing methods, by contrast, typically output baked RGB textures that collapse these properties into a single channel.

CustomTex moves toward disentangling these factors by reducing baked in shading and preserving high frequency material details. However, the framework still ultimately produces static UV mapped textures rather than procedural material graphs or explicit BRDF parameters. The logical next step involves extending the dual distillation approach to predict full material properties, including specular response and surface microstructure, directly from reference imagery.

Within five years, we can reasonably expect standard scene texturing outputs to include complete BRDF parameter estimation, enabling true relighting without model retraining. This would require expanding the current architecture to distil not just appearance but physical material properties from the reference images. The instance cross attention mechanism provides a template for this extension, potentially allowing different objects to inherit distinct material types (fabric, metal, ceramic) with physically accurate light response.

Limitations remain. The computational cost of running two diffusion models through VSD optimization is substantial, potentially limiting real time applications or very large scenes. Additionally, the reliance on reference images assumes access to high quality exemplars for every instance, which may not always be available in practical workflows. The method also assumes clean geometric segmentation of instances; errors in scene parsing would propagate into texture misalignment.

Conclusion

CustomTex establishes a more direct and technically sophisticated path to high quality, customizable 3D scene appearance. By decoupling semantic guidance from pixel enhancement through its dual distillation framework, it achieves instance level consistency with reference images while producing textures of unprecedented sharpness and reduced lighting artifacts. This work situates itself at a critical juncture between traditional texture baking and the emerging paradigm of editable, factorized 3D assets.

As the field progresses, the principles demonstrated in CustomTex will likely extend beyond RGB texture synthesis toward full material graph generation. The ability to disentangle geometry, material, and lighting at the generative stage, rather than the post processing stage, will define the next generation of neural scene creation tools. The research community must now address the challenge of scaling these instance aware distillation techniques to predict complete material properties, ultimately bridging the gap between neural generation and professional physically based rendering workflows.