PixNerd: Pixel Neural Field Diffusion
Credit: https://arxiv.org/pdf/2507.23268

PixNerd: Pixel Neural Field Diffusion

Today's paper introduces PixNerd (Pixel Neural Field Diffusion), a novel approach to image generation that operates directly in pixel space rather than compressed latent representations. The method addresses limitations of current diffusion models that rely on variational autoencoders (VAEs), which can introduce artifacts and require complex two-stage training. By combining diffusion transformers with neural field representations, PixNerd achieves competitive image generation quality while maintaining computational efficiency.

Method Overview

PixNerd operates by replacing the traditional linear projection layer in diffusion transformers with a neural field representation. The overall pipeline works by taking noisy image patches and processing them through transformer layers, but instead of using a simple linear layer to predict the final pixel values, it uses a neural field to model the fine details within each patch.

While working directly in pixel space typically requires much larger patch sizes to maintain computational efficiency compared to latent diffusion models, this makes it difficult to capture fine details. PixNerd solves this by predicting the weights of small neural networks (MLPs) for each patch using the transformer's hidden states. For each pixel within a patch, the method encodes the pixel's local coordinates and combines this with the noisy pixel value. This information is then processed through the patch-specific neural network to predict how the pixel should be denoised.

The neural field component uses coordinate encodings based on DCT (Discrete Cosine Transform) basis functions, which help the model understand spatial relationships within each patch. The method also applies normalization techniques to the neural field parameters to stabilize training. This approach allows the model to capture fine-grained details even when working with large patches, effectively bridging the gap between computational efficiency and detail preservation.

Results

PixNerd achieves competitive performance on standard benchmarks while operating entirely in pixel space. On ImageNet 256×256, the method achieves an FID score of 2.15, which is comparable to latent diffusion models while being significantly faster than other pixel-space approaches. On ImageNet 512×512, it maintains strong performance with an FID of 2.84.

For text-to-image generation, PixNerd-XXL/16 achieves a 0.73 overall score on the GenEval benchmark and 80.9 on the DPG benchmark, demonstrating competitive performance across different evaluation metrics. The method also supports training-free arbitrary resolution generation by interpolating neural field coordinates, allowing it to generate images at different resolutions without additional training.

The approach shows superior spatial structure quality (sFID of 4.55) compared to other pixel-space methods, indicating better preservation of image structure and details. Memory consumption and training efficiency are comparable to latent diffusion models while avoiding the complexity of VAE training and potential decoding artifacts.

Conclusion

PixNerd presents an elegant solution to pixel-space image generation by combining diffusion transformers with neural field representations. The method successfully addresses the computational challenges of working directly in pixel space while maintaining image quality competitive with latent diffusion models. By eliminating the need for VAEs, it offers a simpler, single-stage training paradigm that avoids accumulated errors and decoding artifacts inherent in two-stage approaches. For more information please consult the full paper.

Congrats to the authors for their work!

Wang, Shuai, et al. "PixNerd: Pixel Neural Field Diffusion." arXiv preprint arXiv:2507.23268, 2025.

To view or add a comment, sign in

Others also viewed

Explore topics