PixNerd: Pixel Neural Field Diffusion

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

Published Aug 4, 2025

Today's paper introduces PixNerd (Pixel Neural Field Diffusion), a novel approach to image generation that operates directly in pixel space rather than compressed latent representations. The method addresses limitations of current diffusion models that rely on variational autoencoders (VAEs), which can introduce artifacts and require complex two-stage training. By combining diffusion transformers with neural field representations, PixNerd achieves competitive image generation quality while maintaining computational efficiency.

Method Overview

PixNerd operates by replacing the traditional linear projection layer in diffusion transformers with a neural field representation. The overall pipeline works by taking noisy image patches and processing them through transformer layers, but instead of using a simple linear layer to predict the final pixel values, it uses a neural field to model the fine details within each patch.

While working directly in pixel space typically requires much larger patch sizes to maintain computational efficiency compared to latent diffusion models, this makes it difficult to capture fine details. PixNerd solves this by predicting the weights of small neural networks (MLPs) for each patch using the transformer's hidden states. For each pixel within a patch, the method encodes the pixel's local coordinates and combines this with the noisy pixel value. This information is then processed through the patch-specific neural network to predict how the pixel should be denoised.

The neural field component uses coordinate encodings based on DCT (Discrete Cosine Transform) basis functions, which help the model understand spatial relationships within each patch. The method also applies normalization techniques to the neural field parameters to stabilize training. This approach allows the model to capture fine-grained details even when working with large patches, effectively bridging the gap between computational efficiency and detail preservation.

Results

PixNerd achieves competitive performance on standard benchmarks while operating entirely in pixel space. On ImageNet 256×256, the method achieves an FID score of 2.15, which is comparable to latent diffusion models while being significantly faster than other pixel-space approaches. On ImageNet 512×512, it maintains strong performance with an FID of 2.84.

For text-to-image generation, PixNerd-XXL/16 achieves a 0.73 overall score on the GenEval benchmark and 80.9 on the DPG benchmark, demonstrating competitive performance across different evaluation metrics. The method also supports training-free arbitrary resolution generation by interpolating neural field coordinates, allowing it to generate images at different resolutions without additional training.

The approach shows superior spatial structure quality (sFID of 4.55) compared to other pixel-space methods, indicating better preservation of image structure and details. Memory consumption and training efficiency are comparable to latent diffusion models while avoiding the complexity of VAE training and potential decoding artifacts.

Conclusion

PixNerd presents an elegant solution to pixel-space image generation by combining diffusion transformers with neural field representations. The method successfully addresses the computational challenges of working directly in pixel space while maintaining image quality competitive with latent diffusion models. By eliminating the need for VAEs, it offers a simpler, single-stage training paradigm that avoids accumulated errors and decoding artifacts inherent in two-stage approaches. For more information please consult the full paper.

Congrats to the authors for their work!

Wang, Shuai, et al. "PixNerd: Pixel Neural Field Diffusion." arXiv preprint arXiv:2507.23268, 2025.

PixNerd: Pixel Neural Field Diffusion

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

Method Overview

Results

Conclusion

AI Paper of the Day

1,522 follower

More articles by this author

Others also viewed

Uncertainty Quantification with Graph Neural Networks for Efficient Molecular Design

CNNs vs Vision Transformers: A Modern Comparison on Performance, Explainability, and Cost

Applying Physics-Informed Neural Networks (PINNs): Hands-On Modeling of 2D Plates

The Role of Probabilistic Neural Network Models in Handling Uncertainty

Exploring the Biological Plausibility of Recurrent Artificial Neural Networks: A Comparative Analysis of Elman and Hopfield Models

A Comparison of DNN, CNN and LSTM using TF/Keras

Decoding the CNN Architecture: Unveiling the Power and Precision of Convolutional Neural Networks - Part ⅠⅠ

Types of Activation Functions: Sigmoid tanh, ReLU, Softmax. Part 1

Multilayer Perceptron

ARC: Self-Constructing Neural Intelligence When AI Grows Its Own Brain

Explore topics

Method Overview

Results

Conclusion

AI Paper of the Day

1,522 follower

Intern-S1: A Scientific Multimodal Foundation Model

Aug 23, 2025

Mobile-Agent-v3: Foundamental Agents for GUI Automation

Aug 22, 2025

NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

Aug 21, 2025

ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning

Aug 20, 2025

Ovis2.5 Technical Report

Aug 19, 2025

Thyme: Think Beyond Images

Aug 18, 2025

WideSearch: Benchmarking Agentic Broad Info-Seeking

Aug 17, 2025

NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale

Aug 16, 2025

Mol-R1: Towards Explicit Long-CoT Reasoning in Molecule Discovery

Aug 15, 2025

DINOv3: Self-supervised learning for vision at unprecedented scale

Aug 14, 2025

Others also viewed

Uncertainty Quantification with Graph Neural Networks for Efficient Molecular Design

CNNs vs Vision Transformers: A Modern Comparison on Performance, Explainability, and Cost

Applying Physics-Informed Neural Networks (PINNs): Hands-On Modeling of 2D Plates

The Role of Probabilistic Neural Network Models in Handling Uncertainty

Exploring the Biological Plausibility of Recurrent Artificial Neural Networks: A Comparative Analysis of Elman and Hopfield Models

A Comparison of DNN, CNN and LSTM using TF/Keras

Decoding the CNN Architecture: Unveiling the Power and Precision of Convolutional Neural Networks - Part ⅠⅠ

Types of Activation Functions: Sigmoid tanh, ReLU, Softmax. Part 1

Multilayer Perceptron

ARC: Self-Constructing Neural Intelligence When AI Grows Its Own Brain

Explore topics