
New York University researchers have developed a new architecture for diffusion models that improves the term representation of the images they create. “Representation of diffusion transformers with auto-encoders“(RAE) challenges some of the accepted principles of construct diffusion models. The NYU researcher’s model is more efficient and accurate than standard diffusion models, takes advantage of the latest research in representational learning and can pave the way for new applications that were previously too difficult or expensive.
These advances can unlock more reliable and powerful features for enterprise applications. "To edit images well, a model has to really understand what’s in them," Signing Xie, co-author of the paper, told VentureBeat. "RAE helps to connect this understanding part with the generation part." He also pointed to future applications "RAG-based generation, where you use the RAE encoder features for searching and then generate new images based on the search results," Me too "Video generation and the action-conditioned world model."
The state of generative modeling
Diffusion model. a Variable auto-encoder (VAE) learns a compact representation of the key features of an image in the so-called “latent space”. The model is then trained to generate new images by replacing this process with random noise.
While the vast majority of these models have evolved, the autoencoder used in most of them has changed significantly in recent years. According to the NYU researchers, this standard auto-encoder (SD-VAE) is suitable for capturing low-level features and local appearance, but it “lacks the importance of global semantic structure for general and productive performance.”
At the same time, the field has seen impressive progress in learning image representations with models such as Dino, May et al. Clip. These models learn theoretically structured visual features that generalize tasks and can serve as a natural basis for visual understanding. However, a widely held belief has prevented developers from using these architectures in image generation: models focused on semantics are not suitable for generating images because they do not capture granular, pixel-level features. Practitioners also believe that diffusion models do not work well with the kind of high-dimensional representation that semantic models produce.
Diffusion with representational encoders
NYU researchers proposed replacing the standard VAE with “representational autoencoders” (RAE). This new type of autoencoder combines a presentation representation encoder, e.g Dino of Metawith a trained vision transformer decoder. This approach simplifies the training process by using existing, powerful encoders that have already been trained on large-scale datasets.
To make this work, the team developed a variant of it Diffusion transformer (DIT), the backbone of most image generation models. This modified DIT can be efficiently trained in the high-dimensional space of RAES with no computational cost. The researchers show that frozen representation encoders, even those optimized for semantics, can be adapted to image generation tasks. Their method produces reconstructions that are superior to standard SD-VAE without adding architectural complexity.
However, adopting this approach requires a shift in thinking. "RAE is not a simple plug-and-play auto encoder. The dispersion modeling part also needs to be developed," Xie explained. "An important point we want to highlight is that latent space modeling and generative modeling should be co-designed rather than treated separately."
With the right architectural adjustments, the researchers found that high-dimensional representation is an advantage, offering better structure, faster convergence, and better generation quality. i Their paperthe researchers note that this "High-dimensional latents effectively introduce no additional compute or memory overhead." Additionally, standard SD-VAE is more computationally expensive, requiring about six times more compute for the encoder and three times more for the decoder compared to RAE.
Strong performance and efficiency
The new model architecture provides significant benefits in both training efficiency and generation quality. The team’s optimal distribution instruction achieves robust results after only 80 training positions. Compared to prior diffusion models trained on VAEs, the RAE-based model achieves a 47x training speedup. It also improves upon current approaches based on representation alignment with a 16x training speedup. This level of efficiency translates directly into lower training costs and faster model development cycles.
For enterprise use, this translates into more reliable and consistent results. Xi noted that RAE-based models are less prone to the semantic errors seen in classical diffusion, adding that the RAE model gives "A much smarter lens on data." He observed that Chet is moving towards flagship models like the GPT-4O and Google’s Nano Banana. "Topic-driven, highly consistent and knowledge-driven generation," And that RAE’s clearly rich foundation is key to achieving this reliability at scale and in open source models.
The researchers demonstrated this performance on the Imagint benchmark. By using Farchet start distance (FID) metric, where a low score indicates high-quality images, the RAE-based model achieved a state-of-the-art score of 1.51 without guidance. With Auto, a technique that uses a small model to drive the generation process, the FID score dropped to an even more impressive 1.13 for both 256×256 and 512×512 images.
By successfully integrating advanced representation learning into a diffusion framework, this work opens a new way to create more efficient and cost-effective production models. This alliance points to a future of more integrated AI systems.
"We believe that in the future, there will be a single, unified representation model that captures the rich, underlying structure of reality … able to be decoded in many different output modes," Xie said. He added that Roy offers a unique path towards this goal: "The high-dimensional latent space must be learned separately for a robust representation that can then be decoded in a variety of ways—rather than relying on a brute-force approach of combining all the data at once with multiple objectives."