On May 8, HiDream open-sourced HiDream-O1-Image (codename: Peanut) on Hugging Face — an 8B parameter image generation foundation model. MIT licensed.
Two things make this project stand out.
Architecture: No Detours
Most mainstream image generation models today follow the diffusion + VAE route — compress pixels into latent space, generate there, then decode back to pixels. HiDream-O1-Image takes a more direct approach:
A single Pixel-level Unified Transformer, trained directly on raw pixels. No external VAE, no separate text encoder. Text and image are unified in a single shared token space.
This might sound like making a simple thing complicated, but it actually eliminates a key source of error: VAE compression loss. When the model learns at the pixel level directly, it sees and generates raw pixels without information loss from intermediate transformations.
The trade-off, of course, is higher compute cost — processing pixels directly is more expensive than working with compressed latents. How an 8B model performs on this architecture efficiency-wise remains to be seen from community benchmarks.
Capabilities: More Than Text-to-Image
HiDream-O1-Image's ambitions go beyond text-to-image. It packs multiple capabilities into a single model:
- Text-to-image generation, up to 2048×2048 resolution
- Long text rendering and layout — accurately renders multi-region, multilingual text within generated images
- Instruction-based image editing
- Subject-driven personalization (preserving identity/IP across new scenes)
- Storyboard generation
There's also a built-in Reasoning-Driven Prompt Agent — before generating, the model "thinks" first to resolve implicit knowledge, layout, and text rendering issues in the prompt. It's like GPT's thinking mode applied to the image generation pipeline.
Results
On the Artificial Analysis Text to Image Arena, HiDream-O1-Image ranks 8th (as of May 5, 2026). Among open-weights models, this is currently the best result.
Two days after open-sourcing, it has 124 likes and 1.2K followers on Hugging Face. The technical report was released simultaneously.
Should You Try It
If you work in image generation, this project deserves 30 minutes of your time:
- MIT license, no commercial restrictions
- Pixel-level architecture is a technically distinct path from diffusion models
- Long text rendering capability is rare among open-source models
- The HiDream team has a solid track record in image generation
But set expectations right: direct pixel generation costs more compute than VAE-based approaches. Running 2048×2048 on consumer GPUs may require patience.
A distilled Dev variant is also available — if you don't need peak quality, the Dev version will be friendlier.
Main sources: