About these results:
Rankings reflect a single benchmark run with default parameters. Model performance varies with prompts, settings, and API versions.
These are not absolute rankings — run evalytic bench on your own use case for representative results.
Showcase 01 · Text2Img · 5 Models · 10 Prompts
Do I really need the flagship model?
flux-schnell delivers 96% of the winner's quality at 96% less cost. The 0.2 point gap costs 27× more per image to close.
Model Rankings
ideogram-v3 wins at 4.7 — but flux-schnell scores 4.5 at $0.003 per image. That's 1,490 points per dollar vs 58. For most production workloads, the cheapest model is good enough.
Cost Efficiency (Score per Dollar)
Dimension Breakdown
Gallery
When prompts are straightforward, quality differences vanish — all 5 models hit 5.0. Differentiation happens on harder prompts like the neon sign above.
Showcase 02 · Img2Img · 5 Models · 9 Inputs · Face Metric
Why do users say "that's not me"?
One model destroys faces (similarity 0.03), another fails entirely. ArcFace cosine similarity confirms: 3 models preserve identity, 1 destroys faces, 1 fails completely.
Model Rankings
The VLM judge's identity_preservation scores and ArcFace cosine similarity correlate near-perfectly. Two independent methods — a vision-language model and a deterministic face embedding — confirm the same ranking.
Gallery — "Place this person in a professional office with a city view behind them"
failed
Showcase 03 · Text2Img · 1 Model · 5 Prompt Pairs
Which prompt actually performs better?
"More detail = better results" is a myth. 2 out of 5 stuffed prompts actually scored worse than their minimal versions. Test your prompts, don't guess.
Simple vs Stuffed Prompt Comparison
| Subject | Simple | Stuffed | Delta | Verdict |
|---|---|---|---|---|
| Cat on windowsill | 4.7 | 4.7 | 0.0 | No change |
| Coffee shop | 3.0 | 4.3 | +1.3 | Helped! |
| Neon sign | 5.0 | 5.0 | 0.0 | No change |
| Mountain landscape | 5.0 | 4.7 | -0.3 | Hurt |
| Robot reading | 3.3 | 3.0 | -0.3 | Hurt |
Gallery — Simple vs Stuffed
Showing the biggest improvement and one prompt where extra detail hurt.
Showcase 04 · Img2Img · 4 Models · 9 Inputs
Is my product photo still my product?
AI edits warp shapes, lose logos, change colors. One model fails on 7 of 9 edits, another has mixed results. Input fidelity is the key differentiator — measure, don't assume.
Model Rankings
Dimension Breakdown
Gallery — "Place this product on a marble kitchen countertop with morning light"
Run your own benchmark.
One command. Real scores. Your models, your prompts, your data.