evalytic
arrow_back Home
info

About these results: Rankings reflect a single benchmark run with default parameters. Model performance varies with prompts, settings, and API versions. These are not absolute rankings — run evalytic bench on your own use case for representative results.

Showcase 01 · Text2Img · 5 Models · 10 Prompts

Do I really need the flagship model?

flux-schnell delivers 96% of the winner's quality at 96% less cost. The 0.2 point gap costs 27× more per image to close.

5
MODELS
50
IMAGES
$2.06
TOTAL COST
5m 36s
DURATION
100%
SUCCESS

Model Rankings

Winner $0.08/img
ideogram-v3
4.7
Visual Quality4.5
Prompt Adherence4.9
Text Rendering4.6
Score/$
58
#2 $0.025/img
flux-dev
4.6
Visual Quality4.4
Prompt Adherence5.0
Text Rendering4.4
Score/$
184
#3 $0.05/img
flux-pro
4.6
Visual Quality4.1
Prompt Adherence4.9
Text Rendering4.7
Score/$
91
#4 $0.04/img
recraft-v3
4.5
Visual Quality4.8
Prompt Adherence4.7
Text Rendering4.0
Score/$
112
Best Value $0.003/img
flux-schnell
4.5
Visual Quality4.2
Prompt Adherence4.9
Text Rendering4.3
Score/$
1,490
0.2 point gap. 27× price gap.

ideogram-v3 wins at 4.7 — but flux-schnell scores 4.5 at $0.003 per image. That's 1,490 points per dollar vs 58. For most production workloads, the cheapest model is good enough.

COST EFFICIENCY RATIO
25.7×
schnell vs ideogram

Cost Efficiency (Score per Dollar)

flux-schnell
1,490
flux-dev
184
recraft-v3
112
flux-pro
91
ideogram-v3
58

Dimension Breakdown

Visual Quality ★ differentiator
recraft-v3
4.8
ideogram-v3
4.5
flux-dev
4.4
flux-schnell
4.2
flux-pro
4.1
Prompt Adherence ≈ near ceiling
flux-dev
5.0
ideogram-v3
4.9
flux-pro
4.9
flux-schnell
4.9
recraft-v3
4.7
Text Rendering ★ differentiator
flux-pro
4.7
ideogram-v3
4.6
flux-dev
4.4
flux-schnell
4.3
recraft-v3
4.0

Gallery

"A neon sign reading 'OPEN 24/7' in a foggy downtown street at 2am"
flux-schnell
flux-schnell
3.0 / 5.0
flux-dev
flux-dev
3.0 / 5.0
flux-pro
flux-pro
4.0 / 5.0
recraft-v3
recraft-v3
2.0 / 5.0
ideogram-v3
ideogram-v3
4.0 / 5.0
"White sneaker on a marble countertop, soft shadows, product photography"

When prompts are straightforward, quality differences vanish — all 5 models hit 5.0. Differentiation happens on harder prompts like the neon sign above.

flux-schnell
flux-schnell
5.0 / 5.0
flux-dev
flux-dev
5.0 / 5.0
flux-pro
flux-pro
5.0 / 5.0
recraft-v3
recraft-v3
5.0 / 5.0
ideogram-v3
ideogram-v3
5.0 / 5.0

Showcase 02 · Img2Img · 5 Models · 9 Inputs · Face Metric

Why do users say "that's not me"?

One model destroys faces (similarity 0.03), another fails entirely. ArcFace cosine similarity confirms: 3 models preserve identity, 1 destroys faces, 1 fails completely.

warning flux-dev-i2i: face destroyed (similarity 0.029) error firered-edit: generation failed (0/9 images)
5
MODELS
45
IMAGES
$1.36
TOTAL COST
21m 16s
DURATION
ArcFace
FACE METRIC

Model Rankings

WinnerBest Value
seedream-edit
$0.03/img
4.9
Face 0.939
Visual Quality4.9
Input Fidelity5.0
Identity Pres.4.8
#2 $0.05/img
flux-kontext
4.7
Face 0.877
Visual Quality5.0
Input Fidelity4.7
Identity Pres.4.9
#3 $0.04/img
reve-edit
4.4
Face 0.910
Visual Quality4.6
Input Fidelity4.2
Identity Pres.4.3
Face Destroyed $0.025/img
flux-dev-i2i
2.3
Face 0.029
Visual Quality4.9
Input Fidelity1.0
Identity Pres.1.0
Failed $0.03/img
firered-edit
0.0
Face N/A
Generation failed on all 9 inputs.
psychology
VLM + ArcFace agreement: r = 0.99

The VLM judge's identity_preservation scores and ArcFace cosine similarity correlate near-perfectly. Two independent methods — a vision-language model and a deterministic face embedding — confirm the same ranking.

Gallery — "Place this person in a professional office with a city view behind them"

Input
Input
Original
seedream-edit
seedream-edit
4.9 · face 0.939
flux-kontext
flux-kontext
4.7 · face 0.877
reve-edit
reve-edit
4.4 · face 0.910
flux-dev-i2i
flux-dev-i2i
2.3 · face 0.029
block
Generation
failed
firered-edit
0/9 images

Showcase 03 · Text2Img · 1 Model · 5 Prompt Pairs

Which prompt actually performs better?

"More detail = better results" is a myth. 2 out of 5 stuffed prompts actually scored worse than their minimal versions. Test your prompts, don't guess.

10
IMAGES
$0.04
TOTAL COST
72s
DURATION
flux-schnell
MODEL

Simple vs Stuffed Prompt Comparison

Subject Simple Stuffed Delta Verdict
Cat on windowsill 4.7 4.7 0.0 No change
Coffee shop 3.0 4.3 +1.3 Helped!
Neon sign 5.0 5.0 0.0 No change
Mountain landscape 5.0 4.7 -0.3 Hurt
Robot reading 3.3 3.0 -0.3 Hurt

Gallery — Simple vs Stuffed

Showing the biggest improvement and one prompt where extra detail hurt.

Coffee Shop — biggest improvement (+1.3)
Simple coffee
"Coffee shop interior, morning light"
3.0 / 5.0
Detailed coffee
"Hyperrealistic coffee shop interior with exposed brick walls, reclaimed wood tables, barista making pour-over coffee, golden morning light streaming through floor-to-ceiling windows, steam rising, warm tones, cinematic composition, shot on Hasselblad"
4.3 +1.3
Mountain Landscape — extra detail hurt (-0.3)
Simple mountain
"Mountain landscape with a lake"
5.0 / 5.0
Detailed mountain
"Epic panoramic mountain landscape with crystal clear alpine lake reflecting snow-capped peaks, wildflowers in foreground, dramatic cumulus clouds, golden hour lighting, National Geographic quality, medium format film look"
4.7 -0.3

Showcase 04 · Img2Img · 4 Models · 9 Inputs

Is my product photo still my product?

AI edits warp shapes, lose logos, change colors. One model fails on 7 of 9 edits, another has mixed results. Input fidelity is the key differentiator — measure, don't assume.

error firered-edit: 7/9 failures warning seedream-edit: 3/9 failures
4
MODELS
36
IMAGES
$1.10
TOTAL COST
17m 12s
DURATION

Model Rankings

Winner $0.05/img
flux-kontext
4.8
9/9 success
Visual Quality5.0
Input Fidelity4.6
Transform. Quality5.0
Artifact Detect.4.8
#2Best Value
$0.04/img
reve-edit
4.5
9/9 success
Visual Quality5.0
Input Fidelity3.6
Transform. Quality4.8
Artifact Detect.4.8
#3 $0.03/img
seedream-edit
3.1
6/9 success
Visual Quality3.2
Input Fidelity2.7
Transform. Quality3.3
Artifact Detect.3.3
Mostly Failed $0.03/img
firered-edit
1.0
2/9 success
Visual Quality1.1
Input Fidelity0.8
Transform. Quality1.1
Artifact Detect.1.1

Dimension Breakdown

Input Fidelity ★ key differentiator
flux-kontext
4.6
reve-edit
3.6
seedream-edit
2.7
firered-edit
0.8
Visual Quality ≈ ceiling for top models
flux-kontext
5.0
reve-edit
5.0
seedream-edit
3.2
firered-edit
1.1

Gallery — "Place this product on a marble kitchen countertop with morning light"

Input
Input
Product photo
flux-kontext
flux-kontext
4.8 / 5.0
reve-edit
reve-edit
4.5 / 5.0
seedream-edit
seedream-edit
3.1 / 5.0
firered-edit
firered-edit
1.0 / 5.0

Run your own benchmark.

One command. Real scores. Your models, your prompts, your data.

# Install & setup
pip install evalytic
evaly init
# Run your first benchmark
evaly bench -y
# Or compare specific models
evaly bench -m flux-schnell -m flux-pro \
-p "A cat on a windowsill" --review