Evalytic — Real Benchmark Results

Showcase 01 · Text2Img · 5 Models · 10 Prompts

Do I really need the flagship model?

flux-schnell delivers 96% of the winner's quality at 96% less cost. The 0.2 point gap costs 27× more per image to close.

5

MODELS

50

IMAGES

$2.06

TOTAL COST

5m 36s

DURATION

100%

SUCCESS

Model Rankings

Winner $0.08/img

ideogram-v3

4.7

Visual Quality4.5

Prompt Adherence4.9

Text Rendering4.6

Score/$

58

#2 $0.025/img

flux-dev

4.6

Visual Quality4.4

Prompt Adherence5.0

Text Rendering4.4

Score/$

184

#3 $0.05/img

flux-pro

4.6

Visual Quality4.1

Prompt Adherence4.9

Text Rendering4.7

Score/$

91

#4 $0.04/img

recraft-v3

4.5

Visual Quality4.8

Prompt Adherence4.7

Text Rendering4.0

Score/$

112

Best Value $0.003/img

flux-schnell

4.5

Visual Quality4.2

Prompt Adherence4.9

Text Rendering4.3

Score/$

1,490

0.2 point gap. 27× price gap.

ideogram-v3 wins at 4.7 — but flux-schnell scores 4.5 at $0.003 per image. That's 1,490 points per dollar vs 58. For most production workloads, the cheapest model is good enough.

COST EFFICIENCY RATIO

25.7×

schnell vs ideogram

Cost Efficiency (Score per Dollar)

flux-schnell

1,490

flux-dev

184

recraft-v3

112

flux-pro

91

ideogram-v3

58

Dimension Breakdown

Visual Quality ★ differentiator

recraft-v3

4.8

ideogram-v3

4.5

flux-dev

4.4

flux-schnell

4.2

flux-pro

4.1

Prompt Adherence ≈ near ceiling

flux-dev

5.0

ideogram-v3

4.9

flux-pro

4.9

flux-schnell

4.9

recraft-v3

4.7

Text Rendering ★ differentiator

flux-pro

4.7

ideogram-v3

4.6

flux-dev

4.4

flux-schnell

4.3

recraft-v3

4.0

Gallery

"A neon sign reading 'OPEN 24/7' in a foggy downtown street at 2am"

flux-schnell

3.0 / 5.0

flux-dev

3.0 / 5.0

flux-pro

4.0 / 5.0

recraft-v3

2.0 / 5.0

ideogram-v3

4.0 / 5.0

"White sneaker on a marble countertop, soft shadows, product photography"

When prompts are straightforward, quality differences vanish — all 5 models hit 5.0. Differentiation happens on harder prompts like the neon sign above.

flux-schnell

5.0 / 5.0

flux-dev

5.0 / 5.0

flux-pro

5.0 / 5.0

recraft-v3

5.0 / 5.0

ideogram-v3

5.0 / 5.0

open_in_new View full interactive report

Showcase 02 · Img2Img · 5 Models · 9 Inputs · Face Metric

Why do users say "that's not me"?

One model destroys faces (similarity 0.03), another fails entirely. ArcFace cosine similarity confirms: 3 models preserve identity, 1 destroys faces, 1 fails completely.

warning flux-dev-i2i: face destroyed (similarity 0.029) error firered-edit: generation failed (0/9 images)

5

MODELS

45

IMAGES

$1.36

TOTAL COST

21m 16s

DURATION

ArcFace

FACE METRIC

Model Rankings

WinnerBest Value

seedream-edit

$0.03/img

4.9

Face 0.939

Visual Quality4.9

Input Fidelity5.0

Identity Pres.4.8

#2 $0.05/img

flux-kontext

4.7

Face 0.877

Visual Quality5.0

Input Fidelity4.7

Identity Pres.4.9

#3 $0.04/img

reve-edit

4.4

Face 0.910

Visual Quality4.6

Input Fidelity4.2

Identity Pres.4.3

Face Destroyed $0.025/img

flux-dev-i2i

2.3

Face 0.029

Visual Quality4.9

Input Fidelity1.0

Identity Pres.1.0

Failed $0.03/img

firered-edit

0.0

Face N/A

Generation failed on all 9 inputs.

psychology

VLM + ArcFace agreement: r = 0.99

The VLM judge's identity_preservation scores and ArcFace cosine similarity correlate near-perfectly. Two independent methods — a vision-language model and a deterministic face embedding — confirm the same ranking.

Gallery — "Place this person in a professional office with a city view behind them"

Input

Original

seedream-edit

4.9 · face 0.939

flux-kontext

4.7 · face 0.877

reve-edit

4.4 · face 0.910

flux-dev-i2i

2.3 · face 0.029

block

Generation
failed

firered-edit

0/9 images

open_in_new View full interactive report

Showcase 03 · Text2Img · 1 Model · 5 Prompt Pairs

Which prompt actually performs better?

"More detail = better results" is a myth. 2 out of 5 stuffed prompts actually scored worse than their minimal versions. Test your prompts, don't guess.

10

IMAGES

$0.04

TOTAL COST

72s

DURATION

flux-schnell

MODEL

Simple vs Stuffed Prompt Comparison

Subject	Simple	Stuffed	Delta	Verdict
Cat on windowsill	4.7	4.7	0.0	No change
Coffee shop	3.0	4.3	+1.3	Helped!
Neon sign	5.0	5.0	0.0	No change
Mountain landscape	5.0	4.7	-0.3	Hurt
Robot reading	3.3	3.0	-0.3	Hurt

Gallery — Simple vs Stuffed

Showing the biggest improvement and one prompt where extra detail hurt.

Coffee Shop — biggest improvement (+1.3)

"Coffee shop interior, morning light"

3.0 / 5.0

"Hyperrealistic coffee shop interior with exposed brick walls, reclaimed wood tables, barista making pour-over coffee, golden morning light streaming through floor-to-ceiling windows, steam rising, warm tones, cinematic composition, shot on Hasselblad"

4.3 +1.3

Mountain Landscape — extra detail hurt (-0.3)

"Mountain landscape with a lake"

5.0 / 5.0

"Epic panoramic mountain landscape with crystal clear alpine lake reflecting snow-capped peaks, wildflowers in foreground, dramatic cumulus clouds, golden hour lighting, National Geographic quality, medium format film look"

4.7 -0.3

open_in_new View full interactive report

Showcase 04 · Img2Img · 4 Models · 9 Inputs

Is my product photo still my product?

AI edits warp shapes, lose logos, change colors. One model fails on 7 of 9 edits, another has mixed results. Input fidelity is the key differentiator — measure, don't assume.

error firered-edit: 7/9 failures warning seedream-edit: 3/9 failures

4

MODELS

36

IMAGES

$1.10

TOTAL COST

17m 12s

DURATION

Model Rankings

Winner $0.05/img

flux-kontext

4.8

9/9 success

Visual Quality5.0

Input Fidelity4.6

Transform. Quality5.0

Artifact Detect.4.8

#2Best Value

$0.04/img

reve-edit

4.5

9/9 success

Visual Quality5.0

Input Fidelity3.6

Transform. Quality4.8

Artifact Detect.4.8

#3 $0.03/img

seedream-edit

3.1

6/9 success

Visual Quality3.2

Input Fidelity2.7

Transform. Quality3.3

Artifact Detect.3.3

Mostly Failed $0.03/img

firered-edit

1.0

2/9 success

Visual Quality1.1

Input Fidelity0.8

Transform. Quality1.1

Artifact Detect.1.1

Dimension Breakdown

Input Fidelity ★ key differentiator

flux-kontext

4.6

reve-edit

3.6

seedream-edit

2.7

firered-edit

0.8

Visual Quality ≈ ceiling for top models

flux-kontext

5.0

reve-edit

5.0

seedream-edit

3.2

firered-edit

1.1

Gallery — "Place this product on a marble kitchen countertop with morning light"

Input

Product photo

flux-kontext

4.8 / 5.0

reve-edit

4.5 / 5.0

seedream-edit

3.1 / 5.0

firered-edit

1.0 / 5.0

open_in_new View full interactive report

Run your own benchmark.

One command. Real scores. Your models, your prompts, your data.

# Install & setup

pip install evalytic

evaly init

# Run your first benchmark

evaly bench -y

# Or compare specific models

evaly bench -m flux-schnell -m flux-pro \

-p "A cat on a windowsill" --review

open_in_new GitHub Documentation