evalytic

Showcases Docs GitHub

Evals for visual AI

Judge your AI output
before users do.

One command. Seven dimensions. Real scores.

Prompt

"A neon sign reading 'MIDNIGHT CAFE' above a door in a rainy Tokyo alley at night"

flux-dev $0.025

visual

3.0

prompt

4.0

text

3.0

0.0 2.6s

Winner

flux-schnell 8x cheaper $0.003

visual

3.0

prompt

5.0

text

5.0

0.0 1.2s

sdxl $0.010

visual

3.0

prompt

3.0

text

1.0

0.0 2.6s

Get Started GitHub

$ pip install evalytic copy

What is Evalytic?

Quality evaluation for AI-generated visuals.

visibility

VLM Judges

Gemini, GPT-5.2, or Claude evaluate your images like a human would.

dashboard

7 Dimensions

Visual quality, prompt adherence, text rendering, and four more.

compare

Model Comparison

Side-by-side benchmarks across 40+ fal.ai models.

terminal

One Command

Generate, score, and report. No backend needed.

Questions you should be asking

Do I really need the flagship model?

Schnell scores 4.3 at $0.003. Pro scores 4.7 at $0.05. Is 0.4 points worth 16x the cost?

See benchmark →

Why do users say "that's not me"?

Face edits lose identity. ArcFace similarity + VLM judges catch it before your users complain.

See benchmark →

Did that change break my images?

A prompt tweak, a parameter change, a model update. Regression detection catches quality drops before users do.

See benchmark →

Should I self-host instead of using APIs?

Open-weight models promise savings. But do they match API quality? Benchmark before you buy GPUs.

See benchmark →

Which prompt actually performs better?

Replace "looks good to me" with 7-dimension data. Score prompt variants, not vibes.

See benchmark →

Is my product photo still my product?

AI edits warp shapes, lose logos, change colors. Input fidelity scoring catches every drift.

See benchmark →

Use Cases

Same tool, five questions answered.

leaderboard

Model Selection

Which model is best for my use case? Compare Flux Schnell vs Dev vs Pro across your actual prompts.

bug_report

Regression Detection

Did this model update break anything? Per-item comparison catches problems that averages hide.

tune

Prompt Optimization

Which prompt version is better? Replace "looks good to me" with 7-dimension data.

verified

Quality Gate

Is this safe to ship? Threshold checks, dimension checks, confidence checks. Exit 0 or 1.

monitoring

Production Monitoring

Is quality stable? Scheduled benchmarks with alerting via JSON output and cron jobs.

photo_camera

Text-to-Image & Image-to-Image

Both pipelines supported. Text-to-image generation quality plus image transformations: background removal, style transfer, product photo editing.

Features

Everything you need to evaluate visual AI.

psychology

Multi-Judge Consensus

2-3 VLM judges for reliable scores. Adaptive 2+1 algorithm with dispute detection.

speed

Local Metrics

CLIP score, LPIPS similarity, ArcFace identity matching. Deterministic, runs on your machine.

web

Rich Reports

Terminal tables, HTML with image grids, JSON for automation. Interactive browser review.

attach_money

Cost Tracking

Per-image and total costs for every model. Know the quality-cost tradeoff.

integration_instructions

40+ Models

Built-in registry for fal.ai models. Flux, SDXL, Ideogram, Recraft, and more.

lock_open

Open Source

MIT licensed. Use with your own API keys. No vendor lock-in.

See It In Action

One command, full picture.

Terminal

$ evalytic bench -m flux-schnell -m flux-dev -m flux-pro -p prompts.json

Evalytic Bench v0.3.1

3 models x 5 prompts = 15 images

Generating ████████████████████ 100%

Scoring ████████████████████ 100%

Model	Visual	Prompt	Overall	Cost
flux-schnell	4.2	3.6	4.0	$0.02
flux-dev	4.5	4.1	4.3	$0.13
flux-pro	4.8	4.6	4.7	$0.25

Winner: flux-pro (4.7/5) · Total: $0.40 · 34s

Start scoring in
five minutes.

Free, open source, no backend. Just your API keys and a terminal.

Quickstart Documentation

$ pip install evalytic

Judge your AI outputbefore users do.

What is Evalytic?

VLM Judges

7 Dimensions

Model Comparison

One Command

Use Cases

Model Selection

Regression Detection

Prompt Optimization

Quality Gate

Production Monitoring

Text-to-Image & Image-to-Image

Features

Multi-Judge Consensus

Local Metrics

Rich Reports

Cost Tracking

40+ Models

Open Source

See It In Action

Start scoring infive minutes.

Judge your AI output
before users do.

Start scoring in
five minutes.