evalytic

Evals for visual AI

Judge your AI output
before users do.

One command. Seven dimensions. Real scores.

Prompt

"A neon sign reading 'MIDNIGHT CAFE' above a door in a rainy Tokyo alley at night"

flux-dev
flux-dev $0.025
visual
3.0
prompt
4.0
text
3.0
0.0 2.6s
Winner
flux-schnell
flux-schnell 8x cheaper $0.003
visual
3.0
prompt
5.0
text
5.0
0.0 1.2s
sdxl
sdxl $0.010
visual
3.0
prompt
3.0
text
1.0
0.0 2.6s
Get Started GitHub
$ pip install evalytic copy

What is Evalytic?

Quality evaluation for AI-generated visuals.

visibility

VLM Judges

Gemini, GPT-4o, or Claude evaluate your images like a human would.

dashboard

7 Dimensions

Visual quality, prompt adherence, text rendering, and four more.

compare

Model Comparison

Side-by-side benchmarks across 40+ fal.ai models.

terminal

One Command

Generate, score, and report. No backend needed.

Use Cases

Same tool, five questions answered.

leaderboard

Model Selection

Which model is best for my use case? Compare Flux Schnell vs Dev vs Pro across your actual prompts.

bug_report

Regression Detection

Did this model update break anything? Per-item comparison catches problems that averages hide.

tune

Prompt Optimization

Which prompt version is better? Replace "looks good to me" with 7-dimension data.

verified

Quality Gate

Is this safe to ship? Threshold checks, dimension checks, confidence checks. Exit 0 or 1.

monitoring

Production Monitoring

Is quality stable? Scheduled benchmarks with alerting via JSON output and cron jobs.

photo_camera

Text-to-Image & Image-to-Image

Both pipelines supported. Text-to-image generation quality plus image transformations: background removal, style transfer, product photo editing.

Features

Everything you need to evaluate visual AI.

psychology

Multi-Judge Consensus

2-3 VLM judges for reliable scores. Adaptive 2+1 algorithm with dispute detection.

speed

Local Metrics

CLIP score, LPIPS similarity, ArcFace identity matching. Deterministic, runs on your machine.

web

Rich Reports

Terminal tables, HTML with image grids, JSON for automation. Interactive browser review.

attach_money

Cost Tracking

Per-image and total costs for every model. Know the quality-cost tradeoff.

integration_instructions

40+ Models

Built-in registry for fal.ai models. Flux, SDXL, Ideogram, Recraft, and more.

lock_open

Open Source

MIT licensed. Use with your own API keys. No vendor lock-in.

See It In Action

One command, full picture.

Terminal
$ evalytic bench -m flux-schnell -m flux-dev -m flux-pro -p prompts.json

Evalytic Bench v0.2
3 models x 5 prompts = 15 images

Generating  ████████████████████ 100%
Scoring     ████████████████████ 100%

Model Visual Prompt Overall Cost
flux-schnell 4.2 3.6 4.0 $0.02
flux-dev 4.5 4.1 4.3 $0.13
flux-pro 4.8 4.6 4.7 $0.25

Winner: flux-pro (4.7/5) · Total: $0.40 · 34s

Start scoring in
five minutes.

Free, open source, no backend. Just your API keys and a terminal.

$ pip install evalytic