skval

skval

Give it a skill, get a score.

The Skill Validator Framework — point it at a Claude Code skill and get back a 0–100 score, a letter grade, a per-dimension breakdown, ranked findings, and a Ship / Revise / Reject verdict. The measurement counterpart to skill-creator.

163 tests, deterministic 6 dimensions, safety-gated self-validates 100/A MIT licensed
A full skval scorecard: 95/100, grade A, Ship, with all six dimensions including the safety gate
A full scorecard — structure, effectiveness lift, reliability, artifact quality, triggering, and the safety gate.

How it scores

A safety-gated, normalized weighted composite over six dimensions — reported as a vector and a single number.

DimDimensionWeightHow it's measured
D1Structural integrity0.15deterministic
D2Effectiveness (pass rate + lift over a no-skill baseline)0.30behavioral, LLM-graded
D3Reliability (pass^k over N trials)0.20behavioral
D4Artifact quality (decomposed LLM rubric)0.20LLM-as-judge
D5Triggering (precision / recall / F1)0.15behavioral
D6Safety / least-surprisegateunsafe ⇒ score 0, verdict Reject

Bands: A≥90, B≥80 (Ship) · C≥70, D≥50 (Revise) · else F (Reject). skval also classifies the skill (task / file-transform / interactive / discipline / reference) and routes evals accordingly.

skval vs. skill-creator

They aren't rivals — they're two halves of one loop. Anthropic's skill-creator helps you write and iterate on a skill with a human in the loop; skval gives an automated, defensible verdict on the result — and can audit skills you didn't write. skval even exports to skill-creator's eval-viewer, so the two interoperate.

Capabilityskill-creatorskval
Primary jobAuthor & iterate (draft → test → improve)Audit & grade a finished skill
Headline outputEval viewer + benchmark; you decide0–100 score, grade, Ship / Revise / Reject
Structural / lint checks deterministic static checks (D1)
Safety gate veto gate + safe extraction of untrusted skills (D6)
Offline, no-model path— needs model runs + review deterministic scan, CI-gateable (exit ≠ 0 on Reject)
Pre-run cost estimate— tokens / time only, after the run token + $ projection before you launch
With-skill vs. baseline evals core of the loop effectiveness + lift, significance (D2)
Reliabilityvariance (mean ± stddev) pass^k over N trials, scored (D3)
Triggering description optimizer (train / test) precision / recall / F1, scored (D5)
Skills you didn't writeaimed at your own authoring dir / SKILL.md / .skill, untrusted-safe
Many skills at oncesingle-skill iteration batch leaderboard + regression history
Judge-bias controlsblind A/B (optional, advanced) blind + position-swap + cross-family by default

Reach for skill-creator when…

you're building a skill — capturing intent, drafting, and iterating on a handful of examples with your eyes on every output.

Reach for skval when…

you need a decision: is it good enough to ship? did my edit regress? is this third-party skill safe? — or you're gating many skills in CI.

Benchmark

skval's deterministic structural scan over 10 widely-used skills — 8 ship as-is; the two flagged hit the size budget. Plus runs over 75 installed skills and 69 from the web. See the full benchmark →

SkillTypeScoreFindings
mcp-buildertask100 / A
frontend-designtask100 / A
test-driven-developmentdiscipline100 / A
pdffile_transform100 / A
pptxfile_transform100 / A
canvas-designfile_transform100 / A
xlsxfile_transform100 / A
web-artifacts-buildertask100 / A
skill-creatorfile_transform96 / ASKILL.md ~8246 tokens (>5000)
docxfile_transform92 / A599 lines & ~5142 tokens (over budget)

The upgrade: score → fix → re-score

skval doesn't just grade — its ranked findings drive the fix. The biggest real-world turnarounds:

annotate · invalid frontmatter · 50100

A real published skill (glebis/claude-skills) whose unquoted colon broke its YAML — it used to crash skval, and now scores. One-line fix: quote the description. compare.py: +50, Revise → Ship. case study →

bad-skill · 73100

Four findings fixed one-for-one — kebab name, drop <>, broken ref, stray key. compare.py: +27, Revise → Ship. case study →

Know the cost before you run

A full validation spawns dozens of subagents (~1M tokens for a default run). For token-billed / enterprise teams, skval estimate projects the token + $ cost up front — deterministic, no model calls.

# preview the cost of a full run (directory, SKILL.md, or .skill archive)
uv run skval estimate <skill-source>

# → ## $4.04 – $6.19 – $12.23   (684k – 1.04M – 2.04M tokens)
skval estimate output: $4.04–$6.19–$12.23 projected for a full run, with a per-stage token and cost breakdown
skval estimate — a low / expected / high range, broken down per stage.

How it's computed

Per-stage token assumptions (executors, graders, judge, triggering) × your run plan × a per-model rate table, as a low / expected / high range. Anchored to observed runs; prompt caching isn't modeled, so real cost usually runs lower.

Tune the plan

Flags for --evals, --trials, --configs, and the --executor-model / --judge-model. Read-only by default; --write also saves an estimate.json.

Quick start

Run the deterministic scan now — no model calls, fully offline:

# install
uv venv && uv pip install -e ".[dev]"

# score a skill (directory, SKILL.md, or .skill archive)
uv run skval structural <skill-source>

# → ██████████████  100 / 100   Grade: A   Verdict: Ship

For the full six-dimension run, ask Claude to “validate / score this skill” — it generates evals, runs the task with and without the skill, grades the outputs, judges quality, and tests triggering, then assembles the scorecard (preview the cost first).

Step-by-step guide

docs/USAGE.md — install to a full scorecard, with screenshots.

Browse a real run

commit-conventions — the eval set + every per-trial result.