The model that benchmarks fine
A classifier scoring 92% on a public benchmark and 64% on the company set.
Whether they construct an eval set that mirrors real distribution.
- · a real Linux box in the browser
- · kubectl, docker, terraform, jq, yq
- · cluster, repo and cloud creds pre-wired
- · auto-checks running in the background
- · every keystroke + every command
- · terminal + screen recording
- · auto-graded pass/fail per check
