
Automated scoring tools promise to remove subjectivity from code review. The candidate submits their work, the model evaluates it, you get a score. No bias, no late-Friday judgment calls, no disagreement between interviewers.
It sounds clean. It is also missing several things that matter.
Style consistency. Automated tools reliably flag code that does not match a given style guide.
Test coverage. They can count tests, check that they pass, and surface obvious gaps.
Basic correctness. If the output for a given input is wrong, the model will catch it.
For high-volume screening, these checks are useful. They clear out obviously broken submissions without senior review time.
Approach quality under constraint. A candidate who wrote slightly messy but production-ready code under time pressure will often score lower than one who wrote clean code that solves the wrong version of the problem. The model has no context for which tradeoff mattered.
Debugging behavior. How did they get to the solution? Did they understand the failure mode or iterate blindly? The final code artifact cannot show you this. The session recording can.
Recovery under failure. Did the candidate hit a dead end, recognize it, and change course? Or did they keep pushing a broken approach until time ran out? This is one of the strongest signals in real engineering work. A score on the submitted code has no visibility into it.
Judgment about what to skip. A strong candidate in a 45-minute session knows which parts of a task are not worth completing. A model that scores for completeness will penalize exactly that wisdom.
When candidates use Claude during the assessment, an AI-generated score on the final code tells you even less. The submission might be high quality and entirely AI-produced. Or it might be low quality despite good AI assistance, because the candidate could not direct it.
One way to get more signal: run the same candidate through two sessions, one with AI tools and one without. The difference between the two results tells you what the candidate actually contributed. No automated score surfaces that. A reviewer watching both recordings does.
AI grading works well as a first filter. It removes the obvious bottom of the pool without consuming senior time.
After that, a human who watched the session recording will catch things no score captures. The output is not the candidate. The process that produced it is.
Use AI to filter. Use humans to decide.
Run live coding sessions and take-home challenges in real production environments. Watch sessions back, score consistently, and hire with confidence.
More posts you might like
A generation of engineers learned to code with AI from day one. Some are genuinely fast and capable. Others fall apart the moment Claude gives them wrong advice. Here is how to tell them apart before you hire.
If a model can pass your coding round, the round was not testing the right thing. The question is not whether AI will replace engineers. It is whether AI already broke your interview.
Read moreYou told the candidate they can use Claude. Now you are watching the recording. Here is what to actually look for.
Read more