The Technical Reason Your AI Recruiter Plays Favourites
- Martyn Redstone
- Sep 11
- 4 min read
The promise of AI in recruitment is alluring: sift through hundreds of applications in seconds, identify top talent, and free up human recruiters for more strategic work. However, as organisations rush to adopt these tools, a critical question is often overlooked: are they consistent? My own research has shown that they are not, behaving more like "over-confident interns" than reliable systems.
For months, the exact reason for this instability has been a subject of debate. Now, a technical deep-dive from Thinking Machines Lab provides a definitive answer, and it has profound implications for anyone using AI in a high-stakes context like hiring.
The Problem: Uncovering "Rank Roulette" in AI Screening
In my field experiment, the "LLM Reality Check", I tested the stability of commercial large language models when tasked with screening 109 CVs for a single role. I hypothesised that these models would produce stable and agreeing shortlists from identical inputs.
The hypothesis failed spectacularly. The study revealed:
Shocking Disagreement: When comparing the top-ten shortlists from different AI models, there was only a 14% overlap in the candidates they selected. Two AI recruiters disagreed on four out of five candidates.
Rank Roulette: The ranking of a single candidate was highly volatile, shifting by an average of ±2.5 places from one day to the next. A candidate ranked #10 on a Monday could jump to #1 on a Wednesday with no new information being added.
Significant Blind Spots: A staggering 55% of résumés in the dataset were never shortlisted by any of the models, effectively becoming invisible without any audit trail. This creates the risk of "invisible disqualifiers" that would violate regulations like the EU AI Act and GDPR.
These findings pointed to a fundamental instability that goes far beyond a simple user experience issue; it is a serious compliance risk. But it still left the deeper question of why this was happening.
The "Why": A Technical Breakthrough Reveals the Real Culprit
The common explanation for AI's variability is a combination of factors, including the use of parallel processing on GPUs and the nature of floating-point mathematics. While these play a role, the Thinking Machines Lab post, "Defeating Nondeterminism in LLM Inference", identifies the true culprit: a lack of "batch invariance".
To run efficiently, AI servers "batch" multiple user requests together for simultaneous processing. The problem is that the mathematical operations performed on your single request can change depending on the size and composition of the batch it happens to be in.
Imagine you are baking a cake. The taste of your single cake should not be affected by whether the baker is making one cake or ten cakes in the same oven. With most current LLM systems, however, it does. The size of the "batch of cakes" subtly changes the recipe for each one.
This means that your CV screening request, submitted at 10:00 AM on a Tuesday, will be part of a different server batch than the exact same request submitted at 2:00 PM. This difference in batch size leads to different internal calculations, which in turn produces a different final shortlist. The system is not truly random but it is non-deterministic from a user's perspective, as you have no control over the server's load at any given moment.
The Implications: From Technical Quirk to Governance Nightmare
This technical insight directly explains the "rank roulette" I observed. A candidate's CV was ranked differently not because the AI "changed its mind" but because the server's batch size was different with each run.
This creates an indefensible position from a governance, risk, and compliance (GRC) perspective.
Regulators, quite rightly, have labelled AI résumé screening as "high-risk" under the EU AI Act. This legislation, along with GDPR, demands that automated employment systems produce consistent and explainable outcomes. A system whose results are dependent on fluctuating server traffic fails this test entirely. You cannot defend an unstable and arbitrary process to a candidate or a regulator.
The Path Forward: A "Controlled Copilot" with Guardrails
The clear conclusion from both my research and the Thinking Machines analysis is that off-the-shelf LLMs cannot be trusted as autonomous gatekeepers in hiring. The workable model is one of a "controlled copilot" where AI augments human judgment, but never replaces it.
For HR and TA leaders, this means building a defensible AI strategy requires robust guardrails. My original study proposed a checklist for this, and this new technical understanding makes it more relevant than ever:
Demand Deterministic Systems: The most important question for any HR tech vendor is now: "Is your system deterministic and batch-invariant?". They must be able to prove that their tool will produce the exact same output for the same input every single time.
Implement Technical Safeguards: Insist on using programmatic API calls with the "temperature" set to 0 to reduce randomness, though we now know this alone is not enough. You must also log the specific model version used for every decision.
Conduct Continuous Audits: Implement "shadow audits" by inserting known CVs into every batch to monitor for drift and ensure the model is behaving as expected.
Ensure Human Oversight: A human recruiter must have the final sign-off on any automated rejection. This is your ultimate compliance safeguard.
The technology is powerful and offers incredible speed for tasks like summarising CVs and drafting outreach. But its judgment is demonstrably weak and its reasoning often superficial. Until vendors can prove their systems are stable, we must treat them as brilliant but erratic apprentices: invaluable for the first draft, but never the final authority.
