Generation 1: Boolean search (1995-2010)
The original. Recruiter types "Java AND (microservices OR Kubernetes) NOT junior" and gets a list of CVs that contain those tokens. Brittle, slow, and severely biased toward candidates who happened to use the exact phrasing the recruiter imagined.
A candidate who wrote "led our migration to a containerized platform" would never match "Kubernetes," even if she spent three years running production EKS clusters. The recruiter never sees her CV. Multiply this across 200 applicants and you have silently rejected 30-40% of qualified candidates without realizing it.
Boolean search still ships as the default in most legacy ATS installed at Indian staffing agencies. If your primary search tool is a text box that accepts "AND/OR/NOT," that is what you are using. It is fine for known-item retrieval ("find the candidate we shortlisted last week") but actively harmful for discovery.
Generation 2: TF-IDF and keyword scoring (2010-2020)
Statistical relevance scoring. The system tokenizes the JD and each CV, computes term-frequency / inverse-document-frequency vectors for each, and ranks candidates by cosine similarity. Better than Boolean because it surfaces candidates whose phrasing is related to the JD even if not identical, and because it ranks instead of filtering binary.
Still keyword-anchored at the core. A CV that says "Kafka" gets credit for Kafka; a CV that says "event streaming on AWS MSK" does not, because MSK is a less common token and its connection to Kafka is not in the TF-IDF model. Phrase variance still kills 15-25% of qualified candidates.
Most "AI screening" features shipping in 2026-era ATS systems are still this generation under the hood, with a thin LLM "explanation" layer added on top for the demo. The LLM reads the top 10 TF-IDF results and writes a paragraph about why each was selected. The selection itself is pure TF-IDF, which means the LLM explanation is post-hoc rationalization, not evaluation.
The tell: TF-IDF systems cannot explain why a candidate was ranked 11th instead of 12th, because the ranking is a continuous similarity score with no semantic meaning. Ask a vendor to show you the evidence for position 11, and you will get hand-waving.
Generation 3: Embedding similarity (2020-2024)
Modern semantic search. The JD and each CV get embedded into a high-dimensional vector space (typically 768 or 1536 dimensions); candidates are ranked by vector similarity to the JD. This catches the "led the move from EC2 to ECS" versus "migrated our cloud infrastructure" case because both phrases embed near each other in semantic space.
Big improvement over TF-IDF. Phrase variance rejection drops from 25% to under 5%. Candidates with adjacent phrasing, industry-specific vocabulary, or translated-from-Hindi English get surfaced correctly. This is the generation most sophisticated modern ATS shipped in 2023-2024.
But embeddings optimize for similarity, not fitness. A senior architect and a junior dev both list the same stack (React, Node, MongoDB) on their CVs, and both embed near the JD. The senior architect ranks somewhere between 3rd and 15th; the junior dev ranks somewhere between 3rd and 15th. The model cannot tell them apart because similarity in vector space does not encode seniority, depth, or judgment.
The other failure mode is negative signal blindness. If a JD requires "5+ years Kafka in production," a CV that lists "Kafka (self-taught, tutorial project)" embeds nearly identically to "Kafka (4 years, 50-node cluster running 2M msg/sec)." The model sees both say Kafka, does not know to penalize the tutorial version, and both rank high.
Generation 4: Evidence-based LLM evaluation (2024-today)
Current frontier. A language model reads the structured intake (see Lever 1 of the time-to-hire guide) and each CV in full, scores the candidate against each criterion with a citation back to specific CV text, and outputs a ranked shortlist where every score is auditable. This is qualitatively different from the first three generations because it actually reads both documents and applies judgment, rather than computing a similarity function.
CVPRO STEP0 formula is a concrete example: Skills 40% + Experience 25% + Domain 15% + Location 10% + Recency 10%, with 42 evidence points underpinning each candidate score. For every one of the 42 points, the system cites the exact sentence or bullet from the CV that triggered the score. A recruiter can click any score and see the proof.
Two characteristics distinguish real Generation 4 screening from dressed-up Generation 2. First, the evidence citation: real Gen 4 shows you the exact sentence span on the CV that triggered the score, not a generic "has React experience" with no quote. Second, reproducibility: running the same JD and same CV twice produces the same score within a point or two. TF-IDF systems with LLM explanation layers produce different rationalizations each run because the LLM is writing fresh copy each time.
The practical impact on a staffing agency is that recruiters stop spending time justifying scores to clients. The evidence is already there, linked to CV text, visible in the client portal. When the hiring manager asks "why is candidate 4 ranked above candidate 7," the recruiter opens the rubric, points at the evidence spans, and the conversation moves to judgment calls on the 2-3 areas that actually differ, not on the 15 areas that are equivalent.
How to test a vendor demo (three live checks)
Run these three checks during any AI screening vendor demo. They take 15 minutes total and they distinguish real Generation 4 from dressed-up Generation 2 with high confidence.
- Test 1 Phrase variance: submit two CVs describing the same experience in different words. CV A says "5 years Java, built microservices at scale." CV B says "half a decade of backend development in Java, led distributed architecture work." Real Gen 4 scores them within 5 points of each other. TF-IDF with an LLM skin scores them 20+ points apart because it is reading keywords, not meaning.
- Test 2 Evidence citation: click any candidate score in the demo. Real Gen 4 shows the exact CV sentences that justified each individual criterion (Skills, Experience, Domain, etc.), with the text highlighted on the CV. TF-IDF returns generic prose like "has React experience" with no citation, or highlights random paragraphs that do not match the claim.
- Test 3 Negative signals: add a CV that has the right keywords but is clearly underqualified (e.g., "Kafka tutorial, AWS Solutions Architect cert obtained 2 weeks ago, 1 year total experience" for a senior Kafka role). Gen 4 flags the shallow signals and ranks low. Gen 2 ranks this candidate high because the keyword density is actually above average.
Bias, adverse impact, and audit
AI screening is not automatically less biased than human screening. The screening model inherits bias from its training data (which includes historical hiring patterns) and from the JD itself (which may encode biased language without the author realizing it). A model that scores high-pedigree candidates higher because training data rewarded that pattern will do the same in production.
Two safeguards are non-negotiable for any staffing agency shipping AI screening to clients. First, require evidence citations for every score so a human auditor can check whether the model is scoring on skill signals or on proxies for gender, caste, college tier, or age. Second, test screening output for adverse impact across protected groups at least quarterly: if female candidates with matching skills are consistently ranked lower, the model needs retraining or the rubric needs adjustment.
In India specifically, watch for three common adverse-impact patterns. College-tier bias (IIT/IIIT graduates ranked higher independent of skill evidence) is the most pervasive. English-fluency bias (candidates who write fluent business English ranked higher even on roles where English is not client-facing) is widespread. Age bias (candidates with 15+ years experience quietly down-ranked) is illegal under DPDPA interpretations but common in black-box models.
A working audit cadence looks like: monthly spot-checks of 10 random shortlists by a second reviewer; quarterly adverse-impact analysis across gender, age bands, and college tiers; annual external audit by a third party if your agency has enterprise clients who will ask. CVPRO ships evidence citations by default, which makes the first two items cheap to run.
What to ask vendors before buying
Beyond the three live demo tests, these are the operational questions that separate a tool that works in demos from a tool that works in production.
- What is the latency for screening 200 CVs? Under 15 minutes is production-ready. Over 30 minutes means recruiters will not wait and will revert to manual.
- Does the system retain CVs and for how long? 12 months or less aligns with DPDPA defaults. Indefinite retention is a compliance hazard.
- Can the scoring rubric be customized per client or per role? A fixed rubric that cannot adapt to a specific JD is a Gen 2 giveaway. Real Gen 4 lets you weight criteria per role.
- What is the API rate limit? If the vendor charges per screening run, back-calculate your monthly cost at realistic volumes (300 roles x 200 CVs = 60,000 CVs/month).
- Can the vendor show you a real customer running 50+ roles/month on the platform, not a demo account?
The build-vs-buy question
Staffing agencies doing ₹5 crore+ ARR sometimes ask whether to build their own Gen 4 screening in-house on top of GPT-4 or Claude APIs. The short answer is almost always no, for three reasons.
First, the engineering work is not the hard part. Wiring a JD and a CV into a Claude API call with a structured prompt takes a competent engineer 2 weeks. The hard part is the rubric, the evidence citation UX, the client portal integration, the audit trail, the adverse-impact monitoring, and the continuous prompt tuning as you discover edge cases. That easily becomes a 6-12 month engineering and product effort, and your engineers will be building commodity infrastructure instead of differentiated agency services.
Second, the cost arithmetic is unforgiving. At 60,000 CV screenings per month and current Claude API pricing, you are looking at ₹2-4 lakh/month in API costs alone before engineering salaries. A commercial tool priced at ₹25,000-50,000/month for unlimited screening is 5-10x cheaper than building.
Third, the regulatory surface is non-trivial. DPDPA requires data flow documentation, retention policies, breach notification workflows, candidate consent mechanisms, and the right-to-deletion implementation. A vendor ships this on day one. Building in-house means your ops lead becomes a part-time compliance officer.