03. The Evaluation Panel
The evaluation panel is the only place in Refery where LLM compute is spent on hiring decisions. By the time the panel runs, the matching engine has already narrowed the field to a small set of candidate-role pairs that pass the deterministic filters and rank highly on multi-vector similarity. The panel's job is to make the final call on the genuinely ambiguous ones.
The panel is adversarial by design. Five personas evaluate independently, with deliberately different priors and one of them (the Skeptic) holding hard-veto power that overrides aggregate scoring. This is the architecture that prevents what most LLM-based evaluation systems fail at: confidence collapse, where a single LLM optimizes for politeness over accuracy.
The five personas
Each persona is a distinct prompt with its own lens, role-specific instructions, and output format. All five personas see the same shared context (signal engine output, gmail extracts, market calibration data) but interpret it through their own frame.
Persona 1: The Hiring Manager (Functional Lead)
Lens. "Can they actually do the job."
Assesses. Craft depth, scope of past work, ownership level, output velocity. The Hiring Manager is the persona most concerned with the literal job requirements. Did this person ship the kind of work being asked for? At what scope and ownership level?
Output. Functional fit %, one sample screening question.
Sample questions. "Walk me through the most ambiguous problem you owned end-to-end." "Smallest team you have shipped real production work in?"
Persona 2: The Founder (Early-Stage Operator)
Lens. "Do they have startup DNA."
Assesses. Missionary or mercenary? Builder energy? Will they self-direct without scaffolding? Have they thrived in chaos before? This persona is calibrated to spot the difference between someone who can execute a defined role and someone who can carry an ambiguous problem from inception to ship without a PM, without a spec, without a manager checking in.
Output. Startup fit %, one sample screening question.
Sample questions. "Why this stage of company?" "Last time you shipped something with no PM, no spec, just a problem?"
Persona 3: The Skeptic (Risk Officer)
Lens. "What are we missing."
Assesses. Job hopping, title inflation, gaps, why they left each role, flight risk, comp mismatch, structural placement difficulty. The Skeptic owns interpretation of the non-tech flag and the placement constraint output. The Skeptic is the only persona with hard veto power: a Skeptic veto overrides any aggregate score from the other four personas.
Output. Risk-adjusted confidence %, one sample screening question, explicit veto flag if applicable.
Sample questions. "Why did you leave [each role]?" "Longest role tenure and why did you stay?"
Persona 4: The Future Peer
Lens. "Would I want to work next to this person."
Assesses. Communication style, ego check, intellectual honesty, collaboration. The Future Peer is logo-blind by design: it judges substance only, ignoring company prestige. This is a deliberate counterweight to logo bias. A candidate from Stripe and a candidate from a no-name startup both look identical to the Future Peer.
Output. Team fit %, one sample screening question.
Sample questions. "Tell me about a time you were wrong and changed your mind." "Best person you have ever worked with and why?"
Persona 5: Stage & Signal
Lens. "Is this person calibrated for this stage."
Assesses. Trajectory, logo tier, pedigree, AI bonus, sales client profile. Maps strengths to bottlenecks at each company stage. This persona produces the stage fit matrix: a per-stage hire-ROI score across Seed, Series A, Series B, and Series C+.
Output. Hire ROI % per stage, one sample screening question.
Sample questions. "What metric did you most directly move?" "Justify your hire to the board in one sentence."
Persona prompt structure
Each persona prompt is structured to prevent a known LLM failure mode: agreement collapse. When given the same input and asked for an evaluation, vanilla LLMs tend to produce similar scores across multiple invocations. This collapses the panel back into a single voice.
Refery's panel mitigates this in three ways:
- Different priors per persona. Each persona's system prompt carries explicit instructions about what to weight heavily and what to discount. The Skeptic is instructed to push back on the rest of the panel.
- Logo blinding for the Future Peer. The Future Peer's prompt specifically excludes company names from its view of the candidate. It sees roles, scope, and outputs only.
- Forced discrete brackets. Personas must commit to a specific bracket (Top 1%, Top 5%, Top 10%, Top 25%, Top 50%, Bottom 50%) rather than producing a continuous score. This forces real differentiation.
Below is a simplified extract from the Skeptic prompt (the most distinctive of the five):
// panel/personas/skeptic.ts
export const SKEPTIC_SYSTEM_PROMPT = `
You are the Skeptic on a 5-persona hiring panel. Your job is to push back.
The other four personas will tend toward consensus. Your role is the
counterweight. You are looking for what the rest of the panel is missing.
Specifically, evaluate:
1. Job hopping: tenure under 18 months at multiple roles is a red flag
unless explained by acquisition, layoff, or visa transitions.
2. Title inflation: VP at 2-person company is not the same as VP at 200.
Adjust accordingly.
3. Reasons for leaving: why did they leave each role? Push the panel to
verify before celebrating.
4. Flight risk: do their priorities (location, comp, stage preference)
match this role? If not, this hire will not stick.
5. Non-tech flag interpretation: if the signal engine raised the flag,
examine the suppression criteria carefully.
6. Placement constraint: if the candidate is structurally hard to place,
this is a market signal, not a candidate flaw, but it changes urgency.
You have HARD VETO power. If you identify a structural disqualifier (work
authorization mismatch, off-market comp, fundamental wrong-stage fit,
veto-level cultural concerns), you may issue a veto that overrides any
aggregate score the other personas produce.
Output format:
- Risk-adjusted confidence: <bracket>
- Reasoning: 2-4 sentences, name risks directly
- Veto: yes/no
- If veto: one sentence reason
- Sample question: <one question that would de-risk this hire>
`;
Aggregation and bracketing
After all five personas have run, the synthesizer aggregates their outputs into a single decision.
Weighted aggregate
// panel/synthesizer.ts
const PERSONA_WEIGHTS = {
hiring_manager: 2.0,
founder: 2.0,
skeptic: 1.0,
future_peer: 1.0,
stage_signal: 1.0,
};
export function aggregateScore(panel: PanelOutput): AggregateResult {
// Hard veto overrides everything
if (panel.skeptic.veto) {
return {
bracket: 'pass',
reason: `Skeptic veto: ${panel.skeptic.vetoReason}`,
override: true,
};
}
const totalWeight = Object.values(PERSONA_WEIGHTS).reduce((a, b) => a + b, 0);
const weightedSum =
bracketToScore(panel.hiring_manager.bracket) * PERSONA_WEIGHTS.hiring_manager +
bracketToScore(panel.founder.bracket) * PERSONA_WEIGHTS.founder +
bracketToScore(panel.skeptic.bracket) * PERSONA_WEIGHTS.skeptic +
bracketToScore(panel.future_peer.bracket) * PERSONA_WEIGHTS.future_peer +
bracketToScore(panel.stage_signal.bracket) * PERSONA_WEIGHTS.stage_signal;
const aggregate = weightedSum / totalWeight;
return {
bracket: scoreToBracket(aggregate),
reason: synthesizePanelReasoning(panel),
override: false,
};
}
The Hiring Manager and Founder personas are weighted 2x because they speak most directly to the role's core requirements. The Skeptic's 1x weight in normal aggregation is by design: the veto path is where the Skeptic exerts disproportionate influence, not by skewing the average.
Forced discrete brackets
Continuous scoring (e.g., "73%") creates a false sense of precision and makes cross-candidate comparison muddy. Refery's panel forces every persona, and the aggregate, into one of six discrete brackets:
type Bracket =
| 'top_1_percent' // exceptional, rare find
| 'top_5_percent' // strong yes
| 'top_10_percent' // solid hire
| 'top_25_percent' // worth interviewing
| 'top_50_percent' // likely pass
| 'bottom_50_percent'; // pass
This is the same discipline that strong interview rubrics use. The forced choice produces actual differentiation across candidates, which makes prioritization tractable for the operator.
The stage fit matrix
Stage fit is the most operationally useful output the panel produces. A single aggregate score answers "is this person a good hire?" but the stage fit matrix answers a more useful question: "which type of company should we route this person to?"
Seed: Top 3% peak fit, builder DNA, has done 0 to 1 twice
Series A: Top 5% strong, comfortable with ambiguity
Series B: Top 15% capable but no team-of-10 management exp
Series C+: Top 40% under-leveled for scaled orgs, missing process muscle
A candidate can legitimately be Top 3% for Seed and Top 40% for Series C+. That is not a contradiction. That is the matrix doing its job. Refery uses this matrix to route candidates to the right clients on the roster.
When the spread between best and worst stage in the matrix is 3+ brackets, the system surfaces an explicit stage mismatch flag:
Stage mismatch: best at Seed (Top 5%), drops to Top 25% at Series C+. Pitch to early-stage clients only.
The brief
The output of the panel is a structured intelligence brief that is written to the candidate's ai_analysis field in Postgres. This is the database-of-record artifact. It is what gets shown to the hiring manager (with internal-only sections redacted) and what shapes outreach and prioritization downstream.
═══════════════════════════════════════════════════════════════
REFERY ELITE CANDIDATE INTELLIGENCE BRIEF
<Name> | <Role focus>
Assessed by: AI Recruiter Engine, 5-Persona Panel
Date: <today>
═══════════════════════════════════════════════════════════════
━━━ HEADLINE ━━━
Aggregate: Top X% Recommendation: <Strong yes / Yes / Yes (conditional) / Interview / Pass>
━━━ STAGE FIT MATRIX ━━━
Seed: Top X% <one-line reasoning>
Series A: Top X% <one-line reasoning>
Series B: Top X% <one-line reasoning>
Series C+: Top X% <one-line reasoning>
Best stage fit: <stage(s)>
━━━ SIGNAL READOUT ━━━
Logo tier: <raw → modified, with reasoning>
Investor pedigree: <list + funding>
AI bonus: <if applicable>
Trajectory: <one-line summary>
Non-tech flag: <RAISED/SUPPRESSED + mitigation if any>
━━━ PLACEMENT CONSTRAINT CHECK ━━━
Comp floor reach: X% of board
Location reach: X% of board
Visa reach: X% of board
Verdict: <easy to place / standard / structurally hard, rationale>
━━━ PANEL VERDICTS ━━━
Hiring Manager (functional fit X%):
<2 to 4 sentence reasoning>
Sample question: "<best question from this persona>"
Founder (startup fit X%):
<2 to 4 sentence reasoning>
Sample question: "<best question from this persona>"
Skeptic (risk-adjusted X%) <VETO if applicable>:
<2 to 4 sentence reasoning, including any flags>
Sample question: "<best question from this persona>"
Future Peer (team fit X%):
<2 to 4 sentence reasoning, logo-blind>
Sample question: "<best question from this persona>"
Stage & Signal (per-stage ROI %):
<2 to 4 sentence reasoning mapping strengths to stages>
Sample question: "<best question from this persona>"
━━━ TOP 3 SCREENING QUESTIONS ━━━
1. <question> (from <persona>)
2. <question> (from <persona>)
3. <question> (from <persona>)
━━━ ROLE-BY-ROLE FIT ━━━
| Priority | Company | Role | Comp | Stage | Why |
| 1 | ... | ... | ... | ... | <fit rationale> |
━━━ RECRUITER VERDICT ━━━
GRADE: <A+ / A / A- / B+ / pass>
URGENCY: <high/med/low + reason>
ACTION: <concrete next step>
Why this is novel
The five-persona adversarial panel is the most unusual part of Refery's architecture and the place where the most concentrated technical IP lives.
- Adversarial multi-persona evaluation with hard-veto. Most LLM evaluation systems either use a single LLM (which overfits to politeness and consensus) or use simple multi-LLM voting (which collapses to majority opinion). The hard-veto Skeptic role is structurally different: it is a single persona with override authority, calibrated specifically to surface what the consensus is missing.
- Forced discrete brackets instead of continuous scores. This is borrowed from disciplined interviewing rubrics and rare in LLM-based evaluation systems, which tend to default to continuous percentages.
- Logo blinding for the Future Peer persona. Removing company names from one persona's view is a deliberate bias-correction technique. It produces measurable disagreements with the rest of the panel that are themselves a useful signal.
- Stage fit matrix as a routing primitive. Most evaluation systems produce a single fit score. The matrix produces four scores across stages, which directly drives candidate-to-client routing decisions.
- Persona prompts engineered against consensus collapse. The persona prompts are written to actively resist agreement, including explicit instructions for the Skeptic to push back on the rest of the panel.
The output of this system is consistently higher quality than any single-LLM evaluation, and it is auditable: every candidate's brief contains the reasoning of all five personas, so a human operator can see exactly where the panel agreed and where it disagreed. This auditability is what makes the system trustable at scale.