06. The Data Flywheel

The most defensible part of any AI system is the data it learns from. Refery's data flywheel is built around an unusual asset: a network of 300+ operator scouts who provide structured candidate submissions, and a pipeline outcome ledger that tracks every placement decision back to its predictions.

This chapter describes how the scout network functions as a distributed human-in-the-loop labeling layer, how the schema is structured to capture training signal at every step, and how outcomes flow back into the system to refine scoring over time.

Why scouts matter (technically)

Most "AI recruiting" platforms have a cold-start problem: they need labeled data to train a matching model, but they have no way to acquire labels except through expensive expert review. The labels they do have are noisy (recruiter-generated free text) and one-sided (no negative examples; nobody documents the candidates they passed on).

Refery's scout network solves this structurally. Each scout submission is, by construction, a labeled training example:

Positive label: A scout submits a candidate to Refery for a specific role. This is implicit positive signal that the scout (a domain expert) believes this candidate is a fit.
Quality grade: Refery's panel evaluates the candidate and assigns a bracket (Top 1%, Top 5%, etc.) and a recruiter grade (A+, A, A-, B+, pass). This becomes the calibrated quality signal.
Outcome label: The pipeline state machine eventually produces a terminal stage (hired, rejected, withdrawn). This is the ground-truth outcome.
Time series: Stage transitions in pipeline_stage_history produce a time-series of intermediate signals (was the candidate's profile shared, did they reply, did the client engage).

Each candidate that flows through the system therefore generates ~5-10 labeled signals across multiple time horizons. With 300+ scouts each submitting periodically, the labeled data accumulates faster than any single recruiter could generate manually.

The structured submission contract

Scouts do not submit free-form resumes. Submissions are structured at intake: each submission specifies the candidate's identity, prior context (how the scout knows them), comp expectations, location stance, stage preference, and one or more roles the scout believes this candidate would fit.

This structure forces the scout to commit to specific predictions ("this person fits this role at this comp") which can later be validated against actual outcomes.

CREATE TABLE scout_submissions (
  id                       uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  scout_id                 uuid NOT NULL REFERENCES scouts(id),
  candidate_id             uuid NOT NULL REFERENCES candidates(id),
  submitted_at             timestamptz NOT NULL DEFAULT now(),

  -- Scout's prediction at submission time
  predicted_fit_companies  uuid[] NOT NULL,
  predicted_fit_roles      uuid[],
  scout_confidence         text CHECK (scout_confidence IN ('low', 'medium', 'high')),
  scout_relationship       text,  -- "former colleague", "ran a hackathon together", etc.

  -- Refery's panel evaluation
  panel_aggregate_bracket  text,
  panel_grade              text,
  panel_evaluated_at       timestamptz,

  -- Outcome (populated as pipeline progresses)
  outcome_pipeline_id      uuid REFERENCES job_candidate_pipeline(id),
  outcome_terminal_stage   text,  -- 'hired', 'rejected', 'withdrawn', or null while in motion
  outcome_terminal_at      timestamptz
);

CREATE INDEX idx_scout_submissions_scout ON scout_submissions(scout_id);
CREATE INDEX idx_scout_submissions_outcome ON scout_submissions(outcome_terminal_stage);

This schema captures the full prediction-and-outcome chain in a single row, which makes it directly trainable.

Scout quality scoring

Not all scouts are equal. Some have high precision (their submissions consistently grade A or A-); some have high recall (they submit a lot, with mixed quality); some are domain specialists (their fintech submissions are excellent, their AI submissions are weak).

Refery scores scouts on rolling windows using their submission outcomes. A scout's score affects how their future submissions are prioritized in the operator's queue and informs the platform's reputation system.

-- Rolling 90-day scout precision
SELECT
  s.id,
  s.name,
  COUNT(*) AS total_submissions,
  COUNT(*) FILTER (WHERE ss.panel_grade IN ('A+', 'A', 'A-')) AS high_grade_count,
  COUNT(*) FILTER (WHERE ss.panel_grade IN ('A+', 'A', 'A-'))::float
    / NULLIF(COUNT(*), 0) AS precision_score,
  COUNT(*) FILTER (WHERE ss.outcome_terminal_stage = 'hired') AS placements,
  COUNT(*) FILTER (WHERE ss.outcome_terminal_stage = 'hired')::float
    / NULLIF(COUNT(*), 0) AS hire_rate
FROM scouts s
LEFT JOIN scout_submissions ss ON ss.scout_id = s.id
WHERE ss.submitted_at > now() - interval '90 days'
GROUP BY s.id, s.name
ORDER BY precision_score DESC;

This same query, run with different filters, produces:

Scout-by-vertical scores (best fintech scouts, best AI scouts)
Scout-by-stage scores (best Seed scouts, best Series A scouts)
Scout-by-function scores (best for engineering submissions vs sales submissions)

These are used to weight submissions when surfacing them to the operator: a Series A engineering submission from a scout with strong Series A engineering precision is prioritized over the same submission from a scout with no track record in that segment.

The scoring model retraining loop

The signal engine and matching scores are not static. They are recalibrated periodically against the full outcome corpus.

Calibration cycle

Every quarter:

Pull all terminal-state pipeline rows from the prior quarter. These are the labels.
For each row, recover the scoring features at submission time. These are stored in pipeline_stage_history.notes and candidates.ai_analysis.
Compute calibration tables. What was the actual hire rate for candidates the panel rated Top 5%? Top 10%? If the calibration is off (e.g., Top 5% candidates only convert at the rate Top 10% should), the bracket thresholds are recalibrated.
Compute axis-weight effectiveness. Was the stage_fit axis a strong predictor of placement? Or was comp_signals weighting too high? The retrieval weights in chapter 02 are tuned against this.
Diff the proposed weight changes. A human operator reviews the diff and approves or rejects the proposed update.

This is deliberately not fully automated. The operator-in-the-loop step prevents drift from local optima and keeps the system's behavior interpretable.

Outcome attribution

A subtle but important point: not every "hired" outcome is positive signal for the matching system. A candidate who was hired but then left within 6 months is a weak positive. A candidate who was hired and is still at the company 18 months later is a strong positive.

Refery extends the outcome ledger with retention checks:

ALTER TABLE scout_submissions ADD COLUMN
  retention_check_6mo boolean DEFAULT null,
  retention_check_18mo boolean DEFAULT null;

The operator periodically updates these checks based on LinkedIn or direct contact with the placed candidate. The training loop weights long-retention placements more heavily.

The candidate brief as a database artifact

The structured intelligence brief produced by the panel (described in chapter 03) is written to candidates.ai_analysis as text. But the structure of the brief is consistent enough that it functions as semi-structured data: time-series queries can extract specific sections (the bracket, the stage fit matrix, the panel verdicts) and reason about them across the candidate base.

This allows queries like:

-- Find all candidates with stage mismatch flag for routing
SELECT id, name, ai_analysis
FROM candidates
WHERE ai_analysis ILIKE '%Stage mismatch:%'
  AND status IN ('active', 'reviewing');

-- Find all candidates with non-tech flag SUPPRESSED with a specific reason pattern
SELECT id, name, ai_analysis
FROM candidates
WHERE ai_analysis ILIKE '%Non-tech flag:%SUPPRESSED%'
  AND ai_analysis ILIKE '%Goldman%early career%';

A future refactor will move these structured fields out of the text blob into dedicated columns, enabling stronger analytics. The current structure is a pragmatic intermediate state: storing the brief as text keeps the schema simple while still allowing pattern-based extraction.

Schema summary

The full set of tables that participate in the data flywheel:

candidates                  - core identity + signal engine output + ai_analysis
companies                   - target client companies + do_not_contact
jobs                        - open roles + filter classification + status
job_candidate_pipeline      - active pipeline rows with current stage
pipeline_stage_history      - append-only transition log with evidence
recruiter_notes             - operator notes, timestamped, typed
job_candidate_notes         - notes scoped to a specific pipeline row
scouts                      - scout identities + tier + relationship status
scout_submissions           - submission contract + panel grade + outcome
candidate_embeddings        - multi-vector representations (pgvector)
job_embeddings              - multi-vector representations (pgvector)
outreach_log                - every outbound message with channel + outcome

Eleven tables, all relational, all in a single Postgres instance. No separate vector database. No separate analytics warehouse. No separate logging system. The entire data architecture fits in a single SQL schema.

This is intentional. The cost and operational complexity of running 4-5 separate data systems (which is the typical "AI startup" data architecture in 2026) is one of the largest hidden expenses of an AI-driven product. Refery avoids it entirely.

Why this is novel

Scout network as structured HITL labeling. A 300+ operator network with a structured submission contract produces labels at a rate that no internal recruiting team could match.
Outcome ledger spanning the entire prediction-to-placement chain. Every prediction the system makes is eventually validated against a terminal outcome stored alongside it.
Scout quality scoring fed back into prioritization. The system gets smarter about which scouts to listen to over time, as a function of their realized precision.
Single-database architecture. Vector search, pipeline state, history, embeddings, and outreach logs all live in one Postgres instance. This is operationally radical compared to the typical AI startup multi-system stack.
Quarterly calibration with operator-in-the-loop. Not fully automated, not fully manual. The retraining cycle preserves interpretability while still benefiting from accumulated outcomes.

The flywheel is what makes Refery's AI quality improve faster than a competitor that lacks the scout network. The scout submissions are the moat.