07. Scalability and Efficiency

This is the chapter that does the math.

A core engineering claim about Refery is that the platform produces top-1% recruiter quality output while consuming approximately 1% of the compute that a naive "LLM-evaluates-every-pair" architecture would burn through. This is not a marketing claim. The numbers below are conservative, computed from real workload sizing, and the architectural choices that produce them are described component by component.

The same architectural choices that drive cost efficiency also drive energy efficiency. AI compute, particularly large-model inference, has measurable carbon impact. Refery's design produces dramatically lower energy consumption per hiring decision than a naive AI recruiting platform, and that delta scales linearly with platform usage.

The naive baseline

To compute Refery's efficiency advantage we first need a credible baseline. Consider a "naive AI recruiter" platform that does the simplest thing: for every active candidate and every open job, it asks an LLM to score the fit.

At Refery's current scale (illustrative numbers, in the right order of magnitude):

1,500 active candidates
200 open jobs
= 300,000 candidate × job pairs

Suppose each pairwise evaluation uses GPT-4o-class compute with ~3,000 input tokens and ~2,000 output tokens, totaling ~5,000 tokens per evaluation.

Metric	Naive baseline	Per evaluation
Total tokens per match cycle	1,500,000,000 (1.5 billion)	5,000
Approx GPT-4o cost per cycle	~$5,250	~$0.0175
Approx energy per cycle	~750 kWh equivalent	~2.5 Wh
Wall clock time per cycle	~hours, even fully parallelized	~2 seconds

If this match cycle runs even weekly, the naive platform burns through ~$273,000 per year in inference cost and consumes the equivalent energy footprint of a small office building.

This is the platform Refery does not build.

Refery's architecture

Refery's design produces approximately the same hiring decisions as the naive baseline, but it allocates compute very differently.

graph LR
  A[1500 candidates × 200 jobs<br/>= 300,000 pairs] --> B[Hard filters<br/>visa, salary, location, blacklist<br/>microseconds, free]
  B -->|~75% eliminated| C[225,000 pairs eliminated]
  B -->|~25% pass| D[~75,000 pairs<br/>multi-vector retrieval<br/>cosine similarity, free]
  D -->|Top-30 per role| E[~6,000 candidate-role pairs<br/>signal engine evaluation<br/>microseconds, free]
  E -->|High-quality subset| F[~2,000 pairs<br/>5-persona panel evaluation<br/>~25K tokens each]
  F --> G[Final pipeline writes<br/>state machine + history]

Layer 1: Hard filters (free)

About 75% of all candidate-job pairs fail at least one hard filter (visa mismatch, salary floor, location incompatibility, blacklist hit). These are eliminated in microseconds with no LLM compute and no API call.

Tokens consumed: zero.

Layer 2: Multi-vector retrieval (effectively free)

The remaining 25% (~75,000 pairs) are scored via pgvector cosine similarity. The retrieval query for a single role returns the top-30 candidates ordered by weighted score across seven embedding axes, in single-digit milliseconds.

Embeddings are precomputed and cached. A candidate's embedding is recomputed only when their underlying signals change. At Refery's current data volume, the embedding cache hit rate exceeds 98%.

The cost of pgvector cosine similarity is approximately 10,000x cheaper than an LLM call per comparison.

Tokens consumed: zero (embeddings amortized).

Layer 3: Signal engine (free)

The deterministic signal engine (logo tier, trajectory, pedigree modifier, AI bonus, non-tech flag, sales client profile) runs as pure code. No external API calls. Sub-millisecond per candidate.

Tokens consumed: zero.

Layer 4: Five-persona panel (this is where the compute goes)

The panel runs only on the top-K retrieved candidates per role. With K=30 and 200 roles, that produces a maximum of 6,000 panel evaluations. In practice many candidates appear as top-K for multiple roles, so the actual number of unique panel runs per cycle is closer to 1,500-2,000.

Each panel run consumes approximately:

5 personas × 5,000 tokens each = 25,000 tokens per candidate brief
Synthesizer pass: ~3,000 tokens
Total: ~28,000 tokens per candidate

Metric	Refery architecture	Per panel run
Total tokens per match cycle	~56,000,000 (56 million)	~28,000
Approx Claude Sonnet cost per cycle	~$170	~$0.085
Approx energy per cycle	~28 kWh equivalent	~14 Wh
Wall clock time per cycle	~30 minutes (parallelized)	~10 seconds

The numbers

Metric	Naive baseline	Refery	Reduction
Tokens per match cycle	1,500,000,000	~56,000,000	~96.3% reduction
Cost per match cycle	~$5,250	~$170	~96.8% reduction
Energy per match cycle	~750 kWh	~28 kWh	~96.3% reduction

Roughly 96% reduction in compute, cost, and energy per match cycle. At weekly cadence, the difference is approximately $265,000 per year in saved inference cost and ~37,000 kWh per year in avoided energy consumption.

These numbers are conservative. They assume the naive baseline uses the same model class as the panel. If the naive baseline uses GPT-4-class compute (which is the more realistic comparison since Anthropic's Sonnet is comparable to GPT-4o), the Refery delta is even larger.

Where the savings come from, decomposed

The 96% reduction is not a single trick. It is the cumulative effect of five architectural decisions, each of which contributes to the total.

1. Hard filters before any compute (~75% of work eliminated)

The cheapest decision is the one you do not need to compute at all. Refery's hard filters eliminate three-quarters of candidate-job pairs before any embedding, scoring, or LLM call happens. This single decision is the largest contributor to the efficiency delta.

2. Vector retrieval instead of pairwise LLM (~85% additional reduction)

Of the pairs that pass the hard filters, multi-vector retrieval ranks them in milliseconds and produces top-K candidates per role. Without this layer, the system would need to LLM-score every remaining pair. Vector retrieval converts an O(N×M) problem into O(K×M) where K is small and constant.

3. Panel only on the retrieval-ranked subset

Because the panel runs only on top-K (~30 candidates per role), the panel's compute scales linearly with the number of open roles, not quadratically with the candidate pool. Doubling the candidate pool does not double the panel cost; the candidate pool size affects retrieval, not panel evaluation.

4. Embedding cache (~98% hit rate)

Embeddings are precomputed when a candidate's signals change. They are not recomputed on every match cycle. At Refery's current scale this means embedding cost is approximately 2% of what it would be without the cache.

5. Skill-based execution boundaries (bounded panel cost)

The panel runs as a deterministic sequence of LLM calls (5 personas + synthesizer = 6 calls), not as an open-ended agent loop. This bounds the cost of every panel run to a known quantity. Open-ended agent architectures, by contrast, can consume 10-100x more tokens per task because the LLM has discretion over how much to think.

What scales linearly vs sublinearly

Understanding scaling behavior is critical for projecting future cost.

Variable	Scaling
Candidate count	Sublinear: hard filters eliminate most; embedding cache amortizes
Job count	Linear: panel runs once per role × K candidates
Match cycle frequency	Linear: each cycle is a full re-evaluation
Pipeline state transitions	Linear in transitions, constant in pipeline size: idempotent reconciliation reads only what changed
Outreach campaigns	Linear in messages sent: voice engine + Gmail draft

The dominant cost driver as Refery scales is the panel running on the top-K subset per role. This is the right place to spend compute because it is where the actual hiring intelligence lives. Costs scale predictably with platform usage rather than with total candidate database size.

Energy footprint and environmental implications

Per the IEA's 2024 estimates and follow-up modeling in 2025-2026, large language model inference consumes on the order of 0.5-2 Wh per 1,000 tokens depending on model class and hardware efficiency. The conservative midpoint, ~0.5 Wh per 1,000 tokens, is what is used in the calculations above.

Applied to Refery's architecture:

Workload	Annual tokens	Annual kWh	Annual CO₂e (kg)
Refery (weekly cycles)	~2.9 billion	~1,460 kWh	~580 kg CO₂e
Naive baseline (weekly)	~78 billion	~39,000 kWh	~15,500 kg CO₂e
Avoided emissions		~37,500 kWh	~14,900 kg CO₂e

The CO₂e estimate uses a conservative 0.4 kg CO₂e per kWh global average. This is in the same order of magnitude as taking 3-4 mid-sized passenger vehicles off the road for a year, per Refery instance, per year, compared to a naive AI recruiting platform serving the same volume.

This is not a marginal claim. As AI inference grows as a share of global electricity consumption, the architectural choice between "throw an LLM at every problem" and "use rules where rules suffice, embeddings where embeddings suffice, and LLMs only for the genuinely ambiguous" has cumulative effects measured in megawatt-hours per year, per platform.

Refery's architecture is one of the cleanest applied examples of this principle in the recruiting category. It is structurally aligned with Spain's stated priorities for sustainable digitalization and the EU's 2026 emissions targets for the digital sector.

How this scales to millions

The same architecture that produces the 96% efficiency delta at current scale extends cleanly to dramatically larger scale.

Scale point	Candidates	Roles	Panel runs/cycle	Compute model
Today	1,500	200	~2,000	One Postgres instance
10x	15,000	2,000	~20,000	Same architecture, larger Postgres
100x	150,000	20,000	~200,000	Sharded Postgres, distributed pgvector
1,000x	1,500,000	200,000	~2,000,000	Multi-region, GPU embedding workers

At every scale point, the bottleneck remains the panel running on top-K per role. The hard filters and retrieval layer absorb candidate-side growth at near-zero marginal cost. The architecture does not need to be redesigned to scale; it needs to be horizontally scaled along well-understood axes (Postgres sharding, pgvector indexing, embedding worker pool).

By contrast, the naive baseline scales catastrophically. At 100x scale, the naive baseline would consume billions of dollars per year in inference cost. The naive architecture is not just expensive at small scale; it is structurally incapable of reaching large scale.

Why this is novel

Cost-gradient architecture. Most AI platforms put their cheapest decisions and most expensive decisions through the same compute path. Refery's architecture is built around the cost gradient: free things first, expensive things last.
Embedding amortization via cache + change detection. Most platforms recompute embeddings on every query. Refery recomputes only when underlying signals change.
Skill-based bounded execution. Compared to open agent loops, deterministic skill execution provides predictable, auditable cost per task.
Single-database architecture. No separate vector DB, no separate analytics warehouse, no separate logging system. Each of those would be a multiplier on operational cost; Refery avoids all of them.
Energy-efficient AI by design. The 96% reduction in inference cost translates directly to 96% lower energy consumption per hiring decision. Few applied AI platforms can credibly make this claim.

The architecture is what allows Refery to charge 20% placement fees with 70% pass-through to scouts and recruiter partners. The platform's own margin works because the platform's cost structure is dramatically lower than alternatives. This is a structural cost advantage, not a temporary one.