```yaml
---
title: Evaluating LLMs for Code Tasks
subtitle: "Benchmarking Models on Real Workloads, Avoiding Benchmark Gaming, and Making Cost-Quality Decisions"
author: David Kelly Price
version: "1.0"
date: 2026-04-20
status: draft
type: ebook
target_audience: Senior ML engineers, architects, and engineering managers choosing or comparing LLMs for code generation, review, or search
estimated_pages: 75
chapters:
  - "1. Why Vendor Benchmarks Are Not Enough"
  - "2. Designing Your Own Evaluation Suite"
  - "3. Task Categories: Generation, Completion, Review, Search"
  - "4. Evaluation Metrics That Actually Predict Production Quality"
  - "5. Cost-Quality Tradeoffs and the Efficient Frontier"
  - "6. Latency, Throughput, and Context Window Limits"
  - "7. Fine-Tuning vs. Prompting for Code Tasks"
  - "8. Building a Continuous Evaluation System"
tags:
  - pyckle
  - ebook
  - llm-evaluation
  - benchmarking
  - code-generation
  - model-selection
  - ml-engineering
---
```

<!--
  DESIGN / LAYOUT NOTES
  =====================
  Font pairing : IBM Plex Mono (code) + Inter (body) + Sora (headings)
  Base font size: 11pt body, 9pt code blocks
  Line height  : 1.6 body, 1.4 code
  Page margins : 1.25in top/bottom, 1.0in left/right (print); full-bleed cover
  Code blocks  : dark background (#1e1e2e), syntax highlighting (Catppuccin Mocha)
  Callout boxes: left border accent, light tinted background — used for key claims
                 and "watch out" notes only, not decorative
  Chapter opens: full-width dark header, chapter number (large, muted), title below
  TOC          : chapter number left-aligned, title, dotted leader, page number right
  No widows or orphans. Section breaks with thin rule (not decorative flourish).
  Print-safe: no background fills on body pages outside callout boxes.
  PDF bookmarks: chapters + H2 sections only, not H3.
-->

# Evaluating LLMs for Code Tasks
## Benchmarking Models on Real Workloads, Avoiding Benchmark Gaming, and Making Cost-Quality Decisions

**By David Kelly Price**
Version 1.0 — April 2026

---

## Table of Contents

1. Why Vendor Benchmarks Are Not Enough
2. Designing Your Own Evaluation Suite
3. Task Categories: Generation, Completion, Review, Search
4. Evaluation Metrics That Actually Predict Production Quality
5. Cost-Quality Tradeoffs and the Efficient Frontier
6. Latency, Throughput, and Context Window Limits
7. Fine-Tuning vs. Prompting for Code Tasks
8. Building a Continuous Evaluation System

---

## About This Guide

Every major LLM vendor publishes benchmark numbers. HumanEval, MBPP, SWE-bench, LiveCodeBench — the names change, the format is always the same: a table showing their model at or near the top. Some of those numbers are legitimate. Many are not. The problem is not that vendors lie; it is that benchmark performance and production performance are measuring different things, and the gap between them is where most bad model selection decisions live. This guide is about closing that gap — building the evaluation infrastructure to make model decisions based on your workloads, your quality bar, and your cost constraints.

The intended reader is someone who has already shipped something with an LLM. You know what a context window is. You have probably written a prompt, watched it fail in an unexpected way, and wondered how to prevent that at scale. You may be choosing between two or three models for a new code generation or review feature, or re-evaluating a choice made eighteen months ago when the options were different. This is not an introduction to language models. It is a practical framework for people who need to make defensible, repeatable decisions about which model to use and how to know when that decision should change.

By the end, you will be able to design an evaluation suite against real task categories, select metrics that correlate with what you actually care about in production, reason systematically about the cost-quality tradeoff surface, and build a continuous evaluation system that keeps pace with model releases. The goal is not a perfect benchmark. It is a process that is honest about what it measures and reproducible enough to be useful over time.

---


---

## Chapter 1: Why Vendor Benchmarks Are Not Enough

### Chapter Overview

Vendor benchmarks are not lies. They are, in many cases, carefully constructed, rigorously executed, and entirely accurate on their own terms. The problem is that their terms are not your terms. This chapter examines how benchmark design decisions — dataset selection, task framing, metric choice, contamination risk — systematically diverge from the conditions under which real engineering teams actually use these models. Understanding where that divergence comes from is the prerequisite to building evaluations that mean something.

---

### The Gap Between "Best in Class" and "Best for You"

A model that leads HumanEval is not necessarily the model you want writing production Python for a Django REST API. This seems obvious when stated plainly, but it gets ignored constantly in procurement decisions. The confusion is understandable: when a vendor publishes a benchmark showing their model solves 87% of coding tasks, it is natural to assume that number describes something you care about.

HumanEval, released by OpenAI in 2021, consists of 164 Python programming problems. Each problem provides a function signature and a docstring, and the model must complete the function body. Solutions are verified by running a small set of unit tests. The benchmark has been useful for tracking broad capability growth over time.

But consider what it does not measure. It does not measure whether the model can navigate an unfamiliar 50,000-line codebase. It does not measure whether the model's suggestions introduce security vulnerabilities. It does not measure whether the model hallucinates library APIs that do not exist. It does not measure latency, token efficiency, consistency across semantically equivalent prompts, or the ability to follow project-specific conventions. These are not exotic requirements. They are baseline expectations for any model operating in a real engineering environment.

The gap is not a flaw in HumanEval specifically. It is a structural property of general benchmarks. General benchmarks optimize for breadth and reproducibility. Your use case demands depth and specificity. These are genuinely different objectives.

> **Key Insight**
>
> Benchmark performance and production utility are correlated but not equivalent. A model can rank first on every major coding benchmark and still be the wrong choice for your specific workflow. Correlation at the population level does not predict individual fit.

---

### How Benchmarks Get Contaminated

There is a second problem, separate from scope: contamination. Large language models are trained on massive corpora scraped from the internet, and public benchmarks — including HumanEval, MBPP, and SWE-Bench — exist on the internet. When benchmark problems appear in training data, model performance on that benchmark measures something between genuine capability and memorization. The exact mix is difficult to determine from the outside.

Vendors are aware of this and most make reasonable efforts to detect and mitigate contamination. But the incentive structure is not neutral. A vendor whose model scores 91% on a benchmark they designed and administered, using training data they curated, does not face the same verification pressure as peer-reviewed science. That is not an accusation of bad faith — it is a description of how incentives work.

The practical consequence: when a model performs suspiciously well on a benchmark — particularly one that has been publicly available for years — it is worth asking whether that performance generalizes. One way to probe this is to take the benchmark tasks, modify them slightly (rename variables, invert the problem structure, change the language of the docstring), and retest. Models that learned the shape of a problem rather than the skill needed to solve it will show noticeable performance degradation on modified versions.

```python
# Original HumanEval-style prompt:
def add_two_numbers(a: int, b: int) -> int:
    """Return the sum of a and b."""

# Modified version — semantically identical, surface-different:
def compute_integer_total(x: int, y: int) -> int:
    """Given two integers, produce their arithmetic sum."""
```

A model with genuine arithmetic reasoning should score identically on both. A model that has pattern-matched on training examples may not.

> **Warning**
>
> Do not discard benchmark results because contamination is possible. Do treat dramatic score drops on semantically equivalent but surface-different tasks as a signal worth investigating before committing to a model.

---

### The Metric Mismatch Problem

Even when a benchmark tests something real, the metric it uses may not reflect what you actually optimize for.

Pass@k is a common example. It measures whether at least one of the model's k completions passes the test suite. Pass@1 (does the first completion pass?) is the cleanest signal. Pass@10 and Pass@100 tell you about the breadth of the model's solution space, which matters for some workflows (like using models in sampling loops with automated verifiers) and is largely irrelevant for others (like a developer expecting a single, correct response from an assistant).

When vendors report Pass@k, the choice of k is not neutral. A model that scores well at Pass@100 but mediocrely at Pass@1 has a different operational profile than a model with the reverse pattern. If your workflow is one-shot code generation — a developer asks, the model responds, the developer reviews — then Pass@1 is the only number that matters. Reporting Pass@100 for a one-shot use case is technically accurate and practically misleading.

The same logic applies to language coverage. A benchmark that reports aggregate performance across 10 programming languages produces a number that reflects average capability. If your team writes exclusively in Rust, the aggregate number tells you almost nothing. Rust-specific performance on relevant task types is the signal. Aggregate performance is noise.

---

### Benchmark Tasks Versus Real Tasks

The structure of benchmark tasks diverges from real engineering tasks in ways that compound over time.

Benchmark tasks are typically self-contained. The function to be written is specified completely by its signature and docstring. The correct solution can be verified by a small, deterministic test suite. The problem fits in a single context window. These properties make benchmarks reproducible and measurable. They also make benchmark tasks qualitatively different from the problems engineers actually face.

Real tasks involve partial information. A developer asking a model to "add pagination to this endpoint" is not providing a complete specification. The model needs to infer from the existing code what framework is in use, what the data model looks like, what existing patterns to follow, and what behavior is expected at edge cases. It needs to ask clarifying questions or make reasonable assumptions explicitly. None of this appears in HumanEval.

Real tasks also require maintaining coherence across a session. A developer might ask the model to refactor a module, then add a feature to the refactored code, then write tests for the feature. Each step depends on the previous. Benchmark tasks rarely chain in this way, which means benchmark scores tell you nothing about session-level coherence — a property that determines whether a model is actually pleasant to work with over a full development session.

> **Try This**
>
> Take the last three non-trivial coding requests you made to an LLM assistant. Write down what you actually needed the model to produce. Then check: does any public benchmark directly test that? If yes, how closely do the benchmark conditions match your actual context — codebase size, language, task structure, verification method? The delta between your answers and the benchmark conditions is the measurement gap you need to close.

---

### What Vendors Are and Are Not Optimizing For

It would be unfair to characterize vendors as indifferent to real-world performance. The push toward harder benchmarks — SWE-Bench, which tests models on actual GitHub issues; LiveCodeBench, which maintains a rolling set of problems to limit contamination — reflects genuine effort to close the gap between benchmark conditions and production conditions. This is progress.

But vendors are also optimizing for leaderboard position, which is a competitive necessity. Leaderboards rank models on fixed tasks. Fixed tasks can be studied and optimized for. The result is that models improve faster on the dimensions being measured than on dimensions that are not being measured. This is Goodhart's Law operating at scale: when a measure becomes a target, it ceases to be a good measure.

The models getting trained today are shaped in part by the benchmarks that exist today. That means the capabilities your team needs but that do not appear in current benchmarks are systematically undertrained relative to capabilities that do appear in benchmarks. The only reliable way to measure fitness for your specific requirements is to evaluate directly against those requirements.

---

### Key Takeaways

- Benchmark tasks are designed for reproducibility and breadth. Your production use case requires depth and specificity. These objectives do not automatically align.
- Contamination risk is real and difficult to quantify from the outside. Surface-modifying benchmark problems is a practical way to probe whether performance reflects generalization or memorization.
- The choice of metric (Pass@k, aggregate vs. per-language) is not neutral. Verify that the metric a vendor reports corresponds to the operational conditions your team actually works under.
- Benchmark tasks are self-contained and fully specified. Real engineering tasks are not. Session-level coherence, partial information handling, and convention-following are real capabilities that most benchmarks do not measure.
- Vendors are not optimizing for your use case. They are optimizing for competitive positioning. That is rational for them and insufficient for you.

---

### Try This

Pull the last two weeks of prompts from your team's LLM usage logs (or reconstruct them from memory if logs are not available). Group them into categories: code generation, code review, refactoring, debugging, search/retrieval, documentation. For each category, write a one-sentence description of what a correct response actually looks like in your context.

Then look up how HumanEval, MBPP, or any benchmark the model vendor cites defines "correct." Write down the differences. Those differences are not abstractions — they are the specific gaps between what you are paying for and what you are measuring. The next step is turning that list into test cases you own and control.


---

## Chapter 2: Designing Your Own Evaluation Suite

### Chapter Overview

If vendor benchmarks measure how well a model performs on someone else's problems, a custom evaluation suite measures how well it performs on yours. Building one is not glamorous work, but it is the only way to make a defensible decision about which model belongs in your stack. This chapter covers the practical mechanics: how to assemble a representative test set, how to define what "good" means in your context, how to score outputs without burning your engineers' time, and how to structure the whole thing so it stays useful as models and requirements change.

---

### Start With Your Own Code, Not Generic Benchmarks

The single highest-leverage move in evaluation design is using your own codebase as the source of truth. Not open-source repositories selected for cleanliness. Not textbook examples. Your actual code, with your actual naming conventions, your actual domain logic, your actual debt.

Generic benchmarks fail for a predictable reason: they optimize for coverage across many domains, which means they are shallow in every one of them. A model that scores well on HumanEval has demonstrated it can write self-contained Python functions against a known test suite. That tells you almost nothing about whether it can complete a method in your internal data pipeline, understand your authentication layer, or review a pull request in a codebase built on five years of pragmatic decisions.

Your codebase has two things no public benchmark has. First, it has real complexity — the kind that comes from multiple contributors, evolving requirements, and accumulated workarounds. Second, it has ground truth you can verify. You know what the correct behavior is. You have existing tests. You have colleagues who can tell you whether an output is actually useful.

Start by identifying fifty to one hundred representative tasks. "Representative" does not mean easy or clean. It means typical. Pull from your actual backlog: functions your team recently wrote from scratch, review comments on recent PRs, search queries your developers run against internal tooling. The goal is a distribution that looks like a Tuesday, not an ideal week.

> **Key Insight:** Your evaluation suite is only as useful as its resemblance to your actual work. A test set built from your production codebase will surface model weaknesses that no public benchmark will catch — and will confirm strengths that actually matter for your team.

---

### Defining What "Good" Means for Your Use Case

Before you run a single evaluation, you need a rubric. Not a vague one. A rubric with enough specificity that two different engineers applying it independently would reach the same conclusion most of the time.

The temptation is to use a single quality score. Resist it. "Good" means something different across code generation, code review, and code search, and it means something different across teams. A fintech team cares about whether generated code handles edge cases in financial arithmetic. A developer tools team cares about whether the model understands abstract syntax trees. A platform team running infrastructure automation cares about idempotency and error handling above almost everything else.

Break your rubric into dimensions. Typical dimensions include: correctness (does the output do what was asked?), style adherence (does it match your conventions?), completeness (did it handle edge cases, error paths, documentation where required?), and relevance (for search or review tasks, did it surface what actually mattered?).

Assign weights. Correctness should almost always dominate, but the relative weight of other dimensions depends on your context. If you are evaluating a model for code review in a junior-heavy team, depth of explanation might carry significant weight. If you are evaluating for autocomplete in a mature codebase, style adherence might matter more than it would in a greenfield project.

Write explicit failure conditions alongside your positive criteria. A rubric that only describes success tends to inflate scores. Define what makes an output a zero: generates code that would cause a silent failure, ignores security-sensitive context in a review, returns irrelevant results for a search query. These failure conditions are where models most sharply differentiate from each other.

---

### Building a Representative Test Set

Structure matters here. A test set should have three things: inputs, expected outputs or evaluation criteria, and metadata about each task.

The inputs are your prompts. Be precise about what context you include. If your real-world usage includes file context injected via your IDE or tooling, include it in your test prompts. If users typically query without much context, reflect that. The distribution of context richness in your test set should mirror the distribution in production.

Expected outputs are trickier than they look. For deterministic tasks — "write a function that does X, where X has one correct implementation" — you can provide a reference output and test against it. For tasks with multiple valid implementations, define criteria instead. This distinction matters because automated scoring against a single reference output will penalize correct-but-different responses.

```python
# Example test case structure
{
    "id": "TC-042",
    "category": "generation",
    "subcategory": "data_transformation",
    "prompt": "...",
    "context_files": ["src/models/transaction.py", "src/utils/validators.py"],
    "evaluation_type": "criteria",  # or "reference"
    "criteria": [
        "handles null values without raising",
        "preserves existing keys not in transform_map",
        "does not mutate the input dict"
    ],
    "reference_output": null,
    "tags": ["nullability", "dict_mutation", "side_effects"]
}
```

Metadata — category, subcategory, tags — pays off later. Once you have results, you want to slice them. Which model does better on generation vs. review? Which handles nullable types better? Without tags, you have a single average score. With tags, you have a diagnostic tool.

Aim for at least thirty tasks per major use case category. Below that, individual variance swamps signal. You can run a useful evaluation with fewer, but do not draw hard conclusions from it.

> **Warning:** Test sets degrade. Code that was representative six months ago may reflect a codebase that no longer exists. Build a process to rotate roughly twenty percent of your test cases each quarter — retiring tasks that no longer reflect real work and replacing them with fresh ones.

---

### Scoring: Automated vs. Human Judgment

The correct answer to "automated or human scoring?" is almost always both, and the split should be deliberate.

Automated scoring is fast, cheap, and consistent. It is well-suited for correctness checks where you can run the output against a test suite, or where you can parse structured outputs and validate against schema. If a model generates a function, run your unit tests against it. If it generates SQL, execute it against a sandboxed replica of your data layer. These checks catch objective failures without requiring reviewer time.

```bash
# Simple automated evaluation loop
for task in test_suite:
    output = model.generate(task.prompt, task.context)

    if task.has_executable_tests:
        result = run_tests(output, task.test_cases)
        task.auto_score = result.pass_rate

    if task.has_schema:
        task.schema_valid = validate_schema(output, task.schema)

    # Flag for human review if auto signals are ambiguous
    if task.auto_score is None or task.auto_score < 0.5:
        task.needs_human_review = True
```

Human scoring is slower and introduces variance, but it captures things automated checks cannot: whether a code review comment would actually help a developer, whether a generated implementation is idiomatic, whether a search result surfaces the most useful file or just the most syntactically similar one. Reserve human scoring for tasks where subjective quality matters and for cases where automated scoring gives ambiguous results.

A practical split: automate everything you can, human-review a stratified sample. The sample should include high-scoring outputs (to catch automated false positives), low-scoring outputs (to validate that failures are real), and a random draw from the middle. Size the sample to what your team can review in two to three hours per evaluation cycle.

Calibrate your human reviewers before they score. Show them anchor examples at each quality level before they start. Disagreement between reviewers is information — high disagreement on a task usually means the task is underspecified or the rubric dimension being scored is not well-defined.

---

### Versioning and Repeatability

An evaluation you cannot repeat is not an evaluation — it is a snapshot. Models update. Your codebase changes. Your requirements evolve. The value of a custom evaluation suite compounds only if you can run it again and compare results with confidence.

Version your test suite explicitly. Use a manifest file that records which tasks are active, their versions, and when they were last reviewed. When you retire a task, keep it in the archive rather than deleting it — you may want to compare across time.

```yaml
# eval_manifest.yaml
version: "2.4"
last_updated: "2025-11-15"
active_tasks: 94
categories:
  generation: 38
  review: 31
  search: 25
changelog:
  - version: "2.4"
    date: "2025-11-15"
    changes: "Rotated 18 stale generation tasks; added 12 new infra-focused cases"
  - version: "2.3"
    date: "2025-08-02"
    changes: "Added schema validation for SQL generation tasks"
```

Pin model versions in your evaluation runs. If you are comparing Model A to Model B, both should be evaluated against the same test suite version on the same day if possible. Model providers update their offerings, sometimes silently. A comparison made across different dates may be comparing different underlying systems.

Store results alongside your test suite, not in a separate system. A flat JSON file or a simple SQLite database is enough. The goal is to be able to run a query three months from now and answer "how did Model A score on nullability tasks in Q3 vs. Q4?" without reconstructing anything from memory.

---

### Key Takeaways

- Use your own codebase as the source of evaluation inputs. Public benchmarks measure performance on someone else's problems.
- Define your rubric before you run anything. Break it into explicit dimensions with weights, and write failure conditions, not just success criteria.
- Tag every test case with enough metadata to slice results by category and characteristic. A single average score is nearly useless for decision-making.
- Combine automated and human scoring deliberately. Automate what is objectively verifiable; reserve human time for tasks where subjective quality is the point.
- Version your test suite and pin model versions in evaluation runs. The ability to reproduce and compare results over time is what turns a one-time evaluation into a defensible, compounding asset.

---

### Try This

Take one recurring code task your team handles — a specific type of function your engineers write regularly, or a common code review pattern, or a query type that frequently goes to internal tooling. Write five test cases for it: two easy, two typical, one edge case. For each, define the prompt (including realistic context), the evaluation type (reference output or criteria), and at least three explicit criteria or failure conditions. Run those five cases against the model you are currently using. Score each output against your criteria. Notice where your criteria were ambiguous enough that you had to make a judgment call mid-scoring. Those ambiguities are exactly what you need to resolve before building out the full suite.


---

## Chapter 3: Task Categories: Generation, Completion, Review, Search

### Chapter Overview

Not all code tasks are the same problem wearing different clothes. Generation, completion, review, and search each impose different cognitive demands on a model — different context requirements, different failure modes, different success criteria. A model that writes clean boilerplate from a spec can still hallucinate during review. One that completes function bodies with fluency may produce useless results in retrieval. Before you can evaluate, you need to know what you're actually asking the model to do, and why those categories don't map cleanly onto each other.

---

### Generation: Writing Code from Intent

Generation is what most people picture when they think of LLM coding assistance: a description goes in, code comes out. It looks simple. It isn't.

The core challenge is specification fidelity — whether the model produces code that actually satisfies the requirement, not just code that plausibly resembles it. Vendor benchmarks love generation tasks because they're easy to demo. A model generates a working Fibonacci function and everyone nods. The harder question is what happens when the spec is ambiguous, incomplete, or domain-specific.

Generation quality degrades predictably as requirements complexity increases. Single-function prompts with clear inputs and outputs are a weak signal. Multi-function, multi-file generation with architectural constraints is where models start to differentiate. Testing generation at the lower end tells you very little about production utility.

There's also the question of what "correct" means. Generated code that passes unit tests but introduces an O(n²) loop, ignores error paths, or violates your conventions isn't a success. Your evaluation framework needs to account for this.

```python
# Weak generation prompt — tests nothing interesting
prompt = "Write a Python function that returns the sum of a list."

# Stronger generation prompt — tests spec adherence and edge handling
prompt = """
Write a Python function `safe_sum` that:
- Accepts a list of numeric types (int, float, Decimal)
- Returns 0 for an empty list, not an error
- Raises ValueError with a descriptive message for non-numeric items
- Does not use the built-in sum() function
"""
```

The second prompt has enough specificity to fail on. That's what you want in an evaluation. If a model can't handle explicit constraints, it won't handle implicit ones from your actual codebase.

> **Key Insight**
>
> Generation evaluations collapse when prompts are too clean. Real engineering requirements are ambiguous, incomplete, and occasionally contradictory. Benchmark your models against prompts that look like actual tickets, not textbook exercises. The delta between "ideal spec" and "messy spec" performance is often the most predictive signal you'll find.

---

### Completion: The Context Dependency Problem

Completion is generation with a running start — the model fills in code given partial context already on the screen. It's the most common interaction pattern in IDE-based assistants, and it's evaluated very differently than generation from scratch.

The key variable is context window management. A completion model needs to use what's already there: existing variable names, established patterns, open function signatures. Models that ignore context produce syntactically valid code that clashes with everything around it. Models that over-index on surface similarity produce code that looks right but does the wrong thing.

One useful framing: completion quality is a function of context coherence. You're not just measuring whether the model finishes the function — you're measuring whether it finishes it as *this* codebase would finish it. That means your evaluation inputs need to come from real code, not synthetic snippets.

Test for this by seeding completion prompts with code that has distinctive idioms: custom abstractions, domain-specific variable names, non-obvious patterns. If the model ignores those signals, it fails at the actual task even if it generates valid Python.

Latency is also a real constraint here, not just a nice-to-have. Completion that takes two seconds isn't completion anymore — it's just slow generation. If you're evaluating for real-time IDE use, include latency as a first-class metric.

> **Warning**
>
> Most completion benchmarks use synthetic file fragments that are structurally simple and contextually clean. Real completions happen mid-function, across files, with imports partially resolved and types implicit. If your benchmark doesn't reflect that messiness, you're testing a capability the model will rarely be asked to exercise.

---

### Review: What Models Miss and Why

Code review is a different cognitive task entirely. Instead of producing code, the model reads existing code and identifies problems. This sounds easier. It's harder.

The difficulty is that review requires understanding intent, not just structure. A model can identify that a loop is inefficient. It's much harder for a model to flag that the loop is inefficient *for its stated purpose*, or that the logic contradicts a comment written three lines above it, or that this pattern was already fixed elsewhere in the codebase and this is a regression.

Security review is the highest-stakes application and also the most failure-prone. Models are very good at finding the categories of vulnerability they've seen often in training data. They're systematically weak at novel patterns, at business-logic flaws, and at issues that only become visible across multiple files.

When evaluating review capability, structure your test cases in three tiers:

1. **Known vulnerability patterns** — SQLi, XSS, insecure deserialization. Models should catch these reliably. If they don't, that's a floor failure.
2. **Logic errors with domain context** — Incorrect rate limiting, broken auth flow, wrong access control boundary. These require understanding what the code is *supposed* to do.
3. **Cross-file issues** — A function that looks correct in isolation but calls a deprecated interface, contradicts a contract defined elsewhere, or creates a subtle race condition across services.

Most models perform well on tier one and degrade significantly through tiers two and three. Your evaluation should spend most of its time in tiers two and three, because that's where the real risk lives.

---

### Search: Precision Over Volume

Code search is the most underspecified of the four categories, and models vary more on it than any other task. The user states a goal in natural language; the model either retrieves the right code or it doesn't.

Two failure modes dominate. The first is false precision — the model returns a confident answer pointing to code that is syntactically related but semantically wrong. The query was "find where we handle auth token expiry" and the model returns the token generation logic instead. The second is excessive breadth — the model returns eight results when one correct one exists, leaving the engineer to filter.

Retrieval quality depends heavily on how the model was trained and what retrieval architecture sits underneath it. A model with strong code embeddings but weak cross-file reasoning will find relevant files but fail to identify the specific entry point. A model with strong reasoning but weak embeddings may correctly interpret the query but search the wrong places.

For evaluation, treat search like information retrieval: precision, recall, and MRR are your metrics. Define a gold standard of query-to-location pairs from your actual codebase, and measure against that. Don't use synthetic queries — they don't capture the vocabulary mismatches and implicit context that make real search hard.

> **Try This**
>
> Build a 20-query search benchmark from your own codebase. Use queries that your team actually asked in Slack or code review: "where does the payment retry logic live?", "what handles token refresh?". For each query, document the correct answer manually first. Then run it against the model. Count exact-match hits, near-miss hits (correct file, wrong function), and misses. This benchmark costs about two hours to build and gives you a more honest signal than any vendor leaderboard.

---

### Choosing the Right Evaluation for the Right Task

The mistake most teams make is evaluating all four task types with the same rubric. That produces noisy, misleading results. A model optimized for generation will be evaluated fairly on generation and unfairly on everything else — or vice versa.

The right approach is to map your actual use case to the task categories first. If your team uses an IDE assistant primarily for completion, weight completion heavily. If you're building an automated review tool, tier your review evaluation by the failure categories above. If you're building a code search interface, invest in a retrieval benchmark before you do anything else.

Task mix also matters. Many production tools combine tasks in ways that compound error rates. A pipeline that generates code, then reviews it with the same model, accumulates the failure modes of both. Evaluating each task in isolation tells you something; evaluating them in sequence tells you something more useful.

One practical heuristic: start with the task where failure is most costly for your use case. Wrong generation is usually caught before it ships. Wrong security review might not be. Prioritize your evaluation investment accordingly.

---

### Key Takeaways

- Generation quality is tested by prompts with real constraints, not clean specs. If the prompt can't fail, it doesn't reveal anything useful.
- Completion is a context coherence problem. Evaluate using real code with distinctive patterns, not synthetic fragments.
- Code review degrades as problem complexity increases. Most models handle common vulnerability patterns reliably and cross-file logic poorly. Your evaluation should spend most of its time on the hard tier.
- Search requires retrieval-specific metrics — precision, recall, MRR. Evaluate against queries from your actual team, not vendor-provided examples.
- Task categories compound in pipelines. Evaluate in isolation first, then in sequence. The sequential failure rate is almost always higher than the isolated rates would predict.

---

### Try This

Pick the one task category most central to your intended use case. Build five evaluation inputs that represent the *hardest* realistic version of that task — not the easiest. For generation, use an underspecified prompt with at least three implicit constraints. For completion, use a file fragment from your production code with domain-specific idioms. For review, use code with a business-logic flaw that isn't a textbook vulnerability. For search, use a query that any engineer on your team would ask but that uses none of the actual function names.

Run those five inputs against the model you're currently considering. Document every failure. That's your baseline. Every model you evaluate after this should be measured against the same five inputs before you look at anything else.


---

## Chapter 4: Evaluation Metrics That Actually Predict Production Quality

### Chapter Overview

Benchmarks lie. Not maliciously — they just measure the wrong things. Pass@k scores on HumanEval tell you whether a model can solve 164 Python puzzles from 2021. They tell you almost nothing about whether that model will write correct SQL against your schema, catch a subtle race condition in your async code, or produce a diff your team can actually review. This chapter is about building an evaluation stack that predicts real production quality — the kind of metrics that, when they go up, actual outcomes improve. That requires understanding what each metric class actually captures, where each breaks down, and how to combine them into a signal you can act on.

---

### Why Standard Benchmarks Fall Short

HumanEval, MBPP, SWE-bench — these are useful for one thing: roughly rank-ordering models before you commit time to serious evaluation. Nothing more. The problems are known, the solutions are often in training data, and the task distribution bears little resemblance to production codebases with ten years of accumulated context, inconsistent naming conventions, and business logic that can't be explained in a docstring.

The deeper problem is that benchmark performance is optimized directly. Labs tune on these distributions, and they'd be irrational not to. A model that scores 87% on HumanEval and 72% on your internal eval suite is telling you something real. The gap between those two numbers is where you learn about the model.

There's also a fundamental unit-of-analysis mismatch. Benchmarks score individual functions in isolation. Production code lives in systems. A model that writes a clean `parse_invoice()` in a vacuum may consistently miss that the calling code passes a `bytes` object, not `str`, because it never sees that context. Evaluation that doesn't account for context size, retrieval quality, and cross-file dependencies is measuring a skill that doesn't transfer.

---

### Functional Correctness: What It Actually Measures

Test-pass rate is the closest thing to ground truth for generation tasks. If the code runs and the tests pass, you have evidence of correctness — not proof, but evidence.

The standard approach is pass@k: generate k completions, check how many pass a test suite, report the fraction. Pass@1 is what matters for production. Pass@5 tells you about the model's ceiling. The gap between them tells you about consistency.

```python
def pass_at_k(n_samples: int, n_correct: int, k: int) -> float:
    """Unbiased estimator for pass@k."""
    if n_samples - n_correct < k:
        return 1.0
    return 1.0 - math.comb(n_samples - n_correct, k) / math.comb(n_samples, k)
```

The catch: test quality determines metric quality. Weak tests pass bad code. The evaluation is only as strong as the oracle. If you're building internal evals, writing rigorous tests for your evaluation cases is 80% of the work. Property-based tests are worth the overhead here — they probe edge cases that hand-written tests miss.

> **Warning:** Never evaluate on tests the model could have memorized. If your eval borrows from public repositories, assume the model has seen the test suite. Construct private, held-out test cases for any evaluation you want to trust.

Functional correctness also collapses on non-deterministic code — anything involving concurrency, randomness, or external I/O. For those categories, you need different approaches: sandboxed execution with mocked dependencies, or human-in-the-loop review as ground truth.

---

### Code Quality Signals That Aren't Subjective

Correctness is necessary but not sufficient. Code that passes tests can still be unmaintainable, insecure, or wrong in ways tests don't cover. Several measurable proxies capture this.

**Cyclomatic complexity** measures the number of independent paths through a function. High complexity correlates with defect density and maintenance cost. You can compute it with `radon` in Python or `lizard` for polyglot codebases. A useful eval signal: compare the complexity of model-generated code to the complexity of human-written code in the same codebase for equivalent tasks. Models that produce structurally simpler code without sacrificing correctness are generally better collaborators.

**Type correctness** is underused as an eval metric. Run mypy or pyright on model outputs. Strict mode violations are a proxy for semantic errors — the model generated something that looks right but violates contract guarantees the type system encodes. This is especially valuable for completion tasks where the model is extending existing typed code.

**Security linting** via tools like Bandit, Semgrep, or CodeQL gives you a vulnerability signal. This matters a lot for code review tasks. A model that misses an obvious SQL injection is less useful than its pass@1 score suggests.

```bash
# Automated quality scoring pipeline
radon cc -a -s generated_code/ > complexity_report.txt
mypy --strict generated_code/ 2>&1 | grep "error:" | wc -l > type_errors.txt
bandit -r generated_code/ -f json > security_report.json
```

These tools produce noisy signals in isolation. Aggregated across a large eval set, they surface systematic weaknesses — a model that consistently introduces security findings in database interaction code, for example, or one that degrades type coverage in every file it touches.

---

### Semantic Similarity and Its Limits

BLEU, ROUGE, CodeBLERT, and similar embedding-based metrics compare model output to a reference solution. They're fast, cheap, and useful for regression testing — if a metric that was stable suddenly drops after a model update, something changed and you should investigate.

> **Key Insight:** Semantic similarity metrics are best used as anomaly detectors, not absolute quality scores. A BLEU drop of 8 points between model versions means "look here." It doesn't mean "reject this model."

The fundamental limitation is that there are many correct implementations of the same function, and they can look syntactically very different. Two correct merge sort implementations may share almost no tokens. Embedding similarity helps here over token-level metrics — CodeBERT representations capture functional intent better than n-gram overlap — but even semantic similarity can't resolve the multiple-correct-solutions problem reliably.

For search and retrieval tasks, mean reciprocal rank (MRR) and normalized discounted cumulative gain (nDCG) are more appropriate. If you're evaluating a model's ability to retrieve relevant code snippets or rank candidates, these metrics translate directly from information retrieval. The key is constructing a relevance judgment set specific to your codebase — relevance in retrieval is domain-specific, and a generic judgment set won't reflect your actual retrieval quality.

---

### Human Evaluation: When to Use It and How to Do It Right

Automated metrics miss a class of problems that humans catch immediately: code that's technically correct but structurally bizarre, naming that's inconsistent with the surrounding codebase, comments that are confidently wrong, or logic that solves the literal problem while ignoring the obvious intent.

Human eval is expensive, slow, and subject to evaluator bias. Use it selectively.

The cases where it's worth the cost: calibrating automated metrics (run human eval on a stratified sample to check whether your automated scores correlate with human judgment), evaluating code review quality (does the model catch real bugs, or does it produce plausible-sounding commentary on non-issues?), and any task involving natural language in code — docstrings, error messages, comments.

When you run human eval, the protocol matters. Use a blind, randomized setup. Give evaluators specific rubrics rather than "rate quality 1-5." Ratings like "would you accept this in a PR without modification?" or "does this introduce a bug you'd need to catch in review?" produce more consistent signal than abstract quality scales. Inter-annotator agreement should be measured and reported — if two experienced engineers disagree on more than 30% of samples, your task definition is ambiguous.

> **Try This:** Take 20 outputs from your model eval set where automated metrics scored highest. Have two engineers review them independently using a structured rubric. Calculate how often automated scores agree with human judgment. That correlation coefficient is the confidence interval on your automated eval system.

---

### Building a Composite Score

No single metric captures production quality. The right approach is a weighted composite that reflects what actually matters for your use case.

For a code generation workflow, a reasonable starting point:

| Metric | Weight | Rationale |
|---|---|---|
| Pass@1 (test suite) | 0.40 | Ground truth on correctness |
| Type error rate | 0.20 | Semantic contract violations |
| Security findings | 0.20 | Risk surface |
| Cyclomatic complexity delta | 0.10 | Maintainability |
| Human accept rate (sampled) | 0.10 | Calibration signal |

The weights aren't universal. A security-sensitive context should weight vulnerability findings much higher. A context where test coverage is thin should weight human eval more. The point is to make weighting explicit and revisable — not to find the one true composite score.

Calibrate the composite against actual outcomes. If you're tracking PR rejection rates, defect rates, or time-to-review for model-assisted code, those outcomes should correlate with your composite score over time. If they don't, your weights are wrong.

---

### Key Takeaways

- Pass@k on public benchmarks is useful for initial model ranking only. It does not predict how a model performs on your codebase, your task distribution, or your context sizes.
- Test-pass rate is the strongest single metric for generation tasks, but it's only as good as the test suite. Invest in evaluation test quality before interpreting results.
- Automated code quality tools — type checkers, complexity analyzers, security linters — produce noisy individual signals that become meaningful evaluation inputs when aggregated across a large eval set.
- Human evaluation is not optional for code review tasks, but it should be used selectively and with rigorous protocol. Correlation between human judgment and your automated metrics is itself a metric worth tracking.
- Build a composite score with explicit weights tied to your actual production risk profile. Make the weights revisable as you accumulate outcome data.

### Try This

Pull the last 50 pull requests in your repository that involved significant code changes. For each, note whether the change introduced a defect found in production within 30 days. Now score each PR's code using the automated pipeline from this chapter — complexity, type errors, security findings. Calculate the correlation between your composite automated score and defect outcome. If the correlation is below 0.3, your current metric weights don't reflect your actual risk distribution and need recalibration. If it's above 0.5, you have a working leading indicator for production defect risk — and a foundation for evaluating whether an LLM-assisted workflow would have caught what human review missed.


---

## Chapter 5: Cost-Quality Tradeoffs and the Efficient Frontier

### Chapter Overview

Every organization running LLMs at scale eventually hits the same wall: the model that scores best on your evaluations is not the model that makes economic sense to deploy. This chapter is about how to think rigorously about that tradeoff — not by finding the cheapest option that's "good enough," but by identifying where on the cost-quality curve your workload actually belongs. That requires building a proper efficient frontier: the set of models where no alternative gives you better quality at the same cost, or lower cost at the same quality. Everything else is just guessing.

---

### Why "Good Enough" Is the Wrong Frame

The instinct is understandable. You need a model for code review. You benchmark three candidates. One scores noticeably better, but costs four times as much per token. The cheaper one seems adequate. You ship it.

Six months later, you have data showing that the cheaper model misses subtle logic errors — the kind that make it through code review and cause production incidents. The cost of those incidents dwarfs what you saved on inference.

The problem with "good enough" is that it anchors on cost without quantifying what quality failures actually cost you. Bad code review doesn't fail loudly. It fails quietly, later, somewhere else. That makes it easy to undercount.

Before you can make a rational cost-quality decision, you need two numbers: what a unit of quality improvement is worth in your system, and what a unit of quality degradation costs you. Neither number is easy to compute, but approximating them is better than ignoring them entirely.

Start with your incident data. If you can tie even a fraction of production bugs to review gaps, you have a floor on what better review quality is worth. If your team averages 2 hours to debug an incident that better code review would have caught, that's your conversion rate between review quality and engineer hours.

This reframes the comparison. It's no longer "Model A costs 4x more." It's "Model A costs 4x more and we estimate it reduces escaped defects by 30%, which at our incident rate is worth $X/month." Now you have an actual decision.

---

### Building the Efficient Frontier

The efficient frontier concept comes from portfolio theory, but it applies cleanly here. Plot your candidate models on a two-axis chart: cost on the x-axis (tokens per dollar, or cost per task), quality on the y-axis (your composite score from Chapter 4). The frontier is the upper-left boundary — models that are Pareto-optimal, where no other model beats them on both dimensions simultaneously.

Any model that falls below and to the right of that frontier is dominated. It costs more than an equivalent-quality option, or delivers less quality than an equivalent-cost option. You can eliminate those from consideration immediately.

```python
import matplotlib.pyplot as plt
import numpy as np

models = {
    "Model A": {"cost_per_1k": 0.003, "quality_score": 0.71},
    "Model B": {"cost_per_1k": 0.008, "quality_score": 0.78},
    "Model C": {"cost_per_1k": 0.015, "quality_score": 0.84},
    "Model D": {"cost_per_1k": 0.012, "quality_score": 0.76},  # dominated
    "Model E": {"cost_per_1k": 0.030, "quality_score": 0.85},  # near-dominated
}

def find_pareto_frontier(models):
    pareto = []
    for name, m in models.items():
        dominated = any(
            other["quality_score"] >= m["quality_score"] and
            other["cost_per_1k"] <= m["cost_per_1k"] and
            (other["quality_score"] > m["quality_score"] or
             other["cost_per_1k"] < m["cost_per_1k"])
            for other_name, other in models.items()
            if other_name != name
        )
        if not dominated:
            pareto.append(name)
    return pareto

frontier = find_pareto_frontier(models)
print("Frontier models:", frontier)
# Output: Frontier models: ['Model A', 'Model B', 'Model C']
```

Model D is dominated — Model C delivers better quality for comparable cost. Model E is nearly dominated — Model C is only marginally worse at half the price. Once you have the frontier, your decision becomes: which point on this curve matches your workload's value function?

> **Key Insight**
>
> The efficient frontier eliminates the noise. You don't need to evaluate 12 models against each other. You need to identify which 2-3 are Pareto-optimal for your task type, then pick the one that fits your cost tolerance. Everything else is distraction.

---

### Task Decomposition Changes the Math

One of the most underused strategies in LLM cost optimization is task decomposition — routing subtasks to different models based on what each subtask actually requires.

Code generation tasks are rarely monolithic. A pull request might involve understanding the diff, identifying which files are affected, generating a review comment, checking for security issues, and summarizing changes. These are not equally hard problems. Summarizing changes is a retrieval task. Identifying security issues requires deep reasoning. Generating a review comment needs style consistency with your codebase norms.

If you route every subtask to your most capable model, you're overpaying on the easy parts. A common decomposition pattern:

```python
async def review_pull_request(diff: str, codebase_context: str) -> dict:
    # Cheap model for structural analysis
    affected_files = await lightweight_model.analyze(
        prompt=f"List the files modified and their likely purpose:\n{diff}",
        max_tokens=200
    )

    # Mid-tier model for review comments
    review_comments = await mid_tier_model.analyze(
        prompt=f"Generate specific, actionable review comments:\n{diff}",
        max_tokens=800
    )

    # Expensive model only for security-sensitive paths
    security_issues = await capable_model.analyze(
        prompt=f"Identify any security vulnerabilities in this diff:\n{diff}",
        max_tokens=400
    )

    return {
        "affected_files": affected_files,
        "review": review_comments,
        "security": security_issues
    }
```

In practice, you might find that 70% of your token spend was going to tasks where a cheaper model performs within 3% of your best model. That asymmetry is where decomposition pays off.

> **Warning**
>
> Task decomposition introduces orchestration complexity and new failure modes. Each model call is a potential point of latency, rate limiting, or quality degradation. Don't decompose for the sake of optimization until you've measured where quality variance actually comes from in your pipeline. Premature decomposition creates fragile systems that are hard to debug.

---

### Quality Is Not Linear in Cost

Here's a property of the cost-quality curve that surprises most people: quality gains are not proportional to cost increases. The curve flattens.

Going from a $0.003/1k-token model to a $0.008/1k-token model might buy you 8-10 quality points. Going from $0.015 to $0.030 might buy you 1-2. You're paying 2x for a fraction of the improvement.

This is where the "just use the best model" instinct is most expensive. The last few points of quality on most benchmarks cost disproportionately more — and those last few points often don't translate to proportionate production value. A model that generates slightly more idiomatic code in edge cases is a marginal improvement for most teams. A model that reliably handles your specific language stack and catches common error patterns is a substantial one.

What this means practically: don't optimize for the best absolute score. Optimize for where you get the steepest quality-per-dollar gain. For most workloads, that's somewhere in the middle of the price range, not at the top.

One useful heuristic is to compute quality-per-dollar for each frontier model and look for cliff edges — places where the ratio drops sharply. Below that cliff, you're on favorable terms. Above it, you're paying premium prices for diminishing returns.

```python
def quality_per_dollar(models):
    return {
        name: m["quality_score"] / m["cost_per_1k"]
        for name, m in models.items()
    }

ratios = quality_per_dollar(models)
sorted_ratios = sorted(ratios.items(), key=lambda x: x[1], reverse=True)
for name, ratio in sorted_ratios:
    print(f"{name}: {ratio:.1f} quality points per $0.001")
```

Run this across your frontier models. The drop-off is usually obvious.

---

### When the Expensive Model Is Actually Cheaper

There's a counterintuitive scenario that comes up in agentic and multi-step code tasks: using a more capable model reduces total token spend, because it makes fewer mistakes.

When a model misunderstands a task, it generates output that requires correction — either a human re-prompt, an automated retry, or a follow-up call to fix the error. Each correction is more tokens. In multi-step pipelines, errors compound. A wrong intermediate output can derail downstream steps, requiring the entire sequence to restart.

In agentic code generation benchmarks, less capable models sometimes consume 3-4x more tokens than capable models on the same task, because they spend tokens on wrong turns. The sticker price per token is lower, but the total task cost is higher.

This is why cost-quality analysis must be done at the task level, not the token level. Measure cost per completed task, not cost per token. The denominator changes everything.

> **Try This**
>
> Pick a representative sample of 20-30 completed tasks from your current pipeline. For each, count total tokens consumed across all LLM calls — including retries, corrections, and follow-ups. Compute average cost-per-task. Then run the same tasks against a more capable (and more expensive) model. Compare cost-per-task, not cost-per-token. You may find the gap is smaller than expected — or inverted.

---

### Key Takeaways

- Build an efficient frontier before comparing costs. Models below the frontier are strictly dominated and can be removed from consideration without further analysis.
- Quantify what quality failures cost you before deciding a cheaper model is "good enough." Bad code review fails silently and shows up as incident cost downstream.
- Task decomposition can dramatically reduce spend by routing subtasks to appropriately-capable models — but only after you've identified where quality variance actually originates.
- The cost-quality curve flattens at the high end. The best-per-dollar model is rarely the most expensive one. Find the cliff edge where quality-per-dollar drops sharply and stay below it.
- For multi-step and agentic workloads, measure cost-per-completed-task, not cost-per-token. More capable models often consume fewer total tokens by making fewer errors, which changes the economics significantly.

### Try This

Run a minimum viable efficient frontier analysis on your current model selection. Take three to five models you're considering or currently using. For each, compute your composite quality score (from your existing evaluation suite) and your average cost per task. Plot them on a two-axis chart. Identify which models are Pareto-optimal and which are dominated. If you only have one model today, add at least one cheaper and one more expensive alternative to the comparison. The exercise takes a few hours and usually produces at least one clear surprise — either a model you're overpaying for, or one you dismissed on price that deserves a second look.


---

## Chapter 6: Latency, Throughput, and Context Window Limits

### Chapter Overview

Cost and quality get most of the attention in LLM evaluations, but latency and throughput are what break production systems. A model that produces excellent code but takes 45 seconds to respond is useless in an IDE plugin. A model with a generous context window that silently degrades on long inputs is dangerous in a code review pipeline. This chapter covers how to measure and reason about the operational characteristics of LLMs — time to first token, generation speed, throughput under load, and the actual behavior of models as context fills up — so you can make deployment decisions based on evidence rather than spec sheets.

---

### Time to First Token Is Not the Same as Response Time

When engineers talk about "latency," they usually mean two different things without realizing it. Time to first token (TTFT) is how long the model takes to start responding. Generation speed is how fast tokens arrive after that. These have completely different implications depending on your use case.

For interactive applications — an autocomplete widget, an inline code suggestion, a chat interface — TTFT dominates the user experience. Users will tolerate a slow stream if it starts quickly. A 200ms TTFT with 30 tokens/second feels responsive. A 2-second TTFT followed by 80 tokens/second feels broken, even though the second scenario produces more text faster.

For batch pipelines — nightly code review, automated test generation, documentation synthesis — total response time matters more than TTFT. Here you want raw throughput: tokens per second, requests per minute, cost per thousand tokens processed.

Measure both, separately, in your actual infrastructure. Provider dashboards report aggregate averages that include best-case scenarios during off-peak hours. Run your own benchmark at the time of day you expect production load.

```python
import time
import anthropic

client = anthropic.Anthropic()

def measure_ttft_and_throughput(prompt: str, model: str) -> dict:
    start = time.perf_counter()
    first_token_time = None
    token_count = 0

    with client.messages.stream(
        model=model,
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}]
    ) as stream:
        for text in stream.text_stream:
            if first_token_time is None:
                first_token_time = time.perf_counter()
            token_count += len(text.split())

    end = time.perf_counter()
    generation_duration = end - first_token_time

    return {
        "ttft_seconds": first_token_time - start,
        "tokens_per_second": token_count / generation_duration if generation_duration > 0 else 0,
        "total_seconds": end - start,
    }
```

Run this across your candidate models with a prompt representative of your actual workload. Not "Hello, world." The prompt that mirrors what your pipeline sends.

---

### Throughput Under Load Looks Nothing Like Single-Request Benchmarks

A model that handles one request in 3 seconds does not handle ten concurrent requests in 30 seconds. Provider rate limits, queue depths, and infrastructure scaling all interact in ways that single-request benchmarks cannot capture.

If your system needs to handle bursts — a CI pipeline that triggers fifty review requests in parallel when a PR merges, for example — you need to measure throughput under realistic concurrency, not idealized sequential performance.

**Key Insight:** Most provider rate limits are expressed in tokens per minute (TPM) and requests per minute (RPM), but your effective throughput also depends on whether they enforce limits at the account level, the API key level, or the organization level. A system using multiple API keys to distribute load may behave very differently under rate pressure than one using a single key.

Test this by sending concurrent requests and measuring P50, P95, and P99 latencies. P50 tells you typical behavior. P95 and P99 tell you what your users experience when things go sideways. A model with a 3-second P50 and a 45-second P99 is a support ticket generator.

```python
import asyncio
import time

async def concurrent_throughput_test(prompts: list[str], model: str, concurrency: int) -> dict:
    semaphore = asyncio.Semaphore(concurrency)
    latencies = []

    async def single_request(prompt):
        async with semaphore:
            start = time.perf_counter()
            # your async API call here
            elapsed = time.perf_counter() - start
            latencies.append(elapsed)

    await asyncio.gather(*[single_request(p) for p in prompts])

    latencies.sort()
    n = len(latencies)
    return {
        "p50": latencies[int(n * 0.50)],
        "p95": latencies[int(n * 0.95)],
        "p99": latencies[int(n * 0.99)],
    }
```

Run this test against the actual API endpoint you'll use in production. The numbers from a provider's status page are not your numbers.

---

### Context Window Size and Effective Context Window Are Different Things

Providers advertise context window sizes — 128K tokens, 200K tokens, 1M tokens — as though larger is unconditionally better. The advertised number is the maximum input length the model will accept. The effective context window is the range over which the model actually retrieves and uses information reliably. These are not the same.

**Warning:** Several models show measurable performance degradation on retrieval tasks when relevant information is placed in the middle of a long context, even when the total length is well within the advertised limit. This is sometimes called the "lost in the middle" problem. For code tasks specifically, this means a model might correctly identify a bug when the relevant function is near the top of the context but miss it when it's buried at position 40K in a 100K-token input.

Test this explicitly. Construct a synthetic eval: place a specific, findable artifact (a bug, a function signature, a variable name) at positions 10%, 30%, 50%, 70%, and 90% through a long context window. Measure retrieval accuracy at each position. If accuracy drops in the middle, your pipeline needs to account for that — either through chunking, reranking, or explicit positioning of high-priority content.

This matters most for code review and code search applications where the model needs to reason across large codebases. A model with a "200K context window" that reliably attends to the first and last 20K tokens but drifts in the middle is effectively a model with two separate ~20K context windows — which changes your architecture significantly.

---

### Structured Outputs and Tool Use Add Latency You May Not Have Accounted For

Constrained decoding — forcing the model to output valid JSON, follow a schema, or call a specific tool — adds latency that doesn't show up in basic benchmarks. The overhead varies by provider and implementation. Some providers handle structured output with post-processing; others use constrained beam search or token masking during generation. The latter is more reliable but slower.

For code generation pipelines that need structured outputs (function signatures, test cases in a specific format, dependency lists), measure latency with structured output enabled, not just in free-form generation mode. The difference can be significant — 20% to 60% slower in some configurations — and it's workload-dependent.

**Try This:** Run the same code generation prompt twice: once asking for plain code in a markdown block, and once using your provider's structured output or tool-use mechanism to enforce a specific schema. Time both. If the structured version is substantially slower, decide whether you actually need the schema enforcement or whether prompt-based formatting is reliable enough for your use case. Sometimes it is.

Tool use — where the model decides to call a function and waits for the result before continuing — introduces additional round-trip latency that compounds in agentic workflows. A pipeline with five sequential tool calls at 2 seconds each looks very different from one where those calls can be parallelized. Model behavior here matters: some models batch tool calls in a single turn, others issue them one at a time. Check the actual call patterns in your workload, not the theoretical behavior described in the documentation.

---

### Rate Limits as an Architectural Constraint

Rate limits are often treated as an ops problem — something to work around with retries and exponential backoff. That framing misses the deeper issue. If your system's throughput ceiling is determined by provider rate limits, then the model you choose has an architectural dependency on the provider's pricing tier, not just the model's capabilities.

A model that is technically superior but lives in a tier with aggressive rate limits may perform worse in production than a slightly less capable model with more headroom. This is especially true at launch, when you can't yet predict peak load.

Factor rate limits into your evaluation the same way you factor in token cost. Request the actual limits for your expected pricing tier, not the published defaults. Enterprise limits are often negotiable and frequently not listed in public documentation. Build a simple model of your expected peak RPM and TPM, compare it against the limits you can get access to, and identify where the ceiling is before you're in production trying to find it under pressure.

---

### Key Takeaways

- Time to first token and generation speed are separate metrics with different implications — measure both and weight them according to your application's interaction model.
- Single-request latency benchmarks don't predict concurrent throughput. Test under realistic load with P95 and P99, not just averages.
- Advertised context window sizes describe maximum input length, not the range over which the model reliably retrieves information. Test retrieval accuracy at different positions in long contexts before trusting a large context window for production use.
- Structured output and tool-use mechanisms add latency that doesn't appear in standard benchmarks — measure it with your specific schemas and tool configurations.
- Provider rate limits are an architectural constraint, not just an ops detail. Account for them when choosing a model and pricing tier, before you hit them at the worst possible time.

---

### Try This

Pick the model you're currently most likely to deploy. Write a script that sends fifty concurrent requests — each containing a 10K-token code file with a deliberate bug inserted at a random position — and measures TTFT, total latency, and whether the model correctly identifies the bug. Run it twice: once during business hours and once during off-peak hours. Compare the P95 latencies and bug-detection rates across both runs and across different positions in the file. What you find will tell you more about whether this model is suitable for your use case than any benchmark table the provider has published.


---

## Chapter 7: Fine-Tuning vs. Prompting for Code Tasks

### Chapter Overview

The choice between fine-tuning a model and engineering better prompts is one of the most consequential decisions in any LLM deployment for code tasks — and it's routinely made too early, with too little data. This chapter lays out the actual tradeoffs: when prompting hits a real ceiling, when fine-tuning pays off, what fine-tuning actually costs across infrastructure, data, and ongoing maintenance, and how to tell which approach fits your situation before you've sunk weeks into the wrong one.

---

### The Default Should Be Prompting

Most teams reach for fine-tuning too quickly. They hit an inconsistency in model output, or the model doesn't know their internal framework, and the instinct is to train it in. Resist that instinct until you've genuinely exhausted prompting.

Prompting is reversible. Fine-tuning is not. If you prompt and it doesn't work, you iterate in minutes. If you fine-tune and the results are worse, you've spent compute, time, and now you need to decide whether to keep iterating or cut losses. That asymmetry matters.

The prompting toolkit is deeper than most teams use. Before concluding prompting can't solve the problem, you should have tried:

- **System prompt with concrete examples** of correct output in your target format
- **Few-shot examples** with at least 3–5 representative cases, not cherry-picked ones
- **Chain-of-thought for complex tasks** — ask the model to reason through the code structure before writing it
- **Structured output constraints** — JSON schemas, function signatures, strict format requirements passed as part of the prompt
- **Retrieval augmentation** — injecting relevant internal code, docs, or patterns into the context window dynamically

If you've run all of those and you're still seeing systematic failure on a well-defined task, now you have a real reason to evaluate fine-tuning.

> **Key Insight**
>
> Prompting failures are usually data problems in disguise. The model doesn't know your internal library because you haven't shown it. Retrieval-augmented generation — pulling relevant code into context dynamically — solves this for most "the model doesn't know our stack" complaints without touching the weights.

---

### What Fine-Tuning Actually Buys You

Fine-tuning is not magic. It adjusts model weights on your dataset, which shifts the probability distribution of outputs toward patterns present in your training data. That's it. The model doesn't become smarter. It becomes more likely to generate outputs that resemble your examples.

This is genuinely useful in two cases:

**Consistent format and style at scale.** If you need every generated function to follow a specific pattern — particular docstring format, error handling idiom, naming convention — and you're generating at enough volume that prompt-based enforcement has too much variance, fine-tuning can lock that in. The model stops needing instruction; it's been shaped to default there.

**Specialized domain knowledge not representable in context.** If your codebase uses an internal DSL, a proprietary framework with unique semantics, or patterns so idiosyncratic that explaining them in context would require more tokens than you can afford per call, fine-tuning lets you bake those patterns into the weights. This is the strongest legitimate case for it.

What fine-tuning does not buy you: general reasoning improvement, better debugging capability, or fixes for fundamental model limitations. If the base model can't reason through a complex algorithm, a fine-tune on your data won't fix that. It will just make the outputs look more like your failures.

---

### The Real Cost of Fine-Tuning

Compute cost is the visible cost. It's not the largest one.

**Data collection and curation** is usually the actual bottleneck. A useful fine-tuning dataset for code tasks requires high-quality input-output pairs where the output is genuinely correct, properly formatted, and representative of the task distribution. Generating that takes either significant human review time or a carefully constructed automated pipeline with its own validation layer.

A rough benchmark: for instruction-following tasks on code, you typically need at least 500–1,000 high-quality examples to see reliable improvement over a well-prompted base model, and 2,000–5,000 to see meaningful gains on complex tasks. Below that threshold, results are noisy and often worse than prompting.

Then there's the maintenance surface. Base models get updated. Provider fine-tuning APIs change. If you're using a hosted fine-tuning service and the provider deprecates the base model you tuned on, you're re-running the process on a new checkpoint. Your fine-tuned weights don't transfer. You own that debt going forward.

> **Warning**
>
> Parameter-efficient fine-tuning (LoRA, QLoRA) reduces compute cost significantly but doesn't reduce the data quality requirement. Teams sometimes confuse "cheaper to train" with "less data needed." Low-quality training data with an efficient adapter still produces a low-quality model. The training efficiency gains are real; the data shortcuts aren't.

---

### Evaluation Is Harder After Fine-Tuning

Before fine-tuning, your evaluation baseline is the model's general behavior on your benchmark. After fine-tuning, your model is now domain-specific, and your benchmarks need to reflect that.

This creates a subtle trap. Fine-tuned models often score higher on the specific task distribution they were trained on while scoring lower on related but slightly different tasks. If your eval set has any overlap with your training set — even indirect overlap through shared patterns — your metrics will be optimistic.

Build a held-out evaluation set before you collect training data, not after. Keep it strictly separated. Evaluate on both in-distribution and out-of-distribution tasks, because if your fine-tuned model regresses on out-of-distribution code tasks, that regression will surface in production whether or not it shows up in your benchmark.

```python
# Simple check for training/eval contamination
from difflib import SequenceMatcher

def is_contaminated(train_sample: str, eval_sample: str, threshold: float = 0.8) -> bool:
    ratio = SequenceMatcher(None, train_sample, eval_sample).ratio()
    return ratio >= threshold

def check_eval_contamination(train_set: list[str], eval_set: list[str]) -> list[int]:
    flagged = []
    for i, eval_sample in enumerate(eval_set):
        if any(is_contaminated(t, eval_sample) for t in train_set):
            flagged.append(i)
    return flagged
```

This is a blunt check, not a comprehensive one — semantic similarity matching is more robust — but running it before training costs nothing and catches obvious leakage.

---

### When to Fine-Tune and When to Prompt

The decision tree is shorter than vendor documentation makes it seem.

Start with prompting. Measure systematically. If you're seeing error rates above your acceptable threshold on a specific, well-defined subtask after genuine prompting effort, ask: is this failure due to missing knowledge, or missing reasoning?

Missing knowledge — the model doesn't know your internal patterns, your DSL, your conventions — is a retrieval or fine-tuning problem. Try retrieval first because it's cheaper and more maintainable. If the relevant context doesn't fit in the window, or if latency from retrieval is prohibitive, fine-tuning becomes the cleaner option.

Missing reasoning — the model fails at the logical structure of the task even when given full context — is not a fine-tuning problem. Fine-tuning on examples of correct reasoning can help at the margins, but if the base model lacks the capability, a fine-tune won't install it. You're looking at the wrong lever. The right answer there is a different base model.

> **Try This**
>
> Pick one code task your current model handles inconsistently. Write five high-quality few-shot examples covering the range of inputs you actually see. Measure the error rate on your eval set with and without the few-shot examples. If error rate drops by more than 30%, you have a prompting problem, not a model problem. If it barely moves, you have a capabilities gap — and that's a data point worth having before you commit to anything else.

---

### Continuous Deployment of Fine-Tuned Models

Fine-tuned models introduce a deployment lifecycle that prompted models don't require. You need a version for the current model, a mechanism to retrain when the base model updates, and a rollback path if the new version regresses.

Teams that skip the rollback path discover they needed it the hard way.

The practical minimum: version your fine-tuned models explicitly, keep your training data under version control alongside them, and never deploy a fine-tuned model without running it against your held-out eval set first. This sounds obvious. It's skipped more often than you'd expect, usually because the eval pipeline wasn't built before the fine-tuning started.

If you're deploying through a hosted provider's fine-tuning API, check the model deprecation policy before you start. Some providers give 12 months of access to deprecated checkpoints; others don't. Knowing that before you build your deployment pipeline is significantly better than finding out when a deprecation notice arrives.

---

### Key Takeaways

- Prompting with retrieval solves the majority of "the model doesn't know our stack" problems without the maintenance overhead of fine-tuning.
- Fine-tuning is valuable for enforcing consistent style at scale and for encoding genuinely proprietary patterns that can't fit in context — not for fixing reasoning limitations.
- The dominant cost of fine-tuning is not compute; it's data curation and ongoing maintenance as base models evolve.
- Build your eval set before your training set. Contaminated benchmarks produce optimistic metrics that don't survive contact with production.
- Fine-tuned models require a versioned deployment lifecycle, rollback capability, and retraining discipline that prompted deployments don't. Factor that operational cost in before committing.

---

### Try This

Take a code task your team currently uses an LLM for. Document the current prompt in full. Then build a 20-sample eval set of representative inputs with known-correct outputs — manually verified, not generated. Run the current prompt against that eval set and record your pass rate.

Now iterate on the prompt: add examples, add format constraints, add chain-of-thought instruction. Run each variant against the same eval set. Track the delta. After three or four iterations, you'll know either that prompting has a ceiling on this task, or that you were underutilizing the prompting surface area. Either answer is valuable — and you'll have the data to back whichever decision you make next.


---

## Chapter 8: Building a Continuous Evaluation System

### Chapter Overview

A one-time benchmark tells you where a model stands on the day you ran it. It says nothing about whether that model will still perform well next quarter after a prompt change, a new codebase pattern, or a model update from the vendor. Real evaluation is a system, not an event. This chapter covers how to build infrastructure that monitors LLM code quality continuously — catching regressions before they reach production, tracking performance across model versions, and giving your team signal they can actually act on.

---

### Why Point-in-Time Evals Fail in Practice

Most teams run evaluation when they're choosing a model or justifying a decision. They build a test set, run it, produce a number, move on. Six months later, the model has been fine-tuned by the vendor. The prompt template changed three times. The codebase the system indexes has grown by 40,000 lines. Nobody reruns the eval.

This is how teams end up defending a model choice with stale data — or worse, discovering regressions through user complaints rather than instrumentation.

The problem isn't laziness. It's that evaluation is treated as a phase, not a process. Building it into your CI/CD pipeline changes that. It makes regression detection automatic and gives engineers a feedback loop fast enough to actually use.

The shift in thinking is straightforward: every meaningful change to your LLM pipeline — model version, prompt, retrieval strategy, context window size — should trigger an eval run, just like a test suite. The output isn't a one-time report. It's a metric tracked over time.

---

### Designing Your Golden Set

The foundation of a continuous eval system is a golden set: a curated collection of inputs with known-good outputs. For code tasks, this means representative prompts paired with reference solutions that your team has reviewed and agreed on.

What makes a golden set useful is diversity and maintenance. If every example is a simple function completion, you'll miss regressions on refactoring tasks. If you never update the set as your codebase evolves, it stops being representative.

Practical structure for a code generation golden set:

```python
golden_examples = [
    {
        "id": "auth-middleware-001",
        "task_type": "generation",
        "prompt": "Write a FastAPI middleware that validates JWT tokens and attaches the decoded user to request.state",
        "reference_output": "...",  # reviewed, correct implementation
        "tags": ["auth", "middleware", "fastapi"],
        "complexity": "medium"
    },
    {
        "id": "refactor-002",
        "task_type": "refactor",
        "prompt": "Refactor this function to eliminate nested conditionals: ...",
        "reference_output": "...",
        "tags": ["refactor", "readability"],
        "complexity": "low"
    }
]
```

Tag by task type, complexity, and domain. This lets you slice performance reports — a model might hold its score on simple completions but regress on complex refactors after a vendor update. Flat aggregate scores hide that.

Aim for 50–150 examples at launch. More than that and maintenance becomes a burden. Fewer and the signal is too noisy to trust.

> **Key Insight**
>
> Your golden set is a living artifact. Add examples when you discover new failure modes. Retire examples that are no longer representative. Treat it like a test suite — review it quarterly, own it like code.

---

### Scoring at Scale Without Human Review Bottlenecks

Running continuous eval means scoring outputs automatically. Human review doesn't scale to every CI run. The practical approach is a tiered scoring strategy: fast automated checks catch obvious regressions, and human review is reserved for edge cases and periodic audits.

For code generation tasks, automated scoring combines several signals:

```python
def score_output(generated: str, reference: str, task_type: str) -> dict:
    scores = {}

    # Structural checks — fast, binary
    scores["compiles"] = check_compilation(generated, language=task_type)
    scores["tests_pass"] = run_test_suite(generated, task_id=task_type)

    # Similarity to reference — useful but not sufficient alone
    scores["chrf"] = compute_chrf(generated, reference)

    # LLM-as-judge for semantic quality
    scores["judge_score"] = llm_judge(
        prompt=f"Rate this code solution 1-5 for correctness and style:\n{generated}",
        model="claude-sonnet-4-6"
    )

    return scores
```

The compilation check and test pass rate are your most reliable signals — they're objective and fast. ChrF or similar token-level similarity gives you a rough proxy for whether the output drifted from the reference style. LLM-as-judge adds semantic signal but introduces variance; use it as a tiebreaker, not a primary metric.

One important caveat on LLM-as-judge: if you're evaluating model A using model B as the judge, you're measuring agreement between two models, not ground truth. Use judge scores to detect large movements, not to rank models within a few percentage points of each other.

> **Warning**
>
> Never use the same model family as both the system under test and the judge. A GPT-4-class model grading GPT-4 outputs will systematically prefer its own style, inflating scores. Use a different model family or a deterministic scoring function wherever possible.

---

### Integrating Evals into CI/CD

The mechanics of integration are simpler than they look. The goal is: on every relevant change, run the golden set, compute scores, compare to baseline, and fail the pipeline if regression exceeds a threshold.

A minimal GitHub Actions setup:

```yaml
name: LLM Eval

on:
  push:
    paths:
      - 'prompts/**'
      - 'pipeline/**'
      - 'models/config.yaml'

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Run golden set eval
        run: python scripts/run_eval.py --golden golden_set.json --baseline baselines/main.json

      - name: Check for regression
        run: python scripts/check_regression.py --threshold 0.05
```

The threshold matters a lot here. A 5% drop in aggregate score might be noise — especially if your golden set is small and the judge has variance. Setting the threshold too low means constant false alarms that engineers learn to ignore. Too high and you miss real regressions.

Stratify your thresholds by task type. A 3% drop on simple completions is probably noise. A 3% drop on security-sensitive code review tasks is worth stopping for.

Store every eval result with its metadata — model version, prompt hash, retrieval config, timestamp. Without that, you can't reconstruct why a score changed.

---

### Tracking Model Drift Over Time

Vendors update models without announcing every change. A model that scored well in January may behave differently in March. This isn't hypothetical — it's been documented across GPT-4, Claude, and open-source models hosted on managed inference.

Continuous eval catches this. If you're running evals on a schedule — weekly at minimum — you'll see the score graph move when the vendor pushes a silent update. Without that graph, you're flying blind.

What to track per eval run:

- Aggregate score by task category
- Pass/fail on compilation and test checks
- Score distribution (mean, p10, p90) — the mean can hold steady while the tail gets worse
- Latency percentiles — a model can regress on speed without changing quality scores

Build a simple dashboard. It doesn't need to be elaborate — a time-series plot of category scores with model version annotations is enough to make regressions visible. Many teams use Weights & Biases or a simple Postgres table with a Grafana frontend. The tooling matters less than the habit.

> **Try This**
>
> Pick one prompt from your production system that you've seen produce inconsistent results. Run it 10 times against your current model, score each output, and compute the variance. If the standard deviation is high relative to the mean, you have a stability problem — and you now have baseline data to track it against future model or prompt changes.

---

### Operationalizing Feedback Loops

Eval infrastructure without action is just logging. The value comes from closing the loop: catching a regression, identifying the cause, and deploying a fix — all within a cycle short enough that engineers stay engaged.

That means the eval system needs to surface actionable signal, not just a number. When a regression fires, the on-call engineer needs to know which task categories regressed, which specific examples failed, and what changed since the last passing run.

A useful regression report structure:

```
Regression detected: eval/2026-04-20 vs eval/2026-04-13

Overall score: 0.71 → 0.65 (-8.4%)

Failing categories:
  - refactor/complexity-high: 0.68 → 0.51 (-25%)
  - generation/security: 0.80 → 0.74 (-7.5%)

Top failing examples:
  - refactor-complex-007: judge_score dropped 3.0 → 1.5
  - security-review-012: tests_pass = False (was True)

Changes since last run:
  - prompts/refactor.txt modified (commit a3f81c2)
  - Model config: claude-sonnet-4-5 → claude-sonnet-4-6
```

That's enough information for an engineer to start investigating without needing to re-run anything manually. The diff between the last passing run and the current one is the starting point.

Treat eval failures the same way you treat test failures: they block the pipeline until resolved or explicitly overridden with a documented reason. The override path matters — sometimes a regression is acceptable because the change also improves something else. But it should be a conscious decision, not a silently passing pipeline.

---

### Key Takeaways

- A single benchmark run is a snapshot, not a system. Continuous eval catches regressions that point-in-time testing can't see.
- A curated golden set of 50–150 examples, tagged by task type and complexity, is more useful than a large unstructured benchmark. Maintain it like code.
- Automated scoring with tiered checks — compilation, test pass, similarity, judge — gives you speed without sacrificing coverage. Reserve human review for audits and edge cases.
- Store every eval result with full metadata: model version, prompt hash, retrieval config, timestamp. Without provenance, you can't explain why scores changed.
- Close the loop. Regression reports need to surface which examples failed and what changed, not just a score delta. If the system doesn't support action, it won't get used.

---

### Try This

Set up a minimal eval harness this week. Take five prompts from your current LLM pipeline — real ones, not synthetic toy examples — and write reference outputs for each. Score your model's current output against those references using a combination of a deterministic check (does it compile? does a simple test pass?) and a manual quality rating from 1–5. Record the scores in a spreadsheet with today's date and your current model version. Rerun the same five prompts in 30 days without changing anything. Compare. If the scores moved, you've just demonstrated why continuous eval exists. If they didn't, you have a clean baseline to build on.


---

## Conclusion

The benchmark numbers on any vendor's website are not lies — they're just answers to questions you didn't ask. What this book gives you is the ability to ask your own questions and trust the answers. You now know how to build evaluation suites that reflect your actual codebase, your actual users, and your actual failure modes. You know which metrics predict production quality and which ones measure how well a model passes tests. That distinction matters every time someone pushes a new model release and claims state-of-the-art.

The harder work starts after the evaluation framework exists. Continuous evaluation means treating model selection and prompt strategy as engineering problems that never fully close — models change, your codebase changes, usage patterns shift. The teams that get this right are not the ones who ran the best benchmark once. They're the ones who built the infrastructure to catch regressions before users do, and who made cost-quality tradeoffs explicit enough that the whole team can reason about them without a spreadsheet.

The next step is to pick one task category — generation, review, or search — and run your first real evaluation against your own data this week. Not a proof of concept, not a demo. A scored run with results you'd be comfortable sharing. Everything else in this book compounds from that foundation.

---

## Appendix A: Glossary

**Benchmark leakage** — Contamination of a benchmark dataset into a model's training corpus, causing evaluation scores to reflect memorization rather than generalization. A leading reason vendor benchmarks overstate real-world performance.

**Context window** — The maximum number of tokens a model can process in a single inference call, spanning both input (prompt + retrieved content) and output (completion). Relevant not just as a ceiling but as a cost multiplier.

**BLEU / ROUGE** — N-gram overlap metrics originally designed for machine translation and summarization. Widely misused for code evaluation. Measure surface similarity, not correctness or functionality.

**Edit distance (Levenshtein)** — Minimum character-level operations to transform one string into another. Useful as a rough proxy for how much a model's completion diverges from a reference, but blind to semantic equivalence.

**Exact match (EM)** — Binary correctness metric: 1 if output matches reference exactly, 0 otherwise. High precision, low recall for code tasks where multiple correct implementations exist.

**pass@k** — The probability that at least one of k generated samples passes a correctness test. More informative than single-sample accuracy when sampling temperature is non-zero.

**Efficient frontier** — In cost-quality space, the set of model configurations where no option is both cheaper and better. Points off the frontier are dominated choices.

**Retrieval-Augmented Generation (RAG)** — Architecture that supplements model inference with retrieved context (code snippets, documentation, prior examples) at inference time, without modifying model weights.

**Fine-tuning** — Continued training of a pretrained model on domain-specific data to shift its output distribution toward target behavior. Distinct from prompting, which conditions behavior without weight updates.

**LoRA (Low-Rank Adaptation)** — Parameter-efficient fine-tuning technique that injects trainable low-rank matrices into attention layers, reducing compute and memory requirements relative to full fine-tuning.

**Throughput** — Tokens generated per second at a given concurrency level. Determines whether a model deployment can serve multiple developers simultaneously without queue buildup.

**Time to first token (TTFT)** — Latency from request submission to receipt of the first output token. The dominant latency component for interactive use cases like inline code completion.

**Latency percentile (p95, p99)** — The latency value below which 95% or 99% of requests fall. More operationally relevant than mean latency because tail latency governs user experience at scale.

**Functional correctness** — Whether generated code produces the correct output for a given input, as verified by execution against test cases. The most reliable single metric for generation tasks.

**Hallucination (code context)** — Model output that references functions, APIs, types, or symbols that do not exist in the relevant codebase or standard library. Distinct from logical errors in otherwise valid code.

**Evaluation harness** — The end-to-end system for running prompts, collecting model outputs, scoring results, and aggregating metrics. The infrastructure layer beneath any evaluation methodology.

---

## Appendix B: Tools & Resources

### Evaluation Frameworks

**EvalPlus** — Extended version of HumanEval with additional test cases per problem. Addresses the sparse testing problem in the original benchmark and produces more reliable pass@k estimates.

**BigCode Evaluation Harness** — Open-source framework from Hugging Face / BigCode for running standardized evaluations across multiple code benchmarks. Supports HumanEval, MBPP, DS-1000, and others.

**DeepEval** — Python library for building custom LLM evaluation pipelines. Supports metric definition, dataset management, and regression tracking.

**LangSmith** — Tracing and evaluation platform from LangChain. Useful for logging prompt/response pairs in production and running evaluations against captured traces.

**Promptfoo** — CLI and library for running head-to-head LLM comparisons with custom assertions. Designed for prompt engineers and engineers who want fast iteration cycles.

### Benchmarks & Datasets

**HumanEval** — OpenAI's original Python function completion benchmark. 164 problems with unit tests. Baseline for most code generation comparisons, but limited in scope and diversity.

**MBPP (Mostly Basic Python Problems)** — 374 crowd-sourced Python problems with test cases. Broader coverage than HumanEval at similar complexity.

**SWE-bench** — GitHub issue resolution benchmark. Tests whether models can write patches that fix real bugs in open-source Python repositories. Higher ecological validity than function-completion benchmarks.

**DS-1000** — 1,000 data science problems drawn from Stack Overflow. Covers NumPy, Pandas, Scipy, Scikit-learn, and others. Relevant for teams working in data-heavy codebases.

**CRUXEval** — Benchmark focused on code reasoning: predicting inputs from outputs and outputs from inputs. Tests understanding rather than generation.

### Serving & Infrastructure

**vLLM** — High-throughput inference server for open-weight models. PagedAttention implementation delivers significant throughput improvements over naive serving. The default choice for self-hosted deployments.

**Ollama** — Local model serving tool optimized for developer ergonomics. Useful for evaluation runs on open-weight models without cloud API costs.

**LiteLLM** — Unified API layer that normalizes calls across OpenAI, Anthropic, Cohere, and open-weight model endpoints. Simplifies provider-switching in evaluation harnesses.

**OpenRouter** — API gateway aggregating models from multiple providers under a single endpoint. Useful for cost comparison across providers during evaluation.

### Observability & Cost Tracking

**Helicone** — Proxy-based observability tool for LLM API calls. Logs requests, tracks token usage, and surfaces cost metrics without code changes.

**Weights & Biases (W&B)** — Experiment tracking platform with LLM-specific features for logging prompts, completions, and evaluation scores across runs.

### Fine-Tuning

**Hugging Face PEFT** — Library implementing LoRA, prefix tuning, and other parameter-efficient fine-tuning methods. Standard tooling for adapting open-weight models.

**Axolotl** — Training framework built on top of Hugging Face Transformers, optimized for instruction fine-tuning workflows. Handles dataset formatting, multi-GPU setup, and checkpoint management.

**Unsloth** — Fine-tuning library focused on memory efficiency and speed. Significant training time reductions on consumer and mid-tier GPU hardware.

---

## Appendix C: Further Reading

**"Evaluating Large Language Models Trained on Code" (Chen et al., 2021)** — The HumanEval paper. Introduces pass@k and the functional correctness framing that underlies most subsequent code evaluation methodology.

**"SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" (Jimenez et al., 2023)** — Argues convincingly that function-completion benchmarks miss the skills required for real software engineering work, then proposes a more ecologically valid alternative.

**"Is Your Code Generated by ChatGPT Really Correct?" (Liu et al., 2023)** — Demonstrates that EvalPlus's denser test suites substantially reduce pass rates compared to the original HumanEval tests, quantifying how much sparse testing flatters model performance.

**"Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (Lewis et al., 2020)** — The foundational RAG paper. Required reading before designing any code search or context-augmented generation system.

**"LoRA: Low-Rank Adaptation of Large Language Models" (Hu et al., 2021)** — The paper that made fine-tuning accessible. Essential if you're evaluating whether fine-tuning is worth the investment for your task.

**"Holistic Evaluation of Language Models" (Liang et al., 2022)** — HELM benchmark paper. Useful less for its specific benchmarks and more for its framework of thinking about model evaluation across multiple axes simultaneously.

**"CodeBLEU: a Method for Automatic Evaluation of Code Synthesis" (Ren et al., 2020)** — Proposes an extension of BLEU that accounts for code syntax and data flow. Useful context for understanding both the appeal and the limits of n-gram metrics applied to code.

**"Scaling Laws for Neural Language Models" (Kaplan et al., 2020)** — Establishes the relationship between model size, compute, and performance. Informs how to reason about cost-quality tradeoffs as model scale increases.

**"The Unreasonable Effectiveness of Eccentric Automatic Metrics for Code Generation" (Evtikhiev et al., 2023)** — Empirical study of which automatic metrics actually correlate with human judgments of code quality. Directly applicable to metric selection decisions.

**"A Survey of Large Language Models for Code" (Zheng et al., 2023)** — Comprehensive survey of architectures, training strategies, and evaluation approaches across code-focused models. Useful as a map of the landscape before going deep on any single area.

**"Hungry Hungry Hippos: Towards Language Models with State Space Models" (Fu et al., 2023)** — Relevant for teams evaluating long-context alternatives to transformer architectures, particularly for codebases where context windows are a binding constraint.

**"Constitutional AI: Harmlessness from AI Feedback" (Bai et al., 2022)** — Background on RLHF and RLAIF training, relevant for understanding why model behavior on code review and explanation tasks may reflect training signal beyond raw capability.

---

*Evaluating LLMs for Code Tasks — Version 1.0 — April 2026*
*By David Kelly Price | pyckle.co*

---

*© 2026 Pyckle. All rights reserved. This guide may be shared freely for personal and educational use. Commercial reproduction or redistribution requires written permission. Contact kellyprice@pyckle.co.*

---

## Related Blog Posts

- [We Trained Our Own Code Embedding Model from Scratch](https://pyckle.co/blog/we-trained-our-own-code-embedding-model-from-scratch-heres-what-happened.html)
- [Why Naive Retrieval Breaks at Scale](https://pyckle.co/blog/why-naive-retrieval-breaks-at-scale-and-what-we-built-instead.html)

---

*[Browse all free guides →](https://pyckle.co/books.html)*