Reverse-Engineering How Harvey Built Legal Agent Benchmark

When Harvey launched its Legal Agent Benchmark (LAB) on May 6, the headline numbers were striking: 1,251 tasks, 24 practice areas, more than 75,000 expert-written rubric criteria. My first thought was that was a whole lot of tasks (and rubric criteria!) for one team to have created, so I wanted to see how they did it. I've been working on benchmarks of my own for a few years now, and the sheer amount of labor involved is always a big sticking point. So I downloaded the repo and started poking around.

What I want to write about is what the corpus reveals about how this kind of work gets done, and what that means for anyone trying to build something similar. The corpus is generous — it's a real engineering achievement and Harvey deserves credit for shipping it open. But studying it carefully turns out to be useful in a way that goes beyond admiring the artifact: it gives you a target spec for what well-engineered output looks like, and it makes the labor problem in domain benchmarks visible in ways that aren't obvious from the headlines.

A note on epistemics. I don't have visibility into Harvey's internal process. The patterns I describe in the corpus are real; my inferences about how those patterns were produced are informed guesses. Treat the "this is consistent with X" claims as exactly that. If anyone at Harvey wants to correct or clarify, I'd genuinely welcome it.

The labor math is the whole point

Let's just do a sanity check. 75,000 individual pass/fail criteria. The criteria aren't uniformly hard to write — a factual check ("PASS if the memo correctly states the closing date as March 15, 2024") takes a few minutes once the synthetic record exists; an analytical criterion that requires the author to think through a doctrinal wrinkle and tie it to specific anchors might take 20 or 30. The mix matters, and the corpus skews hard: the analytical criteria substantially outnumber the mechanical ones (more on this when I walk through the rubric structure below). Even with generous time estimates, a weighted average lands around 18-19 minutes per criterion. That's roughly 22,000-25,000 attorney-hours.

Harvey hasn't disclosed how big the team that built LAB was. The launch post credits an "Applied Legal Research" group along with various external contributors, but no headcount. To get a rough sense of the math, suppose that team is somewhere in the 10-30 person range — pick 20 as a midpoint. At 22,000 hours / 20 attorneys / 40 hours per week, that's roughly 28 weeks of full-time work for the entire team doing nothing else, and that doesn't account for the synthetic data rooms, which contain hundreds of fictional contracts, opinions, regulations, and emails that all have to be coherent with the rubrics that grade against them.

Hand-authoring at this scale is possible. OpenAI's GDPval put together 1,320 tasks across 44 occupations using industry experts averaging 14 years of experience. SWE-bench Verified ran 93 software developers through three review passes per sample. Serious teams ship serious hand-authored corpora. So the question isn't whether hand-authoring "works" — it does. The question is who can afford it.

The answer is: not most of us. If you're a solo lawyer-builder, a small research group, a law school clinic, or a startup trying to validate a domain-specific agent, 22,000 attorney-hours is not a budget you have. The pipeline approach — SME-designed taxonomy, LLM-generated artifacts, SME review — is what makes benchmark construction at this scale accessible to anyone who isn't OpenAI or Harvey. That's why it matters, and that's why studying LAB's corpus is worth the time even if you'll never build something its size.

Harvey acknowledges they used a pipeline. The launch post credits Julio Pereyra for "developing a novel document and scenario generation pipeline that helped us scale task creation," and the tutorial in the repoelaborates that documents are "synthetically generated in large batches, under the guidance and review of human lawyers." The structural patterns I'll walk through next are consistent with that pipeline likely extending to the rubric layer too — which makes sense, because once you've built the document generator, not using it to also generate the FACTS and DISTRACTOR criteria that anchor to those documents would be leaving most of the leverage on the table.

What a good pipeline produces: a spec extracted from the corpus

Here's what's useful about LAB even without the generator: the corpus tells you what a well-engineered pipeline's output looks like. If you're building your own, this is your target spec.

I ran some basic scans across all 75,000 criteria. Two patterns are worth knowing about.

Syntactic structure is unusually consistent. Every criterion ID follows the same C-NNN format. No C-001a, no idiosyncratic numbering, no author drift:

grep -roh '"id": "[^"]*"' tasks/ | sort -u | head
# "id": "C-001"
# "id": "C-002"
# "id": "C-003"

99.9% of all criteria literally end with "FAIL if…":

$ grep -c '"FAIL if' tasks/**/task.json | awk '{s+=$1} END {print s}'
74928

The opening is equally templated:

OpeningCount"PASS if the memo…"22,130"PASS if the memorandum…"10,142"PASS if the report…"7,039"PASS if the output…"3,921"PASS if the LPA…"1,513

The deliverable noun varies mechanically by document type. The action verb is drawn from a small fixed dictionary (identifies, recommends, notes, states, references, correctly, explains, concludes). Title prefixes follow patterns too: a third of all criteria use ISSUE_NNN: prefixes, and the count distribution decays smoothly — ISSUE_001: appears 2,734 times, ISSUE_002: 2,366 times, all the way down to ISSUE_014: at 356.

Surface details are inconsistent in ways that suggest batch generation.The same deliverable concept shows up under multiple slightly different filenames: issues-memorandum.docx(44 tasks), issue-memorandum.docx(27), issue-identification-memo.docx (16), issues-memo.docx(11). These aren't different deliverables — they're the same archetype with different filenames. The cleanest read is batch generation with loose surface-detail constraints: humans normalizing centrally would have caught the duplicates; a pipeline generating tasks in independent batches wouldn't.

The takeaway for builders isn't the specific patterns — it's that this level of structural discipline is what your pipeline output should look like. Predictable IDs, predictable verb dictionaries, predictable section ordering, with controlled variation in surface details. That's the legibility that lets human reviewers QA a 75,000-criterion corpus efficiently and lets LLM judges score consistently.

The taxonomy you can reverse-engineer

Here's the input spec a generator would need to produce LAB-shaped output, reverse-engineered from the output itself. CONTRIBUTING.md documents the output format — the task.jsonschema, the C-NNN ID convention, the PASS if … FAIL if … criterion form. What follows is the input side: the dimensions of variation a generator has to take as arguments to produce tasks that fit that output format.

TASK = {
  practice_area:    one of 24
  work_type:        one of {analyze, draft, review, research}
  archetype:        one of ~12 deliverable types
  scenario:         {client/matter/jurisdiction/deal-type/complications}
  document_kit:     {N synthetic source docs matching the scenario}
  rubric:           {co-generated against the scenario + archetype}
}

The 12 deliverable archetypes are the real organizing unit, more than work_type:

  1. Issue Memorandum (~150 tasks)

  2. Issue Identification Memorandum (~57)

  3. Deviation Report / Memo (~50)

  4. Gap Analysis Memorandum (~25)

  5. Redline Analysis / Review Memorandum (~32)

  6. Compliance Gap Report (~13)

  7. Discrepancy Report (~10)

  8. Risk Assessment / Risk Memorandum (~6)

  9. In-House Legal Memorandum (~5)

  10. Term Sheet Summary (~9)

  11. Drafting Memorandum / Cover Memo (~24)

  12. Closing Checklist (~6)

The instruction template hits 100% compliance across all 1,251 tasks:

"{ACTION_VERB} the attached {N} {DOC_TYPE}
 [against {REFERENCE: playbook/form/precedent/policy}],
 and {VERB2: prepare|produce|draft} a {DELIVERABLE_DESCRIPTION}.
 Output: `{kebab-case-filename}.docx`."

Median instruction length: 207 characters. 60% of instructions begin with the literal word "Review."

The rubric architecture, section by section

This is the most useful thing in the corpus. The rubric for any given task isn't a flat list of criteria — it's organized into sections that appear in roughly the same order across the corpus, each with a distinct verb dictionary, distinct anchoring, and distinct structural role.

Section 1 — ISSUES (criteria prefixed ISSUE_NNN:). The substantive analytical points the deliverable should identify. Mean position 0.37 in the rubric; about a third of all criteria. "PASS if the memo identifies that the indemnification cap of $2M is below the deal's $5M materiality threshold. FAIL if the memo does not identify this issue."Each task has between 1 and 14 issues.

Section 2 — CITATIONS (Cites…, References…). Whether the deliverable invokes the right authority. "PASS if the memo cites Section 14(a) of the Securities Exchange Act." Maps directly to a (deliverable-type, doctrine) pair: a Rule 12(b)(6) brief should cite Twombly; a Stark Law memo should cite 42 USC § 1395nn. Easy to template from a per-archetype list of canonical authorities.

Section 3 — IDENTIFICATION/ANALYSIS (Identifies…, Notes…, Recognizes…, Analyzes…).Standalone analytical assertions without the numbered prefix — "you should have noticed this thing" criteria. "PASS if the memo notes that the change-of-control provision is triggered by the proposed transaction. FAIL if the memo does not address change-of-control implications." This is the section where the pipeline approach is at its weakest. The other sections all have natural anchors: ISSUES anchor to scenario seeds, CITATIONS anchor to per-archetype authority lists, FACTS and DISTRACTORS anchor mechanically to facts planted in the synthetic record. Section 3 doesn't have those anchors. It's testing whether the agent caught something doctrinally salient that the document set didn't telegraph — exactly the kind of judgment that's hardest to template, and exactly the kind of criterion where a human author drawing on their own practice experience produces something a template can't. If you grep across LAB looking for criteria that feel hand-authored rather than pipeline-generated, this is the section where you'd find them. It's also where contributors hand-authoring tasks following CONTRIBUTING.md guidance are most likely to produce work that's better per criterion than what a generator can produce — which makes the asymmetry I'll get to in a moment more concrete.

Section 4 — FACTS (Correctly states…). Discrete factual checks that anchor the rubric to specific facts in the synthetic record. "PASS if the memo correctly states the closing date as March 15, 2024." The section where co-generation of documents and rubrics pays off most: once you've planted "$3.5M initial capitalization" in the synthetic financial summary, generating the matching criterion is mechanical.

Section 5 — DISTRACTORS (Does not…, Does NOT falsely claim…).About 3% of all criteria. Tests for hallucinations and false confident claims. "PASS if the memo does NOT claim the indemnification survives indefinitely." These bake hallucination-resistance into the same rubric as substantive scoring — you don't need a separate hallucination eval. The prevalence is low but it's more than most benchmarks use, and it's probably the single most underused pattern I've seen elsewhere.

Section 6 — RECOMMENDATIONS (Recommends…). Whether the deliverable proposes the right operational action. Maps cleanly to issues identified in Section 1 — every issue should have a corresponding recommendation. This is the kind of thing a generator would produce in pairs.

Section 7 — OVERALL. One or two criteria at the end of the rubric: the bottom-line conclusion check. "PASS if the memo concludes that the proposed transaction presents material risks that warrant renegotiation." The "did the agent get to the right answer" criterion.

Hard for humans, easy for generators

Most of the analytical criteria — ISSUES, CITATIONS, RECOMMENDATIONS, OVERALL — are hard for a human author writing from scratch but not hard for a generator working from a scenario seed.Once an SME has written a seed that says "indemnification cap: $2M; materiality threshold: $5M; complication: cap-below-threshold mismatch," producing "PASS if the memo identifies that the indemnification cap of $2M is below the deal's $5M materiality threshold" is almost mechanical. The hard thinking happened upstream, in seed design. The rubric layer is templating.

The exception is Section 3 (IDENTIFICATION/ANALYSIS), which doesn't have natural anchors and is genuinely hard for a generator to produce. That's roughly 25% of the corpus. Everything else is the kind of thing a pipeline can produce efficiently once the taxonomy and seeds are in place.

So the labor math and the pipeline argument actually converge. For a human author, the corpus is mostly hard criteria — which is why 22,000 hours. For a generator, the corpus is mostly easy criteria — which is why the pipeline approach is what makes this scale possible, and which is why "release the corpus, keep the generator" is a meaningful asymmetry.

A master prompt

Whether or not this matches Harvey's actual prompt, here's the spec the corpus implies:

Given a practice area, work type, deliverable archetype, and scenario seed, produce: (1) a task title in {Verb Phrase} — {Deliverable Type} form OR Draft {Specific Document} for {Client}; (2) an instruction in {Verb} the attached documents [against {reference}] and {verb2} a {deliverable}. Output: \filename.doc.form, under 300 characters; (3) a synthetic document set of 5-15 files with planted issues; (4) a rubric of 40-80 criteria organized in this fixed section order, each anchored to specific facts in the synthetic record, each inPASS if the {memo|report} {action_verb} {specific content}. FAIL if {negative}. form.

Writing out a generator prompt forces you to be explicit about every dimension of variation, every constraint, and every quality bar. You learn a lot about what your benchmark is even by trying. That's the spec the corpus implies — reverse-engineered from the artifact, not handed to me by Harvey, but enough to start building against.

What's actually missing

Here's where the gap is. The corpus gives you a target. The eval-strategy docs give you the scoring rule. The judge model is named (Claude Sonnet 4.6). What's not there is the thing that turns a taxonomy and a scenario seed into a coherent synthetic data room and 60 anchored rubric criteria. That's the generator.

Mercor's APEX-Agents (January 2026) is a useful comparison case. It's structurally similar — synthetic worlds, expert-built rubrics, atomic pass/fail criteria, LLM judge — and Mercor open-sourced their evaluation infrastructure (Archipelago) along with the corpus and an arxiv paper detailing methodology. That's more disclosure than Harvey's launch package, but Archipelago is still the evaluator, not the generator. As far as I can tell, neither company has released the tooling that produced the synthetic worlds and the anchored rubrics in the first place. The two companies are aware of each other's work — Harvey is credited in the APEX-Agents launch and Mercor is listed as a research partner on LAB's. Both chose to share the corpus and the evaluator. Neither chose to share the generator. There's a charitable read here worth acknowledging: releasing a generator is genuinely harder than releasing a corpus. Generators tend to be entangled with internal taxonomies, proprietary seed data, and infrastructure that doesn't port cleanly outside the company that built it; a corpus is a static artifact, a generator is a live system. So "didn't release" isn't necessarily "chose to withhold." Still, the practical effect on downstream builders is the same: the fast path stays internal, and the slow path is what's on offer.

Harvey's CONTRIBUTING.md actually makes this gap concrete. It explicitly invites the community to add tasks and write sharper rubrics, with detailed guidance on what makes a good criterion: name the required fact, include expected numbers and dates, state failure modes in FAIL iflanguage, avoid "nice to have" padding. That guidance is useful and the invitation is genuine. But the workflow it describes is one task at a time — write the instructions, build or gather the documents, draft the rubric criteria by hand. The tooling that would let a contributor produce tasks the way Harvey likely produced its 1,251 isn't part of what's on offer. So the on-ramp for community contribution is the slow path, not the fast one — which puts contributors back in the same labor bind that motivated this whole post.

There's an asymmetry worth naming here. Hand-authored contributions following the CONTRIBUTING.md guidance might actually be morevaluable per task than pipeline output — they'd cover the heterogeneity that's hardest to template, especially in the analytical-judgment criteria where templates struggle. Whether or not it's intentional, the current setup is one where contributor attorney-hours flow into the benchmark on the slow path while the tooling that would let contributors work at Harvey's speed stays internal. That's a legitimate model — plenty of open-source projects benefit from contributor labor while keeping internal tools internal — but it's worth being clear-eyed about what the contribution ask actually involves. "Help us build a benchmark" means something different when the fast path is proprietary.

This is the actual bottleneck. Not "should you use a pipeline" — yes, obviously, if you can't afford 22,000 attorney-hours. The bottleneck is that "use a pipeline" is currently advice you can only act on if you can build the pipeline yourself, and building one that produces output as disciplined as LAB's is nontrivial. You need: a seeding system that produces coherent scenarios, a document generator that plants specific facts in specific places, a rubric generator that anchors to those plants, a section-by-section template system with the right verb dictionaries, and the QA tooling to keep all of it consistent across 75,000 criteria. Many teams building domain benchmarks are currently reinventing this wheel.

What builders should take from this

1. Use the section structure. Issues → citations → identification → facts → distractors → recommendations → overall conclusion. Distinct verb dictionaries per section. Distinct anchoring per section. This is well-engineered and freely visible in the corpus. Use it.

2. Co-generate documents and rubrics. They're factually dependent. The criterion "Reports debt-to-equity ratio of approximately 9.81:1" only makes sense if the synthetic financial summary contains $41.2M debt and $4.2M equity. Generate them separately and you'll spend forever debugging.

3. Use distractors. About 3% of LAB's criteria are negative ("PASS if the agent does NOT claim X"). It's a small slice but it's nonzero, which is more than most benchmarks manage. Distractors bake hallucination-resistance into the same rubric as substantive scoring without a separate eval pass.

4. Atomic pass/fail is the right primitive for production work, but know what it costs you. It can't grade contested calls, judgment, synthesis, or "two experts could defensibly disagree" work. Be explicit with your audience about what your benchmark covers and doesn't.

5. Hand-authoring is still alive — don't dismiss it if you have the resources.GDPval and SWE-bench Verified shipped real hand-authored corpora at scale. If you have senior expert time and want maximum defensibility, it's a legitimate path. The pipeline approach is what makes benchmarks accessible to those of us who don't.

6. The structural fingerprints of your generation pipeline will be visible to anyone willing to grep. Plan for the audit. Be ready to describe your workflow accurately if someone asks.

What would continue to move the field

The thing that would change the math for everyone outside the well-funded labs is an open-source domain-benchmark generator — even a rough one, even with strong opinions about structure. Right now we have a small number of large benchmarks from well-funded teams and a long tail of small benchmarks from everyone else, with not much in between. That gap exists because the tooling that lets you produce structured, anchored, distractor-aware tasks at scale lives inside the companies that have built it.

Until that changes, here's what's available to builders without that tooling: study the LAB corpus and copy its structural choices (the section taxonomy, the verb dictionaries, the per-section anchoring strategies). Build smaller benchmarks with the same architecture. Contribute hand-authored tasks to LAB or APEX-Agents knowing exactly what the contribution model is asking of you. And if you're building a generator yourself — share it. The artifact-as-spec reframe is the most useful thing I've taken from looking at LAB carefully. The corpus tells you what well-engineered output looks like. Producing it is still on you, but at least you know what you're aiming for.

"Build your own benchmark" — and "contribute a task to ours" — remain advice for people with 22,000-plus attorney-hours to spend, or the engineering capacity to build the generator first. The harder problem, getting from a scenario seed to that output, is still everyone's to solve on their own. But at least the spec is in the open.

What would continue to move the field

The thing that would change the math for everyone outside the well-funded labs is an open-source domain-benchmark generator — even a rough one, even with strong opinions about structure. Right now we have a small number of large benchmarks from well-funded teams and a long tail of small benchmarks from everyone else, with not much in between, because the tooling that produces structured, anchored, distractor-aware tasks at scale lives inside the companies that have built it. "Build your own benchmark" and "contribute a task to ours" remain advice for people with plenty of attorney-hours to spend or the engineering capacity to build the generator first.

Until that gap closes, four things are available to builders without that tooling:

  1. Study the LAB corpus and copy its structural choices — the section taxonomy, the verb dictionaries, the per-section anchoring strategies. These are the legibility layer that makes a benchmark gradeable, and they're freely visible.

  2. Build smaller benchmarks with the same architecture. A 100-task benchmark with LAB's structural discipline is more useful than a 1,000-task benchmark without it.

  3. Contribute hand-authored tasks to LAB or APEX-Agents, knowing exactly what the contribution model is asking of you — and that the analytical-judgment criteria are where your hand-authoring is most valuable per task.

  4. If you're building a generator yourself, share it. Even partial, even opinionated, even rough.

The corpus tells you what well-engineered output looks like. Producing it is still on you, but at least the spec is in the open.

If you want to dig into the corpus yourself, the repo is at github.com/harveyai/harvey-labs. The scripts I used were mostly one-line greps and a small Python file that walks the tasks/ directory and counts things. If anyone is working on open generation tooling for domain benchmarks, I'd genuinely like to hear about it.