Retrieval SystemsEngineering Log November 12, 2025 9 min read

Grounded Legal AI: Building a Research Backend Around Live Case Law

Blackletter keeps legal synthesis grounded by querying live CourtListener records first, cleaning and trimming opinion text, then sending bounded source context into the model with citation mapping and cost controls.

Backend ArchitectureLegal AICourtListenerRetrievalFastAPIStreamingCitation MappingRate LimitingBudget ControlsSource Traceability

I’ve been working on Blackletter, a legal research assistant for US case law.

The goal is simple enough to explain: ask a legal question, retrieve relevant federal opinions, synthesize the answer, and show the source cases behind the analysis.

The architecture is where the work is.

A generic legal chatbot can answer confidently from stale context, loose prompts, or whatever the model remembers from training. That is a bad shape for legal research. The model needs to be boxed into a retrieval path, and the backend needs to control what reaches it.

The first rule I used was this:

The LLM should never be the first system of record.

For Blackletter, the first system of record is CourtListener. The backend queries live case metadata, fetches opinion text, cleans it, trims it, and then sends only bounded context into the model.

The backend controls the research handoff

The request flow looks like this:

user query
→ FastAPI endpoint
→ rate limit check
→ CourtListener search
→ relevance and authority sorting
→ parallel opinion text fetch
→ text cleanup
→ context trimming
→ Together AI synthesis
→ citation footer parsing
→ structured response with source cases

Blackletter request flow showing frontend query, FastAPI search and analysis endpoints, Redis rate-limit and budget gate, CourtListener search, parallel opinion fetch, text cleanup, chunk trimming, Together AI synthesis, citation parsing, and frontend source display. — Blackletter legal research flow

I split the backend into two main paths.

The first path is search.

GET /api/search

That endpoint does not call the LLM. It only returns case metadata, snippets, citations, dates, and source URLs from CourtListener.

That matters because not every legal research action deserves an expensive model call. Sometimes the user just needs to find the cases. A search path should be fast, cheap, and predictable.

The second path is analysis.

POST /api/analysis

That path is heavier. It searches CourtListener, takes the top cases, fetches full opinion text in parallel, builds the context packet, sends it to the model, and returns a structured answer with citations and sources.

There is also a streaming path.

POST /api/analysis/stream

That exists because full legal synthesis can take long enough to make the frontend feel dead. The streaming endpoint uses server-sent events so the UI can start rendering answer text as the model produces it.

The frontend runs search and streaming analysis in parallel.

case search
→ render case cards as soon as metadata arrives

streaming analysis
→ render answer text as tokens arrive

That made the interface feel less blocked. The source list and answer panel do not have to wait for the same backend step to finish.

Frontend hook running case search and streaming analysis in parallel with AbortController cancellation, independent loading states, and streamed answer updates. — Parallel search and streaming analysis

Retrieval and synthesis are separate jobs

The important backend boundary is between retrieval and synthesis.

CourtListener handles legal source retrieval.

The LLM handles synthesis.

FastAPI controls the handoff.

That handoff cannot be loose.

Opinion text from public legal databases is messy. Some records have plain text. Some have HTML with citations. Some have HTML lawbox content. Some have weak snippets. Some have empty text. The backend has to normalize enough of that before the model sees it.

The CourtListener service uses this priority:

plain text
→ HTML with citations
→ HTML
→ lawbox HTML
→ empty fallback

Then it strips HTML tags and collapses whitespace.

That sounds small, but it protects the prompt budget. Legal opinions can be long. Sending raw HTML and repeated whitespace into an LLM burns money and context window.

Context size has to be controlled before inference

The next problem was context size.

If the backend retrieves three cases and one opinion is huge, that single case can dominate the prompt. Then the model gets an uneven view of the legal context.

So the LLM service trims chunks evenly.

The backend calculates a total input budget, divides it across the retrieved case chunks, and truncates anything too large.

Trim chunks logic that distributes the input budget across retrieved case texts so one long opinion does not consume the entire prompt. — Even prompt-budget trimming

This is one of those backend controls that matters more than it looks.

Without it, the answer quality becomes unstable. One oversized opinion can crowd out the other sources. The model still gives an answer, but the answer may reflect the shape of the prompt more than the actual case set.

The answer carries citation metadata

The analysis prompt also has structure.

It tells the model to organize the response by:

Landmark Precedents
Key Rulings
Majority Opinions
Concurring Opinions
Dissenting Opinions

That is not a guarantee of legal correctness. It is a formatting and reasoning constraint. The model still has to be treated as fallible.

So the backend requires a machine-readable citation footer:

CITATIONS: [1, 2, 3]

After the response comes back, the service parses that footer and returns citation IDs separately from the answer body.

answer text
→ citation footer split
→ citation IDs extracted
→ structured API response
→ frontend displays analysis beside sources

This gives the UI something concrete to work with. The answer is not just markdown. It has citation metadata that can be mapped back to the retrieved source list.

Missing source text should stay visible

The summarization path uses a similar shape.

GET /api/summarize/{case_id}

It fetches the opinion text for a specific CourtListener case ID, checks whether enough text exists, then asks the model to produce a structured brief with Facts, Issue, Holding, and Reasoning.

If the full opinion text is unavailable, the backend returns a data-unavailable response instead of pretending the model can summarize what it does not have.

That failure mode matters.

For legal software, empty source text should be visible. Missing retrieval should not be hidden behind confident generation.

Expensive legal routes need cost controls

The other major piece is cost control.

This is a public-facing prototype shape, so the backend needs guardrails before expensive routes. Search can be relatively permissive. Analysis and summarization need stricter limits.

Blackletter uses an Upstash Redis-backed security service for three things:

Rate limiting
Monthly spend tracking
Budget cutoff

The service can run fail-open locally when Redis is not configured, then enforce limits when credentials are present.

The rate limiter uses a fixed-window counter:

ratelimit:{endpoint}:{window}

Different endpoints get different limits.

Search can allow more requests because it avoids the LLM path.

Analysis gets a tighter limit because it triggers retrieval, context assembly, and model inference.

The budget tracker estimates spend from prompt and completion tokens, increments a monthly Redis key, and blocks new LLM calls once the configured cap is reached.

Redis-backed rate limit and monthly spend gate showing fixed-window counters, token-cost tracking, and HTTP 402 budget cutoff. — Redis-backed rate and spend gate

This is not enterprise billing infrastructure. It is the minimum responsible version for a live demo: stop one curious user, one bug, or one reload loop from turning a small prototype into a surprise invoice.

Deployment details still matter

On application startup, the FastAPI lifespan hook initializes local database tables and checks the current Redis spend state. The health endpoint reports version, security state, and allowed CORS origins.

There is also defensive CORS parsing.

That one came from a normal deployment headache. Environment variables often pick up quotes, spaces, or trailing slashes. Starlette CORS handling can reject preflight requests if origins do not match exactly. The backend strips those values before registering allowed origins.

Small thing. Real thing.

The cache layer exists, but it is not active yet

The current architecture has a local SQLite cache service defined for answers and summaries. I’m not treating that as part of the active request path yet because the shown /analysis and /summarize routes do not currently call it.

That matters for the writeup. A cache table existing in the codebase is different from a cache layer actively protecting the request path.

The honest version is this:

The current Blackletter backend already has retrieval, synthesis, streaming, budget controls, and source mapping.

The next architecture step is wiring cache reads and writes into the expensive paths.

That would change the flow from:

query
→ retrieve
→ synthesize
→ return

to:

query
→ cache check
→ return cached answer if valid
→ retrieve only on miss
→ synthesize
→ save answer with sources
→ return

For legal research, cache invalidation needs care. A cached answer should probably include the original query, source case IDs, generation model, timestamp, and retrieval parameters. Otherwise the system can accidentally serve an old synthesis as if it came from a fresh search.

That is the type of backend detail that matters more than the UI.

The control plane is the product

The model can write the memo. The backend has to decide what the model is allowed to see, how much it can spend, which source text counts, how stale the answer can be, and what happens when the source database returns nothing useful.

Using AI for legal research is the least interesting part of the system now.

The useful part is the control plane around the model:

Live federal source retrieval
Separate search and analysis paths
Parallel opinion fetching
Text cleanup before inference
Bounded prompt construction
Streaming response path
Citation footer parsing
Structured source mapping
Rate limits for expensive operations
Monthly spend cutoff
Visible failure when source text is unavailable

That is the shape I trust more for legal AI.

The LLM is inside the system. It is not the system.

For legal work, that boundary is the whole point. The backend has to keep the research grounded, the cost bounded, and the source trail visible enough for a human to verify.

Onto the next one. Let’s keep sharpening that edge!

First written on November 12, 2025.

Want to implement this architecture in your business?

Discuss Your Project