AI SystemsEngineering Log June 28, 2020 9 min read

The Generative Trap: Defending Against Probabilistic Outputs

A GPT-2 extraction prototype returned polite prose instead of JSON and crashed the database write path. The fix was a defensive model wrapper with raw output capture, parsing cleanup, schema validation, evidence checks, capped retries, a dead letter queue, and human review.

Generative AIGPT-2Extraction SystemsSchema ValidationPydanticDead Letter QueuesHuman ReviewBackend Architecture

The downstream database crashed because the model returned a polite paragraph instead of a JSON object.

OpenAI announced its API earlier this month. GPT-3 is now the thing everyone is watching.

The demo energy is understandable. Text in, text out. Prompt the model and it writes, extracts, summarizes, classifies, answers, completes.

That interface is going to tempt developers into a bad backend decision.

They’re going to treat generative models like normal APIs.

I hit the same failure mode yesterday while working with a GPT-2-based extraction prototype on EC2.

The objective was contract extraction.

Take unstructured legal text, identify important clauses, return structured fields, and store the result for review.

The intended path was simple:

contract text
→ model inference endpoint
→ JSON extraction result
→ parser
→ PostgreSQL write
→ review screen

Early legal extraction pipeline. Contract text is sent to a model inference endpoint, expected JSON comes back, the backend parses it, stores it in PostgreSQL, and shows it in a review screen. — Early legal extraction pipeline

The first test looked good.

The model identified a liability clause.

The second test hallucinated a termination fee that did not exist in the contract.

The third test ignored the requested JSON shape and returned a bulleted list starting with:

“Here is the data you requested:”

That broke the write path.

The backend expected keys. The model returned prose.

No stable object.
No predictable fields.
No clean database insert.

Just a helpful-looking answer in the wrong shape.

That was the useful failure.

Generative output is not a normal API response

A generative model does not behave like a normal REST endpoint. A normal endpoint has a contract. The response shape is part of the system design.

A language model gives you probable text.

Sometimes that text follows your format. Sometimes it drifts. Sometimes it invents. Sometimes it sounds useful while violating the software contract around it.

The original backend path was too trusting:

send prompt
→ receive text
→ parse as JSON
→ write to database

Fragile model integration. The backend sends a prompt, receives raw model text, tries to parse it as JSON, and writes directly to the database. — Fragile model integration

That path was fine for a demo.

It was too weak for a real extraction workflow.

I added a defensive wrapper

So I added a defensive wrapper around the inference endpoint.

The wrapper became the boundary between probabilistic output and deterministic software.

The new shape looked like this:

contract text
→ prompt builder
→ model inference
→ raw output capture
→ cleanup parser
→ schema validation
→ retry or dead letter queue
→ database write only after validation

Defensive model wrapper. Contract text goes into a prompt builder and model endpoint. Raw output is captured, cleaned, validated, retried if needed, and only written to PostgreSQL after passing validation. — Defensive model wrapper

Prompt discipline helped, but did not solve it

The first fix was prompt discipline.

Loose prompts produced loose outputs. Asking the model to “extract key contract terms” gave it room to explain, summarize, and decorate the answer.

I needed records.

So I moved to few-shot prompting.

The prompt included several examples of the exact input and output pattern:

contract snippet
→ expected JSON
contract snippet
→ expected JSON
contract snippet
→ expected JSON
actual contract snippet
→ model completes the next JSON object

Example few-shot prompt with legal contract snippets and exact JSON outputs for liability, termination, payment terms, renewal, and governing law. — Few-shot extraction prompt

The examples were deliberately boring.

No open-ended instruction.
No analysis request.
No legal commentary.

Just pattern pressure.

That reduced format drift.

It did not remove it.

The model still added filler. Sometimes it changed field names. Sometimes it wrapped the JSON in explanation. Sometimes it returned a valid-looking object with a value that was not supported by the contract text.

So the raw output had to be treated as hostile input.

The parser and schema became the safety boundary

The parser did four jobs:

capture the raw response
extract the JSON-like block
strip conversational filler
reject anything that could not be parsed cleanly

Model output parsing layer. Raw model text enters a cleanup parser. The parser extracts a JSON-like block, removes filler, parses it, and sends the object to schema validation. — Model output parsing layer

After parsing, the object hit a strict schema.

The extraction record needed:

document_id
clause_type
extracted_value
source_text
model_name
prompt_version
created_at

Contract extraction record. Each record stores document ID, clause type, extracted value, exact source text, model name, prompt version, and creation timestamp. — Contract extraction record

The most important field was source_text.

If the model claimed the contract had a termination fee, it had to provide the exact sentence or clause that supported the claim.

No source text, no write.

That rule caught the failure that plain JSON validation would miss.

A valid JSON object can still contain garbage.

Example:

{
  "clause_type": "termination_fee",
  "extracted_value": "$50,000",
  "source_text": "Either party may terminate this agreement with 30 days written notice."
}

The object parses.

The shape is valid.

The extracted value is false.

This is where the backend has to stop acting impressed.

The model made a claim. The system needed evidence.

So the review layer showed extracted values beside the source text. The database stored enough context to debug the decision later.

extracted value
source text
document_id
model_name
prompt_version
raw_output_id

Legal extraction review screen. The extracted value appears beside the exact source text, document ID, model name, prompt version, and raw output reference. — Legal extraction review screen

Failures became data

Validation failures went through a capped retry path.

First attempt: normal few-shot prompt.
Second attempt: lower temperature.
Third attempt: stricter repair prompt using the validation error.
After that, the payload went to a dead letter queue.

Example Python wrapper around a local inference endpoint. It captures raw output, parses JSON, validates with Pydantic, retries with lower temperature, and routes failed outputs to a dead letter queue. — Defensive inference wrapper

The dead letter queue stored:

document_id
input_text
raw_model_output
validation_error
model_name
prompt_version
attempt_count
created_at

Dead letter queue for generative extraction. Failed records preserve the input text, raw model output, validation error, model name, prompt version, attempt count, and timestamp. — Dead letter queue

That queue became useful quickly.

It showed which clauses made the model drift. It showed where the prompt examples were too weak. It gave me actual failed outputs to improve against.

Logs were not enough here. Failed generations needed to become a dataset.

The final path was less fragile:

contract text
→ prompt builder
→ GPT-2 inference endpoint
→ raw output log
→ parser
→ Pydantic schema validation
→ evidence check
→ capped retry
→ dead letter queue
→ human review
→ PostgreSQL write

End-to-end defensive extraction workflow. Contract text passes through a prompt builder, GPT-2 inference endpoint, raw output log, parser, schema validation, evidence check, retry path, dead letter queue, human review, and final PostgreSQL write. — End-to-end defensive extraction workflow

Result

This is the part I think will matter as larger models arrive.

GPT-3 will make the output look much better. That does not remove the backend problem.

A smoother answer can still break a parser. A more fluent answer can still invent a value. A bigger model can still drift away from the format the database expects.

The engineering work sits at the boundary:

prompt templates
raw output capture
parsing cleanup
schema validation
source text requirements
prompt versioning
model versioning
capped retries
dead letter queue
human review path

That is the system that keeps probabilistic text from corrupting deterministic software.

The model can generate.

The backend has to decide what becomes data.

Onto the next one. Let’s keep sharpening that edge.

First written on June 28, 2020.

Want to implement this architecture in your business?

Discuss Your Project