Data PipelinesEngineering Log February 24, 2021 8 min read

Using a Model to Audit Label Quality at Scale

A document classifier stalled at 0.74 F1 because the training labels were wrong. The fix was a separate label-audit workflow with out-of-fold predictions, confidence disagreement filters, a review queue, correction records, and reproducible cleaned exports.

Label QualityData PipelinesClassificationModel EvaluationReview QueuesStreamlitBackend ArchitectureOperational Data

I lost three days tuning a document classification model before the real problem showed up in the pipeline.

The classifier was part of a larger records-routing system. Text records came in, passed through a controlled category taxonomy, and moved downstream based on the assigned class. The model sat inside a backend flow, so a bad prediction was not just a weak metric. It could route work into the wrong queue.

The baseline F1 score was stuck around 0.74. I changed the learning rate, adjusted dropout, swapped the encoder, and reran training jobs. Nothing moved enough to matter.

The first system shape was simple:

text records
→ human labels
→ training dataset
→ classifier
→ routing queue

Records-routing classification flow. Text records receive human labels, feed a training dataset, train a classifier, and route records into downstream queues. — Records-routing classification flow

The strange cases came from label disagreements

The strange cases came from high-confidence disagreements.

I sorted the training set by loss and pulled rows where the model strongly disagreed with the assigned label. The records were not random edge cases. A noticeable number had labels that did not match the taxonomy rules.

The training script had treated every label as truth. The backend had no audit path between labeled data and production routing.

That was the system gap.

The first flow looked like this:

labeled records
→ training job
→ model artifact
→ deployment
→ routing decisions

Weak label pipeline. Labeled records move straight into training, then into model deployment and production routing without an audit layer. — Weak label pipeline

I built a separate label-audit pipeline

I needed a way to audit label quality without manually reading the whole dataset.

So I built a separate label-audit pipeline. It was not the production classifier. It was a temporary backend workflow for finding rows that deserved review.

The audit flow looked like this:

training dataset
→ K-fold split
→ temporary classifier runs
→ out-of-fold predictions
→ confidence disagreement filter
→ review queue

Label audit pipeline. Dataset is split into folds, temporary classifiers generate out-of-fold predictions, high-confidence disagreements are filtered, and suspicious rows move into a review queue. — Label audit pipeline

Out-of-fold predictions kept the audit cleaner. Each record had to be scored by a model that had not trained on that record.

For every row, the pipeline stored:

record_id
original_label
predicted_label
prediction_confidence
fold_id
audit_run_id
model_version

Example audit job that creates K-fold splits, generates out-of-fold predictions, filters high-confidence disagreements, and writes suspicious rows into a review table. — Out-of-fold audit job

The first useful filter was strict:

prediction confidence above 0.90
prediction disagreed with the assigned label
record text passed validation
predicted category existed in the approved taxonomy

That produced around 4,200 suspicious records.

The review queue needed a real backend

A spreadsheet would have broken the cleanup work. Too many rows, too easy to duplicate decisions, lose edits, or overwrite someone else’s review. I built a small Streamlit review tool backed by a correction table.

The UI stayed basic:

record text
original label
model-proposed label
confidence score
accept or keep action

Label review tool. Reviewer sees the text record, original label, model-proposed label, confidence score, and action buttons. — Label review tool

The backend around it mattered more than the screen. Each review action wrote a durable correction record:

record_id
original_label
proposed_label
review_decision
reviewed_by
reviewed_at
audit_run_id
model_version

Label correction table. Each reviewed row stores the original label, proposed label, review decision, reviewer, timestamp, audit run ID, and model version. — Correction table

That gave the cleanup process a real data trail. The original labels stayed intact. Corrections were stored separately. The cleaned dataset could be rebuilt from the source records plus accepted corrections.

The same model improved after the data was fixed

The review queue moved quickly. The proposed correction was accepted around 88% of the time.

After applying accepted corrections, I retrained the same baseline classifier. Same general architecture. Same training setup. Cleaner input data.

F1 moved from around 0.74 to around 0.92.

Label audit result. Initial classifier stalls at 0.74 F1. After review and accepted corrections, the same baseline model reaches 0.92 F1. — Label audit result

The final architecture became:

raw records
→ initial labels
→ audit job
→ out-of-fold predictions
→ suspicious row table
→ review tool
→ correction table
→ cleaned dataset export
→ production classifier training
→ routing service

Final records classification architecture. Raw records are labeled, audited by temporary classifiers, reviewed through a correction table, exported into a cleaned dataset, then used to train the production classifier for routing. — Final label-quality workflow

Result

This was backend work wearing an ML costume.

The model helped find inconsistent labels, but the useful system was the pipeline around it: audit runs, review queues, correction tables, reproducible exports, model metadata, and clean handoff back into training.

That changed the project from “tune the classifier again” into “build a repeatable label-quality workflow.”

The quality jump came from treating labels as operational data with failure modes, ownership, and history.

Onto the next one. Let’s keep sharpening that edge.

First written on February 24, 2021.

Want to implement this architecture in your business?

Discuss Your Project