InfrastructureEngineering Log December 14, 2018 8 min read

A Backend Mindset Beyond Jupyter: Serving XGBoost with Flask and Gunicorn

A fraud model worked in a notebook, but the real work started when it had to sit behind checkout. The useful backend pieces were validation, model startup loading, Gunicorn serving, traceable database records, score storage, and decision logs.

XGBoostFlaskGunicornFraud DetectionModel ServingValidationDatabase DesignBackend Architecture

Hello, world.

I am making my engineering logs public today.

From time to time, I’ll post what I’ve been building, breaking, fixing, and everything in between. Usually, it’s a working log: some backend, some data, some architecture, some mistakes, some breakthroughs.

Today’s log is about integrating XGBoost into the backend.

Or more accurately, it is about what happens when a backend developer starts poking around machine learning and quickly realizes the model is only one part of the system.

At this point, most of my work has been around macros, automation, VBA, Python, backend systems, database design, application logic, and building apps. Data science has been getting louder recently, so I wanted to understand more of it from the inside instead of letting it stay as a buzzword.

So I built a small fraud detection pipeline.

The use case was e-commerce checkout fraud.

The idea was simple enough:

transaction comes in
→ store the transaction
→ prepare the features
→ call the model
→ return a fraud risk score
→ let the app decide what to do next

High-level checkout fraud detection flow. Checkout payload enters the backend, gets stored, converted into model features, scored by XGBoost, then passed into a decision rule before the app responds. — Checkout fraud detection flow

I used XGBoost as the model. Why? We’ll get to that below. But I knew there was more going on underneath the backend.

In this case, fraud detection sits inside a checkout flow. That means the system has to be fast, predictable, and careful.

If it blocks a legitimate customer, the business loses money and annoys the buyer. If it lets a fraudulent transaction through, the business eats the loss.

So even though this was a machine learning experiment, I treated it like a backend problem first.

Database design mattered first

Database design matters.

I needed a clean way to store the raw transaction, the normalized feature values, the model output, and the final decision.

Database schema sketch with four separate tables or records: transactions_raw, transaction_features, model_scores, and checkout_decisions. Show the same transaction ID moving through each layer. — Fraud pipeline data records

The raw transaction is what the app received.

The feature row is what the model actually consumed.

The model output is the score.

The decision is what the business logic did with that score.

Keeping those separate made the system easier to inspect. If a transaction was flagged incorrectly, no guessing. I wanted to trace the path from payload to feature mapping to score to decision.

Traceability view for one transaction. Left side shows raw JSON payload. Middle shows normalized feature values. Right side shows fraud score and final decision. — Transaction traceability view

I didn’t bury important state inside a script or let the model become a black box.

The notebook version was not the serving version

Well, the first version worked locally.

I trained the model. I called predict(). I returned a score from a Flask route. It looked fine in the notebook and fine in a local API test.

Minimal early Flask route calling model.predict from a request payload. Keep this intentionally simple to show the first working version. — First local Flask prediction route

Then I tried serving it like an actual endpoint.

That is where the problem showed up.

A notebook does not care about concurrent requests. A checkout flow does.

My first issue: loading too much work inside the request path.

The model file was being loaded too casually, as if each request was just another local script run.

That is fine when one developer is testing one request. It is not fine when multiple requests hit the service.

Bad serving pattern. Multiple incoming requests each trigger model loading inside the request path, causing duplicated memory usage and slow response time. — Bad serving pattern

The model loaded at startup

The model loading had to move out of the request cycle.

I reworked the Flask app so the model loaded when the application started. The route itself only handled validation, feature preparation, prediction, and response formatting.

Then I served it with Gunicorn instead of the basic Flask development server.

Corrected serving pattern. Gunicorn starts the app, the model loads once during startup, workers handle incoming requests, and each request only runs validation, feature mapping, and prediction. — Gunicorn serving pattern

Application startup pattern showing the XGBoost model loaded once at module or app initialization, then reused by the predict route. — Startup model loading

First takeaway:

A model is not production-ready just because it can return a prediction.

It needs a serving layer.

Bad input had to fail before inference

Second issue: input quality.

In the notebook, the dataset is organized. The columns are there. Types are known. Missing values have already been handled. The model receives the exact feature shape it expects.

In an API, the outside world sends whatever it sends.

Missing fields. Strings where numbers should be. Nulls. Extra fields. Wrong formats. Bad assumptions from the frontend.

Validation boundary diagram. Bad payloads stop at the API validation layer. Only clean, typed, normalized feature data reaches the XGBoost model. — Validation boundary

So I added request validation before inference.

The API had to reject bad payloads before they reached the model. I did not want the model silently accepting malformed input and returning a confident-looking answer from broken data.

Example validation schema for transaction payload fields such as amount, timestamp, user_id, payment_method, account_age, and device signal. Show bad input returning a 422-style response. — Transaction payload validation

That is one of the more delicate parts of machine learning inside an application.

Bad input does not always crash. Sometimes it returns a number. And because the number looks precise, people trust it.

The database helped here too.

By keeping raw inputs, feature rows, scores, and decisions separate, I could debug the pipeline without guessing where the mistake happened.

The backend shape became something like this:

checkout payload
→ Flask API
→ schema validation
→ transaction record
→ feature preparation
→ XGBoost prediction
→ score record
→ decision rule
→ response to app

End-to-end backend architecture. Browser checkout calls Flask API. API validates payload, writes raw transaction, prepares features, calls preloaded XGBoost model, stores score, applies decision rule, returns response. — End-to-end XGBoost backend architecture

XGBoost fit the data shape

All fundamentals, which is the point.

The useful work was making the system boring enough to reason about.

Why use XGBoost?

Because the data was tabular. Transaction records are structured: amounts, timestamps, user history, order details, payment signals, account behavior. This is the kind of data where tree-based models are practical and fast.

Small feature table showing transaction fields mapped into model-ready columns. Example columns: amount, hour_of_day, account_age_days, prior_orders, payment_risk_flag, shipping_country_match. — Transaction feature table

But the model choice was only one layer.

The more important engineering questions were:

Can the API stay up?
Can the payload be trusted?
Can bad input be rejected?
Can the database explain what happened later?
Can the app call this without slowing down checkout?
Can a developer debug a wrong decision?

That is where my backend background became useful.

Data science is still new territory for me, but the production concerns were familiar: inputs, outputs, database state, failure modes, request latency, deployment behavior.

Same old problems. Just with a model in the middle.

Result

That is probably the main thing I learned from this build.

Once a model starts influencing real application behavior, the system around it has to be clear. Validation. Traceability. Clean boundaries. A way to understand what happened when the output is wrong.

A notebook can prove that an approach might work.

A backend system decides whether anyone can safely use it.

Data science is interesting, but backend is the part that makes it useful.

Onto the next one. Let’s keep sharpening that edge.

First written on December 14, 2018.

Want to implement this architecture in your business?

Discuss Your Project