Skip to content
Hire me →
← Writing
Case Study//9 min read

Shipping a production RAG system on Azure AI Search

A practical retrieval-augmented generation build at enterprise scale — ingestion, hybrid retrieval, citation-grounded generation, and the evaluation loop that kept it honest. Written from leading a team of seven engineers at Prometheus on an AI integration for an enterprise knowledge base.

01

The brief

The client was an enterprise with a large, heterogeneous internal knowledge base — product documentation, internal runbooks, structured records — spread across formats and systems. Their users kept asking the same kinds of questions in different wording, and the existing full-text search was returning documents, not answers. A few in-house LLM prototypes had already been tried. All of them hallucinated confidently or cited sources they had never read.

The constraints we inherited:

  • Azure-only deployment — it was the committed cloud.
  • Citation-grounded answers. Every claim in the response must point back to a retrievable chunk. No ungrounded generation.
  • Sub-3-second p95 for interactive queries. Fast enough that users don't stop trusting it.
  • Cost-controlled at the scale of millions of chunks, hundreds of thousands of queries a month.
02

Why RAG, and not fine-tuning or pure prompting

Before touching code, the team aligned on the framing. Three options were on the table, and we rejected two:

  • Fine-tuning doesn't cheaply teach new facts. It is the right tool for changing style or constraining output format — not for making an LLM memorize an evolving corpus.
  • Pure prompting — stuffing retrieved documents into a long context — scales badly on cost and hits context limits on real documents. It also makes citation verification a downstream problem.
  • RAG gave us three things the others couldn't: freshness (re-index instead of retrain), grounded citations as a byproduct of retrieval, and per-query cost that scales with the question, not the corpus.

The decision took an afternoon. Everything after it was the actual work.

03

The architecture

Three pipelines, each with its own failure mode:

Ingestion

Azure Blob Storage held the raw corpus. A scheduled NestJS job pulled new or updated files, extracted text and metadata (Markdown, PDF, HTML all needed different extractors), chunked the content, embedded each chunk, and upserted the result into Azure AI Search. Every chunk carried provenance — source URI, section path, position — so retrieval could later cite it exactly.

Query path

A NestJS REST API accepted the user's natural-language question, ran a hybrid query against Azure AI Search (vector + BM25), reranked the top results with the semantic reranker, formatted the top-k chunks into a system prompt, and called Azure OpenAI for the final answer. The answer came back as structured JSON — never free text — with an explicit list of citation chunk IDs.

Evaluation loop

A separate pipeline ran a golden question-and-answer set against the system on every deploy, scored retrieval recall and answer faithfulness, and posted the diff to a Slack channel. Any regression over 3% blocked the deploy. This caught more production bugs than any amount of manual QA.

04

Three decisions worth talking about

Hybrid search beat vector-only retrieval

Our first retrieval implementation was pure cosine similarity over embeddings. It handled paraphrase beautifully and failed embarrassingly on exact-term queries — product names, internal IDs, proper nouns the embedding model had never seen. Adding keyword scoring alongside vector scoring, then passing the union through the semantic reranker, closed the gap. The reranker is the underappreciated piece: vector + BM25 is a messy union; the reranker makes it usable.

Chunk size mattered more than model choice

We spent a day on embedding-model bakeoffs and a week on chunking strategy. That ratio was backwards, and the chunking week was the one that moved the needle. Semantic chunking at Markdown headings, with a fixed-size fallback for runaway paragraphs, outperformed naive fixed-window chunking by roughly fifteen percent on recall@5. Chunk overlap mattered less than we expected; chunk boundaries aligned to the document's own structure mattered more.

Citations aren't a prompt trick — they're an API contract

Early iterations asked the LLM to "include citations." Sometimes it did. Sometimes it invented citations that looked plausible. We moved to a strict JSON response format — a schema with answer and citations[], each citation a chunk ID we had literally just passed into the prompt. Unparseable or off-contract responses hard-failed and retried with a corrective message. The change converted hallucinated citations from a reputation risk into a bug with a stack trace, which is the only form of bug a team can fix.

05

What we measured

The golden set was about two hundred human-written question-answer pairs, versioned in git alongside the code that used it. Four metrics tracked across deploys:

  • Retrieval recall@5 — was the chunk containing the true answer in the top five retrieved?
  • Answer faithfulness — did the generated answer only use retrieved chunks, or did the model add from memory? LLM-judged, sampled, reviewed.
  • Citation correctness — did each cited chunk actually contain the claim it supported?
  • Latency and cost — p50, p95, and $/query, broken down by retrieval, reranking, and generation.

A regression of more than three percent on any of the first three blocked the deploy. Latency and cost regressions surfaced as Slack alerts. Nobody shipped blind.

06

What I'd do differently

  • Build the eval set before the system. We did retrieval-quality bakeoffs on intuition for a week before we had numbers. The eval harness should be the first service, not the last.
  • Start with hybrid retrieval + reranker on day one. We rewrote the query path once. The production architecture was clear from the first day; we just wanted to believe pure vector would be enough.
  • Keep a corpus of bad queries. User-reported failures are the highest-signal training data you have, and they're free. Start collecting them in week one.
07

Closing

RAG is less about the language model and more about retrieval quality. Most production RAG failures are retrieval failures — the LLM faithfully rendered bad source material, and the bad source material came from a chunking strategy or a vector-only index that failed on the query. Fix retrieval first. Measure everything. Make citations a contract, not a hope.

The rest — the NestJS routes, the Docker builds, the CI/CD, the Azure Container Apps deployment — is regular backend work dressed up as AI. It's the part that determines whether your system runs. Retrieval is the part that determines whether it's useful.

08 — Stack
Retrieval
Azure AI Search — hybrid (vector + BM25) with semantic reranking
Embeddings
Azure OpenAI — text-embedding-3-large (1536 dim)
Generation
Azure OpenAI — GPT-4 class model, structured JSON output
Source storage
Azure Blob Storage
API layer
NestJS, class-validator DTOs, Swagger/OpenAPI contracts
Runtime
Docker → Azure Container Apps, GitHub Actions CI/CD
Observability
Application Insights, structured logs, per-stage traces
Evaluation
Python harness, ~200 golden Q/A pairs, LLM-as-judge for faithfulness

Hiring for an AI-native backend team?

I'm currently open to senior / staff backend and full-stack roles, remote across time zones.