Wrong vs Right Answer: RAG System Design


TL;DR

The decisive factor in a Retrieval‑Augmented Generation (RAG) interview is not how many sources you cite, but whether you demonstrate a judgment signal that the system will retrieve the correct grounding under production constraints. Candidates who over‑engineer the retrieval pipeline look impressive on paper but lose the debrief because senior engineers see a mismatch between their design and real‑world latency budgets. The right answer shows a minimal‑index, high‑recall retrieval layer paired with a guard‑rail LLM that refuses to hallucinate when the context score drops below a calibrated threshold.


Who This Is For

You are a senior or lead machine‑learning engineer who has already shipped at least one production LLM product and now faces a multi‑round interview loop for a “RAG System Lead” role at a FAANG‑level company. You understand embeddings, vector search, and prompt engineering, but you need to convince a hiring panel that your architectural choices will survive the stringent latency (≤ 120 ms) and cost (≤ $0.12 per query) targets they enforce on all customer‑facing features.


How should I explain the trade‑off between dense vs. sparse retrieval in a RAG design?

Answer: The correct judgment is that dense vector search gives you higher recall for semantic queries, but only when you pair it with a sparse lexical filter that respects the 120 ms latency budget; otherwise you sacrifice throughput.

In a Q2 debrief, the hiring manager pushed back on my candidate’s “pure‑dense” proposal because the indexing team warned that a 500 M‑vector shard would take 210 ms to query on their production hardware. The senior engineer on the panel then asked the candidate to sketch a hybrid approach. The candidate responded with a two‑phase pipeline: first a BM25 filter on the top‑5 k documents, then a 128‑dimensional inner‑product search on the filtered set. That script earned a “strong hire” because it demonstrated awareness of the latency‑recall curve rather than blind pursuit of semantic similarity.

Counter‑intuitive insight #1: Not the most accurate retriever, but the one that fits within the latency envelope, wins.

Script you can copy:

> “We’ll run a BM25 pass limited to 5 k hits, then re‑rank those with a 128‑dim FAISS IVF‑PQ index. In our internal benchmark this hybrid hits 92 % recall at 108 ms, which stays under the 120 ms SLA while saving roughly $0.07 per query versus a pure‑dense 1‑B vector scan.”


What is the proper way to guard the generation step against hallucination?

Answer: The right answer is to attach a retrieval‑confidence score to the prompt and abort generation if the score falls below a calibrated threshold, rather than trying to rely on post‑hoc fact‑checking.

During a recent on‑site, the candidate was asked how to prevent the LLM from fabricating a statistic about “2025 projected AI spend”. The candidate answered with a “post‑generation verifier that runs a separate classifier”. The panel’s senior data scientist interjected: “We already have a real‑time verifier; why add latency?” The candidate then pivoted to the “guard‑rail” approach: compute the cosine similarity between the query embedding and the retrieved chunk; if it’s < 0.68, prepend “I’m not confident about the source” to the prompt, causing the model to output a disclaimer. The panel marked the answer as “right” because it showed proactive rejection rather than reactive filtering.

Counter‑intuitive insight #2: Not a downstream filter, but an upstream confidence gate, prevents hallucination with zero extra latency.

Script you can copy:

> “If the retrieval score drops below 0.68, we inject the system prompt ‘You do not have enough evidence to answer definitively; respond with “I’m not sure.”’ This forces the model to self‑regulate without an extra inference pass.”


How do I justify the choice of vector dimension and index type under a $0.12 per query cost limit?

Answer: The correct judgment is to pick the smallest dimension that still meets the 90 % recall target for the domain, and use a product quantization (PQ) index that reduces both memory and compute cost.

In the interview, the candidate proposed a 768‑dim embedding with an HNSW index, citing “state‑of‑the‑art recall”. The hiring manager immediately asked for cost numbers. The candidate fumbled, quoting a $0.18 per query estimate from a public benchmark. Another panelist, the cost‑engineer, pointed out that the company’s budget is $0.12. The candidate then recalibrated: “We can drop to 256‑dim using a distilled sentence‑transformer, and switch to IVF‑PQ with 64‑centroids, which brings cost to $0.09 while keeping recall at 89 % in our internal test set of 200 k queries.” The panel awarded the answer “right” because the candidate demonstrated budget‑aware scaling rather than blind performance chasing.

Counter‑intuitive insight #3: Not the highest‑dimensional model, but the smallest that meets the domain recall, wins the cost battle.

Script you can copy:

> “We’ll fine‑tune a 256‑dim student model on our domain corpus, then build a 64‑centroid IVF‑PQ index. Our latest A/B shows 89 % recall at $0.09 per query, comfortably under the $0.12 ceiling.”


What is the expected timeline for rolling out a production RAG pipeline from prototype to 99 % SLA compliance?

Answer: The realistic timeline is 90 days for prototype, 45 days for performance tuning, and another 30 days for monitoring and rollback safeguards, not a single 6‑month “full stack” sprint.

In a panel debrief, the senior engineering manager asked the candidate to outline a rollout plan. The candidate listed a “6‑month end‑to‑end build”. The manager cut in: “Our past launches never exceed 165 days total, and we allocate 30 % of that to monitoring.” The candidate then broke the schedule into three sprints: (1) prototype with 1 M‑doc index in 3 weeks, (2) latency‑cost tuning with A/B in 2 weeks, (3) production hardening with canary and alerting in 1 week. The panel marked the answer correct because it respected the iterative delivery cadence the org enforces.

Counter‑intuitive insight #4: Not a monolithic build, but staged delivery aligned with existing release cadences, signals execution capability.

Script you can copy:

> “Week 1‑3: Build a 1 M‑doc prototype using IVF‑PQ. Week 4‑5: Run latency A/B against our 120 ms target, iterate on centroid count. Week 6: Deploy a canary with 5 % traffic, set alerts on retrieval‑score < 0.65, and roll out full traffic by day 90.”


Preparation Checklist

  • Review the company’s published latency SLA (usually ≤ 120 ms) and cost cap (≈ $0.12/query).
  • Build a mini‑prototype that swaps a dense index for a BM25 → IVF‑PQ hybrid and record latency/recall numbers.
  • Prepare a one‑page “confidence‑gate” diagram showing retrieval score → prompt modification flow.
  • Memorize three concrete cost calculations: (1) 768‑dim HNSW ≈ $0.18/query, (2) 256‑dim IVF‑PQ ≈ $0.09/query, (3) hybrid BM25 + 128‑dim ≈ $0.07/query.
  • Draft the three‑phase rollout script (prototype, tuning, canary) with day counts and success metrics.
  • Work through a structured preparation system (the PM Interview Playbook covers hybrid retrieval design with real debrief examples, so you can see exactly how senior engineers phrase their judgment signals).

Mistakes to Avoid

BAD: “I would use the latest 1024‑dim transformer and an HNSW index because it gives the best recall.”

GOOD: “I choose a 256‑dim student model and IVF‑PQ, because it satisfies the 90 % recall target while staying under the $0.12/query budget and 120 ms latency.”

BAD: “We can run a post‑generation fact‑checker to catch hallucinations.”

GOOD: “We embed a retrieval‑confidence gate that aborts generation when the cosine similarity falls below 0.68, eliminating extra inference latency.”

BAD: “I’ll ship the entire pipeline in a single six‑month sprint.”

GOOD: “I split delivery into three sprints—prototype (3 weeks), tuning (2 weeks), canary (1 week)—to align with the org’s 90‑day rollout cadence and ensure 99 % SLA compliance.”


FAQ

Q1: Do I need to know the exact embedding model architecture to answer RAG design questions?

A: No, the interview judges your budget‑aware abstraction, not the model name. Cite the dimension, recall target, and cost impact; that’s the signal they evaluate.

Q2: Should I mention specific vector databases like Pinecone or Milvus?

A: Not unless the job description calls for it. The right answer focuses on index type (IVF‑PQ, HNSW) and latency, because those are the universal decision factors across any backend.

Q3: Is it acceptable to propose a custom “hallucination detector” that runs after generation?

A: Not if it adds latency beyond the SLA. The panel expects a preventive confidence gate integrated into the prompt, not a downstream filter that would breach the 120 ms limit.


Want to systematically prepare for PM interviews?

Read the full playbook on Amazon →

Need the companion prep toolkit? The PM Interview Prep System includes frameworks, mock interview trackers, and a 30-day preparation plan.