Large Language Model (LLM) applications often appear simple with a single text input, but their underlying processes involve complex, probabilistic stages like intent classification, document retrieval, ranking, prompt construction, model inference, and safety filtering. Altering any part of this sequence can have unforeseen effects, potentially turning a previously accurate response into an error. Developing Dropbox Dash demonstrated that in the era of foundation models, robust AI evaluation—a system of structured tests for accuracy and reliability—is as crucial as model training itself.
Initially, evaluations were largely informal, relying on ad-hoc testing rather than a systematic methodology. Through continued experimentation, it became clear that significant improvements stemmed from refining processes, such as optimizing information retrieval, adjusting prompts, and balancing answer consistency with variety. This led to the development of a more rigorous, standardized evaluation process, treating every experiment with the same diligence as production code. The principle was straightforward: every modification required thorough testing before integration, making evaluation an integral part of every development stage, not just a final step.
These insights were compiled into a comprehensive playbook detailing datasets, metrics, tooling, and workflows. Recognizing that work involves more than just text, the evaluation framework also needs to encompass images, video, and audio to accurately reflect real-world interactions. The findings are presented here to enable others working with LLMs to adopt this evaluation-first methodology.
Step 1: Curate the right datasets
Evaluation began with publicly available datasets to establish a baseline for retrieval and question-answering performance. For question answering, resources like Google’s Natural Questions, Microsoft Machine Reading Comprehension (MS MARCO), and MuSiQue were utilized. Each dataset offered unique challenges: Natural Questions assessed retrieval from extensive documents, MS MARCO focused on managing multiple document matches for a single query, and MuSiQue presented multi-hop question-answering scenarios. This diverse dataset combination provided early insights into system and parameter effectiveness.
However, public datasets are insufficient on their own. To capture the nuances of real-world language, internal datasets were created from anonymized production logs of employees using Dash. Two types of evaluation sets were developed: Representative query datasets, which mirrored actual user behavior through anonymized and ranked internal queries with annotations from proxy labels or internal annotators; and representative content datasets, which focused on frequently used materials like shared files, documentation, and connected data sources. LLMs were then used to generate synthetic questions and answers from this content, covering various formats such as tables, images, tutorials, and factual lookups.
The combination of public and internal datasets resulted in a comprehensive collection of queries and answers reflecting real-world complexity. These datasets, however, require scoring logic to become functional. The subsequent step involved transforming these examples into an active monitoring system, where each test run clearly indicates success or failure, defined by metrics, budget constraints, and automated checks established before any experiment commences.
Step 2: Define actionable metrics and rubrics
When evaluating conversational AI system outputs, common metrics like BLEU, ROUGE, METEOR, BERTScore, and embedding cosine similarity are often considered. These offline metrics are well-understood, computationally efficient, and have long been fundamental for natural language processing benchmarking. However, their effectiveness diminishes rapidly when applied to real-world tasks such as retrieving source-cited answers, summarizing internal wikis, or parsing tabular data.
Traditional metrics have limitations:
- BLEU: Excels at exact word overlap but struggles with paraphrasing, fluency, and factuality.
- ROUGE: Strong in recall-heavy matching but weak on source attribution and hallucination detection.
- BERTScore: Good for semantic similarity but lacks granularity for specific errors or citation gaps.
- Embedding Similarity: Measures vector-space proximity but often fails to capture faithfulness, formatting, or tone.
These metrics were initially used for quick checks to identify significant model deviations. However, they proved inadequate for ensuring deployment-ready correctness. High ROUGE scores could occur even when sources were omitted, strong BERTScore results might accompany hallucinated file names, and fluent Markdown outputs could still contain factual errors. Such issues are common in production AI deployments. This led to a critical question: Could LLMs themselves be used to grade outputs?
The concept of using an LLM as a judge, where one LLM evaluates another, might seem circular but offers substantial flexibility. A judge model can verify factual accuracy against ground truth or context, confirm proper citation for every claim, enforce formatting and tone requirements, and scale across dimensions that traditional metrics overlook. The key insight is that LLMs are often more effective at scoring natural language when the evaluation problem is clearly defined.
Crucially, evaluation rubrics and judge models also require their own assessment and refinement. Changes in prompts, instructions, or even the choice of judge model can influence outcomes. For specific languages or technical domains, specialized models may be necessary to maintain fair and accurate scoring. Thus, evaluating the evaluators became an integral part of the quality assurance process.
LLM-based evaluation was structured like software modules: designed, calibrated, tested, and versioned. A reusable template forms the core, with each evaluation run incorporating the query, the model’s response, available source context, and sometimes a hidden reference answer. The judge prompt guides the evaluation through structured questions, such as:
- Does the answer directly address the query?
- Are all factual claims supported by the provided context?
- Is the answer clear, well-formatted, and consistent in voice?
The judge provides both a justification and a score, which can be scalar or categorical depending on the metric. For instance, a rubric output might appear as follows:
{
"factual_accuracy": 4,
"citation_correctness": 1,
"clarity": 5,
"formatting": 4,
"explanation": "The answer was mostly accurate but referenced
a source not present in context."
}
Regular spot-checks involved manually labeling sampled outputs every few weeks. These calibration sets were instrumental in tuning judge prompts, benchmarking human-model agreement rates, and tracking long-term drift. Any divergence in a judge’s behavior from the established gold standard prompted updates to either the prompt or the underlying model.
While LLM judges automated much of the evaluation, human spot-audits remained vital. For each release, engineers manually reviewed 5–10% of the regression suite. Discrepancies were logged and traced to prompt bugs or model hallucinations, with recurring issues leading to prompt rewrites or more granular scoring adjustments.
To ensure enforceability, three types of metrics were defined, each with a distinct role in the development pipeline:
- Boolean gates: Examples include “Citations present?” or “Source present?”. These trigger hard failures, preventing changes from proceeding.
- Scalar budgets: Such as “Source F1 ≥ 0.85” or “p95 latency ≤ 5s”. These block deployments if performance thresholds are not met.
- Rubric scores: Covering tone, formatting, and narrative quality. These are logged in dashboards and monitored over time.
Every new model version, retriever setting, or prompt modification was assessed against these criteria. If performance fell below predefined thresholds, the change was halted. To integrate metrics effectively, they were woven into every development stage. Automated fast regression tests ran on every pull request, the full suite of curated datasets ran in staging, and live traffic was continuously sampled and scored in production. Dashboards centralized results, providing clear visibility into key metrics, pass/fail rates, and performance shifts over time.
This comprehensive setup ensured that the same evaluation logic governed every prompt adjustment and retriever update, leading to consistent, traceable, and reliable quality control.
Step 3: Set up an evaluation platform
After establishing datasets and metrics and completing several build, test, and deploy cycles, the need for greater structural organization became apparent. Managing disparate artifacts and experiments proved unsustainable. This led to the adoption of Braintrust, an evaluation platform that streamlined workflows by centralizing the management of datasets, scorers, experiments, automation, tracing, and monitoring.
The platform provided four core capabilities. First, a central store offered a unified, versioned repository for datasets and experiment outputs. Second, an experiment API defined each run by its dataset, endpoint, parameters, and scorers, generating an immutable run ID. (Lightweight wrappers were developed to simplify run management.) Third, dashboards facilitated side-by-side comparisons, instantly highlighting regressions and quantifying trade-offs in latency, quality, and cost. Finally, trace-level debugging allowed for single-click access to retrieval hits, prompt payloads, generated answers, and judge critiques.
While spreadsheets sufficed for initial demonstrations, they quickly became inadequate for serious experimentation. Results were fragmented, difficult to reproduce, and challenging to compare. When different individuals ran the same test with minor variations in prompts or model versions, tracking changes and their rationale was nearly impossible. A more structured, shared environment was necessary where every run was versioned, results were reproducible, and regressions were automatically identified. An evaluation platform provided this, enabling collaborative reproduction, comparison, and debugging without hindering progress.
Step 4: Automate evaluation in the dev‑to‑prod pipeline
Prompts, context selection settings, and model choices were treated as application code, subject to the same automated checks. Each pull request initiated approximately 150 canonical queries, which were automatically judged, with results returned in under 10 minutes. Upon merging a pull request, the system re-ran the full suite, including quick smoke checks for latency and cost. Any failure to meet thresholds resulted in a blocked merge.
- Pull request opened: Triggered by GitHub Action, runs ~150 canonical queries judged by scorers, with results in under ten minutes.
- Pull request merged: Triggered by GitHub Action, runs canonical suite plus smoke checks for latency and cost, blocking merge on any red-line miss.
These canonical queries, though few, were carefully selected to cover critical scenarios like multiple document connectors, “no-answer” cases, and non-English queries. Each test meticulously recorded the retriever version, prompt hash, and model choice to ensure reproducibility. If scores fell below a threshold—for instance, due to excessive missing citations—the build was halted. This setup allowed regressions, previously unnoticed until staging, to be caught at the pull-request level.
On-demand synthetic sweeps
Large refactors or engine updates could conceal subtle regressions, necessitating end-to-end evaluation sweeps for early detection. These sweeps commenced with a golden dataset and could be dispatched as a Kubeflow DAG, executing hundreds of requests in parallel. (A Kubeflow DAG is a workflow within Kubeflow Pipelines, an open-source ML platform, where steps are organized as a directed acyclic graph.) Each run was logged with a unique run_id for easy comparison against the last accepted baseline.
Focus was placed on RAG-specific metrics such as binary answer correctness, completeness, source F1 (an F1 score for retrieved sources, balancing precision and recall), and source recall. Any deviation beyond predefined thresholds was automatically flagged. LLMOps tools then enabled slicing traces by retrieval quality, prompt version, or model settings, helping to pinpoint the exact stage of change for remediation before reaching staging.
Live-traffic scoring
While offline evaluation is crucial, real user queries provide the ultimate test. To detect silent degradations promptly, live production traffic was continuously sampled and scored using the same metrics and logic as offline suites. (All work adheres to AI principles.) Each response, along with its context and retrieval trace, was logged and processed through automated judgment, measuring accuracy, completeness, citation fidelity, and latency in near real time.
Dashboards, accessible to both engineering and product teams, tracked rolling quality and performance medians over various intervals (e.g., one-hour, six-hour, 24-hour). If metrics deviated beyond a set threshold, such as a sudden drop in source F1 or a latency spike, immediate alerts were triggered, allowing teams to respond before users were impacted. Since scoring ran asynchronously in parallel with user requests, production traffic experienced no added latency. This real-time feedback loop facilitated quick detection of subtle issues, bridging the gap between code and user experience, and maintaining system reliability as it evolved.
Layered gates
To manage risk throughout the pipeline, layered gates were implemented, progressively tightening requirements and aligning the evaluation environment with real-world usage. The merge gate executed curated regression tests on every change, allowing only those meeting baseline quality and performance to pass. The stage gate expanded coverage to larger, more diverse datasets with stricter thresholds, checking for rare edge cases. Finally, the production gate continuously sampled and scored real traffic to identify issues emerging only at scale. If metrics fell below thresholds, automated alerts were triggered, and rollbacks could be initiated immediately.
By gradually increasing dataset size and realism at each gate, regressions were blocked early, ensuring that staging and production evaluations remained closely aligned with real-world behavior.
Step 5: Close the loop with continuous improvement
Evaluation is not a finite phase but a continuous feedback loop. A system that learns from its errors evolves more rapidly than any predetermined roadmap. While gates and live-traffic scoring provide essential safeguards, resilient AI systems require evaluation to also drive continuous learning. Every low-scoring output, unstable regression, or drifted metric is not merely a red flag but an opportunity for end-to-end system improvement, initiating the next cycle of refinement.
Each poorly scored query offers valuable lessons. By analyzing low-rated traces from live traffic, failure patterns often missed by synthetic datasets were uncovered. These included retrieval gaps for rare file formats, prompts truncated by context windows, inconsistent tone in multilingual inputs, or hallucinations triggered by underspecified queries. Such “hard negatives” were directly incorporated into subsequent dataset iterations; some became labeled examples in the regression suite, while others generated new variants for synthetic sweeps. This process fostered a virtuous cycle, stress-testing the system precisely on the edge cases where it previously failed.
Not all changes were immediately suitable for production gates, especially riskier experiments involving new chunking policies, reranking models, or tool-calling approaches. To safely explore these, a structured A/B playground was developed, allowing teams to conduct controlled experiments against consistent baselines. Inputs included golden datasets, user cohorts, or synthetic clusters. Variants encompassed different retrieval methods, prompt styles, or model configurations. Outputs covered trace comparisons, judge scores, and latency and cost budgets. This secure environment enabled tweaks to prove their value, or fail quickly, without impacting production bandwidth.
LLM pipelines are multi-stage systems, and diagnosing answer failures through guesswork proved costly. To accelerate debugging, playbooks were developed to guide engineers directly to the probable cause. If a document was not retrieved, retrieval logs were checked. If context was included but ignored, prompt structure and truncation risks were reviewed. If an answer failed due to mis-scoring by the judge, it was re-run against the calibration set and human labels. These playbooks became integral to triage, ensuring regressions were systematically traced rather than debated.
Finally, the cultural aspect: Evaluation was not confined to a single team but integrated into daily engineering practice. Every feature pull request was linked to evaluation runs. On-call rotations included dashboards and alert thresholds. All negative feedback was triaged and reviewed. Engineers were made responsible for the quality impact of their changes, not just correctness. While speed is important in product delivery, the cost of errors can be high. Predictability is achieved through guardrails, and evaluation serves as these guardrails.
What was learned
Initial prototypes relied on available evaluation data, which sufficed for quick demonstrations. However, when real users began posing genuine questions, system vulnerabilities became apparent.
Minor prompt adjustments frequently led to unexpected regressions. Product managers and engineers often engaged in subjective debates about answer quality. Critically, issues bypassed staging and entered production due to inadequate detection mechanisms.
The solution was not increased effort but structured processes. A central repository for datasets was established, and every change underwent the same Braintrust-powered evaluation workflows. Automated checks served as the primary defense, identifying missing citations or formatting errors before code merges. Shared dashboards replaced subjective discussions with objective data, visible to both engineering and product teams.
A significant discovery was that many regressions originated not from model changes but from prompt modifications. Even a single word alteration in an instruction could severely impact citation accuracy or formatting. Formal gates, rather than manual review, proved to be the only dependable safety net. It was also learned that judge models and rubrics are not static assets; their prompts require versioning, testing, and recalibration. For evaluations in other languages or specialized technical domains, a dedicated judge model was often essential for fair and accurate scoring.
The key insight is that evaluation is not a secondary aspect of development. By applying the same rigor to the evaluation stack as to production code, development can proceed faster, safer, and with significantly fewer unexpected issues.
While the current system effectively identifies regressions and maintains quality, the next step involves shifting evaluation from purely protective to proactive. This entails moving beyond mere accuracy to measure aspects like user satisfaction, task success, and answer confidence. It also involves developing self-healing pipelines that can suggest fixes when metrics decline, thereby shortening the debugging cycle. Furthermore, coverage needs to extend beyond text to include images, audio, and low-resource languages, ensuring evaluation accurately reflects diverse work practices.
The objective is straightforward: continuously elevate evaluation standards so it not only safeguards the product but actively drives its advancement. By treating evaluation as a primary discipline—supported by rigorous datasets, actionable metrics, and automated gates—probabilistic LLMs can be transformed into reliable products.

