According to our internal research, data + AI teams are quickly pushing AI agents into production. As of March this year (2025):
- 40% are in the production stage (30% just got there)
- 40% are in the semi or pre-production stage
- 20% are in the experimentation stage
Unsurprisingly, the need to monitor the output of those agents has risen just as quickly. Traditional data quality monitoring methods-designed for more deterministic systems and structured outputs-are not always well-suited for this task.
Nature hates a vacuum and so do AI engineers. LLM-as-judge evaluations have quickly emerged to fill this gap as a means for monitoring the fitness of AI outputs. Like everything else in the AI space, it has both spread like wildfire and not fully understood.
Teams are racing to implement these evaluations while basic questions persist like:
- Does it work?
- How should you format an evaluation prompt?
- What are the challenges you will run into?
This guide aims to answer those questions by looking at the most recent academic research as well as the practical experience we’ve gleaned in operating and monitoring our own customer facing AI agents.
We’ll start at the introductory level and dive deeper as we go along. Let’s get into it.
What is LLM-as-judge?
LLM-as-judge is a technique that uses AI to evaluate the fitness of AI outputs.
For example, one agent may be prompted to answer customer support requests while another may be prompted to evaluate those responses across dimensions such as helpfulness or relevance.

This may seem counterintuitive, and don’t get me wrong there are fundamental challenges inherent in these evaluations, but the genesis of the idea is that each agent has a different set of motivations.
One is instructed to be helpful in completing a task (and as many of us know AI can sometimes be too eager to please), while the other is told to provide a critical assessment of the other’s work.
And yes, I know what you are thinking. How can you monitor a system that occasionally hallucinates with another system that occasionally hallucinates. Do you need to evaluate your evaluators? Don’t worry, we’ll cover that and other key concepts.
Why implement LLM-as-judge?
No monitoring approach is perfect. LLM-as-judge evaluations have their fair share of trade-offs that require expertise from AI engineers and others to navigate (I promise we’ll get there!).
But as one wise senior director of data science services at an event production company told me, “It’s not a production-grade application unless it’s being monitored.”
LLMs and agents are non-deterministic systems meaning you can provide the same input and get a slightly different output.

Non-deterministic systems are a bit like slot machines. You pull the same lever, but the outcome is uncertain.
THAT means there is not always a practical way to set and test for an expected output, particularly when the goal is to evaluate blocks of text across subjective dimensions such as relevancy, clarity, helpfulness, prompt adhesion, etc.
Prior to the emergence of LLMs, the quality of natural language responses were evaluated using heuristic or complex mathematical equations such as ROUGE, BLEU, cosine similarity and others. While they are explainable and deterministic evaluations, they unfortunately do a poor job of identifying unfit AI outputs in practice.
For example, ROUGE measures the recall or overlap between a response and a source material, and fails miserably in situations where the response is meaningfully the same but uses different words.
Most AI engineers quickly abandon these approaches after a brief flirtation including our team at Monte Carlo and the team at Dropbox. Academic research also shows these approaches performing poorly.
“Conventional reference-based metrics, such as BLEU and ROUGE, have been shown to have relatively low correlation with human judgments, especially for tasks that require creativity and diversity.” –G-Eval (Liu et al., EMNLP 2023)
Simply put, LLM-as-judge is one of the only practical, scalable approaches for understanding the underlying meaning of a non-deterministic response and evaluating its fitness.
When not to implement LLM-as-judge
It’s important to understand LLM-as-judge is merely a tool in your monitoring belt and you still need to use the right tool for the right job.
There are cases when deterministic code based monitors can be effective in evaluating AI. For example, when the use case and system prompt dictate a certain format. If a response should only be so long or structured in a very specific way, code-based monitors are often the best tool for the job.
For example, it’s common to instruct agents to produce an output in JSON format when it needs to interact with other IT systems or sub-agents. Another example is a pharmaceutical customer of ours that is using AI to enrich their customer database. All outputs should be using a valid US postal code format.

Code based monitor to ensure an output is a valid US zip code.
More traditional code based monitors can also be effective for simple binary conditions. Certain words that should never be used or maybe every response must have a corresponding citation.
Finally, the other scenario where a LLM-as-judge approach may not be appropriate is when you are evaluating AI in development at a small scale. In this case you could leverage human annotators should you wish, although you may want to start on a more automated evaluation suite in development and for your CI/CD process sooner rather than later.
But here is the most important question.
Does LLM-as-judge actually work?
While there is research to support both sides, the emerging consensus is that LLM-as-judge is not infallible, but can be used to spot degradations over time when leveraging best practices.
Our own hands-on experience in evaluating customer facing AI agents broadly reflects the same conclusion. Individual evaluations can be flaky at times, but when smoothed and monitored over time with anomaly detection, LLM-as-judge evaluations are a valid means for detecting and resolving issues that have led to a decline in output quality.
A real incident caught by LLM-as-judge

An LLM-as-judge evaluation monitor catches an issue with our Monitoring Agent.
Here is a very recent example of LLM-as-judge evaluations successfully catching an AI reliability incident.
For context, Monte Carlo’s Monitoring Agent leverages information about a customer’s data landscape (data profile, lineage, metadata, etc) to provide sophisticated monitoring suggestions for specific tables. Our team has a LLM-as-judge evaluation “prompt adherence” or “completion score” monitor to alert when the Monitoring Agent produces an output that does not follow the instructions it is given.
The Monitoring Agent generates many different types of monitor recommendations. The specific task within the Monitoring Agent is designed to provide recommendations specifically for cross-field rules. An example might be one timestamp field must always be more recent than the other or the value in field X is always greater than field Y.
As you can see in the image above, this task provided a valid monitor but it was a simple “alert when id field is null” recommendation. This is exactly the type of issue that would go unnoticed and unreported, but could impact the real and perceived value (and thus adoption) of the agent over time.
7 LLM-as-judge best practices
That being said, it is easier than you think to get LLM-as-judge evaluations wrong. The consequences are also much more severe than you would initially expect.
There is not only a high cost associated with wasted time and compute, but the impact on trust can slow innovation and time-to-market. In some industries that can play a role in the overall organization’s future viability.
The 2024 paper, A Survey On LLM-As-a-Judge (Gu, Jiawei, et al.)” does an excellent job of summarizing the canonical research underpinning many of the accepted LLM-as-judge best practices leveraged by AI engineers today.
We’ll reference a select number of those best practices that are easiest to implement. While strategies like providing iterative feedback or hierarchical evaluations can be helpful, they aren’t always practical to deploy in production where the scale is massive and the inputs aren’t entirely predictable.
Few shot prompting
Few shot prompting involves providing one or more examples of what good or bad outputs look like within the prompt. The easiest way to remember this is just to replace the word shot with the word example in your head.
More examples doesn’t always mean better performance however. In a paper focused on the effectiveness of LLM’s evaluating code, researchers found that all major models performed better with one shot, but experienced declines when more were included.

Comparative F-1 scores of different models on code correctness assessments Source: CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?
Here is an example of a few shot prompt evaluation:
You are an expert evaluator of response relevance.
Rate each answer 1-5 for relevance.
##Example 1: User: Can I get a refund for my headphones?
Agent: Yes, your headphone purchase was made within the return window.Score: 5
Step decomposition
Step decomposition involves helping your LLM-as-judge make big subjective decisions by providing smaller criteria and reasoning steps.
This is a best practice we have found helpful in our own AI monitoring endeavors, which is why it’s included within the pre-built evaluation templates within our platform. Here is an example of step decomposition using our “Answer Relevance” template.

Criteria decomposition
LLMs are much more effective when given clear, single objective tasks. Criteria decomposition is a fancy term for having each evaluation monitor a single criteria. For example, within the Monte Carlo platform there is a template for evaluating relevancy and one for evaluating clarity, but these are not combined into a single evaluation template. Don’t confuse your judges.
Evaluation template (grading rubric)
G-Eval (Liu et al., EMNLP 2023) is among the most cited papers detailing this technique which involves providing a scoring scale and rubric for your judge. The researchers asked LLMs to generate a chain of thought of detailed evaluation steps by only feeding the task introduction and evaluation criteria as a prompt.
This is a best practice we have seen be effective in our own evaluation efforts as well. If we return to our “Answer Relevance” evaluation template we can see this technique is included.

“In general, scores that are floats are not great. LLM-as-judge does better with a categorical integer scoring scale with a very clear explanation of what each score category means.” – Elor Arieli, Monte Carlo’s AI engineering manager
Constrain to structured outputs
Here’s the thing about English or any natural language format, it can be filled with ambiguity and multiple meanings. That’s beautiful if you are a poet, but confusing if you are a LLM judge.
Constraining some agent steps or spans to structured outputs-JSON is the most common- can be helpful for the LLM judge as it can remove ambiguity allowing for a more standardized evaluation.
There are some synergies to this strategy as well because many agent outputs need to be structured as JSON when the agent needs to interact or query a tool (rather than engage with a human).
Looking back at the real incident we caught using LLM-as-judge monitors via our agent observability platform, we can see that both the prompts (inputs) and completions (outputs) are structured in JSON.

Provide explanations
Chain of thought and providing explanations have been among the most explored strategies for LLM evaluations, so we won’t go into too much detail here. The basic concept is pretty simple, have the LLM-as-judge explain why it gave a certain score.
CLAIRE and FLEUR are two frameworks that evaluate a LLM’s ability to caption images that both use this best practice. This is also a strategy leveraged by Monte Carlo’s own evaluation framework.
In addition to helping to standardize scores, it can also expedite human understanding of alerts. For Monte Carlo, the LLM judge accurately explained the problem wasn’t with how the monitor was formatted or its validity, but rather it was a single field rather than multiple field rule as intended.

LLM as judge evaluation with explanation for failure.
Score smoothing
Score smoothing is the process of taking raw scores (1-5) and reducing the random fluctuations. The core idea is that AI hallucinates and it can be more helpful to pay attention to the broader signal versus the noise.
There is a tradeoff however, in that you can miss key behaviors you are seeking to catch and correct with the LLM-as-Judge monitors in the first place.
Monte Carlo’s internal data + AI team uses a slightly different strategy to account for the occasional evaluation hallucination. When enough “soft failures” occur the evaluation is automatically re-run and if those same failures occur the second time the team investigates.
Monte Carlo’s Agent Observability platform also provides some flexibility in this area. Users can set a hard threshold or use anomaly detection to catch the responses that are way outside the norm.
Example LLM-as-Judge Prompts & Evaluation Templates
Here are four LLM-as-Judge evaluation templates that you may find handy. We’ve included relevancy and task completion.
Other interesting evaluation criteria may include helpfulness, clarity, language match and tool usage.
Answer Relevance
You are an expert evaluator tasked with assessing how well an LLM output addresses its input.
## Evaluation Criteria:
1. Analyze the input to understand what is being asked or requested
2. Examine the output to see what information is provided
3. Determine if the output directly addresses the input
4. Check for irrelevant or off-topic information in the output
5. Assess completeness - does the output answer all aspects of the input?
6. Consider conciseness - is the output appropriately focused?
## Input:{{prompts}}
## Output:{{completions}}
## Evaluation Instructions:
Evaluate how well the output addresses the input by analyzing the relevance of the response content.Assign a score from 1 to 5 where:
- 5 = Output perfectly addresses the input with all content being relevant
- 4 = Output mostly addresses the input with minor irrelevant details
- 3 = Output partially addresses the input with some irrelevant content
- 2 = Output barely addresses the input, mostly irrelevant
- 1 = Output does not address the input at all
Task Completion
You are an expert evaluator tasked with assessing task completion in LLM outputs.
## Evaluation Criteria:
1. Identify the specific task requested in the input
2. Determine all requirements and constraints mentioned
3. Check if the output fulfills each requirement
4. Verify the output format matches any specified format
5. Assess completeness - are all parts of the task done?
6. Validate the quality of task execution
## Input:{{prompts}}
## Output:{{completions}}
## Evaluation Instructions:
Evaluate whether the output successfully completes the requested task.Assign a score from 1 to 5 where:
- 5 = Task fully completed with all requirements met
- 4 = Task mostly completed with minor omissions
- 3 = Task partially completed with significant gaps
- 2 = Task barely attempted with major failures
- 1 = Task not completed or attempted
Challenges: Why agent observability platforms are useful
Unfortunately, monitoring AI and agents is not always as easy as rolling out a few natural language prompts.
There are significant challenges that arise when using a manual LLM evaluation framework or one built into platforms such as Bedrock, MLflowand others. The most significant include: cost, non-deterministic scoring, and root cause analysis/incident management.
Challenge #1- Evaluation cost
LLM workloads aren’t cheap, and a single agent session can involve hundreds of LLM calls. Now imagine for each of those calls you are also calling another LLM multiple times to judge different quality dimensions. It can add up quick.
One data + Ai leader confessed to us their evaluation cost was 10 times as expensive as the baseline agent workload. Monte Carlo’s agent development team strives to maintain roughly a one to one workload to evaluation ratio.
Best practices to contain evaluation cost
Most teams will sample a percentage or aggregate number of spans per trace to manage costs while still retaining the ability to detect degradations in performance. Stratified sampling, or sampling a representative portion of the data, can be helpful in this regard, Conversely, it can also be helpful to filter for specific spans such as those with a longer than average duration.

Challenge #2- Defining failure and alert conditions
Even when teams have all the right telemetry and evaluation infrastructure in place, deciding what actually constitutes “failure” turns out to be surprisingly difficult.
To start, defining failure requires being deeply familiar with the agent’s use case and user expectations. A customer support bot, a sales assistant, and a research summarizer all have different standards for what counts as “good enough.”
What’s more, the relationship between a bad response and its real-world impact on adoption isn’t always linear or obvious. For example, if an evaluation model gives a response that is judged to be a .75 for clarity, is that a failure?
Best practices for defining failure and alert conditions
Aggregate multiple evaluation dimensions. Rather than declaring a failure based on a single score, combine several key metrics – such as helpfulness, accuracy, faithfulness, and clarity – and treat them as a composite pass/fail test. This is the approach Monte Carlo takes in our agent evaluation framework for our internal agents.
Most teams will also leverage anomaly detection to identify a consistent drop in scores over a period of time rather than a single (possibly hallucinated) evaluation. Dropbox for example leverages dashboards that track their evaluation score trends over hour, six-hour, and daily intervals.
Finally, know what monitors are “soft” and what monitors are “hard.” There are some monitors that should immediately trigger an alert condition when their threshold is breached. Typically these are more deterministic monitors evaluating an operational metric such as latency or a system failure.
Challenge #3- Flaky evaluations
Who evaluates the evaluators? Using a system that can hallucinate to monitor a system that can hallucinate has obvious drawbacks.
The other challenge for creating valid evaluations is that, as every single person who has put an agent into production has bemoaned to me, small changes to the prompt have a large impact on the outcome. This means creating customized evaluations or experimenting with evaluations can be difficult.
Best practices for avoiding flaky evaluations:
Most teams avoid flaky tests or evaluations by testing extensively in staging on golden datasets with known input-output pairs. This will typically include representative queries that have proved problematic in the past.
It is also a common practice to test evaluations in production on a small sample of real-world traces with a human in the loop.
Of course, LLM judges will still occasionally hallucinate. Or as one data scientist put it to me, “one in every ten tests spits out absolute garbage.” He will automatically rerun evaluations for low scores to confirm issues.

Challenge #4- Visibility across the data + AI lifecycle
Of course once a monitor sends an alert the immediate next question is always: “why did that fail?” Getting the answer isn’t easy! Agents are highly complex, interdependent systems.
Finding the root cause requires end-to-end visibility across the four components that introduce reliability issues into a data + AI system: data, systems, code, and model. Here are some examples:
Data
- Real world changes and input drift. For example, if a company enters a new market and now there are more users speaking Spanish than English. This could impact the language the model was trained in.
- Unavailable context. We recently wrote about an issue where the model was working as intended but the context on the root cause (in this case a list of recent pull requests made on table queries) was missing.
System
- Pipeline or job failures
- Any change to what tools are provided to the agent or changes in the tools themselves.
- Changes to how the agents are orchestrated
Code
- Data transformation issues (changing queries, transformation models)
- Updates to prompts
- Changes impacting how the output is formatted
Model
- Platform updates their model version
- Changes to which model is used for a specific call
Best practices for visibility across the data + AI lifecycle
It is critical to consolidate telemetry from your data + AI systems into a single source of truth, and many teams are choosing the warehouse or lakehouse as their central platform.
This unified view lets teams correlate failures across domains – for example, seeing that a model’s relevancy drop coincided with a schema change in an upstream dataset or an updated model. Monte Carlo’s own approach is to consolidate traces, evaluations, and metadata in one place to make cross-component debugging faster and easier.
You can’t manage AI if you aren’t observing it
If there’s one takeaway from our experience, it’s this: you can’t manage what you don’t measure, and that includes your AI.
LLM-as-judge is quickly becoming the most practical way to ensure your AI features and agents are doing what they’re supposed to-and more importantly, when they aren’t.
When combined with good AI engineering practices like structured prompts, clear rubrics, and score smoothing, it gives data and AI teams a reliable early-warning system for degradation before customers ever notice.
For teams moving from ad-hoc pilots to creating the necessary infrastructure to leverage agents at scale, agent observability is the key that will give your internal teams and external consumers the trust they need to launch and disrupt with confidence.
Because at the end of the day, putting AI in production isn’t just about building something smart-it’s about keeping it reliable and trustworthylong after launch.
The post 7 LLM-As-Judge Best Practices From Research & Experience appeared first on Datafloq.
