4 minute read

Chat Metrics for Enterprise-Scale Retrieval-Augmented Generation

Published on
by
and
Yurts summary

Ensuring the efficiency and accuracy of your retrieval-augmented generation (RAG) system is paramount. So you’ve put the work in — hosted an LLM inference server, created a vector database and document store, built a document ingestion pipeline, and connected it all to a user interface. Your RAG workflow is running smoothly: accepting user queries, returning contextual documents from your document store, and generating an LLM response. But how do you know that your responses are good? What happens if you change your prompt or model? Is there an improvement or degradation in your end-to-end RAG workflow? 

At Yurts, we know deploying an effective RAG pipeline means more than just connecting components. It’s about ensuring every change improves functionality without degrading performance. That’s why our on-premises solutions—whether in private VPCs of major providers like GCP, OCI, and AWS, or even on bare metal—come with comprehensive dashboards to meticulously track your RAG system’s performance. Dive in as we reveal cutting-edge metrics that bring clarity and precision to your chat evaluations.

The three steps of RAG are Ingestion, Retrieval, and Generation. Chat metrics evaluate the performance of retrieval and generation on real-world data in production.

While there are plenty of open-source resources for testing document retrieval against a gold-standard dataset with labeled data (such as the excellent RAGAS package), Yurts saw a major lack in tooling for quantifying the end-to-end chat performance of our RAG system across both the retrieval and generation steps of RAG. Instead of testing against a gold-standard dataset, we wanted to create metrics for evaluating chat performance on real-world customer data in production. Our chat metrics evaluation suite runs over each chat message in our production and development environments, logging the metrics live to our data dashboard.

Each chat metric accepts a data structure containing all the metadata created for a chat message, most importantly the user query, retrieved context, and generated response. Each metric then returns a numerical result, typically a score between 0 and 1. Importantly, none of these metrics require labeled data and we don’t use a bigger LLM like GPT-4 to evaluate our RAG performance. They work on real-world data live in production using small embedding and encoder/decoder models.

Below, we describe what some of our metrics compute and how we interpret them.

The Query-Context Agreement metric answers the question, “Is the retrieved context relevant to the user’s query?” We use a cross-encoder model to compute an agreement score between a sentence embedding of the query and a sentence embedding of each sentence of the context, taking the mean and standard deviation.

The Query-Response Agreement metric determines how relevant the response was to the user query. We use a small LLM to generate questions that can be answered using the response only and then take the mean and standard deviation of similarity scores between embeddings of the user query and the generated questions. This returns a value between 0 and 1 that describes how well the response answers the original user question.

We use the Response-Context Agreement metric to evaluate how well the generated response used the context in its answer. We don’t look for exact matches between the context and the generation since we want the LLM to paraphrase or synthesize information if that would make a better response. Instead, this metric computes a pairwise cosine similarity matrix between embeddings of the response and context sentences and then uses the Hungarian algorithm to find the optimal bipartite matching (i.e., the optimal one-to-one mapping between response and context sentences). The idea is that each sentence of context should approximately map to one sentence in the response.

Hallucinations exist in any LLM system. While we have taken significant steps to mitigate hallucinations in our generated responses, they still appear from time to time. For this reason, we implemented inline hallucination flagging, and we report hallucinations as a metric as well, counting the number of hallucinated entities relative to the number of entities in the response. For more information, read our blog post on hallucination flagging.

Here's a snapshot of our Response-Context Agreement metric. The top plot shows average daily scores; the bottom shows a histogram of all scores.

We expose some of our metrics directly to users on the Yurts platform UI, while others are run as a nightly cron job and stored in our database. We visualize all our metrics in our data dashboard, available to power-users in on-premise deployments of the Yurts platform. One particularly salient use-case is A/B testing of a new model or prompt: How does changing the prompt or model improve or degrade the quality of the RAG system? Those questions can be easily answered via the data dashboard, enabling power-users to self-manage and customize the platform to their use-cases and their data. 

Contact us to evaluate and enhance your workflows today!

Frequently asked questions

Stay up to date with enterprise AI
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
written by
Alec Hoyland
Sr. Applied ML Engineer
4 minute read