Continuous improvement of RAG results

3 min read

Last edited:  

Continuous improvement of RAG results
Ben Colborn
Ben ColbornMember of Knowledge Staff

The previous blogs described some of DevRev’s early experiences with building a RAG and how we see what’s going on inside it. The example of an inconsistent response to multiple instances of the same query showed how to narrow down where the failure occurred.

So that’s all well and good, but not at all scalable. As we continue to optimize Turing and even look to tune Turing for different use cases, we can’t manually make such an investment for every query. Rather, we would need to specify a set of queries with success criteria that can be run against various configurations. Then we would need to be able to automatically send all those queries to a particular configuration of Turing and check if what’s returned is correct.

Parameters

At every step of the RAG pipeline, there are parameters that can be modified to yield different results. Some examples are below.

StageParameters
IndexingContents of KB; size of chunks; embedding algorithm
RephrasingLLM (ChatGPT, Gemini, Claude, etc.); prompt
SearchSearch providers; syntactic + semantic search hybrid score calculation
ValidationSearch result ranking; score sensitivity and thresholds
Answer creationLLM; prompt; number of chunks to use

In addition, we would want to experiment with adding new stages.

That’s a lot of moving parts. It’s necessary to have a set of tests that can be run while modifying one parameter at a time so that we can see the effects of changes.

Test specification

The specification and execution of tests is shown in the following diagram.

Here is an example test.

QueryWhat are articles?
Ground truth answerArticles are pieces of information about your product or organization that are stored in a knowledge base. They are used to answer customer queries and provide self-service support. Articles are associated with a specific part (product or service) and can be created by internal users. Customers can search across articles to find answers to their questions, reducing the need to wait for support team assistance.
Ground truth sourcesARTICLE-908

The ground truth answer may either be written by a person or be generated by an LLM from the ground truth sources and verified by a person.

Test result evaluation

The framework measures various metrics to assess the AI’s performance. These include the answering rate (whether the system provided any answer), correctness (whether the answer was factually correct), and similarity (semantic similarity between the provided answer and the ground truth). The system records detailed data about each step of the query processing pipeline using an observability platform for LLMs, which helps in debugging and further refining the system by pinpointing where improvements are needed.

After running the tests, the framework outputs detailed results including the questions asked, the answers received, the expected answers, and detailed logs from the observability platform. This comprehensive documentation aids in analyzing the system’s performance and making informed decisions about future enhancements.

A run of the example test above could return the following.

QueryWhat are articles?
Received answerArticles are pieces of information about a product or organization that are stored in a knowledge base. They are used to answer customer queries through platforms like Turing bot, customer portal, or PLuG. Articles are associated with a specific part (product or service) and can be created by internal users. Customers can search through these articles to find answers to their questions.
Received sourcesARTICLE-293

The received answer is compared to the ground truth answer and the received sources to the ground truth sources. While the answers have a high similarity score, the sources differ, which may indicate a need to examine the contents of the KB or the search parameters.

Testing at scale

Using the test database and test environments, DevRev can continuously improve the performance of Turing. At present we have several hundred tests that can be run with various configurations of the RAG pipeline. These test runs generate the same analytics as user queries.

One of the challenges is the manual work to revalidate the test specifications when the contents of the KB changes. This means that we can’t always use the latest KB for testing. To fill this gap, we have a smaller set of queries that are are more customer-critical. The support team manages and runs this set of tests, which that can be updated more frequently and can use the latest KB.

The Turing team continuously optimizes the RAG pipeline based on customer feedback. By continuously running the tests, they ensure that their changes are always improvements, never regressions.

Ben Colborn
Ben ColbornMember of Knowledge Staff

Ben leads the Knowledge team at DevRev, responsible for product documentation. Previously he was at Nutanix as Director of Technical Publications.