There’s significant buzz and excitement around using AI copilots to reduce manual work, improving software developer productivity with code generators, and innovating with generative AI. The business opportunities are driving many development teams to build knowledge bases with vector databases and embed large language models (LLMs) into their applications.
Some general use cases for building applications with LLM capabilities include search experiences, content generation, document summarization, chatbots, and customer support applications. Industry examples include developing patient portals in healthcare, improving junior banker workflows in financial services, and paving the way for the factory’s future in manufacturing.
Companies investing in LLMs have some upfront hurdles, including improving data governance around data quality, selecting an LLM architecture, addressing security risks, and developing a cloud infrastructure plan.
My bigger concerns lie in how organizations plan to test their LLM models and applications. Issues making the news include one airline honoring a refund its chatbot offered, lawsuits over copyright infringement, and reducing the risk of hallucinations.
“Testing LLM models requires a multifaceted approach that goes beyond technical rigor, says Amit Jain, co-founder and COO of Roadz. “Teams should engage in iterative improvement and create detailed documentation to memorialize the model’s development process, testing methodologies, and performance metrics. Engaging with the research community to benchmark and share best practices is also effective.”
4 testing strategies for embedded LLMs
Development teams need an LLM testing strategy. Consider as a starting point the following practices for testing LLMs embedded in custom applications:
- Create test data to extend software QA
- Automate model quality and performance testing
- Evaluate RAG quality based on the use case
- Develop quality metrics and benchmarks
Create test data to extend software QA
Most development teams won’t be creating generalized LLMs, and will be developing applications for specific end users and use cases. To develop a testing strategy, teams need to understand the user personas, goals, workflow, and quality benchmarks involved.
“The first requirement of testing LLMs is to know the task that the LLM should be able to solve,” says Jakob Praher, CTO of Mindbreeze. “For these tasks, one would construct test datasets to establish metrics for the performance of the LLM. Then, one can either optimize the prompts or fine-tune the model systematically.”
For example, an LLM designed for customer service might include a test data set of common user problems and the best responses. Other LLM use cases may not have straightforward means to evaluate the results, but developers can still use the test data to perform validations.
“The most reliable way to test an LLM is to create relevant test data, but the challenge is the cost and time to create such a dataset,” says Kishore Gadiraju, VP of engineering for Solix Technologies. “Like any other software, LLM testing includes unit, functional, regression, and performance testing. Additionally, LLM testing requires bias, fairness, safety, content control, and explainability testing.”
Automate model quality and performance testing
Once there’s a test data set, development teams should consider several testing approaches depending on quality goals, risks, and cost considerations. “Companies are beginning to move towards automated evaluation methods, rather than human evaluation, because of their time and cost efficiency,” says Olga Megorskaya, CEO of Toloka AI. “However, companies should still engage domain experts for situations where it’s crucial to catch nuances that automated systems might overlook.”
Finding the right balance of automation and human-in-the-loop testing isn’t easy for developers or data scientists. “We suggest a combination of automated benchmarking for each step of the modeling process and then a mixture of automation and manual verification for the end-to-end system,” says Steven Hillion, SVP of data and AI at Astronomer. “For major application releases, you will almost always want a final round of manual validation against your test set. That’s especially true if you’ve introduced new embeddings, new models, or new prompts that you expect to raise the general level of quality because often the improvements are subtle or subjective.”
Manual testing is a prudent measure until there are robust LLM testing platforms. Nikolaos Vasiloglou, VP of Research ML at RelationalAI, says, “There are no state-of-the-art platforms for systematic testing. When it comes to reliability and hallucination, a knowledge graph question-generating bot is the best solution.”
Gadiraju shares the following LLM testing libraries and tools:
- AI Fairness 360, an open source toolkit used to examine, report, and mitigate discrimination and bias in machine learning models
- DeepEval, an open-source LLM evaluation framework similar to Pytest but specialized for unit testing LLM outputs
- Baserun, a tool to help debug, test, and iteratively improve models
- Nvidia NeMo-Guardrails, an open-source toolkit for adding programmable constraints on an LLM’s outputs
Monica Romila, director of data science tools and runtimes at IBM Data and AI, shared two testing areas for LLMs in enterprise use cases:
- Model quality evaluation assesses the model quality using academic and internal data sets for use cases like classification, extraction, summarization, generation, and retrieval augmented generation (RAG).
- Model performance testing validates the model’s latency (elapsed time for data transmission) and throughput (amount of data processed in a certain timeframe).
Romila says performance testing depends on two critical parameters: the number of concurrent requests and the number of generated tokens (chunks of text a model uses). “It’s important to test for various load sizes and types and compare performance to existing models to see if updates are needed.”
DevOps and cloud architects should consider infrastructure requirements to conduct performance and load testing of LLM applications. “Deploying testing infrastructure for large language models involves setting up robust compute resources, storage solutions, and testing frameworks,” says Heather Sundheim, managing director of solutions engineering at SADA. “Automated provisioning tools like Terraform and version control systems like Git play pivotal roles in reproducible deployments and effective collaboration, emphasizing the importance of balancing resources, storage, deployment strategies, and collaboration tools for reliable LLM testing.”
Evaluate RAG quality based on the use case
Some techniques to improve LLM accuracy include centralizing content, updating models with the latest data, and using RAG in the query pipeline. RAGs are important for marrying the power of LLMs with a company’s proprietary information.
In a typical LLM application, the user enters a prompt, the app sends it to the LLM, and the LLM generates a response that the app sends back to the user. With RAG, the app first sends the prompt to an information database like a search engine or a vector database to retrieve relevant, subject-related information. The app sends the prompt and this contextual information to the LLM, which it uses to formulate a response. The RAG thus confines the LLM’s response to relevant and contextual information.
Igor Jablokov, CEO and founder of Pryon, says, “RAG is more plausible for enterprise-style deployments where verifiable attribution to source content is necessary, especially in critical infrastructure.”
Using RAG with an LLM has been shown to reduce hallucinations and improve accuracy. However, using RAG also adds a new component that requires testing its relevancy and performance. The types of testing depend on how easy it is to evaluate the RAG and LLM’s responses and to what extent development teams can leverage end-user feedback.
I recently spoke with Deon Nicholas, CEO of Forethought, about the options to evaluate RAGs used in his company’s generative customer support AI. He shared three different approaches:
- Gold standard datasets, or human-labeled datasets of correct answers for queries that serve as a benchmark for model performance
- Reinforcement learning, or testing the model in real-world scenarios like asking for a user’s satisfaction level after interacting with a chatbot
- Adversarial networks, or training a secondary LLM to assess the primary’s performance, which provides an automated evaluation by not relying on human feedback
“Each method carries trade-offs, balancing human effort against the risk of overlooking errors,” says Nicholas. “The best systems leverage these methods across system components to minimize errors and foster a robust AI deployment.”
Develop quality metrics and benchmarks
Once you have testing data, a new or updated LLM, and a testing strategy, the next step is to validate quality against stated objectives.
“To ensure the development of safe, secure, and trustworthy AI, it’s important to create specific and measurable KPIs and establish defined guardrails,” says Atena Reyhani, chief product officer at ContractPodAi. “Some criteria to consider are accuracy, consistency, speed, and relevance to domain-specific use cases. Developers need to evaluate the entire LLM ecosystem and operational model in the targeted domain to ensure it delivers accurate, relevant, and comprehensive results.”
One tool to learn from is the Chatbot Arena, an open environment for comparing the results of LLMs. It uses the Elo Rating System, an algorithm often used in ranking players in competitive games, but it works well when a person evaluates the response from different LLM algorithms or versions.
“Human evaluation is a central part of testing, particularly when hardening an LLM to queries appearing in the wild,” says Joe Regensburger, VP of research at Immuta. “Chatbot Arena is an example of crowdsourcing testing, and these types of human evaluator studies can provide an important feedback loop to incorporate user feedback.”
Romila of IBM Data and AI shared three metrics to consider depending on the LLM’s use case.
- F1 score is a composite score around precision and recall and applies when LLMs are used for classifications or predictions. For example, a customer support LLM can be evaluated on how well it recommends a course of action.
- RougeL can be used to test RAG and LLMs for summarization use cases, but this generally needs a human-created summary to benchmark the results.
- sacreBLEU is one method originally used to test language translations that is now being used for quantitative evaluation of LLM responses, along with other methods such as TER, ChrF, and BERTScore.
Some industries have quality and risk metrics to consider. Karthik Sj, VP of product management and marketing at Aisera, says, “In education, assessing age-appropriateness and toxicity avoidance is crucial, but in consumer-facing applications, prioritize response relevance and latency.”
Testing does not end once a model is deployed, and data scientists should seek out end-user reactions, performance metrics, and other feedback to improve the models. “Post-deployment, integrating results with behavior analytics becomes crucial, offering rapid feedback and a clearer measure of model performance,” says Dustin Pearce, VP of engineering and CISO at Amplitude.
One important step to prepare for production is to use feature flags in the application. AI technology companies Anthropic, Character.ai, Notion, and Brex build their product with feature flags to test the application collaboratively, slowly introduce capabilities to large groups, and target experiments to different user segments.
While there are emerging techniques to validate LLM applications, none of these are easy to implement or provide definitive results. For now, just building an app with RAG and LLM integrations may be the easy part compared to the work required to test it and support enhancements.