Tutorial: Model-Based Evaluation of RAG Pipelines
Last Updated: February 13, 2024
- Level: Beginner
- Time to complete: 10 minutes
- Components Used:
InMemoryDocumentStore
,InMemoryBM25Retriever
,PromptBuilder
,OpenAIGenerator
,UpTrainEvaluator
- Prerequisites: You must have an API key from an active OpenAI account as this tutorial is using the gpt-3.5-turbo model by OpenAI: https://platform.openai.com/api-keys
- Goal: After completing this tutorial, you’ll have learned how to evaluate your RAG pipelines using some of the model-based evaluation frameworkes integerated into Haystack.
This tutorial uses Haystack 2.0 Beta. To learn more, read the Haystack 2.0 Beta announcement or see Haystack 2.0 Documentation.
Overview
This tutorial shows you how to evaluate a generative question-answering pipeline that uses the retrieval-augmentation ( RAG) approach with Haystack 2.0. As we’re doing model-based evaluation, no ground-truth labels are required. The process involves Haystack’s integration of three evaluation frameworks:
For this tutorial, you’ll use the Wikipedia pages of Seven Wonders of the Ancient World as Documents, but you can replace them with any text you want.
Preparing the Colab Environment
Installing Haystack
Install Haystack 2.0 Beta and
datasets with pip
:
%%bash
pip install haystack-ai
pip install "datasets>=2.6.1"
pip install uptrain-haystack
Enabling Telemetry
Knowing you’re using this tutorial helps us decide where to invest our efforts to build a better product but you can always opt out by commenting the following line. See Telemetry for more details.
from haystack.telemetry import tutorial_running
tutorial_running(35)
Create the RAG Pipeline to Evaluate
To evaluate a RAG pipeline, we need a RAG pipeline to start with. So, we will start by creating a question answering pipelne.
💡 For a complete tutorial on creating Retrieval-Augmmented Generation pipelines check out the Creating Your First QA Pipeline with Retrieval-Augmentation Tutorial
First, we will initialize a DocumentStore. A DocumentStore stores the Documents that the question answering system uses to find answers to your questions. In this tutorial, you’ll be using the InMemoryDocumentStore
.
You’ll use the Wikipedia pages of Seven Wonders of the Ancient World as Documents. We preprocessed the data and uploaded to a Hugging Face Space: Seven Wonders. Thus, you don’t need to perform any additional cleaning or splitting.
from datasets import load_dataset
from haystack import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore
document_store = InMemoryDocumentStore()
dataset = load_dataset("bilgeyucel/seven-wonders", split="train")
docs = [Document(content=doc["content"], meta=doc["meta"]) for doc in dataset]
document_store.write_documents(docs)
InMemoryDocumentStore
is the simplest DocumentStore to get started with. It requires no external dependencies and it’s a good option for smaller projects and debugging. But it doesn’t scale up so well to larger Document collections, so it’s not a good choice for production systems. To learn more about the different types of external databases that Haystack supports, see DocumentStore Integrations.
Now that we have our data ready, we can create a simple RAG pipeline.
In this example, we’ll be using:
-
InMemoryBM25Retriever
which will get the relevant documents to the query. -
OpenAIGenerator
to generate answers to queries. You can replaceOpenAIGenerator
in your pipeline with anotherGenerator
. Check out the full list of generators here.
import os
from getpass import getpass
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
retriever = InMemoryBM25Retriever(document_store)
template = """
Given the following information, answer the question.
Context:
{% for document in documents %}
{{ document.content }}
{% endfor %}
Question: {{question}}
Answer:
"""
prompt_builder = PromptBuilder(template=template)
os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")
generator = OpenAIGenerator()
To build a pipeline, add all components to your pipeline and connect them. Create connections from retriever
to the prompt_builder
and from prompt_builder
to llm
. Explicitly connect the output of retriever
with “documents” input of the prompt_builder
to make the connection obvious as prompt_builder
has two inputs (“documents” and “question”). For more information on pipelines and creating connections, refer to
Creating Pipelines documentation.
from haystack.pipeline import Pipeline
from haystack.components.builders.answer_builder import AnswerBuilder
rag_pipeline = Pipeline()
# Add components to your pipeline
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component("prompt_builder", prompt_builder)
rag_pipeline.add_component("llm", generator)
rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")
# Now, connect the components to each other
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("llm.meta", "answer_builder.meta")
rag_pipeline.connect("retriever", "answer_builder.documents")
That’s it! The pipeline’s ready to generate answers to questions!
Asking a Question
When asking a question, use the run()
method of the pipeline. Make sure to provide the question to both the retriever
and the prompt_builder
. This ensures that the {{question}}
variable in the template prompt gets replaced with your specific question.
question = "When was the Rhodes Statue built?"
response = rag_pipeline.run(
{"retriever": {"query": question}, "prompt_builder": {"question": question}, "answer_builder": {"query": question}}
)
print(response["answer_builder"]["answers"][0].data)
Now that we have the RAG pipeline ready, we create an evaluation pipeline. It defines what metrics we want to evaluate and we choose one of Haystack’s integrated evaluation frameworks.
Evaluate The Pipeline with UpTrain
Now that we have a RAG pipeline, let’s look at how we can evaluate it. Here, we’re using the Haystack UpTrain integration. We will perform 2 evaluations:
- Context Relevance, grading how relevant the context is to the question specified
- Critique language, grading language aspects such as fluency, politeness, grammar, and coherence
For a full list of available metrics and their expected inputs, check out our
UpTrainEvaluator
Docs.
1) Evaluate Context Relevance
from haystack_integrations.components.evaluators.uptrain import UpTrainEvaluator, UpTrainMetric
evaluator = UpTrainEvaluator(metric=UpTrainMetric.CONTEXT_RELEVANCE, api="openai")
evaluator_pipeline = Pipeline()
evaluator_pipeline.add_component("evaluator", evaluator)
Next, we can create a helper function to evaluate the context relevance of a RAG pipeline with multiple questions. The context relevance metric expects 2 inputs that should be provided from the RAG pipeline we are evaluating:
- questions
- contexts
def evaluate_context_relevance(questions, evaluation_pipeline):
contexts = []
responses = []
for question in questions:
response = rag_pipeline.run(
{
"retriever": {"query": question},
"prompt_builder": {"question": question},
"answer_builder": {"query": question},
}
)
contexts.append([d.content for d in response["answer_builder"]["answers"][0].documents])
responses.append(response["answer_builder"]["answers"][0].data)
evaluation_results = evaluation_pipeline.run({"evaluator": {"questions": questions, "contexts": contexts}})
return evaluation_results
questions = ["When was the Rhodes Statue built?", "Where is the Pyramid of Giza?", "When was the pyramid built?"]
evaluate_context_relevance(questions=questions, evaluation_pipeline=evaluator_pipeline)
2) Critique Tone
An evaluator that uses the CRITIQUE_TONE
metric expects to be initialized with an llm_persona
. This is the persona the generative model being assessed was expected to follow, for example methodical teacher
, helpful chatbot
, or here simply informative
.
from haystack_integrations.components.evaluators.uptrain import UpTrainEvaluator, UpTrainMetric
evaluator = UpTrainEvaluator(
metric=UpTrainMetric.CRITIQUE_TONE, api="openai", metric_params={"llm_persona": "informative"}
)
evaluator_pipeline = Pipeline()
evaluator_pipeline.add_component("evaluator", evaluator)
Next, we can create a helper function to critique the tone of the results of a RAG pipeline. This metric expects 1 input that should be provided from the RAG pipeline we are evaluating:
- responses
def evaluate_critique_tone(questions, evaluation_pipeline):
responses = []
for question in questions:
response = rag_pipeline.run(
{
"retriever": {"query": question},
"prompt_builder": {"question": question},
"answer_builder": {"query": question},
}
)
responses.append(response["answer_builder"]["answers"][0].data)
evaluation_results = evaluator_pipeline.run({"evaluator": {"responses": responses}})
return evaluation_results
questions = ["When was the Rhodes Statue built?", "Where is the Pyramid of Giza?", "When was the pyramid built?"]
evaluate_critique_tone(questions=questions, evaluation_pipeline=evaluator_pipeline)
What’s next
🎉 Congratulations! You’ve learned how to evaluate a RAG pipeline with model-based evaluation frameworks and without any labeling efforts.
If you liked this tutorial, you may also enjoy:
To stay up to date on the latest Haystack developments, you can sign up for our newsletter. Thanks for reading!