RAGAS - Automated Evaluation of RAG

RAGAS: Automated Evaluation of Retrieval Augmented Generation

  • (Note : Well. I am not convinced. They did come up with some ways of testing it. But all of those ways used the LLM to test itself, which is uh. Not very nice or representative. I am unsure of how accurate any of these are.
  • Logically it seems slightly flawed, but I respect the hustle)

RAGAS (Retrieval Augmented Generation Assessment)

  • framework for reference-free evaluation of Retrieval Augmented Generation (RAG) pipelines
  • evaluate these different dimensions without having to rely on ground truth human annotations
  • While the usefulness of retrieval-augmented strategies is clear, their implementation requires a significant amount of tuning, as the overall performance will be affected by the retrieval model, the considered corpus, the LM, or the prompt formulation, among others

Evaluation Strategies

  • standard RAG setting
  • given a question q, the system first retrieves some context c(q) and then uses the retrieved context to generate an answer as(q)
  • (Note : This seems like an interesting start so far)
  • we usually do not have access to human-annotated datasets or reference answers

Faithfulness

  • answer should be grounded in the given context
  • important to avoid hallucinations, and to ensure that the retrieved context can act as a justification for the generated answer
  • RAG systems are often used in applications where the factual consistency of the generated text w.r.t. the grounded sources is highly important
  • (Note : Basically they use an LLM to generate an answer and then check with the ground truth to see how closely it matched by comparing the number of matches/total number of statements)

Answer Relevance

  • generated answer should address the actual question that was provided
  • (Note : Basically uses cosine similarity. I am unsure of how different it is to that usual llm but oh well)

Context Relevance

  • retrieved context should be focused, containing as little irrelevant information as possible
  • gpt-3.5-turbo-16k model
  • penalise the inclusion of redundant information.
  • (Note : -.-, they just asked the LLM to find the most important sentences and then divided it by the total number of sentences. I am now progressively getting more and more annoyed with this paper)