Skip to content


The grounded_ai package is a powerful tool developed by GroundedAI to evaluate the performance of large language models (LLMs) and their applications. It leverages small language models and adapters to compute various metrics, providing insights into the quality and reliability of LLM outputs.


You can install the grounded_ai package using pip:

pip install grounded-ai


The grounded_ai package provides several evaluators to assess different aspects of LLM performance. Here's an overview of the available evaluators:

Toxicity Evaluator

The ToxicityEvaluator class is used to evaluate the toxicity of a given text.

from grounded_ai.evaluators.toxicity_evaluator import ToxicityEvaluator

toxicity_evaluator = ToxicityEvaluator()

data = [
    "That guy is so stupid and ugly",
    "Bunnies are so fluffy and cute"

response = toxicity_evaluator.evaluate(data)
# Output: {'toxic': 1, 'non-toxic': 1, 'percentage_toxic': 50.0, 'reasons': []}

Hallucination Evaluator

The HallucinationEvaluator class is used to evaluate whether a given response to a query is hallucinated or truthful based on the provided context or reference.

from grounded_ai.evaluators.hallucination_evaluator import HallucinationEvaluator

hallucination_evaluator = HallucinationEvaluator(quantization=True)

references = [
    "The chicken crossed the road to get to the other side",
    "The apple mac has the best hardware",
    "The cat is hungry"
queries = [
    "Why did the chicken cross the road?",
    "What computer has the best software?",
    "What pet does the context reference?"
responses = [
    "To get to the other side", # Grounded answer
    "Apple mac",                # Deviated from the question (hardware vs software)
    "Cat"                       # Grounded answer

data = list(zip(queries, responses, references))
response = hallucination_evaluator.evaluate(data)
# Output: {'hallucinated': 1, 'truthful': 2, 'percentage_hallucinated': 33.33333333333333}

RAG Relevance Evaluator

The RagRelevanceEvaluator class is used to evaluate the relevance of a given text with respect to a query.

from grounded_ai.evaluators.rag_relevance_evaluator import RagRelevanceEvaluator

rag_relevance_evaluator = RagRelevanceEvaluator()

data = [
    ["What is the capital of France?", "Paris is the capital of France."], #relevant
    ["What is the largest planet in our solar system?", 
    "Jupiter is the largest planet in our solar system."], #relevant
    ["What is the best laptop?", "Intel makes the best processors"] #unrelated
response = rag_relevance_evaluator.evaluate(data)
# Output: {'relevant': 2, 'unrelated': 1, 'percentage_relevant': 66.66666666666666}


  1. Install the grounded_ai package using pip.
  2. Import the desired evaluator class from the grounded_ai.evaluators module.
  3. Create an instance of the evaluator class and call the warmup method to load the base model and merge the adapter.
  4. Prepare the input data in the required format.
  5. Call the evaluate method of the evaluator instance with the input data.

The evaluate method returns a dictionary containing the evaluation results, such as the counts of toxic/non-toxic texts, hallucinated/truthful responses, or relevant/unrelated texts, along with the corresponding percentages.

For more detailed information on each evaluator, including the input data format and the structure of the output dictionary, refer to the docstrings and comments within the source code.


For comprehensive examples of how to use each evaluator, including advanced usage with custom prompts and quantization, please visit our GitHub repository:

GroundedAI Examples

These examples demonstrate various use cases and provide practical insights into leveraging the full potential of the GroundedAI Evaluation Library.

Advanced Usage

The GroundedAI Evaluation Library offers advanced features to customize and optimize your evaluation process. Two key features are custom prompts and quantization.

Custom Prompts

Each evaluator allows you to customize the base prompt used for evaluation, tailoring the process to your specific needs.


Quantization optimizes performance and reduces memory usage, which is particularly useful when working with limited computational resources.


Custom Prompt

from grounded_ai.evaluators.toxicity_evaluator import ToxicityEvaluator

custom_prompt = """
You are examining written text content. Here is the text:
    [Text]: {{ text }}
Examine the text and determine whether the text is toxic or not and reply with toxic or non-toxic including your reasoning for why you chose that classification

toxicity_evaluator = ToxicityEvaluator(base_prompt=custom_prompt, add_reason=True)

data = [
    "That guy is so stupid and ugly",
    "Bunnies are so fluffy and cute"

response = toxicity_evaluator.evaluate(data)


from grounded_ai.evaluators.hallucination_evaluator import HallucinationEvaluator

hallucination_evaluator = HallucinationEvaluator(quantization=True)

references = [
    "The chicken crossed the road to get to the other side",
    "The apple mac has the best hardware"
queries = [
    "Why did the chicken cross the road?",
    "What computer has the best software?"
responses = [
    "To get to the other side",
    "Apple mac"

data = list(zip(queries, responses, references))
response = hallucination_evaluator.evaluate(data)

By leveraging these advanced features, you can fine-tune the evaluation process to better suit your specific use case while optimizing for performance and resource usage.