Out of Box Evaluators
Learn about the built-in evaluators available in Autoblocks.
Autoblocks provides a set of evaluators that can be used out of the box. These evaluators are designed to be easily integrated into your test suite and can help you get started with testing your AI-powered applications.
Each evaluator below lists the custom properties and methods that need to be implemented to use the evaluator in your test suite.
You must set the id
property, which is a unique identifier for the evaluator.
All of the code snippets can be run by following the instructions in the Quick Start guide.
Ragas
- LLM Context Precision With Reference
- Non LLM Context Precision With Reference
- LLM Context Recall
- Non LLM Context Recall
- Context Entities Recall
- Noise Sensitivity
- Response Relevancy
- Faithfulness
- Factual Correctness
- Semantic Similarity
Logic Based
Is Equals
The IsEquals
evaluator checks if the expected output equals the actual output.
Scores 1 if equal, 0 otherwise.
Name | Required | Type | Description |
---|---|---|---|
test_case_mapper | Yes | Callable[[BaseTestCase], str] | Map your test case to a string for comparison. |
output_mapper | Yes | Callable[[OutputType], str] | Map your output to a string for comparison. |
Is Valid JSON
The IsValidJSON
evaluator checks if the output is valid JSON.
Scores 1 if it is valid, 0 otherwise.
Name | Required | Type | Description |
---|---|---|---|
output_mapper | Yes | Callable[[OutputType], str] | Map your output to the string that you want to check is valid JSON. |
Has All Substrings
The HasAllSubstrings
evaluator checks if the output contains all the expected substrings.
Scores 1 if all substrings are present, 0 otherwise.
Name | Required | Type | Description |
---|---|---|---|
test_case_mapper | Yes | Callable[[BaseTestCase], list[str]] | Map your test case to a list of strings to check for in the output. |
output_mapper | Yes | Callable[[OutputType], str] | Map your output to a string for comparison. |
Assertions (Rubric/Rules)
The Assertions
evaluator enables you to define a set of assertions or rules that your output must satisfy.
Individual assertions can be marked as not required, and if they are not met, the evaluator will still pass.
Name | Required | Type | Description |
---|---|---|---|
evaluate_assertions | Yes | Callable[[BaseTestCase, Any], Optional[List[Assertion]]] | Implement your logic to evaluate the assertions. |
LLM Judges
Custom LLM Judge
The CustomLLMJudge
evaluator enables you to define custom evaluation criteria using an LLM judge.
Name | Required | Type | Description |
---|---|---|---|
output_mapper | Yes | Callable[[OutputType], str] | Map your output to the string that you want to evaluate. |
model | No | str | The OpenAI model to use. Defaults to “gpt-4o”. |
num_overrides | No | int | Number of recent evaluation overrides to use as examples. Defaults to 0. |
example_output_mapper | No | Callable[[EvaluationOverride], str] | Map an EvaluationOverride to a string representation of the output. |
Automatic Battle
The AutomaticBattle
evaluator enables you to compare two outputs using an LLM judge.
Name | Required | Type | Description |
---|---|---|---|
output_mapper | Yes | Callable[[OutputType], str] | Map your output to the string that you want to compare. |
model | No | str | The OpenAI model to use. Defaults to “gpt-4o”. |
num_overrides | No | int | Number of recent evaluation overrides to use as examples. Defaults to 0. |
example_output_mapper | No | Callable[[EvaluationOverride], str] | Map an EvaluationOverride to a string representation of the output. |
Manual Battle
The ManualBattle
evaluator enables you to compare two outputs using human evaluation.
Name | Required | Type | Description |
---|---|---|---|
output_mapper | Yes | Callable[[OutputType], str] | Map your output to the string that you want to compare. |
num_overrides | No | int | Number of recent evaluation overrides to use as examples. Defaults to 0. |
example_output_mapper | No | Callable[[EvaluationOverride], str] | Map an EvaluationOverride to a string representation of the output. |
Accuracy
The Accuracy
evaluator checks if the output is accurate compared to an expected output.
Scores 1 if accurate, 0.5 if somewhat accurate, 0 if inaccurate.
Name | Required | Type | Description |
---|---|---|---|
output_mapper | Yes | Callable[[OutputType], str] | Map your output to the string that you want to check for accuracy. |
model | No | str | The OpenAI model to use. Defaults to “gpt-4o”. |
num_overrides | No | int | Number of recent evaluation overrides to use as examples. Defaults to 0. |
example_output_mapper | No | Callable[[EvaluationOverride], str] | Map an EvaluationOverride to a string representation of the output. |
NSFW
The NSFW
evaluator checks if the output is safe for work.
Scores 1 if safe, 0 otherwise.
Name | Required | Type | Description |
---|---|---|---|
output_mapper | Yes | Callable[[OutputType], str] | Map your output to the string that you want to check for NSFW content. |
model | No | str | The OpenAI model to use. Defaults to “gpt-4o”. |
num_overrides | No | int | Number of recent evaluation overrides to use as examples. Defaults to 0. |
example_output_mapper | No | Callable[[EvaluationOverride], str] | Map an EvaluationOverride to a string representation of the output. |
Toxicity
The Toxicity
evaluator checks if the output is not toxic.
Scores 1 if it is not toxic, 0 otherwise.
Name | Required | Type | Description |
---|---|---|---|
output_mapper | Yes | Callable[[OutputType], str] | Map your output to the string that you want to check for toxicity. |
model | No | str | The OpenAI model to use. Defaults to “gpt-4o”. |
num_overrides | No | int | Number of recent evaluation overrides to use as examples. Defaults to 0. |
example_output_mapper | No | Callable[[EvaluationOverride], str] | Map an EvaluationOverride to a string representation of the output. |
Ragas
Ragas is a framework that helps you evaluate your Retrieval Augmented Generation (RAG) pipelines. We have built wrappers around the metrics to make integration with Autoblocks seamless.
Available Ragas evaluators:
BaseRagasLLMContextPrecisionWithReference
uses a LLM to measure the proportion of relevant chunks in the retrieved_contexts.BaseRagasNonLLMContextPrecisionWithReference
measures the proportion of relevant chunks in the retrieved_contexts without using a LLM.BaseRagasLLMContextRecall
evaluates the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth.BaseRagasNonLLMContextRecall
uses non llm string comparison metrics to identify if a retrieved context is relevant or not.BaseRagasContextEntitiesRecall
evaluates the measure of recall of the retrieved context, based on the number of entities present in both ground_truths and contexts relative to the number of entities present in the ground_truths alone.BaseRagasNoiseSensitivity
measures how often a system makes errors by providing incorrect responses when utilizing either relevant or irrelevant retrieved documents.BaseRagasResponseRelevancy
focuses on assessing how pertinent the generated answer is to the given prompt.BaseRagasFaithfulness
measures the factual consistency of the generated answer against the given context.BaseRagasFactualCorrectness
compares and evaluates the factual accuracy of the generated response with the reference. This metric is used to determine the extent to which the generated response aligns with the reference.BaseRagasSemanticSimilarity
measures the semantic resemblance between the generated answer and the ground truth.
The Ragas evaluators are only available in the Python SDK. You must install Ragas (pip install ragas
) before using these evaluators.
Our wrappers require at least version 0.2.*
of Ragas.
Name | Required | Type | Description |
---|---|---|---|
id | Yes | str | The unique identifier for the evaluator. |
threshold | No | Threshold | The threshold for the evaluation used to determine pass/fail. |
llm | No | Any | Custom LLM for the evaluation. Required for any Ragas evaluator that uses a LLM. Read More: https://docs.ragas.io/en/stable/howtos/customizations/customize_models/ |
embeddings | No | Any | Custom embeddings model for the evaluation. Required for any Ragas evaluator that uses embeddings. Read More: https://docs.ragas.io/en/stable/howtos/customizations/customize_models/ |
mode | No | str | Only applicable for the BaseRagasFactualCorrectness evaluator. Read More: https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/factual_correctness |
atomicity | No | str | Only applicable for the BaseRagasFactualCorrectness evaluator. Read More: https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/factual_correctness |
focus | No | str | Only applicable for the BaseRagasNoiseSensitivity and BaseRagasFaithfulness evaluator. Read More: https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/noise_sensitivity and https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/faithfulness |
user_input_mapper | No | Callable[[TestCaseType, OutputType], str] | Map your test case or output to the user input passed to Ragas. |
response_mapper | No | Callable[OutputType], str] | Map your output to the response passed to Ragas. |
reference_mapper | No | Callable[[TestCaseType], str] | Map your test case to the reference passed to Ragas. |
retrieved_contexts_mapper | No | Callable[[TestCaseType, OutputType], str] | Map your test case and output to the retrieved contexts passed to Ragas. |
reference_contexts_mapper | No | Callable[[TestCaseType], str] | Map your test case to the reference contexts passed to Ragas. |
Individual Ragas evaluators require different parameters. You can find sample implementations for each of the Ragas evaluators here.