Learn about the built-in evaluators available in Autoblocks.
id
property, which is a unique identifier for the evaluator.
All of the code snippets can be run by following the instructions in the Quick Start guide.
Ragas
IsEquals
evaluator checks if the expected output equals the actual output.
Scores 1 if equal, 0 otherwise.
Name | Required | Type | Description |
---|---|---|---|
test_case_mapper | Yes | Callable[[BaseTestCase], str] | Map your test case to a string for comparison. |
output_mapper | Yes | Callable[[OutputType], str] | Map your output to a string for comparison. |
IsValidJSON
evaluator checks if the output is valid JSON.
Scores 1 if it is valid, 0 otherwise.
Name | Required | Type | Description |
---|---|---|---|
output_mapper | Yes | Callable[[OutputType], str] | Map your output to the string that you want to check is valid JSON. |
HasAllSubstrings
evaluator checks if the output contains all the expected substrings.
Scores 1 if all substrings are present, 0 otherwise.
Name | Required | Type | Description |
---|---|---|---|
test_case_mapper | Yes | Callable[[BaseTestCase], list[str]] | Map your test case to a list of strings to check for in the output. |
output_mapper | Yes | Callable[[OutputType], str] | Map your output to a string for comparison. |
Assertions
evaluator enables you to define a set of assertions or rules that your output must satisfy.
Name | Required | Type | Description |
---|---|---|---|
evaluate_assertions | Yes | Callable[[BaseTestCase, Any], Optional[List[Assertion]]] | Implement your logic to evaluate the assertions. |
CustomLLMJudge
evaluator enables you to define custom evaluation criteria using an LLM judge.
Name | Required | Type | Description |
---|---|---|---|
output_mapper | Yes | Callable[[OutputType], str] | Map your output to the string that you want to evaluate. |
model | No | str | The OpenAI model to use. Defaults to “gpt-4o”. |
num_overrides | No | int | Number of recent evaluation overrides to use as examples. Defaults to 0. |
example_output_mapper | No | Callable[[EvaluationOverride], str] | Map an EvaluationOverride to a string representation of the output. |
Accuracy
evaluator checks if the output is accurate compared to an expected output.
Scores 1 if accurate, 0.5 if somewhat accurate, 0 if inaccurate.
Name | Required | Type | Description |
---|---|---|---|
output_mapper | Yes | Callable[[OutputType], str] | Map your output to the string that you want to check for accuracy. |
model | No | str | The OpenAI model to use. Defaults to “gpt-4o”. |
num_overrides | No | int | Number of recent evaluation overrides to use as examples. Defaults to 0. |
example_output_mapper | No | Callable[[EvaluationOverride], str] | Map an EvaluationOverride to a string representation of the output. |
NSFW
evaluator checks if the output is safe for work.
Scores 1 if safe, 0 otherwise.
Name | Required | Type | Description |
---|---|---|---|
output_mapper | Yes | Callable[[OutputType], str] | Map your output to the string that you want to check for NSFW content. |
model | No | str | The OpenAI model to use. Defaults to “gpt-4o”. |
num_overrides | No | int | Number of recent evaluation overrides to use as examples. Defaults to 0. |
example_output_mapper | No | Callable[[EvaluationOverride], str] | Map an EvaluationOverride to a string representation of the output. |
Toxicity
evaluator checks if the output is not toxic.
Scores 1 if it is not toxic, 0 otherwise.
Name | Required | Type | Description |
---|---|---|---|
output_mapper | Yes | Callable[[OutputType], str] | Map your output to the string that you want to check for toxicity. |
model | No | str | The OpenAI model to use. Defaults to “gpt-4o”. |
num_overrides | No | int | Number of recent evaluation overrides to use as examples. Defaults to 0. |
example_output_mapper | No | Callable[[EvaluationOverride], str] | Map an EvaluationOverride to a string representation of the output. |
BaseRagasLLMContextPrecisionWithReference
uses a LLM to measure the proportion of relevant chunks in the retrieved_contexts.BaseRagasNonLLMContextPrecisionWithReference
measures the proportion of relevant chunks in the retrieved_contexts without using a LLM.BaseRagasLLMContextRecall
evaluates the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth.BaseRagasNonLLMContextRecall
uses non llm string comparison metrics to identify if a retrieved context is relevant or not.BaseRagasContextEntitiesRecall
evaluates the measure of recall of the retrieved context, based on the number of entities present in both ground_truths and contexts relative to the number of entities present in the ground_truths alone.BaseRagasNoiseSensitivity
measures how often a system makes errors by providing incorrect responses when utilizing either relevant or irrelevant retrieved documents.BaseRagasResponseRelevancy
focuses on assessing how pertinent the generated answer is to the given prompt.BaseRagasFaithfulness
measures the factual consistency of the generated answer against the given context.BaseRagasFactualCorrectness
compares and evaluates the factual accuracy of the generated response with the reference. This metric is used to determine the extent to which the generated response aligns with the reference.BaseRagasSemanticSimilarity
measures the semantic resemblance between the generated answer and the ground truth.pip install ragas
) before using these evaluators.
Our wrappers require at least version 0.2.*
of Ragas.Name | Required | Type | Description |
---|---|---|---|
id | Yes | str | The unique identifier for the evaluator. |
threshold | No | Threshold | The threshold for the evaluation used to determine pass/fail. |
llm | No | Any | Custom LLM for the evaluation. Required for any Ragas evaluator that uses a LLM. Read More: https://docs.ragas.io/en/stable/howtos/customizations/customize_models/ |
embeddings | No | Any | Custom embeddings model for the evaluation. Required for any Ragas evaluator that uses embeddings. Read More: https://docs.ragas.io/en/stable/howtos/customizations/customize_models/ |
mode | No | str | Only applicable for the BaseRagasFactualCorrectness evaluator. Read More: https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/factual_correctness |
atomicity | No | str | Only applicable for the BaseRagasFactualCorrectness evaluator. Read More: https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/factual_correctness |
focus | No | str | Only applicable for the BaseRagasNoiseSensitivity and BaseRagasFaithfulness evaluator. Read More: https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/noise_sensitivity and https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/faithfulness |
user_input_mapper | No | Callable[[TestCaseType, OutputType], str] | Map your test case or output to the user input passed to Ragas. |
response_mapper | No | Callable[OutputType], str] | Map your output to the response passed to Ragas. |
reference_mapper | No | Callable[[TestCaseType], str] | Map your test case to the reference passed to Ragas. |
retrieved_contexts_mapper | No | Callable[[TestCaseType, OutputType], str] | Map your test case and output to the retrieved contexts passed to Ragas. |
reference_contexts_mapper | No | Callable[[TestCaseType], str] | Map your test case to the reference contexts passed to Ragas. |