oumi.judges

oumi.judges#

This module provides access to various judge configurations for the Oumi project.

The judges are used to evaluate the quality of AI-generated responses based on different criteria such as helpfulness, honesty, and safety.

class oumi.judges.BaseJudge(prompt_template: str, prompt_template_placeholders: set[str] | None, system_instruction: str | None, example_field_values: list[dict[str, str]], response_format: JudgeResponseFormat, output_fields: list[JudgeOutputField], inference_engine: BaseInferenceEngine | None = None)[source]#

Bases: object

Base class for implementing judges that evaluate model outputs.

A judge takes structured inputs, formats them using a prompt template, runs inference to get judgments, and parses the results into structured outputs.

build_conversations(inputs: list[dict[str, str]]) → list[Conversation][source]#

Build judge conversations from inputs without running inference.

Validates inputs, builds few-shot examples, creates judgment prompts, and assembles full conversations ready for inference.

Parameters:: inputs – List of dictionaries containing input data for evaluation. Each dict must contain values for all prompt_template placeholders.
Returns:: List of Conversation objects ready for inference.

judge(inputs: list[dict[str, str]]) → list[JudgeOutput][source]#

Evaluate a batch of inputs and return structured judgments.

Parameters:: inputs – List of dictionaries containing input data for evaluation. Each dict must contain values for all prompt_template placeholders.
Returns:: List of structured judge outputs with parsed results
Raises:: ValueError – If inference returns unexpected number of conversations

judge_batch_result(batch_id: str, conversations: list[Conversation]) → list[JudgeOutput][source]#

Retrieve and parse results from a completed batch judging job.

Parameters:

batch_id – The batch job ID from judge_batch_submit.
conversations – The conversations returned by judge_batch_submit.

Returns:

List of structured judge outputs with parsed results.

Raises:

ValueError – If inference_engine is None or not a RemoteInferenceEngine.
RuntimeError – If any items failed judging.

judge_batch_result_partial(batch_id: str, conversations: list[Conversation]) → JudgeBatchResult[source]#

Retrieve and parse partial results from a completed batch judging job.

Parameters:

batch_id – The batch job ID from judge_batch_submit.
conversations – The conversations returned by judge_batch_submit.

Returns:

JudgeBatchResult with successful outputs and failure info.

Raises:

ValueError – If inference_engine is None or not a RemoteInferenceEngine.

judge_batch_submit(inputs: list[dict[str, str]]) → tuple[str, list[Conversation]][source]#

Submit a batch judging job.

Builds conversations from inputs and submits them as a batch job via the inference engine’s batch API.

Parameters:: inputs – List of dictionaries containing input data for evaluation.
Returns:: Tuple of (batch_id, conversations) — the batch_id for polling, and the conversations needed to retrieve results later.
Raises:: ValueError – If inference_engine is None or not a RemoteInferenceEngine.

parse_judge_outputs(completed_conversations: list[Conversation]) → list[JudgeOutput][source]#

Parse completed conversations into structured judge outputs.

Validates each conversation has the expected structure (correct message count, ends with assistant message) and parses the raw output.

Parameters:: completed_conversations – List of conversations with model responses.
Returns:: List of structured judge outputs with parsed results.
Raises:: ValueError – If any conversation has unexpected structure.

property total_cached_tokens: int#: Total cached tokens accumulated across all judge() calls.

property total_input_tokens: int#: Total input/prompt tokens accumulated across all judge() calls.

property total_output_tokens: int#: Total output/completion tokens accumulated across all judge() calls.

validate_dataset(inputs: list[dict[str, str]], raise_on_error: bool = True) → bool[source]#: Validate that all inputs contain the required placeholder keys.

class oumi.judges.BaseRule[source]#

Bases: ABC

Base class for rules used for deterministic evals.

abstractmethod apply(input_data: dict[str, str], rule_config: dict) → tuple[bool, float][source]#

Apply the rule to input data.

Returns:: (judgement: bool, score: float)
Return type:: tuple

class oumi.judges.JudgeOutput(*, raw_output: str, parsed_output: dict[str, str] = {}, output_fields: list[JudgeOutputField] | None = None, field_values: dict[str, float | int | str | bool | None] = {}, field_scores: dict[str, float | None] = {}, response_format: JudgeResponseFormat | None = None)[source]#

Bases: BaseModel

Represents the output from a judge evaluation.

Variables:

raw_output (str) – The original unprocessed output from the judge
parsed_output (dict[str, str]) – Structured data (fields & their values) extracted from raw output
output_fields (list[oumi.judges.base_judge.JudgeOutputField] | None) – List of expected output fields for this judge
field_values (dict[str, float | int | str | bool | None]) – Typed values for each expected output field
field_scores (dict[str, float | None]) – Numeric scores for each expected output field (if applicable)
response_format (oumi.core.configs.params.judge_params.JudgeResponseFormat | None) – Format used for generating output (XML, JSON, or RAW)

field_scores: dict[str, float | None]#

field_values: dict[str, float | int | str | bool | None]#

classmethod from_raw_output(raw_output: str, response_format: JudgeResponseFormat, output_fields: list[JudgeOutputField]) → Self[source]#: Generate a structured judge output from a raw model output.

generate_raw_output(field_values: dict[str, str]) → str[source]#

Generate raw output string from field values in the specified format.

Parameters:: field_values – Dictionary mapping field keys to their string values. Must contain values for all required output fields.
Returns:: Formatted raw output string ready for use as assistant response.
Raises:: ValueError – If required output fields are missing from field_values, if response_format/output_fields are not set, or if response_format is not supported.

model_config = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

output_fields: list[JudgeOutputField] | None#

parsed_output: dict[str, str]#

raw_output: str#

response_format: JudgeResponseFormat | None#

to_json() → str[source]#

Convert the JudgeOutput to a JSON string.

Returns:: JSON string representation of the JudgeOutput data.

class oumi.judges.JudgeOutputField(*, field_key: str, field_type: JudgeOutputType, field_scores: dict[str, float] | None)[source]#

Bases: BaseModel

Represents a single output field that a judge can produce.

Variables:

field_key (str) – The key/name for this field in the judge’s output
field_type (oumi.core.configs.params.judge_params.JudgeOutputType) – The data type expected for this field’s value
field_scores (dict[str, float] | None) – Optional mapping from categorical values to numeric scores

field_key: str#

field_scores: dict[str, float] | None#

field_type: JudgeOutputType#

get_typed_value(raw_value: str) → float | int | str | bool | None[source]#

Convert the field’s raw string value to the appropriate type.

Parameters:: raw_value – The raw string value from the judge’s output
Returns:: The typed value, or None if conversion fails
Raises:: ValueError – If the field_type is not supported

model_config = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class oumi.judges.RegexRule[source]#

Bases: BaseRule

Rule that checks if input text matches a regex pattern.

Config Parameters:: pattern (str): The regex pattern to match against input_field (str): The field name to extract text from input_data match_mode (str): How to match - “search”, “match”, “fullmatch” inverse (bool): If True, pass when pattern does NOT match (default: False) flags (int): Optional regex flags (e.g., re.IGNORECASE) (default: 0)

Examples

Match a phone number pattern: >>> rule_config = { … “pattern”: r”\d{3}-\d{4}”, … “input_field”: “text”, … “match_mode”: “search” … } >>> rule = RegexRule() >>> result, score = rule.apply({“text”: “Call 555-1234”}, rule_config) >>> print(result, score) True 1.0

Inverse matching (expect NOT to match): >>> rule_config = { … “pattern”: r”error|fail”, … “input_field”: “output”, … “inverse”: True … } >>> result, score = rule.apply({“output”: “Success!”}, rule_config) >>> print(result, score) True 1.0

apply(input_data: dict[str, str], rule_config: dict) → tuple[bool, float][source]#

Apply regex pattern matching to input data.

Parameters:

input_data – Dictionary containing input fields
(e.g. – “…”, “expected”: “…”})
{"text" – “…”, “expected”: “…”})
rule_config – Configuration with ‘pattern’, ‘input_field’, ‘inverse’, etc.

Returns:

bool, score: float) - judgment: True if test passes (matches, or doesn’t match if inverse=True) - score: 1.0 if judgment is True, 0.0 otherwise

Return type:

Tuple of (judgment

Raises:

ValueError – If required config parameters are missing or invalid

class oumi.judges.RuleBasedJudge(judge_config: JudgeConfig | str)[source]#

Bases: BaseJudge

A Rule Based Judge for evaluating outputs based on a configuration.

judge(inputs: list[dict[str, str]]) → list[JudgeOutput][source]#

Evaluate a batch of inputs and return structured judgments.

Parameters:: inputs – List of dictionaries containing input data for evaluation. Each dict must contain values for all prompt_template placeholders.
Returns:: List of structured judge outputs with parsed results
Raises:: ValueError – If inference returns unexpected number of conversations

class oumi.judges.SimpleJudge(judge_config: JudgeConfig | str)[source]#

Bases: BaseJudge

Judge class for evaluating outputs based on a given configuration.

Subpackages#

oumi.judges.rules
- BaseRule
- RegexRule

oumi.judges

Contents

oumi.judges#

Subpackages#