oumi.judges#

This module provides access to various judge configurations for the Oumi project.

The judges are used to evaluate the quality of AI-generated responses based on different criteria such as helpfulness, honesty, and safety.

class oumi.judges.BaseJudge(prompt_template: str, prompt_template_placeholders: set[str] | None, system_instruction: str | None, example_field_values: list[dict[str, str]], response_format: JudgeResponseFormat, output_fields: list[JudgeOutputField], inference_engine: BaseInferenceEngine)[source]#

Bases: object

Base class for implementing judges that evaluate model outputs.

A judge takes structured inputs, formats them using a prompt template, runs inference to get judgments, and parses the results into structured outputs.

judge(inputs: list[dict[str, str]]) list[JudgeOutput][source]#

Evaluate a batch of inputs and return structured judgments.

Parameters:

inputs – List of dictionaries containing input data for evaluation. Each dict must contain values for all prompt_template placeholders.

Returns:

List of structured judge outputs with parsed results

Raises:

ValueError – If inference returns unexpected number of conversations

validate_dataset(inputs: list[dict[str, str]], raise_on_error: bool = True) bool[source]#

Validate that all inputs contain the required placeholder keys.

class oumi.judges.JudgeOutput(*, raw_output: str, parsed_output: dict[str, str] = {}, output_fields: list[JudgeOutputField] | None = None, field_values: dict[str, float | int | str | bool | None] = {}, field_scores: dict[str, float | None] = {}, response_format: JudgeResponseFormat | None = None)[source]#

Bases: BaseModel

Represents the output from a judge evaluation.

Variables:
  • raw_output (str) – The original unprocessed output from the judge

  • parsed_output (dict[str, str]) – Structured data (fields & their values) extracted from raw output

  • output_fields (list[oumi.judges.base_judge.JudgeOutputField] | None) – List of expected output fields for this judge

  • field_values (dict[str, float | int | str | bool | None]) – Typed values for each expected output field

  • field_scores (dict[str, float | None]) – Numeric scores for each expected output field (if applicable)

  • response_format (oumi.core.configs.params.judge_params.JudgeResponseFormat | None) – Format used for generating output (XML, JSON, or RAW)

field_scores: dict[str, float | None]#
field_values: dict[str, float | int | str | bool | None]#
classmethod from_raw_output(raw_output: str, response_format: JudgeResponseFormat, output_fields: list[JudgeOutputField]) Self[source]#

Generate a structured judge output from a raw model output.

generate_raw_output(field_values: dict[str, str]) str[source]#

Generate raw output string from field values in the specified format.

Parameters:

field_values – Dictionary mapping field keys to their string values. Must contain values for all required output fields.

Returns:

Formatted raw output string ready for use as assistant response.

Raises:

ValueError – If required output fields are missing from field_values, if response_format/output_fields are not set, or if response_format is not supported.

model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

output_fields: list[JudgeOutputField] | None#
parsed_output: dict[str, str]#
raw_output: str#
response_format: JudgeResponseFormat | None#
to_json() str[source]#

Convert the JudgeOutput to a JSON string.

Returns:

JSON string representation of the JudgeOutput data.

class oumi.judges.JudgeOutputField(*, field_key: str, field_type: JudgeOutputType, field_scores: dict[str, float] | None)[source]#

Bases: BaseModel

Represents a single output field that a judge can produce.

Variables:
  • field_key (str) – The key/name for this field in the judge’s output

  • field_type (oumi.core.configs.params.judge_params.JudgeOutputType) – The data type expected for this field’s value

  • field_scores (dict[str, float] | None) – Optional mapping from categorical values to numeric scores

field_key: str#
field_scores: dict[str, float] | None#
field_type: JudgeOutputType#
get_typed_value(raw_value: str) float | int | str | bool | None[source]#

Convert the field’s raw string value to the appropriate type.

Parameters:

raw_value – The raw string value from the judge’s output

Returns:

The typed value, or None if conversion fails

Raises:

ValueError – If the field_type is not supported

model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class oumi.judges.SimpleJudge(judge_config: JudgeConfig | str)[source]#

Bases: BaseJudge

Judge class for evaluating outputs based on a given configuration.