oumi.analyze.analyzers

oumi.analyze.analyzers#

Analyzer implementations and result models.

This module contains concrete analyzer implementations that inherit from the base analyzer classes and return typed result models. Each analyzer file contains both the analyzer class and its result model for better cohesion.

class oumi.analyze.analyzers.DataQualityAnalyzer[source]#

Bases: ConversationAnalyzer[DataQualityMetrics]

Analyzer for basic data quality checks on conversations.

Checks for five common data quality issues without requiring an LLM: - Non-alternating user/assistant message patterns - Missing user messages - System messages not at the start of the conversation - Empty or whitespace-only turns - Values serialized as strings (NaN, null, None, undefined)

Example

>>> from oumi.analyze.analyzers.quality import DataQualityAnalyzer
>>> from oumi.core.types.conversation import Conversation, Message, Role
>>>
>>> analyzer = DataQualityAnalyzer()
>>> conversation = Conversation(messages=[
...     Message(role=Role.USER, content="Hello"),
...     Message(role=Role.ASSISTANT, content="Hi there!"),
... ])
>>> result = analyzer.analyze(conversation)
>>> print(result.has_non_alternating_turns)
False

analyze(conversation: Conversation) → DataQualityMetrics[source]#

Analyze data quality for a conversation.

Parameters:: conversation – The conversation to analyze.
Returns:: DataQualityMetrics with the quality check results.

classmethod get_config_schema() → dict[source]#: Get JSON schema for DataQualityAnalyzer configuration.

class oumi.analyze.analyzers.DataQualityMetrics(*, has_non_alternating_turns: bool, has_no_user_message: bool, has_system_message_not_at_start: bool, has_empty_turns: bool, empty_turn_count: int, has_invalid_values: bool, invalid_value_patterns: list[str])[source]#

Bases: BaseModel

Result model for data quality checks on a conversation.

Example

>>> result = DataQualityMetrics(
...     has_non_alternating_turns=False,
...     has_no_user_message=False,
...     has_system_message_not_at_start=False,
...     has_empty_turns=False,
...     empty_turn_count=0,
...     has_invalid_values=False,
...     invalid_value_patterns=[],
... )
>>> print(result.has_non_alternating_turns)
False

empty_turn_count: int#

has_empty_turns: bool#

has_invalid_values: bool#

has_no_user_message: bool#

has_non_alternating_turns: bool#

has_system_message_not_at_start: bool#

invalid_value_patterns: list[str]#

model_config = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class oumi.analyze.analyzers.LengthAnalyzer(tokenizer: Tokenizer | None = None)[source]#

Bases: ConversationAnalyzer[LengthMetrics]

Analyzer for computing token length metrics of conversations.

Computes token counts for conversations using a provided tokenizer. Provides both conversation-level totals and per-message breakdowns.

Example

>>> from oumi.analyze.analyzers.length import LengthAnalyzer
>>> from oumi.core.types.conversation import Conversation, Message, Role
>>>
>>> analyzer = LengthAnalyzer.from_config({"tokenizer_name": "cl100k_base"})
>>> conversation = Conversation(messages=[
...     Message(role=Role.USER, content="Hello, how are you?"),
...     Message(role=Role.ASSISTANT, content="I'm doing well, thanks!"),
... ])
>>> result = analyzer.analyze(conversation)
>>> print(f"Total tokens: {result.total_tokens}")
Total tokens: 12

Parameters:: tokenizer – Tokenizer instance for token counting. Must have an encode(text) -> list method. Use from_config() to construct from a tokenizer name, or pass any compatible tokenizer directly.

analyze(conversation: Conversation) → LengthMetrics[source]#

Analyze token length metrics for a conversation.

Parameters:: conversation – The conversation to analyze.
Returns:: LengthMetrics containing token counts.

analyze_text(text: str) → LengthMetrics[source]#

Analyze token length metrics for a single text string.

Convenience method for analyzing text without creating a Conversation.

Parameters:: text – The text to analyze.
Returns:: LengthMetrics for the text (treated as a single message).

classmethod from_config(config: dict[str, Any]) → LengthAnalyzer[source]#

Create a LengthAnalyzer from a config dictionary.

Parameters:: config – See LengthAnalyzerConfig for supported keys.
Returns:: LengthAnalyzer instance with configured tokenizer.

get_available_metric_names() → list[str][source]#

Return metrics this instance will produce.

Excludes rendered_tokens when the tokenizer doesn’t support apply_chat_template (i.e. tiktoken or no tokenizer).

classmethod get_config_schema() → dict[str, Any][source]#: Get JSON schema for this analyzer’s configuration.

class oumi.analyze.analyzers.LengthAnalyzerConfig(*, tokenizer_name: str = 'cl100k_base', trust_remote_code: bool = False)[source]#

Bases: BaseModel

Configuration for LengthAnalyzer.

model_config = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

tokenizer_name: str#

trust_remote_code: bool#

class oumi.analyze.analyzers.LengthMetrics(*, total_tokens: int, rendered_tokens: int | None = None, avg_tokens_per_message: float, message_token_counts: list[int], num_messages: int, user_total_tokens: int = 0, assistant_total_tokens: int = 0, system_total_tokens: int = 0, tool_total_tokens: int = 0)[source]#

Bases: BaseModel

Result model for length analysis of conversations.

Example

>>> result = LengthMetrics(
...     total_tokens=25,
...     avg_tokens_per_message=12.5,
...     message_token_counts=[10, 15],
...     num_messages=2,
... )
>>> print(result.total_tokens)
25

assistant_total_tokens: int#

avg_tokens_per_message: float#

message_token_counts: list[int]#

model_config = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

num_messages: int#

rendered_tokens: int | None#

system_total_tokens: int#

tool_total_tokens: int#

total_tokens: int#

user_total_tokens: int#

class oumi.analyze.analyzers.Tokenizer(*args, **kwargs)[source]#

Bases: Protocol

Protocol for tokenizers used by LengthAnalyzer.

encode(text: str) → list[int][source]#: Encode text to token IDs.

class oumi.analyze.analyzers.TurnStatsAnalyzer[source]#

Bases: ConversationAnalyzer[TurnStatsMetrics]

Analyzer for computing turn statistics of conversations.

Computes turn counts and per-role statistics to help understand conversation structure and balance.

Example

>>> from oumi.analyze.analyzers.turn_stats import TurnStatsAnalyzer
>>> from oumi.core.types.conversation import Conversation, Message, Role
>>>
>>> analyzer = TurnStatsAnalyzer()
>>> conversation = Conversation(messages=[
...     Message(role=Role.USER, content="What is Python?"),
...     Message(
...         role=Role.ASSISTANT,
...         content="Python is a programming language.",
...     ),
... ])
>>> result = analyzer.analyze(conversation)
>>> print(f"Turns: {result.num_turns}")
Turns: 2

analyze(conversation: Conversation) → TurnStatsMetrics[source]#

Analyze turn statistics for a conversation.

Parameters:: conversation – The conversation to analyze.
Returns:: TurnStatsMetrics containing turn counts and statistics.

classmethod get_config_schema() → dict[source]#: Get JSON schema for TurnStatsAnalyzer configuration.

class oumi.analyze.analyzers.TurnStatsMetrics(*, num_turns: int, num_user_turns: int, num_assistant_turns: int, num_tool_turns: int = 0, has_system_message: bool, first_turn_role: str | None = None, last_turn_role: str | None = None)[source]#

Bases: BaseModel

Result model for turn statistics analysis of conversations.

Example

>>> result = TurnStatsMetrics(
...     num_turns=4,
...     num_user_turns=2,
...     num_assistant_turns=2,
...     has_system_message=False,
...     first_turn_role="user",
...     last_turn_role="assistant",
... )
>>> print(result.num_turns)
4

first_turn_role: str | None#

has_system_message: bool#

last_turn_role: str | None#

model_config = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

num_assistant_turns: int#

num_tool_turns: int#

num_turns: int#

num_user_turns: int#

oumi.analyze.analyzers

Contents

oumi.analyze.analyzers#