oumi.analyze.analyzers#
Analyzer implementations and result models.
This module contains concrete analyzer implementations that inherit from the base analyzer classes and return typed result models. Each analyzer file contains both the analyzer class and its result model for better cohesion.
- class oumi.analyze.analyzers.DataQualityAnalyzer[source]#
Bases:
ConversationAnalyzer[DataQualityMetrics]Analyzer for basic data quality checks on conversations.
Checks for three common data quality issues without requiring an LLM: - Non-alternating user/assistant message patterns - Empty or whitespace-only turns - Values serialized as strings (NaN, null, None, undefined)
Example
>>> from oumi.analyze.analyzers.quality import DataQualityAnalyzer >>> from oumi.core.types.conversation import Conversation, Message, Role >>> >>> analyzer = DataQualityAnalyzer() >>> conversation = Conversation(messages=[ ... Message(role=Role.USER, content="Hello"), ... Message(role=Role.ASSISTANT, content="Hi there!"), ... ]) >>> result = analyzer.analyze(conversation) >>> print(result.has_non_alternating_turns) False
- analyze(conversation: Conversation) DataQualityMetrics[source]#
Analyze data quality for a conversation.
- Parameters:
conversation – The conversation to analyze.
- Returns:
DataQualityMetrics with the quality check results.
- class oumi.analyze.analyzers.DataQualityMetrics(*, has_non_alternating_turns: bool, has_empty_turns: bool, empty_turn_count: int, has_invalid_values: bool, invalid_value_patterns: list[str])[source]#
Bases:
BaseModelResult model for data quality checks on a conversation.
Example
>>> result = DataQualityMetrics( ... has_non_alternating_turns=False, ... has_empty_turns=False, ... empty_turn_count=0, ... has_invalid_values=False, ... invalid_value_patterns=[], ... ) >>> print(result.has_non_alternating_turns) False
- empty_turn_count: int#
- has_empty_turns: bool#
- has_invalid_values: bool#
- has_non_alternating_turns: bool#
- invalid_value_patterns: list[str]#
- model_config = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class oumi.analyze.analyzers.LengthAnalyzer(tokenizer: Tokenizer | None = None)[source]#
Bases:
ConversationAnalyzer[LengthMetrics]Analyzer for computing token length metrics of conversations.
Computes token counts for conversations using a provided tokenizer. Provides both conversation-level totals and per-message breakdowns.
Example
>>> from oumi.analyze.analyzers.length import LengthAnalyzer >>> from oumi.core.types.conversation import Conversation, Message, Role >>> >>> analyzer = LengthAnalyzer.from_config({"tokenizer_name": "cl100k_base"}) >>> conversation = Conversation(messages=[ ... Message(role=Role.USER, content="Hello, how are you?"), ... Message(role=Role.ASSISTANT, content="I'm doing well, thanks!"), ... ]) >>> result = analyzer.analyze(conversation) >>> print(f"Total tokens: {result.total_tokens}") Total tokens: 12
- Parameters:
tokenizer – Tokenizer instance for token counting. Must have an encode(text) -> list method. Use from_config() to construct from a tokenizer name, or pass any compatible tokenizer directly.
- analyze(conversation: Conversation) LengthMetrics[source]#
Analyze token length metrics for a conversation.
- Parameters:
conversation – The conversation to analyze.
- Returns:
LengthMetrics containing token counts.
- analyze_text(text: str) LengthMetrics[source]#
Analyze token length metrics for a single text string.
Convenience method for analyzing text without creating a Conversation.
- Parameters:
text – The text to analyze.
- Returns:
LengthMetrics for the text (treated as a single message).
- classmethod from_config(config: dict[str, Any]) LengthAnalyzer[source]#
Create a LengthAnalyzer from a config dictionary.
- Parameters:
config – See
LengthAnalyzerConfigfor supported keys.- Returns:
LengthAnalyzer instance with configured tokenizer.
- class oumi.analyze.analyzers.LengthAnalyzerConfig(*, tokenizer_name: str = 'cl100k_base', trust_remote_code: bool = False)[source]#
Bases:
BaseModelConfiguration for LengthAnalyzer.
- model_config = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- tokenizer_name: str#
- trust_remote_code: bool#
- class oumi.analyze.analyzers.LengthMetrics(*, total_tokens: int, rendered_tokens: int | None = None, avg_tokens_per_message: float, message_token_counts: list[int], num_messages: int, user_total_tokens: int = 0, assistant_total_tokens: int = 0, system_total_tokens: int = 0, tool_total_tokens: int = 0)[source]#
Bases:
BaseModelResult model for length analysis of conversations.
Example
>>> result = LengthMetrics( ... total_tokens=25, ... avg_tokens_per_message=12.5, ... message_token_counts=[10, 15], ... num_messages=2, ... ) >>> print(result.total_tokens) 25
- assistant_total_tokens: int#
- avg_tokens_per_message: float#
- message_token_counts: list[int]#
- model_config = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- num_messages: int#
- rendered_tokens: int | None#
- system_total_tokens: int#
- tool_total_tokens: int#
- total_tokens: int#
- user_total_tokens: int#
- class oumi.analyze.analyzers.Tokenizer(*args, **kwargs)[source]#
Bases:
ProtocolProtocol for tokenizers used by LengthAnalyzer.
- class oumi.analyze.analyzers.TurnStatsAnalyzer[source]#
Bases:
ConversationAnalyzer[TurnStatsMetrics]Analyzer for computing turn statistics of conversations.
Computes turn counts and per-role statistics to help understand conversation structure and balance.
Example
>>> from oumi.analyze.analyzers.turn_stats import TurnStatsAnalyzer >>> from oumi.core.types.conversation import Conversation, Message, Role >>> >>> analyzer = TurnStatsAnalyzer() >>> conversation = Conversation(messages=[ ... Message(role=Role.USER, content="What is Python?"), ... Message( ... role=Role.ASSISTANT, ... content="Python is a programming language.", ... ), ... ]) >>> result = analyzer.analyze(conversation) >>> print(f"Turns: {result.num_turns}") Turns: 2
- analyze(conversation: Conversation) TurnStatsMetrics[source]#
Analyze turn statistics for a conversation.
- Parameters:
conversation – The conversation to analyze.
- Returns:
TurnStatsMetrics containing turn counts and statistics.
- class oumi.analyze.analyzers.TurnStatsMetrics(*, num_turns: int, num_user_turns: int, num_assistant_turns: int, num_tool_turns: int = 0, has_system_message: bool, first_turn_role: str | None = None, last_turn_role: str | None = None)[source]#
Bases:
BaseModelResult model for turn statistics analysis of conversations.
Example
>>> result = TurnStatsMetrics( ... num_turns=4, ... num_user_turns=2, ... num_assistant_turns=2, ... has_system_message=False, ... first_turn_role="user", ... last_turn_role="assistant", ... ) >>> print(result.num_turns) 4
- first_turn_role: str | None#
- has_system_message: bool#
- last_turn_role: str | None#
- model_config = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- num_assistant_turns: int#
- num_tool_turns: int#
- num_turns: int#
- num_user_turns: int#