oumi.core.datasets

oumi.core.datasets#

Core datasets module for the Oumi (Open Universal Machine Intelligence) library.

This module provides base classes for different types of datasets used in the Oumi framework. These base classes serve as foundations for creating custom datasets for various machine learning tasks.

These base classes can be extended to create custom datasets tailored to specific machine learning tasks within the Oumi framework.

For more detailed information on each class, please refer to their respective documentation.

class oumi.core.datasets.BaseDpoDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, tokenizer: PreTrainedTokenizerBase | None = None, return_tensors: bool = False, **kwargs)[source]#

Bases: BaseMapDataset

Preprocess the samples to the Oumi format.

dataset_name: str#

transform(sample: dict) → dict[source]#: Transform the samples to the Oumi format.

transform_preference(samples: dict) → dict[source]#: Transform the samples to the Oumi format.

trust_remote_code: bool#

class oumi.core.datasets.BaseExperimentalDpoDataset(*args, **kwargs)[source]#

Bases: BaseDpoDataset

Preprocess the samples to the Oumi format.

Warning

This class is experimental and subject to change.

dataset_name: str#

trust_remote_code: bool#

class oumi.core.datasets.BaseExperimentalGrpoDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, **kwargs)[source]#

Bases: BaseMapDataset

Preprocess the GRPO samples to the Oumi format.

Warning

This class is experimental and subject to change.

conversation(idx: int) → Conversation[source]#

Returns the conversation at the specified index.

Parameters:: idx (int) – The index of the conversation to retrieve.
Returns:: The conversation at the specified index.
Return type:: str

conversations() → list[Conversation][source]#: Returns a list of all conversations.

dataset_name: str#

transform(sample: Series) → dict[source]#: Validate and transform the sample into Python dict.

transform_conversation(sample: dict | Series) → Conversation[source]#

Converts the input sample to a Conversation.

Parameters:: sample (Union[dict, pd.Series]) – The input example.
Returns:: The resulting conversation.
Return type:: Conversation

trust_remote_code: bool#

class oumi.core.datasets.BaseExperimentalKtoDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, **kwargs)[source]#

Bases: BaseMapDataset

Base class for KTO (Kahneman-Tversky Optimization) datasets.

This class provides a comprehensive foundation for creating KTO datasets that work with binary feedback signals rather than preference pairs. KTO is an alignment method that optimizes language models based on simple binary labels indicating whether outputs are desirable or undesirable, making it simpler than preference- based methods like DPO which require paired comparisons.

The class handles the standardization of diverse dataset formats into the consistent KTO format required by training frameworks. It supports both string-based completions and chat-formatted conversations, automatically extracting assistant responses when needed.

Key Features:

Standardized KTO format with prompt, completion, and binary label
Automatic handling of chat format vs string format completions
Optimized feature schema for efficient dataset processing
Memory-efficient processing for large datasets
Consistent API across different KTO dataset implementations

Dataset Format:

The standardized KTO format includes: - prompt (str): The input text given to the model - completion (str): The model’s response to be evaluated - label (bool): True for desirable responses, False for undesirable ones

Usage:

Subclasses should implement the _load_data() method to load their specific dataset format and optionally override _transform_kto_example() for custom preprocessing logic.

Warning

This class is experimental and subject to change as KTO training methods evolve and mature.

See also

TRL KTO Trainer: https://huggingface.co/docs/trl/main/en/kto_trainer
KTO Paper: https://arxiv.org/abs/2402.01306

dataset_name: str#

transform(sample: dict) → dict[source]#

Transform a raw dataset sample into the standardized format.

This is the main entry point for processing dataset samples. It delegates to the KTO-specific transformation method to ensure consistent formatting across all KTO datasets.

Parameters:

sample (dict) – A raw dataset sample containing prompt, completion, and label information.

Returns:

The transformed sample in KTO format with standardized keys:: prompt, completion, and label.

Return type:

dict

trust_remote_code: bool#

class oumi.core.datasets.BaseIterableDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, subset: str | None = None, split: str | None = None, trust_remote_code: bool = False, stream: bool = True, **kwargs)[source]#

Bases: IterDataPipe, ABC

Abstract base class for iterable datasets.

__iter__()[source]#: Iterates over the dataset.

property data: Iterable[Any]#: Returns the underlying dataset data.

dataset_name: str#

dataset_path: str | None = None#

default_dataset: str | None = None#

default_subset: str | None = None#

iter_raw()[source]#: Iterates over the raw dataset.

to_hf(return_iterable: bool = True) → IterableDataset[source]#: Converts the dataset to a Hugging Face dataset.

abstractmethod transform(sample: Any) → dict[str, Any][source]#

Preprocesses the inputs in the given sample.

Parameters:: sample (Any) – A sample from the dataset.
Returns:: A dictionary containing the preprocessed input data.
Return type:: dict

trust_remote_code: bool = False#

Bases: MapDataPipe, Sized, ABC

Abstract base class for map datasets.

__getitem__(idx: int) → dict[source]#

Gets the item at the specified index.

Parameters:: idx (int) – The index of the item to retrieve.
Returns:: The item at the specified index.
Return type:: dict

__len__() → int[source]#

Gets the number of items in the dataset.

Returns:: The number of items in the dataset.
Return type:: int

as_generator() → Generator[dict[str, Any], None, None][source]#: Returns a generator for the dataset.

property data: DataFrame#: Returns the underlying dataset data.

dataset_name: str#

dataset_path: str | None = None#

default_dataset: str | None = None#

default_subset: str | None = None#

raw(idx: int) → Series[source]#

Returns the raw data at the specified index.

Parameters:: idx (int) – The index of the data to retrieve.
Returns:: The raw data at the specified index.
Return type:: pd.Series

to_hf(return_iterable: bool = False) → Dataset | IterableDataset[source]#

Converts the dataset to a Hugging Face dataset.

Parameters:: return_iterable – Whether to return an iterable dataset. Iterable datasets aren’t cached to disk, which can sometimes be advantageous. For example, if transformed examples are very large (e.g., if pixel_values are large for multimodal data), or if you don’t want to post-process the whole dataset before training starts.
Returns:: A HuggingFace dataset. Can be datasets.Dataset or datasets.IterableDataset depending on the value of return_iterable.

abstractmethod transform(sample: Series) → dict[source]#

Preprocesses the inputs in the given sample.

Parameters:: sample (dict) – A dictionary containing the input data.
Returns:: A dictionary containing the preprocessed input data.
Return type:: dict

transform_num_workers: str | int | None = None#

trust_remote_code: bool#

class oumi.core.datasets.BasePretrainingDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#

Bases: BaseIterableDataset

Base class for pretraining iterable datasets.

This class extends BaseIterableDataset to provide functionality specific to pretraining tasks.

Variables:

tokenizer (BaseTokenizer) – The tokenizer used for text encoding.
seq_length (int) – The desired sequence length for model inputs.
concat_token_id (int) – The ID of the token used to concatenate documents.

Example

>>> from transformers import AutoTokenizer
>>> from oumi.builders import build_tokenizer
>>> from oumi.core.configs import ModelParams
>>> from oumi.core.datasets import BasePretrainingDataset
>>> tokenizer = build_tokenizer(ModelParams(model_name="gpt2"))
>>> dataset = BasePretrainingDataset(
...     dataset_name="wikimedia/wikipedia",
...     subset="20231101.en",
...     split="train",
...     tokenizer=tokenizer,
...     seq_length=512
... )
>>> example = next(iter(dataset))

__iter__()[source]#

Iterates over the dataset and yields samples of a specified sequence length.

The underlying dataset is a stream of documents. Each document is expected to contain a text field self._dataset_text_field that will be tokenized. Training samples are then yielded in sequences of length self.seq_length.

Given this iterator might return samples from different documents, we optionally use self.concat_token_id to separate the sequences from different documents.

dataset_name: str#

tokenize(text: str) → list[int][source]#

Tokenizes the given text.

Should not apply any padding/truncation to allow for packing.

transform(sample: Any) → list[int][source]#: Preprocesses the inputs in the given sample.

class oumi.core.datasets.BaseRubricDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, **kwargs)[source]#

Bases: BaseMapDataset

Base class for rubric-based datasets.

This provides common functionality for datasets used with rubric-based reward functions in GRPO training. Subclasses should implement the transform method to return the expected format.

Expected transform() output format:

{
    "prompt": str,                # The user prompt/question
    "rubrics": list[Rubric],      # List of evaluation criteria
    "system_prompt": str | None,  # Optional system prompt
    "metadata": dict | None,      # Optional dataset-specific metadata
}

dataset_name: str#

abstractmethod transform(sample: Series) → dict[str, Any][source]#

Transform a raw sample into the standard rubric format.

Subclasses must override this method to return:

{
    "prompt": str,
    "rubrics": list[Rubric],
    "system_prompt": str | None,  # optional
    "metadata": dict | None,      # optional
}

trust_remote_code: bool#

class oumi.core.datasets.BaseSftDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, tokenizer: PreTrainedTokenizerBase | None = None, task: Literal['sft', 'generation', 'auto'] = 'auto', return_tensors: bool = False, text_col: str = 'text', assistant_only: bool = False, response_template: str | None = None, instruction_template: str | None = None, return_conversations: bool = False, return_conversations_format: Literal['dict', 'json'] = 'json', **kwargs)[source]#

Bases: BaseMapDataset, ABC

In-memory dataset for SFT data.

property assistant_only: bool#: Gets whether the dataset is set to train only on assistant turns.

conversation(idx: int) → Conversation[source]#

Returns the conversation at the specified index.

Parameters:: idx (int) – The index of the conversation to retrieve.
Returns:: The conversation at the specified index.
Return type:: str

conversations() → list[Conversation][source]#: Returns a list of all conversations.

dataset_name: str#

default_dataset: str | None = None#

prompt(idx: int) → str[source]#

Returns the prompt at the specified index.

Parameters:: idx (int) – The index of the prompt to retrieve.
Returns:: The prompt at the specified index.
Return type:: str

property task: str#

Gets the task mode for the dataset.

The generated prompt is often different for generation vs SFT tasks.

property text_col: str#

Gets the text target column.

The generated text will be stored in this column.

tokenize(sample: dict | Series | Conversation, tokenize: bool = True) → dict[source]#

Applies the chat template carried by the tokenizer to the input example.

Parameters:

sample (Dict) – Mapping messages to a List containing the (ordered) messages exchanged within a single chat dialogue. Each item of example[“messages”] is a dict mapping the content of the message and the role of the one relayed it. E.g., role == ‘user’ or role == ‘assistant’.
tokenize (bool) – Whether to tokenize the messages or not.

Raises:

NotImplementedError – Currently only the sft task mode is supported.
ValueError – if requested task is not in “sft” or “generation”

Returns:

It adds a text key in the input example dictionary, mapped to a string carrying the messages to the tokenizer’s chat format.

Return type:

Dict

transform(sample: Series) → dict[source]#: Preprocesses the inputs in the given sample.

abstractmethod transform_conversation(example: dict | Series) → Conversation[source]#

Preprocesses the inputs of the example and returns a dictionary.

Parameters:: example (dict) – The example containing the input and instruction.
Returns:: The preprocessed inputs as a dictionary.
Return type:: dict

trust_remote_code: bool#

class oumi.core.datasets.PackedSftDataset(base_dataset: BaseSftDataset, max_seq_len: int, show_progress: bool = True, split_samples: bool = False, concat_token_id: int | None = None, pad_token_id: int | None = None, enable_padding: bool = True, **kwargs)[source]#

Bases: BaseMapDataset

A dataset that packs samples from a base SFT dataset to maximize efficiency.

__getitem__(idx: int) → dict[str, Tensor][source]#: Get a pack from the dataset by index.

dataset_name: str#

transform(example: dict) → dict[source]#: No-op transform.

trust_remote_code: bool#

class oumi.core.datasets.PretrainingAsyncTextDataset(tokenizer: PreTrainedTokenizerBase | None, dataset: Dataset, dataset_text_field: str | None = None, formatting_func: Callable | None = None, infinite: bool = False, seq_length: int = 1024, sequence_buffer_size: int = 1024, eos_token_id: int = 0, shuffle: bool = False, append_concat_token: bool = True, add_special_tokens: bool = True, pretokenized: bool = True)[source]#

Bases: IterableDataset

Iterable dataset that returns constant length chunks of tokens.

Prefetches, formats, and tokenizes asynchronously from main thread.

__iter__()[source]#: Iterates through the dataset with most work on a separate thread.

property column_names: list[str]#: Returns the column names of the dataset.

class oumi.core.datasets.VisionLanguageDpoDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, tokenizer: PreTrainedTokenizerBase | None = None, return_tensors: bool = False, processor: Any | None = None, processor_name: str | None = None, trust_remote_code: bool = False, processor_kwargs: dict[str, Any] | None = None, max_size: int | None = None, prompt_key: str = 'prompt', chosen_key: str = 'chosen', rejected_key: str = 'rejected', images_key: str = 'images', **kwargs)[source]#

Bases: BaseDpoDataset

Dataset for vision-language DPO (Direct Preference Optimization) models.

This class extends BaseDpoDataset to provide functionality specific to vision-language preference optimization tasks. It handles the processing of both image and text data for preference learning.

The dataset expects data in the formats:

{
    "prompt": "What's in this image?",
    "images": ["path/to/image.jpg", ...],  # Optional image paths/URLs
    "chosen": [{"role": "assistant", "content": "I see a cat"}],
    "rejected": [{"role": "assistant", "content": "I see a dog"}]
}

OR

{
    "prompt": "What's in this image?",aths/URLs
    "images": ["path/to/image.jpg", ...],
    "chosen": "preferred response",
    "rejected": "rejected response"
}

dataset_name: str#

transform(sample: dict) → dict[source]#: Transform the sample to the Oumi format.

transform_preference(sample: dict) → dict[source]#

Transform a DPO sample to the format expected by DPO trainer.

Transforms a raw DPO example into three Oumi Conversation objects.

Parameters:: sample (dict) – A dictionary representing a single DPO preference example.
Returns:: Dict with prompt, chosen, and rejected conversations or features

trust_remote_code: bool#

class oumi.core.datasets.VisionLanguageSftDataset(*, return_conversations: bool = False, tokenizer: PreTrainedTokenizerBase | None = None, processor: BaseProcessor | None = None, processor_name: str | None = None, processor_kwargs: dict[str, Any] | None = None, limit: int | None = None, trust_remote_code: bool = False, max_images: int | None = None, **kwargs)[source]#

Bases: BaseSftDataset, ABC

Abstract dataset for vision-language models.

This class extends BaseSftDataset to provide functionality specific to vision-language tasks. It handles the processing of both image and text data.

Note

This dataset is designed to work with models that can process both image and text inputs simultaneously, such as CLIP, BLIP, or other multimodal architectures.

Example

>>> from oumi.builders import build_processor, build_tokenizer
>>> from oumi.core.configs import ModelParams
>>> from oumi.core.types.conversation import Conversation
>>> from oumi.core.datasets import VisionLanguageSftDataset
>>> class MyVisionLanguageSftDataset(VisionLanguageSftDataset):
...     def transform_conversation(self, example: dict):
...         # Implement the abstract method
...         # Convert the raw example into a Conversation object
...         pass
>>> tokenizer = build_tokenizer(
...     ModelParams(model_name="Qwen/Qwen2-1.5B-Instruct")
... )
>>> dataset = MyVisionLanguageSftDataset(
...     tokenizer=tokenizer,
...     processor_name="HuggingFaceTB/SmolVLM-256M-Instruct",
...     dataset_name="coco_captions",
...     split="train"
... )
>>> sample = next(iter(dataset))
>>> print(sample.keys())

dataset_name: str#

transform(sample: dict) → dict[source]#

Transforms an Oumi conversation into a dictionary of inputs for a model.

Parameters:: sample (dict) – A dictionary representing a single conversation example.
Returns:: A dictionary of inputs for a model.
Return type:: dict

abstractmethod transform_conversation(example: dict) → Conversation[source]#

Transforms a raw example into an Oumi Conversation object.

Parameters:: example (dict) – A dictionary representing a single conversation example.
Returns:: A Conversation object representing the conversation.
Return type:: Conversation

trust_remote_code: bool#

oumi.core.datasets

Contents

oumi.core.datasets#