oumi.core.datasets#
Core datasets module for the Oumi (Open Universal Machine Intelligence) library.
This module provides base classes for different types of datasets used in the Oumi framework. These base classes serve as foundations for creating custom datasets for various machine learning tasks.
These base classes can be extended to create custom datasets tailored to specific machine learning tasks within the Oumi framework.
For more detailed information on each class, please refer to their respective documentation.
- class oumi.core.datasets.BaseDpoDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, tokenizer: PreTrainedTokenizerBase | None = None, return_tensors: bool = False, **kwargs)[source]#
Bases:
BaseMapDataset
Preprocess the samples to the Oumi format.
- dataset_name: str#
- trust_remote_code: bool#
- class oumi.core.datasets.BaseExperimentalDpoDataset(*args, **kwargs)[source]#
Bases:
BaseDpoDataset
Preprocess the samples to the Oumi format.
Warning
This class is experimental and subject to change.
- dataset_name: str#
- trust_remote_code: bool#
- class oumi.core.datasets.BaseExperimentalGrpoDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, **kwargs)[source]#
Bases:
BaseMapDataset
Preprocess the GRPO samples to the Oumi format.
Warning
This class is experimental and subject to change.
- conversation(idx: int) Conversation [source]#
Returns the conversation at the specified index.
- Parameters:
idx (int) – The index of the conversation to retrieve.
- Returns:
The conversation at the specified index.
- Return type:
str
- conversations() list[Conversation] [source]#
Returns a list of all conversations.
- dataset_name: str#
- transform_conversation(sample: dict | Series) Conversation [source]#
Converts the input sample to a Conversation.
- Parameters:
sample (Union[dict, pd.Series]) – The input example.
- Returns:
The resulting conversation.
- Return type:
- trust_remote_code: bool#
- class oumi.core.datasets.BaseExperimentalKtoDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, **kwargs)[source]#
Bases:
BaseMapDataset
Base class for KTO (Kahneman-Tversky Optimization) datasets.
This class provides a comprehensive foundation for creating KTO datasets that work with binary feedback signals rather than preference pairs. KTO is an alignment method that optimizes language models based on simple binary labels indicating whether outputs are desirable or undesirable, making it simpler than preference- based methods like DPO which require paired comparisons.
The class handles the standardization of diverse dataset formats into the consistent KTO format required by training frameworks. It supports both string-based completions and chat-formatted conversations, automatically extracting assistant responses when needed.
- Key Features:
Standardized KTO format with prompt, completion, and binary label
Automatic handling of chat format vs string format completions
Optimized feature schema for efficient dataset processing
Memory-efficient processing for large datasets
Consistent API across different KTO dataset implementations
- Dataset Format:
The standardized KTO format includes: - prompt (str): The input text given to the model - completion (str): The model’s response to be evaluated - label (bool): True for desirable responses, False for undesirable ones
- Usage:
Subclasses should implement the _load_data() method to load their specific dataset format and optionally override _transform_kto_example() for custom preprocessing logic.
Warning
This class is experimental and subject to change as KTO training methods evolve and mature.
See also
TRL KTO Trainer: https://huggingface.co/docs/trl/main/en/kto_trainer
KTO Paper: https://arxiv.org/abs/2402.01306
- dataset_name: str#
- transform(sample: dict) dict [source]#
Transform a raw dataset sample into the standardized format.
This is the main entry point for processing dataset samples. It delegates to the KTO-specific transformation method to ensure consistent formatting across all KTO datasets.
- Parameters:
sample (dict) – A raw dataset sample containing prompt, completion, and label information.
- Returns:
- The transformed sample in KTO format with standardized keys:
prompt, completion, and label.
- Return type:
dict
- trust_remote_code: bool#
- class oumi.core.datasets.BaseIterableDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, subset: str | None = None, split: str | None = None, trust_remote_code: bool = False, stream: bool = True, **kwargs)[source]#
Bases:
IterDataPipe
,ABC
Abstract base class for iterable datasets.
- property data: Iterable[Any]#
Returns the underlying dataset data.
- dataset_name: str#
- dataset_path: str | None = None#
- default_dataset: str | None = None#
- default_subset: str | None = None#
- to_hf(return_iterable: bool = True) IterableDataset [source]#
Converts the dataset to a Hugging Face dataset.
- abstractmethod transform(sample: Any) dict[str, Any] [source]#
Preprocesses the inputs in the given sample.
- Parameters:
sample (Any) – A sample from the dataset.
- Returns:
A dictionary containing the preprocessed input data.
- Return type:
dict
- trust_remote_code: bool = False#
- class oumi.core.datasets.BaseMapDataset(*, dataset_name: str | None, dataset_path: str | None = None, subset: str | None = None, split: str | None = None, trust_remote_code: bool = False, transform_num_workers: str | int | None = None, **kwargs)[source]#
Bases:
MapDataPipe
,Sized
,ABC
Abstract base class for map datasets.
- __getitem__(idx: int) dict [source]#
Gets the item at the specified index.
- Parameters:
idx (int) – The index of the item to retrieve.
- Returns:
The item at the specified index.
- Return type:
dict
- __len__() int [source]#
Gets the number of items in the dataset.
- Returns:
The number of items in the dataset.
- Return type:
int
- property data: DataFrame#
Returns the underlying dataset data.
- dataset_name: str#
- dataset_path: str | None = None#
- default_dataset: str | None = None#
- default_subset: str | None = None#
- raw(idx: int) Series [source]#
Returns the raw data at the specified index.
- Parameters:
idx (int) – The index of the data to retrieve.
- Returns:
The raw data at the specified index.
- Return type:
pd.Series
- to_hf(return_iterable: bool = False) Dataset | IterableDataset [source]#
Converts the dataset to a Hugging Face dataset.
- Parameters:
return_iterable – Whether to return an iterable dataset. Iterable datasets aren’t cached to disk, which can sometimes be advantageous. For example, if transformed examples are very large (e.g., if pixel_values are large for multimodal data), or if you don’t want to post-process the whole dataset before training starts.
- Returns:
A HuggingFace dataset. Can be datasets.Dataset or datasets.IterableDataset depending on the value of return_iterable.
- abstractmethod transform(sample: Series) dict [source]#
Preprocesses the inputs in the given sample.
- Parameters:
sample (dict) – A dictionary containing the input data.
- Returns:
A dictionary containing the preprocessed input data.
- Return type:
dict
- transform_num_workers: str | int | None = None#
- trust_remote_code: bool#
- class oumi.core.datasets.BasePretrainingDataset(*, tokenizer: PreTrainedTokenizerBase, seq_length: int, dataset_text_field: str = 'text', append_concat_token: bool = True, add_special_tokens: bool = True, skip_last: bool = True, **kwargs)[source]#
Bases:
BaseIterableDataset
Base class for pretraining iterable datasets.
This class extends BaseIterableDataset to provide functionality specific to pretraining tasks.
- Variables:
tokenizer (BaseTokenizer) – The tokenizer used for text encoding.
seq_length (int) – The desired sequence length for model inputs.
concat_token_id (int) – The ID of the token used to concatenate documents.
Example
>>> from transformers import AutoTokenizer >>> from oumi.builders import build_tokenizer >>> from oumi.core.configs import ModelParams >>> from oumi.core.datasets import BasePretrainingDataset >>> tokenizer = build_tokenizer(ModelParams(model_name="gpt2")) >>> dataset = BasePretrainingDataset( ... dataset_name="wikimedia/wikipedia", ... subset="20231101.en", ... split="train", ... tokenizer=tokenizer, ... seq_length=512 ... ) >>> example = next(iter(dataset))
- __iter__()[source]#
Iterates over the dataset and yields samples of a specified sequence length.
The underlying dataset is a stream of documents. Each document is expected to contain a text field self._dataset_text_field that will be tokenized. Training samples are then yielded in sequences of length self.seq_length.
Given this iterator might return samples from different documents, we optionally use self.concat_token_id to separate the sequences from different documents.
- dataset_name: str#
- class oumi.core.datasets.BaseSftDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, tokenizer: PreTrainedTokenizerBase | None = None, task: Literal['sft', 'generation', 'auto'] = 'auto', return_tensors: bool = False, text_col: str = 'text', assistant_only: bool = False, response_template: str | None = None, instruction_template: str | None = None, return_conversations: bool = False, **kwargs)[source]#
Bases:
BaseMapDataset
,ABC
In-memory dataset for SFT data.
- property assistant_only: bool#
Gets whether the dataset is set to train only on assistant turns.
- conversation(idx: int) Conversation [source]#
Returns the conversation at the specified index.
- Parameters:
idx (int) – The index of the conversation to retrieve.
- Returns:
The conversation at the specified index.
- Return type:
str
- conversations() list[Conversation] [source]#
Returns a list of all conversations.
- dataset_name: str#
- default_dataset: str | None = None#
- prompt(idx: int) str [source]#
Returns the prompt at the specified index.
- Parameters:
idx (int) – The index of the prompt to retrieve.
- Returns:
The prompt at the specified index.
- Return type:
str
- property task: str#
Gets the task mode for the dataset.
The generated prompt is often different for generation vs SFT tasks.
- property text_col: str#
Gets the text target column.
The generated text will be stored in this column.
- tokenize(sample: dict | Series | Conversation, tokenize: bool = True) dict [source]#
Applies the chat template carried by the tokenizer to the input example.
- Parameters:
sample (Dict) – Mapping messages to a List containing the (ordered) messages exchanged within a single chat dialogue. Each item of example[“messages”] is a dict mapping the content of the message and the role of the one relayed it. E.g., role == ‘user’ or role == ‘assistant’.
tokenize (bool) – Whether to tokenize the messages or not.
- Raises:
NotImplementedError – Currently only the sft task mode is supported.
ValueError – if requested task is not in “sft” or “generation”
- Returns:
It adds a text key in the input example dictionary, mapped to a string carrying the messages to the tokenizer’s chat format.
- Return type:
Dict
- abstractmethod transform_conversation(example: dict | Series) Conversation [source]#
Preprocesses the inputs of the example and returns a dictionary.
- Parameters:
example (dict) – The example containing the input and instruction.
- Returns:
The preprocessed inputs as a dictionary.
- Return type:
dict
- trust_remote_code: bool#
- class oumi.core.datasets.PackedSftDataset(base_dataset: BaseSftDataset, max_seq_len: int, show_progress: bool = True, split_samples: bool = False, concat_token_id: int | None = None, pad_token_id: int | None = None, enable_padding: bool = True, **kwargs)[source]#
Bases:
BaseMapDataset
A dataset that packs samples from a base SFT dataset to maximize efficiency.
- dataset_name: str#
- trust_remote_code: bool#
- class oumi.core.datasets.PretrainingAsyncTextDataset(tokenizer: PreTrainedTokenizerBase | None, dataset: Dataset, dataset_text_field: str | None = None, formatting_func: Callable | None = None, infinite: bool = False, seq_length: int = 1024, sequence_buffer_size: int = 1024, eos_token_id: int = 0, shuffle: bool = False, append_concat_token: bool = True, add_special_tokens: bool = True, pretokenized: bool = True)[source]#
Bases:
IterableDataset
Iterable dataset that returns constant length chunks of tokens.
Prefetches, formats, and tokenizes asynchronously from main thread.
- property column_names: list[str]#
Returns the column names of the dataset.
- class oumi.core.datasets.VisionLanguageDpoDataset(*, dataset_name: str | None = None, dataset_path: str | None = None, split: str | None = None, tokenizer: PreTrainedTokenizerBase | None = None, return_tensors: bool = False, processor: Any | None = None, processor_name: str | None = None, trust_remote_code: bool = False, processor_kwargs: dict[str, Any] | None = None, max_size: int | None = None, prompt_key: str = 'prompt', chosen_key: str = 'chosen', rejected_key: str = 'rejected', images_key: str = 'images', **kwargs)[source]#
Bases:
BaseDpoDataset
Dataset for vision-language DPO (Direct Preference Optimization) models.
This class extends BaseDpoDataset to provide functionality specific to vision-language preference optimization tasks. It handles the processing of both image and text data for preference learning.
The dataset expects data in the formats:
{ "prompt": "What's in this image?", "images": ["path/to/image.jpg", ...], # Optional image paths/URLs "chosen": [{"role": "assistant", "content": "I see a cat"}], "rejected": [{"role": "assistant", "content": "I see a dog"}] } OR { "prompt": "What's in this image?",aths/URLs "images": ["path/to/image.jpg", ...], "chosen": "preferred response", "rejected": "rejected response" }
- dataset_name: str#
- transform_preference(sample: dict) dict [source]#
Transform a DPO sample to the format expected by DPO trainer.
Transforms a raw DPO example into three Oumi Conversation objects.
- Parameters:
sample (dict) – A dictionary representing a single DPO preference example.
- Returns:
Dict with prompt, chosen, and rejected conversations or features
- trust_remote_code: bool#
- class oumi.core.datasets.VisionLanguageSftDataset(*, return_conversations: bool = False, tokenizer: PreTrainedTokenizerBase | None = None, processor: BaseProcessor | None = None, processor_name: str | None = None, processor_kwargs: dict[str, Any] | None = None, limit: int | None = None, trust_remote_code: bool = False, max_images: int | None = None, **kwargs)[source]#
Bases:
BaseSftDataset
,ABC
Abstract dataset for vision-language models.
This class extends BaseSftDataset to provide functionality specific to vision-language tasks. It handles the processing of both image and text data.
Note
This dataset is designed to work with models that can process both image and text inputs simultaneously, such as CLIP, BLIP, or other multimodal architectures.
Example
>>> from oumi.builders import build_processor, build_tokenizer >>> from oumi.core.configs import ModelParams >>> from oumi.core.types.conversation import Conversation >>> from oumi.core.datasets import VisionLanguageSftDataset >>> class MyVisionLanguageSftDataset(VisionLanguageSftDataset): ... def transform_conversation(self, example: dict): ... # Implement the abstract method ... # Convert the raw example into a Conversation object ... pass >>> tokenizer = build_tokenizer( ... ModelParams(model_name="Qwen/Qwen2-1.5B-Instruct") ... ) >>> dataset = MyVisionLanguageSftDataset( ... tokenizer=tokenizer, ... processor_name="openai/clip-vit-base-patch32", ... dataset_name="coco_captions", ... split="train" ... ) >>> sample = next(iter(dataset)) >>> print(sample.keys())
- dataset_name: str#
- transform(sample: dict) dict [source]#
Transforms an Oumi conversation into a dictionary of inputs for a model.
- Parameters:
sample (dict) – A dictionary representing a single conversation example.
- Returns:
A dictionary of inputs for a model.
- Return type:
dict
- abstractmethod transform_conversation(example: dict) Conversation [source]#
Transforms a raw example into an Oumi Conversation object.
- Parameters:
example (dict) – A dictionary representing a single conversation example.
- Returns:
A Conversation object representing the conversation.
- Return type:
- trust_remote_code: bool#