Datasets module for the Oumi (Open Universal Machine Intelligence) library.
This module provides various dataset implementations for use in the Oumi framework.
These datasets are designed for different machine learning tasks and can be used
with the models and training pipelines provided by Oumi.
For more information on the available datasets and their usage, see the
oumi.datasets documentation.
Each dataset is implemented as a separate class, inheriting from appropriate base
classes in the oumi.core.datasets module. For usage examples and more detailed
information on each dataset, please refer to their respective class documentation.
See also
oumi.models: Compatible models for use with these datasets.
A dataset for pretraining on the Colossal Clean Crawled Corpus (C4).
The C4 dataset is based on the Common Crawl dataset and is available in
multiple variants: ‘en’, ‘en.noclean’, ‘en.noblocklist’, ‘realnewslike’,
and ‘multilingual’ (mC4). It is intended for pretraining language models
and word representations.
This dataset follows the same structure as the Alpaca dataset, with
instruction, input, and output fields. It is designed for training
Conversational Agentic Language Models (CoALM) that can handle both
task-oriented dialogue and function calling.
>>> fromoumi.datasetsimportCoALMDataset>>> dataset=CoALMDataset()>>> # The dataset will be loaded from HuggingFace with the path>>> # "uiuc-convai/CoALM-IT" and transformed into the Oumi>>> # conversation format automatically.
Dataset class for the HuggingFaceM4/Docmatix dataset.
The dataset has the same data layout and format as HuggingFaceM4/the_cauldron
(hence it’s defined as a sub-class) but the underlying data is different.
Unlike HuggingFaceM4/the_cauldron, the dataset contains many multi-image examples,
and fewer subsets.
Be aware that ‘HuggingFaceM4/Docmatix’ is a very large dataset (~0.5TB) that
requires a lot of Internet bandwidth to download, and a lot of disk space to store,
so only use it if you know what you’re doing.
Using the ‘Docmatix’ dataset in Oumi should become easier after streaming support
is supported for custom Oumi datasets (OPE-1021).
Dolma: A dataset of 3 trillion tokens from diverse web content.
Dolma [1] is a large-scale dataset containing
approximately 3 trillion tokens sourced from various web content, academic
publications, code, books, and encyclopedic materials. It is designed for
language modeling tasks and casual language model training.
The dataset is available in multiple versions, with v1.7 being the latest
release used to train OLMo 7B-v1.7. It includes data from sources such as
Common Crawl, Refined Web, StarCoder, C4, Reddit, Semantic Scholar, arXiv,
StackExchange, and more.
Data Fields:
id (str) – Unique identifier for the data entry.
text (str) – The main content of the data entry.
added (str, optional) – Timestamp indicating when the entry was added
to the dataset.
created (str, optional) – Timestamp indicating when the original content
was created.
source (str, optional) – Information about the origin or source of the
data.
A massive English web dataset built by TII for pretraining large language models.
The Falcon RefinedWeb dataset is created through stringent filtering and
large-scale deduplication of CommonCrawl. It contains about 1B instances
(968M individual web pages) for a total of 2.8TB of clean text data.
This dataset is intended primarily for pretraining large language models and
can be used on its own or augmented with curated sources.
FineWeb-Edu: A high-quality educational dataset filtered from web content.
This dataset contains 1.3 trillion tokens of educational web pages filtered
from the FineWeb dataset using an educational quality classifier. It aims to
provide the finest collection of educational content from the web
[2].
The dataset is available in multiple configurations:
Dataset class for the oumi-ai/oumi-letter-count dataset.
A sample from the dataset:
{"conversation_id":"oumi_letter_count_0","messages":[{"content":"Can you let me know how many 'r's are in 'pandered'?","role":"user",}],"metadata":{"letter":"r","letter_count_integer":1,"letter_count_string":"one","unformatted_prompt":"Can you let me know how many {letter}s are in {word}?","word":"pandered",},}
Multimodal Open R1 8K Verified Dataset from LMMS Lab.
A specialized dataset class for the lmms-lab/multimodal-open-r1-8k-verified dataset
that contains multimodal reasoning problems with images, problems, and solutions.
The Pile: An 825 GiB diverse, open source language modeling dataset.
The Pile is a large-scale English language dataset consisting of 22 smaller,
high-quality datasets combined together. It is designed for training large
language models and supports various natural language processing tasks
[3][4].
Data Fields:
text (str) – The main text content.
meta (dict) – Metadata about the instance, including ‘pile_set_name’.
Key Features:
825 GiB of diverse text data
Primarily in English
Supports text generation and fill-mask tasks
Includes various subsets like enron_emails, europarl, free_law, etc.
This dataset contains text from various sources and may include
personal or sensitive information. Users should consider potential biases
and limitations when using this dataset.
Sample “question”: “[USER] Can you come up with a joke? [ASSISTANT]”
It starts with a [USER] and ends with an [ASSISTANT] role tag.
The Assistant response appears in the “answer” field.
Dataset for RaR-Medicine from the Rubrics as Rewards paper.
This dataset contains 22.4k medical prompts with structured rubric annotations
for training with GRPO. The prompts focus on complex medical reasoning tasks
like diagnosis (50.3%) and treatment (16.0%).
>>> dataset=RaRMedicineDataset(split="train")>>> sample=dataset.raw(0)>>> print(sample["prompt"])>>> print(sample["rubrics"])# List of weighted rubric dicts
The rubrics follow this structure:
{"name":"Identify Most Sensitive Modality","description":"Essential Criteria: Identifies non-contrast helical CT...","weight":5,"evaluation_type":"binary"}
Dataset for RaR-Science from the Rubrics as Rewards paper.
This dataset contains 22.9k expert-level science prompts with structured
rubric annotations for training with GRPO. The prompts are aligned with
the GPQA Diamond benchmark, covering topics from quantum mechanics to
molecular biology.
>>> dataset=RaRScienceDataset(split="train")>>> sample=dataset.raw(0)>>> print(sample["prompt"])>>> print(sample["rubrics"])# List of weighted rubric dicts
The rubrics follow this structure:
{"name":"Temperature Conversion","description":"Essential Criteria: The response must mention...","weight":5,"evaluation_type":"binary"}
RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset.
This dataset contains approximately 1.2 trillion tokens from various sources:
Commoncrawl (878B), C4 (175B), GitHub (59B), ArXiv (28B), Wikipedia (24B),
and StackExchange (20B) [5].
The dataset is primarily in English, though the Wikipedia slice contains
multiple languages.
RedPajama V2 Dataset for training large language models.
This dataset includes over 100B text documents from 84 CommonCrawl snapshots,
processed using the CCNet pipeline. It contains 30B documents with quality
signals and 20B deduplicated documents [5].
The dataset is available in English, German, French, Italian, and Spanish.
Key Features:
Over 100B text documents
30B documents with quality annotations
20B unique documents after deduplication
Estimated 50.6T tokens in total (30.4T after deduplication)
SlimPajama-627B: A cleaned and deduplicated version of RedPajama.
SlimPajama is the largest extensively deduplicated, multi-corpora, open-source
dataset for training large language models. It was created by cleaning and
deduplicating the 1.2T token RedPajama dataset, resulting in a 627B token dataset.
The dataset consists of 59166 jsonl files and is ~895GB compressed. It includes
training, validation, and test splits [6].
StarCoder Training Dataset used for training StarCoder and StarCoderBase models.
This dataset contains 783GB of code in 86 programming languages, including 54GB
of GitHub Issues, 13GB of Jupyter notebooks in scripts and text-code pairs, and
32GB of GitHub commits, totaling approximately 250 Billion tokens.
The dataset is a cleaned, decontaminated, and near-deduplicated version of
The Stack dataset, with PII removed. It includes various programming languages,
GitHub issues, Jupyter Notebooks, and GitHub commits.
GitHub issues, GitHub commits, and Jupyter notebooks subsets have different
columns from the rest. It’s recommended to load programming languages separately
from these categories:
- jupyter-scripts-dedup-filtered
- jupyter-structured-clean-dedup
- github-issues-filtered-structured
- git-commits-cleaned
Subsets (See dataset for full list):
python
javascript
assembly
awk
git-commits-cleaned
github-issues-filtered-structured
…
Warning
Not all subsets have the same format, in particular:
- jupyter-scripts-dedup-filtered
- jupyter-structured-clean-dedup
- github-issues-filtered-structured
- git-commits-cleaned
TextSftJsonLinesDataset for loading SFT data in oumi and alpaca formats.
This dataset class is designed to work with JSON Lines (.jsonl) or
JSON (.json) files containing text-based supervised fine-tuning (SFT) data.
It supports loading data either from a file or from a provided list of data
samples in oumi and alpaca formats.
Supported formats:
1. JSONL or JSON of conversations (Oumi format)
2. JSONL or JSON of Alpaca-style turns (instruction, input, output)
Parameters:
dataset_path (Optional[Union[str, Path]]) – Path to the dataset file
(.jsonl or .json).
data (Optional[List[Dict[str, Any]]]) – List of conversation dicts if not
loading from a file.
format (Optional[str]) – The format of the data. Either “conversations” or
“alpaca”. If not provided, the format will be auto-detected.
**kwargs – Additional arguments to pass to the parent class.
Examples
Loading conversations from a JSONL file with auto-detection:
Dataset class for the HuggingFaceM4/the_cauldron dataset.
The HuggingFaceM4/the_cauldron dataset is a comprehensive collection of
50 vision-language datasets, primarily training sets, used
for fine-tuning the Idefics2 vision-language model.
The datasets cover various domains such as general visual question answering,
captioning, OCR, document understanding, chart/figure understanding,
table understanding, reasoning, logic, maths, textbook/academic questions,
differences between images, and screenshot to code.
A dataset containing over 6TB of permissively-licensed source code files.
The Stack was created as part of the BigCode Project, an open scientific
collaboration working on the responsible development of Large Language Models
for Code (Code LLMs). It serves as a pre-training dataset for Code LLMs,
enabling the synthesis of programs from natural language descriptions and
other code snippets, and covers 358 programming languages.
The dataset contains code in multiple natural languages, primarily found in
comments and docstrings. It supports tasks such as code completion,
documentation generation, and auto-completion of code snippets.
TinyStoriesDataset class for loading and processing the TinyStories dataset.
This dataset contains synthetically generated short stories with a small
vocabulary, created by GPT-3.5 and GPT-4. It is designed for text generation
tasks and is available in English.
A dataset of textbook-like content for training small language models.
This dataset contains 420,000 textbook documents covering a wide range of topics
and concepts. It provides a comprehensive and diverse learning resource for
causal language models, focusing on quality over quantity.
The dataset was synthesized using the Nous-Hermes-Llama2-13b model, combining
the best of the falcon-refinedweb and minipile datasets to ensure diversity and
quality while maintaining a small size.
VLJsonlinesDataset for loading Vision-Language SFT data in Oumi format.
This dataset class is designed to work with JSON Lines (.jsonl) files containing
Vision-Language supervised fine-tuning (SFT) data. It supports loading data either
from a file or from a provided list of data samples.
VisionDpoJsonlinesDataset for loading Vision-Language DPO data in Oumi format.
This dataset class is designed to work with JSON Lines (.jsonl) files containing
Vision-Language Direct Preference Optimization (DPO) data. It supports loading data
either from a file or from a provided list of data samples.
The WikiText dataset is a collection of over 100 million tokens extracted from
verified Good and Featured articles on Wikipedia. It is available in two sizes:
WikiText-2 (2 million tokens) and WikiText-103 (103 million tokens). Each size
comes in two variants: raw (for character-level work) and processed (for
word-level work) [7].
The dataset is well-suited for models that can take advantage of long-term
dependencies, as it is composed of full articles and retains original case,
punctuation, and numbers.
Dataset containing cleaned Wikipedia articles in multiple languages.
This dataset is built from the Wikipedia dumps (https://dumps.wikimedia.org/)
with one subset per language, each containing a single train split.
Each example contains the content of one full Wikipedia article
with cleaning to strip markdown and unwanted sections (references, etc.).
Data Fields:
id (str) – ID of the article.
url (str) – URL of the article.
title (str) – Title of the article.
text (str) – Text content of the article.
Note
All configurations contain a single ‘train’ split.
This dataset is a collection of audio transcripts from 2,063,066 videos shared on
YouTube under a CC-By license. It contains 22,709,724 original and automatically
translated transcripts from 3,156,703 videos (721,136 individual channels),
representing nearly 45 billion words.
The corpus is multilingual, with a majority of English-speaking content (71%) for
original languages. Automated translations are provided for nearly all videos in
English, French, Spanish, German, Russian, Italian, and Dutch.
This dataset aims to expand the availability of conversational data for research
in AI, computational social science, and digital humanities.
The text can be used for training models and republished for reproducibility
purposes. In accordance with the CC-By license, every YouTube channel is fully
credited.