<div class="align-center">
<a href="https://oumi.ai/"><img src="https://oumi.ai/docs/en/latest/_static/logo/header_logo.png" height="200"></a>

[![Documentation](https://img.shields.io/badge/Documentation-latest-blue.svg)](https://oumi.ai/docs/en/latest/index.html)
[![Discord](https://img.shields.io/discord/1286348126797430814?label=Discord)](https://discord.gg/oumi)
[![GitHub Repo stars](https://img.shields.io/github/stars/oumi-ai/oumi)](https://github.com/oumi-ai/oumi)
<a target="_blank" href="https://colab.research.google.com/github/oumi-ai/oumi/blob/main/notebooks/Oumi - Deploying a Job.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
</div>

üëã Welcome to Open Universal Machine Intelligence (Oumi)!

üöÄ Oumi is a fully open-source platform that streamlines the entire lifecycle of foundation models - from [data preparation](https://oumi.ai/docs/en/latest/resources/datasets/datasets.html) and [training](https://oumi.ai/docs/en/latest/user_guides/train/train.html) to [evaluation](https://oumi.ai/docs/en/latest/user_guides/evaluate/evaluate.html) and [deployment](https://oumi.ai/docs/en/latest/user_guides/launch/launch.html). Whether you're developing on a laptop, launching large scale experiments on a cluster, or deploying models in production, Oumi provides the tools and workflows you need.

ü§ù Make sure to join our [Discord community](https://discord.gg/oumi) to get help, share your experiences, and contribute to the project! If you are interested in joining one of the community's open-science efforts, check out our [open collaboration](https://oumi.ai/community) page.

‚≠ê If you like Oumi and you would like to support it, please give it a star on [GitHub](https://github.com/oumi-ai/oumi).

# Deploying a Job

In this tutorial we'll take a working [JobConfig](https://github.com/oumi-ai/oumi/tree/main/src/oumi/core/configs/job_config.py) and deploy it remotely on a cluster of your choice.

This guide dovetails nicely with our [Finetuning Tutorial](https://github.com/oumi-ai/oumi/blob/main/notebooks/Oumi%20-%20Finetuning%20Tutorial.ipynb) where you create your own TrainingConfig and run it locally. Give it a try if you haven't already!

We'll cover the following topics:
1. Prerequisites
1. Choosing a Cloud
1. Preparing Your JobConfig
1. Launching Your Job
1. \[Advanced\] Deploying a Training Config

## Prerequisites


### Oumi Installation
First, let's install Oumi. You can find more detailed instructions [here](https://oumi.ai/docs/en/latest/get_started/installation.html).


In [None]:
%pip install oumi

### Creating our working directory
For our experiments, we'll use the following folder to save our configs.

In [1]:
import time
from pathlib import Path

tutorial_dir = "deploy_training_tutorial"

Path(tutorial_dir).mkdir(parents=True, exist_ok=True)

## Choosing a Cloud

We'll be using the Oumi Launcher to run remote training. To use the launcher, you need to specify which cloud you'd like to run training on.
We'll list the clouds below:

In [None]:
import oumi.launcher as launcher

# Print all available clouds
print(launcher.which_clouds())

#### Local Cloud
If you don't have any clouds set up yet, feel free to use the `local` cloud. This will simply execute your job on your current device as if it's a remote cluster. Hardware requirements are ignored for the `local` cloud.

#### Other Providers
Note that to use a cloud you must already have an account registered with that cloud provider.

For example, GCP, RunPod, and Lambda require accounts with billing enabled. Polaris requires an account set up with [ALCF](https://www.alcf.anl.gov/polaris).

Once you've picked a cloud, move on to the next step.

## Preparing Your JobConfig

Let's get started by creating your JobConfig. In the config below, feel free to change `cloud: local` to the cloud you chose in the previous step.

In [None]:
%%writefile $tutorial_dir/job.yaml

name: job-tutorial
resources:
  cloud: local
  # Accelerators is ignored for the local cloud.
  accelerators: A100

# Upload working directory to remote.
# If on the local cloud, we CD into the working directory before running the job.
working_dir: .

envs:
  TEST_ENV_VARIABLE: '"Hello, World!"'
  OUMI_LOGGING_DIR: "deploy_training_tutorial/logs"

# `setup` will always be executed before `run`.
# No setup is required for this job.
#setup: |
#  echo "Running setup..."

run: |
  set -e  # Exit if any command failed.

  echo "$TEST_ENV_VARIABLE"

## Launching Your Job

First let's load your JobConfig:

In [4]:
# Read our JobConfig from the YAML file.
job_config = launcher.JobConfig.from_yaml(str(Path(tutorial_dir) / "job.yaml"))

At any point you can easily change the cloud where your job will run by modifying the job's `resources.cloud` parameter:

In [5]:
# Manually set the cloud to use.
job_config.resources.cloud = "local"

Once you have a job config, kicking off your job is simple:

In [None]:
# You can optionally specify a cluster name here. If not specified, a random name will
# be generated. This is also useful for launching multiple jobs on the same cluster.
cluster_name = None

# Launch the job!
cluster, job_status = launcher.up(job_config, cluster_name)
print(f"Job status: {job_status}")

Don't worry if you see any errors from `launcher.up`--you may need to configure permissions to run a job on your specified cloud. The error message should provide you with the proper command to run to authenticate (for GCP this is often `gcloud auth application-default login`).

We can quickly check on the status of our job using the `cluster` returned in the previous command:

In [None]:
while job_status and not job_status.done:
    print("Job is running...")
    time.sleep(15)
    job_status = cluster.get_job(job_status.id)
print("Job is done!")

If the job was run on the local cluster, we can view the logs below:

In [None]:
logs_dir = Path(tutorial_dir) / "logs"
for log_file in logs_dir.iterdir():
    print(f"Log file: {log_file}")
    with open(log_file) as f:
        print(f.read())

Now that we're done with the cluster, let's turn it down to stop billing for non-local clouds.

In [9]:
cluster.down()

## \[Advanced\] Deploying a Training Config

In our [Finetuning Tutorial](https://github.com/oumi-ai/oumi/blob/main/notebooks/Oumi%20-%20Finetuning%20Tutorial.ipynb), we created and saved a TrainingConfig. We then invoked training by running
```shell
oumi train -c "$tutorial_dir/train.yaml"
```

You can also run that command as a job! Simply update the "run" section of the JobConfig with your desired command:


In [10]:
path_to_your_train_config = Path(tutorial_dir) / "train.yaml"  # Make sure this exists!

# Set the `run` command to run your training script.
job_config.run = f'oumi train -c "{path_to_your_train_config}"'

And now your job will run your training config when executed!

For a more in-depth overview of the fields in JobConfig, please see our [Running Jobs Remotely tutorial](https://github.com/oumi-ai/oumi/blob/main/notebooks/Oumi%20-%20Running%20Jobs%20Remotely.ipynb).