SYSTEM / CLOUD / GCP / MACHINE LEARNING
By Gabriel Navarro
May 14, 2025
As your models grow in size and complexity, you’ll inevitably hit the limits of local GPUs. Google Cloud’s Vertex AI lets you offload heavy training workloads to managed clusters of GPUs—so you can scale seamlessly, track experiments in the cloud, and integrate with the rest of GCP. In this tutorial, we’ll turn our FlashAttention‐powered Transformer example into a Vertex AI Custom Job, walking through:
Let’s get started!
Before you begin, make sure you have:
GOOGLE_APPLICATION_CREDENTIALS pointing to it🔑 If you haven’t set up your GCP project or service account yet, follow GCP Setup first.
We’ll reuse the same FlashAttention Dockerfile from local dev—just target Artifact Registry:
# Authenticate Docker to Artifact Registry
gcloud auth configure-docker us-central1-docker.pkg.dev
# Clone the repo with the Dockerfile (or use your own repo)
git clone https://github.com/gabenavarro/MLContainerLab.git && \
cd MLContainerLab
# Build your image
docker build -f ./assets/build/Dockerfile.flashattn.cu128py26cp312 \
-t us-central1-docker.pkg.dev/my-project/repo/flash-attention:latest .
# Push it up
docker push us-central1-docker.pkg.dev/my-project/repo/flash-attention:latest
# Verify
gcloud artifacts docker images list us-central1-docker.pkg.dev/my-project/repo/flash-attention
Tip: Replace
us-central1andmy-project/repowith your GCP region & Artifact Registry names.
Vertex AI jobs pull code and data from GCS. Let’s create buckets and upload everything:
# Make a bucket (if you haven’t already)
gsutil mb -l us-central1 gs://flashattn-example
# Create folder structure
gsutil mkdir \
gs://flashattn-example/config \
gs://flashattn-example/scripts \
gs://flashattn-example/datasets \
gs://flashattn-example/checkpoints \
gs://flashattn-example/staging
# Upload model config
gsutil cp ./assets/test-files/flash-attn-config.yaml \
gs://flashattn-example/config/
# Upload training script
gsutil cp ./scripts/flash_attn_train.py \
gs://flashattn-example/scripts/
# Upload processed dataset,
# Please follow the instructions in MLContainerLab to generate the dataset
# https://github.com/gabenavarro/MLContainerLab/blob/main/documentation/flash-attn.ipynb
# (or use your own)
gsutil -m cp -r ./datasets/auto_regressive_processed_timeseries \
gs://flashattn-example/datasets/
# Inspect your uploads
gsutil ls -R gs://flashattn-example
Now we glue it all together with the Python client. This snippet:
from google.cloud import aiplatform
from google.oauth2 import service_account
import os
# ——— Configuration ———
PROJECT_ID = "my-project"
REGION = "us-central1"
BUCKET = "gs://flashattn-example"
IMAGE_URI = f"{REGION}-docker.pkg.dev/{PROJECT_ID}/repo/flash-attention:latest"
SERVICE_KEY = os.getenv("GOOGLE_APPLICATION_CREDENTIALS")
SERVICE_ACCT= f"vertex-ai@{PROJECT_ID}.iam.gserviceaccount.com"
DISPLAY = "flash-attn-crypto-training"
# Command to launch inside container
CMD = [
"python3",
"/gcs/flashattn-example/scripts/flash_attn_train.py",
"--config", "/gcs/flashattn-example/config/flash_attn_crypto_model_config.yaml",
]
# GPU machine spec
worker_pool_specs = [
{
"replica_count": 1,
"machine_spec": {
"machine_type": "a3-megagpu-8g",
"accelerator_type": "NVIDIA_H100_MEGA_80GB",
"accelerator_count": 8,
"reservation_affinity": { "reservation_affinity_type": "ANY" }
},
"container_spec": {
"image_uri": IMAGE_URI,
"command": CMD
}
}
]
# Initialize Vertex AI
aiplatform.init(
project=PROJECT_ID,
location=REGION,
credentials=service_account.Credentials.from_service_account_file(SERVICE_KEY)
)
# Create & submit the CustomJob
job = aiplatform.CustomJob(
display_name=DISPLAY,
worker_pool_specs=worker_pool_specs,
staging_bucket=BUCKET + "/staging"
)
job.submit(service_account=SERVICE_ACCT)
print(f"Submitted: {job.resource_name}")
Once you run this, Vertex AI will spin up your H100 cluster, pull the container, and kick off training—complete with logs in the GCP console.
If your organization has dedicated GPU reservations, swap reservation_affinity to lock onto them:
"reservation_affinity": {
"reservation_affinity_type": "SPECIFIC_RESERVATION",
"key": "compute.googleapis.com/reservation-name",
"values": [
f"projects/{PROJECT_ID}/zones/us-central1-a/reservations/my-h100-resv"
]
}
This guarantees your job runs on reserved hardware, avoiding preemption.
By containerizing your code and orchestration logic, Vertex AI Custom Jobs let you scale effortlessly to large GPU fleets, integrate with GCP’s IAM and monitoring, and reproduce experiments consistently. Once you’ve mastered this flow, you can:
Happy scaling—and may your training queues always be short! 🚀