Google Vertex AI pricing operates on a pay-as-you-go model—but without the right strategies, those “as you go” charges can spiral into surprise bills that derail your AI budget. Multiple teams have reported shock invoices ranging from $400 to over $20,000 in a single month, often from services they didn’t realize were still running.

Optimizing Vertex AI pricing requires controlling token consumption, managing idle endpoints, avoiding hidden egress fees, and most importantly—leveraging Google Cloud Committed Use Discounts (CUDs) to lock in savings of up to 55% on predictable workloads.

This guide breaks down every cost component of Vertex AI, common pitfalls, and practical strategies for leveraging commitment management strategically to lower costs.

What Is Vertex AI?

Vertex AI is Google Cloud’s unified machine learning platform designed to streamline the entire ML lifecycle—from data preparation and model training to deployment, monitoring, and ongoing optimization. It consolidates what used to be scattered across AutoML, AI Platform, and dozens of separate APIs into a single managed environment used by data scientists, ML engineers and app teams.

Core Capabilities

Key features of Vertex AI include:

  • AutoML Models: Pre-configured workflows for image classification, object detection, tabular forecasting, and natural language processing. Teams with limited ML expertise can build production models without writing training loops or tuning hyperparameters manually.
  • Custom Training & Model Tuning: Full control over model architectures, frameworks (TensorFlow, PyTorch, scikit-learn), and compute configurations. You bring your own model training code, and Vertex AI manages distributed infrastructure, job orchestration, and resource scaling.
  • Gen AI Models: Access to Google’s latest foundation model types including Gemini 2.5 Pro, 2.5 Flash and Imagen. These models power text generation, multimodal reasoning, code completion, image generation, and video generation use cases.
  • Managed Endpoints: Deploy models to scalable prediction endpoints with automatic load balancing, versioning, and traffic splitting. Endpoints remain active (and billable) until you explicitly un-deploy them—a common source of runaway costs.
  • Vector Search & RAG Engine: Semantic search infrastructure for retrieval-augmented generation (RAG) pipelines. Charges apply for index build, streaming updates, and storage in addition to LLM inference costs.
  • Pipelines & MLOps: Orchestrate multi-step workflows with Vertex AI Pipelines, track experiments with TensorBoard, and manage model metadata with ML Metadata. Each service has its own pricing structure.

Common Use Cases

The most common include:

  • Predictive Analytics: Time-series forecasting for demand planning, inventory optimization, and financial projections. Vertex AI Forecast uses ARIMA+ models with per-TB training costs plus per-1K-point prediction fees.
  • Customer Support Automation: GenAI models power chatbots, email auto-responders, and knowledge base search. Token consumption can skyrocket when prompts include large context windows or multi-turn conversations.
  • Computer Vision: Image classification, object detection, and video analysis for quality control, security monitoring, and content moderation. AutoML image models charge $3.465/hour for training and $1.375/hour per deployed endpoint.
  • Content Generation: Imagen for image synthesis, Veo for video generation, and Gemini for text/code completion. Video generation pricing ($0.50-$0.75/second) has caused some of the most dramatic billing surprises.

How Vertex AI Pricing Works

Vertex AI’s flexible pricing model includes charges for compute, storage, API calls, and specialized services.

Primary Cost Drivers

The main cost components of Vertex AI pricing are:

1. Compute Resources: Training jobs and prediction endpoints bill per node-hour based on machine type (vCPU, RAM) and attached accelerators (GPUs, TPUs). A single A100 GPU costs $2.93 per hour in us-central1, and charges accumulate from the moment Google provisions resources until the job completes or you undeploy the endpoint.

2. Token Consumption: Generative AI models charge per million tokens processed. Gemini 2.5 Pro costs $1.25 per million input tokens (≤200K context) and $10.00 per million output tokens. Large context windows (200K-1M tokens) double those rates—and every request re-sends the full context.

3. Endpoint Uptime: Online prediction endpoints charge hourly fees even during idle periods. An e2-standard-2 endpoint costs $0.077 per hour continuously. Teams often forget to undeploy development endpoints, accumulating hundreds of dollars per month in “ghost” charges.

4. Storage Costs: Models stored in Vertex AI Model Registry, datasets in Cloud Storage, and vector indexes in Vector Search all generate data storage fees. Standard Cloud Storage costs $0.020/GB-month; SSD-backed storage costs $0.170/GB-month.

5. Data Transfer (Egress): Moving data out of Google Cloud incurs egress fees—$0.12/GB to most destinations, $0.23/GB to China/Australia. Large batch prediction jobs or frequent model downloads can add unexpected charges.

6. Management Fees: Vertex AI adds management fees on top of underlying Compute Engine costs. For example, an NVIDIA A100 GPU costs $2.93 per hour for compute plus $0.44 per hour Vertex management fee, totaling $3.37 per hour.

Vertex AI Pricing Breakdown by Service

Vertex AI pricing varies by service, with different costs for generative AI models, AutoML training, custom model infrastructure, forecasting tools, and other Google Cloud services.

Generative AI Models

Gemini 2.5 Pro (text, image, video, and audio inputs)

Usage Type≤200K Tokens>200K TokensBatch API ≤200KBatch API >200K
Input$1.25/M$2.50/M$0.625/M$1.25/M
Text Output$10.00/M$15.00/M$5.00/M$7.50/M

Gemini 2.5 Flash (lower cost, faster inference)

Input Type≤200K TokensCached TokensBatch API
Text / Image / Video$0.30/M$0.030/M$0.15/M
Audio Input$1.00/M$0.100/M$0.50/M
Text Output$2.50/MN/A$1.25/M
Image Output$30.00/MN/A$15.00/M
Note: Multi-turn conversations re-send the full conversation history with every request. A 10-turn chat with 50K tokens per request consumes 500K input tokens, costing $0.15 with Flash or $1.25 with Pro before generating a single output token.

Grounding Features (add-on costs)

  • Google Search Grounding: $35 per 1,000 grounded prompts after the free tier
  • Web Grounding for Enterprise: $45 per 1,000 grounded prompts
  • Google Maps Grounding: $25 per 1,000 grounded prompts

AutoML Models

AutoML model pricing is based on infrastructure usage, so costs are typically charged per node-hour, meaning you pay for the resources your model uses over 1 hour of training, deployment, or prediction time.

Image Data Pricing Table

OperationPrice
Training$3.465 per node-hour
Training (Edge on-device)$18.00 per node-hour
Deployment & Online Prediction$1.375 per node-hour (classification) / $2.002 per node-hour (object detection)
Batch Prediction$2.222 per node-hour

Tabular Data

OperationPrice
Training$21.252 per node-hour
InferenceSame as custom-trained models (see below)

Note: Deployed AutoML models charge hourly fees continuously. A classification endpoint left running for 30 days costs $991.50 ($1.375/hour × 720 hours), even with zero prediction requests.

Custom-Trained Models

Unlike AutoML pricing, which bundles managed training and deployment costs into service-specific node-hour rates, custom-trained model pricing is based on the exact machine types and accelerators you choose, giving you more flexibility but also more direct control over cost per 1 hour of usage.

Machine Types (us-central1 pricing)

Machine TypePrice per Hour
n1-standard-4$0.219
n1-highmem-16$1.088
n2-standard-32$1.787
a2-highgpu-8g*$35.40 (includes 8× A100 GPUs)
a3-ultragpu-8g*$99.77 (includes 8× H100 GPUs)
GPU cost is included in the machine type price.

Accelerator Pricing (us-central1)

AcceleratorPrice per Hour
NVIDIA A100 (40GB)$2.93 + $0.44 management fee
NVIDIA H100 (80GB)$9.80 + $1.47 management fee
NVIDIA L4$0.64
TPU v3 Pod (32 cores)$36.80
Training jobs are billed from the moment resources are provisioned until the job ends. If a job fails 4 hours into a 5-hour run, you are still charged for those 4 hours, and user-initiated cancellations also incur charges.

Vertex AI Forecast

Unlike AutoML and custom-trained model pricing, Vertex AI forecasting costs are based on prediction volume or data processed rather than compute used over 1 hour, so pricing depends more on how much forecasting work you run than how long infrastructure stays active.

AutoML Forecasting

Volume TierPrice
0–1M predictions/month$0.20 per 1,000 predictions
1M–50M predictions/month$0.10 per 1,000 predictions
50M+ predictions/month$0.02 per 1,000 predictions

ARIMA+ Forecasting

  • Training: $250 per TB × number of candidate models × number of backtesting windows
  • Prediction: $5.00 per 1,000 data points

Supporting Services

Unlike the earlier Vertex AI pricing categories, these supporting services charge in different ways depending on the product, including per run, per node-hour, per GiB processed, or per 1 hour of underlying compute and notebook usage.

Vertex AI Pipelines

Pipeline orchestration: $0.03 per run, plus compute costs for each pipeline step

Vertex AI Feature Store

  • Data Processing Node: $0.08/hour
  • Optimized Serving Node: $0.30/hour (includes 200 GB)
  • Bigtable Serving Node: $0.94/hour

Vertex AI Vector Search

  • Index Build / Update (Batch): $3.00 per GiB processed
  • Index Serving: $0.094 per node-hour (e2-standard-2)
  • Streaming Update: $0.45 per GiB inserted

Vertex AI Workbench (Managed Notebooks)

  • vCPU: $0.0379/hour (N1/N2/A2) or $0.0261/hour (E2)
  • Memory: $0.0051 per GiB-hour (N1/N2/A2) or $0.0035 per GiB-hour (E2)
  • GPU Management Fee: $0.35/hour (standard GPUs) or $2.48/hour (premium GPUs)

The Hidden Costs That Surprise Teams

Let’s talk about a few pitfalls and practical ways to optimize costs.

1. Idle Endpoints (No Scale-to-Zero)

Vertex AI does not support automatic scale-to-zero for deployed models. Once you deploy a model to an endpoint, charges accumulate continuously until you explicitly undeploy it.

For example, a development team deploys three experimental models to separate e2-standard-4 endpoints ($0.154/hour each) for A/B testing. They pause testing but forget to undeploy. After 30 days, the bill shows $332.64 in unused endpoint fees ($0.154 × 3 endpoints × 720 hours).

Fix: Implement automated deprovisioning scripts that undeploy endpoints after a configurable idle period (e.g., 4 hours without prediction requests). Tag endpoints with environment labels (dev/staging/prod) and enforce stricter cleanup policies for non-production resources.

2. Context Window Multipliers

Large language models charge separately for input and output tokens—and multi-turn conversations re-send the entire conversation history with every new user message.

Imagine that a customer support chatbot uses Gemini 2.5 Pro with a 50K-token knowledge base included in every prompt. Each user message adds 2K tokens, and each assistant response generates 500 tokens. A 10-turn conversation consumes:

  • Input: (50K base + 2K user) × 10 turns = 520K tokens → $1.30
  • Output: 500 tokens × 10 turns = 5K tokens → $0.05
  • Total per conversation: $1.35

At 10,000 conversations/day, monthly cost reaches $405,000.

Fix: Use prompt caching to store the 50K knowledge base once, reducing repeat-input costs by 90%. Switch to Gemini 2.5 Flash ($0.30 input/$2.50 output per million tokens) for routine queries, reserving Pro for complex reasoning.

3. Batch Prediction Job Overhead

Batch prediction jobs spin up clusters of virtual machines to process requests in parallel. Google bills for the full cluster runtime—not just active processing time.

Example: a batch job processes 10,000 predictions using 40 n1-highmem-8 machines. Each machine costs $0.544/hour. The job completes in 15 minutes, but Google rounds up to 30-second billing increments and charges for the full cluster:

  • Cost: 40 machines × $0.544/hour × 0.25 hours = $5.44

If job startup and teardown add 5 minutes, total billed time increases to 20 minutes → $7.25.

Fix: Batch requests into larger groups to maximize per-job throughput. Schedule batch jobs during off-peak hours to leverage any future spot pricing discounts Google may introduce for Vertex AI.

4. Egress Fees for Model Artifacts

Every time you download a trained model, export predictions, or transfer data out of Google Cloud, egress fees apply.

For example, a team exports a 5 GB trained model to AWS S3 for multi-cloud deployment. Egress to AWS costs $0.12/GB → $0.60 per export. If they update the model daily and export each version, monthly egress costs reach $18 just for model transfers.

Fix: Store model artifacts in Cloud Storage buckets co-located with downstream consumers. Use Cloud Interconnect or Direct Peering to reduce egress fees for high-volume cross-cloud transfers.

5. Untagged Resource Sprawl

Vertex AI creates dozens of resources (training jobs, endpoints, pipelines, experiments) across multiple projects and regions. Without consistent tagging, cost attribution becomes impossible.

Imagine a FinOps team discovers $15,000/month in Vertex AI charges but can’t identify which business unit or application is responsible. Billing reports show only generic “Vertex AI Prediction” line items with no team, project, or feature labels.

Fix: Enforce mandatory labels at resource creation time. Tag every training job, endpoint, and pipeline with:

  • `team`: Data Science, ML Engineering, Product
  • `environment`: dev, staging, prod
  • `cost_center`: business unit or budget code
  • `application`: recommendation-engine, fraud-detection

Use Google Cloud’s Organization Policy Service to block resource creation without required labels.

How to Optimize Vertex AI Costs

Now let’s talk about a few ways to make Vertex AI a more cost effective model.

1. Right-Size Machine Types

Many teams default to high-memory or GPU-accelerated instances for every workload. Most training jobs and inference endpoints don’t need premium hardware.

Strategy:

  • Start with e2-standard-4 ($0.154/hour) for inference endpoints. Upgrade to n2 or c2 series only if latency requirements demand it.
  • Use n1-standard-4 ($0.219/hour) for small-scale training jobs. Reserve a2-highgpu-8g ($35.40/hour) for distributed training on datasets >100 GB.
  • Benchmark inference latency with T4 GPUs ($0.40/hour) before committing to A100s ($2.93/hour). Many vision and NLP models run acceptably on T4 hardware at 1/7th the cost.

Impact: Downgrading 10 inference endpoints from n2-standard-4 ($0.223/hour) to e2-standard-4 ($0.154/hour) saves $49.68/month per endpoint → $496.80/month total.

2. Leverage Batch API for Non-Time-Sensitive Workloads

Batch API pricing cuts input token costs by 50% and output token costs by 50% compared to real-time inference.

Strategy:

  • Route offline analysis, content moderation queues, and daily report generation to Batch API.
  • Schedule batch jobs during off-peak hours (e.g., 2-6 AM) to reduce contention with production traffic.
  • Group requests into larger batches (1,000-10,000 items) to maximize per-job efficiency.

Impact: Processing 1 million Gemini 2.5 Flash requests (10K input tokens, 500 output tokens each):

  • Real-time: (10K × 1M / 1M) × $0.30 + (500 × 1M / 1M) × $2.50 = $3.00 + $1.25 = $4.25
  • Batch API: (10K × 1M / 1M) × $0.15 + (500 × 1M / 1M) × $1.25 = $1.50 + $0.63 = $2.13
  • Monthly savings (at 1M requests/day): ($4.25 – $2.13) × 30 = $63.60

3. Implement Prompt Caching

Gemini models support prompt caching, which stores repeated input tokens (e.g., system instructions, knowledge bases) and charges only 10% of the standard input rate for cached content.

Strategy:

  • Identify static prompt components that don’t change between requests (e.g., product catalogs, policy documents, few-shot examples).
  • Separate dynamic user queries from static context in your prompt structure.
  • Enable caching for prompts >1K tokens with TTL set to match your content update frequency.

Impact: A customer support bot sends a 50K-token knowledge base with every request. With caching:

  • First request: 50K input tokens × $0.30/M = $0.015
  • Subsequent requests (cache hit): 50K cached tokens × $0.030/M = $0.0015
  • Savings per request after cache hit: $0.0135 (90% reduction)

At 10,000 daily requests (9,999 cache hits), monthly savings reach $4,050.

4. Use Committed Use Discounts (CUDs)

Google Cloud CUDs provide up to 55% savings on Vertex AI compute costs in exchange for 1-year or 3-year spending commitments.

How It Works:

  • Resource-based CUDs: Commit to a specific amount of vCPU, memory, or GPU hours per month. Discounts apply automatically to Vertex AI training and inference workloads using Compute Engine SKUs.

  • Spend-based CUDs: Commit to a minimum monthly spend across all Compute Engine supporting vertex ai. Google applies tiered discounts (25-55%) to usage above the commitment if they are SUD eligible.

Commitment Terms:

  • 1-year commitment: 25-35% discount

  • 3-year commitment: 40-55% discount

Example:

A team runs 10 production inference endpoints 24/7 on n2-standard-8 machines ($0.447/hour each):

  • Monthly usage: 10 endpoints × $0.447/hour × 720 hours = $3,218.40

  • On-demand annual cost: $3,218.40 × 12 = $38,620.80

  • With 3-year CUD (52% discount): $38,620.80 × 0.48 = $18,537.98

  • Annual savings: $20,082.82

Strategy:

  • Analyze historical usage patterns to identify baseline workloads that run continuously (production endpoints, scheduled batch jobs).

  • Commit 70-80% of your expected usage via CUDs. Leave 20-30% on-demand for experimentation and spiky workloads.

  • Use nOps to automate CUD recommendations, track commitment utilization, and optimize coverage across projects.

Pitfall to Avoid: CUDs charge for the committed amount even if you don’t use it. A 100-vCPU commitment costs the same whether you use 100 vCPUs or 10. Only commit to resources you’re confident you’ll consume.

5. Automate Idle Resource Cleanup

Forgotten development endpoints and abandoned experiments are the most common source of waste in Vertex AI.

Strategy:

  • Deploy a Cloud Function or Cloud Run job that scans all Vertex AI endpoints every 6 hours.
  • Check prediction request metrics via Cloud Monitoring. If an endpoint has zero requests in the past 4 hours and is tagged `environment: dev`, automatically undeploy it.
  • Send Slack/email notifications to resource owners before cleanup.
  • Enforce a 7-day TTL for all dev/staging endpoints using resource labels and automated deletion policies.

Impact: Cleaning up 15 idle n2-standard-4 endpoints ($0.223/hour each) that ran for an average of 20 days saves $1,606.80 per month ($0.223 × 15 × 480 hours).

6. Optimize Model Selection

Not every use case needs Gemini 2.5 Pro. Google offers a tiered model family with significant price differences.

Model Selection Matrix:

Use CaseRecommended ModelInput CostOutput Cost
High-stakes reasoning (legal, medical)Gemini 2.5 Pro$1.25/M$10.00/M
General chat, summarizationGemini 2.5 Flash$0.30/M$2.50/M
Simple classification, routingGemini 2.5 Flash Lite$0.10/M$0.40/M

Impact: Switching 50% of chatbot traffic from Pro to Flash for routine queries:

  • Before: 10M input tokens × $1.25/M + 1M output tokens × $10.00/M = $22.50/day
  • After (50% on Flash): (5M × $1.25/M + 0.5M × $10.00/M) + (5M × $0.30/M + 0.5M × $2.50/M) = $11.75 + $2.75 = $14.50/day
  • Monthly savings: ($22.50 – $14.50) × 30 = $240

7. Monitor and Cap Spending

Unlike some cloud services, Vertex AI does not offer hard spending caps. Teams must implement their own guardrails.

Strategy:

  • Set up Cloud Billing budgets with alert thresholds at 50%, 80%, and 100% of monthly target.

  • Use BigQuery to export detailed billing data and analyze spend by service, project, and label.

  • Build a custom dashboard in Looker or Data Studio that shows cost per endpoint, token consumption by model and application, and idle resource detection (endpoints with zero requests in past 24 hours)

  • Implement quota limits at the project level to prevent runaway training jobs or token consumption spikes.

Impact: Early detection of a misconfigured batch job that would have consumed $50,000 in GPU hours over a weekend, allowing cancellation after $500 in spend.

Using nOps for GCP Commitment Management

nOps automates the most complex—and highest-impact—cost optimization lever for Vertex AI: Committed Use Discounts (CUDs).

How nOps Optimizes GCP Commitments

1. Usage Analysis: nOps continuously analyzes your Vertex AI workload patterns across all projects, regions, and machine types. It identifies stable baseline usage suitable for CUD coverage and separates variable workloads that should remain on-demand.

2. Commitment Recommendations: Instead of manual spreadsheet analysis, nOps provides AI-driven recommendations specifying:

  • Optimal commitment size (vCPU, memory, GPU hours)

  • Recommended term (1-year vs 3-year)

  • Expected monthly savings

  • Utilization forecast based on historical trends

3. Automated Purchasing: nOps can automatically purchase CUDs on your behalf when utilization patterns stabilize. No manual GCP Console navigation required.

4. Utilization Tracking: nOps monitors CUD utilization in real-time, alerting you if commitments are underutilized (wasted spend) or if usage spikes exceed coverage (opportunity for additional commitments).

5. Multi-Project Optimization: Unlike native GCP billing tools, nOps aggregates usage across all projects and organizations. It identifies opportunities to consolidate workloads into shared CUDs, maximizing discount coverage.

Real-World Impact

Case Study: A mid-market SaaS company running 50 Vertex AI inference endpoints across 5 projects:

  • Monthly on-demand spend: $28,000
  • Historical usage showed 85% of workloads ran continuously for >6 months
  • nOps recommendation: 3-year resource-based CUD covering 80% of baseline usage
  • Result: $14,560 annual savings (52% discount on committed portion)
  • ROI: nOps paid for itself in the first month

Why Manual Commitment Management Fails

GCP CUDs are powerful but notoriously difficult to optimize manually:

  • Complex eligibility rules: Not all Vertex AI services are CUD-eligible. GPU’s require separate commitments from CPU/memory.

  • Regional constraints: CUDs apply only to specific regions. Multi-region workloads require separate commitments per region.

  • Changing workloads: As Machine Learning models evolve, workload patterns shift. Yesterday’s optimal commitment becomes tomorrow’s wasted spend.

  • Risk of over-commitment: Committing to more resources than you’ll use locks you into paying for unused capacity.

nOps removes this complexity with continuous optimization—automatically adjusting recommendations as your Vertex AI usage evolves.

Vertex AI Pricing Compared to Alternatives

Let’s compare Vertex AI to AWS SageMaker and Azure ML:
FeatureVertex AIAWS SageMakerAzure Machine Learning
Training (GPU)$2.93/hour (A100)$4.10/hour (A100)$3.67/hour (A100)
Inference (CPU)$0.154/hour (e2-std-4)$0.192/hour (ml.m5.xlarge)$0.228/hour (Standard_D4s_v3)
Generative AIGemini 2.5 Flash: $0.30/M inputClaude 3.5 Sonnet: $3.00/M inputGPT-4o: $2.50/M input
AutoML$21.25/hour training$20.40/hour training$19.80/hour training
Commitment DiscountsCUDs: up to 55% (3-year)Savings Plans: up to 72% (3-year)Reserved Instances: up to 72% (3-year)

Key Takeaway: Vertex AI’s base rates are competitive, but its lack of scale-to-zero and mandatory endpoint fees increase total cost of ownership for variable workloads. AWS SageMaker’s Serverless Inference and Azure ML’s endpoint auto-scaling offer better cost efficiency for low-traffic models.

However, for high-volume production workloads where CUDs apply, Vertex AI becomes cost-competitive—especially when combined with nOps automated commitment management.

Conclusion

Vertex AI pricing follows Google Cloud’s pay-as-you-go philosophy—but without proactive cost management, that flexibility quickly becomes a liability. Idle endpoints, inefficient model selection, and unoptimized token consumption can turn a $5,000 budget into a $20,000 surprise bill.

Across the tools in this guide, commitment optimization remains one of the largest savings levers in GCP. nOps focuses on maximizing that lever automatically — increasing your effective savings rate without adding operational overhead. And, we only get paid after delivering you measurable savings.

In 2026, “good enough” means you’re likely leaving money on the table. We’ve talked to companies that can save hundreds of thousands on their cloud bills by switching to nOps from competitors.

There’s no risk to book a free savings analysis to find out if nOps can help you get more value out of your cloud investments.

nOps manages $3B+ in cloud spend and was recently rated #1 in G2’s Cloud Cost Management category.

Frequently Asked Questions

Let’s dive into a few FAQ about Google Cloud Vertex, Vertex AI search offers, and Generative AI Pricing.
Vertex AI pricing depends on the services and models you use including training, prediction, pipelines, and generative AI features. Costs are usage based and may include compute time, storage, endpoints, and API requests. For example, training jobs are billed per machine hour while generative models charge per input and output. Vertex AI Studio follows the same usage-based pricing model, so prototyping and testing prompts can still generate charges depending on the models and services you use.
Vertex AI token pricing varies by the model used such as Gemini models or text embeddings. Costs are typically measured per thousand input and output tokens. For example, some Gemini models charge different rates for input versus output tokens with higher costs for larger context windows and advanced reasoning capabilities.
Vertex AI cost optimization refers to strategies and tools used to reduce spending on machine learning and generative AI workloads in Google Cloud. This includes choosing efficient models, controlling token usage, autoscaling endpoints, scheduling training jobs, monitoring usage with billing reports, and using tools like nOps to improve cloud efficiency.