- Blog
- Commitment Management
- Vertex AI Pricing: The Complete 2026 Guide to Costs, Hidden Fees, and Savings
Vertex AI Pricing: The Complete 2026 Guide to Costs, Hidden Fees, and Savings
Google Vertex AI pricing operates on a pay-as-you-go model—but without the right strategies, those “as you go” charges can spiral into surprise bills that derail your AI budget. Multiple teams have reported shock invoices ranging from $400 to over $20,000 in a single month, often from services they didn’t realize were still running.
Optimizing Vertex AI pricing requires controlling token consumption, managing idle endpoints, avoiding hidden egress fees, and most importantly—leveraging Google Cloud Committed Use Discounts (CUDs) to lock in savings of up to 55% on predictable workloads.
This guide breaks down every cost component of Vertex AI, common pitfalls, and practical strategies for leveraging commitment management strategically to lower costs.
What Is Vertex AI?
Core Capabilities
Key features of Vertex AI include:
- AutoML Models: Pre-configured workflows for image classification, object detection, tabular forecasting, and natural language processing. Teams with limited ML expertise can build production models without writing training loops or tuning hyperparameters manually.
- Custom Training & Model Tuning: Full control over model architectures, frameworks (TensorFlow, PyTorch, scikit-learn), and compute configurations. You bring your own model training code, and Vertex AI manages distributed infrastructure, job orchestration, and resource scaling.
- Gen AI Models: Access to Google’s latest foundation model types including Gemini 2.5 Pro, 2.5 Flash and Imagen. These models power text generation, multimodal reasoning, code completion, image generation, and video generation use cases.
- Managed Endpoints: Deploy models to scalable prediction endpoints with automatic load balancing, versioning, and traffic splitting. Endpoints remain active (and billable) until you explicitly un-deploy them—a common source of runaway costs.
- Vector Search & RAG Engine: Semantic search infrastructure for retrieval-augmented generation (RAG) pipelines. Charges apply for index build, streaming updates, and storage in addition to LLM inference costs.
- Pipelines & MLOps: Orchestrate multi-step workflows with Vertex AI Pipelines, track experiments with TensorBoard, and manage model metadata with ML Metadata. Each service has its own pricing structure.
Common Use Cases
The most common include:
- Predictive Analytics: Time-series forecasting for demand planning, inventory optimization, and financial projections. Vertex AI Forecast uses ARIMA+ models with per-TB training costs plus per-1K-point prediction fees.
- Customer Support Automation: GenAI models power chatbots, email auto-responders, and knowledge base search. Token consumption can skyrocket when prompts include large context windows or multi-turn conversations.
- Computer Vision: Image classification, object detection, and video analysis for quality control, security monitoring, and content moderation. AutoML image models charge $3.465/hour for training and $1.375/hour per deployed endpoint.
- Content Generation: Imagen for image synthesis, Veo for video generation, and Gemini for text/code completion. Video generation pricing ($0.50-$0.75/second) has caused some of the most dramatic billing surprises.
How Vertex AI Pricing Works
Primary Cost Drivers
The main cost components of Vertex AI pricing are:
1. Compute Resources: Training jobs and prediction endpoints bill per node-hour based on machine type (vCPU, RAM) and attached accelerators (GPUs, TPUs). A single A100 GPU costs $2.93 per hour in us-central1, and charges accumulate from the moment Google provisions resources until the job completes or you undeploy the endpoint.
2. Token Consumption: Generative AI models charge per million tokens processed. Gemini 2.5 Pro costs $1.25 per million input tokens (≤200K context) and $10.00 per million output tokens. Large context windows (200K-1M tokens) double those rates—and every request re-sends the full context.
3. Endpoint Uptime: Online prediction endpoints charge hourly fees even during idle periods. An e2-standard-2 endpoint costs $0.077 per hour continuously. Teams often forget to undeploy development endpoints, accumulating hundreds of dollars per month in “ghost” charges.
4. Storage Costs: Models stored in Vertex AI Model Registry, datasets in Cloud Storage, and vector indexes in Vector Search all generate data storage fees. Standard Cloud Storage costs $0.020/GB-month; SSD-backed storage costs $0.170/GB-month.
5. Data Transfer (Egress): Moving data out of Google Cloud incurs egress fees—$0.12/GB to most destinations, $0.23/GB to China/Australia. Large batch prediction jobs or frequent model downloads can add unexpected charges.
6. Management Fees: Vertex AI adds management fees on top of underlying Compute Engine costs. For example, an NVIDIA A100 GPU costs $2.93 per hour for compute plus $0.44 per hour Vertex management fee, totaling $3.37 per hour.
Vertex AI Pricing Breakdown by Service
Generative AI Models
Gemini 2.5 Pro (text, image, video, and audio inputs)
| Usage Type | ≤200K Tokens | >200K Tokens | Batch API ≤200K | Batch API >200K |
|---|---|---|---|---|
| Input | $1.25/M | $2.50/M | $0.625/M | $1.25/M |
| Text Output | $10.00/M | $15.00/M | $5.00/M | $7.50/M |
Gemini 2.5 Flash (lower cost, faster inference)
| Input Type | ≤200K Tokens | Cached Tokens | Batch API |
|---|---|---|---|
| Text / Image / Video | $0.30/M | $0.030/M | $0.15/M |
| Audio Input | $1.00/M | $0.100/M | $0.50/M |
| Text Output | $2.50/M | N/A | $1.25/M |
| Image Output | $30.00/M | N/A | $15.00/M |
Grounding Features (add-on costs)
- Google Search Grounding: $35 per 1,000 grounded prompts after the free tier
- Web Grounding for Enterprise: $45 per 1,000 grounded prompts
- Google Maps Grounding: $25 per 1,000 grounded prompts
AutoML Models
Image Data Pricing Table
| Operation | Price |
|---|---|
| Training | $3.465 per node-hour |
| Training (Edge on-device) | $18.00 per node-hour |
| Deployment & Online Prediction | $1.375 per node-hour (classification) / $2.002 per node-hour (object detection) |
| Batch Prediction | $2.222 per node-hour |
Tabular Data
| Operation | Price |
|---|---|
| Training | $21.252 per node-hour |
| Inference | Same as custom-trained models (see below) |
Note: Deployed AutoML models charge hourly fees continuously. A classification endpoint left running for 30 days costs $991.50 ($1.375/hour × 720 hours), even with zero prediction requests.
Custom-Trained Models
Machine Types (us-central1 pricing)
| Machine Type | Price per Hour |
|---|---|
| n1-standard-4 | $0.219 |
| n1-highmem-16 | $1.088 |
| n2-standard-32 | $1.787 |
| a2-highgpu-8g* | $35.40 (includes 8× A100 GPUs) |
| a3-ultragpu-8g* | $99.77 (includes 8× H100 GPUs) |
Accelerator Pricing (us-central1)
| Accelerator | Price per Hour | |
|---|---|---|
| NVIDIA A100 (40GB) | $2.93 + $0.44 management fee | |
| NVIDIA H100 (80GB) | $9.80 + $1.47 management fee | |
| NVIDIA L4 | $0.64 | |
| TPU v3 Pod (32 cores) | $36.80 |
Vertex AI Forecast
AutoML Forecasting
| Volume Tier | Price |
|---|---|
| 0–1M predictions/month | $0.20 per 1,000 predictions |
| 1M–50M predictions/month | $0.10 per 1,000 predictions |
| 50M+ predictions/month | $0.02 per 1,000 predictions |
ARIMA+ Forecasting
- Training: $250 per TB × number of candidate models × number of backtesting windows
- Prediction: $5.00 per 1,000 data points
Supporting Services
Vertex AI Pipelines
Pipeline orchestration: $0.03 per run, plus compute costs for each pipeline step
Vertex AI Feature Store
- Data Processing Node: $0.08/hour
- Optimized Serving Node: $0.30/hour (includes 200 GB)
- Bigtable Serving Node: $0.94/hour
Vertex AI Vector Search
- Index Build / Update (Batch): $3.00 per GiB processed
- Index Serving: $0.094 per node-hour (e2-standard-2)
- Streaming Update: $0.45 per GiB inserted
Vertex AI Workbench (Managed Notebooks)
- vCPU: $0.0379/hour (N1/N2/A2) or $0.0261/hour (E2)
- Memory: $0.0051 per GiB-hour (N1/N2/A2) or $0.0035 per GiB-hour (E2)
- GPU Management Fee: $0.35/hour (standard GPUs) or $2.48/hour (premium GPUs)
The Hidden Costs That Surprise Teams
1. Idle Endpoints (No Scale-to-Zero)
Vertex AI does not support automatic scale-to-zero for deployed models. Once you deploy a model to an endpoint, charges accumulate continuously until you explicitly undeploy it.
For example, a development team deploys three experimental models to separate e2-standard-4 endpoints ($0.154/hour each) for A/B testing. They pause testing but forget to undeploy. After 30 days, the bill shows $332.64 in unused endpoint fees ($0.154 × 3 endpoints × 720 hours).
Fix: Implement automated deprovisioning scripts that undeploy endpoints after a configurable idle period (e.g., 4 hours without prediction requests). Tag endpoints with environment labels (dev/staging/prod) and enforce stricter cleanup policies for non-production resources.
2. Context Window Multipliers
Large language models charge separately for input and output tokens—and multi-turn conversations re-send the entire conversation history with every new user message.
Imagine that a customer support chatbot uses Gemini 2.5 Pro with a 50K-token knowledge base included in every prompt. Each user message adds 2K tokens, and each assistant response generates 500 tokens. A 10-turn conversation consumes:
- Input: (50K base + 2K user) × 10 turns = 520K tokens → $1.30
- Output: 500 tokens × 10 turns = 5K tokens → $0.05
- Total per conversation: $1.35
At 10,000 conversations/day, monthly cost reaches $405,000.
Fix: Use prompt caching to store the 50K knowledge base once, reducing repeat-input costs by 90%. Switch to Gemini 2.5 Flash ($0.30 input/$2.50 output per million tokens) for routine queries, reserving Pro for complex reasoning.
3. Batch Prediction Job Overhead
Batch prediction jobs spin up clusters of virtual machines to process requests in parallel. Google bills for the full cluster runtime—not just active processing time.
Example: a batch job processes 10,000 predictions using 40 n1-highmem-8 machines. Each machine costs $0.544/hour. The job completes in 15 minutes, but Google rounds up to 30-second billing increments and charges for the full cluster:
- Cost: 40 machines × $0.544/hour × 0.25 hours = $5.44
If job startup and teardown add 5 minutes, total billed time increases to 20 minutes → $7.25.
Fix: Batch requests into larger groups to maximize per-job throughput. Schedule batch jobs during off-peak hours to leverage any future spot pricing discounts Google may introduce for Vertex AI.
4. Egress Fees for Model Artifacts
Every time you download a trained model, export predictions, or transfer data out of Google Cloud, egress fees apply.
For example, a team exports a 5 GB trained model to AWS S3 for multi-cloud deployment. Egress to AWS costs $0.12/GB → $0.60 per export. If they update the model daily and export each version, monthly egress costs reach $18 just for model transfers.
Fix: Store model artifacts in Cloud Storage buckets co-located with downstream consumers. Use Cloud Interconnect or Direct Peering to reduce egress fees for high-volume cross-cloud transfers.
5. Untagged Resource Sprawl
Vertex AI creates dozens of resources (training jobs, endpoints, pipelines, experiments) across multiple projects and regions. Without consistent tagging, cost attribution becomes impossible.
Imagine a FinOps team discovers $15,000/month in Vertex AI charges but can’t identify which business unit or application is responsible. Billing reports show only generic “Vertex AI Prediction” line items with no team, project, or feature labels.
Fix: Enforce mandatory labels at resource creation time. Tag every training job, endpoint, and pipeline with:
- `team`: Data Science, ML Engineering, Product
- `environment`: dev, staging, prod
- `cost_center`: business unit or budget code
- `application`: recommendation-engine, fraud-detection
Use Google Cloud’s Organization Policy Service to block resource creation without required labels.
How to Optimize Vertex AI Costs
1. Right-Size Machine Types
Many teams default to high-memory or GPU-accelerated instances for every workload. Most training jobs and inference endpoints don’t need premium hardware.
Strategy:
- Start with e2-standard-4 ($0.154/hour) for inference endpoints. Upgrade to n2 or c2 series only if latency requirements demand it.
- Use n1-standard-4 ($0.219/hour) for small-scale training jobs. Reserve a2-highgpu-8g ($35.40/hour) for distributed training on datasets >100 GB.
- Benchmark inference latency with T4 GPUs ($0.40/hour) before committing to A100s ($2.93/hour). Many vision and NLP models run acceptably on T4 hardware at 1/7th the cost.
Impact: Downgrading 10 inference endpoints from n2-standard-4 ($0.223/hour) to e2-standard-4 ($0.154/hour) saves $49.68/month per endpoint → $496.80/month total.
2. Leverage Batch API for Non-Time-Sensitive Workloads
Batch API pricing cuts input token costs by 50% and output token costs by 50% compared to real-time inference.
Strategy:
- Route offline analysis, content moderation queues, and daily report generation to Batch API.
- Schedule batch jobs during off-peak hours (e.g., 2-6 AM) to reduce contention with production traffic.
- Group requests into larger batches (1,000-10,000 items) to maximize per-job efficiency.
Impact: Processing 1 million Gemini 2.5 Flash requests (10K input tokens, 500 output tokens each):
- Real-time: (10K × 1M / 1M) × $0.30 + (500 × 1M / 1M) × $2.50 = $3.00 + $1.25 = $4.25
- Batch API: (10K × 1M / 1M) × $0.15 + (500 × 1M / 1M) × $1.25 = $1.50 + $0.63 = $2.13
- Monthly savings (at 1M requests/day): ($4.25 – $2.13) × 30 = $63.60
3. Implement Prompt Caching
Gemini models support prompt caching, which stores repeated input tokens (e.g., system instructions, knowledge bases) and charges only 10% of the standard input rate for cached content.
Strategy:
- Identify static prompt components that don’t change between requests (e.g., product catalogs, policy documents, few-shot examples).
- Separate dynamic user queries from static context in your prompt structure.
- Enable caching for prompts >1K tokens with TTL set to match your content update frequency.
Impact: A customer support bot sends a 50K-token knowledge base with every request. With caching:
- First request: 50K input tokens × $0.30/M = $0.015
- Subsequent requests (cache hit): 50K cached tokens × $0.030/M = $0.0015
- Savings per request after cache hit: $0.0135 (90% reduction)
At 10,000 daily requests (9,999 cache hits), monthly savings reach $4,050.
4. Use Committed Use Discounts (CUDs)
Google Cloud CUDs provide up to 55% savings on Vertex AI compute costs in exchange for 1-year or 3-year spending commitments.
How It Works:
Resource-based CUDs: Commit to a specific amount of vCPU, memory, or GPU hours per month. Discounts apply automatically to Vertex AI training and inference workloads using Compute Engine SKUs.
Spend-based CUDs: Commit to a minimum monthly spend across all Compute Engine supporting vertex ai. Google applies tiered discounts (25-55%) to usage above the commitment if they are SUD eligible.
Commitment Terms:
1-year commitment: 25-35% discount
3-year commitment: 40-55% discount
Example:
A team runs 10 production inference endpoints 24/7 on n2-standard-8 machines ($0.447/hour each):
Monthly usage: 10 endpoints × $0.447/hour × 720 hours = $3,218.40
On-demand annual cost: $3,218.40 × 12 = $38,620.80
With 3-year CUD (52% discount): $38,620.80 × 0.48 = $18,537.98
Annual savings: $20,082.82
Strategy:
Analyze historical usage patterns to identify baseline workloads that run continuously (production endpoints, scheduled batch jobs).
Commit 70-80% of your expected usage via CUDs. Leave 20-30% on-demand for experimentation and spiky workloads.
Use nOps to automate CUD recommendations, track commitment utilization, and optimize coverage across projects.
Pitfall to Avoid: CUDs charge for the committed amount even if you don’t use it. A 100-vCPU commitment costs the same whether you use 100 vCPUs or 10. Only commit to resources you’re confident you’ll consume.
5. Automate Idle Resource Cleanup
Forgotten development endpoints and abandoned experiments are the most common source of waste in Vertex AI.
Strategy:
- Deploy a Cloud Function or Cloud Run job that scans all Vertex AI endpoints every 6 hours.
- Check prediction request metrics via Cloud Monitoring. If an endpoint has zero requests in the past 4 hours and is tagged `environment: dev`, automatically undeploy it.
- Send Slack/email notifications to resource owners before cleanup.
- Enforce a 7-day TTL for all dev/staging endpoints using resource labels and automated deletion policies.
Impact: Cleaning up 15 idle n2-standard-4 endpoints ($0.223/hour each) that ran for an average of 20 days saves $1,606.80 per month ($0.223 × 15 × 480 hours).
6. Optimize Model Selection
Not every use case needs Gemini 2.5 Pro. Google offers a tiered model family with significant price differences.
Model Selection Matrix:
| Use Case | Recommended Model | Input Cost | Output Cost |
|---|---|---|---|
| High-stakes reasoning (legal, medical) | Gemini 2.5 Pro | $1.25/M | $10.00/M |
| General chat, summarization | Gemini 2.5 Flash | $0.30/M | $2.50/M |
| Simple classification, routing | Gemini 2.5 Flash Lite | $0.10/M | $0.40/M |
Impact: Switching 50% of chatbot traffic from Pro to Flash for routine queries:
- Before: 10M input tokens × $1.25/M + 1M output tokens × $10.00/M = $22.50/day
- After (50% on Flash): (5M × $1.25/M + 0.5M × $10.00/M) + (5M × $0.30/M + 0.5M × $2.50/M) = $11.75 + $2.75 = $14.50/day
- Monthly savings: ($22.50 – $14.50) × 30 = $240
7. Monitor and Cap Spending
Unlike some cloud services, Vertex AI does not offer hard spending caps. Teams must implement their own guardrails.
Strategy:
Set up Cloud Billing budgets with alert thresholds at 50%, 80%, and 100% of monthly target.
Use BigQuery to export detailed billing data and analyze spend by service, project, and label.
Build a custom dashboard in Looker or Data Studio that shows cost per endpoint, token consumption by model and application, and idle resource detection (endpoints with zero requests in past 24 hours)
Implement quota limits at the project level to prevent runaway training jobs or token consumption spikes.
Impact: Early detection of a misconfigured batch job that would have consumed $50,000 in GPU hours over a weekend, allowing cancellation after $500 in spend.
Using nOps for GCP Commitment Management
How nOps Optimizes GCP Commitments
1. Usage Analysis: nOps continuously analyzes your Vertex AI workload patterns across all projects, regions, and machine types. It identifies stable baseline usage suitable for CUD coverage and separates variable workloads that should remain on-demand.
2. Commitment Recommendations: Instead of manual spreadsheet analysis, nOps provides AI-driven recommendations specifying:
Optimal commitment size (vCPU, memory, GPU hours)
Recommended term (1-year vs 3-year)
Expected monthly savings
Utilization forecast based on historical trends
3. Automated Purchasing: nOps can automatically purchase CUDs on your behalf when utilization patterns stabilize. No manual GCP Console navigation required.
4. Utilization Tracking: nOps monitors CUD utilization in real-time, alerting you if commitments are underutilized (wasted spend) or if usage spikes exceed coverage (opportunity for additional commitments).
5. Multi-Project Optimization: Unlike native GCP billing tools, nOps aggregates usage across all projects and organizations. It identifies opportunities to consolidate workloads into shared CUDs, maximizing discount coverage.
Real-World Impact
Case Study: A mid-market SaaS company running 50 Vertex AI inference endpoints across 5 projects:
- Monthly on-demand spend: $28,000
- Historical usage showed 85% of workloads ran continuously for >6 months
- nOps recommendation: 3-year resource-based CUD covering 80% of baseline usage
- Result: $14,560 annual savings (52% discount on committed portion)
- ROI: nOps paid for itself in the first month
Why Manual Commitment Management Fails
GCP CUDs are powerful but notoriously difficult to optimize manually:
Complex eligibility rules: Not all Vertex AI services are CUD-eligible. GPU’s require separate commitments from CPU/memory.
Regional constraints: CUDs apply only to specific regions. Multi-region workloads require separate commitments per region.
Changing workloads: As Machine Learning models evolve, workload patterns shift. Yesterday’s optimal commitment becomes tomorrow’s wasted spend.
Risk of over-commitment: Committing to more resources than you’ll use locks you into paying for unused capacity.
nOps removes this complexity with continuous optimization—automatically adjusting recommendations as your Vertex AI usage evolves.
Vertex AI Pricing Compared to Alternatives
| Feature | Vertex AI | AWS SageMaker | Azure Machine Learning |
|---|---|---|---|
| Training (GPU) | $2.93/hour (A100) | $4.10/hour (A100) | $3.67/hour (A100) |
| Inference (CPU) | $0.154/hour (e2-std-4) | $0.192/hour (ml.m5.xlarge) | $0.228/hour (Standard_D4s_v3) |
| Generative AI | Gemini 2.5 Flash: $0.30/M input | Claude 3.5 Sonnet: $3.00/M input | GPT-4o: $2.50/M input |
| AutoML | $21.25/hour training | $20.40/hour training | $19.80/hour training |
| Commitment Discounts | CUDs: up to 55% (3-year) | Savings Plans: up to 72% (3-year) | Reserved Instances: up to 72% (3-year) |
Key Takeaway: Vertex AI’s base rates are competitive, but its lack of scale-to-zero and mandatory endpoint fees increase total cost of ownership for variable workloads. AWS SageMaker’s Serverless Inference and Azure ML’s endpoint auto-scaling offer better cost efficiency for low-traffic models.
However, for high-volume production workloads where CUDs apply, Vertex AI becomes cost-competitive—especially when combined with nOps automated commitment management.
Conclusion
Vertex AI pricing follows Google Cloud’s pay-as-you-go philosophy—but without proactive cost management, that flexibility quickly becomes a liability. Idle endpoints, inefficient model selection, and unoptimized token consumption can turn a $5,000 budget into a $20,000 surprise bill.
Across the tools in this guide, commitment optimization remains one of the largest savings levers in GCP. nOps focuses on maximizing that lever automatically — increasing your effective savings rate without adding operational overhead. And, we only get paid after delivering you measurable savings.
In 2026, “good enough” means you’re likely leaving money on the table. We’ve talked to companies that can save hundreds of thousands on their cloud bills by switching to nOps from competitors.
There’s no risk to book a free savings analysis to find out if nOps can help you get more value out of your cloud investments.
nOps manages $3B+ in cloud spend and was recently rated #1 in G2’s Cloud Cost Management category.
Frequently Asked Questions
How much does Vertex AI cost?
What is Vertex AI cost per token?
What is Vertex AI Cost Optimization?
Last Updated: March 16, 2026, Commitment Management
Last Updated: March 16, 2026, Commitment Management