Compare speed, latency, quality, and cost impact of switching LLMs

nOps AI Model Provider Recommendations help GenAI teams cut LLM spend by switching to lower-cost providers — like replacing OpenAI models with Claude or Nova tiers on AWS Bedrock — to reduce costs by up to 90% while maintaining similar performance. 

Starting today, every recommendation includes Speed and Latency scores. The Speed metric is reflects output speed (tokens per second) of the LLM models. The Latency metric measures time to first token. You can evaluate accuracy impact, cost savings as well as speed, latency before making a switch.

In the example below, a conversational service running GPT-4o costs $78,577 per month. Our engine flags that the same prompt mix fits Nova Pro on Bedrock, saving approximately $33,421 per month (~42.5%) while being 140% faster and reducing latency by 41%.

Speed

The Speed metric measures how quickly a language model generates output, quantified in tokens per second (TPS). A higher TPS indicates faster response generation, enhancing user experience in applications like chatbots and real-time assistants.

Why we chose it:

  • User Experience: Faster token generation leads to more responsive interactions, crucial for real-time applications.

  • Performance Benchmarking: TPS is a standard metric for assessing and comparing model efficiency.

  • Operational Efficiency: Higher TPS can reduce computational costs by completing tasks more quickly.

Latency

Latency refers to the time it takes for a model to generate its first token—also known as Time to First Token (TTFT). Lower latency helps applications feel more responsive.

Why we chose it:

  • Critical for Real-Time Applications: Low latency is essential for applications where immediate feedback is expected.

  • User Satisfaction: Reduced waiting times enhances user experience.

  • Performance Indicator: A key metric for evaluating the promptness of different LLMs.

How It Works

The dashboard now displays these metrics alongside projected dollar savings:

  • Speed Change (%): Calculated as ((suggested_speed_tps − current_speed_tps) / current_speed_tps) × 100. A positive value means the suggested model is faster.
  • Latency Change (%): Calculated as ((current_latency_s − suggested_latency_s) / current_latency_s) × 100. A positive value means the suggest model has lower latency.

These metrics give you a holistic view of the trade-offs involved—speed, latency, quality, and cost—so you can make well-informed decisions.

How To Get Started

To access the updated recommendations, log in to nOps, navigate to the AI Model Provider Recommendations dashboard in nOps Cost Optimization. 

If you’re already on nOps…

Have questions about AI Model Provider Recommendations? Need help getting started? Our dedicated support team is here for you. Simply reach out to your Customer Success Manager or visit our Help Center. If you’re not sure who your CSM is, send our Support Team an email.

If you’re new to nOps…

Ranked #1 on G2 for cloud cost management and trusted to optimise $2B+ in annual spend, nOps gives you automated GenAI savings with complete confidence. Book a demo to start saving on LLM cost without compromising on performance.