Enterprise AI budgets averaged $1.2 million in 2024. By 2026, that figure hit $7 million: a 483% increase in two years. The line item driving nearly all of that growth is not training. It is inference, which now consumes 85% of the average enterprise AI budget according to AnalyticsWeek’s 2026 analysis.
Most FinOps teams still manage AI costs the way they manage cloud compute: setting budget alerts, reviewing invoices after the fact, asking engineering to please use fewer tokens. That approach worked when AI was a research project with a handful of API calls per day. It does not work when agentic workflows trigger 10 to 20 LLM calls per user task and inference spending grows faster than any other line item in the technology budget.
The missing layer is an AI gateway: a proxy that sits between applications and LLM providers, enforcing budgets, routing requests to the right model, and caching responses before they generate another API charge. Enterprises deploying gateways report 40 to 60% reductions in inference costs. The technology is production ready. Several options are open source. And most FinOps teams have never evaluated one.
Inference Ate the AI Budget
Two years ago, training dominated enterprise AI spending. Organizations allocated 60% of their AI budgets to model training and 40% to inference. That ratio has inverted completely. In 2026, inference accounts for 85% and training just 15%, according to the Oplexa AI Inference Cost Report.
The math behind the shift is counterintuitive. Per-token costs have fallen roughly 280 times over two years, from approximately $30 per million tokens for GPT-4 class tasks in 2023 to $0.10 per million tokens in 2026 (per the Stanford HAI AI Index). But total enterprise AI spend rose 320% over the same period. Cheaper tokens invite more usage, and agentic architectures multiply that usage dramatically.
Gartner quantified the agentic multiplier in March 2026: agentic models require between 5 and 30 times more tokens per task than a standard chatbot. A customer support agent that once consumed one unit of tokens per interaction now consumes ten to twenty units when it orchestrates tool calls, retrieves documents, and iterates on its own output. The FinOps Foundation’s 2026 State of FinOps report flagged AI as the fastest growing new spend category, with 73% of respondents reporting that AI costs exceeded their original projections.
In 20 years of IT operations, I have watched budgets spiral before: storage in the early 2000s, virtualization licensing around 2010, cloud compute from 2015 onward. Each time, the correction came not from asking people to use less, but from inserting an automated control layer between demand and spend. AI inference is at that inflection point now.
Three Cost Levers a Gateway Controls
An AI gateway is not a dashboard. It is an enforcement layer that intercepts every API call between your applications and the LLM provider, applying cost controls in real time before the request generates a charge.
Intelligent model routing. Not every prompt needs a frontier model. A gateway evaluates incoming requests and routes simple tasks (classification, summarization, data extraction) to smaller, cheaper models while reserving expensive frontier capacity for complex reasoning. The standard production pattern uses three tiers: frontier models like Claude Opus or GPT-4o for deep reasoning; mid-tier models like Claude Haiku or GPT-4o-mini for general tasks; and open-source models for high-volume, low-complexity workloads. Model routing alone can reduce costs 60 to 80% on workloads where the majority of requests are routine.
Semantic caching. When one user asks “What is our refund policy?” and another asks “Can I get a refund?”, a semantic cache recognizes these as the same intent and serves the stored response instead of triggering a new model call. Exact-match caching catches identical queries at zero marginal cost. Semantic caching catches paraphrased queries at the cost of an embedding lookup, which is orders of magnitude cheaper than a full LLM call. Workloads with repetitive patterns (customer support, FAQ systems, code completion) see 30 to 50% reductions in API call volume through caching alone.
Hierarchical budget enforcement. A gateway sets token budgets per team, per application, and per user, then blocks requests when a budget is exhausted rather than letting costs accumulate until someone reviews the monthly invoice. This is fundamentally different from cloud cost guardrails, which operate at the infrastructure layer. Gateway budget enforcement is preventive: it stops the API call before it happens, not after.
Five Gateways Worth Evaluating
The market has matured quickly. Five production-ready options span the spectrum from fully managed to self-hosted.
Bifrost (open source, built in Go by Maxim AI) is the most complete self-hosted option. Its benchmarks show 11 microseconds of overhead per request at 5,000 requests per second, making the gateway effectively invisible in latency terms compared to the 1 to 5 seconds a typical LLM response takes. Bifrost offers dual-layer caching (exact match plus semantic), CEL-based routing rules, and hierarchical budget controls.
LiteLLM (open source, Python) supports over 100 LLM providers through a unified interface. Its self-hosted proxy gives teams full control over data and routing logic. LiteLLM has broad adoption because it requires minimal code changes to existing applications; you swap your API endpoint, and the gateway handles everything else.
Cloudflare AI Gateway is a managed service that runs on Cloudflare’s edge network. Core features are free on all Cloudflare plans, with the Workers Paid plan including up to one million gateway logs per month. In 2026, Cloudflare added unified billing: organizations can now pay for OpenAI, Anthropic, and other providers through a single Cloudflare invoice.
Kong AI Gateway extends the widely deployed Kong API gateway with AI-specific plugins for rate limiting, token tracking, and provider failover. Organizations already running Kong for API management can add AI cost controls without introducing a separate tool.
AWS Bedrock and Azure AI Gateway are the cloud-native options from Amazon and Microsoft. They integrate tightly with each provider’s ecosystem but limit you to that provider’s model marketplace. For multi-cloud strategies, an independent gateway is more practical.
Where Gateways Will Not Save You
Semantic caching effectiveness depends entirely on traffic patterns. Customer support workloads with repetitive queries see high cache hit rates. R&D workloads with unique, novel prompts see almost no caching benefit. Organizations should measure query diversity before projecting savings.
Model routing requires ongoing tuning. The classification logic that decides which model handles which request needs regular evaluation against output quality metrics. Route a complex legal analysis to a budget model and you get unusable output. Route routine data extraction to a frontier model and you waste money. Most teams start with conservative routing rules and adjust weekly based on quality scores.
Gateways also introduce an infrastructure dependency. Every AI API call now flows through an additional layer that must maintain high availability. A gateway outage takes down every AI-powered feature simultaneously. Production deployments need redundancy, monitoring, and an on-call rotation, just like any other piece of critical infrastructure.
A Four-Week Deployment Sequence
Week one: deploy in passthrough mode, logging every request without modifying routing or caching. After seven days, you have a complete picture of which applications call which models, how many tokens they consume, and what query patterns look like. This data alone typically reveals that two or three applications account for 80% of token spend.
Week two: enable exact-match caching first, then semantic caching. Monitor cache hit rates daily. Customer-facing applications with FAQ-style interactions usually show immediate savings.
Week three: introduce a two-tier routing split. Frontier models for anything over a complexity threshold, mid-tier models for everything else. Run both tiers in parallel for a week, comparing output quality before committing to the routing rules.
Week four: enable budget enforcement with soft limits (alerts, not hard blocks) so teams can adjust their usage patterns before you start preventing requests outright.
The FinOps X 2026 conference in San Diego (June 8 to 11) is devoting multiple sessions to AI gateway architecture and token economics, a sign of how quickly this category moved from experimental to essential. For FinOps teams still managing AI inference costs through spreadsheets and monthly reviews, the gateway layer is where the discipline moves next.
