Why Healthcare AI Is About to Look Like a Hospital Formulary

Healthcare technology leaders are starting to see something strange on their AI invoices. A pilot that cost three thousand dollars in January is now costing twenty thousand in May. The team cannot fully explain why. Usage went up, prompts got longer, and the same general purpose model is being asked to do every job in the workflow. The cost climbed but the reason is fuzzy.

This is the side of AI nobody talks about in keynotes. The demos look magical. The bills do not. There is a pattern forming in the engineering blogs of companies that have already lived through this, and it has two halves arriving at the same time. Healthcare operators who notice it early will have a much easier time running real AI in real workflows.

The First Half: Specialized Models Are Now a Real Choice

The first half of the pattern is specialized AI models. The idea is simple. If your workflow needs to search millions of clinical papers for the three most relevant to a doctor's question, you do not need a giant general purpose model for that job. You need a small model that is very good at one specific thing, which is ranking medical text by relevance.

A company called ZeroEntropy is one of the clearest examples. They train tiny specialized models that beat the big general models on accuracy, speed, and cost for one job at a time. Their reranker takes around eighty milliseconds to do work that a general model would take five hundred milliseconds and a lot more tokens to do. One of their customers, Vera Health, used ZeroEntropy to reach what they call state of the art clinical accuracy across millions of medical research papers. Another customer, the support company Assembled, reported a 2.8 times cost reduction after switching from a general model to a specialized one. Better accuracy, lower latency, lower bill, all at once.

You have a real choice now. Keep paying premium rates for one giant model to do every step, or route each step to a smaller model built for it. The bill drops, the speed goes up, and the answers get better because each model is doing the job it was designed for.

The Second Half: Cost Guardrails Are Becoming a Core Primitive

The second half of the pattern is cost governance, and the company to watch is Cloudflare. Their AI Gateway recently shipped real time spend limits across multiple AI providers. Every request your team makes to OpenAI, Anthropic, or Google flows through Cloudflare first. It counts the tokens, tracks the spend per user, and cuts off requests that go past budget. If your radiology pilot is allowed two thousand dollars this week, the gateway enforces that limit at the network layer instead of waiting for month end finance to discover the overage.

The deeper point is that cost containment cannot live as an afterthought anymore. It has to be a primitive. A primitive in engineering language is a basic building block that everything else sits on top of. Identity is a primitive. Authentication is a primitive. Cloudflare is making the case that spend limit is too. You should not be able to deploy an AI workflow inside a healthcare org without first telling the system how much it can spend and what happens when it hits the cap.

What This Looks Like Together

Put both halves together and you get a clear picture of where serious enterprise AI is heading. Small specialized models doing the actual work on the inside. A control plane on the outside enforcing budget, identity, and policy. Token spend treated the way a hospital CFO already treats supply and labor costs, with line items, alerts, and approval thresholds.

For healthcare this matters more than in most industries. A hospital does not prescribe one drug for every patient. It runs a formulary. The pharmacist matches the right drug to the right condition at the right dose. AI infrastructure inside a hospital should work the same way. Clinical summarization gets one model. Patient communication gets another. Coding and billing gets a third. Every call passes through a gateway that already knows the weekly budget, the HIPAA requirements, and the audit trail.

ZeroEntropy is HIPAA compliant and SOC2 Type II audited. Cloudflare runs its gateway with identity policies that tie into existing single sign on. The pieces already exist. The real question is whether healthcare orgs assemble them on purpose or under pressure after the first surprise invoice.

Three Things to Do This Quarter

If you run technology decisions for a healthcare org, three practical moves are worth making in the next ninety days.

Audit your AI spend by workflow, not by month. Where is the money actually going. Which step is driving the largest token bill. That one question usually reveals eighty or ninety percent of the cost sitting in two or three places that could be moved to a smaller, specialized model.

Ask your AI vendors what their cost governance story is. Most early stage vendors do not have one. They will talk about accuracy and latency and model quality. Ask about budgets, alerts, and spend caps. The serious ones will have real answers. The rest will go quiet.

Stop thinking about AI cost the way you think about cloud cost. Cloud cost scales roughly with usage. AI cost scales with the length of every prompt, the verbosity of every reply, and the choice of model on every call. A single misconfigured agent can burn through a quarter's budget in a weekend. Cloud bills almost never behave that way.

Where This Goes Next

The hospitals that figure this out first will be running real AI in real workflows two years from now. Small models doing precise jobs, a gateway holding the budget line, and a finance team that can answer "what did we spend on AI this week and why" without flinching. The ones that do not will assemble the same stack anyway, just under pressure and with a worse internal reputation for the technology.

The playbook is in the open. Specialized models from companies like ZeroEntropy. Cost guardrails from companies like Cloudflare. A formulary mindset on top of both. None of this requires custom research or a sixty person AI team. It requires a leadership decision to treat token spend as a first class budget category.

Is Your Healthcare AI Bill Climbing Faster Than Your AI Results?

Fermat Solutions helps healthcare orgs design AI workflows that are accurate, compliant, and cost-controlled from day one. From specialized model selection to spend governance, we make sure your AI investment delivers measurable clinical and operational returns.

Book a Free 30-Minute AI Cost Review

About the Author

JD Singh

Founder & Principal Consultant, Fermat Solutions

JD Singh brings over a decade of experience in cloud architecture (Azure), AI integration, and enterprise consulting. He has guided SMBs and healthcare organizations through digital transformation initiatives, helping them leverage automation and AI to achieve operational excellence and sustainable growth.