A three-year cost model for agentic AI workload placement, biased against its own conclusion on purpose. Methodology
Total spent to date under each plan, with the hybrid carrying its full hardware bill from day zero. The moment the green line crosses below a cloud line is the payback moment, marked on the chart. This is the most honest view, so it is the default.
The link carries every slider setting. Whoever opens it sees these numbers recomputed by the engine, not a screenshot.
The green share runs on devices you already own, with no per-task operating cost. The rest is your remaining cloud bill, priced at premium cloud rates after each year's price drop.
Generated by the engine from your exact scenario. This is the spoken version of the chart.
Every task you send to the cloud carries company content out of the building. The table counts each content token once (instructions, documents, and answers); cloud re-reads are excluded, so these figures understate the real traffic. Under the hybrid plan, every loop that runs locally keeps that content inside your walls.
Both cloud plans send the same content out; routing routine work to a cheaper model shrinks the bill, not the exposure.
Keeping data home is necessary, not sufficient. To deploy agents safely at enterprise scale, your security team needs answers to seven questions, and the hybrid plan answers all seven with security software your company already owns and operates.
| Control | The question it answers | Answered by |
|---|---|---|
| Identity | Who or what is this agent acting for? | Entra ID, Hello for Business |
| Data access | What files and data is it allowed to see? | Purview, Information Protection |
| Action limits | What is it allowed to do on the device? | MXC isolation, Defender policies |
| Location | Is this loop running locally, or sending data out? | MXC, Foundry Local, Azure AI |
| Audit trail | Where is its action history recorded? | Defender, Sentinel, Purview Audit |
| Escalation | When is a task too big for the device, and who decides? | Practice-defined routing policy |
| Kill switch | How does IT shut a misbehaving agent down, instantly? | Intune, Entra Conditional Access |
An AI agent solves a task in steps, and because the cloud does not remember the previous step, it charges you to re-read the whole conversation at every one of them, even after caching discounts. Call it the memory tax. This model compares three ways to pay for the same work over three years: the standard cloud plan most companies have, the discounted cloud plan a sharp architect would build, and a hybrid plan that moves work onto devices you already own.
Every figure on this page is computed by the engine from your scenario at request time. Nothing is hand-typed. The ways this model deliberately favors the cloud baselines are published in full below.
An AI agent solves a task in steps: plan, act, check, repeat. The cloud does not remember the previous step, so the agent sends the entire conversation back up on every step, and the cloud charges to re-read it every time. Most of what you pay for is not the answer. It is the agent re-reading its own notes. Caching discounts that re-reading; it does not eliminate it.
For the architects, the precise mechanics we model: on the first turn the full context is written to the provider's cache at the write-premium rate. On every later turn the accumulated history is billed at the cache-read rate, and only the new tail (this turn's tool result plus the prior turn's output) is written at the premium rate. Output tokens are billed at the output rate on every turn. That is the entire cost mechanism; nothing else is modeled.
People are billed by the seat because people are slow and predictable. Agents are billed by the token because they are neither. Your seat subscriptions (Copilot and its peers) cover the first kind of work; this model does not price them and does not argue against them. Keep your seats, keep your laptops, both are table stakes. The decision priced here is where the agents run.
| Spend type | What it covers | What happens at agent scale | In this model |
|---|---|---|---|
| Seat subscriptions | Assistive, human-in-the-loop work: drafting, summarizing, asking | Caps and throttles appear; providers are already trimming the usage included per seat | Kept, not priced, not argued against |
| Metered cloud APIs | Autonomous agents: loops that run without a human per turn | Volume compounds with adoption; the meter is where growth lands | The red and amber lines |
| Local API calls | The same agentic work, served from an endpoint you own | Same API shape; placement changes the meter, not the code | The green line |
This model is biased against its own conclusion.
The cloud baselines are granted perfect caching: every turn is assumed to hit the cache, and no cache entry ever expires between turns. Real deployments miss caches and pay re-write costs. The cloud baselines also receive percent annual price deflation, applied in full and on schedule, even though the current memory squeeze is pressing provider costs in the other direction. The hybrid line is charged its full hardware bill up front, in Year One, with no residual value credited. The cloud lines on this page are floors, not estimates.
Only hardware the strategy requires is billed to it. The Copilot+ NPU refresh is priced at zero by default because fleets are getting that tier regardless of any AI decision; the slider in Advanced settings charges it back if you disagree. The RTX Spark tier, the hardware this page actually prices, is charged in full in Year One.
| Tier | Device class | What it runs | Which workload | Who gets one |
|---|---|---|---|---|
| Table stakes | Copilot+ NPU laptop; your refresh delivers these regardless | Small local models | The routine tier, once routing exists | Every seat |
| The decision | RTX Spark class, high-memory: deskside box, shared workstation, or dedicated machine | Large open-weights models | The complex tier the absorption sliders move | Developers, analysts, creators, or a department sharing one |
The local absorption rates for Years Two and Three are scenario assumptions under your control, not predictions. Your current scenario routes percent of task volume to the routine tier, absorbs percent of the complex tier locally in Year Two and percent in Year Three. Set the absorption sliders to zero and the hybrid strategy loses; the model will say so.
Generated from the engine's unit costs at these rate cards, so you can see where the money concentrates before any scale is applied.
Every figure above is a field in this response. If you find one that is not, the build is broken and we want to know.
Inference economics is the evaluator a calculator can quantify. The placement decision also turns on five more: data-handling constraints (privacy, security, sovereignty), local context and content, proximity to sensors, latency and offline capability, and sustainability. Scoring your actual workloads against all six, and placing each one across on-device, on-prem desktop, on-prem departmental, or cloud, is the Workload Routing Workshop. The math on this page is the part of that workshop you can run without us.
A calculator shows you the math. We can show you the loop running live, on one device, against your scenario, with zero cloud inference, in fifteen minutes.
The briefing costs you an hour and nothing else. The workshop is where we score your workloads against all six evaluators.