An AI agent solves a task in steps, and because the cloud does
not remember the previous step, it charges you to re-read the whole
conversation at every one of them, even after caching discounts. Call it the
memory tax. This model compares three ways to pay for the same work over
three years: the standard cloud plan most companies have, the discounted
cloud plan a sharp architect would build, and a hybrid plan that moves work
onto devices you already own.
Every figure on this page is computed by the engine from
your scenario at request time. Nothing is hand-typed. The ways this model
deliberately favors the cloud baselines are published in full below.
The memory tax, in plain terms
An AI agent solves a task in steps: plan, act, check, repeat. The cloud does
not remember the previous step, so the agent sends the entire conversation
back up on every step, and the cloud charges to re-read it every time. Most of
what you pay for is not the answer. It is the agent re-reading its own notes.
Caching discounts that re-reading; it does not eliminate it.
For the architects, the precise mechanics we model: on the first turn the
full context is written to the provider's cache at the write-premium rate. On
every later turn the accumulated history is billed at the cache-read rate, and
only the new tail (this turn's tool result plus the prior turn's output) is
written at the premium rate. Output tokens are billed at the output rate on
every turn. That is the entire cost mechanism; nothing else is modeled.
What this model prices, and what it does not
People are billed by the seat because people are slow and predictable.
Agents are billed by the token because they are neither. Your seat
subscriptions (Copilot and its peers) cover the first kind of work; this
model does not price them and does not argue against them. Keep your seats,
keep your laptops, both are table stakes. The decision priced here is where
the agents run.
| Spend type |
What it covers |
What happens at agent scale |
In this model |
| Seat subscriptions |
Assistive, human-in-the-loop work: drafting, summarizing, asking |
Caps and throttles appear; providers are already trimming the
usage included per seat |
Kept, not priced, not argued against |
| Metered cloud APIs |
Autonomous agents: loops that run without a human per turn |
Volume compounds with adoption; the meter is where growth lands |
The red and amber lines |
| Local API calls |
The same agentic work, served from an endpoint you own |
Same API shape; placement changes the meter, not the code |
The green line |
Where this model favors the cloud
This model is biased against its own conclusion.
The cloud baselines are granted perfect caching: every turn is assumed to hit
the cache, and no cache entry ever expires between turns. Real deployments miss
caches and pay re-write costs. The cloud baselines also receive
percent annual price
deflation, applied in full and on schedule, even though the current memory
squeeze is pressing provider costs in the other direction. The hybrid line is
charged its full hardware bill up front, in Year One, with no residual value
credited. The cloud lines on this page are floors, not estimates.
What the strategy is charged for
Only hardware the strategy requires is billed to it. The Copilot+ NPU
refresh is priced at zero by default because fleets are getting that tier
regardless of any AI decision; the slider in Advanced settings charges it
back if you disagree. The RTX Spark tier, the hardware this page actually
prices, is charged in full in Year One.
What the hardware actually is
| Tier | Device class |
What it runs | Which workload |
Who gets one |
| Table stakes |
Copilot+ NPU laptop; your refresh delivers these regardless |
Small local models |
The routine tier, once routing exists |
Every seat |
| The decision |
RTX Spark class, high-memory: deskside box, shared workstation,
or dedicated machine |
Large open-weights models |
The complex tier the absorption sliders move |
Developers, analysts, creators, or a department sharing one |
What is an assumption, and whose
The local absorption rates for Years Two and Three are scenario assumptions
under your control, not predictions. Your current scenario routes
percent of task volume to the
routine tier, absorbs percent of
the complex tier locally in Year Two and
percent in Year Three. Set the
absorption sliders to zero and the hybrid strategy loses; the model will say so.
Rate cards
What one task costs in the cloud, Year One
Generated from the engine's unit costs at these rate cards, so you can see
where the money concentrates before any scale is applied.
Trace it yourself
View the raw engine response that rendered this page
Every figure above is a field in this response. If you find
one that is not, the build is broken and we want to know.