AI scaling used to mean one thing: train a bigger model on more data.
That was never the whole story, but it was the story most buyers heard. Bigger parameter count. Bigger cluster. Bigger benchmark number. The industry could compress the messy reality into a simple sentence: intelligence scales with pretraining compute.
That sentence is now incomplete.
Modern AI systems scale along at least four different compute axes. Pretraining builds the foundation. Post-training shapes behavior and reasoning. Test-time compute spends more effort when a hard query arrives. Sleep-time compute moves some of that effort before the query arrives, when reusable context is already available.
For enterprise teams, this matters because the right question is no longer "which model is smartest?" The better question is "which kind of compute should this workflow spend, at which step, for which business outcome?"
A support classifier should not use the same compute strategy as a contract-risk review. A CRM writeback agent should not handle routine field cleanup the same way it handles a disputed enterprise renewal. A coding assistant should not rediscover the architecture of the repository every time someone asks a question about it.
The future is not one giant model everywhere. It is routing the right job to the right model, memory layer, verifier, and compute budget.
The Four Scaling Methods
Here is the practical map.
| Scaling method | When compute is spent | What it improves | Enterprise question |
|---|---|---|---|
| Pretraining | Before the model exists | General knowledge, language, world modeling, broad capability | Which foundation model family should we build on? |
| Post-training | After pretraining, before deployment | Instruction following, alignment, reasoning style, domain behavior | Which model behavior do we need, and how do we validate it? |
| Test-time compute | During the user request | Hard-query accuracy, search, reasoning depth, verification | Which tasks deserve slower, more expensive thinking? |
| Sleep-time compute | Between requests, before likely future queries | Lower latency, lower live-token cost, reusable context understanding | Which contexts can we pre-process and amortize? |
The operating mistake is to treat these as substitutes. They are not.
Pretraining is the model's raw foundation. Post-training teaches it how to act. Test-time compute gives it more time on the hard moments. Sleep-time compute makes repeated work cheaper by doing useful thinking in advance.
The enterprise architecture question is how to combine them.
1. Pretraining: Build the Foundation
Pretraining is the phase where a model learns from massive corpora of text, code, images, audio, or multimodal data. In language models, the model compresses statistical structure from tokens into weights: language patterns, facts, associations, code syntax, world knowledge, and some forms of latent reasoning.
This is the scaling regime associated with parameters, training tokens, and total training compute. The DeepMind Chinchilla paper, Training Compute-Optimal Large Language Models, made the key point that model size and training tokens should scale together under a fixed compute budget. Many earlier large models were not just too small or too large. They were undertrained for their size.
For most enterprises, pretraining is not a project. It is a market they buy from.
You are usually not deciding whether to train a frontier model. You are deciding which pretrained capability to trust for which workload:
- Frontier proprietary models for complex reasoning, coding, synthesis, and high-value analysis
- Open-weight models for controllability, cost control, data residency, and local deployment
- Small specialized models for classification, extraction, routing, and repetitive back-office tasks
- Multimodal models for workflows that combine documents, screenshots, voice, diagrams, or video
There are exceptions. A very large enterprise with proprietary data, large-scale inference demand, and serious ML infrastructure may train or continue-train a domain model. But most companies should be suspicious of any strategy that starts with "we need to pretrain our own model." The bill arrives long before the ROI.
The useful enterprise lesson from pretraining is not "train bigger." It is "know what kind of base capability you are buying." A workflow that needs exact tool use, citations, permissions, and auditability may perform better with a cheaper base model wrapped in a strong system than with the most expensive model called directly from a chat box.
2. Post-Training: Teach the Model How to Behave
Post-training is where a base model becomes usable as an assistant, agent, specialist, or reasoning model.
Instruction fine-tuning teaches the model to follow directions. Human preference training shapes outputs toward helpfulness, honesty, tone, and policy compliance. Reinforcement learning can teach models to spend effort on reasoning, verify intermediate steps, and recover from mistakes. Preference methods such as DPO simplify parts of the RLHF pipeline by optimizing directly from preference pairs rather than training a separate reward model and then running reinforcement learning.
The InstructGPT paper, Training language models to follow instructions with human feedback, showed why this phase matters: bigger pretrained models do not automatically follow user intent well. A smaller instruction-tuned model can be preferred to a much larger base model because behavior matters as much as raw scale.
The newer reasoning wave pushed post-training further. OpenAI's Learning to Reason with LLMs describes a model whose performance improves with more reinforcement learning during training and more thinking time at test time. DeepSeek-R1's paper, Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, made the same point in open research form: reinforcement learning can elicit reasoning behaviors, with GRPO becoming one of the widely discussed algorithms in that family.
For enterprises, post-training shows up in three forms.
First, you choose a model that has already been post-trained for the behavior you need. A general assistant, a coding model, a reasoning model, and a low-latency extraction model are not interchangeable even if their benchmark scores look close.
Second, you may fine-tune or preference-tune a model on your own examples. This can help with tone, structured output, recurring domain decisions, and narrow classification. It is not a substitute for retrieval, permissions, evals, or workflow design.
Third, you build system-level post-training around the model: prompt contracts, tool definitions, approval gates, validators, eval sets, and traces. This is where most enterprise ROI lives. A model that is merely "aligned" in the lab still needs to be aligned to your refund policy, your CRM fields, your escalation rules, and your audit obligations.
OpenNash usually starts here rather than at pretraining. We define the behavior the workflow needs, build examples and evals, then decide whether prompting, fine-tuning, preference data, or a different model is the cleanest way to get it.
3. Test-Time Compute: Think Harder When the Query Deserves It
Test-time compute is the compute spent after the user asks.
Instead of returning the first answer from a single forward pass, the system can allocate a larger inference budget. It might sample multiple candidate answers, run a verifier, use beam search or tree search, ask the model to critique its own plan, call tools, retrieve more evidence, or let a reasoning model spend more tokens before finalizing.
The important shift is that capability becomes partly variable at runtime. You can spend more compute on the hard cases without making every request expensive.
Charlie Snell and coauthors studied this directly in Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. Their finding is not that every query should get more compute. It is that optimal allocation depends on prompt difficulty, and that adaptive test-time strategies can beat a naive best-of-N baseline. They also found cases where a smaller model with enough test-time compute can outperform a much larger model on hard multi-step problems.
That is the idea behind enterprise model routing.
Do not ask "what is our model?" Ask "what is our routing policy?"
For example:
- Simple classification: cheap, fast model
- Structured extraction from a known document: cheap model plus validator
- Customer-facing answer with policy risk: stronger model plus retrieval and citation check
- Contract review: reasoning model plus clause-level evidence and human approval
- Data writeback: deterministic validation plus human approval threshold
- High-value exception: frontier model, more test-time compute, and a reviewer
Test-time compute is powerful, but it is easy to misuse. More thinking can increase latency, cost, and sometimes confusion. The s1 paper showed that budget forcing can improve reasoning performance in some settings, while later work has questioned whether simply extending thinking always helps. The lesson for operators is straightforward: do not buy reasoning tokens on faith. Measure them.
A production routing policy should track:
| Signal | Why it matters |
|---|---|
| Task class | Different jobs deserve different models |
| Confidence or verifier score | Low confidence routes to more compute or review |
| Business value | A $50 invoice and a $5 million contract should not get the same budget |
| Latency tolerance | Back-office batch jobs can wait; live support cannot |
| Reversibility | Irreversible actions need stricter approval |
| Eval performance | Routing only works if cheaper paths still pass |
| Cost per accepted task | Token spend is input cost, not business value |
This is where many AI programs get upside down. They start with a premium model everywhere, then panic when usage grows. The better pattern is to build the harness first: evals, traces, routing, fallbacks, and cost budgets. Then spend expensive reasoning only where it changes the outcome.
4. Sleep-Time Compute: Move Reasoning Off the Critical Path
Sleep-time compute is the newest and most operator-relevant piece of the scaling puzzle.
The Letta paper, Sleep-time Compute: Beyond Inference Scaling at Test-time, starts with a simple observation: many LLM problems are stateful. The context exists before the query arrives.
A codebase exists before the developer asks about it. A contract repository exists before legal asks for a risk summary. A customer account history exists before the support ticket arrives. A call transcript exists before the manager asks for coaching themes. A data room exists before the diligence team asks its tenth question.
Standard test-time compute treats every query as if the context and question arrive together. That means the model re-derives the same understanding on the user's time, again and again.
Sleep-time compute changes the sequence:
- The system already has context
c. - While idle, it runs offline processing
S(c)to create a richer contextc'. - When the user asks
q, the model answers fromc'with a smaller live budget.
In plain English: let the system read, summarize, index, infer, and organize before anyone is waiting.
The Letta paper reports that sleep-time compute can reach the same accuracy with roughly 5x fewer test-time tokens on Stateful GSM-Symbolic and Stateful AIME. It also reports up to 2.5x lower average cost per query when multiple related questions share the same context, because the offline work is amortized. The strongest gains appear when the future query is predictable from the context.
That predictability condition is the whole enterprise story.
Sleep-time compute is not magic. It helps when the system can reasonably anticipate what will matter later. It is weak when the query is unrelated, novel, or depends on information that arrives only at request time. It can also introduce distracting context if the offline summary is noisy or overbroad.
But for many enterprise workflows, predictability is everywhere.
Where Sleep-Time Compute Fits in Real Workflows
Think about the contexts your business already has before anyone asks:
| Workflow | Reusable context | Sleep-time work |
|---|---|---|
| Customer support | Account history, tickets, policies, refunds, product data | Summarize account state, identify likely issues, prepare evidence links |
| Sales | CRM notes, call transcripts, emails, firmographic data | Build account briefs, detect objections, prepare next-best actions |
| Finance | Invoices, POs, contracts, ledger history | Precompute vendor patterns, exception rules, variance explanations |
| Legal | Contracts, redlines, clause library, playbooks | Extract obligations, flag unusual terms, map fallback positions |
| Engineering | Repo, docs, PR history, incidents | Build architecture maps, dependency summaries, likely owners |
| Compliance | Policies, approvals, audit logs, control evidence | Organize evidence by control, detect gaps before review |
None of this requires updating the model's weights. That is a critical distinction. Sleep-time compute is representation learning in token space. The system transforms context into a more useful natural-language or structured memory representation. It can store summaries, inferred relationships, risk flags, evidence pointers, task state, and likely questions.
That makes it operationally attractive. You can inspect it. You can version it. You can delete it. You can attach permissions to it. You can refresh it when the underlying context changes.
This also means sleep-time compute belongs next to retrieval, memory, and caching. It is not a replacement for RAG. It is a smarter pre-processing layer that makes retrieval and reasoning cheaper.
The Enterprise Pattern: Route Across All Four
The best AI systems will not choose one scaling method. They will route across all four.
Take a support workflow.
Pretraining gives you the general model capability to understand language, policy, and tool instructions. Post-training gives you an instruction-following model that can handle support tone and structured tool calls. Sleep-time compute prepares account summaries, likely issue maps, and policy-relevant facts before the customer writes in. Test-time compute is reserved for the difficult cases: angry customers, edge policies, refund disputes, account anomalies, and multi-step remediation.
Take a diligence workflow.
Pretraining gives broad financial and legal language competence. Post-training shapes the model for evidence-grounded analysis. Sleep-time compute processes the data room, builds entity maps, summarizes contracts, and flags likely diligence issues before the partner asks. Test-time compute handles the actual investment question, runs scenario analysis, and resolves conflicts in the evidence.
Take a coding workflow.
Pretraining gives code fluency. Post-training gives instruction following and coding style. Sleep-time compute builds a repo map, module summaries, dependency graph, and owner hints. Test-time compute is spent on the actual bug, refactor, or feature design.
This is the architecture enterprises should want: cheaper by default, stronger when needed, and smarter over time.
What OpenNash Builds
OpenNash helps enterprises turn this scaling map into production workflow design.
We do not start by asking which model you want. We start by mapping the work:
- Which tasks are routine?
- Which tasks are high-value or high-risk?
- Which context repeats across many queries?
- Which outputs need citations or evidence?
- Which actions are reversible?
- Which writes require approval?
- Which latency targets matter?
- Which evals prove the workflow is safe enough to scale?
Then we build the compute policy around the workflow.
For routine steps, that may mean small models, deterministic code, and validation. For complex reasoning, it may mean a frontier model with more test-time budget. For repeated context, it may mean sleep-time memory generation. For regulated actions, it may mean approval gates, audit logs, and reviewer dashboards. For cost control, it means measuring cost per accepted task rather than tokens in isolation.
The result is not "use AI everywhere." It is a routing system:
- Classify the task.
- Load only the context the task is allowed to see.
- Use the cheapest model that passes the eval.
- Escalate to stronger reasoning when the evidence says to.
- Reuse memory when the context is stable.
- Require human approval when the action is risky.
- Log the trace so the decision can be audited.
That is how enterprises get beyond model demos. They stop treating AI as a single magic endpoint and start treating it as an operating system for work.
The Board-Level Question
The four scaling methods give boards and operators a better vocabulary.
Do not ask only: are we using the best model?
Ask:
- Are we buying the right pretrained capability?
- Are we shaping behavior with post-training, examples, prompts, and evals?
- Are we spending test-time compute only where it improves accepted work?
- Are we using sleep-time compute where context repeats?
- Are we routing by task value, risk, latency, and reversibility?
- Are we measuring cost per accepted task?
The companies that win with AI will not be the companies that spend the most inference tokens. They will be the companies that understand when to spend them.
Book a call to map one workflow, define the routing policy, and decide which kind of AI compute should do which job.