How to think
about LLMs
& AI agents.
A single-page crash course. No prior background needed. By the end you'll understand what a language model actually does, what turns one into an agent, and the engineering tricks - tools, MCP, the context window, skills - that make agents work in the real world.
Orientation
There's a lot of mystique around AI agents. There shouldn't be. Underneath the buzzwords, the whole field rests on a small number of simple ideas stacked together - the five layers shown to the right of the title above.
This guide builds those ideas one at a time. Each module assumes only what came before it.
An agent is not a special kind of AI. It is an ordinary language model placed inside a loop with tools and well-managed context. Everything below unpacks that sentence.
What is an LLM?
Mechanically, a Large Language Model does exactly one thing: given some text, it predicts what text most plausibly comes next. Surprisingly, doing that at huge scale can produce behaviour that looks like reasoning, writing, coding, and planning.
Think of it as autocomplete on steroids. Your phone's keyboard suggests the next word based on a few patterns; an LLM is the same idea trained on a library's worth of text, so its suggestions can stretch into whole paragraphs, code, or arguments.
During training, the model plays a guessing game billions of times: hide the next word, guess it, check, adjust. Each tiny adjustment makes it slightly less wrong. Repeat at vast scale, and the patterns it absorbs end up encoding grammar, facts, reasoning styles, and tone - without anyone ever programming a rule for them.
The finished model is, mechanically, a function: text goes in, and out comes a ranked list of likely next tokens with probabilities. To write a sentence it picks one, appends it, and feeds the whole thing back in to predict the next one. Word by word. This is called autoregressive generation.
1 · The model itself is stateless. It has no memory at all between calls. Imagine talking to a brilliant person with total amnesia. But - if you've used ChatGPT and felt like it remembered things, that memory lives in the surrounding app, not the model. The app stores past messages, summaries, or user preferences and quietly re-sends the relevant pieces in each new request. (How that works mechanically: Module 03.)
2 · It can be confidently wrong. A fluent, plausible continuation is not a true one. The model is optimised to sound right, not to be right. Fabricated-but-fluent output is called a hallucination.
3 · Its knowledge is frozen in time. It only "knows" patterns from its training data, up to a cutoff date - like a brilliant graduate locked in a library since, say, early 2024. Ask about yesterday's news and it can't help. One fix: give it a web_search tool so it can look things up at runtime - then make sure it uses sources well. (That's Module 05.)
Tokens & the Context Window
Models don't read letters or whole words. They read tokens - chunks of text, roughly word-sized or smaller.
Common words are usually one token; longer or rarer words get split into pieces. As a rough rule, one token ≈ 4 characters of English, or about ¾ of a word. Try the live tokenizer:
The context window is the maximum amount of text - measured in tokens - the model can consider at once. Think of it as the model's entire field of view. If something isn't in the context window, the model genuinely cannot see it.
Critically, everything shares this one space: the system prompt, the conversation so far, any documents you've pasted in, the descriptions of the model's tools, and the answer it's currently writing. They all compete for the same fixed budget.
The context window is the model's only working memory, and it is finite. Almost everything that separates a flaky toy agent from a reliable one comes down to managing this scarce space well. Hold that thought - Module 08 is dedicated entirely to it.
The LLM as a Function
Forget the chat bubble for a moment. At the engineering level, an LLM is a plain function: you hand it text, it hands you text back.
If "function" is a new word - it just means a repeatable input-output box. Give it the same text, you get a similar response back. No state, no memory, no surprises hidden inside.
So how does a chatbot "remember" your earlier messages? It doesn't, really. The application re-sends the entire conversation history with every single turn. The "conversation" is an illusion stitched together by replaying the whole transcript each time. Concretely, the input is a list of role-tagged messages - this whole list is what people mean when they say "prompt":
This reframing is powerful. Once you see the model as a stateless text-to-text function, the path forward is obvious: if you want it to do more than chat, you change what you put in and what you do with what comes out. That is the entire job of building an agent.
A “prompt” is not magic wording. It is the instruction package the model sees on a given turn: its role, the goal, constraints, useful context, and the desired output format. The system prompt is the standing part at the top; user messages are the variable part. Most of the “prompt engineering” you'll hear about is just being precise and structured here.
Workflow vs Agent
Both "workflows" and "agents" combine LLM calls, tools and code. The difference is one thing: who decides what happens next.
In a workflow, you - the developer - wrote the steps in advance. The path is fixed in code. In an agent, the model itself decides the path at runtime: which tool to use, in what order, and when the job is done.
Workflow
- Predictable & repeatable
- Easy to test and debug
- Best when steps are known ahead of time
- Can't adapt to surprises
Agent
decides
- Flexible & adaptive
- Handles open-ended tasks
- Best when steps can't be known in advance
- Less predictable, costs more
Find the simplest thing that works. Most real problems don't need a full agent - a single well-prompted LLM call, or a fixed workflow, is cheaper, faster and more reliable. Reach for an agent only when the task is genuinely open-ended and you can't map the steps in advance.
Why "cheaper and faster" matters: every loop step is another model call, another tool call, more context. A workflow with 3 fixed LLM calls is predictable in cost and latency. An agent might use 3 - or 30 - to do the same job.
Tools
A model on its own can only produce text. A tool is a function you let the model ask you to run on its behalf - the way it reaches outside its own head to do something real.
Think of the model as a brilliant consultant in a sealed room. It can think and write, but it can't open a door. A tool is a bell the consultant can ring: ring this bell to look something up on the web; ring that one to send an email; ring the third to run a calculation. The consultant doesn't leave the room - it just describes what it wants done, and an assistant outside the room actually does it.
Some everyday tools an agent might have:
Here's the part that surprises people: the model never actually runs anything itself. You give it a list of tool definitions - each with a name, a plain-English description, and a schema for its inputs. When the model wants to use one, it doesn't execute code; it just outputs a structured message that says "please call web_search with query = …". Your surrounding program - the harness - reads that request, actually runs the function, and feeds the result back into context as a new message. Then the model continues with that fresh information in view.
(request)
search_wiki({ query: "remote work policy" })+ tool
[ { title: "Remote Work Policy v3", url: "...", excerpt: "Eligible employees may work remotely up to..." }, ... ]Tools give the model access to information - they don't guarantee judgment. A model can still search badly, misread a source, over-trust stale data, or cite something irrelevant. A reliable agent needs good tool choice, source checking, and sometimes a human in the loop.
Read-only tools are low risk. Searching, reading files, querying a database - the worst case is a wrong answer.
Write tools are high risk. send_email, run_code, delete_file, charge_card, anything that touches a customer or moves money - the worst case is real damage.
Give risky tools explicit approval gates, narrow permissions, and dry-run modes. "The model can do anything" is a feature and a liability.
The Agent Loop
Take a model. Give it tools. Put it in a loop. That loop is the agent. Everything else is engineering around it.
The harness keeps calling the model. Each time, the model either asks for a tool or declares it's finished. As long as it asks for tools, the loop continues - running them, feeding results back, calling the model again.
LOOP
An actual research agent answering a question, played back live. This pattern - the model reasons about what to do, picks an action (a tool call), reads the observation, and reasons again - is so common it has a name: ReAct (Reason + Act). It's the workhorse loop behind almost every agent you'll meet.
An "AI agent" is this loop. Coding agents, research agents, customer-support agents - same skeleton: a model, a set of tools, a loop, and a stopping condition. What makes them good is the quality of the tools and how well their context is managed.
MCP vs CLI
Once you accept that agents need tools, the next question is: where do the tools come from? Two approaches dominate. Both end up giving the model abilities it didn't have on its own - they just package them differently.
The text-based way humans have always talked to computers. You type a command; the computer responds in text. If you've ever opened Terminal on a Mac or Command Prompt on Windows and typed something like ls or dir, you've used a CLI.
For an agent, "using a CLI" just means: let the model type shell commands, and run them in a sandboxed computer. Anything a developer can do at a terminal, the agent can ask to do.
An open standard for plugging AI models into outside services. Instead of inventing a new integration for every app (Slack, Drive, GitHub, your database…), each service exposes an MCP server that lists its tools in a uniform shape: name, description, input schema, output schema.
Think USB-C for AI. One plug shape, many devices. Build (or install) one MCP server for Slack, and any MCP-compatible agent can use it tomorrow.
CLI
$ grep -r "error" ./logs
$ curl -s api.example.com/v1/users
MCP
tool: search_wiki { query: string }
tool: send_slack { channel, text }
Recall that a model picks tools from descriptions and reasons over whatever lands in its context. MCP gives it clean, machine-readable contracts: clearly typed inputs so it makes fewer malformed calls, structured outputs it doesn't have to parse out of messy text, and reliable discovery of what's available. Less guessing means fewer errors. That said - these aren't rivals. Many strong agents use both: a CLI for fast, flexible work inside a sandbox, and MCP servers for dependable connections to the outside world.
Context Management
Here's where toy demos and production agents part ways. An agent that runs for many loop steps keeps piling things into its context - every tool result, every observation, every intermediate thought.
Picture the model's context as a desk. It can only fit so many papers before things start sliding off and the important sheet gets buried under the noise. Remember Module 02: this desk is finite and it's the model's only working memory. Two bad things happen as it fills.
1. Hard limit. You run out of room. The request fails or older messages get silently dropped.
2. Context rot. Long before the limit, the desk gets so cluttered that the model can't find what's relevant. Answer quality quietly degrades. This insidious decay is nicknamed context rot.
So the core skill is treating context like a tight budget. The mindset: keep the desk clean. Anything not needed right now goes back in a drawer (a file, a summary, an external store) where it can be fetched on demand. Try the visualiser below - each technique frees up working space.
The throughline behind every technique below: keep what's relevant in context; keep everything else retrievable outside it; load on demand. The six techniques fall into three families:
Offload to the file system family C
Instead of holding a huge document or a long tool output in context, the agent writes it to a file and keeps only a short pointer ("results saved to notes.md"). It reads the file back only when it actually needs that content. The file system becomes external, effectively unlimited, persistent memory - context holds the index, not the whole library.
Compaction & summarisation family A
When the conversation gets long, the agent compresses older turns into a concise summary and drops the verbose originals. Recent steps stay word-for-word; distant history becomes a few tight lines. A rolling summary keeps the thread intact at a fraction of the token cost.
Retrieval - just-in-time context family B
Don't preload an entire knowledge base into the prompt. Store it externally and fetch only the few relevant chunks for the question at hand, exactly when they're needed. (This is the idea behind "RAG" - retrieval-augmented generation.)
Skills & progressive disclosure family B
A Skill is a self-contained folder of instructions and resources that an agent can load on demand to do a specialised task. The clever part is the front matter - see the diagram below. The agent loads only tiny skill summaries upfront and pulls in the full body of a skill only when a task actually calls for it.
Sub-agents - context isolation family C
For a messy sub-task, spin up a fresh agent with its own clean context. It does the noisy work in isolation and returns only the tidy final result. The main agent's context never gets polluted with the intermediate clutter.
Structured note-taking family C
The agent maintains a running to-do list or scratchpad file it updates as it works. This externalises its plan and progress, so its state survives even when older context is summarised or trimmed away.
Each dark bar is a skill's front matter - a tiny header (just a name + a description of when to use it). Only these headers stay loaded. Click a skill: the heavy body loads only when the task matches it. Watch the context meter below.
A beginner asks "how do I fit everything into the prompt?" An experienced builder asks "what's the least I can put in context, while keeping everything else one cheap fetch away?" That question is most of the job.
Building Effective Agents
You rarely jump straight to a free-roaming agent. There's a ladder of patterns - climb it only as far as the problem demands.
The foundational building block is the augmented LLM: a single model call given tools, retrieval, and memory. Most patterns are just clever arrangements of that block. Each card below names a pattern and gives one concrete example.
Prompt chaining
Break a task into ordered steps; each LLM call feeds the next. Predictable, easy to debug.
Routing
Classify the input first, then send it down a specialised path built for that category.
└→ [path B]
Parallelisation
Run independent sub-tasks at the same time, then merge the results. Faster, and useful for cross-checking.
[in]┼→ [B]┼→ [merge]
└→ [C]┘
Orchestrator-workers
A lead model breaks a job into sub-tasks, delegates each, then synthesises the answers.
Evaluator-optimiser
One model drafts, another critiques against criteria; loop until the work is good enough.
Autonomous agent
The model plans its own path through tools in a loop. For open-ended tasks whose steps can't be predicted.
Can you write the steps yourself? → Use a workflow. Cheaper, faster, more reliable.
Are the steps obvious but path depends on input? → Routing.
Quality matters more than speed? → Evaluator-optimiser.
Steps genuinely can't be predicted? → Reach for an autonomous agent. Only then.
Simplicity. Use the least complex pattern that solves the task - fewer moving parts, fewer failure modes.
Transparency. Make the agent show its planning and tool steps, so you can see why it did what it did.
A well-crafted interface. Invest in clear tools, sharp descriptions, and clean context as much as in clever prompts. The agent is only as good as what it can see and do.
Build Your First Agent
You now have the full mental model. Here's the shortest path from understanding to a working agent - keep the first one deliberately tiny.
- Pick a small, real task. Something genuinely open-ended - "research a topic and summarise it," not "translate this sentence."
- Give it 2-3 tools, no more. Each with a sharp, honest description. Start minimal; add tools only when you observe a real need.
- Write a tight system prompt. State the goal, the constraints, and - critically - when the agent should stop.
- Run the loop. Model → tool call → result back into context → model again, until it returns a final answer.
- Watch the traces. Read every step it took. This is where you learn what's actually happening - and where bad tool descriptions reveal themselves.
- Mind the context. If runs get long or quality dips, reach for Module 08: offload to files, compact history, load skills on demand.
- Add guardrails. A step limit, a cost cap, human approval before risky actions. Then iterate - improve one tool or one prompt at a time.
Once your agent runs, the trace is the truth. For each run, ask:
- Did it choose the right tool for each step?
- Did it use the result correctly, or misread the output?
- Did it stop at the right time, or loop forever / quit early?
- Did it flag uncertainty instead of bluffing?
- Did it avoid risky actions without explicit approval?
Five "yes" answers → you have a real agent. A "no" anywhere is where to focus next.
That's the whole arc: a model predicts tokens → wrapped as a stateless function → handed tools → placed in a loop → fed carefully managed context. Every impressive agent you'll ever see is built from these parts. The mystique is gone - what's left is craft. Now go build something.
Want help getting from zero to agent? OpenNash helps executives and teams learn the basics, find the right workflow, and ship production AI agents that integrate with the software your business already uses.
We can run practical AI education, map your workflow, build the first prototype, and manage the agent as it improves from reviewed outcomes.
- LLM
- Large Language Model. A neural network trained to predict the next token; the engine inside ChatGPT, Claude, Gemini, and friends.
- Token
- A chunk of text the model reads - roughly ¾ of an English word. Everything is counted and billed in tokens.
- Tokenizer
- The piece of code that splits raw text into tokens before the model sees it.
- Context window
- The max tokens the model can consider at once. Its only working memory.
- Prompt
- The text you send into the model on a given call. Includes the system prompt, history, and any new input.
- System prompt
- Standing instructions at the top of every request - role, rules, goal, stopping criteria.
- Autoregressive
- Generates one token at a time, appending each to the input before predicting the next.
- Stateless
- No memory between calls. The model forgets everything the instant a response ends; the app re-sends history each turn.
- Hallucination
- Fluent, plausible output that is factually wrong. A failure mode of pure text prediction.
- Knowledge cutoff
- The date the model's training data ends. After this, it knows nothing - unless you give it a search tool.
- Temperature
- A knob that controls how random the model's choices are. Low = deterministic, high = creative.
- Tool
- A function the model can request to be called. The harness - not the model - actually runs it.
- Tool call
- A structured request from the model: "please run tool_name with these arguments."
- Harness
- The program around the model that runs the loop, executes tools, and manages context. Sometimes called the "orchestrator" or "runtime."
- Agent
- An LLM placed in a loop with tools that it decides how to use, ending when it returns a final answer.
- Agent loop
- Reason → act → observe → repeat, until the model returns a final answer or a guardrail stops it.
- ReAct
- "Reason + Act" - the canonical agent pattern: the model writes a thought, picks an action, reads the observation, repeats.
- Workflow
- A pre-written sequence of LLM calls and code. The developer decides the path, not the model.
- CLI
- Command-Line Interface. The text terminal where you type commands like
lsorcurl. Agents can be given one to use. - MCP
- Model Context Protocol. A standard way to expose typed tools - "USB-C for AI." Any MCP-compatible agent can use any MCP server.
- MCP server
- A small program that publishes a service's tools (Slack, Drive, a database) in the MCP shape.
- Context rot
- Quality decline as context fills with noise - well before the hard token limit is reached.
- Compaction
- Replacing older verbose messages with a short summary to save context space.
- Skill
- A folder of instructions and resources, surfaced by a short front-matter header and loaded on demand.
- Front matter
- A tiny header at the top of a skill (name + description) that stays in context while the heavy body stays out until needed.
- Progressive disclosure
- Load metadata first; load full details only when the task requires them.
- Sub-agent
- A fresh agent spun up with its own clean context for a noisy sub-task; returns only its final result.
- RAG
- Retrieval-Augmented Generation. Fetch only the relevant chunks of a knowledge base at query time.
- Embedding
- A numerical vector that represents the meaning of text, used to find related chunks in a vector store.
- Vector store
- A database that stores embeddings and finds the nearest ones to a query - the search engine behind RAG.
- Fine-tuning
- Continuing to train a base model on your own data to bake in specific behaviour or knowledge.
- Guardrail
- A safety rule outside the model - a step limit, a cost cap, a content filter, a human-in-the-loop check.
- Trace
- The full recorded sequence of an agent's thoughts, tool calls, and observations. Where you learn what's actually happening.