From zero knowledge to reliable AI automation

How to think about
LLMs & AI agents.

A single-page crash course on AI by OpenNash. No prior background needed. By the end you'll understand what a language model actually does, the difference between reliable AI automation and an AI agent, and the engineering tricks - tools, MCP, the context window, skills, evals, and human review - that make agents work in the real world.

11 modules 3 live demos Visual diagrams No code required Print-friendly

Orientation

How to use this guide

There's a lot of mystique around AI agents. There shouldn't be. Underneath the buzzwords, the whole field rests on a small number of simple ideas stacked together - the five layers shown to the right of the title above.

This guide builds those ideas one at a time. Each module assumes only what came before it.

If you only remember one sentence

An agent is not a special kind of AI. It is an ordinary language model placed inside a loop with tools and well-managed context. Everything below unpacks that sentence.

What is an LLM?

The prediction engine

Mechanically, a Large Language Model does exactly one thing: given some text, it predicts what text most plausibly comes next. Surprisingly, doing that at huge scale can produce behaviour that looks like reasoning, writing, coding, and planning.

Think of it as autocomplete on steroids. Your phone's keyboard suggests the next word based on a few patterns; an LLM is the same idea trained on a library's worth of text, so its suggestions can stretch into whole paragraphs, code, or arguments.

During training, the model plays a guessing game billions of times: hide the next word, guess it, check, adjust. Each tiny adjustment makes it slightly less wrong. Repeat at vast scale, and the patterns it absorbs end up encoding grammar, facts, reasoning styles, and tone - without anyone ever programming a rule for them.

The finished model is, mechanically, a function: text goes in, and out comes a ranked list of likely next tokens with probabilities. To write a sentence it picks one, appends it, and feeds the whole thing back in to predict the next one. Word by word. This is called autoregressive generation.

Demo · Predict the next tokenFIG 01

Click a continuation - bars show probability.

This is the model's entire job, on loop. It has no plan and no memory of "intent" - it just keeps choosing a likely next token. Coherent paragraphs are an emergent result of doing this very, very well.

Three consequences worth internalising

1 · The model itself is stateless. It has no memory at all between calls. Imagine talking to a brilliant person with total amnesia. But - if you've used ChatGPT and felt like it remembered things, that memory lives in the surrounding app, not the model. The app stores past messages, summaries, or user preferences and quietly re-sends the relevant pieces in each new request. (How that works mechanically: Module 03.)

2 · It can be confidently wrong. A fluent, plausible continuation is not a true one. The model is optimised to sound right, not to be right. Fabricated-but-fluent output is called a hallucination.

3 · Its knowledge is frozen in time. It only "knows" patterns from its training data, up to a cutoff date - like a brilliant graduate locked in a library since, say, early 2024. Ask about yesterday's news and it can't help. One fix: give it a web_search tool so it can look things up at runtime - then make sure it uses sources well. (That's Module 05.)

Tokens & the Context Window

The model's field of view

Models don't read letters or whole words. They read tokens - chunks of text, roughly word-sized or smaller.

Common words are usually one token; longer or rarer words get split into pieces. As a rough rule, one token ≈ 4 characters of English, or about ¾ of a word. Try the live tokenizer:

Demo · Live tokenizerFIG 02

Your textedit me ↓

Tokenseach coloured chip = one token

0Characters

0Tokens

0Chars / token

This uses a simple heuristic tokenizer for illustration; real tokenizers (like BPE) learn their splits from data. The point holds: tokens are the unit everything is counted, billed, and budgeted in.

The context window is the maximum amount of text - measured in tokens - the model can consider at once. Think of it as the model's entire field of view. If something isn't in the context window, the model genuinely cannot see it.

Critically, everything shares this one space: the system prompt, the conversation so far, any documents you've pasted in, the descriptions of the model's tools, and the answer it's currently writing. They all compete for the same fixed budget.

The single most important idea in this guide

The context window is the model's only working memory, and it is finite. Almost everything that separates a flaky toy agent from a reliable one comes down to managing this scarce space well. Hold that thought - Module 08 is dedicated entirely to it.

The LLM as a Function

Text in, text out

Forget the chat bubble for a moment. At the engineering level, an LLM is a plain function: you hand it text, it hands you text back.

If "function" is a new word - it just means a repeatable input-output box. Give it the same text, you get a similar response back. No state, no memory, no surprises hidden inside.

The model is a stateless functionFIG 03

Input · messages[ ]All the text the model should see - system prompt, history, documents, tool defs

→

LLMpredict next tokens

→

Output · textThe most plausible continuation, token by token

No hidden state. No memory of previous calls. The function's behaviour depends only on the text you pass in this time.

So how does a chatbot "remember" your earlier messages? It doesn't, really. The application re-sends the entire conversation history with every single turn. The "conversation" is an illusion stitched together by replaying the whole transcript each time. Concretely, the input is a list of role-tagged messages - this whole list is what people mean when they say "prompt":

What actually gets sent on each turnFIG 04

system

You are a helpful assistant. Be concise.

user

What's the capital of France?

assistant

Paris.

user

And the population?

assistant

...new prediction appears here, conditioned on everything above.

The "conversation" is just an append-only list of messages, replayed in full on each request. There is no hidden chat session living on a server.

This reframing is powerful. Once you see the model as a stateless text-to-text function, the path forward is obvious: if you want it to do more than chat, you change what you put in and what you do with what comes out. That is the entire job of building an agent.

A note on prompts

A “prompt” is not magic wording. It is the instruction package the model sees on a given turn: its role, the goal, constraints, useful context, and the desired output format. The system prompt is the standing part at the top; user messages are the variable part. Most of the “prompt engineering” you'll hear about is just being precise and structured here.

Workflow vs Agent

Who is driving?

Both "workflows" and "agents" combine LLM calls, tools and code. The difference is one thing: who decides what happens next.

In a workflow, you - the developer - wrote the steps in advance. The path is fixed in code. In an agent, the model itself decides the path at runtime: which tool to use, in what order, and when the job is done.

Fixed path vs. model-directed pathFIG 05

Workflow

Control flow fixed by code

Step A→ Step B→ Step C

Predictable & repeatable
Easy to test and debug
Best when steps are known ahead of time
Can't adapt to surprises

Agent

Control flow chosen by the model

searchread_filerun_code

LLM
decides

↺ loopuntil done

Flexible & adaptive
Handles open-ended tasks
Best when steps can't be known in advance
Less predictable, costs more

Neither is "better." A workflow is a fixed recipe; an agent is a cook who decides the recipe as they go.

The golden rule

Find the simplest thing that works. Most real problems don't need a full agent - a single well-prompted LLM call, or a fixed workflow, is cheaper, faster and more reliable. Reach for an agent only when the task is genuinely open-ended and you can't map the steps in advance.

Why "cheaper and faster" matters: every loop step is another model call, another tool call, more context. A workflow with 3 fixed LLM calls is predictable in cost and latency. An agent might use 3 - or 30 - to do the same job.

Tools

How a model touches the world

A model on its own can only produce text. A tool is a function you let the model ask you to run on its behalf - the way it reaches outside its own head to do something real.

Think of the model as a brilliant consultant in a sealed room. It can think and write, but it can't open a door. A tool is a bell the consultant can ring: ring this bell to look something up on the web; ring that one to send an email; ring the third to run a calculation. The consultant doesn't leave the room - it just describes what it wants done, and an assistant outside the room actually does it.

Some everyday tools an agent might have:

Common tools in the wildFIG 06

web_search

Search the live internet. The fix for "knowledge is frozen."

read_file

Open a file from disk and return its contents.

write_file

Save text to a file - great for offloading long output.

run_python

Execute code in a sandbox. Fixes math & data work.

send_email

Compose and send a message. Real-world action.

query_db

Run a query against a database. Grounded answers.

Tools are how an LLM gets eyes, hands, and a memory beyond its training. Pick any "impressive" agent demo - underneath, the magic is always: which tools does it have, and how cleverly does it use them?

Here's the part that surprises people: the model never actually runs anything itself. You give it a list of tool definitions - each with a name, a plain-English description, and a schema for its inputs. When the model wants to use one, it doesn't execute code; it just outputs a structured message that says "please call web_search with query = …". Your surrounding program - the harness - reads that request, actually runs the function, and feeds the result back into context as a new message. Then the model continues with that fresh information in view.

Anatomy of a tool definitionFIG 07

name

search_wiki

description

Search the company wiki. Returns the 5 most relevant pages with titles, URLs, and a short excerpt. Use whenever the user asks about internal policies, projects, or team docs.

input schema

{ query: string - natural-language search query, e.g. "remote work policy" }

returns

Array<{ title, url, excerpt }>

            ↑ The model sees only these four things. It decides whether to call this tool, and what to pass in, based purely on the highlighted text. Crisp names and descriptions matter more than almost anything else.
          

A tool definition is, fundamentally, a piece of writing - and writing good tool descriptions is a real, underrated craft.

One tool call, step by stepFIG 08

user

What's our remote work policy?

assistant
(request)

→ call search_wiki({ query: "remote work policy" })

harness
+ tool

Runs the real function. Returns 3 wiki pages with excerpts.

tool result

[ { title: "Remote Work Policy v3", url: "...", excerpt: "Eligible employees may work remotely up to..." }, ... ]

assistant

Reads the results from context, writes a grounded answer - or requests another tool.

Tools turn a "frozen, text-only" model into something that can fetch live data, take real actions, and offload work it's bad at (precise math, running code) to systems that are reliable.

Tools ≠ correct answers

Tools give the model access to information - they don't guarantee judgment. A model can still search badly, misread a source, over-trust stale data, or cite something irrelevant. A reliable agent needs good tool choice, source checking, and sometimes a human in the loop.

Some tools are dangerous - plan for it

Read-only tools are low risk. Searching, reading files, querying a database - the worst case is a wrong answer.
Write tools are high risk. send_email, run_code, delete_file, charge_card, anything that touches a customer or moves money - the worst case is real damage.
Give risky tools explicit approval gates, narrow permissions, and dry-run modes. "The model can do anything" is a feature and a liability.

The Agent Loop

The heartbeat of every agent

Take a model. Give it tools. Put it in a loop. That loop is the agent. Everything else is engineering around it.

The harness keeps calling the model. Each time, the model either asks for a tool or declares it's finished. As long as it asks for tools, the loop continues - running them, feeding results back, calling the model again.

Animated · The think → act → observe loopFIG 09

STEP 01

Think

Model reads goal + context and decides the next move.

STEP 02

Act

It requests a tool call - or gives a final answer.

STEP 03

Observe

Harness runs the tool; result is added to context.

STEP 04

Repeat

Loop back - now with new information in view.

AGENT
LOOP

until done

The loop ends when the model returns a final answer with no tool request - or when a guardrail stops it (a step limit, a budget cap, a human approval). That stopping logic is essential: it's what keeps an agent from running forever.

Live trace · The ReAct pattern (Reason → Act → Observe)FIG 10

An actual research agent answering a question, played back live. This pattern - the model reasons about what to do, picks an action (a tool call), reads the observation, and reasons again - is so common it has a name: ReAct (Reason + Act). It's the workhorse loop behind almost every agent you'll meet.

research-agent · runningstep 0 / 6

0 tokens in context

Notice the rhythm: reason → act → observe → reason. Every "smart" thing the agent does is just another lap around this loop. Look at the token counter - each observation pushes more into context. Module 08 is about keeping that under control.

That's the whole secret

An "AI agent" is this loop. Coding agents, research agents, customer-support agents - same skeleton: a model, a set of tools, a loop, and a stopping condition. What makes them good is the quality of the tools and how well their context is managed.

MCP vs CLI

Two ways to hand an agent capabilities

Once you accept that agents need tools, the next question is: where do the tools come from? Two approaches dominate. Both end up giving the model abilities it didn't have on its own - they just package them differently.

First - what is a CLI? What is MCP?PRIMER

CLI · Command-Line Interface

The text-based way humans have always talked to computers. You type a command; the computer responds in text. If you've ever opened Terminal on a Mac or Command Prompt on Windows and typed something like ls or dir, you've used a CLI.

For an agent, "using a CLI" just means: let the model type shell commands, and run them in a sandboxed computer. Anything a developer can do at a terminal, the agent can ask to do.

MCP · Model Context Protocol

An open standard for plugging AI models into outside services. Instead of inventing a new integration for every app (Slack, Drive, GitHub, your database…), each service exposes an MCP server that lists its tools in a uniform shape: name, description, input schema, output schema.

Think USB-C for AI. One plug shape, many devices. Build (or install) one MCP server for Slack, and any MCP-compatible agent can use it tomorrow.

CLI = "give the agent a terminal." MCP = "give the agent a standard plug it can connect to many services through." They're not exclusive - most strong agents use both.

Command line vs. Model Context ProtocolFIG 11

CLI

Command-Line Interface

# the agent runs raw shell commands
$ grep -r "error" ./logs
$ curl -s api.example.com/v1/users

Universal - every command-line tool ever made works instantly.

Composable: pipe small tools into bigger ones.

Zero integration code to write.

Output is unstructured text the model must parse and guess at.

No schemas, messy errors, no standard way to discover what's available.

BEST FOR: coding agents working inside a sandboxed machine.

MCP

Model Context Protocol

# a server exposes typed tools
tool: search_wiki { query: string }
tool: send_slack { channel, text }

Each tool has a name, description & typed input/output schema.

Standardised - one protocol, reusable across any MCP-compatible app.

Built-in discovery, structured results, cleaner auth.

Needs a server to exist or be built - more upfront setup.

BEST FOR: connecting to external services - Slack, Drive, databases, APIs.

MCP is often described as "a USB-C port for AI": one standard plug, many devices. You write or install a server once, and any agent can use it.

Why MCP suits agents so well

Recall that a model picks tools from descriptions and reasons over whatever lands in its context. MCP gives it clean, machine-readable contracts: clearly typed inputs so it makes fewer malformed calls, structured outputs it doesn't have to parse out of messy text, and reliable discovery of what's available. Less guessing means fewer errors. That said - these aren't rivals. Many strong agents use both: a CLI for fast, flexible work inside a sandbox, and MCP servers for dependable connections to the outside world.

Context Management

The real craft of building agents

Here's where toy demos and production agents part ways. An agent that runs for many loop steps keeps piling things into its context - every tool result, every observation, every intermediate thought.

Picture the model's context as a desk. It can only fit so many papers before things start sliding off and the important sheet gets buried under the noise. Remember Module 02: this desk is finite and it's the model's only working memory. Two bad things happen as it fills.

1. Hard limit. You run out of room. The request fails or older messages get silently dropped.
2. Context rot. Long before the limit, the desk gets so cluttered that the model can't find what's relevant. Answer quality quietly degrades. This insidious decay is nicknamed context rot.

Why "just use a bigger context window" doesn't fix itFIG 12

step 1step 3step 6step 9step 12step 16step 20step 25

answer quality (illustrative) noticeable drift context rot

As context fills with tool results and intermediate work, the model has more to attend to and the relevant bits get harder to find. Quality falls before the hard limit is anywhere near.

So the core skill is treating context like a tight budget. The mindset: keep the desk clean. Anything not needed right now goes back in a drawer (a file, a summary, an external store) where it can be fetched on demand. Try the visualiser below - each technique frees up working space.

Demo · The context budgetFIG 13

Sys

Tools

History

Docs

Free

System prompt Tool definitions Conversation history Retrieved documents Free working space

The goal isn't to cram more in - it's to keep only what's relevant right now in context, and keep everything else retrievable just outside it.

The throughline behind every technique below: keep what's relevant in context; keep everything else retrievable outside it; load on demand. The six techniques fall into three families:

Keep less in context

Summarise. Compact. Remove noise.

Fetch only what you need

Retrieval (RAG). Skills loaded on demand.

Isolate messy work

Files. Sub-agents. Scratchpads.

Offload to the file system family C

Instead of holding a huge document or a long tool output in context, the agent writes it to a file and keeps only a short pointer ("results saved to notes.md"). It reads the file back only when it actually needs that content. The file system becomes external, effectively unlimited, persistent memory - context holds the index, not the whole library.

Compaction & summarisation family A

When the conversation gets long, the agent compresses older turns into a concise summary and drops the verbose originals. Recent steps stay word-for-word; distant history becomes a few tight lines. A rolling summary keeps the thread intact at a fraction of the token cost.

Retrieval - just-in-time context family B

Don't preload an entire knowledge base into the prompt. Store it externally and fetch only the few relevant chunks for the question at hand, exactly when they're needed. (This is the idea behind "RAG" - retrieval-augmented generation.)

Skills & progressive disclosure family B

A Skill is a self-contained folder of instructions and resources that an agent can load on demand to do a specialised task. The clever part is the front matter - see the diagram below. The agent loads only tiny skill summaries upfront and pulls in the full body of a skill only when a task actually calls for it.

Sub-agents - context isolation family C

For a messy sub-task, spin up a fresh agent with its own clean context. It does the noisy work in isolation and returns only the tidy final result. The main agent's context never gets polluted with the intermediate clutter.

Structured note-taking family C

The agent maintains a running to-do list or scratchpad file it updates as it works. This externalises its plan and progress, so its state survives even when older context is summarised or trimmed away.

Skills · Progressive disclosure via front matterFIG 14

Each dark bar is a skill's front matter - a tiny header (just a name + a description of when to use it). Only these headers stay loaded. Click a skill: the heavy body loads only when the task matches it. Watch the context meter below.

Full skill body - loaded on demand (~22% of context)

Detailed step-by-step instructions, helper scripts, and reference code for working with PDFs. Could be thousands of tokens - so it stays out of context until a PDF task appears.

Full skill body - loaded on demand (~18% of context)

Plotting recipes, colour-palette rules, and example code. Heavy content - kept dormant until a charting task arrives.

Full skill body - loaded on demand (~14% of context)

Tone guidelines, banned phrases, worked examples. Pulled in only when a writing task needs it.

Context used by skills:

An agent can have hundreds of skills available while spending almost no context on them - because it reads only the cheap front-matter summaries, then loads the expensive body of the one skill it actually needs. Metadata first, details on demand.

The mindset shift

A beginner asks "how do I fit everything into the prompt?" An experienced builder asks "what's the least I can put in context, while keeping everything else one cheap fetch away?" That question is most of the job.

Building Effective Agents

Patterns, simplest first

You rarely jump straight to a free-roaming agent. There's a ladder of patterns - climb it only as far as the problem demands.

The foundational building block is the augmented LLM: a single model call given tools, retrieval, and memory. Most patterns are just clever arrangements of that block. Each card below names a pattern and gives one concrete example.

Workflow

Prompt chaining

Break a task into ordered steps; each LLM call feeds the next. Predictable, easy to debug.

[A] → [B] → [C]

Example: Outline a blog post → write each section → polish the whole.

Workflow

Routing

Classify the input first, then send it down a specialised path built for that category.

[?] →┬→ [path A]
└→ [path B]

Example: Customer email → classify as refund, bug, or sales → route to the right specialised prompt.

Workflow

Parallelisation

Run independent sub-tasks at the same time, then merge the results. Faster, and useful for cross-checking.

┌→ [A]┐
[in]┼→ [B]┼→ [merge]
└→ [C]┘

Example: Three models grade the same answer in parallel → majority vote.

Workflow

Orchestrator-workers

A lead model breaks a job into sub-tasks, delegates each, then synthesises the answers.

[lead] → [w1][w2][w3] → [lead]

Example: Research lead splits "compare 3 cloud providers" into one sub-agent per provider.

Workflow

Evaluator-optimiser

One model drafts, another critiques against criteria; loop until the work is good enough.

[draft] ⇄ [critique] ↻

Example: Writer drafts copy → editor flags issues → writer revises → repeat ≤3x.

Agent

Autonomous agent

The model plans its own path through tools in a loop. For open-ended tasks whose steps can't be predicted.

[LLM] ↻ tools · until done

Example: A coding agent given a bug ticket - reads files, runs tests, edits code, iterates until tests pass.

How to choose (in 10 seconds)

Can you write the steps yourself? → Use a workflow. Cheaper, faster, more reliable.
Are the steps obvious but path depends on input? → Routing.
Quality matters more than speed? → Evaluator-optimiser.
Steps genuinely can't be predicted? → Reach for an autonomous agent. Only then.

Three principles that keep agents reliable

Simplicity. Use the least complex pattern that solves the task - fewer moving parts, fewer failure modes.
Transparency. Make the agent show its planning and tool steps, so you can see why it did what it did.
A well-crafted interface. Invest in clear tools, sharp descriptions, and clean context as much as in clever prompts. The agent is only as good as what it can see and do.

Build Your First Agent

From reading to doing

You now have the full mental model. Here's the shortest path from understanding to a working agent - keep the first one deliberately tiny.

Pick a small, real task. Something genuinely open-ended - "research a topic and summarise it," not "translate this sentence."
Give it 2-3 tools, no more. Each with a sharp, honest description. Start minimal; add tools only when you observe a real need.
Write a tight system prompt. State the goal, the constraints, and - critically - when the agent should stop.
Run the loop. Model → tool call → result back into context → model again, until it returns a final answer.
Watch the traces. Read every step it took. This is where you learn what's actually happening - and where bad tool descriptions reveal themselves.
Mind the context. If runs get long or quality dips, reach for Module 08: offload to files, compact history, load skills on demand.
Add guardrails. A step limit, a cost cap, human approval before risky actions. Then iterate - improve one tool or one prompt at a time.

How to judge an agent run · a 5-question rubric

Once your agent runs, the trace is the truth. For each run, ask:

Did it choose the right tool for each step?
Did it use the result correctly, or misread the output?
Did it stop at the right time, or loop forever / quit early?
Did it flag uncertainty instead of bluffing?
Did it avoid risky actions without explicit approval?

Five "yes" answers → you have a real agent. A "no" anywhere is where to focus next.

You've reached the top

That's the whole arc: a model predicts tokens → wrapped as a stateless function → handed tools → placed in a loop → fed carefully managed context. Every impressive agent you'll ever see is built from these parts. The mystique is gone - what's left is craft. Now go build something.

OpenNash education + implementation

Want help turning AI into reliable automation? OpenNash helps executives and teams learn the basics, choose the right workflow, and ship production-grade AI systems that integrate with the software your business already uses.

Next: once you understand agents, read Zero → Eval to learn how to know whether an agent is safe to ship.

We can run practical AI education, map your workflow, build the first prototype, add test cases and human review, and manage the automation as it improves from reviewed outcomes.

Book time -> Email OpenNash ->

★ Glossary · quick referenceAPPENDIX

LLM: Large Language Model. A neural network trained to predict the next token; the engine inside ChatGPT, Claude, Gemini, and friends.
Token: A chunk of text the model reads - roughly ¾ of an English word. Everything is counted and billed in tokens.
Tokenizer: The piece of code that splits raw text into tokens before the model sees it.
Context window: The max tokens the model can consider at once. Its only working memory.
Prompt: The text you send into the model on a given call. Includes the system prompt, history, and any new input.
System prompt: Standing instructions at the top of every request - role, rules, goal, stopping criteria.
Autoregressive: Generates one token at a time, appending each to the input before predicting the next.
Stateless: No memory between calls. The model forgets everything the instant a response ends; the app re-sends history each turn.
Hallucination: Fluent, plausible output that is factually wrong. A failure mode of pure text prediction.
Knowledge cutoff: The date the model's training data ends. After this, it knows nothing - unless you give it a search tool.
Temperature: A knob that controls how random the model's choices are. Low = deterministic, high = creative.
Tool: A function the model can request to be called. The harness - not the model - actually runs it.
Tool call: A structured request from the model: "please run tool_name with these arguments."
Harness: The program around the model that runs the loop, executes tools, and manages context. Sometimes called the "orchestrator" or "runtime."
Agent: An LLM placed in a loop with tools that it decides how to use, ending when it returns a final answer.
Agent loop: Reason → act → observe → repeat, until the model returns a final answer or a guardrail stops it.
ReAct: "Reason + Act" - the canonical agent pattern: the model writes a thought, picks an action, reads the observation, repeats.
Workflow: A pre-written sequence of LLM calls and code. The developer decides the path, not the model.
CLI: Command-Line Interface. The text terminal where you type commands like ls or curl. Agents can be given one to use.
MCP: Model Context Protocol. A standard way to expose typed tools - "USB-C for AI." Any MCP-compatible agent can use any MCP server.
MCP server: A small program that publishes a service's tools (Slack, Drive, a database) in the MCP shape.
Context rot: Quality decline as context fills with noise - well before the hard token limit is reached.
Compaction: Replacing older verbose messages with a short summary to save context space.
Skill: A folder of instructions and resources, surfaced by a short front-matter header and loaded on demand.
Front matter: A tiny header at the top of a skill (name + description) that stays in context while the heavy body stays out until needed.
Progressive disclosure: Load metadata first; load full details only when the task requires them.
Sub-agent: A fresh agent spun up with its own clean context for a noisy sub-task; returns only its final result.
RAG: Retrieval-Augmented Generation. Fetch only the relevant chunks of a knowledge base at query time.
Embedding: A numerical vector that represents the meaning of text, used to find related chunks in a vector store.
Vector store: A database that stores embeddings and finds the nearest ones to a query - the search engine behind RAG.
Fine-tuning: Continuing to train a base model on your own data to bake in specific behaviour or knowledge.
Guardrail: A safety rule outside the model - a step limit, a cost cap, a content filter, a human-in-the-loop check.
Trace: The full recorded sequence of an agent's thoughts, tool calls, and observations. Where you learn what's actually happening.