EPISODE 02 · MULTI-AGENT & SKILLS · 2026.05.17

One Agent
can't finish anything worth shipping.

It's not that the Agent isn't strong enough — single-agent as a structure has a ceiling.
Skills and Multi-agent aren't two new technologies. They're two solutions to the old problem of division of labor.

AGENT 101 SERIES · EP 02 · NO-BULLSHIT EDITION
§ 01 · THREE DEFINITIONS

Agent / Skill / Multi-agent — draw the boundaries first

Everything in this episode rides on these three words. The market mixes them, swaps them, packages them as the same thing — but engineering gives them clear boundaries. Get the definitions straight first; only then does the rest make sense.

01Agent
An LLM instance that runs the "Perceive → Plan → Act → Reflect" loop — not Q&A, but "here's a goal, go work."
What it contains
  • LLM — the brain, doing reasoning and decisions
  • Tools — interfaces to external capabilities (API / database / browser / filesystem…)
  • Memory — short-term context + long-term memory (vector store / files)
  • Loop — iterate until the task succeeds / fails / hits a stop condition
Key test

Can autonomously decide the next step = agent

Answers one question, no next action = chatbot (not an agent, out of scope here)

Common confusion

"AI assistant / Copilot / 智能体" — covered in Ep 01: mostly marketing language, not discussed further.

02Skill
A modular capability callable inside an Agent — freeze one segment of a workflow into a standalone unit. Called when needed, doesn't pollute the main context when not.
What it contains
  • Trigger description — "when should I be called" (so the agent knows when to use it)
  • I/O contract — what input it needs, what shape of output it produces
  • Prompt — the core instruction for this segment (can include few-shot examples)
  • Tool subset — the tools this skill is allowed to call (not the agent's full tool set)
Key test

Still runs inside the same agent's context — a Skill is the agent's "internal module," not a separate LLM instance.

Calling a prompt file a "Skill" is the same kind of naming as calling a reusable function a "module" — the point isn't the word, it's the structure.

Common confusion

"$29 for 1000 prompts" — that's not a Skill. That's a static prompt library: no trigger logic, no I/O contract, not programmatically callable.

03Multi-agent
Multiple agents collaborating across independent context windows — each sees a slice of the world, exchanging compressed state through a protocol.
What it contains
  • N agents — each with their own LLM + Tools + Memory + Loop (i.e., N complete agents)
  • Protocol — how messages/state are exchanged (message queue / shared files / API calls)
  • Coordination — who splits the task, who aggregates, who stops the whole flow
  • Failure handling — what to do when one agent crashes or goes off the rails
Key test

The agents' contexts are genuinely independent = multi-agent

Writing "you are A, then you are B" in a single prompt is NOT multi-agent — that's role-play, and A and B share the same context.

Common confusion

"My Agent has 100 Skills" — that's one agent with many Skills, not multi-agent. This is the most common naming confusion in the market.

§ 02 · THE SINGLE-AGENT CEILING

One Agent doing everything breaks in 4 ways

Why do you need multi-agent or skills? Because past a certain complexity, a single Agent must break. The 4 failure modes below have nothing to do with "the agent isn't smart enough." They're the limits of the single-context structure itself.

  • CEILING · 01 · CONTEXT DRIFT

    Past 50 steps, the original instructions get diluted by new content

    You give the agent a system prompt + a goal, let it run 80 steps. By step 50, the agent can barely see the original system prompt anymore — it's drowning in its own recent outputs + tool results. Symptom: opens strong, then drifts, forgets rules, answers the wrong question. No stronger model fixes this — it's a physical limit of attention + context length.

  • CEILING · 02 · TOOL OVERLOAD

    One Agent with 20+ tools — the routing decision falls apart

    As you add more tools to an agent (Gmail, Calendar, Notion, SQL, Web search, custom APIs…), every decision becomes "pick one of N tools" — pick wrong and it drifts. Measured in production: above 15 tools, tool-selection accuracy drops linearly; above 30, it's basically unusable. More tools, less competence — this is a structural problem with Function Calling, not something prompt-tuning can fix.

  • CEILING · 03 · ERROR COMPOUNDING

    Single-thread tasks: one wrong step poisons everything downstream

    The agent misreads a customer name at step 12. Every action from step 13 on (lookup, draft reply, send email) is based on the wrong name, but the agent can't see this — there's no second pair of eyes on its intermediate results. By step 30 you notice, dig through the log, trace it back to step 12 — but it already happened in production. This is the most common incident pattern in single-agent setups.

  • CEILING · 04 · ROLE COLLAPSE

    Asking it to "do + check + report" at once — none of them done well

    You write a giant prompt: research the market, draft copy, critically review it, output a markdown report. The agent skims every role — it can't actually switch reasoning modes within one inference pass. Especially "Drafter + Critic" — without separation it always rationalizes its own work: the writer and the critic are the same reasoning stream.

§ 03 · SOLUTION A · SKILLS

Skills — the "lightweight" division of labor

Most people think Skills is a way to add new capabilities to an Agent. More accurate: Skill is how you divide labor INSIDE one Agent — extract a workflow segment into a callable, standalone unit. The Agent calls it when relevant, and doesn't pollute the main context when not.

What a Skill contains

A standard Skill file (e.g. Claude Skills' SKILL.md) has four parts:

SKILL.mdemail-triage / a minimal viable structure
01DESCRIPTIONwhen to call me (the agent's routing signal)
02CONTRACTwhat's the input / what's the output / how errors are reported
03PROMPTthe core instruction for this segment (can include few-shot examples)
04TOOLSthe tool subset this skill is allowed to use (not the agent's full set)
CORE INSIGHTSkill = workflow modularization · NOT "feature expansion"

When Skills fit

  • One Agent doing many kinds of work: instead of cramming 10 if-else branches into a giant prompt, make one Skill per branch and let the agent route to it
  • A workflow segment is reused across places: generic actions like email triage or CSV exploration — write them as Skills and multiple higher-level agents can call them
  • You need versioning / audit trail: Skills are files, git-tracked; prompt changes are diff-able, errors are revertable
  • Team collaborating on the same agent: each person edits the Skill file they own, instead of stomping over each other in one giant system prompt

3 Skills that actually run

SKILL 01email-triage
·whenuser asks about "today's emails" / "inbox status" / mentions Gmail
·doesread Gmail → classify into 5 buckets → draft replies → priority sort
·toolsGmail Connector · nothing else needed
·outputstructured to-do JSON + draft list
SKILL 02csv-explore
·whenuser uploads a CSV / mentions "take a look at the data" / "run some analysis"
·doesread → infer schema → run pandas → plot + write insight
·toolsCode Interpreter · matplotlib
·output3 key charts + 3 sentences of insight
SKILL 03meeting-summary
·whenuser pastes a transcript / mentions "just had a meeting" / Granola output
·doesextract decisions + action items + owners + deadlines + draft follow-up email
·toolsno external tool (pure prompt skill)
·outputMarkdown notes + email draft
§ 04 · SOLUTION B · MULTI-AGENT

Multi-agent — the "heavyweight" division of labor

When Skills isn't enough — task too large, single context can't hold it, or you must have an independent reviewer — go multi-agent. "Multi" here isn't a count, it's independent contexts: each agent sees a different slice of the world, exchanging state through a protocol.

One test for "is this real multi-agent": are the two agents' context windows genuinely independent? If it's just "you are agent A… you are agent B" inside one system prompt, that's role-play — not multi-agent.

Real multi-agent has 4 archetypal collaboration patterns. They solve different problems, at different complexity levels. Most production systems combine them.

  • PATTERN · 01 · MANAGER–WORKER

    One Manager splits the task, N Workers each execute a slice

    The most common, easiest to grasp. The Manager agent receives a big task, splits it into N subtasks, dispatches them to specialized Worker agents, each worker runs its slice in an independent context, reports back, and the manager aggregates and produces the final output.

    USE50-page research report · Manager–Worker implementation
    ·MANAGERsplit into 8 chapters → assign to 8 Researchers
    ·WORKER ×8each independently researches and writes their chapter · no interference
    ·MANAGERcollect 8 chapters → unify voice → write intro + conclusion
    WHEN TO USEtask splits statically · subtasks have no hard dependencies
  • PATTERN · 02 · PIPELINE

    Agent A → Agent B → Agent C — each finishes, hands off downstream

    Assembly-line style. Each agent owns one stage of transformation, the previous agent's output = the next agent's input. Difference from Manager–Worker: a pipeline is serial and single-track, with no central dispatcher.

    USEContent production · Pipeline implementation
    01RESEARCHERmine 10 viral pieces from the niche → extract common patterns
    02IDEATORbased on research → produce 7 topic ideas + outlines
    03WRITERwrite the full draft
    04EDITORcopyedit, adjust to voice corpus
    05PUBLISHERformat + multi-platform adapt + schedule
    WHEN TO USEtask is a linear chain · each stage needs a different specialty
  • PATTERN · 03 · SPECIALIST ROUNDTABLE

    Multiple expert Agents look at the same problem in parallel, one Synthesizer combines

    "Three doctors in consultation" mode. The same input goes simultaneously to multiple expert agents, each looks through their own professional lens and produces an independent verdict; a final synthesizer merges the perspectives. Key point: parallel — the experts don't talk to each other, to avoid cross-contamination.

    USEStock decision · Roundtable implementation
    ·QUANTtechnicals (MA crossover, RSI, volume/price) · independent score
    ·FUNDAMENTALearnings, industry, valuation · independent score
    ·SENTIMENTnews, social, institutional moves · independent score
    ·RISKposition size, volatility, correlation · independent score
    ·SYNTHESIZERaggregate 4 independent verdicts → produce a recommendation
    WHEN TO USEyou need multiple independent perspectives · afraid of single-track thinking
  • PATTERN · 04 · ADVERSARIAL

    Drafter writes, Critic tears it apart, iterate until convergence

    Adversarial. One agent produces (Drafter), another agent exists to find problems (Critic). They go back and forth until the Critic has nothing left or the iteration limit hits. The core problem it solves: a single agent reviewing its own work always rationalizes.

    USECode Review · Adversarial implementation
    01DRAFTERread issue → write implementation + tests → submit v1
    02CRITICtear it apart (edge cases, security, perf, readability) → list fixes
    03DRAFTERapply feedback → submit v2
    04CRITICre-review → until approved or 3 rounds done
    WHEN TO USEoutput quality is non-negotiable · wrong = expensive (code / contracts / legal opinions)
§ 05 · 6 REAL WORKFLOWS

Each workflow is one of the structures from §03/§04

6 systems below, covering the 4 multi-agent patterns + Skills combos. Pick the one closest to your work — copy the structure, swap the prompts and tools — in most cases it runs. Each is tagged with the pattern it uses, so you can cross-reference §03/§04.

  • CASE · 01 · Customer Support · MANAGER-WORKER Per-ticket 8min → 1min · headcount halved

    200 tickets, 5 people can't clear them — manager classifies, workers handle

    Contrarian take: support throughput isn't bottlenecked by typing speed. It's bottlenecked by classification. One manager reading + classifying a ticket in 30 seconds beats 5 agents each spending 10 minutes on "what is this even". Then each ticket type goes to a dedicated worker — each worker has only seen training context for its category, so its handling quality beats a generalist.

    WORKFLOWAuto-routed support tickets (Manager-Worker · with Skills)
    01INTAKEZendesk webhook · triggered on new ticket
    02MANAGERread ticket · classify into 5 buckets (refund / technical / complaint / inquiry / escalation)
    03WORKER Arefund skill · pull order + draft refund flow
    04WORKER Btechnical skill · pull logs + draft fix steps
    05WORKER Ccomplaint skill · draft apology + escalate to human
    06SUPERVISORrandom 5% reviewed by human · feedback tunes the manager
    SETUP2 weeks · ROI8min → 1min · support team 5 → 2
    Claude API Zendesk / Intercom n8n PostgreSQL
  • CASE · 02 · Sales Outreach · PIPELINE 3× outreach · reply rate +40%

    200 prospects, a 4-stage pipeline does what an SDR needs 4 days for

    Contrarian take: personalized email isn't slow because writing is slow. It's slow because "research + write" thrashes inside one human brain. Split those two steps into two agents, add a sender + tracker, and one person running the pipeline does the work of four.

    WORKFLOWCold outreach (Pipeline · 4 sequential stages)
    01RESEARCHERcompany list in · pull filings / news / decision chain / pain signals
    02PERSONALIZERbased on research · write custom opener + value hook
    03SENDERsend at optimal time · track opens / clicks
    04TRACKERno reply in 3 days → trigger differentiated follow-up (A/B)
    05FEEDBACKreply content → loop back to Personalizer to tune voice
    SETUP1 week · ROI200 prospects/week · reply rate 5% → 7%
    Apollo Clay Claude API Lemlist / Instantly HubSpot
  • CASE · 03 · Code Review · ADVERSARIAL Production bugs -50% · PR pass rate +30%

    Drafter + 3 Critics tearing it apart cuts bugs in half

    Contrarian take: a single agent that writes code and reviews it always concludes its own work is fine. You need critic agents with independent contexts, and each critic must own one class of risk. The security critic isn't distracted by perf, the perf critic doesn't skip readability — three critics piling on is what forces the drafter to actually fix things.

    WORKFLOWMulti-perspective PR Review (Adversarial · 1 writer, 3 critics)
    01DRAFTERread issue + impact area · write implementation + tests · submit v1
    02CRITIC.SECSQL injection / XSS / authz boundaries / secret leakage
    03CRITIC.PERFN+1 / infinite loops / O(n²) / missing cache
    04CRITIC.READnaming / abstraction layers / comments / dead code
    05DRAFTERmerge 3 sets of feedback → submit v2 · max 3 rounds
    06HUMANyou only see the final version + each critic's final verdict
    SETUP3 days · ROIPR review time 30min → 5min · production bugs -50%
    Claude Code GitHub Actions Aider Cursor
  • CASE · 04 · Competitive / Market Intel · ROUNDTABLE Per-target research 1 week → 1 hour

    4 expert agents look at the same company in parallel; synthesizer outputs a threat rating

    Contrarian take: ask one generalist agent for a "complete competitive report" and you get something balanced but devoid of insight. Use 4 specialist agents in parallel on the same company, each looking only at their domain, then a synthesizer merges — the report is 3-5× stronger than a single agent's, because the perspectives are genuinely independent.

    WORKFLOWMulti-angle competitive analysis (Roundtable · 4 parallel + 1 synthesizer)
    ·TECHstack / product matrix / engineering capability / hiring signals
    ·FINANCIALfunding cadence / filings / valuation / burn rate
    ·GTMchannels / pricing / marketing moves / customer base
    ·PEOPLEexec changes / team size / key people's posts on X
    SYNTHESIZERaggregate 4 independent verdicts · output "threat level + what you should do"
    SETUP3 days · ROIcompetitive research 1 week → 1 hour · monthly refresh of 5 targets
    Perplexity API Crunchbase LinkedIn API Claude Notion
  • CASE · 05 · Content Production · HYBRID (Pipeline + Adversarial) 1 piece/person/day → 5 pieces/day

    Pipeline of 4 stations + an Adversarial loop in the middle — higher quality than single-agent + 5× volume

    Contrarian take: 99% of "AI writing tools" run a single agent across the whole flow — the result reeks of AI and reads from one angle. A real content factory is hybrid: Pipeline gives you assembly-line throughput, and the critical "write + revise" station uses Adversarial for quality.

    WORKFLOWContent factory (Pipeline trunk + Adversarial nested)
    01RESEARCHERscan niche + Perplexity Deep Research · output topic + source material
    02aWRITERwrite first draft from research · tune voice from persona corpus
    02bEDITOR ⟷tear apart (pacing / hook / contrarian strength / AI tone)
    02cWRITER ⟷revise · re-review · max 3 rounds until EDITOR passes
    03ADAPTERone draft → multi-platform (Twitter / Xiaohongshu / WeChat / YouTube desc)
    04SCHEDULERBuffer / Hypefury scheduling + analytics feedback
    SETUP2 weeks (building voice corpus) · ROI1 piece/day → 5/day per person · AI-tone detection drops noticeably
    Perplexity Claude Buffer / Hypefury Notion CapCut
  • CASE · 06 · Cross-database Data Exploration · MANAGER-WORKER Multi-table join analysis 4h → 15min

    50-table cross-join — one Agent can't hold it all; 10 Workers query in parallel

    Contrarian take: the data analyst's worst problem isn't writing SQL. It's that 10 tables' schemas combined don't fit in one agent's context. Push "understand the schema" down to each worker (each worker knows one table), let the manager only split the question and stitch the answer — a task that used to "not fit in context" now ships in 10 minutes.

    WORKFLOWCross-database exploration (Manager-Worker · solving context capacity)
    01MANAGERread user question · split into N independent subqueries (each touches 1-2 tables)
    02WORKER ×Neach worker runs in its own context · only loads the schema it owns
    03WORKER ×Nrun SQL · validate results · emit structured data
    04MANAGERcollect N subresults · cross-result JOIN / compute · chart
    05MANAGERwrite insight + anomaly flags
    SETUP1 week · ROIpreviously context-blocked tasks · now runnable · 4h → 15min
    Claude Code Snowflake / BigQuery DuckDB LangGraph Plotly
§ 06 · DECISION TREE

Single Agent · Skill · Multi-agent — when to pick which

Don't pick based on "what looks more sophisticated" — pick based on whether your task triggers a specific signal. Ask the 3 questions below in order, stop at the first YES.

  • LEVEL · 01 · SINGLE AGENT

    Can your task finish in 10 steps and won't recur?

    If yes: don't reach for Skill, definitely not Multi-agent. All you need is a ChatGPT / Claude window + a prompt. Skills and Multi-agent both carry maintenance cost — don't build infrastructure for a one-off.

    Trigger signal: ad-hoc tasks that happen 0-1×/week · single function · no parallelism needed
    Real example: writing one-off email / researching an ad-hoc question / looking up a reference
  • LEVEL · 02 · SINGLE AGENT + SKILLS

    Does the same Agent need to handle 5+ task types, or is one workflow segment reused everywhere?

    If yes: add Skills, but you don't need Multi-agent yet. Skills lets one Agent operate like a team — same context, callable modular capabilities. Moderate maintenance cost: you write SOPs, but only one agent runs.

    Trigger signal: same agent handles 5+ task types · one workflow segment reused across places · team needs to collaborate on prompts
    Real example: personal assistant (email + calendar + Notion + memos) / all-in-one support (multi-class tickets + knowledge base + escalation)
  • LEVEL · 03 · MULTI-AGENT

    Any of: context overload / need independent reviewer / need real parallelism?

    If yes: go Multi-agent. But be ready to spend 5-10× more on protocol design, debugging the chain, monitoring state. Multi-agent isn't "the more advanced version" — it's the choice when you have no option but to divide labor.

    Trigger signal: single agent drifts past 50 steps · must have independent eyes critiquing · 4+ subtasks parallelizable · context can't hold the full schema
    SUB-CHOICEHow to pick among the 4 patterns
    ·MANAGER-WORKERtask splits statically · subtasks independent
    ·PIPELINEtask is sequential transformation · each stage needs a specialty
    ·ROUNDTABLEneed multiple independent perspectives · afraid of groupthink
    ·ADVERSARIALquality is non-negotiable · wrong = expensive (code / contracts)
    REMEMBERmost production systems are 2-3 patterns nested together
§ 07 · MARKET OBSERVATION

5 things the market says about Multi-agent and Skills that aren't quite right

Ep 01 covered the "AI peripheral market" — selling anxiety, selling certainty. This section narrows to Ep 02's two words: what the market has done to multi-agent and skills. Once you see it, you can tell faster whether a product / article / course has anything real behind it.

  • OBSERVATION · 01 · LANGGRAPH WRAPPER

    "Multi-agent Platform / OS"

    2024-2026 produced a flood of "Multi-agent OS / platform" startups. Most are LangGraph + a UI layer, with no genuinely independent contexts between agents. The test is simple: ask "what does each of your two agents see in their system prompt?" — if they can't name specific boundaries, it's a wrapper.

  • OBSERVATION · 02 · MORE TOOLS > BETTER AGENT

    "Our Agent has 100+ Skills"

    "Number of Skills" gets pitched as a selling point, but the Tool Overload ceiling from §02-02 applies to skills too — the more you load in, the worse the agent's skill-selection gets. Production measurement: past 15-20 skills, recall accuracy falls off a cliff. 10 sharp Skills > 50 noisy ones.

  • OBSERVATION · 03 · ACADEMIC vs PRACTICAL

    "Multi-agent is the path to AGI"

    This is an academic debate 20 years old (agent-based modeling, swarm intelligence, multi-agent reinforcement learning…) — it has nothing to do with you building a system that actually handles support tickets today. This grand narrative gets used to dress up simple LangGraph demos — when you hear it, remember: however grand, it still has to handle a ticket.

  • OBSERVATION · 04 · PROPRIETARY FRAMING

    "Skill is an Anthropic-proprietary feature"

    Wrong. Skill isn't any vendor's feature — it's a design pattern. OpenAI Custom GPT Actions, LangChain Tools, your own modular prompt files, even Unix tools + a README — they're all the same thing at the core: package reusable capability into a unit that's discoverable and callable. Who packages it cleanest is a separate question, but the concept doesn't belong to anyone.

  • OBSERVATION · 05 · WRONG ORDER

    "Complex tasks require Multi-agent"

    Usually said backwards. The first thing to do with a complex task is turn it into an SOP — break it into repeatable steps with explicit input/output per step. After the SOP, most tasks' "complexity" drops to where single agent + skills handle it; only what truly remains — the parts needing parallelism or independent review — goes multi-agent. Reverse the order — stack multi-agent first — and you're using expensive coordination cost to mask the fact you haven't thought through the workflow.

§ 08 · NEXT 90 DAYS

12 weeks from single Agent to Multi-agent

This extends Ep 01's 90-day path — assuming you can already handle daily tasks with a single agent + 1 Project (the Week 9-12 state from Ep 01). The next 12 weeks take you from "single agent + 1 Skill" to "3-agent coordinated system." Each week has a failure signal — when it lights up, roll back.

  • WEEK 1 – 2 · FIND CANDIDATES

    List the 3 highest-frequency subtasks in your workflow

    Not "things I want to replace with AI" — things you do "5+ times a week, almost the same steps each time." These are the natural Skill candidates. Write them down: name / frequency / input / output / current time spent. Failure signal: can't list 3 — your workflow isn't regular enough yet, go back to Ep 01. What to do: spend another week doing it purely manually, jotting down every task as you do it.

  • WEEK 3 – 4 · WRITE THE FIRST SKILL

    Pick the most painful candidate — write it as SKILL.md and run it

    Write a standalone file with the 4 parts from §03 (DESCRIPTION / CONTRACT / PROMPT / TOOLS). Hook it into Claude Code or your own agent and run it for a week — every time the task comes up, just call it, never hand-write the prompt. Failure signal: by end of Week 4 you're still frequently editing the prompt inside SKILL.md. What to do: it's not Skills failing — your task isn't SOP-ready. Go back and decompose it further.

  • WEEK 5 – 8 · FILL UP ONE AGENT

    Write 3-5 Skills and load them into the same Agent

    Turn the other 2-4 tasks from Week 1-2 into Skills too. Your agent should now handle 5 different task types in a single context — experientially, it goes from "generalist assistant" to "colleague with real expertise". Failure signal: agent routinely picks the wrong Skill / mixes logic from two Skills. What to do: you loaded too many — trim back to the 3 highest-frequency, save the rest for the Multi-agent phase.

  • WEEK 9 – 12 · FIRST DIVISION OF LABOR

    Pick a complex task that won't run, design a 2-3 agent collaboration

    Find a task that actually hits a §02 ceiling — single agent breaks past 50 steps, or needs an independent critic, or needs parallel processing. Pick one of the 4 patterns in §04 and build a minimal viable multi-agent system. Don't go big, don't aim for complete — just make the simplest version that runs end-to-end. Failure signal: by Week 12 you're still tweaking prompts / protocols, end-to-end doesn't run. What to do: usually the system you designed is too complex — strip it down to two agents, then ask which one really needs to be independent.

After 12 weeks, you should be able to answer: "Which part of my daily work is one Agent calling a set of Skills? Which part is multiple Agents collaborating?"

If you can point to both, with outputs and logs — you've crossed the Ep 01 → Ep 02 boundary. If you can't — you're still working out of "one ChatGPT window," not "building a system."

Real Agent Use Cases

From single Agent to Multi-agent — what's missing isn't a new concept,
it's you picking one workflow and sitting down to build it

After this episode, you can tell where single Agent / Skills / Multi-agent each belong. But "telling them apart" ≠ "building one". What the next episode covers depends on whether you actually land this one in a workflow that runs.

EP 01 · What is AI Agent · EP 02 (you are here) · EP 03 (coming)