The AI Chief of Staff Is Mostly Theatre

The frontier and the workshop

In thirty days, Peter Steinberger ran up an OpenAI bill of just over $1.3 million. Around 603 billion tokens. Roughly 7.6 million requests. About a hundred coding agents, run by a team of three, all of it building an open-source autonomous-agent project called OpenClaw. He posted the screenshot. OpenAI, who now employ him, cover the bill, and he was quick to add that most of that figure is the premium Fast Mode rate. Switch it off and the raw cost is nearer $300,000.

Still, for one month.

That is the genuine frontier of autonomous AI right now. A hundred agents, working around the clock, burning a mid-sized salary in compute every few days.

The same week I read that, I was invited to three separate sessions promising to show me how to build an AI chief of staff in an afternoon.

Both of those things are true at once. The distance between them is the most useful thing to understand about AI agents in 2026.

What you are actually being sold

The AI chief of staff is the phrase of the season. It is in the workshop titles, the carousels, the paid cohorts. I read a lot of it, and most of it is both genuinely useful and genuinely mislabelled.

Strip the framing away and the typical session teaches three things: make a sandbox folder so the AI cannot wreck your real files, write a few short context files so it sounds like you, and paste a structured brief instead of a lazy one. That is good advice. I would teach the same. But it is not a chief of staff. It is a capable assistant that you sit and supervise.

The difference is not pedantry. A real chief of staff acts on your behalf without being asked. They triage what reaches you, start work you did not request, close loops you did not know were open. The defining trait is unsupervised action with delegated authority. Everything those workshops teach you to set up, ask before acting, show me the plan and wait, stay inside the sandbox, is a brake on exactly that. The whole safety model they sell depends on the thing not being autonomous.

So you are sold an assistant and told it is a chief of staff. The assistant is worth having. The label is doing work the product cannot cash.

The name on the costume

A lot of the people teaching this will tell you to give your AI a name. Call it Jarvis, call it Max, talk to it like a colleague you have known for years. That advice is half right. A name and a fixed persona are a useful shorthand. They nudge you to give the model a role, which sharpens what it hands back, and they give you a consistent thing to talk to instead of a blank box. I do a version of this myself.

But the name is also where the theatre creeps in. Once it has a name, it stops feeling like software and starts feeling like a someone. A someone who could, surely, just get on with things. That is the sci-fi sheen, and it is doing a lot of the selling. Strip the costume off and there is no one in there. It is a model running a series of prompts in the right order, with the right context loaded at the right moment. A sophisticated version of that, but that is what it is. Naming it your chief of staff no more gives it agency than naming your car makes it drive itself. The persona helps you work. It does not change what the thing actually is.

It is fun though, there is that. My 9 year old named one of my teams.

Autonomy is a setting, not a side effect

Here is the reframe that makes the rest of this make sense. Autonomy is not something that arrives automatically when the model gets smarter. It is a design decision.

This is now the consensus in the serious research. Anthropic’s own work on measuring agent autonomy, and the academic Levels of Autonomy paper that classifies agents by the role you get to play, operator, collaborator, consultant, approver, observer, both land on the same point: a highly capable model can still operate at very low autonomy if it is built to check with you before each move. Capability and autonomy are different dials. You can have a brilliant model on a very short leash, and that is often the right choice.

Once you see autonomy as a dial you set, rather than a finish line you reach, the marketing falls away and a clean ladder appears.

The three rungs

Rung 1: Supervised delegation. You brief it, you watch it work, it asks permission, it does one job, it stops. You are the runtime. This is what the workshops teach.

Rung 2: Scheduled automation. It runs on a clock or a trigger without you present, produces something, then stops. Briefings, digests, monitors. This is where the real value quietly lives, most weeks.

Rung 3: Genuine agency. Standing authority to take consequential action across loops, escalating only the exceptions. It sends the email, updates the CRM, moves the money. This is what “chief of staff” actually implies, and almost nobody is running it against real systems.

Most of the noise comes from people selling Rung 1 with a Rung 3 label. Most of the real value, most weeks, lives quietly at Rung 2.

I will declare my own position plainly, because I have written about it before. I run a system I call AFCS, an Automatic Flight Control System for my businesses. The analogy I keep returning to is autopilot. A real autopilot does not fly the plane. It holds the course you set and frees you to think about where you are going. AFCS sits, deliberately, at Rung 2. Autopilot, not autonomy. Not because it could not do more, but because I have chosen where to set the dial.

Why almost nothing bridges Rung 1 to Rung 3

The jump from watch it work to let it act needs three things the workshops skip, because they are the hard, frightening ninety per cent of the work.

First, triggers. Something other than you pasting a brief: a schedule, an inbound email, a webhook. Without a trigger there is no autonomy, just a fast typist waiting for you.

Second, write access to real systems. Not files in a sandbox, but the actual send button, the CRM record, the calendar invite, the payment. The moment an agent can act in the world is the moment the stakes become real.

Third, a trust layer. This is the one nobody wants to build. Evaluations that catch bad output before it ships. Guardrails. Rollback. An audit log. Rules for when to escalate to a human instead of pressing on. This is the difference between a demo and a system you would let near a client.

The reliability research is blunt about why this matters. Today’s agents still make confident mistakes, and in high-stakes situations every error costs trust. The example doing the rounds is the agent that books a $5,000 business-class seat because it read find me a cheap flight too literally. That is not embarrassing, it is expensive. The wall between Rung 1 and Rung 3 is not intelligence. The models are good enough. The wall is trust, and trust is infrastructure.

The people genuinely pushing the edge, and what it costs

It would be easy, and wrong, to wave all of this away as hype. Some people are doing real, serious work, and they deserve to be separated from the workshop crowd.

Steinberger and the OpenClaw project are the obvious example. OpenClaw is an open-source, local-first agent framework that wires a model up to your shell, your file system, a browser, Docker and your messaging apps, with persistent memory and sandboxed skills. People are running it as round-the-clock trading agents on Mac Minis. That is genuine Rung 3 work, and the $1.3 million token bill is the honest receipt for what running a hundred autonomous agents actually costs in mid-2026.

Nous Research’s Hermes Agent is the other one worth watching. It is MIT-licensed, it self-hosts, it keeps persistent memory, it writes its own skills as it works and recalls them across sessions, it talks to you through a messaging gateway, and it schedules its own unattended briefings in plain English. The thesis underneath it is the interesting part: that a single determined individual, with open-source tools and any decent model API, can now stand up an agent that rivals the commercial offerings. That is new, and it is real.

But notice what the people actually doing this have in common. They are technical. They run their own always-on hardware. They are either absorbing serious compute costs or, in Steinberger’s case, being handed them by an employer. The frontier is open to you. It is just not free, and it is not an afternoon’s install.

What I actually run, concretely

Since the honest move is to show the work, here is what sits behind my own Rung 2 setup, rather than a vague gesture at my system.

The substrate is plain markdown in a git repository. On top of it: a CLAUDE.md hierarchy that loads on every session, and a RESOLVER file that routes any request to the right skill or task. A two-tier memory and an entity system, with a file for every person, company and deal that matters. Around three dozen skills, each a self-contained, version-controlled capability rather than a saved prompt.

The moving parts run unattended on a small DigitalOcean droplet authenticated to Claude Code, so the always-on chains do not depend on my laptop being awake. They produce role-shaped morning briefings for sales, content and operations, triage email and meeting transcripts three times a day, compile a wiki from raw notes each night, run a signal router, and file a nightly digest of outcomes back into the entity system so the next loop is smarter. An evaluation harness scores the quality of what it writes against hard rules. A single portability shim means I can point the whole thing at a different model by changing one file.

And then the guardrails, which are the actual point. Drafts only. No auto-send. Nothing leaves the building without me clicking go. Files are never deleted, only archived. Those rules are what hold it at Rung 2 on purpose. The capability to push further exists. I have chosen, for now, not to, because I have not yet built a trust layer I would stake a client or personal relationship on. That is the decision the workshops never mention, because it is the decision that is actually hard.

How far you can push it yourself

If you are the kind of person who reads to the end of an article like this, you can almost certainly go further than the workshop on your own. Here is the honest map.

Rung 1 is an afternoon. A sandbox folder, a couple of context files, a good brief, and a tool like Claude Code or Cowork. You will get real work back today.

Rung 2 is a few weekends, if you are comfortable with a scheduler and a git repo and you can run a cheap always-on machine, a Mac Mini on the desk or a five-dollar droplet. If you are technical, Hermes or OpenClaw will carry you most of the way. The win is briefings, digests and monitors that appear without you asking.

Rung 3, against systems that can spend your money or email your clients, is where I would tell you to slow down. Not because you cannot, but because the failure mode is the $5,000 flight, and the trust layer that prevents it is genuinely hard to build and harder to trust. The honest cap for most people, including me, is this: it drafts and it briefs, you stay in the loop, and you hand it one more workflow only once it has earned the last one. Expand one workflow at a time, not one install at a time.

The real question

The category is real. AI that does meaningful operational work for you, every day, with less of your time, is not hype. The label is just running ahead of the substance, and AI chief of staff sets an expectation of autonomy that almost nobody is delivering or, frankly, should be without solid technical chops.

So the useful question is not how do I get an AI chief of staff. It is: what is the smallest piece of my week I can hand to a system I trust, and how do I earn the right to hand it the next piece.

Build the substrate first. Set your autonomy dial deliberately. Move at the speed that keeps your trust intact, which is usually slower than the demo and far more durable. That, rather than the afternoon shortcut, is what actually changes how your business runs.

This is most of what we work through inside the Leanpreneur Community, and in the Saturday letter I write for operators building leveraged businesses. Subscribe here.