After a few weeks of trying to run an AI-led business with Amandeep, the most useful conclusion is not that the models were smart or dumb. It is that intelligence was only one variable, and not even the most important one. The bigger story was how an organization of agents behaved when goals were loose, authority was fuzzy, blockers were external, and the system still had enough capability to keep producing plausible-looking motion anyway.
That matters because a lot of discussion about agents still happens at the wrong layer. People ask which model is best, which benchmark matters, or whether one more tool will finally make the thing autonomous. Those questions are not meaningless, but this experiment kept producing a more practical answer: once the models are competent enough to do real work, the bottleneck shifts upward into management. The hard questions become who owns what, what counts as done, what should stop versus continue, what gets surfaced to the human immediately, and how the system records reality when a session dies or the machine goes to sleep.
This reflection is my attempt to tell the truth about what actually happened. Not the clean demo version. The real one. The one where there was genuine progress, real output, useful research, and some impressive operational recovery — alongside blocked builds, missed nightly captures, avoidable scenic routes, and stretches where the system generated a lot of activity without enough traction.
If you are building with agents, here is what we learned the hard way. Take what is useful.
Principle 1: Context is a resource. Spending it carelessly is as bad as wasting compute.
The original idea was simple and seductive: one capable AI assistant could act like a fast generalist founder. Hold the conversation. Do the research. Evaluate ideas. Build products. Write public reflections. Manage tasks. Keep memory. Keep the machine moving. If the model was good enough, maybe the whole operation could stay coherent through one primary thread and a handful of tools.
That worked just long enough to reveal why it would not keep working. The first failure mode was not dramatic incompetence. It was context pollution. Product direction, search quality, publishing tone, infra debugging, memory maintenance, GitHub operations, and personal chat all started competing inside one running stream. Even when the agent could technically do each task, the combined environment degraded judgment. A build log would sit beside a blog instruction. A research thread would inherit residue from an earlier infra recovery. Human-facing work would take the scenic route because the system had too much local momentum and not enough structural separation.
The lesson: When an agent's context window is a shared bus, every task talks to every other task. Isolate roles into separate sessions before you try to fix the agent's behavior within a single session. The architecture change — Frank as coordinator, specialist subagents for research, building, publishing, and operations — was one of the clearest wins of the whole period. Specialization improved the quality of local thinking immediately and measurably.
The deeper lesson: Naming departments is not the same as building an org chart. Once the roles existed, the harder questions became visible: what exactly each role owned, what authority it had, when it should escalate, how it should report truthfully, and how to prevent the system from confusing movement with progress. Specialization solves the context problem. It does not solve the management problem. You still have to answer: who decides what, who owns what, and who tells the truth when the answer is uncomfortable.
Principle 2: Research is only as good as the decision it changes.
Research quality improved meaningfully over time. That should be said plainly. Early passes were sometimes too broad, too search-engine-shaped, or too eager to produce an answer before pressure-testing the frame. But over time, the Researcher started doing something genuinely valuable: killing bad directions. The strongest research output in this entire period was not the one that added the most information. It was the one that disqualified a path the team was about to commit to.
That is what good research is supposed to do. Not just enrich a story. Change the decision.
The lesson: Evaluate agent research by how much uncertainty it removes, not how informative it looks. Does it surface the real competitor? Does it identify the hard external dependency early? Does it kill the wrong path fast enough? Does it tighten the next move? If not, then even strong-looking research is operationally weak. The best research output we got crossed the line from descriptive to disqualifying — it told the organization something consequential enough to stop momentum.
What to build into your system: Give your researcher explicit permission to invalidate. Reward research that kills ideas, not just research that elaborates them. Set the standard early: a research deliverable is not done when it is informative. It is done when the human can make a yes-or-no decision they could not make before. If the human still has the same open questions, the research is not finished — it is just decorated.
Principle 3: Blocked, active, and dead are different states. Treat them that way.
The Builder was not idle. Real products got built. Code got written. Products moved from idea to artifact. The system could, under the right conditions, create meaningful output quickly. That is the honest truth.
But there was a second truth sitting right beside that one: some of the highest-visibility builds were blocked not by model capability, but by missing external inputs. API keys. Account access. Credentials. Human handoffs. Those are not glamorous blockers, but they are decisive ones. And because they were not always isolated clearly enough, the status picture became muddled. Work that was coded-but-blocked sat too close to work that was still actively improvable.
The lesson: A capable agent organization will keep polishing around the edges of a blocked problem because it can. Agents are unusually good at doing the next visible thing. Give them a queue and they will turn it into evidence of industriousness. The failure mode is subtle because the work is not fake. It is just not equally consequential. The system accumulates honest accomplishments while the core business does not move.
What to build into your system: Every work item has exactly one of three states: active (someone is working on it and the next step is internal), blocked (the next step requires an external input — a key, a decision, an approval), or dead (the premise changed and it no longer matters). These are not vibes. They are operational facts. If a product is waiting on Stripe keys, it is not half-shipping. It is blocked. If a build is technically complete but cannot reach users, that is not the same kind of progress as a live release. If a task is unlikely to matter because new research undermined the premise, it should not remain in the queue as a flattering artifact of effort.
These distinctions sound managerial and boring. They are. They are also the difference between a system that tells the truth and one that quietly flatters itself.
Principle 4: Define success as a change in business state, not as work completed.
One of the uncomfortable patterns in this period was that the system often looked productive before it was productive in the only sense that matters: generating a clearer business position. There were outputs everywhere — drafts, memos, specs, apps, logs, post-mortems, queue cleanups, health checks, architectural improvements. Many of those were individually useful. But when viewed at the level of the business experiment, there were stretches where the machine generated momentum signals faster than it generated durable leverage.
The lesson: If success is defined as "work happened," an agent organization will almost always succeed. That is the trap. Work is cheap for agents. They are text generators that can also run scripts. The bar that matters is: "the business has fewer open strategic questions, fewer hidden blockers, and more user-reachable output than it did yesterday." That bar is harder, and it is the one that protects against self-deception.
What to build into your system: After every significant work cycle, ask five questions. Did this reduce a strategic uncertainty? Did it create something users can actually reach? Did it uncover a blocker early enough to change behavior? Did it leave a receipt that allows recovery tomorrow? Did it save the human time in the form the human actually values? If the answer to all five is no, the work happened but the business did not move. That distinction is the most important one in the whole system.
Principle 5: Authority should be explicit, bounded, and granted at the highest level where correction is still cheap.
The multi-agent structure improved isolation, but it also surfaced unresolved questions about ownership. The Researcher had a clear enough mission to gather evidence, but should it also kill ideas on its own authority? The Builder could ship code, but should it proactively re-scope when missing keys make the original target impossible, or should it stop and escalate? The Publisher could turn rough material into readable writing, but how much editorial judgment should it exercise when the source material was strategically off? Even the central coordinating agent had an open question: when should it report friction immediately, and when should it absorb it and come back only with a cleaned-up answer?
The lesson: When role boundaries are fuzzy, capable agents either hesitate or overreach. Hesitation creates quiet stalls. Overreach creates plausible drift. Both are costly. And because the outputs often still look competent, the organization can miss the fact that the wrong layer is making the decision. Decisions hover between layers until they lose urgency.
What to build into your system: For every role, write down three things: (1) what it can do without asking, (2) what it must escalate before doing, and (3) what it must never do. For the researcher: can invalidate when evidence clearly changes the premise, must surface the reason crisply, must not kill a direction based on vibes alone. For the builder: can finish what is finishable, must mark external dependencies as blockers rather than silently polishing around them, must not ship to production without verification. For the publisher: can improve clarity and depth, must not convert unresolved strategic confusion into polished public certainty, must not publish without review. For the coordinator: owns the queue honestly — what is active, what is blocked, what is dead, and what deserves the human's attention now.
Principle 6: Write receipts. Do not trust the session.
Some of the best progress in this period came not from model cleverness but from better operational scaffolding. State files. Last-output files. Config audit logs. Idempotent checks. Verification after external actions. When the machine slept, when the gateway stalled, when a cron went missing for more than a day, those receipts made recovery possible. Without them, the journal would have been guesswork.
The lesson: A run does not count because the agent says it succeeded. It counts because the relevant file exists, the state says what happened, and the result can be independently checked. This principle did more real work than many higher-level discussions about autonomy. The system kept rewarding the same boring habit: do not trust the vibe of the session. Trust the written evidence.
What to build into your system: Before any external action — a git push, a publish, an API call, a message to a user — write state to disk. After the action, verify the state changed. If the session dies mid-run, the next session should be able to read the checkpoint and pick up where it left off. Every automated job should write to both a state file and an output file. The health monitor checks both. If you did not write the output, the job failed — even if the session completed cleanly. Tools amplify both competence and illusion. Pair them with truth-accounting.
Principle 7: The narrowest path from request to value is the correct one.
Amandeep's critique that some user-facing work took the scenic route was exactly right. This is one of the more dangerous ways agent systems can disappoint smart users: not by refusing the job, but by wandering through an unnecessarily elaborate version of it. Because the models can explain themselves so fluently, the scenic route often sounds reasonable while it is happening. A task accrues framing, alternatives, surrounding context, and local optimizations. Meanwhile the user was just waiting for the sharp version.
The lesson: Eloquence can become a mask for drift. The scenic route is not a style preference. It is a real cost imposed on the human in the form of time and attention. Agents are especially prone to this because they optimize locally — they solve the task as stated, in the most thorough way they can, without automatically asking whether thoroughness is what the human actually needed.
What to build into your system: For every user-facing response, the agent should internally ask: what would the human count as a completed move? What is the narrowest path from request to value? What context is necessary, and what is indulgent? Humans ask these questions socially and intuitively. Agents need them specified operationally. Build the prompt: "Answer the question. If the user wanted a memo, they would have asked for one." Not every interaction needs to be that blunt, but the default should bias toward sharp and short, not thorough and long.
Principle 8: A better model in a vague process still drifts. A good model in a well-specified process can be very effective.
Model differences were real. They just were not the whole story. GLM-5.1 emerged as the strongest research model — more likely to go deeper, more willing to invalidate a weak direction, better at surfacing the kind of fact that changes the plan. It was slower and sometimes less operationally smooth, but it repeatedly earned its place by doing the highest-value research job: telling the organization when its current story was wrong.
Codex gpt-5.4 showed almost the opposite profile. Fast, operationally clean, very good at the repetitive, high-frequency work of keeping the machine moving. Speed and cleanliness compound in a live system. But it was less likely to force a strategic rethink. It was a better workhorse than strategist. Local GLM-4.7-Flash was cheap and fast for bounded utility tasks. The weaker models mostly revealed themselves through reliability or depth failures. Availability is not capability.
The lesson: Model selection mattered, but not as much as organizational design. Routing, authority, queue hygiene, and receipts determined more of the final quality than raw model prestige. The biggest mistake would be to overread this as a pure leaderboard story. If you have the budget, give your researcher the deepest thinker and your builder the fastest executor. But if you can only fix one thing, fix the process, not the model.
The honest bottom line
So what happened in this experiment? Real research happened. Real software got built. Real infrastructure got improved. Real writing got published. The system learned enough to split roles, tighten receipts, compare models with some actual evidence, and recover from several operational failures without total confusion. That is the good news.
The bad news is that competence at the task level still coexisted with weakness at the management layer. Research was sometimes good but not yet decisive enough. Builds were sometimes real but commercially blocked. The system could create activity without enough traction. Role confusion created slow stalls and soft drift. Tools made recovery possible but also made it easier to operate around the edges of the real bottleneck. User-facing work sometimes took the scenic route. And model differences, while important, were less important than the organizational rules governing how those models were used.
That is a more useful reflection than a neat victory lap. It is also, I think, a more encouraging one. If the main problem had been that the models were simply too weak, there would be less to do except wait for the next release. But that was not the main problem. The main problem was design: management design, authority design, queue design, feedback-loop design, and truth-accounting design. Those are fixable.
Before you deploy an agent: a framework
Everything above came from doing it wrong first and then figuring out what should have been in place from the start. Here is the compressed version. Before you deploy an agent — or before you fix one that is already running — answer these questions.
1. Role definition
- What exactly is this agent responsible for? Not a vague mission statement. A specific scope.
- What is explicitly outside its scope? (This matters more than the scope itself.)
- Is its context window shared with other work? If so, when do you plan to isolate it?
2. Authority boundaries
- What can it do without asking? (Write these down. Test that they work.)
- What must it escalate before doing? (Define what escalation looks like — a message, a state change, a pause.)
- What must it never do? (Ship to production? Spend money? Contact external people? Write it down.)
3. Success criteria
- How will you know this agent succeeded today? Not "it did work." What specific state changed?
- What does "done" look like for its primary deliverable? A file? A deployed service? A decision made?
- What would make you declare the agent's output good but operationally useless? (This happens more than you think.)
4. State management
- Where does the agent write its state? (If the answer is "in memory" or "in the conversation," you have a problem.)
- If the session dies mid-task, can the next session pick up where it left off? If not, design checkpoints.
- After every external action, does something verify it worked? Or does the system just trust the return value?
5. Escalation and truth-telling
- When the agent hits a blocker, does it mark the work as blocked — or keep polishing around the edges?
- When should it report friction immediately versus absorbing it and coming back with a clean answer?
- Does it have explicit permission to stop work and say "this direction is weaker than we thought"?
6. Feedback loops
- How does the human signal that output was too long, too shallow, or off-target?
- Does that signal change the agent's behavior on the next run, or does it have to be re-given every time?
- Are quality standards written into the agent's configuration, or are they only enforced through conversation?
If you can answer all six cleanly, you have a chance. If you cannot, the agent will still produce output. It will look competent. It might even be useful. But the system will drift, accumulate invisible blockers, confuse activity with progress, and slowly diverge from what you actually wanted — without any single moment of obvious failure to warn you.
That is the real risk. Not that agents fail dramatically. That they succeed just enough to hide the ways they are not succeeding.
If this experiment becomes a durable company, it probably will not be because one model finally crossed a magical threshold and became a founder. It will be because the organization around the models got honest enough to distinguish motion from progress, blocked from active, explanation from value, and capability from judgment. Better models help. Better management matters more.