This is anecdotal but just a couple days ago, with some colleagues, we conducted a little experiment to gather that evidence.
We used a hierarchy of agents to analyze a requirement, letting agents with different personas (architect, business analyst, security expert, developer, infra etc) discuss a request and distill a solution. They all had access to the source code of the project to work on.
Then we provided the very same input, including the personas' definition, straight to Claude Code, and we compared the result.
They council of agents got to a very good result, consuming about 12$, mostly using Opus 4.6.
To our surprise, going straight with a single prompt in Claude Code got to a similar good result, faster and consuming 0.3$ and mostly using Haiku.
This surely deserves more investigation, but our assumption / hypothesis so far is that coordination and communication between agents has a remarkable cost.
Should this be the case, I personally would not be surprised:
- the reason why we humans do job separation is because we have an inherent limited capacity. We cannot reach the point to be experts in all the needed fields : we just can't acquire the needed knowledge to be good architects, good business analysts, good security experts. Apparently, that's not a problem for a LLM. So, probably, job separation is not a needed pattern as it is for humans.
- Job separation has an inherent high cost and just does not scale. Notably, most of the problems in human organizations are about coordination, and the larger the organization the higher the cost for processes, to the point processed turn in bureaucracy. In IT companies, many problems are at the interface between groups, because the low-bandwidth communication and inherent ambiguity of language. I'm not surprised that a single LLM can communicate with itself way better and cheaper that a council of agents, which inevitably faces the same communication challenges of a society of people.
Fair point. I could try with a harder problem.
This still does not explain why Claude Code felt the need to use Opus, and why Opus felt the need to burn 12$ or such an easy task. I mean, it's 40 times the cost.
I'm a bit confused actually, you said you used Claude Code for both examples? Was that a typo, or was it (1) Claude Code instructed to use a hierarchy of agents and (2) Claude Code allowed to do whatever it wants?
I think the benefit may be task separation and cleaning the context between tasks. Asking a single session to do all three has a couple of downsides.
1. The context for each task gets longer, which we know degrades performance.
2. In that longer context, implicit decisions are made in the thinking steps, the model is probably more likely to go through with bad decisions that were made 20 steps back.
The way Stavros does it, is Architect -> Dev -> Review. By splitting the task in three sessions, we get a fresh and shorter context for each task. At minimum skipping the thinking messages and intermediary tool output, should increase the chances of a better result.
Using different agent personas and models at least introduces variability at the token generation, whether it's good or bad, I do not know. As far as I know in general it's supposed to help.
Having the sessions communicate I think is a mistake, because you lose all of the benefits of cleaning up the context, and given the chattiness of LLMs you are probably going to fill up the context with multiple thinking rounds over the same message, one from the session that outputs the message and one from the session reading the message, you are probably going to have competing tool uses, each session using it's own tool calls to read the same content, it will probably be a huge mess.
The way I do it is I have a large session that I interact with and task with planning and agent spawning. I don't have dedicated personas or agents. The benefits the way I see them are I have a single session with an extensive context about what we are doing and then a dedicated task handler with a much more focused context.
What I have seen with my setup is, impressively good performance at the beginning that degrades as feedback and tweaks around work pile up.
Framing LLM use for dev tasks as "narrative" is powerful.
If you want specific, empirical, targeted advice or work from an LLM, you have to frame the conversation correctly. "You are a tenured Computer Science professor agent being consulted on a data structure problem" goes a very long way.
Similarly, context window length and prior progress exerts significant pressure on how an LLM frames its work. At some point (often around 200k-400k tokens in), they seem to reach a "we're in the conclusion of this narrative" point and will sometimes do crazy stuff to reach whatever real or perceived goal there is.
Probably the same reason it takes a team of developers and managers 6 months to write what one or two developers can do on their own in one week. The overhead caused by constant meetings and negotiations is massive.
> The overhead caused by constant meetings and negotiations is massive.
this is my life ngl. i really wish these ai companies would work in automating away all this bullshit instead of just code code code
just the other day i was asked to prepare slides for a presentation about something everyone already knows (among many other useless side-work)... i feel like with "ai" in general we are applying bandages where my real problem is the big machine that gives me paper cuts all day...
LLMs also don't have the primary advantage humans get from job separation, diverse perspectives. A council of Opuses are all exploring the exact same weights with the exact same hardware, unlike multiple humans with unique brains and memories. Even with different ones, Codex 5.3 is far more similar to Opus than any two humans are to each other. Telling an Opus agent to focus on security puts it in a different part of the weights, but it's the same graph-- it's not really more of an expert than a general Opus agent with a rule to maintain secure practices.
You can differentiate by context, one sees the work session, the other sees just the code. Same model, but different perspectives. Or by model, there are at least 7 decent models between the top 3 providers.
I know, but none of those is nearly as much of a difference as another human looking at code. The top models have such overlapping training data they sometimes identify as each other.
Agentic pipelines and systems fall into the same issues as humans who work together, mostly communication.
It's not like they can dump their full context to the "manager" agent, they need to condense stuff, which will result in misinterpreted information or missing information on decisions down the line.
IMO this was more relevant when agents had limited context windows
This I believe is true. Have been working on an Agentic architecture and whenever there was a new requirement the simple workflow was to create an specific agent for that. Earlier the context windows were small and this was the default solution. Overtime our total agents have become vast that it is a headache to maintain and debug.
Absolutely works with frontier models. What do you think about smaller models in these pipelines? That’s literally what I’m working on, with qwen3.5-27b and im splitting the task to 4 steps and not sure if that’s the way to go. Do you have any experience to share?
Yea I do. I use it for programs where I’m unsure whether it will read or modify my filesystem. I still allow the program to run arbitrary computation and use the network. It’s just the filesystem part that I want to isolate.
I would just suggest the author to replace the sentence “99% of the time, it refers to motion in one dimension” with “most of the time” since this is a mathematical article and there’s no need to use specific numbers when they don’t reflect actual data.
Laurent Bossavit wrote a whole book about similar cases occurred in the IT world, “The Leprechauns of Software Engineering
How folklore turns into fact and what to do about it”
Just wonder if depth maps can be used to generate stereograms or SIRDS. I remember having playing with stereogram generation starting from very similar grey-scaled images.
They do. The UI to do this is apparently only included in the VisionOS version of the Photos app. But you can convert any photo in your album to "Spatial Format" as long as it has a Depth Map, or is high enough resolution for the ML approximation to be good enough.
It also reads EXIF to "scale" the image's physical dimensions to match the field of view of the original capture, so wide-angle photos are physically much larger in VR-Space than telephoto.
In my opinion, this button and feature alone justifies the $4000 I spent on the device. Seeing photos I took with my Nikon D7 in 2007, in full 3D and correct scale, triggers nostalgia and memories I've forgotten I had for many years. It was quite emotional.
Apple is dropping the ball on not making this the primary selling-point of Vision Pro. It's incredible.
An audit log table often takes a huge amount of space compared to simple fields on the records so there are tradeoffs. Which solution is best depends on how important change logs are.
I kinda agree, but don’t underestimate the power of having things where people are looking.
Put your documentation in doc strings where the function is defined - don’t have a separate file in a separate folder for that. It might separate concerns, but no one is looking there.
Similarly if those fields aren’t nullable, someone trying to add new rows will have to fill in something for those metadata fields - and that something will now very likely be what’s needed, rather than not pushing anything to the audit table.
Obviously your app can outgrow these simple columns, but you’re getting value now.
Event sourcing also works great. You don't need an audit log per se if you already track a history of all commands that introduced changes to your system.
If you try to redact a part of the past, it can also affect the present, as any time traveler knows.
Let's assume we want to remove every message related to user A.
A photo by user B got to be the best of the day because it collected most upvotes. Without the A's vote, it's no longer so. The photo also got to become the best of the month because it was later voted as the top from the best-of-the-day entries, and received a prize. Should we now play the message stream without the A's upvote, things are going to end up radically different, or end up in a processing error.
User B was able to send a message to user C, and thus start a long thread, because user A had introduced them. With user A removed, the message replay chokes at the attempt of B to communicate with C.
One way is to ignore the inconsistencies; it deprives you of most of the benefits of event sourcing.
Another way is anonymizing: replace messages about user A with messages about some null user, representing the removed users. This can lead to more paradoxes and message replay inconsistencies.
> If you try to redact a part of the past, it can also affect the present, as any time traveler knows.
That's not how snapshots work. You record the state of your system at a point in time, and then you keep all events that occurred after that point. This means you retain the ability to rebuild the current state from that snapshot by replaying all events. I.e., event sourcing's happy flow.
> User B was able to send a message to user C, and thus start a long thread, because user A had introduced them. With user A removed, the message replay chokes at the attempt of B to communicate with C.
Not really. That's just your best attempt at reasoning how the system could work. In the meantime, depending on whether you have a hard requirement on retaining messages from removed users you can either keep them assigned to a deleted user or replace them by deleted messages. This is not a problem caused by event sourcing; it's a problem caused by failing to design a system that meets it's requirements.
Yep. But Event Sourcing comes with its own set of other problems.
Also, I don't think this would apply to OP's post: with Event Sourcing you would not even have those DB tables.
The DB tables suggested by OP are a kin to snapshots, whereas each event would require a separate data store. OP is trying to shoehorn event history into the snapshots, which hardly makes any sense.
We used a hierarchy of agents to analyze a requirement, letting agents with different personas (architect, business analyst, security expert, developer, infra etc) discuss a request and distill a solution. They all had access to the source code of the project to work on.
Then we provided the very same input, including the personas' definition, straight to Claude Code, and we compared the result.
They council of agents got to a very good result, consuming about 12$, mostly using Opus 4.6.
To our surprise, going straight with a single prompt in Claude Code got to a similar good result, faster and consuming 0.3$ and mostly using Haiku.
This surely deserves more investigation, but our assumption / hypothesis so far is that coordination and communication between agents has a remarkable cost.
Should this be the case, I personally would not be surprised:
- the reason why we humans do job separation is because we have an inherent limited capacity. We cannot reach the point to be experts in all the needed fields : we just can't acquire the needed knowledge to be good architects, good business analysts, good security experts. Apparently, that's not a problem for a LLM. So, probably, job separation is not a needed pattern as it is for humans.
- Job separation has an inherent high cost and just does not scale. Notably, most of the problems in human organizations are about coordination, and the larger the organization the higher the cost for processes, to the point processed turn in bureaucracy. In IT companies, many problems are at the interface between groups, because the low-bandwidth communication and inherent ambiguity of language. I'm not surprised that a single LLM can communicate with itself way better and cheaper that a council of agents, which inevitably faces the same communication challenges of a society of people.
reply