Interesting writeup. I think tools like claude cowork are an interesting hack to...

Interesting writeup. I think tools like claude cowork are an interesting hack to adapt agentic coding tools for business use. If you put the security issues aside (and there are significant pitfalls and risks here) and balance the risk against the value add, there's a strong argument to be made for the kind of tradeoffs that Simon is knowingly making here. And given he actually coined the notion of prompt injection, he is of course not ignorant of those risks. That's not dismissing those risks but balancing the risks against the cost of doing things manually. Making progress on addressing these risks is going to be a massive challenge. But there's a lot of short term value if you are not that risk averse as well. That's why codex and claude code have slightly scary command line flags that are widely used. The --yolo flag in codex is a big wink at this topic. "You know you shouldn't but YOLO."

More broadly, my observation is that the type of tools that developers use are naturally suited to be scripted. Because developers do that all the time. We work with command line prompts, lots of tools that can be scripted via the command line, and scripting languages that work in that environment.

Tools like Claude Code and Codex are extremely simple for that reason. It's a simple feedback loop that in pseudo code reads like "while criteria not met, figure out what tools to run, run those, add output to context and re-assess if criteria were met". You don't need to hard code anything about the tools. A handful of tools (read file, run command, etc.) is all that is needed. You can get some very sophisticated feedback loops going that effectively counter the traditional limitations of LLMs (hallucinating stuff, poor instruction following, assertively claiming something is done when it isn't, etc.). A simple test suite and the condition that the tests must pass (while disallowing obvious hacks like disabling all the tests) can be enough to make agents grind away at a problem until it is solved.

In a business context, this is not true yet. Most business users use a variety of tools that aren't very scriptable and require fiddling with complex UIs. Worse, a lot of those tools are proprietary and hacking them requires access you typically don't get or is very limited. Given that, a life hack is to translate business workflows into developer tool workflows and then use agentic coding tools. Claude can't use MS Word for you. But it can probably work on MS word files via open source libraries and tools. So, step zero is to "mount a directory" and then use command line tools to manipulate what's inside. You bypass the tool boundary by swapping out business tools with developer tools. Anything behind a SAAS web UI is a bit out of scope unfortunately. You get bogged down in a complex maze of authentication and permission issues, fiddly APIs with poor documentation. That's why most of the connectors for e.g. Chat GPT are a bad joke in how limited they are.

Simple example. Codex/Claude Code, etc. are probably fairly useless doing anything complicated with say Square Space, a wordpress website, etc. But if you use a static site builder, you can make these tools do fairly complicated things. I've been working for the last two weeks on our Hugo website to do some major modernization, restructuring, content generation, translations, etc. All via prompting codex. I'm working on SEO, lighthouse performance, adding complex new components to the website, reusing content from old pages to create new ones, checking consistency between translations, ensuring consistent use of certain language, etc. All by prompting codex. "Add a logo for company X", "make sure page foo has a translation consistent with my translation guide", etc.

I got a lot more productive with this setup after I added a simple npm run verify test suite with a simple AGENTS.md instruction that the verify script has to pass after any change. If you watch what codex does there's a pattern of trial and error until the verification script passes. Usually it doesn't get it right in one go. But it gets there without my intervention. It's not a very sophisticated test suite but it tests a few of the basics (e.g. tailwind styling survives the build and is in the live site, important shit doesn't 404, hugo doesn't error, etc.). I have about 10 simple smoke tests like that.

I think we'll see a big shift in the business world towards more AI friendly tooling because smart business users will be flocking towards tools that work with AI tools in a hurry as they discover that they can shave weeks/days of grinding those tools manually by switching. This is a process that's likely to take very long because people don't like to change their tool habits. But the notion of what is the right tool for the right job is shifting. If it's not AI friendly, it's the wrong tool probably.

Long term, I expect UIs and dealing with permissions in a sane way will be easier to deal with for AI tools. But meanwhile, we don't actually have to wait for all that. You can hack your way to success if you are a bit smart with your tool choices.