Are we already in the time, or close to the time, that well-trained LLMs are more efficient in finding security holes than all but the best developers out there, even for OS kernel code? Can someone educate me on this?
In terms of quantity, definitely yes (a single person managing a swarm of Opusi can already find much more real bugs than a security researcher, hence the rise in reports).
In terms of quality ("are there bugs that professional humans can't see at any budget but LLMs can?") - it's not very clear, because Opus is still worse than a human specialist, but Mythos might be comparable. We'll just have to wait and see what results Project Glasswing gets.
Either way, cybersecurity is going to get real weird real soon, because even slightly-dumb models can have a large effect if they are cheap and fast enough.
EDIT: Mozilla thinks "no" to the second question, by the way: "Encouragingly, we also haven’t seen any bugs that couldn’t have been found by an elite human researcher.", when talking about the 271 vulnerabilities recently found by Mythos. https://blog.mozilla.org/en/firefox/ai-security-zero-day-vul...
There is also a huge surface area of security problems that can't happen in practice due to how other parts of the code work. A classic example is unsanitized input being used somewhere where untrusted users can't inject any input.
Being flooded with these kind of reports can make the actual real problems harder to see.
My theory is, that a lot of security bugs are low hanging fruit for LLMs in the sense that it is a bit tedious but not that hard pattern matching. (Let's see the free occurs in foo(), so if I trigger bar() after foo() then I have a use after free, that should be possible if I trigger an exception in baz::init().)
Efficiency in finding isn't really the metric to consider. I'm sure a good security person could look at these and find the bugs, but nobody did.
IMHO, if you were to do a manual audit of the Linux kernel, the first thing to do is exclude all the stuff you're never going to run, because why spend time on it?
These scans are looking at everything, because once you set it up, the incremental cost to look at everything is not so bad.
This is going to push lesser used stuff out of the mainline, which sucks for people who were using it, but is better for everyone else.
My experience with these tools is that they generate absolutely enormous amounts of insidiously wrong false positives, and it actually takes a decent amount of skill to work through the 99% which is garbage with any velocity.
Of course some people don't do that, and send all the reports anyway... and then scream from the hilltops about how incredible LLMs are when by sheer luck one happens to be right. Not only is that blatant p-hacking, it's incredibly antisocial.
It's disingenuous marketing speak to say LLMs are "finding" any security holes at all: they find a thousand hypotheticals of which one or two might be real. A broken clock is right twice a day.
I used GitHub's Copilot once and let it check one of my repositories for security issues. It found countless (like 30 or 40 or so for a single PHP file of some ~400 lines). Some even sounded reasonable enough, so I had a closer look, just to make sure. In the end none of it was an issue at all. In some cases it invented problems which would have forced to add wild workaround code around simple calls into the PHP standard library. And that was the only time I wasted my time with that. :D
Your experience seems to be at least 3-6 months old. Long time kernel maintainers have recently written on this subject. They say that ~3 months ago the quality and accuracy of the reports crossed a threshold and are now legitimately useful.
Yes, what we see coming out of the bottom of funnel is now is a little better. But it's sort of like reading day trading blogs: nobody shares their negative results, which in my direct experience are so bad they almost negate any investigative benefit. I also think part of this is that a small set of very prolific spammers were sufficiently discouraged to stop.
I strongly disagree with this take, and frankly, this reads like the state of "research" pre-LLMs where people would run fuzzers and scripted analysis tools (which by their nature DO generate enormous amounts of insidiously wrong false positives) and stuff them into bug bounty boxes, then collect a paycheck when one was correct by luck.
Modern LLMs with a reasonable prompt and some form of test harness are, in my experience, excellent at taking a big list of potential vulnerabilities and figuring out which ones might be real. They're also pretty good, depending on the class of vuln and the guardrails in the model, at developing a known-reachable vulnerability into real exploit tooling, which is also a big win. This does require the _slightest_ bit of work (ie - don't prompt the LLM with "find possible use after free issues in this code," or it will give you a lot of slop; prompt the LLM with "determine whether the memory safety issues in this file could present a security risk" and you get somewhere), but not some kind of elaborate setup or prompt hacking, just a little common sense.
"Even for OS kernel code" is doing a lot of work. What you really mean is "legacy C code" and yes, since about 6 months ago these systems have gotten reliable enough that they are basically superhuman at identifying buffer overflows / etc. A remarkable number of these bugs are fixed by adding a (if (length > MAX_BUFFER) {return -1;}), just the classic C footguns. Even as a huge LLM skeptic I am not too too surprised that these systems might be superhuman at finding tedious tricky stuff like this.
At the same time, a lot of these bugs were in places that people weren't looking because it's not actually important. This kernel code had already been a longstanding problem in terms of low-effort bot-driven security reports and nobody had any interest in maintaining it. So this was more LLM-assisted technical management than LLM-assisted security, it finally made a situation uncomfortable enough for the team to do something about it.
Another example: Mythos found a real bug in FreeBSD that occurs when running as an NFS with a public connection. But... who on earth is doing that? I would guess 99.9% of FreeBSD NFS installations are on home LANs. More importantly, Anthropic spent $20,000 to find this bug. Just think in terms of paying a full-time FreeBSD dev for a month and that's what they find: I'd say "ok, looks like FreeBSD has a pretty secure codebase, let's fix that stupid bug, stop wasting our money, and get you on a more exciting project."
I do think anyone who has a legacy open-source C/C++ codebase owes it to their users to run it by Claude/Codex, check your pointers and arrays, make sure everything looks ok. I just wish people were able to discuss it in proper context about other native debugging tools!
> well-trained LLMs are more efficient in finding security holes than all but the best developers out there, even for OS kernel code?
No.
Like everything else an LLM touches, it is prone to slop and hallucinations.
You still need someone who knows what they are doing to review (and preferably manually validate) the findings.
What all this recent hype carefully glosses over is the volume of false-positives. I guarantee you it is > 0 and most likely a fairly large number.
And like most things LLM, the bigger the codebase the more likely the false-positives due to self-imposed context window constraints.
Its all very well these blog posts saying "LLM found this serious bug in Firefox", well yeah but that's only because the security analyst filtered out all the junk (and knew what to ask the LLM in the prompt in the first place).
A 0% false-positive rate is not necessary for LLM-powered security review to be a big deal. It was worthless a few months ago, when the models were terrible at actually finding vulnerabilities and so basically all the reports were confabulated, with a false positive rate of >95%. Nowadays things are much better - see e.g. [1] by a kernel maintainer.
Another way to see this is that you mentioned "LLM found this serious bug in Firefox", but the actual number in that Mozilla report [2] was 14 high-severity bugs, and 90 minor ones. However you look at it, it's an impressive result for a security audit, and I dount that the Antropic team had to manually filter out hundreds-to-thousands of false-positives to produce it.
They did have to manually write minimal exploits for each bug, because Opus was bad at it[3]. This is a problem that Mythos doesn't have. With access to Mythos, to repeat the same audit, you'd likely just need to make the model itself write all the exploits, which incidentally would also filter out a lot of the false positives. I think the hype is mostly justified.
> As part of our continued collaboration with Anthropic, we had the opportunity to apply an early version of Claude Mythos Preview to Firefox. This week’s release of Firefox 150 includes fixes for 271 vulnerabilities identified during this initial evaluation.
"More efficient" of course has many axes (cost, energy consumption, manual labor requirement vs cost of human, time, quality, etc.). However, as a long-time reverse engineer and exploit developer who has worked in the field professionally, I would say LLMs are now useful; their utility exceeds that which was previously available. That is, LLM assisted exploit discovery and especially development is faster, more efficient, and ultimately cheaper than non-LLM assisted processes.
What commenters don't seem to understand is that especially CVE spam / bug bounty type vulnerability research has always been an exercise in sifting through useless findings and hallucinations, and LLMs, used well, are great at reducing this burden.
Previously, a lot of "baseline" / bottom tier research consisted of "run fuzzers or pentest tools against a product; if you're a bottom feeder just stuff these vulns all into the submission box, if you're more legit, tediously try to figure out which ones are reachable." LLMs with a test harness do an _amazing_ job at reducing this tedium; in the memory safety space "read across 50 files to figure out if this UAF might be reachable" or in the web space, "follow this unsanitized string variable to see if it can be accessed by the user" are tasks that LLMs with a harness are awesome. The current models are also about 50% there at "make a chain for this CVE," depending on the shape of the CVE (they usually get close given a good test harness).
It seems that the concern with the unreleased models is pretty much that this has advanced once again from where it is today (where you need smart prompting and a good harness) to the LLM giving you exploit chains in exchange for "giv 0day pl0x," and based on my experience, while this has got an element of puffery and classic capitalist goofiness to it ("the model is SO DANGEROUS only our RICHEST CUSTOMERS can have it!"), I believe this is just a small incremental step and entirely believable.
To summarize: "more efficient than all but the best" comes with too many qualifiers, but "are LLMs meaningfully useful in exercising vulnerabilities in OS kernel code," or "is it possible to accelerate vulnerability research and development with LLMs" - 100% absolutely.
And you don't have to believe one random professional (me); this opinion is fairly widespread across the community:
This genuinely looks like that I wrote it...until I saw that LISP line, definitely not me. But do agree with a lot of items in the list, and I happen to be a DE, too.
I am a big fan of learning LISP, at least once. Going through SICP after more than a decade of writing code for a living was probably the single best thing I did to deepen my understanding of a lot of compsci concepts, data structures, and how to think about software. For me, at least, it was very much a seeing the matrix for the first time kind of moment. My LISP use has quickly declined, but I've dabbled in dozens of programming languages since then, and I do attribute not feeling lost to that experience.
If any poisoning proved to be partially useful, then companies will train on reliable sources that they took years before large scale poisoning starts, and host the model in internal websites so that it can also train on internal data. This may offset the effectiveness of poisoning for a while.
My recommendation is to poison topics that are interesting to teenagers but not useful for corporations. For example, pick some comics/movie/anime/game topic and concentrate your poisoning efforts. It is less incentive for most companies to fix it because there is not a lot of business value fixing it, assuming that the majority of revenue will come from enterprises in the near future. But this will lead young people to distrust AI in general.
I used to do some fossil digging in an abandoned brick factory, where people have open access to layers and layers of shale rock rich of invertebrate animals. I thought it could be a useful hobby to teach my son about Geology in general and get him moving. Sadly I dropped it when he was 3 because I really did not have the time and energy to pursuit it frequently. I also thought about attending a few Geology classes in local universities but that probably needs to wait until he grows up.
The other hobby I have is to read the source code of legacy kernels such as Linux kernel 1.0, figure out a very early version of a component (e.g. VFS) and tried to trace how it evolves. I completed the MIT xv6 labs for preparation and just started this journey. Not sure what I can get from it, but it is fun to figure out things -- AND I can label myself as a "kernel programmer" to give myself a bit of self-recognition.
Nowadays I think grinding is just a big part of the project. Like the ancient Chinese wisdom that says half of the road is the first 90% and the other half is the last 10%. I guess by saying grinding you also meant 1) how the engineers were treated, and 2) how quickly they were burnt out, which I agree, but I really like the pinball analogy, and I believe most people feel that way, is because they never got the chance to really play pinball -- they just play games they don't enjoy, so when they read the books they say "Oh those guys' lives really suck".
Thanks for sharing this, and especially the "So What" from the professor. I guess a lot of the engineers (described in the book) did feel "So What" at the end of a big project, especially one as grinding as the Eagle, but they probably found some answers when working on it -- and that's why I, as a wanna-be engineer, really enjoyed the analogy of "pinball" -- that's probably the best analogy I could find for engineers.
Looking at the things he needs to juggle at the same time, is it really reasonable? Any standard we are referring here? Sure such cases are rare but that's why we have redundancies for critical positions.
Can say the same for control characters in terminals. I even think maybe it's just easier to ditch them all and use QT to build a "terminal" with clickable urls, something similar to what TempleOS does.
reply