Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
AI and bots have officially taken over the internet (cnbc.com)
44 points by zaikunzhang 10 days ago | hide | past | favorite | 72 comments
 help



Well, IoT traffic peaked "human traffic" already long ago, Netflix etc. eat a lot of bandwidth, etc., so I am not sure where the news is exactly.

IoT traffic and streaming traffic is "invisible" to normal humans.

Your smart thermometer isn't making Reddit posts trying to sound like a human who's just concerned that the bedroom is a bit too warm.


> Well, IoT traffic peaked "human traffic" already long ago, Netflix etc. eat a lot of bandwidth, etc., so I am not sure where the news is exactly.

I think there's different kinds of traffic: raw packets and user-like interactions. I think the OP is not very clear on the line, but it's significant.

If most of the traffic on the internet akin to shell scripts making bulk FTP transfers, it's probably not news. If most of the comments being made or most of the streams on Netfix are being consumed by bots, that's pretty big news.


How is Netflix not 'human traffic'?

It's hyperbole, click bait.

Ah, well you see, now CNBC has declared it official! That obviously speaks for itself in how authoritative and momentous it is! /s

There're human-to-human (H2H), human-to-machine (H2M) or vice versa, and machine-to-machine (M2M) data communication.

If you perform simple extrapolation, the M2M data only surpass the others around 2029.

Coincidently, in the original timeline of Transformer movie, 2029 is the year that the Resistance, led by John Connor, destroyed Skynet and ended the war against the machines.


> Coincidently, in the original timeline of Transformer movie, 2029 is the year that the Resistance, led by John Connor, destroyed Skynet and ended the war against the machines.

I’d love to see that crossover Terminator and Transformers movie. Optimus Prime vs T-800 anyone?


I seem to recall that the entire chip that was created by buddy, to enable skynet, was actually a chip in a Terminator that was reverse engineered.

Leaving the original timeline uncertain.

Was it original, really the original? Or the 10th, or millionth loop?

Skynet can still be in our future.


"Officially?"

Who is this official making this pronouncement?


President of Internet, Kevin Roberts.

Bot traffic have been overtaking humans for a long time. Between crawler, spammers, fake accounts on all social media, automated scripts for API and various saas and botnets for various attacks...

It's just that now the official numbers say so.

But anyone on twitter or reddit can tell you the dead internet theory has been progressing at a swift pace for a decade.

AI just made it more apparent.


> This notion of machine bad, human good just is not realistic

Glad I found this quote. It is quite helpful for an AI to search the web on behaolf of me... even if it was finding where I can buy particular/similar peanuts locally I got from abroad.


Content providers will not agree with this decision, because machine browsing = no ads. Until that gets resolved, I don’t see incentives to align, since any free search requires ads for continuous business.

It could be serving ads if they could persuade the machines to do the purchase.

In fact, even ads ingested by the training data set at this very moment could be useful. Go to Gemini and tell it you want to buy a jacket or whatever and it will recommend some products it ingested from the training data.


This notion isn't just unrealistic, but extremely dangerous. If we accept "machine bad, human good" line of thinking, the only logical conclusion is that we'll have to verify our biometric every time we'd like to access the internet. Like the UK age verification but 100x worse.

As much as I dislike gatekeeping measures like UK's age verification, you can't deny the genuine problem that exists in this case. But it isn't 'machine bad'. There is no good technology or bad technology. It's the intention of those who wield it, that is good or bad. In other words, it's good people vs bad people with technology.

The issue in this particular case is that those content and their web servers are set up for human traffic. In the worst case, a human consumes a few megabytes of data from the server and then leaves. A few of those visits will convert into a job or business opportunity - a fair bargain. LLM scrapers are not like that. They're greedy resource hogs. They not only want everything you have, a whole bunch of them do it repeatedly and endlessly to your server. There's no possible way to justify the cost of such massive bandwidth consumption for a bunch of parasites that never give anything in return. And what do we get? A crappy user experience from all those sites putting up protection measures. This is the tragedy of the commons.

So who is the culprit? The greedy bunch who created the technology that behaves like this and then benefits immensely from it. Are those bad people? Absolutely! Naturally, we need them and their ill intentioned creations off our shared spaces. This isn't anything new. This game has been playing out in different forms since eternity.


Dead Internet theory came out around 2023.

Playwright launched in 2020; similar projects have since launched; similar project existed before.

It used to be automated by script, now you even have AI.

We now have a dead internet.

It's also important to understand what's happening. Why or what would the scripted bot be doing? It's not read, otherwise nobody would notice. They are actually posting things. It's not just posting cat pictures, nobody would notice that neither.

Each bot has a different intention, but universally they all has mass intention to manipulate some subject. Reddit is bad because the bots have the power to curate content with downvotes.

So online discussions have synthetic content intending to change opinions. How does that interact with the various subjects you're interested in?

What's even crazier is the intersection of echo chambers and bots. There are people who have blocked essentially all humans and live in a world of bots who agree with them. It is causing insane social problems.


I view a lot of the AI/Bot internet to be slightly a false misnomer. Even before ChatGPT, the degredation of online content was already happening - SEO farms, worsening google search. Most articles you'd find online would be paywalled, most information about specific things would turn out to be a frustrating SEO labyrinth.

The current one is awful, and there's so much AI/Bot content, but I can find far more detailed information using AI enabled search that isn't covered in ads. I can get an initial overview of methodology without trawling through SEO articles.

I think AI has been almost a natural response to the enshittification of the internet - ChatGPT wouldn't seem so transformative if google search was working like google search rather than ad generator 5000 before it released.


Yeah, the internet has been shitty for uh. decades now. 15 odd years ago people were already complaining about listicles and youtube comments.

Best thing to do is to avoid idly browsing social media and curate your internet experience.


Yeah agreed. I think that's partially why I find AI still useful - It acts as a filter against a lot of the listicle style content. I then can curate and look at specific blog writers I follow for actual content.

But honestly, if google provided me with a good search, I probably would seriously reduce my AI usage when researching. E


Proposing a definition of slop: content optimized for profitability, regardless of quality.

If AI slop is replacing the content you were consuming, it was already slop.


That's silly. I can make slop without worrying about profit, too.

[flagged]


social media was raw sewage even before AI, and WWW was 90% SEO spam generated by third worlders for $2 a day.

I wonder if they will start deliberately not scraping social media because of the low quality human content and AI sloppyness of it.

suddenly the confirmed quality of the scraped data will be at a premium.. "Scrape Engine Optimizers" ?


Perhaps the business of the future will be "cleaning up, eliminating" rather than "creating."

With AI, we have an exponential level of productivity. But what is being produced? 90%: garbage.

The problem is that what is being produced is essentially "garbage" generated by models trained on garbage. Quality knowledge is increasingly submerged and suffocated by spam and low-quality content.

The real challenge of the future will be filtering and cleaning up, on each level.


It already is a problem and maybe unpopular opinion but... That's a good thing. The LLM collapse can't possibly come soon enough. In principle, LLMs can be a good thing but they can't overcome human nature - laziness and the unstoppable desire to take the shortest path. It's those two things that have turned the internet into the absolute dump it is today. Not to mention the bullshitter economy, as I like to call it and everything that comes with it. And all things considered, society does need some reset at this point, the AI bubble might be a good place to kick things off.

It will not collapse, just be there and disappoint us.

I think that >in the future< this will be a non problem as there is reality itself that is a much better validator for behavior then human text.

We already see this with synthetic training data that basically uses logic in form of math and code as constraint.


Will AI test recipes and try to find the cafe with the nicest vibe? Will it do original research about things that have never been written down yet?

I've heard this argument before, but you don't need to think too hard to see the limitations of a machine with no senses.


Yep, models trained on their own output degrade.

There's a paper on it somewhere.


> The internet becoming majority bot content basically guarantees this becomes a real problem for the next generation of models.

Only if you assume that people who train models are stupid.


Stupidity has nothing to do with it. AI articles and comments are now posted everywhere and presented as human. It's becoming harder and harder to determine whether text was written by humans or AI. Where are they supposed to find content to train AI on that isn't polluted with AI content that'll result in a feedback loop? It's like trying to get pure soil and water for growing that's not contaminated with microplastics/nanoplastics and PFAS. There was a time where it was possible. Not anymore. The filth is everywhere and impossible to filter.

And it's simply not reasonable for AI companies to have human hands read through individual comments everywhere from beginning to end to build their training data. There isn't enough time in the universe to advance AI while doing that and also being accurate. Something will always slip through.


You lack imagination.

Why would human review be the only possible way to remove enough of the tainted training data?

> Where are they supposed to find content to train AI on that isn't polluted with AI content that'll result in a feedback loop?

If nothing else: you could look for old data. At the moment, training assumes that input data is essentially without limit. But machine learning has lots and lots of old and proven techniques for what to do when your training data is limited.

You can also look into techniques for avoiding model collapse. Just because one group of researcher showed that this happens with some specific models, doesn't mean it needs to happen in general.


If AI data recognition can be automated, AI data that avoids recognition can be automated. It has nothing to do with imagination. It's simple reality.

You still lack imagination!

Assume it's exactly the arm's race you suggest. So OpenAI develops a new AI-content detection today, and tomorrow someone brings out something that defeats it. In a month OpenAI brings out another new technique that is unbeaten for only a day. Rinse and repeat.

Is that what you had in mind?

In any case, OpenAI knows exactly what it crawled when. So if it has a technique that only got beaten at timestamp X, then everything they crawled up to is good to use.


That solves nothing because that's already what it's doing. The problem of training is adding new data beyond what's currently validated and into the future. And it'll get harder to validate it.

Huh? With the approach outlined above you can continuously keep adding new validated data.

> And it'll get harder to validate it.

Sure, and that's true in general: advances become harder, because we pick the low hanging fruit first anyway. Nothing new about that.

The arms race I described is also only one approach. Model trainers will also want to investigate economising on training data, making their approaches more robust to model collapse, multi-modal training, and a million other strategies and tricks that I can't think of in thirty seconds.


You lack imagination if you can't see how it'll get harder. :)

It'll get harder, sure. I never said it wouldn't.

The very comment you just replied to said so.


If you can devise a tool that can detect AI generated content, you can use it to filter data. But the harsh truth is that "gold standard" training data is from before 2022 or whenever the cutoff was.

And even that needs to be curated because before AI tools there was bot content filling up the internet.

...and even without bots, a lot of human authored content are low value, poorly written, etc.

There are (probably) companies out there whose business is to create, curate and improve training sets.


And if there's a way to detect that content is AI generated, then there's demand to generate content that seems more human. And we're already at a point where most people have been tricked into believing AI content was real at some point and never even realized it. It'll only get worse.

Probably the only real way to validate content is real is building a validation system into devices. Confirm when a photo is taken and send an ID to a server, then when photos are shared, its ID is compared to the image on the camera/phone manufacturer's server. For text, validate every little key press. And there are still ways to game these systems, but I would not be surprised if they're introduced to mitigate AI diffusing everywhere.


Plenty of shortsighted people have done things that are stupid in the long term for short term gains. It's the modus operandi of the US economy since 1981.

I don't believe that the people who train models have a secret way of identifying and filtering out bot-generated content that no one else (email spam filters, search engines, etc.) have identified. I do believe that they feel their models need to have up-to-date information on a variety of topics that require regularly ingesting new data. So no, I don't think they have a good way to avoid their inputs rotting from their outputs.


What's so special about 1981?

And what do you mean about short term gains? If you are training a model, and you see model collapse, where's the short term gain? I don't get it.

The incentive for the person who trains a model, even in the short run, is for them to avoid model collapse.

> I don't believe that the people who train models have a secret way of identifying and filtering out bot-generated content that no one else (email spam filters, search engines, etc.) have identified.

Huh, why would they need a secret filter? A filter is only one way thing you can try. You can also look into using different models, different training, making your approach more resistant to model collapse; training multi-modal models, using approaches to economise on training data; and thousands of other ideas I can't think of in thirty seconds.

> So no, I don't think they have a good way to avoid their inputs rotting from their outputs.

You lack imagination. People can be remarkably clever if its in their (short term!) interest to find solutions.


> What's so special about 1981?

Two very significant things happened in 1981.

After years of claiming the government couldn't help people, Ronald Reagan was elected and Republicans have been working hard to make that statement more true ever since. A big part of that was deregulation of the financial markets.

That same year, Jim Welch became Chairman and CEO of General Electric. He juiced the stock prices by selling off the company's prize jewels, real estate, and future, and for a while (before the utter collapse of the company) artificially raised the stock price so high that executives around the country copied him, and and an entire industry of vultures like Mitt Romney started private equity firms to cannibalize healthy companies for their personal profit.


They can be smart and also contrained by the need of training a new model while having already run out of human generated data.

> Only if you assume that people who train models are stupid

Someone in the chain will be. Even the smartest people buy a lot of their training datasets. What happens when those get contaminated?


You filter them, duh. And you negotiate a contract where the seller bears some of that risk (or you pay less, if they are not willing to make any such warranties.)

> You filter them, duh.

Filters are also not 100% infallible

> And you negotiate a contract where the seller bears some of that risk

So the training data will be polluted anyway, but "the seller will bear some risk"


> Filters are also not 100% infallible

Why would they need to be?


You want your training data to be clean or contaminated?

A small number of samples can poison LLMs of any size https://www.anthropic.com/research/small-samples-poison

--- start quote ---

In a joint study with the UK AI Security Institute and the Alan Turing Institute, we found that as few as 250 malicious documents can produce a "backdoor" vulnerability in a large language model—regardless of model size or training data volume. Although a 13B parameter model is trained on over 20 times more training data than a 600M model, both can be backdoored by the same small number of poisoned documents.

--- end quote ---


I thought we were talking about model collapse?

Poisoning is a completely different topic.


We were talking about this: https://news.ycombinator.com/item?id=47571715 and literally every single comment under this is talking about that.

It's not a different topic. It's literally the topic of this branch of discussion.



--- start quote ---

> The internet becoming majority bot content basically guarantees this becomes a real problem for the next generation of models.

Only if you assume that people who train models are stupid.

--- end quote ---

And then literally everyone who commented on this, including me, was talking about issues with training data contamination. And you are the only one dismissing it as nothing important that can be easily fixed.


Look at the whole comment, instead of selectively quoting:

> The bigger concern is what happens when AI models start training on AI-generated content at scale. We're already seeing model collapse in research papers where output quality degrades when training data is contaminated with synthetic text. The internet becoming majority bot content basically guarantees this becomes a real problem for the next generation of models.

Model collapse.


Not stupid, but I think it's fair to say "careless about/unaware of the wider impact of their work".

What do you mean by wider impact? Model collapse would be the opposite of a wider impact: it's an immediate impact, and I'm fairly sure the people training these models have good incentives to avoid that.

Eg by filtering data, by procuring better data, by applying techniques for making do with more limited data (we used to have a lot of those, and they are still known), or you can also adapt your training process to be less vulnerable to model collapse. Just because some researchers have shown that this happened for the models they tested, doesn't mean it has to be a universal thing.


That's very interesting question I'm ponder about. If all content is AI generated where innovation will come from? Maybe we should differentiates AI assisted content from AI garbage content.

More profitable not to innovate and form cartels.

Are you saying that high quality human-curated content will be rare and more appreciated in the future compared to endless cheap slop? Can't say I am sad, in contrary

You should be careful letting what you want to be true cloud your judgement about what likely is true.

[flagged]


The value to consumers goes up, but that's pointless if they are drowned out by AI overviews paraphrasing their work, half a page of sponsored results, then AI-written SEO spam.

I'm a creator of such content, and like everyone else, I have to make do with 60-70% less traffic now.


At the end of the day, the value of producing content will drop to zero and the value of curation content will skyrocket.

Why should they care about new content? Game over already. Just keep regurgitating the same slop to the masses. Even before ai it was like this. How many 2 minute pop songs use the same chord structure? Just keep selling the same thing slightly permutated (or not) from the last. That's capitalism, baby. This isn't a science.

A lot of people were running original websites, reviewing stuff, blazing new trails, making new art.

It's just harder when you cut all traffic to them, devalue their work and fill the air with AI noise.


Very few of them actually made money though, compared to people who tried to just take an existing idea they could already order from china in bulk and market the living hell out of it. These companies obviously don't care about art and stuff like that, they really just care about the money.

Does that make it okay? Some websites weren't free enough and their owners not passionate enough, so wholesale destruction of that ecosystem is acceptable?

We'll have the internet we deserve


Yeah, it is somewhat funny to read the kind of people that for years looked down on humanities suddenly coming-up with ideas that were described decades or even centuries ago.

Marx, Nietsche, Debord, Foucault, Baudrillard, Adorno - they already saw writing on the wall, or at least fragments of it.


It's not a problem at all. Humans also read books written by humans.

One interesting dynamic here is that AI increases content supply much faster than human attention grows.

Which means filtering and ranking systems become the main bottleneck.

That pushes platforms toward stronger algorithmic selection and sometimes stronger convergence of attention.


I think that's the bigger issue.

Once content gets cheap, the winners are less likely to be the best creators and more likely to be the strongest gatekeepers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: