Huge proportion of internet is AI-generated slime, researchers find

chacham15 · on Jan 20, 2024

The title is clickbait. The actual study[0] reads as follows:

> Multi-way parallel, machine generated content not only dominates the translations in lower resource languages; it also constitutes a large fraction of the total web content in those languages

This is talking specifically about text being translated into less common languages being done mostly by machine translation and that in those languages most of the content found was as a result of machine translation. This has nothing to do with English or more common languages overall.

[0] https://arxiv.org/pdf/2401.05749.pdf

hhs · on Jan 20, 2024

To be fair, the author does link to that study in the first paragraph of this piece, and then adds some context about languages near the end:

“But while the English-language web is experiencing a steady — if palpable — AI creep, this new study suggests that the issue is far more pressing for many non-English speakers.

What's worse, the prevalence of AI-spun gibberish might make effectively training AI models in lower-resource languages nearly impossible in the long run. To train an advanced LLM, AI scientists need large amounts of high-quality data, which they generally get by scraping the web. If a given area of the internet is already overrun by nonsensical AI translations, the possibility of training advanced models in rarer languages could be stunted before it even starts.”

figassis · on Jan 20, 2024

So it is AI generated slime for people in those languages.

mschuster91 · on Jan 20, 2024

No surprise there... the worst problem for me is that translators go away as a business as the job opportunities vanish, and with them, the ability to translate stuff that requires cultural context to properly translate (jokes, puns, word plays).

We will be literally dumbing down as a species as a result.

idiliv · on Jan 20, 2024

Hmm, are you sure that translations of LLMs like ChatGPT are not incorporating cultural context?

crote · on Jan 20, 2024

Yes. Any bilingual person could tell you this.

For example, let's take the joke "What do you call the head of a school of fish? A Sardean" and translate it into Dutch.

Google Translate will give you "Hoe noem je de kop van een school vissen? Een Sardeaan." It's a mistranslation because "kop" explicitly means "head" as in the thing on your neck.

ChatGPT will give you "Hoe noem je het hoofd van een school vissen? Een Sardeaan." The first part now works, but the wordplay in "sardean" is still completely gone.

A proper translation does exist, for example "Hoe noem je het hoofd van een school vissen? Een direcsteur." In this case "directeur" is the translation of "director" and is a word commonly used for the person in charge of an instution. Similarly, "steur" is a type of fish - namely sturgeon. It's a pretty direct translation, but you need to be able to deal with wordplay to make it.

So no, ChatGPT translations do not properly incorporate context yet.

teaearlgraycold · on Jan 20, 2024

Bi-lingual people aren’t going away

bluGill · on Jan 20, 2024

Translations is a skill that no all bi-lingual people have. Most can hold a conversation, but that doesn't mean they can do a good translations.

A friend of mine was asked to translate in Church once, after wards he was told that polite people do not speak like that. (he really got fluent working as manager in a factory in a foreign country so you can imagine the language he learned from his peers).

thfuran · on Jan 20, 2024

When was the last time you hired a person to translate a book or movie for you?

NoZebra120vClip · on Jan 20, 2024

There are supposedly a lot of government jobs for interpreters. If you receive any communications from an agency, it may have 2+ pages in different languages which explain your rights to receive translated materials and to request someone to interpret into your native language. The same with court proceedings.

mattgreenrocks · on Jan 20, 2024

I cannot help but grin as I read this. It’s as if AI will eat itself.

I have to bet this represents a severe existential threat to companies who don’t already have a war chest of decent training data.

2OEH8eoCRo0 · on Jan 20, 2024

Humans eat themselves. Humans read and watch humans and are still able to learn and produce.

I think a bigger threat is siloing information behind logins or apps. Less information is out there free to be used and learned from and trained with.

llamaimperative · on Jan 20, 2024

Humans have novel direct inputs from non-human sources pretty much 24/7 x billions of sensory cells x billions of people.

r0ckarong · on Jan 20, 2024

At some point we're going to just produce all this electricity so some AI can troll itself on the blockchain to decide who is going to trigger the nukes.

feverzsj · on Jan 20, 2024

Soon, ad blockers will only need to supply whitelists.

kevin_thibedeau · on Jan 20, 2024

I got halfway through an AI article the other day before it veered into a completely different random topic and revealed itself. Kagi should add a way to flag such sites. Implement a reputation system to minimize gaming from adversaries while still taking in revenue from them.

calamari4065 · on Jan 20, 2024

When I see those, I just block the domain from my results. Sites that publish these unchecked AI pieces usually don't have anything else worthy of my time

D13Fd · on Jan 20, 2024

This is a great idea. It might be even better to block them at the DNS level via a hosts file or similar, so that you don't waste your time when an otherwise trusted site links you to them.

Ensorceled · on Jan 20, 2024

Do we have a backup of the pre 2023 internet (way back machine, wikipedia, etc.)

runeofdoom · on Jan 20, 2024

Good question. Seeing this certainly makes me think I need to make politely scraping a few valued sites (mostly forums) for backup as personal reference a higher priority.

orbital-decay · on Jan 20, 2024

The study they refer to [1] seems to be about machine-translated versions of human-written pages (that exist on a large scale for a ~decade already). The article somehow blows it out of proportion, it's not like most of what you're reading is generated by the current crop of large transformers.

[1] https://arxiv.org/pdf/2401.05749.pdf

alwa · on Jan 20, 2024

Not my domain, and I know I’m amongst experts. But at the risk of stating the obvious: It feels like the claim here (and elsewhere) is that we’re near the breaking point of the incentive model that’s propelled knowledge out of human minds and onto the web-as-we-know-it in coherent, discoverable, standardized, useful form.

I’ve always been a little old-fashioned, in that I prefer to trust specific bodies of writing and specific humans for knowledge about specific topics, even if that keeps me slow and behind the zeitgeist.

But in any case, now that webpages-for-traffic well seems on the verge of being too polluted to drink from.

What’s the next paradigm? Walled gardens of proven-provenance content for our AI summarizers to wade through? AI-vs-AI arms race? Or does the web become more about underlying facts and structured data, and meaning and insight become less commoditized and more person-to-person again?

I mean are any of these tensions really new, or is this just a Google problem?

SimianLogic · on Jan 20, 2024

I suspect “AI-generated slime” is on average higher quality than what most content mills have been pumping out without AI for the last 10 years (plus).

runeofdoom · on Jan 20, 2024

What is better or worse though - a million 2/10 sites or a hundred million 3/10 sites?

_cgtv · on Jan 20, 2024

so its self destructing

jug · on Jan 20, 2024

Note: The premise of this article is that they call machine translations "AI generated slime".

It is NOT about ChatGPT generated articles.

It's ironic to me, because the ONE thing I think modern AI has no doubt improved is machine translations. DeepL is often miles ahead of what we had, and while LLM's are not trustworthy scientific experts in all fields, if they are anything almost by definition (as LANGUAGE models), it is that they are linguistic experts. Iceland is famously using GPT-4 for language preservation because it's as good at Icelandic as an expert and native speaker.

So please for the love of god let's abandon the former generation of machine translation and let's welcome AI translations with open arms for improved accessibility and cross-culture reach. And let's stop looking down at AI translations just because you see red as you read the word "AI".

Not sure I need to add that I find this article complete junk. As usual with Futurism.com content.

CottonMcKnight · on Jan 20, 2024

enshittification intensifies

American Dialect Society nailed it.

peter_d_sherman · on Jan 20, 2024

99.99% of the Internet -- is crap -- AI generated or not.

But this is also true about books, movies, music, products, corporations, etc., etc.

Everything really...

The thing is though, that the other 0.01% of the Internet (and the other 0.01% of everything else) -- are the proverbial "diamonds in the rough" -- the things that have great value...

But you gotta search to find them...

You know, "seek and ye shall find", "leave no stone unturned", etc., etc.

Ironically, Google's search engine, whose main rise to fame was caused by too little information on the Internet -- is now completely overwhelmed with too much crappy/spammy/subprime/agenda-based/advertising/biased/subpar/TL;DR/unnecessary information.

In other words, we've went from "too thin" to "too fat", from not enough information to too much information...

Google Search Engine's primary virtue -- its ability to find things on the Internet -- has now become its Achilles' Heel, information-wise...

(And I say that as a great fan of Google! At least at this point in time, 2024, i.e., from 1999-2024 -- the first 25 years of the company!)

Perhaps I'm guessing (as opposed to knowing), but I would say that the rise of ChatGPT (and other AI LLM online chat systems) was at least caused in part by too much information.

Think of ChatGPT -- not as a futuristic scary AI (although it could certainly become that too!) -- but as a more human-friendly filter (and that's the keyword, "filter") of information -- than the Google Search Engine is or ever was...

And that's what we need right now more than anything else -- intelligent filters to block out too much information...

The upcoming problem is (or will be!) -- since censorship and intelligent information filtering are twin siblings and run very similar parallel paths -- how do we distinguish one from the other?

How do we permit one, but not the other?

How do we permit intelligent information filtering, but not permit censorship?

You see, there's a very fine line between the two!

A very fine line!

What are we going to do, have a programmer code for what that line is? Have a 3rd party AI determine that exact line? Have a/the government(s) decide what that is?

?

It is or will be a future problem with no easy answer no apparent solution -- and it is starting to form as of this current day!

Perhaps the original unfiltered Google Search of 20+ years ago -- is not looking all that bad in comparison! :-)

runeofdoom · on Jan 20, 2024

George Dyson has what I found to be an enlightening metaphor for how our relationship with information has changed. It is repeated here: https://edu.blogs.com/edublogs/2010/01/george-dyson-media-li...