Semantic ablation: Why AI writing is boring and dangerous

Vittelius@feddit.org · 2 天前

Semantic ablation: Why AI writing is boring and dangerous

arcine@jlai.lu · 23 小时前

This reads as AI generated. I don’t know if I’m being paranoid but it has all the hallmarks of it.

Triumph@fedia.io · 2 天前

If you want your resume to pass the AI filter that decides whether a human sees it or not, that’s exactly what you want your chatbot to do for you.

hendrik@palaver.p3x.de · 2 天前

What kind of software are they talking about? Most I see is people giving their text to ChatGPT. And to my knowledge, ChatGPT doesn’t really analyze the text directly. It just runs the language model and gets its opinion?! Or is that the confirmed version of what ChatGPT etc do under the hood?

Leon@pawb.social · 2 天前

What? It doesn’t give opinions. It doesn’t have opinions, it’s a probability matrix.

Put very simply, when it’s “trained” it takes a lot of data as input, and out of that it builds a sort of “map” of relationships between the input data. When then given an input it can look up that input in the “map” and gives you the most probable following sequence.

Put simply, if you give it “Once upon…” it’ll likely continue with “a time there was” because that’s a pattern it’ll have seen a lot.

It’s an evolution of the autocorrect we have in phone keyboards. It’s more advanced, and as people tweak weights of different neurons you can tweak the output. You can also train a model further to fit it to a specific purpose, but ultimately it’s just fancy autocomplete.

hendrik@palaver.p3x.de · edit-2 2 天前

Ah right. No, I meant it in a very lose way. That ChatGPT won’t calculate elaborate maths to then do some mathematical transformations on your input text. Instead it’ll do its math and then come up with an “opinion” on what it wants to rephrase. Which is going to be an “opinion” in that it isn’t conclusive and in contrast to a straightforward maths to change the entropy? I mean otherwise it’d be very easy to change this? Just use a different formula?
Or phrased a bit differently: If it doesn’t introduce an own opinion/judgement. What’s the issue here? Just configure it to stick with the entropy amount and distribution of the input?

But I have no clue what they’re talking about. Maybe there’s specific text editors/correctors out there, that I’m not aware of… I’m not up to date with these things. Or they’re talking about copy-pasting into ChatGPT, which I’ve tried.

Leon@pawb.social · edit-2 1 天前

Ah I see what you’re getting at.

I’d like to preface by apologising, because this became a very lengthy comment. I’ve written a TL:DR in the bottom that I think carries the main point across, all the rest is a semi-technical rather loosey-goosey rundown of how language models work. I just hope it’s coherent enough for someone to understand what I’m trying to convey.

So without further ado.

A language model doesn’t really train on text. It trains on what’s called tokens. As you feed it training data, before it reaches the ML algorithm it goes through a tokeniser. Huggingface has a functional browser-based example here.

A tokeniser essentially splits up the input characters (including whitespace, tabulators, carriage returns etc.) and assigns them numerical identifiers. This is done for the entire dataset before you train the model.

It could look something like this

ID	Token
12805	`Once`
5304	`upon`
264	`a`
892	`time`
13	`.`

Thus while you read "strawberry" as its own thing, an LLM might get the input 1, 496, 675, 15717, 1

In essence instead of checking each character individually, you end up with a large dictionary of character groupings, with numerical equivalents, allowing you to do maths with them.

Which is what you do. After the tokenisation the algorithm generates embeddings, which is essentially meant to capture the semantics of language, that is tokens represent individual building blocks of language, and embeddings is what defines the relationship between these tokens. This is stored in something called a tensor, which is in essence a multi-dimensional map. Just like how we map locations in 2D/3D space, machine learning algorithms map “concepts” in sometimes many hundred-dimensional maps.

The embeddings is how an LLM can infer that the word “conceal” and “hide” are related, and that the former is generally considered fancier than the latter. I can almost guarantee that if you were to ask an LLM to rephrase Jane stashed the goods behind the crapper in a fancier, more professional manner, it’d come up with something like Jane concealed the items in the bathroom

This is in part what makes it so hard to glean information from a model, you can’t just open up the weights and extract the original training data, it’s been chunked, processed, and categorised, and what you end up with is just many different pointers to and between a (relative) few tokens.

For a very long time the context window of these models was very small, and as a result you ended up with outputs that weren’t very related to one another. I’m sure you’ve seen these memes where someone goes “Type I wish and press the middle word in your keyboard and see what you get” kind of memes, and they usually spiral off into nonsense.

That’s where the transformer architecture (the T in GPT) came into play. In short, it allowed the models to have a larger “working memory” and thus they could retain and extend that semantic context further. They could build more advanced networks of relationships and it’s the source of the current “AI” craze. The models started inferring more distant relationships with words, which is what has given rise to this illusion of intelligence.

Once you have a model trained it’s very hard to modify that. You can train auxiliary models to kind of bias the model in various directions. You can write system prompts to try and coax the model into a certain kind of output, but since it isn’t actually a thinking thing, they can still go off script. You can do a sort of reverse-engineering of sorts, toggling on and off certain neurons in the model to see how a concept might relate to another, though just like with regular brains a single neuron doesn’t typically handle a single thing, and so this is a very time-consuming task.

In the end, the model you train is entirely deterministic, because it’s all mathematics. Computers are by their very nature deterministic. The model you train isn’t intelligent, and given a particular input it will always produce the same output.

If you’ve played Minecraft you’re probably familiar with the concept of seeds. Just like an LLM, Minecraft’s world generation algorithm is deterministic, and if you provide a particular seed value for the randomiser, it will always produce the same world. If you don’t input a seed value the game generates a random value and uses that, which is why whenever you start a new world you’ll always end up with something new.

That’s basically what LLMs do to. When selecting words to continue the given input, it uses a process called stochastic sampling. In essence, for each token input, it gets a bunch of probable tokens that might follow, it organises these in a probability distribution based on a variable called temperature, and then it selects a token from that distribution.

The temperature value essentially controls how randomly it can select words. The lower the temperature setting is the more curved the distribution gets. With a really low temperature setting the deterministic nature of the model shines through. As the temperature increases, the curve flattens and more random tokens might get selected.

At this point, the big “AI” companies have basically sucked the data well dry. They’re trying to find more ways of making more data to train on, because what gave them the biggest, most remarkable progress in the past was increasing the quantity of training data. More and more LLM generated text is making it into these models, and existing patterns get reinforced.

TL:DR

I’ve written this entire comment myself. It is in a sense a mirror of me as a person; the way I punctuate things, the words I choose, the structure in which I’ve decided to describe things. You can infer bits and pieces about me from it; I’ve obviously had an interest in machine learning for a while, given the markdown usage I’m perhaps a bit more technically inclined, I might not be an English native, but I’ve a preference for British English.

Now, I could feed this entire comment through an LLM and you’d get a coherent output. It’d likely change my verbiage, fix the way I punctuate things, perhaps restructure things and make the text overall neater.

However, anything that was me in this text would be lost. There’d no longer be a person to infer anything about. No choices were made in the process of outputting the text. There is no inherent preference on anything because it’s all just normalised pseudo-random output from a weighted probability matrix based on a corpus of as much text as whoever trained the LLM could get their hands on, be that legally or otherwise.

That is, I think, essentially what the article is talking about.

hendrik@palaver.p3x.de · edit-2 21 小时前

Thanks for the comprehensive write-up! I guess that makes a lot of sense. I mean if we’re just talking about regular AI assistant output, Sure. I see that as well. I mean I also have an additional issue with how these things are tuned… I never liked the tone, especially ChatGPT does. It is way to repetitive but in an annoying generic tone, mixed with know-it-all vibes. But it doesn’t know it all. And then it talks to me like I’m 4 years old, and it’s my helpful sycophant. Outputs 3 pages of text to any simple task/question and there’s about no substance to all the many sentences. Unless it decides to lecture me on ethics… Saying my email is phrased way too harshly. And then it goes ahead to replace my witty sarcasm with some bland phraseology like we’re doing some customer support hell… I see no reason to use it as a tool to “refine” my emails. Though, I think that’s mainly due to the role-playing as a “helpful assistant” which people seem to prefer?! Not sure if that’s necessarily in the maths. But it’s enough to deter me… Well, that, and the fact that it removes key information in some ill-suited attempt to “summarize” or brush over important paragraphs.