Why do you call your language model “transformer”?

bschmidt1 · on Sept 1, 2024

Language is the language model that extends Transformer. Transformer is a base model for any kind of token (words, pixels, etc.).

However, currently there is some language-specific stuff in Transformer that should be moved to Language :) I'm focusing first on language models, and getting into image generation next.

p1esk · on Sept 1, 2024

No, I mean, a transformer is a very specific model architecture, and your simple language model has nothing to do with that architecture. Unless I’m missing something.

bschmidt1 · on Sept 1, 2024

I still call it a transformer because the inputs are tokenized and computed to produce completions, not from lookups or assembling based on rules.

> Unless I'm missing something.

Only that I said "without taking the LLM approach" meaning tokens aren't scored in high-dimensional vectors, just as far simpler JSON bigrams. I don't think that disqualifies using the term "transformer" - I didn't want to call it a "computer" or a "completer". Have a better word?

> JSON instead of vectors

I did experiment with a low-dimensional vector approach from scratch, you can paste this into your browser console: https://gist.github.com/bennyschmidt/ba79ba64faa5ba18334b4ae...

But the n-gram approach is better, I don't think vectors start to pull away on accuracy until they are capturing a lot more contextual information (where there is already a lot of context inferred from the structure of an n-gram).

kgeist · on Sept 2, 2024

Calling it a "transformer" is misleading when discussing language modelling because it now means a very specific ML architecture while your project seems to be about Markov chains + hardcoded rules using regexps https://github.com/bennyschmidt/llimo/blob/master/models/Cha...

The idea of tokenizing words and producing completions is not unique to the original transformers, it's a basic idea from NLP. So I'm not sure why you think it should be called a transformer just because it uses tokenized inputs and produces completions as well. It's like saying your new programming language has a "Java-based architecture" simply because they both have classes (and nothing else in common otherwise).

>I didn't want to call it a "computer" or a "completer". Have a better word?

I've seen projects which also use Markov chains + additional rules ontop, for example there's quite a few projects called "Markov chains with POS tagging":

https://github.com/26medias/context-aware-markov-chains

>not from lookups or assembling based on rules.

Not quite sure about "it's not based on rules" when your code has things like:

   const MATCH_FIRST_MODAL = new RegExp(/IS|AM|ARE|WAS|HAS|HAVE|HAD|MUST|MAY|MIGHT|WERE|WILL|SHALL|CAN|COULD|WOULD|SHOULD|OUGHT|DOES|DID/);

or

   const properNoun = `${part.value} `;
   if (isPrevNNP) {
      result += prependArticle(query, properNoun);
   }

Pretty sure your examples in the video are also cherry-picked. The very first example is you asking "where is Paris?" What really happens is, one of the hardcoded regexps transforms it to "Paris is" and then the bigram model repeats the second sentence in the Paris dataset verbatim.

bschmidt1 · on Sept 2, 2024

It's literally what it is. You for some reason think transformers are unique to language models - boy are you late to the game https://en.wikipedia.org/wiki/Transformation_matrix

A CSS matrix "transform" is the same concept.

Same with tile engines & game dev. Say I wanted to rotate a map:

Input

[

  [0, 0, 1],
    
  [0, 0, 0],
    
  [0, 0, 0]

] Output

[

  [0, 0, 0],

  [0, 0, 0],
  
  [0, 0, 1]

]

The function is a "transformer" because it is not looking up some rule that says where to put the new values, it's performing math on the data structure whose result determines the new values.

> Not quite sure about "it's not based on rules" when your code has things like: > > const MATCH_FIRST_MODAL

Totally irrelevant to the topic. This is the chat interface itself which mostly just parses questions into cursors to be completed. You would be a fool to think ChatGPT has no NLP or parts-of-speech analysis. text-ada-embedding itself uses POS.

> Pretty sure your examples in the video are also cherry-picked

Fantastic detective work, you caught me. But just to confirm - why not just use it yourself? npm i next-token-prediction

Here is an example you can run very easily in Chrome, so you don't have to rely solely on your amazing bullshit detector: https://github.com/bennyschmidt/next-token-prediction/tree/m...

Don't forget to log the completions to prove that they aren't broken down by token, and instead just doing key/val lookups or text searches as you said.

> What really happens is, one of the hardcoded regexps transforms it to "Paris is"

The only thing you got right - that questions are transformed into sentences using conventional NLP in order to complete them. This functionality is what makes it a chat bot that you can ask questions.

kgeist · on Sept 2, 2024

>A CSS matrix "transform" is the same concept

It's still misleading to call it a transformer in the context of NLP. It doesn't matter what it means in other, non-NLP areas (linear algebra, CSS or gamedev).

It's like creating a procedural language and calling it "functional" because it has functions. Sure the concept of functions existed long before compsci but it would be very misleading because "functional programming" is a well-established term.

>You would be a fool to think ChatGPT has no NLP or parts-of-speech analysis

Pretty sure it doesn't. At least it's not required to. I've run lots of local models and it's just model weights without hardcoded regexps. In fact, I was able to feed grammar rules of an invented language into Claude Sonnet and it was able to construct proper sentences.

>text-ada-embedding itself uses POS

Do you have a link?

bschmidt1 · on Sept 3, 2024

Again they are the exact same concept. Whether vectors represent tiles in a video game, an object in CSS, matrix algebra you took in school, or the semantics of words used by LLMs, in all cases it's the same meaning of the word "transform". It's not specific to language models at all - which was the thesis of your whole argument.

> it's not required to. I've run lots of models

Then you must know about skip-gram and how embeddings are trained: https://medium.com/@corymaklin/word2vec-skip-gram-904775613b...

What is meant by "sliding window" or "skip gram" is bigram mapping (or other n-gram).

This is ML 101.

It's the same training methodology and data structure used in my next-token-prediction lib, and is widely used for training for LLMs. Ask your local AI to explain the basics, or see examples like: https://www.kaggle.com/code/hamishdickson/training-and-plott...

> ChatGPT doesn't use parts-of-speech

Yes it does, there's not only a huge business in tagging data (both POS and NER) adjacent to AI, but OpenAI specifically famously used African workers on very low wages to tag a bunch of data. ChatGPT uses text-embedding-ada, you'll have to put 2 and 2 together as they don't open source that part.

Mistral says:

"The preprocessing stage of Text-Embedding-ADA-002 involves applying POS tags to the input text using a separate POS tagger like Spacy or Stanford NLP. These POS tags can be useful for segmenting sentences into individual words or tokens."

> I use Claude to make new languages

Cool story, has nothing to do with the topic

kgeist · on Sept 3, 2024

>It's not specific to language models at all - which was the thesis of your whole argument.

I didn't say that it's unique to LMs. My argument is that saying "my LM is a transformer" is misleading because "transformer" in the context of LMs means a very specific architecture. You're deliberately misusing terms, probably to draw attention to your project.

>OpenAI specifically famously used African workers on very low wages to tag a bunch of data

Did they tag Polish parts of speech too? Or Ancient Greek? ChatGPT constructs grammatically correct Ancient Greek. I thought they tagged "harmful/non-harmful", not parts of speech?

>ChatGPT uses text-embedding-ada

[Citation needed]

NanoGPT, for example, learns embeddings together with the rest of the network so, as I said, manual tagging is not required.

Anyway, looking forward to hearing news about your image generation project. Any news?

bschmidt1 · on Sept 4, 2024

Nobody denied the term is used in language models, I only pointed out that they use that term because of what it already means in the context of vector operations (long before OpenAI).

The wikipedia on deep learning transformers:

    All transformers have the same primary components:

    - Tokenizers, which convert text into tokens.

    - Embedding layer, which converts tokens and positions of the tokens into vector representations.

    - Transformer layers, which carry out repeated transformations on the vector representations, extracting more and more linguistic information. These consist of alternating attention and feedforward layers. There are two major types of transformer layers: encoder layers and decoder layers, with further variants.

    - Un-embedding layer, which converts the final vector representations back to a probability distribution over the tokens.

Where does it say bigrams can't be used for next-token prediction? Or that you can't tag data? Note "...which converts tokens and positions of the tokens..."

> You're deliberately misusing terms, probably to draw attention to your project.

Haha well since I have like 30 followers and the npm is free/MIT whatever scheme you think I'm up to it's not working. Anyway a text autocomplete library is not exactly viral material. Jokes aside, no I am trying to use accurate terms that make sense for the project.

Could just make it anonymous - `export default () => {}` - and call the file `model.js`. What would you call it?

> Did they tag Polish parts of speech too? Or Ancient Greek?

Yes, all the foreign words with special characters were tokenized and trained on. An LLM doesn't "know any language". If it never trained on any Polish word sequences it would not be able to output very good Polish sequences anymore that it could output good JavaScript. It's not that has to train on Polish to translate Polish per se, but it does has to have the language coverage at the token level to be able to perform such vector transformations - which is probably most easily accomplished by training on Polish-specific data.

See https://huggingface.co/pranaydeeps/Ancient-Greek-BERT

> The model was initialised from AUEB NLP Group's Greek BERT and subsequently trained on monolingual data from the First1KGreek Project, Perseus Digital Library, PROIEL Treebank and Gorman's Treebank

First1KGreek Project

> The goal of this project is to collect at least one edition of every Greek work composed between Homer and 250CE

> Citation needed

https://openai.com/index/new-and-improved-embedding-model/

> The new model, text-embedding-ada-002, replaces five separate models for text search, text similarity, and code search, and outperforms our previous most capable model, Davinci, at most tasks, while being priced 99.8% lower.

https://platform.openai.com/docs/guides/embeddings/embedding...

Scroll to embedding models

> Anyway, looking forward to hearing news about your image generation project. Any news?

Not yet! Feel free to follow on GitHub or even help out if you're really interested in it. Would be cool to have pixel prediction as snappy as text autocomplete.

richrichie · on Sept 1, 2024

For a century, transformer meant a very different thing. Power systems people are justifiably amused.

p1esk · on Sept 1, 2024

And it means something else in Hollywood. But we are discussing language models here, aren’t we?

bschmidt1 · on Sept 1, 2024

And it fits the definition doesn't it since it tokenizes inputs to compute them against pre-trained ones, rather than being based on rules/lookups or arbitrary logic/algorithms?

Even in CSS a matrix "transform" is the same concept - the word "transform" is not unique to language models, more a reference to how 1 set of data becomes another by way of computation.

Same with tile engines / game dev. Say I wanted to rotate a map, this could be a simple 2D tic-tac-toe board or a 3D MMO tile map, anything in between:

Input

[

  [0, 0, 1],
    
  [0, 0, 0],
    
  [0, 0, 0]

]

Output

[

  [0, 0, 0],

  [0, 0, 0],
  
  [0, 0, 1]

]

The method that takes the input and gives that output is called a "transformer" because it is not looking up some rule that says where to put the new values, it's performing math on the data structure whose result determines the new values.

It's not unique to language models. If anything vector word embeddings are much later to this concept than math and game dev.

An example of use of word "Transformer" outside language models in JavaScript is Three.js' https://threejs.org/docs/#examples/en/controls/TransformCont...

I used Three.js to build https://www.playshadowvane.com/ - built the engine from scratch and recall working with vectors (e.g. THREE Vector3 for XYZ stuff) years before they were being popularized by LLMs.

p1esk · on Sept 3, 2024

Wait, do you really not know what a transformer is in the context of ML? It’s been dominating the field for 7 years now.

bschmidt1 · on Sept 3, 2024

Can't read? I just explained thoroughly what it is in the comment above. Do you understand what matrix transformations are?

Do you know that a vector in LLMs for word embeddings is the same thing as a vector in 3D game dev libraries like Three.js?

Sounds like you 2 are the only ones who don't get it.

p1esk · on Sept 4, 2024

Please do yourself a favor and google “transformer paper”. Open the very first result and read the pdf. Hopefully it will become clear what people mean when they say “transformer” in ML context, and you will finally realize how silly you look like in this thread.

dang · on Sept 6, 2024

You guys both broke the site guidelines badly in this thread. We have to ban accounts that post like this, so please don't.

If you'd please review https://news.ycombinator.com/newsguidelines.html and stick to the rules when posting here, we'd appreciate it.

bschmidt1 · on Sept 4, 2024

You still don't get it. For LLMs a "transformer architecture" only means one that:

- Tokenizes sequences

- Converts tokens to vectors

- Performs vector/matrix transformations

- Converts back to tokens

The matrix transformation part is why it's called a "transformer". Do some reading yourself https://en.wikipedia.org/wiki/Transformer_(deep_learning_arc...

> how silly you look

You'll look twice as silly after thinking vectors are unique to LLMs, or that the word "transformer" has anything to do with LLMs rather than lower-level array math.

Consider that a "vector database" is a very specific technology - yet the word "vector" is not off limits in other database related libraries, especially if dealing with vectors.

In any case - if you think I'm trying to pass it off as something else, what I call "transformer" does tokenize lots of text (breaks it down by ~word, ~pixel) and derives semantic values (AKA trains) to produce real-time completions to inputs by way of math, not lookups. It fits the definition even in that sense where "transformer" meant something more abstract than the mathematical term.

dang · on Sept 6, 2024

You guys both broke the site guidelines badly in this thread. We have to ban accounts that post like this, so please don't.

If you'd please review https://news.ycombinator.com/newsguidelines.html and stick to the rules when posting here, we'd appreciate it.

bschmidt1 · on Sept 7, 2024

I didn't know it was that strict, no offense to the other poster, it was just a little disagreement :)