Embeddings are the only aspect of modern AI I'm excited about because they're the only one that gives more power to humans instead of taking it away. They're the "bicycle for our minds" of Steve Jobs fame; intelligence amplification not intelligence replacement. IMO, the biggest improvement in computer usability in my lifetime was the introduction of fast and ubiquitous local search. I use Firefox's "Find in Page" feature probably 10 or more times per day. I use find and grep probably every day. When I read man pages or logs, I navigate by search. Git would be vastly less useful without git grep. Embeddings have the potential to solve the biggest weakness of search by giving us fuzzy search that's actually useful.
I've been experimenting with using embeddings for finding the relevant git commits, as I often don't know or remember the exact word that was used.
So I created my own little tool for embedding and finding commits by commit messages. Maybe you'll also find it useful:
https://github.com/adrianmfi/git-semantic-similarity
So you're saying, embeddings are fine, as long as we refrain from making full use of their capabilities? We've hit on a mathematical construct that seems to be able to capture understanding, and you're saying that the biggest models are too big, we need to scale down, only use embeddings for surface-level basic similarities?
I too think embeddings are vastly underutilized, and chat interface is not the be-all, end-all (not to mention, "chat with your program/PDF/documentation" just sounds plain stupid). However, whether current AI tools are replacing or amplifying your intelligence, is entirely down to how you use them.
As for search, yes, that was a huge breakthrough and powerful amplifier. 2+ decades ago. At this point it's computer use 101 - which makes it sad when dealing with programs or websites that are opaque to search, and "ubiquitous local search" is still not here. Embeddings can and hopefully will give us better fuzzy/semantic searching, but if you push this far enough, you'll have to stop and ask - if the search tool is now capable to understand some aspects of my data, why not surface this understanding as a different view into data, instead of just invoking it in the background when user makes a search query?
I have found that embeddings + LLM is very successful. I'm going to make the words up as to not yield my work publicly, but I had to classify something into 3 categories. I asked a simple llm to label it, it was 95% accurate. taking the min distance from the word embeddings to the mean category embeddings was about 96%. When I gave gave the LLM the embedding prediction, the LLM was 98% accurate.
There were issues an embedding model might not do well on where as the LLM could handle. for example: These were camel case words, like WoodPecker, AquafinaBottle, and WoodStock (I changed the words to not reveal private data).
WoodPecker and WoodStock would end up with close embedding values because the word Wood dominated the embedding values, but these were supposed to go into 2 different categories.
> word Wood dominated the embedding values, but these were supposed to go into 2 different categories
When faced with a similar challenge we developed a custom tokenizer, pretrained BERT base model[0], and finally a SPLADE-esque sparse embedding model[1] on top of that.
Do you mind sharing why you chose SPLADE-esque sparse embeddings?
I have been working on embeddings for a while.
For different reasons I have recently become very interested in learned sparse embeddings. So I am curious what led you to choose them for your application, and why?
> Do you mind sharing why you chose SPLADE-esque sparse embeddings?
I can provide what I can provide publicly. The first thing we ever do is develop benchmarks given the uniqueness of the nuclear energy space and our application. In this case it's FermiBench[0].
When working with operating nuclear power plants there are some fairly unique challenges:
1. Document collections tend to be in the billions of pages. When you have regulatory requirements to extensively document EVERYTHING and plants that have been operating for several decades you end up with a lot of data...
2. There are very strict security requirements - generally speaking everything is on-prem and hard air-gapped. We don't have the luxury of cloud elasticity. Sparse embeddings are very efficient especially in terms of RAM and storage. Especially important when factoring in budgetary requirements. We're already dropping in eight H100s (minimum) so it starts to creep up fast...
3. Existing document/record management systems in the nuclear space are keyword search based if they have search at all. This has led to substantial user conditioning - they're not exactly used to what we'd call "semantic search". Sparse embeddings in combination with other techniques bridge that well.
4. Interpretability. It's nice to be able to peek at the embedding and be able to get something out of it at a glance.
So it's basically a combination of efficiency, performance, and meeting users where they are. Our Fermi model series is still v1 but we've found performance (in every sense of the word) to be very good based on benchmarking and initial user testing.
I should also add that some aspects of this (like pretrained BERT) are fairly compute-intense to train. Fortunately we work with the Department of Energy Oak Ridge National Laboratory and developed all of this on Frontier[1] (for free).
> We've hit on a mathematical construct that seems to be able to capture understanding
I’m admittedly unfamiliar with the space, but having just done some reading that doesn’t look to be true. Can you elaborate please and maybe point to some external support for such a bold claim?
> Can you elaborate please and maybe point to some external support for such a bold claim?
SOTA LLMs?
If you think about what, say, a chair or electricity or love are, or what it means for something to be something, etc., I believe you'll quickly realize that words and concepts don't have well-defined meanings. Rather, we define things in terms of other things, which themselves are defined in terms of other things, and so on. There's no atomic meaning, the meaning is in the relationships between the thought and other thoughts.
And that is exactly what those models capture. They're trained by consuming a large amount of text - but not random text, real text - and they end up positioning tokens as points in high-dimensional space. As you increase the number of dimensions, there's eventually enough of them that any relationship between any two tokens (or groups; grouping concepts out of tokens is just another relationship) can be encoded in the latent space as proximity along some vector.
You end up with real computational artifact that's implementing the idea of defining concepts only in terms of other concepts. Now, between LLMs and the ability to identify and apply arbitrary concepts with vector math, I believe that's as close to the idea of "understanding" as we've ever come.
That does sound a bit like Peircian semiotic so I’m with you so far as the general concept of meaning being a sort of iterative construct.
Where I don’t follow is how a bitmap approximation captures that in a semiotic way. As far as I can tell the semiosis still is all occurring in the human observer of the machine’s output. The mathematics still hasn’t captured the interpretant so far as I can see.
Regardless of my possible incomprehension, I appreciate your elucidation. Thanks!
I feel like embeddings will be more powerful for understanding high dimensional physics than language because chaotic system predictability is limited by its compressability. Therefore an embedding is able to capture how exactly compressible the system is and therefore can extend the predictability as far as possible.
All modern AI technology can give more power to humans, you just have to use the right tools.
Every AI tool I can think of has made me more productive.
LLMs help me write code faster and understand new libraries, image generation helps me build sites and emails faster, etc
I agee with this view. Generative AI robs us of something (thinking, practicing) which is the long term ability to practice a skill and improve oneself in exchange of an immediate (often crappy) result. Embeddings is a tech that can help us solve problem, ut we still have to do most of the work.
I ask LLMs to give me exercises, tutorials then write up my experience into "course notes", along with flashcards. I ask it to simulate a teacher, I ask it to simulate students that I have to teach, etc...
I haven't found a tool that is more effective in helping me learn.
Does a player piano rob you of playing music yourself? A car from walking? A wheelbarrow from working out? It’s up to you if you want to stop practicing!
Chess has become even more popular despite computers that can “rob us” of the joy. They’re even better practice partners.
An individual car doesn't stop you from walking but a culture that centers cars leads to cities where walking is outright dangerous.
Most car owners would never say outright "I want a car-centric culture". But car manufacturers lobbied for it, and step by step, we got both the deployment of useful car infrastructure, and the destruction or ignoring of all amenities useful for people walking or cycling.
Now let's go back to the period where cars start to become enormously popular, and cities start to build neighborhoods without sidewalks. There was probably someone at the time complaining about the risk of cars overtaking walking and leading to stores being more far away etc. And in front of them was probably someone like you calling them a luddite and being oblivious of second order effects.
I’m not sure it robs us. It makes it possible, but many people including myself find the artistic products of AI to be utterly without value for the reasons you list. I will always cherish the product of lifelong dedication and human skill
It doesn't diminish - but I do find it interesting how it influences. Realism became less important, less interesting, though still valued to a lesser degree, with the ubiquity of photography. Where will human creativity move towards when certain task become trivially machine replicable? Where will human ingenuity _enabled_ by new technology make new art possible?