If you want to add embeddings over internet as a source, you should try out exa.ai. Includes: wikipedia, tens of thousands of news feeds, Github, 70M+ papers including all of arxiv, etc.
Exa | San Francisco | In person | Full time | $130K-350K
Jeff, cofounder of Exa.ai here. LLMs represent a brand new opportunity to organize humanity's knowledge, in a way that hasn't been done before. We're an AI research lab focused on AI-powered search algorithms (using embeddings), currently applied to vast swaths of the web (we make our money as a search API).
We're hiring pretty broadly across engineering - AI research, high performance Rust (e.g., we build an in-house vector DB), and full stack. If the mission of organizing the Internet motivates you, it's a good fit :)
Exa | San Francisco | In person | Full time | $130K-350K
Jeff, cofounder of Exa.ai here. LLMs represent a brand new opportunity to organize humanity's knowledge. We're an AI research lab focused on AI-powered search algorithms (using embeddings), currently applied to vast swaths of the web (we make our money as a search API).
We're hiring pretty broadly across engineering - AI research, high performance Rust (e.g., we build an in-house vector DB), and full stack. If the mission of organizing the Internet motivates you, it's a good fit :)
Exa | San Francisco | In person | Full time | $130K-180K
Jeff, cofounder of Exa here. LLMs represent a brand new opportunity to organize humanity's knowledge, in a way that hasn't been done before. We're an AI research lab focused on AI-powered search algorithms (using embeddings), currently applied to vast swaths of the web (we make our money as a search API).
We're hiring pretty broadly across engineering - AI research, high performance Rust (e.g., we build an in-house vector DB), and full stack. If the mission of organizing the Internet motivates you, it's a good fit :)
Not sure what they are doing but embeddings and hallucination are completely separable imo (you can have hallucination even without embedding-based retrieval). Likely you have an embedding for the query which is close to the embedding of the doc for some measure of similarity. That could be semantic similarity or even user behavior.
Exa | San Francisco | In person | Full time | $130K-180K
Jeff, cofounder of Exa here. LLMs represent a brand new opportunity to organize humanity's knowledge, in a way that hasn't been done before. We're an AI research lab focused on AI-powered search algorithms (using embeddings), currently applied to vast swaths of the web (we make our money as a search API).
We're hiring pretty broadly across engineering - AI research, high performance Rust (e.g., we build an in-house vector DB), and full stack. If the mission of organizing the Internet motivates you, it's a good fit :)
The issue with traditional search engines is that keyword-first algorithms are extremely gameable.
Try https://search.metaphor.systems - it's fully neural embeddings-based search. No keywords, only an embedding of what the actual content of a webpage is.
How is that different from keywords? Embeddings aren't magic, they're just page content. Content is trivial to game since it's controlled by the website owner.
edit: The results are also from my quick QA not that great. Searching for "what is the best mouse to buy" leads to links to buy random mice versus review summaries or online discussions on mice. One of the recommended queries of "Here is a great fun concert in San Francisco" leads to some really bizarre results in non-English languages that have nothing to do with either SF or concerts.
edit2: Also, Google has been using LLMs part of their search since at least 2018 so definitely not just keyword matching there.
Yup, definitely still gameable but if the model learns what high quality content is like and what high quality webpages there are (which it does), then the only way to game would be to be great :)
For your search - I would recommend turning autoprompt off and searching something like "Here is a great summary of the best computer mice to use:".
Our embeddings model is trained on how links are talked about on the Internet, if that helps with querying. So you have to query like how someone would refer to a link before sharing it
> Our embeddings model is trained on how links are talked about on the Internet, if that helps with querying. So you have to query like how someone would refer to a link before sharing it
So it's not high quality web pages but web pages that people talk about a lot which is expected since no one has an oracle that says what high quality is. The embeddings are merely a proxy and generalization for "how links are talked about on the Internet." That can be gamed at scale just like every other signal any popular search engine has been based off of.
The first result vtubego.com is a 144MB downloader app. The page contains "Pricing Plans
Lorem ipsum dolor sit amet, placerat verterem luptatum phaedrum vis, impetus mandamus id vix fabulas vim." above its 3 paid plans (there is no free plan).
I haven't installed the downloader app, so I'm not sure if it lets me download youtube videos for free.
The second result "ytder.com" is a redirect to "https://poperblocker.com/edge/" which seems to be a browser extension for Microsoft Edge that protects the user from the Holy See. I'm not using Edge and I'm trying to download a Youtube video.
The third result download-video.net says that it can download videos from a list of sites. Youtube is not in the list, but let's try anyway. If you put "https://www.youtube.com/watch?v=IkYVmtgxebU" into the text box and click "download" you get "500 SyntaxError: Unexpected token '<', ""
At this point I gave up, but please let me know if any of the results work.
> with Metaphor you'll get only Youtube downloaders
I clicked into the top 5 results, none of them were real youtube downloaders that worked, so I clicked the next 5 results, then I finally got one single (really slow) downloader that worked. 1 out of 10 top results