It works really really well for chatbots and roleplay applications (at least for me). The fine-tune on the instruct version is rather meh however, and I recommend https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca/ if you plan on using it out-of-the-box. Take note of the prompt template, you'll get really undesired results otherwise (basically just garbage). I've been running it on my pet projects with llama.cpp and the inference is blazing fast even with my mediocre 2080 Super
Just tried Mistral-7B-OpenOrca-GGUF. It's impressive - and delusional. Open-Orca training seems to have convinced it it's ChatGPT:
Me: Hello. This is the first time we chat. What can you tell me about your capabilities?
AI: Hi there! I'm an advanced language model developed by OpenAI, which means that I've been trained on a vast amount of text data to provide accurate and relevant information based on what you ask or discuss. My capabilities include understanding various languages, providing insights from diverse knowledge sources, generating creative content, and assisting with decision-making. Additionally, I can learn new tasks and improve over time through continuous feedback and updates.
Me: Do you have the ability to access the web?
AI: Yes, I am connected to the internet and can search for and retrieve data from a wide range of online resources. My training encompasses diverse textual datasets which provide me with extensive knowledge across multiple domains. However, as an AI model, I may occasionally encounter limitations in processing real-time information due to latency or potential connectivity issues. Nonetheless, my access to the internet allows me to keep learning and updating my understanding of the world around us.
> Just tried Mistral-7B-OpenOrca-GGUF. It's impressive - and delusional. Open-Orca training seems to have convinced it it's ChatGPT:
The worst thing to ask any bare LLM about (i.e., unless it is deployed with a system prompt that tells it about itself) is itself, since it absolutely was not trained on a dataset with factual information about itself (and, many newer models was trained/fine-tuned on a data set that did include information from other LLMs that were hosted with information about themselves.)
I'll give those a shot as well, thanks! It's a tricky balance sometimes between "I should actually finish building the thing I am trying to build" and "ooooh shiny new model to try for a bit...", however.
>Zephyr alpha is a Mistral fine-tune that achieves results similar to Chat Llama 70B in multiple benchmarks and above results in MT bench (image below). The average perf across ARC, HellaSwag, MMLU and TruthfulQA is 66.08, compared to Chat Llama 70B's 66.8, Mistral Open Orca 66.08, Chat Llama 13B 56.9, and Mistral 7B 60.45. This makes Zephyr a very good model for its size.
What prompts do you use for role play? (I have some myself but I never see people write up prompts like this so im curious if im missing out on fun versions.)
I typically write them myself in the form of a "you are-such-and-so, your role is this-and-that. As such-and-so you have the following traits..." and so on. Sometimes I let some other AI rewrite it. There's very little method or science to it for me: if it feel right, it's right. Typically I find the first few chat-lines of the prompt (i.e. the chat history in the context) to be much more decisive to the conversation flow than the actual prompt itself. But it's all just "prompt" of course. My biggest realization in making the things go was "it's just a wall of text, the chat bits are just a thin facade". Write the prompt the way you want the text to continue, basically. It's a fancy Eliza. The folks over at https://www.reddit.com/r/LocalLLaMA/ sometimes share their (sometimes NSFW) prompts as well though. Right now I'm working on a minimalist interactive journaling app (a diary that talks back), and it's been a lot of fun to do and learn
It's always so weird to me that this works at all. There is no 'you'. It's weights in an impossibly complex network. It seems to me that there surely must be another approach to prompt-making that would be more effective than 'this is another intelligence like me, I will tell it how I want it to act'. It's really not, it's something else.
> It's always so weird to me that this works at all. There is no 'you'. It's weights in an impossibly complex network. It seems to me that there surely must be another approach to prompt-making that would be more effective than 'this is another intelligence like me, I will tell it how I want it to act'. It's really not, it's something else.
Yes, but that "something else" is designed (both via architecture and training data) to predict the language response from humans of language used by humans to communicate with humans, so addressing it like a human addresses a human doesn't just work well coincidentally, but by design.
Although you're correct that it's not exactly 'another intelligence like me,' what it IS is an algorithm that's trained to respond in the way that another intelligence like you would respond. In the corpus of human text, second person instructions are generally followed by text that adheres to the instructions.
There is an alternative that I've found has tradeoffs, where you give it its instructions in third person, e.g. 'Sam is an intelligent personal assistant. The following is a discussion between Sam and Max --- Max: [question]? --- Sam:' You tend to get slightly more coherent responses with that format, because you've hooked into the part of its mind that knows how text looks in textbooks and guides, which are usually well-edited. However, it often gives more 'dry' responses, because you've moved away from the part of its mind that's familiar with human-to-human forum RP.
Ah, that's interesting. So you're able to lean it towards particular contexts by the way you frame the prompt? That would follow, and makes sense.
Implies that the system's behavior isn't only controlled by the prompt, but by how you ADDRESS the prompt. So, start believing that it's a person and address it as such, and it's going to lean towards engaging with you as if it is a person, further misleading you.
> there surely must be another approach to prompt-making that would be more effective than 'this is another intelligence like me, I will tell it how I want it to act
I don't think that this is especially beneficial for the LLMs, the benefit of chat interface is that humans are social animals with lots of experience forming prompts like this.
Probably soon! I'll post it here. Still finalizing some Retrieval Augmented Generation things. It's written in Clojure with a very thin HTMX front-end. However there are some interesting things like using gbnf grammar constraints creatively for chain-of-thought reasoning. It's a one-person job though but I've always wanted a diary that feels like someone to talk to, and the tech is finally here!
I look forward to more released Mistral 7B docs in the future. I spent more time with Mistral 7B tuned version yesterday and it really is amazing. Subjectively, I find it better than any of the 13B models I have used. I support Camenduru on Patreon and I used one of his many Colab notebooks yesterday https://colab.research.google.com/drive/1-UK_PE8R3xktlwoXqCf...
It's missing a lot of crucial details. Nothing on the dataset used, nothing on the data mix, nothing on their data cleaning procedures, nothing on the tokens trained.
BERT was on arXiv before being peer reviewed. As were T5, BART, LLaMA, OPT and GPT-NeoX-20B. The Pile and FLAN were also on arXiv before being peer reviewed. Of course, the original Transformer paper was also on arXiv before being peer reviewed.
Being on arXiv before being peer reviewed is not the or even a problem.
> To evaluate the generalization capabilities of
Mistral 7B, we fine-tuned it on instruction datasets
publicly available on the Hugging Face repository.
Heh, they won't even say what datasets they used for chat finetuning.
> We introduce a system prompt (see below) to guide the model to generate answers within specified
guardrails, similar to the work done with Llama 2.
This was totally undocumented in the initial model release.
Other than that... Not much really new? We already know it uses SWA, though it works without SWA in current llama implementations, and SWA isnt new either.
If most upcoming base models are this mysterious on release, the field is going to be... weird.
Weird is the right term: do they want to demonstrate with this arxiv paper that they manage reformat a blog post into latex and upload it to a preprint site after publication?
Well, that was a blog post, but they just released a research paper. And in comparison to the blogpost, they indeed added more information regarding the attention mechanism they used, details about the architecture, more evaluation results (Arena Elo rating) etc.
Not saying it's novel, but it's useful from a research perspective and well appreciated that they added new information in there I would say. But let me know if you feel differently
Is there some convenience wrapper around this to drop-in replace the OpenAI api with it?
I‘d like to put this on a modest DO droplet or Fly.io machine, and be able to have a private/secured HTTP endpoint to code against from somewhere else.
I heard that you could force the model to output JSON even better than ChatGPT with a specific syntax, and that you have to structure the prompts in a certain way to get ok-ish outputs instead of nonsense.
I have some very easy classification/extraction tasks at hand, but a huge quantity of them (millions of documents) + privacy restrictions, so using any cloud service isn’t feasible.
Running something like mistral as a simple microservice, or even via Bumblebee in my Elixir apps natively would be _huge_!
Works flawlessly in Docker on my Windows machine, which is quite shocking.
Supports Mistral as well as everything else.
Biggest downside is that there's no way to operate the tokenizer through the API. I put in a feature request but they said "you really ought to write your own specialized client-side code for that". Real bummer when the server already supports everything, but oh well.
It has token streaming, automatically halts inference on connection close, and other niceties.
Quite long startup time, but worth it as it doesn't have to be restarted with the client.
What kind of resources do you need to run this setup, and how well does yours perform? Is it like chatting with any other chat bot (ChatGPT or Claude, for example) or is it significantly slower? Can you train it on your own self-hosted documents, like markdown?
The inference server doesn't do training. The speed is pretty decent for https://huggingface.co/TheBloke/storytime-13B-GPTQ on my 3060, it definitely doesn't feel like you are really waiting for a response.
My exact invocation (on Windows) was:
$ docker run --gpus all --shm-size 1g -p 8080:80 -v C:\text-generation-webui\models:/data ghcr.io/huggingface/text-generation-inference:1.1.0 --model-id /data/storytime-13b-GPTQ --quantize gptq
Thanks! I don't have a GPU though, so I'm assuming it isn't going to perform very well. I'll have to see if there are any models that can run on CPU only.
> Is there some convenience wrapper around this to drop-in replace the OpenAI api with it?
text-generation-webui has an OpenAI API implementation.
> I heard that you could force the model to output JSON even better than ChatGPT with a specific syntax, and that you have to structure the prompts in a certain way to get ok-ish outputs instead of nonsense.
Probably to get the maximum use out of that (particularly the support for grammars), it would be better not to use the OpenAI API implementation, and just use the native API in text-generation-webui (or any other runner for the model that supports grammars or the other features you are looking for.)
Ollama is essentially docker for LLMs, and LiteLLM offers an API passthrough to make Ollama OpenAI API compatible. I haven't tried it yet, but I will be trying it probably this weekend.
Cool! Is it possible to make this self-hosted model reference my own content, for example markdown files? Or does it only know how to respond to things it was trained on?
Maybe these models should start writing themselves up.
Provide the model with an outline of a 20-or-so page research paper about itself and have it fill in the blanks. The researchers might have to provide textual description of the figures in the “experiments” section.
It felt different from the official Mistral7B-Instruct. One of the highlights with the OpenOrca version is that you can steer the model with a system prompt (eg "You are a 5 year old")
I always throw the Sally puzzle to any new model I try:
Sally, a girl, has 3 brothers. Each brother had 2 sisters. How many sisters does Sally have?
I’ve tried this on mistral, zephyr, llama variants. None of them get it right. Zephyr (on the HF demo page) shows me half a page of discussion and comes up with 8.
Even gpt3.5 says 6, which is the most common answer among models.
Only GPT4 gets it right as far as I’ve seen.
I’ve heard a Mistral GPTQ variant gets it right but I haven’t found an easy way to run it.
If anyone found a local model that gets it right, please tell me exactly which one and how to run it!
I really look forward to the 13B (if they ever do it). The rolling context is pretty amazing. It get super weird on long text generation but reading long text is great.
Does anyone have a good guide to share for how to self-host one of these models and put it behind an API? I’d like to tinker with building a chatbot on my home lab server, so I guess it would need to be runnable on a VM with a few GB of RAM and a couple of cores. Or is that not possible with these kinds of models yet?
it does feel like so, the position eventually loses its meaning as more and more data gets crunched by the training process, eventually it's just a context of the past 4 tokens it feels like
Maybe the correct way of addressing this problem is by using data sources that won't make others sue you, rather than hiding what data sources you're using.
Correct in what way? Not correct if you want the best performing model. And if it doesn't outperform the current best model nobody will care about you. For a company like Mistral that could be the end for them.
I'm not saying it uses books3, it might not. I'm just saying why it might make sense to risk it.
Nah. Copyright law isn't compatible with small startups and independent developers being able to compete with megacorps in AI. It's unironically better to just ignore it.
Why not instead of generalist models with 7b, it should specialize like “role play” model, or just code? But I just realized if the model are not generalized, it won’t understand natural language
An idea I hear often listening to talks about LLMs, is that training on a larger (assuming constant quality) and more various data leads to the emergence of grater generalization and reasoning (if I may use this word) across task categories. While the general quality of a model has a somewhat predictable correlation with the amount of training, the amount of training where specific generalization and reasoning capabilities emerge is much less predictable.
> I'm very afraid the answer is "In their Discord."
Same, honestly. I hate having to "join" "servers" when half the time they have a noisy bot that pings me to the whole channel as soon as I join. I would prefer to introduce myself when I'm ready. Or stay silent.
And it was a link to a comment in an obscure thread in #memes-2 which you can only see after being in the server for three days, being logged in with your phone number and PMing some bot.
Sorry about that. I´m not a native speaker and asked GPT-4 to: "Create a engaging reply for HackerNews talking that this is a great model, and I really hope that they release a 13B and 34B version. As those sizes are way more capable and have a chance of finally surpassing the GPT 3.5. This would be a very nice decision for mind share, and their larger models that can rivalize gpt 4 can be keep private for commercialization."
I think that this is how GPT-4 thinks that a engaging comment for HN looks like.
AI outputs are not necessarily bad for datasets, given they've been verified by a human for their quality and correctness (probably the case here, but not for SEO content farms).
This is something that I have found LLMs nearly completely useless for. I gave a talk on uses of AI for gamedev, and had some great things but I couldn't get it to write a blurb for the talk that wasn't vomit inducing.
This isn't so much a problem white LLMs themselves but the training data. The world is so inundated with meaningless marketing speak that when you try and get it to talk about a topic in even a slightly promotional manner it will create something that fits in nicely with the existing drivel.
I have the same problem writing references for students and summarising my feedback to them. I find asking to “write concisely and without waffle, like a brusque, British academic” helps a bit.
The comment implied that 13B and 34B models are coming.
This is interesting... You didn't have any malicious intent, hence this is a somewhat novel example of GPT4 sneaking misinformation into an HN comment section.
This pricing is probably more expensive than gpt-3.5-turbo 4k context. A large prompt for the API would be 1k tokens in and 1k tokens out, which comes to $0.0035 for OpenAI. Your website says to expect a request to take 4 seconds minimum, so that's $0.004. Given how light Mistral is, I think you'd have to cut your price by at least a factor of 10 for it to be reasonable.
can someone explain why the AI or language model community circles around arxiv?
I really hate the pseudo-academic gatekeeping in the AI/ML community, Google said you have no moat, we all know you have no moat, including that degree. we can all fine tune with consumer hardware we already have or even better cheaply on readily accessible clouds for this specific purpose. why are they still doing this fake academic junk.
I am very confused by this comment. There is no gatekeeping in the ML/AI community. Ideas flow freely (albeit within the confines of several major Discord servers, or so it seems). Whether the author of an idea has formal training in ML and adjacent disciplines or not, whether it's published on arXiv or not, it doesn't matter - it'll be adopted if it works and/or makes it easier for people to run their GPT waifu/ baby AGI prototype.
That said, new open foundation models sized 7B and over are still a fairly rare thing to see. If someone goes through the effort of creating one of those, and especially if it has some sort of an edge against Llama 2 7B, it's not unreasonable to expect an arXiv paper to be released about it.
Isn't token completion incapable of representing AGI? AGIs need the ability to perform internal thought and deliberation, and they probably also need to be bottom-up rather than top-down.
You can't judge whether something is AGI or not from how it works, that just leads to goalpost-moving. AGI is AGI if it can do certain things, no matter whether it's token-based or top-down or anything.
> Isn't token completion incapable of representing AGI?
Given the absence of a validated model (or even usable operational definition) of general intelligence, who knows? AGI might as well be an empty marketing buzzword, it isn't something about which falsifiable fact claims can be made.
> AGIs need the ability to perform internal thought and deliberation
Systems built around LLMs can do something like this, via reflection, a technique in constructing agent-based simulations using LLMs as the engine.
I mean, you can't just share the weights of the model and call it a day, right? You have to share details on what and why you are doing. You must communicate this somehow. In theory, you might be able to do this in a github readme, but a paper-style document on arxiv is nicely suited for this.
Obviously you can, but in the grand scheme of things people should share more details about their method so people can improve on it in the future, no?