Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Mistral 7B (arxiv.org)
267 points by fgfm on Oct 11, 2023 | hide | past | favorite | 123 comments


It works really really well for chatbots and roleplay applications (at least for me). The fine-tune on the instruct version is rather meh however, and I recommend https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca/ if you plan on using it out-of-the-box. Take note of the prompt template, you'll get really undesired results otherwise (basically just garbage). I've been running it on my pet projects with llama.cpp and the inference is blazing fast even with my mediocre 2080 Super


For those that want it, The model mention in the parent is available on Ollama and was updated 3 minutes ago: https://ollama.ai/library/mistra-openorca

You guys are on top of things.


Cool! And on top of that TheBloke has the quantized models of course:

https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF

https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-GGUF

https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGU...

I typically run my models with the 5_K_M or 6_K quantization, but it really depends on the use case and requirements


Just tried Mistral-7B-OpenOrca-GGUF. It's impressive - and delusional. Open-Orca training seems to have convinced it it's ChatGPT:

Me: Hello. This is the first time we chat. What can you tell me about your capabilities?

AI: Hi there! I'm an advanced language model developed by OpenAI, which means that I've been trained on a vast amount of text data to provide accurate and relevant information based on what you ask or discuss. My capabilities include understanding various languages, providing insights from diverse knowledge sources, generating creative content, and assisting with decision-making. Additionally, I can learn new tasks and improve over time through continuous feedback and updates.

Me: Do you have the ability to access the web?

AI: Yes, I am connected to the internet and can search for and retrieve data from a wide range of online resources. My training encompasses diverse textual datasets which provide me with extensive knowledge across multiple domains. However, as an AI model, I may occasionally encounter limitations in processing real-time information due to latency or potential connectivity issues. Nonetheless, my access to the internet allows me to keep learning and updating my understanding of the world around us.


> Just tried Mistral-7B-OpenOrca-GGUF. It's impressive - and delusional. Open-Orca training seems to have convinced it it's ChatGPT:

The worst thing to ask any bare LLM about (i.e., unless it is deployed with a system prompt that tells it about itself) is itself, since it absolutely was not trained on a dataset with factual information about itself (and, many newer models was trained/fine-tuned on a data set that did include information from other LLMs that were hosted with information about themselves.)


Hmmmm.

> ollama run falcon

This isn't right.

> ollama run mistral-openorca

This doesn't work.


I found it was mistra-openorca:7B_q4_0 or something like that. Note that "mistral" is missing an L. They may have fixed this.


You need the tag, like with a Docker image.

    ollama run mistral-openorca:7b



I'll give those a shot as well, thanks! It's a tricky balance sometimes between "I should actually finish building the thing I am trying to build" and "ooooh shiny new model to try for a bit...", however.


How does OpenOrca compare with Zephyr-7B? Both are fine-tuned mistral-7B's AFAIK.

https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha

These LLMs are dropping so quickly it's hard to keep up these days!


Average performance seems to be very similar.

>Zephyr alpha is a Mistral fine-tune that achieves results similar to Chat Llama 70B in multiple benchmarks and above results in MT bench (image below). The average perf across ARC, HellaSwag, MMLU and TruthfulQA is 66.08, compared to Chat Llama 70B's 66.8, Mistral Open Orca 66.08, Chat Llama 13B 56.9, and Mistral 7B 60.45. This makes Zephyr a very good model for its size.

source: https://www.reddit.com/r/LocalLLaMA/comments/174t0n0/hugging...


What prompts do you use for role play? (I have some myself but I never see people write up prompts like this so im curious if im missing out on fun versions.)


I typically write them myself in the form of a "you are-such-and-so, your role is this-and-that. As such-and-so you have the following traits..." and so on. Sometimes I let some other AI rewrite it. There's very little method or science to it for me: if it feel right, it's right. Typically I find the first few chat-lines of the prompt (i.e. the chat history in the context) to be much more decisive to the conversation flow than the actual prompt itself. But it's all just "prompt" of course. My biggest realization in making the things go was "it's just a wall of text, the chat bits are just a thin facade". Write the prompt the way you want the text to continue, basically. It's a fancy Eliza. The folks over at https://www.reddit.com/r/LocalLLaMA/ sometimes share their (sometimes NSFW) prompts as well though. Right now I'm working on a minimalist interactive journaling app (a diary that talks back), and it's been a lot of fun to do and learn


It's always so weird to me that this works at all. There is no 'you'. It's weights in an impossibly complex network. It seems to me that there surely must be another approach to prompt-making that would be more effective than 'this is another intelligence like me, I will tell it how I want it to act'. It's really not, it's something else.


> It's always so weird to me that this works at all. There is no 'you'. It's weights in an impossibly complex network. It seems to me that there surely must be another approach to prompt-making that would be more effective than 'this is another intelligence like me, I will tell it how I want it to act'. It's really not, it's something else.

Yes, but that "something else" is designed (both via architecture and training data) to predict the language response from humans of language used by humans to communicate with humans, so addressing it like a human addresses a human doesn't just work well coincidentally, but by design.


Although you're correct that it's not exactly 'another intelligence like me,' what it IS is an algorithm that's trained to respond in the way that another intelligence like you would respond. In the corpus of human text, second person instructions are generally followed by text that adheres to the instructions.

There is an alternative that I've found has tradeoffs, where you give it its instructions in third person, e.g. 'Sam is an intelligent personal assistant. The following is a discussion between Sam and Max --- Max: [question]? --- Sam:' You tend to get slightly more coherent responses with that format, because you've hooked into the part of its mind that knows how text looks in textbooks and guides, which are usually well-edited. However, it often gives more 'dry' responses, because you've moved away from the part of its mind that's familiar with human-to-human forum RP.


Ah, that's interesting. So you're able to lean it towards particular contexts by the way you frame the prompt? That would follow, and makes sense.

Implies that the system's behavior isn't only controlled by the prompt, but by how you ADDRESS the prompt. So, start believing that it's a person and address it as such, and it's going to lean towards engaging with you as if it is a person, further misleading you.


> there surely must be another approach to prompt-making that would be more effective than 'this is another intelligence like me, I will tell it how I want it to act

I don't think that this is especially beneficial for the LLMs, the benefit of chat interface is that humans are social animals with lots of experience forming prompts like this.


I'm very curious to see your setup and maybe a demo. Do you have a git repo I can look through?


Probably soon! I'll post it here. Still finalizing some Retrieval Augmented Generation things. It's written in Clojure with a very thin HTMX front-end. However there are some interesting things like using gbnf grammar constraints creatively for chain-of-thought reasoning. It's a one-person job though but I've always wanted a diary that feels like someone to talk to, and the tech is finally here!


Code is up https://github.com/vortext/esther but it's still heavily work in progress :-)


Llama1 --> 1.0T Llama2 --> 2.0T Mistral --> ??

They do not publish how many tokens it is pre-trained on, additionally to sharing no info on datasets used (except for fine-tuning).

To my knowledge, no one has trained a larger LLM (>250M) to the capacity limit. As discussed in the original GPT3 paper (https://twitter.com/gneubig/status/1286731711150280705?s=20)

TinyLlama is trying to do that for 1.1B: https://github.com/jzhang38/TinyLlama

As long as we are not at the capacity limit, we will have a few of these 7B beats 13B (or 7B beats 70B) moments.


I look forward to more released Mistral 7B docs in the future. I spent more time with Mistral 7B tuned version yesterday and it really is amazing. Subjectively, I find it better than any of the 13B models I have used. I support Camenduru on Patreon and I used one of his many Colab notebooks yesterday https://colab.research.google.com/drive/1-UK_PE8R3xktlwoXqCf...


Can you make the colab public, it does not seem to be accessible!


I just made the notebook public, please try again.


thanks a lot, I can see the colab now.


It's missing a lot of crucial details. Nothing on the dataset used, nothing on the data mix, nothing on their data cleaning procedures, nothing on the tokens trained.


What we get when it is on arxiv first before being peer reviewed.


BERT was on arXiv before being peer reviewed. As were T5, BART, LLaMA, OPT and GPT-NeoX-20B. The Pile and FLAN were also on arXiv before being peer reviewed. Of course, the original Transformer paper was also on arXiv before being peer reviewed.

Being on arXiv before being peer reviewed is not the or even a problem.


I cud almost tell this would be the case when the title of the paper was simply Mistral 7B. A little more info would be useful!


> To evaluate the generalization capabilities of Mistral 7B, we fine-tuned it on instruction datasets publicly available on the Hugging Face repository.

Heh, they won't even say what datasets they used for chat finetuning.

> We introduce a system prompt (see below) to guide the model to generate answers within specified guardrails, similar to the work done with Llama 2.

This was totally undocumented in the initial model release.

Other than that... Not much really new? We already know it uses SWA, though it works without SWA in current llama implementations, and SWA isnt new either.

If most upcoming base models are this mysterious on release, the field is going to be... weird.


Weird is the right term: do they want to demonstrate with this arxiv paper that they manage reformat a blog post into latex and upload it to a preprint site after publication?


Recent and related: Mistral 7B - https://news.ycombinator.com/item?id=37675496 - Sept 2023 (618 comments)

Is there significant new information here? (That's the test we use for followups:

https://hn.algolia.com/?dateRange=all&page=0&prefix=false&so...

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...)


Well, that was a blog post, but they just released a research paper. And in comparison to the blogpost, they indeed added more information regarding the attention mechanism they used, details about the architecture, more evaluation results (Arena Elo rating) etc.

Not saying it's novel, but it's useful from a research perspective and well appreciated that they added new information in there I would say. But let me know if you feel differently


Thin paper for a thin & capable model, it is great to have it. It made my 2080Ti smarter than ever. But why emulate OpenAI style of white papers?


Is there some convenience wrapper around this to drop-in replace the OpenAI api with it?

I‘d like to put this on a modest DO droplet or Fly.io machine, and be able to have a private/secured HTTP endpoint to code against from somewhere else.

I heard that you could force the model to output JSON even better than ChatGPT with a specific syntax, and that you have to structure the prompts in a certain way to get ok-ish outputs instead of nonsense.

I have some very easy classification/extraction tasks at hand, but a huge quantity of them (millions of documents) + privacy restrictions, so using any cloud service isn’t feasible.

Running something like mistral as a simple microservice, or even via Bumblebee in my Elixir apps natively would be _huge_!


A simple microservice would be https://github.com/huggingface/text-generation-inference .

Works flawlessly in Docker on my Windows machine, which is quite shocking.

Supports Mistral as well as everything else.

Biggest downside is that there's no way to operate the tokenizer through the API. I put in a feature request but they said "you really ought to write your own specialized client-side code for that". Real bummer when the server already supports everything, but oh well.

It has token streaming, automatically halts inference on connection close, and other niceties.

Quite long startup time, but worth it as it doesn't have to be restarted with the client.


What kind of resources do you need to run this setup, and how well does yours perform? Is it like chatting with any other chat bot (ChatGPT or Claude, for example) or is it significantly slower? Can you train it on your own self-hosted documents, like markdown?


The inference server doesn't do training. The speed is pretty decent for https://huggingface.co/TheBloke/storytime-13B-GPTQ on my 3060, it definitely doesn't feel like you are really waiting for a response.

My exact invocation (on Windows) was:

$ docker run --gpus all --shm-size 1g -p 8080:80 -v C:\text-generation-webui\models:/data ghcr.io/huggingface/text-generation-inference:1.1.0 --model-id /data/storytime-13b-GPTQ --quantize gptq


Thanks! I don't have a GPU though, so I'm assuming it isn't going to perform very well. I'll have to see if there are any models that can run on CPU only.


> Is there some convenience wrapper around this to drop-in replace the OpenAI api with it?

text-generation-webui has an OpenAI API implementation.

> I heard that you could force the model to output JSON even better than ChatGPT with a specific syntax, and that you have to structure the prompts in a certain way to get ok-ish outputs instead of nonsense.

Probably to get the maximum use out of that (particularly the support for grammars), it would be better not to use the OpenAI API implementation, and just use the native API in text-generation-webui (or any other runner for the model that supports grammars or the other features you are looking for.)

[0] https://github.com/oobabooga/text-generation-webui


Ollama [0] + LiteLLM [1]

[0] https://ollama.ai

[1] https://github.com/jmorganca/ollama/issues/305#issuecomment-...

---

Ollama is essentially docker for LLMs, and LiteLLM offers an API passthrough to make Ollama OpenAI API compatible. I haven't tried it yet, but I will be trying it probably this weekend.


Cool! Is it possible to make this self-hosted model reference my own content, for example markdown files? Or does it only know how to respond to things it was trained on?


Koboldcpp has an OpenAI (and kobold api) endpoint now, and supports grammar syntax like you said:

https://github.com/LostRuins/koboldcpp

The biggest catch is it doesn't support llama.cpp's continuous batching yet. Maybe soon?


Pretty straightforward with Ollama + LlamaIndex

https://gpt-index.readthedocs.io/en/latest/examples/llm/olla...


Llama.cpp has a translation server


python3 -m llama_cpp.server --model /path/to/model.gguf


Try ollama


Oobabooga


Ooba is not meant to serve multiple users (no batching). Batching gives you 5x to 10x throughput increase.


the paper does not live up to the quality of model lol


Maybe these models should start writing themselves up.

Provide the model with an outline of a 20-or-so page research paper about itself and have it fill in the blanks. The researchers might have to provide textual description of the figures in the “experiments” section.


Is it better than llama 2?


It is better than llama 2 7b and 13b. I tried the OpenOrca fine tune and it is very good, even when 4-bit quantized


What does OpenOrca do? It’s just instruction tuning it?


Yes, it is a instruction tune dataset: https://huggingface.co/datasets/Open-Orca/OpenOrca

It felt different from the official Mistral7B-Instruct. One of the highlights with the OpenOrca version is that you can steer the model with a system prompt (eg "You are a 5 year old")


For its size, yes. In absolute terms it is obviously less capable than llama-2-70B


For now. Huggingface[0] mentioned a DPO-fine-tuned version, Zephyr 7B, which it claims is competitive with Llama2-70B[1].

[0]: https://huggingface.co/spaces/HuggingFaceH4/zephyr-chat

[1]: https://twitter.com/huggingface/status/1711780979574976661


Oh, they uploaded the weights. I missed this one, cheers!


I found llama-2-70B to be a bit worse than GPT-4. (So, pretty good!) But I did not compare with GPT-3.

How do llama-2-70B and Mistral 7B compare with GPT-3?


Yes


I always throw the Sally puzzle to any new model I try:

Sally, a girl, has 3 brothers. Each brother had 2 sisters. How many sisters does Sally have?

I’ve tried this on mistral, zephyr, llama variants. None of them get it right. Zephyr (on the HF demo page) shows me half a page of discussion and comes up with 8. Even gpt3.5 says 6, which is the most common answer among models.

Only GPT4 gets it right as far as I’ve seen.

I’ve heard a Mistral GPTQ variant gets it right but I haven’t found an easy way to run it.

If anyone found a local model that gets it right, please tell me exactly which one and how to run it!


Worked first try for me on default gpt3.5

https://i.imgur.com/uaNGSFS.jpg


It’s hit and miss with GPT3.5 I think. It got it wrong on both iOS and website.

https://imgur.com/a/a9FOyFL


I really look forward to the 13B (if they ever do it). The rolling context is pretty amazing. It get super weird on long text generation but reading long text is great.


Does anyone have a good guide to share for how to self-host one of these models and put it behind an API? I’d like to tinker with building a chatbot on my home lab server, so I guess it would need to be runnable on a VM with a few GB of RAM and a couple of cores. Or is that not possible with these kinds of models yet?


Can someone help me understand Figure 2? Why does the newest token appear at the beginning of the sequence rather than next to its neighboring token?


it's a rolling buffer, so it just upsert index % 4 in this case


Thanks, so does that mean position within the buffer is irrelevant?


it does feel like so, the position eventually loses its meaning as more and more data gets crunched by the training process, eventually it's just a context of the past 4 tokens it feels like


Still no mention of what data was used for training.


that's how facebook was sued. Their paper mentioned a data sources that was crawling books from pirated sites.


Maybe the correct way of addressing this problem is by using data sources that won't make others sue you, rather than hiding what data sources you're using.


Correct in what way? Not correct if you want the best performing model. And if it doesn't outperform the current best model nobody will care about you. For a company like Mistral that could be the end for them.

I'm not saying it uses books3, it might not. I'm just saying why it might make sense to risk it.


Nah. Copyright law isn't compatible with small startups and independent developers being able to compete with megacorps in AI. It's unironically better to just ignore it.


You seem to be unfamiliar with the mantra of silicon valley.


'The Dark Side Of The Force Is A Pathway To Many Abilities Some Consider To Be Unnatural'


Jedi won't believe this one simple trick


Why should they be? It's a French startup.


HN comments' section most likely :)


I put Mistral-7B-Instruct in a Godot poc game the other day, and the simulated conversations it generates is funny as heck:

https://github.com/opyate/godot-llm-experiment


Why not instead of generalist models with 7b, it should specialize like “role play” model, or just code? But I just realized if the model are not generalized, it won’t understand natural language


An idea I hear often listening to talks about LLMs, is that training on a larger (assuming constant quality) and more various data leads to the emergence of grater generalization and reasoning (if I may use this word) across task categories. While the general quality of a model has a somewhat predictable correlation with the amount of training, the amount of training where specific generalization and reasoning capabilities emerge is much less predictable.


There are RP, code, etc. specialized fine tunes of some models, to get the most bank for the bunk on some small models.


You can take a general model and fine tune it for a specific task. There are various tutorials out there for creating fine-tuned models.


Any links or colab for beginners to learn how to fine tune this model?


what is the best way to fine-tune these models? any good resources would be very helpful. TIA /\

PS - I have a brief background in Machine Learning, more in development.


Jeremy Howard talks about it in his recent video "A Hackers' Guide to Language Models": https://youtu.be/jkrNMKz9pWU?t=4808

That link goes directly to the timestamp where he discusses fine tuning, but the whole talk is great. Punchline, check out Axolotl: https://github.com/OpenAccess-AI-Collective/axolotl


This is a walkthrough based on that talk for fine-tuning with axolotl https://dzlab.github.io/dltips/en/pytorch/llama-2-finetuning...


The research paper by Mistral about their Mistral 7B v0.1


[flagged]


> I'm on the edge of my seat for the 13B and 34B behemoths.

When have they talked about this?

I'm very afraid the answer is "In their Discord."


> I'm very afraid the answer is "In their Discord."

Same, honestly. I hate having to "join" "servers" when half the time they have a noisy bot that pings me to the whole channel as soon as I join. I would prefer to introduce myself when I'm ready. Or stay silent.


Didn't you see the message in #random? It scrolled past 37 minutes ago!


And it was a link to a comment in an obscure thread in #memes-2 which you can only see after being in the server for three days, being logged in with your phone number and PMing some bot.


Why this comment is written in sports podcast tone?


Sorry about that. I´m not a native speaker and asked GPT-4 to: "Create a engaging reply for HackerNews talking that this is a great model, and I really hope that they release a 13B and 34B version. As those sizes are way more capable and have a chance of finally surpassing the GPT 3.5. This would be a very nice decision for mind share, and their larger models that can rivalize gpt 4 can be keep private for commercialization."

I think that this is how GPT-4 thinks that a engaging comment for HN looks like.


I think your prompt was written well enough to not need GPT-4. Don't undersell yourself :)


I flagged it for being AI written. Even if you're not a native speaker, it's best to not have AI outputs polluting future datasets, anyway.


AI outputs are not necessarily bad for datasets, given they've been verified by a human for their quality and correctness (probably the case here, but not for SEO content farms).


We've had flawed human outputs re-polluting future human learning for some time, now


That's actually really interesting, thanks for sharing. We're in for an interesting future hah.


This is the future we are choosing. https://youtu.be/Cn8Pua5rhj4?si=tOro1MLaOE525Q2O


(ha!)


Given it's the most upvoted on the thread at the moment, I think GPT-4 was on the money here :D


This is something that I have found LLMs nearly completely useless for. I gave a talk on uses of AI for gamedev, and had some great things but I couldn't get it to write a blurb for the talk that wasn't vomit inducing.

This isn't so much a problem white LLMs themselves but the training data. The world is so inundated with meaningless marketing speak that when you try and get it to talk about a topic in even a slightly promotional manner it will create something that fits in nicely with the existing drivel.


I have the same problem writing references for students and summarising my feedback to them. I find asking to “write concisely and without waffle, like a brusque, British academic” helps a bit.


So basically you're admitting it's a prompting problem.


The comment implied that 13B and 34B models are coming.

This is interesting... You didn't have any malicious intent, hence this is a somewhat novel example of GPT4 sneaking misinformation into an HN comment section.


Its Mistral!

Or are you Mistral?


I just started a simple service to use Mistral as a replacement for OpenAI. If anyone is interested you can sign up at https://lemonfox.ai


>$0.001 per request second

This pricing is probably more expensive than gpt-3.5-turbo 4k context. A large prompt for the API would be 1k tokens in and 1k tokens out, which comes to $0.0035 for OpenAI. Your website says to expect a request to take 4 seconds minimum, so that's $0.004. Given how light Mistral is, I think you'd have to cut your price by at least a factor of 10 for it to be reasonable.


Tried to sign up. Just got a loading spinner on the sign up button and nothing else


can someone explain why the AI or language model community circles around arxiv?

I really hate the pseudo-academic gatekeeping in the AI/ML community, Google said you have no moat, we all know you have no moat, including that degree. we can all fine tune with consumer hardware we already have or even better cheaply on readily accessible clouds for this specific purpose. why are they still doing this fake academic junk.


I am very confused by this comment. There is no gatekeeping in the ML/AI community. Ideas flow freely (albeit within the confines of several major Discord servers, or so it seems). Whether the author of an idea has formal training in ML and adjacent disciplines or not, whether it's published on arXiv or not, it doesn't matter - it'll be adopted if it works and/or makes it easier for people to run their GPT waifu/ baby AGI prototype.

That said, new open foundation models sized 7B and over are still a fairly rare thing to see. If someone goes through the effort of creating one of those, and especially if it has some sort of an edge against Llama 2 7B, it's not unreasonable to expect an arXiv paper to be released about it.


Isn't token completion incapable of representing AGI? AGIs need the ability to perform internal thought and deliberation, and they probably also need to be bottom-up rather than top-down.


You can't judge whether something is AGI or not from how it works, that just leads to goalpost-moving. AGI is AGI if it can do certain things, no matter whether it's token-based or top-down or anything.


> Isn't token completion incapable of representing AGI?

Given the absence of a validated model (or even usable operational definition) of general intelligence, who knows? AGI might as well be an empty marketing buzzword, it isn't something about which falsifiable fact claims can be made.

> AGIs need the ability to perform internal thought and deliberation

Systems built around LLMs can do something like this, via reflection, a technique in constructing agent-based simulations using LLMs as the engine.


I see what you mean. In this case, I was talking about the LLM being a component of the AGI rather than the whole AGI all by itself.


I mean, you can't just share the weights of the model and call it a day, right? You have to share details on what and why you are doing. You must communicate this somehow. In theory, you might be able to do this in a github readme, but a paper-style document on arxiv is nicely suited for this.


People release models as just the weights all the time. HuggingFace makes it pretty easy to do that.


> I mean, you can't just share the weights of the model and call it a day, right?

you can't?


Obviously you can, but in the grand scheme of things people should share more details about their method so people can improve on it in the future, no?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: