Not sure if this helps but this is from tinkering with Mistral 7B on both my M1 Pro (10 Core, 16 GB RAM) and WSL 2 w/ CUDA (Acer Predator 17, i7-7700HK, GTX 1070 Mobile, 16GB DRAM, 8GB VRAM).
- Got 15 - 18 Tokens / sec on WSL 2 with slightly higher on M1. Can think of that to about 10 - 15 words per second. Both were using GPU. Haven’t tried CPU on M1 but on WSL 2 it was low single digits - super slow for anything productive.
- Used Mistral 7B via llamafile cross-platform APE executable.
- For local-uses I found increasing the context size increased the RAM a lot - but it’s fast enough. I am considering adding another 16x1 or 8x2.
Tinkering with building a RAG with some of my documents using the vector stores and chaining multiple calls now.
I haven’t seen on how it fares on uncensored use-cases, but from what I see Q5_K variants of Mistral 7B are not very far from Mixtral 8x7B (the latter requires 64GB of RAM which I don’t have).
Tried open-webui yesterday with Ollama for spinning up some of these. It’s pretty good.
Right now the minimum amount of RAM I would recommend is 16gb, I think it can run with less memory but that will require a few changes here and there (although they might reduce performance). I would also strongly recommend using a GPU over CPU, in my experience it can make the LLM run twice as fast if not more. Only Nvidia GPUs are supported for now and the CUDA toolkit 12.2 is required to run Dot.
- how much ram is needed
- what CPU do you need for decent performances
- can it run on a GPU? And if it does how much vram do you need / does it work only on Nvidia?