This is very interesting to me. We spent a significant time “labelling” data whe...

oli5679 · on June 21, 2023

You can self-host an open-source model. Llama CCP is a very popular project with great docs.

You need to be careful about liscencing - some of these models its a legal grey area whether you can use them for commercial projects.

The 'best' models require some quite large hardware to run, but a popular compression methodology at the moment is 'quantization', using lower precision model weights. I find it a bit hard to evaluate which open source models are better than others, and how they are impacted by quantization.

You can also use the Open-AI API. They don't use the data. They store for 30 days, which they use for fraud-monitoring, and then delete. It doesn't seem hugely different to using something like Slack/Google doc/AWS.

I think some people imagine their data will end up in the knowledge-base of GPT-5 if they use Open-AI products, but this would be a clear breach of TOS.

https://openai.com/policies/api-data-usage-policies

devjab · on June 22, 2023

I’m not sure the OpenAI model is EU regulation compliant. It’s not just GDPR these days, the laws are ramping up to the point where we might not even be able to use Azure (as Microsoft unlike Amazon still can’t guarantee that only EU citizens work support on the services). This is obviously worse in some EU sectors than others, but I’m not sure I’ll ever work outside Green Energy, Public Sector or Finance so I’ll always have to deal with the harshest parts.

I wonder if one day they will sell a “self-hosted” version of GPT. We wouldn’t mind having a ChatGPT with its 2021 data set and no ability to use the internet if it meant it lives up to regulations.

But can you do that? Can you “download” a model and then just use it?

As far as the hardware goes I think we will be fine. My sector uses a lot of expensive hardware like mainframes for old legacy systems where we come together as organisations and buy the service from companies like IBM (or similar, typically there are 3-5 companies that take turns winning the 8-12 year contracts) who then operate the stuff inside our country. I’m sure we can do the same with LLMs.

nihit-desai · on June 21, 2023

Yep! I totally understand the concerns around not being able to share data externally - the library currently supports open source, self-hosted LLMs through huggingface pipelines (https://github.com/refuel-ai/autolabel/blob/main/src/autolab...), and we plan to add more support here for models like llama cpp that can be run without many constrains on hardware

devjab · on June 22, 2023

Very interesting. I’ll most certainly favorite this and keep an eye on it. I think that sort of thing will be the future of LLMs for many of us.