Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Well, look at the process of training a chatbot:

- first you make a "raw" corpus, with all the information needed to produce an answer

- then you generate sample question-answer pairs

- then you use AI to make better questions and better answers (look at e.g. WizardLM https://arxiv.org/pdf/2304.12244)

- can also finetune with RLHF or modify the Q-A pairs directly

- then you have a final model finetune once the Q-A pairs look good

- then you use RAG over the corpus and the Q-A pairs because the model doesn't remember all the facts

- then you have a bullshit detector to avoid hallucinations

So the corpus is very important, and the Q-A pairs are also important. I would say you've got to make the corpus by hand, or by very specific LLM prompts. And meanwhile you should be developing the Q-A pairs with LLMs as the project develops - this gives a good indication of what the LLM knows, what needs work, etc. When you have a good set of Q-A pairs you could probably publish it as a static website, save money on LLM generation costs if people don't need super-specific answers.

To add to the current top-scoring comment, though (https://news.ycombinator.com/item?id=42326324), one advantage of an LLM-based workflow is that the corpus is the single source of truth. It is true that good documentation repeats itself, but from a maintenance standpoint, changing all the occurrences of a fact, idea, etc. is hard work, whereas changing it once in the corpus and then regenerating all the QA pairs is straightforward.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: