Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I was using embeddings to group articles by topic, and hit a specific issue. Say I had 10 articles about 3 topics, and articles are either dry or very casual in tone.

I found clustering by topic was hard, because tone dimensions ( whatever they were ) seemed to dominate.

How can you pull apart the embeddings? Maybe use an LLM to extract a topic, and then cluster by extracted topic?

In the end I found it easier to just ask an LLM to group articles by topic.



I agree, I tried several methods during my pet project [1], and all of them have their pros and cons. Looks like creating topics first and predicting them using LLM works the best

[1] https://eamag.me/2024/Automated-Paper-Classification


Allegedly, the new hotness in RAG is exactly that. Use a smaller LLM to summarize the article and include that summary alongside the article when generating the embedding.

Potentially solves your issue, but it is also handy when you have to chunk a larger document and would lose context from calculating the embedding just on the chunk.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: