Cool, first time I've seen one of my posts trend without me submitting it myself...

donavanm · on Nov 1, 2024

You might want to highlight chunking and how embeddings can/should represent subsections of your document as well. It seems relevant to me for cases like similarity or semantics search, getting the reader to the relevant portion of the document or page.

Theres probably some interesting ideas around tokenization and metadata as well. For example, if you’re processing the raw file I expect you want to strip out a lot of markup before tokenization of the content. Conversely, some markup like code blocks or examples would be meaningful for tokenization and embedding anyways.

I wonder if both of those ideas can be combined for something like automated footnotes and annotations. Linking or mouseover relevant content from elsewhere in the documentation.

MrGreenTea · on Nov 1, 2024

Do you have any resources you recommend for representing sub sections? I'm currently prototyping a note/thoughts editor where one feature is suggesting related documents/thoughts (think linked notes in Obsidian) for which I would like to suggest sub sections and not only full documents.

donavanm · on Nov 1, 2024

Sorry, no good references off hand. I’ve had to help write & generate public docs in DocBook in the past. But no expert on either editors, nlp, or embeddings besides hacking around some tools for my own note taking. My assumption is youll want to use your existing markup structure, if you have it. Or naively split on paragraphs with a tool like spacy. Or get real fancy and use dynamic ranges; something like an accumulation window that aggregates adjacent sentences based on individual similarity, break on total size or dissimilarity, and then treat that aggregate as the range to “chunk.”

MrGreenTea · on Nov 1, 2024

Thanks for the elaborate and helpful response. I'm also hacking on this as a personal note taking project and already started playing around with your ideas. Thanks!

enjeyw · on Nov 1, 2024

Haha yeah I was about to comment that I recall a period just after Word2Vec came out where embeddings were most definitely not underrated but rather the most hyped ML thing out there!

dartos · on Nov 1, 2024

Yeah embeddings are the unsung killer feature of LLMs