Maybe I'm not understanding, but wouldn't using the file hash to keep track of things make it really tricky if the file is modified in any way? E.g. when annotating a PDF it would lose all previous data associated because the hash would change
I download still-running fiction from the internet and read them locally, and periodically replace the files with versions with more chapters. It does not sound like the workflow would work for me.
(This also means hashing the contents wouldn't work.)
But if the annotations get stored directly in the ebook file (as opposed to separately in a JSON file), I don't see how this would work for you, either? You would still have to transfer them to the new file somehow.
If a chain of hashes is associated with a single work ... you might get somewhere.
I've thought through the problem of fingerprinting records (as in, any recorded data: text, images, audio, video, software, etc.) in a way that coherently identifies it despite changes over time. Git and related revision control systems probably offer one useful model. Another is to generate signatures via ngrams of the text in such a way that's resilient (i.e., non-brittle) despite varioius changes: different fonts, charactersets, slight variances in spelling (e.g., British vs. American English, transliterations between languages), omissions or additions, or other changes. Different versions of the same underlying work, e.g., PDF, HTML, or ASCII text, translations, different editions, etc., all have much in common, though in ways that a naive file hash wouldn't immediately recognise or reveal.
We often refer to works through tuples such as author-title-pubdate, or editor-language-title. This is a minute fraction of the actual content of most works, but is remarkably effective in creating namespaces. Controlled vocabularies and specific indexing systems (Dewey Decimal, OCLC, Library of Congress Catalog Number, ISBN, DOI, etc.) all refine this further, but require specific authority and expertise. I'd like to see an approach which both leverages and extends such classifications.
You end up with the ebook file and an annotations and bookmarks file. I'm assuming that the program would just look in the same directory as the ebook file or some configurable location.