Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Maybe I'm not understanding, but wouldn't using the file hash to keep track of things make it really tricky if the file is modified in any way? E.g. when annotating a PDF it would lose all previous data associated because the hash would change


I download still-running fiction from the internet and read them locally, and periodically replace the files with versions with more chapters. It does not sound like the workflow would work for me.

(This also means hashing the contents wouldn't work.)


But if the annotations get stored directly in the ebook file (as opposed to separately in a JSON file), I don't see how this would work for you, either? You would still have to transfer them to the new file somehow.


If notes are stored as json maybe it would be easy to write a program to transfer notes from the old hash to the new one


If a chain of hashes is associated with a single work ... you might get somewhere.

I've thought through the problem of fingerprinting records (as in, any recorded data: text, images, audio, video, software, etc.) in a way that coherently identifies it despite changes over time. Git and related revision control systems probably offer one useful model. Another is to generate signatures via ngrams of the text in such a way that's resilient (i.e., non-brittle) despite varioius changes: different fonts, charactersets, slight variances in spelling (e.g., British vs. American English, transliterations between languages), omissions or additions, or other changes. Different versions of the same underlying work, e.g., PDF, HTML, or ASCII text, translations, different editions, etc., all have much in common, though in ways that a naive file hash wouldn't immediately recognise or reveal.

We often refer to works through tuples such as author-title-pubdate, or editor-language-title. This is a minute fraction of the actual content of most works, but is remarkably effective in creating namespaces. Controlled vocabularies and specific indexing systems (Dewey Decimal, OCLC, Library of Congress Catalog Number, ISBN, DOI, etc.) all refine this further, but require specific authority and expertise. I'd like to see an approach which both leverages and extends such classifications.


> Reading progress, bookmarks, and annotations are stored in plain JSON files

The file is not modified.


I think they mean that the hash in the JSON record does not match the PDF file anymore, after the PDF has been changed by some other program.


I agree but it seems the original PDF is not changed but you are right, if it is edited directly and renamed then it will fail to load the JSON.


I guess they'd use a hash of the book contents, not the whole file?


You end up with the ebook file and an annotations and bookmarks file. I'm assuming that the program would just look in the same directory as the ebook file or some configurable location.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: