I find this utterly bizarre. Once upon a time, if you wanted to left pad a strin...

danielrhodes · on March 17, 2022

This really does not resonate at all, and I have the scars to prove it.

I used to work on a browser-based document management system, and I would have used (or at least tried) all of these APIs without hesitation. PDFs are a pain and the mish mash of poor functioning tools that exist provides a constant headache.

1) OCR'ing of a PDF is difficult. The only good service is Google, but requires that you break it into pages as images to be performant. This would have simplified things greatly. Even if the PDF has text inside and is not an image, it can be wrong or not laid out in a linear way, so you have to OCR it. Command line tools do not get you very far. An example: if you OCR or text extract a PDF with multiple columns of text, does it handle the columns well?

2) People want searchable OCR'd PDFs where you can highlight the text, even when it's a bitmap underneath. This requires a technique where you overlay transparent text in the exact position of text in the bitmap. This does not come for free and I've only seen this done on proprietary Windows-only software. This alone would be worth it.

3) Office to PDF is an extremely standard need, especially if you want to display them online. But it's not easy. You have to hack together a headless OpenOffice to have it work at all, but it doesn't do a great job. It's difficult to do well because Office docs are like HTML pages in that it greatly depends on the renderer, not to mention the fonts. Microsoft does not offer a service to do this, unfortunately. If you think anything will do, it really won't: when people see their PDF looks very different than what they saw on Word, they get upset.

4) Table extraction APIs are super important, especially if you are trying to automatically extract data from PDFs (e.g. analyze financial disclosures). There have been whole startups dedicated to this.

5) HTML to PDF is also a pain: you have to set up an instance that is running headless Chromium, which can be quite slow. This has become the defacto standard to quickly create complex PDFs. Having a simple API wrapper around this is just one less thing to manage.

The rest of the APIs, like the merging/splitting/watermarking etc., are pretty standard and you do not need APIs if you already have access to the PDF on a server. But if you were in a browser or on mobile, you might not.

ankrgyl · on March 17, 2022

I'll just throw my hat in the ring and mention that at Impira, we are one of those startups wholly dedicated to (4). We happen to use Google's OCR engine (1) under the hood (for raw OCR), and what you said resonates for sure: there's a lot of engineering work required to make it work performantly and generally (happy to chat about this with anyone who is interested).

Feel free to take Impira for a spin (https://www.impira.com) if you need to accurately extract data from PDF documents. Would love feedback from anyone who tries it out. [Disclaimer: I am the CEO/Founder of Impira].

jfk13 · on March 17, 2022

I agree many of these things are a pain. This often reflects a workflow that is approaching things from entirely the wrong direction. ("If I wanted to go there, I wouldn't start from here.")

E.g. instead of trying to OCR a PDF, go back to the source document or database or whatever from which the PDF was generated. (Yes, I know that's not always an option. But it should be the first avenue to explore. We should push back against people who send around PDFs as though they were an all-purpose interchange format for textual or structured data.)

I'm a bit puzzled by (3), though:

> Office to PDF ... it's not easy ... when people see their PDF looks very different than what they saw on Word, they get upset

To get a PDF that looks the same as the Word document, just tell them to use the Print to PDF driver from right there within Word.

ankrgyl · on March 17, 2022

I think you recognize this already, but to add a bit of color, in highly regulated industries (e.g. financial services) and B2B settings with lots of peers (e.g. supply chain), "going back to the source document or database or whatever" requires an insane amount of consensus (which is not currently incentivized).

To add to that, a lot of PDFs (e.g. financial reports) are generated procedurally with ancient code that would have to be rewritten to generate a different format. The underlying database format is often many layers of abstraction different than the final output.

pipeline_peak · on March 17, 2022

> Office to PDF is an extremely standard need

Is it really an extremely standard need or just something that appears in the bs corners of our jobs a few times a year.

danielrhodes · on March 17, 2022

Yes, if you're working with documents a lot it is. Word docs are not portable and people don't like them because they can be changed easily, not to mention not everybody has Word. You also can't display them in inline in a browser.

yyyk · on March 17, 2022

>HTML to PDF is also a pain: you have to set up an instance that is running headless Chromium, which can be quite slow...

There are at least 6 non-Chromium alternative that I can think of in a moment's notice, and also LGPL wkhtmltopdf.

>Office to PDF.... You have to hack together a headless OpenOffice to have it work at all, but it doesn't do a great job... Microsoft does not offer a service to do this, unfortunately.

Microsoft sorta does offer a service to do this. Sharepoint has a word to pdf action, and with some stitching you can make it into an API. There are also several commercial solution (e.g. Spire.NET) for this and also ways exist to mangle the OpenXML into HTML (of course losing some fidelity into the process).

amluto · on March 19, 2022

All of the above may be correct, but nothing here advocates for a web service instead of licensed software. If I want to solve a linear program, I can use an open source library or I can pay for a commercial offering, but that commercial offering will run on my hardware (or cloud instance) and will operate independently of the network. If I want to edit a Word document, I can pay Microsoft for a local copy of Word.

jcuenod · on March 17, 2022

I'm a very happy user of OCRmyPDF: https://github.com/jbarlow83/OCRmyPDF/

eastendguy · on March 17, 2022

> 1) OCR'ing of a PDF is difficult. The only good service is Google

OCRspace is OK, too, and easier to use. You can just send the PDF. It is free for PDFs with 3 or less pages.

> 2) People want searchable OCR'd PDFs where you can highlight the text, even when it's a bitmap underneath.

OCRspace can also create searchable PDFs: https://ocr.space/searchablepdf

ho_schi · on March 17, 2022

Same on my mind. Let say you have to create an invoice for a customer and your operations stop just because your not using {Cario, Skia, PoDoFo, JagdPDF, Haru, Whatever} on the local environment but relied upon an external service which halted. This introduces a huge dependency chain across the web. But they don't provide anything which cannot provided autonomously by a local library. Integrate with external services because you must and not because you can.

newlisp · on March 17, 2022

Nodejs forces this architecture(no, worker threads are not a solution, they are heavy and have too many restrictions), you don't want to slow down the event loop with heavy PDF processing.

amluto · on March 17, 2022

This is not in any respect limited to NodeJS. If you want to do a 500ms computation, you don’t want to do it synchronously in your network thread. It doesn’t make much difference whether it’s C, Rust, NodeJS, Go, etc. (CGI is different: everything is off the network thread.)

But this doesn’t mean you should outsource computations to a third party remote system. You can have a local (same physical hardware or same data center) off-thread service (or just thread pool) to do this kind of work with much nicer properties.