Supposedly this sort of thing is also why Microsoft Word documents were so hard ...

spolsky · on July 28, 2016

not true, actually

https://msdn.microsoft.com/en-us/library/office/gg615596(v=o...

mdadm · on July 29, 2016

From your link:

I feel as though the gp comment is referring to far older versions, although without clarification, it's hard to be sure.

int_19h · on July 29, 2016

The older versions are also not literal dumps. They're binary "dumps" of the object tree in memory, yes, in a sense that you walk the tree and write it out. This is bad because your in-memory object tree then effectively defines the format, and it's not spec'd otherwise, which makes portability that much harder, especially for a closed-source application where you can't see code. But it's a very different problem.

FWIW, old Office documents were actually CFBF (Compound File Binary Format) files - think of it as FAT-in-a-file, allowing for multiple independent streams inside, with transactions. This was very commonly used on Windows in the OLE/COM era, because it was the underlying format for OLE Structured Storage. It's what allowed a Word document to embed another arbitrary document in an extensible way. The underlying data in the streams within CFBF was a loose object graph dump.

It all makes a lot of sense when you have your OLE glasses firmly on - it's basically a natural design that follows if your world consists of OLE objects and interactions between them. Look up IStorage and IStream to see what I mean.

The side effect of all this, however, is that the data inside an old Office file is not laid out in a logical way - streams consist of non-sequential interleaved blocks in a seemingly random order (depending on what was written when), some blocks may contain garbage data, and so on. So it's very difficult to reverse engineer, which is why it took so long back in the day, and the results were often unreliable.

poizan42 · on July 29, 2016

> FWIW, old Office documents were actually CFBF (Compound File Binary Format) files

That's actually the "new" binary formats. The usage of CFBF seems to have been introduced in Office 4.2 (at least Excel 5.0 is the first Excel version to use them, it's hard to find information about the old Word document file formats).

> The side effect of all this, however, is that the data inside an old Office file is not laid out in a logical way - streams consist of non-sequential interleaved blocks in a seemingly random order (depending on what was written when), some blocks may contain garbage data, and so on. So it's very difficult to reverse engineer, which is why it took so long back in the day, and the results were often unreliable.

I don't believe the OLE compound file format has ever been much of an effort to reverse engineer. But the CFBF based Office documents are also basically just blobs of the older binary formats saved in a more structured way. The issues with Office documents have always been a question about their sheer complexity combined with their tight coupling to the internals of the Office programs. This still shines through in the OOXML formats which contains lots of stuff like "position something the way it was done in Word 5.0".