Reverse engineering iWork | Svelte Hacker News

dunham 17 hours ago

Nice work, thanks for taking the time to write it up. I regret not doing that for my projects.

I also did something similar back around 2014 in https://github.com/dunhamsteve/iwork but I didn't get much further than tables on the Numbers side before taking a break. There I translated iwork files to HTML. That code has been largely neglected since then, and I never wrote up my process. Like the other commenter, I based this on https://github.com/obriensp/iWorkFileFormat

For ObjC programs that don't embed the descriptors, I wrote a python script that reverse engineers protobuf schemas from disassembled code: https://gist.github.com/dunhamsteve/224e26a7f56689c33cea4f0f... I don't remember what project that was for, but maybe it's useful to someone.

And for Notes.app, I reverse engineered the description from the binary protobuf data. Since there is ambiguity between binary data and nested objects, my script would build a tentative schema and then refine it against further examples. I later learned that the full schema, in text form, was embedded in the web version of the application. That project is at https://github.com/dunhamsteve/notesutils and also is neglected. I believe the table format has changed enough that tables are no longer working.

andrew_rfc 14 hours ago

I spent a few days brute forcing tables before I came across your repo and it finally clicked what was actually going on; thank you so much!

psobot 19 hours ago

Nice work! I had the same fun RE adventure in https://github.com/psobot/keynote-parser a couple years back, based on Sean Patrick O'Brien's work back in 2013: https://github.com/obriensp/iWorkFileFormat/blob/master/Docs...

mackross a day ago

Amazing work by author!

donatj 6 hours ago

I wrote a rarely used Numbers importer for my company in I'd guess around 2009. The XML format they used was truly atrocious. Save a single value in a single cell, and it resulted in 1 megabyte of XML, compressed of course. Still, parsing that megabyte took a decent chunk of ram for the XML parser I used.

I spent a couple hours maybe 6 months ago trying to reverse engineer the protobuf version. I could not. Above my ability.

This is frankly kind of fascinating to read and sounds like it targets exactly me.

I wish they'd just gone with a better XML format, or JSON or something. Locking a users raw data away in a binary file even if it's a protobuf will never not be rude.