So, I’ve been having fun traversing the LibreOffice .docx and .rtf import filters while trying to implement Ink annotations in LibreOffice Writer. As it turns out, I don’t strictly agree with the ISO press release that the Office Open XML file format is “intended to be implemented by multiple applications on multiple platforms”. However, despite having spent way too much time wading through a ~24k line XML file defining the document model in the importer, I’ve been very grateful that the format is XML-based and therefore human readable.
I’ve included some useful resources at the end for any intrepid programmers who wish to help with tackling the importer beast.
Ink allows you to annotate using a stylus on a tablet PC using Microsoft Word so that you can doodle over your documents as you see fit. Technical details ahead, so feel free to skip to the results.
Ink strokes are saved in docx documents as bezier curves expressed through VML paths (these are pretty similar to SVG paths, with commands and co-ordinates). I had quite a bit of fun hacking a parser together – here’s the patch, with a few tweaks, it could be generally useful. It produces a list of all the sub-paths in a path, each subpath consisting of a list of co-ordinates and co-ordinate flags indicating normal or control points.
Word’s storage of Ink annotations does highlight some of the problems with implementing Word-compatible OOXML. They’re represented something like this:
<v:shape path="[VML path]" ...>
<o:ink i="[base64 binary data]" annotation="t"/>
Now, [VML path] is really the important part as it contains the Ink shape geometry. But the [base64 binary data]? What’s stored in there is anybody’s guess – I’ve certainly not found any documentation on its contents. Anyone who has a tablet version of Word should feel free to take a crack at reverse engineering it ;)
It turns out that paths for Ink annotations consist of bezier curves. Beziers weren’t supported in the importer, so the path attribute, as well as any <v:curve> elements (which use control1, control2, to and from attributes), got ignored. So I added the support by getting the path and control1/control2 attributes and passing off the parsed result to LibreOffice using the UNO API.
Word allows you to export a document with Ink to RTF. Most of the code for importing the RTF equivalent was already there, it just needed some adapting. I found something interesting that I haven’t seen documented elsewhere, furthering Miklos Vajna’s work (see README) on understanding the RTF spec. The geometry of the Ink shapes is described using the pVerticies [sic] and pSegmentInfo keywords. The pSegmentInfo section is a list of commands indicating what the points listed in pVerticies mean (move to, curve, end sub-path and so on).
|Segment indicator||Description||Vertices associated|
|0×0001||Line to point||1 (x, y)|
|0×2001||Bezier curve with two control points and end point||3 (cx1, cy1, cx2, cy2, x, y)|
|0×4000||Move to point||1 (x, y)|
The plot thickens…
So, when importing a Word-generated .rtf with Ink annotations, why was I seeing segment indicators like 0x200A? Apparently, the low order bytes of certain segment indicators indicate the number of point sets to apply to – for example, if there were four curves in a row with three points each in pVerticies, it can be specified by using the low order bytes of the segment indicator, resulting in 0×2004 (encompassing 12 points in total). This may also apply to other relevant line segment types, but this is as of yet untested. You can easily extract the number of segments indicated using basic bitwise operators:
unsigned int segment = 0x200A; // Example segment indicator
unsigned int points = segment & 0x00FF; // Assuming two lowest order bytes are used for point count
segment &= 0xFF00; // Discard point count; just leave segment indicator
Woo ink in LibreOffice!!
LibreOffice now correctly displays not only Ink, but (in theory) any curves and shapes with paths when importing from .docx or .rtf. A minor bug with RTF image wrapping which caused the shape to be inline with the text instead of over it was also fixed (the property was just being ignored), so better imports all round!
Next step – correct export of bezier shapes to docx and rtf (no, I’m still not sure whether that blob of binary in the o:ink element is of any importance whatsoever, but this should be one way to find out).
v:shape schema information – this website is great for making sense of the OOXML standard, particularly if used alongside this:
ISO IEC 29500 – ISO standard document for OOXML (warning, big pdf in a zip).
RTF spec – only somewhat useful
UNO API reference – useful if used with the search function
writerfilter and oox – LibreOffice modules of interest for importing OOXML documents (cgit links for browsing the source/READMEs)