Importing OOXML Ink annotations into LibreOffice

So, I’ve been having fun traversing the LibreOffice .docx and .rtf import filters while trying to implement Ink annotations in LibreOffice Writer. As it turns out, I don’t strictly agree with the ISO press release that the Office Open XML file format is “intended to be implemented by multiple applications on multiple platforms”. However, despite having spent way too much time wading through a ~24k line XML file defining the document model in the importer, I’ve been very grateful that the format is XML-based and therefore human readable.

I’ve included some useful resources at the end for any intrepid programmers who wish to help with tackling the importer beast.

DOCX import

Ink annotation
Drawn Ink annotations

Ink allows you to annotate using a stylus on a tablet PC using Microsoft Word so that you can doodle over your documents as you see fit. Technical details ahead, so feel free to skip to the results.

Ink strokes are saved in docx documents as bezier curves expressed through VML paths (these are pretty similar to SVG paths, with commands and co-ordinates). I had quite a bit of fun hacking a parser together – here’s the patch, with a few tweaks, it could be generally useful. It produces a list of all the sub-paths in a path, each subpath consisting of a list of co-ordinates and co-ordinate flags indicating normal or control points.

Word’s storage of Ink annotations does highlight some of the problems with implementing Word-compatible OOXML. They’re represented something like this:

<v:shape path="[VML path]" ...>
...
<o:ink i="[base64 binary data]" annotation="t"/>
</v:shape>

Now, [VML path] is really the important part as it contains the Ink shape geometry. But the [base64 binary data]? What’s stored in there is anybody’s guess – I’ve certainly not found any documentation on its contents. Anyone who has a tablet version of Word should feel free to take a crack at reverse engineering it ;)

It turns out that paths for Ink annotations consist of bezier curves. Beziers weren’t supported in the importer, so the path attribute, as well as any <v:curve> elements (which use control1, control2, to and from attributes), got ignored. So I added the support by getting the path and control1/control2 attributes and passing off the parsed result to LibreOffice using the UNO API.

RTF import

Word allows you to export a document with Ink to RTF. Most of the code for importing the RTF equivalent was already there, it just needed some adapting. I found something interesting that I haven’t seen documented elsewhere, furthering Miklos Vajna’s work (see README) on understanding the RTF spec. The geometry of the Ink shapes is described using the pVerticies [sic] and pSegmentInfo keywords. The pSegmentInfo section is a list of commands indicating what the points listed in pVerticies mean (move to, curve, end sub-path and so on).

Segment indicator Description Vertices associated
0×0001 Line to point 1 (x, y)
0×2001 Bezier curve with two control points and end point 3 (cx1, cy1, cx2, cy2, x, y)
0×4000 Move to point 1 (x, y)
0×6001 Close path 0
0×8000 End path 0

The plot thickens…

So, when importing a Word-generated .rtf with Ink annotations, why was I seeing segment indicators like 0x200A? Apparently, the low order bytes of certain segment indicators indicate the number of point sets to apply to – for example, if there were four curves in a row with three points each in pVerticies, it can be specified by using the low order bytes of the segment indicator, resulting in 0×2004 (encompassing 12 points in total). This may also apply to other relevant line segment types, but this is as of yet untested. You can easily extract the number of segments indicated using basic bitwise operators:

unsigned int segment = 0x200A; // Example segment indicator
unsigned int points = segment & 0x00FF; // Assuming two lowest order bytes are used for point count
segment &= 0xFF00; // Discard point count; just leave segment indicator

Woo ink in LibreOffice!!

Ink annotation in LibreOffice
Drawn Ink annotations in LibreOffice

LibreOffice now correctly displays not only Ink, but (in theory) any curves and shapes with paths when importing from .docx or .rtf. A minor bug with RTF image wrapping which caused the shape to be inline with the text instead of over it was also fixed (the property was just being ignored), so better imports all round!

Next step – correct export of bezier shapes to docx and rtf (no, I’m still not sure whether that blob of binary in the o:ink element is of any importance whatsoever, but this should be one way to find out).

Resources

v:shape schema information – this website is great for making sense of the OOXML standard, particularly if used alongside this:
ISO IEC 29500 – ISO standard document for OOXML (warning, big pdf in a zip).
RTF spec – only somewhat useful
UNO API reference – useful if used with the search function
writerfilter and oox – LibreOffice modules of interest for importing OOXML documents (cgit links for browsing the source/READMEs)

Tech update: LibreOffice cross compile MSI installer generation

Table of Contents

I’m working on allowing a Windows Installer (.msi) for LibreOffice to be built when cross compiling under Linux. So far, it has been a broad spanning project and has covered:

  • Windows, MSI and Cabinet APIs (C, SQL, Wine, winegcc)
  • LibreOffice build system (Perl, autotools)

Project status as of posting:

  • Developing on openSUSE 12.1 (x86_64) to target Windows (i686).
  • .msi files can be created and taken apart with the cross MSI tools (cgit) msidb and msiinfo.
  • Cabinet files can be extracted but not created. However, parsing Diamond Directive file (.ddf) format is supported through makecab (this is required when the LO build system creates a cabinet).
  • Remaining: hook up MSI transforms and patches (msitran and msimsp); fit the tools into the build system; clean up and maintenance.

The MSDN documentation for the win32 native tools has been linked to where appropriate.

1.1 Cross compiling LibreOffice

Luckily, LibreOffice cross compile support is already very good. README.cross in the LibreOffice root directory has far more information. Assuming you have checked out LibreOffice and have all the MinGW dependencies, cross compiling can be as simple as changing <lo_root>/autogen.lastrun to read:

CC=ccache i686-w64-mingw32-gcc
CXX=ccache i686-w64-mingw32-g++
CC_FOR_BUILD=ccache gcc
CXX_FOR_BUILD=ccache g++
–with-distro=LibreOfficeMinGW

This references <lo_root>/distro-configs/LibreOfficeMinGW.conf. This folder
contains various configurations for compiling under different circumstances. I
also found it helpful to add this line to LibreOfficeMinGW.conf to make life
simpler:

–without-java

1.2 Building the installer

The installer build logic can be found in <lo_root>/solenv/bin/modules/installer/windows. It makes use of several Microsoft utilities to eventually output an MSI file. Some of these utilities are already distributed by Wine.
Provided by Wine:
expand.exe – Used to unpack cabinet files.
cscript.exe – Command line script host.
Also expected:
msidb.exe – Manipulates installer database tables and streams.
msiinfo.exe – Manipulates installer meta data (summary information).
makecab.exe – Compresses files into cabinets.
msimsp.exe – Creates patch packages.
msitran.exe – Generates and applies database transforms.

Wine already exposes most of the required functionality via the API exposed by msi.dll (MSDN, Wine) and cabinet.dll (MSDN, Wine). My work has been focussed on writing command line utilities that support the interface expected by the LibreOffice build scripts.

  • solenv/bin/make_installer.pl is a very large Perl script that connects up the Perl modules which build the installer. The .pm files relevant to cross MSI building are listed below.
  • solenv/bin/modules/installer/control.pm performs “nativeness” logic such as checking if the environment is Cygwin and whether the required utilities are in the system path.
  • solenv/bin/modules/installer/windows/admin.pm (expand.exe*, msidb.exe, msiinfo.exe)
  • solenv/bin/modules/installer/windows/mergemodule.pm (expand.exe*, msidb.exe)
  • solenv/bin/modules/installer/windows/msiglobal.pm (msidb.exe, msiinfo.exe, cscript.exe*, msitran.exe**, makecab.exe**)
  • solenv/bin/modules/installer/windows/msp.pm (msidb.exe, msimsp.exe**)
  • solenv/bin/modules/installer/windows/update.pm (msidb.exe)
  • * Distributed by Wine
    ** In progress

1.3 Cross MSI tool development

The code for these tools can be found in the the feature/crossmsi branch of libreoffice. It currently resides in setup_native/source/win32/wintools in the tree.

To test the tools individually, grab the dev Makefile and make from the tool’s directory. You can then pass the -? or /? command for usage. I would suggest disabling Wine’s debug logs unless you specifically need them:

$ export WINEDEBUG=-all

  • msidb (MSDN msidb, LibreOffice msidb, dev Makefile)

    Usage: msidb [options] [tables]

    Options:
    -d <path> Fully qualified path to MSI database file
    -f <wdir> Path to the text archive folder
    -c Create or overwrite with new database and import tables
    -i <tables> Import tables from text archive files – use * for all
    -e <tables> Export tables to files archive in directory – use * for all
    -x <stream> Saves stream as <stream>.idb in <wdir>
    -a <file> Adds stream from file to database
    -r <storage> Adds storage to database as substorage

  • msiinfo (MSDN msiinfo, LibreOffice msiinfo, dev Makefile)

    Usage: msiinfo {database} [[-b]-d] {options} {data}

    Options:
    -c <cp> Specify codepage
    -t <title> Specify title
    -j <subject> Specify subject
    -a <author> Specify author
    -k <keywords> Specify keywords
    -o <comment> Specify comments
    -p <template> Specify template
    -l <author> Specify last author
    -v <revno> Specify revision number
    -s <date> Specify last printed date
    -r <date> Specify creation date
    -q <date> Specify date of last save
    -g <pages> Specify page count
    -w <words> Specify word count
    -h <chars> Specify character count
    -n <appname> Specify application which created the database
    -u <security> Specify security (0: none, 2: read only 3: read only (enforced)

  • makecab (MSDN makecab, LibreOffice makecab, dev Makefile)

    Usage: makecab [/V[n]] /F directive_file

    Options:
    /F directives – A file with MakeCAB directives.
    /V[n] – Verbosity level (1..3)

Get into open source with GSoC 2012

Student applications for Google Summer of Code 2012 will be open very soon. After an extremely enjoyable and rewarding experience with the program last year, I feel it’s my duty to student programmers to get the word out. So, here’s why you should apply.

You get paid to work on open source software. I became a long time user, first time contributor early last year. Looking to give something back, I attempted a LibreOffice Easy Hack. In a case of fantastic timing, they announced their involvement in GSoC a week or so later and I got in touch. The end result was a whole new open source library. I had an amazing experience working with LibreOffice but it’s ideal to choose a project that’s personally useful. GSoC doesn’t require that you’re an open source evangelist but if you are, it’s a strong argument for applying.

It’s fantastic experience working on a large project. I feel I learned more during those three months than during my undergraduate degree course. I have to say that I never particularly enjoyed groupwork at university but it’s completely different if you’re working with smart, motivated individuals who’re there either because they want to be or because they’re paid to be (quite often both). As a nice bonus, it’s great work experience and has essentially led me to my dream job. I’m not sure if that’s a typical result, but it certainly wouldn’t hurt to have it on your CV or resume.

You meet some of the smartest, most awesome people (not all of them programmers). I think this is my favourite outcome. I’ve met people from all over the world with an assortment of beliefs, opinions and backgrounds. My experience was that some of the best hackers and coolest people (no, seriously!) hang around open source communities.

Applying isn’t difficult, just choose a participating open source organisation or two and do a little research into the suggested projects before getting in touch with them. Good luck!

LibreOffice Conference 2011

I’ve been home a week from the LibreOffice Conference in Paris and from a personal point of view, it was a huge success.

First of all, here are my slides from the short talk I gave about what we achieved with libvisio over the duration of Google Summer of Code. There is still work to be done but once end-user feedback starts coming in, we can sand down any rough edges.

The conference was a lot of fun, particularly the company. I had the pleasure of meeting the rest of the libvisio team, Fridrich Strba and Valek Filippov, who looked out for me the whole time I was there. I’m sure the Paris pickpockets are still cursing their names.

I also have to admit to being a little starstruck at meeting all the fantastic hackers whose work I have made so much use of. The LibreOffice team were a diverse, interesting and kind bunch who put up with my incessant (well-meaning) questions with good grace and gave me plenty to think about on coding, the universe and everything.

It was wonderful to be surrounded by programmers and Linux users without the geekier-than-thou attitude. Despite being younger (and greener) than most and female unlike many (with a few notable exceptions), I chatted away to my fellow hackers without once feeling patronised.

Finally, I’m staying out of the whole political situation – I started coding with LibreOffice for pragmatic reasons (I could get the code easily, Easy Hacks make getting to know the project simpler and LibreOffice was part of GSoC ’11). However, I think the conference really confirmed for me that as important as the code base is, the community that surrounds a project this size is as vital. Without their helpful, inclusive approach, I’d have found contributing to an open source project of that magnitude an insurmountable task.

So here’s to another year!

Progress with gradient fills

So, I have finally made progress that isn’t so ground-breaking that my mentor wants to write about it but is big enough that certain people will stop making fun of my empty blog. So, frob (his wonderfully useful work can be found here), I hope you’re happy.

I’ve been working on shapes, lines and their properties, most recently on fills. Here’s how it’s going so far (Visio document on top, my output below).

Thanks to frob for the image, plus animated gif.

A few technical details for those who care: Visio draws shapes (including rectangles) as individual lines and before they can be filled, so we have to manually detect whether or not it’s a closed polygon. At the moment, we simply take the first point and compare it to the last point and make sure there are no gaps in between. It works for most simple cases but since when are things ever truly simple when reverse engineering?

You may also notice a difference between how gradients 31-34 are drawn in Visio vs my output. There’s no direct equivalent of this type of square gradient that I know of in the SVG or ODG specifications, so we’re approximating it. I have a whole new appreciation of slight imperfections when porting documents to different formats.

In the time it has taken to write this, I’ve already found that some of what I’ve written about will change. This is why I’m a programmer not a blogger ;)