Ripping links out of OpenOffice Documents · patrick dot wagstrom dot net

Today I received one a syllabus for one of my classes as a word document with a buch of hyper links in it. The hyperlinks are important as they link to the readings for the course. Clicking them all by hand would have taken a little while, so undaunted I sought about another way.

OpenOffice.org makes this quite easy. Just save the file then unzip it and you’ve got some XML files with all your document guts in it. And as you should know, XML is yummy. So here is the program that I wrote to rip the links. The program code is released to the public domain because it’s only a few lines long.

import libxml2
import sys

def riplinks(filename):
    doc = libxml2.parseFile(filename)
    ctxt = doc.xpathNewContext()
    ctxt.xpathRegisterNs("xlink","http://www.w3.org/1999/xlink")
    res = []
    for mem in ctxt.xpathEval("//*[@xlink:href]"):
        res.append(mem.prop("href"))
    return res

if name == "main":
    if len(sys.argv) < 2:
        print "Usage: %s FILENAME" % (sys.argv[0])
        sys.exit(0)
    for mem in riplinks(sys.argv[1]):
        print mem

Using, so for my case all of the links had the word “classpapers” in them, so I used the following command line to download them all:

python linkripper.py content.xml | grep classpapers | xargs wget

In an ideal world someone would hook this up with the zip module for python and have it automagically look at the content.xml file if it’s a zip, but this isn’t ideal right now. I’ve got other stuff to do.

Update: I made linkripper so it uses zipfile to automagically read Open Office documents. Read about it here.