Looseleaf Binder
From Open Siddur Project Development Wiki
What it is
Looseleaf binder is [going to be] an extensible XML bindery for Python. A bindery represents XML elements as Python objects.
It is intended as a support library for the Jewish Liturgy Project's XML applications. It is not intended to represent every possible corner case of XML.
Why a new bindery?
A number of XML libraries already exist for Python, including the standard library DOM implementations, elementTree/lxml.etree, and other binderies: gnosis.xml.objectify, lxml objectify, and Amara bindery.
DOM is not particularly extensible and has a very non-Pythonic interface. A DOM wrapper class could do the job, but all work would then have to be duplicated. It also makes carrying around XML fragments rather annoying. Looseleaf currently uses DOM for XML serialization.
ElementTree is nice and simple, but is somewhat awkward to use for mixed content documents. Looseleaf borrows ElementTree's attitude toward XML namespaces.
lxml has many nice features, but both lxml.etree and lxml objectify use thin wrapper classes around C structures, limiting the types of extension classes that can be written.
Amara is currently undergoing a transition to a new version, which is not yet well-documented and is not guaranteed to be fully backward compatible. It is also licensed under a license derived from the Apache License v1.1, making any application that relies on it GPL-incompatible,
Loading the library
Any of the following lines will import the library into your program:
import Looseleaf import Looseleaf as L from Looseleaf import *
The remainder of the documentation will assume that the Looseleaf library is bound to L
Loading a string
In-place strings containing XML can be loaded with the fromString() function:
[doc, pm] = L.fromString(''' <XML> <!-- This is a comment --> <a> <b> This is text in b </b> </a> </XML> ''')
fromString() returns two return values. The first is the bound document. The second argument is a PrefixManager, which records the namespace prefix mappings that were used by the document (see below).
Loading a file
[doc, pm] = L.fromFile("file.ext")
fromFile() works exactly like fromString().
Building a document from scratch
A document may be built from scratch by instantiating Element, and adding child Node instances through the object interfaces.
Namespaces
Looseleaf bindery is fully XML namespace aware. All names that can be qualified are represented as a tuple:
('http://www.tei.c.org/ns/1.0', 'div') # (1) ('http://www.w3.org/1999/xhtml', 'div') # (2) (None, 'div') # (3) 'div' # (4) ('http://jewishliturgy.org/ns/jlptei/1.0', 'option') # (5)
In the list above:
- (1) represents the name of the TEI element div.
- (2) represents the name of the XHTML element div.
- (3) represents the name of the element div in no namespace.
- (4) is shorthand for (3) and is usable anywhere a name in no namespace can be used.
- (5) represents the name of the JLPTEI option element.
During processing in Looseleaf objects, namespace prefixes are not bound to the namespace. Prefixes may be stored in a PrefixManager instance at load time, and restored at save time. Attributes in the xmlns namespace (http://www.w3.org/2000/xmlns/) will have their usual meaning of binding a prefix to a namespace at serialization time.
By default, the xml and xmlns reserved namespace prefixes are known to the bindery and will work without being declared explicitly in a PrefixManager. Also, by default, no namespace (namespace None) is bound to no prefix.
At load time, the PrefixManager is only guaranteed to remember one prefix for each namespace.
Elements
Elements can be instantiated manually. The following code instantiates a JLP option element, containing a TEI seg element:
TEI = 'http://www.tei-c.org/ns/1.0' # (1) JLP = 'http://jewishliturgy.org/ns/jlptei/1.0' # (2) opt = L.Element((JLP, 'option')) opt.append(L.Element((TEI, 'seg')))
Element instances support all the usual Python list operations, where the element's list contains all child nodes in document order. The following exceptions are where the behavior of Element differs from list:
- Only certain Node objects can be inserted into the list and become child elements (Element, Comment, PI, CData, Text).
- Indexing a nonexistent child element returns an empty list instead of raising IndexError.
Element indexing can be done a number of ways (here, we assume that TEI and JLP are defined as above, and that elem is an instantiated Element):
elem[2] # returns the third child node of elem elem[:] # returns all child nodes of elem as a special type of list (a NodeSequence) elem[-1] # returns the last child node of elem elem['div'] # returns all child elements of elem that have the name div and no namespace elem[(TEI, 'div')] # returns all TEI div child elements of elem elem[(TEI, 'div', 1)] # returns the second TEI div child element of elem elem[Text] # returns all text node children of elem elem[(Text,0)] # returns only the first text node child of elem elem[NS(TEI)] # returns all children of elem in the TEI namespace. This is an example of a Condition class.
If no such node is found, an empty NodeSequence is returned. If one node is found, a single Node is returned. If more than one is found, a NodeSequence is returned.
The Element class presents the following interfaces:
- elem.namespaceUri = the namespaceUri. None if no namespace.
- elem.tag = the tag name
- elem.name = (namespaceUri, tag)
- elem.a = an AttributeSequence containing the element's attributes
A sequence of the Element's attribute is in Element.a .
Attributes
Attributes are stored in Attribute classes, which descend from Node.
The Attribute class presents the following interfaces:
- attr.namespaceUri
- attr.tag = attribute name. None if no namespace. (Note: attributes ignore default namespaces, so, all unqualified attributes have no namespace)
- attr.name = (namespaceUri, tag)
- attr.value = the value of the attribute
Attributes are accessed via the Element's a property:
elem.a['type'].value # the value of the type attribute on elem elem.a[(XML,'id')] # the xml:id attribute class elem.a[NS(XMLNS)] # all xmlns namespace attributes
Special xml:* attributes
In addition to the normal way to access attributes, the following special attributes are available to an Element elem:
- elem.xmlId = the element's xml:id
- elem.xmlBase = the element's base URI
- elem.xmlLang = element's xml:lang
Getting the value of xml:base or xml:lang will get the value in any ancestor. Setting the value will set it in the current element.
Text Nodes
Text nodes are represented by the Text object.
txt = Text(u'This is a text object') print txt.value
Text objects' data is stored in the value property.
Processing Instructions
Processing instructions are represented by the PI object:
pi = PI('tex', '\\itshape') print pi.target # prints 'tex' print pi.value # prints '\itshape'
Comments
Comments are represented by the Comment object.
com = Comment('This is a comment') print com.value # prints 'This is a comment'
Navigation within a tree of Nodes is accomplished through a set of axes that resemble XPath axes. These navigation axes exist on all Nodes and NodeSequences, and will return an empty NodeSequence if no matching nodes exist:
- elem.a = attributes
- elem.parent = parent node
- elem.ancestor = all ancestors in reverse document order
- elem.descendant = all descendants in document order
- elem.sibling = all siblings in document order
- elem.precedingSibling = all preceding siblings in reverse document order
- elem.followingSibling = all following siblings in document order
- elem.preceding = all preceding nodes in reverse document order
- elem.following = all following nodes in document order
Conditions
Special conditions, derived from the class Condition, can also be used for navigation. The Condition object defines two methods, test() and testSequence():
test(self, node) testSequence(self, nodeSequence)
The test() method returns a boolean indicating whether the condition applies to the given node.
The testSequence() method returns a NodeSequence containing the items in the sequence being tested for which the condition applies.
Neither function may have side effects on the original node or sequence.
Condition may also be used to test about a NodeSequence itself.
At this time, the following Condition-derived objects are defined:
- NS, which checks whether a node's namespace URI is the same as the given URI
- A, which selects an attribute from the current Element. If the nodeSequence is not an Element, returns empty.
Examples:
tei_namespace_elements = elem[NS('http://www.tei-c.org/ns/1.0')] no_namespace_attributes = elem.a[NS(None)] typeAttribute = elem[A('type')]
NodeSequence and AttributeSequence
A NodeSequence is a list of nodes (which may include any Node-derived types). An AttributeSequence is a sequence of Attribute nodes. Both sequence types support all the navigation axes.
Saving a document
To save a tree t to a string (1) or file (2) use:
string = t.toString(pretty=True) # (1) t.toFile('test.xml', pretty=False) # (2) t.toFile('test.xml', prefixManager=pm) # (3)
The keyword pretty determines whether additional indentation is added to format the XML into a more human readable form. It defaults to False.
To reuse saved XML namespace prefixes, you may also pass both toString() and toFile() a PrefixManager instance.
Custom binding classes
Custom binding classes must derive from Node. Ideally, they should derive from the closest possible node type class, eg, if it binds an element primarily, it should derive from Element.
To set up custom binding, you need a custom binding class and a function that will return the derived type if the conditions are met, and return None if they aren't. A binding function may also return ExcludedNode if the node being tested should not be included in the tree.
class DivElement(Element): pass def divBinder(domNode): if domNode.nodeType == domNode.ELEMENT_NODE and \ domNode.namespaceURI == 'http://www.tei-c.org/ns/1.0' and \ domNode.nodeName == 'div': return DivElement else: return None
Note that the binding function operates on a DOM node. The entire DOM, including parent and child elements can be considered accessible.
A list of such binding functions can be passed to fromString() or fromFile() using the customBinding keyword:
[doc, pm] = fromFile("test.xml", customBinding=[divBinder])
If you want to remove text nodes that contain only whitespace, then you can use the provided binding factory looseleaf.whitespaceRemovalBinder() as a customBinding.
Every {http://www.tei-c.org/ns/1.0}div element will then be bound using the custom class.
Binding conditions are processed in list order.
Iteration
The basic iterator object used to walk a Looseleaf in document order is NodeIterator.
The following example prints the content of all text fields in the document test.xml:
[doc, prefixManager] = fromFile("test.xml") for node in NodeIterator(doc): if isinstance(node, Text): print node.value
Transformation
Keys
Licensing
Like the rest of the code in JLP, Looseleaf binder is released under the GNU Lesser General Public License, version 3 or later.