XSLT Grammar Parser
From Open Siddur Project Development Wiki
This page describes the use of the XSLT Grammar Parser, which is necessary for parsing XPointer and XPointer schemes in XSLT and XQuery. It may also be used for any other type of text parsing.
The idea for the implementation is based on YAPP, a parser written in XSLT 1.0, although the two share no code. The new parser was written in XSLT 2.0 to take advantage of the language's native support for regular expressions. Most grammars that can be represented in EBNF can be parsed.
We now introduce two new XML namespaces and the conventional prefixes used in the documentation:
- http://jewishliturgy.org/ns/functions/xslt (func)
- http://jewishliturgy.org/ns/parser (p)
- http://jewishliturgy.org/ns/parser-result (r)
To use the parser, you must first define a grammar in the language described below. The grammar is stored as XML.
In your XSLT stylesheet, include the parser code (grammar2.xsl2).
The parser is called using the function call:
func:grammar-parse($string as xs:string, $start-term as xs:string, $grammar as node()) as element()
$string is the string to be parsed, $start-term is the named term where parsing should begin, and $grammar is the XML grammar; it may be element(p:grammar) or document-node().
The result of func:grammar-parse() may be passed to:
func:grammar-clean($parsed-grammar as element()) as element()
This function returns the parsed grammar with the r:anonymous elements that represent anonymous terms (p:expAnon, p:termRefAnon) removed.
Example grammars from our project are: A partial XPointer implementation, the extended XPointer range() function defined by the TEI, and a grammar for our extended version of Sacred Texts Markup Language.
Contents |
Defining a grammar
Root element
A grammar is defined in an XML file with root element p:grammar. The root element may include other sub-grammars, also contained in p:grammar elements. If more than one grammar is included in the same hierarchy, all the grammars are combined in each parsing run.
Terms
Each grammar is composed of one or more named terms, represented by the p:term element. Each p:term element is given a unique name using the @name attribute. Terms are composed of an ordered list of content matchers:
- Regular expressions (p:exp, p:expAnon)
- References to other terms (p:termRef, p:termRefAnon)
- Choice grouping constructs (p:choice)
- Cardinality groupings (p:zeroOrMore, p:oneOrMore, p:zeroOrOne)
- At most one end-of-data indicator (p:end)
The list of elements (content matchers) defines the expected values of a string that matches the term. A string that conforms to the list is said to match.
When run through the parser, each p:term or p:exp element named by an @name attribute will result in:
- (r:@name, r:remainder?)
- (r:no-match, r:remainder?)
r:@name contains the part of the string that matched the term. If the term was not matched, r:no-match is returned. r:remainder contains the remaining part of the string.
Anonymous elements p:termRefAnon and p:expAnon return r:anonymous instead of r:@name. These may be removed by passing the result of the parse run to func:grammar-clean().
Content matchers
Content matchers attempt to match the current position in a string to their defined pattern.
Regular Expressions
Regular expressions may be matched using the p:exp element. The regular expression is the text content of the element. All special characters must be escaped using the normal conventions of regular expressions.
A matched p:exp element returns an element in the r namespace whose node name is defined by the p:exp element's @name attribute.
Term references
Term references (p:termRef) are how named terms are matched inside other named terms. The @alias attribute may be used to reuse a pattern named @name, but give it a different result element name, r:@alias.
Anonymous content matchers
The p:termRefAnon and p:expAnon elements work like their named counterparts, except that matches are returned as r:anonymous elements. p:termRefAnon does not support @alias.
Running the func:grammar-clean() function on the return value of func:grammar-parse() will remove all r:anonymous elements, leaving their content.
Choice groupings
Choice groupings (the p:choice element) indicate that their position in the term may contain any one of the referenced contents. The p:choice fails to match if none of the choices match. If two choices both match the text, the string with the longer match is chosen. If multiple matches are of equal length, the first one is chosen.
In addition to any of the contents of p:term, p:choice may also include two other elements:
- p:group - an anonymous ordered grouping of content matchers.
- p:empty - The possibility that the choice matches to the empty string.
Cardinality
An ordered list of term references, regular expressions, and choice groupings may also be grouped under the p:zeroOrMore, p:zeroOrOne, or p:oneOrMore, which will match if all the references in the group are either repeated 0 or more (present or repeated), 0 or 1 (present or not), or 1 or more times, respectively.
License
The grammar parser is released under the GNU Lesser GPL 3 (or later).
Questions/Bug reports
Questions may be addressed to the opensiddur-tech email list; bugs may be reported to our issue tracker.
TODO
- Complete usage documentation!
- Write a test suite for all cases