Target Survey
From Open Siddur Project Development Wiki
Contents |
Survey of Target Document Formats
Some Terms
- Target
- A document or other storage format, other than the original XML data, produced by a conversion, or series of conversions, from the data. Usually, a target is somehow more useful than the previous target.
- Chain
- The formats and conversions between them that lead to a target.
- Complete target
- A target that produces a document that is production grade, after which the chain ends.
- Intermediate target
- A target that is not complete in a multi-stage transformation chain
- JLPTEI
- The XML format, based off of Text Encoding Initiative, in which our Tanach is stored (and our Siddur soon will be). All chains start from this format.
- muXHTML
- Our only current target, mXHTML, is a small subset of XHTML which our transforms are able to be convert the data into. The mXHTML can be styled with CSS and be displayed in any standards compliant web browser.
- Transformation tool
- An application or library used to transform data from one format to another.
Goal
The ultimate goals are to have a computer-viewable display format (XHTML) and at least one printable format. We may also want a post-processing editable format.
Our farthest target as yet is XHTML, styled by CSS. For a printed format, one expects a complete target to be able to produce a document that has features which one would expect of any Siddur: page numbers, table of contents, footnotes, side notes, header/page title, etc. XHTML originated as a computer-display format, not a publishing format. Even when combined with CSS 2.1, it does not support some of the features above (with some hacking, side notes, a static header/footer, and page numbers are possible, but it is still missing vital features). CSS3 is more publishing friendly, when implemented, will make life much easier. Until then, we will have to be a bit more creative.
The following is a list of software libraries and formats that can help us increase the range of formats that we can target. XSLT or Java are the preferred languages, since the rest of our chain is in XSLT, and driven by Saxon, which is written in Java, allowing us to bundle the entire chain in a portable program, which can be distributed ( with the added bonus of being able to be distributed within a web browser as an applet ).
Complete Targets
PDF/postscript
This is one possible complete target, this format is widely used, and has been published as an open standard in 2008 (see wikipedia:Portable_Document_Format). Once in PDF format, the document is almost as good as a series of images, and is ready to print. The problem is, how to get from XML, or XHTML to here, as the PDF format is (probably) too difficult to target directly using XSLT (considering positioning problems etc.).
It should be noted that there are actually two types of PDF targets. One is simply using PDF as a print format, the other as a digital medium. As a digital medium, the text would be semantic and selectable. This is more difficult to achieve, as the library producing the PDF must understand the OpenType font to be able to position the charactors correctly and to be able to embed the font. However, the text can also be drawn instead of written, as vectors. This is sometimes easier. For example, there are a few libraries that use Java's Graphics2d class as an interface to the PDF. Graphics2d allows vector graphics to be drawn directly to the PDF, and the AWT library, provided with Java understands OpenType, and can draw text correctly to any Graphics2d implementation. Thus, using AWT's font rendering system, we can produce a good PDF for printing purposes. The resulting PDF would be likely be very large and would not be selectable and thus not fit for use as a digital medium.
SVG
Another possible complete target, Scalable Vector Graphics is a standard that describes graphics, and once in this format, a document would be ready to print. SVG is XML, however, targeting SVG directly is (probably) too difficult (considering positioning problems etc.) (see wikipedia:Scalable_Vector_Graphics#Printing). So again the problem is getting the data into this format.
Open Document Format
The word processing format originated in OpenOffice.org, now an ISO standard. wikipedia:OpenDocument
OOXML
Microsoft's "ISO standard" format. There is no product in existence that renders ISO OOXML, and the standards document is a farce. Must be avoided at all costs. wikipedia:Ooxml
XPS, OpenXPS (MS)
Microsoft's answer to PDF. Does anything off Windows 7 support it? wikipedia:XML_Paper_Specification
Technical Problems
Right to left / BiDi
Hebrew text is rendered right to left. In addition there are algorithms defined by the Unicode standard which define how to order RTL (right-to-left) text that is next to LTR (left-to-right) text. This is called the BiDi algorithm (bi-directional text algorithm). In the CSS standard, this can mean splitting and reordering elements, which can become quite complex. Thus many libraries will render Hebrew text in reverse order.
Complex OpenType Layouts
If you finally get the text to display in the correct order, you might find that the vowels are treated as separate characters. That is, they display after the character they belong to, instead of on top, inside, or on bottom of it. The Hebrew text in Tanach uses complex characters that can only be properly rendered (with an open source font) using the Ezra SIL font. This is one of the major problems of the many PDF generating libraries: they can not render OpenType properly and will not perform any kerning at all when it is used.
Other open source fonts can display non-Biblical Hebrew correctly, but all OpenType fonts that display Hebrew with vowels correctly use complex layouts.
Intermediate targets
XSL-FO
This is a W3C recommendation for defining page layout as XML. It defines a template for a page, and then simply fills the template page with text, overflowing to as many pages as necessary. There are no (known) "XSL-FO viewers" (why not?). It is an intermediate target only.
XHTML+CSS3 (Cascading Style Sheets)
The mXHTML can be styled with CSS2 (as any XHTML can), but certain things, such as footnotes, page numbers, an index, table of contents etc. can only be accomplished with the print-friendly CSS3. Right now, it does not really matter if the implementation is a visual renderer (ie. a browser), or a library that renders to another format, such as PDF (since if we target a browser, the user can simply print from there, even to a PDF printer if desired). Unfortunately, there are no known CSS3 implementations, but some libraries have implemented important CSS3 extensions that might just be enough. (Although CSS is usually associated with XHTML, many libraries and software implement CSS for XML, treating each tag as an XHTML div tag, and allowing it to be styled as desired).
XHTML+CSS can be rendered directly by web browsers. It may be both an intermediate target and a complete target.
Note: CSS3 is still a draft - it is not yet a W3C recommendation.
TeX
TeX is a non-XML typesetting language that provides fine control over page layout and has strong support for publishing features. As such, it is an intermediate target.
TeXML
TeXML is a purely intermediate target format that can be used as an XML-ish way to express TeX. In order to do anything, it has to be converted into non-XML TeX.
SVG1.2/SVG-print
- URL: http://www.w3.org/TR/SVGPrint/
- Lang: XML/SVG
- Notes:
- This is a draft of a future version SVG that would support at least some of what we want. I leave this as an intermediate target because as support for SVG1.2/SVG-print is minimal, to use it as a complete target it would likely have to be converted to a different target by whatever renderer does support it.
- Links:
- See second example for an impressive display of what it can do: http://www.w3.org/TR/2004/WD-SVG12-20041027/flow.html#textflow-example
- Firefox displays this example properly, but don't be impressed; it uses svg-embedded javascript to render properly: http://www.carto.net/svg/textFlow/index.svg
- Javascript implementation of textFlow: http://www.carto.net/svg/textFlow/ * http://www.svgopen.org/2003/papers/PuttingSVGOnPaper/index.html#S2
- Batik SVG1.2 support: http://xmlgraphics.apache.org/batik/dev/svg12.html
Mars
- URL: http://www.adobe.com/go/mars
- Lang: XML
- License: Unknown
- Notes:
- Mars is an XML format that mirrors the PDF format, introduced by Adobe. If an XSLT stylesheet (or a library) existed that can convert from Mars to PDF directly, then perhaps we could target Mars. However, reading up on Mars only brings up old discussions about it; it seems to have gotten nowhere. The idea is intriguing though and reminds me of TeXML.
- Links:
Tools
Apache-FOP
- URL: http://xmlgraphics.apache.org/fop
- Lang: Java
- License: Apache License 2.0
- Notes:
- FOP is written in Java, has an excellent license, processes XSL-FO, the standard's answer to our problem, and outputs PDF, the current target of our chain, in addition to other formats.
- Cons:
- However, the implementation has a few shortcomings, the most important being the lack of support for bidirectional Unicode, RTL (right-to-left), which is vital for Hebrew!
- OpenType reportedly does not render correctly
- Summary:
- If it were possible to get bidi working in FOP, that would probably be best. However, BIDI has been a requested feature for a really long time (at least since 2002!), and has not yet been implemented. It is also a very large job to rip apart FOP to try and implement it ourselves.
- Links:
Apache Batik
- URL: http://xmlgraphics.apache.org/batik/
- Lang: Java
- License: Apache License 2.0
- Notes:
- Batik seems to be an excellent SVG renderer, and even has some support for SVG 1.2.
- Links:
html2ps/pdf
- URL: http://www.tufat.com/s_html2ps_html2pdf.htm
- Lang: PHP
- License: LGPL
- Notes:
- Its written in PHP, but it does have some CSS3 support.
- This should be tested for OpenType support, although it does not claim to have OpenType support.
- Links:
- CSS3 compliance: http://www.tufat.com/docs/html2ps/compatibility.css.3.html
jPod
- URL: http://opensource.intarsys.de/home/en/index.php?n=JPod.HomePage
- Lang: Java
- License: BSD
- Notes: PDF library only -- needs a layout controller. Unknown bidi/OpenType support.
pisa
- URL: http://www.xhtml2pdf.com/
- Lang: python
- License: GPLv2
- Notes:
- "It supports HTML 5 and CSS 2.1 (and some of CSS 3)". Seems like an excellent straight XHTML/CSS => PDF renderer, but its license is incompatible with MPL, which Saxon uses. It is also written in python, which isn't the end of the world, and can probably still be embedded inside an applet (see wikipedia:Jython).
- Cons:
- License is incompatible with MPL, which is used by Saxon (the XSTL transformer)
- Links:
Flying Saucer / xhtmlrenderer
- URL: https://xhtmlrenderer.dev.java.net/
- Lang: Java
- License LGPL
- Notes:
- An XHTML-to-PDF renderer that is almost ACID2 compliant. Seems to use iText (detailed here as well), and support some very nice CSS3 page-related rules.
- Cons:
- Does no BIDI, which according the CSS 2.1 standard involves splitting reordering elements if necessary.
- Does not take advantage of iText's BIDI
- Inherits iText's problems, namely no OpenType font support
- Pros:
- Renderer is plug-able; thus it might be easy to write one that does support OpenType and BIDI, and that outputs to many formats. From an XHTML/CSS perspective, this understands the markup and CSS well, and is thus a good starting point for rendering it.
- There is a Graphics2D renderer. This means it can render to images, or to PDF vectors. However, there is no paged support for this
- Summary:
- To use this great library, we would have to implement BIDI element splitting and reordering, then find an appropriate renderer and implement things like drawString(). In addition, if we go the vector route, we would have to implement a paged version of that renderer.
- Links:
iText
- URL: http://www.lowagie.com/iText/
- Lang Java
- License: LGPL/MPL (now AGPL)
- Notes:
- An impressive Java library for directly producing and manipulating PDF files. It has an API that allows constructs such as Paragraph etc. which have parallels to exactly what we need. Would basically require us to make a very simple "xml2pdf" renderer (or use xhtmlrender, detailed here as well). It is also used by many other libraries, including xhtmlrenderer.
- Cons:
- BIDI is supported, but only through special constructs
- It does not render OpenType font well
- It recently switched licences to AGPL, a very restrictive license not suitable for our project. Therefore, any future development is not able to fix any of the above cons, short of a fork.
- Links:
- Comparing to Apache FOP: http://www.oreillynet.com/cs/user/view/cs_msg/20299?page=last
xmlroff
- URL: http://xmlroff.org/
- Lang: C
- License: BSD
- Notes:
- An XSL-FO implementation in C. It is lacking in implementation conformance. Unlikely to work.
- Links:
svg2ps
- URL: http://code.google.com/p/lindenb/source/browse/trunk/src/xsl/svg2ps.xsl
- Lang: XSLT
- License: GPLv2
- Notes:
- This is an example of XSLT producing ps (postscript). Perhaps this eventually can be adapted to cut Java out of the picture entirely, and simply rely on XSLT.
- Cons:
- I don't think its practical, because SVG does not layout text; thus it is impossible to have flowing text (except in the newer SVG standards).
- Links:
- Using XSLT for image rendering: http://www.oreillynet.com/xml/blog/2008/06/xslt_and_image_rendering.html
- Using XSLT for binary file formats: http://www.oreillynet.com/xml/blog/2008/06/xslt_and_binary_file_formats_1.html
css2xslfo
- URL: http://www.re.be/css2xslfo/
- Lang: Java
- License: Public Domain?
- Notes:
- Using this transformation, we can convert XHTML+CSS to XSL-FO, which pretty much does what we want to do one step lower. That is, our current design calls for a special XSL transform which would convert our mXHTML to XSL-FO, ignoring CSS. This software might make it easier to convert to XSL-FO, but probably at the cost of control.
- Cons:
- This is mightily useless right now, as we have nothing to do with XSL-FO since Apache FOP does not render Hebrew text correctly
FO2ODF
- URL: http://fo2odf.sourceforge.net/
- Lang: XSLT
- License: BSD License
- Notes:
- This stylesheet converts an XSL-FO sheet to ODF. This provides an interesting angle to a complete target.
- Cons:
- However, it is not mature and would need lots of XSLT work to get it functional to our needs.
- In addition, there is no path from ODF to PDF short of using Open Office AFAIK; thus it gets us to ODF only (not necessarily a failing of the stylesheet, just a disappointing fact discovered as a result of it).
- ODF/ODT/Open Office does not support embedded fonts AFAIK. Thus the font would have to be bundled externally together with the ODF, and somehow associated with it for the renderer/printer.
- Links:
- Functional test page: http://body.php5.cz/php/fo2odf.php
Conclusion
In summary, after some research I have come to the conclusion that the standards address our issue very well. Instead of relying on traditional typesetting methods, such as TeX, we can go down the path of XSL-FO, X(HT)ML/CSS3, or SVG1.2/SVG-print. However, no one actually implements any of these standards properly (at least no free software does, as far as I can tell), and so until they are, we are forced to either:
- Fix Apache FOP
- Use whatever of CSS3 is implemented in some of the libraries mentioned above
- Use an XSL stylesheet to convert the data to SVG1.2/SVG-Print, and then use one of the libraries mentioned above to render (or leave as is).
- Directly transform our X(HT)ML into a PDF, or similar format (SVG 1.x?), using one of the libraries above (downside: this would be reimplementing some of FOP's functionality)
- Directly transform our X(HT)ML into a PDF or similar format using a very advanced XSLT stylesheet (as noted above, see:
- Wait until the standards are implemented
The ups and downs to consider of all things on this page
- Level of control
- CSS gives less of a level of control when compared with XSL-FO
- XSL-FO gives less of a level of control when compared with iText
- iText gives less of a level of control when compared with direct PS/PDF/SVG output
- Burden of control
- Taking care of flow control
- SVG/PS/PDF vs. SVG1.2/SVG-print/iText/XSL-FO/CSS3
- Taking care of flow control
- License
- GPL is incompatible with MPL
- MPL is required for Saxon
- Saxon is required for XSLT 2.0
- XSLT 2.0 is required for the transformations that drive this project
- Language written in
- Java is a plus
- Burden of installation
- TeX/LaTeX
- Ease of use
- Standardized
- Example: iText etc. would not be standardized
- Would tie us to a specific piece of software
- etc.
- Example: iText etc. would not be standardized
- Implemented
- XSL-FO is almost perfect for our project, but isn't implemented
- SVG1.2/SVG-Print might be a good fit, but isn't widely implemented
- CSS3 could style our mXHTML well, but isn't widely implemented