Target Survey

From The Open Siddur Project Transcription and Documentation Wiki
Jump to: navigation, search

Contents

Survey of Target Document Formats

Some Terms

Target 
A document or other storage format, other than the original XML data, produced by a conversion, or series of conversions, from the data. Usually, a target is somehow more useful than the previous target.
Chain 
The formats and conversions between them that lead to a target.
Complete target
A target that produces a document that is production grade, after which the chain ends.
Intermediate target
A target that is not complete in a multi-stage transformation chain
JLPTEI 
The XML format, based off of Text Encoding Initiative, in which our Tanach is stored (and our Siddur soon will be). All chains start from this format.
muXHTML 
Our only current target, mXHTML, is a small subset of XHTML which our transforms are able to be convert the data into. The mXHTML can be styled with CSS and be displayed in any standards compliant web browser.
Transformation tool 
An application or library used to transform data from one format to another.

Goal

The ultimate goals are to have a computer-viewable display format (XHTML) and at least one printable format. We may also want a post-processing editable format.

Our farthest target as yet is XHTML, styled by CSS. For a printed format, one expects a complete target to be able to produce a document that has features which one would expect of any Siddur: page numbers, table of contents, footnotes, side notes, header/page title, etc. XHTML originated as a computer-display format, not a publishing format. Even when combined with CSS 2.1, it does not support some of the features above (with some hacking, side notes, a static header/footer, and page numbers are possible, but it is still missing vital features). CSS3 is more publishing friendly, when implemented, will make life much easier. Until then, we will have to be a bit more creative.

The following is a list of software libraries and formats that can help us increase the range of formats that we can target. XSLT or Java are the preferred languages, since the rest of our chain is in XSLT, and driven by Saxon, which is written in Java, allowing us to bundle the entire chain in a portable program, which can be distributed ( with the added bonus of being able to be distributed within a web browser as an applet ).

Complete Targets

PDF/postscript

This is one possible complete target, this format is widely used, and has been published as an open standard in 2008 (see wikipedia:Portable_Document_Format). Once in PDF format, the document is almost as good as a series of images, and is ready to print. The problem is, how to get from XML, or XHTML to here, as the PDF format is (probably) too difficult to target directly using XSLT (considering positioning problems etc.).

It should be noted that there are actually two types of PDF targets. One is simply using PDF as a print format, the other as a digital medium. As a digital medium, the text would be semantic and selectable. This is more difficult to achieve, as the library producing the PDF must understand the OpenType font to be able to position the charactors correctly and to be able to embed the font. However, the text can also be drawn instead of written, as vectors. This is sometimes easier. For example, there are a few libraries that use Java's Graphics2d class as an interface to the PDF. Graphics2d allows vector graphics to be drawn directly to the PDF, and the AWT library, provided with Java understands OpenType, and can draw text correctly to any Graphics2d implementation. Thus, using AWT's font rendering system, we can produce a good PDF for printing purposes. The resulting PDF would be likely be very large and would not be selectable and thus not fit for use as a digital medium.

SVG

Another possible complete target, Scalable Vector Graphics is a standard that describes graphics, and once in this format, a document would be ready to print. SVG is XML, however, targeting SVG directly is (probably) too difficult (considering positioning problems etc.) (see wikipedia:Scalable_Vector_Graphics#Printing). So again the problem is getting the data into this format.

Open Document Format

The word processing format originated in OpenOffice.org, now an ISO standard. wikipedia:OpenDocument

OOXML

Microsoft's "ISO standard" format. There is no product in existence that renders ISO OOXML, and the standards document is a farce. Must be avoided at all costs. wikipedia:Ooxml

XPS, OpenXPS (MS)

Microsoft's answer to PDF. Does anything off Windows 7 support it? wikipedia:XML_Paper_Specification

Technical Problems

Right to left / BiDi

Hebrew text is rendered right to left. In addition there are algorithms defined by the Unicode standard which define how to order RTL (right-to-left) text that is next to LTR (left-to-right) text. This is called the BiDi algorithm (bi-directional text algorithm). In the CSS standard, this can mean splitting and reordering elements, which can become quite complex. Thus many libraries will render Hebrew text in reverse order.

Complex OpenType Layouts

If you finally get the text to display in the correct order, you might find that the vowels are treated as separate characters. That is, they display after the character they belong to, instead of on top, inside, or on bottom of it. The Hebrew text in Tanach uses complex characters that can only be properly rendered (with an open source font) using the Ezra SIL font. This is one of the major problems of the many PDF generating libraries: they can not render OpenType properly and will not perform any kerning at all when it is used.

Other open source fonts can display non-Biblical Hebrew correctly, but all OpenType fonts that display Hebrew with vowels correctly use complex layouts.

Intermediate targets

XSL-FO

This is a W3C recommendation for defining page layout as XML. It defines a template for a page, and then simply fills the template page with text, overflowing to as many pages as necessary. There are no (known) "XSL-FO viewers" (why not?). It is an intermediate target only.

XHTML+CSS3 (Cascading Style Sheets)

The mXHTML can be styled with CSS2 (as any XHTML can), but certain things, such as footnotes, page numbers, an index, table of contents etc. can only be accomplished with the print-friendly CSS3. Right now, it does not really matter if the implementation is a visual renderer (ie. a browser), or a library that renders to another format, such as PDF (since if we target a browser, the user can simply print from there, even to a PDF printer if desired). Unfortunately, there are no known CSS3 implementations, but some libraries have implemented important CSS3 extensions that might just be enough. (Although CSS is usually associated with XHTML, many libraries and software implement CSS for XML, treating each tag as an XHTML div tag, and allowing it to be styled as desired).

XHTML+CSS can be rendered directly by web browsers. It may be both an intermediate target and a complete target.

Note: CSS3 is still a draft - it is not yet a W3C recommendation.

TeX

TeX is a non-XML typesetting language that provides fine control over page layout and has strong support for publishing features. As such, it is an intermediate target.

TeXML

TeXML is a purely intermediate target format that can be used as an XML-ish way to express TeX. In order to do anything, it has to be converted into non-XML TeX.

SVG1.2/SVG-print

Mars

Tools

Apache-FOP

  • URL: http://xmlgraphics.apache.org/fop
  • Lang: Java
  • License: Apache License 2.0
  • Notes:
    • FOP is written in Java, has an excellent license, processes XSL-FO, the standard's answer to our problem, and outputs PDF, the current target of our chain, in addition to other formats.
  • Cons:
    • However, the implementation has a few shortcomings, the most important being the lack of support for bidirectional Unicode, RTL (right-to-left), which is vital for Hebrew!
    • OpenType reportedly does not render correctly
  • Summary:
    • If it were possible to get bidi working in FOP, that would probably be best. However, BIDI has been a requested feature for a really long time (at least since 2002!), and has not yet been implemented. It is also a very large job to rip apart FOP to try and implement it ourselves.
  • Links:

Apache Batik

html2ps/pdf

jPod

pisa

Flying Saucer / xhtmlrenderer

  • URL: https://xhtmlrenderer.dev.java.net/
  • Lang: Java
  • License LGPL
  • Notes:
    • An XHTML-to-PDF renderer that is almost ACID2 compliant. Seems to use iText (detailed here as well), and support some very nice CSS3 page-related rules.
  • Cons:
    • Does no BIDI, which according the CSS 2.1 standard involves splitting reordering elements if necessary.
    • Does not take advantage of iText's BIDI
    • Inherits iText's problems, namely no OpenType font support
  • Pros:
    • Renderer is plug-able; thus it might be easy to write one that does support OpenType and BIDI, and that outputs to many formats. From an XHTML/CSS perspective, this understands the markup and CSS well, and is thus a good starting point for rendering it.
    • There is a Graphics2D renderer. This means it can render to images, or to PDF vectors. However, there is no paged support for this
  • Summary:
    • To use this great library, we would have to implement BIDI element splitting and reordering, then find an appropriate renderer and implement things like drawString(). In addition, if we go the vector route, we would have to implement a paged version of that renderer.
  • Links:

iText

  • URL: http://www.lowagie.com/iText/
  • Lang Java
  • License: LGPL/MPL (now AGPL)
  • Notes:
    • An impressive Java library for directly producing and manipulating PDF files. It has an API that allows constructs such as Paragraph etc. which have parallels to exactly what we need. Would basically require us to make a very simple "xml2pdf" renderer (or use xhtmlrender, detailed here as well). It is also used by many other libraries, including xhtmlrenderer.
  • Cons:
    • BIDI is supported, but only through special constructs
    • It does not render OpenType font well
    • It recently switched licences to AGPL, a very restrictive license not suitable for our project. Therefore, any future development is not able to fix any of the above cons, short of a fork.
  • Links:

xmlroff

svg2ps

css2xslfo

  • URL: http://www.re.be/css2xslfo/
  • Lang: Java
  • License: Public Domain?
  • Notes:
    • Using this transformation, we can convert XHTML+CSS to XSL-FO, which pretty much does what we want to do one step lower. That is, our current design calls for a special XSL transform which would convert our mXHTML to XSL-FO, ignoring CSS. This software might make it easier to convert to XSL-FO, but probably at the cost of control.
  • Cons:
    • This is mightily useless right now, as we have nothing to do with XSL-FO since Apache FOP does not render Hebrew text correctly

FO2ODF

  • URL: http://fo2odf.sourceforge.net/
  • Lang: XSLT
  • License: BSD License
  • Notes:
    • This stylesheet converts an XSL-FO sheet to ODF. This provides an interesting angle to a complete target.
  • Cons:
    • However, it is not mature and would need lots of XSLT work to get it functional to our needs.
    • In addition, there is no path from ODF to PDF short of using Open Office AFAIK; thus it gets us to ODF only (not necessarily a failing of the stylesheet, just a disappointing fact discovered as a result of it).
    • ODF/ODT/Open Office does not support embedded fonts AFAIK. Thus the font would have to be bundled externally together with the ODF, and somehow associated with it for the renderer/printer.
  • Links:

Conclusion

In summary, after some research I have come to the conclusion that the standards address our issue very well. Instead of relying on traditional typesetting methods, such as TeX, we can go down the path of XSL-FO, X(HT)ML/CSS3, or SVG1.2/SVG-print. However, no one actually implements any of these standards properly (at least no free software does, as far as I can tell), and so until they are, we are forced to either:

The ups and downs to consider of all things on this page

  • Level of control
    • CSS gives less of a level of control when compared with XSL-FO
    • XSL-FO gives less of a level of control when compared with iText
    • iText gives less of a level of control when compared with direct PS/PDF/SVG output
  • Burden of control
    • Taking care of flow control
      • SVG/PS/PDF vs. SVG1.2/SVG-print/iText/XSL-FO/CSS3
  • License
    • GPL is incompatible with MPL
    • MPL is required for Saxon
    • Saxon is required for XSLT 2.0
    • XSLT 2.0 is required for the transformations that drive this project
  • Language written in
    • Java is a plus
  • Burden of installation
    • TeX/LaTeX
  • Ease of use
  • Standardized
    • Example: iText etc. would not be standardized
      • Would tie us to a specific piece of software
    • etc.
  • Implemented
    • XSL-FO is almost perfect for our project, but isn't implemented
    • SVG1.2/SVG-Print might be a good fit, but isn't widely implemented
    • CSS3 could style our mXHTML well, but isn't widely implemented
Personal tools
Namespaces

Variants
Actions
NAVIGATION
DEVELOPMENT
HOWTO
DOCUMENTATION
PROJECT
COMMUNITY
META
Toolbox