JLPTEI

'''IMPORTANT NOTE: This document is a work in progress. Comments/discussion should go on the opensiddur-tech list.'''

TODO: Specify guidelines, in addition to examples and vice versa.

JLPTEI XML is the customization of Text Encoding Initiative XML to our needs.

The official JLPTEI specifications will be written in TEI ODD (One Document Does it all) self-documenting schema language, with parts in RelaxNG. The TEI's tools will be used to convert the schema to human readable documentation and to RelaxNG schemas that the XML can be validated against.

The ODDs will be stored in the code repository under the schema directory.

If you are interested in participating in JLPTEI development, join the opensiddur-tech mailing list and send a message indicating your interest.

=Introduction: A Computer representation of the Jewish liturgy=

On the most basic level, the siddur is simple text. As such, on first thought, a transcription of the material on a wiki, in a word processor document, or in a text file would seem sufficient for its representation. If that is the case, why define an XML encoding? And why does JLPTEI require the features it does? This introduction addresses the essential problems in representing the Jewish liturgy on a computer and how JLPTEI approaches the problems. It also gives the reasons behind some of the tradeoffs that were made in the design of JLPTEI.

JLPTEI is a technical means to achieving the Open Siddur Project's Mission Statement. As such, it must support "texts and supplemental material that may be accessed, shared, adapted, and improved on by the entire Jewish community," "a non-prescriptive attitude towards the manner in which individuals and communities engage tradition," "pluralism that reflects the multiplicity of creative expressions possible," and "awareness of historical, geographical, and philosophical diversity in Jewish communities." Respect for communal and individual diversity is at the core of the Open Siddur/Jewish Liturgy Project's mission, and guides the project's technological vision.

There is no such thing as a single text called The Siddur. On the broadest scale, siddur texts may be divided by rite (nusah). However, even accepting that there are multiple base texts, neither is there a single Ashkenazic siddur, Hasidic siddur, Sephardic siddur, etc. A rite is a major division which uniquely specifies a common denominator of customs within a group of customs. Within Ashkenaz, there are differences between the Polish and German customs. The Iraqi custom is not the same as the Yemenite custom, and the Lubavitch custom is not the same as the Breslov custom. There are also divisions within each rite along major the philosophical boundaries that have developed in recent centuries, which lead to differences in custom and text. The traditional-egalitarian rite (usually a variant of the Ashkenazic rite), for example, is still undergoing major evolution. As such, it is impossible to maintain a single source text. It is possible to maintain a bank of source texts, calling each one by the name of a given rite. The latter is the approach taken by many vendors, who will sell an electronic edition of a base-text siddur from a given rite. However, from a philosophical perspective, that approach fails to recognize diversity within Jewish communities, and essentially requires the project to canonize one text over another in the distributed version. Further, from a technical perspective, as more base texts are completed, the project becomes unscalable; with many copies of common texts in the archive, both correcting mistakes and remixing content in novel ways become increasingly difficult.

In addition, a modern siddur is expected to contain more than simple text. Aligned translations and transliterations and linked commentaries are usual fare. Some siddurim also contain art. As an online project, we also have the opportunity to link additional non-textual material, such as audio or video.

As a member of the community at large, it is also important to maintain a chain of credit and responsibility for our contributors, and a chain of bibliographic credit for ideas. This necessarily involves maintaining a great deal of metadata in addition to the texts and supplementary material.

The Open Siddur, therefore, takes a different approach, which is realized in the JLPTEI design. This approach involves (1) minimizing the amount of stored text, (2) storing the differences between the texts and (3) having user-selectable sets of conditions that specify when each variant is selected. If a typo is corrected in one variant, it is naturally corrected for all variants. Any stored metadata is also automatically consistent between all texts. An additional advantage of this approach is that a community with a custom that differs from the “base” custom of the rite only has to make a different choice of variants. No change is required in the text in order to support a slightly differing custom.

Aside from text versioning, a second major problem in representation is what should be represented. A user working on a word processor or desktop publishing system would first consider presentation. What is considered good presentation form is up to individual taste, and tastes are expected to change far more rapidly than the text. In a system intended for universal use, we must encode, along with the text, sufficient information to present the text. What we encode, however, should be intrinsic to the text, not to any aspect of its presentation. Separating the encoding of document structure from its presentation allows each user to make the presentation conform to his/her own tastes without duplicating effort on the encoding side.

The primary question then becomes: which structure should be encoded? Prose can be divided into paragraphs and sentences, poetic text can be divided into line groups and verse lines, lists into items and lists, etc. Many parts of the siddur have more than one structure on the same text! XML assumes that a document has a pure hierarchical tree structure. This suggests that XML is not an appropriate encoding technology for the siddur. At the same time, XML encoding is nearly universally standard and more software tools support XML-based formats than other encoding formats. One of the primary innovations of JLPTEI is its particular encoding of concurrent structural hierarchies. While the idea is not novel, the implementation is. The potential for the existence of concurrent structure is a guiding force in JLPTEI design.

The disadvantage of JLPTEI's encoding solutions is that the archival form of the text is not immediately consumable by humans. We are forced to rely extensively on processing software to make the format editable and displayable. The disadvantage, however, is balanced by the encoding format's extensibility and conservation of human labor.

The Open Siddur intends to work within open standards whenever possible. In choosing a basis for our encoding, we searched for available encoding standards that would suit our purposes. We seriously considered using Open Scripture Information Standard (OSIS), an XML format used for encoding bibles. It was quickly discovered that representations of some of the more advanced features required to encode the liturgy (such as those discussed above) would have to be "hacked" on top of the standard. The Text Encoding Initiative (TEI) XML format is a de-facto standard within the digital humanities community. It is also is specified in well-documented texts, is actively supported by tools, and has a large community built around its use and development. Further, the standard is deliberately extensible using a relatively simple mechanism. The TEI was therefore a natural choice as a basis for our encoding.

=Intended audience and prerequisites= The intended audiences of this document are application and transform developers, who will need to implement JLPTEI in software.

The project's front-end applications will hide the details of the JLPTEI encoding from end-users.

Before reading this document, you should be familiar with XML and the TEI. The following parts of the TEI P5 Guidelines are recommended background reading:
 * A Gentle Introduction to XML
 * Languages and character sets
 * Chapters 1, 2, 3, 4, 6
 * Chapter 12 introduces the TEI way to encode textual variants. A derived method is used in JLPTEI.
 * Section 13.2
 * Chapter 16
 * Chapter 18 is the inspiration for JLPTEI conditionals.
 * Chapter 20 discusses the problem of multiple hierarchies. Pay close attention to section 20.4.
 * Section 21.3 references the responsibility elements.
 * Chapters 22 and 23 discuss the ODD schema language.

=Encoding philosophy=

JLPTEI encodes text's meaning or structure. While it should encode enough information to allow the XML to be displayed, it does not encode or enforce how text should be displayed.

Stylesheets and transforms are used to convert the JLPTEI's semantic information into forms suitable for display.

Standards conformance
JLPTEI is intended to be a "TEI Extension," according to the definition in the TEI Guidelines. It is not guaranteed to be "TEI conformable."

=File layout=

A JLPTEI file represents a discrete unit of text. The division is somewhat arbitrary, but will usually be a paragraph, or a commonly addressed unit (examples: kiddush, and havdalah are more than one paragraph long, but may be in one file).

JLPTEI files contains the following parts:
 * The header, which contains metadata about the file's contents. These include bibliographic information, copyright status, and responsibility statements.
 * Zero or more resource sections, including:
 * Definitions and application of conditionals
 * Citation links
 * Other text-associated data links.
 * The text section, which contains:
 * The text repository, a collection of unordered, small segments of textual content.
 * One or more concurrent selection sections, each of which includes:
 * A division containing identified pointers into the text repository. This division imposes order on the text.
 * One or more views into the text. Each view represents all or part of the selection as a structural hierarchy (eg, paragraph prose, line verse).
 * One or more sections containing original content that links to or supports the text (eg, Instructions and Notes, Translations).

The header is mandatory.

Some files will incorporate the other sections by reference. Support files (eg, containing instructions, commentary), for example, will likely not contain text repositories of their own, but will contain their own support sections.

=Namespaces=

The following XML namespaces are used. The prefix displayed here is the one used in the documentation:

=Predefined URLs=

Some data is used globally, but may reside anywhere in the filesystem or database and therefore be difficult to reference from within texts. This representation need not be stored as a single resource.

The fixed URLs are defined relative to the root of the database or filesystem. Global URLs are recognizable by their starting with /. A custom URI resolver (such as XML catalogs) or URL rewriter may be used to resolve these URIs into a valid XML representation in the filesystem or database.

Certain special support files have fixed URIs. These files are:

Other predefined URLs will be used for other media.

=The header=

The JLPTEI header is discussed in header.

=Support Files=
 * The global bibliography.
 * The contributor list.
 * Scan index files.

=Raw data=

The tei:TEI/j:raw tag allows raw data (not yet encoded) to be stored in a JLPTEI file. The raw data may be broken into tei:seg segments. Raw data must not be present in core distribution files.

This setup allows transcribed, but not yet proofread data to be stored within the XML context, and for the normal mechanisms of responsibility to be enacted, even before the file is fully encoded.

As raw data is encoded, it must be removed as a descendant of the j:raw element. Any responsibility information must be ported to point to the new encoded structures,

The j:raw element may include an @status attribute to indicate the current workflow status of the transcription it contains. Its content is application-defined.

=The Text Repository=
 * The Text repository is discussed on a separate page.

=Selections and concurrent hierarchies=
 * Discussed in Concurrent hierarchies.

=Textual variations=
 * Discussed in Textual variations.

=Conditional text=


 * Discussed in Conditionals.

=Injections=
 * Discussed in Injections.

=Translations=
 * Discussed in Translations.

=Transliterations=


 * Discussed in Transliterations.

=Instructions and notes=


 * Discussed in Instructions and Notes.

=Non-textual media=


 * Discussed in Non-textual media.

=Citations and References=


 * Discussed in Citations.

=Encoding tutorial=

An encoding tutorial is available that shows a step-by-step process of encoding.

=Copyright and Licensing=

The JLPTEI schemas and associated documentation are available under the GNU General Public License, version 3, or at your option, any later version.