Text repository

In JLPTEI, the text structure is kept separately from the text itself. The text structure is constructed from base texts in the repository using stand-off markup. This separation is done for at least three reasons:
 * 1) Text structure is far more subjective than the content of the text itself. Different siddurim may want to use the same text with entirely different structure.
 * 2) There are a number of different possible views one can have on the structure of the text (prose, poetry, list structure, etc), all of which might be simultaneously valid and informative.  XML does not support multiple, concurrent hierarchies, but stand-off markup can be used to piece them together.
 * 3) Different rites sometimes have the same words or phrases in different orders.

The text repository is represented by a j:repository element (a limited type of tei:div), containing one or more tei:seg child elements (segments). Even though the segments' positions in the document define an order, the repository is considered, for semantic purposes, to be an unordered set. The relative locations of segments imply no relationships between them within JLPTEI.

In order to use a stand-off markup system, the segments must be individually addressable. Therefore, within a repository, each tei:seg must contain a file-unique @xml:id attribute.

The contents of the segments are addressed below.

=Smallest addressable unit= The smallest addressable unit of text is the word, which is represented by a tei:w element with a file-unique @xml:id attribute. Compound words can be held together by a maqaf (־), a punctuation like character. Each word in a maqaf-joined compound phrase is represented by its own tei:w element. The maqaf itself is contained within a tei:pc element. Other punctuation characters are also stored in tei:pc elements. Characters outside of words that are not punctuation characters are stored in tei:c elements.

=Second smallest addressable unit=

The next size addressable unit is something of a challenge to define. Most siddur text does not neatly break into sentences, and even when it does, the sentences tend to be very long, and their delimiting words are frequently a matter of interpretation.

Therefore, the second smallest unit in JLPTEI is the segment, represented with a tei:seg (segment) element.

For Tanach texts, the cantillation marks make natural divisions into segments. In most of Tanach, segments can be split following the words containing: etnahta, zaqef qatan, sof pasuk and possibly revii. For Psalms, Job, and Proverbs, the divisions are: etnahta, sof pasuk and ole v'yored. In some cases, these rules will have to be overridden when there's a different Massoretic division, which occurs in some poetry.

For other texts, arbitrary decisions will have to be made as to what constitutes a segment. One way to do is to set a soft limit for the number of words per segment (4-5) and require that texts be broken up into those sizes or smaller. Other rules will also be required for situations such as lists (think of the paragraph אמת ויציב), and when textual parallelism should result in words being in the same segment, and when it should result in words being in different segments (Some parts of ישתבח come to mind).

Repeated units of text (eg, ברוך אתה ה׳) can also be stored once in the text repository as a separate segment, and referenced as needed.

=Larger units=

Units of text larger than the segment must not be stored in the text repository.

=Special markup= Certain words should have their own markup, in addition to the normal tei:w markup.


 * Occurrences of the tetragrammaton must be enclosed in a j:divineName element (which is syntactic sugar for tei:name[@type='divine']).

=Example=

The following is an example text repository from the Dayyenu song in the haggadah: