Transcription Rules

From Open Siddur Project Development Wiki

Jump to: navigation, search

Contents

First and foremost, when transcribing text make sure that the text corresponds exactly to what you see in the document image.

RULE #1: TRANSCRIBE EXACTLY WHAT YOU SEE. (NEVER TRANSCRIBE "FROM MEMORY.")

There are a lot of minor textual variants. We need to keep track of which ones come from which historical editions of any particular text, especially when transcribing Jewish liturgy.

All of the English transcriptions begin with text that was automatically recognized from the image by optical character recognition (OCR). The primary job of a transcriber of English text is to correct the OCR-ed text to conform to the text in the original page scan and to follow some simple conventions. For Hebrew transcriptions, we have not yet found OCR software that works reliably enough with voweled text to be usable. The transcriber must therefore type the on screen text into the wiki window.

Besides this overarching rule, there are some additional conventions we use to prepare transcriptions for encoding. While we suggest reading all of these rules before beginning your transcription, you can also consult these rules as you transcribe.

A completed example of a transcription, that of JPS, p. 1049 is shown at Page:1917JPS-TranscriptionCompletedExample-English.jpg.

Tags and Other Formatting Rules

Aside from the text itself, we use a small number of tags, which provides our automated processing software with the very important information on how to structure or format the text. Most tags are enclosed in curly braces, and use the following syntax:

{tag parameters text}.  

(We use curly braces because they are very rarely found in the public domain books we are transcribing.)

Below is a summary list of the tags to use for indicating the structure and format of transcribed text. More detailed instructions for each tag are included below.

When writing tags, be careful about the form of the tag and its proper positioning. If a space is shown in the examples, use a space. If not, do not. The automated parsers are not tolerant of errors in the form of tags.

Quick reference

Page breaks:

{p. Number} a page number
{cont} used at the beginning of a paragraph to indicate that a paragraph continues from the previous page

Paragraphs: Leave a blank line between paragraphs (i.e., press "enter" twice).

File breaks:

{file "File title" "DatabaseFilename"} The following texts goes into a new file

Sections:

{poetry} sets following text as poetry
{prose} sets following text as prose (the default)
{section Text}

Remarks:

{rem comment} to make a personal comment to other transcribers

Credit:

{contrib role "Name1" ... } to credit a particular transcriber or proofreader at the beginning of a page

Instructions:

{instruct text} instructional text

Internal Reference tags:

{bible "book chapter:verse1-verse2" } bible reference
{bible block "book chapter:verse1-verse2"} an extended biblical quote as its own paragraph
{named "UniqueName"} to name a portion of text
{include "#UniqueName"} to include an already named portion by reference
{include block "#UniqueName"} to include an an already named portion as its own paragraph
{ref "File.xml#NamedReference" "Text"} to make a reference link from the text to the given named reference.

Note formatting tags:

{fr. Symbol} reference to a footnote within the text
{fn. Symbol. Text} footnote text
{fp. Number} Page break inside footnote
{note Content of note} Inline notation/comment

Incidental Hebrew:

{he טקסט בעברית} Hebrew text found in an English or other translation

Corrections:

{sic "Incorrect text" "Correct text"} misprints or typos

Emphasis (in born-digital texts):

{emph Text} emphasize text

Incidental transliteration:

{translit LanguageCode "Text as written"}
{translit LanguageCode "Text as written" "Voweled Hebrew"}

Divine Name:

{dn Text}

Formatting (only for transcribed texts, not used in originally digital texts):

$text/$  small text
~text/~  large text
|text/|  indented text
<text>   small caps text
~hr      horizontal line

Escape Character:

\$, \/, \~, \<, \>, \~, \{, \}   type control characters after a \ (backslash)

More detailed tag usage and formatting rules are below.

Tag details

Breaking files

The file tag is used to break up a text into multiple files. The content after a file tag is included in a file with the given name. The file tag's syntax is:

{file "Title" "Filename"}

The Title is a human-readable title of the whole file. The Filename is a filename that describes the section. (no spaces, please!) Note that the filename should be considered a convenience for the encoder; the same filename may not be used to identify the database resource.

When files should be split

Files contain easily identifiable units of information. In many cases, paragraphs or units of small numbers of paragraphs that will likely be identified together are independent files.

Examples:

  • Each blessing in the Amidah should be contained in its own file.
  • The Amidah in itself is one file, but it should include its content by reference.
  • Kiddush, while encompassing more than one paragraph, is an independent unit and should be contained in one file. It may indeed reference other files, each containing other independent units (Blessing over wine, Kiddush paragraph, Havadallah for festivals, etc.)

Working with file contents

Each file only contains what is between its starting file tag and the next file tag. In many cases, that will be insufficient to represent what you want. For instance, Shacharit will contain an Amidah, which will in turn contain individual blessings. In that case, to indicate that one contains the other, you must use include (or include block) tags in the containing file. For example:

{file "Morning Prayers" "Shacharit.xml" }
... Other includes or text goes here ...
{include block "Amidah.xml" }
... Other includes or text goes here ...

{file "Standing Prayer" "Amidah.xml" }
... Other includes or text goes here ...
{include block "Avot.xml" }
... Other includes or text goes here ... 

{file "Patriarchs" "Avot.xml" }
... Text of the first blessing goes here ...

Note: to include only part of a file instead of the whole thing, see the documentation on the include tag below.

Header

Do not transcribe the header line on top of the page.

Page numbers

While transcribing printed works, at the beginning of each page, include the code {p. N}, where N is the page number on the scan, on its own line (followed by an enter press). For example:

{p. 1049}

The page number is the number from the original book. For example, in the 1917 JPS, it is usually at the bottom of the page. It is not the scan sequence number, which is the number in the page title on the wiki.

See below for special issues concerning paragraphs that continue across page breaks.

Paragraphs

At the end of each typographic paragraph, insert a blank line (the equivalent of pressing enter twice).

If a paragraph continues across a column, do not leave any additional space at the column break.

Line breaks

It is not necessary to preserve the original line breaks, unless the original line breaks have semantic meaning, such as the ends of poetic lines.

In prose texts, removal of line breaks is not required, with the exception of the removal of the break in hyphenated words (see the next rule, below). In poetic texts, where line breaks are significant (the entire book of Psalms, for example), both retention of significant line breaks and removal of insignificant line breaks are required.

Em-dashes

Long dashes (—) should be converted into two hyphens (--).

Broken words and hyphenation

It is not necessary to preserve broken words and hyphenation. The lines should be joined together and the hyphen removed.

Special case: Hyphenation across a page break

If a word is hyphenated across a page break, do not break up the word. Leave the entire unbroken word on the page it begins.

There is one exception to this rule: If the word has a footnote, which is printed on the second page, the word, the footnote reference, and the footnote should all be on the second page in their entirety.

Special issues for paragraphs that continue across page breaks

Make sure to check the preceding page.

If a new page begins with a paragraph that began on the previous page:

  • For transcribing pages in books with Hebrew text, a {cont} continuation tag must be included at the beginning of the text.
  • For transcribing non-Hebrew books (e.g. the 1917 JPS):
    • if the first word on the page begins with a lowercase letter, no special action is necessary.
    • If the first word on the page begins with an uppercase letter or the beginning of a new verse, insert a
{cont} 
(continuation) tag just before the first text in the paragraph.

In other words, if the current page continues a paragraph from the previous page and begins with a capital letter or a new verse, add a {cont} (continuation) tag immediately before the first text on your page, and before any verse division commands.

Hyphenation

If the current page ends with a hyphenated word:

  • copy the remainder of the word to your page, and
  • do not leave a hyphenated word cut off between pages.

The only exception to this rule is when the dangling word has a footnote whose text is on the following page. If it does, remove the word from your page and copy the beginning of the word and its footnote reference to the next page.

Example:

{p. 967}
{cont} /2:3 As an apple-tree among the trees of the wood, 
So is my beloved among the sons. 
Under its shadow I delighted to sit, 
And its fruit was sweet to my taste.

Guide words

Many books put guide words in the last line of a page that indicate the first word in the next column or on the next page. Do not transcribe guide words. In this page, the two words on the bottom line are guide words.

Misprints and Typos

If you see a typographical error or misprint in the original text, mark it with `{sic "Incorrect text" "Correct text"}`. Do not correct or regularize spelling (e.g., do not change American spelling to British spelling or vice versa) using sic. If you're not sure if a word is misspelled or using archaic spelling, look it up in a dictionary. sic should only be used to mark errors that were most likely not deliberate on the part of the original editor. For example, if the original text says:

The rain continued alll day.

then, you should mark it:

The rain continued {sic "alll" "all"} day.

Note that only misprints and typographical errors should be indicated as sic. Transcription is not the stage to impose corrections on the text from outside sources.

Ligatures

Ligatures are typographical single characters that represent more than one letter. They are a typographical feature only, and require no special transcription.


Poetry and prose

If a section of text has a poetic layout (eg, piyyutim), insert a {poetry} tag at the beginning of the text.

Inside the poetic text, insert line breaks at the end of each line, as printed in the book.

At the end of the poem, insert a {prose} tag.

Section headings

Section headings must be demarcated by a {section title} tag, where title is the name of the section as given in the text.

For example:

{section תפילת שחרית לשבת}

Remarks

Remarks are notes to other transcribers or encoders. They contain only human-readable text, and their content is ignored by automated processing software.

A remark can be added by inserting a rem tag on its own line. You may take up as many lines as you need. Examples:

{rem This section has some Hebrew in it and I don't know how to type in Hebrew, can someone else do it? }

Instructions and Directions

Instructional text forms a special type of inline note, and is inserted in-place using the instruct tag:

{instruct Turn right, then left, .... }
Remaining text here

Biblical References

If the work you are transcribing contains extended quotes from the bible (and is not in itself a bible!), there is no need to retype the text. If the biblical quote occurs inside a paragraph, use the bible tag. For example, to insert Psalms 1:1 in-place:

{bible "Psalms 1:1"}

If the biblical quote occurs as its own paragraph, use the bible block tag. To insert the entirety of Genesis 1:

{bible block "Genesis 1"}

The bible block tags also allow ranges, and common book abbreviations:

{bible block "Genesis 1:1-6"}
{bible block "Ex. 15:1-16:2"}

Book names should be in English. Chapter and verse numbers follow the Westminster Leningrad Codex.

One contiguous biblical range is allowed in each bible command.

Liturgical References and Refrains

If a section of text contains a repeated section, such as a poetic refrain, the refrain should be enclosed in a named tag as follows:

{named "Name" ... }

The Name must:

  • Be enclosed in " (double quote) characters
  • Be one word with no spaces. You may use CamelCase to join words, as in `LechaDodiRefrain`.
  • Begin with a letter. It may contain numbers, but it may not begin with a number.
  • Be unique within the file that will contain it. Two refrains may not have the same name.

The text is then entered as usual, as in the following example:

{named "LechaDodiRefrain" לְכָה דּוֹדִי לִקְרַאת כַּלָּה. פְּנֵי שַׁבָּת נְקַבְּלָה׃}

In order to reference the named section, insert an {include "#Name" } or {include block "#Name" } tag. Use include tags for sections that should be included within the same paragraph. Use include block tags for sections that should be their own paragraphs. You must include the # character before the name! In the example above, the refrain is included by typing:

{include block "#LechaDodiRefrain"}

If the same poem spans many wiki pages, it is good practice to enter a comment, i.e., rem tag ({rem ...}) at the top of the following pages saying that they should be encoded together with the previous page.

Referencing Text from Elsewhere in the Siddur

You do not have to retype sections of text that have already been transcribed. If you know that a section of text has already been transcribed completely, you may reference it inside an include tag. (HOW DO YOU FIND THE URI!?)

You need to know the URI of the block to include. If you don't, what to do?

{include "Nehemiah001.xml#c1v23"}
{include block "Nehemiah001.xml"}

Assigning Credit

If the transcription comes from work done by someone other than the person logged in, credit must be assigned. Credit is assigned by adding a contrib tag just before a page tag (usually, this occurs at the very beginning of a wiki page). Transcriber credit is automatically assigned to logged in users who edit a wiki page. However, if you are editing an STML file manually and not importing it through the wiki, you must add contrib tags manually before each page. Only if the file is from born-digital data that is not broken up by page, place the contrib tags immediately before the file commands.

The format of a contrib tag is:

{contrib role "Name" }

where role describes why credit is being assigned. It may be any of:

  • author - the credit is being given for original authorship of the content
  • editor - the credit is being assigned for previous editing
  • transcriber - the credit is being assigned to a transcriber other than the person logged in

The Name (which must be enclosed in double quotes, as shown) is the contributor's WikiName, which also serves as the contributor's identifier. If the contributor does not have a WikiName on our wiki, he/she may be assigned an identifier. To search for existing contributors' names, see (INSERT LINK HERE) the contributor list viewer. To add a new name, use the contributor list editor.

The contrib tag is used to assign credit for aspects of our workflow. It is not used to hold bibliographic information (author, editor) about the work.

If more than one person is being assigned credit with the same role on the same block of text, include them all in the same contrib tag:

{contrib transcriber "Efraim.feinstein" "Aharon"}
{p. 5}

assigns both users Efraim.feinstein and Aharon as transcribers of page 5.

If a single wiki page covers more than one page of transcription broken by a page break tag (see below), the contrib tag must be repeated immediately before each subsequent page break tag.

Footnote references

When you see a reference to a footnote in the text, include a footnote reference using the `{fr. N}` tag, where N is number or symbol used in the text.

A line with a footnote reference is written as in this example:

days of {fr. a} Joshua the son of Nun unto

Footnotes

Place footnotes where they appear at the bottom of the page. Include the footnote in a `{fn. N text}` tag. N is the same as the symbol used in the footnote reference. In the example page, the footnote is entered as follows:

{fn. a. Heb. <I>Jeshua.</I>}

Note the use of the I tag for italics.

Footnotes that cross pages

If a footnote crosses a page boundary, insert an `{fp. N}` (footnote page break) tag at the position of the page break. The same rules for hyphenation apply as for normal page breaks: if a word is hyphenated across a page break, leave the word unhyphenated on the page before the break.

Commentary notations

Commentary notations are marked with the `{note text}` tag. The note references the following block of text (verse, paragraph, or line group).

{note This prayer is attributed to Rabbi Yitzchak Luria, 16th Century c.e.}

The note tag is only used for non-instructional inline notations in the original text. Out of line notations should be marked up using the footnote syntax described above. Instructional notations should be marked up using the instruct tag, also described above.

Special Formatting

Emphasis

Emphasized text is formatted as follows:

{emph Emphasized text}

The emph tag is preferred over all other formatting tags if the formatting is clearly intended for emphasis. It is not to be used as a substitute the italics command.

Italics

Italicized text is formatted as follows:

<I>Italicized text</I>

Small font

If the text is in a smaller font:

$Small font text/$

Big font

If the text is in a larger font:

~Big font text/~

Indented text

If the text is indented:

|indented text/|

(The vertical line is the "pipe character" and it is located as shift+\ on a US keyboard).

Small caps

Small capital text is formatted as:

<Small caps>

The one exception to this rule is the word LORD or GOD capitalized in the 1917 JPS.


Translations

Below are a number of conventions to follow when specifically transcribing translated works, for example, Singer's Siddur or the 1917 JPS TaNaKh.

Canonical Book Titles

New book names should be separated from the text by two lines. The Hebrew should be enclosed in a `{he}` tag and the title separated on its own line. The remainder of the text should be separated by another two lines. The following shows the title of the book of Nehemiah (on Page:38neh_0001-English.jpg). Otherwise, these should be treated as section titles, as shown below:

them had wives by whom they had children. 
 
{section {he נחמיה}
NEHEMIAH}

/1:1 The words of Nehemiah the son of Hacaliah

Verse numbers

At the beginning of each verse, include the chapter and verse number preceded by a slash '/'. See for example, the first 2 verses of the Book of Ruth:

/1:1 AND it came to pass in the days when the judges judged, that there was
a famine in the land. And a certain man of Beth-lehem in Judah went to
sojourn in the field of Moab, he, and his wife, and his two sons. /1:2 And
the name of the man was Elimelech, and the name of his wife Naomi, and the
name of his two sons Mahlon and Chilion, Ephrathites of Beth-lehem in Judah.
And they came into the field of Moab, and continued there.

"/1:2" means "Chapter 1, verse 2."

The /chapter:verse number should be followed by a single space.

Note: For odd numbered pages, the bold number at the top right of the header is the chapter and verse of the last verse contained on the page. For even numbered pages, the bold number at the top left of the header is the chapter and verse of the first verse contained on the page. (These numbers are useful for figuring out the chapter and verse where your transcription begins, but they should not be transcribed.)

Chapter numbers

It is not necessary to copy the large number indicating a new chapter division. Simply begin a new paragraph (press enter twice) and start the verse numbering for the new chapter.

Small capitals

When the original text is formatted in small caps (the word "Lord" in the 1917 JPS), do not indicate the formatting. Just transcribe it in CAPITAL LETTERS, as in "LORD."

Incidental Hebrew Words

This rule is specific to English texts where Hebrew only appears incidentally, e.g. the 1917 JPS. Incidental Hebrew text is included inside `{he HEBREW TEXT }`. These will usually be book titles (see above).

{he טקסט בעברית}

Note that if Hebrew and English appear on the same line, the text box may display the characters in the wrong order instead of in the way you typed them. This does not effect the quality of the transcription. It is a quirk of the way mixed right-to-left and left-to-right texts are displayed by default. Because of it, it may be helpful to transcribe incidental Hebrew after completing the English transcription. Leaving a marker in the text that can be searched for (with ctrl-f or cmd-f), such as {he ***}, can be used as a way to find where the missing Hebrew should be filled in.

Incidental Transliterated Words

If words (other than the Divine Name, see below) are incidentally transliterated in the text that is being transcribed, they should be entered using the `translit` command.

{translit LanguageCode "Text"}

The LanguageCode is the ISO 639-x language code for the transliteration. For Hebrew transliterated into English characters, the language code is he-Latn.

If the correctly and fully voweled Hebrew text of the word is known, you should use the second form of the translit command:

{translit LanguageCode "Text" "Hebrew text"}

Full example:

{translit he-Latn "Teshuvah" "תְּשׁוּבָה"}

Quotations

The first verse on the page should be transcribed as follows:

people, said unto all the people: `This day is holy unto the LORD your God; mourn not, nor weep.'

Note the use of the backtick (`) as an open quote mark, and the apostrope (') as a close quote mark

The Divine Name

Words representing the Divine Name (such as Lord, Adonai, YHVH) should be entered using the `dn` tag, as in the following example:

{dn Lord}

Proofread Your Work

Although your work will be reviewed in a proofreading stage, you can help greatly by proofreading your work before submitting it. Proofread your own text and correct any errors you see.

Help Improve this Documentation

If you don't understand how to do something or it took you a long time to figure out how to do it, it's probably insufficiently (or incorrectly) documented. If you know how to correct it, edit and correct the documentation. If you're not sure, ask the mailing list or file a bug report.


Personal tools
NAVIGATION