Encoding tutorial

'''WARNING: DO NOT FOLLOW THIS PROCESS YET. THIS TUTORIAL IS INCOMPLETE AND INACCURATE'''

This encoding tutorial takes you through the process of encoding a text that has been transcribed and proofread on the wiki, according to the instructions in the English-language transcription guidelines.

The procedure is somewhat technical, but should be able to be followed by anyone who has already followed the Intro to hacking. We will assume that you have already downloaded a copy of the source code. Unless otherwise stated, all relative paths are relative to trunk.

=Assembling the transcribed text=

The first step involves assembling the transcription from the wiki. Do that using the script lib/assemble-wiki-transcriptions.py.

The script takes the following parameters, and it is required that all of the parameters be supplied with each run: --output=filename Path and name of the text file --contributors=filename Path and name of a text file that will contain a list of all contributors' Wiki names contributorsFileName = a --server=url (optional) URL of the wiki server. Defaults to wiki.jewishliturgy.org and probably should not be changed --base=strings comma separated list of beginnings of the scan page sequence names --digits=numbers comma separated list of number of zero-padded digits in the scan sequence numbers --first=numbers comma separated list of first scan sequence numbers for each book base --last=numbers comma separated list of last scan sequence numbers for each book base --extension=strings comma separated list of ends of the scan page sequence names --contribtags -t whether to include the added {contrib} tags to identify contributors (this should nearly always be used!)

For example, to assemble the transcription of the book of Nehemiah from the wiki, note that the wiki page names for Nehemiah are written as http://wiki.jewishliturgy.org/Page:38neh_0001-English.jpg http://wiki.jewishliturgy.org/Page:38neh_0002-English.jpg ... http://wiki.jewishliturgy.org/Page:38neh_0019-English.jpg

In the parlance of the script:
 * The base is therefore 38neh_.
 * Each sequence number is 4 digits long.
 * The first in the sequence is 1, the last is 19.
 * The extension is -English.

You may therefore run the script as:

lib/assemble-wiki-transcriptions.py \ --output=Nehemiah.raw.txt \ --contributors=contrib.txt \ --base=38neh_ -d 4 --first=1 --last=36 \ --extension="-English" \ -t To assemble the entire book of Psalms (which has two separate base sequence numbers) into Psalms.raw.txt: lib/assemble-wiki-transcriptions.py \ --output=Psalms.raw.txt \ --contributors=contrib.txt \ --base=27psa-a_,28psa-b_ -d 4,4 --first=1,1 --last=53,53 \ --extension="-English","-English" \ -t

=Entering contributor information=

Each contributor must be listed in the global contributor list.

NO INTERFACE EXISTS YET FOR EDITING THIS FROM OFFLINE.

Make their contributor ID equivalent to their wiki ID!

=Entering bibliographic information=

Before the text can be entered into the database, its bibliographic information must be in the global bibliography.

NO INTERFACE EXISTS YET FOR ENTERING BIBLIOGRAPHIC INFORMATION.

=Run the pre-encoder (Hebrew texts online)=

Certain typing conventions and common mistakes need to be corrected... Our transcription procedures are generally tolerant of these common errors or simplifications.

=Inserting file breaks=

File break commands are used to:
 * Break up a single transcription into multiple output XML files
 * Set the file name and title of the file. Because the file name and title are set in file commands, a {file} command should be added to transcriptions that have the content of only one XML file.

File break commands are inserted on their own lines, and they appear as: {file "Title" "FileName.xml"}

The quote characters (") are required. If a page break occurs just before the file break, the file command should appear before the page break and before any associated {contrib} commands.

=Running the raw text converter=

Before running the text converter, it may be necessary to increase your Java stack size (which defaults to 400K) to 4MB, by adding -Xss4m to the default Java options: export JAVAOPTIONS="-Xss4m"

The text converter is located in code/input-converters/rawtext/rawtext.xsl2. It accepts the following parameters: input-filename=FILE where FILE is a path to the wiki transcription as edited in the previous steps. By default, the path is relative to the stylesheet. To avoid confusion, it is recommended that you pass filenames as absolute paths. The lib/absolutize script can be used to convert relative paths to absolute paths. default-language=LANG set the default language of the file's text. Use "he" for Hebrew, "arc" for Aramaic, or "en" for English (default). bibl-pointer=ID the bibliographic key in the global bibliography of the book that the text came from. conditional-name=ID a prefix for conditionals. May be the same as bibl-pointer facsimile-prefix=STRING facsimile-digits=NUMBER facsimile-mult=NUMBER (default 1) facsimile-offset=NUMBER (default 0) facsimile-extension=STRING How to assemble a URL that points to the scans. The URL is assembled as: {facsimile-prefix}{(page number) * (facsimile-mult) + (facsimile-offset) zero padded to facsimile-digits}{facsimile-extension} See below for an example.

Known facsimile parameters:

Assuming the edited transcription is in Psalms.pre.txt, for the example of the 1917 JPS Psalms, given above, the converter command is: lib/saxon -it main ../code/input-conversion/rawtext/rawtext.xsl2 \ input-filename=`../lib/absolutize Psalms.pre.txt` \ bibl-pointer=JPS1917 \ conditional-name=JPS1917 \ facsimile-prefix="http://jewishliturgy.org/base/sources/JPS-The_Holy_Scriptures/27psa-a/27psa-a_" \ facsimile-digits=4 \ facsimile-offset=-777 \ facsimile-extension=".jpg"

=Manually checking the result=

=Uploading to the database=

=Article license=