Encoding tutorial

From Open Siddur Project Development Wiki

Jump to: navigation, search

Contents

WARNING: DO NOT FOLLOW THIS PROCESS YET. THIS TUTORIAL IS INCOMPLETE AND INACCURATE

This encoding tutorial takes you through the process of encoding a text that has been transcribed and proofread on the wiki, according to the instructions in the English-language transcription guidelines.

The procedure is somewhat technical, but should be able to be followed by anyone who has already followed the Intro to hacking. We will assume that you have already downloaded a copy of the source code. Unless otherwise stated, all relative paths are relative to trunk.

Assembling the transcribed text

The first step involves assembling the transcription from the wiki. Do that using the script lib/assemble-wiki-transcriptions.py.

The script takes the following parameters, and it is required that all of the parameters be supplied with each run:

--output=filename
 Path and name of the text file
--contributors=filename
 Path and name of a text file that will contain a list of all contributors' Wiki names
   contributorsFileName = a
--server=url (optional)
 URL of the wiki server.  Defaults to wiki.jewishliturgy.org and probably should not be changed
--base=strings
 comma separated list of beginnings of the scan page sequence names
--digits=numbers
 comma separated list of number of zero-padded digits in the scan sequence numbers
--first=numbers
 comma separated list of first scan sequence numbers for each book base
--last=numbers
 comma separated list of last scan sequence numbers for each book base
--extension=strings
 comma separated list of ends of the scan page sequence names
--contribtags
-t
 whether to include the added {contrib} tags to identify contributors (this should nearly always be used!)

For example, to assemble the transcription of the book of Nehemiah from the wiki, note that the wiki page names for Nehemiah are written as

http://wiki.jewishliturgy.org/Page:38neh_0001-English.jpg
http://wiki.jewishliturgy.org/Page:38neh_0002-English.jpg
...
http://wiki.jewishliturgy.org/Page:38neh_0019-English.jpg

In the parlance of the script:

  • The base is therefore 38neh_.
  • Each sequence number is 4 digits long.
  • The first in the sequence is 1, the last is 19.
  • The extension is -English.

You may therefore run the script as:

lib/assemble-wiki-transcriptions.py \ 
 --output=Nehemiah.raw.txt \
 --contributors=contrib.txt \
 --base=38neh_ -d 4 --first=1 --last=36 \
 --extension="-English" \
 -t

To assemble the entire book of Psalms (which has two separate base sequence numbers) into Psalms.raw.txt:

lib/assemble-wiki-transcriptions.py \ 
 --output=Psalms.raw.txt \
 --contributors=contrib.txt \
 --base=27psa-a_,28psa-b_ -d 4,4 --first=1,1 --last=53,53 \
 --extension="-English","-English" \
 -t


Entering contributor information

Each contributor must be listed in the global contributor list.

NO INTERFACE EXISTS YET FOR EDITING THIS FROM OFFLINE.

Make their contributor ID equivalent to their wiki ID!

Entering bibliographic information

Before the text can be entered into the database, its bibliographic information must be in the global bibliography.

NO INTERFACE EXISTS YET FOR ENTERING BIBLIOGRAPHIC INFORMATION.

Run the pre-encoder (Hebrew texts online)

Certain typing conventions and common mistakes need to be corrected... Our transcription procedures are generally tolerant of these common errors or simplifications.

Inserting file breaks

File break commands are used to:

  • Break up a single transcription into multiple output XML files
  • Set the file name and title of the file. Because the file name and title are set in file commands, a {file} command should be added to transcriptions that have the content of only one XML file.

File break commands are inserted on their own lines, and they appear as:

{file "Title" "FileName.xml"}

The quote characters (") are required. If a page break occurs just before the file break, the file command should appear before the page break and before any associated {contrib} commands.

Running the raw text converter

Before running the text converter, it may be necessary to increase your Java stack size (which defaults to 400K) to 4MB, by adding -Xss4m to the default Java options:

export JAVAOPTIONS="-Xss4m"

The text converter is located in code/input-converters/rawtext/rawtext.xsl2. It accepts the following parameters:

input-filename=FILE  
 where FILE is a path to the wiki transcription as edited in the previous steps.  
 By default, the path is relative to the stylesheet.  To avoid confusion, it is recommended that 
 you pass filenames as absolute paths. The lib/absolutize script can be used to convert relative paths 
 to absolute paths.
default-language=LANG
 set the default language of the file's text. Use "he" for Hebrew, "arc" for Aramaic, or "en" for English (default).
bibl-pointer=ID
 the bibliographic key in the global bibliography of the book that the text came from.
conditional-name=ID
 a prefix for conditionals.  May be the same as bibl-pointer
facsimile-prefix=STRING
facsimile-digits=NUMBER
facsimile-mult=NUMBER (default 1)
facsimile-offset=NUMBER (default 0)
facsimile-extension=STRING
 How to assemble a URL that points to the scans.  The URL is assembled as:
 {facsimile-prefix}{(page number) * (facsimile-mult) + (facsimile-offset)  zero padded to facsimile-digits}{facsimile-extension}
 See below for an example.

Known facsimile parameters:

Book Bibliographic Key facsimile-prefix facsimile-digits facsimile-mult facsimile-offset facsimile-extension
1917 JPS JPS1917 "http://jewishliturgy.org/base/sources/JPS-The_Holy_Scriptures/Book code_" 4 1 -First page of book ".jpg"
Seder Avodat Yisrael Baer1901 "http://jewishliturgy.org/base/sources/Baer-Seder_Avodat_Yisrael/" 6 0.5 14 ".jpg"

Assuming the edited transcription is in Psalms.pre.txt, for the example of the 1917 JPS Psalms, given above, the converter command is:

lib/saxon -it main ../code/input-conversion/rawtext/rawtext.xsl2 \
 input-filename=`../lib/absolutize Psalms.pre.txt` \
 bibl-pointer=JPS1917 \
 conditional-name=JPS1917 \
 facsimile-prefix="http://jewishliturgy.org/base/sources/JPS-The_Holy_Scriptures/27psa-a/27psa-a_" \
 facsimile-digits=4 \
 facsimile-offset=-777 \
 facsimile-extension=".jpg"

Manually checking the result

Uploading to the database

Article license

The Contributors to the Open Siddur Project, the copyright holder of this work, has published or hereby publishes it under the following licenses:
GNU head This work is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or any later version. This work is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See version 3 of the GNU General Public License for more details.

Deutsch | Deutsch (Sie-Form) | English | Français | +/−

Creative Commons license
Creative Commons Attribution Creative Commons Share Alike
This file is licensed under the Creative Commons Attribution ShareAlike 3.0 License. In short: you are free to share and make derivative works of the file under the conditions that you appropriately attribute it, and that you distribute it only under a license identical to this one. Official license

العربية | Български | Català | Česky | Dansk | Deutsch | Ελληνικά | English | Esperanto | Español | Eesti | فارسی | Suomi | Français | עברית | Hrvatski | Magyar | Italiano | 日本語 | 한국어 | Македонски | Plattdüütsch | Nederlands | Polski | Português | Русский | Slovenčina | Svenska | తెలుగు | ไทย | Tiếng Việt | Українська | ‪中文(简体)‬ | ‪中文(繁體)‬ | +/−

You may select the license of your choice.
Personal tools
NAVIGATION