Scanning
From the Open Siddur Project Development Wiki
|
Scanning is a process for imaging printed texts and saving them as digital image files. The Open Siddur only processes image files of scanned siddurim that are either in the Public Domain or else contributed to the Open Siddur with a compatible free culture license. After scanning, these image files are transcribed, and proofread. After proofreading, the text is encoded in an XML form that carries with it high-level structural data about the text. The encoded form of this text is then stored in our XML database.
The Open Siddur Project needs help scanning siddurim representing these nusḥaot. Some ideas for which books might be good to scan are being collected in this growing siddur wishlist.
A Short Scanning Tutorial
What to scan
In general, it is best to scan the whole book. Even if the whole book is not scanned, the title page and copyright page must be scanned.
Scanned Image Format
The Open Siddur Project needs scanned images of text that can be transcribed easily. Too low a resolution and it's nearly impossible to make out the many diacritical marks that form vowels and other significant distinctions in words and letters.
A good scan is performed at a resolution where the characters in the smallest text on the page can be resolved into a unique character when the image is displayed at 100% zoom.
For these reasons, scanned images should be high resolution at 300dpi (dots per inch).
Naming files
It is best to name files in a sequence where each page is contained in one file, and the filename is the page number followed by the extension. For example, 5.jpg for page 5, or 10.jpg for page 10.
If the page numbers are repeated in the book, for example, both parts 1 and 2 in the same volume begin with a page numbered 1, split the files between directories named part1/ and part2/.
If the page numbers are in Hebrew on one side of the page and Arabic numbers on the other side, or are in Hebrew only, translate the numbers to the Arabic number system.
If the page numbers are in Roman numerals as frequently happens in introductory sections or prefaces, the file should be named by the Roman numeral, exactly as it is written in the book. For example, v.jpg (if it is written lowercase) or V.jpg (if it is written in uppercase).
The title page may be called title.jpg. If the front and back covers are scanned, they may be called frontcover.jpg and backcover.jpg, respectively. Leaves may be called frontleaf.jpg and backleaf.jpg.
If, for some reason, you cannot follow these conventions, then send us the scans with each file numbered in sequence (1.jpg, 2.jpg, 3.jpg, etc.).
Temporary Storage for Uploading Image Collections
After you've scanned a book, you're left with hundreds of image files. The size of an entire collection can easily be more than 250mb -- far larger than can be distributed via email. Free services like Box.net offer temporary storage and dissemination of compressed batches of scanned images.
After you've scanned a book,
- compress the collection in your preferred compression format (zip, tar.gz, rar),
- upload it to an account at box.net (or other similar service), and
- contact us with a link where we can download the file.
Wiki Image Format
Uploaded scans to the wiki should be no larger than 1MB/page. If any resizing is required, it should be proportionally resized from the original. They should be resized into JPEG format. The following ImageMagick command seems to produce high enough quality images for transcription when starting from a 300dpi original.
Here is the imagemagick command for processing a single scanned image for transcription.
convert -resize 50% -quality 50 -colors 256 ORIGINAL_FILENAME FILENAME.jpg
For batch processing, follow the following directions:
- make sure that you have imagemack installed.
- have all your scanned images in a single directory.
- make a sub-directory to receive your processed images, e.g., "small"
- try the following command:
for i in *.jpg; do convert -resize 50% -quality 50 -colors 256 $i ./small/$i; done
Uploading Scans
Those uploading multiple files can use this [Special:MultipleUpload|upload form].
The filename format for uploaded files in an RTL language like Hebree is:
Book-Name_PaddingZerosPageNumber.jpg for example, Seder-Avodat-Yisroel_00025.jpg
For a LTR language like English, the filename format is slightly different. Add the language at the end with a dash as so:
Book-Name_PaddingZerosPageNumber-Langage.jpg for example, Singer-Siddur_00025-English.jpg
Uploading multiple file can be tricky and for batch uploading hundreds of files, we have some scripts that can help. Contact us for help.
Status of Scans
A comprehensive table indicating the status of scanning the siddurim in our wishlist is located here.
Active scans: Seder Avodat Yisrael uploaded, in transcription interface. Siddur Torah Ohr uploaded.
Credit for Scanning
The Open Siddur Project will credit all contributors of high quality scans for their scanning.
Getting additional help and providing feedback
Our processes in the Open Siddur Project are all about feedback. There are expected to be lots of bugs, inconsistencies, and inadequate explanations as we strive to make the first ever free and open repository of Siddur texts available. Please (please!) send any bug reports or feature requests to the issue tracker. Even "I got here, read the docs, and still have no clue what to do" is helpful feedback. Just tell us what you did, and we'll try to correct the problem. If you have any questions or comments, don't hesitate to post to the Discussion List. (You will need a Google account to post to the issue tracker. You need to join the google group to post there.)
