Hebrew projects

From DPCanadaWiki

Jump to: navigation, search

Notes on projects in Hebrew or with significant excerpts in Hebrew.

Contents

Choosing content

While there is ample jewish literature in Hebrew across the centuries, modern Hebrew is a 20th century language. The copyright constraints (pre 1923 for posting on PG, death +50 for proofing on pgdpcanada) severely limit the works and genres available for projects on DP.

From the CP point of view, it is important to consider several aspects which affect the feasibility of a project in or with Hebrew. Proofers and formatters with language competence are so far few, hence the first factor is the composition of the candidate book. There are:

  • books entirely in hebrew (inaccessible for proofers not proficient in the language. Certainly most jewish treatises of all centuries are of this kind, but there is beyond. Religious, and in particular talmudic literature, can contain significant portions in Aramaic, which require further language competence from the proofers.
  • books in dual language, like prayer books (siddurim), side-bi-side Bibles with translation: project rules for dual parallel proofing may have to be considered for them. Such books use very often vocalized hebrew.
  • books in a western language, with enough Hebrew words on the page to require dp-canada, but not disrupting the text flow of their main language. Nineteenth century essays, apologetic and pedagogic books about judaism are often of this kind.

An additional consideration is whether the text contains unvocalized or vocalized hebrew (i.e. with diacritic signs, nikkud). While vocalized hebrew is preferred by beginner readers of Hebrew, it poses many additional difficulties when it comes to OCR (see below), keyboard input (see below), correct grammatical use. Large sections of vocalized hebrew automatically classify the project as HARD if not alltogether as type-in. Prayerbooks, grammars, juvenile books and poetry are almost always vocalized. A further level of typographical difficulty is the presence of masoretical chantillation signs (ta'amim), which are exclusively used in biblical text. What said about vocalization is exacerbated if also ta'amim are present. Hebrew is most commonly printed in typefaces which are just renderings of the "square" Hebrew writing. The rashi script is also traditionally used for commentaries to biblical and talmudic texts, as well as for sidenotes. While that is just a different typeface of the same hebrew alphabets, not all hebrew readers are comfortable with, and some proofers may be discouraged by it.

Image sources

Besides the known ones, a very nice place to search for books is the Judaica Sammlung in der Frankfurter Universitätsbibliothek.

Sites dedicated to hebrew books of various kinds:

It is generally unwise to start the project of a hebrew book if the text is already accessible elsewhere. Here is a list of tematic sites, far from being exhaustive; the Open Siddur Project's "environmental scan" has much more.

Preparing text

OCR

Though printed Hebrew is an alphabetic script with very well separated glyphs, only few OCR software are able to cope with it. One of the issues is right-to-left writing, one is the ability to cope with diacritics (currently only #hocr recognizes nikkud), another is the possibility of recognizing text in mixed Hebrew and Latin alphabet.

ABBYY FineReader 10

ABBYY FineReader 10 (commercial), can be bought with a "middle-east" language pack, which allows scanning of unvocalized Hebrew, even mixed with latin text. It seems completely uncapable to deal with nikud (unless perhaps you train it with every possible combination of letter and diacritic -- which would be very tedious and error-prone, and would lack a correction dictionary).

A statement from ABBYY:

ABBYY FineReader 10 can recognize Rashi script. As to nikkud, unfortunately, it is not supported, so the only way to try processing it -- as you correctly suggested -- is to use recognition with pattern training.

The feature of automatically identifying languages and page areas for each portion, and that of dealing with more languages at once, are definitely good ones, which are missing from all the public domain counterparts I've tried. But alas, that is a must for any professional OCR nowadays.

  • FR10 seems able even to pick up some rashi letter here and there, as used out of the box. I made a small training experiment on another rashi paragraph, on which I got the hint of things that could be tweaked to work much better, with extra tedious work.

hocr

Kobi Zamir's hocr (open source, http://hocr.berlios.de) is the only OCR software I'm aware of, which is capable of identifying nikkud with a fair performance, but not ta'amim, nor latin characters. Its development has been arrested in 2008. Some effort has still been devolved to maintaining the existing code and solving bugs. For this reason it is highly recommended to compile it from the VERY LATEST sources (currently 0.10.17), found at http://code.google.com/p/hebocr/. A compiled version with GUI, called qhocr, is also found on http://code.google.com/p/qhocr/, but it is based on an earlier version.

tesseract

Tesseract is a popular and well performing open source OCR software. For specifics about its DP usage see the PGDP wiki page on tesseract. Version 3.00 is distributed with a pretty useless trained data set for hebrew. Better training data has been provided by users (see this thread, this one this one and this one), but only deals with unvocalized hebrew without latin characters. A major problem of tesseract as of the current version is that it recognizes characters in left-to-right order. Its output has to be reversed (like with the unix command rev) to restore the correct reading order. Correction dictionaries for Hebrew are neither available for tesseract.

Other (commercial, aborted, dubious)

There are OCR packages written for Hebrew. yiddishocr is free but can only read BMP and TIFF files and only works on Windows machines.

ligature is not free (and not cheap) but does support other input formats including jpg and gif as well as direct from scanner. There is a free trial available for this software. There is supposedly free online OCR available on the site but it keeps timing out on me.

I have also found at least one commercial service in Israel claiming to be able to do rashi and nikud, and OCR services for a fee. Not yet evaluated.

Common scannos

  • Square hebrew font: Mem/Tet, Zain/Vav, Kaf/Bet, Vav/Nun sophi.
  • Scannos are much more frequent in rashi font mistakenly recognized as square.

Pre-processing

Usual text preprocessing of the scanned text has to be done before uploading: de-hyphen, correction of systematic scannos, quality-control spellcheck. Obviously suitable text editors and spellcheckers supporting Hebrew and RTL writing have to be used (see below External Editors).

Text of Hebrew projects should be directly prepared in utf8.

For correct working of the formatting tags in the current pgdpcanada proofreading interface, the PM should seriosly consider to add an unicode RLO as first character of each text page (see below Adding tags to RTL text for why).

Inputting hebrew text (for content preparation as well as proofreading)

Inputting Hebrew text is better done installing a hebrew (e.g. israeli) keymap.

Hebrew fonts

Inputting diacritics (nikkud)

Resources on the net explaining how to input nikkud:

  • [1] might help people who don't have Hebrew key mapping.
  • With hebrew keymaps: see [2]. In ubuntu, I needed to go to keyboard preferences/layout/options and define a third level key, then nikkud is on the number keys.
  • For windows, here are some instructions: [3]; [4]

A toolbar for easier manual insertion of Nikkud into Hebrew text using Word can be found here: [5]

Order of diacritics, letter forms

The standard order, as recommended by benyehuda.org, is: letter, shin-sin dot, dagesh, vowel.

If the PM requires so, attention should be paid to the differences between kamatz katan and gadol.

In general the PM should disrecommend the use of wide hebrew letters (unicodes in the range U+FB21-U+FB28), and letters with dagesh or incorporated dots (U+FB2A-U+FB4B). Such variants can be considered in PP if really needed.

The PM may suggest the use of ligatures (ײ yud-yud, aleph-lamed, etc., for which single unicode glyphs exist), thought they too may be better set uniformly in PP than while proofreading.

Differentiation between geresh/gershayim and apostroph/double quotes should be maintained.

Inputting ta'amim

Note that only a few hebrew fonts are complete to the point of including ta'amim. See the Open Siddur Project's and Mechon Mamre's discussion.

Hebrew punctuation

Classic Hebrew texts have their own punctiation signs. Since they are in the unicode table, it is recommended to preserve them.

Special signs include geresh (׳ U+05F3) and gershayim (״ U+05F4), which resemble respectively an apostrophe and a double quote, and occur in abbreviations.

Single or double upper as well as lower or inverted quotes may also be found in relatively modern Hebrew texts. In general they should not be confused with geresh and gershayim, and the PM should give special instructions about them if needed. Double closing quotes can be made out from gershayim because quotes always occur at the end of a word, never in the middle of a group of letters like gershayim.

Other peculiar hebrew punctuation signs are the sof pasuk ׃ U+05C3, normally denoting the end of a verse in the bible, the paseq ׀ U+05C0, denoting a pause in the middle of a verse, and the middle dot (missing a specific unicode). The latter may be proofed using the mid-dot from the drop-down menus in the proofing interface. For PP rendering, some other unicode subpages have similar middots punctuations (e.g. , U+16EB runic punctuation) but it is probably preferable to use more common dingbats, like U+2666 BLACK DIAMOND SUIT

Modern Hebrew uses more commonly the standard punctuation of western languages, like period, comma, semicolon, colon, question and exclamation mark. However, geresh and gershayim are still part of the standard writing. Their standardized rules are:

  • abbreviation and acronims of three and less letters are ended by a geresh (compare with english adding dots after each letter, or omitting any dots)
  • "soft" sounds of foreign languages, like "ch", "j", "th" are rendered with Tzadi-geresh, Gimel-geresh, Tav-geresh; analogously the arabic rha may be transcribed with Resh-geresh, Hha with Het-geresh.
  • abbreviations and acronims of more than three letters have a gershayim sign before the last letter.

Inputting and visualizing unicode directional marks

See below #Handling_bidirectional_text for general explanations and references.

Lacking any other way, the necessary directional marks can be copy-pasted from a character map accessory. NB: while the Gnome Character Map displays any possible unicode entry, even if the system misses a font and a gliph for representing the particular character (or control mark), the windows character map displays ONLY the table entries present in the font chosen for visualization. Almost no MS font has ALL the 7 unicode directional marks!! In general these control characters are deliberately invisible, while for editing purposes it would help a lot to visualize them. I'm only aware of the system Arial font on macs, which incudes glyphs for them (thin vertical bars with some kind of hooks suggesting their function). I think it should be possible to configure the firefox AddOn abcTajpu as well with macros for inserting the relevant direction markers. This way, directional proofing could be done (blindly) all in firefox.

See also below #Editors supporting and showing unicode bidi

DPCustomMono

For general proofreading, the DPCustomMono2 font is highly recommended. However, it does not include Hebrew glyphs (nor Greek, for instance). If an update would be planned, it would be tremendously useful if it included Hebrew characters. In particular, it should exaggerate the differences between similar shaped letters like bet/kaf, vav/zayin/nun sophi, tet/mem, etc, not to mention to overblow nikkud. Gliphs for visible rendering of the unicode directionals would also be invaluable.

Proofing and formatting guidelines

Ellipses

Spaced suspension dots (variable number) should be (tentatively) proofread as LOTE ellipses (close spaces, keep number of dots, attach them to the preceding word), unless the PM requests differently.

Special character markup

It is not uncommon to find letters with points or other markings above them, to highlight them and to make them stand out (for instance for gematriah or acrostics). The PM should indicate how to treat them while proofreading, and eventually suggest which construct the PP could use to render them. On the road we have met so far:

  • overdots: proof like [.א] . PP: overdot, perhaps the unicode U+0597, hebrew revia, could do the job.
  • sort of double reversed primes, or grave accents over letters: proof like ["א] . PP: U+030F may be an acceptable rendering].
  • letters with no less than three dots above. Proofed using U+0592 segolta.

Handling rashi

To the PM to suggest, e.g. mark rashi as <f> </f>

Transliterations

In general, if the project language is Hebrew, there should be no need of transliterating it in latin characters. While it is acceptable to transliterate an occasional Hebrew word in a project which globally is written in latin characters, and thus producing the final etext in a limited character set like ascii or latin-1, to transliterate an entire Hebrew text is awkward.

Asking proofers to transliterate hebrew would have two major hindrances:

  • a the transliteration system has to be agreed upon (ant there are many)
  • asking for a readable, i.e. vocalized transliteration, the proofers have to be up to read the unvocalized words and make up the correct vowels themselves. Maybe not everybody can.

For reference, anyway, there is a nice web transliteration tool at [6]

Adding tags and comments to RTL text

The tags should be entered in their "normal" order, like < i >...... < i >, despite the fact that the enclosed text runs RTL.

In the eventual text version, the tags will be replaced by something that has no inherent order. So, for example, an italic string will be enclosed by underscores, a bold one by =, a < tb > will be replaced by a line of asterisks, and so on. In the HTML version, the tags will not "show through" to the final e-book appearance in the browser window, but will act in the background to create the desired effect.

The characters in the file stream are in a given order, from the beginning to the end, despite the fact that on screen they are rendered in alternating directions (and within some unicode logic, in fact). Therefore we have to stick to opening the tag before the first letter of the tagged hebrew word and to close it after the last letter, caring that before is really before ad after really after - but if that won't show up nicely in the interface, it will be very confusing. Just inserting one tag manually with latin letters i.e. < i >, < g >, etc. breaks the alignment of the line, but maybe inserting two a once we accidentally restore it.

An decent workaround found so far is to force the whole page to be RTL by inserting an U+202E RLO at the very beginning of the page. This fixes magically periods at the correct (left) side of the line and prevents tags containing latin letters to revolutionize the line; it just displays them reversed, like in <bt> and :etontooF].

A trick for inputting < g > and < f > which are not in the formatting interface: mark as < b > or < i > and do a global search/replace b->g.

Alternative tagging conventions

Some of them have been suggested to circumvent the disruption caused by mixed directionality of Hebrew text interspersed with latin tags, but their usefulness is debatable:

  • put each tag always on a new line
  • inventing new tags not containing letters, like <|> </|>

Looking at old Project Comments of Hebrew projects on dp-eu, I've not yet found a mention to the problem. Rather, they seem to have used the practice of tagging bolds with < B > < BB > and italics with < I > < II > (capital i) simply because of input convenience with the israeli keyboard map (unshifted letter keys are hebrew consonants, shifted are latin uppercase). But perhaps that case differs from [ ] because < and > are strongly directional characters in utf8, whereas parentheses are weak. If that is the case, <Footnote: 1> may work. It has been voiced:

I wouldn't change the [] to <> because 1) the formatters can add the [Footnote: ] tag by pressing a button but would have to type (and possibly mis-type) <Footnote: > and 2) if the PPer is using guiguts, it won't handle the footnotes correctly if the tag is changed. We already suspect guiguts isn't designed to automatically handle RTL text so flagging that for manual handling isn't a huge issue.

Spellchecking on dpcanada

SpellCheck utility at DPC is currently partially broken--we do not have access to WordCheck, which was developed later. So no GWL or BWL. In any event, no Hebrew spellchecker is available. SpellCheck should in general only be used to locate incorrect spellings (or rather suggest them)--under no circumstances allow SpellCheck to "correct" your proofing--it has a nasty habit of doubling some letters, and inserting incomprehensible strings in RED. Many experienced proofers doing P2 or P3 do their spellchecks off-line. A good alternative for spellcheck during online proofing is provided by firefox's dictionary highlighting.

Postprocessing

In addition to the standard requirements to the PPer:

The PPer should pay particular attention to the text directionality, and remove useless directional marks that may have inadvertently been added by proofers.

It makes no sense to produce ascii or latin-1 versions of Hebrew etexts. The ISO-8859-8 encoding could be used, but would be reductive and support no diacritics. Utf8 is the preferred production encoding.

Postprocessing in RST: currently (epubmaker 0.3.16) the pdf output of Hebrew text produces words (whereas the other formats render correctly). To be investigated.

To be completed as experience is accumulated.

Handling bidirectional text

In the following, reference to "Hebrew" is made to exemplify any right-to-left language, and "English" for a left-to-right one. The use is conversational as that was one of the first cases occurred on pgdpcanada, but the discussion is intended as generic.

Explanations of unicode directionality

Unicode accounts for directionality (Left-To-Right or Right-To-Left) of the text in many aspects. Characters themselves can have a strong or weak directionality, and there are no less than 7 unicode control characters affecting the direction of the text (LTO RTO LRE RLE LRM RLM PDF). Some good explanation is on [7] and [8] but certainly there is much more available (google for LRE RLE LRM etc.) In general these control characters are deliberately invisible, while for editing purposes it would help a lot to visualize them. I'm only aware of the system Arial font on macs, which incudes glyphs for them (thin vertical bars with some kind of hooks suggesting their function). See above "inputting directional marks" and below "editors".

Common problems: mixed bidi text

Unicode directionality works per paragraphs, where a "paragraph" is ended by a newline. In the proofreading interface, every line is thus a "paragraph". With unicode, a paragraph has an implicit direction if it begins with characters with a strong directionality, so lines all in Hebrew or all in English pose no problem. Problems, begin when English and Hebrew text are mixed in the same line, and the effect may be at once very confusing for the unaware proofreader. Such cases may be present from the beginning in the page text, or may be triggered by adding a [**] comment, or a formatting tag. Words on either language may suddenly break lines and push chunks on the far left of the far right, punctuation may "stubbornly" go at the opposite side of the word than intended, and so on. The use of proper directional unicode characters may become unavoidable in some situations.

Unfortunately the proofreading interface misses a) a way of inserting easily directional markers (I use to copy paste them blindly from a character map), b) a way to make them visible. These could be wished for a future revision of the code.

It has to be borne in mind that in the proofreading interface a "paragraph" (i.e. a line) implicitly LTR starts at the leftmost margin of the window, while a line implicitly RTL starts at the far right margin.

Mixed bidi within a single proofing line

In many of the cases, thanks to the bidi behavior built into unicode, the proofer may have to do nothing special. To my understanding so far, patterns found in books are like:

     English text english text HEBREW_WORD english english<newline>

which requires no directional char. The paragraph is interpreted as LTR because it begins with a strong LTR letter; HEBREW is seen as embedded.

     [LRM](1) HEBREW english english english<newline>

the initial LRM is required because otherwise the line, which begins with a weakly directional character, would be interpreted as RTL, being the first strongly directional character RTL

     English [RLE]HEBREW, HEBREW (punctuation) HEBREW[PDF] english<newline>

requires [RLE] ... [PDF] so that the inner punctuation follows temporarily RTL. With numbers, instead, one has to use

    English [RLE]HEBREW[PDF] number

otherwise the number is considered appended to the Hebrew text and follows it on its left.

We assume that OCR inserted no directional mark in the first place, and that we only have to insert the minimal marks in the few cases needed. If we would visualize the marks, we would understand better.

Mixed bidi text split across lines

Sentences like

   English english english HEBREW<newline>
   HEBREW english english

may not even need a special treatment, neither in proofing nor in postprocessing, thanks to unicode rules. Once lines are unified when the newline is replaced by a space, the sequence of the Hebrew text is preserved.

In some cases of this sort, though, the original typesetter of the book may have done mistakes in distributing the embedded hebrew words across lines. PP attention would be recommended, and could be called for with [**] notes. The decision of moving words may be analogous to that, also responsibility of the PP, of moving paragraphs to join a continued footnote, rather than the automatic rule of joining a hypenated word. I would say that the proofers should keep the line division as it appears on page; I fear that it would be difficult to impose an univocal proofreading directive otherwise.

Wishes for new features in the dp interface

  • the PM can tag each page of the project as RTL or LTR.
  • when a page is RTL, buttons insert tags as [LRE]<tag>[PDF] (have these to be removed afterwards in PP??).
  • directional characters are made visible with special glyphs and highlights, irrespective of the font used.
  • buttons for inserting < g > and < f > tags.

Editors supporting and showing unicode bidi

  • TextEdit on MacOsX is perfectly ok, and the Mac Arial font has visual gliphs for unicode directionals.
  • emacs: I found that there has been some discussion not too much time ago on [emacs-bidi http://lists.gnu.org/archive/html/emacs-bidi/2010-08/msg00011.html]. I use emacs, but haven't yet figured out whether something usable resulted from that thread.
  • gedit comes quite near: it has a right-click menu for inserting the directional unicode characters, its cursor changes shape when placed at an embedding boundary, highlighting the matching embed end. Unfortunately it doesn't display the directional characters. Those can however be deleted blindly by backspace. Gedit has also a display spaces plugin, but that doesn't make them visible. Time permitting I'll try to look into that plugin to see if it can modified, or talk to its developer.
  • yudit, which shows the cursor direction and has 3 buttons for controlling directionality.
  • katoob, available on ubuntu
  • geresh (untested)
  • vi and derivatives show the unicode characters which it can't render like <200e> etc. This is at least a way of making them visible.
  • Notepad++ actually seems to be completely ignoring LRM/RLM markers, since the cursor behaves the same within both Roman and Hebrew sections.
Personal tools