Distributed Proofreaders 44 titles preserved for the world!
  DP
ID: Password:  ·  Register  ·  Help  
 

WordCheck FAQ

Contents

General Questions

Proofer Questions

Project Manager Questions


General Questions

What's up with the new spellcheck interface?

The previous spellcheck interface had a couple of areas that could have used improvement:

  • There was no way to specify project-level dictionaries or accepted word lists.
  • The suggestions listed for misspelled words was very long, increasing the size of the returned HTML page and was frequently not used.
  • In the standard interface the spellcheck page did not show the page image.

To address these and other areas, the spellcheck code was revamped to add the following enhancements:

  • Instead of a drop-down box, Flagged words are displayed in a text box for direct editing.
  • The standard interface now shows the page image beside the spellcheck page for direct comparison to the original text.
  • Page text is still checked against the dictionaries for all project languages. In addition the user has the ability to select additional languages to check the page against, useful if an English-only project has a page with a long quote in French for example.
  • Each project has 'good' and 'bad' word lists that are used when determining words to flag in the interface. Good words are words that are valid for the project even though they are not found in the dictionary. Such words will often include proper nouns of people or places used frequently. Good words can be thought of as a project-specific dictionary. Bad Words are words that should be flagged for a project even though they may be found in the dictionary. These words might include common project-specific stealth scannos. Both the Good and Bad Word Lists are managed by the Project Manager.
  • Misspelled words have an "Unflag All & Suggest" button () next to them. The button is used to indicate that the word matches the image. Once clicked all identically spelled words on the page are also accepted as correct. After a word has been modified, the Unflag All button for that word will become disabled ().
  • Words that are flagged by proofers as accepted via the Unflag All button are added to a file for review by the Project Manager. Commonly unflagged words can be added to the 'good' word list.

The new interface has been relabeled as WordCheck to identify the broader scope of the tool.

What are 'Good', 'Bad', and 'Flagged' words?

The WordCheck interface is designed to help proofers catch differences between the page image and the page text. Often when the OCR software identifies the word incorrectly the word becomes misspelled and can be caught by a spell checker. Other times the OCR software incorrectly identifies a word in the image but the resulting text is a valid word. These words are still wrong despite being valid words. The team has decided to use the Good/Bad nomenclature to better reflect the intent of the WordCheck interface - to help the proofer match the image and the text, rather than use an inaccurate label like 'misspelling'.

After WordCheck has processed words at the various levels it comes up with a final set of Bad words to present to the user for validation or correction. These words are called Flagged words as they have been flagged by the system for closer inspection.

Where do Flagged words come from?

Flagged words can come from a variety of sources. These sources originate from one of three levels:

  • World - misspellings as determined by an external spell-checker and dictionaries
  • Site - words identified by site administrators as common stealth scannos
  • Project - words specified by the project manager as valid (Good Words List) or possible stealth scannos (Bad Words List)

Each level takes precedence over the level before it. Words identified as Bad at the World level (by an external spell-checker) but are valid at the Project level (project Good words) will not be flagged. This allows the person closest to the text more control over what is flagged: Project Managers can adjust the Good and Bad Words Lists at the project level. Site administrators can manage Bad Words commonly found as stealth scannos at the Site level. Spellcheckers and other external validators can be used to determine Bad Words at the World level.

Can you give me a simple example of how the levels work to flag words for the proofer to correct or accept?

To help illustrate how the WordCheck system works, consider the following pseudo-project.

  • Name: A Description of West Texas Towns
  • Languages: English
  • Good Words List: Lubbock Levelland Muleshoe Plainview Littlefield
  • Bad Words List: fiat

Now lets consider the following OCR'd text:

Lubbock is a town of many things: arid fiat 1and, grid-like roads, arid the infamous tumbleweed.

When a proofer selects to WordCheck the text, WordCheck evaluates the text at three levels: World, Site, and Project. At each level words are added or removed from the Flagged word list in order to determine the words to be flagged in the page text for the proofer to evaluate. Here's an example of how the "flagging" process works, level by level.

World

Current list of Flagged words entering level: none

At the World level, the text is run through an external spell-checker (such as aspell) using the dictionaries of the project's Primary and Secondary (if specified) languages. In this case the text would be checked against the English dictionary. The results depend on the particulars of the spell-checker and dictionary, but lets assume that the following words are flagged as misspelled or Bad: Lubbock and tumbleweed

Current list of Flagged words leaving level: Lubbock tumbleweed

Site

Current list of Flagged words entering level: Lubbock tumbleweed

At the Site level, the text is checked for possible stealth scannos, that is OCR software errors which resulted in valid/correctly spelled, but yet incorrect words. In addition, words may be checked against a series of patterns that are frequently incorrect such as a word containing both alphabetic and numeric characters. In the text above, the following would be flagged as Bad: arid (a common stealth scanno) and 1and (matches a suspicious pattern).

Current list of Flagged words leaving level: Lubbock tumbleweed arid 1and

Project

Current list of Flagged words entering level: Lubbock tumbleweed arid 1and

The Project level allows the Project Manager to have more control over which words are considered Good and Bad. At this level the Flagged words are compared to the project's Good Words List. Any words found on the project's Good Words List are assumed to be correct and are removed from the page's list of Flagged words. This would result in Lubbock being removed from the Flagged words for this page.

Also at this level, the text is compared against the project's Bad Words List. Any words in the text that are found on the project's Bad Words List are added to the list of Flagged words for this page. For this example, fiat is added to the list.

Current list of Flagged words leaving level: tumbleweed arid 1and fiat

The final list of Flagged words would be presented to the user and prompt the user to correct or accept them. The proofer might click the Unflag All button () next to tumbleweed to mark it is valid for this page. The next time the Project Manager generates suggestions from the Accepted Words list, tumbleweed will show up for possible inclusion on the Good Word List.

Because arid is a Site-level Bad word (a stealth scanno in this case), it will not have an Unflag All button. This will force the proofer to look closely at all instances. In this situation the first instance of arid is correct while the second instance of the word is a scanno for the word and.

How does capitalization affect the word lists?

Good and Bad words are treated as exact matches and therefore are capitalization specific, for example "Lubbock" and "lubbock" are considered separate words.


Proofer Questions

Why should I use a spell-checker? I'm a good speller!

WordCheck does much more than simply check the text for misspelled words -- it helps detect scannos and other OCR errors. It is intended to flag words which are not in the dictionaries and Good Word Lists, because such words are often situations where the OCR process has confused a letter or word with one that is visually similar. Since it is often visually similar, it is easy for a proofer to skip over, "seeing" it as the correct word. The Unflag All button exists for the common case where the word has been correctly transcribed, but isn't in the dictionaries.

The spell checker is also used to flag words which are commonly incorrectly identified by OCR. The classic example is "arid" which is a perfectly good word, but is often a scanno for "and", a much more common word. Another example is "modem", which is very uncommon in books from before the 1960s, but can easily be a scanno for "modern".

The checker will attempt to flag these kinds of situations for the proofer's attention, so that the proofer can consider them carefully, and take proper action in each case.

Should I run WordCheck before or after I "manually" proof a page?

The answer to this question is entirely up to you.

Some people will like to use WordCheck as a "first pass" through the page text to catch the more obvious OCR errors, and to highlight potential typographical errors and stealth scannos. Some folks believe that finding and fixing those types of errors before they proof the page in regular text-editing mode eliminates them as a possible source of distraction at finding other errors remaining in the page.

Other people will prefer to proof the page in text-editing mode first, and then use the WordCheck as a "final pass" through the page to re-check the punctuation and potential stealth scannos one more time. Some folks feel a great deal of satisfaction in finding that any word which WordCheck may flag is actually a "false flag" since they see it as an affirmation of their proofreading skills.

And other proofers will prefer other approaches to using WordCheck. Thus, run WordCheck at the time when it best fits into your particular page proofreading method.

What's the "Unflag All & Suggest" button () and what does it do?

This button, whose icon shows a book and a plus sign (), provides a way for proofers to indicate that the word matches the image. Once clicked the button will cause all identically spelled words to be unflagged, just as if the word had been found in a dictionary or "good word" list. Additionally words for which the button has been clicked are added to a file for the project manager. The project manager can review these unflagged words and add those that occur frequently to the project's Good Word list.

After a word has been modified, the Unflag All button for that word becomes disabled ()because the proofer has decided that the word as shown was not correct. In addition, words are only unflagged for the current wordcheck session and do not persist for the proofer across wordcheck sessions either for the same or different pages.

Do I have to hit the Unflag All button for every word on the page?

If a Flagged word matches what appears in the scan, you do not have to do anything to it. If, as well as being correct, it is a word that appears several times on this page, or is one that is likely to appear several times in a project (such as a proper name, or technical term), you may optionally choose to press the Unflag All button next to it, which will a) remove flags from all occurences of this word on this page for this session of WordCheck mode, and b) add it to a list of candidate project-specific good words available to the project manager.

Why don't all Flagged words have an Unflag All button?

Words that have been identified as potential stealth scannos, or on a "bad words" list for any reason, do not have an Unflag All button to ensure that careful attention is given to each occurrence of such words.

I hit Unflag All for a word but it was wrong - what do I do now?

Don't panic! Hitting the Unflag All button does not automatically add the word to the project's dictionary, simply suggest it to the Project Manager for inclusion. To correct the word, exit out of WordCheck (by either applying your changes or quitting without applying) and correct the word in the normal text window. Alternatively you can run WordCheck again to correct the word since unflagged words are not kept after the end of a WordCheck session.

If you are worried that the Project manager might add the word to the "good words" list wrongly, you can always send a Private Message indicating what happened. However, Project Managers are responsible for checking that words are actually "good" before adding them to the list.

I hit Unflag All but didn't mean to, can I undo it?

There is no way to undo hitting the Unflag All button, however exiting WordCheck and running it again will accomplish the same thing.

How do I get a word added to the project dictionary?

Words can only be added to the project's Good Words List by the Project Manager. The suggested way to encourage the Project Manager to add a word to the dictionary is to use the Unflag All button in WordCheck to signify that the word is correct, even though it is being flagged. The Project Manager can generate a list of commonly Unflagged words and add them to the Good Words List for the project.

Proofers are encouraged to use the project's discussion topic to suggest words for the project's Bad Words List.

How can I check the page against the dictionary for a different language?

When a page is initially checked for word to flag, the text is checked against the dictionaries for all project languages. An additional ad-hoc language dictionary can be used by selecting the language from the drop-down list at the top of the page and clicking the Check button. This will cause the text to be checked against the dictionaries for the project languages in addition to the ad-hoc language. Only one ad-hoc language can be used at a time and a request for another will replace the previous ad-hoc language. Corrections and Unflagged words will be retained between checks against ad-hoc languages.


Project Manager Questions

How do I view Site Word Lists?

Site-level words are stored in language-specific files.

Site-level Good and Bad word lists are used when calculating Flagged words in a body of text. Here is the current set of such lists: