Providing Content for DP Canada

From DPCanadaWiki

Jump to: navigation, search

So you want to CP!

Draft 2: February 2, 2012

Abbreviations used below:
* CP — content provider, content provision
* DPC — Distributed Proofreaders Canada
* OCR — optical character recognition
* PM — project manager


Contents

Providing content (introduction)

One of the most enjoyable tasks at Distributed Proofreaders Canada is finding new, eligible books and other texts and helping to make them available to readers everywhere.

The job of the CP is very elastic. It can mean as little as suggesting a book in the forum and as much as finding the book, clearing and scanning (or harvesting) it, preparing the illustrations if applicable, doing the Optical Character Recognition (OCR), prepping scans and text files and finally passing the whole lot on to a willing Project Manager (PM).


Finding the book

Canada enjoys the most liberal copyright laws in the world. Books fall into the public domain fifty years after the death date of the author and any other contributor. That means a huge crop of new books becomes 'legal' every year on the first of January.

As a result, there is no need to scour the bottom of the barrel because all the "good stuff" has been done. There is so much good stuff coming along annually that we can barely keep up. (And it really means that scanning rather than harvesting is the priority, because many of the new books are not out of copyright elsewhere and hence not available as scans from any legal source.

Books on your own bookshelf may well be eligible. Books at your local library most certainly will. A good website to check out for ideas is http://www.kingkong.demon.co.uk/. Go for the first edition if you can; alternatively pick the last one the author was personally involved with — the one we can regard as her/his final word on the subject. If you do go with a library book — be nice. Choose one that is unlikely to be damaged by the scanning process.

Let us assume you have picked a book, ideally one you love and want to share with others. Now you can:

  • Suggest the title in the forum and go off and do something else.
  • Carry on and establish whether the book is eligible.


DON'T DO ANY WORK (i.e. scanning) AT THIS STAGE.


Clearing the book

Generally speaking, a book is eligible (i.e. out of copyright in Canada) if the author and every other named person involved in creating it (e.g. illustrators, editors, other contributors) has been dead for fifty years.

The fifty years are counted, not from the actual date of death, but from the first of January of the following year. So the work of anyone who died in the whole year 1961 falls into the public domain in Canada on 1st January 2012.

At this stage, a web search is usually the first step. You will need to provide two pieces of evidence for each person. Wikipedia is not acceptable on its own as evidence. Each search will turn up different websites. The most reliable are major libraries (but note copyright libraries may not provide the information), major encyclopaedias, Who's Who, Dictionaries of National Biography, and the Internet Archive. Another helpful website is: http://blogs.lib.utexas.edu/freethebooks/4/

Illustrators and translators are a special problem, and you may have to hit the reference books for them. However, each person only has to be cleared once, and in due course DPC will build up its own database of cleared contributors.

Photographs fall out of copyright fifty years after publication. The same rule applies to anonymous publications or contributions.

Write an email to Simple Simon at starlink @ rogers.com (remove extra spaces). Include all the details (title, author's complete name, date of original publication, and any information you have on the author's year of death — ideally as web links) plus an image of the title page and verso (i.e., the other side of that page). There is a Copyright Clearance Form kindly provided by Miscia which may be useful in keeping track of copyright details as you gather them.

NB: If your book was published in the US before 1923 (before 1922 if outside the US), it may also be eligible for clearance at Project Gutenberg in the USA. This involves registering at their website (copy.pglaf.org), filling out an on-line form and uploading a scan of the title page and verso. Please refer to instructions on the DP International website to find out more.

Simple Simon will issue a clearance in due course.

At this point, you can:

  • Hand off to a willing PM if you like.
  • Or carry on and do the scanning (or harvesting, but see above).


Scanning the book (Windows / ABBYY Finereader)

This is where it gets serious.

At this point, you will need hardware (i.e. a scanner) as well as software (ABBYY Finereader / Sprint is the package most used; Sprint (which is basically ABBYY 6) comes free with some operating systems but is unable to save a file per page. OmniPage may also do the job; other packages include ReadIris, and Tesseract (open source).

Although you may be able to scan the occasional page using your computer's scanning software, it is unlikely to be suitable for scanning a whole book. It will be too slow and is unlikely to offer features like scanning multiple pages, page splitting and text straightening.

The following information relates to computers running Microsoft Windows and ABBYY Finereader software, since that has proved most suitable for DP's needs. If you have a different set-up and scanning/OCR package, please ask for advice in the CP forum.


Setting up the folder

Before you begin, set up a folder called DPC on your local disc (C:\) or a disc of your choice. Create a folder tree, e.g. DPC — Projects — (this book) which would look something like this:

C:\

+dpc
+projects
+name of project (suggestion: author last name main word in title) like "davisexplorers"

ABBYY / Finereader will open on an "Untitled batch"; name the batch after the book and save it to the (this book) folder. Pick a memorable name; keep it brief; older rules required lower case, and many still stick to them. From now on, the software will save everything you do automatically and nothing can be lost.

Setting up the software

(Note: this section refers to ABBYY FineReader version 8; I'll be happy to formulate guides for other versions if others will provide the page scans)

On the ABBYY / Finereader Tools drop-down menu, go to Options and open General. Depending on the package, this may have a tab for Legacy Options. If it does, tick Open image during scanning, and Show image during recognition.

Image:legacy.jpg

Open View and tick Highlight uncertain characters.

Image:view.jpg

Open Scan/open and tick:

  • Use ABBYY / Finereader interface
  • Display options dialogue before scanning.
  • Scan multiple images
  • Straighten text lines

and possibly:

  • Split dual pages. (This option may not be suitable if the book is very tightly bound and the gutter, i.e. the white space in the centre between pages, is very small.)

Image:scanopen.jpg

If you plan to do the OCR...

Open Read and set recognition language if applicable;

tick:

  • Thorough
  • Extract text from pdf
  • Do not use user patterns (these only come into play if you need to train the software).

Image:read.jpg

Open Check Spelling and set error display level at thorough.

Image:spellcheck.jpg

Make sure the scanner is hooked up and operational; hit Scan.

Image:options.jpg

This will open up the options dialogue. Some of the settings depend on the book you are scanning. For example, the Portrait setting is suitable for small books that are no wider when opened than the shorter side of the scanner. This setting is really best because it brings maximum light into the gutter.

Choose Landscape for larger books.

Your scanner will have markings around the edges indicating paper sizes and / or inches and centimetres. Place the book on the glass and check size to establish the measurement units setting. The closer you can get to the actual size of the book, the less work is involved later in cropping the scans. If you can avoid black margins altogether, so much the better.

Set picture scanning mode as required. (NB: do not use the ABBYY / Finereader interface to scan illustrations; these should be scanned within the scanner's own software. You can access this via Tools / Options / Scan/Open: click on Twain source interface.)

Set pause between pages as required. The eight seconds shown here is a fairly short interval. Start out with something less ambitious, like 12 seconds and work your way down to something that feels comfortable. If you miss the scanning window, it's not the end of the world. Delete your pretty black page and do it again.

Set resolution to 300dpi — UNLESS you are working with a very small book (anything with a long side of less than 5 inches or 12.7 centimetres), in which case you want 400dpi or possibly even more.

Tick

  • Show this dialog before scanning.

Image:thumbn.jpg

At the top right of the tool bars, above button no. 4 (Save), click on Thumbnails batch view (fourth from the left).

You're nearly ready to go.

Check the glass on the scanner and clean if necessary, both now and during scanning. (Spray the glass cleaner on the cloth, not on the glass!) Old books thrown off an amazing amount of dust as well as bits of themselves, which will all show up as speckles on your scan.

Look at your book. If it's your own book and you do not plan to keep it, stripping off the cover will make scanning a good deal easier because you will be able to flatten it completely on the scanner. If it is a library book, play nice. If there was more than one copy to choose from, the looser the binding and the wider the gutter the better. Please don't break the spine.

Does the book have especially thin (bible or parchment) paper? In that case, you may need to insert a sheet of black card behind each page to avoid 'bleed-through'. If that proves necessary, try to avoid tucking it too deep into the gutter — that will cause damage and may result in detached pages, especially in older books.

Image:page.jpg

Choosing what to scan

So what do you scan? Include everything from the title page to the last page of the text proper (up to "THE END" if applicable). Do scan all the blank pages. If you are the CP but not the PM, the following material should be included (even though the PM may later omit it for copyright or other reasons):

  • pre-title: the title of the book, on its own on a right-hand page, preceding the title page with a blank page between
  • publisher's advertising or other material: this may be a brief list of "other works in this series/by this author" or reams of ads or a catalogue at the end.

Don't go over the top. It is not necessary to, say, scan the onion paper between an illustration and the facing page, which is only there to stop bleed-through.

Scanning a whole book will take about two to three hours depending on length. It is easy to miss out the odd page, so keep track of page numbers. ABBYY / Finereader will open each page as it is scanned. On the left, you'll have a thumbnail of the page; next to it the scan (see above). If you set this middle section to "Fit to height" at the bottom (for split pages), you will be able to see immediately if there are any problems. If you have not split the pages, set to "Fit to width" and extend this section to the right-hand margin.

At this point, you can:

  • Hand off the scans to a willing PM and move on with your life.
  • Or carry on and do the splitting, cropping and clean-up.


Dealing with the page scans

Splitting

After you have scanned the book, you will need to split the pages (unless this has already been done automatically — see above under "Setting up the software"). Go to the Image drop-down menu and open 'Split image'. The window allows you to split the image in various ways. You can adjust window size if the gap between the two halves of the double-page spread is especially narrow.

Cropping

If there is a wide margin around the pages, and especially if there is black margin, you will also need to crop the images. Go to the Image drop-down menu and open 'Crop image'. Again, the window can be re-sized if necessary. Crop as required.

Cleaning

The cleaner your images are, the easier it will be to do the OCR and the fewer the errors. Use the eraser (fourth from the bottom on the band between thumbnails and page image) to get rid of blotches and any remaining gutter or signs of folds or tears. This may seem overkill, but anything we can do to improve the quality of the image will pay dividends in the quality of the proofing. You may also try a de-speckle run, though this will only clean up very small specks.

Save the image files as black and white pngs to a sub-folder of (this book). It's a good idea to call that folder pngsbig or similar, since the images will need to be re-sized at a later stage.

At this point, you can:

  • Hand over the scan set to a willing PM and, assuming you've caught the bug, move on to the next book.

The PM will see to the re-sizing and re-naming and do the OCR.

  • Or go the whole way and tackle these steps yourself.

Resizing (and renaming)

Proofing images should be re-sized to 1000 pixels (px) on the short side. There are various software packages that will do the job; the main thing is that they allow batch re-sizing and do not damage (i.e. skew) the image. Irfanview (which is free) is usually regarded as perfectly acceptable; it also has a renumbering function.

Image:irfb.jpg

In the File drop-down menu, click on Batch Conversion / Rename and then on Advanced. See screenshot for the settings. Do NOT opt for re-sample, since this will increase file size substantially. The images ideally need to come in at under 100kb.

Image:irfanview.jpg

The software also offers a rename function. DPC image files are named 001.png, 002.png etc. ABBYY / Finereader follows a different pattern, so files will have to be renamed. Other software packages include 1-4a Rename which can be found here. It has an intuitive interface and is free.

Alternatively, you can leave this step until later and rename image and text files at the same time.


Doing the OCR

Please note that the following is very hands-on. If time is short, the analysis and run-through can be omitted and all pages simply read.

If you have scanned the book within ABBYY / Finereader, you are most of the way there. See above for the Read and Spellcheck settings. Scans should have been split and cropped (if you are doing the OCR, the cleaning can be done later).

Layout analysis

Start by selecting all the images and clicking on the top button in the narrow band between thumbnail and page image (Analyze layout) or hitting Control E on the keyboard. The software will run through the pages and identify blocks of text. Some of these are not required for proof-reading: the page numbers and running titles, for example, as well as identifiers at the bottom of the page which assist in the assembling of the physical book. Although it is not essential to delete these blocks, many PMs do.

Checking (and optional cleaning)

Run through the analysed pages deleting any superfluous blocks (use the delete key on your keyboard or the Delete Block button in ABBYY / Finereader). Sometimes the software will have misinterpreted the layout and pages will need to be marked up manually. If you have not yet cleaned up the pages, this is the ideal time (but take care not to overuse the eraser — it can lead to image distortion which cannot be completely removed even with the undo button). And beware: moving on to the next page closes down the undo function. This is also a good moment to check that all pages are present and to straighten text lines on any that have badly skewed text.

Hit the Read button, and watch the magic!

ABBYY / Finereader will display each text page next to the page scan. Words with uncertain characters, or words not in the dictionary, are highlighted.

Before saving the text pages, a number of clean-up steps can be taken within ABBYY / Finereader. For example, the long dash can be exchanged for two em-dashes. A glance at the highlighted words across a number of pages will suggest other search and replace options. Bullet points, curly braces, asterisks and a variety of other junk symbols can be stripped. Extra spaces can also be removed — they sometimes make it all the way into PP/PPV otherwise! And some misspelled names can be corrected at this stage. The search and replace function allows all batch pages to be searched.

Image:savepages.jpg

Save the text pages in ABBYY / Finereader as UTF-8, one file per page (see screen shot for settings). Save them twice, once with and once without line breaks (click on formats settings bottom right of screen shot).

Now you have the usual two options:

  • Hand the whole lot off to a PM.
  • Or complete the prep.

Completing the prep using guiprep is a chapter in itself and is not included here.

Personal tools