Content Providing FAQ

From DPCanadaWiki

Jump to: navigation, search
This page will eventually become site documentation.

Because you as a DP user can edit the text of it, the information within can still become more useful. If you see a way to improve this page, please do so! Note that because of its eventual destination, formatting conventions may differ from other wiki pages.

You do not need to be a member of DPC to be a Content Provider. However, it might be a little difficult to get in contact with a member to PM the project if you are not. If you wish to provide scans, you can use The OCR Pool.

Contents

Selecting a Project.

Which book you pick is up to you. The only requirement is that it be copyright clearable (discussion below). It is best if it is something in which you have interest. Chances are that you will find others who will work on it as well.

Finding a book.

There are several ways to find a project to CP (Content Provide). You can search the library, buy from a local bookstore, raid your own bookshelves, ask a friend, pull them out of the trash, or find projects that are already scanned at some of the many on-line sources for scans. Be sure to pick a project on which you will enjoy working, because you will be shepherding this project through up to 5 Rounds (if you choose to be the project manager), and until the project is posted. This may take over a year from start to finish, but much of this time can be spent waiting for your project to be released in a round.

You get to decide what you provide. If you want the project to go through the system quickly, pick a popular genre; watch which queues are moving fast, as this changes regularly.

If you choose to get a book from one of the on-line book archives, please follow the individual site guidelines regarding acceptable use and protocol. We don't want to be bad neighbors. It is considered good form to credit the source of the scan when the text is submitted to Project Gutenberg Canada, so make sure the PM knows its source.

Difficulty.

Some things can make the project harder than others. The amount of time you wish to spend on this should be considered. Check the inner margin (gutter) of the book. The wider this is, the easier it will be to scan, and the fewer extra measures you'll need to take in OCR and answering forum questions. This does not mean that you should not work with books that have a narrow gutter, just that they will be much harder. Projects with a lot of illustrations are also harder and more time-consuming. This will be discussed more under Scan/Download images and Prepare the Illustrations.

Copyrights and clearances.

Do a preliminary check to see if it is clearable. Usually that means the author has been dead for more than 50 years. See clearances, below, for more information.

Does the project already exist?

Make sure the book is not already available by searching Project Gutenberg and Project Gutenberg Canada. You will also want to make sure nobody else is working on it by checking David Price's in-progress list split by author last name. This list is ordered by author's name, but you may want to search the entire page for the title as well. At the end of each line there is a status tag. Most likely this will be either "copyright cleared" with a date, or "released" with a number. An In-Progress Search by title at DP is also available.

"Copyright cleared" means someone has requested and received copyright clearance, but has not yet finished the project. If this clearance is more than 3 years old, it has probably (though not certainly) been abandoned. In this case, when you request clearance, DPC Site Admin will attempt to contact the other clearance holder, letting each of you know that the other is working on it. You can then communicate with them to find out if they are working on it, or if you are free to begin processing it.

Some projects, most notably periodicals, will have "blanket clearances." This does not mean that the person who requested the original clearance has all of the volumes ready to scan! Most of these clearances are associated with DPC in some way, so if an Überproject doesn't exist for the periodical you have (where the PM will often list the volumes they have available), you can post in the Content Provider's forum to find out who's working on what.

If the project says "released," it has been posted to Project Gutenberg or PG Canada with the accompanying ebook number. It is a good idea to search both these lists by author and separately by title.

Running a project that is already in PG.

Even if a book is already in PGr PGC, it may be worth processing again. This will require some legwork to determine, so be sure you feel strongly about the book before pursuing this. PG welcomes different editions, illustrated versions, different translations, etc. In addition, many of the older ebooks have more errors than we would find acceptable today and reprocessing them through DPC may be the best way to change that. If the book has a PG number under 10,000 then it probably doesn't have an illustrated version and might be a good candidate for an upgrade.

Below is a list of reasons you might run an existing PG project through DPC. You will need a copyright clearance for each of these cases. For reworks, be sure to put a note into the comments section saying that you know this is a rework and why you are doing it.

Basic upgrade
You have the same text version, there are no illustrations, and the PG version is riddled with errors: Be sure to let PG know, when you upload the final version, that this is a revision of an existing ebook, based on a paper copy in hand. If there are only a few problems, submit them via the PG errata process. If you put the project through DPC, be sure to let the other volunteers know that this is a rework of an existing PG project by putting words to that effect into the Project Comments.
Illustration Upgrade
You have the same text version, but there are illustrations and they are not present in the PG version: Same as the basic upgrade except that you'll be submitting an illustrated html version. If you put the project through DPC, be sure to let the other volunteers know that this is a rework of an existing PG project by putting words to that effect into the Project Comments.
Different Translation
PGC will treat this as a completely different ebook and welcomes them. There are already at least half a dozen translations of the Iliad, for example, and more are always welcome.
Different Edition
Some books were published in very different editions. Where this is the case, PGC welcomes them as separate ebooks. You will have to document the fact that your edition has significant differences from the version that is already in PG or PGC.

Get a clearance.

You have obtained a book, and have decided that it is both clearable and not already in PG or in progress, or you have a book you think is clearable and need to find out for sure. In both cases, it is time to ask the experts.

Copyright Clearance.

Copyright clearance is a process by which Project Gutenberg Canada and DPC determine if a book is in the public domain according to the copyright laws of Canada. This DPC site operates under Canadian law; if you cannot obtain a clearance via DPC Site Admin, your book cannot be processed through this site.

Submit a Clearance Request.

After completing the registration process, log in, and select "Submit a New Clearance Request". A large form will appear; most of the information required should be available directly from the title page of your book. If not, you will have to do some research. Document any findings in the field provided; be sure to list the source of any information not found on your book's title page and verso page (the page immediately following the title page). If a date is listed twice in different contexts (separate publication date and copyright date, for example) enter it twice. Multi-volume works can be cleared in a single clearance request if the dates are the same. You will have to satisfy DPC Site Admin that the aythor has been dead for more than 50 years. Research and submit information from Wikipedia, or other published sources, citing the source of the information. DPC maintains a database of authors' dates of death that have been cleared, so the process only needs to be done once for each author.

Wait.

All that is left now is to wait for the results of your request. Basic clearances using the standard rules are usually processed quickly, anywhere from a day to a week. You may get a response that says NOT OK. A reason for the denial of the clearance will always be given. Be sure to check that reason, since technical difficulties can easily generate this response. Feel free to resubmit your clearance request after correcting whatever problem was noted.

Scan/Download images.

There are two ways to get these images. You can scan them yourself, or you can find an Image Provider that has already scanned them.

Scanning.

For the text of the project, it is best to scan this within your OCR package. Many OCR packages deskew in a way that works great for text, but mangles illustrations, so do not use it for the illustrations. If you have a few illustrations it is best to make two runs with the scanner. The first pass scans every page in black and white, for the OCR package. On the second scan only the illustrations. IrfanView and xnview both have a scanning interface that is good for this. Be sure to get full-color scans of all color illustrations, and grey-scale scans of all black and white or grey-scale illustrations. Also it is nice to get a scan of the cover and spine of the book. The back is also nice if it is illustrated. If there are any advertisements in the book, please scan them as well.

When you first use your scanner, check to see if it dithers in black and white mode. Dithering is a method of simulating colors you don't actually have available by scattering dots around and fooling the eye. The first image has been dithered, and is actually somewhat easier to read, but will confuse the OCR program and inflate the file size. The second has been thresholded, and is the preferred method. If your scanner driver dithers, consider scanning in grey scale and letting your OCR engine convert it to black and white.

Image:Dithered.png
Dithered scan
Image:Thresholded.png
Thresholded scan


Generally you should not despeckle the images, because this process often removes punctuation marks. If you find that despeckling improves the OCR quality, then do so, but use the non-despeckled versions for the page scans that you upload to DPC.

For instructions on how to scan using Abbyy, see the Abbyy Scanning Documentation.

Scanning advice

Avoiding the most common pitfalls

Image providers

There are many online image archives that make available scans of public domain books. For a list of some of these sites see Details of Image Sources.

Please do not use scans from any archives that charge for the use of their service. These archives usually have a compilation copyright, and other restrictions on their use. Please follow the individual site guidelines regarding acceptable use and protocol. We don't want to be bad neighbors. It is considered good form to credit the source of the scan when the text is submitted to Project Gutenberg Canada.

If you've downloaded images to process from an online source, it's important that you record the source of the scans. Filling out the "image provider" field when you create a project allows DPC to coöperate with online image archives' policies. It's also nice to let Project Gutenberg Canada know the source of the scans.

DPC accepts PNG and JPG images only for proofreading. If the images that you've downloaded are in a different format, you'll need to convert them as part of your preparation process. If this results in images of reduced quality, consider adding a link to the original images in the project comments.

Two programs to make harvesting from some sites easier have been developed by DP-INT volunteers:

Snatch: [1] Allows you to "snatch" images from several of the on-line archives. It can be updated to snatch from others as well.

Gharvest: [2] This program is made specifically for Google Print.

Scanners.

There are many scanners you can use. You need one that does what you want to do best, and no one can answer that better than you. If you want to run large pages, you will need a large scanner. There are also automatic document feeders (ADFs) and many other add-ons you can get. If you have questions, or just want to see what others have discussed, there is a thread on scanner recommendations. Scanner Reviews

OCR.

You have the scans, now you need the text. (Alternatively, this might be a Type-in Project, in which there is no starting text.) OCR, or Optical Character Recognition, is the process through which a program takes the image, and "reads" it, producing the text files. There are many programs that do this. Some are very good, some are adequate, and many are not good at all. Some have more functions than others, and some are fairly expensive.

Abbyy FineReader.

The most popular program is Abbyy FineReader. It does an excellent job, and you can find an older version on eBay without breaking the pocket book. Try to stick with the Pro version. The home and sprint versions are much less expensive, and far less feature rich. Instead of getting the newest home edition, get a 1 version old pro version, you will be much happier. There is a forum thread ABBYY Finereader Tips and Tricks for help with Abbyy FineReader.

Other OCR Software.

If you do not have Abbyy FineReader Pro, do not feel that you need to go out and buy the software in order to OCR. You can use any OCR program. So long as you get it into the correct format in the end, that is fine. The instructions given here are for Abbyy FineReader Pro. We will attempt to make them as general as possible, so that you can convert them to other programs, but some software will not do everything that Abbyy FineReader does.

ReadIris

Ocrad

Ocrad is free software; see [3] . Ocrad is an OCR (Optical Character Recognition) program based on a feature extraction method. It reads a bitmap image in pbm format and produces text in byte (8-bit) or UTF-8 formats. It includes a layout analyser able to separate the columns or blocks of text normally found on printed pages. Ocrad does a decent job if the scans are good (use unpaper first!); it does not seem to like italics.

We're begining to collect tips on optimizing the use of OCR packages other than Abby FineReader.

OCR Pool.

If you don't have an OCR package at all, don't want to bother with it, or really want it done with a good OCR program, then you can use the OCR Pool. This group will take the scans you provide and produce the text for you.

Check the project.

Once the project is created you need to check it.

  • Check that every page image is there, and is complete. Include all pages, including title, verso, all illustrations, and plates and all numbered blank pages. It is ok to leave out unnumbered blank pages. If any pages are missing or damaged they should be replaced or repaired before continuing.
  • Check that every page has been OCRd. You should have one text file for each page image file, and they should have the same base name (e.g., 001.txt and 001.png). (If you're creating a Type-in Project, just create empty text files with the appropriate names.)
  • Image & text files should be named so that a simple sort of the filenames (e.g., an "order by name" listing of the files) puts them in the proper (book-binding) sequence.
    • One common convention is to simply number each page serially starting from 001 (or 0001 if there are more than 999 pages). Note that typically, this serial number will not agree with the page number printed in the book, but the difference (or 'offset') will usually be consistent over the body of the book. Check for a consistent offset, and investigate any anomalies, as they may indicate missing or duplicated pages. Changes in the offset can also occur due to unnumbered content pages (plates/appendices/introductions/whatever), which are fine. (Don't try to achieve a consistent offset at the expense of the proper sequence of pages.)
    • Another possibility is to name the image & text files according to the original printed page number. This is complicated by books with multiple page-numbering sequences (e.g., frontmatter numbered with roman numerals) and pages without an explicit or implied page number (e.g., plates). Such complications can be accommodated by judicious use of extra characters in the filename. For page files, the filename can be up to 12 characters long, so in practice the base name can have up to 8 characters. Allowed characters include digits, letters, underscore, hyphen, and dot. Just make sure that a simple sort of the filenames puts them in the proper sequence. (Don't rely on a particular collation for uppercase vs. lowercase letters. To be safe, only use one or the other.)

Abbyy Finereader Scanning Instructions.

Prepare the Illustrations.

ok. You have page scans, your text is ready, but you still have some illustrations to prepare. It is also a good idea to make a scan of the cover and the spine if they have any decoration on them. Some people will get them even if they have no decoration, as this gives a nice feel to the HTML version.

How to handle illustrations on a page.

Many books have illustrations within the text. We like to create HTML versions of all books with illustrations. This means the CP or PM must get these illustrations and include them in the project. It is not OK to just say "The illustrations will be provided by the PM at a later date" or "You can download the illustrations from this location" as the PM or site may not be around when the text is finished.

In order to get the illustrations, scan in full color, or greyscale as needed, in an application other than Abbyy Finereader. IrfanView, the Gimp, or your scanner's software should all provide decent image scanning. Abbyy finereader processes images in several ways that are effective on text, but unacceptable for illustrations.

Illustrations should be scanned at a sufficient resolution to capture fine detail. While it may not be needed now, it is important if the book is to be reprinted or screen technology improves. Generally speaking, 300 DPI is adequate for line art, continuous tone, and descreened images; screened images often require 600 DPI to avoid moire effects. ** Add images to illustrate various types **

Then crop around the illustration, leaving some space around the illustration in order to rotate and clean up the illustration. Do not feel that you need to provide clean rotated images in perfect, ready to post format. This can be done by the PP. If you do wish to clean them up, many PPs appreciate this, however, please leave them larger than you think the PP will need. This allows the PP to resize them to the way they like it.

How to handle plates

Plate are handled in much the same way. But please, make sure that you do keep a black and white copy of the PNG in the page images. This provides a place marker that the PP can use to put the illustration in the right place. Some PMs leave the blank page following unnumbered plates, some do not. If the page numbers include the plate and blank page, then the blank page must be included or the PP will be hunting down the missing page later.

GuiPrep

GuiPrep is a software package created by DP's own Thundergnat. This is a great package that takes the OCR output, checks it for common OCR errors and then spits out a ready for DP version of the text file. It will also renumber the images, and run PNGCrush to make the images smaller. This is a very handy tool indeed. You can find it here.

GuiPrep has a lot of options. Don't let that scare you. Most you do not need to touch, but can change if you have special texts that do not function well with the defaults. Here we will discuss only the basics. If you want more information you can read the manual at the GuiPrep site, or post in the Providing Content Forum.

Installing GuiPrep

Instructions yet to come

Using GuiPrep

This is the basics of how to use GuiPrep. Detailed instructions can be found on the GuiPrep home page. These instructions will need to be altered a little for your specific program.

Setup text files

Save two copies of the text output from your OCR program. The first should be in the "textw" directory of your project folder, save the text with the settings: Save as type Rich Text Format, Create a separate file for each page, Remove all formatting, Keep page breaks and Keep line breaks. It doesn't matter what the File name is set to.

The second file should be set just like the first, except it should be in the "textwo" directory, keep line breaks should be unchecked and you should remove hyphens.

This will allow GuiPrep to merge words split across lines.

If your OCR program does not allow for these options, you can alternately save as text files. Make sure that you save the files in the text folder and that you keep line breaks and use a blank line for paragraph seperation. You will also need to uncheck both Extract and Dehyphenate under the Process Text tab in GuiPrep.

Run GuiPrep

Now you should have in your project folder at least 3 directories. pngs (with the png page scans. textw with the rtf (or txt) files that have line breaks and textwo with the rtf (or txt) files that have no line breaks.

Open GuiPrep, go to the change directory tab and navigate to the folder your project is in.

Go to the process text tab and make sure all the options you want to run are checked. If your project includes the long s character then you should check "Fix Olde Englifh" and an extra routine to check for f/s mistakes will be run. This should only be checked if you have a long s project. If your project is for the European DP then you should uncheck the "convert to ISO 8859‑1" box. If the project is for the main site, this should be checked.

Click start. When GuiPrep is done it will say "Finished all selected routines."

There have reports of unicode reappearing after using the search and replace tools; you may wish to rerun the "Convert to ISO 8859-1" step again if you run any search and replaces.

FAQ, or what do I need to know?

What is the difference between a CP and a PM? And what do those abbreviations mean?

A: The CP or Content Provider supplies the scans to be processed at DPC, and may also prepare the files for the proofreaders, but does not necessarily deal with the project beyond that. CPs do not have to be members of DPC.

The PM or Project Manager is responsible for creating the project at DPC, guiding it through the rounds, answering proofreader/formatter questions, and making decisions that will help create the most consistent output possible for the post-processor. PMs may provide their own content or acquire scans from another CP. The term PM in a different context means Private Message.

How much of my time will CPing take?

A: It depends on the book you choose to CP. If you choose a short novella with no illustrations, then it could take a couple hours to scan, OCR, check and prep your project. If, on the other hand, you are working on a thousand-plus page book on ship construction with 33 fold-out plates and a couple hundred illustrations, then it could take a year or more to finish the scanning alone.

What are the qualifications necessary to become a CP?

A: There are no qualification requirements to be a CP. You just must be able to get the images into good order and find a PM willing to work with you.

What kind of equipment do I need to CP?

A:

  • In order to CP you need a scanner that is capable of scanning the material you want to provide. Some libraries have scanners for public use if you do not have one.
  • You will also need some sort of OCR package capable of providing the OCR text needed to start from. If you do not have an OCR package, there is an OCR Pool with volunteers willing to do this for you.
  • You will also need GuiPrep installed. However if you use the OCR pool, they can run GuiPrep for you, if you ask them nicely.

Are there deadlines? Who sets the schedule? What if the schedule is not met?

A: The only deadlines and schedules are set by the CP. If as the CP you do not want to set a deadline or schedule, then don't. If you do set a deadline and it is passed, then the only one who is going to come down on you, is you. Some projects take very little time, others take a long time.

What files do I need to provide?

A: You should provide clear black and white png images of every page. These should be large enough to be read easily, but not too large to be downloaded over a dial-up modem. Usually if you can get them below 100K the latter is fine.

You will also need to provide text files containing the OCR output of each page. The png and the text file must have the same base name. For example, 005.png goes with 005.txt or the upload software won't know what to do. (Note: Guiprep has a tool to help getting the names to match so long as both sets of file are in correct alpha-numeric sort order.)

If there are any illustrations in the book, or a decorative cover, grey-scale or color images of each should be provided. It is best if these are in jpg format. If the illustration is black and white, these can be provided in png format. Post processors appreciate if these files are named to correspond to the correct project pages. For example, i005-1.jpg and i005-2.jpg would be illustrations from 005.png.


Personal tools