Post-Processing FAQ

From DPCanadaWiki

Jump to: navigation, search
This page will eventually become site documentation.

Because you as a DP user can edit the text of it, the information within can still become more useful. If you see a way to improve this page, please do so! Note that because of its eventual destination, formatting conventions may differ from other wiki pages.

What is Post-Processing?

The purpose of post-processing is to massage an e-text into a readable form. In its journey through multiple proofreading and formatting rounds, the text may have been improved by hundreds of volunteers. The post-processor must standardise the formatting of the book and adjust it to comply with Project Gutenberg Canada's requirements. They must also deal with any detectable mistakes that have survived all four rounds. The ultimate goal of post-processing is to create a consistently formatted e-text, which contains as few errors as possible and which accurately reflects the intentions of the author. A plain-text version is always needed (a .txt file), but many projects now also require other formats. Don't be put off by this — there are people who can help with them if you don't want to do them yourself.

Contents


Who can Post-Process?

Post-processors require more experience than ordinary proofreaders. Since they are preparing the text for uploading to Project Gutenberg Canada, they are the final editors of the text and make choices and decisions about the layout and look of the text. Because of this, post-processing is usually only available for proofreaders who have completed at least 400 pages in F1. If you click through to the PP area of your Activity Hub, you should find a statement of whether you meet the requirements, and, if so, a button to request access (this is automatically granted).

If you are not yet eligible, but have a reason for wanting to PP (special language skills are a common basis for exceptions), please request access.

What help is available?

This FAQ contains a lot of information, particularly in the 'Help' section. The Post-Processing Forum also has some helpful sticky threads, especially the No Dumb Questions topic. There are old discussions to read through and it is the best place to post new questions. For faster help, try using Jabber (Jabber instructions and PGDP Jabber IDs) to visit the DPC Chat room - there's usually someone around who can point you in the right direction (all time zones covered!) You can also get mentoring help - see PP Mentoring and PP Mentors. PG produces its own guidelines. Finally, a strongly recommended step is to look at existing PG books, especially if you can find a book similar to the one you are about to post-process (same author, related topic, etc.).

What tools can I use?

Post-processors use a variety of operating systems and software tools to do the PP work. Which ones you use is your choice, but the minimum software you will need includes:

To create and check an HTML version you will also need to use an online HTML validator and CSS validator and link checker. If your project contains illustrations, you will need an image editor.

There are other useful programs available which are not essential, but which can be extremely useful and will usually save you a lot of time. Some include the Gutcheck program. The majority of post-processors use Guiguts, so there is a lot of support, advice and a PPing tutorial available for it.

How do I choose a book to process?

For your first project, it's best to pick a fiction book with a relatively small number of pages (less than 200 or so). Here's why:

  • A low page count makes the work go faster and is easier to handle.
  • Fiction usually has fewer words per page and a simpler format than non-fiction, so it scans more clearly and is less likely to result in OCR errors and inconsistent formatting.
  • Fiction generally lacks complicating features such as footnotes, tables, illustrations, and other items that could be difficult for a new post-processor to deal with.

Many PPers did not follow this rule to start with, and have turned out all right anyway. But it's a reasonable guide.

There are three good ways to find a book for post-processing:

  1. Check the Projects for new PPers thread. Here are advertised books which have completed the rounds and would make good starting points. Or post there that you are in the market for a good book!
  2. If you are proofing or formatting a particularly enjoyable book, or that seems quite straightforward, ask the Project Manager if a post-processor has been assigned already (or look for that section on the Project Comments page). If there is none, (or sometimes if the Project Manager is listed as post-processing it themselves,) ask if you can post-process the book. You'll need to wait for it to finish the rounds first. An alternative is reviewing the Books with no PPer yet list and contacting Project Managers directly. Be careful what you ask for, though, as you may get it.
  3. Contact a PP Mentor. Not only can they help if you have questions about the process, but they can occasionally provide more suitable projects, or suggest alternatives to the above methods.
Note: there is no commitment in volunteering for a book. If you wish to stop PPing it for any reason, it's best to contact the person who assigned it to you, so they can pass it on appropriately. They will not twist your arm or criticise you for not finishing. The current author has returned more books than she has completed PPing for, and that's fine :) Think of it as allowing someone else the chance to work on it, and freeing up your own valuable time for another project!

Download your chosen text by going to its Project Page, scrolling to the bottom of the page, and selecting "Download Zipped Text". Do not select "Download Zipped TEI Text" (a different encoding) unless you know what TEI is and want to work with it. The plain text is the version that you need.

Scroll through the whole text to see if there are any difficulties, like footnotes, poetry, foreign languages, dialects, and tables in it. This way, you will know what you will be dealing with before you commit to the project. If you see any of these items, you might want to pick a different project. But if you think you can handle it, give it a try! There's always lots of help available.

Check the Project Thread for your book's title to see what proofreaders have been saying about it. Again, this can alert you to issues that might make the work more difficult than you had realised.

Make sure that the book appears as checked out to you, or you might end up working for hours on it only to find that someone else has checked it out and submitted it! Your PP choice will appear near the middle of your Post-processing personal page.

Note: occasionally, perfectly lovely books appear in the PP Pool (at the bottom of your Post-processing personal page). If you wish to take one of these to work on, click on its title on that page, perform the above checks (downloading text, reviewing Project Discussion, etc.) and if you still want to go ahead, scroll right to the bottom of the Project Comments page, where you should find a "Check Out Book" button. Click it, double-check that the book has been assigned to you, and you can go to work.

How long will it take?

It's very difficult to answer this question in advance. The time that a book will take to complete depends on three factors:

  • the difficulty and length of the work itself
  • the tools being used
  • the amount of experience the post-processor has

It can vary from several hours to several days. Some especially difficult works can take weeks (or more!) to complete. Remember to save your work often, using a new filename each time, so that if you make a mistake, you can easily recover. Take it at your own pace—you will be the last person going through this book in detail before its posting (although two other people will verify your work).

Try not to feel discouraged if it seems like it takes a long time to complete an "easy" book. Concentrate on learning the process of post-processing, familiarising yourself with any tools you might be using, and doing a quality job, rather than on working quickly. You will speed up naturally with practice.

What if I change my mind, or don't have time?

If you realize that the project you've chosen is too complicated for you, or if you find yourself short of time, it is perfectly OK to return the project to the PP pool. You can look for an easier project straight away or take another one when you have more time.

To return a project, find the title of the book you are working on on your PP page and click on its link. That shows you the specific Project Comments page for that text and, scrolling to the bottom, you will find an option for Return to Available. Before pressing the button, you can add a note in the 'update comment and project status' window. If the project is difficult or has problems like missing pages, it would be nice if you could state the issue in the comments in order to make others aware of it. Then you can click to put the project back into the Post-Processing Pool.

If you have done considerable work on this project, you might consider offering your work to another post-processor to complete. This is done by putting a new post in the Post-Processing forum which allows another Post Processor to contact you and arrange how to do the work transfer. Don't return the project to the PP pool if you want to try this option!

So what do I have to do?

[Note: many of these tasks are automated by tools designed to minimise the complexity and repetition of such jobs. Please refer to the software-specific tutorials or user guides to find out how to use these utilities effectively.

You may like to put notes about your progress in the 'Update project status comments' box which can be found on the Project Comments page of your project, towards the bottom. These comments are visible to all. They can act as a To Do list, or notes on points to watch out for from proofer comments, and are particularly useful if you have to take a break from post-processing for a short while, so you can start working with your project right where you left off.]

Do some research

Read the Project Comments & Project Thread for your PP project. If the proofers found anything of concern, make a note of it for special attention while processing the text. You will also need to make sure you follow the Project Managers instructions for the text. Many request HTML versions for texts.

Check for asterisks * left by proofers, making you aware of questions/problemsmarkup

Run a search for * to find notes left by proofers/formatters to make you aware of their questions/solutions and potential problems.

Check the markup

Make sure that the /* */, /# #/, etc. tags are balanced. Be sure that any poetry is in the correct markup to save messes later. This is a good time to check each poem is indented correctly or has the relative indents correctly added. Every <i> tag needs a closing and properly placed </i> tag and so on. You may wish to change some formatting tags to markup specific to your tool (e.g. /p p/, /f f/); check your tool's manual for details. Some PMs may request particular markup in the rounds. Also, check any markup that ranges over a page break and make sure it will still result in the desired formatting (usually by deleting all but the first and last markers for a particular section.)

The best way to do this checking is to check through the text page by page, opening the corresponding page-scan in your image viewer. You'll quickly notice unmarked poems or blockquotes this way. It's also a good time to check for missing pages (rare, but it does happen.)


Rewrap? Indent?
No special markup, the default yes no
/* */: poetry, etc. no no
/# #/: block quotes, etc. yes yes


Not officially adopted, but in use by various PP tools

Rewrap? Indent?
/$ $/: tables, etc. no no
/p p/: poetry, etc. no yes

All of these markups require a blank line before the opening tag, and a blank line after the closing tag. They should be on a new line with no other text, unless your PP tool allows it.

Straighten up the Title Page, Table of Contents and List of Illustrations

When formatting the title page, you have a bit of leeway. You can adjust the pieces a bit if you like: for example, you could move the author's name directly under the "by". Relative indenting is not required, but can be added if you wish. For the table of contents and list of illustrations, please retain the page numbers. Line up the chapter titles and page numbers to make it look neat and easy to read. Copying the original format of the table of contents usually works fairly well. Leave all the original information on the title page, including the edition, year of publication and any copyright notice (unless this is a reprint - check with the Project Manager if in doubt.) It is better to keep as much information as possible than to try to find it once the book has been posted for years.

Footnotes

You will need to rematch footnotes split across pages. Then, in the plain-text version, you can put the footnote after the paragraph it references or at the end of the chapter or section. Make sure that your tags match both in the text and in the note itself. In-line footnotes {Footnote: within a line of text} are discouraged even when extremely short.

Consider using end-of-paragraph footnotes if the footnotes are short, unique, and not common. Use end-of-section or chapter footnotes for longer footnotes (such as those that have poetry or blockquotes), or those that have multiple references in the text for one footnote. In any case, be consistent within the work. Use all end-of-paragraph footnotes or end-of-section/chapter footnotes within one work. Don't switch back and forth.

For the HTML version, they can be moved to the end of the chapter or section or to the end of the project. They also need to be hyperlinked. Most of the post-processing tools will do this automatically. Refer to the tutorials, guides or manual for your software, to explain how.

You can renumber the footnotes if you like, so that each one in the book has a unique number, letter or roman numeral (although the latter options are not recommended for more than 20-30 footnotes as they become hard to read / distinguish.) This is the preferred method - it makes them easy to locate in a search. Alternatively, leave them as per the text, making sure that each chapter has the correct sequence.

Check the text for the following problems:

  • spaces around hyphens
  • spaces before punctuation . ! ? ; : ,
  • spaces around quotes in English and/or LOTE
  • spaces around (){}[]
  • spaces within abbreviations
  • multiple spaces in non-marked text (skip /* poetry */ )
  • incorrectly formatted thought breaks
  • incorrectly formatted ellipses (according to the rules of the text's language.)
  • correct any 3- dash em-dashes (---)
  • check appropriate spacing of em-dashes (-- and ----)
  • sort out any asterisks/stars/daisies/comments left in the text by proofers and formatters. As the PPer, you are responsible for resolving problems noted by the proofers. If you need advice or a second opinion, try any of the methods listed in the Help section. You can also rejoin words split across pages at this point.
  • compare hyphenated words throughout the text. If there is a printer's error with many clear uses of the word using the other spelling, you can change the word; e.g., 20 occurrences of "to-morrow", but only one of "tomorrow". If there is not a clear majority, you will have to decide whether to leave them as-proofed, or make a judgment on which way to change the odd one out (and then, whether you note this in a Transcriber's Note<internal link> or not.)

Rejoin pages

Remove page separators, checking either side of them to see if the next page requires a paragraph space, is a section or chapter, or needs to be continuous text. This is a good time to remove [Blank Page]s too!

Spellcheck

Even if it looks like it's going to be a pain, spellchecking is almost always needed. Texts written before spelling was regularised might be the only reasonable exception, but even for those spellchecking is often useful. Even books with dialect or other deliberate non-standard spelling can be spellchecked. In the days of just two rounds, the spellcheck was often the most time-intensive part of post-processing. Hopefully it's easier now!

Handle any Illustrations

Move each illustration tag to an appropriate point in the text. Some PPers like to have them just before, or after, the text they illustrate. Others prefer to place them at the end of the chapter, not wishing to interrupt the flow of the text. Do whatever you think is right for your book.

Note: We keep illustration markers in the plain-text version, in case people want to refer to the HTML version later. Please do not delete them. If you do not want to produce an HTML version, but your book has pictures, post in the HTML pool, where you can enlist someone to generate the HTML and pass it back to you for uploading. HTML versions are required for every book produced at DP with pictures (even if the Project Manager does not request it).

Rewrap the text

Time to rewrap. Did you see any poetry, tables, etc.? If not, rewrapping the lines should be easy. You will need to rewrap the lines to between 65 and 75 characters in length. Each program has a different way of doing this, and you will have to find the way that works best for you. Read the manual or instruction book for your utility.

If worst comes to worst and you cannot find an easy way to rewrap the lines, find and replace all line breaks with spaces, count any line to find to see approximately where 65–75 characters falls, and insert lines breaks manually at this point. It's painful, but it works. (Be grateful that you chose a book with a low page count!) However, this extreme step should not be necessary.

Once your text is suitably rewrapped, remove any end of line spaces (again, use the PPing software wherever possible! All current tools include this task.)

Check Formatting

Chapters should have four blank lines above them, one between lines of the chapter heading, and two blank lines after, but before the main text of the chapter starts. Sections should have two blank lines above, and one blank line after. This is all as per DPC Formatting Guidelines. Thought breaks should be marked with asterisks (seven spaces, then five asterisks, each separated by five spaces, i.e.,       *     *     *     *     *. PGC will accept alternatives to all of these, but the important thing is to be consistent throughout your book.

Poetry should be indented at least 2 spaces (this is a PGC requirement, to prevent rewrapping in future versions of your text). Indents within the poem should be added on to your chosen indent (i.e., if a line is indented by 2 from the line above, and you are using a 4-space indent for poetry, in your final version this line will be indented 6 spaces altogether).

Block quotes should also be indented to show their separation from the rest of the text.

If you have aligned page numbers in a Table of Contents or List of Illustrations, be sure to indent all lines appropriately too, to avoid rewrap/respacing.

Gutcheck!

The Gutcheck tool was written specifically to pick out many of the most common problems with PG texts. It is probably the single most important check you will perform. Follow the instructions with your PP software. If you are not using a PP-specific tool, you can download Gutcheck from here, and run it according to the instructions given there. Run the check initially with all options turned on. Check every potential problem that it brings to your attention. Not all Gutcheck "flags" are genuine errors (for example, it may report short lines where the text contains poetry or a table), but each must be looked into and corrected if necessary. Continue to run Gutcheck after each series of corrections until it doesn't flag any more "true" errors.


Some common things to watch for:

  • Footnote markers are falsely flagged as 'Wrongly spaced brackets'. Check them anyway.
  • Lengthy hyphenated words often cause short lines above or below. Try rewrapping that paragraph a few spaces shorter, which often 'rearranges' the words sufficiently to cure this error. Short lines for the Table of Contents, lines of poetry, etc are okay.
  • PG guidelines suggest regular text should not be more than 75 chars wide. This is relaxed to 80 chars for tables or other essentials (long-line poetry might be another example.) If there's absolutely no way to shorten a feature such as a family tree, you can leave it as is. It is often worth posting in the Post-processing forum about this when you find it, as others may see a sensible way to condense or reformat the feature.
  • Unless you are checking a deliberately-ASCII version of your text, you do not need to worry about characters flagged by "Non-ASCII character".
  • Wrongspaced quotes often appear where characters' quoted speech runs through several paragraphs. Check these, but if they are right according to the Proofing and Formatting Guidelines, that's good enough for Gutenberg posting.


At the end of the above process, you have a processed book which contains HTML markup, as well as DP tags like [Footnote] or <tb>. Save a copy of this 'dual' file, calling it something like <name-backup.txt>. If you want to make your book available for smoothreading, now's the time.

Unusual features - tables, Greek, poetry, etc. see Help! section.


Is that it?

Well, it's the basics done. There are some additional steps you can take to make your text the best it can be.

Paranoid Text Checks (Stealth Scannos, etc.)

These may be run by separate tools or by your main PPing program. Refer to the manual or tutorial for the toolset you are using, or ask in the PPing forum.

Examples include 'smart' programs which can check for he/be irregularities, or regexes (a form of search) which flag unusual letter combinations, such as 'tb' (possible scanno for 'th') or 'rn' (for 'm').

Various regex-searches are available and some tools will run these as a set, through your usual search-&-replace box - again, check the manual / tutorial for the software you're using. Otherwise, have a look at the Regular Expression Clinic for more information and help.

A great formatting check to run is the regex \n\n\n which catches all chapter and section spacing allowing you to confirm its consistency, as well as finding any extra line breaks between paragraphs - especially common after blockquotes or poetry. It's a good idea to run this again on the text version, after you've removed markup such as /* and /#. (See below.)

Smoothreading

When you are finished with the general checks & work on your book, and are ready to produce plaintext (and other formats if required / desired), you can make it available for smoothreading. An extra pair of eyes is always helpful in finding things you might have overlooked in the text. Smoothreading is generally done on a text version, so save a new version of your book (under any name) and convert <i> and </i> to _, and <b> and </b> to the markup of your choice (see the bold-markup thread for discussion of options.) Change <tb> to the usual asterisks. Remove all other markup for the moment. Then zip this .txt file up into a new archive.

Now go to the Project Page for your book. At the bottom of the page, you have three options; make the project available for smoothreading for one week, two weeks and four weeks. Select the desired duration and upload the zipped text file for smoothreading. You can provide comments about what to look for during proofing, or to ask for attention in a particular section (this is very helpful in long texts.) You might also like to advertise the availability of your book in the Project's Thread, or in relevant team threads (see the team-list thread for ideas.)

Smoothreaders will mark errors in the text with [*description of query]. This is a standard format and should not be altered in the comments. When they finish, they will upload the smoothread project back to the pool page. At the end of the smoothying period, you can download the smoothread versions and search the text for [*'s. Not all [* comments] will be valid, just correct those that are. Make your corrections in the master file which still contains <i> markup, (or else make each change in every version of the text that you have, e.g. plaintext and HTML or LaTeX.)

While your book is being smoothread, why not start work on any other formats that are required, or begin fixing up any illustrations?

Transcriber's Notes

If you make any changes to the text it is a good idea to include a Transcriber's Note. Sometimes these are quite simple:

Transcriber's Note: Punctuation has been normalised.

A useful general one, especially for older, less regular texts, is:

Transcriber's Note: All printer's errors retained.

(This one stops the PGC white-washers from getting long errata requests to "fix" your text. It is not, however, an excuse for leaving in bad OCR, scannos or similar detectable problems that are wrong in comparison to the page scan.)

Sometimes these can be quite lengthy.

Transcriber's Notes:
Page 13, "10,00 troops" changed to "10,000 troops." (We fought 10,000 troops at St Germaine.)
Page 27, "Faw-cett" changed to "Fawcett". (Major Fawcett dictated the memo.)
etc. etc.

While we don't retain the individual page numbers in the text version, this give the reader an idea of where it is and can search for the text you have included in the parentheses to find the exact location of your edit.


In the HTML version, the use of "hover" or "inserted" tags is a good way to shrink your list of changes while still maintaining the integrity of the original. Check the Post-processing Forum for ways of doing this.


Most post-processors do fix obvious printer's errers. (Such as "errers") But do not modernise or switch the language from English/American or the other way around. We are preserving history, not improving it.

Put shorter notes, or ones that apply to the whole text in a general way at the start of the book (before the title page.) Put longer lists at the end of the book (after any index or footnotes.) Transcriber's Notes are optional, but can help the reader's understanding of how you've processed the text. It's up to you how much or how little you note. If in doubt, talk to other Post-processors in the forums, Jabber or by PM about how they've handled various situations.

I've finished - now what..?

Creating a plaintext version

Take the file you've been working on and name it something like: <funnyname-text.txt> Make sure you still have a version of the file containing the markup! Start removing markup from the -text.txt file. Rewrap markers need to go, <i> and </i> need to be changed to _ and <b> / </b> to your preferred bold markup. <tb> should be replaced with the old thoughtbreak line of asterisks - that is: 7 spaces, followed by 5 stars, each spaced by 5 from the next. Something like this:
       *     *     *     *     *

If your project contains any <sc> markup, refer to the Guide to Small Caps to find out how to handle them.

Do a quick search for the < and > characters to make sure none have slipped through.

If you want to tidy your footnotes, (that is, make them read [1] text, rather than [Footnote 1: text] do it now.)

Do one final gutcheck, to make sure that there are no remaining problems, and that no issues have been introduced during the tidy-up process (such as short lines being left after the removal of HTML markup.)

Creating an HTML version

See below. Use a copy of the marked-up file, named something like: <funnyname-htm.html>. Make sure you keep a version of the marked-up file for backup and reference.


Uploading for PPV

Create a new zip file. Keep the filenames short, with only letters, numbers, hyphens, and/or underscores -- no spaces or special characters like ?, #, $, etc. Filenames and directory names must be all lowercase. Add into this archive your plaintext file, and any other formats that you've made. Any illustrations should already be stored inside an "images" folder, and this entire folder should be added to the zip archive. If you've been post-processing with Guiguts, add into the archive the .bin file for the plaintext version - this is incredibly helpful for PPVers who also use Guiguts. It won't be uploaded to PG.

Your zip archive should look something like this:

  • projectname.txt
  • projectname.html
  • projectname.txt.bin (if you have it from Guiguts)
  • projectname.html.bin (if you have it from Guiguts)
  • images/ (a folder)
    • image1.png
    • image2.png
    • image3.png
  • projectname-utf.txt (ONLY if you are including a UTF-8 file as well as or instead of projectname.txt)

Depending on your zip software, you may have to adjust its settings to "Save Relative Paths." This prevents the PPVer from getting extra (undesired) folders on their computers. If you are using a Mac, you may need to "omit Finder files" too (leaves out invisible files).

Go to the Project Comments page for your book, and select Upload for Verification from the buttons at the bottom of the page. Include an email address in your PP comments section if you want email notification when the book is Posted! You can also use these comments to note any checks you've done or point out special features of the work which the PPVer should be alert to. Ensure that your preferences with regard to PP credits are correct, as they are what will be used to credit you in the finished book. If you do not wish to be credited, or would like a different name to be used, please note this in the comments.


Your email address will not be displayed in the credits line, but can be used by the PPVer to give you feedback (if you request that option.) If you do not include an email address, this feedback will be sent via a personal message on the site.

What happens to my book now..?

First, your book goes to an experienced post-processor for Post-Processing Verification (PPV). This person will carefully go over your work making sure that all of the requirements have been met, i.e., spellcheck has been done, images are correctly sized and formatted, it passes gutcheck, the html is valid, etc. Sometimes a PPVer will request that a project be returned to you for further work. This does not mean you are a horrible post-processor. It probably just means that you missed a step or two of the process. An email or pm will accompany a return explaining why and what steps you can take to repair your file and usually offering assistance or suggesting where assistance can be obtained.

After your work has passed PPV, the PPVer uploads it to Project Gutenberg Canada. There a friendly Gutenberg white-washer (WW) will make a final check of your work (and the ppv's work) and adds the Project Gutenberg Canada boilerplate of names and legal information. Sometimes a WWer will have a question for you and that question may come through your PPVer.

Finally your project is posted on Project Gutenberg Canada for the world to enjoy! Congratulations! After your project posts, you will receive feedback from your PPVer. This feedback will tell you the great things that you did along with any suggestions for improvements in future work. Feel free to contact your PPVer with any questions that you have about your project. If you do not receive feedback and your project has posted, please drop a line in the $ppv_forum_url and someone will look into it.

If you find an error after the book has posted, (really, it sometimes happens), send a note to your PPVer telling him or her what you have found and he/she will contact the WWer.


I've been granted direct upload access. What do I do?

Once you've had several projects run through the PPV process, and have been granted the ability to upload your work directly to PGC, you will be sent instructions by the PPV coordinator on how to proceed. Please also see the Guide to Direct Uploading (DU) and Posting to PGC for more details.




Help! I've got a problem with ...

Missing or Problem Images or Pages

Sometimes the CP accidentally skips one or more pages when scanning. They usually check through afterwards for any missing pages, but don't rely on this - check for yourself, too. Occasionally the scan is present, but part or all of it is unreadable. First, attempt to contact the PM to get a better scan. If the PM is for some reason unable to get a good scan for you, there are other people that can try to get these pages for you. Find them on the Missing Pages Wiki. You can research here, looking at various library catalogues for Missing Page Finders and contacting them by PM if you find a copy of your book in their library. If no catalogues seem to have the book, log in as DPC_Wiki and post the book's details and your username in the 'Missing Pages' list.

If you do obtain additional pages, illustrations or just replacement images, please let $db_requests_email_addr know. Include the location of the images on dpscans (your page-finder can tell you this), the title & projectID of the project & the project will be fixed for you. This is very important for archiving purposes.

Projects with Multiple Parts

If you have multiple sections of a single book and you would like to have them 'stitched together' for ease of PPing, please email $db_requests_email_addr. Working on one file is easier than doing your own 'stitching' or working on separate files, and is especially helpful if you wish to retain the original page numbers in HTML, of if the png image names overlap.

NOTE: Multiple volumes of a book can be posted to PGC separately if appropriate - they do not need this treatment. If in doubt, ask the PM or email $db_requests_email_addr.


Other Formats

With every project MUST be a plaintext file (unless the project absolutely will not work in a plaintext form - e.g. a musical score.) However, there are other formats, which add information and value to the basic text.

HTML

This is the most common non-plaintext format requested or required for projects. If you are working on a text which is part of an uber-project or is a periodical, you may find a style guide defined for you - check the Uber-threads forum[link]. Otherwise this is up to you to make consistent and readable. To produce HTML, you may wish to use tools that you are familiar with, if you have done web-editing previously. The major PPing tools will also produce basic HTML which will just need some polishing to be valid and look attractive. Finally, you can use PG2HTML on your plaintext which will generate a very basic HTML version for you to work with.

Ask in the Post-processing forum if you'd like more help with this. Many people have learned HTML for the first time here, as part of their post-processing, and it doesn't have to be terribly difficult!

Alternatively, you can post in the HTML pool, giving a little information about your project. Other DPers enjoy the process of making HTML and will be happy to produce an HTML file for you.

HTML is essential for projects with illustrations. It is also very useful for projects with many footnotes (because they can be hyperlinked back and forth, making the text most more usable) or with different letters used from the roman alphabet (such as Greek, which can be encoded so that any reader with an adequate font will see the Greek letters.) Even if your project has none of these, many readers will find an HTML file more readable than the plaintext, and if possible, it's always worth producing one.

If you do an HTML version, make sure that the <title> tags contain the phrase The Project Gutenberg Canada eBook of <title>, by <author>. So for example, if the project you were doing was "A Christmas Carol" by Charles Dickens, you would make sure the title tags looked like: <title>The Project Gutenberg Canada eBook of A Christmas Carol, by Charles Dickens</title>

Lilypond

This is used for representing music - ask for help or information from the Music Team.

LaTeX

This is used for mathematical content within a text - see the section on LaTeX below, or contact the LaTeX team for more help.

TEI

This is a form of markup which is used to generate plaintext, HTML and other formats by automatic converters. PG does not currently use the full TEI standard, but is evolving a subset. At the time of writing, PG-TEI submissions are welcome. As long as they are valid PGTEI documents, David Widger of the Whitewashing Team will take the .tei file and autogenerate the plain text, HTML and PDF files. To check the final output is what you want, use the online converters at the PG-TEI site (there is one for text, one for html and one for PDF). This is a new process and may take longer than usual to process and be subject to minor amendments. Check the Post-processing forum for more information about PG-TEI (but this material is emigrating to a PGTEI Wiki page). Alternatively, you can post-process using TEI but submit only the product of your own transformations (plaintext, HTML and other formats if desired.) Such projects are likely to be posted more quickly, and will look more exactly as you specify.

PDF

This is very useful for certain projects (e.g. those involving LaTeX) which benefit from a fixed paginated layout and embedded specialised fonts but is less helpful for other projects, as it is difficult to make changes to the PDF once it's complete. However, some PG readers do like this format. If you are considering a PDF version, you may wish to discuss it with a PP Mentor or the Project Manager. Note: This is one of the three formats automatically created if you use PGTEI as a master format for your project (along with text and HTML).

DOC, or other proprietary text formats

Project Gutenberg Canada will accept these, but prefers not to. Issues of software compatibility and conversion arise more frequently with these formats than with simple plaintext or HTML. If you are considering an unusual format, you may wish to discuss it with a PP Mentor or the Project Manager.

Unicode, UTF-8, UTF-16 etc.

A UTF-8 file can be produced if you require characters which are not in Latin-1 (the 'usual' character-set used by DPC in the proofing window and dropdowns.) This probably isn't useful if you have a single word with an œ, but if your project has a fair amount of Greek or other characters, a UTF-8 version will preserve the text most faithfully. Most PPing tools have support for this - check the user guide or manual for more information. Ask in the Post-processing forum for more help with this. Also ask if you think you will need to use UTF-16 ... it's not common and there may be a good alternative.

If you produce a UTF-8 text file, name it something like: projectname-utf.txt when you upload for PPV. This will help the PPVer and is how the Whitewashers have requested such files be labelled. The WWers have a script to automagically convert UTF-8 characters to a reasonable ASCII equivalent, so that both formats will be posted in the final PGC archive. If you think that this conversion process will not produce a readable / useful ASCII file, you can produce one of your own with all UTF-8 characters translated. This probably isn't very useful if your file contains œ, š, ĭ etc. -- just send up the UTF-8 file. But if you have a book with a lot of ♣, ♠, ♥ and ♦, for example, or in Esperanto, where an 'x' after certain letters in ASCII indicates their UTF-8 accent, a useful character-translation can be made. Name that file projectname.txt, as usual. Make sure it really *is* ASCII and that no UTF8 characters have slipped in! Remember to include both files when you upload for PPV.

Symbols and Scripts, Non-ASCII Characters, Non-Latin Scripts and Downright Weird Things

Post, with a link to the offending page image, in the Post-Processing forum. Asking there will net you a varied range of ideas about whether the problematic ink blob is in Latin-1, UTF-8, Unicode or can be improvised using ASCII-art or represented in another fashion.


Footnotes

  • I have 18 [1]'s in the text and only one [Footnote 1: ]

Sometimes many tags reference the same footnote. This is not a problem; just make sure that all 1's go to the [1] Footnote.

  • I have no anchor-text for this tag / I have no tag for this footnote text.

If you can make out where the tag should go in the text, then it is probably best to insert it with a Transcriber's Note. If there is a tag without a footnote, then just a Transcriber's Note is probably best. See the section on Transcriber's Notes<link> for details on how to word this and where to place it.

  • I can't read this tiny text!

If neither you, nor the proofers, can figure out the footnote, contact the PM or see the Missing Pages<link> to obtain a clearer scan.

Sidenotes

Many PPers panic when they see sidenotes. This is usually the wrong reaction (though is sometimes justified).

The simplest case is when there are few sidenotes, usually only one per paragraph, and usually at the start of the paragraph. In this case you can just put them before the paragraph they refer to. In the plain text it is probably best to leave them inside the [Sidenote: blah] markup, so the reader can tell what they are. Some people like to leave a blank line between the sidenote and the paragraph.

In HTML, it is probably best to float them off to the side. You can choose whether to put them in the margin, so that they don't interrupt the flow of the text, or whether you want them to stick into the text which will then flow around them. It's probably best to follow the original as much as you can. If you want them in the margin, you'll probably want to use a larger margin than usual in order to make room for them. It is probably easier to read the HTML version if you put all the sidenotes on the same side.

If there are lots of sidenotes, with many sidenotes per paragraph, the situation gets more complicated. Putting them all at the start of the paragraph will lose information: in most cases there is a definite place in the text that the sidenotes are connected to, and that's roughly where the sidenote should go.

In the plain text there are at least a couple of options. The first is to put each sidenote (still in its [Sidenote: blah markup]) at the start of the sentence to which it refers. This has the advantage that the text stays easy to read, but if the sentences are long the sidenotes may end up quite far from their referents. The second is to try to place the sidenotes more exactly, by putting them in the middle of sentences. This makes for a text that is much harder to read.

Sidenote placement is easier in the HTML version, in that it is easy to get them closer to their referents. However, you should check them in several different browsers and at different browser window sizes, as it is very easy to get them overlaying each other so that they are illegible.


Illustrations

Illustrations are the thing most PPers have a hard time with at first, and are the most common reason for PMs to require an HTML version. If you don't feel comfortable with HTML or dealing with images, it's ok! Make use of the HTML-pool -- there are many people who don't like to do the text portion of a project.

Not frightened off? Good!

Some projects have only one or two images, like a frontispiece or an author portrait. The image files should be resized so they appear full-size within the HTML document.

Other projects are heavily illustrated, and often the point of posting them is for making the illustrations available to a wide audience. Image-heavy projects should use small copies at lower resolution and colour-depth (commonly called thumbnails) that link to larger, better quality illustrations. This allows people on dial up to get a feel for the project, without having to wait hours for the HTML to load. Check to see if your PM included high-resolution illustrations in the upload file. These are usually located at the very end of the images file and often have the extension .jpg.

A good rule of thumb is 400-600px for thumbnails & full page illos and 1200px or lower for linked-to images. The illustrations should scale accordingly. If your largest image is a full page illustration that you have made 400px, then your emblems that show up above the chapter header should only be a fraction of that. Imagine that you are holding the book, about what fraction of the page is the illustration taking up and adjust your px number accordingly. If you need to decrease the file size further, or touch up the pictures in some way, please see the Image tutorial or the Guide to Image Processing.

If even the high-resolution images are too dark or corrupted, contact the PM for replacements. Sometimes the PM has given you the best the book had to offer. Your next option is to try the Missing Pages Wiki to see if someone can provide better scans of the images.

All images for use in your final HTML should be stored inside a folder called /images within the project directory. Do a final check, when you've completed work on the HTML, to make sure that all images are used correctly within your page, and that you haven't included any temporary or redundant files.

For any questions or advice about any illustration related matter - contact the Illustrator's Team.

Poetry

As long as you make sure your rewrap markers are set correctly, post-processing poetry shouldn't be any different to producing a prose book. Make sure you save backup copies of your file regularly as you work - it will be much easier to recover from a formatting decision gone terribly wrong. :) Have a look at recently-posted poetry books at PG for layout ideas (remembering your text will always need to be indented at least 2 spaces.) Some PPing software has extra features for handling poetry - refer to the user guide or manual for more information.

Tables

Hopefully tables in your text will already have received the careful attention of a member of the Turn the Tables team, and be sized to fit within PG guidelines (ideally, less than 75 characters wide, or 80 if desperate.) If you need help with a table, or have questions about the HTML formating, post in the team topic or in the post-processing forum.

Greek

It will usually have been transliterated (converted to roman letters) during proofing. There are various ways to handle this.

In the plaintext you can leave the transliteration, commonly removing the [Greek: ] markup (though you may wish to use another markup of your own, such as a +, and mention its use to indicate Greek in a Transcriber's Note.) Or, if you have a significant amount of Greek or other unusual letters, you can produce a UTF-8 version, which will contain the original letters. Post in the forums if you'd like help with this. Faster help can often be obtained for short phrases, via the DPC Jabber chat room (instructions, including a web interface — no software downloads! in this thread).

In the HTML, you can encode the letters using HTML entities which will display for your reader if they have the relevant font installed. It's a nice idea to enclose this in a <span> which uses the transliteration as a 'title' attribute — that way non-Greek readers can still access the word:

<span title="Hyposêmeiôsê">Υποσημείωση</span>

Again, ask for help if you need it. There really are people who enjoy doing this!

Occasional Use of Other Languages

Check the Language Skills forum post to ask for help or advice. If a native speaker hasn't been at DPC for some weeks, or can't help with your particular problem, have a look at the Teams List to see if there's a team for the language or relevant country(s). Don't worry if there's few members or the forum hasn't been posted to in a while — your question might be all it takes to create a lively and helpful community discussion.


Indexes

Keep the page numbers in the index, even in the plaintext form. In the HTML version it's very easy to link up numbers to page anchors - see the user guide or manual for your PPing software. If you need a way to do this linking semi-automatically (you'll need to check that non-index numbers aren't being included!), then just ask in the Regular Expression Clinic. For further help, or formatting queries, try the Junkies, Index team.

Errata Pages

These should be included as printed. There are two ways to handle these: you can leave the amendments up to the reader, or you can make corrections in the text, adding a Transcriber's Note that you've done so. Which you choose depends on the project and on you as PPer. Just don't make silent corrections and don't leave the pages out.

A possible middle ground would be to include the page's content as printed in the plaintext version, and use a correction <span> in the HTML version to indicate that a change was made to the text—making the erratum amendment, but including the original text to pop-up when the change is highlighted / hovered over by the reader's mouse pointer. See the PP forum for more on this.

A Problem After the Project has Posted!

Don't Panic! Everyone who PPs has done this. If they haven't, they will eventually, trust me. :)

If the book is quite recently posted to PGC (in the last week or two), contact your PPVer and let them know the problem. They'll pass it on to the Whitewasher who archived your book and can most easily fix it.

If the book posted a longer time ago, contact a Site Administrator. Say that you are the post-processor of the book, and include the PGC text number, title and author with a clear description of the problem and how to fix it. If you've checked the problem against the page images, mention that too.

What's different about ...

Periodicals

Many periodicals have a standard style for the text and HTML versions that ensures a consistent look for the whole project. If the periodical is part of an überproject, check the Überprojects Forum for details.

Many people are put off proofing, formatting or post-processing periodicals because they are perceived as "hard" in some way. Canny post-processors will therefore quickly realise that mastering a periodical style will give them access to many entertaining projects with little competition. A Style Guide puts an end to those hours spent mulling over whether a heading should be marked with <h2> or <h3>. Periodicals often have longer pages than usual and may have adverts or other unusual formatting issues. These will all have been encountered previously, and recommended handling should be explained in the Überproject thread. If not, or if the explanation is unclear, post there for help.

An excellent source of inspiration will be the most recently-posted issues of that periodical at PGC or PG - refer to these to help. Make sure you select ones which have also been posted by DP (to ensure absolute consistency of style!)

Drama

Many people are put off proofing, formatting or post-processing drama because it is perceived as "hard" in some way. Sometimes it actually is quite hard, for example, when written in Sixteenth Century English with little attention paid to spelling or grammatical niceties. Mostly, though, Drama is quite straightforward.

For all plays, check the Formatting Guidelines, and ensure your plaintext version is in line with these.

Format character names as similarly as possible to the original text. If the text is metrical (written like poetry where line breaks are significant), check the /* */ markers carefully before doing any rewrapping (or consider checking through for rewrap sections, such as stage directions, by hand.)

There are various ways to format plays in HTML; searching PG for recent postings may give some ideas, as will posting in the Post-Processing forum. The DP-INT Plays The Thing team can also offer help and advice. Ideally, make it look as much like the original text as is sensible.

Uber-projects

See Überprojects for a list of the large multivolume projects that are likely to be seen on DPC for quite some time.

See also: Periodicals

Music

Books with sections of music, or a short tune for a song sung in the text, or entirely about music, are regularly put into PGC by DPC. A simple way to post-process a book containing music is to include all scores as illustrations in the HTML. However, much more value can be added to the book by transcribing the music into a common notation format. This has three great advantages:

  • a clear musical score for HTML
  • a midi file for HTML, so the PGC reader can actually listen to the music
  • the reader can edit the music

GNU LilyPond is the most commonly used notation format on DPC, because it is an expressive, concise, open source, text based format which can be edited within our current proofing mechanism. Other means of transcribing music are possible as well, especially if they are done outside of the proofing rounds. Graphical editors like Finale, Sibelius, NoteEdit, Noteworthy Composer, or Harmony Assistant offer another way of handling this task. Each of those mentioned also has the advantage of being able to export to MusicXML, the current standard for portable music notation.

To obtain help with music transcription, simply post a message to the Music team thread.

The most portable source available should be retained in at least one of the versions, particularly for complete pieces of music. Ideally, this source would be MusicXML, but this is not the most common format we are currently producing.

For lilypond projects with larger parts of music the best way to do this is providing a lilypond file. Short fragments could also go inside HTML comments or be used as alt-tags for the score image.

Maths (LaTeX)

LaTeX is a valuable tool for the layout of mathematical formulae.

It is mostly used in projects with mathematical or scientific content that cannot be easily represented in other formats, or for which the features of LaTeX are beneficial.

Help for how LaTeX is used can be found in several places:

  • Some projects provide links in the Project Comments that might be helpful.
  • Discussions about the usage of LaTeX and certain standards within DP mostly occur in the LaTeX team thread.
  • LaTeX tutorials on the Internet (ask the LaTeX team for recommendations).

The submission to PGC should always include the LaTeX source, preferably as a single file (including comprehensive comments and compilation instructions), together with any illustrations (in an "images" subdirectory). The source must be capable of being processed on a "standard" LaTeX installation, because that is what the PGC whitewasher will use to generate the uploaded PDF.

LOTE (Languages Other Than English)

You probably want to speak the language, or have a native speaker spellcheck/smoothread it. However, for many languages, there are few or no native speakers available on the site. These projects can be taken by people who are willing to put in the extra effort involved in dealing with a language that they do not speak. Check the Language Skills List to find who to ask for help or advice.

When you submit a Latin-1 version of your text there is no need to also produce an ASCII version, no matter what the PG-FAQ says! PG has the tools to easily make an ASCII version based on the Latin-1 text. However, if the ideal ASCII-version would be different from the result you get by making standard replacements like ü -> ue é -> e etc., you should produce an ASCII version yourself. Explain your reason for doing so when uploading for PPV, so they can pass that message on to PGC.

As a PPer you have a bit of freedom in choosing the best format for your text. For LOTE texts this may sometimes lead to decisions which would be unusual or even plainly incorrect in English. If you make such a decision you might get lots of gutcheck errors. If you have a good reason for your decision and if you have applied it consistently, you can ignore those errors. You might want to mention this decision in your upload notes. Example: For many languages it looks more natural to have spaces around em-dashes. It's perfectly fine to leave or insert them.

You should replace English markup words which appear in the final text with translations of those words in the main language of the text. e.g. Footnote / Fußnote / Apostille / Ootnotefay / Υποσημείωση / Voetnoot / Nota de rodapé.

Other Questions, and Suggestions for the FAQ

What is different about a 'missing pages' project?

See also

Personal tools