The OCR Pool
From DPCanadaWiki
See Projects needing OCR for the list of who needs what. OCR Pool The process is described in more detail below. NB This process is copied from DP-INT, but many of the steps may be impractical unitl the number of volunteers on DPC increases.
Contents |
How to Contribute Images to the OCR Pool
This defines the extra steps involved in contributing images to the OCR Pool for someone else to OCR and preprocess. It does not cover things covered elsewhere, such as how to scan, how to create projects, etc.
1. Create a zipfile containing all the image files. Name this zipfile in the form <DPCNick>_<Proj>.zip, where <DPCNick> is your DPC username[*], and <Proj> is a short string of characters that will help you identify which book the project is for; it might be the author's surname, or an abbreviated or keyword version of the title, or any combination thereof. Since it is only needed to identify the file to you among the other OCR Pool files of yours (all of which will start with <DPNick>), it need not be hugely long or complex.
[*] If your DPC username has spaces or other characters unusable in a unix filename, come as close as possible and include a text file in the zip stating your correct DP username. If you are not a registered DP user (in which case you will not be able to manage the project, just contribute scanned images), choose a nickname or version of your real name for <DPNick>.
Example: DPC User JoeyDoey wants to send images of "The Campfire Girls vs Smoky the Bear" by Not A. Realauthor. Possible names he might choose for the zip file include (among many others)
- JoeyDoey_CampSmoky.zip
- JoeyDoey_RealauthorCGSB.zip
- JoeyDoey_CfireBearNAR.zip
- JoeyDoey_CGVSBNAR.zip
2. Decide if you just want to just contribute the scans only and let someone else manage the project, or if you want to be the Project Manager for this book yourself after the images have been OCRd and the text pre-processed by someone else.
a. If you just want to contribute the scans, create a text file called <DPCNick>_<Proj>_README containing the following information:
- The clearance line for the book
- Your desired scanner credit (e.g. John Doe, JoeyDoey, The Institute of Borrowed Images or "no scanner credit")
- A working email address (unless you are a registered DPC user who checks their inbox regularly).
- Any other message or information you'd like the person OCRing these images to have.
Start an ftp session as dpscans and upload both the <DPCNick>_<Proj>.zip and the <DPCNick>_<Proj>_README file to
OCRPool/CPOnly/
Post an announcement about the files to the Projects needing OCR page saying they are there and ready for OCR. Mention the language, the type of book, and any other details about it you think may be of interest. Mention if you think the images may or will need splitting, converting, renumbering or cropping.
Thank you, your work is now done (unless there's a problem with a bad or missing scan or something, in which case someone will get in touch with you), your scans are now waiting in the OCR Pool for someone to pick up and OCR, and manage the project from there on. You can now start work on another if you wish. :D
b. If you want to be the PM for this project, start an ftp session as dpscans and upload the <DPNick>_<Proj>.zip file to
OCRPool/WillPM/
Post an announcement about the files to the Projects needing OCR page saying they are there and ready for OCR. Mention the language, the type of book, and any other details about it you think may be of interest. Mention if you think the images may or will need splitting, converting, renumbering or cropping.
One of the people doing OCR from the pool will edit the Wiki post to indicate that the project has been "claimed". The file on the server will also be renamed IP_<DPCNick>_Proj>.zip, indicating that it is In Progress, so other people will not claim it. When they are finished with the preprocessing they will send you a PM and you will find the files waiting for you in your dpscans folder,
<DPNick>/
Look for a folder or files named with <OCRNick>_<Proj>. (It is likely to be a directory containing both text and image files or a zip file). Using these files (and any additional files for the illustrations, if appropriate), go ahead and create the project as usual. <OCRNick> is the DP username of the person who did the OCR/prepping, so you know who to contact in case of questions or if something needs to be redone. After the project has been created, return to the OCRPool/WillPM directory and delete the IP_<DPNick>_<Proj>.zip file, as well as the files from your dpscans folder. The moment the project is created, all the files are copied elsewhere on the server, so they no longer need to remain in dpscans.
So as an example, John Doe, DP user JoeyDoey, has left a file JoeyDoey_CampSmoky.zip in the OCRPool/WillPM/ directory. A few days later he looks in the JoeyDoey directory and finds two files, Ziggurat_CampSmokyImages.zip and Ziggurat_CampSmokyText.zip, from which JoeyDoey can create the project as per usual. He knows that user Ziggurat did the OCR and can be contacted in case a page has gone missing, etc. After he creates the project he deletes the IP_JoeyDoey_CampSmoky.zip file from the OCRPool/WillPM/ directory, deletes Ziggurat_CampSmokyImages.zip and Ziggurat_CampSmokyText.zip from his personal dpscans folder, and updates the OCR Pool Wiki to reflect that the work has been completed.
How to Process Images from the OCR Pool
This defines the extra steps involved in obtaining images from the OCR Pool. It does not cover things covered elsewhere, such as how to OCR or preprocess text, how to create projects, etc.
The page Projects needing OCR has been created for scanners to announce when new files are available for OCR.
Look through the lists on the Wiki page and select a project you would like to work on. Follow the Wiki instructions and edit to post to indicate that you have "claimed" the project, and will be doing the OCR. This will prevent someone else working on the same project.
Start an ftp session as dpscans. Browse through either the OCRPool/CPOnly or OCRPool/WillPM directory to find the zip file. The files here have names of the form <DPCNick>_<Proj>.zip, where <DPCNick> is the DPC username of the person who scanned the images and <Proj> is a brief reminder for them of which book the images are from. Each zip file (containing images) in the CPOnly directory should also have an associated text file called <DPCNick>_<Proj>_README, containing clearance and scanner credit information needed to set up the project on DPC.
If the file you are volunteering to OCR is coming from the CPOnly directory,
the content provider has elected to be a content provider only, so you may manage the project. To process a CPOnly project, download a copy of your chosen zip file, and rename the file on the server IP_<DPCNick>_<Proj>.zip, where "IP" stands for "In Progress". For example, a file named JoeyDoey_CampSmoky.zip is from DPC user JoeyDoey.
OCR and prep the images as per usual. The project is now yours to create and manage as usual. After project creation, please remember to delete the "in progress" zip file and README file from the CPOnly folder, as well as the files from your personal folder in dpscans. The moment the project is created, all the files are copied elsewhere on the server, so they no longer need to remain in dpscans. Please also delete the entry from the OCR Pool Wiki to reflect that the work has been completed.
If the file you are volunteering to OCR is coming from the WillPM directory,
the content provider intends to also manage the project through DPC. To process a WillPM project, download a copy of your chosen zip file, and rename the file to reflect that it is in progress, as IP_<DPCNick>_<Proj>.zip. OCR and prep the images as usual, and then upload them into the dpscans folder of the content provider. (Please create one for them, if it does not already exist). Please include your DPC nickname in the folder or file name, so that they will know who to contact if they have questions or if there are problems, as well as their identifier for the project. So, for example, if you did OCR for JoeyDoey on his file CampSmoky.zip, you could upload the new image and text files into
JoeyDoey/<OCRNick>_<Proj>/
Alternatively, you may zip the files, and upload them into JoeyDoey's folder with a similar file name.
After all the files are uploaded, please send them a PM telling them these files are now ready, where you have put them, and how you have uploaded them (into a directory, zipped, etc). Thanks, you are done and can go on to another if you wish. :D
How to access dpscans
While the account/password information for ftping to/from dpscans is not secret, we deliberately don't post it publically in the forums, FAQs or elsewhere. It is only a small level of protection, but we do insist that anyone who wants to use ftp with our site must get that information from someone who has been around long enough to have a fair understanding of what suitable uses are for it. Anyone needing the password should contact a Site Admin or Project Facilitator.
