Print media pages come in all levels of complexity from a with only one large text block to a with text, tables, graphics, photos, advertisements, lists, and more. Most of the pages in the 48 issues of the Softalk magazine can be considered moderately complex, some are very complex.
Thanks to Peter Caylor’s monumental scanning feat we have lossless 600 dpi PDFs of all 9,300+ Softalk magazine pages. These were scanned as 2-up pages, exactly as you see them in the print copy of the magazine. In order to proceed with the Optical Character Recognition (OCR) phase of our project, the 2-up pages have to be split into 1-up page format. This is what I’ve been working on.
ABBYY Fine Reader 12 Professional is amazing software. It can open a PDF file and through selection of various pre-processing choices, will split 2-up pages into 1-up pages. This splitting process is time-consuming (on my machine) – it takes nearly a minute a page for ABBYY to pre-process and split 2-ups. A thirty-six page issue takes a little over thirty minutes for pre-processing. But, the output is incredibly good.
Because of the complexity of the pages, every page must be carefully “verified” as having come through this page splitting process unchanged in any way other than being split into 1-ups. During the pre-processing, advertisements sometimes skew page text or cause streaking on a page. Two-ups that bleed photos across the two pages often do not get split automatically. And so I've found (so far in Volume One) that between 5% and 11% of the pages in every issue have to be reprocessed using a multi-step manual reprocessing protocol to split, but preserve, the original pages. ABBYY Fine Reader does an amazing job on splitting the other 89-95% of the pages; just a few pages are "gotchas."
As of today, all the pages of all twelve issues of Volume One have been split to 1-up pages. Each page has been verified and manually reprocessed where needed. All issues of Volume One are now ready for the next phase -- OCR processing.
By the end of the month, 1-up page pre-processing by ABBYY Fine Reader of all four volumes of the Softalk archive will be complete. Verifying all pages in all issues from Volume Two through Volume Four will then begin.