Archive Update - Slice Baby Slice

Timlynn Babitsky — Fri, 11 Jul 2014 22:35:55 +0000

Print media pages come in all levels of complexity from a simple page with only one large text block to a complex page with text, tables, graphics, photos, advertisements, lists, and more. Most of the pages in the 48 issues of the Softalk magazine can be considered moderately complex, some are very complex.

Thanks to Peter Caylor’s monumental scanning feat we have lossless 600 dpi PDFs of all 9,300+ Softalk magazine pages. These were scanned as 2-up pages, exactly as you see them in the print copy of the magazine. In order to proceed with the Optical Character Recognition (OCR) phase of our project, the 2-up pages have to be split into 1-up page format. This is what I’ve been working on.

ABBYY Fine Reader 12 Professional is amazing software. It can open a PDF file and through selection of various pre-processing choices, will split 2-up pages into 1-up pages. This splitting process is time-consuming (on my machine) – it takes nearly a minute a page for ABBYY to pre-process and split 2-ups. A thirty-six page issue takes a little over thirty minutes for pre-processing. But, the output is incredibly good.

Because of the complexity of the pages, every page must be carefully “verified” as having come through this page splitting process unchanged in any way other than being split into 1-ups. During the pre-processing, advertisements sometimes skew page text or cause streaking on a page. Two-ups that bleed photos across the two pages often do not get split automatically. And so I've found (so far in Volume One) that between 5% and 11% of the pages in every issue have to be reprocessed using a multi-step manual reprocessing protocol to split, but preserve, the original pages. ABBYY Fine Reader does an amazing job on splitting the other 89-95% of the pages; just a few pages are "gotchas."

As of today, all the pages of all twelve issues of Volume One have been split to 1-up pages. Each page has been verified and manually reprocessed where needed. All issues of Volume One are now ready for the next phase -- OCR processing.

By the end of the month, 1-up page pre-processing by ABBYY Fine Reader of all four volumes of the Softalk archive will be complete. Verifying all pages in all issues from Volume Two through Volume Four will then begin.

The Future of Museums Is Open, Social, Peer-to-Peer, and Read/Write

Timlynn Babitsky — Tue, 27 May 2014 21:51:21 +0000

#CODEWORDS

I’ve just re-read for the third time, Michael Peter Edson’s very powerful piece Dark Matter, the first essay in CODE/WORDS, a collaborative writing project about technology in museums.

Edson’s message to all of us involved with museums, libraries, and big data archives is to think MUCH BIGGER – way beyond tracking how many visitors come through the doors, or how many scholarly articles have come from connected researchers, or even how many page views a website has garnered.

There are over 3 billion people on line today, with another 5 billion predicted to join them over the next 10 years. Museums need to understand the unprecedented opportunity they have to engage users to participate in citizen-action science, archiving, and exploring - to use and share and help.

"There’s just an enormous, humongous, gigantic audience out there connected to the Internet that is starving for authenticity and good ideas—and they want to learn."

Tech innovation is leaping forward exponentially; the connected world we live in is changing almost faster than we can track. As keepers of the artifacts of human experience, museums, libraries and all related institutions need to prepare now for the open, democratic sharing of information that the World Wide Web provides and tech advances ensure.

It is no longer enough to just archive and digitize our huge collections (although that in itself is a HUGE undertaking). We need new platforms that allow each of us to explore those collections to gather facts, to ask questions, and to learn. FactMiners is one such platform and the Softalk Apple Project provides the pilot project on which to test it.

Timlynn Babitsky's blog

Archive Update - Slice Baby Slice

The Future of Museums Is Open, Social, Peer-to-Peer, and Read/Write