Barbara has been setting up a system in the office for scanning the pages of public domain books to submit to the
Distributed Proofreading/Project Gutenberg system. The workflow is essentially:
- create 300-dpi bitmaps of every page
- “clean up” the bitmaps for OCR by aligning them and cropping them and getting the thresholds right and applying a threshold function to convert them to one-bit bitmaps
- [submit the images to the PG website at this point]
- for each page, [somebody needs to] run an OCR algorithm to create a (somewhat) rich text file, indicating italics and bold-face text at least
- [the book’s pages are proofread and recompiled and marked up by hand and scripts to create Project Gutenberg submissions by various and diverse hands].
All you really need to think about today, though, is the OCR.
There is some very nice OCR software out there in the world. High-end commercial stuff. Stuff that does all sorts of fancy stuff with dictionary and context and syntax and fonts and mixed metaheuristcs and training. There is damn-all when it comes to Open Source, cheap or free software, by comparison. gocr, which is passable, but in general it’s far, far less accurate than something big and unwieldy like Omnipage Pro. Worse, Omnipage Pro costs $400+ and doesn’t run in MacOS X. Liars.
Why the stagnation in OCR? Here’s my hypothesis: early lock-in of neural networks methods.
Back in the good old days, OCR was touted as one of the most important success stories of neural networks models. Having a computer read a page in a book was soooo kewl by 1980s standards — and after all, it cashed in on the results from all the early neural computing academic papers, the design pattern (an eye) was pretty obvious. A picture is locally processed, and little bits of it are used to generate small pattern fragments (angles, widths), and these are compiled hierarchically into some bigger patterns (loops, stems), and up near the top the neural network spits out a letter (or just maybe a “don’t know” message).
Sure, it works fine. But it’s tapped out. Time for something new.
See, there’s so much munging involved in this workflow, just because of how the neural networks work and the assumptions about letters and words and language that they embody that it’s sometimes as much work to prep a page for OCR as it is to type it in by hand. This is particularly true if there’s any structure to a page: not merely columns, but tables, figure legends, call-outs, footnotes, foreign language quotes, illustrations with letters in them. And noise, foxing, crumples, tears, crayon….
And then there’s the huge and rather difficult but oft-ignored stuff that goes into making a real text file. After all, character recognition simply produces a list of the characters that appear on the page. And their physical locations. How you get from those to a string of words in a file is not always a simple matter, especially if you want your OCR system to use syntactic cues and sense-making to improve its accuracy.
Look, a few weeks back Barbara proofread Robert Hooke. We’re talking about long S characters, and Greek and Latin and weird bumpy distressed metal type and bad scans. The OCR, doubtless trained on a corpus of modern novels, newspaper pages and magazine columns, sucked.
Because, I’m arguing, the design pattern of the neural networks they use embodies too many assumptions about the scans and quality and orientation and stuff. Neural nets assume too much. Or if not the networks themselves, then the people who use them.
So I want to propose a challenge. An exercise. A way out of the box.
Use genetic programming to create a colony of OCR ants.
Say the scanner produces a bitmap file. Who knows what the page orientation is, or if it’s even straight? Not me. Not you. Who knows if the scan was noisy or clean, or gray-scale or color or what the resolution was or how big the letters are in pixels? Not me. Not you. Not your ants. Who knows if there are illustrations on the page, or typographic ornaments, or foxing or squished spiders, or marginalia scrawled along the side? Go ahead — guess.
All you and your ants will know (in the acceptance test system) is that a number of png-format images (of at least 100 dpi, and maybe up to 2400, and at least 1 bit but maybe 24 bits, and maybe with a normalized histogram of values and then again maybe not, and in some arbitrary orientation) will be provided, and that these files are scans of pages bearing text that the ants need to recognize — that is, output. The ants’ task will be to produce a text file containing as few Type I and Type II errors as possible.
Your OCR ants are little virtual software agents, walking around on the bounded (maybe with toroidal wrapping if you must, but I’d advise against it…) 2-dimensional discrete world of the scan bitmap. One pixel is a quantum of space, to them, the smallest increment of motion and distance.
Along with the scan bitmap in which they “physically reside”, your ants may play with:
- up to five other bitmaps of 24-bit depth and the same size as the scan, called the pheremone sensoria (initialized to all-white),
- a linear discrete space of infinite length called the text workspace, which contains a character at each location (initialized to NULL), and
- a up to five linear pheremone sensoria that map 1:1 to the cells of the text workspace, and contain long integers (initialized to 0).
Ants walk around in the scan bitmap. That is, they each have a discrete “position” property that refers to the coordinates of a pixel in the bitmap. They start the test at the upper left corner of the bitmap, all stacked on top of one another if need be.
They also have a single indexed cursor position that refers to their “attention” in the text workspace. Their cursors all start at the same location “0” in the text workspace.
As ants walk around the scan, they can examine the color of pixels in their immediate neighborhood (defined as you see fit, but no more than 8 pixels in any direction, which is a lot).
They can also examine the color of their location pixel in the pheremone sensoria — “reading” pheremones left by themselves and other ants that have passed nearby previously (again, the nature and meaning of these pheremones are entirely up to you).
They’re also aware of the characters in the region three positions to either side of their cursor in the text workspace, and the pheremone values (long integers) in the pheremone sensoria associated with those letters.
And finally, they have memory, which is just a list of the last 100 sensory vectors of pixel colors, pheremone levels, and characters.
These are not dumb ants! We’re talking (a) a whole bunch of pixel values in the scan, (b) up to five pheremone values in the bitmap space, (c) seven character values from the text workspace, and (d) up to 35 pheremone measures in the workspace, times 100. Whew! Plenty of room to do anything in the world, yes?
Don’t get cocky. There’s a catch, which you shall see at the end.
Each ant, following its internal program, responds to all these sensory inputs (and its memory of up to 100 sensory records from the past 100 time-steps), and does stuff, including:
- it sits still or moves one pixel in any of eight directions in the bitmap
- it may do one of the following: (a) write a letter to the current cursor position in the workspace, (b) swap the current letter with one at either side, or (c) move its cursor one step to the right or left
- it may excrete pheremones of its own in one or more of the ten pheremone sensoria
After they have been “secreted” by the ants, all pheremones diffuse in the sensoria. After each time-step, the amount of a pheremone in any location (whatever the dimensionality of the space) becomes the weighted average of the value in the cell and its neighbors (eight in the bitmap, two in the linear spaces), including a fixed decay term. You may select these terms as you see fit, adding directional biases to the weights used in averaging, setting them to be faster or slower than one another. But in every case, the decay term must be positive, so that all pheremones eventually disappear unless replaced. You may use different constants for weights and decay in each sensorium, but each location within a sensorium must obey the same rules: sum(weight*values)/(sum(weights)+decay)
For example, suppose you choose the weights for one of the 2-D pheremones to be 8 for the current location C and (1,1,2,1,2,2,2,1) for the values in the (N, NE, E, SE, S, SW, W, NW) neighboring cells, and a decay constant of 2.2. The pheremone concentration at every location would be calculated according to (8C+N+NE+2E+SE+2S+2SW+2W+NW)/22.2.
The astute reader will ask: How do the ants map the bitmap to the text file? I mean, how do they even know which way is up?? Yeah, interesting, huh? Good question. Anybody?
Then somebody else will ask: Hang on, what about using dictionaries and syntax and stuff? Don’t the ants get dictionaries? Hmmm. Well, no. I’m imagining you’ll be designing/discovering/training these ants using some method of supervised learning, using a corpus of scans and text files. If they can master those scans, then they should be able to pick up a few rudiments of English along the way, yes? Sortof implicitly, yes? Anything else?
Somebody scribbling on the back of an envelope will ask: Now see here — the ants can only move in 45-degree directions on the grid. How the heck are they supposed to recognize the text if it’s aligned at, say, 72 degrees? Pheremones? Memory? Be creative.
Oh, yes: I don’t think there should be just one kind of ant. I suspect it might be useful to break the task down into pieces, and have different castes of ant work on it.
OK. Well, there you have it. See you all Monday after break, eh?
The formal acceptance test will be based on a sample of several hundred scanned pages (selected from the English pages in the Project Gutenberg archives) of varying quality, resolution, bit-depth, alignment, clarity and content. The ants’ text file will in each case be compared to the human-made “correct” reading of the scan, and graded in terms of:
- coverage: How many of the correct letters and words are present? This should be high.
- flow: How many places are there inverted orders? This should be low.
- deletion and insertion errors; Both should be low.
- modesty: For all characters present, two points will be deducted for each wrong character, but only one for each “#” (don’t know) character; this should be high
- and parsimony (the catch I mentioned above): The number of operations made by ants should be as low as possible — you may reduce this by reducing the number of ants, or reducing their algorithmic complexity, or making them algorithmically efficient…
.
And some others. This is still a draft. Think about it. I’ll clean up the acceptance test in the next few days, with more detail about how the tests will be written and how the ants will have to be submitted. You tell me more about what the acceptance test should be to be useful….