View Single Post
Old 26.11.2009, 20:18
drbits's Avatar
drbits drbits is offline
JD English Support (inactive)
Join Date: Sep 2009
Location: Physically in Los Angeles, CA, USA
Posts: 4,437
Default ReCapcha solution discussion

ReCapcha recognition might be simpler than OCR. It might be cracked by creating a map from physical characteristics of each "word" to a letter sequence. 98% accuracy is plenty (if multiple attempts are taken before giving up on a link). This approach is simpler, if less reliable than letter or skeleton recognition. The main drawback is periodic redefinition of the tri as new words are added.

This is a measure + dictionary approach and has broad application. What makes this workable is that ReCapcha has a limited dictionary, so specific recognition for each word should work. In other words, word recognition,

Measures include:
  • width (CM0x) [10-2048] pixels, (most useful)
  • height (CM0y) [5-128] pixels,
  • density (overall darkness or number of pixels darker than 49%),
  • X center of mass (mean) relative to word box (CM1x),
  • Y center of mass (mean) relative to word box (CM1y),
  • X center of momentum (CM2x),
  • Y center of momentum (CM2y),
  • X median relative to word box (CMm1x), (possibly standardized sCMm1x).
  • Y median relative to word box (CMm1y), (possibly standardized sCMm1y).
  • main X frequency
  • main Y frequency (least useful)
  • Convolution of sample against word box. (very useful, compute intensive)
  • Ratios of any two of the above.
  • Products of any two of the above
The best approach is probably to use a tri data structure (a kind of tree, pronounced like "try") identifying groups of words based on ranges of width, then height, etc. This approach requires less than two tree nodes per "word". The idea is that the root contains keys representing the maximum value of the first measure for each child. A child is either a leaf (character list) or the next measure in turn. This key format is like in a B-Tree. Tris are commonly used for spelling checks and data compression.

A tri is better than a neural network, because it is easier to store in a database, requires only one pass training, and contains fewer connections (a neural network contains one input node per input (n), one output node per possible output(N), and n*N connections. If an intermediate row is required this is n*N*N connections.

The calculations listed above can all be done in "int32 fixed point". The 256's are to return a range of integers. Floor is specified, because of normal integer division. Round can also be used.
Width and Height are integers. These are defined after the actual word box is found.

Density could be the floor of (256 * (total number of pixels with total RGB < 384)) / (width * height)
Range: (0-256)

X Center of mass (CM1x) could be floor (((sum of (X position [base 1] if RGB < 384) ) *256) / (width * widtj)) 
    averaged over all rows.
    Example: [ . . H . H . . . . H . . ] is (256 * (3 + 5 + 10) )/(12*12) = 2
    Range: (0 ... 256)

X Center of momentum (CM2x = second order Center of Mass) could be (CM1x * CM1x).
    Range: (0 ... 32768)

X median (CMm1x) is position (base 1) of middle pixel (averaged over all rows)
Example: [ . . H . H . . . . H . . ] is 5
     Range: (0 ... width)

Standardized X median can be sCMm1x = (256 *CMm1x)/width
Example: [ . . H . H . . . . H . . ] is 106
      Range: (0 ... 256)
Frequencies are to be avoided, because they involve integer discreet Fourier transforms (iDFT). These are complex calculations.

Convolution of a saved pattern with a given word box is a last resort. It is compute intensive. It can be done in spacial or frequency domain.

After width, height, density, and (X and Y)center of mass, statistical methods should be used to determine the best order to apply characteristics.

The tri can be simplified by use of standardized ranges (for a range 0f 0 to 256, the ranges could be <=0, 1..16, 17..32. etc.). This is less accurate than setting ranges by equalizing the number of leaves in that range, but it allows a faster tri build and quick indexing to find the correct key on retrieval.

Answers to questions posed in

Quality OCR programs are 100% on known roman typefaces (where the imaging quality is excellent); even "ransom note" typefaces are 100%. Quality OCR on hand written roman block lettering (often called hand printing or manuscript) is about 98% on random samples (multiple hands) for clear images.

OCR does not take a Turing, Knuth, or Dijkstra. It takes somebody smart, but not hyperintelligent. The real requirement is to be able to combine dictionaries, neural networks, expert systems, and image processing techniques in one package. This is called a computer science generalist.

Last edited by drbits; 26.11.2009 at 22:23. Reason: Additions