You are doing well. But you aren't up to speed on everything.
1) The answer to the unknown word is not important, but a nonspace answer is required. It is in our best interest to enter an incorrect word for the unknown word. Matching length would be good.
2) The real problem with reCaptcha is that it is so easy for Google to change things compared to how hard it is to solve.
3) Ignoring the obfuscation that Google adds (like the blotches or some changes in the lines), these are from hand written documents. None of the known OCR engines can solve the word (or it wouldn't be in the challenge as the unknown word). Once a statistically significant number of users give the same answer for an unknown, it becomes a know and is still just as hard.
4) Part of the obfuscation is to thicken the lines. This makes the letters touch and makes letter separation difficult (no counting the letters in advance). An easy example to see typed here are burn (Bum or Burn). Things get much worse when the letters are freehand. Bunting can look like dantim. pauldmps could look like fwnbrnao or powbna.
5) BeerCan has figured out how to identify the unknown. BeerCan, please correct me. They switch things around, but the odds are good that the second word is the unknown. The known words are only letters. The known words are usually longer than the unknowns.
6) Given a spelling checker and starting with the shortest words, we could identify the ligatures (combined letters, traditionally ae, fi, fl, ." (the period should appear centered under the closing quatation mark). In hand written documents you would have 100's of such combinations (1 to 4 per individual letter, Some letters are relatively easy, such as x and q which each have only a few combinations. Vowels can have many following letters). All of the letter and 2 letter combinations have been identified with their probability in linguistics books. Probabilities for 3 common letter combinations are available as well. The probabilities can be adjusted based on the words encountered.
Traditional filter, rotate, and then neural net will not work well here.
The blotches can easily be removed by edge tracing. When there is an anomaly in the edge (a letter crossing the edge), a cubic spline can be used to estimate the missing blotch edge. Then, the blotch can be removed by inverting all pixels that belong to the blotch. Slight adjustments when crossing letters may also help.
Either edge finding or center finding can be used to follow the flow of the letters, but simplify the information into a few kinds of curves, angles, lines, and gaps. That can be what is fed into the Neural Network identification (which returns a collection of letters, with probabilities for the first unknown letter). Where a line is skinny, we can use an intermediate value to represent it (meaning it might be a gap, such as the r+n vs m).
One form of obfuscation that Google seems to have added is thresholding the greyscale image to form a black and white image. The thresholds are apparently not always the same, but they are low enough to widen most letters and hide the "pen" and "hand". At the same time, when the center of the writing nib was drier than the edges, you get two thin lines instead of one fat line. Two close lines (usually from different letters) can appear as a single line, but the line thickness and length can help distinguish these situations. Using center finding allows us to trace thick lines based on the distance from the clockwise edge and later the counterclockwise edges.
Our best shot is to estimate the length (+/ 2), identify the most common letter pairs for the beginning of a word, and then identify the common letter pairs for the letters 2&3. The results would be ordered by the highest "probability" and the first two characters identified that way. The third character is identified in a similar way (now the probability is the probability of the first pair times the probability of the second pair).

Sorry this is so poorly organized. Let me try to summarize.
1) Process out the blotches (the big disks) and apply a despeckling filter.
2) We trace the edges to find a polyline that represents the skeleton of the letter.
3) We train a Neural Network to give us a probability of a letter, given a skeleton. We train it by manually separating the letters. The number of nodes will be around twice the number needed for printed text and the probabilities lower.
3a) A separate NN may be necessary for first letters.
4) Given each first letter (with the NN probability pNN), we apply the binary letter table which supplies pF(L1, L2)
4a) We threshold the probabilities to create a collection of letter pairs
5) We take the first letter in the pair and estimate where separation of the skeletons, based on matching the skeleton to the data. We can then apply a NN for the second letter. Again, the probabilities are multiplied.
6) At this point, we have the probability for each letter that it appears at the beginning of a word, the identification strength of the corresponding NN node(s), the probability of letter X appearing after the first (for each of the first letters above a threshold), and the NN node(s) strength for each of those second letters.
6a) Especially for vowels and capital letters, the number of nodes that identify the various forms of the letter is more than one. The usable result is the sum (OR) of the values that indicate that letter. Capital Q has two very distinct and different appearances (one looks like a large 2). The lower case a has the form with the top hook and the form without that hook. The distinction between whether a the line segment on the right ends at the top of the curve or beyond it can be handled by a NN node. Lower case q can look like a backwards p, the descender can end in a hook (acute corner to a sorter line segment), a loop (both ways), or the descender line may ascend beyond the curved portion (dq combined).
7) We go to our local NN expert (Jiaz?) to determine which shapes can use the same NN node. For simplicity, we have enough nodes so that no node represents returning two different letters (like q and a without the hooks).
8) The NN for the beginning of a word will likely be different from the NN for intermediate letters (the letter pair tables are divided into beginning, middle, and end tables).
9) We continue until we have choices and probabilities for the first three letters in the word. We can then use a dictionary (in a trei) to identify those combinations that are unlikely (give this a weight to multiply the value by). This eliminates some of the choices. We continue subtracting the skeleton of the guessed letter, running the NN on the next letter, multiplying by the probability of those letters following each other, and comparing with the dictionary.
10) At each point, we limit the number of guesses to a specific number of possibilities (first letter 26, two letters might be 16, three letters and on might be 8 or fewer).
11) Obviously, this needs diagrams.
If necessary, we can use the Gutenberg Foundation's files to determine some of the letter pairings for the pretypewriter, preballpoint era (typewriters were adopted in business starting around 1900, ballpoint pens were mid 20th century).
Good Morning Europe
drbits
