JDownloader Community - Appwork GmbH
 

Closed Thread
 
Thread Tools Display Modes
  #1  
Old 26.11.2009, 20:18
drbits's Avatar
drbits drbits is offline
JD English Support (inactive)
 
Join Date: Sep 2009
Location: Physically in Los Angeles, CA, USA
Posts: 4,437
Default ReCapcha solution discussion

ReCapcha recognition might be simpler than OCR. It might be cracked by creating a map from physical characteristics of each "word" to a letter sequence. 98% accuracy is plenty (if multiple attempts are taken before giving up on a link). This approach is simpler, if less reliable than letter or skeleton recognition. The main drawback is periodic redefinition of the tri as new words are added.

This is a measure + dictionary approach and has broad application. What makes this workable is that ReCapcha has a limited dictionary, so specific recognition for each word should work. In other words, word recognition,

Measures include:
  • width (CM0x) [10-2048] pixels, (most useful)
  • height (CM0y) [5-128] pixels,
  • density (overall darkness or number of pixels darker than 49%),
  • X center of mass (mean) relative to word box (CM1x),
  • Y center of mass (mean) relative to word box (CM1y),
  • X center of momentum (CM2x),
  • Y center of momentum (CM2y),
  • X median relative to word box (CMm1x), (possibly standardized sCMm1x).
  • Y median relative to word box (CMm1y), (possibly standardized sCMm1y).
  • main X frequency
  • main Y frequency (least useful)
  • Convolution of sample against word box. (very useful, compute intensive)
  • Ratios of any two of the above.
  • Products of any two of the above
The best approach is probably to use a tri data structure (a kind of tree, pronounced like "try") identifying groups of words based on ranges of width, then height, etc. This approach requires less than two tree nodes per "word". The idea is that the root contains keys representing the maximum value of the first measure for each child. A child is either a leaf (character list) or the next measure in turn. This key format is like in a B-Tree. Tris are commonly used for spelling checks and data compression.

A tri is better than a neural network, because it is easier to store in a database, requires only one pass training, and contains fewer connections (a neural network contains one input node per input (n), one output node per possible output(N), and n*N connections. If an intermediate row is required this is n*N*N connections.

The calculations listed above can all be done in "int32 fixed point". The 256's are to return a range of integers. Floor is specified, because of normal integer division. Round can also be used.
Code:
Width and Height are integers. These are defined after the actual word box is found.

Density could be the floor of (256 * (total number of pixels with total RGB < 384)) / (width * height)
Range: (0-256)

X Center of mass (CM1x) could be floor (((sum of (X position [base 1] if RGB < 384) ) *256) / (width * widtj)) 
    averaged over all rows.
    Example: [ . . H . H . . . . H . . ] is (256 * (3 + 5 + 10) )/(12*12) = 2
    Range: (0 ... 256)

X Center of momentum (CM2x = second order Center of Mass) could be (CM1x * CM1x).
    Range: (0 ... 32768)

X median (CMm1x) is position (base 1) of middle pixel (averaged over all rows)
Example: [ . . H . H . . . . H . . ] is 5
     Range: (0 ... width)

Standardized X median can be sCMm1x = (256 *CMm1x)/width
Example: [ . . H . H . . . . H . . ] is 106
      Range: (0 ... 256)
Frequencies are to be avoided, because they involve integer discreet Fourier transforms (iDFT). These are complex calculations.

Convolution of a saved pattern with a given word box is a last resort. It is compute intensive. It can be done in spacial or frequency domain.

After width, height, density, and (X and Y)center of mass, statistical methods should be used to determine the best order to apply characteristics.

The tri can be simplified by use of standardized ranges (for a range 0f 0 to 256, the ranges could be <=0, 1..16, 17..32. etc.). This is less accurate than setting ranges by equalizing the number of leaves in that range, but it allows a faster tri build and quick indexing to find the correct key on retrieval.

------------------------------------
Answers to questions posed in http://board.jdownloader.org/showthread.php?t=11101

Quality OCR programs are 100% on known roman typefaces (where the imaging quality is excellent); even "ransom note" typefaces are 100%. Quality OCR on hand written roman block lettering (often called hand printing or manuscript) is about 98% on random samples (multiple hands) for clear images.
_________________

OCR does not take a Turing, Knuth, or Dijkstra. It takes somebody smart, but not hyperintelligent. The real requirement is to be able to combine dictionaries, neural networks, expert systems, and image processing techniques in one package. This is called a computer science generalist.

Last edited by drbits; 26.11.2009 at 22:23. Reason: Additions
  #2  
Old 26.11.2009, 21:01
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 66,134
Default

We don't have manpower left to try anticaptcha for recaptcha at the moment. Also note that if it would be THAT simple, there would already exist an anticaptcha for it
__________________
JD-Dev & Server-Admin
  #3  
Old 26.11.2009, 22:47
drbits's Avatar
drbits drbits is offline
JD English Support (inactive)
 
Join Date: Sep 2009
Location: Physically in Los Angeles, CA, USA
Posts: 4,437
Default

I am willing to give it a try. I will have to download the JDownloader source and install Eclipse. I will give you preliminary feedback after I have examined relevant parts of the existing code.

It probably hasn't been solved because it is a big problem and people are not willing to think outside the box. I can guarantee that the problem has a solution, but the guaranteed solution is impractical.
  #4  
Old 27.11.2009, 02:43
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 66,134
Default

Every Problem has a Solution The question is only if its a theoretical or practical solution

If it's would have been such easy, then many others already would have created an anticaptcha for that.
__________________
JD-Dev & Server-Admin
  #5  
Old 27.11.2009, 10:20
remi
Guest
 
Posts: n/a
Cool

Drbits, I really appreciate your out of the box thinking style. I think it might be easy for people with a mind set like yours and I wish you all the best.

The reason that it hasn't been cracked is that there probably is not enough talent focussing on the problem. It's different from the usual hacking attempts, like DRM protection hacking, where the commercial stakes are/were much higher.
  #6  
Old 06.12.2009, 00:49
dest
Guest
 
Posts: n/a
Default

im not an expert in this, but i guess it only take some undetortings and a dictionary to crack Recaptcha.

edit: or if you think Recaptcha's image captcha is too hard, then you may try the audio one. there is less room for noise in audio.

Last edited by dest; 09.01.2010 at 10:16.
  #7  
Old 06.12.2009, 09:53
drbits's Avatar
drbits drbits is offline
JD English Support (inactive)
 
Join Date: Sep 2009
Location: Physically in Los Angeles, CA, USA
Posts: 4,437
Default

reCaptcha works with a given set of word images. For a while, hotfile was using the word images that Google's OCR could not read, and obscuring them a little more. The problem is that most of them were not human readable either (and they were not in the dictionary).

Hotfile has now changed back to an image set that is marginally OCR readable. JDownloader can frequently get the right answer, and with retries get through the hotfile reCaptcha.

Audio is really the same problem. They add noise in ways you wouldn't believe. For example, in the background, people are speaking numbers and letters. In addition, there is noise added that is as loud as the numbers and letters spoken in the foreground. By controlling the volume of the background noise and the type and quantity of louder noises, they can make it as hard as they want.
  #8  
Old 06.12.2009, 10:11
remi
Guest
 
Posts: n/a
Cool

This means that after excluding the blind or visually impaired people they will also exclude the deaf and people with hearing loss.

Isn't this a form of discrimination for which these sites can be sued?
  #9  
Old 12.12.2009, 04:28
dest
Guest
 
Posts: n/a
Default

well, you may sue Google about that and ban Recaptcha once and for all, also all the rest of captchas, at least in usa.
  #10  
Old 12.12.2009, 08:15
Gweilo's Avatar
Gweilo Gweilo is offline
JD Legend
 
Join Date: Mar 2009
Posts: 716
Default

Quote:
Originally Posted by drbits View Post
Quality OCR programs are 100% on known roman typefaces (where the imaging quality is excellent); even "ransom note" typefaces are 100%. Quality OCR on hand written roman block lettering (often called hand printing or manuscript) is about 98% on random samples (multiple hands) for clear images.
OT, but you seem knowledgeable:
What are these "Quality OCR programs"?
I occasionally need to do OCR on large quantities of text. (Plain pages of text, like a novel.)
I tried OmniPage, TextBridge and Abbyy.
Abbyy seemed the best from my limited experience.
  #11  
Old 02.03.2010, 05:08
drbits's Avatar
drbits drbits is offline
JD English Support (inactive)
 
Join Date: Sep 2009
Location: Physically in Los Angeles, CA, USA
Posts: 4,437
Default

Sorry, Gweilo, I didn't see your message until now. Check out what Project Gutenberg is using.
  #12  
Old 02.03.2010, 15:23
pspzockerscene's Avatar
pspzockerscene pspzockerscene is offline
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 48,767
Default

I am quite sure that the spammers here in the forum (users can't see the spam, the threads are all locked) already cracked the re Captcha captchas ut i don't know how^^

GreeZ pspzockerscene
__________________

Ad-free installers || Werbefreie Installer
Windows Setup<--JD2 BETA-->Linux Setup x86 || Linux Setup x64 || Mac Setup
-----=>Support Chat<=-----
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
That's true James
Quote:
Originally Posted by James
Die Leute verstehen einfach nicht dass nur weil man mit einer Waffe auch auf Menschen schießen kann dass ein Schützenver​ein kein Ort für Amoklaufide​en ist
  #13  
Old 09.03.2010, 06:16
drbits's Avatar
drbits drbits is offline
JD English Support (inactive)
 
Join Date: Sep 2009
Location: Physically in Los Angeles, CA, USA
Posts: 4,437
Default New anti-reCaptcha idea

@PSP:
I think they are just taking the extra time to enter the recaptcha response. I don't think they are using a bot.
--------------------------------------------
New anti-reCaptcha idea

Note: My notes about the "hand" no longer apply. The current characters look like they were written with a round pointed felt tip pen. I suspect that this is because of the filtering and cleaning up that Google has already done to the letters.

1) Removing the splotches should be pretty easy.
a) Find the top and bottom of a splotch, save those as points.
b) Trace the splotch saving a point above/below any discontinuity.
c) If the left side has a discontinuity, use a cubic spline to fill in the gap in the outline. Do the same for the right side.
d) As we scan to find the edge, we invert the section of line that is within the splotch. (this is as we trace it in b and c).

2) Apparently, only the second word counts. You must supply a first word, but the second word has to match. We have hundreds of samples of the second word and we can generate more by making PHP calls to refresh the registration page and save the images as we go.

3) We manually divide and identify the characters in the second word for several words. We create an imaginary grid in which the boxes are the same size as the largest letter (including ascenders and descenders, plus a little extra for later). The grid starts 40 wide by 20 high (we may have to increase the height). The columns represent the symbols [a-z0-9\.\,\-], so we place each of the characters into a grid square in the appropriate column.
3a) We might have to filter the image of each letter to clean it up and make the line thicknesses uniform. Then, we would filter the challenges the same way.
3b) We can deal with the rotation of characters by defining a measure of vertical and rotating so that the character is vertical. The vertical can be the longest direction of the character or a measure based on the longest line. It does not matter if the characters come out vertical, as long as they always are rotated to the same position.

4) Now, we continue this, but we convolve each new character with the characters in the grid and take the power spectrum (this is the quality of match). Where the power spectrum shows that the new character is already in the grid, we throw away the new character. If there is no good match, we identify the character and add it to the grid. After a while, we will not be adding many characters to the grid.

5) To use this, we convolve the word with the grid, take the power spectrum, and that identifies the characters in the word. The real part will tell us the position in the word. Ideally, we could read off the word now.
5a) We can identify the first letter, remove it and then work on the next letter, etc.

6) There are letter combinations that look the same. For example, r n looks like n n which looks like m (without the spaces: rn nn m).This is part of where the dictionary comes in.

7) We do a dictionary search, with limited spelling correction and either request a new challenge or return the word we found.

I can explain any of the algorithm pieces (convolution, power spectrum, etc.) upon request.

Comments? Questions?

Last edited by drbits; 09.03.2010 at 06:25.
  #14  
Old 13.03.2010, 02:07
BearCan
Guest
 
Posts: n/a
Default

For the dictionary approach why not let google check it for us.
We've all sent misspelled words for a search only to have google
come back with suggestions.

Do we really need to create an OCR engine? I did a search of sourceforge and found at least three OCR projects. Why are we unable to use them? seems to me the only processing JD needs to do is image cleaning so the engine returns a higher percentage of right answers.

I was thinking that to reduce a black circle to a line, take say,
6 pixels out of a line. if it's all black remove 3 pixels from the end.
slide over 3 and repeat test until all 6 are not black.
I hope this is where the edge of a character should be.

another idea is to send the image to an emboss filter.
the ocr engine might work with the edges it creates.
some photoshop scripter should be able to test these concepts.
  #15  
Old 13.03.2010, 02:11
pspzockerscene's Avatar
pspzockerscene pspzockerscene is offline
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 48,767
Default

Atm we have more important things to do than *cracking* Re Captcha.
How about you guys doing that ?
Help from users is always welcome!

GreeZ pspzockerscene
__________________

Ad-free installers || Werbefreie Installer
Windows Setup<--JD2 BETA-->Linux Setup x86 || Linux Setup x64 || Mac Setup
-----=>Support Chat<=-----
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
That's true James
Quote:
Originally Posted by James
Die Leute verstehen einfach nicht dass nur weil man mit einer Waffe auch auf Menschen schießen kann dass ein Schützenver​ein kein Ort für Amoklaufide​en ist
  #16  
Old 13.03.2010, 10:14
drbits's Avatar
drbits drbits is offline
JD English Support (inactive)
 
Join Date: Sep 2009
Location: Physically in Los Angeles, CA, USA
Posts: 4,437
Default

Thank you pspzockerscene, we are discussing options and intend to try ourselves. We do not want add to the developer team's burden.
________________________________
@ BearCan

I like the idea of using an outside dictionary, except for the communication delay it is perfect. Google's search dictionary won't help. It uses stemming and searches for all words with the same stem. I believe there is a Soundex-like matching scheme to suggest alternatives to unusual words. However, we can check the Google API to see if we can check a word against Google's word list (in the DB) or use the Google Dictionary Service.

I just realized that all that we need is the number of matches for a one-word Google search (we put a + in front of the word to prevent stemming and alternative recommendation). I am pretty sure that is in the API (and fast). The real words in the challenges should be somewhere in Google. Especially if we are usually only looking for the longer of the two words.

It sounds better than importing a 50,000 word list from Meriam-Webster (which is partially stemmed).
----------------------
When looking to remove the black spots, we have to remember that the text is inverted, so we need the left and right edge of the spot where it crosses the text (the text won't tell us that very easily.

We also have to remember that there are two words. We can start at the top and check each row of pixels for some minimum (possibly 6) pixels to find a spot. we do the same thing for the bottom. Then, we trace the spots (we will find both and have up to 4 curves). Inverting the pixels inside a closed cubic spline should be in one of the open source libraries.

If a word passed through one edge of the spot, we know its approximate vertical position, step out until we have a blank line, and then step in to the first non-blank line. If the word is entirely within the spot, it probably is not our target word. However, it is easy enough to isolate the word in the same way, just starting with the middle of the spot. The word is usually not straight, so extracting it into a box is a good time to rotate and warp it straight.

The problem is avoiding getting the two words in one box. To do that (for the left word), we start at the left and scan at the appropriate vertical range to find the left edge. By looking up to 5 pixels ahead, we can find the top of the next character and keep going until the end of the word. Once that is identified, copied, and cleared, the other word is done the same way, but without having any concerns about hitting the wrong word.

I will have to look-up how to straighten the word (so each letter is to the same scale and the same baseline. We might just want to move each letter until it is centered vertically on the center line of the box and then scale it. This will push "g" up so that the descender is within the box and will make lower case letters as large as upper case letters.

While you are experimenting with reCaptcha, can you find out if it is case sensitive?
We also need to find out if the checked word (the longer one) ever contains digits, hyphens, periods, or commas. If not, we can reduce the complexity of the neural network.
  #17  
Old 28.03.2010, 18:03
BearCan
Guest
 
Posts: n/a
Default

hotfile captcha is not case sensitive. numbers are required.
hyphens, periods, commas, dashes, etc are not required. 1,024=1024 high-life=highlife

you going to pick an ocr engine drbits?
have you played with the image enhancement using the cubic spline idea?

I do not believe scale and letter baseline will be a problem for an engine. I think resolution will be. at the very worse, we can send a single letter to the engine. that will solve scale and baselining.
  #18  
Old 28.03.2010, 18:21
drbits's Avatar
drbits drbits is offline
JD English Support (inactive)
 
Join Date: Sep 2009
Location: Physically in Los Angeles, CA, USA
Posts: 4,437
Default

For some reason, they are fighting hard. They seem to add another obfuscation every 2 months. I decided not to work on it. There are a couple of teams working on it for other products and we can use their work.
  #19  
Old 28.03.2010, 18:26
pspzockerscene's Avatar
pspzockerscene pspzockerscene is offline
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 48,767
Default

You guys can deal with that.
I thing its just a waste of time trying to crack Re Captcha.
Yes it isn't impossible but it just isn't worth it

GreeZ pspzockerscene
__________________

Ad-free installers || Werbefreie Installer
Windows Setup<--JD2 BETA-->Linux Setup x86 || Linux Setup x64 || Mac Setup
-----=>Support Chat<=-----
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
That's true James
Quote:
Originally Posted by James
Die Leute verstehen einfach nicht dass nur weil man mit einer Waffe auch auf Menschen schießen kann dass ein Schützenver​ein kein Ort für Amoklaufide​en ist
  #20  
Old 29.03.2010, 11:14
remi
Guest
 
Posts: n/a
Cool

Quote:
Originally Posted by drbits View Post
For some reason, they are fighting hard. They seem to add another obfuscation every 2 months. I decided not to work on it. There are a couple of teams working on it for other products and we can use their work.
The reason why they add another obfuscation every 2 months is because these teams are close to a solution. Gogol will need to add obfuscations more frequently as these teams will find a pattern in these obfuscation techniques.

reCaptcha will eventually be solved, because we humans can beat Gogol if we stand together. We beat Nazism too.
Closed Thread

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

All times are GMT +2. The time now is 02:32.
Provided By AppWork GmbH | Privacy | Imprint
Parts of the Design are used from Kirsch designed by Andrew & Austin
Powered by vBulletin® Version 3.8.10 Beta 1
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.