JDownloader Community - Appwork GmbH
 

Notices

Closed Thread
 
Thread Tools Display Modes
  #1  
Old 26.11.2009, 20:18
drbits's Avatar
drbits drbits is offline
JD English Support (inactive)
 
Join Date: Sep 2009
Location: Physically in Los Angeles, CA, USA
Posts: 4,434
Default ReCapcha solution discussion

ReCapcha recognition might be simpler than OCR. It might be cracked by creating a map from physical characteristics of each "word" to a letter sequence. 98% accuracy is plenty (if multiple attempts are taken before giving up on a link). This approach is simpler, if less reliable than letter or skeleton recognition. The main drawback is periodic redefinition of the tri as new words are added.

This is a measure + dictionary approach and has broad application. What makes this workable is that ReCapcha has a limited dictionary, so specific recognition for each word should work. In other words, word recognition,

Measures include:
  • width (CM0x) [10-2048] pixels, (most useful)
  • height (CM0y) [5-128] pixels,
  • density (overall darkness or number of pixels darker than 49%),
  • X center of mass (mean) relative to word box (CM1x),
  • Y center of mass (mean) relative to word box (CM1y),
  • X center of momentum (CM2x),
  • Y center of momentum (CM2y),
  • X median relative to word box (CMm1x), (possibly standardized sCMm1x).
  • Y median relative to word box (CMm1y), (possibly standardized sCMm1y).
  • main X frequency
  • main Y frequency (least useful)
  • Convolution of sample against word box. (very useful, compute intensive)
  • Ratios of any two of the above.
  • Products of any two of the above
The best approach is probably to use a tri data structure (a kind of tree, pronounced like "try") identifying groups of words based on ranges of width, then height, etc. This approach requires less than two tree nodes per "word". The idea is that the root contains keys representing the maximum value of the first measure for each child. A child is either a leaf (character list) or the next measure in turn. This key format is like in a B-Tree. Tris are commonly used for spelling checks and data compression.

A tri is better than a neural network, because it is easier to store in a database, requires only one pass training, and contains fewer connections (a neural network contains one input node per input (n), one output node per possible output(N), and n*N connections. If an intermediate row is required this is n*N*N connections.

The calculations listed above can all be done in "int32 fixed point". The 256's are to return a range of integers. Floor is specified, because of normal integer division. Round can also be used.
Code:
Width and Height are integers. These are defined after the actual word box is found.

Density could be the floor of (256 * (total number of pixels with total RGB < 384)) / (width * height)
Range: (0-256)

X Center of mass (CM1x) could be floor (((sum of (X position [base 1] if RGB < 384) ) *256) / (width * widtj)) 
    averaged over all rows.
    Example: [ . . H . H . . . . H . . ] is (256 * (3 + 5 + 10) )/(12*12) = 2
    Range: (0 ... 256)

X Center of momentum (CM2x = second order Center of Mass) could be (CM1x * CM1x).
    Range: (0 ... 32768)

X median (CMm1x) is position (base 1) of middle pixel (averaged over all rows)
Example: [ . . H . H . . . . H . . ] is 5
     Range: (0 ... width)

Standardized X median can be sCMm1x = (256 *CMm1x)/width
Example: [ . . H . H . . . . H . . ] is 106
      Range: (0 ... 256)
Frequencies are to be avoided, because they involve integer discreet Fourier transforms (iDFT). These are complex calculations.

Convolution of a saved pattern with a given word box is a last resort. It is compute intensive. It can be done in spacial or frequency domain.

After width, height, density, and (X and Y)center of mass, statistical methods should be used to determine the best order to apply characteristics.

The tri can be simplified by use of standardized ranges (for a range 0f 0 to 256, the ranges could be <=0, 1..16, 17..32. etc.). This is less accurate than setting ranges by equalizing the number of leaves in that range, but it allows a faster tri build and quick indexing to find the correct key on retrieval.

------------------------------------
Answers to questions posed in http://board.jdownloader.org/showthread.php?t=11101

Quality OCR programs are 100% on known roman typefaces (where the imaging quality is excellent); even "ransom note" typefaces are 100%. Quality OCR on hand written roman block lettering (often called hand printing or manuscript) is about 98% on random samples (multiple hands) for clear images.
_________________

OCR does not take a Turing, Knuth, or Dijkstra. It takes somebody smart, but not hyperintelligent. The real requirement is to be able to combine dictionaries, neural networks, expert systems, and image processing techniques in one package. This is called a computer science generalist.

Last edited by drbits; 26.11.2009 at 22:23. Reason: Additions
  #2  
Old 26.11.2009, 21:01
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 79,286
Default

We don't have manpower left to try anticaptcha for recaptcha at the moment. Also note that if it would be THAT simple, there would already exist an anticaptcha for it
__________________
JD-Dev & Server-Admin
  #3  
Old 26.11.2009, 22:47
drbits's Avatar
drbits drbits is offline
JD English Support (inactive)
 
Join Date: Sep 2009
Location: Physically in Los Angeles, CA, USA
Posts: 4,434
Default

I am willing to give it a try. I will have to download the JDownloader source and install Eclipse. I will give you preliminary feedback after I have examined relevant parts of the existing code.

It probably hasn't been solved because it is a big problem and people are not willing to think outside the box. I can guarantee that the problem has a solution, but the guaranteed solution is impractical.
  #4  
Old 27.11.2009, 02:43
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 79,286
Default

Every Problem has a Solution The question is only if its a theoretical or practical solution

If it's would have been such easy, then many others already would have created an anticaptcha for that.
__________________
JD-Dev & Server-Admin
  #5  
Old 27.11.2009, 10:20
remi
Guest
 
Posts: n/a
Cool

Drbits, I really appreciate your out of the box thinking style. I think it might be easy for people with a mind set like yours and I wish you all the best.

The reason that it hasn't been cracked is that there probably is not enough talent focussing on the problem. It's different from the usual hacking attempts, like DRM protection hacking, where the commercial stakes are/were much higher.
  #6  
Old 06.12.2009, 00:49
dest
Guest
 
Posts: n/a
Default

im not an expert in this, but i guess it only take some undetortings and a dictionary to crack Recaptcha.

edit: or if you think Recaptcha's image captcha is too hard, then you may try the audio one. there is less room for noise in audio.

Last edited by dest; 09.01.2010 at 10:16.
  #7  
Old 06.12.2009, 09:53
drbits's Avatar
drbits drbits is offline
JD English Support (inactive)
 
Join Date: Sep 2009
Location: Physically in Los Angeles, CA, USA
Posts: 4,434
Default

reCaptcha works with a given set of word images. For a while, hotfile was using the word images that Google's OCR could not read, and obscuring them a little more. The problem is that most of them were not human readable either (and they were not in the dictionary).

Hotfile has now changed back to an image set that is marginally OCR readable. JDownloader can frequently get the right answer, and with retries get through the hotfile reCaptcha.

Audio is really the same problem. They add noise in ways you wouldn't believe. For example, in the background, people are speaking numbers and letters. In addition, there is noise added that is as loud as the numbers and letters spoken in the foreground. By controlling the volume of the background noise and the type and quantity of louder noises, they can make it as hard as they want.
  #8  
Old 06.12.2009, 10:11
remi
Guest
 
Posts: n/a
Cool

This means that after excluding the blind or visually impaired people they will also exclude the deaf and people with hearing loss.

Isn't this a form of discrimination for which these sites can be sued?
  #9  
Old 12.12.2009, 04:28
dest
Guest
 
Posts: n/a
Default

well, you may sue Google about that and ban Recaptcha once and for all, also all the rest of captchas, at least in usa.
  #10  
Old 12.12.2009, 08:15
Gweilo's Avatar
Gweilo Gweilo is offline
JD Legend
 
Join Date: Mar 2009
Posts: 725
Default

Quote:
Originally Posted by drbits View Post
Quality OCR programs are 100% on known roman typefaces (where the imaging quality is excellent); even "ransom note" typefaces are 100%. Quality OCR on hand written roman block lettering (often called hand printing or manuscript) is about 98% on random samples (multiple hands) for clear images.
OT, but you seem knowledgeable:
What are these "Quality OCR programs"?
I occasionally need to do OCR on large quantities of text. (Plain pages of text, like a novel.)
I tried OmniPage, TextBridge and Abbyy.
Abbyy seemed the best from my limited experience.
  #11  
Old 02.03.2010, 05:08
drbits's Avatar
drbits drbits is offline
JD English Support (inactive)
 
Join Date: Sep 2009
Location: Physically in Los Angeles, CA, USA
Posts: 4,434
Default

Sorry, Gweilo, I didn't see your message until now. Check out what Project Gutenberg is using.
  #12  
Old 02.03.2010, 15:23
pspzockerscene's Avatar
pspzockerscene pspzockerscene is online now
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 70,922
Default

I am quite sure that the spammers here in the forum (users can't see the spam, the threads are all locked) already cracked the re Captcha captchas ut i don't know how^^

GreeZ pspzockerscene
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
  #13  
Old 09.03.2010, 06:16
drbits's Avatar
drbits drbits is offline
JD English Support (inactive)
 
Join Date: Sep 2009
Location: Physically in Los Angeles, CA, USA
Posts: 4,434
Default New anti-reCaptcha idea

@PSP:
I think they are just taking the extra time to enter the recaptcha response. I don't think they are using a bot.
--------------------------------------------
New anti-reCaptcha idea

Note: My notes about the "hand" no longer apply. The current characters look like they were written with a round pointed felt tip pen. I suspect that this is because of the filtering and cleaning up that Google has already done to the letters.

1) Removing the splotches should be pretty easy.
a) Find the top and bottom of a splotch, save those as points.
b) Trace the splotch saving a point above/below any discontinuity.
c) If the left side has a discontinuity, use a cubic spline to fill in the gap in the outline. Do the same for the right side.
d) As we scan to find the edge, we invert the section of line that is within the splotch. (this is as we trace it in b and c).

2) Apparently, only the second word counts. You must supply a first word, but the second word has to match. We have hundreds of samples of the second word and we can generate more by making PHP calls to refresh the registration page and save the images as we go.

3) We manually divide and identify the characters in the second word for several words. We create an imaginary grid in which the boxes are the same size as the largest letter (including ascenders and descenders, plus a little extra for later). The grid starts 40 wide by 20 high (we may have to increase the height). The columns represent the symbols [a-z0-9\.\,\-], so we place each of the characters into a grid square in the appropriate column.
3a) We might have to filter the image of each letter to clean it up and make the line thicknesses uniform. Then, we would filter the challenges the same way.
3b) We can deal with the rotation of characters by defining a measure of vertical and rotating so that the character is vertical. The vertical can be the longest direction of the character or a measure based on the longest line. It does not matter if the characters come out vertical, as long as they always are rotated to the same position.

4) Now, we continue this, but we convolve each new character with the characters in the grid and take the power spectrum (this is the quality of match). Where the power spectrum shows that the new character is already in the grid, we throw away the new character. If there is no good match, we identify the character and add it to the grid. After a while, we will not be adding many characters to the grid.

5) To use this, we convolve the word with the grid, take the power spectrum, and that identifies the characters in the word. The real part will tell us the position in the word. Ideally, we could read off the word now.
5a) We can identify the first letter, remove it and then work on the next letter, etc.

6) There are letter combinations that look the same. For example, r n looks like n n which looks like m (without the spaces: rn nn m).This is part of where the dictionary comes in.

7) We do a dictionary search, with limited spelling correction and either request a new challenge or return the word we found.

I can explain any of the algorithm pieces (convolution, power spectrum, etc.) upon request.

Comments? Questions?

Last edited by drbits; 09.03.2010 at 06:25.
  #14  
Old 13.03.2010, 02:07
BearCan
Guest
 
Posts: n/a
Default

For the dictionary approach why not let google check it for us.
We've all sent misspelled words for a search only to have google
come back with suggestions.

Do we really need to create an OCR engine? I did a search of sourceforge and found at least three OCR projects. Why are we unable to use them? seems to me the only processing JD needs to do is image cleaning so the engine returns a higher percentage of right answers.

I was thinking that to reduce a black circle to a line, take say,
6 pixels out of a line. if it's all black remove 3 pixels from the end.
slide over 3 and repeat test until all 6 are not black.
I hope this is where the edge of a character should be.

another idea is to send the image to an emboss filter.
the ocr engine might work with the edges it creates.
some photoshop scripter should be able to test these concepts.
  #15  
Old 13.03.2010, 02:11
pspzockerscene's Avatar
pspzockerscene pspzockerscene is online now
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 70,922
Default

Atm we have more important things to do than *cracking* Re Captcha.
How about you guys doing that ?
Help from users is always welcome!

GreeZ pspzockerscene
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
  #16  
Old 13.03.2010, 10:14
drbits's Avatar
drbits drbits is offline
JD English Support (inactive)
 
Join Date: Sep 2009
Location: Physically in Los Angeles, CA, USA
Posts: 4,434
Default

Thank you pspzockerscene, we are discussing options and intend to try ourselves. We do not want add to the developer team's burden.
________________________________
@ BearCan

I like the idea of using an outside dictionary, except for the communication delay it is perfect. Google's search dictionary won't help. It uses stemming and searches for all words with the same stem. I believe there is a Soundex-like matching scheme to suggest alternatives to unusual words. However, we can check the Google API to see if we can check a word against Google's word list (in the DB) or use the Google Dictionary Service.

I just realized that all that we need is the number of matches for a one-word Google search (we put a + in front of the word to prevent stemming and alternative recommendation). I am pretty sure that is in the API (and fast). The real words in the challenges should be somewhere in Google. Especially if we are usually only looking for the longer of the two words.

It sounds better than importing a 50,000 word list from Meriam-Webster (which is partially stemmed).
----------------------
When looking to remove the black spots, we have to remember that the text is inverted, so we need the left and right edge of the spot where it crosses the text (the text won't tell us that very easily.

We also have to remember that there are two words. We can start at the top and check each row of pixels for some minimum (possibly 6) pixels to find a spot. we do the same thing for the bottom. Then, we trace the spots (we will find both and have up to 4 curves). Inverting the pixels inside a closed cubic spline should be in one of the open source libraries.

If a word passed through one edge of the spot, we know its approximate vertical position, step out until we have a blank line, and then step in to the first non-blank line. If the word is entirely within the spot, it probably is not our target word. However, it is easy enough to isolate the word in the same way, just starting with the middle of the spot. The word is usually not straight, so extracting it into a box is a good time to rotate and warp it straight.

The problem is avoiding getting the two words in one box. To do that (for the left word), we start at the left and scan at the appropriate vertical range to find the left edge. By looking up to 5 pixels ahead, we can find the top of the next character and keep going until the end of the word. Once that is identified, copied, and cleared, the other word is done the same way, but without having any concerns about hitting the wrong word.

I will have to look-up how to straighten the word (so each letter is to the same scale and the same baseline. We might just want to move each letter until it is centered vertically on the center line of the box and then scale it. This will push "g" up so that the descender is within the box and will make lower case letters as large as upper case letters.

While you are experimenting with reCaptcha, can you find out if it is case sensitive?
We also need to find out if the checked word (the longer one) ever contains digits, hyphens, periods, or commas. If not, we can reduce the complexity of the neural network.
  #17  
Old 28.03.2010, 18:03
BearCan
Guest
 
Posts: n/a
Default

hotfile captcha is not case sensitive. numbers are required.
hyphens, periods, commas, dashes, etc are not required. 1,024=1024 high-life=highlife

you going to pick an ocr engine drbits?
have you played with the image enhancement using the cubic spline idea?

I do not believe scale and letter baseline will be a problem for an engine. I think resolution will be. at the very worse, we can send a single letter to the engine. that will solve scale and baselining.
  #18  
Old 28.03.2010, 18:21
drbits's Avatar
drbits drbits is offline
JD English Support (inactive)
 
Join Date: Sep 2009
Location: Physically in Los Angeles, CA, USA
Posts: 4,434
Default

For some reason, they are fighting hard. They seem to add another obfuscation every 2 months. I decided not to work on it. There are a couple of teams working on it for other products and we can use their work.
  #19  
Old 28.03.2010, 18:26
pspzockerscene's Avatar
pspzockerscene pspzockerscene is online now
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 70,922
Default

You guys can deal with that.
I thing its just a waste of time trying to crack Re Captcha.
Yes it isn't impossible but it just isn't worth it

GreeZ pspzockerscene
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
  #20  
Old 29.03.2010, 11:14
remi
Guest
 
Posts: n/a
Cool

Quote:
Originally Posted by drbits View Post
For some reason, they are fighting hard. They seem to add another obfuscation every 2 months. I decided not to work on it. There are a couple of teams working on it for other products and we can use their work.
The reason why they add another obfuscation every 2 months is because these teams are close to a solution. Gogol will need to add obfuscations more frequently as these teams will find a pattern in these obfuscation techniques.

reCaptcha will eventually be solved, because we humans can beat Gogol if we stand together. We beat Nazism too.
  #21  
Old 25.04.2010, 06:04
drbits's Avatar
drbits drbits is offline
JD English Support (inactive)
 
Join Date: Sep 2009
Location: Physically in Los Angeles, CA, USA
Posts: 4,434
Default Why Hotfile Asks for reCaptcha Only Sometimes

It has to do with the size of the file.

Hotfile is only issuing the reCaptcha challenge for files that are smaller than about 200.000.000 bytes = 190.73 MiB (it is > 190.72 and < 190.74).

Their affiliate program pays different amounts for different size files. Files over 100MiB receive the most payback. I think they are afraid that affiliates will use a program like JD to continuously download a file (changing IP address after each download), to cheat Hotfile. The cheater would minimize the file size downloaded to maximize profit.

drbits
  #22  
Old 25.04.2010, 13:13
pspzockerscene's Avatar
pspzockerscene pspzockerscene is online now
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 70,922
Default

@drbits
Well many payhosters do not change their site against jd or add a captcha, maybe only because they don't know JD.
Whatever i am not a fan of the "get paid for downloads" system^^

GreeZ pspzockerscene
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
  #23  
Old 26.04.2010, 12:56
drbits's Avatar
drbits drbits is offline
JD English Support (inactive)
 
Join Date: Sep 2009
Location: Physically in Los Angeles, CA, USA
Posts: 4,434
Default

Actually, I was thinking we could somehow make use of the 200MB boundary. Also, one of the users had asked about why hotfile.com doesn't always ask for a Captcha.

GreeZ drbits
  #24  
Old 26.04.2010, 15:24
pspzockerscene's Avatar
pspzockerscene pspzockerscene is online now
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 70,922
Default

(This post is only about Hotfile)
Well the only thing we could do is add a setting "prevent captchas" so if this is enabled jd will only load files smaller than 100MB.
How about that ?
It's easy to do, even i can do it

GreeZ pspzockerscene
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
  #26  
Old 27.04.2010, 04:05
pauldmps
Guest
 
Posts: n/a
Default

Okay, I've observed some very simple things about reCaptcha. You all might already know these:

1. One word of the two words in the captcha is known & another one is unknown (you all obviously know this). However, it cannot be said which word is known & which is unknown. Sometimes the first is unknown, sometimes the second one.

2. It is important to get the known word right. But it is not important to get the unknown word right.

3. If you could count the number of letters in the unknown word & put as many spaces as the number of letters, the captcha will still be solved.

I am not a programmer so I can't give any hint that way. But let's see how we could approach the problem:

1. Identify the known word. Mostly by trial and error. Let the plugin always think that the first word is known word. If not, the captcha refreshes.

2. Decrypt the known (first) word.

3. Count the number of letters in the second (unknown) word. Put as many spaces as the letters
in the word +1 extra space for the space between the two words.
  #27  
Old 27.04.2010, 11:06
drbits's Avatar
drbits drbits is offline
JD English Support (inactive)
 
Join Date: Sep 2009
Location: Physically in Los Angeles, CA, USA
Posts: 4,434
Default

You are doing well. But you aren't up to speed on everything.

1) The answer to the unknown word is not important, but a non-space answer is required. It is in our best interest to enter an incorrect word for the unknown word. Matching length would be good.

2) The real problem with reCaptcha is that it is so easy for Google to change things compared to how hard it is to solve.

3) Ignoring the obfuscation that Google adds (like the blotches or some changes in the lines), these are from hand written documents. None of the known OCR engines can solve the word (or it wouldn't be in the challenge as the unknown word). Once a statistically significant number of users give the same answer for an unknown, it becomes a know and is still just as hard.

4) Part of the obfuscation is to thicken the lines. This makes the letters touch and makes letter separation difficult (no counting the letters in advance). An easy example to see typed here are burn (Bum or Burn). Things get much worse when the letters are freehand. Bunting can look like dantim. pauldmps could look like fwnbrnao or powbna.

5) BeerCan has figured out how to identify the unknown. BeerCan, please correct me. They switch things around, but the odds are good that the second word is the unknown. The known words are only letters. The known words are usually longer than the unknowns.

6) Given a spelling checker and starting with the shortest words, we could identify the ligatures (combined letters, traditionally ae, fi, fl, ." (the period should appear centered under the closing quatation mark). In hand written documents you would have 100's of such combinations (1 to 4 per individual letter, Some letters are relatively easy, such as x and q which each have only a few combinations. Vowels can have many following letters). All of the letter and 2 letter combinations have been identified with their probability in linguistics books. Probabilities for 3 common letter combinations are available as well. The probabilities can be adjusted based on the words encountered.

Traditional filter, rotate, and then neural net will not work well here.

The blotches can easily be removed by edge tracing. When there is an anomaly in the edge (a letter crossing the edge), a cubic spline can be used to estimate the missing blotch edge. Then, the blotch can be removed by inverting all pixels that belong to the blotch. Slight adjustments when crossing letters may also help.

Either edge finding or center finding can be used to follow the flow of the letters, but simplify the information into a few kinds of curves, angles, lines, and gaps. That can be what is fed into the Neural Network identification (which returns a collection of letters, with probabilities for the first unknown letter). Where a line is skinny, we can use an intermediate value to represent it (meaning it might be a gap, such as the r+n vs m).

One form of obfuscation that Google seems to have added is thresholding the greyscale image to form a black and white image. The thresholds are apparently not always the same, but they are low enough to widen most letters and hide the "pen" and "hand". At the same time, when the center of the writing nib was drier than the edges, you get two thin lines instead of one fat line. Two close lines (usually from different letters) can appear as a single line, but the line thickness and length can help distinguish these situations. Using center finding allows us to trace thick lines based on the distance from the clockwise edge and later the counterclockwise edges.

Our best shot is to estimate the length (+/- 2), identify the most common letter pairs for the beginning of a word, and then identify the common letter pairs for the letters 2&3. The results would be ordered by the highest "probability" and the first two characters identified that way. The third character is identified in a similar way (now the probability is the probability of the first pair times the probability of the second pair).

----------------

Sorry this is so poorly organized. Let me try to summarize.

1) Process out the blotches (the big disks) and apply a despeckling filter.
2) We trace the edges to find a polyline that represents the skeleton of the letter.
3) We train a Neural Network to give us a probability of a letter, given a skeleton. We train it by manually separating the letters. The number of nodes will be around twice the number needed for printed text and the probabilities lower.
3a) A separate NN may be necessary for first letters.
4) Given each first letter (with the NN probability pNN), we apply the binary letter table which supplies pF(L1, L2)
4a) We threshold the probabilities to create a collection of letter pairs
5) We take the first letter in the pair and estimate where separation of the skeletons, based on matching the skeleton to the data. We can then apply a NN for the second letter. Again, the probabilities are multiplied.
6) At this point, we have the probability for each letter that it appears at the beginning of a word, the identification strength of the corresponding NN node(s), the probability of letter X appearing after the first (for each of the first letters above a threshold), and the NN node(s) strength for each of those second letters.
6a) Especially for vowels and capital letters, the number of nodes that identify the various forms of the letter is more than one. The usable result is the sum (OR) of the values that indicate that letter. Capital Q has two very distinct and different appearances (one looks like a large 2). The lower case a has the form with the top hook and the form without that hook. The distinction between whether a the line segment on the right ends at the top of the curve or beyond it can be handled by a NN node. Lower case q can look like a backwards p, the descender can end in a hook (acute corner to a sorter line segment), a loop (both ways), or the descender line may ascend beyond the curved portion (dq combined).
7) We go to our local NN expert (Jiaz?) to determine which shapes can use the same NN node. For simplicity, we have enough nodes so that no node represents returning two different letters (like q and a without the hooks).
8) The NN for the beginning of a word will likely be different from the NN for intermediate letters (the letter pair tables are divided into beginning, middle, and end tables).
9) We continue until we have choices and probabilities for the first three letters in the word. We can then use a dictionary (in a trei) to identify those combinations that are unlikely (give this a weight to multiply the value by). This eliminates some of the choices. We continue subtracting the skeleton of the guessed letter, running the NN on the next letter, multiplying by the probability of those letters following each other, and comparing with the dictionary.
10) At each point, we limit the number of guesses to a specific number of possibilities (first letter 26, two letters might be 16, three letters and on might be 8 or fewer).
11) Obviously, this needs diagrams.

If necessary, we can use the Gutenberg Foundation's files to determine some of the letter pairings for the pre-typewriter, pre-ballpoint era (typewriters were adopted in business starting around 1900, ballpoint pens were mid 20th century).

Good Morning Europe
drbits
  #28  
Old 04.05.2010, 05:11
pauldmps
Guest
 
Posts: n/a
Default

Any way to crack the audio captcha instead of the text ? If you google "cracking ReCaptcha", you'll find many reports saying that the audio has been cracked but not the text.
  #29  
Old 04.05.2010, 14:58
pspzockerscene's Avatar
pspzockerscene pspzockerscene is online now
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 70,922
Default

Whatever, if YOU (users) can help us there its okay but we're doing nothing concerning a Re Captcha recognition atm...

If you don't like those captchas, buy premium OR just don't use the hosts that use Re Captcha, is it so hard to do that ??

GreeZ pspzockerscene
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
  #30  
Old 04.05.2010, 15:46
pauldmps
Guest
 
Posts: n/a
Default

Quote:
Originally Posted by pspzockerscene View Post
Whatever, if YOU (users) can help us there its okay but we're doing nothing concerning a Re Captcha recognition atm...

If you don't like those captchas, buy premium OR just don't use the hosts that use Re Captcha, is it so hard to do that ??

GreeZ pspzockerscene
This thread is was merely started to discuss potential ways to crack ReCaptcha by the developers. So it is basically a discussion thread. Whatever being discussed here may not be actually done. So no need to get angry here.
  #31  
Old 04.05.2010, 15:50
pspzockerscene's Avatar
pspzockerscene pspzockerscene is online now
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 70,922
Default

@pauldmps

Don't worry i am not getting.
I'am just saying that the easiest way is just to never use services which have those captchas.
I got bo problem doing that
Well okay lets get back to the topic^^

GreeZ pspzockerscene
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
  #32  
Old 05.05.2010, 10:24
remi
Guest
 
Posts: n/a
Cool

If someone cracks reCaptcha, people can also crack this board.

I think this makes psp a little angry.
  #33  
Old 05.05.2010, 10:57
scr4ve's Avatar
scr4ve scr4ve is offline
JD-Dev & board tech
JD Logo by artcore-illustrations.de
 
Join Date: Feb 2009
Location: Germany, Lower Saxony
Posts: 235
Default

Thanks for your effort, drbits :-)

Two thoughts about cracking ReCaptcha:
  • JD isn't a small project. If we develop a recognition, Google might change it's captchas immediately.
  • ReCaptcha is mostly used for spam protection, JD is Open Source. Developing a recognition comes along with providing spammers with a recognition, too.
Anyway, I think it's nice to have, though more manpower is needed for that. As long as none of the big hosts (RS, MU, ...) are using ReCaptcha, this is a feature with relatively low priority for the main devs I guess.

Regards,


scr4ve

Last edited by scr4ve; 05.05.2010 at 11:02.
  #34  
Old 05.05.2010, 15:32
pspzockerscene's Avatar
pspzockerscene pspzockerscene is online now
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 70,922
Default

@remi
Indeed it often makes me angry if no one uses the boardsearch for for now i calmed down so don't worry, i won't freak out here because it isn't ma thread so freaking out in it wouldn't be very nice

GreeZ pspzockerscene
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
  #35  
Old 06.05.2010, 00:05
drbits's Avatar
drbits drbits is offline
JD English Support (inactive)
 
Join Date: Sep 2009
Location: Physically in Los Angeles, CA, USA
Posts: 4,434
Default

This started as a discussion of possibilities. However, I believe that somebody at Google reads our public board <shock, bewilderment>. One of the dangers of being open source.

When we discuss a possible future technique, within days reCaptcha has changed in a way exactly designed to block that technique.

This thread gives people a place to present ideas, without messing up the other thread (which is now one message, followed by a lot of whining).

GreeZ drbits
  #36  
Old 07.05.2010, 00:28
scr4ve's Avatar
scr4ve scr4ve is offline
JD-Dev & board tech
JD Logo by artcore-illustrations.de
 
Join Date: Feb 2009
Location: Germany, Lower Saxony
Posts: 235
Default

Quote:
Originally Posted by drbits View Post
One of the dangers of being open source.
I agree.

After having switched to a new license* we can consider to develop a closed-source Recaptcha Plugin in case of emergency (e.g. megaupload employs ReCaptcha) as we do it with the DLC Plugin. This is definitely not a nice solution since we prefer open source, though we really don't want to support spammers.

*) Note: JD stays Open Source. License Change is needed as the GPL forbids us to use closed-source parts like the DLC Plugin.

Regards,

scr4ve
  #37  
Old 07.05.2010, 00:33
pspzockerscene's Avatar
pspzockerscene pspzockerscene is online now
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 70,922
Default

@scr4ve
Awesome, Zippyshare will like JD even more then

GreeZ pspzockerscene
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
  #38  
Old 07.05.2010, 10:24
remi
Guest
 
Posts: n/a
Cool

Quote:
Originally Posted by scr4ve View Post
As long as none of the big hosts (RS, MU, ...) are using ReCaptcha, this is a feature with relatively low priority for the main devs I guess.
HF are currently number five in the ranking of hosts. I believe HF are using reCaptcha.

I agree with you that other people should do it. There are enough hackers and crackers out there.

Another way to stop reCaptcha and whatever Gogol might invent to make our lives more difficult, is to simply boycott Gogol. Punish them on a grand scale and they'll have to review their enslaving policy. Letting ordinary people 'scan' other people's books for free should stop.

Note that they currently also are among the worst privacy invaders and spies on Earth. I hope you won't have to use the cliché phrase "Wir haben es nicht gewusst" in the context of this extraordinary powerful and corrupt/evil company!
  #39  
Old 07.05.2010, 20:58
pspzockerscene's Avatar
pspzockerscene pspzockerscene is online now
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 70,922
Default

And still i think that JD will never have a working Re Captcha recognition.
Just imagine we would have a Re Captcha recognition (maybe even only for hotfile).
Hotfile would then just change their page to kill the JD plugin and the game starts from zero:rolleyes:

GreeZ pspzockerscene

If you don't wanna enter captchas, use hosts without captchas or buy premium.
Sorry, i always gotta write this again^^
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
  #40  
Old 08.05.2010, 03:54
pauldmps
Guest
 
Posts: n/a
Default

Dear PSP,
You know why the sites such as hotfile.com is using ReCaptcha ?
To stop people using JDownloader. Yes they know about it. (I've read an article in their news section about JD)

Now just imagine if the most important file hosters (if not all) use the same ReCaptcha, JD will be dead. This might happen in a few years. What will you do then ???
Closed Thread

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

All times are GMT +2. The time now is 20:16.
Provided By AppWork GmbH | Privacy | Imprint
Parts of the Design are used from Kirsch designed by Andrew & Austin
Powered by vBulletin® Version 3.8.10 Beta 1
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.