JDownloader Community - Appwork GmbH
 

Notices

Reply
 
Thread Tools Display Modes
  #1  
Old 19.09.2019, 18:01
radorn radorn is offline
Super Loader
 
Join Date: May 2019
Posts: 26
Default Percent encoding differences in HTTP URLs produce multiple versions of the same files

Hi.
I'm having a little frustration here with HTTP downloads and percent encoding.
Let me illustrate:
**External links are only visible to Support Staff****External links are only visible to Support Staff**
**External links are only visible to Support Staff****External links are only visible to Support Staff**
**External links are only visible to Support Staff****External links are only visible to Support Staff**

Same file, but URLs differ only in that in one some characters are left untouched and in another they are percent encoded. It will also consider them different URLs if there are differences in letter case, whether upper or lower, in the hexadecimal numbers of the percent encoded characters, like %5b and %5B for "[".
All these things result in unnecesary duplicates on the list when you encounter them

Now, it's true that, by default JD will cluster them together in a package as if they were different mirrors for the same file, and, also, that only one will be downloaded (if they are in the same package, I think), or, if they land on the same directory, JD will detect that a file already exists and cancel the download.
These fail-safe measures are good, but, to me, insuficient.

The trickiest issue is that this prevents JD from marking in red those files that are already on the download list, and that's something that really becomes a problem for me.

How do I end up with these URLs, you ask? Well, I'm slowly collecting files from certain cluster of sites. They offer both direct HTTP listing access and, also, an indexer site that keeps track of the updates.
At first I collected a list of directories from the sites and let JD analyze them, sorted these up in JD packages and added them "disabled" to the download list. I further classify them into smaller packages as I need from the download list and enable them when I want to actually download them.
This is really nice so far.
After that initial crawl, I use the indexer site to keep up with the updates.
Here's the problem. I get differently percent-encoded URLs depending on how I capture the URLs.
The way in which the URL is encoded differs depending on whether I...
...add directory URLs and use deep analyze from the "Add URLs" menu.
...browse to the HTTP servers and right-click on each individual link.
...broswe to the HTTP servers select all text containing links in the page and copy and let JD extract the URLs
...go to the indexer site and "copy link location" on each one individually.
...go to the indexer site and select all text containing links to let JD process it.
In each case, I get a different URL for the same file, as described.

Could you solve this or would it break something else?
For me, it would be a great help.

Also, is there some efficient way to deal with these duplicates once they are already on the download list and also in the link grabber?

Last edited by radorn; 19.09.2019 at 18:04.
Reply With Quote
  #2  
Old 27.09.2019, 16:59
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 79,339
Default

Would it be possible to get example links for testing? I'm asking because there are many *paths* the crawler can *take* and some *just eat* the url as it is, while others are processing them and replacing characters, eg space with %20 and so on.

The complex problem is that despite some characters are allowed in plain AND hex representation, some server do only accept one of both. That means for such a server it can happen that JDownloader has processed the *wrong* one first and then could no longer add the *other* link because duplicate check would detect both as same. And yes, there are servers out there that don't accept both representations

It would really help to get example links so I can test the several ways to add the links myself and check where the different representations happens and can try to fix/workaround the different ways so they end up as same/one representation.
__________________
JD-Dev & Server-Admin
Reply With Quote
  #3  
Old 27.09.2019, 17:19
raztoki's Avatar
raztoki raztoki is offline
English Supporter
 
Join Date: Apr 2010
Location: Australia
Posts: 17,659
Default

depends on the end site os and httpd and reference site (indexer) os and httpd.
some are case sensitive (unix and linux clones for instance) and some are not (windows).
clearly if both work for the same effective url its case intensive situation.
if one works over the other its case sensitive.
to know that first is hard, you can not assume prior to request (headers can give you some information).
in the past the back end probed multiple times to different outcomes, this led to detection and or banning of client to multiple request in short secession.
__________________
raztoki @ jDownloader reporter/developer
http://svn.jdownloader.org/users/170

Don't fight the system, use it to your advantage. :]
Reply With Quote
  #4  
Old 27.09.2019, 17:41
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 79,339
Default

It's not only about case-sensitivity but also about plain vs encoded. in case the request is handled by backend process (eg php) and developer does not take care about plain-encoded this can easily cause to fail.

I could add a plugin option for generic http plugin to either ignore different encodings in duplicate checking or not(customizable for each link)
__________________
JD-Dev & Server-Admin
Reply With Quote
  #5  
Old 06.10.2019, 15:56
radorn radorn is offline
Super Loader
 
Join Date: May 2019
Posts: 26
Default

@Jiaz Sorry for the delay

This is the indexer site. **External links are only visible to Support Staff****External links are only visible to Support Staff** It lists files from a bunch of servers in most recent order.
Pick a file, go to it's root domain (they are browseable), and feed the Link Analyzer the folder that contains the file. After that, with the Clipboard Observer active, copy the link of that file both by itself with the "copy address" function of the browser, and by selecting the text and copying it. Do this both in the server itself and also the indexer entry.
You'll end up with several copies of the same thing with only differences in percent encoding.

I normally use PaleMoon, a firefox fork. Not sure if that makes any difference.
Reply With Quote
  #6  
Old 10.11.2019, 18:05
user748912 user748912 is offline
Fibre Channel User
 
Join Date: Jun 2014
Posts: 118
Default

Quote:
Originally Posted by radorn View Post
JD will detect that a file already exists and cancel the download.
These fail-safe measures are good, but, to me, insuficient.
It happends too that JD can identify doubles first so that 3 Hosters load part1,part2,part3 but later in the same Package it fails. This Bug is already reported long time ago.
Cause some hosters are gone and the others are overloaded thats a huge Problem now. JD manage it to hit 3 existing files, means 3 recaptchas(2-3 times each cause of RC Bug), and finaly you got 0 Downloads. Cause this files exists and JD requested the files from the Hosters, you reached the Download limit. You Download nothing at all for the next time. And cause Hosters are overloaded you may unable to complete a file this day(20kb/s and less) without a Disconnect.
Fewer hosters means more time for fixing this Bug finally...
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

All times are GMT +2. The time now is 10:19.
Provided By AppWork GmbH | Privacy | Imprint
Parts of the Design are used from Kirsch designed by Andrew & Austin
Powered by vBulletin® Version 3.8.10 Beta 1
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.