JDownloader Community - Appwork GmbH
 

 
 
Thread Tools Display Modes
Prev Previous Post   Next Post Next
  #1  
Old 19.09.2019, 18:01
radorn radorn is offline
Vacuum Cleaner
 
Join Date: May 2019
Posts: 17
Default Percent encoding differences in HTTP URLs produce multiple versions of the same files

Hi.
I'm having a little frustration here with HTTP downloads and percent encoding.
Let me illustrate:
**External links are only visible to Support Staff****External links are only visible to Support Staff**
**External links are only visible to Support Staff****External links are only visible to Support Staff**
**External links are only visible to Support Staff****External links are only visible to Support Staff**

Same file, but URLs differ only in that in one some characters are left untouched and in another they are percent encoded. It will also consider them different URLs if there are differences in letter case, whether upper or lower, in the hexadecimal numbers of the percent encoded characters, like %5b and %5B for "[".
All these things result in unnecesary duplicates on the list when you encounter them

Now, it's true that, by default JD will cluster them together in a package as if they were different mirrors for the same file, and, also, that only one will be downloaded (if they are in the same package, I think), or, if they land on the same directory, JD will detect that a file already exists and cancel the download.
These fail-safe measures are good, but, to me, insuficient.

The trickiest issue is that this prevents JD from marking in red those files that are already on the download list, and that's something that really becomes a problem for me.

How do I end up with these URLs, you ask? Well, I'm slowly collecting files from certain cluster of sites. They offer both direct HTTP listing access and, also, an indexer site that keeps track of the updates.
At first I collected a list of directories from the sites and let JD analyze them, sorted these up in JD packages and added them "disabled" to the download list. I further classify them into smaller packages as I need from the download list and enable them when I want to actually download them.
This is really nice so far.
After that initial crawl, I use the indexer site to keep up with the updates.
Here's the problem. I get differently percent-encoded URLs depending on how I capture the URLs.
The way in which the URL is encoded differs depending on whether I...
...add directory URLs and use deep analyze from the "Add URLs" menu.
...browse to the HTTP servers and right-click on each individual link.
...broswe to the HTTP servers select all text containing links in the page and copy and let JD extract the URLs
...go to the indexer site and "copy link location" on each one individually.
...go to the indexer site and select all text containing links to let JD process it.
In each case, I get a different URL for the same file, as described.

Could you solve this or would it break something else?
For me, it would be a great help.

Also, is there some efficient way to deal with these duplicates once they are already on the download list and also in the link grabber?

Last edited by radorn; 19.09.2019 at 18:04.
Reply With Quote
 

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

All times are GMT +2. The time now is 15:32.
Provided By AppWork GmbH | Privacy | Imprint
Parts of the Design are used from Kirsch designed by Andrew & Austin
Powered by vBulletin® Version 3.8.10 Beta 1
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.