#1
|
|||
|
|||
Percent encoding differences in HTTP URLs produce multiple versions of the same files
Hi.
I'm having a little frustration here with HTTP downloads and percent encoding. Let me illustrate: **External links are only visible to Support Staff****External links are only visible to Support Staff** **External links are only visible to Support Staff****External links are only visible to Support Staff** **External links are only visible to Support Staff****External links are only visible to Support Staff** Same file, but URLs differ only in that in one some characters are left untouched and in another they are percent encoded. It will also consider them different URLs if there are differences in letter case, whether upper or lower, in the hexadecimal numbers of the percent encoded characters, like %5b and %5B for "[". All these things result in unnecesary duplicates on the list when you encounter them Now, it's true that, by default JD will cluster them together in a package as if they were different mirrors for the same file, and, also, that only one will be downloaded (if they are in the same package, I think), or, if they land on the same directory, JD will detect that a file already exists and cancel the download. These fail-safe measures are good, but, to me, insuficient. The trickiest issue is that this prevents JD from marking in red those files that are already on the download list, and that's something that really becomes a problem for me. How do I end up with these URLs, you ask? Well, I'm slowly collecting files from certain cluster of sites. They offer both direct HTTP listing access and, also, an indexer site that keeps track of the updates. At first I collected a list of directories from the sites and let JD analyze them, sorted these up in JD packages and added them "disabled" to the download list. I further classify them into smaller packages as I need from the download list and enable them when I want to actually download them. This is really nice so far. After that initial crawl, I use the indexer site to keep up with the updates. Here's the problem. I get differently percent-encoded URLs depending on how I capture the URLs. The way in which the URL is encoded differs depending on whether I... ...add directory URLs and use deep analyze from the "Add URLs" menu. ...browse to the HTTP servers and right-click on each individual link. ...broswe to the HTTP servers select all text containing links in the page and copy and let JD extract the URLs ...go to the indexer site and "copy link location" on each one individually. ...go to the indexer site and select all text containing links to let JD process it. In each case, I get a different URL for the same file, as described. Could you solve this or would it break something else? For me, it would be a great help. Also, is there some efficient way to deal with these duplicates once they are already on the download list and also in the link grabber? Last edited by radorn; 19.09.2019 at 17:04. |
Thread Tools | |
Display Modes | |
|
|