|
#1
|
|||
|
|||
[LinkCrawler] Trying to download DPReview sample images
Hi, since DPReview is shutting down on April 10th, I'm trying to download some of their sample photos gallery, and it looks like JD2s LinkCrawler should be able to do that. I've looked at the sample documentation, and tried to construct a rule:
Code:
[{ "enabled" : true, "logging" : false, "maxDecryptDepth" : 2, "name" : "DPReview sample gallery", "pattern" : "(**External links are only visible to Support Staff**, "packageNamePattern": "$2", "rule" : "DEEPDECRYPT", "passwordPattern" : null, "deepPattern" : "(**External links are only visible to Support Staff** }] Shows Key, Description and Value columns Entry for "LinkCrawler: Link Crawler Rules" Edit "Value" field and enter parameters with square brackets Unfortunately it's not working. The RegEx is valid, I checked an online JSON validator, it should work. I had the periods double-escaped with two slashes, but took them out to try and troubleshoot, since regex '.' still matches. A regex checker seems to match fine. I set MaxDecryptDepth to 2, not sure if that is correct, nor the other parameters. What I'm trying to do: From the main page, there's a Camera and Lens sample gallery link, like this: **External links are only visible to Support Staff****External links are only visible to Support Staff** Clicking a link, it opens a gallery with the same URL of images from the camera. Clicking though to an image, there's a sample image with info on the side usually with a jpg and Raw download link on this page. That's what I'm trying to get. Ideally it would only download files over a certain size (ie not every thumbnail jpg), but I'm not sure that can be done. I played with trying to get it to write the camera part of the URL into the file name, but the PackageNamePattern should work fine for individual folders. The pattern above *should* put that part into PackageNamePattern, and hence allow a folder to be created on download. What am I doing wrong? |
#2
|
||||
|
||||
Hi mtsandersen,
that LinkCrawler rule will not work for this website since it is dynamically loading the content. You can easily check this by opening the website -> CTRL + U -> HTML view should open -> You won't be able to find the direct-URLs to that images in the source. (Also keep in mind: Using "inspect element", you may be able to see the direct-URLs inside html but this references the loaded/final HTML and not the html which JDownloader will "see"! The request I'm describing down below is definitely required to find those image URLs! ) Furthermore here is how it could work: 1. Example gallery: dpreview.com/sample-galleries/0887418781/om-system-om-1-sample-gallery-dpreview-tv/6231604069 2. Extract the galleryID -> 0887418781 3. You can get the images via the following request ("API link"): dpreview.com/sample-galleries/data/get-gallery?galleryId=0887418781&isMobile=false In theory, you could try to automate it. For that, you'd need 2 rules: 1. REWRITE Rule which accepts the first kind of links and changes those to the "API link". 2. DEEPDECRYPT Rule which accepts the "API link" and extracts all image-URLs out of that. If you're lazy, you could even try this without any RegEx since it looks like the json response of that "API link" pretty much only contains the URLs you want.
__________________
JD Supporter, Plugin Dev. & Community Manager
Erste Schritte & Tutorials || JDownloader 2 Setup Download |
#3
|
|||
|
|||
Wow ok. I'm currently on Mac, it's page source inspector can see the rendered page. So much more complicated than the early web.
I copied that API link over to jsoneditoronline.org and formatted it, much easier to read. Lots of info. I need to play with it and see if I can get JD2 to do what I want. If I can grab the 1st 'Title' in that API link, I have my folder name for "packageNamePattern" – something like Code:
false,\\"title\\":\\"([^\\"]+) Further down I see the URLs. It seems ALL of them in the gallery. I don't know if JD2 will recognise them and cue them all, but if it can, it is sweet. I can live with all the thumbnails there, though prefer without (there's no "exclude" pattern is there? They all have a prefix with a tilde in front of Sample_gallery/[filename]). I can delete them later in the folder. Will experiment later and post back. Let me know if there's a fault in my logic. Last edited by mtsandersen; 24.03.2023 at 16:03. Reason: typo |
#4
|
||||
|
||||
Do not double-escape quotes or you will break your json syntax.
Apart from that it looks fine.
__________________
JD Supporter, Plugin Dev. & Community Manager
Erste Schritte & Tutorials || JDownloader 2 Setup Download |
#5
|
|||
|
|||
I've finally found some time to work on this; it is hard to debug as I can't see any output to find out why something isn't working. I've tested with Regex validators and JSON validators. I've got something which should work, but doesn't. Some syntax obviously isn't quite right, even if it seems to validate.
Double quotes are escaped, the validator wouldn't pass the ? without double-backslashes. What I've got so far are 3 rules: 1. Rewrite link to API JSON link 2. Rewrite thumbnail links to dummy links, as they aren't needed. 3. DeepDecrypt rule to look for s3.amazonaws image files. Code:
[{ "enabled": true, "name": "Rewrite URL to API", "pattern": "(dpreview.com/sample-galleries/([0-9]+)/[^\"]+\">)", "rule": "REWRITE", "rewriteReplaceWith": "dpreview.com/sample-galleries/data/get-gallery\\?galleryId=$2&isMobile=false\">" }, { "enabled": true, "name": "Rewrite thumbnails to dummy links", "pattern": "(**External links are only visible to Support Staff**, "rule": "REWRITE", "rewriteReplaceWith": "#~dummylink~" }, { "enabled" : true, "logging" : false, "maxDecryptDepth" : 1, "name" : "DPReview sample gallery", "pattern" : "**External links are only visible to Support Staff**, "packageNamePattern": "false,\"title\":\"([^\"]+)", "rule" : "DEEPDECRYPT", "passwordPattern" : null, "deepPattern" : "(**External links are only visible to Support Staff** }] **External links are only visible to Support Staff****External links are only visible to Support Staff** it picks up a list and adds over 2000 unwanted files of all sorts, none of what I am trying to get. Any ideas? Edit: I validated with Regex101 and jsonformatter and jsonlint Last edited by mtsandersen; 26.03.2023 at 16:07. Reason: Extra info |
#6
|
||||
|
||||
Your regular expressions are incomplete: The protocol ("http...") is often missing.
You will see what I mean in my example rule down below. I'll start "from the back" and create the rule that crawls the image-links from the json file/link. You can trial and error from there. First I'll reply to your post though: Quote:
You can build the chain of rules afterwards. In this case, start with the rule for those URLs: dpreview.com/sample-galleries/data/get-gallery?galleryId=0887418781&isMobile=false Regarding output: See [your JD install dir]/logs For RegEx I recommend: regex101.com For json: jsoneditoronline.org JD should also display an error, if you try to put a broken json string into the advanced config. Quote:
Here is a working rule for that json link. Code:
[ { "enabled": true, "logging": false, "maxDecryptDepth": 1, "name": "jdownloader.org example rule for dpreview.com/sample-galleries/data/get-gallery?...", "pattern": "https?://(?:www\\.)?dpreview\\.com/sample-galleries/data/get-gallery\\?galleryId=([0-9]+)&isMobile=false", "rule": "DEEPDECRYPT", "packageNamePattern": "false,\"title\":\"([^\"]+)", "passwordPattern": null, "deepPattern": "(**External links are only visible to Support Staff**]+/[0-9]+/[^.]+.(jpg|cr2|cr3|nef|nrw|raf|pef|arw|rw2|dng|orf|crw|srw|srf|sr2|iiq|j6i|dcr|rwl)[^\"]+)" } ] pastebin.com/raw/G5vNPJx1
__________________
JD Supporter, Plugin Dev. & Community Manager
Erste Schritte & Tutorials || JDownloader 2 Setup Download Last edited by pspzockerscene; 27.03.2023 at 17:47. Reason: Fixed typo |
#7
|
|||
|
|||
Thanks.
The reason I left out the https prefix is that your first post left it out, and I figured since I'm doing a search and replace, it doesn't matter, the part being replaced comes after. I didn't know about the logs, thanks, I thought maybe with logging set to false, it wouldn't log anything. There's a mess of logs there, but can't make much sense of them. And yes I used regex101 and jsoneditoronline based on my previous search of the forum which mentioned them. I based my rules on the sample rules in the docs. My main confusion is with the double escapes, since the sample shows to use them, but regex101 set to use Java8 regex requires only one, except for the \\? as noted, which I found confusing. Hence I left out escaping the full stops altogether, since in regex it will match a full stop as well. In your version, it is mixed, with double-quotes using one escape, everything else 2. The main difference I see is the added https prefix, including the ismobile variable, and added [^"]+ to match all the Amazon variables at the end. Tested code: Code:
[{ "enabled" : true, "logging" : false, "maxDecryptDepth" : 1, "name" : "DPReview sample gallery", "pattern" : "**External links are only visible to Support Staff**, "packageNamePattern": "false,\"title\":\"([^\"]+)", "rule" : "DEEPDECRYPT", "passwordPattern" : null, "deepPattern" : "(**External links are only visible to Support Staff**]+)" }] Adding the first rule reformatting from the page link to the API json link fails, grabbing all useless links. That's all I have time for tonight. More testing tomorrow. But glad I made _some_ progress. Not sure if I need to restart JD2 to reload the changed rules. Edit: I tried another link, a Canon R8 gallery with jpg and CR3. This time it is grabbing both links correctly. Curiously, they are not named as in the json file (10-digit number +lowercase .cr3), but their original IMG_[frame].CR3 so Amazon is doing something funky. Last edited by mtsandersen; 27.03.2023 at 20:45. Reason: fix error |
#8
|
||||
|
||||
Quote:
Quote:
Quote:
If you want to save some time during testing, you can disable the following advanced setting (warning: do not leave it disabled!): Code:
LinkCollector.dolinkcheck Quote:
The double escapes are not part of the regular expression even though it may seem like it. They are needed because of the json syntax: Only \. would try to escape a json-breaking char such as " but there is none so that would just break the syntax. You need \\. to tell the json "please escape the escape-char so in the end we get \. as actual RegEx text". (I'm sorry but I don't know how to describe it in a better way.) Sure we could also update our documentation for that but you also need to keep in mind that this is an advanced functionality which is supposed to be used by advanced users so we don't want to teach users json/RegEx/escaping in such articles - it would just blow them up plus we'd have to do that for every setting/article involving json and/or RegEx.
__________________
JD Supporter, Plugin Dev. & Community Manager
Erste Schritte & Tutorials || JDownloader 2 Setup Download |
#9
|
|||
|
|||
Just a quick note, I won't have time to look at it until tonight: As noted at the end, a last-minute test on the Canon R8 gallery got the CR3 files, though they were named differently to the links, which means Amazon is doing something, and probably why the Raf files didn't work. Another clue: I'd tried putting in the s3,amazonaws .CR3 link directly, and I noticed JD2 had altered the rules to include a httpdirect rule to "learn" about CR3s. I will try to do the same with Raf to see if that changes anything once I get home.
Update I followed up on my hunch when I got home, I found a Raw link in the JSON file and downloaded it through JD2. I then checked the LinkCrawler rules, and JD had addd a new DirectHTTP "learned file extension" rule for that file extension. So I modified it and constructed a general one with the list of Raw file formats: Code:
{ "cookies" : null, "deepPattern" : null, "formPattern" : null, "id" : 1679976851325, "maxDecryptDepth" : 0, "name" : "Learned file extensions: jpg|raf|cr2|cr3|nef|nrw|pef|arw|rw2|dng|orf|crw|srw|srf|sr2|iiq|dcr|rwl", "packageNamePattern" : null, "passwordPattern" : null, "pattern" : "(?i)https?://.*\\.(jpg|raf|cr2|cr3|nef|nrw|pef|arw|rw2|dng|orf|crw|srw|srf|sr2|iiq|dcr|rwl)($|\\?.*$)", "rewriteReplaceWith" : null, "rule" : "DIRECTHTTP", "enabled" : true, "logging" : false, "updateCookies" : true } So the final part is now working exactly as expected. I still haven't got the first rewrite rule working which grabs the actual link on the page and converts it to the 'galleryId' JSON link. That will have to wait till later tonight. Last edited by mtsandersen; 28.03.2023 at 11:50. |
#10
|
||||
|
||||
Nice!
The need ot the "Learned file extensions" may be because JD does not "know" some of the file extensions of some files you're trying to download here though some are definitely supported such as "jpg". Most of those however, I've never seen...
__________________
JD Supporter, Plugin Dev. & Community Manager
Erste Schritte & Tutorials || JDownloader 2 Setup Download |
#11
|
|||
|
|||
I think it has to do with the way Amazon serves up the non-jpg files; I notice the served files don't have the same name as the requested filename. There's an extremely long query string which does who knows what.
The list of Raw files is derived from Googling a list of Raw formats, I took a couple out, but anything pertaining to major brands I left in, as makers like Canon, Nikon, Panasonic and others have changed file format extension over time. The last few in the list I know nothing about. I've played with the 1st Rewrite rule, and managed a breakthrough. I ended up removing any matching surrounding HTML tags ( \">), and along with whatever other smaller tweaks I did, seems to have fixed it. The Rewrite rules now look like this: Code:
[{ "cookies" : null, "deepPattern" : null, "formPattern" : null, "id" : 1679976851325, "maxDecryptDepth" : 0, "name" : "Learned file extensions: jpg|raf|cr2|cr3|nef|nrw|pef|arw|rw2|dng|orf|crw|srw|srf|sr2|iiq|dcr|rwl", "packageNamePattern" : null, "passwordPattern" : null, "pattern" : "(?i)https?://.*\\.(jpg|raf|cr2|cr3|nef|nrw|pef|arw|rw2|dng|orf|crw|srw|srf|sr2|iiq|dcr|rwl)($|\\?.*$)", "rewriteReplaceWith" : null, "rule" : "DIRECTHTTP", "enabled" : true, "logging" : false, "updateCookies" : true }, { "enabled": true, "name": "Rewrite URL to API", "pattern": "(**External links are only visible to Support Staff**]+)", "rule": "REWRITE", "rewriteReplaceWith": "**External links are only visible to Support Staff** }, { "enabled" : true, "logging" : false, "maxDecryptDepth" : 1, "name" : "DPReview sample gallery", "pattern" : "**External links are only visible to Support Staff**, "packageNamePattern": "false,\"title\":\"([^\"]+)", "rule" : "DEEPDECRYPT", "passwordPattern" : null, "deepPattern" : "(**External links are only visible to Support Staff**]+)" }] Code:
**External links are only visible to Support Staff** It seems when I put the main gallery link into 'File -> Analyse Text with Links' it works too, but the problem with this is that it is picking everything up, it's trawling thousands of unwanted content links, as well as the wanted gallery links. So I'm looking into scraping a saved copy of the HTML with a Perl command, I'm trying to read up on it on Stack Overflow and **External links are only visible to Support Staff**Perl commandline options. I need to read up on it some more. PS: How did you find that API json link in the code? |
#12
|
||||
|
||||
Quote:
Code:
[ { "enabled": true, "name": "Rewrite URL to API", "pattern": "**External links are only visible to Support Staff**]+", "rule": "REWRITE", "rewriteReplaceWith": "**External links are only visible to Support Staff** } ] Quote:
That link does not match any rule so the deep-parser will just crawl everything it can find. Possible solution: Write another LinkCrawler rule of type DEEPDECRYPT which will only pick up URLs looking like "http...dpreview.com/sample-galleries/<numbers>/[a-z0-9\\-]+" Quote:
A LinkCrawler Rule should do the job as long as there is no async call or js involved that hides those URLs. From looking into it for some seconds, it looks like all you want is directly available in their html code. Chrome dev tools -> Network tab developer.chrome.com/docs/devtools/
__________________
JD Supporter, Plugin Dev. & Community Manager
Erste Schritte & Tutorials || JDownloader 2 Setup Download |
#13
|
|||
|
|||
Quote:
Quote:
Quote:
Code:
**External links are only visible to Support Staff** Last edited by mtsandersen; 28.03.2023 at 18:47. |
#14
|
||||
|
||||
Quote:
More information: https://support.jdownloader.org/Know...deepdecrypt/22
__________________
JD Supporter, Plugin Dev. & Community Manager
Erste Schritte & Tutorials || JDownloader 2 Setup Download |
#15
|
|||
|
|||
So in this case, would I need both patterns, and would I put the desired links in deepPattern?
Code:
{ "enabled" : true, "logging" : false, "maxDecryptDepth" : 1, "name" : "DPReview DeepDecrypt sample gallery", "pattern" : "**External links are only visible to Support Staff**, "rule" : "DEEPDECRYPT", "passwordPattern" : null, "deepPattern" : "**External links are only visible to Support Staff**]+" } Last edited by mtsandersen; 28.03.2023 at 19:05. |
#16
|
||||
|
||||
The following rule is working fine for me in conjunction with your other rules:
Code:
{ "enabled": true, "logging": false, "maxDecryptDepth": 1, "name": "dpreview.com find single galleries links inside gallery category", "pattern": "https?://www\\.dpreview\\.com/sample-galleries\\?category=cameras", "rule": "DEEPDECRYPT", "packageNamePattern": null, "deepPattern": "(https?://www.dpreview\\.com/sample-galleries/[0-9]+/[a-z0-9\\-]+/[0-9]+)" }
__________________
JD Supporter, Plugin Dev. & Community Manager
Erste Schritte & Tutorials || JDownloader 2 Setup Download |
#17
|
|||
|
|||
Excellent! It appears to be working! Proper testing tomorrow.
Then to see if I can figure out what the Comparison Tool json API link is. Code:
[{ "cookies" : null, "deepPattern" : null, "formPattern" : null, "id" : 1679976851325, "maxDecryptDepth" : 0, "name" : "Learned file extensions: jpg|raf|cr2|cr3|nef|nrw|pef|arw|rw2|dng|orf|crw|srw|srf|sr2|iiq|dcr|rwl", "packageNamePattern" : null, "passwordPattern" : null, "pattern" : "(?i)https?://.*\\.(jpg|raf|cr2|cr3|nef|nrw|pef|arw|rw2|dng|orf|crw|srw|srf|sr2|iiq|dcr|rwl)($|\\?.*$)", "rewriteReplaceWith" : null, "rule" : "DIRECTHTTP", "enabled" : true, "logging" : false, "updateCookies" : true }, { "enabled" : true, "logging" : false, "maxDecryptDepth" : 1, "name" : "DPReview sample gallery links [DeepDecrypt]", "pattern" : "**External links are only visible to Support Staff**, "rule" : "DEEPDECRYPT", "passwordPattern" : null, "deepPattern" : "**External links are only visible to Support Staff**]+" }, { "enabled": true, "name": "DPReview Rewrite URL to API", "pattern": "(**External links are only visible to Support Staff**]+)", "rule": "REWRITE", "rewriteReplaceWith": "**External links are only visible to Support Staff** }, { "enabled" : true, "logging" : false, "maxDecryptDepth" : 1, "name" : "DPReview sample gallery Amazon links", "pattern" : "**External links are only visible to Support Staff**, "packageNamePattern": "false,\"title\":\"([^\"]+)", "rule" : "DEEPDECRYPT", "passwordPattern" : null, "deepPattern" : "(**External links are only visible to Support Staff**]+)" }] |
#18
|
||||
|
||||
Just as a hint:
This may or may not work but using those rules to grab such a high amount of items may leave you with some files missing as you won't get any feedback for certail problems such as connection issues during crawling or, if you run into any rate-limits. Those rules are super simple and there is zero errorhandling for case "anything went wrong during crawling". If it works: Good! If not, you will need to build a "real" custom script solution for this or "real" JDownloader plugins. I recommend, at least double-checking the amount of items you get in the end vs the expected amount of items.
__________________
JD Supporter, Plugin Dev. & Community Manager
Erste Schritte & Tutorials || JDownloader 2 Setup Download |
#19
|
|||
|
|||
Yeah, I prefer smaller batches, and the crawled links will expire. At least I can add links that didn't quite work individually by right-click copy link. Or I could get tricky and add particular brands, eg Sony or Panasonic, to reduce the number of matched links. Realistically I just need a couple of Raw/Jpg samples of each camera to test how they edit and how well their files look in comparison. So not fussed if some are missed. Being able to copy individual galleries of cameras that interest me would probably have been enough, but I wanna see how they all compare, as well as older versions of the same brand.
Edit My Perl-Fu is not great, so left that part for now; I'm at the stage I can right-click on the link and get that gallery, and that is all I really need, as it takes time to parse, then download, and the links will expire. I can add a couple, then go on with other work. I came across a Raw format I hadn't accounted for: Hassleblad has a 3FR format. Without a rule, JD only sees the jpegs, with it added to the rules, it sees the Raws. Last edited by mtsandersen; 29.03.2023 at 07:51. |
#20
|
||||
|
||||
In post #8 I told you, how you can easily skip this time by deactivating the linkcheck.
See post #8 or just text search this thread for "LinkCollector.dolinkcheck". In my tests, I got anout 1 gallery per second with linkcheck disabled.
__________________
JD Supporter, Plugin Dev. & Community Manager
Erste Schritte & Tutorials || JDownloader 2 Setup Download |
#21
|
|||
|
|||
So you did. I was wary of switching off something important hidden away like this is, as my memory is not the most reliable. Nonetheless, I decided to try it, writing a big note on my desk. I suppose it reduced network traffic. Interesting to see the links added unsorted, with the jpg and raw mixed up, and the raws not renamed by Amazon after they are requested.
Anyway, I'm still doing it in relatively small batches to avoid the links expiring, as it's quite a lot of data to download. I'm working backwards and in the middle of 2020 atm, not necessarily downloading everything. I've been reading a few DPReview articles and watched some of their videos, as it's all coming to an end very shortly, sadly. They should just leave it up in a locked state so people can still make use if it. Edit While trying to get the rules to work, I ignored the query string, but I notice the AmazonAWS query string starts with what looks like a session expiry variable X-Amz-Expires=3600. So what happens if I change it... Code:
{ "enabled": true, "name": "DPReview rewrite expiry from 3600 to 36000", "pattern": "X-Amz-Expires=3600", "rule": "REWRITE", "rewriteReplaceWith": "X-Amz-Expires=36000" } Last edited by mtsandersen; 29.03.2023 at 20:54. |
#22
|
||||
|
||||
It is super unlikely that that will have any effect as this is usually also checked serverside...
Also your REWRITE rule in your recent post #21 cannot work like this. I supposed it was only ment to be a mockup?
__________________
JD Supporter, Plugin Dev. & Community Manager
Erste Schritte & Tutorials || JDownloader 2 Setup Download |
#23
|
|||
|
|||
Yeah, it was late, just a thought before bed. And no, it didn't work. It was on the Amazon link buried deep on the json file where people would normally not see it. My connection isn't the fastest (normally not needed), so it takes time as the links will expire.
Speaking of rewrite, I experimented with Regex in BBEdit (Mac text editor) to parse the links from the saved HTML of the links page into a neat list of links, just as an exercise. Code:
Search: https:\/\/www\.dpreview\.com\/sample-galleries\/([0-9]+)\/[^"]+\">(.+)(?=<\/a>) Replace: \2\n**External links are only visible to Support Staff** Extract button Result example: Canon EOS R3 sample gallery (DPReview TV) **External links are only visible to Support Staff** |
#24
|
|||
|
|||
@mtsandersen: did you read that post? looks like you will have some more time to mirror/clone the galleries
**External links are only visible to Support Staff****External links are only visible to Support Staff** |
Thread Tools | |
Display Modes | |
|
|