JDownloader Community - Appwork GmbH
 

Notices

Reply
 
Thread Tools Display Modes
  #1  
Old 24.03.2023, 14:34
mtsandersen mtsandersen is offline
Junior Loader
 
Join Date: Mar 2023
Posts: 12
Smile [LinkCrawler] Trying to download DPReview sample images

Hi, since DPReview is shutting down on April 10th, I'm trying to download some of their sample photos gallery, and it looks like JD2s LinkCrawler should be able to do that. I've looked at the sample documentation, and tried to construct a rule:

Code:
[{
"enabled" : true,
"logging" : false,
"maxDecryptDepth" : 2,
"name" : "DPReview sample gallery",
"pattern" : "(**External links are only visible to Support Staff**,
"packageNamePattern": "$2",
"rule" : "DEEPDECRYPT",
"passwordPattern" : null,
"deepPattern" : "(**External links are only visible to Support Staff**
}]
Settings -> Advanced Settings -> search "Rules"
Shows Key, Description and Value columns
Entry for "LinkCrawler: Link Crawler Rules"
Edit "Value" field and enter parameters with square brackets

Unfortunately it's not working. The RegEx is valid, I checked an online JSON validator, it should work. I had the periods double-escaped with two slashes, but took them out to try and troubleshoot, since regex '.' still matches. A regex checker seems to match fine.
I set MaxDecryptDepth to 2, not sure if that is correct, nor the other parameters.

What I'm trying to do:
From the main page, there's a Camera and Lens sample gallery link, like this:
**External links are only visible to Support Staff****External links are only visible to Support Staff**
Clicking a link, it opens a gallery with the same URL of images from the camera. Clicking though to an image, there's a sample image with info on the side usually with a jpg and Raw download link on this page. That's what I'm trying to get.
Ideally it would only download files over a certain size (ie not every thumbnail jpg), but I'm not sure that can be done.
I played with trying to get it to write the camera part of the URL into the file name, but the PackageNamePattern should work fine for individual folders. The pattern above *should* put that part into PackageNamePattern, and hence allow a folder to be created on download.
What am I doing wrong?
Reply With Quote
  #2  
Old 24.03.2023, 15:02
pspzockerscene's Avatar
pspzockerscene pspzockerscene is online now
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 70,921
Default

Hi mtsandersen,
that LinkCrawler rule will not work for this website since it is dynamically loading the content.
You can easily check this by opening the website -> CTRL + U -> HTML view should open -> You won't be able to find the direct-URLs to that images in the source.

(Also keep in mind: Using "inspect element", you may be able to see the direct-URLs inside html but this references the loaded/final HTML and not the html which JDownloader will "see"!
The request I'm describing down below is definitely required to find those image URLs!
)

Furthermore here is how it could work:
1. Example gallery:
dpreview.com/sample-galleries/0887418781/om-system-om-1-sample-gallery-dpreview-tv/6231604069

2. Extract the galleryID -> 0887418781

3. You can get the images via the following request ("API link"):
dpreview.com/sample-galleries/data/get-gallery?galleryId=0887418781&isMobile=false

In theory, you could try to automate it.
For that, you'd need 2 rules:
1. REWRITE Rule which accepts the first kind of links and changes those to the "API link".
2. DEEPDECRYPT Rule which accepts the "API link" and extracts all image-URLs out of that.
If you're lazy, you could even try this without any RegEx since it looks like the json response of that "API link" pretty much only contains the URLs you want.
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
Reply With Quote
  #3  
Old 24.03.2023, 16:01
mtsandersen mtsandersen is offline
Junior Loader
 
Join Date: Mar 2023
Posts: 12
Default

Wow ok. I'm currently on Mac, it's page source inspector can see the rendered page. So much more complicated than the early web.
I copied that API link over to jsoneditoronline.org and formatted it, much easier to read. Lots of info. I need to play with it and see if I can get JD2 to do what I want. If I can grab the 1st 'Title' in that API link, I have my folder name for "packageNamePattern" – something like
Code:
false,\\"title\\":\\"([^\\"]+)
and use "packageNamePattern": "$1" in the DeepDecrypt section. (is it correct use of double-slashes?)
Further down I see the URLs. It seems ALL of them in the gallery. I don't know if JD2 will recognise them and cue them all, but if it can, it is sweet. I can live with all the thumbnails there, though prefer without (there's no "exclude" pattern is there? They all have a prefix with a tilde in front of Sample_gallery/[filename]). I can delete them later in the folder.
Will experiment later and post back. Let me know if there's a fault in my logic.

Last edited by mtsandersen; 24.03.2023 at 16:03. Reason: typo
Reply With Quote
  #4  
Old 24.03.2023, 16:06
pspzockerscene's Avatar
pspzockerscene pspzockerscene is online now
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 70,921
Default

Do not double-escape quotes or you will break your json syntax.
Apart from that it looks fine.
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
Reply With Quote
  #5  
Old 26.03.2023, 16:05
mtsandersen mtsandersen is offline
Junior Loader
 
Join Date: Mar 2023
Posts: 12
Unhappy

I've finally found some time to work on this; it is hard to debug as I can't see any output to find out why something isn't working. I've tested with Regex validators and JSON validators. I've got something which should work, but doesn't. Some syntax obviously isn't quite right, even if it seems to validate.
Double quotes are escaped, the validator wouldn't pass the ? without double-backslashes.
What I've got so far are 3 rules:
1. Rewrite link to API JSON link
2. Rewrite thumbnail links to dummy links, as they aren't needed.
3. DeepDecrypt rule to look for s3.amazonaws image files.
Code:
[{
  "enabled": true,
  "name": "Rewrite URL to API",
  "pattern": "(dpreview.com/sample-galleries/([0-9]+)/[^\"]+\">)",
  "rule": "REWRITE",
  "rewriteReplaceWith": "dpreview.com/sample-galleries/data/get-gallery\\?galleryId=$2&isMobile=false\">"
},
{
  "enabled": true,
  "name": "Rewrite thumbnails to dummy links",
  "pattern": "(**External links are only visible to Support Staff**,
  "rule": "REWRITE",
  "rewriteReplaceWith": "#~dummylink~"
},
{
  "enabled" : true,
  "logging" : false,
  "maxDecryptDepth" : 1,
  "name" : "DPReview sample gallery",
  "pattern" : "**External links are only visible to Support Staff**,
  "packageNamePattern": "false,\"title\":\"([^\"]+)",
  "rule" : "DEEPDECRYPT",
  "passwordPattern" : null,
  "deepPattern" : "(**External links are only visible to Support Staff**
}]
But when trying it on
**External links are only visible to Support Staff****External links are only visible to Support Staff**
it picks up a list and adds over 2000 unwanted files of all sorts, none of what I am trying to get.

Any ideas?
Edit: I validated with Regex101 and jsonformatter and jsonlint

Last edited by mtsandersen; 26.03.2023 at 16:07. Reason: Extra info
Reply With Quote
  #6  
Old 27.03.2023, 17:47
pspzockerscene's Avatar
pspzockerscene pspzockerscene is online now
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 70,921
Default

Your regular expressions are incomplete: The protocol ("http...") is often missing.
You will see what I mean in my example rule down below.

I'll start "from the back" and create the rule that crawls the image-links from the json file/link.
You can trial and error from there.

First I'll reply to your post though:
Quote:
Originally Posted by mtsandersen View Post
I've finally found some time to work on this; it is hard to debug as I can't see any output to find out why something isn't working.
That's why I recommend to start with a rule that will actually return something that you can see in JD.
You can build the chain of rules afterwards.
In this case, start with the rule for those URLs:
dpreview.com/sample-galleries/data/get-gallery?galleryId=0887418781&isMobile=false

Regarding output:
See [your JD install dir]/logs

Quote:
Originally Posted by mtsandersen View Post
I've tested with Regex validators and JSON validators.
For RegEx I recommend: regex101.com
For json: jsoneditoronline.org
JD should also display an error, if you try to put a broken json string into the advanced config.

Quote:
Originally Posted by mtsandersen View Post
Double quotes are escaped, the validator wouldn't pass the ? without double-backslashes.
Correct. "?" is a regex symbol thus you need to escape it if you want to have "?" as a char.

Here is a working rule for that json link.
Code:
[
  {
    "enabled": true,
    "logging": false,
    "maxDecryptDepth": 1,
    "name": "jdownloader.org example rule for dpreview.com/sample-galleries/data/get-gallery?...",
    "pattern": "https?://(?:www\\.)?dpreview\\.com/sample-galleries/data/get-gallery\\?galleryId=([0-9]+)&isMobile=false",
    "rule": "DEEPDECRYPT",
    "packageNamePattern": "false,\"title\":\"([^\"]+)",
    "passwordPattern": null,
    "deepPattern": "(**External links are only visible to Support Staff**]+/[0-9]+/[^.]+.(jpg|cr2|cr3|nef|nrw|raf|pef|arw|rw2|dng|orf|crw|srw|srf|sr2|iiq|j6i|dcr|rwl)[^\"]+)"
  }
]
Rule as plaintext for easier copy & paste:
pastebin.com/raw/G5vNPJx1
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?

Last edited by pspzockerscene; 27.03.2023 at 17:47. Reason: Fixed typo
Reply With Quote
  #7  
Old 27.03.2023, 19:50
mtsandersen mtsandersen is offline
Junior Loader
 
Join Date: Mar 2023
Posts: 12
Default

Thanks.
The reason I left out the https prefix is that your first post left it out, and I figured since I'm doing a search and replace, it doesn't matter, the part being replaced comes after.
I didn't know about the logs, thanks, I thought maybe with logging set to false, it wouldn't log anything. There's a mess of logs there, but can't make much sense of them.
And yes I used regex101 and jsoneditoronline based on my previous search of the forum which mentioned them. I based my rules on the sample rules in the docs. My main confusion is with the double escapes, since the sample shows to use them, but regex101 set to use Java8 regex requires only one, except for the \\? as noted, which I found confusing. Hence I left out escaping the full stops altogether, since in regex it will match a full stop as well. In your version, it is mixed, with double-quotes using one escape, everything else 2.
The main difference I see is the added https prefix, including the ismobile variable, and added [^"]+ to match all the Amazon variables at the end.
Tested code:
Code:
[{
  "enabled" : true,
  "logging" : false,
  "maxDecryptDepth" : 1,
  "name" : "DPReview sample gallery",
  "pattern" : "**External links are only visible to Support Staff**,
  "packageNamePattern": "false,\"title\":\"([^\"]+)",
  "rule" : "DEEPDECRYPT",
  "passwordPattern" : null,
  "deepPattern" : "(**External links are only visible to Support Staff**]+)"
}]
First test with only that one rule and fed a preformatted galleryID link: It correctly grabs all the jpgs in the gallery, even puts it into a folder named after the gallery, as per the 'packageNamePattern'. Not working: For some reason, it hasn't grabbed the Raws from the page, which has links matching the same pattern. Regex101 concurs it should work.
Adding the first rule reformatting from the page link to the API json link fails, grabbing all useless links.

That's all I have time for tonight. More testing tomorrow. But glad I made _some_ progress.
Not sure if I need to restart JD2 to reload the changed rules.

Edit: I tried another link, a Canon R8 gallery with jpg and CR3. This time it is grabbing both links correctly. Curiously, they are not named as in the json file (10-digit number +lowercase .cr3), but their original IMG_[frame].CR3 so Amazon is doing something funky.

Last edited by mtsandersen; 27.03.2023 at 20:45. Reason: fix error
Reply With Quote
  #8  
Old 27.03.2023, 21:02
pspzockerscene's Avatar
pspzockerscene pspzockerscene is online now
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 70,921
Default

Quote:
Originally Posted by mtsandersen View Post
I figured since I'm doing a search and replace, it doesn't matter, the part being replaced comes after.
To be honest I'm not sure maybe it could have worked but I prefer to have more precise regular expressions and I didn't want to burn more time testing all of your rules so I just went and created my own one.

Quote:
Originally Posted by mtsandersen View Post
I thought maybe with logging set to false, it wouldn't log anything
That is correct. You still need to set each rules debugging flag to "true" to get log output of them in the logs. Keep in mind that if a rule doesn't match, you won't get any logging either way.

Quote:
Originally Posted by mtsandersen View Post
For some reason, it hasn't grabbed the Raws from the page, which has links matching the same pattern.
Well no matter what the reason for that is, you should first continue solving this until you will construct the other rules to automate this whole process.
If you want to save some time during testing, you can disable the following advanced setting (warning: do not leave it disabled!):
Code:
LinkCollector.dolinkcheck
This will skip the linkcheck so all results based on your linkcrawler rule will appear immediately in your linkgrabber.

Quote:
Originally Posted by mtsandersen View Post
There's a mess of logs there, but can't make much sense of them.
Use a powerful texteditor tool like "Notepad++" which you can use to search for a specific string in all files inside a directory.

Quote:
Originally Posted by mtsandersen View Post
My main confusion is with the double escapes
The double escapes are not part of the regular expression even though it may seem like it.
They are needed because of the json syntax:
Only \. would try to escape a json-breaking char such as " but there is none so that would just break the syntax.
You need \\. to tell the json "please escape the escape-char so in the end we get \. as actual RegEx text".
(I'm sorry but I don't know how to describe it in a better way.)
Sure we could also update our documentation for that but you also need to keep in mind that this is an advanced functionality which is supposed to be used by advanced users so we don't want to teach users json/RegEx/escaping in such articles - it would just blow them up plus we'd have to do that for every setting/article involving json and/or RegEx.
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
Reply With Quote
  #9  
Old 28.03.2023, 01:36
mtsandersen mtsandersen is offline
Junior Loader
 
Join Date: Mar 2023
Posts: 12
Default

Just a quick note, I won't have time to look at it until tonight: As noted at the end, a last-minute test on the Canon R8 gallery got the CR3 files, though they were named differently to the links, which means Amazon is doing something, and probably why the Raf files didn't work. Another clue: I'd tried putting in the s3,amazonaws .CR3 link directly, and I noticed JD2 had altered the rules to include a httpdirect rule to "learn" about CR3s. I will try to do the same with Raf to see if that changes anything once I get home.

Update
I followed up on my hunch when I got home, I found a Raw link in the JSON file and downloaded it through JD2. I then checked the LinkCrawler rules, and JD had addd a new DirectHTTP "learned file extension" rule for that file extension. So I modified it and constructed a general one with the list of Raw file formats:
Code:
 {
  "cookies"            : null,
  "deepPattern"        : null,
  "formPattern"        : null,
  "id"                 : 1679976851325,
  "maxDecryptDepth"    : 0,
  "name"               : "Learned file extensions: jpg|raf|cr2|cr3|nef|nrw|pef|arw|rw2|dng|orf|crw|srw|srf|sr2|iiq|dcr|rwl",
  "packageNamePattern" : null,
  "passwordPattern"    : null,
  "pattern"            : "(?i)https?://.*\\.(jpg|raf|cr2|cr3|nef|nrw|pef|arw|rw2|dng|orf|crw|srw|srf|sr2|iiq|dcr|rwl)($|\\?.*$)",
  "rewriteReplaceWith" : null,
  "rule"               : "DIRECTHTTP",
  "enabled"            : true,
  "logging"            : false,
  "updateCookies"      : true
 }
I've tested it on several galleries of different camera makes, including the Fuji one where I only got the jpegs. It now correctly identified all the files, including the Raws. When I copy a URL constructed with the galleryId format , the LinkGrabber picks it up and adds it automatically.
So the final part is now working exactly as expected.
I still haven't got the first rewrite rule working which grabs the actual link on the page and converts it to the 'galleryId' JSON link. That will have to wait till later tonight.

Last edited by mtsandersen; 28.03.2023 at 11:50.
Reply With Quote
  #10  
Old 28.03.2023, 13:48
pspzockerscene's Avatar
pspzockerscene pspzockerscene is online now
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 70,921
Default

Nice!

The need ot the "Learned file extensions" may be because JD does not "know" some of the file extensions of some files you're trying to download here though some are definitely supported such as "jpg".
Most of those however, I've never seen...
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
Reply With Quote
  #11  
Old 28.03.2023, 17:49
mtsandersen mtsandersen is offline
Junior Loader
 
Join Date: Mar 2023
Posts: 12
Default

I think it has to do with the way Amazon serves up the non-jpg files; I notice the served files don't have the same name as the requested filename. There's an extremely long query string which does who knows what.
The list of Raw files is derived from Googling a list of Raw formats, I took a couple out, but anything pertaining to major brands I left in, as makers like Canon, Nikon, Panasonic and others have changed file format extension over time. The last few in the list I know nothing about.

I've played with the 1st Rewrite rule, and managed a breakthrough. I ended up removing any matching surrounding HTML tags ( \">), and along with whatever other smaller tweaks I did, seems to have fixed it.

The Rewrite rules now look like this:
Code:
[{
  "cookies"            : null,
  "deepPattern"        : null,
  "formPattern"        : null,
  "id"                 : 1679976851325,
  "maxDecryptDepth"    : 0,
  "name"               : "Learned file extensions: jpg|raf|cr2|cr3|nef|nrw|pef|arw|rw2|dng|orf|crw|srw|srf|sr2|iiq|dcr|rwl",
  "packageNamePattern" : null,
  "passwordPattern"    : null,
  "pattern"            : "(?i)https?://.*\\.(jpg|raf|cr2|cr3|nef|nrw|pef|arw|rw2|dng|orf|crw|srw|srf|sr2|iiq|dcr|rwl)($|\\?.*$)",
  "rewriteReplaceWith" : null,
  "rule"               : "DIRECTHTTP",
  "enabled"            : true,
  "logging"            : false,
  "updateCookies"      : true
},
{
  "enabled": true,
  "name": "Rewrite URL to API",
  "pattern": "(**External links are only visible to Support Staff**]+)",
  "rule": "REWRITE",
  "rewriteReplaceWith": "**External links are only visible to Support Staff**
},
{
  "enabled" : true,
  "logging" : false,
  "maxDecryptDepth" : 1,
  "name" : "DPReview sample gallery",
  "pattern" : "**External links are only visible to Support Staff**,
  "packageNamePattern": "false,\"title\":\"([^\"]+)",
  "rule" : "DEEPDECRYPT",
  "passwordPattern" : null,
  "deepPattern" : "(**External links are only visible to Support Staff**]+)"
}]
Now when I right-click a link on the Sample Gallery main page, eg
Code:
**External links are only visible to Support Staff**
the Link Crawler picks it up.
It seems when I put the main gallery link into 'File -> Analyse Text with Links' it works too, but the problem with this is that it is picking everything up, it's trawling thousands of unwanted content links, as well as the wanted gallery links. So I'm looking into scraping a saved copy of the HTML with a Perl command, I'm trying to read up on it on Stack Overflow and **External links are only visible to Support Staff**Perl commandline options. I need to read up on it some more.

PS: How did you find that API json link in the code?
Reply With Quote
  #12  
Old 28.03.2023, 17:57
pspzockerscene's Avatar
pspzockerscene pspzockerscene is online now
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 70,921
Default

Quote:
Originally Posted by mtsandersen View Post
I've played with the 1st Rewrite rule, and managed a breakthrough. I ended up removing any matching surrounding HTML tags ( ">), and along with whatever other smaller tweaks I did, seems to have fixed it.

The Rewrite rules now look like this:
You don't need the two matchings - I'd simplify it to:
Code:
[
  {
    "enabled": true,
    "name": "Rewrite URL to API",
    "pattern": "**External links are only visible to Support Staff**]+",
    "rule": "REWRITE",
    "rewriteReplaceWith": "**External links are only visible to Support Staff**
  }
]
...but then again if it works there is no need to change it.

Quote:
Originally Posted by mtsandersen View Post
It seems when I put the main gallery link into 'File -> Analyse Text with Links' it works too, but the problem with this is that it is picking everything up, it's trawling thousands of unwanted content links, as well as the wanted gallery links.
It's not a new problem:
That link does not match any rule so the deep-parser will just crawl everything it can find.
Possible solution: Write another LinkCrawler rule of type DEEPDECRYPT which will only pick up URLs looking like "http...dpreview.com/sample-galleries/<numbers>/[a-z0-9\\-]+"

Quote:
Originally Posted by mtsandersen View Post
So I'm looking into scraping a saved copy of the HTML with a Perl command...
Sure you can do that but I don't know what shall be different here than compared to crawling those other URLs.
A LinkCrawler Rule should do the job as long as there is no async call or js involved that hides those URLs.
From looking into it for some seconds, it looks like all you want is directly available in their html code.

Quote:
Originally Posted by mtsandersen View Post
How did you find that API json link in the code?
Chrome dev tools -> Network tab
developer.chrome.com/docs/devtools/
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
Reply With Quote
  #13  
Old 28.03.2023, 18:41
mtsandersen mtsandersen is offline
Junior Loader
 
Join Date: Mar 2023
Posts: 12
Default

Quote:
You don't need the two matchings - I'd simplify it to:
I wasn't sure if the outer brackets were required, I think the sample used it, so I left it. Matches either way.
Quote:
Write another LinkCrawler rule of type DEEPDECRYPT which will only pick up URLs looking like "http...dpreview.com/sample-galleries/<numbers>/[a-z0-9\\-]+"
That would be ideal, wasn't sure how DeepDecrypt worked, eg if it ignored all other non-matching content. I'll try it. DeepDecrypt has a pattern and a deepPattern, are they both required?
Quote:
Chrome dev tools -> Network tab
developer.chrome.com/docs/devtools/
I keep switching browsers, currently using EDGE (Chrome-based), it does have Developer tools, I don't see a Network tab though. Good to know when I use Chrome next. It could come in handy for investigating the DPReview Studio Scene page. A JSON file with a list of all the links would be very handy, eg
Code:
**External links are only visible to Support Staff**
I might be able to try a DeepDecrypt rule before it's time for bed.

Last edited by mtsandersen; 28.03.2023 at 18:47.
Reply With Quote
  #14  
Old 28.03.2023, 18:45
pspzockerscene's Avatar
pspzockerscene pspzockerscene is online now
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 70,921
Default

Quote:
Originally Posted by mtsandersen View Post
That would be ideal, wasn't sure how DeepDecrypt worked, eg if it ignored all other non-matching content.
That's how it works except if you leave "deepPattern" empty then it will always auto-pickup all items in html of URL matching "pattern".
More information:
https://support.jdownloader.org/Know...deepdecrypt/22
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
Reply With Quote
  #15  
Old 28.03.2023, 18:58
mtsandersen mtsandersen is offline
Junior Loader
 
Join Date: Mar 2023
Posts: 12
Default

So in this case, would I need both patterns, and would I put the desired links in deepPattern?
Code:
{
  "enabled" : true,
  "logging" : false,
  "maxDecryptDepth" : 1,
  "name" : "DPReview DeepDecrypt sample gallery",
  "pattern" : "**External links are only visible to Support Staff**,
  "rule" : "DEEPDECRYPT",
  "passwordPattern" : null,
  "deepPattern" : "**External links are only visible to Support Staff**]+"
}
Edit First try didn't work, it was grabbing everything. I will have to look at it afresh tomorrow.

Last edited by mtsandersen; 28.03.2023 at 19:05.
Reply With Quote
  #16  
Old 28.03.2023, 19:23
pspzockerscene's Avatar
pspzockerscene pspzockerscene is online now
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 70,921
Default

The following rule is working fine for me in conjunction with your other rules:
Code:
{
    "enabled": true,
    "logging": false,
    "maxDecryptDepth": 1,
    "name": "dpreview.com find single galleries links inside gallery category",
    "pattern": "https?://www\\.dpreview\\.com/sample-galleries\\?category=cameras",
    "rule": "DEEPDECRYPT",
    "packageNamePattern": null,
    "deepPattern": "(https?://www.dpreview\\.com/sample-galleries/[0-9]+/[a-z0-9\\-]+/[0-9]+)"
  }
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
Reply With Quote
  #17  
Old 28.03.2023, 20:30
mtsandersen mtsandersen is offline
Junior Loader
 
Join Date: Mar 2023
Posts: 12
Default

Excellent! It appears to be working! Proper testing tomorrow.
Then to see if I can figure out what the Comparison Tool json API link is.
Code:
[{
  "cookies"            : null,
  "deepPattern"        : null,
  "formPattern"        : null,
  "id"                 : 1679976851325,
  "maxDecryptDepth"    : 0,
  "name"               : "Learned file extensions: jpg|raf|cr2|cr3|nef|nrw|pef|arw|rw2|dng|orf|crw|srw|srf|sr2|iiq|dcr|rwl",
  "packageNamePattern" : null,
  "passwordPattern"    : null,
  "pattern"            : "(?i)https?://.*\\.(jpg|raf|cr2|cr3|nef|nrw|pef|arw|rw2|dng|orf|crw|srw|srf|sr2|iiq|dcr|rwl)($|\\?.*$)",
  "rewriteReplaceWith" : null,
  "rule"               : "DIRECTHTTP",
  "enabled"            : true,
  "logging"            : false,
  "updateCookies"      : true
},
{
  "enabled" : true,
  "logging" : false,
  "maxDecryptDepth" : 1,
  "name" : "DPReview sample gallery links [DeepDecrypt]",
  "pattern" : "**External links are only visible to Support Staff**,
  "rule" : "DEEPDECRYPT",
  "passwordPattern" : null,
  "deepPattern" : "**External links are only visible to Support Staff**]+"
},
{
  "enabled": true,
  "name": "DPReview Rewrite URL to API",
  "pattern": "(**External links are only visible to Support Staff**]+)",
  "rule": "REWRITE",
  "rewriteReplaceWith": "**External links are only visible to Support Staff**
},
{
  "enabled" : true,
  "logging" : false,
  "maxDecryptDepth" : 1,
  "name" : "DPReview sample gallery Amazon links",
  "pattern" : "**External links are only visible to Support Staff**,
  "packageNamePattern": "false,\"title\":\"([^\"]+)",
  "rule" : "DEEPDECRYPT",
  "passwordPattern" : null,
  "deepPattern" : "(**External links are only visible to Support Staff**]+)"
}]
Reply With Quote
  #18  
Old 28.03.2023, 20:49
pspzockerscene's Avatar
pspzockerscene pspzockerscene is online now
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 70,921
Default

Just as a hint:
This may or may not work but using those rules to grab such a high amount of items may leave you with some files missing as you won't get any feedback for certail problems such as connection issues during crawling or, if you run into any rate-limits.
Those rules are super simple and there is zero errorhandling for case "anything went wrong during crawling".
If it works: Good!
If not, you will need to build a "real" custom script solution for this or "real" JDownloader plugins.

I recommend, at least double-checking the amount of items you get in the end vs the expected amount of items.
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
Reply With Quote
  #19  
Old 29.03.2023, 03:35
mtsandersen mtsandersen is offline
Junior Loader
 
Join Date: Mar 2023
Posts: 12
Default

Yeah, I prefer smaller batches, and the crawled links will expire. At least I can add links that didn't quite work individually by right-click copy link. Or I could get tricky and add particular brands, eg Sony or Panasonic, to reduce the number of matched links. Realistically I just need a couple of Raw/Jpg samples of each camera to test how they edit and how well their files look in comparison. So not fussed if some are missed. Being able to copy individual galleries of cameras that interest me would probably have been enough, but I wanna see how they all compare, as well as older versions of the same brand.

Edit My Perl-Fu is not great, so left that part for now; I'm at the stage I can right-click on the link and get that gallery, and that is all I really need, as it takes time to parse, then download, and the links will expire. I can add a couple, then go on with other work.

I came across a Raw format I hadn't accounted for: Hassleblad has a 3FR format. Without a rule, JD only sees the jpegs, with it added to the rules, it sees the Raws.

Last edited by mtsandersen; 29.03.2023 at 07:51.
Reply With Quote
  #20  
Old 29.03.2023, 13:35
pspzockerscene's Avatar
pspzockerscene pspzockerscene is online now
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 70,921
Default

Quote:
Originally Posted by mtsandersen View Post
it takes time to parse...
In post #8 I told you, how you can easily skip this time by deactivating the linkcheck.

See post #8 or just text search this thread for "LinkCollector.dolinkcheck".
In my tests, I got anout 1 gallery per second with linkcheck disabled.
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
Reply With Quote
  #21  
Old 29.03.2023, 18:15
mtsandersen mtsandersen is offline
Junior Loader
 
Join Date: Mar 2023
Posts: 12
Default

So you did. I was wary of switching off something important hidden away like this is, as my memory is not the most reliable. Nonetheless, I decided to try it, writing a big note on my desk. I suppose it reduced network traffic. Interesting to see the links added unsorted, with the jpg and raw mixed up, and the raws not renamed by Amazon after they are requested.
Anyway, I'm still doing it in relatively small batches to avoid the links expiring, as it's quite a lot of data to download. I'm working backwards and in the middle of 2020 atm, not necessarily downloading everything.
I've been reading a few DPReview articles and watched some of their videos, as it's all coming to an end very shortly, sadly. They should just leave it up in a locked state so people can still make use if it.

Edit While trying to get the rules to work, I ignored the query string, but I notice the AmazonAWS query string starts with what looks like a session expiry variable X-Amz-Expires=3600. So what happens if I change it...
Code:
{
  "enabled": true,
  "name": "DPReview rewrite expiry from 3600 to 36000",
  "pattern": "X-Amz-Expires=3600",
  "rule": "REWRITE",
  "rewriteReplaceWith": "X-Amz-Expires=36000"
}

Last edited by mtsandersen; 29.03.2023 at 20:54.
Reply With Quote
  #22  
Old 30.03.2023, 13:18
pspzockerscene's Avatar
pspzockerscene pspzockerscene is online now
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 70,921
Default

Quote:
Originally Posted by mtsandersen View Post
So what happens if I change it...
It is super unlikely that that will have any effect as this is usually also checked serverside...

Also your REWRITE rule in your recent post #21 cannot work like this. I supposed it was only ment to be a mockup?
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
Reply With Quote
  #23  
Old 30.03.2023, 15:17
mtsandersen mtsandersen is offline
Junior Loader
 
Join Date: Mar 2023
Posts: 12
Default

Yeah, it was late, just a thought before bed. And no, it didn't work. It was on the Amazon link buried deep on the json file where people would normally not see it. My connection isn't the fastest (normally not needed), so it takes time as the links will expire.
Speaking of rewrite, I experimented with Regex in BBEdit (Mac text editor) to parse the links from the saved HTML of the links page into a neat list of links, just as an exercise.
Code:
Search:
https:\/\/www\.dpreview\.com\/sample-galleries\/([0-9]+)\/[^"]+\">(.+)(?=<\/a>)
Replace:
\2\n**External links are only visible to Support Staff**
Extract button

Result example:
Canon EOS R3 sample gallery (DPReview TV)
**External links are only visible to Support Staff**
Reply With Quote
  #24  
Old 14.04.2023, 19:54
notice notice is offline
JD Supporter
 
Join Date: Mar 2023
Posts: 505
Default

@mtsandersen: did you read that post? looks like you will have some more time to mirror/clone the galleries
**External links are only visible to Support Staff****External links are only visible to Support Staff**
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

All times are GMT +2. The time now is 19:28.
Provided By AppWork GmbH | Privacy | Imprint
Parts of the Design are used from Kirsch designed by Andrew & Austin
Powered by vBulletin® Version 3.8.10 Beta 1
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.