JDownloader Community - Appwork GmbH
 

Reply
 
Thread Tools Display Modes
  #1  
Old 28.07.2023, 20:47
Nimboid Nimboid is offline
Vacuum Cleaner
 
Join Date: Jan 2023
Location: UK
Posts: 18
Default Handling DEEPDECRYPT relative URLs

So, I am trying to acquire links to attachments in a phpBB-hosted forum. This requires being logged-in, so I have provided my user/pass to JD via 'Basic Authentication' settings. Natively, provided with a suitable forum page URL, JDownloader is only finding the thumbnails, and not following the href links:
Code:
<a class="file-preview " href="/phpBB2/index.php?attachments/capture-jpg.1988988/" target="_blank">
	<img src="/phpBB2/data/attachments/1915/1915251-5369dc6cb5e0da0fb7fd46255176b083.jpg" alt="Capture.JPG" width="264" height="200" loading="lazy">
</a>
The logs don't reveal any errors. Am I right in thinking that spending time on a LinkCrawler Rule is the way to go?

To this end, I have made a DEEPDECRYPT rule. I have established that its pattern catches the page URL, because the LinkCrawlerRule...log contains the page source, and the Rule cookies update, but it does not result in any additions to the LinkGrabber pane, nor any traces in logs.
My deepPattern:
Code:
  "deepPattern"        : "(?i)<a class=\"file-preview \" (href=\"[^\"]+\") target=\"_blank\">",
Do I need to follow with a second REWRITE LinkCrawler Rule to convert the relative href to an absolute URL? And if so, what should its pattern be? Should it expect to receive and look for the whole of the 'upstream' deepPattern, or just its matching group? I have tried a followup rule, without success, as it doesn't generate its LinkCrawlerRule...log and without any errors appearing in logs either.

It's frustrating to trawl this forum and finding so much of the content of LinkCrawler Rule discussions is submerged under "**External links are only visible to Support Staff**"!
Reply With Quote
  #2  
Old 29.07.2023, 02:14
raztoki's Avatar
raztoki raztoki is offline
English Supporter
 
Join Date: Apr 2010
Location: Australia
Posts: 17,614
Default

use some \s* for whitespace as if they change html in respects to " or ' it will fail easily.
also you don't want to listen to href itself just the component after = char or inside " or ' (which ever it uses)
you will need a linkcrawler rulefor the newly captured url pattern so it can then process so your
Code:
https ? : //domain/phpBB2/index.php?attachments.+?
psp usually does provide a whitelisted url to a given rule if its taken out by the forum url masking module.
there is also support articles
__________________
raztoki @ jDownloader reporter/developer
http://svn.jdownloader.org/users/170

Don't fight the system, use it to your advantage. :]
Reply With Quote
  #3  
Old 29.07.2023, 11:36
Nimboid Nimboid is offline
Vacuum Cleaner
 
Join Date: Jan 2023
Location: UK
Posts: 18
Default

Followup question: What does deepPattern emit from the LinkCrawler Rule, when:
a. There is no capture group. Is it the whole pattern?
b. There is one or more capture groups. Is it just one of the capture groups?
c. If the capture group isolates after the 'href=' as you suggest, is the URL emitted in its relative form, or is it automatically converted to fully-qualified?

If relative URLs are emitted by the first Rule, how to make the followup Rule discriminate between relative URLs from different hosts?

I'll be making the regex more defensive once I've got something working!

Last edited by Nimboid; 29.07.2023 at 16:31. Reason: Added info
Reply With Quote
  #4  
Old 31.07.2023, 16:11
pspzockerscene's Avatar
pspzockerscene pspzockerscene is offline
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 71,088
Default

a+b: No idea - I'd need to look into the code myself. You may as well just try it out.

c. Only full URLs will be returned for further processing.

Also as you seem to have already found out:
That forum most likely does not support basic authentification so you need your login cookies -> Creating a LinkCrawler rule is the way to go.
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?

Last edited by raztoki; 01.08.2023 at 12:56.
Reply With Quote
  #5  
Old 02.08.2023, 14:02
Nimboid Nimboid is offline
Vacuum Cleaner
 
Join Date: Jan 2023
Location: UK
Posts: 18
Default

OK, I'm making progress, I can identify non-standard URLs using a DEEPDECRYPT rule, and emit the href links below, which are subsequently handled by a DIRECTHTTP rule.

However, the page source treats different media types differently:

HTML Code:
<a class="u-anchorTarget" id="attachment-2062965"></a>
<a class="file-preview js-lbImage" href="/phpBB2/index.php?attachments/capture-jpg.2062965/" target="_blank">

<a class="u-anchorTarget" id="attachment-2062963"></a>		
<a class="file-preview" href="/phpBB2/data/video/1989/1989225-00fcffd963a4710b33ea59fdd998dba7.mp4" target="_blank">
This results in the video files receiving the 'machine-generated' names instead of the 'human-generated' names which appear on the rendered page.
I have discovered that if I remodel the video URLs using the attachment Id thus:

HTML Code:
"/phpBB2/index.php?attachments/2062963/"
This gets resolved to

HTML Code:
"/phpBB2/index.php?attachments/Rover1.mp4.2062963/"
and then JD downloads it as 'Rover1.mp4'.

Bizarrely, if I have regex errors, there are conditions where JDownloader finds the 'human-generated' video names, none of the image files, but also dozens of crud files.
I haven't found any clues in the logs, and the URLs it finds are nowhere in the page source.

If my DEEPDECRYPT were to emit the
HTML Code:
"<a class="u-anchorTarget" id="attachment-2062963"></a>"
instead of the URL, would a REWRITE rule's "pattern" be able to recognise this, so that I could capture the '2062963' and form the URL I need?
Or MUST rule "patterns" contain fully-qualified URLs?

If this isn't technically possible, is it possible within Event Scripter to create new links and submit them to the Link Crawler?
I could then use getPage(myString/*PageURL*/); and pick out the "u-anchorTarget" lines to generate and submit my URLs.


I could also have knocked out a JavaScript routine to paste into my browser console, in the time it's taken me to compose this post!
Reply With Quote
  #6  
Old 02.08.2023, 15:35
pspzockerscene's Avatar
pspzockerscene pspzockerscene is offline
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 71,088
Default

I'm sorry (not meaning to sound rude) but we could waste tons of time doing this in a theoretical way.
It would be way easier if you supplied all the required information (real life testlinks, logins, rules you've created so far) so we can make some progress here.
You're free to do this via email - we do value your privacy.
Send the information to support@jdownloader.org.

Quote:
Originally Posted by Nimboid View Post
Bizarrely, if I have regex errors, there are conditions where JDownloader finds the 'human-generated' video names, none of the image files, but also dozens of crud files.
I haven't found any clues in the logs, and the URLs it finds are nowhere in the page source.
Impossible for me to evaluate without real life examples.

Quote:
Originally Posted by Nimboid View Post
If my DEEPDECRYPT were to emit the
...
instead of the URL, would a REWRITE rule's "pattern" be able to recognise this, so that I could capture the '2062963' and form the URL I need?
No. Those rules are based on URL patterns. You can't return non-URL content and then evaluate than.

Quote:
Originally Posted by Nimboid View Post
If this isn't technically possible, is it possible within Event Scripter to create new links and submit them to the Link Crawler?
I could then use getPage(myString/*PageURL*/); and pick out the "u-anchorTarget" lines to generate and submit my URLs.
Yes but if you're going this route you might as well write any other external script, a browser-addon or a Greasemonkey script to do this -> Prepare URLs there -> Send them to JD.

Quote:
Originally Posted by Nimboid View Post
I could also have knocked out a JavaScript routine to paste into my browser console, in the time it's taken me to compose this post!
That's what I mean - doing this outside JD might be faster than trying to learn the internals especially if you hit a possible dead end like the need to work with multiple snippets of a given html webpage.
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
Reply With Quote
  #7  
Old 04.08.2023, 16:55
Nimboid Nimboid is offline
Vacuum Cleaner
 
Join Date: Jan 2023
Location: UK
Posts: 18
Default

I now have it working to my satisfaction.
As the crucial information is within a short element of the page source
HTML Code:
"<a class="u-anchorTarget" id="attachment-2062963"></a>"
and quickly isolated by a regex, I settled on an Event Script (trigger: New Crawler Job).

If any of the links given to it address a qualifying page, those links are separated and searched, the remainder being passed onwards using myCrawlerJob.setText(...). Processed links are queued using callAPI("linkgrabberv2", "addLinks", {...}), with login credentials obtained from a partner LinkCrawler Rule.

I know a bit more than I did...
Reply With Quote
  #8  
Old 07.08.2023, 11:14
pspzockerscene's Avatar
pspzockerscene pspzockerscene is offline
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 71,088
Default

Nice one.

You should consider posting that script here as it may be helpful for other users as well.
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

All times are GMT +2. The time now is 19:32.
Provided By AppWork GmbH | Privacy | Imprint
Parts of the Design are used from Kirsch designed by Andrew & Austin
Powered by vBulletin® Version 3.8.10 Beta 1
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.