JDownloader Community - Appwork GmbH
 

Reply
 
Thread Tools Display Modes
  #1  
Old 12.04.2020, 14:22
Coldblackice Coldblackice is offline
Junior Loader
 
Join Date: Sep 2019
Location: San Francisco
Posts: 14
Default Can't get this URL, Linkcrawler runs forever

I'm trying to get the main/primary images from a listing like such:

**External links are only visible to Support Staff****External links are only visible to Support Staff**

The primary images I'm wanting are the various P0.jpg's. I've tried all sorts of variations of scripts + packagerizer rules + settings, to no avail. Currently, linkcrawler is running forever, it seems it's getting lost crawling deeper and deeper.

I've tried limiting the maxDecryptDepth, but it doesn't seem to do anything. I've also created filters to avoid the endless stream of crawled links from pbs.twimg.com.

I've also tried increasing the linkcrawler threads from 12 to 48, increasing memory to 8GB (I have plenty), tried limiting the byte-sizes of linkcrawler (since the P0.jpg's should be found right away, no nested crawling necessary.

I've also tried a number of different maxDecryptDepth's (0,1,2,-1). Still out of luck :/

Any help would be greatly appreciated!
Reply With Quote
  #2  
Old 12.04.2020, 15:05
tony2long's Avatar
tony2long tony2long is offline
English Supporter
 
Join Date: Jun 2009
Posts: 6,381
Default

Use deepPattern to get just P0.jpg
__________________
FAQ: How to upload a Log
Reply With Quote
  #3  
Old 13.04.2020, 05:20
raztoki's Avatar
raztoki raztoki is offline
English Supporter
 
Join Date: Apr 2010
Location: Australia
Posts: 17,195
Default

utilise deepPattern like tony mentioned,

fyi: maxdecryptdepth is to prevent crawling for ever if you for example listen to base url, and then find (sub)pages within. say you only want to check 2 sub pages deep instead of unlimited.
__________________
raztoki @ jDownloader reporter/developer
http://svn.jdownloader.org/users/170

Don't fight the system, use it to your advantage. :]
Reply With Quote
  #4  
Old 15.04.2020, 00:49
Coldblackice Coldblackice is offline
Junior Loader
 
Join Date: Sep 2019
Location: San Francisco
Posts: 14
Default

Quote:
Originally Posted by tony2long View Post
Use deepPattern to get just P0.jpg
Quote:
Originally Posted by raztoki View Post
utilise deepPattern like tony mentioned,

fyi: maxdecryptdepth is to prevent crawling for ever if you for example listen to base url, and then find (sub)pages within. say you only want to check 2 sub pages deep instead of unlimited.
Thanks for the replies, guys. The P0.jpg files are being found and added to the LinkGrabber screen successfully, however, it's also adding additional links to LinkGrabber as well, not just the P0's, despite a LinkCrawler rule I've set in advanced settings (is it possible to set LinkCrawler pattern/rules within a script itself, which only applies to that run of the script, and not JDownloader-wide?):

Code:
[ {
  "enabled" : true,
  "cookies" : null,
  "updateCookies" : true,
  "logging" : false,
  "maxDecryptDepth" : 1,
  "id" : 1586860036960,
  "name" : "linkcrawlz",
  "pattern" : null,
  "rule" : "DEEPDECRYPT",
  "packageNamePattern" : null,
  "passwordPattern" : null,
  "formPattern" : null,
  "deepPattern" : "(**External links are only visible to Support Staff**]P0\\.jpg)",
  "rewriteReplaceWith" : null
} ]
But that's not the real issue, since I don't mind seeing links get filtered downstream. The biggest issue is that LinkCrawler continues to run on and on indefinitely without ending, finding thousands of other links which seem to ignore the LinkCrawler deepPattern rule. I don't even see these links that are being added in the original page's source code myself, so it seems that LinkCrawler is spidering beyond the designated URL.

Is it possible to have a script have LinkCrawler only search for P0.jpg's and ignore everything else -- but doing this at the source, and not later downstream by having filter rules remove/ignore links from linkgrabber after the fact? I assume this is why LinkCrawler is running endlessly, because there should only be a few P0.jpg's, but instead crawler is getting lost in a forest of unnecessary links.

When I run my script, I can see in the LinkGrabber tab that thousands of links are being added but filtered (per the bottom right "Restore" button), however, I want the crawler to not even consider those URLs from the start, to not add them to linkgrabber, but most importantly, to finish the crawl task after finding the handful of P0.jpg's instead of running indefinitely.

TIA!
Reply With Quote
  #5  
Old 15.04.2020, 06:51
tony2long's Avatar
tony2long tony2long is offline
English Supporter
 
Join Date: Jun 2009
Posts: 6,381
Default

You have it in combination with script, I don't know what's wrong.
Without script, this will only get 4 P0.jpg
Code:
[ {
  "logging" : true,
  "rule" : "DEEPDECRYPT",
  "maxDecryptDepth" : 0,
  "pattern" : "https?://www\\.depop\\.com/products/[^/]+?/",
  "deepPattern" : "url\":\"([^\"]+?P0\\.jpg)\""
} ]
__________________
FAQ: How to upload a Log
Reply With Quote
  #6  
Old 15.04.2020, 17:21
pspzockerscene's Avatar
pspzockerscene pspzockerscene is offline
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 54,396
Default

@Coldblackice
Your attempt does not contain any "Pattern" this way the rule cannot "know" for which URLs it should be applied so it will not get used at all.
If you got additional scripts / Link Crawler rules in place, please deactivate/delete all of them and only try with my rule on the bottom of this post!

@tony2long
Yours is working fine - I've added a package name to mine so mine might be "nicer" in the end^^

Here is my attempt:
Code:
[ {
  "enabled" : true,
  "updateCookies" : true,
  "logging" : true,
  "maxDecryptDepth" : 1,
  "name" : "Crawl all pictures from depop.com",
  "pattern" : "https?://(www\\.)?depop\\.com/products/.+",
  "rule" : "DEEPDECRYPT",
  "packageNamePattern" : "<title>(.*?)</title>",
  "passwordPattern" : null,
  "formPattern" : null,
  "deepPattern" : "(https?://[A-Za-z0-9]+\\.cloudfront\\.net/[^\"]+/P0\\.jpg)",
  "rewriteReplaceWith" : null
} ]
-psp-
__________________
JD Supporter, Plugin Dev. & Community Manager
How to create a log || Wie man einen Log erstellt
Captcha FAQ EN || Captcha FAQ DE || Erste Schritte & Tutorials
JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
That's true James
Quote:
Originally Posted by James
Die Leute verstehen einfach nicht dass nur weil man mit einer Waffe auch auf Menschen schießen kann dass ein Schützenver​ein kein Ort für Amoklaufide​en ist
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

All times are GMT +2. The time now is 05:36.
Provided By AppWork GmbH | Privacy | Imprint
Parts of the Design are used from Kirsch designed by Andrew & Austin
Powered by vBulletin® Version 3.8.10 Beta 1
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.