#1
|
|||
|
|||
Can't get this URL, Linkcrawler runs forever
I'm trying to get the main/primary images from a listing like such:
**External links are only visible to Support Staff****External links are only visible to Support Staff** The primary images I'm wanting are the various P0.jpg's. I've tried all sorts of variations of scripts + packagerizer rules + settings, to no avail. Currently, linkcrawler is running forever, it seems it's getting lost crawling deeper and deeper. I've tried limiting the maxDecryptDepth, but it doesn't seem to do anything. I've also created filters to avoid the endless stream of crawled links from pbs.twimg.com. I've also tried increasing the linkcrawler threads from 12 to 48, increasing memory to 8GB (I have plenty), tried limiting the byte-sizes of linkcrawler (since the P0.jpg's should be found right away, no nested crawling necessary. I've also tried a number of different maxDecryptDepth's (0,1,2,-1). Still out of luck :/ Any help would be greatly appreciated! |
#2
|
||||
|
||||
Use deepPattern to get just P0.jpg
__________________
FAQ: How to upload a Log |
#3
|
||||
|
||||
utilise deepPattern like tony mentioned,
fyi: maxdecryptdepth is to prevent crawling for ever if you for example listen to base url, and then find (sub)pages within. say you only want to check 2 sub pages deep instead of unlimited.
__________________
raztoki @ jDownloader reporter/developer http://svn.jdownloader.org/users/170 Don't fight the system, use it to your advantage. :] |
#4
|
|||
|
|||
Quote:
Code:
[ { "enabled" : true, "cookies" : null, "updateCookies" : true, "logging" : false, "maxDecryptDepth" : 1, "id" : 1586860036960, "name" : "linkcrawlz", "pattern" : null, "rule" : "DEEPDECRYPT", "packageNamePattern" : null, "passwordPattern" : null, "formPattern" : null, "deepPattern" : "(**External links are only visible to Support Staff**]P0\\.jpg)", "rewriteReplaceWith" : null } ] Is it possible to have a script have LinkCrawler only search for P0.jpg's and ignore everything else -- but doing this at the source, and not later downstream by having filter rules remove/ignore links from linkgrabber after the fact? I assume this is why LinkCrawler is running endlessly, because there should only be a few P0.jpg's, but instead crawler is getting lost in a forest of unnecessary links. When I run my script, I can see in the LinkGrabber tab that thousands of links are being added but filtered (per the bottom right "Restore" button), however, I want the crawler to not even consider those URLs from the start, to not add them to linkgrabber, but most importantly, to finish the crawl task after finding the handful of P0.jpg's instead of running indefinitely. TIA! |
#5
|
||||
|
||||
You have it in combination with script, I don't know what's wrong.
Without script, this will only get 4 P0.jpg Code:
[ { "logging" : true, "rule" : "DEEPDECRYPT", "maxDecryptDepth" : 0, "pattern" : "https?://www\\.depop\\.com/products/[^/]+?/", "deepPattern" : "url\":\"([^\"]+?P0\\.jpg)\"" } ]
__________________
FAQ: How to upload a Log |
#6
|
||||
|
||||
@Coldblackice
Your attempt does not contain any "Pattern" this way the rule cannot "know" for which URLs it should be applied so it will not get used at all. If you got additional scripts / Link Crawler rules in place, please deactivate/delete all of them and only try with my rule on the bottom of this post! @tony2long Yours is working fine - I've added a package name to mine so mine might be "nicer" in the end^^ Here is my attempt: Code:
[ { "enabled" : true, "updateCookies" : true, "logging" : true, "maxDecryptDepth" : 1, "name" : "Crawl all pictures from depop.com", "pattern" : "https?://(www\\.)?depop\\.com/products/.+", "rule" : "DEEPDECRYPT", "packageNamePattern" : "<title>(.*?)</title>", "passwordPattern" : null, "formPattern" : null, "deepPattern" : "(https?://[A-Za-z0-9]+\\.cloudfront\\.net/[^\"]+/P0\\.jpg)", "rewriteReplaceWith" : null } ]
__________________
JD Supporter, Plugin Dev. & Community Manager
Erste Schritte & Tutorials || JDownloader 2 Setup Download |
Thread Tools | |
Display Modes | |
|
|