#1
|
|||
|
|||
Link Crawler Rules
Hi, I am trying to create my very first set of link crawler rules. My goal is to crawl through sequentially numbered sites and grab all available download links. The URLs that I am interested are like this:
**External links are only visible to Support Staff****External links are only visible to Support Staff** **External links are only visible to Support Staff****External links are only visible to Support Staff** ... **External links are only visible to Support Staff****External links are only visible to Support Staff** **External links are only visible to Support Staff****External links are only visible to Support Staff** ... **External links are only visible to Support Staff****External links are only visible to Support Staff** **External links are only visible to Support Staff****External links are only visible to Support Staff** Page 124 is the final one. I took the time to visit the regex101 website in order to create and test the following REGEX expression, which I think fits my needs: **External links are only visible to Support Staff****External links are only visible to Support Staff** http:\/\/www\.webcamrecordings\.com\/modelSearch\/emma_lu1\/page\/([1-9]|[1-9][0-9]|10[0-9]|11[0-9]|12[0-4])$ Accordingly, I have looked at older threads on this forum and copied their templates to fashion my own Link Crawler rule: Code:
[ { "enabled" " true", "maxDecryptDepth" : 2, "id" : 1485801268989, "name" : "Emma_Lu1 WebCamRec" "pattern" : "http:\/\/www\.webcamrecordings\.com\/modelSearch\/emma_lu1\/page\/([1-9]|[1-9][0-9]|10[0-9]|11[0-9]|12[0-4])$" "rule" : "DEEPDECRYPT" "packageNamePattern" : null, "deepPattern" : null, "rewriteReplaceWith" : null } ] Last edited by Pbl; 10.07.2017 at 16:09. |
#2
|
||||
|
||||
first of all, if you want the total amount of pages there is zero need to supply complicated pattern like that. You would only do so if you want ONLY those pages say between pages 1 and 1000.
Since you want all you just need to use \d+ id should be random generated, you can leave it out and it should insert it. rewrite you could leave out also deepdecrypt length I believe should only be 1 (just that given page number right?)
__________________
raztoki @ jDownloader reporter/developer http://svn.jdownloader.org/users/170 Don't fight the system, use it to your advantage. :] |
#3
|
||||
|
||||
Id will be created automatically, packageNamePattern to get the packageName from the page, deepPattern to get what you want in the page (I think).
One of the video that I checked is hosted at publish2.me that say: This file is available only for premium members. Rule example that doesn't give error (catches many pictures): Code:
[ { "enabled" : true, "maxDecryptDepth" : 0, "name" : "webcamrecordings.com", "pattern" : "**External links are only visible to Support Staff**, "rule" : "DEEPDECRYPT", "packageNamePattern" : null, "deepPattern" : "class=\"mp4\"><a href=\"([^\"]+)\"" } ]
__________________
FAQ: How to upload a Log Last edited by tony2long; 14.07.2017 at 06:04. Reason: Better rule |
#4
|
|||
|
|||
Quote:
Is there something else I need to do to actually initiate the crawling process? Last edited by Pbl; 10.07.2017 at 17:12. |
#5
|
||||
|
||||
As usual, enable clipboard monitor and copy the link.
__________________
FAQ: How to upload a Log |
#6
|
|||
|
|||
Am I going to have copy all 124 pages into the clipboard manually?
|
#7
|
||||
|
||||
That is the basic idea, but it might depend on your creativity.
__________________
FAQ: How to upload a Log |
#8
|
||||
|
||||
@Pbl: you can modify the deepPattern to auto find the next page
Then you only have to add page 1 and JDownloader will be able to find the rest In case you need help with that, let us know!
__________________
JD-Dev & Server-Admin |
#9
|
||||
|
||||
Can we have double deepPattern in one rule? How?
__________________
FAQ: How to upload a Log Last edited by tony2long; 14.07.2017 at 06:18. |
#10
|
||||
|
||||
Use or
(a|b)
__________________
JD-Dev & Server-Admin |
#11
|
||||
|
||||
How to add protocol and domain for "next"?
__________________
FAQ: How to upload a Log |
#12
|
||||
|
||||
Code:
[{ "enabled": true, "maxDecryptDepth": 10, "name": "webcamrecordings.com", "pattern": "**External links are only visible to Support Staff**, "rule": "DEEPDECRYPT", "deepPattern": "(class=\"mp4\"><a href=\"([^\"]+)\"|<a href=\"(/modelSearch/.*?/page/\\d+)\")" }]
__________________
JD-Dev & Server-Admin Last edited by Jiaz; 13.07.2017 at 16:36. |
#13
|
||||
|
||||
@Tony: use CODE tags instead of QUOTE! CODE keeps the escapings!
__________________
JD-Dev & Server-Admin |
#14
|
|||
|
|||
Thank you everyone for your help! I really appreciate it.
I actually solved this by setting the maxDecryptDepth to 125 and using linkfilter extensively. That probably took longer used more CPU resources, but I'll see if I can create a more efficient linkcrawler next time. |
#15
|
||||
|
||||
You're welcome
__________________
JD-Dev & Server-Admin |
Thread Tools | |
Display Modes | |
|
|