JDownloader Community - Appwork GmbH
 

Reply
 
Thread Tools Display Modes
  #1  
Old 10.07.2017, 15:41
Pbl Pbl is offline
Junior Loader
 
Join Date: May 2017
Posts: 11
Default Link Crawler Rules

Hi, I am trying to create my very first set of link crawler rules. My goal is to crawl through sequentially numbered sites and grab all available download links. The URLs that I am interested are like this:
**External links are only visible to Supporters**
**External links are only visible to Supporters**
...
**External links are only visible to Supporters**
**External links are only visible to Supporters**
...
**External links are only visible to Supporters**
**External links are only visible to Supporters**

Page 124 is the final one.

I took the time to visit the regex101 website in order to create and test the following REGEX expression, which I think fits my needs: **External links are only visible to Supporters**

http:\/\/www\.webcamrecordings\.com\/modelSearch\/emma_lu1\/page\/([1-9]|[1-9][0-9]|10[0-9]|11[0-9]|12[0-4])$

Accordingly, I have looked at older threads on this forum and copied their templates to fashion my own Link Crawler rule:

Code:
[ {
  "enabled" " true",
  "maxDecryptDepth" : 2,
  "id" : 1485801268989,
  "name" : "Emma_Lu1 WebCamRec"
  "pattern" : "http:\/\/www\.webcamrecordings\.com\/modelSearch\/emma_lu1\/page\/([1-9]|[1-9][0-9]|10[0-9]|11[0-9]|12[0-4])$"
  "rule" : "DEEPDECRYPT"
  "packageNamePattern" : null,
  "deepPattern" : null,
  "rewriteReplaceWith" : null
} ]
Nevertheless, I was unable to find a single universal tutorial explaining the significance of these keywords. The importance of the "id" number, "packageNamePattern" and "deepPattern" is particularly confusing. In any case, the above rule does not crawl the URLS that I want. Instead, I get an error message instructing me that the above is not a valid LinkCrawlerRule, so there must be something subtle that I have missed. I would appreciate any help on this matter. Hopefully, I am trying to use Link Crawler in the way it was intended.

Last edited by Pbl; 10.07.2017 at 16:09.
Reply With Quote
  #2  
Old 10.07.2017, 16:41
raztoki's Avatar
raztoki raztoki is offline
English Supporter
 
Join Date: Apr 2010
Location: Australia
Posts: 14,827
Default

first of all, if you want the total amount of pages there is zero need to supply complicated pattern like that. You would only do so if you want ONLY those pages say between pages 1 and 1000.

Since you want all you just need to use \d+

id should be random generated, you can leave it out and it should insert it.
rewrite you could leave out also
deepdecrypt length I believe should only be 1 (just that given page number right?)
__________________
raztoki @ jDownloader reporter/developer
http://svn.jdownloader.org/users/170

Don't fight the system, use it to your advantage. :]
Reply With Quote
  #3  
Old 10.07.2017, 16:42
tony2long's Avatar
tony2long tony2long is offline
English Supporter
 
Join Date: Jun 2009
Posts: 5,294
Default

Id will be created automatically, packageNamePattern to get the packageName from the page, deepPattern to get what you want in the page (I think).

One of the video that I checked is hosted at publish2.me that say: This file is available only for premium members.

Rule example that doesn't give error (catches many pictures):

Code:
[ {
  "enabled" : true,
  "maxDecryptDepth" : 0,
  "name" : "webcamrecordings.com",
  "pattern" : "http://www\\.webcamrecordings\\.com/modelSearch/[^/]+/page/\\d+",
  "rule" : "DEEPDECRYPT",
  "packageNamePattern" : null,
  "deepPattern" : "class=\"mp4\"><a href=\"([^\"]+)\""
} ]
__________________
FAQ: How to upload a Log

Last edited by tony2long; 14.07.2017 at 06:04. Reason: Better rule
Reply With Quote
  #4  
Old 10.07.2017, 17:09
Pbl Pbl is offline
Junior Loader
 
Join Date: May 2017
Posts: 11
Default

Quote:
Originally Posted by tony2long View Post
Id will be created automatically, packageNamePattern to get the packageName from the page, deepPattern to get what you want in the page (I think).

One of the video that I checked is hosted at publish2.me that say: This file is available only for premium members.

Rule example that doesn't give error (catches many pictures):

[ {
"enabled": true,
"maxDecryptDepth": 2,
"name": "Emma_Lu1 WebCamRec",
"pattern": "http://www\\.webcamrecordings\\.com/modelSearch/emma_lu1/page/\\d+",
"rule": "DEEPDECRYPT",
"packageNamePattern": null,
"deepPattern": null
} ]
I put this into the Link Crawler Rules in Advanced Settings and no error message was thrown. Yay!

Is there something else I need to do to actually initiate the crawling process?

Last edited by Pbl; 10.07.2017 at 17:12.
Reply With Quote
  #5  
Old 10.07.2017, 17:16
tony2long's Avatar
tony2long tony2long is offline
English Supporter
 
Join Date: Jun 2009
Posts: 5,294
Default

As usual, enable clipboard monitor and copy the link.
__________________
FAQ: How to upload a Log
Reply With Quote
  #6  
Old 10.07.2017, 17:45
Pbl Pbl is offline
Junior Loader
 
Join Date: May 2017
Posts: 11
Default

Am I going to have copy all 124 pages into the clipboard manually?
Reply With Quote
  #7  
Old 11.07.2017, 04:47
tony2long's Avatar
tony2long tony2long is offline
English Supporter
 
Join Date: Jun 2009
Posts: 5,294
Default

That is the basic idea, but it might depend on your creativity.
__________________
FAQ: How to upload a Log
Reply With Quote
  #8  
Old 13.07.2017, 11:17
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 48,023
Default

@Pbl: you can modify the deepPattern to auto find the next page
Then you only have to add page 1 and JDownloader will be able to find the rest
In case you need help with that, let us know!
__________________
JD-Dev & Server-Admin
Reply With Quote
  #9  
Old 13.07.2017, 11:27
tony2long's Avatar
tony2long tony2long is offline
English Supporter
 
Join Date: Jun 2009
Posts: 5,294
Default

Can we have double deepPattern in one rule? How?
__________________
FAQ: How to upload a Log

Last edited by tony2long; 14.07.2017 at 06:18.
Reply With Quote
  #10  
Old 13.07.2017, 11:28
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 48,023
Default

Use or
(a|b)
__________________
JD-Dev & Server-Admin
Reply With Quote
  #11  
Old 13.07.2017, 15:34
tony2long's Avatar
tony2long tony2long is offline
English Supporter
 
Join Date: Jun 2009
Posts: 5,294
Default

How to add protocol and domain for "next"?
__________________
FAQ: How to upload a Log
Reply With Quote
  #12  
Old 13.07.2017, 16:34
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 48,023
Default

Code:
[{
"enabled": true,
"maxDecryptDepth": 10,
"name": "webcamrecordings.com",
"pattern": "http://www\\.webcamrecordings\\.com/modelSearch/[^/]+/page/\\d+",
"rule": "DEEPDECRYPT",
"deepPattern": "(class=\"mp4\"><a href=\"([^\"]+)\"|<a href=\"(/modelSearch/.*?/page/\\d+)\")"
}]
This will find the videos and next pages
__________________
JD-Dev & Server-Admin

Last edited by Jiaz; 13.07.2017 at 16:36.
Reply With Quote
  #13  
Old 13.07.2017, 16:36
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 48,023
Default

@Tony: use CODE tags instead of QUOTE! CODE keeps the escapings!
__________________
JD-Dev & Server-Admin
Reply With Quote
  #14  
Old 15.07.2017, 04:27
Pbl Pbl is offline
Junior Loader
 
Join Date: May 2017
Posts: 11
Default

Thank you everyone for your help! I really appreciate it.

I actually solved this by setting the maxDecryptDepth to 125 and using linkfilter extensively. That probably took longer used more CPU resources, but I'll see if I can create a more efficient linkcrawler next time.
Reply With Quote
  #15  
Old 17.07.2017, 15:34
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 48,023
Default

You're welcome
__________________
JD-Dev & Server-Admin
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

All times are GMT +2. The time now is 18:42.
Provided By AppWork GmbH | Privacy | Imprint
Parts of the Design are used from Kirsch designed by Andrew & Austin
Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, Jelsoft Enterprises Ltd.