JDownloader Community - Appwork GmbH
 

Notices

Reply
 
Thread Tools Display Modes
  #1  
Old 28.06.2021, 13:08
wanko wanko is offline
JD VIP
 
Join Date: Aug 2015
Posts: 300
Default crawler rule help

HTML Code:
{
    "maxDecryptDepth": 0,
    "name": "nudecosplaygirls.com rule",
    "pattern": "https?://(?:www\\.)?nudecosplaygirls\\.com/(?!wp)[\\w-]+/",
    "rule": "DEEPDECRYPT",
    "deepPattern": null,
    "packageNamePattern": "<title>(.*?)( - nudecosplaygirls)?</title>"
  }
1. i have this crawler rule but it crawl all related post thumbnails w original post instead only original post "entry-inner". So how i can limit it only crawl inside entry-inner ?



2. some wordpress website have enabled index, so how i can download all content inside each, with package structure domain/wp-content/uploads/years/month/

**External links are only visible to Support Staff****External links are only visible to Support Staff**


Thank you

Last edited by wanko; 28.06.2021 at 13:12.
Reply With Quote
  #2  
Old 28.06.2021, 13:31
raztoki's Avatar
raztoki raztoki is offline
English Supporter
 
Join Date: Apr 2010
Location: Australia
Posts: 17,659
Default

re1: utilise deepPattern, by providing regex pattern that will find the content you want. Else JD will return all supported urls.

re2: create a second rule for this url structure. I would set max depth to prevent infinite loop.
__________________
raztoki @ jDownloader reporter/developer
http://svn.jdownloader.org/users/170

Don't fight the system, use it to your advantage. :]
Reply With Quote
  #3  
Old 28.06.2021, 13:54
wanko wanko is offline
JD VIP
 
Join Date: Aug 2015
Posts: 300
Default

i'm tried w this but it wont work

"deepPattern": "(class="entry-content"><img src="([^"]+)")",

first time deal with deeppattern class
Reply With Quote
  #4  
Old 28.06.2021, 14:14
raztoki's Avatar
raztoki raztoki is offline
English Supporter
 
Join Date: Apr 2010
Location: Australia
Posts: 17,659
Default

you will need to escape the " chars within html, else will break json
__________________
raztoki @ jDownloader reporter/developer
http://svn.jdownloader.org/users/170

Don't fight the system, use it to your advantage. :]
Reply With Quote
  #5  
Old 28.06.2021, 14:27
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 79,343
Default

see
https://support.jdownloader.org/Know...kcrawler-rules
Quote:
LinkCrawler Rules are stored as a json array.
Especially if you have multiple rules it can be a good idea to use a json editor to work on them e.g. jsoneditoronline.org or jsonformatter.org.
JD will only allow you to add rules with a valid json structure!
__________________
JD-Dev & Server-Admin
Reply With Quote
  #6  
Old 28.06.2021, 18:45
wanko wanko is offline
JD VIP
 
Join Date: Aug 2015
Posts: 300
Default

{
"maxDecryptDepth":0,
"name":"nudecosplaygirls.com rule",
"pattern":"https?://(?:www\\.)?nudecosplaygirls\\.com/[\\w-]+/",
"rule":"DEEPDECRYPT",
"deepPattern" : "(<\\class='entry-content'\\><img src='([^']+)//)",
"packageNamePattern":"<title>(.*?)( - nudecosplaygirls)?<\/title>"
}

but still nothing found, weird

Last edited by wanko; 28.06.2021 at 18:51.
Reply With Quote
  #7  
Old 28.06.2021, 19:23
pspzockerscene's Avatar
pspzockerscene pspzockerscene is offline
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 71,140
Default

You've escaped some stuff that you don't have to escape and my attempt is a bit different but it works:
Code:
[
  {
    "maxDecryptDepth": 0,
    "name": "nudecosplaygirls.com rule",
    "pattern": "https?://(?:www\\.)?nudecosplaygirls\\.com/[\\w\\-]+/",
    "rule": "DEEPDECRYPT",
    "deepPattern": "class=\"alignnone size-(?:full|medium) wp-image-[0-9]+\" src=\"(**External links are only visible to Support Staff**]+)\"",
    "packageNamePattern": "<title>(.*?)( - nudecosplaygirls)?</title>"
  }
]
Rule as plaintext for easier copy & paste:
pastebin.com/csCjj1dM

-psp-
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
Reply With Quote
  #8  
Old 28.06.2021, 19:30
wanko wanko is offline
JD VIP
 
Join Date: Aug 2015
Posts: 300
Default

hmm it just crawl and stop right after that, no result

28.06.21 23.27.47 <--> 28.06.21 23.30.32 jdlog://0314825302851/

HTML Code:
 {
    "maxDecryptDepth": 0,
    "name": "nudecosplaygirls.com rule1",
    "pattern": "https?://(?:www\\.)?nudecosplaygirls\\.com/[\\w\\-]+/",
    "rule": "DEEPDECRYPT",
    "deepPattern": "class="alignnone size-(?:full|medium) wp-image-[0-9]+" src="(**External links are only visible to Support Staff**]+)"",
    "packageNamePattern": "<title>(.*?)( - nudecosplaygirls)?</title>"
  }
, {
  "enabled" : true,
  "maxDecryptDepth" : 0,
  "name" : "nudecosplaygirls.com replace thumbnail URL to full image URL",
  "pattern" : "(https?://nudecosplaygirls\\.com/wp-content/uploads/\\d{4}/\\d{2}/.*)(-\\d+x\\d+)\\.jpg",
  "rule" : "REWRITE",
  "packageNamePattern" : null,
  "passwordPattern" : null,
  "formPattern" : null,
  "deepPattern" : "(https?://nudecosplaygirls\\.com/wp-content/[^"]+\\.jpg)",
  "rewriteReplaceWith" : "$1.jpg"
}
Reply With Quote
  #9  
Old 28.06.2021, 19:41
pspzockerscene's Avatar
pspzockerscene pspzockerscene is offline
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 71,140
Default

Please first make sure that the first rule is working for you before starting to work on a 2nd rule...

-psp-
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
Reply With Quote
  #10  
Old 28.06.2021, 19:44
wanko wanko is offline
JD VIP
 
Join Date: Aug 2015
Posts: 300
Default

i'm already checked 1st rule alone but it not working
**External links are only visible to Support Staff****External links are only visible to Support Staff**
Reply With Quote
  #11  
Old 28.06.2021, 19:47
pspzockerscene's Avatar
pspzockerscene pspzockerscene is offline
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 71,140
Default

Which URL did you add to test this?
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
Reply With Quote
  #12  
Old 28.06.2021, 19:49
wanko wanko is offline
JD VIP
 
Join Date: Aug 2015
Posts: 300
Default

**External links are only visible to Support Staff****External links are only visible to Support Staff**
**External links are only visible to Support Staff****External links are only visible to Support Staff**
Reply With Quote
  #13  
Old 28.06.2021, 20:12
pspzockerscene's Avatar
pspzockerscene pspzockerscene is offline
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 71,140
Default

They seem to have totally different pages/allignments:
Wider attempt which doesn#t work for all URLs either:
Spoiler:
Code:
[
  {
    "maxDecryptDepth": 0,
    "name": "nudecosplaygirls.com rule",
    "pattern": "https?://(?:www\\.)?nudecosplaygirls\\.com/[\\w\\-]+/",
    "rule": "DEEPDECRYPT",
    "deepPattern": "(?:class=\"alignnone size-(?:full|medium) wp-image-[0-9]+\" |wp-block-image size-(?:large|medium).*?)src=\"(**External links are only visible to Support Staff**]+)\"",
    "packageNamePattern": "<title>(.*?)( - nudecosplaygirls)?</title>"
  }
]


Widest attempt which just grabs all wordpress .jpg image URLs:
Code:
[
  {
    "maxDecryptDepth": 0,
    "name": "nudecosplaygirls.com rule",
    "pattern": "https?://(?:www\\.)?nudecosplaygirls\\.com/[\\w\\-]+/",
    "rule": "DEEPDECRYPT",
    "deepPattern": "(**External links are only visible to Support Staff**,
    "packageNamePattern": "<title>(.*?)( - nudecosplaygirls)?</title>"
  }
]
As plaintext:
pastebin.com/fELqVEHc

-psp-
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
Reply With Quote
  #14  
Old 28.06.2021, 20:34
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 79,343
Default

The rule can further be optimized to just grab the blogpost images
__________________
JD-Dev & Server-Admin
Reply With Quote
  #15  
Old 28.06.2021, 20:58
wanko wanko is offline
JD VIP
 
Join Date: Aug 2015
Posts: 300
Default

Quote:
Originally Posted by pspzockerscene View Post
They seem to have totally different pages/allignments:
Wider attempt which doesn#t work for all URLs either:
Spoiler:
Code:
[
  {
    "maxDecryptDepth": 0,
    "name": "nudecosplaygirls.com rule",
    "pattern": "https?://(?:www\\.)?nudecosplaygirls\\.com/[\\w\\-]+/",
    "rule": "DEEPDECRYPT",
    "deepPattern": "(?:class=\"alignnone size-(?:full|medium) wp-image-[0-9]+\" |wp-block-image size-(?:large|medium).*?)src=\"(**External links are only visible to Support Staff**]+)\"",
    "packageNamePattern": "<title>(.*?)( - nudecosplaygirls)?</title>"
  }
]


Widest attempt which just grabs all wordpress .jpg image URLs:
Code:
[
  {
    "maxDecryptDepth": 0,
    "name": "nudecosplaygirls.com rule",
    "pattern": "https?://(?:www\\.)?nudecosplaygirls\\.com/[\\w\\-]+/",
    "rule": "DEEPDECRYPT",
    "deepPattern": "(**External links are only visible to Support Staff**,
    "packageNamePattern": "<title>(.*?)( - nudecosplaygirls)?</title>"
  }
]
As plaintext:
pastebin.com/fELqVEHc

-psp-
have only 1 picture but it grab related post thumbnails too
**External links are only visible to Support Staff****External links are only visible to Support Staff**
https://imgur.com/a/gipcXQY


**External links are only visible to Support Staff****External links are only visible to Support Staff**
have only 1 but

it also grab **External links are only visible to Support Staff****External links are only visible to Support Staff**
**External links are only visible to Support Staff****External links are only visible to Support Staff**


but this dont grab any related thumbnails
**External links are only visible to Support Staff****External links are only visible to Support Staff**

i have filtered : \d\dx\d\d and rta/scaled so it wont grab any resized images and rta

Last edited by wanko; 28.06.2021 at 21:13.
Reply With Quote
  #16  
Old 29.06.2021, 13:29
pspzockerscene's Avatar
pspzockerscene pspzockerscene is offline
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 71,140
Default

Yeah as said, you may need to adjust the deepPattern further to fit your needs.
As linked in our LinkCrawler Rules support article, you can e.g. use the tool "regex101.com" to test your filters against the html code of this website.

-psp-
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
Reply With Quote
  #17  
Old 03.07.2021, 19:34
wanko wanko is offline
JD VIP
 
Join Date: Aug 2015
Posts: 300
Default

ok i think that's because some thumbnail have full resolution, so it will grab those thumbnail too.



these two website have pagination and i dont know how to make it work, testing, it just crawl and stop btw


**External links are only visible to Support Staff****External links are only visible to Support Staff**

[{
"enabled": true,
"cookies": null,
"updateCookies": true,
"maxDecryptDepth": 1,
"name": "hentai-img",
"pattern": "https?://[a-z]\\.hentai-img\\.com/image\\.+//",
"rule": "DEEPDECRYPT",
"packageNamePattern": null,
"passwordPattern": null,
"formPattern": null,
"deepPattern": "(**External links are only visible to Support Staff****External links are only visible to Support Staff**[0-12]\\.hentai-img\\.com/upload/\\d{8}/\\d{3}/\\d{6}/p=/[0-9]+\\.(jpeg|jpg|gif))",
"rewriteReplaceWith": null
}]

2 - **External links are only visible to Support Staff****External links are only visible to Support Staff**

[{
"enabled": true,
"cookies": null,
"updateCookies": true,
"maxDecryptDepth": 1,
"name": "jpg4",
"pattern": "http?://img\\.jpg4\\.biz\\.+//pic\\d+\\.html",
"rule": "DEEPDECRYPT",
"packageNamePattern": null,
"passwordPattern": null,
"formPattern": null,
"deepPattern": "id="img[0-9]+" src="([^"]+)",
"rewriteReplaceWith": null
}]
Reply With Quote
  #18  
Old 05.07.2021, 13:14
pspzockerscene's Avatar
pspzockerscene pspzockerscene is offline
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 71,140
Default

I'm unable to access this website at this moment due to Cloudflare...

Pagination via LinkCrawler Rules is not that easy and not always possible.
You might need to use browser addons to auto-scroll and copy URLs to make this possible.

-psp-
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
Reply With Quote
  #19  
Old 05.07.2021, 14:20
wanko wanko is offline
JD VIP
 
Join Date: Aug 2015
Posts: 300
Default

ok nvm then...

now i'm trying to crawl this website

**External links are only visible to Support Staff****External links are only visible to Support Staff**

{
"enabled" : true,
"updateCookies" : true,
"maxDecryptDepth" : 1,
"name" : "amateurfetishist",
"pattern" : "(https?://amateurfetishist\\.com/wp-content/uploads/\\d{4}/\\d{2}/.*)(-\\d+x\\d+)\\.(jpeg|jpg|gif|mp4)",
"rule" : "DEEPDECRYPT",
"packageNamePattern" : null,
"passwordPattern" : null,
"formPattern" : null,
"rewriteReplaceWith" : null
}

but it not working

**External links are only visible to Support Staff****External links are only visible to Support Staff**

i want to crawl all .(jpeg|jpg|gif|mp4) and keep same link structure

eg
domain/wp-content/uploads/year/month/files

it have scaled and /d/dx/d/d resized images.

there have so many website have index enabled so i just need one crawler rule example, thank you
Reply With Quote
  #20  
Old 05.07.2021, 15:10
pspzockerscene's Avatar
pspzockerscene pspzockerscene is offline
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 71,140
Default

You're doing it wrong!
Your rule- and your regular expression are completely wrong!
Please read our LinkCrawler Docs again.
First you need to tell the rule which URLs it is supposed to handle (pattern) and then what to crawl in those URLs (deepPattern).
Also please invest some time to learn how regular expressions work and test your custom regular expressions using online tools like "regex101.com".
In this case the following rule should do the job:
Code:
[
  {
    "enabled": true,
    "logging": false,
    "maxDecryptDepth": 0,
    "name": "amateurfetishist.com example rule",
    "pattern": "^**External links are only visible to Support Staff**,
    "rule": "DEEPDECRYPT",
    "packageNamePattern": null,
    "deepPattern": "(F\\d+-\\d+x\\d+\\.(jpeg|jpg|gif|mp4))"
  }
]
Rule as plaintext for easier copy & paste:
pastebin.com/Qv4gQD22

Please keep in mind that while we are always trying to help our users, we won't provide countless custom linkcrawler rules for you as you're supposed to create them on your own so please invest some time into learning how to do this.
We've already provided countless LinkCrawler rules for you!
Our forum contains a lot of working example rules for all kinds of cases!

-psp-
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

All times are GMT +2. The time now is 12:30.
Provided By AppWork GmbH | Privacy | Imprint
Parts of the Design are used from Kirsch designed by Andrew & Austin
Powered by vBulletin® Version 3.8.10 Beta 1
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.