#1
|
|||
|
|||
how add html tags into LinkCrawler Rule to set the thread title as packages name
Hello Staff,
I wish to understand how should I add html tags into LinkCrawler Rule to set the thread title as packages names that contains all host urls inside that thread is something similar you did to set the package names for metalarea.org, only I would like to apply it for any site and then I would like to understand what I need to look in the html page source code and understand how to integrate that html tag into the capture rule of the links I show an example for 2 sites **External links are only visible to Support Staff****External links are only visible to Support Staff** I write a crawler rule in this way Code:
[ { "deepPattern" : "(//archive/f-92/\\d+)\"", "formPattern" : null, "maxDecryptDepth" : 1, "name" : "XXXXX", "packageNamePattern" : null, "passwordPattern" : null, "pattern" : "https?://XXXXXX/\\[^/]forums-XXXXX/[^/]archive/[^/]f-92\\.html", "rule" : "DEEPDECRYPT", "enabled" : true, "logging" : true, "updateCookies" : true }] How should I do ? Some example of links to test **External links are only visible to Support Staff****External links are only visible to Support Staff** **External links are only visible to Support Staff****External links are only visible to Support Staff** **External links are only visible to Support Staff****External links are only visible to Support Staff** ### second site **External links are only visible to Support Staff****External links are only visible to Support Staff** In this situation I set cookies and I add credentials on JDownloader Looking source code I see that thread title is always inside this kind of <span> tag As crawler rule I set in this way Code:
[ { "cookies" : [ [ "ips4_login_key", "33e867acd253be1d4**************" ] ], "deepPattern" : "(/forum/87-tv-cinema-music-links/\\d+)\"", "formPattern" : null, "maxDecryptDepth" : 1, "name" : "squid-board", "packageNamePattern" : null, "passwordPattern" : null, "pattern" : "https?://XXXX\\.ru/[^/]+/forum/[^/]+/87-tv-cinema-music-links/", "rule" : "DEEPDECRYPT", "enabled" : true, "logging" : true, "updateCookies" : true } **External links are only visible to Support Staff****External links are only visible to Support Staff** **External links are only visible to Support Staff****External links are only visible to Support Staff** but my crawler rule seems to be incorrect and also don't take the name for packages from correct html tag because I don't know I can I do. Can you help me, please to understand ? LOG debug Code:
08.09.22 17.10.28 <--> 08.09.22 18.49.39 jdlog://3779211370661/ Last edited by Jiaz; 08.09.2022 at 20:11. |
#2
|
||||
|
||||
@nathan1
you have to specify a packageNamePattern the pattern must match the name you want to be set for the package, you can use regex101.com for trial&error. Please try to do this by yourself and of course we can help in case you're stuck
__________________
JD-Dev & Server-Admin Last edited by Jiaz; 08.09.2022 at 20:12. |
#3
|
|||
|
|||
@jiaz
Ok, I try to test, but for example this kind of pattern I can't add in rule Code:
"packageNamePattern" : "<div id="navbar">(.*?)<\/div>", Code:
"navbar" |
#4
|
||||
|
||||
escape the quotation marks within html else you will break the json!
__________________
raztoki @ jDownloader reporter/developer http://svn.jdownloader.org/users/170 Don't fight the system, use it to your advantage. :] |
#5
|
|||
|
|||
Ah, ok. I forgot to escape second quotation mark. Ok, I will return when I finish to test with regex
Because seems not so easy when you need to strip title text from nested html tags like this Last edited by Jiaz; 09.09.2022 at 12:13. |
#6
|
||||
|
||||
@nathan1: on regex101, on left side, change flavor to Java8, then it will show/warn for wrong escape. and then you need to escape the escape char because of json
for example Code:
<div\\s*id=\\"navbar\\"><a.*</a>\\s*>\\s*(.*?)\\s*</div>
__________________
JD-Dev & Server-Admin |
#7
|
||||
|
||||
@nathan1
Additionally you can use webtools like "jsoneditoronline.org" to verify that your json structure is intact instead of relying on JD. It's also much more convenient.
__________________
JD Supporter, Plugin Dev. & Community Manager
Erste Schritte & Tutorials || JDownloader 2 Setup Download |
#8
|
|||
|
|||
@psp
I test with your online tool and json structure is ok but still don't assign the correct name for package Code:
[ { "cookies" : null, "deepPattern" : "(//archive/\\d+)\"", "formPattern" : null, "id" : 1662681777413, "maxDecryptDepth" : 1, "name" : "ffshrine", "packageNamePattern" : "<div\\s*id=\"navbar\"><a.*</a>\\s*>\\s*(.*?)\\s*</div>", "passwordPattern" : null, "pattern" : "https?://XXXXXX-XXXXX\\.info/\\[^/]forums-ffshrine/[^/]archive/[^/]\\.html", "rewriteReplaceWith" : null, "rule" : "DEEPDECRYPT", "enabled" : true, "logging" : true, "updateCookies" : true } ] **External links are only visible to Support Staff****External links are only visible to Support Staff** JD returns this name for package Code:
LVB-PC0 Code:
Ludwig van Beethoven - Piano Concerto No. 0 in E Flat Major - Jacobs / Leibowitz **External links are only visible to Support Staff****External links are only visible to Support Staff** give me 404 error while this link **External links are only visible to Support Staff****External links are only visible to Support Staff** give me the thread page list where I take the single thread URL and paste on JD It's not clear to me where I'm wrong |
#9
|
||||
|
||||
@nathan1
I'm sorry but while I will help you here please keep in mind that we're not here to teach you how regular expressions work. Your rule contains multiple fatal- and also RegEx related mistakes but your packageNamePattern is fine Mistakes you made: 1. Wrong deepPattern Code:
(//archive/\\d+)\" The double slash seems wrong plus I don't see anything inside that html code besides "../archive.css" and I don't think this is what you want to grab. I guess you want either all links inside the html or only mega links. To simplify this and also for testing, the rule I will post on the buttom of this reply will only grab mega links. You can easily modify that. Also if you want to test DEEPDECRYPT rules you can leave "deepPattern" empty then JD will crawl all resuilts it can find. 2. Broken pattern[RegEx] (at multiple places) By just looking at it I can tell you that that is wrong: Code:
https?://XXXXXX-XXXXX\\.info/\\[^/]forums-ffshrine/[^/]archive/[^/]\\.html My rule down below will contain a corrected version. Repeated mistakes in detail: Code:
\\[^/] - Quantifier missing at the end (a plus symbol): You want it to allow "Unlimited times of not slash Correct: Code:
[^/]+ If you want to cope with RegEx instead of just trying out copy & pasted patterns I recommend reading an actual tutorial with exercises to lean it. Here is the fixed rule: Code:
[ { "deepPattern": "(https?://mega\\.nz[^>]+)", "maxDecryptDepth": 1, "name": "ffshrine", "packageNamePattern": "<div\\s*id=\"navbar\"><a.*</a>\\s*>\\s*(.*?)\\s*</div>", "passwordPattern": null, "pattern": "https?://XXXXXX-XXXXX\\.info/forums-ffshrine/archive/t-[0-9]+\\.html", "rule": "DEEPDECRYPT", "enabled": true, "logging": true, "updateCookies": true } ]
__________________
JD Supporter, Plugin Dev. & Community Manager
Erste Schritte & Tutorials || JDownloader 2 Setup Download Last edited by Jiaz; 09.09.2022 at 16:01. |
#10
|
||||
|
||||
@nathan1: you can use regex101 and there you can test your pattern. in top text field you enter the pattern (don't forget to change to java8 flavor) and in bottom window, you paste the html code of the website. then you can check if/what your pattern will match/find
__________________
JD-Dev & Server-Admin |
#11
|
||||
|
||||
And also keep in mind that regex101.com does not support double escaping so this:
Code:
<div\\s*id=\"navbar\"><a.*</a>\\s*>\\s*(.*?)\\s*</div> Code:
<div\s*id=\"navbar\"><a.*</a>\s*>\s*(.*?)\s*</div>
__________________
JD Supporter, Plugin Dev. & Community Manager
Erste Schritte & Tutorials || JDownloader 2 Setup Download |
#12
|
|||
|
|||
@psp
ok, I don't understand why unicode characters from these thread **External links are only visible to Support Staff****External links are only visible to Support Staff** are written in this way I test with a second site and I think to write correctly link crawler rule but in this situation there are some complications for me I test with these 2 links but nothing appears on JD **External links are only visible to Support Staff****External links are only visible to Support Staff** **External links are only visible to Support Staff****External links are only visible to Support Staff** I used this rule Code:
[ { "cookies" : [ [ "ips4_login_key", "33e867***************d74313" ] ], "deepPattern": "(https?://mega\\.nz[^>]+)|(https?://www\\.mediafire\\.com[^>]+)|(https?://depositfiles\\.com[^>]+)", "maxDecryptDepth": 1, "name": "squid", "packageNamePattern": "<title>(.*?)</title>", "passwordPattern": null, "pattern": "https?://www\\.ttttt-ttttt\\.org/topic/.*+", "rule": "DEEPDECRYPT", "enabled": true, "logging": true, "updateCookies": true } ] From EditThisCookie there are a lot of option, I don't know how can I understand which use I test but nothing appears LOG Code:
09.09.22 19.31.42 <--> 09.09.22 19.32.43 jdlog://4979211370661/ Last edited by nathan1; 09.09.2022 at 20:44. |
#13
|
||||
|
||||
[QUOTE=nathan1;509804]@psp
ok, I don't understand why unicode characters from these thread /QUOTE] You need to use a font that supports unicode, see https://support.jdownloader.org/Know...nicode-support
__________________
JD-Dev & Server-Admin |
#14
|
|||
|
|||
@Jiaz.
ok, thank you. Any idea about second site rule? I rewrite rule in this way but no links appears **External links are only visible to Support Staff****External links are only visible to Support Staff** Code:
[ { "enabled" : true, "cookies" : [ [ "ips4_login_key", "XXXXXXXXXXXXXXXX" ] ], "updateCookies" : true, "logging" : false, "maxDecryptDepth" : 1, "id" : 10, "name" : "squid-board example rule with cookie-login", "pattern" : "https?://squid-board\\.ru/forum/\\.php\\topic=\\d+", "rule" : "DEEPDECRYPT", "packageNamePattern" : "<title>(.*?)</title>", "passwordPattern" : null, "formPattern" : null, "deepPattern" : "Download from <a href=\"(https?://[^\"]+)\"", "rewriteReplaceWith" : null } ] |
#15
|
||||
|
||||
for a start,
the pattern doesn't match your provided url /forum/.php vs /topic/id-textdisciption/
__________________
raztoki @ jDownloader reporter/developer http://svn.jdownloader.org/users/170 Don't fight the system, use it to your advantage. :] |
#16
|
|||
|
|||
@raztoki
I changed so (I check on rege101) Code:
"pattern" : "https?://www\\.squid-board\\.org\/topic\/.*+", Code:
"deepPattern": "(https?://mega\\.nz[^>]+)|(https?://www\\.mediafire\\.com[^>]+)|(https?://depositfiles\\.com[^>]+)", |
#17
|
||||
|
||||
@nathan1
Regarding the squid-board rule: 1. I recommend testing with: Code:
"deepPattern": null If nothing of the links you're looking for is found that can indicate a problem with the login cookies. If everything + more is found that indicates a bad deepPattern. 2. For squid-board an account is needed so I won't be able to help in a more detailed way without having your login details for that website. 3. Quote:
"How can I know which cookies I need to put into the rule for it in order to work?" or also: "How can I know which cookies are actually needed so JD will be logged in when using that rule?" The answer is: You can't. You need to find it out for each website. What I always do if a website has more cookies is to delete one by one cookie using EditThisCookie and refresh the website in between to see which ones are really needed. Mostly only one or two cookies are actually needed to be logged in. Sure if you want you can also put all cookies into your LinkCrawler Rule but that will make it ugly and can take some time.
__________________
JD Supporter, Plugin Dev. & Community Manager
Erste Schritte & Tutorials || JDownloader 2 Setup Download Last edited by pspzockerscene; 12.09.2022 at 15:37. |
#18
|
|||
|
|||
@psp
credential sent! please check your email. Quote:
LOG Code:
12.09.22 21.25.24 <--> 12.09.22 21.39.20 jdlog://9589211370661/ |
#19
|
||||
|
||||
@nathan1: you must enable logging via
Quote:
__________________
JD-Dev & Server-Admin |
#20
|
||||
|
||||
@nathan1:
I will post a working rule at the bottom of this post. Mistakes you've made in the rule posted in your post number #14: 1. You haven't properly checked all required cookies. In total 3 cookies are required to be logged in (see rule in the bottom of this post). 2. Your pattern was initially wrong: - wrong tld (.ru instead of .org) - wrong pattern overall - even the pattern in post #16 can be improved 3. Your deepPattern was wrong. Looks like you haven't tested this at all. For my rule down below I left it empty so everything will be grabbed. You can build the "deepPattern" on your own. As long as any mega.co Link appears in the linkgrabber when adding a link you can be sure that the login via rule is working. Here is the "fixed" rule with enabled logging: Code:
[ { "enabled": true, "cookies": [ [ "ips4_IPSSessionFront", "CENSORED" ], [ "ips4_login_key", "CENSORED" ], [ "ips4_member_id", "CENSORED" ] ], "updateCookies": true, "logging": true, "maxDecryptDepth": 1, "name": "squid-board.org example rule with cookie-login", "pattern": "https?://(www\\.)?squid-board\\.org/topic/[0-9]+-[a-z0-9\\-]+.*", "rule": "DEEPDECRYPT", "packageNamePattern": "<title>(.*?)</title>", "deepPattern": null } ] pastebin.com/raw/dpy6t1CX
__________________
JD Supporter, Plugin Dev. & Community Manager
Erste Schritte & Tutorials || JDownloader 2 Setup Download Last edited by pspzockerscene; 13.09.2022 at 14:16. Reason: Fixed typo |
#21
|
|||
|
|||
Leider hast Du dazu keine Berechtigung!
@psp
thank you for solution but it doesn't work well. I test with 3 links but it tell me: Leider hast Du dazu keine Berechtigung! LOG Code:
13.09.22 15.55.14 <--> 13.09.22 15.55.13 jdlog://1789211370661/ Example links **External links are only visible to Support Staff****External links are only visible to Support Staff** **External links are only visible to Support Staff****External links are only visible to Support Staff** **External links are only visible to Support Staff****External links are only visible to Support Staff** EDIT: I regenerate cookies, now it catpure again, but is there a solution to avoid eternal idle of linkcrawler ? it runs and check without stop Last edited by nathan1; 13.09.2022 at 19:40. |
#22
|
||||
|
||||
Do you have to login at website again after timeout? maybe try to create a fresh/seperate pair of cookies
__________________
JD-Dev & Server-Admin |
#23
|
|||
|
|||
@Jiaz
yes, I log out and log-in again. But many links are crawled wrongly return me wrong data like in this example link **External links are only visible to Support Staff****External links are only visible to Support Staff** You can see in picture that for ONE link it continues to crawl after 2 minutes And also mega.nz links are not crawled inside thread LOG Code:
13.09.22 21.10.04 <--> 13.09.22 21.27.09 jdlog://7889211370661/ |
#24
|
||||
|
||||
I've checked the links in log for this and they are offline and most likely you have enabled filter to remove/hide offline/unknown type
Link is detected/found/processed correct here
__________________
JD-Dev & Server-Admin |
#25
|
||||
|
||||
so even on website you have to login again and again? what happens when you check the "stay logged in" checkbox? does this prevent to have to login again?
__________________
JD-Dev & Server-Admin |
#26
|
||||
|
||||
Sorry but the logout thing is not a problem of the rule but rather you not checking the "Stay logged in" checkbox or the website simply not keeping sessions active for a long time.
If you continue to have this problem you should think about a solution outside JDownloader e.g. manually collect those links in your browser, then paste them into JD or use other browser scripting solutions. Here is a hint on how to simplify collecting links via browser. EDIT Quote:
For my simple tests I've used the following one but I doubt it will be enough for you: Code:
"deepPattern": "(https?://mega\\.nz/[^\"<>]+)"
__________________
JD Supporter, Plugin Dev. & Community Manager
Erste Schritte & Tutorials || JDownloader 2 Setup Download Last edited by raztoki; 16.09.2022 at 12:48. |
#27
|
||||
|
||||
Additional information regarding cookies lifetime:
Some websites e.g. only allow X parallel sessions so your problem might have been caused by me, logging in and out multiple times from another IP. To exclude this as the cause of your problem I recommend changing your password and using only one session and/or creating an extra test account for me so I won't contribute to reaching this possible limit with your account...
__________________
JD Supporter, Plugin Dev. & Community Manager
Erste Schritte & Tutorials || JDownloader 2 Setup Download |
Thread Tools | |
Display Modes | |
|
|