#1
|
||||
|
||||
Automatically fetch links from that host based on keywords
Is there anyway to have JD2 automate the task of adding links from a particular 'website' and 'host' based on keywords? Same keywords that I use on the website for filtering two separate search queries.
For example: From examplewebsitedotcom add the url from host xyz that match the following keywords in their posts: keywords set 1: 1080p Bluray H264 AAC keywords set 2: 1080p WEB I would not be downloading everything since I still need to re-filter, I just need them to be added to the linkgrabber queue. |
#2
|
||||
|
||||
if your regex is good, you could use linkcrawler rule, else decrypter plugin for given website and filter results (once or more), and return links
__________________
raztoki @ jDownloader reporter/developer http://svn.jdownloader.org/users/170 Don't fight the system, use it to your advantage. :] |
#3
|
||||
|
||||
For the linkcrawler rule I'm trying to find a sample on this site that I can use and edit.
Unfortunately, there are no plugins for rmz.cr. |
#4
|
||||
|
||||
there is already decrypter plugin
__________________
raztoki @ jDownloader reporter/developer http://svn.jdownloader.org/users/170 Don't fight the system, use it to your advantage. :] |
#5
|
||||
|
||||
Where is the plugin?
Last edited by RPNet-user; 17.02.2020 at 07:07. |
#6
|
||||
|
||||
search source for the domain name, or class name is rpdmvzcm
plugin works off the /release/ url not /video if you dont want to edit the main plugin (since it returns all), you could use link filters and ignore all but what you want via regex.
__________________
raztoki @ jDownloader reporter/developer http://svn.jdownloader.org/users/170 Don't fight the system, use it to your advantage. :] |
#7
|
||||
|
||||
I tried this combo and it does not work.
[ { "enabled" : true, "cookies" : null, "updateCookies" : true, "logging" : false, "maxDecryptDepth" : 0, "id" : 1581919157382, "name" : "rmz.cr", "pattern" : "https?://(www\\.)?rmz\\.cr/[^/]/.+", "rule" : "DEEPDECRYPT", "packageNamePattern" : null, "passwordPattern" : null, "formPattern" : null, "deepPattern" : null, "rewriteReplaceWith" : null } ] Last edited by RPNet-user; 17.02.2020 at 18:17. |
#8
|
||||
|
||||
Hi,
please explain exactly what you want to do here and post example URLs. The things you tried to do here will not work. Please add a detailed description on what is happening at this moment when you add the URLs and how the desired behavior would be like. -psp-
__________________
JD Supporter, Plugin Dev. & Community Manager
Erste Schritte & Tutorials || JDownloader 2 Setup Download |
#9
|
||||
|
||||
Nothing is actually happening with this because it is not doing 'any' of what I'm trying to accomplish.
When I test using Code:
**External links are only visible to Support Staff** Here is what I'm trying to accomplish: Allow JD2 to add links to the linkgrabber from Uploaded(UL) and ClicknUpload(CU) from rmz.cr using two sets of keyword patterns: keywords set 1: 1080p Bluray H264 AAC keywords set 2: 1080p WEB Set some kind of date range or limit as I do not want it to crawl indefinitely and/or grab links from months/years ago. In basic laymen: I want the crawler to only add the (UL) and (CU) links for posts that match the movie titles with the above keyword patterns for date range example---> feb.10-to-current-date, it may require two separate sets of rules and I'm ok with that. Example A: marvelvsdc 2019 1080p BluRay H264 AAC-groupname Example B: dcvsmarvel 2020 1080p WEBRip x264-groupname So only the links for posts/titles that have these keywords will be added '1080p BluRay H264 AAC' and '1080p WEB'. Given that the 'WEB' is not specific, it will certainly add links from both 'WEBRip' and 'WEB-DL' and that is perfectly fine. If it is not possible to do this with a date range then on a daily basis for all newly posted titles that match the filtered criteria would be ok as well since this can be easily automated/scheduled to run daily. |
#10
|
||||
|
||||
So you want to add posts posted on their mainpage based on keywords, correct?
-psp-
__________________
JD Supporter, Plugin Dev. & Community Manager
Erste Schritte & Tutorials || JDownloader 2 Setup Download |
#11
|
||||
|
||||
Yes, but for the past range of days if possible, otherwise, on a daily basis as they are been posted.
See screenshot of both keyword search query that I filter every three to four days. For the 1080p WEB I use chrome to highlight 'WEB' on every page of the search since their query for WEB cannot be filtered as well as other sites like scene-rls.net and rarbg. After three days, the left side query will provide anywhere from 2-3 pages, however the web (right side query) will result with around 16+ pages. I just noticed that every single post for both queries includes downloadable links for Rapidrar(RR) so it would be best to substitute CU with RR as some posts do not include downloadable links for either UL or CU, this way the crawler will not skip those filtered posts when they do not have links for UL and CU. So add the downloadable RR and UL links for the filtered results. I just realized that the query by keyword for the second set '1080p WEB' may not work well because it will probably include TV-shows which I do not want, so, here is a further breakdown of what I 'actually' need from the 1080p WEB keyword query in Movies category: 1080p WEBRip VXT and 1080p WEBRip RARBG. RMZ search query results in movies will not work properly when it includes the groupname VXT since it is too short for the query. So basically, I'm only interested in 1080p WEBRip VXT and RARBG and 1080p Bluray VXT and RARBG from the movies categories. Both queries include all four results, otherwise I would have to use several search queries. This is not a problem with sites like scene-rls.net/releases/index.php and RARBG since I'm able to search on both of those sites with only two queries that filters the results accurately for just movies by using '1080p VXT' and '1080p RARBG' so if both of these keyword queries can be used for the site crawler rules then that would be perfect. Last edited by RPNet-user; 18.02.2020 at 01:51. Reason: added information |
#12
|
||||
|
||||
What you want is far too complicated and cannot be accomplished using link crawler rules.
I've also noticed that this website will sometimes ask for a captcha when you do search querries --> That is another issue. Please also keep in mind that we do not develop plugins that support search querries. Here are your options: - Develop an EventScripter script for this purpose - Develop any other kind of script that crawls the links and adds them to JD from outside e.g. via myjdownloader API or via the old remote API - Grab our plugin and extend the code so that it does what you want. We are open source --> Easiest solution probably I will mark this thread as "Declined" because we will not develop the solution you want to have. -psp-
__________________
JD Supporter, Plugin Dev. & Community Manager
Erste Schritte & Tutorials || JDownloader 2 Setup Download |
#13
|
||||
|
||||
Thanks, I will look into all three options starting with an eventscripter.
|
#14
|
||||
|
||||
URL pattern in link crawler rules
Ok, so I'm trying to crawl 5 pages of a site, however, the first page has no number(null) and there is no page labeled with a "1" so here is how they are numbered. Regex101 shows it as a 'pattern error' especially since JD2 crawler rules slashes do not match with regex101.
**External links are only visible to Support Staff****External links are only visible to Support Staff** **External links are only visible to Support Staff****External links are only visible to Support Staff** **External links are only visible to Support Staff****External links are only visible to Support Staff** **External links are only visible to Support Staff****External links are only visible to Support Staff** **External links are only visible to Support Staff****External links are only visible to Support Staff** The pattern that I'm testing with is: "https?://rmz\\.cr/l/m/[0-5]" |
#15
|
||||
|
||||
"https?://rmz\\.cr/l/m/[0-5]?"
__________________
FAQ: How to upload a Log |
#16
|
||||
|
||||
the error (on regex101) as you have double escaped, use the python flavour and remove \\. for \.
for Java / and in Eclipse IDE you need to double escape some chars.
__________________
raztoki @ jDownloader reporter/developer http://svn.jdownloader.org/users/170 Don't fight the system, use it to your advantage. :] |
#17
|
||||
|
||||
@tony2long, thanks, unfortunately, adding the quantifier ? provides the same results as without it, in that it breaks the deepPattern expression which means that if no expression in the pattern will make it work then I will need to modify the DP as well.
@raztoki, thanks, using python without the extra backlash does match all five of my crawler test strings, unfortunately, JD/Java must have it so testing with other languages will not benefit the test strings. |
#18
|
||||
|
||||
Sorry, it's not clear for me.
__________________
FAQ: How to upload a Log |
#19
|
||||
|
||||
Would someone explain why the linkcrawler is able crawl and grab from:
**External links are only visible to Support Staff****External links are only visible to Support Staff** But not crawl and grab from: **External links are only visible to Support Staff****External links are only visible to Support Staff** Last edited by raztoki; 23.02.2020 at 12:19. |
#20
|
||||
|
||||
at a guess, the link format never existed when plugin was created.
__________________
raztoki @ jDownloader reporter/developer http://svn.jdownloader.org/users/170 Don't fight the system, use it to your advantage. :] |
#21
|
||||
|
||||
The plugin only supports /release/[a-z0-9\\-]+
__________________
FAQ: How to upload a Log Last edited by raztoki; 23.02.2020 at 14:17. |
#22
|
||||
|
||||
Is there anyway around this?
|
#23
|
||||
|
||||
by creating a linkcrawler rules for the unsupported patterns, or edit the decrypter plugin to support additional website function.
__________________
raztoki @ jDownloader reporter/developer http://svn.jdownloader.org/users/170 Don't fight the system, use it to your advantage. :] |
#24
|
||||
|
||||
Quote:
Where is the decrypter plugin for this domain and what is it named because I have yet to find it? |
#25
|
||||
|
||||
Quote:
Quote:
__________________
raztoki @ jDownloader reporter/developer http://svn.jdownloader.org/users/170 Don't fight the system, use it to your advantage. :] |
#26
|
||||
|
||||
@RPNet-user
I merged these so you can see post #6 for your second question. Domain is the same, rmz.cr. The rule should crawl /l/m to find /release/ that is supported by plugin.
__________________
FAQ: How to upload a Log |
#27
|
||||
|
||||
@raztoki
It will only create support for a url pattern that originate from the top level domain(rmz.cr) which is in the plugin and not rmz.cr/l/m. The pattern I modified should have worked even without the numbered pages so unless I have to add/edit the deepPattern it will not work provided that the domain in the rule over-rides the domain supported by the plugin. All file links originate from the top level domain plus the '/release' sub-path even if I get there from rmz.cr/l/m. Last edited by RPNet-user; 24.02.2020 at 06:47. |
#28
|
||||
|
||||
Quote:
I don't agree with your statement. helps you write and match in real time. You then adapt for the situation, either java code with the extra escaping (as it will show up as error). Inside JD graphical interface you shouldn't need the extra escaping. Link crawler rules you might as its JSON
__________________
raztoki @ jDownloader reporter/developer http://svn.jdownloader.org/users/170 Don't fight the system, use it to your advantage. :] |
#29
|
||||
|
||||
Yes it will create the pattern but it will not grab any links from posts/titles where the url is rmz.cr/l/m because when you go to any pages or posts originating from the newly created pattern it will always be rmz.cr/release. The rmz.cr/l/m path on their site is simply for displaying different categories (/l/m for movies, /l/s for series, and /l/b for both) it does not provide url displayable access via those paths, therefore, how will the pattern know to grab a link that contains that path.
|
#30
|
||||
|
||||
Create another rule where deepPattern contains /l/m for example.
__________________
FAQ: How to upload a Log |
#31
|
||||
|
||||
Quote:
Check the screenshot for a further explanation. The left side is rmz.cr and the right side is rmz.cr/l/m but at the bottom you will see that they are in the same location--->rmz.cr/release/movietitle Therefore, no pattern will be able to differentiate between the location of the links since the path to the actual title/post will be the same. |
#32
|
||||
|
||||
it wont matter, as dedicated plugin doesn't scan this either.
so you need first "pattern": to listen to /l/m etc, and then "deepPattern" : within the html body (maybe <table>) to just return /release links then dedicated plugin will do the rest. I wouldn't personally follow multiple pages deep, just keep the links you want to scan on txt file and copy them all. Then they are all single tasks.
__________________
raztoki @ jDownloader reporter/developer http://svn.jdownloader.org/users/170 Don't fight the system, use it to your advantage. :] |
#33
|
||||
|
||||
In the page source html you can find /l/m, so first rule will get that /l/m page, then the second rule grab /l/m page and find /release/.
__________________
FAQ: How to upload a Log |
#34
|
||||
|
||||
Ok, so I will just go with the first page in /l/m only.
This is what is currently working properly but from the main rmz.cr page only. [ { "enabled" : true, "updateCookies" : true, "logging" : false, "maxDecryptDepth" : 1, "id" : 1422443765154, "name" : "1080p rarbg and vxt", "pattern" : "https?://rmz\\.cr/", "rule" : "DEEPDECRYPT", "packageNamePattern" : null, "passwordPattern" : null, "formPattern" : null, "deepPattern" : "(/release/[a-z0-9\\-]+1080p[a-z0-9\\-]+rarbg)|(/release/[a-z0-9\\-]+1080p[a-z0-9\\-]+vxt)", "rewriteReplaceWith" : null } ] These patterns will not work as it just grabs everything instead of the keyword links as the above regex: "pattern" : "https?://rmz\\.cr/l/m/[0-5]", "pattern" : "https?://rmz\\.cr/l/m/", |
#35
|
||||
|
||||
Quote:
I've limited it to the "release" URLs only for the next update which means this would be possible and would cover adding URLs to desired pages (without number = first page): Code:
[ { "enabled" : true, "updateCookies" : true, "logging" : false, "maxDecryptDepth" : 1, "id" : 1422443765154, "name" : "1080p rarbg and vxt", "pattern" : "https?://rmz\\.cr/l/b/[0-9]*?", "rule" : "DEEPDECRYPT", "packageNamePattern" : null, "passwordPattern" : null, "formPattern" : null, "deepPattern" : "(/release/[a-z0-9\\-]+1080p[a-z0-9\\-]+rarbg)|(/release/[a-z0-9\\-]+1080p[a-z0-9\\-]+vxt)", "rewriteReplaceWith" : null } ]
__________________
JD Supporter, Plugin Dev. & Community Manager
Erste Schritte & Tutorials || JDownloader 2 Setup Download |
#36
|
||||
|
||||
So as of this moment it is not possible to crawl and grab from /l/m due to the plugin correct?
The next update would make it possible? Last edited by RPNet-user; 25.02.2020 at 02:53. |
#37
|
||||
|
||||
Yes.
Plugins always have priority which is good and makes sense ... usually. This is just an edge-case and soon not anymore. We have a ticket about creating link crawler rules with higher priority than plugins but again in this case, your rules would then override our plugin completely and you'd have to add another rule to manually handle "/release" URLs ... This is the ticket: Plugin updates have just been released - you can now test the above mentioned rule! -psp-
__________________
JD Supporter, Plugin Dev. & Community Manager
Erste Schritte & Tutorials || JDownloader 2 Setup Download |
#38
|
||||
|
||||
Thanks psp,
I was confused earlier as to which one over rides the other: rules>plugin or plugin>rules. When trying to modify the pattern earlier I suspected that the plugin had the higher priority due to the wide regex regardless of what I specified. I will update, test, and feedback. |
#39
|
||||
|
||||
Yeah basically you were really really unlucky.
This is our last RegEx: Code:
https?://(?:www\\.)?rmz\\.cr/(?:release/)?(?!l/)[^/]+ Code:
https?://(?:www\\.)?rmz\\.cr/release/[^/]+ The other alternative would have been to block "l/" in our RegEx but the current solution is the nicer one^^ -psp-
__________________
JD Supporter, Plugin Dev. & Community Manager
Erste Schritte & Tutorials || JDownloader 2 Setup Download |
#40
|
||||
|
||||
Thanks psp, it is working perfectly now.
I set the pattern to: "pattern" : "https?://rmz\\.cr/l/m/*?", I set the crawl for ----> rmz.cr/l/m The event scripter is working perfectly as well as I just test it with rmz.cr/l/m. I'm having problems with the logic for the linkgrabber filter 'views' rule as I'm trying to exclude 'srt' files during the grab so I tried setting a simple rule with only the file type to 'is not' 'srt'. So the rule is: Allow Links if, File isn't a 'srt'-File! However, when I test the rule it still adds the srt files along with the video files. |
Thread Tools | |
Display Modes | |
|
|