[Solved] Automatically fetch links from that host based on keywords - JDownloader Community

RPNet-user · #1 17.02.2020, 02:24

Is there anyway to have JD2 automate the task of adding links from a particular 'website' and 'host' based on keywords? Same keywords that I use on the website for filtering two separate search queries.

For example: From examplewebsitedotcom add the url from host xyz that match the following keywords in their posts:

keywords set 1: 1080p Bluray H264 AAC
keywords set 2: 1080p WEB

I would not be downloading everything since I still need to re-filter, I just need them to be added to the linkgrabber queue.

raztoki · #2 17.02.2020, 03:24

if your regex is good, you could use linkcrawler rule, else decrypter plugin for given website and filter results (once or more), and return links

RPNet-user · #3 17.02.2020, 04:25

For the linkcrawler rule I'm trying to find a sample on this site that I can use and edit.
Unfortunately, there are no plugins for rmz.cr.

raztoki · #4 17.02.2020, 04:26

there is already decrypter plugin

RPNet-user · #5 17.02.2020, 04:56

Where is the plugin?

raztoki · #6 17.02.2020, 09:22

search source for the domain name, or class name is rpdmvzcm
plugin works off the /release/ url not /video

if you dont want to edit the main plugin (since it returns all), you could use link filters and ignore all but what you want via regex.

RPNet-user · #7 17.02.2020, 18:13

I tried this combo and it does not work.

[ {
"enabled" : true,
"cookies" : null,
"updateCookies" : true,
"logging" : false,
"maxDecryptDepth" : 0,
"id" : 1581919157382,
"name" : "rmz.cr",
"pattern" : "https?://(www\\.)?rmz\\.cr/[^/]/.+",
"rule" : "DEEPDECRYPT",
"packageNamePattern" : null,
"passwordPattern" : null,
"formPattern" : null,
"deepPattern" : null,
"rewriteReplaceWith" : null
} ]

pspzockerscene · #8 17.02.2020, 19:39

Hi,

please explain exactly what you want to do here and post example URLs.
The things you tried to do here will not work.

Please add a detailed description on what is happening at this moment when you add the URLs and how the desired behavior would be like.

-psp-

RPNet-user · #9 17.02.2020, 20:48

Nothing is actually happening with this because it is not doing 'any' of what I'm trying to accomplish.

When I test using

Code:

**External links are only visible to Support Staff**

every link from any and all hosts and titles/posts are been added to the linkgrabber queue which would still be the case regardless of filters, rules, etc.

Here is what I'm trying to accomplish:

Allow JD2 to add links to the linkgrabber from Uploaded(UL) and ClicknUpload(CU) from rmz.cr using two sets of keyword patterns:

keywords set 1: 1080p Bluray H264 AAC
keywords set 2: 1080p WEB

Set some kind of date range or limit as I do not want it to crawl indefinitely and/or grab links from months/years ago.

In basic laymen: I want the crawler to only add the (UL) and (CU) links for posts that match the movie titles with the above keyword patterns for date range example---> feb.10-to-current-date, it may require two separate sets of rules and I'm ok with that.

Example A: marvelvsdc 2019 1080p BluRay H264 AAC-groupname
Example B: dcvsmarvel 2020 1080p WEBRip x264-groupname

So only the links for posts/titles that have these keywords will be added '1080p BluRay H264 AAC' and '1080p WEB'.
Given that the 'WEB' is not specific, it will certainly add links from both 'WEBRip' and 'WEB-DL' and that is perfectly fine.

If it is not possible to do this with a date range then on a daily basis for all newly posted titles that match the filtered criteria would be ok as well since this can be easily automated/scheduled to run daily.

pspzockerscene · #10 17.02.2020, 20:52

So you want to add posts posted on their mainpage based on keywords, correct?

-psp-

RPNet-user · #11 17.02.2020, 23:21

Yes, but for the past range of days if possible, otherwise, on a daily basis as they are been posted.

See screenshot of both keyword search query that I filter every three to four days. For the 1080p WEB I use chrome to highlight 'WEB' on every page of the search since their query for WEB cannot be filtered as well as other sites like scene-rls.net and rarbg. After three days, the left side query will provide anywhere from 2-3 pages, however the web (right side query) will result with around 16+ pages.

I just noticed that every single post for both queries includes downloadable links for Rapidrar(RR) so it would be best to substitute CU with RR as some posts do not include downloadable links for either UL or CU, this way the crawler will not skip those filtered posts when they do not have links for UL and CU. So add the downloadable RR and UL links for the filtered results.

I just realized that the query by keyword for the second set '1080p WEB' may not work well because it will probably include TV-shows which I do not want, so, here is a further breakdown of what I 'actually' need from the 1080p WEB keyword query in Movies category: 1080p WEBRip VXT and 1080p WEBRip RARBG.

RMZ search query results in movies will not work properly when it includes the groupname VXT since it is too short for the query.
So basically, I'm only interested in 1080p WEBRip VXT and RARBG and 1080p Bluray VXT and RARBG from the movies categories.

Both queries include all four results, otherwise I would have to use several search queries. This is not a problem with sites like scene-rls.net/releases/index.php and RARBG since I'm able to search on both of those sites with only two queries that filters the results accurately for just movies by using '1080p VXT' and '1080p RARBG' so if both of these keyword queries can be used for the site crawler rules then that would be perfect.

pspzockerscene · #12 18.02.2020, 17:40

What you want is far too complicated and cannot be accomplished using link crawler rules.
I've also noticed that this website will sometimes ask for a captcha when you do search querries --> That is another issue.

Please also keep in mind that we do not develop plugins that support search querries.

Here are your options:
- Develop an EventScripter script for this purpose
- Develop any other kind of script that crawls the links and adds them to JD from outside e.g. via myjdownloader API or via the old remote API
- Grab our plugin and extend the code so that it does what you want. We are open source --> Easiest solution probably

I will mark this thread as "Declined" because we will not develop the solution you want to have.

-psp-

RPNet-user · #13 18.02.2020, 19:55

Thanks, I will look into all three options starting with an eventscripter.

RPNet-user · #14 22.02.2020, 13:52

Ok, so I'm trying to crawl 5 pages of a site, however, the first page has no number(null) and there is no page labeled with a "1" so here is how they are numbered. Regex101 shows it as a 'pattern error' especially since JD2 crawler rules slashes do not match with regex101.

**External links are only visible to Support Staff****External links are only visible to Support Staff**
**External links are only visible to Support Staff****External links are only visible to Support Staff**
**External links are only visible to Support Staff****External links are only visible to Support Staff**
**External links are only visible to Support Staff****External links are only visible to Support Staff**
**External links are only visible to Support Staff****External links are only visible to Support Staff**

The pattern that I'm testing with is:
"https?://rmz\\.cr/l/m/[0-5]"

tony2long · #15 22.02.2020, 14:54

"https?://rmz\\.cr/l/m/[0-5]?"

raztoki · #16 22.02.2020, 14:55

the error (on regex101) as you have double escaped, use the python flavour and remove \\. for \.

for Java / and in Eclipse IDE you need to double escape some chars.

RPNet-user · #17 22.02.2020, 18:49

@tony2long, thanks, unfortunately, adding the quantifier ? provides the same results as without it, in that it breaks the deepPattern expression which means that if no expression in the pattern will make it work then I will need to modify the DP as well.

@raztoki, thanks, using python without the extra backlash does match all five of my crawler test strings, unfortunately, JD/Java must have it so testing with other languages will not benefit the test strings.

tony2long · #18 23.02.2020, 04:29

Sorry, it's not clear for me.

RPNet-user · #19 23.02.2020, 11:27

Would someone explain why the linkcrawler is able crawl and grab from:
**External links are only visible to Support Staff****External links are only visible to Support Staff**

But not crawl and grab from:
**External links are only visible to Support Staff****External links are only visible to Support Staff**

raztoki · #20 23.02.2020, 12:20

at a guess, the link format never existed when plugin was created.

tony2long · #21 23.02.2020, 13:55

The plugin only supports /release/[a-z0-9\\-]+

RPNet-user · #22 24.02.2020, 00:51

Is there anyway around this?

raztoki · #23 24.02.2020, 03:04

by creating a linkcrawler rules for the unsupported patterns, or edit the decrypter plugin to support additional website function.

RPNet-user · #24 24.02.2020, 03:55

Quote:

Originally Posted by raztoki

by creating a linkcrawler rules for the unsupported patterns, or edit the decrypter plugin to support additional website function.

How can a rule possibly work since the plugin will only crawl/support the given domain? I know it will not work in the pattern alone so unless it works if/when it is included in the deepPattern then I don't see how this will make it crawl/grab from it.

Where is the decrypter plugin for this domain and what is it named because I have yet to find it?

raztoki · #25 24.02.2020, 05:07

Quote:

Originally Posted by RPNet-user

How can a rule possibly work since the plugin will only crawl/support the given domain? I know it will not work in the pattern alone so unless it works if/when it is included in the deepPattern then I don't see how this will make it crawl/grab from it.

linkcrawler rules will create support for a url pattern that you specify, it will then scan the page for all supported links. OR return with all links within a field that regex search for. it has limited function compared to plugins, but should be able todo what you want, to return supported links.

Quote:

Originally Posted by RPNet-user

Where is the decrypter plugin for this domain and what is it named because I have yet to find it?

Quote:

Originally Posted by raztoki

search source for the domain name, or class name is rpdmvzcm

tony2long · #26 24.02.2020, 05:11

@RPNet-user
I merged these so you can see post #6 for your second question.
Domain is the same, rmz.cr.
The rule should crawl /l/m to find /release/ that is supported by plugin.

RPNet-user · #27 24.02.2020, 06:16

@raztoki
It will only create support for a url pattern that originate from the top level domain(rmz.cr) which is in the plugin and not rmz.cr/l/m. The pattern I modified should have worked even without the numbered pages so unless I have to add/edit the deepPattern it will not work provided that the domain in the rule over-rides the domain supported by the plugin.
All file links originate from the top level domain plus the '/release' sub-path even if I get there from rmz.cr/l/m.

raztoki · #28 24.02.2020, 08:28

Quote:

Originally Posted by RPNet-user

@raztoki
It will only create support for a url pattern that originate from the top level domain(rmz.cr) which is in the plugin

incorrect, your regex pattern creates the support, its protocol://domain/path(s).

Quote:

Originally Posted by RPNet-user

@raztoki, thanks, using python without the extra backlash does match all five of my crawler test strings, unfortunately, JD/Java must have it so testing with other languages will not benefit the test strings.

I don't agree with your statement. helps you write and match in real time. You then adapt for the situation, either java code with the extra escaping (as it will show up as error). Inside JD graphical interface you shouldn't need the extra escaping. Link crawler rules you might as its JSON

RPNet-user · #29 24.02.2020, 12:00

Yes it will create the pattern but it will not grab any links from posts/titles where the url is rmz.cr/l/m because when you go to any pages or posts originating from the newly created pattern it will always be rmz.cr/release. The rmz.cr/l/m path on their site is simply for displaying different categories (/l/m for movies, /l/s for series, and /l/b for both) it does not provide url displayable access via those paths, therefore, how will the pattern know to grab a link that contains that path.

tony2long · #30 24.02.2020, 13:06

Create another rule where deepPattern contains /l/m for example.

RPNet-user · #31 24.02.2020, 14:15

Quote:

Originally Posted by tony2long

Create another rule where deepPattern contains /l/m for example.

I already did and it will not work.
Check the screenshot for a further explanation.
The left side is rmz.cr and the right side is rmz.cr/l/m but at the bottom you will see that they are in the same location--->rmz.cr/release/movietitle
Therefore, no pattern will be able to differentiate between the location of the links since the path to the actual title/post will be the same.

raztoki · #32 24.02.2020, 15:06

it wont matter, as dedicated plugin doesn't scan this either.
so you need first "pattern": to listen to /l/m etc, and then "deepPattern" : within the html body (maybe <table>) to just return /release links then dedicated plugin will do the rest.
I wouldn't personally follow multiple pages deep, just keep the links you want to scan on txt file and copy them all. Then they are all single tasks.

tony2long · #33 24.02.2020, 15:09

In the page source html you can find /l/m, so first rule will get that /l/m page, then the second rule grab /l/m page and find /release/.

RPNet-user · #34 25.02.2020, 01:46

Ok, so I will just go with the first page in /l/m only.

This is what is currently working properly but from the main rmz.cr page only.

[ {
"enabled" : true,
"updateCookies" : true,
"logging" : false,
"maxDecryptDepth" : 1,
"id" : 1422443765154,
"name" : "1080p rarbg and vxt",
"pattern" : "https?://rmz\\.cr/",
"rule" : "DEEPDECRYPT",
"packageNamePattern" : null,
"passwordPattern" : null,
"formPattern" : null,
"deepPattern" : "(/release/[a-z0-9\\-]+1080p[a-z0-9\\-]+rarbg)|(/release/[a-z0-9\\-]+1080p[a-z0-9\\-]+vxt)",
"rewriteReplaceWith" : null
} ]

These patterns will not work as it just grabs everything instead of the keyword links as the above regex:
"pattern" : "https?://rmz\\.cr/l/m/[0-5]",
"pattern" : "https?://rmz\\.cr/l/m/",

pspzockerscene · #35 25.02.2020, 02:05

Quote:

Originally Posted by RPNet-user

These patterns will not work as it just grabs everything instead of the keyword links as the above regex:
"pattern" : "https?://rmz\\.cr/l/m/[0-5]",
"pattern" : "https?://rmz\\.cr/l/m/",

Yes this happened because the RegEx of our own plugin was very wide.
I've limited it to the "release" URLs only for the next update which means this would be possible and would cover adding URLs to desired pages (without number = first page):

Code:

[ {
"enabled" : true,
"updateCookies" : true,
"logging" : false,
"maxDecryptDepth" : 1,
"id" : 1422443765154,
"name" : "1080p rarbg and vxt",
"pattern" : "https?://rmz\\.cr/l/b/[0-9]*?",
"rule" : "DEEPDECRYPT",
"packageNamePattern" : null,
"passwordPattern" : null,
"formPattern" : null,
"deepPattern" : "(/release/[a-z0-9\\-]+1080p[a-z0-9\\-]+rarbg)|(/release/[a-z0-9\\-]+1080p[a-z0-9\\-]+vxt)",
"rewriteReplaceWith" : null
} ]

-psp-

RPNet-user · #36 25.02.2020, 02:46

So as of this moment it is not possible to crawl and grab from /l/m due to the plugin correct?
The next update would make it possible?

pspzockerscene · #37 25.02.2020, 02:56

Yes.
Plugins always have priority which is good and makes sense ... usually.
This is just an edge-case and soon not anymore.

We have a ticket about creating link crawler rules with higher priority than plugins but again in this case, your rules would then override our plugin completely and you'd have to add another rule to manually handle "/release" URLs ...
This is the ticket:

Plugin updates have just been released - you can now test the above mentioned rule!

-psp-

RPNet-user · #38 25.02.2020, 03:19

Thanks psp,
I was confused earlier as to which one over rides the other: rules>plugin or plugin>rules.
When trying to modify the pattern earlier I suspected that the plugin had the higher priority due to the wide regex regardless of what I specified. I will update, test, and feedback.

pspzockerscene · #39 25.02.2020, 03:29

Yeah basically you were really really unlucky.
This is our last RegEx:

Code:

https?://(?:www\\.)?rmz\\.cr/(?:release/)?(?!l/)[^/]+

This is the new one:

Code:

https?://(?:www\\.)?rmz\\.cr/release/[^/]+

I did not remove it earlier because I suspected that rmz.cr has URLs without containing "release" which then redirect to "release" URLs but I was unable to find such so I've modified our RegEx.
The other alternative would have been to block "l/" in our RegEx but the current solution is the nicer one^^

-psp-

RPNet-user · #40 25.02.2020, 06:14

Thanks psp, it is working perfectly now.

I set the pattern to:
"pattern" : "https?://rmz\\.cr/l/m/*?",
I set the crawl for ----> rmz.cr/l/m

The event scripter is working perfectly as well as I just test it with rmz.cr/l/m.

I'm having problems with the logic for the linkgrabber filter 'views' rule as I'm trying to exclude 'srt' files during the grab so I tried setting a simple rule with only the file type to 'is not' 'srt'. So the rule is: Allow Links if, File isn't a 'srt'-File!
However, when I test the rule it still adds the srt files along with the video files.

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

	JDownloader Community Board - Archive - Top
Provided By AppWork GmbH \| Privacy \| Imprint