JDownloader Community - Appwork GmbH
 

Reply
 
Thread Tools Display Modes
  #1  
Old 08.06.2022, 12:23
StefanM's Avatar
StefanM StefanM is offline
JD VIP
 
Join Date: Oct 2020
Posts: 313
Default Questions about LinkGrabbing process

Could somebody please complete/edit the process of LinkGrabbing in JD I have documented below?
Once everything has been reviewed/edited and complete, I'd be happy to prepare an 'Article' for the Knowledgebase on https://support.jdownloader.org/Knowledgebase/

What I see and think to know is:
After I have added a number of links to 'Analyse and Add Links'-window of LinkGrabber and clicked on 'Continue'…
…following steps/operations will be performed:

Step 1:
A dupe check is being performed. As I can see from the number of links found, any duplicates that were added are only counted once.

Question 1: Is this dupecheck being performed offline and before anything gets written to linkcollector*.zip files.

Step 2:
If there is only one single link that does not require a deep scan, a normal link analysis is started. If there are only links that require a deep scan, then a popup window will show up, asking me whether or not I want to perform a deep scan.
Please also read this thread about an important LinkGrabber warning message missing:
https://board.jdownloader.org/showth...985#post503985

Step 3:
A popup window 'Crawling for download links' appears. It shows an increasing number of links found, which is the number of (unique) links, pasted to the LinkGrabber window. The popup window will disappear, as soon as all pasted links have been read.

Step 4:
In case you have 'Bubble Notifier' for this enabled after a while you see the progress also in that 'Bubble'.
Now, an online check is being performed for all links found.

Question 2: How many links are being checked simultaneously?

Question 3: They are not being checked in the order of lines: link in line 1, link in line 2, link in line 3,…, right?

During this online check files are added to LinkGrabber Table (as online, as offline, as…). They maybe subject to filtering and/or hiding.

Step 5:
When all files have been checked, in 'Bubble Notifier' you will read 'Done' and 'Bubble Notifier' will disappear after a few seconds.


This is what I see/conclude. But I'm sure my list of steps is not complete. So, I'm asking for completion and correction of any statements that are no correct.
Reply With Quote
  #2  
Old 08.06.2022, 13:54
pspzockerscene's Avatar
pspzockerscene pspzockerscene is offline
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 62,600
Default

Quote:
Originally Posted by StefanM View Post
Step 1:
Before that, some links eventually get crawled.

Quote:
Originally Posted by StefanM View Post
Is this dupecheck being performed offline
Yes.

Quote:
Originally Posted by StefanM View Post
and before anything gets written to linkcollector*.zip files.
Jiaz will be able to answer that.

Quote:
Originally Posted by StefanM View Post
If there is only one single link that does not require a deep scan, a normal link analysis is started. If there are only links that require a deep scan, then a popup window will show up, asking me whether or not I want to perform a deep scan.
Correct though I don't know why you list this as "Step 2".
Your "Step 1" reads itself as if it should include the whole add & crawl process...

Quote:
Originally Posted by StefanM View Post
A popup window 'Crawling for download links' appears. It shows an increasing number of links found, which is the number of (unique) links, pasted to the LinkGrabber window. The popup window will disappear, as soon as all pasted links have been read.
Yes.

Quote:
Originally Posted by StefanM View Post
Now, an online check is being performed for all links found.
Yes and no.
It always depends on how the website is made (and if there is an API or not) and how the plugin is made:
Is it optimized to show the links as fast as possible or does that depend on plugin settings e.g. some plugins do have a "Fast linkcheck" setting.
If the website is e.g. a cloud folder structure such as google drive, the crawler will do the "linkchecking part" alltogether (for folders) because we can be sure in beforehand that all files found in a folder-structure are online and we do get all needed information (status, filename, filesize, md5 hash) right away.
This very much speeds up/skips the linkchecking that will usually happen if you e.g. add hundrets of filehoster links.

Quote:
Originally Posted by StefanM View Post
How many links are being checked simultaneously?
This varies from website to website.
Some will provide an API to linkcheck batches of X (mostly up to 100) at the same time, others don't.

Quote:
Originally Posted by StefanM View Post
They are not being checked in the order of lines: link in line 1, link in line 2, link in line 3,…, right?
Correct.

Quote:
Originally Posted by StefanM View Post
During this online check files are added to LinkGrabber Table (as online, as offline, as…). They maybe subject to filtering and/or hiding.
Correct.

Quote:
Originally Posted by StefanM View Post
This is what I see/conclude. But I'm sure my list of steps is not complete. So, I'm asking for completion and correction of any statements that are no correct.
To be honest, how to guides are nice but what I see here is just a description of what will happen when adding links.
I doubt that "guides in this style" will be helpful for our users.
Example of our current articles about the linkgrabber (yes we only got 2 atm.):
https://support.jdownloader.org/Know...25/linkgrabber
-->
https://support.jdownloader.org/Know...download-paths
https://support.jdownloader.org/Know...w-to-add-links

I don't like to self compliment but I'd say my "Add links" article cntains more useful information than this thread of yours (and no I don't mean to be rude).
__________________
JD Supporter, Plugin Dev. & Community Manager
JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
That's true James
Quote:
Originally Posted by James
Die Leute verstehen einfach nicht dass nur weil man mit einer Waffe auch auf Menschen schießen kann dass ein Schützenver​ein kein Ort für Amoklaufide​en ist

Last edited by pspzockerscene; 08.06.2022 at 15:21. Reason: Fixed wrong info about stuff that happens before dupe check
Reply With Quote
  #3  
Old 08.06.2022, 14:59
StefanM's Avatar
StefanM StefanM is offline
JD VIP
 
Join Date: Oct 2020
Posts: 313
Default

Quote:
Originally Posted by pspzockerscene View Post
I don't like to self compliment but I'd say my "Add links" article contains more useful information than this thread of yours (and no I don't mean to be rude).
First of all, thanks for your answers.

And yes, I read all those articles in the knowledgebase.
I even mirrored the whole Knowledgebase with HTTrack, so I can edit and comment it for my personal documentation.

My intention was not to write an article on 'howto'.
My intention was to write a documentation about what happens, when I use LinkGrabber.

But if you think, that nobody is interested to know what happens (apart from me)...
... you probably know that better than me. And that's fine with me. I will write that document anyway - also if it is just for me.

PS: I will send you a copy of my email to Jiaz. I think you will better understand then, why I'm doing some things, the way I do ...
Reply With Quote
  #4  
Old 08.06.2022, 15:12
StefanM's Avatar
StefanM StefanM is offline
JD VIP
 
Join Date: Oct 2020
Posts: 313
Default

Quote:
Originally Posted by pspzockerscene View Post

Before that (the dupe check?), inks get crawled and checked for availability.
Are you sure?
Before links I added to LinkGrabber are checked for dupes, they are being checked for availability?

This means dupes, triplicates,... would be checked online twice or more, unnecessarily?
Unnecessary requests sent?

If yes, I would file a request to change that, as the number of requests sent, should be kept as low as possible...
... in order not to flood the website with requests, which might lead to temporary banning.

Last edited by StefanM; 08.06.2022 at 15:15.
Reply With Quote
  #5  
Old 08.06.2022, 15:23
pspzockerscene's Avatar
pspzockerscene pspzockerscene is offline
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 62,600
Default

Quote:
Originally Posted by StefanM View Post
I even mirrored the whole Knowledgebase with HTTrack, so I can edit and comment it for my personal documentation.
Crazy

Quote:
Originally Posted by StefanM View Post
My intention was not to write an article on 'howto'.
My intention was to write a documentation about what happens, when I use LinkGrabber.
Okay, understood.

Quote:
Originally Posted by StefanM View Post
But if you think, that nobody is interested to know what happens (apart from me)...
... you probably know that better than me. And that's fine with me. I will write that document anyway - also if it is just for me.
We can just examine the final result and then decide whether or not we want to add it to our knowledgebase.

Quote:
Originally Posted by StefanM View Post
Are you sure?
No. I've edited my post accordingly.

Jiaz can/will add more information here once he finds the time.
__________________
JD Supporter, Plugin Dev. & Community Manager
JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
That's true James
Quote:
Originally Posted by James
Die Leute verstehen einfach nicht dass nur weil man mit einer Waffe auch auf Menschen schießen kann dass ein Schützenver​ein kein Ort für Amoklaufide​en ist
Reply With Quote
  #6  
Old 08.06.2022, 17:09
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 76,897
Default

Quote:
Originally Posted by StefanM View Post
Step 1:
A dupe check is being performed. As I can see from the number of links found, any duplicates that were added are only counted once.
There are multiple dupe checks being performed!
  • *While crawling/processing the links, the linkcrawler does check if the current linkcrawler already has processed the same link (URL). No need
    to process the same URL multiple times. -> You add the same link twice in the same *job*, only one link is processed.
    *When crawling/processing of a link is finished, it will be added to list only if not exists yet (can be disabled by Settings->Advanced Settings->LinkCollector.dupemanagerenabled)
Note: Linkfilter may prevent a link to be added to list. Some plugins maybe have additional dupe checking against download list to speed up the process of crawling.

Quote:
Originally Posted by StefanM View Post
Question 1: Is this dupecheck being performed offline and before anything gets written to linkcollector*.zip files.
Dupecheck has nothing to do with this. Linkcollector is just the current list stored to disk. These files are getting written only. Only on startup of JDownloader, the last working list will be read. Those files (downloadListXXX.zip and linkcollectorXXX.zip) are not used for anything else than storage.
__________________
JD-Dev & Server-Admin
Reply With Quote
  #7  
Old 08.06.2022, 17:16
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 76,897
Default

Quote:
Originally Posted by StefanM View Post
Step 2:
If there is only one single link that does not require a deep scan, a normal link analysis is started. If there are only links that require a deep scan, then a popup window will show up, asking me whether or not I want to perform a deep scan.
This dialog will shows up if:
1.) none of the added links are supported/being processed.
or
2.) the way how you add the link is NOT part of Settings->Advanced Settings->LinkCrawler.autolearnextensionorigins
__________________
JD-Dev & Server-Admin
Reply With Quote
  #8  
Old 08.06.2022, 17:21
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 76,897
Default

Quote:
Originally Posted by StefanM View Post
Step 3:
A popup window 'Crawling for download links' appears. It shows an increasing number of links found, which is the number of (unique) links, pasted to the LinkGrabber window. The popup window will disappear, as soon as all pasted links have been read.
Wrong: It's the number of unique links found from that *job* that are NOT filtered.
It is NOT the number of links added to Linkgrabber window, because the dupe check takes places AFTER the crawling/processing of the links. Linkcrawler is decoupled from Linkgrabber due to speed optimization. Else a long list/list activity can block/slow down the Linkcrawler.
__________________
JD-Dev & Server-Admin
Reply With Quote
  #9  
Old 08.06.2022, 17:28
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 76,897
Default

Quote:
Originally Posted by StefanM View Post
Step 4:
In case you have 'Bubble Notifier' for this enabled after a while you see the progress also in that 'Bubble'.
Now, an online check is being performed for all links found.
During this online check files are added to LinkGrabber Table (as online, as offline, as…). They maybe subject to filtering and/or hiding.
Linkchecker will process the links as soon as they return from Linkcrawler. Linkcrawler may still be running. It's heavily multi threaded
and Linkcrawler and Linkchecker and Linkgrabber window all run simultaneously.
LinkFilter are processed multiple times during linkcrawler and after linkcheck (because condition requires the link status to be checked first)


Quote:
Originally Posted by StefanM View Post
Question 2: How many links are being checked simultaneously?
For each host/plugin only ONE instance is running simultaneously but Settings->Advanced Settings->LinkChecker.maxthreads
hosts/plugins may run simultaneously. But plugins may make use of special api that allows the checking of multiples
links at once to save up requests and speedup the whole process.


Quote:
Originally Posted by StefanM View Post
Question 3: They are not being checked in the order of lines: link in line 1, link in line 2, link in line 3,…, right?
There are no guarantees for any order at all due to heavy usage of multi threading in Linkcrawler, Linkchecker and also
optimization within Linkcrawler (eg process known faster plugins over known slower plugins).
__________________
JD-Dev & Server-Admin
Reply With Quote
  #10  
Old 08.06.2022, 17:46
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 76,897
Default

Just a simple flowchart:
Linkcrawler:
1.) one/multiple links -> linkcrawlerjob
2.) linkcrawler is getting started and takes over one linkcrawlerjob
3.) linkcrawler now searches for supported links (linkcrawler rule, plugins...) and processes them. before any next processing of a link, the linkfilter are processed/checked to see if we can abort processing of the link. after any processing, packagizer rules are processed/checked.
4.) links that are not filtered and with status unknown are forwared to linkchecker
5.) once all links are processed, the linkcrawler finishes -> done

Linkchecker:
1.) one/multiple links from linkcrawler(job) are forwared to linkchecker
2.) existing linkchecker instance for host/plugin will enqueue the link(s) or new linkchecker instance is enqueued/started
3.) checked links are forwarded to Linkgrabber window

Linkgrabber:
1.) link(s) are getting added and once more Linkfilter and Packagizer rules are processed/checked.
2.) links that are not filtered by Linkfilter rules, will get dupe checked against all links in Linkgrabber and are only added
if no dupe exists.
__________________
JD-Dev & Server-Admin
Reply With Quote
  #11  
Old 08.06.2022, 17:48
StefanM's Avatar
StefanM StefanM is offline
JD VIP
 
Join Date: Oct 2020
Posts: 313
Default

Quote:
Originally Posted by Jiaz View Post
Just a simple flowchart: ...
Thank you very much!
Reply With Quote
  #12  
Old 08.06.2022, 18:03
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 76,897
Default

Quote:
Originally Posted by StefanM View Post
Thank you very much!
You're welcome! In case of further questions, just ask
__________________
JD-Dev & Server-Admin
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

All times are GMT +2. The time now is 07:18.
Provided By AppWork GmbH | Privacy | Imprint
Parts of the Design are used from Kirsch designed by Andrew & Austin
Powered by vBulletin® Version 3.8.10 Beta 1
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.