View Single Post
  #911  
Old 25.09.2019, 11:39
Demongornot Demongornot is offline
JD Beta
 
Join Date: Sep 2019
Location: Universe, Local group, Milky Way, Solar System, Earth, France
Posts: 50
Default

Well for the script of anti double downloads, I try to use only one script using API trigger, so far by looking at all API call using "alert(event);" during download end and adding links to the link grabber, I found those two who fit my needs :
"event.id : LINK_UPDATE.finished" for download ended, and it provide me all info I need.
"event.id : STOPPED event.publisher : linkcrawler"
Though that latter don't really get me what I need, only the job id and crawler id are really exploitable, the event id "FINISHED" give only those two.

I tried using those ID to get to find the list of links that it added using the api call "queryLinkCrawlerJobs" who return nothing and "CrawledLinkQuery" who don't work as the job id cause an error by being read as a float while it require a double, even if I use "parseInt(job id);" while the same variable containing said work fine with "queryLinkCrawlerJobs".
I previously got the same error with the first code on my previous post, it read :
Code:
Can not deserialize instance of long[] out of VALUE_NUMBER_FLOAT token
 at [Source: {
  "collectorInfo" : false,
  "jobIds" : 1.569388206182E12

I lately tried to see if I could actually use the opposite way by using "var myCrawlerJob = myCrawledLink.getSourceJob();" but I got a "null" result...

I could simply go through all crawled links and check their URL, but this isn't a really optimised solution of multiple crawler jobs each adding multiples links are runnings...

Also I found that using "getAllCrawledLinks" after the API trigger "STOPPED" or "FINISHED" only return a partial list of links when crawling an URL with multiples links as the last links are not to be found in my array, actually only few of the crawled links show up...So I was forced to use a sleep delay to get them all...

My other solution would be to use the job id and crawler id (whichever is the biggest) and go through all the crawled links in descending order and treat every links who have an UUID larger than the job id or the crawler id, sadly the list isn't in order of the first to latest added, so I might have one of the added link in the first package, forcing me to go through basically all other links and check their UUID...
The only optimisation I can do is using the event.data of "STOPPED" API trigger and count how many links have been added using the "offline", "online" and "links" properties and once I got the same number of links UUID analysed end the loop, but here is the trap, I need a delay to lets all the file be available to "getAllCrawledLinks" which mean I can overlap with the next crawler job and don't get the correct number of links analysed...

So I am out of idea about how to analyse only the latest links and already existing ones in a CPU and memory usage friendly way...
Reply With Quote